Are you able to do all of the following at your dinner conversation?:
1) Provide everyone with a decent close-talking directional microphone.
2) Require each person to take turns speaking, so there is very little overlap.
3) Have no pre-adolescents speaking.
4) Eliminate noticeable background noises.
5) Have no one with a strong non-native dialect speaking.
6) Require everyone to speak in full, grammatical sentences.
To the extent you say no to any of the above, you will get increasingly poor output. They are listed approximately in order of importance (1 being the most important). If you can say yes to all of those, you can probably get in the vicinity of 90% accuracy. This might be usable, depending on your ultimate purpose. If you were to additionally train acoustic and language models for all of the speakers, and then tell the software which user was speaking (i.e. switch the user on the fly during the conversation), you could probably get 95% accuracy and that would be quite usable.
So, in other words