It tends to be overcome with a simple r:
'niner', 'ten' are now completely different lip movements. (Additionally, it breaks a tonal similarity between 9 and 5 up for people who are listening to me.)
This tends to be even easier to disambiguate in context.
I use 'niner' on a regular basis in my line of work, in which I give and receive a lot of numbers over the phone, as well as names and locations.
I'd think that this technology would use a similar method to disambiguate between 5, 9, and 10.
(Of asides and theory: I run under the assumption that most software that listens to voices for recognition are listening not for the lead or tail sounds of a number or letter, but for the shape your mouth makes and the resulting sound that comes with it. If you try saying numbers 0-10 without moving your jaw or tongue, you'll notice how some sounds are the same as well (eeo, uhn, oo, wee, or, i, ih, ehun, eeh, i, ehn, for those who skipped the practice level). Because of this, 'niner' becomes important, simply to change that aforementioned sound to 'ihr'. I think this theory holds a little weight, since I can log in to the voice system of my financial institution using that same method, and not have my lips read.)