Sorry, but this is bull. Your statement that "voice recognition is at its limits phonetically" is just wrong. I work in the voice recognition industry, and in the past five years, I've seen the recognition error rate markedly and measurably go down, and this trend is continuing.
There are actually two kinds of models involved in voice recognition:
1) the acoustic model (which has to do with looking at a sequence of time slices of the acoustic signal and working out what sequence of phonemes could most likely have given rise to it). You say that voice recognition is at its limits phonetically, but these models are actually getting better over time with larger sets of training data, and the improved models measurably result in a lowered word error rate.
2) the language model (which has to do with specifying which words exist, and in what order they are most likely to occur). These language models can be very simple, as in the case of a yes/no question in a phone-based app (your model might accept "yes" and "yes ma'am", but not any arbitrary English utterance); or they can be very large, as in the case of a general-purpose dictation application.
In conjunction with the recognizer, what these two models give you is a raw string of recognized words. What sort of processing you do on that string is a separate question. There are obviously all sorts of things you can do with the string. The parsing and processing techniques are getting more sophisticated, and are getting integrated with other systems in interesting ways. This is largely a separate question from the accuracy of the string itself, which is the output of the recognizer (I say "largely" because your application might activate a different language model based on the current context, which does affect recognition accuracy).