Speech recognition will continue to hit upon this wall because it disregards internal representation. What you hear may externally be a wave form that has some certain characteristics, but the phonetic structure is also dependent on the phonemic, morphophonemic, syntactic, and semantic interactions. To actually understand a word, your exposure to phonetic information must trigger the aforementioned interactions.
The best example of speech recognition learning comes from babies. Babies are born with the ability to distinguish a pretty much infinite number of phonemes. Continued exposure to their native language then narrows this to the phones that are applicable to their use, i.e. their language. In English, things like aspirated ps get ignored for the purposes of meaning, such that I can hear "stoph" and how it is not distinct from "stop". Built upon this we discover morphemes and morphophonemic rules, so that I can tell that "stop" becomes "stopt" in past tense. Similarly, we upon this we build syntactic and semantic relationships. This is context based understanding. I need a context of "past" to start applying morphemes for past tense, but I also need a the correct phonemic context to perform the correct allophonic substitutions. Similarly, if someone with a thick Scottish or Novacastrian accent comes up to me on the street, I need to combine my semantic context with my own abstract internal representations of my language to try and understand them.
This provides a form of natural error correction that allows me to understand something I have never heard before and that might contain deviations or ambiguity (either inherent in the language or introduced by the speaker). My internal representation of English should prevent me from ascribing wrong phone clusters or wrong morphemes (runn-ingk) to the processed sound.
It's stimulus plus rule matching plus context plus error correction that should ultimately help me decide if something can be understood.
All thing ignores the complexity of graphemic translation, which build another set of rules.
The article said (somewhat in jest) that throwing out linguists helped improve the accuracy of the system. Sure, methods not representative of human language capability might in the short term give greater results, and there is no definitive model of the how language is represented in the mind. You can probably provide a great system ignoring much linguistic information that functions in a limited context (i.e. one language, rigid contexts (yes/no, numbers etc.). However, ultimately if the goal is to produce a speech system that functions like a human -- that is, performs the error correction when appropriate, uses various types of linguistic information, and in certain circumstances requires clarification -- then linguistic models are important