Please create an account to participate in the Slashdot moderation system


Forgot your password?

Vista Speech Recognition Goes Awry 418

An anonymous reader writes "It seems even MSNBC is willing to take a jab on those rare occasions when Microsoft products don't work. During a demo of Vista's speech recognition technology, Vista couldn't differentiate between mom and aunt, and all attempts to rectify the problem just made it worse. Wait until you see what it spat out, I think we have a new 'All your base.' Don't you just love Microsoft's live demonstrations?"
This discussion has been archived. No new comments can be posted.

Vista Speech Recognition Goes Awry

Comments Filter:
  • Roald Dahl (Score:5, Funny)

    by Ithika ( 703697 ) on Saturday July 29, 2006 @09:28AM (#15805400) Homepage

    Reminds me of the Roald Dahl short story about the ant-eater who ate someone's aunt because their accent rendered the two words the same.

    I can't remember what the story was called.

  • by JoeLinux ( 20366 ) <joelinux AT gmail DOT com> on Saturday July 29, 2006 @09:31AM (#15805411) Homepage
    It's just a one-time thing.

    I mean, it's not like they have a reputation for releasing half-assed code that's been hyped up through marketing to the point that it will never perform as advertised.

    And it's not like this is a company that is having image problems due to its monopolistic nature.

    Or headed by an infamous ragaholic with a history of intolerance towards free standards.

    Nope, I'm sure that this is just an accident by a company that spends its off hours petting little baby chickens and bunnies.
    • by kripkenstein ( 913150 ) on Saturday July 29, 2006 @09:37AM (#15805439) Homepage
      Nothing to worry about, I'm sure they'll get all the kinks out by the time Vista is released - sometime in 2008 or so, it seems, based on this video.

      This was really a dreadful presentation. There was no ambient noise (as the commentators say later, and despite what Microsoft says), and there was no echo as the demonstrator claims during the actual test. It seems to have been done under really good test conditions, but still it failed miserably.
      • by tomstdenis ( 446163 ) <tomstdenis@gm a i> on Saturday July 29, 2006 @09:41AM (#15805458) Homepage
        Most likely the system was trained by an engineer and handed off to the ass in marketting. He was probably supposed to train it to his voice too but decided to hit the bar instead.

        Voice recognition requires some training regardless of who provides it. We're not Star Trek here....Prep work and rehearsal people. If mr. sales guy had tried the demo before the presentation he would have noticed it wasn't working and avoided the embarassment.

        This is why sales people are asshats. They're unprofessional non-technical people who sap back the high life while the rest of us have to put up with the mess they create through their daily barrage of verbal diarhea.

        • He writes []: "The software I'm using is Dragon NaturallySpeaking 9.0, the latest version of the best-selling speech-recognition software for Windows. This software, which made its debut Tuesday, is remarkable for two reasons.

          Reason 1: You don't have to train this software. That's when you have to read aloud a canned piece of prose that it displays on the screen -- a standard ritual that has begun the speech-recognition adventure for thousands of people.

          I can remember, in the early days, having to read 45 minu

        • by Illserve ( 56215 ) on Saturday July 29, 2006 @10:46AM (#15805713)
          Let's give this guy some credit. He clearly has some degree if competence if he's selected to showboat the app at a major presentation, at least enough to know that you need to train, or at least test, a voice recognition demo.

          A far more likely scenario, in my mind, is that he trained and tested it 100 times and got it working nearly flawlessly, but in a different room and with a different setup. In fact he may have overtrained it. Programs like this can behave very badly when they end up overfitting the data.

          On the day in question he may have had a different mic and the acoustics were certainly different and the program went whacko.
          • by tomstdenis ( 446163 ) <tomstdenis@gm a i> on Saturday July 29, 2006 @10:55AM (#15805770) Homepage
            You clearly don't work for a large corporation. Sales people (who are not all bad) are typically the sort that don't really understand technology and are all juiced up to make sales. Around technology they often leap before actually finding out the facts which nets them in a world of trouble.

            I seriously doubt this presentation was rehearsed. At the very least, they should have tested it in that room with that mic, etc. But in all honesty, this is going to be used by millions of people in all sorts of rooms with all sorts of mics. That shouldn't matter anyways.

            Anyways, I doubt he prepared at all, that is, other than snorting cocaine off a mirror in the back room before the show.

        • That's so last century. NPR did a bit on the new Dragon Dictate 9. The NPR reporter got 100% accuracy out of the box, no training.

          Dictation Software Improves Usability, Accuracy []
        • Since when did Mum sound anything like Aunt?
  • Hee hee (Score:5, Funny)

    by kefoo ( 254567 ) on Saturday July 29, 2006 @09:32AM (#15805413)
    Reminds me of the time when I worked at a computer store and we played with the voice recognition card in a PowerMac floor model. Somebody programmed it so that if someone said "Computer, bite me", it would respond with "Can't bite what's not there". Over time the accuracy of the recognition fell. One day as a salesman was talking to a customer about the computer it misinterpreted something he said and said "Can't bite what's not there". Needless to say that system was wiped and we weren't allowed to play with it anymore.
    • Re:Hee hee (Score:3, Interesting)

      In fact voice recognition would be a great playground for non-profit open source software projects.

      Voice recognition means permanent beta. Voice recognition only slightly improved during the last ten years. One reason is that the VR market it a trivial patent minefield. The rest is just performance.

      Sure, we will get proper voice recognition some day. I would source it out to open source and integrate it back into my products once it will be ready.
      • Re:Hee hee (Score:3, Insightful)

        by marcello_dl ( 667940 )
        Yes, we'll get good voice recognition one day. It'll be right after 99% of the world population have mastered mouse and keyboard interfaces.
  • Well (Score:4, Funny)

    by antifoidulus ( 807088 ) on Saturday July 29, 2006 @09:33AM (#15805420) Homepage Journal
    it could lead to surprising porn....
  • by dacap ( 177314 ) on Saturday July 29, 2006 @09:36AM (#15805434) Homepage
    Yes, once again Microsoft S/W Engineers learn that the more public the demo or the more important the audience, the more likely some will go wrong. It's one of Murphy's laws. Been there. Did that. Barely survived.

    Experience is the human quality that enables you to recognize a mistake immediately when you make it again.

  • So? (Score:5, Informative)

    by Klaidas ( 981300 ) on Saturday July 29, 2006 @09:39AM (#15805452)
    This isn't the first presentation went wrong, isn't it?
    Win98 gone wild: []
    Media Center Edition gone wild []
    We can add this one to the list too ;)
    • The XBox also crashed during its first public demo. Rumour has it that the developers were told that they had to make sure it never blue-screened in public again - and that's why when the XBox crashes you get a green screen of death.
    • Those are painful to watch. As much as I hate MSFT I still feel some empathy for them... Damn, poor gates playing with the remote...

  • Dear aunt (Score:5, Informative)

    by linvir ( 970218 ) * on Saturday July 29, 2006 @09:40AM (#15805453)
    For the flashless. Here's the format:
    Microsoftie says this
    Speech recogniser hears this

    Dear mom
    Dear aunt,
    Fix aunt
    Let's set
    Delete that
    Delete that
    Delete that
    I think it's picking up a little bit of echo here
    Delete... select all
    double the killer delete select all

    Final text:
    Dear aunt, let's set so double the killer delete select all
  • Not quite as embarrassing as the Windows 98 BSOD, but more entertaining than the Ballmer developer's video. []
  • by Seiruu ( 808321 ) on Saturday July 29, 2006 @09:47AM (#15805477)
    Steve Ballmers accidently send an e-mail while diligently testing the software. The e-mail says:

    "Sir put down the chair, then we'll talk"
    "No Steve wait up, don't do that"
  • Having seen some of the research that goes on, voice recognition is still far from good.
    Yes it works in some contexts, especially if it's been trained with the person speaking, and the language is limited, such as in a professional environment.

    but for home computers, it's not only overkill it's also inadequate and non-functional.
    I say COOL feature, but hopeless waste of time and money, which in the end will be paid by you-know-who (not ms)

    on another topic can someone please ask ms to stop the increasing and
  • It's hard (Score:4, Funny)

    by wootest ( 694923 ) on Saturday July 29, 2006 @10:00AM (#15805527)
    It's hard to wreck a nice beach. :)
  • by Danga ( 307709 )
    My guess is that the marketer "showing off" the voice recognition didn't properly train the software before the demonstration. If he did do that then he obviously did not pick and test something that was at least known to work which is not a bad idea when you are doing product demos. The software obviously has much work left since it interpreted the two syllable sentence "select all." as 13 syllables "so double the killer delete select all" (while it did finally get "select all" where the hell did the res
    • Yes, sales and marketing guys are asshats, but they're also professionals who usually spend a lot of time with the software rehearsing and choreographing their demos, because they're the ones who have to deal with the immediate fallout, after all.

      This demo didn't just drop a couple of words, or misinterpret an ambiguous sounding phrase, this was a complete melt down. A more plausible explanation is that the guy's voice was also amplified through the PA system in the room, and the computer's microphone was
  • by jcraveiro ( 848243 ) on Saturday July 29, 2006 @10:06AM (#15805547) Homepage
    Maybe they're twin sisters... ;)
  • "Dear mom comma"

    Dear aunt,

    "Fix aunt"

    Dear aunt, let's set

    "Delete that"

    Dear aunt, let's set

    "Delete that"

    Dear aunt, let's set

    "Delete that"

    Dear aunt, let's set so

    "I think it's picking up a little bit of echo here...delete - select all"

    Dear aunt, let's set so double the killer delete select all

    *Manually selects all and deletes*

    "Okay, I'm glad you're enjoying this"

  • Does Microsoft have to copy EVERYTHING??? I used OS/2 Warp for the second half of the 90s but my experience with _its_ built-in speech recognition was pretty much identical to that demo.
  • by Greyfox ( 87712 ) on Saturday July 29, 2006 @10:15AM (#15805581) Homepage Journal
    C'mon! IBM put on a great speech reco demo at the '95 Atlanta COMDEX. Their product worked flawlessly! Well... Except the guy fired it up and started talking and the little text editor was picking up the words when someone in the back of the audience yells "FORMAT C!" The crowd went wild and the guy doing the demo cracked up too, which caused the speech engine to freak out a bit. He had to delete a bunch of junk out of his text editor once things settled down.

    Speech recognition is still just a gimmick anyway. We still have a LONG way to go before it gets to the point that Joe Average User imagines it should be. Joe average user wants his computer to respond like the one in Star Trek. I still want to set up my Asterisk server with speech recognition, though, so that people can either dial or say the extension they want. It'd also be neat to pick up the phone, say "Call Mom" to the dial tone and have it call my aunt for me.

  • by sh0rtie ( 455432 ) on Saturday July 29, 2006 @10:23AM (#15805611)

    why not just use two mics, one to record the ambient noise (positioned away from the voice mic) the other to record the voice (headset) then as you have two signals just subtract the ambient noise signal from the heaset signal , voila clean headset mic audio

    works for music too, you could control your music player by voice even when its playing loud (at a party) by removing the music signal from the mic signal


    • by Anonymous Coward on Saturday July 29, 2006 @11:53AM (#15806069)
      For those interested, merely subtracting the two signals doesn't work. The signal at the microphone is not just the music signal (called far-end signal) plus the mic signal (near-end signal). The music signal has travelled across the room before it reaches the microphone, giving it some reverberations (echo). If you simply subtract the two signals, you will still hear the music signal quite loudly.

      What is done in practice and works extremely good, is modelling that "echo" as a filter (a FIR transversal filter, which is simply a delay line). You estimate the coefficients of the filter and use the music signal after the "room filter" has been applied to substract from the microphone signal. You then have the voice-only signal left.

      This is setup is called AEC or Acoustic Noise Cancellation. It is used in every telephone and mobile phone there is and is crucial to ADSL. If an ADSL modem would not cancel out its own sent signal at its receiver, the attainable speed would be several times less. AEC is also the reason why talking immediately when you pick up a mobile phone leaves an audible echo of your own voice: estimating the coefficients of the filter is still taking place at that point.

      See [] for a diagram of the AEC or read Haykin's Adaptive Filter Theory if you're looking for a decent book on the subject.
  • rare occasions?
  • by SCHecklerX ( 229973 ) <> on Saturday July 29, 2006 @10:40AM (#15805687) Homepage
    OS/2 Warp had speech recognition in 1994 with OS/2 Warp. Better yet, the OS/2 version of netscape at the time was speech enabled (browse simply by speaking the link). Even cooler was that the netscape developers actually listened to the OS/2 community with that version (I remember them implementing something that I had asked for...very cool). Keep in mind that the average system of that time was a pentium 133 with 100MB of ram. And here we are at 2006, With GHz processors and GBytes of RAM dirt cheap, and M$ is just now starting to experiment with this? By now this technology should be damned near perfectly integrated across the board! Thanks for abusing your monopoly power to destroy all of the competition and REAL innovation, Microsoft!
  • It was funny, but the end of the video was funnier. Apparently Microsoft sent the heavies round and made it clear they weren't happy about the video being shown, and that the problem was down to background noise. However, CNN obviously found it funny, and the newsreader pointed out that it was a very quiet room until it started going wrong and people started laughing.

    "Live television is rough. Welcome to our world." she said. Ooooooh. Nice kick below the belt. Sounds like they're not keen on Microsoft at
  • by wowbagger ( 69688 ) on Saturday July 29, 2006 @11:05AM (#15805819) Homepage Journal
    A friend of mine called me at work (since he knew that to access MSNBC's videos requires Internet Explorer, Windows Media 9 or better, and Flash, and I have neither IE nor WMP at home) and told me about this.

    I went to - and there it was, third on the list of videos on the main page.

    I called this to the attention of two of my coworkers, and we viewed the video - total elapsed time, maybe twenty minutes.

    Then I went to call it to the attention of a third coworker - and the video was no longer on the front page of MSNBC. OK, so maybe they've moved it off the front page, but it should still be on the Technology subsection, right?


    Nor was it under Videos, nor anywhere else I could find it easily.

    Perhaps this was just a normal rotation of a video. Perhaps not. But no matter what the real cause, there is the appearance that it was removed from the page because it was too embarrassing. Not good for Microsoft.

    However, I will give MSNBC this - they didn't give Microsoft a free ride on this, they ribbed them pretty hard.

    However, I knew that this would be appearing on other sources as a video that could be viewed outside of Windows. Actually, I am rather surprised that it took this long.

    Now, as to the demonstration itself - it looks to me (a person who does signal processing and analysis for a living) like the presenter had the mike gain too high - every time he spoke he maxed out the bar graph on the display. *IF* he had the gain too high, and the audio was clipping significantly, that could make "mom" have enough of a pop to maybe sound like AUNT - especially if the software is using context to try to reduce the search-space for the words. Of course, that's why I would have a monitoring routine in the system, and if any of the samples are at 100% full scale, or if many of the samples are over 90% full scale, or the signal power is too high, I'd have my software adjust the mike gain down *and* flag an alert to the user. I'd also try to look for the mike element itself being overloaded.
  • My Cell Phone's voice recognition hasn't given me a problem, EVER. Perhaps Microsoft could learn a thing or two from the mobile industry?
  • by ThinkFr33ly ( 902481 ) on Saturday July 29, 2006 @11:26AM (#15805947)
    As much as many of you would like to believe that the reason this demo failed was because Microsoft code is horribly designed and implemented, and that they are completely incompetent, there just might be a slightly more realistic explanation for the demo's abject failure.

    According to Rob Chambers [], a developer on the Vista speech recognition team, the failures during the demo were caused by audio gain issues.

    From his blog:

    If you watch the video clip on MSN Video you can see in the speech user interface that the microphone "volume" is very high. It pushes up into the red frequently while Shanen is speaking to the computer. That's caused by the fact that the audio sub-system wasn't respecting the audio gain settings we've asked it to use.

    This is a known bug in current builds, and has already been fixed by the audio team in their private builds in preparation for RTM.

    Read the entire blog post for a more complete explanation of what happened... one that's just slightly more plausible than most of the explanations proffer by your fellow Slashdotters.
  • Probably a bad Mic. (Score:3, Informative)

    by jcr ( 53032 ) <> on Saturday July 29, 2006 @12:34PM (#15806233) Journal
    When I was last involved in adding speech control to an app, I attended a developer workshop at Apple, and found out much to my surprise that my mic wasn't any good. It sounded fine when I used it for voice recording, but for recognition the gain curve was all wrong. When I tried one of the mics that the speech team from Apple provided, the hit rate went from under 20% to well over 90%.

    When Kim Silverman demos Apple's speech recognition, he uses a high-quality noise cancelling mic. It makes all the difference.

  • by Lonath ( 249354 ) * on Saturday July 29, 2006 @01:57PM (#15806607)
    Every time I say the word "Linux" it gets typed out as "Windows". Go figure.

"The Avis WIZARD decides if you get to drive a car. Your head won't touch the pillow of a Sheraton unless their computer says it's okay." -- Arthur Miller