Slashdot Log In
IBM Strives For 'Superhuman' Speech Tech
Posted by
ScuttleMonkey
on Wed Jan 25, 2006 04:34 AM
from the fansubbing-in-jeopardy dept.
from the fansubbing-in-jeopardy dept.
robyn217 writes "IBM unveiled new speech recognition technology today that can comprehend the nuances of spoken English, translate it on the fly, and even create on-the-fly subtitles for foreign-language television programs. One of the projects perpetually monitors Arabic television stations, dynamically transcribing and translating any words spoken into English subtitles. Videos can then be viewed via a web browser, with all transcriptions indexed and searchable."
This discussion has been archived.
No new comments can be posted.
IBM Strives For 'Superhuman' Speech Tech
|
Log In/Create an Account
| Top
| 289 comments
| Search Discussion
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Which ... (Score:4, Interesting)
(http://www.atari.st/ | Last Journal: Thursday April 27 2006, @05:27AM)
Re:Which ... (Score:5, Interesting)
Sometimes you need rather a large context to disambiguate: is this sentence part of a discussion on shore-front management, or spoken language understanding?
Coherency? (Score:5, Insightful)
(http://www.euvsus.blogspot.com/)
Still even at 80 percent how good is this translation. If that 20% is the important parts of speech You could still be left clueless. Even the best Machine translations of text I have seen always leaves the text a bit garbled and confusticated.
I don't know how much delay is implied in the phrase "on the fly" , but I personally don' think there could ever be real time translation for the following reason. Sentences in different languages have different sentence structures. While in English the verb is usually the second part, in other languages the verb comes many times last (German). For the translator to get the second word of a sentence, it would have to wait till the end, of what could be a long sentence. This necessarily adds delay.
Re:Coherency? (Score:4, Interesting)
since even "live" boradcasts are usually delayed several minutes for technical and legal reasons anyway, if this technology can get to the state where you're just one or two sentences behind real-life it will be effectively real-time anyway for almost all practical purposes.
And German is an easy one (Score:5, Informative)
first? (Score:5, Funny)
A great advance in technology! (Score:1)
Seriously though, this is a great advance in technology, but will it still be as funny to listen to? It's always fun typing in words into speech recognition programs and listening to the unexpected results!
Nuances (Score:4, Funny)
Subtitle: "All your base are belongs to us"
NSA Babelfish (Score:2, Funny)
(http://www.webdevelopers.cz/)
(I'm sure that this eBabelfish is already installed - not in my ear - but on the telecommunication centers...)
Opensource? (Score:1, Interesting)
Foreign languages are complex... (Score:5, Insightful)
It's not until you learn another foreign language that you realise how complex languages are, and how subtle. Learning another language can literally change the way you think about things.
This type of technology will make people think they completely understand a foreign language, but they won't. Their understanding will be crude, without the subtleties and cultural understanding.
I can speak English and Spanish fluently, and if I watch an English film with Spanish subtitles I'm always thinking - damn, they missed a good joke there, they got that wrong, etc. (Equally so with a Spanish film with English subtitles). And film subtitles are done by professional translators. God only knows what a terrible job a computer would make of film translation.
Re:Foreign languages are complex... (Score:5, Funny)
(http://xinagnet.xs4all.nl/browser_info)
Re:Foreign languages are complex... (Score:5, Funny)
Rocco: Fucking... What the fuck. Who the fuck fucked this fucking... How did you two fucking fucks...
[shouts]
Rocco: fuck!
Connor: Well, that certainly illustrates the diversity of the word.
Think that just about covers it...
Ghee... (Score:4, Insightful)
If they REALLY want to test it properly... (Score:5, Funny)
have closed.
"Ye loooiii ahhh me jimmeh??! *belch* C'mere ya wee electrahnich bastid, I'll
shoo ye!"
Available with old version of Mandrake Linux (Score:1)
Anyone know where I can get this from?
It isn't worth it (Score:5, Funny)
(http://www.vidaartificial.com/)
On-The-Fly (Score:5, Informative)
(Last Journal: Wednesday February 26 2003, @06:32AM)
I have read a lot of auto-translated documents and it is always a good laughter in terms of "crapslation cabaret". So far, there is no technology that could auto-translate a text document succesfully. The "80% success" is a myth - they just count how many words were found in the vocabulary, not how many of them were put into a good context. A "fly" translated as an insect would be accounted as a success!
Even if you are not a bot but a human being with some knowledge of the other language and culture, it's very easy to involuntary offend someone or just to make a ridiculous faux-pas. Polish and Czech languages, for example, are very much alike and use common roots for many words, but because of the way both languages evolved, some neutral terms on one side of the border have become offensive on the other side. Czechs evolved an euphemism for sexual intercourse based on the verb "to look for". Poles still use this word when they look for something, which leads to constant crapslation cabaret gags when a Polish tourist appears in a Czech town "looking for a parking lot". Now, auto-translate this...
IBM and Google cooperation to come? (Score:3, Interesting)
This won't make speech recognition mainstream (Score:4, Interesting)
(http://highc.org/)
Ben Shneiderman is the person who, in my opinion, articulates the best the limits of speech recognition [umd.edu].
One of my favorite phrases to explain this issue is: "You don't want to speak to a computer, because you can't speak and think at the same time". More precisely, speech utterance makes use of some modules in our brain which are required for planification too. Hence, you can't plan as well what to do next when you speak, which is a big hurdle in the type of intellectual activities one carries with a computer.
Awful default TTS (Score:4, Insightful)
(http://xkcd.com/)
What really bothers me is the state of Windows text-to-speech. The TTS that ships with the most popular operating system on Earth is easily trumped in understandability by a small third-party program I downloaded literally TWELVE YEARS AGO. I really wonder if M$ made some pact to give out crappy TTS so as not to stifle sales of some business partner's application.
This seems pretty ridiculous, but I'm at a loss as to why their text-to-speech programs are of 12-year-old quality.
I'm glad people are doing good speech research, (I know I've seen a demo of good IBM TTS somewhere) but I hope it finds its way into Windows someday.
What about SubHuman Speech? (Score:2)
(Last Journal: Thursday November 09 2006, @10:31AM)
ViaVoice (Score:1)
(http://trap.me.uk/damion/)
So I'm surprised to see an announcement like this one.
American or English? (Score:3, Interesting)
(http://www.crazysquirrel.com/index.jspx)
I realize that Anericans and British (English at least ;o)) speak essentially the same language but I have yet to find any speech recognition software that can get more than roughly 85% of what I say correct. I have a fairly soft neutral english accent with pretty good enunciation so I would have expectd to be getting a recognition rate in the high 90%s. I'm wondering if, as most of this software is developed in the US, it is tuned specifically to pick up on english with a US accent? I realize that you train the software for your voice but AIUI all you are doing is tuning a basic speech model. Has anyone else had this problem or is it just me?
Oh oh oh. (Score:3, Funny)
So I'm sitting here thinking of how funny it was to the juvenile me back then, and how unfunny it seems right now. Oh well.
Not _that_ amazing (Score:2, Interesting)
The translation, on the other hand, sounds damned impressive. For unrestricted content, especially with an untrained voice (I imagine that IBM isn't individually training to each Al Jazeera talking head), 70% recognition sounds quite good. 70% accuracy post-translation ought to be quite a bit better than what's currently out there. The description of MASTOR, however, is useless -- it could easily describe anything that isn't word-for-word translation.
Buyer beware (Score:5, Insightful)
Speech recognition is riddled with problems. From a computing side it's enormously processor intensive and memory hungry. From a computer side it's very com,plex code and the 'learning' process is fraught with problems - surnames, company names and locations are all very poorly recognised.
So don't rush to buy. Let the labs check it out first.
Trusted Computing (Score:1)
(Last Journal: Tuesday April 12 2005, @07:06AM)
I'll just be happy if (Score:2)
(http://marshonsmacs.blogspot.com/)
Nintendogs: I've stopped trying to train my dog, its never going to happen.
Apple Speech: Only works if I use a terrible californian accent. Not worth the embarresment.
Nokia: Even with just one voice command, my girlfriends name, if still can't match my voice.
If this can translate foreign languages in to American (sic) then it definately sounds like it could stand a chance at translating English into text and command.
funny this subject should come up... (Score:2, Interesting)
the training process definitely has its ups and downs. The more you work with it however, the more it becomes attenuated to your own speech patterns and moreover, the quirky words we use every day. If you can get past the first two or three hours, you'll see that it is totally worth the effort, especially if this IBM tech isn't available to end-users for some time. There is also an aspect of the software training you, while you train the software. At the present time, I can dictate to slightly slower than I can probably type.
In the end, I can see where this would make a writing e-mails and other such time-consuming tasks, which involve spellchecking, grammar, and other proof reading significantly quicker. When you really hit your stride, it's easy to write at the speed of thought, which is really appealing. There are caveats, however. it's very easy to dictate several sentences worth of tax and taken for granted that it to everything down the way you attendedselect tax select select tax undo
I am surprised (Score:1)
Real-time eavesdropping (Score:2, Interesting)
(http://hiranyaloka.com/)
Monitor all conversation.
Apply real-time text filters.
Assign live agents to priority eavesdropping.
Profit!
If you could apply a filter to listen in to any call what would it be?
Finally! (Score:2)
Translating Arab TV (Score:3, Informative)
I was in Kuwait and watched arab TV with english subtitles, it was enlightening to say the least. One long tribute to racism paid for by the Amir of Quatar. Only on arab TV will you see such trash as "the jews are descended from pigs".
Big deal, I can do that on my Apple ][ (Score:2)
10 PRINT "DEATH TO AMERICA";
20 GOTO 10
RUN
Why "superhuman" tech? (Score:1)
(Last Journal: Monday September 04 2006, @10:07PM)
Speech Synthesis. (Score:2)
(http://www.leperkhanz.com/ | Last Journal: Wednesday October 01 2003, @05:17AM)
Excellent Product, Confused Reviewers (Score:2, Informative)
Dictation, however, is a completely different problem. There are far fewer constraints on what can be said, and the system makes errors as it picks through the possible choices. As a result, most dictation software requires training: the system will use your voice to train its recognition models to improve its word selection. Dictation systems also ask for samples of your documents to train its language models on how you put words together; that also helps determine the probabiity of proper word choice. (Example of how you put words together: "Peanut butter sandwich" is a much more likely choice than "peanut butter sand," and will get a higher score.)
The IBM announcement is about embedded, task-oriented speech recognition. It's not "superhuman," according to the article's text and ignoring its headline. I'll have an opportunity to see it in action next week at SpeechTek West [speechtek.com]. Expect to see other product announcements about speech technology in the next few days as the conference approaches.
As for the TV translation software, it's still in the research stage according to the article. I've seen BBN's version of this software, and frankly it's amazing how good real-time translation can be.
Bell Canada deployed Emily [speechtechmag.com] a few years back, and the results to date have been excellent. A top-level question of "How can I help you?" replaces several layers of DTMF auto-attendant complexity.
If you're interested in trying speech recognition and text-to-speech out for yourself, you can use Voxeo's servers, program in VoiceXML, and my Voice Conference Manager [sourceforge.net] app as a starting point (yeah, VCM needs a new release, and it's getting one soon).
We also showed off... (Score:1)
(http://www.yapme.com/)
Not only do they allow you to navigate by voice, but using X+V (a blend of XHTML and VoiceXML), you could have fully speech-enabled Web apps. Example: "show me nearby sushi restaurants" or "movie schedules in my area".
We also released our Multimodal Tools Project for Eclipse a couple weeks ago: http://alphaworks.ibm.com/tech/mmtp [ibm.com]
Go ahead and play.
Let's see it translate poems (Score:3, Interesting)
(http://booktextmark.mozdev.org/)
Re:Let's see it translate poems (Score:4, Interesting)
(http://lavincolindo.net/ | Last Journal: Friday January 20 2006, @05:50PM)
Anime fansubs! (Score:2)
(http://www.nerdwatch.com/)
What a boon this will be to those anime fansub groups who can't find decent translators, or at least translators who aren't overworked.
Thanks for the laugh! (Score:2)
I've been hearing this every 6 months for about the last, oh, thiry years.
Given that the state of the art in something much simpler, like automatic language translation, is pitifully inadequate, how likely is it IBM has conquered speech recognition AND translation?
Har har har.
S-to-T in hospitals (Score:2, Interesting)
this of course worries secretaries, since they might eventually lose their job/"career". on the other hand it would improve effeciency *a lot*.
Live experiment with Dragon 8 (Score:4, Funny)
(http://www.bdwoolman.net/)
I can wreck a nice beach. I can recognize speech.
Well, Dragon Systems eight passed the beach test first try. Knowing the program, however, I did use pretty clear diction.
I use Dragon Systems and find it absolutely great. There are a few persistent errors. For example, It frequently fails to get "there" and " there" right on the first try. But the fly down menu system enables me to quickly correct the problem on the run. Certainly I pick it up on an edit. If IBM has something better than this -- and it sounds like they do -- then it must be pretty darn good. Of course, you have to insert the punctuation verbally. But that comes with a little practice -- provided that you know what to do in the first place.
It does take a little bit of investment in time. But not nearly as much as learning to type at seventy words a minute, which I can now do in dictation. I have added very little by way of customized commands etc. The program has done a lot of learning on its own.
Let's try once again: I can't recognize beach. I can recognize speech. Oops. Okay, it failed that time. Let's try one more time: I can wreck a nice beach. I can recognize speech. Well, the phrases have to be enunciated pretty clearly or the program has trouble.
Which which blew the blue candle. Failed on the second "which" the b*tch.
Okay, okay. I'll put the laundry in the dryer. No I am not just screwing around on Slashdot again I'm getting some work done down here. Just a minute. Just a MINUTE.
One trouble. You do have to put the mike to sleep during family discussions.
WoT (Score:2)
(http://news.google.com/)
Sounds like the results of a DOD/DARPA/NSA funded research grant. They'd love to be able to translate on the fly, instead of having to train and pay actual humans to manually translate several hours -- or even days and weeks -- after the original transmission.
Now that IBM has something kinda working and the grant money is running out they are trying to market it to the public. Kinda like Tang for the War on Terror-age.
'Twas Brillig (Score:1)
On a more serious note, however, my wife was involved in an ill-fated-due-to-ancient-technology project back in grad school in the early 70's which involved:
1. Speech recognition.
2. Machine translation into a universal grammar
3. Translation of the universal grammer into various target languages.
4. Speech synthesis in the various target languages, using the same vocal qualities as the original speaker.
Pretty lofty goals cosidering they were probably using computers with discrete components in them.
Curiously, my wife (a native Japanese speaker) was teamed with the Suomi (Finnish) team because of the similarities in the two language's structures.
what about... (Score:1)
vs,
"Does your radio suck? boy I sure hope my stupid radio doesn't. Uh, play 92.3"
breakdown of the article (Score:1, Interesting)
1. IBM has updated their ViaVoice large vocabulary continuous speech recognition (LVCSR) engine.
2. IBM has paired ViaVoice with some clever apps to use the ViaVoice output in interesting ways (e.g. "on the fly" recognition, translation).
Things that are not obvious from the article:
1. ViaVoice has been around for ages and has always been pretty darn good at LVCSR. Without seeing numbers and knowing exactly how they were measured, it's impossible to know how much of an improvement 4.4 is over previous versions.
2. Speaker-dependent speech recognition can always achieve much higher accuracy rates than speaker-independent systems like ViaVoice. Dragon NaturallySpeaking is an example of speaker-dependent speech recognition.
3. Limited grammatical contexts (i.e. language models with low perplexity) always give better recognition than when you don't know what to expect next. For example, when your phone only has to tell "home" and "wife" apart, it's a lot less likely to make a mistake than if it has to figure out which word out of a list of 50,000 you just said. The more context, the better. The most interesting tech in the article seems to be the algorithms "that can determine this context on the fly."
4. No improvements in translation technology were noted in the article; it sounds like they might as well have fed ViaVoice through BabelFish, made it happen in real time, and slapped a UI on it. The app might be new, but the tech is not.
Capitalization????? (Score:2)
(Last Journal: Saturday November 10, @03:30PM)
"I had to help my uncle jack off a horse."
Will it ever catch that one?
I helped apple... (Score:2)
(http://www.partow.net/)
Another scam (Score:2)
First we have software that cannot be reverse engineered and guarantees the free speech rights of Americans.
It comes attached to the Brooklyn Bridge and some Florida swamp land.
Now we have this crap: "By limiting the domain, the system can make assumptions or inferences about what the user would like to accomplish, he said."
This is not exactly "superhuman" speech recognition.
None of this is feasible absent conceptual processing technology. Period.
I don't know why I don't clean up at the public trough by simply announcing I have "true artificial intelligence" and wait for the checks to roll in before leaving for Brazil.
Unlikely (Score:2, Insightful)
Recent studies of the efficacy of machine translation found that we have made only marginal progress by modern engines from those of the *70s*, (in fact, one of them, SysTrans, is the most used translation engine online) and there were *no* descernable difference between engines of the eighties and current engines. I hope that they're not trying to claim that they suddenly overcame the vast problems of translation wholly independent of the linguistic community. That's just ludicrous.
I'd love to see the this engine handle a parasitic sentence like this between two largely different languages and catch the nuance in the parens: "Which report did she file (that report) without (her) reading (that same report)?" Sure some engines will hit by chance, but only because of similar structure, but the engine is lucky, not actually parsing the "meaning."
Speech recognition would be easy.... (Score:1)
Real great example ... (Score:1)
(http://slashdot.org/ | Last Journal: Wednesday March 03 2004, @05:45PM)
I rather like "Open the pod bay door, Hal" myself.
--
1. http://www-03.ibm.com/press/us/en/pressrelease/19
Re:Just what we need... (Score:5, Insightful)
I think you've got it the wrong way round haven't you? Did you mean to say "More opportunities for English speaking people to misinterpret Arabic media."?
Re:Just what we need... (Score:4, Insightful)
(Last Journal: Saturday December 24 2005, @03:18PM)
Re:Just what we need... (Score:3, Insightful)
Re:learn the langage ? (Score:1)