Improving Open Source Speech Recognition 121
kmaclean writes, "VoxForge
collects free GPL Transcribed Speech Audio that can be used in the creation of Acoustic Models for use with Open Source Speech Recognition Engines. We are essentially
creating a user-submitted repository of the 'source' speech audio for
the creation of Acoustic Models to be used by Speech Recognition Engines. The Speech
Audio files will then be 'compiled' into Acoustic Models for use with
Open Source Speech Recognition engines such as Sphinx, HTK, CAVS and Julius." Read on for why we need free GPL speech audio.
Why free GPL Speech Audio?
Speech Recognition Engines require two types of files to recognize speech. The first is an Acoustic Model, which is created by taking a very large number of audio recordings of speech and their transcriptions (called Speech Corpus or Corpora) and 'compiling' them into statistical representations of the sounds that make up each word. The second is a Language Model or Grammar file. A Language Model is a very large file containing the probabilities of certain sequences of words. A Grammar is a much smaller file containing sets of predefined combinations of words.
Most Acoustic Models used by 'Open Source' Speech Recognition engines are 'closed source'. They do not give you access to the speech audio (the 'source') used to create the acoustic model, or if they do, there are licensing restrictions on the distribution of the 'source' (i.e. you can only use it for personal or research purposes). The reason for this is because there is no free Speech Corpora in a form that can readily be used to create Acoustic Models for Speech Recognition Engines. Open Source projects are required to purchase Speech Copora which has restrictive licensing — i.e. they are not permitted to distribute the 'source' speech audio, but they are permitted them to distributed the 'compiled' Acoustic Model.
Why GPL?
A GPL-style license will ensure that user contributions will always benefit the open source community, since it requires any distribution of derivative Acoustic Models to include access to the 'source' speech audio.
Re: (Score:2, Interesting)
but agreeing with you, the voice system in my cell phone sucks.
What is your choice?..."Operator"...I'm sorry. (Score:2, Insightful)
What is your choice?..."Operator"...I'm sorry. Please say another option...."CUS-TO-MER SER-VICE REP-RE-SENT-A-TIVE!!!"...I'm sorry...
That's usually the gist of my conversation with those automated systems.
If I'm calling, it's not something that can be solved with an automated prompt. If it was, I would have looked it up on your website already... I'm calling specifically because there's something WRONG with my account!
Re: (Score:2)
If you do that on Bell Canada's system (well, I haven't tried in about a year, but it did then) it will drop you directory to an operator.
Re: (Score:1)
Re: (Score:2)
Powergen in the UK had a system where, when paying a bill by credit card, it would ask for the name on the card. I have an unusual name, and it would get it fine everytime. Although, being an AI graduate, I'm used to speaking in a manner that typical analysis algorithms can process well.
Re:A sound affair. (Score:4, Insightful)
Muffin for Jew to Ski here? (Score:2)
Are there people out there who use voice as their main method of inputing text? For older people who type incredibly slow would this software be worthwhile using for composing emails?
Re: (Score:1, Interesting)
Re: (Score:3, Interesting)
I knew a few old people who asked about it and tried it, but I think the real holy grail for voice recognition is not a replacement for typing text, but for rather understanding context of what you are wanting it to do.
You know... "Computer go to Red Alert!" like Star Trek.
But in our case it would be...
"Computer. Go to email an
Re: (Score:2)
Well, it wouldn't take much [artificial] intelligence for that one.
Computer. Go to Slashdot ... (Score:2)
Error: Insufficient computing power.
Re: (Score:2)
Re: (Score:2)
how about make world?
Re: (Score:2)
OG.
Re: (Score:2)
I look at the subject "Muffin for Jew to Ski here?" and use both my knowledge of Slashdot and of similar-sounding words to infer what the writer is getting at. The knowledge of Slashdot is an important factor in my accuracy in d
Re: (Score:3, Interesting)
Depends on the approach. I recall circa 1980 or so a prof at Concordia U. had a speech recognizer on a VAX 11/780 (with an A/D adapter). It didn't have to be trained on the speaker, and recognized my "Mary had a little lamb, its fleece was white as snow"(*) in a mere 10 or 15 minutes.
Okay, hardly real time, but that was on a 1 MIPS machine. It was also logging all the steps it took to analyze
Re: (Score:1)
Re: (Score:2)
Doing speech-to-text from a speeded-up recording, or simultaneously doing multiple transcripts from different audio inputs. Or doing
Re: (Score:3, Informative)
The article is about speech recognition as is your post. Speech recognition is about recognizing what was said. Voice recognition is about recognizing who said it. The distinction is important since the coding and the problems associated with them are very different.
Just what we need... (Score:2, Funny)
Re: (Score:1)
Re: (Score:2)
Anythings gotta be better than (Score:5, Funny)
Re: (Score:2)
The reality of Star Trek- like voice interaction with a computer is still a ways off- decades perhaps.
Re: (Score:2)
Re: (Score:2)
GPL? (Score:3, Interesting)
Re: (Score:1, Interesting)
Well no, not *that* one. (Score:3, Insightful)
Creative Commons ShareAlike is GFDL compatible, at least according to WikiMedia. Or heck, why not just use the GFDL itself?
The reason not to use the GPL on something like this is because there's not a clear separation between "source" and "binary" like there would be for a programming project; there's
Re: (Score:2, Informative)
There is the 'source' data which is 'compiled' in to something useful.
Sounds familiar?
Re: (Score:2, Interesting)
On a related note, how much data are they planning to collect and how will it be distributed? The web site says that they will make available 8kHz and 16kHz
Re: (Score:1)
You cant get the original audio from the compiled version.
What rock have you been hiding under? Dont you know what a mp3 is?
I highly doubt they are even considering distributing raw pcm data. It will be compressed in one form or another.
1000 hours of CD quality mp3 is only roughly 60gig (your numbers are wrong I think) and voice doesnt need CD quality.
Anyway they dont *need* to distribute the audio to eve
Re: (Score:1)
Yes, I realize that you can't recover the audio from the acoustic models. But my point was that using the GPL in this context seems wrong because it would require that I (as the builder of the "derivative" work, aka the acoustic models) make the audio available (unless I misundertand the GPL.) So, given that I haven't changed the original audio in t
Re: (Score:1)
Flac is a good candidate for a format. Its open source and lossless.
Re: (Score:1)
IFA Dutch Corpus (Score:4, Informative)
From TFA: What? The IFA Dutch "Open-Source" Corpus [hum.uva.nl] is a phonemically-annotated speech corpus released under the GNU GPL (read more - pdf). [let.uva.nl] They even have an SQL interface. Did you mean English speech corpora?
Re: (Score:1)
The more important problem is that current speech recognisers do not generalise well. If you train only on read speech, the performance on spontaneous speech will most likely be horrible. Transcribing spontaneous speech, however, takes enormous amounts of time. And it is not the kind of job you wan
Re: (Score:1)
Aha, that's what undergraduate RAs (+lots of funding) are for. But seriously, this is really what I was getting at in my post.
An IFA Corpus trained system won't be state-of-the-art, admittedly. The key word here is "free" - beggars can't be
It's about time (Score:5, Informative)
Having a viable open source alternative will ensure that everyone has access to this technology and there will be many new innovations that will just continue to make technology cooler.
Please people, take the time out of your schedule to record prompts. It will do everyone a lot of good.
Re:It's about time (Score:5, Interesting)
We've gotta do something to get this beast moving forward.
Re: (Score:1)
Re: (Score:2)
Re: (Score:2)
For the longest time I had a speech recognition system sketched out on a whiteboard in
my office, maybe once I get done with all my current projects (ww.com, daz.com and a bunch
of smaller stuff) I'll restart it, it's one of the things I really don't like about
computers, the fact that our whole 'navigation' experience and knowledge seems to
revolve around large surfaced displays. If we could somehow get rid of that I think
computers would be *far* more useful.
best regards, & congratula
two modes of speech-to-text, also (Score:5, Informative)
It's helpful to understand that there are two very different modes of speech recognition.
Continuous speech to text is useful for flat-out dictation, where the speaker should be allowed to speak in a clear voice at a normal or almost normal speed, saying whatever the speaker wants to say, and the result should be a sequence of recognized words.
Prompted speech to text is useful if your program is trying to exchange a dialogue with a human, such as a voice prompt or simply a set of useful user-voice commands. In this mode, the listening routine has a set of "expected" responses and should try only to recognize one of those responses.
The latter form usually requires a lot less training from an individual human, and is more robust in noisy environments, since the range of recognized expression is very tightly controlled to a few possibilities. The former mode, continuous speech, is much harder to accurately recognize without personal training for each human speaker, or significant statistical work in background processing.
Here's how to help out (Score:5, Informative)
Donate your speech for a GPL speech data collection so they can do better recognition.
Includes seperate instructions for windows and linux users. (Wonder if there will be any significant differences in the quality of the data based on OS...)
Re:Here's how to help out (Score:4, Funny)
So my bet is yes, there will be a difference based on OS.
Wreck A Nice Beach... (Score:1)
Re: (Score:3, Informative)
recognize speech
Entered with Dragon Systems 9.
Not trying to be snotty. Just informative. Dragon Systems has been pretty good since version 7. Eight was a real improvement. Nine is totally awesome. Almost magic. There is a user learning curve, however. One does have to dictate the punctuation for example. Nine works with Firefox very nicely.
Wordos do happen from time to time when you 'wreck a nice beach' (sic) , but then so do typos. Everything needs to be edited no matter h
Re: (Score:2)
I remember eagerly anticipating not having to type anymore when I bought IBM's Via Voice. This was about 10 years ago, back when "powerful computer" meant a P90 with 8 Meg Ram. After training the software for about an hour, I could, by. talking. like. William. Shatner. on. Ritalin. produce text that was maybe 60% - 80% accurate. It was definitely oversold
speech recognition (Score:1)
I think you will be pleasantly surprised if you try Dragon Systems. Dragon Systems is special among speech engines. It is the long-term pet project of a couple of gifted scientists who decided to solve the problem of speech recognition a generation ago. They filed hundreds of patents over a couple of decades and solved many engineering problems one at a time. IBM took a long term interest in speech recogn
But you didn't leave your number (Score:1)
It ain't perfect, but training is easy these days and accuracies over 95% are arrived at fairly quickly.
Data conditioning (GIGO) (Score:5, Insightful)
This project seems to be gathering a "Wild Type" sampling of submitted data. What if the data is not representative . . . for example, a bunch of people in China decide to submit english language files with the best of intentions, but the data is heavily accented (Or to be fair, if a bunch of native English speakers submitted a bunch of heavily accented recordings of Mandarin speech)?
Without controlling the data source or making sure that the data is valid, one could become a victim of GIGO (Garbage In, Garbage Out). In all fairness, this may not be a problem if the sample size is large enough to overwhelm any outlying data, but I'm not sure that this project has sufficiently addressed this concern . . .
Re: (Score:2)
They had to invent an acronym for this too, didn't they!? Jesus what is going on with this world!
Wait... who are they?...
GIGO is older than you (Score:2)
So your question should be: "Jesus what was going on this world way back when before I was born!"
Re: (Score:3, Interesting)
On the same line of thought, I hope I can use this tool with my heavy (ok not so bad) english accent...
I have no clue how those programs work so I might be off-base, but it seems to me that
Re: (Score:2)
But the same applies to recording modalities. Depending on whether you're b
Re: (Score:2)
B.
Re: (Score:2)
Re: (Score:2)
Then it would work great for Chinese users of the software. I don't see a problem here, except that the data needs to be categorized properly.
Re: (Score:2)
Part of the submission process is for you to classify your dialect. After your recordings are posted, people can rate your recordings & comment on them. I thi
Re: (Score:1)
What I feel is that for English speech, this is exactly what they should do. The language is so common across the globe, that all the accented variants should be included in the speech database. I remember using a voice recognition software long
slashdot met voxforge (Score:3, Funny)
Re: (Score:3, Funny)
But thanks to those millions of samples, we can now transcribe "AHHHHHHHHHHHHHHHH!" very accurately.
Re: (Score:3, Insightful)
The developers who work overtime to bring such advances should damn near be nominated for saint-hood. Or maybe you could learn to enunciate.
GPL versus public domain? (Score:5, Insightful)
Re: (Score:1)
1 PD is like a public park with the problem that somebody could buy a certain section (say grease a few palms and..) and lock YOU out of it
2 GPL is like a park owned by some old looney that leaves the gate open (or in some cases owned by a group of folks that HATE Each other)
PD is free now but could be nonfree later
GPL is free FOREVER
(for some projects its like getting a jew a muslim a catholic and several subtypes of protestants to agree on a "winter holida
Re: (Score:2)
Public Domain is forever, but people are free to copy it and make the copy (plus all improvements) their own.
GPL is forever, but the only people free to distribute it are those who provide all the original source IP plus their modifications under the GPL.
Then, of course, there's also BSDL, where the only restriction is that, unlike the public domain, you are required to credit the original authors of any work you use.
Re: (Score:2)
But what the heck: BSD vs. GPL, let me just get my flameproof stuff.
Please (Score:3)
Speech to text overlooked (Score:3, Informative)
What's frustrating is that there *was* something halfway decent - IBM's ViaVoice - but that's gone. A few of the Linux apps I see out there are layers to run on top of ViaVoice. With that option gone for Linux, those tools are useless. It's like the rug was pulled out from underneath any progress in this arena for the foreseeable future.
I found voxforge a few days ago, and while it seems admirable, it's a small part of the larger problem which I don't see getting any better any time soon.
Re: (Score:3, Informative)
Re: (Score:2)
Re: (Score:2)
It's about time (Score:3, Interesting)
Re: (Score:2)
Nor, in this case I suppose, especially desirable.
I would expect that you would want a wide variety of voised to train the thing.
Listen to the genreral public sometime, do many of them sound like they have "professionally trained" voices?
Probably not.
If you train a speech rec. engine with "golden" voices, how can it be expected to figure out the average Joe/Jane on the street?
I routinely hear from customers whose accent (or manner of enunciation) makes it nearly
Re: (Score:2)
(thwacks self on forehead, while chanting RTFA)
Now where did I put that microphone....
Ill help (Score:2)
I would also love to seen open source dictation software. [shameless plug](See my journal for why)[/shameless plug]
Re: (Score:1)
Speech recognition performance on low-noise, read "proper" speech is actually impressively good. The forefront of speech recognition research is on noisy, spontaneous and conversational speech - i.e. real world speech. Any speech data is helpful, but the state-of-the-art would actually be better served by contributions of sub-optimal speech fro
How about artificial speech next? (Score:2)
I've seen some decent ones, and the OS ones aren't better than the
Re: (Score:1)
"Steve Gibson" speech DSP words
Did turn up this reference though "world gay escort dating free of charge", heh. A result of keyword stuffing in the linked site rather than a legitimate hit. Steve must be proud that his name is deemed such a valuable search keyword.
huh? (Score:1)
Why not use the NIST database? (Score:4, Interesting)
Since it was done by NIST, I imagine the database is available. Note that it will be limited by the telephone call quality (4KHz).
Re: (Score:1)
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp? catalogId=LDC97S62 [upenn.edu]
This kind of speech, um, yeah, is a - a world away, you know what I mean, from how most users speak to dictation software, command-and-control, etc.
The Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu/ [upenn.edu] is the main source of speech corpora that I know about. You have to pay and possibly be a member (depending on the corpus you want I think). The catalog covers all kinds of speech. Another sourc
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
How will it be distributed? (Score:2, Interesting)
Re: (Score:2)
BitTorrent! (Score:1)
Nothing to hear here. Move along.
Waist uptime (Score:1)
How many times ... (Score:2)
Sorry, someone was excited about "Acoustic Models to be used by Speech Recognition Engines". [giggle]
Isn't this reinventing Librivox's wheel? (Score:4, Informative)
I'm probably missing something in regards to why this stuff can't be used...
Re: (Score:1)
I would assume the phrasing patterns would be quite different.
Re: (Score:1)
Again with the GPL (Score:2)