Slashdot Log In
Speech Recognition, Voice Verification -- Free
Posted by
timothy
on Wed Jul 19, 2000 05:04 PM
from the can-I-say-"lee-andre-lee"? dept.
from the can-I-say-"lee-andre-lee"? dept.
ten thirty writes: "TECHNOCRAT.NET recently featured a great
article regarding the dawning (well, it's only a few of years old anyway) of speech recognition software within the open source community. In particular, the
Sphinx
project of Carnegie Mellon University is discussed, as well as some other systems such as Festival and a public domain project at the University of Missouri. The notion here is that
eventually the GUI, which has come so far over the past two decades, will eventually be supplanted, at least for some applications, by the VUI. The question is, will the open-source community allow the integration of this technology into our society be spearheaded by closed-source vendors?"
This discussion has been archived.
No new comments can be posted.
Speech Recognition, Voice Verification -- Free
|
Log In/Create an Account
| Top
| 120 comments
(Spill at 50!) | Index Only
| Search Discussion
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
|
2
(1)
|
2

Re:How practical is use of this technology? (Score:4)
It'd also be nice in a wearable computer system, though I'm sure someone already has a patent on using voice to control a wearable computer.
Patent issues and Microsoft (Score:3)
It is further no secret that Microsoft has been hiring machine learning and speech recognition experts from anywhere they can find them, and paying them pretty well.
You can bet that the best voice recognition sequences will be patented and protected in the US.
Re:Did anyone else notice ... (Score:4)
Computer: Unable to toast lorry
User: No, Post, P
Computer: Command 'host tea'. Tea is scheduled for 16:00
User: Post the damn story
Computer: Command 'roast ham'. Oven is preheating. Would you like to serve the Ham with tea?
User: Cancel, I do not want ham, I do not want spam, I do not like it in a car, I do not like it at the bar. Just post the story.
...
drawbacks to VUI (Score:3)
Even if VUIs work perfectly, there are two major drawbacks that will make many people prefer GUIs:
1. Privacy. Do you really want to be saying things like "browse to pervert site dot com" or "send bankruptcy memo" out loud? Typing and clicking are more discreet.
2. Annoying others. I don't want to be in an office full of people babbling at their computers. I also don't want to be on a plane or in a restaurant near somebody babbling commands at his laptop. It's bad enough already with cell phones.
That being said, there will be a place for VUIs in critical hands-free situations such as in cars.
VUI == Workplace Stress (Score:3)
-----
Re:How practical is use of this technology? (Score:3)
"Play U2"
instead of:
find MP3 player icon, deiconize, click load, click U2 playlist, click OK, click play, iconize, put mouse back in editor window, recommence hacking
I think the quick verbal shortcut causes a much smaller disruption of concentration and saves a tone of screen real estate. For those of us insane people who have 6-7 emacs windows, 2-3 netscape windows and 3-4 xterms going on 4 virtual desks, this would be a HUGE benefit.
I can't tell you how much mental energy I have saved since I got a box with external volume control instead of a GUI volume tool. I think a voice interface would help in similar ways.
So, I think voice-assisted GUIs would be great, accelerating the experience just like keyboard shortcuts help keep experienced users sane today.
Re:How practical is use of this technology? (Score:3)
In text editors, I would like to say "oops" and automatically have the last word deleted. That would definitely speed me up, but my cubicle neighbors might get tired of hearing a constant stream of "oops... oops... oops" over the wall. I bet it wouldn't be hard to patch that into emacs...
Bruce's description of a voice-controlled car stereo is also good. This is especially interesting to me, because I am thinking of building an MP3 player for my car that will be a full X86 computer. How do you do a user interface that allows you to scroll through hundreds of albums and thousands of songs? While driving?
Voice command seems like the best solution. Say "Play... U2... Zooropa... Lemon", or "Play... Beethoven... Sixth Symphony". (imagine a little chime from the computer during each "..." to indicate it "got" it and is ready for more input.)
I should be able to operate that while driving without driving off the road. And, a well written voice command program could be pretty accurate for that application, since the set of valid inputs is reasonably small at each step.
I'm enthusiastic about the possibilities. I predict that once people have this, they will wonder how they ever survived with out it. Just like wheels on mice!
Torrey Hoffman (Azog)
Open Source vs. Design vs. Basic Research (Score:3)
- How to do speech recognition at all
- How natural languages express meaning using words and sentences
- How to integrate sophisticated speech recognition into user interfaces that will be useful/meaningful/interesting for users.
Research tends to happen either at universities or at commercial research labs like Bell Labs, Xerox Parc, and IBM, where people can spend a long time looking at hard problems; while that can happen in an open academic-type environment or a closed intellectual-property-hoarding secret laBoratory, research is a much different environment from design or implementation, which are closer to what open-source development processes are good at, which are things that amateurs can do using their own resources or that professionals (including advanced college students) can do that piggybacks off their own work, like hacking operating systems or compilers. We're fortunate that enough of the development of speech recognition has been open so it's accessible for use - learning how people make phonemes with their mouths, words out of phonemes and sentences out of words is an immense job if you have to reinvent it.Early user interfaces were simple - if your recognizer can only do 10-20 words, it doesn't take deep design research to design an interface - telephone companies do obvious things with 0-9/yes/no/help, and computer interfaces pick a dozen Mostly Harmless commands so that a misrecognized command or somebody walking down the hall talking doesn't trigger "rm -rf /", it just triggers ls or "play cd" or something. But now that voice recognition can handle vocabularies of hundreds or thousands of words, depending on your taste in accuracy and user-specific training, figuring out what good designs for interfacing with voice users that make sense in the environments you expect them to use is a large set of research problems. Open source is ok for doing implementations of specific proposals for what that interface should look like, and pretty good for tweaking existing designs to do more things, and really excellent for connecting the voice interface up to other things that are already written. But overall, it's a design problem, not a hacking problem.
As far as things I'd see that are useful that voice recognition interfaces can do, some are pretty obvious, like cellphone dialers and dictation tools - you'd like to tell your handsfree phone "call Alice" while you're driving, and have it look up Alice in a database, rather than typing or saying "+1-987-655-3210, er, umm that was 654-3210". (Some cellphone companies provide this - it's not based in your handset, but at the cellphone company's end, using a database lookup on your phone numebr to retrieve your voice settings and your list of names and phone numbers. If you're the canonical carpal-tunnel-abusing hacker, you'd like to dictate some of that business plan by voice using a voice editor that can stitch together words you've recycled from previous documents instead of having to mouse it in.
Beyond that there's a lot of open territory - it'd be nice to be able to walk down the street with a headset on or sit at a desk with a speakerphone or headset and tell your computers what you want them to do, who you want to communicate with, have them tell you stuff you want to know, etc. It's not a direct substitute for reading off a screen and pointing with a mouse; it'll change your workstyle just like adding GUIs and getting cellphones did.
How practical is use of this technology? (Score:3)
I can maybe see controlling a speaker-phone or a TV with this, but button-based interfaces are pretty efficient for this as it is. I can maybe see using this for quick shortcuts on a computer, but again, current interfaces are pretty efficient.
For massive data entry or for extended interactive editing, this probably isn't practical (try giving a multi-hour lecture - not too comfortable, is it?).
So, I'm wondering where a verbal interface _is_ practical.
Re:Xvoice (Score:3)
What happened to open source Via Voice? (Score:3)
Well, at least there is some choice!
jcc
A need for an "open source" speech database (Score:5)
One project which addresses the problem is the Open Mind Initiative [openmind.org], and more specifically the Open Mind Speech Recognition [sourceforge.net] project, for which I am the coordinator. Our goal is to collect data from people on the internet and make that data available to people working on speech recognition with a GPL-like license. I think this is the key to having OSS speech recognition engines perform as well as the proprietary ones. The project is not very advanced yet, but any help would be really welcomed.
Re:How practical is use of this technology? (Score:3)
Recent discussion on the via voice mailing list (Score:4)
I replied with the following-
I would suspect, that the primary reason [there are so few developers of via-voice] is the desire of (free software) programmers to not make their code dependent on non-free (as in speech) software. For better or worse, many Linux programmers will reject, out of hand, any library or software that is not based upon one of the standard free licenses (GPL, LGPL, BSD, NPL, Artistic, etc.).
Given that IBM is unlikely to change it's licensing terms in the near future, and that (free) programmers are unlikely to change their moral stance on using 'non-free' software. Development with viavoice will likely
be limited to commercial programmers, or those situations where STT/VTS are a necessity such as applications for the blind.
Tom M.
TomM@pentstar.com
In a latter post he asked our opinion on the IBM Public License. My reply was thus...
"I did a search on the web for discussions on the IBM Public License (IPL).
According to Bruce Perens, (and the general consensus...)- the IPL is OSD
(Open Source Definition) compliant, but not GPL compatible. Being OSD
compliant will certainly encourage more developers, however, how many is the
big question. Of the free software developers out there, my guess would be
that 80% (likely more?) will only develop (in their free time) with software
that is GPL compatible (i.e. GPL, LGPL, BSD, and a few others). However,
for 'work' stuff, the IPL is less problematic, and thus would lead to more
commercial development (not as much as the GPL, BSD, LGPL - but mostly for
'religious' reasons).
Personally, I would recommend going with the GPL, which would result in full
and quick integration with all of the Linux distributions, and allow source
from many useful GPL and LGPL projects to be integrated/merge with it. I'm
guessing that the developer good will from such an action would be
Phenomenal. The suggestion of another poster that viavoice should be viewed
as infrastructure is very valid. However, I'm a realist. There is almost
zero chance of IBM doing that unless they come out with their own Linux
distribution, and tout complete voice integration as the big selling point,
or, the dollar value of developer good will is high enough to justify
whatever future lost revenue would be. (I'd bet that it certainly would be-
having a 'truly free' voice software solution would be rather impressive.
The fact that viavoice isn't considered a drowning/dying product (I.e.
Netscape) or (in the case of Apple) one that was previously free - would be
all the more impressive.
So, given the above, I would say that changing to the IPL might well give vv
a strong pull for more developers, certainly enough to justify the change.
Of course, as suggested above, an even stronger case can be made for the
GPL.
Tom M.
TomM@pentstar.com
"
If you would care to contribute to the conversation, you can join by sending email to
join-viavoice@laser.sparklist.com
Thanks,
LetterRip
Tom M.
Re:drawbacks to VUI (Score:3)
"Lockscreen" as you walk away from your cube
"Mute" to silence your music when a colleague stops by to talk
"Raise" to bring a window to the front without moving your hands from the keyboard
"Print" when you're to lazy to type CTRL-P
All of these are low-mental-energy ways of doing things you can already do with a normal GUI. Just like the mouse simplified some aspects of the pure-CLI interface (think copy-and-paste), even sparse voice input can improve the current state of GUIs.
My experience with voice systems is pure hobby and very rudimentary, but I think I've read that simple keyword-driven voice systems are MUCH simpler free-dictation systems needed for, say, word processing via spoken word, so the examples above should be feasible now.
Re:How practical is use of this technology? (Score:4)
Just as GUIs weren't practical in 1980. Or pick an earlier year if you would dispute that. The point is that this idea is more than current technology can handle.
GUIs allow users to do more with less knowledge and less work if properly designed. For instance, it is easier to drag select several folders then drop them into the trash, than it is to explicitly name those directories in a CLI.
But the GUI didn't replace the CLI, it augmented it, and relegated it to a secondary function, or one for power users only. The Next Big Thing, will do the same.
I am one click away from reading new mail after it comes in, and I don't think it would be a great improvement to have to say outloud, "Read new mail." But for less experienced users, being able to say, "New message to Bob Jones, copy marketing team, blind copy Jon Bones. Dear Bob, I love you like the brother...." That's valuable, and would be quicker than CLI or GUI if it worked.
The challenges are myriad. How do you insure privacy? How do you achieve accuracy? (Though accuracy never stopped the CLI or GUI).
Re:The possibilities.. (Score:5)
No worries; your computer will dutifully add to the command line:
bash$ Our imps pace the chef cap a dull ours pace lashing turn.
which may give the grammar checker fits but which won't erase your hard drive.
Re:How practical is use of this technology? (Score:5)
There are three problems with voice apps right now.
First is the lack of off-the-shelf recognition. Dragon gets better than 90%, IBM ViaVoice MIGHT get about 60%, others score well below that. For someone with no hands and a non-technical nurse for day-to-day assistance, Dragon ends up being the choice for now. Mind you, an ideal system should be able to be installed with one or two clicks, and then be on Voice Recognition through the rest of the process, or it won't work for most of the physically impaired. As things stand, Dragon is all he can consider using, being that the other packages he has demo'd have all required AT LEAST 45 min of voice recognition training to be done at a given time prior to getting functionality. Given that the amount of time that most quads get with someone who knows a delete key from a return key is limited, most of these apps are pretty useless. Dragon is the only one that will let you do this at your leisure.
Second is impact on resources. Most disabled people dont have them. My friend's box is built out of donated parts. The software, Dragon, costs more than $400 and was donated as well. Now, Dragon gets that 90% and stability from running on at least 256M of RAM, on a 500 Mhz processor. Did I mention that these closed source software houses completely revamp their software every so often, requiring you to buy a completely new version just about whenever you upgrade your hardware? Additionally, my friend is one of the very lucky few to know anyone in the computer biz. There are three of us that spare time for him whenever we can, but most people are stuck buying their time. Think of what this means when it comes to upgrading every so often. Remember, you can't even hit a return key, much less open up your box. For that matter, neither can your nurse, really.
Third is actual usability. Most of these voice systems are designed for and by sighted people who can use their hands. 'Nuff said.
Ideally, it would take the efforts of several physically impaired people working with some coders to come up with a working Voice Recognition package that was open-sourced and designed with the impaired user in mind. It is nice that some of the framework apps useful for that type of project are now open-sourced.
Telephony. (Score:3)
Think e-commerce.
It's far easier for a consumer to pick up a phone and talk to a computer to place their order for X widgets than it is for them to log on to the Internet, type in a URL, etc. *Far* easier.
This will be the 'tractor app' for voice recognition, and in many cases it already is... Called AT&T customer support lately? Probably half of that call was handled by a computer listening to what you were saying...
Other posters are correct in saying that it may not seem appropriate right now, just like the WIMP interface didn't seem appropriate in the early 80's, but there *will* be uses for it.
I've already built a Telephony-based interface to my Linux web server. From anywhere in the country, I can call it up, get an uptime reading, ask for a running total of web orders, restart the web process, even shut the machine down, all over the telephone.
Telephones are an ideal interface to a computing system. Okay, so you're not gonna want to play Quake with it (though I'm sure some fool hacker will add it, heh heh), or play with the Gimp over the phone (hey, whatever turns you on), but there are plenty of interfaces that could be replaced with the telephone and be a *hell* of a lot easier for people to use - web forms, for example, could really easily be replaced with a voice recognition software-running dialup #...