Slashdot Log In
Text to Speech Software Copies Any Human Voice
Posted by
CmdrTaco
on Tue Jul 31, 2001 11:33 AM
from the now-thats-something-clever dept.
from the now-thats-something-clever dept.
mindpixel writes " A New York Times Report (registration required) states that AT&T Labs will start selling speech software that it says is so good at reproducing the sounds, inflections and intonations of a human voice that it can recreate voices and even bring the voices of long-dead celebrities back to life. The software, which turns printed text into synthesized speech, makes it possible for a company to use recordings of a person's voice to utter things that the person never actually said."
This discussion has been archived.
No new comments can be posted.
Text to Speech Software Copies Any Human Voice
|
Log In/Create an Account
| Top
| 299 comments
(Spill at 50!) | Index Only
| Search Discussion
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
another thing (Score:3)
--
Geoff Harrison (http://mandrake.net)
open source speech synthesis (Score:4)
Actually there is even and example of Hemos himself, doing a talking clock on http://www.festvox.org/ldom/ldom_time.html
--
Geoff Harrison (http://mandrake.net)
Re:Try it out! - It's not that great (Score:3)
Voice over IP compression; useful for the deaf (Score:3)
This could also be very useful for deaf telephone users. Currently, a deaf person relies on a human relay to talk to a non-TDD equipped person. With good speech-to-text and text-to-speech technology the human middle-man could be removed, saving a ton of money.
Human rights? (Score:3)
Let's say someone wanted to make me say something in direct contradiction to my normal views, then publish that. Now, I don't consider myself famous enough for this to be a problem
The flipside for law enforcement is perhaps even more scary. What if I published a recording, generated in this way, of (for example) Gary Condit (sp?) confessing to having killed Chandra Levy (again, sp?)? For a parallel (and I never thought I'd cite Lois & Clarke... Promise I'm not a fan, my sister used to watch it over meals so we all had to, I have a weird memory, honest really...) the episode where a photographer produces a pre-wedding image of them in bed which could have been taken properly but was actually faked due to a lost film.
This has been coming for years, I know, but it's still a nasty big can of worms.
On Yahho w/o registration here (Score:3)
Read it on Yahoo without registration here [yahoo.com].
On the other hand... (Score:4)
------------------------
Re:One more step... (Score:3)
GPL'd Klatt Synth Source [bham.ac.uk]
RSynth Speech Synthesizer - Klatt based synth - go to
KPE80 - A Klatt Synthesiser and Parameter Editor [ucl.ac.uk]
Worldcom [worldcom.com] - Generation Duh!
One more step... (Score:4)
Another good speech synthesizer, no doubt an early version of the AT&T one (possibly?), is by Lucent [bell-labs.com].
Still, I am amazed at the quality of the AT&T system - it sounds almost perfectly natural. To the naysayers that say "No, it isn't natural" - what all of you have to realize is that this simply demo doesn't allow you to tweak all the variables that would really allow the inflections or type of voice (like whispering, etc) to really come through - it is too bad they don't give an advanced interface with a FAQ or some other form of documentation to allow this, but I imagine that if they did, it would probably take quite a while to compose even a simple sentence (I remember the hell you had to go through with an old Radio Shack speech synth for the Color Computer, specifying individual phoenomes (sp?) just to get proper speech to come out - it could pronounce many words, but others it just fell flat on its face).
Finally - something I want everyone to ponder. Take a look at this old article [slashdot.org] (it was about Square redubbing FFTM) - once it loads, search for "cr0sh" and "I dare say" - you will come across a series of comments about what I think may happen in the future - what is funny is that the comments in reply to my take on things sound like your typical naysayers. How many computers were we supposed to only need back in the 60's? How much memory would people "only" need again Mr. Gates?
What I predict will come about - probably sooner than we can all imagine. It may not be cheap enough to do it now, at a quality that people would watch, fast enough to be done quicker than what can be done with live actors - but it is all software and hardware - this stuff will get faster and cheaper. Anybody who has been in this business long enough knows that it will happen. There might still be a need for actors, and voice artists, and such - but they probably won't have the "god" status society seems to confer on them now (with the exception, perhaps, of stage acting - which will probably enjoy a huge comeback).
Worldcom [worldcom.com] - Generation Duh!
Code words and access lists (Score:3)
A general can't just call up the guard post and order the person on duty to let unknown people in. I once was on duty in a radio room and we had a Very Important Senior Officer come by to see what we were doing. He wasn't on the access list, so we wouldn't let him in, even though we recognized him. He had to go get the Colonel, who was on the list, to get in. We got attaboys from him, the Colonel, and our NCOs for that. If we'd let him in, we'd have been in deep doo doo.
Re:Cool... and disturbing. (Score:3)
On the radio this morning, CBS ran a short blurb about this system, including hypothetical news and sports reports. It sounded pretty good, too...if you've done anything with TTS before, the speech quality of this system was considerably ahead of what's been done before. (Light years ahead of Speak & Spell, but that's almost a given at this point. Compared to more modern systems such as Festival, it still comes out ahead quite a bit.)
The announcer posited that, one day, his job could be in danger from this kind of technology. With some broadcasters' penchants for cutting costs any way possible (somebody either here or on K5 posted a link about Clear Channel and its shenanigans a while back, but I can't find it), DJs could end up going the way of the dodo as well.
Re:Doubtful. (Score:3)
Of course, it would need a corpus of recorded and (possibly automatically) tagged speech from the person they wish to imitate, but that's not that impossible. Every notice how the generated speech on some speech recognizing phone system (such as American Airlines) is getting better and better, with more and more human-like pronunciation and intonation? And these are the production systems -- not the research systems. I'm not saying they're perfect (and, of course, they're dealing with multiple intonations of fully recorded words, not subwords), but the problem is a far cry from "true AI", and the work on it is getting better all the time.
Check out http://www.sls.lcs.mit.edu/sls/publications/1998/
-Puk
p.s. If this gets modded up, I could cap my karma on this.
So much for voice print security systems. (Score:3)
TomatoMan
Re:Movie dubbing today... (Score:3)
One solution would be to get demo reels of the actors saying various sounds in the target language. The downside is that they will come across speaking the foreign language with a terrible accent...a Japanese actor might be fairly unintelligable speaking English since they are missing so many sounds (la=da=ra, no th-, etc)
It's definitely a neat idea though.
The AT&T "Rich" Voice (Score:3)
If you haven't already, listen to the AT&T Customized Voice Product Demo (U.S. English, Male: "Rich") [att.com], truly amazing.
With online news feeds coming in to the local radio station and the quality of the "Rich" custom voice, I have a feeeling a lot of announcers may be going bye bye. In these samples he's way better than our local guy. Plus, since Shoutcast and such already have all the song info, think of the cool DJ announcing you could have.
My roommate and I used the older online AT&T TTS to do our answering machine message for the dorm... It's did pretty will with "This is mack daddy JD and phat daddy John's room" that's the only message we've ever had that people would call back just to hear. With the old AT&T system you could adjust the pitch and various other settings to get it to sound good, I can't imagine what their new system will do!
If you don't think too good, don't think too much.
KingoftheBongo.com [kingofthebongo.com]So? (Score:3)
Try it out! (Score:5)
They also have recorded demos you can listen to, but I thought the interactive demo was pretty nifty.
--BEGIN SIG BLOCK--
I'd rather be trolling for goatse.cx [slashdot.org].
Re:Grrrreat (Score:3)
His-story.. I hate that term. Who are you? Michael Jackson?
Phone Sex With Anyone!! Call Now 1-800-ANJOLIE (Score:4)
Re:Entropy-licious (Score:4)
No, wait. We already have laws that cover this. I think they're called perjury...
Fakes (Score:4)
Absolutely not. And for the same reason that second-printings, plastic surgery, and fake breasts all suck - they're not the real deal.
And as a die-hard Cubs fan since the age of 4, might I also add that the World Series drought for the last half century has taken on a sort of religious significance, not unlike the 40 years the Hebrews spent wandering in the desert. And Harry Caray was our Moses - resurrecting his voice without the man behind it is tantamount to sacrilege (not to mention unbelievably morbid!).
-------
There's an evil use for this too: (Score:5)
You hear that? There is to be no telemarketing use of this technology!
This could be useful in games. (Score:5)
Try it out! (Score:3)
I'm sure it's not the same thing as the one mentioned in the article, but I'm pretty sure the one in the article is at least based on this one.
Try it out!
Other Online Demos (Score:5)
http://www.elantts.com/indemo.htm [elantts.com]
http://www.cstr.ed.ac.uk/projects/festival/userin
http://www.flexvoice.com/demo.html [flexvoice.com]
http://www.acuvoice.com/downloads/ttsdemo.html [acuvoice.com]
I searched for good TTS software to give voice to some of the 3d animations I did in max
Re:Job cuts in Hollywood... (Score:4)
And who says Tom Hanks ever has to fade away? It could be a brave new world where your future kids and mine grow up watching the same stars we have today and some from yesterday. I can imagine my grandchildren raving about that new Humphrey Bogart action film. Not so far fetched really.
And for those that wonder about the legal aspects
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~ the real world is much simpler ~~
Movie dubbing today... (Score:5)
They could start by fixing all those old Chinese and Japanese action/monster flicks dubbed by the same guy talking in false baritone and falsetto.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~ the real world is much simpler ~~
I can see it now (Score:3)
Saddam Hussein: Somebody set us up the bomb!
God help us all!
I don't know about it.... (Score:3)
But this isn't what they're saying:
"The software [...] turns printed text into synthesized speech"
Which prays the question "How does the software know what inflection to associate with the printed text?"
I know that the same words can sound radically different. Take the phrase "one, two, or three" in each of the following contexts (not that none begins or ends a sentence):
-
"I can't imagine why ANYONE would want four subnets in their own house. I mean one, two, or three I can imagine, but four??"
- Please press one, two, or three at the tone."
- Okay, so it was in the early morning before 4. But can you be at all more specific? Do you have any idea whether it was at around one, two, or three AM?
- Settings of four or five are considered dangerous, while settings of one, two, or three are considered to be within acceptible parameters.
I think that if you record yourself saying the above phrases, then crop out just the highlighted phrase, you'll find a different inflection in each one. Without understanding what a sentence says, or, more precisely, what the person means who is saying the sentence, the fact that you can produce any inflection won't help you determine which one is right.I found Liz and Ike playing scrabble while very drunk, and putting on all sorts of none-sensical words. I even saw "Zisis's", using a piece of rice for an apostrophe! (Zisis is a greek convenience store near us).
I told Liz and Ike that I thought they were crazy. "Heheh, yeah we're crazy", Ike says, "but each of us only put one word down that broke the rules in a major way."
"Which words were those?"
" 'Zisis's' and 'Windology' "
Since Liz was the crazier of the two, I ventured a guess, "Liz's is Zisis's, isn't it?"
"Nope. Liz's is 'windology'. 'Zisis's' is mine." Ike replied proudly.
Anyway, the point of this exercise is to show that a human reader reading this can make the phrase "Liz's is Zisis's, isn't it" sound natural, but I bet any speech-synthesizing software that just follows rules will make it sound incomprehensible. That's because speech is more than reading things by set rules -- it is reading things to reflect your internal parsing of the sentence.
Not to mention the fact that actors can read the same line in a thousand different ways to show a thousand different "interpretations" (states of the character who speaks it, or parsings of the sentence). How will this software produce them, if it only has the same text to parse?
Either someone manually will give it an inflection, or it needs (or would need before truly being able to make good its claim) a human oral reading to "mimic", where it can use the synthesized voice to sound the same inflection in a different voice. Now that would, as the old mis-translated Coke slogan goes, "bring your dead relatives back alive."
Mere dancing with power brooms? Ha, now celebrities will be telling you about how easy to use AOL is. So easy to use, no wonder it's number 1 -- even among the dead!
Gee, I can hardly wait.
(It was intended to sound like "coca cola" when its Chinese characters pronounced).
--
John F. Headroom (Score:3)
what
what
what
what
what
you can do for for your country.
Re:Entropy-licious (Score:4)
Good for them... Better for us! Who wants dumpy Sandra Bullock, bug-eyed Steve Buscemi, or smarmy Ben Affleck when we can have perfect, artist produced, fan-boy (and fan-girl) material like Aki from FF?
Cool... and disturbing. (Score:3)
What happens when you get a sample of some General's voice and then use a synthesiser to call up the poor kid on guard duty and get him to let a bunch of terrorists enter the base?
Most excellent (Score:3)
Re:Cool... and disturbing. (Score:5)
Its main use is for telephony (surprise!) but it I suppose it'll be turning up in new and exciting places.
Re:Cool... and disturbing. (Score:5)
Obviously if this does happen, then all their bases...aww, forget it.
--
Doubtful. (Score:5)
"Yeah, right!"
"Officer, it is clear to me that you are in fact the one who is inebriated."
"I found it that way. Honest."
"Now, nothing has really changed since the last contract, we just cleaned up a few details; Please sign and return ASAP."
"But Billy got one...why can't I? Please?"
"Would you like to move to the sofa?"
I don't buy it for a minute. To do what they claim would require real AI(tm).
-- MarkusQ
Entropy-licious (Score:5)
Of course, i don't think this kind of techonology should be "outlawed" or "restricted", that will only make it easier to be used maliciously, as with any technological advancement.
Another interesting point of interest is with the new Final Fantasy: spririts within movie, actors are beginning to consider copyrighting their likenesses, since they can be reproduced on a computer with frightening quality and clarity. Perhaps this applies to voice reproduction as well.
This sounds like a very beneficial technology, especially for games, where a high-quality voice synth could replace volumes of digitally recorded and compressed audio files..but it opens the door for some really frightening possabilities of fraud, social engineering, and copywrite side-stepping.