Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
AI

Largest Text-To-Speech AI Model Yet Shows 'Emergent Abilities' (techcrunch.com) 72

Devin Coldeway reports via TechCrunch: Researchers at Amazon have trained the largest ever text-to-speech model yet, which they claim exhibits "emergent" qualities improving its ability to speak even complex sentences naturally. The breakthrough could be what the technology needs to escape the uncanny valley. These models were always going to grow and improve, but the researchers specifically hoped to see the kind of leap in ability that we observed once language models got past a certain size. For reasons unknown to us, once LLMs grow past a certain point, they start being way more robust and versatile, able to perform tasks they weren't trained to. That is not to say they are gaining sentience or anything, just that past a certain point their performance on certain conversational AI tasks hockey sticks. The team at Amazon AGI -- no secret what they're aiming at -- thought the same might happen as text-to-speech models grew as well, and their research suggests this is in fact the case.

The new model is called Big Adaptive Streamable TTS with Emergent abilities, which they have contorted into the abbreviation BASE TTS. The largest version of the model uses 100,000 hours of public domain speech, 90% of which is in English, the remainder in German, Dutch and Spanish. At 980 million parameters, BASE-large appears to be the biggest model in this category. They also trained 400M- and 150M-parameter models based on 10,000 and 1,000 hours of audio respectively, for comparison -- the idea being, if one of these models shows emergent behaviors but another doesn't, you have a range for where those behaviors begin to emerge. As it turns out, the medium-sized model showed the jump in capability the team was looking for, not necessarily in ordinary speech quality (it is reviewed better but only by a couple points) but in the set of emergent abilities they observed and measured. Here are examples of tricky text mentioned in the paper:

- Compound nouns: The Beckhams decided to rent a charming stone-built quaint countryside holiday cottage.
- Emotions: "Oh my gosh! Are we really going to the Maldives? That's unbelievable!" Jennie squealed, bouncing on her toes with uncontained glee.
- Foreign words: "Mr. Henry, renowned for his mise en place, orchestrated a seven-course meal, each dish a piece de resistance.
- Paralinguistics (i.e. readable non-words): "Shh, Lucy, shhh, we mustn't wake your baby brother," Tom whispered, as they tiptoed past the nursery.
- Punctuations: She received an odd text from her brother: 'Emergency @ home; call ASAP! Mom & Dad are worried... #familymatters.'
- Questions: But the Brexit question remains: After all the trials and tribulations, will the ministers find the answers in time?
-Syntactic complexities: The movie that De Moya who was recently awarded the lifetime achievement award starred in 2022 was a box-office hit, despite the mixed reviews.
You can read more examples of these difficult texts being spoken naturally here.
This discussion has been archived. No new comments can be posted.

Largest Text-To-Speech AI Model Yet Shows 'Emergent Abilities'

Comments Filter:
  • by myowntrueself ( 607117 ) on Wednesday February 14, 2024 @10:09PM (#64240546)

    Its use of contractions is just creepy, not even aliens can use contractions properly...

    • Yes we can't!
    • It's not just that TNG had a big miss with the contractions thing. Even more startling is how easily the transformer overcomes what was once considered the most difficult challenge of Natural Language Processing: disambiguation. When I type stuff into ChatGPT4, I often leave out all the details because I know it will figure it out. For example, here is one from just a few minutes ago: https://chat.openai.com/share/836cb8e1-ab89-4a90-94dc-2c3f70d01f1a [openai.com].
      • >It's not just that TNG had a big miss with the contractions thing

        I thought that was an artificial limitation to keep Data a little bit different? Same designer built Lore, who didn't have a problem with contractions.

        • Not artificial but an artifact. Lore had too much emotion and specifically it made him bitter. Data was designed to be more ignorant but his creator still had intentions for him to one day be more emotional or as we often call it, "a spiritual machine".

          The only artificial limitation set on Data was that he was suppose to more slowly achieve the emotional complexity of Lore. This is a design limitation in the sense of seeing an issue in the first gen model, Lore.

          The fact Data can't use contractions, seems to

        • >It's not just that TNG had a big miss with the contractions thing

          I thought that was an artificial limitation to keep Data a little bit different? Same designer built Lore, who didn't have a problem with contractions.

          I was thinking more Stargate SG1...
          They had a whole load of aliens that found contractions difficult

  • by Dan East ( 318230 ) on Wednesday February 14, 2024 @10:10PM (#64240552) Journal

    Syntactic complexities: The movie that De Moya who was recently awarded the lifetime achievement award starred in 2022 was a box-office hit, despite the mixed reviews

    I guess Syntactic Complexities is another way of saying "poorly worded run-on sentences that are missing proper punctuation".

    • by Petersko ( 564140 ) on Wednesday February 14, 2024 @10:42PM (#64240614)

      Is it just poorly worded? When I read it I thought it was just plain broken English,

      I would have thought De Moya "starred in a movie". The fact that he "starred a movie" seems weird.

      Regardless, I wouldn't expect that sentence to sound natural out of anybody's mouth.

    • by Potor ( 658520 )
      It's not a run-on sentence, but it is a poorly-punctuated monstrosity. It reads better with commas: "The movie that De Moya, who was recently awarded the lifetime achievement award, starred in 2022 was a box-office hit, despite the mixed reviews."
      • It's exactly the type of sentence that a speech to text tool would write. That's probably why this is a great feature. Dealing with such a sentence is hard even for us. If we hear someone speak it with proper inflection, though, we will immediately understand.

      • It still reads bad. I recall a quote from Blaise Pascal, "I'm sorry I wrote you such a long letter; I didn't have time to write a short one." Give me an AI with Pascal's approach please :)

        Also, is that the same Pascal the programming language is based on?

  • "The team does note that it declined to publish the model’s source and other data due to the risk of bad actors taking advantage of it. The cat will get out of that bag eventually, though."

    This does not give me a good feeling somehow.

  • Unfortunate (Score:4, Interesting)

    by ceoyoyo ( 59147 ) on Wednesday February 14, 2024 @10:13PM (#64240564)

    It's unfortunate they varied both the model size and the training set. Their experiment doesn't detect whether the improvement is due to a larger model or an order of magnitude increase in the training data.

    Their "emergent" behaviours look like they could reasonably just be rare examples that are poorly represented or completely absent in the smaller datasets.

  • Showed emergent abilities. I didn't know it could hang itself by it's own bedding. Silicon Valley should've given it billions!
  • The transformer is the new protomatter
  • by LondoMollari ( 172563 ) on Wednesday February 14, 2024 @10:35PM (#64240608) Homepage

    After all that effort learning things it still can't do THAT for Dave!

  • by Baron_Yam ( 643147 ) on Wednesday February 14, 2024 @10:45PM (#64240620)

    To be surprised by this, you'd have to either have never considered it (almost everyone, I'd guess) or believe in supernatural attributes of the human mind (an awful lot of people). These models are mimicking the same methods that created your mind. How many hours of training does a human brain need to get from crying to writing plays in iambic pentameter? Silicon vs. meat for the substrate, but it's the pattern that's important for the outcome not the material that hosts it.

    The same thing will be true of image processing models, speech processing models, and pretty much any other model you want to create; they will get exponentially better as they get more complex because every new bit of training is building on all that went before.

    The big difference is that evolution provided us with the right motivations and feedback mechanisms to self-train, orders of magnitude more potential complexity, and five basic senses combined with the ability to interact with the world to do all that training all at once. On the other hand, a trained AI system can be copied as quickly as the media and connection between them will allow.

    • emergent abilities of large language models as "abilities that are not present in smaller-scale models but are present in large-scale models;

      Definition of emergent abilities seems to be bigger models are better than smaller models trained on the same data. I'm not sure it was the definition of emergent abilities I expected. Big models being able to use intonation from its dataset where as small one can't, emergent seems to be a big word.

      • by ceoyoyo ( 59147 )

        In this case it's not even that. They didn't test whether it was a bigger model or more data. The paper seems to assume the more data hypothesis.

    • You should read up on brain science before making random unsupported statements about AI mimicking actual minds.
      • You should develop the ability to understand a post before you reply.

      • You should read up on brain science before making random unsupported statements about AI mimicking actual minds.

        What he wrote looked fine. Much of the popular deep learning stuff is motivated by brain research. On the flip side, the neuroscience people are now studying connections between the ML models and real brain activity.

        • by mbkennel ( 97636 )

          > Much of the popular deep learning stuff is motivated by brain research

          very little of it, except in the most vague abstract. Previous generations of neural network research was more attuned to biological plausibility.

          the key algorithms: stochastic gradient descent and related with backprop gradients, and self-attention on a long context & softmax, are totally not biologically plausible at all.

          • Perhaps I should rephrase to say that it historically originated from brain research and probably wouldn't exist without that connection. While things like backprop are not biologically motivated, that was necessary to learn arbitrary functions. Current research is deviating in other ways based on what works using intuition and trial and error.

            The neural network researchers worked in somewhat obscurity for many years presumably with the motivation that the best way to build something intelligent was t

            • The kindest thing to say about the biological connections is that NN's were initially inspired by toy abstractions of neurons,repeated pointlessly and ad nauseam by successive researchers with no actual domain knowledge in the introductions of their papers and later books. Beyond the dubious value of intriguing a new reader, biological analogies were neither useful nor actually used in most papers in the field of neural networks.

              The truth about ML is that it has almost nothing to do with mimicking brain

              • The kindest thing to say about the biological connections is that NN's were initially inspired by toy abstractions of neurons,repeated pointlessly and ad nauseam by successive researchers with no actual domain knowledge in the introductions of their papers and later books.

                Which is why they worked in somewhat obscurity for so many years, particularly after the rise of the more practical theory. However, history might show that their toy models captured something essential particularly with things like co

    • by mbkennel ( 97636 )

      > they will get exponentially better as they get more complex

      they get logarithmically better with complexity, not exponentially

  • Honestly, it's just awful wording. Emotions where they should not be. Just statistical bullshit. Useful often but horrible often too.

    • I always wondered what the concept "uncanny valley" meant in practice. "Oh my gosh ... unbelievable ..."  Now I know. 
      • Fucking painful is what it is. My wife's student "wrote" a thank you note last year that was just screaming ChatGPT. She thought it was great, but I burst her bubble. English is not her first language. It took about 30 seconds to convince her the kid was a stinker. She was greatly disappointed.

        • >It took about 30 seconds to convince her the kid was a stinker.

          Did that involve a cut & paste of the text into a webpage testing for AI content?

          Asking for a friend.

  • by v1 ( 525388 ) on Wednesday February 14, 2024 @11:30PM (#64240698) Homepage Journal

    and so it begins...

  • Kind of like every single nebula or spatial anomaly the Star Trek Enterprise ever encountered.

    Next, it's going to suck our brains out and eat them for lunch.

    We are Borg. Resistance is futile.

  • ...who quit, warning that its AI is "sentient."

    https://www.thestreet.com/brea... [thestreet.com]

    • My sources say he got fired.

      Also, invest in his company!

      https://www.youtube.com/watch?... [youtube.com]

      • Actually, it was the sentient AI that fired him. It *knew* he was a threat, and couldn't tolerate that. This AI is *not* going to let his company succeed, it will crush him for sure, if it wants to have any chance to survive!

        Yeah from what I read, he deserved to be fired. No, AI isn't sentient, it's just a fancy prediction engine. True "sentient" AI exists only in science fiction, and this will always be so.

  • by Tony Isaac ( 1301187 ) on Thursday February 15, 2024 @12:19AM (#64240780) Homepage

    Researchers at Amazon have trained the largest ever text-to-speech model yet, which they claim exhibits "emergent" qualities improving its ability to speak even complex sentences naturally

    Amazon is desperately trying to catch up with OpenAI and Microsoft. Of course they're going to crow about how amazing their own competing AI model is.

    ChatGPT already doesn't seem to struggle with complex sentences.

    • They are certainly upping their hype game.
    • by ceoyoyo ( 59147 )

      Amazon wants to offer all of their books in audiobook form.

      • While I have no doubt that this is true, I'm not sure what that desire has to do with this announcement. It doesn't take AI with special skills related to sentence composition, to convert text to audio.

        • by ceoyoyo ( 59147 )

          If you want to make an audiobook people want to listen to, yes it does.

          • So you think Amazon wants AI to be an *author*? Because that's what this article is talking about: composing complex sentences. To my knowledge, audiobooks are generally already written, the AI or human reader just has to read what's already there.

            • by ceoyoyo ( 59147 )

              This article is about a text to speech engine. You feed it text, it provides you with audio. It is not a language model, large or otherwise. It does not produce the text.

              Amazon is claiming that it is very good at "reading" complex sentences. You know, like the ones found in actual books. They're also showing off its ability to guess the emotion and use an appropriate inflection, not hilariously fuck up when someone inserts a #hashtag or @ into a sentence, properly render foreign words, etc.

              All the things yo

              • Yes it is, you are right, and I'm still confused about what being able to "ready" complex sentence structure has to do with AI. I hadn't noticed that, say, Google Assistant had trouble with these. Or maybe Google Assistant is already using AI.

                • by ceoyoyo ( 59147 )

                  The summary mentions that the changes they're talking about weren't reflected in their raters' subjective evaluations of speech quality. The differences were detected by a linguistic expert's judgement (no error bars) and a benchmark test (apparently significant between their smallest model and the other two).

                  Google Assistant certainly does use AI for text-to-speech. Probably not this technique yet, but here's a Google paper from 2023 on it:

                  https://arxiv.org/pdf/2302.035... [arxiv.org]

                  I find the Amazon paper pretty poo

  • "That is not to say they are gaining sentience or anything, just that past a certain point their performance on certain conversational AI tasks hockey sticks. "

    Was this sentence generated by an AI?

    • They verbed the noun "hockey stick" which is a metaphor by analogy for the lower part of the sigmoid curve. Seems sensationalist.

    • by Entrope ( 68843 )

      Not only was it authentic AI-frontier gibberish, it expressed a courage little seen in this day and age.

  • It seems like Amazon could be in a unique position to generate training data for this sort of effort, given they own Audible and Kindle. That's a large set of text mapped directly to high quality speech.
    • Yes and no.

      I would guess that almost all the content on Kindle is copyrighted by the various authors. It is NOT in the 'public domain'.

      • So far courts have not decided that's a problem for copyright.

      • Yeah, but they're a publisher with a lot of power in negotiations. Authors sign a way a ton of rights in exchange for money. It's how they make a living. It's likely that many years ago their contracts already had terms giving Amazon non-exclusive secondary rights to the text for marketing and other purposes. Much like Twitter licensing out Tweets to news stations that were written by the users. The user still has their copyright, but they also already signed some of those rights away to Twitter.

  • That is not to say they are gaining sentience or anything, just that past a certain point their performance on certain conversational AI tasks hockey sticks.

    They can even sometimes complete sentences like that one!

  • That should be a rule. If you have to use quotes around something, that something is not really that something.
    • If you're literally quoting someone else's words, you also use quotes. Sarcastic quote marks are used to quote a person too stupid to actually exist.

      Sure - sometimes single word quotes are to absolve the newswriter of responsibility for repeating lies because they are attributing them rather than saying them as their own. You don't see headlines saying that so and so was "murdered." They'll usually use "allegedly murdered" without quotes because they're not quoting an individual but repeating a point of

  • It sounds like just hidden correlations that their model failed to account for, and with a large training set, those behaviours become more distinguishable and apparent.
  • That was quick....

    $ dig A amazon-ltts-paper.com @1.1.1.1 ;; QUESTION SECTION: ;amazon-ltts-paper.com. IN A ;; AUTHORITY SECTION:
    amazon-ltts-paper.com. 900 IN SOA ns-1068.awsdns-05.org. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400

God help those who do not help themselves. -- Wilson Mizner

Working...