Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
AI

Researchers Warn of 'Model Collapse' As AI Trains On AI-Generated Content (venturebeat.com) 159

schwit1 shares a report from VentureBeat: [A]s those following the burgeoning industry and its underlying research know, the data used to train the large language models (LLMs) and other transformer models underpinning products such as ChatGPT, Stable Diffusion and Midjourney comes initially from human sources -- books, articles, photographs and so on -- that were created without the help of artificial intelligence. Now, as more people use AI to produce and publish content, an obvious question arises: What happens as AI-generated content proliferates around the internet, and AI models begin to train on it, instead of on primarily human-generated content?

A group of researchers from the UK and Canada have looked into this very problem and recently published a paper on their work in the open access journal arXiv. What they found is worrisome for current generative AI technology and its future: "We find that use of model-generated content in training causes irreversible defects in the resulting models." Specifically looking at probability distributions for text-to-text and image-to-image AI generative models, the researchers concluded that "learning from data produced by other models causes model collapse -- a degenerative process whereby, over time, models forget the true underlying data distribution ... this process is inevitable, even for cases with almost ideal conditions for long-term learning."

"Over time, mistakes in generated data compound and ultimately force models that learn from generated data to misperceive reality even further," wrote one of the paper's leading authors, Ilia Shumailov, in an email to VentureBeat. "We were surprised to observe how quickly model collapse happens: Models can rapidly forget most of the original data from which they initially learned." In other words: as an AI training model is exposed to more AI-generated data, it performs worse over time, producing more errors in the responses and content it generates, and producing far less non-erroneous variety in its responses. As another of the paper's authors, Ross Anderson, professor of security engineering at Cambridge University and the University of Edinburgh, wrote in a blog post discussing the paper: "Just as we've strewn the oceans with plastic trash and filled the atmosphere with carbon dioxide, so we're about to fill the Internet with blah. This will make it harder to train newer models by scraping the web, giving an advantage to firms which already did that, or which control access to human interfaces at scale. Indeed, we already see AI startups hammering the Internet Archive for training data."
schwit1 writes: "Garbage in, garbage out -- and if this paper is correct, generative AI is turning into the self-licking ice cream cone of garbage generation."
This discussion has been archived. No new comments can be posted.

Researchers Warn of 'Model Collapse' As AI Trains On AI-Generated Content

Comments Filter:
  • by rgmoore ( 133276 ) <glandauer@charter.net> on Tuesday June 13, 2023 @05:49PM (#63600098) Homepage

    This is an obvious enough problem that people have been predicting it for a long time. It's still critical to have what was just a hunch confirmed by a serious study. The worrisome part is that AI generated stuff is rarely marked as such, so it will be very difficult for corpus curators to filter it out. It will be crucial to maintain an old, uncontaminated corpus.

    • It will be crucial to maintain an old, uncontaminated corpus.

      Perhaps, but it will become dated quickly. In 2030, resetting to 2023 even to remove the discussed artifacts will be of limited utility.

      • by DarkOx ( 621550 )

        but would it? Are we not already seeing this with ordinary human intelligence?

        Sure there are areas in the hard science were progress is being made, however in the space of general knowledge and culture are we actually better off than we were before the commercial internet?

        I think we have a ton of people whose intelligence have been trained on total crap they repeat on reddit, that other people feed on and repeat... In the span of three decades we have arrived at a point where 1/3 of the general public can't

        • by spitzak ( 4019 )

          "1/3 of the general public can't or won't say which sex in their species gives birth"

          Care to back that up with an actual random poll or other research, or is this just an excellent example of somebody repeating crap they saw on the internet.

    • If multiple independently trained LLMs learn from each others' output, wouldn't that initially maybe strengthen the knowledge bases?

      Presumably, each individual LLM instance could be designed to recognize and filter out its OWN output from its training data. But still would imbibe the output of different LLMs.

      Then the premise that the information will get degenerate seems to assume that the percentage of hallucinated misinformation in the AI outputs is higher than the percentage of misinformation in a genera
      • If multiple independently trained LLMs learn from each others' output, wouldn't that initially maybe strengthen the knowledge bases?

        No, for now. Right now there is nothing substantial to tell a LLM what it's saying is wrong. All it does is collect information and spit it out when asked. It doesn't associate ideas and concepts like you or I do.

        As someone further up said, copy a photograph, then copy the copy, and so on. The system will degrade the more it is used. The same here. Since, as mentioned abov

        • Well I would quibble that they don't associate ideas and concepts. By learning the statistical relative strength of association between word pairs, triples etc as used in many humans' discourse, the LLM, I would argue, is in fact learning the equivalent of a semantic network (of old). Things like abstraction (specialization/generalization relations) will come out in the structure of the partially ordered representation of statistical association strengths.

          With statistics from enough examples, the semantic s
      • Imagine they all have the same code, then being fed your output as an input would yield a perfect fit, meaning you wouldn't alter the model a single bit. So I would imagine that as time grows, the model just converges to something. With some noise, to account for new information.

        Now they're not all the same. But they're probably similar enough that it doesn't matter.

        • The same code doesn't matter. The code of transformer neural nets is relatively small and simple.

          It's the input (ie. training) data set (and the order it is encountered, to some extent) which gives each LLM model instance its unique character.

          By different LLMs, I'm not even talking about whether the bog-simple neural net training and traversal algorithms are the same or slightly different. I'm talking about LLMs that have been trained on different (non identical corpuses) input data and/or different sequenc
      • by narcc ( 412956 )

        If multiple independently trained LLMs learn from each others' output, wouldn't that initially maybe strengthen the knowledge bases?

        The content produced is going to reflect the information encoded. That's true. That's the whole point. The problem is that content will also contain error. Remember that what is encoded by any model is just some of the information from the training data. (The model is necessarily imperfect.) In the absolute best case, which is extremely unlikely, a model trained exclusively on the output from another would be an equivalent model, error and all.

        Then the premise that the information will get degenerate seems to assume that the percentage of hallucinated misinformation in the AI outputs is higher than the percentage of misinformation in a general corpus of human discourse. Which I would strongly guess is false.

        You've misunderstood the problem. LLMs do not encode facts

        • So what about the case of multiple models? Absolutely nothing changes. Imagine training three models on different data sets and one model on all three.

          Indeed: multiple models are in fat equivalent to one, larger model.

        • I think when you say that the model encodes only some of the information in the input (e.g music), you may be missing that it generally encodes (more heavily weights) the repeated patterns (i.e. arguably the important semantics) in the input (in the music case, the weighting distribution in the neural net comes to represent the "common musicality" of it, as agreed by many different composers who produced the large music corpus).

          Neural net learning turns large numbers of examples which contain similarities (
          • by narcc ( 412956 )

            you may be missing that it generally encodes (more heavily weights) the repeated patterns (i.e. arguably the important semantics) in the input (in the music case, the weighting distribution in the neural net comes to represent the "common musicality" of it

            What you're calling "common" will drift. When you train a model, you only capture some of the information in the training data. What your model generates will also not be perfectly representative. You will lose information and you will introduce error with each incestuous generation. The model is guaranteed to degrade.

            You don't have to take my word for you. Do the experiment I've suggested and see for yourself. It really won't take long. You could also try reading the paper [arxiv.org]. The results of both, as i

    • ãSconfirmed by a ~~serious~~ noise studyã

      Can I fix that for you?

    • The resultant effect will be much like training humans on facebook.

  • by Baron_Yam ( 643147 ) on Tuesday June 13, 2023 @05:54PM (#63600114)

    Unless the original is perfect, every generation introduces new flaws on top of old ones.

    Nature takes care of this with feedback mechanisms - random variation is curated by natural selection. Novel AI models need a selection mechanism outside of their base training model, to prune the bad results.

    Of course, this is a problem because the whole point is to get the AI to do something so a human doesn't have to... but having a human tuning an AI model as it is formed results in an AI you can copy and use forever, so it's not really that big a barrier.

  • Too funny (Score:4, Funny)

    by NFN_NLN ( 633283 ) on Tuesday June 13, 2023 @05:55PM (#63600116)

    I'm not sure what this is called in computer terms, but in human terms this is what's known as a circle jerk.

  • Linguistic incest (Score:5, Insightful)

    by jenningsthecat ( 1525947 ) on Tuesday June 13, 2023 @05:58PM (#63600124)

    I'm seeing a parallel here between biological inbreeding and its AI equivalent. Lower IQ and mental deficiencies often result from inbreeding, and it seems something analogous may happen with LLM's.

    It would be cool if at some point this "incestuousness" also compromised the hardware that the LLM's are running on - kind of an "art imitates life" phenomenon. It won't happen, but it's fun to speculate.

    • by dfghjk ( 711126 )

      There could be other parallels, like prion disease, but they only sound similar, like your incest analogy. Probably says more about you ;)

      I'd say it's more like a steady diet of Fox News, quite literally. It produces SuperKendall models, forcing out everything that doesn't support the world view.

    • by swillden ( 191260 ) <shawn-ds@willden.org> on Tuesday June 13, 2023 @07:21PM (#63600380) Journal

      I'm seeing a parallel here between biological inbreeding and its AI equivalent.

      I see an even more direct parallel between human dialogue and learning and AI model training. We increasingly see people trapping themselves in media bubbles and social media echo chambers where there is little or no correction from objective reality, and thinking based on distorted input produces even more distorted mental models which generate more distorted output... repeat ad absurdum.

      The proliferation of media options and the ability of individuals to isolate themselves online in a group (albeit a group often containing millions of co-believers) has made this possible in many ways that it wasn't previously -- though obviously it did happen before in walled-off communities with little outside interaction, and has always and probably will always happen in some ways, even in the best of circumstances.

      "Model collapse" seems like a good description of what happens to the brains of, say, cult members. Or Q-anoners, or flat earthers, or 9/11 truthers, or Twitter users (kidding... sort of).

      • Human dialogue is a great example. "I could care less" is a shining example of a stupid statement that should have been corrected, yet through lack of correction became part education itself. Now when you are surrounded by people who say "I could care less" you start to think it's correct and start using it yourself.

      • by DarkOx ( 621550 )

        or Democrat party members

      • by Evtim ( 1022085 )

        "Or Q-anoners, or flat earthers, or 9/11 truthers, or Twitter users (kidding... sort of)."

        Oh, I see...you not at all a part of "a group (albeit a group often containing millions of co-believers)".

        Much better to have one narrative, controlled by the right people (my side) for all. Right, comrade?

        • I'm a part of many groups, but I work hard to remain skeptical of all of them... and especially of any ideas that confirm my pre-existing beliefs or trigger supportive emotional responses in me. Objectivity is always aspirational, never truly achievable, but it is possible to stay close to it with constant effort.
  • who looked at the headline and thought it was to do with AI controlling Model Trains?

    (Like HO scale)

  • Don't these things all have a random number generator in series with their output?

    Anyone old enough to remember the days before room cancellation filters when the microphone got a little too close to the speaker? Same idea.

    • by rossdee ( 243626 )

      "Don't these things all have a random number generator in series with their output?"

      I seem to remember that RAH thought that random numbers were important to AI
      (Mike, Gay Deceiver, Minerva/Athena)

    • by gweihir ( 88907 )

      Not really, but often they amplify input noise. Feedback loops are a bitch.

  • by hdyoung ( 5182939 ) on Tuesday June 13, 2023 @06:12PM (#63600168)
    Extremely sophisticated interpolation engines. 8 million people have written short text about otters, so now we have a computer program that can generate a mostly-passable essay about otters that isn’t exactly the same as any of the others.

    Useful? Yes. But, fundamentally, it’s basically an automated way of treading old ground. For the time being, humans still need to expand the boundaries of knowledge. It’s gonna decimate the jobs that involve writing short amounts of text about already-done-stuff. In other words, a LOT of jobs.
    • by gweihir ( 88907 )

      That is pretty accurate. These things cannot create anything original. Sometimes averages can _look_ original (see stable diffusion), but then most art is already derivative, so it is not that visible what the machine actually does.

      This means that these systems cannot generate new ideas. They can help with data collection in some cases and that is very useful. They will likely be able to do simple, no-insight white-collar jobs in the near future with good accuracy and that is indeed threatening a lot of job

  • An approach to training that provides immunity to such degradation would be both conceivable and incredibly valuable. A nice patent opportunity.

  • No surprise (Score:5, Insightful)

    by gweihir ( 88907 ) on Tuesday June 13, 2023 @06:18PM (#63600204)

    ChatAI already messes it up frequently when trained on real data. Hence training it on data from ChatAI just amplifies the nonsense and reduced the depth of "understanding" it has even further.

    • How accurate is real data? Was misinformation a thing long before large language models?

      • How accurate is real data? Was misinformation a thing long before large language models?

        Hello computer. Please tell me which came first LLM or Fox News.

      • by gweihir ( 88907 )

        This is a about a different thing. Overall, ChatAI is always _less_ accurate than its training data as it combines things without understanding or cross-checks. It also creates additional inaccuracies by combining things that cannot be combined. These effects add inaccuracies. Hence iterating the process corrupts more and more of the training data produced and used in every step.

         

  • by thragnet ( 5502618 ) on Tuesday June 13, 2023 @06:22PM (#63600210)

    blockchain in the air ?

    • by Dwedit ( 232252 )

      If by "Blockchain" you mean the use of signed timestamps on the data to prove its existence before a particular date...

  • Just an idle thought, but since this is largely a statistical model, and the desire is to have 'correctness', could the errors not be used intentionally as an anti-seed to prevent future regressions along a similar path - similar to error correction encoding, of sorts?

    • by StormReaver ( 59959 ) on Tuesday June 13, 2023 @06:31PM (#63600256)

      ...could the errors not be used intentionally as an anti-seed to prevent future regressions along a similar path...

      Potentially, but it would still require human labor to determine what is true and what is not. The LLM's are incapable of making that determination, and always will be.

      LLM's cannot ever exceed their basic programming and purpose, no matter how much OpenAI wishes it. The current LLM craze is snake oil and Blockchain mixed together: total bullshit just waiting to collapse.

      • Potentially, but it would still require human labor to determine what is true and what is not. The LLM's are incapable of making that determination, and always will be.

        "Always will be" is too strong. Well, I suppose maybe it's okay if you restrict your comment to LLMs, but there's no reason to believe that AI will always be less capable than humans at doing research to separate fact from fiction. Even with LLMs, it's hard to be certain just how capable they might become.

        In any case, current LLMs are clearly incapable of doing it.

      • LLM's cannot ever exceed their basic programming and purpose, no matter how much OpenAI wishes it.

        The limiting factor is computation itself. Simple machinery like GAs can accomplish anything with sufficient computational resources. It for example created people from dirt.

        Neural networks can accomplish tasks with far less resources by exploiting learned experience to achieve results with less trials. Phase transitions have already occurred in large models where capabilities far in excess of linear expectation have emerged. Things nobody on earth had any a-priori clue would happen.

        Likewise it has been

        • by vyvepe ( 809573 )

          Likewise it has been demonstrated trained models can exceed the capabilities of the model by applying simple reflective techniques.

          Can you elaborate? May be some references for a layman?

          • Can you elaborate? May be some references for a layman?

            Here are two..

            "For example, Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%"

            https://arxiv.org/pdf/2303.113... [arxiv.org]

            "Experiments on three large language models show that chain-of-thought prompting
            improves performance on a range of arithmetic, commonsense, and symbolic
            reasoning tasks. The empirical gains can be striking. For instance, prompting a
            PaLM 540B with just eight chain-of-thought exemplars achieves state-of-the-art
            a

        • Likewise it has been demonstrated trained models....

          That training is on data from human creativity. It makes perfect sense that neural networks trained on human data will increasingly reach a point where use of that data will become highly efficient. However, the training data represent the limits of what a neural network can do.

          No neural network will ever transcend its training, as the NN is limited by the very nature of computation. You cannot compute inspiration. You can iterate over all possible combinations of a dataset, and neural nets may be able to m

      • The current LLM craze is snake oil and Blockchain mixed together: total bullshit just waiting to collapse.

        Not being capable of becoming a true AI does not make LLM snake oil. Even it its current forms it has very real, practical and useful applications. Unlike blockchain which after nearly a decade is still looking for a problem to solve.

        LLMs are not capable of creativity they only summarises and repackage the information they already have. However that kind of pointless busy work consumes a scary amount of our human brains and is already potentially capable of surpassing many human jobs.

  • ...eat it up, bots!

  • PAY ATTENTION, MIDDLE MANAGEMENT: AI that can read and write can't solve society's literacy problems. At some point, near the top of the knowledge chain, there has to be someone who actually knew what they were doing, or nothing will work and nobody will even be smart enough to notice, let alone fix it. There would be an inevitable downward spiral of deterioration of competence and understanding across the board, no matter how slightly you angle the spiral. Deal with it.

  • by JoshuaZ ( 1134087 ) on Tuesday June 13, 2023 @07:08PM (#63600362) Homepage
    The arXiv is a preprint server, not an open-access journal. That means that there is no review which has occurred.
  • This is the classic signal to noise ratio. When AI starts generating noise and feeding itself back as the signal you end up with more noise.

  • In grade 2 our class were taken outside and lined up. A teacher whispered a phrase into the ear of the person at one end of the line, said to pass the phrase on to the next person in line then remember what one heard, and repeat it later when asked. The phrase was "rubber baby buggy bumpers". It survived recognizably through maybe 6 tellings.

    In grades 3 and 4 arithmetic, and in all later grades, we were told "show your work". A fuzzy correlator cannot do this: Bender's "stochastic parrot". Neither can

  • by istartedi ( 132515 ) on Tuesday June 13, 2023 @08:00PM (#63600504) Journal

    Human: Is it OK to marry my cousin?

    AI: Boy howdy! Go to town.

    Human: Is there anything else you can tell me?

    AI: That depends. Would you like to make moonshine or become the king of Spain?

  • "Just as we've strewn the oceans with plastic trash and filled the atmosphere with carbon dioxide, so we're about to fill the Internet with blah.

    Yeah... "about to"... What, we're going to litter on our landfill?

  • Expectation of cascading generational loss is not reasonable as humans are not likely to accept resulting auto-generated garbage.

    In the real world those using tools like SD to produce imagery are likely to spend time developing prompts, iteration and spot editing so the resulting imagery reflects what they expect.

    Likewise for automated document writing a human is likely to read/review material to make sure it has the required quality and meaning.

    Whatever amount of laziness is induced by better tools enabli

    • We've already seen a real world case where AI delusions were submitted unchecked to a court in a legal filing.

      AI output is not going to be double checked consistently. At best the degeneration will just take a little longer.

      • As Theodore Sturgeon would say, 90% of everything is crap.
        AI generated outputs will likely be no different. From what I have seen, a lot (and I do mean a lot) of it is generic porn. Actual decent quality outputs still do take some level of time that is noticeably higher then the bargain bin trash that fills the subreddits, ergo they will likely never outnumber them in the long-run.
        Boy if we thought the internet was filled with garbage, it is going to get much worse.

      • We've already seen a real world case where AI delusions were submitted unchecked to a court in a legal filing.

        We've seen a real world case of it not being acceptable to humans.

        AI output is not going to be double checked consistently. At best the degeneration will just take a little longer.

        My argument is there is a floor to what people are willing to accept. This applies concurrently to both the automation tools themselves as well as work products.

  • by haruchai ( 17472 ) on Tuesday June 13, 2023 @09:52PM (#63600730)

    Garbage in, Garbage out

    • by twms2h ( 473383 )

      Garbage in, Garbage out

      Actually we already have Data in Garbage out, so it will become Garbage in even more Garbage out.

  • So this is what AI sentience looks ike?

  • These destructive phenomena will be repeated in the abstract by AI models in a highly accelerated way, and to the extent people are dumb enough to rely on such things, the consequences will be more destructive than in the human prototypes.

    How do people not already know that Frank Herbert was right? It's beyond obvious.
  • The unavoidable laws of thermo affects AI also. Increasing entropy will cause AI Altzheimers if the trainers don’t know what they are doing. This effect is well known to everyone using machine learning systems.
  • Of course training AI with AI leads to defects. That is pure inbreed.
    I give you the story of how Belgian drivers became one of the worst in western Europe. Young drivers were allowed to learn driving from their parents. Who learned how to drive from their parents and forgot everything that's useful for safe driving. And those parents also learned from their parents. Who didn't really learn how to drive, they just got in a car and drove (because there was no official license).
    And now you have to do resear
  • My take is that recursion is already polluting publicly available data & so datasets that are not "contaminated" with generative AI output will become harder to find. As a result, the media corporations that currently hold "pure, unadulterated" human generated IP will see the value of their content maintained or even increase rather than the doomsday scenarios initially predicted for them.

    So, what's likely to become valuable now is detecting how "pure", i.e. completely human generated, content is
  • Scientists right now train on faked scientific papers written by other 'scientists' with fake statistics, beautified results and other crap.

  • Joking aside, this isn't a problem. Gains come from better design, not having everything up to date.

    Yes models can access data after 2021 for processing, you just wouldn't use it for training.
  • Jup, soon we'll need to keep the AI entertained to prevent it from going haywire.
  • Sounds to me like the old "garbage in, garbage out" problem. Old is what's new!

  • AI is NOT IMMUNE to Garbage IN = Garbge OUT

This is now. Later is later.

Working...