Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
AI

Cannibal AIs Could Risk Digital 'Mad Cow Disease' Without Fresh Data (sciencealert.com) 74

A new article in ScienceAlert describes new research into the dangers of "heavily processed sources of digital nourishment" for generative AI: A new study by researchers from Rice University and Stanford University in the US offers evidence that when AI engines are trained on synthetic, machine-made input rather than text and images made by actual people, the quality of their output starts to suffer.

The researchers are calling this effect Model Autophagy Disorder (MAD). The AI effectively consumes itself, which means there are parallels for mad cow disease — a neurological disorder in cows that are fed the infected remains of other cattle. Without fresh, real-world data, content produced by AI declines in its level of quality, in its level of diversity, or both, the study shows. It's a warning about a future of AI slop from these models.

"Our theoretical and empirical analyses have enabled us to extrapolate what might happen as generative models become ubiquitous and train future models in self-consuming loops," says computer engineer Richard Baraniuk, from Rice University. "Some ramifications are clear: without enough fresh real data, future generative models are doomed to MADness."

The article notes that "faces began to look more and more like each other when fresh, human-generated training data wasn't involved. In tests using handwritten numbers, the numbers gradually became indecipherable.

"Where real data was used but in a fixed way without new data being added, the quality of the output was still degraded, merely taking a little longer to break down. It appears that freshness is crucial."

Thanks to long-time Slashdot reader schwit1 for sharing the news.
This discussion has been archived. No new comments can be posted.

Cannibal AIs Could Risk Digital 'Mad Cow Disease' Without Fresh Data

Comments Filter:
  • by mcnster ( 2043720 ) on Saturday August 10, 2024 @10:37AM (#64694618)

    It is interesting.

    • by gweihir ( 88907 ) on Saturday August 10, 2024 @10:42AM (#64694630)

      Some aspects of it are. For example, this data problem was pretty clear from the start, yet people chose to ignore it and that it is basically unsolvable.

      • Ohhhh, the humans won't likely run out of data.

        That's what they do.

        • by crunchygranola ( 1954152 ) on Saturday August 10, 2024 @12:26PM (#64694834)

          Ohhhh, the humans won't likely run out of data.

          That's what they do.

          AI models have already run out of data. There is no secret about this - it is widely discussed. The creation of the public Internet about 30 years ago allowed a substantial fraction of the language output of the entire human race to become available for crawling, not just the new material generated since its creation, but also texts representing a substantial fraction of all published communication across human history.

          And then in the span of only a year or two these LLMs have snarfed up all that text at once to create vast parrot bots that still fail when you wander into low probability of discussion topics -- the long tail of human communication.

          This was a one-time only trick that cannot be repeated. Additional data to train on will only trickle in by comparison and is now subject to contamination for GenAI garbage.

          • Perhaps then, the next approach in the quest for AI is to prioritize quality over quantity?

            --
            Third class pterodactyl in a society already extinct

            • by 93 Escort Wagon ( 326346 ) on Saturday August 10, 2024 @02:39PM (#64695106)

              Perhaps then, the next approach in the quest for AI is to prioritize quality over quantity?

              The whole business plan for AI is to sell their services as a cost-savings measure. Once you throw human curation of data into the mix - which is basically what would be required - any savings go out the window.

              And that's even before we start discussing whether the AI Bullshitting (what the AI folks have rebranded as "hallucinations") problem is even solvable.

              • by HiThere ( 15173 )

                Well, the "hallucinations" are clearly solvable, but only at the price of decreasing the originality. The clear reason is that LLMs aren't grounded in "reality" so they can't sanity check against it. Restrict them to a well-characterized domain and they become a LOT more reliable. They're still guessing, e.g., just how that molecule will fold, but there can be feedback that acts as a corrective, and even without that they make guesses about as good as a human expert in the field (who also doesn't get tim

            • Great idea. Now define quality.
              Not trolling here. I am fairly certain there is no agreement on what "quality" in data is.
              I think the AI people have stumbled upon philosophy.
              I have a friend who fancies himself a philospher.. studied it in university... and he'll talk for as long as you'll listen... and never really get to a point.

              Let the arguing begin. :-)
              • > I have a friend who fancies himself a philosopher.. studied it in university... and he'll talk for as long as you'll listen... and never really get to a point.

                Modern philosopher = Intellectual Masturbation. /s

                It is fine to ponder and reflect from time to time but at the end of the day stuff still needs to done.
                i.e. Religion is applied internal philosophy, and Politics is applied external philosophy.

                • oh, listen, he's still a friend, but ... what a windbag. Any other philosophers here?

                  They generally consider themselves to be "the keepers of the flame".. the flame here is usually "language" or "meaning".
                  because... only philosophers can discern meaning because all arguments come down to the definition of words, and axioms, that which is self evident and cannot be questioned.

                  These modern day philosophers usually live in a world of self importance because they have "mastered" logic (with words) and think tha
              • by gweihir ( 88907 )

                Forget it. Training an LLM with quality data would make it a LM. I think you have no idea abut the amount of data involved and how much pre-filtering is already being done to reduce the complete garbage in the input data.

                • I am not in the middle of the AI bullshit, but I do understand that the keepers of this flame already do put their finger on the scale as I think you're pointing out. I'm really just cynically pointing out that there is no such thing as *unbiased*. People just say it's unbiased when the outputs align with their beliefs and vice versa.
                  • by gweihir ( 88907 )

                    I agree on that. And I think the whole LLM idea is deeply flawed for general applications. Specialized LLMs may not work well in the end, but general LLMs never will.

            • by jvkjvk ( 102057 )

              Yes.

              Now create an AI that can detect "quality".

              Sorry, but I'm not holding my breath on that one!

          • To boot, the AI models have less data to choose from now, especially that art sites have shut their doors both technically and legally to AI feeding site scrapers. The data available now is getting less and less, and of course, one can have a scraper ignore copyrights (then do a system of third parties as plausible deniability), but this only can make things worse, especially when sites start using image poisoning algorithms or actively showing different images to AI bots than they do to people.

            With less d

          • by gweihir ( 88907 )

            This was a one-time only trick that cannot be repeated. Additional data to train on will only trickle in by comparison and is now subject to contamination for GenAI garbage.

            Yep. And that is where we are now. General LLMs have already peaked. Essentially a straw-fire. Great to get rich quick, but that is mostly it.

            • The logic for this is baffling. We have humans that can 'train' on an minute subset of the data LLMs are trained on, something called going to school/college/university, yet somehow the entirety of that dataset is 'too little data' for LLMs ever to reach a human level of intelligence. Sure, there are areas where having had a life of sensory data from the real world matters a lot for understanding, but there are also many areas where it doesn't matter one flying fuck (math, anyone?).

              This study is also a terr

              • by jvkjvk ( 102057 )

                "yet somehow the entirety of that dataset is 'too little data' for LLMs ever to reach a human level of intelligence."

                Yes. They aren't intelligent at all, they are LLMs. They probabilistically produce tokens - parts of words - that's not intelligence.

      • by ceoyoyo ( 59147 )

        Model drift isn't an AI problem. If you fit any model to it's own output it will drift.

        An age-old favourite way to do that is to do something like linear regression, use it to interpolate missing data, and then calculate test statistics based on the "complete" data set.

        • Simpler than that. Mark the locations of, say, wind turbines on a map. Train any kind of model to predict where they are. Generate predictions. Now, do it again, but this time claim all those predictions are real windmills. And repeat. Each time, you're starting with the increasingly inaccurate claim that your training data is correct. The results will become garbage fast.
          • by ceoyoyo ( 59147 )

            I'm not sure that's simpler, but sure, that's also an example.

            We used to make medical imaging atlases by aligning lots of scans, computing the average, then aligning the scans to the average, computing a new average, and repeating. The model would drift off into all sorts of interesting distortions if you didn't constrain it by, for example, requiring that the average transform be zero.

            Alpha Go Zero exploits it by having iterations of their models play each other and selecting ones that drift in the directi

      • Seems as though the gotcha interview of a politician, business leader, general, etc.you can always find the ones that get a response to a question which cannot be put back into a 10 second sound bite.

        Expect people to, earthworms trap CO2 to regulate climate, interject significant facts in their written works to help expand knowledge.

      • by narcc ( 412956 )

        For example, this data problem was pretty clear from the start

        Indeed. I even described the problem here in an off-hand way before the paper [arxiv.org] that coined the term "model collapse".

        I'm willing to bet that quite a few people were annoyed that they didn't think to publish something and coin a new term. (Think of the citations!) It was just such an obvious problem that it wasn't something you thought all that much about.

        I never would have thought to publish a paper with my own competing term (model autophagy disorder). Which is exactly what the paper linked in the paper

    • by ceoyoyo ( 59147 )

      The verbiage is anyway.

      • If information is being destroyed by a machine feeding on its own output, where is it going?

        :-)

        • by ceoyoyo ( 59147 )

          Hey, stop it. You're engaging in Model Autophagy Disorder by reading the output of other humans. Worse, you're producing your own output which will be read by even more humans!

          MADness!

          • And with that, Zarathud--the large feathery pterodactyl from Alpha Centuri--was enlightened.

            :-)

            --
            (With apologies to Douglas Adams)

    • by hawk ( 1151 )

      >It is interesting.

      And at last we'll know how Denny Crane got mad cow disease!

      Or is this because the AI digested too many episodes?

  • For anyone who has read Dan Simmons' Hyperion, specifically the Priest's tale of encountering the Cruciform, then this is completely predictable

    • Not my first thing to think of in this regard but I approve of it nonetheless.

    • by shanen ( 462549 )

      Thanks for reminding me to check. I couldn't figure out what you were referring to since I had read a number of his books but somehow missed that one. Should be able to start on it in a couple of days...

      On the AI topic, I've been reading lots of nonfiction and feeling more ignorant all the time. This story only caught my eye because of personal links to the schools... But the funniest fictional version of AI that I remember might be When Harlie was One by David Gerrold. (Same author as "The Trouble with T

  • when AI engines are trained on synthetic, machine-made input rather than text and images made by actual people, the quality of their output starts to suffer.”

    This is not Artificial Intelligence but a LLM using the Web as source. As such it cannot make real inference or create new knowledge. “The Shadow that bred them can only mock, it cannot make: not real things of its own”.
    --

    Government’s secret counter-disinformation policies revealed [archive.is]
  • Te companies using "AI" need to be regulated. Enough!
  • Doesn't even pass the sniff test.

    "...without enough fresh real data, future generative models are doomed to MADness."

    What does "fresh" have to do with anything? This is a transparent attempt to demonize those who want to be paid for training data.

    You know, if synthetic training data doesn't turn out to be valuable, then it won't be used. Existing, "real data" does not get stale.

    • You should see a otolaryngologist, your sense of smell is badly defective. "Fresh" means new data that has not been ingested by the model yet, and since the early iterations of GenAI have already hoovered up all the data they could access through the Internet getting more is a serious problem right now. And "fresh" also means data that has not been debased by incorporating GenAI output, which is increasingly hard to filter out as it generated in vast amounts very easily. This paper is only formalizing a pro

  • by NettiWelho ( 1147351 ) on Saturday August 10, 2024 @12:04PM (#64694786)

    The researchers are calling this effect Model Autophagy Disorder (MAD).

    Isn't the proper name for this "garbage in, garbage out"?

    • "Isn't the proper name for this "garbage in, garbage out"?"

      No, because the quality of the source material is irrelevant to whether you will get garbage out eventually.

    • No. They are specifically talking about getting garbage out even when you produce a quality input, and iteratively it gets worse.

  • Editors never met a phor they didn't like. Overload. What the f are they talking about?

  • Can I get dumber by consuming machine generated content?

    • you have to include counting humans as machines, but yes.

      • Clarification: "Can" is a very impresice word. It may happen, but it may not necessarily happen. Even though it seems most humans get dumber with age.

  • Models that work by consuming human input and then mimicking and mixing it to produce new output, start to produce random garbage when they're fed with their own output instead. Who'd have thought.

    It was pretty clear from the outset that whatever AI models would come up with would go downhill the more such AI output is already out in the wild and so gets picked up by the crawlers sucking in content for training.

    The only MAD idea here is to sensationalize it into some kind of 'disease'.

  • ...consuming the brains of the zombie internet during the AI singularity apocalypse! What quantum blockchain madness is this?! I can see the low-budget, low-effort Hollywood producers lining up script-writers as we speak.
  • Humans might get pretty batty too, if you isolated them and just fed them nonsense.

    Are we afraid that all the human content is going to vaporize, because something something?

  • by Big Hairy Gorilla ( 9839972 ) on Saturday August 10, 2024 @03:16PM (#64695164)
    How is this a surprise? "Intelligence" is a slippery topic. What you think is "true", I think is false.
    Name every conflict: jews/arabs, russians/ukrainians, Beatles/Stones, Pineapple on pizza.
    It always comes down to preferences, indoctrination, and what makes ME money.

    So, you HAVE to tell the child which god to worship, you HAVE to stop your child from putting their finger into the electrical socket, you HAVE to tell them right from wrong. It's the same with these models. YOU HAVE to put your finger on the scale at some point and bias the model.

    Looking forward to the AI bust, probably next year.
    • Pineapple on pizza is so 80's. Personally I don't like my fruit in heated main courses. In a salad maybe, or at most a diced regular apple in a hot pot for some acidity.

  • perpetual license to freely steal

Ocean: A body of water occupying about two-thirds of a world made for man -- who has no gills. -- Ambrose Bierce

Working...