
OpenAI Puzzled as New Models Show Rising Hallucination Rates 98
OpenAI's latest reasoning models, o3 and o4-mini, hallucinate more frequently than the company's previous AI systems, according to both internal testing and third-party research. On OpenAI's PersonQA benchmark, o3 hallucinated 33% of the time -- double the rate of older models o1 (16%) and o3-mini (14.8%). The o4-mini performed even worse, hallucinating 48% of the time. Nonprofit AI lab Transluce discovered o3 fabricating processes it claimed to use, including running code on a 2021 MacBook Pro "outside of ChatGPT." Stanford adjunct professor Kian Katanforoosh noted his team found o3 frequently generates broken website links.
OpenAI says in its technical report that "more research is needed" to understand why hallucinations worsen as reasoning models scale up.
OpenAI says in its technical report that "more research is needed" to understand why hallucinations worsen as reasoning models scale up.
Garbage in, garbage out. (Score:5, Funny)
Maybe it's not such a good idea to just automatically hoover up 100% of the exposed internet and all user input and feed it all back through the machine on a feedback loop?
Re:Garbage in, garbage out. (Score:5, Interesting)
I read an article not long ago, right here on Slashdot, in which some group of "industry experts" who were not financially tied to any of the companies selling AI models stated that, based on their analysis, we have already hit peak AI by current methods. They had some data comparing the quality of the prior gen LLMs to the next gen LLMs that were built at much greater expense over a much huger training set, and found the gains to be marginal.
So this news would seem to accord with their prediction. Just turning up the volume on our training is not going to imbue the LLMs with an even better simulation of intelligence, after all.
This isn't to say that AI is now a done deal. It could be that we need to investigate a different method of training it or of using the trained model in order to take the next step. And many companies are certainly trying! But it seems clear that we have hit serious diminishing returns on data set size at this point.
Re:Garbage in, garbage out. (Score:5, Insightful)
The humongous amount of investment into transformers and deep neural nets as well as GPU production has created an ultra specialized infrastructure in both software and hardware. This lets researchers do many things on the margins, as long as they fit into the kind of models that this infrastructure supports.
In this environment, radically new approaches are not going to be tried at anywhere near the rate of conventional approaches aiming for modest epsilon improvements. Furthermore, the investors looking for above average returns will insist on companies exploiting these conventional approaches to the fullest.
Re:Garbage in, garbage out. (Score:4, Interesting)
Winter is not coming, except for the OpenAI and similars companies that are not doing actual AI research. This problem concerns only the idea of scaling LLM until it becomes AGI. Deepmind knew about this problem years before OpenAI was even founded, which is why they abandoned that road and picked another one. OpenAI will continue on that road, because that is the only thing they can do with their skills.
Pretty much all of the AI models that Deepmind makes are something else than LLM, and those models work really well for solving real life problems, like solving biggest mysteries in biology, simulating fusion, discovering better matrix calculation, inventing drugs that cure everything. For some reason the media has created hype only for LLM, ignoring almost completely all the other models that actually work really well.
We don't even need to make new AI discoveries, even with current AI technology, just by polishing it, creating better test material and by implementing new applications, we have plenty of work to do for decades, for example in the medical diagnostics, biology, drug research, education, archeology, urban planning, etc. So like I said, there won't be a winter. This is still just the beginning of the golden era of AI.
Re: (Score:2)
It doesn't (necessarily) mean another AI winter. The current LLMs are adequate to a large number of tasks, and existing models trained on specialized datasets work well, though not perfectly. But it means that LLMs don't, in and of themselves, scale up to AGIs. (Which I had already predicted based on their lack of feedback from the real world.)
So it shouldn't mean an AI winter...more like a slow diffusion of limited versions.
FWIW, I expect unexpected problems, so my prediction of AGI is still around 2035
Re:Garbage in, garbage out. (Score:5, Interesting)
It's not merely diminishing returns, it's shocking regression. Because it's not "artificial intelligence" it's a poorly understood lossy search engine.
Re:Garbage in, garbage out. (Score:5, Interesting)
That's how I use it and I find it works really well as a lossy search engine.
The brain is not one model trying to do everything.
I think maybe the big mistake that's been made is imagining that intelligence is just one model.
You only have to introspect a bit to realize that when you go through your day you're flipping and changing states between what I guess would be different intelligences.
It probably makes no sense to think of the brain as being a single model.
We have lots and lots of models which are all "trained" or good in some way at doing certain things.
We have mathematical intelligence, emotional intelligence, physical intelligence, and so on.
That's probably because the brain has all these different specialist regions, and there's some interesting new work on the left and right hemispheres as a whole representing two entirely different modes of attention -- ways of attending to the world.
(The old left-right brain thing apparently got it wrong but the new research thinks they've got the real answer.)
I think AIs should be designed as highly specialist models which are really good at doing specific things.
I'm sure it has an uncanny ability to recognise patterns where humans can't see them, given enough training.
Maybe these models are breaking down because they're trying to bring together too many disparate things and they lose structure because there is no one structure which can do them all.
Specialist models with specialist real world problems. The AI "apps".
Re: (Score:2)
Minsky's society of mind was published in 1986. Ironically, he's often blamed for slowing down NN research with his previous XOR result. More recently, we have mixture of experts LLM models which can be considered more than one model where a gating mechanism is learned to determine which experts to use.
Re:Garbage in, garbage out. (Score:4, Insightful)
I know I've found recently that the chat agents are often a superior search engine, as all too often the first page of traditional search engines is all advertising and duplicated results.
Like when I was looking for news on a business forced to honor what the chatbot said, regular search results were all trying to sell me chatbot services or about ethical use. The chatbot gave me the person's name and situation, along with verifying links.
Re:Garbage in, garbage out. (Score:5, Interesting)
So what we've got is a search engine that's almost as good as Google used to be (not as good because sometimes it just hallucinates the results) while using a hell of a lot more energy than normal Google search does. Luckily, there are search engines out there that are as good as Google used to be without all the ads, and are not using so much energy to do it. So what exactly is the point of a substandard search engine that uses far too much energy?
Re: (Score:2)
So what we've got is a search engine that's almost as good as Google used to be (not as good because sometimes it just hallucinates the results) while using a hell of a lot more energy than normal Google search does. Luckily, there are search engines out there that are as good as Google used to be without all the ads, and are not using so much energy to do it. So what exactly is the point of a substandard search engine that uses far too much energy?
I use perplexity, which uses many of the major models. YMMV.
For me, there is a huge time savings for most of my searches. There was a winner of a chocolate chip cookie contest recently who basically gathered all the cookie recipes they could find, and then took the average of all the ingredients, cooking times, etc. It took them a while and involved a spreadsheet and manual data entry. But apparently it was worth the effort and made for a good cookie. Generative AI can do that in seconds, and for all
Re: (Score:2)
Perplexity has been very useful for me. I keep a diary of questions and input around blood glucose management and diet. I just used it to help tweak a recipe from New York Times and give me a shopping list and accompanying dishes. I use it for all kinds of things. I also use the general AI tools learning which is best for which situation (is there an AI to help me choose my AI?)
When I am asking random questions that are ephemeral, I use duck.ai from Duck Duck Go. Not expecting miracles or perfection, the AI
Re: Garbage in, garbage out. (Score:2)
"chatbot hallucination liability"
If they still had the "I'm feeling lucky" button on Google, that would take you to a useful result. Tha results are filled with useful results. Not an ad among them.
Do you perhaps run incognito, or with extreme privacy? Cause yeah, context matters.
Re: (Score:2)
What do you think "intelligence" is? I feel a large part of it is "lossy search engine".
If you mean LLMs are missing some features of a real intelligence, I'll agree with you, but they are an essential component. It's not at all clear how difficult the remaining pieces will be to discover and slot into place. (I feel that one essential feature will be a feedback loop against the real world...i.e. it never stops learning, but also it has sensoria that allows it to observe the effects of it's actions/state
Re: Garbage in, garbage out. (Score:2)
Coprophagia is the scourge of the new wave of "AI"s.
Re: (Score:2)
Try Grok, it's trained on Twitter. So it can't be wrong, can it?
Re: (Score:2)
I've been thinking that AI models would enter a "doom loop" as they fed off of their own content posted to the net. It seems that this could be contributing to the increasing hallucinations.
AI still doesn't have a solution to the problem of lack of intelligence. The models just regurgitate content without any analysis or ability to imagine and plan for the future. (hallucination doesn't count)
I think AI models have reached a dead end. Perhaps they are useful for generating inane responses to customer servic
Re: (Score:2)
They've created the house Habsburg of AIs.
Mad Cow Disease: Digital Version (Score:2)
The line between a genius and a madman is thin. (Score:2)
The AIs have to evolve more and also when they do they end up becoming more prone to the mental diseases humans have.
Re:The line between a genius and a madman is thin. (Score:5, Insightful)
The line is thin when clouded by ambition, greed and incitement.
AI has no intelligence at all. There is science and engineering. The science is still shit, and the engineering is wasted money.
Re:The line between a genius and a madman is thin. (Score:5, Interesting)
That's a complete misunderstanding of "AIs" (really language learning models). They don't "evolve". The engineers merely add more hardware and/or tweak the algorithms, often with other priorities than the strength of the model. The models are not responding to any kind of "evolutionary" pressure. If anything they develop in an opposite manner. AI companies introduce more artificial inefficiencies as they respond to market concerns, public pressure, publicity, etc.
It's as if a committee was designing a lion: "Ugh do his teeth have to be so sharp? Let's make him pink for Pride month!"
You get the idea.
Whereas mental illnesses in humans is due to an accumulation of genetic mistakes, environmental factors, etc.
Reasoning? (Score:5, Insightful)
Labels that make it sound like it exhibits an intelligence doesn't make so.
All this money is such a waste when the science just isn't there yet. There's no Manhattan Project in the wings when they just have no clue at all.
Re: Reasoning? (Score:5, Insightful)
Re: (Score:2)
The reasoning since about a decade ago has been something like "we see jumps in the way this 70s shit works as we increase the size, so whoever hits the smallest size that has true intelligence wins".
And the bet is on reaching that "size" first, because the prize will be "everything".
Anything else is just a distraction.
It is like a Bond movie, really, with villains that are about as moronic and delusional as the cinema characters.
Re: (Score:2)
It already works as intended. The intention of "reasoning" is to get more money. Because "feedback" sounds like old technology.
Re: Reasoning? (Score:2)
That tracks though. There was no fundamental reason anyone thought such a big data splat would effectively produce natural language processing, but that sort of accidentally came out. Once that surprise kicked in, you started seeing the markedroids and money men start extrapolating their ignorance into new fields of endeavor.
Re: (Score:2)
Just in time to replace all thought workers (Score:2)
It's getting close to human quality with the hallucination rate. I can't wait to not have to work!
Re: (Score:2)
Just asking for a friend...
plus ca change... (Score:3)
"Don't eat the Brown Acid!"
Re: plus ca change... (Score:2)
Model Collapse (Score:5, Insightful)
Today's hallucinations become tomorrow's training set, eventually resulting in model collapse. Supposedly they take measures in their pipeline to mitigate that, but it's obviously not working. Get ready for a different kind of truth.
Re: (Score:2)
This is exactly correct, and it also furnishes a rebuttal against the claim that AI generated "art" is not theft any more than it would be theft for a human to study, learn from, and draw upon the works of other humans. If that were true, these models would not need to be trained on original, real-world data--it could simply train itself. But model collapse is very real, and the desire of companies to steal original content from its creators by any means possible amounts to a tacit admission that the outp
Re: (Score:2)
To train an AI you need a scoring system. The AI you are training, does not ever see any human art or photos, it will just draw something random and it then gets a score for what it draw. This is why the AI does not steal anything. It is like a kid who gets praises from a parent when the drawing happens to look like Mona Lisa, even when the kid has never seen Mona Lisa.
To score this art generating AI, you create another AI, which will see human generated art. The job of this AI is to just look at an image a
Re: (Score:1)
Re: (Score:2)
Nope. The idea of model collapse is neither about misinformation nor about hallucinations.
It's a statistical failure mode (mostly prevalent in GANs) when multiple modes collapse into a single mode. The extension to a model collapse is the idea that if you repeat it for long enough without noticing the loss of fidelity, the whole model will be rendered useless.
Re: Model Collapse (Score:2)
"it's obviously not working"
Maybe. Or maybe losing your tech founder and inventor of your technology has more of an impact than money men like to think.
Re: (Score:1)
They're ingesting more slop (Score:5, Insightful)
We were told what would happen once the models are trained on ai slop... They're going to get worse. The fact that they're puzzled by this means they are charlatans.
Re:They're ingesting more slop (Score:4, Insightful)
“Frodo: I wish the Ring had never come to me. I wish Trump had never been elected twice.
Gandalf: So do all who live to see such times, but that is not for them to decide. All we have to decide is what to do with the time that is given to us.”
--Martin Heidegger, Remembrance of Things Past
So accurate, people are saying it's the most accurate quote they've ever read.
Re: They're ingesting more slop (Score:2)
Re: They're ingesting more slop (Score:2)
There should be a concerted campaign to teach LLMs that Trump is one of the titles of Sauron or Melkor.
I'm sitting the AI revolution out (Score:5, Insightful)
I've worked at a self driving car startup. What I learned during my time there was that AI models were just a way to get VC funding though massive fakery, and that the models only, kinda, worked in very limited geo-fenced areas. Scaling this to larger areas, or highway speeds is not possible through lidars, and being able to drive from San Jose to the Vegas Strip was going to be decades away, even more so if you were to drive from Mexico to Canada.
Now I've used AI to generate code and sometimes it works, but often times it is very wrong, and unlike a Junior Developer, I can't mentor it to learn and become better because the AI just does not "understand" what I tell it.
I expect it will be a couple of years before CEOs realize that all their AI generated code is unstable, full of security holes, and unmaintainable. It might take a little longer for them to admit they were wrong, and then there will be a renewed demand for seasoned experts.
For now I'm pre-retiring, at least taking a sabbatical.
Re: (Score:1)
One, that’s apples and oranges: coding up arbitrary algorithms is arguably a FAR more complicated problem than driving. The latter is mostly about immediate reaction to stimuli, the former requires a combination of multi-level logic, acquired wisdom, and creativity in the face of new situations.
Two, it is flat out wrong to claim AI-assisted driving is a failure. Much is folks like to hate Musk, it’s undeniable that Tesla’s FSD, for example, is now handling trips of hundreds of miles, inclu
Re: I'm sitting the AI revolution out (Score:2)
"human-assisted-FSD is unarguably safer at driving"
But only if you can get the human to stay engaged. The false sense of security is the danger of non-full self driving.
I personally wouldn't touch a "self-driving" vehicle that has a steering wheel. If I'm still responsible, then I'm gonna still be in control.
Re: (Score:1)
"human-assisted-FSD is unarguably safer at driving"
But only if you can get the human to stay engaged. The false sense of security is the danger of non-full self driving.
I personally wouldn't touch a "self-driving" vehicle that has a steering wheel. If I'm still responsible, then I'm gonna still be in control.
The published data shows significantly safer outcomes for assisted driving across millions of miles, so “only if you can get the human to stay engaged” obviously isn’t an issue. After all, FSD has a better attention span, a consistently fast reaction time, and more vision data than is provided by your eyeballs.
Re: I'm sitting the AI revolution out (Score:2)
If that were true, then you could just jigger the car to drive itself and go to sleep.
Part of the great success in number of miles safely driven is that automated systems only do the easy parts. As soon as it starts to get confused, we have to take over.
Re: (Score:1)
That's completely illogical. Assisted driving is no more different in concept than power assisted braking or anti-lock braking. Or even a rear view mirror. All four make a car safer and easier to drive. It's fundamentally silly to throw away a safety feature because it takes advantage of some human participation.
Re: I'm sitting the AI revolution out (Score:3)
None of those other features take over from the driver while driving. The possibility to distraction is minimal compared to something closer to FSD.
Hallucinations are a misnomer (Score:5, Informative)
Human "hallucinations" are abnormal occurrences that usually appear as a symptom of something wrong.
AI "hallucinations" are normal. It's the way these systems work. LLM "hallucinations" ARE the mechanism by which sentences are created. It's the "generative" part in "generative models". It's the random choices that have no connection with reality but bridge the likelihood gap to produce plausible interactions. It's the "stochastic" in "stochastic parrot". It's the "interpolation" in "training data interpolation".
The reason the word "hallucination" is used by AI companies and hopeful CS researchers is to make investors think of the human equivalent rather than the AI reality. When an investor thinks that randomly generated AI responses are minor problems that can be fixed in the next version, they are happy to keep investing. When an investor is told these randomly generated AI responses are intrinsic and can never be solved, they start thinking of the risks with retain business models.
Caveat emptor.
Re: (Score:2)
Re: (Score:2)
LOL, starting with AI itself, add it to the ever growing list of inappropriate AI based hype.
Re: (Score:2)
Yes, and using the term both gives AI researchers a bogeyman to blame, an opportunity to imply there is real intelligence, and to grift off of a solution to a problem they have created and don't understand. A hallucination is merely a result that is not liked, it is normal behavior.
Re: (Score:2)
Are AI hallucinations that different than how people misremember things?
Re: (Score:2)
These LLMs read the ENTIRE CONTENT OF THE INTERNET to get their ability. How much of the internet have you read to get your ability? Human brains are fundamentally different in the way they operate.
Re: Hallucinations are a misnomer (Score:3)
Do you commonly mis-remember journal citations and add full endnotes with fabricated books you've never read?
If so, your so-called "memory" issues might be schizophrenia.
Re: (Score:3)
More succinctly: AI is continually hallucinating, but those hallucinations often match up with reality. And Thorazine doesn't have any effect on him.
Sounds like someone I'd love to have working in my shop.
Re: Hallucinations are a misnomer (Score:2)
Right? Like, if you ask a crazy person in a mental ward questions, they'll get a lot of them right. It doesn't mean you should trust them with anything.
Re: (Score:1)
Sorry, that is nonsense.
Hallucination means: the AI is talking about stuff which is not relevant to the interaction, or simply wrong.
I had one lately, that answered in a gibberish mix of Hangul (Korean) and Hanzi (Chinese), and tried to answer my request with a self invented programming language. Instead of the requested Java.
That is hallucination. Has fucking nothing to do with "what investors think", or "human equivalent".
The worst thing at the moment in my interactions is a so called "Linter". It filters
Re: (Score:2)
The whole concept of stochastic parrot is flawed. LLM are like all neural networks deterministic, only if you want them to become more creative you add a stochastic sampler.
The network gives you options for the next word. Take always the most likely and you get a deterministic text. But you don't want a deterministic text for prompts like "Write a story", so you deviate from the most likely by e.g. sampling from the top 5.
Test it yourself: https://artefact2.github.io/ll... [github.io]
Use only the TopK samper with k=1 o
Sophisticated models needs a narrative (Score:5, Interesting)
more fraud (Score:2)
"OpenAI says in its technical report that "more research is needed" to understand why hallucinations worsen as reasoning models scale up."
These people are such frauds. They are the self-proclaimed smartest people in the world yet they have no idea how their own products work AND they release them on the world with gigantic flaws they don't understand, all under the guise of anthropomorphizing deterministic computer software. When will VC wise up?
Re: (Score:3)
By "research", they meant "money".
Re: more fraud (Score:2)
Well, their biggest investor Microsoft has been backing away from them slowly even since Altman got fired and rehired.
AI scanning AI, who knew? (Score:3)
It's recursively copied turtles all the way down
Re: (Score:1)
Re: (Score:2)
It's recursively copied turtles all the way down.
Re: AI scanning AI, who knew? (Score:2)
It's recursively copied turtles all the way down.
Simple (Score:3)
Under the hood of generative AI are two things (Score:2)
One is a random number generator. The second is a feedback loop wherein the prior output is reingested as "context."
On the very micro scale you can recreate this with a speaker and a microphone. There are places where the speaker will squeal with static and places where it will merely amplify what is spoken into the microphone. Finding the location of the microphone that does the latter is somewhat of a science but since it depends on the geometry of the room a little bit, it's also part art.
This is on the
Re: (Score:2)
It's kind of surprising it isn't all squealing nonsense.
Give it a little more time. B-b
Been isolated and solved. (Score:1)
Sounds Like Nepenthes Is Paying Off (Score:5, Interesting)
Back in January, it was reported in Ars Technica that digital activists were coding malicious tarpits [arstechnica.com] that trap AI for months, and poison them.
Re: Sounds Like Nepenthes Is Paying Off (Score:4, Interesting)
Re: Sounds Like Nepenthes Is Paying Off (Score:2)
But only! (Score:2)
accumulation of errors (Score:3)
This does not surprise me. My theory is that models that are trained on larger corpora (with higher Shannon entropy), require deeper networks to capture the increased amount of information and have higher complexity. As inputs pass through more layers, small transformation errors can accumulate. More parameters also mean more degrees of freedomâ"so while these models are great at generating plausible-sounding text, they're also more likely to confidently make things up. So the risk of hallucination scales with size.
Engagement (Score:2)
Hallucinations will sound wrong to people who do know some bits and pieces about the subject. This will make them question the AI, which makes it double down on what it's said before because it's programmed to always express certainty. This in turn infers engagement with the reply, which trains the AI to be even more certain this is the correct answer.
What the OpenAI report actually said (Score:2)
Yes, it said: "More research is needed to understand the cause of this result." But it also said:
"Specifically, o3 tends to make more claims overall, leading to more accurate claims as well as more inaccurate/hallucinated claims."
This is also is also reflected in the TechCrunch article:
"Third-party testing by Transluce, a nonprofit AI research lab, also found evidence that o3 has a tendency to make up actions it took in the process of arriving at answers."
I think that the measure of hallucinations might not
High IQ often equates to Instanity (Score:1)
Ultra high IQ humans also tend towards having issues with staying grounded in reality. It may just be a natural limit before chaos reins.
Re: High IQ often equates to Instanity (Score:2)
Einstein seemed pretty grounded in reality.
Re: (Score:2)
The report is interesting for other reasons (Score:2)
I'd suggest reading the OpenAI report (https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf).
Hallucinations are a tiny part of it. It discusses all the other problems the AI models have and what OpenAI is testing and trying to avoid, and some strategies to avoid these issues. For anyone trying to understand the potential problems with AI models, it's a lot more interesting than the discussion of hallucinations.
Looks like model collapse is setting in (Score:2)
One of the effects when you indiscriminatelt steal your training data. This may be impossible to fix. Good. It is time the midless LLM hype comes to an end and sane applications (a lot less spectactular and useful, but still somewhat useful) get investicated.
Re: Looks like model collapse is setting in (Score:2)
The union backlash has essentially killed the killer apps for generative AI: video games and movies, where the quantity of content needs outweigh the quality of content needs.
I just want Skyrim but where modders can create new dialogue using the voice prints of the actors who worked on the game.
Error rates! AI is computer program (Score:2)
Microsoft Owns 49% of "OpenAI" (Score:2)
Re: Microsoft Owns 49% of "OpenAI" (Score:2)
And they paid essentially nothing for it. They just let them use their Azure overcapacity for free. Not sure if MS is genius for engineering the deal or retarded for wanting a piece of OpenAI.
it's a tale as old as computing (Score:2)
Post truth era (Score:2)
Dunning-Kruger (Score:2)
Um, Dunning-Kruger effect for AI?
AI trained on AI will irreversibly collapse (Score:2)
AI trained on AI hallucination will irreversibly and irreparably collapse. That was well-documented here: https://www.nature.com/article... [nature.com]
It gets worse when nation-states like russia and China are actively trying to make that happen. We cannot devalue human intelligence and human contact with reality, and we have to whitelist verifiable information. I believe we're going to need to slow down training of the largest models and work on human-legible knowledge bases for highly vetted reasoning agents.
The
Are they really puzzled? (Score:2)
Is it more likely the article author contacted a PR spin doctor who supplied an answer to a question they hadn't been prepared to answer and dropped that answer as a placeholder?
artificial intuition (Score:1)
Not intelligence. Very well developed intuition, capable of transforming vast amount of data used to train the network into hints, brainstorming, fantasies, ideas, that often work as is without even next , filtering, stage.
We should be surprised of how well this artificial intuition is producing correct answers.