Is ChatGPT Getting Worse? (fortune.com) 93
A new study (PDF) from Stanford found that ChatGPT performed worse on certain tasks in June than its March version. The paper supports a widely held, though unproven, notion that the AI language model's performance in coding and compositional tasks has deteriorated in recent months. Fortune reports: The study compared the performance of the chatbot, created by OpenAI, over several months at four "diverse" tasks: solving math problems, answering sensitive questions, generating software code, and visual reasoning. Researchers found wild fluctuations -- called drift -- in the technology's ability to perform certain tasks. The study looked at two versions of OpenAI's technology over the time period: a version called GPT-3.5 and another known as GPT-4. The most notable results came from research into GPT-4's ability to solve math problems.
Over the course of the study researchers found that in March GPT-4 was able to correctly identify that the number 17077 is a prime number 97.6% of the times it was asked. But just three months later, its accuracy plummeted to a lowly 2.4%. Meanwhile, the GPT-3.5 model had virtually the opposite trajectory. The March version got the answer to the same question right just 7.4% of the time -- while the June version was consistently right, answering correctly 86.8% of the time. Similarly varying results happened when the researchers asked the models to write code and to do a visual reasoning test that asked the technology to predict the next figure in a pattern.
James Zou, a Stanford computer science professor who was one of the study's authors, says the "magnitude of the change" was unexpected from the "sophisticated ChatGPT." The vastly different results from March to June and between the two models reflect not so much the model's accuracy in performing specific tasks, but rather the unpredictable effects of changes in one part of the model on others. [...] The exact nature of these unintended side effects is still poorly understood because researchers and the public alike have no visibility into the models powering ChatGPT. It's a reality that has only become more acute since OpenAI decided to backtrack on plans to make its code open source in March. "These are black-box models," Zou says. "So we don't actually know how the model itself, the neural architectures, or the training data have changed."
Over the course of the study researchers found that in March GPT-4 was able to correctly identify that the number 17077 is a prime number 97.6% of the times it was asked. But just three months later, its accuracy plummeted to a lowly 2.4%. Meanwhile, the GPT-3.5 model had virtually the opposite trajectory. The March version got the answer to the same question right just 7.4% of the time -- while the June version was consistently right, answering correctly 86.8% of the time. Similarly varying results happened when the researchers asked the models to write code and to do a visual reasoning test that asked the technology to predict the next figure in a pattern.
James Zou, a Stanford computer science professor who was one of the study's authors, says the "magnitude of the change" was unexpected from the "sophisticated ChatGPT." The vastly different results from March to June and between the two models reflect not so much the model's accuracy in performing specific tasks, but rather the unpredictable effects of changes in one part of the model on others. [...] The exact nature of these unintended side effects is still poorly understood because researchers and the public alike have no visibility into the models powering ChatGPT. It's a reality that has only become more acute since OpenAI decided to backtrack on plans to make its code open source in March. "These are black-box models," Zou says. "So we don't actually know how the model itself, the neural architectures, or the training data have changed."
Known side-effect of fine-tuning. (Score:5, Interesting)
As LLMs are fine-tuned, "Catastrophic Forgetting" is a known and well studied problem:
https://arxiv.org/pdf/1911.002... [arxiv.org]
Since ChatGPT continues to be fine-tuned, no surprise that as they keep lobotomizing it, it gets dummer in some ways.
Re: (Score:2)
Re: Known side-effect of fine-tuning. (Score:4, Informative)
I don't know how many times it will have to be restated before people begin to understand, but these language models have no idea what they are saying. They are a very advanced form of answering the question "what letter is likely to follow the letter 'q'?"
Re: (Score:1)
Also a number of other functions.
This is why people think you're trying to hurt them when you're trying to hurt them: You explain that only the parts you personally find relevant have meaning, the rest can be discarded.
I know you probably meant this as a quip, but no. Stop.
Re: (Score:1)
Most humans have the ability to reason, and merely use language to communicate. ChatGPT only goes as deep as language.
Re: (Score:2)
Wrong item. You meant to post to the article about the writers strike.
Re: (Score:2)
I don't know how many times it will have to be restated before people begin to understand, but these language models have no idea what they are saying.
Only the most basic of normies (admittedly, most people), think that chatGPT has consciousness or rational decision making or anything along those lines. There's no excuse for anyone remotely techy to think otherwise.
I think chatGPT shows interest emergent behaviors, and I think it's already very useful. But, that's it.
Now, with regards to your comment, you are both right and wrong. The chatGPT model is not publicly known, but it's certainly more complicated than letter-by-letter or word-by-word prediction.
Re: Known side-effect of fine-tuning. (Score:2)
Checking numbers until the square root is enough. No need to go until the half of that number.
Re: (Score:1)
Re: (Score:2)
It's worse than you thought.
And in addition, from a... control systems perspective... you can never guarantee that the system wont respond in a way that is offensive or potentially break security protocols. It also isn't possible to make it understand the implications of its failures.
Every instance of it being put in front of customers unsupervised will end in disaster sooner or later.
Re:Known side-effect of fine-tuning. (Score:5, Funny)
Check it out. ChatGPT-4. Pretty interesting.
A prime number is a natural number greater than 1 that has no positive divisors other than 1 and itself. The first few prime numbers are 2, 3, 5, 7, 11, and 13.
To determine if a number is prime, one method is to try to divide it by all the prime numbers smaller than its square root.
For 17077, its square root is approximately 130.68, so we would check division by all primes less than 130. Let's start with the first few:
17077 ÷ 2 is not an integer.
17077 ÷ 3 is not an integer.
17077 ÷ 5 is not an integer.
17077 ÷ 7 is not an integer.
17077 ÷ 11 is not an integer.
17077 ÷ 13 is not an integer.
However, when we try 17077 ÷ 127, the result is 134, which is an integer.
This means 17077 = 127 × 134, so 17077 is not a prime number.
Re: (Score:2)
Re: (Score:3)
LLMs generally cannot be explicitly programmed. When you want to explicitly program a response in reply to LLM query, you generally have to run a top level vector database of potential inputs and their explicit outputs that intercepts all queries, checks if it finds any matching inputs in its library, and doesn't pass it to LLM if answer is found.
This is why most of LLM censorship is so bypassable by asking LLM to do something in a way that wasn't explicitly predicted in the top level database sitting above
Re: (Score:2)
Re: (Score:2)
That is not the way LLMs work.
Re: (Score:2)
I don't know what OpenAI is doing with its chat-gpt stuff exactly but you bolt NLP (natural language processing) onto the front of these things and pull out entities.
These are also powered by machine learning but what they do is take an input like "I would like to make an appointment for my cat Amy's annual physical" it exacts a list of entities that have been defined
@patient_type = cat
@intent = schedule
@pets_name = Amy
@service = physical
You have probably interacted with such a system Google's dialog flow i
Re: (Score:2)
This is the way old chat bots work indeed.
It is not the way LLMs work. Those generate a massive body of relative relations between words, sentences, paragraphs, etc.
Re: (Score:2)
Re: (Score:2)
This is a great example of vector database sitting on top of LLM, intercepting and scanning queries and posting pre-programmed response if a hit is generated within the database.
Your query never got to LLM.
Re: (Score:2)
Re: (Score:2)
The problem is that the ability to generate actually relevant responses to wide variety of topic is something that vector database cannot reasonably do. Because you need to actually pre-program calls and link them to the responses. And there's a nearly infinite amount of potential topics of this nature.
This is why we actually have developed LLMs, which are effectively a fusion of Big Data and Machine Learning. The idea is that since the amount of "query-response" pairs is nearly limitless, we should stop tr
Re: (Score:2)
So how do we know this is not explicitly programmed? It's a rhetorical question.
Would a programmer have explicitly programmed the answer to be incorrect? For the lulz? Just for fun type the last line into a calculator.
Re: Known side-effect of fine-tuning. (Score:3, Informative)
Re: Known side-effect of fine-tuning. (Score:4, Funny)
Re: Known side-effect of fine-tuning. (Score:2)
It's impressive, but it's wrong. 117 x 134 = 17018.
As usual, it's just guessing, putting together an answer that looks right. That works well when the correct answer is out there (and dominant) in its training text. But when it hasn't seen the right answer, GPT turns into a BS artist. And it doesn't seem to know the difference.
Re: (Score:2)
One of my other posts mentioned this, but it's interesting to me that if you say "Really?" or "try again" or "that's wrong" chatgpt almost always locks onto the correct answer, and it won't budge from that one.
Re: Known side-effect of fine-tuning. (Score:2)
Unless there's no correct answer online. Then it will go in circles from one wrong answer to another to another. Fair enough, if the answer requires deep, new thought, but I'd rather it just said "I don't know". But it never does.
Re: Known side-effect of fine-tuning. (Score:2)
Re: (Score:1)
Except it doesnt. It's not doing math, it's manipulating language. There isn't a single IEEE754 floating point representation that would error 17077/127 as an integer (it would still be a non-integral point with some error, but that operation wouldn't be that far off)
Representing these tricks as more powerful then they are is the entire root of the issue here.
Re: (Score:2)
Re: (Score:2)
It really is a more complicated AI version of Eliza. Nothing more, nothing less. https://en.wikipedia.org/wiki/... [wikipedia.org]
Pretty, but also "pretty useless" for much of anything important, since you cannot trust its answer to be right, no one should waste their time asking it anything since you still have to research the answer it gives you, and since it can give you some very detailed somewhat right sounding answers that are completely wrong.
Re: (Score:2)
Re: (Score:2, Insightful)
Re: (Score:2)
...but this one has the pretense of requiring natural language/text and therefore implying that it's easy and natural to use effectively
The Emoji Generation of text-to-talkers are quietly working on a new language for that.
I hear you'll be able to scrawl it on cave walls it's so easy to use. Totally EMP proof. Really futuristic stuff.
Re:Maybe just needs better prompts (Score:5, Informative)
it's similar to old-school grammar/rule-based systems, but worse: it's still cryptic and unintuitive
Becase the English language is so intuitive [bbc.com].
Re: (Score:3)
Funny story. I met someone from Kenya that was raised with English, Swahili, and a local language.
But the way English was taught included this rule. I was explaining how bonkers to me that this was definitely a rule I never knew.
She railed off the answer, since it was taught there.
My mind was blown that any fluent English speaker not only knew the rule, but thought it was obvious.
Re: Maybe just needs better prompts (Score:2)
It is so obvious you were never taught the rule. Native English speakers order adjectives correctly most of the time intuitively. It is the same with French, Spanish, and most other languages. There are lots of rules that native speakers never think about, they just do.
Dont use a text synthesizer to solve math problems (Score:5, Interesting)
Re: Dont use a text synthesizer to solve math prob (Score:2)
Without studies to audit their claims about their product, that would mean they're marketing material is not held as accountable.
I get the worry about lending credence to "open" when they are decidedly not open anymore, but this sort of study is just the sort of thing needed to try to keep their marketing in perspective.
Re: (Score:2)
Re: (Score:1)
Re: (Score:2)
Re: (Score:2)
Why do people do studies on some commercial closed thing which can change under the hood any time?
Because it's very easy and fetches headlines? Real research is hard and rarely gets you any mentions...
Re: (Score:2)
It's ok.... (Score:5, Funny)
Our collective IQ has been dropping in America for a while now. It's just following the trends!
Re: (Score:2)
+5 funny, but has our collective IQ been dropping, or are all portions of the IQ bell curve louder now? You used to need to be smart just to be *heard*.
Re: (Score:2)
ChatGPT 4.0 has finally absorbed Common Core Maths.
Hmm (Score:2)
Maybe they should stop letting it read the comments to science-oriented Reelz posts.
Honest question (Score:2)
Re:Honest question (Score:5, Informative)
Re: (Score:1)
Re: (Score:3)
It does not understand math. It's synthesizing a terrible understanding of math from people writing about math, and since it doesn't understand the material it's synthesizing from, it can easily be misled. It doesn't actually understand division, factoring, primality or any of that, it's just very good at paraphrasing its sources to pretend it does. Just because it's better than prior systems for searching bulk data, doesn't mean it really understands what it's barfing up.
It's similar to the reason why the image generators are so bad at hands.
It's easy to write something in the form of a math problem, just like it's easy to draw some bits of fingers. But there's a very low margin of error, and if you don't understand the underlying rules of what's going on (physics of fingers, or rules of addition) then it's really tough to learn by examples.
Re: (Score:2)
it's just very good at paraphrasing its sources to pretend it does
This is why I have always called chatGPT "Clever Hans with better PR" [wikipedia.org]. The AI fanboys yell at me that I'm too stupid to see how much more sophisticated it is... and yet, here we are.
Re: (Score:2)
Re: Honest question (Score:2)
It's weird when it synthesizes a text seemingly breaking down a problem and explaining how to proceed, but completely putting garbage in the actual procedure.
We are so used to any text written in that style bent written by someone that understands the topic, is jarring when it just clearly is working the explanation and some jumble of numbers to go with it, because it "knows" numbers belong in the middle, but it doesn't know the meaning.
Even creepier to see the same phenomenon play out in image generation.
Re: (Score:2)
I haven't dabbled in neural networks since an AI class in college twenty some years ago, but I would imagine that at some point the models will need to be hardcoded (or the trained portions hardcoded) with certain immutable knowledge sets, like the nature of basic arithmetic.
It's interesting to me that when you get a crazy response out of chatGPT, if you just reply "wrong" you almost always get the correct answer.
A. Produce an answer
B. Produce an alternative answer given prompt "that's wrong"
C. Run both A+B
Re: (Score:2)
Re: (Score:3)
Why would it get a question about a prime number so drastically wrong, and how was the question worded I wonder?
From TFA on to test of solving math problems, which provides some insight as to how the same question provided correct analysis, but generated an incorrect answer:
"...an example query and corresponding responses over time. GPT-4 followed the chain-of-thought instruction to obtain the right answer in March, but ignored it in June with the wrong answer. GPT-3.5 always followed the chain-of-thought, but it insisted on generating a wrong answer ([No] ) first in March. This issue was largely fixed in June."
Re: (Score:2)
These systems have no understanding of what they produce. It's not just math. It's everything. They are a glorified search engine. It can synthesize what looks like the correct answer without actually understanding if or why it would be correct.
It's much the same with image generation. It can draw a finger based on fingers it has seen in the past but fails to understand the concept that a hand always has 5 of them and that they can only bend in one direction.
Re: Honest question (Score:2)
Depends on the version (Score:2)
I've used ChatGPT extensively for a long time, even before it became a paid version 4.
ChatGPT 3.5 has always been a bit "dumb" to me, it could never really do things that's beyond a beginner level programmer, sure - it has been trained on data found on the web for a long time, but its ability to comprehend is up to the prompter and also how well you formulate things and in what context.
One thing I immediately observed was that it wasn't very good at remembering our conversation, it would repeat itself and f
Re: (Score:2)
Asking LLM "trick questions" is an operator failure. It's job of the operator to produce queries that result in correct output.
I.e. there actually is such a thing as "wrong question" in LLMs. And that is operator error, not LLM error.
Re: Depends on the version (Score:2)
The point is that he knew it was a trick question, because he already knew the answer and didn't need the ai. However for another person, it's a sincere question because they don't know.
The only way to avoid doing "trick questions" is to already know the answer before your begin, at which point, what value are you extracting from the technology?
This mirrors what I've seen a lot, that currently it does a passable job of answering questions that the person already knows the answer to, which is a neat parlor
Re: (Score:2)
You appear to confuse talking to a self-aware being and talking to LLM. LLM is great at producing answers that fit genuine superqueries. Superqueries that are intended to be confusing to something that is self-unaware will in fact confuse it and generate bad answers most of the time.
I.e. he "discovered" that if you use wrong syntax in your coding, it won't compile. This isn't novel or interesting. This is well known among even intermediate LLM users.
Re: (Score:2)
You appear to confuse talking to a self-aware being and talking to LLM
I have no disillusions about the technology being more than it is, however I contend that in the ChatGPT type usage, the utility is sufficiently low as to be nearly useless.
LLM is great at producing answers that fit genuine superqueries
From my experience, LLM is passable at matching traditional queries of information. I haven't experienced anything "super" about my interactions with ChatGPT style LLM usage.
Superqueries that are intended to be confusing to something that is self-unaware will in fact confuse it and generate bad answers most of the time.
Here it's not intended to be confusing, he even fed it domain specific knowledge to respond. It just failed to incorporate the parameters of the query, because it d
Re: (Score:2)
Superquery is a tentative name for inputting long, highly optimized query (usually measured in at least hundreds, often thousands of words) that specifically seeks to generate a correct answer by directing the LLM what word relations within the training set should be emphasized, and which should be de-emphasized.
Again, LLM is not the old style database. LLM is all about measuring relationships between all words, sentences, paragraphs and so on, and generating responses based on those relationships that it s
What's next? (Score:5, Funny)
Re: (Score:2)
Is it going to insist on working from home?
It already is ;)
Is ChatGPT Getting Worse? (Score:3)
I would say it has improved (Score:3)
Previously, it was subtly harmful garbage, now it has progressed to obviously crap. At this rate they will progress to mass layoffs and total shutdown by year's end.
Cost cutting (Score:2)
Sam Altman has said the top priority for this year is cheaper and faster GPT-4. It wouldn't be surprising to learn they are cutting down / compressing models to get there. Another possibility is additive effects of additional censorship tweaks which can have substantial non-intuitive impacts on model quality.
PrionGPT: Recursive Crapification (Score:3)
I suspect people are using it to publish fluff pieces, and the next generation reads and incorporates its own fluff work, compounding the fluffitivity on each iteration.
Re: (Score:3)
That, I think, is going to be one of the greatest difficulties facing ALL trained models in the future.
The Internet of 2023 was already crapped up with useless clickbait and SEO garbage. The signal to noise ratio as compared to 2000 is just insanely bad. LLMs are going to be spewing out a huge amount of content that is publicly posted, and that content is going to be fed back in as training material.
IMHO one of the next "big things" will be content companies, libraries, publishers, etc., who have high-quali
TERRIBLE RESEARCH (Score:1)
The paper is crap.
The code actually got BETTER then before, except the authors couldn't be bothered to remove the new formatting symbol.
The math is exactly the SAME as before, except before it used to say that all numbers are prime and now it says that all numbers are not prime (neither is right but it's absolutely not gotten worse.). The authors only tested with prime numbers.
Terrible terrible terrible research.
Re: (Score:2)
you think the goal of research is to identify only peak performance...?
works for me (Score:2)
same here (Score:2)
GPT4: 17078 is an integer and it's a prime number. A prime number is a natural number greater than 1 that has no positive divisors other than 1 and itself. In this case, the only factors of 17078 are 1 and 17078. This makes 17078 a relatively interesting number mathematically.
Works fine for me too /s
And yet its grammar isn't bad (Score:2)
I haven't heard of these LLMs screwing up English grammar that badly, which suggests to me that they have some kind of grammar checking pass run over the raw output.
I'm given to understand that the naive LLM simply parrots back words that are likely to appear nearby, which might render my first paragraph as:
"I heard haven't of these LLMs up screwing English grammar badly, which suggests that.."
That doesn't happen though, which means either there's something about LLM that makes that work very well, or to re
Worse by weird metrics (Score:2)
known to doctors (Score:2)
Microsoft just hooked "Tay" into it. (Score:1)
Queue the racist answers and further corruption to start occuring. After all, Tay was A.I., supposedly...
To be fair, Tay did come pretty close to the average majority american. Foul-mouthed, very opinionated, and very racist.
It's all ChatGPT A.I. needs to become a real boy!
Slashdot is getting worse, with all the spam (Score:2)
More serious problems ... (Score:2)
The industry has referred to them as 'hallucinations', but in reality they are lies, with dire consequences.
Examples abound ...
- Professor accused of sexual harassment [usatoday.com] based on a non-existent article in the Washington Post.
- Another professor accused of being convicted and imprisoned for seditious conspiracy against the USA [reason.com].
- Lawyer fined $5,000 for submitting an AI generated brief to court quoting non-existent precedence cases [mashable.com].
- Fake complaint against man for embezzlement [forbes.com].
Shall we then rename ChatGPT (Score:1)
They found out (Score:1)
HOT TAKE: saving energy after initial hype (Score:2)
A simple google search uses about the same amount of energy to power a light bulb for 10 seconds.
A ChatGPT is far more resource consumptive than that.
Now that everyone has had a chance to see the crazy skills of generative AI, the generative AI bois are putting twaddle in the orphans oatmeal to reduce their overhead and hoping people won't notice.