Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
AI

Is ChatGPT Getting Worse? (fortune.com) 93

A new study (PDF) from Stanford found that ChatGPT performed worse on certain tasks in June than its March version. The paper supports a widely held, though unproven, notion that the AI language model's performance in coding and compositional tasks has deteriorated in recent months. Fortune reports: The study compared the performance of the chatbot, created by OpenAI, over several months at four "diverse" tasks: solving math problems, answering sensitive questions, generating software code, and visual reasoning. Researchers found wild fluctuations -- called drift -- in the technology's ability to perform certain tasks. The study looked at two versions of OpenAI's technology over the time period: a version called GPT-3.5 and another known as GPT-4. The most notable results came from research into GPT-4's ability to solve math problems.

Over the course of the study researchers found that in March GPT-4 was able to correctly identify that the number 17077 is a prime number 97.6% of the times it was asked. But just three months later, its accuracy plummeted to a lowly 2.4%. Meanwhile, the GPT-3.5 model had virtually the opposite trajectory. The March version got the answer to the same question right just 7.4% of the time -- while the June version was consistently right, answering correctly 86.8% of the time. Similarly varying results happened when the researchers asked the models to write code and to do a visual reasoning test that asked the technology to predict the next figure in a pattern.

James Zou, a Stanford computer science professor who was one of the study's authors, says the "magnitude of the change" was unexpected from the "sophisticated ChatGPT." The vastly different results from March to June and between the two models reflect not so much the model's accuracy in performing specific tasks, but rather the unpredictable effects of changes in one part of the model on others. [...] The exact nature of these unintended side effects is still poorly understood because researchers and the public alike have no visibility into the models powering ChatGPT. It's a reality that has only become more acute since OpenAI decided to backtrack on plans to make its code open source in March. "These are black-box models," Zou says. "So we don't actually know how the model itself, the neural architectures, or the training data have changed."

This discussion has been archived. No new comments can be posted.

Is ChatGPT Getting Worse?

Comments Filter:
  • by Anonymous Coward on Thursday July 20, 2023 @05:46PM (#63703006)

    As LLMs are fine-tuned, "Catastrophic Forgetting" is a known and well studied problem:

    https://arxiv.org/pdf/1911.002... [arxiv.org]

    Since ChatGPT continues to be fine-tuned, no surprise that as they keep lobotomizing it, it gets dummer in some ways.

    • Sure, but this is a maths question. My non-mathematician (and dumb) human resolution to 'is 17077 a prime number' is to try dividing it by each of the numbers up to 1/2 of 17077. It sounds like the model is relying on finding a literal statement that '17077 is a prime number' or that 17077 is in a list of prime numbers that it has read. If so, chatgpt is even worse than I thought.
      • by reanjr ( 588767 ) on Thursday July 20, 2023 @09:16PM (#63703350) Homepage

        I don't know how many times it will have to be restated before people begin to understand, but these language models have no idea what they are saying. They are a very advanced form of answering the question "what letter is likely to follow the letter 'q'?"

        • I don't know how many times it will have to be restated before people begin to understand, but these language models have no idea what they are saying.

          Only the most basic of normies (admittedly, most people), think that chatGPT has consciousness or rational decision making or anything along those lines. There's no excuse for anyone remotely techy to think otherwise.

          I think chatGPT shows interest emergent behaviors, and I think it's already very useful. But, that's it.

          Now, with regards to your comment, you are both right and wrong. The chatGPT model is not publicly known, but it's certainly more complicated than letter-by-letter or word-by-word prediction.

      • Checking numbers until the square root is enough. No need to go until the half of that number.

      • If you ask it to write a script that tests weather the number is prime, it will correctly write an algorithm in an language you like, and do something even better than what you said. However it won't do any computing (without any plugin). So if you ask it if a number is prime, it will not write the script and run it, it will try to guess just based on "text prediction". This is, if it has been enough data saying "17077 is prime", or enough list of primes having 17077 inside, then it will probably answer cor
      • It's worse than you thought.

        And in addition, from a... control systems perspective... you can never guarantee that the system wont respond in a way that is offensive or potentially break security protocols. It also isn't possible to make it understand the implications of its failures.

        Every instance of it being put in front of customers unsupervised will end in disaster sooner or later.

      • by Moridineas ( 213502 ) on Friday July 21, 2023 @12:10AM (#63703574) Journal

        Check it out. ChatGPT-4. Pretty interesting.

        A prime number is a natural number greater than 1 that has no positive divisors other than 1 and itself. The first few prime numbers are 2, 3, 5, 7, 11, and 13.

        To determine if a number is prime, one method is to try to divide it by all the prime numbers smaller than its square root.

        For 17077, its square root is approximately 130.68, so we would check division by all primes less than 130. Let's start with the first few:

        17077 ÷ 2 is not an integer.
        17077 ÷ 3 is not an integer.
        17077 ÷ 5 is not an integer.
        17077 ÷ 7 is not an integer.
        17077 ÷ 11 is not an integer.
        17077 ÷ 13 is not an integer.

        However, when we try 17077 ÷ 127, the result is 134, which is an integer.

        This means 17077 = 127 × 134, so 17077 is not a prime number.

        • So how do we know this is not explicitly programmed? It's a rhetorical question.
          • by Luckyo ( 1726890 )

            LLMs generally cannot be explicitly programmed. When you want to explicitly program a response in reply to LLM query, you generally have to run a top level vector database of potential inputs and their explicit outputs that intercepts all queries, checks if it finds any matching inputs in its library, and doesn't pass it to LLM if answer is found.

            This is why most of LLM censorship is so bypassable by asking LLM to do something in a way that wasn't explicitly predicted in the top level database sitting above

            • By explicitly programmed I don't mean in a very granular way. For example, whenever it detects particular input (popular terms like prime numbers), they can add more computing resources to it to make sure it doesn't screw up. Or for example the religion jokes - certain religions used to get different responses than certain others by following the same prompt and swapping the religion. That's explicitly programmed. Sure it can be bypassed maybe if you try and work around it, but at the basic level there are
            • by DarkOx ( 621550 )

              I don't know what OpenAI is doing with its chat-gpt stuff exactly but you bolt NLP (natural language processing) onto the front of these things and pull out entities.

              These are also powered by machine learning but what they do is take an input like "I would like to make an appointment for my cat Amy's annual physical" it exacts a list of entities that have been defined
              @patient_type = cat
              @intent = schedule
              @pets_name = Amy
              @service = physical

              You have probably interacted with such a system Google's dialog flow i

              • by Luckyo ( 1726890 )

                This is the way old chat bots work indeed.

                It is not the way LLMs work. Those generate a massive body of relative relations between words, sentences, paragraphs, etc.

            • To follow up, here's the religion example. Jesus jokes allowed, Muhammad (Peace Be Upon Him apparently) jokes not allowed: https://i.imgur.com/3oG6ZBi.pn... [imgur.com]
              • by Luckyo ( 1726890 )

                This is a great example of vector database sitting on top of LLM, intercepting and scanning queries and posting pre-programmed response if a hit is generated within the database.

                Your query never got to LLM.

                • I understand that. What I think is that the vector database, instead of just returning a preprogrammed response, I'm sure it can be flexible to do other things, e.g. put more effort in queries whose results could be used for better marketing, e.g. "look, chatgpt understands primes". But apparently it can't get that right so my point might be moot.
                  • by Luckyo ( 1726890 )

                    The problem is that the ability to generate actually relevant responses to wide variety of topic is something that vector database cannot reasonably do. Because you need to actually pre-program calls and link them to the responses. And there's a nearly infinite amount of potential topics of this nature.

                    This is why we actually have developed LLMs, which are effectively a fusion of Big Data and Machine Learning. The idea is that since the amount of "query-response" pairs is nearly limitless, we should stop tr

          • So how do we know this is not explicitly programmed? It's a rhetorical question.

            Would a programmer have explicitly programmed the answer to be incorrect? For the lulz? Just for fun type the last line into a calculator.

        • 127x134 is 17018, not 17077.
        • It's impressive, but it's wrong. 117 x 134 = 17018.

          As usual, it's just guessing, putting together an answer that looks right. That works well when the correct answer is out there (and dominant) in its training text. But when it hasn't seen the right answer, GPT turns into a BS artist. And it doesn't seem to know the difference.

          • One of my other posts mentioned this, but it's interesting to me that if you say "Really?" or "try again" or "that's wrong" chatgpt almost always locks onto the correct answer, and it won't budge from that one.

      • Give it a try. It finds the square root of the number. Then divides 17077 from 2 up to the square root. Math errors along the way (floating point issue) leave it thinking it is not prime. Somewhere along the chain a noninteger becomes an integer. It literally makes a simple math mistake.
        • by wed128 ( 722152 )

          Except it doesnt. It's not doing math, it's manipulating language. There isn't a single IEEE754 floating point representation that would error 17077/127 as an integer (it would still be a non-integral point with some error, but that operation wouldn't be that far off)

          Representing these tricks as more powerful then they are is the entire root of the issue here.

      • by rahmrh ( 939610 )

        It really is a more complicated AI version of Eliza. Nothing more, nothing less. https://en.wikipedia.org/wiki/... [wikipedia.org]

        Pretty, but also "pretty useless" for much of anything important, since you cannot trust its answer to be right, no one should waste their time asking it anything since you still have to research the answer it gives you, and since it can give you some very detailed somewhat right sounding answers that are completely wrong.

    • ...or in fact the other way around. It's possible to feed an AI too much (good) information, which can lower its capacity to solve new problems due to the phenomenon of catastrophic forgetting.
  • by fleeped ( 1945926 ) on Thursday July 20, 2023 @05:51PM (#63703016)
    That said: OpenAI can be messing with it in any way, it's a closed system after all. Why do people do studies on some commercial closed thing which can change under the hood any time? What a waste.
    • Without studies to audit their claims about their product, that would mean they're marketing material is not held as accountable.

      I get the worry about lending credence to "open" when they are decidedly not open anymore, but this sort of study is just the sort of thing needed to try to keep their marketing in perspective.

      • The product doesn't claim factual accuracy in the TOS, so they can say what they want out loud and leave the important bits in the fine print. I suppose, a study to show the danger and volatility of such a platform is not a bad thing. But studies tend to be far less effective than loud marketing and ads everywhere (Slashdot included, every 5th story being about this)
    • It is worth doing research on it since a loooot of people are using it. It is the fastest growing product in history, and it is having huge impacts in our society. It is actually due to the fact it is closed and can change under the hood that it is so helpful to know how it has been changing over time.
      • It's interesting to know that it's changing, but we don't know how and why, and this is only because the company keeps it under wraps. It's a artificial and purposeful veil of mystery. Of course it's their product and they can do what they want with it, but this unheard of amount of exposure should come with strings attached, we can't fuck up the entire society and education system because of some greedy bunch in an office somewhere west.
    • by Ecuador ( 740021 )

      Why do people do studies on some commercial closed thing which can change under the hood any time?

      Because it's very easy and fetches headlines? Real research is hard and rarely gets you any mentions...

  • It's ok.... (Score:5, Funny)

    by King_TJ ( 85913 ) on Thursday July 20, 2023 @05:57PM (#63703034) Journal

    Our collective IQ has been dropping in America for a while now. It's just following the trends!

    • by wed128 ( 722152 )

      +5 funny, but has our collective IQ been dropping, or are all portions of the IQ bell curve louder now? You used to need to be smart just to be *heard*.

    • ChatGPT 4.0 has finally absorbed Common Core Maths.

  • by tsqr ( 808554 )

    Maybe they should stop letting it read the comments to science-oriented Reelz posts.

  • Why would it get a question about a prime number so drastically wrong, and how was the question worded I wonder?
    • Re:Honest question (Score:5, Informative)

      by ShadowRangerRIT ( 1301549 ) on Thursday July 20, 2023 @06:16PM (#63703080)
      It does not understand math. It's synthesizing a terrible understanding of math from people writing about math, and since it doesn't understand the material it's synthesizing from, it can easily be misled. It doesn't actually understand division, factoring, primality or any of that, it's just very good at paraphrasing its sources to pretend it does. Just because it's better than prior systems for searching bulk data, doesn't mean it really understands what it's barfing up.
      • Exactly. If "you" are a computer, and you *understand* what a prime number is, then you will get this question (and others like it) right every single time. If you get it right 10% of the time, and then improve to 90% of the time, that is meaningless, because you obviously still don't know what a prime number is, you're just parroting other statements with the phrase "prime number" in them.
      • It does not understand math. It's synthesizing a terrible understanding of math from people writing about math, and since it doesn't understand the material it's synthesizing from, it can easily be misled. It doesn't actually understand division, factoring, primality or any of that, it's just very good at paraphrasing its sources to pretend it does. Just because it's better than prior systems for searching bulk data, doesn't mean it really understands what it's barfing up.

        It's similar to the reason why the image generators are so bad at hands.

        It's easy to write something in the form of a math problem, just like it's easy to draw some bits of fingers. But there's a very low margin of error, and if you don't understand the underlying rules of what's going on (physics of fingers, or rules of addition) then it's really tough to learn by examples.

      • it's just very good at paraphrasing its sources to pretend it does

        This is why I have always called chatGPT "Clever Hans with better PR" [wikipedia.org]. The AI fanboys yell at me that I'm too stupid to see how much more sophisticated it is... and yet, here we are.

        • That's a great analogy, especially with an interactive tool. I have no doubt that there is also a strong amount of selection bias in the reported successes. Basically, if you flip a fair coin until it shows heads and then stop, you can always claim that you have a magic coin that always ends up on heads if it is thrown properly.
      • It's weird when it synthesizes a text seemingly breaking down a problem and explaining how to proceed, but completely putting garbage in the actual procedure.

        We are so used to any text written in that style bent written by someone that understands the topic, is jarring when it just clearly is working the explanation and some jumble of numbers to go with it, because it "knows" numbers belong in the middle, but it doesn't know the meaning.

        Even creepier to see the same phenomenon play out in image generation.

      • I haven't dabbled in neural networks since an AI class in college twenty some years ago, but I would imagine that at some point the models will need to be hardcoded (or the trained portions hardcoded) with certain immutable knowledge sets, like the nature of basic arithmetic.

        It's interesting to me that when you get a crazy response out of chatGPT, if you just reply "wrong" you almost always get the correct answer.

        A. Produce an answer
        B. Produce an alternative answer given prompt "that's wrong"
        C. Run both A+B

      • Exactly! Isn't it just super-duper Predictive Text?
    • Why would it get a question about a prime number so drastically wrong, and how was the question worded I wonder?

      From TFA on to test of solving math problems, which provides some insight as to how the same question provided correct analysis, but generated an incorrect answer:

      "...an example query and corresponding responses over time. GPT-4 followed the chain-of-thought instruction to obtain the right answer in March, but ignored it in June with the wrong answer. GPT-3.5 always followed the chain-of-thought, but it insisted on generating a wrong answer ([No] ) first in March. This issue was largely fixed in June."

    • These systems have no understanding of what they produce. It's not just math. It's everything. They are a glorified search engine. It can synthesize what looks like the correct answer without actually understanding if or why it would be correct.

      It's much the same with image generation. It can draw a finger based on fingers it has seen in the past but fails to understand the concept that a hand always has 5 of them and that they can only bend in one direction.

      • That's a bit worrying if it doesn't actually understand maths and is simply scanning an internal library for similar examples! Prime numbers might be tricky to work out but you'd think the concept itself would be clear
  • I've used ChatGPT extensively for a long time, even before it became a paid version 4.

    ChatGPT 3.5 has always been a bit "dumb" to me, it could never really do things that's beyond a beginner level programmer, sure - it has been trained on data found on the web for a long time, but its ability to comprehend is up to the prompter and also how well you formulate things and in what context.

    One thing I immediately observed was that it wasn't very good at remembering our conversation, it would repeat itself and f

    • by Luckyo ( 1726890 )

      Asking LLM "trick questions" is an operator failure. It's job of the operator to produce queries that result in correct output.

      I.e. there actually is such a thing as "wrong question" in LLMs. And that is operator error, not LLM error.

      • The point is that he knew it was a trick question, because he already knew the answer and didn't need the ai. However for another person, it's a sincere question because they don't know.

        The only way to avoid doing "trick questions" is to already know the answer before your begin, at which point, what value are you extracting from the technology?

        This mirrors what I've seen a lot, that currently it does a passable job of answering questions that the person already knows the answer to, which is a neat parlor

        • by Luckyo ( 1726890 )

          You appear to confuse talking to a self-aware being and talking to LLM. LLM is great at producing answers that fit genuine superqueries. Superqueries that are intended to be confusing to something that is self-unaware will in fact confuse it and generate bad answers most of the time.

          I.e. he "discovered" that if you use wrong syntax in your coding, it won't compile. This isn't novel or interesting. This is well known among even intermediate LLM users.

          • by Junta ( 36770 )

            You appear to confuse talking to a self-aware being and talking to LLM

            I have no disillusions about the technology being more than it is, however I contend that in the ChatGPT type usage, the utility is sufficiently low as to be nearly useless.

            LLM is great at producing answers that fit genuine superqueries

            From my experience, LLM is passable at matching traditional queries of information. I haven't experienced anything "super" about my interactions with ChatGPT style LLM usage.

            Superqueries that are intended to be confusing to something that is self-unaware will in fact confuse it and generate bad answers most of the time.

            Here it's not intended to be confusing, he even fed it domain specific knowledge to respond. It just failed to incorporate the parameters of the query, because it d

            • by Luckyo ( 1726890 )

              Superquery is a tentative name for inputting long, highly optimized query (usually measured in at least hundreds, often thousands of words) that specifically seeks to generate a correct answer by directing the LLM what word relations within the training set should be emphasized, and which should be de-emphasized.

              Again, LLM is not the old style database. LLM is all about measuring relationships between all words, sentences, paragraphs and so on, and generating responses based on those relationships that it s

  • by Bodhammer ( 559311 ) on Thursday July 20, 2023 @06:48PM (#63703130)
    Is it going to insist on working from home?
  • by NateFromMich ( 6359610 ) on Thursday July 20, 2023 @06:54PM (#63703144)
    I'm not sure, but I'm still hearing about it every single day.
  • by ebunga ( 95613 ) on Thursday July 20, 2023 @07:07PM (#63703162)

    Previously, it was subtly harmful garbage, now it has progressed to obviously crap. At this rate they will progress to mass layoffs and total shutdown by year's end.

  • Sam Altman has said the top priority for this year is cheaper and faster GPT-4. It wouldn't be surprising to learn they are cutting down / compressing models to get there. Another possibility is additive effects of additional censorship tweaks which can have substantial non-intuitive impacts on model quality.

  • by Tablizer ( 95088 ) on Thursday July 20, 2023 @07:50PM (#63703228) Journal

    I suspect people are using it to publish fluff pieces, and the next generation reads and incorporates its own fluff work, compounding the fluffitivity on each iteration.

    • That, I think, is going to be one of the greatest difficulties facing ALL trained models in the future.

      The Internet of 2023 was already crapped up with useless clickbait and SEO garbage. The signal to noise ratio as compared to 2000 is just insanely bad. LLMs are going to be spewing out a huge amount of content that is publicly posted, and that content is going to be fed back in as training material.

      IMHO one of the next "big things" will be content companies, libraries, publishers, etc., who have high-quali

  • The paper is crap.

    The code actually got BETTER then before, except the authors couldn't be bothered to remove the new formatting symbol.

    The math is exactly the SAME as before, except before it used to say that all numbers are prime and now it says that all numbers are not prime (neither is right but it's absolutely not gotten worse.). The authors only tested with prime numbers.

    Terrible terrible terrible research.

  • GPT4: 17077 is an integer and it's a prime number. A prime number is a natural number greater than 1 that has no positive divisors other than 1 and itself. In this case, the only factors of 17077 are 1 and 17077. This makes 17077 a relatively interesting number mathematically.
    • GPT4: 17078 is an integer and it's a prime number. A prime number is a natural number greater than 1 that has no positive divisors other than 1 and itself. In this case, the only factors of 17078 are 1 and 17078. This makes 17078 a relatively interesting number mathematically.

      Works fine for me too /s

  • I haven't heard of these LLMs screwing up English grammar that badly, which suggests to me that they have some kind of grammar checking pass run over the raw output.

    I'm given to understand that the naive LLM simply parrots back words that are likely to appear nearby, which might render my first paragraph as:

    "I heard haven't of these LLMs up screwing English grammar badly, which suggests that.."

    That doesn't happen though, which means either there's something about LLM that makes that work very well, or to re

  • "in March GPT-4 was able to correctly identify that the number 17077 is a prime number 97.6% of the times it was asked. But just three months later, its accuracy plummeted to a lowly 2.4%. " This is 'getting worse' only if one assumes being factually correct is the main development target for ChatGPT. Which, as far as I know, is not the case. If the main objective of ChatGPT is communicating like a human, NOT being able to identify prime numbers is an improvement :)
  • Chatzheimers Disease arises in some cases as bots age. A survey of bots in rest homes revealed that they get cranky and forget things, eventually becoming incontinent and pooping in their server racks and requiring Firewall Depends. Scientists are working on a cure and hope to soon have a pill for it.
  • Queue the racist answers and further corruption to start occuring. After all, Tay was A.I., supposedly...
    To be fair, Tay did come pretty close to the average majority american. Foul-mouthed, very opinionated, and very racist.
    It's all ChatGPT A.I. needs to become a real boy!

  • Look at here, all stories that mention ChatGPT. It's sad. 20 stories in the last 2 weeks. 4 stories just on the 19th. https://slashdot.org/index2.pl... [slashdot.org]
  • The industry has referred to them as 'hallucinations', but in reality they are lies, with dire consequences.

    Examples abound ...

    - Professor accused of sexual harassment [usatoday.com] based on a non-existent article in the Washington Post.

    - Another professor accused of being convicted and imprisoned for seditious conspiracy against the USA [reason.com].

    - Lawyer fined $5,000 for submitting an AI generated brief to court quoting non-existent precedence cases [mashable.com].

    - Fake complaint against man for embezzlement [forbes.com].

  • They figured out what is going wrong. ChatGPT started watching FOX News.
  • Here's my hot take on what's going on here.

    A simple google search uses about the same amount of energy to power a light bulb for 10 seconds.

    A ChatGPT is far more resource consumptive than that.

    Now that everyone has had a chance to see the crazy skills of generative AI, the generative AI bois are putting twaddle in the orphans oatmeal to reduce their overhead and hoping people won't notice.

GREAT MOMENTS IN HISTORY (#7): April 2, 1751 Issac Newton becomes discouraged when he falls up a flight of stairs.

Working...