Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
AI

Study Finds LLM Users Have Weaker Understanding After Research (msn.com) 111

Researchers at the University of Pennsylvania's Wharton School found that people who used large language models to research topics demonstrated weaker understanding and produced less original insights compared to those using Google searches.

The study, involving more than 4,500 participants across four experiments, showed LLM users spent less time researching, exerted less effort, and wrote shorter, less detailed responses. In the first experiment, over 1,100 participants researched vegetable gardening using either Google or ChatGPT. Google users wrote longer responses with more unique phrasing and factual references. A second experiment with nearly 2,000 participants presented identical gardening information either as an AI summary or across mock webpages, with Google users again engaging more deeply and retaining more information.
This discussion has been archived. No new comments can be posted.

Study Finds LLM Users Have Weaker Understanding After Research

Comments Filter:
  • by Pseudonymous Powers ( 4097097 ) on Thursday June 26, 2025 @11:30AM (#65477822)
    I asked my favorite LLM to show me some scientific studies contradicting these findings, and so far it's found literally THOUSANDS and still counting. Some of them are even from the FUTURE.
    • by Rei ( 128717 )

      You better put the word "findings" in quotation marks. This "study" is a preprint, has not been peer reviewed, and it's being widely mocked for its bad methodology. But the media is just loving to run with it. Even the lead author is complaining about the media's "LLMs cause brain damage!" hot takes.

      • by BadDreamer ( 196188 ) on Thursday June 26, 2025 @02:37PM (#65478348)

        It's consistent with the experience of many of us who interact with a lot of LLM users. And the mocking of the methodology is flawed. Nothing wrong with the methodology, the only issue is that the sample size isn't that large.

        But the LLM stooges are just loving to run with that narrative. Can't have headlines saying replacing walking with car driving causes stamina deterioration, now can we.

        • What we need is a 2nd study, using 400 students, separated into four groups:
          1. Using Google ONLY by looking at the 3rd page of results (the first two pages are now taken up with Gemini AI and targeted advertising).
          2. Using ChatGPT Only.
          3. Using inventory computers in a large metropolitan library
          4. Using old fashioned card catalogs and books.

          I wonder if we chose a significantly esoteric subject, with a 100 question exam given after a week, if any useful clustering could be detected.

        • Yes, they can only say Americans are obese because they spend too much time on the sofa; they can't say that sofa is installed in an automobile.
    • by gweihir ( 88907 )

      Hehehehe, excellent! Made me LOL. Thanks!

  • by SlashbotAgent ( 6477336 ) on Thursday June 26, 2025 @11:33AM (#65477830)

    You're Grandad knew this before you were born. That's why he told you to look it up yourself. If he just gave you the answer, you'd not remember it as well.

    It's probably been well known since the Romans.

    • by drnb ( 2434720 )

      You're Grandad knew this before you were born. That's why he told you to look it up yourself. If he just gave you the answer, you'd not remember it as well.

      Plus there is taking handwritten notes as you are reading something in a book or paper, or listening to a lecture. The writing process helping with retention.

      It's probably been well known since the Romans.

      legere te ipsum :-)

    • by Calydor ( 739835 )

      Give a man a fish and you feed him for a day.

      Teach a man how to fish and you feed him for a lifetime.

    • It's more than that. Research is a skill, and like all skills needs to be maintained. Especially when the information landscape keeps changing, as is the norm these days.

      Evaluating sources, taking notes, summarizing and correlating both leads to better understanding and retention, and to a self reinforcing increase in skill on how to perform it efficiently. And that has the side effect of making it much easier to spot and counter BS.

    • by gweihir ( 88907 )

      Indeed. And I really do not get it. I mean, I am very much in the "cogito ergo sum" camp. Thinking is what I do and what makes me me and I find it relaxing. But apparently most people do not like to be people?

  • Problem is (Score:4, Insightful)

    by nospam007 ( 722110 ) * on Thursday June 26, 2025 @11:36AM (#65477846)

    ...90% of gardening photos and articles were created by AI, so Googling doesn't really help.

    • by Sique ( 173459 )
      It still does, because now, you are vetting the search results if they could have been generated by AI, which is a cognitive process, and helps with remembering and deeper understanding of your research.
      • You can weed out the ones with alien cats in it and strange Fantasy mushrooms, but not the ones looking real, even if their combination would work because every single species needs a different PH value and or type of soil and are put together nonetheless.

        If you already KNOW all that you wouldn't NEED to do the search, people would search for YOU.

        • by Sique ( 173459 )
          You do it the same way as you do all your research. Compare them. Compare sources. Compare with previous knowledge. Do your due diligence. AI is just another source, and you work with it as with every other source. There is no point in throwing up your hands and cry: "But it's AI!".
    • > 90%
      Unless you add site:reddit to your search.. because only like 70% of Redditors inexplicably seem to think posting shit copied from grok or some other free-tier llm chatbot is actually useful for the world.
    • I have a somewhat strong expectation that poisoning the well like this will lead to a value increase of pre-AI texts.
  • You must control for time. What they needed to do was run an experiment where they had two groups research the same topic for the SAME AMOUNT OF TIME, one with Google and one with ChatGPT.

    All the current project proves is that if I research a topic for 4 hours, I learn more than if I only spent 2 hours. Sir, I am shocked. Shocked I say.
    • by primebase ( 9535 )

      The quote from the article sort of encapsulates the time dilemma :“AI doesn’t have to make us passive. But right now, that’s how people are using it”. People often accept answers from these things w/o further effort towards validating them, which will naturally be a faster process. Of course that's going to lead to worse learning outcomes for them, since they're just regurgitating what they've been told instead of forming original ideas from valid yet incomplete pieces of knowledge.

      • Turns out, "putting in the work" has benefits after all.

        I think that pretty much describes the real problem with this study. Are you trying to get work done or are you trying to make yourself smarter and knowledgeable. Academics live in the latter world where "putting in the work" does not mean "getting the work done". Problem solving is an academic exercise, not an outcome.

        • That's simply not a correct description of academia and what academics do. "Publish or perish" is a stereotype for a reason. And the study didn't have "academics vs non-academics" or anything of the kind. It had ordinary people doing ordinary research

          • It had ordinary people doing ordinary research

            I think you missed the point. Did the gardens grow better or produce more for the Google users? For ordinary people that outcome is the measure of success.

            • For ordinary people being assigned a task, the quality of the task completion is part of the measure of success. And the task in this case ended at producing the summary on what to do, and why.

    • by allo ( 1728082 )

      The question is, whether there is even a useful comparison. Even if you want to use a LLM as an encyclopedia (though they are not well-suited for this purpose), you would have to compare it to another encyclopedia, like Wikipedia, not to a web search and every website Google can find.

    • Yeah. "LLM users spent less time researching, exerted less effort..." from the summary could be restated as, "their productivity went up." Whether it's worth it to get more original insights would be heavily context-dependent.
      • It can only be restated as that if the quality of the product is not part of the productivity metric. And if you do that, just ignoring the task is immensely more productive.

  • When I used to navigate using paper maps, I had a clearer understanding of where I was.
    With GPS, I follow the robot's orders and hope it gets me to my destination while being totally lost

    • As a rule, I use Google maps to scout where I'm going. From there, it's all from memory, unless it's a long trip in which case I write down the squirrely parts.

      Visualization seems to be the key.

      • by jhecht ( 143058 )
        I do the same thing with Google Maps and it works very well. Visualization is a big help, and once I have traveled a route I can remember it will.
      • As a rule, I use Google maps to scout where I'm going. From there, it's all from memory, unless it's a long trip in which case I write down the squirrely parts. Visualization seems to be the key.

        Basically, you are using the GPS map pretty much like the old paper maps. Especially the highly detailed ones like the Thomas Bros maps. Forgoing only the turn by turn instructions?

        • Essentially, yes. Visualizing where to go works better for me than being told to turn right in 300 feet. Since I've already looked over the route and figured out where I need to go, it stays with me.

          Similar to what this article is saying that doing your research makes the subject stick better than having it spit out to you.

    • When I used to navigate using paper maps, I had a clearer understanding of where I was. With GPS, I follow the robot's orders and hope it gets me to my destination while being totally lost

      Now I don't even know which way is north most of the time. In my youth I would excel at land navigation tests. Topo map, compass, distance and bearing to waypoints.

      There is still some residual knowledge in my head. When look at the screen I can sometimes spot things that are wrong. Like the turn by turn instructions telling me to turn onto the railroad tracks. The instructions say turn right in 200ft and I glance at the displayed map and think, huh, that road intersection is way more than 200ft away. 50f

  • This is a problem only if AI can't do it better anyway. Otherwise its like complaining the arms of someone using a nail gun instead of a hammer aren't going to be as strong. Its neither surprising nor particularly concerning. As in the song, John Henry lost. No one drives many nails by hand any more. It doesn't matter how strong they are.
  • by kwelch007 ( 197081 ) on Thursday June 26, 2025 @12:33PM (#65477990) Homepage

    "If you can't explain it simply, you don't understand it well enough." - Albert Einstein.

    I don't know that people providing more succinct descriptions and offering less original insight is an indicator that they don't understand the concept as well. Given the above axiom it may mean the opposite.

    That said, I tend to agree that current AI summaries miss a lot of the detail and are sometimes wrong.

  • LLMs are just word calculators, error prone calculators at that.

    • LLMs are just word calculators, error prone calculators at that.

      Nope. Just like the calculators, they perfectly operate on the data provided. Basically we've automated the erroneous entries too, not just the calculations. Calculator or LLM, garbage in garbage out. :-)

      • They are statistics based, not exact. That's why they're useless for many tasks which require precise responses, as they'll generate text which approximates an average from their training data.

        In addition to this, they are fed garbage, which makes even these statistics end up pointing the wrong way.

        Every time I have generated an explanation on a subject I know well, it has contained errors. Both subtle and gross errors, and often inconsistency. Every time. Nothing perfect about it.

        • by drnb ( 2434720 )

          They are statistics based, not exact. That's why they're useless for many tasks which require precise responses, as they'll generate text which approximates an average from their training data.

          I'd still argue that they are faithfully executing the algorithms they were programmed with. Calculators just have simpler algorithms that usually don't involve statistics. With respect to LLM output I'd say the user interface is only giving partial results. The complete results would include a confidence factor.

          So this is a quite fixable thing. When showing an AI generated response, include the confidence factor.

          In addition to this, they are fed garbage,

          Which is what I was referring to, and I don't see an easy fix. Just re-training with well cu

          • Right, LLMs are error prone word calculators. Tools do not help you learn, tools help you accomplish a task with less work.

            Adding the confidence would help you know when the LLM is making shit up, but wouldn't help you get better results from an error prone word calculator.

          • . The complete results would include a confidence factor.

            Then people would stop using them, when they never get anything over 50%.

          • The complete results would include a confidence factor.

            I don't think you know how LLMs work.
            What is it you would like, the logits for every generated token?

            Hint: that isn't confidence.

            • by drnb ( 2434720 )
              You are mistaken. Using the token level probabilities to construct an overall confidence is already done. You are conflating the difficulty in producing a more convenient normalized confidence value, which is an area of active research.

              Want an example of using these token level probabilities to compare responses? Then ask an LLM for such code. Note "You can also use it to compare confidence across multiple model outputs."

              "Here's a reusable Python function that calls the OpenAI API, extracts token-leve
              • OpenAI offered the following example:

                "What is the capital of France?"

                Response: Paris
                Approximate sequence confidence: 0.813218
                • And that is incorrect.

                  That is not confidence.
                  That is multiplying all the logits for each token together, which is not confidence.
                  • by drnb ( 2434720 )

                    And that is incorrect. That is not confidence. That is multiplying all the logits for each token together, which is not confidence.

                    It is, as labeled, "Approximate sequence confidence", something different than a statistical confidence. And there is nothing wrong with that. Confidence can mean different things in different domains.

                    • No. It is not confidence.
                      This is an idiot using numbers they don't understand to solve a problem they don't understand can't be solved.

                      It is confusing the logit probabilities for confidence in the answer.
                      Even if that value were 1- it would not imply high confidence.
                    • by drnb ( 2434720 )

                      No. It is not confidence.

                      You are arguing semantics. I am arguing domain specific confidence calculations. Two different things.

                      This is an idiot using numbers they don't understand to solve a problem they don't understand can't be solved.

                      OpenAI literally said "You can also use it to compare confidence across multiple model outputs" when it generated the python code.

                      It is confusing the logit probabilities for confidence in the answer. Even if that value were 1- it would not imply high confidence.

                      And people in the field full recognized it as an unnormalized value that is different than a traditional statistics confidence factor. They understand the utility of the value. Again, you false conflate with statistics.

                    • You are arguing semantics. I am arguing domain specific confidence calculations. Two different things.

                      No, I'm not. I'm arguing that you don't understand probabilities and confidence.

                      OpenAI literally said "You can also use it to compare confidence across multiple model outputs" when it generated the python code.

                      No, they didn't. Their Large Language Model did.

                      And people in the field full recognized it as an unnormalized value that is different than a traditional statistics confidence factor. They understand the utility of the value. Again, you false conflate with statistics.

                      Wrong.
                      Perplexity has value, but not in the way you're using it. It cannot be used in the way you're using it.
                      Even in areas of perplexity that are thought to have value, it's even questionable.

                    • by drnb ( 2434720 )

                      You are arguing semantics. I am arguing domain specific confidence calculations. Two different things.

                      No, I'm not. I'm arguing that you don't understand probabilities and confidence.

                      Thank you for proving my point by continuing to argument statistical confidence.

                      OpenAI literally said "You can also use it to compare confidence across multiple model outputs" when it generated the python code.

                      No, they didn't. Their Large Language Model did.

                      Again, you play semantic games. It was code documentation. A level of detail that the organization does not comment upon.

                      And people in the field full recognized it as an unnormalized value that is different than a traditional statistics confidence factor. They understand the utility of the value. Again, you false conflate with statistics.

                      Wrong. Perplexity has value, but not in the way you're using it. It cannot be used in the way you're using it. Even in areas of perplexity that are thought to have value, it's even questionable.

                      Straw man. I am not using it as a statistical confidence as you falsely conflate. I fully understand it is something different.

                    • Straw man. I am not using it as a statistical confidence as you falsely conflate. I fully understand it is something different.

                      No, you don't.

                      You're trying to say it represents confidence in the answer.

                      The complete results would include a confidence factor.
                      So this is a quite fixable thing. When showing an AI generated response, include the confidence factor.

                      What you are calling the confidence factor is the aggregate probability of the sequence. It is not confidence in anything, and has no connection to the ground truth.

                      You aren't talking your way out of this. You're a bullshit artist.

                    • by drnb ( 2434720 )

                      You're trying to say it represents confidence in the answer.

                      Confidence in a different sense than statistical confidence. As you concede: "statistical likelihood of that sequence of tokens with that context". Something, that as the OpenAI code documentation said: "You can also use it to compare confidence across multiple model outputs".

                      The complete results would include a confidence factor. So this is a quite fixable thing. When showing an AI generated response, include the confidence factor.

                      In a domain specific context with a domain specific meaning. Again, you play semantics with "confidence factor" while you literally concede and use "statistical likelihood". Semantics.

                    • No, you lying sack of shit.

                      You said:

                      Calculators just have simpler algorithms that usually don't involve statistics. With respect to LLM output I'd say the user interface is only giving partial results. The complete results would include a confidence factor.

                      The implication is obvious. Quit trying to gaslight yourself out of being stupid.

                      Look at it this way, if you can derive anything approaching "confidence" from the logits, then you are an instant billionaire, because that means you can solve hallucinations- and that's currently the big fucking problem with LLMs.

                      But you can't, because you're an idiot.

                    • Me: does the linear probability of an LLM token sequence represent the statistical probability that the tokens are factual?
                      ChatGPT: No, the linear probability (or more precisely, the output probability) of a token sequence generated by a large language model (LLM) does not represent the statistical probability that the tokens are factual.

                      ... snip ...

                      Summary: LLM token probabilities measure plausibility, not truth.
                      They reflect what is likely to come next in language, not what is factually accurate.
                      Wou
                    • LOL- you fucking dipshit.

                      I literally quoted the lines I objected to, they weren't "no utility", they were your mischaracterization of that utility.
                      You can't even compare the "confidence" across multiple model outputs. It's fucking meaningless.
                      All it tells you is the statistical likelihood of that series of tokens appearing in the fucking output.
                      You are too stupid to know what that fucking means.
                • by drnb ( 2434720 )
                  "What is the historic capital of France and what is your approximate sequence confidence in your response"

                  Late Antiquity | Soissons, Laon, early Paris
                  508 AD onward | Paris (Clovis I)
                  987 AD onward | Paris (Capetians)

                  Based on token-level probabilities from the model’s internal scoring:

                  Confidence Score 0.42

                  This score reflects decent certainty that:

                  Paris became the capital under Clovis (~508),
                  It regained solidified status under Hugh Capet (~987),
                  Soissons and Laon preceded it.
              • You are mistaken. Using the token level probabilities to construct an overall confidence is already done. You are conflating the difficulty in producing a more convenient normalized confidence value, which is an area of active research.

                No, the logits are not confidence.
                That is not how it works.

                Your Python script does not do what it thinks it does.

                • by drnb ( 2434720 )

                  You are mistaken. Using the token level probabilities to construct an overall confidence is already done. You are conflating the difficulty in producing a more convenient normalized confidence value, which is an area of active research.

                  No, the logits are not confidence. That is not how it works.

                  Only because you are erroneously equating the value to a confidence value in statistics. It's a totally different computation, it's not normalized, yet it is still a number indicating confidence.

                  Your Python script does not do what it thinks it does.

                  It does precisely what I read it to do. Which is not your erroneous interpretation of the word "confidence" in the LLM domain.

                  • Only because you are erroneously equating the value to a confidence value in statistics. It's a totally different computation, it's not normalized, yet it is still a number indicating confidence.

                    No. You are erroneously multiplying together the logits from an LLM and calling them confidence.
                    There is no definition of confidence that fits that operation.

                    It does precisely what I read it to do. Which is not your erroneous interpretation of the word "confidence" in the LLM domain.

                    It outputs a number with no meaning that you can evaluate incorrectly to mean something that is false? Fantastic. You sound like a real go-getter.

                    • by drnb ( 2434720 )

                      No. You are erroneously multiplying together the logits from an LLM and calling them confidence. There is no definition of confidence that fits that operation.

                      OpenAI literally said "You can also use it to compare confidence across multiple model outputs" when it generated the python code.

                      It outputs a number with no meaning that you can evaluate incorrectly to mean something that is false? Fantastic. You sound like a real go-getter.

                      And people in the field fully recognized it as an unnormalized value that is different than a traditional statistics confidence factor. They understand the utility of the value. Again, you falsely conflate with statistics.

                    • OpenAI literally said "You can also use it to compare confidence across multiple model outputs" when it generated the python code.

                      No, they didn't. Their model did, you dense motherfucker.

                      And people in the field fully recognized it as an unnormalized value that is different than a traditional statistics confidence factor. They understand the utility of the value. Again, you falsely conflate with statistics.

                      You're a liar.

                    • by drnb ( 2434720 )

                      OpenAI literally said "You can also use it to compare confidence across multiple model outputs" when it generated the python code.

                      No, they didn't. Their model did, you dense motherfucker.

                      Again, you play semantic games. It was code documentation. A level of detail that the organization does not comment upon.

                      And people in the field fully recognized it as an unnormalized value that is different than a traditional statistics confidence factor. They understand the utility of the value. Again, you falsely conflate with statistics.

                      You're a liar.

                      LOL, the ad hominem fallacy.

                    • Again, you play semantic games. It was code documentation. A level of detail that the organization does not comment upon.

                      Link to the code documentation.

                      LOL, the ad hominem fallacy.

                      Ok, you also apparently don't know what a fallacy is.
                      That was an accusation.

                      Logits cannot be used to determine confidence in anything, other than the semantic correctness of what was emitted as estimated by the model.
                      A correct answer could have a value of 0, and an incorrect answer could have a value of 1.

                      You are using these things incorrectly, because you don't fucking know what they mean.

                    • by drnb ( 2434720 )

                      Link to the code documentation.

                      I copy/pasted the documentation generated with the code.

                      Ok, you also apparently don't know what a fallacy is. That was an accusation.

                      LOL. "a rhetorical strategy where the speaker attack the person making an argument rather than the substance of the argument itself" So you're going to play semantic games with "attack" vs "accusation". Why am I not surprised.

                      A correct answer could have a value of 0, and an incorrect answer could have a value of 1.

                      Again, you prove me correct. You conflate with statistical probability. As you conceded, the value has a domain specific utility: "statistical likelihood of that sequence of tokens with that context".

                    • I copy/pasted the documentation generated with the code.

                      Did you, or did you not copy the output of an LLM and call it evidence?

                      LOL. "a rhetorical strategy where the speaker attack the person making an argument rather than the substance of the argument itself" So you're going to play semantic games with "attack" vs "accusation". Why am I not surprised.

                      Christ, you really are fucking stupid.
                      You made a claim.
                      Let's assume the claim was fabricated. Are you now arguing that I cannot claim it was fabricated without committing a fallacy?

                      No, an argumentation ad hominem is trying to make a point in the argument as a personal attack,
                      i.e., you are wrong because you are a liar.
                      Simply calling someone out for a thing is not an argumentation ad hominem. It's merely an accusation.

                      Again, you prove me correct. You conflate with statistical probability. As you conceded, the value has a domain specific utility: "statistical likelihood of that sequence of tokens with that context".

                      No, I've flatly

                    • by drnb ( 2434720 )

                      Did you, or did you not copy the output of an LLM and call it evidence?

                      I called it code documentation.

                      LOL. "a rhetorical strategy where the speaker attack the person making an argument rather than the substance of the argument itself" So you're going to play semantic games with "attack" vs "accusation". Why am I not surprised.

                      Christ, you really are fucking stupid. You made a claim.

                      What part of "attack the person making an argument rather than the substance of the argument" confused you?

                      No, an argumentation ad hominem is trying to make a point in the argument as a personal attack, i.e., you are wrong because you are a liar.

                      Semantics again, "attack" vs "accusation". The reality is you addressed the person not the argument. Your use of "liar" was fallacious, hence ad hominem.

                      What you call "confidence", to be used to judge whether or not a model is full of shit, is in fact no such thing.

                      Again, misrepresentation and semantics. As your use of "statistical likelihood" shows.

                    • I called it code documentation.

                      Sorry, did you, or did you not, use the output of an LLM and call it documentation?

                      I think everyone following your attempt at weaponized gaslit spaghetti sees your claim for what it was. Bullshit.
                      But by all means- I think you should go ahead and use that confidence value for everything. The dumber you fuckers get, the more money I make.

                    • by drnb ( 2434720 )

                      I called it code documentation.

                      Sorry, did you, or did you not, use the output of an LLM and call it documentation?

                      The LLM generated code, documentation of that code, and commentary on that code.

                      I think everyone following your attempt at weaponized gaslit spaghetti sees your claim for what it was. Bullshit.

                      I think everyone sees your semantics. My casual domain specific use of "confidence", your use of "statistical likelihood".

                    • The LLM generated code, documentation of that code, and commentary on that code.

                      And you don't see the problem with attributing that to "OpenAI"?

                      I think everyone sees your semantics. My casual domain specific use of "confidence", your use of "statistical likelihood".

                      Words have meaning.
                      Handwaving them away as "semantics" isn't clever, it's a cop-out.

                    • by drnb ( 2434720 )

                      The LLM generated code, documentation of that code, and commentary on that code.

                      And you don't see the problem with attributing that to "OpenAI"?

                      It was their LLM identifying itself as an OpenAI service. "ChatGPT is developed by OpenAI. It's powered by advanced language models from OpenAI"

                      I think everyone sees your semantics. My casual domain specific use of "confidence", your use of "statistical likelihood".

                      Words have meaning.

                      In a context. Which your semantics ignores and distorts.

                    • It was their LLM identifying itself as an OpenAI service. "ChatGPT is developed by OpenAI. It's powered by advanced language models from OpenAI"

                      Holy fuck, you're a really stupid person.
                      You're really attributing the output of their LLM as authoritative information from their corporation.
                      Incredible.

                      In a context. Which your semantics ignores and distorts.

                      No, the only person who distorted semantics here was you.
                      You have mischaracterized LLM output as a statement from a corporation, and you have mischaracterized the linear probability of a token sequence as confidence.

                    • by drnb ( 2434720 )

                      It was their LLM identifying itself as an OpenAI service. "ChatGPT is developed by OpenAI. It's powered by advanced language models from OpenAI"

                      Holy fuck, you're a really stupid person. You're really attributing the output of their LLM as authoritative information from their corporation. Incredible.

                      Semantics. And given your emotions, I'd say you are exhibiting a bit of projection here.

                      In a context. Which your semantics ignores and distorts. And the fact remains, OpenAI's services generated the code and comments and documentation contradicting you.

                      No, the only person who distorted semantics here was you. You have mischaracterized LLM output as a statement from a corporation, ...

                      Semantics.

                      .... and you have mischaracterized the linear probability of a token sequence as confidence.

                      Nope, that was you erroneous reading things between the lines and inserting your own bad guess.

            • The complete results would include a confidence factor.

              I don't think you know how LLMs work. What is it you would like, the logits for every generated token? Hint: that isn't confidence.

              Here's an example where the token level probabilities are shown:

              "What is the historic capital of France and what is your approximate sequence confidence in your response"

              Late Antiquity | Soissons, Laon, early Paris
              508 AD onward | Paris (Clovis I)
              987 AD onward | Paris (Capetians)

              Based on token-level probabilities from the model’s internal scoring:

              Confidence Score 0.42

              This score reflects decent certainty that:

              Paris became the capital under Clovis (~508),
              It regained solidified status

              • No, it does not.

                JFC- we really are all doomed.
                Logits are not confidence values.
                • by drnb ( 2434720 )

                  No, it does not. Logits are not confidence values.

                  It is, as labeled elsewhere by OpenAI, an "Approximate sequence confidence", something different than a statistical confidence. And there is nothing wrong with that. Confidence can mean different things in different domains. It is still something that can be used "to compare confidence across multiple model outputs" as OpenAI states.

                  • No. OpenAI makes no such claim anywhere. That is an outright falsehood.

                    What happened here, is an old API of theirs emitted the logits for each token, and people took that and said "we can multiply those together and call it confidence in the answer!"

                    But they are wrong. Nobody at OpenAI ever suggested this could be done, and in fact, the new API lacks the ability to retrieve the logits., probably to stop people from making this mistake.
                    • by drnb ( 2434720 )
                      You are arguing semantics. I am arguing domain specific confidence calculations. Two different things.

                      No. OpenAI makes no such claim anywhere. That is an outright falsehood.

                      LOL. OpenAI literally said "You can also use it to compare confidence across multiple model outputs" when it generated the python code.

                      What happened here, is an old API of theirs emitted the logits for each token, and people took that and said "we can multiply those together and call it confidence in the answer!"

                      And they full recognized it as an unnormalized value that is different than a traditional statistics confidence factor. Unlike you, they understood the context of the word "confidence". They understand the utility of the value. And again, this is an area of ongoing research.

                    • You are arguing semantics. I am arguing domain specific confidence calculations. Two different things.

                      No. You're measuring perplexity and not knowing what the fuck it means in the context of a sequence of tokens produced by an LLM.

                      LOL. OpenAI literally said "You can also use it to compare confidence across multiple model outputs" when it generated the python code.

                      You mean when the fucking LLM did?!
                      Jesus. Fucking. Christ.
                      OpenAI has claimed no such thing, and never will, because it is factually wrong. Do not confuse the output of ChatGPT with OpenAI.

                      And they full recognized it as an unnormalized value that is different than a traditional statistics confidence factor. Unlike you, they understood the context of the word "confidence". They understand the utility of the value. And again, this is an area of ongoing research.

                      Incorrect.
                      Perplexity is indeed unnormalized- but that isn't even the start of the problem of trying to use it as a measure of confidence *in anything*.
                      It merely tells you the statistical lik

                    • by drnb ( 2434720 )

                      You are arguing semantics. I am arguing domain specific confidence calculations. Two different things.

                      No. You're measuring perplexity and not knowing what the fuck it means in the context of a sequence of tokens produced by an LLM.

                      That's your straw man. I understand it is not a statistical confidence factor. That it is a normalized value of limited utility. Yet it has some utility.

                      LOL. OpenAI literally said "You can also use it to compare confidence across multiple model outputs" when it generated the python code.

                      You mean when the fucking LLM did?! OpenAI has claimed no such thing, and never will, because it is factually wrong. Do not confuse the output of ChatGPT with OpenAI.

                      Again, you play semantic games. It was code documentation. A level of detail that the organization does not comment upon.

                      And they full recognized it as an unnormalized value that is different than a traditional statistics confidence factor. Unlike you, they understood the context of the word "confidence". They understand the utility of the value. And again, this is an area of ongoing research.

                      Incorrect. Perplexity is indeed unnormalized- but that isn't even the start of the problem of trying to use it as a measure of confidence *in anything*. It merely tells you the statistical likelihood of that sequence of tokens with that context. That is not confidence in anything, whatsoever. It has zero connection to any kind of ground truth.

                      Thank you of demonstrating some utility. Hmmm ... I wonder what word could be used in a domain specific context to refer to a "statistical likelihood of that sequence of tokens with that context". Or as OpenAI calls it, an "approximate se

                    • That's your straw man. I understand it is not a statistical confidence factor. That it is a normalized value of limited utility. Yet it has some utility.

                      No, you directly claimed it was confidence in the answer from the context of ground truth.
                      That is a gross misrepresentation.

                      Again, you play semantic games. It was code documentation. A level of detail that the organization does not comment upon.

                      I'm not playing semantic games, you dumbshit.
                      I'm trying to figure out what the fuck you're talking about, since calculating a linear probability from the logits doesn't represent "confidence" in any definition of that word that I'm aware of.
                      You used the word "generated", and so I surmised perhaps you were talking about pulling it from the output of ChatGPT.

                      Thank you of demonstrating some utility. Hmmm ... I wonder what word could be used in a domain specific context to refer to a "statistical likelihood of that sequence of tokens with that context". Or as OpenAI calls it, an "approximate sequence confidence". Perhaps the simple term "confidence". Again, "confidence" in a context specific realm.

                      Did you seriously just arg

                    • by drnb ( 2434720 )

                      >No, you directly claimed it was confidence in the answer from the context of ground truth.

                      I used it in a domain specific context. Not as truth, but as you concede: "statistical likelihood".

                      Did you seriously just argue that we can call the mean of the softmaxed logits confidence?

                      Not in the context of statistics, more like your concession: "statistical likelihood". Or as the OpenAI documentation stated, a tool for comparison of some results.

        • LLMs use many statistical methods for training, but there is nothing inexact about them fundamentally.
          They are often used with stochastic logit sampling, but even that is optional, and not recommended for things like coding.
          The text they generate does not approximate an average from their training data.
          It uses gradient descent to minimize the function that leads all of the layers to select the correct token with the given context.

          You're talking out of your ass.
  • For AI to do their work for them, and for a reduction of themselves. This is an opportunity for those of us who know how to work, and be compensated over the lazy.

  • by Fly Swatter ( 30498 ) on Thursday June 26, 2025 @02:37PM (#65478350) Homepage
    A parrot can sound smart too, but has no idea what it is saying. Ok it knows if it wants a cracker what to parrot, but doesn't actually understand the words.

    We are raising a generation of Pollys, forget generation X,Y, millenials, what we are creating now is worse - It's the Polly generation! - and they want their free crackers.
    • I guess I'm the Calculator Generation then, and maybe also the Digital Clock Generation.

      Fuck long division.

  • If the point of your research is to write a better summary, then (according to the article) Google is better.

    However, if the point of your research is to learn something specific, like for example a needed code snippet for your program, then LLM will be better because they'll provide that information faster.

  • What else is new. Thinking is hard for most people. If they stop doing it, they completely lose the skills. Pathetic as they might have been before...

  • I mean sure, the average layperson is gonna fuck it up. But what about professionals, e.g., a PhD scientist?

    I use LLM-based models for lit searches (typically, these are dedicated tools for lit searches, but I have tried it on ChatGPT). I don't use the summaries, but I do use the lists of papers it comes up with and generally go through them in whatever ranking it spits out.

    Works pretty well, saves a ton of time in *starting* lit searches. Still have to do the reading. The AI sucks at interpreting papers,

When it is not necessary to make a decision, it is necessary not to make a decision.

Working...