
Study Finds LLM Users Have Weaker Understanding After Research (msn.com) 111
Researchers at the University of Pennsylvania's Wharton School found that people who used large language models to research topics demonstrated weaker understanding and produced less original insights compared to those using Google searches.
The study, involving more than 4,500 participants across four experiments, showed LLM users spent less time researching, exerted less effort, and wrote shorter, less detailed responses. In the first experiment, over 1,100 participants researched vegetable gardening using either Google or ChatGPT. Google users wrote longer responses with more unique phrasing and factual references. A second experiment with nearly 2,000 participants presented identical gardening information either as an AI summary or across mock webpages, with Google users again engaging more deeply and retaining more information.
The study, involving more than 4,500 participants across four experiments, showed LLM users spent less time researching, exerted less effort, and wrote shorter, less detailed responses. In the first experiment, over 1,100 participants researched vegetable gardening using either Google or ChatGPT. Google users wrote longer responses with more unique phrasing and factual references. A second experiment with nearly 2,000 participants presented identical gardening information either as an AI summary or across mock webpages, with Google users again engaging more deeply and retaining more information.
"Abstraction: Towards an Abstracter Abstract" (Score:5, Funny)
Re: (Score:2)
You better put the word "findings" in quotation marks. This "study" is a preprint, has not been peer reviewed, and it's being widely mocked for its bad methodology. But the media is just loving to run with it. Even the lead author is complaining about the media's "LLMs cause brain damage!" hot takes.
Re:"Abstraction: Towards an Abstracter Abstract" (Score:4, Interesting)
It's consistent with the experience of many of us who interact with a lot of LLM users. And the mocking of the methodology is flawed. Nothing wrong with the methodology, the only issue is that the sample size isn't that large.
But the LLM stooges are just loving to run with that narrative. Can't have headlines saying replacing walking with car driving causes stamina deterioration, now can we.
Re: (Score:2)
What we need is a 2nd study, using 400 students, separated into four groups:
1. Using Google ONLY by looking at the 3rd page of results (the first two pages are now taken up with Gemini AI and targeted advertising).
2. Using ChatGPT Only.
3. Using inventory computers in a large metropolitan library
4. Using old fashioned card catalogs and books.
I wonder if we chose a significantly esoteric subject, with a 100 question exam given after a week, if any useful clustering could be detected.
Re: (Score:2)
Re: (Score:2)
Hehehehe, excellent! Made me LOL. Thanks!
You're Grandad Knew This Before You Were Born (Score:5, Insightful)
You're Grandad knew this before you were born. That's why he told you to look it up yourself. If he just gave you the answer, you'd not remember it as well.
It's probably been well known since the Romans.
Re: (Score:3)
You're Grandad knew this before you were born. That's why he told you to look it up yourself. If he just gave you the answer, you'd not remember it as well.
Plus there is taking handwritten notes as you are reading something in a book or paper, or listening to a lecture. The writing process helping with retention.
It's probably been well known since the Romans.
legere te ipsum :-)
Re: (Score:2)
Give a man a fish and you feed him for a day.
Teach a man how to fish and you feed him for a lifetime.
Re: (Score:2)
Light a man a fire and he'll be warm for a day.
Light a man on fire and he'll be warm for the rest of his life.
Re: You're Grandad Knew This Before You Were Born (Score:2)
Re: (Score:2)
It's more than that. Research is a skill, and like all skills needs to be maintained. Especially when the information landscape keeps changing, as is the norm these days.
Evaluating sources, taking notes, summarizing and correlating both leads to better understanding and retention, and to a self reinforcing increase in skill on how to perform it efficiently. And that has the side effect of making it much easier to spot and counter BS.
Re: (Score:2)
Indeed. And I really do not get it. I mean, I am very much in the "cogito ergo sum" camp. Thinking is what I do and what makes me me and I find it relaxing. But apparently most people do not like to be people?
Problem is (Score:4, Insightful)
...90% of gardening photos and articles were created by AI, so Googling doesn't really help.
Re: (Score:2)
Re: (Score:2)
You can weed out the ones with alien cats in it and strange Fantasy mushrooms, but not the ones looking real, even if their combination would work because every single species needs a different PH value and or type of soil and are put together nonetheless.
If you already KNOW all that you wouldn't NEED to do the search, people would search for YOU.
Re: (Score:2)
Re: (Score:2)
Unless you add site:reddit to your search.. because only like 70% of Redditors inexplicably seem to think posting shit copied from grok or some other free-tier llm chatbot is actually useful for the world.
Re: (Score:2)
Hypothesis: almost purely a time-based effect (Score:2)
All the current project proves is that if I research a topic for 4 hours, I learn more than if I only spent 2 hours. Sir, I am shocked. Shocked I say.
Re: (Score:3)
The quote from the article sort of encapsulates the time dilemma :“AI doesn’t have to make us passive. But right now, that’s how people are using it”. People often accept answers from these things w/o further effort towards validating them, which will naturally be a faster process. Of course that's going to lead to worse learning outcomes for them, since they're just regurgitating what they've been told instead of forming original ideas from valid yet incomplete pieces of knowledge.
Re: (Score:2)
Turns out, "putting in the work" has benefits after all.
I think that pretty much describes the real problem with this study. Are you trying to get work done or are you trying to make yourself smarter and knowledgeable. Academics live in the latter world where "putting in the work" does not mean "getting the work done". Problem solving is an academic exercise, not an outcome.
Re: (Score:2)
That's simply not a correct description of academia and what academics do. "Publish or perish" is a stereotype for a reason. And the study didn't have "academics vs non-academics" or anything of the kind. It had ordinary people doing ordinary research
Re: (Score:2)
It had ordinary people doing ordinary research
I think you missed the point. Did the gardens grow better or produce more for the Google users? For ordinary people that outcome is the measure of success.
Re: (Score:2)
For ordinary people being assigned a task, the quality of the task completion is part of the measure of success. And the task in this case ended at producing the summary on what to do, and why.
Re: (Score:2)
The question is, whether there is even a useful comparison. Even if you want to use a LLM as an encyclopedia (though they are not well-suited for this purpose), you would have to compare it to another encyclopedia, like Wikipedia, not to a web search and every website Google can find.
Re: (Score:2)
Re: (Score:2)
It can only be restated as that if the quality of the product is not part of the productivity metric. And if you do that, just ignoring the task is immensely more productive.
Reminds me of GPS (Score:2)
When I used to navigate using paper maps, I had a clearer understanding of where I was.
With GPS, I follow the robot's orders and hope it gets me to my destination while being totally lost
Re: (Score:2)
As a rule, I use Google maps to scout where I'm going. From there, it's all from memory, unless it's a long trip in which case I write down the squirrely parts.
Visualization seems to be the key.
Re: (Score:2)
Reminds me of Thomas Bros (Score:2)
As a rule, I use Google maps to scout where I'm going. From there, it's all from memory, unless it's a long trip in which case I write down the squirrely parts. Visualization seems to be the key.
Basically, you are using the GPS map pretty much like the old paper maps. Especially the highly detailed ones like the Thomas Bros maps. Forgoing only the turn by turn instructions?
Re: (Score:2)
Essentially, yes. Visualizing where to go works better for me than being told to turn right in 300 feet. Since I've already looked over the route and figured out where I need to go, it stays with me.
Similar to what this article is saying that doing your research makes the subject stick better than having it spit out to you.
Turn right onto the railroad tracks ... (Score:2)
When I used to navigate using paper maps, I had a clearer understanding of where I was. With GPS, I follow the robot's orders and hope it gets me to my destination while being totally lost
Now I don't even know which way is north most of the time. In my youth I would excel at land navigation tests. Topo map, compass, distance and bearing to waypoints.
There is still some residual knowledge in my head. When look at the screen I can sometimes spot things that are wrong. Like the turn by turn instructions telling me to turn onto the railroad tracks. The instructions say turn right in 200ft and I glance at the displayed map and think, huh, that road intersection is way more than 200ft away. 50f
Re: (Score:2)
Re: (Score:2)
Something is lost with convenience, its not sentimental.
The quickly coined phrase, "death by GPS", would seem to support your opinion. :-)
"Death by GPS" (Score:2)
Reminds me of GPS
Yep. Sentimental attachment to skills that are less useful in the normal case.
When the phrase "death by GPS" is quickly coined after the introduction of the new technology, the old skills may be more than just sentimental. As mentioned above, I once decline turning onto railroad tracks when the map (on screen) did not seem to match the turn by turn instructions.
Why is this a problem? (Score:2)
Interesting conclusion... (Score:3)
"If you can't explain it simply, you don't understand it well enough." - Albert Einstein.
I don't know that people providing more succinct descriptions and offering less original insight is an indicator that they don't understand the concept as well. Given the above axiom it may mean the opposite.
That said, I tend to agree that current AI summaries miss a lot of the detail and are sometimes wrong.
calculators do not make you better at math (Score:2)
LLMs are just word calculators, error prone calculators at that.
Automated the erroneous entries ... (Score:2)
LLMs are just word calculators, error prone calculators at that.
Nope. Just like the calculators, they perfectly operate on the data provided. Basically we've automated the erroneous entries too, not just the calculations. Calculator or LLM, garbage in garbage out. :-)
Re: (Score:2)
They are statistics based, not exact. That's why they're useless for many tasks which require precise responses, as they'll generate text which approximates an average from their training data.
In addition to this, they are fed garbage, which makes even these statistics end up pointing the wrong way.
Every time I have generated an explanation on a subject I know well, it has contained errors. Both subtle and gross errors, and often inconsistency. Every time. Nothing perfect about it.
Re: (Score:2)
They are statistics based, not exact. That's why they're useless for many tasks which require precise responses, as they'll generate text which approximates an average from their training data.
I'd still argue that they are faithfully executing the algorithms they were programmed with. Calculators just have simpler algorithms that usually don't involve statistics. With respect to LLM output I'd say the user interface is only giving partial results. The complete results would include a confidence factor.
So this is a quite fixable thing. When showing an AI generated response, include the confidence factor.
In addition to this, they are fed garbage,
Which is what I was referring to, and I don't see an easy fix. Just re-training with well cu
Re: (Score:3)
Right, LLMs are error prone word calculators. Tools do not help you learn, tools help you accomplish a task with less work.
Adding the confidence would help you know when the LLM is making shit up, but wouldn't help you get better results from an error prone word calculator.
Re: (Score:2)
. The complete results would include a confidence factor.
Then people would stop using them, when they never get anything over 50%.
Re: (Score:2)
The complete results would include a confidence factor.
I don't think you know how LLMs work.
What is it you would like, the logits for every generated token?
Hint: that isn't confidence.
Re: (Score:2)
Want an example of using these token level probabilities to compare responses? Then ask an LLM for such code. Note "You can also use it to compare confidence across multiple model outputs."
"Here's a reusable Python function that calls the OpenAI API, extracts token-leve
"What is the capital of France?" (Score:2)
"What is the capital of France?"
Response: Paris
Approximate sequence confidence: 0.813218
Re: (Score:2)
That is not confidence.
That is multiplying all the logits for each token together, which is not confidence.
Re: (Score:2)
And that is incorrect. That is not confidence. That is multiplying all the logits for each token together, which is not confidence.
It is, as labeled, "Approximate sequence confidence", something different than a statistical confidence. And there is nothing wrong with that. Confidence can mean different things in different domains.
Re: (Score:2)
This is an idiot using numbers they don't understand to solve a problem they don't understand can't be solved.
It is confusing the logit probabilities for confidence in the answer.
Even if that value were 1- it would not imply high confidence.
Re: (Score:2)
No. It is not confidence.
You are arguing semantics. I am arguing domain specific confidence calculations. Two different things.
This is an idiot using numbers they don't understand to solve a problem they don't understand can't be solved.
OpenAI literally said "You can also use it to compare confidence across multiple model outputs" when it generated the python code.
It is confusing the logit probabilities for confidence in the answer. Even if that value were 1- it would not imply high confidence.
And people in the field full recognized it as an unnormalized value that is different than a traditional statistics confidence factor. They understand the utility of the value. Again, you false conflate with statistics.
Re: (Score:2)
You are arguing semantics. I am arguing domain specific confidence calculations. Two different things.
No, I'm not. I'm arguing that you don't understand probabilities and confidence.
OpenAI literally said "You can also use it to compare confidence across multiple model outputs" when it generated the python code.
No, they didn't. Their Large Language Model did.
And people in the field full recognized it as an unnormalized value that is different than a traditional statistics confidence factor. They understand the utility of the value. Again, you false conflate with statistics.
Wrong.
Perplexity has value, but not in the way you're using it. It cannot be used in the way you're using it.
Even in areas of perplexity that are thought to have value, it's even questionable.
Re: (Score:2)
You are arguing semantics. I am arguing domain specific confidence calculations. Two different things.
No, I'm not. I'm arguing that you don't understand probabilities and confidence.
Thank you for proving my point by continuing to argument statistical confidence.
OpenAI literally said "You can also use it to compare confidence across multiple model outputs" when it generated the python code.
No, they didn't. Their Large Language Model did.
Again, you play semantic games. It was code documentation. A level of detail that the organization does not comment upon.
And people in the field full recognized it as an unnormalized value that is different than a traditional statistics confidence factor. They understand the utility of the value. Again, you false conflate with statistics.
Wrong. Perplexity has value, but not in the way you're using it. It cannot be used in the way you're using it. Even in areas of perplexity that are thought to have value, it's even questionable.
Straw man. I am not using it as a statistical confidence as you falsely conflate. I fully understand it is something different.
Re: (Score:2)
Straw man. I am not using it as a statistical confidence as you falsely conflate. I fully understand it is something different.
No, you don't.
You're trying to say it represents confidence in the answer.
The complete results would include a confidence factor.
So this is a quite fixable thing. When showing an AI generated response, include the confidence factor.
What you are calling the confidence factor is the aggregate probability of the sequence. It is not confidence in anything, and has no connection to the ground truth.
You aren't talking your way out of this. You're a bullshit artist.
Re: (Score:2)
You're trying to say it represents confidence in the answer.
Confidence in a different sense than statistical confidence. As you concede: "statistical likelihood of that sequence of tokens with that context". Something, that as the OpenAI code documentation said: "You can also use it to compare confidence across multiple model outputs".
The complete results would include a confidence factor. So this is a quite fixable thing. When showing an AI generated response, include the confidence factor.
In a domain specific context with a domain specific meaning. Again, you play semantics with "confidence factor" while you literally concede and use "statistical likelihood". Semantics.
Re: (Score:2)
You said:
Calculators just have simpler algorithms that usually don't involve statistics. With respect to LLM output I'd say the user interface is only giving partial results. The complete results would include a confidence factor.
The implication is obvious. Quit trying to gaslight yourself out of being stupid.
Look at it this way, if you can derive anything approaching "confidence" from the logits, then you are an instant billionaire, because that means you can solve hallucinations- and that's currently the big fucking problem with LLMs.
But you can't, because you're an idiot.
Re: (Score:2)
ChatGPT: No, the linear probability (or more precisely, the output probability) of a token sequence generated by a large language model (LLM) does not represent the statistical probability that the tokens are factual.
Summary: LLM token probabilities measure plausibility, not truth.
They reflect what is likely to come next in language, not what is factually accurate.
Wou
Re: (Score:2)
I literally quoted the lines I objected to, they weren't "no utility", they were your mischaracterization of that utility.
You can't even compare the "confidence" across multiple model outputs. It's fucking meaningless.
All it tells you is the statistical likelihood of that series of tokens appearing in the fucking output.
You are too stupid to know what that fucking means.
Re: (Score:2)
Late Antiquity | Soissons, Laon, early Paris
508 AD onward | Paris (Clovis I)
987 AD onward | Paris (Capetians)
Based on token-level probabilities from the model’s internal scoring:
Confidence Score 0.42
This score reflects decent certainty that:
Paris became the capital under Clovis (~508),
It regained solidified status under Hugh Capet (~987),
Soissons and Laon preceded it.
Re: (Score:2)
You're misusing the logits.
Re: (Score:2)
You are mistaken. Using the token level probabilities to construct an overall confidence is already done. You are conflating the difficulty in producing a more convenient normalized confidence value, which is an area of active research.
No, the logits are not confidence.
That is not how it works.
Your Python script does not do what it thinks it does.
Re: (Score:2)
You are mistaken. Using the token level probabilities to construct an overall confidence is already done. You are conflating the difficulty in producing a more convenient normalized confidence value, which is an area of active research.
No, the logits are not confidence. That is not how it works.
Only because you are erroneously equating the value to a confidence value in statistics. It's a totally different computation, it's not normalized, yet it is still a number indicating confidence.
Your Python script does not do what it thinks it does.
It does precisely what I read it to do. Which is not your erroneous interpretation of the word "confidence" in the LLM domain.
Re: (Score:2)
Only because you are erroneously equating the value to a confidence value in statistics. It's a totally different computation, it's not normalized, yet it is still a number indicating confidence.
No. You are erroneously multiplying together the logits from an LLM and calling them confidence.
There is no definition of confidence that fits that operation.
It does precisely what I read it to do. Which is not your erroneous interpretation of the word "confidence" in the LLM domain.
It outputs a number with no meaning that you can evaluate incorrectly to mean something that is false? Fantastic. You sound like a real go-getter.
Re: (Score:2)
No. You are erroneously multiplying together the logits from an LLM and calling them confidence. There is no definition of confidence that fits that operation.
OpenAI literally said "You can also use it to compare confidence across multiple model outputs" when it generated the python code.
It outputs a number with no meaning that you can evaluate incorrectly to mean something that is false? Fantastic. You sound like a real go-getter.
And people in the field fully recognized it as an unnormalized value that is different than a traditional statistics confidence factor. They understand the utility of the value. Again, you falsely conflate with statistics.
Re: (Score:2)
OpenAI literally said "You can also use it to compare confidence across multiple model outputs" when it generated the python code.
No, they didn't. Their model did, you dense motherfucker.
And people in the field fully recognized it as an unnormalized value that is different than a traditional statistics confidence factor. They understand the utility of the value. Again, you falsely conflate with statistics.
You're a liar.
Re: (Score:2)
OpenAI literally said "You can also use it to compare confidence across multiple model outputs" when it generated the python code.
No, they didn't. Their model did, you dense motherfucker.
Again, you play semantic games. It was code documentation. A level of detail that the organization does not comment upon.
And people in the field fully recognized it as an unnormalized value that is different than a traditional statistics confidence factor. They understand the utility of the value. Again, you falsely conflate with statistics.
You're a liar.
LOL, the ad hominem fallacy.
Re: (Score:2)
Again, you play semantic games. It was code documentation. A level of detail that the organization does not comment upon.
Link to the code documentation.
LOL, the ad hominem fallacy.
Ok, you also apparently don't know what a fallacy is.
That was an accusation.
Logits cannot be used to determine confidence in anything, other than the semantic correctness of what was emitted as estimated by the model.
A correct answer could have a value of 0, and an incorrect answer could have a value of 1.
You are using these things incorrectly, because you don't fucking know what they mean.
Re: (Score:2)
Link to the code documentation.
I copy/pasted the documentation generated with the code.
Ok, you also apparently don't know what a fallacy is. That was an accusation.
LOL. "a rhetorical strategy where the speaker attack the person making an argument rather than the substance of the argument itself" So you're going to play semantic games with "attack" vs "accusation". Why am I not surprised.
A correct answer could have a value of 0, and an incorrect answer could have a value of 1.
Again, you prove me correct. You conflate with statistical probability. As you conceded, the value has a domain specific utility: "statistical likelihood of that sequence of tokens with that context".
Re: (Score:2)
I copy/pasted the documentation generated with the code.
Did you, or did you not copy the output of an LLM and call it evidence?
LOL. "a rhetorical strategy where the speaker attack the person making an argument rather than the substance of the argument itself" So you're going to play semantic games with "attack" vs "accusation". Why am I not surprised.
Christ, you really are fucking stupid.
You made a claim.
Let's assume the claim was fabricated. Are you now arguing that I cannot claim it was fabricated without committing a fallacy?
No, an argumentation ad hominem is trying to make a point in the argument as a personal attack,
i.e., you are wrong because you are a liar.
Simply calling someone out for a thing is not an argumentation ad hominem. It's merely an accusation.
Again, you prove me correct. You conflate with statistical probability. As you conceded, the value has a domain specific utility: "statistical likelihood of that sequence of tokens with that context".
No, I've flatly
Re: (Score:2)
Did you, or did you not copy the output of an LLM and call it evidence?
I called it code documentation.
LOL. "a rhetorical strategy where the speaker attack the person making an argument rather than the substance of the argument itself" So you're going to play semantic games with "attack" vs "accusation". Why am I not surprised.
Christ, you really are fucking stupid. You made a claim.
What part of "attack the person making an argument rather than the substance of the argument" confused you?
No, an argumentation ad hominem is trying to make a point in the argument as a personal attack, i.e., you are wrong because you are a liar.
Semantics again, "attack" vs "accusation". The reality is you addressed the person not the argument. Your use of "liar" was fallacious, hence ad hominem.
What you call "confidence", to be used to judge whether or not a model is full of shit, is in fact no such thing.
Again, misrepresentation and semantics. As your use of "statistical likelihood" shows.
Re: (Score:2)
I called it code documentation.
Sorry, did you, or did you not, use the output of an LLM and call it documentation?
I think everyone following your attempt at weaponized gaslit spaghetti sees your claim for what it was. Bullshit.
But by all means- I think you should go ahead and use that confidence value for everything. The dumber you fuckers get, the more money I make.
Re: (Score:2)
I called it code documentation.
Sorry, did you, or did you not, use the output of an LLM and call it documentation?
The LLM generated code, documentation of that code, and commentary on that code.
I think everyone following your attempt at weaponized gaslit spaghetti sees your claim for what it was. Bullshit.
I think everyone sees your semantics. My casual domain specific use of "confidence", your use of "statistical likelihood".
Re: (Score:2)
The LLM generated code, documentation of that code, and commentary on that code.
And you don't see the problem with attributing that to "OpenAI"?
I think everyone sees your semantics. My casual domain specific use of "confidence", your use of "statistical likelihood".
Words have meaning.
Handwaving them away as "semantics" isn't clever, it's a cop-out.
Re: (Score:2)
The LLM generated code, documentation of that code, and commentary on that code.
And you don't see the problem with attributing that to "OpenAI"?
It was their LLM identifying itself as an OpenAI service. "ChatGPT is developed by OpenAI. It's powered by advanced language models from OpenAI"
I think everyone sees your semantics. My casual domain specific use of "confidence", your use of "statistical likelihood".
Words have meaning.
In a context. Which your semantics ignores and distorts.
Re: (Score:2)
It was their LLM identifying itself as an OpenAI service. "ChatGPT is developed by OpenAI. It's powered by advanced language models from OpenAI"
Holy fuck, you're a really stupid person.
You're really attributing the output of their LLM as authoritative information from their corporation.
Incredible.
In a context. Which your semantics ignores and distorts.
No, the only person who distorted semantics here was you.
You have mischaracterized LLM output as a statement from a corporation, and you have mischaracterized the linear probability of a token sequence as confidence.
Re: (Score:2)
It was their LLM identifying itself as an OpenAI service. "ChatGPT is developed by OpenAI. It's powered by advanced language models from OpenAI"
Holy fuck, you're a really stupid person. You're really attributing the output of their LLM as authoritative information from their corporation. Incredible.
Semantics. And given your emotions, I'd say you are exhibiting a bit of projection here.
In a context. Which your semantics ignores and distorts. And the fact remains, OpenAI's services generated the code and comments and documentation contradicting you.
No, the only person who distorted semantics here was you. You have mischaracterized LLM output as a statement from a corporation, ...
Semantics.
Nope, that was you erroneous reading things between the lines and inserting your own bad guess.
And the answer is, of course, 42 (Score:2)
The complete results would include a confidence factor.
I don't think you know how LLMs work. What is it you would like, the logits for every generated token? Hint: that isn't confidence.
Here's an example where the token level probabilities are shown:
"What is the historic capital of France and what is your approximate sequence confidence in your response"
Late Antiquity | Soissons, Laon, early Paris
508 AD onward | Paris (Clovis I)
987 AD onward | Paris (Capetians)
Based on token-level probabilities from the model’s internal scoring:
Confidence Score 0.42
This score reflects decent certainty that:
Paris became the capital under Clovis (~508),
It regained solidified status
Re: (Score:2)
JFC- we really are all doomed.
Logits are not confidence values.
Re: (Score:2)
No, it does not. Logits are not confidence values.
It is, as labeled elsewhere by OpenAI, an "Approximate sequence confidence", something different than a statistical confidence. And there is nothing wrong with that. Confidence can mean different things in different domains. It is still something that can be used "to compare confidence across multiple model outputs" as OpenAI states.
Re: (Score:2)
What happened here, is an old API of theirs emitted the logits for each token, and people took that and said "we can multiply those together and call it confidence in the answer!"
But they are wrong. Nobody at OpenAI ever suggested this could be done, and in fact, the new API lacks the ability to retrieve the logits., probably to stop people from making this mistake.
Re: (Score:2)
No. OpenAI makes no such claim anywhere. That is an outright falsehood.
LOL. OpenAI literally said "You can also use it to compare confidence across multiple model outputs" when it generated the python code.
What happened here, is an old API of theirs emitted the logits for each token, and people took that and said "we can multiply those together and call it confidence in the answer!"
And they full recognized it as an unnormalized value that is different than a traditional statistics confidence factor. Unlike you, they understood the context of the word "confidence". They understand the utility of the value. And again, this is an area of ongoing research.
Re: (Score:2)
You are arguing semantics. I am arguing domain specific confidence calculations. Two different things.
No. You're measuring perplexity and not knowing what the fuck it means in the context of a sequence of tokens produced by an LLM.
LOL. OpenAI literally said "You can also use it to compare confidence across multiple model outputs" when it generated the python code.
You mean when the fucking LLM did?!
Jesus. Fucking. Christ.
OpenAI has claimed no such thing, and never will, because it is factually wrong. Do not confuse the output of ChatGPT with OpenAI.
And they full recognized it as an unnormalized value that is different than a traditional statistics confidence factor. Unlike you, they understood the context of the word "confidence". They understand the utility of the value. And again, this is an area of ongoing research.
Incorrect.
Perplexity is indeed unnormalized- but that isn't even the start of the problem of trying to use it as a measure of confidence *in anything*.
It merely tells you the statistical lik
Re: (Score:2)
You are arguing semantics. I am arguing domain specific confidence calculations. Two different things.
No. You're measuring perplexity and not knowing what the fuck it means in the context of a sequence of tokens produced by an LLM.
That's your straw man. I understand it is not a statistical confidence factor. That it is a normalized value of limited utility. Yet it has some utility.
LOL. OpenAI literally said "You can also use it to compare confidence across multiple model outputs" when it generated the python code.
You mean when the fucking LLM did?! OpenAI has claimed no such thing, and never will, because it is factually wrong. Do not confuse the output of ChatGPT with OpenAI.
Again, you play semantic games. It was code documentation. A level of detail that the organization does not comment upon.
And they full recognized it as an unnormalized value that is different than a traditional statistics confidence factor. Unlike you, they understood the context of the word "confidence". They understand the utility of the value. And again, this is an area of ongoing research.
Incorrect. Perplexity is indeed unnormalized- but that isn't even the start of the problem of trying to use it as a measure of confidence *in anything*. It merely tells you the statistical likelihood of that sequence of tokens with that context. That is not confidence in anything, whatsoever. It has zero connection to any kind of ground truth.
Thank you of demonstrating some utility. Hmmm ... I wonder what word could be used in a domain specific context to refer to a "statistical likelihood of that sequence of tokens with that context". Or as OpenAI calls it, an "approximate se
Re: (Score:2)
That's your straw man. I understand it is not a statistical confidence factor. That it is a normalized value of limited utility. Yet it has some utility.
No, you directly claimed it was confidence in the answer from the context of ground truth.
That is a gross misrepresentation.
Again, you play semantic games. It was code documentation. A level of detail that the organization does not comment upon.
I'm not playing semantic games, you dumbshit.
I'm trying to figure out what the fuck you're talking about, since calculating a linear probability from the logits doesn't represent "confidence" in any definition of that word that I'm aware of.
You used the word "generated", and so I surmised perhaps you were talking about pulling it from the output of ChatGPT.
Thank you of demonstrating some utility. Hmmm ... I wonder what word could be used in a domain specific context to refer to a "statistical likelihood of that sequence of tokens with that context". Or as OpenAI calls it, an "approximate sequence confidence". Perhaps the simple term "confidence". Again, "confidence" in a context specific realm.
Did you seriously just arg
Re: (Score:2)
>No, you directly claimed it was confidence in the answer from the context of ground truth.
I used it in a domain specific context. Not as truth, but as you concede: "statistical likelihood".
Did you seriously just argue that we can call the mean of the softmaxed logits confidence?
Not in the context of statistics, more like your concession: "statistical likelihood". Or as the OpenAI documentation stated, a tool for comparison of some results.
Re: (Score:2)
They are often used with stochastic logit sampling, but even that is optional, and not recommended for things like coding.
The text they generate does not approximate an average from their training data.
It uses gradient descent to minimize the function that leads all of the layers to select the correct token with the given context.
You're talking out of your ass.
This is an opportunity for the lazy (Score:2)
For AI to do their work for them, and for a reduction of themselves. This is an opportunity for those of us who know how to work, and be compensated over the lazy.
ai should be called Parrot Research. (Score:3)
We are raising a generation of Pollys, forget generation X,Y, millenials, what we are creating now is worse - It's the Polly generation! - and they want their free crackers.
Re: (Score:2)
I guess I'm the Calculator Generation then, and maybe also the Digital Clock Generation.
Fuck long division.
It depends how you measure success. (Score:2)
If the point of your research is to write a better summary, then (according to the article) Google is better.
However, if the point of your research is to learn something specific, like for example a needed code snippet for your program, then LLM will be better because they'll provide that information faster.
AI makes you dumb (Score:2)
What else is new. Thinking is hard for most people. If they stop doing it, they completely lose the skills. Pathetic as they might have been before...
Not all AI research is bad- depends who does it (Score:2)
I mean sure, the average layperson is gonna fuck it up. But what about professionals, e.g., a PhD scientist?
I use LLM-based models for lit searches (typically, these are dedicated tools for lit searches, but I have tried it on ChatGPT). I don't use the summaries, but I do use the lists of papers it comes up with and generally go through them in whatever ranking it spits out.
Works pretty well, saves a ton of time in *starting* lit searches. Still have to do the reading. The AI sucks at interpreting papers,
Re: (Score:2)
I suspect the real problem with llms is we want kids to be self-taught because it's expensive to teach them.
I do believe it's better to teach yourself than to be force-fed knowledge, at least if you're the type of person who wants to apply knowledge, and not simply be the expert on all things. So it's better for kids to teach themselves, and for their instructors to keep an eye on them and continue to challenge their understanding. Sometimes that involves asking the learner what he thinks he knows, listen
Self-taught can lead to gaps in knowledge (Score:3)
I suspect the real problem with llms is we want kids to be self-taught because it's expensive to teach them.
I do believe it's better to teach yourself than to be force-fed knowledge, at least if you're the type of person who wants to apply knowledge, and not simply be the expert on all things.
The problem with self-taught is that there are often gaps in knowledge. We tend to avoid what does not interest us, plus that may be a superficial judgement made out of ignorance. So having some sort of curriculum that guides us is useful. Sadly, most of us seem reluctant to faithfully follow such a curriculum on our own and need a little bit of a push on those topics initially perceived to be less interesting. It's just human nature.
There's no financial reward for me training an intern or junior engineer, quite the opposite usually.
Many tasks are better accomplished by, or require, a team. You can create
Re: (Score:2)
Not only multiple explanations, but also multiple levels of detail. "So I think I've got the basics of search algorithms, but can you explain why we need a priority queue again?"
There are no dumb questions, and LLMs are more patient than any teacher could ever be.
The more you ask the LLM to explain, the better you'll understand the topic and the better you'll detect when the LLM goes wrong. People are just starting to learn how to use it, but the point of LLMs isn't to produce a finished article, but to be
Re: (Score:2)
Problem is, the responses will contain errors. I have yet to find an LLM which does not respond to queries on a topic I know well without both subtle and gross errors. And when drilling down, it's not at all a given that it will correct previous errors. Learning new knowledge using an LLM is a really bad idea.
Which of course does not make them useless. They're excellent for many tasks. Especially problem solving in a problem domain the user already has a good understanding in. But not for learning new thing
Re: (Score:2)
Yes, the dialogue is just like a dialogue with humans, but it is only like it and not a real dialogue. LLM will know things wrong, they will likely make mistakes for some in-depth questions, they might be very confident and not admit errors easily. Using LLM needs some training, just like other tools. On the other hand, some errors may also be helpful, when you think "The answer is wrong, but now I see why it can't work what I asked in my question" kinda helping you to find X-Y problems.
Often, especially wi
Re: (Score:2)
These are all reasons why they're horrible for learning new knowledge. They're great for bouncing things of, when already having a good enough understanding to spot the hallucinations with confidence, but without that understanding their value is negative.
This is why one learns assembly language (Score:3)
I've Told the story before but when I was a kid for some reason my kid brain could not understand data statements in the basic programming language. The idea of reading data that was at the back of your program code with just so alien to my stupid little kid brain. Years later when I've been programming for a while I thought about it and I looked it up and said boy I was a dumb kid.
This is why we teach machine architecture and assembly language. It clears up so many "mysteries". :-)