Anthropic Releases New Version of Claude That Beats GPT-4 and Gemini Ultra in Some Benchmark Tests (venturebeat.com) 33

Posted by msmash on Monday March 04, 2024 @12:40PM from the moving-forward dept.

Anthropic, a leading artificial intelligence startup, unveiled its Claude 3 series of AI models today, designed to meet the diverse needs of enterprise customers with a balance of intelligence, speed, and cost efficiency. The lineup includes three models: Opus, Sonnet, and the upcoming Haiku. From a report: The star of the lineup is Opus, which Anthropic claims is more capable than any other openly available AI system on the market, even outperforming leading models from rivals OpenAI and Google. "Opus is capable of the widest range of tasks and performs them exceptionally well," said Anthropic cofounder and CEO Dario Amodei in an interview with VentureBeat. Amodei explained that Opus outperforms top AI models like GPT-4, GPT-3.5 and Gemini Ultra on a wide range of benchmarks. This includes topping the leaderboard on academic benchmarks like GSM-8k for mathematical reasoning and MMLU for expert-level knowledge.

"It seems to outperform everyone and get scores that we haven't seen before on some tasks," Amodei said. While companies like Anthropic and Google have not disclosed the full parameters of their leading models, the reported benchmark results from both companies imply Opus either matches or surpasses major alternatives like GPT-4 and Gemini in core capabilities. This, at least on paper, establishes a new high watermark for commercially available conversational AI. Engineered for complex tasks requiring advanced reasoning, Opus stands out in Anthropic's lineup for its superior performance. Sonnet, the mid-range model, offers businesses a more cost-effective solution for routine data analysis and knowledge work, maintaining high performance without the premium price tag of the flagship model. Meanwhile, Haiku is designed to be swift and economical, suited for applications such as consumer-facing chatbots, where responsiveness and cost are crucial factors. Amodei told VentureBeat he expects Haiku to launch publicly in a matter of "weeks, not months."

Anthropic Releases New Version of Claude That Beats GPT-4 and Gemini Ultra in Some Benchmark Tests

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 33 Comments Log In/Create an Account

Comments Filter:

Weasel words (Score:2)

by Luckyo ( 1726890 ) writes:

"Some" benchmark tests. We're not disclosing what kind of a test, what criteria were benchmarked, etc. ...
Really?
- Re: (Score:2)
  
  by Calydor ( 739835 ) writes:
  
  The test could, for all we know, just be the average length of the output, or the lix score, or anything else that is technically measurable but says nothing about the actual value of the output.
  - Re: (Score:2)
    
    by Impy the Impiuos Imp ( 442658 ) writes:
    
    Like "Remake Blue with Buffy and Rey photorealistic".
    It's just a stress test, I swear.
- Re: (Score:2)
  
  by JoshuaZ ( 1134087 ) writes:
  
  The summary explicitly gives examples. From the summary tests included " GSM-8k for mathematical reasoning and MMLU for expert-level knowledge." Not reading the linked article is one thing, but at least read the summary please.
  - Re: (Score:2)
    
    by JoshuaZ ( 1134087 ) writes:
    
    Also if one clicks through to the article, they list a whole bunch of other explicit benchmarks including some where they performed less well than the other AIs.
    - Re: (Score:1)
      
      by Luckyo ( 1726890 ) writes:
      
      I wrote the initial post for people just like you. Who read "oh, they have a nice chart with names for each benchmark" and conclude that those are meaningful numbers.
      When in reality, they didn't actually tell you anything about the benchmarks. They merely told you what they called them. They didn't tell you what test was actually ran, under what parameters, with what inputs, on what hardware. They literally told you nothing about what they did. But they did inform you that "some" of the tests (which are not
      - Re: (Score:2)
        
        by JoshuaZ ( 1134087 ) writes:
        
        So 1) That's not what you said. 2) In fact, even if you had said that, it would still be wrong. The vast majority of of these benchmarks are benchmarks made by independent organizations where you can find the details of how they work without too much effort. For example, GSM-8k is available here. https://paperswithcode.com/dataset/gsm8k [paperswithcode.com].
        
        Re: (Score:2)
        
        by ceoyoyo ( 59147 ) writes:
        
        Luckyo has a shovel and he's going to use it to keep on digging, damn it.
        
        Re: (Score:1)
        
        by Luckyo ( 1726890 ) writes:
        
        Let's try one last time.
        What the story linked in the OP says, right after the wonderful looking charts that you are referencing as explaining in detail how these models are superior:
        >While companies like Anthropic and Google have not disclosed the full parameters of their leading models, the reported benchmark results from both companies imply Opus either matches or surpasses major alternatives like GPT-4 and Gemini in core capabilities.
        Note that none of these are my words. These are the words of writers
        
        Re: (Score:2)
        
        by JoshuaZ ( 1134087 ) writes:
        
        You are now doing a weird thing where you are focusing on the word "imply" rather than "show" as if that is some major difference in meaning. The use of "imply" here is simply due to the inherent limitations of any benchmarks. But please note that this is now your third claim you are making. Your first claim was that they "not disclosing what kind of a test, what criteria were benchmarked." That was shown to be wrong. Then you claimed in your second comment that "When in reality, they didn't actually tell
        
        Re: (Score:1)
        
        by Luckyo ( 1726890 ) writes:
        
        >word "imply" rather than "show" as if that is some major difference in meaning.
        I am honestly flabbergasted, but considering your fight for supremacy of marketing wank over reality underneath it, I really shouldn't be.
      - Re: Weasel words (Score:1)
        
        by MrNaz ( 730548 ) writes:
        
        You're wrong and you fired your mouth off carelessly. Just accept the correction like a man.
        
        Re: (Score:1)
        
        by Luckyo ( 1726890 ) writes:
        
        From the linked story, coming AFTER the wonderful charts:
        >While companies like Anthropic and Google have not disclosed the full parameters of their leading models, the reported benchmark results from both companies imply Opus either matches or surpasses major alternatives like GPT-4 and Gemini in core capabilities.
        >This, at least on paper, establishes a new high watermark for commercially available conversational AI.
        Did you notice the word "imply" and "on paper" being used? Why did they do that if I'm
        
        Re: Weasel words (Score:1)
        
        by MrNaz ( 730548 ) writes:
        
        This is no different to other commercial releases of any product. Car companies make representations about their vehicles' performance. Intel says things that it's CPUs can do. There's a list of the benchmarks used. The names are listed. It's like Intel saying it's Cinebench results for its CPUs. It's enough to establish a baseline for the product's relevance in the market.
        As always, YMMV, and the best benchmark is your own use.
        Now stop being obtuse and learn the lesson. Nothing is more pathetic than a grow
      - Re: (Score:2)
        
        by Rei ( 128717 ) writes:
        
        When in reality, they didn't actually tell you anything about the benchmarks. They merely told you what they called them.
        Whether or not you're familiar with these benchmarks, they're industry standard benchmarks.
    - Re: (Score:2)
      
      by NomDeAlias ( 10449224 ) writes:
      
      So in other words we'll never know what the benchmarks were.
    - Re: (Score:2)
      
      by Rei ( 128717 ) writes:
      
      Also if one clicks through to the article, they list a whole bunch of other explicit benchmarks including some where they performed less well than the other AIs.
      Claude 3 Opus outperformed GPT-4 in every benchmark on that list, so I'm not sure what you're looking at.
      - Re: (Score:2)
        
        by JoshuaZ ( 1134087 ) writes:
        
        I'm... not sure. I don't know how I misread it. You are correct.
        
        Re: (Score:2)
        
        by Rei ( 128717 ) writes:
        
        Probably got mixed up by the fact that there's several different-weight variants of Claude 3. :) Opus is the biggest.
  - Re: (Score:1)
    
    by Luckyo ( 1726890 ) writes:
    
    And if you continue reading and not stop there, you'll find the "but" that nullifies that statement completely:
    >"It seems to outperform everyone and get scores that we haven't seen before on some tasks," Amodei said. While companies like Anthropic and Google have not disclosed the full parameters of their leading models, the reported benchmark results from both companies imply Opus either matches or surpasses major alternatives like GPT-4 and Gemini in core capabilities.
- Re: (Score:2)
  
  by Rei ( 128717 ) writes:
  
  It did good pretty much across the board. The really impressive thing was how well it performed on the "diamond" question set, which is a set of questions designed (A) specifically to be un-googleable, and (B) to be things across a range of fields that are challenging for experts in their respective fields, and which experts in unrelated fields will generally get wrong.
  Seems to be an extremely capable set of models. AI Explained fed it all seven Harry Potter books, with a secret sentence inserted in the m
A new kind of AI benchmark? (Score:2)

by jenningsthecat ( 1525947 ) writes:

I'm just waiting for a new benchmark called 'Tabs'. It would be a measure of the amount of LSD required by a human to replicate the severity of an AI's hallucinations.
- Re: (Score:2)
  
  by IWantMoreSpamPlease ( 571972 ) writes:
  
  Right. I'd settle for accuracy over speed any day
  - Re: (Score:3)
    
    by ceoyoyo ( 59147 ) writes:
    
    Then you don't want AI.
    The point of AI is to make computers behave more like human intelligence. People are great at giving approximate solutions quickly, imagining possibilities, lying, etc. Using it for accurate information recall is pretty much the opposite. Using a generative model is even dumber. Most generative models are literally designed to produce results starting with random numbers; the technique was originally used to produce art.
  - Re: (Score:2)
    
    by jenningsthecat ( 1525947 ) writes:
    
    Right. I'd settle for accuracy over speed any day
    I'm still trying to figure out if you were being serious and referring to rapidity, or being humourous and talking about amphetamines... ;-)
- - Re: It is still woke trash? (Score:1)
    
    by UpnAtom ( 551727 ) writes:
    
    Show me on the doll where wokeness aka facts & empathy hurt you.
    - - Re: It is still woke trash? (Score:2)
        
        by UpnAtom ( 551727 ) writes:
        
        After you . I consider it all the time. You're too scared, even of yourself.
Is this another IQ test? (Score:2)

by OneOfMany07 ( 4921667 ) writes:

"Performance" doesn't mean anything if it's not useful. What parameters are they measuring? How does a human do on the test(s)? Etc...
- Re: (Score:2)
  
  by Rei ( 128717 ) writes:
  
  You know that you can look up all of the tests listed in the article, right?
Balance (Score:2)

by Iamthecheese ( 1264298 ) writes:

>a balance of intelligence, speed, and cost efficiency.

lolno. You can't balance those things. With the current generation if you skew all the way toward "intelligence" you'll still get AIs that sell cars for a dollar, fail basic arithmetic, make shit up, and at best know whether the customer request is an item in a pre-made database of responses or whether a human agent needs to get involved. This is a fig leaf for CIOs to wear when they need the money to get into AI but have to pretend it's not an ear

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Anthropic Releases New Version of Claude That Beats GPT-4 and Gemini Ultra in Some Benchmark Tests (venturebeat.com) 33

Anthropic Releases New Version of Claude That Beats GPT-4 and Gemini Ultra in Some Benchmark Tests More Login

Anthropic Releases New Version of Claude That Beats GPT-4 and Gemini Ultra in Some Benchmark Tests

Weasel words (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:1)

Re: Weasel words (Score:1)

Re: (Score:1)

Re: Weasel words (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

A new kind of AI benchmark? (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: It is still woke trash? (Score:1)

Re: It is still woke trash? (Score:2)

Is this another IQ test? (Score:2)

Re: (Score:2)

Balance (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot