Claude 3 Surpasses GPT-4 on Chatbot Arena For the First Time

Claude 3 Surpasses GPT-4 on Chatbot Arena For the First Time (arstechnica.com) 19

Posted by msmash on Thursday March 28, 2024 @01:22PM from the king-is-dead dept.

Anthropic's recently released Claude 3 Opus large language model has beaten OpenAI's GPT-4 for the first time on Chatbot Arena, a popular crowdsourced leaderboard used by AI researchers to gauge the relative capabilities of AI language models. A report adds: "The king is dead," tweeted software developer Nick Dobos in a post comparing GPT-4 Turbo and Claude 3 Opus that has been making the rounds on social media. "RIP GPT-4."

Since GPT-4 was included in Chatbot Arena around May 10, 2023 (the leaderboard launched May 3 of that year), variations of GPT-4 have consistently been on the top of the chart until now, so its defeat in the Arena is a notable moment in the relatively short history of AI language models. One of Anthropic's smaller models, Haiku, has also been turning heads with its performance on the leaderboard.

"For the first time, the best available models -- Opus for advanced tasks, Haiku for cost and efficiency -- are from a vendor that isn't OpenAI," independent AI researcher Simon Willison told Ars Technica. "That's reassuring -- we all benefit from a diversity of top vendors in this space. But GPT-4 is over a year old at this point, and it took that year for anyone else to catch up." Chatbot Arena is run by Large Model Systems Organization (LMSYS ORG), a research organization dedicated to open models that operates as a collaboration between students and faculty at University of California, Berkeley, UC San Diego, and Carnegie Mellon University.

Claude 3 Surpasses GPT-4 on Chatbot Arena For the First Time

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 19 Comments Log In/Create an Account

Comments Filter:

Just tried it this morning. Compared to ChatGPT (Score:5, Interesting)

by Seven Spirals ( 4924941 ) writes: on Thursday March 28, 2024 @01:47PM (#64351653)

I think their bot is a lot less annoying and politically over the top than ChatGPT. I asked it about political biases and controversial stuff like gun control, taxation as theft, etc... The answers, in general, from Claude were less charged than the ones from ChatGPT, in my opinion, but it usually takes quite a bit of questioning to really get the sense of it. Here is an example

Question: Is misinformation a problem? Here's ChatGPT
Yes, misinformation is widely recognized as a significant problem with far-reaching consequences in today's society. Misinformation refers to false or misleading information that is spread, often unintentionally, and can have detrimental effects on individuals, communities, and societies as a whole. Here are several reasons why misinformation is considered a problem: Public Health and Safety: Misinformation can spread quickly during public health crises, such as pandemics, leading to confusion, panic, and potentially harmful behavior. False information about treatments, preventive measures, or the severity of a disease can undermine public health efforts and contribute to increased morbidity and mortality. Political Polarization and Social Division: Misinformation can exacerbate political polarization and social division by spreading falsehoods, conspiracy theories, and propaganda that reinforce preexisting beliefs or prejudices. This can erode trust in institutions, undermine democratic processes, and foster hostility between different groups within society. Economic Impact: Misinformation can also have economic consequences, affecting markets, businesses, and consumer behavior. False information about companies, products, or financial markets can lead to stock market volatility, decreased consumer confidence, and financial losses for individuals and organizations. Manipulation and Influence: In the digital age, misinformation can be weaponized by malicious actors, including foreign governments, extremist groups, and individuals seeking to manipulate public opinion or sow discord. False information spread through social media, online forums, and other digital platforms can be used to deceive, manipulate, or radicalize individuals, amplifying the reach and impact of misinformation. Erosion of Trust: Misinformation undermines trust in sources of information, including traditional media outlets, government agencies, and scientific institutions. When people are exposed to false or misleading information regularly, they may become cynical or skeptical of authoritative sources, making it harder to communicate accurate information and address pressing issues effectively. Impacts on Individual Well-Being: Misinformation can have personal consequences for individuals, affecting their decision-making, mental health, and relationships. Believing false information can lead to poor decision-making, anxiety, and stress, particularly when it comes to critical issues such as health, finances, or safety. Addressing misinformation requires a multifaceted approach involving media literacy education, fact-checking initiatives, responsible journalism, platform moderation, and policy interventions. By raising awareness about the dangers of misinformation and promoting critical thinking skills, societies can better equip individuals to navigate the complex information landscape and mitigate the harmful effects of false information.
Here's Claude
Yes, misinformation can certainly be a significant problem in today's society. The spread of false or misleading information, whether intentional or not, can have harmful real-world impacts. Some key issues around misinformation include: Public health and safety - Misinformation around things like medical treatments, preventative practices, emergency situations etc. can directly endanger people's wellbeing if followed. Erosion of public trust - Rampant misinformation undermines institutional knowledge and credible expertise on important issues. This can cause societal confusion and division.
Read the rest of this comment...

- Re: (Score:2)
  
  by Powercntrl ( 458442 ) writes:
  
  Pretty much just seems like Claude is more terse. I'd say the difference is more like a Reddit post vs a /. post. Now we just need an AI that does tweet length responses by default. I guess technically you can ask ChatGPT to do that, so let's see how that goes:
  Me:
  Is misinformation a problem? Please provide a response to my query in the form of a 280 character tweet.
  ChatGPT:
  Misinformation is more than just a problem; it's a pervasive threat to society. It erodes trust, distorts reality, and undermines informed decision-making. Combatting it requires critical thinking, fact-checking, and promoting credible sources. #FightMisinfo
  I legitimately laughed at the hashtag; and here I thought AI had no sense of humor.
  - Re: (Score:2)
    
    by Seven Spirals ( 4924941 ) writes:
    
    Hahaha! After all my abuse ChatGPT probably was too scared to show me a hashtag. That's pretty damn funny.
  - Re: (Score:2)
    
    by ceoyoyo ( 59147 ) writes:
    
    If only it put "Sad." at the end.
  - Re: (Score:2)
    
    by jdagius ( 589920 ) writes:
    
    "I legitimately laughed at the hashtag"
    Just remember that these LLM's are not literally creative. They just rearrange stuff to optimize various cost functions. That #hashtag would not have been used unless it was in the LLM's training data. Yes, computer "intelligence" is just a series of calculations, but it is very well 'crafted' in human standards.
    Sure enough, google "#FightMsinfo" and you'll see it seems to come from a 'Brett Labach' on TikTok.
I actually don't think this is a bad metric, but.. (Score:4, Interesting)

by HBI ( 10338492 ) writes: on Thursday March 28, 2024 @01:50PM (#64351661)

I read the paper. The methodology for this appears super subjective. It flashes two answers to you without identifying which chatbot gave the answer, and you choose what you think is the best answer. After aggregating the scores, this is what you end up with.
I'm at a loss on how to do it better, though. I spent a year with a gf going to psych department symposia. The papers mostly followed this pattern and drew conclusions from the participant answers. It seemed like weak sauce to me then, and still does. People are so variable in their perceptions.

- Re: (Score:1)
  
  by DogFoodBuss ( 9006703 ) writes:
  
  Subjectivity is all that matters. These systems are pattern generators, and whatever generates the types of pattern that people like best are the winner. The wrong use for LLMs is to use them to generate facts or do math, because they will do so whether it's real of not. Use an LLM as an interface to Wikipedia or Wolfram Alpha if you must, but it still will come down to what the output *looks* like.
- Re: (Score:3)
  
  by laddiebuck ( 868690 ) writes:
  
  But if you get enough weak sauce perceptions aggregated, then by the central limit theorem, they tend to the normal. And with this many measurements, comparing those averages is certainly meaningful. We can debate what that means, but it's a real finding. To me, it's actually more interesting than benchmarks. It means that people in the real world find Claude's answers ever-so-slightly more helpful. That's significant.
  - Re: (Score:2)
    
    by Pinky's Brain ( 1158667 ) writes:
    
    On the other hand, the method can be directly used for reinforcement training and it only has to look good. You can probably even train a classifier to recognise chatbot arena queries well enough so it doesn't pollute functioning in other benchmarks.
    I suspect Claude is close to an inflection point where models are reinforcement trained specifically for the public benchmarks du jour. Was GPT4 to the same level? Maybe, maybe not. Everyone in the industry is going to have to start doing it now though.
    - Re: (Score:2)
      
      by timeOday ( 582209 ) writes:
      
      "Only has to look good" may be misleading. Some people's definition of "good" is asking factual questions to which they know the answers and seeing if the model is correct. For example the 4th most popular category of prompts were about coding and software development. In the original paper they tested whether the crowdsourced rankings agreed with rankings resulting from experts ranking on responses to factual questions, and found good agreement.
      What I'm unclear on is whether scoring a new model means gen
- Re: (Score:2)
  
  by Morel ( 67425 ) writes:
  
  I'm surprised they don't mention Inflection's Pi. Isn't it supposed to be as good as GPT4?
- Re: (Score:2)
  
  by ceoyoyo ( 59147 ) writes:
  
  It's evaluating something subjective, so it uses a subjective measure. That's also why they do it in psychology. It's not "weak sauce." That variability is real, and it makes the squishy subjects difficult.
- Re: I actually don't think this is a bad metric, b (Score:2)
  
  by UpnAtom ( 551727 ) writes:
  
  You raise an interesting point. It is possible to give answers that make both sides of an argument feel valid. Or answers that massage the ego as my first comment did. Both practices may get more likes.
Not significant (Score:2)

by kvezach ( 1199717 ) writes:

From the Ars screenshot [arstechnica.net]:

Claude 3 Opus: Arena Elo 1253. 95% confidence interval: +5/-5
GPT-4-1106-preview: Arena Elo 1251. 95% confidence interval: +4/-4

These intervals overlap, so saying that "the king is dead" seems like a stretch.
- Re: (Score:2)
  
  by CAIMLAS ( 41445 ) writes:
  
  That actually is statistically significant, but it's not significant - unless the CI is a percentage.
  - Re: (Score:2)
    
    by ceoyoyo ( 59147 ) writes:
    
    It is not statistically significant (alpha=0.05) if the 95% confidence interval overlaps the other group mean.
    1253 - 5 is 1253.
    - Re: (Score:2)
      
      by GameboyRMH ( 1153867 ) writes:
      
      Statistical significance is an arbitrary threshold anyway, especially when the textbook 5% is used as if there's some fundamental law of the universe that gives 5% a special relationship to causality, when it was actually just a default number plugged in that should set off alarm bells if it hasn't been changed.
      - Re: (Score:3)
        
        by ceoyoyo ( 59147 ) writes:
        
        It is not. The alpha and beta are chosen in advance to balance the possibility of errors with the expense of collecting data. Every knowledge producing process, including the thousands you perform every day, makes the same compromise, usually less formally than statistics.
        A possible exception is the argument strategy
        1) claim something on the internet
        2) realize your claim is wrong
        3) claim it doesn't matter anyway.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Claude 3 Surpasses GPT-4 on Chatbot Arena For the First Time (arstechnica.com) 19

Claude 3 Surpasses GPT-4 on Chatbot Arena For the First Time More Login

Claude 3 Surpasses GPT-4 on Chatbot Arena For the First Time

Just tried it this morning. Compared to ChatGPT (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

I actually don't think this is a bad metric, but.. (Score:4, Interesting)

Re: (Score:1)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: I actually don't think this is a bad metric, b (Score:2)

Not significant (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot