Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
AI

How Do Olympiad Medalists Judge LLMs in Competitive Programming? 23

A new benchmark assembled by a team of International Olympiad medalists suggests the hype about large language models beating elite human coders is premature. LiveCodeBench Pro, unveiled in a 584-problem study [PDF] drawn from Codeforces, ICPC and IOI contests, shows the best frontier model clears just 53% of medium-difficulty tasks on its first attempt and none of the hard ones, while grandmaster-level humans routinely solve at least some of those highest-tier problems.

The researchers measured models and humans on the same Elo scale used by Codeforces and found that OpenAI's o4-mini-high, when stripped of terminal tools and limited to one try per task, lands at an Elo rating of 2,116 -- hundreds of points below the grandmaster cutoff and roughly the 1.5 percentile among human contestants. A granular tag-by-tag autopsy identified implementation-friendly, knowledge-heavy problems -- segment trees, graph templates, classic dynamic programming -- as the models' comfort zone; observation-driven puzzles such as game-theory endgames and trick-greedy constructs remain stubborn roadblocks.

Because the dataset is harvested in real time as contests conclude, the authors argue it minimizes training-data leakage and offers a moving target for future systems. The broader takeaway is that impressive leaderboard jumps often reflect tool use, multiple retries or easier benchmarks rather than genuine algorithmic reasoning, leaving a conspicuous gap between today's models and top human problem-solvers.

How Do Olympiad Medalists Judge LLMs in Competitive Programming?

Comments Filter:
  • by TWX ( 665546 ) on Tuesday June 17, 2025 @10:19AM (#65455565)

    ...but I guess that you have to be very, very into this particular niche for this to make any sense.

    When I read, "International Olympiad," I do not think programming. I think track-and-field and other competition where physical fitness and physical skill define the event.

    As for, "LLM", does anyone else see that and think, "MLM"? As in, scam?

  • by bill_mcgonigle ( 4333 ) * on Tuesday June 17, 2025 @10:33AM (#65455603) Homepage Journal

    Some neural nets have been good at solving sticky programming problems. Whether finding game cheats, doing voice recognition, modeling proteins, or other tasks humans haven't done well at.

    But an LLM is more of an information retrieval tool, so tasking it with clever algorithm design is asking the wrong tool the wrong question.

    Then there are the people who complete in programming challenges. In high school I would sometimes stay after to do the ACSL competition tests - no big deal, the school was a five minute walk, and it helped my buddies who wanted a high team score.

    Then they implored me to go to DC on a trip for a national competition our score qualified us for. This seemed so bizzare to me as a fifteen year old kid - I could stay in a run-down motel and take tests this weekend or go camping in a state forest with friends. I let them down, in a way, but the ask was totally alien to me.

    I have nothing at all against people who enjoy such things but it's a subset of the algorithm minds.

    So we now have the results of some competitive coders vs. the wrong tool for the job.

    OK, mildly interesting, but does it tell us much?

    • But an LLM is more of an information retrieval tool,

      And not even really that. At its core an LLM is a "plausible-sounding sentence generator".
      It merely puts tokens together, given a context (the prompt, etc.) and given a statistical model (the distribution of tokens found in the corpus that the LLM was trained on).
      It's like an insanely advance super-duper autocomplete on steroids (pun intended given the context).

      If the model is rich enough the plausible-sounding sentence have a higher chance to be close to truth.
      (Just like on a smartphone the autocomplete do

      • It's like an insanely advance super-duper autocomplete on steroids (pun intended given the context).

        This is, and remains a very misleading characterization.

        Is it true at its deepest level? Yes.
        But what is it autocompleting? That's the rub in the characterization.
        It's autocompleting what "a person with all of human knowledge would write".

        LLMs do not reason. LLMs cannot really reason. They can put plausible sounding words together that's about it.

        Sure they do. It's literally demonstrable. Saying an LLM cannot really reason is like saying a calculator can't really do math.
        The only way you can back up the assertion is by relying on an anthropocentric definition of the word that precludes non-humans from doing it

        • by ceoyoyo ( 59147 )

          It is 1.5 percentile. I.e. 1.5% of humans who took the test scored better than it. I.e. it scored better than 98.5% of humans who took the test. Thus the sneaky summary "suggests the hype about large language models beating elite human coders is premature."

          https://en.wikipedia.org/wiki/... [wikipedia.org]

          • That's just ass-backwards.

            To quote your link,

            The score for a specified percentage (e.g., 90th) indicates a score below which (exclusive definition) or at or below which (inclusive definition) other scores in the distribution fall.

            1.5th percentile would imply that only 1.5% of people scored worse than it, when in fact it is the exact opposite.

    • But an LLM is more of an information retrieval tool, so tasking it with clever algorithm design is asking the wrong tool the wrong question.

      I'm going to fix some omissions and errors in the summary for you, and we can recompute.

      It's not 1.5th percentile, it's the 98th percentile.
      The "hundreds of points between grandmaster" (Top 0.33%) places it in the "Master" category.

      Really, the picture is more complicated.
      In some tests, it scored at the level of "Interntional Grandmaster" (seriously, what the fuck are these names?), or the top 0.12%
      While in some tests, merely "Specialist", or top 23%.

      But the real insight here- that I think any high

      • by ceoyoyo ( 59147 )

        It is 1.5 percentile. I.e. 1.5% of humans who took the test scored better than it. I.e. it scored better than 98.5% of humans who took the test. Thus the sneaky summary "suggests the hype about large language models beating elite human coders is premature."

        https://en.wikipedia.org/wiki/ [wikipedia.org]... [wikipedia.org]

        (sorry, other reply was to the wrong post)

        • That's just ass-backwards.

          To quote your link,

          The score for a specified percentage (e.g., 90th) indicates a score below which (exclusive definition) or at or below which (inclusive definition) other scores in the distribution fall.

          1.5th percentile would imply that only 1.5% of people scored below it, when in fact it is the exact opposite.
          If 1.5% of scores are higher than yours, then you are in the 98th percentile, not the 1.5th. Top quartile, not the bottom.
          The summary isn't sneaky- it's just wrong.

          • by ceoyoyo ( 59147 )

            There's meme going around where a home schooling mom is bragging about her kid's results on an IQ test. Look, my kid is in the 98th percentile with an 80 IQ, see how wonderful home schooling is, and what a great teacher I am!

            It's funny for a couple of reasons.

    • by alanw ( 1822 )

      Bruce Schneier posted this today:
      Where AI Provides Value [schneier.com]
      If you’ve worried that AI might take your job, deprive you of your livelihood, or maybe even replace your role in society, it probably feels good to see the latest AI tools fail spectacularly. If AI recommends glue as a pizza topping, then you’re safe for another day.

      But the fact remains that AI already has definite advantages over even the most skilled humans, and knowing where these advantages arise—and where they don’t—

  • It sure sounds like it.
  • The nuance and complexity of reality is very often a bit of a parade rainer when it comes to the need of the media to promote sensationalist headlines like "AI beats best humans at... !"

  • If the bot can do the judge's job for them well, it gets an "A"

  • It took some work to try and decode what this story was really about, and what the actual goals of the paper abstract were trying to accomplish. This looks like a thinly veiled fanboi attempt to heap even more hype on LLMs. The goal seems to be to create a standard to use to judge both how 'useful' LLMs have become, and to monitor progress and identify specific shortfalls and areas that still need improvement. All the social circles and web sites that seem interested in this all appear to be very pro LLM
  • Legitimately fuck the poster for taking a shit in my eyeballs with this trash
  • TFS stated opinion presents a false dichotomy. AI doesn't have to be better than elite programmers. It only has to be better than MOST programmers or, alternatively, average programmers at approximately the same cost. Once it can do that, if not already, AI has the better ROI if for no other reasons than AI won't quit its job.

    • by AvitarX ( 172628 )

      That assumes that either:
      1) you can reliably hire elite programmers and use AI for the rest
      Or
      2) a team with diverse levels of talents is worse than a team with all average talents

      I have no opinion on the answer to those two questions, but it doesn't seem obvious to me that pure average talent would be better than a mixed talent team.

"Engineering without management is art." -- Jeff Johnson

Working...