
How Do Olympiad Medalists Judge LLMs in Competitive Programming? 23
A new benchmark assembled by a team of International Olympiad medalists suggests the hype about large language models beating elite human coders is premature. LiveCodeBench Pro, unveiled in a 584-problem study [PDF] drawn from Codeforces, ICPC and IOI contests, shows the best frontier model clears just 53% of medium-difficulty tasks on its first attempt and none of the hard ones, while grandmaster-level humans routinely solve at least some of those highest-tier problems.
The researchers measured models and humans on the same Elo scale used by Codeforces and found that OpenAI's o4-mini-high, when stripped of terminal tools and limited to one try per task, lands at an Elo rating of 2,116 -- hundreds of points below the grandmaster cutoff and roughly the 1.5 percentile among human contestants. A granular tag-by-tag autopsy identified implementation-friendly, knowledge-heavy problems -- segment trees, graph templates, classic dynamic programming -- as the models' comfort zone; observation-driven puzzles such as game-theory endgames and trick-greedy constructs remain stubborn roadblocks.
Because the dataset is harvested in real time as contests conclude, the authors argue it minimizes training-data leakage and offers a moving target for future systems. The broader takeaway is that impressive leaderboard jumps often reflect tool use, multiple retries or easier benchmarks rather than genuine algorithmic reasoning, leaving a conspicuous gap between today's models and top human problem-solvers.
The researchers measured models and humans on the same Elo scale used by Codeforces and found that OpenAI's o4-mini-high, when stripped of terminal tools and limited to one try per task, lands at an Elo rating of 2,116 -- hundreds of points below the grandmaster cutoff and roughly the 1.5 percentile among human contestants. A granular tag-by-tag autopsy identified implementation-friendly, knowledge-heavy problems -- segment trees, graph templates, classic dynamic programming -- as the models' comfort zone; observation-driven puzzles such as game-theory endgames and trick-greedy constructs remain stubborn roadblocks.
Because the dataset is harvested in real time as contests conclude, the authors argue it minimizes training-data leakage and offers a moving target for future systems. The broader takeaway is that impressive leaderboard jumps often reflect tool use, multiple retries or easier benchmarks rather than genuine algorithmic reasoning, leaving a conspicuous gap between today's models and top human problem-solvers.
Most of the word are English... (Score:3)
...but I guess that you have to be very, very into this particular niche for this to make any sense.
When I read, "International Olympiad," I do not think programming. I think track-and-field and other competition where physical fitness and physical skill define the event.
As for, "LLM", does anyone else see that and think, "MLM"? As in, scam?
Olympiad medalists are not allowed to use bionics (Score:2)
Olympiad medalists are not allowed to use bionics or drugs that can't be traced right?
Re: (Score:2)
As for, "LLM", does anyone else see that and think, "MLM"? As in, scam?
Perhaps it stands for "Loser Level Marketing"?
Category Problems (Score:3)
Some neural nets have been good at solving sticky programming problems. Whether finding game cheats, doing voice recognition, modeling proteins, or other tasks humans haven't done well at.
But an LLM is more of an information retrieval tool, so tasking it with clever algorithm design is asking the wrong tool the wrong question.
Then there are the people who complete in programming challenges. In high school I would sometimes stay after to do the ACSL competition tests - no big deal, the school was a five minute walk, and it helped my buddies who wanted a high team score.
Then they implored me to go to DC on a trip for a national competition our score qualified us for. This seemed so bizzare to me as a fifteen year old kid - I could stay in a run-down motel and take tests this weekend or go camping in a state forest with friends. I let them down, in a way, but the ask was totally alien to me.
I have nothing at all against people who enjoy such things but it's a subset of the algorithm minds.
So we now have the results of some competitive coders vs. the wrong tool for the job.
OK, mildly interesting, but does it tell us much?
Not even retrival. (Score:2)
But an LLM is more of an information retrieval tool,
And not even really that. At its core an LLM is a "plausible-sounding sentence generator".
It merely puts tokens together, given a context (the prompt, etc.) and given a statistical model (the distribution of tokens found in the corpus that the LLM was trained on).
It's like an insanely advance super-duper autocomplete on steroids (pun intended given the context).
If the model is rich enough the plausible-sounding sentence have a higher chance to be close to truth.
(Just like on a smartphone the autocomplete do
Re: (Score:2)
It's like an insanely advance super-duper autocomplete on steroids (pun intended given the context).
This is, and remains a very misleading characterization.
Is it true at its deepest level? Yes.
But what is it autocompleting? That's the rub in the characterization.
It's autocompleting what "a person with all of human knowledge would write".
LLMs do not reason. LLMs cannot really reason. They can put plausible sounding words together that's about it.
Sure they do. It's literally demonstrable. Saying an LLM cannot really reason is like saying a calculator can't really do math.
The only way you can back up the assertion is by relying on an anthropocentric definition of the word that precludes non-humans from doing it
Re: (Score:2)
It is 1.5 percentile. I.e. 1.5% of humans who took the test scored better than it. I.e. it scored better than 98.5% of humans who took the test. Thus the sneaky summary "suggests the hype about large language models beating elite human coders is premature."
https://en.wikipedia.org/wiki/... [wikipedia.org]
Re: (Score:2)
To quote your link,
The score for a specified percentage (e.g., 90th) indicates a score below which (exclusive definition) or at or below which (inclusive definition) other scores in the distribution fall.
1.5th percentile would imply that only 1.5% of people scored worse than it, when in fact it is the exact opposite.
Re: (Score:2)
Not really. It's autocompleting a statistically probable sentence based on the corpus of all trained input.
Statistically probable based upon 1000-dimensional embeddings of portions of words, which means it is not in any way, shape or form, as simple as "this word commonly comes after that word", but rather is comparing 1000 separate dimensions of semantics.
So yes, really.
But not even that. Instead, it's autocompleting a statistically probable sentence based on the corpus of all trained input filtered through acceptable output lenses, because we all remember how Tay turned out after getting some training from the Internet for just a few hours.
That's fair- we should *definitely* emphasize what goes into fine-tuning and alignment training.
Re: (Score:2)
But an LLM is more of an information retrieval tool, so tasking it with clever algorithm design is asking the wrong tool the wrong question.
I'm going to fix some omissions and errors in the summary for you, and we can recompute.
It's not 1.5th percentile, it's the 98th percentile.
The "hundreds of points between grandmaster" (Top 0.33%) places it in the "Master" category.
Really, the picture is more complicated.
In some tests, it scored at the level of "Interntional Grandmaster" (seriously, what the fuck are these names?), or the top 0.12%
While in some tests, merely "Specialist", or top 23%.
But the real insight here- that I think any high
Re: (Score:2)
It is 1.5 percentile. I.e. 1.5% of humans who took the test scored better than it. I.e. it scored better than 98.5% of humans who took the test. Thus the sneaky summary "suggests the hype about large language models beating elite human coders is premature."
https://en.wikipedia.org/wiki/ [wikipedia.org]... [wikipedia.org]
(sorry, other reply was to the wrong post)
Re: (Score:2)
To quote your link,
The score for a specified percentage (e.g., 90th) indicates a score below which (exclusive definition) or at or below which (inclusive definition) other scores in the distribution fall.
1.5th percentile would imply that only 1.5% of people scored below it, when in fact it is the exact opposite.
If 1.5% of scores are higher than yours, then you are in the 98th percentile, not the 1.5th. Top quartile, not the bottom.
The summary isn't sneaky- it's just wrong.
Re: (Score:2)
There's meme going around where a home schooling mom is bragging about her kid's results on an IQ test. Look, my kid is in the 98th percentile with an 80 IQ, see how wonderful home schooling is, and what a great teacher I am!
It's funny for a couple of reasons.
Re: (Score:2)
Bruce Schneier posted this today:
Where AI Provides Value [schneier.com]
If you’ve worried that AI might take your job, deprive you of your livelihood, or maybe even replace your role in society, it probably feels good to see the latest AI tools fail spectacularly. If AI recommends glue as a pizza topping, then you’re safe for another day.
But the fact remains that AI already has definite advantages over even the most skilled humans, and knowing where these advantages arise—and where they don’t—
Was this written by an LLM? (Score:2)
Unfortunately (Score:2)
The nuance and complexity of reality is very often a bit of a parade rainer when it comes to the need of the media to promote sensationalist headlines like "AI beats best humans at... !"
The Turing Sloth Test (Score:2)
If the bot can do the judge's job for them well, it gets an "A"
What a messy story and attached article (Score:2)
Re: What a messy story and attached article (Score:2)
What the fuck (Score:2)
AI doesn't have to be better (Score:2)
TFS stated opinion presents a false dichotomy. AI doesn't have to be better than elite programmers. It only has to be better than MOST programmers or, alternatively, average programmers at approximately the same cost. Once it can do that, if not already, AI has the better ROI if for no other reasons than AI won't quit its job.
Re: (Score:2)
That assumes that either:
1) you can reliably hire elite programmers and use AI for the rest
Or
2) a team with diverse levels of talents is worse than a team with all average talents
I have no opinion on the answer to those two questions, but it doesn't seem obvious to me that pure average talent would be better than a mixed talent team.