Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
AI

Gemini AI Solves Coding Problem That Stumped 139 Human Teams At ICPC World Finals (arstechnica.com) 66

An anonymous reader quotes a report from Ars Technica: Like the rest of its Big Tech cadre, Google has spent lavishly on developing generative AI models. Google's AI can clean up your text messages and summarize the web, but the company is constantly looking to prove that its generative AI has true intelligence. The International Collegiate Programming Contest (ICPC) helps make the point. Google says Gemini 2.5 participated in the 2025 ICPC World Finals, turning in a gold medal performance. According to Google this marks "a significant step on our path toward artificial general intelligence."

Every year, thousands of college-level coders participate in the ICPC event, facing a dozen deviously complex coding and algorithmic puzzles over five grueling hours. This is the largest and longest-running competition of its type. To compete in the ICPC, Google connected Gemini 2.5 Deep Think to a remote online environment approved by the ICPC. The human competitors were given a head start of 10 minutes before Gemini began "thinking."

According to Google, it did not create a freshly trained model for the ICPC like it did for the similar International Mathematical Olympiad (IMO) earlier this year. The Gemini 2.5 AI that participated in the ICPC is the same general model that we see in other Gemini applications. However, it was "enhanced" to churn through thinking tokens for the five-hour duration of the competition in search of solutions. At the end of the time limit, Gemini managed to get correct answers for 10 of the 12 problems, which earned it a gold medal. Only four of 139 human teams managed the same feat. "The ICPC has always been about setting the highest standards in problem-solving," said ICPC director Bill Poucher. "Gemini successfully joining this arena, and achieving gold-level results, marks a key moment in defining the AI tools and academic standards needed for the next generation."
Gemini's solutions are available on GitHub.

Gemini AI Solves Coding Problem That Stumped 139 Human Teams At ICPC World Finals

Comments Filter:
  • It's terrible at creating big ugly applications. Again, I'd assert this guy is right. Where's the Shovelware? Why AI Coding Claims Don't Add Up [substack.com].
    • My conclusion from reading the headline, is that most young coders don't know bits from bytes, and could not add 0110 to 0010 in their heads if their life depended on it.
        • :-) 420? *giggles*
        • by cusco ( 717999 )

          If he's referring to bits and bytes he may be using binary (thus the leading 0s). Then the answer is 8.

          Used to have to do that to address readers in security systems and set serial addresses on daisy chained PTZ cameras.

          • The correct answer shouldn't ever be "8"... if you assume that the numbers are expressed as binary, you would express the answer as binary. My joke was based on the fact that the base wasn't specified (I know it was implied to be binary and hence my smiley). Leading 0s don't tell me what base it is in. Hexadecimal is often used with sets of 2/4/8 digitals with leading 0s. If hex (or any base >2), the correct answer would be 0120. If base 2, of course it is 1000.

            • by cusco ( 717999 )

              If you tell a security device installer to use address 0010 they're going to set DIP switch 1 off, 2 on, 3 off, and 4 on, so at least in the security profession you always use decimal numbers in the instructions/documentation. If I need them to set switches to off, on, off and off I would have to tell them to set it to address 2. That's just the convention in my profession though, no idea what other professions might do.

              And you're right about the possibility of it being other bases, but I've never seen an

              • Octal was once common and probably still is used in some contexts. Binary seems to be rare except when directly interacting with hardware (the DIP switch example) or algorithms that twiddle bits. Hex is obviously used a lot (e.g. HTML colors, printing pointers, ...).

                There are still traces of base 20 which was once widely used until the Romans came along. "Score" in English (Lincoln used it with "four score"). French still has "quatre-vingt" and "trois-vingt" (this is much rarer, but still exists in some reg

      • So um, I'm in old fart but one of the reasons I like owning a computer is that I don't need to add 0110 to 0010 in my head. I did that shit in college and a little bit in high school but it's not something I ever used again except once in awhile to do some testing the sanity check something.

        And it wasn't long until I didn't need the sanity check basic shit.

        One of the problems with work in general is when you go to get a job everyone wants you to do all these things that you never do in the job. It's
        • I hear you, I am semi-retired myself. Reflecting on my work, a lot of it was to move a pile of rocks from there, to here. I still can not make sense of it. There are houses for us to live in, and food for us to eat. Why the drama? I am currently watching "Severance" on AppleTV, it is hitting me hard, with my experiences. The show turned out not to be just entertainment, but it is "crunching" my mind at the moment with philosophical questions and making me yearn for answers.
      • by gweihir ( 88907 )

        My experience from teaching CS matches that conclusion.

        • At the ICPC there are usually several programmers who are quite good. Quick, efficient, and effective. Harvey Mudd university in particular had a system that produced very good programmers, even though it's a university most people haven't heard of.

          There are books to teach about how to gain the skills [amazon.com], but I think an important part is to practice programming a LOT. If I were designing a CS curriculum today, I would have students write a program every day (every weekday).
    • by allo ( 1728082 )

      You know how complicated it is for a modern LLM to write a tetris clone? The author could have tried. It's literally one of the problems used as benchmark for new Code-LLM. Not a very useful benchmark, as Tetris, Snake, Flappy Bird is known well enough that writing "Write a Tetris clone" is enough without even explaining the game mechanic, but on the other hand you see fast if a model sucks and doesn't even manage to do that.

      If the authors wants to fill Steam with shovelware, he would need like three prompt

      • The point is that if it was so easy, why isn't it being done?
        • by allo ( 1728082 )

          The author can think about it, or maybe you have a good theory?

          I tested it with Gwen3-Coder-30B-A3 some time ago and it were able to create a fully functional tetris clone with a single short prompt "Create a tetris clone as SPA", including score, game over screen, preview of the next piece, pause function and a short help section explaining how tetris works.
          And now think about what the weird people, who put 4 3090 GPUs into their LLM PC and run something like DeepSeek, the large GLM model or other huge LLM

          • Did you read Mike's substack article I linked to earlier? It sounds like you disagree with his premise. If so, I'd say you're better off answering his question: Where is all the shovelware?
        • by allo ( 1728082 )

          Here is the generated example: https://pastebin.com/PM6Rb6VN [pastebin.com]
          The only line by me is 744, which is needed to prevent the page from scrolling when you press the down arrow, if the page content is too large for your screen. All other code is AI generated.

          I tested after fixing the scroll bug myself to just tell the model "The page scrolls when I press down" and it suggested the same fix. Still, in the first run it overlooked that problem. So you could have built this without knowing how HTML, CSS, or JavaScript

          • I've done similar tests with tetris games and building a reminder app. The LLMs can sometimes generate code that compiles, but then fail later to figure out what it's doing and make major modifications. I've had better luck with things like NeoVIM because it keeps the code context a bit more isolated. As I said earlier, with small data analysis or algorithms it works pretty well, for maintaining 200 .C/.CPP files, Makefiles, build-flags, and everything besides documentation, it's been pretty disappointing.
  • Yeah right (Score:4, Funny)

    by backslashdot ( 95548 ) on Wednesday September 17, 2025 @04:28PM (#65666800)

    Give it UI problems.

  • by SpinyNorman ( 33776 ) on Wednesday September 17, 2025 @04:34PM (#65666816)

    > However, it was "enhanced" to churn through thinking tokens for the five-hour duration of the competition in search of solutions.

    If you read the comments on the linked story, one is from a competitor from a prior years competition who notes that his competition always has a "time sink" problem that smart humans will steer clear of unless that have solved everything else.

    Apparently it took Gemini 30 minutes of solve this one time sink problem "C". The article doesn't say what hardware Gemini was running on, but apparently the dollar cost of this30 min run was high enough that they'd rather not say. Impressive perhaps, but I'm not sure that the correct takeaway is what a great programmer Gemini is (if so, when did it take 30 min ?!), but rather that with brute force search lots of time consuming things can be achieved.

    • "...the dollar cost of this 30 min run was high enough that they'd rather not say."

      It used Bothans. Many died to bring it the information. They'd rather not say how many.

    • by gweihir ( 88907 )

      This is probably just a meaningless stunt, like so many done before to keep the hype running. A hard coding problem (but not a research problem) is one where somebody capable, experienced and well-educated takes a few days to think before coming up with a design. Obviously, easy coding problems can either be solved much faster by computers or not solved by them at all. For harder coding problems, the second case applies.

    • by allo ( 1728082 )

      Hardware gets better over time. But have neural networks a limit? The question "Can it solve it AT ALL" is the first relevant question, the question "Can it solve it fast" comes afterward.

  • by TheMiddleRoad ( 1153113 ) on Wednesday September 17, 2025 @04:36PM (#65666818)

    The general model has been thoroughly trained on these types of problems. Then they tweaked it for the specific challenge. Then they ran it with tons of processing power, more than any normal person gets. And all of this was for very, very, very specific types of coding problems.

    https://worldfinals.icpc.globa... [worldfinals.icpc.global]

    It's not intelligence. It's processing.

    • > It's not intelligence. It's processing.

      Its like a souped up search engine;

      Its very good at not only finding the answers to a query, but recognizing the question even if it is worded differently than it has been in the past, finding the existing answers, and presenting those answers even if it has to tweak or assemble or rearrange them.

      What it cannot do is actually solve novel problems missing from its training set, any more than a search engine can find an match for a document that does not exist.

      • What it cannot do is actually solve novel problems missing from its training set, any more than a search engine can find an match for a document that does not exist.

        I'm not so sure about this. LLMs are a linguistic approach, and novel problems are described using existing words, phrases and concepts. The solution to a novel problem may be contained in the structure and patterns of the existing billions of sentences and lines of code humanity has produced, without needing formal reasoning.

        • by allo ( 1728082 )

          Most of programming is recombining your existing knowledge of programming patterns and algorithms to the solution you need. That's also why the AIs excel at writing everyday stuff and struggle a lot more with research level algorithms. If it is something where the human also doesn't know yet if it is possible, the LLM often isn't as helpful as for the things that "just need to be done". Writing an interactive website is something where one has to estimate time, but there is no doubt that it can be done and

      • by gweihir ( 88907 )

        Its like a souped up search engine;

        That is still my conclusion for LLMs in general and I have seen nothing that would strongly indicate otherwise. The real question is whether that is less than or on par with an average person. Keep in mind that only about 15% or all humans can fact-check competently ("independent thinkers") and that is clearly completely out of reach of LLMs. But much of what the rest of humanity can do might not be.

      • It can, which is the genius part (for those who created these algorithms, not for the software itself), piece together complex solutions from various bits it has been trained on. It cannot, however, deal with anything actually new.

  • The coding part of this isn't of any particular interest, what is interesting is that it solved a complex logic optimization problem: take a look [worldfinals.icpc.global]

    That said, this seems more like the kind of problem you would throw at mathematicians. While most real world applications are unlikely to have neat and tidy solutions, optimization problems like this really do exist. Being able to get quick solutions to these kind of complex optimization problem would radically reduce the number of people needed to solve such a pro

    • Looks like all you need to do to solve is is build a simple model, and run all 600 million combinations of input through it and the answer is the highest percentage

      Perfect job for a computer with massive compute resource.

      The other contestants probably don't have the computing power to do that, so would have to solve it by figuring out a mathematical proof. Or skip it and solve the other problems instead.

      • The question there is then, how did the computer know which answer is correct? Did a human create a test dataset? Because the sample data in the problem description [worldfinals.icpc.global] is not enough to get the right answer. That is, you can write a program that perfectly solves the sample data but doesn't actually solve the problem.
    • by gweihir ( 88907 )

      This is on the level of a typical advanced exercise questions. I expect something close or identical can be found in the training data of you just steal enough from the Internet and the published literature.

    • I've looked, and I can't find that they've published a paper explaining their methods. We don't know how much input humans gave to help them solve the problem.

      If they solved it with nothing more than the problem description, that does seem impressive.
  • Computers have been able to beat 99% of the population at chess for quite a while.

    Beating a bunch of college-level coders at coding isn't any more a sign of "general intelligence" than was my Amiga 500 being able to checkmate me 35 years ago.

    • Sure it does, the code challenge requires more general intelligence than chess which has very simple rules. Is it general enough to replace all human workers? Nope.
    • by gweihir ( 88907 )

      Indeed. And computers beating humans at chess is not a sign of general intelligence either and no serious researcher claims that. The thing is that we find out that some relatively complex problems have other working approaches besides general intelligence. Incidentally, general intelligence alone does not get you far in chess either.

      But the other thing we are finding is that while supposedly typical humans have general intelligence, we also know that most people are not actually using it or are not using i

  • Headline: "Gemini AI Solves Coding Problem That Stumped 139 Human Teams"

    Story: "After 677 minutes, Gemini 2.5 Deep Think had 10 correct answers, securing a second-place finish among the university teams."

    Corrected headline: "Gemini AI Comes second to human team in coding problems"
    • by evanh ( 627108 )

      You should've led with the last sentence.

      • I don't make money from clicks.

        Well, apart from the hundred or so dollars per month all us regulars get, obviously.
    • by jythie ( 914043 )
      So it had 10 right answers... how many wrong, and could it tell the difference?
    • More than that: "Gemini managed to get correct answers for 10 of the 12 problems, which earned it a gold medal. Only four of 139 human teams managed the same feat." So it didn't beat 139 teams, it beat 135 of 139 teams.
      • No, read the story.
        The misleading headline is referring to one problem that Gemini solved but no human team did.

        So, *on that one problem* it beat 139 human teams.

        "[...] Google points to Problem C as especially impressive. This question, a multi-dimensional optimization problem revolving around fictitious "flubber" storage and drainage rates, stumped every human team. But not Gemini."
  • I wonder if the person driving it had a PhD to steer and guide and fix up code hallucinations as it "burned through tokens" for five hours to achieve this college level programming task.

    I also wonder just how many tokens were burned? That is a cost thing after all.

    And if not actively vibe steered by an experienced coder supervising as they burnt tokens for 5 hours at an enhanced rate, I would be impressed only if I saw the complete preprompting, and probably how many KwH of power/Litres of water it also c

  • An enhanced LLM that churned through tokens for five hours, versus a human brain that works on the same problems.

    Anyone here have idea how the energy consumption of this LLM processor-farm compares to the energy consumption of the next-place human contestant during the same time period? If Andy Weir re-wrote "The Martian" with an LLM-powered drone as the main character, how would the calculation of potatoes-versus-poop-versus-water change?

    To me, the obvious plot hole in "The Matrix" series was the absurdity of the notion that bio-batteries - human brains and bodies - could magically violate physics and provide more energy/processing power output to The Matrix than the inputs that would be required to keep the nutrient/stasis apparatus running. Billions of people kept alive in a coma, apparently without significant muscular atrophy or any damage to brain development, because when people take the red pill and escape The Matrix they are walking and talking a short while later. The infrastructure required to feed, chemically stimulate, neurologically stimulate, and maintain homeostasis for billions of meatsacks would be prohibitively more expensive than just burning those energy inputs to directly power the Matrix.

    Yes, if we put a billion monkeys on a billion typewriters (the training set and subsequent LLM token-producing functions) for a billion years (processor cycles in an AI farm), then they can produce Ye Compleat Works Of Shakespeare. Or, well... just give one monkey some porridge and water for a few decades and he will also produce Ye Compleat Works Of Shakespeare because he's, like, you know... Shakespeare.

    • *nerd alert*

      The original script had The Matrix running in parallel on all the human brains.

      Studio execs said that was too confusing and that they should be batteries.

      Also Neo is seen on the Nebuchadnezzar with hundreds of acupuncture-looking needles with wires to get his muscles working while he's in a coma.

      Writers should have been left alone (a story old as time).

    • by allo ( 1728082 )

      Including the pretraining? Given the average age of the human participants, the humans lose on that by a large margin.

  • "Google's AI can clean up your text messages and summarize the web" No, googles AI fails miserably at summarizing the web and their entire search system has gone to shit. I'm an American but Yandex still provides better results than bing or Google for search with a couple exceptions.
  • I hope the article to mention which of the solved models were that hard and solved nevertheless.

  • no, its I-C-U-P .. come on guys get it right!

  • Why does the summary think it needs to give the entire background of a whole field. Just the news, not a full education.

In computing, the mean time to failure keeps getting shorter.

Working...