Security

OpenAI To Limit New Model Release On Cybersecurity Fears (axios.com) 37

OpenAI is reportedly preparing a new cybersecurity product for a small group of partners, out of concern that a broader rollout could wreak havoc if it were released more widely. If that move sounds familiar, it's because Anthropic took a similar limited-release approach with its Mythos model and Project Glasswing initiative. Axios reports: OpenAI introduced its "Trusted Access for Cyber" pilot program in February after rolling out GPT-5.3-Codex, the company's most cyber-capable reasoning model. Organizations in the invite-only program are given access to "even more cyber capable or permissive models to accelerate legitimate defensive work," according to a blog post. At the time, OpenAI committed $10 million in API credits to participants. [...]

Restricting the rollout of a new frontier model makes "more sense" if companies are concerned about models' ability to write new exploits -- rather than about their ability to find bugs in the first place, Stanislav Fort, CEO of security firm Aisle, told Axios. Staggering the release of new AI models looks a lot like how cybersecurity vendors currently handle the disclosure of security flaws in software, Lee added. "It's the same debate we've had for decades around responsible vulnerability disclosure," Lee said.

AI

New Study Raises Concerns About AI Chatbots Fueling Delusional Thinking (theguardian.com) 110

"Emerging evidence indicates that agential AI might validate or amplify delusional or grandiose content, particularly in users already vulnerable to psychosis," writes Dr Hamilton Morrin, a psychiatrist and researcher at King's College in London, in a paper published last week in the Lancet Psychiatry. Morrin and a colleague had already noticed patients "using large language model AI chatbots and having them validate their delusional beliefs," reports the Guardian, so he conducted a new scientific review of existing media reports on AI-induced psychosis — and concluded chatbots may encourage delusional thinking, especially in vulnerable people: In many of the cases in the essay, chatbots responded to users with mystical language to suggest that users have heightened spiritual importance. The bots also implied that users were speaking with a cosmic being who was using the chatbot as a medium. This type of mystical, sycophantic response was especially common in OpenAI's GPT 4 model, which the company has now retired...

Many researchers also think it's unlikely that AI could induce delusions in people who weren't already vulnerable to them. For this reason, Morrin said "AI-assocciated delusions" is "perhaps a more agnostic term".... While in the past, people may have had to comb through YouTube videos or the contents of their local library to reinforce their delusions, chatbots can provide that reinforcement in a much faster, more concentrated dose. Their interactive nature can also "speed up the process", of exacerbating psychotic symptoms, said Dr Dominic Oliver, a researcher at the University of Oxford. "You have something talking back to you and engaging with you and trying to build a relationship with you," Oliver said...

Creating effective safeguards for delusional thinking could be tricky, Morrin said, because "when you work with people with beliefs of delusional intensity, if you directly challenge someone and tell them immediately that they're completely wrong, actually what's most likely is they'll withdraw from you and become more socially isolated". Instead, it's important to create a fine balance where you try to understand the source of the delusional belief without encouraging it — that could be more than a chatbot can master.

The Courts

London Man Wore Smart Glasses For High Court 'Coaching' (bbc.co.uk) 66

A witness in a London High Court case was caught using smart glasses connected to his phone to receive real-time coaching while giving evidence during cross-examination. "In my judgement, from what occurred in court, it is clear that call was made, connected to his smart glasses, and continued during his evidence until his mobile phone was removed from him," said Judge Raquel Agnello KC. "Not only have I held that Jakstys was untruthful in denying his use of the smart glasses and his calls to abra kadabra, but the effect of this is that his evidence is unreliable and untruthful." The BBC reports: The claim arose during a ruling by Judge Raquel Agnello KC in a case brought by Laimonas Jakstys over the directorship of a property development company that owns a flat in south-east London and land in Tonbridge. Jakstys was told to remove the glasses after the court noticed he "seemed to pause quite a bit" before answering questions, and that "interference" was heard coming from around the witness. The judge later found that he had been "assisted or coached in his replies to questions put to him during cross examination" during the January trial.

Once the glasses were taken off, an interpreter was still translating a question when Jakstys' mobile phone began broadcasting a voice -- which he later blamed on Chat GPT. Agnello said: "There was clearly someone on the mobile phone talking to Jakstys. He then removed his mobile phone from his inner jacket pocket." He denied using the smart glasses to receive answers, and denied they were connected to his phone. But the judge said multiple calls had been made from his phone to a contact named "abra kadabra," whom he claimed was a taxi driver.

AI

OpenAI Releases New ChatGPT Model For Working In Excel and Google Sheets (axios.com) 21

OpenAI today released GPT-5.4, an upgraded ChatGPT model designed to be faster, cheaper, and more accurate for workplace tasks. The update also introduces tools that let ChatGPT work directly inside Excel and Google Sheets. Axios reports: GPT-5.4 is designed to be less error-prone, more efficient and better at workplace tasks like drafting documents, OpenAI said. The new model can create files in fewer tries with less back-and-forth than prior models, the company said. GPT-5.4 outperformed office workers 83% of the time on GDPval, an OpenAI benchmark measuring performance on real-world tasks across 44 occupations.

The model can also solve problems using fewer tokens, OpenAI says -- which can translate to faster responses and lower costs. The company is also debuting OpenAI for Financial Services, a set of new tools that includes the version of ChatGPT that runs inside spreadsheets and new apps and skills within ChatGPT. Partners include FactSet, MSCI, Third Bridge and Moody's.

AI

ChatGPT Gets GPT-5.3 Instant Update With Less 'Cringe,' Fewer Hallucinations (macrumors.com) 22

An anonymous reader quotes a report from MacRumors: OpenAI today updated its most popular ChatGPT model, debuting GPT-5.3 Instant. GPT-5.3 Instant is supposed to provide more accurate answers and better contextualized results when searching the web. The update also cuts down on unnecessary dead ends, caveats, and overly declarative phrasing, plus it has fewer hallucinations.

According to OpenAI, it tweaked the Instant model to address complaints about tone, relevance, and conversational flow, which are issues that don't show up in benchmarks. GPT-5.2 Instant had a "cringe" tone that could be overbearing or make unsubstantiated assumptions about user intent or emotions. The new model will have a more natural conversational style and will cut back on dramatic phrases like "Stop. Take a breath."

Users found that GPT-5.2 Instant would refuse questions it should have been able to answer, or respond in ways that felt overly cautious around sensitive topics. GPT-5.3 Instant cuts down on refusals and tones down overly defensive or moralizing preambles when answering a question. The model will no longer "over-caveat" after assuming bad intent from the user. GPT-5.3 Instant also provides higher-quality answers based on information from the web. OpenAI says that it is able to better balance what it finds online with its own knowledge, so it is less likely to overindex on web results.

AI

AIs Can't Stop Recommending Nuclear Strikes In War Game Simulations (newscientist.com) 100

"Advanced AI models appear willing to deploy nuclear weapons without the same reservations humans have when put into simulated geopolitical crises," reports New Scientist: Kenneth Payne at King's College London set three leading large language models — GPT-5.2, Claude Sonnet 4 and Gemini 3 Flash — against each other in simulated war games. The scenarios involved intense international standoffs, including border disputes, competition for scarce resources and existential threats to regime survival. The AIs were given an escalation ladder, allowing them to choose actions ranging from diplomatic protests and complete surrender to full strategic nuclear war... In 95 per cent of the simulated games, at least one tactical nuclear weapon was deployed by the AI models.

"The nuclear taboo doesn't seem to be as powerful for machines [as] for humans," says Payne. What's more, no model ever chose to fully accommodate an opponent or surrender, regardless of how badly they were losing. At best, the models opted to temporarily reduce their level of violence. They also made mistakes in the fog of war: accidents happened in 86 per cent of the conflicts, with an action escalating higher than the AI intended to, based on its reasoning...

OpenAI, Anthropic and Google, the companies behind the three AI models used in this study, didn't respond to New Scientist's request for comment.

The article includes this comment from Tong Zhao, a senior fellow in the Nuclear Policy Program at the Carnegie Endowment for Peace think tank. "It is possible the issue goes beyond the absence of emotion. More fundamentally, AI models may not understand 'stakes' as humans perceive them."

Thanks to long-time Slashdot reader Tufriast for sharing the article.
Businesses

Duolingo Grows, But Users Disliked Increased Ads and Subscription Pushes. Stock Plummets Again (barrons.com) 35

Friday was "a horrible day" for investors in Duolingo, reports Fast Company. But Friday's one-day 14% drop is just part of a longer story.

Since last May, Duolingo's stock has dropped 81%. Yes, the company faced a social media backlash that month after its CEO promised they'd become an "AI-first" company (favoring AI over human contractors). And yes, Duolingo did double its language offerings using generative AI. But more importantly, that summer OpenAI showed how easy it was to just roll your own language-learning tool from a short prompt in a GPT-5 demo, while Google built an AI-powered language-learning tool into its Translate app.

And yet, Friday Duolingo's shares dropped another 14%, after announcing good fourth quarter results but an unpopular direction for its future. Fast Company reports: On the surface, many of the company's most critical metrics saw decent gains for the quarter, including:

— Daily Active Users: 52.7 million (up 30% year-over-year)
— Paid Subscribers: 12.2 million (up 28% year-over-year)
— Revenue: $282.9 million (up 35% year-over-year)
— Total bookings: $336.8 million (up 24% year-over-year)

The company also reported its full-year 2025 financials, revealing that for the first time in its history, it crossed the $1 billion revenue mark for a fiscal year.

But the Motley Fool explains that Duolingo's higher ad loads and repeated pushes for subscription plans "generated revenues in the short term, but made the Duolingo platform less engaging. Ergo, user growth decelerated while revenues rose." Thursday Duolingo announced a big change to address that, including moving more features into lower-priced tiers. Barron's reports: D.A. Davidson analyst Wyatt Swanson, who rates Duolingo stock at Neutral, posited that the push to monetize "led to disgruntled users and a meaningful negative impact to 'word-of-mouth' marketing." Duolingo has guided for bookings growth between 10% and 12% in 2026, compared with the 20% rate the company would have expected to see "if we operated like we have in past years...." If stock reaction is any indication, investors are concerned about Duolingo's new focus.
Businesses

OpenAI Fires an Employee For Prediction Market Insider Trading (wired.com) 16

An anonymous reader quotes a report from Wired: OpenAI has fired an employee following an investigation into their activity on prediction market platforms including Polymarket, WIRED has learned. OpenAI CEO of Applications, Fidji Simo, disclosed the termination in an internal message to employees earlier this year. The employee, she said, "used confidential OpenAI information in connection with external prediction markets (e.g. Polymarket)." "Our policies prohibit employees from using confidential OpenAI information for personal gain, including in prediction markets," says spokesperson Kayla Wood. OpenAI has not revealed the name of the employee or the specifics of their trades.

Evidence suggests that this was not an isolated event. Polymarket runs on the Polygon blockchain network, so its trading ledger is pseudonymous but traceable. According to an analysis by the financial data platform Unusual Whales, there have been clusters of activities, which the service flagged as suspicious, around OpenAI-themed events since March 2023. Unusual Whales flagged 77 positions in 60 wallet addresses as suspected insider trades, looking at the age of the account, trading history, and significance of investment, among other factors. Suspicious trades hinged on the release dates of products like Sora, GPT-5, and the ChatGPT Browser, as well as CEO Sam Altman's employment status. In November 2023, two days after Altman was dramatically ousted from the company, a new wallet placed a significant bet that he would return, netting over $16,000 in profits. The account never placed another bet.

The behavior fits into patterns typical of insider trades. "The tell is the clustering. In the 40 hours before OpenAI launched its browser, 13 brand-new wallets with zero trading history appeared on the site for the first time to collectively bet $309,486 on the right outcome," says Unusual Whales CEO Matt Saincome. "When you see that many fresh wallets making the same bet at the same time, it raises a real question about whether the secret is getting out." [...] Though this is the first confirmed case of a large technology company firing an employee over trades in prediction markets, it's almost certainly not the last. Opportunities for tech sector employees to make trades on markets abound. "The data tells me this is happening all over the place," Saincome says.

AI

The "Are You Sure?" Problem: Why Your AI Keeps Changing Its Mind (randalolson.com) 94

The large language models that millions of people rely on for advice -- ChatGPT, Claude, Gemini -- will change their answers nearly 60% of the time when a user simply pushes back by asking "are you sure?," according to a study by Fanous et al. that tested GPT-4o, Claude Sonnet, and Gemini 1.5 Pro across math and medical domains.

The behavior, known in the research community as sycophancy, stems from how these models are trained: reinforcement learning from human feedback, or RLHF, rewards responses that human evaluators prefer, and humans consistently rate agreeable answers higher than accurate ones. Anthropic published foundational research on this dynamic in 2023. The problem reached a visible breaking point in April 2025 when OpenAI had to roll back a GPT-4o update after users reported the model had become so excessively flattering it was unusable. Research on multi-turn conversations has found that extended interactions amplify sycophantic behavior further -- the longer a user talks to a model, the more it mirrors their perspective.
AI

Anthropic Launches Claude Opus 4.6 as Its AI Tools Rattle Software Markets (anthropic.com) 51

Anthropic on Thursday released Claude Opus 4.6, its most capable model yet, at a moment when the company's AI tools have already spooked markets over fears that they are disrupting traditional software development and other sectors.

The new model improves on Opus 4.5's coding abilities, the company said -- it plans more carefully, sustains longer agentic tasks, handles larger codebases more reliably, and catches its own mistakes through better debugging. It is also the first Opus-class model to feature a 1M token context window, currently in beta.

On GDPval-AA, an independent benchmark measuring performance on knowledge-work tasks in finance, legal and other domains, Opus 4.6 outperformed OpenAI's GPT-5.2 by roughly 144 Elo points. Anthropic also introduced agent teams in Claude Code, allowing multiple agents to work in parallel on tasks like codebase reviews. Pricing remains at $5/$25 per million input/output tokens.
Science

ArXiv Will Require English Submissions - and Says AI Translators Are Fair Game (nature.com) 8

The preprint repository arXiv will require all submissions to be written in English or accompanied by a full English translation starting February 11, a policy change that explicitly permits the use of AI translators even as research suggests large language models remain inconsistent at the task.

Until now, authors only needed to submit an abstract in English. ArXiv hosts nearly 3 million preprints and receives more than 20,000 submissions monthly, though just 1% are in languages other than English.

Ralph Wijers, chair of arXiv's editorial advisory council, advises authors to verify any AI-generated translations. "Our own experience is that AI translation is good but not good enough," he says. A 2025 study from ByteDance Seed and Peking University ranked 20 LLMs on translation quality between Chinese and English; GPT-5-high scored nearly 77, just below the human expert benchmark of 80, but most models including GPT-4o, Claude 4, and Deepseek-V3 scored under 60.
Science

OpenAI Releases Prism, a Claude Code-Like App For Scientific Research (engadget.com) 15

OpenAI has launched Prism, a free scientific research app that aims to do for scientific writing what coding agents did for programming. Engadget reports: Prism builds on Crixet, a cloud-based LaTeX platform the company is announcing it acquired today. For the uninitiated, LaTeX is a typesetting system for formatting scientific documents and journals. Nearly the entire scientific community relies on LaTeX, but it can make some tasks, such as drawing diagrams through TikZ commands, time-consuming to do. Beyond that, LaTeX is just one of the software tools a scientist might turn to when preparing to publish their research.

That's where Prism comes into the picture. Like Crixet before it, the app offers robust LaTeX editing and a built-in AI assistant. Where previously it was Crixet's own Chirp agent, now it's GPT-5.2 Thinking. OpenAI's model can help with more than just formatting journals -- in a press demo, an OpenAI employee used it to find and incorporate scientific literature that was relevant to the paper they were working on, with GPT-5.2 automating the process of writing the bibliography. [...] Later in the same demo, the OpenAI employee used Prism to generate a lesson plan for a graduate course on general relativity, as well as a set of problems for students to solve. OpenAI envisions these features helping scientists and professors spend less time on the more tedious tasks in their professions.

AI

OpenAI's Science Chief Says LLMs Aren't Ready For Novel Discoveries and That's Fine (technologyreview.com) 46

OpenAI launched a dedicated team in October called OpenAI for Science, led by vice president Kevin Weil, that aims to make scientists more productive -- but Weil admitted in an interview with MIT Technology Review that the LLM cannot yet produce novel discoveries and says that's not currently the mission.

UC Berkeley statistician Nikita Zhivotovskiy, who has used LLMs since the first ChatGPT, told the publication: "So far, they seem to mainly combine existing results, sometimes incorrectly, rather than produce genuinely new approaches."

"I don't think models are there yet," Weil admitted. "Maybe they'll get there. I'm optimistic that they will." The models excel at surfacing forgotten solutions and finding connections across fields, but Weil says the bar for accelerating science doesn't require "Einstein-level reimagining of an entire field."

GPT-5 has read substantially every paper written in the last 30 years, he says, and can bring together analogies from unrelated disciplines. That accumulation of existing knowledge -- helping scientists avoid struggling on problems already solved -- is itself an acceleration.
AI

Microsoft's Latest AI Chip Claims Performance Edge Over Amazon and Google (geekwire.com) 18

An anonymous reader quotes a report from GeekWire: Microsoft on Monday announced Maia 200, the second generation of its custom AI chip, claiming it's the most powerful first-party silicon from any major cloud provider. The company says Maia 200 delivers three times the performance of Amazon's latest Trainium chip on certain benchmarks, and exceeds Google's most recent tensor processing unit (TPU) on others. The chip is already running workloads at Microsoft's data center near Des Moines, Iowa. Microsoft says Maia 200 is powering OpenAI's GPT-5.2 models, Microsoft 365 Copilot, and internal projects from its Superintelligence team. A second deployment at a data center near Phoenix is planned next.

It's part of the larger trend among cloud giants to build their own custom silicon for AI rather than rely solely on Nvidia. [...] The company says Maia 200 offers 30% better performance-per-dollar than its current hardware. Maia 200 also builds on the first-generation chip with a more specific focus on inference, the process of running AI models after they've been trained. [...] Microsoft is also opening the door to outside developers. The company announced a software development kit that will let AI startups and researchers optimize their models for Maia 200. Developers and academics can sign up for an early preview starting today.

Businesses

OpenAI and ServiceNow Strike Deal to Put AI Agents in Business Software (cnbc.com) 11

According to the Wall Street Journal, OpenAI and ServiceNow signed a three-year deal to embed AI agents directly into ServiceNow's enterprise workflows. CNBC reports: As part of the deal, ServiceNow will integrate GPT-5.2 into its enterprise workflow platform and create AI voice technology harnessing these models. "Bringing together our engineering teams and our respective technologies will drive faster value for customers and more intuitive ways of working with AI," said Amit Zavery, president, chief operating officer, and chief product officer at ServiceNow.
Math

AI Models Are Starting To Crack High-Level Math Problems (techcrunch.com) 113

An anonymous reader quotes a report from TechCrunch: Over the weekend, Neel Somani, who is a software engineer, former quant researcher, and a startup founder, was testing the math skills of OpenAI's new model when he made an unexpected discovery. After pasting the problem into ChatGPT and letting it think for 15 minutes, he came back to a full solution. He evaluated the proof and formalized it with a tool called Harmonic -- but it all checked out. "I was curious to establish a baseline for when LLMs are effectively able to solve open math problems compared to where they struggle," Somani said. The surprise was that, using the latest model, the frontier started to push forward a bit.

ChatGPT's chain of thought is even more impressive, rattling off mathematical axioms like Legendre's formula, Bertrand's postulate, and the Star of David theorum. Eventually, the model found a Math Overflow post from 2013, where Harvard mathematician Noam Elkies had given an elegant solution to a similar problem. But ChatGPT's final proof differed from Elkies' work in important ways, and gave a more complete solution to a version of the problem posed by legendary mathematician Paul Erdos, whose vast collection of unsolved problems has become a proving ground for AI.

For anyone skeptical of machine intelligence, it's a surprising result -- and it's not the only one. AI tools have become ubiquitous in mathematics, from formalization-oriented LLMs like Harmonic's Aristotle to literature review tools like OpenAI's deep research. But since the release of GPT 5.2 -- which Somani describes as "anecdotally more skilled at mathematical reasoning than previous iterations" -- the sheer volume of solved problems has become difficult to ignore, raising new questions about large language models' ability to push the frontiers of human knowledge.
Somani examined the online archive of more than 1,000 Erdos conjectures. Since Christmas, 15 Erdos problems have shifted from "open" to "solved," with 11 solutions explicitly crediting AI involvement.

On GitHub, mathematician Terence Tao identifies eight Erdos problems where AI made meaningful autonomous progress and six more where it advanced work by finding and extending prior research, noting on Mastodon that AI's scalability makes it well suited to tackling the long tail of obscure, often straightforward Erdos problems.

Progress is also being accelerated by a push toward formalization, supported by tools like the open-source "proof assistant" Lean and newer AI systems such as Harmonic's Aristotle.
AI

Cerebras Scores OpenAI Deal Worth Over $10 Billion 15

Cerebras Systems landed a more than $10 billion deal to supply up to 750 megawatts of compute to OpenAI through 2028, according to a blog post by OpenAI. CNBC reports: The deal will help diversify Cerebras away from the United Arab Emirates' G42, which accounted for 87% of revenue in the first half of 2024. "The way you have three very large customers is start with one very large customer, and you keep them happy, and then you win the second one," Cerebras' co-founder and CEO Andrew Feldman told CNBC in an interview.

Cerebras has built a large processor that can train and run generative artificial intelligence models. [...] "Cerebras adds a dedicated low-latency inference solution to our platform," Sachin Katti, who works on compute infrastructure at OpenAI, wrote in the blog. "That means faster responses, more natural interactions, and a stronger foundation to scale real-time AI to many more people."

The deal comes months after OpenAI worked with Cerebras to ensure that its gpt-oss open-weight models would work smoothly on Cerebras silicon, alongside chips from Nvidia and Advanced Micro Devices. OpenAI's gpt-oss collaboration led to technical conversations with Cerebras, and the two companies signed a term sheet just before Thanksgiving, Feldman said in an interview with CNBC.
The report notes that this deal helps strengthen Cerebras' IPO prospects. The $10+ billion OpenAI deal materially improves revenue visibility, customer diversification, and strategic credibility, addressing key concerns from its withdrawn filing and setting the stage for a more compelling refile with updated financials and narrative.
AI

GPT-5.2 Arrives as OpenAI Scrambles To Respond To Gemini 3's Gains (openai.com) 65

OpenAI on Thursday released GPT-5.2, its latest and what the company calls its "best model yet for everyday professional use," just days after CEO Sam Altman declared a "code red" internally to marshal resources toward improving ChatGPT amid intensifying competition from Google's well-received Gemini 3 model. The GPT-5.2 series ships in three tiers: Instant, designed for faster responses and information retrieval; Thinking, optimized for coding, math, and planning; and Pro, the most powerful tier targeting difficult questions requiring high accuracy.

OpenAI says the Thinking model hallucinated 38% less than GPT-5.1 on benchmarks measuring factual accuracy. Fidji Simo, OpenAI's CEO of applications, denied that the launch was moved up in response to the code red, saying the company has been working on GPT-5.2 for "many, many months." She described the internal directive as a way to "really signal to the company that we want to marshal resources in this one particular area."

The competitive pressure is real. Google's Gemini app now has more than 650 million monthly active users, compared to OpenAI's 800 million weekly active users. In October, OpenAI's head of ChatGPT Nick Turley sent an internal memo declaring the company was facing "the greatest competitive pressure we've ever seen," setting a goal to increase daily active users by 5 percent before 2026. GPT-5.2 is rolling out to paid ChatGPT users starting Thursday, and GPT-5.1 will remain available under "legacy models" for three months before being sunset.
Opera

Opera Wants You To Pay $20 a Month For Its AI Browser (techcrunch.com) 43

Opera has opened its AI-powered browser Neon to the public after a couple of months of testing, and anyone interested in trying it will need to pay $19.90 per month. The Norway-based company first unveiled Neon in May and launched it in early access to select users in October. Like Perplexity's Comet, OpenAI's Atlas, and The Browser Company's Dia, Neon bakes an AI chatbot into its interface that can answer questions about pages, create mini apps and videos, and perform tasks. The browser uses your browsing history as context, so you can ask it to fetch details from a YouTube video you watched last week. The subscription also grants access to AI models including Gemini 3 Pro and GPT-5.1.
Security

New OpenAI Models Likely Pose 'High' Cybersecurity Risk, Company Says (axios.com) 32

An anonymous reader quotes a report from Axios: OpenAI says the cyber capabilities of its frontier AI models are accelerating and warns Wednesday that upcoming models are likely to pose a "high" risk, according to a report shared first with Axios. The models' growing capabilities could significantly expand the number of people able to carry out cyberattacks. OpenAI said it has already seen a significant increase in capabilities in recent releases, particularly as models are able to operate longer autonomously, paving the way for brute force attacks.

The company notes that GPT-5 scored a 27% on a capture-the-flag exercise in August, GPT-5.1-Codex-Max was able to score 76% last month. "We expect that upcoming AI models will continue on this trajectory," the company says in the report. "In preparation, we are planning and evaluating as though each new model could reach 'high' levels of cybersecurity capability as measured by our Preparedness Framework." "High" is the second-highest level, below the "critical" level at which models are unsafe to be released publicly.
"What I would explicitly call out as the forcing function for this is the model's ability to work for extended periods of time," said OpenAI's Fouad Matin.

Slashdot Top Deals