Model milestones this week

Several frontier models posted big, specific results: GPT‑Rosalind hit human‑expert accuracy on RNA prediction, Anthropic’s Claude Opus 4.7 posted large reasoning gains, and GPT‑5.4 reportedly produced a three‑page proof solving a 60‑year math problem. (OpenAI GPT‑Rosalind and Anthropic Claude Opus 4.7 were highlighted in recent social posts.) (x.com) (A separate social note credited GPT‑5.4 with solving a longstanding Erdős problem.) (x.com)

A week of model releases turned abstract claims about “reasoning” into narrower tests: one model was graded on RNA design, another on hard coding benchmarks, and a third was credited in public posts with a new math proof. (openai.com) RNA is the cell’s working copy of genetic instructions, and sequence prediction asks whether a string of letters will fold or behave the way a researcher wants. OpenAI said on April 16 that GPT‑Rosalind’s best 10 submissions on an unpublished RNA sequence-to-function task ranked above the 95th percentile of 57 historical human-expert scores in an evaluation with Dyno Therapeutics. (openai.com) OpenAI said GPT‑Rosalind is a life-sciences model aimed at biology, drug discovery, and translational medicine, and that access is limited through a trusted-access research preview rather than open public release. The company said the model is optimized for workflows in chemistry, protein engineering, and genomics, where early research decisions can shape a drug program for years. (openai.com (openai.com)) Anthropic’s April 16 release was narrower and easier to compare: Claude Opus 4.7 cleared 70% on CursorBench, up from 58% for Opus 4.6, according to Anthropic’s product page. Anthropic said the gains were concentrated on the hardest software-engineering tasks and that the model is now generally available through its own platform plus Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. (anthropic.com) That matters because many recent model launches have leaned on broad labels like “smarter” or “more capable.” These announcements instead pointed to specific domains — RNA prediction, benchmarked coding work, and a named Erdős problem — where outside researchers can try to check the claims. (openai.com) (anthropic.com) The math claim is the least settled of the three. OpenAI’s March 5 launch post for GPT‑5.4 described the model as its most capable model for professional work, but it did not mention solving Erdős problem #1196; that claim appears in social posts and discussion threads, not in an OpenAI research paper or product note. (openai.com) (erdosproblems.com) A proof in mathematics is different from a benchmark score or a product demo because other mathematicians have to check every step. The Erdős Problems forum now has an active thread on problem #1196 discussing a claimed proof and related follow-up work, which is closer to verification than a launch post but still short of a journal publication. (erdosproblems.com) OpenAI and Anthropic also framed the releases differently. Anthropic sold Opus 4.7 as a generally available coding model with posted API pricing, while OpenAI positioned GPT‑Rosalind as a restricted research model and GPT‑5.4 as a broader professional system with up to 1 million tokens of context. (anthropic.com) (openai.com 1) (openai.com 2) The common thread is that frontier labs are now attaching model launches to narrower, auditable tasks instead of only leaderboard language. Whether those claims hold up will depend on replication in labs, benchmarks, and, in the case of the Erdős result, mathematicians reading the proof line by line. (openai.com) (anthropic.com) (erdosproblems.com)

Model milestones this week

Get your own daily briefing