Coding models are getting specialized
Recent Android-focused benchmarks show different AI models winning on different coding tasks, with Google’s Gemini and OpenAI’s newest models trading top spots depending on the evaluation — a sign that model choice is becoming task-specific rather than one-size-fits-all. (9to5google.com) Observers note that some models now include very large context windows, which matters for lengthy codebases or design notes. (glenrhodes.com)
A coding model is a prediction engine that turns plain-English instructions into software, and developers have spent the last two years looking for one model that wins everywhere. Google’s April 9 Android ranking says that search is starting to break down, because OpenAI’s GPT‑5.4 and Google’s Gemini 3.1 Pro Preview now tie at 72.4% on Android Bench instead of one model running away with the field. (9to5google.com) Android Bench is Google’s test for a very specific job: building Android apps with the tools Android developers actually use. Google says it checks work in Jetpack Compose for screen layouts, Coroutines and Flows for background tasks, Room for local data storage, and Hilt for wiring app components together. (9to5google.com) That matters because a model that looks brilliant on a general coding test can still stumble on a niche stack, the same way a great mechanic might be slower on an electric car than on a gasoline engine. In Google’s latest Android-specific table, GPT‑5.4 tied for first, GPT‑5.3 Codex landed at 67.7%, and Claude Opus 4.6 scored 66.6%. (9to5google.com) OpenAI is also pushing a different angle: folding coding skill into a broader work model instead of keeping it in a separate coding-only box. In its March 5 launch post, OpenAI said GPT‑5.4 combines reasoning, coding, and agent workflows, and said it inherits the coding strengths of GPT‑5.3 Codex while adding native computer use and a 1 million token context window. (openai.com) A context window is the amount of text a model can keep in view at once, like the size of a desk you can spread papers across before you start stacking them on the floor. Google’s Gemini documentation says many Gemini models handle 1 million tokens or more, and says that is roughly enough room for about 50,000 lines of code in one prompt. (ai.google.dev) That changes the kind of coding a model can do well. If a developer can paste the app code, the bug report, the design notes, and the application programming interface docs into one session, the model has fewer chances to miss a dependency hiding 200 files away. (ai.google.dev) The split shows up even more clearly when you leave Android and look at broader code-editing tests. The Aider Polyglot leaderboard, which measures 225 coding exercises across six languages, currently shows OpenAI’s GPT‑5 at 88.0%, o3‑pro at 84.9%, and Gemini 2.5 Pro Preview at 83.1%, which is a different order from Google’s Android-only chart. (aider.chat) So “best coding model” now means something narrower than it did a year ago. One model may be strongest at Android app plumbing, another at editing mixed-language repositories, and another at long sessions where 1 million tokens lets it keep an entire codebase in working memory. (9to5google.com) (aider.chat) (openai.com) (ai.google.dev) That is why the leaderboard fight now looks less like one company pulling ahead and more like the software world splitting into weight classes. On April 9, Google’s own benchmark showed a tie at the top, and the more useful question for developers became “best for which stack” instead of “best overall.” (9to5google.com)