Benchmarks put GPT‑5.4 ahead

An Android benchmarking roundup ranks OpenAI’s GPT‑5.4 roughly level with Google’s Gemini 3.1 Pro Preview for Android app‑development tasks, while GPT‑5.3 Codex trails behind — a narrow test but one that highlights model performance in applied developer workflows. For mobile‑first teams, these applied benchmarks matter more than abstract leaderboards because they reflect real developer productivity. (letsdatascience.com)

Google’s own Android coding leaderboard just put OpenAI’s GPT‑5.4 in a tie with Google’s Gemini 3.1 Pro Preview, with both models scoring 72.4% on Android Bench in an April 9 update. GPT‑5.3 Codex came in lower at 67.7%, with Claude Opus 4.6 at 66.6% and GPT‑5.2 Codex at 62.5%. (developer.android.com) (letsdatascience.com) Android Bench is not a general trivia test. Google built it to check whether a model can fix real Android development problems pulled from open-source projects, the kind of work that lands in pull requests instead of demo videos. (developer.android.com 1) (developer.android.com 2) The benchmark uses 100 tasks selected from a pool of 38,989 pull requests, and each model is run 10 times before Google averages the results. Google also publishes a confidence interval, which is its way of saying the score comes with an error bar instead of pretending the number is perfectly exact. (developer.android.com 1) (developer.android.com 2) The tasks are Android-specific in a very literal way. Google says the benchmark checks code generation with Jetpack Compose for user interfaces, Coroutines and Flows for background work, Room for local databases, and Hilt for dependency injection. (developer.android.com) (letsdatascience.com) That is why this result looks different from broad leaderboards. A model can look brilliant on math, reasoning, or generic coding and still stumble when asked to write Android code that fits the platform’s preferred libraries and patterns. (developer.android.com 1) (developer.android.com 2) The timing matters too. Google says the April refresh is the first Android Bench update to include GPT‑5.4 and GPT‑5.3 Codex, while older models on the table kept scores from a late-February run and the new OpenAI models were tested in mid-March. This is not a live ladder that updates every hour; it is a snapshot from specific test windows. (letsdatascience.com) There is also a quiet twist in the ranking. GPT‑5.3 Codex has posted stronger results on other software benchmarks, including Terminal-Bench 2.0 and OSWorld-Verified, but it did not win here, which suggests Android Bench rewards platform fit as much as raw coding power. (llm-stats.com) (developer.android.com) Google is not treating this as a one-off chart. The company said Android Bench is meant to help model builders improve Android performance, and it has already said future updates will add Gemma 4 and other open models. (developer.android.com) (developer.android.com) For teams shipping Android apps, the practical read is simple: if your developers spend their day inside Jetpack Compose, Room, and Hilt, a benchmark tuned to those tools is a better buying signal than a giant all-purpose leaderboard. On this one, as of April 9, 2026, GPT‑5.4 and Gemini 3.1 Pro Preview are effectively neck and neck. (developer.android.com) (letsdatascience.com)

Benchmarks put GPT‑5.4 ahead

Get your own daily briefing