GPT-5.4 Outperforms Humans in Reasoning
OpenAI's GPT-5.4 surpasses human benchmarks in desktop navigation and reasoning, triggering a “Model Benchmark Shift” for enterprise AI.
GPT-5.4's "Thinking" variant introduces a planning stage before generating its final output, allowing users to redirect or adjust the model's reasoning mid-response. This is a shift from previous models where users had to wait for the complete output before prompting again. It's now live on chatgpt.com and Android, with iOS coming soon. GPT-5.4 can now operate computers, using screenshots and mouse/keyboard commands to navigate software and websites. It achieved a 75% success rate on the OSWorld-Verified benchmark for GUI navigation, surpassing both GPT-5.2 (47.3%) and the human baseline (72.4%). This capability allows the model to perform tasks across multiple applications. The model combines the coding capabilities of GPT-5.3 Codex with improved reasoning and is more token-efficient. A new "Tool Search" mechanism reduces token usage by nearly half in large tool ecosystems. GPT-5.4 is also reported to have lower hallucination rates than GPT-5.2, with fewer false claims per answer. GPT-5.4 matches or beats professionals across 44 occupations on the GDPval benchmark. It attained an 83% score, a significant jump from GPT-5.2's 70.9%. The model shows stronger performance on tasks involving spreadsheets, presentations, and long documents.