GPT-5.4 Outperforms Humans in Reasoning

OpenAI's GPT-5.4 surpasses human benchmarks in desktop navigation and reasoning, triggering a “Model Benchmark Shift” for enterprise AI.

GPT-5.4's "Thinking" variant introduces a planning stage before generating its final output, allowing users to redirect or adjust the model's reasoning mid-response. This is a shift from previous models where users had to wait for the complete output before prompting again. It's now live on chatgpt.com and Android, with iOS coming soon. GPT-5.4 can now operate computers, using screenshots and mouse/keyboard commands to navigate software and websites. It achieved a 75% success rate on the OSWorld-Verified benchmark for GUI navigation, surpassing both GPT-5.2 (47.3%) and the human baseline (72.4%). This capability allows the model to perform tasks across multiple applications. The model combines the coding capabilities of GPT-5.3 Codex with improved reasoning and is more token-efficient. A new "Tool Search" mechanism reduces token usage by nearly half in large tool ecosystems. GPT-5.4 is also reported to have lower hallucination rates than GPT-5.2, with fewer false claims per answer. GPT-5.4 matches or beats professionals across 44 occupations on the GDPval benchmark. It attained an 83% score, a significant jump from GPT-5.2's 70.9%. The model shows stronger performance on tasks involving spreadsheets, presentations, and long documents.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.