Grok-4 Allegedly Beats GPT-5 in Tests

xAI's Grok-4 is reportedly topping GPT-5 and DeepSeek in math, science, and coding benchmarks while handling both text and images. The AI offers 5 key advantages: real-time X/web access, tool use, truth-seeking, multimodal capabilities, and optimized reasoning. One analyst launched Blackswan Grok integration for intelligence feeds analysis.

The performance claims for Grok-4 are backed by specific, challenging benchmarks. On the ARC-AGI-2 test, which assesses general reasoning, Grok-4 scored approximately 16%, surpassing GPT-5's 9.9%. It also took the lead on the less difficult ARC-AGI-1 test with a score of about 68% to GPT-5's 65.7%. A more powerful version, Grok-4 Heavy, is the first model to achieve a score of 50.7% on the text-only section of "Humanity's Last Exam," a benchmark composed of PhD-level questions across various disciplines. This version also excels in competitive math, scoring 96.7% on HMMT 2025 and 61.9% on USAMO'25. The model's architecture was trained using scaled reinforcement learning on xAI's "Colossus" supercomputer, a 200,000 GPU cluster. This process was designed to refine the model's reasoning abilities from the ground up, rather than adding reinforcement learning as a later fine-tuning step. This approach enables Grok-4 to natively use tools like a code interpreter and web browser to tackle complex questions. Grok-4 was released on July 10, 2025, and is accessible to SuperGrok and Premium+ subscribers on the X platform, as well as through an API. The API offers developers a 256,000-token context window and multimodal understanding of both text and vision. An enhanced, more realistic voice mode has also been introduced, which can analyze scenes through a user's camera in real-time. OpenAI's GPT-5, anticipated for a summer 2025 release, is positioned as a unified system integrating memory, reasoning, and vision, rather than a single model. Leaked information suggests it could support a context window of up to one million tokens, a significant increase intended for analyzing large documents and datasets in one session. The competitive landscape also includes DeepSeek's models. A September 2025 technical evaluation by the U.S. AI Safety Institute (CAISI) found that the best U.S. models, including GPT-5, outperformed DeepSeek's V3.1 across most benchmarks, with the largest gaps in software engineering and cyber tasks. However, DeepSeek's models have shown strong performance in specific math benchmarks, occasionally surpassing OpenAI's models. The Blackswan integration leverages its ELEMENT™ platform, a data fabric for enterprise AI applications. This system is designed to connect and analyze data from various internal and external sources, using knowledge graphs and machine learning to uncover patterns for financial compliance, risk management, and competitive intelligence.

Grok-4 Allegedly Beats GPT-5 in Tests

Get your own daily briefing