Grok 4.3 tops agent leaderboards for tool-calling and instruction-following

- xAI has put Grok 4.3 into its public API, positioning it as the default flagship for developers building agents, tool use, and long-context workflows. - The headline numbers are a 1,000,000-token window and $1.25/$2.50 per million input/output tokens, with higher pricing kicking in past 200K context. - It matters because Grok’s agent benchmarks jumped sharply, but tool calls and long prompts can still make real production costs swell fast.

xAI’s new Grok 4.3 release is really an API story. This is not about a chatbot gimmick or a flashy demo. It is about whether developers should trust Grok as the model that plans, follows instructions, and actually uses tools without going off the rails. That is the gap xAI is trying to close — and Grok 4.3 is now the company’s default recommendation for “everything else” in its own docs. (docs.x.ai) ### What actually shipped? Grok 4.3 is now listed in xAI’s developer docs as a live model with function calling, structured outputs, reasoning, and a 1,000,000-token context window. xAI’s model page says developers should use Grok 4.3 for chat and coding, and calls it the most intelligent and fastest model it has built. The public docs also show aliases like `grok-4.3-latest` and `grok-latest`, which matters b(docs.x.ai)t a side branch. (docs.x.ai) ### Why are people talking about “agent” performance? Because Grok 4.3’s best story is not raw IQ points. It is task execution. Artificial Analysis says the biggest jump versus Grok 4.20 was on GDPval-AA, a benchmark for real-world agentic tasks, where Grok 4.3 rose to 1500 Elo from 1179 — a 321-point jump. It also hit 98% on τ²-Bench Telecom, which is basically a test of agentic customer-support behavior, and held 81% on IFBench for instruction following. (artificialanalysis.ai) ### So did it really “top leaderboards”? Yes, but with a catch. The “tops leaderboards” line is true for specific categories like agentic tool calling and instruction-following style benchmarks. It is not the same as saying Grok 4.3 is the overall best model at everything. Artificial Analysis puts it at 53 on its broader I(artificialanalysis.ai)ly, Grok 4.3 looks strongest where reliable action-taking matters more than pure benchmark prestige. (artificialanalysis.ai) ### Why does the 1M-token window matter? A million tokens is the part developers instantly notice. It means you can stuff in giant codebases, long contracts, support logs, or research bundles without chopping everything into tiny pieces first. But xAI’s own docs add the important fine print — requests beyond a 200K context window move into higher-context pricing. So the big window is real, but the cheap version of that window is not infinite. (docs.x.ai) ### What does it cost? The base published rate is $1.25 per million input tokens and $2.50 per million output tokens, with cached input at $0.20 per million. That is a lot cheaper than Grok 4.20’s earlier pricing, and Artificial Analysis says the full benchmark suite cost fell about 20% even though Grok 4.3 used more output tokens. But xAI also bills tool-enabled requests on token usage plus tool invo(docs.x.ai)in English — the model may be cheaper per token, but agents can still get expensive by thinking longer, calling more tools, and dragging around huge contexts. (docs.x.ai) ### Why is xAI emphasizing tools so hard? Because Grok is being pushed toward “do the work” behavior. The consumer-side release notes for the April 17, 2026 Grok 4.3 beta said Grok now has access to a computer-like environment where it can write code, run it, install what it needs, and produce files. That is a different product posture from simple chat. It is xAI betting that the next buying decision (docs.x.ai)hich model can finish the task?” (grok.com) ### What’s the real takeaway? Grok 4.3 looks like xAI’s most credible developer release yet. The upgrade is real — especially for tool use and instruction following. But the catch is simple: long context and autonomous tools are exactly the features that can quietly blow up your bill. If you are building agents, Grok 4.3 now belongs on the shortlist. If you are paying for production traffic, the benchmark win is only half the story. (docs.x.ai)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.