Anthropic posts 64.37% finance benchmark
- Anthropic said on May 5 it launched 10 finance-specific Claude agent templates and tied the push to a leading 64.37% score on Vals AI. - That 64.37% came from Vals AI’s Finance Agent v1.1 benchmark, where Claude Opus 4.7 ranked above Claude Sonnet 4.6 and GPT 5.5. - The bigger shift is product, not bragging rights—Anthropic is packaging models, data access, and workflows for regulated finance teams.
Finance AI is moving out of demo mode and into job-shaped software. That is the real story here. Anthropic did not just post a benchmark score — it used that score to launch 10 ready-made agents for banks, insurers, and finance teams on May 5. The pitch is simple: stop selling a raw model and start selling workflows that already know how to build a pitchbook, screen KYC files, or close the books. (anthropic.com) ### What actually launched? Anthropic released 10 finance agent templates that run in Claude Cowork, Claude Code, or as cookbooks for Claude Managed Agents. The list spans front-office and back-office work — pitch building, meeting prep, earnings review, model building, market research, valuation review, general-ledger reconciliation, month-end (anthropic.com)ch closer to “buy a worker for this task” than “buy a model and figure it out yourself.” (anthropic.com) ### Where does the 64.37% come from? The number comes from Vals AI’s Finance Agent v1.1 benchmark, updated May 4, 2026. On that leaderboard, Claude Opus 4.7 sits at 64.37% accuracy, ahead of Claude Sonnet 4.6 at 63.33%, Muse Spark at 60.59%, DeepSeek V4 at 60.39%, and GPT 5.5 at 59.96%. So yes, Anthropic is leading this specific test — but by a few points, not by some absurd gap. (vals.ai) ### What does that benchmark measure? Basically, it is trying to simulate entry-level financial analyst work. Vals says the benchmark was built with Stanford researchers, a global systemically important bank, and industry experts, and uses 537 questions across tasks like retrieval, market research, and projections. That makes the score more useful than a generic (vals.ai) do finance-shaped work, not just answer trivia. (vals.ai) ### Why is 64.37% both good and not enough? Because the benchmark is hard, and because finance is unforgiving. A top score in the mid-60s says these systems are getting meaningfully better at analyst tasks. But it also says they are nowhere near “just let the bot run the bank.” Anthropic’s own product framing gives the game away — firms are supposed to adapt thes(vals.ai)ich means humans are still very much in the loop. (anthropic.com) ### Why are the Microsoft and data integrations a big deal? This is the part people miss. Anthropic also shipped Microsoft 365 add-ins for Excel, PowerPoint, and Word, with Outlook coming soon, plus connectors and an MCP app setup so Claude can work with finance data where teams already live. In other words, the company is trying to remove the a(anthropic.com)t in the spreadsheet and deck.” That is how software actually gets bought. (anthropic.com) ### So is this about models or about packaging? Mostly packaging. The benchmark helps Anthropic win the argument that Claude is credible for finance. But the commercial move is the bundle — model, agent template, data access, office-suite integration, and governed workflow. That is a much stronger enterprise pitch than “our LLM is smart.” It also puts pressure on rivals that still sell mostly horizontal copilots. (anthropic.com) ### Why finance first? Because finance has expensive repetitive work, structured documents, and clear ROI. Vals even frames the category as one of the most lucrative applications for agents. If a bank can shave hours off analyst prep, reconciliation, or compliance packaging, the economics show up fast. The catch is that regulated work needs audi(anthropic.com) chatbots. (vals.ai) ### Bottom line? The 64.37% headline matters, but mainly as proof that Anthropic can claim the top spot on a finance-specific leaderboard today. The bigger change is that Anthropic is turning benchmark bragging rights into a product stack for Wall Street — and that is the kind of move procurement teams can actually act on. (anthropic.com)