FinToolBench benchmarks LLM agents

FinToolBench released a benchmark of LLM agents against 760 financial APIs with 295 queries — a new standard for testing agent behavior on real finance tooling. That dataset can be used to evaluate agent safety, correctness, and API orchestration in quant workflows. (x.com)

The preprint was posted to arXiv on March 9, 2026 and lists Jiaxuan Lu and a 11-person team with affiliations at Shanghai AI Laboratory, Hunan University, Xiamen University, Tencent, UCAS, and Tongji University. (arxiv.org) The authors say the benchmark’s tool manifest, execution environment, and evaluation code will be open-sourced and a companion GitHub repository is already published to host runnable components and evaluation scripts. (arxiv.org) The runnable tool library is constructed from third-party APIs and public finance interfaces—notably RapidAPI marketplace endpoints and the AkShare Python data library—and those sources are normalized into a unified schema for consistent agent access. (github.com) The question bank was assembled by adapting existing finance evaluation resources (including FinanceBench and OpenFinData) into tasks that explicitly require either single-tool calls or multi-tool orchestration to complete. (github.com) Evaluation goes beyond binary success: the framework scores timeliness, intent type, and regulatory-domain alignment and reports trace-level compliance mismatch metrics identified in the paper as TMR, IMR, and DMR to quantify specific failure modes. (arxiv.org) To offer a conservative production-style baseline, the study introduces FATR (Finance-Aware Tool Retrieval), which performs Top-K retrieval (K defaulting to 20), generates compact “Tool Cards” with attribute metadata, and runs a ReAct-style planner that injects per-query attribute constraints. (paperium.net) (paperium.net / arxiv.org) The authors position FinToolBench as the first execution-grounded, auditable testbed for agentic financial workflows—explicitly designed to expose operational and compliance risks by running real API executions rather than toy simulations. (arxiv.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.