UniToolCall normalises tool‑use datasets
A new paper introduced UniToolCall, a unified framework that standardises over 22,000 tool records to fix fragmentation in AI tool‑use datasets and supports parallel and serial execution of tools. The authors report it boosted an 8B model to 93% precision on benchmarked tasks, outperforming several major APIs in their tests. (x.com)
Large language models are getting better at using software tools, but the training data behind that skill is still fragmented across incompatible formats. UniToolCall is a new paper that tries to put those records into one schema and one benchmark. (arxiv.org) In plain terms, tool use means a model does not just answer with text; it chooses a function, fills in arguments, runs it, reads the result, and may call another function after that. UniToolCall represents that loop in a single Query–Action–Observation–Answer format, or QAOA, across datasets and benchmarks. (arxiv.org) The authors said existing tool-use work mixes different interaction styles, from one-shot calls to multi-turn exchanges, and often evaluates models on benchmarks that do not line up with the training data. Their framework standardizes toolset construction, data generation, and evaluation in one pipeline. (arxiv.org; github.com) The dataset is large by tool-learning standards: the paper says it curates a pool of more than 22,000 tools and builds a hybrid corpus with more than 390,000 instances. That corpus combines 10 public datasets with synthetic trajectories designed to control structure, including serial chains and parallel calls. (arxiv.org; github.com) Serial and parallel matter because real assistants often need both. A travel agent may need to check flights and hotels at the same time, then book only after comparing both results; UniToolCall explicitly models those branching and step-by-step patterns. (arxiv.org; arxiv.org) The paper also adds what it calls “Anchor Linkage,” a way to preserve references across turns in a conversation. In practice, that means a model can carry forward earlier details such as an order number, a date, or a name instead of treating each turn like a fresh request. (arxiv.org; github.com) For evaluation, the authors converted seven public benchmarks into the same QAOA structure and scored models at the function-call, turn, and conversation levels. That setup is meant to reduce the common problem where one benchmark rewards argument accuracy while another focuses on end-to-end completion. (arxiv.org) On results, the paper says fine-tuning Qwen3-8B on UniToolCall reached 93.0% single-turn strict precision in the “Hybrid-20” setting. The GitHub page says that beat Qwen3-32B by 20.3 points in that test and outperformed several commercial application programming interfaces in the authors’ comparisons. (github.com; arxiv.org) That claim lands in a crowded measurement landscape. The Berkeley Function Calling Leaderboard, one of the public scoreboards for function-calling models, also tests real-world tool use but uses its own data and methodology, which is part of the fragmentation UniToolCall is trying to address. (gorilla.cs.berkeley.edu; arxiv.org) The project is open-sourced on GitHub, and the paper appeared on arXiv on April 13, 2026. Whether its schema becomes a common standard will depend less on one headline score than on whether other labs adopt the same format for training and evaluation. (arxiv.org; github.com)