Ramp Labs’ token cut
Ramp Labs launched 'Latent Briefing', a way for AI agents to share a key-value cache directly so multi-agent systems use 31% fewer tokens with no reported loss in accuracy. The technique is being pitched as a cost-and-efficiency lever for scalable finance tech stacks (x.com).
Ramp Labs said on April 10 that its new “Latent Briefing” method let artificial intelligence agents share memory directly and cut token use by 31% without reducing accuracy. (threadreaderapp.com) In multi-agent systems, one model often acts as an orchestrator that breaks work into steps and hands context to worker models. Ramp Labs said that context handoff is usually done with tokens, which raises cost and can bury the useful signal in extra text. (threadreaderapp.com) A key-value cache is the model’s working memory: the internal record of what it has already processed while reading a prompt. Ramp Labs said Latent Briefing passes that memory in cache form instead of rewriting it as text for the next agent. (threadreaderapp.com) Ramp Labs said standard alternatives each have tradeoffs. Its April 10 thread said large language model summaries can take 20 to 60 seconds and lose detail, retrieval-augmented generation can split related facts across chunks, and passing the full context can be expensive and hurt accuracy. (threadreaderapp.com) The company said its method uses the worker model’s attention pattern — the parts of the prompt the model focuses on most — to keep the relevant pieces of the orchestrator’s memory and discard the rest. Ramp Labs said it adapted an “Attention Matching” cache-compaction framework for inference, the stage when a model is actually serving requests. (threadreaderapp.com) Ramp Labs said those changes turned “320 sequential solves” into “2-3 batched ops” and cut compression time to a median 1.7 seconds, about 20 times faster than the original algorithm. In the same thread, the company said tests on LongBench v2 showed a 30% median token reduction and a 3 percentage point accuracy gain across document lengths and task difficulty levels. (threadreaderapp.com) A third-party market brief published April 10 described the same research with a higher top-line number: up to 65% lower worker-model token use and a 49% median reduction for medium-length documents of 32,000 to 100,000 tokens. That brief said Ramp Labs used Claude Sonnet 4 as the orchestrator and Qwen3-14B as the worker model. (kucoin.com) Ramp is pitching the work through Ramp Labs, the company’s research arm. The Ramp Labs site describes itself as “the home for AI research at Ramp,” tying the release to Ramp’s broader push to build artificial intelligence tools around finance workflows and software infrastructure. (labs.ramp.com) The immediate test for Latent Briefing is whether Ramp publishes a fuller paper, code, or benchmark details beyond the April 10 thread. For now, the company’s public claim is simple: cheaper agent-to-agent context sharing without giving up accuracy. (threadreaderapp.com)