Three‑layer token budgets

A practitioner recommended a three‑layer token budgeting approach — per request, per agent, and per pipeline — to prevent surprise bills when running fleets of agents. (Ranjan Kumar) (x.com)

Tokens are the units model providers bill for, and one user request can fan out into many model calls inside an agent workflow. Teams running multiple agents are starting to cap usage at three levels instead of one: the request, the agent, and the full pipeline. (ranjankumar.in) Ranjan Kumar laid out that three-layer model in a recent post, arguing that a per-request limit alone misses where spending actually accumulates in production. His framework sets one cap for a single call, another for an agent type over time, and a third for the end-to-end workflow that chains agents together. (ranjankumar.in) The request layer is the simplest guardrail: stop one prompt from turning into a giant bill. OpenAI’s API docs say customers are billed on input and output tokens, and its current pricing page lists GPT-5.4 at $2.50 per 1 million input tokens and $15.00 per 1 million output tokens. (developers.openai.com, openai.com) The agent layer covers repeated behavior that looks harmless one call at a time but expensive in aggregate. Microsoft said last week that teams want per-request audits, per-agent summaries, and cost trending because built-in dashboards do not always show which agent is consuming what, down to the token. (techcommunity.microsoft.com) The pipeline layer is for the hidden multiplication effect inside agent systems. Tencent Cloud’s developer blog says a user question that looks like a 200-token prompt can expand into 3,000 to 8,000 tokens once system prompts, conversation history, retrieval results, and model output are added. (adp.tencentcloud.com) That is the gap Kumar is trying to close: engineers often budget for a single model call, while agent systems re-plan, call tools, and pass context between steps. In his example, tightening only the request cap can still leave a team exposed if one agent type spikes or a multi-step pipeline runs too often. (ranjankumar.in) Vendors are building around the same problem from different angles. OpenAI now documents separate token charges for tools and sessions in parts of its platform, while Google’s Gemini pricing page splits free and paid tiers and charges separately for features such as grounding with Google Search after monthly allowances are used. (developers.openai.com, ai.google.dev) Other practitioners describe the same failure mode in plainer terms: one runaway agent can burn through a monthly budget in hours. Athenic’s guide recommends token-based quotas across multiple levels, and Waxell argues that “engineers think in requests” while agents run in loops. (getathenic.com, waxell.ai) The practical takeaway is not to guess at one average cost per chat and multiply by traffic. Teams that want predictable bills are starting to meter every turn, total every agent, and cut off the whole workflow before the invoice does it for them. (ranjankumar.in, techcommunity.microsoft.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.