Build an LLM routing gateway
- Microsoft’s May 13 Reuters report on post-OpenAI optionality points developers toward a practical hedge: build an LLM gateway that can switch providers. - OpenAI, Anthropic and Google all publish production API docs and pricing, making latency, timeout and cost routing a concrete engineering task. - Next, developers can wire one endpoint to OpenAI Responses, Anthropic Messages and Gemini generateContent, then add dashboards and failover tests.
Microsoft’s search for AI startup deals has sharpened a question many software teams already face: what happens when one model provider slows down, raises prices or changes terms. Reuters reported on May 13 that Microsoft is preparing for a future less dependent on OpenAI, including by weighing startup deals that could add AI talent and model capabilities. That corporate backdrop points to a practical systems project for developers: build an LLM routing gateway that sits between an application and several model APIs. The gateway accepts one internal request format, sends traffic to OpenAI, Anthropic or Google, and switches providers when a timeout, rate limit or outage hits. The appeal is operational, not theoretical. OpenAI’s API reference describes request IDs and processing-time headers that can be logged in production, Anthropic’s developer materials focus on production API use and batch processing, and Google publishes Gemini billing and pricing pages for production deployments. (money.usnews.com) ### What does an LLM routing gateway actually do? A routing gateway gives an engineering team one stable endpoint even when the underlying model vendor changes. (developers.openai.com) An app can call `/generate` or `/chat`, while the gateway translates that request into OpenAI’s Responses API, Anthropic’s Messages API or Google’s Gemini generateContent format. The core logic is simple. A request comes in with a task type, a latency budget and a cost target; the gateway chooses a provider, records the result, and retries on another provider if the first call times out or returns a retryable error. (developers.openai.com) OpenAI documents rate-limit and request-tracing headers, which makes that decision process measurable rather than guesswork. ### Which features make the project more than a thin proxy? (developers.openai.com) Latency measurement is the first feature that turns a proxy into infrastructure. OpenAI exposes `openai-processing-ms` and request IDs, while provider responses generally carry enough metadata to record model, status code, token usage and elapsed time for each call. Cost measurement is the second. Anthropic’s current model pages list prices such as $5 per million input tokens and $25 per million output tokens for Claude Opus 4.7, and Google publishes free and paid Gemini API tiers alongside billing documentation. (developers.openai.com) Those published schedules let a gateway estimate request cost per provider and route cheaper jobs away from premium models. Output logging is the third. A production gateway can store prompts, redacted inputs, outputs, latency, token counts, provider name and error type, then feed those records into a dashboard for uptime, p95 latency, spend and fallback rate by model. (developers.openai.com) OpenAI’s docs explicitly recommend logging request IDs for troubleshooting. ### How should failover work when one provider stalls? Timeout-based failover is the clearest starting rule. (anthropic.com) A gateway can send a request to a primary provider, wait a fixed number of seconds, then retry on a secondary provider if the first call has not completed. The retry policy should be narrow enough to avoid duplicate side effects, especially when a model call triggers tools or downstream actions. (developers.openai.com) Rate limits are another trigger. OpenAI publishes rate-limit headers, and Anthropic said last week it raised API rate limits for Claude Opus models, underscoring that provider capacity and quotas change over time. A router that reads those signals can shift traffic before an application fails outright. ### Why does this project fit the Microsoft story? Reuters’ May 13 report said Microsoft is seeking optionality beyond OpenAI. (developers.openai.com) For developers, a routing gateway applies the same idea at application level: do not let one vendor’s latency, pricing or availability define the whole product. The project also shows skills that employers can inspect. (developers.openai.com) A working gateway demonstrates API normalization, retries, observability, cost controls and dashboarding with live provider data instead of a toy prompt wrapper. Official docs from OpenAI, Anthropic and Google give enough surface area to build and test that system now. ### What is the next concrete build step? A first version can start with three adapters: OpenAI Responses, Anthropic Messages and Gemini generateContent. (money.usnews.com) The next layer is a small policy engine that picks a primary model by task and sends retries to a backup after a timeout or 429 response. A second milestone is a dashboard. Teams can chart daily spend, median latency, timeout rate and fallback frequency by provider, then run scheduled failover tests against those adapters as model pricing and limits change on vendor documentation pages. (developers.openai.com) (anthropic.com) (developers.openai.com)