Infrastructure moves down‑stack
Recent industry commentary emphasised that AI’s bottlenecks are increasingly about infrastructure—things like low‑latency caching, multi‑model routing and cost‑aware scheduling rather than just model architecture. The pieces call out operational stacks such as vLLM for serving, Kafka for eventing and pgvector for vector storage as practical building blocks for production model systems. (x.com)
A year ago, most artificial intelligence talk was still about which model was smartest. In 2026, the harder question inside production teams is why a good model still feels slow, expensive, or unreliable once millions of requests hit it. (docs.vllm.ai) A model server is the software layer that keeps a model loaded, accepts requests, and feeds work to graphics processors without wasting memory. The open-source project vLLM became a reference point here because it uses “PagedAttention” and continuous batching to raise throughput on the same hardware. (docs.vllm.ai, github.com) Continuous batching means the server does not wait for a neat batch to fill up before starting work. It keeps slipping new requests into open slots, like an elevator taking passengers as space appears instead of waiting for one full crowd in the lobby. (docs.vllm.ai) The expensive part of many chat requests is not always the new answer. It is often the repeated prompt prefix — the system prompt, the policy text, the tool instructions, and the first turns of a conversation — which forces the model to recompute the same internal state unless that state is cached. (docs.vllm.ai, arxiv.org) That internal state is called the key-value cache, and it works like keeping a half-finished worksheet on your desk instead of redoing the first ten problems every time. Recent infrastructure writing has focused on cache-aware routing because a cache only helps if the next related request lands on the same machine that already holds it. (arxiv.org, llm-d.ai) Once a company runs more than one model, another problem appears: routing. A cheap small model may handle a billing question, while a larger and slower model may be needed for code generation or legal drafting, so the system has to choose the right model for each request instead of sending everything to the most expensive one. (arxiv.org, gmicloud.ai) That is what people mean by cost-aware scheduling. The scheduler is the traffic cop that decides which request runs where, on which hardware, with which model, and in what order, so a graphics processor does not sit idle while users wait in line. (gmicloud.ai, docs.vllm.ai) The same shift is showing up in the data layer. Apache Kafka is popular because it turns clicks, messages, tool calls, and model outputs into ordered event streams that many services can read at once without tightly coupling every component to every other component. (kafka.apache.org) Vector storage solves a different problem. The PostgreSQL extension pgvector lets teams store embeddings — number lists that represent meaning — inside ordinary PostgreSQL databases, so semantic search can live next to user accounts, permissions, and transaction records instead of in a separate specialty system. (github.com, postgresql.org) Put those pieces together and the center of gravity moves down the stack. The winning system is often not the one with the flashiest model, but the one that serves tokens efficiently, routes requests intelligently, streams events cleanly, and keeps retrieval close to the rest of the application data. (docs.vllm.ai, kafka.apache.org, github.com)