A 'winning' distributed-AI stack
A popular developer thread outlined a production-focused stack for distributed AI that prioritizes latency, caching and cost-aware routing. The stack lists vLLM/TensorRT-LLM for inference, Temporal or Ray Serve for orchestration, Kafka/Redpanda for messaging, Postgres with pgvector for vectors, Redis for caching, and FastAPI or Go for services. (x.com)
A distributed artificial intelligence stack is a set of parts that splits one model request across serving, routing, storage and caching systems, instead of running everything in one app. A developer thread that circulated this week argued the “winning” production setup is the one that cuts latency first and routes work by cost. (x.com) At the front of that stack sits the inference engine, the software that turns prompts into tokens on graphics processors. vLLM describes itself as a fast serving engine for large language model inference, and NVIDIA says TensorRT-LLM builds optimized engines for efficient inference on NVIDIA graphics processors. (github.com) (docs.nvidia.com) The caching piece matters because large language models repeatedly recompute the same opening text unless a system saves that work. vLLM’s prefix caching stores key-value cache blocks from earlier requests so later requests with the same prefix can skip repeated prompt computation. (docs.vllm.ai) The orchestration layer is the traffic controller that decides what runs, in what order, and what gets retried after failures. Temporal says its workflows provide durable execution for long-running application logic, while Ray Serve says it is built for scalable production serving of large language models. (docs.temporal.io) (docs.ray.io) The messaging layer moves events between services without forcing every component to talk directly to every other component. Apache Kafka calls this event streaming, and Redpanda says it is Kafka application programming interface-compatible while positioning itself as a lighter, faster streaming platform. (kafka.apache.org) (docs.redpanda.com) The storage choice in the thread pairs ordinary application data with vector search, which is the trick that lets systems find semantically similar text instead of exact keyword matches. pgvector adds vector similarity search to PostgreSQL, letting teams store embeddings alongside the rest of their relational data. (github.com) (pgxn.org) Redis fills the short-term memory role in this pattern. Redis says its in-memory store is used as a cache, vector database, streaming engine and message broker, which is why teams often use it to hold hot prompts, session state and rate-limit counters close to the model. (redis.io) The service layer is the part users and internal apps actually call. FastAPI says it is a high-performance Python framework for application programming interfaces, while Go’s documentation highlights concurrency and efficient networked services, two reasons teams often put either one in front of model backends. (fastapi.tiangolo.com) (go.dev) The thread’s emphasis on latency and cost tracks the direction of current serving work. Red Hat’s guide to the llm-d distributed inference stack says modern inference now depends on adaptive computation and intelligent caching, and llm-d’s scheduling guide points to load-aware and prefix-cache-aware routing to reduce tail latency and raise throughput. (developers.redhat.com) (llm-d.ai) That is why the stack resonated: it did not pitch one database or one framework as a silver bullet. It mapped a production pattern already visible in the official documentation of the tools it named — fast inference, durable orchestration, event streaming, vector search and aggressive caching, all tuned around response time and spend. (docs.nvidia.com) (docs.temporal.io) (kafka.apache.org) (github.com) (redis.io)