Serving and index speed gains

New toolkit updates report major performance wins: GLM 5.1 claims a 6× speedup for vector‑DB access and a 3.6× kernel improvement, while vLLM work shows the ability to serve very large Mixture‑of‑Experts models like Qwen3.5‑397B on DGX clusters. Those optimizations change the inference cost/latency calculus for large models and retrieval-heavy apps. (x.com) (x.com)

A large language model spends most of its time doing two chores: finding the right context and then generating the next token. The new work here speeds up both chores at once, which is why the gains look bigger than a normal model update. (z.ai) (docs.vllm.ai) The first chore is retrieval. A vector database is a search engine for meaning, where a user’s question and millions of documents are turned into coordinates and the system looks for nearby points instead of exact keyword matches. (z.ai) That search has a built-in tradeoff. If you want higher recall, which means finding almost all of the truly relevant items, you usually pay with lower queries per second, which is the number of searches the system can answer each second. (z.ai) Z.ai tested GLM 5.1 on VectorDBBench, a coding benchmark where the model edits Rust code, compiles it, profiles it, and keeps resubmitting faster versions. On the SIFT-1M dataset, the score is queries per second while keeping recall at or above 95 percent. (z.ai) The old best single-session result on that setup was 3,547 queries per second from Claude Opus 4.6. GLM 5.1 was allowed to keep iterating for more than 600 rounds and more than 6,000 tool calls, and it reached 21.5 thousand queries per second, about 6 times higher than the best 50-turn run. (z.ai) The second chore is the model’s own math. A graphics processing unit kernel is a tiny work routine on the chip, and improving it is like rewriting a kitchen so the cook takes fewer steps for the same meal. (z.ai) (lushbinary.com) On KernelBench Level 3, which covers 50 full-model optimization problems, GLM 5.1 posted a 3.6 times geometric-mean speedup. Z.ai says earlier models often found quick wins and then stalled, while GLM 5.1 kept improving across hundreds of rounds. (z.ai) (lushbinary.com) The serving side of the story is about model shape. A Mixture-of-Experts model is like a company with hundreds of specialists where only a small team is called into each meeting, so the model can be huge without using every parameter on every token. (docs.vllm.ai) (github.com) Qwen 3.5’s largest release uses that design with 397 billion total parameters but only 17 billion active parameters per token. The vLLM recipe says it can be served with 8-way setups on NVIDIA H200 systems or AMD MI300X and MI355X systems, and it recommends the floating-point 8 checkpoint for better efficiency. (docs.vllm.ai) (github.com) vLLM is the software layer that keeps those giant models busy instead of leaving memory and compute idle between requests. Its Qwen 3.5 guide leans on expert parallelism, which spreads specialists across devices, and prefix caching, which reuses work when many requests start with the same prompt. (docs.vllm.ai 1) (docs.vllm.ai 2) NVIDIA’s H200 matters here because each chip has 141 gigabytes of high-bandwidth memory, and a DGX H200 system packs 8 of those chips into one box. That much memory is what makes “serve a 397 billion parameter model” sound like a deployment recipe instead of a research stunt. (nvidia.com) (docs.nvidia.com) Put together, the change is simple: retrieval-heavy apps get faster context lookup, and giant expert models get more practical serving paths. When both halves improve at once, the price of a useful answer drops not because the model got smaller, but because the wasted work around it got cut out. (z.ai) (docs.vllm.ai)

Serving and index speed gains

Get your own daily briefing