Serving: vLLM, kernels, Mac server

vLLM pushed two inference optimizations this week—DDTree for speculative draft decoding (claimed multi‑x speedups) and PagedAttention to reduce KV-cache fragmentation and improve memory efficiency. (x.com)(x.com). Hugging Face published a precompiled GPU kernels hub to speed models on common hardware, and a Mac-optimized LLM server 'omlx' surfaced as a self-host alternative in social discussion. (x.com)(x.com)

Large language model serving got a fresh round of speed work this week, with updates aimed at the two usual bottlenecks: token generation and memory. (docs.vllm.ai) Serving is the software layer that keeps a model loaded, batches requests together, and reuses past context so replies arrive faster and cheaper. vLLM, one of the main open-source serving stacks, says its design centers on high throughput, continuous batching, and tighter control of the attention key-value cache, the running memory a model builds as it writes. (docs.vllm.ai) That key-value cache is the expensive part to keep around during long chats. In the 2023 PagedAttention paper behind vLLM, the authors said existing systems wasted memory through fragmentation and over-reservation, while their block-based approach targeted “near-zero waste” and reported 2x to 4x higher throughput than FasterTransformer and Orca in tests. (cs.princeton.edu) The other bottleneck is decoding, the step where a model normally predicts one token at a time. vLLM’s documentation says speculative decoding speeds that up by letting a smaller drafter guess ahead while the larger target model verifies those guesses in parallel, with output intended to match standard decoding apart from hardware precision limits. (docs.vllm.ai) A new paper posted April 14 by Liran Ringel and Yaniv Romano describes DDTree, short for Diffusion Draft Tree, as a way to verify multiple likely continuations from a diffusion-based drafter in one pass. The paper says DDTree builds a draft tree with a fixed node budget and checks it with a single target-model forward pass using an ancestor-only attention mask. (arxiv.org) That paper places DDTree on top of DFlash, a block-diffusion drafting method, and says the baseline problem is that vanilla DFlash verifies only one drafted path per round. The authors say DDTree expands that into a tree of candidate continuations so the verifier can accept longer stretches when the drafter is right. (arxiv.org) At the same time, Hugging Face has been pushing a different piece of the stack: the low-level math code that actually runs on graphics processors. Its Kernels documentation says the Kernel Hub lets Python libraries and applications load compute kernels directly from the Hub, with packages designed to work across recent Python versions, multiple PyTorch build configurations, and different CUDA and C++ application binary interface combinations. (huggingface.co) That changes the usual workflow for performance tuning. Instead of asking each model library to compile custom operators from source on each machine, Hugging Face is packaging those operators so they can be discovered and loaded more like model artifacts. (huggingface.co) A third strand of the week’s discussion came from Apple Silicon users looking for local alternatives to Linux-and-NVIDIA setups. The Mac app and GitHub repository for oMLX describe it as a macOS-native server for Apple Silicon with continuous batching, a menu-bar controller, and a two-tier cache that keeps hot key-value blocks in memory and colder ones on solid-state storage. (omlx.ai) (github.com) oMLX says it supports Apple Silicon machines running macOS 15 or later, recommends 64 gigabytes of memory or more for larger models, and exposes OpenAI-compatible and Anthropic-compatible endpoints for tools such as Cursor and Claude Code. Its site also says previously seen prefixes can be restored after restarts instead of being recomputed from scratch. (omlx.ai) Put together, the week’s updates point to the same race inside open-source artificial intelligence infrastructure: faster draft-and-verify decoding, less wasted cache memory, and fewer install-time headaches between a model and the hardware running it. (docs.vllm.ai) (huggingface.co) (omlx.ai)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.