vLLM‑Omni drops

Modal released vLLM‑Omni, an open‑source omni‑modality inference framework aimed at optimizing multi‑modal model deployment — a clear trigger for benchmarking multi‑modal inference performance. This creates a direct opportunity to run GB300/DGX + CUDA comparisons on real-world omni‑modal workloads. (aitoolly.com)

The vLLM team’s public announcement for vLLM‑Omni was posted on November 30, 2025, framing the project as an extension of vLLM to “true omni‑modality” across text, image, video and audio. (vllm.ai) The project’s recent v0.16.0 release rebased Omni onto upstream vLLM v0.16.0 and explicitly expanded coverage for Qwen3‑Omni, Qwen3‑TTS, Bagel, MiMo‑Audio, GLM‑Image and multiple diffusion (DiT) families while adding platform support labeled CUDA / ROCm / NPU / XPU. (newreleases.io) Release notes for v0.16.0 list concrete performance gains: TTFP reductions ~90% and RTFs in the 0.22–0.45 range for Qwen3 variants, and an MiMo‑Audio RTF ≈0.2 measured as ~11× faster than the reported baseline. (newreleases.io) The repository includes a dedicated benchmarks folder with repeatable latency/throughput runners, per‑model READMEs and baseline comparisons against Hugging Face Transformers to produce comparable omni‑modal measurements. (github.com) Packaging and deployability are concrete priorities: vLLM‑Omni provides an OpenAI‑compatible API server, an included Helm chart for Kubernetes deployment, pip‑installable packages on PyPI, and official Docker images noted in the project README. (github.com) Architecturally, vLLM‑Omni introduces a fully disaggregated “OmniStage” pipeline abstraction designed to decompose complex any‑to‑any model graphs and dynamically allocate resources across stages, a design described in the project paper and documentation. (arxiv.org) That design plus the v0.16.0 removal of CUDA hardcoding and explicit platform expansions make direct GB300/DGX + CUDA vs ROCm (or other accelerator) comparisons practical, and NVIDIA DGX Spark/DGX GB300 documentation already includes vLLM deployment playbooks for running inference on DGX systems. (newreleases.io)

vLLM‑Omni drops

Get your own daily briefing