Enterprises chasing sub‑second p95 on 70B models

Social reports this week say teams are moving to own GPU clusters to hit sub‑second p95 latencies on 70B models using techniques like speculative decoding — a reason customers want dedicated infra. The trend shows latency SLAs are pushing buyers toward on‑prem or colocated racks rather than generic cloud APIs. (x.com)

NVIDIA’s TensorRT‑LLM added speculative decoding support and reported up to a 3.6× increase in token throughput on NVIDIA GPUs in its technical blog (resources.nvidia.com). AMD’s ROCm docs and vLLM examples show speculative decoding yielding roughly 2–2.3× speedups for Llama‑3 class 70B models when paired with a ~1B draft model on MI300X hardware (rocm.docs.amd.com). AWS published a Trn2 tutorial demonstrating deployment of Llama‑3.3 70B with speculative‑decoding configurations on Trn2 instances, signaling cloud providers ship prescriptive guides for the pattern (awsdocs-neuron.readthedocs-hosted.com). Together announced General Availability of Instant Clusters to let teams spin up dedicated, tightly‑networked GPU clusters for inference without rearchitecting their stacks, positioning managed private clusters as an SLA‑oriented alternative to public APIs (together.ai). Vendors marketing dedicated infrastructure explicitly brand “private AI factories” and bespoke colocated racks to control latency variance and pricing exposure, with companies like Swarm Systems selling on‑prem/colocation GPU clusters for enterprise workloads (swarmsystems.ai). Community projects are pushing the same optimizations into practice: prima.cpp’s paper documents techniques and performance targets for 30–70B models on heterogeneous clusters (including ~2 tokens/sec results for some 70B configs), while experimental repos like llama_duo prototype asynchronous/speculative setups across devices (arxiv.org) (github.com). Independent benchmarks that compare multi‑GPU systems (8×H100, 8×H200, 8×B200) show measurable latency and throughput gaps between accelerator stacks, which helps explain why engineering teams validate colocated racks and specific GPUs when negotiating tight latency SLAs (artificialanalysis.ai).

Enterprises chasing sub‑second p95 on 70B models

Get your own daily briefing