TorchAO hits 1M tokens/s
- TorchAO introduced quantization tooling that runs dense matrix mults in float8 and reports throughput up to 1,000,000 tokens per second. (x.com) - The demo uses layer‑level quantization to convert large attention/FFN ops to FP8 while keeping correctness checks in higher precision. (x.com) - This shows software quantization can push inference/training throughput an order of magnitude for some workloads without bespoke ASIC changes. (x.com)