Cohere shows MXFP4 training

Cohere Labs publicly highlighted training LLMs using MXFP4 low‑precision formats, signaling an engineering push on quantized training to cut cost and speed cycles. The post opens a place to benchmark MXFP4 on H100/GB200 and test CUDA kernel tuning for cost/perf tradeoffs. (x.com)

Cohere Labs hosted recorded talks and demos under its Open Science community showing practical MXFP4 training workflows and speaker sessions on the topic. (youtube.com) The peer-reviewed "Training LLMs with MXFP4" paper reports a near‑lossless MXFP4 training recipe that uses stochastic rounding plus a random Hadamard transform and claims MXFP4 GEMMs are ≈2× faster than FP8 while shifting >50% of training FLOPs into MXFP4. (proceedings.mlr.press) An accompanying public implementation (amazon‑science/mxfp4-llm) provides code and integration notes for Megatron‑LM and NVIDIA TransformerEngine, including the stochastic‑rounding gradient estimator used in the paper. (github.com) Hardware‑side guidance and community notes show MXFP4 requires newer tensor‑core features (compute capability ≳9.0), with Hopper and Blackwell family GPUs such as H100 and GB200 listed as supported targets for MXFP4 acceleration. (michaelbommarito.com) Ecosystem tooling and benchmarks are already emerging: Hugging Face and vLLM have MXFP4 docs/backends, LLM Compressor added experimental MXFP4 support in Jan 2026, and independent benchmarks (e.g., Millstone’s gpt‑oss‑20B on 1×H100) illustrate inference/throughput numbers under MXFP4. (huggingface.co) Community kernel work and experimental repos demonstrate the CUDA/Triton kernel tuning people are using to trade cost versus perf, and published guidance calls for recent Triton versions and framework patches to build stable MXFP4 operators. (github.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.