Nous Research reports 2.5x training cost cut

- Nous Research said on May 13 it had cut large-language-model pre-training time by up to 2.5 times with a method called Token-Superposition Training. - The paper’s headline result was 4,768 B200-GPU-hours versus 12,311 at 10B-A1B scale, with authors Bowen Peng, Théo Gigant and Jeffrey Quesnelle. - The method is described in arXiv paper 2605.06546, submitted May 7 by Peng, Gigant and Quesnelle.

Nous Research said this week it had reduced large-language-model pre-training time by as much as 2.5 times with a method it calls Token-Superposition Training, according to a paper posted on arXiv on May 7. The paper says the technique changes the training loop, not the model used at inference, and was tested from 270 million parameters up to a 10B-A1B mixture-of-experts model. The headline claim is a lower final training loss than a matched-FLOPs baseline with 4,768 B200-GPU-hours, compared with 12,311 for the baseline at the largest tested scale. The authors are Bowen Peng, Théo Gigant and Jeffrey Quesnelle. ### How does Token-Superposition Training work in practice? The arXiv paper says Token-Superposition Training, or TST, runs in two phases. In the first phase, the model groups contiguous tokens into bags, averages their embeddings into a single latent representation, and predicts the next bag with a multi-hot cross-entropy objective; in the second phase, training returns to standard next-token prediction. (arxiv.org) The paper says that setup lets the model process more text per unit of compute during the early part of training because sequence length is effectively compressed while keeping each step at equal FLOPs. The authors wrote that the method does not require changes to the optimizer, tokenizer, training data, parallelism strategy or model architecture. ### Where did the 2.5-times figure come from? (arxiv.org) The 2.5-times figure comes from the paper’s 10B-A1B mixture-of-experts experiment. The abstract says TST yielded “up to a 2.5x reduction in total pre-training time” under equal-loss settings at that scale. MarkTechPost, citing the paper, reported that the TST run used 4,768 B200-GPU-hours versus 12,311 for the matched-FLOPs baseline while reaching a lower final training loss. (arxiv.org) That comparison is the clearest concrete number attached to the announcement. ### How broad was the testing? The paper says the method was evaluated at 270 million and 600 million parameters and validated on 3 billion and 10B-A1B models. The abstract describes the results as robust across those settings and says TST “consistently outperforms baseline loss and downstream evaluations.” The May 13 coverage by MarkTechPost said the tested range ran from 270M to 10B and described the technique as leaving inference-time architecture untouched. (marktechpost.com) That matters because the claim is about a training-time efficiency change rather than a new serving stack. (arxiv.org) ### What is missing from the announcement so far? The public materials currently visible are the arXiv paper and secondary write-ups. The paper abstract lays out the method and top-line results, but the materials reviewed here did not include a linked code repository or a standalone benchmark page from Nous Research with reproduction instructions. The available sources also do not show an independent replication of the 10B-A1B result. (marktechpost.com) That leaves the paper itself as the primary public record for the claim at this stage. ### What should readers watch next? ArXiv lists the paper as 2605.06546 and shows it was submitted on May 7, 2026 by Bowen Peng, Théo Gigant and Jeffrey Quesnelle. The next concrete step for outside researchers is whether the authors or Nous Research publish code, training logs or additional benchmark details tied to that paper. (arxiv.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.