Synthetic needs human anchoring
People working on synthetic-data workflows warn of ‘model collapse’ when synthetic examples are used without a strong human grounding—some suggest at least 20% human-grounded samples to avoid bias amplification and edge-case blindness. Labs are scaling synthetic trace generation for agents, but that expansion makes human validation the scarce resource that prevents contamination and gaming. In practice, startups that offer human validation for synthetic outputs can sell trust rather than just label volume. (x.com) (marktechpost.com)
Synthetic data is machine-made training material, and researchers say models degrade when too much of it replaces human examples. (nature.com) A July 2024 Nature paper called that failure “model collapse” and found that repeated training on model-generated content makes rare but important patterns disappear from the data. An Oxford summary of the work said access to original human-created data remains necessary to limit that drift. (nature.com) (ox.ac.uk) A separate April 2024 study found collapse tends to happen when synthetic data replaces the original data, but not when new synthetic generations are added alongside the original human set. That result turned the practical question from “whether to use synthetic data” to “how much human data must stay in the mix.” (arxiv.org) That question is getting sharper because labs are generating more synthetic traces for agents — step-by-step records of how a model uses tools, writes code, and completes tasks. MiniMax said on March 18 that its M2.7 model was built around “self-evolution,” and NVIDIA said on April 11 that the release targets agentic harnesses and complex workflows. (minimax.io) (developer.nvidia.com) On April 12, MarkTechPost summarized MiniMax M2.7 as scoring 56.22% on SWE-Pro and 57.0% on Terminal Bench 2, benchmarks for software and terminal-based agent tasks. Those systems need large volumes of traces, rankings, and corrections, which makes synthetic generation attractive and human review expensive. (marktechpost.com) (developer.nvidia.com) Researchers have not settled on a universal safe ratio for human versus synthetic data. The published papers above support keeping original human data in the training pool, but they do not establish a single threshold such as 20% across all model types and tasks. (arxiv.org) (nature.com) What they do agree on is the failure mode: models lose the tails of the distribution, meaning uncommon edge cases, minority patterns, and unusual combinations get washed out first. Nature described those defects as irreversible once recursive training compounds across generations. (nature.com 1) (nature.com 2) That shifts value toward firms that can prove a sample was checked against reality by a person, not just produced at scale by another model. Oxford’s summary of the 2024 paper pointed to data attribution and provenance as part of the fix, because future model builders need to know what is human and what is synthetic. (ox.ac.uk) As agent builders race to manufacture more traces, the bottleneck is no longer generation alone. It is the human grounding that keeps synthetic data from turning into a closed loop. (developer.nvidia.com) (nature.com)