New synthetic‑data methods
Researchers discussed a new 'Dataset Policy Gradients' paper for optimising synthetic datasets to target metrics, arguing synthetic data can be tuned to achieve specific behaviours in models. Practitioners also recommended Kimi2.5 for supervised fine‑tuning synthetic data and advised modeling reasoning‑length distributions to improve outputs. (x.com/smsampark/status/2043723640521597339, x.com/DJLougen/status/2043759562344345742)
Synthetic data is moving from filler training text to a control knob: a new Stanford-led paper says researchers can optimize generated examples for a chosen model metric. (arxiv.org) The paper, “Synthetic Data for any Differentiable Target,” was posted to arXiv on April 9, 2026 by Tristan Thrush, Sung Min Park, Herman Brunborg, Luke Bailey, Marcel Roed, Neil Band, Christopher Potts, and Tatsunori Hashimoto. It introduces “Dataset Policy Gradient,” or DPG, a reinforcement-learning method for tuning a synthetic-data generator. (arxiv.org) In plain terms, the method treats a training dataset like a recipe that can be adjusted ingredient by ingredient. The authors say DPG assigns rewards to individual synthetic texts, instead of scoring only a full training run after the fact. (arxiv.org) The paper says those optimized examples can steer a target model toward specific measurable outcomes during supervised fine-tuning, the standard process of retraining a model on labeled examples. The authors report experiments that made a model’s output layer encode a QR code, the pattern “67,” and a lower weight norm. (arxiv.org) They also report two text-generation behaviors: rephrasing inputs into a new language and producing a specific universally unique identifier, or UUID, even when that target was not written into the generator’s prompt. The paper argues that synthetic examples can carry more control signal than the prompt alone. (arxiv.org) That lands in a field already wrestling with how synthetic training data changes model behavior in unexpected ways. The paper’s introduction points to recent work on emergent misalignment, subliminal learning, data poisoning from harmless inputs, and model provenance as evidence that examples themselves can shape models in hard-to-see ways. (arxiv.org) The discussion around the paper also folded in a practical question: which open model is worth using to make synthetic fine-tuning data. Moonshot AI says Kimi K2.5 is its most capable model to date, with text, image, and video input, a 256,000-token context window, and “thinking” and non-thinking modes. (platform.kimi.ai) Moonshot’s open Kimi K2 page says the K2 family includes a base model meant for fine-tuning and custom solutions, plus an instruct model for general chat and agentic use. That makes it a plausible candidate for practitioners generating large volumes of synthetic supervised fine-tuning examples, though the recommendation itself came from researchers on X, not from a controlled benchmark in the paper. (moonshotai.github.io, x.com) Another thread in the discussion was output length. A 2025 paper on chain-of-thought length found an inverted U-shaped pattern, with accuracy improving and then falling as reasoning gets longer, and said the optimal length rises with task difficulty but falls with model capability. (arxiv.org) A separate 2025 study on supervised fine-tuning length found that long-context fine-tuning can improve short-context performance, while a 2025 “Long-Short Chain-of-Thought Mixture” method reported a 2.3% average accuracy gain and about a 47.61% cut in response length versus direct fine-tuning. Those results help explain why practitioners are now talking about matching the distribution of reasoning lengths in synthetic data instead of always generating the longest possible answers. (aclanthology.org, zgca-ai4edu.github.io) The immediate claim is not that synthetic data can do everything, but that it can be optimized much more precisely than many labs treated it a year ago. If that holds up beyond preprint results, the next contest in model training may be less about collecting raw data and more about designing the exact examples a model sees. (arxiv.org)