When to use synthetic data — a framework
Fixstars published a framework for deciding when synthetic data is appropriate versus when human labeling is required, stressing ROI and edge‑case risk. (x.com)
Synthetic data is fake-but-statistical training data, and Fixstars says teams should use it only when it beats collecting and labeling real examples on cost, speed, or access. (blog.us.fixstars.com) Fixstars published the framework on April 8, 2026 in a post by Changgyu Choi aimed at machine-learning engineers and technical leads. The piece lays out a go-or-no-go checklist covering generative adversarial networks, variational autoencoders, diffusion models, large language models, model collapse, and a three-year total-cost-of-ownership calculation. (blog.us.fixstars.com) The core test is simple: use synthetic data when real data is scarce, legally restricted, dangerous to collect, or too expensive to label at scale. Keep humans in the loop when labels depend on expert judgment, when mistakes carry high safety costs, or when the model will face rare edge cases that generated data may miss. (blog.us.fixstars.com) Synthetic data works by learning patterns from real examples and then generating new records that look statistically similar without copying any one person, image, or event. That makes it useful in fields like finance and health care, where privacy rules and limited access can slow model training. (jpmorganchase.com) Fixstars anchors the argument in two production examples: Waymo for autonomous-driving simulation and JPMorgan Chase for financial data under regulatory constraints. JPMorgan says its artificial intelligence research team builds realistic synthetic financial datasets for research and development in financial services. (blog.us.fixstars.com) (jpmorganchase.com) The catch is that synthetic data can smooth away the exact failures engineers most need to catch. Fixstars flags edge-case coverage and data quality as central risks, and recent research on “model collapse” found that repeatedly training on generated data can degrade output unless real data remains in the mix. (blog.us.fixstars.com) (arxiv.org) That warning lines up with broader guidance from standards and policy groups. The National Institute of Standards and Technology’s generative artificial intelligence risk profile says organizations should measure and manage risks around data provenance, validity, privacy, and downstream harm rather than treating generated content as automatically safe. (nvlpubs.nist.gov) Governments are also treating synthetic data as a controlled tool, not a blanket substitute for real records. The United Kingdom government’s January 29, 2025 guidance says synthetic data can support research and sharing, but it still raises questions about safety, transparency, and whether the generated data is fit for the intended use. (gov.uk) The practical takeaway in Fixstars’ framework is narrower than the hype around generated data: buy or build a synthetic pipeline when it solves a specific bottleneck, and keep paying for human labels when judgment, accountability, and weird real-world failures decide whether the model works. (blog.us.fixstars.com)