OpenAI's Codex Agent Can Now Fork Itself

A new release of OpenAI's open-source Codex code agent adds the ability to "fork a thread into sub-agents." This technical update allows for more complex, branching workflows within a terminal environment. It's a clear signal that labs are building more modular, multi-agent architectures for sophisticated tasks.

The ability for an agent to fork itself signals a move toward multi-agent systems, where specialized AI agents collaborate to solve complex problems. Instead of a single model handling all steps, tasks are distributed, which can improve efficiency and adaptability in dynamic environments. This modular approach is easier to maintain and extend, as individual agents can be updated or added without overhauling the entire system. This architectural shift creates new challenges for model alignment and evaluation. Instead of just assessing the quality of a single text output, labs must now measure the performance of agents executing multi-step tasks. Benchmarks like AgentBench, WebArena, and GAIA are emerging to test agent capabilities in areas like web navigation, tool use, and multi-turn decision-making. For businesses selling to these labs, key performance indicators now include task success rate, token cost, and action accuracy. Reinforcement Learning from Human Feedback (RLHF) remains a critical workflow for refining these agentic systems, but the nature of the data is evolving. Rather than simple preference pairs ("is A better than B?"), data labelers are increasingly asked to provide complex, domain-specific feedback on tasks like medical diagnoses or financial analysis. This has led to a flight to quality, with labs orchestrating supply chains of human experts—lawyers, doctors, coders—to provide nuanced annotations. The reliance on expert human data is a direct response to the limitations of synthetic data. While synthetic data is scalable and cost-effective for general topics, it often fails to capture the real-world noise, cultural nuance, and edge cases necessary for frontier models. A hybrid approach, using synthetic data for volume and human validation for refinement, has become a common strategy; one analysis showed this can improve model performance by 23% while cutting annotation costs by 64%. For startups entering this space, the go-to-market strategy must focus on outcomes over technology. Technical buyers at AI labs are less interested in the underlying architecture and more in how a solution can "cut debugging time by 40%." Aligning product and GTM from day one is crucial, with a focus on creating tight feedback loops with early customers to iterate on both the product and the messaging. The fundraising climate for AI infrastructure is robust but concentrating around a few key players. Funding for AI infrastructure startups grew tenfold between 2022 and 2025, with the average deal size jumping to $242 million in 2025. However, a broader liquidity crunch means that while investor interest is high, capital is becoming a competitive moat, favoring established funds that can finance billion-dollar rounds for foundational companies. This evolution is reshaping the data labeling workforce. Low-skill, repetitive annotation tasks are increasingly being automated. The future of this work lies in high-value, specialized roles that require deep domain expertise and the ability to validate complex AI reasoning. This creates an opportunity to build a more skilled, better-compensated workforce, moving away from the "digital sweatshop" model to one that treats human expertise as a core component of the AI value chain.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.