New Framework Trains AI Agents Without Manual Rewards

A new open-source framework called ART (Agent Reinforcement Trainer) has been released for training AI agents. It uses a novel combination of techniques (GRPO + RULER) to enable automatic reward generation, eliminating the need for manual reward crafting. This could significantly simplify the process of training specialized agents for side projects.

The ART (Agent Reinforcement Trainer) framework is specifically engineered to overcome the limitations of existing reinforcement learning tools that struggle with multi-turn interactions typical of complex agentic workflows. It employs a client-server architecture, allowing the ART client to be a lightweight component within an application, while a backend server manages the heavy lifting of the RL training loop. This design enhances GPU utilization and simplifies the integration of reinforcement learning into existing, complex codebases with minimal refactoring. At the core of ART is Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm that is more memory and compute-efficient than its predecessors like Proximal Policy Optimization (PPO). GRPO eliminates the need for a separate critic model by estimating rewards based on relative comparisons within a "group" of generated outputs for the same prompt. This approach of learning from relative judgments, rather than absolute scores, aligns AI behavior more closely with human preferences. The framework's RULER (Relative Universal LLM-Elicited Rewards) component automates the reward generation process, a significant bottleneck in traditional reinforcement learning. RULER uses a configurable large language model as a "judge" to rank multiple agent outputs (trajectories) against each other based on a provided rubric. This method has been shown to match or even outperform hand-crafted reward functions in several benchmarks, potentially reducing development time by 2-3x. OpenPipe, the company behind ART, was co-founded by Kyle Corbitt, formerly of Google and Y Combinator, and focuses on making custom AI model creation more accessible to developers. Their work is built on the insight that fine-tuned models can significantly outperform larger, general-purpose models like GPT-4 on specific tasks at a fraction of the cost and with lower latency. OpenPipe's broader platform helps developers capture prompt-completion pairs to train smaller, more efficient custom models. ART is designed for scenarios where an open-source model can already complete a task at least 30% of the time, ensuring the model is capable enough to benefit from this training method. The framework integrates with tools like Weights & Biases for experiment tracking and is built to be flexible, running on a local GPU or a cloud environment. For developers, this means they can start training agents on a laptop and scale as needed. One practical application showcased by OpenPipe is an email research agent trained with ART that can answer deep-research questions from an inbox. This example demonstrated a small, 14-billion-parameter model achieving state-of-the-art performance, highlighting ART's ability to create highly accurate and efficient agents for real-world applications. The project and others, like an agent that learns to play 2048, are available as open-source examples.

New Framework Trains AI Agents Without Manual Rewards

Get your own daily briefing