ByteDance AI Agent Generates Faster CUDA Kernels

ByteDance has published a paper on an AI agent that generates CUDA kernels 2.11x faster than standard compilers like torch.compile. The agent uses reinforcement learning on GPU profiling data, suggesting a new software-based path to boosting performance on existing NVIDIA hardware.

The "CUDA Agent" is a joint research project between ByteDance's Seed research team and Tsinghua University's Institute for AI Industry Research. Their paper, published in February 2026, details a system that moves beyond simple code generation to active, intelligent optimization of CUDA kernels. This work is part of a broader push by ByteDance into fundamental AI research, with labs in the US, Singapore, and China covering areas from large language models to AI infrastructure. The agent's key innovation lies in its training methodology. It uses reinforcement learning where the reward signal is derived directly from actual GPU profiling data, not just code compilation or correctness checks. This hardware-aware approach allows the agent to learn the nuances of GPU architecture that lead to genuine performance gains, such as avoiding memory bank conflicts or ensuring coalesced memory access. The system iterates through a "reason-act-observe" loop, autonomously profiling, identifying bottlenecks, and rewriting kernels for up to 200 cycles per task. On the standard KernelBench benchmark, CUDA Agent outperformed `torch.compile` in 96.8% of cases, achieving a 2.11x overall geometric mean speedup. For the most complex Level-3 tasks, which involve real-world neural network building blocks, the agent was faster 92% of the time. This significantly surpasses other powerful proprietary models like Google's Gemini 3 Pro and Anthropic's Claude Opus 4.5, which registered faster rates of 52% and 50% respectively on the same complex tasks. The training process itself is a multi-stage pipeline designed for stability, starting with a warm-up phase using Proximal Policy Optimization (PPO), followed by rejection fine-tuning and critic pretraining before the full agentic reinforcement learning begins. To facilitate this, the researchers developed a synthetic dataset of 6,000 training operations and a skill-augmented development environment to provide reliable verification and profiling. The entire training loop for the agent required a dedicated cluster of 128 NVIDIA H20 GPUs. The agent has demonstrated sophisticated, human-like optimization strategies without explicit instruction. For instance, it independently discovered the performance benefit of enabling TF32 precision to utilize Tensor Cores and learned to fuse multiple operations like convolution, batch normalization, and ReLU activation into a single, efficient kernel. In one case, it achieved a 73x speedup by recognizing that a diagonal matrix multiplication could be simplified to a much cheaper row-scaling operation. This software-driven approach to performance enhancement is notable as the industry confronts the physical limitations and rising costs of hardware-based scaling. While the current research doesn't compare the agent against more complex compiler frameworks like Apache TVM, the results point toward a future where AI agents can automate the highly specialized and valuable skill of GPU kernel optimization. The dataset, CUDA-Agent-Ops-6K, and the agent's work directory have been publicly released, potentially accelerating adoption in custom AI training pipelines.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.