Jeff Dean on Future of Chip Design
Google's Jeff Dean recently discussed the future of TPU design, emphasizing the physics of energy consumption and the need for hardware/software co-design for chips planned for 2028-2032. He also highlighted significant gaps in using reinforcement learning for chip verification, a key challenge for the industry.
The push for hardware/software co-design stems from the need to bridge the efficiency gap in AI workloads. Traditionally developed separately, this new model involves creating "hardware-aware" algorithms and "algorithm-aware" hardware that work together for higher performance and energy efficiency. This synergy allows for the creation of custom accelerators like TPUs, which are fine-tuned for specific deep learning operations. Google's first TPU, an application-specific integrated circuit (ASIC), was deployed internally in 2015 to handle the massive computational demands of features like voice search. This initial chip focused on inference and was a response to projections that showed a rapid increase in AI compute needs would require a doubling of Google's data centers. The first-generation TPU delivered 92 TOPS at 40W, powering services like Search and the historic AlphaGo victory. Subsequent TPU generations expanded from inference to training capabilities. The v2 TPU (2017) introduced training support and high-bandwidth memory, while the v3 (2018) doubled performance and added liquid cooling. More recent generations, like the sixth-generation Trillium, continue to boost power and efficiency to train cutting-edge AI models. The emphasis on energy efficiency is a strategic necessity driven by the rising operational costs and carbon intensity of hyperscale data centers. Innovations are now focused on reducing watts per computation through advanced power management, new materials like Gallium Nitride (GaN) and Silicon Carbide (SiC), and even neuromorphic architectures. Reinforcement learning (RL) in chip verification is promising but challenging. While RL can learn optimal strategies to find bugs and has shown the ability to reduce test cases by 37% in some scenarios, its effectiveness is highly dependent on the quality and quantity of training data. A key difficulty is defining the right cost function to optimize when the target—an unknown bug—is not clearly defined.