Custom ASICs Show Major LLM Speed Gains

The performance of custom ASICs for AI is highlighting the long-term advantage of hardware-software co-design. One new design, the Taalas HC1, runs at 16,960 tokens per second for the Llama 3.1 8B model. This demonstrates how specialized silicon, like Apple's, can deliver exponential rather than incremental performance improvements for on-device machine learning.

- The Toronto-based startup Taalas was founded by Ljubisa Bajic, who previously founded the AI chip company Tenstorrent, along with former AMD and Tenstorrent engineers Drago Ignjatovic and Lejla Bajic. - The HC1's performance is achieved by "hard-coding" the entire Llama 3.1 8B model, including its weights, directly onto a single 815 mm² chip built on TSMC's 6-nanometer process. - This extreme specialization means a new chip must be created for each new model, but Taalas claims its "foundry optimal workflow" with TSMC allows for a two-month turnaround from model weights to deployable hardware. - For comparison on the same Llama 8B model, Groq's LPU runs at approximately 877 tokens per second and Cerebras achieves around 2,000 tokens per second. - Each Taalas HC1 card consumes about 200-250 watts, allowing a server rack with ten cards to be air-cooled at just 2.5 kW. This contrasts sharply with GPU racks, which can require 120-600 kW and liquid cooling. - The company states the cost of inference on the HC1 is 0.75 cents per million tokens for the Llama 3.1 8B model. - The core trade-off is sacrificing the programmability of GPUs for an order-of-magnitude gain in speed and efficiency on a single, static model. - Taalas's next architecture, the HC2, plans to run frontier-class models by the end of the year using pipeline parallelism to spread the workload across multiple HC cards.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.