Taalas Claims 17k Tokens/Sec on Llama

Taalas has demonstrated inference speeds of 17,000 tokens per second on a Llama 3.1 8B model. The company achieves this by etching model weights directly into silicon, which eliminates memory-to-compute bottlenecks. The performance reportedly surpasses competitors like Groq and NVIDIA's B200, though the approach involves tradeoffs such as fixed models and quantization challenges.

- The founding team, including CEO Ljubisa Bajic, previously founded the AI chip company Tenstorrent, with other key engineers hailing from AMD, Apple, and Google. Taalas has raised over $200 million in funding to support its hardware development. - Taalas's approach creates a model-specific ASIC, which contrasts with the more flexible, software-driven approach of GPUs. While the core model is fixed in silicon, the architecture still supports fine-tuning through Low-Rank Adaptation (LoRA). - The company claims it can go from a finalized set of model weights to a deployable custom PCIe card in approximately two months by partnering with TSMC. - The core technical advantage comes from eliminating the "memory wall" bottleneck, where processors wait for model weights to be transferred from high-bandwidth memory (HBM) to on-chip compute units. By etching weights into the logic fabric of the chip, memory and compute are effectively merged. - For comparison on similar models, benchmark aggregators show top inference speeds for Llama 3.1 8B from other specialized hardware providers like Cerebras at ~1,946 tokens/sec and Groq at ~588-877 tokens/sec. - The process of preparing a model to be etched into silicon involves quantization, which reduces the precision of the model's parameters (e.g., from 32-bit floating-point to 8-bit integers). A key challenge is minimizing the accuracy loss that can occur during this conversion, often requiring techniques like quantization-aware training. - Taalas plans to offer its technology both as a cloud-based Inference-as-a-Service and by selling the physical hardware directly to customers. The initial Llama 3.1 8B chip serves as a proof-of-concept for their architecture.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.