Taalas Claims 17k Tokens/Sec on Llama

Published by The Daily Scout

What happened

Taalas has demonstrated inference speeds of 17,000 tokens per second on a Llama 3.1 8B model. The company achieves this by etching model weights directly into silicon, which eliminates memory-to-compute bottlenecks. The performance reportedly surpasses competitors like Groq and NVIDIA's B200, though the approach involves tradeoffs such as fixed models and quantization challenges.

Why it matters

- The founding team, including CEO Ljubisa Bajic, previously founded the AI chip company Tenstorrent, with other key engineers hailing from AMD, Apple, and Google. Taalas has raised over $200 million in funding to support its hardware development. - Taalas's approach creates a model-specific ASIC, which contrasts with the more flexible, software-driven approach of GPUs. While the core model is fixed in silicon, the architecture still supports fine-tuning through Low-Rank Adaptation (LoRA). - The company claims it can go from a finalized set of model weights to a deployable custom PCIe card in approximately two months by partnering with TSMC. - The core technical advantage comes from eliminating the "memory wall" bottleneck, where processors wait for model weights to be transferred from high-bandwidth memory (HBM) to on-chip compute units. By etching weights into the logic fabric of the chip, memory and compute are effectively merged. - For comparison on similar models, benchmark aggregators show top inference speeds for Llama 3.1 8B from other specialized hardware providers like Cerebras at ~1,946 tokens/sec and Groq at ~588-877 tokens/sec. - The process of preparing a model to be etched into silicon involves quantization, which reduces the precision of the model's parameters (e.g., from 32-bit floating-point to 8-bit integers). A key challenge is minimizing the accuracy loss that can occur during this conversion, often requiring techniques like quantization-aware training. - Taalas plans to offer its technology both as a cloud-based Inference-as-a-Service and by selling the physical hardware directly to customers. The initial Llama 3.1 8B chip serves as a proof-of-concept for their architecture.

Key numbers

  • Taalas has demonstrated inference speeds of 17,000 tokens per second on a Llama 3.1 8B model.
  • The performance reportedly surpasses competitors like Groq and NVIDIA's B200, though the approach involves tradeoffs such as fixed models and quantization challenges.
  • Taalas has raised over $200 million in funding to support its hardware development.
  • For comparison on similar models, benchmark aggregators show top inference speeds for Llama 3.1 8B from other specialized hardware providers like Cerebras at ~1,946 tokens/sec and Groq at ~588-877 tokens/sec.

What happens next

  • Taalas plans to offer its technology both as a cloud-based Inference-as-a-Service and by selling the physical hardware directly to customers.

Quick answers

What happened in Taalas Claims 17k Tokens/Sec on Llama?

Taalas has demonstrated inference speeds of 17,000 tokens per second on a Llama 3.1 8B model. The company achieves this by etching model weights directly into silicon, which eliminates memory-to-compute bottlenecks. The performance reportedly surpasses competitors like Groq and NVIDIA's B200, though the approach involves tradeoffs such as fixed models and quantization challenges.

Why does Taalas Claims 17k Tokens/Sec on Llama matter?

The founding team, including CEO Ljubisa Bajic, previously founded the AI chip company Tenstorrent, with other key engineers hailing from AMD, Apple, and Google. Taalas has raised over $200 million in funding to support its hardware development. Taalas's approach creates a model-specific ASIC, which contrasts with the more flexible, software-driven approach of GPUs. While the core model is fixed in silicon, the architecture still supports fine-tuning through Low-Rank Adaptation (LoRA). The company claims it can go from a finalized set of model weights to a deployable custom PCIe card in approximately two months by partnering with TSMC. The core technical advantage comes from eliminating the "memory wall" bottleneck, where processors wait for model weights to be transferred from high-bandwidth memory (HBM) to on-chip compute units. By etching weights into the logic fabric of the chip, memory and compute are effectively merged. For comparison on similar models, benchmark aggregators show top inference speeds for Llama 3.1 8B from other specialized hardware providers like Cerebras at ~1,946 tokens/sec and Groq at ~588-877 tokens/sec. The process of preparing a model to be etched into silicon involves quantization, which reduces the precision of the model's parameters (e.g., from 32-bit floating-point to 8-bit integers). A key challenge is minimizing the accuracy loss that can occur during this conversion, often requiring techniques like quantization-aware training. Taalas plans to offer its technology both as a cloud-based Inference-as-a-Service and by selling the physical hardware directly to customers. The initial Llama 3.1 8B chip serves as a proof-of-concept for their architecture.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Published by The Daily Scout - Be the smartest in the room.