Taalas Claims 17k Tokens/Sec on Llama

Published February 26, 2026 by The Daily Scout

Taalas has demonstrated inference speeds of 17,000 tokens per second on a Llama 3.1 8B model. The company achieves this by etching model weights directly into silicon, which eliminates memory-to-compute bottlenecks. The performance reportedly surpasses competitors like Groq and NVIDIA's B200, though the approach involves tradeoffs such as fixed models and quantization challenges.

Why it matters

- The founding team, including CEO Ljubisa Bajic, previously founded the AI chip company Tenstorrent, with other key engineers hailing from AMD, Apple, and Google. Taalas has raised over $200 million in funding to support its hardware development. - Taalas's approach creates a model-specific ASIC, which contrasts with the more flexible, software-driven approach of GPUs. While the core model is fixed in silicon, the architecture still supports fine-tuning through Low-Rank Adaptation (LoRA). - The company claims it can go from a finalized set of model weights to a deployable custom PCIe card in approximately two months by partnering with TSMC. - The core technical advantage comes from eliminating the "memory wall" bottleneck, where processors wait for model weights to be transferred from high-bandwidth memory (HBM) to on-chip compute units. By etching weights into the logic fabric of the chip, memory and compute are effectively merged. - For comparison on similar models, benchmark aggregators show top inference speeds for Llama 3.1 8B from other specialized hardware providers like Cerebras at ~1,946 tokens/sec and Groq at ~588-877 tokens/sec. - The process of preparing a model to be etched into silicon involves quantization, which reduces the precision of the model's parameters (e.g., from 32-bit floating-point to 8-bit integers). A key challenge is minimizing the accuracy loss that can occur during this conversion, often requiring techniques like quantization-aware training. - Taalas plans to offer its technology both as a cloud-based Inference-as-a-Service and by selling the physical hardware directly to customers. The initial Llama 3.1 8B chip serves as a proof-of-concept for their architecture.

Key numbers

Taalas has demonstrated inference speeds of 17,000 tokens per second on a Llama 3.1 8B model.
The performance reportedly surpasses competitors like Groq and NVIDIA's B200, though the approach involves tradeoffs such as fixed models and quantization challenges.
Taalas has raised over $200 million in funding to support its hardware development.
For comparison on similar models, benchmark aggregators show top inference speeds for Llama 3.1 8B from other specialized hardware providers like Cerebras at ~1,946 tokens/sec and Groq at ~588-877 tokens/sec.

What happens next

Taalas plans to offer its technology both as a cloud-based Inference-as-a-Service and by selling the physical hardware directly to customers.

Sources

Quick answers

What happened in Taalas Claims 17k Tokens/Sec on Llama?

Taalas has demonstrated inference speeds of 17,000 tokens per second on a Llama 3.1 8B model. The company achieves this by etching model weights directly into silicon, which eliminates memory-to-compute bottlenecks. The performance reportedly surpasses competitors like Groq and NVIDIA's B200, though the approach involves tradeoffs such as fixed models and quantization challenges.

Why does Taalas Claims 17k Tokens/Sec on Llama matter?

The founding team, including CEO Ljubisa Bajic, previously founded the AI chip company Tenstorrent, with other key engineers hailing from AMD, Apple, and Google. Taalas has raised over $200 million in funding to support its hardware development. Taalas's approach creates a model-specific ASIC, which contrasts with the more flexible, software-driven approach of GPUs. While the core model is fixed in silicon, the architecture still supports fine-tuning through Low-Rank Adaptation (LoRA). The company claims it can go from a finalized set of model weights to a deployable custom PCIe card in approximately two months by partnering with TSMC. The core technical advantage comes from eliminating the "memory wall" bottleneck, where processors wait for model weights to be transferred from high-bandwidth memory (HBM) to on-chip compute units. By etching weights into the logic fabric of the chip, memory and compute are effectively merged. For comparison on similar models, benchmark aggregators show top inference speeds for Llama 3.1 8B from other specialized hardware providers like Cerebras at ~1,946 tokens/sec and Groq at ~588-877 tokens/sec. The process of preparing a model to be etched into silicon involves quantization, which reduces the precision of the model's parameters (e.g., from 32-bit floating-point to 8-bit integers). A key challenge is minimizing the accuracy loss that can occur during this conversion, often requiring techniques like quantization-aware training. Taalas plans to offer its technology both as a cloud-based Inference-as-a-Service and by selling the physical hardware directly to customers. The initial Llama 3.1 8B chip serves as a proof-of-concept for their architecture.

Taalas Claims 17k Tokens/Sec on Llama

What happened

Why it matters

Key numbers

What happens next

Sources

Quick answers

What happened in Taalas Claims 17k Tokens/Sec on Llama?

Why does Taalas Claims 17k Tokens/Sec on Llama matter?

Get your own daily briefing