Taalas Embeds LLM Weights Directly Into Silicon
Toronto-based startup Taalas has developed a chip that embeds the model weights of Llama 3.1 8B directly into its transistors. The purpose-built inference chip reportedly achieves 17,000 tokens per second, 74 times faster than an Nvidia H200, while consuming only 200W of power. This represents a 1,000-fold gain in performance per watt over traditional GPU-based inference.
Taalas is the brainchild of Ljubisa Bajic, founder of AI chip company Tenstorrent, who serves as CEO, alongside former early Tenstorrent engineers Drago Ignjatovic and Lejla Bajic. The company has raised a total of $219 million, with a recent $169 million injection from investors including Quiet Capital and Fidelity. This positions Taalas as a serious contender in the specialized AI hardware space, taking aim at Nvidia's dominance. The core innovation is a "direct-to-silicon" foundry model that essentially hard-wires a specific LLM's weights into the chip's physical structure, manufactured on TSMC's 6nm process. This Application-Specific Integrated Circuit (ASIC) approach eliminates the memory bandwidth bottleneck that constrains traditional GPU setups, as the model *is* the chip rather than being loaded into it. Taalas claims this allows them to produce a custom chip for a new model in just two months. This specialized architecture yields significant performance claims but comes with a critical trade-off: inflexibility. The chip is permanently configured for a single model; updating to a new model like Llama 4 would require a new silicon spin. Taalas plans to mitigate this by initially focusing on popular open-source models and has a roadmap that includes a 20-billion parameter model and a next-generation "HC2" platform by the end of 2026. The competitive landscape for custom AI silicon is intensifying as the limitations of general-purpose hardware for AI inference become more apparent. Hyperscalers like Google (TPU), Amazon (Inferentia), and Meta (MTIA) are heavily invested in their own custom chips to optimize performance and reduce the total cost of ownership for their AI services. Meanwhile, startups like Groq, Cerebras, and SambaNova are also targeting the inference market with unique architectures, validating the demand for specialized solutions beyond GPUs.