Runs transformer inference on FPGA
- GitHub project TALOS‑V2 surfaced this week with a hand-written RTL inference core that runs Karpathy-style microGPT directly on a DE1‑SoC Cyclone V FPGA. - The notable claim is “50k+ tkps,” with one shared streamed matvec tile reused across Q, K, V, attention output, MLP, and LM-head stages. - It matters because this is full transformer decode logic in plain RTL — not just one kernel, and not a GPU-dependent demo. (github.com)
A transformer running on an FPGA is not the weird part anymore. The weird part is how complete this one is. TALOS‑V2 is a public GitHub project that implements a Karpathy-style microGPT inference path in hand-written SystemVerilog for the DE1‑SoC, which uses an Intel Cyclone V FPGA. And the headline claim is big for this class of hardware — “50k+ tkps,” or more than 50,000 tokens per second. is not just a matmul block or a benchmark kernel. The repository packages a full inference stack: synthesizable RTL, fixed-point model ROMs, simulation files, Python host tools, TCL scripts, and a DE1‑SoC top level that can be driven over JTAG/MMIO. The active core is `microgpt_exact_core.sv`, and the whole thing is meant to build, program, and generate tokens on the board. A lot of FPGA transformer work stops at one expensive subproblem — usually attention or tiled matrix multiplication. TALOS‑V2 goes further and wires the decode path together as one hardware-oriented state machine. The repo explicitly describes it as “one-token-at-a-time microGPT inference,” which means the project is about end-to-end generation, not just proving that one GEMM primitive works. What math does the FPGA really handle? The interesting trick is reuse. Search snippets from the repo show the same streamed matvec tile being reused for the transformer’s learned projection stages — Q, K, V, attention output, the two feed-forward layers, and the LM head. There is also dedicated support logic for RMS scaling and a saturated divide engine for attention math, so the design is not only doing dense projections usually make “full transformer in hardware” annoying. ### Is it really a 16-lane engine? Mostly yes, with one naming wrinkle. Search results describe an “active microgpt core” using a streamed 16-lane systolic MAC tile and mention a reduction across that 16-lane tile. But the repo’s `info.md` still calls `systolic_matvec16_tile.sv` a “shared 4-lane streamed matvec tile,” which looks like stale documentation rather than a design contradiction. The newer RTL tree and search snippets point to the 16-lane version being the current implementation. ### What’s the catch? Precision and scale. The core is not bit-exact to Karpathy’s floating-point Python reference. It uses Q4.12 fixed-point arithmetic, LUT-based exponential weights, saturation, and an xorshift-based sampler. Basically, this is a hardware-shaped transformer, not a faithful floating-point clone. That tradeoff is probably part of how it gets the speed claim onto a modest FPGA board. all? Because the point is determinism and edge deployment, not chasing datacenter throughput. The DE1‑SoC is an educational Cyclone V platform, so getting a complete token generator onto it says something important: narrow transformer inference can be collapsed into a small, inspectable hardware design with board-level controls, LEDs, HEX displays, and JTAG host control. That is a very different philosophy from “just run a tiny model on a GPU.” ### Does 50k tokens per second mean useful LLM serving? Not by itself. The repo is built around microGPT — a very small Karpathy-style model and fixed-point ROM weights — so the number should be read as a proof about architecture, not a claim that Cyclone V suddenly competes with modern GPU inference. The real signal is that a complete decode stack can be specialized hard enough that FPGA inference becomes fast, predictable, and self-contained for tiny models. ### Bottom line? The news here is not “FPGA beats GPU.” It is that TALOS‑V2 appears to have crossed