FPGAs beating GPUs in places
A technical analysis reports FPGAs are now outperforming GPUs on certain LLM inference workloads where latency and power matter, because FPGAs allow custom pipelines and tailored memory access patterns. That makes them an interesting alternative for edge or cost‑sensitive enterprise search deployments. (eejournal.com)
FlightLLM reports up to 6.0× higher energy efficiency and 1.8× better cost efficiency for FPGA mappings compared with “top GPUs” in their experiments. (arXiv.org) (arxiv.org) TerEffic demonstrates a 1.6‑bit weight compression scheme and ternary compute units on FPGA to enable fully on‑chip LLM inference for edge deployments. (arXiv.org) (arxiv.org) The open-source llama‑fpga project shows LLaMA2‑7B running in AWQ 4‑bit on commodity FPGA boards and provides reference designs and binaries for Xilinx/Alveo targets. (GitHub.com) (github.com) A recent IEEE literature review found FPGA LLM implementations can reach comparable inference speeds while using up to ~80% less energy versus GPUs, but memory‑bandwidth and on‑chip capacity routinely limit them to smaller or heavily quantized models. (ieee-dataport.org; ieeexplore.ieee.org) (ieee-dataport.org) Major FPGA vendors are positioning silicon for LLM workloads—Xilinx’s Versal AI Core series highlights low‑latency, low‑power deployment, and Achronix publicly posted Speedster7t benchmark claims on Llama2 workloads. (xilinx.com; achronix.com) (xilinx.com) Ongoing benchmarking efforts such as InferenceMAX run nightly cross‑chip inference suites, and multiple FPGA‑targeted LLM papers and toolchains have appeared on arXiv and in ICCAD/ICCAD‑adjacent venues during 2024–2025. (semianalysis.com; arXiv.org) (inferencex.semianalysis.com)