TrilinearCIM FeFET for in‑memory attention

Researchers posted about TrilinearCIM, a FeFET‑based architecture that supports a 3‑operand MAC for in‑memory attention operations without reprogramming, suggesting new approaches to accelerate attention close to memory. The design is part of a wider wave of in‑memory compute ideas aimed at changing where and how attention work happens in models (TrilinearCIM FeFET architecture).

Modern artificial intelligence chips still spend much of their time moving data between memory and logic, and a new April 8 paper proposes doing more of attention directly inside memory cells. (arxiv.org) The paper, posted by researchers from Pennsylvania State University and the University of Notre Dame, describes TrilinearCIM, a double-gate ferroelectric field-effect transistor design for transformer attention. The authors are Md Zesun Ahmed Mia, Jiahui Duan, Kai Ni and Abhronil Sengupta. (arxiv.org) Attention is the part of a transformer that compares one token with many others through repeated dot products, and those comparisons grow with sequence length. A 2024 survey on compute-in-memory for large language model inference said memory access can become more expensive than the arithmetic itself, a problem often called the memory wall. (arxiv.org) Compute-in-memory tries to cut that traffic by storing values and performing math in the same array. In TrilinearCIM, the authors say back-gate modulation in a ferroelectric field-effect transistor lets one cell participate in a three-operand multiply-accumulate operation, instead of the usual two-operand pattern used in many memory arrays. (arxiv.org) The immediate target is a practical bottleneck in attention hardware: some operands change at runtime, so conventional non-volatile memory designs often need fresh programming cycles. The TrilinearCIM paper says its design performs in-memory attention “without dynamic ferroelectric reprogramming,” which the authors present as a way to avoid throughput loss and endurance stress. (arxiv.org) On the reported results, the paper says the design beat conventional ferroelectric field-effect transistor compute-in-memory on seven of nine General Language Understanding Evaluation tasks with BERT-base. It also reports up to 46.6% lower energy use and 20.4% lower latency, with a 37.3% area overhead, and includes Vision Transformer base results on ImageNet and CIFAR benchmarks. (arxiv.org) The authors also make a broader claim: they call this the first architecture to perform complete transformer attention exclusively in non-volatile memory cores without runtime reprogramming. That places the work inside a fast-moving effort to relocate transformer work closer to stored weights and activations. (arxiv.org) That effort is not limited to ferroelectric devices. A 2025 Nature Computational Science paper described an in-memory attention system built on gain-cell memories for generative transformers, reporting up to two orders of magnitude lower latency and four orders of magnitude lower energy use than graphics processing units in its setup. (nature.com) Ferroelectric field-effect transistors have been drawing interest because they can store state without power and remain compatible with mainstream chip processes. A 2023 Nature Communications paper reported a 28-nanometer multi-level ferroelectric field-effect transistor crossbar for in-memory multi-bit multiply-accumulate operations, with 885.4 tera-operations per second per watt in its experiments. (nature.com) TrilinearCIM is still a research paper, not a shipping chip, and its headline numbers come from the authors’ evaluation rather than a commercial product test. But the paper adds one more concrete design to a growing list of attempts to make attention happen where the data already sits. (arxiv.org; arxiv.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.