EdgeCIM paper accelerates small LLMs

A new arXiv paper, EdgeCIM, describes a hardware‑software co‑design that uses Compute‑in‑Memory to speed up small language models for edge deployment. The work — which includes researchers from UC Irvine — aims to improve edge AI by reducing memory movement and energy for on‑device inference. (x.com)

A new arXiv paper says a Compute-in-Memory design called EdgeCIM can run small language models faster and with less energy on edge devices. (arxiv.org) Compute-in-Memory means doing math where the model’s weights are stored instead of shuttling data back and forth between memory and a processor. The EdgeCIM authors say that cuts the memory-traffic bottleneck that slows token-by-token text generation on phones, laptops, and embedded systems. (arxiv.org) The paper was submitted on April 13, 2026 by Jinane Bazzi, Mariam Rakka, Fadi Kurdahi, Mohammed E. Fouda, and Ahmed Eltawil. It lists Kurdahi at the University of California, Irvine, and Bazzi and Eltawil at King Abdullah University of Science and Technology. (arxiv.org, arxiv.org) The hardware target is not giant cloud models. The paper focuses on “small language models” up to 4 billion parameters, a size range the authors say is better matched to on-device use than data-center-scale systems. (arxiv.org) The technical problem is decoding, the stage where a model generates one token at a time after it has read the prompt. The authors write that this phase is dominated by matrix-vector operations that are memory-bound, so general-purpose graphics processors spend too much time waiting on data. (arxiv.org) EdgeCIM pairs a 65-nanometer Compute-in-Memory macro with a tile-based mapping scheme that spreads model work across pipeline stages. The paper says that layout is meant to raise parallelism while easing dynamic random-access memory bandwidth pressure. (arxiv.org, arxiv.org) In the authors’ simulator, the design reached up to 7.3 times the throughput and 49.59 times the energy efficiency of an NVIDIA Orin Nano on Llama 3.2 1B. The paper also reports up to 9.95 times higher throughput than Qualcomm’s SA8255P on Llama 3.2 3B. (arxiv.org) Across the benchmark set, which included TinyLLaMA, Llama 3.2, Phi-3.5-mini, Qwen2.5, SmolLM2, SmolLM3, and Qwen3, the authors report an average of 336.42 tokens per second and 173.02 tokens per joule at 4-bit integer precision. Those are simulation results, not shipping device measurements. (arxiv.org, arxiv.org) That pitch fits a broader hardware trend. A 2024 survey of Compute-in-Memory architectures described the same “memory wall” problem: modern artificial intelligence chips burn time and power moving data between memory and compute units, and Compute-in-Memory tries to collapse those steps into one place. (arxiv.org, ieeexplore.ieee.org) The next test is outside the simulator. If the reported gains hold in real silicon and full software stacks, EdgeCIM would give small language models a more plausible path onto battery-powered devices that cannot rely on cloud inference for every prompt. (arxiv.org)

EdgeCIM paper accelerates small LLMs

Get your own daily briefing