arXiv paper shows CGRA kernels
- Researchers from EPFL and Universidad Complutense de Madrid posted an arXiv paper describing Kernel-CGRA, a compiler flow that swaps detected matrix-multiply regions for precompiled kernels. - The paper reports runtime speedups of up to 9.1x by using polyhedral loop reordering and splitting to uncover hidden matrix multiplication patterns. - CGRA toolchains still miss or underuse complex loops, making specialized kernels attractive. (arxiv.org)
Coarse-grained reconfigurable arrays are chips built from many small processing tiles, like a grid of reusable calculator blocks wired together for one workload. They sit between fixed-function accelerators and general-purpose processors. (arxiv.org) (aha.stanford.edu) The hard part is compilation. A compiler has to decide which operation lands on which tile and in which clock cycle, and that gets difficult when programs contain deep, multidimensional loop nests. (arxiv.org 1) (arxiv.org 2) A new arXiv paper from Yuxuan Wang, María José Belda, Fernando Castro, Katzalin Olcoz, David Atienza, and Giovanni Ansaloni argues that one pattern deserves special treatment: matrix multiplication. The authors posted “Exploiting pre-optimized kernels with polyhedral transformations for CGRA compilation” in April 2026. (arxiv.org) Matrix multiplication is the repeated row-by-column arithmetic behind attention, fully connected layers, convolutions, and other edge workloads. The paper says current CGRA compilation often maps those kernels poorly because the parallel structure is spread across several loop dimensions. (arxiv.org) The authors’ system, called Kernel-CGRA, uses a library-style shortcut. Instead of compiling every operation from scratch, it detects matrix-multiply regions and replaces them with a handcrafted but parameterized kernel schedule tuned for different CGRA sizes. (arxiv.org) To find those regions, the compiler uses polyhedral analysis, a mathematical way to rewrite loop nests while preserving the result. In plain terms, it reorders and splits loops until a buried matrix-multiply pattern becomes visible enough to swap in the precompiled kernel. (arxiv.org 1) (arxiv.org 2) The rest of the program does not have to match that pattern. The paper says non-matrix-multiply regions still go through a standard control-data-flow-graph mapping flow, and the final CGRA configuration combines both paths. (arxiv.org) That hybrid approach targets a known weakness in CGRA software stacks. A February 2025 arXiv evaluation of four CGRA toolchains found that some mappers struggled with more complex loops and often underutilized the available processing elements. (arxiv.org) The new paper reports runtime speedups of up to 9.1x on benchmarks containing what it calls hidden matrix multiplications, across different CGRA sizes. It also adds a live-value store and retrieval mechanism so the specialized kernel can exchange data with the rest of the compiled program. (arxiv.org 1) (arxiv.org 2) The larger claim is not that CGRAs stop needing general compilers. It is that compiler writers may get better results by mixing generic mapping with a small set of pre-optimized kernels for patterns that appear again and again in real code. (arxiv.org) For teams weighing CGRA deployment, the paper offers a specific recipe: detect the hot loop shape, rewrite the loops, drop in a tuned kernel, and leave the leftovers to the normal toolchain. That is a narrower promise than full automation, but it is concrete. (arxiv.org)