FPGA inference on the edge
Researchers posted a paper showing a heterogeneous inference framework that leverages FPGA acceleration and SqueezeNet on a DE10-Nano board to run deep neural network workloads at the edge. (x.com)
Artificial intelligence models usually run either on a general-purpose processor, which is easy to program, or on custom hardware, which is faster but harder to build. A paper published on January 14, 2024 describes a middle path on a low-cost DE10-Nano board that splits the work between an Arm processor and field-programmable gate array logic. (mdpi.com) The paper, by Rafael Gadea-Gironés, José Luís Rocabado-Rocha, Jorge Fe, and Jose M. Monzó at the Universitat Politècnica de València, implements SqueezeNet v1.1 in a “heterogeneous” setup, meaning some neural-network operations stay in software while others move into reconfigurable hardware. The authors used PyTorch 1.13.1 as the software reference and OpenCL-written systolic-array accelerators for the hardware side. (mdpi.com) A field-programmable gate array is a chip that can be rewired after manufacturing, closer to a configurable factory floor than a fixed central processor. The DE10-Nano board used here pairs that programmable logic with a dual-core Arm Cortex-A9 hard processor system, letting one device handle both ordinary software and custom acceleration. (terasic.com.tw, intel.com) The neural network in the paper is SqueezeNet, a compact image model introduced in 2016 for machines with tight memory limits. Its authors reported AlexNet-level ImageNet accuracy with 50 times fewer parameters, and said model compression could shrink it to less than 0.5 megabytes. (arxiv.org) That smaller footprint is the point for edge computing, where inference runs on the device instead of sending data to a cloud server. The Valencia paper says cloud-heavy designs are a poor fit for many electronics systems because latency, power consumption, and physical size all become constraints outside the data center. (mdpi.com) The researchers frame their contribution less as a new neural network than as a workflow for deciding which layers belong in hardware and which should remain in software. They say that approach combines the flexibility of high-level synthesis flows with the tighter architectural control of hardware-description methods. (mdpi.com) The hardware target also matters because the DE10-Nano is not a large server accelerator. Terasic describes the board as a development kit built around a Cyclone V system-on-chip, and Intel’s university-program materials describe the hard processor system as a dual-core Arm Cortex-A9 with DDR3 memory, the kind of platform often used in teaching, prototyping, and embedded-system work. (terasic.com.tw, intel.com) Researchers have been trying to squeeze computer-vision models onto small field-programmable gate arrays for years, including earlier SqueezeNet-like designs on DE10-Nano-class hardware. The 2024 paper adds to that line of work by emphasizing a mixed software-hardware deployment flow instead of a hardware-only implementation. (ieeexplore.ieee.org, mdpi.com) The paper does not change the basic tradeoff that edge systems still face: smaller models and custom accelerators can cut bandwidth and keep data local, but they must fit inside limited memory and chip resources. The appeal of this result is that it shows one way to run a known convolutional network on a sub-$250 educational board without moving the whole job back to the cloud. (arxiv.org, terasic.com.tw, mdpi.com)