NVIDIA: FPGAs hit microsecond inference
NVIDIA published insights saying capital‑markets models can reach single‑digit microsecond inference when running neural networks on FPGAs, pointing to specialized hardware as a path to extreme low latency. The write‑up explains how traders combine FPGA determinism with deep models to squeeze inference into tight execution windows, implying hardware‑level acceleration for market signals. (x.com)
A neural network can act like a fast pattern detector for market data, and NVIDIA said on April 2 that one of its systems ran that detector in single-digit microseconds on a finance benchmark. (developer.nvidia.com) The result came from a NVIDIA GH200 Grace Hopper Superchip in a Supermicro ARS-111GL-NHR server on the STAC-ML Markets Inference Tacana suite, which STAC describes as a benchmark for running inference on real-time market data. (developer.nvidia.com) (docs.stacresearch.com) STAC-ML measures the time between receiving new input and producing an output from long short-term memory models, a type of neural network used for time-series forecasting; its Tacana suite uses a sliding window of recent data, while Sumaco uses fresh data each run. (developer.nvidia.com) (stacresearch.com) NVIDIA said the GH200 system posted single-digit microsecond latency at the 99th percentile across multiple long short-term memory model sizes and said those results matched or exceeded specialized hardware such as field-programmable gate arrays and application-specific integrated circuits. (developer.nvidia.com) Field-programmable gate arrays are chips that can be wired for one job after manufacturing, which is why trading firms have long used them when a few microseconds can decide whether an order reaches a market first. NVIDIA’s post says those firms have also been adding deeper neural networks as electronic markets have become more efficient. (developer.nvidia.com) The benchmark spans three model sizes: LSTM_B is about six times larger than LSTM_A, and LSTM_C is roughly 200 times larger than LSTM_A. That spread matters because very small models punish software overhead, while larger ones stress raw compute. (developer.nvidia.com) (myrtle.ai) NVIDIA said it reached the new numbers with persistent CUDA kernels, green context partitioning, and precomputation, which together keep work resident on the chip and cut setup delays between inputs. The company also said performance stayed consistent when scaling from one to eight model instances. (developer.nvidia.com) That claim lands in a field where field-programmable gate arrays already had published low-latency results. STAC’s vault includes an audited 2023 Tacana submission from Myrtle.ai on Intel Agilex field-programmable gate arrays, and Intel later highlighted that stack as strong on latency, throughput, and rack-space efficiency. (docs.stacresearch.com) (intel.com) NVIDIA framed the tradeoff as cost and flexibility: building complex deep models directly on low-level hardware can require heavy engineering, while graphics processing units let firms train and deploy on a more general platform. The post also points to an open-source low-latency inference implementation for firms that want to reproduce the approach. (developer.nvidia.com) The immediate takeaway is narrower than “artificial intelligence for trading.” It is that a benchmark built around capital-markets forecasting models now has new published evidence that general-purpose accelerators can operate in the same single-digit-microsecond window that made specialized chips attractive in the first place. (developer.nvidia.com) (docs.stacresearch.com)