Microsecond inference benchmarks

Published by The Daily Scout

What happened

NVIDIA published benchmarks showing single‑digit microsecond inference is achievable for capital‑markets workloads using tightly integrated hardware accelerators and user‑space networking stacks. The writeup highlights combining GPUs, FPGAs and network offloads with kernel‑bypass techniques to minimize context switches and memory copies for tick‑to‑trade use cases. That confirms latency advantage now hinges on orchestration across accelerators and networking, not just raw compute. (developer.nvidia.com)

Why it matters

NVIDIA published an audited benchmark run showing its latest Grace Hopper “GH200” superchip can push neural‑network inference for trading into the single‑digit microsecond range and released the underlying reference code as open source. (developer.nvidia.com) (github.com) The company’s report and the audit by the STAC Benchmark Council position general‑purpose graphics processors as viable alternatives to the field‑programmable gate arrays and custom chips that have traditionally dominated the lowest‑latency trading stacks, and Supermicro published a partner report describing the audited system configuration used in the tests. (stacresearch.com) (learn-more.supermicro.com) The tests used the STAC‑ML Markets “Tacana” pack — a market‑benchmark that measures inference latency for time‑series models by running a sliding‑window update every tick — and exercised three long short‑term memory models of increasing size (called LSTM_A, LSTM_B and LSTM_C), where the medium model is roughly six times larger than the smallest and the largest is about 200× larger. (developer.nvidia.com) (docs.stacresearch.com) NVIDIA attributes the latency drop to three specific software/hardware techniques: persistent compute kernels (compute kernels that stay resident on the processor so they avoid repeated launch overhead), GPU partitioning via “green contexts” (a way to isolate multiple inference instances on the same chip so they don’t interfere with each other), and precomputation phases (moving fixed parts of the model calculation out of the per‑tick path). The same writeup also describes pairing those application changes with kernel‑bypass networking and network offloads — meaning packet handling is moved out of the operating system’s network stack and onto dedicated NIC or DPU paths so market data reaches the inference code with fewer context switches and memory copies. (developer.nvidia.com) (github.com) (nvidia.com) The audited submission shows the GH200 result compares favorably to prior FPGA submissions to the same benchmark — STAC’s public note lists up to ~20% lower latency for the smallest model, ~8% for the medium model, and ~49% for the largest model versus a previous FPGA entry — and independent press coverage of the audited run quoted a 99th‑percentile latency in the low single‑digit microseconds on the smallest model. (stacresearch.com) (blockchain.news) NVIDIA published the dl‑lowlat‑infer repository and documentation so engineering teams can reproduce the sliding‑window LSTM test, examine the custom CUDA kernels and timing techniques, and experiment with the same kernel‑bypass and offload patterns used in the audited stack; the repo and the STAC audit together provide a concrete implementation and a verifier for teams evaluating whether to pursue a GPU‑centric low‑latency path. (github.com) (stacresearch.com)

Key numbers

  • (developer.nvidia.com) NVIDIA published an audited benchmark run showing its latest Grace Hopper “GH200” superchip can push neural‑network inference for trading into the single‑digit microsecond range and released the underlying reference code as open source.

Quick answers

What happened in Microsecond inference benchmarks?

NVIDIA published benchmarks showing single‑digit microsecond inference is achievable for capital‑markets workloads using tightly integrated hardware accelerators and user‑space networking stacks. The writeup highlights combining GPUs, FPGAs and network offloads with kernel‑bypass techniques to minimize context switches and memory copies for tick‑to‑trade use cases. That confirms latency advantage now hinges on orchestration across accelerators and networking, not just raw compute. (developer.nvidia.com)

Why does Microsecond inference benchmarks matter?

NVIDIA published an audited benchmark run showing its latest Grace Hopper “GH200” superchip can push neural‑network inference for trading into the single‑digit microsecond range and released the underlying reference code as open source. (developer.nvidia.com) (github.com) The company’s report and the audit by the STAC Benchmark Council position general‑purpose graphics processors as viable alternatives to the field‑programmable gate arrays and custom chips that have traditionally dominated the lowest‑latency trading stacks, and Supermicro published a partner report describing the audited system configuration used in the tests. (stacresearch.com) (learn-more.supermicro.com) The tests used the STAC‑ML Markets “Tacana” pack — a market‑benchmark that measures inference latency for time‑series models by running a sliding‑window update every tick — and exercised three long short‑term memory models of increasing size (called LSTM_A, LSTM_B and LSTM_C), where the medium model is roughly six times larger than the smallest and the largest is about 200× larger. (developer.nvidia.com) (docs.stacresearch.com) NVIDIA attributes the latency drop to three specific software/hardware techniques: persistent compute kernels (compute kernels that stay resident on the processor so they avoid repeated launch overhead), GPU partitioning via “green contexts” (a way to isolate multiple inference instances on the same chip so they don’t interfere with each other), and precomputation phases (moving fixed parts of the model calculation out of the per‑tick path). The same writeup also describes pairing those application changes with kernel‑bypass networking and network offloads — meaning packet handling is moved out of the operating system’s network stack and onto dedicated NIC or DPU paths so market data reaches the inference code with fewer context switches and memory copies. (developer.nvidia.com) (github.com) (nvidia.com) The audited submission shows the GH200 result compares favorably to prior FPGA submissions to the same benchmark — STAC’s public note lists up to ~20% lower latency for the smallest model, ~8% for the medium model, and ~49% for the largest model versus a previous FPGA entry — and independent press coverage of the audited run quoted a 99th‑percentile latency in the low single‑digit microseconds on the smallest model. (stacresearch.com) (blockchain.news) NVIDIA published the dl‑lowlat‑infer repository and documentation so engineering teams can reproduce the sliding‑window LSTM test, examine the custom CUDA kernels and timing techniques, and experiment with the same kernel‑bypass and offload patterns used in the audited stack; the repo and the STAC audit together provide a concrete implementation and a verifier for teams evaluating whether to pursue a GPU‑centric low‑latency path. (github.com) (stacresearch.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Published by The Daily Scout - Be the smartest in the room.