Traders reveal microarchitecture tweaks enabling microsecond HFT on high‑frequency Intel CPUs
- Recent discussion among low-latency trading engineers centered on how Intel server features such as Data Direct I/O and cache partitioning can cut network-handling delays by keeping packets out of main memory. - The most concrete detail is the cache tradeoff: Intel Data Direct I/O routes network traffic into the last-level cache instead of dynamic random-access memory, improving latency but risking cache eviction. - The setup mirrors older research and vendor guidance on cache isolation, NUMA locality, and NIC placement rather than a newly announced Intel product or exchange rollout. (intel.com)
High-frequency trading tries to react to market data in microseconds, so the path from a network card to a CPU core matters as much as the trading code itself. (intel.com) One of the key ideas in the recent discussion was Intel Data Direct I/O, or DDIO. Intel says DDIO lets Ethernet adapters send data straight into the processor cache instead of taking a detour through dynamic random-access memory, or DRAM. (intel.com) That shortcut can save time, but it also turns the last-level cache into contested space. Research papers on DDIO and cache management say inbound network traffic can evict useful data from the shared cache if operators do not control where that traffic lands. (par.nsf.gov) (dl.acm.org) That is where cache partitioning comes in. Intel Resource Director Technology, including Cache Allocation Technology, lets software reserve portions of the shared last-level cache for selected classes of work instead of letting every core and device compete for all of it. (eci.intel.com) (github.com) In plain terms, operators can try to give a trading thread its own protected shelf in the shared cache. Ubuntu’s real-time documentation describes the same mechanism as a way to improve temporal isolation between latency-sensitive and best-effort workloads. (documentation.ubuntu.com) The thread also referred to tight CPU affinity, which means pinning a process and its interrupts to specific cores. AMD’s affinity documentation says topology tools such as `lscpu` are used to map cores and NUMA domains so workloads stay close to the memory and devices they use. (gpuopen.com) NUMA, short for non-uniform memory access, is the other half of the tuning story. Dell’s and AMD-related guidance on Nodes per Socket, or NPS, says splitting a socket into smaller NUMA domains can improve locality by keeping cores closer to their memory controllers and peripheral devices. (dl.dell.com) (docs.amd.com) Broadcom’s Ethernet tuning guide makes the same tradeoff explicit for network-heavy systems: it recommends NPS4 for speeds up to 100 gigabits per second because it provides better CPU and memory locality, while NPS1 is suggested for 200 gigabits per second and above. (techdocs.broadcom.com) What traders are describing, then, is not a single secret switch but a stack of hardware controls. The network card is placed near the target cores, interrupts are pinned, cache ways are carved up, and packet traffic is steered to avoid blowing useful data out of shared cache. (eci.intel.com) (par.nsf.gov) The caution in the discussion is also familiar from the documentation and research: every isolation feature costs something. Reserving cache reduces the pool available to everything else, and topology choices that help one workload can hurt throughput or flexibility for another. (dl.acm.org) (documentation.ubuntu.com) So the practical lesson is narrower than the hype. If a trading shop wants steadier microsecond behavior on high-frequency Intel servers, the published evidence points to cache control, locality, and device placement as the knobs worth benchmarking first. (intel.com) (github.com)