Microsecond kernel-bypass wins
User-space kernel-bypass with zero-copy mmap and polling hit 3.2 μs logic execution and 6.2 μs round-trip latency in public benchmarks—eliminating context switches and PCIe overhead. The same thread contrasts London colocation at ~0.56 ms round-trip with 120 ms on a U.S. VPS, underscoring how colocated infrastructure still dominates microsecond-level edges. ( )
Benchmarks of this class typically stitch together AF_PACKET TPACKET_V3 mmap rings or AF_XDP UMEM with io_uring’s zero-copy receive and busy‑polling so packets never traverse the kernel copy path. (github.com)) User‑space poll‑mode drivers remove interrupt-driven context switches by continuously polling NIC descriptor rings, an approach documented in DPDK’s PMD model and demonstrated by hybrid stacks that retain kernel TCP services while giving an application fast data paths. (doc.dpdk.org)) The measurable latency floor often moves off the OS and into NIC firmware and the I/O fabric: Arista/Solarflare lab tests and vendor reports show sub‑microsecond half‑round application latencies and single‑digit microsecond server‑to‑server results on accelerated NICs with Onload. (arista.com)) PCIe and host DMA remain a dominant host‑side tax for programmable NICs and FPGAs, with academic PCIe characterization and practical measurements showing PCIe transaction and DMA completion can introduce additional microsecond‑scale delays. (cl.cam.ac.uk)) Exchange hosting and colocating client racks inside the same London data‑centre are explicit vendor strategies to shave tens to hundreds of microseconds off venue access time, while public internet paths to U.S. endpoints routinely measure RTTs in the tens to low‑hundreds of milliseconds depending on city pairs. (londonstockexchange.com)) Live deployments marry kernel‑bypass with NIC acceleration or FPGA offload for the lowest order‑to‑wire latency; industry test reports and vendor/firm claims cite roughly 700–800 ns event‑to‑order figures on Onload+FPGA setups used by HFT shops. (leaprate.com))