C++ low‑latency design thread
A social thread has been pushing C++ design patterns that favour upfront, deterministic architecture over runtime micro‑optimisations for sub‑millisecond trading systems. The discussion points to an arXiv paper and a CppCon talk and stresses design‑time choices for HFT feed handling, not just late‑stage performance tweaks (x.com) (x.com).
The argument that lit up social media was not really about C++. It was about where speed actually comes from. In late June and early July, a thread circulating among C++ and trading developers pushed a blunt claim: if you want sub‑millisecond behavior in a market data system, you do not start with compiler flags and heroic micro‑benchmarks. You start with architecture. The thread pointed readers to a 2023 arXiv paper by Paul Bilokon and Burak Gunduz, and to David Gross’s CppCon 2024 keynote on “Ultrafast Trading Systems in C++,” both of which make the same case in different ways (arxiv.org, cppcon.org, youtube.com). That idea sounds obvious until you look at how performance work is usually discussed. Most programmers meet “optimization” as a late phase. A profiler finds a hot loop. Someone swaps in a faster container. Another person sprinkles `constexpr`, unrolls a loop, or trims an allocation. Bilokon and Gunduz do catalog exactly those kinds of tactics, including cache warming, loop unrolling, lock‑free structures, and compile‑time work. But even their paper frames them as part of a larger design discipline for latency‑critical systems, not as isolated tricks (arxiv.org, github.com). Gross says the quiet part out loud. In the official CppCon description and the published video notes, he argues that low latency “cannot be an afterthought” and that most of the work needed to achieve it happens “upfront, at the design phase.” His talk is not pitched as a bag of C++ idioms. It is pitched as a blueprint for building a trading stack from scratch, with choices about cores, queues, memory movement, and concurrency made before the code is polished (cppcon.org, youtube.com). That matters because feed handling in high‑frequency trading is hostile to wishful thinking. A market data handler does not get to process one message in peace. It must absorb bursts, decode wire formats, update in‑memory state, hand off work to downstream components, and do it while competing with the memory hierarchy and the scheduler. Every avoidable handoff adds jitter. Every unpredictable allocation risks a pause. Every shared queue can become a fight over cache lines. The whole point of “deterministic” design is to remove those surprises before they happen (github.com, aeron.io, github.com). This is why the thread kept circling back to patterns like the Disruptor. The original LMAX Disruptor was built as a ring‑buffer‑based alternative to conventional queueing for inter‑thread communication. Its design goal was not just raw throughput. It was low and predictable latency under load. The official user guide describes it as a pre‑allocated concurrent ring buffer for asynchronous event processing, and the older LMAX architecture write‑up explains how the pattern grew out of a trading platform that tried to avoid the usual costs of contended concurrent code (lmax-exchange.github.io, lmax-exchange.github.io, martinfowler.com). Bilokon and Gunduz made that pattern one of the three pillars of their paper. Alongside a low‑latency programming repository and an optimized statistical arbitrage example, they implemented the Disruptor pattern in C++ and reported that it outperformed traditional queuing approaches on measures including speed and cache use. Their GitHub repository is explicit about the same agenda. It is a collection of design patterns and benchmarked techniques for HFT, not a generic C++ performance cookbook (arxiv.org, github.com). The striking part is how little of this is really language magic. C++ helps because it lets developers control layout, allocation, and timing with unusual precision. But the thread’s core message survives translation into other ecosystems. The same logic shows up in Aeron, a messaging stack used in electronic trading that advertises microsecond latency and predictable behavior by leaning hard on transport design, memory discipline, and message encoding rather than on decorative abstractions (aeron.io, github.com). So the social thread landed because it cut against a familiar fantasy. There is no final sprint where a slow trading system becomes fast because someone finds the cleverest `std::` trick. The architecture has already decided most of the outcome. By the time a feed handler is allocating on the hot path, bouncing work across threads, and fighting queues it never needed, the nanoseconds are already gone. Gross’s talk title makes that sound dramatic. The more concrete detail is in the CppCon materials repository, where his slide deck sits under a plain filename: `When_Nanoseconds_Matter.pdf` (github.com, github.com).