Barclays Prioritizes Resilience in Infra

Barclays is taking a resilience-first approach to its trading infrastructure modernization. The firm is focused on robust disaster recovery and real-time failover strategies, suggesting that for some competitors, operational stability and risk control are as critical as raw execution speed.

The industry standard for high-frequency trading disaster recovery now involves "sizzling-hot-takeover" or active-active configurations, where backup systems mirror primary servers in real-time. This approach leverages automated failover to maintain seamless operations, with cloud-based geographic redundancy often used to achieve uptimes exceeding 99.99%. Barclays is navigating the on-prem versus cloud dilemma with a hybrid strategy, expanding its partnership with Hewlett Packard Enterprise for its private cloud. The bank has already migrated over 50,000 workloads to its HPE GreenLake private cloud since 2021 and plans to double this figure, prioritizing the control and security of private infrastructure for sensitive financial data. To achieve ultra-low latency, trading firms increasingly use kernel bypass technologies like DPDK (Data Plane Development Kit). This technique grants applications direct memory access to network interface cards, cutting out the Linux kernel's networking stack to reduce processing delays from microseconds to hundreds of nanoseconds. Field-Programmable Gate Arrays (FPGAs) represent the next frontier, moving trading logic from software to hardware. This allows for deterministic, nanosecond-level latency by executing tasks like market data parsing, pre-trade risk checks, and order generation in parallel directly on the chip, bypassing the CPU entirely. FPGA-based systems have demonstrated tick-to-trade latencies between 700-800 nanoseconds. This performance comes with significant architectural trade-offs. Kernel bypass, for instance, requires dedicating entire CPU cores to polling in a 100% utilization loop, which can double or quadruple server core counts compared to traditional setups. It also increases operational complexity, making applications harder to debug and deploy. Ultimately, the goal is fully automated recovery orchestration to eliminate human error during high-pressure failure events. An automated failover system that can instantly transfer operations to backup infrastructure without manual intervention is considered the ultimate risk management strategy, ensuring data processing and transactions continue seamlessly.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.