Telemetry for AI clusters

Marvell introduced the RELIANT Interconnect Telemetry Platform to give real‑time visibility across racks, optical/electrical modules and PCIe retimers so AI clusters can spot and avoid disruptions. The product targets the reliability gap that shows up as clusters scale and become more heterogeneous, where hidden interconnect issues can cause inference or training drops. For operators, finer‑grained telemetry is becoming a necessary tool as GPU farms and disaggregated fabrics proliferate. (x.com)

A giant artificial intelligence cluster can look healthy at the server level and still lose work because one cable, one optical module, or one signal-cleanup chip is quietly going bad between the machines. Marvell’s new RELIANT platform is built to watch those links in real time instead of waiting for a job failure to reveal the problem. (marvell.com) Those links are the roads between graphics processing units, and modern training systems use thousands of them at once. Marvell says connectivity has become the primary bottleneck in hyperscale artificial intelligence data centers as clusters scale up, scale out, and spread across more racks. (marvell.com) Telemetry is the basic idea here. It means pulling live health data out of the hardware itself, the way a car dashboard turns engine sensors into warning lights before the car stops on the highway. (marvell.com) In an artificial intelligence cluster, the fragile parts are often the interconnects. Those are the optical and electrical links that move data between chips, servers, and racks, and Marvell says hidden link flaps and signal errors can lead to downtime if operators only see the problem after the fact. (marvell.com) One of the components RELIANT watches is a Peripheral Component Interconnect Express retimer. A retimer is a signal booster and cleanup station for very fast chip-to-chip traffic, and Marvell’s Alaska P retimers are designed for graphics processing unit servers, memory disaggregation, and cable links inside data centers. (marvell.com) Marvell says RELIANT pulls data from optical digital signal processors, active electrical cables, active copper cable equalizers, co-packaged optics, switch silicon, network interface cards, modules, and Peripheral Component Interconnect Express retimers into one view. That matters because a single training run can cross all of those layers, so a fault in one box can look like a software slowdown somewhere else. (marvell.com) The platform is not just a red-or-green dashboard. Marvell says operators can filter by rack, row, connection type, health rating, and link parameter, including forward error correction, bit error rate, and signal-to-noise ratio, with history going back up to 90 days and refresh intervals from 5 seconds to 5 minutes. (marvell.com) Marvell’s example screen shows why that granularity exists. In one live view, 70 racks contained 22,656 modules, and 70 of those modules, or 0.3%, were flagged as critical, which is the kind of tiny failure rate that can still waste expensive graphics processing unit time if nobody can find it fast. (marvell.com) The timing fits a bigger buildout in artificial intelligence networking. Marvell cites LightCounting research forecasting that annual shipments of high-speed cables at 100 gigabits per second and above will grow 42% through 2030, with revenue growing 39% and cumulative shipments reaching 1.4 billion units. (marvell.com) That is why a telemetry product is showing up next to chips and optics in 2026. When clusters become more disaggregated and heterogeneous, the winning operator is not just the one with more graphics processing units, but the one that can see a sick link early enough to reroute, tune, or replace it before a training job falls over. (marvell.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.