40–70% GPU idle claim

Cedana’s co‑founder says most teams leave 40–70% of existing GPU capacity unused and that failed jobs force costly restarts — they pitch an OS‑layer checkpoint/restore tool deployed at AI companies and supercomputing labs. That’s a direct utilization statistic vendors and infra teams are pointing to as a huge waste line. (x.com)

Cedana lists Neel Master and Niranjan Ravichandra as founders and participated in Y Combinator’s S23 batch. (ycombinator.com) The company says its system layers checkpoint/restore and live‑migration between the Linux kernel and workloads to “save, migrate and resume” containerized CPU and GPU jobs. (docs.cedana.ai) Cedana’s docs and repo show the product builds on CRIU and offers a proprietary GPU plugin today that the documentation notes is supported for NVIDIA GPUs and installed via a Cedana plugin. (docs.cedana.ai) Independent research and vendor surveys document widespread low GPU utilization—Microsoft’s empirical study of 400 real deep‑learning jobs reported average GPU utilization of 50% or less for the sampled low‑utilization jobs. (microsoft.com) Industry reports and vendor analyses repeatedly estimate large idle fractions in practice, with multiple 2024–25 surveys finding most organizations report GPU utilization below 70% at peak. (ai-infrastructure.org) Academic and open projects show GPU checkpointing is viable: the CRIUgpu paper demonstrates transparent GPU checkpoint/restore across workloads and reports large recovery‑time improvements, while NVIDIA documents a cuda‑checkpoint utility with current limitations around driver/features. (arxiv.org) (developer.nvidia.com) Cedana’s marketing and third‑party case studies position the tool for hyperscalers, supercomputing labs and on‑prem clusters facing GPU failures or preemption, and a recent deployment note highlights use cases tied to high on‑prem GPU failure rates. (cedana.com) (northflank.com) While checkpoint/restore can avoid full restarts, Microsoft’s analysis found most low‑utilization problems stem from code, data‑movement, batching and configuration issues that are often fixable with software or scheduling changes rather than checkpointing alone. (microsoft.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.