Cut GPU data‑loading idle time by ~40%
Engineering threads argue 'pre-sharded & shuffled' Parquet streaming can reduce GPU data-loading idle time by up to 40% and teams should track Model FLOPs Utilization to spot bottlenecks during inference argued. That’s a concrete optimization for video+AI pipelines where I/O often throttles throughput.
Alimama's MUSE project documents a ParquetIterableDataset that streams pre-sharded Parquet files into distributed training jobs to minimize host-side preprocessing and keep worker pipelines saturated. deepwiki.com NVIDIA's RAPIDS work on Parquet scans documents GPU-accelerated decoding and pipeline changes, and IBM Storage Scale benchmarks showed up to 1.96x higher throughput and a 49% latency reduction when combining GPU Direct Storage with GPU-side Parquet processing. developer.nvidia.com Spark performance guidance explicitly recommends partitioning and right-sizing file shards to avoid per-file open/seek overhead and enable parallel readers across executors, a point that underpins why pre-sharding improves multi-worker loaders. spark.apache.org Forum and community posts report production PyTorch pipelines where file-based Parquet readers created a CPU bottleneck and left GPUs operating at roughly 7% utilization until teams switched to streaming or asynchronous loaders. discuss.pytorch.org Model FLOPs Utilization (MFU) is increasingly used to diagnose inference vs I/O limits; PyTorch engineering published MFU improvements (from ~57% to ~68%) using FSDP and torch.compile on A100-class GPUs as an example of a measurable uplift. pytorch.org Public implementations and tooling that materially overlap I/O with compute include the ParquetLoader async preloading pattern on GitHub, RAPIDS/cuDF for in-GPU Parquet decoding, and Ray Data for scaling out preprocessing and streaming—all cited as effective building blocks to raise MFU. github.com