Google ships advanced TPU XProf

Google published an advanced TPU optimization update to XProf that adds continuous profiling and memory/throughput insights to help developers squeeze performance from TPUs. (opensource.googleblog.com)

Google published the post on the Google Open Source Blog on March 23, 2026, credited to Yogesh SY of Google’s AI Infra team. (opensource.googleblog.com) XProf’s Continuous Profiling Snapshots run with roughly 7 µs CPU overhead per packet and use a host-side circular buffer of about 2 GB to retain roughly the last 90 seconds of trace data. (opensource.googleblog.com) The snapshot system adds out‑of‑band state tracking that polls P‑state (voltage/frequency) and trace‑drop counters and reconstructs context separately from the trace stream so arbitrary snapshots retain the hardware “ground truth.” (opensource.googleblog.com) The new Utilization Viewer exposes chip‑level views (execution units and DMA paths) with “achieved” vs “peak” tooltips and is currently present in nightly XProf builds. (openxla.org) LLO bundle and LLO‑utilization tooling surface hardware resource usage for XLA custom calls and is explicitly framed as useful for diagnosing custom kernels (examples cited include Pallas and Mosaic). (openxla.org) XProf is hosted under OpenXLA with documented low profiling overhead (typically <1% on TPUs and <5% on GPUs), and the xprof package was published to PyPI (xprof 2.22.0 on Mar 1, 2026, with a nightly xprof‑nightly build dated 2026‑03‑13 available). (openxla.org) Google’s Cloud documentation and the Cloud Diagnostics XProf library continue to recommend TensorBoard integration and programmatic or on‑demand capture workflows for TPU VM profiling with XProf. (docs.cloud.google.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.