Lamini weighing TPU vs CUDA
Google's TPU tooling improvements (XProf) and broader cloud ASIC momentum are forcing small‑model vendors like Lamini to weigh TPU/ASIC options against CUDA portability for large inference runs. (opensource.googleblog.com)
Google published the XProf update on March 23, 2026 introducing Continuous Profiling Snapshots with a ~7µs per-packet CPU overhead, a ~2GB host-side circular buffer and roughly 90 seconds of retained trace context. (opensource.googleblog.com)) XProf’s new Utilization Viewer and LLO Bundle Visualization expose instruction‑level and memory‑usage hotspots for TPU workloads, enabling finer-grained bottleneck analysis than previous sampling profiles. (opensource.googleblog.com)) Cloud TPUs continue to push price/performance: Google cites a 2.3× price‑performance gain for TPU v5e versus TPU v4 on MLPerf Training 3.1 and previous TPU inference gains up to ~2.7× in MLPerf Inference. (cloud.google.com)) Google’s TPU lineup (including Ironwood and v5e) has been rolled into Google Cloud products and frameworks such as Vertex AI, GKE and PyTorch/JAX integrations, making TPU-backed training and serving broadly accessible to cloud customers. (cloud.google.com)) Lamini previously reported running “secretly” on more than 100 AMD Instinct MI200‑series GPUs and in September 2023 said AMD’s ROCm had reached software parity with CUDA for LLMs. (crn.com)) On June 11, 2025 AMD announced it had hired Lamini founder Sharon Zhou and several employees, a move AMD described as hiring (not an acquisition) that folded Lamini expertise into AMD’s AI group. (crn.com)) Lamini publicly prepared its stack for NVIDIA hardware in mid‑2024 while stressing its AMD partnership was “deep but not exclusive,” signaling cross‑vendor optimization rather than single‑vendor lock‑in. (tomshardware.com)) Lamini’s documentation shows flexible deployment modes (Reserved, Self‑Managed, On‑Demand), model preloading workflows and a 20–30 minute load time for non‑preloaded models, and company case materials note Lamini runs on Supermicro GPU servers after a $25M Series A in May 2024. (docs.lamini.ai))