Ollama upgrades to NVIDIA B300

Ollama updated its cloud infrastructure to run on NVIDIA B300 hardware to boost throughput and reduce latency for Kimi/GLM models — a sign cloud providers are iterating on specialized inference hardware for real‑time workloads announced. That matters for firms evaluating cloud‑hosted model inference for risk analytics or signal generation where tail latency is sensitive.

Ollama’s model registry lists Kimi K2.5 as a multimodal, agentic Mixture‑of‑Experts variant (Kimi K2.5) (ollama.com) and the preceding Kimi K2 is documented as an MoE with ~32 billion activated parameters and ~1 trillion total parameters. (ollama.com) Ollama’s GLM‑5 entry identifies GLM‑5 as a Z.ai MoE foundation model with ~744 billion total parameters and ~40 billion active parameters during inference, and a roughly 198k token context window on the cloud listing. (ollama.com) NVIDIA’s DGX B300 (Blackwell Ultra) is presented as the new data‑center inference platform and NVIDIA advertises ~1.5× dense FP4 throughput and ~2× attention performance versus the DGX B200 on the product page. (nvidia.com) Independent spec summaries and vendor listings show B300 chips shipping with up to ~288 GB HBM3e per GPU and third‑party cloud trackers report per‑GPU rental listings in the approximate range of $4.95–$18/hour depending on provider and configuration. (spheron.network) Third‑party system vendors are shipping HGX/B300 rack systems for hyperscale deployments (Supermicro announced 4U/2‑OU HGX B300 solutions supporting up to 64 GPUs per rack in its product release). (ir.supermicro.com) NVIDIA’s stated 2× attention speedup versus B200 (DGX B300) and published concurrency studies using vLLM‑style setups show that GPUs with higher attention throughput translate to measurably lower end‑to‑end latency under rising request concurrency. (nvidia.com) Ollama launched its cloud‑models preview on Sept 19, 2025 and documents a Free/Pro/Max plan architecture for cloud access, while multiple market trackers and product summaries list Pro at ~$20/month and Max at ~$100/month for higher usage tiers. (ollama.com) Academic and industry analyses of datacenter tail latency identify queueing, multitenancy and TCP retransmits as dominant P99 drivers in cloud networks, and engineering write‑ups on trading stacks continue to recommend colocated FPGA/NIC kernel‑bypass approaches (DPDK/OpenOnload) when firms require deterministic sub‑millisecond or microsecond execution. (conferences.sigcomm.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.