Local LLMs hit VRAM limit

Running some of the newest large local language models is bumping into GPU memory limits — DeepSeek‑V3 (671B total parameters, ~37B active) needs roughly 512GB of VRAM to run efficiently, far above NVIDIA RTX Pro 6000’s 96GB capacity (x.com). Some users point out that Mac Studio’s 512GB unified memory meets the memory target but still lacks the raw GPU compute to match dedicated accelerator performance (x.com).

A local language model is software that runs on your own machine, and the bottleneck is often memory: the model’s weights have to fit in fast graphics memory before it can answer at usable speed. DeepSeek-V3 is a mixture-of-experts model with 671 billion total parameters and about 37 billion active per token, which is why hobbyists have been running into hardware ceilings. (github.com) DeepSeek says DeepSeek-V3 uses a “Mixture-of-Experts” design, which means the system stores a huge pool of specialists and activates only part of it for each token. That cuts compute per token, but the full model still has to be loaded into memory for inference. (github.com) That memory math is colliding with workstation hardware. NVIDIA’s RTX Pro 6000 Blackwell Workstation Edition ships with 96 gigabytes of GDDR7 memory, and the Max-Q version also tops out at 96 gigabytes, far below the several-hundred-gigabyte footprints developers cite for full DeepSeek-V3 deployments. (nvidia.com) Independent hardware guides that estimate quantized deployments put DeepSeek-V3-class 671 billion parameter models at roughly 400 gigabytes of graphics memory at 4-bit precision before extra overhead for cache and runtime. That is why users talking about “512 gigabytes” are usually describing a practical target for fitting the model plus working memory, not the raw parameter count alone. (gpuforllm.com) Apple has been part of the discussion because its March 5, 2025 Mac Studio announcement said the M3 Ultra configuration supports up to 512 gigabytes of unified memory and can run language models with “over 600 billion parameters entirely in memory.” Unified memory means the central processor and graphics processor share one pool, so the machine can hold models that would not fit inside a single discrete graphics card. (apple.com) Apple’s current Mac Studio specifications page lists the M3 Ultra model at up to 256 gigabytes of unified memory, while the March 2025 press release described configurations with up to 512 gigabytes. Apple support pages also list M3 Ultra Mac Studio memory as configurable to 256 gigabytes, suggesting the 512-gigabyte figure is tied to Apple’s launch materials rather than the current retail configuration shown on the specs pages. (apple.com) Memory is only half the problem. NVIDIA markets the RTX Pro 6000 Blackwell at up to 110 teraflops of single-precision performance and 3,511 trillion operations per second of artificial intelligence performance, while Apple’s Mac Studio page emphasizes memory bandwidth and integrated graphics rather than accelerator-class inference throughput. (nvidia.com) That leaves local users with a trade-off that has become more visible in 2026: a machine can have enough memory to hold a frontier-scale open model, or enough dedicated graphics compute to run one quickly, but getting both in one desktop is still expensive and uncommon. For many developers, the practical answer remains smaller distilled models, multi-GPU setups, or remote inference instead of a single-box DeepSeek-V3 build. (apple.com)

Local LLMs hit VRAM limit

Get your own daily briefing