DeepSeek hits 88.6% GPQA Diamond

- On May 19, 2026, home-lab researcher ShinkaIoT posted an 88.6% GPQA Diamond result using DeepSeek-V4-Flash across two NVIDIA DGX Spark systems. (api-docs.deepseek.com) - The reported setup used a $32 cable and a brass heatsink for distributed inference; GPQA’s authors said PhD-level experts reached 65% accuracy. (arxiv.org) - DeepSeek published V4 Preview on April 24, 2026, and NVIDIA lists DGX Spark through its marketplace and documentation pages. (api-docs.deepseek.com)

A home-lab benchmark claim is getting attention because it compresses a frontier-style result into a desktop-scale setup. On May 19, researcher ShinkaIoT posted that DeepSeek-V4-Flash scored 88.6% on GPQA Diamond using two NVIDIA DGX Spark machines, a $32 cable and a brass heatsink for distributed inference. (api-docs.deepseek.com) GPQA Diamond is not a casual benchmark. (arxiv.org) The dataset’s authors described GPQA as a graduate-level, “Google-proof” multiple-choice benchmark in biology, physics and chemistry, and said domain experts with or pursuing PhDs reached 65% accuracy, or 74% after discounting mistakes they later identified. (api-docs.deepseek.com) What makes the post notable is not just the number. It is the combination of an open-weight model, commodity-ish interconnect improvisation and a machine class NVIDIA markets as a personal AI supercomputer rather than a data-center server. NVIDIA says DGX Spark delivers up to one petaFLOP of FP4 AI performance and 128 GB of memory in a compact desktop system. (api-docs.deepseek.com) ### Why does 88.6% on GPQA Diamond matter? GPQA Diamond is used as a hard science-reasoning test, and scores near the high 80s have typically been associated with top-end reasoning systems. (arxiv.org) OpenAI said earlier that o1 was the first model to surpass PhD-level experts on GPQA Diamond, while a later OpenAI post said GPT-5.2 Pro reached 93.2% and GPT-5.2 Thinking reached 92.4%. An 88.6% result therefore places the reported DeepSeek-V4-Flash run in a range that is close to recent frontier numbers, even if it remains below the newest published OpenAI figures. (nvidia.com) The X post’s claim that this narrows the gap to GPT-5.1 by about six months is an inference by the poster, not an official benchmark comparison published by DeepSeek or OpenAI. ### What exactly is DeepSeek-V4-Flash? DeepSeek said on April 24 that DeepSeek-V4 Preview had gone live and open-sourced, with DeepSeek-V4-Flash listed at 284 billion total parameters and 13 billion active parameters. (openai.com) The company described Flash as the faster, lower-cost member of the V4 family, while V4-Pro was positioned as the higher-end variant. That architecture detail matters because it suggests the reported result did not require running all parameters densely at once. Mixture-of-experts designs activate only a subset of parameters per token, which is one reason large headline parameter counts can still be served on smaller systems. (openai.com) That is an inference from DeepSeek’s published model description. ### How unusual is the hardware setup? NVIDIA says DGX Spark is designed for developers and researchers who want to prototype, deploy and fine-tune large AI models on a desktop. Its marketplace page says a single unit can work with models of up to 200 billion parameters locally, and lists a bundle price of $9,449. (api-docs.deepseek.com) Using two units linked for distributed inference pushes beyond the single-box marketing story. The post’s mention of a $32 cable and a brass heatsink suggests the bottleneck was not only raw compute, but also interconnect and thermal management — the same practical constraints that often determine whether a benchmark run is feasible outside a data center. (api-docs.deepseek.com) That reading is based on the hardware details in the post and NVIDIA’s published DGX Spark specifications. ### Does this prove frontier AI is moving into home labs? One benchmark post does not establish a broad industry shift. (docs.nvidia.com) The result has not been independently published in a paper or replicated in an official leaderboard entry that surfaced in this reporting. What it does show is narrower. DeepSeek released V4 Preview less than a month ago, GPQA remains a demanding science benchmark, and NVIDIA is already selling compact systems with enough memory and software support to make multi-box local inference plausible for well-funded individual researchers and small labs. (nvidia.com) The next useful check will be replication. If ShinkaIoT or other researchers publish prompts, serving details, or a fuller evaluation log, those materials would show whether the 88.6% run can be reproduced on the same two-DGX-Spark setup with DeepSeek-V4-Flash. (api-docs.deepseek.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.