DeepSeek hits 88.6% GPQA Diamond

Published by The Daily Scout

What happened

- On May 19, 2026, home-lab researcher ShinkaIoT posted an 88.6% GPQA Diamond result using DeepSeek-V4-Flash across two NVIDIA DGX Spark systems. (api-docs.deepseek.com) - The reported setup used a $32 cable and a brass heatsink for distributed inference; GPQA’s authors said PhD-level experts reached 65% accuracy. (arxiv.org) - DeepSeek published V4 Preview on April 24, 2026, and NVIDIA lists DGX Spark through its marketplace and documentation pages. (api-docs.deepseek.com)

Why it matters

A home-lab benchmark claim is getting attention because it compresses a frontier-style result into a desktop-scale setup. On May 19, researcher ShinkaIoT posted that DeepSeek-V4-Flash scored 88.6% on GPQA Diamond using two NVIDIA DGX Spark machines, a $32 cable and a brass heatsink for distributed inference. (api-docs.deepseek.com) GPQA Diamond is not a casual benchmark. (arxiv.org) The dataset’s authors described GPQA as a graduate-level, “Google-proof” multiple-choice benchmark in biology, physics and chemistry, and said domain experts with or pursuing PhDs reached 65% accuracy, or 74% after discounting mistakes they later identified. (api-docs.deepseek.com) What makes the post notable is not just the number. It is the combination of an open-weight model, commodity-ish interconnect improvisation and a machine class NVIDIA markets as a personal AI supercomputer rather than a data-center server. NVIDIA says DGX Spark delivers up to one petaFLOP of FP4 AI performance and 128 GB of memory in a compact desktop system. (api-docs.deepseek.com) ### Why does 88.6% on GPQA Diamond matter? GPQA Diamond is used as a hard science-reasoning test, and scores near the high 80s have typically been associated with top-end reasoning systems. (arxiv.org) OpenAI said earlier that o1 was the first model to surpass PhD-level experts on GPQA Diamond, while a later OpenAI post said GPT-5.2 Pro reached 93.2% and GPT-5.2 Thinking reached 92.4%. An 88.6% result therefore places the reported DeepSeek-V4-Flash run in a range that is close to recent frontier numbers, even if it remains below the newest published OpenAI figures. (nvidia.com) The X post’s claim that this narrows the gap to GPT-5.1 by about six months is an inference by the poster, not an official benchmark comparison published by DeepSeek or OpenAI. ### What exactly is DeepSeek-V4-Flash? DeepSeek said on April 24 that DeepSeek-V4 Preview had gone live and open-sourced, with DeepSeek-V4-Flash listed at 284 billion total parameters and 13 billion active parameters. (openai.com) The company described Flash as the faster, lower-cost member of the V4 family, while V4-Pro was positioned as the higher-end variant. That architecture detail matters because it suggests the reported result did not require running all parameters densely at once. Mixture-of-experts designs activate only a subset of parameters per token, which is one reason large headline parameter counts can still be served on smaller systems. (openai.com) That is an inference from DeepSeek’s published model description. ### How unusual is the hardware setup? NVIDIA says DGX Spark is designed for developers and researchers who want to prototype, deploy and fine-tune large AI models on a desktop. Its marketplace page says a single unit can work with models of up to 200 billion parameters locally, and lists a bundle price of $9,449. (api-docs.deepseek.com) Using two units linked for distributed inference pushes beyond the single-box marketing story. The post’s mention of a $32 cable and a brass heatsink suggests the bottleneck was not only raw compute, but also interconnect and thermal management — the same practical constraints that often determine whether a benchmark run is feasible outside a data center. (api-docs.deepseek.com) That reading is based on the hardware details in the post and NVIDIA’s published DGX Spark specifications. ### Does this prove frontier AI is moving into home labs? One benchmark post does not establish a broad industry shift. (docs.nvidia.com) The result has not been independently published in a paper or replicated in an official leaderboard entry that surfaced in this reporting. What it does show is narrower. DeepSeek released V4 Preview less than a month ago, GPQA remains a demanding science benchmark, and NVIDIA is already selling compact systems with enough memory and software support to make multi-box local inference plausible for well-funded individual researchers and small labs. (nvidia.com) The next useful check will be replication. If ShinkaIoT or other researchers publish prompts, serving details, or a fuller evaluation log, those materials would show whether the 88.6% run can be reproduced on the same two-DGX-Spark setup with DeepSeek-V4-Flash. (api-docs.deepseek.com)

Key numbers

  • On May 19, 2026, home-lab researcher ShinkaIoT posted an 88.6% GPQA Diamond result using DeepSeek-V4-Flash across two NVIDIA DGX Spark systems.
  • (api-docs.deepseek.com) The reported setup used a $32 cable and a brass heatsink for distributed inference; GPQA’s authors said PhD-level experts reached 65% accuracy.
  • (arxiv.org) DeepSeek published V4 Preview on April 24, 2026, and NVIDIA lists DGX Spark through its marketplace and documentation pages.
  • On May 19, researcher ShinkaIoT posted that DeepSeek-V4-Flash scored 88.6% on GPQA Diamond using two NVIDIA DGX Spark machines, a $32 cable and a brass heatsink for distributed inference.

What happens next

  • On May 19, researcher ShinkaIoT posted that DeepSeek-V4-Flash scored 88.6% on GPQA Diamond using two NVIDIA DGX Spark machines, a $32 cable and a brass heatsink for distributed inference.
  • (nvidia.com) The next useful check will be replication.
  • (api-docs.deepseek.com) - On May 19, 2026, home-lab researcher ShinkaIoT posted an 88.6% GPQA Diamond result using DeepSeek-V4-Flash across two NVIDIA DGX Spark systems.

Quick answers

What happened in DeepSeek hits 88.6% GPQA Diamond?

On May 19, 2026, home-lab researcher ShinkaIoT posted an 88.6% GPQA Diamond result using DeepSeek-V4-Flash across two NVIDIA DGX Spark systems. (api-docs.deepseek.com) The reported setup used a $32 cable and a brass heatsink for distributed inference; GPQA’s authors said PhD-level experts reached 65% accuracy. (arxiv.org) DeepSeek published V4 Preview on April 24, 2026, and NVIDIA lists DGX Spark through its marketplace and documentation pages. (api-docs.deepseek.com)

Why does DeepSeek hits 88.6% GPQA Diamond matter?

A home-lab benchmark claim is getting attention because it compresses a frontier-style result into a desktop-scale setup. On May 19, researcher ShinkaIoT posted that DeepSeek-V4-Flash scored 88.6% on GPQA Diamond using two NVIDIA DGX Spark machines, a $32 cable and a brass heatsink for distributed inference. (api-docs.deepseek.com) GPQA Diamond is not a casual benchmark. (arxiv.org) The dataset’s authors described GPQA as a graduate-level, “Google-proof” multiple-choice benchmark in biology, physics and chemistry, and said domain experts with or pursuing PhDs reached 65% accuracy, or 74% after discounting mistakes they later identified. (api-docs.deepseek.com) What makes the post notable is not just the number. It is the combination of an open-weight model, commodity-ish interconnect improvisation and a machine class NVIDIA markets as a personal AI supercomputer rather than a data-center server. NVIDIA says DGX Spark delivers up to one petaFLOP of FP4 AI performance and 128 GB of memory in a compact desktop system. (api-docs.deepseek.com) Why does 88.6% on GPQA Diamond matter? GPQA Diamond is used as a hard science-reasoning test, and scores near the high 80s have typically been associated with top-end reasoning systems. (arxiv.org) OpenAI said earlier that o1 was the first model to surpass PhD-level experts on GPQA Diamond, while a later OpenAI post said GPT-5.2 Pro reached 93.2% and GPT-5.2 Thinking reached 92.4%. An 88.6% result therefore places the reported DeepSeek-V4-Flash run in a range that is close to recent frontier numbers, even if it remains below the newest published OpenAI figures. (nvidia.com) The X post’s claim that this narrows the gap to GPT-5.1 by about six months is an inference by the poster, not an official benchmark comparison published by DeepSeek or OpenAI. What exactly is DeepSeek-V4-Flash? DeepSeek said on April 24 that DeepSeek-V4 Preview had gone live and open-sourced, with DeepSeek-V4-Flash listed at 284 billion total parameters and 13 billion active parameters. (openai.com) The company described Flash as the faster, lower-cost member of the V4 family, while V4-Pro was positioned as the higher-end variant. That architecture detail matters because it suggests the reported result did not require running all parameters densely at once. Mixture-of-experts designs activate only a subset of parameters per token, which is one reason large headline parameter counts can still be served on smaller systems. (openai.com) That is an inference from DeepSeek’s published model description. How unusual is the hardware setup? NVIDIA says DGX Spark is designed for developers and researchers who want to prototype, deploy and fine-tune large AI models on a desktop. Its marketplace page says a single unit can work with models of up to 200 billion parameters locally, and lists a bundle price of $9,449. (api-docs.deepseek.com) Using two units linked for distributed inference pushes beyond the single-box marketing story. The post’s mention of a $32 cable and a brass heatsink suggests the bottleneck was not only raw compute, but also interconnect and thermal management — the same practical constraints that often determine whether a benchmark run is feasible outside a data center. (api-docs.deepseek.com) That reading is based on the hardware details in the post and NVIDIA’s published DGX Spark specifications. Does this prove frontier AI is moving into home labs? One benchmark post does not establish a broad industry shift. (docs.nvidia.com) The result has not been independently published in a paper or replicated in an official leaderboard entry that surfaced in this reporting. What it does show is narrower. DeepSeek released V4 Preview less than a month ago, GPQA remains a demanding science benchmark, and NVIDIA is already selling compact systems with enough memory and software support to make multi-box local inference plausible for well-funded individual researchers and small labs. (nvidia.com) The next useful check will be replication. If ShinkaIoT or other researchers publish prompts, serving details, or a fuller evaluation log, those materials would show whether the 88.6% run can be reproduced on the same two-DGX-Spark setup with DeepSeek-V4-Flash. (api-docs.deepseek.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Published by The Daily Scout - Be the smartest in the room.