LLM one‑shot codes UI

Alibaba’s Qwen 122B (10B active) reportedly hit 61.5 tok/s on 2x H200 NVL with 280GB+ VRAM in BF16 and one‑shot coded a 1,054‑line GPU marketplace UI with particle effects — versus NVIDIA’s Nemotron Super 120B producing 769 lines in the same test. Earlier community tests also show Qwen 27B building a full game on a single RTX 3090, underscoring how smaller models + clever toolchains are being used for real game dev tasks. (x.com) (x.com)

Qwen3.5-122B is a sparse Mixture‑of‑Experts model that activates roughly 10 billion parameters per token (the “A10B” designation), while NVIDIA’s Nemotron‑3 Super documentation describes a 120B model with about 12B active parameters per forward pass. (qwen.ai (qwen.ai); huggingface.co (huggingface.co)) The hardware cited in the community benchmark—the H200 NVL—ships with HBM3e memory and is sold in NVL variants with ~141 GB per card, so a dual‑card setup provides the reported 280+ GB of GPU RAM used for large MoE deployments. (nvidia.com (nvidia.com); techpowerup.com (techpowerup.com)) Qwen’s own deployment guidance and third‑party infra notes show the 122B MoE variant is designed for long‑context, agentic workflows and is routinely benchmarked across SGLang/vLLM toolchains and BF16/FP16 modes for throughput and memory tradeoffs. (lambda.ai (lambda.ai); qwen.readthedocs.io (qwen.readthedocs.io)) Community reproducibility threads and how‑to guides demonstrate that Qwen 27B can be quantized and run on a single RTX 3090/4090 with 24 GB VRAM using Q4/AWQ/GGUF toolchains, with multiple users publishing step‑by‑step deployment scripts and benchmark notes. (docs.bswen.com (docs.bswen.com); github.com (github.com)) Separately, bench posts and a Hugging Face discussion document an agentic workflow where smaller open models (including community builds of Nemotron variants) used tooling and quantization to one‑shot assemble a functioning GPU marketplace UI on consumer hardware, illustrating the same “small model + clever toolchain” pattern. (huggingface.co discussions (huggingface.co)) Performance and output size in these experiments have strong sensitivity to quantization format, backend (vLLM/SGLang/llama.cpp/ollama), and attention/KV‑cache strategies, and several issues/bench threads warn that different quantizers or framework versions can flip latency and memory numbers dramatically. (sgl-project GitHub issue (github.com); qwen.readthedocs.io (qwen.readthedocs.io))

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.