supergemma-autotune ships Bayesian tuner

- streetquant published supergemma-autotune on May 25, offering Bayesian runtime tuning for local Supergemma GGUF deployments instead of changing model weights. (github.com) - The project targets llama.cpp-style settings including context, batching, KV cache, flash attention, GPU layers and MTP, then exports copy-paste configs. (github.com) - GitHub shows no packaged release yet; users currently install and run it from the repository with `uv` commands. (github.com)

streetquant has published supergemma-autotune, a new open-source tool that searches for better local runtime settings for Supergemma GGUF models on a user’s own hardware. The project describes itself as “runtime autotuning, not weight fine-tuning,” meaning it does not retrain, merge or quantize model weights. (github.com) Instead, it runs time-budgeted tests, scores speed and reliability probes, and emits a configuration a user can copy into a local stack. (github.com) That matters because local users often inherit llama.cpp, Ollama or LM Studio settings from forum posts or sample configs that were built for different GPUs, different RAM limits and different model quants. (github.com) The repository says the tool is meant to make those trade-offs measurable on the machine that will actually run the model. ### What is this tool actually tuning? The repository says supergemma-autotune can search runner and server parameters such as context size, batching, KV cache, flash attention, GPU layers and MTP when it is used in `llamacpp` mode. In OpenAI-compatible endpoint mode, it is more limited: it can only tune request-level sampling parameters that are actually sent to the endpoint. (github.com) The project is framed around local Supergemma deployments, with documentation naming `supergemma4-26b-uncensored-fast-v2-Q4_K_M.gguf` as its initial first-class target. The docs say that model was chosen because it is a single GGUF file and fits common llama.cpp and LM Studio workflows. (github.com) ### Why use Bayesian search instead of hand-tuning flags? The tool’s pitch is that local inference failures are often blamed on the model when the problem is the runtime setup. The documentation lists several examples: context windows that exceed available VRAM, KV cache choices that hurt structured output, and MTP depth settings that may improve speed while reducing acceptance or stability. (github.com) Bayesian search is meant to cut down the number of trial runs needed to find a workable configuration. Rather than exhaustively testing every combination, the tool keeps a ledger of prior trials and continues the same study if the user reruns it with the same output path, according to the README. (github.com) ### How does it check reliability, not just throughput? The docs say the scoring includes reliability probes for JSON, tool calls, coding edits, context pressure, crashes and memory safety. That is a notable design choice for local users running coding agents or structured-output workflows, where a faster setup can still be unusable if it breaks tool calls or degrades under longer prompts. (github.com) The README also says the tool applies a conservative hardware-aware safety filter by default using available VRAM, RAM and model file size. Users can override that with an `--unsafe` flag if they want the runner to try every candidate. (github.com) ### Where does it fit in a local workflow? The project includes commands to export the best result for `llamacpp`, `ollama`, `lmstudio` and `codex`, and it offers a minimal web UI plus “SuperGemma-first helpers” for model listing and quickstart flows. The managed `llama-server` path runs one process per candidate, which lets the search test server-side flags directly instead of only benchmarking a static endpoint. (github.com) That gives the tool a practical role in setups where users want a one-command path from model file to tuned config without manually sweeping dozens of flags. The repository’s own wording is that it aims to help users “stop guessing whether the model or their runner flags are the problem.” (github.com) ### What should users watch next? GitHub’s releases page shows no packaged release as of May 25, so the current path is to pull the repository and run it with `uv`. The documentation also asks for feedback from Supergemma maintainers and users on recommended defaults, excluded flags and better structured-output probes, suggesting the search space and scoring rules may still evolve. (github.com 1) (github.com 2) (github.com 3)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.