Evolutionary fine-tuning skips gradients
- Xin Qiu, Yinggan Xu, and collaborators showed evolution strategies can fine-tune billion-parameter LLMs without backprop, then extended the idea to quantized models. - Their follow-up QES method updates directly in discrete 3-bit and 4-bit weight space, using error feedback and seed replay to keep memory near inference cost. - That matters because it reframes tuning as search plus evaluation, not gradient plumbing — but only for teams that can run fast reward loops.
Large-language-model fine-tuning usually means gradients, backprop, and a lot of training hardware. This work goes after that assumption directly. A team led by Xin Qiu, Yinggan Xu, and collaborators showed that evolution strategies — basically mutation, scoring, and selection — can fine-tune billion-parameter LLMs without backprop, then pushed the same idea into quantized models where gradients are awkward or impossible anyway. The interesting part is not just “biology metaphor in AI.” It’s that they’re treating post-training as black-box search over whole-model behavior, not token-level credit assignment. (arxiv.org) ### What is the actual trick? Evolution strategies work by making many slightly perturbed versions of a model, scoring the outputs, and nudging the base model toward the perturbations that did better. No backward pass through the full network. No need to compute exact gradients for every weight. In the LLM paper, the authors say this is the first full-parameter ES fine-tuning at the billion-parameter scale without dimensionality reduction. (arxiv.org) ### Why is that different from RL fine-tuning? RL-based post-training still leans on gradient estimators, and that gets messy when rewards are delayed, sparse, or only available after a whole response is finished. The paper’s pitch is that ES sidesteps the token-by-token credit-assignment problem. You score the completed behavior and select the better perturbations. That can be more stable when the reward is outcome-only — like “did the m(arxiv.org)oken deserved credit?” (arxiv.org) ### Why does quantization matter here? Because quantized models are where the usual recipe really starts to break. Post-training quantization makes LLMs cheap enough to run on constrained hardware, but it also turns weights into low-bit, discrete values. Backprop wants smooth, high-precision parameter space. Quantized weights are neither. QES is the follow-up idea: do the search directly in that discrete space instead of pretending the model is still a float-heavy training artifact. (arxiv.org) ### So how does QES avoid falling apart? The paper adds two pieces. One is accumulated error feedback — small update signals are stored until they’re large enough to flip an actual low-bit weight. The other is stateless seed replay, which recreates perturbations from random seeds instead of storing huge high-precision copies. That keeps memory usage near low-precision inference levels while still approximating a high-precision optimization path over time. (arxiv.org) ### Is this just a toy result? Not really. The ES-at-scale paper says it beats established RL implementations on several axes, including long-horizon rewards, training stability, robustness across base models, and lower susceptibility to reward hacking. The quantized follow-up says QES beats the state-of-the-art zeroth-order baseline on arithmetic reasoning tasks. Those are targeted results, not a blanket “ES beats everything,” but they’re enough to make the field pay attention. (arxiv.org) ### Does this make fine-tuning cheap for everyone? Not automatically. It cuts out backprop through the whole model, but it replaces that with lots of model evaluations. So the bottleneck shifts. You need fast inference, fast scoring, and good orchestration for populations of candidates. The GitHub release for the ES paper even highlights an accelerated vLLM-based version with 10x-plus speedup, which tells you where the practical battle really is. (github.com) ### What changes if this line of work holds up? The mental model changes. Fine-tuning stops looking like “open the model and differentiate through everything” and starts looking more like “search over behaviors with compressed, hardware-friendly updates.” That is especially important for quantized models, because QLoRA still uses backprop into adapters, while QES is trying to adapt the quantized model itself. (arxiv.org)eal shift in where people are looking for leverage. Gradients are still dominant. But these papers make a credible case that for some post-training jobs — especially low-bit ones — selection plus perturbation can be a feature, not a fallback. (arxiv.org)