15 fine-tuning techniques listed
An ML practitioner published a list of 15 LLM fine-tuning techniques — from LoRA and QLoRA to DPO and a preferred GRPO method for reasoning models — and linked to a hands-on tutorial in replies (x.com/i/status/2045125478391099858). The thread gathered strong community interest and serves as a compact checklist for small‑budget fine-tuning experiments (x.com/i/status/2045125478391099858).
Large language model fine-tuning is the step where a general chatbot gets retrained for one job, one domain, or one style. A new X thread turned that sprawling toolbox into a 15-item checklist that many smaller labs and solo builders can actually use. (x.com) Fine-tuning starts with a base model that already knows language, then changes its behavior with new examples or rewards. The lowest-cost path is often parameter-efficient fine-tuning, where developers train a small add-on instead of rewriting billions of original weights. (arxiv.org) Low-Rank Adaptation, or LoRA, is the best-known version of that shortcut. The 2021 paper said LoRA freezes the original model, trains small rank-decomposition matrices, and can cut trainable parameters by 10,000 times versus full fine-tuning on GPT-3-scale models. (arxiv.org) Quantized LoRA, or QLoRA, pushed the cost lower by storing the frozen base model in 4-bit form while still training adapters. The 2023 paper said that made it possible to fine-tune a 65 billion-parameter model on a single 48 GB GPU while preserving full 16-bit fine-tuning performance. (arxiv.org) That is why lists like this travel fast: the bottleneck is no longer only model size, but method choice. A practitioner deciding between supervised fine-tuning, LoRA, QLoRA, Direct Preference Optimization, Odds Ratio Preference Optimization, or reinforcement learning can waste days on the wrong setup before a single run finishes. (unsloth.ai) Supervised fine-tuning is the simplest branch. It shows the model labeled examples of the behavior you want, while preference methods use pairs of answers and teach the model which one humans or a reward function prefer. (unsloth.ai) Direct Preference Optimization, or DPO, became popular because it skips a separate reward-model training stage. Its 2023 paper described it as a stable, lightweight way to learn from preferences with a simple classification loss instead of a heavier reinforcement-learning pipeline. (arxiv.org) Odds Ratio Preference Optimization, or ORPO, trims that stack further. The 2024 paper presented ORPO as a reference-model-free method that folds preference alignment into one monolithic step, rather than running supervised fine-tuning and preference optimization as two separate phases. (arxiv.org) For reasoning models, the newer branch is reinforcement learning with verifiable rewards, where math, code, or tool-use answers can be checked automatically. Group Relative Policy Optimization, or GRPO, was introduced in the DeepSeekMath paper as a memory-saving variant of Proximal Policy Optimization that compares multiple responses for the same prompt. (arxiv.org) That approach has now been packaged into smaller-budget tutorials. Hugging Face’s TRL notebook says GRPO with LoRA or QLoRA can run in a free Google Colab notebook on a T4 GPU, and its example reports about a sevenfold memory reduction compared with naive FP16 training for a 7 billion-parameter model. (colab.research.google.com) The practical message in the 15-technique thread is not that every team needs 15 experiments. It is that fine-tuning in 2026 is less one recipe than a menu, and the cheapest useful run often starts with LoRA or QLoRA, moves to DPO or ORPO for preference shaping, and reaches GRPO only when the task has clear rewards to optimize. (unsloth.ai)