Which wrench do you actually pick up?
You've learned about LoRA, QLoRA, DPO, GRPO, and the specialization recipes. None of that matters until you actually run a training job. In April 2026 there are six serious fine-tuning frameworks — each one solving a different slice of the problem. Pick the wrong one and you'll waste a week. Pick the right one and you'll ship a specialist by the weekend.
This lesson does two things. First, it walks through the six frameworks in depth — what they're for, how they work under the hood, what their benchmarks actually mean. Second, it gives you a decision tree so that given any reasonable scenario, you can pick in 30 seconds.
How the six frameworks fit together
A critical distinction: some of these tools stack rather than compete. Unsloth rewrites the kernel layer but hands off the trainer to TRL. LLaMA-Factory is a UI that sits on top of both. Axolotl wraps TRL and adds YAML. torchtune is the PyTorch-native alternative that skips the HF stack entirely. MLX-LM-LoRA is the Apple Silicon path that sidesteps the whole CUDA world.
- ✓3–5× faster training, 30–90% less VRAM
- ✓Custom Triton kernels for RoPE, MLP, cross-entropy, attention
- ✓Single-GPU champion — squeezes maximum out of one card
- ✓Integrates cleanly with HF TRL for the actual training loop
- ✓2026 MoE kernels: 12× faster, 35% less VRAM for MoE models
- ✗Multi-GPU is paid (Unsloth Pro)
- ✗Most gains are in LoRA/QLoRA — less dramatic for full FT
- ✗Kernel is Hopper/Ada-focused; older hardware gets smaller gains
pip install unslothUnsloth — the kernel king
Unsloth's value proposition is blunt: same math, different kernels, dramatically faster. Under the hood, it rewrites the training hot path in Triton(OpenAI's GPU kernel DSL) so that the operations PyTorch normally handles get fused and specialized.
from unsloth import FastLanguageModel
from trl import SFTTrainer
# 1. Load base with QLoRA in one call.
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "Qwen/Qwen3-4B-Instruct",
max_seq_length = 4096,
load_in_4bit = True, # NF4 quantization
)
# 2. Attach LoRA adapters.
model = FastLanguageModel.get_peft_model(
model,
r = 16, lora_alpha = 32,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
use_gradient_checkpointing = "unsloth", # ← Unsloth's checkpointing
)
# 3. Hand off to TRL for the actual training loop.
trainer = SFTTrainer(
model = model, tokenizer = tokenizer,
train_dataset = my_dataset,
args = SFTConfig(
learning_rate = 2e-4,
per_device_train_batch_size = 4,
gradient_accumulation_steps = 4,
num_train_epochs = 3,
),
)
trainer.train()Axolotl — the production YAML
Axolotl's proposition is different: instead of optimising kernels, optimise reproducibility. A training run is a YAML file. The YAML is checked into git. Every parameter is explicit. Two colleagues running the same YAML on different machines get the same result.
base_model: Qwen/Qwen3-4B-Instruct
model_type: AutoModelForCausalLM
load_in_4bit: true
adapter: qlora
lora_r: 16
lora_alpha: 32
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
datasets:
- path: my/function-calling-data
type: chat_template
sequence_len: 4096
learning_rate: 2e-4
num_epochs: 3
micro_batch_size: 4
gradient_accumulation_steps: 4
bf16: auto
flash_attention: true
# Multi-GPU
deepspeed: configs/ds_zero3.json
# or use fsdp: full_shard auto_wrapRun with accelerate launch -m axolotl.cli.train config.yaml. Multi-GPU distribution is handled by the YAML reference to DeepSpeed or FSDP — not by your Python code.
LLaMA-Factory — the beginner-friendly UI
LLaMA-Factory is the fastest way to do your first fine-tune. You install it, launch llamafactory-cli webui, open your browser, select a model, upload a dataset, pick a training mode, and click start. The UI (LlamaBoard) is backed by the same Python engine you'd call directly — you can dump the config as YAML and re-run it later headless.
Crucially, LLaMA-Factory detects Unsloth and uses it as a backend automatically when available. You get Unsloth's speed (3.4 hours on the same benchmark, within ~6% of raw Unsloth) with zero kernel configuration.
TRL — the reference implementation
HuggingFace TRL is where every new preference optimization method debuts. When a paper drops a new technique — DPO, GRPO, ORPO, KTO, CPO, RLOO, XPO, Dr. GRPO, DAPO — the first production-quality implementation is almost always in TRL within weeks. Reading TRL's source is how you understand what these methods actually dobeyond the paper's pseudocode.
torchtune — PyTorch-native, QAT, and the mobile path
torchtune is PyTorch's official fine-tuning library. It doesn't depend on HuggingFace Transformers; every model is implemented natively. This is less convenient for most users but unlocks two things nothing else in the ecosystem does well: quantization-aware training and ExecuTorch export.
Train in FP16. After training, quantize.
Simple. Fast. No retraining cost.
But: weights weren't trained to tolerate the rounding, so quality drops measurably.
Llama-3 8B @ 4-bit: +0.35 perplexity
Fine-tune with simulated quantization in the forward pass.
Straight-through estimator for backward.
Weights learn to be rounding-robust. Quality preserved.
Llama-3 8B @ 4-bit: +0.11 perplexity
MLX-LM-LoRA — the Mac path
If you're on Apple Silicon — and for SLM fine-tuning you often should be, because an M3 Ultra has 192 GB of unified memory, more than almost any consumer GPU has of HBM — you want MLX-LM-LoRA.
MLX is Apple's ground-up array framework built specifically for unified memory. There is no PCIe tax, no CPU↔GPU copy step, no CUDA dependency. MLX-LM-LoRA sits on top and provides a training library that supports 12 training algorithms: SFT, DPO, CPO, ORPO, GRPO, GSPO, Dr. GRPO, DAPO, Online DPO, XPO, RLHF, and PPO. That's more algorithms than TRL.
# Install pip install mlx-lm-lora # Convert a HuggingFace model to MLX format python -m mlx_lm.convert \ --hf-path Qwen/Qwen3-4B-Instruct \ --mlx-path ./mlx-qwen3-4b \ -q --q-bits 4 # SFT + LoRA python -m mlx_lm_lora.sft \ --model ./mlx-qwen3-4b \ --data ./my-data.jsonl \ --lora-rank 16 \ --iters 1000 # DPO with preference pairs python -m mlx_lm_lora.dpo \ --model ./mlx-qwen3-4b \ --data ./preference-pairs.jsonl \ --beta 0.1 # GRPO for verifiable rewards (math, code) python -m mlx_lm_lora.grpo \ --model ./mlx-qwen3-4b \ --data ./math-problems.jsonl \ --reward-fn ./my_reward.py
What about the new frameworks?
The space moves. Two 2026 entrants worth tracking:
Unsloth itself keeps releasing new kernels. The 2026 release brought MoE-specific fused kernels, 3× faster training with smart packing, and auto-tuning for batch size vs sequence length. Expect another step-change every six months.
The practical recipe — which to install today
Given all of this, here's my honest 2026 defaults if you're starting fresh:
- Mac with 32+ GB unified memory: install
mlx-lm-lora. Train and serve on the same machine. - One NVIDIA GPU, want to go fast: install
unsloth+trl. Use Unsloth'sFastLanguageModelloader, hand to TRL trainers. - One NVIDIA GPU, first time: install
llama-factory. Web UI, Unsloth under the hood. - Multi-GPU, production: install
axolotl. YAML-driven, DeepSpeed or FSDP backend. - Targeting mobile deployment: install
torchtune+torchao. QAT fine-tune → ExecuTorch export → ship. - Implementing a new algorithm from a paper: read
trlsource directly.