Post-Training LLMs: A Practical Guide to Fine-Tuning and RLHF
From supervised fine-tuning to RLHF and DPO—learn how production teams actually customize large language models for specific use cases.
Beyond Pre-Training
When you hear about GPT-4 or Claude being trained on trillions of tokens, that's pre-training—the foundation. But the model you interact with has gone through extensive post-training to become helpful, harmless, and honest.
Post-training is where raw language models become useful assistants. And increasingly, it's where companies differentiate their AI products.
The Post-Training Stack
Modern LLM development typically follows this pipeline:
Pre-trained Base Model
↓
Supervised Fine-Tuning (SFT)
↓
Preference Alignment (RLHF or DPO)
↓
Production Model
Let's break down each stage.
1. Supervised Fine-Tuning (SFT)
SFT is conceptually simple: you show the model examples of good responses and train it to produce similar outputs.
Data Format
Your training data looks like conversations:
{
"messages": [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "How do I reverse a string in Python?"},
{"role": "assistant", "content": "Here's how to reverse a string in Python:\n\n```python\n# Method 1: Slicing\nreversed_str = original[::-1]\n\n# Method 2: reversed() function\nreversed_str = ''.join(reversed(original))\n```\n\nThe slicing method is more Pythonic and typically faster."}
]
}
Practical SFT with Hugging Face
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer
from datasets import load_dataset
# Load base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
# Load your instruction dataset
dataset = load_dataset("json", data_files="your_data.jsonl")
# Train
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
max_seq_length=2048,
formatting_func=format_conversation, # Your formatting function
)
trainer.train()
When to Use SFT
- Format adaptation: Teaching the model your preferred output format
- Domain injection: Adding knowledge about your specific domain
- Behavior shaping: Making the model more helpful, concise, or detailed
SFT Gotchas
- Data quality > quantity: 1,000 excellent examples beat 100,000 mediocre ones
- Catastrophic forgetting: Too much fine-tuning can degrade general capabilities
- Overfitting: Monitor validation loss carefully
2. RLHF: Reinforcement Learning from Human Feedback
SFT teaches the model what to say. RLHF teaches it how to rank different responses—and generates outputs humans actually prefer.
The RLHF Pipeline
Step 1: Train a Reward Model
- Collect pairs of responses to the same prompt
- Humans label which response is better
- Train a model to predict human preferences
Step 2: RL Optimization
- Generate responses from the SFT model
- Score them with the reward model
- Update the model to generate higher-scoring responses
- Use PPO (Proximal Policy Optimization) to keep changes stable
Reward Model Training
from transformers import AutoModelForSequenceClassification
from trl import RewardTrainer
# Preference data format
# Each example has: prompt, chosen response, rejected response
preference_data = [
{
"prompt": "Explain quantum computing simply.",
"chosen": "Quantum computers use qubits that can be 0, 1, or both at once...",
"rejected": "Quantum computing is a type of computation that harnesses quantum mechanical phenomena..."
}
]
reward_model = AutoModelForSequenceClassification.from_pretrained(
"meta-llama/Llama-2-7b-hf",
num_labels=1
)
trainer = RewardTrainer(
model=reward_model,
train_dataset=preference_dataset,
)
trainer.train()
PPO Training
from trl import PPOTrainer, PPOConfig
ppo_config = PPOConfig(
learning_rate=1e-5,
batch_size=64,
mini_batch_size=8,
)
ppo_trainer = PPOTrainer(
config=ppo_config,
model=sft_model,
ref_model=sft_model, # Keep as reference to prevent drift
tokenizer=tokenizer,
reward_model=reward_model,
)
# Training loop
for batch in dataloader:
queries = batch["input_ids"]
responses = ppo_trainer.generate(queries)
rewards = reward_model(queries, responses)
ppo_trainer.step(queries, responses, rewards)
Why RLHF is Hard
- Reward hacking: The model finds shortcuts to high rewards that don't reflect quality
- Mode collapse: The model converges to a narrow range of "safe" responses
- Compute intensive: Training is significantly more expensive than SFT
- Human labeling bottleneck: Preference data is expensive to collect
3. DPO: Direct Preference Optimization
DPO is a simpler alternative to RLHF that's gained massive popularity. It skips the reward model entirely.
The DPO Insight
Instead of training a separate reward model, DPO directly optimizes the language model using preference pairs. The math works out so you don't need RL at all.
from trl import DPOTrainer
dpo_trainer = DPOTrainer(
model=sft_model,
ref_model=sft_model,
beta=0.1, # Controls deviation from reference
train_dataset=preference_dataset,
tokenizer=tokenizer,
)
dpo_trainer.train()
DPO vs RLHF
| Aspect | RLHF | DPO | |--------|------|-----| | Complexity | High (reward model + RL) | Low (single training run) | | Compute | 3-4x SFT | 1.5-2x SFT | | Stability | Tricky to tune | More stable | | Performance | Slightly better at scale | Comparable for most uses |
For most teams, DPO is the practical choice. RLHF shines when you need maximum performance and have the resources.
Efficient Fine-Tuning: LoRA and QLoRA
Full fine-tuning a 70B model requires hundreds of GBs of GPU memory. LoRA (Low-Rank Adaptation) makes it feasible on consumer hardware.
How LoRA Works
Instead of updating all weights, LoRA adds small trainable matrices alongside frozen weights:
Original: W (frozen)
LoRA: W + BA (where B and A are small matrices)
A 7B model with LoRA might have only 10-50M trainable parameters—0.5% of the full model.
QLoRA: Quantized LoRA
QLoRA goes further by quantizing the base model to 4-bit precision:
from peft import LoraConfig, get_peft_model
from transformers import BitsAndBytesConfig
import torch
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
quantization_config=bnb_config,
)
# Add LoRA adapters
lora_config = LoraConfig(
r=16, # Rank of the adaptation matrices
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
)
model = get_peft_model(model, lora_config)
With QLoRA, you can fine-tune a 70B model on a single A100 GPU.
Production Considerations
1. Evaluation
- Use held-out test sets that mirror production queries
- Human evaluation is essential—automated metrics don't capture everything
- A/B test in production when possible
2. Serving LoRA Adapters
Many inference frameworks support serving multiple LoRA adapters with a single base model:
# vLLM example
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
enable_lora=True,
max_loras=4, # Serve up to 4 adapters
)
3. Continuous Improvement
- Collect user feedback and add to training data
- Retrain adapters periodically
- Monitor for distribution shift
When to Fine-Tune
Fine-tune when:
- You need consistent output formatting
- Domain-specific knowledge is critical
- Latency matters (smaller fine-tuned models can match larger base models)
Don't fine-tune when:
- RAG can solve your problem (cheaper, more flexible)
- Your use case changes frequently
- You don't have quality training data
Key Takeaways
- Post-training transforms base models into useful assistants
- SFT teaches format and behavior; RLHF/DPO aligns with human preferences
- LoRA and QLoRA make fine-tuning accessible on reasonable hardware
- Start with SFT + DPO—only add complexity if needed
The ability to customize LLMs through post-training is one of the most valuable skills in AI engineering today. It's what separates generic chatbots from production-grade AI systems.