ML Engineering

Post-Training LLMs: A Practical Guide to Fine-Tuning and RLHF

From supervised fine-tuning to RLHF and DPO—learn how production teams actually customize large language models for specific use cases.

Post-Training LLMs: A Practical Guide to Fine-Tuning and RLHF
M
Marcus JohnsonJan 18, 2026
15 minute read

Beyond Pre-Training

When you hear about GPT-4 or Claude being trained on trillions of tokens, that's pre-training—the foundation. But the model you interact with has gone through extensive post-training to become helpful, harmless, and honest.

Post-training is where raw language models become useful assistants. And increasingly, it's where companies differentiate their AI products.


The Post-Training Stack

Modern LLM development typically follows this pipeline:

Pre-trained Base Model
    ↓
Supervised Fine-Tuning (SFT)
    ↓
Preference Alignment (RLHF or DPO)
    ↓
Production Model

Let's break down each stage.

1. Supervised Fine-Tuning (SFT)

SFT is conceptually simple: you show the model examples of good responses and train it to produce similar outputs.

Data Format

Your training data looks like conversations:

{
  "messages": [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "How do I reverse a string in Python?"},
    {"role": "assistant", "content": "Here's how to reverse a string in Python:\n\n```python\n# Method 1: Slicing\nreversed_str = original[::-1]\n\n# Method 2: reversed() function\nreversed_str = ''.join(reversed(original))\n```\n\nThe slicing method is more Pythonic and typically faster."}
  ]
}

Practical SFT with Hugging Face

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer
from datasets import load_dataset

# Load base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# Load your instruction dataset
dataset = load_dataset("json", data_files="your_data.jsonl")

# Train
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    max_seq_length=2048,
    formatting_func=format_conversation,  # Your formatting function
)

trainer.train()

When to Use SFT

  • Format adaptation: Teaching the model your preferred output format
  • Domain injection: Adding knowledge about your specific domain
  • Behavior shaping: Making the model more helpful, concise, or detailed

SFT Gotchas

  1. Data quality > quantity: 1,000 excellent examples beat 100,000 mediocre ones
  2. Catastrophic forgetting: Too much fine-tuning can degrade general capabilities
  3. Overfitting: Monitor validation loss carefully

2. RLHF: Reinforcement Learning from Human Feedback

SFT teaches the model what to say. RLHF teaches it how to rank different responses—and generates outputs humans actually prefer.

The RLHF Pipeline

Step 1: Train a Reward Model
- Collect pairs of responses to the same prompt
- Humans label which response is better
- Train a model to predict human preferences

Step 2: RL Optimization
- Generate responses from the SFT model
- Score them with the reward model
- Update the model to generate higher-scoring responses
- Use PPO (Proximal Policy Optimization) to keep changes stable

Reward Model Training

from transformers import AutoModelForSequenceClassification
from trl import RewardTrainer

# Preference data format
# Each example has: prompt, chosen response, rejected response
preference_data = [
    {
        "prompt": "Explain quantum computing simply.",
        "chosen": "Quantum computers use qubits that can be 0, 1, or both at once...",
        "rejected": "Quantum computing is a type of computation that harnesses quantum mechanical phenomena..."
    }
]

reward_model = AutoModelForSequenceClassification.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    num_labels=1
)

trainer = RewardTrainer(
    model=reward_model,
    train_dataset=preference_dataset,
)
trainer.train()

PPO Training

from trl import PPOTrainer, PPOConfig

ppo_config = PPOConfig(
    learning_rate=1e-5,
    batch_size=64,
    mini_batch_size=8,
)

ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=sft_model,
    ref_model=sft_model,  # Keep as reference to prevent drift
    tokenizer=tokenizer,
    reward_model=reward_model,
)

# Training loop
for batch in dataloader:
    queries = batch["input_ids"]
    responses = ppo_trainer.generate(queries)
    rewards = reward_model(queries, responses)
    ppo_trainer.step(queries, responses, rewards)

Why RLHF is Hard

  1. Reward hacking: The model finds shortcuts to high rewards that don't reflect quality
  2. Mode collapse: The model converges to a narrow range of "safe" responses
  3. Compute intensive: Training is significantly more expensive than SFT
  4. Human labeling bottleneck: Preference data is expensive to collect

3. DPO: Direct Preference Optimization

DPO is a simpler alternative to RLHF that's gained massive popularity. It skips the reward model entirely.

The DPO Insight

Instead of training a separate reward model, DPO directly optimizes the language model using preference pairs. The math works out so you don't need RL at all.

from trl import DPOTrainer

dpo_trainer = DPOTrainer(
    model=sft_model,
    ref_model=sft_model,
    beta=0.1,  # Controls deviation from reference
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
)

dpo_trainer.train()

DPO vs RLHF

| Aspect | RLHF | DPO | |--------|------|-----| | Complexity | High (reward model + RL) | Low (single training run) | | Compute | 3-4x SFT | 1.5-2x SFT | | Stability | Tricky to tune | More stable | | Performance | Slightly better at scale | Comparable for most uses |

For most teams, DPO is the practical choice. RLHF shines when you need maximum performance and have the resources.

Efficient Fine-Tuning: LoRA and QLoRA

Full fine-tuning a 70B model requires hundreds of GBs of GPU memory. LoRA (Low-Rank Adaptation) makes it feasible on consumer hardware.

How LoRA Works

Instead of updating all weights, LoRA adds small trainable matrices alongside frozen weights:

Original: W (frozen)
LoRA: W + BA (where B and A are small matrices)

A 7B model with LoRA might have only 10-50M trainable parameters—0.5% of the full model.

QLoRA: Quantized LoRA

QLoRA goes further by quantizing the base model to 4-bit precision:

from peft import LoraConfig, get_peft_model
from transformers import BitsAndBytesConfig
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    quantization_config=bnb_config,
)

# Add LoRA adapters
lora_config = LoraConfig(
    r=16,  # Rank of the adaptation matrices
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
)

model = get_peft_model(model, lora_config)

With QLoRA, you can fine-tune a 70B model on a single A100 GPU.

Production Considerations

1. Evaluation

  • Use held-out test sets that mirror production queries
  • Human evaluation is essential—automated metrics don't capture everything
  • A/B test in production when possible

2. Serving LoRA Adapters

Many inference frameworks support serving multiple LoRA adapters with a single base model:

# vLLM example
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    enable_lora=True,
    max_loras=4,  # Serve up to 4 adapters
)

3. Continuous Improvement

  • Collect user feedback and add to training data
  • Retrain adapters periodically
  • Monitor for distribution shift

When to Fine-Tune

Fine-tune when:

  • You need consistent output formatting
  • Domain-specific knowledge is critical
  • Latency matters (smaller fine-tuned models can match larger base models)

Don't fine-tune when:

  • RAG can solve your problem (cheaper, more flexible)
  • Your use case changes frequently
  • You don't have quality training data

Key Takeaways

  1. Post-training transforms base models into useful assistants
  2. SFT teaches format and behavior; RLHF/DPO aligns with human preferences
  3. LoRA and QLoRA make fine-tuning accessible on reasonable hardware
  4. Start with SFT + DPO—only add complexity if needed

The ability to customize LLMs through post-training is one of the most valuable skills in AI engineering today. It's what separates generic chatbots from production-grade AI systems.