Parameter-Efficient Fine-Tuning (PEFT)¶
Large Language Models (LLMs) are expensive to fine-tune end-to-end. Parameter-Efficient Fine-Tuning (PEFT) adapts a pre-trained model by training a small number of additional parameters while keeping the original weights frozen. This lecture focuses on two beginner-friendly PEFT techniques widely used in practice:
- LoRA (Low-Rank Adaptation)
- Quantization (8-bit and 4-bit) and QLoRA (LoRA on a quantized base)
These methods enable fine-tuning models like Llama 3 on a single consumer GPU with limited memory. The examples below use the Hugging Face ecosystem: transformers, datasets, peft, bitsandbytes, and trl.
Why PEFT?¶
- Reduce memory and compute costs by training fewer parameters
- Maintain strong performance by reusing powerful base models
- Enable domain adaptation on modest hardware
Typical use cases in biostatistics and biomedical research:
- Adapting a general LLM to clinical language or calculators
- Enforcing structured outputs (e.g., report templates)
- Reducing hallucinations via supervised examples
Quantization Before Adapters¶
Before introducing LoRA, it helps to separate two ideas that are often mixed together:
- Quantization reduces the memory footprint of the frozen base model
- PEFT adapters decide which trainable parameters you update
Quantization is not the same thing as PEFT, but the two are often combined. In practice, the most common options in the Hugging Face + PEFT workflow are:
| Option | Typical tool | When to use it | Main trade-off |
|---|---|---|---|
| bf16 / fp16 weights | native transformers loading |
Model already fits in memory | Simplest and most stable |
| 8-bit quantization | bitsandbytes (load_in_8bit=True) |
Moderate memory pressure | Good stability, modest compression |
| 4-bit quantization | bitsandbytes (load_in_4bit=True) |
Severe memory pressure | Best compression, more care needed |
| GPTQ / AWQ style quantization | specialized inference stacks | Mostly inference deployment | Fast inference, less common for training |
How to choose¶
- Choose full precision / mixed precision if the model fits and you want the simplest debugging experience
- Choose 8-bit + LoRA when memory is somewhat tight but you still want a conservative setup
- Choose 4-bit + LoRA (QLoRA) when the model would otherwise not fit on your GPU
- Prefer
bfloat16compute on newer GPUs; fall back tofloat16on older hardware - If you add new special tokens or resize embeddings, keep an eye on embedding layers because they may need higher precision
Minimal loading patterns¶
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_id = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
# 8-bit base model
bnb_8bit = BitsAndBytesConfig(load_in_8bit=True)
# 4-bit base model for QLoRA
bnb_4bit = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
LoRA: The Math¶
Why low-rank?¶
A pre-trained Transformer contains large weight matrices \(W_0 \in \mathbb{R}^{d \times k}\) (e.g., \(d = k = 4096\) for Llama 3). Fine-tuning all parameters requires storing and updating \(d \times k \approx 16.8\) million numbers per layer.
The key insight of LoRA (Hu et al., 2021) is that the weight updates during fine-tuning have low intrinsic rank—most of the useful signal lives in a small subspace of parameter space.
The low-rank decomposition¶
Instead of updating \(W_0\) directly, LoRA adds a low-rank perturbation:
where:
| Matrix | Shape | Role |
|---|---|---|
| \(W_0\) | \(d \times k\) | Frozen pre-trained weight |
| \(A\) | \(r \times k\) | Trainable; initialized from \(\mathcal{N}(0, \sigma^2)\) |
| \(B\) | \(d \times r\) | Trainable; initialized to zero (so \(\Delta W = 0\) at start) |
| \(r\) | scalar | Rank, \(r \ll \min(d, k)\) (e.g., 8–64) |
Figure: The frozen weight \(W_0\) passes through unchanged. The adapter path computes \(BAx\) with far fewer parameters than \(W_0x\).
Scaled forward pass¶
The full forward pass through an adapted layer is:
where \(\alpha\) is a scaling hyperparameter (lora_alpha in the code). The factor \(\alpha / r\) stabilizes training: increasing rank \(r\)without adjusting\(\alpha\) would otherwise amplify the adapter's contribution.
In practice \(B\) is initialized to zero, so at the start of training \(h = W_0 x\)—the adapter starts as an identity perturbation and gradually learns the task-specific correction.
Parameter savings¶
For a single weight matrix:
| Method | Trainable params |
|---|---|
| Full fine-tuning | \(d \times k\) |
| LoRA rank \(r\) | \(r \times (d + k)\) |
Example: \(d = k = 4096\), \(r = 16\) → LoRA trains \(16 \times 8192 = 131{,}072\) vs. \(4096^2 = 16{,}777{,}216\) parameters. That is a 128× reduction in trainable parameters for this layer.
Across all layers of Llama 3 8B, LoRA (rank 16, all-linear) typically trains ~1–2% of total parameters.
Why does \(B\) start at zero?¶
If both \(A\)and\(B\)were random at initialization, the perturbation\(\Delta W = BA\)would immediately shift the model away from its well-optimized starting point. Initializing\(B = 0\) ensures:
so the fine-tuning starts from the pre-trained model's behavior and only diverges as the task signal accumulates.
PEFT Package¶
The PEFT (Parameter-Efficient Fine-Tuning) package, developed by Hugging Face, provides simple interfaces for applying parameter-efficient methods such as LoRA to large language models. It integrates seamlessly with common model libraries.
The LoRA math maps directly to the LoraConfig parameters:
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16, # rank r — capacity of adapter; higher = more expressive
lora_alpha=32, # α scaling factor; effective scale = α/r = 2.0
lora_dropout=0.05, # dropout on the adapter path for regularization
bias="none", # do not train bias terms
target_modules="all-linear", # apply ΔW = BA to every linear layer
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# → trainable params: ~20M || all params: ~8B || trainable%: ~0.24%
To see the actual adapter matrices inside the model:
# Inspect one LoRA adapter layer
for name, module in model.named_modules():
if hasattr(module, "lora_A"):
print(f"Layer: {name}")
print(f" A shape (r × k): {module.lora_A['default'].weight.shape}")
print(f" B shape (d × r): {module.lora_B['default'].weight.shape}")
break
The forward pass in PEFT's source mirrors the math exactly:
# Conceptual pseudocode matching h = W₀x + (α/r) * B * A * x
result = F.linear(x, self.weight) # W₀ x (frozen)
lora_out = self.lora_B(self.lora_A(self.lora_dropout(x))) # B(A(x))
result += lora_out * (self.lora_alpha / self.r) # + (α/r) B A x
Quantization: 8-bit vs 4-bit¶
Quantization stores model weights in lower precision to reduce memory.
- 8-bit (int8): Good trade-off of speed and stability; widely used for inference and fine-tuning with LoRA.
- 4-bit (int4): Maximum compression; combined with LoRA → QLoRA for efficient fine-tuning on very limited VRAM.
Loading a causal LM (e.g., Llama 3 8B) with quantization:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
model_id = "meta-llama/Meta-Llama-3-8B"
# Choose either 8-bit OR 4-bit config
bnb_8bit = BitsAndBytesConfig(load_in_8bit=True)
bnb_4bit = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True, # second quantization for extra compression
bnb_4bit_quant_type="nf4", # NormalFloat4: best distribution for LLM weights
bnb_4bit_compute_dtype=torch.bfloat16, # compute in bf16 even though weights are 4-bit
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
quantization_config=bnb_4bit, # or bnb_8bit
torch_dtype=torch.bfloat16,
)
Tips:
- If you encounter numerical instability on older GPUs, try torch.float16 compute dtype.
- For chat-tuned models, ensure the tokenizer has correct special tokens and chat template.
QLoRA: LoRA on a 4-bit Base¶
QLoRA combines quantization and LoRA: the base model stays in 4-bit (frozen), while the LoRA adapter matrices \(A\)and\(B\) are trained in bfloat16. This is possible because:
- The quantized weights are dequantized on-the-fly during the forward pass to bf16 for matrix multiplication
- Gradients only flow through the adapter \(BA\), never through \(W_0\)
The math remains identical; only the storage format of \(W_0\) changes:
Practical tips:
- Use nf4 quant type and bf16 compute where possible
- Enable gradient checkpointing to trade compute for memory
- Consider training embeddings and lm_head if you add special tokens/chat templates
Example with trl.SFTTrainer:
import torch
from datasets import load_dataset
from transformers import TrainingArguments
from trl import SFTTrainer, setup_chat_format
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
model_id = "meta-llama/Meta-Llama-3-8B"
# 4-bit base model — W₀ stored in 4-bit, dequantized to bf16 for compute
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
quantization_config=bnb_config,
torch_dtype=torch.bfloat16,
)
# Ensure chat format if training on conversations
model, tokenizer = setup_chat_format(model, tokenizer)
# LoRA adapters: r=16, α=16 → effective scale α/r = 1.0
peft_config = LoraConfig(
r=16,
lora_alpha=16,
lora_dropout=0.05,
bias="none",
target_modules="all-linear",
task_type="CAUSAL_LM",
)
# Example: load a small jsonl dataset in OpenAI messages format
train_data = load_dataset("json", data_files="train_dataset.json", split="train")
args = TrainingArguments(
output_dir="llama3-8b-qlora-demo",
num_train_epochs=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
gradient_checkpointing=True,
learning_rate=2e-4,
logging_steps=10,
save_strategy="epoch",
bf16=True,
tf32=True,
report_to="none",
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=train_data,
args=args,
peft_config=peft_config,
max_seq_length=2048,
packing=True,
dataset_kwargs={"add_special_tokens": False, "append_concat_token": False},
)
trainer.train()
trainer.save_model()
Merging Adapters¶
QLoRA saves only adapter weights. For simple deployment without PEFT, you can merge adapters into the base model on CPU and save a standalone checkpoint:
from peft import AutoPeftModelForCausalLM
peft_dir = "llama3-8b-qlora-demo"
peft_model = AutoPeftModelForCausalLM.from_pretrained(peft_dir, torch_dtype=torch.float16, low_cpu_mem_usage=True)
merged = peft_model.merge_and_unload() # computes W₀ + (α/r)BA for each layer
merged.save_pretrained(peft_dir, safe_serialization=True, max_shard_size="2GB")
Combining PEFT with trl¶
Once the base model is loaded, the practical recipe is straightforward:
- LoRA = full-precision or bf16 base model +
LoraConfig - 8-bit LoRA = 8-bit base model +
LoraConfig - QLoRA = 4-bit base model +
LoraConfig
The trainer class changes with the learning objective, but the PEFT pattern stays almost the same.
Shared adapter config¶
from peft import LoraConfig
peft_config = LoraConfig(
r=16,
lora_alpha=16,
lora_dropout=0.05,
bias="none",
target_modules="all-linear",
task_type="CAUSAL_LM",
)
SFT + LoRA / QLoRA¶
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=train_data,
args=args,
peft_config=peft_config,
)
DPO + LoRA / QLoRA¶
from trl import DPOTrainer
trainer = DPOTrainer(
model=model,
ref_model=None,
tokenizer=tokenizer,
train_dataset=preference_data,
args=dpo_args,
peft_config=peft_config,
)
GRPO + LoRA / QLoRA¶
from trl import GRPOTrainer
trainer = GRPOTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=train_data,
reward_funcs=reward_funcs,
args=grpo_args,
peft_config=peft_config,
)
Practical rules of thumb¶
| Goal | Recommended setup |
|---|---|
| Small model fits comfortably | Full fine-tuning or LoRA |
| Medium model, limited VRAM | 8-bit LoRA |
| Large model, tight VRAM | QLoRA |
| Fastest iteration for teaching/demo | LoRA on a smaller model |
| Lowest memory footprint | 4-bit base + LoRA |
If you are unsure, a good default is: start with LoRA on a small model, then move to QLoRA only when memory becomes the bottleneck.
Tips for setting up LoRA:
| Setting | Guidance |
|---|---|
r (rank) |
Start with 16; increase to 64–128 for harder tasks |
lora_alpha |
Set equal to r (scale = 1) or 2× r (scale = 2) |
lora_dropout |
0.05–0.1 for regularization |
| Quant bits | 4-bit for 8B+ models on 24GB VRAM; 8-bit for extra stability |
packing |
True for short examples; boosts throughput |
References¶
- Hu et al., LoRA: Low-Rank Adaptation of Large Language Models
- Dettmers et al., QLoRA: Efficient Finetuning of Quantized LLMs
- Hugging Face, PEFT documentation
- Hugging Face, TRL documentation

