Reinforcement Learning Fine-Tuning with GRPO¶
Group Relative Policy Optimization (GRPO) fine-tunes LLMs via reinforcement learning without a separate value model. For each prompt, the policy samples multiple completions, receives rewards, and updates toward higher-reward behaviors.
Why use RL at all after SFT? Because high-quality target answers are expensive, but reward signals are often cheaper. For example, solving a math problem step by step may require an expert-written solution, yet checking whether the final numeric answer is correct is easy. Likewise, in biomedical tasks, it may be expensive to author ideal long-form responses, but relatively cheap to verify whether the answer format is valid, the units are correct, or the predicted label matches the gold label.
From SFT to RL Fine-Tuning¶
SFT (see Supervised Fine-Tuning) trains the model to imitate gold-standard responses via maximum likelihood. It has a key limitation: you need labeled responses, and the loss treats all tokens of a response equally regardless of whether the response is actually correct.
RL fine-tuning replaces the supervised signal with a reward function \(r(y, x)\) that scores a sampled response $ y $ for a given prompt $ x$. This lets you optimize for non-differentiable objectives like numerical accuracy, format compliance, or clinical correctness.
The GRPO Objective¶
GRPO was introduced in the DeepSeekMath paper and popularized for chain-of-thought reasoning tasks. It eliminates the value/critic network required by PPO, making it simpler and more memory-efficient.
Algorithm overview¶
For each training step:
- Sample a prompt \(q \sim \mathcal{Q}\) from the training distribution
- Generate a group of \(G\) completions from the current (old) policy:
\(\{o_1, o_2, \ldots, o_G\} \sim \pi_{\theta_{\text{old}}}(\cdot \mid q)\) - Score each completion with a reward function: \(r_i = r(o_i, q)\)
- Normalize rewards within the group to compute advantages
- Update the policy \(\pi_\theta\) to increase the probability of high-advantage completions
Advantage computation¶
Unlike PPO which trains a separate value network \(V(s)\) to estimate expected return, GRPO uses the group mean and standard deviation as a baseline:
This is the reward of completion \(i\) relative to its peers in the group. A completion that scores above the group average gets a positive advantage; one below average gets a negative advantage. No separate value network is needed.
The GRPO objective¶
where:
| Symbol | Meaning |
|---|---|
| \(\rho_{i,t} = \dfrac{\pi_\theta(o_{i,t} \mid q, o_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,<t})}\) | Probability ratio: new policy vs. old (importance weight) |
| $\hat{A}_i $ | Group-normalized advantage for completion $ i$ |
| \(\epsilon\) | Clipping range (e.g., 0.2); prevents too-large policy updates |
| \(\beta\) | KL penalty coefficient; keeps policy close to a reference model |
| \(\pi_{\text{ref}}\) | Reference (base) policy, typically the SFT-initialized model |
| \(G\) | Group size (number of completions per prompt, e.g., 4–16) |
The clipped surrogate loss¶
The \(\min(\rho\hat{A},\, \text{clip}(\rho, 1-\epsilon, 1+\epsilon)\hat{A})\) term is the PPO clipped surrogate objective:
- When \(\hat{A}_i > 0\) (good completion): the ratio \(\rho\) is clipped at \(1 + \epsilon\), preventing the policy from jumping too aggressively toward this completion
- When \(\hat{A}_i < 0\) (bad completion): the ratio \(\rho\) is clipped at \(1 - \epsilon\), preventing the policy from moving too far away from it in one step
This trust-region mechanism is what makes the training stable.
The KL divergence term¶
The KL penalty:
prevents the policy from collapsing into reward hacking—producing nonsensical outputs that happen to score high on a simple reward function. A typical value is \(\beta = 0.01\).
Huggingface trl GRPO Training¶
Just as SFT, trl provides the training for GRPO.
The GRPO formula maps directly to the GRPOConfig parameters:
from trl import GRPOConfig, GRPOTrainer
config = GRPOConfig(
# ── Reward / advantage ──────────────────────────────────────
num_generations=4, # G: group size; 4 completions per prompt
# advantages = (r_i - mean) / std computed automatically
# ── Clipped surrogate ───────────────────────────────────────
# epsilon = 0.2 (default) → clip(ρ, 0.8, 1.2)
# ── KL penalty ──────────────────────────────────────────────
# beta = 0.04 (default) → β * D_KL[π_θ || π_ref]
# ── Optimization ────────────────────────────────────────────
learning_rate=1e-5,
gradient_accumulation_steps=8,
per_device_train_batch_size=1,
num_train_epochs=1,
bf16=True,
max_prompt_length=512,
max_completion_length=128,
# ── Logging ─────────────────────────────────────────────────
output_dir="llama3-medcalc-grpo",
logging_steps=10,
save_strategy="epoch",
report_to=["tensorboard"],
remove_unused_columns=False,
)
Load Model and Tokenizer¶
For a first GRPO experiment, keep the setup simple: use a manageable instruct model and fine-tune it directly.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Qwen/Qwen2.5-1.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
Data and Prompt Template (MedCalc-Bench)¶
We guide the model to produce structured outputs so rewards are easy and reliable.
from datasets import load_dataset
raw_ds = load_dataset("ncbi/MedCalc-Bench-v1.0")
train_ds = raw_ds["train"]
SYSTEM_PROMPT = (
"You are a clinical calculator assistant. "
"Provide concise reasoning in <think>...</think> and the final numeric result in <answer>...</answer>."
)
def build_prompt(example):
patient_note = example.get("Patient Note", "")
question = example.get("Question", "")
return (
f"{SYSTEM_PROMPT}\n\n"
f"Patient Note: {patient_note}\n"
f"Question: {question}\n"
f"Answer with <think> and <answer> tags."
)
prompts = [build_prompt(ex) for ex in train_ds]
references = [ex.get("Ground Truth Answer", "") for ex in train_ds]
train_data = [{"prompt": p, "reference": r} for p, r in zip(prompts, references)]
Reward Functions¶
The reward function \(r(o_i, q)\) is the core design choice in GRPO. Use simple, deterministic signals first.
import re
def format_reward(completions, **kwargs):
"""
r = 1.0 if the completion contains both <think>…</think> and <answer>…</answer>.
Encourages structured output format.
"""
pattern = re.compile(r"<think>.*?</think>\s*<answer>.*?</answer>", re.DOTALL)
return [1.0 if isinstance(c, str) and pattern.search(c) else 0.0 for c in completions]
def exact_answer_reward(completions, references=None, **kwargs):
"""r = 1.0 if extracted <answer> matches reference exactly."""
out = []
for c, ref in zip(completions, references or []):
if not isinstance(c, str):
out.append(0.0)
continue
m = re.search(r"<answer>(.*?)</answer>", c, re.DOTALL)
pred = m.group(1).strip() if m else ""
out.append(1.0 if pred == (ref or "").strip() else 0.0)
return out
def numeric_tolerance_reward(completions, references=None, atol=0.5, **kwargs):
"""r = 1.0 if extracted numeric answer is within atol of reference."""
out = []
for c, ref in zip(completions, references or []):
try:
pred_m = re.search(r"<answer>\s*([+-]?[0-9]*\.?[0-9]+)", c)
ref_m = re.search(r"([+-]?[0-9]*\.?[0-9]+)", ref)
if pred_m and ref_m:
pred_v = float(pred_m.group(1))
ref_v = float(ref_m.group(1))
out.append(1.0 if abs(pred_v - ref_v) <= atol else 0.0)
else:
out.append(0.0)
except Exception:
out.append(0.0)
return out
How rewards map to advantages¶
For a group of $G = 4 $ completions with rewards \([1, 0, 1, 0]\):
The policy is updated to increase the probability of completions 1 and 3, and decrease it for completions 2 and 4.
# Demonstrate advantage computation manually
import numpy as np
rewards = np.array([1.0, 0.0, 1.0, 0.0])
advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-8)
print("Rewards: ", rewards)
print("Advantages: ", advantages)
# Rewards: [1. 0. 1. 0.]
# Advantages: [ 1. -1. 1. -1.]
Wrap correctness rewards so they can read references from samples passed by the trainer:
def reward_wrapper(func):
def _wrapped(completions, samples, **kwargs):
refs = [s.get("reference", "") for s in samples]
return func(completions, references=refs, **kwargs)
return _wrapped
GRPO Trainer Configuration¶
from trl import GRPOConfig, GRPOTrainer
config = GRPOConfig(
output_dir="llama3-medcalc-grpo",
learning_rate=1e-5,
gradient_accumulation_steps=8,
per_device_train_batch_size=1,
num_train_epochs=1,
bf16=True,
max_prompt_length=512,
max_completion_length=128,
num_generations=4, # G: group size (completions per prompt)
logging_steps=10,
save_strategy="epoch",
report_to=["tensorboard"],
remove_unused_columns=False,
)
trainer = GRPOTrainer(
model=model,
tokenizer=tokenizer,
args=config,
train_dataset=train_data,
# Reward functions are called as r(completions, samples, **kwargs)
# The trainer automatically combines them: r_total = sum of all reward signals
reward_funcs=[format_reward, reward_wrapper(numeric_tolerance_reward)],
)
trainer.train()
trainer.save_model()
What the trainer does at each step:
- Sample
num_generations=4completions per prompt usingπ_θ_old - Compute rewards:
r_total = format_reward(...) + numeric_tolerance_reward(...) - Compute group advantages:
Â_i = (r_i - mean) / std - For each token in each completion, compute
ρ = π_θ / π_θ_old - Compute clipped surrogate loss + KL penalty
- Backpropagate and update
θ
You can then quickly infer the model's response to check the tuning results.
def generate_answer(question, patient_note=""):
prompt = (
f"{SYSTEM_PROMPT}\n\n"
f"Patient Note: {patient_note}\n"
f"Question: {question}\n"
f"Answer with <think> and <answer> tags."
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=128)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
m = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
return text, (m.group(1).strip() if m else "")
Tips¶
- Keep rewards sparse and clear; start with one correctness signal
- Constrain outputs with tags to simplify parsing and reward computation
- Start with
num_generations=4; scale up if compute allows - Validate on a held-out split by computing rewards without training
- If all completions in a group score identically, advantages are all zero → no learning signal; diversify prompts
References¶
- Shao et al., DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
- Schulman et al., Proximal Policy Optimization Algorithms
- Hugging Face, TRL documentation


