GRPO σ-normalization
Bereket & Leskovec show GRPO’s advantage scaling induces overconfidence. Their mechanism needs the target to be a sampled outcome — a vertex of the simplex. Under a known soft distributional target it dissolves: the same divisor becomes a harmless z-score.
Their mechanism
The reward depends on a sampled outcome : , so a single prediction has two possible reward values — if , if . With , the true advantage is
The group std in the denominator depends on which outcome occurred, so there are two — and likewise — and the std-normalized advantage is approximately
This differs from the true advantage only in the two coefficients. Equal coefficients would be a harmless rescale; the bias is that they are unequal. As the policy concentrates above , (their Fig. 5), so the term — the penalty that should punish overconfidence — is divided by the larger number and shrinks, while the reward dominates. Overconfident predictions look better than they are.
Our reward
We score deterministically against a known soft target :
This is one scalar per rollout — there is no sampled outcome to index, so the group has a single . Centering cancels the constant , leaving the advantage as a divergence gap:
The advantage is a z-score of the rollout’s KL-to-target within its group — how many group- closer to this rollout is than its peers.
Where it differs
A single replaces the per-outcome family . It factors out, so the advantage is the true advantage scaled by one positive scalar:
Direction-preserving. There is no outcome axis for the std to split along, so rescales the whole advantage equally — there is no separate penalty term for it to shrink. The divisor they delete to kill the bias, we render harmless upstream by deleting the outcome sampling that indexes it.
Simplex geometry
Overconfidence means the optimizer is driven toward a vertex of the simplex — a one-hot distribution.
- One-hot supervision rewards , whose optimum is the vertex for class .
- A soft target rewards , whose optimum is the interior point .
The pathology needs a corner-seeking optimum; a known soft target does not have one. “Overconfident in the one-hot case” and “not overconfident in the soft case” are the same statement — whether the reward’s optimum sits at a vertex or in the interior.