GRPO σ-normalization

Bereket & Leskovec show GRPO’s $\sigma^{-1}$ advantage scaling induces overconfidence. Their mechanism needs the target to be a sampled outcome — a vertex of the simplex. Under a known soft distributional target it dissolves: the same divisor becomes a harmless z-score.

Their mechanism

The reward depends on a sampled outcome $a \sim \operatorname{Bernoulli}(p)$ : $r(\hat p, a) = a \log \hat p + (1-a)\log(1-\hat p)$ , so a single prediction $\hat p$ has two possible reward values — $\log \hat p$ if $a{=}1$ , $\log(1{-}\hat p)$ if $a{=}0$ . With $\mu_a = \mathbb{E}_{\hat p'}[r(\hat p', a)]$ , the true advantage is

A(q, \hat p) = p\,(r(\hat p, 1) - \mu_1) + (1-p)\,(r(\hat p, 0) - \mu_0).

The group std in the denominator depends on which outcome occurred, so there are two — $\sigma_1 = \mathbb{E}_{\hat p}[\operatorname{std}(r(\hat p, 1))]$ and $\sigma_0$ likewise — and the std-normalized advantage is approximately

\mathbb{E}[\hat A_{\mathrm{STD}}] \approx \frac{p\,(r(\hat p, 1) - \mu_1)}{\sigma_1 + \epsilon} + \frac{(1-p)\,(r(\hat p, 0) - \mu_0)}{\sigma_0 + \epsilon}.

This differs from the true advantage only in the two coefficients. Equal coefficients would be a harmless rescale; the bias is that they are unequal. As the policy concentrates above $0.5$ , $\sigma_0 > \sigma_1$ (their Fig. 5), so the $a{=}0$ term — the penalty that should punish overconfidence — is divided by the larger number and shrinks, while the $a{=}1$ reward dominates. Overconfident predictions look better than they are.

Our reward

We score deterministically against a known soft target $p$ :

R(\hat p) = \sum_i p_i \log \hat p_i = -H(p) - \mathrm{KL}(p \,\|\, \hat p).

This is one scalar per rollout — there is no sampled outcome to index, so the group has a single $\hat\sigma$ . Centering cancels the constant $H(p)$ , leaving the advantage as a divergence gap:

\hat A_{\mathrm{STD}} = \frac{R^{(1)} - \bar R}{\hat\sigma + \epsilon} = \frac{\overline{\mathrm{KL}} - \mathrm{KL}^{(1)}}{\hat\sigma_{\mathrm{KL}} + \epsilon}.

The advantage is a z-score of the rollout’s KL-to-target within its group — how many group- $\sigma$ closer to $p$ this rollout is than its peers.

Where it differs

A single $\hat\sigma$ replaces the per-outcome family $\{\sigma_a\}$ . It factors out, so the advantage is the true advantage scaled by one positive scalar:

\mathbb{E}[\hat A_{\mathrm{STD}}] = \frac{A(q, \hat p)}{\hat\sigma + \epsilon} \;\propto\; A(q, \hat p).

Direction-preserving. There is no outcome axis for the std to split along, so $\sigma^{-1}$ rescales the whole advantage equally — there is no separate penalty term for it to shrink. The divisor they delete to kill the bias, we render harmless upstream by deleting the outcome sampling that indexes it.

Simplex geometry

Overconfidence means the optimizer is driven toward a vertex of the simplex — a one-hot distribution.

One-hot supervision rewards $-\log \hat p_a$ , whose optimum is the vertex for class $a$ .
A soft target rewards $-\mathrm{KL}(p \,\|\, \hat p)$ , whose optimum is the interior point $\hat p = p$ .

The pathology needs a corner-seeking optimum; a known soft target does not have one. “Overconfident in the one-hot case” and “not overconfident in the soft case” are the same statement — whether the reward’s optimum sits at a vertex or in the interior.