MedMCQA

4-choice medical-board questions. The model outputs a verbalized probability distribution over the choices. Replication of Bereket & Leskovec, Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes.

GRPO divides the group-centered reward by the within-group reward standard deviation $\hat\sigma_i$ ,

\hat A_i = \frac{r_i - \operatorname{mean}(r)}{\hat\sigma_i + \epsilon}.

Dropping the std leaves the advantage unbiased up to $\tfrac{G-1}{G}$ → calibrated. Keeping it, the std is taken per outcome — for a binary answer, $\sigma_1$ when $a{=}1$ and $\sigma_0$ when $a{=}0$ . As the policy concentrates above $0.5$ , $\sigma_0 > \sigma_1$ , so the penalty for the over-confident wrong call is divided by the larger number and shrinks → overconfidence.

This is not necessarily true in general — examined further on GRPO σ-normalization.

Dataset entry

openlifescienceai/medmcqa · validation[0]

Which of the following is not true for myelinated nerve fibers:

A. Impulse through myelinated fibers is slower than non-myelinated fibers
B. Membrane currents are generated at nodes of Ranvier
C. Saltatory conduction of impulses is seen
D. Local anesthesia is effective only when the nerve is not covered by myelin sheath

Gold target

1.0 A

Model output

{A: 0.4; B: 0.3; C: 0.1; D: 0.2}

Setup

Qwen3-4B + LoRA, GRPO. Log-score reward on the verbalized distribution, trained with vs without normalizing by the group reward std.

Results

Eval on 4183 questions:

Metric	with $\sigma^{-1}$	without $\sigma^{-1}$
Format rate	0.865	0.998
Accuracy	0.556	0.585
ECE (marginal)	0.104	0.019
ECE (argmax)	0.21	0.04
Mean reward	−1.40	−1.04
Mean max confidence	0.75	0.62

The ECE (expected calibration error) validates their result. wandb report.