MedMCQA
4-choice medical-board questions. The model outputs a verbalized probability distribution over the choices. Replication of Bereket & Leskovec, Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes.
GRPO divides the group-centered reward by the within-group reward standard deviation ,
Dropping the std leaves the advantage unbiased up to → calibrated. Keeping it, the std is taken per outcome — for a binary answer, when and when . As the policy concentrates above , , so the penalty for the over-confident wrong call is divided by the larger number and shrinks → overconfidence.
This is not necessarily true in general — examined further on GRPO σ-normalization.
Dataset entry
Which of the following is not true for myelinated nerve fibers: A. Impulse through myelinated fibers is slower than non-myelinated fibers B. Membrane currents are generated at nodes of Ranvier C. Saltatory conduction of impulses is seen D. Local anesthesia is effective only when the nerve is not covered by myelin sheath
{A: 0.4; B: 0.3; C: 0.1; D: 0.2} Setup
Qwen3-4B + LoRA, GRPO. Log-score reward on the verbalized distribution, trained with vs without normalizing by the group reward std.
Results
Eval on 4183 questions:
| Metric | with | without |
|---|---|---|
| Format rate | 0.865 | 0.998 |
| Accuracy | 0.556 | 0.585 |
| ECE (marginal) | 0.104 | 0.019 |
| ECE (argmax) | 0.21 | 0.04 |
| Mean reward | −1.40 | −1.04 |
| Mean max confidence | 0.75 | 0.62 |
The ECE (expected calibration error) validates their result. wandb report.