MedMCQA

4-choice medical-board questions. The model outputs a verbalized probability distribution over the choices. Replication of Bereket & Leskovec, Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes.

GRPO divides the group-centered reward by the within-group reward standard deviation σ^i\hat\sigma_i,

A^i=rimean(r)σ^i+ϵ.\hat A_i = \frac{r_i - \operatorname{mean}(r)}{\hat\sigma_i + \epsilon}.

Dropping the std leaves the advantage unbiased up to G1G\tfrac{G-1}{G} → calibrated. Keeping it, the std is taken per outcome — for a binary answer, σ1\sigma_1 when a=1a{=}1 and σ0\sigma_0 when a=0a{=}0. As the policy concentrates above 0.50.5, σ0>σ1\sigma_0 > \sigma_1, so the penalty for the over-confident wrong call is divided by the larger number and shrinks → overconfidence.

This is not necessarily true in general — examined further on GRPO σ-normalization.

Dataset entry

openlifescienceai/medmcqa · validation[0]
Which of the following is not true for myelinated nerve fibers:

A. Impulse through myelinated fibers is slower than non-myelinated fibers
B. Membrane currents are generated at nodes of Ranvier
C. Saltatory conduction of impulses is seen
D. Local anesthesia is effective only when the nerve is not covered by myelin sheath
  • 1.0 A
{A: 0.4; B: 0.3; C: 0.1; D: 0.2}

Setup

Qwen3-4B + LoRA, GRPO. Log-score reward on the verbalized distribution, trained with vs without normalizing by the group reward std.

Results

Eval on 4183 questions:

Metricwith σ1\sigma^{-1}without σ1\sigma^{-1}
Format rate0.8650.998
Accuracy0.5560.585
ECE (marginal)0.1040.019
ECE (argmax)0.210.04
Mean reward−1.40−1.04
Mean max confidence0.750.62

The ECE (expected calibration error) validates their result. wandb report.