Let's Think Through the Possibilities

We train language models to output a calibrated probability distribution with a cross-entropy distributional matching reward. The output can be any format we induce a distribution from. Framework has the setup; the experiments are organized by domain below; the scratchpads work through the analysis.

Experiment map

DomainGold targetWhy this dataset
MedMCQA4-choicereplicate Bereket; GRPO σ-normalization overconfidence
HotpotQAsingle answer, dropped evidenceRLCR’s domain
ProtoQAcrowd distribution (~5 clusters)the support-recovery angle in isolation
DDXPlusdistribution over 49 conditionsRLCR’s domain; CoT vs DSL vs PPL representations

The gold support shape decides whether the model commits to one hypothesis or spreads probability across several, and the other bucket is what makes each behavior reward-optimal.

Prior work

Uncalibrated Reasoning (Bereket & Leskovec): RL on verbalized probabilities for binary/4-choice outcomes; GRPO’s std-normalization induces overconfidence. Beyond Binary Rewards (RLCR) (Damani et al.): one answer plus a scalar confidence, correctness + Brier reward. Both predict probabilities over a small fixed option set — we target a distribution over a hypothesis space the model itself generates.