Let's Think Through the Possibilities

We train language models to output a calibrated probability distribution with a cross-entropy distributional matching reward. The output can be any format we induce a distribution from. Framework has the setup; the experiments are organized by domain below; the scratchpads work through the analysis.

Experiment map

Domain	Gold target	Why this dataset
MedMCQA	4-choice	replicate Bereket; GRPO σ-normalization overconfidence
HotpotQA	single answer, dropped evidence	RLCR’s domain
ProtoQA	crowd distribution (~5 clusters)	the support-recovery angle in isolation
DDXPlus	distribution over 49 conditions	RLCR’s domain; CoT vs DSL vs PPL representations

The gold support shape decides whether the model commits to one hypothesis or spreads probability across several, and the other bucket is what makes each behavior reward-optimal.

Prior work

Uncalibrated Reasoning (Bereket & Leskovec): RL on verbalized probabilities for binary/4-choice outcomes; GRPO’s std-normalization induces overconfidence. Beyond Binary Rewards (RLCR) (Damani et al.): one answer plus a scalar confidence, correctness + Brier reward. Both predict probabilities over a small fixed option set — we target a distribution over a hypothesis space the model itself generates.