Let's Think Through the Possibilities
We train language models to output a calibrated probability distribution with a cross-entropy distributional matching reward. The output can be any format we induce a distribution from. Framework has the setup; the experiments are organized by domain below; the scratchpads work through the analysis.
Experiment map
| Domain | Gold target | Why this dataset |
|---|---|---|
| MedMCQA | 4-choice | replicate Bereket; GRPO σ-normalization overconfidence |
| HotpotQA | single answer, dropped evidence | RLCR’s domain |
| ProtoQA | crowd distribution (~5 clusters) | the support-recovery angle in isolation |
| DDXPlus | distribution over 49 conditions | RLCR’s domain; CoT vs DSL vs PPL representations |
The gold support shape decides whether the model commits to one hypothesis or spreads probability across several, and the other bucket is what makes each behavior reward-optimal.
Prior work
Uncalibrated Reasoning (Bereket & Leskovec): RL on verbalized probabilities for binary/4-choice outcomes; GRPO’s std-normalization induces overconfidence. Beyond Binary Rewards (RLCR) (Damani et al.): one answer plus a scalar confidence, correctness + Brier reward. Both predict probabilities over a small fixed option set — we target a distribution over a hypothesis space the model itself generates.