Framework
The model’s output has to define a probability distribution with evaluable likelihoods over the answer space ; a proper scoring rule then grades it against the target. The hypothesis list is the simplest such output.
Setup
In a single response to prompt , the model emits hypotheses with weights plus an Other bucket , all summing to 1. A categorical distribution with a residual bucket — probabilistic programs are the natural generalization (explored on DDXPlus).
Listed hypotheses get their stated weight; all unlisted outcomes split in proportion to the base LM’s own probabilities — the LM-prior backoff. This is what gives the output full support over , which the log score needs: with and gold , the reward is .
Other mass w0 is spread across the remaining outcomes by the base LM (the backoff), giving full support over 𝒴.Proper scoring rules
A scoring rule grades a predicted distribution against an outcome. It is strictly proper when the expected score is uniquely maximized by reporting the true distribution — so optimizing it drives the prediction toward the target.
The log score is unbounded below; Brier and spherical are bounded alternatives that trade the cliff for a finite miss penalty. That cliff is why the Other bucket is load-bearing under the log score but only useful under the bounded ones.
Reward
Apply the scoring rule to the induced distribution at the gold outcome. Under the log score the reward is just of the gold; in expectation that is the negative cross-entropy , maximized at .
When the dataset gives the gold distribution directly — ProtoQA’s crowd counts, DDXPlus’s per-case differential — the reward is the full cross-entropy against it. A single label is the one-sample estimate, with .
The Other-bucket tension
Setting earns with zero reasoning — a possible local optimum during RL. But listing the correct hypothesis with is strictly better, so the global optimum still requires genuine hypothesis generation. Mitigations: prompt for multiple hypotheses, add a penalty, or attach an RLCR-style correctness bonus. Empirically the gold support decides which way it goes — single-answer gold collapses the list, wide gold support spreads it (ProtoQA).