Framework

The model’s output has to define a probability distribution with evaluable likelihoods over the answer space $\mathcal{Y}$ ; a proper scoring rule then grades it against the target. The hypothesis list is the simplest such output.

Setup

Hypothesis list

In a single response to prompt $x$ , the model emits $K$ hypotheses $h_k \in \mathcal{Y}$ with weights $w_k \ge 0$ plus an Other bucket $w_0$ , all summing to 1. A categorical distribution with a residual bucket — probabilistic programs are the natural generalization (explored on DDXPlus).

\mathcal{L} = \bigl\{(h_1, w_1),\ \ldots,\ (h_K, w_K),\ (\texttt{Other},\, w_0)\bigr\}

Induced distribution

Listed hypotheses get their stated weight; all unlisted outcomes split $w_0$ in proportion to the base LM’s own probabilities — the LM-prior backoff. This is what gives the output full support over $\mathcal{Y}$ , which the log score needs: with $w_0 = 0$ and gold $y^* \notin \mathcal{H}$ , the reward is $-\infty$ .

p_V(y \mid x, \mathcal{L}) = \begin{cases} w_k & y = h_k, \\[6pt] w_0 \cdot \dfrac{p_{\mathrm{LM}}(y \mid x)}{p_{\mathrm{LM}}(\mathcal{Y} \setminus \mathcal{H} \mid x)} & \text{otherwise}. \end{cases}

The induced distribution p_V. Listed hypotheses keep their stated weights w_k; the Other mass w₀ is spread across the remaining outcomes by the base LM (the backoff), giving full support over 𝒴.

Proper scoring rules

Strictly proper

A scoring rule $S$ grades a predicted distribution against an outcome. It is strictly proper when the expected score is uniquely maximized by reporting the true distribution — so optimizing it drives the prediction toward the target.

\mathbb{E}_{a \sim p}\!\left[S(p, a)\right] > \mathbb{E}_{a \sim p}\!\left[S(\hat p, a)\right]\quad \forall\,\hat p \neq p

Log, Brier, spherical

The log score is unbounded below; Brier and spherical are bounded alternatives that trade the $-\infty$ cliff for a finite miss penalty. That cliff is why the Other bucket is load-bearing under the log score but only useful under the bounded ones.

S_{\log} = \log \hat p_a \qquad S_{\mathrm{Brier}} = 2\hat p_a - \lVert\hat p\rVert_2^2 \qquad S_{\mathrm{sph}} = \hat p_a / \lVert\hat p\rVert_2

Reward

Single label

Apply the scoring rule to the induced distribution at the gold outcome. Under the log score the reward is just $\log p_V$ of the gold; in expectation that is the negative cross-entropy $-H(p^*, p_V)$ , maximized at $p_V = p^*$ .

r(\mathcal{L}, y^*) = S\bigl(p_V(\cdot \mid x, \mathcal{L}),\ y^*\bigr) \;\xrightarrow{\ S_{\log}\ }\; \log p_V(y^* \mid x, \mathcal{L})

Distributional gold

When the dataset gives the gold distribution directly — ProtoQA’s crowd counts, DDXPlus’s per-case differential — the reward is the full cross-entropy against it. A single label is the one-sample estimate, with $p^* = \delta_{y^*}$ .

r(\mathcal{L}, p^*) = \sum_{j} p_j^* \, \log p_V(y_j \mid x, \mathcal{L})

The Other-bucket tension

Setting $w_0 = 1$ earns $r = \log p_{\mathrm{LM}}(y^* \mid x)$ with zero reasoning — a possible local optimum during RL. But listing the correct hypothesis with $w_k > p_{\mathrm{LM}}(y^* \mid x)$ is strictly better, so the global optimum still requires genuine hypothesis generation. Mitigations: prompt for multiple hypotheses, add a $-\alpha w_0$ penalty, or attach an RLCR-style correctness bonus. Empirically the gold support decides which way it goes — single-answer gold collapses the list, wide gold support spreads it (ProtoQA).