Framework

The model’s output has to define a probability distribution with evaluable likelihoods over the answer space Y\mathcal{Y}; a proper scoring rule then grades it against the target. The hypothesis list is the simplest such output.

Setup

Hypothesis list

In a single response to prompt xx, the model emits KK hypotheses hkYh_k \in \mathcal{Y} with weights wk0w_k \ge 0 plus an Other bucket w0w_0, all summing to 1. A categorical distribution with a residual bucket — probabilistic programs are the natural generalization (explored on DDXPlus).

L={(h1,w1), , (hK,wK), (Other,w0)}\mathcal{L} = \bigl\{(h_1, w_1),\ \ldots,\ (h_K, w_K),\ (\texttt{Other},\, w_0)\bigr\}
Induced distribution

Listed hypotheses get their stated weight; all unlisted outcomes split w0w_0 in proportion to the base LM’s own probabilities — the LM-prior backoff. This is what gives the output full support over Y\mathcal{Y}, which the log score needs: with w0=0w_0 = 0 and gold yHy^* \notin \mathcal{H}, the reward is -\infty.

pV(yx,L)={wky=hk,w0pLM(yx)pLM(YHx)otherwise.p_V(y \mid x, \mathcal{L}) = \begin{cases} w_k & y = h_k, \\[6pt] w_0 \cdot \dfrac{p_{\mathrm{LM}}(y \mid x)}{p_{\mathrm{LM}}(\mathcal{Y} \setminus \mathcal{H} \mid x)} & \text{otherwise}. \end{cases}
p_V(y)answer space 𝒴h₁h₂h₃Other · w₀ split by p_LM
The induced distribution pV. Listed hypotheses keep their stated weights wk; the Other mass w0 is spread across the remaining outcomes by the base LM (the backoff), giving full support over 𝒴.

Proper scoring rules

Strictly proper

A scoring rule SS grades a predicted distribution against an outcome. It is strictly proper when the expected score is uniquely maximized by reporting the true distribution — so optimizing it drives the prediction toward the target.

Eap ⁣[S(p,a)]>Eap ⁣[S(p^,a)]p^p\mathbb{E}_{a \sim p}\!\left[S(p, a)\right] > \mathbb{E}_{a \sim p}\!\left[S(\hat p, a)\right]\quad \forall\,\hat p \neq p
Log, Brier, spherical

The log score is unbounded below; Brier and spherical are bounded alternatives that trade the -\infty cliff for a finite miss penalty. That cliff is why the Other bucket is load-bearing under the log score but only useful under the bounded ones.

Slog=logp^aSBrier=2p^ap^22Ssph=p^a/p^2S_{\log} = \log \hat p_a \qquad S_{\mathrm{Brier}} = 2\hat p_a - \lVert\hat p\rVert_2^2 \qquad S_{\mathrm{sph}} = \hat p_a / \lVert\hat p\rVert_2

Reward

Single label

Apply the scoring rule to the induced distribution at the gold outcome. Under the log score the reward is just logpV\log p_V of the gold; in expectation that is the negative cross-entropy H(p,pV)-H(p^*, p_V), maximized at pV=pp_V = p^*.

r(L,y)=S(pV(x,L), y)   Slog   logpV(yx,L)r(\mathcal{L}, y^*) = S\bigl(p_V(\cdot \mid x, \mathcal{L}),\ y^*\bigr) \;\xrightarrow{\ S_{\log}\ }\; \log p_V(y^* \mid x, \mathcal{L})
Distributional gold

When the dataset gives the gold distribution directly — ProtoQA’s crowd counts, DDXPlus’s per-case differential — the reward is the full cross-entropy against it. A single label is the one-sample estimate, with p=δyp^* = \delta_{y^*}.

r(L,p)=jpjlogpV(yjx,L)r(\mathcal{L}, p^*) = \sum_{j} p_j^* \, \log p_V(y_j \mid x, \mathcal{L})

The Other-bucket tension

Setting w0=1w_0 = 1 earns r=logpLM(yx)r = \log p_{\mathrm{LM}}(y^* \mid x) with zero reasoning — a possible local optimum during RL. But listing the correct hypothesis with wk>pLM(yx)w_k > p_{\mathrm{LM}}(y^* \mid x) is strictly better, so the global optimum still requires genuine hypothesis generation. Mitigations: prompt for multiple hypotheses, add a αw0-\alpha w_0 penalty, or attach an RLCR-style correctness bonus. Empirically the gold support decides which way it goes — single-answer gold collapses the list, wide gold support spreads it (ProtoQA).