LM-prior backoff

Full support over $\mathcal{Y}$ is what makes the cross-entropy reward well-defined — a zero on the gold outcome sends the log score to $-\infty$ . The LM-prior backoff is one way to get it: unlisted outcomes back off to $p_{\mathrm{LM}}(y \mid x)$ (Framework), the chance the base LM just says $y$ .

Score $y$ as a suffix of a fixed prefix — a plain QA scaffold, not the hypothesis-list instructions — and sum its token log-probs. The prefix is the same for every candidate, so it is prefilled once and reused.