LM-prior backoff

Full support over Y\mathcal{Y} is what makes the cross-entropy reward well-defined — a zero on the gold outcome sends the log score to -\infty. The LM-prior backoff is one way to get it: unlisted outcomes back off to pLM(yx)p_{\mathrm{LM}}(y \mid x) (Framework), the chance the base LM just says yy.

Score yy as a suffix of a fixed prefix — a plain QA scaffold, not the hypothesis-list instructions — and sum its token log-probs. The prefix is the same for every candidate, so it is prefilled once and reused.

…prompt Answer : Pneu monia ⟨eos⟩ summed log-probs = log p_LM(answer)