LM-prior backoff
Full support over is what makes the cross-entropy reward well-defined — a zero on the gold outcome sends the log score to . The LM-prior backoff is one way to get it: unlisted outcomes back off to (Framework), the chance the base LM just says .
Score as a suffix of a fixed prefix — a plain QA scaffold, not the hypothesis-list instructions — and sum its token log-probs. The prefix is the same for every candidate, so it is prefilled once and reused.