Scratchpads

Score decomposition — reliability / sufficiency decomposition of proper scores, applied to the cross-entropy reward.
GRPO σ-normalization — when does std-normalization cause overconfidence? Hypothesis: only when the target sits on a simplex vertex (one-hot, as in multiple choice), not for interior targets.