Score decomposition

Any strictly proper score decomposes into uncertainty − resolution + reliability (Bröcker, 2009, generalizing the Murphy decomposition of the Brier score to any proper score).

The decomposition

For a forecast γ\gamma and target YY, write:

Then Bröcker’s Eq. (13), for any strictly proper loss SS, is

E[S(γ,Y)]=e(πˉ)E[d(πˉ,πγ)]+E[d(γ,πγ)].\mathbb{E}[S(\gamma, Y)] = e(\bar\pi) - \mathbb{E}[d(\bar\pi, \pi^\gamma)] + \mathbb{E}[d(\gamma, \pi^\gamma)].

It is the Savage split s(γ,πγ)=e(πγ)+d(γ,πγ)s(\gamma, \pi^\gamma) = e(\pi^\gamma) + d(\gamma, \pi^\gamma), then E[e(πγ)]=e(πˉ)E[d(πˉ,πγ)]\mathbb{E}[e(\pi^\gamma)] = e(\bar\pi) - \mathbb{E}[d(\bar\pi, \pi^\gamma)] by linearity of s(πˉ,)s(\bar\pi, \cdot).

Cross-entropy

The log score sets s(p,q)=H(q,p)s(p,q) = H(q,p), e(p)=H(p)e(p) = H(p), d(p,q)=KL(qp)d(p,q) = \mathrm{KL}(q \,\|\, p), giving

E[logγY]=H(Y)I(Y;γ)+E[KL(πγγ)].\mathbb{E}[-\log \gamma_Y] = H(Y) - I(Y;\gamma) + \mathbb{E}[\mathrm{KL}(\pi^\gamma \,\|\, \gamma)].

A constant forecast scores H(Y)H(Y); an oracle maximizes I(Y;γ)I(Y;\gamma) and zeroes the gap.

How it’s estimated

Resolution and reliability both need πγ\pi^\gamma, which we don’t have in closed form. We estimate it by bucketing: γ\gamma is the model’s predicted distribution, YY‘s marginal is the dataset disease base rate, and πγ\pi^\gamma is each γ\gamma-bucket’s empirical gold rate. The decomposition is therefore an estimate (bucket-dependent), but the ordering of methods holds across choices.

On DDXPlus

DDXPlus proper-score decomposition across four output representations
The decomposition across four output representations — Pyro ±std-normalization, the verbalized hypothesis list, and the hierarchical DSL. Pyro with std-normalization (red) reaches higher reward and resolution than without (blue) while reliability tracks it — no overconfidence on this distributional target, the opposite of the MedMCQA replication. Mechanism: σ-normalization.