ProtoQA
Family-Feud-style commonsense questions (ProtoQA). Each question has a crowd distribution over answer clusters — the empirical frequencies of how a pool of respondents answered. The answer set is not shown in the prompt, so the model must recover the candidate set itself (set recovery), and the reward is cross-entropy against the crowd distribution rather than a single label.
Dataset entry
Question: At the beach, name something that might protect you from sun.
Gold target
Model output
{sunscreen: 0.4; umbrella: 0.3; hat: 0.15; other: 0.15} Each cluster groups the surface forms a crowd-sourcer merged together; its probability is that cluster's share of responses.
Setup
Qwen3-1.7B, full fine-tune, DAPO. Pure cross-entropy reward against the crowd distribution over a K=5 hypothesis list with LM-prior backoff.
Results
Training progression, start → step 353 (coverage = fraction of crowd mass on named clusters):
| Metric | start | step 353 |
|---|---|---|
| Format rate | 0.37 | 0.97 |
| Coverage | 0.06 | 0.21 |
| # hypotheses | 3.5 | 3.3 |
| w_other | 0.04 | 0.12 |
| Mean reward | −5.51 | −4.29 |