ProtoQA

Family-Feud-style commonsense questions (ProtoQA). Each question has a crowd distribution over answer clusters — the empirical frequencies of how a pool of respondents answered. The answer set is not shown in the prompt, so the model must recover the candidate set itself (set recovery), and the reward is cross-entropy against the crowd distribution rather than a single label.

Dataset entry

community-datasets/proto_qa · train[0]

Question: At the beach, name something that might protect you from sun.

Gold target

0.384 umbrella
0.364 sunscreen
0.141 sun hat
0.051 sunglasses
0.030 cover up
0.030 shade

Model output

{sunscreen: 0.4; umbrella: 0.3; hat: 0.15; other: 0.15}

Each cluster groups the surface forms a crowd-sourcer merged together; its probability is that cluster's share of responses.

Setup

Qwen3-1.7B, full fine-tune, DAPO. Pure cross-entropy reward against the crowd distribution over a K=5 hypothesis list with LM-prior backoff.

Results

Training progression, start → step 353 (coverage = fraction of crowd mass on named clusters):

Metric	start	step 353
Format rate	0.37	0.97
Coverage	0.06	0.21
# hypotheses	3.5	3.3
w_other	0.04	0.12
Mean reward	−5.51	−4.29