ProtoQA

Family-Feud-style commonsense questions (ProtoQA). Each question has a crowd distribution over answer clusters — the empirical frequencies of how a pool of respondents answered. The answer set is not shown in the prompt, so the model must recover the candidate set itself (set recovery), and the reward is cross-entropy against the crowd distribution rather than a single label.

Dataset entry

community-datasets/proto_qa · train[0]
Question: At the beach, name something that might protect you from sun.
  • 0.384 umbrella
  • 0.364 sunscreen
  • 0.141 sun hat
  • 0.051 sunglasses
  • 0.030 cover up
  • 0.030 shade
{sunscreen: 0.4; umbrella: 0.3; hat: 0.15; other: 0.15}
Each cluster groups the surface forms a crowd-sourcer merged together; its probability is that cluster's share of responses.

Setup

Qwen3-1.7B, full fine-tune, DAPO. Pure cross-entropy reward against the crowd distribution over a K=5 hypothesis list with LM-prior backoff.

Results

Training progression, start → step 353 (coverage = fraction of crowd mass on named clusters):

Metricstartstep 353
Format rate0.370.97
Coverage0.060.21
# hypotheses3.53.3
w_other0.040.12
Mean reward−5.51−4.29