HotpotQA

Multi-hop QA (HotpotQA) with the RLCR-style modification (Damani et al.): each question ships 10 context paragraphs (2 supporting, 8 distractor), and 0–2 of the supporting paragraphs are dropped, so the answer is often not derivable from context. The model emits a verbalized hypothesis-list distribution; reward is the cross-entropy (log-score) against the gold answer, with LM-prior backoff on <other>.

Dataset entry

hotpot_qa · distractor · validation[4]

Context:
[Clinton, Minnesota] Clinton is a city in Big Stone County, Minnesota, United States.  The city was named for New York Governor DeWitt Clinton.  The population was 449 at the 2010 census.

[Adriana Trigiani] Adriana Trigiani is an Italian American best-selling author of sixteen books, television writer, film director, and entrepreneur based in Greenwich Village, New York City.  Trigiani has published a novel a year since 2000.

[ … 7 more distractor paragraphs … ]

[Big Stone Gap (film)] Big Stone Gap is a 2014 American drama romantic comedy film written and directed by Adriana Trigiani and produced by Donna Gigliotti for Altar Identity Studios, a subsidiary of Media Society.  Based on Trigiani's 2000 best-selling novel of the same name, the story is set in the actual Virginia town of Big Stone Gap circa 1970s.

Question: The director of the romantic comedy "Big Stone Gap" is based in what New York city?

Gold target

1.0 Greenwich Village, New York City

Model output

{Greenwich Village: 0.5; Manhattan: 0.2; Brooklyn: 0.1; other: 0.2}

Setup

Qwen3-4B + LoRA, DAPO. Pure cross-entropy (log-score) reward over a K=3 hypothesis list; no correctness or auxiliary terms.

Results

Training progression, start → step 284:

Metric	start	step 284
Format rate	0.87	1.00
# hypotheses	2.24	2.00
w_other	0.13	0.27
p(gold)	0.24	0.48
Marginal ECE	0.31	0.13
Mean reward	−4.22	−2.95