HotpotQA
Multi-hop QA (HotpotQA) with the RLCR-style modification (Damani et al.): each question ships 10 context paragraphs (2 supporting, 8 distractor), and 0–2 of the supporting paragraphs are dropped, so the answer is often not derivable from context. The model emits a verbalized hypothesis-list distribution; reward is the cross-entropy (log-score) against the gold answer, with LM-prior backoff on <other>.
Dataset entry
Context: [Clinton, Minnesota] Clinton is a city in Big Stone County, Minnesota, United States. The city was named for New York Governor DeWitt Clinton. The population was 449 at the 2010 census. [Adriana Trigiani] Adriana Trigiani is an Italian American best-selling author of sixteen books, television writer, film director, and entrepreneur based in Greenwich Village, New York City. Trigiani has published a novel a year since 2000. [ … 7 more distractor paragraphs … ] [Big Stone Gap (film)] Big Stone Gap is a 2014 American drama romantic comedy film written and directed by Adriana Trigiani and produced by Donna Gigliotti for Altar Identity Studios, a subsidiary of Media Society. Based on Trigiani's 2000 best-selling novel of the same name, the story is set in the actual Virginia town of Big Stone Gap circa 1970s. Question: The director of the romantic comedy "Big Stone Gap" is based in what New York city?
Gold target
Model output
{Greenwich Village: 0.5; Manhattan: 0.2; Brooklyn: 0.1; other: 0.2} Setup
Qwen3-4B + LoRA, DAPO. Pure cross-entropy (log-score) reward over a K=3 hypothesis list; no correctness or auxiliary terms.
Results
Training progression, start → step 284:
| Metric | start | step 284 |
|---|---|---|
| Format rate | 0.87 | 1.00 |
| # hypotheses | 2.24 | 2.00 |
| w_other | 0.13 | 0.27 |
| p(gold) | 0.24 | 0.48 |
| Marginal ECE | 0.31 | 0.13 |
| Mean reward | −4.22 | −2.95 |