Experiments
Each domain probes a different corner of the design space, along two axes: the shape of the gold target — point mass, crowd distribution, full distribution on a closed set — and the output representation — verbalized hypothesis list, class-structured spec, executable probabilistic program. The probes characterize the space; the metric-best variant is not automatically the right RL target.
- MedMCQA — 4-choice; the Bereket replication and GRPO σ-normalization overconfidence.
- HotpotQA — multi-hop QA with dropped supporting paragraphs; RLCR’s domain.
- ProtoQA — crowd-distribution gold; the support-recovery angle in isolation.
- DDXPlus — distributional diagnosis over 49 conditions; RLCR’s domain, where the CoT / DSL / PPL representation comparison lives.