Experiments

Each domain probes a different corner of the design space, along two axes: the shape of the gold target — point mass, crowd distribution, full distribution on a closed set — and the output representation — verbalized hypothesis list, class-structured spec, executable probabilistic program. The probes characterize the space; the metric-best variant is not automatically the right RL target.