{"type":"rich","version":"1.0","provider_name":"Transistor","provider_url":"https://transistor.fm","author_name":"Daily Paper Cast","title":"RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards","html":"<iframe width=\"100%\" height=\"180\" frameborder=\"no\" scrolling=\"no\" seamless src=\"https://share.transistor.fm/e/99f378e6\"></iframe>","width":"100%","height":180,"duration":1357,"description":"\n            🤗 Upvotes: 66 | cs.CL, cs.LG\n\n            Authors:\n            Gaotang Li, Bhavana Dalvi Mishra, Zifeng Wang, Jun Yan, Yanfei Chen, Chun-Liang Li, Long T. Le, Rujun Han, George Lee, Hanghang Tong, Chen-Yu Lee, Tomas Pfister\n\n            Title:\n            RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards\n\n            Arxiv:\n            http://arxiv.org/abs/2605.10899v1\n\n            Abstract:\n            Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy evolution. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough analyses to understand the key ingredients of RubricEM.\n            ","thumbnail_url":"https://img.transistorcdn.com/8lOVNnuwhrA3rxrDMv7Osu4j_t1-jORooO6NfGcQhcw/rs:fill:0:0:1/w:400/h:400/q:60/mb:500000/aHR0cHM6Ly9pbWct/dXBsb2FkLXByb2R1/Y3Rpb24udHJhbnNp/c3Rvci5mbS81Zjg1/YzRhODczMDU4MmE4/OGMwN2FiNDlmYzI2/MDliMi5qcGVn.webp","thumbnail_width":300,"thumbnail_height":300}