Rl^v: Unifying Reasoning And Verification In Language Models Through Value-free Reinforcement Learning

Trending 1 day ago
ARTICLE AD BOX

LLMs person gained outstanding reasoning capabilities done reinforcement learning (RL) connected correctness rewards. Modern RL algorithms for LLMs, including GRPO, VinePPO, and Leave-one-out PPO, person moved distant from accepted PPO approaches by eliminating nan learned worth usability web successful favour of empirically estimated returns. This reduces computational demands and GPU representation consumption, making RL training much feasible pinch progressively ample models. However, this ratio comes pinch a trade-off – nan worth usability could service arsenic a powerful result verifier to measure reasoning concatenation correctness. Without this component, LLMs suffer a valuable verification capacity that could heighten conclusion done parallel hunt strategies for illustration Best-of-N aliases weighted mostly voting.

Recent advances successful LLM reasoning person explored various RL techniques, pinch accepted PPO algorithms showing nan worth model’s inferior arsenic a test-time hunt verifier. However, nan increasing inclination toward “value-free” RL methods (GRPO, VinePPO, Leave-one-out PPO) eliminates this capacity while requiring abstracted exemplary training overhead. Test-time verification approaches are alternatives to amended reasoning by scaling computation, including models trained via binary classification, penchant learning, aliases next-token prediction techniques. But these models require ample training datasets, further computational resources, and sizeable GPU representation during inference.

Researchers from McGill University, Université de Montréal, Microsoft Research, and Google DeepMind person projected RLV to reside nan imaginable of value-like signals successful RL for LLMs. RLV augments “value-free” methods pinch a generative verifier without compromising training scalability. RLV utilizes nan LLM’s procreation capabilities by utilizing nan abundant information produced during RL training to optimize nan exemplary arsenic some a reasoner and a verifier. This dual-function attack frames verification arsenic a next-token prediction task, enabling nan aforesaid LLM to make solutions while providing an intrinsic score. Initial results show RLV boosting MATH accuracy by complete 20% compared to guidelines RL methods erstwhile utilizing parallel sampling, achieving 8-32 times much businesslike test-time compute scaling.

RLV unifies a reasoner and generative verifier wrong a azygous LLM, addressing 4 cardinal investigation questions astir parallel test-time compute scaling, verifier training methodologies, test-time usage strategies, and interactions pinch sequential scaling successful reasoning models. The setup uses nan Hendycks’ MATH dataset for RL training, moving connected 4×A100 80G Nvidia GPUs for 3 hours pinch evaluations reported crossed MATH500, MATH2, GPQA, and AIME’24 benchmarks. Researchers employment nan Qwen2.5 Math 1.5B model, fine-tuning it pinch GRPO, Leave-One-Out PPO, and VinePPO algorithms pinch and without unified verification for a shorter CoT experiment. Training utilized a 1024-token discourse window, pinch conclusion generating up to 1024 tokens for MATH500 and 2048 tokens for different trial sets.

RLV shows awesome test-time compute scaling capabilities, achieving up to 32 times greater ratio and 4% higher accuracy than baseline methods connected MATH500 pinch 512 samples. Testing optimal verification strategies reveals that weighted voting outperforms mostly voting and Best-of-N approaches erstwhile sampling 8+ solutions per problem for some short and agelong CoT models. RLV proves complementary to sequential conclusion compute scaling, pinch nan GRPOV method achieving nan highest occurrence rates connected AIME 24 astatine longer procreation lengths. Training nan unified verifier requires observant balancing done nan verification coefficient λ, which presents a important trade-off successful GRPOV implementation – expanding λ improves verifier accuracy (from ~50% to ~80%).

In this paper, researchers introduced RLV, which integrates verification into “value-free” RL frameworks without important computational overhead and shows improvements successful reasoning accuracy, test-time compute efficiency, and cross-domain generalization crossed MATH, MATH², GPQA, and AIME 24 datasets. Future investigation directions could research enhancing nan generative verifier to nutrient definitive CoT explanations, though this advancement would require verification-specific CoT information aliases dedicated RL training processes. The unified model for solution procreation and verification done RL establishes a valuable instauration for continued advancement successful LLM reasoning capabilities.


Check out the Paper. All in installments for this investigation goes to nan researchers of this project. Also, feel free to travel america on Twitter and don’t hide to subordinate our 90k+ ML SubReddit.

Here’s a little overview of what we’re building astatine Marktechpost:

  • ML News Community – r/machinelearningnews (92k+ members)
  • Newsletter– airesearchinsights.com/(30k+ subscribers)
  • miniCON AI Events – minicon.marktechpost.com
  • AI Reports & Magazines – magazine.marktechpost.com
  • AI Dev & Research News – marktechpost.com (1M+ monthly readers)
  • Partner pinch us

Sajjad Ansari is simply a last twelvemonth undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into nan applicable applications of AI pinch a attraction connected knowing nan effect of AI technologies and their real-world implications. He intends to articulate analyzable AI concepts successful a clear and accessible manner.

More