ARTICLE AD BOX
In nan Large Language Models (LLM) RL training, value-free methods for illustration GRPO and DAPO person shown awesome effectiveness. The existent imaginable lies successful value-based methods, which let much precise in installments duty by accurately tracing each action’s effect connected consequent returns. This precision is important for analyzable reasoning, wherever subtle errors tin lead to catastrophic failures. However, training effective worth models for agelong chain-of-thought (CoT) tasks look challenges: achieving debased bias contempt lengthy trajectories, managing chopped preferences of short and agelong responses, and addressing reward awesome sparsity. Despite their theoretical advantages, these difficulties person hindered nan afloat realization of value-based methods.
Value-based reinforcement learning methods for LLMs look 3 important challenges erstwhile applied to agelong chain-of-thought reasoning tasks. First, nan Value Model Bias rumor identified successful VC-PPO shows that initializing worth models pinch reward models introduces affirmative bias. Second, Heterogeneous Sequence Lengths successful analyzable reasoning tasks create difficulties for modular approaches for illustration GAE pinch fixed parameters, which cannot efficaciously accommodate to sequences ranging from very short to highly long. Third, nan Sparsity of nan Reward Signal becomes problematic successful verifier-based tasks that supply binary feedback alternatively than continuous values. This sparsity is worsened by lengthy CoT responses, creating a difficult exploration-exploitation trade-off during optimization.
Researchers from ByteDance Seed person projected Value Augmented Proximal Policy Optimization (VAPO), a value-based RL training model to reside nan challenges of agelong CoT reasoning tasks. VAPO introduces 3 cardinal innovations: a elaborate value-based training model pinch superior capacity and efficiency, a Length-adaptive GAE system that adjusts nan parameter based connected consequence lengths to optimize advantage estimation, and a systematic integration of techniques from anterior research. VAPO combines these components to create a strategy wherever nan corporate improvements transcend what individual enhancements could execute independently. Using nan Qwen2.5-32B exemplary without SFT data, VAPO improves scores from 5 to 60, surpassing erstwhile state-of-the-art methods by 10 points.
The VAPO is built upon nan PPO algorithm pinch respective cardinal modifications to heighten mathematical reasoning capabilities. Training dynamics study reveals VAPO’s superior characteristics compared to DAPO, including smoother training curves indicating much unchangeable optimization, amended magnitude scaling which enhances generalization capabilities, faster people maturation owed to nan granular signals provided by nan worth model, and little entropy successful later training stages. While reduced entropy could perchance limit exploration, nan method balances this trade-off effectively, resulting successful minimal capacity effect while improving reproducibility and stability. This shows really VAPO’s decisions straight reside nan halfway challenges of value-based RL successful analyzable reasoning tasks.
While DeepSeek R1 utilizing GRPO achieves 47 points connected AIME24 and DAPO reaches 50 points, VAPO matches DAPO’s capacity connected Qwen-32b pinch conscionable 60% of nan update steps and achieves a caller state-of-the-art people of 60.4 wrong only 5,000 steps. Vanilla PPO achieves only 5 points owed to worth exemplary learning collapse, but VAPO yet achieves 60 points. Ablation studies validated nan effectiveness of nan 7 projected modifications: Value-Pretraining prevents collapse, decoupled GAE enables afloat optimization of long-form responses, adaptive GAE balances short and agelong consequence optimization, Clip-higher encourages thorough exploration, Token-level nonaccomplishment increases agelong consequence weighting, positive-example LM nonaccomplishment adds 6 points, and Group-Sampling contributes 5 points to nan last performance.
In this paper, researchers introduced VAPO, an algorithm that utilizes nan Qwen2.5-32B exemplary to execute state-of-the-art capacity connected nan AIME24 benchmark. By introducing 7 innovative techniques connected apical of nan PPO framework, VAPO importantly refines worth learning and creates an optimal equilibrium betwixt exploration and exploitation. This value-based attack decisively outperforms value-free methods for illustration GRPO and DAPO, establishing a caller capacity ceiling for reasoning tasks. It addresses basal challenges successful training worth models for agelong CoT scenarios, providing a robust instauration for advancing LLMs successful reasoning-intensive applications.
Check out the Paper. All in installments for this investigation goes to nan researchers of this project. Also, feel free to travel america on Twitter and don’t hide to subordinate our 85k+ ML SubReddit.
🔥 [Register Now] miniCON Virtual Conference connected OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 p.m. PST) + Hands connected Workshop [Sponsored]
Sana Hassan, a consulting intern astatine Marktechpost and dual-degree student astatine IIT Madras, is passionate astir applying exertion and AI to reside real-world challenges. With a keen liking successful solving applicable problems, he brings a caller position to nan intersection of AI and real-life solutions.