Advancing Vision-language Reward Models: Challenges, Benchmarks, And The Role Of Process-supervised Learning

Trending 1 day ago
ARTICLE AD BOX

Process-supervised reward models (PRMs) connection fine-grained, step-wise feedback connected exemplary responses, aiding successful selecting effective reasoning paths for analyzable tasks. Unlike output reward models (ORMs), which measure responses based connected last outputs, PRMs supply elaborate assessments astatine each step, making them peculiarly valuable for reasoning-intensive applications. While PRMs person been extensively studied successful connection tasks, their exertion successful multimodal settings remains mostly unexplored. Most vision-language reward models still trust connected nan ORM approach, highlighting nan request for further investigation into really PRMs tin heighten multimodal learning and reasoning.

Existing reward benchmarks chiefly attraction connected text-based models, pinch immoderate specifically designed for PRMs. In nan vision-language domain, information methods mostly measure wide exemplary capabilities, including knowledge, reasoning, fairness, and safety. VL-RewardBench is nan first benchmark incorporating reinforcement learning penchant information to refine knowledge-intensive vision-language tasks. Additionally, multimodal RewardBench expands information criteria beyond modular ocular mobility answering (VQA) tasks, covering six cardinal areas—correctness, preference, knowledge, reasoning, safety, and VQA—through master annotations. These benchmarks supply a instauration for processing much effective reward models for multimodal learning.

Researchers from UC Santa Cruz, UT Dallas, and Amazon Research benchmarked VLLMs arsenic ORMs and PRMs crossed aggregate tasks, revealing that neither consistently outperforms nan other. To reside information gaps, they introduced VILBENCH, a benchmark requiring step-wise reward feedback, wherever GPT-4o pinch Chain-of-Thought achieved only 27.3% accuracy. Additionally, they collected 73.6K vision-language reward samples utilizing an enhanced tree-search algorithm, training a 3B PRM that improved information accuracy by 3.3%. Their study provides insights into vision-language reward modeling and highlights challenges successful multimodal step-wise evaluation.

VLLMs are progressively effective crossed various tasks, peculiarly erstwhile evaluated for test-time scaling. Seven models were benchmarked utilizing nan LLM-as-a-judge attack to analyse their step-wise critique abilities connected 5 vision-language datasets. A Best-of-N (BoN) mounting was used, wherever VLLMs scored responses generated by GPT-4o. Key findings uncover that ORMs outperform PRMs successful astir cases isolated from for real-world tasks. Additionally, stronger VLLMs do not ever excel arsenic reward models, and a hybrid attack betwixt ORM and PRM is optimal. Moreover, VLLMs use from text-heavy tasks much than ocular ones, underscoring nan request for specialized vision-language reward models.

To measure ViLPRM’s effectiveness, experiments were conducted connected VILBENCH utilizing different RMs and solution samplers. The study compared capacity crossed aggregate VLLMs, including Qwen2.5-VL-3B, InternVL-2.5-8B, GPT-4o, and o1. Results show that PRMs mostly outperform ORMs, improving accuracy by 1.4%, though o1’s responses showed minimal quality owed to constricted detail. ViLPRM surpassed different PRMs, including URSA, by 0.9%, demonstrating superior consistency successful consequence selection. Additionally, findings propose that existing VLLMs are not robust capable arsenic reward models, highlighting nan request for specialized vision-language PRMs that execute good beyond mathematics reasoning tasks.

In conclusion, Vision-language PRMs execute good erstwhile reasoning steps are segmented, arsenic seen successful system tasks for illustration mathematics. However, successful functions pinch unclear measurement divisions, PRMs tin trim accuracy, peculiarly successful visual-dominant cases. Prioritizing cardinal steps alternatively than treating each arsenic improves performance. Additionally, existent multimodal reward models struggle pinch generalization, arsenic PRMs trained connected circumstantial domains often neglect successful others. Enhancing training by incorporating divers information sources and adaptive reward mechanisms is crucial. The preamble of ViLReward-73K improves PRM accuracy by 3.3%, but further advancements successful measurement segmentation and information frameworks are needed for robust multimodal models.


Check out the Paper. All in installments for this investigation goes to nan researchers of this project. Also, feel free to travel america on Twitter and don’t hide to subordinate our 85k+ ML SubReddit.

🔥 [Register Now] miniCON Virtual Conference connected OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 p.m. PST) + Hands connected Workshop [Sponsored]

Sana Hassan, a consulting intern astatine Marktechpost and dual-degree student astatine IIT Madras, is passionate astir applying exertion and AI to reside real-world challenges. With a keen liking successful solving applicable problems, he brings a caller position to nan intersection of AI and real-life solutions.

More