Table of Contents

cs.CV [Back]

[1] Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals

Shruti Singh Baghel, Yash Pratap Singh Rathore, Sushovan Jena, Anurag Pradhan, Amit Shukla, Arnav Bhavsar, Pawan Goyal

🧩 TL;DR

本研究针对大型视觉语言模型在盲人和低视力用户辅助应用中面临的高资源需求问题,提出了专门的可访问性评估框架,并系统评估了不同规模SmolVLM2模型在移动设备上的性能表现。


📘 Detailed Summary

Motivation: 大型视觉语言模型在视频描述生成方面表现出色,但其高内存、计算和部署需求限制了实际应用,特别是对于依赖详细上下文感知描述的盲人和低视力用户群体,需要研究模型规模对可访问性描述质量的影响。

Method: 研究评估了500M和2.2B参数的SmolVLM2变体在两个多样化数据集上的表现,并引入了两个专门针对BLV可访问性评估的新框架:多上下文BLV框架评估空间定向、社交互动、动作事件和环境背景;导航辅助框架专注于移动关键信息。同时系统评估了四种不同的提示设计策略,并在智能手机上部署了FP32和INT8精度变体。

Result: 实验在AVCaps(户外)和Charades(室内)两个数据集上进行,评估了不同规模模型在资源受限移动设备上的实际性能约束,包括计算效率、内存使用和推理速度等关键指标。

Conclusion: 该研究为视觉语言模型在可访问性应用中的优化提供了重要指导,表明通过适当的模型压缩和精度量化技术,可以在保持描述质量的同时显著降低资源需求,为BLV用户的实际部署开辟了可行路径。


📄 Abstract

Large Vision-Language Models (VLMs) excel at understanding and generating video descriptions but their high memory, computation, and deployment demands hinder practical use particularly for blind and low-vision (BLV) users who depend on detailed, context-aware descriptions. To study the effect of model size on accessibility-focused description quality, we evaluate SmolVLM2 variants with 500M and 2.2B parameters across two diverse datasets: AVCaps (outdoor), and Charades (indoor). In this work, we introduce two novel evaluation frameworks specifically designed for BLV accessibility assessment: the Multi-Context BLV Framework evaluating spatial orientation, social interaction, action events, and ambience contexts; and the Navigational Assistance Framework focusing on mobility-critical information. Additionally, we conduct a systematic evaluation of four different prompt design strategies and deploy both models on a smartphone, evaluating FP32 and INT8 precision variants to assess real-world performance constraints on resource-limited mobile devices.

[2] Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Jiahao Wang, Weiye Xu, Aijun Yang, Wengang Zhou, Lewei Lu, Houqiang Li, Xiaohua Wang, Jinguo Zhu

🧩 TL;DR

本文提出了自洽采样(SCS)方法,通过引入视觉扰动和轨迹重采样来纠正多模态大语言模型中结果奖励强化学习存在的不可靠轨迹问题,在多个基准测试上显著提升了推理准确性。


📘 Detailed Summary

Motivation: 在多模态推理基准测试中,结果奖励强化学习面临一个被忽视的关键问题:即使推理链存在错误但最终猜对正确答案的不可靠轨迹,与真正正确推理的轨迹获得相同奖励,这种奖励分配的不公平性严重影响了模型的学习效果。

Method: 提出的自洽采样(SCS)方法包含两个核心步骤:首先对输入图像引入微小视觉扰动,然后对初始推理轨迹进行重复截断和重采样;通过计算这些扰动轨迹之间的一致性得分,在策略更新时降低不可靠轨迹的权重。

Result: 在Qwen2.5-VL-7B-Instruct模型上,将SCS集成到RLOO、GRPO和REINFORCE++系列方法中,在六个多模态基准测试上准确率最高提升7.7个百分点,且计算开销可忽略不计;该方法在Qwen2.5-VL-3B-Instruct和InternVL3-8B模型上也取得了显著效果提升。

Conclusion: SCS为多模态大语言模型中的结果奖励强化学习提供了一种简单通用的解决方案,能够有效识别和惩罚不可靠的推理轨迹,显著提升模型的推理可靠性和准确性,具有广泛的适用性和实用性。


📄 Abstract

Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting - a dominant format for multimodal reasoning benchmarks - the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We propose Self-Consistency Sampling (SCS) to correct this issue. For each question, SCS (i) introduces small visual perturbations and (ii) performs repeated truncation and resampling of an initial trajectory; agreement among the resulting trajectories yields a differentiable consistency score that down-weights unreliable traces during policy updates. Based on Qwen2.5-VL-7B-Instruct, plugging SCS into RLOO, GRPO, and REINFORCE++ series improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation. SCS also yields notable gains on both Qwen2.5-VL-3B-Instruct and InternVL3-8B, offering a simple, general remedy for outcome-reward RL in MLLMs.