Table of Contents

cs.CV [Back]

[1] Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models

Marco Morini, Sara Sarto, Marcella Cornia, Lorenzo Baraldi

🧩 TL;DR

本文提出了Look Twice (LoT),一种无需训练的推理时框架,通过利用模型注意力模式来估计与查询相关的视觉区域和检索到的文本元素,并通过轻量级提示标记增强模型对相关证据的重新关注,从而提升预训练多模态大语言模型在知识密集型视觉问答任务中的表现。


📘 Detailed Summary

Motivation: 多模态大语言模型在回答知识密集型查询时,往往难以识别最相关的视觉和文本证据,需要整合视觉线索与通常嘈杂或部分相关的检索文本证据,同时定位图像中的细粒度视觉信息。

Method: LoT框架利用模型注意力模式估计查询相关的视觉区域和检索文本元素,通过轻量级提示级标记突出显示所选线索,鼓励模型在生成过程中重新关注相关证据,整个过程无需额外训练或架构修改。

Result: 在多个基于知识的视觉问答基准测试中,LoT相比零样本多模态大语言模型实现了持续改进;在视觉中心和幻觉导向基准上的额外评估进一步表明,仅视觉证据突出显示就能在没有文本上下文的环境中提升模型性能。

Conclusion: 该研究表明通过推理时注意力引导机制可以有效提升多模态模型对相关证据的利用能力,为无需训练的多模态模型优化提供了新思路,同时证明了轻量级提示标记在增强模型关注相关视觉和文本线索方面的有效性。


📄 Abstract

Answering questions about images often requires combining visual understanding with external knowledge. Multimodal Large Language Models (MLLMs) provide a natural framework for this setting, but they often struggle to identify the most relevant visual and textual evidence when answering knowledge-intensive queries. In such scenarios, models must integrate visual cues with retrieved textual evidence that is often noisy or only partially relevant, while also localizing fine-grained visual information in the image. In this work, we introduce Look Twice (LoT), a training-free inference-time framework that improves how pretrained MLLMs utilize multimodal evidence. Specifically, we exploit the model attention patterns to estimate which visual regions and retrieved textual elements are relevant to a query, and then generate the answer conditioned on this highlighted evidence. The selected cues are highlighted through lightweight prompt-level markers that encourage the model to re-attend to the relevant evidence during generation. Experiments across multiple knowledge-based VQA benchmarks show consistent improvements over zero-shot MLLMs. Additional evaluations on vision-centric and hallucination-oriented benchmarks further demonstrate that visual evidence highlighting alone improves model performance in settings without textual context, all without additional training or architectural modifications. Source code will be publicly released.

[2] HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models

Yansong Guo, Chaoyang Zhu, Jiayi Ji, Jianghang Lin, Liujuan Cao

🧩 TL;DR

本文提出了HieraVid,一种用于视频大语言模型的分层剪枝框架,通过渐进式动态减少视觉冗余,在仅保留30%令牌的情况下实现了新的最先进性能,同时保持了原始模型98%以上的性能。


📘 Detailed Summary

Motivation: 现有视频大语言模型面临大量输入视频令牌带来的显著计算负担问题,当前方法主要在输入层面进行令牌剪枝,但忽略了视频和大型语言模型内部固有的信息结构,这限制了剪枝效率和模型性能的保持。

Method: 基于视频具有片段-帧结构和LLM内部单向传播多模态信息的观察,HieraVid采用分层剪枝框架,包含三个层次:片段级首先将视频令牌按时间分段并进行空间合并;帧级在同一片段内联合剪除相似帧以保持多样性;层级则随着LLM层数增加逐步减少冗余而不影响性能。

Result: 在四个广泛使用的视频理解基准测试上进行全面评估,HieraVid在仅保留30%令牌的情况下实现了新的最先进性能,同时保持了LLaVA-Video-7B超过98%的性能和LLaVA-OneVision-7B超过99%的性能。

Conclusion: 该研究表明,通过利用视频的固有结构和LLM的内部信息传播特性进行分层剪枝,可以显著降低计算成本而不牺牲性能,为高效视频大语言模型的部署提供了有效解决方案,并展示了结构化剪枝在视频理解任务中的潜力。


📄 Abstract

Video Large Language Models (VideoLLMs) have demonstrated impressive capabilities in video understanding, yet the massive number of input video tokens incurs a significant computational burden for deployment. Existing methods mainly prune video tokens at input level while neglecting the inherent information structure embedded in videos and large language models (LLMs). To address this, we propose HieraVid, a hierarchical pruning framework that progressively and dynamically reduces visual redundancy. Based on two observations that videos possess the segment-frame structure and LLMs internally propagate multi-modal information unidirectionally, we decompose pruning into three levels: 1) segment-level, where video tokens are first temporally segmented and spatially merged; 2) frame-level, where similar frames within the same segment are jointly pruned to preserve diversity; 3) layer-level, redundancy gradually shrinks as LLM layer increases w/o compromising performance. We conduct extensive experiments on four widely used video understanding benchmarks to comprehensively evaluate the effectiveness of HieraVid. Remarkably, with only 30% of tokens retained, HieraVid achieves new state-of-the-art performance, while maintaining over 98% and 99% of the performance of LLaVA-Video-7B and LLaVA-OneVision-7B, respectively.

[3] Reinforcing Consistency in Video MLLMs with Structured Rewards

Yihao Quan, Zeru Shi, Jinman Zhao, Ruixiang Tang

🧩 TL;DR

本文针对多模态大语言模型在视频理解中存在的视觉与时间基础不牢问题,提出了一种结构化奖励机制,通过组合一致性审计揭示模型输出中的事实与时间错误,并设计包含场景图奖励、时间奖励和视频问答奖励的三元训练目标,显著提升了视频理解的忠实度。


📘 Detailed Summary

Motivation: 多模态大语言模型在视频理解中常产生看似合理但缺乏可靠视觉和时间基础的输出,如虚构物体存在、分配错误属性或压缩重复事件,而标准的句子级监督无法有效检测这些细粒度错误,现有强化学习中的句子级奖励也难以准确定位具体的基础失败问题。

Method: 研究采用自上而下的组合一致性审计方法,将视频描述分解为支持性的事实和时间主张,并设计结构化奖励替代通用句子级奖励,该训练目标包含三个互补组件:用于事实对象、属性和关系的实例感知场景图奖励,用于事件排序和重复的时间奖励,以及用于分层自验证的视频基础问答奖励。

Result: 实验表明,即使正确的根关系主张也常缺乏可靠的属性和存在支持,而提出的结构化奖励机制在时间基准、通用视频理解和幻觉导向基准测试中,对开源骨干模型均带来了一致的性能提升,验证了该方法对提高视频理解忠实度的有效性。

Conclusion: 结构化奖励塑造是提升视频理解忠实度的实用途径,组合一致性审计揭示了句子级监督作为忠实视频理解代理的局限性,而细粒度的奖励设计能够更准确地定位和纠正模型的基础失败,为多模态大语言模型的可靠视频理解提供了新的训练范式。


📄 Abstract

Multimodal large language models (MLLMs) have achieved remarkable progress in video understanding. However, seemingly plausible outputs often suffer from poor visual and temporal grounding: a model may fabricate object existence, assign incorrect attributes, or collapse repeated events while still producing a globally reasonable caption or answer. We study this failure mode through a compositional consistency audit that decomposes a caption into supporting factual and temporal claims, investigating whether a correct high-level prediction is actually backed by valid lower-level evidence. Our top-down audit reveals that even correct root relational claims often lack reliable attribute and existence support. This indicates that standard sentence-level supervision is a weak proxy for faithful video understanding. Furthermore, when turning to reinforcement learning (RL) for better alignment, standard sentence-level rewards often prove too coarse to accurately localize specific grounding failures. To address this, we replace generic sentence-level rewards with a structured reward built from factual and temporal units. Our training objective integrates three complementary components: (1) an instance-aware scene-graph reward for factual objects, attributes, and relations; (2) a temporal reward for event ordering and repetition; and (3) a video-grounded VQA reward for hierarchical self-verification. Across temporal, general video understanding, and hallucination-oriented benchmarks, this objective yields consistent gains on open-source backbones. These results suggest that structured reward shaping is a practical route to more faithful video understanding.

[4] VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

Jiahao Meng, Tan Yue, Qi Xu, Haochen Wang, Zhongwei Ren, Weisong Liu, Yuhao Wang, Renrui Zhang, Yunhai Tong, Haodong Duan

🧩 TL;DR

本文提出了VideoZeroBench,一个用于长视频问答的分层基准测试,旨在严格验证时空证据,揭示当前视频多模态大语言模型在细粒度视觉理解和基于证据的推理方面存在显著不足。


📘 Detailed Summary

Motivation: 当前视频多模态大语言模型的评估存在两个关键局限:一是虚高的分数掩盖了细粒度视觉理解和推理的缺陷,二是答案正确性的衡量通常不验证模型是否识别了支持其预测的精确时空证据。这导致表面答案正确性与真正基于证据的推理之间存在差距。

Method: 研究提出了VideoZeroBench,这是一个针对挑战性长视频问答的分层基准测试,包含13个领域的500个人工标注问题,每个问题都配有作为证据的时间区间和空间边界框。为了解耦答案生成、时间定位和空间定位,研究者引入了一个五级评估协议,逐步收紧证据要求,从标准端到端问答到需要精确时空定位的严格证据验证。

Result: 实验结果显示,即使在标准端到端问答设置下,Gemini-3-Pro正确回答的问题也不足17%。当施加定位约束时,性能急剧下降:在需要正确答案和精确时空定位的最严格级别下,没有任何模型超过1%的准确率,大多数模型无法实现任何正确的定位预测。这暴露了表面答案正确性与真正基于证据的推理之间的显著差距。

Conclusion: 研究揭示了当前视频多模态大语言模型在基于证据的推理方面存在严重不足,特别是在长视频问答中,时空证据的精确定位仍然是一个瓶颈。该基准测试为未来基于证据的视频推理研究提供了重要见解,并强调了需要开发能够真正理解和定位视频中时空证据的模型。


📄 Abstract

Recent video multimodal large language models achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting their predictions. To address this, we present VideoZeroBench, a hierarchical benchmark designed for challenging long-video question answering that rigorously verifies spatio-temporal evidence. It comprises 500 manually annotated questions across 13 domains, paired with temporal intervals and spatial bounding boxes as evidence. To disentangle answering generation, temporal grounding, and spatial grounding, we introduce a five-level evaluation protocol that progressively tightens evidence requirements. Experiments show that even Gemini-3-Pro correctly answers fewer than 17% of questions under the standard end-to-end QA setting (Level-3). When grounding constraints are imposed, performance drops sharply: No model exceeds 1% accuracy when both correct answering and accurate spatio-temporal localization are required (Level-5), with most failing to achieve any correct grounded predictions. These results expose a significant gap between surface-level answer correctness and genuine evidence-based reasoning, revealing that grounded video understanding remains a bottleneck for long-video QA. We further analyze performance across minimal evidence spans, atomic abilities, and inference paradigms, providing insights for future research in grounded video reasoning. The benchmark and code will be made publicly available.

[5] Ego-Grounding for Personalized Question-Answering in Egocentric Videos

Junbin Xiao, Shenglang Zhang, Pengxiang Zhu, Angela Yao

🧩 TL;DR

本文提出了MyEgo,首个用于评估多模态大语言模型在自我中心视频中个性化问答能力的基准数据集,揭示了现有模型在自我定位和长期记忆方面的显著不足。


📘 Detailed Summary

Motivation: 现有研究缺乏对多模态大语言模型在自我中心视频中进行个性化问答能力的系统评估,特别是模型需要理解摄像头佩戴者身份、记忆其过去经历并进行推理的自我定位能力,这一研究空白限制了模型在个性化辅助应用中的发展。

Method: 研究提出了MyEgo数据集,包含541个长视频和5K个个性化问题,涵盖"我的物品"、"我的活动"和"我的过去"三类问题,并系统评估了包括开源与闭源、思考与非思考、不同规模在内的多种MLLM变体,通过提供相关证据的对照实验分析模型能力限制。

Result: 实验结果显示,顶级闭源模型(如GPT-5)和开源模型(如Qwen3-VL)在MyEgo上的准确率分别仅为约46%和36%,与人类性能相差近40%和50%,显式推理和模型规模扩展均未带来一致改进,模型在提供相关证据时有所提升但随时间增益下降,表明其在跟踪和记忆"我"及"我的过去"方面存在根本限制。

Conclusion: 该研究揭示了自我定位和长期记忆在自我中心视频个性化问答中的关键作用,现有MLLM在这两方面存在显著不足,MyEgo基准为未来研究提供了重要评估工具,将推动面向个性化辅助的自我中心AI系统发展。


📄 Abstract

We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs' ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about "my things", "my activities", and "my past". Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top closed- and open-source models (e.g., GPT-5 and Qwen3-VL) achieve only~46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering "me" and "my past". These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance. Data and code are available at https://github.com/Ryougetsu3606/MyEgo

[6] Hidden Meanings in Plain Sight: RebusBench for Evaluating Cognitive Visual Reasoning

Seyed Amir Kasaei, Arash Marioriyad, Mahbod Khaleti, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban

🧩 TL;DR

该研究揭示了大型视觉语言模型在解决需要神经符号推理的谜题(如字谜)方面存在严重缺陷,并提出了RebusBench基准来评估模型整合感知与知识的能力。研究发现即使最先进的模型在字谜求解任务上的准确率也低于10%,表明模型缺乏连接视觉和语言组件的认知推理能力。


📘 Detailed Summary

Motivation: 当前大型视觉语言模型在显式视觉识别方面表现出色,但在需要复杂多步推理的问题上存在严重认知缺陷,特别是当视觉输入仅作为线索而非答案时。研究旨在探索模型在解决需要整合视觉属性提取、语言先验知识检索和抽象映射的字谜类问题上的能力不足,这类问题要求合成像素空间之外的意义。

Method: 研究引入了RebusBench基准,包含1,164个专门设计的字谜谜题,用于评估模型整合感知与知识的神经符号能力。该基准测试要求模型执行三个关键认知步骤:提取视觉和文本属性、检索语言先验知识(如成语)、执行抽象映射以合成像素空间之外的意义。研究评估了包括Qwen、InternVL和LLaVA在内的最先进模型,并测试了模型缩放和上下文学习的影响。

Result: 实验结果显示最先进模型在RebusBench上表现严重不足:精确匹配准确率低于10%,语义准确率低于20%。模型缩放和上下文学习均未带来显著性能提升,表明当前架构在解决需要神经符号推理的问题上存在根本性限制。所有测试模型都未能有效整合视觉感知与语言知识来求解字谜。

Conclusion: 研究表明大型视觉语言模型虽然具备必要的视觉和语言组件,但缺乏连接这些组件的认知推理能力。这一发现揭示了当前模型在神经符号推理方面的根本局限性,为未来研究指明了需要开发能够整合感知与抽象推理的新架构方向。RebusBench基准为评估此类能力提供了重要工具。


📄 Abstract

Large Vision-Language Models (LVLMs) have achieved remarkable proficiency in explicit visual recognition, effectively describing what is directly visible in an image. However, a critical cognitive gap emerges when the visual input serves only as a clue rather than the answer. We identify that current models struggle with the complex, multi-step reasoning required to solve problems where information is not explicitly depicted. Successfully solving a rebus puzzle requires a distinct cognitive workflow: the model must extract visual and textual attributes, retrieve linguistic prior knowledge (such as idioms), and perform abstract mapping to synthesize these elements into a meaning that exists outside the pixel space. To evaluate this neurosymbolic capability, we introduce RebusBench, a benchmark of 1,164 puzzles designed to test this specific integration of perception and knowledge. Our evaluation of state-of-the-art models (including Qwen, InternVL, and LLaVA) shows a severe deficiency: performance saturates below 10% Exact Match and 20% semantic accuracy, with no significant improvement observed from model scaling or In-Context Learning (ICL). These findings suggest that while models possess the necessary visual and linguistic components, they lack the cognitive reasoning glue to connect them. Project page available at https://amirkasaei.com/rebusbench/.

[7] Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation

Boyang Gong, Yu Zheng, Fanye Kong, Jie Zhou, Jiwen Lu

🧩 TL;DR

本文提出了一种无需训练的惯性感知视觉激发(IVE)方法,用于解决多模态大语言模型中视觉注意力的惯性问题,该方法通过打破注意力惯性模式来有效缓解认知幻觉,在多个基准测试中显著提升了模型对组合推理的能力。


📘 Detailed Summary

Motivation: 多模态大语言模型中的视觉注意力表现出明显的惯性现象,一旦在早期解码步骤中稳定后基本保持静态,无法支持认知推理所需的组合理解。现有幻觉缓解方法主要针对对象存在或属性的感知幻觉,对于需要对象间关系推理的认知幻觉仍然不足,因此需要解决这种注意力惯性问题。

Method: 本文提出了一种无需训练的惯性感知视觉激发(IVE)方法,通过将认知推理建模为视觉注意力的动态响应来打破惯性模式。IVE选择相对于历史注意力趋势动态出现的视觉标记,同时区分表现出惯性行为的标记。为了进一步促进组合推理,IVE引入了惯性感知惩罚机制,防止注意力过度集中并限制在局部区域的持久性。

Result: 广泛的实验表明,IVE在多种基础MLLM和多个幻觉基准测试中均有效,特别是在认知幻觉方面表现突出。通过令牌级注意力分析,研究确认视觉惯性是关键因素,而IVE方法成功打破了这种惯性模式,显著提升了模型的关系推理能力。

Conclusion: 该研究揭示了多模态大语言模型中视觉注意力的惯性现象是导致认知幻觉的关键机制,提出的IVE方法为无需训练地缓解此类幻觉提供了有效解决方案。这一发现强调了动态注意力调节在组合推理中的重要性,为未来多模态模型的认知能力改进提供了新的研究方向。


📄 Abstract

Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.

[8] A3R: Agentic Affordance Reasoning via Cross-Dimensional Evidence in 3D Gaussian Scenes

Di Li, Jie Feng, Guanbin Li, Ronghua Shang, Yuhui Zheng, Weisheng Dong, Guangming Shi

🧩 TL;DR

本文提出A3R框架,将3D高斯场景中的细粒度可供性推理重新定义为顺序证据获取过程,通过跨维度证据采集逐步减少模糊性,显著超越静态一次性预测方法。


📘 Detailed Summary

Motivation: 现有方法将3D高斯场景中的可供性推理视为静态观察下的一次性预测问题,假设已有足够证据进行推理,但在复杂3D场景中,许多失败案例源于固定观察下任务相关证据的不完整性,而非预测能力不足。

Method: 提出A3R框架,将细粒度可供性推理重新定义为顺序证据获取过程,通过基于MLLM的策略迭代选择证据获取动作,利用跨维度证据采集(3D几何和2D语义证据)更新可供性信念,并引入基于GRPO的策略学习策略优化顺序决策效率。

Result: 在场景级基准测试上的广泛实验表明,A3R框架在复杂3D高斯场景中持续超越静态一次性基线方法,证明了智能跨维度证据采集对细粒度可供性推理的优势。

Conclusion: 研究表明将可供性推理重新定义为顺序证据获取过程能有效解决复杂3D场景中证据不完整的问题,智能跨维度证据采集框架为细粒度3D场景理解提供了新范式,未来可扩展至更广泛的3D场景理解任务。


📄 Abstract

Affordance reasoning in 3D Gaussian scenes aims to identify the region that supports the action specified by a given text instruction in complex environments. Existing methods typically cast this problem as one-shot prediction from static scene observations, assuming sufficient evidence is already available for reasoning. However, in complex 3D scenes, many failure cases arise not from weak prediction capacity, but from incomplete task-relevant evidence under fixed observations. To address this limitation, we reformulate fine-grained affordance reasoning as a sequential evidence acquisition process, where ambiguity is progressively reduced through complementary 3D geometric and 2D semantic evidence. Building on this formulation, we propose A3R, an agentic affordance reasoning framework that enables an MLLM-based policy to iteratively select evidence acquisition actions and update the affordance belief through cross-dimensional evidence acquisition. To optimize such sequential decision making, we further introduce a GRPO-based policy learning strategy that improves evidence acquisition efficiency and reasoning accuracy. Extensive experiments on scene-level benchmarks show that A3R consistently surpasses static one-shot baselines, demonstrating the advantage of agentic cross-dimensional evidence acquisition for fine-grained affordance reasoning in complex 3D Gaussian scenes.

[9] Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models

Issa Sugiura, Keito Sasagawa, Keisuke Nakao, Koki Maeda, Ziqi Yin, Zhishen Yang, Shuhei Kurita, Yusuke Oda, Ryoko Tokuhisa, Daisuke Kawahara, Naoaki Okazaki

🧩 TL;DR

本文提出了Jagle,这是迄今为止最大的日语多模态后训练数据集,包含约920万个实例,通过多种策略生成视觉问答对,显著提升了日语视觉语言模型的性能。


📘 Detailed Summary

Motivation: 构建高质量的多语言和非英语视觉语言模型面临主要障碍,因为英语以外的视觉问答数据集在规模和领域覆盖上仍然有限,现有方法依赖于聚合现有VQA资源,但这不适用于其他语言。

Method: 研究团队收集了异构源数据,包括图像、图像-文本对和PDF文档,并通过多种策略生成VQA对,包括基于VLM的问答生成、翻译和文本渲染,而不是依赖现有的VQA数据集。

Result: 使用Jagle训练的2.2B模型在日语任务上表现出色,在十个日语评估任务中平均得分超过InternVL3.5-2B,接近Qwen3-VL-2B-Instruct五个点以内,同时与FineVision结合不仅不降低英语性能,反而相比单独使用FineVision有所提升。

Conclusion: Jagle数据集为构建高质量日语视觉语言模型提供了有效解决方案,证明了通过异构数据收集和多策略生成方法可以克服非英语VQA数据稀缺的问题,同时展示了多语言训练中性能提升而不损害原有语言能力的可能性。


📄 Abstract

Developing vision-language models (VLMs) that generalize across diverse tasks requires large-scale training datasets with diverse content. In English, such datasets are typically constructed by aggregating and curating numerous existing visual question answering (VQA) resources. However, this strategy does not readily extend to other languages, where VQA datasets remain limited in both scale and domain coverage, posing a major obstacle to building high-quality multilingual and non-English VLMs. In this work, we introduce Jagle, the largest Japanese multimodal post-training dataset to date, comprising approximately 9.2 million instances across diverse tasks. Rather than relying on existing VQA datasets, we collect heterogeneous source data, including images, image-text pairs, and PDF documents, and generate VQA pairs through multiple strategies such as VLM-based QA generation, translation, and text rendering. Experiments demonstrate that a 2.2B model trained with Jagle achieves strong performance on Japanese tasks, surpassing InternVL3.5-2B in average score across ten Japanese evaluation tasks and approaching within five points of Qwen3-VL-2B-Instruct. Furthermore, combining Jagle with FineVision does not degrade English performance; instead, it improves English performance compared to training with FineVision alone. To facilitate reproducibility and future research, we release the dataset, trained models, and code.

cs.AI [Back]

[10] MM-ReCoder: Advancing Chart-to-Code Generation with Reinforcement Learning and Self-Correction

Zitian Tang, Xu Zhang, Jianbo Yuan, Yang Zou, Varad Gunjal, Songyao Jiang, Davide Modolo

🧩 TL;DR

本文提出了MM-ReCoder,一种基于强化学习训练并具备自我纠正能力的图表到代码生成模型,通过两阶段多轮自我纠正强化学习策略,在三个基准测试中实现了最先进的性能。


📘 Detailed Summary

Motivation: 现有多模态大语言模型在图表到代码生成任务中主要依赖监督微调,这种方法缺乏代码执行环境的暴露,且即使最先进的模型也难以有效进行自我纠正,导致代码生成质量受限。

Method: 本文提出了基于分组相对策略优化的两阶段多轮自我纠正强化学习策略:第一阶段通过共享首轮增强模型的自我纠正能力,第二阶段通过完整轨迹优化提升编码能力,使模型能够通过与环境的交互迭代纠正自身输出。

Result: 在三个图表到代码基准测试中,MM-ReCoder展示了最先进的性能,能够生成更准确且可执行的代码,显著优于现有方法。

Conclusion: 研究表明,将强化学习与自我纠正能力相结合能够有效提升多模态大语言模型在代码生成任务中的性能,为未来多模态编程辅助系统的发展提供了新的技术路径。


📄 Abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated promising capabilities in multimodal coding tasks such as chart-to-code generation. However, existing methods primarily rely on supervised fine-tuning (SFT), which requires the model to learn code patterns through chart-code pairs but does not expose the model to a code execution environment. Moreover, while self-correction through execution feedback offers a potential route to improve coding quality, even state-of-the-art MLLMs have been shown to struggle with effective self-correction. In this work, we introduce MM-ReCoder, a chart-to-code generation model trained with reinforcement learning (RL) and equipped with self-correction ability. We propose a two-stage multi-turn self-correction RL strategy based on Group Relative Policy Optimization (GRPO). The first stage enhances the model's self-correction ability via rolling out a shared first turn, while the second stage improves the coding capability with full-trajectory optimization. MM-ReCoder learns to produce more accurate and executable code through the interaction with the environment and by iteratively correcting its own outputs. Our results on three chart-to-code benchmarks demonstrate the state-of-the-art performance of MM-ReCoder.

[11] Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

Zekai Ye, Qiming Li, Xiaocheng Feng, Ruihan Chen, Ziming Li, Haoyu Ren, Kun Chen, Dandan Tu, Bing Qin

🧩 TL;DR

本文提出了感知基础策略优化(PGPO),这是一种新颖的细粒度信用分配框架,通过动态重塑令牌级优势来解决多模态推理中视觉依赖令牌学习信号稀释的问题,在七个具有挑战性的基准测试中平均提升了18.7%的性能。


📘 Detailed Summary

Motivation: 现有基于可验证奖励的强化学习(RLVR)框架存在根本性方法缺陷:通过在所有生成的令牌上分配相同的优势,这些方法本质上稀释了优化多模态推理中关键视觉基础步骤所需的学习信号,导致视觉依赖令牌的梯度信号不足。

Method: 本文首先形式化了令牌视觉依赖概念,通过视觉条件预测分布与纯文本预测分布之间的KL散度量化视觉输入的信息增益;基于此提出了感知基础策略优化(PGPO),这是一个细粒度信用分配框架,采用阈值门控、质量守恒机制动态重塑令牌级优势,主动放大视觉依赖令牌的学习信号,同时抑制语言先验带来的梯度噪声。

Result: 基于Qwen2.5-VL系列模型在七个具有挑战性的多模态推理基准上进行广泛实验,结果表明PGPO平均提升了18.7%的性能;理论和实证分析均证实PGPO有效降低了梯度方差,防止训练崩溃,并作为强大的正则化器促进鲁棒的感知基础多模态推理。

Conclusion: 该研究揭示了多模态推理中令牌视觉依赖的高度稀疏性和语义关键性,提出的PGPO框架通过细粒度信用分配机制解决了现有RLVR方法的根本缺陷,为构建更鲁棒、感知基础的多模态推理系统提供了有效方法,并展示了理论分析和实证验证的一致性。


📄 Abstract

While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning. To bridge this gap, we formulate \textit{Token Visual Dependency}, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception-Grounded Policy Optimization (PGPO), which is a novel fine-grained credit assignment framework that dynamically reshapes advantages at the token level. Through a threshold-gated, mass-conserving mechanism, PGPO actively amplifies learning signals for visually-dependent tokens while suppressing gradient noise from linguistic priors. Extensive experiments based on the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks demonstrate that PGPO boosts models by 18.7% on average. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and acts as a potent regularizer for robust, perception-grounded multimodal reasoning. Code will be published on https://github.com/Yzk1114/PGPO.