Table of Contents

cs.CV [Back]

[1] First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models

Jiwoo Ha, Jongwoo Baek, Jinhyun So

🧩 TL;DR

本文提出了一种名为First Logit Boosting(FLB)的无训练技术,旨在缓解大型视觉语言模型(LVLMs)中的物体幻觉问题,该方法通过存储并重用首个生成token的logit来对抗视觉信息在生成过程中的长期衰减。


📘 Detailed Summary

Motivation: 尽管大型视觉语言模型在多模态任务中表现出色,但物体幻觉(生成不存在物体的答案)仍是持续挑战。现有方法如重训练或外部接地方法存在数据成本高或结构复杂的问题,而无训练方法如对比解码(CD)虽成本效益高,却遭受长期衰减问题,即随着生成过程推进,视觉接地减弱而语言先验主导。

Method: 本文提出First Logit Boosting(FLB),一种简单有效的无训练技术,通过存储首个生成token的logit并将其添加到后续token预测中,从而缓解LVLMs中的长期衰减问题。该方法观察到FLB能够(1)在整个生成过程中维持嵌入在首个token中的视觉信息,(2)通过"The" token的稳定化效应抑制幻觉词汇。

Result: 实验结果表明,FLB在不同任务、基准测试和骨干模型中显著减少了物体幻觉。该方法仅带来可忽略的推理开销,使其高度适用于实时多模态系统。具体而言,FLB在多个评估基准上表现出优于现有方法的幻觉缓解效果,同时保持了模型的生成质量。

Conclusion: FLB提供了一种高效、轻量级的解决方案来缓解LVLMs中的物体幻觉问题,无需额外训练或外部模型。该方法揭示了通过重用早期生成阶段的视觉信息可以有效对抗长期衰减现象,为实时多模态系统的部署提供了实用工具。代码已开源,便于社区进一步研究和应用。


📄 Abstract

Recent Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks that require understanding both visual and linguistic inputs. However, object hallucination -- the generation of nonexistent objects in answers -- remains a persistent challenge. Although several approaches such as retraining and external grounding methods have been proposed to mitigate this issue, they still suffer from high data costs or structural complexity. Training-free methods such as Contrastive Decoding (CD) are more cost-effective, avoiding additional training or external models, but still suffer from long-term decay, where visual grounding weakens and language priors dominate as the generation progresses. In this paper, we propose First Logit Boosting (FLB), a simple yet effective training-free technique designed to alleviate long-term decay in LVLMs. FLB stores the logit of the first generated token and adds it to subsequent token predictions, effectively mitigating long-term decay of visual information. We observe that FLB (1) sustains the visual information embedded in the first token throughout generation, and (2) suppresses hallucinated words through the stabilizing effect of the ``The'' token. Experimental results show that FLB significantly reduces object hallucination across various tasks, benchmarks, and backbone models. Notably, it causes negligible inference overhead, making it highly applicable to real-time multimodal systems. Code is available at https://github.com/jiwooha20/FLB

[2] Multimodal Language Models Cannot Spot Spatial Inconsistencies

Om Khangaonkar, Hadi J. Rad, Hamed Pirsiavash

🧩 TL;DR

该研究引入了一项评估多模态大语言模型三维空间一致性推理能力的新任务,并提出了一种生成空间不一致图像对的简单可扩展方法,揭示了当前模型在三维几何理解方面的显著缺陷。


📘 Detailed Summary

Motivation: 尽管多模态大语言模型取得了进展,但它们往往难以跨多个视角进行三维几何推理,本研究旨在解决这一局限性,通过引入更具挑战性的任务来评估模型对三维运动一致性的理解能力。

Method: 研究提出了一种简单且可扩展的方法,从多视角场景中生成真实的空间不一致图像对,通过要求模型识别违反三维运动一致性的物体来系统评估其空间推理能力。

Result: 实验结果表明,最先进的多模态大语言模型在三维空间一致性任务上显著低于人类观察者水平,且在不同场景属性上表现出巨大变异性,揭示了其对三维结构的理解是脆弱且不完整的。

Conclusion: 该研究强调了当前多模态大语言模型在三维物理世界理解方面的严重不足,指出了开发具有更深层次物理世界基础理解的模型方法的必要性,为未来研究提供了重要的评估基准。


📄 Abstract

Spatial consistency is a fundamental property of the visual world and a key requirement for models that aim to understand physical reality. Despite recent advances, multimodal large language models (MLLMs) often struggle to reason about 3D geometry across multiple views. Rather than asking models to describe scene attributes, we introduce a more challenging task: given two views of the same scene, identify the object that violates 3D motion consistency. We propose a simple and scalable method for generating realistic, spatially inconsistent image pairs from multi-view scenes, enabling systematic evaluation of this capability. Our results show that state-of-the-art MLLMs significantly underperform human observers and exhibit substantial variability across different scene attributes, revealing a fragile and incomplete understanding of 3D structure. We hope our findings underscore the need for approaches that develop a more deeply grounded understanding of the physical world.

[3] The 1st Winner for 5th PVUW MeViS-Text Challenge: Strong MLLMs Meet SAM3 for Referring Video Object Segmentation

Xusheng He, Canyang Wu, Jinrong Zhang, Weili Guan, Jianlong Wu, Liqiang Nie

🧩 TL;DR

本文提出了一个无需训练的三阶段流水线方法,用于解决运动中心语言表达下的视频目标分割任务,该方法结合了多模态大语言模型与SAM3,在PVUW 2026 MeViS-Text挑战赛中取得了第一名。


📘 Detailed Summary

Motivation: 该研究旨在解决运动中心语言表达下的视频目标分割问题,该任务要求模型同时理解外观、时间行为和目标交互,现有方法通常需要任务特定的微调,而本研究探索无需训练的高效解决方案。

Method: 该方法构建了一个完全无需训练的三阶段流水线:首先使用Gemini-3.1 Pro分解目标事件为实例级定位目标并选择最佳可见帧;其次利用SAM3-agent在选定帧生成精确种子掩码,并通过官方SAM3跟踪器进行全视频传播;最后使用Qwen3.5-Plus和行为级验证进行细化修正。

Result: 该方法在PVUW 2026 MeViS-Text测试集上排名第一,获得了0.909064的Final分数和0.7897的J&F分数,证明了无需任务特定微调的强大性能,代码已在GitHub上开源。

Conclusion: 研究表明,通过巧妙组合现有多模态大语言模型和分割模型,可以构建无需训练的高效视频目标分割系统,这为视频理解任务提供了一种新的零样本解决方案范式,展示了预训练基础模型的强大泛化能力。


📄 Abstract

This report presents our winning solution to the 5th PVUW MeViS-Text Challenge. The track studies referring video object segmentation under motion-centric language expressions, where the model must jointly understand appearance, temporal behavior, and object interactions. To address this problem, we build a fully training-free pipeline that combines strong multimodal large language models with SAM3. Our method contains three stages. First, Gemini-3.1 Pro decomposes each target event into instance-level grounding targets, selects the frame where the target is most clearly visible, and generates a discriminative description. Second, SAM3-agent produces a precise seed mask on the selected frame, and the official SAM3 tracker propagates the mask through the whole video. Third, a refinement stage uses Qwen3.5-Plus and behavior-level verification to correct ambiguous or semantically inconsistent predictions. Without task-specific fine-tuning, our method ranks first on the PVUW 2026 MeViS-Text test set, achieving a Final score of 0.909064 and a J&F score of 0.7897. The code is available at https://github.com/Moujuruo/MeViSv2_Track_Solution_2026.

[4] IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models

Dong-Jae Lee, Sunghyun Baek, Junmo Kim

🧩 TL;DR

本文提出一种基于注意力机制对偶形式视角的无训练令牌剪枝框架,通过将注意力重新表述为隐式线性层,并引入量化令牌信息量和信息重复度的新度量,实现了大视觉语言模型中视觉令牌的高效剪枝。


📘 Detailed Summary

Motivation: 大视觉语言模型在图像和视频理解任务中表现出色,但其计算成本随视觉令牌数量快速增长。现有令牌剪枝方法多采用经验性方法,忽视了注意力机制的内在机理,缺乏理论指导的剪枝框架。

Method: 从注意力对偶形式视角出发,将注意力重新表述为权重矩阵由单个令牌键值对生成的外积之和构成的隐式线性层。基于此推导出量化令牌信息量和信息重复度的新度量,并引入渐进分块最大边际相关性算法高效选择最优令牌子集。

Result: 大量实验表明,该方法在性能和效率之间实现了更好的权衡,相比现有剪枝方法在保持模型性能的同时显著降低了计算成本,同时为现有剪枝方法提供了新的理论视角。

Conclusion: 该研究从注意力机制的对偶形式视角为令牌剪枝提供了理论框架,揭示了令牌剪枝可视为选择最优秩1更新子集以近似原始对偶权重矩阵的问题,为理解现有剪枝方法提供了新的理论视角。


📄 Abstract

Large Vision Language Models show impressive performance across image and video understanding tasks, yet their computational cost grows rapidly with the number of visual tokens. Existing token pruning methods mitigate this issue through empirical approaches while overlooking the internal mechanism of attention. In this paper, we propose a novel training free token pruning framework grounded in the dual form perspective of attention. We reformulate attention as an implicit linear layer whose weight matrix is the sum of rank 1 outer products, each generated by a single token's key value pair. Token pruning thus reduces to selecting an optimal subset of these rank 1 updates that best approximates the original dual weight matrix. Extending this perspective to standard softmax attention in LVLMs, we derive a novel metric quantifying both a token's information magnitude and information duplication. To efficiently select the subset with the proposed metric, we introduce Progressive Chunked Maximal Marginal Relevance. Extensive experiments demonstrate that our method achieves a better trade off between performance and efficiency, while providing another perspective on existing pruning approaches.

[5] A 4D Representation for Training-Free Agentic Reasoning from Monocular Laparoscopic Video

Maximilian Fehrentz, Nicolas Stellwag, Robert Wiebe, Nicole Thorisch, Fabian Grob, Patrick Remerscheid, Ken-Joel Simmoteit, Benjamin D. Killeen, Christian Heiliger, Nassir Navab

🧩 TL;DR

该研究提出了一种基于显式4D表示的时空推理框架,使外科手术AI系统能够在3D空间和时间中实现自然语言推理,通过组合2D多模态大语言模型和3D计算机视觉模型,无需额外训练即可构建时空智能。


📘 Detailed Summary

Motivation: 软组织手术中的时空推理是人工智能的关键能力,但现有2D视觉语言模型在处理手术场景的空间复杂性方面存在局限,需要能够同时理解时间和3D空间的显式4D表示来提升推理系统的性能。

Method: 该框架利用点跟踪、深度估计和分割模型构建具有时空一致性的工具和组织语义的连贯4D模型,然后使用多模态大语言模型作为智能体,在基于显式4D表示提取的工具特征(如轨迹)上进行推理,无需微调。

Result: 在包含134个临床相关问题的新数据集上评估表明,通用推理主干与4D表示的结合显著提升了时空理解能力,并实现了4D基础,验证了从2D MLLMs和3D计算机视觉模型无需额外训练即可"组装"时空智能的有效性。

Conclusion: 该研究表明时空智能可以通过组合现有2D多模态大语言模型和3D计算机视觉模型构建,无需额外训练,为手术AI系统提供了显式4D表示的新范式,推动了智能辅助系统和自主机器人手术的发展。


📄 Abstract

Spatiotemporal reasoning is a fundamental capability for artificial intelligence (AI) in soft tissue surgery, paving the way for intelligent assistive systems and autonomous robotics. While 2D vision-language models show increasing promise at understanding surgical video, the spatial complexity of surgical scenes suggests that reasoning systems may benefit from explicit 4D representations. Here, we propose a framework for equipping surgical agents with spatiotemporal tools based on an explicit 4D representation, enabling AI systems to ground their natural language reasoning in both time and 3D space. Leveraging models for point tracking, depth, and segmentation, we develop a coherent 4D model with spatiotemporally consistent tool and tissue semantics. A Multimodal Large Language Model (MLLM) then acts as an agent on tools derived from the explicit 4D representation (e.g., trajectories) without any fine-tuning. We evaluate our method on a new dataset of 134 clinically relevant questions and find that the combination of a general purpose reasoning backbone and our 4D representation significantly improves spatiotemporal understanding and allows for 4D grounding. We demonstrate that spatiotemporal intelligence can be "assembled" from 2D MLLMs and 3D computer vision models without additional training. Code, data, and examples are available at https://tum-ai.github.io/surg4d/

[6] Query-Conditioned Evidential Keyframe Sampling for MLLM-Based Long-Form Video Understanding

Yiheng Wang, Lichen Zhu, Yueqian Lin, Yudong Liu, Jingyang Zhang, Hai "Helen" Li, Yiran Chen

🧩 TL;DR

本文提出了一种基于信息瓶颈理论的证据驱动关键帧采样框架,用于长视频问答任务。该方法将关键帧选择问题形式化为最大化选定帧与查询之间的条件互信息,并通过分解优化和查询条件证据评分网络实现高效采样。


📘 Detailed Summary

Motivation: 多模态大语言模型在长视频问答应用中面临上下文长度有限和计算成本高的约束,使得关键帧采样变得至关重要。现有方法通常依赖语义相关性或强化学习,这些方法要么无法捕捉证据线索,要么在组合优化方面效率低下,因此需要一种更有效且理论驱动的采样策略。

Method: 本文提出了一种基于信息瓶颈理论的证据驱动关键帧采样框架,将关键帧选择形式化为最大化选定帧与查询之间的条件互信息。通过利用目标函数的结构,推导出分解优化方法,将子集选择问题简化为独立的帧级评分。进一步引入了查询条件证据评分网络,使用对比目标进行训练,以高效估计证据重要性。

Result: 在长视频理解基准测试上的实验表明,该方法在严格的token预算下始终优于先前的采样策略,同时显著提高了训练效率。该方法在各种长视频问答任务中表现出优越的性能,验证了基于信息瓶颈理论的证据驱动采样框架的有效性。

Conclusion: 该研究为长视频理解中的关键帧采样提供了一个理论驱动的框架,通过信息瓶颈理论为帧选择提供了原则性目标。该方法不仅提高了采样效率,还为多模态大语言模型在长视频应用中的实际部署提供了可行的解决方案,展示了分解优化和查询条件评分在复杂视频理解任务中的有效性。


📄 Abstract

Multimodal Large Language Models (MLLMs) have shown strong performance on video question answering, but their application to long-form videos is constrained by limited context length and computational cost, making keyframe sampling essential. Existing approaches typically rely on semantic relevance or reinforcement learning, which either fail to capture evidential clues or suffer from inefficient combinatorial optimization. In this work, we propose an evidence-driven keyframe sampling framework grounded in information bottleneck theory. We formulate keyframe selection as maximizing the conditional mutual information between selected frames and the query, providing a principled objective that reflects each frame's contribution to answering the question. To make this objective tractable, we exploit its structure to derive a decomposed optimization that reduces subset selection to independent frame-level scoring. We further introduce a query-conditioned evidence scoring network trained with a contrastive objective to estimate evidential importance efficiently. Experiments on long-form video understanding benchmarks show that our method consistently outperforms prior sampling strategies under strict token budgets, while significantly improving training efficiency.

[7] ACT Now: Preempting LVLM Hallucinations via Adaptive Context Integration

Bei Yan, Yuecong Min, Jie Zhang, Shiguang Shan, Xilin Chen

🧩 TL;DR

本文提出了一种名为自适应上下文集成(ACT)的训练无关推理干预方法,通过自适应整合上下文信息来缓解大型视觉语言模型中的幻觉问题,该方法在保持基础生成能力的同时显著减少了幻觉现象。


📘 Detailed Summary

Motivation: 大型视觉语言模型普遍存在严重的幻觉问题,现有缓解策略主要依赖孤立的单步状态来增强视觉关注或抑制强语言先验,但这些静态方法忽视了生成过程中的动态上下文变化,并且难以纠正继承的信息损失。

Method: ACT方法包含两个核心组件:视觉上下文探索利用时空分析自适应放大负责视觉探索的注意力头;语义上下文聚合通过边缘化潜在语义查询来有效聚合视觉证据,解决由令牌预测的离散性引起的信息损失问题。

Result: 在多种大型视觉语言模型上的广泛实验表明,ACT显著减少了幻觉现象,在判别性和生成性基准测试中均取得了有竞争力的结果,且无需牺牲基础生成能力。

Conclusion: ACT作为一种无需训练的推理干预方法,提供了强大且高度自适应的幻觉缓解解决方案,通过动态整合上下文信息有效解决了现有静态方法的局限性,为视觉语言对齐提供了新的技术路径。


📄 Abstract

Large Vision-Language Models (LVLMs) frequently suffer from severe hallucination issues. Existing mitigation strategies predominantly rely on isolated, single-step states to enhance visual focus or suppress strong linguistic priors. However, these static approaches neglect dynamic context changes across the generation process and struggles to correct inherited information loss. To address this limitation, we propose Adaptive Context inTegration (ACT), a training-free inference intervention method that mitigates hallucination through the adaptive integration of contextual information. Specifically, we first propose visual context exploration, which leverages spatio-temporal profiling to adaptively amplify attention heads responsible for visual exploration. To further facilitate vision-language alignment, we propose semantic context aggregation that marginalizes potential semantic queries to effectively aggregate visual evidence, thereby resolving the information loss caused by the discrete nature of token prediction. Extensive experiments across diverse LVLMs demonstrate that ACT significantly reduces hallucinations and achieves competitive results on both discriminative and generative benchmarks, acting as a robust and highly adaptable solution without compromising fundamental generation capabilities.

cs.CL [Back]

[8] MSA-Thinker: Discrimination-Calibration Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis

Miaosen Luo, Zhenhao Yang, Jieshen Long, Jinghu Sun, Yichu Liu, Sijie Mai

🧩 TL;DR

本文提出了一种集成结构化判别校准推理与基于提示的强化学习的新型训练框架,用于多模态情感分析。该方法通过冷启动监督微调结合Hint-GRPO算法,在提升细粒度情感回归任务准确性的同时生成高质量结构化推理链,显著增强了模型的可解释性和跨领域泛化能力。


📘 Detailed Summary

Motivation: 多模态大语言模型虽然通过监督微调取得了最先进的性能,但其端到端的黑盒性质限制了可解释性。现有结合思维链推理的方法面临高标注成本问题,而强化学习方法则在硬样本上存在探索效率低和奖励稀疏等挑战,需要一种能够平衡性能与可解释性的新训练范式。

Method: 该方法提出了一种集成结构化判别校准推理与基于提示的强化学习的训练框架。首先使用教师模型合成的包含DC结构的高质量思维链数据进行冷启动监督微调,使模型具备宏观判别后细粒度校准的推理范式。在此基础上提出Hint-GRPO算法,利用DC结构中的判别阶段作为可验证锚点,为硬样本提供方向性提示,指导策略优化并有效缓解奖励稀疏问题。

Result: 在Qwen2.5Omni-7B模型上的实验表明,该方法在细粒度情感回归任务中实现了更高的准确性,同时生成了高质量的结构化推理链。更重要的是,在跨领域评估中展现出优越的泛化能力,验证了显式推理步骤对模型鲁棒性的积极贡献。

Conclusion: 该研究不仅增强了模型的可解释性,还验证了显式推理步骤对模型鲁棒性的积极贡献,为构建可信赖且高效的情感分析系统提供了新范式。该方法通过结构化推理与强化学习的有效结合,为解决多模态情感分析中的可解释性与性能平衡问题提供了创新解决方案。


📄 Abstract

Multimodal sentiment analysis aims to understand human emotions by integrating textual, auditory, and visual modalities. Although Multimodal Large Language Models (MLLMs) have achieved state-of-the-art performance via supervised fine-tuning (SFT), their end-to-end "black-box" nature limits interpretability. Existing methods incorporating Chain-of-Thought (CoT) reasoning are hindered by high annotation costs, while Reinforcement Learning (RL) faces challenges such as low exploration efficiency and sparse rewards, particularly on hard samples. To address these issues, we propose a novel training framework that integrates structured Discrimination-Calibration (DC) reasoning with Hint-based Reinforcement Learning. First, we perform cold-start SFT using high-quality CoT data synthesized by a teacher model (Qwen3Omni-30B), which inherently contains the DC structure. This equips the model with a reasoning paradigm that performs macro discrimination followed by fine-grained calibration from the initial stage. Building on this, we propose Hint-GRPO, which leverages the discrimination phase within the DC structure as a verifiable anchor during RL to provide directional hints for hard samples, guiding policy optimization and effectively mitigating the reward sparsity problem. Experiments on the Qwen2.5Omni-7B model demonstrate that our method not only achieves higher accuracy in fine-grained sentiment regression tasks but also generates high-quality structured reasoning chains. Crucially, it exhibits superior generalization capability in cross-domain evaluations. This enhances model interpretability while validating the positive contribution of explicit reasoning steps to model robustness, offering a new paradigm for building trustworthy and efficient sentiment analysis systems.

cs.AI [Back]

[9] HippoCamp: Benchmarking Contextual Agents on Personal Computers

Zhe Yang, Shulin Tian, Kairui Hu, Shuai Liu, Hoang-Nhat Nguyen, Yichi Zhang, Zujin Guo, Mengying Yu, Zinan Zhang, Jingkang Yang, Chen Change Loy, Ziwei Liu

🧩 TL;DR

本文提出了HippoCamp基准测试,旨在评估智能体在多模态文件管理中的能力,特别关注用户中心环境下的个性化建模和上下文感知推理,揭示了当前先进模型在现实场景中的显著性能差距。


📘 Detailed Summary

Motivation: 现有智能体基准测试主要关注通用环境下的网络交互、工具使用或软件自动化任务,缺乏对用户中心环境中个性化建模和大规模个人文件搜索的评估能力,无法充分测试智能体在现实个人文件系统中的上下文感知推理能力。

Method: HippoCamp构建了基于真实世界用户档案的设备级文件系统,涵盖42.4GB数据和超过2000个真实文件,创建了581个问答对来评估搜索、证据感知和多步推理能力,并提供了46.1K个密集标注的结构化轨迹用于逐步故障诊断。

Result: 实验评估了多种最先进的多模态大语言模型和智能体方法,结果显示即使最先进的商业模型在用户画像任务中仅达到48.3%的准确率,特别是在长时程检索和密集个人文件系统中的跨模态推理方面表现不佳,故障诊断表明多模态感知和证据接地是主要瓶颈。

Conclusion: HippoCamp揭示了当前智能体在现实用户中心环境中的关键局限性,为开发下一代个人AI助手提供了坚实基础,强调了多模态感知和证据接地能力的重要性,并指出了未来研究需要解决长时程检索和跨模态推理的挑战。


📄 Abstract

We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.