Table of Contents
cs.CV [Back]
[1] Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs
Huan Zheng, Yucheng Zhou, Tianyi Yan, Dubing Chen, Hongbo Lu, Wenlong Liao, Tao He, Pai Peng, Jianbing Shen
🧩 TL;DR
本文提出了一种临床认知对齐(CogAlign)框架,通过构建层次化临床认知数据集和监督微调赋予模型严谨的临床分析能力,并采用反事实驱动的强化学习策略消除视觉偏差,从而显著提升胃肠道内窥镜诊断的准确性。
📘 Detailed Summary
Motivation: 多模态大语言模型在医学图像分析中展现出巨大潜力,但在胃肠道内窥镜应用中面临两个关键限制:通用模型推理与标准化临床认知路径之间的错位,以及视觉特征与诊断结果之间缺乏因果关联。
Method: 该方法包含两个核心组件:首先通过构建层次化临床认知数据集并进行监督微调,将专家从解剖定位、形态评估到微血管分析的层次化诊断逻辑内化到模型中;其次提出反事实驱动的强化学习策略,通过病灶掩蔽生成反事实正常样本,并利用临床认知中心奖励进行优化,约束模型严格基于因果病灶特征进行诊断。
Result: 大量实验表明,该方法在多个基准测试中实现了最先进的性能,显著提升了复杂临床场景下的诊断准确性,所有源代码和数据集将公开提供。
Conclusion: 该研究通过临床认知对齐和因果校正策略,有效解决了多模态大语言模型在医学图像分析中的关键限制,为胃肠道内窥镜诊断提供了更可靠、可解释的解决方案,并展示了反事实学习在消除视觉偏差方面的理论价值。
📄 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable potential in medical image analysis. However, their application in gastrointestinal endoscopy is currently hindered by two critical limitations: the misalignment between general model reasoning and standardized clinical cognitive pathways, and the lack of causal association between visual features and diagnostic outcomes. In this paper, we propose a novel Clinical-Cognitive-Aligned (CogAlign) framework to address these challenges. First, we endow the model with rigorous clinical analytical capabilities by constructing the hierarchical clinical cognition dataset and employing Supervised Fine-Tuning (SFT). Unlike conventional approaches, this strategy internalizes the hierarchical diagnostic logic of experts, ranging from anatomical localization and morphological evaluation to microvascular analysis, directly into the model. Second, to eliminate visual bias, we provide a theoretical analysis demonstrating that standard supervised tuning inevitably converges to spurious background correlations. Guided by this insight, we propose a counterfactual-driven reinforcement learning strategy to enforce causal rectification. By generating counterfactual normal samples via lesion masking and optimizing through clinical-cognition-centric rewards, we constrain the model to strictly ground its diagnosis in causal lesion features. Extensive experiments demonstrate that our approach achieves State-of-the-Art (SoTA) performance across multiple benchmarks, significantly enhancing diagnostic accuracy in complex clinical scenarios. All source code and datasets will be made publicly available.
[2] VGS-Decoding: Visual Grounding Score Guided Decoding for Hallucination Mitigation in Medical VLMs
Govinda Kolli, Adinath Madhavrao Dukre, Behzad Bozorgtabar, Dwarikanath Mahapatra, Imran Razzak
🧩 TL;DR
本文提出了一种无需训练的视觉基础分数引导解码方法(VGS-Decoding),通过比较原始图像与失真图像下的token概率分布差异来量化视觉依赖性,在解码过程中自适应地增强视觉基础token并抑制幻觉,有效缓解医学视觉语言模型的幻觉问题。
📘 Detailed Summary
Motivation: 医学视觉语言模型在临床应用中存在严重的幻觉问题,即模型基于语言先验而非视觉证据生成响应,这可能导致临床决策风险。现有方法通常需要额外训练或采用固定权重的对比策略,缺乏对每个token视觉依赖性的自适应控制。
Method: 本文提出视觉基础分数引导解码方法,核心是视觉基础分数,通过比较原始图像与失真图像条件下每个token的概率分布差异来量化其视觉依赖性。在解码过程中,该方法根据VGS对token概率进行重新加权,自适应地增强视觉基础token并抑制幻觉token,无需额外训练且仅引入约2倍推理开销。
Result: 在MIMIC-Diff-VQA和VQA-RAD数据集上,对LLaVA-Med、CheXagent和MedGemma等模型的实验表明,VGS-Decoding能带来一致的性能提升,最高达到+9.12%的整体增益和+8.98%的开放式召回率提升,同时仅引入约2倍推理开销且无需额外训练。
Conclusion: 该方法提供了一种无需训练、计算高效的幻觉缓解方案,通过每个token的自适应控制机制优于固定权重的对比方法,具有临床部署的实用性。研究揭示了幻觉token与视觉基础token在图像失真条件下的不同概率变化模式,为理解VLM幻觉机制提供了新视角。
📄 Abstract
Medical Vision-Language Models (VLMs) often hallucinate by generating responses based on language priors rather than visual evidence, posing risks in clinical applications. We propose Visual Grounding Score Guided Decoding (VGS-Decoding), a training-free method to mitigate hallucinations during inference. Our key insight is that hallucinated tokens maintain or increase their probability when visual information is degraded, while visually grounded tokens decrease in probability. We introduce the Visual Grounding Score (VGS), which measures each token's visual dependency by comparing distributions from original and distorted images. During decoding, we reweight probabilities by amplifying visually grounded tokens while suppressing hallucinations. Unlike fixed-weight contrastive methods, VGS-Decoding provides per-token adaptive control. Experiments on MIMIC-Diff-VQA and VQA-RAD across LLaVA-Med, CheXagent, and MedGemma demonstrate consistent improvements, with up to +9.12% overall gain and $+8.98\%$ in open-ended recall, while introducing only $2\times$ inference overhead and no additional training, making it practical for clinical deployment. Upon acceptance, code will be released publicly to facilitate reproducibility.
[3] Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models
Enguang Wang, Qiang Wang, Yuanchen Wu, Ke Yan, Xinbin Yuan, Shouhong Ding, Xialei Liu, Ming-Ming Cheng
🧩 TL;DR
本文揭示了多模态大语言模型在语言驱动训练过程中出现的视觉表征退化现象,并提出预测正则化方法以保持内部视觉特征的质量,从而提升视觉语言任务的性能。
📘 Detailed Summary
Motivation: 多模态大语言模型虽然在视觉语言任务上表现出色,但其语言驱动训练对内部视觉基础能力的影响尚不明确。本研究旨在诊断并解决MLLMs中存在的视觉表征退化问题,即模型在中间层的视觉表征相比初始视觉特征在全局功能和补丁结构上均出现退化,这源于单一文本生成目标驱动的视觉牺牲。
Method: 研究首先对MLLMs进行详细的诊断分析,揭示视觉表征退化的普遍现象。为解决此问题,提出预测正则化方法,强制退化的中间特征预测初始视觉特征,从而保持MLLM内部表征的固有视觉属性。该方法旨在平衡跨模态推理能力和核心视觉能力。
Result: 大量实验证实,缓解视觉表征退化能有效提升视觉语言性能。研究结果表明,通过预测正则化方法保持内部视觉表征的质量,可以显著改善MLLMs在视觉语言任务上的表现,验证了维护稳健内部视觉表征对全面多模态理解的重要性。
Conclusion: 研究强调了稳健的MLLM需要同时具备强大的跨模态推理能力和核心视觉能力。视觉表征退化现象揭示了单一文本生成目标对视觉保真度的负面影响,而预测正则化方法为解决这一问题提供了有效途径,为构建更全面的多模态理解系统提供了重要见解。
📄 Abstract
While Multimodal Large Language Models (MLLMs) excel at vision-language tasks, the cost of their language-driven training on internal visual foundational competence remains unclear. In this paper, we conduct a detailed diagnostic analysis to unveil a pervasive issue: visual representation degradation in MLLMs. Specifically, we find that compared to the initial visual features, the visual representation in the middle layers of LLM exhibits both a degradation in global function and patch structure. We attribute this phenomenon to a visual sacrifice driven by the singular text-generation objective, where the model compromises its visual fidelity to optimize for answer generation. We argue that a robust MLLM requires both strong cross-modal reasoning and core visual competence, and propose Predictive Regularization (PRe) to force degraded intermediate features to predict initial visual features, thereby maintaining the inherent visual attributes of the MLLM's internal representations. Extensive experiments confirm that mitigating this visual degradation effectively boosts vision-language performance, underscoring the critical importance of fostering robust internal visual representations within MLLMs for comprehensive multimodal understanding.
[4] Consistent but Dangerous: Per-Sample Safety Classification Reveals False Reliability in Medical Vision-Language Models
Binesh Sadanandan, Vahid Behzadan
🧩 TL;DR
该研究揭示了医疗视觉语言模型中一致性评估的根本缺陷,提出了一种联合评估一致性和图像依赖性的四象限安全分类法,并发现在LoRA微调后模型会产生大量具有高准确率但不可靠的'危险'样本。
📘 Detailed Summary
Motivation: 当前医疗视觉语言模型部署中,语义等价提示下的一致性被广泛用作可靠性的代理指标,但该研究指出这一代理存在根本缺陷:模型可以通过依赖文本模式而非输入图像来实现完美一致性,从而掩盖了真正的可靠性问题。
Method: 研究提出了一种四象限逐样本安全分类法,联合评估一致性(在改写提示下的稳定预测)和图像依赖性(移除图像时预测的变化),将样本分类为理想型、脆弱型、危险型和最差型,并在两个胸部X光数据集上评估了五种医疗VLM配置。
Result: 实验发现LoRA微调显著降低了翻转率,但将大多数样本转移到危险象限:LLaVA-Rad Base在PadChest数据集上达到1.5%的翻转率,而98.5%的样本属于危险型,这些样本表现出高达99.6%的准确率和低熵值,无法通过标准置信度筛选检测。
Conclusion: 研究揭示了翻转率与危险样本比例之间存在负相关关系,建议部署评估必须将一致性检查与文本基线配对,通过额外的前向传递暴露虚假可靠性陷阱,这对医疗AI系统的安全部署具有重要实践意义。
📄 Abstract
Consistency under paraphrase, the property that semantically equivalent prompts yield identical predictions, is increasingly used as a proxy for reliability when deploying medical vision-language models (VLMs). We show this proxy is fundamentally flawed: a model can achieve perfect consistency by relying on text patterns rather than the input image. We introduce a four-quadrant per-sample safety taxonomy that jointly evaluates consistency (stable predictions across paraphrased prompts) and image reliance (predictions that change when the image is removed). Samples are classified as Ideal (consistent and image-reliant), Fragile (inconsistent but image-reliant), Dangerous (consistent but not image-reliant), or Worst (inconsistent and not image-reliant). Evaluating five medical VLM configurations across two chest X-ray datasets (MIMIC-CXR, PadChest), we find that LoRA fine-tuning dramatically reduces flip rates but shifts a majority of samples into the Dangerous quadrant: LLaVA-Rad Base achieves a 1.5% flip rate on PadChest while 98.5% of its samples are Dangerous. Critically, Dangerous samples exhibit high accuracy (up to 99.6%) and low entropy, making them invisible to standard confidence-based screening. We observe a negative correlation between flip rate and Dangerous fraction (r = -0.89, n=10) and recommend that deployment evaluations always pair consistency checks with a text-only baseline: a single additional forward pass that exposes the false reliability trap.
[5] CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models
Nan Zhou, Huiqun Wang, Yaoyan Zheng, Di Huang
🧩 TL;DR
本文提出了上下文感知视觉微调(CoVFT)框架,通过将多模态上下文显式整合到视觉适应中,解决了多模态大语言模型中视觉编码器微调不稳定的问题。该方法在12个多模态基准测试中实现了最先进的性能,并显著提升了模型效率。
📘 Detailed Summary
Motivation: 多模态大语言模型在跨模态感知和推理方面取得了显著进展,但关于视觉编码器应该微调还是冻结的基本问题仍未解决。现有视觉微调方法在不同多模态任务中表现不一致,且无法稳定超越冻结基线,这种不稳定性源于视觉偏好冲突,即视觉编码器的上下文无关特性在不同多模态上下文中引发参数更新的分歧。
Method: 本文提出了上下文感知视觉微调(CoVFT)框架,通过整合上下文向量提取(CVE)和上下文混合专家(CoMoE)模块,将多模态上下文显式纳入视觉适应过程。该框架能够分解冲突的优化信号,实现稳定且上下文敏感的视觉参数更新,从而解决视觉偏好冲突问题。
Result: 在12个多模态基准测试上的广泛实验表明,CoVFT实现了最先进的性能并表现出优越的稳定性。值得注意的是,使用CoVFT微调的7B参数MLLM超越了其13B参数对应模型的平均性能,揭示了MLLM中视觉编码器优化的巨大未开发潜力。
Conclusion: 该研究揭示了多模态大语言模型中视觉编码器微调的关键挑战,并提出了有效的解决方案。CoVFT框架不仅提升了模型性能,还显著提高了训练稳定性,为未来多模态模型的视觉组件优化提供了新的方向,表明通过适当的上下文整合可以充分释放视觉编码器的优化潜力。
📄 Abstract
Multimodal large language models (MLLMs) achieve remarkable progress in cross-modal perception and reasoning, yet a fundamental question remains unresolved: should the vision encoder be fine-tuned or frozen? Despite the success of models such as LLaVA and Qwen-VL, inconsistent design choices and heterogeneous training setups hinder a unified understanding of visual fine-tuning (VFT) in MLLMs. Through a configuration-aligned benchmark, we find that existing VFT methods fail to consistently outperform the frozen baseline across multimodal tasks. Our analysis suggests that this instability arises from visual preference conflicts, where the context-agnostic nature of vision encoders induces divergent parameter updates under diverse multimodal context. To address this issue, we propose the Context-aware Visual Fine-tuning (CoVFT) framework, which explicitly incorporates multimodal context into visual adaptation. By integrating a Context Vector Extraction (CVE) and a Contextual Mixture-of-Experts (CoMoE) module, CoVFT decomposes conflicting optimization signals and enables stable, context-sensitive visual updates. Extensive experiments on 12 multimodal benchmarks demonstrate that CoVFT achieves state-of-the-art performance with superior stability. Notably, fine-tuning a 7B MLLM with CoVFT surpasses the average performance of its 13B counterpart, revealing substantial untapped potential in visual encoder optimization within MLLMs.
[6] Rethinking Token Reduction for Large Vision-Language Models
Yi Wang, Haofei Zhang, Qihan Huang, Anda Cao, Gongfan Fang, Wei Wang, Xuan Jin, Jie Song, Mingli Song, Xinchao Wang
🧩 TL;DR
本文提出MetaCompress,一种基于学习的提示无关视觉令牌压缩方法,用于解决多轮视觉问答中现有令牌减少策略的局限性,实现更优的效率-精度权衡。
📘 Detailed Summary
Motivation: 大型视觉语言模型在推理时因视觉令牌过多导致计算成本高昂,现有令牌减少方法主要针对单轮视觉问答,而更实用的多轮视觉问答场景面临额外挑战:后续问题未知且可能涉及任意图像区域,现有基于启发式度量的提示相关或提示无关方法均表现不佳。
Method: MetaCompress将令牌减少形式化为可学习的压缩映射,统一了剪枝和合并等现有格式为单一学习目标,并引入数据高效的训练范式,能够在有限计算成本下学习最优压缩映射,是一种基于学习的提示无关方法。
Result: 在多轮视觉问答基准测试和多种LVLM架构上的广泛实验表明,MetaCompress实现了优越的效率-精度权衡,同时在对话轮次间保持强大的泛化能力,显著优于现有启发式方法。
Conclusion: 该研究证明了基于学习的压缩映射在多轮视觉问答中的有效性,为视觉令牌减少提供了统一框架,克服了启发式设计的局限性,为实际多轮视觉对话系统的高效部署提供了可行方案。
📄 Abstract
Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the more practical multi-turn VQA (MT-VQA) scenario largely unexplored. MT-VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbitrary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt-dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt-agnostic ones, which, though technically applicable to multi-turn settings, rely on heuristic reduction metrics such as attention scores, leading to suboptimal performance. In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. We begin by formulating token reduction as a learnable compression mapping, unifying existing formats such as pruning and merging into a single learning objective. Upon this formulation, we introduce a data-efficient training paradigm capable of learning optimal compression mappings with limited computational costs. Extensive experiments on MT-VQA benchmarks and across multiple LVLM architectures demonstrate that MetaCompress achieves superior efficiency-accuracy trade-offs while maintaining strong generalization across dialogue turns. Our code is available at https://github.com/MArSha1147/MetaCompress.
[7] CVT-Bench: Counterfactual Viewpoint Transformations Reveal Unstable Spatial Representations in Multimodal LLMs
Shanmukha Vellamcheti, Uday Kiran Kothapalli, Disharee Bhowmick, Sathyanarayanan N. Aakur
🧩 TL;DR
本文引入了一个诊断性基准来评估多模态大语言模型在反事实视角变化下的空间关系一致性,发现尽管单视角空间推理表现良好,但现有模型在视角变换下存在系统性退化,而增加表征结构可提升稳定性。
📘 Detailed Summary
Motivation: 多模态大语言模型在单视角空间推理任务上表现优异,但其在反事实视角变化下是否保持稳定的空间状态表征尚不明确。本研究旨在评估模型在假设性相机轨道变换下的关系一致性,而无需重新渲染图像,以揭示单视角空间准确性是否高估了诱导空间表征的鲁棒性。
Method: 研究引入了一个受控诊断基准,包含100个合成场景和6,000个关系查询,评估模型在假设性相机轨道变换下的表现。该方法测量视角一致性、360度循环一致性和序列变换下的关系稳定性,并比较了多种输入表征形式,包括视觉输入、文本边界框和结构化场景图。
Result: 尽管单视角准确率高,但最先进的多模态大语言模型在反事实视角变化下表现出系统性退化,频繁违反循环一致性,且关系稳定性迅速衰减。实验表明,增加表征结构(如结构化场景图)能显著改善稳定性,而视觉输入和文本边界框的表征形式稳定性较差。
Conclusion: 研究结果表明,单视角空间准确性高估了诱导空间表征的鲁棒性,而表征结构在反事实空间推理中起着关键作用。这为未来开发更稳健的空间推理模型提供了重要见解,强调了结构化表征在应对视角变化时的重要性。
📄 Abstract
Multimodal large language models (MLLMs) achieve strong performance on single-view spatial reasoning tasks, yet it remains unclear whether they maintain stable spatial state representations under counterfactual viewpoint changes. We introduce a controlled diagnostic benchmark that evaluates relational consistency under hypothetical camera orbit transformations without re-rendering images. Across 100 synthetic scenes and 6,000 relational queries, we measure viewpoint consistency, 360° cycle agreement, and relational stability over sequential transformations. Despite high single-view accuracy, state-of-the-art MLLMs exhibit systematic degradation under counterfactual viewpoint changes, with frequent violations of cycle consistency and rapid decay in relational stability. We further evaluate multiple input representations, visual input, textual bounding boxes, and structured scene graphs, and show that increasing representational structure improves stability. Our results suggest that single-view spatial accuracy overestimates the robustness of induced spatial representations and that representation structure plays a critical role in counterfactual spatial reasoning.
[8] Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement
Junrong Guo, Shancheng Fang, Yadong Qu, Hongtao Xie
🧩 TL;DR
本文提出视觉反馈布局模型(VFLM),这是一个通过视觉反馈进行迭代优化的自改进框架,解决了现有代码式布局生成方法忽视渲染视觉结果导致可读性和美观性难以保证的问题。
📘 Detailed Summary
Motivation: 现有基于代码的多模态大语言模型布局生成方法通常遵循代码生成范式,这些方法生成的代码通过图形引擎渲染为最终图像,但模型对渲染后的视觉结果缺乏感知,导致难以保证生成布局的可读性和美观性。
Method: VFLM采用基于强化学习的自改进框架,通过视觉反馈进行迭代优化,实现了自适应反思生成能力。该方法利用包含OCR准确率的视觉基础奖励模型,仅对最终生成结果进行奖励,有效激发模型的迭代和反思生成能力。
Result: 在多个基准测试上的实验表明,VFLM在性能上持续优于先进的多模态大语言模型、现有布局模型以及纯代码基线方法,验证了视觉反馈对于设计导向多模态大语言模型的关键作用。
Conclusion: 该研究确立了视觉反馈在布局生成中的关键作用,为设计导向的多模态大语言模型提供了新的优化方向,通过迭代反思机制实现了布局质量的显著提升。
📄 Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have enabled automated generation of structured layouts from natural language descriptions. Existing methods typically follow a code-only paradigm that generates code to represent layouts, which are then rendered by graphic engines to produce final images. However, they are blind to the rendered visual outcome, making it difficult to guarantee readability and aesthetics. In this paper, we identify visual feedback as a critical factor in layout generation and propose Visual Feedback Layout Model (VFLM), a self-improving framework that leverages visual feedback iterative refinement. VFLM is capable of performing adaptive reflective generation, which leverages visual information to reflect on previous issues and iteratively generates outputs until satisfactory quality is achieved. It is achieved through reinforcement learning with a visually grounded reward model that incorporates OCR accuracy. By rewarding only the final generated outcome, we can effectively stimulate the model's iterative and reflective generative capabilities. Experiments across multiple benchmarks show that VFLM consistently outperforms advanced MLLMs, existing layout models, and code-only baselines, establishing visual feedback as critical for design-oriented MLLMs. Our code and data are available at https://github.com/FolSpark/VFLM.
[9] Learning Trajectory-Aware Multimodal Large Language Models for Video Reasoning Segmentation
Jingnan Luo, Mingqi Gao, Jun Liu, Bin-Bin Gao, Feng Zheng
🧩 TL;DR
本文提出TrajSeg框架,通过双向文本-轨迹对齐增强多模态大语言模型对视频动态的轨迹感知能力,在视频推理分割任务中实现了统一的端到端可训练架构,并在多个基准测试中取得了最先进的性能。
📘 Detailed Summary
Motivation: 现有的视频推理分割方法依赖于单向隐式的文本-轨迹对齐,在面对严重视频动态变化时难以有效感知对象轨迹,这限制了多模态大语言模型在视频理解任务中的性能表现。
Method: TrajSeg框架引入了双向文本-轨迹对齐机制,包括基于文本的轨迹定位和基于轨迹的文本描述两种指令模式,通过帧级内容集成模块将轨迹级令牌适配到帧特定信息,并采用统一的掩码解码器实现所有帧的分割处理,形成简化的端到端可训练架构。
Result: 在多个参考和推理视频分割数据集上的广泛实验表明,TrajSeg在所有指标上均超越了现有的视频推理分割方法,验证了双向对齐机制在增强轨迹感知和提升分割性能方面的有效性。
Conclusion: 该研究证明了双向文本-轨迹对齐能够显著增强多模态大语言模型对视频动态的感知能力,统一的端到端架构简化了训练流程并提升了模型性能,为视频理解任务提供了新的有效解决方案。
📄 Abstract
The prosperity of Multimodal Large Language Models (MLLMs) has stimulated the demand for video reasoning segmentation, which aims to segment video objects based on human instructions. Previous studies rely on unidirectional and implicit text-trajectory alignment, which struggles with trajectory perception when faced with severe video dynamics. In this work, we propose TrajSeg, a simple and unified framework built upon MLLMs. Concretely, we introduce bidirectional text-trajectory alignment, where MLLMs accept grounding-intended (text-to-trajectory) and captioning-intended (trajectory-to-text) instructions. This way, MLLMs can benefit from enhanced correspondence and better perceive object trajectories in videos. The mask generation from trajectories is achieved via a frame-level content integration (FCI) module and a unified mask decoder. The former adapts the MLLM-parsed trajectory-level token to frame-specific information. The latter unifies segmentation for all frames into a single structure, enabling the proposed framework to be simplified and end-to-end trainable. Extensive experiments on referring and reasoning video segmentation datasets demonstrate the effectiveness of TrajSeg, which outperforms all video reasoning segmentation methods on all metrics. The code will be publicly available at https://github.com/haodi19/TrajSeg.
[10] VIGIL: Part-Grounded Structured Reasoning for Generalizable Deepfake Detection
Xinghan Li, Junhao Xu, Jingjing Chen
🧩 TL;DR
本文提出了VIGIL,一种基于部件中心的结构化取证框架,通过先规划后检查的流程实现可解释的深度伪造检测。该方法采用阶段门控注入机制和渐进式三阶段训练范式,在OmniFake基准测试中显著优于现有方法。
📘 Detailed Summary
Motivation: 当前基于多模态大语言模型的深度伪造检测方法将证据生成和篡改定位统一在一个步骤中,导致忠实观察与幻觉解释之间的界限模糊,产生不可靠的结论。本研究旨在解决这一推理过程的结构性问题,提高检测的可解释性和可靠性。
Method: VIGIL框架采用部件中心的结构化取证方法,包含先规划后检查的流程:模型首先基于全局视觉线索规划需要检查的面部部件,然后使用独立获取的取证证据检查每个部件。阶段门控注入机制确保部件级取证证据仅在检查阶段注入,避免外部信号对部件选择的偏见。此外,提出了渐进式三阶段训练范式,其中强化学习阶段采用部件感知奖励来强制解剖有效性和证据-结论一致性。
Result: 在构建的OmniFake分层五级基准测试中,VIGIL在仅使用三个基础生成器训练的情况下,在所有泛化级别上均一致优于专家检测器和并发的MLLM方法。跨数据集评估进一步验证了该框架的鲁棒性和泛化能力,证明了其在从基础生成器到社交媒体真实数据的渐进测试中的卓越性能。
Conclusion: VIGIL通过结构化取证框架将证据生成与篡改定位分离,显著提高了深度伪造检测的可解释性和可靠性。该研究强调了在MLLM中采用专家启发的结构化推理流程的重要性,为未来可解释性人工智能系统设计提供了重要见解,特别是在需要可靠取证分析的领域。
📄 Abstract
Multimodal large language models (MLLMs) offer a promising path toward interpretable deepfake detection by generating textual explanations. However, the reasoning process of current MLLM-based methods combines evidence generation and manipulation localization into a unified step. This combination blurs the boundary between faithful observations and hallucinated explanations, leading to unreliable conclusions. Building on this, we present VIGIL, a part-centric structured forensic framework inspired by expert forensic practice through a plan-then-examine pipeline: the model first plans which facial parts warrant inspection based on global visual cues, then examines each part with independently sourced forensic evidence. A stage-gated injection mechanism delivers part-level forensic evidence only during examination, ensuring that part selection remains driven by the model's own perception rather than biased by external signals. We further propose a progressive three-stage training paradigm whose reinforcement learning stage employs part-aware rewards to enforce anatomical validity and evidence--conclusion coherence. To enable rigorous generalizability evaluation, we construct OmniFake, a hierarchical 5-Level benchmark where the model, trained on only three foundational generators, is progressively tested up to in-the-wild social-media data. Extensive experiments on OmniFake and cross-dataset evaluations demonstrate that VIGIL consistently outperforms both expert detectors and concurrent MLLM-based methods across all generalizability levels.
[11] Getting to the Point: Why Pointing Improves LVLMs
Simone Alghisi, Massimo Rizzoli, Seyed Mahed Mousavi, Giuseppe Riccardi
🧩 TL;DR
本研究探讨了指向机制在大视觉语言模型中的认知作用,通过零样本计数任务验证了指向-计数方法相比直接计数具有更好的分布外泛化能力,揭示了坐标预测通过提供空间信息提升模型性能的机制。
📘 Detailed Summary
Motivation: 尽管指向机制已被证明能提高大视觉语言模型的准确性,但其支持增益的具体机制及其在认知任务中的相关性尚不明确。此外,中间点预测的可靠性研究不足,限制了其作为视觉解释的实用性。本研究旨在通过零样本视觉场景计数这一认知任务,深入探究指向机制的作用及其可靠性。
Method: 研究采用两种微调方法对比分析:直接计数方法仅预测物体总数;指向-计数方法则先预测目标物体的坐标,再基于这些坐标生成计数结果。研究使用最先进的大视觉语言模型进行实验,通过零样本计数任务评估两种方法的性能差异和泛化能力。
Result: 实验结果表明,指向-计数方法在分布外泛化方面表现更优,表明坐标帮助模型学习技能而非在狭窄任务上过拟合。预测点的准确率超过89%(通过F1分数衡量),但性能在不同图像区域存在差异,揭示了空间偏差的存在。机制分析显示计数性能的提升源于坐标编码的空间信息。
Conclusion: 指向机制通过提供明确的空间信息增强了大视觉语言模型的认知能力,特别是在分布外泛化方面。虽然预测点具有较高的准确性,但空间偏差的存在表明需要进一步优化。这些发现为理解指向机制在视觉推理中的作用提供了实证基础,并支持其在可解释AI中的应用。
📄 Abstract
Pointing increases the accuracy and explainability of Large Vision-Language Models (LVLMs) by modeling grounding and reasoning as explicit sequential steps. The model grounds the objects mentioned in the natural-language query by predicting their coordinates, and then generates an answer conditioned on these points. While pointing has been shown to increase LVLMs' accuracy, it is unclear which mechanism supports these gains and its relevance in cognitive tasks. In addition, the reliability of the intermediate points remains understudied, limiting their use as visual explanations. In this work, we study the role of pointing in a cognitive task: zero-shot counting from a visual scene. We fine-tune state-of-the-art LVLMs following two approaches: Direct Counting, where models only predict the total number of objects, and Point-then-Count, where LVLMs generate the target objects' coordinates followed by their count. The results show that Point-then-Count achieves higher out-of-distribution generalization, suggesting that coordinates help LVLMs learn skills rather than overfitting on narrow tasks. Although predicted points are accurately grounded in the image in over 89\% of cases (as measured by F1), performance varies across image regions, revealing spatial biases. Finally, mechanistic analyses show that gains in counting arise from the spatial information encoded in the coordinates.
[12] Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection
Youbin Kim, Jinho Park, Hogun Park, Eunbyung Park
🧩 TL;DR
本文提出Group3D,一种多视角开放词汇3D检测框架,通过将语义约束直接集成到实例构建过程中,解决了现有方法因几何一致性主导而导致的过合并或碎片化问题,实现了语义与几何一致性的协同优化。
📘 Detailed Summary
Motivation: 在多视角RGB设置中,现有开放词汇3D检测方法通常将几何实例构建与语义标注解耦,导致实例构建仅受几何一致性约束而缺乏语义约束。当几何证据不完整或视角依赖时,这种纯几何合并会导致不可逆的关联错误,包括不同对象的过合并或单个实例的碎片化。
Method: Group3D框架通过多模态大语言模型(MLLM)维护场景自适应词汇表,并将其组织为语义兼容组,编码合理的跨视角类别等价关系。这些组作为合并时的约束条件:3D片段仅在同时满足语义兼容性和几何一致性的情况下才进行关联,实现了语义门控的合并机制,支持已知姿态和无姿态两种设置,仅依赖RGB观测。
Result: 在ScanNet和ARKitScenes数据集上的实验表明,Group3D在多视角开放词汇3D检测中实现了最先进的性能,同时在零样本场景下表现出强大的泛化能力,有效缓解了几何驱动的过合并问题并吸收了多视角类别变异性。
Conclusion: 该研究证明了将语义约束直接集成到实例构建过程中的重要性,通过语义兼容组实现几何与语义的协同优化,为开放词汇3D检测提供了更稳健的框架,同时展示了多模态大语言模型在场景理解中的有效应用。
📄 Abstract
Open-vocabulary 3D object detection aims to localize and recognize objects beyond a fixed training taxonomy. In multi-view RGB settings, recent approaches often decouple geometry-based instance construction from semantic labeling, generating class-agnostic fragments and assigning open-vocabulary categories post hoc. While flexible, such decoupling leaves instance construction governed primarily by geometric consistency, without semantic constraints during merging. When geometric evidence is view-dependent and incomplete, this geometry-only merging can lead to irreversible association errors, including over-merging of distinct objects or fragmentation of a single instance. We propose Group3D, a multi-view open-vocabulary 3D detection framework that integrates semantic constraints directly into the instance construction process. Group3D maintains a scene-adaptive vocabulary derived from a multimodal large language model (MLLM) and organizes it into semantic compatibility groups that encode plausible cross-view category equivalence. These groups act as merge-time constraints: 3D fragments are associated only when they satisfy both semantic compatibility and geometric consistency. This semantically gated merging mitigates geometry-driven over-merging while absorbing multi-view category variability. Group3D supports both pose-known and pose-free settings, relying only on RGB observations. Experiments on ScanNet and ARKitScenes demonstrate that Group3D achieves state-of-the-art performance in multi-view open-vocabulary 3D detection, while exhibiting strong generalization in zero-shot scenarios. The project page is available at https://ubin108.github.io/Group3D/.
[13] ACPO: Counteracting Likelihood Displacement in Vision-Language Alignment with Asymmetric Constraints
Kaili Huang, Hongming Zhang, Rui Shen, Linjun Dai, Jiahao Wang, Hanming Deng, Lewei Lu
🧩 TL;DR
本文提出了非对称约束偏好优化(ACPO),一种解决直接偏好优化中似然位移问题的模态无关对齐机制。ACPO通过动态、目标导向的缩放来不对称地抑制被拒绝奖励项的梯度流,从而在大型视觉语言模型中有效缓解视觉锚点崩溃和幻觉问题。
📘 Detailed Summary
Motivation: 直接偏好优化在大型视觉语言模型对齐中存在似然位移问题,即选择和拒绝响应的概率同时崩溃。这种优化缺陷在多模态环境中尤为有害,导致视觉锚点崩溃——模型放弃视觉证据而依赖强语言先验,从而引发显著幻觉。
Method: 本文提出非对称约束偏好优化,这是一种模态无关的对齐机制,应用动态、目标导向的缩放到偏好优化中。ACPO推导出复杂度感知的缩放系数,专门应用于被拒绝奖励项,不对称地抑制被拒绝项的梯度流,同时保持选择分布作为梯度稳定的参考。打破这种梯度对称性对于多模态任务至关重要,因为它缓解了语言先验对视觉标记的抑制。
Result: 在InternVL模型上的实验表明,ACPO有效逆转了标准DPO的选择奖励退化。通过阻止视觉锚点崩溃,ACPO在幻觉基准测试(HallusionBench、MM-IFEval)和通用排行榜(MMBench、MMStar、OCRBenchV2)上普遍优于基线方法,同时推动通用能力的同步改进。
Conclusion: ACPO作为一种通用目标函数,通过不对称梯度约束解决了DPO中的似然位移问题,特别适用于多模态对齐任务。该方法不仅改善了幻觉问题,还保持了模型的通用能力,为视觉语言模型的对齐提供了更稳健的优化框架。
📄 Abstract
While Direct Preference Optimization (DPO) has become the de facto approach for aligning Large Vision-Language Models (LVLMs), it suffers from Likelihood Displacement, where the probability of both chosen and rejected responses collapses. This optimization flaw is especially detrimental in multimodal settings: the erosion of chosen likelihoods -- a failure we term Visual Anchor Collapse -- causes models to abandon visual evidence for strong language priors, precipitating significant hallucinations. To address this, we propose Asymmetric Constrained Preference Optimization (ACPO), a modality-agnostic alignment mechanism that applies dynamic, target-oriented scaling to preference optimization. ACPO derives a complexity-aware scaling coefficient applied exclusively to the rejected reward, asymmetrically suppressing the gradient flow on the rejected term while preserving the chosen distribution as a gradient-stable reference. While fundamentally a general-purpose objective, breaking this gradient symmetry is crucial for multimodal tasks, as it mitigates the suppression of visual tokens by language priors. Experiments on InternVL models demonstrate that ACPO effectively reverses the chosen-reward degradation of standard DPO. By halting Visual Anchor Collapse, ACPO generally outperforms baselines on hallucination benchmarks (HallusionBench, MM-IFEval) and general leaderboards (MMBench, MMStar, OCRBenchV2) while driving concurrent improvements in general capabilities.
[14] VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding
Ruoliu Yang, Chu Wu, Caifeng Shan, Ran He, Chaoyou Fu
🧩 TL;DR
本文提出了VideoDetective框架,通过整合查询-片段相关性和片段间亲和性,在视觉-时间亲和图结构上进行假设-验证-精炼循环,实现了长视频问答中关键片段的稀疏定位,显著提升了多模态大语言模型的长视频理解能力。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在长视频理解中面临上下文窗口有限的挑战,需要识别稀疏的查询相关视频片段,但现有方法主要基于查询进行定位,忽略了视频的内在结构和不同片段间的相关性差异,导致定位效果受限。
Method: VideoDetective框架将视频划分为多个片段,基于视觉相似性和时间邻近性构建视觉-时间亲和图,通过假设-验证-精炼循环估计观察片段与查询的相关性分数,并将这些分数传播到未观察片段,生成全局相关性分布,从而指导关键片段的稀疏定位。
Result: 实验表明该方法在主流多模态大语言模型上均取得显著提升,在代表性基准测试中表现优异,特别是在VideoMME-long基准上实现了高达7.5%的准确率提升,验证了框架的有效性和泛化能力。
Conclusion: 该研究证明了整合查询-片段相关性和片段间亲和性对于长视频问答的重要性,提出的假设-验证-精炼循环机制能够有效传播相关性信息,为多模态大语言模型的长视频理解提供了新的解决方案,具有重要的实际应用价值。
📄 Abstract
Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at https://videodetective.github.io/
cs.CL [Back]
[15] Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs
Hengwei Ye, Yuanting Guan, Yuxuan Ge, Tianying Zhu, Zhenhan Guan, Yijia Zhong, Yijing Zhang, Han Zhang, Yingna Wu, Zheng Tian
🧩 TL;DR
本文提出了KidGym,一个基于2D网格的综合性基准测试,用于评估多模态大语言模型在五个核心认知能力方面的表现,该基准受韦氏儿童智力量表启发,旨在更准确地衡量MLLMs的适应性和发展潜力。
📘 Detailed Summary
Motivation: 当前缺乏能够全面评估多模态大语言模型核心认知能力的系统性基准测试,现有评估方法未能充分反映MLLMs在类似人类认知发展方面的适应性和潜力,特别是缺乏能够分解为可解释、可测试能力的结构化评估框架。
Method: 研究受韦氏儿童智力量表启发,开发了KidGym这一2D网格基准测试,包含12个独特任务,专门设计用于评估执行、感知推理、学习、记忆和规划五个核心能力,基准采用随机生成布局的多样化场景和对象,支持用户自定义和扩展以适应不同难度级别。
Result: 通过对最先进MLLMs的评估,KidGym揭示了模型在多个认知能力方面的显著局限性,基准测试显示当前模型在适应性和发展潜力方面存在不足,同时验证了该基准在准确性和鲁棒性评估方面的有效性。
Conclusion: KidGym为MLLMs的认知能力评估提供了系统化框架,揭示了当前模型的局限性并为未来发展指明了方向,该基准的可扩展性和可定制性使其能够适应快速发展的MLLM研究社区,促进更全面、更准确的模型评估。
📄 Abstract
Multimodal Large Language Models (MLLMs) combine the linguistic strengths of LLMs with the ability to process multimodal data, enbaling them to address a broader range of visual tasks. Because MLLMs aim at more general, human-like competence than language-only models, we take inspiration from the Wechsler Intelligence Scales - an established battery for evaluating children by decomposing intelligence into interpretable, testable abilities. We introduce KidGym, a comprehensive 2D grid-based benchmark for assessing five essential capabilities of MLLMs: Execution, Perception Reasoning, Learning, Memory and Planning. The benchmark comprises 12 unique tasks, each targeting at least one core capability, specifically designed to guage MLLMs' adaptability and developmental potential, mirroring the stages of children's cognitive growth. Additionally, our tasks encompass diverse scenarios and objects with randomly generated layouts, ensuring a more accurate and robust evluation of MLLM capabilities. KidGym is designed to be fully user-customizable and extensible, allowing researchers to create new evaluation scenarios and adjust difficuly levels to accommodate the rapidly growing MLLM community. Through the evaluation of state-of-the-art MLLMs using KidGym, we identified significant insights into model capabilities and revealed several limitations of current models. We release our benchmark at: https://kidgym.github.io/KidGym-Website/.
cs.AI [Back]
[16] RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models
Dongyoung Kim, Sumin Park, Woomin Song, Seungku Kim, Taeyoung Kim, Huiwon Jang, Jinwoo Shin, Jaehyung Kim, Younggyo Seo
🧩 TL;DR
本文提出RoboAlign框架,通过强化学习对齐多模态大语言模型(MLLM)的推理过程,显著提升视觉-语言-动作模型(VLA)在机器人任务中的性能表现。
📘 Detailed Summary
Motivation: 现有方法通过视觉问答监督增强MLLM的具身推理能力,但常导致VLA性能不稳定,仅产生边际甚至负向增益,需要更系统化的训练框架来可靠提升VLA性能。
Method: RoboAlign框架采用零样本自然语言推理采样动作标记,并通过强化学习精炼推理过程以提高动作准确性,在MLLM骨干上添加基于扩散的动作头构建VLA,使用不到1%的数据进行RL对齐。
Result: 在主要机器人基准测试中,RoboAlign在LIBERO、CALVIN和真实世界环境上分别比SFT基线提升17.5%、18.9%和106.6%的性能,显著优于传统监督微调方法。
Conclusion: 该研究证明强化学习对齐能有效弥合MLLM中语言与低级动作之间的模态鸿沟,促进知识从MLLM向VLA的迁移,为构建高性能具身智能系统提供了系统化训练框架。
📄 Abstract
Improving embodied reasoning in multimodal-large-language models (MLLMs) is essential for building vision-language-action models (VLAs) on top of them to readily translate multimodal understanding into low-level actions. Accordingly, recent work has explored enhancing embodied reasoning in MLLMs through supervision of vision-question-answering type. However, these approaches have been reported to result in unstable VLA performance, often yielding only marginal or even negative gains. In this paper, we propose a more systematic MLLM training framework RoboAlign that reliably improves VLA performance. Our key idea is to sample action tokens via zero-shot natural language reasoning and refines this reasoning using reinforcement learning (RL) to improve action accuracy. As a result, RoboAlign bridges the modality gap between language and low-level actions in MLLMs, and facilitate knowledge transfer from MLLM to VLA. To validate the effectiveness of RoboAlign, we train VLAs by adding a diffusion-based action head on top of an MLLM backbone and evaluate them on major robotics benchmarks. Remarkably, by performing RL-based alignment after SFT using less than 1\% of the data, RoboAlign achieves performance improvements of 17.5\%, 18.9\%, and 106.6\% over SFT baselines on LIBERO, CALVIN, and real-world environments, respectively.
[17] Mind over Space: Can Multimodal Large Language Models Mentally Navigate?
Qihui Zhu, Shouwei Ruan, Xiao Yang, Hao Jiang, Yao Huang, Shiji Zhao, Hanwei Fan, Hang Su, Xingxing Wei
🧩 TL;DR
本文提出了Video2Mental基准来评估多模态大语言模型的心理导航能力,并开发了NavMind模型,该模型通过显式认知地图作为可学习中间表示,显著提升了长时空尺度下的结构化空间推理性能。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在具身智能体中的应用主要局限于基于即时观察的反应式规划,在跨越广泛时空尺度的空间推理方面持续失败。认知科学表明,生物智能依赖于"心理导航"——从经验中构建空间表征并进行行动前的心理路径模拟,本研究旨在弥合AI与BI在这一能力上的差距。
Method: 研究首先引入了Video2Mental基准,要求模型从长时第一人称视频构建分层认知地图,并生成基于地标的逐步路径规划。为解决现有模型的不足,提出了NavMind推理模型,该模型将显式、细粒度的认知地图作为可学习的中间表示来内化心理导航,采用难度分层的渐进式监督微调范式来桥接原始感知与结构化规划。
Result: 基准测试结果表明,心理导航能力不会从标准预训练中自然涌现,前沿多模态大语言模型在零样本结构化空间表征方面表现不佳,且规划精度随视野延长而急剧衰减。NavMind在心理导航能力上显著优于前沿商业和空间专用多模态大语言模型,通过模拟器物理交互验证了其规划准确性。
Conclusion: 该研究揭示了当前多模态大语言模型在长时空尺度空间推理方面的根本局限性,证明了通过显式认知地图作为中间表示的有效性。NavMind的成功表明,将生物智能的心理导航机制形式化并整合到AI系统中,是实现高级空间推理能力的关键方向,为具身智能的发展提供了重要启示。
📄 Abstract
Despite the widespread adoption of MLLMs in embodied agents, their capabilities remain largely confined to reactive planning from immediate observations, consistently failing in spatial reasoning across extensive spatiotemporal scales. Cognitive science reveals that Biological Intelligence (BI) thrives on "mental navigation": the strategic construction of spatial representations from experience and the subsequent mental simulation of paths prior to action. To bridge the gap between AI and BI, we introduce Video2Mental, a pioneering benchmark for evaluating the mental navigation capabilities of MLLMs. The task requires constructing hierarchical cognitive maps from long egocentric videos and generating landmark-based path plans step by step, with planning accuracy verified through simulator-based physical interaction. Our benchmarking results reveal that mental navigation capability does not naturally emerge from standard pre-training. Frontier MLLMs struggle profoundly with zero-shot structured spatial representation, and their planning accuracy decays precipitously over extended horizons. To overcome this, we propose \textbf{NavMind}, a reasoning model that internalizes mental navigation using explicit, fine-grained cognitive maps as learnable intermediate representations. Through a difficulty-stratified progressive supervised fine-tuning paradigm, NavMind effectively bridges the gap between raw perception and structured planning. Experiments demonstrate that NavMind achieves superior mental navigation capabilities, significantly outperforming frontier commercial and spatial MLLMs.
[18] Deterministic Hallucination Detection in Medical VQA via Confidence-Evidence Bayesian Gain
Mohammad Asadi, Tahoura Nedaee, Jack W. O'Sullivan, Euan Ashley, Ehsan Adeli
🧩 TL;DR
本文提出了一种名为CEBaG的确定性幻觉检测方法,用于医疗多模态大语言模型,该方法无需随机采样或外部模型,通过结合令牌级预测方差和证据幅度来检测与输入图像相矛盾的响应。
📘 Detailed Summary
Motivation: 医疗多模态大语言模型在视觉问答中容易产生与输入图像相矛盾的幻觉响应,这在临床环境中存在严重风险。现有幻觉检测方法如语义熵和视觉增强语义熵需要大量随机采样和外部自然语言推理模型,计算成本高且难以实际部署。
Method: 本文提出置信度-证据贝叶斯增益方法,该方法基于观察到幻觉响应在模型自身对数概率中表现出两个特征:不一致的令牌级置信度和对视觉证据的弱敏感性。CEBaG结合两个互补信号:令牌级预测方差捕获响应令牌间的不一致置信度,证据幅度测量图像相对于纯文本推理对每个令牌预测的偏移程度。
Result: 在四个医疗MLLM和三个VQA基准测试的16个实验设置中,CEBaG在13个设置中实现了最高的AUC,平均比VASE提高了8个AUC点。该方法完全确定且自包含,无需随机采样、外部模型或任务特定超参数。
Conclusion: 研究表明幻觉响应在模型内部对数概率中具有可检测的特征模式,CEBaG通过利用这些内在信号提供了一种高效实用的幻觉检测解决方案。该方法为医疗MLLM的安全部署提供了重要工具,并展示了模型内部置信度信号在幻觉检测中的潜力。
📄 Abstract
Multimodal large language models (MLLMs) have shown strong potential for medical Visual Question Answering (VQA), yet they remain prone to hallucinations, defined as generating responses that contradict the input image, posing serious risks in clinical settings. Current hallucination detection methods, such as Semantic Entropy (SE) and Vision-Amplified Semantic Entropy (VASE), require 10 to 20 stochastic generations per sample together with an external natural language inference model for semantic clustering, making them computationally expensive and difficult to deploy in practice. We observe that hallucinated responses exhibit a distinctive signature directly in the model's own log-probabilities: inconsistent token-level confidence and weak sensitivity to visual evidence. Based on this observation, we propose Confidence-Evidence Bayesian Gain (CEBaG), a deterministic hallucination detection method that requires no stochastic sampling, no external models, and no task-specific hyperparameters. CEBaG combines two complementary signals: token-level predictive variance, which captures inconsistent confidence across response tokens, and evidence magnitude, which measures how much the image shifts per-token predictions relative to text-only inference. Evaluated across four medical MLLMs and three VQA benchmarks (16 experimental settings), CEBaG achieves the highest AUC in 13 of 16 settings and improves over VASE by 8 AUC points on average, while being fully deterministic and self-contained. The code will be made available upon acceptance.