Table of Contents
cs.CV [Back]
[1] Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues
Mohammed Salah, Eman Ouda, Giuseppe Dell'Avvocato, Fabrizio Sarasini, Ester D'Accardi, Jorge Dias, Davor Svetinovic, Stefano Sfarra, Yusra Abdulrahman
🧩 TL;DR
本文提出了一种基于语言引导的认知缺陷分析框架,利用预训练视觉语言模型实现碳纤维增强聚合物亚表面缺陷的零样本检测,无需训练数据集即可完成热成像缺陷的生成式理解和定位。
📘 Detailed Summary
Motivation: 传统基于人工智能的主动红外热成像方法需要创建耗时且昂贵的碳纤维增强聚合物检测序列数据集来训练神经网络,这限制了AI方法在工业检测中的实际应用,因此需要开发无需训练数据集的零样本缺陷检测方法。
Method: 提出了一种新颖的语言引导框架,结合预训练多模态视觉语言模型编码器和轻量级适配器,通过AIRT-VLM适配器弥合热成像数据与自然图像之间的领域差距,增强缺陷可见性并实现生成式零样本理解和定位亚表面缺陷。
Result: 实验验证了三种代表性视觉语言模型在25个碳纤维增强聚合物检测序列上的性能,AIRT-VLM适配器相比传统热成像降维方法实现了超过10dB的信噪比增益,零样本缺陷检测的交并比达到70%,能够检测不同能量水平引入的工业实际缺陷。
Conclusion: 该研究证明了预训练视觉语言模型在热成像缺陷检测中的有效性,提出的适配器方法成功解决了领域适应问题,为零样本工业无损检测提供了可行方案,显著降低了数据收集和模型训练成本。
📄 Abstract
Active infrared thermography (AIRT) is currently witnessing a surge of artificial intelligence (AI) methodologies being deployed for automated subsurface defect analysis of high performance carbon fiber-reinforced polymers (CFRP). Deploying AI-based AIRT methodologies for inspecting CFRPs requires the creation of time consuming and expensive datasets of CFRP inspection sequences to train neural networks. To address this challenge, this work introduces a novel language-guided framework for cognitive defect analysis in CFRPs using AIRT and vision-language models (VLMs). Unlike conventional learning-based approaches, the proposed framework does not require developing training datasets for extensive training of defect detectors, instead it relies solely on pretrained multimodal VLM encoders coupled with a lightweight adapter to enable generative zero-shot understanding and localization of subsurface defects. By leveraging pretrained multimodal encoders, the proposed system enables generative zero-shot understanding of thermographic patterns and automatic detection of subsurface defects. Given the domain gap between thermographic data and natural images used to train VLMs, an AIRT-VLM Adapter is proposed to enhance the visibility of defects while aligning the thermographic domain with the learned representations of VLMs. The proposed framework is validated using three representative VLMs; specifically, GroundingDINO, Qwen-VL-Chat, and CogVLM. Validation is performed on 25 CFRP inspection sequences with impacts introduced at different energy levels, reflecting realistic defects encountered in industrial scenarios. Experimental results demonstrate that the AIRT-VLM adapter achieves signal-to-noise ratio (SNR) gains exceeding 10 dB compared with conventional thermographic dimensionality-reduction methods, while enabling zero-shot defect detection with intersection-over-union values reaching 70%.
[2] Are Video Reasoning Models Ready to Go Outside?
Yangfan He, Changgyu Boo, Jaehong Yoon
🧩 TL;DR
本文提出了ROVA训练框架,通过建模时空扰动下的鲁棒感知一致性奖励来提升视觉语言模型的现实世界鲁棒性,并引入了PVRBench基准来评估模型在真实扰动下的性能。
📘 Detailed Summary
Motivation: 现实世界部署中,视觉语言模型常遇到天气、遮挡和相机运动等扰动,导致其理解和推理能力显著下降,这揭示了干净、受控评估环境与真实世界鲁棒性之间的差距。
Method: 提出了ROVA训练框架,通过建模时空扰动下的鲁棒感知一致性奖励来提升模型鲁棒性;采用难度感知的在线训练策略,基于模型演化能力优先处理信息丰富的样本;通过自反思评估持续重新估计样本难度,实现自适应训练;并引入了PVRBench基准,将真实世界扰动注入具身视频数据集以评估扰动下的准确性和推理质量。
Result: 在PVRBench、UrbanVideo和VisBench上的评估显示,开源和专有模型在真实扰动下的准确性和推理能力分别下降高达35%和28%;ROVA有效缓解了性能下降,相比基线模型(QWen2.5/3-VL、InternVL2.5、Embodied-R)将相对准确性提升至少24%,推理能力提升超过9%;这些增益还能迁移到干净的基准测试中,带来一致的改进。
Conclusion: ROVA框架通过难度感知训练和鲁棒感知一致性奖励,显著提升了视觉语言模型在真实世界扰动下的鲁棒性;PVRBench基准为评估模型在现实扰动下的性能提供了标准化工具;该方法不仅改善了扰动环境下的性能,还能提升干净环境下的表现,表明鲁棒性训练具有广泛的迁移价值。
📄 Abstract
In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo, and VisBench, where open-source and proprietary models suffer up to 35% and 28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.
[3] One Token, Two Fates: A Unified Framework via Vision Token Manipulation Against MLLMs Hallucination
Zhan Fa, Yue Duan, Jian Zhang, Lei Qi, Yinghuan Shi
🧩 TL;DR
本文提出了一种统一的训练免费框架来解决多模态大语言模型中的幻觉问题,通过协同利用视觉令牌的两种不同作用:增强视觉表示和创建潜在空间负样本,从而有效恢复视觉-语言平衡。
📘 Detailed Summary
Motivation: 当前解决MLLM幻觉的训练免费方法采用分离策略:增强视觉信号或抑制文本惯性,但这些方法存在关键权衡:单纯增强视觉难以对抗强语言先验,而抑制语言可能引入额外图像无关噪声,且它们的简单组合效果不佳,因此需要一个统一框架。
Method: 该框架基于视觉令牌这一核心资产,提出两种关键方法:协同视觉校准模块通过融入增强令牌来强化视觉表示,而因果表示校准模块则利用剪枝令牌创建潜在空间负样本来校正内部模型偏差,两者均在潜在表示层面操作。
Result: 该方法在多个基准测试中显著减少了物体幻觉,将LLaVA-1.5的POPE准确率平均提升2%,推理延迟开销仅为1.06倍,验证了框架在恢复视觉-语言平衡方面的有效性。
Conclusion: 研究揭示了增强图像提供互补视觉语义,而移除视觉令牌比扭曲图像能更精确地隔离幻觉倾向,通过协调视觉令牌的两种不同作用,该统一框架为MLLM幻觉缓解提供了更有效的解决方案。
📄 Abstract
Current training-free methods tackle MLLM hallucination with separate strategies: either enhancing visual signals or suppressing text inertia. However, these separate methods are insufficient due to critical trade-offs: simply enhancing vision often fails against strong language prior, while suppressing language can introduce extra image-irrelevant noise. Moreover, we find their naive combination is also ineffective, necessitating a unified framework. We propose such a framework by focusing on the core asset: the vision token. Our design leverages two key insights: (1) augmented images offer complementary visual semantics, and (2) removing vision tokens (information-gap) isolates hallucination tendencies more precisely than distorting images (modality-gap). Based on these, our framework uses vision tokens in two distinct ways, both operating on latent representations: our Synergistic Visual Calibration (SVC) module incorporates augmented tokens to strengthen visual representations, while our Causal Representation Calibration (CRC) module uses pruned tokens to create latent-space negative samples for correcting internal model biases. By harmonizing these two roles, our framework effectively restores the vision-language balance, significantly reducing object hallucinations, improving POPE accuracy by an average of 2% absolute on LLaVA-1.5 across multiple benchmarks with only a 1.06x inference latency overhead.
[4] GeoSense: Internalizing Geometric Necessity Perception for Multimodal Reasoning
Ruiheng Liu, Haihong Hao, Mingfei Han, Xin Gu, Kecheng Zhang, Changlin Li, Xiaojun Chang
🧩 TL;DR
本文提出了一种赋予多模态大语言模型空间感知能力的新框架,使模型能够自主判断何时需要几何信息,从而在保持二维视觉推理能力的同时显著提升空间理解性能。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在空间理解方面存在局限,现有方法通常将几何信号硬性注入每个输入,忽略了其必要性并增加了计算开销,本研究旨在解决这一问题,使模型能够自主感知何时需要几何特征。
Method: 该方法首先在模型架构中引入独立的几何输入通道并进行对齐训练,以有效利用几何特征;随后通过精心策划的空间感知监督微调数据集激活模型的潜在内部线索,使其能够自主判断几何信息的必要性。
Result: 在多个空间推理基准测试上的实验验证了该方法的有效性,展示了显著的空间理解提升,同时不损害二维视觉推理能力,实现了更鲁棒、高效且自感知的多模态智能。
Conclusion: 该研究为多模态大语言模型的空间理解提供了一种新范式,通过赋予模型感知不足的能力,使其能够自主决定何时需要几何信息,这为实现更强大的人工超级智能提供了路径。
📄 Abstract
Advancing towards artificial superintelligence requires rich and intelligent perceptual capabilities. A critical frontier in this pursuit is overcoming the limited spatial understanding of Multimodal Large Language Models (MLLMs), where geometry information is essential. Existing methods often address this by rigidly injecting geometric signals into every input, while ignoring their necessity and adding computation overhead. Contrary to this paradigm, our framework endows the model with an awareness of perceptual insufficiency, empowering it to autonomously engage geometric features in reasoning when 2D cues are deemed insufficient. To achieve this, we first introduce an independent geometry input channel to the model architecture and conduct alignment training, enabling the effective utilization of geometric features. Subsequently, to endow the model with perceptual awareness, we curate a dedicated spatial-aware supervised fine-tuning dataset. This serves to activate the model's latent internal cues, empowering it to autonomously determine the necessity of geometric information. Experiments across multiple spatial reasoning benchmarks validate this approach, demonstrating significant spatial gains without compromising 2D visual reasoning capabilities, offering a path toward more robust, efficient and self-aware multi-modal intelligence.
[5] Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression
Hamidreza Dastmalchi, Aijun An, Ali Cheraghian, Hamed Barzamini
🧩 TL;DR
本文提出了CIPHER,一种无需训练的方法,通过构建反事实视觉扰动数据集OHC-25K来提取幻觉相关表征,并在推理阶段将隐藏状态投影出幻觉子空间,从而有效抑制大型视觉语言模型中的视觉诱导幻觉。
📘 Detailed Summary
Motivation: 大型视觉语言模型在多模态任务中表现优异,但经常产生与视觉输入不一致的幻觉输出。现有无需训练的方法主要关注文本诱导的幻觉,而缺乏针对视觉模态引发幻觉的有效解决方案。
Method: CIPHER采用两阶段方法:离线阶段通过扩散编辑构建OHC-25K反事实数据集,包含故意与原始真实描述矛盾的图像,通过对比真实图像对和反事实图像对的表征差异,提取出表征视觉诱导幻觉的低秩子空间;推理阶段通过将中间隐藏状态投影出该幻觉子空间来抑制幻觉。
Result: 在多个基准测试中,CIPHER显著降低了幻觉率,同时保持了任务性能。实验证明该方法能有效提高LVLM的忠实度,验证了反事实视觉扰动在改善模型可靠性方面的有效性。
Conclusion: 研究表明视觉诱导幻觉具有系统性的结构化特征,可通过低秩子空间表征。反事实视觉扰动为无需训练地改善大型视觉语言模型忠实度提供了有效途径,该方法具有通用性,可应用于多种LVLM架构。
📄 Abstract
While large vision-language models (LVLMs) achieve strong performance on multimodal tasks, they frequently generate hallucinations -- unfaithful outputs misaligned with the visual input. To address this issue, we introduce CIPHER (Counterfactual Image Perturbations for Hallucination Extraction and Removal), a training-free method that suppresses vision-induced hallucinations via lightweight feature-level correction. Unlike prior training-free approaches that primarily focus on text-induced hallucinations, CIPHER explicitly targets hallucinations arising from the visual modality. CIPHER operates in two phases. In the offline phase, we construct OHC-25K (Object-Hallucinated Counterfactuals, 25,000 samples), a counterfactual dataset consisting of diffusion-edited images that intentionally contradict the original ground-truth captions. We pair these edited images with the unchanged ground-truth captions and process them through an LVLM to extract hallucination-related representations. Contrasting these representations with those from authentic (image, caption) pairs reveals structured, systematic shifts spanning a low-rank subspace characterizing vision-induced hallucination. In the inference phase, CIPHER suppresses hallucinations by projecting intermediate hidden states away from this subspace. Experiments across multiple benchmarks show that CIPHER significantly reduces hallucination rates while preserving task performance, demonstrating the effectiveness of counterfactual visual perturbations for improving LVLM faithfulness. Code and additional materials are available at https://hamidreza-dastmalchi.github.io/cipher-cvpr2026/.
[6] WalkGPT: Grounded Vision-Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation
Rafi Ibn Sultan, Hui Zhu, Xiangyu Zhou, Chengyin Li, Prashant Khanduri, Marco Brocanelli, Dongxiao Zhu
🧩 TL;DR
本文提出了WalkGPT,一种用于接地导航引导任务的像素接地大型视觉语言模型,通过统一语言推理和分割架构,为深度感知的无障碍导航提供指导。该模型在无需用户提供线索或锚点的情况下,生成包含分割掩码和相对深度估计的对话响应。
📘 Detailed Summary
Motivation: 现有的大型视觉语言模型在复杂的城市行人导航场景中面临挑战,虽然能够描述视觉内容,但缺乏明确的接地机制导致物体幻觉和不可靠的深度推理,限制了其在无障碍导航指导中的实用性。研究旨在解决语义和空间推理的统一问题,为行人视角图像提供可靠的深度感知无障碍指导。
Method: WalkGPT采用像素接地的大型视觉语言模型架构,包含多尺度查询投影器(MSQP)通过跨空间层次聚合文本令牌来塑造最终图像令牌,以及校准文本投影器(CTP)通过提出的区域对齐损失将语言嵌入映射到分割感知表示。该模型统一了语言推理和分割功能,能够生成包含可访问和有害特征分割掩码以及相对深度估计的对话响应。
Result: 实验表明WalkGPT在接地推理和分割性能方面表现优异。研究还引入了PAVE大规模基准数据集,包含41k行人视角图像及其无障碍感知问题和深度接地答案,为评估提供了全面基准。模型在无需用户提供线索或锚点的情况下,能够生成完整且现实的导航指导。
Conclusion: WalkGPT通过统一的像素接地架构成功解决了大型视觉语言模型在无障碍导航中的接地挑战,为深度感知的行人导航指导提供了有效解决方案。该研究提出的多尺度查询投影器和校准文本投影器为视觉语言模型的细粒度接地和深度推理开辟了新途径,推动了无障碍智能导航系统的发展。
📄 Abstract
Ensuring accessible pedestrian navigation requires reasoning about both semantic and spatial aspects of complex urban scenes, a challenge that existing Large Vision-Language Models (LVLMs) struggle to meet. Although these models can describe visual content, their lack of explicit grounding leads to object hallucinations and unreliable depth reasoning, limiting their usefulness for accessibility guidance. We introduce WalkGPT, a pixel-grounded LVLM for the new task of Grounded Navigation Guide, unifying language reasoning and segmentation within a single architecture for depth-aware accessibility guidance. Given a pedestrian-view image and a navigation query, WalkGPT generates a conversational response with segmentation masks that delineate accessible and harmful features, along with relative depth estimation. The model incorporates a Multi-Scale Query Projector (MSQP) that shapes the final image tokens by aggregating them along text tokens across spatial hierarchies, and a Calibrated Text Projector (CTP), guided by a proposed Region Alignment Loss, that maps language embeddings into segmentation-aware representations. These components enable fine-grained grounding and depth inference without user-provided cues or anchor points, allowing the model to generate complete and realistic navigation guidance. We also introduce PAVE, a large-scale benchmark of 41k pedestrian-view images paired with accessibility-aware questions and depth-grounded answers. Experiments show that WalkGPT achieves strong grounded reasoning and segmentation performance. The source code and dataset are available on the \href{https://sites.google.com/view/walkgpt-26/home}{project website}.
[7] CodePercept: Code-Grounded Visual STEM Perception for MLLMs
Tongkun Guan, Zhibo Yang, Jianqiang Wan, Mingkun Yang, Zhengtao Guo, Zijian Hu, Ruilin Luo, Ruize Chen, Songtao Jiang, Peng Wang, Wei Shen, Junyang Lin, Xiaokang Yang
🧩 TL;DR
该研究通过系统化缩放分析发现,多模态大语言模型在STEM视觉推理中的主要瓶颈是感知而非推理能力,并提出了以可执行代码作为感知媒介的新范式,通过构建包含100万图像-描述-代码三元组的数据集和专门评估基准来验证这一方法。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在STEM视觉推理任务中表现不佳,但根本原因尚不明确:是感知缺陷还是推理限制?该研究旨在通过独立缩放感知和推理组件来识别真正的瓶颈,并探索如何系统性地增强模型的感知能力。
Method: 研究采用系统化缩放分析独立评估感知和推理组件,提出以可执行代码作为感知媒介的新范式,构建了包含100万Image-Caption-Code三元组的ICC-1M数据集,通过两种互补方法实现:代码锚定描述生成将可执行代码作为图像描述的真实标签,以及STEM图像到代码翻译让模型生成重建代码来增强感知。
Result: 缩放分析显示缩放感知始终优于缩放推理,表明感知是当前STEM视觉推理的主要限制因素。基于此构建的ICC-1M数据集和STEM2Code-Eval基准提供了新的评估框架,该基准通过图像重建的可执行代码生成来直接评估视觉感知能力,而非依赖问题解决准确率作为代理指标。
Conclusion: 研究揭示了感知而非推理是多模态大语言模型在STEM视觉推理中的核心瓶颈,提出了以代码作为感知媒介的创新范式,为增强模型在结构化视觉内容理解方面提供了新方向,同时建立的评估基准为未来研究提供了更精确的评估工具。
📄 Abstract
When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium--executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two complementary approaches: (1) Code-Grounded Caption Generation treats executable code as ground truth for image captions, eliminating the hallucinations inherent in existing knowledge distillation methods; (2) STEM Image-to-Code Translation prompts models to generate reconstruction code, mitigating the ambiguity of natural language for perception enhancement. To validate this paradigm, we further introduce STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains. Unlike existing work relying on problem-solving accuracy as a proxy that only measures problem-relevant understanding, our benchmark requires comprehensive visual comprehension through executable code generation for image reconstruction, providing deterministic and verifiable assessment. Code is available at https://github.com/TongkunGuan/Qwen-CodePercept.
[8] Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding
Lin Chen, Bolin Ni, Qi Yang, Zili Wang, Kun Ding, Ying Wang, Houwen Peng, Shiming Xiang
🧩 TL;DR
本文提出了一种跨模态距离不变位置编码(DIPE)机制,通过解耦基于模态交互的位置编码来缓解多模态大语言模型在长上下文场景中的视觉衰减问题,确保视觉信号在不同上下文长度下保持感知一致性。
📘 Detailed Summary
Motivation: 多模态大语言模型在长上下文场景中存在视觉衰减问题,即随着文本序列增长,对视觉标记的关注度逐渐减弱,导致文本生成脱离视觉约束。作者将此问题归因于多模态RoPE固有的归纳偏置,该偏置会随着视觉与文本标记之间距离的增加而惩罚跨模态注意力。
Method: 本文提出了跨模态距离不变位置编码(DIPE),这是一种简单而有效的机制,基于模态交互解耦位置编码。DIPE保留模态内交互的自然相对定位以保持局部结构,同时为模态间交互强制实施锚定的感知邻近性。该策略有效缓解了基于跨模态距离的惩罚,确保视觉信号在不同上下文长度下保持感知一致性。
Result: 实验结果表明,通过将DIPE与多模态RoPE集成,模型在长上下文场景中保持了稳定的视觉基础,显著缓解了视觉衰减问题,同时在标准短上下文基准测试中保持了性能。该方法在维持短上下文性能的同时,有效解决了长上下文中的视觉退化问题。
Conclusion: 该研究揭示了多模态位置编码中模态间距离惩罚是导致视觉衰减的根本原因,并提出了一种解耦的位置编码策略来缓解这一问题。DIPE机制为多模态模型的长上下文处理提供了新的解决方案,具有保持视觉基础一致性的潜力,为未来多模态交互研究提供了重要启示。
📄 Abstract
Despite the remarkable capabilities of Multimodal Large Language Models (MLLMs), they still suffer from visual fading in long-context scenarios. Specifically, the attention to visual tokens diminishes as the text sequence lengthens, leading to text generation detached from visual constraints. We attribute this degradation to the inherent inductive bias of Multimodal RoPE, which penalizes inter-modal attention as the distance between visual and text tokens increases. To address this, we propose inter-modal Distance Invariant Position Encoding (DIPE), a simple but effective mechanism that disentangles position encoding based on modality interactions. DIPE retains the natural relative positioning for intra-modal interactions to preserve local structure, while enforcing an anchored perceptual proximity for inter-modal interactions. This strategy effectively mitigates the inter-modal distance-based penalty, ensuring that visual signals remain perceptually consistent regardless of the context length. Experimental results demonstrate that by integrating DIPE with Multimodal RoPE, the model maintains stable visual grounding in long-context scenarios, significantly alleviating visual fading while preserving performance on standard short-context benchmarks. Code is available at https://github.com/lchen1019/DIPE.
cs.AI [Back]
[9] Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities
Ziwei Zhou, Rui Wang, Zuxuan Wu, Yu-Gang Jiang
🧩 TL;DR
本文提出了Daily-Omni基准测试,这是一个需要跨模态时间推理的视听问答基准,并开发了半自动标注流程和诊断评估套件,评估了24个基础模型,发现现有MLLM在跨模态时间对齐方面仍存在显著挑战。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在视觉和音频基准上表现良好,但其同步处理跨模态信息的能力尚未得到充分探索,特别是需要跨模态时间推理的任务,这构成了一个重要的研究空白。
Method: 研究提出了Daily-Omni基准测试,包含684个真实世界视频和1,197个问题,涵盖6个任务家族;开发了半自动标注流程,包括跨模态一致性精炼、时间对齐激发和文本泄漏过滤;构建了诊断评估套件,评估了24个基础模型的37种模态设置;并设计了无需训练的模块化诊断基线,组合现成的单模态模型来评估时间对齐信号的影响。
Result: 评估结果显示,许多端到端多模态大语言模型在需要精确时间对齐的问题上表现不佳,表明跨模态时间对齐仍然是一个重要的开放挑战;诊断基线进一步证实了显式时间对齐信号对性能的关键影响。
Conclusion: 该研究揭示了当前多模态大语言模型在跨模态时间对齐方面的局限性,强调了开发更鲁棒的跨模态时间推理能力的重要性,为未来研究提供了有价值的基准和诊断工具。
📄 Abstract
Recent Multimodal Large Language Models (MLLMs) achieve promising performance on visual and audio benchmarks independently. However, the ability of these models to process cross-modal information synchronously remains largely unexplored. We introduce Daily-Omni, a multiple-choice Audio-Visual QA benchmark featuring 684 real-world videos and 1,197 questions spanning 6 task families that explicitly require cross-modal temporal reasoning. To support scalable benchmark construction, we develop a semi-automatic pipeline for annotation, cross-modal consistency refinement, temporal alignment elicitation, and text-only leakage filtering, followed by human verification. We further provide a diagnostic evaluation suite and extensively evaluate 24 foundation models under 37 model--modality settings (Audio+Video / Audio-only / Video-only / Text-only). Finally, we include a training-free modular diagnostic baseline that composes off-the-shelf unimodal models to serve as a diagnostic baseline and to illustrate how explicit temporal alignment signals affect performance. Results indicate that many end-to-end MLLMs still struggle on alignment-critical questions, suggesting that robust cross-modal temporal alignment remains an important open challenge.