Table of Contents

cs.CV [Back]

[1] Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models

Qingtao Pan, Zhihao Dou, Shuo Li

🧩 TL;DR

本文提出FMVR(频率调制视觉恢复)策略,一种即插即用的简单方法,通过解耦低频和高频视觉组件来增强大型多模态模型在视觉令牌减少时的推理能力,同时结合Matryoshka表示学习实现弹性计算调整。


📘 Detailed Summary

Motivation: 大型多模态模型因处理大量视觉令牌而难以适应不同的计算预算,现有减少视觉令牌的方法不可避免地导致视觉语义信息丢失,需要一种既能减少计算开销又能保持视觉语义完整性的解决方案。

Method: FMVR通过AvgPool和MaxPool将减少后的视觉令牌表示解耦为低频和高频组件,使用轻量级可学习参数进行频率调制,其中AvgPool的高频组件作为显著性滤波器增强关键视觉语义,MaxPool的低频组件作为反显著性滤波器强化弱视觉语义,并结合Matryoshka表示学习实现从粗到细的视觉令牌集学习。

Result: 在10个图像基准和4个视频基准上的实验表明,FMVR-LLaVA将LLaVA-1.5-7B的FLOPs减少了89%,同时保持了接近100%的原始准确率,实现了计算效率与性能的显著平衡。

Conclusion: FMVR提供了一种简单有效的视觉语义恢复机制,使LMMs能够在减少视觉令牌的同时保持推理能力,结合弹性调整机制为实际部署中的计算预算适应性提供了实用解决方案,具有广泛的适用性和部署价值。


📄 Abstract

Large Multimodal Models (LMMs) struggle to adapt varying computational budgets due to numerous visual tokens. Previous methods attempted to reduce the number of visual tokens before or within LLMs. However, these strategies inevitably result in the loss of visual semantic. To address these issues, we introduce FMVR, a plug-and-play and extremely simple Frequency-Modulated Visual Restoration strategy to boost the reasoning ability of LMMs under visual token reduction. Specifically, FMVR disentangles the visual representation of fewer visual tokens into low- and high-frequency components through AvgPool and MaxPool. The derived frequencies are subsequently modulated using lightweight learnable parameters. The high-frequency from AvgPool acts as a saliency filter to enhance saliency visual semantics, while the low-frequency from MaxPool acts as an anti-saliency filter to strengthen weak visual semantics. It enables the preservation of visual semantics dominated by few visual tokens and the restoration of diluted visual semantics. Additionally, we inject FMVR into Matryoshka Representation Learning to learn coarse-to-fine visual token sets, thus enabling to elastically adjust the number of visual tokens during inference while maintaining comparable performance. Experiments across 10 image-based and 4 video-based bench marks demonstrate that FMVR-LLaVA reduce the FLOPs of LLaVA-1.5-7B by 89%, while maintaining almost 100% of the original accuracy. The code will be open.

[2] DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding

Mingzhe Tao, Ruiping Liu, Junwei Zheng, Yufan Chen, Kedi Ying, M. Saquib Sarfraz, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen

🧩 TL;DR

本文提出了DriveXQA多模态自动驾驶问答数据集和MVX-LLM模型架构,通过双交叉注意力投影器融合互补视觉模态,以增强自动驾驶系统在恶劣天气和传感器故障条件下的场景理解能力。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在利用多传感器信息理解自动驾驶异常场景方面研究不足,特别是在恶劣天气和传感器故障条件下缺乏有效的多模态融合方法,这限制了自动驾驶系统对复杂驾驶环境的全面理解。

Method: 研究提出了DriveXQA数据集,包含四种视觉模态、五种传感器故障情况和五种天气条件,共102,505个QA对,分为全局场景、他中心化和自车中心化三个层次;同时设计了MVX-LLM架构,采用令牌高效的双交叉注意力投影器来融合多模态信息,减少信息冗余。

Result: 实验表明双交叉注意力投影器在雾天等挑战性条件下表现优异,GPTScore达到53.5,相比基线的25.1有显著提升,验证了多模态融合在恶劣驾驶环境中的有效性。

Conclusion: 该研究填补了多模态大语言模型在自动驾驶异常场景理解方面的空白,提出的数据集和模型架构为多传感器融合提供了新思路,有助于提升自动驾驶系统在复杂环境下的鲁棒性和安全性,相关资源将公开促进领域发展。


📄 Abstract

Fusing sensors with complementary modalities is crucial for maintaining a stable and comprehensive understanding of abnormal driving scenes. However, Multimodal Large Language Models (MLLMs) are underexplored for leveraging multi-sensor information to understand adverse driving scenarios in autonomous vehicles. To address this gap, we propose the DriveXQA, a multimodal dataset for autonomous driving VQA. In addition to four visual modalities, five sensor failure cases, and five weather conditions, it includes $102,505$ QA pairs categorized into three types: global scene level, allocentric level, and ego-vehicle centric level. Since no existing MLLM framework adopts multiple complementary visual modalities as input, we design MVX-LLM, a token-efficient architecture with a Dual Cross-Attention (DCA) projector that fuses the modalities to alleviate information redundancy. Experiments demonstrate that our DCA achieves improved performance under challenging conditions such as foggy (GPTScore: $53.5$ vs. $25.1$ for the baseline). The established dataset and source code will be made publicly available.

[3] Seeing Isn't Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs Supplementary

Nazia Tasnim, Keanu Nichols, Yuting Yang, Nicholas Ikechukwu, Elva Zou, Deepti Ghadiyaram, Bryan A. Plummer

🧩 TL;DR

该研究提出了DORI基准测试,这是一个认知启发的层次化基准,专门评估视觉语言模型的对象方向推理能力,揭示了现有模型在几何方向理解方面的显著局限性。


📘 Detailed Summary

Motivation: 当前视觉语言基准测试大多将对象方向与位置和一般场景理解混为一谈,缺乏专门评估对象方向推理能力的工具。人类学习对象方向是渐进式的,从识别对象朝向到心理旋转再到对象间方向推理,而现有基准无法系统评估模型在这些认知层次上的表现。

Method: 研究提出了Discriminative Orientation Reasoning Intelligence (DORI)基准,这是一个认知启发的层次化基准,将对象方向分解为四个维度,每个维度在粗粒度(分类)和细粒度(度量)两个层次进行评估。基准包含13,652张图像和33,656个多项选择题,覆盖67个对象类别,通过边界框隔离、标准化空间参考框架和结构化提示来隔离方向推理与其他混淆因素。

Result: 评估24个最先进的视觉语言模型显示,在一般空间基准上表现良好的模型在对象中心方向任务上接近随机水平。最佳模型在粗粒度判断上仅达到54.2%,在细粒度判断上仅达到45.0%,在复合旋转和对象间参考框架变化方面失败最为严重。粗粒度到细粒度的显著差距表明模型依赖分类启发式而非几何推理。

Conclusion: 该研究识别出方向理解是多模态系统尚未解决的挑战,现有基准隐藏了模型在几何推理方面的局限性。研究结果对机器人操作、3D场景重建和人机交互具有重要意义,表明需要开发更强大的几何理解能力。


📄 Abstract

Humans learn object orientation progressively, from recognizing which way an object faces, to mentally rotating it, to reasoning about orientations between objects. Current vision-language benchmarks largely conflate orientation with position and general scene understanding. We introduce Discriminative Orientation Reasoning Intelligence (DORI), a cognitively grounded hierarchical benchmark that makes object orientation the primary target. Inspired by stages of human orientation cognition, DORI decomposes orientation into four dimensions, each evaluated at coarse (categorical) and granular (metric) levels. Composed from 13,652 images across 14 sources, DORI provides 33,656 multiple-choice questions covering 67 object categories in real-world and synthetic settings. Its coarse-to-granular design isolates orientation from confounds such as object recognition difficulty, scene clutter, and linguistic ambiguity via bounding-box isolation, standardized spatial reference frames, and structured prompts. Evaluating 24 state-of-the-art vision-language models shows a clear pattern: models that perform well on general spatial benchmarks are near-random on object-centric orientation tasks. The best models reach only 54.2% on coarse and 45.0% on granular judgments, with largest failures on compound rotations and shifts in inter-object reference frames. Large coarse-to-granular gaps reveal reliance on categorical heuristics rather than geometric reasoning, a limitation hidden by existing benchmarks. These results identify orientation understanding as an unsolved challenge for multimodal systems, with implications for robotic manipulation, 3D scene reconstruction, and human-AI interaction.

[4] Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding

Songlin Li, Xin Zhu, Zechao Guan, Peipeng Chen, Jian Yao

🧩 TL;DR

本文提出了R-MSD(可靠多样本蒸馏)框架,通过显式建模教师采样方差来解决传统黑盒蒸馏中单教师响应带来的高方差和格式不一致问题,显著提升了大型视觉语言模型的蒸馏稳定性与性能。


📘 Detailed Summary

Motivation: 传统的大型视觉语言模型黑盒蒸馏通常依赖每个输入的单教师响应,这在多模态或时序场景中往往产生高方差响应和格式不一致问题,导致监督信号不可靠,限制了蒸馏效果的稳定性与质量。

Method: R-MSD框架采用任务自适应教师池提供鲁棒监督,结合质量感知信号匹配与对抗蒸馏目标,显式建模教师采样方差以增强蒸馏稳定性,有效过滤教师噪声并最大化知识转移,适用于闭式和开放式推理任务。

Result: 在综合视频理解基准测试中,R-MSD持续优于单样本蒸馏方法,使用4B学生模型在VideoMME(+1.5%)、Video-MMMU(+3.2%)和MathVerse(+3.6%)上取得显著提升,而相同训练预算下的SFT+RL基线仅获得边际增益。

Conclusion: 该研究表明显式建模教师方差对蒸馏稳定性至关重要,多样本蒸馏框架能有效缓解监督信号不可靠问题,为大型视觉语言模型的高效知识转移提供了新范式,在保持模型紧凑性的同时显著提升多模态理解能力。


📄 Abstract

Traditional black-box distillation for Large Vision-Language Models (LVLMs) typically relies on a single teacher response per input, which often yields high-variance responses and format inconsistencies in multimodal or temporal scenarios. To mitigate this unreliable supervision, we propose R-MSD (Reliable Multi-Sample Distillation), a framework that explicitly models teacher sampling variance to enhance distillation stability. Rather than relying on a single teacher response, our approach leverages a task-adaptive teacher pool to provide robust supervision tailored to both closed-ended and open-ended reasoning. By integrating quality-aware signal matching with an adversarial distillation objective, our approach effectively filters teacher noise while maximizing knowledge transfer. Extensive evaluations across comprehensive video understanding benchmarks demonstrate that R-MSD consistently outperforms single sample distillation methods. We additionally include an original SFT+RL 4B baseline under the same training budget, which shows only marginal gains, while our method achieves significant improvements. With a 4B student model, our approach delivers gains on VideoMME (+1.5%), Video-MMMU (+3.2%), and MathVerse (+3.6%).

[5] BackdoorIDS: Zero-shot Backdoor Detection for Pretrained Vision Encoder

Siquan Huang, Yijiang Li, Ningzhi Gao, Xingfu Yan, Leyu Shi

🧩 TL;DR

本文提出BackdoorIDS,一种简单有效的零样本推理时后门样本检测方法,用于保护预训练视觉编码器免受后门攻击。该方法基于渐进掩码输入下后门图像的注意力劫持与恢复现象,通过密度聚类检测嵌入序列中的异常模式。


📘 Detailed Summary

Motivation: 下游用户经常依赖来源不确定的第三方预训练视觉编码器,这使他们面临后门攻击的风险。现有防御方法通常需要重新训练或特定攻击知识,缺乏零样本推理时检测方案,因此需要一种无需重新训练、兼容多种架构的通用检测方法。

Method: BackdoorIDS基于两个关键观察:注意力劫持与恢复现象。在渐进输入掩码过程中,后门图像最初将注意力集中在恶意触发特征上,一旦掩码比例超过触发器的鲁棒性阈值,注意力会迅速转移到良性内容。该方法通过提取掩码轨迹上的嵌入序列,应用基于密度的聚类算法(如DBSCAN)检测序列中的聚类数量变化,从而识别后门样本。

Result: 大量实验表明,BackdoorIDS在不同攻击类型、数据集和模型家族中始终优于现有防御方法。该方法在多种编码器架构(包括CNN、ViT、CLIP和LLaVA-1.5)上均表现良好,且无需重新训练,完全在推理时以零样本方式运行,实现了即插即用的兼容性。

Conclusion: BackdoorIDS提供了一种简单有效的零样本后门检测方案,揭示了后门图像在渐进掩码下的独特行为模式。该方法具有广泛的架构兼容性和实际部署价值,为预训练视觉编码器的安全使用提供了重要保障,同时为理解后门攻击的机制提供了新的视角。


📄 Abstract

Self-supervised and multimodal vision encoders learn strong visual representations that are widely adopted in downstream vision tasks and large vision-language models (LVLMs). However, downstream users often rely on third-party pretrained encoders with uncertain provenance, exposing them to backdoor attacks. In this work, we propose BackdoorIDS, a simple yet effective zero-shot, inference-time backdoor samples detection method for pretrained vision encoders. BackdoorIDS is motivated by two observations: Attention Hijacking and Restoration. Under progressive input masking, a backdoored image initially concentrates attention on malicious trigger features. Once the masking ratio exceeds the trigger's robustness threshold, the trigger is deactivated, and attention rapidly shifts to benign content. This transition induces a pronounced change in the image embedding, whereas embeddings of clean images evolve more smoothly across masking progress. BackdoorIDS operationalizes this signal by extracting an embedding sequence along the masking trajectory and applying density-based clustering such as DBSCAN. An input is flagged as backdoored if its embedding sequence forms more than one cluster. Extensive experiments show that BackdoorIDS consistently outperforms existing defenses across diverse attack types, datasets, and model families. Notably, it is a plug-and-play approach that requires no retraining and operates fully zero-shot at inference time, making it compatible with a wide range of encoder architectures, including CNNs, ViTs, CLIP, and LLaVA-1.5.

[6] OSCBench: Benchmarking Object State Change in Text-to-Video Generation

Xianjing Han, Bin Zhu, Shiqi Hu, Franklin Mingzhe Li, Patrick Carrington, Roger Zimmermann, Jingjing Chen

🧩 TL;DR

本文提出了OSCBench,这是首个专门评估文本到视频生成模型中对象状态变化性能的基准测试,揭示了当前模型在准确和时序一致的对象状态变化方面存在显著瓶颈。


📘 Detailed Summary

Motivation: 现有文本到视频生成基准主要关注感知质量、文本-视频对齐或物理合理性,而忽略了文本提示中明确指定的对象状态变化这一关键动作理解方面。对象状态变化指由动作引起的对象状态转变,如削土豆皮或切柠檬,这一能力对生成符合指令的视频至关重要。

Method: 研究从教学烹饪数据构建了OSCBench基准,系统地将动作-对象交互组织为常规、新颖和组合场景,以探究分布内性能和泛化能力。评估了六个代表性的开源和专有文本到视频模型,结合人工用户研究和基于多模态大语言模型的自动评估方法进行综合分析。

Result: 实验结果表明,尽管当前文本到视频模型在语义和场景对齐方面表现良好,但在准确和时序一致的对象状态变化方面持续存在困难,特别是在新颖和组合设置中。这一发现揭示了对象状态变化是文本到视频生成的关键瓶颈,OSCBench可作为诊断基准推动状态感知视频生成模型的进步。

Conclusion: 该研究确立了对象状态变化作为文本到视频生成的核心挑战,并提供了专门的评估基准。OSCBench不仅揭示了当前模型的局限性,还为未来开发能够准确理解和生成对象状态变化的视频生成模型提供了重要的诊断工具和研究方向。


📄 Abstract

Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)-based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models.

[7] ZeroSense:How Vision matters in Long Context Compression

Yonghan Gao, Zehong Chen, Lijian Xu, Jingzhi Chen, Jingwei Guan, Xingyu Zeng

🧩 TL;DR

本文提出了一种新的评估框架,用于解耦多模态大语言模型的能力以准确评估视觉文本压缩质量,并引入了ZeroSense基准来确保测试样本的低语义相关性。


📘 Detailed Summary

Motivation: 现有视觉文本压缩方法的评估协议严重依赖下游任务性能,这种评估指标无法准确衡量文本保留质量,因为多模态大语言模型具有强大的固有语言先验,导致评估结果受到下游模型语义推理能力的影响。

Method: 本文引入了一个新的评估框架,通过解耦多模态大语言模型的能力来忠实评估视觉文本压缩质量,并进一步提出了ZeroSense基准,通过消除上下文依赖关系确保测试样本的低语义相关性,从而保证评估结果纯粹反映视觉文本压缩质量。

Result: 在多个数据集上的广泛实验表明,视觉文本压缩质量与下游任务准确性之间存在显著差异,这突显了解耦评估框架的必要性,并且验证了所提框架能够准确评估压缩质量而不受下游模型语义推理能力的影响。

Conclusion: 该研究揭示了现有评估方法的局限性,强调了开发独立于下游模型能力的评估框架的重要性,为视觉文本压缩技术的质量评估提供了更可靠的方法论基础,并指出了未来研究方向应关注压缩质量与任务性能之间的解耦评估。


📄 Abstract

Recent visual-text compression (VTC) methods, typified by DeepSeek-OCR, report impressive high token compression ratios for long-context modeling tasks by leveraging text-to-image rendering. However, existing evaluation protocols heavily rely on downstream task performance. Such evaluation metrics fail to accurately measure text preservation due to the strong inherent linguistic priors of Multimodal Large Language Models (MLLMs). In this work, we introduce a new evaluation framework that decouples MLLMs' capabilities to faithfully assess VTC quality. Within this framework, we further introduce the ZeroSense Benchmark to ensure low semantic correlation of testing samples. By eliminating contextual dependencies, our benchmark guarantees that the evaluation results are purely reflective of VTC quality, unaffected by the semantic inference capabilities of downstream models. Extensive experiments across multiple datasets demonstrate that VTC quality and downstream task accuracy diverge significantly, highlighting the necessity of our decoupled evaluation framework.

[8] Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models

Lu Wang, Zhuoran Jin, Yupu Hao, Yubo Chen, Kang Liu, Yulong Ao, Jun Zhao

🧩 TL;DR

本文提出了Think While Watching框架,一种基于记忆锚定的流式视频推理方法,通过解耦感知与生成过程并维护连续片段级记忆,显著提升了多模态大语言模型在流式视频输入下的多轮交互能力。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在离线视频理解方面表现良好,但在处理连续到达的视频流时面临挑战。现有流式方法通常采用交错的感知-生成范式,导致感知与生成无法并发执行,且随着视频流增长会出现早期记忆衰减问题,损害长距离依赖建模能力,难以支持有效的多轮交互。

Method: 该方法提出了Think While Watching框架,采用记忆锚定的流式视频推理架构,通过维护连续片段级记忆来支持多轮交互。具体包括构建三阶段多轮思维链数据集,采用阶段匹配的训练策略,通过片段级流式因果掩码和流式位置编码确保严格因果性。推理阶段引入高效管道,实现观看与思考的重叠执行,并自适应选择最佳注意力后端。

Result: 在单轮和多轮流式输入协议下均取得显著效果。基于Qwen3-VL构建的系统在StreamingBench上单轮准确率提升2.6%,在OVO-Bench上提升3.79%。在多轮设置中,在保持性能的同时将输出token减少56%,证明了框架在效率和效果上的双重优势。

Conclusion: 该研究展示了通过解耦感知与生成过程并维护连续记忆,能够有效解决流式视频推理中的早期记忆衰减问题。Think While Watching框架为多模态大语言模型在实时视频流处理中的应用提供了新范式,平衡了计算效率与长距离依赖建模能力,为未来流式多模态交互系统的发展奠定了基础。


📄 Abstract

Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi-turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception-generation paradigm, which prevents concurrent perception and generation and leads to early memory decay as streams grow, hurting long-range dependency modeling. We propose Think While Watching, a memory-anchored streaming video reasoning framework that preserves continuous segment-level memory during multi-turn interaction. We build a three-stage, multi-round chain-of-thought dataset and adopt a stage-matched training strategy, while enforcing strict causality through a segment-level streaming causal mask and streaming positional encoding. During inference, we introduce an efficient pipeline that overlaps watching and thinking and adaptively selects the best attention backend. Under both single-round and multi-round streaming input protocols, our method achieves strong results. Built on Qwen3-VL, it improves single-round accuracy by 2.6% on StreamingBench and by 3.79% on OVO-Bench. In the multi-round setting, it maintains performance while reducing output tokens by 56%. Code is available at: https://github.com/wl666hhh/Think_While_Watching/

[9] Linking Perception, Confidence and Accuracy in MLLMs

Yuetian Du, Yucheng Wang, Rongyu Zhang, Zhijie Xu, Boyu Yang, Ming Kong, Jie Liu, Qiang Zhu

🧩 TL;DR

本文提出了一种置信度驱动的强化学习框架(CDRL)和置信感知测试时缩放方法(CA-TTS),以解决多模态大语言模型中的置信度校准问题,通过增强感知敏感性和动态协调多个推理模块,在多个基准测试中实现了显著的性能提升。


📘 Detailed Summary

Motivation: 尽管多模态大语言模型在视觉感知准确性方面取得了进展,但模型是否知道自身何时不确定这一关键问题尚未被充分探索。通过探测实验,研究发现MLLMs存在严重的置信度校准问题,即模型在不确定时仍可能表现出高置信度,这限制了模型的可靠性和实际应用价值。

Method: 研究提出了置信度驱动的强化学习(CDRL),利用原始-噪声图像对和基于置信度的奖励函数来增强感知敏感性并校准模型置信度。进一步提出了置信感知测试时缩放(CA-TTS),该框架动态协调自一致性、自反思和视觉自检模块,并由专家模型担任规划者、批评者和投票者等多重角色来调度这些模块并提供外部验证。

Result: 该集成框架在四个基准测试中建立了新的最先进结果,实现了持续8.8%的性能增益。消融研究进一步证明了每个模块的有效性以及缩放方法的优越性,校准后的置信度使测试时缩放成为一种"免费午餐",显著提升了模型性能。

Conclusion: 该研究不仅解决了多模态大语言模型的置信度校准问题,还展示了校准置信度如何作为有效信号来协调复杂的推理过程。置信感知测试时缩放框架为构建更可靠、更透明的多模态AI系统提供了新思路,通过动态模块调度和外部验证机制,显著提升了模型在不确定情况下的决策质量。


📄 Abstract

Recent advances in Multi-modal Large Language Models (MLLMs) have predominantly focused on enhancing visual perception to improve accuracy. However, a critical question remains unexplored: Do models know when they do not know? Through a probing experiment, we reveal a severe confidence miscalibration problem in MLLMs. To address this, we propose Confidence-Driven Reinforcement Learning (CDRL), which uses original-noise image pairs and a novel confidence-based reward to enhance perceptual sensitivity and robustly calibrate the model's confidence. Beyond training benefits, calibrated confidence enables more effective test-time scaling as a free lunch. We further propose Confidence-Aware Test-Time Scaling (CA-TTS), which dynamically coordinates Self-Consistency, Self-Reflection, and Visual Self-Check modules guided by confidence signals. An Expert Model acts in multiple roles (e.g., Planner, Critic, Voter) to schedule these modules and provide external verification. Our integrated framework establishes new state-of-the-art results with consistent 8.8% gains across four benchmarks. More ablation studies demonstrate the effectiveness of each module and scaling superiority.

[10] EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation

Yan Li, Ning Liao, Xiangyu Zhao, Shaofeng Zhang, Xiaoxing Wang, Yifan Yang, Junchi Yan, Xue Yang

🧩 TL;DR

本文提出EvoTok,一种通过残差演化过程在共享潜在空间中统一视觉理解与生成的图像分词器,解决了多模态大语言模型中视觉理解与生成之间的粒度差距问题。


📘 Detailed Summary

Motivation: 多模态大语言模型发展的根本挑战在于视觉理解与生成之间的粒度差距:理解需要高层语义抽象,而图像生成需要细粒度像素级表示。现有方法通常在相同表示集上施加两种监督,或在分离特征空间上解耦这两种监督,分别导致干扰和不一致。

Method: EvoTok通过共享潜在空间中的残差演化过程协调这些需求,使用残差向量量化将图像编码为级联的残差标记序列。这种残差序列形成演化轨迹,早期阶段捕获低级细节,深层阶段逐步过渡到高层语义表示。

Result: 尽管仅在1300万图像的小规模数据集上训练,远小于先前统一分词器使用的十亿级数据集,EvoTok在ImageNet-1K 256x256分辨率上实现了0.43 rFID的重建质量。与大型语言模型集成后,在9个视觉理解基准中的7个表现出色,在GenEval和GenAI-Bench等图像生成基准上取得显著成果。

Conclusion: 将视觉表示建模为演化轨迹为统一视觉理解与生成提供了有效且原则性的解决方案,表明通过残差演化过程可以在共享潜在空间中协调不同粒度的视觉表示需求,为多模态大语言模型的发展提供了新方向。


📄 Abstract

The development of unified multimodal large language models (MLLMs) is fundamentally challenged by the granularity gap between visual understanding and generation: understanding requires high-level semantic abstractions, while image generation demands fine-grained pixel-level representations. Existing approaches usually enforce the two supervision on the same set of representation or decouple these two supervision on separate feature spaces, leading to interference and inconsistency, respectively. In this work, we propose EvoTok, a unified image tokenizer that reconciles these requirements through a residual evolution process within a shared latent space. Instead of maintaining separate token spaces for pixels and semantics, EvoTok encodes an image into a cascaded sequence of residual tokens via residual vector quantization. This residual sequence forms an evolution trajectory where earlier stages capture low-level details and deeper stages progressively transition toward high-level semantic representations. Despite being trained on a relatively modest dataset of 13M images, far smaller than the billion-scale datasets used by many previous unified tokenizers, EvoTok achieves a strong reconstruction quality of 0.43 rFID on ImageNet-1K at 256x256 resolution. When integrated with a large language model, EvoTok shows promising performance across 7 out of 9 visual understanding benchmarks, and remarkable results on image generation benchmarks such as GenEval and GenAI-Bench. These results demonstrate that modeling visual representations as an evolving trajectory provides an effective and principled solution for unifying visual understanding and generation.

[11] EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, Yuhang Zang

🧩 TL;DR

本文提出Endogenous Chain-of-Thought (EndoCoT)框架,通过激活多模态大语言模型的推理潜力并将其与扩散模型的去噪过程桥接,解决了现有方法在复杂空间推理任务中推理深度不足和指导信息不变的问题。


📘 Detailed Summary

Motivation: 当前将多模态大语言模型作为文本编码器集成到扩散框架中的范式存在两个关键限制:一是MLLM文本编码器表现出推理深度不足,单步编码无法激活对复杂任务至关重要的思维链过程;二是在解码过程中指导信息保持不变,即使有正确的MLLM编码,这种不变的指导也阻碍了扩散模型将复杂指令逐步分解为可执行的去噪步骤。

Method: 提出的EndoCoT框架包含两个核心组件:首先通过迭代思维指导模块迭代细化潜在思维状态来激活MLLM的推理潜力,然后将这些状态桥接到扩散模型的去噪过程中;其次应用终端思维接地模块,通过将最终状态与真实答案对齐,确保推理轨迹保持在文本监督的范围内,从而使MLLM文本编码器提供经过精心推理的指导。

Result: 在多个基准测试(包括Maze、TSP、VSP和Sudoku)上的广泛评估显示,EndoCoT框架实现了平均92.1%的准确率,比最强基线高出8.3个百分点,显著提升了复杂空间推理任务的性能。

Conclusion: 该研究通过激活MLLM的内在推理能力并将其与扩散模型的渐进式去噪过程相结合,为解决复杂任务提供了一种逐步执行的框架,为多模态推理与生成模型的集成开辟了新方向,展示了思维链机制在指导生成过程中的关键作用。


📄 Abstract

Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points.

[12] EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next

Ye Pan, Chi Kit Wong, Yuanhuiyi Lyu, Hanqian Li, Jiahao Huo, Jiacheng Chen, Lutao Jiang, Xu Zheng, Xuming Hu

🧩 TL;DR

该研究提出了EgoIntent基准测试,用于评估多模态大语言模型在自我中心视频中的细粒度步骤级意图理解能力,发现现有模型在此任务上表现不佳,揭示了该领域的重要研究挑战。


📘 Detailed Summary

Motivation: 现有基准测试主要关注情节级意图推理,忽视了步骤级意图理解的细粒度需求,而智能助手、机器人模仿学习和增强现实指导等应用需要理解每个步骤中人物在做什么、为什么做以及下一步计划,以实现及时且上下文感知的支持。

Method: 研究引入了EgoIntent基准测试,包含3,014个步骤,涵盖15种不同的室内外日常生活场景,评估模型在三个互补维度上的表现:局部意图(做什么)、全局意图(为什么做)和下一步计划(接下来做什么)。关键设计是每个视频片段在查询步骤的关键结果发生前被截断,不包含后续步骤的任何帧,防止未来帧泄漏,实现对预期步骤理解和下一步规划的干净评估。

Result: 评估了15个多模态大语言模型,包括最先进的闭源和开源模型,即使表现最佳的模型在三个意图维度上的平均得分也仅为33.31,表明自我中心视频中的步骤级意图理解仍然是一个极具挑战性的问题。

Conclusion: 该研究揭示了多模态大语言模型在自我中心视频步骤级意图理解方面的显著不足,强调了开发更精细的意图理解能力对于实际应用的重要性,并为未来研究提供了明确的评估基准和方向。


📄 Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable video reasoning capabilities across diverse tasks. However, their ability to understand human intent at a fine-grained level in egocentric videos remains largely unexplored. Existing benchmarks focus primarily on episode-level intent reasoning, overlooking the finer granularity of step-level intent understanding. Yet applications such as intelligent assistants, robotic imitation learning, and augmented reality guidance require understanding not only what a person is doing at each step, but also why and what comes next, in order to provide timely and context-aware support. To this end, we introduce EgoIntent, a step-level intent understanding benchmark for egocentric videos. It comprises 3,014 steps spanning 15 diverse indoor and outdoor daily-life scenarios, and evaluates models on three complementary dimensions: local intent (What), global intent (Why), and next-step plan (Next). Crucially, each clip is truncated immediately before the key outcome of the queried step (e.g., contact or grasp) occurs and contains no frames from subsequent steps, preventing future-frame leakage and enabling a clean evaluation of anticipatory step understanding and next-step planning. We evaluate 15 MLLMs, including both state-of-the-art closed-source and open-source models. Even the best-performing model achieves an average score of only 33.31 across the three intent dimensions, underscoring that step-level intent understanding in egocentric videos remains a highly challenging problem that calls for further investigation.

[13] LatentGeo: Learnable Auxiliary Constructions in Latent Space for Multimodal Geometric Reasoning

Haiying Xu, Zihan Wang, Song Dai, Zhengxuan Zhang, Kairan Dou, Xuming Hu

🧩 TL;DR

本研究提出LatentGeo框架,通过连续潜在视觉表示学习来内化辅助几何构造,解决了多模态大语言模型在几何推理中表示辅助构造的难题,并在几何推理任务上取得了显著性能提升。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在几何推理中面临表示辅助几何构造的基本挑战,这些构造在原始图中不存在但却是定理应用的前提。现有方法主要依赖显式构造范式,包括基于文本的几何规范、推理过程中的视觉-标记交错以及工具增强的几何执行,但这些方法要么无法忠实表示复杂空间关系,要么在离散符号与连续几何结构间存在表示不匹配,或依赖外部能力阻碍端到端优化。

Method: 本研究提出LatentGeo框架,通过连续潜在视觉表示学习来内化辅助几何构造,无需像素级渲染或外部执行器。设计了包含三个阶段的学习课程,通过辅助视觉监督逐步对齐和内化这些潜在表示,随后引入LaGDPO(潜在感知强化学习过程),在策略优化期间稳定潜在表示同时提升端任务正确性。为系统评估构造中心表示质量,引入了GeoAux新基准,专门针对视觉依赖的几何问题。

Result: 在GeoAux和MathVerse基准上的实验表明,LatentGeo在几何推理任务上取得了实质性增益,特别是在需要辅助构造的任务上表现突出。广泛的消融研究进一步验证了框架中每个组件的有效性,证实了连续潜在表示学习对几何构造内化的优势。

Conclusion: 该研究表明连续潜在视觉表示学习能够有效内化辅助几何构造,克服了现有显式构造范式的局限性。LatentGeo框架通过课程学习和潜在感知强化学习的结合,实现了端到端优化的几何推理,为多模态大语言模型在复杂几何问题上的应用提供了新方向。


📄 Abstract

Despite recent advances in multimodal reasoning, representing auxiliary geometric constructions remains a fundamental challenge for multimodal large language models (MLLMs). Such constructions are absent from the original diagram and must be introduced before theorems apply. Existing approaches predominantly rely on explicit construction paradigms, including text-based geometric specification, visual-token interleaving during reasoning, and tool-augmented geometric execution. However, these methods either fail to faithfully represent complex spatial relationships, incur representation mismatch between discrete symbols and continuous geometric structures, or rely on external capabilities that hinder end-to-end optimization. To address these limitations, we propose LatentGeo, a framework that learns continuous latent visual representations to internalize auxiliary geometric constructions without pixel-level rendering or external executors. We design a three-stage curriculum that progressively aligns and internalizes these latent representations through auxiliary visual supervision, followed by LaGDPO, a latent-aware reinforcement learning procedure that stabilizes latent representations during policy optimization while improving end-task correctness. To systematically evaluate construction-centric representation quality, we introduce GeoAux, a new benchmark targeting visually dependent geometry problems, and conduct experiments on GeoAux and MathVerse. Results show that LatentGeo achieves substantial gains on geometric reasoning tasks, particularly those requiring auxiliary constructions. Extensive analyses and ablation studies further validate the effectiveness of each component in our framework.

[14] ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models

Yingxin Lai, Zitong Yu, Jun Wang, Linlin Shen, Yong Xu, Xiaochun Cao

🧩 TL;DR

本文提出了ForensicZip,一种无需训练的框架,通过将令牌压缩重新定义为伪造驱动视角来解决多模态大语言模型在多媒体取证中的计算效率问题,在保持高性能的同时显著加速推理过程。


📘 Detailed Summary

Motivation: 多模态大语言模型通过生成文本解释支持可解释的多媒体取证,但处理密集视觉序列的计算成本高昂,特别是在高分辨率图像和视频场景下。现有的视觉令牌剪枝方法主要基于语义驱动,保留了显著物体但丢弃了背景区域,而伪造痕迹如高频异常和时间抖动往往存在于这些背景区域中。

Method: ForensicZip将时间令牌演化建模为带有松弛虚拟节点的生灭最优传输问题,量化指示瞬态生成伪影的物理不连续性。取证评分进一步整合了基于传输的新颖性和高频先验,在大比例压缩下将取证证据与语义内容分离,实现无需训练的伪造驱动令牌压缩。

Result: 在深度伪造和AIGC基准测试中,在仅保留10%令牌的情况下,ForensicZip实现了2.97倍的加速和超过90%的FLOPs减少,同时保持了最先进的检测性能,验证了该框架在计算效率和检测准确性之间的有效平衡。

Conclusion: 该研究展示了从语义驱动到伪造驱动的令牌压缩视角转变的重要性,为多媒体取证任务提供了高效且有效的解决方案。ForensicZip框架无需训练即可部署,在保持检测性能的同时显著降低计算成本,为实际应用中的实时取证分析开辟了新途径。


📄 Abstract

Multimodal Large Language Models (MLLMs) enable interpretable multimedia forensics by generating textual rationales for forgery detection. However, processing dense visual sequences incurs high computational costs, particularly for high-resolution images and videos. Visual token pruning is a practical acceleration strategy, yet existing methods are largely semantic-driven, retaining salient objects while discarding background regions where manipulation traces such as high-frequency anomalies and temporal jitters often reside. To address this issue, we introduce ForensicZip, a training-free framework that reformulates token compression from a forgery-driven perspective. ForensicZip models temporal token evolution as a Birth-Death Optimal Transport problem with a slack dummy node, quantifying physical discontinuities indicating transient generative artifacts. The forensic scoring further integrates transport-based novelty with high-frequency priors to separate forensic evidence from semantic content under large-ratio compression. Experiments on deepfake and AIGC benchmarks show that at 10\% token retention, ForensicZip achieves $2.97\times$ speedup and over 90\% FLOPs reduction while maintaining state-of-the-art detection performance.

[15] Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

Baifeng Shi, Stephanie Fu, Long Lian, Hanrong Ye, David Eigen, Aaron Reite, Boyi Li, Jan Kautz, Song Han, David M. Chan, Pavlo Molchanov, Trevor Darrell, Hongxu Yin

🧩 TL;DR

本文提出了AutoGaze,一种轻量级模块,通过自回归选择最小化多尺度视频块来消除时空冗余,从而显著提升多模态大语言模型处理长高分辨率视频的效率。该方法将视觉令牌减少4-100倍,加速ViT和MLLM推理达19倍,并在视频基准测试中取得优异性能。


📘 Detailed Summary

Motivation: 多模态大语言模型在通用视频理解方面取得进展,但在处理长高分辨率视频时面临挑战,因为其视觉变换器或LLM对所有像素进行同等处理,忽略了显著的时空冗余。现有方法无法有效处理1K帧4K分辨率的长视频,导致计算效率低下和性能瓶颈。

Method: AutoGaze是一个轻量级模块,通过自回归方式选择能够以用户指定误差阈值重建视频的最小多尺度块集合。该方法结合下一令牌预测和强化学习进行训练,在视觉变换器或MLLM处理前移除冗余块,同时保留关键信息,实现冗余消除与信息保存的平衡。

Result: AutoGaze将视觉令牌减少4-100倍,加速ViT和MLLM推理达19倍,使MLLM能够扩展到1K帧4K分辨率视频。在VideoMME基准上达到67.0%的优异性能,并在新提出的HLVid基准(包含5分钟4K分辨率视频)上比基线提升10.1%,超越先前最佳MLLM 4.5%。

Conclusion: AutoGaze通过智能冗余消除机制显著提升了多模态大语言模型处理长高分辨率视频的效率和可扩展性。该方法不仅解决了现有方法的计算瓶颈,还推动了视频理解向更长、更高分辨率内容的扩展,为实际应用中的高效视频分析提供了有效解决方案。


📄 Abstract

Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: https://autogaze.github.io/.

[16] MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

Haozhan Shen, Shilin Yan, Hongwei Xue, Shuaiqi Lu, Xiaojun Tang, Guannan Zhang, Tiancheng Zhao, Jianwei Yin

🧩 TL;DR

本文提出了MM-CondChain基准测试,用于评估多模态大语言模型在视觉基础深度组合推理方面的能力,并开发了一种代理合成流水线来规模化构建包含多层推理链的工作流式数据。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在视觉工作流(如GUI导航)中的应用日益增多,这些工作流需要基于验证的视觉组合条件进行决策,但现有基准测试主要关注浅层组合或独立约束,缺乏对深度链式组合条件推理能力的系统评估,这一能力缺口亟待解决。

Method: 研究提出了一种代理合成流水线,包括规划器负责逐层生成组合条件,可验证程序化中间表示确保每层条件的机械可验证性,合成器将这些验证后的层组装成完整指令,从而在自然图像、数据图表和GUI轨迹三个视觉领域构建基准测试。

Result: 实验结果表明,即使在最强模型上,路径F1分数也仅为53.33%,在困难负例以及深度或谓词复杂性增加时性能显著下降,证实深度组合推理仍然是多模态大语言模型面临的基本挑战。

Conclusion: 该研究揭示了当前多模态大语言模型在深度视觉组合推理方面的显著局限性,提出的MM-CondChain基准为系统评估这一关键能力提供了标准化工具,同时开发的合成方法为构建复杂视觉推理数据提供了可扩展的解决方案。


📄 Abstract

Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.

cs.CL [Back]

[17] Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge

Junjie Wu, Xuan Kan, Zihao He, Shunwen Tan, Bo Pan, Kaitai Zhang

🧩 TL;DR

本文提出MT-RL-Judge框架,通过多任务强化学习联合优化多模态大语言模型的评判能力,显著提升了评判一致性和与人类偏好的相关性,并展现出强大的泛化性能。


📘 Detailed Summary

Motivation: 现有MLLM-as-a-Judge模型主要针对单任务场景优化,在多样化上下文中的泛化能力不足,这限制了其作为可靠评估工具的实用性,需要开发能够适应多任务环境的评判模型。

Method: 提出MT-RL-Judge框架,采用多任务强化学习方法联合优化评判模型,利用强化学习的泛化能力在多个任务上同时训练,提升模型在不同视觉任务中的评判一致性和适应性。

Result: 实验结果表明,MT-RL-Judge在多个强基线模型上均表现出优越性能,在评判一致性和与人类偏好的相关性方面均优于现有方法,同时在分布外任务上展现出稳健的泛化能力。

Conclusion: 该研究证明了多任务强化学习能有效提升MLLM评判模型的泛化能力和可靠性,为构建更稳健的多模态评估系统提供了新思路,未来可扩展至更广泛的多任务评估场景。


📄 Abstract

Multimodal Large Language Models (MLLMs) have been widely adopted as MLLM-as-a-Judges due to their strong alignment with human judgment across various visual tasks. However, most existing judge models are optimized for single-task scenarios and struggle to generalize to diverse contexts, which is a critical requirement for reliable evaluation. To address this limitation, we propose Multi-Task Reinforcement Learning for MLLM-as-a-Judge (MT-RL-Judge), a framework that jointly optimizes the judge model across multiple tasks, leveraging the generalization capabilities of RL. Experimental results against several strong baselines demonstrate that MT-RL-Judge outperforms strong baselines in both judgment consistency and correlation with human preferences. Furthermore, our approach exhibits robust generalization on out-of-distribution tasks, further validating its effectiveness.

cs.AI [Back]

[18] A Survey of Reasoning in Autonomous Driving Systems: Open Challenges and Emerging Paradigms

Kejin Yu, Yuhan Sun, Taiqiang Wu, Ruixu Zhang, Zhiqiang Lin, Yuxin Meng, Junjie Wang, Yujiu Yang

🧩 TL;DR

本文系统性地探讨了将大型语言和多模态模型整合到自动驾驶系统中的认知推理框架,提出了认知层次结构来分解驾驶任务,并识别了七个核心推理挑战,为构建可解释的'玻璃盒'智能体提供了理论指导。


📘 Detailed Summary

Motivation: 当前自动驾驶系统在感知能力取得进展后,面临鲁棒性和泛化性推理的根本瓶颈,尤其在长尾场景和复杂社会交互中缺乏类人判断能力。尽管大型语言和多模态模型为集成认知引擎提供了变革性机会,但缺乏系统性的整合框架来指导从模式匹配向真正理解的转变。

Method: 本文提出了新颖的认知层次结构,根据认知和交互复杂性分解整体驾驶任务,并在此基础上推导和系统化了七个核心推理挑战。采用双视角方法综述了最新进展,既分析了构建智能体的系统中心方法,也评估了验证这些方法的评估中心实践。

Result: 分析揭示了向整体性和可解释性'玻璃盒'智能体的明确趋势,识别了LLM推理的高延迟、深思熟虑特性与车辆控制的毫秒级安全关键需求之间的根本性未解决张力。研究为自动驾驶认知推理领域提供了系统化的分类框架和挑战识别。

Conclusion: 未来工作的主要目标是弥合符号到物理的鸿沟,通过开发可验证的神经符号架构、不确定性下的鲁棒推理以及隐式社会协商的可扩展模型。研究强调了将推理从模块化组件提升为系统认知核心的重要性,为下一代自动驾驶系统的设计提供了理论指导。


📄 Abstract

The development of high-level autonomous driving (AD) is shifting from perception-centric limitations to a more fundamental bottleneck, namely, a deficit in robust and generalizable reasoning. Although current AD systems manage structured environments, they consistently falter in long-tail scenarios and complex social interactions that require human-like judgment. Meanwhile, the advent of large language and multimodal models (LLMs and MLLMs) presents a transformative opportunity to integrate a powerful cognitive engine into AD systems, moving beyond pattern matching toward genuine comprehension. However, a systematic framework to guide this integration is critically lacking. To bridge this gap, we provide a comprehensive review of this emerging field and argue that reasoning should be elevated from a modular component to the system's cognitive core. Specifically, we first propose a novel Cognitive Hierarchy to decompose the monolithic driving task according to its cognitive and interactive complexity. Building on this, we further derive and systematize seven core reasoning challenges, such as the responsiveness-reasoning trade-off and social-game reasoning. Furthermore, we conduct a dual-perspective review of the state-of-the-art, analyzing both system-centric approaches to architecting intelligent agents and evaluation-centric practices for their validation. Our analysis reveals a clear trend toward holistic and interpretable "glass-box" agents. In conclusion, we identify a fundamental and unresolved tension between the high-latency, deliberative nature of LLM-based reasoning and the millisecond-scale, safety-critical demands of vehicle control. For future work, a primary objective is to bridge the symbolic-to-physical gap by developing verifiable neuro-symbolic architectures, robust reasoning under uncertainty, and scalable models for implicit social negotiation.

[19] VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought

Eunsoo Lee, Jeongwoo Lee, Minki Hong, Jangho Choi, Jihie Kim

🧩 TL;DR

本文提出了VisDoT框架,通过人类感知启发的解释性基础来增强视觉推理能力,该框架将视觉问题分解为感知子问题和逻辑子问题,在图表理解任务上实现了最先进的性能。


📘 Detailed Summary

Motivation: 大型视觉语言模型在检测图表中的视觉基元并将其与语义表示对齐方面存在困难,这种感知基础的缺乏严重限制了复杂视觉推理的性能,构成了图表推理的主要瓶颈。

Method: 基于图形感知理论,提出了VisDoT框架,形式化了四个感知任务,并引入了分解思维提示方法,将问题顺序分解为视觉感知子问题和逻辑子问题,通过微调InternVL模型实现感知-逻辑分离策略。

Result: 在ChartQA基准上实现了+11.2%的性能提升,在更具挑战性的ChartQAPro基准上超越了GPT-4o,在新引入的VisDoTQA基准上提升了+33.2%,在多样化的开放域VQA基准上展示了零样本泛化能力。

Conclusion: 感知-逻辑分离策略显著增强了视觉基础能力,实现了最先进的图表理解和可解释的视觉推理,该方法具有广泛的泛化性,为视觉问答任务提供了有效的解决方案。


📄 Abstract

Large vision-language models (LVLMs) struggle to reliably detect visual primitives in charts and align them with semantic representations, which severely limits their performance on complex visual reasoning. This lack of perceptual grounding constitutes a major bottleneck for chart-based reasoning. We propose VisDoT, a framework that enhances visual reasoning through human-like interpretation grounding. We formalize four perceptual tasks based on the theory of graphical perception, including position and length. Building on this foundation, we introduce Decomposition-of-Thought (DoT) prompting, which sequentially separates questions into visual perception sub-questions and logic sub-questions. Fine-tuning InternVL with VisDoT achieves a +11.2% improvement on ChartQA and surpasses GPT-4o on the more challenging ChartQAPro benchmark. On the newly introduced VisDoTQA benchmark, the model improves by +33.2%. Furthermore, consistent zero-shot gains on diverse open-domain VQA benchmarks confirm the generalizability of the perception-logic separation strategy for visual question answering. VisDoT leverages human-like perception to enhance visual grounding, achieving state-of-the-art chart understanding and interpretable visual reasoning.

[20] Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks

Mei Chee Leong, Ying Gu, Hui Li Tan, Liyuan Li, Nancy Chen

🧩 TL;DR

本文提出了一种显式逻辑通道(ELC)与一致性率(CR)框架,用于多模态大语言模型(MLLMs)的验证、选择与增强。该框架通过并行于黑盒模型通道的显式逻辑推理,提升了MLLMs在零样本任务中的可解释性与可信度。


📘 Detailed Summary

Motivation: 前沿多模态大语言模型(MLLMs)在视觉语言理解任务中表现出色,但通常以黑盒方式作为零样本解决方案部署到新任务中,其行为难以验证与理解。这限制了模型在新任务应用中的可信度与可靠性,因此需要一种能够进行显式逻辑推理的验证与增强机制。

Method: 提出了一种与黑盒模型通道并行的显式逻辑通道(ELC),其中将前沿MLLM视为封装潜在视觉语言知识的隐式逻辑通道。ELC模仿人类逻辑推理,结合LLM、视觉基础模型(VFM)以及基于概率推理的逻辑推理,对显式视觉证据进行事实、反事实和关系推理。同时提出一致性率(CR)用于跨通道验证和模型选择,无需真实标注。

Result: 在三个具有挑战性的基准测试上,对来自4个前沿家族的11个近期开源MLLMs进行了两项代表性视觉语言理解任务(MC-VQA和HC-REC)的综合实验。系统评估表明,所提出的ELC和CR框架在模型验证、选择和性能提升方面具有显著效果,同时增强了MLLMs的可解释性与可信度。

Conclusion: 该研究为多模态大语言模型提供了一种有效的验证与增强框架,通过显式逻辑推理通道与一致性评估机制,能够在缺乏真实标注的情况下进行模型选择与验证。跨通道集成进一步提升了MLLMs在零样本任务中的性能,并通过基于显式视觉证据的推理增强了模型的可信度,为MLLMs的可靠部署提供了重要方法论支持。


📄 Abstract

Frontier Multimodal Large Language Models (MLLMs) exhibit remarkable capabilities in Visual-Language Comprehension (VLC) tasks. However, they are often deployed as zero-shot solution to new tasks in a black-box manner. Validating and understanding the behavior of these models become important for application to new task. We propose an Explicit Logic Channel, in parallel with the black-box model channel, to perform explicit logical reasoning for model validation, selection and enhancement. The frontier MLLM, encapsulating latent vision-language knowledge, can be considered as an Implicit Logic Channel. The proposed Explicit Logic Channel, mimicking human logical reasoning, incorporates a LLM, a VFM, and logical reasoning with probabilistic inference for factual, counterfactual, and relational reasoning over the explicit visual evidence. A Consistency Rate (CR) is proposed for cross-channel validation and model selection, even without ground-truth annotations. Additionally, cross-channel integration further improves performance in zero-shot tasks over MLLMs, grounded with explicit visual evidence to enhance trustworthiness. Comprehensive experiments conducted for two representative VLC tasks, i.e., MC-VQA and HC-REC, on three challenging benchmarks, with 11 recent open-source MLLMs from 4 frontier families. Our systematic evaluations demonstrate the effectiveness of proposed ELC and CR for model validation, selection and improvement on MLLMs with enhanced explainability and trustworthiness.

[21] LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories

Qianpu Sun, Xiaowei Chi, Yuhan Rui, Ying Li, Kuangzhi Ge, Jiajun Li, Sirui Han, Shanghang Zhang

🧩 TL;DR

本文提出了LABSHIELD基准,用于评估多模态大语言模型在实验室危险识别和安全关键推理中的能力,揭示了模型在专业实验室场景中安全性能显著下降的问题。


📘 Detailed Summary

Motivation: 随着多模态大语言模型代理从实验室助手演变为自主实验室操作者,实验室环境中的安全要求变得极为严格,但现有模型在危险识别和安全决策可靠性方面的能力尚未得到充分定义和评估,这构成了研究空白。

Method: 研究引入了LABSHIELD基准,这是一个基于美国职业安全与健康管理局标准和全球化学品统一分类和标签制度的现实多视角评估框架,建立了涵盖164个操作任务的安全分类体系,并采用双轨评估框架对20个专有模型、9个开源模型和3个具身模型进行了系统评估。

Result: 评估结果显示模型在通用领域多项选择题准确性与半开放问答安全性能之间存在系统性差距,在专业实验室场景中平均性能下降32.0%,特别是在危险解释和安全感知规划方面表现不佳,揭示了当前模型在安全关键推理方面的局限性。

Conclusion: 研究强调了在具身实验室环境中开发安全中心推理框架的紧迫必要性,以确保自主科学实验的可靠性,LABSHIELD基准为评估和改进多模态大语言模型在安全关键场景中的能力提供了重要工具和标准。


📄 Abstract

Artificial intelligence is increasingly catalyzing scientific automation, with multimodal large language model (MLLM) agents evolving from lab assistants into self-driving lab operators. This transition imposes stringent safety requirements on laboratory environments, where fragile glassware, hazardous substances, and high-precision laboratory equipment render planning errors or misinterpreted risks potentially irreversible. However, the safety awareness and decision-making reliability of embodied agents in such high-stakes settings remain insufficiently defined and evaluated. To bridge this gap, we introduce LABSHIELD, a realistic multi-view benchmark designed to assess MLLMs in hazard identification and safety-critical reasoning. Grounded in U.S. Occupational Safety and Health Administration (OSHA) standards and the Globally Harmonized System (GHS), LABSHIELD establishes a rigorous safety taxonomy spanning 164 operational tasks with diverse manipulation complexities and risk profiles. We evaluate 20 proprietary models, 9 open-source models, and 3 embodied models under a dual-track evaluation framework. Our results reveal a systematic gap between general-domain MCQ accuracy and Semi-open QA safety performance, with models exhibiting an average drop of 32.0% in professional laboratory scenarios, particularly in hazard interpretation and safety-aware planning. These findings underscore the urgent necessity for safety-centric reasoning frameworks to ensure reliable autonomous scientific experimentation in embodied laboratory contexts. The full dataset will be released soon.