Table of Contents

cs.CV [Back]

[1] Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

Yuhao Dong, Zuyan Liu, Shulin Tian, Yongming Rao, Ziwei Liu

🧩 TL;DR

该研究提出了一个统一的多智能体视觉推理框架Insight-V++,通过自主生成高质量多模态推理数据和创新的双智能体架构,显著提升了多模态大语言模型在复杂图像和视频推理任务中的性能。


📘 Detailed Summary

Motivation: 多模态大语言模型在扩展测试时推理能力方面面临重大挑战,主要由于高质量长链推理数据的严重匮乏以及优化训练流程的缺失,这限制了模型在复杂图像和视频理解任务中的表现。

Method: 研究提出了统一的多智能体视觉推理框架Insight-V++,包含可扩展的数据生成流程和多粒度评估机制,自主合成跨图像和视频域的结构化复杂推理轨迹;设计了双智能体架构,包含执行分析链的推理智能体和评估蒸馏结果的总结智能体;针对长时视频理解,引入了ST-GRPO和J-GRPO两种新算法来增强时空推理和评估鲁棒性,并通过总结智能体的可靠反馈引导迭代推理路径生成,实现持续自我改进的训练循环。

Result: 在LLaVA-NeXT和Qwen2.5-VL等基础模型上的广泛实验表明,该框架在具有挑战性的图像和视频推理基准测试中取得了显著的性能提升,同时在传统感知任务上保持了强大的能力。

Conclusion: 该研究通过自主数据生成、双智能体架构和创新的强化学习算法,为多模态大语言模型的复杂推理能力提升提供了系统化解决方案,展示了持续自我改进框架在多模态理解任务中的有效性,为未来多模态智能系统的发展提供了重要参考。


📄 Abstract

Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant challenge due to a critical scarcity of high-quality, long-chain reasoning data and optimized training pipelines. To bridge this gap, we present a unified multi-agent visual reasoning framework that systematically evolves from our foundational image-centric model, Insight-V, into a generalized spatial-temporal architecture, Insight-V++. We first propose a scalable data generation pipeline equipped with multi-granularity assessment that autonomously synthesizes structured, complex reasoning trajectories across image and video domains without human intervention. Recognizing that directly supervising MLLMs with such intricate data yields sub-optimal results, we design a dual-agent architecture comprising a reasoning agent to execute extensive analytical chains, and a summary agent to critically evaluate and distill final outcomes. While our initial framework utilized Direct Preference Optimization (DPO), its off-policy nature fundamentally constrained reinforcement learning potential. To overcome these limitations, particularly for long-horizon video understanding, Insight-V++ introduces two novel algorithms, ST-GRPO and J-GRPO, which enhance spatial-temporal reasoning and improve evaluative robustness. Crucially, by leveraging reliable feedback from the summary agent, we guide an iterative reasoning path generation process, retraining the entire multi-agent system in a continuous, self-improving loop. Extensive experiments on base models like LLaVA-NeXT and Qwen2.5-VL demonstrate significant performance gains across challenging image and video reasoning benchmarks while preserving strong capabilities on traditional perception-focused tasks.

[2] VLM-AutoDrive: Post-Training Vision-Language Models for Safety-Critical Autonomous Driving Events

Mohammad Qazim Bhat, Yufan Huang, Niket Agarwal, Hao Wang, Michael Woods, John Kenyon, Tsung-Yi Lin, Xiaodong Yang, Ming-Yu Liu, Kevin Xie

🧩 TL;DR

本文提出了VLM-AutoDrive,一种模块化后训练框架,用于将预训练的视觉语言模型适配到驾驶场景中的安全关键异常检测任务,显著提升了碰撞和近碰撞事件的检测性能。


📘 Detailed Summary

Motivation: 随着车载摄像头视频数据的快速增长,检测碰撞和近碰撞等安全关键事件面临重大挑战,这些场景短暂、罕见且通用视觉模型难以捕捉。尽管多模态大语言模型展现出强大的通用推理能力,但由于领域和时间对齐问题,它们在驾驶场景中表现不佳。

Method: 本文提出了VLM-AutoDrive框架,这是一个模块化的后训练框架,通过整合元数据生成的标题、LLM生成的描述、视觉问答对和思维链推理监督,实现领域对齐和可解释学习。该框架专门用于将预训练的视觉语言模型适配到高保真异常检测任务。

Result: 实验表明,在零样本设置下,现成的VLM模型如NVIDIA Cosmos-Reason1 7B的碰撞召回率接近零;使用VLM-AutoDrive微调后,碰撞F1分数从0.00提升到0.69,总体准确率从35.35%提升到77.27%。在真实世界Nexar车载摄像头视频评估中,该框架在碰撞和近碰撞检测方面取得了显著提升。

Conclusion: VLM-AutoDrive为将通用VLM适配到安全关键、时间定位的感知任务提供了可扩展的方案,通过生成可解释的推理轨迹,弥合了自动驾驶中感知、因果关系和决策推理之间的差距,为领域特定视觉语言模型适配提供了有效框架。


📄 Abstract

The rapid growth of ego-centric dashcam footage presents a major challenge for detecting safety-critical events such as collisions and near-collisions, scenarios that are brief, rare, and difficult for generic vision models to capture. While multimodal large language models (MLLMs) demonstrate strong general reasoning ability, they underperform in driving contexts due to domain and temporal misalignment. We introduce VLM-AutoDrive, a modular post-training framework for adapting pretrained Vision-Language Models (VLMs) to high-fidelity anomaly detection. The framework integrates metadata-derived captions, LLM-generated descriptions, visual question answering (VQA) pairs, and chain-of-thought (CoT) reasoning supervision to enable domain-aligned and interpretable learning. Off-the-shelf VLMs such as NVIDIA's Cosmos-Reason1 7B (CR1) exhibit near-zero Collision recall in zero-shot settings; fine-tuning with VLM-AutoDrive improves Collision F1 from 0.00 to 0.69 and overall accuracy from 35.35% to 77.27%. VLM-AutoDrive offers a scalable recipe for adapting general-purpose VLMs to safety-critical, temporally localized perception tasks. Evaluated on real-world Nexar dashcam videos, it achieves substantial gains in Collision and Near-Collision detection while producing interpretable reasoning traces, bridging the gap between perception, causality, and decision reasoning in autonomous driving.

[3] Mind the Rarities: Can Rare Skin Diseases Be Reliably Diagnosed via Diagnostic Reasoning?

Yang Liu, Jiyao Yang, Hongjin Zhao, Xiaoyong Li, Yanzhe Ji, Xingjian Li, Runmin Jiang, Tianyang Wang, Saeed Anwar, Dongwoo Kim, Yue Yao, Zhenyue Qin, Min Xu

🧩 TL;DR

该研究构建了DermCase基准数据集,用于评估大型视觉语言模型在皮肤病学中罕见病症的诊断推理能力,揭示了当前模型在临床推理方面的显著缺陷。


📘 Detailed Summary

Motivation: 现有基准主要关注常见疾病且仅评估最终诊断准确率,忽略了临床推理过程,而这对复杂病例至关重要;同时,评估模型对罕见病症的诊断推理能力仍未被充分探索。

Method: 研究构建了DermCase长上下文基准数据集,包含26,030个多模态图像-文本对和6,354个临床挑战性病例,每个病例都标注了全面的临床信息和逐步推理链;建立了基于DermLIP的相似性度量指标,以更准确地评估鉴别诊断质量。

Result: 对22个领先大型视觉语言模型的基准测试显示,它们在诊断准确性、鉴别诊断和临床推理方面存在显著缺陷;微调实验表明指令调优能显著提升性能,而直接偏好优化带来的改进有限;系统误差分析进一步揭示了当前模型推理能力的关键局限性。

Conclusion: 该研究强调了评估临床推理过程的重要性,而不仅仅是最终诊断准确率;DermCase基准为皮肤病学AI评估提供了更全面的框架;当前大型视觉语言模型在复杂临床推理任务上仍存在根本性限制,需要更先进的训练方法来提升其医学推理能力。


📄 Abstract

Large vision-language models (LVLMs) demonstrate strong performance in dermatology; however, evaluating diagnostic reasoning for rare conditions remains largely unexplored. Existing benchmarks focus on common diseases and assess only final accuracy, overlooking the clinical reasoning process, which is critical for complex cases. We address this gap by constructing DermCase, a long-context benchmark derived from peer-reviewed case reports. Our dataset contains 26,030 multi-modal image-text pairs and 6,354 clinically challenging cases, each annotated with comprehensive clinical information and step-by-step reasoning chains. To enable reliable evaluation, we establish DermLIP-based similarity metrics that achieve stronger alignment with dermatologists for assessing differential diagnosis quality. Benchmarking 22 leading LVLMs exposes significant deficiencies across diagnosis accuracy, differential diagnosis, and clinical reasoning. Fine-tuning experiments demonstrate that instruction tuning substantially improves performance while Direct Preference Optimization (DPO) yields minimal gains. Systematic error analysis further reveals critical limitations in current models' reasoning capabilities.

[4] Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models

Liwei Che, Zhiyu Xue, Yihao Quan, Benlin Liu, Zeru Shi, Michelle Hurst, Jacob Feldman, Ruixiang Tang, Ranjay Krishna, Vladimir Pavlovic

🧩 TL;DR

本研究通过机制分析揭示了大型视觉语言模型中的结构化计数电路,并提出了一种轻量级干预策略,利用合成图像微调模型仅针对计数任务,从而显著提升计数精度并泛化到复杂视觉推理任务。


📘 Detailed Summary

Motivation: 本研究旨在探究大型视觉语言模型如何实现计数这一基础但关键的视觉推理能力,计数作为简单而强大的测试,要求模型识别每个个体对象并进行累加,研究试图理解模型内部的计数机制及其对整体视觉推理的影响。

Method: 研究采用受控的合成和真实世界基准测试结合机制分析方法,引入了两种新颖的可解释性方法——视觉激活修补和HeadLens,用于揭示模型内部的计数电路结构,并基于这些洞察提出了一种轻量级干预策略,利用丰富可得的合成图像对任意预训练LVLM进行仅针对计数的微调。

Result: 实验结果显示LVLMs表现出类似人类的计数行为,在小数量上表现精确而对大数量进行噪声估计,干预策略不仅提升了分布内合成数据的计数准确率,还在分布外计数基准上平均提升+8.36%,在Qwen2.5-VL的复杂通用视觉推理任务上平均获得+1.54%的增益,揭示了计数电路在多种视觉推理任务中的共享结构。

Conclusion: 研究发现计数在视觉推理中扮演核心且具有影响力的角色,通过针对性增强计数机制可能成为提升整体视觉推理能力的潜在途径,这为理解模型内部工作机制和开发更有效的微调策略提供了重要见解。


📄 Abstract

Counting serves as a simple but powerful test of a Large Vision-Language Model's (LVLM's) reasoning; it forces the model to identify each individual object and then add them all up. In this study, we investigate how LVLMs implement counting using controlled synthetic and real-world benchmarks, combined with mechanistic analyses. Our results show that LVLMs display a human-like counting behavior, with precise performance on small numerosities and noisy estimation for larger quantities. We introduce two novel interpretability methods, Visual Activation Patching and HeadLens, and use them to uncover a structured "counting circuit" that is largely shared across a variety of visual reasoning tasks. Building on these insights, we propose a lightweight intervention strategy that exploits simple and abundantly available synthetic images to fine-tune arbitrary pretrained LVLMs exclusively on counting. Despite the narrow scope of this fine-tuning, the intervention not only enhances counting accuracy on in-distribution synthetic data, but also yields an average improvement of +8.36% on out-of-distribution counting benchmarks and an average gain of +1.54% on complex, general visual reasoning tasks for Qwen2.5-VL. These findings highlight the central, influential role of counting in visual reasoning and suggest a potential pathway for improving overall visual reasoning capabilities through targeted enhancement of counting mechanisms.

[5] CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models

Xiang Chen, Fangfang Yang, Chunlei Meng, Chengyin Hu, Ang Li, Yiwei Wei, Jiahuan Long, Jiujiang Guo

🧩 TL;DR

本文提出了CoDA框架,用于评估医学视觉-语言模型在临床工作流中的鲁棒性,该框架通过组合临床相关的图像分布偏移来模拟真实场景中的退化,并展示了现有模型在此类扰动下的脆弱性,同时提出了一种轻量级的后修复策略来提升部署鲁棒性。


📘 Detailed Summary

Motivation: 医学视觉-语言模型在临床工作流中的可靠性尚未得到充分探索,现有鲁棒性评估通常假设干净、经过整理的输入或研究孤立的退化,忽略了临床实践中保持可读性但会改变图像统计特性的常规采集、重建、显示和传输操作,这导致模型在实际部署中可能存在未被发现的脆弱性。

Method: 本文提出了CoDA链式分布框架,通过组合采集类阴影、重建与显示重映射、传输与导出退化等临床相关操作来构建临床合理的管道偏移,在掩码结构相似性约束下联合优化阶段组合和参数以诱导模型失败同时保持视觉合理性,并引入了基于教师引导的标记空间适应与补丁级对齐的后修复策略来提升模型鲁棒性。

Result: 在脑部MRI、胸部X光和腹部CT数据集上,CoDA显著降低了CLIP风格医学视觉-语言模型的零样本性能,链式组合比任何单阶段退化更具破坏性;专有多模态模型在CoDA偏移样本上的审计可靠性下降且存在持续高置信度错误,而测试的医学专用多模态大语言模型在医学图像质量审计方面表现出明显缺陷;后修复策略在存档的CoDA输出上提高了准确性。

Conclusion: 研究揭示了医学视觉-语言模型部署中基于临床场景的威胁面,表明现有模型对临床工作流中的复合退化具有脆弱性,同时证明了轻量级对齐方法能够有效提升部署鲁棒性,为医学AI系统的实际应用安全性评估提供了重要框架和方法论。


📄 Abstract

Medical vision--language models (MVLMs) are increasingly used as perceptual backbones in radiology pipelines and as the visual front end of multimodal assistants, yet their reliability under real clinical workflows remains underexplored. Prior robustness evaluations often assume clean, curated inputs or study isolated corruptions, overlooking routine acquisition, reconstruction, display, and delivery operations that preserve clinical readability while shifting image statistics. To address this gap, we propose CoDA, a chain-of-distribution framework that constructs clinically plausible pipeline shifts by composing acquisition-like shading, reconstruction and display remapping, and delivery and export degradations. Under masked structural-similarity constraints, CoDA jointly optimizes stage compositions and parameters to induce failures while preserving visual plausibility. Across brain MRI, chest X-ray, and abdominal CT, CoDA substantially degrades the zero-shot performance of CLIP-style MVLMs, with chained compositions consistently more damaging than any single stage. We also evaluate multimodal large language models (MLLMs) as technical-authenticity auditors of imaging realism and quality rather than pathology. Proprietary multimodal models show degraded auditing reliability and persistent high-confidence errors on CoDA-shifted samples, while the medical-specific MLLMs we test exhibit clear deficiencies in medical image quality auditing. Finally, we introduce a post-hoc repair strategy based on teacher-guided token-space adaptation with patch-level alignment, which improves accuracy on archived CoDA outputs. Overall, our findings characterize a clinically grounded threat surface for MVLM deployment and show that lightweight alignment improves robustness in deployment.

[6] HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

Dan Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin

🧩 TL;DR

本文提出了HiMu框架,通过使用纯文本LLM将查询分解为层次逻辑树,并路由到轻量级多模态专家,实现了长视频问答中效率与准确性的帕累托前沿改进,在显著降低计算成本的同时超越了现有选择器和代理系统的性能。


📘 Detailed Summary

Motivation: 长视频问答需要在扩展的时间上下文中进行推理,而大型视觉语言模型受限于有限的上下文窗口,使得帧选择变得至关重要。现有方法面临尖锐的权衡:基于相似性的选择器速度快但将组合查询压缩为单个密集向量,丢失了子事件排序和跨模态绑定;基于代理的方法通过迭代LVLM推理恢复这种结构,但计算成本过高。

Method: HiMu是一个无需训练的框架,通过单个纯文本LLM调用将查询分解为层次逻辑树,其叶子节点是原子谓词,每个谓词被路由到轻量级专家系统,涵盖视觉(CLIP、开放词汇检测、OCR)和音频(ASR、CLAP)模态。生成的信号经过归一化、时间平滑以对齐不同模态,并通过模糊逻辑运算符自底向上组合,强制执行时间序列和邻近约束,产生连续满足度曲线。

Result: 在Video-MME、LongVideoBench和HERBench-Lite上的评估表明,HiMu推进了效率-准确性帕累托前沿:在16帧配置下使用Qwen3-VL 8B时,它超越了所有竞争选择器;使用GPT-4o时,它在32-512帧操作下超越了代理系统,同时需要大约10倍更少的FLOPs计算量。

Conclusion: HiMu框架成功弥合了长视频问答中效率与准确性之间的差距,通过层次化逻辑分解和轻量级多模态专家路由,实现了计算效率的显著提升。该方法展示了无需训练即可实现复杂时空推理的潜力,为大规模视频理解系统提供了实用的解决方案,并可能推动更高效的多模态推理架构发展。


📄 Abstract

Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.

[7] GenVideoLens: Where LVLMs Fall Short in AI-Generated Video Detection?

Yueying Zou, Pei Pei Li, Zekun Li, Xinyu Guo, Xing Cui, Huaibo Huang, Ran He

🧩 TL;DR

本文提出了GenVideoLens,一个用于细粒度评估大视觉语言模型在AI生成视频检测中能力的基准,揭示了当前模型在光学一致性、物理交互和时序因果推理等维度上的显著能力不平衡问题。


📘 Detailed Summary

Motivation: 现有评估协议将AI生成视频检测视为二元分类问题,依赖整体准确率等粗粒度指标,无法深入揭示大视觉语言模型在哪些具体维度上成功或失败,缺乏对模型能力的细粒度诊断分析。

Method: 研究团队构建了GenVideoLens基准,包含400个高度欺骗性的AI生成视频和100个真实视频,由专家在15个真实性维度上进行标注,涵盖感知、光学、物理和时序线索,并评估了11个代表性的大视觉语言模型。

Result: 分析揭示了显著的维度不平衡现象:模型在感知线索上表现相对较好,但在光学一致性、物理交互和时序因果推理方面表现不佳;不同维度间性能差异显著,小型开源模型有时在特定真实性线索上优于更强的专有模型;时序扰动实验表明当前模型对时序信息的利用有限。

Conclusion: GenVideoLens为大视觉语言模型行为提供了诊断性见解,揭示了关键能力差距,为改进未来AI生成视频检测系统提供了指导;研究强调了细粒度评估的重要性,并指出了模型在复杂推理维度上的局限性。


📄 Abstract

In recent years, AI-generated videos have become increasingly realistic and sophisticated. Meanwhile, Large Vision-Language Models (LVLMs) have shown strong potential for detecting such content. However, existing evaluation protocols largely treat the task as a binary classification problem and rely on coarse-grained metrics such as overall accuracy, providing limited insight into where LVLMs succeed or fail. To address this limitation, we introduce GenVideoLens, a fine-grained benchmark that enables dimension-wise evaluation of LVLM capabilities in AI-generated video detection. The benchmark contains 400 highly deceptive AI-generated videos and 100 real videos, annotated by experts across 15 authenticity dimensions covering perceptual, optical, physical, and temporal cues. We evaluate eleven representative LVLMs on this benchmark. Our analysis reveals a pronounced dimensional imbalance. While LVLMs perform relatively well on perceptual cues, they struggle with optical consistency, physical interactions, and temporal-causal reasoning. Model performance also varies substantially across dimensions, with smaller open-source models sometimes outperforming stronger proprietary models on specific authenticity cues. Temporal perturbation experiments further show that current LVLMs make limited use of temporal information. Overall, GenVideoLens provides diagnostic insights into LVLM behavior, revealing key capability gaps and offering guidance for improving future AI-generated video detection systems.

[8] Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

Yuchen Li, Amanmeet Garg, Shalini Chaudhuri, Rui Zhao, Garin Kessler

🧩 TL;DR

本文提出了Perceptio,一种具有2D和3D空间推理能力的感知增强型大型视觉语言模型,通过引入显式的语义分割token和深度token来增强LVLM的细粒度空间定位能力,在多个基准测试中实现了最先进的性能。


📘 Detailed Summary

Motivation: 大型视觉语言模型在语义理解方面表现出色,但在细粒度空间定位方面存在困难,因为模型必须隐式推断复杂几何结构而无法产生显式的空间解释,这限制了其在需要精确空间理解任务中的应用。

Method: Perceptio通过引入显式的语义分割token和深度token来增强空间推理能力,具体包括:从强大的单目深度估计教师模型中蒸馏VQ-VAE深度码本以将密集深度信息token化为紧凑序列;集成SAM2语义分割token和VQ-VAE深度token到LLM中,使模型先输出空间token再回答问题;为稳定深度token生成,提出了复合深度token目标(标记、token和计数损失)和用于可微分重建的软合并技术;采用跨多样化数据集的多任务协同训练策略。

Result: 基于InternVL构建的Perceptio在多个基准测试中实现了最先进的性能:在RefCOCO/+/g的指代表达式分割任务上分别提升了+0.8/+1.4/+1.1 cIoU,HardBLINK空间理解准确率提高了10.3%,MMBench准确率提升了1.0%,显著增强了LVLM的空间定位能力。

Conclusion: 研究表明,通过引入显式的空间思维链机制,能够显著增强大型视觉语言模型的空间定位能力,这种感知增强方法为开发具有更强空间推理能力的多模态模型提供了有效途径,并为未来在需要精确空间理解的应用场景中部署LVLM奠定了基础。


📄 Abstract

Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs.

[9] Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token

Anqi Zhang, Xiaokang Ji, Guangyu Gao, Jianbo Jiao, Chi Harold Liu, Yunchao Wei

🧩 TL;DR

本文提出SELF1E方法,通过单一分割嵌入实现多模态大语言模型的自分割能力,无需外部掩码解码器,通过特征分辨率增强和双感知注意力机制在多个分割任务上达到与专业解码器方法竞争的性能。


📘 Detailed Summary

Motivation: 现有基于多模态大语言模型的分割方法主要依赖专业掩码解码器或引入多个额外标记来生成分割结果,本文旨在探索是否能够仅通过单一分割嵌入实现MLLM的自分割能力,从而消除对外部解码器的依赖,解决像素混洗导致的特征分辨率降低这一根本限制。

Method: SELF1E方法首先保留原始未压缩分辨率的图像特征,并通过从MLLM处理的压缩特征中提取残差特征进行补充以提升特征精度;随后分别对经过和未经过LLM处理的图像特征应用像素反混洗操作,释放压缩特征的细节并在未压缩分辨率下增强残差特征;此外,重新设计具有双感知路径的注意力掩码,实现像素与分割标记之间的丰富特征交互。

Result: 在多个分割任务上的综合实验验证表明,SELF1E能够达到与基于专业掩码解码器方法竞争的性能水平,证明了在MLLMs中实现无解码器分割的可行性,同时保持了可靠的对象级分割能力和增强的空间感知能力。

Conclusion: 该研究展示了多模态大语言模型通过单一分割嵌入实现自分割的潜力,消除了对外部专业解码器的依赖,为更简洁高效的分割架构设计提供了新思路,同时通过特征分辨率增强和注意力机制优化解决了现有方法的根本限制。


📄 Abstract

Recent segmentation methods leveraging Multi-modal Large Language Models (MLLMs) have shown reliable object-level segmentation and enhanced spatial perception. However, almost all previous methods predominantly rely on specialist mask decoders to interpret masks from generated segmentation-related embeddings and visual features, or incorporate multiple additional tokens to assist. This paper aims to investigate whether and how we can unlock segmentation from MLLM itSELF with 1 segmentation Embedding (SELF1E) while achieving competitive results, which eliminates the need for external decoders. To this end, our approach targets the fundamental limitation of resolution reduction in pixel-shuffled image features from MLLMs. First, we retain image features at their original uncompressed resolution, and refill them with residual features extracted from MLLM-processed compressed features, thereby improving feature precision. Subsequently, we integrate pixel-unshuffle operations on image features with and without LLM processing, respectively, to unleash the details of compressed features and amplify the residual features under uncompressed resolution, which further enhances the resolution of refilled features. Moreover, we redesign the attention mask with dual perception pathways, i.e., image-to-image and image-to-segmentation, enabling rich feature interaction between pixels and the segmentation token. Comprehensive experiments across multiple segmentation tasks validate that SELF1E achieves performance competitive with specialist mask decoder-based methods, demonstrating the feasibility of decoder-free segmentation in MLLMs. Project page: https://github.com/ANDYZAQ/SELF1E.

[10] Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan, Xiang Bai

🧩 TL;DR

本文提出VEGA-3D框架,通过利用大规模视频生成模型中隐含的空间先验知识,增强多模态大语言模型的几何推理能力,无需显式3D监督即可实现卓越的空间感知性能。


📘 Detailed Summary

Motivation: 多模态大语言模型虽然具备强大的语义理解能力,但普遍存在空间盲区问题,难以进行细粒度几何推理和物理动态理解。现有方法依赖显式3D模态或复杂几何支架,受限于数据稀缺性和泛化挑战,需要新的解决方案来提升模型的空间感知能力。

Method: 本文提出VEGA-3D框架,将预训练视频扩散模型重新用作潜在世界模拟器,通过提取中间噪声级别的时空特征,并通过令牌级自适应门控融合机制将其与语义表示集成。该方法利用视频生成模型隐含学习的3D结构先验和物理规律,为MLLM提供密集几何线索而无需显式3D监督。

Result: 在3D场景理解、空间推理和具身操作基准测试中的广泛实验表明,该方法在多个任务上超越了现有最先进基线。实验验证了生成先验为物理世界理解提供了可扩展的基础,证明了视频扩散模型中隐含的空间先验能够有效增强MLLM的几何推理能力。

Conclusion: 研究表明生成先验为物理世界理解提供了可扩展的基础,视频生成模型中隐含的时空一致性要求使其自然学习到稳健的3D结构先验。这一范式转变避免了显式3D数据的依赖,为增强多模态模型的几何推理能力开辟了新途径,具有重要的理论和应用价值。


📄 Abstract

While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.

cs.AI [Back]

[11] DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models

Jiaqi Xiong, Yunjia Qi, Qi Cao, Yu Zheng, Weisheng Xu, Ziteng Wang, Ruofan Liao, Yutong Zhang, Sichen Liu

🧩 TL;DR

本文提出了DEAF基准测试,用于系统评估音频多模态大语言模型是否真正处理声学信号而非依赖文本语义推断,揭示了当前模型在标准语音基准上表现优异但与真实声学理解之间存在差距。


📘 Detailed Summary

Motivation: 尽管当前音频多模态大语言模型在语音基准测试中表现出色,但尚不清楚这些模型是否真正处理声学信号还是主要依赖文本语义推断。本研究旨在系统探究这一问题,揭示模型在声学理解方面的真实能力与局限性。

Method: 研究引入了DEAF基准测试,包含超过2,700个冲突刺激,涵盖情感韵律、背景声音和说话人身份三个声学维度。设计了一个受控的多级评估框架,逐步增加文本影响,从内容语义冲突到误导性提示及其组合,以区分内容驱动偏差与提示诱导的顺从性。进一步引入了诊断指标来量化模型对文本线索而非声学信号的依赖程度。

Result: 对七个音频多模态大语言模型的评估揭示了一致的文本主导模式:模型对声学变化敏感,但预测主要由文本输入驱动。这表明模型在标准语音基准上的高表现与真实声学理解之间存在显著差距,模型倾向于依赖文本语义而非声学信号进行推断。

Conclusion: 该研究揭示了当前音频多模态大语言模型存在显著的文本主导倾向,表明标准语音基准测试可能无法充分评估模型的真实声学理解能力。研究结果为未来开发更可靠的音频理解模型提供了诊断工具和方法论框架,强调了需要设计更严格的评估标准来确保模型真正处理声学信息而非仅依赖文本语义。


📄 Abstract

Recent Audio Multimodal Large Language Models (Audio MLLMs) demonstrate impressive performance on speech benchmarks, yet it remains unclear whether these models genuinely process acoustic signals or rely on text-based semantic inference. To systematically study this question, we introduce DEAF (Diagnostic Evaluation of Acoustic Faithfulness), a benchmark of over 2,700 conflict stimuli spanning three acoustic dimensions: emotional prosody, background sounds, and speaker identity. Then, we design a controlled multi-level evaluation framework that progressively increases textual influence, ranging from semantic conflicts in the content to misleading prompts and their combination, allowing us to disentangle content-driven bias from prompt-induced sycophancy. We further introduce diagnostic metrics to quantify model reliance on textual cues over acoustic signals. Our evaluation of seven Audio MLLMs reveals a consistent pattern of text dominance: models are sensitive to acoustic variations, yet predictions are predominantly driven by textual inputs, revealing a gap between high performance on standard speech benchmarks and genuine acoustic understanding.

[12] Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

Yinghui Li, Jiayi Kuang, Peng Xing, Daixian Liu, Junnan Dong, Shu-Yu Guo, Yangning Li, Qingyu Zhou, Wenhao Jiang, Hai-Tao Zheng, Ying Shen, Liang Lin, Philip S. Yu

🧩 TL;DR

该研究引入了一个全面的基准测试,用于评估多模态大语言模型在离散符号空间中的理解能力,揭示了模型在基本符号识别与复杂推理任务之间的认知不匹配现象。


📘 Detailed Summary

Motivation: 尽管多模态大语言模型在自然场景理解方面取得了显著成功,但其处理离散符号(如数学公式、化学结构和语言字符)的能力仍是一个关键开放问题。这些符号作为人类认知的基本构建块,需要精确且深层的解释,而现有模型在这方面的能力尚未得到充分评估。

Method: 研究引入了一个全面的基准测试框架,用于评估顶级多模态大语言模型在五个不同领域(语言、文化、数学、物理和化学)中导航离散语义空间的能力。该基准专门设计用于测试模型对离散符号的感知和理解,而非连续视觉数据的处理。

Result: 研究发现了一个反直觉现象:模型在基本符号识别任务上经常失败,却在复杂推理任务中表现出色,这表明它们依赖语言概率而非真正的视觉感知。这种认知不匹配揭示了当前AI能力的重要缺陷,即在理解支撑科学发现和抽象思维的符号语言方面存在根本性困难。

Conclusion: 这项工作强调了多模态大语言模型在离散符号理解方面的显著局限性,暴露了当前AI系统在真正感知和理解符号语言方面的根本缺陷。研究结果为开发更严谨、与人类认知对齐的智能系统提供了路线图,指出了未来研究需要关注模型对符号基础认知能力的提升。


📄 Abstract

While Multimodal Large Language Models (MLLMs) have achieved remarkable success in interpreting natural scenes, their ability to process discrete symbols -- the fundamental building blocks of human cognition -- remains a critical open question. Unlike continuous visual data, symbols such as mathematical formulas, chemical structures, and linguistic characters require precise, deeper interpretation. This paper introduces a comprehensive benchmark to evaluate how top-tier MLLMs navigate these "discrete semantic spaces" across five domains: language, culture, mathematics, physics, and chemistry. Our investigation uncovers a counterintuitive phenomenon: models often fail at basic symbol recognition yet succeed in complex reasoning tasks, suggesting they rely on linguistic probability rather than true visual perception. By exposing this "cognitive mismatch", we highlight a significant gap in current AI capabilities: the struggle to truly perceive and understand the symbolic languages that underpin scientific discovery and abstract thought. This work offers a roadmap for developing more rigorous, human-aligned intelligent systems.

[13] MedForge: Interpretable Medical Deepfake Detection via Forgery-aware Reasoning

Zhihui Chen, Kai He, Qingyuan Lei, Bin Pu, Jian Zhang, Yuling Xu, Mengling Feng

🧩 TL;DR

本文提出了MedForge,一种针对医学图像伪造检测的数据与方法解决方案,包括大规模基准数据集MedForge-90K和基于局部化-分析推理的检测器MedForge-Reasoner,旨在解决现有方法在医疗领域检测文本引导医学图像编辑方面的不足。


📘 Detailed Summary

Motivation: 文本引导的图像编辑器能够以高保真度操纵真实医学扫描图像,实现病灶植入/移除,这威胁到临床信任与安全。现有防御方法在医疗领域存在不足:医学检测器多为黑盒,而基于MLLM的解释器通常是事后解释、缺乏医学专业知识,并且在模糊病例上可能产生幻觉证据。

Method: 本文提出了MedForge,一种针对医学伪造检测的数据与方法解决方案。首先构建了MedForge-90K大规模基准数据集,包含19种病理学的真实病灶编辑,并通过医生检查指南和黄金编辑位置提供专家指导的推理监督。在此基础上,MedForge-Reasoner采用局部化-分析推理方法,先预测可疑区域再生成判断,并通过伪造感知的GSPO对齐来增强基础并减少幻觉。

Result: 实验结果表明,该方法在检测准确率方面达到了最先进的性能水平,并能够提供可信赖、与专家对齐的解释。MedForge-90K数据集提供了大规模、多样化的医学伪造基准,而MedForge-Reasoner在检测精度和解释质量方面均表现出色。

Conclusion: 该研究为解决医学图像伪造检测问题提供了有效的预验证据基础解决方案,通过大规模基准数据集和局部化-分析推理框架,显著提升了检测准确性和解释可信度。该方法为医疗领域的安全保障提供了重要工具,并为未来医学伪造检测研究设定了新的标准。


📄 Abstract

Text-guided image editors can now manipulate authentic medical scans with high fidelity, enabling lesion implantation/removal that threatens clinical trust and safety. Existing defenses are inadequate for healthcare. Medical detectors are largely black-box, while MLLM-based explainers are typically post-hoc, lack medical expertise, and may hallucinate evidence on ambiguous cases. We present MedForge, a data-and-method solution for pre-hoc, evidence-grounded medical forgery detection. We introduce MedForge-90K, a large-scale benchmark of realistic lesion edits across 19 pathologies with expert-guided reasoning supervision via doctor inspection guidelines and gold edit locations. Building on it, MedForge-Reasoner performs localize-then-analyze reasoning, predicting suspicious regions before producing a verdict, and is further aligned with Forgery-aware GSPO to strengthen grounding and reduce hallucinations. Experiments demonstrate state-of-the-art detection accuracy and trustworthy, expert-aligned explanations.

[14] Thinking with Constructions: A Benchmark and Policy Optimization for Visual-Text Interleaved Geometric Reasoning

Haokun Zhao, Wanshi Xu, Haidong Yuan, Songjun Cao, Long Ma, Yanghua Xiao

🧩 TL;DR

本文提出了一种视觉-文本交织思维链框架,通过强化学习优化多模态大语言模型在几何推理中的辅助构造策略,使模型能够动态生成视觉辅助工具而非仅被动推理静态图表。


📘 Detailed Summary

Motivation: 现有多模态大语言模型在几何推理中主要局限于对静态图表的被动推理,缺乏关于何时以及如何构造有效视觉辅助工具的策略性知识,无法实现几何问题求解所需的动态构造思维过程。

Method: 研究首先构建了包含4,334个几何问题的GeoAux-Bench基准数据集,将文本构造步骤与真实视觉更新对齐;然后提出了动作适用性策略优化框架,采用自适应奖励塑造通过反事实采样来区分必要与冗余构造,从而优化视觉辅助工具的时机和质量。

Result: 实验表明视觉-文本交织辅助优于单模态方法,有效构造作为熵减器与推理困惑度降低强相关;所提方法使MLLMs能够利用选择性辅助构造,在强基线基础上实现了3.51%的性能提升。

Conclusion: 该研究揭示了视觉-文本交织表示在几何推理中的重要性,验证了有效构造作为熵减器的理论假设,提出的强化学习范式为多模态模型掌握策略性视觉构造提供了可行路径,推动了动态几何推理能力的发展。


📄 Abstract

Geometric reasoning inherently requires "thinking with constructions" -- the dynamic manipulation of visual aids to bridge the gap between problem conditions and solutions. However, existing Multimodal Large Language Models (MLLMs) are largely confined to passive inference with static diagrams, lacking the strategic knowledge of when and how to construct effective visual aids. To address this, we present a framework for Visual-Text Interleaved Chain-of-Thought. We first introduce GeoAux-Bench, the first benchmark comprising 4,334 geometry problems that aligns textual construction steps with ground-truth visual updates. Our pilot study reveals two critical insights: (1) interleaved visual-textual aids outperform single-modality counterparts, which cannot losslessly capture geometric synergy; and (2) valid constructions act as entropy reducers, strongly correlating with reduced reasoning perplexity. Building on these findings, we propose Action Applicability Policy Optimization (A2PO), a reinforcement learning paradigm for mastering strategic construction. A2PO employs Adaptive Reward Shaping to regulate the timing and quality of visual aids via counterfactual sampling to distinguish necessary from redundant constructions. Experiments demonstrate our approach enables MLLMs to leverage selective auxiliary constructions, yielding a 3.51% gain over strong baselines. Code and data are available on GitHub.