Table of Contents

cs.CV [Back]

[1] QUSR: Quality-Aware and Uncertainty-Guided Image Super-Resolution Diffusion Model

Junjie Yin, Jiaju Li, Hanfa Xing

🧩 TL;DR

本文提出QUSR,一种用于真实世界图像超分辨率的扩散模型,通过集成质量感知先验和不确定性引导噪声生成模块,有效处理未知且空间非均匀的退化问题,生成高保真图像。


📘 Detailed Summary

Motivation: 基于扩散的图像超分辨率方法在真实世界场景中面临挑战,因为退化过程通常是未知且空间非均匀的,这导致细节丢失和视觉伪影,现有方法难以有效处理此类复杂退化模式。

Method: QUSR模型包含两个核心组件:不确定性引导噪声生成模块根据区域不确定性自适应调整噪声注入强度,对高不确定性区域施加更强扰动以重建复杂细节,同时最小化低不确定性区域的噪声以保留原始信息;质量感知先验则利用先进的多模态大语言模型生成可靠的质量描述,为恢复过程提供有效且可解释的质量先验。

Result: 实验结果表明,QUSR能够在真实世界场景中生成高保真和高真实感的图像,有效处理未知且空间非均匀的退化问题,相比现有方法在细节重建和伪影抑制方面表现更优。

Conclusion: 该研究展示了结合不确定性引导噪声生成和质量感知先验在真实世界图像超分辨率中的有效性,为处理复杂退化模式提供了新思路,同时利用多模态大语言模型生成质量先验的方法增强了模型的可解释性。


📄 Abstract

Diffusion-based image super-resolution (ISR) has shown strong potential, but it still struggles in real-world scenarios where degradations are unknown and spatially non-uniform, often resulting in lost details or visual artifacts. To address this challenge, we propose a novel super-resolution diffusion model, QUSR, which integrates a Quality-Aware Prior (QAP) with an Uncertainty-Guided Noise Generation (UNG) module. The UNG module adaptively adjusts the noise injection intensity, applying stronger perturbations to high-uncertainty regions (e.g., edges and textures) to reconstruct complex details, while minimizing noise in low-uncertainty regions (e.g., flat areas) to preserve original information. Concurrently, the QAP leverages an advanced Multimodal Large Language Model (MLLM) to generate reliable quality descriptions, providing an effective and interpretable quality prior for the restoration process. Experimental results confirm that QUSR can produce high-fidelity and high-realism images in real-world scenarios. The source code is available at https://github.com/oTvTog/QUSR.

[2] EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

Chengjun Yu, Xuhan Zhu, Chaoqun Du, Pengfei Yu, Wei Zhai, Yang Cao, Zheng-Jun Zha

🧩 TL;DR

本文提出了EXPLORE-Bench基准测试,用于评估多模态大语言模型在长时程自我中心场景预测任务中的推理能力,揭示了当前模型与人类性能之间的显著差距。


📘 Detailed Summary

Motivation: 多模态大语言模型越来越多地被用作具身智能体的基础,但目前尚不清楚它们是否能够从自我中心视角可靠地推理动作的长期物理后果,这一研究空白需要通过系统化评估来填补。

Method: 研究引入了Egocentric Scene Prediction with LOng-horizon REasoning任务,并构建了EXPLORE-Bench基准测试,该基准从真实第一人称视频中提取,包含长动作序列和结构化最终场景标注,支持细粒度的定量评估,同时探索了逐步推理的测试时扩展方法。

Result: 实验评估了多种专有和开源多模态大语言模型,结果显示模型性能与人类存在显著差距,表明长时程自我中心推理仍是一个主要挑战,同时发现将长动作序列分解为逐步推理可以在一定程度上提升性能,但会带来非平凡的计算开销。

Conclusion: EXPLORE-Bench为衡量和推进具身感知中的长时程推理能力提供了一个原则性测试平台,研究结果表明当前多模态大语言模型在长时程自我中心物理推理方面仍存在根本性局限,需要新的方法和技术突破。


📄 Abstract

Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from an egocentric viewpoint. We study this gap through a new task, Egocentric Scene Prediction with LOng-horizon REasoning: given an initial-scene image and a sequence of atomic action descriptions, a model is asked to predict the final scene after all actions are executed. To enable systematic evaluation, we introduce EXPLORE-Bench, a benchmark curated from real first-person videos spanning diverse scenarios. Each instance pairs long action sequences with structured final-scene annotations, including object categories, visual attributes, and inter-object relations, which supports fine-grained, quantitative assessment. Experiments on a range of proprietary and open-source MLLMs reveal a significant performance gap to humans, indicating that long-horizon egocentric reasoning remains a major challenge. We further analyze test-time scaling via stepwise reasoning and show that decomposing long action sequences can improve performance to some extent, while incurring non-trivial computational overhead. Overall, EXPLORE-Bench provides a principled testbed for measuring and advancing long-horizon reasoning for egocentric embodied perception.

[3] Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM

Junyuan Mao, Qiankun Li, Linghao Meng, Zhicheng He, Xinliang Zhou, Kun Wang, Yang Liu, Yueming Jin

🧩 TL;DR

本文提出Granulon,一种基于DINOv3的多模态大语言模型,通过自适应粒度增强机制解决了现有视觉编码器在细粒度视觉理解与多粒度推理方面的不足,实现了统一的像素到细粒度到粗粒度推理。


📘 Detailed Summary

Motivation: 当前多模态大语言模型主要依赖CLIP视觉编码器,其强调全局语义对齐但缺乏细粒度视觉理解能力;而DINOv3虽提供强大的像素级感知,却缺乏粗粒度语义抽象能力,导致多粒度推理受限。本研究旨在填补这一空白,构建能够自适应调整视觉抽象粒度的多模态模型。

Method: Granulon引入了文本条件粒度控制器,根据文本输入的语义范围动态调整视觉抽象层次;同时设计了自适应令牌聚合模块,通过粒度引导的池化和关系感知聚类生成紧凑且语义丰富的视觉令牌。该架构在单次前向传播中实现了统一的"像素到细粒度到粗粒度"推理流程。

Result: 大量可解释性实验表明,Granulon在相同设置下优于所有视觉编码器,准确率提升约30%,幻觉减少约20%。该模型在细粒度视觉理解和多粒度推理任务中表现出显著优势,验证了自适应粒度增强机制的有效性。

Conclusion: 本研究证明了基于DINOv3的自适应粒度增强机制能够有效解决多模态大语言模型中的多粒度推理问题。Granulon的统一架构为视觉语言理解提供了新范式,其文本条件粒度控制器和自适应令牌聚合模块的设计思路对未来的多模态模型研究具有重要启示。


📄 Abstract

Recent advances in multimodal large language models largely rely on CLIP-based visual encoders, which emphasize global semantic alignment but struggle with fine-grained visual understanding. In contrast, DINOv3 provides strong pixel-level perception yet lacks coarse-grained semantic abstraction, leading to limited multi-granularity reasoning. To address this gap, we propose Granulon, a novel DINOv3-based MLLM with adaptive granularity augmentation. Granulon introduces a text-conditioned granularity Controller that dynamically adjusts the visual abstraction level according to the semantic scope of the textual input, and an Adaptive Token Aggregation module that performs granularity-guided pooling and relation-aware clustering to produce compact, semantically rich visual tokens. This design enables unified "pixel-to-fine-to-coarse" reasoning within a single forward pass. Extensive and interpretable experiments demonstrate that Granulon improves accuracy by ~30% and reduces hallucination by ~20%, outperforming all visual encoders under identical settings.

[4] MEGC2026: Micro-Expression Grand Challenge on Visual Question Answering

Xinqi Fan, Jingting Li, John See, Moi Hoon Yap, Su-Jing Wang, Adrian K. Davison

🧩 TL;DR

该论文介绍了MEGC 2026的两个新任务:微表情视频问答(ME-VQA)和微表情长视频问答(ME-LVQA),旨在利用多模态大语言模型(MLLMs)和大型视觉语言模型(LVLMs)的先进推理能力来推进微表情分析领域的发展。


📘 Detailed Summary

Motivation: 微表情分析领域虽然已有显著进展,但在利用新兴多模态大语言模型和大型视觉语言模型的强大推理能力方面仍存在研究空白。MEGC 2026旨在通过引入两个新任务来探索这些模型在微表情理解中的应用潜力,特别是针对短序列视频的问答和长序列视频的时序推理挑战。

Method: 该研究提出了两个核心任务框架:微表情视频问答(ME-VQA)利用MLLMs或LVLMs对相对较短的视频序列进行视觉问答,涵盖与微表情相关的多样化问题类型;微表情长视频问答(ME-LVQA)将VQA扩展到真实场景中的长时视频序列,要求模型具备跨长时间段的时序推理和细微微表情检测能力。

Result: 研究建立了MEGC 2026的公共评估框架,要求所有参与算法在公共排行榜上提交结果。该挑战赛为微表情分析领域提供了标准化的基准测试平台,特别关注多模态模型在微表情视频理解任务中的性能评估。

Conclusion: 该研究标志着微表情分析向多模态大模型时代的重要转变,通过引入ME-VQA和ME-LVQA任务,为利用先进MLLMs和LVLMs进行微表情理解开辟了新途径。这些任务设计反映了从传统识别方法向复杂推理和时序分析的发展趋势,有望推动微表情分析在真实场景中的应用。


📄 Abstract

Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion but attempts to suppress or repress the facial expression, typically found in a high-stakes environment. In recent years, substantial advancements have been made in the areas of ME recognition, spotting, and generation. The emergence of multimodal large language models (MLLMs) and large vision-language models (LVLMs) offers promising new avenues for enhancing ME analysis through their powerful multimodal reasoning capabilities. The ME grand challenge (MEGC) 2026 introduces two tasks that reflect these evolving research directions: (1) ME video question answering (ME-VQA), which explores ME understanding through visual question answering on relatively short video sequences, leveraging MLLMs or LVLMs to address diverse question types related to MEs; and (2) ME long-video question answering (ME-LVQA), which extends VQA to long-duration video sequences in realistic settings, requiring models to handle temporal reasoning and subtle micro-expression detection across extended time periods. All participating algorithms are required to submit their results on a public leaderboard. More details are available at https://megc2026.github.io.

[5] Point Cloud as a Foreign Language for Multi-modal Large Language Model

Sneha Paul, Zachary Patterson, Nizar Bouguila

🧩 TL;DR

本文提出了SAGE,首个端到端的3D多模态大语言模型,直接处理原始点云数据而无需预训练的3D编码器,通过轻量级3D分词器和偏好优化训练策略,在3D理解任务上超越了现有编码器方法。


📘 Detailed Summary

Motivation: 现有基于编码器的3D多模态大语言模型依赖预训练的3D编码器提取几何特征,但存在几何与语言空间语义不对齐、分辨率敏感性和计算开销大的问题,需要更高效且语义对齐的端到端解决方案。

Method: 提出轻量级3D分词器,结合几何采样、邻域聚合和向量量化将点云转换为离散标记,将3D数据视为外语扩展LLM词汇表;同时设计基于语义对齐奖励的偏好优化训练策略,专门针对开放式3D问答任务。

Result: 在多样化3D理解基准测试中,端到端方法超越了现有编码器方法,在计算效率、跨LLM骨干网络的泛化能力和输入分辨率变化的鲁棒性方面具有显著优势。

Conclusion: 研究证明了直接处理原始点云的端到端3D MLLM的可行性,通过将3D数据视为离散标记并与语言模型自然集成,为3D多模态理解提供了更高效、鲁棒且语义对齐的新范式。


📄 Abstract

Multi-modal large language models (MLLMs) have shown remarkable progress in integrating visual and linguistic understanding. Recent efforts have extended these capabilities to 3D understanding through encoder-based architectures that rely on pre-trained 3D encoders to extract geometric features. However, such approaches suffer from semantic misalignment between geometric and linguistic spaces, resolution sensitivity, and substantial computational overhead. In this work, we present SAGE, the first end-to-end 3D MLLM that directly processes raw point clouds without relying on a pre-trained 3D encoder. Our approach introduces a lightweight 3D tokenizer that combines geometric sampling and neighbourhood aggregation with vector quantization to convert point clouds into discrete tokens--treating 3D data as a foreign language that naturally extends the LLM's vocabulary. Furthermore, to enhance the model's reasoning capability on complex 3D tasks, we propose a preference optimization training strategy with a semantic alignment-based reward, specifically designed for open-ended 3D question answering where responses are descriptive. Extensive experiments across diverse 3D understanding benchmarks demonstrate that our end-to-end approach outperforms existing encoder-based methods while offering significant advantages in computational efficiency, generalization across LLM backbones, and robustness to input resolution variations. Code is available at: github.com/snehaputul/SAGE3D.

[6] TubeMLLM: A Foundation Model for Topology Knowledge Exploration in Vessel-like Anatomy

Yaoyu Liu, Minghui Zhang, Xin You, Hanxiao Zhang, Yun Gu

🧩 TL;DR

本文提出了TubeMLLM,一种用于医学管状解剖结构建模的统一基础模型,通过集成拓扑先验和共享注意力架构,显著提升了拓扑感知能力,并在多种数据集上实现了最先进的分布外性能和零样本跨模态迁移能力。


📘 Detailed Summary

Motivation: 医学管状解剖结构建模面临拓扑结构复杂和对数据集偏移敏感的挑战,导致任务特定模型常出现拓扑不一致问题,包括人工断开和虚假合并。现有方法缺乏对拓扑结构的显式建模能力,限制了模型的泛化性能和跨模态适应性。

Method: 提出了TubeMLLM统一基础模型,通过显式自然语言提示集成拓扑先验,并在共享注意力架构中将其与视觉表示对齐。构建了TubeMData多模态基准数据集,包含全面的拓扑中心任务,并引入了自适应损失加权策略以强调训练过程中的拓扑关键区域。

Result: 在15个多样化数据集上的实验表明,TubeMLLM在分布外性能上达到最先进水平,在彩色眼底摄影中将β₀数值误差从37.42显著降低至8.58。在未见过的X射线血管造影上实现了67.50%的Dice分数,同时将β₀误差降低至1.21。模型对模糊、噪声和低分辨率等退化具有鲁棒性,在拓扑感知理解任务中达到97.38%的掩模拓扑质量评估准确率。

Conclusion: TubeMLLM通过耦合结构化理解与可控生成,为医学管状解剖结构建模提供了有效的统一框架。该研究证明了多模态大语言模型在医学图像拓扑建模中的潜力,为构建具有拓扑感知能力的医学视觉基础模型开辟了新方向,具有重要的临床应用价值。


📄 Abstract

Modeling medical vessel-like anatomy is challenging due to its intricate topology and sensitivity to dataset shifts. Consequently, task-specific models often suffer from topological inconsistencies, including artificial disconnections and spurious merges. Motivated by the promise of multimodal large language models (MLLMs) for zero-shot generalization, we propose TubeMLLM, a unified foundation model that couples structured understanding with controllable generation for medical vessel-like anatomy. By integrating topological priors through explicit natural language prompting and aligning them with visual representations in a shared-attention architecture, TubeMLLM significantly enhances topology-aware perception. Furthermore, we construct TubeMData, a pionner multimodal benchmark comprising comprehensive topology-centric tasks, and introduce an adaptive loss weighting strategy to emphasize topology-critical regions during training. Extensive experiments on fifteen diverse datasets demonstrate our superiority. Quantitatively, TubeMLLM achieves state-of-the-art out-of-distribution performance, substantially reducing global topological discrepancies on color fundus photography (decreasing the $β_{0}$ number error from 37.42 to 8.58 compared to baselines). Notably, TubeMLLM exhibits exceptional zero-shot cross-modality transferring ability on unseen X-ray angiography, achieving a Dice score of 67.50% while significantly reducing the $β_{0}$ error to 1.21. TubeMLLM also maintains robustness against degradations such as blur, noise, and low resolution. Furthermore, in topology-aware understanding tasks, the model achieves 97.38% accuracy in evaluating mask topological quality, significantly outperforming standard vision-language baselines.

[7] OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models

Tengjin Weng, Wenhao Jiang, Jingyi Wang, Ming Li, Lin Ma, Zhong Ming

🧩 TL;DR

本文提出了OddGridBench基准测试和OddGrid-GRPO强化学习框架,用于系统评估和提升多模态大语言模型的细粒度视觉差异感知能力,发现当前MLLMs在此任务上远低于人类水平。


📘 Detailed Summary

Motivation: 多模态大语言模型在多种视觉语言任务上表现出色,但其低层次视觉感知能力,特别是检测细粒度视觉差异的能力尚未得到充分探索和系统分析,这构成了当前研究的重要空白。

Method: 研究提出了OddGridBench基准测试,包含超过1400个基于网格的图像,其中单个元素在颜色、大小、旋转或位置等一个或多个视觉属性上与其他元素不同;同时开发了OddGrid-GRPO强化学习框架,该框架整合了课程学习和距离感知奖励机制,通过渐进控制训练样本难度并在奖励设计中融入空间邻近约束来提升模型能力。

Result: 实验表明,所有评估的MLLMs(包括开源模型如Qwen3-VL和InternVL3.5,以及专有系统如Gemini-2.5-Pro和GPT-5)在视觉差异检测任务上的表现都远低于人类水平;而OddGrid-GRPO框架显著增强了模型的细粒度视觉辨别能力。

Conclusion: 该研究揭示了当前MLLMs在细粒度视觉感知方面的显著局限性,提出的OddGridBench基准和OddGrid-GRPO框架为推进多模态智能的感知基础和视觉差异敏感性奠定了基础,为未来研究提供了重要的评估工具和改进方法。


📄 Abstract

Multimodal large language models (MLLMs) have achieved remarkable performance across a wide range of vision language tasks. However, their ability in low-level visual perception, particularly in detecting fine-grained visual discrepancies, remains underexplored and lacks systematic analysis. In this work, we introduce OddGridBench, a controllable benchmark for evaluating the visual discrepancy sensitivity of MLLMs. OddGridBench comprises over 1,400 grid-based images, where a single element differs from all others by one or multiple visual attributes such as color, size, rotation, or position. Experiments reveal that all evaluated MLLMs, including open-source families such as Qwen3-VL and InternVL3.5, and proprietary systems like Gemini-2.5-Pro and GPT-5, perform far below human levels in visual discrepancy detection. We further propose OddGrid-GRPO, a reinforcement learning framework that integrates curriculum learning and distance-aware reward. By progressively controlling the difficulty of training samples and incorporating spatial proximity constraints into the reward design, OddGrid-GRPO significantly enhances the model's fine-grained visual discrimination ability. We hope OddGridBench and OddGrid-GRPO will lay the groundwork for advancing perceptual grounding and visual discrepancy sensitivity in multimodal intelligence. Code and dataset are available at https://wwwtttjjj.github.io/OddGridBench/.

[8] Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity

Zhengyao Fang, Pengyuan Lyu, Chengquan Zhang, Guangming Lu, Jun Yu, Wenjie Pei

🧩 TL;DR

本文提出PruneSID,一种无需训练、基于协同重要性-多样性的视觉令牌剪枝方法,通过两阶段流程显著提升视觉语言模型的推理效率,同时保持语义完整性。


📘 Detailed Summary

Motivation: 视觉语言模型面临由过多视觉令牌生成引起的显著计算效率低下问题,现有压缩方法难以在重要性保留和信息多样性之间取得平衡,需要一种更有效的令牌剪枝策略。

Method: PruneSID采用两阶段流程:首先通过主语义成分分析将令牌聚类为语义连贯的组以确保概念覆盖全面性,然后通过组内非极大值抑制在每个组内剪枝冗余令牌同时保留关键代表性令牌,并引入信息感知的动态压缩比机制根据图像复杂度优化压缩率。

Result: 实验表明PruneSID在LLaVA-1.5上仅保留11.1%令牌即可达到96.3%准确率,在LLaVA-NeXT上以5.6%极端压缩率实现92.8%准确率,比先前方法提升2.5%且预填充速度比原始模型快7.8倍,框架在多种视觉语言模型和图像视频模态上均表现出良好泛化能力。

Conclusion: 该研究证明了协同重要性-多样性方法在视觉令牌压缩中的有效性,提出的训练免费框架具有跨模型和跨模态的强泛化能力,为高效视觉语言模型推理提供了实用解决方案,代码已开源促进社区应用。


📄 Abstract

Vision-language models (VLMs) face significant computational inefficiencies caused by excessive generation of visual tokens. While prior work shows that a large fraction of visual tokens are redundant, existing compression methods struggle to balance importance preservation and information diversity. To address this, we propose PruneSID, a training-free Synergistic Importance-Diversity approach featuring a two-stage pipeline: (1) Principal Semantic Components Analysis (PSCA) for clustering tokens into semantically coherent groups, ensuring comprehensive concept coverage, and (2) Intra-group Non-Maximum Suppression (NMS) for pruning redundant tokens while preserving key representative tokens within each group. Additionally, PruneSID incorporates an information-aware dynamic compression ratio mechanism that optimizes token compression rates based on image complexity, enabling more effective average information preservation across diverse scenes. Extensive experiments demonstrate state-of-the-art performance, achieving 96.3% accuracy on LLaVA-1.5 with only 11.1% token retention, and 92.8% accuracy at extreme compression rates (5.6%) on LLaVA-NeXT, outperforming prior methods by 2.5% with 7.8 $\times$ faster prefilling speed compared to the original model. Our framework generalizes across diverse VLMs and both image and video modalities, showcasing strong cross-modal versatility. Code is available at https://github.com/ZhengyaoFang/PruneSID}{https://github.com/ZhengyaoFang/PruneSID.

[9] FetalAgents: A Multi-Agent System for Fetal Ultrasound Image and Video Analysis

Xiaotian Hu, Junwei Huang, Mingxuan Liu, Kasidit Anmahapong, Yifei Chen, Yitong Luo, Yiming Huang, Xuguang Bai, Zihan Li, Yi Liao, Haibo Qu, Qiyuan Tian

🧩 TL;DR

本研究提出了FetalAgents,这是首个用于全面胎儿超声分析的多智能体系统,通过轻量级智能体协调框架动态编排专业视觉专家,在诊断、测量和分割任务中实现最优性能,并支持端到端的视频流摘要生成。


📘 Detailed Summary

Motivation: 胎儿超声是产前筛查的主要成像方式,但其解读高度依赖临床医生的专业知识。现有自动化工具难以在任务特定准确性与支持端到端临床工作流程所需的整体过程多功能性之间取得平衡,这限制了其在临床实践中的广泛应用。

Method: FetalAgents采用多智能体系统架构,通过轻量级智能体协调框架动态编排专业视觉专家。该系统不仅支持静态图像分析,还能实现端到端视频流摘要,自动识别多个解剖平面的关键帧,协调专家进行分析,并将结果与患者元数据结合生成结构化临床报告。

Result: 在多中心外部评估中,FetalAgents在八个临床任务上始终表现出最稳健和准确的性能,优于专用模型和多模态大语言模型。该系统提供了可审计、与工作流程对齐的解决方案,验证了其在胎儿超声分析和报告生成中的有效性。

Conclusion: FetalAgents通过多智能体协调框架成功解决了胎儿超声分析中任务特定准确性与临床工作流程多功能性之间的平衡问题,为自动化产前筛查提供了可审计、工作流程对齐的解决方案,代表了从静态图像分析向动态视频流处理的重要范式转变。


📄 Abstract

Fetal ultrasound (US) is the primary imaging modality for prenatal screening, yet its interpretation relies heavily on the expertise of the clinician. Despite advances in deep learning and foundation models, existing automated tools for fetal US analysis struggle to balance task-specific accuracy with the whole-process versatility required to support end-to-end clinical workflows. To address these limitations, we propose FetalAgents, the first multi-agent system for comprehensive fetal US analysis. Through a lightweight, agentic coordination framework, FetalAgents dynamically orchestrates specialized vision experts to maximize performance across diagnosis, measurement, and segmentation. Furthermore, FetalAgents advances beyond static image analysis by supporting end-to-end video stream summarization, where keyframes are automatically identified across multiple anatomical planes, analyzed by coordinated experts, and synthesized with patient metadata into a structured clinical report. Extensive multi-center external evaluations across eight clinical tasks demonstrate that FetalAgents consistently delivers the most robust and accurate performance when compared against specialized models and multimodal large language models (MLLMs), ultimately providing an auditable, workflow-aligned solution for fetal ultrasound analysis and reporting.

[10] InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, Zirun Zhu, Ziqian Fan, Leyao Gu, Haomin Wang, Qi Wei, Jinhui Yin, Xue Yang, Zhihang Zhong, Qi Qin, Yi Xin, Bin Fu, Yihao Liu, Jiaye Ge, Qipeng Guo, Gen Luo, Hongsheng Li, Yu Qiao, Kai Chen, Hongjie Zhang

🧩 TL;DR

本文提出了InternVL-U,一个轻量级的40亿参数统一多模态模型,通过统一上下文建模和模态特定模块化设计,在单一框架中实现了强大的语义理解和生成能力之间的平衡。


📘 Detailed Summary

Motivation: 统一多模态模型在保持强大语义理解能力和获得强大生成能力之间存在固有权衡,现有模型难以同时实现高质量的多模态理解和生成编辑能力,需要解决语义理解与视觉生成之间的鸿沟问题。

Method: InternVL-U采用统一上下文建模和模态特定模块化设计原则,结合解耦的视觉表示,将最先进的多模态大语言模型与专门的MMDiT视觉生成头集成,并构建了针对高语义密度任务的数据合成管道,采用以推理为中心的范式利用思维链来对齐抽象用户意图与细粒度视觉生成细节。

Result: 实验表明InternVL-U在性能与效率之间实现了优越平衡,尽管仅使用40亿参数,但在各种生成和编辑任务上持续优于规模超过3倍大的统一基线模型如BAGEL(140亿),同时保持了强大的多模态理解和推理能力。

Conclusion: 该研究证明了通过统一框架和精心设计的数据策略,可以在轻量级模型中同时实现高质量的多模态理解、推理和生成能力,为统一多模态模型的发展提供了新的设计范式,展示了在有限参数规模下实现全面多模态能力的可行性。


📄 Abstract

Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that democratizes these capabilities within a unified framework. Guided by the principles of unified contextual modeling and modality-specific modular design with decoupled visual representations, InternVL-U integrates a state-of-the-art Multimodal Large Language Model (MLLM) with a specialized MMDiT-based visual generation head. To further bridge the gap between aesthetic generation and high-level intelligence, we construct a comprehensive data synthesis pipeline targeting high-semantic-density tasks, such as text rendering and scientific reasoning, under a reasoning-centric paradigm that leverages Chain-of-Thought (CoT) to better align abstract user intent with fine-grained visual generation details. Extensive experiments demonstrate that InternVL-U achieves a superior performance - efficiency balance. Despite using only 4B parameters, it consistently outperforms unified baseline models with over 3x larger scales such as BAGEL (14B) on various generation and editing tasks, while retaining strong multimodal understanding and reasoning capabilities.

cs.CL [Back]

[11] Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

Kaiser Sun, Xiaochuang Yuan, Hongjun Liu, Chen Zhao, Cheng Zhang, Mark Dredze, Fan Bai

🧩 TL;DR

本研究系统诊断了多模态大语言模型在处理图像文本时出现的'模态鸿沟'问题,并提出了一种自蒸馏方法显著提升视觉文本理解性能。研究发现模态鸿沟具有任务和数据依赖性,渲染选择是重要混淆因素。


📘 Detailed Summary

Motivation: 多模态大语言模型能够处理以图像形式呈现的文本,但其性能通常低于相同内容以文本标记形式提供的情况。本研究旨在系统诊断这种'模态鸿沟'现象,探究其在不同任务和数据条件下的表现差异,并理解其根本原因。

Method: 研究评估了七个MLLM在七个基准测试中的五种输入模式,涵盖合成渲染文本和真实文档图像。通过基于扎根理论的错误分析检查了超过4000个示例,揭示了错误模式。提出了一种自蒸馏方法,使用模型自身的纯文本推理轨迹与图像输入进行配对训练。

Result: 研究发现模态鸿沟具有任务和数据依赖性,数学任务在合成渲染上性能下降超过60点。渲染选择如字体和分辨率是强混淆因素,仅字体变化就能使准确率波动达47个百分点。图像模式选择性地放大了阅读错误,而知识和推理错误基本不变。自蒸馏方法将GSM8K上的图像模式准确率从30.71%提升至92.72%,并能迁移到未见基准测试而不发生灾难性遗忘。

Conclusion: 该研究提供了对模态鸿沟的系统理解,揭示了图像模式选择性地放大阅读错误而保持知识和推理错误不变的现象。提出的自蒸馏方法为改善多模态语言模型中的视觉文本理解提供了实用路径,能够显著提升性能并实现跨任务迁移。


📄 Abstract

Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this "modality gap" by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both synthetically rendered text and realistic document images from arXiv PDFs to Wikipedia pages. We find that the modality gap is task- and data-dependent. For example, math tasks degrade by over 60 points on synthetic renderings, while natural document images often match or exceed text-mode performance. Rendering choices such as font and resolution are strong confounds, with font alone swinging accuracy by up to 47 percentage points. To understand this, we conduct a grounded-theory error analysis of over 4,000 examples, revealing that image mode selectively amplifies reading errors (calculation and formatting failures) while leaving knowledge and reasoning errors largely unchanged, and that some models exhibit a chain-of-thought reasoning collapse under visual input. Motivated by these findings, we propose a self-distillation method that trains the model on its own pure text reasoning traces paired with image inputs, raising image-mode accuracy on GSM8K from 30.71% to 92.72% and transferring to unseen benchmarks without catastrophic forgetting. Overall, our study provides a systematic understanding of the modality gap and suggests a practical path toward improving visual text understanding in multimodal language models.

cs.AI [Back]

[12] OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences

Ming Wen, Kun Yang, Jingyu Zhang, Yuxuan Liu, shiwen cui, Shouling Ji, Xingjun Ma

🧩 TL;DR

本文提出了一种面向后果驱动的多模态大语言模型安全对齐新范式,并开发了OOD-MMSafe基准和CASPO优化框架,显著提升了模型在因果链中识别潜在风险的能力。


📘 Detailed Summary

Motivation: 当前多模态大语言模型的安全对齐主要针对恶意意图或情境违规,缺乏对后果驱动的安全考量,这限制了自主和具身智能体的稳健部署,因此需要将安全前沿转向因果链中的潜在风险识别。

Method: 研究首先构建了包含455个查询-图像对的OOD-MMSafe基准,用于评估模型在上下文依赖因果链中识别潜在危险的能力;随后开发了后果感知安全策略优化框架,该框架将模型内在推理作为动态参考,实现令牌级自蒸馏奖励机制。

Result: 实验分析揭示了前沿模型普遍存在的因果盲区,高性能闭源模型的失败率高达67.5%,并识别出静态对齐导致格式中心失效而非安全推理改进的偏好天花板;CASPO框架显著提升了后果预测能力,将Qwen2.5-VL-7B和Qwen3-VL-4B的风险识别失败率分别降至7.3%和5.7%。

Conclusion: 该研究强调了后果驱动安全范式对多模态大语言模型部署的重要性,揭示了当前安全对齐方法的局限性,提出的CASPO框架为提升模型在复杂因果链中的安全推理能力提供了有效解决方案,为自主智能体的安全部署奠定了基础。


📄 Abstract

While safety alignment for Multimodal Large Language Models (MLLMs) has gained significant attention, current paradigms primarily target malicious intent or situational violations. We propose shifting the safety frontier toward consequence-driven safety, a paradigm essential for the robust deployment of autonomous and embodied agents. To formalize this shift, we introduce OOD-MMSafe, a benchmark comprising 455 curated query-image pairs designed to evaluate a model's ability to identify latent hazards within context-dependent causal chains. Our analysis reveals a pervasive causal blindness among frontier models, with the highest 67.5% failure rate in high-capacity closed-source models, and identifies a preference ceiling where static alignment yields format-centric failures rather than improved safety reasoning as model capacity grows. To address these bottlenecks, we develop the Consequence-Aware Safety Policy Optimization (CASPO) framework, which integrates the model's intrinsic reasoning as a dynamic reference for token-level self-distillation rewards. Experimental results demonstrate that CASPO significantly enhances consequence projection, reducing the failure ratio of risk identification to 7.3% for Qwen2.5-VL-7B and 5.7% for Qwen3-VL-4B while maintaining overall effectiveness.

[13] PathMem: Toward Cognition-Aligned Memory Transformation for Pathology MLLMs

Jinyue Li, Yuci Liang, Qiankun Li, Xinheng Lyu, Jiayu Qian, Huabao Chen, Kun Wang, Zhigang Zeng, Anil Anthony Bharath, Yang Liu

🧩 TL;DR

本文提出了PathMem,一种用于病理学多模态大语言模型的记忆中心框架,通过层次化记忆机制整合结构化病理学知识,显著提升了病理诊断推理的性能和一致性。


📘 Detailed Summary

Motivation: 计算病理学需要视觉模式识别和结构化领域知识的动态整合,但现有多模态大语言模型缺乏显式的结构化知识整合机制和可解释的记忆控制,导致难以在推理过程中一致地融入病理学特定的诊断标准。

Method: PathMem框架受人类病理学家层次化记忆过程启发,将结构化病理学知识组织为长期记忆,引入记忆Transformer建模从长期记忆到工作记忆的动态转换,通过多模态记忆激活和上下文感知的知识接地实现上下文感知的记忆精炼。

Result: PathMem在多个基准测试中达到最先进性能,在WSI-Bench报告生成任务上提升了12.8%的WSI-Precision和10.1%的WSI-Relevance,在开放式诊断任务上分别比先前WSI-based模型提高了9.7%和8.9%。

Conclusion: 该研究证明了记忆中心框架在整合结构化领域知识方面的有效性,为计算病理学提供了可解释的知识整合机制,为医学多模态大语言模型的设计提供了新范式,强调了层次化记忆建模在专业领域推理中的重要性。


📄 Abstract

Computational pathology demands both visual pattern recognition and dynamic integration of structured domain knowledge, including taxonomy, grading criteria, and clinical evidence. In practice, diagnostic reasoning requires linking morphological evidence with formal diagnostic and grading criteria. Although multimodal large language models (MLLMs) demonstrate strong vision language reasoning capabilities, they lack explicit mechanisms for structured knowledge integration and interpretable memory control. As a result, existing models struggle to consistently incorporate pathology-specific diagnostic standards during reasoning. Inspired by the hierarchical memory process of human pathologists, we propose PathMem, a memory-centric multimodal framework for pathology MLLMs. PathMem organizes structured pathology knowledge as a long-term memory (LTM) and introduces a Memory Transformer that models the dynamic transition from LTM to working memory (WM) through multimodal memory activation and context-aware knowledge grounding, enabling context-aware memory refinement for downstream reasoning. PathMem achieves SOTA performance across benchmarks, improving WSI-Bench report generation (12.8% WSI-Precision, 10.1% WSI-Relevance) and open-ended diagnosis by 9.7% and 8.9% over prior WSI-based models.