Table of Contents

cs.CV [Back]

[1] V-Agent: An Interactive Video Search System Using Vision-Language Models

SunYoung Park, Jong-Hyeon Lee, Youngjune Kim, Daegyu Sung, Younghyun Yu, Young-rok Cha, Jeongho Ju

🧩 TL;DR

本文提出了V-Agent,一个用于高级视频搜索和交互式对话的多智能体平台,通过微调视觉语言模型并集成检索向量,在MultiVENT 2.0基准测试中实现了最先进的零样本性能。


📘 Detailed Summary

Motivation: 传统文本检索系统在多模态场景中存在局限性,无法有效处理视频中的视觉和语音内容,这促使研究者开发能够同时理解视觉和语音信息的视频搜索系统。

Method: 该方法通过微调视觉语言模型并使用小型视频偏好数据集,结合图像-文本检索模型的检索向量增强,构建了基于VLM的检索模型,将视频帧和ASR模块的音频转录嵌入到共享的多模态表示空间中,系统包含路由、搜索和聊天三个协同工作的智能体,搜索智能体还使用额外的重排序模块进一步提升检索质量。

Result: 所提出的框架在MultiVENT 2.0基准测试中展示了最先进的零样本性能,证明了其在视频检索任务中的有效性,为学术研究和实际应用提供了有力支持。

Conclusion: V-Agent通过多智能体协作和先进的视觉语言建模技术,成功解决了传统文本检索在多模态视频搜索中的局限性,展示了在复杂视频理解任务中的潜力,为未来多模态交互系统的发展提供了重要参考。


📄 Abstract

We introduce V-Agent, a novel multi-agent platform designed for advanced video search and interactive user-system conversations. By fine-tuning a vision-language model (VLM) with a small video preference dataset and enhancing it with a retrieval vector from an image-text retrieval model, we overcome the limitations of traditional text-based retrieval systems in multimodal scenarios. The VLM-based retrieval model independently embeds video frames and audio transcriptions from an automatic speech recognition (ASR) module into a shared multimodal representation space, enabling V-Agent to interpret both visual and spoken content for context-aware video search. This system consists of three agents-a routing agent, a search agent, and a chat agent-that work collaboratively to address user intents by refining search outputs and communicating with users. The search agent utilizes the VLM-based retrieval model together with an additional re-ranking module to further enhance video retrieval quality. Our proposed framework demonstrates state-of-the-art zero-shot performance on the MultiVENT 2.0 benchmark, highlighting its potential for both academic research and real-world applications.

[2] Lights, Camera, Consistency: A Multistage Pipeline for Character-Stable AI Video Stories

Chayan Jain, Rishant Sharma, Archit Garg, Ishan Bhanuka, Pratik Narang, Dhruv Kumar

🧩 TL;DR

本文提出了一种电影制作式的多阶段视频生成方法,通过大型语言模型生成详细制作脚本,利用文本到图像模型创建一致的角色视觉锚点,再指导视频生成模型合成各个场景,显著提升了长视频故事中角色的一致性。


📘 Detailed Summary

Motivation: 当前文本到视频AI在生成长篇、连贯的视频故事时面临角色一致性不足的挑战,特别是在需要保持角色身份连续性的叙事场景中,现有的一步生成方法难以维持视觉一致性。

Method: 该方法采用电影制作式的多阶段流水线:首先使用大型语言模型生成详细制作脚本,然后通过文本到图像模型为每个角色创建一致的视觉锚点,最后利用这些视觉锚点指导视频生成模型逐个场景合成视频,确保角色身份的一致性。

Result: 实验验证了多阶段分解的必要性,移除视觉锚点机制会导致角色一致性评分从7.99急剧下降至0.55,证实视觉先验对身份保持至关重要;同时分析揭示了当前模型在印度与西方主题生成中存在主体一致性和动态程度方面的文化差异偏见。

Conclusion: 该研究证明了结构化分解方法在提升视频生成角色一致性方面的有效性,视觉锚点机制是维持身份连续性的关键要素;同时揭示了现有生成模型存在的文化偏见问题,为未来更具包容性和一致性的视频生成系统提供了重要见解。


📄 Abstract

Generating long, cohesive video stories with consistent characters is a significant challenge for current text-to-video AI. We introduce a method that approaches video generation in a filmmaker-like manner. Instead of creating a video in one step, our proposed pipeline first uses a large language model to generate a detailed production script. This script guides a text-to-image model in creating consistent visuals for each character, which then serve as anchors for a video generation model to synthesize each scene individually. Our baseline comparisons validate the necessity of this multi-stage decomposition; specifically, we observe that removing the visual anchoring mechanism results in a catastrophic drop in character consistency scores (from 7.99 to 0.55), confirming that visual priors are essential for identity preservation. Furthermore, we analyze cultural disparities in current models, revealing distinct biases in subject consistency and dynamic degree between Indian vs Western-themed generations.

[3] RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering

Léo Butsanets, Charles Corbière, Julien Khlaut, Pierre Manceron, Corentin Dancette

🧩 TL;DR

本文提出了RadImageNet-VQA,这是一个用于CT和MRI检查的大规模放射学视觉问答数据集,包含75万张图像和750万个问答样本,旨在解决现有医学VQA数据集规模小、模态单一且存在文本捷径的问题。


📘 Detailed Summary

Motivation: 现有医学视觉问答数据集存在规模有限、主要依赖X射线图像或生物医学插图、且容易受到文本捷径影响的问题,这限制了放射学VQA在CT和MRI等模态上的发展,需要构建一个大规模、无语言捷径的放射学专用数据集。

Method: 研究构建了RadImageNet-VQA数据集,包含专家标注的75万张CT和MRI图像与750万个问答样本,涵盖异常检测、解剖结构识别和病理识别三大任务,支持开放式、封闭式和多项选择题型,覆盖八个解剖区域和97种病理类别。

Result: 实验表明,即使经过微调,最先进的视觉语言模型在细粒度病理识别任务上表现仍然不佳,特别是在开放式设置中。纯文本分析显示,在没有图像输入时模型性能降至接近随机水平,证实了数据集不存在语言捷径问题。

Conclusion: RadImageNet-VQA填补了放射学VQA领域的数据空白,揭示了当前模型在细粒度医学图像理解方面的局限性,为开发更强大的医学视觉语言模型提供了基准,数据集已公开可用以促进该领域的研究进展。


📄 Abstract

In this work, we introduce RadImageNet-VQA, a large-scale dataset designed to advance radiologic visual question answering (VQA) on CT and MRI exams. Existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and often prone to text-based shortcuts. RadImageNet-VQA is built from expert-curated annotations and provides 750K images paired with 7.5M question-answer samples. It covers three key tasks - abnormality detection, anatomy recognition, and pathology identification - spanning eight anatomical regions and 97 pathology categories, and supports open-ended, closed-ended, and multiple-choice questions. Extensive experiments show that state-of-the-art vision-language models still struggle with fine-grained pathology identification, particularly in open-ended settings and even after fine-tuning. Text-only analysis further reveals that model performance collapses to near-random without image inputs, confirming that RadImageNet-VQA is free from linguistic shortcuts. The full dataset and benchmark are publicly available at https://huggingface.co/datasets/raidium/RadImageNet-VQA.

[4] InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

Haotian Ye, Qiyuan He, Jiaqi Han, Puheng Li, Jiaojiao Fan, Zekun Hao, Fitsum Reda, Yogesh Balaji, Huayu Chen, Sheng Liu, Angela Yao, James Zou, Stefano Ermon, Haoxiang Wang, Ming-Yu Liu

🧩 TL;DR

本文提出了InfoTok,一种基于信息论原理的自适应视频标记化框架,通过理论证明现有数据无关训练方法在表示长度上存在次优性,并开发了基于证据下界(ELBO)的算法实现接近理论最优的自适应压缩。


📘 Detailed Summary

Motivation: 当前视频标记化方法在处理长视频序列时面临显著瓶颈,由于视频内容固有的复杂性和可变信息密度,现有标记器采用固定压缩率处理所有内容,导致冗余或信息丢失,需要更有效的自适应压缩方法。

Method: 基于香农信息论原理,提出了InfoTok自适应视频标记化框架,理论证明了现有数据无关训练方法在表示长度上的次优性,并开发了基于证据下界(ELBO)的新型算法,结合基于Transformer的自适应压缩器实现自适应标记化。

Result: 实验结果表明InfoTok实现了最先进的压缩性能,在不影响性能的情况下节省了20%的标记,达到2.3倍的压缩率,同时仍然优于先前基于启发式的自适应方法,证明了自适应标记化的有效性。

Conclusion: 通过根据信息丰富度分配标记,InfoTok实现了更压缩且准确的视频表示标记化,为未来研究提供了有价值的见解,展示了信息论原理在视频处理中的实际应用潜力。


📄 Abstract

Accurate and efficient discrete video tokenization is essential for long video sequences processing. Yet, the inherent complexity and variable information density of videos present a significant bottleneck for current tokenizers, which rigidly compress all content at a fixed rate, leading to redundancy or information loss. Drawing inspiration from Shannon's information theory, this paper introduces InfoTok, a principled framework for adaptive video tokenization. We rigorously prove that existing data-agnostic training methods are suboptimal in representation length, and present a novel evidence lower bound (ELBO)-based algorithm that approaches theoretical optimality. Leveraging this framework, we develop a transformer-based adaptive compressor that enables adaptive tokenization. Empirical results demonstrate state-of-the-art compression performance, saving 20% tokens without influence on performance, and achieving 2.3x compression rates while still outperforming prior heuristic adaptive approaches. By allocating tokens according to informational richness, InfoTok enables a more compressed yet accurate tokenization for video representation, offering valuable insights for future research.

[5] A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

Mohammed Irfan Kurpath, Jaseel Muhammad Kaithakkodan, Jinxing Zhou, Sahal Shaji Mullappilly, Mohammad Almansoori, Noor Ahsan, Beknur Kalmakhanbet, Sambal Shikhar, Rishabh Lalla, Jean Lahoud, Mariette Awad, Fahad Shahbaz Khan, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal

🧩 TL;DR

本文提出了LongShOTBench,这是一个用于诊断长视频多模态理解的基准测试,包含开放式意图驱动问题、多轮对话和多模态推理任务,并引入了LongShOTAgent系统来分析长视频。


📘 Detailed Summary

Motivation: 现有基准测试要么关注时间长度要么关注多模态丰富性,很少同时兼顾两者,且大多依赖单一准确度评分,难以揭示失败模式。长视频多模态理解需要整合视觉、语音和环境音频并进行连贯的长程推理,当前缺乏能够全面评估这些能力的诊断性基准。

Method: 研究提出了LongShOTBench基准,包含开放式意图驱动问题、单轮和多轮对话,以及需要跨视频、音频和语音的多模态推理和工具使用任务。每个项目都配有参考答案和分级评分标准,通过可扩展的人工验证流程确保覆盖范围和可重复性。同时提出了LongShOTAgent系统,通过预处理、搜索和迭代优化来分析长视频。

Result: 在LongShOTBench上,最先进的多模态大语言模型表现存在显著差距:Gemini-2.5-Flash达到52.95%,开源模型低于30%,而LongShOTAgent获得44.66%。这些结果突显了真实世界长视频理解的困难程度,基准测试为评估和改进多模态大语言模型提供了实用、可重复的基础。

Conclusion: LongShOTBench为长视频多模态理解提供了首个全面的诊断性评估框架,揭示了当前模型在真实世界长视频理解任务上的显著局限性。该基准通过人类验证的样本、参考答案和分级评分标准,实现了可解释和可追溯的评估,为未来模型改进提供了明确方向。研究结果表明,长视频理解仍然是多模态人工智能领域的重要挑战,需要更先进的推理架构和评估方法。


📄 Abstract

Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both and while some incorporate open-ended questions and advanced metrics, they mostly rely on single-score accuracy, obscuring failure modes. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable, and traceable evaluation. LongShOTBench is produced via a scalable, human-validated pipeline to ensure coverage and reproducibility. All samples in our LongShOTBench are human-verified and corrected. Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%. These results underscore the difficulty of real-world long-form video understanding. LongShOTBench provides a practical, reproducible foundation for evaluating and improving MLLMs. All resources are available on GitHub: https://github.com/mbzuai-oryx/longshot.

[6] A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs

Yunkai Dang, Meiyi Zhu, Donghao Wang, Yizhuo Zhang, Jiacheng Yang, Qi Fan, Yuekun Yang, Wenbin Li, Feng Miao, Yang Gao

🧩 TL;DR

本文提出了RSHR-Bench,一个用于遥感视觉理解与推理的超高分辨率基准测试,旨在解决现有基准测试中因低分辨率图像和任务设计缺陷导致的视觉理解评估不准确问题。


📘 Detailed Summary

Motivation: 现有遥感多模态大语言模型基准测试大多依赖低分辨率图像,部分高分辨率基准测试存在推理任务设计缺陷,导致文本仅LLM在无需访问图像的情况下也能在遥感推理任务中表现优异,这揭示了当前基准测试与视觉理解评估意图之间的严重不匹配问题。

Method: 作者构建了RSHR-Bench基准测试,包含5,329张长边至少4,000像素的超高分辨率全场景图像,每张图像最多约3×10^8像素,图像来源于广泛使用的遥感语料库和无人机采集数据。设计了四个任务家族:多项选择VQA、开放式VQA、图像描述和单图像评估,涵盖九个感知类别和四种推理类型,支持多轮和多图像对话。为减少语言先验依赖,采用强LLM进行对抗性过滤并辅以严格人工验证。

Result: 基准测试包含3,864个VQA任务、3,913个图像描述任务以及500个完全人工编写或验证的单图像评估VQA对。对开源、闭源和遥感专用视觉语言模型的评估显示,在超高分辨率场景下存在持续的性能差距,验证了现有模型在处理精细视觉细节方面的局限性。

Conclusion: RSHR-Bench为遥感领域的视觉理解提供了更可靠的评估框架,揭示了当前多模态模型在超高分辨率场景下的实际能力限制,强调了设计无语言先验偏差的视觉理解基准测试的重要性,为未来遥感视觉语言模型的发展提供了重要的评估工具和方向指导。


📄 Abstract

Multimodal large language models (MLLMs) demonstrate strong perception and reasoning performance on existing remote sensing (RS) benchmarks. However, most prior benchmarks rely on low-resolution imagery, and some high-resolution benchmarks suffer from flawed reasoning-task designs. We show that text-only LLMs can perform competitively with multimodal vision-language models on RS reasoning tasks without access to images, revealing a critical mismatch between current benchmarks and the intended evaluation of visual understanding. To enable faithful assessment, we introduce RSHR-Bench, a super-high-resolution benchmark for RS visual understanding and reasoning. RSHR-Bench contains 5,329 full-scene images with a long side of at least 4,000 pixels, with up to about 3 x 10^8 pixels per image, sourced from widely used RS corpora and UAV collections. We design four task families: multiple-choice VQA, open-ended VQA, image captioning, and single-image evaluation. These tasks cover nine perception categories and four reasoning types, supporting multi-turn and multi-image dialog. To reduce reliance on language priors, we apply adversarial filtering with strong LLMs followed by rigorous human verification. Overall, we construct 3,864 VQA tasks, 3,913 image captioning tasks, and 500 fully human-written or verified single-image evaluation VQA pairs. Evaluations across open-source, closed-source, and RS-specific VLMs reveal persistent performance gaps in super-high-resolution scenarios. Code: https://github.com/Yunkaidang/RSHR

[7] 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen

🧩 TL;DR

本文提出了4D-RGPT,一种专门用于增强4D感知的多模态大语言模型,通过感知4D蒸馏训练框架和R4D-Bench基准测试,显著提升了视频问答中对3D结构和时序动态的理解能力。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在3D结构和时序动态推理方面能力有限,受限于薄弱的4D感知和时序理解能力,现有3D和4D视频问答基准测试主要关注静态场景且缺乏区域级提示,这些局限性阻碍了模型对动态4D场景的深入理解。

Method: 研究提出了三个核心组件:4D-RGPT专门设计用于从视频输入中捕获4D表示并增强时序感知;感知4D蒸馏训练框架将冻结专家模型的4D表示转移到4D-RGPT中实现全面4D感知;R4D-Bench基准测试通过混合自动化和人工验证流程构建,专注于深度感知动态场景并提供区域级提示。

Result: 4D-RGPT在现有4D视频问答基准测试和提出的R4D-Bench基准测试上均取得了显著改进,验证了该方法在增强4D感知和时序理解方面的有效性,特别是在处理动态场景和区域级推理任务上表现出优越性能。

Conclusion: 该研究通过专门的模型架构、蒸馏训练框架和综合基准测试,为多模态大语言模型的4D感知能力提升提供了系统解决方案,强调了时序动态理解和区域级推理在视频理解中的重要性,为未来动态场景分析研究奠定了基础。


📄 Abstract

Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.

[8] PhysFire-WM: A Physics-Informed World Model for Emulating Fire Spread Dynamics

Nan Zhou, Huandong Wang, Jiahao Li, Yang Li, Xiao-Ping Zhang, Yong Li, Xinlei Chen

🧩 TL;DR

本文提出PhysFire-WM,一种用于模拟火灾蔓延动态的物理信息世界模型,通过整合物理模拟器的结构化先验和跨任务协同训练策略,显著提升了细粒度火灾预测的准确性和物理一致性。


📘 Detailed Summary

Motivation: 现有火灾预测方法主要局限于二元掩码建模,其信号稀疏性无法捕捉火灾的复杂动态,而传统世界模型在视频生成中虽具潜力,但其物理不一致性对火灾预测构成重大挑战,需要一种能够整合物理先验并克服掩码建模信息限制的新方法。

Method: 本文提出PhysFire-WM模型,通过编码物理模拟器的结构化先验来校正物理不一致性,同时采用跨任务协同训练策略(CC-Train)缓解掩码建模的信息限制问题。CC-Train通过参数共享和梯度协调,有效整合热辐射动态和空间边界划分,增强物理真实性和几何准确性。

Result: 在细粒度多模态火灾数据集上的广泛实验表明,PhysFire-WM在火灾蔓延预测方面表现出卓越的准确性。验证结果强调了物理先验和跨任务协作的重要性,为物理信息世界模型在灾害预测中的应用提供了实证支持。

Conclusion: 该研究证明了物理信息世界模型在火灾预测中的有效性,通过整合物理先验和跨任务协作机制,能够显著提升预测的物理一致性和几何准确性。这项工作为将物理约束融入世界模型框架提供了新见解,并为灾害预测领域的应用开辟了新方向。


📄 Abstract

Fine-grained fire prediction plays a crucial role in emergency response. Infrared images and fire masks provide complementary thermal and boundary information, yet current methods are predominantly limited to binary mask modeling with inherent signal sparsity, failing to capture the complex dynamics of fire. While world models show promise in video generation, their physical inconsistencies pose significant challenges for fire forecasting. This paper introduces PhysFire-WM, a Physics-informed World Model for emulating Fire spread dynamics. Our approach internalizes combustion dynamics by encoding structured priors from a Physical Simulator to rectify physical discrepancies, coupled with a Cross-task Collaborative Training strategy (CC-Train) that alleviates the issue of limited information in mask-based modeling. Through parameter sharing and gradient coordination, CC-Train effectively integrates thermal radiation dynamics and spatial boundary delineation, enhancing both physical realism and geometric accuracy. Extensive experiments on a fine-grained multimodal fire dataset demonstrate the superior accuracy of PhysFire-WM in fire spread prediction. Validation underscores the importance of physical priors and cross-task collaboration, providing new insights for applying physics-informed world models to disaster prediction.

[9] Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding

Jiaqi Tang, Jianmin Chen, Wei Wei, Xiaogang Xu, Runtao Liu, Xiangyu Wu, Qipeng Xie, Jiafei Wu, Lei Zhang, Qifeng Chen

🧩 TL;DR

本文提出了Robust-R1框架,通过显式建模视觉退化链来增强多模态大语言模型的鲁棒性。该框架在真实世界退化基准R-Bench上实现了最先进的性能,并在多个基准测试中展现出卓越的抗退化能力。


📘 Detailed Summary

Motivation: 多模态大语言模型在极端真实世界视觉退化条件下性能不可靠,现有鲁棒性方法主要依赖隐式训练/适应,仅关注视觉编码器泛化,存在可解释性有限和孤立优化的问题。本研究旨在克服这些限制,通过结构化推理链显式建模视觉退化。

Method: 提出的Robust-R1框架包含三个核心组件:监督微调用于建立退化感知推理基础,奖励驱动对齐用于准确感知退化参数,以及动态推理深度缩放以适应退化强度。为支持该方法,研究构建了包含11K样本的专用数据集,涵盖四个关键真实世界视觉处理阶段的退化合成,每个样本都标注了连接退化参数、感知影响、原始语义推理链和结论的结构化链。

Result: 综合评估显示Robust-R1在真实世界退化基准R-Bench上优于所有通用和鲁棒基线,实现了最先进的鲁棒性。在MMMB、MMStar和RealWorldQA基准上,该框架在多种强度对抗性退化条件下保持了卓越的抗退化性能。

Conclusion: 该研究证明了通过结构化推理链显式建模视觉退化的有效性,为多模态大语言模型的鲁棒性提供了新的研究方向。框架的模块化设计允许针对不同退化类型进行专门优化,同时保持模型的可解释性和适应性,为实际应用中的视觉退化问题提供了系统解决方案。


📄 Abstract

Multimodal Large Language Models struggle to maintain reliable performance under extreme real-world visual degradations, which impede their practical robustness. Existing robust MLLMs predominantly rely on implicit training/adaptation that focuses solely on visual encoder generalization, suffering from limited interpretability and isolated optimization. To overcome these limitations, we propose Robust-R1, a novel framework that explicitly models visual degradations through structured reasoning chains. Our approach integrates: (i) supervised fine-tuning for degradation-aware reasoning foundations, (ii) reward-driven alignment for accurately perceiving degradation parameters, and (iii) dynamic reasoning depth scaling adapted to degradation intensity. To facilitate this approach, we introduce a specialized 11K dataset featuring realistic degradations synthesized across four critical real-world visual processing stages, each annotated with structured chains connecting degradation parameters, perceptual influence, pristine semantic reasoning chain, and conclusion. Comprehensive evaluations demonstrate state-of-the-art robustness: Robust-R1 outperforms all general and robust baselines on the real-world degradation benchmark R-Bench, while maintaining superior anti-degradation performance under multi-intensity adversarial degradations on MMMB, MMStar, and RealWorldQA.

[10] Can Synthetic Images Serve as Effective and Efficient Class Prototypes?

Dianxing Shi, Dingjie Fu, Yuqiao Liu, Jun Wang

🧩 TL;DR

本文提出LGCLIP框架,利用大语言模型生成类别特定的提示词指导扩散模型合成参考图像,仅需类别标签即可实现零样本图像分类,无需人工标注的图像-文本对,显著降低了数据准备成本并保持了模型的轻量化。


📘 Detailed Summary

Motivation: 现有视觉语言模型如CLIP依赖于人工标注的图像-文本对进行视觉与文本模态的对齐,这种数据准备成本高昂且质量要求高,同时双塔编码器架构也阻碍了模型的轻量化部署,因此需要一种无需人工标注对且更高效的零样本分类方法。

Method: LGCLIP框架采用大语言模型生成类别特定的提示词,指导扩散模型合成参考图像作为视觉原型,然后提取真实图像的视觉特征与这些原型特征进行比较实现分类预测,仅使用视觉编码器并通过LLM优化提示生成,保持了模型的轻量化与高效性。

Result: 实验结果表明LGCLIP在零样本分类任务中表现出色,验证了该框架的可行性与高效性,仅需类别标签作为输入即可实现竞争性的分类性能,无需任何人工标注的图像-文本对或额外的预处理步骤。

Conclusion: LGCLIP为图像分类建立了一种新颖范式,通过生成式方法替代传统依赖人工标注对的方式,显著降低了数据准备成本,同时保持了模型的轻量化特性,为零样本视觉任务提供了更高效且成本更低的解决方案。


📄 Abstract

Vision-Language Models (VLMs) have shown strong performance in zero-shot image classification tasks. However, existing methods, including Contrastive Language-Image Pre-training (CLIP), all rely on annotated text-to-image pairs for aligning visual and textual modalities. This dependency introduces substantial cost and accuracy requirement in preparing high-quality datasets. At the same time, processing data from two modes also requires dual-tower encoders for most models, which also hinders their lightweight. To address these limitations, we introduce a ``Contrastive Language-Image Pre-training via Large-Language-Model-based Generation (LGCLIP)" framework. LGCLIP leverages a Large Language Model (LLM) to generate class-specific prompts that guide a diffusion model in synthesizing reference images. Afterwards these generated images serve as visual prototypes, and the visual features of real images are extracted and compared with the visual features of these prototypes to achieve comparative prediction. By optimizing prompt generation through the LLM and employing only a visual encoder, LGCLIP remains lightweight and efficient. Crucially, our framework requires only class labels as input during whole experimental procedure, eliminating the need for manually annotated image-text pairs and extra pre-processing. Experimental results validate the feasibility and efficiency of LGCLIP, demonstrating great performance in zero-shot classification tasks and establishing a novel paradigm for classification.

[11] Animate Any Character in Any World

Yitong Wang, Fangyun Wei, Hongyang Zhang, Bo Dai, Yan Lu

🧩 TL;DR

本文提出了AniX,一个能够将静态3D高斯泼溅场景与用户指定角色相结合,并通过自然语言指令控制角色在环境中执行开放式动作的视频生成系统,实现了从基础移动到物体交互的多样化行为合成。


📘 Detailed Summary

Motivation: 现有世界模型主要分为两类:静态世界生成模型只能构建无主动代理的3D环境,而可控实体模型仅允许单个实体在不可控环境中执行有限动作。本研究旨在结合静态世界生成的真实性和结构基础,同时扩展可控实体模型以支持用户指定角色执行开放式动作,填补了现有方法在交互式环境模拟中的局限性。

Method: AniX将问题形式化为条件自回归视频生成任务,基于预训练视频生成器构建。系统接受3D高斯泼溅场景和角色作为输入,通过自然语言指令控制角色执行从基础移动到物体中心交互的多样化行为。训练策略显著增强了运动动态性,同时保持了跨动作和角色的泛化能力,确保生成的时间相干视频片段与提供场景和角色保持视觉保真度。

Result: 评估涵盖了广泛的方面,包括视觉质量、角色一致性、动作可控性和长时程相干性。实验结果表明,AniX能够合成保持时间相干性的视频片段,在视觉保真度、角色行为一致性和动作控制精度方面表现出色,同时支持角色在环境中自由探索和执行多样化交互行为。

Conclusion: AniX成功结合了静态世界生成的真实性和可控实体模型的交互能力,实现了用户指定角色在3D场景中执行开放式动作的视频生成。这项工作为交互式环境模拟提供了新范式,支持从基础移动到物体交互的多样化行为控制,为未来更具沉浸感和可控性的虚拟环境构建奠定了基础。


📄 Abstract

Recent advances in world models have greatly enhanced interactive environment simulation. Existing methods mainly fall into two categories: (1) static world generation models, which construct 3D environments without active agents, and (2) controllable-entity models, which allow a single entity to perform limited actions in an otherwise uncontrollable environment. In this work, we introduce AniX, leveraging the realism and structural grounding of static world generation while extending controllable-entity models to support user-specified characters capable of performing open-ended actions. Users can provide a 3DGS scene and a character, then direct the character through natural language to perform diverse behaviors from basic locomotion to object-centric interactions while freely exploring the environment. AniX synthesizes temporally coherent video clips that preserve visual fidelity with the provided scene and character, formulated as a conditional autoregressive video generation problem. Built upon a pre-trained video generator, our training strategy significantly enhances motion dynamics while maintaining generalization across actions and characters. Our evaluation covers a broad range of aspects, including visual quality, character consistency, action controllability, and long-horizon coherence.

[12] ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching

Qi Zhang, Yuxu Chen, Lei Deng, Lili Shen

🧩 TL;DR

本文提出ABE-CLIP,一种无需训练的属性绑定增强方法,通过语义精炼机制和局部标记-补丁对齐策略,显著提升了CLIP类模型在组合式图像-文本匹配中的属性-对象绑定能力。


📘 Detailed Summary

Motivation: CLIP在组合式图像-文本匹配中表现不佳,尤其是在准确关联对象与其对应属性方面,因为其固有的全局表示往往忽略了细粒度语义以实现属性绑定。现有方法通常需要额外训练或大量负样本采样,但对新颖组合概念的泛化能力有限,且未能从根本上解决全局表示的缺陷。

Method: ABE-CLIP采用语义精炼机制来精炼文本中对象和属性短语的标记嵌入,从而减轻属性混淆并提高语义精度。进一步引入局部标记-补丁对齐策略,计算精炼后的文本标记与其最相关图像补丁之间的相似度分数,通过聚合局部相似度分数来计算最终的图像-文本相似度。

Result: 在多个数据集上的实验表明,ABE-CLIP显著提升了属性-对象绑定性能,甚至超越了需要大量训练的方法。该方法在组合式图像-文本匹配任务中表现出优异的性能改进,验证了其训练自由方法的有效性。

Conclusion: ABE-CLIP通过语义精炼和局部对齐策略,在不需额外训练的情况下有效解决了CLIP类模型中的属性绑定问题,为组合式多模态理解提供了新的解决方案。该方法展示了通过改进表示对齐而非依赖大量训练数据来增强模型组合推理能力的潜力。


📄 Abstract

Contrastive Language-Image Pretraining (CLIP) has achieved remarkable performance in various multimodal tasks. However, it still struggles with compositional image-text matching, particularly in accurately associating objects with their corresponding attributes, because its inherent global representation often overlooks fine-grained semantics for attribute binding. Existing methods often require additional training or extensive hard negative sampling, yet they frequently show limited generalization to novel compositional concepts and fail to fundamentally address the drawbacks of global representations. In this paper, we propose ABE-CLIP, a novel training-free Attribute Binding Enhancement method designed to strengthen attribute-object binding in CLIP-like models. Specifically, we employ a Semantic Refinement Mechanism to refine token embeddings for both object and attribute phrases in the text, thereby mitigating attribute confusion and improving semantic precision. We further introduce a Local Token-Patch Alignment strategy that computes similarity scores between refined textual tokens and their most relevant image patches. By aggregating localized similarity scores, ABE-CLIP computes the final image-text similarity. Experiments on multiple datasets demonstrate that ABE-CLIP significantly improves attribute-object binding performance, even surpassing methods that require extensive training.

[13] InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models

Sarah Rastegar, Violeta Chatalbasheva, Sieger Falkena, Anuj Singh, Yanbo Wang, Tejas Gokhale, Hamid Palangi, Hadi Jamali-Rad

🧩 TL;DR

本文提出InfSplign,一种无需训练、推理时优化的方法,通过在每个去噪步骤中引入复合损失调整噪声,显著提升了文本到图像扩散模型的空间关系对齐能力,在多个基准测试中实现了最先进的性能。


📘 Detailed Summary

Motivation: 文本到图像扩散模型虽然能生成高质量图像,但经常无法准确捕捉文本提示中指定的空间关系,这主要源于两个因素:训练数据中缺乏细粒度空间监督,以及文本嵌入无法有效编码空间语义信息。

Method: InfSplign是一种无需训练的推理时方法,通过在每次去噪步骤中引入复合损失来调整噪声,该损失利用从主干解码器提取的不同层级的交叉注意力图,在采样过程中强制实现准确的对象放置和平衡的对象存在性。

Result: 在VISOR和T2I-CompBench上的综合评估表明,InfSplign建立了新的最先进性能,显著超越了现有最强的推理时基线方法,甚至优于基于微调的方法,实现了实质性的性能提升。

Conclusion: 该方法具有轻量级、即插即用特性,且与任何扩散主干模型兼容,为解决文本到图像模型的空间对齐问题提供了一种高效且通用的解决方案,无需额外的训练成本即可显著提升空间关系生成质量。


📄 Abstract

Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub.

[14] Anatomical Region-Guided Contrastive Decoding: A Plug-and-Play Strategy for Mitigating Hallucinations in Medical VLMs

Xiao Liang, Chenxi Liu, Zhi Ma, Di Wang, Bin Jing, Quan Wang, Yuanyuan Shi

🧩 TL;DR

本文提出了解剖区域引导对比解码(ARCD),一种即插即用的策略,通过提供有针对性的区域特定指导来缓解医学视觉语言模型中的幻觉问题。该方法利用解剖掩码引导三层对比解码过程,显著提高了区域理解能力并减少了事实错误输出。


📘 Detailed Summary

Motivation: 医学视觉语言模型在临床应用中展现出巨大潜力,但其可靠性受到幻觉问题的严重制约,模型往往无法从视觉证据中推导答案,而是依赖学习到的文本先验知识。现有的缓解策略存在明显局限性:基于训练的方法需要昂贵的专家标注,可扩展性有限;而无需训练的干预方法如对比解码虽然数据高效,但应用的是全局、无针对性的校正,在复杂真实临床环境中的效果不可靠。

Method: 本文提出了解剖区域引导对比解码(ARCD),这是一种即插即用的策略,通过提供有针对性的区域特定指导来缓解幻觉问题。该方法利用解剖掩码引导一个三层对比解码过程,通过在标记、注意力和对数概率三个层面进行动态重新加权,可验证地将模型注意力引导到指定区域,增强解剖理解并抑制事实错误的输出。

Result: 在包括胸部X光、CT、脑部MRI和眼部超声在内的多样化数据集上进行的大量实验表明,该方法在改善区域理解、减少幻觉和提高整体诊断准确性方面具有显著效果。ARCD方法在多个医学影像模态上都表现出有效性,证明了其在复杂真实临床环境中的可靠性。

Conclusion: ARCD提供了一种数据高效且可扩展的解决方案,解决了医学视觉语言模型中幻觉缓解的关键挑战。该方法通过有针对性的区域引导机制,在无需昂贵专家标注的情况下显著提高了模型的可靠性和临床适用性,为医学AI系统的实际部署提供了重要技术基础。


📄 Abstract

Medical Vision-Language Models (MedVLMs) show immense promise in clinical applicability. However, their reliability is hindered by hallucinations, where models often fail to derive answers from visual evidence, instead relying on learned textual priors. Existing mitigation strategies for MedVLMs have distinct limitations: training-based methods rely on costly expert annotations, limiting scalability, while training-free interventions like contrastive decoding, though data-efficient, apply a global, untargeted correction whose effects in complex real-world clinical settings can be unreliable. To address these challenges, we introduce Anatomical Region-Guided Contrastive Decoding (ARCD), a plug-and-play strategy that mitigates hallucinations by providing targeted, region-specific guidance. Our module leverages an anatomical mask to direct a three-tiered contrastive decoding process. By dynamically re-weighting at the token, attention, and logits levels, it verifiably steers the model's focus onto specified regions, reinforcing anatomical understanding and suppressing factually incorrect outputs. Extensive experiments across diverse datasets, including chest X-ray, CT, brain MRI, and ocular ultrasound, demonstrate our method's effectiveness in improving regional understanding, reducing hallucinations, and enhancing overall diagnostic accuracy.

[15] RadarGen: Automotive Radar Point Cloud Generation from Cameras

Tomer Borreda, Fangqiang Ding, Sanja Fidler, Shengyu Huang, Or Litany

🧩 TL;DR

本文提出了RadarGen,一种从多视角相机图像合成逼真汽车雷达点云的扩散模型,通过将雷达测量表示为鸟瞰图形式并融入视觉场景线索,实现了跨模态的生成式仿真。


📘 Detailed Summary

Motivation: 当前自动驾驶系统需要大量真实雷达数据进行感知模型训练,但真实雷达数据采集成本高且难以获取。现有方法在跨模态生成仿真方面存在不足,特别是从视觉到雷达的生成能力有限,这限制了多模态感知系统的可扩展性发展。

Method: RadarGen采用高效的图像潜在扩散模型,将雷达测量表示为鸟瞰图形式,编码空间结构、雷达截面积和速度属性。模型通过轻量级恢复步骤从生成的地图中重建点云,并融入从预训练基础模型提取的BEV对齐深度、语义和运动线索,引导随机生成过程产生物理合理的雷达模式。

Result: 在大规模驾驶数据上的评估表明,RadarGen能够捕捉特征性的雷达测量分布,并显著缩小了在真实数据上训练的感知模型之间的性能差距。该模型生成的雷达点云在统计特性和物理合理性方面表现出色,验证了跨模态生成的有效性。

Conclusion: RadarGen为多模态生成式仿真提供了可扩展的方向,通过图像条件化使其与现有视觉数据集和仿真框架广泛兼容。这项工作标志着跨传感模态统一生成式仿真的重要进展,为自动驾驶感知系统的数据增强和仿真测试提供了新途径。


📄 Abstract

We present RadarGen, a diffusion model for synthesizing realistic automotive radar point clouds from multi-view camera imagery. RadarGen adapts efficient image-latent diffusion to the radar domain by representing radar measurements in bird's-eye-view form that encodes spatial structure together with radar cross section (RCS) and Doppler attributes. A lightweight recovery step reconstructs point clouds from the generated maps. To better align generation with the visual scene, RadarGen incorporates BEV-aligned depth, semantic, and motion cues extracted from pretrained foundation models, which guide the stochastic generation process toward physically plausible radar patterns. Conditioning on images makes the approach broadly compatible, in principle, with existing visual datasets and simulation frameworks, offering a scalable direction for multimodal generative simulation. Evaluations on large-scale driving data show that RadarGen captures characteristic radar measurement distributions and reduces the gap to perception models trained on real data, marking a step toward unified generative simulation across sensing modalities.

[16] CheXPO-v2: Preference Optimization for Chest X-ray VLMs with Knowledge Graph Consistency

Xiao Liang, Yuxuan An, Di Wang, Jiawei Hu, Zhicheng Jiao, Bin Jing, Quan Wang

🧩 TL;DR

本文提出CheXPO-v2,一种新颖的医学视觉语言模型对齐框架,通过从结果监督转向过程监督来解决幻觉问题。该框架引入基于实体关系匹配的知识图谱一致性奖励机制,显著提升了临床推理的可靠性和可验证性。


📘 Detailed Summary

Motivation: 医学视觉语言模型容易产生幻觉,损害临床可靠性。现有强化学习方法如GRPO依赖稀疏的结果奖励,导致模型"过度思考"——生成冗长、复杂且不可验证的思维链推理来合理化答案,这种对结果的关注掩盖了事实错误并带来重大安全风险。

Method: 提出CheXPO-v2对齐框架,核心创新是基于实体关系匹配的知识图谱一致性奖励机制。该方法将推理步骤显式解析为结构化的"疾病、关系、解剖"三元组,提供细粒度监督,在原子级别惩罚不连贯逻辑和幻觉。结合硬样本挖掘策略,实现高效对齐。

Result: CheXPO-v2在MIMIC-CXR-VQA等基准测试中显著优于GRPO和最先进模型。仅使用5k样本即达到新的最先进准确率,展示了卓越的数据效率。同时生成临床合理且可验证的推理,有效减少幻觉现象。

Conclusion: 该研究证明了过程监督相对于结果监督在医学VLM对齐中的优越性。知识图谱一致性奖励机制为细粒度推理验证提供了有效框架,硬样本挖掘策略增强了数据效率。该方法为开发安全可靠的临床AI系统提供了新方向,项目源代码已公开。


📄 Abstract

Medical Vision-Language Models (VLMs) are prone to hallucinations, compromising clinical reliability. While reinforcement learning methods like Group Relative Policy Optimization (GRPO) offer a low-cost alignment solution, their reliance on sparse, outcome-based rewards inadvertently encourages models to "overthink" -- generating verbose, convoluted, and unverifiable Chain-of-Thought reasoning to justify answers. This focus on outcomes obscures factual errors and poses significant safety risks. To address this, we propose CheXPO-v2, a novel alignment framework that shifts from outcome to process supervision. Our core innovation is a Knowledge Graph Consistency Reward mechanism driven by Entity-Relation Matching. By explicitly parsing reasoning steps into structured "Disease, Relation, Anatomy" triplets, we provide fine-grained supervision that penalizes incoherent logic and hallucinations at the atomic level. Integrating this with a hard-example mining strategy, our approach significantly outperforms GRPO and state-of-the-art models on benchmarks like MIMIC-CXR-VQA. Crucially, CheXPO-v2 achieves new state-of-the-art accuracy using only 5k samples, demonstrating exceptional data efficiency while producing clinically sound and verifiable reasoning. The project source code is publicly available at: https://github.com/ecoxial2007/CheX-Phi4MM.

[17] Adversarial Robustness of Vision in Open Foundation Models

Jonathon Fox, William J Buchanan, Pavlos Papadopoulos

🧩 TL;DR

本文研究了LLaVA-1.5-13B和Meta Llama 3.2 Vision-8B-2两种视觉语言模型在对抗性攻击下的鲁棒性,发现尽管Llama 3.2 Vision在基准准确率较低,但在高扰动水平下表现出更强的对抗鲁棒性。


📘 Detailed Summary

Motivation: 随着深度学习模型的复杂性增加,理解AI系统识别对象的机制变得困难,攻击者可能通过向图像添加不可见元素来混淆AI识别实体。本研究旨在探索当代开放权重视觉语言模型在视觉模态对抗性攻击下的脆弱性,特别关注LLaVA和Llama 3.2 Vision模型的对抗鲁棒性差异。

Method: 研究采用无目标投影梯度下降方法对视觉输入模态进行对抗性攻击,在Visual Question Answering v2数据集的子集上对LLaVA-1.5-13B和Meta Llama 3.2 Vision-8B-2模型进行实证评估。使用标准VQA准确率指标量化攻击效果,并比较两种模型在对抗攻击下的准确率下降程度。

Result: 实验结果显示,Llama 3.2 Vision虽然在基准准确率上低于LLaVA,但在对抗攻击下表现出更小的性能下降,特别是在高扰动水平下。这一发现表明对抗鲁棒性与标准基准性能并不直接相关,且可能受到底层架构和训练因素的影响。

Conclusion: 研究证实视觉模态是降低当代开放权重视觉语言模型性能的有效攻击向量,包括Meta的Llama 3.2 Vision。更重要的是,研究发现对抗鲁棒性与标准基准性能之间没有直接相关性,这强调了在评估视觉语言模型时需要考虑对抗鲁棒性作为独立的重要指标。


📄 Abstract

With the increase in deep learning, it becomes increasingly difficult to understand the model in which AI systems can identify objects. Thus, an adversary could aim to modify an image by adding unseen elements, which will confuse the AI in its recognition of an entity. This paper thus investigates the adversarial robustness of LLaVA-1.5-13B and Meta's Llama 3.2 Vision-8B-2. These are tested for untargeted PGD (Projected Gradient Descent) against the visual input modality, and empirically evaluated on the Visual Question Answering (VQA) v2 dataset subset. The results of these adversarial attacks are then quantified using the standard VQA accuracy metric. This evaluation is then compared with the accuracy degradation (accuracy drop) of LLaVA and Llama 3.2 Vision. A key finding is that Llama 3.2 Vision, despite a lower baseline accuracy in this setup, exhibited a smaller drop in performance under attack compared to LLaVA, particularly at higher perturbation levels. Overall, the findings confirm that the vision modality represents a viable attack vector for degrading the performance of contemporary open-weight VLMs, including Meta's Llama 3.2 Vision. Furthermore, they highlight that adversarial robustness does not necessarily correlate directly with standard benchmark performance and may be influenced by underlying architectural and training factors.

[18] DAVE: A VLM Vision Encoder for Document Understanding and Web Agents

Brandon Huang, Hang Hua, Zhuoran Yu, Trevor Darrell, Rogerio Feris, Roei Herzig

🧩 TL;DR

本文提出了DAVE,一种专为文档理解和网页代理任务设计的视觉编码器,通过自监督预训练和监督自回归预训练相结合的方法,解决了现有视觉语言模型中视觉编码器缺乏结构化和空间信息的核心弱点。


📘 Detailed Summary

Motivation: 当前视觉语言模型(VLMs)中的视觉编码器存在一个根本性弱点:其低层特征缺乏文档理解和网页代理任务所需的稳健结构化和空间信息,这限制了模型在这些特定领域的性能表现。

Method: DAVE采用两阶段训练流程:首先在无标注图像上进行自监督预训练,然后在有限高质量数据上进行监督自回归预训练,学习解析和定位等任务。在监督阶段引入两种策略:一是新颖的模型融合方案,结合不同文本解码器训练的编码器以确保与不同网页代理架构的广泛兼容性;二是集成训练方法,融合预训练通用编码器(如SigLIP2)的特征与文档和网页特定表示。

Result: 在经典文档任务、视觉问答(VQA)、网页定位和基于代理的基准测试上的广泛实验验证了该方法的有效性,DAVE在这些任务上表现出色,确立了其作为文档和网页应用强大视觉编码器的地位。

Conclusion: DAVE通过专门针对文档和网页任务设计的训练流程和架构创新,成功解决了现有视觉语言模型中视觉编码器的局限性,为文档理解和网页代理领域提供了更强大的视觉表示能力,展示了任务特定视觉编码器设计的重要性。


📄 Abstract

While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder's alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.

[19] Learning When to Look: A Disentangled Curriculum for Strategic Perception in Multimodal Reasoning

Siqi Yang, Zilve Gao, Haibo Qiu, Fanfan Liu, Peng Shi, Zhixiong Zeng, Qingmin Liao, Lin Ma

🧩 TL;DR

本文提出了一种新颖的课程学习框架,用于解决多模态大语言模型在长链视觉推理任务中的视觉遗忘问题,通过解耦抽象逻辑推理和策略性视觉感知两种认知技能,将模型从启发式观察者转变为策略性、有根据的推理者。


📘 Detailed Summary

Motivation: 多模态大语言模型在复杂长链视觉推理任务中表现出脆弱性,存在"视觉遗忘"的关键失效模式,即随着推理链延长,模型逐渐失去视觉基础。作者认为这一问题的根源在于当前训练范式过早地将两种不同的认知技能纠缠在一起:抽象逻辑推理("如何思考")和策略性视觉感知("何时查看"),这导致了基础冷启动缺陷和策略性感知缺陷。

Method: 本文提出了一个两阶段的课程学习框架。首先,引入解耦的监督微调课程,先在纯文本数据上构建鲁棒的抽象推理骨干,然后通过新颖的感知基础思维链范式将其锚定到视觉模态。其次,通过将时机选择问题形式化为强化学习问题来解决策略性感知缺陷,设计了关键感知奖励机制,通过将感知动作与认知不确定性语言标记(如"等待"、"验证")耦合,教导模型何时查看,从而学习自主的基础策略。

Result: 该方法通过将两种认知技能解耦,有效解决了多模态大语言模型在长链视觉推理中的视觉遗忘问题。实验结果表明,该框架能够显著提升模型在复杂视觉推理任务中的性能,使模型从启发式驱动的观察者转变为策略性、有根据的推理者,代码已在GitHub上开源。

Conclusion: 本研究形式化了多模态大语言模型在长链视觉推理中的两种关键缺陷,并开发了一个原则性的两阶段框架来解决这些问题。该方法的核心贡献在于将抽象推理和视觉感知解耦,并通过强化学习策略教导模型何时进行视觉感知,为构建更鲁棒的多模态推理系统提供了新的理论框架和实践路径。


📄 Abstract

Multimodal Large Language Models (MLLMs) demonstrate significant potential but remain brittle in complex, long-chain visual reasoning tasks. A critical failure mode is "visual forgetting", where models progressively lose visual grounding as reasoning extends, a phenomenon aptly described as "think longer, see less". We posit this failure stems from current training paradigms prematurely entangling two distinct cognitive skills: (1) abstract logical reasoning "how-to-think") and (2) strategic visual perception ("when-to-look"). This creates a foundational cold-start deficiency -- weakening abstract reasoning -- and a strategic perception deficit, as models lack a policy for when to perceive. In this paper, we propose a novel curriculum-based framework to disentangle these skills. First, we introduce a disentangled Supervised Fine-Tuning (SFT) curriculum that builds a robust abstract reasoning backbone on text-only data before anchoring it to vision with a novel Perception-Grounded Chain-of-Thought (PG-CoT) paradigm. Second, we resolve the strategic perception deficit by formulating timing as a reinforcement learning problem. We design a Pivotal Perception Reward that teaches the model when to look by coupling perceptual actions to linguistic markers of cognitive uncertainty (e.g., "wait", "verify"), thereby learning an autonomous grounding policy. Our contributions include the formalization of these two deficiencies and the development of a principled, two-stage framework to address them, transforming the model from a heuristic-driven observer to a strategic, grounded reasoner. \textbf{Code}: \url{https://github.com/gaozilve-max/learning-when-to-look}.

[20] Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos

Henghui Du, Chang Zhou, Chunjie Zhang, Xi Chen, Di Hu

🧩 TL;DR

本文提出了VideoDetective方法,通过问题感知的记忆机制和循环处理策略,使多模态大语言模型能够高效处理长视频问答任务,仅需32K上下文长度即可处理100K视觉标记的长视频内容。


📘 Detailed Summary

Motivation: 长视频问答任务对多模态大语言模型构成重大挑战,主要源于巨大的上下文长度和过载信息导致的内存消耗问题。现有方法通过减少视觉标记或扩展上下文长度可能丢失有用信息或带来大量计算开销,而实际上回答问题仅需少量关键信息。

Method: 提出VideoDetective方法,采用问题感知的记忆机制和循环处理策略,迭代处理视频子片段。每个子片段通过引入少量特殊记忆标记实现有目的压缩,然后循环聚合存储这些记忆标记以更新历史上下文,供后续子片段重用。同时构建了GLVC数据集,专门评估模型的长视频理解能力。

Result: 实验结果表明,该方法使上下文长度仅为32K的MLLMs能够高效处理100K标记的长视频(3600帧,1小时视频以1fps采样),仅需2分钟和37GB GPU内存。在多个长视频基准测试中,该方法能更有效地从海量信息中寻找关键线索。

Conclusion: 该研究证明了通过问题感知的压缩策略和循环记忆机制,可以在有限上下文长度内有效处理长视频问答任务。提出的方法为长视频理解提供了高效解决方案,同时构建的GLVC数据集为评估长视频理解能力提供了更有效的基准。


📄 Abstract

Long Video Question-Answering (LVQA) presents a significant challenge for Multi-modal Large Language Models (MLLMs) due to immense context and overloaded information, which could also lead to prohibitive memory consumption. While existing methods attempt to address these issues by reducing visual tokens or extending model's context length, they may miss useful information or take considerable computation. In fact, when answering given questions, only a small amount of crucial information is required. Therefore, we propose an efficient question-aware memory mechanism, enabling MLLMs to recurrently seek these critical clues. Our approach, named VideoDetective, simplifies this task by iteratively processing video sub-segments. For each sub-segment, a question-aware compression strategy is employed by introducing a few special memory tokens to achieve purposefully compression. This allows models to effectively seek critical clues while reducing visual tokens. Then, due to history context could have a significant impact, we recurrently aggregate and store these memory tokens to update history context, which would be reused for subsequent sub-segments. Furthermore, to more effectively measure model's long video understanding ability, we introduce GLVC (Grounding Long Video Clues), a long video question-answering dataset, which features grounding critical and concrete clues scattered throughout entire videos. Experimental results demonstrate our method enables MLLMs with limited context length of 32K to efficiently process 100K tokens (3600 frames, an hour-long video sampled at 1fps), requiring only 2 minutes and 37GB GPU memory usage. Evaluation results across multiple long video benchmarks illustrate our method can more effectively seek critical clues from massive information.

[21] AnyCXR: Human Anatomy Segmentation of Chest X-ray at Any Acquisition Position using Multi-stage Domain Randomized Synthetic Data with Imperfect Annotations and Conditional Joint Annotation Regularization Learning

Dong Zifei, Wu Wenjie, Hao Jinkui, Chen Tianqi, Weng Ziqiao, Zhou Bo

🧩 TL;DR

本文提出AnyCXR框架,通过合成监督实现任意投照角度胸部X光的多器官分割,该方法结合多阶段域随机化生成合成数据和条件联合标注正则化学习策略,在多个真实数据集上实现零样本泛化,为解剖感知的CXR分析提供可扩展基础。


📘 Detailed Summary

Motivation: 胸部X光解剖分割面临标注稀缺和真实采集条件变异性的挑战,现有方法难以在任意投照角度下实现鲁棒的多器官分割,需要解决标注负担和泛化能力之间的平衡问题。

Method: AnyCXR框架包含多阶段域随机化引擎,从3D CT体积生成超过10万张解剖保真且高度多样的合成X光图像,结合条件联合标注正则化学习策略,在潜在空间中通过强制解剖一致性来利用部分和不完美标注。

Result: 仅使用合成数据训练的AnyCXR在多个真实世界数据集上实现强零样本泛化,能够准确分割PA位、侧位和斜位视图中的54个解剖结构,支持下游临床任务如自动心胸比估计、脊柱曲率评估和疾病分类,解剖先验的融入提升了诊断性能。

Conclusion: AnyCXR为解剖感知的CXR分析建立了可扩展且可靠的基础,提供了一条减少标注负担同时提升跨多样成像条件鲁棒性的实用路径,展示了合成监督在医学影像分析中的潜力。


📄 Abstract

Robust anatomical segmentation of chest X-rays (CXRs) remains challenging due to the scarcity of comprehensive annotations and the substantial variability of real-world acquisition conditions. We propose AnyCXR, a unified framework that enables generalizable multi-organ segmentation across arbitrary CXR projection angles using only synthetic supervision. The method combines a Multi-stage Domain Randomization (MSDR) engine, which generates over 100,000 anatomically faithful and highly diverse synthetic radiographs from 3D CT volumes, with a Conditional Joint Annotation Regularization (CAR) learning strategy that leverages partial and imperfect labels by enforcing anatomical consistency in a latent space. Trained entirely on synthetic data, AnyCXR achieves strong zero-shot generalization on multiple real-world datasets, providing accurate delineation of 54 anatomical structures in PA, lateral, and oblique views. The resulting segmentation maps support downstream clinical tasks, including automated cardiothoracic ratio estimation, spine curvature assessment, and disease classification, where the incorporation of anatomical priors improves diagnostic performance. These results demonstrate that AnyCXR establishes a scalable and reliable foundation for anatomy-aware CXR analysis and offers a practical pathway toward reducing annotation burdens while improving robustness across diverse imaging conditions.

[22] Vision-Language Model Guided Image Restoration

Cuixin Yang, Rongkang Dong, Kin-Man Lam

🧩 TL;DR

本文提出VLMIR框架,通过利用视觉语言模型(如CLIP)的丰富先验知识来增强图像恢复性能,该框架结合视觉感知和语义理解,在通用和特定退化图像恢复任务中均取得优越表现。


📘 Detailed Summary

Motivation: 现有图像恢复方法难以有效结合像素级保真度和高级语义理解,而近期尝试将视觉语言模型融入通用图像恢复的方法未能充分利用语言先验来确保恢复过程中的语义一致性,这限制了恢复照片的真实性和细节质量。

Method: 提出的VLMIR框架包含两个阶段:基于VLM的特征提取和基于扩散的图像恢复。第一阶段通过CLIP等VLM提取互补的视觉和语言表示,使用LoRA微调和余弦相似度损失对齐低质量与高质量图像的描述嵌入,并采用退化预测器分解退化与干净图像内容嵌入。第二阶段通过交叉注意力机制将这些嵌入集成到扩散模型中实现增强恢复。

Result: 广泛的实验和消融研究表明,VLMIR在通用和特定退化的图像恢复任务中均实现了优越性能,验证了整合视觉语言模型提供的视觉和语言知识对于提升图像恢复能力的关键作用。

Conclusion: 该研究强调了视觉语言模型在图像恢复中的重要性,展示了结合视觉感知和语义理解的综合方法能够显著提升恢复质量,为未来利用多模态先验知识解决复杂视觉任务提供了新方向。


📄 Abstract

Many image restoration (IR) tasks require both pixel-level fidelity and high-level semantic understanding to recover realistic photos with fine-grained details. However, previous approaches often struggle to effectively leverage both the visual and linguistic knowledge. Recent efforts have attempted to incorporate Vision-language models (VLMs), which excel at aligning visual and textual features, into universal IR. Nevertheless, these methods fail to utilize the linguistic priors to ensure semantic coherence during the restoration process. To address this issue, in this paper, we propose the Vision-Language Model Guided Image Restoration (VLMIR) framework, which leverages the rich vision-language priors of VLMs, such as CLIP, to enhance IR performance through improved visual perception and semantic understanding. Our approach consists of two stages: VLM-based feature extraction and diffusion-based image restoration. In the first stage, we extract complementary visual and linguistic representations of input images by condensing the visual perception and high-level semantic priors through VLMs. Specifically, we align the embeddings of captions from low-quality and high-quality images using a cosine similarity loss with LoRA fine-tuning, and employ a degradation predictor to decompose degradation and clean image content embeddings. These complementary visual and textual embeddings are then integrated into a diffusion-based model via cross-attention mechanisms for enhanced restoration. Extensive experiments and ablation studies demonstrate that VLMIR achieves superior performance across both universal and degradation-specific IR tasks, underscoring the critical role of integrated visual and linguistic knowledge from VLMs in advancing image restoration capabilities.

[23] Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images

Wenhao Yang, Yu Xia, Jinlong Huang, Shiyin Lu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Yuanyu Wan, Lijun Zhang

🧩 TL;DR

本文提出DRIM模型,通过深度可靠的多轮推理机制解决现有视觉语言模型在图像思维链中难以自我反思和修正错误推理轨迹的问题,采用包含数据构建、冷启动SFT和强化学习的三阶段训练流程。


📘 Detailed Summary

Motivation: 现有大型视觉语言模型在图像思维链中展现出通过调用工具分析视觉输入的推理能力,但往往难以在尝试错误推理轨迹时进行自我反思和修正,这限制了模型的可靠性和深度推理能力。

Method: 提出DRIM模型的三阶段训练流程:基于高分辨率图像数据集构建高难度可验证的视觉问答对,收集工具轨迹作为冷启动监督微调数据引导多轮推理模式,引入冗余惩罚策略优化强化学习机制激励模型发展自我反思推理模式。

Result: 大量实验表明DRIM在视觉理解基准测试中取得了优越性能,验证了其深度可靠多轮推理机制的有效性,特别是在需要多轮工具调用才能解决的高难度视觉任务上表现突出。

Conclusion: 该研究展示了通过结合监督微调和强化学习策略优化,可以显著提升视觉语言模型在图像思维链中的自我反思能力和推理可靠性,为构建更可靠的视觉推理系统提供了新思路和方法框架。


📄 Abstract

Recent advances in large Vision-Language Models (VLMs) have exhibited strong reasoning capabilities on complex visual tasks by thinking with images in their Chain-of-Thought (CoT), which is achieved by actively invoking tools to analyze visual inputs rather than merely perceiving them. However, existing models often struggle to reflect on and correct themselves when attempting incorrect reasoning trajectories. To address this limitation, we propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT. Our pipeline comprises three stages: data construction, cold-start SFT and RL. Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs, where solving each task requires multi-turn tool calls to reach the correct answer. In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern. In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern. The basic idea is to impose judgment on reasoning trajectories and penalize those that produce incorrect answers without sufficient multi-scale exploration. Extensive experiments demonstrate that DRIM achieves superior performance on visual understanding benchmarks.

[24] CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning

Qi Song, Honglin Li, Yingchen Yu, Haoyi Zhou, Lin Yang, Song Bai, Qi She, Zilong Huang, Yunqing Zhao

🧩 TL;DR

本文提出CodeDance,一种基于可执行代码的视觉推理框架,通过代码定义、组合和执行来协调多种工具,实现透明且可自我验证的推理过程,在多个基准测试中超越了现有方法。


📘 Detailed Summary

Motivation: 当前开源视觉推理方法主要依赖纯文本链、固定视觉模式或单步流水线,这些方法在灵活性、可解释性和复杂任务的可迁移性方面存在局限,无法支持人类式的"图像思维"推理,该研究旨在解决这一研究空白。

Method: CodeDance采用可执行代码作为通用视觉推理求解器,通过代码定义、组合和执行来协调多种工具、计算中间结果并渲染视觉工件(如边界框、线条、图表),同时引入平衡自适应工具调用的奖励机制,以平衡探索与效率并缓解工具过度使用问题。

Result: 在视觉搜索、数学推理和图表问答等多个基准测试中,CodeDance不仅持续优于模式驱动和纯文本基线方法,还超越了GPT-4o等先进闭源模型和更大的开源模型,同时在强化学习训练中观察到新颖的工具调用、未见过的组合和跨任务迁移等涌现行为。

Conclusion: 研究表明可执行代码为视觉推理提供了一种通用且可扩展的机制,能够产生超越原子监督预期的新兴能力,这些能力无需任务特定微调即可出现,为构建更灵活、透明和可迁移的视觉推理系统提供了新方向。


📄 Abstract

Recent releases such as o3 highlight human-like "thinking with images" reasoning that combines structured tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks. We introduce CodeDance, which explores executable code as a general solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts (e.g., boxes, lines, plots) that support transparent, self-checkable reasoning. To guide this process, we introduce a reward for balanced and adaptive tool-call, which balances exploration with efficiency and mitigates tool overuse. Interestingly, beyond the expected capabilities taught by atomic supervision, we empirically observe novel emergent behaviors during RL training: CodeDance demonstrates novel tool invocations, unseen compositions, and cross-task transfer. These behaviors arise without task-specific fine-tuning, suggesting a general and scalable mechanism of executable visual reasoning. Extensive experiments across reasoning benchmarks (e.g., visual search, math, chart QA) show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models such as GPT-4o and larger open-source models.

[25] Auxiliary Descriptive Knowledge for Few-Shot Adaptation of Vision-Language Model

SuBeen Lee, GilHan Park, WonJun Moon, Hyun Seok Seong, Jae-Pil Heo

🧩 TL;DR

本文提出了辅助描述知识(ADK)框架,通过离线生成丰富的类别描述提示来增强视觉语言模型的文本表示,采用组合知识和实例特定知识两种部署方式,在不增加推理计算开销的情况下显著提升少样本适应性能。


📘 Detailed Summary

Motivation: 尽管视觉语言模型在零样本任务中表现出色,但在下游任务中面临分布偏移的挑战。现有的参数高效微调方法依赖于固定的人工设计提示,这些提示往往不足以理解类别语义,而基于图像诱导提示的方法虽然能提供额外线索,却引入了不可接受的推理计算开销。

Method: ADK框架首先利用大语言模型为每个类别离线生成丰富的描述性提示集合。这些预计算特征通过两种方式部署:作为组合知识提供平均化的丰富语义表示,特别适用于类别名称模糊或VLM不熟悉的情况;以及作为实例特定知识,通过轻量级的非参数注意力机制为给定图像动态选择最相关的描述。该框架作为参数免费的即插即用组件,可与现有PEFT方法协同工作。

Result: 大量实验表明,ADK能够持续提升多种PEFT基线的性能,在各种场景下创造了新的最先进水平。该框架在不影响效率的情况下显著增强了模型对分布偏移的适应能力,特别是在类别名称模糊或模型不熟悉的场景中表现出色。

Conclusion: ADK框架通过引入离线生成的丰富描述性知识,有效解决了现有PEFT方法在语义理解方面的局限性,同时避免了推理时的计算开销。该方法展示了如何在不增加参数的情况下,通过增强文本表示来显著提升视觉语言模型的少样本适应能力,为参数高效微调提供了新的研究方向。


📄 Abstract

Despite the impressive zero-shot capabilities of Vision-Language Models (VLMs), they often struggle in downstream tasks with distribution shifts from the pre-training data. Few-Shot Adaptation (FSA-VLM) has emerged as a key solution, typically using Parameter-Efficient Fine-Tuning (PEFT) to adapt models with minimal data. However, these PEFT methods are constrained by their reliance on fixed, handcrafted prompts, which are often insufficient to understand the semantics of classes. While some studies have proposed leveraging image-induced prompts to provide additional clues for classification, they introduce prohibitive computational overhead at inference. Therefore, we introduce Auxiliary Descriptive Knowledge (ADK), a novel framework that efficiently enriches text representations without compromising efficiency. ADK first leverages a Large Language Model to generate a rich set of descriptive prompts for each class offline. These pre-computed features are then deployed in two ways: (1) as Compositional Knowledge, an averaged representation that provides rich semantics, especially beneficial when class names are ambiguous or unfamiliar to the VLM; and (2) as Instance-Specific Knowledge, where a lightweight, non-parametric attention mechanism dynamically selects the most relevant descriptions for a given image. This approach provides two additional types of knowledge alongside the handcrafted prompt, thereby facilitating category distinction across various domains. Also, ADK acts as a parameter-free, plug-and-play component that enhances existing PEFT methods. Extensive experiments demonstrate that ADK consistently boosts the performance of multiple PEFT baselines, setting a new state-of-the-art across various scenarios.

[26] EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories

Lu Wei, Yuta Nakashima, Noa Garcia

🧩 TL;DR

本文提出了EMMA基准测试,用于全面评估文本到图像生成模型中的概念擦除技术,发现现有方法在间接提示和视觉相似概念方面存在不足,且可能加剧偏见问题。


📘 Detailed Summary

Motivation: 随着文本到图像生成技术的广泛应用,隐私、偏见和版权问题日益凸显,概念擦除技术提供了一种无需完全重新训练即可从预训练模型中移除特定概念的解决方案。然而,现有评估方法通常局限于有限的概念集,并依赖过于简单直接的提示,无法全面测试概念擦除技术的边界和有效性。

Method: 研究提出了EMMA基准测试框架,该框架通过12个指标评估概念擦除技术的五个关键维度,超越了传统的图像质量和时间效率指标。EMMA特别关注具有挑战性的条件,包括间接描述、视觉相似的非目标概念以及潜在的性别和种族偏见,提供了对社会意识层面的方法行为分析。

Result: 使用EMMA基准对五个概念擦除方法在五个领域(物体、名人、艺术风格、NSFW内容和版权)进行评估,结果显示现有方法在处理间接提示时表现不佳,即当被擦除概念被间接引用时仍能生成该概念。同时,这些方法在视觉相似的非目标概念上也存在困难,无法生成与被擦除概念相似的非目标概念,部分方法甚至比原始模型加剧了性别和种族偏见。

Conclusion: 研究表明当前概念擦除技术尚未达到真正从模型表示中移除目标概念的目标,特别是在处理间接描述和视觉相似概念方面存在显著缺陷。该发现强调了开发更鲁棒、社会意识更强的概念擦除方法的必要性,并为未来研究提供了全面的评估框架和方向。


📄 Abstract

The widespread adoption of text-to-image (T2I) generation has raised concerns about privacy, bias, and copyright violations. Concept erasure techniques offer a promising solution by selectively removing undesired concepts from pre-trained models without requiring full retraining. However, these methods are often evaluated on a limited set of concepts, relying on overly simplistic and direct prompts. To test the boundaries of concept erasure techniques, and assess whether they truly remove targeted concepts from model representations, we introduce EMMA, a benchmark that evaluates five key dimensions of concept erasure over 12 metrics. EMMA goes beyond standard metrics like image quality and time efficiency, testing robustness under challenging conditions, including indirect descriptions, visually similar non-target concepts, and potential gender and ethnicity bias, providing a socially aware analysis of method behavior. Using EMMA, we analyze five concept erasure methods across five domains (objects, celebrities, art styles, NSFW, and copyright). Our results show that existing methods struggle with implicit prompts (i.e., generating the erased concept when it is indirectly referenced) and visually similar non-target concepts (i.e., failing to generate non-targeted concepts resembling the erased one), while some amplify gender and ethnicity bias compared to the original model.

[27] Democratizing Pathology Co-Pilots: An Open Pipeline and Dataset for Whole-Slide Vision-Language Modelling

Sander Moonemans, Sebastiaan Ram, Frédérique Meeuwsen, Carlijn Lems, Jeroen van der Laak, Geert Litjens, Francesco Ciompi

🧩 TL;DR

本文提出了Polysome工具用于合成指令生成,并创建了HISTAI-Instruct大规模全切片图像指令调优数据集,在此基础上训练了ANTONI-α视觉语言模型,该模型在组织识别、肿瘤检测和鉴别诊断等全切片图像视觉问答任务上超越了MedGemma。


📘 Detailed Summary

Motivation: 当前视觉语言模型在病理学应用中存在多个限制:多数模型仅关注全切片图像中的小区域兴趣区,仅提供静态切片级输出,或依赖非公开数据限制了可复现性。此外,包含详细临床报告配对的WSI训练数据稀缺,阻碍了透明且可泛化VLMs的发展。

Method: 研究提出了三个主要贡献:首先开发了Polysome标准化工具用于合成指令生成;其次将Polysome应用于公开HISTAI数据集,生成了HISTAI-Instruct大规模全切片指令调优数据集,涵盖24,259张切片和超过110万条指令-响应对;最后使用该数据集训练了ANTONI-α视觉语言模型,专门用于视觉问答任务。

Result: ANTONI-α在全切片图像视觉问答任务上超越了MedGemma,特别是在组织识别、肿瘤检测和鉴别诊断等关键病理学任务中表现出色。研究还比较了使用不同数据量训练的多个ANTONI-α变体性能,验证了数据规模对模型性能的影响。所有方法、数据和代码均已公开提供。

Conclusion: 该研究通过标准化工具和公开数据集解决了病理学视觉语言模型的数据稀缺和可复现性问题,证明了大规模指令调优数据对提升模型性能的重要性。ANTONI-α的成功训练为病理学辅助诊断提供了透明且可泛化的解决方案,推动了医学AI向更开放和可复现的方向发展。


📄 Abstract

Vision-language models (VLMs) have the potential to become co-pilots for pathologists. However, most VLMs either focus on small regions of interest within whole-slide images, provide only static slide-level outputs, or rely on data that is not publicly available, limiting reproducibility. Furthermore, training data containing WSIs paired with detailed clinical reports is scarce, restricting progress toward transparent and generalisable VLMs. We address these limitations with three main contributions. First, we introduce Polysome, a standardised tool for synthetic instruction generation. Second, we apply Polysome to the public HISTAI dataset, generating HISTAI-Instruct, a large whole-slide instruction tuning dataset spanning 24,259 slides and over 1.1 million instruction-response pairs. Finally, we use HISTAI-Instruct to train ANTONI-α, a VLM capable of visual-question answering (VQA). We show that ANTONI-α outperforms MedGemma on WSI-level VQA tasks of tissue identification, neoplasm detection, and differential diagnosis. We also compare the performance of multiple incarnations of ANTONI-α trained with different amounts of data. All methods, data, and code are publicly available.

[28] Towards Deeper Emotional Reflection: Crafting Affective Image Filters with Generative Priors

Peixuan Zhang, Shuchen Weng, Jiajun Tang, Si Li, Boxin Shi

🧩 TL;DR

本文提出了情感图像过滤(AIF)任务,旨在将文本中的视觉抽象情感转化为视觉具体图像,并开发了基于多模态Transformer架构的AIF-B和利用预训练扩散模型生成先验的AIF-D模型,在内容一致性和情感保真度方面均优于现有方法。


📘 Detailed Summary

Motivation: 社交媒体用户通过文本和图像表达情感,但现有方法难以将文本中的视觉抽象情感有效转化为视觉具体图像,因此需要开发能够反映文本情感到图像生成的情感图像过滤任务。

Method: 本文首先构建了AIF数据集并制定了AIF模型框架,提出了基于多模态Transformer架构的AIF-B作为初步尝试,随后进一步开发了AIF-D模型,该模型通过利用预训练大规模扩散模型的生成先验,实现了更深层次的情感反映。

Result: 定量和定性实验表明,AIF模型在内容一致性和情感保真度方面均优于最先进方法,广泛的用户研究实验证实AIF模型在唤起特定情感方面显著更有效,验证了所提方法的优越性。

Conclusion: 该研究展示了情感图像过滤模型的重要价值和潜力,为多模态情感计算和图像生成领域提供了新的研究方向,基于实验结果全面讨论了AIF模型在实际应用中的前景和未来发展方向。


📄 Abstract

Social media platforms enable users to express emotions by posting text with accompanying images. In this paper, we propose the Affective Image Filter (AIF) task, which aims to reflect visually-abstract emotions from text into visually-concrete images, thereby creating emotionally compelling results. We first introduce the AIF dataset and the formulation of the AIF models. Then, we present AIF-B as an initial attempt based on a multi-modal transformer architecture. After that, we propose AIF-D as an extension of AIF-B towards deeper emotional reflection, effectively leveraging generative priors from pre-trained large-scale diffusion models. Quantitative and qualitative experiments demonstrate that AIF models achieve superior performance for both content consistency and emotional fidelity compared to state-of-the-art methods. Extensive user study experiments demonstrate that AIF models are significantly more effective at evoking specific emotions. Based on the presented results, we comprehensively discuss the value and potential of AIF models.

[29] AIFloodSense: A Global Aerial Imagery Dataset for Semantic Segmentation and Understanding of Flooded Environments

Georgios Simantiris, Konstantinos Bacharidis, Apostolos Papanikolaou, Petros Giannakakis, Costas Panagiotakis

🧩 TL;DR

本文提出了AIFloodSense数据集,这是一个包含全球230次洪水事件、470张高分辨率航拍图像的综合性公开数据集,支持图像分类、语义分割和视觉问答三种任务,旨在推进用于气候韧性的领域泛化AI工具。


📘 Detailed Summary

Motivation: 洪水检测对于灾害响应和风险评估至关重要,但现有洪水分割数据集稀缺且地理范围有限、标注细节不足,这阻碍了鲁棒、泛化的计算机视觉方法的发展,因此需要构建一个具有全球多样性和时间相关性的综合性数据集来弥补这一研究空白。

Method: 本文构建了AIFloodSense数据集,包含来自64个国家、六大洲230次洪水事件的470张高分辨率航拍图像,支持三种互补任务:图像分类(包含环境类型、相机角度和大陆识别子任务)、语义分割(提供洪水、天空和建筑物的像素级掩码)以及视觉问答,并使用最先进的架构为所有任务建立了基准测试。

Result: 研究为所有任务建立了基准测试,证明了数据集的复杂性及其在推进领域泛化AI工具方面的价值,数据集具有全球多样性和时间相关性(2022-2024年),为计算机视觉方法在洪水检测领域的鲁棒性和泛化性提供了重要资源。

Conclusion: AIFloodSense数据集通过提供全球多样化的洪水图像资源,显著推进了用于气候韧性的计算机视觉工具发展,该数据集支持多任务学习框架,为开发鲁棒、泛化的洪水检测模型奠定了重要基础,并促进了自然语言推理在灾害评估中的应用。


📄 Abstract

Accurate flood detection from visual data is a critical step toward improving disaster response and risk assessment, yet datasets for flood segmentation remain scarce due to the challenges of collecting and annotating large-scale imagery. Existing resources are often limited in geographic scope and annotation detail, hindering the development of robust, generalized computer vision methods. To bridge this gap, we introduce AIFloodSense, a comprehensive, publicly available aerial imagery dataset comprising 470 high-resolution images from 230 distinct flood events across 64 countries and six continents. Unlike prior benchmarks, AIFloodSense ensures global diversity and temporal relevance (2022-2024), supporting three complementary tasks: (i) Image Classification with novel sub-tasks for environment type, camera angle, and continent recognition; (ii) Semantic Segmentation providing precise pixel-level masks for flood, sky, and buildings; and (iii) Visual Question Answering (VQA) to enable natural language reasoning for disaster assessment. We establish baseline benchmarks for all tasks using state-of-the-art architectures, demonstrating the dataset's complexity and its value in advancing domain-generalized AI tools for climate resilience.

[30] Xiaomi MiMo-VL-Miloco Technical Report

Jiaze Li, Jingyang Chen, Yuxun Qu, Jianzhong Ju, Zhenbo Luo, Jian Luan, Shijie Xu, Zhenru Lin, Junyou Zhu, Boshen Xu, Wenhui Tan, Pei Fu

🧩 TL;DR

本文开源了MiMo-VL-Miloco-7B及其量化变体,这是一对面向智能家居环境的视觉语言模型,在家庭场景理解和通用多模态推理方面均表现出色,通过两阶段训练流程实现了专业化与通用性的平衡。


📘 Detailed Summary

Motivation: 本研究旨在解决智能家居环境中视觉语言模型的专业化需求与通用推理能力之间的平衡问题,现有模型在家庭场景理解方面存在不足,需要开发既能处理家庭特定任务(如手势识别、活动理解)又能保持广泛多模态推理能力的模型。

Method: 基于MiMo-VL-7B主干网络,设计了结合监督微调与基于Group Relative Policy Optimization强化学习的两阶段训练流程,利用高效多领域数据进行训练,并融入了思维链监督和令牌预算感知推理机制,使模型能够以数据高效的方式学习知识并执行高效推理。

Result: MiMo-VL-Miloco-7B在家庭场景理解方面取得了领先的F1分数,在Video-MME、Video-MMMU、Charades-STA等视频基准测试以及MMMU-Pro、MMLU-Pro等语言理解基准测试中均表现优异,超越了多个闭源和开源基线模型,同时量化变体GGUF格式提供了部署便利性。

Conclusion: 研究表明,针对家庭场景的专门训练不仅能增强活动和手势理解能力,还能提升纯文本推理性能,且仅在文档中心任务上产生适度权衡,该工作为智能家居应用提供了实用的开源模型和评估工具包,支持实际部署和研究发展。


📄 Abstract

We open-source \textbf{MiMo-VL-Miloco-7B} and its quantized variant \textbf{MiMo-VL-Miloco-7B-GGUF}, a pair of home-centric vision-language models that achieve strong performance on both home-scenario understanding and general multimodal reasoning. Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environments, attaining leading F1 scores on gesture recognition and common home-scenario understanding, while also delivering consistent gains across video benchmarks such as Video-MME, Video-MMMU, and Charades-STA, as well as language understanding benchmarks including MMMU-Pro and MMLU-Pro. In our experiments, MiMo-VL-Miloco-7B outperforms strong closed-source and open-source baselines on home-scenario understanding and several multimodal reasoning benchmarks. To balance specialization and generality, we design a two-stage training pipeline that combines supervised fine-tuning with reinforcement learning based on Group Relative Policy Optimization, leveraging efficient multi-domain data. We further incorporate chain-of-thought supervision and token-budget-aware reasoning, enabling the model to learn knowledge in a data-efficient manner while also performing reasoning efficiently. Our analysis shows that targeted home-scenario training not only enhances activity and gesture understanding, but also improves text-only reasoning with only modest trade-offs on document-centric tasks. Model checkpoints, quantized GGUF weights, and our home-scenario evaluation toolkit are publicly available at \href{https://github.com/XiaoMi/xiaomi-mimo-vl-miloco}{https://github.com/XiaoMi/xiaomi-mimo-vl-miloco} to support research and deployment in real-world smart-home applications.

[31] LangDriveCTRL: Natural Language Controllable Driving Scene Editing with Multi-modal Agents

Yun He, Francesco Pittaluga, Ziyu Jiang, Matthias Zwicker, Manmohan Chandraker, Zaid Tasneem

🧩 TL;DR

LangDriveCTRL 提出了一种基于自然语言控制的真实驾驶视频编辑框架,通过显式3D场景分解和智能体协调管道,实现了对交通场景中对象节点和多对象行为的细粒度编辑,在指令对齐度上达到了先前最佳方法近2倍的性能提升。


📘 Detailed Summary

Motivation: 现有驾驶视频编辑方法在细粒度控制、真实感保持和多对象行为协调方面存在局限,难以通过单一自然语言指令实现对复杂交通场景的全面编辑,需要一种能够同时支持对象节点编辑和多对象行为编辑的统一框架。

Method: 该框架采用显式3D场景分解将驾驶视频表示为包含静态背景和动态对象的场景图,并构建智能体协调管道:编排器将用户指令转换为执行图,对象接地智能体建立文本描述与场景图节点的对应关系,行为编辑智能体从语言指令生成多对象轨迹,行为审查智能体迭代优化轨迹,最后通过视频扩散工具修复渲染伪影。

Result: LangDriveCTRL在指令对齐度上达到了先前最佳方法近2倍的性能提升,同时在结构保持、照片真实感和交通真实感方面表现优异,能够通过单一自然语言指令同时支持对象节点编辑(移除、插入和替换)和多对象行为编辑。

Conclusion: 该研究展示了通过显式3D表示与智能体协调管道相结合的方法,能够有效实现自然语言驱动的复杂驾驶场景编辑,为自动驾驶模拟、数据增强和内容创作提供了强大的工具,同时为多模态场景理解与控制开辟了新的研究方向。


📄 Abstract

LangDriveCTRL is a natural-language-controllable framework for editing real-world driving videos to synthesize diverse traffic scenarios. It leverages explicit 3D scene decomposition to represent driving videos as a scene graph, containing static background and dynamic objects. To enable fine-grained editing and realism, it incorporates an agentic pipeline in which an Orchestrator transforms user instructions into execution graphs that coordinate specialized agents and tools. Specifically, an Object Grounding Agent establishes correspondence between free-form text descriptions and target object nodes in the scene graph; a Behavior Editing Agent generates multi-object trajectories from language instructions; and a Behavior Reviewer Agent iteratively reviews and refines the generated trajectories. The edited scene graph is rendered and then refined using a video diffusion tool to address artifacts introduced by object insertion and significant view changes. LangDriveCTRL supports both object node editing (removal, insertion and replacement) and multi-object behavior editing from a single natural-language instruction. Quantitatively, it achieves nearly $2\times$ higher instruction alignment than the previous SoTA, with superior structural preservation, photorealism, and traffic realism. Project page is available at: https://yunhe24.github.io/langdrivectrl/.

[32] MULTIAQUA: A multimodal maritime dataset and robust training strategies for multimodal semantic segmentation

Jon Muhovič, Janez Perš

🧩 TL;DR

本文提出了一个新颖的多模态海事数据集MULTIAQUA,包含RGB、热成像、红外、激光雷达等多种传感器数据,旨在解决水面无人载具在恶劣能见度条件下的场景理解问题,并展示了仅使用白天图像训练鲁棒多模态网络的方法。


📘 Detailed Summary

Motivation: 水面无人载具在运行中会遇到各种复杂的视觉环境,某些天气和光照条件(如夜间或恶劣能见度)下仅靠彩色相机图像难以准确解析场景,现有海事数据缺乏多模态同步校准的标注数据,限制了多模态感知方法的发展。

Method: 研究团队构建了MULTIAQUA多模态海事数据集,包含RGB、热成像、红外、激光雷达等多种传感器的同步校准和标注数据,提出了专门的多模态训练方法,使模型能够在仅使用白天图像训练的情况下保持对夜间和低能见度条件的鲁棒性能。

Result: 在提出的困难夜间测试集上评估了多种多模态方法,验证了所提训练方法的有效性,实现了在近乎完全黑暗条件下仍能保持可靠性能的多模态网络,显著简化了数据采集、标注和训练过程。

Conclusion: MULTIAQUA数据集填补了海事领域多模态数据的空白,提出的训练方法使多模态感知系统能够在恶劣能见度条件下保持鲁棒性,为水面无人载具的可靠环境感知提供了重要技术基础,简化了实际部署中的数据需求。


📄 Abstract

Unmanned surface vehicles can encounter a number of varied visual circumstances during operation, some of which can be very difficult to interpret. While most cases can be solved only using color camera images, some weather and lighting conditions require additional information. To expand the available maritime data, we present a novel multimodal maritime dataset MULTIAQUA (Multimodal Aquatic Dataset). Our dataset contains synchronized, calibrated and annotated data captured by sensors of different modalities, such as RGB, thermal, IR, LIDAR, etc. The dataset is aimed at developing supervised methods that can extract useful information from these modalities in order to provide a high quality of scene interpretation regardless of potentially poor visibility conditions. To illustrate the benefits of the proposed dataset, we evaluate several multimodal methods on our difficult nighttime test set. We present training approaches that enable multimodal methods to be trained in a more robust way, thus enabling them to retain reliable performance even in near-complete darkness. Our approach allows for training a robust deep neural network only using daytime images, thus significantly simplifying data acquisition, annotation, and the training process.

[33] LumiCtrl : Learning Illuminant Prompts for Lighting Control in Personalized Text-to-Image Models

Muhammad Atif Butt, Kai Wang, Javier Vazquez-Corral, Joost Van De Weijer

🧩 TL;DR

本文提出了一种名为LumiCtrl的文本到图像模型光照个性化方法,通过单张物体图像学习光照提示,实现了对生成图像中场景照明的精确控制,显著提升了光照保真度和美学质量。


📘 Detailed Summary

Motivation: 当前文本到图像模型在创意图像生成方面取得了显著进展,但在场景照明控制方面仍缺乏精确性,而照明控制对于内容设计师操纵生成图像的情绪、氛围和视觉美学至关重要,这一研究空白限制了图像生成在专业设计应用中的实用性。

Method: LumiCtrl方法包含三个核心组件:首先应用基于物理的普朗克轨迹光照增强,在标准照明条件下创建微调变体;其次采用边缘引导的提示解缠技术,利用冻结的ControlNet确保提示专注于光照而非结构;最后设计掩码重建损失,使学习聚焦于前景物体,同时允许背景进行上下文适应,实现上下文光照适应。

Result: 实验结果表明,LumiCtrl在光照保真度、美学质量和场景一致性方面显著优于现有的个性化基线方法,定量和定性评估均证实了其优越性能,人类偏好研究进一步显示用户对LumiCtrl输出有强烈偏好,验证了该方法在实际应用中的有效性。

Conclusion: 该研究展示了通过物理增强、提示解缠和上下文适应相结合的方法,能够有效解决文本到图像模型中光照控制的精确性问题,为专业图像生成和设计应用提供了新的技术路径,同时代码和数据的公开将促进该领域的进一步研究和发展。


📄 Abstract

Current text-to-image (T2I) models have demonstrated remarkable progress in creative image generation, yet they still lack precise control over scene illuminants, which is a crucial factor for content designers aiming to manipulate the mood, atmosphere, and visual aesthetics of generated images. In this paper, we present an illuminant personalization method named LumiCtrl that learns an illuminant prompt given a single image of an object. LumiCtrl consists of three basic components: given an image of the object, our method applies (a) physics-based illuminant augmentation along the Planckian locus to create fine-tuning variants under standard illuminants; (b) edge-guided prompt disentanglement using a frozen ControlNet to ensure prompts focus on illumination rather than structure; and (c) a masked reconstruction loss that focuses learning on the foreground object while allowing the background to adapt contextually, enabling what we call contextual light adaptation. We qualitatively and quantitatively compare LumiCtrl against other T2I customization methods. The results show that our method achieves significantly better illuminant fidelity, aesthetic quality, and scene coherence compared to existing personalization baselines. A human preference study further confirms strong user preference for LumiCtrl outputs. The code and data will be released upon publication.

[34] MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding

Oskar Kristoffersen, Alba R. Sánchez, Morten R. Hannemose, Anders B. Dahl, Dim P. Papadopoulos

🧩 TL;DR

本文提出了多模态地标数据集MMLANDMARKS,该数据集包含四种模态的地理空间数据,并通过简单的CLIP风格基线模型展示了其在多种地理空间任务上的广泛泛化能力和竞争性性能。


📘 Detailed Summary

Motivation: 当前地理空间基准在模态覆盖方面存在局限性,无法整合所有相关模态到统一框架中,这严重限制了该领域的进展,因为现实世界的地理位置可以通过多种方式描述(不同视角的图像、文本描述和地理坐标)。

Method: 研究引入了多模态地标数据集MMLANDMARKS,包含美国18,557个独特地标的四种模态数据:197k高分辨率航拍图像、329k地面视角图像、文本信息和地理坐标,所有模态间具有一一对应关系,并采用简单的CLIP风格基线模型进行训练和评估。

Result: 该数据集支持多种地理空间任务的训练和基准测试,包括跨视角地面到卫星检索、地面和卫星地理定位、文本到图像以及文本到GPS检索,基线模型在各项任务上展现出广泛泛化能力和与现成基础模型及专业最先进模型相当的竞争性性能。

Conclusion: 研究证明了多模态数据集对于实现广泛地理空间理解的必要性,MMLANDMARKS数据集填补了当前地理空间基准的模态覆盖空白,为统一的多模态地理空间分析提供了重要资源,推动了该领域的发展。


📄 Abstract

Geo-spatial analysis of our world benefits from a multimodal approach, as every single geographic location can be described in numerous ways (images from various viewpoints, textual descriptions, and geographic coordinates). Current geo-spatial benchmarks have limited coverage across modalities, considerably restricting progress in the field, as current approaches cannot integrate all relevant modalities within a unified framework. We introduce the Multi-Modal Landmark dataset (MMLANDMARKS), a benchmark composed of four modalities: 197k highresolution aerial images, 329k ground-view images, textual information, and geographic coordinates for 18,557 distinct landmarks in the United States. The MMLANDMARKS dataset has a one-to-one correspondence across every modality, which enables training and benchmarking models for various geo-spatial tasks, including cross-view Ground-to-Satellite retrieval, ground and satellite geolocalization, Text-to-Image, and Text-to-GPS retrieval. We demonstrate broad generalization and competitive performance against off-the-shelf foundational models and specialized state-of-the-art models across different tasks by employing a simple CLIP-inspired baseline, illustrating the necessity for multimodal datasets to achieve broad geo-spatial understanding.

[35] GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

Rang Li, Lei Li, Shuhuai Ren, Hao Tian, Shuhao Gu, Shicheng Li, Zihao Yue, Yudong Wang, Wenhan Ma, Zhe Yang, Jingyuan Ma, Zhifang Sui, Fuli Luo

🧩 TL;DR

本文提出了GroundingME基准测试,用于系统评估多模态大语言模型在视觉定位任务中的真实能力,揭示了当前模型在复杂现实场景下的显著能力差距,并探索了改进策略。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在视觉定位基准测试中取得高分,但无法确定它们是否真正具备人类水平的视觉语言理解能力,还是仅仅在简化数据集上进行模式匹配。现有基准测试未能捕捉现实世界的复杂性,包括模糊指代、无法定位的情况等,因此需要更严格的评估框架来揭示模型的真实能力差距。

Method: 研究提出了GroundingME基准测试,通过系统化设计挑战模型的四个关键维度:判别性(区分高度相似对象)、空间性(理解复杂关系描述)、局限性(处理遮挡或微小对象)和拒绝性(识别无法定位的查询)。该基准包含1,005个经过人工验证的挑战性示例,并探索了两种改进策略:测试时缩放通过思维轨迹选择最优响应,以及数据混合训练教导模型识别无法定位的查询。

Result: 评估25个最先进的多模态大语言模型显示出显著的能力差距:最佳模型准确率仅为45.1%,大多数模型在拒绝任务上得分为0%,倾向于幻觉对象而非承认其不存在。改进策略取得一定效果:测试时缩放将复杂定位准确率提升2.9%,数据混合训练将拒绝准确率从0%提升至27.9%。

Conclusion: GroundingME基准揭示了当前多模态大语言模型在视觉定位任务中的严重局限性,特别是在处理模糊指代和识别无法定位情况方面。该研究不仅提供了诊断当前模型缺陷的工具,还为实现人类水平视觉定位指明了改进方向,强调了模型安全性部署的重要性。


📄 Abstract

Visual grounding, localizing objects from natural language descriptions, represents a critical bridge between language and vision understanding. While multimodal large language models (MLLMs) achieve impressive scores on existing benchmarks, a fundamental question remains: can MLLMs truly ground language in vision with human-like sophistication, or are they merely pattern-matching on simplified datasets? Current benchmarks fail to capture real-world complexity where humans effortlessly navigate ambiguous references and recognize when grounding is impossible. To rigorously assess MLLMs' true capabilities, we introduce GroundingME, a benchmark that systematically challenges models across four critical dimensions: (1) Discriminative, distinguishing highly similar objects, (2) Spatial, understanding complex relational descriptions, (3) Limited, handling occlusions or tiny objects, and (4) Rejection, recognizing ungroundable queries. Through careful curation combining automated generation with human verification, we create 1,005 challenging examples mirroring real-world complexity. Evaluating 25 state-of-the-art MLLMs reveals a profound capability gap: the best model achieves only 45.1% accuracy, while most score 0% on rejection tasks, reflexively hallucinating objects rather than acknowledging their absence, raising critical safety concerns for deployment. We explore two strategies for improvements: (1) test-time scaling selects optimal response by thinking trajectory to improve complex grounding by up to 2.9%, and (2) data-mixture training teaches models to recognize ungroundable queries, boosting rejection accuracy from 0% to 27.9%. GroundingME thus serves as both a diagnostic tool revealing current limitations in MLLMs and a roadmap toward human-level visual grounding.

[36] FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views

Qijian Tian, Xin Tan, Jiayu Ying, Xuhong Wang, Yuan Xie, Lizhuang Ma

🧩 TL;DR

本文提出FLEG,一种前馈网络,能够从任意视角重建语言嵌入的3D高斯表示,解决了现有方法依赖固定输入视角和3D标注数据不足的问题,实现了无需3D标注的2D到3D提升。


📘 Detailed Summary

Motivation: 现有结合前馈重建与高斯头的方法存在两个主要限制:依赖固定输入视角和缺乏足够的3D训练数据。这阻碍了从任意未标定、未配准的多视角图像中进行2D到3D提升的灵活性和可扩展性。

Method: 提出无需3D标注的训练框架,利用大规模视频数据获取2D实例信息来丰富语义嵌入。采用实例引导的对比学习对齐2D语义与3D表示,并提出几何-语义分层稀疏化策略以降低密集视角的内存和计算成本。

Result: FLEG能够从任意稀疏或密集视角高效重建语言嵌入的3D高斯表示,同时生成精确的几何结构、高保真外观和语言对齐的语义。大量实验表明,该方法在多个相关任务上优于现有方法。

Conclusion: 该研究展示了无需3D标注即可从任意多视角图像重建语言嵌入3D表示的可行性,为大规模视频数据的3D语义理解提供了新范式,并通过分层稀疏化策略解决了计算效率问题。


📄 Abstract

We present FLEG, a feed-forward network that reconstructs language-embedded 3D Gaussians from any views. Previous straightforward solutions combine feed-forward reconstruction with Gaussian heads but suffer from fixed input views and insufficient 3D training data. In contrast, we propose a 3D-annotation-free training framework for 2D-to-3D lifting from arbitrary uncalibrated and unposed multi-view images. Since the framework does not require 3D annotations, we can leverage large-scale video data with easily obtained 2D instance information to enrich semantic embedding. We also propose an instance-guided contrastive learning to align 2D semantics with the 3D representations. In addition, to mitigate the high memory and computational cost of dense views, we further propose a geometry-semantic hierarchical sparsification strategy. Our FLEG efficiently reconstructs language-embedded 3D Gaussian representation in a feed-forward manner from arbitrary sparse or dense views, jointly producing accurate geometry, high-fidelity appearance, and language-aligned semantics. Extensive experiments show that it outperforms existing methods on various related tasks. Project page: https://fangzhou2000.github.io/projects/fleg.

[37] G3Splat: Geometrically Consistent Generalizable Gaussian Splatting

Mehdi Hosseinzadeh, Shin-Fang Chng, Yi Xu, Simon Lucey, Ian Reid, Ravi Garg

🧩 TL;DR

本文提出G3Splat方法,通过引入几何先验来解决仅依赖视图合成损失时3D高斯溅射表示中存在的几何模糊性问题,实现了无需姿态先验的通用可泛化溅射,在几何一致性重建、相对姿态估计和新视角合成方面均达到最先进性能。


📘 Detailed Summary

Motivation: 现有基于多视图结构预测网络的方法主要依赖视图合成监督来回归每个像素的3D高斯参数,但仅凭视图合成损失无法恢复具有几何意义的溅射表示,特别是在姿态自由的通用可泛化溅射设置下存在几何模糊性问题。

Method: G3Splat方法通过引入几何先验来强制实现几何一致性,解决了自监督学习3D高斯溅射时的模糊性问题,该方法在RE10K数据集上进行训练,能够获得几何一致的3D场景表示。

Result: 在RE10K数据集上,G3Splat在几何一致性重建、相对姿态估计和新视角合成三个任务上均达到最先进性能;在ScanNet上的零样本泛化实验中,该方法在几何恢复和相对姿态估计方面显著优于先前工作。

Conclusion: 研究表明仅依赖视图合成损失不足以学习几何有意义的3D高斯溅射,引入几何先验对于实现姿态自由的通用可泛化溅射至关重要,该方法为3D场景表示学习提供了新的几何一致性框架。


📄 Abstract

3D Gaussians have recently emerged as an effective scene representation for real-time splatting and accurate novel-view synthesis, motivating several works to adapt multi-view structure prediction networks to regress per-pixel 3D Gaussians from images. However, most prior work extends these networks to predict additional Gaussian parameters -- orientation, scale, opacity, and appearance -- while relying almost exclusively on view-synthesis supervision. We show that a view-synthesis loss alone is insufficient to recover geometrically meaningful splats in this setting. We analyze and address the ambiguities of learning 3D Gaussian splats under self-supervision for pose-free generalizable splatting, and introduce G3Splat, which enforces geometric priors to obtain geometrically consistent 3D scene representations. Trained on RE10K, our approach achieves state-of-the-art performance in (i) geometrically consistent reconstruction, (ii) relative pose estimation, and (iii) novel-view synthesis. We further demonstrate strong zero-shot generalization on ScanNet, substantially outperforming prior work in both geometry recovery and relative pose estimation. Code and pretrained models are released on our project page (https://m80hz.github.io/g3splat/).

[38] HeadHunt-VAD: Hunting Robust Anomaly-Sensitive Heads in MLLM for Tuning-Free Video Anomaly Detection

Zhaolin Cai, Fan Li, Ziwei Zheng, Haixia Bi, Lijun He

🧩 TL;DR

本文提出HeadHunt-VAD,一种无需微调的视频异常检测新范式,通过直接定位多模态大语言模型中鲁棒的异常敏感内部注意力头,避免了传统基于文本输出的方法存在的信息损失、正常性偏差和提示敏感性等问题。


📘 Detailed Summary

Motivation: 当前基于多模态大语言模型的免调优视频异常检测方法主要依赖文本输出,这会导致信息损失、正常性偏差和提示敏感性等问题,难以捕捉细微的异常线索,因此需要一种能够直接利用模型内部表示而不依赖文本生成的新方法。

Method: 提出HeadHunt-VAD范式,核心是鲁棒头部识别模块,通过多标准分析(显著性和稳定性)系统评估所有注意力头,识别出对多样提示具有一致判别性的稀疏专家头子集,然后将这些专家头的特征输入轻量级异常评分器和时序定位器,实现高效准确的异常检测。

Result: 在两大主流视频异常检测基准测试中,HeadHunt-VAD在免调优方法中实现了最先进的性能,同时保持了高效率,验证了在MLLMs中进行头部级探测作为实际异常检测解决方案的有效性和实用性。

Conclusion: 研究表明,在多模态大语言模型中直接定位异常敏感的注意力头是一种强大且实用的视频异常检测方法,避免了文本生成带来的局限性,为利用大模型内部表示进行细粒度视觉理解任务提供了新思路。


📄 Abstract

Video Anomaly Detection (VAD) aims to locate events that deviate from normal patterns in videos. Traditional approaches often rely on extensive labeled data and incur high computational costs. Recent tuning-free methods based on Multimodal Large Language Models (MLLMs) offer a promising alternative by leveraging their rich world knowledge. However, these methods typically rely on textual outputs, which introduces information loss, exhibits normalcy bias, and suffers from prompt sensitivity, making them insufficient for capturing subtle anomalous cues. To address these constraints, we propose HeadHunt-VAD, a novel tuning-free VAD paradigm that bypasses textual generation by directly hunting robust anomaly-sensitive internal attention heads within the frozen MLLM. Central to our method is a Robust Head Identification module that systematically evaluates all attention heads using a multi-criteria analysis of saliency and stability, identifying a sparse subset of heads that are consistently discriminative across diverse prompts. Features from these expert heads are then fed into a lightweight anomaly scorer and a temporal locator, enabling efficient and accurate anomaly detection with interpretable outputs. Extensive experiments show that HeadHunt-VAD achieves state-of-the-art performance among tuning-free methods on two major VAD benchmarks while maintaining high efficiency, validating head-level probing in MLLMs as a powerful and practical solution for real-world anomaly detection.

[39] PathFLIP: Fine-grained Language-Image Pretraining for Versatile Computational Pathology

Fengchun Liu, Songhan Jiang, Linghan Cai, Ziyue Wang, Yongbing Zhang

🧩 TL;DR

本文提出PathFLIP,一种用于全切片图像(WSI)整体解释的新型病理学细粒度语言-图像预训练框架,通过将幻灯片级描述分解为区域级子标题并生成文本条件区域嵌入,实现了精确的视觉-语言对齐,显著提升了病理学视觉语言模型的性能。


📘 Detailed Summary

Motivation: 尽管视觉语言模型在计算病理学中取得了显著进展,但全切片图像的千兆像素尺度和空间异质性仍然对多模态理解构成挑战,现有对齐方法难以捕捉文本描述与数千个图像块之间的细粒度对应关系,从而影响下游任务的性能。

Method: PathFLIP框架将幻灯片级描述分解为区域级子标题,并生成文本条件区域嵌入以实现精确的视觉-语言对齐,通过利用大型语言模型,该框架能够无缝遵循多样化的临床指令并适应不同的诊断场景,同时支持多种任务范式。

Result: 在四个代表性基准测试上的广泛实验表明,PathFLIP在性能上超越了现有的大规模病理学视觉语言模型,同时需要显著更少的训练数据,在幻灯片级分类与检索、细粒度病变定位和指令遵循等任务中均表现出色。

Conclusion: 该研究为临床实践中的细粒度、指令感知的全切片图像解释开辟了新途径,展示了通过区域级文本-视觉对齐和大型语言模型集成,能够在减少训练数据需求的同时显著提升病理学多模态理解能力,具有重要的临床应用价值。


📄 Abstract

While Vision-Language Models (VLMs) have achieved notable progress in computational pathology (CPath), the gigapixel scale and spatial heterogeneity of Whole Slide Images (WSIs) continue to pose challenges for multimodal understanding. Existing alignment methods struggle to capture fine-grained correspondences between textual descriptions and visual cues across thousands of patches from a slide, compromising their performance on downstream tasks. In this paper, we propose PathFLIP (Pathology Fine-grained Language-Image Pretraining), a novel framework for holistic WSI interpretation. PathFLIP decomposes slide-level captions into region-level subcaptions and generates text-conditioned region embeddings to facilitate precise visual-language grounding. By harnessing Large Language Models (LLMs), PathFLIP can seamlessly follow diverse clinical instructions and adapt to varied diagnostic contexts. Furthermore, it exhibits versatile capabilities across multiple paradigms, efficiently handling slide-level classification and retrieval, fine-grained lesion localization, and instruction following. Extensive experiments demonstrate that PathFLIP outperforms existing large-scale pathological VLMs on four representative benchmarks while requiring significantly less training data, paving the way for fine-grained, instruction-aware WSI interpretation in clinical practice.

[40] Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs

Zhaolin Cai, Huiyu Duan, Zitong Xu, Fan Li, Zhi Liu, Jing Liu, Wei Shen, Xiongkuo Min, Guangtao Zhai

🧩 TL;DR

本文提出GRASP-HO框架,将人-物交互检测从封闭集分类任务重新定义为开放词汇生成问题,通过轻量级认知引导模块将细粒度视觉证据注入冻结的多模态大语言模型,实现了判别式感知与生成式推理的统一范式。


📘 Detailed Summary

Motivation: 现有的人-物交互检测方法在封闭世界假设下运行,将任务视为对小型预定义动词集的分类问题,难以泛化到现实世界中未见或模糊交互的长尾分布。虽然多模态大语言模型具备开放词汇理解所需的丰富世界知识,但由于微调计算成本过高,它们与现有HOI检测器保持解耦状态。

Method: 本文提出GRASP-HO框架,首先提取混合交互表示,然后设计轻量级可学习的认知引导模块,将细粒度视觉证据注入冻结的多模态大语言模型进行有效推理。为解决基于分类的HOI数据集与开放词汇生成模型之间的监督不匹配问题,引入混合指导策略,耦合语言建模损失和辅助分类损失,在不牺牲生成灵活性的情况下实现判别式定位。

Result: 实验表明该方法在封闭集性能上达到最先进水平,并展现出强大的零样本泛化能力。该框架实现了判别式感知与生成式推理的无缝桥接,为开放世界HOI检测提供了统一范式。

Conclusion: 该研究通过重新定义HOI检测任务为开放词汇生成问题,成功将多模态大语言模型的世界知识与细粒度视觉证据相结合,实现了开放世界交互理解的突破。轻量级认知引导模块的设计避免了大规模模型微调的计算负担,混合监督策略有效解决了传统分类监督与生成模型之间的不匹配问题。


📄 Abstract

Human-object interaction (HOI) detection aims to localize human-object pairs and the interactions between them. Existing methods operate under a closed-world assumption, treating the task as a classification problem over a small, predefined verb set, which struggles to generalize to the long-tail of unseen or ambiguous interactions in the wild. While recent multi-modal large language models (MLLMs) possess the rich world knowledge required for open-vocabulary understanding, they remain decoupled from existing HOI detectors since fine-tuning them is computationally prohibitive. To address these constraints, we propose \GRASP-HO}, a novel Generative Reasoning And Steerable Perception framework that reformulates HOI detection from the closed-set classification task to the open-vocabulary generation problem. To bridge the vision and cognitive, we first extract hybrid interaction representations, then design a lightweight learnable cognitive steering conduit (CSC) module to inject the fine-grained visual evidence into a frozen MLLM for effective reasoning. To address the supervision mismatch between classification-based HOI datasets and open-vocabulary generative models, we introduce a hybrid guidance strategy that coupling the language modeling loss and auxiliary classification loss, enabling discriminative grounding without sacrificing generative flexibility. Experiments demonstrate state-of-the-art closed-set performance and strong zero-shot generalization, achieving a unified paradigm that seamlessly bridges discriminative perception and generative reasoning for open-world HOI detection.

[41] AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection

Yichen Jiang, Mohammed Talha Alam, Sohail Ahmed Khan, Duc-Tien Dang-Nguyen, Fakhri Karray

🧩 TL;DR

本文提出了一种基于CLIP的通用深度伪造检测框架,通过引入Diff-Gen大规模数据集和参数高效的AdaptPrompt微调方法,实现了对多种生成模型(包括GAN、扩散模型和商业工具)合成内容的跨域检测,在25个测试集上达到了最先进性能。


📘 Detailed Summary

Motivation: 随着图像生成技术的快速发展,高度逼真的合成媒体广泛传播,使得可靠的深度伪造检测变得更加困难。当前检测器面临的主要挑战是泛化能力不足,在狭窄生成器类别上训练的模型往往无法有效识别未见过的生成模型,因此迫切需要开发具有更强跨域泛化能力的通用检测方法。

Method: 本研究采用大型视觉语言模型CLIP作为基础架构,提出了参数高效的迁移学习框架AdaptPrompt,该框架联合学习任务特定的文本提示和视觉适配器,同时保持CLIP主干网络冻结。此外,通过层消融研究发现,剪裁视觉编码器的最后一个Transformer块有助于保留高频生成伪影,从而显著提升检测精度。同时,引入了Diff-Gen大规模基准数据集,包含10万张扩散生成伪造图像,这些图像捕获了比传统GAN数据集更广泛的光谱伪影特征。

Result: 在涵盖GAN、扩散模型和商业工具生成的25个挑战性测试集上的评估表明,该方法在标准和跨域场景下均达到了新的最先进性能。基于Diff-Gen数据集训练的模型展现出更强的跨域泛化能力,特别是在未见过的图像生成器上表现优异。此外,框架在少样本泛化(仅使用320张图像)和源属性识别方面也表现出色,能够在封闭集设置中精确识别生成器架构。

Conclusion: 该研究证明了利用大型视觉语言模型进行通用深度伪造检测的有效性,特别是通过参数高效的微调策略和有针对性的架构修改可以显著提升跨域泛化能力。Diff-Gen数据集的引入为训练更具鲁棒性的检测器提供了重要资源,而AdaptPrompt框架的灵活性使其能够适应不同的检测任务,为未来合成媒体检测研究提供了有价值的参考方向。


📄 Abstract

Recent advances in image generation have led to the widespread availability of highly realistic synthetic media, increasing the difficulty of reliable deepfake detection. A key challenge is generalization, as detectors trained on a narrow class of generators often fail when confronted with unseen models. In this work, we address the pressing need for generalizable detection by leveraging large vision-language models, specifically CLIP, to identify synthetic content across diverse generative techniques. First, we introduce Diff-Gen, a large-scale benchmark dataset comprising 100k diffusion-generated fakes that capture broad spectral artifacts unlike traditional GAN datasets. Models trained on Diff-Gen demonstrate stronger cross-domain generalization, particularly on previously unseen image generators. Second, we propose AdaptPrompt, a parameter-efficient transfer learning framework that jointly learns task-specific textual prompts and visual adapters while keeping the CLIP backbone frozen. We further show via layer ablation that pruning the final transformer block of the vision encoder enhances the retention of high-frequency generative artifacts, significantly boosting detection accuracy. Our evaluation spans 25 challenging test sets, covering synthetic content generated by GANs, diffusion models, and commercial tools, establishing a new state-of-the-art in both standard and cross-domain scenarios. We further demonstrate the framework's versatility through few-shot generalization (using as few as 320 images) and source attribution, enabling the precise identification of generator architectures in closed-set settings.

[42] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, Daniil Pakhomov, Kai Zhang, Zhe Lin, Ping Luo

🧩 TL;DR

本文提出了一种系统框架,将面向理解的编码器特征适配于生成任务,通过语义-像素重建目标正则化潜在空间,实现了紧凑且语义丰富的表示,并在文本到图像生成和图像编辑任务中取得了最先进的性能。


📘 Detailed Summary

Motivation: 现代潜在扩散模型通常在低层次VAE潜在空间中运行,这些空间主要针对像素级重建进行优化。为了统一视觉生成和理解,新兴趋势是采用表示编码器的高维特征作为生成潜在表示,但存在两个基本障碍:判别性特征空间缺乏紧凑正则化,导致扩散模型容易产生偏离流形的潜在表示,从而产生不准确的对象结构;编码器固有的弱像素级重建能力阻碍生成器学习准确的细粒度几何和纹理。

Method: 本文提出了一个系统框架来适配面向理解的编码器特征用于生成任务。引入语义-像素重建目标来正则化潜在空间,将语义信息和细粒度细节压缩到高度紧凑的表示中(96个通道,16×16空间下采样)。基于这种表示,设计了一个统一的文本到图像生成和图像编辑模型,确保潜在空间既语义丰富又能实现最先进的图像重建,同时保持足够紧凑以实现准确生成。

Result: 与各种特征空间进行基准测试表明,该方法实现了最先进的图像重建质量、更快的收敛速度,并在文本到图像生成和编辑任务中取得了显著的性能提升。具体而言,该方法在紧凑表示(96通道,16×16下采样)下实现了优异的语义保持和细粒度细节重建,验证了表示编码器可以有效适配为鲁棒的生成组件。

Conclusion: 该研究表明,通过适当的正则化和适配,面向理解的编码器特征可以有效地转化为生成任务的强大组件,实现了视觉生成和理解任务的统一。该方法为构建更高效、更通用的视觉生成系统提供了新思路,展示了紧凑且语义丰富的潜在表示在提升生成模型性能方面的关键作用。


📄 Abstract

Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder's inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic-pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with 16x16 spatial downsampling). This design ensures that the latent space remains semantically rich and achieves state-of-the-art image reconstruction, while remaining compact enough for accurate generation. Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, we demonstrate that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both T2I and editing tasks, validating that representation encoders can be effectively adapted into robust generative components.

cs.CL [Back]

[43] Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?

Zabir Al Nazi, G M Shahariar, Abrar Hossain, Wei Peng

🧩 TL;DR

本研究提出了CulturalToM-VQA,一个包含5095个问题的新评估基准,旨在通过视觉问答探索视觉语言模型在不同文化背景下的心理理论推理能力,填补了现有评估体系在跨文化心理理论推理方面的空白。


📘 Detailed Summary

Motivation: 心理理论作为人类社交智能的基础能力,对人工智能体仍构成重大挑战。现有视觉语言模型越来越多地应用于社交基础任务,但其跨文化心理理论推理能力尚未得到充分探索,当前评估基准主要局限于西方中心视角,缺乏对多元文化背景下心理理论推理的系统评估。

Method: 研究采用视觉语言模型辅助的人机协同流水线构建数据集:人类专家首先策划涵盖不同传统、仪式和社交互动的文化丰富图像;视觉语言模型协助生成结构化的心理理论焦点场景描述;最终将这些描述精炼为涵盖六类心理理论任务和四个复杂度等级的问题-答案对,形成CulturalToM-VQA基准。

Result: 研究构建了包含5095个问题的CulturalToM-VQA基准数据集,该数据集捕捉了仪式、服饰、手势和人际动态等文化基础线索,涵盖心理状态归因、错误信念推理、非字面沟通、社会规范违反、视角协调和多智能体推理等六个心理理论任务维度,并设置了四个渐进复杂度等级。

Conclusion: CulturalToM-VQA基准为系统评估视觉语言模型的跨文化心理理论推理能力提供了标准化工具,强调了文化背景在心理理论评估中的重要性,推动了人工智能社交智能评估向更包容、多元的方向发展,为未来开发具有跨文化理解能力的智能体奠定了基础。


📄 Abstract

Theory of Mind (ToM) -- the ability to attribute beliefs, desires, and emotions to others -- is fundamental for human social intelligence, yet remains a major challenge for artificial agents. Existing Vision-Language Models (VLMs) are increasingly applied in socially grounded tasks, but their capacity for cross-cultural ToM reasoning is largely unexplored. In this work, we introduce CulturalToM-VQA, a new evaluation benchmark containing 5095 questions designed to probe ToM reasoning across diverse cultural contexts through visual question answering. The dataset captures culturally grounded cues such as rituals, attire, gestures, and interpersonal dynamics, enabling systematic evaluation of ToM reasoning beyond Western-centric benchmarks. Our dataset is built through a VLM-assisted human-in-the-loop pipeline, where human experts first curate culturally rich images across traditions, rituals, and social interactions; a VLM then assist in generating structured ToM-focused scene descriptions, which are refined into question-answer pairs spanning a taxonomy of six ToM tasks and four graded complexity levels. The resulting dataset covers diverse theory of mind facets such as mental state attribution, false belief reasoning, non-literal communication, social norm violations, perspective coordination, and multi-agent reasoning.

cs.AI [Back]

[44] Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

Wanghan Xu, Yuhao Zhou, Yifan Zhou, Qinglong Cao, Shuo Li, Jia Bu, Bo Liu, Yixin Chen, Xuming He, Xiangyu Zhao, Xiang Zhuang, Fengxiang Wang, Zhiwang Zhou, Qiantai Feng, Wenxuan Huang, Jiaqi Wei, Hao Wu, Yuejin Yang, Guangshuai Wang, Sheng Xu, Ziyan Huang, Xinyao Liu, Jiyao Liu, Cheng Tang, Wei Li, Ying Chen, Junzhi Ning, Pengfei Jiang, Chenglong Ma, Ye Du, Changkai Ji, Huihui Xu, Ming Hu, Jiangbin Zheng, Xin Chen, Yucheng Wu, Feifei Jiang, Xi Chen, Xiangru Tang, Yuchen Fu, Yingzhou Lu, Yuanyuan Zhang, Lihao Sun, Chengbo Li, Jinzhe Ma, Wanhao Liu, Yating Liu, Kuo-Cheng Wu, Shengdu Chai, Yizhou Wang, Ouwen Zhangjin, Chen Tang, Shufei Zhang, Wenbo Cao, Junjie Ren, Taoyong Cui, Zhouheng Yao, Juntao Deng, Yijie Sun, Feng Liu, Wangxu Wei, Jingyi Xu, Zhangrui Li, Junchao Gong, Zijie Guo, Zhiyu Yao, Zaoyu Chen, Tianhao Peng, Fangchen Yu, Bo Zhang, Dongzhan Zhou, Shixiang Tang, Jiaheng Liu, Fenghua Ling, Yan Lu, Yuchen Ren, Ben Fei, Zhen Zhao, Xinyu Gu, Rui Su, Xiao-Ming Wu, Weikang Si, Yang Liu, Hao Chen, Xiangchao Yan, Xue Yang, Junchi Yan, Jiamin Wu, Qihao Zheng, Chenhui Li, Zhiqiang Gao, Hao Kong, Junjun He, Mao Su, Tianfan Fu, Peng Ye, Chunfeng Song, Nanqing Dong, Yuqiang Li, Huazhu Fu, Siqi Sun, Lijing Cheng, Jintai Lin, Wanli Ouyang, Bowen Zhou, Wenlong Zhang, Lei Bai

🧩 TL;DR

本文提出了科学通用智能(SGI)的操作化定义,并构建了SGI-Bench基准来系统评估大语言模型在科学发现任务上的能力,同时引入了测试时强化学习(TTRL)来增强假设生成的新颖性。


📘 Detailed Summary

Motivation: 尽管科学AI取得了进展,但科学通用智能(SGI)——即跨科学领域自主构思、研究和推理的能力——仍缺乏一个连贯的框架,现有研究未能系统评估AI系统在完整科学发现工作流程中的表现。

Method: 基于实践探究模型(PIM:审议、构思、行动、感知)提出了SGI的操作化定义,并设计了四个科学家对齐任务:深度研究、想法生成、干/湿实验和实验推理;构建了包含1000多个专家策划的跨学科样本的SGI-Bench基准,并引入了测试时强化学习(TTRL)在推理时优化检索增强的新颖性奖励。

Result: 评估结果显示多个能力差距:深度研究的精确匹配率仅为10-20%,尽管步骤对齐度较高;生成的想法缺乏可行性和细节;干实验的代码可执行性高但执行结果准确性低;湿实验协议序列保真度低;多模态比较推理存在持续挑战;TTRL方法能在无需参考答案的情况下有效增强假设的新颖性。

Conclusion: 研究通过PIM基础的定义、工作流程中心的基准和实证洞察,为真正参与科学发现的AI系统奠定了基础,揭示了当前大语言模型在科学发现任务上的系统性不足,并提出了通过强化学习增强科学创造性的新方向。


📄 Abstract

Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.

[45] MMRAG-RFT: Two-stage Reinforcement Fine-tuning for Explainable Multi-modal Retrieval-augmented Generation

Shengwei Zhao, Jingwen Yao, Sitong Wei, Linhai Xu, Yuying Liu, Dong Zhang, Zhiqiang Tian, Shaoyi Du

🧩 TL;DR

本文提出了一种基于强化学习的可解释多模态检索增强生成方法,通过两阶段强化微调框架增强多模态大语言模型的推理能力,在WebQA和MultimodalQA基准上取得了最先进性能。


📘 Detailed Summary

Motivation: 现有多模态检索增强生成方法未能阐明检索和响应生成背后的推理逻辑,这限制了结果的可解释性,因此需要开发能够提供可解释推理过程的多模态检索增强生成框架。

Method: 该方法引入强化学习到多模态检索增强生成中,采用两阶段强化微调框架:第一阶段使用基于规则的强化微调对多模态文档进行粗粒度点式排序,过滤显著不相关文档;第二阶段使用基于推理的强化微调联合优化细粒度列表式排序和答案生成,引导多模态大语言模型输出可解释的推理逻辑。

Result: 该方法在WebQA和MultimodalQA这两个多模态检索增强生成基准数据集上取得了最先进的性能,并通过全面的消融实验验证了其有效性,证明了强化学习框架在提升多模态检索增强生成可解释性方面的优势。

Conclusion: 研究表明强化学习能够有效增强多模态大语言模型在检索增强生成任务中的推理能力,两阶段微调框架通过结合规则引导和推理优化实现了可解释的多模态知识整合,为构建透明可信的多模态AI系统提供了新思路。


📄 Abstract

Multi-modal Retrieval-Augmented Generation (MMRAG) enables highly credible generation by integrating external multi-modal knowledge, thus demonstrating impressive performance in complex multi-modal scenarios. However, existing MMRAG methods fail to clarify the reasoning logic behind retrieval and response generation, which limits the explainability of the results. To address this gap, we propose to introduce reinforcement learning into multi-modal retrieval-augmented generation, enhancing the reasoning capabilities of multi-modal large language models through a two-stage reinforcement fine-tuning framework to achieve explainable multi-modal retrieval-augmented generation. Specifically, in the first stage, rule-based reinforcement fine-tuning is employed to perform coarse-grained point-wise ranking of multi-modal documents, effectively filtering out those that are significantly irrelevant. In the second stage, reasoning-based reinforcement fine-tuning is utilized to jointly optimize fine-grained list-wise ranking and answer generation, guiding multi-modal large language models to output explainable reasoning logic in the MMRAG process. Our method achieves state-of-the-art results on WebQA and MultimodalQA, two benchmark datasets for multi-modal retrieval-augmented generation, and its effectiveness is validated through comprehensive ablation experiments.

[46] UmniBench: Unified Understand and Generation Model Oriented Omni-dimensional Benchmark

Kai Liu, Leyang Chen, Wenbo Li, Zhikai Chen, Zhixin Wang, Renjing Pei, Linghe Kong, Yulun Zhang

🧩 TL;DR

该研究提出了UmniBench,一个面向统一多模态模型的全维度评估基准,能够在一个评估流程中同时评估理解、生成和编辑能力,并覆盖13个主要领域和200多个概念。


📘 Detailed Summary

Motivation: 当前统一多模态模型的评估仍然处于解耦状态,分别使用不同的数据集评估其理解和生成能力,缺乏一个能够全面评估这些模型综合能力的基准。

Method: UmniBench采用基于人工检查的提示和问答对,利用UMM自身的理解能力来评估其生成和编辑能力,这种简单而有效的范式允许对UMM进行综合评估,同时基准覆盖13个主要领域和200多个概念。

Result: 基于UmniBench,研究者对24个流行模型进行了基准测试,包括统一多模态模型和单能力大模型,该基准能够提供解耦的细粒度评估,分别评估理解、生成和编辑能力。

Conclusion: UmniBench为统一模型提供了更全面和客观的评估视角,为社区模型性能改进提供了逻辑支持,该基准的全面覆盖和综合评估能力有助于推动多模态模型的发展。


📄 Abstract

Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. However, evaluations of unified multimodal models (UMMs) remain decoupled, assessing their understanding and generation abilities separately with corresponding datasets. To address this, we propose UmniBench, a benchmark tailored for UMMs with omni-dimensional evaluation. First, UmniBench can assess the understanding, generation, and editing ability within a single evaluation process. Based on human-examined prompts and QA pairs, UmniBench leverages UMM itself to evaluate its generation and editing ability with its understanding ability. This simple but effective paradigm allows comprehensive evaluation of UMMs. Second, UmniBench covers 13 major domains and more than 200 concepts, ensuring a thorough inspection of UMMs. Moreover, UmniBench can also decouple and separately evaluate understanding, generation, and editing abilities, providing a fine-grained assessment. Based on UmniBench, we benchmark 24 popular models, including both UMMs and single-ability large models. We hope this benchmark provides a more comprehensive and objective view of unified models and logistical support for improving the performance of the community model.

[47] Accelerating Multi-modal LLM Gaming Performance via Input Prediction and Mishit Correction

Ziyang Lin, Zixuan Sun, Sanhorn Chen, Xiaoyang Chen, Roy Zhao

🧩 TL;DR

本文提出了一种基于推测执行的实时顺序控制框架,通过预测-验证机制结合轻量级校正器,显著降低了模型预测控制的推理延迟,同时保持了控制性能的稳定性。


📘 Detailed Summary

Motivation: 实时顺序控制代理常受推理延迟瓶颈制约,即使微小的每步规划延迟也可能导致控制不稳定和整体性能下降。现有方法在减少延迟时往往牺牲控制质量或鲁棒性,需要一种既能降低推理频率又能保持可靠性能的解决方案。

Method: 提出了一种推测与校正框架,将推测执行的预测-验证理念应用于基于TD-MPC2的模型控制。该方法使用预训练的世界模型和潜在空间MPC规划器生成短时域动作队列及预测的潜在状态轨迹,允许代理无需立即重新规划即可执行多个计划动作。当新观测到达时,系统测量编码的真实潜在状态与队列预测潜在状态之间的不匹配度,对于中小不匹配,轻量级学习校正器应用从重新规划教师离线蒸馏的残差更新;对于大不匹配,则安全回退到完全重新规划并清除陈旧动作队列。研究包括门控双塔MLP校正器和时序Transformer校正器,分别处理局部误差和系统性漂移。

Result: 在DMC Humanoid-Walk任务上的实验表明,该方法将规划推理次数从500次减少到282次,端到端步进延迟改善了25%,同时仅导致7.1%的回报减少,保持了强大的控制性能。消融研究证实,无校正的推测执行在较长时域上不可靠,突显了不匹配感知校正对于鲁棒延迟减少的必要性。

Conclusion: 该研究证明了推测执行框架在实时控制中的有效性,通过智能校正机制平衡了延迟减少与性能保持的权衡。轻量级校正器的引入使得系统能够在不牺牲鲁棒性的情况下显著降低计算开销,为实时控制系统的延迟优化提供了新思路。该方法展示了离线蒸馏与在线校正相结合的策略在处理模型预测控制延迟问题上的潜力。


📄 Abstract

Real-time sequential control agents are often bottlenecked by inference latency. Even modest per-step planning delays can destabilize control and degrade overall performance. We propose a speculation-and-correction framework that adapts the predict-then-verify philosophy of speculative execution to model-based control with TD-MPC2. At each step, a pretrained world model and latent-space MPC planner generate a short-horizon action queue together with predicted latent rollouts, allowing the agent to execute multiple planned actions without immediate replanning. When a new observation arrives, the system measures the mismatch between the encoded real latent state and the queued predicted latent. For small to moderate mismatch, a lightweight learned corrector applies a residual update to the speculative action, distilled offline from a replanning teacher. For large mismatch, the agent safely falls back to full replanning and clears stale action queues. We study both a gated two-tower MLP corrector and a temporal Transformer corrector to address local errors and systematic drift. Experiments on the DMC Humanoid-Walk task show that our method reduces the number of planning inferences from 500 to 282, improves end-to-end step latency by 25 percent, and maintains strong control performance with only a 7.1 percent return reduction. Ablation results demonstrate that speculative execution without correction is unreliable over longer horizons, highlighting the necessity of mismatch-aware correction for robust latency reduction.