Table of Contents

cs.CV [Back]

[1] Beyond CNNs: Efficient Fine-Tuning of Multi-Modal LLMs for Object Detection on Low-Data Regimes

Nirmal Elamon, Rouzbeh Davoudi

🧩 TL;DR

本研究通过系统比较传统CNN、零样本多模态LLM和微调多模态LLM在图像文本叠加检测任务上的表现,证明多模态LLM仅需少量数据微调即可显著提升性能,在低资源视觉环境中展现出卓越的数据效率和适应性。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在专业视觉任务中的潜力尚未充分挖掘,直接使用预训练模型往往导致性能欠佳,本研究旨在探索如何通过有限监督数据有效适配语言引导模型以实现精确的视觉理解。

Method: 采用综合对比研究方法,在人工文本叠加检测任务上评估微调传统CNN、零样本预训练多模态LLM以及微调多模态LLM三种策略,特别关注LLM在极少量数据下的微调效果。

Result: 实验表明多模态LLM仅需不足1000张图像微调即可实现高达36%的准确率提升,性能达到甚至超越需要大量数据的CNN基线模型,突显其卓越的数据效率优势。

Conclusion: 研究揭示了LLM方法在真实目标检测任务中的强大适应性和数据效率,为低资源视觉环境中应用多模态transformer提供了实用指导,推动了视觉与语言模态的高效融合学习策略发展。


📄 Abstract

The field of object detection and understanding is rapidly evolving, driven by advances in both traditional CNN-based models and emerging multi-modal large language models (LLMs). While CNNs like ResNet and YOLO remain highly effective for image-based tasks, recent transformer-based LLMs introduce new capabilities such as dynamic context reasoning, language-guided prompts, and holistic scene understanding. However, when used out-of-the-box, the full potential of LLMs remains underexploited, often resulting in suboptimal performance on specialized visual tasks. In this work, we conduct a comprehensive comparison of fine-tuned traditional CNNs, zero-shot pre-trained multi-modal LLMs, and fine-tuned multi-modal LLMs on the challenging task of artificial text overlay detection in images. A key contribution of our study is demonstrating that LLMs can be effectively fine-tuned on very limited data (fewer than 1,000 images) to achieve up to 36% accuracy improvement, matching or surpassing CNN-based baselines that typically require orders of magnitude more data. By exploring how language-guided models can be adapted for precise visual understanding with minimal supervision, our work contributes to the broader effort of bridging vision and language, offering novel insights into efficient cross-modal learning strategies. These findings highlight the adaptability and data efficiency of LLM-based approaches for real-world object detection tasks and provide actionable guidance for applying multi-modal transformers in low-resource visual environments. To support continued progress in this area, we have made the code used to fine-tune the models available in our GitHub, enabling future improvements and reuse in related applications.

[2] Adjusting Initial Noise to Mitigate Memorization in Text-to-Image Diffusion Models

Hyeonggeun Han, Sehwan Kim, Hyungjun Joo, Sangwoo Hong, Jungwoo Lee

🧩 TL;DR

本文提出通过调整初始噪声样本来促进文本到图像扩散模型更早地逃离记忆吸引盆,从而在减少训练数据记忆的同时保持图像-文本对齐。该方法通过集体或个体方式优化初始噪声分布,显著降低了模型记忆化问题。


📘 Detailed Summary

Motivation: 文本到图像扩散模型存在严重的训练数据记忆化问题,引发隐私和版权担忧。现有方法通过延迟应用分类器自由引导来避免记忆化,但会导致图像与输入提示对齐不佳,因此需要促进更早逃离记忆吸引盆以便尽早应用CFG。

Method: 本文提出两种缓解策略:集体调整和个体调整初始噪声样本。基于初始噪声决定逃离时间的观察,通过优化初始噪声分布来寻找能促进更早逃离记忆吸引盆的初始样本,从而允许在去噪过程中更早应用分类器自由引导。

Result: 实验表明,所提出的初始噪声调整方法显著减少了模型记忆化现象,同时有效保持了生成图像与输入文本提示之间的对齐质量,解决了现有延迟CFG方法导致的图像质量下降问题。

Conclusion: 初始噪声在扩散模型记忆化中起关键作用,通过针对性优化初始噪声可以平衡记忆化减少和图像质量保持。这为扩散模型隐私保护提供了新思路,即从生成过程源头而非中间步骤进行干预。


📄 Abstract

Despite their impressive generative capabilities, text-to-image diffusion models often memorize and replicate training data, prompting serious concerns over privacy and copyright. Recent work has attributed this memorization to an attraction basin-a region where applying classifier-free guidance (CFG) steers the denoising trajectory toward memorized outputs-and has proposed deferring CFG application until the denoising trajectory escapes this basin. However, such delays often result in non-memorized images that are poorly aligned with the input prompts, highlighting the need to promote earlier escape so that CFG can be applied sooner in the denoising process. In this work, we show that the initial noise sample plays a crucial role in determining when this escape occurs. We empirically observe that different initial samples lead to varying escape times. Building on this insight, we propose two mitigation strategies that adjust the initial noise-either collectively or individually-to find and utilize initial samples that encourage earlier basin escape. These approaches significantly reduce memorization while preserving image-text alignment.

[3] Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding

Songtao Jiang, Yuan Wang, Sibo Song, Tianxiang Hu, Chenyi Zhou, Bin Pu, Yan Zhang, Zhibo Yang, Yang Feng, Joey Tianyi Zhou, Jin Hao, Zijian Chen, Ruijia Wu, Tao Tang, Junhui Lv, Hongxia Xu, Hongwei Wang, Jun Xiao, Bin Feng, Fudong Zhu, Kenli Li, Weidi Xie, Jimeng Sun, Jian Wu, Zuozhu Liu

🧩 TL;DR

本文提出了Hulu-Med,一个透明的医学视觉语言模型,通过统一的基于补丁的视觉编码器和LLM解码器架构,实现了跨文本、2D/3D图像和视频的多模态理解,在30个基准测试中展现出最先进的性能。


📘 Detailed Summary

Motivation: 现实世界临床决策需要整合来自不同数据模态的信息,包括医学文本、2D/3D图像和视频,这导致效率低下和潜在的诊断遗漏。虽然通用视觉语言模型具有潜力,但其医学应用面临流程不透明、数据稀缺和架构不灵活的挑战。

Method: Hulu-Med基于统一的基于补丁的视觉编码器和LLM解码器构建,通过渐进式训练在1670万个样本上从2D扩展到3D和视频理解。医学感知的令牌缩减技术实现了高效训练,7B到32B参数变体仅需4000到40000 GPU小时。

Result: 在30个基准测试中的广泛评估显示,Hulu-Med实现了最先进的性能,在视觉问答、医学报告生成以及多语言和罕见疾病场景的复杂推理任务中超越了领先的开源模型,并与专有系统竞争。

Conclusion: 通过开源完整的流程,本研究证明了高性能医学视觉语言模型可以透明地实现,为可访问和有影响力的临床AI提供了基础工具,推动了医学多模态理解的标准化和可复现性。


📄 Abstract

Real-world clinical decision-making grapples with integrating information from diverse data modalities, including medical text, 2D/3D images, and video, leading to inefficiencies and potential diagnostic oversights. While generalist vision-language models (VLMs) offer promise, their medical development faces challenges of opaque pipelines, data scarcity, and architectural inflexibility. Here we present Hulu-Med, a transparent medical VLM that unifies understanding across all these modalities. Built upon a unified patch-based vision encoder and an LLM decoder, Hulu-Med was progressively trained on 16.7 million (M) samples to scale from 2D to 3D and video comprehension. The medical-aware token reduction enables efficient training, requiring only 4,000 to 40,000 GPU hours for 7B to 32B parameter variants. Extensive evaluation across 30 benchmarks exhibits state-of-the-art performance, surpassing leading open-source models and competing with proprietary systems in tasks spanning visual question-answering, medical report generation, and complex reasoning in multilingual and rare disease scenarios. By open-sourcing our complete pipeline, we establish that high-performance medical VLM can be achieved transparently, providing a foundational tool for accessible and impactful clinical AI. Code is released on \href{https://github.com/ZJUI-AI4H/Hulu-Med}{https://github.com/ZJUI-AI4H/Hulu-Med}.

[4] Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy

🧩 TL;DR

Puffin是一个统一的相机中心多模态模型,通过将相机视为语言来扩展空间智能,在相机中心的理解和生成任务上超越了专门化模型,并能够泛化到多种跨视角任务。


📘 Detailed Summary

Motivation: 当前相机中心的理解和生成研究通常孤立进行,缺乏统一的框架来同时处理空间解释和场景创建任务,这限制了空间智能系统在任意视角下的综合能力。

Method: Puffin采用语言回归和基于扩散的生成方法,提出将相机视为语言的新范式,通过全局相机参数和像素级相机映射,在Puffin-4M大规模数据集上进行训练,实现了空间感知与摄影术语的对齐。

Result: 实验表明Puffin在相机中心生成和理解任务上优于专门化模型,通过指令调优能够泛化到空间想象、世界探索和摄影指导等多样化跨视角任务。

Conclusion: 该研究为多模态空间智能提供了统一框架,通过相机语言化范式实现了灵活可靠的空间生成,推动了相机维度空间感知能力的发展,相关代码、模型和数据集将开源以促进进一步研究。


📄 Abstract

Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance. We will release the code, models, dataset pipeline, and benchmark to advance multimodal spatial intelligence research.

[5] BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities

Yu Qi, Haibo Zhao, Ziyu Guo, Siyuan Ma, Ziyan Chen, Yaokun Han, Renrui Zhang, Zitiantao Lin, Shiji Xin, Yijian Huang, Kai Cheng, Peiheng Wang, Jiazheng Liu, Jiayi Zhang, Yizhe Zhu, Wenqing Wang, Yiran Qin, Xupeng Zhu, Haojie Huang, Lawson L. S. Wong

🧩 TL;DR

本文提出了BEAR基准测试,用于系统评估多模态大语言模型的具身能力,并开发了BEAR-Agent代理来增强这些能力,在基准测试上实现了17.5%的相对性能提升。


📘 Detailed Summary

Motivation: 当前多模态大语言模型作为具身代理的潜力尚未得到充分评估,现有基准测试主要关注特定领域如规划或空间理解,缺乏对原子级具身能力的全面系统评估。

Method: 提出了BEAR基准测试,包含4,469个交错图像-视频-文本条目,涵盖6个类别14个领域的任务;并开发了BEAR-Agent多模态对话代理,集成预训练视觉模型以增强感知、3D理解和规划能力。

Result: 对20个代表性MLLM的评估显示其在所有具身能力领域均存在持续局限性;BEAR-Agent在BEAR基准上实现了9.12%的绝对增益和17.5%的相对改进,同时提升具身能力也能改善模拟环境中的具身任务性能。

Conclusion: 系统评估揭示了MLLM在具身能力方面的显著不足,提出的BEAR-Agent框架能有效增强这些能力,为开发更强大的具身智能体提供了重要方法论和基准支持。


📄 Abstract

Embodied capabilities refer to a suite of fundamental abilities for an agent to perceive, comprehend, and interact with the physical world. While multimodal large language models (MLLMs) show promise as embodied agents, a thorough and systematic evaluation of their embodied capabilities remains underexplored, as existing benchmarks primarily focus on specific domains such as planning or spatial understanding. To bridge this gap, we introduce BEAR, a comprehensive and fine-grained benchmark that evaluates MLLMs on atomic embodied capabilities. BEAR comprises 4,469 interleaved image-video-text entries across 14 domains in 6 categories, including tasks from low-level pointing, trajectory understanding, spatial reasoning, to high-level planning. Extensive evaluation results of 20 representative MLLMs reveal their persistent limitations across all domains of embodied capabilities. To tackle the shortfall, we propose BEAR-Agent, a multimodal conversable agent that integrates pretrained vision models to strengthen MLLM perception, 3D understanding, and planning capabilities. It substantially enhances MLLM performance across diverse embodied capabilities on BEAR, yielding a 9.12% absolute gain and a relative improvement of 17.5% on GPT-5. Furthermore, our experiments indicate that improving MLLM embodied capabilities can benefit embodied tasks in simulated environments. Project website: https://bear-official66.github.io/

[6] Q-Router: Agentic Video Quality Assessment with Expert Model Routing and Artifact Localization

Shuo Xing, Soumik Dey, Mingyang Wu, Ashirbad Mishra, Hansi Wu, Binbin Li, Zhengzhong Tu

🧩 TL;DR

本文提出Q-Router,一种基于多层级模型路由的智能视频质量评估框架,通过视觉语言模型动态选择和集成专家模型,实现了跨不同视频内容和任务的通用视频质量评估。


📘 Detailed Summary

Motivation: 现有基于直接分数监督的视频质量评估模型存在三个主要问题:对用户生成内容、短视频和AI生成内容等多样化内容的泛化能力差、可解释性有限,以及缺乏对新用例或内容类型的可扩展性,这限制了视频质量评估系统在实际应用中的适用性。

Method: Q-Router采用智能代理框架,构建多层级模型路由系统,集成多样化专家模型,并利用视觉语言模型作为实时路由器,根据输入视频语义动态推理并集成最合适的专家模型,最高计算层级还包含特定的时空伪影定位以增强可解释性。

Result: 大量实验表明,Q-Router在多种基准测试中达到或超越了最先进的视频质量评估模型性能,显著提升了泛化能力和可解释性,在基于质量的问题回答基准Q-Bench-Video上表现优异,并能有效定位时空伪影,展示了作为后训练视频生成模型奖励函数的潜力。

Conclusion: Q-Router通过智能路由机制结合专业化专家模型的互补优势,实现了跨异构视频源和任务的灵活鲁棒性能,为下一代视频质量评估系统提供了有前景的基础框架,同时其伪影定位能力为视频生成模型的优化提供了新的可能性。


📄 Abstract

Video quality assessment (VQA) is a fundamental computer vision task that aims to predict the perceptual quality of a given video in alignment with human judgments. Existing performant VQA models trained with direct score supervision suffer from (1) poor generalization across diverse content and tasks, ranging from user-generated content (UGC), short-form videos, to AI-generated content (AIGC), (2) limited interpretability, and (3) lack of extensibility to novel use cases or content types. We propose Q-Router, an agentic framework for universal VQA with a multi-tier model routing system. Q-Router integrates a diverse set of expert models and employs vision--language models (VLMs) as real-time routers that dynamically reason and then ensemble the most appropriate experts conditioned on the input video semantics. We build a multi-tiered routing system based on the computing budget, with the heaviest tier involving a specific spatiotemporal artifacts localization for interpretability. This agentic design enables Q-Router to combine the complementary strengths of specialized experts, achieving both flexibility and robustness in delivering consistent performance across heterogeneous video sources and tasks. Extensive experiments demonstrate that Q-Router matches or surpasses state-of-the-art VQA models on a variety of benchmarks, while substantially improving generalization and interpretability. Moreover, Q-Router excels on the quality-based question answering benchmark, Q-Bench-Video, highlighting its promise as a foundation for next-generation VQA systems. Finally, we show that Q-Router capably localizes spatiotemporal artifacts, showing potential as a reward function for post-training video generation models.

[7] Alignment, Mining and Fusion: Representation Alignment with Hard Negative Mining and Selective Knowledge Fusion for Medical Visual Question Answering

Yuanhao Zou, Zhaozheng Yin

🧩 TL;DR

本研究提出了一个统一的医学视觉问答框架,通过多级模态对齐、困难负样本挖掘和门控交叉注意力模块,解决了Med-VQA任务中的模态对齐不统一、困难负样本和知识融合问题,在多个基准数据集上实现了最先进的性能。


📘 Detailed Summary

Motivation: 当前医学视觉问答任务面临三个主要挑战:缺乏统一的模态对齐解决方案,困难负样本问题研究不足,以及常用知识融合技术可能引入不相关信息。这些限制阻碍了Med-VQA系统的性能提升和实际应用效果。

Method: 该框架包含三个关键技术:采用对比学习和最优传输理论实现跨多级、多模态、多视图和多阶段的异质模态统一对齐;使用软标签进行多模态对齐并加强困难负样本对判别性的困难负样本挖掘方法;以及集成答案词汇作为先验知识并从中选择相关信息门控交叉注意力模块。

Result: 该框架在RAD-VQA、SLAKE、PathVQA和VQA-2019等广泛使用的Med-VQA数据集上均超越了先前的最先进方法,证明了所提出技术在医学视觉问答任务中的有效性和优越性能。

Conclusion: 该研究为医学视觉问答提供了统一的模态对齐解决方案,有效解决了困难负样本和知识融合问题,展示了多级对齐策略和门控机制在医学多模态任务中的重要性,为未来医学AI系统的发展提供了新的技术路径。


📄 Abstract

Medical Visual Question Answering (Med-VQA) is a challenging task that requires a deep understanding of both medical images and textual questions. Although recent works leveraging Medical Vision-Language Pre-training (Med-VLP) have shown strong performance on the Med-VQA task, there is still no unified solution for modality alignment, and the issue of hard negatives remains under-explored. Additionally, commonly used knowledge fusion techniques for Med-VQA may introduce irrelevant information. In this work, we propose a framework to address these challenges through three key contributions: (1) a unified solution for heterogeneous modality alignments across multiple levels, modalities, views, and stages, leveraging methods like contrastive learning and optimal transport theory; (2) a hard negative mining method that employs soft labels for multi-modality alignments and enforces the hard negative pair discrimination; and (3) a Gated Cross-Attention Module for Med-VQA that integrates the answer vocabulary as prior knowledge and selects relevant information from it. Our framework outperforms the previous state-of-the-art on widely used Med-VQA datasets like RAD-VQA, SLAKE, PathVQA and VQA-2019.

[8] D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

Yiyang Huang, Yizhou Wang, Yun Fu

🧩 TL;DR

本文提出D-CoDe框架,通过动态压缩和问题分解解决图像预训练视觉语言模型适应视频任务时的感知瓶颈和令牌过载问题,无需额外训练即可显著提升视频理解能力。


📘 Detailed Summary

Motivation: 当前将图像预训练的视觉语言模型扩展到视频领域面临两大关键挑战:感知瓶颈和令牌过载。感知瓶颈源于模型难以处理密集且时间延长的视觉输入,而令牌过载则由于视频帧数量庞大导致计算资源不足。这些限制阻碍了图像基础模型在复杂视频语言任务中的有效应用。

Method: D-CoDe框架包含两个核心组件:动态压缩和问题分解。动态压缩通过自适应选择代表性帧和内容感知的空间令牌聚合来减少冗余并保留信息内容;问题分解则将原始查询重新表述为子问题,引导模型关注视频的不同方面,实现更全面的理解。整个框架无需额外训练即可部署。

Result: 实验表明D-CoDe在各种基准测试中有效提升了视频理解能力。特别是在具有挑战性的长视频基准测试中表现出色,证明了该框架在处理复杂视频语言任务方面的潜力。该方法在多个视频理解任务上均取得了显著的性能改进。

Conclusion: D-CoDe展示了无需训练即可有效扩展图像基础模型到视频领域的可行性,为解决视频语言模型中的感知瓶颈和令牌过载问题提供了新思路。该框架的成功表明通过智能压缩和分解策略可以显著提升模型处理长视频内容的能力,为未来视频理解研究提供了重要参考方向。


📄 Abstract

Video large language models (Vid-LLMs), which excel in diverse video-language tasks, can be effectively constructed by adapting image-pretrained vision-language models (VLMs). However, this adaptation remains challenging, as it requires processing dense and temporally extended visual inputs that exceed the capacity of image-based models. This paper identifies the perception bottleneck and token overload as key challenges in extending image-based VLMs to the video domain. To address these issues, we propose D-CoDe, a training-free adaptation framework that incorporates dynamic compression and question decomposition. Specifically, dynamic compression alleviates the perception bottleneck through adaptive selection of representative frames and content-aware aggregation of spatial tokens, thereby reducing redundancy while preserving informative content. In parallel, question decomposition mitigates token overload by reformulating the original query into sub-questions, guiding the model to focus on distinct aspects of the video and enabling more comprehensive understanding. Experiments demonstrate that D-CoDe effectively improves video understanding across various benchmarks. Furthermore, strong performance on the challenging long-video benchmark highlights the potential of D-CoDe in handling complex video-language tasks. Code is available at https://github.com/hukcc/D-CoDe.

[9] FOLK: Fast Open-Vocabulary 3D Instance Segmentation via Label-guided Knowledge Distillation

Hongrui Wu, Zhicheng Gao, Jin Cao, Kelu Yao, Wen Shen, Zhihua Wei

🧩 TL;DR

本文提出FOLK方法,通过标签引导的知识蒸馏实现快速开放词汇3D实例分割,避免了传统方法中2D映射带来的噪声和计算开销,在保持高性能的同时显著加速推理过程。


📘 Detailed Summary

Motivation: 现有开放词汇3D实例分割方法通常将3D实例映射到2D RGB-D图像,然后使用视觉语言模型进行分类,这种映射策略会引入2D遮挡噪声,并在推理过程中产生大量计算和内存成本,显著降低推理速度。

Method: 提出FOLK方法,设计教师模型提取高质量实例嵌入并蒸馏其开放词汇知识到3D学生模型中,具体包括设计教师模型生成包含可见性和视角多样性的2D CLIP嵌入作为蒸馏目标,开发直接生成3D嵌入的学生模型,并提出标签引导蒸馏算法将标签一致的2D嵌入知识蒸馏到学生模型中。

Result: 在ScanNet200和Replica数据集上的实验表明,FOLK在ScanNet200数据集上达到35.7的AP50分数,取得最先进性能,同时推理速度比先前方法快约6.0倍到152.2倍。

Conclusion: 该方法证明了通过知识蒸馏将开放词汇能力直接集成到3D模型中的有效性,避免了2D映射的噪声问题,同时实现了显著的推理加速,为实时3D场景理解提供了可行的解决方案。


📄 Abstract

Open-vocabulary 3D instance segmentation seeks to segment and classify instances beyond the annotated label space. Existing methods typically map 3D instances to 2D RGB-D images, and then employ vision-language models (VLMs) for classification. However, such a mapping strategy usually introduces noise from 2D occlusions and incurs substantial computational and memory costs during inference, slowing down the inference speed. To address the above problems, we propose a Fast Open-vocabulary 3D instance segmentation method via Label-guided Knowledge distillation (FOLK). Our core idea is to design a teacher model that extracts high-quality instance embeddings and distills its open-vocabulary knowledge into a 3D student model. In this way, during inference, the distilled 3D model can directly classify instances from the 3D point cloud, avoiding noise caused by occlusions and significantly accelerating the inference process. Specifically, we first design a teacher model to generate a 2D CLIP embedding for each 3D instance, incorporating both visibility and viewpoint diversity, which serves as the learning target for distillation. We then develop a 3D student model that directly produces a 3D embedding for each 3D instance. During training, we propose a label-guided distillation algorithm to distill open-vocabulary knowledge from label-consistent 2D embeddings into the student model. FOLK conducted experiments on the ScanNet200 and Replica datasets, achieving state-of-the-art performance on the ScanNet200 dataset with an AP50 score of 35.7, while running approximately 6.0x to 152.2x faster than previous methods. All codes will be released after the paper is accepted.

[10] PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning

Daiki Yoshikawa, Takashi Matsubara

🧩 TL;DR

本文提出PHyCLIP模型,通过在多曲率双曲空间的笛卡尔积上引入ℓ₁-乘积度量,同时捕捉概念族内的层次结构和跨概念族的组合语义,解决了现有视觉语言模型难以同时表达层次性和组合性的问题。


📘 Detailed Summary

Motivation: 现有视觉语言模型虽然在多模态表示学习上取得显著进展,但难以同时表达概念族内的层次结构(如狗⊆哺乳动物⊆动物)和跨概念族的组合语义(如“车里的狗”⊆狗、车)。虽然近期工作使用双曲空间有效捕捉树状层次结构,但其对组合性的表示能力仍不明确。

Method: PHyCLIP采用在多曲率双曲空间的笛卡尔积上定义ℓ₁-乘积度量,其中概念族内的层次结构在单个双曲因子中自然涌现,而跨概念族的组合语义通过ℓ₁-乘积度量捕获,类似于布尔代数的结构。

Result: 在零样本分类、检索、层次分类和组合理解任务上的实验表明,PHyCLIP优于现有的单空间方法,并在嵌入空间中提供了更具可解释性的结构表示。

Conclusion: 该研究表明多曲率双曲空间的乘积结构能够同时有效建模层次性和组合性语义,为多模态表示学习提供了新的几何视角,并展示了比传统单空间方法更好的性能和可解释性。


📄 Abstract

Vision-language models have achieved remarkable success in multi-modal representation learning from large-scale pairs of visual scenes and linguistic descriptions. However, they still struggle to simultaneously express two distinct types of semantic structures: the hierarchy within a concept family (e.g., dog $\preceq$ mammal $\preceq$ animal) and the compositionality across different concept families (e.g., "a dog in a car" $\preceq$ dog, car). Recent works have addressed this challenge by employing hyperbolic space, which efficiently captures tree-like hierarchy, yet its suitability for representing compositionality remains unclear. To resolve this dilemma, we propose PHyCLIP, which employs an $\ell_1$-Product metric on a Cartesian product of Hyperbolic factors. With our design, intra-family hierarchies emerge within individual hyperbolic factors, and cross-family composition is captured by the $\ell_1$-product metric, analogous to a Boolean algebra. Experiments on zero-shot classification, retrieval, hierarchical classification, and compositional understanding tasks demonstrate that PHyCLIP outperforms existing single-space approaches and offers more interpretable structures in the embedding space.

[11] Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition with Multimodal Training

Mahdi Abavisani, Hamid Reza Vaezi Joze, Vishal M. Patel

🧩 TL;DR

本文提出了一种通过多模态知识嵌入训练单模态3D-CNN的高效方法,用于动态手势识别任务。该方法通过时空语义对齐损失和焦点正则化参数,使单模态网络能够学习共享语义表示,而无需在测试时使用多模态输入。


📘 Detailed Summary

Motivation: 现有动态手势识别方法通常显式组合多模态信息,但这种方法在测试时依赖多模态输入。本文旨在解决如何将多模态知识嵌入到单模态网络中,使每个单模态网络都能获得性能提升,同时避免测试时对多模态输入的依赖。

Method: 提出了一种新颖框架,为每个可用模态分配独立网络,并通过时空语义对齐损失强制这些网络协作学习共享语义和更好的表示。引入焦点正则化参数来避免负知识迁移,确保网络间知识传递的有效性。

Result: 实验结果表明,该框架显著提高了单模态网络的测试时识别准确率,并在多个动态手势识别数据集上达到了最先进的性能水平。

Conclusion: 研究表明通过适当的正则化和对齐机制,可以在不依赖测试时多模态输入的情况下,将多模态知识有效嵌入单模态网络。这为构建更高效、更实用的手势识别系统提供了新思路,并展示了知识迁移在单模态任务中的潜力。


📄 Abstract

We present an efficient approach for leveraging the knowledge from multiple modalities in training unimodal 3D convolutional neural networks (3D-CNNs) for the task of dynamic hand gesture recognition. Instead of explicitly combining multimodal information, which is commonplace in many state-of-the-art methods, we propose a different framework in which we embed the knowledge of multiple modalities in individual networks so that each unimodal network can achieve an improved performance. In particular, we dedicate separate networks per available modality and enforce them to collaborate and learn to develop networks with common semantics and better representations. We introduce a "spatiotemporal semantic alignment" loss (SSA) to align the content of the features from different networks. In addition, we regularize this loss with our proposed "focal regularization parameter" to avoid negative knowledge transfer. Experimental results show that our framework improves the test time recognition accuracy of unimodal networks, and provides the state-of-the-art performance on various dynamic hand gesture recognition datasets.

[12] RO-Bench: Large-scale robustness evaluation of MLLMs with text-driven counterfactual videos

Zixi Yang, Jiapeng Li, Muxi Diao, Yinuo Jing, Kongming Liang

🧩 TL;DR

本文提出了Ro-Bench,首个评估多模态大语言模型在动态分布外反事实视频测试集上鲁棒性的基准,研究发现当前模型在反事实视频内容上性能显著下降,而通过反事实数据微调可有效提升模型鲁棒性。


📘 Detailed Summary

Motivation: 多模态大语言模型在各种视频理解任务中表现出色,但其在面对被操纵视频内容时的鲁棒性尚未得到充分探索,现有研究缺乏针对动态分布外反事实视频的专门评估基准。

Method: 通过编辑风格、对象、背景及其组合来构建高质量、多样化且时间相关的反事实视频数据,创建Ro-Bench基准,并对八个最新的视频多模态大语言模型进行评估,同时探索使用反事实数据微调模型的策略。

Result: 当前模型在Ro-Bench基准上面对反事实视频内容时表现出显著性能下降,而通过反事实数据微调可使模型在Ro-Bench上性能提升21.73%,在MVBench数据集的20个任务上平均提升12.78%。

Conclusion: 反事实数据能有效增强多模态大语言模型的视频理解能力,研究结果强调了评估和改进模型在面对分布外视频内容时鲁棒性的重要性,为未来视频理解模型的稳健性研究提供了重要基准和方法论指导。


📄 Abstract

Recently, Multi-modal Large Language Models (MLLMs) have demonstrated significant performance across various video understanding tasks. However, their robustness, particularly when faced with manipulated video content, remains largely unexplored. In this paper, we introduce Ro-Bench, the first benchmark for evaluating MLLMs on dynamic out-of-distribution (OOD) counterfactual video test sets. Ro-Bench incorporates high-quality, diverse and temporally relevant video data, by editing Style, Object, Background and their compositions. We evaluated eight recent video MLLMs and found that current models exhibit substantial performance degradation on Ro-Bench when exposed to counterfactual video content. Furthermore, we demonstrate that fine-tuning MLLMs with counterfactual data enhances robustness, achieving a 21.73% performance increase on Ro-Bench and a 12.78% improvement across 20 tasks in the MVBench dataset. These findings underscore the effectiveness of counterfactual data in enhancing the video understanding ability of MLLMs. The code and data will be released shortly.

[13] Unleashing Perception-Time Scaling to Multimodal Reasoning Models

Yifan Li, Zhenghao Chen, Ziheng Wu, Kun Zhou, Ruipu Luo, Can Zhang, Zhentao He, Yufei Zhan, Wayne Xin Zhao, Minghui Qiu

🧩 TL;DR

本文提出了一种新的感知时间扩展(PTS)范式,通过将复杂感知问题分解为可处理的子问题,显著提升了大型视觉语言模型在视觉估计任务中的精度,将高精度性能从8.0%提升至64.7%。


📘 Detailed Summary

Motivation: 当前大型视觉语言模型在视觉感知任务中表现出有限的估计精度,推理时间扩展仅带来边际收益,这主要源于现有模型的快速感知范式将视觉理解视为一次性输出,未能建模底层感知过程。

Method: 提出了感知时间扩展(PTS)范式,鼓励丰富标记的感知过程,将复杂感知问题分解为中间可处理的子问题,并结合强化学习技术,使感知能够与推理时间扩展对齐并从中受益。

Result: PTS显著提升了感知精度,在DisTANCE基准上将高精度性能从8.0%提升至64.7%,并展现出良好的跨域泛化能力;即使使用纯合成数据,与数学推理数据结合也能在推理和真实世界感知基准上获得一致增益。

Conclusion: PTS范式通过引入更多感知相关标记并增强模型对图像标记的关注,有效解决了当前LVLMs在视觉感知任务中的精度限制,为多模态模型的感知能力提升提供了新方向。


📄 Abstract

Recent advances in inference-time scaling, particularly those leveraging reinforcement learning with verifiable rewards, have substantially enhanced the reasoning capabilities of Large Vision-Language Models (LVLMs). Inspired by this success, similar strategies have been applied to multimodal reasoning, yet their impact on visual perception remains unclear. To investigate this gap, we introduce DisTANCE, a perception-centric benchmark for visual estimation tasks. Evaluation results show that LVLMs exhibit limited estimation precision, and inference-time scaling offers only marginal gains. We attribute this to the fast perception paradigm of current LVLMs, where visual understanding is treated as a one-shot output without modeling the underlying perceptual process. To address this, we propose Perception-Time Scaling (PTS), a novel paradigm that encourages token-rich perception and decomposes complex perception problems into intermediate tractable sub-problems, thereby enabling perception to align with and benefit from inference-time scaling. Combined with reinforcement learning techniques, PTS significantly improves perception accuracy, raising high-precision performance on DisTANCE from 8.0% to 64.7%, and generalizes well to out-of-domain tasks. Surprisingly, even though PTS data are purely synthetic, combining them with math reasoning data yields consistent gains in both reasoning and real-world perception benchmarks. Further analysis reveals that PTS introduces more perception-related tokens and increases the model's attention to image tokens. Our code and data will be publicly released.

[14] Hierarchical Scheduling for Multi-Vector Image Retrieval

Maoliang Li, Ke Li, Yaoyang Liu, Jiayu Chen, Zihao Zheng, Yinjun Wu, Xiang Chen

🧩 TL;DR

本文提出了HiMIR框架,通过分层多粒度图像检索和跨层次相似性一致性优化,在提升多模态大语言模型中检索增强生成准确性的同时,将计算量减少高达3.5倍。


📘 Detailed Summary

Motivation: 传统检索方法在多模态大语言模型的检索增强生成应用中存在检索精度有限的问题,而现有的多向量检索方法虽然通过查询分解和图像分段匹配提高了准确性,但仍面临查询与不同图像对象对齐不足以及细粒度图像片段冗余导致的次优准确性和效率问题。

Method: 提出了HiMIR高效调度框架,采用分层范式使用多个中间粒度处理不同图像对象以增强对齐,利用跨层次相似性一致性和层次稀疏性最小化检索冗余,并为不同数据集自动配置参数以适应多样化应用场景。

Result: 实证研究表明,HiMIR不仅实现了显著的准确性提升,而且在现有多向量检索系统基础上将计算量减少了高达3.5倍,同时保持了优异的检索性能。

Conclusion: 该研究证明了分层多粒度检索策略在平衡精度与效率方面的有效性,为多模态检索系统提供了实用的优化方案,能够适应不同应用场景的需求,具有广泛的实际应用价值。


📄 Abstract

To effectively leverage user-specific data, retrieval augmented generation (RAG) is employed in multimodal large language model (MLLM) applications. However, conventional retrieval approaches often suffer from limited retrieval accuracy. Recent advances in multi-vector retrieval (MVR) improve accuracy by decomposing queries and matching against segmented images. They still suffer from sub-optimal accuracy and efficiency, overlooking alignment between the query and varying image objects and redundant fine-grained image segments. In this work, we present an efficient scheduling framework for image retrieval - HiMIR. First, we introduce a novel hierarchical paradigm, employing multiple intermediate granularities for varying image objects to enhance alignment. Second, we minimize redundancy in retrieval by leveraging cross-hierarchy similarity consistency and hierarchy sparsity to minimize unnecessary matching computation. Furthermore, we configure parameters for each dataset automatically for practicality across diverse scenarios. Our empirical study shows that, HiMIR not only achieves substantial accuracy improvements but also reduces computation by up to 3.5 times over the existing MVR system.

[15] HandEval: Taking the First Step Towards Hand Quality Evaluation in Generated Images

Zichuan Wang, Bo Peng, Songlin Yang, Zhenchen Tang, Jing Dong

🧩 TL;DR

本文提出了首个针对生成图像中手部区域的质量评估任务,开发了基于多模态大语言模型的手部质量评估模型HandEval,该模型在多个下游应用中显著提升了生成手部的真实性和AIGC检测准确性。


📘 Detailed Summary

Motivation: 尽管当前文本到图像生成模型在整体视觉质量上取得了显著进步,但在复杂局部区域特别是手部的细节生成方面仍存在严重不足,生成的手部经常出现结构扭曲和不真实纹理,而手部质量评估任务却长期被忽视,这限制了人类中心生成质量优化和AIGC检测等下游任务的性能提升。

Method: 研究团队首先构建了包含48k张高质量和低质量手部配对图像的HandPair数据集,无需人工标注即可实现低成本高效监督,在此基础上开发了HandEval模型,该模型利用多模态大语言模型的强大视觉理解能力,并融入手部关键点的先验知识,从而获得对手部质量的强感知能力。

Result: 实验结果表明,HandEval在从各种最先进文本到图像模型生成的手部图像测试集上,与人类判断的一致性优于现有最优方法,将HandEval集成到图像生成和AIGC检测流程中,分别显著提升了生成手部的真实性和检测准确性,验证了其在下游应用中的普适有效性。

Conclusion: 该研究填补了生成图像手部质量评估的空白,展示了手部质量评估在下游任务中的丰富应用价值,通过结合多模态大语言模型和手部结构先验知识的方法,为生成内容质量评估提供了新的技术路径,具有重要的实际应用意义和推广价值。


📄 Abstract

Although recent text-to-image (T2I) models have significantly improved the overall visual quality of generated images, they still struggle in the generation of accurate details in complex local regions, especially human hands. Generated hands often exhibit structural distortions and unrealistic textures, which can be very noticeable even when the rest of the body is well-generated. However, the quality assessment of hand regions remains largely neglected, limiting downstream task performance like human-centric generation quality optimization and AIGC detection. To address this, we propose the first quality assessment task targeting generated hand regions and showcase its abundant downstream applications. We first introduce the HandPair dataset for training hand quality assessment models. It consists of 48k images formed by high- and low-quality hand pairs, enabling low-cost, efficient supervision without manual annotation. Based on it, we develop HandEval, a carefully designed hand-specific quality assessment model. It leverages the powerful visual understanding capability of Multimodal Large Language Model (MLLM) and incorporates prior knowledge of hand keypoints, gaining strong perception of hand quality. We further construct a human-annotated test set with hand images from various state-of-the-art (SOTA) T2I models to validate its quality evaluation capability. Results show that HandEval aligns better with human judgments than existing SOTA methods. Furthermore, we integrate HandEval into image generation and AIGC detection pipelines, prominently enhancing generated hand realism and detection accuracy, respectively, confirming its universal effectiveness in downstream applications. Code and dataset will be available.

[16] Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation

Yao Teng, Fuyun Wang, Xian Liu, Zhekai Chen, Han Shi, Yu Wang, Zhenguo Li, Weiyang Liu, Difan Zou, Xihui Liu

🧩 TL;DR

本文提出了Speculative Jacobi-Denoising Decoding (SJD2)框架,将去噪过程融入Jacobi迭代中,实现了自回归文本到图像模型的并行令牌生成,显著加速了推理过程。


📘 Detailed Summary

Motivation: 自回归文本到图像模型由于采用顺序令牌解码过程,推理速度缓慢,通常需要数千次模型前向传递才能生成单张图像,这种低效率限制了其实际应用。

Method: 提出了一种下一干净令牌预测范式,通过低成本微调使预训练自回归模型能够接受噪声扰动令牌嵌入并预测下一干净令牌,该去噪范式引导模型朝向更稳定的Jacobi轨迹;在推理时使用高斯噪声初始化令牌序列,在嵌入空间中进行迭代的下一干净令牌预测,并采用概率准则并行验证和接受多个令牌。

Result: 实验结果表明,该方法能够通过减少模型前向传递次数来加速生成过程,同时保持生成图像的视觉质量。

Conclusion: 该研究为自回归生成模型提供了一种有效的并行化解决方案,通过将去噪过程与Jacobi迭代相结合,在保持生成质量的同时显著提升了推理效率,为大规模视觉内容生成应用开辟了新途径。


📄 Abstract

As a new paradigm of visual content generation, autoregressive text-to-image models suffer from slow inference due to their sequential token-by-token decoding process, often requiring thousands of model forward passes to generate a single image. To address this inefficiency, we propose Speculative Jacobi-Denoising Decoding (SJD2), a framework that incorporates the denoising process into Jacobi iterations to enable parallel token generation in autoregressive models. Our method introduces a next-clean-token prediction paradigm that enables the pre-trained autoregressive models to accept noise-perturbed token embeddings and predict the next clean tokens through low-cost fine-tuning. This denoising paradigm guides the model towards more stable Jacobi trajectories. During inference, our method initializes token sequences with Gaussian noise and performs iterative next-clean-token-prediction in the embedding space. We employ a probabilistic criterion to verify and accept multiple tokens in parallel, and refine the unaccepted tokens for the next iteration with the denoising trajectory. Experiments show that our method can accelerate generation by reducing model forward passes while maintaining the visual quality of generated images.

[17] On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models

Hoigi Seo, Dong Un Kang, Hyunjin Cho, Joohoon Lee, Se Young Chun

🧩 TL;DR

该论文提出了一种通过修改视觉编码器来缓解大型视觉语言模型中物体幻觉问题的方法,通过识别和屏蔽高不确定性的视觉标记来减少幻觉现象。


📘 Detailed Summary

Motivation: 大型视觉语言模型虽然取得了显著成功,但仍然面临物体幻觉的关键挑战,即生成输入图像中不存在的物体描述。作者认为视觉编码器中的不确定视觉标记是导致物体幻觉的关键因素。

Method: 提出了一种简单有效的策略,包括使用对抗扰动代理方法高效识别不确定视觉标记,并在视觉编码器的中间层自注意力过程中屏蔽这些不确定视觉标记,从而抑制其对视觉编码的影响。

Result: 大量实验表明,该方法显著减少了大型视觉语言模型中的物体幻觉,并且能够与其他现有技术协同工作。统计分析和理论证明都支持了不确定视觉标记与幻觉之间的正相关性。

Conclusion: 研究表明视觉编码器中不确定视觉标记是物体幻觉的重要来源,通过针对性屏蔽这些标记可以有效缓解幻觉问题,为改进大型视觉语言模型的可靠性提供了新思路。


📄 Abstract

Large vision-language models (LVLMs), which integrate a vision encoder (VE) with a large language model, have achieved remarkable success across various tasks. However, there are still crucial challenges in LVLMs such as object hallucination, generating descriptions of objects that are not in the input image. Here, we argue that uncertain visual tokens within the VE is a key factor that contributes to object hallucination. Our statistical analysis found that there are positive correlations between visual tokens with high epistemic uncertainty and the occurrence of hallucinations. Furthermore, we show theoretically and empirically that visual tokens in early VE layers that exhibit large representation deviations under small adversarial perturbations indicate high epistemic uncertainty. Based on these findings, we propose a simple yet effective strategy to mitigate object hallucination by modifying the VE only. Our method comprises a proxy method with adversarial perturbations for identifying uncertain visual tokens efficiently and a method to mask these uncertain visual tokens during the self-attention process in the middle layers of the VE, suppressing their influence on visual encoding and thus alleviating hallucinations. Extensive experiments show that our method significantly reduces object hallucinations in LVLMs and can synergistically work with other prior arts.

[18] Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation

Youwei Zheng, Yuxi Ren, Xin Xia, Xuefeng Xiao, Xiaohua Xie

🧩 TL;DR

本研究提出Dense2MoE方法,将密集扩散变换器转换为混合专家结构,在保持模型性能的同时减少60%的激活参数,为高效文本到图像生成建立了新范式。


📘 Detailed Summary

Motivation: 扩散变换器在文本到图像生成中表现出色,但其庞大的参数量导致显著的推理开销。现有的参数压缩方法主要依赖剪枝,但激进的剪枝会因模型容量减少而导致严重的性能下降。

Method: 将密集DiT转换为混合专家结构进行结构化稀疏化,用MoE层替换DiT块中的前馈网络,将FFN激活参数减少62.5%。提出混合块选择性地激活DiT块以增强稀疏性,并设计了多步蒸馏流程,包括基于泰勒度量的专家初始化、负载均衡的知识蒸馏和MoB优化的组特征损失。

Result: 将大型扩散变换器转换为MoE结构,激活参数减少60%的同时保持原始性能,在广泛实验中超越了基于剪枝的方法。

Conclusion: Dense2MoE为高效文本到图像生成建立了新范式,证明了通过结构化稀疏化而非传统剪枝可以在保持模型容量的同时显著减少计算开销,为大型生成模型的部署提供了可行方案。


📄 Abstract

Diffusion Transformer (DiT) has demonstrated remarkable performance in text-to-image generation; however, its large parameter size results in substantial inference overhead. Existing parameter compression methods primarily focus on pruning, but aggressive pruning often leads to severe performance degradation due to reduced model capacity. To address this limitation, we pioneer the transformation of a dense DiT into a Mixture of Experts (MoE) for structured sparsification, reducing the number of activated parameters while preserving model capacity. Specifically, we replace the Feed-Forward Networks (FFNs) in DiT Blocks with MoE layers, reducing the number of activated parameters in the FFNs by 62.5\%. Furthermore, we propose the Mixture of Blocks (MoB) to selectively activate DiT blocks, thereby further enhancing sparsity. To ensure an effective dense-to-MoE conversion, we design a multi-step distillation pipeline, incorporating Taylor metric-based expert initialization, knowledge distillation with load balancing, and group feature loss for MoB optimization. We transform large diffusion transformers (e.g., FLUX.1 [dev]) into an MoE structure, reducing activated parameters by 60\% while maintaining original performance and surpassing pruning-based approaches in extensive experiments. Overall, Dense2MoE establishes a new paradigm for efficient text-to-image generation.

[19] SOS: Synthetic Object Segments Improve Detection, Segmentation, and Grounding

Weikai Huang, Jieyu Zhang, Taoyang Jia, Chenhao Zheng, Ziqi Gao, Jae Sung Park, Ranjay Krishna

🧩 TL;DR

本文提出SOS,一种基于物体中心合成策略的简单可扩展数据合成流水线,通过在真实图像中粘贴高质量合成物体片段并应用结构化布局先验和生成式重光照,显著提升了检测和视觉定位任务的性能。


📘 Detailed Summary

Motivation: 视觉分组任务依赖于大规模标注数据集,但真实数据标注成本高昂、覆盖范围存在偏差且难以扩展,而现有合成数据方法缺乏灵活性、准确性和组合多样性,无法满足实际应用需求。

Method: SOS采用物体中心合成策略,将高质量合成物体片段粘贴到新图像中,利用结构化布局先验指导物体放置,并通过生成式重光照技术确保视觉一致性,从而生成准确的掩码、边界框和指代表达式标注。

Result: 在检测和视觉定位任务上,仅使用10万张SOS合成图像训练的模型超越了基于更大规模真实数据集(GRIT的2000万张和V3Det的20万张)训练的模型,在LVIS检测任务上达到+10.9 AP提升,在gRefCOCO视觉定位任务上实现+8.4 NAcc提升。

Conclusion: SOS实现了可控的数据集构建,在低数据量和封闭词汇表设置下均能显著提升模型泛化能力,通过向LVIS和COCO添加合成物体片段可在各种真实数据规模下获得强劲性能,特别是在极有限真实数据场景下表现尤为突出。


📄 Abstract

Visual grouping -- operationalized via instance segmentation, visual grounding, and object detection -- underpins applications from robotic perception to photo editing. Large annotated datasets are costly, biased in coverage, and hard to scale. Synthetic data are promising but often lack flexibility, accuracy, and compositional diversity. We present SOS, a simple and scalable data synthesis pipeline based on an object-centric composition strategy. It pastes high-quality synthetic object segments into new images using structured layout priors and generative relighting, producing accurate and diverse masks, boxes, and referring expressions. Models trained on 100000 synthetic images from SOS outperform those trained on larger real-image datasets such as GRIT (20M) and V3Det (200K) on detection and grounding tasks, achieving +10.9 AP on LVIS detection and +8.4 $N_{\text{Acc}}$ on gRefCOCO grounding. SOS enables controllable dataset construction and improves generalization in both low-data and closed-vocabulary settings. Augmenting LVIS and COCO with synthetic object segments yields strong performance across real-data scales and even larger gains under extremely limited real data (for example, +3.83 $AP_{\text{rare}}$ on LVIS instance segmentation and +6.59 AP with a 1 percent COCO setup). This controllability also supports targeted data generation for challenging intra-class referring in visual grounding.

[20] MSDM: Generating Task-Specific Pathology Images with a Multimodal Conditioned Diffusion Model for Cell and Nuclei Segmentation

Dominik Winter, Mai Bui, Monica Azqueta Gavaldon, Nicolas Triltsch, Marco Rosati, Nicolas Brieu

🧩 TL;DR

本文提出了一种多模态语义扩散模型MSDM,用于生成像素级精确的细胞和细胞核分割图像-掩码对,通过整合形态学、颜色特征和元数据等多模态信息,显著提升了分割模型在稀有细胞类型上的性能。


📘 Detailed Summary

Motivation: 计算病理学中标注数据稀缺,特别是对于罕见或非典型形态的细胞和细胞核,手动标注成本高昂且耗时,而合成数据提供了一种经济有效的替代方案来解决这一数据稀缺问题。

Method: 提出多模态语义扩散模型MSDM,通过整合细胞/细胞核形态学特征(使用水平和垂直映射)、RGB颜色特性以及BERT编码的检测/指示元数据,利用多头交叉注意力机制融合这些异质模态,实现对生成图像的细粒度控制。

Result: 定量分析表明合成图像与真实数据高度匹配,在匹配生物条件下生成图像与真实图像的嵌入之间具有较低的Wasserstein距离,特别是在柱状细胞等稀有类型上,合成样本的加入显著提升了分割模型的准确性。

Conclusion: 多模态扩散增强策略能够系统性地丰富数据集,直接针对模型缺陷进行改进,为提升细胞和细胞核分割模型的鲁棒性和泛化能力提供了有效途径,推动了生成模型在计算病理学中的更广泛应用。


📄 Abstract

Scarcity of annotated data, particularly for rare or atypical morphologies, present significant challenges for cell and nuclei segmentation in computational pathology. While manual annotation is labor-intensive and costly, synthetic data offers a cost-effective alternative. We introduce a Multimodal Semantic Diffusion Model (MSDM) for generating realistic pixel-precise image-mask pairs for cell and nuclei segmentation. By conditioning the generative process with cellular/nuclear morphologies (using horizontal and vertical maps), RGB color characteristics, and BERT-encoded assay/indication metadata, MSDM generates datasests with desired morphological properties. These heterogeneous modalities are integrated via multi-head cross-attention, enabling fine-grained control over the generated images. Quantitative analysis demonstrates that synthetic images closely match real data, with low Wasserstein distances between embeddings of generated and real images under matching biological conditions. The incorporation of these synthetic samples, exemplified by columnar cells, significantly improves segmentation model accuracy on columnar cells. This strategy systematically enriches data sets, directly targeting model deficiencies. We highlight the effectiveness of multimodal diffusion-based augmentation for advancing the robustness and generalizability of cell and nuclei segmentation models. Thereby, we pave the way for broader application of generative models in computational pathology.

[21] Towards Safer and Understandable Driver Intention Prediction

Mukilan Karuppasamy, Shankar Gangisetty, Shyam Nandan Rai, Carlo Masone, C V Jawahar

🧩 TL;DR

本研究提出了可解释的驾驶员意图预测任务,并开发了VCBM框架和DAAD-X数据集,通过概念瓶颈模型生成时空一致的解释,证明了Transformer模型比传统CNN模型具有更好的可解释性。


📘 Detailed Summary

Motivation: 随着自动驾驶系统与人类交互的增加,驾驶系统决策过程的可解释性对于确保安全驾驶操作变得至关重要。深度学习系统在理解环境表示和驾驶任务方面仍面临挑战,需要解决驾驶员意图预测的可解释性问题。

Method: 研究构建了DAAD-X多模态数据集,提供驾驶员决策的层次化文本解释,并提出了视频概念瓶颈模型(VCBM)框架,该框架能够固有地生成时空一致的解释,而不依赖后处理技术。

Result: 在DAAD-X数据集上的广泛评估表明,基于Transformer的模型比传统CNN模型表现出更好的可解释性。研究还引入了多标签t-SNE可视化技术来展示多个解释之间的解缠和因果相关性。

Conclusion: 该研究为可解释的驾驶员意图预测提供了新的数据集和框架,证明了Transformer架构在可解释性方面的优势,并为理解驾驶决策的因果机制提供了可视化工具,推动了自动驾驶系统安全性和可信度的提升。


📄 Abstract

Autonomous driving (AD) systems are becoming increasingly capable of handling complex tasks, mainly due to recent advances in deep learning and AI. As interactions between autonomous systems and humans increase, the interpretability of decision-making processes in driving systems becomes increasingly crucial for ensuring safe driving operations. Successful human-machine interaction requires understanding the underlying representations of the environment and the driving task, which remains a significant challenge in deep learning-based systems. To address this, we introduce the task of interpretability in maneuver prediction before they occur for driver safety, i.e., driver intent prediction (DIP), which plays a critical role in AD systems. To foster research in interpretable DIP, we curate the eXplainable Driving Action Anticipation Dataset (DAAD-X), a new multimodal, ego-centric video dataset to provide hierarchical, high-level textual explanations as causal reasoning for the driver's decisions. These explanations are derived from both the driver's eye-gaze and the ego-vehicle's perspective. Next, we propose Video Concept Bottleneck Model (VCBM), a framework that generates spatio-temporally coherent explanations inherently, without relying on post-hoc techniques. Finally, through extensive evaluations of the proposed VCBM on the DAAD-X dataset, we demonstrate that transformer-based models exhibit greater interpretability than conventional CNN-based models. Additionally, we introduce a multilabel t-SNE visualization technique to illustrate the disentanglement and causal correlation among multiple explanations. Our data, code and models are available at: https://mukil07.github.io/VCBM.github.io/

[22] Cattle-CLIP: A Multimodal Framework for Cattle Behaviour Recognition

Huimin Liu, Jing Gao, Daria Baran, AxelX Montout, Neill W Campbell, Andrew W Dowsey

🧩 TL;DR

本文提出了Cattle-CLIP,一种基于多模态深度学习框架的牛只行为识别方法,通过语义线索增强视频特征识别性能,在监督学习和少样本学习场景下均表现出色。


📘 Detailed Summary

Motivation: 当前基于视频的牛只行为监测虽然能提供高精度识别,但在数据稀缺的行为识别任务中表现不足,特别是在真实农场监控视频与预训练模型网络数据之间存在领域差距的问题尚未得到充分解决。

Method: 该方法基于大规模图像-语言模型CLIP进行适配,添加了时间整合模块来处理视频时序信息,并针对领域差距问题引入了定制化数据增强策略和专门设计的文本提示词。

Result: 在完全监督设置下,Cattle-CLIP在六个行为类别上达到96.1%的总体准确率,其中进食、饮水和站立反刍行为的召回率接近100%,并在少样本场景下展现出稳健的泛化能力。

Conclusion: 研究表明多模态学习在农业和动物行为分析中具有巨大潜力,特别是在数据稀缺条件下仍能保持高性能,为精准畜牧业监测提供了有效的技术解决方案。


📄 Abstract

Cattle behaviour is a crucial indicator of an individual animal health, productivity and overall well-being. Video-based monitoring, combined with deep learning techniques, has become a mainstream approach in animal biometrics, and it can offer high accuracy in some behaviour recognition tasks. We present Cattle-CLIP, a multimodal deep learning framework for cattle behaviour recognition, using semantic cues to improve the performance of video-based visual feature recognition. It is adapted from the large-scale image-language model CLIP by adding a temporal integration module. To address the domain gap between web data used for the pre-trained model and real-world cattle surveillance footage, we introduce tailored data augmentation strategies and specialised text prompts. Cattle-CLIP is evaluated under both fully-supervised and few-shot learning scenarios, with a particular focus on data-scarce behaviour recognition - an important yet under-explored goal in livestock monitoring. To evaluate the proposed method, we release the CattleBehaviours6 dataset, which comprises six types of indoor behaviours: feeding, drinking, standing-self-grooming, standing-ruminating, lying-self-grooming and lying-ruminating. The dataset consists of 1905 clips collected from our John Oldacre Centre dairy farm research platform housing 200 Holstein-Friesian cows. Experiments show that Cattle-CLIP achieves 96.1% overall accuracy across six behaviours in a supervised setting, with nearly 100% recall for feeding, drinking and standing-ruminating behaviours, and demonstrates robust generalisation with limited data in few-shot scenarios, highlighting the potential of multimodal learning in agricultural and animal behaviour analysis.

[23] Diagnosing Shoulder Disorders Using Multimodal Large Language Models and Consumer-Grade Cameras

Jindong Hong, Wencheng Zhang, Shiqin Qiao, Jianhai Chen, Jianing Qiu, Chuanyang Zheng, Qian Xu, Yun Ji, Qianyue Wen, Weiwei Sun, Hao Li, Huizhen Li, Huichao Wang, Kai Wu, Meng Li, Yijun He, Lingjie Luo, Jiankai Sun

🧩 TL;DR

本研究提出了HMVDx混合运动视频诊断框架,通过将动作理解与疾病诊断任务分离并由两个MLLM分别完成,显著提升了肩关节障碍的初步诊断准确率,为医疗资源匮乏地区提供了低成本可扩展的辅助诊断解决方案。


📘 Detailed Summary

Motivation: 肩关节障碍如冻结肩是全球常见疾病,在医疗资源稀缺地区实现早期准确诊断面临重大挑战,迫切需要低成本且易于扩展的辅助诊断解决方案,本研究旨在利用消费级设备拍摄的视频作为诊断基础以降低用户成本。

Method: 提出了HMVDx混合运动视频诊断框架,将动作理解和疾病诊断两个任务分离,分别由两个多模态大语言模型完成,并提出了基于医疗决策逻辑过程的新型评估指标——可用性指数,从完整医疗诊断路径角度评估MLLM在医疗领域的有效性。

Result: 在实验比较中,HMVDx在诊断肩关节损伤方面的准确率相比直接视频诊断提高了79.6%,显著提升了诊断性能,证明了该框架在医疗视频理解应用中的技术贡献。

Conclusion: 该研究揭示了低成本MLLM在医疗应用中为医疗从业者带来的潜在价值,为未来MLLM在医疗领域视频理解应用研究提供了重要的技术贡献和方向指引,特别是在资源受限环境下的辅助诊断方案开发。


📄 Abstract

Shoulder disorders, such as frozen shoulder (a.k.a., adhesive capsulitis), are common conditions affecting the health of people worldwide, and have a high incidence rate among the elderly and workers engaged in repetitive shoulder tasks. In regions with scarce medical resources, achieving early and accurate diagnosis poses significant challenges, and there is an urgent need for low-cost and easily scalable auxiliary diagnostic solutions. This research introduces videos captured by consumer-grade devices as the basis for diagnosis, reducing the cost for users. We focus on the innovative application of Multimodal Large Language Models (MLLMs) in the preliminary diagnosis of shoulder disorders and propose a Hybrid Motion Video Diagnosis framework (HMVDx). This framework divides the two tasks of action understanding and disease diagnosis, which are respectively completed by two MLLMs. In addition to traditional evaluation indicators, this work proposes a novel metric called Usability Index by the logical process of medical decision-making (action recognition, movement diagnosis, and final diagnosis). This index evaluates the effectiveness of MLLMs in the medical field from the perspective of the entire medical diagnostic pathway, revealing the potential value of low-cost MLLMs in medical applications for medical practitioners. In experimental comparisons, the accuracy of HMVDx in diagnosing shoulder joint injuries has increased by 79.6\% compared with direct video diagnosis, a significant technical contribution to future research on the application of MLLMs for video understanding in the medical field.

[24] Clear Roads, Clear Vision: Advancements in Multi-Weather Restoration for Smart Transportation

Vijay M. Galshetwar, Praful Hambarde, Prashant W. Patil, Akshay Dudhane, Sachin Chaudhary, Santosh Kumar Vipparathi, Subrahmanyam Murala

🧩 TL;DR

本调查论文对图像和视频天气退化恢复技术进行了全面综述,系统分类了传统先验方法和现代数据驱动模型,并讨论了智能交通系统中天气弹性视觉系统的未来发展方向。


📘 Detailed Summary

Motivation: 雾霾、雨雪等恶劣天气条件会显著降低图像和视频质量,对依赖视觉输入的智能交通系统(包括自动驾驶、交通监控和安防应用)构成严重挑战,需要开发有效的恢复技术来缓解天气引起的视觉损伤。

Method: 将现有方法分类为传统先验方法和现代数据驱动模型,包括CNN、Transformer、扩散模型和新兴的视觉语言模型;恢复策略进一步按范围分为单任务模型、多任务/多天气系统以及能够处理多种退化的全能框架。

Result: 调查涵盖了日间和夜间恢复挑战、基准数据集和评估协议,系统分析了不同方法在应对天气退化方面的性能表现和技术特点。

Conclusion: 当前研究存在混合/复合退化恢复、实时部署和智能AI框架等局限性,未来方向包括开发更强大的恢复模型、实现实时应用以及构建智能交通环境中的天气弹性视觉系统参考框架。


📄 Abstract

Adverse weather conditions such as haze, rain, and snow significantly degrade the quality of images and videos, posing serious challenges to intelligent transportation systems (ITS) that rely on visual input. These degradations affect critical applications including autonomous driving, traffic monitoring, and surveillance. This survey presents a comprehensive review of image and video restoration techniques developed to mitigate weather-induced visual impairments. We categorize existing approaches into traditional prior-based methods and modern data-driven models, including CNNs, transformers, diffusion models, and emerging vision-language models (VLMs). Restoration strategies are further classified based on their scope: single-task models, multi-task/multi-weather systems, and all-in-one frameworks capable of handling diverse degradations. In addition, we discuss day and night time restoration challenges, benchmark datasets, and evaluation protocols. The survey concludes with an in-depth discussion on limitations in current research and outlines future directions such as mixed/compound-degradation restoration, real-time deployment, and agentic AI frameworks. This work aims to serve as a valuable reference for advancing weather-resilient vision systems in smart transportation environments. Lastly, to stay current with rapid advancements in this field, we will maintain regular updates of the latest relevant papers and their open-source implementations at https://github.com/ChaudharyUPES/A-comprehensive-review-on-Multi-weather-restoration

[25] CapGeo: A Caption-Assisted Approach to Geometric Reasoning

Yuying Li, Siyi Qian, Hao Liang, Leqi Zheng, Ruichuan An, Yongzhen Guo, Wentao Zhang

🧩 TL;DR

本文提出了CapGeo框架,通过将几何图形转换为文本描述来增强多模态大语言模型的几何推理能力,并建立了CapGeo-Bench基准数据集用于系统评估几何标注模型的质量。


📘 Detailed Summary

Motivation: 当前最先进的多模态大语言模型在几何推理方面仍然存在显著瓶颈,即使如GPT-4o和Gemini-2.5-Pro等闭源系统在解决几何问题时也表现不佳,这表明问题的核心在于对几何图形的理解而非推理能力本身。

Method: 研究引入了CapGeo框架,这是一种基于标注辅助的推理方法,通过将视觉内容转换为简洁的文本描述来桥接视觉和文本模态,并开发了包含4,641个精心筛选的图形-标注对的CapGeo-Bench数据集,其中包含基于关键点的评估指标。

Result: 实验结果显示,配备标注后模型性能显著提升:Qwen2.5-VL-72B从仅视觉的8.6%提升至59.0%,Claude-Opus-4从44.8%提升至73.0%,同时CapGeo-Bench中的关键点评估指标与下游CapGeo性能高度相关。

Conclusion: 该研究揭示了通过视觉到文本转换提升几何推理能力的新途径,强调了几何标注质量对下游任务性能的关键影响,为推进多模态大语言模型的几何推理能力提供了系统性的评估框架和发展方向。


📄 Abstract

Geometric reasoning remains a core challenge for Multimodal Large Language Models (MLLMs). Even the most advanced closed-source systems, such as GPT-O3 and Gemini-2.5-Pro, still struggle to solve geometry problems reliably, despite exhibiting strong textual reasoning abilities on tasks like the International Mathematical Olympiad (IMO). This gap suggests that the bottleneck lies in understanding geometric diagrams rather than reasoning itself. Since geometric figures can often be faithfully described in concise textual form, converting visual content into captions offers a promising direction. Motivated by this insight, we introduce CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities. Experiments show substantial improvements when models are equipped with captions: Qwen2.5-VL-72B improves from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rises from 44.8% to 73.0%. To systematically evaluate and identify high-quality geometric captioning models, we further propose CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs. Crucially, CapGeo-Bench incorporates a keypoint-based evaluation metric that correlates strongly with downstream CapGeo performance, enabling reliable assessment of geometric captioning ability. Together, our framework and benchmark highlight a new pathway toward advancing geometric reasoning in MLLMs.

[26] StreamingVLM: Real-Time Understanding for Infinite Video Streams

Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, Song Han

🧩 TL;DR

StreamingVLM是一个专为实时理解无限视觉输入而设计的视觉语言模型,通过统一的训练推理对齐框架和紧凑KV缓存机制,解决了长视频处理中的计算复杂度和延迟问题。


📘 Detailed Summary

Motivation: 现有视觉语言模型在处理无限视频流时面临关键挑战:完整注意力机制导致二次计算成本和高内存使用,而简单滑动窗口方法会破坏连贯性或产生高延迟冗余计算,无法满足实时助手和自主代理的需求。

Method: 提出统一训练推理对齐框架,在推理时通过重用注意力汇聚点状态、近期视觉标记短窗口和文本标记长窗口来维护紧凑KV缓存,采用监督微调策略在短重叠视频块上应用完整注意力,有效模拟推理时注意力模式而无需训练超长上下文。

Result: 在新建的Inf-Streams-Eval基准测试中达到66.18%胜率对抗GPT-4O mini,在单张NVIDIA H100上保持稳定实时性能达8 FPS,监督微调策略还提升了通用VQA能力,在LongVideoBench和OVOBench Realtime上分别提升+4.30和+5.96。

Conclusion: StreamingVLM证明了通过训练推理对齐和紧凑缓存机制可实现无限视频流的实时稳定理解,其监督微调策略不仅能提升流式处理能力,还能增强通用视觉问答性能,为实时视觉语言应用提供了可行解决方案。


📄 Abstract

Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.

[27] Zero-shot image privacy classification with Vision-Language Models

Alina Elena Baia, Alessio Xompero, Andrea Cavallaro

🧩 TL;DR

本研究建立了图像隐私分类的零样本基准,系统评估了大型视觉语言模型与专用模型的性能对比。研究发现尽管VLMs参数规模庞大且推理速度较慢,但在隐私预测准确性上仍落后于专用的小型模型,但在图像扰动鲁棒性方面表现更优。


📘 Detailed Summary

Motivation: 当前文献越来越倾向于采用为通用任务设计的大型视觉语言模型,而忽视了专用模型可能达到的性能上限,这主要是由于缺乏系统性的评估框架。本研究旨在解决这一问题,建立公平的比较基准来评估VLMs在图像隐私预测任务中的实际表现。

Method: 研究建立了图像隐私分类的零样本基准,评估了根据隐私基准排名前三的开源VLMs,使用任务对齐的提示词,并将它们的性能、效率和鲁棒性与已建立的纯视觉和多模态方法进行对比分析。

Result: 研究结果显示,尽管VLMs具有资源密集的特性,包括高参数数量和较慢的推理速度,但在隐私预测准确性方面目前仍落后于专用的小型模型。同时发现VLMs对图像扰动表现出更高的鲁棒性。

Conclusion: 该研究揭示了在图像隐私预测这一特定任务中,专用模型仍然具有不可替代的优势,而VLMs虽然在某些方面表现优异,但需要针对特定任务进行优化。这为未来模型选择和发展方向提供了重要参考,强调了任务专用性与通用性之间的权衡。


📄 Abstract

While specialized learning-based models have historically dominated image privacy prediction, the current literature increasingly favours adopting large Vision-Language Models (VLMs) designed for generic tasks. This trend risks overlooking the performance ceiling set by purpose-built models due to a lack of systematic evaluation. To address this problem, we establish a zero-shot benchmark for image privacy classification, enabling a fair comparison. We evaluate the top-3 open-source VLMs, according to a privacy benchmark, using task-aligned prompts and we contrast their performance, efficiency, and robustness against established vision-only and multi-modal methods. Counter-intuitively, our results show that VLMs, despite their resource-intensive nature in terms of high parameter count and slower inference, currently lag behind specialized, smaller models in privacy prediction accuracy. We also find that VLMs exhibit higher robustness to image perturbations.

[28] Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy

Patrick Wienholt, Sophie Caselitz, Robert Siepmann, Philipp Bruners, Keno Bressem, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn

🧩 TL;DR

本研究提出使用离散语义熵来检测和过滤可能产生幻觉的问题,显著提高了黑盒视觉语言模型在放射学视觉问答任务中的诊断准确性。该方法通过量化语义不一致性实现了可靠的幻觉检测,为临床VLM应用提供了有效的过滤策略。


📘 Detailed Summary

Motivation: 该研究旨在解决黑盒视觉语言模型在放射学图像视觉问答任务中容易产生幻觉的问题,通过开发可靠的幻觉检测方法来提高模型在临床诊断应用中的准确性和可靠性。

Method: 研究采用离散语义熵方法,通过双向蕴含检查将意义等效的回答分组,并从生成的语义簇的相对频率计算DSE。使用GPT-4o和GPT-4.1模型在温度1.0下对每个问题回答15次,并通过排除DSE高于0.6或0.3的问题来重新计算准确率。

Result: 在706个图像-问题对中,基线准确率为GPT-4o的51.7%和GPT-4.1的54.8%。过滤掉高熵问题后,GPT-4o在剩余334个问题上的准确率提升至76.3%,GPT-4.1在499个问题上的准确率提升至63.8%,两者均具有统计学显著性。

Conclusion: 离散语义熵能够通过量化语义不一致性实现黑盒视觉语言模型中的可靠幻觉检测,显著提高诊断答案的准确性,并为临床VLM应用提供有效的过滤策略,具有重要的临床应用价值。


📄 Abstract

To determine whether using discrete semantic entropy (DSE) to reject questions likely to generate hallucinations can improve the accuracy of black-box vision-language models (VLMs) in radiologic image based visual question answering (VQA). This retrospective study evaluated DSE using two publicly available, de-identified datasets: (i) the VQA-Med 2019 benchmark (500 images with clinical questions and short-text answers) and (ii) a diagnostic radiology dataset (206 cases: 60 computed tomography scans, 60 magnetic resonance images, 60 radiographs, 26 angiograms) with corresponding ground-truth diagnoses. GPT-4o and GPT-4.1 answered each question 15 times using a temperature of 1.0. Baseline accuracy was determined using low-temperature answers (temperature 0.1). Meaning-equivalent responses were grouped using bidirectional entailment checks, and DSE was computed from the relative frequencies of the resulting semantic clusters. Accuracy was recalculated after excluding questions with DSE > 0.6 or > 0.3. p-values and 95% confidence intervals were obtained using bootstrap resampling and a Bonferroni-corrected threshold of p < .004 for statistical significance. Across 706 image-question pairs, baseline accuracy was 51.7% for GPT-4o and 54.8% for GPT-4.1. After filtering out high-entropy questions (DSE > 0.3), accuracy on the remaining questions was 76.3% (retained questions: 334/706) for GPT-4o and 63.8% (retained questions: 499/706) for GPT-4.1 (both p < .001). Accuracy gains were observed across both datasets and largely remained statistically significant after Bonferroni correction. DSE enables reliable hallucination detection in black-box VLMs by quantifying semantic inconsistency. This method significantly improves diagnostic answer accuracy and offers a filtering strategy for clinical VLM applications.

[29] Spotlight on Token Perception for Multimodal Reinforcement Learning

Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, Yu Cheng

🧩 TL;DR

本文提出了一种新颖的视觉感知策略优化方法VPPO,通过引入令牌感知视角来增强多模态强化学习的优化过程,在八个感知与推理基准上显著提升了大型视觉语言模型的多模态推理能力。


📘 Detailed Summary

Motivation: 现有基于可验证奖励的强化学习方法在多模态推理中忽视了视觉感知在优化过程中的关键作用,特别是缺乏对生成令牌视觉依赖性的细粒度分析,这限制了模型在视觉基础推理任务中的性能提升。

Method: 提出了视觉感知策略优化算法VPPO,该算法通过双重机制优化学习信号:基于轨迹整体视觉依赖度重新加权优势函数,并仅对感知关键令牌进行策略更新,从而更有效地利用视觉信息进行多模态推理。

Result: 在八个全面的感知与推理基准测试中,VPPO相比领先的开源RL调优模型取得了显著性能提升,其有效性在7B和32B模型规模上均得到一致验证,证明了该方法在不同模型容量下的普适性。

Conclusion: 本研究不仅为分析多模态RLVR建立了新的令牌级感知视角,还提出了一种新颖有效的优化策略,能够显著增强LVLMs的多模态推理能力,为未来多模态强化学习研究提供了重要启示。


📄 Abstract

While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.

[30] Hybrid-grained Feature Aggregation with Coarse-to-fine Language Guidance for Self-supervised Monocular Depth Estimation

Wenyao Zhang, Hongsi Liu, Bohan Li, Jiawei He, Zekun Qi, Yunnan Wang, Shengyang Zhao, Xinqiang Yu, Wenjun Zeng, Xin Jin

🧩 TL;DR

本文提出Hybrid-depth框架,通过整合CLIP和DINO基础模型来提取视觉先验,解决自监督单目深度估计中语义-空间知识提取不足的问题,显著提升了深度估计性能。


📘 Detailed Summary

Motivation: 当前自监督单目深度估计方法由于语义-空间知识提取不足而面临性能限制,需要更有效地整合全局语义和局部空间细节来提升深度估计精度。

Method: 提出粗到细渐进学习框架:首先在对比语言引导下聚合CLIP的全局语义和DINO的局部空间特征,设计近距离-远距离图像块对比代理任务;然后整合相机位姿信息和像素级语言对齐来细化深度预测,可作为即插即用模块与现有自监督MDE流程集成。

Result: 在KITTI基准测试上的广泛实验表明,该方法在所有指标上显著优于现有最优方法,同时有效提升了BEV感知等下游任务的性能。

Conclusion: 通过语言引导聚合CLIP的语义上下文和DINO的空间细节,有效解决了特征粒度不匹配问题,为自监督深度估计提供了新的基础模型集成范式,具有重要的实际应用价值。


📄 Abstract

Current self-supervised monocular depth estimation (MDE) approaches encounter performance limitations due to insufficient semantic-spatial knowledge extraction. To address this challenge, we propose Hybrid-depth, a novel framework that systematically integrates foundation models (e.g., CLIP and DINO) to extract visual priors and acquire sufficient contextual information for MDE. Our approach introduces a coarse-to-fine progressive learning framework: 1) Firstly, we aggregate multi-grained features from CLIP (global semantics) and DINO (local spatial details) under contrastive language guidance. A proxy task comparing close-distant image patches is designed to enforce depth-aware feature alignment using text prompts; 2) Next, building on the coarse features, we integrate camera pose information and pixel-wise language alignment to refine depth predictions. This module seamlessly integrates with existing self-supervised MDE pipelines (e.g., Monodepth2, ManyDepth) as a plug-and-play depth encoder, enhancing continuous depth estimation. By aggregating CLIP's semantic context and DINO's spatial details through language guidance, our method effectively addresses feature granularity mismatches. Extensive experiments on the KITTI benchmark demonstrate that our method significantly outperforms SOTA methods across all metrics, which also indeed benefits downstream tasks like BEV perception. Code is available at https://github.com/Zhangwenyao1/Hybrid-depth.

[31] Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models

Qihang Ma, Shengyu Li, Jie Tang, Dingkang Yang, Shaodong Chen, Yingyi Zhang, Chao Feng, Jiao Ran

🧩 TL;DR

本研究提出了一种基于视觉语言模型的多模态关键词预测方法,通过动态思维链策略解决传统方法在缺失和未见场景下的局限性,显著提升了模型的复杂推理能力。


📘 Detailed Summary

Motivation: 传统多模态方法在处理缺失和未见场景时存在显著局限性,同时现有基准测试由于训练测试集重叠严重而高估了模型能力,本研究旨在解决这些问题并提升多模态关键词预测的性能。

Method: 采用视觉语言模型进行多模态关键词预测,首先使用零样本和监督微调评估模型性能下限,然后采用Fine-tune-CoT策略利用教师模型生成的高质量思维链数据微调小模型,最后提出动态CoT策略在训练过程中自适应注入思维链数据。

Result: 在多个数据集上的实验结果表明,所提出的方法有效提升了多模态关键词预测的性能,动态CoT策略特别解决了模型的"过度思考"现象,使模型在推理阶段能够灵活运用其推理能力。

Conclusion: 该研究证明了视觉语言模型在多模态关键词预测任务中的有效性,动态思维链策略为解决复杂推理问题提供了新思路,为未来多模态理解研究指明了方向。


📄 Abstract

Multi-modal keyphrase prediction (MMKP) aims to advance beyond text-only methods by incorporating multiple modalities of input information to produce a set of conclusive phrases. Traditional multi-modal approaches have been proven to have significant limitations in handling the challenging absence and unseen scenarios. Additionally, we identify shortcomings in existing benchmarks that overestimate model capability due to significant overlap in training tests. In this work, we propose leveraging vision-language models (VLMs) for the MMKP task. Firstly, we use two widely-used strategies, e.g., zero-shot and supervised fine-tuning (SFT) to assess the lower bound performance of VLMs. Next, to improve the complex reasoning capabilities of VLMs, we adopt Fine-tune-CoT, which leverages high-quality CoT reasoning data generated by a teacher model to finetune smaller models. Finally, to address the "overthinking" phenomenon, we propose a dynamic CoT strategy which adaptively injects CoT data during training, allowing the model to flexibly leverage its reasoning capabilities during the inference stage. We evaluate the proposed strategies on various datasets and the experimental results demonstrate the effectiveness of the proposed approaches. The code is available at https://github.com/bytedance/DynamicCoT.

Junyan Ye, Dongzhi Jiang, Jun He, Baichuan Zhou, Zilong Huang, Zhiyuan Yan, Hongsheng Li, Conghui He, Weijia Li

🧩 TL;DR

本文提出了BLINK-Twice,一个基于挑战性感知任务的视觉中心推理基准,旨在评估多模态大语言模型在仅依赖视觉内容进行推理的能力,而非依赖外部知识或语言主导的推理。


📘 Detailed Summary

Motivation: 现有推理基准主要评估基于语言的推理能力,通常将视觉输入视为可替换的上下文,缺乏对纯粹视觉内容推理能力的专门评估。为了填补这一空白,需要开发一个专注于视觉中心推理的基准,从语言主导推理转向图像基础推理。

Method: BLINK-Twice基准包含三个核心组件:七种视觉挑战类型用于测试视觉推理能力,自然对抗图像对强制模型依赖视觉内容,以及带注释的推理链用于对推理过程进行细粒度评估。该基准评估了20个领先的MLLM,包括12个基础模型和8个推理增强模型。

Result: BLINK-Twice对当前模型构成了显著挑战,现有语言空间的推理策略如思维链或自我批评虽能提升性能,但往往导致不稳定和冗余的推理。实验观察到重复图像观察能提高模型性能,而像o3模型展示的主动视觉交互突显了视觉推理新范式的需求。

Conclusion: 研究表明当前MLLM在视觉中心推理方面仍存在显著局限,需要开发新的视觉推理范式。重复图像观察和主动视觉交互是提升性能的关键方向,强调了超越浅层感知向细粒度观察和分析推理转变的重要性。


📄 Abstract

Recently, Multimodal Large Language Models (MLLMs) have made rapid progress, particularly in enhancing their reasoning capabilities. However, existing reasoning benchmarks still primarily assess language-based reasoning, often treating visual input as replaceable context. To address this gap, we introduce BLINK-Twice, a vision-centric reasoning benchmark grounded in challenging perceptual tasks. Instead of relying on external knowledge, our tasks require models to reason from visual content alone, shifting the focus from language-based to image-grounded reasoning. Compared to prior perception benchmarks, it moves beyond shallow perception ("see") and requires fine-grained observation and analytical reasoning ("observe"). BLINK-Twice integrates three core components: seven types of visual challenges for testing visual reasoning, natural adversarial image pairs that enforce reliance on visual content, and annotated reasoning chains for fine-grained evaluation of the reasoning process rather than final answers alone. We evaluate 20 leading MLLMs, including 12 foundation models and 8 reasoning-enhanced models. BLINK-Twice poses a significant challenge to current models. While existing reasoning strategies in the language space-such as chain-of-thought or self-criticism can improve performance, they often result in unstable and redundant reasoning. We observe that repeated image observation improves performance across models, and active visual interaction, as demonstrated by models like o3, highlights the need for a new paradigm for vision reasoning. The dataset is publicly available at https://github.com/PicoTrex/BLINK-Twice

[33] Mono4DEditor: Text-Driven 4D Scene Editing from Monocular Video via Point-Level Localization of Language-Embedded Gaussians

Jin-Chuan Shi, Chengye Su, Jiajun Wang, Ariel Shamir, Miao Wang

🧩 TL;DR

本文提出了Mono4DEditor框架,通过将3D高斯与量化CLIP特征结合形成语言嵌入动态表示,实现了基于文本提示的单目视频4D场景编辑,在保持未编辑内容完整性的同时实现语义精确的局部区域编辑。


📘 Detailed Summary

Motivation: 该研究旨在解决从单目视频重建的4D场景基于文本提示进行编辑的挑战性任务,主要困难在于如何在复杂动态场景的局部区域实现语义精确的编辑,同时保持未编辑内容的完整性。

Method: 该方法通过将3D高斯与量化CLIP特征结合形成语言嵌入动态表示,支持任意空间区域的高效语义查询;提出两阶段点级定位策略,首先通过CLIP相似度选择候选高斯,然后精炼其空间范围以提高准确性;最后使用基于扩散的视频编辑模型对局部区域进行针对性编辑,通过流和涂鸦引导确保空间保真度和时间一致性。

Result: 大量实验表明,Mono4DEditor能够在多样场景和对象类型上实现高质量的文本驱动编辑,同时保持未编辑区域的外观和几何特性,在灵活性和视觉保真度方面均优于先前方法。

Conclusion: 该研究证明了语言嵌入动态表示与两阶段定位策略相结合的有效性,为4D场景编辑提供了灵活且准确的解决方案,在内容创作和虚拟环境领域具有重要应用价值,并为未来动态场景编辑研究提供了新的技术路径。


📄 Abstract

Editing 4D scenes reconstructed from monocular videos based on text prompts is a valuable yet challenging task with broad applications in content creation and virtual environments. The key difficulty lies in achieving semantically precise edits in localized regions of complex, dynamic scenes, while preserving the integrity of unedited content. To address this, we introduce Mono4DEditor, a novel framework for flexible and accurate text-driven 4D scene editing. Our method augments 3D Gaussians with quantized CLIP features to form a language-embedded dynamic representation, enabling efficient semantic querying of arbitrary spatial regions. We further propose a two-stage point-level localization strategy that first selects candidate Gaussians via CLIP similarity and then refines their spatial extent to improve accuracy. Finally, targeted edits are performed on localized regions using a diffusion-based video editing model, with flow and scribble guidance ensuring spatial fidelity and temporal coherence. Extensive experiments demonstrate that Mono4DEditor enables high-quality, text-driven edits across diverse scenes and object types, while preserving the appearance and geometry of unedited areas and surpassing prior approaches in both flexibility and visual fidelity.

[34] D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language Models

Jisu Han, Wonjun Hwang

🧩 TL;DR

本文提出了一种维度熵最大化方法,通过正则化文本特征分布来缓解对比视觉语言模型中主导维度的影响,从而改善测试时提示调优中的校准性能退化问题,为VLM在真实部署场景中的可靠性提供了简单有效的解决方案。


📘 Detailed Summary

Motivation: 本研究旨在解决对比视觉语言模型中由跨模态单一主导特征维度引起的模态间隙问题,该问题导致文本和图像模态中的主导维度表现出高预测敏感性,限制了模型在测试时提示调优中的校准性能。

Method: 提出了维度熵最大化方法,通过正则化文本特征分布使其趋向均匀分布,从而减轻主导维度的依赖性,该方法专门针对测试时提示调优场景设计,通过约束主导维度的影响来改善模型校准误差。

Result: 实验结果表明,该方法有效缓解了测试时提示调优中校准性能的退化问题,通过降低主导维度的依赖性显著改善了模型的校准误差,为视觉语言模型在真实部署场景中的可靠性提供了实证支持。

Conclusion: 研究表明约束对比视觉语言模型中主导维度的影响可以有效提升模型校准性能,维度熵最大化为测试时自适应提供了一种简单而有效的正则化策略,对提升视觉语言模型在现实应用中的可靠性具有重要价值。


📄 Abstract

Test-time adaptation paradigm provides flexibility towards domain shifts by performing immediate adaptation on unlabeled target data from the source model. Vision-Language Models (VLMs) leverage their generalization capabilities for diverse downstream tasks, and test-time prompt tuning has emerged as a prominent solution for adapting VLMs. In this work, we explore contrastive VLMs and identify the modality gap caused by a single dominant feature dimension across modalities. We observe that the dominant dimensions in both text and image modalities exhibit high predictive sensitivity, and that constraining their influence can improve calibration error. Building on this insight, we propose dimensional entropy maximization that regularizes the distribution of textual features toward uniformity to mitigate the dependency of dominant dimensions. Our method alleviates the degradation of calibration performance in test-time prompt tuning, offering a simple yet effective solution to enhance the reliability of VLMs in real-world deployment scenarios.

[35] Few-shot multi-token DreamBooth with LoRa for style-consistent character generation

Ruben Pascual, Mikel Sesma-Sara, Aranzazu Jurio, Daniel Paternain, Mikel Galar

🧩 TL;DR

本文提出了一种基于DreamBooth的多令牌策略,结合LoRA参数高效微调,解决了从少量参考角色生成无限数量新角色同时保持艺术风格一致性的问题,在动画和游戏领域展现出强大的创意生成能力。


📘 Detailed Summary

Motivation: 当前视听行业需要解决从少量人工设计参考角色生成无限数量新角色时保持艺术风格和共享视觉特征一致性的挑战,以扩展动画、游戏等领域的创作可能性。

Method: 该方法基于DreamBooth微调技术,采用多令牌策略通过聚类为单个角色和集体风格分配独立令牌,结合LoRA参数高效微调,移除类别特定正则化集并在生成时引入随机令牌和嵌入。

Result: 在五个小型专业数据集上的评估显示,该方法能生成高质量、多样化的角色同时保持参考角色的独特美学特征,人类评估进一步证实了其有效性。

Conclusion: 该方法证明了在保持艺术风格一致性的前提下实现无限角色生成的可行性,为创意产业提供了强大的工具,并展示了扩散模型在风格保持生成任务中的潜力。


📄 Abstract

The audiovisual industry is undergoing a profound transformation as it is integrating AI developments not only to automate routine tasks but also to inspire new forms of art. This paper addresses the problem of producing a virtually unlimited number of novel characters that preserve the artistic style and shared visual traits of a small set of human-designed reference characters, thus broadening creative possibilities in animation, gaming, and related domains. Our solution builds upon DreamBooth, a well-established fine-tuning technique for text-to-image diffusion models, and adapts it to tackle two core challenges: capturing intricate character details beyond textual prompts and the few-shot nature of the training data. To achieve this, we propose a multi-token strategy, using clustering to assign separate tokens to individual characters and their collective style, combined with LoRA-based parameter-efficient fine-tuning. By removing the class-specific regularization set and introducing random tokens and embeddings during generation, our approach allows for unlimited character creation while preserving the learned style. We evaluate our method on five small specialized datasets, comparing it to relevant baselines using both quantitative metrics and a human evaluation study. Our results demonstrate that our approach produces high-quality, diverse characters while preserving the distinctive aesthetic features of the reference characters, with human evaluation further reinforcing its effectiveness and highlighting the potential of our method.

[36] PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs

Zixin Zhang, Kanghao Chen, Xingwang Lin, Lutao Jiang, Xu Zheng, Yuanhuiyi Lyu, Litao Guo, Yinchuan Li, Ying-Cong Chen

🧩 TL;DR

本研究提出了PhysToolBench,这是首个专门评估多模态大语言模型对物理工具理解能力的基准,通过包含1000多个图像-文本对的视觉问答数据集,揭示了当前MLLMs在工具理解方面存在显著缺陷。


📘 Detailed Summary

Motivation: 虽然现代多模态大语言模型利用其广泛常识知识在具身AI和下游视觉-语言-动作模型中进行高级规划,但它们对物理工具的真正理解程度尚未得到量化评估,本研究旨在填补这一研究空白。

Method: 该基准采用视觉问答数据集形式,包含1000多个图像-文本对,评估能力分为三个难度级别:工具识别(识别工具主要功能)、工具理解(理解工具操作原理)和工具创建(在传统工具不可用时利用周围物体创造新工具)。

Result: 对32个MLLMs(包括专有、开源、专用具身模型和VLA骨干模型)的综合评估显示,当前模型在工具理解方面存在显著缺陷,特别是在理解工具操作原理和创造新工具方面表现不足。

Conclusion: 研究揭示了当前多模态大语言模型在物理工具理解方面的局限性,为未来改进提供了重要基准和初步解决方案,强调了在具身智能系统中加强物理常识推理能力的必要性。


📄 Abstract

The ability to use, understand, and create tools is a hallmark of human intelligence, enabling sophisticated interaction with the physical world. For any general-purpose intelligent agent to achieve true versatility, it must also master these fundamental skills. While modern Multimodal Large Language Models (MLLMs) leverage their extensive common knowledge for high-level planning in embodied AI and in downstream Vision-Language-Action (VLA) models, the extent of their true understanding of physical tools remains unquantified. To bridge this gap, we present PhysToolBench, the first benchmark dedicated to evaluating the comprehension of physical tools by MLLMs. Our benchmark is structured as a Visual Question Answering (VQA) dataset comprising over 1,000 image-text pairs. It assesses capabilities across three distinct difficulty levels: (1) Tool Recognition: Requiring the recognition of a tool's primary function. (2) Tool Understanding: Testing the ability to grasp the underlying principles of a tool's operation. (3) Tool Creation: Challenging the model to fashion a new tool from surrounding objects when conventional options are unavailable. Our comprehensive evaluation of 32 MLLMs-spanning proprietary, open-source, specialized embodied, and backbones in VLAs-reveals a significant deficiency in tool understanding. Furthermore, we provide an in-depth analysis and propose preliminary solutions. Code and dataset are publicly available.

[37] Vision Language Models: A Survey of 26K Papers

Fengming Lin

🧩 TL;DR

本文通过对2023-2025年间26,104篇CVPR、ICLR和NeurIPS论文的系统分析,量化了AI研究领域的三大宏观趋势:多模态视觉-语言-大语言模型工作的急剧增长、生成式方法的稳步扩展以及3D和视频活动的持续活跃。


📘 Detailed Summary

Motivation: 本研究旨在通过透明、可复现的方法量化AI研究趋势,解决当前缺乏系统性跨领域研究动态分析的问题,特别是针对计算机视觉、机器学习和神经信息处理系统等顶级会议的研究演变进行大规模测量。

Method: 研究采用规范化处理、短语保护和手工构建词典匹配的方法,对论文标题和摘要进行分析,分配多达35个主题标签,并挖掘关于任务、架构、训练机制、目标函数、数据集和共提及模态的细粒度线索。

Result: 分析揭示了三大宏观转变:多模态视觉-语言-LLM工作的急剧增长,将经典感知重新定义为指令跟随和多步推理;生成式方法的稳步扩展,扩散研究围绕可控性、蒸馏和速度进行整合;3D和视频活动的持续活跃,组合方法从NeRF转向高斯溅射,并更加关注以人和智能体为中心的理解。

Conclusion: 研究提供了AI研究趋势的量化证据,揭示了多模态和生成式方法的快速崛起,同时释放了词典和方法论以支持审计和扩展,为理解领域演变和指导未来研究方向提供了系统性的实证基础。


📄 Abstract

We present a transparent, reproducible measurement of research trends across 26,104 accepted papers from CVPR, ICLR, and NeurIPS spanning 2023-2025. Titles and abstracts are normalized, phrase-protected, and matched against a hand-crafted lexicon to assign up to 35 topical labels and mine fine-grained cues about tasks, architectures, training regimes, objectives, datasets, and co-mentioned modalities. The analysis quantifies three macro shifts: (1) a sharp rise of multimodal vision-language-LLM work, which increasingly reframes classic perception as instruction following and multi-step reasoning; (2) steady expansion of generative methods, with diffusion research consolidating around controllability, distillation, and speed; and (3) resilient 3D and video activity, with composition moving from NeRFs to Gaussian splatting and a growing emphasis on human- and agent-centric understanding. Within VLMs, parameter-efficient adaptation like prompting/adapters/LoRA and lightweight vision-language bridges dominate; training practice shifts from building encoders from scratch to instruction tuning and finetuning strong backbones; contrastive objectives recede relative to cross-entropy/ranking and distillation. Cross-venue comparisons show CVPR has a stronger 3D footprint and ICLR the highest VLM share, while reliability themes such as efficiency or robustness diffuse across areas. We release the lexicon and methodology to enable auditing and extension. Limitations include lexicon recall and abstract-only scope, but the longitudinal signals are consistent across venues and years.

[38] VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

Shaoqi Dong, Chaoyou Fu, Haihan Gao, Yi-Fan Zhang, Chi Yan, Chu Wu, Xiaoyu Liu, Yunhang Shen, Jing Huo, Deqiang Jiang, Haoyu Cao, Yang Gao, Xing Sun, Ran He, Caifeng Shan

🧩 TL;DR

本文提出了一种基于蒸馏的框架,通过从预训练的小型动作模型转移知识,为视觉语言模型赋予动作执行能力,显著降低了训练成本并提升了操作性能。该方法在LIBERO基准上实现了97.3%的平均成功率,在真实世界任务中达到82.0%的成功率。


📘 Detailed Summary

Motivation: 视觉语言动作模型通过利用预训练视觉语言模型的强大感知能力显著推进了机器人操作,但从头训练这些模型成本高昂。本研究旨在解决VLA模型训练成本高的问题,同时保持其泛化能力。

Method: 提出蒸馏框架,保留原始VLM结构,仅添加动作令牌和状态编码器来整合物理输入。采用两阶段训练策略:首先通过轻量级对齐将VLM隐藏状态映射到小型动作模型的动作空间,重用其预训练动作解码器;然后选择性微调语言模型、状态编码器和动作模块,实现多模态输入与精确动作生成的集成。

Result: 在LIBERO基准上达到97.3%的平均成功率(提升11.8%),在LIBERO-LONG上达到93.5%(提升24.5%)。在五个真实世界操作任务中,以82.0%的成功率持续超越教师模型(提升17%)。

Conclusion: 动作蒸馏有效使VLM能够生成精确动作,同时大幅降低训练成本。该框架展示了通过知识转移而非从头训练来增强大型模型功能的有效性,为高效机器人学习提供了新途径。


📄 Abstract

Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs). By integrating action modules into these pretrained models, VLA methods exhibit improved generalization. However, training them from scratch is costly. In this work, we propose a simple yet effective distillation-based framework that equips VLMs with action-execution capability by transferring knowledge from pretrained small action models. Our architecture retains the original VLM structure, adding only an action token and a state encoder to incorporate physical inputs. To distill action knowledge, we adopt a two-stage training strategy. First, we perform lightweight alignment by mapping VLM hidden states into the action space of the small action model, enabling effective reuse of its pretrained action decoder and avoiding expensive pretraining. Second, we selectively fine-tune the language model, state encoder, and action modules, enabling the system to integrate multimodal inputs with precise action generation. Specifically, the action token provides the VLM with a direct handle for predicting future actions, while the state encoder allows the model to incorporate robot dynamics not captured by vision alone. This design yields substantial efficiency gains over training large VLA models from scratch. Compared with previous state-of-the-art methods, our method achieves 97.3% average success rate on LIBERO (11.8% improvement) and 93.5% on LIBERO-LONG (24.5% improvement). In real-world experiments across five manipulation tasks, our method consistently outperforms the teacher model, achieving 82.0% success rate (17% improvement), which demonstrate that action distillation effectively enables VLMs to generate precise actions while substantially reducing training costs.

cs.CL [Back]

[39] Hierarchical Self-Supervised Representation Learning for Depression Detection from Speech

Yuxin Li, Eng Siong Chng, Cuntai Guan

🧩 TL;DR

本文提出HAREN-CTC,一种整合多层自监督学习特征的多任务学习架构,通过交叉注意力和CTC损失处理稀疏时间监督,在语音抑郁检测任务上实现了最先进的性能。


📘 Detailed Summary

Motivation: 现有语音抑郁检测方法主要面临两个关键挑战:难以提取有意义的特征以及捕捉稀疏、异质的抑郁时间线索。虽然预训练自监督学习模型如WavLM提供了丰富的多层语音表示,但现有方法通常仅使用最终层或寻找单一最佳层,导致过拟合特定数据集且无法充分利用检测细微持久抑郁信号所需的完整层次结构。

Method: HAREN-CTC架构整合多层SSL特征,采用多任务学习框架结合连接时序分类损失来处理稀疏时间监督。该模型包含两个关键模块:层次自适应聚类模块将SSL特征重组为互补嵌入,以及跨模态融合模块通过交叉注意力建模层间依赖关系。CTC目标实现对齐感知训练,使模型能够追踪抑郁语音线索的不规则时间模式。

Result: 在标准数据分割的上界设置和五折交叉验证的泛化设置下进行评估,HAREN-CTC在DAIC-WOZ数据集上达到0.81的宏F1分数,在MODMA数据集上达到0.82的宏F1分数,在两个评估场景下均优于先前方法,实现了最先进的性能表现。

Conclusion: 该研究表明充分利用预训练SSL模型的多层特征对于检测细微抑郁信号至关重要,提出的层次特征整合和CTC对齐机制为解决语音抑郁检测中的稀疏时间监督问题提供了有效方案,为临床辅助诊断系统的发展提供了技术支撑。


📄 Abstract

Speech-based depression detection (SDD) is a promising, non-invasive alternative to traditional clinical assessments. However, it remains limited by the difficulty of extracting meaningful features and capturing sparse, heterogeneous depressive cues over time. Pretrained self-supervised learning (SSL) models such as WavLM provide rich, multi-layer speech representations, yet most existing SDD methods rely only on the final layer or search for a single best-performing one. These approaches often overfit to specific datasets and fail to leverage the full hierarchical structure needed to detect subtle and persistent depression signals. To address this challenge, we propose HAREN-CTC, a novel architecture that integrates multi-layer SSL features using cross-attention within a multitask learning framework, combined with Connectionist Temporal Classification loss to handle sparse temporal supervision. HAREN-CTC comprises two key modules: a Hierarchical Adaptive Clustering module that reorganizes SSL features into complementary embeddings, and a Cross-Modal Fusion module that models inter-layer dependencies through cross-attention. The CTC objective enables alignment-aware training, allowing the model to track irregular temporal patterns of depressive speech cues. We evaluate HAREN-CTC under both an upper-bound setting with standard data splits and a generalization setting using five-fold cross-validation. The model achieves state-of-the-art macro F1-scores of 0.81 on DAIC-WOZ and 0.82 on MODMA, outperforming prior methods across both evaluation scenarios.

[40] Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection

Cong Zeng, Shengkun Tang, Yuanzhou Chen, Zhiqiang Shen, Wenchao Yu, Xujiang Zhao, Haifeng Chen, Wei Cheng, Zhiqiang Xu

🧩 TL;DR

本文提出将AI生成文本检测任务重新定义为离群检测问题,采用单类学习和基于分数的学习方法来提高检测器的泛化能力,在多个数据集上实现了优异的检测性能。


📘 Detailed Summary

Motivation: 现有AI生成文本检测方法大多将任务视为二分类问题,这种二元化表述错误地假设人类文本构成统一的分布,导致检测器容易记忆观察到的分布外特征而非学习真正的'非分布内'行为本质,从而限制了在未见人类文本输入上的泛化能力。

Method: 本文提出将检测任务重新定义为离群检测问题,将人类文本视为分布外样本而机器生成文本作为分布内样本,开发了基于单类学习的方法包括DeepSVDD和HRN,以及基于分数的学习技术如能量方法,构建了鲁棒且可泛化的检测框架。

Result: 在多个数据集上的广泛实验验证了基于离群检测方法的有效性,在DeepFake数据集上实现了98.3%的AUROC和AUPR,FPR95仅为8.9%,并在多语言、受攻击以及未见模型和领域文本设置下测试了检测框架,证明了其鲁棒性和泛化能力。

Conclusion: 研究表明将AI生成文本检测重新定义为离群检测问题能够有效解决传统二分类方法的泛化限制,基于单类学习和分数学习的方法能够捕捉机器生成文本的本质特征,为构建更鲁棒的检测系统提供了新的技术路径和理论视角。


📄 Abstract

The rapid advancement of large language models (LLMs) such as ChatGPT, DeepSeek, and Claude has significantly increased the presence of AI-generated text in digital communication. This trend has heightened the need for reliable detection methods to distinguish between human-authored and machine-generated content. Existing approaches both zero-shot methods and supervised classifiers largely conceptualize this task as a binary classification problem, often leading to poor generalization across domains and models. In this paper, we argue that such a binary formulation fundamentally mischaracterizes the detection task by assuming a coherent representation of human-written texts. In reality, human texts do not constitute a unified distribution, and their diversity cannot be effectively captured through limited sampling. This causes previous classifiers to memorize observed OOD characteristics rather than learn the essence of `non-ID' behavior, limiting generalization to unseen human-authored inputs. Based on this observation, we propose reframing the detection task as an out-of-distribution (OOD) detection problem, treating human-written texts as distributional outliers while machine-generated texts are in-distribution (ID) samples. To this end, we develop a detection framework using one-class learning method including DeepSVDD and HRN, and score-based learning techniques such as energy-based method, enabling robust and generalizable performance. Extensive experiments across multiple datasets validate the effectiveness of our OOD-based approach. Specifically, the OOD-based method achieves 98.3% AUROC and AUPR with only 8.9% FPR95 on DeepFake dataset. Moreover, we test our detection framework on multilingual, attacked, and unseen-model and -domain text settings, demonstrating the robustness and generalizability of our framework. Code, pretrained weights, and demo will be released.

[41] Centering Emotion Hotspots: Multimodal Local-Global Fusion and Cross-Modal Alignment for Emotion Recognition in Conversations

Yu Liu, Hanlei Shi, Haoxun Li, Yuqing Sun, Yuxuan Ding, Linlin Gong, Leyuan Qu, Taihao Li

🧩 TL;DR

本文提出了一种以情感热点为中心的对话情感识别统一模型,通过热点门控融合和路由混合对齐器有效处理多模态数据中的稀疏、局部和异步情感证据,在标准基准测试中实现了优于强基线的性能提升。


📘 Detailed Summary

Motivation: 对话情感识别面临的主要挑战在于情感证据在多模态数据中呈现稀疏性、局部性和异步性分布,传统方法难以有效捕捉这些分散且跨模态不对齐的关键情感线索。

Method: 该模型通过检测文本、音频和视频中的逐话语情感热点,采用热点门控融合将局部热点与全局特征结合,使用路由混合对齐器解决模态对齐问题,并构建跨模态图编码对话结构以保持上下文信息。

Result: 在标准对话情感识别基准测试上的实验表明,该方法相对于强基线模型取得了持续的性能提升,消融研究证实了热点门控融合和混合对齐器组件的关键贡献。

Conclusion: 研究结果表明以情感热点为中心的视角能够为未来多模态学习提供新思路,为模态融合提供了新的研究方向,强调了聚焦关键情感区域在处理多模态对话数据中的重要性。


📄 Abstract

Emotion Recognition in Conversations (ERC) is hard because discriminative evidence is sparse, localized, and often asynchronous across modalities. We center ERC on emotion hotspots and present a unified model that detects per-utterance hotspots in text, audio, and video, fuses them with global features via Hotspot-Gated Fusion, and aligns modalities using a routed Mixture-of-Aligners; a cross-modal graph encodes conversational structure. This design focuses modeling on salient spans, mitigates misalignment, and preserves context. Experiments on standard ERC benchmarks show consistent gains over strong baselines, with ablations confirming the contributions of HGF and MoA. Our results point to a hotspot-centric view that can inform future multimodal learning, offering a new perspective on modality fusion in ERC.

[42] MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation

Weihua Zheng, Zhengyuan Liu, Tanmoy Chakraborty, Weiwen Xu, Xiaoxue Gao, Bryan Chen Zhengyu Tan, Bowei Zou, Chang Liu, Yujia Hu, Xing Xie, Xiaoyuan Yi, Jing Yao, Chaojun Wang, Long Li, Rui Liu, Huiyao Liu, Koji Inoue, Ryuichi Sumida, Tatsuya Kawahara, Fan Xu, Lingyu Ye, Wei Tian, Dongjun Kim, Jimin Jung, Jaehyung Seo, Nadya Yuki Wangsajaya, Pham Minh Duc, Ojasva Saxena, Palash Nandi, Xiyan Tao, Wiwik Karlina, Tuan Luong, Keertana Arun Vasan, Roy Ka-Wei Lee, Nancy F. Chen

🧩 TL;DR

本研究提出了MMA-ASIA框架,这是首个针对亚洲文化背景的多模态大语言模型评估基准,包含27,000个多语言多模态对齐问题,揭示了LLMs在非西方文化环境中的理解退化问题。


📘 Detailed Summary

Motivation: 当前大语言模型在多模态理解和推理能力上,在非西方高资源环境(特别是亚洲文化背景)中表现显著退化,缺乏针对文化意识的系统性评估框架。

Method: 提出了MMA-ASIA评估框架,包括人类标注的多语言多模态对齐基准(涵盖8个亚洲国家和10种语言,27,000个问题),五维评估协议,以及文化意识基础验证模块来检测捷径学习,并通过注意力追踪和视觉消融前缀重放方法分析模型行为。

Result: 基准中超过79%的问题需要基于文化背景的多步推理,建立了首个在文本、图像和语音三种模态输入层面完全对齐的数据集,能够直接测试跨模态迁移能力。

Conclusion: 研究揭示了LLMs在跨语言和跨模态的文化理解中存在显著差异,为构建文化可靠的多模态大语言模型提供了可操作的见解和系统性评估方法。


📄 Abstract

Large language models (LLMs) are now used worldwide, yet their multimodal understanding and reasoning often degrade outside Western, high-resource settings. We propose MMA-ASIA, a comprehensive framework to evaluate LLMs' cultural awareness with a focus on Asian contexts. MMA-ASIA centers on a human-curated, multilingual, and multimodally aligned multiple-choice benchmark covering 8 Asian countries and 10 languages, comprising 27,000 questions; over 79 percent require multi-step reasoning grounded in cultural context, moving beyond simple memorization. To our knowledge, this is the first dataset aligned at the input level across three modalities: text, image (visual question answering), and speech. This enables direct tests of cross-modal transfer. Building on this benchmark, we propose a five-dimensional evaluation protocol that measures: (i) cultural-awareness disparities across countries, (ii) cross-lingual consistency, (iii) cross-modal consistency, (iv) cultural knowledge generalization, and (v) grounding validity. To ensure rigorous assessment, a Cultural Awareness Grounding Validation Module detects "shortcut learning" by checking whether the requisite cultural knowledge supports correct answers. Finally, through comparative model analysis, attention tracing, and an innovative Vision-ablated Prefix Replay (VPR) method, we probe why models diverge across languages and modalities, offering actionable insights for building culturally reliable multimodal LLMs.

[43] FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao, Fengran Mo, Xueqing Peng, Lingfei Qian, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Jian-Yun Nie

🧩 TL;DR

本文提出了FinAuditing,这是首个针对财务审计任务的、与分类法对齐的、结构感知的多文档基准测试,用于评估LLMs在结构化财务文档推理方面的能力。该基准揭示了现代LLMs在基于分类法的财务推理中存在的系统性局限。


📘 Detailed Summary

Motivation: 通用会计准则的复杂性和XBRL申报文件的层次化结构使得财务审计难以自动化和验证。尽管LLMs在非结构化文本理解方面表现出强大能力,但它们在结构化、相互依赖且基于分类法的财务文档上的推理能力仍未被充分探索,本研究旨在填补这一空白。

Method: 基于真实的US-GAAP兼容XBRL申报文件构建FinAuditing基准,定义了三个互补的子任务:FinSM用于语义一致性、FinRE用于关系一致性、FinFinMR用于数值一致性,每个子任务针对结构化审计推理的不同方面。进一步提出了一个统一的评估框架,整合了检索、分类和推理指标。

Result: 对13个最先进LLMs进行的零样本实验表明,当前模型在语义、关系和数学维度上表现不一致,当在层次化多文档结构上进行推理时,准确率下降高达60-90%。实验结果显示模型在结构化财务推理方面存在显著挑战。

Conclusion: 研究结果揭示了现代LLMs在基于分类法的财务推理中的系统性局限,并将FinAuditing确立为开发可信赖、结构感知且符合监管要求的财务智能系统的基础。该基准为未来改进LLMs在结构化财务文档处理能力提供了重要参考框架。


📄 Abstract

The complexity of the Generally Accepted Accounting Principles (GAAP) and the hierarchical structure of eXtensible Business Reporting Language (XBRL) filings make financial auditing increasingly difficult to automate and verify. While large language models (LLMs) have demonstrated strong capabilities in unstructured text understanding, their ability to reason over structured, interdependent, and taxonomy-driven financial documents remains largely unexplored. To fill this gap, we introduce FinAuditing, the first taxonomy-aligned, structure-aware, multi-document benchmark for evaluating LLMs on financial auditing tasks. Built from real US-GAAP-compliant XBRL filings, FinAuditing defines three complementary subtasks, FinSM for semantic consistency, FinRE for relational consistency, and FinMR for numerical consistency, each targeting a distinct aspect of structured auditing reasoning. We further propose a unified evaluation framework integrating retrieval, classification, and reasoning metrics across these subtasks. Extensive zero-shot experiments on 13 state-of-the-art LLMs reveal that current models perform inconsistently across semantic, relational, and mathematical dimensions, with accuracy drops of up to 60-90% when reasoning over hierarchical multi-document structures. Our findings expose the systematic limitations of modern LLMs in taxonomy-grounded financial reasoning and establish FinAuditing as a foundation for developing trustworthy, structure-aware, and regulation-aligned financial intelligence systems. The benchmark dataset is available at Hugging Face.

[44] A Unified Biomedical Named Entity Recognition Framework with Large Language Models

Tengxiao Lv, Ling Luo, Juntao Li, Yanhua Wang, Yuchen Pan, Chao Liu, Yanan Wang, Yan Jiang, Huiyi Lv, Yuanyuan Sun, Jian Wang, Hongfei Lin

🧩 TL;DR

本文提出了一种基于大语言模型的统一生物医学命名实体识别框架,通过将BioNER重构为文本生成任务并采用符号化标记策略,在多个基准数据集上实现了最先进的性能,并展现出强大的跨语言零样本泛化能力。


📘 Detailed Summary

Motivation: 现有生物医学命名实体识别方法在处理嵌套实体、实体边界模糊性和跨语言泛化方面存在显著挑战,这限制了医学信息抽取和知识发现的准确性。

Method: 该方法将BioNER重构为文本生成任务,设计了符号化标记策略来联合处理平面和嵌套实体,采用双语联合微调增强多语言多任务泛化能力,并引入了基于对比学习的实体选择器来过滤错误预测。

Result: 在四个基准数据集和两个未见语料库上的实验结果表明,该方法实现了最先进的性能,并在跨语言零样本泛化方面表现出强大的鲁棒性。

Conclusion: 该研究证明了基于大语言模型的统一框架在生物医学命名实体识别任务中的有效性,为处理复杂实体结构和跨语言场景提供了新的解决方案,具有重要的实际应用价值。


📄 Abstract

Accurate recognition of biomedical named entities is critical for medical information extraction and knowledge discovery. However, existing methods often struggle with nested entities, entity boundary ambiguity, and cross-lingual generalization. In this paper, we propose a unified Biomedical Named Entity Recognition (BioNER) framework based on Large Language Models (LLMs). We first reformulate BioNER as a text generation task and design a symbolic tagging strategy to jointly handle both flat and nested entities with explicit boundary annotation. To enhance multilingual and multi-task generalization, we perform bilingual joint fine-tuning across multiple Chinese and English datasets. Additionally, we introduce a contrastive learning-based entity selector that filters incorrect or spurious predictions by leveraging boundary-sensitive positive and negative samples. Experimental results on four benchmark datasets and two unseen corpora show that our method achieves state-of-the-art performance and robust zero-shot generalization across languages. The source codes are freely available at https://github.com/dreamer-tx/LLMNER.

[45] CrisiText: A dataset of warning messages for LLM training in emergency communication

Giacomo Gonella, Gian Maria Campedelli, Stefano Menini, Marco Guerini

🧩 TL;DR

本文提出了CrisiText,这是首个面向13种危机场景的大规模预警消息生成数据集,包含超过40万条预警消息,并系统比较了监督微调、偏好对齐、零样本和少样本等不同自然语言生成方法在危机预警任务上的性能。


📘 Detailed Summary

Motivation: 当前在危机情境下使用NLP技术仍然有限,主要集中于分类任务,而利用自然语言生成架构进行及时预警消息生成的巨大潜力尚未得到充分探索,这限制了AI在自然灾害或暴力攻击等紧急情况下有效保护民众的能力。

Method: 研究构建了包含40多万条预警消息的CrisiText数据集,覆盖近1.8万个危机情境,通过从现有危机描述创建事件链并为每个事件配对预警消息,遵循专家指南确保术语正确性和事实准确性,同时为每条消息提供三种次优预警类型以支持不同NLG方法研究。

Result: 实验比较了监督微调设置与偏好对齐、零样本和少样本方法的性能,评估了模型在分布外场景下的表现,并测试了自动后编辑器的有效性,为危机预警消息生成提供了全面的基准评估。

Conclusion: 该研究填补了危机预警NLG领域的空白,CrisiText数据集为开发更有效的危机响应系统奠定了基础,证明了不同NLG方法在紧急情况下的适用性,并为未来研究提供了重要的基准和方向。


📄 Abstract

Effectively identifying threats and mitigating their potential damage during crisis situations, such as natural disasters or violent attacks, is paramount for safeguarding endangered individuals. To tackle these challenges, AI has been used in assisting humans in emergency situations. Still, the use of NLP techniques remains limited and mostly focuses on classification tasks. The significant potential of timely warning message generation using NLG architectures, however, has been largely overlooked. In this paper we present CrisiText, the first large-scale dataset for the generation of warning messages across 13 different types of crisis scenarios. The dataset contains more than 400,000 warning messages (spanning almost 18,000 crisis situations) aimed at assisting civilians during and after such events. To generate the dataset, we started from existing crisis descriptions and created chains of events related to the scenarios. Each event was then paired with a warning message. The generations follow experts' written guidelines to ensure correct terminology and factuality of their suggestions. Additionally, each message is accompanied by three suboptimal warning types to allow for the study of different NLG approaches. To this end, we conducted a series of experiments comparing supervised fine-tuning setups with preference alignment, zero-shot, and few-shot approaches. We further assessed model performance in out-of-distribution scenarios and evaluated the effectiveness of an automatic post-editor.

[46] CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation

Kaiwen Wei, Xiao Liu, Jie Zhang, Zijian Wang, Ruida Liu, Yuming Yang, Xin Xiao, Xiao Sun, Haoyang Zeng, Changzai Pan, Yidan Zhang, Jiang Zhong, Peijin Wang, Yingchao Feng

🧩 TL;DR

本文提出了CFVBench基准测试和自适应视觉优化框架,解决了现有多模态检索增强生成基准在模态覆盖和格式多样性方面的局限性,显著提升了多模态大语言模型对细粒度视觉信息的理解能力。


📘 Detailed Summary

Motivation: 现有视频多模态检索增强生成基准存在模态覆盖不足和格式多样性有限的问题,主要关注单模态或有限模态任务,以及粗粒度的场景理解,无法充分评估模型在长时序视频中检索和推理细粒度多模态信息的能力。

Method: 提出了CFVBench大规模人工验证基准,包含599个公开视频的5360个开放式问答对,涵盖图表密集报告、新闻广播和软件教程等高密度格式领域;同时设计了自适应视觉优化框架,通过自适应增加帧采样密度和选择性调用外部工具来增强细粒度多模态理解。

Result: 对7种检索方法和14个常用多模态大语言模型的系统评估显示,当前模型在捕捉瞬态但关键的细粒度多模态细节方面存在显著瓶颈;自适应视觉优化框架实验证明能够持续增强细粒度多模态理解能力,并提升所有评估模型的性能表现。

Conclusion: 研究揭示了当前多模态大语言模型在细粒度视觉信息理解方面的根本局限性,提出的自适应视觉优化框架为解决这一挑战提供了有效途径,为未来多模态检索增强生成系统的改进指明了方向。


📄 Abstract

Multimodal Retrieval-Augmented Generation (MRAG) enables Multimodal Large Language Models (MLLMs) to generate responses with external multimodal evidence, and numerous video-based MRAG benchmarks have been proposed to evaluate model capabilities across retrieval and generation stages. However, existing benchmarks remain limited in modality coverage and format diversity, often focusing on single- or limited-modality tasks, or coarse-grained scene understanding. To address these gaps, we introduce CFVBench, a large-scale, manually verified benchmark constructed from 599 publicly available videos, yielding 5,360 open-ended QA pairs. CFVBench spans high-density formats and domains such as chart-heavy reports, news broadcasts, and software tutorials, requiring models to retrieve and reason over long temporal video spans while maintaining fine-grained multimodal information. Using CFVBench, we systematically evaluate 7 retrieval methods and 14 widely-used MLLMs, revealing a critical bottleneck: current models (even GPT5 or Gemini) struggle to capture transient yet essential fine-grained multimodal details. To mitigate this, we propose Adaptive Visual Refinement (AVR), a simple yet effective framework that adaptively increases frame sampling density and selectively invokes external tools when necessary. Experiments show that AVR consistently enhances fine-grained multimodal comprehension and improves performance across all evaluated MLLMs

[47] One Sentence, Two Embeddings: Contrastive Learning of Explicit and Implicit Semantic Representations

Kohei Oda, Po-Min Chuang, Kiyoaki Shirai, Natthawut Kertkeidkachorn

🧩 TL;DR

本文提出DualCSE,一种为每个句子分配两个嵌入向量的句子嵌入方法,分别表示显式语义和隐式语义,以克服传统单向量表示方法在捕捉隐式语义方面的局限性。


📘 Detailed Summary

Motivation: 传统句子嵌入方法为每个句子仅分配单个向量,存在固有的局限性,难以有效捕捉句子中的隐式语义,这限制了句子表示的质量和下游任务的性能。

Method: 提出DualCSE方法,为每个句子分配两个共存于共享空间的嵌入向量:一个表示显式语义,另一个表示隐式语义,使得可以根据特定任务需求选择相应的语义表示。

Result: 实验结果表明DualCSE能够有效编码显式和隐式语义,并在信息检索和文本分类等下游任务中提升性能表现。

Conclusion: 该研究表明双向量表示方法能够更全面地捕捉句子语义,为句子嵌入研究提供了新的方向,并展示了在多种下游任务中的实用价值。


📄 Abstract

Sentence embedding methods have made remarkable progress, yet they still struggle to capture the implicit semantics within sentences. This can be attributed to the inherent limitations of conventional sentence embedding methods that assign only a single vector per sentence. To overcome this limitation, we propose DualCSE, a sentence embedding method that assigns two embeddings to each sentence: one representing the explicit semantics and the other representing the implicit semantics. These embeddings coexist in the shared space, enabling the selection of the desired semantics for specific purposes such as information retrieval and text classification. Experimental results demonstrate that DualCSE can effectively encode both explicit and implicit meanings and improve the performance of the downstream task.

[48] The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach

Nizar El Ghazal, Antoine Caubrière, Valentin Vielzeuf

🧩 TL;DR

本研究系统比较了端到端口语对话状态跟踪中的上下文管理策略,发现完整口语对话历史输入可获得最佳性能,同时基于注意力池化的压缩方法在保持竞争力的同时显著减小上下文规模。


📘 Detailed Summary

Motivation: 当前口语对话状态跟踪领域缺乏对上下文管理策略的系统性研究,特别是如何有效利用口语对话历史信息来提升Speech-LLM模型的性能,需要探索不同上下文表示方法的优劣。

Method: 研究系统评估了三种上下文管理策略:传统多模态上下文(结合文本历史和当前口语轮次)、完整口语历史以及基于注意力池化压缩的口语历史方法,在SpokenWOZ语料库上进行了对比实验。

Result: 实验结果表明,提供完整口语对话作为输入在相似规模模型中获得了最高性能,显著超越先前方法;注意力池化压缩方法在保持竞争力的准确率的同时有效减小了上下文规模。

Conclusion: 研究证实性能提升源于更有效的上下文利用,完整口语历史提供了最丰富的对话信息,而压缩方法为实际应用提供了计算效率与性能的平衡方案,为口语对话系统的上下文管理策略选择提供了重要指导。


📄 Abstract

This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs. We systematically evaluate traditional multimodal context (combining text history and spoken current turn), full spoken history, and compressed spoken history approaches. Our experiments on the SpokenWOZ corpus demonstrate that providing the full spoken conversation as input yields the highest performance among models of similar size, significantly surpassing prior methods. Furthermore, we show that attention-pooling-based compression of the spoken history offers a strong trade-off, maintaining competitive accuracy with reduced context size. Detailed analysis confirms that improvements stem from more effective context utilization.

[49] Multimodal Policy Internalization for Conversational Agents

Zhenhailong Wang, Jiateng Liu, Amin Fazel, Ritesh Sarkhel, Xing Fan, Xiang Li, Chenlei Guo, Heng Ji, Ruhi Sarikaya

🧩 TL;DR

本文提出了多模态策略内化(MPI)任务,通过将推理密集型多模态策略内化到模型参数中,实现无需推理时包含策略的更强策略遵循能力。作者开发了TriMPI三阶段训练框架,在端到端准确性、泛化性和抗遗忘性方面取得了显著提升。


📘 Detailed Summary

Motivation: 现代对话系统依赖预定义策略来指定元数据、响应风格和工具使用规则,随着LLM系统扩展支持多样化业务和用户查询,这些通常作为上下文提示实现的策略变得日益复杂冗长,导致忠实遵循困难并产生大量固定计算成本。多模态代理中管理视觉和多模态行为的策略至关重要但研究不足,现有提示压缩工作主要缩短任务模板和演示,而策略对齐研究仅关注基于文本的安全规则。

Method: 本文提出TriMPI三阶段训练框架:首先通过持续预训练注入策略知识,然后进行监督微调,最后应用PolicyRollout——一种GRPO风格的强化学习扩展,通过策略感知响应增强rollouts以实现有基础的探索。作者构建了两个数据集,涵盖合成和真实世界决策制定与工具使用任务。

Result: TriMPI在端到端准确性、泛化性和抗遗忘鲁棒性方面取得了显著提升。作为首个多模态策略内化工作,作者提供了数据集、训练方法和全面评估,以促进未来研究。该方法在推理密集型多模态策略内化任务上表现出色。

Conclusion: 本研究首次系统探索了多模态策略内化问题,提出的TriMPI框架通过三阶段训练有效解决了策略内化的数据和算法挑战。这项工作为多模态代理的策略遵循提供了新范式,通过参数内化策略知识避免了推理时的固定计算成本,并为未来多模态策略学习研究奠定了基础。


📄 Abstract

Modern conversational agents like ChatGPT and Alexa+ rely on predefined policies specifying metadata, response styles, and tool-usage rules. As these LLM-based systems expand to support diverse business and user queries, such policies, often implemented as in-context prompts, are becoming increasingly complex and lengthy, making faithful adherence difficult and imposing large fixed computational costs. With the rise of multimodal agents, policies that govern visual and multimodal behaviors are critical but remain understudied. Prior prompt-compression work mainly shortens task templates and demonstrations, while existing policy-alignment studies focus only on text-based safety rules. We introduce Multimodal Policy Internalization (MPI), a new task that internalizes reasoning-intensive multimodal policies into model parameters, enabling stronger policy-following without including the policy during inference. MPI poses unique data and algorithmic challenges. We build two datasets spanning synthetic and real-world decision-making and tool-using tasks and propose TriMPI, a three-stage training framework. TriMPI first injects policy knowledge via continual pretraining, then performs supervised finetuning, and finally applies PolicyRollout, a GRPO-style reinforcement learning extension that augments rollouts with policy-aware responses for grounded exploration. TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness to forgetting. As the first work on multimodal policy internalization, we provide datasets, training recipes, and comprehensive evaluations to foster future research. Project page: https://mikewangwzhl.github.io/TriMPI.

[50] Can We Reliably Rank Model Performance across Domains without Labeled Data?

Veronica Rammouz, Aaron Gonzalez, Carlos Cruzportillo, Adrian Tan, Nicole Beebe, Anthony Rios

🧩 TL;DR

本研究分析了无标签情况下模型性能估计方法的可靠性,发现基于大语言模型的错误预测器在跨域性能排序中比漂移基准和零样本方法表现更优,揭示了性能差异大小和错误模式对齐是影响排序可靠性的关键因素。


📘 Detailed Summary

Motivation: 现有研究提出了基于数据集相似性或预测正确性的模型性能估计方法,但尚不清楚这些估计在跨域情况下何时能产生可靠的性能排序,因此需要分析影响排序可靠性的因素。

Method: 采用四步评估框架,使用四种基础分类器和多个大语言模型作为错误预测器,在GeoOLID和Amazon Reviews数据集的15个领域上进行实验,比较基于大语言模型的错误预测器与漂移基准和零样本方法的性能。

Result: 实验表明基于大语言模型的错误预测器与真实准确率之间的秩相关性更强且更一致,优于漂移基准和零样本方法,并发现当跨域性能差异较大且错误模型预测与基础模型真实失败模式对齐时排序更可靠。

Conclusion: 研究明确了性能估计方法在跨域模型评估中的可信条件,为实际应用提供了指导,揭示了性能差异和错误模式对齐是确保可靠排序的关键因素。


📄 Abstract

Estimating model performance without labels is an important goal for understanding how NLP models generalize. While prior work has proposed measures based on dataset similarity or predicted correctness, it remains unclear when these estimates produce reliable performance rankings across domains. In this paper, we analyze the factors that affect ranking reliability using a two-step evaluation setup with four base classifiers and several large language models as error predictors. Experiments on the GeoOLID and Amazon Reviews datasets, spanning 15 domains, show that large language model-based error predictors produce stronger and more consistent rank correlations with true accuracy than drift-based or zero-shot baselines. Our analysis reveals two key findings: ranking is more reliable when performance differences across domains are larger, and when the error model's predictions align with the base model's true failure patterns. These results clarify when performance estimation methods can be trusted and provide guidance for their use in cross-domain model evaluation.

[51] Hierarchical Indexing with Knowledge Enrichment for Multilingual Video Corpus Retrieval

Yu Wang, Tianhao Tan, Yifei Wang

🧩 TL;DR

本文提出了一种多阶段框架用于多语言视频语料库检索,通过结合多语言语义、领域术语和高效长文本处理,在医学视频集合中实现了准确且可扩展的多语言检索。该方法采用分层树索引和轻量级LLM重排序,避免了详尽的交叉编码器评分同时保持了块级精度。


📘 Detailed Summary

Motivation: 现有系统在处理多语言医学视频检索时存在两个主要问题:要么将小时长的视频压缩为粗糙的嵌入表示,要么进行细粒度匹配时产生过高计算成本。多语言视频语料库检索任务需要解决跨语言边界的复杂多跳问题,特别是在专业医学视频集合中。

Method: 提出多阶段框架,将视频字幕分割成语义连贯的块,并用简洁的知识图谱事实进行丰富,组织成层次树结构。使用语言无关的多语言编码器生成节点嵌入,查询时采用粗到细的树搜索策略剪枝无关分支,仅对排名靠前的块使用轻量级大语言模型进行重排序。

Result: 在mVCR测试集上实现了最先进的性能,消融研究证实了知识图谱丰富、分层索引和目标LLM重排序的互补贡献。实验结果表明该方法在保持准确性的同时显著提高了检索效率。

Conclusion: 该方法为专业医学视频集合中的多语言检索提供了准确且可扩展的解决方案,证明了分层处理与轻量级重排序的有效结合能够在不牺牲精度的情况下显著降低计算成本,为长视频多语言检索开辟了新途径。


📄 Abstract

Retrieving relevant instructional videos from multilingual medical archives is crucial for answering complex, multi-hop questions across language boundaries. However, existing systems either compress hour-long videos into coarse embeddings or incur prohibitive costs for fine-grained matching. We tackle the Multilingual Video Corpus Retrieval (mVCR) task in the NLPCC-2025 M4IVQA challenge with a multi-stage framework that integrates multilingual semantics, domain terminology, and efficient long-form processing. Video subtitles are divided into semantically coherent chunks, enriched with concise knowledge-graph (KG) facts, and organized into a hierarchical tree whose node embeddings are generated by a language-agnostic multilingual encoder. At query time, the same encoder embeds the input question; a coarse-to-fine tree search prunes irrelevant branches, and only the top-ranked chunks are re-scored by a lightweight large language model (LLM). This design avoids exhaustive cross-encoder scoring while preserving chunk-level precision. Experiments on the mVCR test set demonstrate state-of-the-art performance, and ablation studies confirm the complementary contributions of KG enrichment, hierarchical indexing, and targeted LLM re-ranking. The proposed method offers an accurate and scalable solution for multilingual retrieval in specialized medical video collections.

[52] AutoPR: Let's Automate Your Academic Promotion!

Qiguang Chen, Zheng Yan, Mingda Yang, Libo Qin, Yixin Yuan, Hanjing Li, Jinhao Liu, Yiyan Ji, Dengyun Peng, Jiannan Guan, Mengkang Hu, Yantao Du, Wanxiang Che

🧩 TL;DR

本文提出了AutoPR任务,将研究论文自动转化为高质量的推广内容,并开发了PRAgent多智能体框架和PRBench基准测试,显著提升了学术推广的效率和效果。


📘 Detailed Summary

Motivation: 随着同行评审研究数量的激增,学者们越来越依赖社交平台进行论文发现,而作者需要投入大量精力进行推广以确保可见性和引用。为了简化这一过程并减少对人力的依赖,需要自动化解决方案来将研究论文转化为准确、吸引人且及时的公共内容。

Method: 提出了AutoPR任务和PRAgent多智能体框架,该框架包含三个阶段:使用多模态准备的内容提取、协作合成生成精炼输出、以及平台特定适配以优化规范、语气和标签。同时发布了PRBench多模态基准测试,从保真度、参与度和对齐度三个维度评估系统性能。

Result: 与直接LLM流水线相比,PRAgent在PRBench上表现出显著改进,包括总观看时间增加604%、点赞数增长438%,整体参与度至少提升2.9倍。消融研究表明平台建模和定向推广对这些增益贡献最大。

Conclusion: 研究结果表明AutoPR是一个可处理、可衡量的研究问题,并为可扩展、有影响力的自动化学术交流提供了路线图。该方法能够显著提高学术内容的传播效果,减少作者推广负担,促进研究成果的广泛传播。


📄 Abstract

As the volume of peer-reviewed research surges, scholars increasingly rely on social platforms for discovery, while authors invest considerable effort in promoting their work to ensure visibility and citations. To streamline this process and reduce the reliance on human effort, we introduce Automatic Promotion (AutoPR), a novel task that transforms research papers into accurate, engaging, and timely public content. To enable rigorous evaluation, we release PRBench, a multimodal benchmark that links 512 peer-reviewed articles to high-quality promotional posts, assessing systems along three axes: Fidelity (accuracy and tone), Engagement (audience targeting and appeal), and Alignment (timing and channel optimization). We also introduce PRAgent, a multi-agent framework that automates AutoPR in three stages: content extraction with multimodal preparation, collaborative synthesis for polished outputs, and platform-specific adaptation to optimize norms, tone, and tagging for maximum reach. When compared to direct LLM pipelines on PRBench, PRAgent demonstrates substantial improvements, including a 604% increase in total watch time, a 438% rise in likes, and at least a 2.9x boost in overall engagement. Ablation studies show that platform modeling and targeted promotion contribute the most to these gains. Our results position AutoPR as a tractable, measurable research problem and provide a roadmap for scalable, impactful automated scholarly communication.

[53] Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation

Sondos Mahmoud Bsharat, Zhiqiang Shen

🧩 TL;DR

本文提出Prompting Test-Time Scaling (P-TTS),一种通过测试时数据增强来增强大语言模型推理能力的微调方法,仅需90个手动选择的推理实例即可显著提升模型在数学推理任务上的表现。


📘 Detailed Summary

Motivation: 当前大语言模型在提供思维链示例时展现出强大的推理能力,但构建大规模推理数据集需要大量人工标注和计算资源,这限制了模型在资源受限或快速演化领域的应用潜力。

Method: P-TTS采用测试时数据增强策略,通过系统性地变化指令提示强度来合成多样化的推理轨迹上下文,仅使用90个手动选择的推理实例,然后对各种规模的Qwen-2.5模型在P-TTS数据上进行微调。

Result: 在AIME2024、AIME2025、MATH500和GPQA-Diamond等数学推理基准测试中,P-TTS-7B和32B模型显著超越了S1和S1.1等竞争基线,在AIME'24上分别获得+26.66%和+30.00%的绝对准确率提升,在AIME'25上分别获得+13.34%和+6.67%的提升,同时在零样本泛化测试中也表现出色。

Conclusion: 测试时扩展策略有效探索了推理模式的潜在空间,以最小的标注开销放大了大语言模型的解决问题的能力,为资源受限或快速演化领域提供了一种实用、低成本的LLM推理激发方法。


📄 Abstract

Large language models (LLMs) have demonstrated impressive reasoning capabilities when provided with chain-of-thought exemplars, but curating large reasoning datasets remains laborious and resource-intensive. In this work, we introduce Prompting Test-Time Scaling (P-TTS), a simple yet effective inference-time data augmentation strategy for enhancing LLM reasoning through finetuning. Rather than collecting thousands or even millions of examples, P-TTS leverages a small pool of only 90 manually selected reasoning instances and systematically varies exemplar augmentation through principled instruction prompting intensities at test time to synthesize diverse reasoning trajectory contexts. Then we finetune the various sizes of Qwen-2.5 models on P-TTS data. Across a suite of mathematical reasoning AIME2024 & 25, MATH500, and GPQA-Diamond, our P-TTS-7B and 32B models outperform the prior competitive baselines like S1 and S1.1 (1K-shot), achieving absolute accuracy gains of +26.66% and +30.00% on AIME'24 (7B), and +13.34% and +6.67% on AIME'25 (7B); P-TTS-32B yields gains of +23.33% and +16.63% on AIME'24, and +26.63% and +3.33% on AIME'25 (vs. S1 and S1.1, respectively), with comparable or better performance on MATH500 and GPQA-Diamond. We further show that P-TTS enhances zero-shot generalization accuracy on out-of-domain reasoning benchmarks of Gaokao, Kaoyan, OlympiadBench, AMC23, GradeSchoolMath, and Minerva. Our analysis suggests that test-time scaling effectively explores the latent space of reasoning patterns, amplifying LLM problem-solving with minimal annotation overhead, and further unlocking the reasoning potential and capabilities of LLMs. Prompting Test-Time Scaling offers a practical, low-cost way to elicit LLM reasoning in resource-constrained or rapidly evolving domains.

cs.AI [Back]

[54] Unified World Models: Memory-Augmented Planning and Foresight for Visual Navigation

Yifei Dong, Fengyi Wu, Guangyu Chen, Zhi-Qi Cheng, Qiyu Hu, Yuxuan Zhou, Jingdong Sun, Jun-Yan He, Qi Dai, Alexander G Hauptmann

🧩 TL;DR

本文提出UniWM,一种统一的内存增强世界模型,将自我中心视觉预见和导航规划集成到单一多模态自回归骨干中,显著提升了具身导航的成功率和泛化能力。


📘 Detailed Summary

Motivation: 当前最先进的具身导航方法采用模块化架构,将导航规划与视觉世界建模分离,导致状态-动作不对齐以及在新颖或动态场景中的有限适应性。这种分离架构存在根本性限制,需要统一框架来确保预测与控制之间的紧密对齐。

Method: UniWM采用统一的多模态自回归骨干网络,将自我中心视觉预见和规划集成到单一框架中,通过分层记忆机制整合详细的短期感知线索与长期轨迹上下文,实现在扩展视野上的稳定连贯推理。该方法明确地将动作决策建立在视觉想象的未来状态基础上。

Result: 在四个挑战性基准测试(Go Stanford、ReCon、SCAND、HuRoN)上的广泛实验表明,UniWM将导航成功率提高了高达30%,显著减少了轨迹误差,并在未见过的TartanDrive数据集上表现出令人印象深刻的零样本泛化能力。

Conclusion: UniWM代表了向统一、想象力驱动的具身导航的原则性进展,证明了将视觉预见与规划紧密集成在单一框架中的有效性,为构建更鲁棒和通用的具身智能体提供了重要方向。


📄 Abstract

Enabling embodied agents to effectively imagine future states is critical for robust and generalizable visual navigation. Current state-of-the-art approaches, however, adopt modular architectures that separate navigation planning from visual world modeling, leading to state-action misalignment and limited adaptability in novel or dynamic scenarios. To overcome this fundamental limitation, we propose UniWM, a unified, memory-augmented world model integrating egocentric visual foresight and planning within a single multimodal autoregressive backbone. Unlike modular frameworks, UniWM explicitly grounds action decisions in visually imagined outcomes, ensuring tight alignment between prediction and control. A hierarchical memory mechanism further integrates detailed short-term perceptual cues with longer-term trajectory context, enabling stable, coherent reasoning over extended horizons. Extensive experiments across four challenging benchmarks (Go Stanford, ReCon, SCAND, HuRoN) demonstrate that UniWM substantially improves navigation success rates by up to 30%, significantly reduces trajectory errors compared to strong baselines, and exhibits impressive zero-shot generalization on the unseen TartanDrive dataset. These results highlight UniWM as a principled step toward unified, imagination-driven embodied navigation.

[55] LM Fight Arena: Benchmarking Large Multimodal Models via Game Competition

Yushuo Zheng, Zicheng Zhang, Xiongkuo Min, Huiyu Duan, Guangtao Zhai

🧩 TL;DR

本研究提出了LM Fight Arena框架,通过在格斗游戏《真人快打II》中让大型多模态模型相互对抗,评估其在实时对抗环境中的战略推理能力,填补了现有基准测试在动态交互评估方面的空白。


📘 Detailed Summary

Motivation: 现有的大型多模态模型基准测试往往无法捕捉其在实时对抗环境中的真实表现,存在评估静态化、缺乏交互性的局限性,需要开发能够评估动态战略推理能力的新基准。

Method: 提出了LM Fight Arena框架,将六个领先的开源和闭源模型置于《真人快打II》游戏环境中相互对抗,通过解析游戏画面和状态数据来选择动作,采用相同角色控制确保公平比较,实现了全自动、可复现的客观评估。

Result: 该框架在受控锦标赛中测试了多个领先模型,提供了对模型在动态环境中战略推理能力的系统评估,相比静态评估更能反映模型在实时对抗场景中的真实性能。

Conclusion: LM Fight Arena引入了一个具有挑战性且引人入胜的基准测试,弥合了AI评估与交互娱乐之间的差距,为评估大型多模态模型在动态环境中的战略推理能力提供了新范式。


📄 Abstract

Existing benchmarks for large multimodal models (LMMs) often fail to capture their performance in real-time, adversarial environments. We introduce LM Fight Arena (Large Model Fight Arena), a novel framework that evaluates LMMs by pitting them against each other in the classic fighting game Mortal Kombat II, a task requiring rapid visual understanding and tactical, sequential decision-making. In a controlled tournament, we test six leading open- and closed-source models, where each agent operates controlling the same character to ensure a fair comparison. The models are prompted to interpret game frames and state data to select their next actions. Unlike static evaluations, LM Fight Arena provides a fully automated, reproducible, and objective assessment of an LMM's strategic reasoning capabilities in a dynamic setting. This work introduces a challenging and engaging benchmark that bridges the gap between AI evaluation and interactive entertainment.

[56] FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation

Samuel Hildebrand, Curtis Taylor, Sean Oesch, James M Ghawaly Jr, Amir Sadovnik, Ryan Shivers, Brandon Schreiber, Kevin Kurian

🧩 TL;DR

本文提出了一个用于评估检索增强生成(RAG)管道的多模态基准,包含93个人工构建的问题数据集、短语级召回指标和最近邻嵌入分类器,发现闭源管道在正确性和幻觉检测方面显著优于开源管道。


📘 Detailed Summary

Motivation: 现有基准主要关注检索等特定方面,缺乏对RAG管道整体能力的评估,特别是处理多模态信息(文本、表格、图像)和跨文档信息的能力,需要开发一个综合评估框架。

Method: 构建了包含93个多模态问题的人工数据集,开发了短语级召回正确性指标和基于最近邻嵌入的幻觉分类器,对2个开源检索机制和4个闭源基础模型进行了比较评估,并进行了第三方人工评估验证指标一致性。

Result: 闭源管道在正确性和幻觉指标上显著优于开源管道,特别是在依赖多模态和跨文档信息的问题上性能差距更大,人工评估显示正确性指标平均一致性为4.62,幻觉检测为4.53(5分制)。

Conclusion: 该基准为RAG系统提供了全面的评估框架,揭示了闭源模型在多模态信息处理上的优势,提出的评估指标与人类判断高度一致,为未来RAG系统开发提供了重要基准和方向指导。


📄 Abstract

Retrieval-augmented generation (RAG) has emerged as a promising paradigm for improving factual accuracy in large language models (LLMs). We introduce a benchmark designed to evaluate RAG pipelines as a whole, evaluating a pipeline's ability to ingest, retrieve, and reason about several modalities of information, differentiating it from existing benchmarks that focus on particular aspects such as retrieval. We present (1) a small, human-created dataset of 93 questions designed to evaluate a pipeline's ability to ingest textual data, tables, images, and data spread across these modalities in one or more documents; (2) a phrase-level recall metric for correctness; (3) a nearest-neighbor embedding classifier to identify potential pipeline hallucinations; (4) a comparative evaluation of 2 pipelines built with open-source retrieval mechanisms and 4 closed-source foundation models; and (5) a third-party human evaluation of the alignment of our correctness and hallucination metrics. We find that closed-source pipelines significantly outperform open-source pipelines in both correctness and hallucination metrics, with wider performance gaps in questions relying on multimodal and cross-document information. Human evaluation of our metrics showed average agreement of 4.62 for correctness and 4.53 for hallucination detection on a 1-5 Likert scale (5 indicating "strongly agree").

[57] Tiny-R1V: Lightweight Multimodal Unified Reasoning Model via Model Merging

Qixiang Yin, Huanjin Yao, Jianghao Chen, Jiaxing Huang, Zhicheng Zhao, Fei Su

🧩 TL;DR

本文提出Tiny-R1V,一种轻量级3B参数的多模态大语言模型,通过两阶段优化方法实现更快的推理速度和更高的准确率,统一了多任务多模态推理能力。


📘 Detailed Summary

Motivation: 尽管多模态大语言模型在各种任务中展现出卓越能力,但在推理效率方面面临模型规模大、过度思考以及在轻量级场景下准确率受损等挑战,而针对轻量级MLLMs推理能力的研究相当缺乏。

Method: Tiny-R1V采用两阶段优化方法:第一阶段引入长度感知相对策略优化(LIPO),通过动态调整组内响应优势来优先选择简洁高质量响应;第二阶段提出自适应模型融合(AMM),通过梯度投影正则化损失函数自适应调整任务向量权重并优化合并向量。

Result: 在涵盖数学、结构化数据、OCR和通用能力的十个广泛使用的推理基准测试中,Tiny-R1V展现出卓越性能,使轻量级模型能够在多样化多模态推理任务中表现出色。

Conclusion: 该研究表明通过创新的两阶段优化方法,轻量级多模态大语言模型能够实现高效推理和准确性能的统一,为轻量级MLLMs的发展提供了重要技术路径和解决方案。


📄 Abstract

Although Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, they encounter numerous challenges in terms of reasoning efficiency, such as large model size, overthinking, and compromised accuracy in lightweight scenarios. However, research on the reasoning capabilities of lightweight MLLMs is quite lacking. To this end, we propose Tiny-R1V, a novel lightweight 3B model that achieves faster inference and higher accuracy via a two-stage optimization, while unifying multimodal reasoning across multiple tasks and using fewer tokens. In the first stage, Tiny-R1V introduces Length-Informed Relative Policy Optimization (LIPO), a novel reinforcement learning method, to train each reasoning model. The LIPO is designed to dynamically adjusts advantages of responses within groups, that is, by prioritizing concise yet high-quality responses to encourage the generation of shorter and more accurate response. In the second stage, we propose Adaptive Model Merging (AMM), a training-free model merging method that merges multiple specialist models into a unified architecture. Specifically, AMM adaptively adjusts the weights of task vectors and robustly optimizes the merged vectors via a novel gradient projection regularization loss function, thus mitigating redundant conflicts between them. Extensive evaluations on ten widely-used reasoning benchmarks covering mathematics, structured data (charts, tables, documents), OCR, and general capabilities showcase the superior performance of Tiny-R1V, enabling lightweight models to excel in diverse multimodal reasoning tasks.

[58] OSCAR: Orthogonal Stochastic Control for Alignment-Respecting Diversity in Flow Matching

Jingxuan Wu, Zhenglin Wan, Xingrui Yu, Yuzhe Yang, Bo An, Ivor Tsang

🧩 TL;DR

本文提出了一种无需训练的推理时控制机制,通过特征空间目标和正交随机扰动使基于流的文本到图像生成模型具备多样性感知能力,在保持图像质量和提示对齐的同时显著提升生成多样性。


📘 Detailed Summary

Motivation: 基于流的文本到图像模型遵循确定性轨迹,迫使用户重复采样以发现多样模式,这一过程成本高昂且效率低下,现有方法缺乏在推理时有效控制生成多样性的机制。

Method: 该方法通过特征空间目标同时促进轨迹间的横向扩展,并引入时间调度的随机扰动来重新引入不确定性,关键创新在于将扰动投影为与生成流正交的几何约束,从而在不降低图像细节或提示保真度的情况下增强变化。

Result: 在固定采样预算下,该方法在多个文本到图像设置中持续改进Vendi Score和Brisque等多样性指标,同时保持图像质量和对齐度,优于强基线方法。

Conclusion: 该方法理论上被证明能单调增加体积代理,同时由于几何约束近似保持边缘分布,为生成质量的鲁棒保持提供了原理性解释,为流匹配求解器提供了无需重新训练或修改基础采样器的多样性增强方案。


📄 Abstract

Flow-based text-to-image models follow deterministic trajectories, forcing users to repeatedly sample to discover diverse modes, which is a costly and inefficient process. We present a training-free, inference-time control mechanism that makes the flow itself diversity-aware. Our method simultaneously encourages lateral spread among trajectories via a feature-space objective and reintroduces uncertainty through a time-scheduled stochastic perturbation. Crucially, this perturbation is projected to be orthogonal to the generation flow, a geometric constraint that allows it to boost variation without degrading image details or prompt fidelity. Our procedure requires no retraining or modification to the base sampler and is compatible with common flow-matching solvers. Theoretically, our method is shown to monotonically increase a volume surrogate while, due to its geometric constraints, approximately preserving the marginal distribution. This provides a principled explanation for why generation quality is robustly maintained. Empirically, across multiple text-to-image settings under fixed sampling budgets, our method consistently improves diversity metrics such as the Vendi Score and Brisque over strong baselines, while upholding image quality and alignment.

[59] Agentic Systems in Radiology: Design, Applications, Evaluation, and Challenges

Christian Bluethgen, Dave Van Veen, Daniel Truhn, Jakob Nikolas Kather, Michael Moor, Malgorzata Polacin, Akshay Chaudhari, Thomas Frauenfelder, Curtis P. Langlotz, Michael Krauthammer, Farhad Nooralahzadeh

🧩 TL;DR

本综述探讨了基于大型语言模型的智能代理系统在放射学中的应用,重点分析了如何通过外部工具和反馈机制增强LLM能力以支持复杂多步骤工作流,并讨论了关键应用、评估方法和挑战。


📘 Detailed Summary

Motivation: 当前放射学中大型语言模型主要应用于信息提取和报告总结等独立任务,未能充分利用其在复杂多步骤工作流中的潜力,其中决策依赖于来自多个信息源的动态上下文,因此需要开发能够整合外部工具和反馈机制的智能代理系统。

Method: 研究提出通过为大型语言模型配备外部工具和反馈机制来构建智能代理系统,这些系统能够处理多模态数据流和编排工作流,支持从半自动化工作流到能够管理复杂流程的自适应代理的自主性谱系。

Result: 综述系统性地分析了LLM驱动代理系统的设计原则、关键应用场景、规划与工具使用的评估方法,并识别了错误级联、工具使用效率和医疗IT集成等主要挑战。

Conclusion: LLM驱动的智能代理系统在放射学中具有巨大潜力,能够显著提升工作流效率和适应性,但需要解决错误传播、工具集成和系统可靠性等关键挑战才能实现临床部署。


📄 Abstract

Building agents, systems that perceive and act upon their environment with a degree of autonomy, has long been a focus of AI research. This pursuit has recently become vastly more practical with the emergence of large language models (LLMs) capable of using natural language to integrate information, follow instructions, and perform forms of "reasoning" and planning across a wide range of tasks. With its multimodal data streams and orchestrated workflows spanning multiple systems, radiology is uniquely suited to benefit from agents that can adapt to context and automate repetitive yet complex tasks. In radiology, LLMs and their multimodal variants have already demonstrated promising performance for individual tasks such as information extraction and report summarization. However, using LLMs in isolation underutilizes their potential to support complex, multi-step workflows where decisions depend on evolving context from multiple information sources. Equipping LLMs with external tools and feedback mechanisms enables them to drive systems that exhibit a spectrum of autonomy, ranging from semi-automated workflows to more adaptive agents capable of managing complex processes. This review examines the design of such LLM-driven agentic systems, highlights key applications, discusses evaluation methods for planning and tool use, and outlines challenges such as error cascades, tool-use efficiency, and health IT integration.