Table of Contents
cs.CV [Back]
[1] Towards Fine-Grained Human Motion Video Captioning
Guorui Song, Guocun Wang, Zhe Huang, Jing Lin, Xuefei Zhe, Jian Li, Haoqian Wang
🧩 TL;DR
本文提出了运动增强字幕模型(M-ACM),通过结合基于人体网格恢复的运动感知解码来提升视频字幕质量,并发布了专注于人体运动的HMI数据集和基准测试,显著改进了复杂人体动作描述的准确性。
📘 Detailed Summary
Motivation: 现有视频字幕模型在捕捉细粒度运动细节方面存在困难,导致生成的字幕模糊或语义不一致,无法准确描述人体动作的动态特征。
Method: M-ACM框架利用从人体网格恢复中提取的运动表示来显式突出人体动态,通过运动感知解码机制减少幻觉并改善生成字幕的语义保真度和空间对齐。
Result: 实验结果表明M-ACM在准确描述复杂人体动作和细微时间变化方面显著优于先前方法,为运动中心视频字幕设定了新标准。
Conclusion: 该研究强调了运动表示在视频字幕中的重要性,提出的框架和数据集为运动感知视频理解开辟了新方向,推动了细粒度动作描述技术的发展。
📄 Abstract
Generating accurate descriptions of human actions in videos remains a challenging task for video captioning models. Existing approaches often struggle to capture fine-grained motion details, resulting in vague or semantically inconsistent captions. In this work, we introduce the Motion-Augmented Caption Model (M-ACM), a novel generative framework that enhances caption quality by incorporating motion-aware decoding. At its core, M-ACM leverages motion representations derived from human mesh recovery to explicitly highlight human body dynamics, thereby reducing hallucinations and improving both semantic fidelity and spatial alignment in the generated captions. To support research in this area, we present the Human Motion Insight (HMI) Dataset, comprising 115K video-description pairs focused on human movement, along with HMI-Bench, a dedicated benchmark for evaluating motion-focused video captioning. Experimental results demonstrate that M-ACM significantly outperforms previous methods in accurately describing complex human motions and subtle temporal variations, setting a new standard for motion-centric video captioning.
[2] Cross-Enhanced Multimodal Fusion of Eye-Tracking and Facial Features for Alzheimer's Disease Diagnosis
Yujie Nie, Jianzhang Ni, Yonglong Ye, Yuan-Ting Zhang, Yun Kwok Wing, Xiangqing Xu, Xin Ma, Lizhou Fan
🧩 TL;DR
本研究提出了一种多模态交叉增强融合框架,通过协同利用眼动追踪和面部特征进行阿尔茨海默病检测,该框架在区分AD患者与健康对照时达到了95.11%的分类准确率。
📘 Detailed Summary
Motivation: 当前阿尔茨海默病诊断中,虽然多模态方法通过整合行为和感知领域的互补信息展现出巨大潜力,但很少有研究探索眼动追踪和面部特征的联合集成用于辅助AD诊断,这限制了诊断的准确性和鲁棒性。
Method: 该框架包含两个关键模块:交叉增强融合注意力模块通过交叉注意力和全局增强建模模态间交互,方向感知卷积模块通过水平-垂直感受野捕获细粒度方向性面部特征,共同实现自适应和判别性的多模态表示学习。
Result: 在包含25名AD患者和25名健康对照的同步多模态数据集上的广泛实验表明,该框架优于传统的后期融合和特征拼接方法,在区分AD与HC时达到95.11%的分类准确率,显示出通过显式建模模态间依赖关系和模态特定贡献的优越鲁棒性和诊断性能。
Conclusion: 该研究证明了通过显式建模模态间交互和捕获细粒度特征的多模态融合框架在AD诊断中的有效性,为开发更准确、鲁棒的辅助诊断工具提供了重要见解,并构建了生态有效的多模态资源用于评估集成策略。
📄 Abstract
Accurate diagnosis of Alzheimer's disease (AD) is essential for enabling timely intervention and slowing disease progression. Multimodal diagnostic approaches offer considerable promise by integrating complementary information across behavioral and perceptual domains. Eye-tracking and facial features, in particular, are important indicators of cognitive function, reflecting attentional distribution and neurocognitive state. However, few studies have explored their joint integration for auxiliary AD diagnosis. In this study, we propose a multimodal cross-enhanced fusion framework that synergistically leverages eye-tracking and facial features for AD detection. The framework incorporates two key modules: (a) a Cross-Enhanced Fusion Attention Module (CEFAM), which models inter-modal interactions through cross-attention and global enhancement, and (b) a Direction-Aware Convolution Module (DACM), which captures fine-grained directional facial features via horizontal-vertical receptive fields. Together, these modules enable adaptive and discriminative multimodal representation learning. To support this work, we constructed a synchronized multimodal dataset, including 25 patients with AD and 25 healthy controls (HC), by recording aligned facial video and eye-tracking sequences during a visual memory-search paradigm, providing an ecologically valid resource for evaluating integration strategies. Extensive experiments on this dataset demonstrate that our framework outperforms traditional late fusion and feature concatenation methods, achieving a classification accuracy of 95.11% in distinguishing AD from HC, highlighting superior robustness and diagnostic performance by explicitly modeling inter-modal dependencies and modality-specific contributions.
[3] PISA-Bench: The PISA Index as a Multilingual and Multimodal Metric for the Evaluation of Vision-Language Models
Patrick Haller, Fabio Barth, Jonas Golde, Georg Rehm, Alan Akbik
🧩 TL;DR
本文提出了PISA-Bench,一个基于专家创建的PISA测试构建的多语言视觉语言基准,包含六个语言的平行语料库,用于评估多语言多模态推理能力。
📘 Detailed Summary
Motivation: 现有视觉语言模型基准存在高质量人工验证样本不足的问题,许多数据集依赖大语言模型合成生成内容,且大多数仅限于英语,多语言翻译样本的质量保证耗时且成本高昂。
Method: 基于英语PISA测试专家创建的例子构建多语言基准,每个例子包含人工提取的指令、问题、答案选项和图像,并添加问题类型分类,从英语翻译到五种额外语言(西班牙语、德语、中文、法语和意大利语),形成覆盖六种语言的完全平行语料库。
Result: 评估最先进的视觉语言模型发现,特别是小型模型(<200亿参数)无法获得高测试分数,在非英语分割上存在显著性能下降,在空间和几何推理任务上表现出高错误率。
Conclusion: 通过发布数据集和评估框架,为推进多语言多模态推理研究提供了资源,揭示了当前模型在多语言和复杂推理任务上的局限性,为未来模型改进指明了方向。
📄 Abstract
Vision-language models (VLMs) have demonstrated remarkable progress in multimodal reasoning. However, existing benchmarks remain limited in terms of high-quality, human-verified examples. Many current datasets rely on synthetically generated content by large language models (LLMs). Furthermore, most datasets are limited to English, as manual quality assurance of translated samples is time-consuming and costly. To fill this gap, we introduce PISA-Bench, a multilingual benchmark derived from English examples of the expert-created PISA tests, a unified framework for the assessment of student competencies in over eighty countries. Each example consists of human-extracted instructions, questions, answer options, and images, enriched with question type categories, and has been translated from English into five additional languages (Spanish, German, Chinese, French, and Italian), resulting in a fully parallel corpus covering six languages. We evaluate state-of-the-art vision-language models on PISA-Bench and find that especially small models (<20B parameters) fail to achieve high test scores. We further find substantial performance degradation on non-English splits as well as high error-rates when models are tasked with spatial and geometric reasoning. By releasing the dataset and evaluation framework, we provide a resource for advancing research on multilingual multimodal reasoning.
[4] A Survey on Efficient Vision-Language-Action Models
Zhaoshu Yu, Bo Wang, Pengpeng Zeng, Haonan Zhang, Ji Zhang, Lianli Gao, Jingkuan Song, Nicu Sebe, Heng Tao Shen
🧩 TL;DR
本文提出了首个关于高效视觉-语言-动作模型的系统性综述,通过建立统一分类法将现有技术组织为三个核心支柱,为社区建立了基础参考框架并规划了未来研究方向。
📘 Detailed Summary
Motivation: 视觉-语言-动作模型在具身智能领域展现出强大潜力,但其部署受到底层大规模基础模型巨大计算和数据需求的严重制约,迫切需要解决这些效率挑战。
Method: 引入统一分类法将高效VLA技术系统组织为三个核心支柱:高效模型设计(关注高效架构和模型压缩)、高效训练(减少模型学习过程中的计算负担)以及高效数据收集(解决机器人数据获取和利用的瓶颈)。
Result: 通过在此框架内对最先进方法进行批判性回顾,不仅为社区建立了基础参考,还总结了代表性应用,界定了关键挑战,并为未来研究绘制了路线图。
Conclusion: 该调查确立了高效VLA领域的系统性分析框架,通过三支柱分类法整合了分散的研究工作,为后续研究提供了明确方向,并维护持续更新的项目页面以跟踪最新进展。
📄 Abstract
Vision-Language-Action models (VLAs) represent a significant frontier in embodied intelligence, aiming to bridge digital knowledge with physical-world interaction. While these models have demonstrated remarkable generalist capabilities, their deployment is severely hampered by the substantial computational and data requirements inherent to their underlying large-scale foundation models. Motivated by the urgent need to address these challenges, this survey presents the first comprehensive review of Efficient Vision-Language-Action models (Efficient VLAs) across the entire data-model-training process. Specifically, we introduce a unified taxonomy to systematically organize the disparate efforts in this domain, categorizing current techniques into three core pillars: (1) Efficient Model Design, focusing on efficient architectures and model compression; (2) Efficient Training, which reduces computational burdens during model learning; and (3) Efficient Data Collection, which addresses the bottlenecks in acquiring and utilizing robotic data. Through a critical review of state-of-the-art methods within this framework, this survey not only establishes a foundational reference for the community but also summarizes representative applications, delineates key challenges, and charts a roadmap for future research. We maintain a continuously updated project page to track our latest developments: https://evla-survey.github.io/
[5] Conflict Adaptation in Vision-Language Models
Xiaoyang Hu
🧩 TL;DR
本研究通过顺序Stroop任务发现12个视觉语言模型表现出与人类冲突适应一致的行为模式,并使用稀疏自编码器在InternVL 3.5 4B中识别出负责冲突调制的关键神经元,揭示了VLMs认知控制机制的神经基础。
📘 Detailed Summary
Motivation: 本研究旨在探索视觉语言模型是否表现出类似人类认知控制的冲突适应现象,即在高冲突试次后性能提升的行为模式,以理解这些模型如何动态调整其稀缺的认知资源。
Method: 研究采用顺序Stroop任务评估13个视觉语言模型的行为表现,并使用稀疏自编码器在InternVL 3.5 4B模型中识别任务相关的超节点,通过消融实验验证特定神经元的功能重要性。
Result: 实验发现12个VLMs表现出显著的冲突适应行为,仅有一个模型因天花板效应未显示该模式;在InternVL 3.5 4B中识别出早期和晚期层中部分重叠的文本和颜色超节点,其相对大小反映了人类阅读与颜色命名的自动性不对称,并在24-25层发现冲突调制超节点,其消融显著增加Stroop错误率。
Conclusion: 该研究首次在视觉语言模型中系统性地证明了类似人类的认知控制机制,揭示了VLMs内部表征结构与人类认知过程的相似性,为理解大型语言模型的认知能力提供了神经计算基础。
📄 Abstract
A signature of human cognitive control is conflict adaptation: improved performance on a high-conflict trial following another high-conflict trial. This phenomenon offers an account for how cognitive control, a scarce resource, is recruited. Using a sequential Stroop task, we find that 12 of 13 vision-language models (VLMs) tested exhibit behavior consistent with conflict adaptation, with the lone exception likely reflecting a ceiling effect. To understand the representational basis of this behavior, we use sparse autoencoders (SAEs) to identify task-relevant supernodes in InternVL 3.5 4B. Partially overlapping supernodes emerge for text and color in both early and late layers, and their relative sizes mirror the automaticity asymmetry between reading and color naming in humans. We further isolate a conflict-modulated supernode in layers 24-25 whose ablation significantly increases Stroop errors while minimally affecting congruent trials.
[6] DualCap: Enhancing Lightweight Image Captioning via Dual Retrieval with Similar Scenes Visual Prompts
Binbin Li, Guimiao Yang, Zisen Qi, Haiping Wang, Yu Ding
🧩 TL;DR
本文提出了DualCap方法,通过双检索机制生成视觉提示来增强图像描述模型的视觉表示,解决了现有方法仅将检索数据用作文本提示而忽略原始视觉特征增强的语义鸿沟问题。
📘 Detailed Summary
Motivation: 现有轻量级检索增强图像描述模型通常仅将检索数据用作文本提示,导致原始视觉特征未得到增强,在对象细节和复杂场景理解方面存在语义鸿沟。
Method: 提出DualCap方法,采用双检索机制:标准图像到文本检索用于文本提示,新颖的图像到图像检索用于获取视觉相似场景。从视觉相似场景的标题中提取关键词语和短语,通过轻量级可训练特征融合网络将这些文本特征编码并与原始图像特征集成。
Result: 大量实验表明,该方法在保持竞争力的性能同时,相比之前的视觉提示描述方法需要更少的可训练参数。
Conclusion: 该研究证明了通过双检索机制生成视觉提示能有效增强图像描述模型的视觉表示能力,为轻量级检索增强模型提供了新的设计思路,在参数效率和性能之间取得了良好平衡。
📄 Abstract
Recent lightweight retrieval-augmented image caption models often utilize retrieved data solely as text prompts, thereby creating a semantic gap by leaving the original visual features unenhanced, particularly for object details or complex scenes. To address this limitation, we propose $DualCap$, a novel approach that enriches the visual representation by generating a visual prompt from retrieved similar images. Our model employs a dual retrieval mechanism, using standard image-to-text retrieval for text prompts and a novel image-to-image retrieval to source visually analogous scenes. Specifically, salient keywords and phrases are derived from the captions of visually similar scenes to capture key objects and similar details. These textual features are then encoded and integrated with the original image features through a lightweight, trainable feature fusion network. Extensive experiments demonstrate that our method achieves competitive performance while requiring fewer trainable parameters compared to previous visual-prompting captioning approaches.
[7] Perception, Understanding and Reasoning, A Multimodal Benchmark for Video Fake News Detection
Cui Yakun, Fushuo Huo, Weijie Shi, Juntao Dai, Hang Du, Zhenghao Zhu, Sirui Han, Yike Guo
🧩 TL;DR
该研究提出了MVFNDB基准测试,用于系统评估多模态大语言模型在视频假新闻检测中的感知、理解和推理能力,并设计了MVFND-CoT框架来验证多特征融合对检测结果的影响。
📘 Detailed Summary
Motivation: 传统视频假新闻检测基准主要关注最终决策的准确性,缺乏对整个检测过程的细粒度评估,使得检测过程成为黑箱,无法深入理解模型的感知、理解和推理能力。
Method: 研究基于经验分析构建了MVFNDB基准测试,包含10个任务和9730个人工标注的视频相关问题,并设计了MVFND-CoT框架,该框架融合了创作者添加内容和原始拍摄素材的推理过程。
Result: 研究对影响检测准确性的深层因素进行了深入分析,包括视频处理策略以及视频特征与模型能力之间的对齐关系,验证了多特征融合对最终结果的影响。
Conclusion: 该基准测试为未来多模态大语言模型在视频假新闻检测领域的评估和发展奠定了坚实基础,有助于推动该领域研究的系统化和深入化发展。
📄 Abstract
The advent of multi-modal large language models (MLLMs) has greatly advanced research into applications for Video fake news detection (VFND) tasks. Traditional video-based FND benchmarks typically focus on the accuracy of the final decision, often failing to provide fine-grained assessments for the entire detection process, making the detection process a black box. Therefore, we introduce the MVFNDB (Multi-modal Video Fake News Detection Benchmark) based on the empirical analysis, which provides foundation for tasks definition. The benchmark comprises 10 tasks and is meticulously crafted to probe MLLMs' perception, understanding, and reasoning capacities during detection, featuring 9730 human-annotated video-related questions based on a carefully constructed taxonomy ability of VFND. To validate the impact of combining multiple features on the final results, we design a novel framework named MVFND-CoT, which incorporates both creator-added content and original shooting footage reasoning. Building upon the benchmark, we conduct an in-depth analysis of the deeper factors influencing accuracy, including video processing strategies and the alignment between video features and model capabilities. We believe this benchmark will lay a solid foundation for future evaluations and advancements of MLLMs in the domain of video fake news detection.
[8] SafeEditor: Unified MLLM for Efficient Post-hoc T2I Safety Editing
Ruiyang Zhang, Jiahao Luo, Xiaoru Feng, Qiufan Pang, Yaodong Yang, Juntao Dai
🧩 TL;DR
本研究提出了一个多轮安全编辑框架,通过构建MR-SafeEdit数据集和开发SafeEditor模型,为文本到图像生成提供模型无关的安全对齐解决方案,显著减少了过度拒绝并改善了安全性与实用性的平衡。
📘 Detailed Summary
Motivation: 现有文本到图像模型的安全方法主要分为训练时和推理时两类,其中推理时方法因成本效益而被广泛采用,但存在过度拒绝以及安全性与实用性之间平衡不足的问题,需要开发更有效的安全对齐方案。
Method: 提出了一个多轮安全编辑框架作为模型无关的即插即用模块,核心是专门为安全编辑构建的MR-SafeEdit多轮图文交错数据集,并开发了SafeEditor统一多模态大语言模型,采用后验安全编辑范式模拟人类识别和优化不安全内容的认知过程。
Result: 实验结果表明,SafeEditor超越了先前的安全方法,在减少过度拒绝的同时实现了更优的安全性与实用性平衡,证明了该框架在文本到图像生成安全对齐方面的有效性。
Conclusion: 该研究展示了多轮安全编辑框架在文本到图像模型安全对齐中的潜力,为模型安全提供了新的后验编辑范式,未来可扩展到更广泛的多模态生成任务中,实现更全面的安全防护。
📄 Abstract
With the rapid advancement of text-to-image (T2I) models, ensuring their safety has become increasingly critical. Existing safety approaches can be categorized into training-time and inference-time methods. While inference-time methods are widely adopted due to their cost-effectiveness, they often suffer from limitations such as over-refusal and imbalance between safety and utility. To address these challenges, we propose a multi-round safety editing framework that functions as a model-agnostic, plug-and-play module, enabling efficient safety alignment for any text-to-image model. Central to this framework is MR-SafeEdit, a multi-round image-text interleaved dataset specifically constructed for safety editing in text-to-image generation. We introduce a post-hoc safety editing paradigm that mirrors the human cognitive process of identifying and refining unsafe content. To instantiate this paradigm, we develop SafeEditor, a unified MLLM capable of multi-round safety editing on generated images. Experimental results show that SafeEditor surpasses prior safety approaches by reducing over-refusal while achieving a more favorable safety-utility balance.
[9] Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation
Inclusion AI, :, Bowen Ma, Cheng Zou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianing Li, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jianping Jiang, Jun Peng, Kaixiang Ji, Kaimeng Ren, Libin Wang, Lixiang Ru, Longhua Tan, Lan Wang, Mochen Bai, Ning Gao, Qingpei Guo, Qinglong Zhang, Qiang Xu, Rui Liu, Ruijie Xiong, Ruobing Zheng, Sirui Gao, Tianqi Li, Tinghao Liu, Weilong Chai, Xinyu Xiao, Xiaomei Wang, Xiaolong Wang, Xiao Lu, Xiaoyu Li, Xingning Dong, Xuzheng Yu, Yi Yuan, Yuting Gao, Yuting Xiao, Yunxiao Sun, Yipeng Chen, Yifan Mao, Yifei Wu, Yongjie Lyu, Ziping Ma, Zhiqiang Fang, Zhihao Qiu, Ziyuan Huang, Zizheng Yang, Zhengyu He
🧩 TL;DR
本文提出了Ming-Flash-Omni,这是Ming-Omni的升级版本,采用基于Ling-Flash-2.0的稀疏混合专家架构,具有1000亿总参数但每token仅激活61亿参数,在统一多模态智能方面实现了显著进步,并在文本到图像生成、生成式分割和上下文语音识别等多个任务上达到了最先进性能。
📘 Detailed Summary
Motivation: 该研究旨在解决传统模型在计算效率和模型容量扩展之间的平衡问题,同时推动统一多模态智能的发展,涵盖视觉、语音和语言等多个模态,为实现通用人工智能迈出关键一步。
Method: 该方法基于Ling-Flash-2.0的稀疏混合专家变体构建,总参数量达1000亿但每token仅激活61亿参数,通过这种架构实现了高效的可扩展性,并引入了生成式分割等新能力来增强空间控制和编辑一致性。
Result: 实验结果显示,该模型在多模态理解和生成方面相比前代有显著提升,在上下文ASR上达到最先进性能并在方言感知ASR上获得高度竞争力,在图像生成中实现了高保真文本渲染,并在场景一致性和身份保持方面表现优异,同时在所有12个上下文ASR基准测试中都创造了新记录。
Conclusion: 该研究表明稀疏混合专家架构能够有效平衡计算效率与模型容量,统一的架构设计为多模态智能系统提供了可行路径,生成式分割等新能力不仅提升了分割性能,还增强了图像生成的空间控制,为通用人工智能的发展提供了重要技术支撑。
📄 Abstract
We propose Ming-Flash-Omni, an upgraded version of Ming-Omni, built upon a sparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100 billion total parameters, of which only 6.1 billion are active per token. This architecture enables highly efficient scaling (dramatically improving computational efficiency while significantly expanding model capacity) and empowers stronger unified multimodal intelligence across vision, speech, and language, representing a key step toward Artificial General Intelligence (AGI). Compared to its predecessor, the upgraded version exhibits substantial improvements across multimodal understanding and generation. We significantly advance speech recognition capabilities, achieving state-of-the-art performance in contextual ASR and highly competitive results in dialect-aware ASR. In image generation, Ming-Flash-Omni introduces high-fidelity text rendering and demonstrates marked gains in scene consistency and identity preservation during image editing. Furthermore, Ming-Flash-Omni introduces generative segmentation, a capability that not only achieves strong standalone segmentation performance but also enhances spatial control in image generation and improves editing consistency. Notably, Ming-Flash-Omni achieves state-of-the-art results in text-to-image generation and generative segmentation, and sets new records on all 12 contextual ASR benchmarks, all within a single unified architecture.
[10] MCIHN: A Hybrid Network Model Based on Multi-path Cross-modal Interaction for Multimodal Emotion Recognition
Haoyang Zhang, Zhou Yang, Ke Sun, Yucai Pang, Guoliang Xu
🧩 TL;DR
本文提出了一种基于多路径跨模态交互的混合网络模型MCIHN,通过对抗自编码器和跨模态门控机制解决多模态情感识别中的模态差异和情感特征表征难题,在公开数据集上实现了优越性能。
📘 Detailed Summary
Motivation: 多模态情感识别在人机交互中至关重要,但当前面临模态间差异显著以及单模态情感信息表征困难两大挑战,这限制了情感识别的准确性。
Method: 提出MCIHN混合网络模型,首先为每个模态构建对抗自编码器学习判别性情感特征并进行重构增强,然后通过预定义的跨模态门控机制CGMM减少模态差异并建立模态间情感关系,最后使用特征融合模块FFM进行多模态融合。
Result: 在公开可用的SIMS和MOSI数据集上的实验表明,MCIHN模型实现了优越的性能表现,验证了所提方法的有效性。
Conclusion: 该研究证明了通过对抗自编码器学习判别特征和跨模态交互机制减少模态差异的有效性,为多模态情感识别提供了新的技术路径,具有重要的实际应用价值。
📄 Abstract
Multimodal emotion recognition is crucial for future human-computer interaction. However, accurate emotion recognition still faces significant challenges due to differences between different modalities and the difficulty of characterizing unimodal emotional information. To solve these problems, a hybrid network model based on multipath cross-modal interaction (MCIHN) is proposed. First, adversarial autoencoders (AAE) are constructed separately for each modality. The AAE learns discriminative emotion features and reconstructs the features through a decoder to obtain more discriminative information about the emotion classes. Then, the latent codes from the AAE of different modalities are fed into a predefined Cross-modal Gate Mechanism model (CGMM) to reduce the discrepancy between modalities, establish the emotional relationship between interacting modalities, and generate the interaction features between different modalities. Multimodal fusion using the Feature Fusion module (FFM) for better emotion recognition. Experiments were conducted on publicly available SIMS and MOSI datasets, demonstrating that MCIHN achieves superior performance.
[11] Modality-Aware SAM: Sharpness-Aware-Minimization Driven Gradient Modulation for Harmonized Multimodal Learning
Hossein R. Nowdeh, Jie Ji, Xiaolong Ma, Fatemeh Afghah
🧩 TL;DR
本文提出了模态感知锐度感知最小化(M-SAM)框架,通过识别主导模态并调制损失函数来平衡多模态学习,显著提升模型鲁棒性和性能。该模型无关方法在四个数据集上超越了现有最优化和梯度操作方法。
📘 Detailed Summary
Motivation: 多模态学习中,主导模态往往会压制其他模态的贡献,导致模型泛化能力受限。现有方法难以有效平衡不同模态之间的学习动态,限制了模型从互补特征中获益的能力。
Method: M-SAM框架采用三阶段优化策略:首先基于Shapley值识别主导模态,然后通过损失函数分解调制损失景观以增强主导模态的鲁棒性,最后通过调制梯度的反向传播更新权重。该方法支持早期和晚期融合场景,适用于多种模态类型。
Result: 在四个多样化数据集上的广泛实验表明,M-SAM显著优于最新的最优化和梯度操作方法。该方法有效平衡了多模态学习过程,同时提升了整体性能表现,证明了其在增强模型鲁棒性和利用互补特征方面的有效性。
Conclusion: M-SAM通过模态感知的损失调制机制,成功解决了多模态学习中主导模态压制问题。该框架为多模态学习提供了新的优化视角,能够促进模型更好地探索和利用不同模态间的互补特征,为未来多模态研究开辟了新的方向。
📄 Abstract
In multimodal learning, dominant modalities often overshadow others, limiting generalization. We propose Modality-Aware Sharpness-Aware Minimization (M-SAM), a model-agnostic framework that applies to many modalities and supports early and late fusion scenarios. In every iteration, M-SAM in three steps optimizes learning. \textbf{First, it identifies the dominant modality} based on modalities' contribution in the accuracy using Shapley. \textbf{Second, it decomposes the loss landscape}, or in another language, it modulates the loss to prioritize the robustness of the model in favor of the dominant modality, and \textbf{third, M-SAM updates the weights} by backpropagation of modulated gradients. This ensures robust learning for the dominant modality while enhancing contributions from others, allowing the model to explore and exploit complementary features that strengthen overall performance. Extensive experiments on four diverse datasets show that M-SAM outperforms the latest state-of-the-art optimization and gradient manipulation methods and significantly balances and improves multimodal learning.
[12] FT-ARM: Fine-Tuned Agentic Reflection Multimodal Language Model for Pressure Ulcer Severity Classification with Reasoning
Reza Saadati Fard, Emmanuel Agu, Palawat Busaranuvong, Deepak Kumar, Shefalika Gautam, Bengisu Tulu, Diane Strong, Lorraine Loretz
🧩 TL;DR
本文提出了FT-ARM,一种基于微调多模态大语言模型的自反式智能体模型,通过迭代推理机制在压力性损伤严重程度分类任务中实现了85%的准确率,比现有CNN方法提升4%,同时提供临床可解释的自然语言解释。
📘 Detailed Summary
Motivation: 压力性损伤严重程度分类存在视觉特征细微差异和主观判断变异性等挑战,现有基于CNN和ViT的AI方法虽然准确率较高但可解释性不足,无法满足临床部署对透明度和一致性的需求。
Method: FT-ARM基于LLaMA 3.2 90B进行微调,采用多模态大语言模型架构,集成了智能体自反式机制,通过迭代推理过程对视觉特征和编码的临床知识进行综合分析,模拟临床医生的诊断再评估过程。
Result: 在公开压力性损伤图像数据集上的实验显示,FT-ARM在I-IV期压力性损伤分类任务中达到85%的准确率,比先前CNN模型提升4%,并在实时推理场景下验证了性能,同时生成基于临床知识的自然语言解释。
Conclusion: FT-ARM通过结合微调和多模态自反式推理,显著提升了自动化伤口评估系统的可靠性、透明度和临床应用价值,为解决压力性损伤分期的一致性和可解释性需求提供了有效方案。
📄 Abstract
Pressure ulcers (PUs) are a serious and prevalent healthcare concern. Accurate classification of PU severity (Stages I-IV) is essential for proper treatment but remains challenging due to subtle visual distinctions and subjective interpretation, leading to variability among clinicians. Prior AI-based approaches using Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) achieved promising accuracy but offered limited interpretability. We present FT-ARM (Fine-Tuned Agentic Reflection Multimodal model), a fine-tuned multimodal large language model (MLLM) with an agentic self-reflection mechanism for pressure ulcer severity classification. Inspired by clinician-style diagnostic reassessment, FT-ARM iteratively refines its predictions by reasoning over visual features and encoded clinical knowledge from text, enhancing both accuracy and consistency. On the publicly available Pressure Injury Image Dataset (PIID), FT-ARM, fine-tuned from LLaMA 3.2 90B, achieved 85% accuracy in classifying PU stages I-IV, surpassing prior CNN-based models by +4%. Unlike earlier CNN/ViT studies that relied solely on offline evaluations, FT-ARM is designed and tested for live inference, reflecting real-time deployment conditions. Furthermore, it produces clinically grounded natural-language explanations, improving interpretability and trust. By integrating fine-tuning and reflective reasoning across multimodal inputs, FT-ARM advances the reliability, transparency, and clinical applicability of automated wound assessment systems, addressing the critical need for consistent and explainable PU staging to support improved patient care.
[13] Efficient License Plate Recognition via Pseudo-Labeled Supervision with Grounding DINO and YOLOv8
Zahra Ebrahimi Vargoorani, Amir Mohammad Ghoreyshi, Ching Yee Suen
🧩 TL;DR
本文提出了一种基于YOLOv8和半监督学习的自动车牌识别系统,通过结合Grounding DINO生成的伪标签与人工标注数据,显著提升了模型性能,在多个数据集上实现了优异的召回率和字符错误率。
📘 Detailed Summary
Motivation: 自动车牌识别系统面临环境因素(如光照、雨水、灰尘)、高速车辆、多变摄像头角度以及低质量图像等挑战,这些因素限制了现有系统的准确性和鲁棒性。本文旨在解决这些技术难题,提升ALPR系统在复杂实际场景中的性能表现。
Method: 采用基于YOLOv8的深度学习策略进行车牌检测与识别,并引入半监督学习框架,结合少量人工标注数据和Grounding DINO视觉语言模型生成的伪标签来训练检测模型。该方法通过自动标注大量图像减少对人工标注的依赖,同时保持标签质量。
Result: 在CENPARMI数据集上达到94%的召回率,在UFPR-ALPR数据集上达到91%的召回率,同时报告了两个数据集的字符错误率,为系统性能提供了全面的评估指标。
Conclusion: 研究表明,结合半监督学习和视觉语言模型能够有效提升车牌识别系统的性能,减少人工标注成本,为实际应用中的大规模部署提供了可行的技术路径。该方法在保持高准确率的同时显著提高了数据标注效率。
📄 Abstract
Developing a highly accurate automatic license plate recognition system (ALPR) is challenging due to environmental factors such as lighting, rain, and dust. Additional difficulties include high vehicle speeds, varying camera angles, and low-quality or low-resolution images. ALPR is vital in traffic control, parking, vehicle tracking, toll collection, and law enforcement applications. This paper proposes a deep learning strategy using YOLOv8 for license plate detection and recognition tasks. This method seeks to enhance the performance of the model using datasets from Ontario, Quebec, California, and New York State. It achieved an impressive recall rate of 94% on the dataset from the Center for Pattern Recognition and Machine Intelligence (CENPARMI) and 91% on the UFPR-ALPR dataset. In addition, our method follows a semi-supervised learning framework, combining a small set of manually labeled data with pseudo-labels generated by Grounding DINO to train our detection model. Grounding DINO, a powerful vision-language model, automatically annotates many images with bounding boxes for license plates, thereby minimizing the reliance on labor-intensive manual labeling. By integrating human-verified and model-generated annotations, we can scale our dataset efficiently while maintaining label quality, which significantly enhances the training process and overall model performance. Furthermore, it reports character error rates for both datasets, providing additional insight into system performance.
[14] Breast Cancer VLMs: Clinically Practical Vision-Language Train-Inference Models
Shunjie-Fabian Zheng, Hyeonjun Lee, Thijs Kooi, Ali Diba
🧩 TL;DR
本研究提出了一种新颖的多模态框架,通过将2D乳腺X线摄影的视觉特征与临床元数据和合成放射学报告的结构化文本描述相结合,显著提升了乳腺癌检测性能。该方法在癌症检测和钙化识别方面优于单模态基线,为开发临床可行的视觉语言模型辅助诊断系统建立了新范式。
📘 Detailed Summary
Motivation: 现有计算机辅助诊断系统在临床部署中存在关键局限性,特别是在处理多模态数据的细微解释方面存在困难,且由于需要先前的临床历史而缺乏可行性。乳腺癌作为发达国家女性最常见的恶性肿瘤,早期检测对降低死亡率至关重要,但现有方法无法充分利用可获取的临床信息和影像数据的协同效应。
Method: 本研究提出了一种新颖框架,通过创新的标记化模块将2D乳腺X线摄影的视觉特征与来自易获取临床元数据和合成放射学报告的结构化文本描述进行协同整合。该方法策略性地将卷积神经网络与语言表示相结合,在处理高分辨率图像的同时实现了优于基于视觉变换器模型的性能,并支持在不同人群中的实际部署。
Result: 通过在跨国队列筛查乳腺X线摄影数据上的评估,该多模态方法在癌症检测和钙化识别方面表现出优于单模态基线的卓越性能,特别是在特定改进方面取得了显著成果。该方法证明了视觉特征与文本描述的有效融合能够显著提升诊断准确性和临床实用性。
Conclusion: 该研究为开发临床可行的基于视觉语言模型的计算机辅助诊断系统建立了新范式,通过有效的融合机制充分利用影像数据和上下文患者信息。该方法展示了多模态整合在医疗影像分析中的巨大潜力,为未来临床部署提供了实用且高效的解决方案,特别是在资源受限的环境中具有重要应用价值。
📄 Abstract
Breast cancer remains the most commonly diagnosed malignancy among women in the developed world. Early detection through mammography screening plays a pivotal role in reducing mortality rates. While computer-aided diagnosis (CAD) systems have shown promise in assisting radiologists, existing approaches face critical limitations in clinical deployment - particularly in handling the nuanced interpretation of multi-modal data and feasibility due to the requirement of prior clinical history. This study introduces a novel framework that synergistically combines visual features from 2D mammograms with structured textual descriptors derived from easily accessible clinical metadata and synthesized radiological reports through innovative tokenization modules. Our proposed methods in this study demonstrate that strategic integration of convolutional neural networks (ConvNets) with language representations achieves superior performance to vision transformer-based models while handling high-resolution images and enabling practical deployment across diverse populations. By evaluating it on multi-national cohort screening mammograms, our multi-modal approach achieves superior performance in cancer detection and calcification identification compared to unimodal baselines, with particular improvements. The proposed method establishes a new paradigm for developing clinically viable VLM-based CAD systems that effectively leverage imaging data and contextual patient information through effective fusion mechanisms.
[15] DRIP: Dynamic patch Reduction via Interpretable Pooling
Yusen Peng, Sachin Kumar
🧩 TL;DR
本文提出动态补丁缩减可解释池化方法,通过自适应合并深层视觉编码器中的令牌,在保持分类和零样本性能的同时显著降低计算复杂度。该方法在ImageNet从头训练和CLIP对比预训练中均验证了有效性,并成功应用于生物学领域的大规模持续预训练。
📘 Detailed Summary
Motivation: 当前视觉语言模型虽然取得了显著进展,但由于大规模预训练的计算成本高昂,研究者往往避免从头开始预训练视觉语言模型。本研究旨在解决视觉语言模型预训练效率低下的问题,通过降低计算复杂度来推动多模态AI的发展。
Method: 提出动态补丁缩减可解释池化方法,该方法根据输入图像自适应地在视觉编码器的深层合并令牌。这种动态合并机制能够显著减少计算量,同时保持模型的表示能力,适用于不同的预训练场景。
Result: 在ImageNet从头训练和CLIP对比预训练实验中,该方法实现了显著的GFLOPs降低,同时保持了可比的分类和零样本性能。在大型生物学数据集上的持续预训练进一步验证了方法的有效性,展示了其在科学领域的应用潜力。
Conclusion: 研究表明动态补丁缩减方法能够有效平衡计算效率与模型性能,为大规模视觉语言模型预训练提供了可行的效率优化方案。该方法不仅适用于通用领域,还能扩展到科学计算等专业领域,具有广泛的应用前景。
📄 Abstract
Recently, the advances in vision-language models, including contrastive pretraining and instruction tuning, have greatly pushed the frontier of multimodal AI. However, owing to the large-scale and hence expensive pretraining, the efficiency concern has discouraged researchers from attempting to pretrain a vision language model from scratch. In this work, we propose Dynamic patch Reduction via Interpretable Pooling (DRIP), which adapts to the input images and dynamically merges tokens in the deeper layers of a visual encoder. Our results on both ImageNet training from scratch and CLIP contrastive pretraining demonstrate a significant GFLOP reduction while maintaining comparable classification/zero-shot performance. To further validate our proposed method, we conduct continual pretraining on a large biology dataset, extending its impact into scientific domains.
[16] Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments
Manjunath Prasad Holenarasipura Rajiv, B. M. Vidyavathi
🧩 TL;DR
本研究提出了一种视觉-语言集成框架,通过统一预训练的视觉编码器和大型语言模型,实现视觉与文本模态的语义对齐,显著提升了零样本场景理解能力。
📘 Detailed Summary
Motivation: 真实世界场景中的零样本理解面临重大挑战,因为自然场景的复杂性和多变性要求模型在没有先验标注样本的情况下识别新对象、动作和上下文,这需要解决跨模态语义对齐和泛化能力不足的问题。
Method: 该方法开发了一个统一模型,将视觉输入和文本提示嵌入到共享空间中,随后通过多模态融合和推理层进行上下文解释,集成了CLIP、ViT等预训练视觉编码器与GPT架构的大型语言模型。
Result: 在Visual Genome、COCO、ADE20K和自定义真实世界数据集上的实验表明,该方法在物体识别、活动检测和场景描述任务中显著优于最先进的零样本模型,实现了高达18%的top-1准确率提升和语义连贯性指标的显著增益。
Conclusion: 该研究证明了跨模态对齐和语言接地在增强真实世界场景理解泛化能力方面的有效性,为构建更鲁棒的零样本视觉理解系统提供了重要见解,并展示了视觉-语言集成框架在复杂场景理解中的潜力。
📄 Abstract
Zero-shot scene understanding in real-world settings presents major challenges due to the complexity and variability of natural scenes, where models must recognize new objects, actions, and contexts without prior labeled examples. This work proposes a vision-language integration framework that unifies pre-trained visual encoders (e.g., CLIP, ViT) and large language models (e.g., GPT-based architectures) to achieve semantic alignment between visual and textual modalities. The goal is to enable robust zero-shot comprehension of scenes by leveraging natural language as a bridge to generalize over unseen categories and contexts. Our approach develops a unified model that embeds visual inputs and textual prompts into a shared space, followed by multimodal fusion and reasoning layers for contextual interpretation. Experiments on Visual Genome, COCO, ADE20K, and custom real-world datasets demonstrate significant gains over state-of-the-art zero-shot models in object recognition, activity detection, and scene captioning. The proposed system achieves up to 18% improvement in top-1 accuracy and notable gains in semantic coherence metrics, highlighting the effectiveness of cross-modal alignment and language grounding in enhancing generalization for real-world scene understanding.
[17] Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection
Chanhyeong Yang, Taehoon Song, Jihwan Park, Hyunwoo J. Kim
🧩 TL;DR
本文提出了VDRP框架,一种针对零样本人-物交互检测的视觉多样性和区域感知提示学习方法,通过引入视觉多样性感知提示学习和区域特定概念检索,有效解决了同类视觉多样性和异类视觉纠缠问题。
📘 Detailed Summary
Motivation: 现有基于CLIP等预训练视觉语言模型的零样本人-物交互检测方法在处理交互的视觉复杂性方面存在不足,特别是无法有效应对同类视觉多样性(同一动词在不同姿态和上下文中的视觉表现差异)和异类视觉纠缠(不同动词产生相似视觉模式)这两个关键挑战。
Method: VDRP框架包含两个核心组件:视觉多样性感知提示学习策略,通过将分组视觉方差注入上下文嵌入并应用高斯扰动来捕捉动词的多样化视觉变化;区域特定概念检索机制,从人、物和联合区域提取概念来增强多样性感知提示嵌入,生成能够提升动词级别区分度的区域感知提示。
Result: 在HICO-DET基准测试上的实验表明,该方法在四种零样本评估设置下均达到了最先进的性能水平,有效解决了同类视觉多样性和异类视觉纠缠问题,证明了其在处理复杂视觉交互模式方面的优越性。
Conclusion: 该研究证明了通过结合视觉多样性建模和区域感知提示学习,能够显著提升零样本人-物交互检测的性能,为处理复杂视觉交互模式提供了新的技术路径,并为视觉语言模型在细粒度视觉理解任务中的应用开辟了新的可能性。
📄 Abstract
Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the visual complexity of interaction, including (1) intra-class visual diversity, where instances of the same verb appear in diverse poses and contexts, and (2) inter-class visual entanglement, where distinct verbs yield visually similar patterns. To address these challenges, we propose VDRP, a framework for Visual Diversity and Region-aware Prompt learning. First, we introduce a visual diversity-aware prompt learning strategy that injects group-wise visual variance into the context embedding. We further apply Gaussian perturbation to encourage the prompts to capture diverse visual variations of a verb. Second, we retrieve region-specific concepts from the human, object, and union regions. These are used to augment the diversity-aware prompt embeddings, yielding region-aware prompts that enhance verb-level discrimination. Experiments on the HICO-DET benchmark demonstrate that our method achieves state-of-the-art performance under four zero-shot evaluation settings, effectively addressing both intra-class diversity and inter-class visual entanglement. Code is available at https://github.com/mlvlab/VDRP.
[18] EA3D: Online Open-World 3D Object Extraction from Streaming Videos
Xiaoyu Zhou, Jingqi Wang, Yuang Jia, Yongtao Wang, Deqing Sun, Ming-Hsuan Yang
🧩 TL;DR
本文提出了ExtractAnything3D(EA3D),一个用于开放世界3D物体提取的统一在线框架,能够同时实现几何重建和整体场景理解,通过动态集成视觉语言知识和在线高斯特征更新来支持多种下游任务。
📘 Detailed Summary
Motivation: 当前3D场景理解方法受限于离线收集的多视角数据或预构建的3D几何,本研究旨在解决在线动态环境下同时进行几何重建和语义理解的挑战,填补开放世界3D物体提取框架的空白。
Method: EA3D使用视觉语言和2D视觉基础编码器动态解释视频流帧,通过前馈在线更新策略将物体级知识集成到高斯特征图中,结合迭代视觉里程计估计和增量特征更新,并采用循环联合优化模块引导模型关注感兴趣区域。
Result: 在多样化基准和任务上的广泛实验表明,EA3D在照片级真实感渲染、语义和实例分割、3D边界框和语义占据估计以及3D网格生成等任务中均表现出有效性,验证了框架的统一性和高效性。
Conclusion: 该研究建立了一个统一高效的在线3D重建和整体场景理解框架,为多种下游任务提供了基础支持,推动了开放世界3D场景理解的发展方向,具有重要的实际应用价值。
📄 Abstract
Current 3D scene understanding methods are limited by offline-collected multi-view data or pre-constructed 3D geometry. In this paper, we present ExtractAnything3D (EA3D), a unified online framework for open-world 3D object extraction that enables simultaneous geometric reconstruction and holistic scene understanding. Given a streaming video, EA3D dynamically interprets each frame using vision-language and 2D vision foundation encoders to extract object-level knowledge. This knowledge is integrated and embedded into a Gaussian feature map via a feed-forward online update strategy. We then iteratively estimate visual odometry from historical frames and incrementally update online Gaussian features with new observations. A recurrent joint optimization module directs the model's attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding. Extensive experiments across diverse benchmarks and tasks, including photo-realistic rendering, semantic and instance segmentation, 3D bounding box and semantic occupancy estimation, and 3D mesh generation, demonstrate the effectiveness of EA3D. Our method establishes a unified and efficient framework for joint online 3D reconstruction and holistic scene understanding, enabling a broad range of downstream tasks.
[19] Target-Guided Bayesian Flow Networks for Quantitatively Constrained CAD Generation
Wenhao Zheng, Chenwei Sun, Wenbo Zhang, Jiancheng Lv, Xianggen Liu
🧩 TL;DR
本文提出了目标引导贝叶斯流网络(TGBFN),这是一个用于定量约束CAD生成的新框架,首次在统一的连续可微分参数空间中处理CAD序列的多模态特性,并在单条件和多条件约束生成任务中实现了最先进的性能。
📘 Detailed Summary
Motivation: 当前深度生成模型在图像和音频生成方面取得了显著进展,但针对多模态数据(如参数化CAD序列)的生成建模技术发展滞后,主要挑战在于处理长程约束和参数敏感性,这限制了CAD生成的质量和可控性。
Method: TGBFN框架通过将CAD序列的离散命令和连续参数统一映射到连续可微分参数空间来处理多模态问题,并引入引导贝叶斯流机制来穿透参数更新核,从而实现对CAD属性的精确控制。
Result: 在新建的定量约束CAD生成数据集上的广泛实验表明,TGBFN在单条件和多条件约束生成任务中均实现了最先进的性能,能够生成高保真度且条件感知的CAD序列。
Conclusion: 该研究证明了在统一连续空间中处理CAD多模态数据的有效性,为参数化CAD生成提供了新的技术路径,并展示了引导贝叶斯流在控制生成属性方面的潜力,为复杂工程设计的自动化生成开辟了新方向。
📄 Abstract
Deep generative models, such as diffusion models, have shown promising progress in image generation and audio generation via simplified continuity assumptions. However, the development of generative modeling techniques for generating multi-modal data, such as parametric CAD sequences, still lags behind due to the challenges in addressing long-range constraints and parameter sensitivity. In this work, we propose a novel framework for quantitatively constrained CAD generation, termed Target-Guided Bayesian Flow Network (TGBFN). For the first time, TGBFN handles the multi-modality of CAD sequences (i.e., discrete commands and continuous parameters) in a unified continuous and differentiable parameter space rather than in the discrete data space. In addition, TGBFN penetrates the parameter update kernel and introduces a guided Bayesian flow to control the CAD properties. To evaluate TGBFN, we construct a new dataset for quantitatively constrained CAD generation. Extensive comparisons across single-condition and multi-condition constrained generation tasks demonstrate that TGBFN achieves state-of-the-art performance in generating high-fidelity, condition-aware CAD sequences. The code is available at https://github.com/scu-zwh/TGBFN.
[20] MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
Runxi Huang, Mingxuan Yu, Mingyu Tsoi, Xiaomin Ouyang
🧩 TL;DR
MMEdge是一个基于流水线感知和编码的端侧多模态推理框架,通过细粒度增量计算和跨模态优化,在保持高精度的同时显著降低端到端延迟。
📘 Detailed Summary
Motivation: 现有方法通常忽略了感知动态与模型执行之间的紧密耦合以及复杂的模态间依赖关系,而边缘设备上的实时多模态推理对于自动驾驶、人机交互和移动健康等应用至关重要。
Method: MMEdge将整个推理过程分解为一系列细粒度的感知和编码单元,采用增量计算方式处理到达数据;引入轻量级时间聚合模块捕获跨流水线单元的丰富时间动态;包含自适应多模态配置优化器和跨模态推测跳过机制,动态选择最优配置并在预测置信度足够时跳过较慢模态的未来单元。
Result: 在两个公共多模态数据集上的评估以及在真实无人机多模态测试平台上的部署结果表明,MMEdge在各种系统和数据动态下显著降低了端到端延迟,同时保持了高任务精度。
Conclusion: 该研究证明了流水线感知设计能够有效解耦多模态推理中的感知与计算,为资源受限边缘设备上的实时多模态应用提供了可行的解决方案,并展示了跨模态优化和早期决策的潜力。
📄 Abstract
Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.
[21] $D^2GS$: Dense Depth Regularization for LiDAR-free Urban Scene Reconstruction
Kejing Xia, Jidong Jia, Ke Jin, Yucai Bai, Li Sun, Dacheng Tao, Youjian Zhang
🧩 TL;DR
本文提出了一种无需LiDAR的城市场景重建框架D²GS,通过多视角深度预测和扩散先验获得比LiDAR更密集准确的几何先验,在Waymo数据集上超越了包括使用真实LiDAR数据的方法在内的现有最佳方法。
📘 Detailed Summary
Motivation: 当前城市场景重建方法通常依赖多模态传感器输入(如LiDAR和图像),但获取精确LiDAR数据存在挑战:需要精确的时空标定,且LiDAR与相机安装位置不同会产生重投影误差。本文旨在解决这些限制,开发无需LiDAR的高质量城市场景重建方法。
Method: 提出D²GS框架,首先通过多视角度量深度预测反投影初始化密集点云,采用渐进式剪枝策略优化全局一致性;其次通过深度增强器联合优化高斯几何和预测深度,利用深度基础模型的扩散先验增强高斯渲染的深度图;最后在道路区域约束高斯形状和法向量属性以改进地面几何精度。
Result: 在Waymo数据集上的广泛实验表明,该方法始终优于现有最先进方法,即使与使用真实LiDAR数据的方法相比,也能产生更准确的几何重建结果。
Conclusion: 该研究证明了无需LiDAR传感器即可实现高质量城市场景重建的可行性,通过深度预测和扩散先验的组合可以产生比实际LiDAR数据更密集准确的几何先验,为自动驾驶领域的场景重建提供了更实用的解决方案。
📄 Abstract
Recently, Gaussian Splatting (GS) has shown great potential for urban scene reconstruction in the field of autonomous driving. However, current urban scene reconstruction methods often depend on multimodal sensors as inputs, \textit{i.e.} LiDAR and images. Though the geometry prior provided by LiDAR point clouds can largely mitigate ill-posedness in reconstruction, acquiring such accurate LiDAR data is still challenging in practice: i) precise spatiotemporal calibration between LiDAR and other sensors is required, as they may not capture data simultaneously; ii) reprojection errors arise from spatial misalignment when LiDAR and cameras are mounted at different locations. To avoid the difficulty of acquiring accurate LiDAR depth, we propose $D^2GS$, a LiDAR-free urban scene reconstruction framework. In this work, we obtain geometry priors that are as effective as LiDAR while being denser and more accurate. $\textbf{First}$, we initialize a dense point cloud by back-projecting multi-view metric depth predictions. This point cloud is then optimized by a Progressive Pruning strategy to improve the global consistency. $\textbf{Second}$, we jointly refine Gaussian geometry and predicted dense metric depth via a Depth Enhancer. Specifically, we leverage diffusion priors from a depth foundation model to enhance the depth maps rendered by Gaussians. In turn, the enhanced depths provide stronger geometric constraints during Gaussian training. $\textbf{Finally}$, we improve the accuracy of ground geometry by constraining the shape and normal attributes of Gaussians within road regions. Extensive experiments on the Waymo dataset demonstrate that our method consistently outperforms state-of-the-art methods, producing more accurate geometry even when compared with those using ground-truth LiDAR data.
[22] Test-Time Adaptive Object Detection with Foundation Model
Yingjie Gao, Yanan Zhang, Zhi Cai, Di Huang
🧩 TL;DR
本文提出了首个基于基础模型的测试时自适应目标检测方法,通过多模态提示调优和实例动态记忆模块,在无需源数据的情况下实现了跨域和跨类别的自适应检测,显著优于现有方法。
📘 Detailed Summary
Motivation: 现有测试时自适应目标检测方法严重依赖源域统计特征,并假设源域和目标域具有相同的类别空间,这限制了其在真实开放世界场景中的应用。本文旨在消除对源数据的依赖并克服传统闭集限制,实现更灵活的自适应检测。
Method: 提出多模态提示均值教师框架,结合文本和视觉提示调优以参数高效方式适应测试数据的语言和视觉表示空间;设计测试时热启动策略保护视觉分支表示能力;构建实例动态记忆模块存储高质量伪标签,并提出记忆增强和记忆幻觉策略提升预测质量。
Result: 在跨损坏和跨数据集基准上的广泛实验表明,该方法持续优于先前最先进方法,能够适应任意跨域和跨类别的目标数据,在多个基准测试中取得了显著性能提升。
Conclusion: 该方法首次实现了无需源数据的测试时自适应目标检测,突破了传统闭集假设限制,为开放世界场景下的目标检测提供了有效解决方案,展示了基础模型在自适应检测任务中的强大潜力。
📄 Abstract
In recent years, test-time adaptive object detection has attracted increasing attention due to its unique advantages in online domain adaptation, which aligns more closely with real-world application scenarios. However, existing approaches heavily rely on source-derived statistical characteristics while making the strong assumption that the source and target domains share an identical category space. In this paper, we propose the first foundation model-powered test-time adaptive object detection method that eliminates the need for source data entirely and overcomes traditional closed-set limitations. Specifically, we design a Multi-modal Prompt-based Mean-Teacher framework for vision-language detector-driven test-time adaptation, which incorporates text and visual prompt tuning to adapt both language and vision representation spaces on the test data in a parameter-efficient manner. Correspondingly, we propose a Test-time Warm-start strategy tailored for the visual prompts to effectively preserve the representation capability of the vision branch. Furthermore, to guarantee high-quality pseudo-labels in every test batch, we maintain an Instance Dynamic Memory (IDM) module that stores high-quality pseudo-labels from previous test samples, and propose two novel strategies-Memory Enhancement and Memory Hallucination-to leverage IDM's high-quality instances for enhancing original predictions and hallucinating images without available pseudo-labels, respectively. Extensive experiments on cross-corruption and cross-dataset benchmarks demonstrate that our method consistently outperforms previous state-of-the-art methods, and can adapt to arbitrary cross-domain and cross-category target data. Code is available at https://github.com/gaoyingjay/ttaod_foundation.
[23] AI-Powered Early Detection of Critical Diseases using Image Processing and Audio Analysis
Manisha More, Kavya Bhand, Kaustubh Mukdam, Kavya Sharma, Manas Kawtikwar, Hridayansh Kaware, Prajwal Kavhar
🧩 TL;DR
本文提出了一种多模态人工智能诊断框架,通过整合图像分析、热成像和音频信号处理技术,实现了对皮肤癌、血管血栓和心肺异常的早期检测。该系统在保持轻量级的同时实现了与最先进模型相竞争的性能,为可扩展的实时AI预诊断医疗解决方案提供了可行路径。
📘 Detailed Summary
Motivation: 现有诊断技术通常成本高昂、具有侵入性且在低资源地区难以获取,而早期诊断对于提高患者生存率和降低治疗成本至关重要。本研究旨在解决这一医疗可及性问题,开发一种能够在资源受限环境中部署的多模态AI诊断方案。
Method: 采用多模态AI框架整合三种诊断模式:使用在ISIC 2019数据集上微调的MobileNetV2卷积神经网络进行皮肤病变分类;采用支持向量机结合手工特征进行热成像血栓检测;利用Mel频率倒谱系数特征提取和随机森林分类器处理心肺声音数据。
Result: 皮肤癌检测达到89.3%准确率、91.6%灵敏度和88.2%特异性;热成像血栓检测在合成和临床数据上获得86.4%准确率和0.89 AUC;心肺异常分析达到87.2%准确率和85.7%灵敏度。与最先进模型相比,该系统在保持轻量级的同时实现了竞争性性能。
Conclusion: 该多模态AI框架为可扩展、实时且易于获取的预诊断医疗解决方案提供了有前景的技术路径,特别适合在资源受限环境中部署。研究结果表明,通过整合多种诊断模式,可以在保持模型轻量化的同时实现准确的早期疾病检测,为改善全球医疗可及性提供了重要技术支撑。
📄 Abstract
Early diagnosis of critical diseases can significantly improve patient survival and reduce treatment costs. However, existing diagnostic techniques are often costly, invasive, and inaccessible in low-resource regions. This paper presents a multimodal artificial intelligence (AI) diagnostic framework integrating image analysis, thermal imaging, and audio signal processing for early detection of three major health conditions: skin cancer, vascular blood clots, and cardiopulmonary abnormalities. A fine-tuned MobileNetV2 convolutional neural network was trained on the ISIC 2019 dataset for skin lesion classification, achieving 89.3% accuracy, 91.6% sensitivity, and 88.2% specificity. A support vector machine (SVM) with handcrafted features was employed for thermal clot detection, achieving 86.4% accuracy (AUC = 0.89) on synthetic and clinical data. For cardiopulmonary analysis, lung and heart sound datasets from PhysioNet and Pascal were processed using Mel-Frequency Cepstral Coefficients (MFCC) and classified via Random Forest, reaching 87.2% accuracy and 85.7% sensitivity. Comparative evaluation against state-of-the-art models demonstrates that the proposed system achieves competitive results while remaining lightweight and deployable on low-cost devices. The framework provides a promising step toward scalable, real-time, and accessible AI-based pre-diagnostic healthcare solutions.
[24] DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis
Yinqi Cai, Jichang Li, Zhaolun Li, Weikai Chen, Rushi Lan, Xi Xie, Xiaonan Luo, Guanbin Li
🧩 TL;DR
本文提出了DeepShield框架,通过结合局部补丁引导和全局伪造多样化技术,在CLIP-ViT编码器基础上构建了一个能够平衡局部敏感性和全局泛化能力的深度伪造检测系统,显著提升了在未见伪造技术上的检测鲁棒性。
📘 Detailed Summary
Motivation: 现有深度伪造检测器在域内场景表现良好,但由于过度依赖特定伪造伪影而难以泛化到多样化的操纵技术,这限制了其在现实世界中的实用性,特别是在面对未知伪造攻击时的检测能力。
Method: DeepShield框架基于CLIP-ViT编码器,包含两个核心组件:局部补丁引导通过时空伪影建模和逐补丁监督捕获细粒度不一致性;全局伪造多样化通过领域特征增强、领域桥接和边界扩展特征生成合成多样化伪造样本,缓解过拟合并提升跨域适应性。
Result: 在跨数据集和跨操纵技术的评估中,DeepShield超越了现有最先进方法,展现出对未见深度伪造攻击的卓越鲁棒性,证明了其在泛化能力方面的显著优势。
Conclusion: 该研究表明结合新颖的局部和全局分析策略能够有效提升深度伪造检测的泛化能力,为构建更鲁棒的伪造检测系统提供了重要思路,并强调了在检测框架中平衡局部敏感性和全局泛化的关键价值。
📄 Abstract
Recent advances in deep generative models have made it easier to manipulate face videos, raising significant concerns about their potential misuse for fraud and misinformation. Existing detectors often perform well in in-domain scenarios but fail to generalize across diverse manipulation techniques due to their reliance on forgery-specific artifacts. In this work, we introduce DeepShield, a novel deepfake detection framework that balances local sensitivity and global generalization to improve robustness across unseen forgeries. DeepShield enhances the CLIP-ViT encoder through two key components: Local Patch Guidance (LPG) and Global Forgery Diversification (GFD). LPG applies spatiotemporal artifact modeling and patch-wise supervision to capture fine-grained inconsistencies often overlooked by global models. GFD introduces domain feature augmentation, leveraging domain-bridging and boundary-expanding feature generation to synthesize diverse forgeries, mitigating overfitting and enhancing cross-domain adaptability. Through the integration of novel local and global analysis for deepfake detection, DeepShield outperforms state-of-the-art methods in cross-dataset and cross-manipulation evaluations, achieving superior robustness against unseen deepfake attacks.
[25] VADB: A Large-Scale Video Aesthetic Database with Professional and Multi-Dimensional Annotations
Qianqian Qiao, DanDan Zheng, Yihang Bo, Bao Peng, Heng Huang, Longteng Jiang, Huaye Wang, Jingdong Chen, Jun Zhou, Xin Jin
🧩 TL;DR
本研究提出了VADB——最大的视频美学评估数据库,包含10,490个多样化视频,并开发了VADB-Net双模态预训练框架,通过两阶段训练策略显著提升了视频美学评估性能。
📘 Detailed Summary
Motivation: 视频美学评估作为多媒体计算的重要领域,其发展受到标准化数据集缺乏和鲁棒模型不足的限制,视频的时序动态特性和多模态融合挑战阻碍了基于图像方法的直接应用。
Method: 提出了VADB-Net双模态预训练框架,采用两阶段训练策略,该框架能够有效处理视频美学评估中的多模态信息融合问题。
Result: VADB-Net在评分任务中超越了现有的视频质量评估模型,并支持下游视频美学评估任务,实验验证了其优越性能。
Conclusion: 该研究不仅提供了大规模标准化视频美学数据库,还开发了有效的双模态预训练框架,为视频美学评估领域的发展提供了重要基础设施和方法论支持。
📄 Abstract
Video aesthetic assessment, a vital area in multimedia computing, integrates computer vision with human cognition. Its progress is limited by the lack of standardized datasets and robust models, as the temporal dynamics of video and multimodal fusion challenges hinder direct application of image-based methods. This study introduces VADB, the largest video aesthetic database with 10,490 diverse videos annotated by 37 professionals across multiple aesthetic dimensions, including overall and attribute-specific aesthetic scores, rich language comments and objective tags. We propose VADB-Net, a dual-modal pre-training framework with a two-stage training strategy, which outperforms existing video quality assessment models in scoring tasks and supports downstream video aesthetic assessment tasks. The dataset and source code are available at https://github.com/BestiVictory/VADB.
[26] LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation
Yang Miao, Jan-Nico Zaech, Xi Wang, Fabien Despinoy, Danda Pani Paudel, Luc Van Gool
🧩 TL;DR
LangHOPS是首个基于多模态大语言模型的开放词汇对象-部件实例分割框架,通过语言空间中的层次结构实现对象和部件的联合检测与分割,在多个基准测试中达到最先进性能。
📘 Detailed Summary
Motivation: 现有方法依赖启发式或可学习的视觉分组策略,难以有效处理开放词汇的对象-部件层次结构解析问题,需要一种能够利用语言知识来建立多粒度概念间联系的新方法。
Method: 提出基于多模态大语言模型的框架,将MLLM集成到对象-部件解析流程中,利用其丰富的知识和推理能力,在语言空间中建立对象-部件层次结构,并采用MLLM驱动的部件查询优化策略。
Result: 在PartImageNet数据集上,LangHOPS在域内和跨数据集对象-部件实例分割中分别以5.5%和4.8%的平均精度优势超越先前方法,在ADE20K的零样本语义分割中未见对象部件上获得2.5% mIOU提升。
Conclusion: 研究表明基于语言层次结构的方法能有效处理对象-部件解析任务,MLLM的知识推理能力对多粒度概念链接具有关键作用,为开放词汇的层次化视觉理解提供了新方向。
📄 Abstract
We propose LangHOPS, the first Multimodal Large Language Model (MLLM) based framework for open-vocabulary object-part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object-part hierarchies in language space. It integrates the MLLM into the object-part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi-granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in-domain and cross-dataset object-part instance segmentation, and zero-shot semantic segmentation. LangHOPS achieves state-of-the-art results, surpassing previous methods by 5.5% Average Precision (AP) (in-domain) and 4.8% (cross-dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero-shot). Ablation studies further validate the effectiveness of the language-grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.
[27] Prototype-Driven Adaptation for Few-Shot Object Detection
Yushen Huang, Zhiming Wang
🧩 TL;DR
本文提出原型驱动对齐(PDA),一种轻量级插件式度量头,通过提供与线性分类器互补的原型化“第二意见”来解决少样本目标检测中的基础类偏差和校准不稳定问题。该方法在VOC FSOD和GFSOD基准测试中显著提升了新类性能,同时保持基础类性能并仅引入可忽略的计算开销。
📘 Detailed Summary
Motivation: 少样本目标检测(FSOD)在仅有少量新类样本可用时,常常面临基础类偏差和校准不稳定的问题。现有方法在有限的新类数据下难以平衡基础类和新类的检测性能,且线性分类器在少样本场景下容易产生偏差预测。
Method: PDA在DeFRCN框架中引入支持集原型维护机制,在可学习的身份初始化投影空间中构建原型表示,并可选地应用原型条件RoI对齐以减少几何不匹配。该方法采用指数移动平均(EMA)更新标记前景RoI来适应原型,无需引入类特定参数,并在推理时冻结原型以确保协议合规性。PDA使用最佳K匹配方案捕捉类内多模态性,并通过温度缩放融合将度量相似度与检测器逻辑值结合。
Result: 在VOC FSOD和GFSOD基准测试上的实验表明,PDA能够持续提升新类检测性能,同时对基础类性能影响极小。该方法以可忽略的计算开销实现了显著的性能改进,验证了原型驱动方法的有效性。
Conclusion: PDA证明了原型化度量学习作为线性分类器补充的有效性,为少样本目标检测提供了稳定可靠的解决方案。该方法展示了在保持基础类性能的同时提升新类检测能力的可行性,为未来少样本学习研究提供了新的技术路径和设计思路。
📄 Abstract
Few-shot object detection (FSOD) often suffers from base-class bias and unstable calibration when only a few novel samples are available. We propose Prototype-Driven Alignment (PDA), a lightweight, plug-in metric head for DeFRCN that provides a prototype-based "second opinion" complementary to the linear classifier. PDA maintains support-only prototypes in a learnable identity-initialized projection space and optionally applies prototype-conditioned RoI alignment to reduce geometric mismatch. During fine-tuning, prototypes can be adapted via exponential moving average(EMA) updates on labeled foreground RoIs-without introducing class-specific parameters-and are frozen at inference to ensure strict protocol compliance. PDA employs a best-of-K matching scheme to capture intra-class multi-modality and temperature-scaled fusion to combine metric similarities with detector logits. Experiments on VOC FSOD and GFSOD benchmarks show that PDA consistently improves novel-class performance with minimal impact on base classes and negligible computational overhead.
[28] StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA
Yuhang Hu, Zhenyu Yang, Shihan Wang, Shengsheng Qian, Bin Wen, Fan Yang, Tingting Gao, Changsheng Xu
🧩 TL;DR
本文提出了StreamingCoT,这是首个专门为流式视频问答和多模态思维链任务设计的具有时间演化推理能力的数据集,通过动态层次标注架构和显式推理链生成范式解决了现有VideoQA数据集在时间动态性和推理透明度方面的局限性。
📘 Detailed Summary
Motivation: 当前视频问答数据集存在两个关键局限性:静态标注机制无法捕捉时间视频流中答案的演化特性,以及缺乏显式推理过程标注限制了模型的可解释性和逻辑推理能力,这阻碍了多模态模型在流式视频应用中对时间动态理解和复杂推理能力的提升。
Method: 研究提出了动态层次标注架构,生成每秒密集描述并通过相似性融合构建时间依赖的语义片段,同时设计时间演化模式约束的问题-答案对;进一步提出显式推理链生成范式,通过关键帧语义对齐提取时空对象,利用大语言模型生成基于对象状态转换的推理路径,并通过人工验证确保逻辑一致性。
Result: StreamingCoT数据集建立了流式视频理解、复杂时间推理和多模态推理研究的基础,提供了首个具有时间演化推理能力的VideoQA数据集,包含动态层次标注和显式推理链标注,为相关领域研究提供了标准基准和工具支持。
Conclusion: 该研究为流式视频理解、复杂时间推理和多模态推理领域建立了重要基础,提出的数据集和构建工具将推动视频问答模型在时间动态理解和逻辑推理能力方面的进步,并为可解释性多模态推理研究提供了新的方向。
📄 Abstract
The rapid growth of streaming video applications demands multimodal models with enhanced capabilities for temporal dynamics understanding and complex reasoning. However, current Video Question Answering (VideoQA) datasets suffer from two critical limitations: 1) Static annotation mechanisms fail to capture the evolving nature of answers in temporal video streams, and 2) The absence of explicit reasoning process annotations restricts model interpretability and logical deduction capabilities. To address these challenges, We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. Our framework first establishes a dynamic hierarchical annotation architecture that generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion, paired with question-answer sets constrained by temporal evolution patterns. We further propose an explicit reasoning chain generation paradigm that extracts spatiotemporal objects via keyframe semantic alignment, derives object state transition-based reasoning paths using large language models, and ensures logical coherence through human-verified validation. This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference. Our StreamingCoT and its construction toolkit can be accessed at https://github.com/Fleeting-hyh/StreamingCoT.
[29] Instance-Level Composed Image Retrieval
Bill Psomas, George Retsinas, Nikos Efthymiadis, Panagiotis Filntisis, Yannis Avrithis, Petros Maragos, Ondrej Chum, Giorgos Tolias
🧩 TL;DR
本文提出了i-CIR评估数据集和BASIC训练无关方法,解决了组合图像检索领域高质量数据稀缺的问题。BASIC方法通过分别估计视觉和文本查询与图像的相似度并进行后期融合,在多个CIR数据集上实现了新的最先进性能。
📘 Detailed Summary
Motivation: 组合图像检索研究的进展受到高质量训练和评估数据缺乏的限制,现有数据集主要关注语义级别的类别定义,而缺乏针对特定实例级别对象的检索评估数据。
Method: 提出了BASIC训练无关方法,分别计算查询图像到图像和查询文本到图像的相似度,通过后期融合对同时满足两个查询的图像进行加权,并引入简单直观的组件来改进各个相似度估计。
Result: BASIC方法在提出的i-CIR数据集上实现了新的最先进性能,同时在遵循语义级别类别定义的现有CIR数据集上也达到了最佳表现。
Conclusion: 该研究表明利用预训练视觉语言模型的无训练方法可以有效解决组合图像检索问题,i-CIR数据集为实例级别检索研究提供了标准化评估基准,为未来研究奠定了基础。
📄 Abstract
The progress of composed image retrieval (CIR), a popular research direction in image retrieval, where a combined visual and textual query is used, is held back by the absence of high-quality training and evaluation data. We introduce a new evaluation dataset, i-CIR, which, unlike existing datasets, focuses on an instance-level class definition. The goal is to retrieve images that contain the same particular object as the visual query, presented under a variety of modifications defined by textual queries. Its design and curation process keep the dataset compact to facilitate future research, while maintaining its challenge-comparable to retrieval among more than 40M random distractors-through a semi-automated selection of hard negatives. To overcome the challenge of obtaining clean, diverse, and suitable training data, we leverage pre-trained vision-and-language models (VLMs) in a training-free approach called BASIC. The method separately estimates query-image-to-image and query-text-to-image similarities, performing late fusion to upweight images that satisfy both queries, while down-weighting those that exhibit high similarity with only one of the two. Each individual similarity is further improved by a set of components that are simple and intuitive. BASIC sets a new state of the art on i-CIR but also on existing CIR datasets that follow a semantic-level class definition. Project page: https://vrg.fel.cvut.cz/icir/.
[30] SPADE: Sparsity Adaptive Depth Estimator for Zero-Shot, Real-Time, Monocular Depth Estimation in Underwater Environments
Hongjie Zhang, Gideon Billings, Stefan B. Williams
🧩 TL;DR
本文提出SPADE:稀疏自适应深度估计器,一种结合预训练相对深度估计器与稀疏深度先验的单目深度估计管道,能够生成密集的度量尺度深度图,显著提升水下基础设施检查的自主性和安全性。
📘 Detailed Summary
Motivation: 当前水下基础设施检查依赖人类潜水员或遥控操作车辆,面临复杂结构和浑浊水域中的感知与操作挑战,需要增强水下车辆的空间感知能力以降低操控风险并提高自主性。
Method: 采用两阶段方法:首先使用稀疏深度点对相对深度图进行尺度缩放,然后通过提出的级联卷积-可变形Transformer块对最终度量预测进行精细化处理。
Result: 该方法在精度和泛化能力上优于现有最先进基线模型,在嵌入式硬件上运行效率超过15 FPS,为实际水下检查任务提供了可行解决方案。
Conclusion: SPADE框架通过有效结合相对深度估计与稀疏深度先验,实现了高效准确的度量深度估计,为水下自主检查系统的开发提供了重要技术支撑,具有实际应用价值。
📄 Abstract
Underwater infrastructure requires frequent inspection and maintenance due to harsh marine conditions. Current reliance on human divers or remotely operated vehicles is limited by perceptual and operational challenges, especially around complex structures or in turbid water. Enhancing the spatial awareness of underwater vehicles is key to reducing piloting risks and enabling greater autonomy. To address these challenges, we present SPADE: SParsity Adaptive Depth Estimator, a monocular depth estimation pipeline that combines pre-trained relative depth estimator with sparse depth priors to produce dense, metric scale depth maps. Our two-stage approach first scales the relative depth map with the sparse depth points, then refines the final metric prediction with our proposed Cascade Conv-Deformable Transformer blocks. Our approach achieves improved accuracy and generalisation over state-of-the-art baselines and runs efficiently at over 15 FPS on embedded hardware, promising to support practical underwater inspection and intervention. This work has been submitted to IEEE Journal of Oceanic Engineering Special Issue of AUV 2026.
[31] Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation
Zhi-Kai Chen, Jun-Peng Jiang, Han-Jia Ye, De-Chuan Zhan
🧩 TL;DR
本文提出了Hawk方法,利用图像的空间结构来引导推测模型进行更准确和高效的预测,在保持图像保真度和多样性的同时,相比标准自回归模型实现了1.71倍的加速。
📘 Detailed Summary
Motivation: 自回归图像生成模型虽然能够产生高保真图像,但由于其固有的顺序、逐个令牌的解码过程,通常存在推理速度慢的问题。推测解码在文本生成中已显示出加速潜力,但在图像生成中的应用仍未被充分探索,主要挑战包括更大的采样空间导致草稿模型与目标模型输出对齐困难,以及未能充分利用图像的二维空间结构来建模局部依赖关系。
Method: 本文提出了Hawk方法,该方法利用图像的空间结构来引导推测模型进行更准确和高效的预测。该方法通过更好地建模局部依赖关系,克服了传统推测解码在图像生成中面临的挑战,包括采样空间大和空间结构利用不足的问题。
Result: 在多个文本到图像基准测试上的实验结果表明,Hawk方法相比标准自回归模型实现了1.71倍的加速,同时保持了图像的保真度和多样性。该方法在加速推理的同时没有牺牲生成质量,证明了其在图像生成任务中的有效性。
Conclusion: Hawk方法成功地将推测解码技术扩展到图像生成领域,通过利用空间结构指导实现了显著的加速效果。这项研究为加速自回归图像生成模型提供了新的思路,表明空间感知的推测解码是解决图像生成推理效率问题的有效途径。
📄 Abstract
Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a lightweight draft model to approximate the output of a larger AR model, has shown promise in accelerating text generation without compromising quality. However, its application to image generation remains largely underexplored. The challenges stem from a significantly larger sampling space, which complicates the alignment between the draft and target model outputs, coupled with the inadequate use of the two-dimensional spatial structure inherent in images, thereby limiting the modeling of local dependencies. To overcome these challenges, we introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions. Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models, while preserving both image fidelity and diversity.
[32] Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks
Xu Zheng, Zihao Dongfang, Lutao Jiang, Boyuan Zheng, Yulong Guo, Zhenquan Zhang, Giuliano Albanese, Runyi Yang, Mengjiao Ma, Zixin Zhang, Chenfei Liao, Dingcheng Zhen, Yuanhuiyi Lyu, Yuqian Fu, Bin Ren, Linfeng Zhang, Danda Pani Paudel, Nicu Sebe, Luc Van Gool, Xuming Hu
🧩 TL;DR
本综述对多模态空间推理任务进行了系统性回顾,涵盖了多模态大语言模型的最新进展,并引入了开放基准测试用于评估,为这一新兴领域建立了坚实基础。
📘 Detailed Summary
Motivation: 尽管人类具备通过视觉和声音等多模态观察理解空间的能力,且多模态大模型在空间推理任务中展现出潜力,但针对这些模型的系统性综述和公开可用基准测试仍然有限,需要建立全面的评估框架。
Method: 本综述系统性地分类了多模态大语言模型在空间推理方面的进展,重点关注后训练技术、可解释性和架构设计,涵盖了从经典2D任务到3D空间中的视觉问答与定位,以及具身AI中的视觉语言导航和动作模型。
Result: 研究建立了多模态空间推理的开放基准测试,涵盖了空间关系推理、场景与布局理解、3D空间中的视觉问答与定位等任务,并考虑了音频和自我中心视频等新兴模态对空间理解的新贡献。
Conclusion: 本综述为多模态空间推理领域奠定了坚实基础,提供了对该领域发展的深刻见解,通过系统分类和基准测试的引入,促进了该领域研究的标准化和可比性。
📄 Abstract
Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing promising performance across diverse spatial tasks. However, systematic reviews and publicly available benchmarks for these models remain limited. In this survey, we provide a comprehensive review of multimodal spatial reasoning tasks with large models, categorizing recent progress in multimodal large language models (MLLMs) and introducing open benchmarks for evaluation. We begin by outlining general spatial reasoning, focusing on post-training techniques, explainability, and architecture. Beyond classical 2D tasks, we examine spatial relationship reasoning, scene and layout understanding, as well as visual question answering and grounding in 3D space. We also review advances in embodied AI, including vision-language navigation and action models. Additionally, we consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors. We believe this survey establishes a solid foundation and offers insights into the growing field of multimodal spatial reasoning. Updated information about this survey, codes and implementation of the open benchmarks can be found at https://github.com/zhengxuJosh/Awesome-Spatial-Reasoning.
cs.CL [Back]
[33] Dingtalk DeepResearch: A Unified Multi Agent Framework for Adaptive Intelligence in Enterprise Environments
Mengyuan Chen, Chengjun Dai, Xinyang Dong, Chengzhe Feng, Kewei Fu, Jianshe Li, Zhihan Peng, Yongqi Tong, Junshao Zhang, Hong Zhu
🧩 TL;DR
本文提出了钉钉DeepResearch,一个面向企业环境的统一多智能体智能框架,能够实现深度研究、异构表格推理和多模态报告生成。该框架旨在解决企业环境中复杂信息处理和分析的挑战。
📘 Detailed Summary
Motivation: 当前企业环境面临信息处理复杂、多源异构数据整合困难以及深度分析能力不足的问题,需要一种能够统一处理多样化企业智能任务的框架来提升企业决策效率和分析能力。
Method: 该研究提出了一个统一的多智能体智能框架,整合了深度研究、异构表格推理和多模态报告生成能力,通过智能体协同工作来处理企业环境中的复杂信息分析任务。
Result: 框架在实际企业环境中展示了强大的信息处理能力,能够有效整合多源数据、进行深度分析并生成高质量的多模态研究报告,提升了企业智能决策的效率。
Conclusion: 该研究证明了多智能体框架在企业智能应用中的有效性,为构建统一的企业智能分析平台提供了新思路,具有重要的实际应用价值和推广潜力。
📄 Abstract
We present Dingtalk DeepResearch, a unified multi agent intelligence framework for real world enterprise environments, delivering deep research, heterogeneous table reasoning, and multimodal report generation.
[34] Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation
Alexander Martin, William Walden, Reno Kriz, Dengjia Zhang, Kate Sanders, Eugene Yang, Chihsheng Jin, Benjamin Van Durme
🧩 TL;DR
本文提出了MiRAGE,一个用于多模态检索增强生成(RAG)的评估框架,通过引入InfoF1和CiteF1指标来解决现有文本中心评估方法在多模态推理场景中的局限性。
📘 Detailed Summary
Motivation: 随着视听媒体成为在线信息的重要来源,RAG系统需要整合多模态信息进行生成,但现有的RAG评估方法主要针对文本中心场景,无法验证多模态来源的信息支持,限制了在多模态推理密集型环境中的应用。
Method: MiRAGE采用声明中心的评估方法,包含InfoF1指标评估事实性和信息覆盖率,以及CiteF1指标衡量引用支持和完整性;同时引入了MiRAGE的自动变体和三种主流TextRAG指标(ACLE、ARGUE、RAGAS)进行对比分析。
Result: 实验表明,人工应用MiRAGE框架时与外部质量判断高度一致;通过对比文本中心方法,揭示了其在多模态场景中的局限性,为自动评估奠定了基础。
Conclusion: 该研究为多模态RAG评估提供了系统框架,开源实现促进了该领域的发展,并明确了如何有效评估多模态检索增强生成系统的性能和质量。
📄 Abstract
We introduce MiRAGE, an evaluation framework for retrieval-augmented generation (RAG) from multimodal sources. As audiovisual media becomes a prevalent source of information online, it is essential for RAG systems to integrate information from these sources into generation. However, existing evaluations for RAG are text-centric, limiting their applicability to multimodal, reasoning intensive settings because they don't verify information against sources. MiRAGE is a claim-centric approach to multimodal RAG evaluation, consisting of InfoF1, evaluating factuality and information coverage, and CiteF1, measuring citation support and completeness. We show that MiRAGE, when applied by humans, strongly aligns with extrinsic quality judgments. We additionally introduce automatic variants of MiRAGE and three prominent TextRAG metrics -- ACLE, ARGUE, and RAGAS -- demonstrating the limitations of text-centric work and laying the groundwork for automatic evaluation. We release open-source implementations and outline how to assess multimodal RAG.
[35] Teaching Sarcasm: Few-Shot Multimodal Sarcasm Detection via Distillation to a Parameter-Efficient Student
Soumyadeep Jana, Sanasam Ranbir Singh
🧩 TL;DR
本文提出PEKD框架,通过从大规模讽刺数据训练的专家模型中提取知识来增强参数高效微调方法,解决了少样本多模态讽刺检测中监督信号不足的问题。该框架引入熵感知门控机制动态调整蒸馏强度,在少样本场景下显著提升了PEFT方法的性能。
📘 Detailed Summary
Motivation: 多模态讽刺检测在低资源环境下面临挑战,由于标注数据稀缺导致模型难以学习图像-文本间的微妙矛盾。现有的参数高效微调方法虽然减少了过拟合,但在少样本数据下因监督信号有限而无法达到最优性能。
Method: 提出PEKD统一框架,通过从大规模讽刺数据训练的专家模型进行知识蒸馏来增强PEFT方法。引入熵感知门控机制,根据教师模型的置信度动态调整蒸馏强度,以缓解来自教师模型的不可靠信号。
Result: 在两个公开数据集上的实验表明,PEKD框架使PEFT方法在少样本场景下超越了先前的参数高效方法和大型多模态模型,取得了强劲的性能表现。
Conclusion: 该框架具有模块化特性,可适应广泛的多模态模型和任务。研究证明了通过知识蒸馏增强参数高效微调在低资源多模态理解任务中的有效性,为资源受限环境下的模型优化提供了新思路。
📄 Abstract
Multimodal sarcasm detection is challenging, especially in low-resource settings where subtle image-text contradictions are hard to learn due to scarce annotated data, which hinders the model's performance. Parameter-efficient fine-tuning (PEFT) methods like adapters, LoRA, and prompt tuning reduce overfitting but struggle to reach optimal performance due to limited supervision from few-shot data. We propose PEKD, a unified framework that enhances PEFT methods via distillation from an expert model trained on large-scale sarcasm data, which acts as the teacher. To mitigate unreliable signals from the teacher, we introduce an entropy-aware gating mechanism that dynamically adjusts the distillation strength based on teacher confidence. Experiments on two public datasets demonstrate that our PEKD framework enables PEFT methods to outperform both prior parameter-efficient approaches and large multimodal models, achieving strong results in the few-shot scenario. The framework is modular and adaptable to a wide range of multimodal models and tasks.
[36] CLASS-IT: Conversational and Lecture-Aligned Small-Scale Instruction Tuning for BabyLMs
Luca Capone, Alessandro Bondielli, Alessandro Lenci
🧩 TL;DR
本研究探讨了小型语言模型能否从指令调优中受益,发现指令调优在微调场景中带来小幅但一致的性能提升,但改进并不一致地迁移到零样本任务,揭示了交互导向适应与广泛语言泛化之间的权衡。
📘 Detailed Summary
Motivation: 本研究旨在解决小型语言模型是否能够从指令调优中受益的问题,探索在有限资源条件下如何通过人类启发式学习策略来提升模型性能,特别关注指令调优对低参数量模型的适用性和局限性。
Method: 研究比较了对话式和问答式指令调优数据集,采用合并或顺序课程两种策略,使用100M和140M参数的仅解码器模型,在微调和零样本设置下进行全面评估。
Result: 实验结果显示指令调优在微调场景(SuperGLUE)中产生小幅但一致的性能增益,顺序课程策略优于合并数据方法;然而这些改进在零样本任务(BLiMP、EWoK、WUGs等)中并不一致地迁移,表明存在特定权衡。
Conclusion: 研究揭示了将人类启发式学习策略应用于低资源语言模型的潜力和约束,指出了在生态训练限制下通过混合课程方法增强泛化能力的方向,为小型模型优化提供了重要见解。
📄 Abstract
This work investigates whether small-scale LMs can benefit from instruction tuning. We compare conversational and question-answering instruction tuning datasets, applied either in a merged or sequential curriculum, using decoder-only models with 100M and 140M parameters. Evaluation spans both fine-tuning (SuperGLUE) and zero-shot (BLiMP, EWoK, WUGs, entity tracking, and psycholinguistic correlation) settings. Results show that instruction tuning yields small but consistent gains in fine-tuning scenarios, with sequential curricula outperforming merged data; however, improvements do not consistently transfer to zero-shot tasks, suggesting a trade-off between interaction-focused adaptation and broad linguistic generalization. These results highlight both the potential and the constraints of adapting human-inspired learning strategies to low-resource LMs, and point toward hybrid, curriculum-based approaches for enhancing generalization under ecological training limits.
[37] Seeing, Signing, and Saying: A Vision-Language Model-Assisted Pipeline for Sign Language Data Acquisition and Curation from Social Media
Shakib Yazdani, Yasser Hamidullah, Cristina España-Bonet, Josef van Genabith
🧩 TL;DR
本研究提出了首个利用视觉语言模型实现手语翻译数据集自动标注和过滤的框架,显著减少人工标注依赖,并构建了涵盖八种手语的大规模数据集TikTok-SL-8,为手语翻译模型提供了可扩展的弱监督预训练数据源。
📘 Detailed Summary
Motivation: 现有手语翻译数据集普遍存在规模有限、多语言覆盖不足的问题,且依赖专家标注和受控录制环境导致成本高昂,而视觉语言模型在手语数据获取方面的潜力尚未被充分挖掘。
Method: 提出基于视觉语言模型的自动化标注过滤框架,包含人脸可见性检测、手语活动识别、视频文本提取以及视频文本对齐验证四个步骤,应用于TikTok八种手语视频和YouTube-SL-25德语手语数据集。
Result: 构建了TikTok-SL-8多语言手语数据集,并在过滤后的德语和美国手语数据上评估了两个现成手语翻译模型的性能,为自动提取的带噪声数据建立了基准测试。
Conclusion: 该工作实现了手语翻译的可扩展弱监督预训练,促进了从社交媒体获取手语数据的能力,为手语翻译研究提供了高效的数据获取新范式。
📄 Abstract
Most existing sign language translation (SLT) datasets are limited in scale, lack multilingual coverage, and are costly to curate due to their reliance on expert annotation and controlled recording setup. Recently, Vision Language Models (VLMs) have demonstrated strong capabilities as evaluators and real-time assistants. Despite these advancements, their potential remains untapped in the context of sign language dataset acquisition. To bridge this gap, we introduce the first automated annotation and filtering framework that utilizes VLMs to reduce reliance on manual effort while preserving data quality. Our method is applied to TikTok videos across eight sign languages and to the already curated YouTube-SL-25 dataset in German Sign Language for the purpose of additional evaluation. Our VLM-based pipeline includes a face visibility detection, a sign activity recognition, a text extraction from video content, and a judgment step to validate alignment between video and text, implementing generic filtering, annotation and validation steps. Using the resulting corpus, TikTok-SL-8, we assess the performance of two off-the-shelf SLT models on our filtered dataset for German and American Sign Languages, with the goal of establishing baselines and evaluating the robustness of recent models on automatically extracted, slightly noisy data. Our work enables scalable, weakly supervised pretraining for SLT and facilitates data acquisition from social media.
[38] A Critical Study of Automatic Evaluation in Sign Language Translation
Shakib Yazdani, Yasser Hamidullah, Cristina España-Bonet, Eleftherios Avramidis, Josef van Genabith
🧩 TL;DR
本研究系统分析了手语翻译评估中文本指标的局限性,发现传统词汇重叠指标存在不足,而基于大语言模型的评估器虽能更好捕捉语义对等,但对LLM生成的释义存在偏见,揭示了构建多模态评估框架的必要性。
📘 Detailed Summary
Motivation: 当前手语翻译评估主要依赖BLEU、ROUGE等纯文本指标,但这些指标在多大程度上能可靠评估手语翻译质量尚不明确,需要系统分析文本指标在手语翻译评估中的局限性和可靠性。
Method: 研究分析了六种评估指标,包括BLEU、chrF、ROUGE和BLEURT等传统指标,以及基于大语言模型的G-Eval和GEMBA零样本直接评估方法,并在释义、模型输出幻觉和句子长度变化三种受控条件下评估这些指标的一致性和鲁棒性。
Result: 分析表明词汇重叠指标存在明显局限,基于大语言模型的评估器能更好捕捉传统指标常忽略的语义对等,但对LLM生成的释义存在偏见;所有指标都能检测幻觉,但BLEU过于敏感,而BLEURT和LLM评估器对细微幻觉案例相对宽松。
Conclusion: 研究揭示了纯文本评估指标在手语翻译评估中的根本局限性,强调了开发超越文本指标的多模态评估框架的必要性,以实现对手语翻译输出的更全面评估,推动手语翻译领域的发展。
📄 Abstract
Automatic evaluation metrics are crucial for advancing sign language translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are only text-based, and it remains unclear to what extent text-based metrics can reliably capture the quality of SLT outputs. To address this gap, we investigate the limitations of text-based SLT evaluation metrics by analyzing six metrics, including BLEU, chrF, and ROUGE, as well as BLEURT on the one hand, and large language model (LLM)-based evaluators such as G-Eval and GEMBA zero-shot direct assessment on the other hand. Specifically, we assess the consistency and robustness of these metrics under three controlled conditions: paraphrasing, hallucinations in model outputs, and variations in sentence length. Our analysis highlights the limitations of lexical overlap metrics and demonstrates that while LLM-based evaluators better capture semantic equivalence often missed by conventional metrics, they can also exhibit bias toward LLM-paraphrased translations. Moreover, although all metrics are able to detect hallucinations, BLEU tends to be overly sensitive, whereas BLEURT and LLM-based evaluators are comparatively lenient toward subtle cases. This motivates the need for multimodal evaluation frameworks that extend beyond text-based metrics to enable a more holistic assessment of SLT outputs.
[39] EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis
Yusheng Liao, Chaoyi Wu, Junwei Liu, Shuyang Jiang, Pengcheng Qiu, Haowen Wang, Yun Yue, Shuai Zhen, Jian Wang, Qianrui Fan, Jinjie Gu, Ya Zhang, Yanfeng Wang, Yu Wang, Weidi Xie
🧩 TL;DR
本研究提出了EHR-Ins大规模电子健康记录推理指令数据集、EHR-R1推理增强大语言模型系列以及EHR-Bench评估基准,显著提升了LLM在EHR分析中的推理能力和临床相关性。
📘 Detailed Summary
Motivation: 当前大型语言模型在电子健康记录分析中存在任务覆盖范围有限和缺乏面向EHR的推理能力等关键限制,阻碍了其在临床决策中的有效应用。
Method: 采用思维图驱动框架生成大规模高质量推理数据,通过领域适应、推理增强和强化学习的多阶段训练范式开发参数高达720亿的EHR-R1模型系列,并构建涵盖42个任务的EHR-Bench评估基准。
Result: EHR-R1在MIMIC-Bench上超越GPT-4o超过30分,在EHRSHOT上实现10%的零样本AUROC提升,显著优于包括DeepSeek-V3和GPT-4o在内的最先进商业和开源LLM。
Conclusion: EHR-Ins、EHR-R1和EHR-Bench共同推动了更可靠和临床相关的EHR分析发展,为医疗AI系统提供了系统性的领域知识获取和多样化推理能力增强方案。
📄 Abstract
Electronic Health Records (EHRs) contain rich yet complex information, and their automated analysis is critical for clinical decision-making. Despite recent advances of large language models (LLMs) in clinical workflows, their ability to analyze EHRs remains limited due to narrow task coverage and lack of EHR-oriented reasoning capabilities. This paper aims to bridge the gap, specifically, we present EHR-Ins, a large-scale, comprehensive EHR reasoning instruction dataset, comprising 300k high-quality reasoning cases and 4M non-reasoning cases across 42 distinct EHR tasks. Its core innovation is a thinking-graph-driven framework that enables to generate high-quality reasoning data at scale. Based on it, we develop EHR-R1, a series of reasoning-enhanced LLMs with up to 72B parameters tailored for EHR analysis. Through a multi-stage training paradigm, including domain adaptation, reasoning enhancement, and reinforcement learning, EHR-R1 systematically acquires domain knowledge and diverse reasoning capabilities, enabling accurate and robust EHR analysis. Lastly, we introduce EHR-Bench, a new benchmark curated from MIMIC-IV, spanning 42 tasks, to comprehensively assess reasoning and prediction across EHR scenarios. In experiments, we show that the resulting EHR-R1 consistently outperforms state-of-the-art commercial and open-source LLMs (including DeepSeek-V3 and GPT-4o), surpassing GPT-4o by over 30 points on MIMIC-Bench and achieving a 10\% higher zero-shot AUROC on EHRSHOT. Collectively, EHR-Ins, EHR-R1, and EHR-Bench have significantly advanced the development for more reliable and clinically relevant EHR analysis.
[40] PairUni: Pairwise Training for Unified Multimodal Language Models
Jiani Zheng, Zhiyang Teng, Xiangtai Li, Anran Wang, Yu Tian, Kunpeng Qiu, Ye Tian, Haochen Wang, Zhuochen Wang
🧩 TL;DR
本文提出了PairUni框架,通过将视觉语言模型数据重组为理解-生成对,并开发Pair-GPRO优化方法,有效解决了统一视觉语言模型中理解与生成任务在强化学习中的平衡问题。
📘 Detailed Summary
Motivation: 统一视觉语言模型需要在单一架构中同时执行理解和生成任务,但这些任务依赖于异构数据和监督信号,导致在强化学习过程中难以实现任务间的平衡优化。
Method: 提出PairUni框架,首先使用GPT-4o增强单任务数据,为理解样本生成描述、为生成样本生成问答对,形成对齐的实例对;同时通过检索语义相关的理解示例构建检索对。基于此开发Pair-GPRO方法,通过相似性评分调节优势函数,强化对齐良好的样本学习并减少任务干扰。
Result: 构建了包含16K个理解-生成对的高质量数据集PairUG,在强大的Janus-Pro统一视觉语言模型上评估,相比现有强化学习方法实现了更平衡的性能提升,在各种统一视觉语言模型上都表现出优越性能。
Conclusion: 该研究证明了通过数据重组和配对感知优化策略,可以有效缓解统一视觉语言模型中多任务学习的冲突问题,为实现更平衡的多模态模型训练提供了新思路,具有重要的实际应用价值。
📄 Abstract
Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Code: \href{https://github.com/Haochen-Wang409/PairUni}{github.com/Haochen-Wang409/PairUni}
[41] Interpreting LLMs as Credit Risk Classifiers: Do Their Feature Explanations Align with Classical ML?
Saeed AlMarri, Kristof Juhasz, Mathieu Ravaut, Gautier Marti, Hamdan Al Ahbabi, Ibrahim Elfadel
🧩 TL;DR
本研究系统比较了零样本LLM分类器与LightGBM在金融风险预测任务中的表现,发现LLM在结构化金融数据上存在局限性且自解释可靠性不足,强调了在风险敏感金融环境中部署LLM时需要可解释性审计和人工监督。
📘 Detailed Summary
Motivation: 当前大型语言模型作为分类任务的灵活替代方案在零样本提示下被广泛探索,但其在结构化表格数据特别是高风险金融应用如金融风险评估中的适用性仍未充分研究,需要系统评估LLM在此类关键任务中的实际表现和可靠性。
Method: 研究采用系统比较方法,在真实世界贷款违约预测任务中对比零样本LLM分类器与最先进的梯度提升模型LightGBM,使用SHAP进行特征归因分析,并评估LLM生成自解释的可靠性,全面考察模型预测性能、特征重要性和解释一致性。
Result: 实验结果显示LLM能够识别关键金融风险指标,但其特征重要性排序与LightGBM存在显著差异,且LLM的自解释往往无法与经验SHAP归因保持一致,表明LLM在结构化金融风险预测中作为独立模型存在明显局限性。
Conclusion: 研究强调了在风险敏感金融环境中部署LLM时需要可解释性审计、与可解释模型的基线比较以及人工监督的必要性,这些发现对金融领域AI系统的可信部署具有重要指导意义,提醒业界关注LLM自解释的可靠性问题。
📄 Abstract
Large Language Models (LLMs) are increasingly explored as flexible alternatives to classical machine learning models for classification tasks through zero-shot prompting. However, their suitability for structured tabular data remains underexplored, especially in high-stakes financial applications such as financial risk assessment. This study conducts a systematic comparison between zero-shot LLM-based classifiers and LightGBM, a state-of-the-art gradient-boosting model, on a real-world loan default prediction task. We evaluate their predictive performance, analyze feature attributions using SHAP, and assess the reliability of LLM-generated self-explanations. While LLMs are able to identify key financial risk indicators, their feature importance rankings diverge notably from LightGBM, and their self-explanations often fail to align with empirical SHAP attributions. These findings highlight the limitations of LLMs as standalone models for structured financial risk prediction and raise concerns about the trustworthiness of their self-generated explanations. Our results underscore the need for explainability audits, baseline comparisons with interpretable models, and human-in-the-loop oversight when deploying LLMs in risk-sensitive financial environments.
[42] DiagramEval: Evaluating LLM-Generated Diagrams via Graphs
Chumeng Liang, Jiaxuan You
🧩 TL;DR
本文提出了DiagramEval,一种新颖的评估指标,用于评估LLM生成的演示图质量。该方法将图表示为图结构,通过节点对齐和路径对齐两个新指标组来量化评估图的质量。
📘 Detailed Summary
Motivation: 研究旨在解决LLM生成图评估中缺乏足够区分性和可解释性指标的问题。尽管图在论文中作为图像呈现,但标准图像生成模型难以生成具有明确结构的清晰图,而LLM直接生成SVG格式的图是一个有前景的方向,但缺乏有效的评估方法。
Method: DiagramEval将图概念化为图结构,将文本元素视为节点,连接关系视为有向边。该方法提出了两个新的指标组:节点对齐和路径对齐,通过图表示来评估图的质量。
Result: 研究首次有效评估了最先进LLM在近期研究文献上生成的图,定量证明了所提指标的有效性。增强的可解释性为LLM生成图的特征提供了有价值的见解。
Conclusion: DiagramEval为LLM生成图的评估提供了量化框架,其增强的可解释性有助于深入理解LLM生成图的特性,为未来图生成研究提供了重要的评估工具和洞察。
📄 Abstract
Diagrams play a central role in research papers for conveying ideas, yet they are often notoriously complex and labor-intensive to create. Although diagrams are presented as images, standard image generative models struggle to produce clear diagrams with well-defined structure. We argue that a promising direction is to generate demonstration diagrams directly in textual form as SVGs, which can leverage recent advances in large language models (LLMs). However, due to the complexity of components and the multimodal nature of diagrams, sufficiently discriminative and explainable metrics for evaluating the quality of LLM-generated diagrams remain lacking. In this paper, we propose DiagramEval, a novel evaluation metric designed to assess demonstration diagrams generated by LLMs. Specifically, DiagramEval conceptualizes diagrams as graphs, treating text elements as nodes and their connections as directed edges, and evaluates diagram quality using two new groups of metrics: node alignment and path alignment. For the first time, we effectively evaluate diagrams produced by state-of-the-art LLMs on recent research literature, quantitatively demonstrating the validity of our metrics. Furthermore, we show how the enhanced explainability of our proposed metrics offers valuable insights into the characteristics of LLM-generated diagrams. Code: https://github.com/ulab-uiuc/diagram-eval.
cs.AI [Back]
[43] H3M-SSMoEs: Hypergraph-based Multimodal Learning with LLM Reasoning and Style-Structured Mixture of Experts
Peilin Tan, Liang Xie, Churan Zhi, Dian Tu, Chuanqi Shi
🧩 TL;DR
本文提出H3M-SSMoEs模型,通过超图多模态架构结合LLM推理和风格结构化专家混合,解决了股票预测中复杂时空依赖、异构模态和动态股票关系的统一建模问题,在多个主要股票市场实现了预测精度和投资性能的显著提升。
📘 Detailed Summary
Motivation: 股票运动预测面临复杂时空依赖、异构模态和动态演化的股票间关系的根本挑战,现有方法难以在可扩展框架内统一结构、语义和机制自适应建模,存在建模能力不足的问题。
Method: 提出H3M-SSMoEs架构,包含三个关键创新:多上下文多模态超图通过局部和全局上下文超图分层捕捉时空动态和持久股票依赖,采用共享跨模态超边和Jensen-Shannon散度加权机制;LLM增强推理模块利用冻结大语言模型和轻量适配器语义融合量化与文本模态;风格结构化专家混合结合共享市场专家和行业专业专家,通过可学习风格向量实现机制感知专业化。
Result: 在三个主要股票市场的广泛实验表明,H3M-SSMoEs在预测精度和投资性能上均超越最先进方法,同时展现出有效的风险控制能力,证明了模型在真实市场环境中的优越性。
Conclusion: 该研究展示了统一结构、语义和机制自适应建模在复杂金融预测任务中的重要性,为多模态时序预测提供了可扩展框架,同时验证了LLM增强推理和风格结构化专家混合在金融领域应用的有效性,为未来智能投资决策系统的发展提供了重要参考。
📄 Abstract
Stock movement prediction remains fundamentally challenging due to complex temporal dependencies, heterogeneous modalities, and dynamically evolving inter-stock relationships. Existing approaches often fail to unify structural, semantic, and regime-adaptive modeling within a scalable framework. This work introduces H3M-SSMoEs, a novel Hypergraph-based MultiModal architecture with LLM reasoning and Style-Structured Mixture of Experts, integrating three key innovations: (1) a Multi-Context Multimodal Hypergraph that hierarchically captures fine-grained spatiotemporal dynamics via a Local Context Hypergraph (LCH) and persistent inter-stock dependencies through a Global Context Hypergraph (GCH), employing shared cross-modal hyperedges and Jensen-Shannon Divergence weighting mechanism for adaptive relational learning and cross-modal alignment; (2) a LLM-enhanced reasoning module, which leverages a frozen large language model with lightweight adapters to semantically fuse and align quantitative and textual modalities, enriching representations with domain-specific financial knowledge; and (3) a Style-Structured Mixture of Experts (SSMoEs) that combines shared market experts and industry-specialized experts, each parameterized by learnable style vectors enabling regime-aware specialization under sparse activation. Extensive experiments on three major stock markets demonstrate that H3M-SSMoEs surpasses state-of-the-art methods in both superior predictive accuracy and investment performance, while exhibiting effective risk control. Datasets, source code, and model weights are available at our GitHub repository: https://github.com/PeilinTime/H3M-SSMoEs.
[44] KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA
Zhuo Chen, Fei Wang, Zixuan Li, Zhao Zhang, Weiwei Ding, Chuanguang Yang, Yongjun Xu, Xiaolong Jin, Jiafeng Guo
🧩 TL;DR
KnowCoder-A1提出了一种基于结果监督的多阶段课程强化学习方法,用于训练LLM在知识库上进行自主代理推理,显著提升了KBQA性能,在零样本场景下实现了11.1%的相对改进。
📘 Detailed Summary
Motivation: 现有KBQA方法通常通过过程监督对LLM进行微调,这种监督方式提供较弱的探索激励,无法有效增强代理推理能力,因此需要开发能够激励自主探索的训练方法。
Method: 该方法采用多阶段课程强化学习框架,首先通过基于结果的拒绝采样获得高质量轨迹进行基础微调,然后应用从易到难的奖励调度策略来缓解结果监督中的奖励稀疏性问题。
Result: KnowCoder-A1在三个主流数据集上持续优于先前方法,特别是在GrailQA的零样本子集上实现了11.1%的相对改进,同时仅使用十二分之一的训练数据。
Conclusion: 研究表明基于结果监督的课程强化学习能够有效培养LLM的自主代理推理能力,为知识库问答系统提供了更高效的训练范式,展示了在有限监督下实现强大推理性能的潜力。
📄 Abstract
Knowledge Base Question Answering (KBQA) aims to answer natural-language questions over a structured Knowledge Base (KB). Recent work improves KBQA by adopting an agentic reasoning paradigm, in which Large Language Models (LLMs) iteratively decompose a question, generate its corresponding logical queries, and interact with the KB to derive the answer. However, these methods typically fine-tune LLMs on reasoning trajectories synthesized via process supervision, which offers weak incentives for exploration and thus fails to strengthen the agentic reasoning ability. In this paper, we propose KnowCoder-A1, an LLM that can autonomously perform agentic reasoning on KBs to obtain answers. To incentivize autonomous exploration, KnowCoder-A1 trains the LLM under outcome-only supervision via a multi-stage curriculum reinforcement learning with an easy-to-hard curriculum. To establish foundational agentic capabilities, KnowCoder-A1 first fine-tunes the LLM on a small set of high-quality trajectories obtained through outcome-based rejection sampling. Then, to alleviate the reward sparsity inherent in outcome-only supervision, it applies multi-stage curriculum RL with reward schedules that progress from easy to hard. Trained with outcome-only supervision, KnowCoder-A1 exhibits powerful reasoning behaviors and consistently outperforms prior approaches across three mainstream datasets. Notably, on the zero-shot subset of GrailQA, KnowCoder-A1 achieves up to an 11.1% relative improvement while using only one-twelfth of the training data, demonstrating strong agentic reasoning capabilities.
[45] Agentic Moderation: Multi-Agent Design for Safer Vision-Language Models
Juan Ren, Mark Dras, Usman Naseem
🧩 TL;DR
本文提出了Agentic Moderation框架,利用专业化的智能体系统来防御多模态系统对抗越狱攻击,通过动态协作的智能体实现上下文感知和可解释的内容审核。
📘 Detailed Summary
Motivation: 现有安全对齐方法通常作为静态层应用于输入或输出,仅提供二元分类(安全或不安全),缺乏动态性、上下文感知和可解释性,无法有效应对复杂的越狱攻击。
Method: 提出了Agentic Moderation框架,包含Shield、Responder、Evaluator和Reflector四个动态协作智能体,实现模型无关的多模态系统安全防御,提供上下文感知和可解释的审核机制。
Result: 在五个数据集和四个代表性大型视觉语言模型上的实验表明,该方法将攻击成功率降低7-19%,保持稳定的不跟随率,并将拒绝率提高4-20%,实现了鲁棒、可解释且平衡的安全性能。
Conclusion: 通过利用智能体架构的灵活性和推理能力,Agentic Moderation提供了模块化、可扩展和细粒度的安全执行,突显了智能体系统作为自动化安全治理基础的更广泛潜力。
📄 Abstract
Agentic methods have emerged as a powerful and autonomous paradigm that enhances reasoning, collaboration, and adaptive control, enabling systems to coordinate and independently solve complex tasks. We extend this paradigm to safety alignment by introducing Agentic Moderation, a model-agnostic framework that leverages specialised agents to defend multimodal systems against jailbreak attacks. Unlike prior approaches that apply as a static layer over inputs or outputs and provide only binary classifications (safe or unsafe), our method integrates dynamic, cooperative agents, including Shield, Responder, Evaluator, and Reflector, to achieve context-aware and interpretable moderation. Extensive experiments across five datasets and four representative Large Vision-Language Models (LVLMs) demonstrate that our approach reduces the Attack Success Rate (ASR) by 7-19%, maintains a stable Non-Following Rate (NF), and improves the Refusal Rate (RR) by 4-20%, achieving robust, interpretable, and well-balanced safety performance. By harnessing the flexibility and reasoning capacity of agentic architectures, Agentic Moderation provides modular, scalable, and fine-grained safety enforcement, highlighting the broader potential of agentic systems as a foundation for automated safety governance.
[46] ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents
Tianyu Yang, Terry Ruas, Yijun Tian, Jan Philip Wahle, Daniel Kurzawe, Bela Gipp
🧩 TL;DR
本文提出了ALDEN,一种基于强化学习的多轮交互框架,通过将视觉语言模型微调为主动导航长文档的智能体,解决了传统方法在长文档理解中的局限性。该框架引入新颖的页面索引访问动作和视觉语义锚定机制,在五个长文档基准测试中实现了最先进的性能。
📘 Detailed Summary
Motivation: 现有视觉语言模型在处理需要跨多页分析和信息整合的长而复杂的文档时表现不佳,传统方法依赖固定的推理模板或刚性流程,迫使模型处于被动角色,限制了效率和泛化能力。
Method: ALDEN框架采用多轮强化学习微调视觉语言模型,引入页面索引访问动作以利用文档结构,提出基于规则的跨层级奖励机制进行密集过程监督,并设计视觉语义锚定机制通过双路径KL散度约束分别稳定视觉和文本表示以解决训练不稳定问题。
Result: 在基于三个开源数据集构建的语料库上训练后,ALDEN在五个长文档基准测试中取得了最先进的性能表现,显著提升了长文档理解的准确性和效率。
Conclusion: ALDEN标志着从被动文档阅读向能够自主导航和跨长文档推理的智能体的重要进展,为更准确高效的长文档理解提供了稳健路径,展示了主动交互式文档理解方法的巨大潜力。
📄 Abstract
Vision-language models (VLMs) excel at interpreting text-rich images but struggle with long, visually complex documents that demand analysis and integration of information spread across multiple pages. Existing approaches typically rely on fixed reasoning templates or rigid pipelines, which force VLMs into a passive role and hinder both efficiency and generalization. We present Active Long-DocumEnt Navigation (ALDEN), a multi-turn reinforcement learning framework that fine-tunes VLMs as interactive agents capable of actively navigating long, visually rich documents. ALDEN introduces a novel fetch action that directly accesses the page by index, complementing the classic search action and better exploiting document structure. For dense process supervision and efficient training, we propose a rule-based cross-level reward that provides both turn- and token-level signals. To address the empirically observed training instability caused by numerous visual tokens from long documents, we further propose a visual-semantic anchoring mechanism that applies a dual-path KL-divergence constraint to stabilize visual and textual representations separately during training. Trained on a corpus constructed from three open-source datasets, ALDEN achieves state-of-the-art performance on five long-document benchmarks. Overall, ALDEN marks a step beyond passive document reading toward agents that autonomously navigate and reason across long, visually rich documents, offering a robust path to more accurate and efficient long-document understanding.