Table of Contents
cs.CV [Back]
[1] MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation
Ye Tian, Ling Yang, Jiongfan Yang, Anran Wang, Yu Tian, Jiani Zheng, Haochen Wang, Zhiyang Teng, Zhuochen Wang, Yinjie Wang, Yunhai Tong, Mengdi Wang, Xiangtai Li
🧩 TL;DR
本文提出MMaDA-Parallel并行多模态扩散框架,通过并行强化学习解决思维感知生成中的错误传播问题,在ParaBench基准上相比最先进模型Bagel实现了6.9%的输出对齐改进。
📘 Detailed Summary
Motivation: 现有顺序自回归方法在思维感知生成中存在错误传播问题,导致性能下降,特别是在生成推理与最终图像之间的对齐不足。
Method: 提出并行多模态扩散框架MMaDA-Parallel,支持文本和图像在整个去噪轨迹中的连续双向交互,通过监督微调训练,并采用并行强化学习策略在轨迹上应用语义奖励来增强跨模态一致性。
Result: 实验验证模型显著改善了跨模态对齐和语义一致性,在ParaBench基准上相比最先进模型Bagel实现了6.9%的输出对齐改进。
Conclusion: 该研究为思维感知图像合成建立了更鲁棒的范式,通过并行交互机制有效解决了传统自回归方法的错误传播问题,推动了多模态生成模型的发展。
📄 Abstract
While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our model significantly improves cross-modal alignment and semantic consistency, achieving a 6.9\% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis. Our code is open-sourced at https://github.com/tyfeld/MMaDA-Parallel
[2] SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control
Arman Zarei, Samyadeep Basu, Mobina Pournemat, Sayan Nag, Ryan Rossi, Soheil Feizi
🧩 TL;DR
本文提出了SliderEdit框架,为基于指令的图像编辑模型提供连续、细粒度的指令控制,通过解耦多指令提示中的各个编辑操作并将其暴露为可调节的滑块,实现对单个编辑强度的平滑调整。
📘 Detailed Summary
Motivation: 现有的基于指令的图像编辑模型在处理多指令提示时,对每个指令应用固定的编辑强度,限制了用户对单个编辑操作强度的精确和连续控制能力,无法实现细粒度的编辑调节。
Method: SliderEdit框架通过解耦多部分编辑指令中的各个指令,将每个指令暴露为全局训练的滑块,使用单一低秩适应矩阵集合来泛化处理多样化的编辑、属性和组合指令,实现沿单个编辑维度的连续插值同时保持空间局部性和全局语义一致性。
Result: 将SliderEdit应用于FLUX-Kontext和Qwen-Image-Edit等最先进的图像编辑模型后,观察到在编辑可控性、视觉一致性和用户可引导性方面均有显著提升,验证了框架在连续指令控制方面的有效性。
Conclusion: 该研究为交互式、指令驱动的图像操作提供了连续和组合控制的新途径,首次探索并提出了基于指令的图像编辑模型中连续细粒度指令控制的框架,推动了图像编辑工具向更精确和用户友好的方向发展。
📄 Abstract
Instruction-based image editing models have recently achieved impressive performance, enabling complex edits to an input image from a multi-instruction prompt. However, these models apply each instruction in the prompt with a fixed strength, limiting the user's ability to precisely and continuously control the intensity of individual edits. We introduce SliderEdit, a framework for continuous image editing with fine-grained, interpretable instruction control. Given a multi-part edit instruction, SliderEdit disentangles the individual instructions and exposes each as a globally trained slider, allowing smooth adjustment of its strength. Unlike prior works that introduced slider-based attribute controls in text-to-image generation, typically requiring separate training or fine-tuning for each attribute or concept, our method learns a single set of low-rank adaptation matrices that generalize across diverse edits, attributes, and compositional instructions. This enables continuous interpolation along individual edit dimensions while preserving both spatial locality and global semantic consistency. We apply SliderEdit to state-of-the-art image editing models, including FLUX-Kontext and Qwen-Image-Edit, and observe substantial improvements in edit controllability, visual consistency, and user steerability. To the best of our knowledge, we are the first to explore and propose a framework for continuous, fine-grained instruction control in instruction-based image editing models. Our results pave the way for interactive, instruction-driven image manipulation with continuous and compositional control.
[3] STORM: Segment, Track, and Object Re-Localization from a Single 3D Model
Yu Deng, Teng Cao, Hikaru Shindo, Jiahong Xue, Quentin Delfosse, Kristian Kersting
🧩 TL;DR
本文提出STORM系统,一种无需人工标注的实时6D姿态估计方法,通过结合视觉语言理解与自监督特征匹配的三阶段流程,在工业数据集上实现了最先进的精度和实时性能。
📘 Detailed Summary
Motivation: 现有6D姿态估计方法通常依赖第一帧中目标的手动标注分割掩码,这既耗时又会在面对遮挡或快速运动时导致性能下降,因此需要开发无需人工标注的鲁棒实时系统。
Method: STORM采用新颖的三阶段流程:上下文对象描述指导定位,自交叉注意力机制识别候选区域,分割模型生成精确掩码用于姿态估计,并引入自动重新注册机制通过特征相似性监控检测跟踪失败并从中恢复。
Result: STORM在具有多目标遮挡、高速运动和变化光照的挑战性工业数据集上实现了最先进的精度,同时保持实时运行速度且无需额外训练。
Conclusion: 这种无需标注的方法显著降低了部署开销,为柔性制造和智能质量控制等现代应用提供了实用解决方案,展示了视觉语言理解与自监督学习的有效结合。
📄 Abstract
Accurate 6D pose estimation and tracking are fundamental capabilities for physical AI systems such as robots. However, existing approaches typically rely on a manually annotated segmentation mask of the target in the first frame, which is labor-intensive and leads to reduced performance when faced with occlusions or rapid movement. To address these limi- tations, we propose STORM (Segment, Track, and Object Re-localization from a single 3D Model), an open-source robust real-time 6D pose estimation system that requires no manual annotation. STORM employs a novel three-stage pipeline combining vision-language understanding with self-supervised feature matching: contextual object descriptions guide localization, self-cross-attention mechanisms identify candidate regions, and a segmentation model produces precise masks for accurate pose estimation. Another key innovation is our automatic re-registration mechanism that detects tracking failures through feature similarity monitoring and recovers from severe occlusions or rapid motion. STORM achieves state-of-the-art accuracy on challenging industrial datasets featuring multi-object occlusions, high-speed motion, and varying illumination, while operating at real-time speeds without additional training. This annotation-free approach significantly reduces deployment overhead, providing a practical solution for modern applications, such as flexible manufacturing and intelligent quality control.
[4] PANDA - Patch And Distribution-Aware Augmentation for Long-Tailed Exemplar-Free Continual Learning
Siddeshwar Raghavan, Jiangpeng He, Fengqing Zhu
🧩 TL;DR
本文提出PANDA框架,一种面向无示例持续学习的补丁与分布感知增强方法,通过CLIP编码器识别代表性区域并进行类间移植,结合自适应平衡策略解决现实数据流中的双重不平衡问题,显著提升现有预训练模型在持续学习中的性能。
📘 Detailed Summary
Motivation: 无示例持续学习面临灾难性遗忘的严重挑战,现有基于预训练模型的方法往往忽视现实世界数据分布的内在不平衡性。研究发现现实数据流普遍存在双重不平衡:数据集级分布与单个任务内的极端或反向偏斜相结合,形成任务内和任务间的不平衡,阻碍有效学习和泛化能力。
Method: PANDA框架采用CLIP编码器识别低频率类别的代表性区域,并将其移植到高频类别样本中,实现类别间知识增强。同时引入自适应平衡策略,利用先前任务分布平滑任务间不平衡,缩小任务间平均样本差距,使冻结预训练模型能够进行更公平的学习。
Result: 大量实验和消融研究表明,PANDA能够与现有基于预训练模型的持续学习方法无缝集成,显著提高分类准确率并有效减少灾难性遗忘现象。该方法在多个基准测试中展现出优越性能,验证了其对双重不平衡问题的有效解决能力。
Conclusion: PANDA框架揭示了处理现实世界数据流中双重不平衡的重要性,为预训练模型在持续学习中的应用提供了新的增强策略。该研究强调了分布感知增强在缓解灾难性遗忘中的关键作用,为未来持续学习方法设计提供了重要启示。
📄 Abstract
Exemplar-Free Continual Learning (EFCL) restricts the storage of previous task data and is highly susceptible to catastrophic forgetting. While pre-trained models (PTMs) are increasingly leveraged for EFCL, existing methods often overlook the inherent imbalance of real-world data distributions. We discovered that real-world data streams commonly exhibit dual-level imbalances, dataset-level distributions combined with extreme or reversed skews within individual tasks, creating both intra-task and inter-task disparities that hinder effective learning and generalization. To address these challenges, we propose PANDA, a Patch-and-Distribution-Aware Augmentation framework that integrates seamlessly with existing PTM-based EFCL methods. PANDA amplifies low-frequency classes by using a CLIP encoder to identify representative regions and transplanting those into frequent-class samples within each task. Furthermore, PANDA incorporates an adaptive balancing strategy that leverages prior task distributions to smooth inter-task imbalances, reducing the overall gap between average samples across tasks and enabling fairer learning with frozen PTMs. Extensive experiments and ablation studies demonstrate PANDA's capability to work with existing PTM-based CL methods, improving accuracy and reducing catastrophic forgetting.
[5] Test-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models
Konstantinos M. Dafnis, Dimitris N. Metaxas
🧩 TL;DR
本文提出了一种轻量级的测试时适应框架STS,通过提取文本嵌入的谱子空间来定义主要语义方向,并以频谱感知的方式调整潜在表示,无需反向传播或修改冻结编码器,在多个基准测试中显著优于现有方法。
📘 Detailed Summary
Motivation: 现有的视觉语言模型在零样本推理方面表现出色,但在测试时域偏移下性能会下降。虽然测试时适应策略已经出现,但现有方法如测试时提示调优通常需要反向传播大型编码器权重或改变核心模型组件,导致计算开销大和实现复杂。
Method: STS框架从文本嵌入中提取谱子空间来定义主要语义方向,通过适应少量每个样本的偏移参数以最小化增强视图间的熵,以频谱感知的方式调整潜在表示。该方法完全在推理阶段的潜在空间中操作,无需通过冻结编码器进行反向传播或修改。
Result: 在标准评估协议下,STS在多个基准测试中大幅超越或与最先进的测试时适应方法相媲美,同时仅引入少量额外参数,推理速度比传统测试时提示调优快8倍,内存占用减少12倍。
Conclusion: STS证明了在潜在空间中进行轻量级适应的有效性,为测试时域适应提供了高效实用的解决方案,展示了无需修改核心模型组件即可实现显著性能提升的潜力,为实际部署中的资源受限场景提供了重要参考。
📄 Abstract
Vision-Language Models (VLMs) excel at zero-shot inference but often degrade under test-time domain shifts. For this reason, episodic test-time adaptation strategies have recently emerged as powerful techniques for adapting VLMs to a single unlabeled image. However, existing adaptation strategies, such as test-time prompt tuning, typically require backpropagating through large encoder weights or altering core model components. In this work, we introduce Spectrum-Aware Test-Time Steering (STS), a lightweight adaptation framework that extracts a spectral subspace from the textual embeddings to define principal semantic directions and learns to steer latent representations in a spectrum-aware manner by adapting a small number of per-sample shift parameters to minimize entropy across augmented views. STS operates entirely at inference in the latent space, without backpropagation through or modification of the frozen encoders. Building on standard evaluation protocols, our comprehensive experiments demonstrate that STS largely surpasses or compares favorably against state-of-the-art test-time adaptation methods, while introducing only a handful of additional parameters and achieving inference speeds up to 8x faster with a 12x smaller memory footprint than conventional test-time prompt tuning. The code is available at https://github.com/kdafnis/STS.
[6] From Street to Orbit: Training-Free Cross-View Retrieval via Location Semantics and LLM Guidance
Jeongho Min, Dongyoung Kim, Jaehyup Lee
🧩 TL;DR
本文提出了一种无需训练的跨视角图像检索框架,通过结合预训练视觉编码器和大型语言模型,实现了街景到卫星图像的零样本匹配,并在基准数据集上超越了现有学习方法。
📘 Detailed Summary
Motivation: 现有的跨视角图像检索方法通常需要监督训练和特定数据集,且依赖全景或无人机图像,这限制了实际部署。本文旨在解决这些限制,开发一种无需训练、仅使用单目街景图像即可实现街景到卫星匹配的通用方法。
Method: 该方法采用预训练视觉编码器(如DINOv2)和大型语言模型,通过基于网络的图像搜索和LLM位置推断提取地理线索,利用地理编码API生成卫星查询,并使用PCA白化特征精化进行匹配检索,整个过程无需额外训练。
Result: 在零样本设置下,该方法在基准数据集上超越了先前的学习方法,无需地面真值监督或微调。此外,该流程能够自动构建语义对齐的街景到卫星数据集,为手动标注提供了可扩展且成本效益高的替代方案。
Conclusion: 该研究证明了预训练模型与LLM结合在跨视角检索任务中的有效性,为零样本地理定位提供了新范式,同时展示了自动数据集构建的潜力,为大规模应用奠定了基础。
📄 Abstract
Cross-view image retrieval, particularly street-to-satellite matching, is a critical task for applications such as autonomous navigation, urban planning, and localization in GPS-denied environments. However, existing approaches often require supervised training on curated datasets and rely on panoramic or UAV-based images, which limits real-world deployment. In this paper, we present a simple yet effective cross-view image retrieval framework that leverages a pretrained vision encoder and a large language model (LLM), requiring no additional training. Given a monocular street-view image, our method extracts geographic cues through web-based image search and LLM-based location inference, generates a satellite query via geocoding API, and retrieves matching tiles using a pretrained vision encoder (e.g., DINOv2) with PCA-based whitening feature refinement. Despite using no ground-truth supervision or finetuning, our proposed method outperforms prior learning-based approaches on the benchmark dataset under zero-shot settings. Moreover, our pipeline enables automatic construction of semantically aligned street-to-satellite datasets, which is offering a scalable and cost-efficient alternative to manual annotation. All source codes will be made publicly available at https://jeonghomin.github.io/street2orbit.github.io/.
[7] Remember Me: Bridging the Long-Range Gap in LVLMs with Three-Step Inference-Only Decay Resilience Strategies
Peng Gao, Yujian Lee, Xiaofeng Zhang, Zailong Chen, Hui Zhang
🧩 TL;DR
本文提出了推理阶段的三步衰减恢复策略(T-DRS),通过语义驱动、距离感知控制和远程依赖再强化三个步骤,有效缓解大型视觉语言模型中旋转位置编码导致的远程注意力衰减问题,在无需训练的情况下显著提升模型性能。
📘 Detailed Summary
Motivation: 大型视觉语言模型在使用旋转位置编码时面临远程依赖建模的关键挑战,虽然旋转位置编码能够精确建模标记位置,但随着标记距离增加会导致渐进性注意力衰减,特别是对远程标记对的注意力逐渐减弱,严重损害模型记忆全局上下文的能力。
Method: 提出了推理阶段的三步衰减恢复策略(T-DRS),包括语义驱动衰减恢复策略(SD-DRS)通过内容感知残差放大语义重要但距离较远的信号,距离感知控制衰减恢复策略(DC-DRS)基于位置距离平滑调节权重以净化注意力并抑制噪声,以及远程依赖再强化衰减恢复策略(reRD-DRS)巩固剩余的信息性远程依赖以维持全局连贯性。
Result: 在视觉问答基准测试上的广泛实验表明,T-DRS能够以无需训练的方式持续提升模型性能,有效恢复了被抑制的远程标记对而不损害局部归纳偏置。
Conclusion: 该研究证明了通过精心设计的推理阶段策略可以有效缓解旋转位置编码的注意力衰减问题,为大型视觉语言模型的远程依赖建模提供了新的解决方案,同时保持了模型的局部推理能力,具有重要的实际应用价值。
📄 Abstract
Large Vision-Language Models (LVLMs) have achieved impressive performance across a wide range of multimodal tasks. However, they still face critical challenges in modeling long-range dependencies under the usage of Rotary Positional Encoding (ROPE). Although it can facilitate precise modeling of token positions, it induces progressive attention decay as token distance increases, especially with progressive attention decay over distant token pairs, which severely impairs the model's ability to remember global context. To alleviate this issue, we propose inference-only Three-step Decay Resilience Strategies (T-DRS), comprising (1) Semantic-Driven DRS (SD-DRS), amplifying semantically meaningful but distant signals via content-aware residuals, (2) Distance-aware Control DRS (DC-DRS), which can purify attention by smoothly modulating weights based on positional distances, suppressing noise while preserving locality, and (3) re-Reinforce Distant DRS (reRD-DRS), consolidating the remaining informative remote dependencies to maintain global coherence. Together, the T-DRS recover suppressed long-range token pairs without harming local inductive biases. Extensive experiments on Vision Question Answering (VQA) benchmarks demonstrate that T-DRS can consistently improve performance in a training-free manner. The code can be accessed in https://github.com/labixiaoq-qq/Remember-me
[8] SAM-DAQ: Segment Anything Model with Depth-guided Adaptive Queries for RGB-D Video Salient Object Detection
Jia Lin, Xiaofei Zhou, Jiyuan Liu, Runmin Cong, Guodao Zhang, Zhi Liu, Jiyong Zhang
🧩 TL;DR
本文提出了SAM-DAQ方法,通过深度引导自适应查询将SAM2适配于RGB-D视频显著目标检测任务,解决了手动提示依赖、序列适配器内存消耗高和内存注意力计算负担大三个关键挑战。该方法在三个RGB-D VSOD数据集上均优于现有最先进方法。
📘 Detailed Summary
Motivation: 现有研究尝试将基础模型直接应用于RGB-D视频显著目标检测任务时面临三个主要挑战:对人工提示的依赖性、序列适配器的高内存消耗以及内存注意力的计算负担。这些限制阻碍了基础模型在该任务中的高效应用。
Method: 提出了SAM-DAQ方法,包含两个核心模块:基于并行适配器的多模态图像编码器(PAMIE)和查询驱动时序内存(QTM)模块。PAMIE通过深度引导并行适配器在无提示条件下微调冻结的SAM编码器,QTM模块通过同时利用帧级查询和视频级查询,选择性提取时序一致性特征并迭代更新查询的时序表示。
Result: 在三个RGB-D VSOD数据集上进行的广泛实验表明,所提出的SAM-DAQ方法在所有评估指标上均一致优于现有最先进方法,验证了该方法的有效性和优越性。
Conclusion: 该研究展示了如何通过深度引导自适应查询机制有效适配基础分割模型至视频显著目标检测任务,为多模态视频分析提供了统一的框架,并为视觉基础模型在时序任务中的应用开辟了新方向。
📄 Abstract
Recently segment anything model (SAM) has attracted widespread concerns, and it is often treated as a vision foundation model for universal segmentation. Some researchers have attempted to directly apply the foundation model to the RGB-D video salient object detection (RGB-D VSOD) task, which often encounters three challenges, including the dependence on manual prompts, the high memory consumption of sequential adapters, and the computational burden of memory attention. To address the limitations, we propose a novel method, namely Segment Anything Model with Depth-guided Adaptive Queries (SAM-DAQ), which adapts SAM2 to pop-out salient objects from videos by seamlessly integrating depth and temporal cues within a unified framework. Firstly, we deploy a parallel adapter-based multi-modal image encoder (PAMIE), which incorporates several depth-guided parallel adapters (DPAs) in a skip-connection way. Remarkably, we fine-tune the frozen SAM encoder under prompt-free conditions, where the DPA utilizes depth cues to facilitate the fusion of multi-modal features. Secondly, we deploy a query-driven temporal memory (QTM) module, which unifies the memory bank and prompt embeddings into a learnable pipeline. Concretely, by leveraging both frame-level queries and video-level queries simultaneously, the QTM module can not only selectively extract temporal consistency features but also iteratively update the temporal representations of the queries. Extensive experiments are conducted on three RGB-D VSOD datasets, and the results show that the proposed SAM-DAQ consistently outperforms state-of-the-art methods in terms of all evaluation metrics.
[9] HCC-3D: Hierarchical Compensatory Compression for 98% 3D Token Reduction in Vision-Language Models
Liheng Zhang, Jin Wang, Hui Li, Bingfeng Zhang, Weifeng Liu
🧩 TL;DR
本文提出了分层补偿压缩方法HCC-3D,通过全局结构压缩和自适应细节挖掘模块,在保持关键细节的同时实现了3D令牌的极端压缩,显著提升了3D视觉语言模型的效率和性能。
📘 Detailed Summary
Motivation: 当前3D视觉语言模型直接将点云嵌入为3D令牌,导致在大语言模型部分处理所有3D令牌时产生巨大的计算开销,这限制了其实际应用,因此需要找到在减少计算负担的同时保持关键信息完整性的解决方案。
Method: 提出了分层补偿压缩框架HCC-3D,包含全局结构压缩模块使用全局查询将3D令牌压缩为少量关键令牌以保持整体结构信息,以及自适应细节挖掘模块通过互补评分选择性重新压缩显著但未被充分关注的细节特征来补偿信息损失。
Result: 实验表明HCC-3D相比之前的3D-VLMs实现了约98%的极端压缩率,同时达到了新的最先进性能,在效率和性能两方面都取得了显著提升。
Conclusion: 该研究表明通过分层补偿压缩策略可以在大幅减少计算开销的同时保持甚至提升模型性能,为高效3D多模态理解提供了新的技术路径,证明了压缩与性能并非必然权衡关系。
📄 Abstract
3D understanding has drawn significant attention recently, leveraging Vision-Language Models (VLMs) to enable multi-modal reasoning between point cloud and text data. Current 3D-VLMs directly embed the 3D point clouds into 3D tokens, following large 2D-VLMs with powerful reasoning capabilities. However, this framework has a great computational cost limiting its application, where we identify that the bottleneck lies in processing all 3D tokens in the Large Language Model (LLM) part. This raises the question: how can we reduce the computational overhead introduced by 3D tokens while preserving the integrity of their essential information? To address this question, we introduce Hierarchical Compensatory Compression (HCC-3D) to efficiently compress 3D tokens while maintaining critical detail retention. Specifically, we first propose a global structure compression (GSC), in which we design global queries to compress all 3D tokens into a few key tokens while keeping overall structural information. Then, to compensate for the information loss in GSC, we further propose an adaptive detail mining (ADM) module that selectively recompresses salient but under-attended features through complementary scoring. Extensive experiments demonstrate that HCC-3D not only achieves extreme compression ratios (approximately 98%) compared to previous 3D-VLMs, but also achieves new state-of-the-art performance, showing the great improvements on both efficiency and performance.
[10] Regional Attention-Enhanced Swin Transformer for Clinically Relevant Medical Image Captioning
Zubia Naz, Farhan Asghar, Muhammad Ishfaq Hussain, Yahya Hadadi, Muhammad Aasim Rafique, Wookjin Choi, Moongu Jeon
🧩 TL;DR
本文提出了一种基于Swin-BART编码器-解码器架构的自动化医学图像描述系统,通过轻量级区域注意力模块增强诊断关键区域,在ROCO数据集上实现了最先进的语义保真度和可解释性。
📘 Detailed Summary
Motivation: 自动化医学图像描述旨在将复杂的放射学图像转化为诊断性叙述,以支持报告工作流程,但现有方法在语义准确性和可解释性方面存在不足。
Method: 采用Swin-BART编码器-解码器架构,引入轻量级区域注意力模块在交叉注意力前放大诊断关键区域,使用束搜索解码策略(束大小=4,长度惩罚=1.1,无重复n-gram大小=3,最大长度=128)。
Result: 在ROCO数据集上取得最优性能:ROUGE得分0.603(ResNet-CNN 0.356,BLIP2-OPT 0.255),BERTScore 0.807(BLIP2-OPT 0.645,ResNet-CNN 0.623),BLEU、CIDEr和METEOR指标具有竞争力,并提供消融实验、模态分析和定性热力图验证。
Conclusion: 该设计生成准确且临床术语化的描述,提供透明的区域归因,支持在人类监督下的安全研究应用,为医学图像理解提供了可解释的解决方案。
📄 Abstract
Automated medical image captioning translates complex radiological images into diagnostic narratives that can support reporting workflows. We present a Swin-BART encoder-decoder system with a lightweight regional attention module that amplifies diagnostically salient regions before cross-attention. Trained and evaluated on ROCO, our model achieves state-of-the-art semantic fidelity while remaining compact and interpretable. We report results as mean$\pm$std over three seeds and include $95\%$ confidence intervals. Compared with baselines, our approach improves ROUGE (proposed 0.603, ResNet-CNN 0.356, BLIP2-OPT 0.255) and BERTScore (proposed 0.807, BLIP2-OPT 0.645, ResNet-CNN 0.623), with competitive BLEU, CIDEr, and METEOR. We further provide ablations (regional attention on/off and token-count sweep), per-modality analysis (CT/MRI/X-ray), paired significance tests, and qualitative heatmaps that visualize the regions driving each description. Decoding uses beam search (beam size $=4$), length penalty $=1.1$, $no_repeat_ngram_size$ $=3$, and max length $=128$. The proposed design yields accurate, clinically phrased captions and transparent regional attributions, supporting safe research use with a human in the loop.
[11] MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding
Ketong Chen, Yuhao Chen, Yang Xue
🧩 TL;DR
本文提出了DocWeaver多智能体流水线和MosaicDoc基准,这是一个大规模双语视觉丰富文档理解基准,通过自动生成方法解决了现有评估基准在语言多样性、布局复杂性和任务覆盖范围方面的不足。
📘 Detailed Summary
Motivation: 现有视觉语言模型的评估基准主要存在三个关键问题:以英语为中心、布局过于简化以及支持任务有限,这些限制使得它们无法有效评估模型在视觉丰富文档理解这一关键挑战上的性能,特别是处理复杂布局和密集文本的能力。
Method: 研究提出了DocWeaver多智能体流水线,利用大语言模型自动生成新的评估基准,最终构建了MosaicDoc这一大规模双语资源,该基准源自报纸和杂志,具有多样化的复杂布局、来自196个出版商的丰富风格变化,以及涵盖OCR、VQA、阅读顺序和定位的全面多任务标注。
Result: MosaicDoc包含72K张图像和超过600K个问答对,通过对现有最先进模型在该基准上的广泛评估,揭示了这些模型在处理真实世界文档复杂性方面的当前局限性,为未来研究指明了清晰的方向。
Conclusion: 该研究不仅提供了一个权威的视觉丰富文档理解基准,还通过系统评估揭示了当前模型的不足,强调了处理复杂文档布局和多语言能力的重要性,为下一代文档理解模型的发展提供了重要指导。
📄 Abstract
Despite the rapid progress of Vision-Language Models (VLMs), their capabilities are inadequately assessed by existing benchmarks, which are predominantly English-centric, feature simplistic layouts, and support limited tasks. Consequently, they fail to evaluate model performance for Visually Rich Document Understanding (VRDU), a critical challenge involving complex layouts and dense text. To address this, we introduce DocWeaver, a novel multi-agent pipeline that leverages Large Language Models to automatically generate a new benchmark. The result is MosaicDoc, a large-scale, bilingual (Chinese and English) resource designed to push the boundaries of VRDU. Sourced from newspapers and magazines, MosaicDoc features diverse and complex layouts (including multi-column and non-Manhattan), rich stylistic variety from 196 publishers, and comprehensive multi-task annotations (OCR, VQA, reading order, and localization). With 72K images and over 600K QA pairs, MosaicDoc serves as a definitive benchmark for the field. Our extensive evaluation of state-of-the-art models on this benchmark reveals their current limitations in handling real-world document complexity and charts a clear path for future research.
[12] TSPE-GS: Probabilistic Depth Extraction for Semi-Transparent Surface Reconstruction via 3D Gaussian Splatting
Zhiyuan Xu, Nan Min, Yuhang Guo, Tong Wei
🧩 TL;DR
本文提出TSPE-GS方法,通过均匀采样透射率来建模像素级多模态不透明度和深度分布,解决了3D高斯泼溅在半透明表面重建中的局限性,显著提升了半透明几何重建质量。
📘 Detailed Summary
Motivation: 3D高斯泼溅在速度-质量权衡方面表现优异,但在重建半透明表面时存在困难,因为现有方法大多假设每个像素仅有一个深度值,当多个表面可见时这种假设就会失效。
Method: TSPE-GS方法采用均匀采样透射率来建模像素级多模态不透明度和深度分布,替代了先前的单峰假设,解决了跨表面深度模糊问题;通过渐进式融合截断符号距离函数,在统一框架内分别重建外部和内部表面。
Result: 在公开和自采集的半透明及不透明数据集上的广泛实验表明,TSPE-GS显著改善了半透明几何重建质量,同时在不透明场景上保持了原有性能表现。
Conclusion: 该方法无需额外训练开销即可泛化到其他基于高斯的重建流程中,为半透明表面重建提供了统一解决方案,扩展了3D高斯泼溅的应用范围。
📄 Abstract
3D Gaussian Splatting offers a strong speed-quality trade-off but struggles to reconstruct semi-transparent surfaces because most methods assume a single depth per pixel, which fails when multiple surfaces are visible. We propose TSPE-GS (Transparent Surface Probabilistic Extraction for Gaussian Splatting), which uniformly samples transmittance to model a pixel-wise multi-modal distribution of opacity and depth, replacing the prior single-peak assumption and resolving cross-surface depth ambiguity. By progressively fusing truncated signed distance functions, TSPE-GS reconstructs external and internal surfaces separately within a unified framework. The method generalizes to other Gaussian-based reconstruction pipelines without extra training overhead. Extensive experiments on public and self-collected semi-transparent and opaque datasets show TSPE-GS significantly improves semi-transparent geometry reconstruction while maintaining performance on opaque scenes.
[13] Beyond Cosine Similarity Magnitude-Aware CLIP for No-Reference Image Quality Assessment
Zhicheng Liao, Dongxu Wu, Zhenshan Shi, Sijie Mai, Hanwei Zhu, Lingyu Zhu, Yuncheng Jiang, Baoliang Chen
🧩 TL;DR
本文提出了一种自适应融合框架,将CLIP图像特征的幅度信息与余弦相似度相结合用于无参考图像质量评估。该方法通过Box-Cox变换归一化特征分布,并设计置信度引导的融合策略,在多个基准数据集上显著优于现有方法。
📘 Detailed Summary
Motivation: 现有方法主要利用CLIP模型的余弦相似度进行无参考图像质量评估,但忽视了CLIP图像特征幅度与感知质量之间的强相关性。语义相似度方法未能充分利用这一关键但未被充分探索的线索,导致评估性能受限。
Method: 提出自适应融合框架,首先提取CLIP图像特征的绝对值并应用Box-Cox变换进行统计归一化以减轻语义敏感性。设计置信度引导的融合方案,根据每个线索的相对强度自适应加权,有效整合余弦相似度和幅度感知质量线索。
Result: 在多个基准IQA数据集上的广泛实验表明,该方法一致优于标准的基于CLIP的IQA方法和最先进的基线方法,且无需任何任务特定的训练。实验验证了幅度线索与感知质量之间的强相关性及其对性能提升的重要贡献。
Conclusion: CLIP图像特征幅度是感知质量评估的重要线索,与余弦相似度具有互补性。自适应融合框架能够有效利用多模态线索,为无参考图像质量评估提供了新的技术路径,展示了预训练模型特征统计特性的潜在价值。
📄 Abstract
Recent efforts have repurposed the Contrastive Language-Image Pre-training (CLIP) model for No-Reference Image Quality Assessment (NR-IQA) by measuring the cosine similarity between the image embedding and textual prompts such as "a good photo" or "a bad photo." However, this semantic similarity overlooks a critical yet underexplored cue: the magnitude of the CLIP image features, which we empirically find to exhibit a strong correlation with perceptual quality. In this work, we introduce a novel adaptive fusion framework that complements cosine similarity with a magnitude-aware quality cue. Specifically, we first extract the absolute CLIP image features and apply a Box-Cox transformation to statistically normalize the feature distribution and mitigate semantic sensitivity. The resulting scalar summary serves as a semantically-normalized auxiliary cue that complements cosine-based prompt matching. To integrate both cues effectively, we further design a confidence-guided fusion scheme that adaptively weighs each term according to its relative strength. Extensive experiments on multiple benchmark IQA datasets demonstrate that our method consistently outperforms standard CLIP-based IQA and state-of-the-art baselines, without any task-specific training.
[14] Robust Object Detection with Pseudo Labels from VLMs using Per-Object Co-teaching
Uday Bhaskar, Rishabh Bhattacharya, Avinash Patel, Sarthak Khoche, Praveen Anil Kulkarni, Naresh Manwani
🧩 TL;DR
本文提出了一种利用视觉语言模型自动生成伪标签来训练高效实时目标检测器的新颖流程,通过逐对象协同教学策略有效缓解VLM生成标签中的噪声问题,显著提升了自动驾驶场景下的目标检测性能。
📘 Detailed Summary
Motivation: 基础模型特别是视觉语言模型在自动驾驶等需要大量标注数据的领域提供了有前景的零样本目标检测能力,但其检测延迟和幻觉预测问题使其无法直接部署应用,而手动标注成本又极其昂贵。
Method: 提出基于逐对象协同教学的训练策略,通过两个YOLO模型协作学习,根据对等模型的逐对象损失值在训练过程中过滤不可靠的边界框,而不是过滤整张图像,从而有效缓解VLM生成标签中的噪声问题。
Result: 在KITTI数据集上的实验结果表明,该方法显著优于基线YOLOv5m模型,mAP@0.5从31.12%提升至46.61%,同时保持实时检测延迟;补充10%真实标签后性能进一步提升至57.97% mAP@0.5,在ACDC和BDD100k数据集上也观察到类似性能提升。
Conclusion: 该流程为自动驾驶提供了一种高效、鲁棒且可扩展的高性能目标检测器训练方法,显著减少了对昂贵人工标注的依赖,证明了利用VLM生成伪标签结合协同教学策略的有效性,为实际应用部署提供了可行解决方案。
📄 Abstract
Foundation models, especially vision-language models (VLMs), offer compelling zero-shot object detection for applications like autonomous driving, a domain where manual labelling is prohibitively expensive. However, their detection latency and tendency to hallucinate predictions render them unsuitable for direct deployment. This work introduces a novel pipeline that addresses this challenge by leveraging VLMs to automatically generate pseudo-labels for training efficient, real-time object detectors. Our key innovation is a per-object co-teaching-based training strategy that mitigates the inherent noise in VLM-generated labels. The proposed per-object coteaching approach filters noisy bounding boxes from training instead of filtering the entire image. Specifically, two YOLO models learn collaboratively, filtering out unreliable boxes from each mini-batch based on their peers' per-object loss values. Overall, our pipeline provides an efficient, robust, and scalable approach to train high-performance object detectors for autonomous driving, significantly reducing reliance on costly human annotation. Experimental results on the KITTI dataset demonstrate that our method outperforms a baseline YOLOv5m model, achieving a significant mAP@0.5 boost ($31.12\%$ to $46.61\%$) while maintaining real-time detection latency. Furthermore, we show that supplementing our pseudo-labelled data with a small fraction of ground truth labels ($10\%$) leads to further performance gains, reaching $57.97\%$ mAP@0.5 on the KITTI dataset. We observe similar performance improvements for the ACDC and BDD100k datasets.
[15] Difference Vector Equalization for Robust Fine-tuning of Vision-Language Models
Satoshi Suzuki, Shin'ya Yamaguchi, Shoichiro Takeda, Taiga Yamane, Naoki Makishima, Naotaka Kawata, Mana Ihori, Tomohiro Tanaka, Shota Orihashi, Ryo Masumura
🧩 TL;DR
本文提出了差异向量均衡化方法,通过约束预训练模型与微调模型嵌入之间的差异向量来保持嵌入空间的几何结构,从而在提升分布内性能的同时保持分布外和零样本泛化能力。
📘 Detailed Summary
Motivation: 现有基于对比学习的鲁棒微调方法会扭曲嵌入空间的几何结构,这种几何结构对于视觉语言模型的泛化能力至关重要,导致在分布外和零样本场景下性能受限。
Method: 提出差异向量均衡化方法,通过平均向量损失和成对向量损失两种约束机制来保持几何结构。平均向量损失通过约束差异向量与其加权平均值相等来全局保持几何结构,而成对向量损失则通过确保一致的多模态对齐来局部保持几何结构。
Result: 实验结果表明,该方法能有效保持嵌入空间的几何结构,在分布内、分布外和零样本评估指标上均取得了优异性能。
Conclusion: 该方法证明了在视觉语言模型微调过程中保持嵌入空间几何结构的重要性,为提升模型泛化能力提供了新的技术路径,具有重要的实际应用价值。
📄 Abstract
Contrastive pre-trained vision-language models, such as CLIP, demonstrate strong generalization abilities in zero-shot classification by leveraging embeddings extracted from image and text encoders. This paper aims to robustly fine-tune these vision-language models on in-distribution (ID) data without compromising their generalization abilities in out-of-distribution (OOD) and zero-shot settings. Current robust fine-tuning methods tackle this challenge by reusing contrastive learning, which was used in pre-training, for fine-tuning. However, we found that these methods distort the geometric structure of the embeddings, which plays a crucial role in the generalization of vision-language models, resulting in limited OOD and zero-shot performance. To address this, we propose Difference Vector Equalization (DiVE), which preserves the geometric structure during fine-tuning. The idea behind DiVE is to constrain difference vectors, each of which is obtained by subtracting the embeddings extracted from the pre-trained and fine-tuning models for the same data sample. By constraining the difference vectors to be equal across various data samples, we effectively preserve the geometric structure. Therefore, we introduce two losses: average vector loss (AVL) and pairwise vector loss (PVL). AVL preserves the geometric structure globally by constraining difference vectors to be equal to their weighted average. PVL preserves the geometric structure locally by ensuring a consistent multimodal alignment. Our experiments demonstrate that DiVE effectively preserves the geometric structure, achieving strong results across ID, OOD, and zero-shot metrics.
[16] Anomagic: Crossmodal Prompt-driven Zero-shot Anomaly Generation
Yuxin Jiang, Wei Luo, Hui Zhang, Qiyu Chen, Haiming Yao, Weiming Shen, Yunkang Cao
🧩 TL;DR
本文提出Anomagic,一种零样本异常生成方法,通过跨模态提示编码方案统一视觉和文本线索,无需真实异常样本即可生成语义一致的异常。该方法结合对比精炼策略增强异常与掩码的对齐,显著提升下游异常检测性能。
📘 Detailed Summary
Motivation: 现有异常生成方法通常依赖真实异常样本作为参考,限制了在零样本场景下的应用能力。本文旨在解决无需真实异常样本即可生成语义一致异常的问题,填补零样本异常生成领域的研究空白。
Method: 提出跨模态提示编码方案统一视觉和文本线索,利用上下文信息引导基于修复的生成流程。采用对比精炼策略强制合成异常与其掩码之间的精确对齐。构建AnomVerse数据集,包含12,987个异常-掩码-描述三元组,通过多模态大语言模型自动生成结构化视觉提示和基于模板的文本提示。
Result: 实验表明Anomagic在AnomVerse上训练后能生成比现有方法更真实多样的异常,显著提升下游异常检测性能。该方法能够为任何正常类别图像生成用户定义提示的异常,展现出强大的泛化能力。
Conclusion: Anomagic建立了异常生成的通用基础模型,为零样本异常检测提供了有效解决方案。该方法展示了跨模态信息融合在异常生成中的关键作用,为未来无监督异常检测研究开辟了新方向。
📄 Abstract
We propose Anomagic, a zero-shot anomaly generation method that produces semantically coherent anomalies without requiring any exemplar anomalies. By unifying both visual and textual cues through a crossmodal prompt encoding scheme, Anomagic leverages rich contextual information to steer an inpainting-based generation pipeline. A subsequent contrastive refinement strategy enforces precise alignment between synthesized anomalies and their masks, thereby bolstering downstream anomaly detection accuracy. To facilitate training, we introduce AnomVerse, a collection of 12,987 anomaly-mask-caption triplets assembled from 13 publicly available datasets, where captions are automatically generated by multimodal large language models using structured visual prompts and template-based textual hints. Extensive experiments demonstrate that Anomagic trained on AnomVerse can synthesize more realistic and varied anomalies than prior methods, yielding superior improvements in downstream anomaly detection. Furthermore, Anomagic can generate anomalies for any normal-category image using user-defined prompts, establishing a versatile foundation model for anomaly generation.
[17] LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers
Minjun Kim, Jaeri Lee, Jongjin Kim, Jeongin Yun, Yongmo Kwon, U Kang
🧩 TL;DR
本文提出了LampQ方法,一种针对Vision Transformers的层级混合精度量化方法,通过类型感知的Fisher敏感度度量、整数线性规划优化和迭代位宽分配,解决了现有混合精度量化方法在粒度、度量尺度和位宽分配方面的局限性。
📘 Detailed Summary
Motivation: 现有Vision Transformer量化方法主要采用统一精度策略,忽视了不同组件对量化的敏感度差异,而基于度量的混合精度量化方法存在三个主要问题:粒度过于粗糙、不同类型组件间度量尺度不匹配、以及位宽分配未考虑量化影响。
Method: LampQ采用层级量化实现细粒度控制和高效加速,引入类型感知的Fisher敏感度度量方法,通过整数线性规划进行最优位宽分配,并采用迭代更新策略进一步优化位宽配置。
Result: 大量实验表明,LampQ在图像分类、目标检测和零样本量化等多种任务中预训练的Vision Transformers量化方面达到了最先进的性能水平。
Conclusion: LampQ通过细粒度的层级别混合精度量化策略,有效解决了Vision Transformers量化中的敏感度差异问题,为高效部署大规模视觉Transformer模型提供了可行的解决方案。
📄 Abstract
How can we accurately quantize a pre-trained Vision Transformer model? Quantization algorithms compress Vision Transformers (ViTs) into low-bit formats, reducing memory and computation demands with minimal accuracy degradation. However, existing methods rely on uniform precision, ignoring the diverse sensitivity of ViT components to quantization. Metric-based Mixed Precision Quantization (MPQ) is a promising alternative, but previous MPQ methods for ViTs suffer from three major limitations: 1) coarse granularity, 2) mismatch in metric scale across component types, and 3) quantization-unaware bit allocation. In this paper, we propose LampQ (Layer-wise Mixed Precision Quantization for Vision Transformers), an accurate metric-based MPQ method for ViTs to overcome these limitations. LampQ performs layer-wise quantization to achieve both fine-grained control and efficient acceleration, incorporating a type-aware Fisher-based metric to measure sensitivity. Then, LampQ assigns bit-widths optimally through integer linear programming and further updates them iteratively. Extensive experiments show that LampQ provides the state-of-the-art performance in quantizing ViTs pre-trained on various tasks such as image classification, object detection, and zero-shot quantization.
[18] Right Looks, Wrong Reasons: Compositional Fidelity in Text-to-Image Generation
Mayank Vatsa, Aparna Bharati, Richa Singh
🧩 TL;DR
本文调查了当前文本到图像模型在逻辑组合方面的根本缺陷,揭示了模型在否定、计数和空间关系等核心原语组合时出现性能崩溃的现象,并指出实现真正组合性需要表示和推理方面的根本性突破。
📘 Detailed Summary
Motivation: 当前领先的文本到图像模型架构存在根本性缺陷:无法处理逻辑组合。本研究旨在调查这种故障在三个核心原语(否定、计数和空间关系)上的表现,揭示模型在组合这些原语时出现的严重性能崩溃问题。
Method: 本研究通过分析近期基准测试和方法,调查了文本到图像模型在逻辑组合方面的失败原因。研究重点关注三个关键因素:训练数据中明确否定的几乎完全缺失、连续注意力架构对离散逻辑的根本不适用性,以及评估指标偏向视觉合理性而非约束满足的问题。
Result: 分析显示模型在单个原语上准确,但在组合时出现急剧性能崩溃,暴露了严重的干扰效应。研究表明当前解决方案和简单的规模扩展无法弥合这一差距,模型在处理逻辑组合时表现出系统性的失败模式。
Conclusion: 实现真正的组合性需要表示和推理方面的根本性进步,而非对现有架构的增量调整。研究结论指出当前基于连续注意力架构的方法在本质上不适合处理离散逻辑问题,这为未来研究指明了新的方向。
📄 Abstract
The architectural blueprint of today's leading text-to-image models contains a fundamental flaw: an inability to handle logical composition. This survey investigates this breakdown across three core primitives-negation, counting, and spatial relations. Our analysis reveals a dramatic performance collapse: models that are accurate on single primitives fail precipitously when these are combined, exposing severe interference. We trace this failure to three key factors. First, training data show a near-total absence of explicit negations. Second, continuous attention architectures are fundamentally unsuitable for discrete logic. Third, evaluation metrics reward visual plausibility over constraint satisfaction. By analyzing recent benchmarks and methods, we show that current solutions and simple scaling cannot bridge this gap. Achieving genuine compositionality, we conclude, will require fundamental advances in representation and reasoning rather than incremental adjustments to existing architectures.
[19] AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models
Xinyi Wang, Xun Yang, Yanlong Xu, Yuchen Wu, Zhen Li, Na Zhao
🧩 TL;DR
本文提出了细粒度3D具身推理新任务,并开发了AffordBot框架,通过多模态大语言模型与链式思维推理相结合,实现了基于任务指令的3D场景可操作元素结构化预测,在SceneFun3D数据集上达到最先进性能。
📘 Detailed Summary
Motivation: 现有方法通常在对象级别操作或零散处理细粒度可供性推理,缺乏连贯的指令驱动式接地与推理能力,无法有效支持物理环境中的人机协作,这需要同时理解操作对象、空间位置及交互方式。
Method: 提出AffordBot框架,集成多模态大语言模型与定制化链式思维推理范式,通过渲染场景环视图像并将3D元素候选投影至这些视图,构建与场景几何对齐的丰富视觉表示,推理流程包含主动感知阶段选择最优视角,然后逐步推理定位可供性元素并推断合理交互运动。
Result: 在SceneFun3D数据集上的评估表明,AffordBot实现了最先进的性能,仅使用3D点云输入和多模态大语言模型就展现出强大的泛化能力和物理接地推理能力。
Conclusion: 该研究证明了结合多模态大语言模型与结构化推理范式在细粒度3D具身推理任务中的有效性,为物理环境中的人机协作提供了新的解决方案,并展示了仅凭3D点云输入实现复杂推理任务的潜力。
📄 Abstract
Effective human-agent collaboration in physical environments requires understanding not only what to act upon, but also where the actionable elements are and how to interact with them. Existing approaches often operate at the object level or disjointedly handle fine-grained affordance reasoning, lacking coherent, instruction-driven grounding and reasoning. In this work, we introduce a new task: Fine-grained 3D Embodied Reasoning, which requires an agent to predict, for each referenced affordance element in a 3D scene, a structured triplet comprising its spatial location, motion type, and motion axis, based on a task instruction. To solve this task, we propose AffordBot, a novel framework that integrates Multimodal Large Language Models (MLLMs) with a tailored chain-of-thought (CoT) reasoning paradigm. To bridge the gap between 3D input and 2D-compatible MLLMs, we render surround-view images of the scene and project 3D element candidates into these views, forming a rich visual representation aligned with the scene geometry. Our CoT pipeline begins with an active perception stage, prompting the MLLM to select the most informative viewpoint based on the instruction, before proceeding with step-by-step reasoning to localize affordance elements and infer plausible interaction motions. Evaluated on the SceneFun3D dataset, AffordBot achieves state-of-the-art performance, demonstrating strong generalization and physically grounded reasoning with only 3D point cloud input and MLLMs.
[20] GEA: Generation-Enhanced Alignment for Text-to-Image Person Retrieval
Hao Zou, Runqing Zhang, Xue Zhou, Jianxiao Zou
🧩 TL;DR
本文提出了一种基于生成视角的生成增强对齐方法GEA,通过扩散生成图像作为中间语义表示来弥合文本与图像之间的模态鸿沟,显著提升了文本到图像行人检索的性能。
📘 Detailed Summary
Motivation: 文本到图像行人检索中,文本查询往往无法准确全面地反映图像内容,导致跨模态对齐效果不佳和数据集过拟合问题,同时文本与图像之间的固有模态差异进一步加剧了这些挑战。
Method: GEA包含两个并行模块:文本引导的令牌增强通过扩散生成图像作为中间语义表示来丰富文本语义并促进跨模态对齐;生成中间融合结合生成图像、原始图像和文本特征的交叉注意力,通过三元组对齐损失优化生成统一表示。
Result: 在CUHK-PEDES、RSTPReid和ICFG-PEDES三个公开TIPR数据集上的广泛实验验证了GEA方法的有效性,取得了显著的性能提升。
Conclusion: 该研究表明从生成视角出发,利用生成图像作为中间语义桥梁能够有效缓解跨模态检索中的语义鸿沟问题,为文本-图像对齐提供了新的解决思路和方向。
📄 Abstract
Text-to-Image Person Retrieval (TIPR) aims to retrieve person images based on natural language descriptions. Although many TIPR methods have achieved promising results, sometimes textual queries cannot accurately and comprehensively reflect the content of the image, leading to poor cross-modal alignment and overfitting to limited datasets. Moreover, the inherent modality gap between text and image further amplifies these issues, making accurate cross-modal retrieval even more challenging. To address these limitations, we propose the Generation-Enhanced Alignment (GEA) from a generative perspective. GEA contains two parallel modules: (1) Text-Guided Token Enhancement (TGTE), which introduces diffusion-generated images as intermediate semantic representations to bridge the gap between text and visual patterns. These generated images enrich the semantic representation of text and facilitate cross-modal alignment. (2) Generative Intermediate Fusion (GIF), which combines cross-attention between generated images, original images, and text features to generate a unified representation optimized by triplet alignment loss. We conduct extensive experiments on three public TIPR datasets, CUHK-PEDES, RSTPReid, and ICFG-PEDES, to evaluate the performance of GEA. The results justify the effectiveness of our method. More implementation details and extended results are available at https://github.com/sugelamyd123/Sup-for-GEA.
[21] Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models
Zhengtao Zou, Ya Gao, Jiarui Guan, Bin Li, Pekka Marttinen
🧩 TL;DR
本文提出RUDDER框架,通过残差更新导向的解码调控来缓解大型视觉语言模型的对象幻觉问题,在保持计算效率的同时实现与最先进方法相当的性能。该方法在单次前向传播中提取视觉证据向量,并通过自适应门控机制进行令牌级校正。
📘 Detailed Summary
Motivation: 大型视觉语言模型普遍存在对象幻觉问题,生成的文本与视觉输入不一致,严重影响了模型的可靠性。现有的推理时干预方法面临计算效率与性能的权衡困境,虽然内部状态引导或输出对数调整方法有效,但通常需要额外的前向传播计算,导致显著的计算开销,限制了在实时延迟敏感场景中的实际应用。
Method: RUDDER框架基于两个关键创新:上下文激活残差方向向量,在单次标准前向传播过程中从自注意力层的残差更新中提取每个样本的视觉证据向量;以及贝叶斯启发的自适应门控机制,执行令牌级注入,根据模型偏离视觉上下文程度调整校正信号的强度。
Result: 在POPE和CHAIR等关键幻觉基准测试上的广泛实验表明,RUDDER在实现与最先进方法相当性能的同时,仅引入可忽略的计算延迟,验证了该方法在不显著牺牲效率的前提下提高LVLM可靠性的实用性和有效性。
Conclusion: RUDDER为解决LVLM对象幻觉问题提供了一种实用且高效的解决方案,通过创新的残差向量提取和自适应门控机制,在保持计算效率的同时显著提升模型可靠性,为实际部署中的延迟敏感应用开辟了新的可能性。
📄 Abstract
Large Vision-Language Models (LVLMs) often suffer from object hallucination, generating text inconsistent with visual inputs, which can critically undermine their reliability. Existing inference-time interventions to mitigate this issue present a challenging trade-off: while methods that steer internal states or adjust output logits can be effective, they often incur substantial computational overhead, typically requiring extra forward passes. This efficiency bottleneck can limit their practicality for real-world, latency-sensitive deployments. In this work, we aim to address this trade-off with Residual-Update Directed DEcoding Regulation (RUDDER), a low-overhead framework that steers LVLMs towards visually-grounded generation. RUDDER is built on two key innovations: (1) Contextual Activation Residual Direction (CARD) vector, a per-sample visual evidence vector extracted from the residual update of a self-attention layer during a single, standard forward pass. (2) A Bayesian-inspired adaptive gate that performs token-wise injection, applying a corrective signal whose strength is conditioned on the model's deviation from the visual context. Extensive experiments on key hallucination benchmarks, including POPE and CHAIR, indicate that RUDDER achieves performance comparable to state-of-the-art methods while introducing negligible computational latency, validating RUDDER as a pragmatic and effective approach for improving LVLMs' reliability without a significant compromise on efficiency.
[22] DGFusion: Dual-guided Fusion for Robust Multi-Modal 3D Object Detection
Feiyang Jia, Caiyan Jia, Ailin Liu, Shaoqing Xu, Qiming Xia, Lin Liu, Lei Yang, Yan Gong, Ziying Song
🧩 TL;DR
本文提出DGFusion,一种基于双引导范式的多模态3D物体检测方法,通过难度感知实例配对和双引导模块有效解决硬实例检测问题,在nuScenes数据集上实现了性能提升。
📘 Detailed Summary
Motivation: 现有3D物体检测方法在处理远距离、小尺寸或被遮挡物体等硬实例时存在挑战,这些检测失败直接影响自动驾驶系统的安全性。现有多模态方法通常采用单引导范式,未能充分考虑不同模态间硬实例信息密度的差异。
Method: 提出DGFusion框架,核心包括难度感知实例配对器(DIPM)和双引导模块。DIPM基于难度进行实例级特征匹配生成易实例和硬实例对,双引导模块充分利用两种配对类型的优势实现有效的多模态特征融合。
Result: 在nuScenes数据集上的实验结果表明,DGFusion相比基线方法分别实现了+1.0% mAP、+0.8% NDS和+1.3%平均召回率的提升。大量实验证明该方法在自我距离、尺寸、可见性和小规模训练场景下对硬实例检测具有一致的鲁棒性增益。
Conclusion: 双引导范式能够有效解决单引导范式的局限性,难度感知的实例配对策略为多模态特征融合提供了新思路。该方法为自动驾驶感知系统中的硬实例检测问题提供了有效的解决方案,具有重要的实际应用价值。
📄 Abstract
As a critical task in autonomous driving perception systems, 3D object detection is used to identify and track key objects, such as vehicles and pedestrians. However, detecting distant, small, or occluded objects (hard instances) remains a challenge, which directly compromises the safety of autonomous driving systems. We observe that existing multi-modal 3D object detection methods often follow a single-guided paradigm, failing to account for the differences in information density of hard instances between modalities. In this work, we propose DGFusion, based on the Dual-guided paradigm, which fully inherits the advantages of the Point-guide-Image paradigm and integrates the Image-guide-Point paradigm to address the limitations of the single paradigms. The core of DGFusion, the Difficulty-aware Instance Pair Matcher (DIPM), performs instance-level feature matching based on difficulty to generate easy and hard instance pairs, while the Dual-guided Modules exploit the advantages of both pair types to enable effective multi-modal feature fusion. Experimental results demonstrate that our DGFusion outperforms the baseline methods, with respective improvements of +1.0\% mAP, +0.8\% NDS, and +1.3\% average recall on nuScenes. Extensive experiments demonstrate consistent robustness gains for hard instance detection across ego-distance, size, visibility, and small-scale training scenarios.
[23] Rethinking Visual Information Processing in Multimodal LLMs
Dongwan Kim, Viresh Ranjan, Takashi Nagata, Arnab Dhua, Amit Kumar K C
🧩 TL;DR
LLaViT 提出了一种新颖的视觉语言建模方法,将大型语言模型扩展为视觉变换器,通过三个关键修改使 LLM 同时作为视觉编码器,显著超越了 LLaVA 基线方法并在多个基准测试中表现出色。
📘 Detailed Summary
Motivation: 尽管 LLaVA 架构在视觉语言任务上取得了显著成功,但其设计本质上难以有效整合视觉特征,主要源于文本和视觉模态之间的固有失配问题,这限制了模型对视觉信息的充分利用能力。
Method: LLaViT 通过三个关键修改实现 LLM 作为视觉编码器的功能:学习视觉模态的独立 QKV 投影、启用视觉令牌的双向注意力机制,以及融合全局和局部视觉表示,从而构建了一个统一的视觉语言处理框架。
Result: 在广泛的受控实验中,LLaViT 在多个基准测试上显著优于基线 LLaVA 方法,甚至超越了参数数量是其两倍的模型,证明了该方法在视觉语言建模中的卓越有效性。
Conclusion: 这项研究确立了将 LLM 作为扩展视觉变换器的更有效方法,为视觉语言建模提供了新的范式,展示了统一架构在处理多模态任务中的巨大潜力,并为未来的多模态模型设计指明了方向。
📄 Abstract
Despite the remarkable success of the LLaVA architecture for vision-language tasks, its design inherently struggles to effectively integrate visual features due to the inherent mismatch between text and vision modalities. We tackle this issue from a novel perspective in which the LLM not only serves as a language model but also a powerful vision encoder. To this end, we present LLaViT - Large Language Models as extended Vision Transformers - which enables the LLM to simultaneously function as a vision encoder through three key modifications: (1) learning separate QKV projections for vision modality, (2) enabling bidirectional attention on visual tokens, and (3) incorporating both global and local visual representations. Through extensive controlled experiments on a wide range of LLMs, we demonstrate that LLaViT significantly outperforms the baseline LLaVA method on a multitude of benchmarks, even surpassing models with double its parameter count, establishing a more effective approach to vision-language modeling.
[24] FreDFT: Frequency Domain Fusion Transformer for Visible-Infrared Object Detection
Wencong Wu, Xiuwei Zhang, Hanlin Yin, Shun Dai, Hongxi Zhang, Yanning Zhang
🧩 TL;DR
本文提出了一种频域融合变换器FreDFT,用于解决可见光-红外目标检测中的模态信息不平衡问题,通过在频域中挖掘跨模态互补信息,显著提升了复杂场景下的检测性能。
📘 Detailed Summary
Motivation: 现有可见光-红外目标检测方法存在模态信息不平衡问题,导致跨模态融合不足和检测性能下降,同时大多数方法仅在空间域使用变换器而忽略了频域变换器在挖掘互补信息方面的优势。
Method: 提出频域融合变换器FreDFT,包含多模态频域注意力机制MFDA挖掘模态间互补信息,频域前馈层FDFFL通过混合尺度频域特征融合策略增强多模态特征,跨模态全局建模模块CGMM在空间和通道维度进行像素级跨模态特征交互,局部特征增强模块LFEM利用多种卷积层和通道混洗强化多模态局部特征表示。
Result: 在多个公开数据集上的大量实验结果表明,所提出的FreDFT方法相比其他最先进方法取得了优异的性能表现,验证了频域融合策略的有效性。
Conclusion: 频域变换器在可见光-红外目标检测中具有显著优势,能够有效解决模态信息不平衡问题,通过频域注意力机制和混合尺度融合策略可以更好地挖掘跨模态互补信息,为多模态目标检测提供了新的技术路径。
📄 Abstract
Visible-infrared object detection has gained sufficient attention due to its detection performance in low light, fog, and rain conditions. However, visible and infrared modalities captured by different sensors exist the information imbalance problem in complex scenarios, which can cause inadequate cross-modal fusion, resulting in degraded detection performance. \textcolor{red}{Furthermore, most existing methods use transformers in the spatial domain to capture complementary features, ignoring the advantages of developing frequency domain transformers to mine complementary information.} To solve these weaknesses, we propose a frequency domain fusion transformer, called FreDFT, for visible-infrared object detection. The proposed approach employs a novel multimodal frequency domain attention (MFDA) to mine complementary information between modalities and a frequency domain feed-forward layer (FDFFL) via a mixed-scale frequency feature fusion strategy is designed to better enhance multimodal features. To eliminate the imbalance of multimodal information, a cross-modal global modeling module (CGMM) is constructed to perform pixel-wise inter-modal feature interaction in a spatial and channel manner. Moreover, a local feature enhancement module (LFEM) is developed to strengthen multimodal local feature representation and promote multimodal feature fusion by using various convolution layers and applying a channel shuffle. Extensive experimental results have verified that our proposed FreDFT achieves excellent performance on multiple public datasets compared with other state-of-the-art methods. The code of our FreDFT is linked at https://github.com/WenCongWu/FreDFT.
[25] MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns
Jiarui Zhang, Yuliang Liu, Zijun Wu, Guosheng Pang, Zhili Ye, Yupei Zhong, Junteng Ma, Tao Wei, Haiyang Xu, Weikai Chen, Zeen Wang, Qiangjun Ji, Fanxi Zhou, Qi Zhang, Yuanrui Hu, Jiahao Liu, Zhang Li, Ziyang Zhang, Qiang Liu, Xiang Bai
🧩 TL;DR
本文提出了MonkeyOCR v1.5,这是一个统一的视觉-语言框架,通过两阶段解析流程增强文档布局理解和内容识别,在复杂布局文档解析任务中实现了最先进的性能。
📘 Detailed Summary
Motivation: 现实世界文档通常包含多级表格、嵌入式图像或公式以及跨页结构等复杂布局,这些复杂结构对现有OCR系统构成了显著挑战,需要开发能够同时处理布局理解和内容识别的统一解决方案。
Method: 采用两阶段解析流程:第一阶段使用大型多模态模型联合预测文档布局和阅读顺序;第二阶段在检测区域内执行文本、公式和表格的局部识别。针对复杂表格结构,提出了基于视觉一致性的强化学习方案,以及图像解耦表格解析和类型引导表格合并两个专门模块。
Result: 在OmniDocBench v1.5上的综合实验表明,MonkeyOCR v1.5实现了最先进的性能,超越了PPOCR-VL和MinerU 2.5,在视觉复杂文档场景中表现出卓越的鲁棒性。
Conclusion: 该研究展示了统一视觉-语言框架在复杂文档解析中的有效性,通过结合布局理解和内容识别,以及专门针对复杂表格结构的创新方法,为文档智能应用提供了更可靠的解决方案。
📄 Abstract
Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts with multi-level tables, embedded images or formulas, and cross-page structures, which remain challenging for existing OCR systems. We introduce MonkeyOCR v1.5, a unified vision-language framework that enhances both layout understanding and content recognition through a two-stage parsing pipeline. The first stage employs a large multimodal model to jointly predict document layout and reading order, leveraging visual information to ensure structural and sequential consistency. The second stage performs localized recognition of text, formulas, and tables within detected regions, maintaining high visual fidelity while reducing error propagation. To address complex table structures, we propose a visual consistency-based reinforcement learning scheme that evaluates recognition quality via render-and-compare alignment, improving structural accuracy without manual annotations. Additionally, two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables containing embedded images and reconstruction of tables crossing pages or columns. Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios.
[26] MuSc-V2: Zero-Shot Multimodal Industrial Anomaly Classification and Segmentation with Mutual Scoring of Unlabeled Samples
Xurui Li, Feng Xue, Yu Zhou
🧩 TL;DR
本文提出MuSc-V2框架,通过利用正常图像块在2D外观和3D形状上的相似性特征,构建互评分机制实现零样本异常分类与分割,在多个数据集上显著超越现有方法。
📘 Detailed Summary
Motivation: 现有零样本异常检测方法忽略了一个关键特性:工业产品中的正常图像块通常在2D外观和3D形状上都能找到许多相似块,而异常则保持多样性和孤立性,本研究旨在显式利用这一判别性特征。
Method: 提出互评分框架MuSc-V2,包含迭代点分组改进3D表示、多度相似性邻域聚合融合2D/3D特征、互评分机制进行样本间评分、跨模态异常增强恢复缺失异常,以及约束邻域重评分抑制误分类。
Result: 在MVTec 3D-AD数据集上获得+23.7% AP提升,在Eyecandies数据集上获得+19.3%性能提升,超越了先前的零样本基准,甚至优于大多数少样本方法。
Conclusion: 该框架展示了利用正常样本的相似性特征在零样本异常检测中的有效性,通过多模态融合和互评分机制实现了鲁棒性能,为工业缺陷检测提供了灵活且高效的解决方案。
📄 Abstract
Zero-shot anomaly classification (AC) and segmentation (AS) methods aim to identify and outline defects without using any labeled samples. In this paper, we reveal a key property that is overlooked by existing methods: normal image patches across industrial products typically find many other similar patches, not only in 2D appearance but also in 3D shapes, while anomalies remain diverse and isolated. To explicitly leverage this discriminative property, we propose a Mutual Scoring framework (MuSc-V2) for zero-shot AC/AS, which flexibly supports single 2D/3D or multimodality. Specifically, our method begins by improving 3D representation through Iterative Point Grouping (IPG), which reduces false positives from discontinuous surfaces. Then we use Similarity Neighborhood Aggregation with Multi-Degrees (SNAMD) to fuse 2D/3D neighborhood cues into more discriminative multi-scale patch features for mutual scoring. The core comprises a Mutual Scoring Mechanism (MSM) that lets samples within each modality to assign score to each other, and Cross-modal Anomaly Enhancement (CAE) that fuses 2D and 3D scores to recover modality-specific missing anomalies. Finally, Re-scoring with Constrained Neighborhood (RsCon) suppresses false classification based on similarity to more representative samples. Our framework flexibly works on both the full dataset and smaller subsets with consistently robust performance, ensuring seamless adaptability across diverse product lines. In aid of the novel framework, MuSc-V2 achieves significant performance improvements: a $\textbf{+23.7\%}$ AP gain on the MVTec 3D-AD dataset and a $\textbf{+19.3\%}$ boost on the Eyecandies dataset, surpassing previous zero-shot benchmarks and even outperforming most few-shot methods. The code will be available at The code will be available at \href{https://github.com/HUST-SLOW/MuSc-V2}{https://github.com/HUST-SLOW/MuSc-V2}.
[27] A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space
Huijie Liu, Shuhao Cui, Haoxiang Cao, Shuai Ma, Kai Wu, Guoliang Kang
🧩 TL;DR
本文提出CoTyle方法,首次在开源社区实现代码到风格的图像生成任务,仅需数值风格代码即可生成新颖且一致的视觉风格图像,解决了现有方法在风格一致性和创造性方面的局限性。
📘 Detailed Summary
Motivation: 现有生成方法依赖冗长文本提示、参考图像或参数高效微调来引导风格感知图像生成,但存在风格一致性差、创造性有限和风格表示复杂等问题。学术界尚未对仅凭数值代码生成新颖视觉风格的开源方法进行探索,而工业界已有所尝试。
Method: 首先从图像集合中训练离散风格码本提取风格嵌入,这些嵌入作为文本到图像扩散模型的条件生成风格化图像。随后在离散风格嵌入上训练自回归风格生成器以建模其分布,从而合成新颖风格嵌入。推理时,数值风格代码通过风格生成器映射为唯一风格嵌入,并引导扩散模型生成对应风格的图像。
Result: 大量实验验证CoTyle能有效将数值代码转化为风格控制器,证明一个风格确实对应一个代码。该方法在风格一致性和多样性方面表现出色,能够从最小输入中解锁大量可复现的风格空间。
Conclusion: 研究表明数值风格代码能够有效控制视觉风格生成,为艺术创作提供了前所未有的简洁性和多样性。这项工作填补了学术界在代码到风格生成任务上的空白,为未来风格可控的图像生成研究开辟了新方向。
📄 Abstract
Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we affirm that a style is worth one numerical code by introducing the novel task, code-to-style image generation, which produces images with novel, consistent visual styles conditioned solely on a numerical style code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Unlike existing methods, our method offers unparalleled simplicity and diversity, unlocking a vast space of reproducible styles from minimal input. Extensive experiments validate that CoTyle effectively turns a numerical code into a style controller, demonstrating a style is worth one code.
[28] Image Aesthetic Reasoning via HCM-GRPO: Empowering Compact Model for Superior Performance
Zhiyuan Hu, Zheng Sun, Yi Wei, Long Yu
🧩 TL;DR
本文提出了一个完整的图像筛选解决方案,包括构建大规模数据集和开发HCM-GRPO方法,显著提升了多模态大语言模型在图像美学推理方面的能力,超越了现有开源和闭源模型的性能。
📘 Detailed Summary
Motivation: 当前图像筛选研究稀缺,多模态大语言模型在图像美学推理方面表现不佳,主要由于缺乏专门的数据集和模型美学推理能力不足,本文旨在解决这些数据和方法论上的挑战。
Method: 构建了包含128k样本、640k图像的综合图像筛选数据集,评估外观变形、物理阴影、布局位置和扩展合理性四个维度;提出了HCM-GRPO方法,将困难案例挖掘策略和动态比例精度奖励集成到组相对策略优化框架中。
Result: 实验表明即使是GPT4o和Qwen-VL-Max等顶尖闭源MLLM在图像美学推理上表现接近随机猜测,而HCM-GRPO方法能够以更小的模型超越大规模开源和领先闭源模型的评分。
Conclusion: 该研究证明了专门数据集和强化学习方法对提升MLLM图像美学推理能力的重要性,为图像质量评估提供了有效解决方案,并展示了小模型通过优化方法可以超越大模型的潜力。
📄 Abstract
The performance of image generation has been significantly improved in recent years. However, the study of image screening is rare and its performance with Multimodal Large Language Models (MLLMs) is unsatisfactory due to the lack of data and the weak image aesthetic reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive image screening dataset with over 128k samples, about 640k images. Each sample consists of an original image, four generated images. The dataset evaluates the image aesthetic reasoning ability under four aspects: appearance deformation, physical shadow, placement layout, and extension rationality. Regarding data annotation, we investigate multiple approaches, including purely manual, fully automated, and answer-driven annotations, to acquire high-quality chains of thought (CoT) data in the most cost-effective manner. Methodologically, we introduce a Hard Cases Mining (HCM) strategy with a Dynamic Proportional Accuracy (DPA) reward into the Group Relative Policy Optimization (GRPO) framework, called HCM-GRPO. This enhanced method demonstrates superior image aesthetic reasoning capabilities compared to the original GRPO. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT4o and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic reasoning. In contrast, by leveraging the HCM-GRPO, we are able to surpass the scores of both large-scale open-source and leading closed-source models with a much smaller model.
[29] When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?
Qilang Ye, Wei Zeng, Meng Liu, Jie Zhang, Yupeng Hu, Zitong Yu, Yu Zhou
🧩 TL;DR
本文提出了RL-CoMM框架,通过强化学习协作多模态大语言模型来解决MLLMs在音频-视觉混淆场景中的视觉主导推理问题,在音频-视觉问答任务上比基线模型提升10-30%的准确率。
📘 Detailed Summary
Motivation: 现有多模态大语言模型在音频-视觉混淆场景中存在视觉主导推理问题,即当视频中某个对象视觉存在但音频缺失时,模型难以准确判断不存在的音频,这限制了MLLMs在复杂多模态环境中的推理能力。
Method: 提出了基于强化学习的协作多模态大语言模型RL-CoMM,包含两阶段方法:首先引入大型音频语言模型作为参考模型生成纯音频推理,设计逐步推理奖励函数使MLLMs能够自我改进音频-视觉推理;其次采用答案中心置信度优化来减少潜在异构推理差异的不确定性。
Result: 在音频-视觉问答和音频-视觉幻觉任务上的广泛实验表明,RL-CoMM在有限训练数据下比基线模型准确率提升10-30%,有效解决了视觉主导推理导致的音频判断错误问题。
Conclusion: 该研究揭示了MLLMs在音频-视觉推理中的视觉主导偏差问题,提出的RL-CoMM框架通过强化学习和多模型协作有效缓解了这一问题,为多模态模型的平衡推理提供了新思路,具有重要的实际应用价值。
📄 Abstract
Can Multimodal Large Language Models (MLLMs) discern confused objects that are visually present but audio-absent? To study this, we introduce a new benchmark, AV-ConfuseBench, which simulates an ``Audio-Visual Confusion'' scene by modifying the corresponding sound of an object in the video, e.g., mute the sounding object and ask MLLMs Is there a/an muted-object sound''. Experimental results reveal that MLLMs, such as Qwen2.5-Omni and Gemini 2.5, struggle to discriminate non-existent audio due to visually dominated reasoning. Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM that is built upon the Qwen2.5-Omni foundation. RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning. Then, we design a Step-wise Reasoning Reward function that enables MLLMs to self-improve audio-visual reasoning with the audio-only reference. 2) To ensure an accurate answer prediction, we introduce Answer-centered Confidence Optimization to reduce the uncertainty of potential heterogeneous reasoning differences. Extensive experiments on audio-visual question answering and audio-visual hallucination show that RL-CoMM improves the accuracy by 10~30\% over the baseline model with limited training data. Follow: https://github.com/rikeilong/AVConfusion.
[30] VLF-MSC: Vision-Language Feature-Based Multimodal Semantic Communication System
Gwangyeon Ahn, Jiwan Seo, Joonhyuk Kang
🧩 TL;DR
本文提出了基于视觉语言特征的多模态语义通信系统VLF-MSC,该系统通过传输单一紧凑的视觉语言表示来同时支持接收端的图像和文本生成,显著提高了频谱效率并增强了语义保真度。
📘 Detailed Summary
Motivation: 现有语义通信技术通常独立处理每种模态,导致需要特定模态的传输流或重传,降低了频谱效率。本文旨在解决多模态通信中资源利用率低和语义保真度不足的问题,通过统一的视觉语言表示来同时支持图像和文本生成。
Method: VLF-MSC采用预训练的视觉语言模型将源图像编码为视觉语言语义特征,通过无线信道传输该特征。在接收端,基于解码器的语言模型和基于扩散的图像生成器都以此特征为条件,分别生成描述性文本和语义对齐的图像。
Result: 实验表明VLF-MSC在低信噪比条件下优于仅文本和仅图像的基线方法,在两种模态上都实现了更高的语义准确性,同时显著减少了带宽需求。系统展现出对信道噪声的鲁棒性,同时保持了语义保真度。
Conclusion: 该研究证明了统一视觉语言表示在多模态语义通信中的有效性,为资源受限环境下的高效多模态通信提供了新范式。基于基础模型的方法为实现鲁棒且高效的语义通信开辟了新方向,具有重要的实际应用价值。
📄 Abstract
We propose Vision-Language Feature-based Multimodal Semantic Communication (VLF-MSC), a unified system that transmits a single compact vision-language representation to support both image and text generation at the receiver. Unlike existing semantic communication techniques that process each modality separately, VLF-MSC employs a pre-trained vision-language model (VLM) to encode the source image into a vision-language semantic feature (VLF), which is transmitted over the wireless channel. At the receiver, a decoder-based language model and a diffusion-based image generator are both conditioned on the VLF to produce a descriptive text and a semantically aligned image. This unified representation eliminates the need for modality-specific streams or retransmissions, improving spectral efficiency and adaptability. By leveraging foundation models, the system achieves robustness to channel noise while preserving semantic fidelity. Experiments demonstrate that VLF-MSC outperforms text-only and image-only baselines, achieving higher semantic accuracy for both modalities under low SNR with significantly reduced bandwidth.
[31] GridPrune: From "Where to Look" to "What to Select" in Visual Token Pruning for MLLMs
Yuxiang Duan, Ao Li, Yingqin Li, Luyu Li, Pengwei Wang
🧩 TL;DR
本文提出GridPrune方法,通过'全局引导、局部选择'的区域选择系统解决MLLMs中视觉令牌剪枝的空间分配效率问题,在LLaVA-NeXT-7B上仅使用11.1%的令牌即可保持96.98%的完整性能。
📘 Detailed Summary
Motivation: 现有MLLMs视觉令牌剪枝方法主要关注'选择什么'的直接优化,而忽略了认知科学中'看向何处'的关键策略,导致空间分配效率低下、位置偏差以及保留无关或冗余令牌等问题。
Method: GridPrune采用两阶段剪枝策略:首先使用文本条件引导动态分配跨空间区域的令牌预算,然后在每个预算区域内执行局部选择,替代全局Top-K机制。
Result: 实验结果表明GridPrune在各种MLLM架构上均实现优越性能,在LLaVA-NeXT-7B上以相同剪枝率优于最佳基线方法2.34%,仅使用11.1%令牌即可保持96.98%完整性能。
Conclusion: 该研究证明了模拟人类视觉注意机制的两阶段剪枝策略的有效性,为MLLMs效率优化提供了新思路,即通过空间预算分配和局部选择相结合的方式实现更智能的令牌剪枝。
📄 Abstract
Multimodal large language models (MLLMs) have shown remarkable capabilities in a wide range of vision-language tasks. However, the large number of visual tokens introduces significant computational overhead. To address this issue, visual token pruning has emerged as a key technique for enhancing the efficiency of MLLMs. In cognitive science, humans tend to first determine which regions of a scene to attend to ("where to look") before deciding which specific elements within those regions to process in detail ("what to select"). This two-stage strategy enables the visual system to efficiently allocate attention at a coarse spatial level before performing fine-grained selection. However, existing pruning methods primarily focus on directly optimizing "what to select", typically using attention scores or similarity metrics. They rarely consider "where to look", which has been shown to lead to inefficient spatial allocation, positional bias, and the retention of irrelevant or redundant tokens. In this paper, we propose GridPrune, a method that replaces the global Top-K mechanism with a "guide-globally, select-locally" zonal selection system. GridPrune splits the pruning process into two steps: first, it uses text-conditional guidance to dynamically allocate a token budget across spatial zones; and then, it performs local selection within each budgeted zone. Experimental results demonstrate that GridPrune achieves superior performance across various MLLM architectures. On LLaVA-NeXT-7B, GridPrune retains 96.98% of the full performance while using 11.1% of the tokens, outperforming the best-performing baseline by 2.34% at the same pruning rate.
[32] SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition
Qilang Ye, Yu Zhou, Lian He, Jie Zhang, Xuanming Guo, Jiayu Zhang, Mingkui Tan, Weicheng Xie, Yue Sun, Tao Tan, Xiaochen Yuan, Ghada Khoriba, Zitong Yu
🧩 TL;DR
本文提出SUGAR框架,通过将大规模视频模型作为知识库生成视觉运动先验,监督骨架学习获得离散表示,使未经微调的LLM能够理解骨架序列并执行动作分类与描述任务。
📘 Detailed Summary
Motivation: 当前研究面临两个关键问题:LLM如何理解人体骨架数据,以及如何区分不同动作类别。现有方法在处理骨架序列时存在表示学习不足和LLM骨架理解能力有限的问题,需要探索有效的骨架表示学习范式。
Method: 提出SUGAR框架,首先利用现成的大规模视频模型作为知识库生成视觉和运动相关先验信息;然后通过该先验知识监督骨架学习获得离散表示;最后使用未经微调的预训练LLM理解这些表示并生成动作目标和描述。特别设计了时序查询投影模块来连续建模长序列骨架信号。
Result: 在多个基于骨架的动作分类基准测试中验证了SUGAR的有效性,特别是在零样本场景下的实验表明,该方法比基于线性方法具有更强的泛化能力和适应性。
Conclusion: SUGAR框架成功地将LLM的隐含知识与骨架表示学习相结合,为骨架动作识别提供了新的范式,证明了利用视觉运动先验知识监督骨架学习能够有效提升LLM在动作理解任务上的性能,特别是在零样本场景下展现出优越的泛化能力。
📄 Abstract
Large Language Models (LLMs) hold rich implicit knowledge and powerful transferability. In this paper, we explore the combination of LLMs with the human skeleton to perform action classification and description. However, when treating LLM as a recognizer, two questions arise: 1) How can LLMs understand skeleton? 2) How can LLMs distinguish among actions? To address these problems, we introduce a novel paradigm named learning Skeleton representation with visUal-motion knowledGe for Action Recognition (SUGAR). In our pipeline, we first utilize off-the-shelf large-scale video models as a knowledge base to generate visual, motion information related to actions. Then, we propose to supervise skeleton learning through this prior knowledge to yield discrete representations. Finally, we use the LLM with untouched pre-training weights to understand these representations and generate the desired action targets and descriptions. Notably, we present a Temporal Query Projection (TQP) module to continuously model the skeleton signals with long sequences. Experiments on several skeleton-based action classification benchmarks demonstrate the efficacy of our SUGAR. Moreover, experiments on zero-shot scenarios show that SUGAR is more versatile than linear-based methods.
[33] MTAttack: Multi-Target Backdoor Attacks against Large Vision-Language Models
Zihan Wang, Guansong Pang, Wenjun Miao, Jin Zheng, Xiao Bai
🧩 TL;DR
本文提出了MTAttack,这是首个针对大型视觉语言模型的多目标后门攻击框架,通过联合优化多个触发器并确保其可分离性,实现了高成功率的攻击,揭示了LVLMs在多目标后门攻击下的严重安全漏洞。
📘 Detailed Summary
Motivation: 现有后门攻击主要关注单目标攻击,即针对特定触发器的单一恶意输出,而现实应用中多目标后门攻击构成更大威胁,但执行此类攻击在LVLMs中具有挑战性,因为不同触发器之间的特征干扰会导致大量错误的触发器-目标映射。
Method: MTAttack框架采用了一种新颖的优化方法,包含代理空间划分约束和触发器原型锚定约束两个关键约束条件,在潜在空间中联合优化多个触发器,使每个触发器独立地将干净图像映射到唯一的代理类别,同时保证它们的可分离性。
Result: 在流行基准测试上的实验表明,MTAttack在多目标攻击中实现了高成功率,显著优于现有攻击方法,并且该攻击在不同数据集上表现出强大的泛化能力,对后门防御策略具有鲁棒性。
Conclusion: 这些发现凸显了LVLMs在多目标后门攻击下的脆弱性,强调了缓解此类威胁的紧迫需求,为LVLMs的安全防护研究提供了重要启示,并推动了更健壮防御机制的发展方向。
📄 Abstract
Recent advances in Large Visual Language Models (LVLMs) have demonstrated impressive performance across various vision-language tasks by leveraging large-scale image-text pretraining and instruction tuning. However, the security vulnerabilities of LVLMs have become increasingly concerning, particularly their susceptibility to backdoor attacks. Existing backdoor attacks focus on single-target attacks, i.e., targeting a single malicious output associated with a specific trigger. In this work, we uncover multi-target backdoor attacks, where multiple independent triggers corresponding to different attack targets are added in a single pass of training, posing a greater threat to LVLMs in real-world applications. Executing such attacks in LVLMs is challenging since there can be many incorrect trigger-target mappings due to severe feature interference among different triggers. To address this challenge, we propose MTAttack, the first multi-target backdoor attack framework for enforcing accurate multiple trigger-target mappings in LVLMs. The core of MTAttack is a novel optimization method with two constraints, namely Proxy Space Partitioning constraint and Trigger Prototype Anchoring constraint. It jointly optimizes multiple triggers in the latent space, with each trigger independently mapping clean images to a unique proxy class while at the same time guaranteeing their separability. Experiments on popular benchmarks demonstrate a high success rate of MTAttack for multi-target attacks, substantially outperforming existing attack methods. Furthermore, our attack exhibits strong generalizability across datasets and robustness against backdoor defense strategies. These findings highlight the vulnerability of LVLMs to multi-target backdoor attacks and underscore the urgent need for mitigating such threats. Code is available at https://github.com/mala-lab/MTAttack.
[34] Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction
Mingda Jia, Weiliang Meng, Zenghuang Fu, Yiheng Li, Qi Zeng, Yifan Zhang, Ju Xin, Rongtao Xu, Jiguang Zhang, Xiaopeng Zhang
🧩 TL;DR
本文提出了一种显式时序语义建模框架CACMI,通过跨模态帧聚合和上下文感知特征增强,在密集视频描述任务中实现了最先进的性能。该框架有效捕捉了视频事件序列的时序连贯性和视觉上下文的全面语义。
📘 Detailed Summary
Motivation: 现有密集视频描述方法主要依赖隐式建模,使用帧级或碎片化视频特征,无法有效捕捉事件序列间的时序连贯性和视觉上下文中的全面语义。这种局限性限制了模型对视频内容的理解和描述能力。
Method: 提出的CACMI框架包含两个核心组件:跨模态帧聚合通过跨模态检索聚合相关帧,提取时序连贯的事件对齐文本特征;上下文感知特征增强利用查询引导注意力将视觉动态与伪事件语义进行整合。
Result: 在ActivityNet Captions和YouCook2数据集上的大量实验表明,CACMI在密集视频描述任务中达到了最先进的性能水平,验证了显式时序语义建模的有效性。
Conclusion: 该研究证明了显式建模视频时序特性和语义上下文的重要性,为密集视频描述任务提供了新的研究方向,强调了跨模态交互在理解复杂视频内容中的关键作用。
📄 Abstract
Dense video captioning jointly localizes and captions salient events in untrimmed videos. Recent methods primarily focus on leveraging additional prior knowledge and advanced multi-task architectures to achieve competitive performance. However, these pipelines rely on implicit modeling that uses frame-level or fragmented video features, failing to capture the temporal coherence across event sequences and comprehensive semantics within visual contexts. To address this, we propose an explicit temporal-semantic modeling framework called Context-Aware Cross-Modal Interaction (CACMI), which leverages both latent temporal characteristics within videos and linguistic semantics from text corpus. Specifically, our model consists of two core components: Cross-modal Frame Aggregation aggregates relevant frames to extract temporally coherent, event-aligned textual features through cross-modal retrieval; and Context-aware Feature Enhancement utilizes query-guided attention to integrate visual dynamics with pseudo-event semantics. Extensive experiments on the ActivityNet Captions and YouCook2 datasets demonstrate that CACMI achieves the state-of-the-art performance on dense video captioning task.
[35] HeatV2X: Scalable Heterogeneous Collaborative Perception via Efficient Alignment and Interaction
Yueran Zhao, Zhang Zhang, Chao Sun, Tianze Wang, Chao Yue, Nuoran Li
🧩 TL;DR
本文提出HeatV2X异构适应框架,通过局部异构微调和全局协同微调解决V2X协同感知中的异构性和可扩展性挑战,在显著降低训练成本的同时实现优越的感知性能。
📘 Detailed Summary
Motivation: 现有V2X协同感知框架面临两个关键挑战:参与智能体本质上是多模态和异构的,这需要有效的跨智能体特征对齐来减轻异构性损失;同时协作框架必须能够扩展以适应新智能体,这使得全参数训练不切实际,凸显了可扩展适应的重要性。
Method: 提出HeatV2X可扩展协作框架,首先基于异构图注意力训练高性能智能体作为协作学习基础,然后设计局部异构微调和全局协作微调来实现异构智能体间的有效对齐和交互。局部微调使用异构感知适配器高效提取模态特定差异,全局微调采用多认知适配器增强跨智能体协作并充分挖掘融合潜力。
Result: 在OPV2V-H和DAIR-V2X数据集上的实验结果表明,该方法以显著降低的训练开销实现了优越的感知性能,超越了现有的最先进方法,证明了框架在性能和效率方面的优势。
Conclusion: 该研究展示了通过精心设计的适配器机制和分层微调策略,可以在最小化训练成本的同时显著提升异构V2X协同感知系统的性能,为大规模实际部署提供了可行的解决方案,并为未来异构多智能体系统的可扩展协作学习指明了方向。
📄 Abstract
Vehicle-to-Everything (V2X) collaborative perception extends sensing beyond single vehicle limits through transmission. However, as more agents participate, existing frameworks face two key challenges: (1) the participating agents are inherently multi-modal and heterogeneous, and (2) the collaborative framework must be scalable to accommodate new agents. The former requires effective cross-agent feature alignment to mitigate heterogeneity loss, while the latter renders full-parameter training impractical, highlighting the importance of scalable adaptation. To address these issues, we propose Heterogeneous Adaptation (HeatV2X), a scalable collaborative framework. We first train a high-performance agent based on heterogeneous graph attention as the foundation for collaborative learning. Then, we design Local Heterogeneous Fine-Tuning and Global Collaborative Fine-Tuning to achieve effective alignment and interaction among heterogeneous agents. The former efficiently extracts modality-specific differences using Hetero-Aware Adapters, while the latter employs the Multi-Cognitive Adapter to enhance cross-agent collaboration and fully exploit the fusion potential. These designs enable substantial performance improvement of the collaborative framework with minimal training cost. We evaluate our approach on the OPV2V-H and DAIR-V2X datasets. Experimental results demonstrate that our method achieves superior perception performance with significantly reduced training overhead, outperforming existing state-of-the-art approaches. Our implementation will be released soon.
[36] Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization
Ashutosh Anshul, Shreyas Gopal, Deepu Rajan, Eng Siong Chng
🧩 TL;DR
本文提出了一种单阶段训练框架,通过结合单模态和跨模态特征的下一帧预测来增强深度伪造检测的泛化能力,并引入窗口级注意力机制来捕获预测帧与实际帧之间的差异,从而实现对完全操纵视频的准确分类和部分伪造样本的精确时间定位。
📘 Detailed Summary
Motivation: 现有面向泛化的多模态深度伪造检测方法需要真实样本的预训练,且主要关注音频-视觉不一致性检测,可能忽略保持音频-视觉对齐的操纵中存在的单模态伪影,导致对这些操纵的检测失败。
Method: 提出单阶段训练框架,结合单模态和跨模态特征的下一帧预测来增强泛化能力,并引入窗口级注意力机制捕获预测帧与实际帧之间的差异,使模型能够检测每帧周围的局部伪影。
Result: 在多个基准数据集上的评估表明,该模型展现出强大的泛化能力和精确的时间定位性能。
Conclusion: 该研究证明了单阶段训练框架在深度伪造检测中的有效性,通过下一帧预测和窗口级注意力机制能够同时处理完全操纵视频和部分伪造样本,为多模态深度伪造检测提供了新的技术路径。
📄 Abstract
Recent multimodal deepfake detection methods designed for generalization conjecture that single-stage supervised training struggles to generalize across unseen manipulations and datasets. However, such approaches that target generalization require pretraining over real samples. Additionally, these methods primarily focus on detecting audio-visual inconsistencies and may overlook intra-modal artifacts causing them to fail against manipulations that preserve audio-visual alignment. To address these limitations, we propose a single-stage training framework that enhances generalization by incorporating next-frame prediction for both uni-modal and cross-modal features. Additionally, we introduce a window-level attention mechanism to capture discrepancies between predicted and actual frames, enabling the model to detect local artifacts around every frame, which is crucial for accurately classifying fully manipulated videos and effectively localizing deepfake segments in partially spoofed samples. Our model, evaluated on multiple benchmark datasets, demonstrates strong generalization and precise temporal localization.
[37] TubeRMC: Tube-conditioned Reconstruction with Mutual Constraints for Weakly-supervised Spatio-Temporal Video Grounding
Jinxuan Li, Yi Zhang, Jian-Fang Hu, Chaolei Tan, Tianming Liang, Beihao Xia
🧩 TL;DR
本文提出TubeRMC框架,通过管状条件重构与互约束机制解决弱监督时空视频定位中的目标识别错误和跟踪不一致问题,在VidSTG和HCSTVG基准上优于现有方法。
📘 Detailed Summary
Motivation: 现有弱监督时空视频定位方法通常采用简单的后期融合方式,独立于文本描述生成管状区域,导致目标识别失败和跟踪不一致的问题,需要更有效的文本-管状区域交互机制。
Method: 提出TubeRMC框架,利用预训练视觉定位模型生成文本条件候选管状区域,通过管状条件重构从时间、空间和时空三个视角捕获丰富的管状-文本对应关系,并引入空间和时间建议之间的互约束机制提升重构质量。
Result: 在VidSTG和HCSTVG两个公开基准测试中,TubeRMC优于现有方法,可视化结果表明该方法有效缓解了目标识别错误和跟踪不一致问题。
Conclusion: 管状条件重构与互约束机制能够显著提升弱监督时空视频定位的性能,为复杂视觉语言理解和时空推理任务提供了有效的解决方案,未来可扩展至其他多模态定位任务。
📄 Abstract
Spatio-Temporal Video Grounding (STVG) aims to localize a spatio-temporal tube that corresponds to a given language query in an untrimmed video. This is a challenging task since it involves complex vision-language understanding and spatiotemporal reasoning. Recent works have explored weakly-supervised setting in STVG to eliminate reliance on fine-grained annotations like bounding boxes or temporal stamps. However, they typically follow a simple late-fusion manner, which generates tubes independent of the text description, often resulting in failed target identification and inconsistent target tracking. To address this limitation, we propose a Tube-conditioned Reconstruction with Mutual Constraints (\textbf{TubeRMC}) framework that generates text-conditioned candidate tubes with pre-trained visual grounding models and further refine them via tube-conditioned reconstruction with spatio-temporal constraints. Specifically, we design three reconstruction strategies from temporal, spatial, and spatio-temporal perspectives to comprehensively capture rich tube-text correspondences. Each strategy is equipped with a Tube-conditioned Reconstructor, utilizing spatio-temporal tubes as condition to reconstruct the key clues in the query. We further introduce mutual constraints between spatial and temporal proposals to enhance their quality for reconstruction. TubeRMC outperforms existing methods on two public benchmarks VidSTG and HCSTVG. Further visualization shows that TubeRMC effectively mitigates both target identification errors and inconsistent tracking.
[38] Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis
Jiulong Wu, Yucheng Shen, Lingyong Yan, Haixin Sun, Deguo Xia, Jizhou Huang, Min Cao
🧩 TL;DR
本文提出Facial-R1框架,通过三阶段对齐方法解决面部情感分析中的幻觉推理和识别-推理不对齐问题,在八个基准测试中实现最先进性能,并引入FEA-20K数据集。
📘 Detailed Summary
Motivation: 当前基于视觉语言模型的面部情感分析方法存在两个关键局限:一是幻觉推理问题,模型因缺乏足够的情感特定知识而生成看似合理但不准确的解释;二是情感推理与识别之间的不对齐,源于观察到的面部特征与最终标签之间的碎片化连接。
Method: 提出三阶段对齐框架:首先通过指令微调建立基础情感推理能力;其次引入基于情感和动作单元标签作为奖励信号的强化训练,显式对齐生成推理过程与预测情感;最后设计数据合成流程,迭代利用前阶段扩展训练数据集,实现模型的可扩展自我改进。
Result: 在八个标准基准测试上的广泛实验表明,Facial-R1在面部情感分析中实现了最先进的性能,具有强大的泛化能力和鲁棒的可解释性,并引入了包含17,737个训练样本和1,688个测试样本的FEA-20K基准数据集。
Conclusion: 该研究证明了通过三阶段对齐框架可以有效解决面部情感分析中的关键挑战,最小化监督需求的同时实现模型性能的持续自我改进,为细粒度情感理解提供了新的技术路径和基准资源。
📄 Abstract
Facial Emotion Analysis (FEA) extends traditional facial emotion recognition by incorporating explainable, fine-grained reasoning. The task integrates three subtasks: emotion recognition, facial Action Unit (AU) recognition, and AU-based emotion reasoning to model affective states jointly. While recent approaches leverage Vision-Language Models (VLMs) and achieve promising results, they face two critical limitations: (1) hallucinated reasoning, where VLMs generate plausible but inaccurate explanations due to insufficient emotion-specific knowledge; and (2) misalignment between emotion reasoning and recognition, caused by fragmented connections between observed facial features and final labels. We propose Facial-R1, a three-stage alignment framework that effectively addresses both challenges with minimal supervision. First, we employ instruction fine-tuning to establish basic emotional reasoning capability. Second, we introduce reinforcement training guided by emotion and AU labels as reward signals, which explicitly aligns the generated reasoning process with the predicted emotion. Third, we design a data synthesis pipeline that iteratively leverages the prior stages to expand the training dataset, enabling scalable self-improvement of the model. Built upon this framework, we introduce FEA-20K, a benchmark dataset comprising 17,737 training and 1,688 test samples with fine-grained emotion analysis annotations. Extensive experiments across eight standard benchmarks demonstrate that Facial-R1 achieves state-of-the-art performance in FEA, with strong generalization and robust interpretability.
[39] PROPA: Toward Process-level Optimization in Visual Reasoning via Reinforcement Learning
Yanbei Jiang, Chao Lei, Yihao Ding, Krista Ehinger, Jey Han Lau
🧩 TL;DR
本文提出PROPA框架,通过整合蒙特卡洛树搜索与GRPO生成密集的过程级奖励,无需人工标注即可优化视觉语言模型的中间推理步骤,在七个基准测试中显著优于现有方法。
📘 Detailed Summary
Motivation: 现有视觉语言模型在复杂视觉推理任务中存在多步依赖导致的错误级联问题,而监督微调需要昂贵的步骤级标注,基于可验证奖励的强化学习方法如GRPO仅提供稀疏的结果级反馈,限制了稳定优化。
Method: PROPA框架整合蒙特卡洛树搜索与GRPO生成密集的过程级奖励,通过交错GRPO更新与监督微调解决冷启动问题,并训练过程奖励模型在推理时指导搜索,使测试时搜索与训练信号对齐。
Result: 在七个基准测试和四个VLM骨干网络上,PROPA一致优于基于监督微调和强化学习的基线方法,在域内任务上获得最高17.0%的提升,在域外任务上获得最高21.0%的提升。
Conclusion: PROPA建立了强大的视觉推理和泛化能力,证明了过程级推理优化的有效性,为无需人工标注的复杂视觉推理提供了可行的解决方案,并展示了交错训练策略在克服冷启动问题上的优势。
📄 Abstract
Despite significant progress, Vision-Language Models (VLMs) still struggle with complex visual reasoning, where multi-step dependencies cause early errors to cascade through the reasoning chain. Existing post-training paradigms are limited: Supervised Fine-Tuning (SFT) relies on costly step-level annotations, while Reinforcement Learning with Verifiable Rewards (RLVR) methods like GRPO provide only sparse, outcome-level feedback, hindering stable optimization. We introduce PROPA (Process-level Reasoning Optimization with interleaved Policy Alignment), a novel framework that integrates Monte Carlo Tree Search (MCTS) with GRPO to generate dense, process-level rewards and optimize reasoning at each intermediate step without human annotations. To overcome the cold-start problem, PROPA interleaves GRPO updates with SFT, enabling the model to learn from both successful and failed reasoning trajectories. A Process Reward Model (PRM) is further trained to guide inference-time search, aligning the test-time search with the training signal. Across seven benchmarks and four VLM backbones, PROPA consistently outperforms both SFT- and RLVR-based baselines. It achieves up to 17.0% gains on in-domain tasks and 21.0% gains on out-of-domain tasks compared to existing state-of-the-art, establishing a strong reasoning and generalization capability for visual reasoning tasks. The code isavailable at: https://github.com/YanbeiJiang/PROPA.
[40] CLIP4VI-ReID: Learning Modality-shared Representations via CLIP Semantic Bridge for Visible-Infrared Person Re-identification
Xiaomei Yang, Xizhan Gao, Sijie Niu, Fa Zhu, Guang Feng, Xiaofeng Qu, David Camacho
🧩 TL;DR
本文提出了一种新颖的CLIP驱动的模态共享表示学习网络CLIP4VI-ReID,通过文本语义生成、红外特征嵌入和高级语义对齐三个模块,有效解决了可见光-红外行人重识别中的模态差异问题,在多个基准数据集上实现了最先进的性能。
📘 Detailed Summary
Motivation: 可见光图像与红外图像在物理特性上存在巨大差异,导致传统方法在可见光-红外行人重识别任务中面临严重的模态差异挑战,现有方法难以有效学习跨模态的共享表示。
Method: 提出三阶段框架:文本语义生成模块为可见光图像生成文本描述实现初步的可见光-文本模态对齐;红外特征嵌入模块利用生成的文本语义修正红外图像特征嵌入;高级语义对齐模块精炼高层语义对齐,确保文本语义仅包含身份相关信息。
Result: 在多个广泛使用的VI-ReID数据集上的大量实验结果表明,CLIP4VI-ReID相比其他最先进方法取得了优越的性能表现。
Conclusion: 该方法通过文本作为桥梁实现了间接的可见光-红外模态对齐,增强了学习到的模态共享表示的判别性,为跨模态行人重识别提供了有效的解决方案。
📄 Abstract
This paper proposes a novel CLIP-driven modality-shared representation learning network named CLIP4VI-ReID for VI-ReID task, which consists of Text Semantic Generation (TSG), Infrared Feature Embedding (IFE), and High-level Semantic Alignment (HSA). Specifically, considering the huge gap in the physical characteristics between natural images and infrared images, the TSG is designed to generate text semantics only for visible images, thereby enabling preliminary visible-text modality alignment. Then, the IFE is proposed to rectify the feature embeddings of infrared images using the generated text semantics. This process injects id-related semantics into the shared image encoder, enhancing its adaptability to the infrared modality. Besides, with text serving as a bridge, it enables indirect visible-infrared modality alignment. Finally, the HSA is established to refine the high-level semantic alignment. This process ensures that the fine-tuned text semantics only contain id-related information, thereby achieving more accurate cross-modal alignment and enhancing the discriminability of the learned modal-shared representations. Extensive experimental results demonstrate that the proposed CLIP4VI-ReID achieves superior performance than other state-of-the-art methods on some widely used VI-ReID datasets.
[41] Learning to Tell Apart: Weakly Supervised Video Anomaly Detection via Disentangled Semantic Alignment
Wenti Yin, Huaxin Zhang, Xiang Wang, Yuqing Lu, Yicheng Zhang, Bingquan Gong, Jialong Zuo, Li Yu, Changxin Gao, Nong Sang
🧩 TL;DR
本文提出了一种解耦语义对齐网络(DSANet),通过从粗粒度和细粒度层面显式分离异常和正常特征,解决了弱监督视频异常检测中忽视正常模式挖掘和类别混淆的问题,在XD-Violence和UCF-Crime基准测试中取得了最先进的性能。
📘 Detailed Summary
Motivation: 当前基于多模态基础模型(如CLIP)的弱监督视频异常检测方法倾向于检测最显著响应片段,而忽视了从异常中分离多样正常模式的挖掘,并且由于相似外观容易导致类别混淆,导致细粒度分类结果不理想。
Method: DSANet在粗粒度层面引入了自引导正常性建模分支,通过学习到的正常原型指导重构输入视频特征,鼓励模型利用视频中固有的正常性线索;在细粒度层面提出了解耦对比语义对齐机制,首先使用帧级异常分数将每个视频时间分解为事件中心和背景中心组件,然后应用视觉语言对比学习增强类别判别性表示。
Result: 在XD-Violence和UCF-Crime两个标准基准测试上的综合实验表明,DSANet超越了现有的最先进方法,证明了该方法在弱监督视频异常检测任务中的优越性能。
Conclusion: 该研究通过显式分离异常和正常特征的双重机制,有效提升了弱监督视频异常检测的判别能力,为处理类别混淆和挖掘正常模式提供了新的技术路径,推动了细粒度异常检测的发展。
📄 Abstract
Recent advancements in weakly-supervised video anomaly detection have achieved remarkable performance by applying the multiple instance learning paradigm based on multimodal foundation models such as CLIP to highlight anomalous instances and classify categories. However, their objectives may tend to detect the most salient response segments, while neglecting to mine diverse normal patterns separated from anomalies, and are prone to category confusion due to similar appearance, leading to unsatisfactory fine-grained classification results. Therefore, we propose a novel Disentangled Semantic Alignment Network (DSANet) to explicitly separate abnormal and normal features from coarse-grained and fine-grained aspects, enhancing the distinguishability. Specifically, at the coarse-grained level, we introduce a self-guided normality modeling branch that reconstructs input video features under the guidance of learned normal prototypes, encouraging the model to exploit normality cues inherent in the video, thereby improving the temporal separation of normal patterns and anomalous events. At the fine-grained level, we present a decoupled contrastive semantic alignment mechanism, which first temporally decomposes each video into event-centric and background-centric components using frame-level anomaly scores and then applies visual-language contrastive learning to enhance class-discriminative representations. Comprehensive experiments on two standard benchmarks, namely XD-Violence and UCF-Crime, demonstrate that DSANet outperforms existing state-of-the-art methods.
[42] FOUND: Fourier-based von Mises Distribution for Robust Single Domain Generalization in Object Detection
Mengzhu Wang, Changyuan Deng, Shanshan Wang, Nan Yin, Long Lan, Liang Yang
🧩 TL;DR
本文提出了一种新颖的单域泛化目标检测框架,通过整合von Mises-Fisher分布和傅里叶变换到CLIP引导的流程中,显著提升了模型在未见目标域上的泛化性能。该方法不仅保留了CLIP的语义对齐优势,还增强了特征多样性和跨域结构一致性。
📘 Detailed Summary
Motivation: 单域泛化目标检测方法虽然通过CLIP语义增强取得了一定进展,但往往忽视了特征分布底层结构和频域特性对模型鲁棒性的关键影响。现有方法未能充分建模方向性特征分布和频域扰动,限制了模型在未见域上的泛化能力。
Method: 提出的框架采用von Mises-Fisher分布建模目标表示的方向性特征,以更好地捕捉嵌入空间中的域不变语义结构。同时引入基于傅里叶变换的增强策略,通过扰动振幅和相位分量来模拟频域中的域偏移,从而进一步提升特征鲁棒性。
Result: 在多样化天气驾驶基准上的广泛实验表明,该方法在单域泛化目标检测任务上显著优于现有最先进方法,证明了所提框架在提升跨域泛化性能方面的有效性。
Conclusion: 该研究证明了结合方向性特征建模和频域增强对于提升单域泛化目标检测性能的重要性,为开发更鲁棒的跨域视觉系统提供了新的技术路径。通过同时考虑语义结构和频域特性,该方法为应对现实世界中的域偏移挑战提供了有效解决方案。
📄 Abstract
Single Domain Generalization (SDG) for object detection aims to train a model on a single source domain that can generalize effectively to unseen target domains. While recent methods like CLIP-based semantic augmentation have shown promise, they often overlook the underlying structure of feature distributions and frequency-domain characteristics that are critical for robustness. In this paper, we propose a novel framework that enhances SDG object detection by integrating the von Mises-Fisher (vMF) distribution and Fourier transformation into a CLIP-guided pipeline. Specifically, we model the directional features of object representations using vMF to better capture domain-invariant semantic structures in the embedding space. Additionally, we introduce a Fourier-based augmentation strategy that perturbs amplitude and phase components to simulate domain shifts in the frequency domain, further improving feature robustness. Our method not only preserves the semantic alignment benefits of CLIP but also enriches feature diversity and structural consistency across domains. Extensive experiments on the diverse weather-driving benchmark demonstrate that our approach outperforms the existing state-of-the-art method.
[43] MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation
Xun Huang, Shijia Zhao, Yunxiang Wang, Xin Lu, Wanfa Zhang, Rongsheng Qu, Weixin Li, Yunhong Wang, Chenglu Wen
🧩 TL;DR
本文提出了多模态3D场景图(M3DSG)和基于此的MSGNav零样本导航系统,通过保留视觉线索的动态图像关系边替代纯文本关系,解决了现有零样本导航方法中视觉信息丢失和词汇受限的问题,在GOAT-Bench和HM3D-OVON数据集上实现了最先进的性能。
📘 Detailed Summary
Motivation: 现有零样本导航方法构建显式3D场景图时通常将丰富的视觉观察压缩为纯文本关系,导致构建成本高、视觉证据不可逆丢失以及词汇受限,无法满足现实世界部署所需的开放词汇泛化和低训练开销要求。
Method: 提出了多模态3D场景图(M3DSG),用动态分配的图像替换文本关系边以保留视觉线索;构建了MSGNav零样本导航系统,包含关键子图选择模块、自适应词汇更新模块、闭环推理模块,并针对最后一英里问题提出了基于可见性的视点决策模块。
Result: 在GOAT-Bench和HM3D-OVON数据集上的综合实验结果表明,MSGNav实现了最先进的性能,验证了该方法在零样本导航任务中的有效性。
Conclusion: M3DSG通过保留视觉证据有效解决了现有零样本导航方法的局限性,MSGNav系统为开放词汇导航提供了可行的解决方案,同时明确解决了最后一英里问题,为零样本导航的实际部署奠定了基础。
📄 Abstract
Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relational edges with dynamically assigned images. Built on M3DSG, we propose MSGNav, a zero-shot navigation system that includes a Key Subgraph Selection module for efficient reasoning, an Adaptive Vocabulary Update module for open vocabulary support, and a Closed-Loop Reasoning module for accurate exploration reasoning. Additionally, we further identify the last-mile problem in zero-shot navigation - determining the feasible target location with a suitable final viewpoint, and propose a Visibility-based Viewpoint Decision module to explicitly resolve it. Comprehensive experimental results demonstrate that MSGNav achieves state-of-the-art performance on GOAT-Bench and HM3D-OVON datasets. The open-source code will be publicly available.
[44] RodEpil: A Video Dataset of Laboratory Rodents for Seizure Detection and Benchmark Evaluation
Daniele Perlo, Vladimir Despotovic, Selma Boudissa, Sang-Yoon Kim, Petr Nazarov, Yanrong Zhang, Max Wintermark, Olivier Keunen
🧩 TL;DR
本研究提出了一个用于自动检测实验室啮齿动物惊厥事件的精选视频数据集,包含10,101个阴性样本和2,952个阳性样本,并使用基于Transformer的视频分类器实现了97%的平均F1分数。该数据集和基准代码已公开,旨在支持临床前癫痫研究中非侵入性视频监测的可重复研究。
📘 Detailed Summary
Motivation: 当前临床前癫痫研究缺乏标准化、公开可用的视频数据集来支持自动惊厥事件检测。本研究旨在填补这一空白,通过构建一个精心标注的实验室啮齿动物视频数据集,为开发非侵入性视频监测方法提供可靠基础,从而减少对人工观察的依赖并提高检测效率。
Method: 研究采用基于Transformer的视频分类器TimeSformer架构进行基准实验,使用严格的受试者间五折交叉验证来防止数据泄露。数据集包含10秒的俯视和侧视视频片段,分为正常活动和惊厥两类,共计13,053个样本来自19个受试者,并详细描述了数据整理、标注协议和预处理流程。
Result: 实验结果显示TimeSformer架构能够有效区分惊厥和正常活动,平均F1分数达到97%。采用严格的受试者间划分确保模型泛化能力,每个受试者仅出现在单一折中,验证了方法在未见过的受试者上的鲁棒性表现。
Conclusion: 该研究证明了基于视频的自动惊厥检测在临床前癫痫研究中的可行性,为开发非侵入性监测工具提供了重要基础。公开的数据集和代码将促进该领域的可重复研究,有望减少动物实验中对侵入性监测的依赖,并提高癫痫研究的标准化水平。
📄 Abstract
We introduce a curated video dataset of laboratory rodents for automatic detection of convulsive events. The dataset contains short (10~s) top-down and side-view video clips of individual rodents, labeled at clip level as normal activity or seizure. It includes 10,101 negative samples and 2,952 positive samples collected from 19 subjects. We describe the data curation, annotation protocol and preprocessing pipeline, and report baseline experiments using a transformer-based video classifier (TimeSformer). Experiments employ five-fold cross-validation with strict subject-wise partitioning to prevent data leakage (no subject appears in more than one fold). Results show that the TimeSformer architecture enables discrimination between seizure and normal activity with an average F1-score of 97%. The dataset and baseline code are publicly released to support reproducible research on non-invasive, video-based monitoring in preclinical epilepsy research. RodEpil Dataset access - DOI: 10.5281/zenodo.17601357
[45] SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation
Wei Li, Renshan Zhang, Rui Shao, Zhijian Fang, Kaiwen Zhou, Zhuotao Tian, Liqiang Nie
🧩 TL;DR
本文提出SemanticVLA框架,通过语义对齐的稀疏化和增强技术解决视觉-语言-动作模型在机器人操作中的感知冗余和语义对齐不足问题,在保持高性能的同时显著提升效率。
📘 Detailed Summary
Motivation: 当前视觉-语言-动作模型在机器人操作应用中面临两个主要限制:感知冗余导致无关视觉输入处理效率低下,以及浅层的指令-视觉对齐阻碍了动作的语义基础,这些因素限制了实际部署的可行性。
Method: 提出的SemanticVLA框架包含三个核心组件:语义引导的双重视觉剪枝器通过指令驱动剪枝器和空间聚合剪枝器分别处理全局动作线索和几何特征;语义互补的层次融合器整合SigLIP和DINOv2的密集补丁与稀疏标记;语义条件动作耦合器替代传统的观测到自由度方法,实现更高效的机器人行为建模。
Result: 在仿真和真实世界任务上的广泛实验表明,SemanticVLA在性能和效率上均达到新的最先进水平,在LIBERO基准上相比OpenVLA成功率提升21.1%,同时训练成本和推理延迟分别降低3.0倍和2.7倍。
Conclusion: 该研究证明了语义对齐的稀疏化策略能有效平衡机器人操作任务的性能与效率,为实际部署提供了可行的技术路径,同时开源实现促进了相关领域的进一步发展。
📄 Abstract
Vision-Language-Action (VLA) models have advanced in robotic manipulation, yet practical deployment remains hindered by two key limitations: 1) perceptual redundancy, where irrelevant visual inputs are processed inefficiently, and 2) superficial instruction-vision alignment, which hampers semantic grounding of actions. In this paper, we propose SemanticVLA, a novel VLA framework that performs Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation. Specifically: 1) To sparsify redundant perception while preserving semantic alignment, Semantic-guided Dual Visual Pruner (SD-Pruner) performs: Instruction-driven Pruner (ID-Pruner) extracts global action cues and local semantic anchors in SigLIP; Spatial-aggregation Pruner (SA-Pruner) compacts geometry-rich features into task-adaptive tokens in DINOv2. 2) To exploit sparsified features and integrate semantics with spatial geometry, Semantic-complementary Hierarchical Fuser (SH-Fuser) fuses dense patches and sparse tokens across SigLIP and DINOv2 for coherent representation. 3) To enhance the transformation from perception to action, Semantic-conditioned Action Coupler (SA-Coupler) replaces the conventional observation-to-DoF approach, yielding more efficient and interpretable behavior modeling for manipulation tasks. Extensive experiments on simulation and real-world tasks show that SemanticVLA sets a new SOTA in both performance and efficiency. SemanticVLA surpasses OpenVLA on LIBERO benchmark by 21.1% in success rate, while reducing training cost and inference latency by 3.0-fold and 2.7-fold.SemanticVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/SemanticVLA
[46] Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation
Isabela Albuquerque, Ira Ktena, Olivia Wiles, Ivana Kajić, Amal Rannen-Triki, Cristina Vasconcelos, Aida Nematzadeh
🧩 TL;DR
本文提出了一个用于系统评估文本到图像模型多样性的框架,通过评估个体概念及其相关变异因素来解决当前模型生成同质化输出的问题。该框架包括人类评估模板、精心设计的提示集和基于二项检验的模型比较方法。
📘 Detailed Summary
Motivation: 当前文本到图像模型在生成质量上有所进步,但往往缺乏多样性,产生同质化的输出结果。现有方法缺乏对模型多样性的系统评估框架,无法准确衡量不同概念和变异因素下的生成多样性表现。
Method: 提出了一个系统性的多样性评估框架,包括设计新颖的人类评估模板用于细致多样性评估,构建覆盖多样化概念及其变异因素的提示集,以及通过二项检验比较模型的人类标注结果。同时严格比较了多种图像嵌入方法在多样性测量中的表现。
Result: 该框架能够对文本到图像模型进行多样性排名,识别出模型在特定类别中表现不佳的情况。通过系统评估发现不同模型在多样性方面的显著差异,并验证了所提出评估方法的有效性。
Conclusion: 本研究提供了一个稳健的多样性评估方法论和重要见解,为改进文本到图像模型的多样性和度量开发铺平了道路。该框架能够指导模型开发者识别多样性不足的领域,并推动更全面的模型评估标准发展。
📄 Abstract
Despite advances in generation quality, current text-to-image (T2I) models often lack diversity, generating homogeneous outputs. This work introduces a framework to address the need for robust diversity evaluation in T2I models. Our framework systematically assesses diversity by evaluating individual concepts and their relevant factors of variation. Key contributions include: (1) a novel human evaluation template for nuanced diversity assessment; (2) a curated prompt set covering diverse concepts with their identified factors of variation (e.g. prompt: An image of an apple, factor of variation: color); and (3) a methodology for comparing models in terms of human annotations via binomial tests. Furthermore, we rigorously compare various image embeddings for diversity measurement. Notably, our principled approach enables ranking of T2I models by diversity, identifying categories where they particularly struggle. This research offers a robust methodology and insights, paving the way for improvements in T2I model diversity and metric development.
[47] OmniVGGT: Omni-Modality Driven Visual Geometry Grounded
Haosong Peng, Hao Li, Yalun Dai, Yushi Lan, Yihang Luo, Tianyu Qi, Zhengshen Zhang, Yufeng Zhan, Junfei Zhang, Wenchao Xu, Ziwei Liu
🧩 TL;DR
OmniVGGT是一个新颖的3D基础模型框架,通过GeoAdapter有效整合任意数量的几何模态输入,采用零初始化卷积渐进注入几何信息,并在多视图深度估计、立体匹配和相机姿态估计任务上取得最先进性能。
📘 Detailed Summary
Motivation: 当前大多数3D基础模型仅假设RGB输入而忽略了易于获取的几何线索(如相机内参、姿态和深度图),这限制了模型对空间信息的充分利用和性能提升。
Method: 提出GeoAdapter模块,使用零初始化卷积渐进式编码深度和相机内外参数到空间基础模型中;采用随机多模态融合策略,在训练时随机采样模态子集以增强模型鲁棒性;该设计保持推理速度与VGGT相当且优化稳定。
Result: 在单目/多视图深度估计、多视图立体匹配和相机姿态估计任务上,OmniVGGT超越了现有辅助输入方法,即使仅使用RGB输入也能达到最先进性能;集成到视觉-语言-动作模型后在主流基准测试和机器人任务上均表现优异。
Conclusion: 该研究表明几何线索的有效整合能显著提升3D基础模型的性能,GeoAdapter的零初始化设计确保了稳定优化,随机多模态融合策略增强了模型泛化能力,为构建更强大的多模态3D感知系统提供了可行方案。
📄 Abstract
General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model's representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed, which randomly samples modality subsets per instance during training. This enables an arbitrary number of modality inputs during testing and promotes learning robust spatial representations instead of overfitting to auxiliary cues. Comprehensive experiments on monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation demonstrate that OmniVGGT outperforms prior methods with auxiliary inputs and achieves state-of-the-art results even with RGB-only input. To further highlight its practical utility, we integrated OmniVGGT into vision-language-action (VLA) models. The enhanced VLA model by OmniVGGT not only outperforms the vanilla point-cloud-based baseline on mainstream benchmarks, but also effectively leverages accessible auxiliary inputs to achieve consistent gains on robotic tasks.
cs.CL [Back]
[48] Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages
Omnilingual ASR team, Gil Keren, Artyom Kozhevnikov, Yen Meng, Christophe Ropers, Matthew Setzler, Skyler Wang, Ife Adebara, Michael Auli, Can Balioglu, Kevin Chan, Chierh Cheng, Joe Chuang, Caley Droof, Mark Duppenthaler, Paul-Ambroise Duquenne, Alexander Erben, Cynthia Gao, Gabriel Mejia Gonzalez, Kehan Lyu, Sagar Miglani, Vineel Pratap, Kaushik Ram Sadagopan, Safiyyah Saleem, Arina Turkatenko, Albert Ventayol-Boada, Zheng-Xin Yong, Yu-An Chung, Jean Maillard, Rashel Moritz, Alexandre Mourachko, Mary Williamson, Shireen Yates
🧩 TL;DR
本文提出了Omnilingual ASR,这是首个专为可扩展性设计的大规模自动语音识别系统,能够仅用少量数据样本支持未覆盖的语言,将ASR覆盖范围扩展到1600多种语言,其中包括500多种此前从未被ASR服务过的语言。
📘 Detailed Summary
Motivation: 当前自动语音识别技术在高资源语言中取得了进展,但全球7000多种语言中的大多数仍未被支持,导致数千种长尾语言被忽视。扩展ASR覆盖范围成本高昂且受限于架构限制,同时在没有社区协作的情况下还存在伦理问题,这使得大多数语言无法获得ASR服务。
Method: 该系统采用7B参数的自监督预训练来学习鲁棒的语音表示,并引入了专为零样本泛化设计的编码器-解码器架构,利用LLM启发的解码器。通过结合公共资源和通过有偿本地合作伙伴关系收集的社区录音,构建了大规模多样化的训练语料库。
Result: 自动评估显示,与先前系统相比,该系统在低资源条件下取得了显著提升,并表现出强大的泛化能力。模型家族从适用于低功耗设备的300M变体到追求最大精度的7B变体,覆盖了超过1600种语言,这是迄今为止规模最大的此类努力。
Conclusion: 该研究强调了开源模型和工具如何降低研究人员和社区的准入门槛,邀请新的参与形式。通过反思塑造这一设计的伦理考量,讨论了其社会影响,特别是如何通过社区协作和补偿性合作来促进语言技术的包容性发展。
📄 Abstract
Automatic speech recognition (ASR) has advanced in high-resource languages, but most of the world's 7,000+ languages remain unsupported, leaving thousands of long-tail languages behind. Expanding ASR coverage has been costly and limited by architectures that restrict language support, making extension inaccessible to most--all while entangled with ethical concerns when pursued without community collaboration. To transcend these limitations, we introduce Omnilingual ASR, the first large-scale ASR system designed for extensibility. Omnilingual ASR enables communities to introduce unserved languages with only a handful of data samples. It scales self-supervised pre-training to 7B parameters to learn robust speech representations and introduces an encoder-decoder architecture designed for zero-shot generalization, leveraging a LLM-inspired decoder. This capability is grounded in a massive and diverse training corpus; by combining breadth of coverage with linguistic variety, the model learns representations robust enough to adapt to unseen languages. Incorporating public resources with community-sourced recordings gathered through compensated local partnerships, Omnilingual ASR expands coverage to over 1,600 languages, the largest such effort to date--including over 500 never before served by ASR. Automatic evaluations show substantial gains over prior systems, especially in low-resource conditions, and strong generalization. We release Omnilingual ASR as a family of models, from 300M variants for low-power devices to 7B for maximum accuracy. We reflect on the ethical considerations shaping this design and conclude by discussing its societal impact. In particular, we highlight how open-sourcing models and tools can lower barriers for researchers and communities, inviting new forms of participation. Open-source artifacts are available at https://github.com/facebookresearch/omnilingual-asr.
[49] TermGPT: Multi-Level Contrastive Fine-Tuning for Terminology Adaptation in Legal and Financial Domain
Yidan Sun, Mengying Zhu, Feiyue Chen, Yangyang Wu, Xiaolei Dan, Mengyuan Yang, Xiaolin Zheng, Shenglin Ben
🧩 TL;DR
本文提出TermGPT,一种针对术语适应的多层级对比微调框架,通过构建句子图和设计句子级与词元级的对比学习,有效解决了大语言模型在专业领域术语表示中的各向同性问题和术语级表示不足的问题。
📘 Detailed Summary
Motivation: 大语言模型在文本生成任务中表现出色,但其嵌入空间存在各向同性问题,导致在专业领域(特别是法律和金融领域)的术语表示能力不足,这种术语级表示的弱点严重影响了法律判决预测和金融风险分析等下游任务的性能,因为这些任务对细微的语义区分要求很高。
Method: 我们首先构建句子图来捕捉语义和结构关系,基于上下文和拓扑线索生成语义一致且具有区分度的正负样本;然后设计了一个多层级对比学习方法,在句子级和词元级同时进行对比学习,以增强全局上下文理解和细粒度术语区分能力。
Result: 实验结果表明,TermGPT在金融和法律领域的术语区分任务中优于现有基线方法;为了支持稳健评估,我们还构建了首个基于官方监管文件的金融术语数据集。
Conclusion: 该研究证明了多层级对比学习框架在提升专业领域术语表示能力方面的有效性,为解决大语言模型在专业领域应用中的术语表示瓶颈提供了新的技术路径,并为相关领域的研究提供了首个金融术语数据集资源。
📄 Abstract
Large language models (LLMs) have demonstrated impressive performance in text generation tasks; however, their embedding spaces often suffer from the isotropy problem, resulting in poor discrimination of domain-specific terminology, particularly in legal and financial contexts. This weakness in terminology-level representation can severely hinder downstream tasks such as legal judgment prediction or financial risk analysis, where subtle semantic distinctions are critical. To address this problem, we propose TermGPT, a multi-level contrastive fine-tuning framework designed for terminology adaptation. We first construct a sentence graph to capture semantic and structural relations, and generate semantically consistent yet discriminative positive and negative samples based on contextual and topological cues. We then devise a multi-level contrastive learning approach at both the sentence and token levels, enhancing global contextual understanding and fine-grained terminology discrimination. To support robust evaluation, we construct the first financial terminology dataset derived from official regulatory documents. Experiments show that TermGPT outperforms existing baselines in term discrimination tasks within the finance and legal domains.
[50] HI-TransPA: Hearing Impairments Translation Personal Assistant
Zhiming Ma, Shiyu Gan, Junhao Zhao, Xianming Li, Qingyun Pan, Peidong Wang, Mingjun Pan, Yuhao Mo, Jiajie Cheng, Chengxin Chen, Zhonglun Cao, Chonghan Liu, Shi Cheng
🧩 TL;DR
本文提出HI-TransPA,一种基于Omni-Model范式的指令驱动视听个人助手,通过融合模糊语音与高帧率唇部动态,为听障人士提供统一的翻译和对话解决方案。该模型在专门构建的HI-Dialogue数据集上实现了最先进的字面准确性和语义保真度性能。
📘 Detailed Summary
Motivation: 为解决听障人士日常交流中的统一性和灵活性需求,本研究将Omni-Model范式引入辅助技术领域。现有Omni-Model对听障人士语音的适应性有限,且原始数据存在噪声和异质性挑战,需要开发专门的多模态处理框架。
Method: 采用综合预处理流程,包括面部关键点检测、唇部区域隔离与稳定化,以及多模态样本质量定量评估。基于质量分数的课程学习策略先训练干净高置信度样本,逐步引入困难样本增强鲁棒性。结合SigLIP编码器和统一3D重采样器高效编码高帧率唇部运动。
Result: 在专门构建的HI-Dialogue数据集上的实验表明,HI-TransPA在字面准确性和语义保真度方面均达到最先进性能。该模型成功实现了单一多模态框架内的翻译和对话功能。
Conclusion: 本研究为将Omni-Model应用于辅助通信技术奠定了基础,提供了端到端建模框架和关键处理工具。工作展示了多模态融合在听障辅助技术中的潜力,为未来研究提供了重要参考和基础设施。
📄 Abstract
To provide a unified and flexible solution for daily communication among hearing-impaired individuals, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with high-frame-rate lip dynamics, enabling both translation and dialogue within a single multimodal framework. To tackle the challenges of noisy and heterogeneous raw data and the limited adaptability of existing Omni-Models to hearing-impaired speech, we construct a comprehensive preprocessing and curation pipeline that detects facial landmarks, isolates and stabilizes the lip region, and quantitatively assesses multimodal sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. We further adopt a SigLIP encoder combined with a Unified 3D-Resampler to efficiently encode high-frame-rate lip motion. Experiments on our purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. This work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.
[51] Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism
Jinhong Jeong, Sunghyun Lee, Jaeyoung Lee, Seonah Han, Youngjae Yu
🧩 TL;DR
本研究提出LEX-ICON数据集,通过分析多模态大语言模型在语音象征性任务中的表现,首次实现了对模型语音象似性解释能力的大规模定量分析。研究发现MLLMs展现出与语言学理论一致的语音直觉,并揭示了模型对标志性音素的注意力模式。
📘 Detailed Summary
Motivation: 语音象征性是语言学中研究语音形式与意义之间非任意性关联的概念,本研究旨在探索多模态大语言模型如何解释人类语言中的听觉信息,填补了AI模型在语音象似性理解方面的研究空白。
Method: 研究构建了LEX-ICON数据集,包含来自英语、法语、日语和韩语的8,052个自然语言词汇和2,930个系统构建的伪词,标注了25个语义维度的特征。通过测量音素级注意力分数,分析模型在不同输入形式(正字法和国际音标)和听觉模态下的层级信息处理模式。
Result: 关键发现表明MLLMs在多个语义维度上展现出与现有语言学研究一致的语音直觉,同时揭示了模型对标志性音素的注意力聚焦模式,这些模式通过音素语义注意力分析得以量化。
Conclusion: 该研究在人工智能与认知语言学之间建立了桥梁,为MLLMs的可解释性研究提供了首个大规模定量分析框架,揭示了模型在处理语音象征性时的认知机制,为理解多模态模型的语义处理能力提供了新视角。
📄 Abstract
Sound symbolism is a linguistic concept that refers to non-arbitrary associations between phonetic forms and their meanings. We suggest that this can be a compelling probe into how Multimodal Large Language Models (MLLMs) interpret auditory information in human languages. We investigate MLLMs' performance on phonetic iconicity across textual (orthographic and IPA) and auditory forms of inputs with up to 25 semantic dimensions (e.g., sharp vs. round), observing models' layer-wise information processing by measuring phoneme-level attention fraction scores. To this end, we present LEX-ICON, an extensive mimetic word dataset consisting of 8,052 words from four natural languages (English, French, Japanese, and Korean) and 2,930 systematically constructed pseudo-words, annotated with semantic features applied across both text and audio modalities. Our key findings demonstrate (1) MLLMs' phonetic intuitions that align with existing linguistic research across multiple semantic dimensions and (2) phonosemantic attention patterns that highlight models' focus on iconic phonemes. These results bridge domains of artificial intelligence and cognitive linguistics, providing the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs' interpretability.
[52] Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts
Xanh Ho, Yun-Ang Wu, Sunisth Kumar, Florian Boudin, Atsuhiro Takasu, Akiko Aizawa
🧩 TL;DR
本研究评估了12个多模态大语言模型在科学声明验证任务中的表现,发现当前模型在处理表格证据时表现更好,而在图表证据上表现不佳,揭示了多模态推理能力的关键差距。
📘 Detailed Summary
Motivation: 随着提交的科学论文数量不断增加,对能够协助审稿人评估研究声明的系统需求日益增长。实验结果是科学工作的核心组成部分,通常以表格或图表等不同格式呈现。理解当前多模态大语言模型在不同证据格式下验证科学声明的鲁棒性是一个重要且尚未充分探索的挑战。
Method: 本研究设计并实施了一系列实验,评估多模态大语言模型使用表格和图表作为证据验证科学声明的能力。为此,我们改编了两个现有的科学论文数据集,加入了多模态声明验证任务所需的注释和结构。使用这个改编的数据集,我们评估了12个多模态大语言模型。
Result: 评估结果显示,当前模型在处理基于表格的证据时表现更好,而在基于图表的证据上表现不佳。人类评估表明人类在两种格式下都保持强劲表现,与模型形成鲜明对比。分析还发现,较小的多模态大语言模型(低于80亿参数)在基于表格和基于图表的任务之间表现出弱相关性,表明跨模态泛化能力有限。
Conclusion: 这些发现突显了当前模型在多模态推理能力方面的关键差距。建议未来的多模态大语言模型应更加重视改进图表理解能力,以更好地支持科学声明验证。研究强调了提升模型在图表证据处理方面的能力对于科学评估应用的重要性。
📄 Abstract
With the growing number of submitted scientific papers, there is an increasing demand for systems that can assist reviewers in evaluating research claims. Experimental results are a core component of scientific work, often presented in varying formats such as tables or charts. Understanding how robust current multimodal large language models (multimodal LLMs) are at verifying scientific claims across different evidence formats remains an important and underexplored challenge. In this paper, we design and conduct a series of experiments to assess the ability of multimodal LLMs to verify scientific claims using both tables and charts as evidence. To enable this evaluation, we adapt two existing datasets of scientific papers by incorporating annotations and structures necessary for a multimodal claim verification task. Using this adapted dataset, we evaluate 12 multimodal LLMs and find that current models perform better with table-based evidence while struggling with chart-based evidence. We further conduct human evaluations and observe that humans maintain strong performance across both formats, unlike the models. Our analysis also reveals that smaller multimodal LLMs (under 8B) show weak correlation in performance between table-based and chart-based tasks, indicating limited cross-modal generalization. These findings highlight a critical gap in current models' multimodal reasoning capabilities. We suggest that future multimodal LLMs should place greater emphasis on improving chart understanding to better support scientific claim verification.
[53] Analogical Structure, Minimal Contextual Cues and Contrastive Distractors: Input Design for Sample-Efficient Linguistic Rule Induction
Chunyang Jiang, Paola Merlo
🧩 TL;DR
本研究提出了一种基于类比范式组织的计算方法,使轻量级模型仅需少量数据即可达到与大型语言模型相媲美的性能。该方法通过认知启发的类比结构、对比学习和最小上下文线索,在仅使用100个结构化示例的情况下实现了F1=0.95的优异表现。
📘 Detailed Summary
Motivation: 当前大型语言模型依赖海量数据训练才能获得强大性能,本研究旨在探索是否通过类比范式组织,使轻量级模型能够以极少数据达到同等性能水平,解决数据效率低下的问题。
Method: 开发了一种基于三个认知启发原则的计算方法:类比结构、对比学习和最小上下文线索。使用BERT+CNN架构的轻量级模型(50万参数),在结构化完形填空任务中通过类比模式和对比替代项识别正确的句子补全。
Result: 在英语使役/起始交替的仅100个结构化示例上训练,轻量级模型达到F1=0.95,优于零样本GPT-o3的F1=0.87。消融研究证实类比组织和对比结构提升性能,在不同架构中始终优于随机打乱的基线。跨现象验证使用未指定宾语交替重现了这些效率增益。
Conclusion: 类比范式组织能够实现具有竞争力的语言规则学习,所需数据量比传统方法少几个数量级。该方法展示了认知启发的结构化学习在数据效率方面的显著优势,为资源受限环境下的自然语言处理提供了新途径。
📄 Abstract
Large language models achieve strong performance through training on vast datasets. Can analogical paradigm organization enable lightweight models to match this performance with minimal data? We develop a computational approach implementing three cognitive-inspired principles: analogical structure, contrastive learning, and minimal contextual cues. We test this approach with structured completion tasks where models identify correct sentence completions from analogical patterns with contrastive alternatives. Training lightweight models (BERT+CNN, $0.5M$ parameters) on only one hundred structured examples of English causative/inchoative alternations achieves $F1=0.95$, outperforming zero-shot \texttt{GPT-o3} ($F1=0.87$). Ablation studies confirm that analogical organization and contrastive structure improve performance, consistently surpassing randomly shuffled baselines across architectures. Cross-phenomenon validation using unspecified object alternations replicates these efficiency gains, confirming approach robustness. Our results show that analogical paradigm organization enables competitive linguistic rule learning with orders of magnitude less data than conventional approaches require.
[54] URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding
Yongxin Shi, Jiapeng Wang, Zeyu Shan, Dezhi Peng, Zening Lin, Lianwen Jin
🧩 TL;DR
本研究提出URaG框架,通过统一检索与生成过程,利用MLLMs内在的证据定位能力实现高效长文档理解。该方法将早期Transformer层转换为轻量级证据选择器,在保持高性能的同时显著降低计算开销。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在长文档理解方面面临两大挑战:大量无关内容造成的信息干扰,以及基于Transformer架构的二次计算成本。现有方法主要分为两类:牺牲细粒度细节的令牌压缩,以及增加系统复杂性并阻碍端到端优化的外部检索器引入。
Method: 提出URaG框架,引入轻量级跨模态检索模块,将早期Transformer层转换为高效证据选择器,识别并保留最相关页面同时丢弃无关内容。这种设计使深层层能够将计算资源集中在相关信息上,实现检索与生成的统一。
Result: 广泛实验表明,URaG在实现最先进性能的同时,将计算开销降低了44-56%。该框架在长文档理解任务中表现出色,同时保持了模型的准确性和效率。
Conclusion: 研究表明MLLMs具有类似人类的从粗到细推理模式,早期层广泛关注文档,深层层聚焦相关证据页面。URaG成功利用这种内在能力实现高效检索,为长文档理解提供了端到端的优化解决方案,并揭示了模型自身推理过程的可利用性。
📄 Abstract
Recent multimodal large language models (MLLMs) still struggle with long document understanding due to two fundamental challenges: information interference from abundant irrelevant content, and the quadratic computational cost of Transformer-based architectures. Existing approaches primarily fall into two categories: token compression, which sacrifices fine-grained details; and introducing external retrievers, which increase system complexity and prevent end-to-end optimization. To address these issues, we conduct an in-depth analysis and observe that MLLMs exhibit a human-like coarse-to-fine reasoning pattern: early Transformer layers attend broadly across the document, while deeper layers focus on relevant evidence pages. Motivated by this insight, we posit that the inherent evidence localization capabilities of MLLMs can be explicitly leveraged to perform retrieval during the reasoning process, facilitating efficient long document understanding. To this end, we propose URaG, a simple-yet-effective framework that Unifies Retrieval and Generation within a single MLLM. URaG introduces a lightweight cross-modal retrieval module that converts the early Transformer layers into an efficient evidence selector, identifying and preserving the most relevant pages while discarding irrelevant content. This design enables the deeper layers to concentrate computational resources on pertinent information, improving both accuracy and efficiency. Extensive experiments demonstrate that URaG achieves state-of-the-art performance while reducing computational overhead by 44-56%. The code is available at https://github.com/shi-yx/URaG.
[55] Mined Prompting and Metadata-Guided Generation for Wound Care Visual Question Answering
Bavana Durgapraveen, Sornaraj Sivasankaran, Abhinand Balachandran, Sriram Rajkumar
🧩 TL;DR
本研究针对MEDIQA-WV 2025共享任务,提出了两种互补方法用于伤口护理查询的文本回复生成:基于嵌入检索的挖掘提示策略和基于元数据预测的引导生成方法,有效提升了回复的相关性和临床精确度。
📘 Detailed Summary
Motivation: 异步远程医疗的快速发展加剧了医护人员的工作负担,迫切需要能够高效处理患者查询的AI辅助系统,特别是在结合图像信息的伤口护理场景中,现有方法在生成准确临床回复方面存在局限性。
Method: 第一种方法采用挖掘提示策略,通过嵌入训练数据并检索最相似的top-k示例作为少样本演示;第二种方法基于元数据消融研究识别出四个关键元数据属性,训练分类器预测这些属性并基于预测置信度动态调整生成流程。
Result: 实验结果表明挖掘提示策略显著提升了回复相关性,而元数据引导的生成方法进一步优化了临床精确度,两种方法在伤口护理回复生成任务中均表现出色。
Conclusion: 这些方法为开发可靠高效的AI驱动伤口护理支持工具指明了有前景的方向,展示了结合检索增强和元数据引导在医疗文本生成任务中的协同优势。
📄 Abstract
The rapid expansion of asynchronous remote care has intensified provider workload, creating demand for AI systems that can assist clinicians in managing patient queries more efficiently. The MEDIQA-WV 2025 shared task addresses this challenge by focusing on generating free-text responses to wound care queries paired with images. In this work, we present two complementary approaches developed for the English track. The first leverages a mined prompting strategy, where training data is embedded and the top-k most similar examples are retrieved to serve as few-shot demonstrations during generation. The second approach builds on a metadata ablation study, which identified four metadata attributes that consistently enhance response quality. We train classifiers to predict these attributes for test cases and incorporate them into the generation pipeline, dynamically adjusting outputs based on prediction confidence. Experimental results demonstrate that mined prompting improves response relevance, while metadata-guided generation further refines clinical precision. Together, these methods highlight promising directions for developing AI-driven tools that can provide reliable and efficient wound care support.
cs.AI [Back]
[56] SlideBot: A Multi-Agent Framework for Generating Informative, Reliable, Multi-Modal Presentations
Eric Xie, Danielle Waterfield, Michael Kennedy, Aidong Zhang
🧩 TL;DR
SlideBot是一个模块化多代理幻灯片生成框架,通过集成大语言模型、检索机制和代码生成技术,解决了教育领域幻灯片自动生成的可靠性和信息质量问题。该系统基于认知负荷理论和多媒体学习理论,在AI和生物医学教育评估中展现出卓越的概念准确性和教学价值。
📘 Detailed Summary
Motivation: 现有基于大语言模型的幻灯片生成解决方案在生成可靠且信息丰富的输出方面存在不足,特别是在处理多模态内容创建和精确领域特定信息方面面临挑战,这限制了其在教育应用中的价值。
Method: SlideBot采用模块化多代理框架,集成检索、结构化规划和代码生成技术,通过专门代理协作检索信息、总结内容、生成图表和使用LaTeX格式化幻灯片。系统基于认知负荷理论和多媒体学习理论,使用结构化规划管理内在负荷,并通过一致的可视化宏减少外在负荷以增强双通道学习。
Result: 在AI和生物医学教育领域的专家和学生评估中,SlideBot在概念准确性、清晰度和教学价值方面持续表现出色,证明其能够有效提升幻灯片准备效率同时确保内容的准确性和相关性。
Conclusion: SlideBot展示了通过多代理协作和认知理论指导的框架,能够显著提升教育幻灯片生成的质量和效率,为高等教育中的自适应内容创建提供了可行解决方案,并强调了结合教学理论和实际应用需求的重要性。
📄 Abstract
Large Language Models (LLMs) have shown immense potential in education, automating tasks like quiz generation and content summarization. However, generating effective presentation slides introduces unique challenges due to the complexity of multimodal content creation and the need for precise, domain-specific information. Existing LLM-based solutions often fail to produce reliable and informative outputs, limiting their educational value. To address these limitations, we introduce SlideBot - a modular, multi-agent slide generation framework that integrates LLMs with retrieval, structured planning, and code generation. SlideBot is organized around three pillars: informativeness, ensuring deep and contextually grounded content; reliability, achieved by incorporating external sources through retrieval; and practicality, which enables customization and iterative feedback through instructor collaboration. It incorporates evidence-based instructional design principles from Cognitive Load Theory (CLT) and the Cognitive Theory of Multimedia Learning (CTML), using structured planning to manage intrinsic load and consistent visual macros to reduce extraneous load and enhance dual-channel learning. Within the system, specialized agents collaboratively retrieve information, summarize content, generate figures, and format slides using LaTeX, aligning outputs with instructor preferences through interactive refinement. Evaluations from domain experts and students in AI and biomedical education show that SlideBot consistently enhances conceptual accuracy, clarity, and instructional value. These findings demonstrate SlideBot's potential to streamline slide preparation while ensuring accuracy, relevance, and adaptability in higher education.
[57] EgoEMS: A High-Fidelity Multimodal Egocentric Dataset for Cognitive Assistance in Emergency Medical Services
Keshara Weerasinghe, Xueren Ge, Tessa Heick, Lahiru Nuwan Wijayasingha, Anthony Cortez, Abhishek Satpathy, John Stankovic, Homa Alemzadeh
🧩 TL;DR
本文提出了EgoEMS,这是首个端到端、高保真、多模态、多参与者的急救医疗服务数据集,包含233个模拟紧急场景中超过20小时的自我中心视角数据,旨在开发AI认知助手以支持急救人员的实时决策。
📘 Detailed Summary
Motivation: 急救医疗服务中的急救人员在高风险情况下面临巨大的认知负担,而现有的AI认知助手缺乏真实、全面的数据集来支持实时数据收集和决策制定,这限制了智能EMS系统的发展。
Method: 研究团队与EMS专家合作开发了一个开源、低成本且可复制的数据收集系统,采集了62名参与者(包括46名EMS专业人员)在233个模拟紧急场景中的多模态数据,并提供了关键步骤标注、带时间戳的音频转录与说话人分离、动作质量指标以及带分割掩码的边界框。
Result: EgoEMS数据集包含超过20小时的真实EMS活动记录,涵盖了响应者与患者之间的真实互动动态,并建立了一套用于实时多模态关键步骤识别和动作质量评估的基准测试,为开发EMS AI支持工具提供了基础。
Conclusion: EgoEMS数据集填补了智能EMS系统开发中的数据空白,通过强调真实性和专业标准对齐,为研究社区提供了推动智能EMS系统发展的关键资源,有望最终改善患者治疗效果。
📄 Abstract
Emergency Medical Services (EMS) are critical to patient survival in emergencies, but first responders often face intense cognitive demands in high-stakes situations. AI cognitive assistants, acting as virtual partners, have the potential to ease this burden by supporting real-time data collection and decision making. In pursuit of this vision, we introduce EgoEMS, the first end-to-end, high-fidelity, multimodal, multiperson dataset capturing over 20 hours of realistic, procedural EMS activities from an egocentric view in 233 simulated emergency scenarios performed by 62 participants, including 46 EMS professionals. Developed in collaboration with EMS experts and aligned with national standards, EgoEMS is captured using an open-source, low-cost, and replicable data collection system and is annotated with keysteps, timestamped audio transcripts with speaker diarization, action quality metrics, and bounding boxes with segmentation masks. Emphasizing realism, the dataset includes responder-patient interactions reflecting real-world emergency dynamics. We also present a suite of benchmarks for real-time multimodal keystep recognition and action quality estimation, essential for developing AI support tools for EMS. We hope EgoEMS inspires the research community to push the boundaries of intelligent EMS systems and ultimately contribute to improved patient outcomes.
[58] Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis for Large Reasoning Models
Yongxian Wei, Yilin Zhao, Li Shen, Xinrui Chen, Runxi Cheng, Sinan Du, Hao Yu, Gang Liu, Jiahong Yan, Chun Yuan, Dian Li
🧩 TL;DR
本文提出了一种能够显式推理并适应求解器能力的问题生成器,通过构建相关问题对并利用推理模型生成中间问题设计思维链,同时将求解器反馈作为奖励信号来校准问题难度,从而生成高质量的训练数据。
📘 Detailed Summary
Motivation: 现有数据合成方法面临两个主要挑战:一是忽略求解器能力的盲目生成导致低价值问题,或依赖复杂数据管道来平衡问题难度;二是问题生成缺乏推理过程,导致产生浅层问题变体,无法有效提升模型推理能力。
Method: 该方法构建相关问题对并通过推理模型生成中间问题设计思维链,从而引导问题生成策略。同时将求解器对合成问题的反馈作为奖励信号,使生成器能够校准难度并产生接近求解器能力边界的互补问题,实现难度自适应。
Result: 在10个数学和通用推理基准测试上的广泛实验表明,该方法平均性能提升2.5%,并能泛化到语言和视觉语言模型。通过协同进化,求解器在合成数据上训练后能为生成器提供改进的奖励信号,进一步带来0.7%的性能增益。
Conclusion: 该研究证明了显式推理引导的问题生成与难度自适应相结合的有效性,为大规模推理模型训练提供了可扩展的数据合成方案。协同进化机制展示了生成器与求解器相互促进的潜力,为持续改进模型性能开辟了新途径。
📄 Abstract
Data synthesis for training large reasoning models offers a scalable alternative to limited, human-curated datasets, enabling the creation of high-quality data. However, existing approaches face several challenges: (i) indiscriminate generation that ignores the solver's ability and yields low-value problems, or reliance on complex data pipelines to balance problem difficulty; and (ii) a lack of reasoning in problem generation, leading to shallow problem variants. In this paper, we develop a problem generator that reasons explicitly to plan problem directions before synthesis and adapts difficulty to the solver's ability. Specifically, we construct related problem pairs and augment them with intermediate problem-design CoT produced by a reasoning model. These data bootstrap problem-design strategies from the generator. Then, we treat the solver's feedback on synthetic problems as a reward signal, enabling the generator to calibrate difficulty and produce complementary problems near the edge of the solver's competence. Extensive experiments on 10 mathematical and general reasoning benchmarks show that our method achieves an average improvement of 2.5% and generalizes to both language and vision-language models. Moreover, a solver trained on the synthesized data provides improved rewards for continued generator training, enabling co-evolution and yielding a further 0.7% performance gain. Our code will be made publicly available here.
[59] OIDA-QA: A Multimodal Benchmark for Analyzing the Opioid Industry Documents Archive
Xuan Shen, Brian Wingenroth, Zichao Wang, Jason Kuen, Wanrong Zhu, Ruiyi Zhang, Yiwei Wang, Lichun Ma, Anqi Liu, Hongfu Liu, Tong Sun, Kevin S. Hawkins, Kate Tasker, G. Caleb Alexander, Jiuxiang Gu
🧩 TL;DR
本研究针对阿片类药物危机相关文档分析挑战,开发了多模态大语言模型和基准数据集,通过整合文本、视觉和布局信息,显著提升了文档信息提取和问答任务的性能。
📘 Detailed Summary
Motivation: 阿片类药物危机揭示了监管系统、医疗实践和企业治理等多方面的系统性缺陷,但分析这些相互关联系统的失效需要创新的分析方法来处理UCSF-JHU阿片类药物行业文档档案中的大量数据和文档。这些医疗相关法律和企业文档的复杂性、多模态特性和专业特征要求开发针对特定数据类型和详细标注的先进方法和模型,以确保分析的精确性和专业性。
Method: 研究通过按文档属性组织原始数据集,构建包含40万训练文档和1万测试文档的基准,从每个文档中提取丰富的多模态信息包括文本内容、视觉元素和布局结构以捕获全面特征。使用多个AI模型生成包含36万训练问答对和1万测试问答对的大规模数据集,开发领域特定的多模态大语言模型,并探索多模态输入对任务性能的影响。为提高回答准确性,将历史问答对作为上下文基础,在答案中整合页面引用并引入基于重要性的页面分类器。
Result: 初步结果表明我们的AI助手在文档信息提取和问答任务方面取得了改进,通过整合多模态信息和上下文基础显著提升了信息提供的精确性和相关性。构建的大规模数据集和模型已在Hugging Face平台公开可用,为后续研究提供了重要资源。
Conclusion: 该研究展示了多模态大语言模型在复杂医疗法律文档分析中的有效性,通过整合文本、视觉和布局信息以及上下文基础,显著提升了文档理解和问答性能。开发的基准数据集和领域特定模型为类似复杂文档分析任务提供了可复现的框架和方法论,具有重要的实际应用价值和研究意义。
📄 Abstract
The opioid crisis represents a significant moment in public health that reveals systemic shortcomings across regulatory systems, healthcare practices, corporate governance, and public policy. Analyzing how these interconnected systems simultaneously failed to protect public health requires innovative analytic approaches for exploring the vast amounts of data and documents disclosed in the UCSF-JHU Opioid Industry Documents Archive (OIDA). The complexity, multimodal nature, and specialized characteristics of these healthcare-related legal and corporate documents necessitate more advanced methods and models tailored to specific data types and detailed annotations, ensuring the precision and professionalism in the analysis. In this paper, we tackle this challenge by organizing the original dataset according to document attributes and constructing a benchmark with 400k training documents and 10k for testing. From each document, we extract rich multimodal information-including textual content, visual elements, and layout structures-to capture a comprehensive range of features. Using multiple AI models, we then generate a large-scale dataset comprising 360k training QA pairs and 10k testing QA pairs. Building on this foundation, we develop domain-specific multimodal Large Language Models (LLMs) and explore the impact of multimodal inputs on task performance. To further enhance response accuracy, we incorporate historical QA pairs as contextual grounding for answering current queries. Additionally, we incorporate page references within the answers and introduce an importance-based page classifier, further improving the precision and relevance of the information provided. Preliminary results indicate the improvements with our AI assistant in document information extraction and question-answering tasks. The dataset and models are publicly available at: https://huggingface.co/opioidarchive
[60] MTP: Exploring Multimodal Urban Traffic Profiling with Modality Augmentation and Spectrum Fusion
Haolong Xiang, Peisi Wang, Xiaolong Xu, Kun Yi, Xuyun Zhang, Quanzheng Sheng, Amin Beheshti, Wei Fan
🧩 TL;DR
本文提出了一种名为MTP的多模态城市交通信号分析框架,通过数值、视觉和文本三个视角学习多模态特征,在频域中进行交通信号建模,并在六个真实世界数据集上展示了优于现有方法的性能。
📘 Detailed Summary
Motivation: 现有交通信号建模方法通常依赖原始数值模态,忽略了多模态异构城市数据中存在的语义信息,这阻碍了对交通信号的全面理解并限制了复杂交通动态的准确预测。
Method: 提出MTP多模态框架,通过数值分支使用频率多层感知器,视觉分支将信号转换为频率图像和周期性图像,文本分支基于特定主题、背景信息和项目描述生成描述性文本,并设计分层对比学习来融合三个模态的频谱信息。
Result: 在六个真实世界数据集上的广泛实验表明,该方法相比最先进方法表现出优越性能,验证了多模态方法在交通信号分析中的有效性。
Conclusion: 该研究证明了多模态学习能够更全面地理解城市交通信号,频域学习策略能够精细地提取信息,为复杂交通动态预测提供了新的解决方案和未来研究方向。
📄 Abstract
With rapid urbanization in the modern era, traffic signals from various sensors have been playing a significant role in monitoring the states of cities, which provides a strong foundation in ensuring safe travel, reducing traffic congestion and optimizing urban mobility. Most existing methods for traffic signal modeling often rely on the original data modality, i.e., numerical direct readings from the sensors in cities. However, this unimodal approach overlooks the semantic information existing in multimodal heterogeneous urban data in different perspectives, which hinders a comprehensive understanding of traffic signals and limits the accurate prediction of complex traffic dynamics. To address this problem, we propose a novel \textit{M}ultimodal framework, \textit{MTP}, for urban \textit{T}raffic \textit{P}rofiling, which learns multimodal features through numeric, visual, and textual perspectives. The three branches drive for a multimodal perspective of urban traffic signal learning in the frequency domain, while the frequency learning strategies delicately refine the information for extraction. Specifically, we first conduct the visual augmentation for the traffic signals, which transforms the original modality into frequency images and periodicity images for visual learning. Also, we augment descriptive texts for the traffic signals based on the specific topic, background information and item description for textual learning. To complement the numeric information, we utilize frequency multilayer perceptrons for learning on the original modality. We design a hierarchical contrastive learning on the three branches to fuse the spectrum of three modalities. Finally, extensive experiments on six real-world datasets demonstrate superior performance compared with the state-of-the-art approaches.
[61] PepTriX: A Framework for Explainable Peptide Analysis through Protein Language Models
Vincent Schilling, Akshat Dubey, Georges Hattab
🧩 TL;DR
PepTriX是一个新颖的肽分类框架,通过整合一维序列嵌入和三维结构特征,结合图注意力网络、对比训练和跨模态共同注意力机制,实现了在多个肽分类任务中的优异性能,同时提供了可解释的结构和生物物理基序洞察。
📘 Detailed Summary
Motivation: 传统肽分类方法依赖手工编码的一维序列表示,限制了跨任务和数据集的可泛化性;蛋白质语言模型虽然性能强大但面临计算成本高、潜在表示复杂难以解释的问题;现有框架多为特定肽分类任务设计,缺乏通用性,难以将模型预测与生物学相关基序和结构特性联系起来。
Method: PepTriX框架整合一维序列嵌入和三维结构特征,采用图注意力网络架构,增强对比训练和跨模态共同注意力机制,能够自动适应不同数据集,生成任务特定的肽向量,同时保持生物学合理性。
Result: 领域专家评估表明,PepTriX在多个肽分类任务中表现优异,提供了关于驱动预测的结构和生物物理基序的可解释洞察,实现了预测鲁棒性和可解释验证的结合。
Conclusion: PepTriX弥合了性能驱动的肽级模型与领域级理解之间的差距,为肽研究提供了既具有预测鲁棒性又具备可解释验证能力的解决方案,推动了肽分类任务从黑盒预测向生物学可解释建模的转变。
📄 Abstract
Peptide classification tasks, such as predicting toxicity and HIV inhibition, are fundamental to bioinformatics and drug discovery. Traditional approaches rely heavily on handcrafted encodings of one-dimensional (1D) peptide sequences, which can limit generalizability across tasks and datasets. Recently, protein language models (PLMs), such as ESM-2 and ESMFold, have demonstrated strong predictive performance. However, they face two critical challenges. First, fine-tuning is computationally costly. Second, their complex latent representations hinder interpretability for domain experts. Additionally, many frameworks have been developed for specific types of peptide classification, lacking generalization. These limitations restrict the ability to connect model predictions to biologically relevant motifs and structural properties. To address these limitations, we present PepTriX, a novel framework that integrates one dimensional (1D) sequence embeddings and three-dimensional (3D) structural features via a graph attention network enhanced with contrastive training and cross-modal co-attention. PepTriX automatically adapts to diverse datasets, producing task-specific peptide vectors while retaining biological plausibility. After evaluation by domain experts, we found that PepTriX performs remarkably well across multiple peptide classification tasks and provides interpretable insights into the structural and biophysical motifs that drive predictions. Thus, PepTriX offers both predictive robustness and interpretable validation, bridging the gap between performance-driven peptide-level models (PLMs) and domain-level understanding in peptide research.
[62] Causal-HalBench: Uncovering LVLMs Object Hallucinations Through Causal Intervention
Zhe Xu, Zhicai Wang, Junkang Wu, Jinda Lu, Xiang Wang
🧩 TL;DR
该论文通过因果分析揭示大型视觉语言模型中物体幻觉问题的根源,提出结构化因果模型来形式化定义虚假相关性,并构建了Causal-HalBench基准来量化评估模型对虚假相关性的鲁棒性。
📘 Detailed Summary
Motivation: 大型视觉语言模型普遍存在物体幻觉问题,错误判断图像中物体的存在,这主要源于训练过程中高度共现物体之间的虚假相关性,而现有基准主要关注幻觉检测,缺乏对虚假相关性的形式化定义和定量评估。
Method: 研究将因果分析引入LVLMs的物体识别场景,建立结构化因果模型来形式化定义共现偏差导致的虚假相关性,开发了Causal-HalBench基准,该基准包含反事实样本并集成了全面的因果指标,同时提出了利用专有LVLMs和文本到图像模型生成反事实样本的可扩展流水线。
Result: 在主流LVLMs上使用Causal-HalBench进行评估,结果表明这些模型在不同程度上都表现出对虚假相关性的敏感性,验证了所提出基准的有效性和模型鲁棒性评估的必要性。
Conclusion: 研究揭示了LVLMs中物体幻觉的因果机制,为理解模型偏差提供了理论框架,提出的基准和评估方法为未来模型鲁棒性改进提供了重要工具,强调了在模型评估中考虑因果关系的必要性。
📄 Abstract
Large Vision-Language Models (LVLMs) often suffer from object hallucination, making erroneous judgments about the presence of objects in images. We propose this primar- ily stems from spurious correlations arising when models strongly associate highly co-occurring objects during train- ing, leading to hallucinated objects influenced by visual con- text. Current benchmarks mainly focus on hallucination de- tection but lack a formal characterization and quantitative evaluation of spurious correlations in LVLMs. To address this, we introduce causal analysis into the object recognition scenario of LVLMs, establishing a Structural Causal Model (SCM). Utilizing the language of causality, we formally de- fine spurious correlations arising from co-occurrence bias. To quantify the influence induced by these spurious correla- tions, we develop Causal-HalBench, a benchmark specifically constructed with counterfactual samples and integrated with comprehensive causal metrics designed to assess model ro- bustness against spurious correlations. Concurrently, we pro- pose an extensible pipeline for the construction of these coun- terfactual samples, leveraging the capabilities of proprietary LVLMs and Text-to-Image (T2I) models for their genera- tion. Our evaluations on mainstream LVLMs using Causal- HalBench demonstrate these models exhibit susceptibility to spurious correlations, albeit to varying extents.