Table of Contents

cs.CV [Back]

[1] State-Change Learning for Prediction of Future Events in Endoscopic Videos

Saurav Sharma, Chinedu Innocent Nwoye, Didier Mutter, Nicolas Padoy

🧩 TL;DR

本文提出SurgFUTR框架,将手术未来预测重新定义为状态变化学习问题,通过教师-学生架构和动作动态模块实现跨手术场景的通用预测能力,在五个预测任务上实现了持续改进。


📘 Detailed Summary

Motivation: 当前手术AI研究主要关注理解当前状态而非预测未来事件,现有方法针对孤立任务且缺乏统一框架,无法同时处理短期和长期预测任务,基于未来特征预测的方法难以在不同手术场景间泛化。

Method: 提出SurgFUTR框架,将手术未来预测重构为状态变化学习问题,采用教师-学生架构:通过Sinkhorn-Knopp聚类将视频片段压缩为状态表示,教师网络学习当前和未来片段,学生网络仅从当前视频预测未来状态,并引入动作动态模块进行指导。

Result: 构建SFPBench基准包含五个预测任务,涵盖短期和长期预测范围,在四个数据集和三种手术程序上的实验显示持续改进,跨手术程序迁移验证了方法的泛化能力。

Conclusion: 状态变化学习框架为手术未来预测提供了更通用的解决方案,能够有效处理不同时间尺度的预测任务,并通过跨手术场景的泛化能力证明了方法的实用价值,为手术室安全和效率提升开辟了新途径。


📄 Abstract

Surgical future prediction, driven by real-time AI analysis of surgical video, is critical for operating room safety and efficiency. It provides actionable insights into upcoming events, their timing, and risks-enabling better resource allocation, timely instrument readiness, and early warnings for complications (e.g., bleeding, bile duct injury). Despite this need, current surgical AI research focuses on understanding what is happening rather than predicting future events. Existing methods target specific tasks in isolation, lacking unified approaches that span both short-term (action triplets, events) and long-term horizons (remaining surgery duration, phase transitions). These methods rely on coarse-grained supervision while fine-grained surgical action triplets and steps remain underexplored. Furthermore, methods based only on future feature prediction struggle to generalize across different surgical contexts and procedures. We address these limits by reframing surgical future prediction as state-change learning. Rather than forecasting raw observations, our approach classifies state transitions between current and future timesteps. We introduce SurgFUTR, implementing this through a teacher-student architecture. Video clips are compressed into state representations via Sinkhorn-Knopp clustering; the teacher network learns from both current and future clips, while the student network predicts future states from current videos alone, guided by our Action Dynamics (ActDyn) module. We establish SFPBench with five prediction tasks spanning short-term (triplets, events) and long-term (remaining surgery duration, phase and step transitions) horizons. Experiments across four datasets and three procedures show consistent improvements. Cross-procedure transfer validates generalizability.

[2] Unifying Vision-Language Latents for Zero-label Image Caption Enhancement

Sanghyun Byun, Jung Ick Guack, Mohanad Odema, Baisub Lee, Jacob Song, Woo Seong Chung

🧩 TL;DR

本文提出ViZer框架,通过视觉-语言表示特征对齐实现零标签图像描述增强,无需文本标注或完整重训练即可提升现有视觉语言模型的描述能力。


📘 Detailed Summary

Motivation: 现有视觉语言模型依赖大规模标注图像数据集,限制了可扩展性并导致大量未标注图像数据未被充分利用,需要开发不依赖人工或合成标注的零标签学习方法。

Method: 提出统一视觉语言对齐框架ViZer,在训练过程中主动对齐视觉和语言表示特征,使现有视觉语言模型无需文本标签或完整重训练即可生成改进的描述。

Result: 在SmolVLM-Base和Qwen2-VL上的实验表明,ViZer在定性评估中表现优势,生成更接地气和描述性的描述,而CIDEr和BERTScore等自动指标往往会惩罚参考描述中缺失的细节。

Conclusion: ViZer为视觉语言任务中的零标签适应提供了实用起点,证明了通过表示对齐而非依赖标注数据来增强模型能力的可行性,为更广泛的零标签视觉语言学习开辟了新方向。


📄 Abstract

Vision-language models (VLMs) achieve remarkable performance through large-scale image-text pretraining. However, their reliance on labeled image datasets limits scalability and leaves vast amounts of unlabeled image data underutilized. To address this, we propose Unified Vision-Language Alignment for Zero-Label Enhancement (ViZer), an enhancement training framework that enables zero-label learning in image captioning, providing a practical starting point for broader zero-label adaptation in vision-language tasks. Unlike prior approaches that rely on human or synthetically annotated datasets, ViZer actively aligns vision and language representation features during training, enabling existing VLMs to generate improved captions without requiring text labels or full retraining. We demonstrate ViZer's advantage in qualitative evaluation, as automated caption metrics such as CIDEr and BERTScore often penalize details that are absent in reference captions. Applying ViZer on SmolVLM-Base and Qwen2-VL, we observe consistent qualitative improvements, producing captions that are more grounded and descriptive than their baseline.

[3] Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation

Xiao He, Huangxuan Zhao, Guojia Wan, Wei Zhou, Yanxing Liu, Juhua Liu, Yongchao Xu, Yong Luo, Dacheng Tao, Bo Du

🧩 TL;DR

本文提出了FetalMind系统,这是首个专门针对胎儿超声的医学AI系统,通过引入显著认知解耦方法将专家构建的二部图注入模型,有效解耦视图-疾病关联并指导临床推理过程,在胎儿超声报告生成和诊断任务上显著超越了现有基线方法。


📘 Detailed Summary

Motivation: 现有医学视觉语言模型主要针对结构化成人影像设计,在胎儿超声领域表现不佳,面临多视图推理、疾病种类繁多和图像多样性等挑战,需要专门针对胎儿超声特点的AI系统来弥补这一研究空白。

Method: 提出了显著认知解耦方法,将专家构建的二部图注入模型以解耦视图-疾病关联,并通过强化学习引导模型沿着临床可信的步骤进行偏好选择,同时构建了首个大规模胎儿超声报告数据集FetalSigma-1M,包含来自12个医疗中心的20K报告。

Result: FetalMind在所有孕周阶段均超越了开源和闭源基线方法,实现了平均14%的性能提升,在关键病症上的准确率提高了61.2%,同时保持了高效、稳定和可扩展的特性。

Conclusion: 该研究证明了通过临床工作流引导的认知解耦方法能够有效缓解疾病变异性和视图异质性带来的学习瓶颈,使模型推理与产科实践保持一致,为胎儿超声AI系统的发展提供了重要技术路径和数据集支持。


📄 Abstract

Recent medical vision-language models have shown promise on tasks such as VQA, report generation, and anomaly detection. However, most are adapted to structured adult imaging and underperform in fetal ultrasound, which poses challenges of multi-view image reasoning, numerous diseases, and image diversity. To bridge this gap, we introduce FetalMind, a medical AI system tailored to fetal ultrasound for both report generation and diagnosis. Guided by clinical workflow, we propose Salient Epistemic Disentanglement (SED), which injects an expert-curated bipartite graph into the model to decouple view-disease associations and to steer preference selection along clinically faithful steps via reinforcement learning. This design mitigates variability across diseases and heterogeneity across views, reducing learning bottlenecks while aligning the model's inference with obstetric practice. To train FetalMind at scale, we curate FetalSigma-1M dataset, the first large-scale fetal ultrasound report corpus, comprising 20K reports from twelve medical centers, addressing the scarcity of domain data. Extensive experiments show that FetalMind outperforms open- and closed-source baselines across all gestational stages, achieving +14% average gains and +61.2% higher accuracy on critical conditions while remaining efficient, stable, and scalable. Project Page: https://hexiao0275.github.io/FetalMind.

[4] Scope: Selective Cross-modal Orchestration of Visual Perception Experts

Tianyu Zhang, Suyuchen Wang, Chao Wang, Juan Rodriguez, Ahmed Masry, Xiangru Jian, Yoshua Bengio, Perouz Taslakian

🧩 TL;DR

本文提出SCOPE框架,一种混合编码器方法,通过实例级路由动态选择专用视觉编码器处理图像-文本对,显著减少计算成本的同时超越传统多编码器模型的性能。该方法仅使用一个共享编码器和一个路由编码器即可超越同时使用四个额外编码器的模型,计算量减少24-49%。


📘 Detailed Summary

Motivation: 当前视觉语言模型通过堆叠多个视觉编码器获得性能提升,但这种简单堆叠方法导致计算成本倍增而收益递减。现有方法缺乏智能的编码器选择机制,无法根据具体图像-文本对的特点动态选择最合适的编码器,造成计算资源的浪费。

Method: SCOPE框架采用混合编码器架构,包含一个共享编码器和多个路由编码器池。通过轻量级路由器使用文本提示与共享视觉特征之间的交叉注意力机制,实现实例级路由选择最优编码器。训练过程中引入双熵正则化与辅助损失函数,平衡数据集级负载分布与实例级路由置信度。

Result: 实验结果显示,仅使用一个共享编码器加一个路由编码器的SCOPE框架,在性能上超越了同时使用所有四个额外编码器的模型。在计算效率方面,实现了24-49%的计算量减少,证明了智能编码器选择策略的有效性。

Conclusion: 该研究挑战了多编码器视觉语言模型中普遍采用的暴力聚合范式,证明智能编码器选择策略优于简单的编码器堆叠。这一发现为构建更高效的多模态模型提供了新思路,强调了实例级自适应路由在平衡性能与效率方面的重要性。


📄 Abstract

Vision-language models (VLMs) benefit from multiple vision encoders, but naively stacking them yields diminishing returns while multiplying inference costs. We propose SCOPE, a Mixture-of-Encoders (MoEnc) framework that dynamically selects one specialized encoder per image-text pair via instance-level routing, unlike token-level routing in traditional MoE. SCOPE maintains a shared encoder and a pool of routed encoders. A lightweight router uses cross-attention between text prompts and shared visual features to select the optimal encoder from the routed encoders. To train this router, we introduce dual entropy regularization with auxiliary losses to balance dataset-level load distribution with instance-level routing confidence. Remarkably, SCOPE with one shared plus one routed encoder outperforms models using all four extra encoders simultaneously, while reducing compute by 24-49\%. This demonstrates that intelligent encoder selection beats brute-force aggregation, challenging the prevailing paradigm in multi-encoder VLMs.

[5] SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding

Tanveer Hannan, Shuaicong Wu, Mark Weber, Suprosanna Shit, Jindong Gu, Rajat Koner, Aljoša Ošep, Laura Leal-Taixé, Thomas Seidl

🧩 TL;DR

本文提出了时空视频动作定位(SVAG)新任务,旨在同时检测、跟踪并时序定位视频中基于自然语言描述动作的所有参考对象,并构建了大规模基准SVAG-Bench和基线框架SVAGFormer来解决现有方法在细粒度对象-动作交互推理方面的不足。


📘 Detailed Summary

Motivation: 现有视频理解方法主要关注粗粒度动作识别或通用对象跟踪,忽视了基于动作描述同时检测和跟踪多个对象并进行时序定位的联合挑战,这限制了下一代AI系统在细粒度动作理解和空间时间定位方面的发展。

Method: 提出了SVAGFormer基线框架,通过适配最先进的视觉语言模型实现联合空间和时间定位,并开发了SVAGEval标准化评估工具包以确保公平可复现的基准测试,同时构建了包含688个视频、19,590个标注记录和903个独特动词的大规模基准SVAG-Bench。

Result: 实验结果表明现有模型在SVAG任务上表现较差,特别是在密集或复杂场景中,突显了在长视频中进行细粒度对象-动作交互推理的需求,验证了该任务的挑战性和现有方法的局限性。

Conclusion: 该研究揭示了当前视频理解系统在细粒度动作-对象联合推理方面的显著不足,为开发更先进的时空推理模型提供了重要基准和方向,对具身智能、自主平台和人机交互框架的发展具有重要推动作用。


📄 Abstract

Understanding fine-grained actions and accurately localizing their corresponding actors in space and time are fundamental capabilities for advancing next-generation AI systems, including embodied agents, autonomous platforms, and human-AI interaction frameworks. Despite recent progress in video understanding, existing methods predominantly address either coarse-grained action recognition or generic object tracking, thereby overlooking the challenge of jointly detecting and tracking multiple objects according to their actions while grounding them temporally. To address this gap, we introduce Spatio-temporal Video Action Grounding (SVAG), a novel task that requires models to simultaneously detect, track, and temporally localize all referent objects in videos based on natural language descriptions of their actions. To support this task, we construct SVAG-Bench, a large-scale benchmark comprising 688 videos, 19,590 annotated records, and 903 unique verbs, covering a diverse range of objects, actions, and real-world scenes. We further propose SVAGFormer, a baseline framework that adapts state of the art vision language models for joint spatial and temporal grounding, and introduce SVAGEval, a standardized evaluation toolkit for fair and reproducible benchmarking. Empirical results show that existing models perform poorly on SVAG, particularly in dense or complex scenes, underscoring the need for more advanced reasoning over fine-grained object-action interactions in long videos.

[6] Edit-Your-Interest: Efficient Video Editing via Feature Most-Similar Propagation

Yi Zuo, Zitao Wang, Lingling Li, Xu Liu, Fang Liu, Licheng Jiao

🧩 TL;DR

本文提出了Edit-Your-Interest,一种轻量级、文本驱动的零样本视频编辑方法,通过引入时空特征记忆库和特征最相似传播机制,显著降低了计算开销并提升了时间一致性。


📘 Detailed Summary

Motivation: 现有文本到图像扩散模型的视频编辑方法面临高计算开销和内存消耗的严重限制,同时经常牺牲视觉保真度,导致时间不一致性和伪影问题,如模糊和马赛克图案。

Method: 提出了时空特征记忆库来缓存先前帧的特征,显著减少计算开销;设计了特征最相似传播方法传播最相关token以保持时间一致性;开发了SFM更新算法持续刷新缓存特征;利用交叉注意力图自动提取感兴趣实例的掩码并集成到扩散去噪过程中。

Result: 大量实验证明Edit-Your-Interest在效率和视觉保真度方面均优于最先进方法,验证了其卓越的有效性和实用性。

Conclusion: 该方法通过高效的时空特征管理和自动掩码提取机制,实现了高质量的视频编辑,同时保持了背景完整性,为轻量级视频编辑提供了实用解决方案。


📄 Abstract

Text-to-image (T2I) diffusion models have recently demonstrated significant progress in video editing. However, existing video editing methods are severely limited by their high computational overhead and memory consumption. Furthermore, these approaches often sacrifice visual fidelity, leading to undesirable temporal inconsistencies and artifacts such as blurring and pronounced mosaic-like patterns. We propose Edit-Your-Interest, a lightweight, text-driven, zero-shot video editing method. Edit-Your-Interest introduces a spatio-temporal feature memory to cache features from previous frames, significantly reducing computational overhead compared to full-sequence spatio-temporal modeling approaches. Specifically, we first introduce a Spatio-Temporal Feature Memory bank (SFM), which is designed to efficiently cache and retain the crucial image tokens processed by spatial attention. Second, we propose the Feature Most-Similar Propagation (FMP) method. FMP propagates the most relevant tokens from previous frames to subsequent ones, preserving temporal consistency. Finally, we introduce an SFM update algorithm that continuously refreshes the cached features, ensuring their long-term relevance and effectiveness throughout the video sequence. Furthermore, we leverage cross-attention maps to automatically extract masks for the instances of interest. These masks are seamlessly integrated into the diffusion denoising process, enabling fine-grained control over target objects and allowing Edit-Your-Interest to perform highly accurate edits while robustly preserving the background integrity. Extensive experiments decisively demonstrate that the proposed Edit-Your-Interest outperforms state-of-the-art methods in both efficiency and visual fidelity, validating its superior effectiveness and practicality.

[7] EgoSocial: Benchmarking Proactive Intervention Ability of Omnimodal LLMs via Egocentric Social Interaction Perception

Xijun Wang, Tanay Sharma, Achin Kulshrestha, Abhimitra Meka, Aveek Purohit, Dinesh Manocha

🧩 TL;DR

本研究提出了EgoSocial数据集和EgoSoD方法,旨在解决AR/VR环境中AI助手缺乏社交感知能力的问题。EgoSoD通过多模态上下文线索和社会思维图动态建模社交互动,显著提升了干预时机检测和社交交互理解的性能。


📘 Detailed Summary

Motivation: 随着AR/VR技术融入日常生活,需要能够从自我中心视角理解人类社交动态的AI。当前LLMs缺乏社交意识,无法判断何时作为AI助手进行干预,导致持续、缺乏社交意识的响应可能破坏自然对话并影响用户专注度。

Method: 提出了EgoSocial大规模自我中心数据集(包含13,500个社交视频-问题对)用于基准测试干预时机检测。开发了EgoSoD端到端方法,集成多模态上下文线索(如音频和视觉线索)到社会思维图中,动态建模参与者和互动关系。

Result: 实验显示现有全模态LLMs在干预时机检测上表现不佳(Gemini 2.5 Pro仅为14.4%)。EgoSoD在干预时机性能上分别将Phi-4和Gemini 2.5 Pro提升了45.6%和9.9%,在整体社交交互性能上分别提升了20.4%和6.9%。

Conclusion: 研究揭示了当前OLLMs在社交感知方面的局限性,提出的EgoSoD方法通过多模态融合和社会动态建模有效解决了干预时机检测问题。这项工作为开发具有社交意识的AI助手提供了重要基础,数据集和代码的发布将促进该领域进一步发展。


📄 Abstract

As AR/VR technologies become integral to daily life, there's a growing need for AI that understands human social dynamics from an egocentric perspective. However, current LLMs often lack the social awareness to discern when to intervene as AI assistant. This leads to constant, socially unaware responses that may disrupt natural conversation and negatively impact user focus. To address these limitations, we introduce EgoSocial, a large-scale egocentric dataset with 13,500 social video-question pairs, specifically designed to benchmark intervention in social interaction perception. We also present an in-depth analysis of current omnimodal LLMs (OLLMs) to assess their effectiveness in detecting diverse social contextual cues. Experiments show that OLLMs still struggle to detect the intervention timing (14.4% for Gemini 2.5 Pro). We also propose EgoSoD (EgoSocial Detection), an end-to-end method for robustly discerning social dynamics. Informed by our OLLM analysis, EgoSoD integrates multimodal contextual cues (e.g., audio and visual cues) into a social thinking graph, dynamically modeling participants and interactions. Our method proactively detects intervention timing and social interactions, precisely determining when to intervene. Our EgoSoD improves Phi-4 by 45.6% and Gemini 2.5 Pro by 9.9% on Intervention Timing performance, and improves Phi-4 by 20.4% and Gemini 2.5 Pro by 6.9% on overall Social Interaction performance. We will release the dataset and code soon.

[8] DriveCritic: Towards Context-Aware, Human-Aligned Evaluation for Autonomous Driving with Vision-Language Models

Jingyu Song, Zhenxin Li, Shiyi Lan, Xinglong Sun, Nadine Chang, Maying Shen, Joshua Chen, Katherine A. Skinner, Jose M. Alvarez

🧩 TL;DR

本文提出DriveCritic框架,通过构建包含关键上下文场景的数据集和基于视觉语言模型的评估器,解决了自动驾驶规划器评估中缺乏上下文感知的问题,显著提升了与人类判断的一致性。


📘 Detailed Summary

Motivation: 当前最先进的自动驾驶规划器评估指标(如EPDMS)在复杂场景中缺乏上下文感知能力,无法准确反映人类判断标准,这限制了自动驾驶系统的可靠评估和发展。

Method: 提出DriveCritic框架,包含两个核心组件:DriveCritic数据集——收集关键上下文场景并标注人类偏好对;DriveCritic模型——基于视觉语言模型的评估器,采用两阶段监督学习和强化学习流程进行微调,整合视觉和符号上下文信息来评判轨迹对。

Result: 实验表明DriveCritic在匹配人类偏好方面显著优于现有指标和基线方法,展现出强大的上下文感知能力,为自动驾驶系统评估提供了更可靠的基准。

Conclusion: 该研究为自动驾驶系统评估建立了更可靠、与人类判断对齐的基础,通过上下文感知的评估框架解决了现有指标在复杂场景中的局限性,推动了自动驾驶评估方法的发展。


📄 Abstract

Benchmarking autonomous driving planners to align with human judgment remains a critical challenge, as state-of-the-art metrics like the Extended Predictive Driver Model Score (EPDMS) lack context awareness in nuanced scenarios. To address this, we introduce DriveCritic, a novel framework featuring two key contributions: the DriveCritic dataset, a curated collection of challenging scenarios where context is critical for correct judgment and annotated with pairwise human preferences, and the DriveCritic model, a Vision-Language Model (VLM) based evaluator. Fine-tuned using a two-stage supervised and reinforcement learning pipeline, the DriveCritic model learns to adjudicate between trajectory pairs by integrating visual and symbolic context. Experiments show DriveCritic significantly outperforms existing metrics and baselines in matching human preferences and demonstrates strong context awareness. Overall, our work provides a more reliable, human-aligned foundation to evaluating autonomous driving systems.

[9] OS-HGAdapter: Open Semantic Hypergraph Adapter for Large Language Models Assisted Entropy-Enhanced Image-Text Alignment

Rongjun Chen, Chengsi Yao, Jinchang Ren, Xianxian Zeng, Peixian Wang, Jun Yuan, Jiawen Li, Huimin Zhao, Xu Lu

🧩 TL;DR

本文提出了一种基于大语言模型开放语义知识的超图适配器方法,通过增强文本模态的信息熵来弥补文本与图像之间的熵差,显著提升了跨模态检索性能。该方法在Flickr30K和MS-COCO基准测试中实现了最先进的语义对齐性能。


📘 Detailed Summary

Motivation: 文本与图像对齐是多模态内容理解的基础挑战,由于文本和图像在信息熵上的固有差异,传统方法在双向检索中往往表现出不平衡性。本文旨在解决文本模态相对于视觉模态信息熵不足的问题,通过模拟人类在跨模态对齐任务中的能力来弥合这一熵差。

Method: 提出了一种基于大语言模型开放语义知识的超图适配器方法,包含两个关键步骤:首先设计不依赖任务领域显式知识的提示模板,利用LLM增强文本模态的多义性描述以增加其信息熵;其次使用超图适配器构建文本与图像模态间的多边连接,在固定嵌入空间中校正同义语义的正负匹配误差,同时通过降维映射回原维度来减少开放语义熵引入的噪声。

Result: 在Flickr30K和MS-COCO基准测试上的综合评估验证了所提方法的优越性,相比现有方法实现了16.8%的文本到图像检索增益和40.1%的图像到文本检索增益,在语义对齐任务中建立了新的最先进性能。

Conclusion: 研究表明利用大语言模型的开放语义知识可以有效弥合文本与图像之间的信息熵差距,超图结构的多边连接机制能够显著提升跨模态语义对齐的准确性。该方法为多模态表示学习提供了新的思路,即通过增强语义丰富性而非单纯优化嵌入空间来改善跨模态检索性能。


📄 Abstract

Text-image alignment constitutes a foundational challenge in multimedia content understanding, where effective modeling of cross-modal semantic correspondences critically enhances retrieval system performance through joint embedding space optimization. Given the inherent difference in information entropy between texts and images, conventional approaches often show an imbalance in the mutual retrieval of these two modalities. To address this particular challenge, we propose to use the open semantic knowledge of Large Language Model (LLM) to fill for the entropy gap and reproduce the alignment ability of humans in these tasks. Our entropy-enhancing alignment is achieved through a two-step process: 1) a new prompt template that does not rely on explicit knowledge in the task domain is designed to use LLM to enhance the polysemy description of the text modality. By analogy, the information entropy of the text modality relative to the visual modality is increased; 2) A hypergraph adapter is used to construct multilateral connections between the text and image modalities, which can correct the positive and negative matching errors for synonymous semantics in the same fixed embedding space, whilst reducing the noise caused by open semantic entropy by mapping the reduced dimensions back to the original dimensions. Comprehensive evaluations on the Flickr30K and MS-COCO benchmarks validate the superiority of our Open Semantic Hypergraph Adapter (OS-HGAdapter), showcasing 16.8\% (text-to-image) and 40.1\% (image-to-text) cross-modal retrieval gains over existing methods while establishing new state-of-the-art performance in semantic alignment tasks.

[10] Foveation Improves Payload Capacity in Steganography

Lifeng Qiu Lin, Henry Kam, Qi Sun, Kaan Akşit

🧩 TL;DR

本研究提出了一种基于高效潜在表示和中央凹渲染的隐写术方法,将隐写容量从100比特提升至500比特,同时实现了2000比特中仅1比特错误的更高精度,并在视觉质量上达到31.47 dB PSNR和0.13 LPIPS的优异表现。


📘 Detailed Summary

Motivation: 现有隐写术在视觉媒体中的应用如元数据提供和水印嵌入存在容量限制和精度不足的问题,需要开发能够同时提高信息嵌入容量和准确性的新型隐写方法。

Method: 利用高效的潜在表示和中央凹渲染技术训练模型,通过新颖的感知设计创建多模态潜在表示,以优化隐写术的性能表现。

Result: 实验结果显示隐写容量从100比特显著提升至500比特,在200K测试比特中仅出现1比特错误,同时视觉质量达到31.47 dB PSNR和0.13 LPIPS的优异指标。

Conclusion: 该研究证明了新颖感知设计在多模态潜在表示创建中的有效性,为视觉隐写术提供了同时提高容量和精度的可行方案,具有重要的实际应用价值。


📄 Abstract

Steganography finds its use in visual medium such as providing metadata and watermarking. With support of efficient latent representations and foveated rendering, we trained models that improve existing capacity limits from 100 to 500 bits, while achieving better accuracy of up to 1 failure bit out of 2000, at 200K test bits. Finally, we achieve a comparable visual quality of 31.47 dB PSNR and 0.13 LPIPS, showing the effectiveness of novel perceptual design in creating multi-modal latent representations in steganography.

[11] What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging

Inha Kang, Youngsun Lim, Seonho Lee, Jiho Choi, Junsuk Choe, Hyunjung Shim

🧩 TL;DR

本文提出CoVAND数据集和NegToMe模块来解决视觉语言模型在否定理解上的肯定性偏差问题。通过链式思维数据生成和文本标记合并技术,显著提升了描述性目标检测任务中的否定理解能力。


📘 Detailed Summary

Motivation: 当前最先进的视觉语言模型在理解否定概念时存在严重的肯定性偏差问题,这种限制在描述性目标检测任务中尤为突出,导致模型无法正确处理包含否定描述的查询。

Method: 提出了两个主要贡献:一是CoVAND数据集,采用链式思维和VQA驱动的流程生成高质量、实例接地的否定数据;二是NegToMe文本标记合并模块,通过将否定词与属性词合并为连贯的语义短语,从根本上解决标记化过程中否定线索丢失的结构性问题。该模块与参数高效的LoRA微调策略相结合。

Result: 该方法在具有挑战性的否定基准测试中显著提升了性能,将假阳性率降低了10.8个NMS-AP点,在OVDEval基准上表现出色,并证明了在最新视觉语言模型上的泛化能力。

Conclusion: 这项工作标志着在解决真实世界检测应用中否定理解问题的关键进展,通过直接处理架构层面的否定线索丢失问题,即使在有限数据下也能实现鲁棒的否定理解能力。


📄 Abstract

State-of-the-art vision-language models (VLMs) suffer from a critical failure in understanding negation, often referred to as affirmative bias. This limitation is particularly severe in described object detection (DOD) tasks. To address this, we propose two primary contributions: (1) a new dataset pipeline and (2) a novel, lightweight adaptation recipe. First, we introduce CoVAND, a dataset constructed with a systematic chain-of-thought (CoT) and VQA-based pipeline to generate high-quality, instance-grounded negation data. Second, we propose NegToMe, a novel text token merging module that directly tackles the architectural cause of affirmative bias. NegToMe fundamentally addresses the structural loss of negation cues in tokenization, grouping them with attributes into coherent semantic phrases. It maintains correct polarity at the input level, enabling robust negation understanding even with limited data. For instance, to prevent a model from treating the fragmented tokens "not" and "girl" as simply "girl", NegToMe binds them into a single token whose meaning is correctly distinguished from that of "girl" alone. This module is integrated with a parameter-efficient and strategic LoRA fine-tuning approach. Our method significantly improves performance on challenging negation benchmarks with a lowered false positive rate, boosting NMS-AP by up to +10.8 points on OVDEval and demonstrating generalization to SoTA VLMs. This work marks a crucial step forward in addressing negation understanding for real-world detection applications.

[12] Prompt-based Adaptation in Large-scale Vision Models: A Survey

Xi Xiao, Yunbei Zhang, Lin Zhao, Yiyang Liu, Xiaoying Liao, Zheda Mai, Xingjian Li, Xiao Wang, Hao Xu, Jihun Hamm, Xue Lin, Min Xu, Qifan Wang, Tianyang Wang, Cheng Han

🧩 TL;DR

本综述首次对视觉提示适应(PA)方法进行全面系统梳理,提出了统一的概念框架和分类体系,澄清了视觉提示(VP)与视觉提示调优(VPT)之间的概念边界,为研究者提供了该领域发展的清晰路线图。


📘 Detailed Summary

Motivation: 当前视觉提示(VP)和视觉提示调优(VPT)在研究中概念边界模糊,经常被互换使用,缺乏对这两种技术及其各自应用场景的系统性区分,阻碍了该领域的规范化发展。

Method: 从第一性原理重新审视VP和VPT的设计,将其概念化为统一的提示适应(PA)框架,提出了基于可学习性(可学习、生成式、不可学习)和注入粒度(像素级、令牌级)的分类体系。

Result: 构建了全面的方法论和应用领域分析框架,涵盖了医学影像、3D点云、视觉语言任务等多个领域,并探讨了PA在测试时适应和可信AI中的作用,同时总结了当前基准和关键挑战。

Conclusion: 该综述为所有领域的研究者和实践者提供了理解PA相关研究演进格局的清晰路线图,明确了未来发展方向,填补了该领域系统性综述的空白,促进了视觉提示适应技术的规范化发展。


📄 Abstract

In computer vision, Visual Prompting (VP) and Visual Prompt Tuning (VPT) have recently emerged as lightweight and effective alternatives to full fine-tuning for adapting large-scale vision models within the ``pretrain-then-finetune'' paradigm. However, despite rapid progress, their conceptual boundaries remain blurred, as VP and VPT are frequently used interchangeably in current research, reflecting a lack of systematic distinction between these techniques and their respective applications. In this survey, we revisit the designs of VP and VPT from first principles, and conceptualize them within a unified framework termed Prompt-based Adaptation (PA). We provide a taxonomy that categorizes existing methods into learnable, generative, and non-learnable prompts, and further organizes them by injection granularity -- pixel-level and token-level. Beyond the core methodologies, we examine PA's integrations across diverse domains, including medical imaging, 3D point clouds, and vision-language tasks, as well as its role in test-time adaptation and trustworthy AI. We also summarize current benchmarks and identify key challenges and future directions. To the best of our knowledge, we are the first comprehensive survey dedicated to PA's methodologies and applications in light of their distinct characteristics. Our survey aims to provide a clear roadmap for researchers and practitioners in all area to understand and explore the evolving landscape of PA-related research.

[13] Self-Augmented Visual Contrastive Decoding

Eun Woo Im, Muhammad Kashif Ali, Vivek Gupta

🧩 TL;DR

本文提出了一种新颖的训练无关解码策略,通过查询相关的视觉增强和自适应阈值算法,显著减少大型视觉语言模型中的幻觉问题,在多个基准测试中展现出优于现有方法的性能表现。


📘 Detailed Summary

Motivation: 大型视觉语言模型虽然展现出强大的多模态能力,但继承了底层语言模型的幻觉倾向。现有的视觉对比解码方法通常采用通用的视觉增强策略,未能充分考虑文本查询提供的具体上下文信息,从而限制了其有效性。

Method: 该方法包含两个关键创新:首先提出了一种自增强提示策略,利用模型内在知识动态对齐查询与视觉增强之间的语义;其次开发了一种自适应阈值算法,基于输出稀疏性自适应调整下一个令牌候选集大小,充分利用对数分布的全部信息。

Result: 在四个大型视觉语言模型和七个基准测试上的广泛实验表明,所提出的解码方法在事实一致性方面显著优于最先进的解码方法,有效提升了模型生成质量。

Conclusion: 这项工作强调了集成查询相关增强和熵感知解码对于改进大型视觉语言模型有效生成的重要性,为缓解多模态模型幻觉问题提供了新的技术路径。


📄 Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal capabilities, but they inherit the tendency to hallucinate from their underlying language models. While visual contrastive decoding has been proposed to mitigate this issue, existing methods often apply generic visual augmentations that disregard the specific context provided by the text query, limiting their effectiveness. This study introduces a novel training-free decoding strategy that addresses these limitations, featuring two key contributions. First, a self-augmentation prompting strategy that leverages the intrinsic knowledge of the model to dynamically align semantics between the query and the visual augmentation. Second, an adaptive thresholding algorithm that adaptively adjusts next token candidate size based on the output sparsity, utilizing full information from the logit distribution. Extensive experiments across four LVLMs and seven benchmarks demonstrate that the proposed decoding significantly enhances factual consistency compared to state-of-the-art decoding methods. This work highlights the importance of integrating query-dependent augmentation and entropy-aware decoding for improving effective generation of LVLMs.

[14] EPIPTrack: Rethinking Prompt Modeling with Explicit and Implicit Prompts for Multi-Object Tracking

Yukuan Zhang, Jiarui Zhao, Shangqing Nie, Jin Kuang, Shengsheng Wang

🧩 TL;DR

本文提出了EPIPTrack,一种统一的多模态视觉语言跟踪框架,通过显式和隐式提示实现动态目标建模和语义对齐,显著提升了目标跟踪的适应性和性能。


📘 Detailed Summary

Motivation: 现有方法依赖大型语言模型生成的静态文本描述,缺乏对实时目标状态变化的适应性且容易产生幻觉,这限制了多模态语义线索在目标跟踪中的潜力。

Method: EPIPTrack框架包含显式提示将空间运动信息转换为自然语言描述提供时空指导,隐式提示结合伪词和可学习描述符构建个性化知识表示,两者通过CLIP文本编码器进行动态调整,并设计了判别性特征增强器来提升视觉和跨模态表示。

Result: 在MOT17、MOT20和DanceTrack数据集上的广泛实验表明,EPIPTrack在不同场景下均优于现有跟踪器,展现出强大的适应性和优越性能。

Conclusion: 该研究证明了动态多模态提示在目标跟踪中的有效性,为实时自适应目标建模提供了新思路,推动了视觉语言融合在跟踪任务中的进一步发展。


📄 Abstract

Multimodal semantic cues, such as textual descriptions, have shown strong potential in enhancing target perception for tracking. However, existing methods rely on static textual descriptions from large language models, which lack adaptability to real-time target state changes and prone to hallucinations. To address these challenges, we propose a unified multimodal vision-language tracking framework, named EPIPTrack, which leverages explicit and implicit prompts for dynamic target modeling and semantic alignment. Specifically, explicit prompts transform spatial motion information into natural language descriptions to provide spatiotemporal guidance. Implicit prompts combine pseudo-words with learnable descriptors to construct individualized knowledge representations capturing appearance attributes. Both prompts undergo dynamic adjustment via the CLIP text encoder to respond to changes in target state. Furthermore, we design a Discriminative Feature Augmentor to enhance visual and cross-modal representations. Extensive experiments on MOT17, MOT20, and DanceTrack demonstrate that EPIPTrack outperforms existing trackers in diverse scenarios, exhibiting robust adaptability and superior performance.

[15] Language as a Label: Zero-Shot Multimodal Classification of Everyday Postures under Data Scarcity

MingZe Tang, Jubal Chandy Jacob

🧩 TL;DR

本研究揭示了视觉语言模型在零样本分类中的一个反直觉现象:对于高性能模型,最简单的提示词反而能获得最佳效果,而增加描述性细节会导致性能显著下降,这种现象被称为"提示过拟合"。


📘 Detailed Summary

Motivation: 尽管视觉语言模型能够在共享空间中实现图像和文本的对齐以支持零样本分类,但提示词设计对识别视觉相似类别(如人体姿态)的影响尚未得到充分理解,特别是在数据稀缺条件下如何优化提示词策略存在研究空白。

Method: 研究采用现代视觉语言模型套件(包括OpenCLIP、MetaCLIP 2和SigLip),在285张COCO衍生数据集上评估了三层次提示词设计策略,该系统性地增加了语言描述的详细程度,以分析提示词特异性对坐姿、站姿和行走/跑步分类的影响。

Result: 实验发现高性能模型(MetaCLIP 2和OpenCLIP)呈现反直觉趋势:最简单的基础提示词始终获得最佳结果,增加描述性细节会显著降低性能,例如MetaCLIP 2的多类准确率从68.8%降至55.1%;而性能较低的SigLip模型在给予更具描述性的基于身体线索提示词时,对模糊类别的分类性能有所提升。

Conclusion: 研究提出了"提示过拟合"概念,表明在视觉语言模型的零样本分类中,提示词设计需要根据模型性能水平进行差异化策略,高性能模型倾向于简单提示词以避免过度拟合,而低性能模型则受益于更详细的描述性提示词,这为实际应用中的提示词工程提供了重要指导。


📄 Abstract

Recent Vision-Language Models (VLMs) enable zero-shot classification by aligning images and text in a shared space, a promising approach for data-scarce conditions. However, the influence of prompt design on recognizing visually similar categories, such as human postures, is not well understood. This study investigates how prompt specificity affects the zero-shot classification of sitting, standing, and walking/running on a small, 285-image COCO-derived dataset. A suite of modern VLMs, including OpenCLIP, MetaCLIP 2, and SigLip, were evaluated using a three-tiered prompt design that systematically increases linguistic detail. Our findings reveal a compelling, counter-intuitive trend: for the highest-performing models (MetaCLIP 2 and OpenCLIP), the simplest, most basic prompts consistently achieve the best results. Adding descriptive detail significantly degrades performance for instance, MetaCLIP 2's multi-class accuracy drops from 68.8\% to 55.1\% a phenomenon we term "prompt overfitting". Conversely, the lower-performing SigLip model shows improved classification on ambiguous classes when given more descriptive, body-cue-based prompts.

[16] Model-agnostic Adversarial Attack and Defense for Vision-Language-Action Models

Haochuan Xu, Yun Sing Koh, Shuhuai Huang, Zirun Zhou, Di Wang, Jun Sakuma, Jingfeng Zhang

🧩 TL;DR

本研究提出了针对视觉-语言-动作模型的对抗性补丁攻击方法EDPA及其防御策略,EDPA通过破坏视觉与文本潜在表示的语义对齐来干扰模型决策,同时设计了对抗性微调防御方案来增强模型鲁棒性。


📘 Detailed Summary

Motivation: 尽管视觉-语言-动作模型在机器人学习领域取得了革命性进展,但其对抗鲁棒性研究仍然不足,当前缺乏针对VLA模型的有效攻击与防御方法,特别是在不需要了解模型架构或机器人控制器的情况下。

Method: 提出了模型无关的嵌入破坏补丁攻击方法,通过优化两个目标函数生成可直接放置在相机视野中的对抗补丁:破坏视觉与文本潜在表示的语义对齐,以及最大化对抗样本与干净样本潜在表示之间的差异;同时设计了针对视觉编码器的对抗性微调防御方案,使编码器对干净和对抗样本产生相似的潜在表示。

Result: 在广泛认可的LIBERO机器人仿真基准测试中,EDPA显著提高了最先进VLA模型的任务失败率,而提出的防御方法有效缓解了这种性能退化,证明了攻击的有效性和防御方案的可行性。

Conclusion: 该研究揭示了VLA模型在对抗攻击下的脆弱性,提出的EDPA攻击方法为评估VLA模型鲁棒性提供了有效工具,同时对抗性微调防御方案为增强VLA模型安全性提供了可行路径,推动了机器人学习系统的安全可靠发展。


📄 Abstract

Vision-Language-Action (VLA) models have achieved revolutionary progress in robot learning, enabling robots to execute complex physical robot tasks from natural language instructions. Despite this progress, their adversarial robustness remains underexplored. In this work, we propose both adversarial patch attack and corresponding defense strategies for VLA models. We first introduce the Embedding Disruption Patch Attack (EDPA), a model-agnostic adversarial attack that generates patches directly placeable within the camera's view. In comparison to prior methods, EDPA can be readily applied to different VLA models without requiring prior knowledge of the model architecture, or the controlled robotic manipulator. EDPA constructs these patches by (i) disrupting the semantic alignment between visual and textual latent representations, and (ii) maximizing the discrepancy of latent representations between adversarial and corresponding clean visual inputs. Through the optimization of these objectives, EDPA distorts the VLA's interpretation of visual information, causing the model to repeatedly generate incorrect actions and ultimately result in failure to complete the given robotic task. To counter this, we propose an adversarial fine-tuning scheme for the visual encoder, in which the encoder is optimized to produce similar latent representations for both clean and adversarially perturbed visual inputs. Extensive evaluations on the widely recognized LIBERO robotic simulation benchmark demonstrate that EDPA substantially increases the task failure rate of cutting-edge VLA models, while our proposed defense effectively mitigates this degradation. The codebase is accessible via the homepage at https://edpa-attack.github.io/.

[17] FlyAwareV2: A Multimodal Cross-Domain UAV Dataset for Urban Scene Understanding

Francesco Barbato, Matteo Caligiuri, Pietro Zanuttigh

🧩 TL;DR

本文提出了FlyAwareV2,这是一个专为城市场景理解任务设计的新型多模态无人机数据集,包含真实和合成图像,解决了无人机视觉算法开发中数据收集和标注成本高昂的问题。


📘 Detailed Summary

Motivation: 无人机在城市环境中的计算机视觉算法开发严重依赖具有精确标注的大规模数据集,但收集和标注真实世界无人机数据极其困难且成本高昂,现有数据集无法满足这一需求。

Method: 基于SynDrone和FlyAware数据集,FlyAwareV2引入了多模态数据(RGB、深度、语义标签),涵盖不同环境条件,包括变化的天气和白天时段,并为真实样本通过最先进的单目深度估计计算深度图。

Result: 该研究提供了RGB和多模态语义分割在标准架构上的基准测试,并进行了合成到真实域适应的研究,以评估在合成数据上训练的模型的泛化能力。

Conclusion: FlyAwareV2凭借其丰富的标注集和环境多样性,为基于无人机的3D城市场景理解研究提供了宝贵资源,推动了无人机视觉算法的发展。


📄 Abstract

The development of computer vision algorithms for Unmanned Aerial Vehicle (UAV) applications in urban environments heavily relies on the availability of large-scale datasets with accurate annotations. However, collecting and annotating real-world UAV data is extremely challenging and costly. To address this limitation, we present FlyAwareV2, a novel multimodal dataset encompassing both real and synthetic UAV imagery tailored for urban scene understanding tasks. Building upon the recently introduced SynDrone and FlyAware datasets, FlyAwareV2 introduces several new key contributions: 1) Multimodal data (RGB, depth, semantic labels) across diverse environmental conditions including varying weather and daytime; 2) Depth maps for real samples computed via state-of-the-art monocular depth estimation; 3) Benchmarks for RGB and multimodal semantic segmentation on standard architectures; 4) Studies on synthetic-to-real domain adaptation to assess the generalization capabilities of models trained on the synthetic data. With its rich set of annotations and environmental diversity, FlyAwareV2 provides a valuable resource for research on UAV-based 3D urban scene understanding.

[18] MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models

Keyan Zhou, Zecheng Tang, Lingfeng Ming, Guanghao Zhou, Qiguang Chen, Dan Qiao, Zheming Yang, Libo Qin, Minghui Qiu, Juntao Li, Min Zhang

🧩 TL;DR

本文提出了MMLongCite基准测试,用于评估大型视觉语言模型在长上下文场景中的忠实性,发现现有模型在处理长多模态上下文时存在显著局限性。该基准涵盖8个任务、6种上下文长度区间和多种模态,为长上下文多模态评估提供了全面框架。


📘 Detailed Summary

Motivation: 尽管大型视觉语言模型的上下文窗口不断扩展,但扩展的上下文窗口并不能保证模型能够有效利用上下文信息,这在实际应用中构成了关键挑战。当前的长上下文忠实性评估主要集中在纯文本领域,而多模态评估仍局限于短上下文场景,因此需要开发专门的多模态长上下文评估基准来弥补这一研究空白。

Method: 研究团队开发了MMLongCite基准测试,该基准包含8个不同的任务,覆盖6种上下文长度区间,并整合了文本、图像和视频等多种模态。该基准旨在系统评估大型视觉语言模型在长多模态上下文场景中的忠实性表现,通过控制上下文长度和关键内容位置来分析模型行为。

Result: 对最先进的大型视觉语言模型的评估显示,这些模型在处理长多模态上下文时表现出有限的忠实性。深入分析进一步揭示了上下文长度和关键内容位置对模型忠实性的显著影响,表明现有模型在长上下文多模态理解方面存在系统性缺陷。

Conclusion: 该研究强调了大型视觉语言模型在长多模态上下文处理中的局限性,为未来模型改进提供了重要方向。MMLongCite基准的建立为多模态长上下文评估设立了新标准,将推动该领域的研究进展和模型性能提升,特别是在实际应用场景中的可靠性方面。


📄 Abstract

The rapid advancement of large vision language models (LVLMs) has led to a significant expansion of their context windows. However, an extended context window does not guarantee the effective utilization of the context, posing a critical challenge for real-world applications. Current evaluations of such long-context faithfulness are predominantly focused on the text-only domain, while multimodal assessments remain limited to short contexts. To bridge this gap, we introduce MMLongCite, a comprehensive benchmark designed to evaluate the fidelity of LVLMs in long-context scenarios. MMLongCite comprises 8 distinct tasks spanning 6 context length intervals and incorporates diverse modalities, including text, images, and videos. Our evaluation of state-of-the-art LVLMs reveals their limited faithfulness in handling long multimodal contexts. Furthermore, we provide an in-depth analysis of how context length and the position of crucial content affect the faithfulness of these models.

[19] UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning

Tiancheng Gu, Kaicheng Yang, Kaichen Zhang, Xiang An, Ziyong Feng, Yueyi Zhang, Weidong Cai, Jiankang Deng, Lidong Bing

🧩 TL;DR

本文提出UniME-V2,一种利用多模态大语言模型增强表示学习的通用多模态嵌入方法,通过MLLM-as-a-Judge机制生成软语义匹配分数进行硬负样本挖掘和软标签对齐,显著提升了模型的判别能力。


📘 Detailed Summary

Motivation: 现有多模态嵌入方法通常采用批内负样本挖掘,但难以捕捉候选样本间的细微语义差异且负样本多样性不足,同时嵌入表示在区分假负样本和硬负样本方面能力有限。

Method: 首先通过全局检索构建潜在硬负样本集,引入MLLM-as-a-Judge机制评估查询-候选对的语义对齐并生成软语义匹配分数,利用这些分数进行硬负样本挖掘和软标签对齐,同时提出基于联合成对和列表优化的UniME-V2-Reranker重排序模型。

Result: 在MMEB基准测试和多个检索任务上的综合实验表明,该方法在所有任务上平均达到了最先进的性能水平。

Conclusion: 该方法通过利用MLLM的语义理解能力有效缓解了假负样本的影响,提升了模型对候选样本间语义差异的判别能力,为多模态表示学习提供了新的技术路径。


📄 Abstract

Universal multimodal embedding models are foundational to various tasks. Existing approaches typically employ in-batch negative mining by measuring the similarity of query-candidate pairs. However, these methods often struggle to capture subtle semantic differences among candidates and lack diversity in negative samples. Moreover, the embeddings exhibit limited discriminative ability in distinguishing false and hard negatives. In this paper, we leverage the advanced understanding capabilities of MLLMs to enhance representation learning and present a novel Universal Multimodal Embedding (UniME-V2) model. Our approach first constructs a potential hard negative set through global retrieval. We then introduce the MLLM-as-a-Judge mechanism, which utilizes MLLMs to assess the semantic alignment of query-candidate pairs and generate soft semantic matching scores. These scores serve as a foundation for hard negative mining, mitigating the impact of false negatives and enabling the identification of diverse, high-quality hard negatives. Furthermore, the semantic matching scores are used as soft labels to mitigate the rigid one-to-one mapping constraint. By aligning the similarity matrix with the soft semantic matching score matrix, the model learns semantic distinctions among candidates, significantly enhancing its discriminative capacity. To further improve performance, we propose UniME-V2-Reranker, a reranking model trained on our mined hard negatives through a joint pairwise and listwise optimization approach. We conduct comprehensive experiments on the MMEB benchmark and multiple retrieval tasks, demonstrating that our method achieves state-of-the-art performance on average across all tasks.

[20] Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

Minji Kim, Taekyung Kim, Bohyung Han

🧩 TL;DR

本研究通过机制可解释性技术揭示了VideoLLMs在视频问答中的内部信息流动模式,发现时间推理通过早期跨帧交互和中期视频-语言整合实现,并证明可以保留有效信息路径同时显著减少注意力边而保持性能。


📘 Detailed Summary

Motivation: 尽管视频大语言模型在视频问答任务上取得进展,但其内部机制中视频和文本信息如何提取和传播的具体过程仍未被充分探索,本研究旨在填补这一研究空白。

Method: 采用机制可解释性技术分析VideoLLMs的内部信息流动,特别关注跨层注意力模式和视频表示与语言嵌入之间的对齐关系。

Result: 分析揭示了跨任务一致的模式:时间推理始于早期到中间层的活跃跨帧交互,随后在中间层实现渐进式视频-语言整合,通过选择有效信息路径可以抑制58%的注意力边而保持LLaVA-NeXT-7B-Video-FT的性能。

Conclusion: 研究结果为VideoLLMs如何执行时间推理提供了蓝图,为模型可解释性和下游泛化改进提供了实用见解,揭示了视频表示与时间概念语言嵌入之间的对齐是成功整合的关键。


📄 Abstract

Video Large Language Models (VideoLLMs) extend the capabilities of vision-language models to spatiotemporal inputs, enabling tasks such as video question answering (VideoQA). Despite recent advances in VideoLLMs, their internal mechanisms on where and how they extract and propagate video and textual information remain less explored. In this study, we investigate the internal information flow of VideoLLMs using mechanistic interpretability techniques. Our analysis reveals consistent patterns across diverse VideoQA tasks: (1) temporal reasoning in VideoLLMs initiates with active cross-frame interactions in early-to-middle layers, (2) followed by progressive video-language integration in middle layers. This is facilitated by alignment between video representations and linguistic embeddings containing temporal concepts. (3) Upon completion of this integration, the model is ready to generate correct answers in middle-to-late layers. (4) Based on our analysis, we show that VideoLLMs can retain their VideoQA performance by selecting these effective information pathways while suppressing a substantial amount of attention edges, e.g., 58% in LLaVA-NeXT-7B-Video-FT. These findings provide a blueprint on how VideoLLMs perform temporal reasoning and offer practical insights for improving model interpretability and downstream generalization. Our project page with the source code is available at https://map-the-flow.github.io

[21] Generative Universal Verifier as Multimodal Meta-Reasoner

Xinchen Zhang, Xiaoying Zhang, Youbin Wu, Yanbin Cao, Renrui Zhang, Ruihang Chu, Ling Yang, Yujiu Yang

🧩 TL;DR

本文提出了生成式通用验证器,为多模态推理系统提供视觉结果反思与优化的核心能力,通过构建综合基准验证现有模型在视觉验证方面的显著差距,并开发了首个全能力生成验证器及测试时扩展范式,显著提升了多模态推理的可靠性和可控性。


📘 Detailed Summary

Motivation: 当前视觉语言模型在可靠视觉验证方面存在显著能力差距,无法在推理和生成过程中对视觉结果进行有效反思与优化,这限制了多模态推理系统的可信度和可控性发展。

Method: 构建了ViVerBench综合基准评估视觉结果质量,设计自动化流水线构建大规模视觉验证数据并训练OmniVerifier-7B全能力生成验证器,提出OmniVerifier-TTS序列测试时扩展范式实现迭代细粒度优化。

Result: 现有VLM在ViVerBench基准上表现不佳,OmniVerifier-7B在ViVerBench上提升8.3分,OmniVerifier-TTS在T2I-ReasonBench和GenEval++上分别提升3.7和4.3分,显著优于Best-of-N等并行测试时扩展方法。

Conclusion: 生成式通用验证器通过赋予多模态推理可靠的视觉验证能力,推动了生成过程中的可靠反思和可扩展测试时优化,为实现更可信可控的下一代推理系统迈出了重要一步。


📄 Abstract

We introduce Generative Universal Verifier, a novel concept and plugin designed for next-generation multimodal reasoning in vision-language models and unified multimodal models, providing the fundamental capability of reflection and refinement on visual outcomes during the reasoning and generation process. This work makes three main contributions: (1) We build ViVerBench, a comprehensive benchmark spanning 16 categories of critical tasks for evaluating visual outcomes in multimodal reasoning. Results show that existing VLMs consistently underperform across these tasks, underscoring a substantial gap from human-level capability in reliable visual verification. (2) We design two automated pipelines to construct large-scale visual verification data and train OmniVerifier-7B, the first omni-capable generative verifier trained for universal visual verification and achieves notable gains on ViVerBench(+8.3). Through training, we identify three atomic capabilities in visual verification and demonstrate how they generalize and interact synergistically. (3) We propose OmniVerifier-TTS, a sequential test-time scaling paradigm that leverages the universal verifier to bridge image generation and editing within unified models, enhancing the upper bound of generative ability through iterative fine-grained optimization. Beyond generation, we extend universal verifier to broader world-modeling interleaved reasoning scenarios. Empirically, OmniVerifier-TTS achieves improvements on T2I-ReasonBench(+3.7), and GenEval++(+4.3), outperforming existing parallel test-time scaling methods, such as Best-of-N. By endowing multimodal reasoning with reliable visual verification, OmniVerifier advances both reliable reflection during generation and scalable test-time refinement, marking a step toward more trustworthy and controllable next-generation reasoning systems.

[22] Modeling Cultural Bias in Facial Expression Recognition with Adaptive Agents

David Freire-Obregón, José Salas-Cáceres, Javier Lorenzo-Navarro, Oliverio J. Santana, Daniel Hernández-Sosa, Modesto Castrillón-Santana

🧩 TL;DR

本研究提出了一个基于智能体的流式基准测试,揭示了跨文化组成和渐进模糊化如何共同影响面部表情识别的鲁棒性,发现不同文化群体在感知条件恶化时表现出不对称的退化模式。


📘 Detailed Summary

Motivation: 现有面部表情识别评估通常假设同质数据和高质量图像,缺乏对文化差异和感知退化条件下鲁棒性的系统研究,本研究旨在解决这一研究空白。

Method: 采用基于智能体的流式基准测试框架,每个智能体在冻结的CLIP特征空间中运行,配备轻量级残差适配器,在5x5网格上进行移动和交互,环境提供按sigma调度的渐进高斯模糊输入。

Result: 实验结果显示文化群体间存在不对称退化曲线:JAFFE(亚洲)群体在低模糊度下保持更高性能但中间阶段下降更陡峭,而KDEF(西方)群体退化更均匀;混合群体呈现中间模式,平衡混合缓解早期退化但不平衡设置会放大高模糊下多数群体的弱点。

Conclusion: 研究量化了文化组成和交互结构如何影响面部表情识别在感知条件恶化时的鲁棒性,为开发更具文化适应性和环境鲁棒性的面部识别系统提供了重要见解。


📄 Abstract

Facial expression recognition (FER) must remain robust under both cultural variation and perceptually degraded visual conditions, yet most existing evaluations assume homogeneous data and high-quality imagery. We introduce an agent-based, streaming benchmark that reveals how cross-cultural composition and progressive blurring interact to shape face recognition robustness. Each agent operates in a frozen CLIP feature space with a lightweight residual adapter trained online at sigma=0 and fixed during testing. Agents move and interact on a 5x5 lattice, while the environment provides inputs with sigma-scheduled Gaussian blur. We examine monocultural populations (Western-only, Asian-only) and mixed environments with balanced (5/5) and imbalanced (8/2, 2/8) compositions, as well as different spatial contact structures. Results show clear asymmetric degradation curves between cultural groups: JAFFE (Asian) populations maintain higher performance at low blur but exhibit sharper drops at intermediate stages, whereas KDEF (Western) populations degrade more uniformly. Mixed populations exhibit intermediate patterns, with balanced mixtures mitigating early degradation, but imbalanced settings amplify majority-group weaknesses under high blur. These findings quantify how cultural composition and interaction structure influence the robustness of FER as perceptual conditions deteriorate.

[23] End-to-End Multi-Modal Diffusion Mamba

Chunhao Lu, Qiang Lu, Meichen Dong, Jake Luo

🧩 TL;DR

本文提出MDM(多模态扩散Mamba)架构,通过Mamba驱动的多步选择扩散模型和统一变分自编码器,实现了多模态处理的统一表示学习,在图像生成和文本理解等任务中显著优于现有端到端模型。


📘 Detailed Summary

Motivation: 当前端到端多模态模型使用不同的编码器和解码器处理输入输出信息,这种分离阻碍了各种模态的联合表示学习,限制了模型在多模态任务中的统一处理能力。

Method: MDM采用基于Mamba的多步选择扩散模型,通过统一的变分自编码器逐步生成和精炼模态特定信息,实现了编码和解码的统一处理框架。

Result: 在图像生成、图像描述、视觉问答、文本理解和推理任务中,MDM显著优于MonoFormer、LlamaGen和Chameleon等端到端模型,并与GPT-4V、Gemini Pro和Mistral等SOTA模型竞争有效。

Conclusion: MDM验证了在多模态处理中保持计算效率的同时实现统一处理的可行性,为端到端多模态架构确立了新的研究方向,特别是在处理高维数据方面展现出优势。


📄 Abstract

Current end-to-end multi-modal models utilize different encoders and decoders to process input and output information. This separation hinders the joint representation learning of various modalities. To unify multi-modal processing, we propose a novel architecture called MDM (Multi-modal Diffusion Mamba). MDM utilizes a Mamba-based multi-step selection diffusion model to progressively generate and refine modality-specific information through a unified variational autoencoder for both encoding and decoding. This innovative approach allows MDM to achieve superior performance when processing high-dimensional data, particularly in generating high-resolution images and extended text sequences simultaneously. Our evaluations in areas such as image generation, image captioning, visual question answering, text comprehension, and reasoning tasks demonstrate that MDM significantly outperforms existing end-to-end models (MonoFormer, LlamaGen, and Chameleon etc.) and competes effectively with SOTA models like GPT-4V, Gemini Pro, and Mistral. Our results validate MDM's effectiveness in unifying multi-modal processes while maintaining computational efficiency, establishing a new direction for end-to-end multi-modal architectures.

[24] Universal Image Restoration Pre-training via Masked Degradation Classification

JiaKui Hu, Zhengjian Yao, Lujia Jin, Yinghao Chen, Yanye Lu

🧩 TL;DR

本文提出了掩码退化分类预训练方法(MaskDCPT),通过将图像退化类型作为弱监督信号,结合掩码图像建模和对比学习,为图像恢复任务学习通用表示。该方法显著提升了卷积神经网络和Transformer在图像恢复任务中的性能,并展现出对未见退化类型的强泛化能力。


📘 Detailed Summary

Motivation: 传统预训练方法在图像恢复任务中存在局限性,无法有效处理多种退化类型和复杂退化场景。本研究旨在开发一种能够利用退化类型作为弱监督信号的预训练方法,以学习适用于通用图像恢复任务的鲁棒表示,解决现有方法在退化类型识别和图像重建方面的不足。

Method: MaskDCPT采用编码器-双解码器架构,编码器从掩码的低质量输入图像中提取特征,分类解码器识别退化类型,重建解码器恢复高质量图像。该方法结合了掩码图像建模和对比学习的优势,使用退化类型作为弱监督信号,同时通过图像重建任务增强模型的性能和鲁棒性。

Result: 实验结果表明,MaskDCPT显著提升了CNN和Transformer的性能,在5D一体化恢复任务中PSNR至少提高3.77dB,在真实世界退化场景中PIQE指标降低34.8%。该方法展现出对未见退化类型和退化级别的强泛化能力,并发布了包含250万对恢复样本的UIR-2.5M数据集,涵盖19种退化类型和200多个退化级别。

Conclusion: MaskDCPT证明了利用退化类型作为弱监督信号的有效性,为通用图像恢复任务提供了强大的预训练框架。该方法不仅显著提升了现有模型的性能,还展现出优异的泛化能力,为图像恢复领域的发展提供了新的思路和基准数据集,具有重要的实际应用价值。


📄 Abstract

This study introduces a Masked Degradation Classification Pre-Training method (MaskDCPT), designed to facilitate the classification of degradation types in input images, leading to comprehensive image restoration pre-training. Unlike conventional pre-training methods, MaskDCPT uses the degradation type of the image as an extremely weak supervision, while simultaneously leveraging the image reconstruction to enhance performance and robustness. MaskDCPT includes an encoder and two decoders: the encoder extracts features from the masked low-quality input image. The classification decoder uses these features to identify the degradation type, whereas the reconstruction decoder aims to reconstruct a corresponding high-quality image. This design allows the pre-training to benefit from both masked image modeling and contrastive learning, resulting in a generalized representation suited for restoration tasks. Benefit from the straightforward yet potent MaskDCPT, the pre-trained encoder can be used to address universal image restoration and achieve outstanding performance. Implementing MaskDCPT significantly improves performance for both convolution neural networks (CNNs) and Transformers, with a minimum increase in PSNR of 3.77 dB in the 5D all-in-one restoration task and a 34.8% reduction in PIQE compared to baseline in real-world degradation scenarios. It also emergences strong generalization to previously unseen degradation types and levels. In addition, we curate and release the UIR-2.5M dataset, which includes 2.5 million paired restoration samples across 19 degradation types and over 200 degradation levels, incorporating both synthetic and real-world data. The dataset, source code, and models are available at https://github.com/MILab-PKU/MaskDCPT.

[25] RECODE: Reasoning Through Code Generation for Visual Question Answering

Junhong Shen, Mu Cai, Bo Hu, Ameet Talwalkar, David A Ross, Cordelia Schmid, Alireza Fathi

🧩 TL;DR

本文提出RECODE框架,通过将视觉内容逆向工程为可执行代码来实现可验证的多模态推理,显著提升了结构化视觉(如图表、图表)的精确推理能力。该方法在多个视觉推理基准测试中超越了不利用代码或仅将代码用于辅助绘图的现有方法。


📘 Detailed Summary

Motivation: 多模态大语言模型在处理结构化视觉内容(如图表、图表)时面临精确推理的挑战,因为基于像素的感知缺乏验证机制,导致推理结果不可靠且难以验证。

Method: 提出RECODE代理框架,首先生成多个候选程序来复现输入图像,然后使用批评器选择最忠实的重建结果并进行迭代代码优化,将模糊的感知任务转化为可验证的符号问题。

Result: 在CharXiv、ChartQA和Geometry3K等多个视觉推理基准测试中,RECODE显著优于不利用代码或仅将代码用于绘制辅助线或裁剪的方法,证明了基于可执行代码的视觉感知的有效性。

Conclusion: 将视觉感知基于可执行代码为更准确和可验证的多模态推理提供了新路径,通过逆向工程将视觉内容转化为符号表示,实现了精确计算和逻辑推理能力的提升。


📄 Abstract

Multimodal Large Language Models (MLLMs) struggle with precise reasoning for structured visuals like charts and diagrams, as pixel-based perception lacks a mechanism for verification. To address this, we propose to leverage derendering -- the process of reverse-engineering visuals into executable code -- as a new modality for verifiable visual reasoning. Specifically, we propose RECODE, an agentic framework that first generates multiple candidate programs to reproduce the input image. It then uses a critic to select the most faithful reconstruction and iteratively refines the code. This process not only transforms an ambiguous perceptual task into a verifiable, symbolic problem, but also enables precise calculations and logical inferences later on. On various visual reasoning benchmarks such as CharXiv, ChartQA, and Geometry3K, RECODE significantly outperforms methods that do not leverage code or only use code for drawing auxiliary lines or cropping. Our work demonstrates that grounding visual perception in executable code provides a new path toward more accurate and verifiable multimodal reasoning.

[26] Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs

Yi Zhang, Bolin Ni, Xin-Sheng Chen, Heng-Rui Zhang, Yongming Rao, Houwen Peng, Qinglin Lu, Han Hu, Meng-Hao Guo, Shi-Min Hu

🧩 TL;DR

本文提出了Honey-Data-15M数据集和HoneyPipe数据整理流程,通过高质量的数据清理和双级CoT增强策略,训练出的Bee-8B模型在完全开放多模态大语言模型中达到新的最先进水平,性能与半开放模型相媲美。


📘 Detailed Summary

Motivation: 当前完全开放的多模态大语言模型落后于专有模型,主要原因是监督微调阶段的数据质量存在显著差距,现有开源数据集普遍存在噪声问题且缺乏复杂的推理数据如思维链,这限制了先进模型能力的发展。

Method: 提出了Honey-Data-15M数据集,包含约1500万个问答对,通过多重清理技术处理并采用新颖的双级(短链和长链)思维链增强策略;开发了HoneyPipe数据整理流程及其底层框架DataStudio,为社区提供透明且可适应的数据整理方法。

Result: 在Honey-Data-15M上训练的Bee-8B模型在完全开放多模态大语言模型中建立了新的最先进水平,其性能与近期半开放模型如InternVL3.5-8B相竞争,在某些情况下甚至超越这些模型。

Conclusion: 研究表明,专注于数据质量的原则性方法是开发具有竞争力的完全开放多模态大语言模型的关键途径,为社区提供了包括数据集、完整工具套件、训练配方、评估框架和模型权重在内的一系列基础资源。


📄 Abstract

Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.

[27] Visual Interestingness Decoded: How GPT-4o Mirrors Human Interests

Fitim Abdullahu, Helmut Grabner

🧩 TL;DR

本研究探索大型多模态模型GPT-4o对视觉趣味性概念的理解能力,通过比较分析发现其与人类评估存在部分对齐,并利用该能力生成训练数据以蒸馏知识至学习排序模型。


📘 Detailed Summary

Motivation: 当前研究旨在探索大型多模态模型是否能够捕捉视觉趣味性概念,并检验人类评估与GPT-4o预测之间的对齐程度,以填补对AI系统理解人类兴趣能力的认知空白。

Method: 采用比较分析方法评估GPT-4o与人类在视觉趣味性判断上的一致性,利用该对齐性生成图像对的趣味性标签,并将这些标签作为训练数据蒸馏到学习排序模型中。

Result: 研究表明GPT-4o与人类在视觉趣味性评估上存在部分对齐,其表现优于现有最先进方法,能够有效标注图像对的共同趣味性用于模型训练。

Conclusion: 该研究为深入理解人类兴趣提供了新途径,证明了大型多模态模型在捕捉视觉趣味性概念方面的潜力,为开发更符合人类感知的AI系统奠定了基础。


📄 Abstract

Our daily life is highly influenced by what we consume and see. Attracting and holding one's attention -- the definition of (visual) interestingness -- is essential. The rise of Large Multimodal Models (LMMs) trained on large-scale visual and textual data has demonstrated impressive capabilities. We explore these models' potential to understand to what extent the concepts of visual interestingness are captured and examine the alignment between human assessments and GPT-4o's, a leading LMM, predictions through comparative analysis. Our studies reveal partial alignment between humans and GPT-4o. It already captures the concept as best compared to state-of-the-art methods. Hence, this allows for the effective labeling of image pairs according to their (commonly) interestingness, which are used as training data to distill the knowledge into a learning-to-rank model. The insights pave the way for a deeper understanding of human interest.

[28] No-Reference Rendered Video Quality Assessment: Dataset and Metrics

Sipeng Yang, Jiayu Ji, Qingchuan Zhu, Zhiyao Yang, Xiaogang Jin

🧩 TL;DR

本研究提出了首个面向渲染视频的无参考质量评估数据集和专用指标,解决了现有方法对渲染视频质量评估存在偏差的问题,通过同时考虑图像质量和时间稳定性来准确评估渲染视频质量。


📘 Detailed Summary

Motivation: 现有无参考视频质量评估数据集和指标主要针对相机拍摄视频,直接应用于渲染视频会产生有偏预测,因为渲染视频更容易出现时间伪影,缺乏专门针对渲染视频的质量评估方法。

Method: 构建了大规模渲染导向的视频数据集,包含多种3D场景和渲染设置,并带有主观质量标注;设计了专门针对渲染视频的无参考质量评估指标,通过同时考虑图像质量和时间稳定性来校准评估模型。

Result: 与现有无参考视频质量评估指标相比,所提方法在渲染视频上表现出更优越的性能,能够有效用于超采样方法的基准测试和实时渲染中帧生成策略的评估。

Conclusion: 该研究填补了渲染视频质量评估领域的空白,提出的数据集和指标为计算机图形学应用中的视频质量评估提供了专用工具,对视频游戏、虚拟现实和增强现实等领域的用户体验优化具有重要意义。


📄 Abstract

Quality assessment of videos is crucial for many computer graphics applications, including video games, virtual reality, and augmented reality, where visual performance has a significant impact on user experience. When test videos cannot be perfectly aligned with references or when references are unavailable, the significance of no-reference video quality assessment (NR-VQA) methods is undeniable. However, existing NR-VQA datasets and metrics are primarily focused on camera-captured videos; applying them directly to rendered videos would result in biased predictions, as rendered videos are more prone to temporal artifacts. To address this, we present a large rendering-oriented video dataset with subjective quality annotations, as well as a designed NR-VQA metric specific to rendered videos. The proposed dataset includes a wide range of 3D scenes and rendering settings, with quality scores annotated for various display types to better reflect real-world application scenarios. Building on this dataset, we calibrate our NR-VQA metric to assess rendered video quality by looking at both image quality and temporal stability. We compare our metric to existing NR-VQA metrics, demonstrating its superior performance on rendered videos. Finally, we demonstrate that our metric can be used to benchmark supersampling methods and assess frame generation strategies in real-time rendering.

[29] DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning

Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Zhuoguang Chen, Tao Jiang, Hang Zhao

🧩 TL;DR

DepthVLA提出了一种简单而有效的视觉-语言-动作模型架构,通过集成预训练的深度预测模块显式增强空间感知能力,在多个真实世界和模拟环境中显著提升了空间推理任务的性能。


📘 Detailed Summary

Motivation: 现有视觉-语言-动作模型在需要精确空间推理的任务上性能下降,主要原因是继承了视觉-语言模型有限的空间推理能力,且依赖大量动作数据预训练来将视觉-语言模型在3D空间中落地,这降低了训练效率且仍不足以实现准确的空间理解。

Method: DepthVLA采用混合变换器设计,统一整合了视觉-语言模型、深度变换器和动作专家,通过完全共享注意力机制形成端到端模型,通过预训练的深度预测模块显式引入空间感知能力。

Result: 在真实世界任务中达到78.5%的进展(对比基线65.0%),在LIBERO模拟器中达到94.9%(对比93.6%),在Simpler模拟器中达到74.8%(对比58.8%),在多个基准测试中均优于最先进方法。

Conclusion: 显式集成空间感知模块是提升视觉-语言-动作模型空间推理能力的有效途径,混合变换器架构能够统一不同模态的表示学习,为需要精确空间理解的机器人操作任务提供了更高效的解决方案。


📄 Abstract

Vision-Language-Action (VLA) models have recently shown impressive generalization and language-guided manipulation capabilities. However, their performance degrades on tasks requiring precise spatial reasoning due to limited spatial reasoning inherited from Vision-Language Models (VLMs). Existing VLAs rely on extensive action-data pretraining to ground VLMs in 3D space, which reduces training efficiency and is still insufficient for accurate spatial understanding. In this work, we present DepthVLA, a simple yet effective VLA architecture that explicitly incorporates spatial awareness through a pretrained depth prediction module. DepthVLA adopts a mixture-of-transformers design that unifies a VLM, a depth transformer, and an action expert with fully shared attentions, forming an end-to-end model with enhanced spatial reasoning. Extensive evaluations in both real-world and simulated environments show that DepthVLA outperforms state-of-the-art approaches, achieving 78.5% vs. 65.0% progress in real-world tasks, 94.9% vs. 93.6% in the LIBERO simulator, and 74.8% vs. 58.8% in the Simpler simulator. Our code will be made publicly available.

[30] Generalizing WiFi Gesture Recognition via Large-Model-Aware Semantic Distillation and Alignment

Feng-Qi Cui, Yu-Tong Guo, Tianyue Zheng, Jinyang Huang

🧩 TL;DR

本文提出了一种基于大模型语义蒸馏与对齐的通用化框架GLSDA,通过利用预训练大模型的语义先验来增强WiFi手势识别的表示学习,在保持模型轻量化的同时显著提升了跨域泛化能力。


📘 Detailed Summary

Motivation: 现有WiFi手势识别方法存在泛化能力有限和语义表达能力不足的问题,主要源于信道状态信息的域敏感性以及缺乏高级手势抽象表示,难以在真实AIoT环境中实现可靠的跨域部署。

Method: 提出GLSDA框架,包括双路径CSI编码管道分别捕获几何和动态手势模式,多尺度语义编码器通过跨模态注意力机制学习鲁棒时序嵌入并与手势语义对齐,语义感知软监督方案编码类间相关性减少标签模糊,以及鲁棒双蒸馏策略将对齐模型压缩为轻量学生网络。

Result: 在Widar3.0基准测试上的广泛实验表明,GLSDA在域内和跨域手势识别任务中均优于现有最先进方法,同时显著减小了模型规模和推理延迟,验证了方法的有效性和实用性。

Conclusion: 该研究为真实世界AIoT应用中的通用化RF手势接口提供了可扩展和可部署的解决方案,通过大模型语义先验的利用和高效的模型压缩策略,实现了性能与效率的平衡,推动了非接触式人机交互技术的发展。


📄 Abstract

WiFi-based gesture recognition has emerged as a promising RF sensing paradigm for enabling non-contact and privacy-preserving human-computer interaction in AIoT environments. However, existing methods often suffer from limited generalization and semantic expressiveness due to the domain-sensitive nature of Channel State Information and the lack of high-level gesture abstraction. To address these challenges, we propose a novel generalization framework, termed Large-Model-Aware Semantic Distillation and Alignment (GLSDA), which leverages the semantic prior of pre-trained large foundation models to enhance gesture representation learning in both in-domain and cross-domain scenarios. Specifically, we first design a dual-path CSI encoding pipeline that captures geometric and dynamic gesture patterns via CSI-Ratio phase sequences and Doppler spectrograms. These representations are then fed into a Multiscale Semantic Encoder, which learns robust temporal embeddings and aligns them with gesture semantics through cross-modal attention mechanisms. To further enhance category discrimination, we introduce a Semantic-Aware Soft Supervision scheme that encodes inter-class correlations and reduces label ambiguity, especially for semantically similar gestures. Finally, we develop a Robust Dual-Distillation strategy to compress the aligned model into a lightweight student network, jointly distilling intermediate features and semantic-informed soft labels from the teacher model. Extensive experiments on the Widar3.0 benchmark show that GLSDA consistently outperforms state-of-the-art methods in both in-domain and cross-domain gesture recognition tasks, while significantly reducing model size and inference latency. Our method offers a scalable and deployable solution for generalized RF-based gesture interfaces in real-world AIoT applications.

[31] Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models

Xinmiao Huang, Qisong He, Zhenglin Huang, Boxuan Wang, Zhuoyun Li, Guangliang Cheng, Yi Dong, Xiaowei Huang

🧩 TL;DR

本研究提出了一个基于认知分类学的空间推理基准Spatial-DISE,通过自动化流水线生成多样化可验证的空间推理问题,评估发现当前视觉语言模型在多步骤多视角空间推理方面与人类能力存在显著差距。


📘 Detailed Summary

Motivation: 现有基准在评估视觉语言模型的空间推理能力方面存在不足,特别是对人类空间认知中至关重要的内在动态空间推理能力缺乏充分评估,这限制了模型在机器人、增强现实和自主导航等实际应用中的表现。

Method: 基于认知分类学构建了统一基准Spatial-DISE,将任务分为四个基本象限:内在静态、内在动态、外在静态和外在动态空间推理,并开发了可扩展的自动化流水线来生成多样化和可验证的空间推理问题。

Result: 对28个最先进的视觉语言模型进行全面评估显示,当前模型在多步骤多视角空间推理方面与人类能力存在显著且一致的差距,生成了包含559个评估VQA对的Spatial-DISE基准和12,000多个训练VQA对的Spatial-DISE-12K数据集。

Conclusion: Spatial-DISE为未来研究提供了稳健的评估框架、有价值的数据集和明确的研究方向,推动视觉语言模型实现类人空间智能,揭示了当前模型在复杂空间推理任务上的局限性。


📄 Abstract

Spatial reasoning ability is crucial for Vision Language Models (VLMs) to support real-world applications in diverse domains including robotics, augmented reality, and autonomous navigation. Unfortunately, existing benchmarks are inadequate in assessing spatial reasoning ability, especially the \emph{intrinsic-dynamic} spatial reasoning which is a fundamental aspect of human spatial cognition. In this paper, we propose a unified benchmark, \textbf{Spatial-DISE}, based on a cognitively grounded taxonomy that categorizes tasks into four fundamental quadrants: \textbf{I}ntrinsic-\textbf{S}tatic, Intrinsic-\textbf{D}ynamic, \textbf{E}xtrinsic-Static, and Extrinsic-Dynamic spatial reasoning. Moreover, to address the issue of data scarcity, we develop a scalable and automated pipeline to generate diverse and verifiable spatial reasoning questions, resulting in a new \textbf{Spatial-DISE} dataset that includes Spatial-DISE Bench (559 evaluation VQA pairs) and Spatial-DISE-12K (12K+ training VQA pairs). Our comprehensive evaluation across 28 state-of-the-art VLMs reveals that, current VLMs have a large and consistent gap to human competence, especially on multi-step multi-view spatial reasoning. Spatial-DISE offers a robust framework, valuable dataset, and clear direction for future research toward human-like spatial intelligence. Benchmark, dataset, and code will be publicly released.

[32] Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation

Yifu Luo, Xinhao Hu, Keyu Fan, Haoyuan Sun, Zeyu Chen, Bo Xia, Tiantian Zhang, Yongzhe Chang, Xueqian Wang

🧩 TL;DR

本文提出了Mask-GRPO,这是首个将基于GRPO的强化学习应用于掩码生成模型的方法,通过重新定义转移概率并将去掩码过程建模为多步决策问题,在文本到图像生成任务上实现了显著性能提升。


📘 Detailed Summary

Motivation: 现有强化学习方法主要针对扩散模型或自回归模型,而忽略了掩码生成模型这一重要范式,本文旨在填补这一研究空白,将强化学习引入掩码生成模型以提升文本到图像生成性能。

Method: 提出Mask-GRPO方法,核心创新是重新定义转移概率,将去掩码过程建模为多步决策问题,并探索了去除KL约束、应用缩减策略和过滤低质量样本等优化策略。

Result: 在标准文本到图像生成基准测试和偏好对齐任务上,Mask-GRPO显著提升了基础模型Show-o的性能,超越了现有最先进方法的性能表现。

Conclusion: 研究表明强化学习可有效应用于掩码生成模型范式,为文本到图像生成提供了新的优化途径,该方法在性能提升和偏好对齐方面展现出显著优势,具有重要的实践价值。


📄 Abstract

Reinforcement learning (RL) has garnered increasing attention in text-to-image (T2I) generation. However, most existing RL approaches are tailored to either diffusion models or autoregressive models, overlooking an important alternative: masked generative models. In this work, we propose Mask-GRPO, the first method to incorporate Group Relative Policy Optimization (GRPO)-based RL into this overlooked paradigm. Our core insight is to redefine the transition probability, which is different from current approaches, and formulate the unmasking process as a multi-step decision-making problem. To further enhance our method, we explore several useful strategies, including removing the KL constraint, applying the reduction strategy, and filtering out low-quality samples. Using Mask-GRPO, we improve a base model, Show-o, with substantial improvements on standard T2I benchmarks and preference alignment, outperforming existing state-of-the-art approaches. The code is available on https://github.com/xingzhejun/Mask-GRPO

[33] Fusion Meets Diverse Conditions: A High-diversity Benchmark and Baseline for UAV-based Multimodal Object Detection with Condition Cues

Chen Chen, Kangcheng Bin, Ting Hu, Jiahao Qi, Xingyue Liu, Tianpeng Liu, Zhen Liu, Yongxiang Liu, Ping Zhong

🧩 TL;DR

本研究提出了ATR-UMOD数据集和PCDF方法,ATR-UMOD是一个涵盖多样化成像条件的高质量RGB-IR无人机目标检测数据集,PCDF是一种基于提示引导的条件感知动态融合方法,能够自适应调整多模态贡献。


📘 Detailed Summary

Motivation: 现有无人机目标检测数据集难以充分捕捉真实世界复杂成像条件,特别是在不同高度、角度、全天候和全年时间变化下的多样性,这限制了基于RGB和红外图像的鲁棒全天候检测系统的开发。

Method: 提出了提示引导的条件感知动态融合方法,通过将成像条件编码为文本提示,利用任务特定的软门控变换建模条件与多模态贡献之间的关系,并设计了提示引导的条件解耦模块以确保在无条件标注时的实际可用性。

Result: 在ATR-UMOD数据集上的实验验证了PCDF方法的有效性,该方法能够自适应地重新分配多模态贡献,显著提升了在多样化成像条件下的目标检测性能。

Conclusion: 该研究不仅提供了高质量的多模态数据集,还开发了能够有效利用条件信息的动态融合框架,为全天候无人机目标检测系统的实际部署提供了重要支撑,并展示了条件感知方法在多模态融合中的潜力。


📄 Abstract

Unmanned aerial vehicles (UAV)-based object detection with visible (RGB) and infrared (IR) images facilitates robust around-the-clock detection, driven by advancements in deep learning techniques and the availability of high-quality dataset. However, the existing dataset struggles to fully capture real-world complexity for limited imaging conditions. To this end, we introduce a high-diversity dataset ATR-UMOD covering varying scenarios, spanning altitudes from 80m to 300m, angles from 0{\deg} to 75{\deg}, and all-day, all-year time variations in rich weather and illumination conditions. Moreover, each RGB-IR image pair is annotated with 6 condition attributes, offering valuable high-level contextual information. To meet the challenge raised by such diverse conditions, we propose a novel prompt-guided condition-aware dynamic fusion (PCDF) to adaptively reassign multimodal contributions by leveraging annotated condition cues. By encoding imaging conditions as text prompts, PCDF effectively models the relationship between conditions and multimodal contributions through a task-specific soft-gating transformation. A prompt-guided condition-decoupling module further ensures the availability in practice without condition annotations. Experiments on ATR-UMOD dataset reveal the effectiveness of PCDF.

[34] AVAR-Net: A Lightweight Audio-Visual Anomaly Recognition Framework with a Benchmark Dataset

Amjid Ali, Zulfiqar Ahmad Khan, Altaf Hussain, Muhammad Munsif, Adnan Hussain, Sung Wook Baik

🧩 TL;DR

本研究提出了AVAR-Net,一种轻量级高效的音视频异常识别框架,通过融合音频和视觉模态来提升在遮挡、低光照等挑战性条件下的异常识别性能,并在新构建的VAAR数据集上取得了89.29%的准确率。


📘 Detailed Summary

Motivation: 现有异常识别方法主要依赖视觉数据,在遮挡、低光照和恶劣天气等挑战性条件下表现不可靠,同时缺乏大规模同步音视频数据集限制了多模态异常识别的发展。

Method: AVAR-Net包含四个核心模块:使用Wav2Vec2提取音频时序特征,MobileViT捕获视频的局部和全局视觉表示,采用早期融合机制整合多模态信息,以及多阶段时序卷积网络(MTCN)学习融合表示中的长程时序依赖关系以实现稳健的时空推理。

Result: 在VAAR数据集上达到89.29%的准确率,在XD-Violence数据集上获得88.56%的平均精度,相比现有最优方法将平均精度提升了2.8%,同时引入了包含3,000个真实世界视频的VAAR数据集作为中规模基准。

Conclusion: 该框架展示了在多模态异常识别中的有效性、高效性和泛化能力,VAAR数据集为推进多模态异常识别研究提供了有价值的基准,证明了音频和视觉信息融合在复杂环境下的优势。


📄 Abstract

Anomaly recognition plays a vital role in surveillance, transportation, healthcare, and public safety. However, most existing approaches rely solely on visual data, making them unreliable under challenging conditions such as occlusion, low illumination, and adverse weather. Moreover, the absence of large-scale synchronized audio-visual datasets has hindered progress in multimodal anomaly recognition. To address these limitations, this study presents AVAR-Net, a lightweight and efficient audio-visual anomaly recognition framework designed for real-world environments. AVAR-Net consists of four main modules: an audio feature extractor, a video feature extractor, fusion strategy, and a sequential pattern learning network that models cross-modal relationships for anomaly recognition. Specifically, the Wav2Vec2 model extracts robust temporal features from raw audio, while MobileViT captures both local and global visual representations from video frames. An early fusion mechanism combines these modalities, and a Multi-Stage Temporal Convolutional Network (MTCN) model that learns long-range temporal dependencies within the fused representation, enabling robust spatiotemporal reasoning. A novel Visual-Audio Anomaly Recognition (VAAR) dataset, is also introduced, serving as a medium-scale benchmark containing 3,000 real-world videos with synchronized audio across ten diverse anomaly classes. Experimental evaluations demonstrate that AVAR-Net achieves 89.29% accuracy on VAAR and 88.56% Average Precision on the XD-Violence dataset, improving Average Precision by 2.8% over existing state-of-the-art methods. These results highlight the effectiveness, efficiency, and generalization capability of the proposed framework, as well as the utility of VAAR as a benchmark for advancing multimodal anomaly recognition research.

[35] Challenges, Advances, and Evaluation Metrics in Medical Image Enhancement: A Systematic Literature Review

Chun Wai Chin, Haniza Yazid, Hoi Leong Lee

🧩 TL;DR

本系统综述采用PRISMA方法分析了39篇同行评审研究,探讨了医学图像增强中的关键挑战、最新进展和评估指标,揭示了当前研究局限性并为未来发展方向提供了见解。


📘 Detailed Summary

Motivation: 医学图像在X射线、CT、MRI和超声等成像技术中常面临噪声、伪影和低对比度等挑战,这些因素限制了诊断潜力,需要强大的预处理、去噪算法和先进的增强方法来提升图像质量和可解释性。

Method: 研究采用PRISMA系统综述方法,分析了39项同行评审研究,涵盖传统数学方法、深度学习技术和混合方法,并系统评估了参考基准和非参考基准的图像质量评估指标。

Result: 分析发现低对比度和噪声是最常见问题,MRI和多模态成像研究最多,而组织病理学、内窥镜和骨闪烁扫描等专业模态研究不足;39项研究中29项使用传统数学方法,9项聚焦深度学习,1项采用混合方法;共引入65个IQA指标,其中非参考基准指标占主导地位。

Conclusion: 该综述揭示了医学图像增强领域的研究空白和当前局限性,强调了评估指标在方法有效性评估中的重要性,并为未来研究方向提供了指导,特别是在专业成像模态和先进深度学习技术的应用方面。


📄 Abstract

Medical image enhancement is crucial for improving the quality and interpretability of diagnostic images, ultimately supporting early detection, accurate diagnosis, and effective treatment planning. Despite advancements in imaging technologies such as X-ray, CT, MRI, and ultrasound, medical images often suffer from challenges like noise, artifacts, and low contrast, which limit their diagnostic potential. Addressing these challenges requires robust preprocessing, denoising algorithms, and advanced enhancement methods, with deep learning techniques playing an increasingly significant role. This systematic literature review, following the PRISMA approach, investigates the key challenges, recent advancements, and evaluation metrics in medical image enhancement. By analyzing findings from 39 peer-reviewed studies, this review provides insights into the effectiveness of various enhancement methods across different imaging modalities and the importance of evaluation metrics in assessing their impact. Key issues like low contrast and noise are identified as the most frequent, with MRI and multi-modal imaging receiving the most attention, while specialized modalities such as histopathology, endoscopy, and bone scintigraphy remain underexplored. Out of the 39 studies, 29 utilize conventional mathematical methods, 9 focus on deep learning techniques, and 1 explores a hybrid approach. In terms of image quality assessment, 18 studies employ both reference-based and non-reference-based metrics, 9 rely solely on reference-based metrics, and 12 use only non-reference-based metrics, with a total of 65 IQA metrics introduced, predominantly non-reference-based. This review highlights current limitations, research gaps, and potential future directions for advancing medical image enhancement.

[36] Local-Global Context-Aware and Structure-Preserving Image Super-Resolution

Sanchar Palit, Subhasis Chaudhuri, Biplab Banerjee

🧩 TL;DR

本文提出了一种上下文精确的图像超分辨率框架,通过局部-全局上下文感知注意力和感知对齐条件机制,在保持结构一致性的同时生成高质量的超分辨率图像。


📘 Detailed Summary

Motivation: 现有基于预训练文本到图像模型的超分辨率方法在处理多样化和高度退化图像时存在噪声放大和错误内容生成的问题,需要解决这些局限性以实现更鲁棒的超分辨率性能。

Method: 提出了局部-全局上下文感知注意力机制来有效保持局部和全局像素关系,同时设计了分布和感知对齐的像素空间条件机制,从局部内容细节逐步过渡到全局结构组合来增强感知保真度。

Result: 在多个超分辨率基准测试上的广泛实验表明,该方法能够生成结构一致、减少伪影的高质量图像,实现了高保真度和感知准确的图像重建。

Conclusion: 该研究展示了通过上下文感知机制和感知对齐条件能够有效解决扩散模型在超分辨率任务中的局限性,为高质量图像生成提供了新的技术路径,具有重要的实际应用价值。


📄 Abstract

Diffusion models have recently achieved significant success in various image manipulation tasks, including image super-resolution and perceptual quality enhancement. Pretrained text-to-image models, such as Stable Diffusion, have exhibited strong capabilities in synthesizing realistic image content, which makes them particularly attractive for addressing super-resolution tasks. While some existing approaches leverage these models to achieve state-of-the-art results, they often struggle when applied to diverse and highly degraded images, leading to noise amplification or incorrect content generation. To address these limitations, we propose a contextually precise image super-resolution framework that effectively maintains both local and global pixel relationships through Local-Global Context-Aware Attention, enabling the generation of high-quality images. Furthermore, we propose a distribution- and perceptual-aligned conditioning mechanism in the pixel space to enhance perceptual fidelity. This mechanism captures fine-grained pixel-level representations while progressively preserving and refining structural information, transitioning from local content details to the global structural composition. During inference, our method generates high-quality images that are structurally consistent with the original content, mitigating artifacts and ensuring realistic detail restoration. Extensive experiments on multiple super-resolution benchmarks demonstrate the effectiveness of our approach in producing high-fidelity, perceptually accurate reconstructions.

[37] OmniGaze: Reward-inspired Generalizable Gaze Estimation In The Wild

Hongyu Qu, Jianan Wei, Xiangbo Shu, Yazhou Yao, Wenguan Wang, Jinhui Tang

🧩 TL;DR

本文提出了OmniGaze,一种用于3D视线估计的半监督框架,通过利用大规模无标签数据来缓解领域偏差,实现了在野外环境下的泛化视线估计。该方法采用伪标签策略和奖励模型来评估伪标签可靠性,在多个数据集上达到了最先进的性能。


📘 Detailed Summary

Motivation: 当前3D视线估计方法难以在不同数据领域间泛化,主要由于标注数据集的稀缺性和标记数据多样性的不足。该研究旨在解决领域偏差问题,提升在无约束真实世界环境中的视线估计泛化能力。

Method: OmniGaze采用半监督框架,构建了包含多样化面部外观、背景环境、光照条件、头部姿态和眼部遮挡的无标签面部图像集合。该方法使用标准伪标签策略,并设计了奖励模型来评估伪标签可靠性,该模型整合了3D方向向量伪标签、现成视觉编码器提取的视觉嵌入以及通过多模态大语言模型生成的视线视角语义线索来计算置信度分数。

Result: 广泛实验表明,OmniGaze在五个数据集上实现了最先进的性能,无论是在领域内还是跨领域设置下。此外,作为可扩展的视线估计数据引擎,该方法在四个未见数据集上展现出强大的零样本泛化能力。

Conclusion: 该研究证明了利用大规模无标签数据和多模态信息可以有效提升3D视线估计的泛化性能。OmniGaze不仅作为高性能视线估计方法,还展示了作为可扩展数据引擎的潜力,为未来视线估计研究提供了新的方向。


📄 Abstract

Current 3D gaze estimation methods struggle to generalize across diverse data domains, primarily due to i) the scarcity of annotated datasets, and ii) the insufficient diversity of labeled data. In this work, we present OmniGaze, a semi-supervised framework for 3D gaze estimation, which utilizes large-scale unlabeled data collected from diverse and unconstrained real-world environments to mitigate domain bias and generalize gaze estimation in the wild. First, we build a diverse collection of unlabeled facial images, varying in facial appearances, background environments, illumination conditions, head poses, and eye occlusions. In order to leverage unlabeled data spanning a broader distribution, OmniGaze adopts a standard pseudo-labeling strategy and devises a reward model to assess the reliability of pseudo labels. Beyond pseudo labels as 3D direction vectors, the reward model also incorporates visual embeddings extracted by an off-the-shelf visual encoder and semantic cues from gaze perspective generated by prompting a Multimodal Large Language Model to compute confidence scores. Then, these scores are utilized to select high-quality pseudo labels and weight them for loss computation. Extensive experiments demonstrate that OmniGaze achieves state-of-the-art performance on five datasets under both in-domain and cross-domain settings. Furthermore, we also evaluate the efficacy of OmniGaze as a scalable data engine for gaze estimation, which exhibits robust zero-shot generalization on four unseen datasets.

[38] Seeing and Knowing in the Wild: Open-domain Visual Entity Recognition with Large-scale Knowledge Graphs via Contrastive Learning

Hongkuan Zhou, Lavdim Halilaj, Sebastian Monka, Stefan Schmid, Yuqicheng Zhu, Jingcheng Wu, Nadeem Nazer, Steffen Staab

🧩 TL;DR

本文提出了知识引导对比学习框架KnowCoL,通过结合视觉、文本和结构化知识显著提升了开放域视觉实体识别性能,特别是在罕见和未见实体上的准确率相比最先进方法提高了10.5%,同时模型规模缩小了35倍。


📘 Detailed Summary

Motivation: 开放域视觉实体识别面临固定标签集分类任务无法应对的挑战,包括训练期间大部分目标实体不可见、长尾分布、监督信号有限、视觉歧义性高以及需要进行语义消歧等问题,这导致传统方法在开放集条件下性能受限。

Method: 提出的KnowCoL框架将图像和文本描述映射到基于Wikidata结构化信息构建的共享语义空间中,通过抽象视觉和文本输入到概念层面,利用实体描述、类型层次结构和关系上下文来支持零样本实体识别。

Result: 在OVEN基准测试上的实验表明,结合视觉、文本和结构化知识显著提高了识别准确率,特别是对罕见和未见实体的识别效果提升明显,最小模型在未见实体上的准确率比最先进方法提高了10.5%,同时模型规模缩小了35倍。

Conclusion: 研究表明多模态知识融合对于开放域视觉实体识别至关重要,结构化知识能够有效弥补视觉和文本信息的不足,特别是在处理长尾分布和零样本场景时具有显著优势,为开放世界视觉理解提供了新的技术路径。


📄 Abstract

Open-domain visual entity recognition aims to identify and link entities depicted in images to a vast and evolving set of real-world concepts, such as those found in Wikidata. Unlike conventional classification tasks with fixed label sets, it operates under open-set conditions, where most target entities are unseen during training and exhibit long-tail distributions. This makes the task inherently challenging due to limited supervision, high visual ambiguity, and the need for semantic disambiguation. In this work, we propose a Knowledge-guided Contrastive Learning (KnowCoL) framework that combines both images and text descriptions into a shared semantic space grounded by structured information from Wikidata. By abstracting visual and textual inputs to a conceptual level, the model leverages entity descriptions, type hierarchies, and relational context to support zero-shot entity recognition. We evaluate our approach on the OVEN benchmark, a large-scale open-domain visual recognition dataset with Wikidata IDs as the label space. Our experiments show that using visual, textual, and structured knowledge greatly improves accuracy, especially for rare and unseen entities. Our smallest model improves the accuracy on unseen entities by 10.5% compared to the state-of-the-art, despite being 35 times smaller.

[39] Risk-adaptive Activation Steering for Safe Multimodal Large Language Models

Jonghyun Park, Minhyuk Seo, Jonghyun Choi

🧩 TL;DR

本文提出风险自适应激活导向方法,通过重构查询增强对安全关键图像区域的跨模态注意力,实现查询级风险评估并自适应引导激活生成安全且有用的响应,显著降低攻击成功率并提高推理速度。


📘 Detailed Summary

Motivation: 现代AI模型面临的关键挑战是确保对良性查询提供有用响应同时拒绝恶意查询,但模型通常对嵌入有害意图的多模态查询存在脆弱性。现有安全对齐方法存在训练成本高昂或推理时对齐导致过度拒绝良性查询和迭代输出调整降低推理速度的问题。

Method: 提出风险自适应激活导向方法,通过重构查询增强对安全关键图像区域的跨模态注意力,实现准确的查询级风险评估,并基于评估风险自适应引导激活生成响应,避免迭代输出调整带来的开销。

Result: 在多模态安全性和实用性基准测试上的广泛实验表明,该方法显著降低了攻击成功率,保持了通用任务性能,并相比先前的推理时防御方法提高了推理速度。

Conclusion: 该方法提供了一种高效的多模态安全对齐解决方案,通过查询重构和激活导向在保持模型实用性的同时增强安全性,为推理时安全防御提供了新的技术路径。


📄 Abstract

One of the key challenges of modern AI models is ensuring that they provide helpful responses to benign queries while refusing malicious ones. But often, the models are vulnerable to multimodal queries with harmful intent embedded in images. One approach for safety alignment is training with extensive safety datasets at the significant costs in both dataset curation and training. Inference-time alignment mitigates these costs, but introduces two drawbacks: excessive refusals from misclassified benign queries and slower inference speed due to iterative output adjustments. To overcome these limitations, we propose to reformulate queries to strengthen cross-modal attention to safety-critical image regions, enabling accurate risk assessment at the query level. Using the assessed risk, it adaptively steers activations to generate responses that are safe and helpful without overhead from iterative output adjustments. We call this Risk-adaptive Activation Steering (RAS). Extensive experiments across multiple benchmarks on multimodal safety and utility demonstrate that the RAS significantly reduces attack success rates, preserves general task performance, and improves inference speed over prior inference-time defenses.

[40] Cyclic Self-Supervised Diffusion for Ultra Low-field to High-field MRI Synthesis

Zhenxuan Zhang, Peiyuan Jing, Zi Wang, Ula Briski, Coraline Beitone, Yue Yang, Yinzhe Wu, Fanwen Wang, Liutao Yang, Jiahao Huang, Zhifan Gao, Zhaolin Chen, Kh Tohidul Islam, Guang Yang, Peter J. Lally

🧩 TL;DR

本文提出了一种循环自监督扩散框架(CSS-Diff),用于从真实低场MRI数据合成高质量高场MRI图像,通过循环一致性约束确保解剖结构保真度,在定量指标和解剖一致性方面均达到最先进性能。


📘 Detailed Summary

Motivation: 低场MRI虽然成本低、可及性好且安全性高,但存在分辨率低和信噪比差的问题,而现有高场MRI合成方法在临床保真度方面存在不足,需要解决解剖保真度保持、细粒度结构增强以及图像对比度领域差距等关键挑战。

Method: 提出循环自监督扩散框架(CSS-Diff),核心思想是在循环一致性约束下重新构建基于扩散的合成过程,确保解剖结构在整个生成过程中得到保持;该框架包含两个新过程:切片间差距感知网络通过对比学习对齐切片间不一致性,局部结构校正网络通过掩码和扰动补丁的自重建增强局部特征恢复。

Result: 在跨场合成任务上的广泛实验表明,该方法达到最先进性能(PSNR 31.80±2.70 dB,SSIM 0.943±0.102,LPIPS 0.0864±0.0689);除了像素级保真度,相比原始低场MRI,该方法更好地保留了细粒度解剖结构(如左脑白质误差从12.1%降至2.1%,皮层误差从4.2%降至3.7%)。

Conclusion: CSS-Diff框架能够合成既在定量上可靠又在解剖上一致的图像,解决了高场MRI合成中的临床保真度差距问题,为医学图像合成提供了新的技术路径,具有重要的临床应用价值。


📄 Abstract

Synthesizing high-quality images from low-field MRI holds significant potential. Low-field MRI is cheaper, more accessible, and safer, but suffers from low resolution and poor signal-to-noise ratio. This synthesis process can reduce reliance on costly acquisitions and expand data availability. However, synthesizing high-field MRI still suffers from a clinical fidelity gap. There is a need to preserve anatomical fidelity, enhance fine-grained structural details, and bridge domain gaps in image contrast. To address these issues, we propose a \emph{cyclic self-supervised diffusion (CSS-Diff)} framework for high-field MRI synthesis from real low-field MRI data. Our core idea is to reformulate diffusion-based synthesis under a cycle-consistent constraint. It enforces anatomical preservation throughout the generative process rather than just relying on paired pixel-level supervision. The CSS-Diff framework further incorporates two novel processes. The slice-wise gap perception network aligns inter-slice inconsistencies via contrastive learning. The local structure correction network enhances local feature restoration through self-reconstruction of masked and perturbed patches. Extensive experiments on cross-field synthesis tasks demonstrate the effectiveness of our method, achieving state-of-the-art performance (e.g., 31.80 $\pm$ 2.70 dB in PSNR, 0.943 $\pm$ 0.102 in SSIM, and 0.0864 $\pm$ 0.0689 in LPIPS). Beyond pixel-wise fidelity, our method also preserves fine-grained anatomical structures compared with the original low-field MRI (e.g., left cerebral white matter error drops from 12.1$\%$ to 2.1$\%$, cortex from 4.2$\%$ to 3.7$\%$). To conclude, our CSS-Diff can synthesize images that are both quantitatively reliable and anatomically consistent.

[41] InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue

Wenwen Tong, Hewei Guo, Dongchuan Ran, Jiangnan Chen, Jiefan Lu, Kaibin Wang, Keqiang Li, Xiaoxu Zhu, Jiakui Li, Kehan Li, Xueheng Li, Lumin Li, Chenxu Guo, Jiasheng Zhou, Jiandong Chen, Xianye Wu, Jiahao Wang, Silei Wu, Lei Chen, Hanming Deng, Yuxuan Song, Dinghao Zhou, Guiping Zhong, Ken Zheng, Shiyin Kang, Lewei Lu

🧩 TL;DR

InteractiveOmni是一个统一的开源全模态大语言模型,提供从4B到8B参数的轻量级解决方案,在图像、音频、视频理解和语音生成任务中实现了最先进的性能,特别擅长多轮音频-视觉交互和长期记忆能力。


📘 Detailed Summary

Motivation: 该研究旨在解决现有轻量级模型在全模态理解和语音生成能力方面的不足,特别是缺乏能够处理复杂多轮音频-视觉交互的统一模型架构,以及缺乏针对多轮记忆和语音交互能力的系统评估基准。

Method: 通过将视觉编码器、音频编码器、大语言模型和语音解码器集成到统一架构中,设计多阶段训练策略包括全模态理解预训练和语音对话与音频-视觉交互的后训练,并精心构建多轮训练数据集以增强复杂交互处理能力。

Result: 实验表明InteractiveOmni在通用基准测试中表现优异,4B版本性能可与Qwen2.5-Omni-7B相媲美,同时仅用50%模型大小即可保留8B版本97%的性能,在多轮记忆和语音交互基准上显著优于领先开源模型。

Conclusion: InteractiveOmni为下一代智能交互系统提供了可访问的开源基础,证明了轻量级模型在保持高性能的同时实现全模态理解和生成的可能性,特别是在长期记忆和多轮交互方面展现出显著优势。


📄 Abstract

We introduce InteractiveOmni, a unified and open-source omni-modal large language model for audio-visual multi-turn interaction, ranging from 4B to 8B parameters, designed to lead the field of lightweight models by offering comprehensive omni-modal understanding and speech generation capabilities. To achieve this, we integrate the vision encoder, audio encoder, large language model, and speech decoder into a unified model for understanding and generation tasks. We design a multi-stage training strategy to ensure robust cross-modal capabilities, including pre-training for omni-modal understanding, followed by post-training with speech conversation and audio-visual interaction. To enable human-like long-term conversational ability, we meticulously curate a multi-turn training dataset that enhances the model's ability to handle complex and multi-turn interactions. To effectively evaluate the multi-turn memory and speech interaction capabilities, we construct the multi-modal multi-turn memory benchmark and the multi-turn speech interaction benchmark. Experiments demonstrate that InteractiveOmni significantly outperforms leading open-source models and provides a more intelligent multi-turn audio-visual experience, particularly in its long-term memory capabilities. Notably, InteractiveOmni-4B is comparable to the much larger model like Qwen2.5-Omni-7B on general benchmarks, and it can retain 97% of the performance of the InteractiveOmni-8B while utilizing only 50% of the model size. Achieving state-of-the-art results against similarly sized models across image, audio, video understanding, and speech generation tasks, InteractiveOmni is an accessible, open-source foundation for next-generation intelligent interactive systems.

[42] Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

Kai Zou, Ziqi Huang, Yuhao Dong, Shulin Tian, Dian Zheng, Hongbo Liu, Jingwen He, Bin Liu, Yu Qiao, Ziwei Liu

🧩 TL;DR

本文提出了Uni-MMMU基准测试,这是一个全面且学科感知的多模态基准,系统性地评估了生成与理解能力在八个推理中心领域的双向协同作用,为统一多模态模型的发展提供了可靠基础。


📘 Detailed Summary

Motivation: 现有统一多模态模型旨在联合实现视觉理解和生成能力,但当前基准测试很少检验它们的真正整合,现有评估要么孤立处理这两种能力,要么忽视了本质上耦合它们的任务,这构成了研究空白。

Method: Uni-MMMU基准测试系统性地展开了生成与理解在八个推理中心领域的双向协同,每个任务都是双向耦合的,要求模型利用概念理解指导精确的视觉合成,或利用生成作为分析推理的认知支架,并包含可验证的中间推理步骤、唯一真实值和可复现的评分协议。

Result: 通过对最先进的统一模型、仅生成模型和仅理解模型进行广泛评估,揭示了显著的性能差异和跨模态依赖关系,提供了关于这些能力何时以及如何相互增强的新见解。

Conclusion: 该研究为统一多模态模型的发展建立了可靠基础,揭示了生成与理解能力之间的协同机制,并为评估多模态模型的综合能力提供了系统性框架和新的评估视角。


📄 Abstract

Unified multimodal models aim to jointly enable visual understanding and generation, yet current benchmarks rarely examine their true integration. Existing evaluations either treat the two abilities in isolation or overlook tasks that inherently couple them. To address this gap, we present Uni-MMMU, a comprehensive and discipline-aware benchmark that systematically unfolds the bidirectional synergy between generation and understanding across eight reasoning-centric domains, including science, coding, mathematics, and puzzles. Each task is bidirectionally coupled, demanding models to (i) leverage conceptual understanding to guide precise visual synthesis, or (ii) utilize generation as a cognitive scaffold for analytical reasoning. Uni-MMMU incorporates verifiable intermediate reasoning steps, unique ground truths, and a reproducible scoring protocol for both textual and visual outputs. Through extensive evaluation of state-of-the-art unified, generation-only, and understanding-only models, we reveal substantial performance disparities and cross-modal dependencies, offering new insights into when and how these abilities reinforce one another, and establishing a reliable foundation for advancing unified models.

[43] Adaptive Visual Conditioning for Semantic Consistency in Diffusion-Based Story Continuation

Seyed Mohammad Mousavi, Morteza Analoui

🧩 TL;DR

本文提出了AVC(自适应视觉条件)框架,用于基于扩散模型的故事延续任务,通过CLIP检索语义对齐的先前图像并在无相关图像时限制视觉条件影响,实现了更好的叙事连贯性和语义一致性。


📘 Detailed Summary

Motivation: 故事延续任务的核心挑战在于有效利用先前的视觉上下文,同时确保与当前文本输入的语义对齐,需要解决如何平衡视觉条件影响以避免引入误导或无关信息的问题。

Method: AVC框架采用CLIP模型检索语义最匹配的先前图像,并在无足够相关图像时自适应限制视觉条件仅影响扩散过程的早期阶段,同时通过大语言模型重新标注噪声数据集以增强文本监督和语义对齐。

Result: 定量结果和人工评估表明,AVC在叙事连贯性、语义一致性和视觉保真度方面均优于强基线方法,特别是在先前视觉信息与当前输入存在冲突的挑战性场景中表现突出。

Conclusion: 该研究证明了自适应视觉条件机制在故事延续任务中的有效性,为多模态生成中视觉上下文利用提供了新思路,同时展示了数据质量改进对模型性能的重要影响。


📄 Abstract

Story continuation focuses on generating the next image in a narrative sequence so that it remains coherent with both the ongoing text description and the previously observed images. A central challenge in this setting lies in utilizing prior visual context effectively, while ensuring semantic alignment with the current textual input. In this work, we introduce AVC (Adaptive Visual Conditioning), a framework for diffusion-based story continuation. AVC employs the CLIP model to retrieve the most semantically aligned image from previous frames. Crucially, when no sufficiently relevant image is found, AVC adaptively restricts the influence of prior visuals to only the early stages of the diffusion process. This enables the model to exploit visual context when beneficial, while avoiding the injection of misleading or irrelevant information. Furthermore, we improve data quality by re-captioning a noisy dataset using large language models, thereby strengthening textual supervision and semantic alignment. Quantitative results and human evaluations demonstrate that AVC achieves superior coherence, semantic consistency, and visual fidelity compared to strong baselines, particularly in challenging cases where prior visuals conflict with the current input.

[44] Reasoning in Space via Grounding in the World

Yiming Chen, Zekun Qi, Wenyao Zhang, Xin Jin, Li Zhang, Peidong Liu

🧩 TL;DR

本文提出GS-Reasoner,这是首个无需外部模块即可实现自回归3D视觉定位的3D LLM,通过双路径池化机制构建统一的三维表示,在3D视觉定位和空间推理任务上均达到最先进性能。


📘 Detailed Summary

Motivation: 现有3D LLM缺乏能够同时捕捉语义和几何信息的统一3D表示,这导致在视觉定位任务上表现不佳或过度依赖外部模块,阻碍了定位与空间推理的无缝集成。

Method: 提出简单有效的双路径池化机制,将几何特征与语义及位置线索紧密对齐,构建基于图像块的统一3D表示,在不增加输入token数量的情况下封装所有必要信息。

Result: GS-Reasoner在3D视觉定位上取得优异结果,显著提升了空间推理能力,在多个基准测试中达到最先进性能,同时无需任何外部模块即可实现自回归定位。

Conclusion: 该研究证明了3D视觉定位是空间推理的基石,通过统一表示框架成功将两者集成,为3D空间推理建立了自包含的解决方案,并引入GCoT数据集进一步推动该领域发展。


📄 Abstract

In this paper, we claim that 3D visual grounding is the cornerstone of spatial reasoning and introduce the Grounded-Spatial Reasoner (GS-Reasoner) to explore the effective spatial representations that bridge the gap between them. Existing 3D LLMs suffer from the absence of a unified 3D representation capable of jointly capturing semantic and geometric information. This deficiency is manifested either in poor performance on grounding or in an excessive reliance on external modules, ultimately hindering the seamless integration of grounding and spatial reasoning. To address this, we propose a simple yet effective dual-path pooling mechanism that tightly aligns geometric features with both semantic and positional cues, constructing a unified image patch-based 3D representation that encapsulates all essential information without increasing the number of input tokens. Leveraging this holistic representation, GS-Reasoner is the first 3D LLM that achieves autoregressive grounding entirely without external modules while delivering performance comparable to state-of-the-art models, establishing a unified and self-contained framework for 3D spatial reasoning. To further bridge grounding and spatial reasoning, we introduce the Grounded Chain-of-Thought (GCoT) dataset. This dataset is meticulously curated to include both 3D bounding box annotations for objects referenced in reasoning questions and step-by-step reasoning paths that integrate grounding as a core component of the problem-solving process. Extensive experiments demonstrate that GS-Reasoner achieves impressive results on 3D visual grounding, which in turn significantly enhances its spatial reasoning capabilities, leading to state-of-the-art performance.

[45] VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models

Dominick Reilly, Manish Kumar Govind, Le Xue, Srijan Das

🧩 TL;DR

本文提出Vision Contextualized Probing (VisCoP)方法,通过在视觉编码器中添加可学习的视觉探针,实现大型视觉语言模型在分布偏移场景下的高效领域自适应,在多个挑战性领域适应设置中显著优于现有方法。


📘 Detailed Summary

Motivation: 大型视觉语言模型在通用视觉推理任务上表现出色,但在面临与预训练数据存在显著分布偏移的新领域时性能急剧下降。现有的领域自适应方法通常微调不同VLM组件,但这往往导致领域特定特征学习有限或先前能力的灾难性遗忘问题。

Method: 提出的Vision Contextualized Probing (VisCoP)方法通过在VLM的视觉编码器中添加一组紧凑的可学习视觉探针,实现对领域特定的高效适应,同时最小化对预训练参数的修改。该方法仅需训练少量额外参数即可完成领域自适应。

Result: 在三个挑战性领域适应设置(跨视角、跨模态、跨任务)上的实验表明,VisCoP始终优于现有适应策略,在目标领域获得优越性能的同时有效保留源领域知识。该方法在分布偏移场景下展现出强大的适应能力。

Conclusion: VisCoP提供了一种参数高效的领域自适应框架,通过视觉探针机制实现了特定领域知识的有效学习而不损害模型原有能力。该方法为VLM在现实世界复杂场景中的应用提供了可行的技术路径,展示了在保持模型通用性的同时增强领域特定性能的潜力。


📄 Abstract

Large Vision-Language Models (VLMs) excel at general visual reasoning tasks but exhibit sharp performance degradation when applied to novel domains with substantial distribution shifts from pretraining data. Existing domain adaptation approaches finetune different VLM components, but this often results in limited domain-specific feature learning or catastrophic forgetting of prior capabilities. To address these issues, we introduce Vision Contextualized Probing (VisCoP), which augments the VLM's vision encoder with a compact set of learnable visual probes. These probes enable efficient domain-specific adaptation with minimal modification to pretrained parameters. We evaluate VisCoP across three challenging domain adaptation settings-cross-view (exocentric to egocentric), cross-modal (RGB to depth), and cross-task (human understanding to robot control). Experiments show that VisCoP consistently outperforms existing adaptation strategies, achieving superior performance on target domains while effectively retaining source-domain knowledge.

cs.CL [Back]

[46] Benchmarking Open-Source Large Language Models for Persian in Zero-Shot and Few-Shot Learning

Mahdi Cherakhloo, Arash Abbasi, Mohammad Saeid Sarafraz, Bijan Vosoughi Vahdat

🧩 TL;DR

本研究对多个开源大语言模型在波斯语自然语言处理任务中的表现进行了全面基准测试,发现Gemma 2模型在零样本和少样本学习范式中均表现最优,特别是在复杂推理任务中,但大多数模型在命名实体识别等标记级理解任务上存在困难。


📘 Detailed Summary

Motivation: 尽管大语言模型在多种语言中展现出卓越能力,但其在波斯语等低资源语言中的有效性仍需深入评估,本研究旨在填补这一研究空白,系统评估开源LLMs在波斯语NLP任务中的表现。

Method: 采用零样本和少样本学习范式,在情感分析、命名实体识别、阅读理解、问答等多种波斯语NLP任务上评估多个开源LLMs,使用ParsiNLU和ArmanEmo等标准波斯语数据集,并通过准确率、F1分数、BLEU和ROUGE等指标进行性能评估。

Result: 实验结果表明Gemma 2在几乎所有任务的两种学习范式中均优于其他模型,尤其在复杂推理任务中表现突出,但大多数模型在命名实体识别等标记级理解任务上表现不佳,揭示了波斯语处理中的特定挑战。

Conclusion: 本研究为多语言大语言模型研究提供了重要贡献,揭示了LLMs在波斯语处理中的性能特点和局限性,为未来模型开发提供了基准参考,并强调了低资源语言处理中标记级理解任务的特殊挑战。


📄 Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous languages; however, their effectiveness in low-resource languages like Persian requires thorough investigation. This paper presents a comprehensive benchmark of several open-source LLMs for Persian Natural Language Processing (NLP) tasks, utilizing both zero-shot and few-shot learning paradigms. We evaluate models across a range of tasks including sentiment analysis, named entity recognition, reading comprehension, and question answering, using established Persian datasets such as ParsiNLU and ArmanEmo. Our methodology encompasses rigorous experimental setups for both zero-shot and few-shot scenarios, employing metrics such as Accuracy, F1-score, BLEU, and ROUGE for performance evaluation. The results reveal that Gemma 2 consistently outperforms other models across nearly all tasks in both learning paradigms, with particularly strong performance in complex reasoning tasks. However, most models struggle with token-level understanding tasks like Named Entity Recognition, highlighting specific challenges in Persian language processing. This study contributes to the growing body of research on multilingual LLMs, providing valuable insights into their performance in Persian and offering a benchmark for future model development.

[47] VLURes: Benchmarking VLM Visual and Linguistic Understanding in Low-Resource Languages

Jesse Atuhurra, Iqra Ali, Tomoya Iwakura, Hidetaka Kamigaito, Tatsuya Hiraoka

🧩 TL;DR

本研究提出了VLURes,一个新颖的多语言视觉语言理解基准,用于评估视觉语言模型在四种语言(包括低资源语言)下的细粒度能力,填补了现有评估主要局限于英语短文本的空白。


📘 Detailed Summary

Motivation: 当前视觉语言模型的评估主要局限于英语为中心的基准测试,且图像-文本对通常包含短文本,无法全面评估模型在多种语言特别是低资源语言下的细粒度视觉和语言理解能力。

Method: 研究构建了包含八种视觉语言任务和一个开创性无关性任务的VLURes基准,涵盖英语、日语以及低资源语言斯瓦希里语和乌尔都语,数据集从目标语言的网络资源中精心策划,包含十个不同的图像类别和丰富的文本上下文。

Result: 在评估的十个视觉语言模型中,表现最佳的GPT-4o模型达到90.8%的整体准确率,但仍落后人类表现6.7%,开源模型的差距更大,揭示了不同语言和任务之间的性能差异。

Conclusion: VLURes基准在开发能够处理多模态视觉推理的智能代理中发挥关键作用,突显了当前模型在低资源语言和复杂视觉语言任务上的局限性,为未来研究提供了重要的评估框架和资源。


📄 Abstract

Vision Language Models (VLMs) are pivotal for advancing perception in intelligent agents. Yet, evaluation of VLMs remains limited to predominantly English-centric benchmarks in which the image-text pairs comprise short texts. To evaluate VLM fine-grained abilities, in four languages under long-text settings, we introduce a novel multilingual benchmark VLURes featuring eight vision-and-language tasks, and a pioneering unrelatedness task, to probe the fine-grained Visual and Linguistic Understanding capabilities of VLMs across English, Japanese, and low-resource languages, Swahili, and Urdu. Our datasets, curated from web resources in the target language, encompass ten diverse image categories and rich textual context, introducing valuable vision-language resources for Swahili and Urdu. By prompting VLMs to generate responses and rationales, evaluated automatically and by native speakers, we uncover performance disparities across languages and tasks critical to intelligent agents, such as object recognition, scene understanding, and relationship understanding. We conducted evaluations of ten VLMs with VLURes. The best performing model, GPT-4o, achieves an overall accuracy of 90.8% and lags human performance by 6.7%, though the gap is larger for open-source models. The gap highlights VLURes' critical role in developing intelligent agents to tackle multi-modal visual reasoning.

[48] SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs

Juan Ren, Mark Dras, Usman Naseem

🧩 TL;DR

本文提出SHIELD框架,一种轻量级、模型无关的预处理方法,通过细粒度安全分类与类别特定指导相结合,有效防御大型视觉语言模型中的对抗性输入攻击,在保持模型效用的同时显著降低越狱率。


📘 Detailed Summary

Motivation: 大型视觉语言模型虽然具备强大的多模态推理能力,但也扩大了攻击面,特别是通过将有害目标隐藏在良性提示中的对抗性输入攻击,现有二元审核方法无法提供细粒度的安全控制。

Method: SHIELD框架结合细粒度安全分类与类别特定指导,采用明确的处理动作(阻止、重构、转发),通过组合定制化安全提示来强制执行细微的拒绝或安全重定向,无需重新训练模型。

Result: 在五个基准测试和五个代表性LVLM上的实验表明,SHIELD持续降低越狱率和不遵循率,同时保持模型效用,该方法即插即用,开销可忽略,且易于扩展到新的攻击类型。

Conclusion: SHIELD作为实用的安全补丁,适用于弱对齐和强对齐的大型视觉语言模型,为多模态模型安全提供了一种高效、可扩展的防御解决方案,无需模型重训练即可实现细粒度安全控制。


📄 Abstract

Large Vision-Language Models (LVLMs) unlock powerful multimodal reasoning but also expand the attack surface, particularly through adversarial inputs that conceal harmful goals in benign prompts. We propose SHIELD, a lightweight, model-agnostic preprocessing framework that couples fine-grained safety classification with category-specific guidance and explicit actions (Block, Reframe, Forward). Unlike binary moderators, SHIELD composes tailored safety prompts that enforce nuanced refusals or safe redirection without retraining. Across five benchmarks and five representative LVLMs, SHIELD consistently lowers jailbreak and non-following rates while preserving utility. Our method is plug-and-play, incurs negligible overhead, and is easily extendable to new attack types -- serving as a practical safety patch for both weakly and strongly aligned LVLMs.

[49] Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems

Karthik Avinash, Nikhil Pareek, Rishav Hada

🧩 TL;DR

本文提出了Protect,一种原生多模态护栏模型,专为企业级部署设计,能够在文本、图像和音频输入上无缝运行,在多个安全维度上实现了最先进的性能。


📘 Detailed Summary

Motivation: 随着大语言模型在企业级和关键任务领域的广泛部署,迫切需要能够确保安全性、可靠性和合规性的强大护栏系统。现有解决方案在实时监督、多模态数据处理和可解释性方面存在不足,这些限制阻碍了它们在受监管环境中的采用。

Method: Protect集成了通过低秩适应(LoRA)在广泛多模态数据集上训练的、针对特定类别的微调适配器,涵盖毒性、性别歧视、数据隐私和提示注入四个安全维度。采用教师辅助标注流程,利用推理和解释轨迹生成跨模态的高保真、上下文感知标签。

Result: 实验结果表明,Protect在所有安全维度上均实现了最先进的性能,超越了现有的开源和专有模型,如WildGuard、LlamaGuard-4和GPT-4.1。

Conclusion: Protect为可信赖、可审计且生产就绪的安全系统奠定了坚实基础,该系统能够跨文本、图像和音频模态运行,解决了现有护栏系统在多模态生产规模环境中的不足。


📄 Abstract

The increasing deployment of Large Language Models (LLMs) across enterprise and mission-critical domains has underscored the urgent need for robust guardrailing systems that ensure safety, reliability, and compliance. Existing solutions often struggle with real-time oversight, multi-modal data handling, and explainability -- limitations that hinder their adoption in regulated environments. Existing guardrails largely operate in isolation, focused on text alone making them inadequate for multi-modal, production-scale environments. We introduce Protect, natively multi-modal guardrailing model designed to operate seamlessly across text, image, and audio inputs, designed for enterprise-grade deployment. Protect integrates fine-tuned, category-specific adapters trained via Low-Rank Adaptation (LoRA) on an extensive, multi-modal dataset covering four safety dimensions: toxicity, sexism, data privacy, and prompt injection. Our teacher-assisted annotation pipeline leverages reasoning and explanation traces to generate high-fidelity, context-aware labels across modalities. Experimental results demonstrate state-of-the-art performance across all safety dimensions, surpassing existing open and proprietary models such as WildGuard, LlamaGuard-4, and GPT-4.1. Protect establishes a strong foundation for trustworthy, auditable, and production-ready safety systems capable of operating across text, image, and audio modalities.

[50] Document Intelligence in the Era of Large Language Models: A Survey

Weishi Wang, Hengchang Hu, Zhijie Zhang, Zhaochen Li, Hongxin Shao, Daniel Dahlmeier

🧩 TL;DR

本文提供了文档AI领域的全面综述,重点分析了大型语言模型如何变革该领域,涵盖了多模态、多语言和检索增强等关键方向,并探讨了未来研究方向。


📘 Detailed Summary

Motivation: 文档AI领域经历了从编码器-解码器架构到仅解码器大型语言模型的重大转变,需要系统梳理这一演变过程,分析当前研究进展和未来前景,为学术界和实际应用提供结构化分析。

Method: 采用综述研究方法,系统分析文档AI的演进历程,重点关注多模态、多语言和检索增强等关键技术方向,并探讨基于代理的方法和文档特定基础模型等新兴范式。

Result: 研究表明大型语言模型为文档AI带来了革命性进步,在理解和生成能力方面取得显著突破,同时识别了多模态整合、跨语言处理等关键挑战和机遇。

Conclusion: 文档AI正经历由大型语言模型驱动的深刻变革,未来研究方向包括代理系统、文档专用基础模型等,该领域对学术研究和实际应用都具有重要意义。


📄 Abstract

Document AI (DAI) has emerged as a vital application area, and is significantly transformed by the advent of large language models (LLMs). While earlier approaches relied on encoder-decoder architectures, decoder-only LLMs have revolutionized DAI, bringing remarkable advancements in understanding and generation. This survey provides a comprehensive overview of DAI's evolution, highlighting current research attempts and future prospects of LLMs in this field. We explore key advancements and challenges in multimodal, multilingual, and retrieval-augmented DAI, while also suggesting future research directions, including agent-based approaches and document-specific foundation models. This paper aims to provide a structured analysis of the state-of-the-art in DAI and its implications for both academic and practical applications.

[51] Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization

Yang Li, Zhichen Dong, Yuhan Sun, Weixun Wang, Shaopan Xiong, Yijia Luo, Jiashun Liu, Han Lu, Jiamang Wang, Wenbo Su, Bo Zheng, Junchi Yan

🧩 TL;DR

本研究通过分析注意力机制揭示了LLMs的推理模式,提出基于预规划-锚定机制的强化学习策略,在多种推理任务上实现性能提升,为LLM优化提供了结构感知的方法。


📘 Detailed Summary

Motivation: 大型语言模型的推理模式仍然不透明,强化学习通常对整个生成过程应用统一信用分配,模糊了关键步骤和常规步骤之间的区别,需要开发能够理解模型内部推理逻辑的优化方法。

Method: 提出将注意力作为特权基底来解析LLMs内部逻辑,区分局部和全局注意力头,并开发两个度量指标:窗口平均注意力距离和未来注意力影响,基于这些信号识别预规划-锚定机制,并设计三种针对关键节点的强化学习策略。

Result: 研究揭示了LLMs中反复出现的预规划-锚定机制,其中模型首先进行长距离上下文参考生成引导性标记,随后立即跟随语义锚定标记组织后续推理,基于此设计的强化学习策略在各种推理任务上实现了一致的性能提升。

Conclusion: 通过将优化与模型内在推理节奏对齐,将不透明的优化转变为可操作的结构感知过程,为LLM推理的透明化和有效优化提供了潜在途径,有望推动更智能的信用分配策略发展。


📄 Abstract

The reasoning pattern of Large language models (LLMs) remains opaque, and Reinforcement learning (RL) typically applies uniform credit across an entire generation, blurring the distinction between pivotal and routine steps. This work positions attention as a privileged substrate that renders the internal logic of LLMs legible, not merely as a byproduct of computation, but as a mechanistic blueprint of reasoning itself. We first distinguish attention heads between locally and globally focused information processing and reveal that locally focused heads produce a sawtooth pattern near the diagonal indicating phrasal chunks, while globally focused heads expose tokens that exert broad downstream influence over future tokens. We formalize these with two metrics: 1) Windowed Average Attention Distance, which measures the extent of backward attention within a clipped window; 2) Future Attention Influence, which quantifies a token's global importance as the average attention it receives from subsequent tokens. Taken together, these signals reveal a recurring preplan-and-anchor mechanism, where the model first performs a long-range contextual reference to generate an introductory token, which is immediately followed by or coincides with a semantic anchor token that organizes subsequent reasoning. Leveraging these insights, we introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes (preplan tokens, anchor tokens, and their temporal coupling) and show consistent performance gains across various reasoning tasks. By aligning optimization with the model's intrinsic reasoning rhythm, we aim to transform opaque optimization into an actionable structure-aware process, hoping to offer a potential step toward more transparent and effective optimization of LLM reasoning.

[52] Closing the Gap Between Text and Speech Understanding in LLMs

Santiago Cuervo, Skyler Seto, Maureen de Seyssel, Richard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly, Zakaria Aldeneh

🧩 TL;DR

本文提出SALAD方法,通过跨模态蒸馏和针对性合成数据,在仅使用少量公开语音数据的情况下,有效缩小语音适配大语言模型与纯文本模型之间的性能差距。


📘 Detailed Summary

Motivation: 当前语音适配的大语言模型在语言理解任务上持续落后于其文本版本甚至级联流水线,存在文本-语音理解差距,而现有缩小该差距的方法要么依赖大规模语音合成成本高昂,要么依赖不可复现的专有数据集,亟需更数据高效的方法。

Method: 提出SALAD方法,结合跨模态蒸馏与针对性合成数据,通过主动选择学习策略改进语音与文本的对齐,同时减轻模型在适配过程中对文本能力的遗忘问题。

Result: 在3B和7B参数的大语言模型上,SALAD在知识、语言理解和推理等广泛领域基准测试中,仅使用比现有方法少一个数量级的公开语音数据,就达到了与强开源模型竞争的性能水平。

Conclusion: 研究表明文本-语音理解差距主要由适配过程中的文本能力遗忘和跨模态不对齐驱动,SALAD方法证明通过数据高效的跨模态对齐策略可以有效缩小这一差距,为语音语言模型的发展提供了新方向。


📄 Abstract

Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts--and even cascaded pipelines--on language understanding tasks. We term this shortfall the text-speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text. Based on this analysis, we introduce SALAD--Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation--which combines cross-modal distillation with targeted synthetic data to improve alignment while mitigating forgetting. Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech data from public corpora.

[53] NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching

Run Luo, Xiaobo Xia, Lu Wang, Longze Chen, Renke Shan, Jing Luo, Min Yang, Tat-Seng Chua

🧩 TL;DR

本文提出了NExT-OMNI,一种基于离散流范式的开源全模态基础模型,通过统一的建模方法实现了任意模态间的理解与生成,在多轮多模态交互和跨模态检索任务中超越了现有统一模型。


📘 Detailed Summary

Motivation: 现有大多数多模态模型受限于自回归架构,其固有局限性阻碍了理解与生成能力的平衡整合,而现有的混合和解耦策略虽然分别处理这些任务,但其冗余且非集成的设计限制了在更广泛场景(如跨模态检索)中的应用。

Method: NExT-OMNI采用离散流范式实现统一建模,通过度量诱导的概率路径和动力学最优速度,原生支持任意模态间的理解与生成,并使用简洁的统一表示而非任务解耦设计来支持更广泛的应用场景。

Result: 在大规模交错文本、图像、视频和音频数据上训练的NExT-OMNI在多模态生成和理解基准测试中表现出竞争力,同时在多轮多模态交互和跨模态检索任务中超越了先前的统一模型,突显了其作为下一代多模态基础模型的架构优势。

Conclusion: NExT-OMNI的离散流范式为下一代多模态基础模型提供了有效的统一建模方法,其架构设计在平衡理解与生成能力方面具有显著优势,为人工通用智能系统的核心组件发展指明了方向,作者还开源了训练细节、数据协议、代码和模型检查点以促进进一步研究。


📄 Abstract

Next-generation multimodal foundation models capable of any-to-any cross-modal generation and multi-turn interaction will serve as core components of artificial general intelligence systems, playing a pivotal role in human-machine interaction. However, most existing multimodal models remain constrained by autoregressive architectures, whose inherent limitations prevent a balanced integration of understanding and generation capabilities. Although hybrid and decoupling strategies have been explored to address these tasks within unified frameworks separately, their redundant, non-integrated designs limit their applicability to broader scenarios, such as cross-modal retrieval.In this work, we introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms. By leveraging metric-induced probability paths and kinetic optimal velocities, NExT-OMNI natively supports any-to-any understanding and generation with enhanced response efficiency, while enabling broader application scenarios through concise unified representations rather than task-decoupled designs. Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks, while outperforming prior unified models in multi-turn multimodal interaction and cross-modal retrieval, highlighting its architectural advantages as a next-generation multimodal foundation model. To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.

[54] The Mechanistic Emergence of Symbol Grounding in Language Models

Shuyu Wu, Ziqiao Ma, Xiaoxi Luo, Yidong Huang, Josue Torres-Fonseca, Freda Shi, Joyce Chai

🧩 TL;DR

本研究通过机制性和因果分析框架,揭示了符号接地现象在语言模型中的涌现机制,发现接地主要集中在中层计算并通过注意力头的聚合机制实现,这一现象在Transformer和状态空间模型中普遍存在。


📘 Detailed Summary

Motivation: 尽管已有初步证据表明符号接地现象可在大规模训练的语言模型中自发涌现,但其具体发生位置和驱动机制仍不明确,本研究旨在系统性地探索符号接地在模型内部计算中的产生机制和具体实现方式。

Method: 引入受控评估框架,采用机制性和因果分析方法系统追踪符号接地在内部计算中的产生过程,研究涵盖多模态对话场景并比较了Transformer、状态空间模型和单向LSTM等不同架构。

Result: 研究发现符号接地主要集中在中层计算层,通过注意力头的聚合机制实现环境信息的整合以支持语言形式预测,该现象在Transformer和状态空间模型中普遍存在但在单向LSTM中未观察到。

Conclusion: 研究提供了行为和机制层面的证据表明符号接地可在语言模型中自发涌现,对预测和控制生成可靠性具有实际意义,揭示了大规模预训练模型中符号意义获取的内在机制。


📄 Abstract

Symbol grounding (Harnad, 1990) describes how symbols such as words acquire their meanings by connecting to real-world sensorimotor experiences. Recent work has shown preliminary evidence that grounding may emerge in (vision-)language models trained at scale without using explicit grounding objectives. Yet, the specific loci of this emergence and the mechanisms that drive it remain largely unexplored. To address this problem, we introduce a controlled evaluation framework that systematically traces how symbol grounding arises within the internal computations through mechanistic and causal analysis. Our findings show that grounding concentrates in middle-layer computations and is implemented through the aggregate mechanism, where attention heads aggregate the environmental ground to support the prediction of linguistic forms. This phenomenon replicates in multimodal dialogue and across architectures (Transformers and state-space models), but not in unidirectional LSTMs. Our results provide behavioral and mechanistic evidence that symbol grounding can emerge in language models, with practical implications for predicting and potentially controlling the reliability of generation.

cs.AI [Back]

[55] From Literal to Liberal: A Meta-Prompting Framework for Eliciting Human-Aligned Exception Handling in Large Language Models

Imran Khan

🧩 TL;DR

本文提出了规则-意图区分(RID)框架,这是一种低计算成本的元提示技术,能够在零样本情况下引导LLMs进行人类对齐的异常处理。该框架显著提升了LLMs在需要细微判断场景中的决策质量,实现了95%的人类对齐分数。


📘 Detailed Summary

Motivation: 大型语言模型作为智能代理系统的推理引擎存在关键缺陷:对显式规则的僵化遵循导致决策与人类常识和意图不一致。这种“规则刚性”是构建可信自主代理的重要障碍,而现有的监督微调方法计算成本高昂且对许多实践者不可及。

Method: 本文引入了规则-意图区分(RID)框架,这是一种新颖的低计算元提示技术。该框架为模型提供了结构化认知模式,用于解构任务、分类规则、权衡冲突结果并最终证明其决策。该方法在零样本设置下工作,无需额外的模型训练。

Result: 在包含20个需要跨领域细微判断场景的自定义基准测试中,RID框架显著优于基线和思维链提示。人类验证结果显示,RID框架实现了95%的人类对齐分数,而基线和CoT分别为80%和75%。此外,它持续产生更高质量、意图驱动的推理。

Conclusion: 这项工作提供了一种实用、可访问且有效的方法,用于引导LLMs从字面指令遵循转向自由、目标导向的推理。该框架为构建更可靠和实用的AI代理铺平了道路,展示了元提示技术在改善模型对齐方面的潜力。


📄 Abstract

Large Language Models (LLMs) are increasingly being deployed as the reasoning engines for agentic AI systems, yet they exhibit a critical flaw: a rigid adherence to explicit rules that leads to decisions misaligned with human common sense and intent. This "rule-rigidity" is a significant barrier to building trustworthy autonomous agents. While prior work has shown that supervised fine-tuning (SFT) with human explanations can mitigate this issue, SFT is computationally expensive and inaccessible to many practitioners. To address this gap, we introduce the Rule-Intent Distinction (RID) Framework, a novel, low-compute meta-prompting technique designed to elicit human-aligned exception handling in LLMs in a zero-shot manner. The RID framework provides the model with a structured cognitive schema for deconstructing tasks, classifying rules, weighing conflicting outcomes, and justifying its final decision. We evaluated the RID framework against baseline and Chain-of-Thought (CoT) prompting on a custom benchmark of 20 scenarios requiring nuanced judgment across diverse domains. Our human-verified results demonstrate that the RID framework significantly improves performance, achieving a 95% Human Alignment Score (HAS), compared to 80% for the baseline and 75% for CoT. Furthermore, it consistently produces higher-quality, intent-driven reasoning. This work presents a practical, accessible, and effective method for steering LLMs from literal instruction-following to liberal, goal-oriented reasoning, paving the way for more reliable and pragmatic AI agents.

[56] Toward Reasoning-Centric Time-Series Analysis

Xinlei Wang, Mingtian Tan, Jing Qiu, Junhua Zhao, Jinjin Gu

🧩 TL;DR

本文提出将时间序列分析重新构想为推理任务,利用大型语言模型的深层推理能力而非数值回归能力,强调因果结构和可解释性,使时间序列分析更接近人类对齐的理解。


📘 Detailed Summary

Motivation: 传统时间序列分析在现实世界环境中面临局限性,无法有效捕捉政策变化、人类行为适应和意外事件等复杂动态因素。现有基于LLM的方法主要利用其数值回归能力,忽视了更深层次的推理潜力,缺乏对驱动趋势的实际力量的可解释分析。

Method: 本文提出将时间序列分析重新定位为推理任务,重点利用LLM的因果推理和解释能力,而非传统的数值回归方法。通过整合多模态输入,强调因果结构和透明性,使分析能够适应复杂现实环境中的上下文变化。

Result: 该方法使时间序列分析能够提供更透明和上下文感知的洞察,在复杂现实环境中实现人类对齐的理解。通过关注因果结构而非表面趋势,增强了分析结果的可解释性和实际应用价值。

Conclusion: 将时间序列分析转向基于LLM的推理范式,能够更好地应对现实世界的动态复杂性,为政策制定和决策提供更可靠的依据。这种转变强调了可解释性和因果理解在时间序列分析中的核心地位,为未来研究指明了方向。


📄 Abstract

Traditional time series analysis has long relied on pattern recognition, trained on static and well-established benchmarks. However, in real-world settings -- where policies shift, human behavior adapts, and unexpected events unfold -- effective analysis must go beyond surface-level trends to uncover the actual forces driving them. The recent rise of Large Language Models (LLMs) presents new opportunities for rethinking time series analysis by integrating multimodal inputs. However, as the use of LLMs becomes popular, we must remain cautious, asking why we use LLMs and how to exploit them effectively. Most existing LLM-based methods still employ their numerical regression ability and ignore their deeper reasoning potential. This paper argues for rethinking time series with LLMs as a reasoning task that prioritizes causal structure and explainability. This shift brings time series analysis closer to human-aligned understanding, enabling transparent and context-aware insights in complex real-world environments.