Table of Contents
cs.CV [Back]
[1] FusionFM: All-in-One Multi-Modal Image Fusion with Flow Matching
Huayi Zhu, Xiu Shu, Youqiang Xiong, Qiao Liu, Rui Chen, Di Yuan, Xiaojun Chang, Zhenyu He
🧩 TL;DR
本文提出了一种基于流匹配的图像融合方法FusionFM,将多模态图像融合建模为从源模态到融合图像分布的直接概率传输,显著提高了采样效率并保持了结构一致性。该方法通过任务感知伪标签选择和融合精炼模块,在缺乏高质量监督数据的情况下实现了竞争性的融合性能。
📘 Detailed Summary
Motivation: 当前多模态图像融合方法通常依赖于任务特定模型,导致训练成本高且可扩展性有限。虽然生成方法提供了统一建模视角,但由于从噪声到图像的复杂采样轨迹,往往存在推理速度慢的问题。本研究旨在解决这些局限性,提高图像融合的效率和泛化能力。
Method: 采用流匹配范式将图像融合建模为从源模态到融合图像分布的直接概率传输,提高了采样效率。通过收集多个先进模型的融合结果作为先验,并利用任务感知选择函数为每个任务选择最可靠的伪标签。引入融合精炼模块,采用分治策略系统识别、分解和增强伪标签中的退化组件。在多任务场景中,整合弹性权重巩固和经验回放机制,从参数稳定性和记忆保持角度保持跨任务性能。
Result: 该方法在多样化融合任务中实现了竞争性性能,同时显著提高了采样效率并保持了轻量级模型设计。实验结果表明,与现有方法相比,在保持高质量融合结果的同时,推理速度得到显著提升,验证了所提方法的有效性和实用性。
Conclusion: 本研究证明了流匹配范式在多模态图像融合中的有效性,为生成式融合方法提供了更高效的替代方案。通过伪标签选择和精炼机制,解决了缺乏高质量监督数据的问题,而多任务学习策略增强了模型的持续学习能力。这项工作为高效、可扩展的图像融合系统设计提供了新的思路和技术路径。
📄 Abstract
Current multi-modal image fusion methods typically rely on task-specific models, leading to high training costs and limited scalability. While generative methods provide a unified modeling perspective, they often suffer from slow inference due to the complex sampling trajectories from noise to image. To address this, we formulate image fusion as a direct probabilistic transport from source modalities to the fused image distribution, leveraging the flow matching paradigm to improve sampling efficiency and structural consistency. To mitigate the lack of high-quality fused images for supervision, we collect fusion results from multiple state-of-the-art models as priors, and employ a task-aware selection function to select the most reliable pseudo-labels for each task. We further introduce a Fusion Refiner module that employs a divide-and-conquer strategy to systematically identify, decompose, and enhance degraded components in selected pseudo-labels. For multi-task scenarios, we integrate elastic weight consolidation and experience replay mechanisms to preserve cross-task performance and enhance continual learning ability from both parameter stability and memory retention perspectives. Our approach achieves competitive performance across diverse fusion tasks, while significantly improving sampling efficiency and maintaining a lightweight model design. The code will be available at: https://github.com/Ist-Zhy/FusionFM.
[2] Passive Dementia Screening via Facial Temporal Micro-Dynamics Analysis of In-the-Wild Talking-Head Video
Filippo Cenacchi. Longbing Cao, Mitchell McEwan, Deborah Richards
🧩 TL;DR
本研究提出了一种基于面部时序微动态分析的被动式痴呆筛查方法,通过分析眨眼动态、嘴部微动、注视变异性和头部细微调整等面部运动学特征,实现了无需语音或文本的语言无关痴呆检测。
📘 Detailed Summary
Motivation: 现有痴呆筛查资源主要依赖语音或脚本化访谈,限制了在临床环境外的应用,并将预测结果与语言和转录紧密耦合。本研究旨在解决这一局限性,探索仅通过面部运动学特征进行无语言依赖的被动式痴呆筛查。
Method: 通过稳定化面部信号,将微观运动转换为可解释的面部微动态时间序列,进行平滑处理并将短窗口汇总为紧凑的片段级统计量。每个窗口通过其活动混合(跨运动流的相对运动份额)进行编码,使预测器分析运动在流间的分布而非幅度,实现每通道效应的透明化分析。
Result: 在YT DemTalk数据集(300个片段,150个自报痴呆病例和150个对照)上的实验表明,消融研究识别出注视不稳定性和嘴部/下颌动态为最具信息量的线索,轻量级浅层分类器实现了AUROC 0.953、平均精度0.961、F1分数0.851和准确率0.857的痴呆预测性能。
Conclusion: 研究证实面部微动态特征足以实现有效的痴呆筛查,无需语音或文本信息,为大规模、跨设备、跨文化和无脚本的被动式神经认知变化监测提供了可行方案,推动了在自然环境中进行无干预健康评估的技术发展。
📄 Abstract
We target passive dementia screening from short camera-facing talking head video, developing a facial temporal micro dynamics analysis for language free detection of early neuro cognitive change. This enables unscripted, in the wild video analysis at scale to capture natural facial behaviors, transferrable across devices, topics, and cultures without active intervention by clinicians or researchers during recording. Most existing resources prioritize speech or scripted interviews, limiting use outside clinics and coupling predictions to language and transcription. In contrast, we identify and analyze whether temporal facial kinematics, including blink dynamics, small mouth jaw motions, gaze variability, and subtle head adjustments, are sufficient for dementia screening without speech or text. By stabilizing facial signals, we convert these micro movements into interpretable facial microdynamic time series, smooth them, and summarize short windows into compact clip level statistics for screening. Each window is encoded by its activity mix (the relative share of motion across streams), thus the predictor analyzes the distribution of motion across streams rather than its magnitude, making per channel effects transparent. We also introduce YT DemTalk, a new dataset curated from publicly available, in the wild camera facing videos. It contains 300 clips (150 with self reported dementia, 150 controls) to test our model and offer a first benchmarking of the corpus. On YT DemTalk, ablations identify gaze lability and mouth/jaw dynamics as the most informative cues, and light weighted shallow classifiers could attain a dementia prediction performance of (AUROC) 0.953, 0.961 Average Precision (AP), 0.851 F1-score, and 0.857 accuracy.
[3] Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark
Xinxin Liu, Zhaopan Xu, Kai Wang, Yong Jae Lee, Yuzhang Shang
🧩 TL;DR
本文提出了Gen-ViRe基准测试框架,首次对视频生成模型作为推理器的能力进行定量评估,揭示了视觉质量与推理深度之间的显著差距,为开发真正的世界模拟器提供了诊断工具。
📘 Detailed Summary
Motivation: 现有视频生成模型虽然通过帧链推理展示了作为世界模拟器的潜力,但现有基准测试主要关注保真度或对齐度,无法评估多步规划、算法逻辑和抽象模式外推等核心认知能力,这阻碍了对模型能力的系统理解和改进的指导原则。
Method: 研究提出了基于认知科学和实际AI应用的Gen-ViRe框架,将帧链推理分解为六个认知维度和24个子任务,通过多源数据整理、最小提示协议以及基于详细标准的混合VLM辅助评估方法来实现定量评估。
Result: 在最先进系统上的实验显示,视觉质量与真实推理深度之间存在显著差异,建立了基准线和诊断工具,为推进真正的世界模拟器发展提供了基础。
Conclusion: 该研究建立了首个定量评估视频模型推理能力的框架,揭示了当前模型在推理能力上的局限性,为开发具有深度认知能力的世界模拟器提供了系统性的评估方法和改进方向。
📄 Abstract
While Chain-of-Thought (CoT) prompting enables sophisticated symbolic reasoning in LLMs, it remains confined to discrete text and cannot simulate the continuous, physics-governed dynamics of the real world. Recent video generation models have emerged as potential world simulators through Chain-of-Frames (CoF) reasoning -- materializing thought as frame-by-frame visual sequences, with each frame representing a physically-grounded reasoning step. Despite compelling demonstrations, a challenge persists: existing benchmarks, focusing on fidelity or alignment, do not assess CoF reasoning and thus cannot measure core cognitive abilities in multi-step planning, algorithmic logic, or abstract pattern extrapolation. This evaluation void prevents systematic understanding of model capabilities and principled guidance for improvement. We introduce Gen-ViRe (Generative Visual Reasoning Benchmark), a framework grounded in cognitive science and real-world AI applications, which decomposes CoF reasoning into six cognitive dimensions -- from perceptual logic to abstract planning -- and 24 subtasks. Through multi-source data curation, minimal prompting protocols, and hybrid VLM-assisted evaluation with detailed criteria, Gen-ViRe delivers the first quantitative assessment of video models as reasoners. Our experiments on SOTA systems reveal substantial discrepancies between impressive visual quality and actual reasoning depth, establishing baselines and diagnostic tools to advance genuine world simulators.
[4] Segmenting Collision Sound Sources in Egocentric Videos
Kranti Kumar Parida, Omar Emara, Hazel Doughty, Dima Damen
🧩 TL;DR
本文提出了碰撞声源分割(CS3)新任务,旨在根据音频输入在视觉输入中分割产生碰撞声音的物体,并开发了一种利用基础模型和自我中心线索的弱监督方法,在两个新基准上实现了显著性能提升。
📘 Detailed Summary
Motivation: 该研究旨在解决从碰撞声音中识别物体属性的多感官感知挑战,特别关注自我中心视频中碰撞声源分割的独特困难,包括场景杂乱、物体小、交互短暂等问题,而现有方法难以处理两个物体交互产生的复杂声学特征。
Method: 提出了一种弱监督的音频条件分割方法,利用CLIP和SAM2等基础模型,并整合自我中心线索(如手中物体)来识别可能作为碰撞声源的行动物体,通过多模态融合处理视觉和音频信息。
Result: 在两个新提出的CS3基准数据集EPIC-CS3和Ego4D-CS3上,该方法分别以3倍和4.7倍的mIoU优势显著优于竞争基线方法,证明了其在复杂自我中心场景中的有效性。
Conclusion: 该研究证明了利用基础模型和自我中心线索进行多模态感知的可行性,为从碰撞声音中理解物体交互开辟了新方向,对具身AI和机器人感知具有重要意义,未来可扩展到更广泛的多感官推理任务。
📄 Abstract
Humans excel at multisensory perception and can often recognise object properties from the sound of their interactions. Inspired by this, we propose the novel task of Collision Sound Source Segmentation (CS3), where we aim to segment the objects responsible for a collision sound in visual input (i.e. video frames from the collision clip), conditioned on the audio. This task presents unique challenges. Unlike isolated sound events, a collision sound arises from interactions between two objects, and the acoustic signature of the collision depends on both. We focus on egocentric video, where sounds are often clear, but the visual scene is cluttered, objects are small, and interactions are brief. To address these challenges, we propose a weakly-supervised method for audio-conditioned segmentation, utilising foundation models (CLIP and SAM2). We also incorporate egocentric cues, i.e. objects in hands, to find acting objects that can potentially be collision sound sources. Our approach outperforms competitive baselines by $3\times$ and $4.7\times$ in mIoU on two benchmarks we introduce for the CS3 task: EPIC-CS3 and Ego4D-CS3.
[5] H-CNN-ViT: A Hierarchical Gated Attention Multi-Branch Model for Bladder Cancer Recurrence Prediction
Xueyang Li, Zongren Wang, Yuliang Zhang, Zixuan Pan, Yu-Jen Chen, Nishchal Sapkota, Gelei Xu, Danny Z. Chen, Yiyu Shi
🧩 TL;DR
本研究提出了H-CNN-ViT模型,一种用于膀胱癌复发预测的分层门控注意力多分支架构,并通过构建首个专门的多序列MRI数据集,在膀胱癌复发检测任务上实现了78.6%的AUC,超越了现有最优方法。
📘 Detailed Summary
Motivation: 膀胱癌作为全球高发恶性肿瘤,复发率高达78%,术后监测至关重要。多序列增强MRI是常用的复发检测手段,但术后组织改变使影像解读极具挑战性,且该领域缺乏专门用于复发评估的多序列MRI数据集,阻碍了AI辅助诊断工具的发展。
Method: 提出H-CNN-ViT分层门控注意力多分支模型,通过全局ViT路径和局部CNN路径的选择性特征加权实现平衡的目标特征融合。多分支架构独立处理每个模态,确保各成像通道的独特特性得到最优捕获和整合。
Result: 在构建的专用数据集上评估,H-CNN-ViT模型实现了78.6%的AUC性能,显著超越了现有的最优模型,为膀胱癌复发预测建立了新的性能基准。
Conclusion: 该研究不仅提供了首个专门用于膀胱癌复发预测的多序列MRI数据集,还证明了分层门控注意力机制在医学影像分析中的有效性,为AI辅助膀胱癌监测开辟了新途径,模型已公开供研究社区使用。
📄 Abstract
Bladder cancer is one of the most prevalent malignancies worldwide, with a recurrence rate of up to 78%, necessitating accurate post-operative monitoring for effective patient management. Multi-sequence contrast-enhanced MRI is commonly used for recurrence detection; however, interpreting these scans remains challenging, even for experienced radiologists, due to post-surgical alterations such as scarring, swelling, and tissue remodeling. AI-assisted diagnostic tools have shown promise in improving bladder cancer recurrence prediction, yet progress in this field is hindered by the lack of dedicated multi-sequence MRI datasets for recurrence assessment study. In this work, we first introduce a curated multi-sequence, multi-modal MRI dataset specifically designed for bladder cancer recurrence prediction, establishing a valuable benchmark for future research. We then propose H-CNN-ViT, a new Hierarchical Gated Attention Multi-Branch model that enables selective weighting of features from the global (ViT) and local (CNN) paths based on contextual demands, achieving a balanced and targeted feature fusion. Our multi-branch architecture processes each modality independently, ensuring that the unique properties of each imaging channel are optimally captured and integrated. Evaluated on our dataset, H-CNN-ViT achieves an AUC of 78.6%, surpassing state-of-the-art models. Our model is publicly available at https://github.com/XLIAaron/H-CNN-ViT}.
[6] QwenCLIP: Boosting Medical Vision-Language Pretraining via LLM Embeddings and Prompt tuning
Xiaoyang Wei, Camille Kurtz, Florence Cloppet
🧩 TL;DR
本文提出QwenCLIP,一种将CLIP文本编码器替换为基于大语言模型的嵌入模块的视觉语言框架,通过引入可学习提示词增强跨模态对齐,显著提升了长文本放射学报告的表示能力和医学图像-文本对齐性能。
📘 Detailed Summary
Motivation: CLIP模型在计算机视觉和医学领域展现出强大的泛化能力,但其文本编码器仅支持最多77个token,限制了其对信息丰富的长文本放射学报告的表示能力。现有的领域特定编码器如PubMedBERT或ClinicalBERT虽然缓解了这一问题,但仍受限于512token的输入长度限制和相对浅层的语义理解能力。
Method: QwenCLIP框架将CLIP的文本编码器替换为基于大语言模型的嵌入模块(如Qwen3-Embedding),并引入可学习提示词来增强跨模态对齐。通过利用LLM的扩展上下文窗口和更丰富的表示能力,该框架能够从长格式临床文本中捕获全面的医学语义信息。
Result: QwenCLIP在放射学基准测试中显著提升了医学图像-文本对齐性能和下游任务表现,通过利用LLM的扩展上下文能力,能够更好地处理长文本放射学报告并捕获更丰富的语义信息。
Conclusion: 该研究表明利用大语言模型的扩展上下文窗口和丰富表示能力可以有效解决医学视觉语言任务中的长文本处理挑战,为医学领域的多模态学习提供了新的技术路径,并展示了LLM在专业领域应用的潜力。
📄 Abstract
Contrastive Language-Image Pretraining (CLIP) has demonstrated strong generalization for vision-language tasks in computer vision and medical domains, yet its text encoder accepts only up to 77 tokens, which limits its ability to represent long and information-rich radiology reports. Recent adaptations using domain-specific encoders, such as PubMedBERT or ClinicalBERT, mitigate this issue by leveraging medical corpora, but remain constrained by their limited input length (typically 512 tokens) and relatively shallow semantic understanding. To address these limitations, we propose QwenCLIP, a vision-language framework that replaces CLIP's text encoder with a large language model (LLM)-based embedding module (e.g., Qwen3-Embedding) and introduces learnable prompts to enhance cross-modal alignment. By leveraging the extended context window and richer representations of LLMs, QwenCLIP captures comprehensive medical semantics from long-form clinical text, substantially improving medical image-text alignment and downstream performance on radiology benchmarks. Our code is publicly available at https://github.com/Wxy-24/QwenCLIP.
[7] VLMs Guided Interpretable Decision Making for Autonomous Driving
Xin Hu, Taotao Jing, Renran Tian, Zhengming Ding
🧩 TL;DR
本文提出了一种新的方法,将视觉语言模型从直接决策生成器转变为语义增强器,通过多模态交互架构融合视觉和语言特征,在自动驾驶基准测试中实现了最先进的性能。
📘 Detailed Summary
Motivation: 现有自动驾驶研究中使用的视觉语言模型依赖人工设计的提示词且性能不稳定,在现实场景中的鲁棒性和泛化能力有限,无法提供可靠的情境感知决策。
Method: 提出将VLM角色从直接决策生成转变为语义增强,利用其强大的场景理解能力丰富视觉基准数据,构建多模态交互架构融合视觉和语言特征,并设计了后处理精炼模块提升预测可靠性。
Result: 在两个自动驾驶基准测试上的广泛实验表明,该方法实现了最先进的性能,为可靠且可解释的自动驾驶系统提供了有效解决方案。
Conclusion: 该研究为将视觉语言模型集成到可靠且可解释的自动驾驶系统中提供了有前景的方向,展示了语义增强和多模态融合在提升决策准确性和可解释性方面的潜力。
📄 Abstract
Recent advancements in autonomous driving (AD) have explored the use of vision-language models (VLMs) within visual question answering (VQA) frameworks for direct driving decision-making. However, these approaches often depend on handcrafted prompts and suffer from inconsistent performance, limiting their robustness and generalization in real-world scenarios. In this work, we evaluate state-of-the-art open-source VLMs on high-level decision-making tasks using ego-view visual inputs and identify critical limitations in their ability to deliver reliable, context-aware decisions. Motivated by these observations, we propose a new approach that shifts the role of VLMs from direct decision generators to semantic enhancers. Specifically, we leverage their strong general scene understanding to enrich existing vision-based benchmarks with structured, linguistically rich scene descriptions. Building on this enriched representation, we introduce a multi-modal interactive architecture that fuses visual and linguistic features for more accurate decision-making and interpretable textual explanations. Furthermore, we design a post-hoc refinement module that utilizes VLMs to enhance prediction reliability. Extensive experiments on two autonomous driving benchmarks demonstrate that our approach achieves state-of-the-art performance, offering a promising direction for integrating VLMs into reliable and interpretable AD systems.
[8] O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model
Rishi Gupta, Mukilan Karuppasamy, Shyam Marjit, Aditay Tripathi, Anirban Chakraborty
🧩 TL;DR
本文提出了O3SLM模型和配套的大规模数据集,专门解决大型视觉语言模型在理解手绘草图方面的局限性,通过在多个草图任务上实现最先进性能,显著提升了模型对抽象视觉输入的推理能力。
📘 Detailed Summary
Motivation: 当前大型视觉语言模型在解释抽象视觉输入方面存在显著局限,特别是在理解手绘草图这种难以用文本描述概念的直观表达方式上,主要瓶颈在于缺乏同时建模草图、真实图像和自然语言指令的大规模数据集。
Method: 研究提出了两个关键贡献:一是构建了包含图像-草图-指令三元组的大规模数据集,支持预训练和指令微调;二是开发了O3SLM模型,该模型在此数据集上进行训练,专门针对草图理解任务进行优化。
Result: 在多个草图任务上的综合评估显示,O3SLM在物体定位、计数、图像检索(包括SBIR和细粒度SBIR)以及视觉问答任务中均达到最先进性能,显著优于现有的大型视觉语言模型,评估涵盖了QuickDraw!、Sketchy、Tu Berlin等现有数据集以及新生成的SketchVCL数据集。
Conclusion: 该研究表明通过专门的数据集和模型设计,可以有效提升大型视觉语言模型对抽象视觉内容的理解能力,为处理手绘草图等非传统视觉模态开辟了新途径,具有重要的实际应用价值。
📄 Abstract
While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually. We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions: (1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning, and (2) O3SLM, an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, (c) image retrieval i.e., (SBIR and fine-grained SBIR), and (d) visual question answering (VQA); while incorporating the three existing sketch datasets, namely QuickDraw!, Sketchy, and Tu Berlin, along with our generated SketchVCL dataset, show that O3SLM achieves state-of-the-art performance, substantially outperforming existing LVLMs in sketch comprehension and reasoning.
[9] Uni-Hema: Unified Model for Digital Hematopathology
Abdul Rehman, Iqra Rasool, Ayesha Imran, Mohsen Ali, Waqas Sultani
🧩 TL;DR
本文提出了Uni-Hema,一个用于数字血液病理学的多任务统一模型,能够跨多种疾病进行检测、分类、分割、形态预测和推理,解决了现有方法无法在数字血液病理学复杂性中提供统一多任务多模态推理的关键限制。
📘 Detailed Summary
Motivation: 现有数字血液病理学方法,无论是单任务、视觉语言、WSI优化还是单细胞血液学模型,都存在一个关键限制:无法在数字血液病理学的复杂性中提供统一的多任务多模态推理,这限制了跨恶性疾病、感染性疾病和非恶性红细胞疾病的综合分析能力。
Method: Uni-Hema基于Hema-Former多模态模块构建,该模块在层次级别桥接视觉和文本表示,支持检测、分类、分割、形态学、掩码语言建模和视觉问答等不同粒度的任务,并整合了46个公开数据集,包含超过70万张图像和2.1万个问答对。
Result: 大量实验表明,Uni-Hema在多种血液学任务上实现了与单任务单数据集模型相当或更优的性能,同时在单细胞水平提供了可解释的形态学相关洞察,验证了其统一框架的有效性。
Conclusion: 该研究为多任务多模态数字血液病理学建立了新标准,展示了统一模型在保持高性能的同时提供形态学洞察的能力,为跨疾病类别的综合分析开辟了新途径,代码将公开以促进进一步研究。
📄 Abstract
Digital hematopathology requires cell-level analysis across diverse disease categories, including malignant disorders (e.g., leukemia), infectious conditions (e.g., malaria), and non-malignant red blood cell disorders (e.g., sickle cell disease). Whether single-task, vision-language, WSI-optimized, or single-cell hematology models, these approaches share a key limitation, they cannot provide unified, multi-task, multi-modal reasoning across the complexities of digital hematopathology. To overcome these limitations, we propose Uni-Hema, a multi-task, unified model for digital hematopathology integrating detection, classification, segmentation, morphology prediction, and reasoning across multiple diseases. Uni-Hema leverages 46 publicly available datasets, encompassing over 700K images and 21K question-answer pairs, and is built upon Hema-Former, a multimodal module that bridges visual and textual representations at the hierarchy level for the different tasks (detection, classification, segmentation, morphology, mask language modeling and visual question answer) at different granularity. Extensive experiments demonstrate that Uni-Hema achieves comparable or superior performance to train on a single-task and single dataset models, across diverse hematological tasks, while providing interpretable, morphologically relevant insights at the single-cell level. Our framework establishes a new standard for multi-task and multi-modal digital hematopathology. The code will be made publicly available.
[10] Zero-Training Task-Specific Model Synthesis for Few-Shot Medical Image Classification
Yao Qin, Yangyang Yan, YuanChao Yang, Jinhua Pang, Huanyong Bi, Yuan Liu, HaiHua Wang
🧩 TL;DR
本文提出了一种零训练任务特定模型合成(ZS-TMS)新范式,通过预训练生成引擎直接合成任务特定分类器的完整参数集,无需任何任务特定训练或微调,在极低数据场景下实现了最先进的性能。
📘 Detailed Summary
Motivation: 深度学习模型在医学图像分析中严重依赖大规模精细标注数据集,但在医学领域获取患者数据和专家标注成本高昂,特别是对于样本稀少的罕见疾病,这种对"大数据"的依赖成为关键瓶颈。
Method: 提出的语义引导参数合成器(SGPS)框架利用大规模预训练生成引擎,仅需最少的多模态任务信息(如单张示例图像和对应的临床文本描述)即可直接合成任务特定分类器的完整参数集,生成轻量级高效分类器的权重,无需任何任务特定训练即可立即部署进行推理。
Result: 在基于ISIC 2018皮肤病变数据集和自定义罕见疾病数据集的少样本分类基准测试中,SGPS建立了新的最先进水平,显著优于先进的少样本和零样本学习方法,特别是在1-shot和5-shot等极低数据场景下表现尤为突出。
Conclusion: 这项工作为快速开发和部署AI驱动的诊断工具铺平了道路,特别是对于数据严重受限的罕见疾病长尾分布场景,开创了无需训练直接合成模型参数的新范式。
📄 Abstract
Deep learning models have achieved remarkable success in medical image analysis but are fundamentally constrained by the requirement for large-scale, meticulously annotated datasets. This dependency on "big data" is a critical bottleneck in the medical domain, where patient data is inherently difficult to acquire and expert annotation is expensive, particularly for rare diseases where samples are scarce by definition. To overcome this fundamental challenge, we propose a novel paradigm: Zero-Training Task-Specific Model Synthesis (ZS-TMS). Instead of adapting a pre-existing model or training a new one, our approach leverages a large-scale, pre-trained generative engine to directly synthesize the entire set of parameters for a task-specific classifier. Our framework, the Semantic-Guided Parameter Synthesizer (SGPS), takes as input minimal, multi-modal task information as little as a single example image (1-shot) and a corresponding clinical text description to directly synthesize the entire set of parameters for a task-specific classifier. The generative engine interprets these inputs to generate the weights for a lightweight, efficient classifier (e.g., an EfficientNet-V2), which can be deployed for inference immediately without any task-specific training or fine-tuning. We conduct extensive evaluations on challenging few-shot classification benchmarks derived from the ISIC 2018 skin lesion dataset and a custom rare disease dataset. Our results demonstrate that SGPS establishes a new state-of-the-art, significantly outperforming advanced few-shot and zero-shot learning methods, especially in the ultra-low data regimes of 1-shot and 5-shot classification. This work paves the way for the rapid development and deployment of AI-powered diagnostic tools, particularly for the long tail of rare diseases where data is critically limited.
[11] Weakly Supervised Ephemeral Gully Detection In Remote Sensing Images Using Vision Language Models
Seyed Mohamad Ali Tousi, John A. Lory, G. N. DeSouza
🧩 TL;DR
该研究提出了首个用于瞬态冲沟检测的弱监督管道,利用视觉语言模型减少人工标注负担,并发布了首个用于半监督检测的遥感图像数据集。该方法通过教师-学生模型和噪声感知损失函数实现了优于VLM和标签模型的检测性能。
📘 Detailed Summary
Motivation: 瞬态冲沟作为农业领域最令人担忧的土壤侵蚀现象之一,其短暂的时间周期增加了使用传统计算机视觉方法和遥感技术进行自动检测的难度。由于准确标注数据的稀缺性和制作困难,基于机器学习的自动检测方法仅限于难以实现的零样本方法。
Method: 该方法依赖于遥感技术,利用视觉语言模型大幅减少人工标注任务。具体包括:利用VLM预训练中嵌入的知识;采用教师-学生模型,其中教师从VLM产生的噪声标签中学习,学生通过弱监督使用教师生成的标签和噪声感知损失函数进行学习。
Result: 实验结果表明该方法在弱监督训练学生模型时表现出优于VLM和标签模型本身的性能。研究还发布了首个用于半监督检测的瞬态冲沟数据集,包含超过18,000张高分辨率遥感图像,覆盖13年的时间跨度。
Conclusion: 该研究证明了弱监督方法在瞬态冲沟检测中的有效性,为类似稀缺标注数据的遥感应用提供了可行解决方案。通过结合VLM知识和噪声感知训练策略,成功克服了传统方法在数据标注和检测精度方面的限制。
📄 Abstract
Among soil erosion problems, Ephemeral Gullies are one of the most concerning phenomena occurring in agricultural fields. Their short temporal cycles increase the difficulty in automatically detecting them using classical computer vision approaches and remote sensing. Also, due to scarcity of and the difficulty in producing accurate labeled data, automatic detection of ephemeral gullies using Machine Learning is limited to zero-shot approaches which are hard to implement. To overcome these challenges, we present the first weakly supervised pipeline for detection of ephemeral gullies. Our method relies on remote sensing and uses Vision Language Models (VLMs) to drastically reduce the labor-intensive task of manual labeling. In order to achieve that, the method exploits: 1) the knowledge embedded in the VLM's pretraining; 2) a teacher-student model where the teacher learns from noisy labels coming from the VLMs, and the student learns by weak supervision using teacher-generate labels and a noise-aware loss function. We also make available the first-of-its-kind dataset for semi-supervised detection of ephemeral gully from remote-sensed images. The dataset consists of a number of locations labeled by a group of soil and plant scientists, as well as a large number of unlabeled locations. The dataset represent more than 18,000 high-resolution remote-sensing images obtained over the course of 13 years. Our experimental results demonstrate the validity of our approach by showing superior performances compared to VLMs and the label model itself when using weak supervision to train an student model. The code and dataset for this work are made publicly available.
[12] FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration
Jingren Liu, Shuning Xu, Qirui Yang, Yun Wang, Xiangyu Chen, Zhong Ji
🧩 TL;DR
本文提出FAPE-IR框架,一种用于全合一图像恢复的频率感知规划与执行方法,通过冻结多模态大语言模型生成频率感知恢复计划,并利用LoRA-MoE模块在扩散模型中动态选择高频或低频专家,在七个恢复任务上实现了最先进的性能。
📘 Detailed Summary
Motivation: 现有全合一图像恢复方法通常依赖任务特定设计或潜在路由策略,难以适应具有多种退化的真实场景,需要一种能够统一处理复杂退化条件的解决方案。
Method: FAPE-IR采用冻结多模态大语言模型作为规划器分析退化图像并生成简洁的频率感知恢复计划,这些计划指导基于扩散的执行器中的LoRA-MoE模块动态选择高频或低频专家,并辅以对抗训练和频率正则化损失来提升恢复质量。
Result: 大量实验表明FAPE-IR在七个图像恢复任务上实现了最先进的性能,并在混合退化条件下展现出强大的零样本泛化能力。
Conclusion: 通过将语义规划与基于频率的恢复相结合,FAPE-IR为全合一图像恢复提供了统一且可解释的解决方案,展示了频率感知方法在复杂退化场景中的有效性。
📄 Abstract
All-in-One Image Restoration (AIO-IR) aims to develop a unified model that can handle multiple degradations under complex conditions. However, existing methods often rely on task-specific designs or latent routing strategies, making it hard to adapt to real-world scenarios with various degradations. We propose FAPE-IR, a Frequency-Aware Planning and Execution framework for image restoration. It uses a frozen Multimodal Large Language Model (MLLM) as a planner to analyze degraded images and generate concise, frequency-aware restoration plans. These plans guide a LoRA-based Mixture-of-Experts (LoRA-MoE) module within a diffusion-based executor, which dynamically selects high- or low-frequency experts, complemented by frequency features of the input image. To further improve restoration quality and reduce artifacts, we introduce adversarial training and a frequency regularization loss. By coupling semantic planning with frequency-based restoration, FAPE-IR offers a unified and interpretable solution for all-in-one image restoration. Extensive experiments show that FAPE-IR achieves state-of-the-art performance across seven restoration tasks and exhibits strong zero-shot generalization under mixed degradations.
[13] Mind the Gap: Evaluating LLM Understanding of Human-Taught Road Safety Principles
Chalamalasetti Kranti
🧩 TL;DR
本研究评估了多模态大语言模型对道路安全概念的理解能力,发现这些模型在安全推理方面存在显著困难,揭示了人类学习与模型解释之间的差距。
📘 Detailed Summary
Motivation: 该研究旨在解决多模态大语言模型在理解道路安全规范方面的能力评估问题,特别是针对自动驾驶系统中AI系统必须遵守的道路安全标准,当前缺乏对这些模型在安全推理方面能力的系统性评估。
Method: 研究采用零样本评估方法,构建了一个从学校教科书中收集的交通标志和道路安全规范图像数据集,通过图示和示意图表示来测试多模态大语言模型对道路安全概念的理解能力。
Result: 初步结果显示多模态大语言模型在安全推理方面表现不佳,模型在理解道路安全概念时存在显著困难,研究进一步分析了这些性能差距的具体表现和原因。
Conclusion: 该研究揭示了多模态大语言模型在道路安全理解方面的局限性,强调了人类学习与AI模型解释之间的重要差异,为未来改进模型的安全推理能力提供了分析基础和研究方向。
📄 Abstract
Following road safety norms is non-negotiable not only for humans but also for the AI systems that govern autonomous vehicles. In this work, we evaluate how well multi-modal large language models (LLMs) understand road safety concepts, specifically through schematic and illustrative representations. We curate a pilot dataset of images depicting traffic signs and road-safety norms sourced from school text books and use it to evaluate models capabilities in a zero-shot setting. Our preliminary results show that these models struggle with safety reasoning and reveal gaps between human learning and model interpretation. We further provide an analysis of these performance gaps for future research.
[14] Start Small, Think Big: Curriculum-based Relative Policy Optimization for Visual Grounding
Qingyang Yan, Guangyao Chen, Yixiong Zou
🧩 TL;DR
本文提出基于课程的相对策略优化(CuRPO),一种利用思维链长度和广义交并比奖励作为复杂度指标的训练策略,通过从简单到复杂的渐进式数据训练,有效解决了强化学习微调思维链推理在视觉定位任务中的性能退化问题。
📘 Detailed Summary
Motivation: 研究发现强化学习微调的思维链推理在视觉定位任务中会随着思维链输出变长或复杂而出现性能退化,同时增加数据集规模并不总能提升性能,因为数据复杂度存在差异,这促使需要一种能够自适应处理不同复杂度数据的训练策略。
Method: 提出基于课程的相对策略优化(CuRPO),利用思维链长度和广义交并比奖励作为复杂度指标,构建从简单到复杂的渐进式训练数据排序,通过相对策略优化方法在课程学习框架下逐步提升模型处理复杂视觉定位任务的能力。
Result: 在RefCOCO、RefCOCO+、RefCOCOg和LISA数据集上的广泛实验表明,CuRPO持续优于现有方法,包括Visual-RFT,在RefCOCO上实现了高达+12.52 mAP的显著改进,同时在少样本学习场景下表现出优异的效率和鲁棒性。
Conclusion: CuRPO证明了基于课程学习的训练策略在视觉定位任务中的有效性,特别是对于具有模糊和复杂文本描述的任务,该方法通过渐进式复杂度排序实现了更稳定和高效的模型训练,为复杂推理任务的强化学习优化提供了新思路。
📄 Abstract
Chain-of-Thought (CoT) prompting has recently shown significant promise across various NLP and computer vision tasks by explicitly generating intermediate reasoning steps. However, we find that reinforcement learning (RL)-based fine-tuned CoT reasoning can paradoxically degrade performance in Visual Grounding tasks, particularly as CoT outputs become lengthy or complex. Additionally, our analysis reveals that increased dataset size does not always enhance performance due to varying data complexities. Motivated by these findings, we propose Curriculum-based Relative Policy Optimization (CuRPO), a novel training strategy that leverages CoT length and generalized Intersection over Union (gIoU) rewards as complexity indicators to progressively structure training data from simpler to more challenging examples. Extensive experiments on RefCOCO, RefCOCO+, RefCOCOg, and LISA datasets demonstrate the effectiveness of our approach. CuRPO consistently outperforms existing methods, including Visual-RFT, with notable improvements of up to +12.52 mAP on RefCOCO. Moreover, CuRPO exhibits exceptional efficiency and robustness, delivering strong localization performance even in few-shot learning scenarios, particularly benefiting tasks characterized by ambiguous and intricate textual descriptions.The code is released on https://github.com/qyoung-yan/CuRPO.
[15] Multi-view Phase-aware Pedestrian-Vehicle Incident Reasoning Framework with Vision-Language Models
Hao Zhen, Yunxiang Yang, Jidong J. Yang
🧩 TL;DR
本文提出了MP-PVIR框架,这是一个统一的多视角行人-车辆事故推理系统,通过将多视角视频流处理为结构化诊断报告,系统地将事故分解为认知阶段并生成因果链分析,从而提升AI驱动的交通安全分析能力。
📘 Detailed Summary
Motivation: 现有基于视频的系统虽然能够检测事故何时发生,但无法深入理解事故在行人行为不同认知阶段如何演变,同时当前视觉语言模型通常孤立处理视频,缺乏显式的时间结构化和多视角集成能力,这限制了事故分析的深度和实用性。
Method: MP-PVIR框架包含四个核心阶段:事件触发的多视角视频采集、行人行为阶段分割、阶段特定的多视角推理以及分层合成与诊断推理,该框架通过两个专用视觉语言模型支撑:TG-VLM用于行为阶段分割,PhaVR-VLM用于阶段感知的多视角分析,最后使用大型语言模型生成综合报告。
Result: 在Woven Traffic Safety数据集上的评估显示,TG-VLM在行为阶段分割上达到mIoU 0.4881,PhaVR-VLM在字幕生成任务中获得33.063分,在问答任务中准确率最高达64.70%,证明框架能够有效将多视角视频数据转化为可操作的见解。
Conclusion: MP-PVIR框架通过将行为理论操作化,自动将事故分割为认知阶段并在每个阶段内执行同步多视角分析,最终合成具有针对性预防策略的因果链,显著推进了车路协同系统中AI驱动的交通安全分析能力,为城市安全挑战提供了系统化解决方案。
📄 Abstract
Pedestrian-vehicle incidents remain a critical urban safety challenge, with pedestrians accounting for over 20% of global traffic fatalities. Although existing video-based systems can detect when incidents occur, they provide little insight into how these events unfold across the distinct cognitive phases of pedestrian behavior. Recent vision-language models (VLMs) have shown strong potential for video understanding, but they remain limited in that they typically process videos in isolation, without explicit temporal structuring or multi-view integration. This paper introduces Multi-view Phase-aware Pedestrian-Vehicle Incident Reasoning (MP-PVIR), a unified framework that systematically processes multi-view video streams into structured diagnostic reports through four stages: (1) event-triggered multi-view video acquisition, (2) pedestrian behavior phase segmentation, (3) phase-specific multi-view reasoning, and (4) hierarchical synthesis and diagnostic reasoning. The framework operationalizes behavioral theory by automatically segmenting incidents into cognitive phases, performing synchronized multi-view analysis within each phase, and synthesizing results into causal chains with targeted prevention strategies. Particularly, two specialized VLMs underpin the MP-PVIR pipeline: TG-VLM for behavioral phase segmentation (mIoU = 0.4881) and PhaVR-VLM for phase-aware multi-view analysis, achieving a captioning score of 33.063 and up to 64.70% accuracy on question answering. Finally, a designated large language model is used to generate comprehensive reports detailing scene understanding, behavior interpretation, causal reasoning, and prevention recommendations. Evaluation on the Woven Traffic Safety dataset shows that MP-PVIR effectively translates multi-view video data into actionable insights, advancing AI-driven traffic safety analytics for vehicle-infrastructure cooperative systems.
[16] Learning Skill-Attributes for Transferable Assessment in Video
Kumar Ashutosh, Kristen Grauman
🧩 TL;DR
本文提出了CrossTrainer方法,通过发现跨运动技能属性并训练多模态语言模型,实现了从视频中进行可迁移的技能评估,在跨运动和领域内设置中相比现有技术实现了高达60%的相对性能提升。
📘 Detailed Summary
Motivation: 当前基于视频的技能评估模型通常针对单一运动专门化,且面临专家级监督数据成本高、稀缺的问题,特别是在长尾运动领域,这限制了模型的泛化能力和实际应用范围。
Method: CrossTrainer方法首先发现跨运动边界的技能属性(如平衡性、控制力和手部位置),然后训练多模态语言模型为新的视频生成可操作的反馈(如“抬高双手以产生更多力量”)及其熟练度等级(如早期专家)。
Result: 在跨运动和领域内设置下的多个数据集验证中,该方法相比现有技术实现了高达60%的相对性能提升,通过抽象出指示人类技能的共享行为,视频表示比现有技术具有更好的泛化能力。
Conclusion: 通过抽象出跨运动共享的人类技能行为特征,所提出的视频表示方法显著提升了多模态大语言模型的泛化能力,为长尾运动领域的技能评估提供了有效的解决方案,并丰富了当前多模态大语言模型的应用范围。
📄 Abstract
Skill assessment from video entails rating the quality of a person's physical performance and explaining what could be done better. Today's models specialize for an individual sport, and suffer from the high cost and scarcity of expert-level supervision across the long tail of sports. Towards closing that gap, we explore transferable video representations for skill assessment. Our CrossTrainer approach discovers skill-attributes, such as balance, control, and hand positioning -- whose meaning transcends the boundaries of any given sport, then trains a multimodal language model to generate actionable feedback for a novel video, e.g., "lift hands more to generate more power" as well as its proficiency level, e.g., early expert. We validate the new model on multiple datasets for both cross-sport (transfer) and intra-sport (in-domain) settings, where it achieves gains up to 60% relative to the state of the art. By abstracting out the shared behaviors indicative of human skill, the proposed video representation generalizes substantially better than an array of existing techniques, enriching today's multimodal large language models.
[17] SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM
An Yu, Weiheng Lu, Jian Li, Zhenfei Zhang, Yunhang Shen, Felix X. -F. Ye, Ming-Ching Chang
🧩 TL;DR
本文提出SMART框架,一种基于多模态大语言模型的视频时刻检索方法,通过集成音频线索和利用镜头级时间结构,在Charades-STA和QVHighlights基准上显著优于现有最先进方法。
📘 Detailed Summary
Motivation: 当前视频时刻检索方法主要依赖粗粒度时间理解和单一视觉模态,在复杂视频场景下性能受限,需要更精细的多模态表示和时间结构建模。
Method: SMART框架通过整合音频和视觉特征丰富多模态表示,并采用镜头感知令牌压缩技术选择性地保留每个镜头内的高信息量令牌以减少冗余,同时优化提示设计以更好地利用音频-视觉线索。
Result: 在Charades-STA数据集上,SMART实现了显著性能提升,R1@0.5指标提高1.61%,R1@0.7指标提升2.59%,在QVHighlights基准上也表现出优越性能。
Conclusion: 研究表明结合音频线索和镜头级时间结构能有效提升视频时刻检索性能,为多模态视频理解提供了新的技术路径,强调了细粒度时间建模的重要性。
📄 Abstract
Video Moment Retrieval is a task in video understanding that aims to localize a specific temporal segment in an untrimmed video based on a natural language query. Despite recent progress in moment retrieval from videos using both traditional techniques and Multimodal Large Language Models (MLLM), most existing methods still rely on coarse temporal understanding and a single visual modality, limiting performance on complex videos. To address this, we introduce \textit{S}hot-aware \textit{M}ultimodal \textit{A}udio-enhanced \textit{R}etrieval of \textit{T}emporal \textit{S}egments (SMART), an MLLM-based framework that integrates audio cues and leverages shot-level temporal structure. SMART enriches multimodal representations by combining audio and visual features while applying \textbf{Shot-aware Token Compression}, which selectively retains high-information tokens within each shot to reduce redundancy and preserve fine-grained temporal details. We also refine prompt design to better utilize audio-visual cues. Evaluations on Charades-STA and QVHighlights show that SMART achieves significant improvements over state-of-the-art methods, including a 1.61\% increase in R1@0.5 and 2.59\% gain in R1@0.7 on Charades-STA.
[18] Flood-LDM: Generalizable Latent Diffusion Models for rapid and accurate zero-shot High-Resolution Flood Mapping
Sun Han Neo, Sachith Seneviratne, Herath Mudiyanselage Viraj Vidura Herath, Abhishek Saha, Sanka Rasnayaka, Lucy Amanda Marshall
🧩 TL;DR
本文提出了一种基于潜在扩散模型的洪水地图超分辨率方法,能够在保持高精度洪水地图准确性的同时显著减少计算时间,并展现出优于传统方法的跨区域泛化能力。该方法通过结合物理信息输入,解决了机器学习黑盒行为的常见限制,增强了模型的可解释性。
📘 Detailed Summary
Motivation: 传统基于物理的水动力模型虽然能生成高分辨率洪水地图,但需要精细网格离散化,计算量大且不适用于实时大规模应用。现有的卷积神经网络洪水地图超分辨率方法虽然具有较好的准确性和速度,但在未见区域的泛化能力有限,无法满足实际部署需求。
Method: 本文提出了一种新颖的潜在扩散模型方法,用于对粗网格洪水地图进行超分辨率处理。该方法结合物理信息输入,通过扩散过程学习从低分辨率到高分辨率洪水地图的映射关系,同时利用迁移学习加速模型在新地理区域的适应过程。
Result: 实验结果表明,潜在扩散模型在保持洪水地图精度的同时,显著减少了生成高保真洪水地图所需的计算时间。模型展现出优越的跨物理位置泛化能力,迁移学习进一步加快了在新地理区域的适应速度,为实时洪水风险管理提供了可行解决方案。
Conclusion: 潜在扩散模型为洪水预测提供了一种高效准确的解决方案,突破了传统方法在计算效率和泛化能力方面的限制。结合物理信息输入的方法增强了模型的可解释性,为机器学习在环境科学领域的应用提供了新的思路,具有重要的实际应用价值。
📄 Abstract
Flood prediction is critical for emergency planning and response to mitigate human and economic losses. Traditional physics-based hydrodynamic models generate high-resolution flood maps using numerical methods requiring fine-grid discretization; which are computationally intensive and impractical for real-time large-scale applications. While recent studies have applied convolutional neural networks for flood map super-resolution with good accuracy and speed, they suffer from limited generalizability to unseen areas. In this paper, we propose a novel approach that leverages latent diffusion models to perform super-resolution on coarse-grid flood maps, with the objective of achieving the accuracy of fine-grid flood maps while significantly reducing inference time. Experimental results demonstrate that latent diffusion models substantially decrease the computational time required to produce high-fidelity flood maps without compromising on accuracy, enabling their use in real-time flood risk management. Moreover, diffusion models exhibit superior generalizability across different physical locations, with transfer learning further accelerating adaptation to new geographic regions. Our approach also incorporates physics-informed inputs, addressing the common limitation of black-box behavior in machine learning, thereby enhancing interpretability. Code is available at https://github.com/neosunhan/flood-diff.
[19] AdaTok: Adaptive Token Compression with Object-Aware Representations for Efficient Multimodal LLMs
Xinliang Zhang, Lei Zhu, Hangzhou He, Shuang Zeng, Ourui Fu, Jiakui Hu, Zhengjian Yao, Yanye Lu
🧩 TL;DR
本文提出了一种基于对象级令牌合并的自适应令牌压缩策略,用于解决多模态大语言模型中图像令牌数量二次增长导致的计算和内存负担问题,该方法仅使用10%的令牌即可达到原始模型约96%的性能。
📘 Detailed Summary
Motivation: 多模态大语言模型通过将图像转换为补丁级令牌序列来实现文本-图像理解,但补丁级令牌化导致图像令牌数量呈二次增长,给模型的理解和推理带来巨大的计算和内存负担,同时传统的补丁级扫描令牌化流程与人类视觉认知系统不匹配,进一步导致幻觉问题和计算冗余。
Method: 提出了一种对象级令牌合并策略用于自适应令牌压缩,该方法揭示了与人类视觉系统的一致性,通过识别和合并图像中的语义对象来减少令牌数量,而非传统的补丁级扫描方式。
Result: 在多个综合基准测试上的实验表明,该方法平均仅使用10%的令牌即可达到原始模型约96%的性能,与相关工作的广泛比较结果证明了该方法在平衡压缩比和性能方面的优越性。
Conclusion: 该研究证明了对象级令牌压缩策略在保持多模态大语言模型性能的同时显著减少计算需求的有效性,为开发更高效的多模态模型提供了新方向,并强调了与人类视觉认知系统一致性的重要性。
📄 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated substantial value in unified text-image understanding and reasoning, primarily by converting images into sequences of patch-level tokens that align with their architectural paradigm. However, patch-level tokenization leads to a quadratic growth in image tokens, burdening MLLMs' understanding and reasoning with enormous computation and memory. Additionally, the traditional patch-wise scanning tokenization workflow misaligns with the human vision cognition system, further leading to hallucination and computational redundancy. To address this issue, we propose an object-level token merging strategy for Adaptive Token compression, revealing the consistency with human vision system. The experiments are conducted on multiple comprehensive benchmarks, which show that our approach averagely, utilizes only 10% tokens while achieving almost 96% of the vanilla model's performance. More extensive experimental results in comparison with relevant works demonstrate the superiority of our method in balancing compression ratio and performance. Our code will be available.
[20] Semantic Context Matters: Improving Conditioning for Autoregressive Models
Dongyang Jin, Ryan Xu, Jianhao Zeng, Rui Lan, Yancheng Bai, Lei Sun, Xiangxiang Chu
🧩 TL;DR
本文提出SCAR方法,一种面向自回归模型的语义上下文驱动图像编辑技术,通过压缩语义预填充和语义对齐引导解决现有方法在条件控制方面的局限性,在保持图像生成质量的同时显著提升指令遵循能力。
📘 Detailed Summary
Motivation: 自回归模型在图像生成中展现出强大潜力,但在扩展到通用图像编辑任务时面临挑战,主要问题在于条件控制机制薄弱且效率低下,导致指令遵循能力不足和视觉伪影问题,需要一种更有效的语义条件注入方法。
Method: SCAR方法引入两个核心组件:压缩语义预填充将高层语义编码为紧凑高效的前缀表示,语义对齐引导在自回归解码过程中将最后视觉隐藏状态与目标语义对齐以增强指令保真度,该方法基于向量量化预填充的灵活性同时克服其语义局限性和高成本问题。
Result: SCAR在指令编辑和可控生成基准测试中均取得优异表现,展现出卓越的视觉保真度和语义对齐能力,超越了先前的基于自回归的方法,同时保持了对生成过程的良好控制性,且能泛化到next-token和next-set两种自回归范式。
Conclusion: SCAR证明了语义上下文驱动方法在自回归图像编辑中的有效性,为自回归模型在复杂编辑任务中的应用开辟了新途径,其通用性设计使其能够适应不同的自回归范式,为多模态统一系统的构建提供了重要技术支撑。
📄 Abstract
Recently, autoregressive (AR) models have shown strong potential in image generation, offering better scalability and easier integration with unified multi-modal systems compared to diffusion-based methods. However, extending AR models to general image editing remains challenging due to weak and inefficient conditioning, often leading to poor instruction adherence and visual artifacts. To address this, we propose SCAR, a Semantic-Context-driven method for Autoregressive models. SCAR introduces two key components: Compressed Semantic Prefilling, which encodes high-level semantics into a compact and efficient prefix, and Semantic Alignment Guidance, which aligns the last visual hidden states with target semantics during autoregressive decoding to enhance instruction fidelity. Unlike decoding-stage injection methods, SCAR builds upon the flexibility and generality of vector-quantized-based prefilling while overcoming its semantic limitations and high cost. It generalizes across both next-token and next-set AR paradigms with minimal architectural changes. SCAR achieves superior visual fidelity and semantic alignment on both instruction editing and controllable generation benchmarks, outperforming prior AR-based methods while maintaining controllability. All code will be released.
[21] Few-Shot Precise Event Spotting via Unified Multi-Entity Graph and Distillation
Zhaoyu Liu, Kan Jiang, Murong Ma, Zhe Hou, Yun Lin, Jin Song Dong
🧩 TL;DR
本文提出了一种统一多实体图网络(UMEG-Net)用于少样本精确事件检测,通过整合人体骨架和运动特定物体关键点构建统一图结构,并采用多模态知识蒸馏提升性能,在有限标注数据下实现了鲁棒的事件检测效果。
📘 Detailed Summary
Motivation: 精确事件检测任务面临快速连续事件、运动模糊和细微视觉差异等挑战,现有方法依赖大规模标注数据集和端到端训练,在少样本条件下表现不佳,而获取大规模标注数据在实际应用中十分困难。
Method: UMEG-Net将人体骨架和运动特定物体关键点整合到统一图结构中,采用基于先进图卷积网络和多尺度时间位移的高效时空提取模块,并通过多模态知识蒸馏将关键点图的知识迁移到视觉表示中。
Result: 该方法在少样本设置下显著优于基线模型,仅需有限标注数据即可实现鲁棒性能,为少样本精确事件检测提供了可扩展且有效的解决方案。
Conclusion: UMEG-Net通过统一图表示和多模态知识蒸馏有效解决了少样本精确事件检测的挑战,为运动分析领域提供了数据高效的解决方案,具有重要的实际应用价值。
📄 Abstract
Precise event spotting (PES) aims to recognize fine-grained events at exact moments and has become a key component of sports analytics. This task is particularly challenging due to rapid succession, motion blur, and subtle visual differences. Consequently, most existing methods rely on domain-specific, end-to-end training with large labeled datasets and often struggle in few-shot conditions due to their dependence on pixel- or pose-based inputs alone. However, obtaining large labeled datasets is practically hard. We propose a Unified Multi-Entity Graph Network (UMEG-Net) for few-shot PES. UMEG-Net integrates human skeletons and sport-specific object keypoints into a unified graph and features an efficient spatio-temporal extraction module based on advanced GCN and multi-scale temporal shift. To further enhance performance, we employ multimodal distillation to transfer knowledge from keypoint-based graphs to visual representations. Our approach achieves robust performance with limited labeled data and significantly outperforms baseline models in few-shot settings, providing a scalable and effective solution for few-shot PES. Code is publicly available at https://github.com/LZYAndy/UMEG-Net.
[22] CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs
Jingyu Lei, Gaoang Wang, Der-Horng Lee
🧩 TL;DR
CORE提出了一种基于对象中心表示的视觉令牌压缩新范式,通过高效分割解码器生成对象掩码作为语义先验指导令牌合并,并采用质心引导排序机制恢复空间顺序,在保持高性能的同时显著提升大型视觉语言模型的效率。
📘 Detailed Summary
Motivation: 现有视觉令牌压缩方法缺乏高层次语义理解,导致次优合并、信息冗余或上下文丢失,而大型视觉语言模型因图像分辨率增加导致的视觉令牌二次增长面临计算和内存成本过高的问题。
Method: CORE利用高效分割解码器生成对象掩码作为高层次语义先验,指导视觉令牌合并为紧凑的对象中心表示,并引入新颖的质心引导排序机制恢复合并令牌的连贯空间顺序以保留关键位置信息。
Result: 在六个权威基准测试中,CORE在固定速率压缩方面建立了新的最先进水平,在自适应速率设置下实现了显著的效率提升,即使在仅保留2.2%视觉令牌的极端压缩下仍能维持97.4%的基线性能。
Conclusion: 该研究证明了对象中心表示在高效且有效的大型视觉语言模型处理中的优越性,为视觉令牌压缩提供了基于语义理解的新方向,并展示了在极端压缩条件下保持高性能的潜力。
📄 Abstract
Large Vision-Language Models (LVLMs) usually suffer from prohibitive computational and memory costs due to the quadratic growth of visual tokens with image resolution. Existing token compression methods, while varied, often lack a high-level semantic understanding, leading to suboptimal merges, information redundancy, or context loss. To address these limitations, we introduce CORE (Compact Object-centric REpresentations), a new paradigm for visual token compression. CORE leverages an efficient segmentation decoder to generate object masks, which serve as a high-level semantic prior to guide the merging of visual tokens into a compact set of object-centric representations. Furthermore, a novel centroid-guided sorting mechanism restores a coherent spatial order to the merged tokens, preserving vital positional information. Extensive experiments show that CORE not only establishes a new state-of-the-art on six authoritative benchmarks for fixed-rate compression, but also achieves dramatic efficiency gains in adaptive-rate settings. Even under extreme compression, after aggressively retaining with only 2.2% of all visual tokens, CORE still maintains 97.4% of baseline performance. Our work demonstrates the superiority of object-centric representations for efficient and effective LVLM processing.
[23] Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution
N Dinesh Reddy, Sudeep Pillai
🧩 TL;DR
Orion是一个多模态视觉智能框架,通过代理架构协调多种计算机视觉工具执行复杂视觉工作流,实现了从被动视觉理解到主动工具驱动智能的转变,并在多个基准测试中达到最先进性能。
📘 Detailed Summary
Motivation: 传统视觉语言模型主要生成描述性输出,无法执行复杂的多步骤视觉工作流,Orion旨在解决这一局限性,将单体视觉语言模型扩展到生产级视觉智能系统。
Method: Orion采用代理框架设计,具备多工具调用能力,协调对象检测、关键点定位、全景分割、光学字符识别和几何分析等专门化计算机视觉工具,实现神经感知与符号执行的结合。
Result: 在MMMU、MMBench、DocVQA和MMLongBench等多个基准测试中取得竞争性性能,展示了系统在生产级视觉智能任务中的有效性。
Conclusion: 通过结合神经感知与符号执行,Orion实现了自主视觉推理,标志着从被动视觉理解到主动工具驱动视觉智能的重要转变,为构建更强大的视觉AI系统提供了新范式。
📄 Abstract
We introduce Orion, a visual agent framework that can take in any modality and generate any modality. Using an agentic framework with multiple tool-calling capabilities, Orion is designed for visual AI tasks and achieves state-of-the-art results. Unlike traditional vision-language models that produce descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition, and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance on MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic vision-language models to production-grade visual intelligence. By combining neural perception with symbolic execution, Orion enables autonomous visual reasoning, marking a transition from passive visual understanding to active, tool-driven visual intelligence.
[24] Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding
Hong Gao, Yiming Bao, Xuezhen Tu, Yutong Xu, Yue Jin, Yiyang Mu, Bin Zhong, Linan Yue, Min-Ling Zhang
🧩 TL;DR
本文提出了Agentic Video Intelligence (AVI)框架,这是一个无需训练的系统级设计,通过模仿人类视频理解的三阶段推理过程,结合结构化视频知识库和开源模型集成,实现了竞争性性能并显著提升了可解释性。
📘 Detailed Summary
Motivation: 当前视频理解面临两个主要限制:传统视觉语言模型通常以单次处理方式运行,缺乏证据重访和迭代优化的能力;而新兴的基于代理的方法要么严重依赖昂贵的专有模型,要么需要大量的代理强化学习训练,这限制了其实际应用和可扩展性。
Method: AVI框架引入了三个关键创新:受人类启发的三阶段推理过程(检索-感知-审查),确保充分的全局探索和聚焦的局部分析;通过实体图组织的结构化视频知识库,结合多粒度集成工具构成代理交互环境;以及结合推理LLM与轻量级基础CV模型和VLM的开源模型集成,消除了对专有API或RL训练的依赖。
Result: 在LVBench、VideoMME-Long、LongVideoBench和Charades-STA等基准测试上的实验表明,AVI实现了竞争性的性能表现,同时在可解释性方面具有显著优势,验证了该框架在复杂视频理解任务中的有效性。
Conclusion: 该研究证明了通过系统级设计和优化,无需依赖昂贵专有模型或复杂训练过程,即可实现高性能的视频理解代理系统,为构建更高效、可解释且可扩展的视频智能系统提供了新的技术路径和设计范式。
📄 Abstract
Video understanding requires not only visual recognition but also complex reasoning. While Vision-Language Models (VLMs) demonstrate impressive capabilities, they typically process videos largely in a single-pass manner with limited support for evidence revisit and iterative refinement. While recently emerging agent-based methods enable long-horizon reasoning, they either depend heavily on expensive proprietary models or require extensive agentic RL training. To overcome these limitations, we propose Agentic Video Intelligence (AVI), a flexible and training-free framework that can mirror human video comprehension through system-level design and optimization. AVI introduces three key innovations: (1) a human-inspired three-phase reasoning process (Retrieve-Perceive-Review) that ensures both sufficient global exploration and focused local analysis, (2) a structured video knowledge base organized through entity graphs, along with multi-granularity integrated tools, constituting the agent's interaction environment, and (3) an open-source model ensemble combining reasoning LLMs with lightweight base CV models and VLM, eliminating dependence on proprietary APIs or RL training. Experiments on LVBench, VideoMME-Long, LongVideoBench, and Charades-STA demonstrate that AVI achieves competitive performance while offering superior interpretability.
[25] BCE3S: Binary Cross-Entropy Based Tripartite Synergistic Learning for Long-tailed Recognition
Weijia Fan, Qiufu Li, Jiajun Wen, Xiaoyang Peng
🧩 TL;DR
本文提出了一种基于二元交叉熵的三方协同学习框架BCE3S,用于解决长尾识别任务中特征紧凑性和分类器平衡性问题。该方法通过解耦特征度量和分类器向量,在多个长尾数据集上实现了最先进的性能。
📘 Detailed Summary
Motivation: 现有基于交叉熵损失的长尾识别方法难以学习具有理想特性的特征,同时在Softmax分母中耦合了不平衡的分类器向量,放大了长尾识别中的不平衡效应。这导致头尾类别的类内紧凑性和类间可分性不足,且分类器向量间的可分性不平衡。
Method: BCE3S框架包含三个核心组件:基于二元交叉熵的联合学习通过多个Sigmoid解耦特征度量和分类器向量,优化分类器和样本特征;基于二元交叉熵的对比学习进一步提升特征的类内紧凑性;基于二元交叉熵的均匀学习平衡分类器向量间的可分性,并与联合学习交互增强特征特性。
Result: 在CIFAR10-LT、CIFAR100-LT、ImageNet-LT和iNaturalist2018等多个长尾数据集上的广泛实验表明,BCE3S训练的长尾识别模型不仅实现了更高的样本特征紧凑性和可分性,还平衡了分类器的可分性,达到了最先进的性能水平。
Conclusion: BCE3S通过解耦特征学习和分类器平衡,有效解决了长尾识别中的关键挑战,为处理类别不平衡问题提供了新的技术路径。该方法展示了二元交叉熵在长尾学习中的潜力,为未来不平衡学习研究提供了重要启示。
📄 Abstract
For long-tailed recognition (LTR) tasks, high intra-class compactness and inter-class separability in both head and tail classes, as well as balanced separability among all the classifier vectors, are preferred. The existing LTR methods based on cross-entropy (CE) loss not only struggle to learn features with desirable properties but also couple imbalanced classifier vectors in the denominator of its Softmax, amplifying the imbalance effects in LTR. In this paper, for the LTR, we propose a binary cross-entropy (BCE)-based tripartite synergistic learning, termed BCE3S, which consists of three components: (1) BCE-based joint learning optimizes both the classifier and sample features, which achieves better compactness and separability among features than the CE-based joint learning, by decoupling the metrics between feature and the imbalanced classifier vectors in multiple Sigmoid; (2) BCE-based contrastive learning further improves the intra-class compactness of features; (3) BCE-based uniform learning balances the separability among classifier vectors and interactively enhances the feature properties by combining with the joint learning. The extensive experiments show that the LTR model trained by BCE3S not only achieves higher compactness and separability among sample features, but also balances the classifier's separability, achieving SOTA performance on various long-tailed datasets such as CIFAR10-LT, CIFAR100-LT, ImageNet-LT, and iNaturalist2018.
[26] CCSD: Cross-Modal Compositional Self-Distillation for Robust Brain Tumor Segmentation with Missing Modalities
Dongqing Xie, Yonghuang Wu, Zisheng Ai, Jun Min, Zhencun Jiang, Shaojin Geng, Lei Wang
🧩 TL;DR
本文提出了一种新颖的跨模态组合自蒸馏框架CCSD,用于处理脑肿瘤分割中任意模态组合的缺失问题。该方法通过分层模态自蒸馏和渐进模态组合蒸馏策略,在多个公开基准测试中实现了最先进的性能。
📘 Detailed Summary
Motivation: 多模态MRI脑肿瘤分割在临床诊断和治疗规划中至关重要,但实际临床环境中经常出现一种或多种模态缺失的问题,这严重影响了基于深度学习的分割模型的性能和泛化能力。现有方法难以灵活处理任意模态组合的缺失情况,因此需要开发能够适应这种现实挑战的鲁棒分割框架。
Method: CCSD框架采用共享-特定编码器-解码器架构,并引入了两种自蒸馏策略:分层模态自蒸馏机制通过跨模态层次传递知识来减少语义差异;渐进模态组合蒸馏方法通过在训练过程中模拟逐渐的模态丢弃来增强对缺失模态的鲁棒性。该框架能够灵活处理任意输入模态组合。
Result: 在公开脑肿瘤分割基准上的广泛实验表明,CCSD在各种缺失模态场景下均实现了最先进的性能。该方法展现出强大的泛化能力和稳定性,在多个评估指标上显著优于现有方法,特别是在复杂模态缺失情况下表现突出。
Conclusion: CCSD框架为解决实际临床环境中模态缺失问题提供了有效的解决方案,其自蒸馏策略能够有效提升模型对任意模态组合的适应能力。这项研究为多模态医学图像分析中的鲁棒性分割开辟了新方向,具有重要的临床应用价值。
📄 Abstract
The accurate segmentation of brain tumors from multi-modal MRI is critical for clinical diagnosis and treatment planning. While integrating complementary information from various MRI sequences is a common practice, the frequent absence of one or more modalities in real-world clinical settings poses a significant challenge, severely compromising the performance and generalizability of deep learning-based segmentation models. To address this challenge, we propose a novel Cross-Modal Compositional Self-Distillation (CCSD) framework that can flexibly handle arbitrary combinations of input modalities. CCSD adopts a shared-specific encoder-decoder architecture and incorporates two self-distillation strategies: (i) a hierarchical modality self-distillation mechanism that transfers knowledge across modality hierarchies to reduce semantic discrepancies, and (ii) a progressive modality combination distillation approach that enhances robustness to missing modalities by simulating gradual modality dropout during training. Extensive experiments on public brain tumor segmentation benchmarks demonstrate that CCSD achieves state-of-the-art performance across various missing-modality scenarios, with strong generalization and stability.
[27] MRI Embeddings Complement Clinical Predictors for Cognitive Decline Modeling in Alzheimer's Disease Cohorts
Nathaniel Putera, Daniel Vilet Rodríguez, Noah Videcrantz, Julia Machnio, Mostafa Mehdipour Ghazi
🧩 TL;DR
本研究评估了表格数据和基于Transformer的MRI嵌入在阿尔茨海默病认知衰退预测中的互补作用,发现临床特征在识别高风险极端病例方面表现最佳,而Transformer MRI嵌入在区分认知稳定个体方面更有效。
📘 Detailed Summary
Motivation: 阿尔茨海默病认知衰退的准确建模对于早期分层和个性化管理至关重要,虽然表格预测因子提供了稳健的全局风险标记,但它们捕捉细微脑变化的能力仍然有限,因此需要评估不同表征的预测贡献。
Method: 引入了基于动态时间规整聚类的轨迹感知标记策略来捕捉认知变化的异质模式,并通过无监督重建在协调和增强的MRI数据上训练3D视觉Transformer,以获得无需进展标签的解剖保留嵌入,随后使用传统机器学习分类器和深度学习头评估预训练编码器嵌入。
Result: 临床和体积特征在预测轻度和重度进展方面达到约0.70的最高AUC,突显了其在捕捉全局衰退轨迹方面的效用;相比之下,来自ViT模型的MRI嵌入在区分认知稳定个体方面最有效,AUC为0.71;所有方法在异质中度组中都表现困难。
Conclusion: 临床特征在识别高风险极端病例方面表现出色,而基于Transformer的MRI嵌入对稳定性细微标记更敏感,这为AD进展建模的多模态融合策略提供了动机,表明不同模态在疾病进展预测中具有互补优势。
📄 Abstract
Accurate modeling of cognitive decline in Alzheimer's disease is essential for early stratification and personalized management. While tabular predictors provide robust markers of global risk, their ability to capture subtle brain changes remains limited. In this study, we evaluate the predictive contributions of tabular and imaging-based representations, with a focus on transformer-derived Magnetic Resonance Imaging (MRI) embeddings. We introduce a trajectory-aware labeling strategy based on Dynamic Time Warping clustering to capture heterogeneous patterns of cognitive change, and train a 3D Vision Transformer (ViT) via unsupervised reconstruction on harmonized and augmented MRI data to obtain anatomy-preserving embeddings without progression labels. The pretrained encoder embeddings are subsequently assessed using both traditional machine learning classifiers and deep learning heads, and compared against tabular representations and convolutional network baselines. Results highlight complementary strengths across modalities. Clinical and volumetric features achieved the highest AUCs of around 0.70 for predicting mild and severe progression, underscoring their utility in capturing global decline trajectories. In contrast, MRI embeddings from the ViT model were most effective in distinguishing cognitively stable individuals with an AUC of 0.71. However, all approaches struggled in the heterogeneous moderate group. These findings indicate that clinical features excel in identifying high-risk extremes, whereas transformer-based MRI embeddings are more sensitive to subtle markers of stability, motivating multimodal fusion strategies for AD progression modeling.
[28] Coffee: Controllable Diffusion Fine-tuning
Ziyao Zeng, Jingcheng Ni, Ruyi Liu, Alex Wong
🧩 TL;DR
本文提出Coffee方法,通过语言描述指定不需要的概念来正则化扩散模型的微调过程,防止模型学习用户数据中的不良概念并与用户提示产生纠缠。该方法无需额外训练,仅通过修改文本描述即可灵活调整不需要的概念。
📘 Detailed Summary
Motivation: 文本到图像扩散模型虽然能够通过少量用户数据进行微调实现定制化,但在可控微调方面仍面临挑战,即防止模型学习微调数据中存在的不良概念,并避免这些概念与用户提示产生纠缠。这对于偏见缓解、防止恶意适应、属性解耦和扩散策略的通用微调等下游任务至关重要。
Method: Coffee方法的核心在于通过语言描述指定不需要的概念来正则化适应过程,关键机制是防止用户提示的嵌入与不需要的概念对齐。该方法无需额外训练,仅通过修改文本描述即可灵活调整不需要的概念,实现了对扩散模型微调过程的精确控制。
Result: 实验结果表明,Coffee能够有效防止文本到图像模型在微调过程中学习指定的不需要概念,并且在性能上优于现有方法。该方法在用户提示与不需要概念配对的图像微调任务中表现出色,验证了其有效性和优越性。
Conclusion: Coffee方法为解决扩散模型可控微调问题提供了有效的解决方案,通过语言驱动的正则化机制实现了对不需要概念的精确控制。该方法具有无需额外训练、灵活可调的优势,为偏见缓解、安全适应等实际应用场景提供了可靠的技术支持,推动了扩散模型在负责任AI方向的发展。
📄 Abstract
Text-to-image diffusion models can generate diverse content with flexible prompts, which makes them well-suited for customization through fine-tuning with a small amount of user-provided data. However, controllable fine-tuning that prevents models from learning undesired concepts present in the fine-tuning data, and from entangling those concepts with user prompts, remains an open challenge. It is crucial for downstream tasks like bias mitigation, preventing malicious adaptation, attribute disentanglement, and generalizable fine-tuning of diffusion policy. We propose Coffee that allows using language to specify undesired concepts to regularize the adaptation process. The crux of our method lies in keeping the embeddings of the user prompt from aligning with undesired concepts. Crucially, Coffee requires no additional training and enables flexible modification of undesired concepts by modifying textual descriptions. We evaluate Coffee by fine-tuning on images associated with user prompts paired with undesired concepts. Experimental results demonstrate that Coffee can prevent text-to-image models from learning specified undesired concepts during fine-tuning and outperforms existing methods. Code will be released upon acceptance.
[29] Learning Representation and Synergy Invariances: A Povable Framework for Generalized Multimodal Face Anti-Spoofing
Xun Lin, Shuai Wang, Yi Yu, Zitong Yu, Jiale Zhou, Yizhong Liu, Xiaochun Cao, Alex Kot, Yefeng Zheng
🧩 TL;DR
本文提出RiSe框架,通过非对称不变风险最小化和多模态协同解缠,解决了跨域多模态人脸活体检测中的表示不变性和协同不变性风险,实现了最先进的跨域性能。
📘 Detailed Summary
Motivation: 多模态人脸活体检测方法在未见域中的性能下降比单模态方法更严重,主要由于两个被忽视的跨域泛化风险:模态表示不变风险(类不对称性放大泛化误差上界)和模态协同不变风险(模型过拟合到域特定的模态间相关性)。
Method: 提出RiSe框架,包含两个核心组件:针对表示风险的AsyIRM,在径向空间学习不变球形决策边界以适应不对称分布,同时在角度空间保留域线索;针对协同风险的MMSD,通过跨样本混合和解缠的自监督任务增强内在、可泛化的模态特征。
Result: 理论分析和实验验证表明,RiSe在跨域性能上达到最先进水平,显著提升了多模态人脸活体检测在未见域中的泛化能力。
Conclusion: 该研究揭示了多模态活体检测中类不对称性和模态协同风险的深层影响,提出的RiSe框架为跨域多模态学习提供了理论保证和实用解决方案,对未来多模态安全系统设计具有重要指导意义。
📄 Abstract
Multimodal Face Anti-Spoofing (FAS) methods, which integrate multiple visual modalities, often suffer even more severe performance degradation than unimodal FAS when deployed in unseen domains. This is mainly due to two overlooked risks that affect cross-domain multimodal generalization. The first is the modal representation invariant risk, i.e., whether representations remain generalizable under domain shift. We theoretically show that the inherent class asymmetry in FAS (diverse spoofs vs. compact reals) enlarges the upper bound of generalization error, and this effect is further amplified in multimodal settings. The second is the modal synergy invariant risk, where models overfit to domain-specific inter-modal correlations. Such spurious synergy cannot generalize to unseen attacks in target domains, leading to performance drops. To solve these issues, we propose a provable framework, namely Multimodal Representation and Synergy Invariance Learning (RiSe). For representation risk, RiSe introduces Asymmetric Invariant Risk Minimization (AsyIRM), which learns an invariant spherical decision boundary in radial space to fit asymmetric distributions, while preserving domain cues in angular space. For synergy risk, RiSe employs Multimodal Synergy Disentanglement (MMSD), a self-supervised task enhancing intrinsic, generalizable modal features via cross-sample mixing and disentanglement. Theoretical analysis and experiments verify RiSe, which achieves state-of-the-art cross-domain performance.
[30] MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
Huiyi Chen, Jiawei Peng, Dehai Min, Changchang Sun, Kaijie Chen, Yan Yan, Xu Yang, Lu Cheng
🧩 TL;DR
本文提出了MVI-Bench,这是首个专门评估误导性视觉输入如何削弱大型视觉语言模型鲁棒性的综合基准,通过分层分类和细粒度度量揭示了现有模型的显著脆弱性。
📘 Detailed Summary
Motivation: 现有鲁棒性基准主要关注幻觉或误导性文本输入,而忽视了误导性视觉输入在评估视觉理解能力中的同等重要性,这一重要研究空白限制了大型视觉语言模型在现实应用中的可靠部署。
Method: 基于基础视觉基元,MVI-Bench围绕误导性视觉输入的三个层次进行设计:视觉概念、视觉属性和视觉关系,构建了六个代表性类别并收集了1,248个专家标注的VQA实例,同时引入了MVI-Sensitivity这一新颖的细粒度鲁棒性度量指标。
Result: 对18个最先进的大型视觉语言模型的实证研究揭示了它们对误导性视觉输入的显著脆弱性,MVI-Bench的深入分析为开发更可靠的模型提供了可操作的见解。
Conclusion: 该研究强调了误导性视觉输入作为评估LVLM鲁棒性的关键维度的重要性,提出的基准和度量方法能够指导未来开发更稳健的视觉语言模型,并为模型安全性评估提供了新的方向。
📄 Abstract
Evaluating the robustness of Large Vision-Language Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate fine-grained robustness evaluation, we further introduce MVI-Sensitivity, a novel metric that characterizes LVLM robustness at a granular level. Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs, and our in-depth analyses on MVI-Bench provide actionable insights that can guide the development of more reliable and robust LVLMs. The benchmark and codebase can be accessed at https://github.com/chenyil6/MVI-Bench.
[31] DoGCLR: Dominance-Game Contrastive Learning Network for Skeleton-Based Action Recognition
Yanshan Li, Ke Ma, Miaomiao Wei, Linhui Dai
🧩 TL;DR
本文提出DoGCLR,一种基于博弈论的自监督对比学习框架,用于骨架动作识别。该框架通过动态优势博弈建模正负样本构建,并采用时空双重权重定位机制和熵驱动优势策略,显著提升了识别性能。
📘 Detailed Summary
Motivation: 现有基于骨架的自监督对比学习方法通常对所有骨架区域进行统一处理,并采用先进先出队列存储负样本,这导致运动信息丢失和负样本选择不优的问题。
Method: DoGCLR将正负样本构建建模为动态优势博弈,通过时空双重权重定位机制识别关键运动区域并指导区域级增强,同时采用熵驱动优势策略管理记忆库,保留高熵负样本并替换低熵样本。
Result: 在NTU RGB+D 60数据集上,DoGCLR在X-Sub/X-View分别达到81.1%/89.4%准确率;在NTU RGB+D 120数据集上,X-Sub/X-Set分别达到71.2%/75.5%准确率,均超越现有最优方法。在PKU-MMD Part II上实现1.9%的准确率提升。
Conclusion: 该研究表明基于博弈论的样本构建策略能有效平衡语义保持和判别强度,熵驱动记忆库管理确保了对比信号的信息量,为自监督骨架动作识别提供了新的优化方向。
📄 Abstract
Existing self-supervised contrastive learning methods for skeleton-based action recognition often process all skeleton regions uniformly, and adopt a first-in-first-out (FIFO) queue to store negative samples, which leads to motion information loss and non-optimal negative sample selection. To address these challenges, this paper proposes Dominance-Game Contrastive Learning network for skeleton-based action Recognition (DoGCLR), a self-supervised framework based on game theory. DoGCLR models the construction of positive and negative samples as a dynamic Dominance Game, where both sample types interact to reach an equilibrium that balances semantic preservation and discriminative strength. Specifically, a spatio-temporal dual weight localization mechanism identifies key motion regions and guides region-wise augmentations to enhance motion diversity while maintaining semantics. In parallel, an entropy-driven dominance strategy manages the memory bank by retaining high entropy (hard) negatives and replacing low-entropy (weak) ones, ensuring consistent exposure to informative contrastive signals. Extensive experiments are conducted on NTU RGB+D and PKU-MMD datasets. On NTU RGB+D 60 X-Sub/X-View, DoGCLR achieves 81.1%/89.4% accuracy, and on NTU RGB+D 120 X-Sub/X-Set, DoGCLR achieves 71.2%/75.5% accuracy, surpassing state-of-the-art methods by 0.1%, 2.7%, 1.1%, and 2.3%, respectively. On PKU-MMD Part I/Part II, DoGCLR performs comparably to the state-of-the-art methods and achieves a 1.9% higher accuracy on Part II, highlighting its strong robustness on more challenging scenarios.
[32] Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision
Zitang Sun, Masakazu Yoshimura, Junji Otsuka, Atsushi Irie, Takeshi Ohashi
🧩 TL;DR
本文提出了DetGain,一种专门用于目标检测的在线数据筛选方法,通过估计每张图像对数据集级平均精度的边际扰动来动态选择信息量丰富的训练样本。该方法在COCO数据集上多个代表性检测器中均实现了精度提升,并展现出对低质量数据的强鲁棒性。
📘 Detailed Summary
Motivation: 现有在线采样策略主要针对分类和多模态学习,很少扩展到目标检测领域,这主要由于目标检测的结构复杂性和领域差距问题。高质量数据在规模定律下已成为进展的主要驱动力,但目标检测缺乏有效的动态数据筛选方法。
Method: DetGain通过建模全局得分分布,基于每张图像的预测质量估计其对数据集级平均精度的边际扰动。该方法计算师生贡献差距来选择每轮迭代中的信息样本,具有架构无关性和最小侵入性,可轻松集成到各种目标检测架构中。
Result: 在COCO数据集上的实验表明,DetGain在多个代表性检测器中均实现了精度的一致提升。该方法在低质量数据下展现出强鲁棒性,并能有效结合知识蒸馏技术进一步优化性能。
Conclusion: DetGain作为一种通用且互补的策略,为数据高效的目标检测提供了新思路。该方法展示了在线数据筛选在复杂视觉任务中的潜力,为未来目标检测的数据优化研究开辟了方向。
📄 Abstract
High-quality data has become a primary driver of progress under scale laws, with curated datasets often outperforming much larger unfiltered ones at lower cost. Online data curation extends this idea by dynamically selecting training samples based on the model's evolving state. While effective in classification and multimodal learning, existing online sampling strategies rarely extend to object detection because of its structural complexity and domain gaps. We introduce DetGain, an online data curation method specifically for object detection that estimates the marginal perturbation of each image to dataset-level Average Precision (AP) based on its prediction quality. By modeling global score distributions, DetGain efficiently estimates the global AP change and computes teacher-student contribution gaps to select informative samples at each iteration. The method is architecture-agnostic and minimally intrusive, enabling straightforward integration into diverse object detection architectures. Experiments on the COCO dataset with multiple representative detectors show consistent improvements in accuracy. DetGain also demonstrates strong robustness under low-quality data and can be effectively combined with knowledge distillation techniques to further enhance performance, highlighting its potential as a general and complementary strategy for data-efficient object detection.
[33] Measurement-Constrained Sampling for Text-Prompted Blind Face Restoration
Wenjie Li, Yulun Zhang, Guangwei Gao, Heng Guo, Zhanyu Ma
🧩 TL;DR
本文提出了一种测量约束采样方法,用于解决盲人脸恢复中的一对多映射问题,通过文本提示引导生成多样化的高质量人脸重建结果。该方法将盲人脸恢复构建为测量约束生成任务,在文本到图像扩散模型中实现后验引导采样。
📘 Detailed Summary
Motivation: 现有盲人脸恢复方法通常产生确定性结果,难以捕捉极低质量输入下可能对应多个合理高质量重建的一对多映射特性。传统方法在处理这种不确定性时存在局限,无法根据不同的文本提示生成多样化的重建结果。
Method: 提出测量约束采样方法,通过控制粗粒度恢复的退化构建逆问题,在文本到图像扩散模型中实现后验引导采样。测量约束包括确保结果与输入结构对齐的前向测量,以及生成投影空间确保解能与各种提示对齐的反向测量。
Result: 实验表明,该方法能够生成与提示对齐的结果,并在盲人脸恢复任务中优于现有方法。所提出的测量约束采样框架在保持输入结构一致性的同时,实现了基于文本提示的多样化高质量重建。
Conclusion: 该研究将盲人脸恢复重新定义为测量约束生成任务,为处理一对多映射问题提供了新思路。测量约束采样框架不仅提升了重建质量,还实现了基于文本条件的可控多样性生成,为不确定性视觉恢复任务开辟了新方向。
📄 Abstract
Blind face restoration (BFR) may correspond to multiple plausible high-quality (HQ) reconstructions under extremely low-quality (LQ) inputs. However, existing methods typically produce deterministic results, struggling to capture this one-to-many nature. In this paper, we propose a Measurement-Constrained Sampling (MCS) approach that enables diverse LQ face reconstructions conditioned on different textual prompts. Specifically, we formulate BFR as a measurement-constrained generative task by constructing an inverse problem through controlled degradations of coarse restorations, which allows posterior-guided sampling within text-to-image diffusion. Measurement constraints include both Forward Measurement, which ensures results align with input structures, and Reverse Measurement, which produces projection spaces, ensuring that the solution can align with various prompts. Experiments show that our MCS can generate prompt-aligned results and outperforms existing BFR methods. Codes will be released after acceptance.
[34] Enhancing Generalization of Depth Estimation Foundation Model via Weakly-Supervised Adaptation with Regularization
Yan Huang, Yongyi Su, Xin Lin, Le Zhang, Xun Xu
🧩 TL;DR
本文提出WeSTAR框架,一种参数高效的弱监督自训练适应方法,通过结构自监督、语义感知层次归一化和弱监督排序约束来增强单目深度估计基础模型在未见领域的泛化能力,在多个基准测试中达到最先进性能。
📘 Detailed Summary
Motivation: 随着基础模型在单目深度估计领域的出现,虽然零样本泛化能力显著提升,但在获得下游任务数据后,如何进一步提升这些模型在未见和多样化领域的性能成为一个关键问题,现有方法在保持模型泛化能力的同时实现有效适应方面存在不足。
Method: WeSTAR框架采用密集自训练目标作为主要结构自监督源,引入语义感知层次归一化利用实例级分割图进行多尺度结构归一化,同时使用成本效益高的弱监督成对深度排序标注来施加信息性排序约束以缓解局部拓扑错误,并通过权重正则化损失确保LoRA更新的训练稳定性。
Result: 在多个现实和损坏的外分布数据集上的广泛实验表明,WeSTAR在多样化和具有挑战性的场景下持续改善泛化性能,在广泛的基准测试中实现了最先进的性能表现,显著提升了模型在未见领域的鲁棒性。
Conclusion: 该研究表明通过结合密集自监督、语义感知归一化和弱监督排序约束的参数高效适应框架,可以有效增强基础模型在多样化领域的泛化能力,同时保持模型的通用知识,为单目深度估计领域的领域适应提供了新的有效解决方案。
📄 Abstract
The emergence of foundation models has substantially advanced zero-shot generalization in monocular depth estimation (MDE), as exemplified by the Depth Anything series. However, given access to some data from downstream tasks, a natural question arises: can the performance of these models be further improved? To this end, we propose WeSTAR, a parameter-efficient framework that performs Weakly supervised Self-Training Adaptation with Regularization, designed to enhance the robustness of MDE foundation models in unseen and diverse domains. We first adopt a dense self-training objective as the primary source of structural self-supervision. To further improve robustness, we introduce semantically-aware hierarchical normalization, which exploits instance-level segmentation maps to perform more stable and multi-scale structural normalization. Beyond dense supervision, we introduce a cost-efficient weak supervision in the form of pairwise ordinal depth annotations to further guide the adaptation process, which enforces informative ordinal constraints to mitigate local topological errors. Finally, a weight regularization loss is employed to anchor the LoRA updates, ensuring training stability and preserving the model's generalizable knowledge. Extensive experiments on both realistic and corrupted out-of-distribution datasets under diverse and challenging scenarios demonstrate that WeSTAR consistently improves generalization and achieves state-of-the-art performance across a wide range of benchmarks.
[35] ManipShield: A Unified Framework for Image Manipulation Detection, Localization and Explanation
Zitong Xu, Huiyu Duan, Xiaoyu Wang, Zhaolin Cai, Kaiwei Zhang, Qiang Hu, Jing Liu, Xiongkuo Min, Guangtao Zhai
🧩 TL;DR
本文提出了ManipBench大规模图像篡改检测基准和基于多模态大语言模型的ManipShield统一检测框架,旨在解决现有篡改检测基准在内容多样性、生成模型覆盖范围和可解释性方面的不足,实现了最先进的篡改检测、定位和解释性能。
📘 Detailed Summary
Motivation: 现有图像篡改检测与定位基准存在内容多样性有限、生成模型覆盖范围狭窄和可解释性不足的问题,这限制了当前篡改检测方法的泛化能力和解释能力,无法应对快速发展的生成模型带来的新型图像篡改挑战。
Method: 构建了包含45万张由25种最先进图像编辑模型生成的篡改图像的ManipBench基准,其中10万张图像进一步标注了边界框、判断线索和文本解释;基于该基准提出了ManipShield统一模型,采用多模态大语言模型架构,结合对比LoRA微调和任务特定解码器,实现篡改检测、定位和解释的一体化处理。
Result: 在ManipBench和多个公开数据集上的广泛实验表明,ManipShield实现了最先进的性能表现,并对未见过的篡改模型展现出强大的泛化能力,验证了所提基准和方法的有效性。
Conclusion: 该研究为AI编辑图像篡改检测领域提供了大规模、多样化的基准测试平台和统一的检测框架,显著提升了篡改检测的准确性、泛化性和可解释性,为应对日益复杂的图像篡改挑战提供了有效解决方案,具有重要的实际应用价值。
📄 Abstract
With the rapid advancement of generative models, powerful image editing methods now enable diverse and highly realistic image manipulations that far surpass traditional deepfake techniques, posing new challenges for manipulation detection. Existing image manipulation detection and localization (IMDL) benchmarks suffer from limited content diversity, narrow generative-model coverage, and insufficient interpretability, which hinders the generalization and explanation capabilities of current manipulation detection methods. To address these limitations, we introduce \textbf{ManipBench}, a large-scale benchmark for image manipulation detection and localization focusing on AI-edited images. ManipBench contains over 450K manipulated images produced by 25 state-of-the-art image editing models across 12 manipulation categories, among which 100K images are further annotated with bounding boxes, judgment cues, and textual explanations to support interpretable detection. Building upon ManipBench, we propose \textbf{ManipShield}, an all-in-one model based on a Multimodal Large Language Model (MLLM) that leverages contrastive LoRA fine-tuning and task-specific decoders to achieve unified image manipulation detection, localization, and explanation. Extensive experiments on ManipBench and several public datasets demonstrate that ManipShield achieves state-of-the-art performance and exhibits strong generality to unseen manipulation models. Both ManipBench and ManipShield will be released upon publication.
[36] Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation
Weimin Bai, Yubo Li, Weijian Luo, Zeqiang Lai, Yequan Wang, Wenzheng Chen, He Sun
🧩 TL;DR
本文提出VLM3D框架,通过重新利用大型视觉语言模型作为可微分的语义和空间评判器,解决了文本到3D生成中的语义对齐和空间理解问题,显著提升了现有方法的生成质量。
📘 Detailed Summary
Motivation: 当前最先进的文本到3D生成模型存在两个基本限制:一是难以实现精细的语义对齐,经常无法捕捉提示中的细粒度细节;二是缺乏稳健的3D空间理解,导致几何不一致性和部件组装及空间关系中的灾难性失败。
Method: VLM3D的核心贡献是基于视觉语言模型的是/否对数几率推导出的双查询评判信号,该信号同时评估语义保真度和几何一致性。该指导信号具有通用性,可作为优化管道的奖励目标,也可作为前馈管道的测试时指导模块,主动引导迭代采样过程纠正空间错误。
Result: 在标准基准测试中,VLM3D作为优化管道的奖励目标显著优于现有方法;作为前馈管道的测试时指导模块,能够有效纠正最先进原生3D模型中的严重空间错误。
Conclusion: VLM3D建立了一个原则性且可泛化的路径,将视觉语言模型丰富的、基于语言的语义和空间理解注入到多样化的3D生成管道中,为解决文本到3D生成中的核心挑战提供了新的方向。
📄 Abstract
Text-to-3D generation has advanced rapidly, yet state-of-the-art models, encompassing both optimization-based and feed-forward architectures, still face two fundamental limitations. First, they struggle with coarse semantic alignment, often failing to capture fine-grained prompt details. Second, they lack robust 3D spatial understanding, leading to geometric inconsistencies and catastrophic failures in part assembly and spatial relationships. To address these challenges, we propose VLM3D, a general framework that repurposes large vision-language models (VLMs) as powerful, differentiable semantic and spatial critics. Our core contribution is a dual-query critic signal derived from the VLM's Yes or No log-odds, which assesses both semantic fidelity and geometric coherence. We demonstrate the generality of this guidance signal across two distinct paradigms: (1) As a reward objective for optimization-based pipelines, VLM3D significantly outperforms existing methods on standard benchmarks. (2) As a test-time guidance module for feed-forward pipelines, it actively steers the iterative sampling process of SOTA native 3D models to correct severe spatial errors. VLM3D establishes a principled and generalizable path to inject the VLM's rich, language-grounded understanding of both semantics and space into diverse 3D generative pipelines.
[37] NeuralBoneReg: A Novel Self-Supervised Method for Robust and Accurate Multi-Modal Bone Surface Registration
Luohong Wu, Matthias Seibold, Nicola A. Cavalcanti, Yunke Ao, Roman Flepp, Aidana Massalimova, Lilian Calvet, Philipp Fürnstahl
🧩 TL;DR
本文提出NeuralBoneReg,一种自监督的表面配准框架,用于解决计算机辅助骨科手术中多模态骨表面配准的挑战。该方法在多个数据集上达到或超越现有方法的性能,展示了跨解剖结构和模态的强泛化能力。
📘 Detailed Summary
Motivation: 计算机辅助骨科手术中,术前计划需要通过精确的跨模态配准转移到术中数据,但由于成像模态间的显著异质性,这种配准具有挑战性且容易出错。因此,开发鲁棒、自动且模态无关的骨表面配准方法具有重要的临床意义。
Method: NeuralBoneReg包含两个模块:隐式神经无符号距离场学习术前骨模型,以及基于MLP的配准模块执行全局初始化和局部细化,通过生成变换假设将术中点云与神经UDF对齐。该方法以自监督方式运行,无需跨受试者训练数据。
Result: 在三个多模态数据集上的评估显示,NeuralBoneReg在UltraBones100k上达到平均RRE/RTE为1.68°/1.86 mm,在UltraBones-Hip上为1.88°/1.89 mm,在SpineDepth上为3.79°/2.45 mm,在所有数据集上匹配或超越现有方法。
Conclusion: NeuralBoneReg展示了跨解剖结构和成像模态的强大泛化能力,为计算机辅助骨科手术提供了鲁棒且准确的跨模态对齐解决方案。该方法的自监督特性使其无需依赖跨受试者训练数据,具有重要的临床应用价值。
📄 Abstract
In computer- and robot-assisted orthopedic surgery (CAOS), patient-specific surgical plans derived from preoperative imaging define target locations and implant trajectories. During surgery, these plans must be accurately transferred, relying on precise cross-registration between preoperative and intraoperative data. However, substantial modality heterogeneity across imaging modalities makes this registration challenging and error-prone. Robust, automatic, and modality-agnostic bone surface registration is therefore clinically important. We propose NeuralBoneReg, a self-supervised, surface-based framework that registers bone surfaces using 3D point clouds as a modality-agnostic representation. NeuralBoneReg includes two modules: an implicit neural unsigned distance field (UDF) that learns the preoperative bone model, and an MLP-based registration module that performs global initialization and local refinement by generating transformation hypotheses to align the intraoperative point cloud with the neural UDF. Unlike SOTA supervised methods, NeuralBoneReg operates in a self-supervised manner, without requiring inter-subject training data. We evaluated NeuralBoneReg against baseline methods on two publicly available multi-modal datasets: a CT-ultrasound dataset of the fibula and tibia (UltraBones100k) and a CT-RGB-D dataset of spinal vertebrae (SpineDepth). The evaluation also includes a newly introduced CT--ultrasound dataset of cadaveric subjects containing femur and pelvis (UltraBones-Hip), which will be made publicly available. NeuralBoneReg matches or surpasses existing methods across all datasets, achieving mean RRE/RTE of 1.68°/1.86 mm on UltraBones100k, 1.88°/1.89 mm on UltraBones-Hip, and 3.79°/2.45 mm on SpineDepth. These results demonstrate strong generalizability across anatomies and modalities, providing robust and accurate cross-modal alignment for CAOS.
[38] ArchMap: Arch-Flattening and Knowledge-Guided Vision Language Model for Tooth Counting and Structured Dental Understanding
Bohan Zhang, Yiyi Miao, Taoyu Wu, Tong Chen, Ji Jiang, Zhuoxiao Li, Zhe Tang, Limin Yu, Jionglong Su
🧩 TL;DR
本文提出ArchMap,一种免训练的知识引导框架,用于口腔内3D扫描的结构化理解。该方法通过几何感知的牙弓展平模块和多模态推理,在无需模态特定训练的情况下实现了稳健的牙齿计数、解剖分区和临床状况识别。
📘 Detailed Summary
Motivation: 现有深度学习方法严重依赖模态特定训练、大规模标注数据集和受控扫描条件,限制了跨设备的泛化能力并阻碍了真实临床工作流程的部署。此外,原始口腔内网格在牙弓姿态、因遮挡或牙齿接触导致的不完整几何结构以及缺乏纹理线索方面存在显著变化,使得统一的语义解释极具挑战性。
Method: ArchMap首先引入几何感知的牙弓展平模块,将原始3D网格标准化为空间对齐且保持连续性的多视图投影。然后构建包含分层牙齿本体论、牙列阶段策略和临床语义的牙科知识库(DKB),以约束符号推理空间。该框架结合几何标准化与本体论引导的多模态推理,实现免训练的结构化分析。
Result: 在1060个正畸前/后病例上的验证表明,ArchMap在牙齿计数、解剖分区、牙列阶段分类以及拥挤、缺失牙、修复体和龋齿等临床状况识别方面表现出稳健性能。与监督流程和提示式VLM基线相比,ArchMap实现了更高的准确性、减少的语义漂移以及在稀疏或伪影易发条件下的优越稳定性。
Conclusion: 作为一个完全免训练的系统,ArchMap证明了将几何标准化与本体论引导的多模态推理相结合,为现代数字正畸中3D口腔内扫描的结构化分析提供了实用且可扩展的解决方案。该方法突破了传统深度学习方法对标注数据和模态特定训练的依赖,展示了知识引导框架在临床部署中的潜力。
📄 Abstract
A structured understanding of intraoral 3D scans is essential for digital orthodontics. However, existing deep-learning approaches rely heavily on modality-specific training, large annotated datasets, and controlled scanning conditions, which limit generalization across devices and hinder deployment in real clinical workflows. Moreover, raw intraoral meshes exhibit substantial variation in arch pose, incomplete geometry caused by occlusion or tooth contact, and a lack of texture cues, making unified semantic interpretation highly challenging. To address these limitations, we propose ArchMap, a training-free and knowledge-guided framework for robust structured dental understanding. ArchMap first introduces a geometry-aware arch-flattening module that standardizes raw 3D meshes into spatially aligned, continuity-preserving multi-view projections. We then construct a Dental Knowledge Base (DKB) encoding hierarchical tooth ontology, dentition-stage policies, and clinical semantics to constrain the symbolic reasoning space. We validate ArchMap on 1060 pre-/post-orthodontic cases, demonstrating robust performance in tooth counting, anatomical partitioning, dentition-stage classification, and the identification of clinical conditions such as crowding, missing teeth, prosthetics, and caries. Compared with supervised pipelines and prompted VLM baselines, ArchMap achieves higher accuracy, reduced semantic drift, and superior stability under sparse or artifact-prone conditions. As a fully training-free system, ArchMap demonstrates that combining geometric normalization with ontology-guided multimodal reasoning offers a practical and scalable solution for the structured analysis of 3D intraoral scans in modern digital orthodontics.
[39] ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries
Junfu Pu, Teng Wang, Yixiao Ge, Yuying Ge, Chen Li, Ying Shan
🧩 TL;DR
本文提出了ARC-Chapter,首个基于百万级长视频章节标注的大规模视频分章模型,通过构建双语、时序定位和层次化标注的数据集,显著提升了长视频内容结构化的性能。
📘 Detailed Summary
Motivation: 现有视频分章方法受限于小规模训练数据和粗粒度的标注,难以泛化到长视频中细微的过渡变化,无法满足小时级视频内容高效结构化的需求。
Method: 通过结构化流水线构建了英汉双语章节数据集,统一整合了ASR转录文本、场景文本和视觉描述,生成从短标题到长摘要的多层次标注;同时设计了新的评估指标GRACE,综合考虑多对一段落重叠和语义相似性。
Result: ARC-Chapter在F1分数和SODA分数上分别比之前最佳方法提升了14.0%和11.3%,建立了新的最先进水平;在YouCook2等下游任务的密集视频描述任务上也展现出优异的迁移能力。
Conclusion: 研究表明数据规模和标注强度的扩展能显著提升模型性能,GRACE指标更好地反映了实际分章任务的灵活性,为长视频内容理解提供了有效的解决方案,并展示了在相关任务上的良好泛化能力。
📄 Abstract
The proliferation of hour-long videos (e.g., lectures, podcasts, documentaries) has intensified demand for efficient content structuring. However, existing approaches are constrained by small-scale training with annotations that are typical short and coarse, restricting generalization to nuanced transitions in long videos. We introduce ARC-Chapter, the first large-scale video chaptering model trained on over million-level long video chapters, featuring bilingual, temporally grounded, and hierarchical chapter annotations. To achieve this goal, we curated a bilingual English-Chinese chapter dataset via a structured pipeline that unifies ASR transcripts, scene texts, visual captions into multi-level annotations, from short title to long summaries. We demonstrate clear performance improvements with data scaling, both in data volume and label intensity. Moreover, we design a new evaluation metric termed GRACE, which incorporates many-to-one segment overlaps and semantic similarity, better reflecting real-world chaptering flexibility. Extensive experiments demonstrate that ARC-Chapter establishes a new state-of-the-art by a significant margin, outperforming the previous best by 14.0% in F1 score and 11.3% in SODA score. Moreover, ARC-Chapter shows excellent transferability, improving the state-of-the-art on downstream tasks like dense video captioning on YouCook2.
[40] Cranio-ID: Graph-Based Craniofacial Identification via Automatic Landmark Annotation in 2D Multi-View X-rays
Ravi Shankar Prasad, Nandani Sharma, Dinesh Singh
🧩 TL;DR
本文提出了Cranio-ID框架,通过YOLO-pose模型自动标注颅骨标志点,并利用图表示和跨模态注意力机制实现颅骨到面部的可靠匹配,显著提升了法医颅面识别的准确性和可靠性。
📘 Detailed Summary
Motivation: 传统颅骨标志点定位方法耗时且需要专业知识,现有基于深度学习的自动标注方法由于缺乏大规模验证研究而不可靠,特别是在法医颅面识别和生物医学应用中需要更可靠的跨模态匹配解决方案。
Method: 提出Cranio-ID框架:首先使用训练的YOLO-pose模型在二维颅骨X射线图像上自动标注标志点,然后将这些标志点构建为图表示,利用跨注意力机制和最优传输框架实现颅骨与面部图像之间的语义对应匹配。
Result: 在S2F和CUHK数据集上的广泛实验表明,该框架在可靠性和准确性方面均有显著提升,在跨域颅骨到面部和素描到面部匹配任务中表现出色,验证了其在法医科学中的有效性。
Conclusion: 该研究为法医颅面识别提供了可靠的自动化解决方案,通过结合目标检测、图表示学习和跨模态匹配技术,解决了现有方法可靠性不足的问题,为生物医学和法医应用开辟了新方向。
📄 Abstract
In forensic craniofacial identification and in many biomedical applications, craniometric landmarks are important. Traditional methods for locating landmarks are time-consuming and require specialized knowledge and expertise. Current methods utilize superimposition and deep learning-based methods that employ automatic annotation of landmarks. However, these methods are not reliable due to insufficient large-scale validation studies. In this paper, we proposed a novel framework Cranio-ID: First, an automatic annotation of landmarks on 2D skulls (which are X-ray scans of faces) with their respective optical images using our trained YOLO-pose models. Second, cross-modal matching by formulating these landmarks into graph representations and then finding semantic correspondence between graphs of these two modalities using cross-attention and optimal transport framework. Our proposed framework is validated on the S2F and CUHK datasets (CUHK dataset resembles with S2F dataset). Extensive experiments have been conducted to evaluate the performance of our proposed framework, which demonstrates significant improvements in both reliability and accuracy, as well as its effectiveness in cross-domain skull-to-face and sketch-to-face matching in forensic science.
[41] DIR-TIR: Dialog-Iterative Refinement for Text-to-Image Retrieval
Zongwei Zhen, Biqing Zeng
🧩 TL;DR
本文提出DIR-TIR框架,通过对话式交互逐步优化文本到图像检索过程,结合对话精炼和图像精炼双模块协同工作,显著提升目标图像检索的准确性和用户体验。
📘 Detailed Summary
Motivation: 传统单查询文本到图像检索方法存在信息不足和容错性差的问题,无法有效处理用户意图与检索结果之间的语义鸿沟,需要开发能够通过多轮对话交互逐步细化搜索目标的智能检索系统。
Method: 提出DIR-TIR框架,包含对话精炼模块和图像精炼模块:对话精炼模块主动向用户提问以提取关键信息并生成更精确的图像描述;图像精炼模块识别生成图像与用户意图之间的感知差距,策略性地减少视觉语义差异。
Result: 在多个图像数据集上的综合实验表明,该对话式方法显著优于仅使用初始描述的基线方法,模块协同集成实现了更高的检索精度和增强的交互体验,目标图像命中准确率得到显著提升。
Conclusion: 多轮对话交互为文本到图像检索提供了优越的可控性和容错能力,双模块协同机制有效弥合了视觉语义鸿沟,为未来交互式检索系统的发展提供了重要技术路径和设计思路。
📄 Abstract
This paper addresses the task of interactive, conversational text-to-image retrieval. Our DIR-TIR framework progressively refines the target image search through two specialized modules: the Dialog Refiner Module and the Image Refiner Module. The Dialog Refiner actively queries users to extract essential information and generate increasingly precise descriptions of the target image. Complementarily, the Image Refiner identifies perceptual gaps between generated images and user intentions, strategically reducing the visual-semantic discrepancy. By leveraging multi-turn dialogues, DIR-TIR provides superior controllability and fault tolerance compared to conventional single-query methods, significantly improving target image hit accuracy. Comprehensive experiments across diverse image datasets demonstrate our dialogue-based approach substantially outperforms initial-description-only baselines, while the synergistic module integration achieves both higher retrieval precision and enhanced interactive experience.
[42] Enhancing End-to-End Autonomous Driving with Risk Semantic Distillaion from VLM
Jack Qin, Zhitao Wang, Yinan Zheng, Keyu Chen, Yang Zhou, Yuanxin Zhong, Siyuan Cheng
🧩 TL;DR
本文提出风险语义蒸馏(RSD)框架,通过利用视觉语言模型增强端到端自动驾驶骨干网络的训练,解决自动驾驶系统在泛化性和一致性方面的挑战。该方法通过RiskHead模块将VLM的因果风险估计蒸馏到BEV特征中,生成可解释的风险注意力图。
📘 Detailed Summary
Motivation: 当前自动驾驶系统在复杂场景中表现出色,但泛化能力仍是关键限制,特别是在处理未见场景或不熟悉传感器配置时。现有方法要么使用混合系统导致规划不一致,要么采用端到端视觉语言动作框架但计算成本过高,需要一种既能提升泛化性又保持系统一致性的解决方案。
Method: 提出风险语义蒸馏(RSD)框架,引入RiskHead插件模块,该模块从视觉语言模型中蒸馏因果风险估计到鸟瞰图特征中,生成可解释的风险注意力图。这种方法使BEV特征能够学习更丰富和细致入微的风险注意力表示,直接增强模型处理空间边界和风险对象的能力。
Result: 在Bench2Drive基准测试上的实验证明了RSD在处理复杂和不可预测驾驶条件方面的有效性。由于RSD增强的BEV表示,观察到感知和规划能力均有显著提升,特别是在空间边界和风险对象处理方面表现优异。
Conclusion: RSD框架通过关注风险注意力,更好地模拟了类人驾驶行为,这对于在复杂动态环境中导航至关重要。该方法为自动驾驶系统提供了一种既能提升泛化性又保持计算效率的解决方案,为未来端到端自动驾驶系统的发展提供了新方向。
📄 Abstract
The autonomous driving (AD) system has exhibited remarkable performance in complex driving scenarios. However, generalization is still a key limitation for the current system, which refers to the ability to handle unseen scenarios or unfamiliar sensor configurations.Related works have explored the use of Vision-Language Models (VLMs) to address few-shot or zero-shot tasks. While promising, these methods introduce a new challenge: the emergence of a hybrid AD system, where two distinct systems are used to plan a trajectory, leading to potential inconsistencies. Alternative research directions have explored Vision-Language-Action (VLA) frameworks that generate control actions from VLM directly. However, these end-to-end solutions demonstrate prohibitive computational demands. To overcome these challenges, we introduce Risk Semantic Distillation (RSD), a novel framework that leverages VLMs to enhance the training of End-to-End (E2E) AD backbones. By providing risk attention for key objects, RSD addresses the issue of generalization. Specifically, we introduce RiskHead, a plug-in module that distills causal risk estimates from Vision-Language Models into Bird's-Eye-View (BEV) features, yielding interpretable risk-attention maps.This approach allows BEV features to learn richer and more nuanced risk attention representations, which directly enhance the model's ability to handle spatial boundaries and risky objects.By focusing on risk attention, RSD aligns better with human-like driving behavior, which is essential to navigate in complex and dynamic environments. Our experiments on the Bench2Drive benchmark demonstrate the effectiveness of RSD in managing complex and unpredictable driving conditions. Due to the enhanced BEV representations enabled by RSD, we observed a significant improvement in both perception and planning capabilities.
[43] OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
Keda Tao, Kele Shao, Bohan Yu, Weiqiang Wang, Jian liu, Huan Wang
🧩 TL;DR
本文提出了OmniZip,一种无需训练、音频引导的视听令牌压缩框架,通过动态令牌剪枝和交错时空压缩方案,在保持性能的同时显著加速OmniLLM的推理速度。该框架实现了3.42倍推理加速和1.4倍内存减少,解决了多模态令牌压缩的计算瓶颈问题。
📘 Detailed Summary
Motivation: 随着全模态大语言模型在统一音视频理解领域的研究关注度增加,处理音视频令牌序列产生了显著的计算瓶颈。现有的令牌压缩方法尚未满足联合压缩多模态令牌的新兴需求,因此需要开发专门的多模态令牌压缩技术来优化表示并加速推理。
Method: OmniZip首先识别显著的音频令牌,然后为每个时间组计算音频保留分数以捕捉信息密度,从而动态指导视频令牌剪枝并保留由跨模态相似性增强的音频锚点线索。对于每个时间窗口,采用交错时空方案压缩视频令牌,实现无需训练的高效多模态令牌压缩。
Result: 广泛的实证结果表明,OmniZip相比其他顶级性能对应方法实现了3.42倍的推理加速和1.4倍的内存减少,同时在无需训练的情况下保持了模型性能,在多模态令牌压缩方面表现出显著优势。
Conclusion: 该研究证明了音频引导的多模态令牌压缩的有效性,为全模态大语言模型的高效推理提供了实用解决方案。OmniZip的训练无关特性使其易于部署,其动态令牌剪枝方法为未来多模态压缩技术发展提供了重要启示。
📄 Abstract
Omnimodal large language models (OmniLLMs) have attracted increasing research attention of late towards unified audio-video understanding, wherein processing audio-video token sequences creates a significant computational bottleneck, however. Existing token compression methods have yet to accommodate this emerging need of jointly compressing multimodal tokens. To bridge this gap, we present OmniZip, a training-free, audio-guided audio-visual token-compression framework that optimizes multimodal token representation and accelerates inference. Specifically, OmniZip first identifies salient audio tokens, then computes an audio retention score for each time group to capture information density, thereby dynamically guiding video token pruning and preserving cues from audio anchors enhanced by cross-modal similarity. For each time window, OmniZip compresses the video tokens using an interleaved spatio-temporal scheme. Extensive empirical results demonstrate the merits of OmniZip - it achieves 3.42X inference speedup and 1.4X memory reduction over other top-performing counterparts, while maintaining performance with no training.
[44] XAttn-BMD: Multimodal Deep Learning with Cross-Attention for Femoral Neck Bone Mineral Density Estimation
Yilin Zhang, Leo D. Westbury, Elaine M. Dennison, Nicholas C. Harvey, Nicholas R. Fuggle, Rahman Attar
🧩 TL;DR
本研究提出了XAttn-BMD,一种基于双向交叉注意力机制的多模态深度学习框架,通过髋部X射线图像和临床元数据预测股骨颈骨密度,在回归泛化性和鲁棒性方面优于基线模型。
📘 Detailed Summary
Motivation: 骨健康不良是重要的公共卫生问题,低骨密度会增加骨折风险,这是骨质疏松症的关键特征。现有方法在整合多模态数据和处理骨密度不平衡方面存在局限,需要更有效的预测框架。
Method: 该框架采用新颖的双向交叉注意力机制动态整合图像和元数据特征,实现跨模态相互增强,并使用加权平滑L1损失函数处理骨密度不平衡问题并优先考虑临床显著病例。
Result: 在Hertfordshire队列研究数据上的实验表明,该模型在回归泛化性和鲁棒性方面优于基线模型,与无交叉注意力的简单特征拼接相比,MSE降低16.7%,MAE降低6.03%,R2得分提高16.4%。
Conclusion: 该研究证明了交叉注意力机制在多模态骨密度预测中的有效性,定制化损失函数成功解决了数据不平衡问题,模型在临床相关骨密度阈值下的二元分类表现展示了其在真实场景中的应用潜力。
📄 Abstract
Poor bone health is a significant public health concern, and low bone mineral density (BMD) leads to an increased fracture risk, a key feature of osteoporosis. We present XAttn-BMD (Cross-Attention BMD), a multimodal deep learning framework that predicts femoral neck BMD from hip X-ray images and structured clinical metadata. It utilizes a novel bidirectional cross-attention mechanism to dynamically integrate image and metadata features for cross-modal mutual reinforcement. A Weighted Smooth L1 loss is tailored to address BMD imbalance and prioritize clinically significant cases. Extensive experiments on the data from the Hertfordshire Cohort Study show that our model outperforms the baseline models in regression generalization and robustness. Ablation studies confirm the effectiveness of both cross-attention fusion and the customized loss function. Experimental results show that the integration of multimodal data via cross-attention outperforms naive feature concatenation without cross-attention, reducing MSE by 16.7%, MAE by 6.03%, and increasing the R2 score by 16.4%, highlighting the effectiveness of the approach for femoral neck BMD estimation. Furthermore, screening performance was evaluated using binary classification at clinically relevant femoral neck BMD thresholds, demonstrating the model's potential in real-world scenarios.
[45] Fusing Biomechanical and Spatio-Temporal Features for Fall Prediction: Characterizing and Mitigating the Simulation-to-Reality Gap
Md Fokhrul Islam, Sajeda Al-Hammouri, Christopher J. Arellano, Kavan Hazeli, Heman Shakeri
🧩 TL;DR
本研究提出Biomechanical Spatio-Temporal Graph Convolutional Network (BioST-GCN),一种结合姿态和生物力学信息的双流模型,用于基于视觉的跌倒预测。该模型在模拟数据集上显著优于基线方法,但揭示了模拟与现实数据之间的显著性能差距。
📘 Detailed Summary
Motivation: 跌倒预测系统面临的主要挑战是真实跌倒数据的稀缺性,这限制了基于视觉的预测模型的开发。虽然模拟数据提供了替代方案,但模拟与现实数据之间存在显著的泛化差距,特别是在老年人群体的独特运动特征方面。
Method: 提出了Biomechanical Spatio-Temporal Graph Convolutional Network (BioST-GCN),这是一种双流模型,通过交叉注意力融合机制结合姿态和生物力学信息。模型利用时空图卷积网络处理姿态数据,并引入生物力学特征流,通过注意力机制识别关键关节和时序阶段。
Result: 在模拟的MCF-UA特技演员和MUVIM数据集上,BioST-GCN相比基准ST-GCN分别提升了5.32%和2.91%的F1分数。在完全监督的模拟数据上达到89.0%的F1分数,但对未见受试者的零样本泛化性能降至35.9%,揭示了显著的模拟-现实差距。
Conclusion: 研究强调了弥合模拟与现实数据差距的紧迫性,特别是针对脆弱老年人群体的独特运动特征。提出了个性化策略和隐私保护数据管道,以促进真实世界验证,这对于开发有效的老年人跌倒预测系统至关重要。
📄 Abstract
Falls are a leading cause of injury and loss of independence among older adults. Vision-based fall prediction systems offer a non-invasive solution to anticipate falls seconds before impact, but their development is hindered by the scarcity of available fall data. Contributing to these efforts, this study proposes the Biomechanical Spatio-Temporal Graph Convolutional Network (BioST-GCN), a dual-stream model that combines both pose and biomechanical information using a cross-attention fusion mechanism. Our model outperforms the vanilla ST-GCN baseline by 5.32% and 2.91% F1-score on the simulated MCF-UA stunt-actor and MUVIM datasets, respectively. The spatio-temporal attention mechanisms in the ST-GCN stream also provide interpretability by identifying critical joints and temporal phases. However, a critical simulation-reality gap persists. While our model achieves an 89.0% F1-score with full supervision on simulated data, zero-shot generalization to unseen subjects drops to 35.9%. This performance decline is likely due to biases in simulated data, such as `intent-to-fall' cues. For older adults, particularly those with diabetes or frailty, this gap is exacerbated by their unique kinematic profiles. To address this, we propose personalization strategies and advocate for privacy-preserving data pipelines to enable real-world validation. Our findings underscore the urgent need to bridge the gap between simulated and real-world data to develop effective fall prediction systems for vulnerable elderly populations.
cs.CL [Back]
[46] Knowledge-Grounded Agentic Large Language Models for Multi-Hazard Understanding from Reconnaissance Reports
Chenchen Kuai, Zihao Li, Braden Rosen, Stephanie Paan, Navid Jafari, Jean-Louis Briaud, Yunlong Zhang, Youssef M. A. Hashash, Yang Zhou
🧩 TL;DR
本研究提出了MoRA-RAG框架,通过混合检索和代理机制将灾害勘察报告转化为结构化知识基础,在多灾害推理任务中实现了94.5%的准确率,显著优于现有方法并有效减少幻觉生成。
📘 Detailed Summary
Motivation: 灾害勘察报告包含理解多灾害相互作用的关键证据,但其非结构化叙述使得系统知识转移困难,而现有大语言模型在缺乏领域基础时会产生不可靠或幻觉输出,需要开发能够可靠分析这些报告的知识驱动框架。
Method: MoRA-RAG框架集成了混合检索机制,动态路由跨灾害特定数据库的查询,同时使用代理分块在检索过程中保持上下文连贯性,并包含验证循环来评估证据充分性、优化查询以及在信息不完整时启动针对性搜索。
Result: 在基于GEER勘察报告构建的HazardRecQA数据集上,MoRA-RAG达到94.5%的准确率,比零样本LLM提升30%,比最先进的RAG系统提升10%,同时在不同LLM架构上显著减少幻觉,并使开源LLM达到与专有模型相当的性能。
Conclusion: MoRA-RAG建立了一个将灾后文档转化为可操作、可信赖灾害韧性智能的新范式,证明了知识驱动框架在提高LLM可靠性和领域适应性方面的有效性,为灾害管理提供了实用的分析工具。
📄 Abstract
Post-disaster reconnaissance reports contain critical evidence for understanding multi-hazard interactions, yet their unstructured narratives make systematic knowledge transfer difficult. Large language models (LLMs) offer new potential for analyzing these reports, but often generate unreliable or hallucinated outputs when domain grounding is absent. This study introduces the Mixture-of-Retrieval Agentic RAG (MoRA-RAG), a knowledge-grounded LLM framework that transforms reconnaissance reports into a structured foundation for multi-hazard reasoning. The framework integrates a Mixture-of-Retrieval mechanism that dynamically routes queries across hazard-specific databases while using agentic chunking to preserve contextual coherence during retrieval. It also includes a verification loop that assesses evidence sufficiency, refines queries, and initiates targeted searches when information remains incomplete. We construct HazardRecQA by deriving question-answer pairs from GEER reconnaissance reports, which document 90 global events across seven major hazard types. MoRA-RAG achieves up to 94.5 percent accuracy, outperforming zero-shot LLMs by 30 percent and state-of-the-art RAG systems by 10 percent, while reducing hallucinations across diverse LLM architectures. MoRA-RAG also enables open-weight LLMs to achieve performance comparable to proprietary models. It establishes a new paradigm for transforming post-disaster documentation into actionable, trustworthy intelligence for hazard resilience.
[47] HiEAG: Evidence-Augmented Generation for Out-of-Context Misinformation Detection
Junjie Wu, Yumeng Fu, Nan Yu, Guohong Fu
🧩 TL;DR
本文提出HiEAG框架,通过分层证据增强生成方法改进多模态上下文外虚假信息检测,利用多模态大语言模型的知识来增强外部一致性检查,在多个基准数据集上超越了现有最先进方法。
📘 Detailed Summary
Motivation: 现有的多模态上下文外虚假信息检测方法主要关注内部一致性,而忽视了图像-文本对与外部证据之间的外部一致性检查的重要性,这限制了检测性能的进一步提升。
Method: 提出分层证据增强生成框架HiEAG,将外部一致性检查分解为检索、重排序和重写的综合引擎流程,其中证据重排序模块使用自动证据选择提示来获取相关证据项,证据重写模块利用自动证据生成提示来改进基于MLLM的虚假信息检测器的任务适应性。
Result: 在多个基准数据集上的实验结果表明,所提出的HiEAG框架在所有样本的准确率上均超越了先前的最先进方法,并能够提供判断解释,通过指令微调实现了令人印象深刻的性能。
Conclusion: 该研究强调了外部一致性检查在多模态虚假信息检测中的关键作用,提出的分层证据增强框架为改进MLLM在虚假信息检测任务中的适应性提供了有效途径,具有重要的实际应用价值。
📄 Abstract
Recent advancements in multimodal out-of-context (OOC) misinformation detection have made remarkable progress in checking the consistencies between different modalities for supporting or refuting image-text pairs. However, existing OOC misinformation detection methods tend to emphasize the role of internal consistency, ignoring the significant of external consistency between image-text pairs and external evidence. In this paper, we propose HiEAG, a novel Hierarchical Evidence-Augmented Generation framework to refine external consistency checking through leveraging the extensive knowledge of multimodal large language models (MLLMs). Our approach decomposes external consistency checking into a comprehensive engine pipeline, which integrates reranking and rewriting, apart from retrieval. Evidence reranking module utilizes Automatic Evidence Selection Prompting (AESP) that acquires the relevant evidence item from the products of evidence retrieval. Subsequently, evidence rewriting module leverages Automatic Evidence Generation Prompting (AEGP) to improve task adaptation on MLLM-based OOC misinformation detectors. Furthermore, our approach enables explanation for judgment, and achieves impressive performance with instruction tuning. Experimental results on different benchmark datasets demonstrate that our proposed HiEAG surpasses previous state-of-the-art (SOTA) methods in the accuracy over all samples.
[48] Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT
Le Yu, Zhengyue Zhao, Yawen Zheng, Yunhao Liu
🧩 TL;DR
本文提出了一种名为Stealth Fine-Tuning的新型攻击方法,通过段级干扰和自生成监督数据,仅需499个样本即可在3小时内有效绕过推理增强视觉语言模型的安全对齐机制。
📘 Detailed Summary
Motivation: 推理增强视觉语言模型依赖安全对齐机制防止有害行为,但其暴露的思维链轨迹引入了新的攻击面,现有安全防御机制存在被轻易绕过的风险。
Method: 该方法通过段级干扰引发有害推理轨迹,将自生成输出作为监督微调数据,采用轮次加权损失设计实现轻量级、分布一致的微调策略。
Result: 在单张A100上使用QLoRA仅需3小时和499个样本,Stealth Fine-Tuning在ASR指标上比IDEATOR高出38.52%,同时保持原始表示分布和通用推理能力。
Conclusion: 研究表明当前RVLM安全对齐机制存在严重脆弱性,Stealth Fine-Tuning作为一种低成本高效攻击方法,揭示了模型安全防御需要更全面的评估框架。
📄 Abstract
Reasoning-augmented Vision-Language Models (RVLMs) rely on safety alignment to prevent harmful behavior, yet their exposed chain-of-thought (CoT) traces introduce new attack surfaces. In this work, we find that the safety alignment of RVLMs can be easily break through a novel attack method termed \textbf{Stealth Fine-Tuning}. Our method elicits harmful reasoning traces through \textbf{segment-level interference} and reuses the self-generated outputs as supervised fine-tuning data. Through a \textbf{turn-based weighted} loss design, yielding a lightweight, distribution-consistent finetuning method. In our experiment, with only 499 samples and under 3 hours on a single A100 (QLoRA), Stealth Fine-Tuning outperforms IDEATOR by 38.52\% ASR while preserving general reasoning ability, as the tuned model retains the original representation distribution. Experiments on AdvBench and several general benchmarks demonstrate that Stealth Fine-Tuning is a low-cost and highly effective way to bypass alignment defenses. \textcolor{red}{\textbf{Disclaimer: This paper contains content that may be disturbing or offensive.}}
[49] Synthetic Clinical Notes for Rare ICD Codes: A Data-Centric Framework for Long-Tail Medical Coding
Truong Vo, Weiyi Wu, Kaize Ding
🧩 TL;DR
本文提出了一种数据中心的框架,通过生成高质量合成出院摘要来解决ICD编码中的极端长尾分布问题。该方法显著扩展了训练数据分布,在保持强微观F1的同时适度提升了宏观F1性能,优于先前的最先进方法。
📘 Detailed Summary
Motivation: 临床文本的自动ICD编码是医学NLP的关键任务,但受到诊断代码极端长尾分布的阻碍。数千个罕见和零样本ICD代码在MIMIC-III等数据集中严重代表性不足,导致宏观F1分数较低。
Method: 该方法构建以罕见代码为锚点的现实多标签代码集,利用真实世界的共现模式、ICD描述、同义词、分类法和相似临床笔记。使用这些结构化提示生成90,000个合成笔记,覆盖7,902个ICD代码,显著扩展了训练分布。在两个最先进的基于Transformer的模型PLM-ICD和GKI-ICD上对原始和扩展数据集进行微调。
Result: 实验表明,该方法在保持强微观F1的同时适度提高了宏观F1,优于先前的最先进方法。虽然相对于计算成本而言增益可能显得有限,但结果证明精心设计的合成数据可以增强长尾ICD代码预测的公平性。
Conclusion: 研究结果表明,精心设计的合成数据能够增强长尾ICD代码预测的公平性。尽管相对于计算成本而言性能提升可能有限,但该方法为解决医疗NLP中的数据不平衡问题提供了有价值的思路和方向。
📄 Abstract
Automatic ICD coding from clinical text is a critical task in medical NLP but remains hindered by the extreme long-tail distribution of diagnostic codes. Thousands of rare and zero-shot ICD codes are severely underrepresented in datasets like MIMIC-III, leading to low macro-F1 scores. In this work, we propose a data-centric framework that generates high-quality synthetic discharge summaries to mitigate this imbalance. Our method constructs realistic multi-label code sets anchored on rare codes by leveraging real-world co-occurrence patterns, ICD descriptions, synonyms, taxonomy, and similar clinical notes. Using these structured prompts, we generate 90,000 synthetic notes covering 7,902 ICD codes, significantly expanding the training distribution. We fine-tune two state-of-the-art transformer-based models, PLM-ICD and GKI-ICD, on both the original and extended datasets. Experiments show that our approach modestly improves macro-F1 while maintaining strong micro-F1, outperforming prior SOTA. While the gain may seem marginal relative to the computational cost, our results demonstrate that carefully crafted synthetic data can enhance equity in long-tail ICD code prediction.
[50] Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning
Rui Liu, Yuan Zhao, Zhenqi Jia
🧩 TL;DR
本文提出Authentic-Dubber,一种基于检索增强的导演-演员交互学习方案,通过模拟真实电影配音工作流程中的导演指导机制,显著提升了配音的情感表现力。该方法在V2C Animation基准数据集上实现了全面的性能改进。
📘 Detailed Summary
Motivation: 现有自动电影配音模型简化了配音流程,假设演员直接配音而无需准备,忽略了真实工作流程中导演与演员之间的关键交互。真实的配音过程涉及导演主动指导演员内化情境线索特别是情感,然后进行表演,这一重要互动机制在现有方法中被忽视。
Method: 提出检索增强的导演-演员交互学习方案Authentic-Dubber,包含三个核心机制:构建多模态参考片段库模拟导演提供的学习素材,利用LLM深度理解多模态信号中的情感表征;设计基于情感相似性的检索增强策略,检索与目标静音视频最相关的多模态信息;开发渐进式图基语音生成方法,逐步整合检索到的多模态情感知识模拟演员最终配音过程。
Result: 在V2C Animation基准数据集上的主观和客观评估均验证了方法的有效性,实现了情感表现力的全面改进。实验结果表明该方法能够忠实复制真实配音工作流程,在多个评估指标上表现出优越性能。
Conclusion: 该研究通过模拟真实导演-演员交互机制,为自动电影配音提供了更符合实际工作流程的解决方案,强调了情境理解和情感内化在配音过程中的重要性。这项工作为多模态情感理解和生成任务开辟了新方向,展示了检索增强和渐进式整合策略在复杂多媒体任务中的潜力。
📄 Abstract
The automatic movie dubbing model generates vivid speech from given scripts, replicating a speaker's timbre from a brief timbre prompt while ensuring lip-sync with the silent video. Existing approaches simulate a simplified workflow where actors dub directly without preparation, overlooking the critical director-actor interaction. In contrast, authentic workflows involve a dynamic collaboration: directors actively engage with actors, guiding them to internalize the context cues, specifically emotion, before performance. To address this issue, we propose a new Retrieve-Augmented Director-Actor Interaction Learning scheme to achieve authentic movie dubbing, termed Authentic-Dubber, which contains three novel mechanisms: (1) We construct a multimodal Reference Footage library to simulate the learning footage provided by directors. Note that we integrate Large Language Models (LLMs) to achieve deep comprehension of emotional representations across multimodal signals. (2) To emulate how actors efficiently and comprehensively internalize director-provided footage during dubbing, we propose an Emotion-Similarity-based Retrieval-Augmentation strategy. This strategy retrieves the most relevant multimodal information that aligns with the target silent video. (3) We develop a Progressive Graph-based speech generation approach that incrementally incorporates the retrieved multimodal emotional knowledge, thereby simulating the actor's final dubbing process. The above mechanisms enable the Authentic-Dubber to faithfully replicate the authentic dubbing workflow, achieving comprehensive improvements in emotional expressiveness. Both subjective and objective evaluations on the V2C Animation benchmark dataset validate the effectiveness. The code and demos are available at https://github.com/AI-S2-Lab/Authentic-Dubber.
[51] AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR
Gabrial Zencha Ashungafac, Mardhiyah Sanni, Busayo Awobade, Alex Gichamba, Tobi Olatunji
🧩 TL;DR
本研究提出了AfriSpeech-MultiBench,这是首个针对非洲英语口音的领域特定评估套件,涵盖100多种口音和7个应用领域,填补了非洲语言多样性公开评估的空白。
📘 Detailed Summary
Motivation: 尽管语音AI技术在全球范围内快速发展,但目前缺乏针对非洲语言多样性的公开应用特定模型评估,这限制了语音技术在非洲社区的应用和发展。
Method: 构建了覆盖10多个国家、100多种非洲英语口音的评估套件,涵盖金融、法律、医疗等7个应用领域,并对开源、闭源、单模态ASR和多模态LLM语音识别系统进行了全面基准测试。
Result: 实证分析显示开源ASR模型在自发语音场景表现优异但在嘈杂非母语对话中性能下降,多模态LLM对口音更具鲁棒性但在领域特定命名实体识别上存在困难,专有模型在清晰语音上准确率高但表现因国家和领域差异显著。
Conclusion: 通过发布这一全面基准,研究人员和从业者能够选择适合非洲使用场景的语音技术,促进服务不足社区的包容性语音应用发展,同时发现针对非洲英语优化的模型能以更低延迟实现竞争性准确率。
📄 Abstract
Recent advances in speech-enabled AI, including Google's NotebookLM and OpenAI's speech-to-speech API, are driving widespread interest in voice interfaces globally. Despite this momentum, there exists no publicly available application-specific model evaluation that caters to Africa's linguistic diversity. We present AfriSpeech-MultiBench, the first domain-specific evaluation suite for over 100 African English accents across 10+ countries and seven application domains: Finance, Legal, Medical, General dialogue, Call Center, Named Entities and Hallucination Robustness. We benchmark a diverse range of open, closed, unimodal ASR and multimodal LLM-based speech recognition systems using both spontaneous and non-spontaneous speech conversation drawn from various open African accented English speech datasets. Our empirical analysis reveals systematic variation: open-source ASR models excels in spontaneous speech contexts but degrades on noisy, non-native dialogue; multimodal LLMs are more accent-robust yet struggle with domain-specific named entities; proprietary models deliver high accuracy on clean speech but vary significantly by country and domain. Models fine-tuned on African English achieve competitive accuracy with lower latency, a practical advantage for deployment, hallucinations still remain a big problem for most SOTA models. By releasing this comprehensive benchmark, we empower practitioners and researchers to select voice technologies suited to African use-cases, fostering inclusive voice applications for underserved communities.
[52] MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents
Jinru Ding, Lu Lu, Chao Ding, Mouxiao Bian, Jiayuan Chen, Renjie Lu, Wenrao Pang, Xiaoqin Wu, Zhiqiang Liu, Luyi Jiang, Bing Han, Yunqiu Wang, Jie Xu
🧩 TL;DR
本研究提出了MedBench v4,一个包含70多万个专家策划任务的全国性云端医疗AI基准测试平台,评估了15个前沿模型,发现基础模型在安全性和伦理方面表现较差,而基于相同骨干网络的智能体能显著提升端到端性能。
📘 Detailed Summary
Motivation: 当前医疗大语言模型、多模态模型和智能体的快速发展需要能够反映真实临床工作流程和安全约束的评估框架,现有基准测试在任务规模、专业覆盖和临床相关性方面存在不足。
Method: 构建了包含70多万个专家策划任务的云端基准测试基础设施,涵盖24个主要专科和91个次要专科,采用多阶段精炼和多轮临床医生评审流程,开放答案通过经过人类评分校准的LLM-as-a-judge进行评分。
Result: 基础LLM平均得分54.1/100,最佳模型Claude Sonnet 4.5得分为62.5/100,但安全性和伦理得分仅为18.4/100;多模态模型平均得分47.5/100,感知能力较强但跨模态推理较弱;基于相同骨干的智能体显著提升性能至平均79.8/100,Claude Sonnet 4.5智能体在安全任务上达到88.9/100。
Conclusion: 研究揭示了基础模型在多模态推理和安全性方面存在持续差距,而治理意识的智能体编排能够显著提升临床准备度而不牺牲能力,该平台为中国医院、开发者和政策制定者提供了实用的医疗AI审计参考标准。
📄 Abstract
Recent advances in medical large language models (LLMs), multimodal models, and agents demand evaluation frameworks that reflect real clinical workflows and safety constraints. We present MedBench v4, a nationwide, cloud-based benchmarking infrastructure comprising over 700,000 expert-curated tasks spanning 24 primary and 91 secondary specialties, with dedicated tracks for LLMs, multimodal models, and agents. Items undergo multi-stage refinement and multi-round review by clinicians from more than 500 institutions, and open-ended responses are scored by an LLM-as-a-judge calibrated to human ratings. We evaluate 15 frontier models. Base LLMs reach a mean overall score of 54.1/100 (best: Claude Sonnet 4.5, 62.5/100), but safety and ethics remain low (18.4/100). Multimodal models perform worse overall (mean 47.5/100; best: GPT-5, 54.9/100), with solid perception yet weaker cross-modal reasoning. Agents built on the same backbones substantially improve end-to-end performance (mean 79.8/100), with Claude Sonnet 4.5-based agents achieving up to 85.3/100 overall and 88.9/100 on safety tasks. MedBench v4 thus reveals persisting gaps in multimodal reasoning and safety for base models, while showing that governance-aware agentic orchestration can markedly enhance benchmarked clinical readiness without sacrificing capability. By aligning tasks with Chinese clinical guidelines and regulatory priorities, the platform offers a practical reference for hospitals, developers, and policymakers auditing medical AI.
[53] Bridging Human and Model Perspectives: A Comparative Analysis of Political Bias Detection in News Media Using Large Language Models
Shreya Adrita Banik, Niaz Nafi Rahman, Tahsina Moiukh, Farig Sadeque
🧩 TL;DR
本研究提出了一个比较框架来评估人类注释与多种大语言模型在政治偏见检测上的一致性,发现RoBERTa在传统Transformer模型中与人类标注最对齐,而GPT在零样本设置下表现出最强的整体一致性。
📘 Detailed Summary
Motivation: 尽管自然语言处理技术的进步使得自动偏见分类成为可能,但大语言模型与人类判断在政治偏见检测方面的一致性程度仍然相对未被充分探索且尚未被很好理解,本研究旨在填补这一研究空白。
Method: 研究构建了一个手动标注的新闻文章数据集,采用比较评估框架来量化人类与模型对偏见感知的差异,评估了包括GPT、BERT、RoBERTa和FLAN在内的多种大语言模型,并分析了注释一致性、偏见极性和模型间一致性。
Result: 实验结果表明,在传统基于Transformer的模型中,RoBERTa实现了与人类标签的最高对齐度,而生成式模型如GPT在零样本设置下表现出与人类注释最强的整体一致性,经过微调的RoBERTa模型在所有Transformer基线中获得了最高准确率和最强的人类标注对齐度。
Conclusion: 研究发现人类与大语言模型在感知政治倾向方面存在系统性差异,强调了在自动化媒体偏见检测中需要结合人类可解释性与模型可扩展性的混合评估框架的重要性。
📄 Abstract
Detecting political bias in news media is a complex task that requires interpreting subtle linguistic and contextual cues. Although recent advances in Natural Language Processing (NLP) have enabled automatic bias classification, the extent to which large language models (LLMs) align with human judgment still remains relatively underexplored and not yet well understood. This study aims to present a comparative framework for evaluating the detection of political bias across human annotations and multiple LLMs, including GPT, BERT, RoBERTa, and FLAN. We construct a manually annotated dataset of news articles and assess annotation consistency, bias polarity, and inter-model agreement to quantify divergence between human and model perceptions of bias. Experimental results show that among traditional transformer-based models, RoBERTa achieves the highest alignment with human labels, whereas generative models such as GPT demonstrate the strongest overall agreement with human annotations in a zero-shot setting. Among all transformer-based baselines, our fine-tuned RoBERTa model acquired the highest accuracy and the strongest alignment with human-annotated labels. Our findings highlight systematic differences in how humans and LLMs perceive political slant, underscoring the need for hybrid evaluation frameworks that combine human interpretability with model scalability in automated media bias detection.
[54] Enhancing Agentic Autonomous Scientific Discovery with Vision-Language Model Capabilities
Kahaan Gandhi, Boris Bolliet, Inigo Zubeldia
🧩 TL;DR
本研究提出了一种基于视觉语言模型的多智能体系统,通过将图表作为可验证检查点来指导自主科学发现过程。该系统能够实时纠正推理错误并适应新数据集,在10项数据驱动发现任务中显著优于传统基线方法。
📘 Detailed Summary
Motivation: 当前自主科学发现系统面临推理路径错误和缺乏实时自我纠正能力的问题,特别是在处理复杂数据分析和可视化任务时。传统方法难以动态评估和修正分析过程中的错误,限制了端到端科学发现的可靠性和适应性。
Method: 采用视觉语言模型作为评判者,将图表视为可验证检查点,通过动态生成的领域特定评分标准评估图表质量。多智能体系统能够根据VLM的反馈实时纠正错误,引导探索性数据分析过程,无需人工干预即可适应新数据集。
Result: 在宇宙学和天体化学的案例研究中,系统成功从错误推理路径中恢复并适应新数据集。在10项数据驱动发现任务的基准测试中,VLM增强系统达到0.7-0.8的pass@1分数,显著优于代码基线(0.2-0.3)和代码加文本基线(0.4-0.5),同时提供可审计的推理轨迹。
Conclusion: 视觉语言模型指导的多智能体系统显著提升了自主科学发现的可靠性和适应性,通过实时错误纠正和动态评估机制解决了传统方法的局限性。该方法不仅提高了性能表现,还增强了系统的可解释性,为自动化科学研究提供了新的技术路径。
📄 Abstract
We show that multi-agent systems guided by vision-language models (VLMs) improve end-to-end autonomous scientific discovery. By treating plots as verifiable checkpoints, a VLM-as-a-judge evaluates figures against dynamically generated domain-specific rubrics, enabling agents to correct their own errors and steer exploratory data analysis in real-time. Case studies in cosmology and astrochemistry demonstrate recovery from faulty reasoning paths and adaptation to new datasets without human intervention. On a 10-task benchmark for data-driven discovery, VLM-augmented systems achieve pass at 1 scores of 0.7-0.8, compared to 0.2-0.3 for code-only and 0.4-0.5 for code-and-text baselines, while also providing auditable reasoning traces that improve interpretability. Code available here: https://github.com/CMBAgents/cmbagent
cs.AI [Back]
[55] Jailbreaking Large Vision Language Models in Intelligent Transportation Systems
Badhan Chandra Das, Md Tasnim Jawad, Md Jueal Mia, M. Hadi Amini, Yanzhao Wu
🧩 TL;DR
本文系统分析了智能交通系统中大型视觉语言模型在精心设计的越狱攻击下的脆弱性,提出了一种基于图像排版操纵和多轮提示的新型越狱攻击方法,并开发了多层响应过滤防御技术来保护模型安全。
📘 Detailed Summary
Motivation: 大型视觉语言模型在多模态推理和实际应用中表现出强大能力,但在智能交通系统集成时存在严重的安全漏洞,特别是对越狱攻击高度脆弱,这可能导致模型对有害查询生成不当响应,危及交通系统安全。
Method: 研究构建了符合OpenAI禁止类别的交通相关有害查询数据集,开发了结合图像排版操纵和多轮提示的新型越狱攻击方法,并提出了多层响应过滤防御技术来阻止模型生成不当响应。
Result: 实验在开源和闭源的最先进LVLMs上进行,使用GPT-4判断和人工验证评估攻击效果和防御性能,结果显示基于图像排版操纵和多轮提示的越狱攻击在智能交通系统集成的LVLMs中带来严重安全风险。
Conclusion: 研究表明LVLMs在智能交通系统中面临严重的越狱攻击威胁,需要开发有效的防御机制来确保系统安全,同时突显了图像排版操纵和多轮提示组合攻击的破坏性潜力,为未来安全研究提供了重要方向。
📄 Abstract
Large Vision Language Models (LVLMs) demonstrate strong capabilities in multimodal reasoning and many real-world applications, such as visual question answering. However, LVLMs are highly vulnerable to jailbreaking attacks. This paper systematically analyzes the vulnerabilities of LVLMs integrated in Intelligent Transportation Systems (ITS) under carefully crafted jailbreaking attacks. First, we carefully construct a dataset with harmful queries relevant to transportation, following OpenAI's prohibited categories to which the LVLMs should not respond. Second, we introduce a novel jailbreaking attack that exploits the vulnerabilities of LVLMs through image typography manipulation and multi-turn prompting. Third, we propose a multi-layered response filtering defense technique to prevent the model from generating inappropriate responses. We perform extensive experiments with the proposed attack and defense on the state-of-the-art LVLMs (both open-source and closed-source). To evaluate the attack method and defense technique, we use GPT-4's judgment to determine the toxicity score of the generated responses, as well as manual verification. Further, we compare our proposed jailbreaking method with existing jailbreaking techniques and highlight severe security risks involved with jailbreaking attacks with image typography manipulation and multi-turn prompting in the LVLMs integrated in ITS.
[56] Scene Graph-Guided Generative AI Framework for Synthesizing and Evaluating Industrial Hazard Scenarios
Sanjay Acharjee, Abir Khan Ratul, Diego Patino, Md Nazmus Sakib
🧩 TL;DR
本研究提出了一种基于场景图引导的生成式AI框架,通过分析OSHA事故报告生成逼真的工作场所危险场景图像,并引入视觉问答评估框架验证生成数据的真实性和语义保真度。该框架在四个先进生成模型上的评估表明,所提出的VQA图分数在区分敏感性方面优于CLIP和BLIP指标。
📘 Detailed Summary
Motivation: 训练视觉模型准确检测工作场所危险需要真实的不安全状况图像,但由于几乎不可能实时捕捉事故触发场景,获取此类数据集存在困难。本研究旨在克服这一限制,通过生成基于历史职业安全与健康管理局事故报告的危险场景图像来解决数据稀缺问题。
Method: 研究提出场景图引导的生成式AI框架,使用GPT-4o分析OSHA叙述提取结构化危险推理,将其转换为捕捉空间和上下文关系的对象级场景图。这些图指导文本到图像扩散模型生成构图准确的危险场景,并引入视觉问答框架评估生成数据的真实性和语义保真度。
Result: 在四个最先进的生成模型上的实验表明,所提出的VQA图分数基于熵验证优于CLIP和BLIP指标,证实其具有更高的区分敏感性。该评估框架能够有效验证生成危险场景图像的真实性和语义准确性。
Conclusion: 该研究展示了生成式AI在创建危险检测训练数据方面的潜力,通过结构化场景图引导和VQA评估框架的结合,为工作场所安全监控提供了可靠的数据生成解决方案。该方法为难以获取真实数据的领域提供了可行的替代方案,具有重要的实际应用价值。
📄 Abstract
Training vision models to detect workplace hazards accurately requires realistic images of unsafe conditions that could lead to accidents. However, acquiring such datasets is difficult because capturing accident-triggering scenarios as they occur is nearly impossible. To overcome this limitation, this study presents a novel scene graph-guided generative AI framework that synthesizes photorealistic images of hazardous scenarios grounded in historical Occupational Safety and Health Administration (OSHA) accident reports. OSHA narratives are analyzed using GPT-4o to extract structured hazard reasoning, which is converted into object-level scene graphs capturing spatial and contextual relationships essential for understanding risk. These graphs guide a text-to-image diffusion model to generate compositionally accurate hazard scenes. To evaluate the realism and semantic fidelity of the generated data, a visual question answering (VQA) framework is introduced. Across four state-of-the-art generative models, the proposed VQA Graph Score outperforms CLIP and BLIP metrics based on entropy-based validation, confirming its higher discriminative sensitivity.
[57] Run, Ruminate, and Regulate: A Dual-process Thinking System for Vision-and-Language Navigation
Yu Zhong, Zihao Zhang, Rui Zhang, Lingdong Huang, Haihan Gao, Shuo Wang, Da Li, Ruijian Han, Jiaming Guo, Shaohui Peng, Di Huang, Yunji Chen
🧩 TL;DR
本文提出了一种名为R3的双进程思维框架,通过整合大型语言模型的泛化能力和视觉语言导航领域的专家知识,在零样本设置下显著提升了导航性能。该方法在REVERIE基准测试中取得了最先进的性能表现,SPL和RGSPL指标分别超过现有方法3.28%和3.30%。
📘 Detailed Summary
Motivation: 当前基于大型语言模型的视觉语言导航方法存在两个主要问题:首先,LLM在理解真实世界空间关联方面存在固有困难,导致任务完成性能与领域专家存在显著差距;其次,引入LLM会带来巨大的计算成本和推理延迟,限制了实际应用。
Method: R3框架包含三个核心模块:Runner是一个轻量级基于Transformer的专家模型,确保常规情况下的高效准确导航;Ruminator采用强大的多模态LLM作为主干,使用思维链提示来激发结构化推理;Regulator监控导航进度并根据三个标准控制适当的思维模式,实现Runner和Ruminator的协调集成。
Result: 实验结果表明,R3在REVERIE基准测试中显著优于其他最先进方法,SPL和RGSPL指标分别提升了3.28%和3.30%。这一显著增强证明了该方法在处理挑战性VLN任务方面的有效性。
Conclusion: 该研究展示了将LLM的泛化能力与领域特定专业知识相结合的有效性,为复杂导航任务提供了新的解决思路。双进程思维框架的成功表明,通过合理分配计算资源,可以在保持性能的同时优化效率,为实际部署提供了可行方案。
📄 Abstract
Vision-and-Language Navigation (VLN) requires an agent to dynamically explore complex 3D environments following human instructions. Recent research underscores the potential of harnessing large language models (LLMs) for VLN, given their commonsense knowledge and general reasoning capabilities. Despite their strengths, a substantial gap in task completion performance persists between LLM-based approaches and domain experts, as LLMs inherently struggle to comprehend real-world spatial correlations precisely. Additionally, introducing LLMs is accompanied with substantial computational cost and inference latency. To address these issues, we propose a novel dual-process thinking framework dubbed R3, integrating LLMs' generalization capabilities with VLN-specific expertise in a zero-shot manner. The framework comprises three core modules: Runner, Ruminator, and Regulator. The Runner is a lightweight transformer-based expert model that ensures efficient and accurate navigation under regular circumstances. The Ruminator employs a powerful multimodal LLM as the backbone and adopts chain-of-thought (CoT) prompting to elicit structured reasoning. The Regulator monitors the navigation progress and controls the appropriate thinking mode according to three criteria, integrating Runner and Ruminator harmoniously. Experimental results illustrate that R3 significantly outperforms other state-of-the-art methods, exceeding 3.28% and 3.30% in SPL and RGSPL respectively on the REVERIE benchmark. This pronounced enhancement highlights the effectiveness of our method in handling challenging VLN tasks.
[58] Listen Like a Teacher: Mitigating Whisper Hallucinations using Adaptive Layer Attention and Knowledge Distillation
Kumud Tripathi, Aditya Srinivas Menon, Aman Gaurav, Raj Prakash Gohil, Pankaj Wasnik
🧩 TL;DR
本文提出了一种两阶段架构,通过自适应层注意力增强编码器鲁棒性,并利用多目标知识蒸馏框架抑制幻觉,显著提升了Whisper模型在噪声条件下的可靠性。
📘 Detailed Summary
Motivation: Whisper模型在噪声声学条件下经常出现幻觉错误,而现有工作主要关注音频预处理或转录后处理,对模型本身的修改以直接缓解幻觉的研究仍较为缺乏。
Method: 提出两阶段架构:第一阶段通过自适应层注意力将编码器层分组为语义连贯的块,并使用可学习多头注意力模块融合块表示;第二阶段采用多目标知识蒸馏框架,在噪声音频上训练学生模型以对齐其语义和注意力分布与处理干净输入的教师模型。
Result: 在噪声语音基准测试中显示出幻觉和词错误率的显著降低,同时在干净语音上保持了性能表现。
Conclusion: 自适应层注意力和知识蒸馏共同提供了一种原则性策略来提升Whisper在真实世界噪声条件下的可靠性,为改善ASR系统在挑战性环境中的鲁棒性提供了有效途径。
📄 Abstract
The Whisper model, an open-source automatic speech recognition system, is widely adopted for its strong performance across multilingual and zero-shot settings. However, it frequently suffers from hallucination errors, especially under noisy acoustic conditions. Previous works to reduce hallucinations in Whisper-style ASR systems have primarily focused on audio preprocessing or post-processing of transcriptions to filter out erroneous content. However, modifications to the Whisper model itself remain largely unexplored to mitigate hallucinations directly. To address this challenge, we present a two-stage architecture that first enhances encoder robustness through Adaptive Layer Attention (ALA) and further suppresses hallucinations using a multi-objective knowledge distillation (KD) framework. In the first stage, ALA groups encoder layers into semantically coherent blocks via inter-layer correlation analysis. A learnable multi-head attention module then fuses these block representations, enabling the model to jointly exploit low- and high-level features for more robust encoding. In the second stage, our KD framework trains the student model on noisy audio to align its semantic and attention distributions with a teacher model processing clean inputs. Our experiments on noisy speech benchmarks show notable reductions in hallucinations and word error rates, while preserving performance on clean speech. Together, ALA and KD offer a principled strategy to improve Whisper's reliability under real-world noisy conditions.