cs.CV [Total: 60]
cs.CL [Total: 14]
cs.AI [Total: 8]

cs.CV [Back]

[1] Short-Window Sliding Learning for Real-Time Violence Detection via LLM-based Auto-Labeling

Seoik Jung, Taekyung Song, Yangro Lee, Sungjun Lee

🧩 TL;DR

本文提出了一种短窗口滑动学习框架，用于CCTV监控视频的实时暴力检测。该方法通过将视频分割为1-2秒短片段并利用LLM自动标注构建细粒度数据集，在RWF-2000数据集上达到95.25%准确率，显著提升了长视频暴力检测性能。

📘 Detailed Summary

Motivation: 传统长视频训练方法难以精确识别快速发生的暴力事件，且缺乏细粒度的标注数据。本研究旨在解决实时暴力检测中时间连续性保持和快速事件识别精度不足的问题，为智能监控系统提供更有效的解决方案。

Method: 提出短窗口滑动学习框架，将视频分割为1-2秒的短片段，利用大型语言模型进行自动字幕标注构建细粒度数据集。该方法充分利用所有帧信息保持时间连续性，专门针对快速暴力事件进行精确识别。

Result: 在RWF-2000数据集上达到95.25%的准确率，在长视频数据集UCF-Crime上性能显著提升至83.25%。实验验证了该方法在保持时间连续性和识别快速暴力事件方面的优越性能。

Conclusion: 短窗口滑动学习框架在实时暴力检测中展现出强大的泛化能力和实用性，为智能监控系统提供了有效的技术方案。该方法通过细粒度数据处理和LLM辅助标注，解决了传统方法的局限性，具有重要的实际应用价值。

📄 Abstract

This paper proposes a Short-Window Sliding Learning framework for real-time violence detection in CCTV footages. Unlike conventional long-video training approaches, the proposed method divides videos into 1-2 second clips and applies Large Language Model (LLM)-based auto-caption labeling to construct fine-grained datasets. Each short clip fully utilizes all frames to preserve temporal continuity, enabling precise recognition of rapid violent events. Experiments demonstrate that the proposed method achieves 95.25\% accuracy on RWF-2000 and significantly improves performance on long videos (UCF-Crime: 83.25\%), confirming its strong generalization and real-time applicability in intelligent surveillance systems.

[2] MCN-CL: Multimodal Cross-Attention Network and Contrastive Learning for Multimodal Emotion Recognition

Feng Li, Ke Wu, Yongwei Li

🧩 TL;DR

本文提出了多模态交叉注意力网络与对比学习（MCN-CL）方法，通过三重查询机制和困难负样本挖掘策略解决多模态情感识别中的模态异质性和类别不平衡问题，在IEMOCAP和MELD数据集上显著优于现有方法。

📘 Detailed Summary

Motivation: 多模态情感识别面临三个主要挑战：类别分布不平衡、动态面部动作单元时间建模的复杂性，以及模态异质性导致的特征融合困难。随着社交媒体场景中多模态数据的爆炸式增长，构建高效跨模态融合框架的需求日益迫切。

Method: 提出MCN-CL框架，采用三重查询机制和困难负样本挖掘策略，在去除特征冗余的同时保留重要情感线索。该方法有效解决了模态异质性和类别不平衡问题，实现了跨模态特征的有效融合。

Result: 在IEMOCAP和MELD数据集上的实验结果表明，所提方法显著优于现有最优方法，加权F1分数分别提升了3.42%和5.73%，验证了其在多模态情感识别任务中的有效性。

Conclusion: 该研究为解决多模态情感识别中的关键挑战提供了有效方案，三重查询机制和对比学习策略的组合能够显著提升模型性能，为实际应用场景中的情感分析任务提供了有力工具。

📄 Abstract

Multimodal emotion recognition plays a key role in many domains, including mental health monitoring, educational interaction, and human-computer interaction. However, existing methods often face three major challenges: unbalanced category distribution, the complexity of dynamic facial action unit time modeling, and the difficulty of feature fusion due to modal heterogeneity. With the explosive growth of multimodal data in social media scenarios, the need for building an efficient cross-modal fusion framework for emotion recognition is becoming increasingly urgent. To this end, this paper proposes Multimodal Cross-Attention Network and Contrastive Learning (MCN-CL) for multimodal emotion recognition. It uses a triple query mechanism and hard negative mining strategy to remove feature redundancy while preserving important emotional cues, effectively addressing the issues of modal heterogeneity and category imbalance. Experiment results on the IEMOCAP and MELD datasets show that our proposed method outperforms state-of-the-art approaches, with Weighted F1 scores improving by 3.42% and 5.73%, respectively.

[3] Text-guided Weakly Supervised Framework for Dynamic Facial Expression Recognition

Gunho Jung, Heejo Kong, Seong-Whan Lee

🧩 TL;DR

本文提出了TG-DFER，一种文本引导的弱监督动态面部表情识别框架，通过结合视觉语言预训练模型的语义指导和多粒度时序建模，有效解决了MIL方法中的视觉多样性和时序复杂性挑战。

📘 Detailed Summary

Motivation: 动态面部表情识别面临多对一标签问题的挑战，即包含多帧的视频仅被分配单一情感标签，而基于多示例学习的方法在处理情感表达的视觉多样性和时序动态复杂性方面存在固有局限。

Method: TG-DFER框架整合了视觉语言预训练模型提供语义指导，通过细粒度文本情感描述生成视觉提示，将丰富的文本情感标签与视觉实例特征对齐，同时设计了多粒度时序网络联合捕捉短期面部动态和长期情感流动。

Result: 大量实验结果表明，TG-DFER在弱监督条件下实现了改进的泛化能力、可解释性和时序敏感性，在动态面部表情识别任务中表现出优越性能。

Conclusion: 该研究证明了语义指导和连贯时序建模在弱监督动态表情识别中的有效性，为处理复杂情感动态提供了新的解决方案，并展示了视觉语言模型在细粒度情感分析任务中的应用潜力。

📄 Abstract

Dynamic facial expression recognition (DFER) aims to identify emotional states by modeling the temporal changes in facial movements across video sequences. A key challenge in DFER is the many-to-one labeling problem, where a video composed of numerous frames is assigned a single emotion label. A common strategy to mitigate this issue is to formulate DFER as a Multiple Instance Learning (MIL) problem. However, MIL-based approaches inherently suffer from the visual diversity of emotional expressions and the complexity of temporal dynamics. To address this challenge, we propose TG-DFER, a text-guided weakly supervised framework that enhances MIL-based DFER by incorporating semantic guidance and coherent temporal modeling. We incorporate a vision-language pre-trained (VLP) model is integrated to provide semantic guidance through fine-grained textual descriptions of emotional context. Furthermore, we introduce visual prompts, which align enriched textual emotion labels with visual instance features, enabling fine-grained reasoning and frame-level relevance estimation. In addition, a multi-grained temporal network is designed to jointly capture short-term facial dynamics and long-range emotional flow, ensuring coherent affective understanding across time. Extensive results demonstrate that TG-DFER achieves improved generalization, interpretability, and temporal sensitivity under weak supervision.

[4] PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs

Bowen Sun, Yujun Cai, Ming-Hsuan Yang, Hang Wu, Yiwei Wang

🧩 TL;DR

本文提出相位聚合平滑（PAS）方法，通过应用相反的相位偏移并聚合多头输出来解决视频大语言模型中的时间不一致性问题，该方法无需训练即可平滑时间核函数并减少相位敏感性。

📘 Detailed Summary

Motivation: 视频大语言模型存在时间不一致性问题：帧时序的微小偏移会翻转注意力并抑制相关帧。研究发现这种不稳定性源于通过多模态RoPE将旋转位置编码扩展到视频时产生的时间核函数中的帧级波纹效应。

Method: 提出相位聚合平滑（PAS）机制，该方法在多头注意力中应用小的相反相位偏移，然后聚合它们的输出。PAS保持每个头的频谱幅度，同时通过聚合有效平滑时间核函数并减少相位敏感性，而不改变位置编码结构。

Result: 在多个视频理解基准测试中，在匹配的token预算下显示出持续的性能提升，且计算开销可忽略不计。分析表明平滑时间核函数可获得对小时序偏移的Lipschitz稳定性，而多相位平均可在满足Nyquist采样条件下衰减高频波纹。

Conclusion: PAS为视频大语言模型提供了即插即用的鲁棒时间编码升级方案，通过相位聚合有效解决了RoPE在视频扩展中的时间不稳定问题，同时保持了原有的位置编码结构优势。

📄 Abstract

Video LLMs suffer from temporal inconsistency: small shifts in frame timing can flip attention and suppress relevant frames. We trace this instability to the common extension of Rotary Position Embeddings to video through multimodal RoPE. The induced inverse Fourier time kernel exhibits frame-scale ripples that multiply adjacent frames by different factors, which perturbs attention that should otherwise be governed by the raw query key inner product. We present Phase Aggregated Smoothing (PAS), a simple, training-free mechanism that applies small opposed phase offsets across heads and then aggregates their outputs. PAS preserves the per-head spectrum magnitude, while the aggregation effectively smooths the temporal kernel and reduces phase sensitivity without changing the positional encoding structure. Our analysis shows that the RoPE rotated logit can be approximated as a content dot product scaled by a time kernel; smoothing this kernel yields Lipschitz stability of attention to small temporal shifts; multi phase averaging attenuates high frequency ripples while preserving per-head spectra under Nyquist-valid sampling. Experiments on multiple video understanding benchmarks under matched token budgets show consistent improvements with negligible computational overhead. PAS provides a plug and play upgrade for robust temporal encoding in Video LLMs.

[5] Binary Verification for Zero-Shot Vision

Jeffrey Liu, Rongbin Hu

🧩 TL;DR

本文提出了一种无需训练的二值验证工作流，通过量化和二值化将开放式的视觉语言查询转化为多项选择题和真/假验证问题，显著提升了零样本视觉任务的性能。该方法强调推理时设计而非任务特定训练，为现成视觉语言模型提供了实用的性能提升路径。

📘 Detailed Summary

Motivation: 当前视觉语言模型在零样本视觉任务中面临开放式查询处理困难的问题，现有方法通常需要任务特定的训练或复杂的优化过程。本研究旨在开发一种无需训练的统一工作流，通过改进推理时处理方式来提升现成视觉语言模型的零样本性能。

Method: 该方法包含两个核心步骤：量化将开放式查询转化为具有明确候选答案的多项选择题；二值化则为每个候选答案提出真/假验证问题，并通过确定性解析逻辑进行决策——当仅有一个答案为真时选择该答案，否则回退到剩余候选答案的多项选择。

Result: 在指代表达式定位、空间推理和BLINK-Jigsaw等任务上的实验表明，相对于直接回答开放式查询，量化到多项选择题带来了显著性能提升，而真/假二值化进一步提供了稳定的额外增益。该工作流在所有测试任务中都产生了显著改进，证明了其通用性。

Conclusion: 研究建立了从开放式视觉查询到多项选择题再到真/假验证的难度阶梯理论框架，并通过简单分析解释了布尔解析提升准确性的机制。该方法强调推理时设计而非任务特定训练，为利用现有视觉语言模型实现更强的零样本视觉能力提供了实用且即插即用的解决方案。

📄 Abstract

We propose a training-free, binary verification workflow for zero-shot vision with off-the-shelf VLMs. It comprises two steps: (i) quantization, which turns the open-ended query into a multiple-choice question (MCQ) with a small, explicit list of unambiguous candidates; and (ii) binarization, which asks one True/False question per candidate and resolves deterministically: if exactly one is True, select it; otherwise, revert to an MCQ over the remaining plausible candidates. We evaluate the workflow on referring expression grounding (REC), spatial reasoning (Spatial-Map, Spatial-Grid, Spatial-Maze), and BLINK-Jigsaw. Relative to answering open-ended queries directly, quantization to MCQ yields large gains, and True/False binarization provides a consistent additional boost. Across all tasks, the same workflow produces significant improvements, indicating generality. Our theory formalizes how open-ended vision queries can be quantized to MCQs and further binarized into True/False verifications, establishing a hardness ladder. A simple analysis explains why Boolean resolution boosts accuracy. Together, these components yield a simple and unified workflow that emphasizes inference-time design over task-specific training. It offers a practical, drop-in path to stronger zero-shot vision with today's VLMs.

[6] Semantic VLM Dataset for Safe Autonomous Driving

Yuankai He, Weisong Shi

🧩 TL;DR

CAR-Scenes是一个用于自动驾驶的帧级数据集，通过28个关键类别和350+叶子属性支持视觉语言模型进行可解释的场景级理解，提供了GPT-4o辅助标注流程和可复现的基线模型。

📘 Detailed Summary

Motivation: 当前自动驾驶领域缺乏支持视觉语言模型进行可解释场景级理解的标准化数据集，现有数据集在细粒度属性标注、跨数据源一致性以及风险感知场景挖掘方面存在不足，限制了数据驱动方法在智能车辆中的发展。

Method: 采用GPT-4o辅助的视觉语言标注流程结合人工验证，构建包含28个关键类别和350+叶子属性的知识体系，涵盖环境、道路几何、车辆行为等维度，并提供属性共现图和JSONL记录支持语义检索和数据集筛选。

Result: 数据集包含5,192张来自Argoverse 1、Cityscapes、KITTI和nuScenes的图像标注，基线模型LoRA微调的Qwen2-VL-2B在固定验证集上通过标量准确率、微平均F1和严重性MAE/RMSE进行评估，展示了可复现的性能基准。

Conclusion: CAR-Scenes为自动驾驶视觉语言理解提供了标准化的评估框架和可解释的数据中心化工作流，通过公开标注脚本和分析工具促进了智能车辆领域可解释AI方法的发展，支持风险感知场景挖掘和跨数据集一致性分析。

📄 Abstract

CAR-Scenes is a frame-level dataset for autonomous driving that enables training and evaluation of vision-language models (VLMs) for interpretable, scene-level understanding. We annotate 5,192 images drawn from Argoverse 1, Cityscapes, KITTI, and nuScenes using a 28-key category/sub-category knowledge base covering environment, road geometry, background-vehicle behavior, ego-vehicle behavior, vulnerable road users, sensor states, and a discrete severity scale (1-10), totaling 350+ leaf attributes. Labels are produced by a GPT-4o-assisted vision-language pipeline with human-in-the-loop verification; we release the exact prompts, post-processing rules, and per-field baseline model performance. CAR-Scenes also provides attribute co-occurrence graphs and JSONL records that support semantic retrieval, dataset triage, and risk-aware scenario mining across sources. To calibrate task difficulty, we include reproducible, non-benchmark baselines, notably a LoRA-tuned Qwen2-VL-2B with deterministic decoding, evaluated via scalar accuracy, micro-averaged F1 for list attributes, and severity MAE/RMSE on a fixed validation split. We publicly release the annotation and analysis scripts, including graph construction and evaluation scripts, to enable explainable, data-centric workflows for future intelligent vehicles. Dataset: https://github.com/Croquembouche/CAR-Scenes

[7] VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models

Xinlei Yu, Chengming Xu, Guibin Zhang, Zhangquan Chen, Yudong Zhang, Yongbo He, Peng-Tao Jiang, Jiangning Zhang, Xiaobin Hu, Shuicheng Yan

🧩 TL;DR

本文提出VisMem框架，受人类认知记忆理论启发，为视觉语言模型配备动态潜在视觉记忆系统，通过短期和长期记忆模块解决视觉处理瓶颈问题，在多个视觉基准测试中显著提升模型性能。

📘 Detailed Summary

Motivation: 当前视觉语言模型在复杂视觉任务中存在视觉处理瓶颈问题，容易在长时间生成过程中失去视觉证据的接地性和上下文视觉体验，这限制了模型在理解、推理和生成任务中的表现。

Method: VisMem框架基于人类认知记忆理论，设计了动态潜在视觉记忆系统，包含短期模块用于细粒度感知保留和长期模块用于抽象语义巩固，这些记忆在推理过程中被无缝调用以维持感知保真度和语义一致性。

Result: 在多个视觉理解、推理和生成基准测试上的广泛实验表明，VisMem相比原始模型实现了11.8%的平均性能提升，并且优于所有对比方法，建立了潜在空间记忆增强的新范式。

Conclusion: VisMem通过认知对齐的记忆机制有效解决了VLMs的视觉处理瓶颈，为视觉语言模型的记忆增强提供了新的研究方向，证明了潜在空间记忆系统在提升模型性能方面的重要价值。

📄 Abstract

Despite the remarkable success of Vision-Language Models (VLMs), their performance on a range of complex visual tasks is often hindered by a "visual processing bottleneck": a propensity to lose grounding in visual evidence and exhibit a deficit in contextualized visual experience during prolonged generation. Drawing inspiration from human cognitive memory theory, which distinguishes short-term visually-dominant memory and long-term semantically-dominant memory, we propose VisMem, a cognitively-aligned framework that equips VLMs with dynamic latent vision memories, a short-term module for fine-grained perceptual retention and a long-term module for abstract semantic consolidation. These memories are seamlessly invoked during inference, allowing VLMs to maintain both perceptual fidelity and semantic consistency across thinking and generation. Extensive experiments across diverse visual benchmarks for understanding, reasoning, and generation reveal that VisMem delivers a significant average performance boost of 11.8% relative to the vanilla model and outperforms all counterparts, establishing a new paradigm for latent-space memory enhancement. The code will be available: https://github.com/YU-deep/VisMem.git.

[8] Fast Data Attribution for Text-to-Image Models

Sheng-Yu Wang, Aaron Hertzmann, Alexei A Efros, Richard Zhang, Jun-Yan Zhu

🧩 TL;DR

本文提出了一种可扩展且高效的数据归因方法，通过将基于反学习的缓慢归因方法蒸馏到特征嵌入空间，实现快速检索高影响力训练图像，相比现有方法加速2500-400000倍。

📘 Detailed Summary

Motivation: 现有文本到图像模型的数据归因方法需要为每个查询分配大量计算资源，导致在实际应用中不可行，因此需要开发可扩展且高效的数据归因解决方案。

Method: 核心思想是将基于反学习的缓慢归因方法蒸馏到特征嵌入空间，结合高效索引和搜索方法，在部署时无需运行昂贵的归因算法即可找到高影响力训练图像。

Result: 在MSCOCO训练的中等规模模型和LAION训练的大规模Stable Diffusion模型上的广泛实验表明，该方法在几秒内即可达到更好或竞争性性能，比现有方法快2500-400000倍。

Conclusion: 这项工作代表了将数据归因方法应用于Stable Diffusion等真实世界模型的重要进展，为实现大规模实际应用提供了可行路径。

📄 Abstract

Data attribution for text-to-image models aims to identify the training images that most significantly influenced a generated output. Existing attribution methods involve considerable computational resources for each query, making them impractical for real-world applications. We propose a novel approach for scalable and efficient data attribution. Our key idea is to distill a slow, unlearning-based attribution method to a feature embedding space for efficient retrieval of highly influential training images. During deployment, combined with efficient indexing and search methods, our method successfully finds highly influential images without running expensive attribution algorithms. We show extensive results on both medium-scale models trained on MSCOCO and large-scale Stable Diffusion models trained on LAION, demonstrating that our method can achieve better or competitive performance in a few seconds, faster than existing methods by 2,500x - 400,000x. Our work represents a meaningful step towards the large-scale application of data attribution methods on real-world models such as Stable Diffusion.

[9] AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning

Jirong Zha, Yuxuan Fan, Tianyu Zhang, Geng Chen, Yingfeng Chen, Chen Gao, Xinlei Chen

🧩 TL;DR

本文提出了AirCopBench，这是首个专门评估多模态大语言模型在具身空中协作感知任务中表现的基准测试，包含14.6k+个问题，涵盖模拟器和真实世界数据，揭示了当前MLLMs在协作感知任务中的显著性能差距。

📘 Detailed Summary

Motivation: 当前多模态大语言模型在单智能体视觉任务中表现出色，但缺乏专门评估多智能体协作感知能力的基准测试，特别是在复杂、以自我为中心的协作场景和真实世界感知退化条件下，现有基准主要针对使用高质量单智能体图像的基础感知任务，无法充分评估MLLMs在更复杂协作场景中的表现。

Method: 构建了包含14.6k+个问题的综合基准AirCopBench，涵盖模拟器和真实世界数据，问题生成采用模型、规则和人工三种方法，并经过严格质量控制，基准包含四个关键任务维度：场景理解、对象理解、感知评估和协作决策，涵盖14种任务类型。

Result: 对40个MLLMs的评估显示在协作感知任务中存在显著性能差距，最佳模型平均落后人类24.38%，且在不同任务间表现不一致，微调实验进一步证实了空中协作感知和推理中从模拟到真实迁移的可行性。

Conclusion: 研究揭示了当前MLLMs在多智能体协作感知任务中的局限性，强调了开发专门针对协作场景的评估基准的重要性，同时证明了模拟到真实迁移在复杂感知任务中的可行性，为未来多智能体协作系统的发展提供了重要参考。

📄 Abstract

Multimodal Large Language Models (MLLMs) have shown promise in single-agent vision tasks, yet benchmarks for evaluating multi-agent collaborative perception remain scarce. This gap is critical, as multi-drone systems provide enhanced coverage, robustness, and collaboration compared to single-sensor setups. Existing multi-image benchmarks mainly target basic perception tasks using high-quality single-agent images, thus failing to evaluate MLLMs in more complex, egocentric collaborative scenarios, especially under real-world degraded perception conditions.To address these challenges, we introduce AirCopBench, the first comprehensive benchmark designed to evaluate MLLMs in embodied aerial collaborative perception under challenging perceptual conditions. AirCopBench includes 14.6k+ questions derived from both simulator and real-world data, spanning four key task dimensions: Scene Understanding, Object Understanding, Perception Assessment, and Collaborative Decision, across 14 task types. We construct the benchmark using data from challenging degraded-perception scenarios with annotated collaborative events, generating large-scale questions through model-, rule-, and human-based methods under rigorous quality control. Evaluations on 40 MLLMs show significant performance gaps in collaborative perception tasks, with the best model trailing humans by 24.38% on average and exhibiting inconsistent results across tasks. Fine-tuning experiments further confirm the feasibility of sim-to-real transfer in aerial collaborative perception and reasoning.

[10] S2D-ALIGN: Shallow-to-Deep Auxiliary Learning for Anatomically-Grounded Radiology Report Generation

Jiechao Gao, Chang Liu, Yuangang Li

🧩 TL;DR

本文提出了S2D-Align，一种新颖的监督微调范式，通过利用多粒度辅助信号建立解剖学基础的跨模态对齐，显著提升了放射学报告生成的质量。该方法在MIMIC-CXR和IU X-Ray基准测试中实现了最先进的性能。

📘 Detailed Summary

Motivation: 现有方法主要依赖标准监督微调范式进行图像-文本对的实例级对齐，但无法建立解剖学基础的对齐，由于报告的模板化特性导致生成质量欠佳，需要解决跨模态对齐的细粒度问题。

Method: 提出浅层到深层策略的S2D-Align范式，逐步丰富对齐过程：从粗粒度放射图像-报告配对开始，引入参考报告进行实例级指导，最后利用关键短语将生成过程锚定到特定解剖细节，并通过基于记忆的适配器实现特征共享以桥接不同对齐阶段。

Result: 在公开的MIMIC-CXR和IU X-Ray基准测试中，S2D-Align相比现有方法实现了最先进的性能，消融研究验证了多阶段辅助引导方法的有效性。

Conclusion: 该研究展示了通过多粒度辅助信号增强复杂多模态生成任务中基础能力的可行方向，为提升放射学报告生成的解剖学准确性提供了有前景的解决方案。

📄 Abstract

Radiology Report Generation (RRG) aims to automatically generate diagnostic reports from radiology images. To achieve this, existing methods have leveraged the powerful cross-modal generation capabilities of Multimodal Large Language Models (MLLMs), primarily focusing on optimizing cross-modal alignment between radiographs and reports through Supervised Fine-Tuning (SFT). However, by only performing instance-level alignment with the image-text pairs, the standard SFT paradigm fails to establish anatomically-grounded alignment, where the templated nature of reports often leads to sub-optimal generation quality. To address this, we propose \textsc{S2D-Align}, a novel SFT paradigm that establishes anatomically-grounded alignment by leveraging auxiliary signals of varying granularities. \textsc{S2D-Align} implements a shallow-to-deep strategy, progressively enriching the alignment process: it begins with the coarse radiograph-report pairing, then introduces reference reports for instance-level guidance, and ultimately utilizes key phrases to ground the generation in specific anatomical details. To bridge the different alignment stages, we introduce a memory-based adapter that empowers feature sharing, thereby integrating coarse and fine-grained guidance. For evaluation, we conduct experiments on the public \textsc{MIMIC-CXR} and \textsc{IU X-Ray} benchmarks, where \textsc{S2D-Align} achieves state-of-the-art performance compared to existing methods. Ablation studies validate the effectiveness of our multi-stage, auxiliary-guided approach, highlighting a promising direction for enhancing grounding capabilities in complex, multi-modal generation tasks.

[11] Frequency-Aware Vision-Language Multimodality Generalization Network for Remote Sensing Image Classification

Junjie Zhang, Feng Zhao, Hanqiang Liu, Jun Yu

🧩 TL;DR

本文提出了一种频率感知的视觉语言多模态泛化网络（FVMGN），用于解决遥感图像分类中的多模态泛化问题，通过频率域分析和多尺度特征对齐实现了优异的跨场景泛化能力。

📘 Detailed Summary

Motivation: 当前遥感技术发展催生了多模态泛化任务，需要模型克服数据异质性并具备强大的跨场景泛化能力，而现有视觉语言模型通常使用通用文本描述遥感图像，缺乏针对不同遥感视觉模态的专用语言先验知识。

Method: 提出扩散式训练-测试时间增强策略重构多模态土地覆盖分布，开发多模态小波解缠模块在频域重采样高低频分量学习跨域不变特征，设计共享和专用类别文本作为语言输入，构建空间-频率感知图像编码器实现局部-全局特征重构，并采用多尺度空间-频率特征对齐模块构建统一语义空间。

Result: 大量实验表明，与最先进方法相比，FVMGN具有优异的多模态泛化能力，在遥感图像分类任务中表现出卓越的性能。

Conclusion: 该研究为遥感多模态泛化问题提供了一种有效的学习范式，通过频率域分析和多模态特征对齐技术显著提升了模型的跨场景适应能力，为遥感图像分析开辟了新的研究方向。

📄 Abstract

The booming remote sensing (RS) technology is giving rise to a novel multimodality generalization task, which requires the model to overcome data heterogeneity while possessing powerful cross-scene generalization ability. Moreover, most vision-language models (VLMs) usually describe surface materials in RS images using universal texts, lacking proprietary linguistic prior knowledge specific to different RS vision modalities. In this work, we formalize RS multimodality generalization (RSMG) as a learning paradigm, and propose a frequency-aware vision-language multimodality generalization network (FVMGN) for RS image classification. Specifically, a diffusion-based training-test-time augmentation (DTAug) strategy is designed to reconstruct multimodal land-cover distributions, enriching input information for FVMGN. Following that, to overcome multimodal heterogeneity, a multimodal wavelet disentanglement (MWDis) module is developed to learn cross-domain invariant features by resampling low and high frequency components in the frequency domain. Considering the characteristics of RS vision modalities, shared and proprietary class texts is designed as linguistic inputs for the transformer-based text encoder to extract diverse text features. For multimodal vision inputs, a spatial-frequency-aware image encoder (SFIE) is constructed to realize local-global feature reconstruction and representation. Finally, a multiscale spatial-frequency feature alignment (MSFFA) module is suggested to construct a unified semantic space, ensuring refined multiscale alignment of different text and vision features in spatial and frequency domains. Extensive experiments show that FVMGN has the excellent multimodality generalization ability compared with state-of-the-art (SOTA) methods.

[12] CrossMed: A Multimodal Cross-Task Benchmark for Compositional Generalization in Medical Imaging

Pooja Singh, Siddhant Ujjain, Tapan Kumar Gandhi, Sandeep Kumar

🧩 TL;DR

本文提出了CrossMed基准，用于评估医学多模态大语言模型在未见过的成像模态、解剖结构和任务类型组合上的组合泛化能力，揭示了现有模型在跨模态和跨任务泛化方面的显著局限性。

📘 Detailed Summary

Motivation: 当前多模态大语言模型在医学AI领域展现出统一处理视觉和文本输入的潜力，但其在未见过的成像模态、解剖结构和任务类型组合上的组合泛化能力尚未得到充分探索，这限制了模型在真实医疗场景中的实际应用价值。

Method: 研究引入了基于模态-解剖结构-任务三元组框架的CrossMed基准，将四个公共医学数据集重新格式化为统一的视觉问答形式，包含20,200个多项选择题实例，并在相关、不相关以及零重叠三种设置下评估了LLaVA-Vicuna-7B和Qwen2-VL-7B等开源多模态大语言模型。

Result: 模型在相关分割上达到83.2%的分类准确率和0.75的分割cIoU，但在不相关和零重叠条件下性能显著下降，同时发现分割任务性能在仅使用分类数据训练时提升了7% cIoU，证明了跨任务迁移的有效性。

Conclusion: CrossMed基准为医学视觉语言模型的零样本、跨任务和模态无关泛化能力提供了严格测试平台，揭示了多模态大语言模型在组合泛化方面的独特优势，同时传统模型仅表现出有限的改进，突显了该框架的广泛适用性。

📄 Abstract

Recent advances in multimodal large language models have enabled unified processing of visual and textual inputs, offering promising applications in general-purpose medical AI. However, their ability to generalize compositionally across unseen combinations of imaging modality, anatomy, and task type remains underexplored. We introduce CrossMed, a benchmark designed to evaluate compositional generalization (CG) in medical multimodal LLMs using a structured Modality-Anatomy-Task (MAT) schema. CrossMed reformulates four public datasets, CheXpert (X-ray classification), SIIM-ACR (X-ray segmentation), BraTS 2020 (MRI classification and segmentation), and MosMedData (CT classification) into a unified visual question answering (VQA) format, resulting in 20,200 multiple-choice QA instances. We evaluate two open-source multimodal LLMs, LLaVA-Vicuna-7B and Qwen2-VL-7B, on both Related and Unrelated MAT splits, as well as a zero-overlap setting where test triplets share no Modality, Anatomy, or Task with the training data. Models trained on Related splits achieve 83.2 percent classification accuracy and 0.75 segmentation cIoU, while performance drops significantly under Unrelated and zero-overlap conditions, demonstrating the benchmark difficulty. We also show cross-task transfer, where segmentation performance improves by 7 percent cIoU even when trained using classification-only data. Traditional models (ResNet-50 and U-Net) show modest gains, confirming the broad utility of the MAT framework, while multimodal LLMs uniquely excel at compositional generalization. CrossMed provides a rigorous testbed for evaluating zero-shot, cross-task, and modality-agnostic generalization in medical vision-language models.

[13] Discovering Meaningful Units with Visually Grounded Semantics from Image Captions

Melika Behjati, James Henderson

🧩 TL;DR

本文提出了一种通过分组语言标记来增强视觉语言模型细粒度理解能力的新方法，该方法通过将文本标记分组并与图像中的对象对齐，显著提升了模型对视觉和语言的细粒度理解能力。

📘 Detailed Summary

Motivation: 现有视觉语言模型主要关注将图像块与语言标记对齐，但图像块对人类视觉无意义，且单个标记不一定携带可接地的图像信息，而描述场景不同方面的标记组才是关键，因此需要开发能够捕获语言细粒度表示的模型。

Method: 提出了一种在模型架构中分组标题标记的方法，通过将语言表示与经过训练以发现对象的图像编码器输出对齐，学习将标记分组以捕获语言的细粒度表示，期望这些表示达到图像中对象的级别。

Result: 实验表明通过学习标记分组，视觉语言模型对视觉和语言具有更好的细粒度理解能力，同时模型发现的标记组在定性和定量上与文本中的可接地短语高度相似。

Conclusion: 该研究证明了通过分组语言标记并与视觉对象对齐的方法能有效提升视觉语言模型的细粒度理解能力，为构建更精确的视觉语言对齐模型提供了新的技术路径，具有重要的理论和应用价值。

📄 Abstract

Fine-grained knowledge is crucial for vision-language models to obtain a better understanding of the real world. While there has been work trying to acquire this kind of knowledge in the space of vision and language, it has mostly focused on aligning the image patches with the tokens on the language side. However, image patches do not have any meaning to the human eye, and individual tokens do not necessarily carry groundable information in the image. It is groups of tokens which describe different aspects of the scene. In this work, we propose a model which groups the caption tokens as part of its architecture in order to capture a fine-grained representation of the language. We expect our representations to be at the level of objects present in the image, and therefore align our representations with the output of an image encoder trained to discover objects. We show that by learning to group the tokens, the vision-language model has a better fine-grained understanding of vision and language. In addition, the token groups that our model discovers are highly similar to groundable phrases in text, both qualitatively and quantitatively.

[14] VIDEOP2R: Video Understanding from Perception to Reasoning

Yifan Jiang, Yueying Wang, Rui Zhao, Toufiq Parag, Zhimin Chen, Zhenyu Liao, Jayakrishnan Unnikrishnan

🧩 TL;DR

本文提出了VideoP2R，一种面向视频大语言模型的感知-推理分离强化微调框架，通过将感知和推理建模为独立过程来提升视频推理能力。该方法在七个视频推理基准测试中的六个上实现了最先进性能。

📘 Detailed Summary

Motivation: 现有强化微调框架在提升大语言模型推理能力方面取得成效，但将其扩展到视频大语言模型仍面临挑战，需要解决视频模态下感知与推理过程的复杂交互问题。

Method: 提出过程感知视频强化微调框架VideoP2R，包含监督微调和强化学习两阶段：在SFT阶段构建三步流水线生成包含16.2万样本的VideoP2R-CoT数据集；在RL阶段设计过程感知分组相对策略优化算法，为感知和推理过程分别提供奖励信号。

Result: 在七个视频推理和理解基准测试中，VideoP2R在六个任务上达到最先进性能。消融研究证实了过程感知建模和PA-GRPO算法的有效性，并证明模型感知输出对下游推理具有充分信息量。

Conclusion: 将视频推理分解为感知和推理两个独立过程是有效的策略，过程感知奖励机制能显著提升模型性能。该框架为视频语言模型的强化微调提供了新的技术路径，未来可扩展到更复杂的多模态推理任务。

📄 Abstract

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

[15] From Synthetic Scenes to Real Performance: Enhancing Spatial Reasoning in VLMs

Massimo Rizzoli, Simone Alghisi, Seyed Mahed Mousavi, Giuseppe Riccardi

🧩 TL;DR

本研究提出了一种通过可控合成数据生成来改进视觉语言模型微调的方法，该方法通过全面采样物体属性来构建无偏见的平衡数据集，从而在真实世界数据上实现显著性能提升并缓解常见偏见问题。

📘 Detailed Summary

Motivation: 当前视觉语言模型的微调过程通常依赖于真实世界场景的数据收集和标注，但这一过程容易受到偏见、错误和分布不平衡的影响，导致模型过拟合和性能不均衡。现有研究尝试通过生成合成数据来解决这些问题，但缺乏对分布偏见和标注质量的有效控制。

Method: 本研究重新设计了微调过程的两个关键方面：首先通过自动构建数据集，全面采样物体的颜色、形状、大小和场景位置等属性，确保数据生成和标注过程无偏见、分布平衡且无标注错误；其次使用该标注数据集对最先进的视觉语言模型进行微调，并在绝对位置任务上评估其向真实世界数据的性能迁移能力。

Result: 在合成和真实世界基准上的广泛评估显示两个关键发现：在平衡合成数据上微调能在整个视觉场景中产生均匀性能并缓解常见偏见；在合成刺激上微调显著提升了在真实世界数据（COCO）上的性能，其表现优于在匹配设置下微调的模型。

Conclusion: 该研究表明通过可控合成数据生成进行微调是改进视觉语言模型性能的有效策略，不仅能够解决数据偏见和分布不平衡问题，还能实现向真实世界数据的良好泛化，为视觉语言模型的鲁棒训练提供了新的方向。

📄 Abstract

Fine-tuning Vision-Language Models (VLMs) is a common strategy to improve performance following an ad-hoc data collection and annotation of real-world scenes. However, this process is often prone to biases, errors, and distribution imbalance, resulting in overfitting and imbalanced performance. Although a few studies have tried to address this problem by generating synthetic data, they lacked control over distribution bias and annotation quality. To address these challenges, we redesign the fine-tuning process in two ways. First, we control the generation of data and its annotations, ensuring it is free from bias, distribution imbalance, and annotation errors. We automatically construct the dataset by comprehensively sampling objects' attributes, including color, shape, size, and position within the scene. Secondly, using this annotated dataset, we fine-tune state-of-the-art VLMs and assess performance transferability to real-world data on the absolute position task. We conduct exhaustive evaluations on both synthetic and real-world benchmarks. Our experiments reveal two key findings: 1) fine-tuning on balanced synthetic data yields uniform performance across the visual scene and mitigates common biases; and 2) fine-tuning on synthetic stimuli significantly improves performance on real-world data (COCO), outperforming models fine-tuned in the matched setting.

[16] PhaseWin Search Framework Enable Efficient Object-Level Interpretation

Zihan Gu, Ruoyu Chen, Junchi Zhang, Yue Hu, Hua Zhang, Xiaochun Cao

🧩 TL;DR

本文提出PhaseWin算法，一种具有近线性复杂度的相位窗口搜索方法，用于解决对象级基础模型的忠实区域归因问题。该算法在保持接近贪婪算法性能的同时，将计算成本降低至传统方法的20%，在多模态对象检测和视觉定位任务中实现了新的最先进性能。

📘 Detailed Summary

Motivation: 当前基于子模子集选择的归因方法虽然具有高忠实性，但其效率限制阻碍了在实际场景中的部署应用。传统贪婪选择方法存在二次计算成本问题，需要一种能够在保持高忠实性的同时显著提升计算效率的新型归因算法。

Method: PhaseWin采用分阶段粗到细搜索策略，替代传统的二次成本贪婪选择。该方法结合自适应剪枝、窗口化细粒度选择和动态监督机制，通过相位窗口搜索来近似贪婪行为，同时大幅减少模型评估次数。算法在温和单调子模假设下保持接近贪婪的近似保证。

Result: 实验结果表明，PhaseWin仅使用20%的计算预算即可达到贪婪归因忠实性的95%以上。在Grounding DINO和Florence-2的对象检测和视觉定位任务中，PhaseWin持续优于其他归因基线方法，建立了可扩展高忠实性归因的新技术标准。

Conclusion: PhaseWin为对象级多模态模型建立了可扩展、高忠实性归因的新技术标准，证明了通过精心设计的近似算法可以在保持性能的同时实现显著的计算效率提升。这项工作为实际部署中的高效模型解释提供了可行解决方案，并为未来高效归因算法设计指明了方向。

📄 Abstract

Attribution is essential for interpreting object-level foundation models. Recent methods based on submodular subset selection have achieved high faithfulness, but their efficiency limitations hinder practical deployment in real-world scenarios. To address this, we propose PhaseWin, a novel phase-window search algorithm that enables faithful region attribution with near-linear complexity. PhaseWin replaces traditional quadratic-cost greedy selection with a phased coarse-to-fine search, combining adaptive pruning, windowed fine-grained selection, and dynamic supervision mechanisms to closely approximate greedy behavior while dramatically reducing model evaluations. Theoretically, PhaseWin retains near-greedy approximation guarantees under mild monotone submodular assumptions. Empirically, PhaseWin achieves over 95% of greedy attribution faithfulness using only 20% of the computational budget, and consistently outperforms other attribution baselines across object detection and visual grounding tasks with Grounding DINO and Florence-2. PhaseWin establishes a new state of the art in scalable, high-faithfulness attribution for object-level multimodal models.

[17] Out-of-Distribution Detection with Positive and Negative Prompt Supervision Using Large Language Models

Zhixia He, Chen Zhao, Minglai Shao, Xintao Wu, Xujiang Zhao, Dong Li, Qin Tian, Linlin Yu

🧩 TL;DR

本文提出了一种正负提示监督方法，通过优化类间特征捕捉的负提示，并将语义知识迁移到视觉模态，显著提升了基于能量的OOD检测性能。该方法在多个基准数据集和不同LLM上均优于现有最先进方法。

📘 Detailed Summary

Motivation: 当前基于视觉语言模型的OOD检测方法使用负提示来强调图像特征与提示内容之间的差异，但这些负提示通常包含广泛的非ID特征，可能捕获重叠或误导性信息，导致次优结果。

Method: 提出正负提示监督方法，首先使用LLM初始化类特定的正负提示，然后优化这些提示使正提示聚焦于类内特征而负提示突出类别边界特征，并采用基于图的架构聚合优化提示表示的语义感知监督，将其传播到视觉分支以增强基于能量的OOD检测器。

Result: 在CIFAR-100和ImageNet-1K两个基准数据集上，针对八个OOD数据集和五种不同LLM的广泛实验表明，该方法在所有设置下均优于最先进的基线方法。

Conclusion: 该方法通过优化负提示使其专注于捕获类间特征，并将语义知识有效迁移到视觉模态，为OOD检测提供了更精确的边界划分，展示了多模态知识迁移在提升检测性能方面的潜力。

📄 Abstract

Out-of-distribution (OOD) detection is committed to delineating the classification boundaries between in-distribution (ID) and OOD images. Recent advances in vision-language models (VLMs) have demonstrated remarkable OOD detection performance by integrating both visual and textual modalities. In this context, negative prompts are introduced to emphasize the dissimilarity between image features and prompt content. However, these prompts often include a broad range of non-ID features, which may result in suboptimal outcomes due to the capture of overlapping or misleading information. To address this issue, we propose Positive and Negative Prompt Supervision, which encourages negative prompts to capture inter-class features and transfers this semantic knowledge to the visual modality to enhance OOD detection performance. Our method begins with class-specific positive and negative prompts initialized by large language models (LLMs). These prompts are subsequently optimized, with positive prompts focusing on features within each class, while negative prompts highlight features around category boundaries. Additionally, a graph-based architecture is employed to aggregate semantic-aware supervision from the optimized prompt representations and propagate it to the visual branch, thereby enhancing the performance of the energy-based OOD detector. Extensive experiments on two benchmarks, CIFAR-100 and ImageNet-1K, across eight OOD datasets and five different LLMs, demonstrate that our method outperforms state-of-the-art baselines.

[18] Refine and Align: Confidence Calibration through Multi-Agent Interaction in VQA

Ayush Pandey, Jai Bardhan, Ishita Jain, Ramya S Hebbalaguppe, Rohan Raju Dhanakshirur, Lovekesh Vig

🧩 TL;DR

本文提出了AlignVQA，一个基于辩论的多智能体框架，通过两阶段交互过程改进视觉问答系统的置信度校准，并引入可微分的校准感知损失函数来优化专业智能体的置信度估计。

📘 Detailed Summary

Motivation: 现代视觉问答系统在医疗诊断和自主导航等高风险领域应用日益广泛，但其置信度估计的可靠性研究不足，这些系统经常产生过度自信的响应，在视觉不确定性下的自主决策中带来风险。

Method: 提出了AlignVQA辩论框架，包含多样化专业视觉语言模型采用不同提示策略生成候选答案，通过通用智能体进行两阶段交互：批判、精炼和聚合提案，并设计了可微分的aligncal损失函数来最小化校准误差上界。

Result: 在多个基准VQA数据集上的实验结果表明，该方法显著减少了校准差异，更校准的专业智能体产生了更好对齐的置信度估计。

Conclusion: 辩论过程能够产生更准确反映模型真实预测性能的置信度估计，校准感知的微调方法有效提升了单个智能体置信度估计的保真度，为高风险应用中的可靠AI系统提供了重要技术路径。

📄 Abstract

In the context of Visual Question Answering (VQA) and Agentic AI, calibration refers to how closely an AI system's confidence in its answers reflects their actual correctness. This aspect becomes especially important when such systems operate autonomously and must make decisions under visual uncertainty. While modern VQA systems, powered by advanced vision-language models (VLMs), are increasingly used in high-stakes domains like medical diagnostics and autonomous navigation due to their improved accuracy, the reliability of their confidence estimates remains under-examined. Particularly, these systems often produce overconfident responses. To address this, we introduce AlignVQA, a debate-based multi-agent framework, in which diverse specialized VLM -- each following distinct prompting strategies -- generate candidate answers and then engage in two-stage interaction: generalist agents critique, refine and aggregate these proposals. This debate process yields confidence estimates that more accurately reflect the model's true predictive performance. We find that more calibrated specialized agents produce better aligned confidences. Furthermore, we introduce a novel differentiable calibration-aware loss function called aligncal designed to fine-tune the specialized agents by minimizing an upper bound on the calibration error. This objective explicitly improves the fidelity of each agent's confidence estimates. Empirical results across multiple benchmark VQA datasets substantiate the efficacy of our approach, demonstrating substantial reductions in calibration discrepancies. Furthermore, we propose a novel differentiable calibration-aware loss to fine-tune the specialized agents and improve the quality of their individual confidence estimates based on minimising upper bound calibration error.

[19] Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

Yifan Liu, Fangneng Zhan, Kaichen Zhou, Yilun Du, Paul Pu Liang, Hanspeter Pfister

🧩 TL;DR

本文提出SandboxVLM框架，通过引入抽象边界框编码几何结构和物理运动学，有效提升视觉语言模型在3D任务中的空间推理能力。该方法在零样本设置下显著改进多个基准测试性能，无需额外训练即可增强模型的3D理解能力。

📘 Detailed Summary

Motivation: 视觉语言模型在3D相关任务如空间认知和物理理解方面表现不佳，这限制了其在机器人和具身智能等实际应用中的潜力。作者认为这一问题的根源在于3D任务与VLM的2D训练之间存在模态鸿沟，导致从2D输入中检索3D信息效率低下。

Method: 提出的SandboxVLM框架采用抽象边界框来编码几何结构和物理运动学，包含四个关键阶段：基于抽象控制生成多视图先验、代理高程、多视图投票聚类以及3D感知推理。这一3D沙盒重建与感知流水线旨在弥合2D与3D表示之间的差距。

Result: 在多个基准测试和VLM骨干网络的零样本评估中，该方法在空间智能方面取得显著提升，例如在SAT Real基准上相比基线方法获得8.3%的性能增益。实验结果表明该方法在不同VLM架构上均能一致地改善3D推理能力。

Conclusion: 研究表明，为视觉语言模型配备3D抽象表示能够显著增强其3D推理能力，且无需额外训练。这一发现为通用具身智能的发展开辟了新的可能性，证明了通过适当的抽象表示可以有效地弥合2D与3D理解之间的鸿沟。

📄 Abstract

Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents. We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input. To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning. Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an 8.3\% gain on SAT Real compared with baseline methods for instance. These results demonstrate that equipping VLMs with a 3D abstraction substantially enhances their 3D reasoning ability without additional training, suggesting new possibilities for general-purpose embodied intelligence.

Haokun Chen, Jianing Li, Yao Zhang, Jinhe Bi, Yan Xia, Jindong Gu, Volker Tresp

🧩 TL;DR

本文提出AUVIC，一种针对多模态大语言模型的视觉概念遗忘框架，通过对抗扰动实现精确的目标概念移除，同时构建首个视觉概念遗忘基准VCUBench进行评估。

📘 Detailed Summary

Motivation: 多模态大语言模型在优化过程中使用大量包含敏感或版权内容的数据集，引发数据隐私担忧，而现有研究主要关注文本遗忘，视觉概念遗忘在MLLMs中仍未被充分探索，主要挑战在于精确移除目标视觉概念而不影响相关实体的模型性能。

Method: 提出AUVIC框架，采用对抗扰动技术实现精确的视觉概念遗忘，该方法能够有效隔离目标概念，同时避免对相似实体产生意外影响，并构建VCUBench作为首个评估群体上下文中视觉概念遗忘的基准。

Result: 实验结果表明，AUVIC在目标遗忘率方面达到最先进水平，同时在非目标概念上仅产生最小的性能退化，验证了该框架在视觉概念遗忘任务中的有效性。

Conclusion: 该研究为MLLMs中的视觉概念遗忘提供了有效的解决方案，满足了数据隐私法规的要求，同时保持了模型整体性能，为未来多模态遗忘研究奠定了基础并指明了方向。

📄 Abstract

Multimodal Large Language Models (MLLMs) achieve impressive performance once optimized on massive datasets. Such datasets often contain sensitive or copyrighted content, raising significant data privacy concerns. Regulatory frameworks mandating the 'right to be forgotten' drive the need for machine unlearning. This technique allows for the removal of target data without resource-consuming retraining. However, while well-studied for text, visual concept unlearning in MLLMs remains underexplored. A primary challenge is precisely removing a target visual concept without disrupting model performance on related entities. To address this, we introduce AUVIC, a novel visual concept unlearning framework for MLLMs. AUVIC applies adversarial perturbations to enable precise forgetting. This approach effectively isolates the target concept while avoiding unintended effects on similar entities. To evaluate our method, we construct VCUBench. It is the first benchmark designed to assess visual concept unlearning in group contexts. Experimental results demonstrate that AUVIC achieves state-of-the-art target forgetting rates while incurs minimal performance degradation on non-target concepts.

[21] DEFT-LLM: Disentangled Expert Feature Tuning for Micro-Expression Recognition

Ren Zhang, Huilai Li, Chao qi, Guoliang Xu, Tianyu Zhou, Wei wei, Jianqin Yin

🧩 TL;DR

本文提出DEFT-LLM方法，通过多专家解耦实现运动语义对齐，解决了微表情识别中静态外观与动态运动线索纠缠以及文本监督与物理运动语义鸿沟的核心挑战，在多个基准测试中实现了最先进性能。

📘 Detailed Summary

Motivation: 微表情识别面临两个核心挑战：静态外观与动态运动线索的纠缠阻碍模型关注细微运动；现有数据集中的文本标签与底层面部肌肉运动不完全对应，造成文本监督与物理运动之间的语义鸿沟。

Method: 提出DEFT-LLM框架，采用多专家解耦方法将面部动态分解为独立可解释的表征（结构、动态纹理和运动语义），并构建Uni-MER运动驱动指令数据集，利用光流和动作单元标签的双重约束确保时空一致性和运动对应关系。

Result: 在多个具有挑战性的微表情识别基准测试中展示了最先进的性能，特别是在局部面部运动的可解释建模方面表现出显著优势。

Conclusion: 该方法通过将指令对齐知识注入DEFT-LLM，为微表情提供了有效的物理先验，同时利用大语言模型的跨模态推理能力，实现了对细微情感线索的精确捕捉，推动了微表情识别向更可解释和物理一致的方向发展。

📄 Abstract

Micro expression recognition (MER) is crucial for inferring genuine emotion. Applying a multimodal large language model (MLLM) to this task enables spatio-temporal analysis of facial motion and provides interpretable descriptions. However, there are still two core challenges: (1) The entanglement of static appearance and dynamic motion cues prevents the model from focusing on subtle motion; (2) Textual labels in existing MER datasets do not fully correspond to underlying facial muscle movements, creating a semantic gap between text supervision and physical motion. To address these issues, we propose DEFT-LLM, which achieves motion semantic alignment by multi-expert disentanglement. We first introduce Uni-MER, a motion-driven instruction dataset designed to align text with local facial motion. Its construction leverages dual constraints from optical flow and Action Unit (AU) labels to ensure spatio-temporal consistency and reasonable correspondence to the movements. We then design an architecture with three experts to decouple facial dynamics into independent and interpretable representations (structure, dynamic textures, and motion-semantics). By integrating the instruction-aligned knowledge from Uni-MER into DEFT-LLM, our method injects effective physical priors for micro expressions while also leveraging the cross modal reasoning ability of large language models, thus enabling precise capture of subtle emotional cues. Experiments on multiple challenging MER benchmarks demonstrate state-of-the-art performance, as well as a particular advantage in interpretable modeling of local facial motion.

[22] Language-Guided Graph Representation Learning for Video Summarization

Wenrui Li, Wei Han, Hengyu Man, Wangmeng Zuo, Xiaopeng Fan, Yonghong Tian

🧩 TL;DR

本文提出了一种新颖的语言引导图表示学习网络（LGRLN），通过图结构建模视频帧间关系，解决了现有视频摘要方法在捕捉全局依赖和适应多模态用户定制方面的挑战，显著提升了性能并大幅降低了计算开销。

📘 Detailed Summary

Motivation: 现有视频摘要方法面临两个主要挑战：难以有效捕捉视频内容的全局依赖关系，以及无法适应多模态用户定制需求。此外，视频帧之间的时间邻近性并不总是对应语义邻近性，这限制了现有方法的性能表现。

Method: 提出语言引导图表示学习网络（LGRLN），包含视频图生成器将视频帧转换为结构化图以保持时序顺序和上下文依赖，通过构建前向、后向和无向图有效保留视频内容的序列性和上下文关系。设计了具有双阈值图卷积机制的图内关系推理模块，区分节点间语义相关与无关的帧，并提出语言引导的跨模态嵌入模块生成具有特定文本描述的视频摘要。

Result: 实验结果表明，该方法在多个基准测试中优于现有方法，同时推理时间和模型参数分别减少了87.8%和91.7%，在保持高性能的同时显著提升了计算效率。

Conclusion: 该研究证明了图结构建模在视频摘要任务中的有效性，通过语言引导的跨模态学习实现了用户定制化摘要生成，为多媒体处理领域提供了新的技术思路，同时大幅优化的模型复杂度使其更适用于实际部署场景。

📄 Abstract

With the rapid growth of video content on social media, video summarization has become a crucial task in multimedia processing. However, existing methods face challenges in capturing global dependencies in video content and accommodating multimodal user customization. Moreover, temporal proximity between video frames does not always correspond to semantic proximity. To tackle these challenges, we propose a novel Language-guided Graph Representation Learning Network (LGRLN) for video summarization. Specifically, we introduce a video graph generator that converts video frames into a structured graph to preserve temporal order and contextual dependencies. By constructing forward, backward and undirected graphs, the video graph generator effectively preserves the sequentiality and contextual relationships of video content. We designed an intra-graph relational reasoning module with a dual-threshold graph convolution mechanism, which distinguishes semantically relevant frames from irrelevant ones between nodes. Additionally, our proposed language-guided cross-modal embedding module generates video summaries with specific textual descriptions. We model the summary generation output as a mixture of Bernoulli distribution and solve it with the EM algorithm. Experimental results show that our method outperforms existing approaches across multiple benchmarks. Moreover, we proposed LGRLN reduces inference time and model parameters by 87.8% and 91.7%, respectively. Our codes and pre-trained models are available at https://github.com/liwrui/LGRLN.

[23] The Persistence of Cultural Memory: Investigating Multimodal Iconicity in Diffusion Models

Maria-Teresa De Rosa Palmini, Eva Cetinic

🧩 TL;DR

本文提出了一种评估文本到图像扩散模型文化记忆的新框架，区分文化参考的识别与实现，揭示了模型如何在复制与重新诠释文化知识之间取得平衡。

📘 Detailed Summary

Motivation: 本研究旨在解决文本到图像扩散模型中泛化与记忆之间的模糊界限，特别关注多模态标志性现象，即图像和文本唤起文化共享关联的情况，如标题唤起熟悉艺术品或电影场景。现有关于记忆和遗忘的研究强调遗忘过程，而本研究则关注模型记住什么以及如何记住，聚焦于识别文化参考与复制它们之间的平衡。

Method: 引入了一个评估框架，将识别（模型是否识别出参考）与实现（模型如何通过复制或重新诠释来描绘参考）分离开来，通过量化这两个维度的度量指标进行评估。该框架在767个基于Wikidata的文化参考上评估了五个扩散模型，涵盖静态和动态图像，并通过提示扰动实验评估语言敏感性，使用同义词替换和字面图像描述。

Result: 评估结果显示，该框架在区分复制与转换方面比现有的基于相似性的方法更有效。提示扰动实验表明，即使文本线索被改变，模型仍经常复制标志性的视觉结构。分析还发现，文化对齐不仅与训练数据频率相关，还与文本独特性、参考流行度和创建日期相关。

Conclusion: 研究表明扩散模型的价值不仅在于它们复制的内容，更在于它们如何转换和重新语境化文化知识。这项工作将评估从简单的文本-图像匹配推进到更丰富的上下文理解，揭示了模型在文化知识处理中的复杂动态。

📄 Abstract

Our work addresses the ambiguity between generalization and memorization in text-to-image diffusion models, focusing on a specific case we term multimodal iconicity. This refers to instances where images and texts evoke culturally shared associations, such as when a title recalls a familiar artwork or film scene. While prior research on memorization and unlearning emphasizes forgetting, we examine what is remembered and how, focusing on the balance between recognizing cultural references and reproducing them. We introduce an evaluation framework that separates recognition, whether a model identifies a reference, from realization, how it depicts it through replication or reinterpretation, quantified through measures capturing both dimensions. By evaluating five diffusion models across 767 Wikidata-derived cultural references spanning static and dynamic imagery, we show that our framework distinguishes replication from transformation more effectively than existing similarity-based methods. To assess linguistic sensitivity, we conduct prompt perturbation experiments using synonym substitutions and literal image descriptions, finding that models often reproduce iconic visual structures even when textual cues are altered. Finally, our analysis shows that cultural alignment correlates not only with training data frequency, but also textual uniqueness, reference popularity, and creation date. Our work reveals that the value of diffusion models lies not only in what they reproduce but in how they transform and recontextualize cultural knowledge, advancing evaluation beyond simple text-image matching toward richer contextual understanding.

[24] Benchmarking Visual LLMs Resilience to Unanswerable Questions on Visually Rich Documents

Davide Napolitano, Luca Cagliero, Fabrizio Battiloro

🧩 TL;DR

本文提出了VRD-UQA基准，用于评估视觉大语言模型在处理视觉丰富文档中看似合理但实际无法回答的问题时的鲁棒性，揭示了现有模型在检测不可回答问题方面的局限性。

📘 Detailed Summary

Motivation: 尽管视觉大语言模型在多页视觉丰富文档的视觉问答任务中表现出色，但其检测无法回答问题的能力仍是一个开放研究问题，特别是对于看似合理但由于概念替换或布局变化而无法回答的问题。

Method: 研究提出了VRD-UQA基准，通过自动修改现有VQA数据集的问题生成看似合理但无法回答的问题，使用VLLM-as-a-judge方法验证其不可回答性，并评估模型在不同类型干扰下的表现。

Result: 对12个模型的实验表明，视觉大语言模型在检测不可回答问题时存在显著局限性，特别是在页面和文档级别的准确性方面，同时分析了不同类型干扰的影响和不同知识注入策略的有效性。

Conclusion: 研究揭示了视觉大语言模型在处理视觉丰富文档时的鲁棒性不足，VRD-UQA可作为开发弹性文档VQA系统的评估框架，为未来改进模型检测不可回答问题的能力提供了重要方向。

📄 Abstract

The evolution of Visual Large Language Models (VLLMs) has revolutionized the automatic understanding of Visually Rich Documents (VRDs), which contain both textual and visual elements. Although VLLMs excel in Visual Question Answering (VQA) on multi-page VRDs, their ability to detect unanswerable questions is still an open research question. Our research delves into the robustness of the VLLMs to plausible yet unanswerable questions, i.e., questions that appear valid but cannot be answered due to subtle corruptions caused by swaps between related concepts or plausible question formulations. Corruptions are generated by replacing the original natural language entities with other ones of the same type, belonging to different document elements, and in different layout positions or pages of the related document. To this end, we present VRD-UQA (VISUALLY RICH DOCUMENT UNANSWERABLE QUESTION ANSWERING), a benchmark for evaluating VLLMs' resilience to plausible yet unanswerable questions across multiple dimensions. It automatically alters the questions of existing VQA datasets consisting of multi-page VRDs, verifies their unanswerability using a VLLM-as-a-judge approach, and then thoroughly evaluates VLLMs' performance. Experiments, run on 12 models, analyze: (1) The VLLMs' accuracy in detecting unanswerable questions at both page and document levels; (2) The effect of different types of corruption (NLP entity, document element, layout); (3) The effectiveness of different knowledge injection strategies based on in-context learning (OCR, multi-page selection, or the possibility of unanswerability). Our findings reveal VLLMs' limitations and demonstrate that VRD-UQA can serve as an evaluation framework for developing resilient document VQA systems.

[25] ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization

Anzhe Cheng, Shukai Duan, Shixuan Li, Chenzhong Yin, Mingxi Cheng, Heng Ping, Tamoghna Chattopadhyay, Sophia I Thomopoulos, Shahin Nazarian, Paul Thompson, Paul Bogdan

🧩 TL;DR

本文提出ERMoE，一种稀疏混合专家Transformer，通过将专家重新参数化为学习正交特征基，并使用特征基评分替代学习门控逻辑，解决了MoE架构中的路由不稳定和专家利用不足问题，在多个基准测试中实现了最先进的性能。

📘 Detailed Summary

Motivation: 混合专家架构通过稀疏激活专家来扩展模型容量，但面临两个核心挑战：路由器逻辑与专家内部结构之间的不对齐导致路由不稳定和专家利用不足，以及负载不平衡造成的瓶颈问题。标准解决方案如辅助负载平衡损失虽然能减少负载差异，但通常会削弱专家专业化并损害下游性能。

Method: ERMoE将每个专家重新参数化为学习的正交特征基，并用特征基评分替代学习的门控逻辑，该评分定义为输入特征与专家基之间的余弦相似度。这种内容感知路由将令牌分配直接与专家的表示空间绑定，稳定了利用率并促进可解释的专业化，同时保持稀疏性。

Result: ERMoE在ImageNet分类和跨模态图像-文本检索基准测试中实现了最先进的准确率，同时自然产生更平坦的专家负载分布。此外，3D MRI变体将脑年龄预测准确率提高了7%以上，并产生了解剖学上可解释的专家专业化。

Conclusion: ERMoE为稀疏专家模型引入了一种新的架构原则，直接解决了路由不稳定问题，并通过可扩展、可解释的专业化实现了改进的性能。该方法消除了对显式平衡损失的需求，避免了它们引入的干扰梯度，为MoE架构提供了更稳定和有效的路由机制。

📄 Abstract

Mixture-of-Experts (MoE) architectures expand model capacity by sparsely activating experts but face two core challenges: misalignment between router logits and each expert's internal structure leads to unstable routing and expert underutilization, and load imbalances create straggler bottlenecks. Standard solutions, such as auxiliary load-balancing losses, can reduce load disparities but often weaken expert specialization and hurt downstream performance. To address these issues, we propose ERMoE, a sparse MoE transformer that reparameterizes each expert in a learned orthonormal eigenbasis and replaces learned gating logits with an "Eigenbasis Score", defined as the cosine similarity between input features and an expert's basis. This content-aware routing ties token assignments directly to experts' representation spaces, stabilizing utilization and promoting interpretable specialization without sacrificing sparsity. Crucially, ERMoE removes the need for explicit balancing losses and avoids the interfering gradients they introduce. We show that ERMoE achieves state-of-the-art accuracy on ImageNet classification and cross-modal image-text retrieval benchmarks (e.g., COCO, Flickr30K), while naturally producing flatter expert load distributions. Moreover, a 3D MRI variant (ERMoE-ba) improves brain age prediction accuracy by more than 7\% and yields anatomically interpretable expert specializations. ERMoE thus introduces a new architectural principle for sparse expert models that directly addresses routing instabilities and enables improved performance with scalable, interpretable specialization.

[26] ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation

Kaishen Wang, Ruibo Chen, Tong Zheng, Heng Huang

🧩 TL;DR

本文提出了ImAgent，一种无需训练的统一多模态代理，通过集成推理、生成和自评估功能，在单一框架内实现高效测试时扩展，显著提升图像生成质量与语义对齐。

📘 Detailed Summary

Motivation: 当前文本到图像模型在生成视觉逼真和语义连贯图像方面取得显著进展，但仍存在随机性和与给定提示不一致的问题，特别是在文本描述模糊或未充分指定时。现有方法如提示重写、最佳N采样和自优化虽然能缓解这些问题，但通常需要额外模块且独立运行，阻碍了测试时扩展效率并增加了计算开销。

Method: ImAgent是一种无需训练的统一多模态代理，在单一框架内集成推理、生成和自评估功能。通过策略控制器引导，多个生成动作动态交互并自组织，无需依赖外部模型即可增强图像保真度和语义对齐。

Result: 在图像生成和编辑任务上的广泛实验表明，ImAgent持续优于骨干模型，甚至在骨干模型失败的情况下也超越了其他强基线方法，突显了统一多模态代理在测试时扩展下实现自适应高效图像生成的潜力。

Conclusion: 该研究展示了统一多模态代理在自适应高效图像生成方面的巨大潜力，为测试时扩展提供了新的解决方案框架，无需额外训练即可显著提升生成质量与语义一致性。

📄 Abstract

Recent text-to-image (T2I) models have made remarkable progress in generating visually realistic and semantically coherent images. However, they still suffer from randomness and inconsistency with the given prompts, particularly when textual descriptions are vague or underspecified. Existing approaches, such as prompt rewriting, best-of-N sampling, and self-refinement, can mitigate these issues but usually require additional modules and operate independently, hindering test-time scaling efficiency and increasing computational overhead. In this paper, we introduce ImAgent, a training-free unified multimodal agent that integrates reasoning, generation, and self-evaluation within a single framework for efficient test-time scaling. Guided by a policy controller, multiple generation actions dynamically interact and self-organize to enhance image fidelity and semantic alignment without relying on external models. Extensive experiments on image generation and editing tasks demonstrate that ImAgent consistently improves over the backbone and even surpasses other strong baselines where the backbone model fails, highlighting the potential of unified multimodal agents for adaptive and efficient image generation under test-time scaling.

Haoran Chen, Houze Xu, Micah Goldblum, Daoguo Dong, Zuxuan Wu

🧩 TL;DR

本文提出了DMC和DMC-OT框架，用于解决CLIP模型在类增量学习中的分类器偏差问题，通过解耦视觉编码器适应和文本软提示优化，实现了跨模态对齐的持续学习。

📘 Detailed Summary

Motivation: 当前基于CLIP的类增量学习方法存在严重分类器偏差问题，当学习新类别的任务特定软提示时，文本原型会过度拟合到最近类别，特别是在先前数据不可用的情况下。此外，现有的生成式重放方法忽略了视觉编码器更新时产生的分布漂移问题。

Method: 提出了两阶段框架DMC，解耦视觉编码器适应和文本软提示优化，每个阶段训练时冻结另一模态作为语义锚点。进一步提出了DMC-OT增强版本，引入最优传输引导的校准策略来对齐演进编码器间的记忆统计量，并设计了任务特定提示以增强任务间可分离性。

Result: 在CIFAR-100、Imagenet-R、CUB-200和UCF-101上的广泛实验表明，DMC和DMC-OT均实现了最先进的性能，其中DMC-OT进一步将准确率平均提高了1.80%。

Conclusion: 该研究证明了通过解耦模态适应和最优传输校准策略，可以有效缓解CLIP在持续学习中的分类器偏差问题，为基于预训练模型的增量学习提供了新的技术路径，强调了跨模态对齐在持续学习中的重要性。

📄 Abstract

Class-incremental learning (CIL) enables models to continuously learn new categories from sequential tasks without forgetting previously acquired knowledge. While recent advances in vision-language models such as CLIP have demonstrated strong generalization across domains, extending them to continual settings remains challenging. In particular, learning task-specific soft prompts for newly introduced classes often leads to severe classifier bias, as the text prototypes overfit to recent categories when prior data are unavailable. In this paper, we propose DMC, a simple yet effective two-stage framework for CLIP-based CIL that decouples the adaptation of the vision encoder and the optimization of textual soft prompts. Each stage is trained with the other frozen, allowing one modality to act as a stable semantic anchor for the other to preserve cross-modal alignment. Furthermore, current CLIP-based CIL approaches typically store class-wise Gaussian statistics for generative replay, yet they overlook the distributional drift that arises when the vision encoder is updated over time. To address this issue, we introduce DMC-OT, an enhanced version of DMC that incorporates an optimal-transport guided calibration strategy to align memory statistics across evolving encoders, along with a task-specific prompting design that enhances inter-task separability. Extensive experiments on CIFAR-100, Imagenet-R, CUB-200, and UCF-101 demonstrate that both DMC and DMC-OT achieve state-of-the-art performance, with DMC-OT further improving accuracy by an average of 1.80%.

[28] PAS : Prelim Attention Score for Detecting Object Hallucinations in Large Vision--Language Models

Nhat Hoang-Xuan, Minh Vu, My T. Thai, Manish Bhattarai

🧩 TL;DR

该论文提出了一种名为Prelim Attention Score (PAS)的轻量级训练无关方法，通过分析大型视觉语言模型对先前生成标记的注意力权重来检测物体幻觉，实现了最先进的幻觉检测性能。

📘 Detailed Summary

Motivation: 大型视觉语言模型虽然功能强大，但由于物体幻觉问题而不可靠，研究发现许多幻觉预测中模型实际上忽略了图像内容，而是依赖先前生成的输出来推断新物体。

Method: 基于互信息分析发现弱图像依赖与幻觉强相关，提出了Prelim Attention Score (PAS)，这是一种从对prelim标记的注意力权重计算得到的轻量级信号，无需额外前向传播即可在推理时实时计算。

Result: PAS在多个模型和数据集上实现了最先进的物体幻觉检测性能，能够进行实时过滤和干预，且无需训练或额外计算开销。

Conclusion: 该研究揭示了LVLM中先前生成标记在物体幻觉中的关键作用，提出的PAS信号为实时幻觉检测提供了有效解决方案，对提升模型可靠性具有重要意义。

📄 Abstract

Large vision-language models (LVLMs) are powerful, yet they remain unreliable due to object hallucinations. In this work, we show that in many hallucinatory predictions the LVLM effectively ignores the image and instead relies on previously generated output (prelim) tokens to infer new objects. We quantify this behavior via the mutual information between the image and the predicted object conditioned on the prelim, demonstrating that weak image dependence strongly correlates with hallucination. Building on this finding, we introduce the Prelim Attention Score (PAS), a lightweight, training-free signal computed from attention weights over prelim tokens. PAS requires no additional forward passes and can be computed on the fly during inference. Exploiting this previously overlooked signal, PAS achieves state-of-the-art object-hallucination detection across multiple models and datasets, enabling real-time filtering and intervention.

[29] CLUE: Controllable Latent space of Unprompted Embeddings for Diversity Management in Text-to-Image Synthesis

Keunwoo Park, Jihye Chae, Joong Ho Ahn, Jihoon Kweon

🧩 TL;DR

本文提出了CLUE框架，一种无需额外数据即可实现多样化图像生成并保持稳定性的生成模型，特别针对医学等数据稀缺领域，通过在Stable Diffusion架构中引入风格编码器和第二注意力层来实现连续潜在空间表示。

📘 Detailed Summary

Motivation: 当前文本到图像合成模型在通用领域表现良好，但在医学等专业领域面临数据种类有限且数量不足的挑战，现有方法难以在保持稳定性的同时实现多样化图像生成，特别是在不依赖额外数据的情况下。

Method: 基于Stable Diffusion架构，CLUE引入风格编码器处理图像和提示以生成风格嵌入，并将其输入到U-Net架构的新第二注意力层中，通过Kullback-Leibler散度实现与提示无关的连续高斯区域潜在空间表示。

Result: 在中耳炎数据集上，CLUE将FID从46.81降至9.30，召回率从49.60%提升至70.29%；仅使用1000%规模的合成数据训练的F1分数达83.21%，合成与真实数据等量组合的F1分数达94.76%；在外部数据集上，仅合成数据训练的F1分数为76.77%，组合方法达85.78%。

Conclusion: CLUE证明了在有限数据集上实现多样化且稳定的图像生成的可行性，为领域特定应用提供了有效的数据增强方法，特别是在医学等数据稀缺场景中具有重要应用价值。

📄 Abstract

Text-to-image synthesis models require the ability to generate diverse images while maintaining stability. To overcome this challenge, a number of methods have been proposed, including the collection of prompt-image datasets and the integration of additional data modalities during training. Although these methods have shown promising results in general domains, they face limitations when applied to specialized fields such as medicine, where only limited types and insufficient amounts of data are available. We present CLUE (Controllable Latent space of Unprompted Embeddings), a generative model framework that achieves diverse generation while maintaining stability through fixed-format prompts without requiring any additional data. Based on the Stable Diffusion architecture, CLUE employs a Style Encoder that processes images and prompts to generate style embeddings, which are subsequently fed into a new second attention layer of the U-Net architecture. Through Kullback-Leibler divergence, the latent space achieves continuous representation of image features within Gaussian regions, independent of prompts. Performance was assessed on otitis media dataset. CLUE reduced FID to 9.30 (vs. 46.81) and improved recall to 70.29% (vs. 49.60%). A classifier trained on synthetic-only data at 1000% scale achieved an F1 score of 83.21% (vs. 73.83%). Combining synthetic data with equal amounts of real data achieved an F1 score of 94.76%, higher than when using only real data. On an external dataset, synthetic-only training achieved an F1 score of 76.77% (vs. 60.61%) at 1000% scale. The combined approach achieved an F1 score of 85.78%, higher than when using only the internal dataset. These results demonstrate that CLUE enables diverse yet stable image generation from limited datasets and serves as an effective data augmentation method for domain-specific applications.

Jiajun Chen, Sai Cheng, Yutao Yuan, Yirui Zhang, Haitao Yuan, Peng Peng, Yi Zhong

🧩 TL;DR

本文提出了PROMISE框架，一种专门针对模态缺失场景的跨模态表示学习方法，通过将多模态提示学习融入分层对比学习框架，并配备专门的提示注意力机制，有效解决了模态缺失情况下的表示一致性问题。

📘 Detailed Summary

Motivation: 现有多模态模型在现实场景中面临模态缺失时性能显著下降，主要源于完整多模态数据与不完整模态场景之间的表示学习不一致性，而现有方法通过相对简单的生成方法处理缺失模态，但无法充分保持跨模态一致性导致性能不佳。

Method: PROMISE创新性地将多模态提示学习融入分层对比学习框架，配备专门设计的提示注意力机制，该机制能够动态生成针对特定模态缺失场景的鲁棒且一致的表示，有效弥合完整数据与不完整数据之间的表示差距。

Result: 在基准数据集上的广泛实验和全面消融研究明确表明，PROMISE相比当前最先进的多模态方法展现出优越性能，特别是在模态缺失场景下保持了更好的表示一致性和泛化能力。

Conclusion: 该研究证明了通过结合提示学习和分层对比学习的创新框架能够有效解决模态缺失问题，为多模态表示学习在现实应用中的鲁棒性提供了重要技术路径，并展示了动态生成鲁棒表示在跨模态一致性保持方面的关键作用。

📄 Abstract

Multimodal models integrating natural language and visual information have substantially improved generalization of representation models. However, their effectiveness significantly declines in real-world situations where certain modalities are missing or unavailable. This degradation primarily stems from inconsistent representation learning between complete multimodal data and incomplete modality scenarios. Existing approaches typically address missing modalities through relatively simplistic generation methods, yet these approaches fail to adequately preserve cross-modal consistency, leading to suboptimal performance. To overcome this limitation, we propose a novel multimodal framework named PROMISE, a PROMpting-Attentive HIerarchical ContraStive LEarning approach designed explicitly for robust cross-modal representation under conditions of missing modalities. Specifically, PROMISE innovatively incorporates multimodal prompt learning into a hierarchical contrastive learning framework, equipped with a specially designed prompt-attention mechanism. This mechanism dynamically generates robust and consistent representations for scenarios where particular modalities are absent, thereby effectively bridging the representational gap between complete and incomplete data. Extensive experiments conducted on benchmark datasets, along with comprehensive ablation studies, clearly demonstrate the superior performance of PROMISE compared to current state-of-the-art multimodal methods.

[31] EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation

Zongyang Qiu, Bingyuan Wang, Xingbei Chen, Yingqing He, Zeyu Wang

🧩 TL;DR

本文提出了EmoVid，首个为创意媒体设计的多模态情感标注视频数据集，并通过情感条件化视频生成技术显著提升了文本到视频和图像到视频任务的生成质量，为情感视频计算建立了新基准。

📘 Detailed Summary

Motivation: 现有视频生成系统主要关注低层次视觉指标而忽视情感维度，视频社区缺乏专门资源来桥接情感理解与生成任务，特别是在风格化和非现实语境中。

Method: 构建了包含卡通动画、电影片段和动画贴纸的多模态情感标注视频数据集EmoVid，每个视频标注了情感标签、视觉属性和文本描述；基于这些洞察开发了通过微调Wan2.1模型实现的情感条件化视频生成技术。

Result: 在文本到视频和图像到视频任务中，生成视频在定量指标和视觉质量方面均显示出显著提升，揭示了跨不同视频形式中视觉特征与情感感知的空间和时间模式关联。

Conclusion: EmoVid为情感视频计算建立了新基准，不仅为艺术风格视频中的视觉情感分析提供了宝贵见解，还为增强视频生成中的情感表达提供了实用方法，推动了创造性媒体中情感理解与生成任务的融合。

📄 Abstract

Emotion plays a pivotal role in video-based expression, but existing video generation systems predominantly focus on low-level visual metrics while neglecting affective dimensions. Although emotion analysis has made progress in the visual domain, the video community lacks dedicated resources to bridge emotion understanding with generative tasks, particularly for stylized and non-realistic contexts. To address this gap, we introduce EmoVid, the first multimodal, emotion-annotated video dataset specifically designed for creative media, which includes cartoon animations, movie clips, and animated stickers. Each video is annotated with emotion labels, visual attributes (brightness, colorfulness, hue), and text captions. Through systematic analysis, we uncover spatial and temporal patterns linking visual features to emotional perceptions across diverse video forms. Building on these insights, we develop an emotion-conditioned video generation technique by fine-tuning the Wan2.1 model. The results show a significant improvement in both quantitative metrics and the visual quality of generated videos for text-to-video and image-to-video tasks. EmoVid establishes a new benchmark for affective video computing. Our work not only offers valuable insights into visual emotion analysis in artistically styled videos, but also provides practical methods for enhancing emotional expression in video generation.

[32] Draft and Refine with Visual Experts

Sungheon Jeong, Ryozo Masukawa, Jihong Park, Sanggeon Yun, Wenjun Huang, Hanning Chen, Mahdi Imani, Mohsen Imani

🧩 TL;DR

本文提出Draft and Refine（DnR）智能体框架，通过量化视觉信息利用率来减少大视觉语言模型中的幻觉问题。该方法通过问题条件化的利用率度量引导视觉专家反馈，在不重新训练的情况下显著提升模型的视觉基础能力。

📘 Detailed Summary

Motivation: 现有的大视觉语言模型在推理过程中过度依赖语言先验而非视觉证据，导致产生不接地气的幻觉响应。这一问题的根源在于缺乏对这些模型在推理过程中实际使用视觉信息程度的定量衡量标准，限制了模型的可解释性和证据驱动能力。

Method: 提出Draft and Refine（DnR）智能体框架，核心是问题条件化的利用率度量方法。该方法首先构建查询条件化的相关性图来定位问题特定线索，然后通过相关性引导的概率掩码测量依赖程度。基于此度量，DnR智能体利用外部视觉专家的针对性反馈来优化初始草稿，将专家输出渲染为视觉线索后重新查询模型以选择利用率提升最大的响应。

Result: 在VQA和图像描述基准测试上的实验表明，该方法在多个数据集上实现了准确率的持续提升并显著减少了幻觉现象。具体而言，模型在保持原有架构不变的情况下，通过视觉利用率引导的优化过程获得了更好的视觉基础性能。

Conclusion: 研究表明量化视觉利用率为实现更可解释和证据驱动的多模态智能体系统提供了原则性路径。该方法证明了在不重新训练或修改架构的情况下，通过系统性地测量和优化视觉信息依赖度可以有效增强多模态模型的视觉基础能力，为未来多模态人工智能系统的可解释性研究提供了重要启示。

📄 Abstract

While recent Large Vision-Language Models (LVLMs) exhibit strong multimodal reasoning abilities, they often produce ungrounded or hallucinated responses because they rely too heavily on linguistic priors instead of visual evidence. This limitation highlights the absence of a quantitative measure of how much these models actually use visual information during reasoning. We propose Draft and Refine (DnR), an agent framework driven by a question-conditioned utilization metric. The metric quantifies the model's reliance on visual evidence by first constructing a query-conditioned relevance map to localize question-specific cues and then measuring dependence through relevance-guided probabilistic masking. Guided by this metric, the DnR agent refines its initial draft using targeted feedback from external visual experts. Each expert's output (such as boxes or masks) is rendered as visual cues on the image, and the model is re-queried to select the response that yields the largest improvement in utilization. This process strengthens visual grounding without retraining or architectural changes. Experiments across VQA and captioning benchmarks show consistent accuracy gains and reduced hallucination, demonstrating that measuring visual utilization provides a principled path toward more interpretable and evidence-driven multimodal agent systems.

[33] SP-Guard: Selective Prompt-adaptive Guidance for Safe Text-to-Image Generation

Sumin Yu, Taesup Moon

🧩 TL;DR

SP-Guard提出了一种自适应选择性的扩散模型安全引导方法，通过估计提示词危害性并应用选择性引导掩码，仅对图像中的不安全区域进行引导，在生成更安全图像的同时最小化对预期内容的改变。

📘 Detailed Summary

Motivation: 现有的扩散文本到图像模型虽然能生成高质量图像，但也容易产生有害内容，引发社会担忧。当前推理时引导方法缺乏自适应性和选择性，无法根据提示词调整引导强度，也无法仅针对图像中的不安全区域进行引导。

Method: SP-Guard方法通过估计提示词的危害性程度，并应用选择性引导掩码来仅对图像中的不安全区域进行引导。该方法结合了自适应引导强度调整和区域选择性引导两个关键技术。

Result: 实验表明，SP-Guard相比现有方法能生成更安全的图像，同时最大限度地减少对预期内容的意外改变。该方法在安全性和内容保持性之间取得了更好的平衡。

Conclusion: 除了提升图像生成安全性外，该研究强调了图像生成过程中透明度和可控性的重要性。研究结果为构建更安全、更可控的生成模型提供了重要启示。

📄 Abstract

While diffusion-based T2I models have achieved remarkable image generation quality, they also enable easy creation of harmful content, raising social concerns and highlighting the need for safer generation. Existing inference-time guiding methods lack both adaptivity--adjusting guidance strength based on the prompt--and selectivity--targeting only unsafe regions of the image. Our method, SP-Guard, addresses these limitations by estimating prompt harmfulness and applying a selective guidance mask to guide only unsafe areas. Experiments show that SP-Guard generates safer images than existing methods while minimizing unintended content alteration. Beyond improving safety, our findings highlight the importance of transparency and controllability in image generation.

[34] NP-LoRA: Null Space Projection Unifies Subject and Style in LoRA Fusion

Chuheng Chen, Xiaofei Zhou, Geyuan Zhang, Yong Huang

🧩 TL;DR

本文提出了NP-LoRA，一种基于零空间投影的LoRA融合框架，通过强制子空间分离来防止主方向间的结构干扰，显著提升了可控生成的融合质量。该方法通过奇异值分解提取主风格方向，并将主体LoRA投影到其正交零空间中，实现了主体保真度与风格一致性的平滑权衡控制。

📘 Detailed Summary

Motivation: 现有LoRA融合方法依赖基于权重的合并策略，导致一个LoRA往往主导另一个，产生干扰和保真度下降。这种干扰是结构性的：分别训练的LoRA占据低秩高维子空间，形成非正交且重叠的表示，阻碍了有效的组合生成。

Method: 提出NP-LoRA框架，首先通过奇异值分解提取主风格方向，然后将主体LoRA投影到其正交零空间中以防止结构干扰。引入软投影机制实现主体保真度与风格一致性之间的平滑权衡控制，无需重新训练即可应用于不同骨干网络和LoRA对。

Result: 实验表明NP-LoRA在融合质量上持续优于强基线方法，在DINO和CLIP等指标上表现优异，获得人类和LLM偏好评分的高度认可。该方法广泛适用于各种骨干网络和LoRA对组合，展现出良好的泛化能力。

Conclusion: 研究揭示了LoRA生成行为由低秩子空间中少数主方向主导的关键洞察，提出通过零空间投影实现子空间分离的有效策略。NP-LoRA框架为可控生成中的表示组合提供了新的技术路径，具有重要的理论价值和实际应用前景。

📄 Abstract

Low-Rank Adaptation (LoRA) fusion has emerged as a key technique for reusing and composing learned subject and style representations for controllable generation without costly retraining. However, existing methods rely on weight-based merging, where one LoRA often dominates the other, leading to interference and degraded fidelity. This interference is structural: separately trained LoRAs occupy low-rank high-dimensional subspaces, leading to non-orthogonal and overlapping representations. In this work, we analyze the internal structure of LoRAs and find their generative behavior is dominated by a few principal directions in the low-rank subspace, which should remain free from interference during fusion. To achieve this, we propose Null Space Projection LoRA (NP-LoRA), a projection-based framework for LoRA fusion that enforces subspace separation to prevent structural interference among principal directions. Specifically, we first extract principal style directions via singular value decomposition (SVD) and then project the subject LoRA into its orthogonal null space. Furthermore, we introduce a soft projection mechanism that enables smooth control over the trade-off between subject fidelity and style consistency. Experiments show NP-LoRA consistently improves fusion quality over strong baselines (e.g., DINO and CLIP-based metrics, with human and LLM preference scores), and applies broadly across backbones and LoRA pairs without retraining.

[35] Evaluating Latent Generative Paradigms for High-Fidelity 3D Shape Completion from a Single Depth Image

Matthias Humt, Ulrich Hillenbrand, Rudolph Triebel

🧩 TL;DR

本研究系统比较了去噪扩散概率模型和自回归因果变换器在3D形状生成和补全任务中的性能，发现扩散模型在连续潜在空间上表现最优，而自回归模型在相同离散潜在空间下也能达到或超越扩散模型性能。

📘 Detailed Summary

Motivation: 当前生成模型在3D数据应用中缺乏统一评估标准，特别是对于条件生成任务如基于部分3D数据的形状补全，尚未得到充分研究。本文旨在填补这一空白，系统比较不同生成模型在形状建模和补全任务中的表现。

Method: 研究采用去噪扩散概率模型和自回归因果变换器两种生成模型，针对生成形状建模和补全任务进行适应性改进，同时引入判别模型作为基线，并进行了广泛的消融实验以全面评估模型性能。

Result: 实验结果表明，基于连续潜在空间的扩散模型在从单张噪声深度图像进行多模态形状补全任务中表现最优，达到了最先进的性能水平；而在相同离散潜在空间条件下，自回归模型能够匹配甚至超越扩散模型的性能表现。

Conclusion: 研究表明不同生成模型在3D形状任务中各有优势，扩散模型在连续表示上表现卓越，而自回归模型在离散表示下同样具有竞争力，为3D生成模型的选择提供了重要指导依据。

📄 Abstract

While generative models have seen significant adoption across a wide range of data modalities, including 3D data, a consensus on which model is best suited for which task has yet to be reached. Further, conditional information such as text and images to steer the generation process are frequently employed, whereas others, like partial 3D data, have not been thoroughly evaluated. In this work, we compare two of the most promising generative models--Denoising Diffusion Probabilistic Models and Autoregressive Causal Transformers--which we adapt for the tasks of generative shape modeling and completion. We conduct a thorough quantitative evaluation and comparison of both tasks, including a baseline discriminative model and an extensive ablation study. Our results show that (1) the diffusion model with continuous latents outperforms both the discriminative model and the autoregressive approach and delivers state-of-the-art performance on multi-modal shape completion from a single, noisy depth image under realistic conditions and (2) when compared on the same discrete latent space, the autoregressive model can match or exceed diffusion performance on these tasks.

[36] Detection of Bark Beetle Attacks using Hyperspectral PRISMA Data and Few-Shot Learning

Mattia Ferrari, Giancarlo Papitto, Giorgio Deligios, Lorenzo Bruzzone

🧩 TL;DR

本文提出了一种基于对比学习的少样本学习方法，利用PRISMA高光谱卫星数据检测树皮甲虫侵染。该方法通过对比学习预训练一维CNN编码器提取鲁棒特征，结合支持向量回归估计器在少量标注样本上实现森林健康状态分类。

📘 Detailed Summary

Motivation: 树皮甲虫侵染对针叶林健康构成严重威胁，传统监测方法面临标注数据稀缺的挑战。本研究旨在解决高光谱数据中标注样本有限的问题，探索少样本学习在森林健康监测中的应用潜力。

Method: 采用对比学习框架预训练一维CNN编码器，从PRISMA高光谱数据中提取鲁棒特征表示。随后使用支持向量回归估计器，每个类别对应一个回归器，在少量标注样本上训练，估计每个像素中健康、受甲虫攻击和死亡树木的比例。

Result: 在意大利多洛米蒂山区的实验表明，该方法在性能上优于使用原始PRISMA光谱波段和Sentinel-2数据的方法。PRISMA高光谱数据与少样本学习的结合在森林健康监测中展现出显著优势。

Conclusion: 研究表明PRISMA高光谱数据与少样本学习的结合为森林健康监测提供了有效解决方案。该方法在标注数据有限的情况下仍能实现准确的树皮甲虫侵染检测，为大规模森林监测应用提供了可行途径。

📄 Abstract

Bark beetle infestations represent a serious challenge for maintaining the health of coniferous forests. This paper proposes a few-shot learning approach leveraging contrastive learning to detect bark beetle infestations using satellite PRISMA hyperspectral data. The methodology is based on a contrastive learning framework to pre-train a one-dimensional CNN encoder, enabling the extraction of robust feature representations from hyperspectral data. These extracted features are subsequently utilized as input to support vector regression estimators, one for each class, trained on few labeled samples to estimate the proportions of healthy, attacked by bark beetle, and dead trees for each pixel. Experiments on the area of study in the Dolomites show that our method outperforms the use of original PRISMA spectral bands and of Sentinel-2 data. The results indicate that PRISMA hyperspectral data combined with few-shot learning offers significant advantages for forest health monitoring.

[37] Toward Generalized Detection of Synthetic Media: Limitations, Challenges, and the Path to Multimodal Solutions

Redwan Hussain, Mizanur Rahman, Prithwiraj Bhattacharjee

🧩 TL;DR

本文系统综述了24项AI生成媒体检测研究，识别出现有方法的共同局限性和关键挑战，并提出以多模态深度学习模型为核心的研究方向，旨在构建更鲁棒和通用的合成媒体防御机制。

📘 Detailed Summary

Motivation: 随着GAN和扩散模型等生成技术的快速发展，AI生成媒体质量显著提升，导致真实与合成内容难以区分，深度伪造等恶意应用引发虚假信息传播、隐私侵犯和欺诈等严重问题，而现有检测方法在泛化性、跨模型适应性和多模态数据处理方面存在明显不足。

Method: 研究系统分析了24项最新AI生成媒体检测工作，重点考察了基于卷积神经网络和视觉变换器的深度学习方法，这些方法主要通过检测视觉、空间或时序异常来实现识别，但面临泛化能力有限的问题。

Result: 分析发现现有检测方法在未见数据上泛化性能较差，难以适应不同生成模型产生的内容，对多模态数据和高度修改内容的处理效果不佳，揭示了当前技术路线在鲁棒性和通用性方面的根本局限。

Conclusion: 基于系统性分析提出了以多模态深度学习模型为核心的研究方向，这类模型有望提供更鲁棒和通用的检测能力，为未来研究者构建更强防御机制提供了清晰的起点，强调了应对有害合成媒体的技术发展路径。

📄 Abstract

Artificial intelligence (AI) in media has advanced rapidly over the last decade. The introduction of Generative Adversarial Networks (GANs) improved the quality of photorealistic image generation. Diffusion models later brought a new era of generative media. These advances made it difficult to separate real and synthetic content. The rise of deepfakes demonstrated how these tools could be misused to spread misinformation, political conspiracies, privacy violations, and fraud. For this reason, many detection models have been developed. They often use deep learning methods such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). These models search for visual, spatial, or temporal anomalies. However, such approaches often fail to generalize across unseen data and struggle with content from different models. In addition, existing approaches are ineffective in multimodal data and highly modified content. This study reviews twenty-four recent works on AI-generated media detection. Each study was examined individually to identify its contributions and weaknesses, respectively. The review then summarizes the common limitations and key challenges faced by current approaches. Based on this analysis, a research direction is suggested with a focus on multimodal deep learning models. Such models have the potential to provide more robust and generalized detection. It offers future researchers a clear starting point for building stronger defenses against harmful synthetic media.

[38] Hindsight Distillation Reasoning with Knowledge Encouragement Preference for Knowledge-based Visual Question Answering

Yu Zhao, Ying Zhang, Xuhui Sui, Baohang Zhou, Li Shen, Dacheng Tao

🧩 TL;DR

本文提出了Hindsight Distilled Reasoning (HinD)框架，通过Knowledge Encouragement Preference Optimization (KEPO)激发多模态大语言模型的内在知识推理能力，在无需商业模型API或外部知识的情况下，在KBVQA任务上实现了优越性能。

📘 Detailed Summary

Motivation: 现有基于知识的视觉问答方法虽然通过上下文学习或检索增强生成利用知识，但推理过程仍显隐式，缺乏多模态大语言模型明确的多步推理轨迹，这限制了模型的可解释性和推理能力。

Method: 提出HinD框架，首先通过提示冻结的7B规模多模态大语言模型在问题和真实答案间完成推理过程，构建Hindsight-Zero训练数据，然后自蒸馏为思维链生成器和知识生成器；其次设计KEPO优化知识生成器，偏好低置信度但有用的知识而非高置信度但无用的知识。

Result: 在OK-VQA和A-OKVQA数据集上的实验验证了HinD的有效性，仅使用7B规模多模态大语言模型激发的推理能力就实现了优越性能，无需依赖商业模型API或外部知识源。

Conclusion: 该研究表明多模态大语言模型具备内在的知识推理能力，通过适当的训练策略可以激发这种能力，为构建更高效、可解释的基于知识的视觉问答系统提供了新思路，同时避免了对外部知识源的依赖。

📄 Abstract

Knowledge-based Visual Question Answering (KBVQA) necessitates external knowledge incorporation beyond cross-modal understanding. Existing KBVQA methods either utilize implicit knowledge in multimodal large language models (MLLMs) via in-context learning or explicit knowledge via retrieval augmented generation. However, their reasoning processes remain implicit, without explicit multi-step trajectories from MLLMs. To address this gap, we provide a Hindsight Distilled Reasoning (HinD) framework with Knowledge Encouragement Preference Optimization (KEPO), designed to elicit and harness internal knowledge reasoning ability in MLLMs. First, to tackle the reasoning supervision problem, we propose to emphasize the hindsight wisdom of MLLM by prompting a frozen 7B-size MLLM to complete the reasoning process between the question and its ground truth answer, constructing Hindsight-Zero training data. Then we self-distill Hindsight-Zero into Chain-of-Thought (CoT) Generator and Knowledge Generator, enabling the generation of sequential steps and discrete facts. Secondly, to tackle the misalignment between knowledge correctness and confidence, we optimize the Knowledge Generator with KEPO, preferring under-confident but helpful knowledge over the over-confident but unhelpful one. The generated CoT and sampled knowledge are then exploited for answer prediction. Experiments on OK-VQA and A-OKVQA validate the effectiveness of HinD, showing that HinD with elicited reasoning from 7B-size MLLM achieves superior performance without commercial model APIs or outside knowledge.

[39] Explainable Deep Convolutional Multi-Type Anomaly Detection

Alex George, Lyudmila Mihaylova, Sean Anderson

🧩 TL;DR

本文提出MultiTypeFCDD，一种轻量级卷积框架，用于可解释的多类型异常检测，能够在单一统一模型中区分不同异常类型，显著降低计算开销和推理时间。

📘 Detailed Summary

Motivation: 现有可解释异常检测方法通常只能识别异常存在而无法区分异常类型，且需要为每个对象类别训练和维护独立模型，缺乏特异性。同时，现有大规模视觉语言模型虽然开始解决此问题，但计算密集且内存占用大，限制了在实时或嵌入式系统中的使用。

Method: MultiTypeFCDD采用简单的轻量级卷积框架，仅使用图像级标签学习并生成多通道热力图，每个通道对应特定异常类型。该模型作为单一统一框架，能够跨多个对象类别区分异常类型，无需为每个类别训练单独模型。

Result: 在Real-IAD数据集上的评估表明，该方法在显著降低参数负载和推理时间的同时，实现了与最先进复杂模型竞争的结果，证明了其在实际应用中的有效性。

Conclusion: MultiTypeFCDD为计算资源受限的实际应用场景提供了高度实用且可行的解决方案，通过统一的轻量级框架解决了多类型异常检测的挑战，消除了维护多个模型的需求。

📄 Abstract

Most explainable anomaly detection methods often identify anomalies but lack the capability to differentiate the type of anomaly. Furthermore, they often require the costly training and maintenance of separate models for each object category. The lack of specificity is a significant research gap, as identifying the type of anomaly (e.g., "Crack" vs. "Scratch") is crucial for accurate diagnosis that facilitates cost-saving operational decisions across diverse application domains. While some recent large-scale Vision-Language Models (VLMs) have begun to address this, they are computationally intensive and memory-heavy, restricting their use in real-time or embedded systems. We propose MultiTypeFCDD, a simple and lightweight convolutional framework designed as a practical alternative for explainable multi-type anomaly detection. MultiTypeFCDD uses only image-level labels to learn and produce multi-channel heatmaps, where each channel is trained to correspond to a specific anomaly type. The model functions as a single, unified framework capable of differentiating anomaly types across multiple object categories, eliminating the need to train and manage separate models for each object category. We evaluated our proposed method on the Real-IAD dataset and it delivers results competitive with state-of-the-art complex models at significantly reduced parametric load and inference times. This makes it a highly practical and viable solution for real-world applications where computational resources are tightly constrained.

[40] CATS-V2V: A Real-World Vehicle-to-Vehicle Cooperative Perception Dataset with Complex Adverse Traffic Scenarios

Hangyu Li, Bofeng Cao, Zhaohui Liang, Wuzhen Li, Juyoung Oh, Yuxuan Chen, Shixiao Liang, Hang Zhou, Chengyuan Ma, Jiaxi Liu, Zheng Li, Peng Zhang, KeKe Long, Maolin Liu, Jackson Jiang, Chunlei Yu, Shengxiang Liu, Hongkai Yu, Xiaopeng Li

🧩 TL;DR

本文提出了CATS-V2V，这是首个面向复杂不利交通场景的V2V协同感知真实世界数据集，包含100个片段、60K帧LiDAR点云和1.26M多视角图像，为自动驾驶社区提供了高质量的数据基础设施。

📘 Detailed Summary

Motivation: 现有自动驾驶数据集主要关注普通交通场景，缺乏针对复杂不利交通场景的V2V协同感知数据，限制了协同感知在恶劣条件下的性能提升潜力。

Method: 采用两辆硬件时间同步的车辆进行数据采集，覆盖10种天气和光照条件下的10个不同地点，并提出基于目标的时序对齐方法确保所有传感器模态中物体的精确对齐。

Result: 构建了包含60K帧10Hz LiDAR点云、1.26M多视角30Hz相机图像以及750K高精度RTK固定GNSS和IMU记录的数据集，提供了时间一致的3D边界框标注和静态场景的4D BEV表示。

Conclusion: CATS-V2V作为目前同类数据集中规模最大、支持最全面、质量最高的数据集，将为自动驾驶相关任务提供重要支持，推动复杂不利场景下的协同感知研究发展。

📄 Abstract

Vehicle-to-Vehicle (V2V) cooperative perception has great potential to enhance autonomous driving performance by overcoming perception limitations in complex adverse traffic scenarios (CATS). Meanwhile, data serves as the fundamental infrastructure for modern autonomous driving AI. However, due to stringent data collection requirements, existing datasets focus primarily on ordinary traffic scenarios, constraining the benefits of cooperative perception. To address this challenge, we introduce CATS-V2V, the first-of-its-kind real-world dataset for V2V cooperative perception under complex adverse traffic scenarios. The dataset was collected by two hardware time-synchronized vehicles, covering 10 weather and lighting conditions across 10 diverse locations. The 100-clip dataset includes 60K frames of 10 Hz LiDAR point clouds and 1.26M multi-view 30 Hz camera images, along with 750K anonymized yet high-precision RTK-fixed GNSS and IMU records. Correspondingly, we provide time-consistent 3D bounding box annotations for objects, as well as static scenes to construct a 4D BEV representation. On this basis, we propose a target-based temporal alignment method, ensuring that all objects are precisely aligned across all sensor modalities. We hope that CATS-V2V, the largest-scale, most supportive, and highest-quality dataset of its kind to date, will benefit the autonomous driving community in related tasks.

Quoc-Huy Trinh, Mustapha Abdullahi, Do Duy Hung Trinh, Bo Zhao, Debesh Jha

🧩 TL;DR

Viper-F1是一种混合状态空间视觉语言模型，通过用液态状态空间动力学替代注意力机制，并引入令牌-网格相关模块实现细粒度视觉定位，在保持线性时间推理的同时显著提升了多模态理解效率。

📘 Detailed Summary

Motivation: 当前多模态大语言模型在视觉语言理解方面取得显著进展，但其高计算成本限制了在资源受限场景中的部署，且现有基于Transformer交叉注意力的方法存在二次复杂度问题，同时小型视觉语言模型难以精确捕捉任务相关的细粒度视觉区域，导致在现实世界细粒度推理任务中性能下降。

Method: 提出Viper-F1混合状态空间视觉语言模型，用高效的液态状态空间动力学替代注意力机制，并引入令牌-网格相关模块计算文本令牌与图像块之间的轻量级相关性，通过FiLM条件调节状态空间动力学，使模型能够选择性强调与文本提示相关的视觉区域。

Result: 在多个基准测试上的实验结果表明，Viper-F1实现了准确、细粒度的理解能力，同时显著提升了推理效率，在保持线性时间复杂度的前提下超越了现有方法的性能表现。

Conclusion: 该研究证明了状态空间模型在多模态任务中的有效性，为资源受限环境下的高效视觉语言理解提供了可行解决方案，同时通过细粒度视觉定位机制提升了模型在现实世界应用中的实用性。

📄 Abstract

Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as robotic manipulation, personal assistants, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Viper-F1, a Hybrid State-Space Vision-Language Model that replaces attention with efficient Liquid State-Space Dynamics. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates the state-space dynamics via FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Viper-F1 achieves accurate, fine-grained understanding with significantly improved efficiency.

[42] Geospatial Chain of Thought Reasoning for Enhanced Visual Question Answering on Satellite Imagery

Shambhavi Shanker, Manikandan Padmanaban, Jagabondhu Hazra

🧩 TL;DR

本研究提出了一种结合思维链推理与直接偏好优化的视觉问答框架，显著提升了卫星图像地理空间推理能力，在气候相关应用中实现了34.9%的准确率提升。

📘 Detailed Summary

Motivation: 现有的视觉问答模型虽然能够实现遥感数据的规模化解读，但缺乏处理复杂地理空间查询所需的结构化推理能力，这在灾害监测、基础设施风险评估、城市韧性规划等高风险气候应用中尤为关键。

Method: 该框架将思维链推理与直接偏好优化相结合，通过生成中间推理依据来增强模型在检测、分类、空间关系和比较分析等关键任务上的处理能力。

Result: 实验表明，思维链监督相比直接基线方法将准确率提升了34.9%，而直接偏好优化进一步提高了准确率和推理质量，显著提升了系统的可靠性和鲁棒性。

Conclusion: 该研究推动了多光谱地球观测的视觉问答技术发展，通过实现更丰富的地理空间推理能力，为气候相关应用提供了更有效的决策支持工具，在高风险领域具有重要应用价值。

📄 Abstract

Geospatial chain of thought (CoT) reasoning is essential for advancing Visual Question Answering (VQA) on satellite imagery, particularly in climate related applications such as disaster monitoring, infrastructure risk assessment, urban resilience planning, and policy support. Existing VQA models enable scalable interpretation of remote sensing data but often lack the structured reasoning required for complex geospatial queries. We propose a VQA framework that integrates CoT reasoning with Direct Preference Optimization (DPO) to improve interpretability, robustness, and accuracy. By generating intermediate rationales, the model better handles tasks involving detection, classification, spatial relations, and comparative analysis, which are critical for reliable decision support in high stakes climate domains. Experiments show that CoT supervision improves accuracy by 34.9\% over direct baselines, while DPO yields additional gains in accuracy and reasoning quality. The resulting system advances VQA for multispectral Earth observation by enabling richer geospatial reasoning and more effective climate use cases.

[43] Questioning the Stability of Visual Question Answering

Amir Rosenfeld, Neta Glazer, Ethan Fetaya

🧩 TL;DR

本文首次对视觉语言模型在语义保持的微小视觉和文本扰动下的鲁棒性进行了大规模系统研究，发现现代VLM对像素级偏移、几何变换、重述等无害修改高度敏感，且样本稳定性可作为模型正确性的强有力指标。

📘 Detailed Summary

Motivation: 当前视觉语言模型虽取得显著进展，但其在语义保持的微小输入变化下的可靠性尚未得到充分理解，研究旨在填补这一空白，探索模型对无害视觉和文本扰动的鲁棒性。

Method: 研究采用系统性评估框架，涵盖像素级偏移、轻微几何变换、填充缩放、文本重述和多语言改写等多种语义保持扰动类型，在广泛模型和数据集上进行大规模实验分析。

Result: 实验表明现代VLM对微小扰动高度敏感，大量样本在至少一种视觉或文本修改下改变预测答案，即使最先进系统如GPT-4o和Gemini 2.0 Flash也常在像素级偏移或无害重述下失效，且样本稳定性与正确性高度相关。

Conclusion: 研究揭示了当前VLM存在根本性脆弱性，强调了超越对抗性扰动的鲁棒性评估需求，并证明小规模开源模型的稳定性模式可高精度预测大规模闭源模型的正确性，为模型可靠性评估提供了新视角。

📄 Abstract

Visual Language Models (VLMs) have achieved remarkable progress, yet their reliability under small, meaning-preserving input changes remains poorly understood. We present the first large-scale, systematic study of VLM robustness to benign visual and textual perturbations: pixel-level shifts, light geometric transformations, padded rescaling, paraphrasing, and multilingual rewrites that do not alter the underlying semantics of an image-question pair. Across a broad set of models and datasets, we find that modern VLMs are highly sensitive to such minor perturbations: a substantial fraction of samples change their predicted answer under at least one visual or textual modification. We characterize how this instability varies across perturbation types, question categories, and models, revealing that even state-of-the-art systems (e.g., GPT-4o, Gemini 2.0 Flash) frequently fail under shifts as small as a few pixels or harmless rephrasings. We further show that sample-level stability serves as a strong indicator of correctness: stable samples are consistently far more likely to be answered correctly. Leveraging this, we demonstrate that the stability patterns of small, accessible open-source models can be used to predict the correctness of much larger closed-source models with high precision. Our findings expose a fundamental fragility in current VLMs and highlight the need for robustness evaluations that go beyond adversarial perturbations, focusing instead on invariances that models should reliably uphold.

Mohammad Areeb Qazi, Munachiso S Nwadike, Ibrahim Almakky, Mohammad Yaqub, Numan Saeed

🧩 TL;DR

本文提出MAFM^3框架，通过轻量级模块化组件使单一基础模型能够扩展到医学影像的多个领域、任务和模态，实现高效的多任务多模态适应，在CT和PET扫描任务上均取得性能提升。

📘 Detailed Summary

Motivation: 医学影像领域面临数据稀缺的挑战，为每个领域、模态或任务单独预训练基础模型不切实际，现有适应方法通常孤立处理新任务或模态，缺乏统一可扩展的框架来支持多任务多模态适应。

Method: 提出MAFM^3框架，通过轻量级模块化组件作为专门技能集，使单一基础模型能够根据输入类型或临床目标灵活激活相应能力，实现统一可扩展的多任务多模态适应。

Result: 实验验证中，将胸部CT基础模型适应到预后和分割模块，两项任务均表现提升；整合PET扫描后，MAFM^3在Dice得分上相比基线提升5%。

Conclusion: 研究表明，配备模块化组件的基础模型不受初始训练范围限制，能够演变为医学影像的多任务多模态系统，为医学AI提供高效可扩展的解决方案。

📄 Abstract

Foundational models are trained on extensive datasets to capture the general trends of a domain. However, in medical imaging, the scarcity of data makes pre-training for every domain, modality, or task challenging. Instead of building separate models, we propose MAFM^3 (Modular Adaptation of Foundation Models for Multi-Modal Medical AI), a framework that enables a single foundation model to expand into diverse domains, tasks, and modalities through lightweight modular components. These components serve as specialized skill sets that allow the system to flexibly activate the appropriate capability at the inference time, depending on the input type or clinical objective. Unlike conventional adaptation methods that treat each new task or modality in isolation, MAFM^3 provides a unified and expandable framework for efficient multitask and multimodality adaptation. Empirically, we validate our approach by adapting a chest CT foundation model initially trained for classification into prognosis and segmentation modules. Our results show improved performance on both tasks. Furthermore, by incorporating PET scans, MAFM^3 achieved an improvement in the Dice score 5% compared to the respective baselines. These findings establish that foundation models, when equipped with modular components, are not inherently constrained to their initial training scope but can evolve into multitask, multimodality systems for medical imaging. The code implementation of this work can be found at https://github.com/Areeb2735/CTscan_prognosis_VLM

[45] Positional Bias in Multimodal Embedding Models: Do They Favor the Beginning, the Middle, or the End?

Kebin Wu, Fatima Albreiki

🧩 TL;DR

本研究揭示了多模态表示模型中普遍存在的位置偏差现象，发现在图像-文本检索任务中，文本编码器倾向于关注输入开始位置，而图像编码器则在开始和结束位置均表现出偏差。

📘 Detailed Summary

Motivation: 尽管位置偏差在文本生成模型中已被广泛研究，但在表示模型特别是多模态模型中的存在和影响仍未得到充分探索，本研究旨在填补这一研究空白，重点关注图像-文本检索任务中的位置偏差问题。

Method: 研究首先区分了上下文重要性和位置偏差的概念，然后通过系统实验评估了不同模型和数据集中的位置偏差程度，分析了位置编码方案、训练损失、上下文重要性以及图像-文本对使用方式等多重因素对偏差的影响。

Result: 实验结果表明位置偏差在多模态模型中普遍存在，但不同模态表现出不同的偏差模式：文本编码器偏向输入开始位置，而图像编码器在开始和结束位置均显示偏差，这种偏差由位置编码方案、训练损失、上下文重要性和多模态训练特性共同导致或放大。

Conclusion: 该研究揭示了多模态表示模型中位置偏差的复杂性和多因素成因，强调了在模型设计和评估中考虑位置偏差的重要性，为开发更公平、鲁棒的多模态系统提供了重要见解和理论基础。

📄 Abstract

Positional bias - where models overemphasize certain positions regardless of content - has been shown to negatively impact model performance across various tasks. While recent research has extensively examined positional bias in text generation models, its presence and effects in representation models remain underexplored. Even less is known about such biases in multimodal models. In this work, we investigate positional bias in multimodal representation models, specifically in the context of image-text retrieval. We begin by distinguishing between context importance and positional bias, and then assess the presence and extent of positional bias across different models and datasets. Our experiments demonstrate that positional bias is prevalent in multimodal models, but manifests differently across modalities: text encoders tend to exhibit bias toward the beginning of the input, whereas image encoders show bias at both the beginning and end. Furthermore, we find that this bias arises from, or is amplified by, a combination of factors, including the positional encoding scheme, training loss, context importance, and the nature of using image-text pairs in multimodal training.

[46] CountSteer: Steering Attention for Object Counting in Diffusion Models

Hyemin Boo, Hyoryung Kim, Myungjin Lee, Seunghyeon Lee, Jiyoung Lee, Jang-Hwan Choi, Hyunsoo Cho

🧩 TL;DR

本文提出CountSteer，一种无需训练的方法，通过引导扩散模型在推理过程中的交叉注意力隐藏状态来改善文本到图像生成中对指定物体数量的遵循能力。该方法利用模型内部对数值正确性的潜在感知，在不损害视觉质量的前提下将物体计数准确率提高了约4%。

📘 Detailed Summary

Motivation: 文本到图像扩散模型虽然能生成逼真连贯的图像，但往往无法准确遵循文本中的数值指令，这揭示了语言与视觉表示之间的差距。研究发现这些模型并非完全对数字无感知，它们内部信号会根据输出是否满足指定计数而呈现一致性变化，表明模型已编码了对数值正确性的潜在认知。

Method: 基于模型内部对数值正确性的潜在感知，提出了CountSteer方法，这是一种无需训练的技术，通过在推理过程中引导模型的交叉注意力隐藏状态来改善对指定物体数量的生成。该方法利用模型内部信号的变化模式，在不修改模型参数的情况下实现更精确的计数控制。

Result: 实验结果表明，CountSteer方法将物体计数准确率提高了约4%，同时保持了原有的视觉质量。这一改进证明了通过利用模型内部潜在信号可以实现更可控的文本到图像生成，而无需额外的训练成本。

Conclusion: 该研究表明文本到图像扩散模型内部已编码了对数值正确性的潜在认知，这为改善生成控制提供了新的可能性。CountSteer方法展示了如何利用这些内部信号实现更精确的语义控制，为构建更可靠和可控的文本到图像生成系统迈出了简单而有效的一步。

📄 Abstract

Text-to-image diffusion models generate realistic and coherent images but often fail to follow numerical instructions in text, revealing a gap between language and visual representation. Interestingly, we found that these models are not entirely blind to numbers-they are implicitly aware of their own counting accuracy, as their internal signals shift in consistent ways depending on whether the output meets the specified count. This observation suggests that the model already encodes a latent notion of numerical correctness, which can be harnessed to guide generation more precisely. Building on this intuition, we introduce CountSteer, a training-free method that improves generation of specified object counts by steering the model's cross-attention hidden states during inference. In our experiments, CountSteer improved object-count accuracy by about 4% without compromising visual quality, demonstrating a simple yet effective step toward more controllable and semantically reliable text-to-image generation.

[47] GraphPilot: Grounded Scene Graph Conditioning for Language-Based Autonomous Driving

Fabian Schmidt, Markus Enzweiler, Abhinav Valada

🧩 TL;DR

本研究提出了一种新颖的模型无关方法，通过交通场景图条件化语言驱动模型来增强自动驾驶规划能力。该方法在LangAuto基准测试中显著提升了驾驶性能，LMDrive和BEVDriver分别实现了15.6%和17.5%的驾驶分数提升。

📘 Detailed Summary

Motivation: 现有视觉语言模型在自动驾驶规划中缺乏对空间结构和动态交互关系的显式监督，限制了它们从原始传感器数据中推断交通实体间相互影响的能力。当前模型训练过程中未能明确编码这些关系依赖性，导致拓扑感知推理能力不足。

Method: 提出了一种模型无关的方法，通过将交通场景图以不同抽象层次和格式序列化，并利用结构化提示模板将其整合到语言驱动模型中。该方法支持系统分析关系监督在何时以及如何发挥最大效益，实现了场景图条件化的训练策略。

Result: 在公开LangAuto基准测试上的广泛评估表明，场景图条件化显著提升了最先进方法的驾驶性能。LMDrive的驾驶分数提升了15.6%，BEVDriver提升了17.5%，即使测试时不需要场景图输入，模型也能更好地内化和基于关系先验知识。

Conclusion: 研究表明场景图条件化训练使模型能够更好地内化关系先验知识，即使测试时无需场景图输入也能保持性能提升。这项工作为自动驾驶规划中的关系推理提供了有效解决方案，并公开了代码、微调模型和场景图数据集以促进进一步研究。

📄 Abstract

Vision-language models have recently emerged as promising planners for autonomous driving, where success hinges on topology-aware reasoning over spatial structure and dynamic interactions from multimodal input. However, existing models are typically trained without supervision that explicitly encodes these relational dependencies, limiting their ability to infer how agents and other traffic entities influence one another from raw sensor data. In this work, we bridge this gap with a novel model-agnostic method that conditions language-based driving models on structured relational context in the form of traffic scene graphs. We serialize scene graphs at various abstraction levels and formats, and incorporate them into the models via structured prompt templates, enabling a systematic analysis of when and how relational supervision is most beneficial. Extensive evaluations on the public LangAuto benchmark show that scene graph conditioning of state-of-the-art approaches yields large and persistent improvement in driving performance. Notably, we observe up to a 15.6\% increase in driving score for LMDrive and 17.5\% for BEVDriver, indicating that models can better internalize and ground relational priors through scene graph-conditioned training, even without requiring scene graph input at test-time. Code, fine-tuned models, and our scene graph dataset are publicly available at https://github.com/iis-esslingen/GraphPilot.

[48] DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding

Tanveer Hannan, Dimitrios Mallios, Parth Pathak, Faegheh Sardari, Thomas Seidl, Gedas Bertasius, Mohsen Fayyaz, Sunando Sengupta

🧩 TL;DR

DocSLM是一种高效的小型视觉语言模型，专为资源受限的边缘设备设计，用于长文档理解。该模型通过分层多模态压缩器和流式弃权机制，在显著减少内存消耗的同时保持或超越最先进方法的性能。

📘 Detailed Summary

Motivation: 大型视觉语言模型在长文档理解方面表现出色，但其高内存占用使得在资源受限的边缘设备上部署不切实际。本研究旨在解决这一部署瓶颈，开发能够在有限内存资源下处理长多模态文档的高效模型。

Method: DocSLM采用分层多模态压缩器，将每页的视觉、文本和布局信息联合编码为固定长度序列，大幅减少内存消耗。同时引入流式弃权机制，通过基于熵的不确定性校准器对文档片段进行顺序处理并过滤低置信度响应。

Result: 在多个长多模态文档基准测试中，DocSLM匹配或超越了最先进方法，同时减少了82%的视觉标记、75%的参数和71%的延迟。该模型在轻量级边缘设备上实现了可靠的多模态文档理解能力。

Conclusion: DocSLM证明了通过创新的压缩和流式处理技术，可以在保持高性能的同时显著降低模型复杂度。这项研究为在资源受限环境中部署高效的多模态文档理解系统提供了可行方案，推动了边缘AI的发展。

📄 Abstract

Large Vision-Language Models (LVLMs) have demonstrated strong multimodal reasoning capabilities on long and complex documents. However, their high memory footprint makes them impractical for deployment on resource-constrained edge devices. We present DocSLM, an efficient Small Vision-Language Model designed for long-document understanding under constrained memory resources. DocSLM incorporates a Hierarchical Multimodal Compressor that jointly encodes visual, textual, and layout information from each page into a fixed-length sequence, greatly reducing memory consumption while preserving both local and global semantics. To enable scalable processing over arbitrarily long inputs, we introduce a Streaming Abstention mechanism that operates on document segments sequentially and filters low-confidence responses using an entropy-based uncertainty calibrator. Across multiple long multimodal document benchmarks, DocSLM matches or surpasses state-of-the-art methods while using 82\% fewer visual tokens, 75\% fewer parameters, and 71\% lower latency, delivering reliable multimodal document understanding on lightweight edge devices. Code is available in the supplementary material.

[49] MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model

Manyu Li, Ruian He, Chenxi Ma, Weimin Tan, Bo Yan

🧩 TL;DR

本文提出了MicroVQA++，一个三阶段构建的大规模高质量显微镜视觉问答语料库，通过异构图过滤和人类筛选确保数据质量，使4B规模的多模态大语言模型在显微镜推理任务上达到与GPT-5竞争的性能。

📘 Detailed Summary

Motivation: 当前多模态大语言模型在生物医学成像领域的应用受到大规模高质量训练数据稀缺的限制，特别是在显微镜科学推理方面，现有数据集的质量和规模不足以支持有效的模型训练。

Method: 采用三阶段构建方法：第一阶段从同行评审文章中获取专家验证的图-标题对作为监督信号；第二阶段引入HiCQA-Graph异构图，融合NLI文本蕴含、CLIP视觉语言对齐和智能体信号来识别过滤不一致样本；第三阶段使用多模态大语言模型生成多选题并通过人工筛选。

Result: 构建的数据集包含大规模训练集和人工检查测试集，其Bloom认知层级硬样本分布超过MicroVQA基准，实验表明精心构建的数据使4B规模MLLM在显微镜推理任务上达到与GPT-5竞争的性能，并在开源MLLM中实现最先进性能。

Conclusion: 研究证明通过专家文献耦合、基于图的过滤和人工精炼的质量控制方法能够显著提升数据集质量，HiCQA-Graph首次联合建模图像、标题和问答的跨模态一致性过滤，为科学多模态推理提供了有效的数据构建框架。

📄 Abstract

Multimodal Large Language Models are increasingly applied to biomedical imaging, yet scientific reasoning for microscopy remains limited by the scarcity of large-scale, high-quality training data. We introduce MicroVQA++, a three-stage, large-scale and high-quality microscopy VQA corpus derived from the BIOMEDICA archive. Stage one bootstraps supervision from expert-validated figure-caption pairs sourced from peer-reviewed articles. Stage two applies HiCQA-Graph, a novel heterogeneous graph over images, captions, and QAs that fuses NLI-based textual entailment, CLIP-based vision-language alignment, and agent signals to identify and filter inconsistent samples. Stage three uses a MultiModal Large Language Model (MLLM) agent to generate multiple-choice questions (MCQ) followed by human screening. The resulting release comprises a large training split and a human-checked test split whose Bloom's level hard-sample distribution exceeds the MicroVQA benchmark. Our work delivers (i) a quality-controlled dataset that couples expert literature with graph-based filtering and human refinement; (ii) HiCQA-Graph, the first graph that jointly models (image, caption, QA) for cross-modal consistency filtering; (iii) evidence that careful data construction enables 4B-scale MLLMs to reach competitive microscopy reasoning performance (e.g., GPT-5) and achieve state-of-the-art performance among open-source MLLMs. Code and dataset will be released after the review process concludes.

Jiaxi Huang, Dongxu Wu, Hanwei Zhu, Lingyu Zhu, Jun Xing, Xu Wang, Baoliang Chen

🧩 TL;DR

本文提出了Q-Doc框架，系统评估多模态大语言模型在文档图像质量评估中的能力，发现MLLMs具备初步DIQA能力但存在显著局限性，并证明思维链提示能有效提升性能。

📘 Detailed Summary

Motivation: 尽管多模态大语言模型在高级视觉任务中取得显著进展，但其在文档图像质量评估领域的潜力尚未得到充分探索，现有研究缺乏系统性的评估框架来全面分析MLLMs在DIQA任务中的能力与局限。

Method: 提出Q-Doc三层评估框架，包括粗粒度级别的质量评分任务、中粒度级别的失真类型识别任务（单选择和多选择测试），以及细粒度级别的失真严重程度分类任务，同时采用思维链提示策略来增强模型性能。

Result: 评估结果表明MLLMs具备初步的DIQA能力，但存在评分不一致、失真类型误识别和严重程度误判等关键局限，思维链提示在所有层级上均能显著提升模型性能。

Conclusion: 本研究为MLLMs的文档图像质量评估能力建立了基准，揭示了其在质量感知方面的明显缺陷，同时指出了通过提示工程改进性能的可行路径，为未来MLLMs在文档分析领域的应用提供了重要参考。

📄 Abstract

The rapid advancement of Multi-modal Large Language Models (MLLMs) has expanded their capabilities beyond high-level vision tasks. Nevertheless, their potential for Document Image Quality Assessment (DIQA) remains underexplored. To bridge this gap, we propose Q-Doc, a three-tiered evaluation framework for systematically probing DIQA capabilities of MLLMs at coarse, middle, and fine granularity levels. a) At the coarse level, we instruct MLLMs to assign quality scores to document images and analyze their correlation with Quality Annotations. b) At the middle level, we design distortion-type identification tasks, including single-choice and multi-choice tests for multi-distortion scenarios. c) At the fine level, we introduce distortion-severity assessment where MLLMs classify distortion intensity against human-annotated references. Our evaluation demonstrates that while MLLMs possess nascent DIQA abilities, they exhibit critical limitations: inconsistent scoring, distortion misidentification, and severity misjudgment. Significantly, we show that Chain-of-Thought (CoT) prompting substantially enhances performance across all levels. Our work provides a benchmark for DIQA capabilities in MLLMs, revealing pronounced deficiencies in their quality perception and promising pathways for enhancement. The benchmark and code are publicly available at: https://github.com/cydxf/Q-Doc.

[51] BOFA: Bridge-Layer Orthogonal Low-Rank Fusion for CLIP-Based Class-Incremental Learning

Lan Li, Tao Hu, Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan

🧩 TL;DR

本文提出BOFA框架，通过将模型适配限制在CLIP的跨模态桥接层并采用正交低秩融合机制，实现了无需数据回放的类增量学习，在保持高效性的同时显著提升了分类性能。

📘 Detailed Summary

Motivation: 类增量学习中应用CLIP模型面临两个主要挑战：下游任务适配通常需要额外可学习模块，这会增加模型复杂度并加剧遗忘问题；现有方法尚未充分利用多模态表示中视觉和文本模态的互补优势进行有效整合。

Method: BOFA框架将所有模型适配限制在CLIP现有的跨模态桥接层，不添加额外参数或推理成本；采用正交低秩融合机制，将参数更新约束在数学构造的与过去任务特征正交的低秩安全子空间中；使用跨模态混合原型，将稳定文本原型与桥接层稳定适配得到的视觉原型相结合。

Result: 在标准基准测试上的广泛实验表明，BOFA在准确性和效率方面均优于现有方法，实现了无需数据回放的稳定知识积累和增强的分类性能。

Conclusion: 该研究证明了通过精心设计的正交约束和跨模态融合策略，可以在不增加模型复杂度的前提下有效解决类增量学习中的遗忘问题，为多模态模型在持续学习场景中的应用提供了新的技术路径。

📄 Abstract

Class-Incremental Learning (CIL) aims to continually learn new categories without forgetting previously acquired knowledge. Vision-language models such as CLIP offer strong transferable representations via multi-modal supervision, making them promising for CIL. However, applying CLIP to CIL poses two major challenges: (1) adapting to downstream tasks often requires additional learnable modules, increasing model complexity and susceptibility to forgetting; and (2) while multi-modal representations offer complementary strengths, existing methods have yet to fully realize their potential in effectively integrating visual and textual modalities. To address these issues, we propose BOFA (Bridge-layer Orthogonal Fusion for Adaptation), a novel framework for CIL. BOFA confines all model adaptation exclusively to CLIP's existing cross-modal bridge-layer, thereby adding no extra parameters or inference cost. To prevent forgetting within this layer, it leverages Orthogonal Low-Rank Fusion, a mechanism that constrains parameter updates to a low-rank ``safe subspace" mathematically constructed to be orthogonal to past task features. This ensures stable knowledge accumulation without data replay. Furthermore, BOFA employs a cross-modal hybrid prototype that synergizes stable textual prototypes with visual counterparts derived from our stably adapted bridge-layer, enhancing classification performance. Extensive experiments on standard benchmarks show that BOFA achieves superior accuracy and efficiency compared to existing methods.

[52] Shrinking the Teacher: An Adaptive Teaching Paradigm for Asymmetric EEG-Vision Alignment

Lukun Wu, Jie Li, Ziqi Ren, Kaifan Zhang, Xinbo Gao

🧩 TL;DR

本文提出自适应教学范式，通过让视觉模态动态调整其知识结构来弥合视觉与脑电信号之间的不对称性差距，在零样本脑-图像检索任务中实现了60.2%的top-1准确率，比现有最优方法提升9.8%。

📘 Detailed Summary

Motivation: 现有方法忽视了视觉与脑电信号之间的根本不对称性，包括保真度差距（EEG的固有噪声和信号退化vs视觉的高保真特征）和语义差距（EEG的浅层概念表示vs视觉的丰富语义深度），这种忽视导致跨模态对齐效果不佳和泛化能力差。

Method: 提出自适应教学范式，使教师模态（视觉）能够在任务指导下动态收缩和调整其知识结构，将语义密集的特征适配到学生模态（EEG）的容量；具体实现为ShrinkAdapter模块，采用无残差设计和瓶颈结构。

Result: 在零样本脑-图像检索任务中达到60.2%的top-1准确率，比之前的最优方法提升9.8%；通过广泛实验验证了该范式的理论基础和有效性。

Conclusion: 本研究为不对称对齐引入了新视角：教师必须收缩和适应以弥合视觉-大脑差距；这种自适应教学范式为解决跨模态学习中的不对称关系提供了有效框架，具有重要的理论和实践意义。

📄 Abstract

Decoding visual features from EEG signals is a central challenge in neuroscience, with cross-modal alignment as the dominant approach. We argue that the relationship between visual and brain modalities is fundamentally asymmetric, characterized by two critical gaps: a Fidelity Gap (stemming from EEG's inherent noise and signal degradation, vs. vision's high-fidelity features) and a Semantic Gap (arising from EEG's shallow conceptual representation, vs. vision's rich semantic depth). Previous methods often overlook this asymmetry, forcing alignment between the two modalities as if they were equal partners and thereby leading to poor generalization. To address this, we propose the adaptive teaching paradigm. This paradigm empowers the teacher" modality (vision) to dynamically shrink and adjust its knowledge structure under task guidance, tailoring its semantically dense features to match thestudent" modality (EEG)'s capacity. We implement this paradigm with the ShrinkAdapter, a simple yet effective module featuring a residual-free design and a bottleneck structure. Through extensive experiments, we validate the underlying rationale and effectiveness of our paradigm. Our method achieves a top-1 accuracy of 60.2\% on the zero-shot brain-to-image retrieval task, surpassing previous state-of-the-art methods by a margin of 9.8\%. Our work introduces a new perspective for asymmetric alignment: the teacher must shrink and adapt to bridge the vision-brain gap.

[53] Comprehension of Multilingual Expressions Referring to Target Objects in Visual Inputs

Francisco Nogueira, Alexandre Bernardino, Bruno Martins

🧩 TL;DR

本文提出了一个多语言指代表达理解系统，通过构建包含10种语言的大规模数据集和注意力锚定神经网络架构，解决了当前指代表达理解研究主要局限于英语的问题。

📘 Detailed Summary

Motivation: 当前指代表达理解研究主要集中于英语，缺乏多语言支持，这限制了模型在全球部署中的实际应用。本文旨在解决这一研究空白，构建能够处理多种语言的视觉定位系统。

Method: 通过机器翻译和基于上下文的翻译增强方法，系统性地扩展了12个现有英语REC基准，构建了包含10种语言的大规模数据集。提出注意力锚定神经网络架构，使用多语言SigLIP2编码器，通过注意力分布生成粗粒度空间锚点，并通过学习残差进行精炼。

Result: 在标准基准测试中表现出竞争力，在RefCOCO多语言聚合评估中达到86.9%的IoU@50准确率，相比英语专用模型的91.3%。多语言评估显示模型在不同语言间具有一致的能力表现。

Conclusion: 该研究证明了多语言视觉定位系统的实际可行性，为全球部署提供了技术基础。构建的数据集和模型为多语言REC研究提供了重要资源，推动了视觉语言理解领域的国际化发展。

📄 Abstract

Referring Expression Comprehension (REC) requires models to localize objects in images based on natural language descriptions. Research on the area remains predominantly English-centric, despite increasing global deployment demands. This work addresses multilingual REC through two main contributions. First, we construct a unified multilingual dataset spanning 10 languages, by systematically expanding 12 existing English REC benchmarks through machine translation and context-based translation enhancement. The resulting dataset comprises approximately 8 million multilingual referring expressions across 177,620 images, with 336,882 annotated objects. Second, we introduce an attention-anchored neural architecture that uses multilingual SigLIP2 encoders. Our attention-based approach generates coarse spatial anchors from attention distributions, which are subsequently refined through learned residuals. Experimental evaluation demonstrates competitive performance on standard benchmarks, e.g. achieving 86.9% accuracy at IoU@50 on RefCOCO aggregate multilingual evaluation, compared to an English-only result of 91.3%. Multilingual evaluation shows consistent capabilities across languages, establishing the practical feasibility of multilingual visual grounding systems. The dataset and model are available at $\href{https://multilingual.franreno.com}{multilingual.franreno.com}$.

[54] WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation

Wei Chow, Jiachun Pan, Yongyuan Liang, Mingze Zhou, Xue Song, Liyu Jia, Saining Zhang, Siliang Tang, Juncheng Li, Fengda Zhang, Weijia Wu, Hanwang Zhang, Tat-Seng Chua

🧩 TL;DR

本文提出了WEAVE，这是首个用于上下文交织跨模态理解与生成的全套工具，包含大规模数据集WEAVE-100k和人工标注基准WEAVEBench，旨在解决现有多模态模型在真实世界多轮交互图像创建和编辑中的局限性。

📘 Detailed Summary

Motivation: 现有统一多模态模型在视觉理解和生成方面取得了显著进展，但当前数据集和基准主要关注单轮交互，未能捕捉真实世界图像创建和编辑的多轮、上下文依赖特性，存在重要的研究空白。

Method: 提出了WEAVE套件，包含两个互补部分：WEAVE-100k是一个包含10万个交织样本的大规模数据集，涵盖超过37万轮对话和50万张图像，覆盖需要历史上下文推理的理解、编辑和生成任务；WEAVEBench是基于480张图像的100个任务的人工标注基准，采用基于参考图像和原始图像与编辑指令组合的混合VLM评判器评估框架。

Result: 实验表明，在WEAVE-100k上训练能够提升视觉理解、图像编辑以及理解-生成协作能力，促进统一多模态模型发展出新兴的视觉记忆能力，同时在WEAVEBench上的广泛评估揭示了当前方法在多轮上下文感知图像生成和编辑方面的持续局限性和挑战。

Conclusion: WEAVE为研究上下文交织理解与生成提供了视角和基础，揭示了当前多模态模型在多轮交互任务中的能力边界，为未来多模态社区的发展指明了重要的研究方向和改进空间。

📄 Abstract

Recent advances in unified multimodal models (UMMs) have enabled impressive progress in visual comprehension and generation. However, existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing. To address this gap, we present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation. Our suite consists of two complementary parts. WEAVE-100k is a large-scale dataset of 100K interleaved samples spanning over 370K dialogue turns and 500K images, covering comprehension, editing, and generation tasks that require reasoning over historical context. WEAVEBench is a human-annotated benchmark with 100 tasks based on 480 images, featuring a hybrid VLM judger evaluation framework based on both the reference image and the combination of the original image with editing instructions that assesses models' abilities in multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains. Experiments demonstrate that training on WEAVE-100k enables vision comprehension, image editing, and comprehension-generation collaboration capabilities. Furthermore, it facilitates UMMs to develop emergent visual-memory capabilities, while extensive evaluations on WEAVEBench expose the persistent limitations and challenges of current approaches in multi-turn, context-aware image generation and editing. We believe WEAVE provides a view and foundation for studying in-context interleaved comprehension and generation for multi-modal community.

[55] VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models

Mingjie Xu, Jinpeng Chen, Yuzhi Zhao, Jason Chun Lok Li, Yue Qiu, Zekang Du, Mengyang Wu, Pingping Zhang, Kun Li, Hongzheng Yang, Wenao Ma, Jiaheng Wei, Qinbin Li, Kangcheng Liu, Wenqiang Lei

🧩 TL;DR

本文提出了VP-Bench基准测试，首次系统评估多模态大语言模型对视觉提示的理解能力，通过对28个MLLMs的全面分析揭示了视觉提示感知与利用的关键因素。

📘 Detailed Summary

Motivation: 现有研究缺乏对多模态大语言模型理解视觉提示能力的系统性评估，无法确定当前模型是否能有效识别人类直观使用的视觉提示方法并用于解决问题，这一研究空白限制了MLLMs在实际应用中的潜力发挥。

Method: VP-Bench采用两阶段评估框架：第一阶段评估模型在自然场景中感知视觉提示的能力，使用30k可视化提示涵盖8种形状和355种属性组合；第二阶段研究视觉提示对下游任务的影响，测量其在真实问题解决场景中的有效性。

Result: 评估了28个MLLMs包括专有系统和开源模型，提供了影响视觉提示理解因素的全面分析，如视觉提示属性变化、问题安排和模型规模等因素对性能的影响。

Conclusion: VP-Bench为研究MLLMs如何理解和解决基于视觉提示的参考问题建立了新的参考框架，揭示了当前模型在视觉提示理解方面的能力水平和改进方向，推动了多模态交互技术的发展。

📄 Abstract

Multimodal large language models (MLLMs) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use "visual prompts" (VPs), such as bounding boxes, to provide reference. However, no existing benchmark systematically evaluates the ability of MLLMs to interpret such VPs. This gap leaves it unclear whether current MLLMs can effectively recognize VPs, an intuitive prompting method for humans, and use them to solve problems. To address this limitation, we introduce VP-Bench, a benchmark for assessing MLLMs' capability in VP perception and utilization. VP-Bench employs a two-stage evaluation framework: Stage 1 examines models' ability to perceive VPs in natural scenes, using 30k visualized prompts spanning eight shapes and 355 attribute combinations. Stage 2 investigates the impact of VPs on downstream tasks, measuring their effectiveness in real-world problem-solving scenarios. Using VP-Bench, we evaluate 28 MLLMs, including proprietary systems (e.g., GPT-4o) and open-source models (e.g., InternVL3 and Qwen2.5-VL), and provide a comprehensive analysis of factors that affect VP understanding, such as variations in VP attributes, question arrangement, and model scale. VP-Bench establishes a new reference framework for studying how MLLMs comprehend and resolve grounded referring questions.

[56] VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation

Maximilian Rokuss, Moritz Langenberg, Yannick Kirchhoff, Fabian Isensee, Benjamin Hamm, Constantin Ulrich, Sebastian Regnery, Lukas Bauer, Efthimios Katsigiannopulos, Tobias Norajitra, Klaus Maier-Hein

🧩 TL;DR

VoxTell是一种用于文本提示体积医学图像分割的视觉语言模型，能够将自由形式的文本描述映射到3D分割掩码。该模型在超过62K个CT、MRI和PET体积数据上训练，涵盖1000多个解剖和病理类别，在多个模态上实现了最先进的零样本分割性能。

📘 Detailed Summary

Motivation: 当前医学图像分割方法通常需要预定义类别或大量标注数据，无法灵活处理自由形式的临床文本描述。本研究旨在开发一种能够理解从单词到完整临床句子的文本提示，并生成相应3D分割掩码的通用模型，以解决临床实践中复杂多变的描述需求。

Method: VoxTell采用多阶段视觉语言融合架构，在解码器层间对齐文本和视觉特征。模型在超过62,000个CT、MRI和PET体积数据上进行训练，涵盖1000多个解剖和病理类别，通过跨尺度特征融合实现文本到3D分割的精确映射。

Result: VoxTell在未见数据集上实现了最先进的零样本分割性能，在熟悉概念上表现优异，同时能够泛化到相关未见类别。实验表明模型具有强大的跨模态迁移能力、对语言变体和临床术语的鲁棒性，以及从真实世界文本进行实例特定分割的准确性。

Conclusion: 该研究证明了视觉语言模型在医学图像分割中的巨大潜力，为临床实践提供了灵活高效的文本驱动分割工具。多阶段融合策略和跨模态训练方法为处理复杂医学场景奠定了基础，未来可扩展到更多医学成像模态和临床应用场景。

📄 Abstract

We introduce VoxTell, a vision-language model for text-prompted volumetric medical image segmentation. It maps free-form descriptions, from single words to full clinical sentences, to 3D masks. Trained on 62K+ CT, MRI, and PET volumes spanning over 1K anatomical and pathological classes, VoxTell uses multi-stage vision-language fusion across decoder layers to align textual and visual features at multiple scales. It achieves state-of-the-art zero-shot performance across modalities on unseen datasets, excelling on familiar concepts while generalizing to related unseen classes. Extensive experiments further demonstrate strong cross-modality transfer, robustness to linguistic variations and clinical language, as well as accurate instance-specific segmentation from real-world text. Code is available at: https://www.github.com/MIC-DKFZ/VoxTell

[57] Rethinking Efficient Mixture-of-Experts for Remote Sensing Modality-Missing Classification

Qinghao Gao, Jianhai Qu, Yunsong Li, Weiqiang Dong

🧩 TL;DR

本文提出了缺失感知的LoRA混合框架(MaMOL)，通过将模态缺失重新表述为多任务学习问题，实现了对不完整多模态遥感数据的鲁棒分类。该框架采用双路由机制，在保持参数效率的同时显著提升了在变化缺失率下的泛化性能。

📘 Detailed Summary

Motivation: 遥感多模态分类常因环境干扰、传感器故障或大气效应导致模态缺失，严重降低分类性能。现有两阶段适应方法计算成本高且训练时假设完整多模态数据，限制了其在真实世界不完整性下的泛化能力。

Method: MaMOL框架引入双路由机制：任务导向的动态路由器根据不同缺失模式自适应激活专家，模态特定共享的静态路由器维持稳定的跨模态知识共享。通过轻量级专家更新和共享专家重用实现参数高效适应，而非为每种缺失配置训练独立网络。

Result: 在多个遥感基准测试中，MaMOL在不同缺失率下展现出优越的鲁棒性和泛化性能，且计算开销最小。在自然图像数据集上的迁移实验验证了其可扩展性和跨领域适用性。

Conclusion: MaMOL为不完整多模态学习提供了一种通用且高效的解决方案，通过多任务学习框架有效处理模态缺失问题。该方法的参数效率和跨领域适用性使其在实际应用中具有重要价值，为多模态遥感分析开辟了新方向。

📄 Abstract

Multimodal classification in remote sensing often suffers from missing modalities caused by environmental interference, sensor failures, or atmospheric effects, which severely degrade classification performance. Existing two-stage adaptation methods are computationally expensive and assume complete multimodal data during training, limiting their generalization to real-world incompleteness. To overcome these issues, we propose a Missing-aware Mixture-of-Loras (MaMOL) framework that reformulates modality missing as a multi-task learning problem. MaMOL introduces a dual-routing mechanism: a task-oriented dynamic router that adaptively activates experts for different missing patterns, and a modality-specific-shared static router that maintains stable cross-modal knowledge sharing. Unlike prior methods that train separate networks for each missing configuration, MaMOL achieves parameter-efficient adaptation via lightweight expert updates and shared expert reuse. Experiments on multiple remote sensing benchmarks demonstrate superior robustness and generalization under varying missing rates, with minimal computational overhead. Moreover, transfer experiments on natural image datasets validate its scalability and cross-domain applicability, highlighting MaMOL as a general and efficient solution for incomplete multimodal learning.

[58] Multimodal Posterior Sampling-based Uncertainty in PD-L1 Segmentation from H&E Images

Roman Kinakh, Gonzalo R. Ríos-Muñoz, Arrate Muñoz-Barrutia

🧩 TL;DR

本文提出nnUNet-B，一种基于贝叶斯分割的框架，通过多模态后验采样直接从H&E染色组织学图像推断PD-L1表达。该方法在保持竞争性分割性能的同时提供像素级不确定性估计，为临床工作流程中的生物标志物评估提供可扩展且可解释的解决方案。

📘 Detailed Summary

Motivation: 当前基于免疫组织化学的PD-L1表达评估方法资源密集且耗时，限制了其在临床工作流程中的可扩展性。本研究旨在开发一种能够直接从常规H&E染色组织学图像预测PD-L1表达的自动化方法，以减轻对专门染色和专家评估的依赖。

Method: 基于nnUNet-v2构建贝叶斯分割框架nnUNet-B，采用多模态后验采样方法在循环训练过程中采样多样化模型检查点来近似后验分布。该框架不仅实现准确分割，还通过熵和标准差提供认知不确定性估计，生成像素级不确定性图谱。

Result: 在肺鳞状细胞癌数据集上的评估显示，该方法达到平均Dice分数0.805和平均IoU 0.709的竞争性性能。不确定性估计与分割误差呈现强相关性，尽管校准仍有改进空间。框架成功生成与分割质量相关的像素级不确定性图谱。

Conclusion: 研究表明基于H&E图像的PD-L1预测结合不确定性感知是迈向可扩展、可解释生物标志物评估的有前景步骤。不确定性图谱为临床决策提供额外置信度信息，但需要进一步改进校准精度以实现更可靠的临床应用。

📄 Abstract

Accurate assessment of PD-L1 expression is critical for guiding immunotherapy, yet current immunohistochemistry (IHC) based methods are resource-intensive. We present nnUNet-B: a Bayesian segmentation framework that infers PD-L1 expression directly from H&E-stained histology images using Multimodal Posterior Sampling (MPS). Built upon nnUNet-v2, our method samples diverse model checkpoints during cyclic training to approximate the posterior, enabling both accurate segmentation and epistemic uncertainty estimation via entropy and standard deviation. Evaluated on a dataset of lung squamous cell carcinoma, our approach achieves competitive performance against established baselines with mean Dice Score and mean IoU of 0.805 and 0.709, respectively, while providing pixel-wise uncertainty maps. Uncertainty estimates show strong correlation with segmentation error, though calibration remains imperfect. These results suggest that uncertainty-aware H&E-based PD-L1 prediction is a promising step toward scalable, interpretable biomarker assessment in clinical workflows.

[59] OpenUS: A Fully Open-Source Foundation Model for Ultrasound Image Analysis via Self-Adaptive Masked Contrastive Learning

Xiaoyu Zheng, Xu Chen, Awais Rauf, Qifan Fu, Benedetta Monosi, Felice Rivellese, Myles J. Lewis, Shaogang Gong, Gregory Slabaugh

🧩 TL;DR

本文提出了OpenUS，这是首个基于大规模公共数据构建的可复现开源超声基础模型，采用视觉Mamba骨干网络和自适应掩码框架，在包含30.8万张图像的42个公共数据集上预训练，实现了标签高效的超声AI模型。

📘 Detailed Summary

Motivation: 超声图像解释高度依赖操作者且在不同解剖区域、采集协议和设备类型间差异显著，加上斑点噪声、低对比度和标准化标注有限等独特挑战，阻碍了泛化性强、标签高效的超声AI模型的发展。

Method: OpenUS采用视觉Mamba骨干网络捕捉图像的局部和全局长程依赖关系，提出新颖的自适应掩码框架，结合对比学习和掩码图像建模，通过教师注意力图与学生重建损失的集成自适应优化临床相关掩码，并应用动态学习调度逐步调整预训练难度。

Result: 构建了迄今为止最大的公共超声数据集，包含来自42个公开数据集的超过30.8万张图像，涵盖不同解剖区域、机构、成像设备和疾病类型，预训练的OpenUS模型可作为骨干网络轻松适应特定下游任务的标签高效微调。

Conclusion: OpenUS为超声AI研究提供了首个可复现的基础模型，其自适应掩码框架和动态学习调度显著提升了预训练效果，为开发泛化性强、标签高效的超声AI系统奠定了重要基础，推动了医学影像分析的可重复研究。

📄 Abstract

Ultrasound (US) is one of the most widely used medical imaging modalities, thanks to its low cost, portability, real-time feedback, and absence of ionizing radiation. However, US image interpretation remains highly operator-dependent and varies significantly across anatomical regions, acquisition protocols, and device types. These variations, along with unique challenges such as speckle, low contrast, and limited standardized annotations, hinder the development of generalizable, label-efficient ultrasound AI models. In this paper, we propose OpenUS, the first reproducible, open-source ultrasound foundation model built on a large collection of public data. OpenUS employs a vision Mamba backbone, capturing both local and global long-range dependencies across the image. To extract rich features during pre-training, we introduce a novel self-adaptive masking framework that combines contrastive learning with masked image modeling. This strategy integrates the teacher's attention map with student reconstruction loss, adaptively refining clinically-relevant masking to enhance pre-training effectiveness. OpenUS also applies a dynamic learning schedule to progressively adjust the difficulty of the pre-training process. To develop the foundation model, we compile the largest to-date public ultrasound dataset comprising over 308K images from 42 publicly available datasets, covering diverse anatomical regions, institutions, imaging devices, and disease types. Our pre-trained OpenUS model can be easily adapted to specific downstream tasks by serving as a backbone for label-efficient fine-tuning. Code is available at https://github.com/XZheng0427/OpenUS.

[60] Bridging Hidden States in Vision-Language Models

Benjamin Fein-Ashley, Jacob Fein-Ashley

🧩 TL;DR

本文提出BRIDGE，一种轻量级的视觉-语言模型融合模块，通过在编码器顶部添加少量双向交叉注意力层来对齐视觉和文本隐藏状态，在保持双编码器效率的同时显著提升多模态理解性能。

📘 Detailed Summary

Motivation: 现有视觉-语言模型融合方法存在局限性：早期融合会混合编码器内的token/特征，晚期融合仅比较池化后的嵌入，且许多方法将融合与自回归解码器绑定。然而，两种模态的隐藏状态本身已包含丰富的模态特定结构信息，直接对齐这些状态是更自然的跨模态匹配方式。

Method: 提出轻量级融合模块：在视觉和文本编码器顶部附近放置少量仅交叉的双向注意力层。每个层将视觉和文本编码器的隐藏状态序列投影到共享空间，进行跨模态注意力计算，并通过门控残差更新回传，使用简单的稳定器来改善对齐效果。编码器保持非因果性和强大理解能力，生成任务通过可选解码器清晰解耦。

Result: 在标准检索、视觉问答和视觉推理基准测试中，BRIDGE在保持对比模型双编码器效率的同时，性能优于可比较的视觉-语言模型。该方法在多个多模态任务上实现了显著的性能提升。

Conclusion: BRIDGE证明了通过轻量级双向交叉注意力层直接对齐模态特定隐藏状态的有效性，为视觉-语言模型设计提供了新的融合范式。该方法在保持编码器独立性的同时实现了高效的多模态对齐，为未来多模态理解系统的设计提供了重要启示。

📄 Abstract

Vision-Language Models (VLMs) are a new family of models that align image content with natural language. Existing approaches typically fuse either (a) early: by mixing tokens/features inside the encoders, or (b) late: by comparing pooled embeddings. Many methods also tie fusion to an autoregressive decoder. However, the hidden states of both modalities already carry rich, modality-specific structure (spatial layout in vision; syntax and semantics in text), so directly aligning these states is a natural way to match what the two modalities "think". We propose a lightweight fusion module: a few cross-only, bidirectional attention layers placed near the top of both encoders. Each layer projects the vision and text encoder hidden-state sequences into a shared space, attends across modalities, and sends gated residual updates back, with simple stabilizers to improve alignment. The encoders remain non-causal and strong for understanding, while generation stays cleanly decoupled via an optional decoder. Across standard retrieval, VQA, and visual reasoning benchmarks, BRIDGE outperforms comparable VLMs while preserving the bi-encoder efficiency of contrastive models. We make our code publicly available at https://github.com/jfeinashley/BRIDGE.

cs.CL [Back]

[61] Patent Representation Learning via Self-supervision

You Zuo, Kim Gerdes, Eric Villemonte de La Clergerie, Benoît Sagot

🧩 TL;DR

本文提出了一种基于专利文档内部多视角的对比学习框架，通过利用专利不同章节（如摘要、权利要求、背景技术）作为互补视图来学习专利嵌入。该方法解决了SimCSE风格dropout增强在专利领域产生的过度均匀化嵌入问题，在无需外部标注的情况下超越了基于引用和IPC分类的监督基线。

📘 Detailed Summary

Motivation: 研究发现SimCSE风格的dropout增强在专利嵌入学习中存在特定失败模式，会产生过度均匀的嵌入表示，导致语义内聚性丧失。现有方法依赖脆弱或不完整的标注信息，需要开发不依赖外部标注的专利表示学习方法。

Method: 提出基于章节的增强策略，将专利文档的不同部分（摘要、权利要求、背景技术等）作为对比学习的互补视图。这种设计引入了自然的语义和结构多样性，缓解了过度分散问题，能够同时保留全局结构和局部连续性。

Result: 在大规模基准测试中，这种完全自监督的方法在现有技术检索和分类任务上匹配或超越了基于引用和IPC分类的监督基线。分析表明不同章节对不同任务具有专门化优势：权利要求和摘要有利于检索任务，而背景技术部分有助于分类任务。

Conclusion: 研究证明了利用专利文档内部多视图进行表示学习的价值，不同章节的固有话语结构为专利理解提供了丰富的语义信息。该方法为可扩展和泛化的专利理解提供了有效途径，避免了对外部标注的依赖。

📄 Abstract

This paper presents a simple yet effective contrastive learning framework for learning patent embeddings by leveraging multiple views from within the same document. We first identify a patent-specific failure mode of SimCSE style dropout augmentation: it produces overly uniform embeddings that lose semantic cohesion. To remedy this, we propose section-based augmentation, where different sections of a patent (e.g., abstract, claims, background) serve as complementary views. This design introduces natural semantic and structural diversity, mitigating over-dispersion and yielding embeddings that better preserve both global structure and local continuity. On large-scale benchmarks, our fully self-supervised method matches or surpasses citation-and IPC-supervised baselines in prior-art retrieval and classification, while avoiding reliance on brittle or incomplete annotations. Our analysis further shows that different sections specialize for different tasks-claims and summaries benefit retrieval, while background sections aid classification-highlighting the value of patents' inherent discourse structure for representation learning. These results highlight the value of exploiting intra-document views for scalable and generalizable patent understanding.

[62] Evaluating Open-Weight Large Language Models for Structured Data Extraction from Narrative Medical Reports Across Multiple Use Cases and Languages

Douwe J. Spaanderman, Karthik Prathaban, Petr Zelina, Kaouther Mouheb, Lukáš Hejtmánek, Matthew Marzetti, Antonius W. Schurink, Damian Chan, Ruben Niemantsverdriet, Frederik Hartmann, Zhen Qian, Maarten G. J. Thomeer, Petr Holub, Farhan Akram, Frank J. Wolters, Meike W. Vernooij, Cornelis Verhoef, Esther E. Bron, Vít Nováček, Dirk J. Grünhagen, Wiro J. Niessen, Martijn P. A. Starmans, Stefan Klein

🧩 TL;DR

本研究评估了15个开源大语言模型在跨疾病、语言和机构的临床报告结构化信息提取任务中的表现，发现中小型通用模型与大型模型性能相当，提示图和小样本提示可提升约13%的性能。

📘 Detailed Summary

Motivation: 当前大语言模型在临床记录结构化信息提取方面的研究主要局限于单一任务、有限模型和英文报告，缺乏对多疾病、多语言、多机构的系统性评估，本研究旨在填补这一研究空白。

Method: 研究评估了15个开源大语言模型，包括通用型和医学专用模型，在病理学和放射学报告的六个临床应用场景中比较了六种提示策略：零样本、单样本、小样本、思维链、自一致性和提示图，使用任务适应性指标和线性混合效应模型进行性能量化。

Result: 表现最佳的模型在各项任务中达到了接近人工标注者一致性的宏观平均分数，中小型通用模型与大型模型性能相当，而微型和专用模型表现较差，提示图和小样本提示策略可将性能提升约13%，任务特异性因素对结果的影响超过模型规模或提示策略。

Conclusion: 研究表明开源大语言模型能够跨疾病、语言和机构从临床报告中提取结构化数据，为临床数据管理提供了可扩展的方法，同时强调任务复杂性和标注变异性是影响性能的关键因素。

📄 Abstract

Large language models (LLMs) are increasingly used to extract structured information from free-text clinical records, but prior work often focuses on single tasks, limited models, and English-language reports. We evaluated 15 open-weight LLMs on pathology and radiology reports across six use cases, colorectal liver metastases, liver tumours, neurodegenerative diseases, soft-tissue tumours, melanomas, and sarcomas, at three institutes in the Netherlands, UK, and Czech Republic. Models included general-purpose and medical-specialised LLMs of various sizes, and six prompting strategies were compared: zero-shot, one-shot, few-shot, chain-of-thought, self-consistency, and prompt graph. Performance was assessed using task-appropriate metrics, with consensus rank aggregation and linear mixed-effects models quantifying variance. Top-ranked models achieved macro-average scores close to inter-rater agreement across tasks. Small-to-medium general-purpose models performed comparably to large models, while tiny and specialised models performed worse. Prompt graph and few-shot prompting improved performance by ~13%. Task-specific factors, including variable complexity and annotation variability, influenced results more than model size or prompting strategy. These findings show that open-weight LLMs can extract structured data from clinical reports across diseases, languages, and institutions, offering a scalable approach for clinical data curation.

[63] Grounded Visual Factualization: Factual Anchor-Based Finetuning for Enhancing MLLM Factual Consistency

Filippo Morbiato, Luca Romano, Alessandro Persona

🧩 TL;DR

本文提出了Grounded Visual Factualization (GVF)微调方法，通过引入明确的真实性信号来系统性地增强多模态大语言模型的视觉事实一致性，显著减少了视觉幻觉问题。

📘 Detailed Summary

Motivation: 多模态大语言模型中存在的视觉幻觉问题严重削弱了其可靠性，现有微调方法改进有限，未能深入干预事实推理过程，需要开发更有效的解决方案来提升视觉事实一致性。

Method: GVF微调方法包含三个核心机制：事实锚点数据增强，通过结构化事实锚点和反事实提示丰富训练数据；事实感知指令调优，将真实性线索嵌入显式指令；以及事实一致性损失函数，专门惩罚事实不准确性。

Result: 在LLaVA-1.5-13B模型上，GVF微调在VHTest基准测试中显著优于标准微调方法，在开放式问题和是/否问题格式上都表现出色，同时在MME和POPE等多模态基准测试中保持甚至略微提升了性能。

Conclusion: GVF方法有效缓解了视觉幻觉问题而不会损害通用的理解和推理能力，为提升多模态大语言模型的可靠性提供了系统性的解决方案，具有重要的实际应用价值。

📄 Abstract

Visual hallucination, where Multimodal Large Language Models fabricate details inconsistent with image content, critically undermines their reliability. Existing fine-tuning methods offer limited improvement, failing to deeply intervene in factual reasoning. This paper introduces Grounded Visual Factualization (GVF) Finetuning, a novel approach to systematically enhance MLLM visual factual consistency. GVF integrates explicit factual signals via three core mechanisms: Factual Anchor Data Augmentation, enriching training data with structured factual anchors and counter-factual prompts; Fact-Aware Instruction Tuning, embedding these cues into explicit instructions; and a Factual Consistency Loss function, specifically penalizing factual inaccuracies. Evaluated on LLaVA-1.5-13B, GVF Finetuning significantly outperforms standard fine-tuning on the VHTest benchmark for both Open-Ended Question (OEQ) and Yes/No Question (YNQ) formats. Crucially, GVF maintains or even slightly improves performance on general multimodal benchmarks like MME and POPE, demonstrating effective mitigation of visual hallucinations without compromising general understanding and reasoning abilities.

[64] Saying the Unsaid: Revealing the Hidden Language of Multimodal Systems Through Telephone Games

Juntu Zhao, Jialing Zhang, Chongxuan Li, Dequan Wang

🧩 TL;DR

本研究通过多轮'传话游戏'框架，利用多模态系统的偏好偏差来揭示其隐藏语言，提出了一种定量分析概念连接强度的新方法，并构建了包含10,000+概念对的Telescope数据集。

📘 Detailed Summary

Motivation: 当前闭源多模态系统虽然取得了显著进展，但其理解世界的隐藏语言由于黑盒架构而难以解析，研究者旨在通过系统偏好偏差来探索这些隐藏的语言表示机制。

Method: 采用多轮传话游戏策略，利用多模态系统在图像压缩为文本再重建为图像过程中的偏好偏差，通过观察概念共现频率定量分析概念连接强度，并引入推理大语言模型来发现超越文本和视觉相似性的概念关系。

Result: 研究构建了包含10,000+概念对的Telescope数据集，通过迭代运行传话游戏建立了多模态系统理解中概念连接的全局图谱，能够识别训练继承的偏好偏差、评估泛化能力进展，并发现脆弱概念连接的稳定路径。

Conclusion: 该研究为多模态系统的隐藏语言提供了新的分析视角，为未来多模态系统的可解释性和可控性研究奠定了基础，揭示了系统如何理解和模拟世界的内部机制。

📄 Abstract

Recent closed-source multimodal systems have made great advances, but their hidden language for understanding the world remains opaque because of their black-box architectures. In this paper, we use the systems' preference bias to study their hidden language: During the process of compressing the input images (typically containing multiple concepts) into texts and then reconstructing them into images, the systems' inherent preference bias introduces specific shifts in the outputs, disrupting the original input concept co-occurrence. We employ the multi-round "telephone game" to strategically leverage this bias. By observing the co-occurrence frequencies of concepts in telephone games, we quantitatively investigate the concept connection strength in the understanding of multimodal systems, i.e., "hidden language." We also contribute Telescope, a dataset of 10,000+ concept pairs, as the database of our telephone game framework. Our telephone game is test-time scalable: By iteratively running telephone games, we can construct a global map of concept connections in multimodal systems' understanding. Here we can identify preference bias inherited from training, assess generalization capability advancement, and discover more stable pathways for fragile concept connections. Furthermore, we use Reasoning-LLMs to uncover unexpected concept relationships that transcend textual and visual similarities, inferring how multimodal systems understand and simulate the world. This study offers a new perspective on the hidden language of multimodal systems and lays the foundation for future research on the interpretability and controllability of multimodal systems.

[65] $π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling

Dong Liu, Yanxuan Yu

🧩 TL;DR

本文提出了ΠAttention，一种周期性稀疏Transformer，通过将注意力分解为环形局部邻域、确定性π步长跳跃和自适应融合门，在保持线性复杂度的同时显著扩展了感受野，在语言建模、检索和视觉语言任务中优于RingAttention。

📘 Detailed Summary

Motivation: Transformer的二次复杂度限制了长序列建模能力，而现有稀疏注意力机制如RingAttention虽然降低了计算成本，但存在感受野受限和缺乏自适应性的问题，需要一种既能保持线性复杂度又能有效覆盖远距离token的注意力机制。

Method: ΠAttention采用周期性稀疏架构，将注意力分解为三个组件：环形局部邻域确保局部信息流动，确定性π步长跳跃提供对远距离token的可预测覆盖，自适应融合门动态调整局部和全局信息的整合，同时保持每层复杂度与序列长度呈线性关系。

Result: 理论分析表明ΠAttention实现了O(kL + πlogL)的感受野增长，优于RingAttention的O(kL)；在语言建模任务中比RingAttention降低8.3%的困惑度，相同上下文长度下GPU使用量减少50%，在检索和视觉语言任务中达到或超过密集注意力的性能。

Conclusion: 周期性跳跃结构、自适应融合机制和头级稀疏协调对于高效长上下文建模至关重要，ΠAttention证明了在保持线性复杂度的同时实现接近密集注意力性能的可行性，为大规模长序列处理提供了新的架构设计思路。

📄 Abstract

Transformers have revolutionized natural language processing, but their quadratic complexity with respect to sequence length remains a fundamental bottleneck for long-range modeling. While sparse attention mechanisms like RingAttention reduce computational costs by restricting attention to local neighborhoods, they suffer from limited receptive fields and lack of adaptability. We present \PiAttention, a periodic sparse Transformer that factorizes attention into ring-local neighborhoods, deterministic $π$-stride skips, and an adaptive fusion gate. The periodic structure provides predictable coverage of distant tokens, while the sparse footprint keeps the per-layer complexity linear in context length. We prove that \PiAttention achieves $\mathcal{O}(kL + π\log L)$ receptive field growth compared to $\mathcal{O}(kL)$ for RingAttention, where $k$ is the local window size, $π$ is the skip period, and $L$ is the sequence length. Extensive experiments on language modeling, retrieval, and vision-language tasks demonstrate that \PiAttention matches or surpasses dense attention quality with 8.3\% lower perplexity than RingAttention while using 50\% fewer GPUs for the same context length. Our detailed ablations and visualizations reveal the importance of periodic skips, adaptive fusion, and head-level sparsity coordination for efficient long-context modeling.

[66] Faithful Summarization of Consumer Health Queries: A Cross-Lingual Framework with LLMs

Ajwad Abrar, Nafisa Tabassum Oeshy, Prianka Maheru, Farzana Tabassum, Tareque Mohmud Chowdhury

🧩 TL;DR

本研究提出了一个结合TextRank句子提取、医学命名实体识别和大语言模型的框架，用于增强医疗文本摘要的忠实性。通过在MeQSum和BanglaCHQ-Summ数据集上微调LLaMA-2-7B模型，在质量和忠实性指标上均取得了显著提升，证明了该方法在医疗环境中安全部署LLM的潜力。

📘 Detailed Summary

Motivation: 消费者健康问题摘要可以简化医疗沟通，但不忠实的摘要可能误传医疗细节，带来严重风险。当前医疗文本摘要系统在保持信息忠实性方面存在不足，需要专门的方法来确保关键医疗信息的准确保留。

Method: 提出了结合TextRank句子提取、医学命名实体识别和大语言模型的框架。使用LLaMA-2-7B模型在MeQSum（英语）和BanglaCHQ-Summ（孟加拉语）数据集上进行微调，通过结构化方法增强摘要的医学准确性。

Result: 在质量和忠实性指标上均取得一致改进，包括ROUGE、BERTScore、可读性以及SummaC和AlignScore等忠实性指标。人类评估显示超过80%的生成摘要保留了关键医疗信息，显著优于零样本基线和先前系统。

Conclusion: 忠实性是可靠医疗摘要的关键维度，该框架展示了在医疗环境中安全部署大语言模型的潜力。研究强调了专门方法在医疗文本处理中的重要性，为医疗AI系统的安全应用提供了可行路径。

📄 Abstract

Summarizing consumer health questions (CHQs) can ease communication in healthcare, but unfaithful summaries that misrepresent medical details pose serious risks. We propose a framework that combines TextRank-based sentence extraction and medical named entity recognition with large language models (LLMs) to enhance faithfulness in medical text summarization. In our experiments, we fine-tuned the LLaMA-2-7B model on the MeQSum (English) and BanglaCHQ-Summ (Bangla) datasets, achieving consistent improvements across quality (ROUGE, BERTScore, readability) and faithfulness (SummaC, AlignScore) metrics, and outperforming zero-shot baselines and prior systems. Human evaluation further shows that over 80\% of generated summaries preserve critical medical information. These results highlight faithfulness as an essential dimension for reliable medical summarization and demonstrate the potential of our approach for safer deployment of LLMs in healthcare contexts.

[67] Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions

Mengze Hong, Di Jiang, Weiwei Zhao, Yawen Li, Yihang Wang, Xinyuan Luo, Yanjie Sun, Chen Jason Zhang

🧩 TL;DR

本研究提出了一种基于多模态大语言模型的交互式学术同行评审系统，通过整合文本和视觉信息、基于检索增强生成技术以及结构化反馈格式，为论文提交前提供有效的修订指导。

📘 Detailed Summary

Motivation: 现有学术同行评审系统存在文本输入限制、上下文信息不足以及缺乏可操作反馈等问题，无法充分利用大语言模型在学术工作流自动化方面的潜力，因此需要开发更全面、交互式的评审解决方案。

Method: 该系统采用多模态大语言模型整合文本和视觉信息，通过基于OpenReview数据的检索增强生成技术提升评审质量，并设计Action:Objective[#]格式将生成的评审意见转换为可追踪的结构化待办清单。

Result: 实验结果表明，所提出的系统在生成全面且有用的评审意见方面优于消融基线模型，其输出质量与专家标准保持一致，证明了该框架在提升评审效果方面的有效性。

Conclusion: 该研究展示了多模态大语言模型在学术辅助工具中的潜力，通过结构化反馈和实时交互界面推动了透明、以人为本的学术支持系统发展，为学术写作平台集成提供了可行方案。

📄 Abstract

While large language models (LLMs) offer promising capabilities for automating academic workflows, existing systems for academic peer review remain constrained by text-only inputs, limited contextual grounding, and a lack of actionable feedback. In this work, we present an interactive web-based system for multimodal, community-aware peer review simulation to enable effective manuscript revisions before paper submission. Our framework integrates textual and visual information through multimodal LLMs, enhances review quality via retrieval-augmented generation (RAG) grounded in web-scale OpenReview data, and converts generated reviews into actionable to-do lists using the proposed Action:Objective[#] format, providing structured and traceable guidance. The system integrates seamlessly into existing academic writing platforms, providing interactive interfaces for real-time feedback and revision tracking. Experimental results highlight the effectiveness of the proposed system in generating more comprehensive and useful reviews aligned with expert standards, surpassing ablated baselines and advancing transparent, human-centered scholarly assistance.

[68] Automated Analysis of Learning Outcomes and Exam Questions Based on Bloom's Taxonomy

Ramya Kumar, Dhruv Gulwani, Sonit Singh

🧩 TL;DR

本研究系统评估了多种机器学习方法在布鲁姆分类法自动分类任务上的表现，发现支持向量机结合数据增强在小型数据集上表现最佳，而复杂深度学习模型则面临严重的过拟合问题。

📘 Detailed Summary

Motivation: 本研究旨在解决教育领域中布鲁姆分类法自动分类的挑战，特别是针对小型标注数据集上不同机器学习方法的性能比较和过拟合问题。

Method: 研究采用了传统机器学习模型（朴素贝叶斯、逻辑回归、支持向量机）、循环神经网络架构（LSTM、BiLSTM、GRU、BiGRU）、基于Transformer的模型（BERT和RoBERTa）以及大语言模型（OpenAI、Gemini、Ollama、Anthropic），并在不同预处理和数据增强策略下进行评估。

Result: 支持向量机结合数据增强取得了最佳性能，达到94%的准确率、召回率和F1分数，且过拟合最小；RNN和BERT模型出现严重过拟合，RoBERTa初期表现良好但后期出现过拟合迹象；大语言模型的零样本评估显示OpenAI和Gemini表现最佳，准确率约为0.72-0.73。

Conclusion: 研究强调了在有限数据上训练复杂深度学习模型的挑战，并证明了精心设计的数据增强和简单算法（如增强型SVM）在布鲁姆分类法分类任务中的重要价值。

📄 Abstract

This paper explores the automatic classification of exam questions and learning outcomes according to Bloom's Taxonomy. A small dataset of 600 sentences labeled with six cognitive categories - Knowledge, Comprehension, Application, Analysis, Synthesis, and Evaluation - was processed using traditional machine learning (ML) models (Naive Bayes, Logistic Regression, Support Vector Machines), recurrent neural network architectures (LSTM, BiLSTM, GRU, BiGRU), transformer-based models (BERT and RoBERTa), and large language models (OpenAI, Gemini, Ollama, Anthropic). Each model was evaluated under different preprocessing and augmentation strategies (for example, synonym replacement, word embeddings, etc.). Among traditional ML approaches, Support Vector Machines (SVM) with data augmentation achieved the best overall performance, reaching 94 percent accuracy, recall, and F1 scores with minimal overfitting. In contrast, the RNN models and BERT suffered from severe overfitting, while RoBERTa initially overcame it but began to show signs as training progressed. Finally, zero-shot evaluations of large language models (LLMs) indicated that OpenAI and Gemini performed best among the tested LLMs, achieving approximately 0.72-0.73 accuracy and comparable F1 scores. These findings highlight the challenges of training complex deep models on limited data and underscore the value of careful data augmentation and simpler algorithms (such as augmented SVM) for Bloom's Taxonomy classification.

[69] CardioEmbed: Domain-Specialized Text Embeddings for Clinical Cardiology

Richard J. Young, Alice M. Matthews

🧩 TL;DR

本研究开发了CardioEmbed，一个基于Qwen3-Embedding-8B的心血管领域专用嵌入模型，通过对比学习在七本心脏病学教科书上训练，在心脏特异性语义检索任务中达到99.60%的检索准确率，相比当前最先进的医学嵌入模型MedTE提升了15.94个百分点。

📘 Detailed Summary

Motivation: 现有生物医学文本嵌入模型主要基于PubMed研究文献开发，但临床心脏病学实践严重依赖综合性教科书中的程序性知识和专业术语，这种研究实践差距限制了现有嵌入模型在心脏病学临床应用中的有效性。

Method: 基于Qwen3-Embedding-8B架构，采用对比学习方法，使用InfoNCE损失函数和批次内负样本，在七本经过整理的综合性心脏病学教科书语料上进行训练，该语料包含约150,000个去重后的句子。

Result: 在心脏特异性语义检索任务中达到99.60%的检索准确率，相比MedTE模型提升了15.94个百分点；在MTEB医学基准测试中，BIOSSES获得0.77斯皮尔曼相关系数，SciFact获得0.61 NDCG@10，表明在相关生物医学领域具有竞争力。

Conclusion: 在综合性临床教科书上进行领域专业化训练能够产生近乎完美的心脏病学检索性能，显著优于现有医学嵌入模型，证明了领域特定训练对于临床应用的实用价值，为医学嵌入模型的开发提供了新的方向。

📄 Abstract

Biomedical text embeddings have primarily been developed using research literature from PubMed, yet clinical cardiology practice relies heavily on procedural knowledge and specialized terminology found in comprehensive textbooks rather than research abstracts. This research practice gap limits the effectiveness of existing embedding models for clinical applications incardiology. This study trained CardioEmbed, a domain-specialized embedding model based on Qwen3-Embedding-8B, using contrastive learning on a curated corpus of seven comprehensive cardiology textbooks totaling approximately 150,000 sentences after deduplication. The model employs InfoNCE loss with in-batch negatives and achieves 99.60% retrieval accuracy on cardiac-specific semantic retrieval tasks, a +15.94 percentage point improvement over MedTE, the current state-of-the-art medical embedding model. On MTEB medical benchmarks, the model obtained BIOSSES 0.77 Spearman and SciFact 0.61 NDCG@10, indicating competitive performance on related biomedical domains. Domain-specialized training on comprehensive clinical textbooks yields near-perfect cardiology retrieval (99.60% Acc@1), improving over MedTE by +15.94 percentage points.

[70] AV-Dialog: Spoken Dialogue Models with Audio-Visual Input

Tuochao Chen, Bandhav Veluri, Hongyu Gong, Shyamnath Gollakota

🧩 TL;DR

本文提出了AV-Dialog，首个利用音频和视觉线索的多模态对话框架，通过结合声学标记化和多任务多阶段训练，在嘈杂多说话者环境中实现鲁棒的流式转录、语义基础的说话人切换检测和准确响应。

📘 Detailed Summary

Motivation: 现有对话模型在嘈杂多说话者环境中表现不佳，经常产生不相关响应和尴尬的说话人切换，需要解决在真实世界噪声环境下实现自然对话流的问题。

Method: AV-Dialog框架结合声学标记化与多任务多阶段训练，在单声道、合成和真实音频-视觉对话数据集上进行训练，实现目标说话人跟踪、说话人切换预测和连贯响应生成。

Result: 实验表明AV-Dialog在干扰条件下优于纯音频模型，减少了转录错误，改进了说话人切换预测，并提升了人类评级的对话质量。

Conclusion: 这些结果突显了视觉与听觉结合在说话人感知交互中的强大能力，为在真实世界嘈杂环境中实现鲁棒性能的口语对话代理铺平了道路。

📄 Abstract

Dialogue models falter in noisy, multi-speaker environments, often producing irrelevant responses and awkward turn-taking. We present AV-Dialog, the first multimodal dialog framework that uses both audio and visual cues to track the target speaker, predict turn-taking, and generate coherent responses. By combining acoustic tokenization with multi-task, multi-stage training on monadic, synthetic, and real audio-visual dialogue datasets, AV-Dialog achieves robust streaming transcription, semantically grounded turn-boundary detection and accurate responses, resulting in a natural conversational flow. Experiments show that AV-Dialog outperforms audio-only models under interference, reducing transcription errors, improving turn-taking prediction, and enhancing human-rated dialogue quality. These results highlight the power of seeing as well as hearing for speaker-aware interaction, paving the way for {spoken} dialogue agents that perform {robustly} in real-world, noisy environments.

Yi Shi, Wenlong Meng, Zhenyuan Guo, Chengkun Wei, Wenzhi Chen

🧩 TL;DR

本文提出MemoDetector框架，通过多模态大语言模型增强文本推理和双阶段模态融合策略，显著提升了表情包情感理解任务的性能，在两个基准数据集上分别实现了4.3%和3.4%的F1分数提升。

📘 Detailed Summary

Motivation: 当前表情包情感理解研究面临两个主要挑战：缺乏细粒度的多模态融合策略，以及对表情包隐含意义和背景知识的挖掘不足。这些限制阻碍了模型对复杂情感意图的准确理解。

Method: 提出四步文本增强模块，利用多模态大语言模型的丰富知识和推理能力逐步推断和提取表情包的隐含和上下文信息；设计双阶段模态融合策略，第一阶段对原始图像和文本进行浅层融合，第二阶段深度整合增强后的视觉和文本特征。

Result: 在MET-MEME和MOOD两个数据集上的实验表明，MemoDetector持续优于现有最优基线方法，F1分数分别提升4.3%和3.4%。消融研究和深入分析验证了方法的有效性和鲁棒性。

Conclusion: 该研究证明了利用多模态大语言模型进行文本增强和分层融合策略在表情包情感理解任务中的显著优势，为推进多模态情感分析提供了新的技术路径和实用框架。

📄 Abstract

With the rapid rise of social media and Internet culture, memes have become a popular medium for expressing emotional tendencies. This has sparked growing interest in Meme Emotion Understanding (MEU), which aims to classify the emotional intent behind memes by leveraging their multimodal contents. While existing efforts have achieved promising results, two major challenges remain: (1) a lack of fine-grained multimodal fusion strategies, and (2) insufficient mining of memes' implicit meanings and background knowledge. To address these challenges, we propose MemoDetector, a novel framework for advancing MEU. First, we introduce a four-step textual enhancement module that utilizes the rich knowledge and reasoning capabilities of Multimodal Large Language Models (MLLMs) to progressively infer and extract implicit and contextual insights from memes. These enhanced texts significantly enrich the original meme contents and provide valuable guidance for downstream classification. Next, we design a dual-stage modal fusion strategy: the first stage performs shallow fusion on raw meme image and text, while the second stage deeply integrates the enhanced visual and textual features. This hierarchical fusion enables the model to better capture nuanced cross-modal emotional cues. Experiments on two datasets, MET-MEME and MOOD, demonstrate that our method consistently outperforms state-of-the-art baselines. Specifically, MemoDetector improves F1 scores by 4.3\% on MET-MEME and 3.4\% on MOOD. Further ablation studies and in-depth analyses validate the effectiveness and robustness of our approach, highlighting its strong potential for advancing MEU. Our code is available at https://github.com/singing-cat/MemoDetector.

[72] PRSM: A Measure to Evaluate CLIP's Robustness Against Paraphrases

Udo Schlegel, Franziska Weeber, Jian Lan, Thomas Seidl

🧩 TL;DR

本文提出了Paraphrase Ranking Stability Metric (PRSM)来量化CLIP模型对改写查询的敏感性，揭示了其在语言变异性下的鲁棒性不足问题，特别是在社会敏感语境中可能放大人口统计偏差。

📘 Detailed Summary

Motivation: CLIP模型虽然在零样本和少样本任务上表现优异，但其对语言变异性特别是改写的鲁棒性尚未得到充分探索。改写鲁棒性对于可靠部署至关重要，尤其是在社会敏感语境中，不一致的表征可能放大人口统计偏差。

Method: 本文引入了Paraphrase Ranking Stability Metric (PRSM)，这是一种量化CLIP对改写查询敏感性的新度量方法。使用Social Counterfactuals数据集，这是一个旨在揭示社会和人口统计偏差的基准，实证评估CLIP在改写变异下的稳定性。

Result: 分析表明鲁棒性在不同改写策略间存在差异，在男性和女性相关查询之间观察到微妙但一致的差异。CLIP模型对改写查询的敏感性在不同人口统计群体中表现出系统性变化。

Conclusion: 研究强调了多模态系统公平性和公平部署的重要性，揭示了语言鲁棒性与人口统计偏差之间的相互作用。这些发现对开发更稳健和公平的多模态AI系统具有重要启示。

📄 Abstract

Contrastive Language-Image Pre-training (CLIP) is a widely used multimodal model that aligns text and image representations through large-scale training. While it performs strongly on zero-shot and few-shot tasks, its robustness to linguistic variation, particularly paraphrasing, remains underexplored. Paraphrase robustness is essential for reliable deployment, especially in socially sensitive contexts where inconsistent representations can amplify demographic biases. In this paper, we introduce the Paraphrase Ranking Stability Metric (PRSM), a novel measure for quantifying CLIP's sensitivity to paraphrased queries. Using the Social Counterfactuals dataset, a benchmark designed to reveal social and demographic biases, we empirically assess CLIP's stability under paraphrastic variation, examine the interaction between paraphrase robustness and gender, and discuss implications for fairness and equitable deployment of multimodal systems. Our analysis reveals that robustness varies across paraphrasing strategies, with subtle yet consistent differences observed between male- and female-associated queries.

[73] LANE: Lexical Adversarial Negative Examples for Word Sense Disambiguation

Jader Martins Camboim de Sá, Jooyoung Lee, Cédric Pruski, Marcos Da Silveira

🧩 TL;DR

本文提出了一种名为LANE的新型对抗训练策略，通过选择性标记训练集中的替代词来生成具有挑战性的负训练样本，旨在解决神经语言模型在细粒度词义消解中过度依赖全局句子表示的问题。该方法在词汇语义变化检测和词义消歧基准测试中表现出色，能够产生更具区分性的词表示。

📘 Detailed Summary

Motivation: 神经语言模型在细粒度词义消解方面面临关键挑战，因为它们往往过度拟合全局句子表示，无法捕捉局部语义细节。现有方法在处理目标词的细微语义差异时表现不足，需要一种能够增强模型对局部语义敏感性的训练策略。

Method: LANE方法通过对抗训练策略，有意识地将模型的学习重点转移到目标词上。该方法通过选择性标记训练集中的替代词来生成具有挑战性的负训练样本，强制模型在具有不同标记词的相同句子之间创建更大的可分离性。该方法是模型无关的，可以集成到现有的表示学习框架中。

Result: 在词汇语义变化检测和词义消歧基准测试上的实验结果表明，该方法相比标准对比学习基线获得了性能提升。定性分析进一步显示，所提出的负样本能够产生更好地捕捉细微意义差异的表示，即使在具有挑战性的环境中也是如此。

Conclusion: LANE方法为神经语言模型提供了一种有效的对抗训练策略，能够显著提升词表示的区分能力。该方法展示了通过精心设计的负样本生成机制来增强模型对局部语义敏感性的潜力，为细粒度语义理解任务提供了新的解决方案。

📄 Abstract

Fine-grained word meaning resolution remains a critical challenge for neural language models (NLMs) as they often overfit to global sentence representations, failing to capture local semantic details. We propose a novel adversarial training strategy, called LANE, to address this limitation by deliberately shifting the model's learning focus to the target word. This method generates challenging negative training examples through the selective marking of alternate words in the training set. The goal is to force the model to create a greater separability between same sentences with different marked words. Experimental results on lexical semantic change detection and word sense disambiguation benchmarks demonstrate that our approach yields more discriminative word representations, improving performance over standard contrastive learning baselines. We further provide qualitative analyses showing that the proposed negatives lead to representations that better capture subtle meaning differences even in challenging environments. Our method is model-agnostic and can be integrated into existing representation learning frameworks.

[74] iMAD: Intelligent Multi-Agent Debate for Efficient and Accurate LLM Inference

Wei Fan, JinYi Yoon, Bo Ji

🧩 TL;DR

本文提出了智能多智能体辩论框架iMAD，通过选择性触发多智能体辩论来显著降低计算成本同时提高答案准确性。该框架通过学习可泛化的模型行为来做出准确的辩论决策，实现了高达92%的token使用量减少和13.5%的准确率提升。

📘 Detailed Summary

Motivation: 现有多智能体辩论框架存在效率低下的问题，因为对每个查询都触发辩论会产生巨大的计算成本，并且可能通过推翻正确的单智能体答案而降低准确性。研究旨在解决这些限制，开发一个能够智能选择何时触发辩论的token高效框架。

Method: iMAD框架首先提示单个智能体生成结构化的自我批判响应，从中提取41个可解释的语言和语义特征来捕捉犹豫线索。然后使用轻量级辩论决策分类器，通过提出的FocusCal损失函数进行训练，以确定是否触发多智能体辩论，实现无需测试数据集特定调优的稳健辩论决策。

Result: 在六个视觉问答数据集上对五个竞争基线的广泛实验表明，iMAD显著减少了token使用量，最高可达92%，同时将最终答案准确率提高了最高13.5%。该框架在保持或提高准确性的同时大幅降低了计算成本。

Conclusion: iMAD证明了通过智能选择辩论时机可以实现计算效率与准确性的双重优化。该研究为多智能体系统的高效部署提供了新思路，表明学习模型行为特征可以有效地指导资源分配决策，为实际应用中的大规模部署铺平了道路。

📄 Abstract

Large Language Model (LLM) agent systems have advanced rapidly, driven by their strong generalization in zero-shot settings. To further enhance reasoning and accuracy on complex tasks, Multi-Agent Debate (MAD) has emerged as a promising framework that engages multiple LLM agents in structured debates to encourage diverse reasoning. However, triggering MAD for every query is inefficient, as it incurs substantial computational (token) cost and may even degrade accuracy by overturning correct single-agent answers. To address these limitations, we propose intelligent Multi-Agent Debate (iMAD), a token-efficient framework that selectively triggers MAD only when it is likely to be beneficial (i.e., correcting an initially wrong answer). To achieve this goal, iMAD learns generalizable model behaviors to make accurate debate decisions. Specifically, iMAD first prompts a single agent to produce a structured self-critique response, from which we extract 41 interpretable linguistic and semantic features capturing hesitation cues. Then, iMAD uses a lightweight debate-decision classifier, trained using our proposed FocusCal loss, to determine whether to trigger MAD, enabling robust debate decisions without test dataset-specific tuning. Through extensive experiments using six (visual) question answering datasets against five competitive baselines, we have shown that iMAD significantly reduces token usage (by up to 92%) while also improving final answer accuracy (by up to 13.5%).

cs.AI [Back]

[75] Co-EPG: A Framework for Co-Evolution of Planning and Grounding in Autonomous GUI Agents

Yuan Zhao, Hualei Zhu, Tingyu Jiang, Shen Li, Xiaohang Xu, Hao Henry Wang

🧩 TL;DR

本文提出Co-EPG框架，通过规划与落地能力的协同进化实现GUI任务自动化，建立了自迭代训练机制，在无需外部数据的情况下仅需三次迭代即可超越现有最优方法。

📘 Detailed Summary

Motivation: 当前GUI任务自动化方法存在两个根本性局限：跨模型协同效应利用不足，以及过度依赖合成数据生成而利用不充分，这限制了智能体规划与落地能力的整体性能提升。

Method: Co-EPG框架建立了规划与落地模型的协同进化机制，通过Group Relative Policy Optimization（GRPO）在落地奖励指导下探索更优策略，生成多样化数据优化落地模型，同时优化的落地模型为后续规划模型训练提供更有效的奖励，形成正向反馈循环。

Result: 在Multimodal-Mind2Web和AndroidControl基准测试中，Co-EPG框架仅经过三次迭代就超越了现有最优方法，且无需外部数据，智能体在每次迭代中持续改进，展现出强大的自我增强能力。

Conclusion: 本研究为GUI智能体建立了新的训练范式，从孤立优化转向集成化、自驱动的协同进化方法，通过自博弈优化和训练数据蒸馏实现智能体能力的迭代增强，为自动化系统的发展提供了重要方向。

📄 Abstract

Graphical User Interface (GUI) task automation constitutes a critical frontier in artificial intelligence research. While effective GUI agents synergistically integrate planning and grounding capabilities, current methodologies exhibit two fundamental limitations: (1) insufficient exploitation of cross-model synergies, and (2) over-reliance on synthetic data generation without sufficient utilization. To address these challenges, we propose Co-EPG, a self-iterative training framework for Co-Evolution of Planning and Grounding. Co-EPG establishes an iterative positive feedback loop: through this loop, the planning model explores superior strategies under grounding-based reward guidance via Group Relative Policy Optimization (GRPO), generating diverse data to optimize the grounding model. Concurrently, the optimized Grounding model provides more effective rewards for subsequent GRPO training of the planning model, fostering continuous improvement. Co-EPG thus enables iterative enhancement of agent capabilities through self-play optimization and training data distillation. On the Multimodal-Mind2Web and AndroidControl benchmarks, our framework outperforms existing state-of-the-art methods after just three iterations without requiring external data. The agent consistently improves with each iteration, demonstrating robust self-enhancement capabilities. This work establishes a novel training paradigm for GUI agents, shifting from isolated optimization to an integrated, self-driven co-evolution approach.

[76] Advanced Tool for Traffic Crash Analysis: An AI-Driven Multi-Agent Approach to Pre-Crash Reconstruction

Gerui Xu, Boyou Chen, Huizhong Guo, Dave LeBlanc, Ananna Ahmed, Zhaonan Sun, Shan Bao

🧩 TL;DR

本研究开发了一个多智能体AI框架，用于从碎片化的碰撞数据中重建事故前场景并推断车辆行为。该系统在39起复杂追尾事故案例中实现了100%的准确率，超越了人类专家92%的表现。

📘 Detailed Summary

Motivation: 传统的交通事故重建依赖人类专家经验，在处理不完整多模态数据时往往产生不一致的结果。本研究旨在解决从碎片化碰撞数据中重建事故前场景和推断车辆行为的挑战，特别是处理包含缺失或冲突数据的复杂案例。

Method: 提出了一个两阶段协作框架，结合重建和推理阶段。系统处理277起追尾前车减速事故，整合文本事故报告、结构化表格数据和视觉场景图。第一阶段从多模态输入生成自然语言事故重建，第二阶段结合这些重建与时序事件数据记录器数据进行深度事故推理。

Result: 在39起复杂追尾事故案例评估中，框架实现了所有测试案例的完美准确率，成功识别最相关EDR事件并正确区分撞击与被撞车辆，超越了人类研究人员在同一挑战性数据集上92%的准确率。系统在处理不完整数据时保持稳健性能，包括缺失或错误的EDR记录和模糊场景图。

Conclusion: 本研究展示了AI在处理异构碰撞数据方面的卓越能力，在重建碰撞动力学和表征事故前行为方面提供了前所未有的精确度。该框架为交通事故分析提供了更可靠、一致的解决方案，特别是在处理复杂、不完整数据场景时表现出强大优势。

📄 Abstract

Traffic collision reconstruction traditionally relies on human expertise, often yielding inconsistent results when analyzing incomplete multimodal data. This study develops a multi-agent AI framework that reconstructs pre-crash scenarios and infers vehicle behaviors from fragmented collision data. We present a two-phase collaborative framework combining reconstruction and reasoning phases. The system processes 277 rear-end lead vehicle deceleration (LVD) collisions from the Crash Investigation Sampling System, integrating textual crash reports, structured tabular data, and visual scene diagrams. Phase I generates natural-language crash reconstructions from multimodal inputs. Phase II performs in-depth crash reasoning by combining these reconstructions with temporal Event Data Recorder (EDR).For validation, we applied it to all LVD cases, focusing on a subset of 39 complex crashes where multiple EDR records per collision introduced ambiguity (e.g., due to missing or conflicting data).The evaluation of the 39 LVD crash cases revealed our framework achieved perfect accuracy across all test cases, successfully identifying both the most relevant EDR event and correctly distinguishing striking versus struck vehicles, surpassing the 92% accuracy achieved by human researchers on the same challenging dataset. The system maintained robust performance even when processing incomplete data, including missing or erroneous EDR records and ambiguous scene diagrams. This study demonstrates superior AI capabilities in processing heterogeneous collision data, providing unprecedented precision in reconstructing impact dynamics and characterizing pre-crash behaviors.

[77] LLM enhanced graph inference for long-term disease progression modelling

Tiantian He, An Zhao, Elinor Thompson, Anna Schroder, Ahmed Abdulaal, Frederik Barkhof, Daniel C. Alexander

🧩 TL;DR

本研究提出了一种新颖的框架，利用大型语言模型作为专家指导来增强从不规则采样的纵向患者数据中学习神经退行性疾病进展的能力。该方法通过LLM整合多模态关系和疾病驱动机制，同时优化长期疾病轨迹构建和生物约束的图结构，在阿尔茨海默病tau-PET数据上展示了优越的预测准确性和可解释性。

📘 Detailed Summary

Motivation: 当前神经退行性疾病进展建模方法存在显著局限性，它们通常过度简化大脑连接的复杂性，仅假设单一模态的大脑连接组作为疾病传播基质，导致在长期进展期间病理传播预测不准确。同时，纯粹数据驱动的方法由于缺乏适当约束而面临可识别性问题，无法有效捕捉脑区变量间的复杂相互作用机制。

Method: 本研究开发了一种新颖框架，利用大型语言模型作为专家系统来指导脑区变量间的相互作用学习。该方法通过LLM整合多模态关系并纳入多样化的疾病驱动机制，同时优化两个关键组件：从个体层面观察数据构建长期疾病轨迹，以及具有更好可识别性的生物约束图结构来捕捉脑区间的相互作用。

Result: 在阿尔茨海默病队列的tau-PET成像数据上进行病理传播估计验证，新框架相比传统方法展现出显著优越的预测准确性。该方法不仅提高了预测性能，还增强了模型的可解释性，同时揭示了超出传统连接性测量的额外疾病驱动因素。

Conclusion: 该研究证明了大型语言模型在神经科学建模中的创新应用价值，能够有效解决传统方法在疾病进展建模中的局限性。新框架为理解神经退行性疾病的复杂传播机制提供了更准确和可解释的工具，同时揭示了传统连接性测量之外的重要疾病驱动因素，为未来疾病建模研究开辟了新方向。

📄 Abstract

Understanding the interactions between biomarkers among brain regions during neurodegenerative disease is essential for unravelling the mechanisms underlying disease progression. For example, pathophysiological models of Alzheimer's Disease (AD) typically describe how variables, such as regional levels of toxic proteins, interact spatiotemporally within a dynamical system driven by an underlying biological substrate, often based on brain connectivity. However, current methods grossly oversimplify the complex relationship between brain connectivity by assuming a single-modality brain connectome as the disease-spreading substrate. This leads to inaccurate predictions of pathology spread, especially during the long-term progression period. Meanhwile, other methods of learning such a graph in a purely data-driven way face the identifiability issue due to lack of proper constraint. We thus present a novel framework that uses Large Language Models (LLMs) as expert guides on the interaction of regional variables to enhance learning of disease progression from irregularly sampled longitudinal patient data. By leveraging LLMs' ability to synthesize multi-modal relationships and incorporate diverse disease-driving mechanisms, our method simultaneously optimizes 1) the construction of long-term disease trajectories from individual-level observations and 2) the biologically-constrained graph structure that captures interactions among brain regions with better identifiability. We demonstrate the new approach by estimating the pathology propagation using tau-PET imaging data from an Alzheimer's disease cohort. The new framework demonstrates superior prediction accuracy and interpretability compared to traditional approaches while revealing additional disease-driving factors beyond conventional connectivity measures.

[78] GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models

Jingxuan Wei, Caijun Jia, Xi Bai, Xinglong Xu, Siyuan Li, Linzhuang Sun, Bihui Yu, Conghui He, Lijun Wu, Cheng Tan

🧩 TL;DR

本文提出了GGBench基准测试，专门用于评估统一多模态模型在几何生成推理方面的能力，填补了现有评估方法在衡量整合认知过程方面的空白。该基准通过几何构造任务为下一代智能系统设定了更严格的标准。

📘 Detailed Summary

Motivation: 当前统一多模态模型虽然具备前所未有的信息合成能力，但现有基准测试主要评估判别性理解或无约束图像生成，无法衡量生成推理的整合认知过程。几何构造任务因其天然需要语言理解和精确视觉生成的融合，为填补这一评估空白提供了理想测试平台。

Method: 研究团队提出了GGBench基准测试，专门设计用于评估几何生成推理能力。该基准提供了一个全面的框架，能够系统性地诊断模型不仅理解和推理，还能主动构建解决方案的能力。

Result: GGBench基准为统一多模态模型的评估建立了新的标准，通过几何构造任务全面测试模型的生成推理能力。该基准能够更准确地反映模型在实际应用中的综合认知表现。

Conclusion: 几何构造任务为评估统一多模态模型的生成推理能力提供了理想测试平台，GGBench基准的提出为下一代智能系统设定了更严格的评估标准。这项研究强调了在人工智能发展中从被动感知向主动生成推理转变的重要性。

📄 Abstract

The advent of Unified Multimodal Models (UMMs) signals a paradigm shift in artificial intelligence, moving from passive perception to active, cross-modal generation. Despite their unprecedented ability to synthesize information, a critical gap persists in evaluation: existing benchmarks primarily assess discriminative understanding or unconstrained image generation separately, failing to measure the integrated cognitive process of generative reasoning. To bridge this gap, we propose that geometric construction provides an ideal testbed as it inherently demands a fusion of language comprehension and precise visual generation. We introduce GGBench, a benchmark designed specifically to evaluate geometric generative reasoning. It provides a comprehensive framework for systematically diagnosing a model's ability to not only understand and reason but to actively construct a solution, thereby setting a more rigorous standard for the next generation of intelligent systems. Project website: https://opendatalab-raiser.github.io/GGBench/.

[79] Multi-agent Undercover Gaming: Hallucination Removal via Counterfactual Test for Multimodal Reasoning

Dayong Liang, Xiao-Yong Wei, Changmeng Zheng

🧩 TL;DR

本文提出多智能体卧底游戏（MUG）协议，通过引入反事实测试和动态证据修改来检测幻觉智能体，为多模态推理提供了更可靠的框架。该方法将多智能体辩论重新构建为识别幻觉智能体的过程，显著提升了推理的可靠性。

📘 Detailed Summary

Motivation: 现有多智能体辩论（MAD）范式依赖所有辩论者都是理性和反思性的不现实假设，但实际中智能体本身容易产生幻觉，这限制了其在提升大型语言模型推理可靠性方面的有效性。

Method: 提出多智能体卧底游戏（MUG）协议，受社交推理游戏启发，通过修改参考图像引入反事实证据，观察智能体是否能准确识别这些变化，从而为识别幻觉智能体提供真实依据。该方法实现了三个关键创新：反事实测试实现事实验证、动态修改证据源实现跨证据推理、以及促进主动推理的探测性讨论。

Result: MUG协议在多模态推理任务中表现出更高的可靠性，通过反事实测试能够有效识别幻觉智能体，相比传统MAD方法提供了更稳健的群体驱动多模态推理能力。

Conclusion: MUG协议为多智能体辩论范式提供了实质性改进，通过反事实验证和动态证据处理机制，为大型语言模型的多模态推理建立了更可靠的框架，为解决幻觉问题开辟了新途径。

📄 Abstract

Hallucination continues to pose a major obstacle in the reasoning capabilities of large language models (LLMs). Although the Multi-Agent Debate (MAD) paradigm offers a promising solution by promoting consensus among multiple agents to enhance reliability, it relies on the unrealistic assumption that all debaters are rational and reflective, which is a condition that may not hold when agents themselves are prone to hallucinations. To address this gap, we introduce the Multi-agent Undercover Gaming (MUG) protocol, inspired by social deduction games like "Who is Undercover?". MUG reframes MAD as a process of detecting "undercover" agents (those suffering from hallucinations) by employing multimodal counterfactual tests. Specifically, we modify reference images to introduce counterfactual evidence and observe whether agents can accurately identify these changes, providing ground-truth for identifying hallucinating agents and enabling robust, crowd-powered multimodal reasoning. MUG advances MAD protocols along three key dimensions: (1) enabling factual verification beyond statistical consensus through counterfactual testing; (2) introducing cross-evidence reasoning via dynamically modified evidence sources instead of relying on static inputs; and (3) fostering active reasoning, where agents engage in probing discussions rather than passively answering questions. Collectively, these innovations offer a more reliable and effective framework for multimodal reasoning in LLMs. The source code can be accessed at https://github.com/YongLD/MUG.git.

[80] AIonopedia: an LLM agent orchestrating multimodal learning for ionic liquid discovery

Yuqi Yin, Yibo Fu, Siyuan Wang, Peng Sun, Hongyu Wang, Xiaohui Wang, Lei Zheng, Zhiyong Li, Zhirong Liu, Jianji Wang, Zhaoxi Sun

🧩 TL;DR

本研究提出了AIonopedia，这是首个用于离子液体发现的LLM智能体，通过构建多模态领域基础模型和分层搜索架构，实现了准确的离子液体性质预测和分子设计，并在真实湿实验验证中展现出卓越的泛化能力。

📘 Detailed Summary

Motivation: 离子液体（ILs）的发现面临关键挑战，包括数据有限、模型准确性差以及工作流程碎片化，这些问题严重阻碍了新型离子液体的高效开发和应用。

Method: 研究引入了基于大语言模型（LLM）增强的多模态离子液体领域基础模型，并构建了用于分子筛选和设计的层次化搜索架构，该模型在新构建的全面离子液体数据集上进行训练和评估。

Result: 模型在新建数据集上表现出优越性能，对文献报道系统的评估显示智能体能够有效进行离子液体修饰，真实湿实验验证进一步证实了其在具有挑战性的分布外任务上的卓越泛化能力。

Conclusion: 该研究证明了LLM智能体在加速真实世界离子液体发现方面的实用效能，为材料科学领域的智能设计提供了新的范式，展示了人工智能在复杂分子系统优化中的巨大潜力。

📄 Abstract

The discovery of novel Ionic Liquids (ILs) is hindered by critical challenges in property prediction, including limited data, poor model accuracy, and fragmented workflows. Leveraging the power of Large Language Models (LLMs), we introduce AIonopedia, to the best of our knowledge, the first LLM agent for IL discovery. Powered by an LLM-augmented multimodal domain foundation model for ILs, AIonopedia enables accurate property predictions and incorporates a hierarchical search architecture for molecular screening and design. Trained and evaluated on a newly curated and comprehensive IL dataset, our model delivers superior performance. Complementing these results, evaluations on literature-reported systems indicate that the agent can perform effective IL modification. Moving beyond offline tests, the practical efficacy was further confirmed through real-world wet-lab validation, in which the agent demonstrated exceptional generalization capabilities on challenging out-of-distribution tasks, underscoring its ability to accelerate real-world IL discovery.

[81] EcoAlign: An Economically Rational Framework for Efficient LVLM Alignment

Ruoxi Cheng, Haoxuan Ma, Teng Ma, Hongyi Zhang

🧩 TL;DR

本文提出EcoAlign框架，将大型视觉语言模型的对齐问题重新定义为经济理性搜索问题，通过前向价值函数动态权衡安全性、效用性和计算成本，在降低计算开销的同时实现或超越现有最优方法的安全性和效用性。

📘 Detailed Summary

Motivation: 当前大型视觉语言模型存在复杂的越狱漏洞，传统对齐方法在安全性、效用性和运营成本之间难以平衡，且仅关注最终输出的过程盲目性会浪费大量计算资源在不安全的推理过程中，使得有害推理能够通过良性论证伪装绕过简单加性安全评分机制。

Method: 提出EcoAlign推理时框架，将LVLM视为有限理性智能体，通过增量扩展思维图并使用前向价值函数（类似净现值）对行动进行评分，该函数根据剩余预算动态权衡预期安全性、效用性和成本，并通过最薄弱环节原则强制执行路径安全性以防止欺骗行为。

Result: 在3个闭源和2个开源模型上的6个数据集上的广泛实验表明，EcoAlign在较低计算成本下达到或超越了现有最优方法的安全性和效用性表现，为实现稳健的LVLM对齐提供了原则性经济路径。

Conclusion: 该研究为大型视觉语言模型对齐提供了经济学视角的解决方案，通过将安全性、效用性和成本统一纳入优化框架，不仅解决了传统方法的过程盲目性问题，还为构建更高效、更安全的AI系统提供了理论基础和实践指导，推动了AI对齐研究从单纯安全考量向经济效益综合优化的范式转变。

📄 Abstract

Large Vision-Language Models (LVLMs) exhibit powerful reasoning capabilities but suffer sophisticated jailbreak vulnerabilities. Fundamentally, aligning LVLMs is not just a safety challenge but a problem of economic efficiency. Current alignment methods struggle with the trade-off between safety, utility, and operational costs. Critically, a focus solely on final outputs (process-blindness) wastes significant computational budget on unsafe deliberation. This flaw allows harmful reasoning to be disguised with benign justifications, thereby circumventing simple additive safety scores. To address this, we propose EcoAlign, an inference-time framework that reframes alignment as an economically rational search by treating the LVLM as a boundedly rational agent. EcoAlign incrementally expands a thought graph and scores actions using a forward-looking function (analogous to net present value) that dynamically weighs expected safety, utility, and cost against the remaining budget. To prevent deception, path safety is enforced via the weakest-link principle. Extensive experiments across 3 closed-source and 2 open-source models on 6 datasets show that EcoAlign matches or surpasses state-of-the-art safety and utility at a lower computational cost, thereby offering a principled, economical pathway to robust LVLM alignment.

[82] CURENet: Combining Unified Representations for Efficient Chronic Disease Prediction

Cong-Tinh Dao, Nguyen Minh Thao Phan, Jun-En Ding, Chenwei Wu, David Restrepo, Dongsheng Luo, Fanyi Zhao, Chun-Chieh Liao, Wen-Chih Peng, Chi-Te Wang, Pei-Fu Chen, Ling Chen, Xinglong Ju, Feng Liu, Fang-Ming Hung

🧩 TL;DR

本文提出了CURENet，一种用于慢性疾病预测的多模态模型，通过整合非结构化临床文本、实验室检测和患者时间序列数据，在MIMIC-III和FEMH数据集上实现了超过94%的准确率。该模型利用大语言模型处理临床文本和实验室检测数据，并采用Transformer编码器处理纵向序列就诊数据。

📘 Detailed Summary

Motivation: 当前大多数预测模型未能充分利用电子健康记录中多模态数据之间的交互作用、冗余性和时间模式，往往只关注单一数据类型或忽视这些复杂性。电子健康记录包含非结构化临床笔记、结构化实验室检测和时间序列就诊数据，但现有方法无法完全捕捉这些多模态和时间性数据源之间的复杂关系，限制了临床决策支持系统的效果。

Method: CURENet采用多模态架构，利用大语言模型处理临床文本和文本化实验室检测数据，同时使用Transformer编码器处理纵向序列就诊数据。该模型通过统一的表示学习方法整合三种不同类型的数据源，能够捕捉不同形式临床数据之间的复杂交互关系，从而构建更可靠的慢性疾病预测模型。

Result: 在公开的MIMIC-III和私有FEMH数据集上的评估显示，CURENet在多标签框架下预测前10种慢性疾病时实现了超过94%的准确率。该模型在整合多模态电子健康记录数据方面表现出色，显著提升了慢性疾病预测的性能，验证了多模态数据融合在临床预测任务中的有效性。

Conclusion: 研究表明多模态电子健康记录整合具有显著潜力，能够增强临床决策支持并改善患者预后。CURENet的成功证明了充分利用电子健康记录中不同数据模态之间复杂交互关系的重要性，为开发更可靠的临床预测模型提供了新的方向，同时也为医疗人工智能系统的实际应用奠定了基础。

📄 Abstract

Electronic health records (EHRs) are designed to synthesize diverse data types, including unstructured clinical notes, structured lab tests, and time-series visit data. Physicians draw on these multimodal and temporal sources of EHR data to form a comprehensive view of a patient's health, which is crucial for informed therapeutic decision-making. Yet, most predictive models fail to fully capture the interactions, redundancies, and temporal patterns across multiple data modalities, often focusing on a single data type or overlooking these complexities. In this paper, we present CURENet, a multimodal model (Combining Unified Representations for Efficient chronic disease prediction) that integrates unstructured clinical notes, lab tests, and patients' time-series data by utilizing large language models (LLMs) for clinical text processing and textual lab tests, as well as transformer encoders for longitudinal sequential visits. CURENet has been capable of capturing the intricate interaction between different forms of clinical data and creating a more reliable predictive model for chronic illnesses. We evaluated CURENet using the public MIMIC-III and private FEMH datasets, where it achieved over 94\% accuracy in predicting the top 10 chronic conditions in a multi-label framework. Our findings highlight the potential of multimodal EHR integration to enhance clinical decision-making and improve patient outcomes.

Table of Contents

cs.CV [Back]

[1] Short-Window Sliding Learning for Real-Time Violence Detection via LLM-based Auto-Labeling

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[2] MCN-CL: Multimodal Cross-Attention Network and Contrastive Learning for Multimodal Emotion Recognition

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[3] Text-guided Weakly Supervised Framework for Dynamic Facial Expression Recognition

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[4] PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[5] Binary Verification for Zero-Shot Vision

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[6] Semantic VLM Dataset for Safe Autonomous Driving

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[7] VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[8] Fast Data Attribution for Text-to-Image Models

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[9] AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[10] S2D-ALIGN: Shallow-to-Deep Auxiliary Learning for Anatomically-Grounded Radiology Report Generation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[11] Frequency-Aware Vision-Language Multimodality Generalization Network for Remote Sensing Image Classification

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[12] CrossMed: A Multimodal Cross-Task Benchmark for Compositional Generalization in Medical Imaging

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[13] Discovering Meaningful Units with Visually Grounded Semantics from Image Captions

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[14] VIDEOP2R: Video Understanding from Perception to Reasoning

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[15] From Synthetic Scenes to Real Performance: Enhancing Spatial Reasoning in VLMs

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[16] PhaseWin Search Framework Enable Efficient Object-Level Interpretation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[17] Out-of-Distribution Detection with Positive and Negative Prompt Supervision Using Large Language Models

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[18] Refine and Align: Confidence Calibration through Multi-Agent Interaction in VQA

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[19] Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[20] AUVIC: Adversarial Unlearning of Visual Concepts for Multi-modal Large Language Models

🧩 TL;DR