Table of Contents

cs.CV [Back]

[1] An Empirical Study on Knowledge Transfer under Domain and Label Shifts in 3D LiDAR Point Clouds

Subeen Lee, Siyeong Lee, Namil Kim, Jaesik Choi

🧩 TL;DR

该研究提出了ROAD基准测试,用于评估LiDAR点云分类在持续学习和迁移学习场景下的鲁棒性,特别关注同时发生的领域偏移和标签演化,填补了3D感知系统在现实应用中的适应性研究空白。


📘 Detailed Summary

Motivation: 3D感知系统在自动驾驶和具身AI等现实应用中需要适应不断演化的物体定义和传感器领域,然而与2D视觉相比,3D点云感知中的持续学习和迁移学习研究仍然不足,特别是在同时面临领域偏移和标签变化的情况下,这一研究空白需要被填补。

Method: 研究提出了ROAD基准测试,这是一个专门为LiDAR点云分类设计的综合评估套件,明确考虑了领域偏移以及三种关键标签演化形式:类别分裂、类别扩展和类别插入。研究使用大规模数据集(Waymo、NuScenes、Argoverse2)评估了零样本迁移、线性探测和持续学习方法,并分析了骨干网络架构、训练目标和持续学习方法的影响。

Result: 研究结果揭示了现有方法在现实偏移下的局限性,特别是在同时处理领域和标签变化时的性能不足。通过系统评估,研究建立了未来鲁棒3D感知研究的强基线,为不同架构和方法在复杂演化场景下的表现提供了量化分析。

Conclusion: 该研究强调了3D感知系统在现实世界应用中适应持续变化的重要性,并指出了当前方法在处理同时发生的领域和标签偏移时的不足。ROAD基准测试为未来研究提供了标准化的评估框架,有助于推动更鲁棒的3D感知模型发展,特别是在自动驾驶等安全关键应用中。


📄 Abstract

For 3D perception systems to be practical in real-world applications -- from autonomous driving to embodied AI -- models must adapt to continuously evolving object definitions and sensor domains. Yet, research on continual and transfer learning in 3D point cloud perception remains underexplored compared to 2D vision -- particularly under simultaneous domain and label shifts. To address this gap, we propose the RObust Autonomous driving under Dataset shifts (ROAD) benchmark, a comprehensive evaluation suite for LiDAR-based object classification that explicitly accounts for domain shifts as well as three key forms of label evolution: class split, class expansion, and class insertion. Using large-scale datasets (Waymo, NuScenes, Argoverse2), we evaluate zero-shot transfer, linear probe, and CL, and analyze the impact of backbone architectures, training objectives, and CL methods. Our findings reveal limitations of existing approaches under realistic shifts and establish strong baselines for future research in robust 3D perception.

[2] CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation

Chaoyu Li, Deeparghya Dutta Barua, Fei Tao, Pooyan Fazli

🧩 TL;DR

本文提出CASHEW和CASHEW-RL两种互补方法,通过推理时聚合多个候选轨迹和强化学习训练,显著提升了视觉语言模型多步推理的稳定性和性能。


📘 Detailed Summary

Motivation: 视觉语言模型在多种多模态理解任务上表现出色,但其多步推理过程存在不稳定性问题,对相同输入的重复采样会产生发散推理轨迹和不一致的最终预测,这限制了模型的可靠性和实际应用价值。

Method: 本文提出两种互补方法:CASHEW是一个推理时框架,通过迭代聚合多个候选推理轨迹形成更高质量推理路径,并利用显式视觉验证过滤幻觉步骤;CASHEW-RL则通过Group Sequence Policy Optimization训练,使用复合奖励函数鼓励基于最小充分视觉证据的正确答案,并自适应分配推理计算资源。

Result: 在13个图像理解、视频理解和视频推理基准测试上的广泛实验显示,该方法带来了显著的性能提升,其中ScienceQA上提升达23.6个百分点,EgoSchema上提升8.1个百分点,证明了推理稳定性和准确性的显著改善。

Conclusion: 该研究表明通过推理时轨迹聚合和强化学习训练,可以显著提升视觉语言模型多步推理的稳定性和可靠性,为构建更稳健的多模态推理系统提供了有效途径,同时展示了自适应推理计算分配的重要性。


📄 Abstract

Vision-language models achieve strong performance across a wide range of multimodal understanding and reasoning tasks, yet their multi-step reasoning remains unstable. Repeated sampling over the same input often produces divergent reasoning trajectories and inconsistent final predictions. To address this, we introduce two complementary approaches inspired by test-time scaling: (1) CASHEW, an inference-time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher-quality reasoning traces, with explicit visual verification filtering hallucinated steps and grounding reasoning in visual evidence, and (2) CASHEW-RL, a learned variant that internalizes this aggregation behavior within a single model. CASHEW-RL is trained using Group Sequence Policy Optimization (GSPO) with a composite reward that encourages correct answers grounded in minimal yet sufficient visual evidence, while adaptively allocating reasoning effort based on task difficulty. This training objective enables robust self-aggregation at inference. Extensive experiments on 13 image understanding, video understanding, and video reasoning benchmarks show significant performance improvements, including gains of up to +23.6 percentage points on ScienceQA and +8.1 percentage points on EgoSchema.

[3] Representations of Text and Images Align From Layer One

Evžen Wybitul, Javier Rando, Florian Tramèr, Stanislav Fort

🧩 TL;DR

本文提出了一种基于优化的合成方法,揭示了基于适配器的视觉语言模型中图像与文本表示在早期层就已存在有意义的对齐,挑战了传统认为此类对齐仅出现在深层网络的观点。


📘 Detailed Summary

Motivation: 本研究旨在挑战传统观点,即基于适配器的视觉语言模型中图像与文本的对齐仅出现在网络深层。现有研究普遍认为跨模态对齐需要经过多层处理才能形成,但该研究试图探索这种对齐是否在更早的网络层中就已存在,从而揭示模型内部表示形成的动态过程。

Method: 研究提出了一种受DeepDream启发的合成方法:给定文本概念如"木星",在特定层提取其概念向量,然后通过优化过程合成一个图像,使其表示与该向量对齐。该方法不需要辅助模型或数据集,直接在Gemma 3模型的七个层上对数百个概念进行测试,通过可视化模型的表示空间来验证图像-文本对齐。

Result: 实验结果显示,在Gemma 3模型的第一层,超过50%的合成图像已经能够描绘目标文本概念的可识别视觉特征,如动物、活动或季节的显著特征。这表明图像与文本表示的对齐从网络早期层就开始出现,而非仅限于深层,为概念层面的跨模态对齐提供了直接证据。

Conclusion: 该研究提供了直接、建设性的证据,表明基于适配器的视觉语言模型中图像-文本对齐在早期层就已存在,挑战了传统理解。该方法不仅为模型可解释性提供了新途径,通过可视化表示空间来理解模型内部工作机制,而且提供了一种简单、快速且无需外部资源的跨模态对齐评估方法。


📄 Abstract

We show that for a variety of concepts in adapter-based vision-language models, the representations of their images and their text descriptions are meaningfully aligned from the very first layer. This contradicts the established view that such image-text alignment only appears in late layers. We show this using a new synthesis-based method inspired by DeepDream: given a textual concept such as "Jupiter", we extract its concept vector at a given layer, and then use optimisation to synthesise an image whose representation aligns with that vector. We apply our approach to hundreds of concepts across seven layers in Gemma 3, and find that the synthesised images often depict salient visual features of the targeted textual concepts: for example, already at layer 1, more than 50 % of images depict recognisable features of animals, activities, or seasons. Our method thus provides direct, constructive evidence of image-text alignment on a concept-by-concept and layer-by-layer basis. Unlike previous methods for measuring multimodal alignment, our approach is simple, fast, and does not require auxiliary models or datasets. It also offers a new path towards model interpretability, by providing a way to visualise a model's representation space by backtracing through its image processing components.

[4] Training Free Zero-Shot Visual Anomaly Localization via Diffusion Inversion

Samet Hicsonmez, Abd El Rahman Shabayek, Djamila Aouada

🧩 TL;DR

本文提出了一种无需训练的视觉零样本异常检测框架DIVAD,通过利用预训练去噪扩散隐式模型的反转过程,在无需细粒度提示的情况下实现异常检测与定位,在VISA数据集上取得了最先进的性能。


📘 Detailed Summary

Motivation: 当前零样本图像异常检测方法存在两个主要问题:基于语言的方法需要依赖细粒度提示来实现定位,而纯视觉方法通常仅限于图像级分类,缺乏空间定位精度。本研究旨在开发一种无需训练、不依赖辅助模态的视觉零样本异常检测框架,以克服对提示的依赖并提高定位能力。

Method: 该方法提出了一种基于预训练去噪扩散隐式模型的反转框架DIVAD。具体而言,给定输入图像和通用文本描述,首先将图像反转到潜在空间获得潜在表示,然后从固定的中间时间步开始去噪过程以重建图像。由于底层扩散模型仅在正常数据上训练,该过程会产生正常外观的重建,输入图像与重建图像之间的差异则突出了潜在异常区域。

Result: 该方法在VISA数据集上实现了最先进的性能,展示了强大的异常定位能力,无需任何辅助模态。实验结果表明,该框架能够有效检测和定位异常,同时避免了传统方法对细粒度提示的依赖,为纯视觉零样本异常检测提供了新的基准。

Conclusion: 本研究证明了利用预训练扩散模型的反转过程可以在无需训练的情况下实现有效的零样本异常检测与定位,为减少对提示依赖的异常检测研究提供了新方向。该方法展示了纯视觉方法的潜力,为未来零样本异常检测研究提供了简单而有效的框架,促进了该领域从提示依赖向更通用方法的转变。


📄 Abstract

Zero-Shot image Anomaly Detection (ZSAD) aims to detect and localise anomalies without access to any normal training samples of the target data. While recent ZSAD approaches leverage additional modalities such as language to generate fine-grained prompts for localisation, vision-only methods remain limited to image-level classification, lacking spatial precision. In this work, we introduce a simple yet effective training-free vision-only ZSAD framework that circumvents the need for fine-grained prompts by leveraging the inversion of a pretrained Denoising Diffusion Implicit Model (DDIM). Specifically, given an input image and a generic text description (e.g., "an image of an [object class]"), we invert the image to obtain latent representations and initiate the denoising process from a fixed intermediate timestep to reconstruct the image. Since the underlying diffusion model is trained solely on normal data, this process yields a normal-looking reconstruction. The discrepancy between the input image and the reconstructed one highlights potential anomalies. Our method achieves state-of-the-art performance on VISA dataset, demonstrating strong localisation capabilities without auxiliary modalities and facilitating a shift away from prompt dependence for zero-shot anomaly detection research. Code is available at https://github.com/giddyyupp/DIVAD.

[5] A Highly Efficient Diversity-based Input Selection for DNN Improvement Using VLMs

Amin Abbasishahkoo, Mahboubeh Dadkhah, Lionel Briand

🧩 TL;DR

本文提出了一种基于概念多样性的高效图像输入选择方法CBD,利用视觉语言模型计算多样性度量,并结合不确定性度量构建混合选择策略,显著提升了深度神经网络微调中标注样本选择的效率和效果。


📘 Detailed Summary

Motivation: 深度神经网络微调需要标注新收集的输入数据,这一过程通常成本高昂且耗时。现有的多样性选择方法虽然有效,但计算密集且缺乏可扩展性,限制了其在大规模输入集上的实际应用。

Method: 本文提出了概念多样性度量方法,利用视觉语言模型高效计算图像输入的多样性特征。基于CBD与几何多样性之间的强相关性发现,构建了CBD与简单不确定性度量Margin相结合的混合输入选择方法,实现了效率与效果的平衡。

Result: 实验结果表明,CBD与几何多样性度量呈现强相关性,同时计算时间大幅减少。在多种DNN模型、输入集和选择预算下,CBD-based选择方法在五个最先进的基线方法中表现最优,且选择时间接近简单不确定性方法,在ImageNet等大规模数据集上仍保持高效。

Conclusion: CBD-based方法不仅证明了其相对于混合基线的有效性和计算优势,还展示了在重复和大规模输入选择场景中的可扩展性。该方法为高效标注样本选择提供了实用解决方案,平衡了计算效率与选择质量。


📄 Abstract

Maintaining or improving the performance of Deep Neural Networks (DNNs) through fine-tuning requires labeling newly collected inputs, a process that is often costly and time-consuming. To alleviate this problem, input selection approaches have been developed in recent years to identify small, yet highly informative subsets for labeling. Diversity-based selection is one of the most effective approaches for this purpose. However, they are often computationally intensive and lack scalability for large input sets, limiting their practical applicability. To address this challenge, we introduce Concept-Based Diversity (CBD), a highly efficient metric for image inputs that leverages Vision-Language Models (VLM). Our results show that CBD exhibits a strong correlation with Geometric Diversity (GD), an established diversity metric, while requiring only a fraction of its computation time. Building on this finding, we propose a hybrid input selection approach that combines CBD with Margin, a simple uncertainty metric. We conduct a comprehensive evaluation across a diverse set of DNN models, input sets, selection budgets, and five most effective state-of-the-art selection baselines. The results demonstrate that the CBD-based selection consistently outperforms all baselines at guiding input selection to improve the DNN model. Furthermore, the CBD-based selection approach remains highly efficient, requiring selection times close to those of simple uncertainty-based methods such as Margin, even on larger input sets like ImageNet. These results confirm not only the effectiveness and computational advantage of the CBD-based approach, particularly compared to hybrid baselines, but also its scalability in repetitive and extensive input selection scenarios.

[6] FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures

Jifeng Song, Arun Das, Pan Wang, Hui Ji, Kun Zhao, Yufei Huang

🧩 TL;DR

本文提出FigEx2,一种视觉条件化框架,用于从科学复合图中定位面板并生成面板级描述,通过噪声感知门控融合模块和两阶段优化策略显著提升了检测和描述生成性能。


📘 Detailed Summary

Motivation: 科学复合图将多个带标签面板组合成单一图像,但实际流程中的图注经常缺失或仅提供图级摘要,这使得面板级理解变得困难,需要开发能够直接从复合图中定位面板并生成面板级描述的解决方案。

Method: FigEx2采用视觉条件化框架,包含噪声感知门控融合模块以自适应过滤标记级特征来稳定检测查询空间,并采用结合监督学习和强化学习的两阶段优化策略,利用基于CLIP的对齐和基于BERTScore的语义奖励来强制严格的多模态一致性。

Result: 实验结果显示FigEx2在检测方面达到0.726 mAP@0.5:0.95的优异性能,在METEOR和BERTScore指标上分别显著超越Qwen3-VL-8B模型0.51和0.24分,并在未经微调的情况下展现出对分布外科学领域的卓越零样本迁移能力。

Conclusion: 该研究通过创新的噪声感知融合机制和两阶段优化策略,有效解决了科学复合图的面板级理解问题,构建的BioSci-Fig-Cap基准和跨学科测试套件为后续研究提供了高质量监督数据,展示了在科学视觉语言理解任务中的强大泛化能力。


📄 Abstract

Scientific compound figures combine multiple labeled panels into a single image, but captions in real pipelines are often missing or only provide figure-level summaries, making panel-level understanding difficult. In this paper, we propose FigEx2, visual-conditioned framework that localizes panels and generates panel-wise captions directly from the compound figure. To mitigate the impact of diverse phrasing in open-ended captioning, we introduce a noise-aware gated fusion module that adaptively filters token-level features to stabilize the detection query space. Furthermore, we employ a staged optimization strategy combining supervised learning with reinforcement learning (RL), utilizing CLIP-based alignment and BERTScore-based semantic rewards to enforce strict multimodal consistency. To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry. Experimental results demonstrate that FigEx2 achieves a superior 0.726 mAP@0.5:0.95 for detection and significantly outperforms Qwen3-VL-8B by 0.51 in METEOR and 0.24 in BERTScore. Notably, FigEx2 exhibits remarkable zero-shot transferability to out-of-distribution scientific domains without any fine-tuning.

[7] Rescind: Countering Image Misconduct in Biomedical Publications with Vision-Language and State-Space Modeling

Soumyaroop Nandi, Prem Natarajan

🧩 TL;DR

本文提出了首个视觉语言引导的生物医学图像伪造生成与检测框架,通过结合扩散合成与视觉语言提示,实现了对生物医学图像中复制、拼接和区域移除等操作的逼真且语义可控的伪造,并建立了大规模基准数据集Rescind和检测框架Integscan。


📘 Detailed Summary

Motivation: 生物医学出版物中的科学图像篡改对研究完整性和可重复性构成日益严重的威胁,与自然图像取证不同,生物医学伪造检测面临领域特定伪影、复杂纹理和非结构化图像布局等独特挑战,现有方法难以有效应对。

Method: 提出了一种结合扩散合成与视觉语言提示的生成检测框架,通过视觉语言模型验证循环确保语义保真度;建立了大规模基准数据集Rescind,包含细粒度标注和模态特定划分;设计了Integscan检测框架,采用注意力增强的视觉编码与提示条件语义对齐的结构化状态空间建模方法,实现精确的伪造定位。

Result: 在Rescind和现有基准上的广泛实验表明,Integscan在检测和定位任务中均达到了最先进的性能水平,为自动化科学完整性分析建立了坚实基础,验证了所提框架在多种生物医学模态上的有效性。

Conclusion: 该研究为生物医学图像伪造检测提供了首个全面的视觉语言引导框架,通过生成与检测的协同设计解决了领域特定挑战,所建立的数据集和检测方法为科学完整性分析开辟了新方向,具有重要的实际应用价值和研究意义。


📄 Abstract

Scientific image manipulation in biomedical publications poses a growing threat to research integrity and reproducibility. Unlike natural image forensics, biomedical forgery detection is uniquely challenging due to domain-specific artifacts, complex textures, and unstructured figure layouts. We present the first vision-language guided framework for both generating and detecting biomedical image forgeries. By combining diffusion-based synthesis with vision-language prompting, our method enables realistic and semantically controlled manipulations, including duplication, splicing, and region removal, across diverse biomedical modalities. We introduce Rescind, a large-scale benchmark featuring fine-grained annotations and modality-specific splits, and propose Integscan, a structured state space modeling framework that integrates attention-enhanced visual encoding with prompt-conditioned semantic alignment for precise forgery localization. To ensure semantic fidelity, we incorporate a vision-language model based verification loop that filters generated forgeries based on consistency with intended prompts. Extensive experiments on Rescind and existing benchmarks demonstrate that Integscan achieves state of the art performance in both detection and localization, establishing a strong foundation for automated scientific integrity analysis.

[8] From Prompts to Deployment: Auto-Curated Domain-Specific Dataset Generation via Diffusion Models

Dongsik Yoon, Jongeun Kim

🧩 TL;DR

本文提出了一种基于扩散模型的自动化流水线,用于生成领域特定的合成数据集,通过三阶段框架解决预训练模型与真实部署环境之间的分布偏移问题,从而减少对大规模真实数据收集的依赖。


📘 Detailed Summary

Motivation: 该研究旨在解决预训练模型与真实世界部署环境之间的分布偏移问题,特别是在缺乏足够领域特定真实数据的情况下,传统方法依赖大量真实数据收集,成本高昂且效率低下,因此需要一种自动化生成高质量合成数据集的方法。

Method: 研究提出了一种三阶段自动化流水线框架:首先通过受控修复技术将目标对象合成到领域特定背景中;然后采用多模态评估方法进行验证,包括对象检测、美学评分和视觉语言对齐;最后使用用户偏好分类器来捕捉主观选择标准,确保生成数据的质量和适用性。

Result: 该流水线能够高效构建高质量、可部署的合成数据集,显著减少了对大规模真实世界数据收集的依赖,通过多模态评估确保了生成数据的质量,用户偏好分类器的引入进一步提升了数据的主观质量和实用性。

Conclusion: 该研究为领域特定数据集的生成提供了一种有效的自动化解决方案,通过扩散模型和系统化验证流程的结合,不仅解决了分布偏移问题,还为实际部署环境中的数据需求提供了可扩展的替代方案,具有重要的实践应用价值。


📄 Abstract

In this paper, we present an automated pipeline for generating domain-specific synthetic datasets with diffusion models, addressing the distribution shift between pre-trained models and real-world deployment environments. Our three-stage framework first synthesizes target objects within domain-specific backgrounds through controlled inpainting. The generated outputs are then validated via a multi-modal assessment that integrates object detection, aesthetic scoring, and vision-language alignment. Finally, a user-preference classifier is employed to capture subjective selection criteria. This pipeline enables the efficient construction of high-quality, deployable datasets while reducing reliance on extensive real-world data collection.

[9] How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?

Peng Gao, Yujian Lee, Yongqi Xu, Wentao Fan

🧩 TL;DR

本文提出了一种名为SSP(Stepping Stone Plus)的新型协作框架,用于音频-视觉语义分割任务,该框架通过集成光流和文本提示来增强分割精度,并在复杂场景中超越了现有AVS方法。


📘 Detailed Summary

Motivation: 音频-视觉语义分割任务需要超越简单的发声对象识别,实现场景的语义理解。现有方法将任务分解为两个子任务,但面临运动对象和静止发声对象(如闹钟)的挑战,需要更精确的时空上下文和语义整合。

Method: 本文提出了SSP协作框架,采用预掩码技术利用光流捕捉运动动态,为精确分割提供时间上下文。针对静止发声对象,SSP整合了两种文本提示:对象类别识别和场景描述。此外,还实现了视觉-文本对齐模块以促进跨模态整合,并采用后掩码技术训练模型学习光流图。

Result: 实验结果表明,SSP框架在音频-视觉分割任务中超越了现有AVS方法,能够提供高效且精确的分割结果,特别是在复杂场景中表现出色。

Conclusion: 该研究证明了结合光流和文本提示的协作框架在音频-视觉语义分割中的有效性,为处理运动对象和静止发声对象提供了创新解决方案,并为多模态场景理解开辟了新的研究方向。


📄 Abstract

Audio-visual semantic segmentation (AVSS) represents an extension of the audio-visual segmentation (AVS) task, necessitating a semantic understanding of audio-visual scenes beyond merely identifying sound-emitting objects at the visual pixel level. Contrary to a previous methodology, by decomposing the AVSS task into two discrete subtasks by initially providing a prompted segmentation mask to facilitate subsequent semantic analysis, our approach innovates on this foundational strategy. We introduce a novel collaborative framework, \textit{S}tepping \textit{S}tone \textit{P}lus (SSP), which integrates optical flow and textual prompts to assist the segmentation process. In scenarios where sound sources frequently coexist with moving objects, our pre-mask technique leverages optical flow to capture motion dynamics, providing essential temporal context for precise segmentation. To address the challenge posed by stationary sound-emitting objects, such as alarm clocks, SSP incorporates two specific textual prompts: one identifies the category of the sound-emitting object, and the other provides a broader description of the scene. Additionally, we implement a visual-textual alignment module (VTA) to facilitate cross-modal integration, delivering more coherent and contextually relevant semantic interpretations. Our training regimen involves a post-mask technique aimed at compelling the model to learn the diagram of the optical flow. Experimental results demonstrate that SSP outperforms existing AVS methods, delivering efficient and precise segmentation results.

[10] Subspace Alignment for Vision-Language Model Test-time Adaptation

Zhichen Zeng, Wenxuan Bao, Xiao Lin, Ruizhong Qiu, Tianxin Wei, Xuying Ning, Yuchen Yan, Chen Luo, Monica Xiao Cheng, Jingrui He, Hanghang Tong

🧩 TL;DR

本文提出SubTTA方法,通过对齐视觉与文本模态的语义子空间来增强视觉语言模型的测试时适应能力,解决了分布偏移下伪标签不可靠的问题,在多个基准测试中平均提升2.24%的性能。


📘 Detailed Summary

Motivation: 尽管视觉语言模型具有出色的零样本能力,但在分布偏移下表现脆弱。现有测试时适应方法严重依赖零样本预测作为伪标签进行自训练,但这些伪标签在分布偏移下不可靠,主要受两个基本限制影响:模态间隙导致跨模态关系不准确,以及视觉嵌入编码了丰富但与任务无关的噪声,这些噪声在分布偏移下常常淹没任务特定的语义。

Method: 本文提出SubTTA方法,通过对齐两种模态的语义子空间来增强零样本预测以更好地指导TTA过程。为弥合模态间隙,SubTTA提取两种模态的主子空间,并通过最小化弦距离将视觉流形对齐到文本语义锚点。为消除视觉噪声,SubTTA将对齐后的视觉特征投影到任务特定的文本子空间,通过将视觉嵌入约束在有效语义范围内来过滤掉任务无关噪声,然后在纯化空间上执行标准TTA以细化决策边界。

Result: 在多个基准测试和VLM架构上的广泛实验证明了SubTTA的有效性,相比最先进的TTA方法平均提升了2.24%的性能。该方法显著改善了分布偏移下的适应能力,通过子空间对齐和噪声过滤机制增强了伪标签的可靠性。

Conclusion: SubTTA通过语义子空间对齐有效解决了视觉语言模型在测试时适应中的两个核心限制,为分布偏移下的可靠适应提供了新思路。该方法不仅提升了性能,还揭示了模态对齐和噪声过滤在跨模态学习中的重要性,为未来视觉语言模型的鲁棒性研究提供了重要参考。


📄 Abstract

Vision-language models (VLMs), despite their extraordinary zero-shot capabilities, are vulnerable to distribution shifts. Test-time adaptation (TTA) emerges as a predominant strategy to adapt VLMs to unlabeled test data on the fly. However, existing TTA methods heavily rely on zero-shot predictions as pseudo-labels for self-training, which can be unreliable under distribution shifts and misguide adaptation due to two fundamental limitations. First (Modality Gap), distribution shifts induce gaps between visual and textual modalities, making cross-modal relations inaccurate. Second (Visual Nuisance), visual embeddings encode rich but task-irrelevant noise that often overwhelms task-specific semantics under distribution shifts. To address these limitations, we propose SubTTA, which aligns the semantic subspaces of both modalities to enhance zero-shot predictions to better guide the TTA process. To bridge the modality gap, SubTTA extracts the principal subspaces of both modalities and aligns the visual manifold to the textual semantic anchor by minimizing their chordal distance. To eliminate visual nuisance, SubTTA projects the aligned visual features onto the task-specific textual subspace, which filters out task-irrelevant noise by constraining visual embeddings within the valid semantic span, and standard TTA is further performed on the purified space to refine the decision boundaries. Extensive experiments on various benchmarks and VLM architectures demonstrate the effectiveness of SubTTA, yielding an average improvement of 2.24% over state-of-the-art TTA methods.

[11] Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention

Shezheng Song, Shasha Li, Jie Yu

🧩 TL;DR

该研究通过系统性的层间掩码分析揭示了多模态大语言模型中视觉-文本融合的演化机制,并提出了一种无需训练的对齐注意力框架来提升多模态推理性能。


📘 Detailed Summary

Motivation: 尽管多模态大语言模型在视觉语言理解方面取得了显著进展,但其内部如何整合视觉和文本信息仍然缺乏深入理解,这种黑箱性质限制了模型的进一步优化和可解释性。

Method: 研究采用系统性层间掩码分析技术,对多种模型架构进行逐层分析以追踪视觉-文本融合的演化过程,并基于分析结果提出了一种无需训练的对齐注意力框架,该框架通过建模早期融合层与最终层之间的注意力变换来突出有意义的注意力转移。

Result: 实验发现融合过程集中在特定层而非均匀分布,部分模型在输出前出现视觉信号重新激活的"回顾"现象;注意力分析显示不相关区域存在持续的高注意力噪声,而文本对齐区域的注意力逐渐增强;提出的对齐注意力框架在多种MLLM和基准测试中显著提升了多模态推理性能。

Conclusion: 该研究揭示了MLLM中视觉-文本融合的层级演化机制,为模型可解释性提供了重要见解;提出的训练无关注意力对齐方法展示了通过理解内部表示动态来提升模型性能的可行性,为未来多模态模型优化和架构设计提供了新方向。


📄 Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language understanding, yet how they internally integrate visual and textual information remains poorly understood. To bridge this gap, we perform a systematic layer-wise masking analysis across multiple architectures, revealing how visual-text fusion evolves within MLLMs. The results show that fusion emerges at several specific layers rather than being uniformly distributed across the network, and certain models exhibit a late-stage "review" phenomenon where visual signals are reactivated before output generation. Besides, we further analyze layer-wise attention evolution and observe persistent high-attention noise on irrelevant regions, along with gradually increasing attention on text-aligned areas. Guided by these insights, we introduce a training-free contrastive attention framework that models the transformation between early fusion and final layers to highlight meaningful attention shifts. Extensive experiments across various MLLMs and benchmarks validate our analysis and demonstrate that the proposed approach improves multimodal reasoning performance. Code will be released.

[12] Representation Learning with Semantic-aware Instance and Sparse Token Alignments

Phuoc-Nguyen Bui, Toan Duc Nguyen, Junghyun Bum, Duc-Tai Le, Hyunseung Choo

🧩 TL;DR

本文提出了一种多级对齐框架SISTA,通过利用医学图像与放射学报告在图像-报告和补丁-词级别上的语义对应关系,改进了医学对比视觉语言预训练,有效解决了传统方法中将所有未配对样本视为负例导致的语义结构破坏问题。


📘 Detailed Summary

Motivation: 传统医学对比视觉语言预训练方法通常将配对图像-报告样本视为正例,未配对样本视为负例,但在医学数据集中,不同患者的图像或报告之间可能存在显著相似性,将所有未配对样本视为负例会破坏底层语义结构并影响学习表示的质量。

Method: 本文提出了多级对齐框架SISTA,通过利用医学图像与放射学报告在图像-报告和补丁-词两个级别的语义对应关系,改进了传统对比学习方法,具体包括引入报告间相似性以消除假负例,并开发了有效对齐图像补丁与相关词标记的方法。

Result: 实验结果表明,该框架在不同数据集上的三个下游任务(图像分类、图像分割和目标检测)中显著提高了迁移性能,特别是在有限标注数据的细粒度任务上实现了显著改进,代码和预训练模型将公开提供。

Conclusion: 该研究强调了在医学视觉语言预训练中考虑语义相似性的重要性,提出的多级对齐框架为处理医学数据中的复杂语义关系提供了有效解决方案,特别是在数据有限的情况下仍能实现细粒度任务的性能提升,为医学AI应用提供了更鲁棒的表示学习方法。


📄 Abstract

Medical contrastive vision-language pre-training (VLP) has demonstrated significant potential in improving performance on downstream tasks. Traditional approaches typically employ contrastive learning, treating paired image-report samples as positives and unpaired ones as negatives. However, in medical datasets, there can be substantial similarities between images or reports from different patients. Rigidly treating all unpaired samples as negatives, can disrupt the underlying semantic structure and negatively impact the quality of the learned representations. In this paper, we propose a multi-level alignment framework, Representation Learning with Semantic-aware Instance and Sparse Token Alignments (SISTA) by exploiting the semantic correspondence between medical image and radiology reports at two levels, i.e., image-report and patch-word levels. Specifically, we improve the conventional contrastive learning by incorporating inter-report similarity to eliminate the false negatives and introduce a method to effectively align image patches with relevant word tokens. Experimental results demonstrate the effectiveness of the proposed framework in improving transfer performance across different datasets on three downstream tasks: image classification, image segmentation, and object detection. Notably, our framework achieves significant improvements in fine-grained tasks even with limited labeled data. Codes and pre-trained models will be made available.

[13] Instruction-Driven 3D Facial Expression Generation and Transition

Anh H. Vo, Tae-Seok Kim, Hulin Jin, Soo-Mi Choi, Yong-Guk Kim

🧩 TL;DR

本研究提出了一种指令驱动的三维面部表情生成框架,能够根据文本指令在任意两种指定表情之间生成平滑的面部表情过渡序列,显著扩展了三维虚拟化身的表情表达能力。


📘 Detailed Summary

Motivation: 传统三维虚拟化身通常仅支持六种基本面部表情,缺乏模拟真实情感变化的灵活性。本研究旨在解决如何根据文本指令在任意两种面部表情之间生成平滑过渡序列的问题,以扩展虚拟化身的情感表达能力。

Method: 研究提出了指令驱动的面部表情分解器模块来学习多模态数据并捕捉文本描述与面部表情特征之间的相关性。随后开发了指令到面部表情过渡方法,利用该分解器和顶点重建损失函数来优化潜在向量的语义理解,从而根据给定指令生成面部表情序列。最后构建了面部表情过渡模型来生成表情之间的平滑过渡。

Result: 在CK+和CelebV-HQ数据集上的广泛评估表明,所提出的模型在面部表情生成任务上优于现有最先进方法。实验结果显示该框架能够根据文本指令生成准确的面部表情轨迹,并且通过文本提示可以极大地扩展面部表情及其过渡的多样性。

Conclusion: 该研究为三维虚拟化身提供了灵活的表情控制机制,通过文本指令驱动的方式显著增强了情感表达的多样性和自然性。该框架在虚拟现实、游戏角色动画和人机交互等领域具有广泛的应用前景,为基于多模态输入的面部动画生成提供了新的技术路径。


📄 Abstract

A 3D avatar typically has one of six cardinal facial expressions. To simulate realistic emotional variation, we should be able to render a facial transition between two arbitrary expressions. This study presents a new framework for instruction-driven facial expression generation that produces a 3D face and, starting from an image of the face, transforms the facial expression from one designated facial expression to another. The Instruction-driven Facial Expression Decomposer (IFED) module is introduced to facilitate multimodal data learning and capture the correlation between textual descriptions and facial expression features. Subsequently, we propose the Instruction to Facial Expression Transition (I2FET) method, which leverages IFED and a vertex reconstruction loss function to refine the semantic comprehension of latent vectors, thus generating a facial expression sequence according to the given instruction. Lastly, we present the Facial Expression Transition model to generate smooth transitions between facial expressions. Extensive evaluation suggests that the proposed model outperforms state-of-the-art methods on the CK+ and CelebV-HQ datasets. The results show that our framework can generate facial expression trajectories according to text instruction. Considering that text prompts allow us to make diverse descriptions of human emotional states, the repertoire of facial expressions and the transitions between them can be expanded greatly. We expect our framework to find various practical applications More information about our project can be found at https://vohoanganh.github.io/tg3dfet/

[14] GI-Bench: A Panoramic Benchmark Revealing the Knowledge-Experience Dissociation of Multimodal Large Language Models in Gastrointestinal Endoscopy Against Clinical Standards

Yan Zhu, Te Luo, Pei-Yao Fu, Zhen Zhang, Zi-Long Wang, Yi-Fan Qu, Zi-Han Geng, Jia-Qi Xu, Lu Yao, Li-Yun Ma, Wei Su, Wei-Feng Chen, Quan-Lin Li, Shuo Wang, Ping-Hong Zhou

🧩 TL;DR

本研究系统评估了多模态大语言模型在胃肠内窥镜临床工作流程中的表现,揭示了模型在诊断推理方面可媲美初级内镜医师,但在空间定位和事实准确性方面存在显著瓶颈,并提出了GI-Bench动态基准测试平台。


📘 Detailed Summary

Motivation: 尽管多模态大语言模型在胃肠病学中展现出潜力,但其在完整临床工作流程中的表现以及与人类基准的对比尚未得到系统验证,本研究旨在填补这一空白,评估模型在胃肠内窥镜全景工作流程中的临床效用。

Method: 研究构建了包含20个细粒度病变类别的GI-Bench基准测试,评估了12个多模态大语言模型在五阶段临床工作流程中的表现,包括解剖定位、病变识别、诊断、发现描述和管理,并使用Macro-F1、平均交并比和多维度李克特量表将模型性能与三名初级内镜医师和三名住院医师进行对比。

Result: Gemini-3-Pro取得了最先进的性能,在诊断推理方面,顶级模型的Macro-F1分数为0.641,优于住院医师的0.492,并与初级内镜医师的0.727相当,但存在关键的"空间定位瓶颈",人类病变定位的mIoU超过0.506,显著优于最佳模型的0.345,同时发现"流畅性-准确性悖论",模型生成报告的语言可读性优于人类但事实准确性显著较低。

Conclusion: 研究表明多模态大语言模型在诊断推理方面已达到临床可用水平,但在空间定位和视觉特征解释方面仍需改进,揭示了模型在医学应用中存在的"过度解释"和幻觉问题,GI-Bench动态排行榜为跟踪模型在临床内窥镜中的演进提供了持续评估框架。


📄 Abstract

Multimodal Large Language Models (MLLMs) show promise in gastroenterology, yet their performance against comprehensive clinical workflows and human benchmarks remains unverified. To systematically evaluate state-of-the-art MLLMs across a panoramic gastrointestinal endoscopy workflow and determine their clinical utility compared with human endoscopists. We constructed GI-Bench, a benchmark encompassing 20 fine-grained lesion categories. Twelve MLLMs were evaluated across a five-stage clinical workflow: anatomical localization, lesion identification, diagnosis, findings description, and management. Model performance was benchmarked against three junior endoscopists and three residency trainees using Macro-F1, mean Intersection-over-Union (mIoU), and multi-dimensional Likert scale. Gemini-3-Pro achieved state-of-the-art performance. In diagnostic reasoning, top-tier models (Macro-F1 0.641) outperformed trainees (0.492) and rivaled junior endoscopists (0.727; p>0.05). However, a critical "spatial grounding bottleneck" persisted; human lesion localization (mIoU >0.506) significantly outperformed the best model (0.345; p<0.05). Furthermore, qualitative analysis revealed a "fluency-accuracy paradox": models generated reports with superior linguistic readability compared with humans (p<0.05) but exhibited significantly lower factual correctness (p<0.05) due to "over-interpretation" and hallucination of visual features.GI-Bench maintains a dynamic leaderboard that tracks the evolving performance of MLLMs in clinical endoscopy. The current rankings and benchmark results are available at https://roterdl.github.io/GIBench/.

[15] Route, Retrieve, Reflect, Repair: Self-Improving Agentic Framework for Visual Detection and Linguistic Reasoning in Medical Imaging

Md. Faiyaz Abdullah Sayeedi, Rashedur Rahman, Siam Tahsin Bhuiyan, Sefatul Wasi, Ashraful Islam, Saadia Binte Alam, AKM Mahbubur Rahman

🧩 TL;DR

本文提出R⁴框架,一种用于医学图像分析的多智能体系统,通过路由、检索、反思和修复四个协调智能体,将强但脆弱的视觉语言模型转变为更可靠、空间基础更好的临床图像解释工具,无需基于梯度的微调即可显著提升性能。


📘 Detailed Summary

Motivation: 当前医学图像分析主要依赖大型视觉语言模型,但大多数系统仍是单次通过的黑盒,在推理控制、安全性和空间基础方面存在局限,缺乏对临床错误模式的有效处理机制。

Method: R⁴框架将医学影像工作流分解为四个协调智能体:路由器根据图像、患者病史和元数据配置任务和专业化提示;检索器使用示例记忆和pass@k采样联合生成自由文本报告和边界框;反思器针对每个草稿-框对批判六种关键临床错误模式;修复器在针对性约束下迭代修订叙事和空间输出,同时为未来病例策划高质量示例。

Result: 在胸部X射线分析中,R⁴使用多个现代VLM骨干进行评估,在报告生成和弱监督检测任务上,相比强大的单VLM基线,持续提升LLM-as-a-Judge评分约1.7-2.5分,mAP50提升2.5-3.5个绝对百分点,且无需任何基于梯度的微调。

Conclusion: 研究表明,智能体路由、反思和修复机制能够将强大但脆弱的视觉语言模型转变为更可靠、基础更好的临床图像解释工具,为医学影像分析提供了可控制、可解释且性能更优的框架,展示了多智能体系统在医疗AI中的潜力。


📄 Abstract

Medical image analysis increasingly relies on large vision-language models (VLMs), yet most systems remain single-pass black boxes that offer limited control over reasoning, safety, and spatial grounding. We propose R^4, an agentic framework that decomposes medical imaging workflows into four coordinated agents: a Router that configures task- and specialization-aware prompts from the image, patient history, and metadata; a Retriever that uses exemplar memory and pass@k sampling to jointly generate free-text reports and bounding boxes; a Reflector that critiques each draft-box pair for key clinical error modes (negation, laterality, unsupported claims, contradictions, missing findings, and localization errors); and a Repairer that iteratively revises both narrative and spatial outputs under targeted constraints while curating high-quality exemplars for future cases. Instantiated on chest X-ray analysis with multiple modern VLM backbones and evaluated on report generation and weakly supervised detection, R^4 consistently boosts LLM-as-a-Judge scores by roughly +1.7-+2.5 points and mAP50 by +2.5-+3.5 absolute points over strong single-VLM baselines, without any gradient-based fine-tuning. These results show that agentic routing, reflection, and repair can turn strong but brittle VLMs into more reliable and better grounded tools for clinical image interpretation. Our code can be found at: https://github.com/faiyazabdullah/MultimodalMedAgent

[16] Unified Multi-Site Multi-Sequence Brain MRI Harmonization Enriched by Biomedical Semantic Style

Mengqi Wu, Yongheng Sun, Qianqian Wang, Pew-Thian Yap, Mingxia Liu

🧩 TL;DR

本文提出了MMH,一个用于多站点多序列脑MRI协调的统一框架,该框架利用生物医学语义先验进行序列感知的风格对齐,通过两阶段扩散模型实现无需配对数据的解剖结构保留协调。


📘 Detailed Summary

Motivation: 多站点脑MRI数据聚合可增强深度学习模型训练,但会引入由站点特定变异(如扫描仪厂商、采集参数和成像协议差异)导致的非生物异质性,从而损害模型泛化能力。现有回顾性MRI协调方法通常依赖有限的配对旅行者数据,或未能有效解耦风格与解剖结构,且大多仅处理单序列协调,限制了在常规获取多序列MRI的真实场景中的应用。

Method: MMH框架采用两阶段方法:第一阶段为基于扩散的全局协调器,通过风格无关的梯度条件将MR图像映射到序列特定的统一域;第二阶段为目标特定微调器,将全局对齐图像适配到期望的目标域。采用三平面注意力BiomedCLIP编码器聚合多视图嵌入以表征体积风格信息,实现无需配对数据的图像风格与解剖结构的显式解耦。

Result: 在4,163个T1和T2加权MRI上的评估表明,MMH在图像特征聚类、体素级比较、组织分割以及下游年龄和站点分类任务中均优于现有最先进方法,展示了其在多站点多序列脑MRI协调方面的卓越性能。

Conclusion: 该研究证明了利用生物医学语义先验进行序列感知风格对齐的有效性,为多站点多序列脑MRI协调提供了统一解决方案。MMH框架通过显式解耦风格与解剖结构,无需配对数据即可实现高质量协调,为医学影像分析中的域适应和泛化问题提供了新思路。


📄 Abstract

Aggregating multi-site brain MRI data can enhance deep learning model training, but also introduces non-biological heterogeneity caused by site-specific variations (e.g., differences in scanner vendors, acquisition parameters, and imaging protocols) that can undermine generalizability. Recent retrospective MRI harmonization seeks to reduce such site effects by standardizing image style (e.g., intensity, contrast, noise patterns) while preserving anatomical content. However, existing methods often rely on limited paired traveling-subject data or fail to effectively disentangle style from anatomy. Furthermore, most current approaches address only single-sequence harmonization, restricting their use in real-world settings where multi-sequence MRI is routinely acquired. To this end, we introduce MMH, a unified framework for multi-site multi-sequence brain MRI harmonization that leverages biomedical semantic priors for sequence-aware style alignment. MMH operates in two stages: (1) a diffusion-based global harmonizer that maps MR images to a sequence-specific unified domain using style-agnostic gradient conditioning, and (2) a target-specific fine-tuner that adapts globally aligned images to desired target domains. A tri-planar attention BiomedCLIP encoder aggregates multi-view embeddings to characterize volumetric style information, allowing explicit disentanglement of image styles from anatomy without requiring paired data. Evaluations on 4,163 T1- and T2-weighted MRIs demonstrate MMH's superior harmonization over state-of-the-art methods in image feature clustering, voxel-level comparison, tissue segmentation, and downstream age and site classification.

[17] Knowledge-based learning in Text-RAG and Image-RAG

Alexander Shim, Khalil Saieh, Samuel Clarke

🧩 TL;DR

本研究通过对比基于EVA-ViT图像编码器的多模态方法与LLaMA或ChatGPT LLM,旨在减少医学影像分析中的幻觉问题并提升胸部X光疾病检测性能,发现基于文本的RAG能有效降低幻觉率,而基于图像的RAG通过KNN方法提高了预测置信度和校准效果。


📘 Detailed Summary

Motivation: 本研究旨在解决医学影像分析中多模态方法存在的幻觉问题,特别是在胸部X光疾病检测任务中,探索如何有效结合视觉Transformer图像编码器与大型语言模型来提升诊断准确性和可靠性,同时应对数据不平衡和复杂多阶段结构的挑战。

Method: 研究采用基于EVA-ViT的图像编码器与LLaMA或ChatGPT LLM相结合的多模态方法,使用NIH胸部X光图像数据集进行训练,并对比了三种不同配置:基于图像的RAG(采用KNN方法)、基于文本的RAG(利用外部知识信息)以及基线方法,以系统评估不同策略对幻觉问题和疾病检测性能的影响。

Result: 实验结果表明,基于文本的RAG能有效利用外部知识信息显著降低幻觉问题,而基于图像的RAG通过KNN方法提高了预测置信度和校准效果;GPT LLM在性能表现、幻觉率和期望校准误差方面均优于LLaMA模型,显示出更好的整体表现和可靠性。

Conclusion: 本研究揭示了多模态医学影像分析中数据不平衡和结构复杂性的挑战,同时证明了结合外部知识的RAG方法和GPT LLM在减少幻觉、提升校准效果方面的有效性,为构建更可靠的医学诊断系统提供了重要参考,并建议需要大规模实验环境和平衡的用例示例来进一步优化系统性能。


📄 Abstract

This research analyzed and compared the multi-modal approach in the Vision Transformer(EVA-ViT) based image encoder with the LlaMA or ChatGPT LLM to reduce the hallucination problem and detect diseases in chest x-ray images. In this research, we utilized the NIH Chest X-ray image to train the model and compared it in image-based RAG, text-based RAG, and baseline. [3] [5] In a result, the text-based RAG[2] e!ectively reduces the hallucination problem by using external knowledge information, and the image-based RAG improved the prediction con"dence and calibration by using the KNN methods. [4] Moreover, the GPT LLM showed better performance, a low hallucination rate, and better Expected Calibration Error(ECE) than Llama Llama-based model. This research shows the challenge of data imbalance, a complex multi-stage structure, but suggests a large experience environment and a balanced example of use.

[18] Improving Zero-shot ADL Recognition with Large Language Models through Event-based Context and Confidence

Michele Fiori, Gabriele Civitarese, Marco Colussi, Claudio Bettini

🧩 TL;DR

本文提出了一种基于事件分割和置信度估计的零样本ADL识别方法,通过事件分割取代传统时间分割,并引入预测置信度估计机制,显著提升了大型语言模型在智能家居活动识别中的性能。


📘 Detailed Summary

Motivation: 现有基于大型语言模型的零样本ADL识别方法依赖时间分割策略,这与LLMs的上下文推理能力不匹配,且缺乏预测置信度估计机制,限制了其在复杂现实场景中的应用效果和可靠性。

Method: 该方法采用事件分割策略替代传统时间分割,使分割边界与活动事件的自然边界对齐,同时提出了一种新颖的预测置信度估计方法,能够有效区分正确与错误预测,提升模型可靠性。

Result: 实验表明,事件分割方法在复杂现实数据集上持续优于基于时间的LLM方法,甚至超越了监督数据驱动方法,即使使用相对较小的LLM模型(如Gemma 3 27B)也能取得优异性能,且提出的置信度度量能有效区分预测正确性。

Conclusion: 研究表明事件分割策略能更好地利用LLMs的上下文推理能力,显著提升零样本ADL识别性能,置信度估计机制增强了模型在实际应用中的可靠性,为智能家居中的活动识别提供了更有效的零样本解决方案。


📄 Abstract

Unobtrusive sensor-based recognition of Activities of Daily Living (ADLs) in smart homes by processing data collected from IoT sensing devices supports applications such as healthcare, safety, and energy management. Recent zero-shot methods based on Large Language Models (LLMs) have the advantage of removing the reliance on labeled ADL sensor data. However, existing approaches rely on time-based segmentation, which is poorly aligned with the contextual reasoning capabilities of LLMs. Moreover, existing approaches lack methods for estimating prediction confidence. This paper proposes to improve zero-shot ADL recognition with event-based segmentation and a novel method for estimating prediction confidence. Our experimental evaluation shows that event-based segmentation consistently outperforms time-based LLM approaches on complex, realistic datasets and surpasses supervised data-driven methods, even with relatively small LLMs (e.g., Gemma 3 27B). The proposed confidence measure effectively distinguishes correct from incorrect predictions.

[19] KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old?

Xianfeng Wang, Kaiwei Zhang, Qi Jia, Zijian Chen, Guangtao Zhai, Xiongkuo Min

🧩 TL;DR

本研究引入KidVis基准测试,基于人类视觉发展理论评估多模态大语言模型的基础视觉能力,发现当前最先进的MLLMs在儿童已掌握的原子视觉能力方面存在显著缺陷,且参数缩放无法线性提升这些基础能力。


📘 Detailed Summary

Motivation: 尽管多模态大语言模型在高级推理任务中表现出色,但尚不清楚它们是否具备与人类直觉相当的基础视觉原语能力。本研究旨在探究MLLMs是否拥有类似6-7岁儿童已掌握的基本视觉能力,以评估其广义视觉智能的生理基础。

Method: 研究引入KidVis基准测试,该基准基于人类视觉发展理论,将视觉智能解构为六个原子能力:专注力、追踪、辨别、记忆、空间和闭合能力。这些能力构成10个低语义依赖的视觉任务类别,用于评估20个最先进的多模态大语言模型,并与人类生理基线进行比较。

Result: 评估结果显示显著性能差距:人类儿童平均得分接近完美(95.32),而最先进的GPT-5仅获得67.33分。研究观察到"缩放定律悖论":单纯增加模型参数无法线性提升这些基础视觉能力。所有20个MLLMs在儿童已掌握的原子视觉任务上表现均不理想。

Conclusion: 研究证实当前多模态大语言模型尽管具备高级推理能力,但缺乏实现广义视觉智能所需的基本生理感知原语。这一发现挑战了单纯通过参数扩展就能实现全面视觉智能的假设,表明需要新的架构或训练范式来弥补基础视觉能力的不足。


📄 Abstract

While Multimodal Large Language Models (MLLMs) have demonstrated impressive proficiency in high-level reasoning tasks, such as complex diagrammatic interpretation, it remains an open question whether they possess the fundamental visual primitives comparable to human intuition. To investigate this, we introduce KidVis, a novel benchmark grounded in the theory of human visual development. KidVis deconstructs visual intelligence into six atomic capabilities - Concentration, Tracking, Discrimination, Memory, Spatial, and Closure - already possessed by 6-7 year old children, comprising 10 categories of low-semantic-dependent visual tasks. Evaluating 20 state-of-the-art MLLMs against a human physiological baseline reveals a stark performance disparity. Results indicate that while human children achieve a near-perfect average score of 95.32, the state-of-the-art GPT-5 attains only 67.33. Crucially, we observe a "Scaling Law Paradox": simply increasing model parameters fails to yield linear improvements in these foundational visual capabilities. This study confirms that current MLLMs, despite their reasoning prowess, lack the essential physiological perceptual primitives required for generalized visual intelligence.

[20] Enhancing Image Quality Assessment Ability of LMMs via Retrieval-Augmented Generation

Kang Fu, Huiyu Duan, Zicheng Zhang, Yucheng Zhu, Jun Zhao, Xiongkuo Min, Jia Wang, Guangtao Zhai

🧩 TL;DR

本文提出IQARAG,一种无需训练的新型框架,通过检索增强生成技术提升大型多模态模型在图像质量评估任务中的性能,为传统微调方法提供了资源高效的替代方案。


📘 Detailed Summary

Motivation: 大型多模态模型在图像质量评估任务中展现出强大的零样本能力,但实现最先进性能通常需要计算成本高昂的微调方法,这些方法旨在将质量相关标记的输出分布与图像质量水平对齐,因此需要开发更高效的替代方案。

Method: IQARAG采用检索增强生成框架,包含检索特征提取、图像检索以及集成与质量分数生成三个关键阶段,通过检索语义相似但质量变化的参考图像及其平均意见分数,并将这些检索到的图像与输入图像整合到特定提示中,为LMM提供视觉感知锚点。

Result: 在KADID、KonIQ、LIVE Challenge和SPAQ等多个多样化图像质量评估数据集上的广泛实验表明,IQARAG有效提升了大型多模态模型的图像质量评估性能,为质量评估任务提供了资源高效的替代方案。

Conclusion: 该研究证明了无需训练的检索增强生成框架在提升大型多模态模型图像质量评估能力方面的有效性,为传统计算密集型微调方法提供了实用且高效的替代路径,具有重要的实际应用价值。


📄 Abstract

Large Multimodal Models (LMMs) have recently shown remarkable promise in low-level visual perception tasks, particularly in Image Quality Assessment (IQA), demonstrating strong zero-shot capability. However, achieving state-of-the-art performance often requires computationally expensive fine-tuning methods, which aim to align the distribution of quality-related token in output with image quality levels. Inspired by recent training-free works for LMM, we introduce IQARAG, a novel, training-free framework that enhances LMMs' IQA ability. IQARAG leverages Retrieval-Augmented Generation (RAG) to retrieve some semantically similar but quality-variant reference images with corresponding Mean Opinion Scores (MOSs) for input image. These retrieved images and input image are integrated into a specific prompt. Retrieved images provide the LMM with a visual perception anchor for IQA task. IQARAG contains three key phases: Retrieval Feature Extraction, Image Retrieval, and Integration & Quality Score Generation. Extensive experiments across multiple diverse IQA datasets, including KADID, KonIQ, LIVE Challenge, and SPAQ, demonstrate that the proposed IQARAG effectively boosts the IQA performance of LMMs, offering a resource-efficient alternative to fine-tuning for quality assessment.

[21] UM-Text: A Unified Multimodal Model for Image Understanding

Lichen Ma, Xiaolong Fu, Gaojing Zhou, Zipeng Guo, Ting Zhu, Yichun Liu, Yu Shi, Jason Li, Junshi Huang

🧩 TL;DR

本文提出UM-Text,一种统一的多模态模型,通过自然语言指令实现上下文理解和视觉文本编辑,解决了视觉文本生成中风格一致性的挑战,并在多个基准测试中达到最先进性能。


📘 Detailed Summary

Motivation: 现有视觉文本编辑方法通常需要复杂步骤指定文本内容和属性(如字体大小、颜色、布局),而未充分考虑与参考图像的风格一致性,这限制了自然语言指令驱动的视觉文本生成效果。

Method: 提出UM-Text统一多模态模型,引入视觉语言模型处理指令和参考图像以精心设计文本内容和布局,设计UM-Encoder自动配置多种条件信息的嵌入组合,采用区域一致性损失在潜在空间和RGB空间提供字形生成监督,并开发三阶段训练策略增强性能,同时贡献包含20万张多样化场景视觉文本图像的UM-DATA-200K数据集。

Result: 在多个公开基准测试上的广泛定性和定量结果表明,该方法实现了最先进的性能,生成的视觉文本图像在准确性和与参考图像的和谐度方面表现出色。

Conclusion: 该研究通过统一的多模态框架有效解决了视觉文本编辑中的风格一致性问题,提出的区域一致性损失和三阶段训练策略为字形生成提供了更有效的监督,大规模数据集的贡献也为该领域研究提供了重要资源,推动了自然语言指令驱动的视觉文本生成技术的发展。


📄 Abstract

With the rapid advancement of image generation, visual text editing using natural language instructions has received increasing attention. The main challenge of this task is to fully understand the instruction and reference image, and thus generate visual text that is style-consistent with the image. Previous methods often involve complex steps of specifying the text content and attributes, such as font size, color, and layout, without considering the stylistic consistency with the reference image. To address this, we propose UM-Text, a unified multimodal model for context understanding and visual text editing by natural language instructions. Specifically, we introduce a Visual Language Model (VLM) to process the instruction and reference image, so that the text content and layout can be elaborately designed according to the context information. To generate an accurate and harmonious visual text image, we further propose the UM-Encoder to combine the embeddings of various condition information, where the combination is automatically configured by VLM according to the input instruction. During training, we propose a regional consistency loss to offer more effective supervision for glyph generation on both latent and RGB space, and design a tailored three-stage training strategy to further enhance model performance. In addition, we contribute the UM-DATA-200K, a large-scale visual text image dataset on diverse scenes for model training. Extensive qualitative and quantitative results on multiple public benchmarks demonstrate that our method achieves state-of-the-art performance.

[22] CoMa: Contextual Massing Generation with Vision-Language Models

Evgenii Maslov, Valentin Khrulkov, Anastasia Volkova, Anton Gusarov, Andrey Kuznetsov, Ivan Oseledets

🧩 TL;DR

本研究提出了一个自动化建筑体量生成框架,并引入了CoMa-20K数据集来解决数据驱动建筑设计中的数据稀缺问题,通过将体量生成构建为视觉语言模型的条件任务,展示了数据驱动方法在建筑概念设计中的潜力。


📘 Detailed Summary

Motivation: 建筑和城市规划中的概念设计阶段,特别是建筑体量设计,具有高度复杂性且严重依赖设计师的直觉和手动工作,而数据驱动方法面临的主要障碍是缺乏合适的数据集,这限制了自动化设计框架的发展。

Method: 研究提出了一个基于功能需求和场地上下文的自动化建筑体量生成框架,并引入了CoMa-20K数据集,该数据集包含详细的体量几何、经济和程序数据以及开发场地在现有城市环境中的视觉表示,通过将体量生成构建为视觉语言模型的条件任务进行基准测试,评估了微调模型和大型零样本模型。

Result: 实验揭示了建筑体量生成任务的内在复杂性,同时证明了视觉语言模型在生成上下文敏感的体量选项方面的潜力,数据集和分析为数据驱动的建筑设计建立了基础基准,并突出了该领域未来研究的重要机会。

Conclusion: 该研究为数据驱动的建筑设计建立了重要的基准和资源,展示了视觉语言模型在复杂设计任务中的应用潜力,同时强调了建筑体量生成任务的挑战性,为未来在自动化建筑设计和城市规划方面的研究开辟了新的方向。


📄 Abstract

The conceptual design phase in architecture and urban planning, particularly building massing, is complex and heavily reliant on designer intuition and manual effort. To address this, we propose an automated framework for generating building massing based on functional requirements and site context. A primary obstacle to such data-driven methods has been the lack of suitable datasets. Consequently, we introduce the CoMa-20K dataset, a comprehensive collection that includes detailed massing geometries, associated economical and programmatic data, and visual representations of the development site within its existing urban context. We benchmark this dataset by formulating massing generation as a conditional task for Vision-Language Models (VLMs), evaluating both fine-tuned and large zero-shot models. Our experiments reveal the inherent complexity of the task while demonstrating the potential of VLMs to produce context-sensitive massing options. The dataset and analysis establish a foundational benchmark and highlight significant opportunities for future research in data-driven architectural design.

[23] Tissue Classification and Whole-Slide Images Analysis via Modeling of the Tumor Microenvironment and Biological Pathways

Junzhuo Liu, Xuemei Du, Daniel Reisenbuchler, Ye Chen, Markus Eckstein, Christian Matek, Friedrich Feuerhake, Dorit Merhof

🧩 TL;DR

本文提出了BioMorphNet,一种多模态网络,通过自动整合组织形态学特征和空间基因表达数据来支持组织分类和差异基因分析,在多种癌症数据集上显著提升了分类性能。


📘 Detailed Summary

Motivation: 现有研究主要关注单个基因序列和切片级别的分类任务,对空间转录组学和斑块级别应用关注有限,这限制了全切片图像与基因表达谱的整合在精准临床诊断和癌症进展研究中的潜力。

Method: BioMorphNet构建图模型来建模目标斑块与其邻域的关系,基于形态学和分子水平的相似性调整响应强度以更好表征肿瘤微环境;从空间转录组数据中提取临床通路特征作为组织形态与基因表达的桥梁;设计可学习通路模块自动模拟生物通路形成过程,为现有临床通路提供补充表示。

Result: 与最新的形态-基因多模态方法相比,BioMorphNet在前列腺癌、结直肠癌和乳腺癌数据集上的平均分类指标分别提升了2.67%、5.48%和6.29%,不仅准确分类WSI内的组织类别以支持肿瘤定位,还能基于预测置信度分析组织类别间的差异基因表达。

Conclusion: 该研究为组织形态学与空间基因表达的整合提供了创新框架,不仅提升了癌症组织分类的准确性,还支持差异基因分析和潜在肿瘤生物标志物的发现,推动了多模态生物医学数据分析在精准医疗中的应用。


📄 Abstract

Automatic integration of whole slide images (WSIs) and gene expression profiles has demonstrated substantial potential in precision clinical diagnosis and cancer progression studies. However, most existing studies focus on individual gene sequences and slide level classification tasks, with limited attention to spatial transcriptomics and patch level applications. To address this limitation, we propose a multimodal network, BioMorphNet, which automatically integrates tissue morphological features and spatial gene expression to support tissue classification and differential gene analysis. For considering morphological features, BioMorphNet constructs a graph to model the relationships between target patches and their neighbors, and adjusts the response strength based on morphological and molecular level similarity, to better characterize the tumor microenvironment. In terms of multimodal interactions, BioMorphNet derives clinical pathway features from spatial transcriptomic data based on a predefined pathway database, serving as a bridge between tissue morphology and gene expression. In addition, a novel learnable pathway module is designed to automatically simulate the biological pathway formation process, providing a complementary representation to existing clinical pathways. Compared with the latest morphology gene multimodal methods, BioMorphNet's average classification metrics improve by 2.67%, 5.48%, and 6.29% for prostate cancer, colorectal cancer, and breast cancer datasets, respectively. BioMorphNet not only classifies tissue categories within WSIs accurately to support tumor localization, but also analyzes differential gene expression between tissue categories based on prediction confidence, contributing to the discovery of potential tumor biomarkers.

[24] VideoHEDGE: Entropy-Based Hallucination Detection for Video-VLMs via Semantic Clustering and Spatiotemporal Perturbations

Sushant Gautam, Cise Midoglu, Vajira Thambawita, Michael A. Riegler, Pål Halvorsen

🧩 TL;DR

本文提出了VideoHEDGE框架,用于检测视频视觉语言模型中的幻觉问题,通过扩展基于熵的可靠性估计方法到时空结构化输入,并引入视觉增强语义熵(VASE)指标,在多个7B参数视频VLM上实现了优于现有方法的幻觉检测性能。


📘 Detailed Summary

Motivation: 视频视觉语言模型中幻觉现象频繁且置信度高,而现有的不确定性度量方法往往无法与正确性对齐,这构成了当前视频问答任务中的关键挑战和研究空白。

Method: VideoHEDGE框架采用模块化设计,通过从原始视频片段及其光度和时空扰动变体中生成基线答案和多个高温采样响应,然后使用自然语言推理或嵌入方法将文本输出聚类为语义假设,最终基于聚类级概率质量计算三种可靠性分数:语义熵、RadFlag和视觉增强语义熵。

Result: 在SoccerChat基准测试中,使用三个7B参数视频VLM进行评估,视觉增强语义熵在较大失真预算下始终获得最高的ROC-AUC性能,而语义熵和RadFlag往往接近随机水平;嵌入聚类在显著降低计算成本的同时达到与自然语言推理聚类相当的检测性能。

Conclusion: 该研究表明视觉增强语义熵是检测视频VLM幻觉的有效指标,嵌入聚类提供了计算效率的替代方案,领域微调虽能减少幻觉频率但对校准改善有限,同时发布的hedge-bench库支持可复现和可扩展的基准测试。


📄 Abstract

Hallucinations in video-capable vision-language models (Video-VLMs) remain frequent and high-confidence, while existing uncertainty metrics often fail to align with correctness. We introduce VideoHEDGE, a modular framework for hallucination detection in video question answering that extends entropy-based reliability estimation from images to temporally structured inputs. Given a video-question pair, VideoHEDGE draws a baseline answer and multiple high-temperature generations from both clean clips and photometrically and spatiotemporally perturbed variants, then clusters the resulting textual outputs into semantic hypotheses using either Natural Language Inference (NLI)-based or embedding-based methods. Cluster-level probability masses yield three reliability scores: Semantic Entropy (SE), RadFlag, and Vision-Amplified Semantic Entropy (VASE). We evaluate VideoHEDGE on the SoccerChat benchmark using an LLM-as-a-judge to obtain binary hallucination labels. Across three 7B Video-VLMs (Qwen2-VL, Qwen2.5-VL, and a SoccerChat-finetuned model), VASE consistently achieves the highest ROC-AUC, especially at larger distortion budgets, while SE and RadFlag often operate near chance. We further show that embedding-based clustering matches NLI-based clustering in detection performance at substantially lower computational cost, and that domain fine-tuning reduces hallucination frequency but yields only modest improvements in calibration. The hedge-bench PyPI library enables reproducible and extensible benchmarking, with full code and experimental resources available at https://github.com/Simula/HEDGE#videohedge .

[25] Semantic Misalignment in Vision-Language Models under Perceptual Degradation

Guo Cheng

🧩 TL;DR

该研究系统分析了视觉语言模型在感知退化下的语义对齐问题,揭示了像素级鲁棒性与多模态语义可靠性之间的脱节,并提出了一套语言级错位度量标准来量化安全关键应用中的VLM失效。


📘 Detailed Summary

Motivation: 尽管视觉语言模型在多模态基准测试中表现出色,但其在现实感知退化下的鲁棒性尚未得到充分理解,特别是在自动驾驶和具身AI等安全关键系统中,感知不确定性可能导致严重的语义推理和决策失误。

Method: 研究采用Cityscapes数据集上的语义分割作为代表性感知模块,引入感知现实性退化,这些退化仅导致传统分割指标适度下降,但会引发下游VLM行为严重失效;同时提出了一套语言级错位度量标准,用于量化幻觉、关键遗漏和安全误判等现象。

Result: 实验结果显示,感知退化在传统分割指标上仅造成适度下降,却导致下游VLM出现严重失效,包括幻觉对象提及、安全关键实体遗漏以及不一致的安全判断;研究还揭示了像素级鲁棒性与多模态语义可靠性之间的明显脱节,这一现象在多个对比性和生成性VLM中得到验证。

Conclusion: 该研究揭示了当前基于VLM的系统在安全关键应用中的关键局限性,强调了评估框架需要明确考虑感知不确定性的必要性,为未来开发更可靠的视觉语言系统提供了重要见解和方向。


📄 Abstract

Vision-Language Models (VLMs) are increasingly deployed in autonomous driving and embodied AI systems, where reliable perception is critical for safe semantic reasoning and decision-making. While recent VLMs demonstrate strong performance on multimodal benchmarks, their robustness to realistic perception degradation remains poorly understood. In this work, we systematically study semantic misalignment in VLMs under controlled degradation of upstream visual perception, using semantic segmentation on the Cityscapes dataset as a representative perception module. We introduce perception-realistic corruptions that induce only moderate drops in conventional segmentation metrics, yet observe severe failures in downstream VLM behavior, including hallucinated object mentions, omission of safety-critical entities, and inconsistent safety judgments. To quantify these effects, we propose a set of language-level misalignment metrics that capture hallucination, critical omission, and safety misinterpretation, and analyze their relationship with segmentation quality across multiple contrastive and generative VLMs. Our results reveal a clear disconnect between pixel-level robustness and multimodal semantic reliability, highlighting a critical limitation of current VLM-based systems and motivating the need for evaluation frameworks that explicitly account for perception uncertainty in safety-critical applications.

[26] SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models

Renyang Liu, Kangjie Chen, Han Qiu, Jie Zhang, Kwok-Yan Lam, Tianwei Zhang, See-Kiong Ng

🧩 TL;DR

SafeRedir是一个轻量级推理时框架,通过提示嵌入重定向实现鲁棒性遗忘,无需修改底层图像生成模型即可有效消除有害概念,同时保持良性生成的质量和语义完整性。


📘 Detailed Summary

Motivation: 图像生成模型容易记忆训练数据中的不良概念,导致生成不安全内容和受版权保护的艺术风格,现有遗忘方法存在需要昂贵重新训练、降低良性生成质量或无法抵抗提示改写和对抗攻击等局限性。

Method: SafeRedir框架包含两个核心组件:用于识别不安全生成轨迹的潜在感知多模态安全分类器,以及用于精确语义重定向的令牌级增量生成器,后者配备令牌掩码和自适应缩放辅助预测器以定位和调节干预。

Result: 实验结果表明,SafeRedir在多个代表性遗忘任务中实现了有效的遗忘能力、高语义和感知保持、鲁棒的图像质量以及增强的抗对抗攻击能力,并在多种扩散骨干和现有遗忘模型上表现出良好的泛化性能。

Conclusion: 该研究提供了一种无需修改底层模型的轻量级推理时遗忘解决方案,具有即插即用兼容性和广泛适用性,为图像生成模型的安全部署提供了新的技术途径,同时保持了生成质量和语义完整性。


📄 Abstract

Image generation models (IGMs), while capable of producing impressive and creative content, often memorize a wide range of undesirable concepts from their training data, leading to the reproduction of unsafe content such as NSFW imagery and copyrighted artistic styles. Such behaviors pose persistent safety and compliance risks in real-world deployments and cannot be reliably mitigated by post-hoc filtering, owing to the limited robustness of such mechanisms and a lack of fine-grained semantic control. Recent unlearning methods seek to erase harmful concepts at the model level, which exhibit the limitations of requiring costly retraining, degrading the quality of benign generations, or failing to withstand prompt paraphrasing and adversarial attacks. To address these challenges, we introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection. Without modifying the underlying IGMs, SafeRedir adaptively routes unsafe prompts toward safe semantic regions through token-level interventions in the embedding space. The framework comprises two core components: a latent-aware multi-modal safety classifier for identifying unsafe generation trajectories, and a token-level delta generator for precise semantic redirection, equipped with auxiliary predictors for token masking and adaptive scaling to localize and regulate the intervention. Empirical results across multiple representative unlearning tasks demonstrate that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks. Furthermore, SafeRedir generalizes effectively across a variety of diffusion backbones and existing unlearned models, validating its plug-and-play compatibility and broad applicability. Code and data are available at https://github.com/ryliu68/SafeRedir.

[27] Edge-Optimized Multimodal Learning for UAV Video Understanding via BLIP-2

Yizhan Feng, Hichem Snoussi, Jing Teng, Jian Liu, Yuyang Wang, Abel Cherouat, Tian Wang

🧩 TL;DR

本文提出了一种基于BLIP-2的轻量级多模态任务平台,通过集成YOLO-World和YOLOv8-Seg模型,并设计内容感知关键帧采样机制和统一提示优化方案,解决了无人机边缘设备上大型视觉语言模型高计算成本与有限计算资源之间的矛盾。


📘 Detailed Summary

Motivation: 无人机在复杂场景中需要实时视觉理解与交互能力,但大型视觉语言模型的高计算成本与无人机边缘设备的有限计算资源之间存在显著矛盾,这阻碍了先进视觉语言模型在无人机平台上的实际部署与应用。

Method: 方法包括三个核心部分:首先将BLIP-2与YOLO-World和YOLOv8-Seg模型深度融合,利用YOLO的精确感知结果增强视觉注意力理解;其次设计基于K-Means聚类的内容感知关键帧采样机制,结合智能帧选择和时序特征拼接;最后实施统一的多任务适应提示优化方案,将YOLO的结构化事件日志作为上下文信息注入BLIP-2输入,并设计输出约束过滤技术细节。

Result: 该方法在无需对无人机数据进行任务特定微调的情况下,成功扩展了BLIP-2的多任务能力,使其能够有效处理视频级交互任务,并生成准确且上下文相关的输出,显著降低了计算需求同时保持了多模态理解性能。

Conclusion: 该研究为无人机边缘计算提供了实用的轻量级多模态任务平台解决方案,通过模型集成、智能帧采样和提示优化的协同设计,实现了在有限资源下保持先进视觉语言模型能力,为无人机实时视觉理解与交互应用开辟了新途径。


📄 Abstract

The demand for real-time visual understanding and interaction in complex scenarios is increasingly critical for unmanned aerial vehicles. However, a significant challenge arises from the contradiction between the high computational cost of large Vision language models and the limited computing resources available on UAV edge devices. To address this challenge, this paper proposes a lightweight multimodal task platform based on BLIP-2, integrated with YOLO-World and YOLOv8-Seg models. This integration extends the multi-task capabilities of BLIP-2 for UAV applications with minimal adaptation and without requiring task-specific fine-tuning on drone data. Firstly, the deep integration of BLIP-2 with YOLO models enables it to leverage the precise perceptual results of YOLO for fundamental tasks like object detection and instance segmentation, thereby facilitating deeper visual-attention understanding and reasoning. Secondly, a content-aware key frame sampling mechanism based on K-Means clustering is designed, which incorporates intelligent frame selection and temporal feature concatenation. This equips the lightweight BLIP-2 architecture with the capability to handle video-level interactive tasks effectively. Thirdly, a unified prompt optimization scheme for multi-task adaptation is implemented. This scheme strategically injects structured event logs from the YOLO models as contextual information into BLIP-2's input. Combined with output constraints designed to filter out technical details, this approach effectively guides the model to generate accurate and contextually relevant outputs for various tasks.

[28] UR-Bench: A Benchmark for Multi-Hop Reasoning over Ultra-High-Resolution Images

Siqi Li, Xinyu Cai, Jianbiao Mei, Nianchen Deng, Pinlong Cai, Licheng Wen, Yufan Shen, Xuemeng Yang, Botian Shi, Yong Liu

🧩 TL;DR

本文提出了UR-Bench基准测试,用于评估多模态大语言模型在超高分辨率图像上的推理能力,并开发了一个基于代理的框架,通过调用外部视觉工具来高效处理千兆像素级图像。


📘 Detailed Summary

Motivation: 现有多模态大语言模型在视觉语言推理方面表现出色,但在超高分辨率图像上的性能尚未得到充分探索。当前的视觉问答基准通常使用中等分辨率数据,视觉复杂度有限,无法评估模型在极端视觉信息下的推理能力。

Method: 研究提出了UR-Bench基准测试,包含人文场景和自然场景两大类别,涵盖四个具有不同空间结构和数据源的超高分辨率图像子集,图像分辨率从数百兆像素到千兆像素不等。同时开发了一个基于代理的框架,其中语言模型通过调用外部视觉工具进行推理,并引入了语义抽象和检索工具以实现对超高分辨率图像的高效处理。

Result: 研究评估了最先进的端到端多模态大语言模型和基于代理的框架,结果表明所提出的框架在处理超高分辨率图像推理任务时具有显著有效性,能够应对传统方法难以处理的极端视觉复杂度挑战。

Conclusion: UR-Bench填补了现有基准在超高分辨率图像推理评估方面的空白,为多模态大语言模型在复杂视觉场景下的能力评估提供了标准化测试平台。基于代理的框架展示了通过模块化工具调用处理极端视觉信息的可行性,为未来超高分辨率视觉理解研究提供了新的方法论方向。


📄 Abstract

Recent multimodal large language models (MLLMs) show strong capabilities in visual-language reasoning, yet their performance on ultra-high-resolution imagery remains largely unexplored. Existing visual question answering (VQA) benchmarks typically rely on medium-resolution data, offering limited visual complexity. To bridge this gap, we introduce Ultra-high-resolution Reasoning Benchmark (UR-Bench), a benchmark designed to evaluate the reasoning capabilities of MLLMs under extreme visual information. UR-Bench comprises two major categories, Humanistic Scenes and Natural Scenes, covering four subsets of ultra-high-resolution images with distinct spatial structures and data sources. Each subset contains images ranging from hundreds of megapixels to gigapixels, accompanied by questions organized into three levels, enabling evaluation of models' reasoning capabilities in ultra-high-resolution scenarios. We further propose an agent-based framework in which a language model performs reasoning by invoking external visual tools. In addition, we introduce Semantic Abstraction and Retrieval tools that enable more efficient processing of ultra-high-resolution images. We evaluate state-of-the-art models using both an end-to-end MLLMs and our agent-based framework, demonstrating the effectiveness of our framework.

[29] MMLGNet: Cross-Modal Alignment of Remote Sensing Data using CLIP

Aditya Chaudhary, Sneha Barman, Mainak Singha, Ankit Jha, Girish Mishra, Biplab Banerjee

🧩 TL;DR

本文提出了多模态语言引导网络(MMLGNet),这是一个新颖的多模态框架,利用CLIP等视觉语言模型将高光谱成像和LiDAR等异构遥感模态与自然语言语义对齐,通过语言监督显著提升了遥感数据的语义理解能力。


📘 Detailed Summary

Motivation: 随着多模态地球观测数据的日益增多,迫切需要能够有效融合光谱、空间和几何信息并实现语义级理解的方法,现有方法在处理异构遥感模态与语言语义对齐方面存在明显不足。

Method: MMLGNet采用模态特定的编码器,通过双向对比学习在共享潜在空间中将视觉特征与手工制作的文本嵌入对齐,借鉴CLIP的训练范式,使用简单的CNN编码器实现高维遥感数据与语言引导解释的桥接。

Result: 在两个基准数据集上,MMLGNet超越了多种已建立的多模态纯视觉方法,取得了强劲的性能表现,证明了语言监督对遥感数据理解的显著益处。

Conclusion: 该研究表明语言监督能够有效提升遥感数据的语义理解能力,为多模态地球观测数据的解释提供了新的范式,展示了简单CNN编码器结合语言引导的潜力。


📄 Abstract

In this paper, we propose a novel multimodal framework, Multimodal Language-Guided Network (MMLGNet), to align heterogeneous remote sensing modalities like Hyperspectral Imaging (HSI) and LiDAR with natural language semantics using vision-language models such as CLIP. With the increasing availability of multimodal Earth observation data, there is a growing need for methods that effectively fuse spectral, spatial, and geometric information while enabling semantic-level understanding. MMLGNet employs modality-specific encoders and aligns visual features with handcrafted textual embeddings in a shared latent space via bi-directional contrastive learning. Inspired by CLIP's training paradigm, our approach bridges the gap between high-dimensional remote sensing data and language-guided interpretation. Notably, MMLGNet achieves strong performance with simple CNN-based encoders, outperforming several established multimodal visual-only methods on two benchmark datasets, demonstrating the significant benefit of language supervision. Codes are available at https://github.com/AdityaChaudhary2913/CLIP_HSI.

[30] S3-CLIP: Video Super Resolution for Person-ReID

Tamas Endrei, Gyorgy Cserey

🧩 TL;DR

本文提出了S3-CLIP,一种基于视频超分辨率的CLIP-ReID框架,首次系统性地研究视频超分辨率技术如何通过提升轨迹质量来增强跨视角行人重识别性能,特别是在具有挑战性的空中到地面和地面到空中场景中。


📘 Detailed Summary

Motivation: 现有行人重识别方法大多将轨迹质量视为次要问题,主要关注基础模型的架构改进,忽视了轨迹质量在现实世界困难场景部署中的关键限制,特别是在跨视角条件下需要处理低质量轨迹的挑战。

Method: 该方法提出了S3-CLIP框架,将最新的超分辨率网络进展与任务驱动的超分辨率流程相结合,专门针对视频行人重识别场景进行适配,通过视频超分辨率技术提升轨迹质量来增强重识别性能。

Result: 实验结果显示S3-CIP在VReID-XFD挑战中取得了有竞争力的性能,在空对地场景达到37.52% mAP,在地对空场景达到29.16% mAP,特别是在地对空设置中,Rank-1、Rank-5和Rank-10准确率分别提升了11.24%、13.48%和17.98%。

Conclusion: 这项研究首次系统性地探索了视频超分辨率作为提升行人重识别轨迹质量的手段,特别是在跨视角条件下,为处理现实世界困难场景中的低质量轨迹问题提供了新的技术途径,展示了超分辨率技术在重识别任务中的潜力。


📄 Abstract

Tracklet quality is often treated as an afterthought in most person re-identification (ReID) methods, with the majority of research presenting architectural modifications to foundational models. Such approaches neglect an important limitation, posing challenges when deploying ReID systems in real-world, difficult scenarios. In this paper, we introduce S3-CLIP, a video super-resolution-based CLIP-ReID framework developed for the VReID-XFD challenge at WACV 2026. The proposed method integrates recent advances in super-resolution networks with task-driven super-resolution pipelines, adapting them to the video-based person re-identification setting. To the best of our knowledge, this work represents the first systematic investigation of video super-resolution as a means of enhancing tracklet quality for person ReID, particularly under challenging cross-view conditions. Experimental results demonstrate performance competitive with the baseline, achieving 37.52% mAP in aerial-to-ground and 29.16% mAP in ground-to-aerial scenarios. In the ground-to-aerial setting, S3-CLIP achieves substantial gains in ranking accuracy, improving Rank-1, Rank-5, and Rank-10 performance by 11.24%, 13.48%, and 17.98%, respectively.

[31] Incentivizing Cardiologist-Like Reasoning in MLLMs for Interpretable Echocardiographic Diagnosis

Yi Qin, Lehan Wang, Chenxu Zhao, Alex P. W. Lee, Xiaomeng Li

🧩 TL;DR

本文提出了一种增强多模态大语言模型超声心动图诊断推理能力的新方法,通过引入心脏推理模板和基于强化学习的CardiacMind框架,显著提升了复杂心脏疾病的诊断性能。


📘 Detailed Summary

Motivation: 现有超声心动图基础模型未能有效捕捉定量测量与临床表现之间的关系,而医学推理多模态大语言模型需要昂贵的详细推理路径构建,且难以直接融入超声心动图先验知识,这限制了其在复杂心脏疾病诊断中的应用效果。

Method: 该方法包含两个核心组件:心脏推理模板提供复杂心脏疾病的逐步规范化诊断流程,简化推理路径构建;CardiacMind强化学习框架引入三种新型奖励机制——过程数量奖励促进详细推理,过程质量奖励促进跨视图和模态的证据整合,超声心动图语义奖励确保逐步描述与视觉内容的一致性。

Result: 实验结果显示,该方法在15种复杂心脏疾病的多视图超声心动图诊断中实现了48%的性能提升,在CardiacNet-PAH数据集上相比先前方法提高了5%。用户研究表明,其推理输出的临床医生同意率达到93.33%,显示出与心脏病专家相似的推理逻辑。

Conclusion: 该研究通过引入心脏病专家思维模式,有效提升了多模态大语言模型在超声心动图诊断中的推理能力,为医学人工智能系统提供了结构化推理框架和有效的强化学习奖励机制,具有重要的临床应用价值。


📄 Abstract

Echocardiographic diagnosis is vital for cardiac screening yet remains challenging. Existing echocardiography foundation models do not effectively capture the relationships between quantitative measurements and clinical manifestations, whereas medical reasoning multimodal large language models (MLLMs) require costly construction of detailed reasoning paths and remain ineffective at directly incorporating such echocardiographic priors into their reasoning. To address these limitations, we propose a novel approach comprising Cardiac Reasoning Template (CRT) and CardiacMind to enhance MLLM's echocardiographic reasoning by introducing cardiologist-like mindset. Specifically, CRT provides stepwise canonical diagnostic procedures for complex cardiac diseases to streamline reasoning path construction without the need for costly case-by-case verification. To incentivize reasoning MLLM under CRT, we develop CardiacMind, a new reinforcement learning scheme with three novel rewards: Procedural Quantity Reward (PQtR), Procedural Quality Reward (PQlR), and Echocardiographic Semantic Reward (ESR). PQtR promotes detailed reasoning; PQlR promotes integration of evidence across views and modalities, while ESR grounds stepwise descriptions in visual content. Our methods show a 48% improvement in multiview echocardiographic diagnosis for 15 complex cardiac diseases and a 5% improvement on CardiacNet-PAH over prior methods. The user study on our method's reasoning outputs shows 93.33% clinician agreement with cardiologist-like reasoning logic. Our code will be available.

[32] Reasoning Matters for 3D Visual Grounding

Hsiang-Wei Huang, Kuang-Ming Chen, Wenhao Chai, Cheng-Yen Yang, Jen-Hao Cheng, Jenq-Neng Hwang

🧩 TL;DR

本文提出了一种自动合成3D视觉定位数据及其推理过程的数据管道,并基于该数据微调LLM得到Reason3DVG-8B模型,该模型仅使用1.6%的训练数据就在3D视觉定位任务上超越了先前方法,证明了推理能力在3D视觉定位中的重要性。


📘 Detailed Summary

Motivation: 当前3D视觉定位任务面临两大挑战:现有模型推理能力有限,且依赖大量人工标注的3D数据进行监督训练;同时,现有基于合成数据扩展的方法性能提升有限且与数据收集成本不成比例,因此需要一种更高效的数据生成方法来提升3D视觉定位模型的推理能力。

Method: 本研究提出了一种能够自动合成3D视觉定位数据及其对应推理过程的数据管道,该管道无需人工标注即可生成高质量的3D视觉定位训练数据;基于生成的数据,研究者对大型语言模型进行微调,开发了Reason3DVG-8B模型,这是一个专门针对3D视觉定位任务优化的语言模型。

Result: Reason3DVG-8B模型在3D视觉定位任务上表现出色,仅使用先前LLM-based方法3D-GRAND所需训练数据的1.6%就实现了性能超越;这一结果不仅证明了所提出数据管道的有效性,也凸显了推理过程在3D视觉定位任务中的关键作用。

Conclusion: 本研究证明了自动合成的3D视觉定位数据及其推理过程能够显著提升模型性能,同时大幅降低数据需求;这一发现为3D视觉理解领域提供了新的数据生成范式,强调了推理能力在跨模态3D任务中的重要性,并为未来开发更高效的3D视觉定位系统指明了方向。


📄 Abstract

The recent development of Large Language Models (LLMs) with strong reasoning ability has driven research in various domains such as mathematics, coding, and scientific discovery. Meanwhile, 3D visual grounding, as a fundamental task in 3D understanding, still remains challenging due to the limited reasoning ability of recent 3D visual grounding models. Most of the current methods incorporate a text encoder and visual feature encoder to generate cross-modal fuse features and predict the referring object. These models often require supervised training on extensive 3D annotation data. On the other hand, recent research also focus on scaling synthetic data to train stronger 3D visual grounding LLM, however, the performance gain remains limited and non-proportional to the data collection cost. In this work, we propose a 3D visual grounding data pipeline, which is capable of automatically synthesizing 3D visual grounding data along with corresponding reasoning process. Additionally, we leverage the generated data for LLM fine-tuning and introduce Reason3DVG-8B, a strong 3D visual grounding LLM that outperforms previous LLM-based method 3D-GRAND using only 1.6% of their training data, demonstrating the effectiveness of our data and the importance of reasoning in 3D visual grounding.

[33] Modality-Decoupled RGB-Thermal Object Detector via Query Fusion

Chao Tian, Zikun Zhou, Chao Yang, Guoqing Zhu, Fu'an Zhong, Zhenyu He

🧩 TL;DR

本文提出了一种模态解耦的RGB-T检测框架MDQF,通过查询融合机制平衡模态互补与分离,在极端条件下排除退化模态干扰,同时支持使用非配对数据进行分支优化。


📘 Detailed Summary

Motivation: RGB-T检测虽然能通过模态融合利用跨模态互补信息实现鲁棒检测,但在极端条件下当一个模态质量较差时会干扰检测性能,因此需要模态分离来减轻噪声影响,现有方法在平衡模态互补与分离方面存在不足。

Method: 提出模态解耦的RGB-T检测框架MDQF,采用类似DETR的检测器作为RGB和TIR图像的分支,在每个细化阶段通过查询选择与适应将高质量查询从一个分支馈送到另一个分支,实现查询融合并排除退化模态干扰。

Result: 大量实验表明,该方法在RGB-T检测任务上优于现有方法,实现了更好的模态独立性,并且解耦框架允许使用非配对的RGB或TIR图像单独优化每个分支,减少了对配对数据的需求。

Conclusion: 该研究证明了在极端条件下平衡模态互补与分离的重要性,提出的查询融合机制能有效排除退化模态干扰,解耦框架为使用非配对数据优化多模态系统提供了新思路,增强了RGB-T检测的鲁棒性和实用性。


📄 Abstract

The advantage of RGB-Thermal (RGB-T) detection lies in its ability to perform modality fusion and integrate cross-modality complementary information, enabling robust detection under diverse illumination and weather conditions. However, under extreme conditions where one modality exhibits poor quality and disturbs detection, modality separation is necessary to mitigate the impact of noise. To address this problem, we propose a Modality-Decoupled RGB-T detection framework with Query Fusion (MDQF) to balance modality complementation and separation. In this framework, DETR-like detectors are employed as separate branches for the RGB and TIR images, with query fusion interspersed between the two branches in each refinement stage. Herein, query fusion is performed by feeding the high-quality queries from one branch to the other one after query selection and adaptation. This design effectively excludes the degraded modality and corrects the predictions using high-quality queries. Moreover, the decoupled framework allows us to optimize each individual branch with unpaired RGB or TIR images, eliminating the need for paired RGB-T data. Extensive experiments demonstrate that our approach delivers superior performance to existing RGB-T detectors and achieves better modality independence.

[34] Motion Attribution for Video Generation

Xindi Wu, Despoina Paschalidou, Jun Gao, Antonio Torralba, Laura Leal-Taixé, Olga Russakovsky, Sanja Fidler, Jonathan Lorraine

🧩 TL;DR

本文提出了Motive框架,这是首个针对视频生成模型的运动归因方法,通过梯度归因技术识别影响时间动态的关键训练数据,并利用这些发现指导数据筛选以提升视频生成质量。


📘 Detailed Summary

Motivation: 尽管视频生成模型发展迅速,但数据如何影响运动动态的问题尚未得到充分理解。现有研究缺乏专门针对运动而非视觉外观的数据归因方法,这限制了通过数据筛选优化视频时间一致性和物理合理性的能力。

Method: Motive是一个基于梯度的运动中心数据归因框架,通过运动加权损失掩码将时间动态与静态外观分离,实现了高效可扩展的运动特定影响计算。该框架能够扩展到现代大规模高质量视频数据集和模型,专门用于分析哪些微调片段会改善或恶化时间动态。

Result: 在文本到视频模型上,Motive能够识别对运动有强烈影响的训练片段,指导的数据筛选显著提升了时间一致性和物理合理性。使用Motive选择的高影响力数据进行训练,在VBench基准上同时改善了运动平滑度和动态程度,相比预训练基础模型获得了74.1%的人类偏好胜率。

Conclusion: 该研究首次实现了视频生成模型中运动而非视觉外观的数据归因,为理解数据如何影响时间动态提供了系统框架。Motive不仅能够诊断现有模型的运动问题,还能指导高效的数据筛选策略,为提升视频生成质量开辟了新的研究方向。


📄 Abstract

Despite the rapid progress of video generation models, the role of data in influencing motion is poorly understood. We present Motive (MOTIon attribution for Video gEneration), a motion-centric, gradient-based data attribution framework that scales to modern, large, high-quality video datasets and models. We use this to study which fine-tuning clips improve or degrade temporal dynamics. Motive isolates temporal dynamics from static appearance via motion-weighted loss masks, yielding efficient and scalable motion-specific influence computation. On text-to-video models, Motive identifies clips that strongly affect motion and guides data curation that improves temporal consistency and physical plausibility. With Motive-selected high-influence data, our method improves both motion smoothness and dynamic degree on VBench, achieving a 74.1% human preference win rate compared with the pretrained base model. To our knowledge, this is the first framework to attribute motion rather than visual appearance in video generative models and to use it to curate fine-tuning data.

[35] Zero-Shot Distracted Driver Detection via Vision Language Models with Double Decoupling

Takamichi Miyata, Sumiko Miyata, Andrew Morris

🧩 TL;DR

本文提出了一种主体解耦框架,通过提取驾驶员外观嵌入并在零样本分类前消除其对图像嵌入的影响,解决了基于视觉语言模型的驾驶员分心检测中主体特定外观变化导致的性能瓶颈问题。


📘 Detailed Summary

Motivation: 分心驾驶是交通事故的主要原因,需要鲁棒且可扩展的检测方法。现有基于视觉语言模型的分心驾驶员检测器在真实场景中表现不佳,主要瓶颈在于主体特定外观变化(如服装、年龄、性别)与行为线索的纠缠,导致模型决策基于驾驶员身份而非驾驶行为。

Method: 提出主体解耦框架,提取驾驶员外观嵌入并在零样本分类前消除其对图像嵌入的影响,从而强调与分心相关的证据。进一步通过度量投影到Stiefel流形对文本嵌入进行正交化,在保持原始语义接近的同时提高可分性。

Result: 实验表明该方法在多个基准测试中相比先前基线取得了一致的性能提升,验证了主体解耦和文本嵌入正交化策略的有效性,展示了在真实道路安全应用中的潜力。

Conclusion: 该研究揭示了主体外观变化是视觉语言模型在分心驾驶检测中的关键瓶颈,提出的解耦框架通过分离外观与行为线索显著提升了检测性能。该方法为基于视觉语言模型的零样本分类任务提供了新的技术路径,具有实际道路安全应用价值。


📄 Abstract

Distracted driving is a major cause of traffic collisions, calling for robust and scalable detection methods. Vision-language models (VLMs) enable strong zero-shot image classification, but existing VLM-based distracted driver detectors often underperform in real-world conditions. We identify subject-specific appearance variations (e.g., clothing, age, and gender) as a key bottleneck: VLMs entangle these factors with behavior cues, leading to decisions driven by who the driver is rather than what the driver is doing. To address this, we propose a subject decoupling framework that extracts a driver appearance embedding and removes its influence from the image embedding prior to zero-shot classification, thereby emphasizing distraction-relevant evidence. We further orthogonalize text embeddings via metric projection onto Stiefel manifold to improve separability while staying close to the original semantics. Experiments demonstrate consistent gains over prior baselines, indicating the promise of our approach for practical road-safety applications.

[36] Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models

Hao Tang, Yu Liu, Shuanglin Yan, Fei Shen, Shengfeng He, Jing Qin

🧩 TL;DR

本文提出CoEvo,一种无需训练和标注的测试时框架,通过双向、样本条件化的文本与视觉代理自适应机制,实现零样本分布外检测。该方法通过代理对齐的协同进化机制动态挖掘上下文文本负例并迭代优化视觉代理,显著提升了开放世界场景下的检测性能。


📘 Detailed Summary

Motivation: 在开放世界场景中部署视觉语言模型需要可靠的零样本分布外检测能力,但现有基于固定文本代理的负标签方法存在两个关键问题:一是对分布外语义空间的稀疏采样不足,二是文本代理静态不变而视觉特征漂移导致跨模态错位和预测不稳定,这限制了零样本OOD检测在分布偏移下的有效性。

Method: CoEvo采用无需训练和标注的测试时框架,通过代理对齐的协同进化机制维护两个进化的代理缓存,实现双向、样本条件化的自适应。该方法动态挖掘由测试图像引导的上下文文本负例,迭代优化视觉代理,逐步重新对齐跨模态相似性并扩大局部OOD边界,最后动态重新加权双模态代理的贡献以获得对分布偏移鲁棒的校准OOD分数。

Result: 在标准基准测试上的广泛实验表明,CoEvo实现了最先进的性能,相比强负标签基线,在ImageNet-1K上AUROC提升了1.33%,FPR95降低了45.98%。该方法在多个数据集上均表现出优异的零样本OOD检测能力,显著优于现有方法。

Conclusion: 该研究表明通过双向、样本条件化的代理自适应机制可以有效解决零样本OOD检测中的跨模态错位问题,动态代理缓存和协同进化策略为开放世界视觉语言模型的可靠部署提供了新思路,未来可扩展到更复杂的多模态场景和动态环境适应中。


📄 Abstract

Reliable zero-shot detection of out-of-distribution (OOD) inputs is critical for deploying vision-language models in open-world settings. However, the lack of labeled negatives in zero-shot OOD detection necessitates proxy signals that remain effective under distribution shift. Existing negative-label methods rely on a fixed set of textual proxies, which (i) sparsely sample the semantic space beyond in-distribution (ID) classes and (ii) remain static while only visual features drift, leading to cross-modal misalignment and unstable predictions. In this paper, we propose CoEvo, a training- and annotation-free test-time framework that performs bidirectional, sample-conditioned adaptation of both textual and visual proxies. Specifically, CoEvo introduces a proxy-aligned co-evolution mechanism to maintain two evolving proxy caches, which dynamically mines contextual textual negatives guided by test images and iteratively refines visual proxies, progressively realigning cross-modal similarities and enlarging local OOD margins. Finally, we dynamically re-weight the contributions of dual-modal proxies to obtain a calibrated OOD score that is robust to distribution shift. Extensive experiments on standard benchmarks demonstrate that CoEvo achieves state-of-the-art performance, improving AUROC by 1.33% and reducing FPR95 by 45.98% on ImageNet-1K compared to strong negative-label baselines.

[37] End-to-End Video Character Replacement without Structural Guidance

Zhengbo Xu, Jie Ma, Ziheng Wang, Zhan Peng, Jun Liang, Jing Li

🧩 TL;DR

本文提出了MoCha框架,通过仅需单帧任意掩码实现可控视频角色替换,克服了传统方法对逐帧分割掩码和显式结构指导的依赖,显著提升了在复杂场景下的泛化能力和视觉质量。


📘 Detailed Summary

Motivation: 可控视频角色替换面临缺乏配对视频数据的挑战,现有方法主要依赖重建范式,需要逐帧分割掩码和显式结构指导(如骨架、深度),这严重限制了在遮挡、角色-物体交互、异常姿态或复杂光照等复杂场景下的泛化能力,常导致视觉伪影和时间不一致性。

Method: MoCha框架仅需单帧任意掩码,无需逐帧分割或显式结构指导;引入了条件感知的RoPE机制来适应多模态输入条件并增强面部身份特征;采用基于强化学习的后训练阶段;为解决合格配对训练数据稀缺问题,提出了综合数据构建流程,包括基于Unreal Engine 5构建的高保真渲染数据集、通过当前肖像动画技术合成的表情驱动数据集,以及从现有视频-掩码对衍生的增强数据集。

Result: 大量实验表明,该方法在性能上显著优于现有最先进方法,在复杂场景下表现出更好的泛化能力,减少了视觉伪影并提高了时间一致性,实现了更高质量的视频角色替换效果。

Conclusion: MoCha通过简化输入要求并构建综合训练数据,为可控视频角色替换提供了更实用和鲁棒的解决方案,其条件感知机制和后训练策略为相关领域的研究提供了新思路,代码的发布将促进该方向的进一步探索。


📄 Abstract

Controllable video character replacement with a user-provided identity remains a challenging problem due to the lack of paired video data. Prior works have predominantly relied on a reconstruction-based paradigm that requires per-frame segmentation masks and explicit structural guidance (e.g., skeleton, depth). This reliance, however, severely limits their generalizability in complex scenarios involving occlusions, character-object interactions, unusual poses, or challenging illumination, often leading to visual artifacts and temporal inconsistencies. In this paper, we propose MoCha, a pioneering framework that bypasses these limitations by requiring only a single arbitrary frame mask. To effectively adapt the multi-modal input condition and enhance facial identity, we introduce a condition-aware RoPE and employ an RL-based post-training stage. Furthermore, to overcome the scarcity of qualified paired-training data, we propose a comprehensive data construction pipeline. Specifically, we design three specialized datasets: a high-fidelity rendered dataset built with Unreal Engine 5 (UE5), an expression-driven dataset synthesized by current portrait animation techniques, and an augmented dataset derived from existing video-mask pairs. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research. Please refer to our project page for more details: orange-3dv-team.github.io/MoCha

[38] SoC: Semantic Orthogonal Calibration for Test-Time Prompt Tuning

Leo Fillioux, Omprakash Chakraborty, Ismail Ben Ayed, Paul-Henry Cournède, Stergios Christodoulidis, Maria Vakalopoulou, Jose Dolz

🧩 TL;DR

本文提出了一种名为语义正交校准(SoC)的新方法,用于改善视觉语言模型在测试时提示调优中的不确定性校准问题。该方法通过Huber-based正则化器实现平滑的原型分离,同时保持语义邻近性,从而在保持竞争性判别能力的同时显著提升校准性能。


📘 Detailed Summary

Motivation: 随着视觉语言模型在医疗和自动驾驶等关键决策系统中的广泛应用,其不确定性估计的校准变得至关重要。然而,现有VLM测试时提示调优研究主要关注提升判别性能,而校准维度在很大程度上未被充分探索。当前最先进方法强制文本提示嵌入的完全正交性以增强可分离性,但理论分析表明这种完全正交约束会过度推动语义相关类别分离,导致模型过度自信。

Method: 本文提出了语义正交校准(SoC)方法,这是一种基于Huber的正则化器,旨在实现平滑的原型分离同时保持语义邻近性。该方法通过理论分析揭示了完全正交约束的局限性,即其梯度会强烈推动语义相关类别分离,从而设计出能够平衡可分离性和语义保持的正则化策略,相比先前基于正交性的方法能更好地改善校准性能。

Result: 通过全面的实证验证,研究表明SoC方法在多个基准测试中一致地改善了校准性能。该方法在提升校准质量的同时,仍保持了竞争性的判别能力,在视觉语言模型的测试时提示调优场景中实现了校准与性能的平衡优化。

Conclusion: 本研究揭示了完全正交约束在视觉语言模型校准中的局限性,并提出了一种更有效的语义感知校准方法。SoC方法通过平衡原型分离和语义保持,为VLM不确定性校准提供了新的技术路径,对医疗和自动驾驶等高风险应用中的可靠决策具有重要意义,为未来研究提供了理论指导和实用工具。


📄 Abstract

With the increasing adoption of vision-language models (VLMs) in critical decision-making systems such as healthcare or autonomous driving, the calibration of their uncertainty estimates becomes paramount. Yet, this dimension has been largely underexplored in the VLM test-time prompt-tuning (TPT) literature, which has predominantly focused on improving their discriminative performance. Recent state-of-the-art advocates for enforcing full orthogonality over pairs of text prompt embeddings to enhance separability, and therefore calibration. Nevertheless, as we theoretically show in this work, the inherent gradients from fully orthogonal constraints will strongly push semantically related classes away, ultimately making the model overconfident. Based on our findings, we propose Semantic Orthogonal Calibration (SoC), a Huber-based regularizer that enforces smooth prototype separation while preserving semantic proximity, thereby improving calibration compared to prior orthogonality-based approaches. Across a comprehensive empirical validation, we demonstrate that SoC consistently improves calibration performance, while also maintaining competitive discriminative capabilities.

[39] CtrlFuse: Mask-Prompt Guided Controllable Infrared and Visible Image Fusion

Yiming Sun, Yuan Ruan, Qinghua Hu, Pengfei Zhu

🧩 TL;DR

本文提出CtrlFuse,一种可控的红外与可见光图像融合框架,通过掩码提示实现交互式动态融合,解决了现有方法忽视下游任务适应性和无法交互处理多样化语义目标感知需求的问题。


📘 Detailed Summary

Motivation: 现有红外与可见光图像融合方法存在两个主要局限:一是专注于像素级融合而忽视下游任务适应性,二是通过级联检测/分割模型隐式学习刚性语义,无法交互式处理多样化的语义目标感知需求。本研究旨在开发一种可控的融合框架,能够根据具体任务需求动态调整融合策略。

Method: 提出的CtrlFuse框架包含多模态特征提取器、参考提示编码器(RPE)和提示-语义融合模块(PSFM)。RPE通过输入掩码引导微调预训练分割模型,动态编码任务特定的语义提示;PSFM将这些语义明确注入融合特征中。通过并行分割和融合分支的协同优化,实现任务性能与融合质量的相互增强。

Result: 实验表明,CtrlFuse在融合可控性和分割准确性方面均达到最先进水平。特别值得注意的是,经过适配的任务分支甚至超越了原始分割模型的性能,证明了所提框架在增强下游任务表现方面的有效性。

Conclusion: 该研究展示了通过交互式动态融合框架实现任务感知型图像融合的可行性,为智能无人系统的环境感知提供了更灵活和有效的解决方案。框架的可控性设计为多模态融合领域开辟了新方向,能够根据具体应用需求定制融合策略,具有重要的实际应用价值。


📄 Abstract

Infrared and visible image fusion generates all-weather perception-capable images by combining complementary modalities, enhancing environmental awareness for intelligent unmanned systems. Existing methods either focus on pixel-level fusion while overlooking downstream task adaptability or implicitly learn rigid semantics through cascaded detection/segmentation models, unable to interactively address diverse semantic target perception needs. We propose CtrlFuse, a controllable image fusion framework that enables interactive dynamic fusion guided by mask prompts. The model integrates a multi-modal feature extractor, a reference prompt encoder (RPE), and a prompt-semantic fusion module (PSFM). The RPE dynamically encodes task-specific semantic prompts by fine-tuning pre-trained segmentation models with input mask guidance, while the PSFM explicitly injects these semantics into fusion features. Through synergistic optimization of parallel segmentation and fusion branches, our method achieves mutual enhancement between task performance and fusion quality. Experiments demonstrate state-of-the-art results in both fusion controllability and segmentation accuracy, with the adapted task branch even outperforming the original segmentation model.

[40] Near-perfect photo-ID of the Hula painted frog with zero-shot deep local-feature matching

Maayan Yesharim, R. G. Bina Perl, Uri Roll, Sarig Gafny, Eli Geffen, Yoav Ram

🧩 TL;DR

本研究提出了一种用于濒危两栖动物胡拉彩蛙的非侵入式照片重识别方法,通过比较局部特征匹配与全局特征嵌入模型,开发了一个结合两者优势的两阶段工作流程,实现了高精度且可扩展的个体识别。


📘 Detailed Summary

Motivation: 准确识别个体对于监测稀有两栖动物至关重要,但侵入式标记方法通常不适用于极度濒危物种。本研究旨在评估最先进的计算机视觉方法,为胡拉彩蛙开发一种非侵入式的照片重识别技术,以支持保护监测和捕获-重捕获分析。

Method: 研究比较了零样本设置下的深度局部特征匹配与深度全局特征嵌入模型,使用了2013-2020年间采集的191只个体的1,233张腹面图像。为了结合可扩展性与准确性,实现了一个两阶段工作流程:首先使用微调后的全局特征模型检索候选列表,然后通过局部特征匹配进行重新排序。

Result: 局部特征管道在封闭集识别中达到了98%的top-1准确率,优于所有全局特征模型;微调后的最佳全局特征模型达到60% top-1准确率(91% top-10)。两阶段工作流程将端到端运行时间从6.5-7.8小时减少到约38分钟,同时保持约96%的top-1封闭集准确率。相同个体与不同个体对的匹配分数分离支持开放集识别的阈值设置。

Conclusion: 研究表明,对于该物种,零样本深度局部特征匹配优于全局特征嵌入,可作为照片识别的强大默认方法。开发的两阶段管道结合了可扩展性与准确性,已部署为支持常规野外使用的Web应用程序,为非侵入式保护监测提供了实用解决方案。


📄 Abstract

Accurate individual identification is essential for monitoring rare amphibians, yet invasive marking is often unsuitable for critically endangered species. We evaluate state-of-the-art computer-vision methods for photographic re-identification of the Hula painted frog (Latonia nigriventer) using 1,233 ventral images from 191 individuals collected during 2013-2020 capture-recapture surveys. We compare deep local-feature matching in a zero-shot setting with deep global-feature embedding models. The local-feature pipeline achieves 98% top-1 closed-set identification accuracy, outperforming all global-feature models; fine-tuning improves the best global-feature model to 60% top-1 (91% top-10) but remains below local matching. To combine scalability with accuracy, we implement a two-stage workflow in which a fine-tuned global-feature model retrieves a short candidate list that is re-ranked by local-feature matching, reducing end-to-end runtime from 6.5-7.8 hours to ~38 minutes while maintaining ~96% top-1 closed-set accuracy on the labeled dataset. Separation of match scores between same- and different-individual pairs supports thresholding for open-set identification, enabling practical handling of novel individuals. We deploy this pipeline as a web application for routine field use, providing rapid, standardized, non-invasive identification to support conservation monitoring and capture-recapture analyses. Overall, in this species, zero-shot deep local-feature matching outperformed global-feature embedding and provides a strong default for photo-identification.

[41] RAVEN: Erasing Invisible Watermarks via Novel View Synthesis

Fahad Shamshad, Nils Lukas, Karthik Nandakumar

🧩 TL;DR

本文揭示了不可见水印的基本脆弱性,通过将水印去除重新表述为视图合成问题,提出了一种零样本扩散框架,在15种水印方法上实现了最先进的水印抑制效果,同时保持卓越的感知质量。


📘 Detailed Summary

Motivation: 尽管不可见水印已成为认证AI生成图像内容的关键机制,但评估这些方案对抗复杂去除攻击的脆弱性对于评估其可靠性和指导稳健设计至关重要。本研究旨在揭示不可见水印的基本脆弱性,特别是发现水印即使对像素空间和频域攻击具有鲁棒性,仍然容易受到语义保持的视点变换攻击。

Method: 本研究引入了一种零样本扩散框架,将水印去除重新表述为视图合成问题。该方法的关键洞察是生成相同语义内容的感知一致替代视图,类似于从偏移视角重新观察场景,从而自然去除嵌入水印同时保持视觉保真度。框架在潜在空间中应用受控几何变换,并通过视图引导对应注意力增强以在重建过程中保持结构一致性,无需访问检测器或水印知识即可在冻结的预训练模型上操作。

Result: 该方法在15种水印方法上实现了最先进的水印抑制效果,超越了14种基线攻击方法。实验结果表明,该方法在多个数据集上保持卓越的感知质量,同时有效去除水印,验证了语义保持视点变换对水印去除的有效性。

Conclusion: 本研究揭示了不可见水印的基本脆弱性,即水印即使对传统像素空间和频域攻击具有鲁棒性,仍然容易受到语义保持的视点变换攻击。这一发现为水印方案的稳健设计提供了重要指导,强调了需要考虑语义级攻击向量,并为未来水印技术发展提供了关键见解。


📄 Abstract

Invisible watermarking has become a critical mechanism for authenticating AI-generated image content, with major platforms deploying watermarking schemes at scale. However, evaluating the vulnerability of these schemes against sophisticated removal attacks remains essential to assess their reliability and guide robust design. In this work, we expose a fundamental vulnerability in invisible watermarks by reformulating watermark removal as a view synthesis problem. Our key insight is that generating a perceptually consistent alternative view of the same semantic content, akin to re-observing a scene from a shifted perspective, naturally removes the embedded watermark while preserving visual fidelity. This reveals a critical gap: watermarks robust to pixel-space and frequency-domain attacks remain vulnerable to semantic-preserving viewpoint transformations. We introduce a zero-shot diffusion-based framework that applies controlled geometric transformations in latent space, augmented with view-guided correspondence attention to maintain structural consistency during reconstruction. Operating on frozen pre-trained models without detector access or watermark knowledge, our method achieves state-of-the-art watermark suppression across 15 watermarking methods--outperforming 14 baseline attacks while maintaining superior perceptual quality across multiple datasets.

cs.CL [Back]

[42] Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models

Haorui Yu, Ramon Ruiz-Dolz, Xuehang Wen, Fengrui Zhang, Qiufeng Yi

🧩 TL;DR

本文提出了一种三层评估框架,用于系统评估视觉语言模型在跨文化艺术批评中的文化理解能力,通过校准评分机制显著降低了与人类评分的误差,并揭示了现有自动指标在衡量文化深度方面的局限性。


📘 Detailed Summary

Motivation: 视觉语言模型在视觉感知方面表现出色,但其在艺术作品中解读文化意义的能力尚未得到充分验证,现有评估方法缺乏对跨文化艺术批评中文化理解深度的系统评估框架。

Method: 研究提出了一个三层评估框架:第一层离线计算自动覆盖率和风险指标;第二层采用单一主评审员基于五个维度进行基于量规的评分;第三层通过保序回归将第二层聚合分数校准到人类评分,从而产生校准后的文化理解分数。

Result: 该框架在152个样本的保留集上实现了5.2%的平均绝对误差降低,评估了15个视觉语言模型在涵盖六种文化传统的294个专家锚点上的表现,发现自动指标无法可靠代理文化深度,西方样本得分高于非西方样本,且跨评审员尺度不匹配使得朴素平均评分不可靠。

Conclusion: 研究强调了需要专门的文化理解评估框架而非依赖通用自动指标,单一主评审员配合显式校准的方法更可靠,框架输出的校准分数可用于模型选择和文化差距诊断,为跨文化艺术AI评估提供了系统方法论。


📄 Abstract

Vision-Language Models (VLMs) excel at visual perception, yet their ability to interpret cultural meaning in art remains under-validated. We present a tri-tier evaluation framework for cross-cultural art-critique assessment: Tier I computes automated coverage and risk indicators offline; Tier II applies rubric-based scoring using a single primary judge across five dimensions; and Tier III calibrates the Tier II aggregate score to human ratings via isotonic regression, yielding a 5.2% reduction in MAE on a 152-sample held-out set. The framework outputs a calibrated cultural-understanding score for model selection and cultural-gap diagnosis, together with dimension-level diagnostics and risk indicators. We evaluate 15 VLMs on 294 expert anchors spanning six cultural traditions. Key findings are that (i) automated metrics are unreliable proxies for cultural depth, (ii) Western samples score higher than non-Western samples under our sampling and rubric, and (iii) cross-judge scale mismatch makes naive score averaging unreliable, motivating a single primary judge with explicit calibration. Dataset and code are available in the supplementary materials.

[43] Multilingual, Multimodal Pipeline for Creating Authentic and Structured Fact-Checked Claim Dataset

Z. Melce Hüsünbeyi, Virginie Mouilleron, Leonie Uhling, Daniel Foppe, Tatjana Scheffler, Djamé Seddah

🧩 TL;DR

本文提出了一种用于构建多语言、多模态事实核查数据集的数据收集与处理流程,通过聚合ClaimReview源、抓取完整辟谣文章、规范化异构声明裁决,并利用大语言模型进行证据提取和理由生成,为可解释的事实核查模型开发奠定基础。


📘 Detailed Summary

Motivation: 在线平台上的虚假信息快速传播凸显了对强大、最新、可解释且多语言事实核查资源的迫切需求,然而现有数据集在范围上存在局限,通常缺乏多模态证据、结构化标注以及声明、证据与裁决之间的详细关联。

Method: 本文提出了一个全面的数据收集与处理流程,通过聚合ClaimReview源、抓取完整辟谣文章、规范化异构声明裁决,并利用最先进的大语言模型和多模态大语言模型进行证据提取和理由生成,同时为数据添加结构化元数据和对齐的视觉内容。

Result: 通过G-Eval和人工评估表明,该流程能够实现对不同组织或媒体市场之间事实核查实践的细粒度比较,促进开发更具可解释性和证据基础的事实核查模型,并为多语言、多模态虚假信息验证的未来研究奠定基础。

Conclusion: 该研究为构建全面的多语言、多模态事实核查数据集提供了系统化方法,通过结构化证据提取和理由生成增强了事实核查的可解释性,为开发更可靠的事实核查系统和跨语言、跨媒体市场的比较研究提供了重要基础。


📄 Abstract

The rapid proliferation of misinformation across online platforms underscores the urgent need for robust, up-to-date, explainable, and multilingual fact-checking resources. However, existing datasets are limited in scope, often lacking multimodal evidence, structured annotations, and detailed links between claims, evidence, and verdicts. This paper introduces a comprehensive data collection and processing pipeline that constructs multimodal fact-checking datasets in French and German languages by aggregating ClaimReview feeds, scraping full debunking articles, normalizing heterogeneous claim verdicts, and enriching them with structured metadata and aligned visual content. We used state-of-the-art large language models (LLMs) and multimodal LLMs for (i) evidence extraction under predefined evidence categories and (ii) justification generation that links evidence to verdicts. Evaluation with G-Eval and human assessment demonstrates that our pipeline enables fine-grained comparison of fact-checking practices across different organizations or media markets, facilitates the development of more interpretable and evidence-grounded fact-checking models, and lays the groundwork for future research on multilingual, multimodal misinformation verification.

[44] VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding

Haorui Yu, Ramon Ruiz-Dolz, Diji Yang, Hang He, Fengrui Zhang, Qiufeng Yi

🧩 TL;DR

本文提出了VULCA-Bench,这是一个用于评估视觉语言模型文化理解能力的多文化艺术评论基准,超越了表面视觉感知,包含7,410个图像-评论对,涵盖八种文化传统,并采用五层文化理解框架进行系统评估。


📘 Detailed Summary

Motivation: 现有视觉语言模型基准主要评估L1-L2能力(物体识别、场景描述和事实问答),而忽视了更高层次的文化解释能力,导致模型文化理解评估不足,无法全面衡量跨文化场景下的深度认知表现。

Method: 研究构建了包含7,410个匹配图像-评论对的多文化艺术评论基准,涵盖八种文化传统并支持中英双语;采用五层文化理解框架(L1-L5,从视觉感知到哲学美学),具体化为225个文化特定维度,并由专家撰写双语评论进行实例化。

Result: 初步实验结果表明,高层次推理(L3-L5)始终比视觉和技术分析(L1-L2)更具挑战性,验证了现有视觉语言模型在文化深度理解方面的局限性,数据集、评估脚本和标注工具已通过CC BY 4.0许可在补充材料中公开。

Conclusion: 该研究强调了评估视觉语言模型文化理解能力的重要性,提出的五层框架为系统评估提供了结构化方法,公开的数据集和工具将促进跨文化AI研究,推动模型从表面感知向深度文化解释发展。


📄 Abstract

We introduce VULCA-Bench, a multicultural art-critique benchmark for evaluating Vision-Language Models' (VLMs) cultural understanding beyond surface-level visual perception. Existing VLM benchmarks predominantly measure L1-L2 capabilities (object recognition, scene description, and factual question answering) while under-evaluate higher-order cultural interpretation. VULCA-Bench contains 7,410 matched image-critique pairs spanning eight cultural traditions, with Chinese-English bilingual coverage. We operationalise cultural understanding using a five-layer framework (L1-L5, from Visual Perception to Philosophical Aesthetics), instantiated as 225 culture-specific dimensions and supported by expert-written bilingual critiques. Our pilot results indicate that higher-layer reasoning (L3-L5) is consistently more challenging than visual and technical analysis (L1-L2). The dataset, evaluation scripts, and annotation tools are available under CC BY 4.0 in the supplementary materials.

[45] Generation-Augmented Generation: A Plug-and-Play Framework for Private Knowledge Injection in Large Language Models

Rongji Li, Jian Xu, Xueqing Chen, Yisheng Yang, Jiayi Wang, Xingyu Chen, Chunyu Xie, Dawei Leng, Xu-Yao Zhang

🧩 TL;DR

本文提出生成增强生成(GAG)方法,将私有专业知识视为专家模态,通过紧凑的表征级接口对齐到冻结的基础模型,解决了私有知识注入中微调迭代成本高和RAG在专业语料中脆弱性的问题。


📘 Detailed Summary

Motivation: 在生物医学、材料和金融等高风险领域部署大语言模型时,需要注入私有、领域特定的知识,这些知识具有专有性、快速演变性且在公开预训练中代表性不足。然而,当前两种主要的私有知识注入范式各有明显缺陷:微调迭代成本高昂且持续更新可能导致灾难性遗忘和通用能力退化;检索增强生成(RAG)虽然保持基础模型完整,但在专业私有语料中因分块导致的证据碎片化、检索漂移和长上下文压力而表现脆弱。

Method: 受多模态大语言模型将异构模态对齐到共享语义空间的启发,本文提出生成增强生成(GAG)方法,将私有专业知识视为额外的专家模态,通过紧凑的表征级接口对齐到冻结的基础模型。该方法避免了提示时的证据序列化,同时实现了即插即用的专业化以及可靠选择性激活的可扩展多领域组合。

Result: 在两个私有科学问答基准(免疫学佐剂和催化材料)以及混合领域评估中,GAG在两个基准上分别比强大的RAG基线提高了15.34%和14.86%的专业性能。同时,在六个开放通用基准上保持了性能,并实现了接近oracle的选择性激活,支持可扩展的多领域部署。

Conclusion: GAG方法通过将私有知识作为专家模态注入,提供了一种高效且可靠的私有知识集成方案,避免了传统方法的局限性。该方法支持即插即用的专业化和可扩展的多领域组合,为高风险领域的大语言模型部署提供了新的技术路径,平衡了专业性能与通用能力保持的需求。


📄 Abstract

In domains such as biomedicine, materials, and finance, high-stakes deployment of large language models (LLMs) requires injecting private, domain-specific knowledge that is proprietary, fast-evolving, and under-represented in public pretraining. However, the two dominant paradigms for private knowledge injection each have pronounced drawbacks: fine-tuning is expensive to iterate, and continual updates risk catastrophic forgetting and general-capability regression; retrieval-augmented generation (RAG) keeps the base model intact but is brittle in specialized private corpora due to chunk-induced evidence fragmentation, retrieval drift, and long-context pressure that yields query-dependent prompt inflation. Inspired by how multimodal LLMs align heterogeneous modalities into a shared semantic space, we propose Generation-Augmented Generation (GAG), which treats private expertise as an additional expert modality and injects it via a compact, representation-level interface aligned to the frozen base model, avoiding prompt-time evidence serialization while enabling plug-and-play specialization and scalable multi-domain composition with reliable selective activation. Across two private scientific QA benchmarks (immunology adjuvant and catalytic materials) and mixed-domain evaluations, GAG improves specialist performance over strong RAG baselines by 15.34% and 14.86% on the two benchmarks, respectively, while maintaining performance on six open general benchmarks and enabling near-oracle selective activation for scalable multi-domain deployment.

[46] AgriAgent: Contract-Driven Planning and Capability-Aware Tool Orchestration in Real-World Agriculture

Bo Yang, Yu Zhang, Yunkui Chen, Lanfei Feng, Xiao Xu, Nueraili Aierken, Shijian Li

🧩 TL;DR

本文提出了AgriAgent,一个用于真实农业场景的两级智能体框架,通过分层执行策略处理不同复杂度的任务,在复杂任务上相比现有统一执行范式的工具中心化智能体基线实现了更高的执行成功率和鲁棒性。


📘 Detailed Summary

Motivation: 真实农业场景中的智能体系统需要处理从轻量级信息理解到复杂多步执行的多模态输入任务,但现有方法大多依赖统一执行范式,难以适应农业环境中常见的任务复杂度差异大和工具可用性不完整的问题。

Method: AgriAgent采用基于任务复杂度的分层执行策略:简单任务通过特定模态智能体直接推理处理,复杂任务则触发契约驱动的规划机制,将任务形式化为能力需求,执行能力感知的工具编排和动态工具生成,实现可验证的多步执行和故障恢复。

Result: 实验结果表明,AgriAgent在复杂任务上相比依赖统一执行范式的现有工具中心化智能体基线实现了更高的执行成功率和鲁棒性,所有代码和数据将在论文被接受后发布以促进可重复研究。

Conclusion: 该研究证明了分层执行策略在处理农业场景中多样化任务的有效性,契约驱动的规划机制能够适应不完整的工具可用性,为真实世界农业智能体系统提供了更灵活和鲁棒的解决方案,推动了农业人工智能向更实用的方向发展。


📄 Abstract

Intelligent agent systems in real-world agricultural scenarios must handle diverse tasks under multimodal inputs, ranging from lightweight information understanding to complex multi-step execution. However, most existing approaches rely on a unified execution paradigm, which struggles to accommodate large variations in task complexity and incomplete tool availability commonly observed in agricultural environments. To address this challenge, we propose AgriAgent, a two-level agent framework for real-world agriculture. AgriAgent adopts a hierarchical execution strategy based on task complexity: simple tasks are handled through direct reasoning by modality-specific agents, while complex tasks trigger a contract-driven planning mechanism that formulates tasks as capability requirements and performs capability-aware tool orchestration and dynamic tool generation, enabling multi-step and verifiable execution with failure recovery. Experimental results show that AgriAgent achieves higher execution success rates and robustness on complex tasks compared to existing tool-centric agent baselines that rely on unified execution paradigms. All code, data will be released at after our work be accepted to promote reproducible research.

[47] Detecting Mental Manipulation in Speech via Synthetic Multi-Speaker Dialogue

Run Chen, Wen Liang, Ziwei Gong, Lin Ai, Julia Hirschberg

🧩 TL;DR

本文首次研究了语音对话中的心理操纵检测,构建了首个多说话人语音基准SPEECHMENTALMANIP,揭示了模型在语音模态上检测性能显著下降的现象,强调了多模态对话系统需要模态感知的安全对齐。


📘 Detailed Summary

Motivation: 先前关于心理操纵检测的研究仅关注文本对话,忽视了操纵策略在语音中的表现形式,存在模态覆盖不全的研究空白。本文旨在解决语音对话中心理操纵检测这一新兴任务,探索模态如何影响检测准确性和感知。

Method: 本研究构建了首个合成多说话人语音基准SPEECHMENTALMANIP,通过高质量、语音一致的文本到语音技术将文本数据集增强为音频数据。采用少样本大型音频-语言模型和人工标注方法,系统评估了模态对检测性能的影响。

Result: 实验结果表明,模型在语音数据上表现出高特异性但召回率显著低于文本数据,表明模型对训练中缺失的声学或韵律线索敏感。人类标注者在音频设置下表现出类似的不确定性,凸显了操纵性语音固有的模糊性特征。

Conclusion: 这些发现强调了多模态对话系统需要模态感知的评估和安全对齐机制。研究揭示了语音模态中心理操纵检测的特殊挑战,为未来开发更鲁棒的多模态社交推理系统提供了重要方向。


📄 Abstract

Mental manipulation, the strategic use of language to covertly influence or exploit others, is a newly emerging task in computational social reasoning. Prior work has focused exclusively on textual conversations, overlooking how manipulative tactics manifest in speech. We present the first study of mental manipulation detection in spoken dialogues, introducing a synthetic multi-speaker benchmark SPEECHMENTALMANIP that augments a text-based dataset with high-quality, voice-consistent Text-to-Speech rendered audio. Using few-shot large audio-language models and human annotation, we evaluate how modality affects detection accuracy and perception. Our results reveal that models exhibit high specificity but markedly lower recall on speech compared to text, suggesting sensitivity to missing acoustic or prosodic cues in training. Human raters show similar uncertainty in the audio setting, underscoring the inherent ambiguity of manipulative speech. Together, these findings highlight the need for modality-aware evaluation and safety alignment in multimodal dialogue systems.

[48] GraphSearch: Agentic Search-Augmented Reasoning for Zero-Shot Graph Learning

Jiajin Liu, Yuanfu Sun, Dongzhe Fan, Qiaoyu Tan

🧩 TL;DR

本文提出了GraphSearch框架,首次将搜索增强推理扩展到图学习领域,实现了无需任务特定微调的零样本图学习。该框架通过图感知查询规划器和检索器,在多样化的图基准测试中取得了与监督方法相媲美甚至更优的性能。


📘 Detailed Summary

Motivation: 当前搜索增强大型推理模型在处理图结构数据方面存在不足,而图数据在电子商务、社交网络和科学引用等领域普遍存在。图结构编码了丰富的拓扑信号,可作为检索的宝贵先验知识,但有效利用这种结构面临独特挑战,包括生成图表达性查询的困难以及平衡结构和语义相关性的可靠检索问题。

Method: GraphSearch框架包含图感知查询规划器,将搜索空间(如1跳、多跳或全局邻居)与语义查询解耦,以及图感知检索器,基于拓扑构建候选集并使用混合评分函数进行排序。框架实例化了两种遍历模式:GraphSearch-R递归扩展邻域,而GraphSearch-F灵活检索局部和全局邻域而不受跳数约束。

Result: 在多样化基准测试上的广泛实验表明,GraphSearch在零样本节点分类和链接预测任务中取得了与监督图学习方法相竞争甚至更优的性能,并创下了最先进的结果。该框架在多个图学习任务中展现出卓越的泛化能力和效率。

Conclusion: GraphSearch作为一个灵活且可泛化的范式,为图上的智能推理提供了新途径,证明了搜索增强推理在图结构数据上的有效性。该研究为无需任务特定训练的大规模图学习开辟了新的可能性,并为图与语言模型的结合提供了重要见解。


📄 Abstract

Recent advances in search-augmented large reasoning models (LRMs) enable the retrieval of external knowledge to reduce hallucinations in multistep reasoning. However, their ability to operate on graph-structured data, prevalent in domains such as e-commerce, social networks, and scientific citations, remains underexplored. Unlike plain text corpora, graphs encode rich topological signals that connect related entities and can serve as valuable priors for retrieval, enabling more targeted search and improved reasoning efficiency. Yet, effectively leveraging such structure poses unique challenges, including the difficulty of generating graph-expressive queries and ensuring reliable retrieval that balances structural and semantic relevance. To address this gap, we introduce GraphSearch, the first framework that extends search-augmented reasoning to graph learning, enabling zero-shot graph learning without task-specific fine-tuning. GraphSearch combines a Graph-aware Query Planner, which disentangles search space (e.g., 1-hop, multi-hop, or global neighbors) from semantic queries, with a Graph-aware Retriever, which constructs candidate sets based on topology and ranks them using a hybrid scoring function. We further instantiate two traversal modes: GraphSearch-R, which recursively expands neighborhoods hop by hop, and GraphSearch-F, which flexibly retrieves across local and global neighborhoods without hop constraints. Extensive experiments across diverse benchmarks show that GraphSearch achieves competitive or even superior performance compared to supervised graph learning methods, setting state-of-the-art results in zero-shot node classification and link prediction. These findings position GraphSearch as a flexible and generalizable paradigm for agentic reasoning over graphs.

[49] How Order-Sensitive Are LLMs? OrderProbe for Deterministic Structural Reconstruction

Yingjie He, Zhaolu Kang, Kehan Jiang, Qianyuan Zhang, Jiachen Qian, Chunlei Meng, Yujie Feng, Yuan Wang, Jiabao Dou, Aming Wu, Leqi Zheng, Pengxiang Zhao, Jiaxin Liu, Zeyu Zhang, Lei Wang, Guansu Wang, Qishi Zhan, Xiaomin He, Meisheng Zhang, Jianyuan Ni

🧩 TL;DR

该研究提出了OrderProbe基准和诊断框架,用于评估大语言模型在结构重建任务上的能力,揭示了即使前沿模型在零样本设置下也难以恢复固定四字符表达式的规范顺序,且语义能力与结构规划之间存在系统性分离。


📘 Detailed Summary

Motivation: 大语言模型在语义理解方面表现出色,但其从乱序输入中重建内部结构的能力尚未得到充分探索。句子级恢复任务由于存在多种有效词序而难以进行自动化评估,因此需要一种能够支持精确匹配评分的确定性基准来系统评估模型的结构重建能力。

Method: 研究引入了OrderProbe基准,使用中文、日文和韩文中的固定四字符表达式进行结构重建评估,这些表达式具有唯一的规范顺序,支持精确匹配评分。此外,提出了一个诊断框架,超越恢复准确率,评估语义保真度、逻辑有效性、一致性、鲁棒性敏感度和信息密度等多个维度。

Result: 在十二个广泛使用的大语言模型上的实验表明,结构重建任务即使对前沿系统也具有挑战性:零样本恢复准确率经常低于35%。研究还观察到语义回忆与结构规划之间存在一致性的分离现象,表明结构鲁棒性并非语义能力的自动副产品。

Conclusion: 该研究揭示了当前大语言模型在结构推理方面的局限性,表明语义理解与结构规划是相对独立的认知能力。OrderProbe基准为系统评估语言模型的结构重建能力提供了可靠工具,未来研究需要专门针对结构鲁棒性进行模型改进,而非仅仅依赖语义能力的提升。


📄 Abstract

Large language models (LLMs) excel at semantic understanding, yet their ability to reconstruct internal structure from scrambled inputs remains underexplored. Sentence-level restoration is ill-posed for automated evaluation because multiple valid word orders often exist. We introduce OrderProbe, a deterministic benchmark for structural reconstruction using fixed four-character expressions in Chinese, Japanese, and Korean, which have a unique canonical order and thus support exact-match scoring. We further propose a diagnostic framework that evaluates models beyond recovery accuracy, including semantic fidelity, logical validity, consistency, robustness sensitivity, and information density. Experiments on twelve widely used LLMs show that structural reconstruction remains difficult even for frontier systems: zero-shot recovery frequently falls below 35%. We also observe a consistent dissociation between semantic recall and structural planning, suggesting that structural robustness is not an automatic byproduct of semantic competence.

[50] A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding

Dilara Torunoğlu-Selamet, Dogukan Arslan, Rodrigo Wilkens, Wei He, Doruk Eryiğit, Thomas Pickard, Adriana S. Pagano, Aline Villavicencio, Gülşen Eryiğit, Ágnes Abuczki, Aida Cardoso, Alesia Lazarenka, Dina Almassova, Amalia Mendes, Anna Kanellopoulou, Antoni Brosa-Rodríguez, Baiba Saulite, Beata Wojtowicz, Bolette Pedersen, Carlos Manuel Hidalgo-Ternero, Chaya Liebeskind, Danka Jokić, Diego Alves, Eleni Triantafyllidi, Erik Velldal, Fred Philippy, Giedre Valunaite Oleskeviciene, Ieva Rizgeliene, Inguna Skadina, Irina Lobzhanidze, Isabell Stinessen Haugen, Jauza Akbar Krito, Jelena M. Marković, Johanna Monti, Josue Alejandro Sauca, Kaja Dobrovoljc, Kingsley O. Ugwuanyi, Laura Rituma, Lilja Øvrelid, Maha Tufail Agro, Manzura Abjalova, Maria Chatzigrigoriou, María del Mar Sánchez Ramos, Marija Pendevska, Masoumeh Seyyedrezaei, Mehrnoush Shamsfard, Momina Ahsan, Muhammad Ahsan Riaz Khan, Nathalie Carmen Hau Norman, Nilay Erdem Ayyıldız, Nina Hosseini-Kivanani, Noémi Ligeti-Nagy, Numaan Naeem, Olha Kanishcheva, Olha Yatsyshyna, Daniil Orel, Petra Giommarelli, Petya Osenova, Radovan Garabik, Regina E. Semou, Rozane Rebechi, Salsabila Zahirah Pranida, Samia Touileb, Sanni Nimb, Sarfraz Ahmad, Sarvinoz Nematkhonova, Shahar Golan, Shaoxiong Ji, Sopuruchi Christian Aboh, Srdjan Sucur, Stella Markantonatou, Sussi Olsen, Vahide Tajalli, Veronika Lipp, Voula Giouli, Yelda Yeşildal Eraydın, Zahra Saaberi, Zhuohan Xie

🧩 TL;DR

本文提出了XMPIE,一个包含34种语言、超过一万个项目的平行多语言多模态潜在习语表达数据集,用于评估NLP系统在语言和文化理解方面的能力,支持跨语言和跨模态的习语理解研究。


📘 Detailed Summary

Motivation: 潜在习语表达与特定语言社区的日常经验密切相关,对评估NLP系统的语言和文化理解能力构成挑战,当前缺乏能够支持跨语言和跨模态比较分析的高质量数据集。

Method: 研究构建了XMPIE平行多语言多模态数据集,包含34种语言和超过一万个项目,每个潜在习语表达配有五张图像,涵盖从习语到字面意义的连续谱,包括语义相关和随机干扰项,数据由语言专家根据多语言指导原则创建。

Result: XMPIE数据集提供了高质量的多语言多模态基准,支持语言特定实现和偏好的比较分析,能够评估模型在不同语言中的习语理解性能,以及跨语言和跨模态的理解迁移能力。

Conclusion: 该数据集为评估多语言和多模态习语理解提供了标准化基准,有助于研究不同语言社区共享的文化方面,并为理解习语知识在语言和模态间的迁移机制提供了实证基础。


📄 Abstract

Potentially idiomatic expressions (PIEs) construe meanings inherently tied to the everyday experience of a given language community. As such, they constitute an interesting challenge for assessing the linguistic (and to some extent cultural) capabilities of NLP systems. In this paper, we present XMPIE, a parallel multilingual and multimodal dataset of potentially idiomatic expressions. The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects. This parallel dataset allows to evaluate model performance for a given PIE in different languages and whether idiomatic understanding in one language can be transferred to another. Moreover, the dataset supports the study of PIEs across textual and visual modalities, to measure to what extent PIE understanding in one modality transfers or implies in understanding in another modality (text vs. image). The data was created by language experts, with both textual and visual components crafted under multilingual guidelines, and each PIE is accompanied by five images representing a spectrum from idiomatic to literal meanings, including semantically related and random distractors. The result is a high-quality benchmark for evaluating multilingual and multimodal idiomatic language understanding.

[51] From Rows to Reasoning: A Retrieval-Augmented Multimodal Framework for Spreadsheet Understanding

Anmol Gulati, Sahil Sen, Waqar Sarguroh, Kevin Paul

🧩 TL;DR

本文提出了FRTR-Bench,首个大规模多模态电子表格推理基准,以及FRTR框架,该框架通过细粒度嵌入分解、混合检索和视觉整合,显著提升了大型语言模型对复杂企业电子表格的推理能力。


📘 Detailed Summary

Motivation: 大型语言模型在处理包含数千行数值、多个关联工作表以及图表、收据等嵌入视觉内容的企业级电子表格时存在推理困难。现有方法通常依赖单表压缩或全上下文编码,这限制了可扩展性且无法反映用户与复杂多模态工作簿的真实交互方式。

Method: 研究提出了FRTR-Bench基准,包含30个企业级Excel工作簿,涵盖近四百万个单元格和50多个嵌入图像。为解决上述挑战,开发了FRTR框架,该框架将Excel工作簿分解为细粒度的行、列和块嵌入,采用基于互逆排序融合的混合词汇-稠密检索,并整合多模态嵌入以同时推理数值和视觉信息。

Result: 在六个大型语言模型上测试FRTR,在FRTR-Bench基准上使用Claude Sonnet 4.5达到74%的答案准确率,相比之前仅24%的最先进方法有显著提升。在SpreadsheetLLM基准上,FRTR使用GPT-5达到87%准确率,同时相比上下文压缩方法减少了约50%的令牌使用量。

Conclusion: 该研究表明,通过细粒度嵌入分解和混合检索策略,结合多模态信息整合,可以显著提升大型语言模型对复杂企业电子表格的推理能力。FRTR框架在保持高性能的同时大幅降低了计算开销,为实际企业应用中的电子表格自动化分析提供了有效解决方案。


📄 Abstract

Large Language Models (LLMs) struggle to reason over large-scale enterprise spreadsheets containing thousands of numeric rows, multiple linked sheets, and embedded visual content such as charts and receipts. Prior state-of-the-art spreadsheet reasoning approaches typically rely on single-sheet compression or full-context encoding, which limits scalability and fails to reflect how real users interact with complex, multimodal workbooks. We introduce FRTR-Bench, the first large-scale benchmark for multimodal spreadsheet reasoning, comprising 30 enterprise-grade Excel workbooks spanning nearly four million cells and more than 50 embedded images. To address these challenges, we present From Rows to Reasoning (FRTR), an advanced, multimodal retrieval-augmented generation framework that decomposes Excel workbooks into granular row, column, and block embeddings, employs hybrid lexical-dense retrieval with Reciprocal Rank Fusion (RRF), and integrates multimodal embeddings to reason over both numerical and visual information. We tested FRTR on six LLMs, achieving 74% answer accuracy on FRTR-Bench with Claude Sonnet 4.5, a substantial improvement over prior state-of-the-art approaches that reached only 24%. On the SpreadsheetLLM benchmark, FRTR achieved 87% accuracy with GPT-5 while reducing token usage by roughly 50% compared to context-compression methods.

cs.AI [Back]

[52] Forecast Aware Deep Reinforcement Learning for Efficient Electricity Load Scheduling in Dairy Farms

Nawazish Alia, Rachael Shawb, Karl Mason

🧩 TL;DR

本研究提出了一种用于奶牛场高效负荷调度的深度强化学习框架,通过整合短期预测和自适应KL散度控制,在动态电价和可再生能源间歇性条件下实现成本最小化。


📘 Detailed Summary

Motivation: 奶牛场作为高能耗行业严重依赖电网供电,可再生能源的间歇性给实时供需平衡带来挑战。现有强化学习调度方法通常假设完全知晓未来电价或发电量,这在动态环境中不切实际,且标准PPO变体依赖固定裁剪或KL散度阈值,在可变电价下常导致训练不稳定。

Method: 本研究提出了一个深度强化学习框架,专注于电池储能和热水负荷调度。其中Forecast Aware PPO整合了基于小时和月份的残差校准短期需求与可再生能源预测,而PID KL PPO变体采用比例积分微分控制器自适应调节KL散度以实现稳定的策略更新。

Result: 在真实奶牛场数据上的训练结果显示,该方法比标准PPO降低1%的电费成本,比DQN降低4.8%,比SAC降低1.5%。在电池调度方面,PPO减少了13.1%的电网输入,证明了其在现代奶牛场可持续能源管理中的可扩展性和有效性。

Conclusion: 该研究证明了整合预测和自适应KL控制机制的深度强化学习框架能够有效解决动态环境下的能源调度问题,为农业高能耗行业的可持续能源管理提供了可扩展的解决方案,支持联合国可持续发展目标7的实现。


📄 Abstract

Dairy farming is an energy intensive sector that relies heavily on grid electricity. With increasing renewable energy integration, sustainable energy management has become essential for reducing grid dependence and supporting the United Nations Sustainable Development Goal 7 on affordable and clean energy. However, the intermittent nature of renewables poses challenges in balancing supply and demand in real time. Intelligent load scheduling is therefore crucial to minimize operational costs while maintaining reliability. Reinforcement Learning has shown promise in improving energy efficiency and reducing costs. However, most RL-based scheduling methods assume complete knowledge of future prices or generation, which is unrealistic in dynamic environments. Moreover, standard PPO variants rely on fixed clipping or KL divergence thresholds, often leading to unstable training under variable tariffs. To address these challenges, this study proposes a Deep Reinforcement Learning framework for efficient load scheduling in dairy farms, focusing on battery storage and water heating under realistic operational constraints. The proposed Forecast Aware PPO incorporates short term forecasts of demand and renewable generation using hour of day and month based residual calibration, while the PID KL PPO variant employs a proportional integral derivative controller to regulate KL divergence for stable policy updates adaptively. Trained on real world dairy farm data, the method achieves up to 1% lower electricity cost than PPO, 4.8% than DQN, and 1.5% than SAC. For battery scheduling, PPO reduces grid imports by 13.1%, demonstrating scalability and effectiveness for sustainable energy management in modern dairy farming.

[53] ZeroDVFS: Zero-Shot LLM-Guided Core and Frequency Allocation for Embedded Platforms

Mohammad Pivezhandi, Mahdi Banisharif, Abusayeed Saifullah, Ali Jannesari

🧩 TL;DR

本文提出了一种基于模型的分层多智能体强化学习框架,用于多核嵌入式系统的热能和能耗感知调度。该框架结合LLM语义特征提取和环境模型,实现了零样本部署和快速决策,显著提升了能效和调度性能。


📘 Detailed Summary

Motivation: 现有动态电压频率缩放和任务分配方法存在两个主要问题:基于利用率的启发式方法忽略了停顿时间,而基于离线性能分析的方法需要大量离线分析生成表格,无法适应运行时变化。这些限制阻碍了动态嵌入式系统中高效的热管理和能耗性能平衡。

Method: 提出了一种基于模型的分层多智能体强化学习框架,采用两个协作智能体分解指数级动作空间,决策延迟仅为358毫秒。框架结合LLM语义特征提取技术,从OpenMP程序中提取13个代码级特征而无需执行,并利用回归技术构建准确的环境模型预测热力学和性能状态。采用Dyna-Q启发的框架,将直接强化学习与基于模型的规划相结合,通过生成合成训练数据实现零样本部署。

Result: 实验在BOTS和PolybenchC基准测试上进行,覆盖NVIDIA Jetson TX2、Jetson Orin NX、RubikPi和Intel Core i7平台。结果显示相比Linux ondemand调度器,能效提升7.09倍,完工时间改善4.0倍。首次决策延迟为3.5至8.0秒(含一次性LLM特征提取),后续决策仅需358毫秒,比基于表格的分析方法快8300倍。模型收敛速度比无模型方法快20倍。

Conclusion: 该研究证明了结合LLM语义特征提取和环境模型的多智能体强化学习框架在动态嵌入式系统中的实用性,实现了零样本部署和快速自适应调度。该方法为热能和能耗感知调度提供了新的解决方案,特别适用于需要实时适应新工作负载的动态嵌入式系统环境。


📄 Abstract

Dynamic voltage and frequency scaling (DVFS) and task-to-core allocation are critical for thermal management and balancing energy and performance in embedded systems. Existing approaches either rely on utilization-based heuristics that overlook stall times, or require extensive offline profiling for table generation, preventing runtime adaptation. We propose a model-based hierarchical multi-agent reinforcement learning (MARL) framework for thermal- and energy-aware scheduling on multi-core platforms. Two collaborative agents decompose the exponential action space, achieving 358ms latency for subsequent decisions. First decisions require 3.5 to 8.0s including one-time LLM feature extraction. An accurate environment model leverages regression techniques to predict thermal dynamics and performance states. When combined with LLM-extracted semantic features, the environment model enables zero-shot deployment for new workloads on trained platforms by generating synthetic training data without requiring workload-specific profiling samples. We introduce LLM-based semantic feature extraction that characterizes OpenMP programs through 13 code-level features without execution. The Dyna-Q-inspired framework integrates direct reinforcement learning with model-based planning, achieving 20x faster convergence than model-free methods. Experiments on BOTS and PolybenchC benchmarks across NVIDIA Jetson TX2, Jetson Orin NX, RubikPi, and Intel Core i7 demonstrate 7.09x better energy efficiency and 4.0x better makespan than Linux ondemand governor. First-decision latency is 8,300x faster than table-based profiling, enabling practical deployment in dynamic embedded systems.

[54] The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios

Daocheng Fu, Jianbiao Mei, Rong Wu, Xuemeng Yang, Jia Xu, Ding Wang, Pinlong Cai, Yong Liu, Licheng Wen, Botian Shi

🧩 TL;DR

本文提出了EvoEnv,一个用于评估多模态大语言模型在动态环境中鲁棒性的动态评估框架,解决了现有研究主要关注静态环境性能上限而忽视实际部署中随机性挑战的问题。


📘 Detailed Summary

Motivation: 现有多模态大语言模型研究主要针对静态环境中的性能上限,忽视了实际部署中的随机性和动态性挑战,特别是在动态任务调度、不确定性下的主动探索以及从经验中持续学习这三个关键方面存在研究空白。

Method: 本文提出了EvoEnv动态评估环境,模拟"受训者"智能体在新型设置中的持续探索过程,从三个维度评估智能体:面向流式任务的上下文感知调度、通过主动探索减少幻觉的谨慎信息获取,以及从基于规则动态生成的任务中提炼通用策略的持续演化能力。

Result: 实验表明,当前最先进的智能体在动态环境中存在显著缺陷,特别是在主动探索和持续学习方面表现不足,这揭示了现有评估方法与实际生产场景之间的差距。

Conclusion: 本研究建立了一个评估智能体可靠性的框架,将评估重点从静态测试转向现实的生产导向场景,为多模态大语言模型在实际部署中的鲁棒性评估提供了新的方法论基础。


📄 Abstract

The rapid evolution of Multi-modal Large Language Models (MLLMs) has advanced workflow automation; however, existing research mainly targets performance upper bounds in static environments, overlooking robustness for stochastic real-world deployment. We identify three key challenges: dynamic task scheduling, active exploration under uncertainty, and continuous learning from experience. To bridge this gap, we introduce \method{}, a dynamic evaluation environment that simulates a "trainee" agent continuously exploring a novel setting. Unlike traditional benchmarks, \method{} evaluates agents along three dimensions: (1) context-aware scheduling for streaming tasks with varying priorities; (2) prudent information acquisition to reduce hallucination via active exploration; and (3) continuous evolution by distilling generalized strategies from rule-based, dynamically generated tasks. Experiments show that cutting-edge agents have significant deficiencies in dynamic environments, especially in active exploration and continual learning. Our work establishes a framework for assessing agent reliability, shifting evaluation from static tests to realistic, production-oriented scenarios. Our codes are available at https://github.com/KnowledgeXLab/EvoEnv

[55] MPCI-Bench: A Benchmark for Multimodal Pairwise Contextual Integrity Evaluation of Language Model Agents

Shouju Wang, Haopeng Zhang

🧩 TL;DR

本文提出了MPCI-Bench,这是首个用于评估智能体隐私行为的多模态配对上下文完整性基准,通过三层评估框架揭示现有多模态模型在平衡隐私与效用方面的系统性缺陷。


📘 Detailed Summary

Motivation: 随着语言模型智能体从被动聊天机器人演变为处理个人数据的主动助手,评估其对社会规范的遵守变得日益重要,但现有上下文完整性基准主要关注文本场景和负面拒绝情况,忽视了多模态隐私风险以及隐私与效用的基本权衡问题。

Method: 本文提出了MPCI-Bench多模态配对上下文完整性基准,包含从相同视觉源衍生的正负配对实例,并实例化为三个层级:规范性种子判断、上下文丰富的故事推理和可执行的智能体行动轨迹,通过三原则迭代精炼流程确保数据质量。

Result: 对最先进多模态模型的评估揭示了系统性平衡隐私与效用的失败,以及显著的模态泄露差距,其中敏感视觉信息比文本信息泄露更频繁,表明现有模型在多模态隐私保护方面存在严重缺陷。

Conclusion: 该研究强调了开发能够平衡隐私与效用的多模态智能体的重要性,揭示了模态泄露差距这一新问题,MPCI-Bench的开源将为智能体上下文完整性研究提供重要基础,推动更全面的隐私保护评估框架发展。


📄 Abstract

As language-model agents evolve from passive chatbots into proactive assistants that handle personal data, evaluating their adherence to social norms becomes increasingly critical, often through the lens of Contextual Integrity (CI). However, existing CI benchmarks are largely text-centric and primarily emphasize negative refusal scenarios, overlooking multimodal privacy risks and the fundamental trade-off between privacy and utility. In this paper, we introduce MPCI-Bench, the first Multimodal Pairwise Contextual Integrity benchmark for evaluating privacy behavior in agentic settings. MPCI-Bench consists of paired positive and negative instances derived from the same visual source and instantiated across three tiers: normative Seed judgments, context-rich Story reasoning, and executable agent action Traces. Data quality is ensured through a Tri-Principle Iterative Refinement pipeline. Evaluations of state-of-the-art multimodal models reveal systematic failures to balance privacy and utility and a pronounced modality leakage gap, where sensitive visual information is leaked more frequently than textual information. We will open-source MPCI-Bench to facilitate future research on agentic CI.

[56] Creativity in AI as Emergence from Domain-Limited Generative Models

Corina Chutaux

🧩 TL;DR

本文提出了一种生成视角下的AI创造力框架,将创造力视为有限领域生成模型在受限信息环境中涌现的属性,而非事后评估标签,并建立了包含四个相互作用组件的概念分解。


📘 Detailed Summary

Motivation: 现有AI创造力研究主要采用评估框架来衡量生成输出的新颖性、多样性或实用性,将创造力视为待评估的属性而非待建模的现象,这忽略了创造力作为涌现现象的结构和情境条件,需要建立更基础的技术框架来研究AI系统中的创造力涌现机制。

Method: 本文提出了生成视角下的创造力框架,将创造力视为领域受限生成模型在有限信息环境中的涌现属性,并引入了包含四个相互作用组件的概念分解:基于模式的生成、诱导世界模型、上下文基础性和任意性,特别关注这些组件在多模态生成系统中的具体表现。

Result: 该研究建立了将创造力视为生成动态与领域特定表示之间相互作用涌现现象的技术框架,通过概念分解揭示了创造力在多模态生成系统中的具体表现机制,为理解大规模生成系统(特别是多模态架构)中日益复杂的模式重组行为提供了理论基础。

Conclusion: 这项工作为将创造力作为AI系统中的涌现现象而非事后评估标签进行研究提供了技术框架,通过将创造力基础建立在生成动态与领域特定表示的相互作用上,为理解机器创造力的本质和限制开辟了新途径,并为未来研究AI系统中的创造性行为提供了概念基础。


📄 Abstract

Creativity in artificial intelligence is most often addressed through evaluative frameworks that aim to measure novelty, diversity, or usefulness in generated outputs. While such approaches have provided valuable insights into the behavior of modern generative models, they largely treat creativity as a property to be assessed rather than as a phenomenon to be explicitly modeled. In parallel, recent advances in large-scale generative systems, particularly multimodal architectures, have demonstrated increasingly sophisticated forms of pattern recombination, raising questions about the nature and limits of machine creativity. This paper proposes a generative perspective on creativity in AI, framing it as an emergent property of domain-limited generative models embedded within bounded informational environments. Rather than introducing new evaluative criteria, we focus on the structural and contextual conditions under which creative behaviors arise. We introduce a conceptual decomposition of creativity into four interacting components-pattern-based generation, induced world models, contextual grounding, and arbitrarity, and examine how these components manifest in multimodal generative systems. By grounding creativity in the interaction between generative dynamics and domain-specific representations, this work aims to provide a technical framework for studying creativity as an emergent phenomenon in AI systems, rather than as a post hoc evaluative label.

[57] An Under-Explored Application for Explainable Multimodal Misogyny Detection in code-mixed Hindi-English

Sargam Yadav, Abhishek Kaushik, Kevin Mc Daid

🧩 TL;DR

本研究提出了一种多模态可解释的Web应用程序,用于检测印地语-英语混合语言文本和表情包中的厌女内容,该系统结合了先进的Transformer模型和SHAP、LIME等可解释性技术,旨在为研究人员和内容审核员提供透明化的仇恨言论检测工具。


📘 Detailed Summary

Motivation: 数字平台用户规模不断扩大,但同时也助长了仇恨言论和厌女内容的传播,现有人工智能模型在低资源语言和混合语言环境下的应用不足,且缺乏可解释性,这在仇恨言论检测等敏感领域中尤为关键。

Method: 系统采用基于Transformer的多语言多模态架构,文本检测使用XLM-RoBERTa和mBERT模型处理约4,193条评论,多模态表情包检测结合mBERT与EfficientNet、ResNET模型处理约4,218个表情包,并集成SHAP和LIME技术提供特征重要性解释。

Result: 系统通过人类评估者使用聊天机器人可用性问卷和用户体验问卷进行评估,确定了整体可用性,但具体性能指标未在摘要中详细说明,评估重点在于用户体验和系统实用性。

Conclusion: 该研究为混合语言环境下的厌女内容检测提供了透明化的多模态解决方案,促进了可解释人工智能在敏感领域的应用,有助于打击基于性别的数字暴力,确保安全的数字空间,并为后续研究提供了实用工具。


📄 Abstract

Digital platforms have an ever-expanding user base, and act as a hub for communication, business, and connectivity. However, this has also allowed for the spread of hate speech and misogyny. Artificial intelligence models have emerged as an effective solution for countering online hate speech but are under explored for low resource and code-mixed languages and suffer from a lack of interpretability. Explainable Artificial Intelligence (XAI) can enhance transparency in the decisions of deep learning models, which is crucial for a sensitive domain such as hate speech detection. In this paper, we present a multi-modal and explainable web application for detecting misogyny in text and memes in code-mixed Hindi and English. The system leverages state-of-the-art transformer-based models that support multilingual and multimodal settings. For text-based misogyny identification, the system utilizes XLM-RoBERTa (XLM-R) and multilingual Bidirectional Encoder Representations from Transformers (mBERT) on a dataset of approximately 4,193 comments. For multimodal misogyny identification from memes, the system utilizes mBERT + EfficientNet, and mBERT + ResNET trained on a dataset of approximately 4,218 memes. It also provides feature importance scores using explainability techniques including Shapley Additive Values (SHAP) and Local Interpretable Model Agnostic Explanations (LIME). The application aims to serve as a tool for both researchers and content moderators, to promote further research in the field, combat gender based digital violence, and ensure a safe digital space. The system has been evaluated using human evaluators who provided their responses on Chatbot Usability Questionnaire (CUQ) and User Experience Questionnaire (UEQ) to determine overall usability.

[58] What If TSF: A Benchmark for Reframing Forecasting as Scenario-Guided Multimodal Forecasting

Jinkwan Jang, Hyunbin Jin, Hyungjin Park, Kyubyung Chae, Taesup Kim

🧩 TL;DR

该研究提出了What If TSF(WIT)多模态时间序列预测基准,旨在评估模型是否能够基于上下文文本(特别是未来场景)进行条件预测,填补了现有基准在场景引导预测评估方面的空白。


📘 Detailed Summary

Motivation: 现有时间序列预测方法大多为单模态且依赖历史模式外推,而当前多模态预测基准主要提供回顾性或未对齐的原始上下文,无法明确评估模型是否真正利用文本输入进行预测。实际应用中,人类专家会结合历史证据和假设场景,在不同场景下基于相同观测产生不同的预测结果。

Method: 研究引入了What If TSF(WIT)多模态预测基准,该基准通过提供专家精心设计的合理或反事实场景,为场景引导的多模态预测提供严格的测试平台。基准设计旨在评估模型能否基于上下文文本(特别是未来场景)对其预测进行条件化处理。

Result: WIT基准提供了专家精心设计的合理或反事实场景,为场景引导的多模态预测建立了严格的评估框架。该基准已公开可用,为研究社区提供了评估模型是否真正利用文本上下文进行预测的标准测试平台。

Conclusion: 该研究强调了场景引导预测的重要性,并提供了评估多模态时间序列预测模型能力的标准化基准。WIT基准填补了现有评估方法的不足,为未来研究如何有效整合文本场景信息进行预测提供了基础框架和方向。


📄 Abstract

Time series forecasting is critical to real-world decision making, yet most existing approaches remain unimodal and rely on extrapolating historical patterns. While recent progress in large language models (LLMs) highlights the potential for multimodal forecasting, existing benchmarks largely provide retrospective or misaligned raw context, making it unclear whether such models meaningfully leverage textual inputs. In practice, human experts incorporate what-if scenarios with historical evidence, often producing distinct forecasts from the same observations under different scenarios. Inspired by this, we introduce What If TSF (WIT), a multimodal forecasting benchmark designed to evaluate whether models can condition their forecasts on contextual text, especially future scenarios. By providing expert-crafted plausible or counterfactual scenarios, WIT offers a rigorous testbed for scenario-guided multimodal forecasting. The benchmark is available at https://github.com/jinkwan1115/WhatIfTSF.

[59] Sketch-Based Facade Renovation With Generative AI: A Streamlined Framework for Bypassing As-Built Modelling in Industrial Adaptive Reuse

Warissara Booranamaitree, Xusheng Du, Yushu Cai, Zhengyang Wang, Ye Zhang, Haoran Xie

🧩 TL;DR

本文提出了一种结合生成式人工智能和视觉语言模型的三阶段框架,能够直接处理粗略结构草图和文本描述来生成一致的立面改造方案,有效绕过了传统方法中需要详细竣工建模的繁琐流程。


📘 Detailed Summary

Motivation: 立面改造作为比完全拆除更可持续的替代方案,当前工作流程通常需要在设计前进行详细的竣工建模,这一过程耗时耗力且经常涉及重复修订,阻碍了建筑师快速探索设计方案和迭代早期概念的能力。

Method: 该框架采用三阶段方法:首先使用微调的视觉语言模型根据输入草图预测需要修改区域的边界框和应添加的组件;接着通过稳定扩散模型生成新元素的详细草图,并通过生成式修复管道将其与原始轮廓合并;最后利用ControlNet将结果细化为逼真的图像。

Result: 在数据集和真实工业建筑上的实验表明,该框架能够生成既保留原始结构又提升立面细节质量的改造方案,有效验证了方法在绕过详细竣工建模需求方面的可行性。

Conclusion: 该方法为建筑师提供了快速探索设计替代方案、迭代早期概念并以更清晰方式传达改造意图的工具,代表了生成式AI在建筑改造设计流程中的创新应用,具有显著的实践价值。


📄 Abstract

Facade renovation offers a more sustainable alternative to full demolition, yet producing design proposals that preserve existing structures while expressing new intent remains challenging. Current workflows typically require detailed as-built modelling before design, which is time-consuming, labour-intensive, and often involves repeated revisions. To solve this issue, we propose a three-stage framework combining generative artificial intelligence (AI) and vision-language models (VLM) that directly processes rough structural sketch and textual descriptions to produce consistent renovation proposals. First, the input sketch is used by a fine-tuned VLM model to predict bounding boxes specifying where modifications are needed and which components should be added. Next, a stable diffusion model generates detailed sketches of new elements, which are merged with the original outline through a generative inpainting pipeline. Finally, ControlNet is employed to refine the result into a photorealistic image. Experiments on datasets and real industrial buildings indicate that the proposed framework can generate renovation proposals that preserve the original structure while improving facade detail quality. This approach effectively bypasses the need for detailed as-built modelling, enabling architects to rapidly explore design alternatives, iterate on early-stage concepts, and communicate renovation intentions with greater clarity.

[60] ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios

António Loison, Quentin Macé, Antoine Edy, Victor Xing, Tom Balough, Gabriel Moreira, Bo Liu, Manuel Faysse, Céline Hudelot, Gautier Viaud

🧩 TL;DR

本文提出了ViDoRe v3,这是一个全面的多模态检索增强生成基准测试,旨在评估模型处理视觉丰富文档、多类型查询和多语言场景的能力,揭示了当前RAG系统在视觉理解和细粒度定位方面的局限性。


📘 Detailed Summary

Motivation: 现有检索增强生成基准测试主要关注文本数据、单文档理解或孤立评估检索与生成组件,无法捕捉真实应用中处理视觉元素(表格、图表、图像)、跨文档信息合成和准确来源定位的复杂性,这限制了多模态RAG系统的有效评估与发展。

Method: 研究团队构建了ViDoRe v3基准测试,包含10个专业领域数据集,约26,000个文档页面和3,099个人工验证查询,支持6种语言,通过12,000小时人工标注提供检索相关性、边界框定位和验证参考答案的高质量标注,系统评估了最先进的RAG流水线在不同配置下的表现。

Result: 实验结果表明,视觉检索器优于文本检索器,晚期交互模型和文本重排序显著提升性能,混合或纯视觉上下文能改善答案生成质量,但当前模型在处理非文本元素、开放式查询和细粒度视觉定位方面仍存在明显困难。

Conclusion: 该研究揭示了多模态RAG系统在视觉理解和跨文档推理方面的关键挑战,为未来研究提供了重要的评估基准,通过商业友好许可发布的数据集将促进该领域的发展,特别是在视觉丰富文档处理和细粒度信息定位方面的技术进步。


📄 Abstract

Retrieval-Augmented Generation (RAG) pipelines must address challenges beyond simple single-document retrieval, such as interpreting visual elements (tables, charts, images), synthesizing information across documents, and providing accurate source grounding. Existing benchmarks fail to capture this complexity, often focusing on textual data, single-document comprehension, or evaluating retrieval and generation in isolation. We introduce ViDoRe v3, a comprehensive multimodal RAG benchmark featuring multi-type queries over visually rich document corpora. It covers 10 datasets across diverse professional domains, comprising ~26,000 document pages paired with 3,099 human-verified queries, each available in 6 languages. Through 12,000 hours of human annotation effort, we provide high-quality annotations for retrieval relevance, bounding box localization, and verified reference answers. Our evaluation of state-of-the-art RAG pipelines reveals that visual retrievers outperform textual ones, late-interaction models and textual reranking substantially improve performance, and hybrid or purely visual contexts enhance answer generation quality. However, current models still struggle with non-textual elements, open-ended queries, and fine-grained visual grounding. To encourage progress in addressing these challenges, the benchmark is released under a commercially permissive license at https://hf.co/vidore.

[61] Resisting Manipulative Bots in Memecoin Copy Trading: A Multi-Agent Approach with Chain-of-Thought Reasoning

Yichen Luo, Yebo Feng, Jiahua Xu, Yang Liu

🧩 TL;DR

本文提出了一种可解释的多智能体系统用于模因币跟单交易,通过分解复杂任务并协调专业智能体,在识别高质量模因币项目和关键意见领袖钱包方面显著优于传统机器学习模型和单一大型语言模型。


📘 Detailed Summary

Motivation: 模因币跟单交易面临操纵性机器人泛滥、被跟随钱包未来表现不确定以及交易执行延迟等挑战,而单一大型语言模型在处理资产分配等复杂多面任务时能力有限,且在加密货币领域缺乏足够的领域专业知识。

Method: 受资产管理团队结构启发,提出可解释的多智能体系统,将复杂任务分解为子任务并由专业智能体协作解决;采用少样本思维链提示技术,使每个智能体获取专业模因币交易知识、解释多模态数据并生成可解释的决策。

Result: 在包含1000个模因币项目交易数据的数据集上,所提多智能体系统在识别高质量模因币项目和关键意见领袖钱包方面分别达到73%和70%的精确率,所选关键意见领袖在这些项目中累计产生50万美元的总利润。

Conclusion: 研究表明多智能体系统能有效解决单一大型语言模型在复杂金融任务中的局限性,通过任务分解和专业智能体协作显著提升模因币跟单交易性能,为可解释的金融人工智能系统设计提供了新思路。


📄 Abstract

The launch of \$Trump coin ignited a wave in meme coin investment. Copy trading, as a strategy-agnostic approach that eliminates the need for deep trading knowledge, quickly gains widespread popularity in the meme coin market. However, copy trading is not a guarantee of profitability due to the prevalence of manipulative bots, the uncertainty of the followed wallets' future performance, and the lag in trade execution. Recently, large language models (LLMs) have shown promise in financial applications by effectively understanding multi-modal data and producing explainable decisions. However, a single LLM struggles with complex, multi-faceted tasks such as asset allocation. These challenges are even more pronounced in cryptocurrency markets, where LLMs often lack sufficient domain-specific knowledge in their training data. To address these challenges, we propose an explainable multi-agent system for meme coin copy trading. Inspired by the structure of an asset management team, our system decomposes the complex task into subtasks and coordinates specialized agents to solve them collaboratively. Employing few-shot chain-of-though (CoT) prompting, each agent acquires professional meme coin trading knowledge, interprets multi-modal data, and generates explainable decisions. Using a dataset of 1,000 meme coin projects' transaction data, our empirical evaluation shows that the proposed multi-agent system outperforms both traditional machine learning models and single LLMs, achieving 73% and 70% precision in identifying high-quality meme coin projects and key opinion leader (KOL) wallets, respectively. The selected KOLs collectively generated a total profit of \$500,000 across these projects.

[62] MEMEWEAVER: Inter-Meme Graph Reasoning for Sexism and Misogyny Detection

Paolo Italiani, David Gimeno-Gomez, Luca Ragazzi, Gianluca Moro, Paolo Rosso

🧩 TL;DR

本文提出了MemeWeaver,一种端到端可训练的多模态框架,通过新颖的模因间图推理机制检测性别歧视和厌女内容,在MAMI和EXIST基准测试中优于现有方法并实现更快的训练收敛。


📘 Detailed Summary

Motivation: 女性遭受在线骚扰的可能性是男性的两倍,但现有多模态内容审核方法大多忽视了这种现象背后的社会动态,即施害者在志同道合的社区中强化偏见和群体认同。基于图的方法虽然有望捕捉此类互动,但现有解决方案仍受限于启发式图构建、浅层模态融合和实例级推理。

Method: 本文提出了MemeWeaver,一种端到端可训练的多模态框架,采用新颖的模因间图推理机制来检测性别歧视和厌女内容。该方法系统评估了多种视觉-文本融合策略,并通过图结构学习来捕捉模因之间的语义关系,超越了传统的启发式图构建方法。

Result: MemeWeaver在MAMI和EXIST基准测试中始终优于最先进的基线方法,同时实现了更快的训练收敛速度。进一步分析表明,学习到的图结构能够捕捉语义上有意义的模式,为在线仇恨的关系性质提供了有价值的见解。

Conclusion: 该研究证明了图推理机制在捕捉在线性别歧视的社会动态方面的有效性,为多模态内容审核提供了新的方向。学习到的图结构不仅提升了检测性能,还揭示了模因之间的语义关系,有助于理解在线仇恨的传播模式。


📄 Abstract

Women are twice as likely as men to face online harassment due to their gender. Despite recent advances in multimodal content moderation, most approaches still overlook the social dynamics behind this phenomenon, where perpetrators reinforce prejudices and group identity within like-minded communities. Graph-based methods offer a promising way to capture such interactions, yet existing solutions remain limited by heuristic graph construction, shallow modality fusion, and instance-level reasoning. In this work, we present MemeWeaver, an end-to-end trainable multimodal framework for detecting sexism and misogyny through a novel inter-meme graph reasoning mechanism. We systematically evaluate multiple visual--textual fusion strategies and show that our approach consistently outperforms state-of-the-art baselines on the MAMI and EXIST benchmarks, while achieving faster training convergence. Further analyses reveal that the learned graph structure captures semantically meaningful patterns, offering valuable insights into the relational nature of online hate.