Table of Contents

cs.CV [Back]

[1] From Sight to Insight: Improving Visual Reasoning Capabilities of Multimodal Models via Reinforcement Learning

Omar Sharif, Eftekhar Hossain, Patrick Ng

🧩 TL;DR

本研究提出了一种基于强化学习的奖励驱动方法,用于增强多模态大语言模型的视觉推理能力,通过设计针对不同推理方面的奖励函数并采用组相对策略优化,显著提升了模型在视觉谜题等任务上的性能。


📘 Detailed Summary

Motivation: 当前多模态大语言模型生成的推理链缺乏视觉信息的有效整合,限制了其在需要精确视觉感知的任务(如视觉谜题)中的表现。研究表明视觉感知是此类任务的关键瓶颈,将图像转换为文本描述可显著提升性能,但现有方法需要昂贵的监督数据。

Method: 研究采用奖励驱动的强化学习方法,设计了六个针对不同推理方面的奖励函数,涵盖图像理解、思维步骤和答案准确性等维度。通过组相对策略优化(GRPO)算法,显式激励模型生成更长、结构化的推理过程,并防止视觉信息被绕过。

Result: 实验表明,将图像转换为文本描述可使Claude 3.5和Claude 3.7的性能分别提升26.7%和23.6%。在Qwen-2.5-VL-7B模型上,所提出的强化学习方法相比基础模型实现了5.56%的性能提升,且在领域内和领域外设置下均表现出一致的增益。

Conclusion: 该研究证实了强化学习是解锁开源多模态大语言模型长视觉推理能力的有效机制,无需昂贵的监督数据。所提出的奖励函数设计和GRPO优化方法为增强模型视觉推理能力提供了系统框架,具有重要的实际应用价值。


📄 Abstract

Reinforcement learning (RL) has emerged as a promising approach for eliciting reasoning chains before generating final answers. However, multimodal large language models (MLLMs) generate reasoning that lacks integration of visual information. This limits their ability to solve problems that demand accurate visual perception, such as visual puzzles. We show that visual perception is the key bottleneck in such tasks: converting images into textual descriptions significantly improves performance, yielding gains of 26.7% for Claude 3.5 and 23.6% for Claude 3.7. To address this, we investigate reward-driven RL as a mechanism to unlock long visual reasoning in open-source MLLMs without requiring costly supervision. We design and evaluate six reward functions targeting different reasoning aspects, including image understanding, thinking steps, and answer accuracy. Using group relative policy optimization (GRPO), our approach explicitly incentivizes longer, structured reasoning and mitigates bypassing of visual information. Experiments on Qwen-2.5-VL-7B achieve 5.56% improvements over the base model, with consistent gains across both in-domain and out-of-domain settings.

[2] TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model

Yabo Chen, Yuanzhi Liang, Jiepeng Wang, Tingxi Chen, Junfei Cheng, Zixiao Gu, Yuyang Huang, Zicheng Jiang, Wei Li, Tian Li, Weichen Li, Zuoxin Li, Guangce Liu, Jialun Liu, Junqi Liu, Haoyuan Wang, Qizhen Weng, Xuan'er Wu, Xunzhi Xiang, Xiaoyan Yang, Xin Zhang, Shiwen Zhang, Junyu Zhou, Chengcheng Zhou, Haibin Huang, Chi Zhang, Xuelong Li

🧩 TL;DR

TeleWorld提出了一种实时多模态4D世界建模框架,通过生成-重建-引导范式统一视频生成、动态场景重建和长期世界记忆,实现了空间、时间和物理一致性的闭环系统,推动了实用交互式世界模型的发展。


📘 Detailed Summary

Motivation: 当前视频生成模型虽然视觉质量令人印象深刻,但在实时交互、长时程一致性和动态场景的持久记忆方面存在局限,阻碍了其发展为实用的世界模型。研究旨在解决这些限制,推动世界模型向实用、交互式和计算可访问的系统演进。

Method: TeleWorld采用生成-重建-引导范式,将生成的视频流连续重建为动态4D时空表示,进而指导后续生成以保持一致性。框架采用自回归扩散视频模型,结合宏观-微观规划(MMPL)的分层规划方法减少误差累积,并通过高效的分布匹配蒸馏(DMD)实现实时合成,在统一4D框架中整合了动态对象建模和静态场景表示。

Result: 大量实验表明,TeleWorld在静态和动态世界理解、长期一致性和实时生成效率方面表现出色。该框架在保持空间、时间和物理一致性的同时,实现了低延迟的长时程生成,为交互式、具备记忆功能的世界模型提供了实用解决方案。

Conclusion: TeleWorld代表了向实用交互式世界模型迈出的重要一步,通过统一视频生成、动态场景重建和长期记忆的闭环系统,为多模态生成和具身智能应用奠定了基础。该框架展示了如何通过生成-重建-引导范式实现时空一致性,并为计算受限环境下的实时世界建模提供了可行路径。


📄 Abstract

World models aim to endow AI systems with the ability to represent, generate, and interact with dynamic environments in a coherent and temporally consistent manner. While recent video generation models have demonstrated impressive visual quality, they remain limited in real-time interaction, long-horizon consistency, and persistent memory of dynamic scenes, hindering their evolution into practical world models. In this report, we present TeleWorld, a real-time multimodal 4D world modeling framework that unifies video generation, dynamic scene reconstruction, and long-term world memory within a closed-loop system. TeleWorld introduces a novel generation-reconstruction-guidance paradigm, where generated video streams are continuously reconstructed into a dynamic 4D spatio-temporal representation, which in turn guides subsequent generation to maintain spatial, temporal, and physical consistency. To support long-horizon generation with low latency, we employ an autoregressive diffusion-based video model enhanced with Macro-from-Micro Planning (MMPL)--a hierarchical planning method that reduces error accumulation from frame-level to segment-level-alongside efficient Distribution Matching Distillation (DMD), enabling real-time synthesis under practical computational budgets. Our approach achieves seamless integration of dynamic object modeling and static scene representation within a unified 4D framework, advancing world models toward practical, interactive, and computationally accessible systems. Extensive experiments demonstrate that TeleWorld achieves strong performance in both static and dynamic world understanding, long-term consistency, and real-time generation efficiency, positioning it as a practical step toward interactive, memory-enabled world models for multimodal generation and embodied intelligence.

[3] It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models

Anne Harrington, A. Sophia Koepke, Shyamgopal Karthik, Trevor Darrell, Alexei A. Efros

🧩 TL;DR

本文提出了一种通过噪声优化来缓解文本到图像生成模型中模式崩溃问题的方法,该方法在保持基础模型保真度的同时显著提高了生成多样性。


📘 Detailed Summary

Motivation: 当代文本到图像生成模型在给定相同文本提示时表现出显著的模式崩溃现象,即生成图像缺乏多样性。现有方法主要通过引导机制或生成大量候选图像后进行精炼来解决此问题,但本文探索了通过噪声优化的不同方向来提升生成多样性。

Method: 本文提出了一种简单的噪声优化目标来缓解模式崩溃问题,同时分析了噪声的频率特性,并展示了具有不同频率分布的替代噪声初始化方法可以同时改善优化过程和搜索效果。

Result: 实验结果表明,噪声优化方法在生成质量和多样性方面均取得了优越的性能,能够有效缓解模式崩溃现象,同时保持基础模型的保真度。

Conclusion: 该研究表明噪声优化是解决文本到图像生成中模式崩溃问题的有效途径,通过分析噪声频率特性并采用适当的初始化策略,可以在不牺牲生成质量的前提下显著提升多样性,为生成模型的改进提供了新的技术方向。


📄 Abstract

Contemporary text-to-image models exhibit a surprising degree of mode collapse, as can be seen when sampling several images given the same text prompt. While previous work has attempted to address this issue by steering the model using guidance mechanisms, or by generating a large pool of candidates and refining them, in this work we take a different direction and aim for diversity in generations via noise optimization. Specifically, we show that a simple noise optimization objective can mitigate mode collapse while preserving the fidelity of the base model. We also analyze the frequency characteristics of the noise and show that alternative noise initializations with different frequency profiles can improve both optimization and search. Our experiments demonstrate that noise optimization yields superior results in terms of generation quality and variety.

[4] Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark

Pan Wang, Yang Liu, Guile Wu, Eduardo R. Corral-Soto, Chengjie Huang, Binbin Xu, Dongfeng Bai, Xu Yan, Yuan Ren, Xingxin Chen, Yizhe Wu, Tao Huang, Wenjun Wan, Xin Wu, Pei Zhou, Xuyang Dai, Kangbo Lv, Hongbo Zhang, Yosef Fried, Aixue Ye, Bailan Feng, Zhenyu Chen, Zhen Li, Yingcong Chen, Yiyi Liao, Bingbing Liu

🧩 TL;DR

本文提出了Spatial4D-Bench,一个用于全面评估多模态大语言模型4D空间智能的大规模基准测试,包含约40,000个问答对和18个任务,揭示了当前MLLMs在4D空间推理方面的显著局限性。


📘 Detailed Summary

Motivation: 人类天生具备4D空间智能,能够感知和处理物体随时间的变化,但现有空间智能基准测试往往规模较小或多样性有限,无法全面评估多模态大语言模型是否能够达到人类水平的4D空间智能。

Method: 研究团队开发了Spatial4D-Bench这一大规模多任务评估基准,包含约40,000个问答对,涵盖18个明确定义的任务,这些任务被系统地组织为六个认知类别:物体理解、场景理解、空间关系理解、时空关系理解、空间推理和时空推理。

Result: 在Spatial4D-Bench上对各种开源和专有MLLMs进行基准测试后,发现它们在多种4D空间推理方面存在显著局限性,包括路径规划、动作识别和物理合理性推理等任务,表明当前模型远未达到人类水平的4D空间智能。

Conclusion: 该研究为社区提供了关于MLLMs空间认知能力的重要见解,Spatial4D-Bench基准测试将促进开发更具能力的MLLMs,朝着人类水平的4D空间智能发展,并为未来研究提供了结构化的评估框架。


📄 Abstract

4D spatial intelligence involves perceiving and processing how objects move or change over time. Humans naturally possess 4D spatial intelligence, supporting a broad spectrum of spatial reasoning abilities. To what extent can Multimodal Large Language Models (MLLMs) achieve human-level 4D spatial intelligence? In this work, we present Spatial4D-Bench, a versatile 4D spatial intelligence benchmark designed to comprehensively assess the 4D spatial reasoning abilities of MLLMs. Unlike existing spatial intelligence benchmarks that are often small-scale or limited in diversity, Spatial4D-Bench provides a large-scale, multi-task evaluation benchmark consisting of ~40,000 question-answer pairs covering 18 well-defined tasks. We systematically organize these tasks into six cognitive categories: object understanding, scene understanding, spatial relationship understanding, spatiotemporal relationship understanding, spatial reasoning and spatiotemporal reasoning. Spatial4D-Bench thereby offers a structured and comprehensive benchmark for evaluating the spatial cognition abilities of MLLMs, covering a broad spectrum of tasks that parallel the versatility of human spatial intelligence. We benchmark various state-of-the-art open-source and proprietary MLLMs on Spatial4D-Bench and reveal their substantial limitations in a wide variety of 4D spatial reasoning aspects, such as route plan, action recognition, and physical plausibility reasoning. We hope that the findings provided in this work offer valuable insights to the community and that our benchmark can facilitate the development of more capable MLLMs toward human-level 4D spatial intelligence. More resources can be found on our project page.

[5] A Spatially Masked Adaptive Gated Network for multimodal post-flood water extent mapping using SAR and incomplete multispectral data

Hyunho Lee, Wenwen Li

🧩 TL;DR

本文提出了一种名为SMAGNet的多模态深度学习模型,该模型利用SAR数据作为洪水后水域范围映射的主要输入,并通过特征融合集成互补的MSI数据,以增强模型对缺失数据的鲁棒性和实际洪水管理场景的适用性。


📘 Detailed Summary

Motivation: 在洪水响应阶段,及时准确的水域范围映射对灾害管理至关重要。虽然SAR数据是主要数据源,但将SAR与MSI数据通过多模态方法结合能提升映射精度,特别是在洪水峰值期间或之后及时观测有限的情况下。然而,如何在SAR基础的洪水后水域范围映射过程中自适应地集成部分可用的MSI数据仍是一个未充分探索的研究空白。

Method: 本文提出了空间掩蔽自适应门控网络(SMAGNet),这是一种多模态深度学习模型,以SAR数据作为洪水后水域范围映射的主要输入,并通过特征融合机制集成互补的MSI数据。该模型采用自适应集成策略,能够处理不同可用程度的MSI数据,即使在MSI数据完全缺失的情况下也能保持性能。

Result: 在C2S-MS Floods数据集上的实验表明,SMAGNet在不同MSI数据可用性水平下,其预测性能始终优于其他多模态深度学习模型。特别值得注意的是,即使当MSI数据完全缺失时,SMAGNet的性能在统计上与仅使用SAR数据训练的U-Net模型相当,这表明模型对缺失数据具有鲁棒性。

Conclusion: SMAGNet通过自适应集成部分可用的MSI数据,不仅提升了洪水后水域范围映射的准确性,还增强了模型对缺失数据的鲁棒性。这项研究为多模态深度学习在现实世界洪水管理场景中的应用提供了可行方案,特别是在数据获取不完全或不稳定的实际条件下具有重要实践价值。


📄 Abstract

Mapping water extent during a flood event is essential for effective disaster management throughout all phases: mitigation, preparedness, response, and recovery. In particular, during the response stage, when timely and accurate information is important, Synthetic Aperture Radar (SAR) data are primarily employed to produce water extent maps. Recently, leveraging the complementary characteristics of SAR and MSI data through a multimodal approach has emerged as a promising strategy for advancing water extent mapping using deep learning models. This approach is particularly beneficial when timely post-flood observations, acquired during or shortly after the flood peak, are limited, as it enables the use of all available imagery for more accurate post-flood water extent mapping. However, the adaptive integration of partially available MSI data into the SAR-based post-flood water extent mapping process remains underexplored. To bridge this research gap, we propose the Spatially Masked Adaptive Gated Network (SMAGNet), a multimodal deep learning model that utilizes SAR data as the primary input for post-flood water extent mapping and integrates complementary MSI data through feature fusion. In experiments on the C2S-MS Floods dataset, SMAGNet consistently outperformed other multimodal deep learning models in prediction performance across varying levels of MSI data availability. Furthermore, we found that even when MSI data were completely missing, the performance of SMAGNet remained statistically comparable to that of a U-Net model trained solely on SAR data. These findings indicate that SMAGNet enhances the model robustness to missing data as well as the applicability of multimodal deep learning in real-world flood management scenarios.

[6] FCMBench: A Comprehensive Financial Credit Multimodal Benchmark for Real-world Applications

Yehui Yang, Dalu Yang, Wenshuo Zhou, Fangxin Shang, Yifan Liu, Jie Ren, Haojun Fei, Qing Yang, Tao Chen

🧩 TL;DR

本文提出了FCMBench-V1.0——一个针对金融信贷领域的大规模多模态基准测试,涵盖18种核心证书类型,包含4,043张隐私合规图像和8,446个问答样本,旨在评估视觉语言模型在金融信贷应用中的实际性能。


📘 Detailed Summary

Motivation: 随着多模态AI在信贷风险评估和文档审查中的广泛应用,当前缺乏一个能够反映金融信贷应用特定文档和工作流程、包含信贷特定理解和真实世界鲁棒性、同时保持隐私合规性而不牺牲实用性的领域专用基准测试。

Method: 研究团队构建了FCMBench-V1.0评估框架,包含感知、推理和鲁棒性三个维度,其中感知维度包括3个基础感知任务,推理维度包括4个需要视觉证据决策理解的信贷特定任务,鲁棒性维度包括10种真实世界采集伪影类型。所有样本通过封闭的合成-采集流水线构建:手动合成带有虚拟内容的文档模板,并在内部采集场景感知图像,避免了网络来源或公开发布图像,从而减轻预训练数据泄露问题。

Result: 在14个顶级AI公司和研究机构的23个最先进视觉语言模型上进行广泛实验,结果显示Gemini 3 Pro作为商业模型获得最佳F1分数(64.61%),Qwen3-VL-235B作为开源基线获得最佳分数(57.27%),而研究团队提出的金融信贷专用模型Qfin-VL-Instruct获得最高总体分数(64.92%)。鲁棒性评估表明即使表现最佳的模型在采集伪影下也会出现明显的性能下降。

Conclusion: FCMBench能够有效区分现代视觉语言模型的性能差异和鲁棒性,为金融信贷领域的多模态AI评估提供了标准化基准。该基准通过隐私合规的合成-采集方法解决了数据泄露问题,同时保持了实际应用的实用性,为领域专用多模态模型的发展提供了重要评估工具。


📄 Abstract

As multimodal AI becomes widely used for credit risk assessment and document review, a domain-specific benchmark is urgently needed that (1) reflects documents and workflows specific to financial credit applications, (2) includes credit-specific understanding and real-world robustness, and (3) preserves privacy compliance without sacrificing practical utility. Here, we introduce FCMBench-V1.0 -- a large-scale financial credit multimodal benchmark for real-world applications, covering 18 core certificate types, with 4,043 privacy-compliant images and 8,446 QA samples. The FCMBench evaluation framework consists of three dimensions: Perception, Reasoning, and Robustness, including 3 foundational perception tasks, 4 credit-specific reasoning tasks that require decision-oriented understanding of visual evidence, and 10 real-world acquisition artifact types for robustness stress testing. To reconcile compliance with realism, we construct all samples via a closed synthesis-capture pipeline: we manually synthesize document templates with virtual content and capture scenario-aware images in-house. This design also mitigates pre-training data leakage by avoiding web-sourced or publicly released images. FCMBench can effectively discriminate performance disparities and robustness across modern vision-language models. Extensive experiments were conducted on 23 state-of-the-art vision-language models (VLMs) from 14 top AI companies and research institutes. Among them, Gemini 3 Pro achieves the best F1(\%) score as a commercial model (64.61), Qwen3-VL-235B achieves the best score as an open-source baseline (57.27), and our financial credit-specific model, Qfin-VL-Instruct, achieves the top overall score (64.92). Robustness evaluations show that even top-performing models suffer noticeable performance drops under acquisition artifacts.

[7] FaithSCAN: Model-Driven Single-Pass Hallucination Detection for Faithful Visual Question Answering

Chaodong Tong, Qi Zhang, Chen Li, Lei Jiang, Yanbing Liu

🧩 TL;DR

本文提出FaithSCAN,一种轻量级网络,通过利用视觉语言模型的丰富内部信号来检测VQA中的幻觉问题,同时扩展了LLM-as-a-Judge范式以自动生成监督信号,无需昂贵的人工标注。


📘 Detailed Summary

Motivation: VQA中的忠实性幻觉问题严重削弱了视觉语言模型在安全关键应用中的可靠性,现有检测方法存在固有局限性:外部验证方法计算开销大且依赖外部资源质量,不确定性驱动方法仅捕捉模型不确定性的有限方面且未能充分探索与多样化失败模式相关的丰富内部信号。

Method: 提出FaithSCAN轻量级网络,通过利用视觉语言模型的丰富内部信号检测幻觉,包括令牌级解码不确定性、中间视觉表示和跨模态对齐特征,这些信号通过分支证据编码和不确定性感知注意力进行融合,同时扩展LLM-as-a-Judge范式到VQA幻觉检测,提出低成本策略自动生成模型依赖的监督信号,实现无需昂贵人工标注的监督训练。

Result: 在多个VQA基准测试上的实验表明,FaithSCAN在效果和效率方面显著优于现有方法,深入分析显示幻觉源于视觉感知、跨模态推理和语言解码中系统性的内部状态变化,不同内部信号提供互补的诊断线索,且幻觉模式在不同VLM架构间存在差异。

Conclusion: 研究揭示了幻觉源于视觉语言模型内部状态的系统性变化,不同内部信号提供互补的诊断价值,幻觉模式具有架构依赖性,这为理解多模态幻觉的底层机制提供了新见解,同时提出的自动监督信号生成方法为低成本幻觉检测训练提供了可行方案。


📄 Abstract

Faithfulness hallucinations in VQA occur when vision-language models produce fluent yet visually ungrounded answers, severely undermining their reliability in safety-critical applications. Existing detection methods mainly fall into two categories: external verification approaches relying on auxiliary models or knowledge bases, and uncertainty-driven approaches using repeated sampling or uncertainty estimates. The former suffer from high computational overhead and are limited by external resource quality, while the latter capture only limited facets of model uncertainty and fail to sufficiently explore the rich internal signals associated with the diverse failure modes. Both paradigms thus have inherent limitations in efficiency, robustness, and detection performance. To address these challenges, we propose FaithSCAN: a lightweight network that detects hallucinations by exploiting rich internal signals of VLMs, including token-level decoding uncertainty, intermediate visual representations, and cross-modal alignment features. These signals are fused via branch-wise evidence encoding and uncertainty-aware attention. We also extend the LLM-as-a-Judge paradigm to VQA hallucination and propose a low-cost strategy to automatically generate model-dependent supervision signals, enabling supervised training without costly human labels while maintaining high detection accuracy. Experiments on multiple VQA benchmarks show that FaithSCAN significantly outperforms existing methods in both effectiveness and efficiency. In-depth analysis shows hallucinations arise from systematic internal state variations in visual perception, cross-modal reasoning, and language decoding. Different internal signals provide complementary diagnostic cues, and hallucination patterns vary across VLM architectures, offering new insights into the underlying causes of multimodal hallucinations.

[8] Focal-RegionFace: Generating Fine-Grained Multi-attribute Descriptions for Arbitrarily Selected Face Focal Regions

Kaiwen Zheng, Junchen Fu, Songpei Xu, Yaoqing He, Joemon M. Jose, Han Hu, Xuri Ge

🧩 TL;DR

本文提出了FaceFocalDesc这一新问题,旨在为任意选定的人脸区域生成和识别包含面部动作单元、情感状态和年龄估计的多属性自然语言描述,并基于Qwen2.5-VL构建了Focal-RegionFace模型,通过渐进式微调实现细粒度区域聚焦分析。


📘 Detailed Summary

Motivation: 本文旨在解决面部分析中一个尚未充分探索的问题:为任意选定的人脸区域生成和识别包含面部动作单元、情感状态和年龄估计的多属性自然语言描述。研究者认为系统对个体面部区域的聚焦能力能带来更好的理解和控制,为此需要构建新的数据集并开发相应的分析模型。

Method: 本文构建了一个新的多属性描述数据集,为任意选定的人脸区域提供丰富的区域级标注和自然语言描述。基于Qwen2.5-VL视觉语言模型,提出了Focal-RegionFace模型,通过多个渐进式微调阶段逐步细化对局部面部特征的关注,实现可解释的年龄估计、面部动作单元和情感检测。

Result: 实验结果表明,Focal-RegionFace在新基准测试中取得了最佳性能,无论是在传统和广泛使用的指标上,还是在新提出的指标上。这充分验证了该模型在细粒度多属性人脸区域聚焦分析场景中的有效性和多功能性。

Conclusion: 本研究展示了区域聚焦方法在面部分析中的重要性,提出的Focal-RegionFace模型为细粒度多属性面部状态分析提供了有效解决方案。该工作为面部理解的可解释性和控制性开辟了新方向,未来可扩展到更广泛的面部分析任务和实际应用场景。


📄 Abstract

In this paper, we introduce an underexplored problem in facial analysis: generating and recognizing multi-attribute natural language descriptions, containing facial action units (AUs), emotional states, and age estimation, for arbitrarily selected face regions (termed FaceFocalDesc). We argue that the system's ability to focus on individual facial areas leads to better understanding and control. To achieve this capability, we construct a new multi-attribute description dataset for arbitrarily selected face regions, providing rich region-level annotations and natural language descriptions. Further, we propose a fine-tuned vision-language model based on Qwen2.5-VL, called Focal-RegionFace for facial state analysis, which incrementally refines its focus on localized facial features through multiple progressively fine-tuning stages, resulting in interpretable age estimation, FAU and emotion detection. Experimental results show that Focal-RegionFace achieves the best performance on the new benchmark in terms of traditional and widely used metrics, as well as new proposed metrics. This fully verifies its effectiveness and versatility in fine-grained multi-attribute face region-focal analysis scenarios.

[9] HarmoniAD: Harmonizing Local Structures and Global Semantics for Anomaly Detection

Naiqi Zhang, Chuancheng Shi, Jingtong Dou, Wenhua Wu, Fei Shen, Jianhua Cao

🧩 TL;DR

本文提出HarmoniAD,一种频率引导的双分支异常检测框架,通过解耦高频和低频路径来平衡结构细节与语义一致性,在多个工业缺陷检测基准上实现了最先进的性能。


📘 Detailed Summary

Motivation: 工业产品质量检测中的异常检测至关重要,但现有方法面临结构-语义权衡问题:结构导向模型对噪声敏感,而语义导向模型往往忽略精细细节,导致微小缺陷检测困难。

Method: HarmoniAD采用频率引导的双分支框架,首先通过CLIP图像编码器提取特征,然后转换到频域并解耦为高频和低频路径。高频分支配备细粒度结构注意力模块增强纹理和边缘检测,低频分支使用全局结构上下文模块捕获长程依赖并保持语义一致性,两者互补建模结构与语义。

Result: 在MVTec-AD、VisA和BTAD三个基准数据集上的实验表明,HarmoniAD实现了最先进的性能,同时具备高敏感性和鲁棒性,有效平衡了精细细节检测与全局语义保持。

Conclusion: 该研究证明了频率域解耦策略在异常检测中的有效性,通过双分支互补建模成功解决了结构-语义权衡问题,为工业缺陷检测提供了兼顾敏感性与鲁棒性的新方法,并展示了多类联合训练策略的实用价值。


📄 Abstract

Anomaly detection is crucial in industrial product quality inspection. Failing to detect tiny defects often leads to serious consequences. Existing methods face a structure-semantics trade-off: structure-oriented models (such as frequency-based filters) are noise-sensitive, while semantics-oriented models (such as CLIP-based encoders) often miss fine details. To address this, we propose HarmoniAD, a frequency-guided dual-branch framework. Features are first extracted by the CLIP image encoder, then transformed into the frequency domain, and finally decoupled into high- and low-frequency paths for complementary modeling of structure and semantics. The high-frequency branch is equipped with a fine-grained structural attention module (FSAM) to enhance textures and edges for detecting small anomalies, while the low-frequency branch uses a global structural context module (GSCM) to capture long-range dependencies and preserve semantic consistency. Together, these branches balance fine detail and global semantics. HarmoniAD further adopts a multi-class joint training strategy, and experiments on MVTec-AD, VisA, and BTAD show state-of-the-art performance with both sensitivity and robustness.

[10] IntraStyler: Exemplar-based Style Synthesis for Cross-modality Domain Adaptation

Han Liu, Yubo Fan, Hao Li, Dewei Hu, Daniel Moyer, Zhoubing Xu, Benoit M. Dawant, Ipek Oguz

🧩 TL;DR

本文提出IntraStyler,一种基于示例的风格合成方法,用于无监督域适应中的可控风格多样化,无需先验知识即可捕获多样的域内风格,从而增强下游分割任务的性能。


📘 Detailed Summary

Motivation: 现有无监督域适应方法主要关注源域和目标域之间的域偏移,而域内变异性研究不足。传统方法需要预先指定域内变化进行风格合成,这在实践中往往不切实际,因此需要一种无需先验知识即可捕获多样域内风格的方法。

Method: 提出IntraStyler方法,采用基于示例的风格合成策略,使用示例图像引导风格合成以确保输出风格与示例风格匹配。引入风格编码器,基于对比学习以判别方式学习风格特征,从而提取纯风格特征,实现可控的风格合成。

Result: 在跨模态域适应最大公开数据集CrossMoDA 2023上进行评估,实验证明该方法在可控风格合成方面的有效性,以及多样化合成数据对下游分割任务的益处,展示了其实际应用价值。

Conclusion: 该研究强调了域内风格多样化在无监督域适应中的重要性,提出的IntraStyler方法无需先验知识即可实现可控风格合成,为跨模态医学图像分析等实际应用提供了有效的解决方案,推动了域适应方法向更细粒度风格控制的发展。


📄 Abstract

Image-level domain alignment is the de facto approach for unsupervised domain adaptation, where unpaired image translation is used to minimize the domain gap. Prior studies mainly focus on the domain shift between the source and target domains, whereas the intra-domain variability remains under-explored. To address the latter, an effective strategy is to diversify the styles of the synthetic target domain data during image translation. However, previous methods typically require intra-domain variations to be pre-specified for style synthesis, which may be impractical. In this paper, we propose an exemplar-based style synthesis method named IntraStyler, which can capture diverse intra-domain styles without any prior knowledge. Specifically, IntraStyler uses an exemplar image to guide the style synthesis such that the output style matches the exemplar style. To extract the style-only features, we introduce a style encoder to learn styles discriminatively based on contrastive learning. We evaluate the proposed method on the largest public dataset for cross-modality domain adaptation, CrossMoDA 2023. Our experiments show the efficacy of our method in controllable style synthesis and the benefits of diverse synthetic data for downstream segmentation. Code is available at https://github.com/han-liu/IntraStyler.

[11] MotionPhysics: Learnable Motion Distillation for Text-Guided Simulation

Miaowei Wang, Jakub Zadrożny, Oisin Mac Aodha, Amir Vaxman

🧩 TL;DR

本文提出了MotionPhysics,一个端到端的可微分框架,能够从自然语言提示中推断出3D场景的合理物理参数,无需真实轨迹或标注视频的指导,实现了基于文本的逼真动态模拟。


📘 Detailed Summary

Motivation: 准确模拟现有3D物体和多种材料通常需要专家知识和耗时的物理参数调整才能实现期望的动态行为,现有方法依赖于真实轨迹或标注视频的指导,限制了其适用性和易用性。

Method: 该方法首先利用多模态大语言模型估计材料参数值,并约束在合理范围内;进一步提出可学习的运动蒸馏损失,从预训练的视频扩散模型中提取鲁棒的运动先验,同时最小化外观和几何归纳偏置以指导模拟过程。

Result: MotionPhysics在超过三十个场景中进行了评估,包括真实世界、人工设计和AI生成的3D物体,涵盖弹性固体、金属、泡沫、沙子以及牛顿和非牛顿流体等多种材料,在视觉逼真度和物理合理性方面超越了现有技术。

Conclusion: 该研究展示了基于自然语言的物理参数推断框架的有效性,为3D动态模拟提供了更易用和自动化的解决方案,推动了物理模拟与生成模型的结合,为创意设计和虚拟内容创作开辟了新途径。


📄 Abstract

Accurately simulating existing 3D objects and a wide variety of materials often demands expert knowledge and time-consuming physical parameter tuning to achieve the desired dynamic behavior. We introduce MotionPhysics, an end-to-end differentiable framework that infers plausible physical parameters from a user-provided natural language prompt for a chosen 3D scene of interest, removing the need for guidance from ground-truth trajectories or annotated videos. Our approach first utilizes a multimodal large language model to estimate material parameter values, which are constrained to lie within plausible ranges. We further propose a learnable motion distillation loss that extracts robust motion priors from pretrained video diffusion models while minimizing appearance and geometry inductive biases to guide the simulation. We evaluate MotionPhysics across more than thirty scenarios, including real-world, human-designed, and AI-generated 3D objects, spanning a wide range of materials such as elastic solids, metals, foams, sand, and both Newtonian and non-Newtonian fluids. We demonstrate that MotionPhysics produces visually realistic dynamic simulations guided by natural language, surpassing the state of the art while automatically determining physically plausible parameters. The code and project page are available at: https://wangmiaowei.github.io/MotionPhysics.github.io/.

[12] A Comprehensive Dataset for Human vs. AI Generated Image Detection

Rajarshi Roy, Nasrin Imanpour, Ashhar Aziz, Shashwat Bajpai, Gurpreet Singh, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Gaytri Jena, Vasu Sharma, Vinija Jain, Aman Chadha, Aishwarya Naresh Reganti, Amitava Das

🧩 TL;DR

该研究发布了MS COCOAI数据集,这是一个包含96000个真实与合成图像对的新型数据集,用于AI生成图像检测,并提出了两个任务:图像真实性分类和生成模型溯源。


📘 Detailed Summary

Motivation: 随着Stable Diffusion、DALL-E和MidJourney等多模态生成AI系统的普及,合成图像越来越难以与真实照片区分,导致虚假信息和误导性内容传播,因此开发有效的检测方法成为紧迫需求。

Method: 研究基于MS COCO数据集构建了包含96000个数据点的MS COCOAI数据集,使用五种生成器(Stable Diffusion 3、Stable Diffusion 2.1、SDXL、DALL-E 3和MidJourney v6)生成合成图像,并设计了两个检测任务:图像真实性分类和生成模型溯源。

Result: 研究创建了一个大规模、多样化的AI生成图像检测基准数据集,包含来自五种不同生成模型的合成图像,为开发更鲁棒的检测算法提供了标准化评估平台,数据集已在Hugging Face平台公开。

Conclusion: 该研究通过构建全面的数据集和定义明确的检测任务,为AI生成图像检测领域提供了重要的基准资源,有助于开发更有效的检测方法以应对合成媒体带来的安全挑战,推动该领域的研究进展。


📄 Abstract

Multimodal generative AI systems like Stable Diffusion, DALL-E, and MidJourney have fundamentally changed how synthetic images are created. These tools drive innovation but also enable the spread of misleading content, false information, and manipulated media. As generated images become harder to distinguish from photographs, detecting them has become an urgent priority. To combat this challenge, We release MS COCOAI, a novel dataset for AI generated image detection consisting of 96000 real and synthetic datapoints, built using the MS COCO dataset. To generate synthetic images, we use five generators: Stable Diffusion 3, Stable Diffusion 2.1, SDXL, DALL-E 3, and MidJourney v6. Based on the dataset, we propose two tasks: (1) classifying images as real or generated, and (2) identifying which model produced a given synthetic image. The dataset is available at https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset.

[13] Application Research of a Deep Learning Model Integrating CycleGAN and YOLO in PCB Infrared Defect Detection

Chao Yang, Haoyuan Zheng, Yue Ma

🧩 TL;DR

本文提出了一种集成CycleGAN和YOLOv8的跨模态数据增强框架,通过将丰富的可见光PCB图像转换为红外域来缓解红外数据稀缺问题,显著提升了PCB缺陷检测在低数据条件下的性能。


📘 Detailed Summary

Motivation: 本研究旨在解决印刷电路板缺陷检测中红外数据稀缺的关键瓶颈问题,传统方法依赖配对监督数据,而真实红外样本获取成本高昂且数量有限,这严重制约了基于深度学习的缺陷检测模型的训练效果和实际应用部署。

Method: 该方法采用跨模态数据增强框架,首先利用CycleGAN进行无配对图像到图像转换,将丰富的可见光PCB图像映射到红外域,生成保留缺陷结构语义并准确模拟热分布模式的高保真伪红外样本;随后构建异构训练策略,融合生成的伪红外数据与有限真实红外样本,共同训练轻量级YOLOv8检测器。

Result: 实验结果表明,该方法在低数据条件下有效增强了特征学习能力,增强后的检测器性能显著优于仅使用有限真实数据训练的模型,并且接近完全监督训练的基准性能,证明了伪红外合成作为工业检测稳健增强策略的有效性。

Conclusion: 本研究证明了跨模态数据增强在缓解工业检测数据稀缺问题上的有效性,通过无监督图像转换生成高质量伪红外数据能够显著提升缺陷检测性能,为实际工业应用中数据获取困难的场景提供了可行的解决方案,并展示了生成模型与传统检测框架结合的实际价值。


📄 Abstract

This paper addresses the critical bottleneck of infrared (IR) data scarcity in Printed Circuit Board (PCB) defect detection by proposing a cross-modal data augmentation framework integrating CycleGAN and YOLOv8. Unlike conventional methods relying on paired supervision, we leverage CycleGAN to perform unpaired image-to-image translation, mapping abundant visible-light PCB images into the infrared domain. This generative process synthesizes high-fidelity pseudo-IR samples that preserve the structural semantics of defects while accurately simulating thermal distribution patterns. Subsequently, we construct a heterogeneous training strategy that fuses generated pseudo-IR data with limited real IR samples to train a lightweight YOLOv8 detector. Experimental results demonstrate that this method effectively enhances feature learning under low-data conditions. The augmented detector significantly outperforms models trained on limited real data alone and approaches the performance benchmarks of fully supervised training, proving the efficacy of pseudo-IR synthesis as a robust augmentation strategy for industrial inspection.

[14] Detecting Performance Degradation under Data Shift in Pathology Vision-Language Model

Hao Guan, Li Zhou

🧩 TL;DR

本研究提出了一种结合输入数据偏移检测和输出置信度指标的互补框架,用于监测病理视觉语言模型在数据分布变化下的性能退化,并通过DomainSAT工具箱和置信度指标实现了更可靠的模型可靠性监控。


📘 Detailed Summary

Motivation: 视觉语言模型在医疗图像分析中表现出强大潜力,但部署后当输入数据分布偏离开发时观察到的分布时,其性能可能退化,而检测这种性能退化对于临床可靠性至关重要,特别是在缺乏标注数据的大型预训练VLM中具有挑战性。

Method: 研究同时考察输入级数据偏移和输出级预测行为,开发了DomainSAT轻量级工具箱集成代表性偏移检测算法并提供图形界面,同时引入基于置信度的无标签性能退化指标直接捕捉模型预测置信度的变化。

Result: 实验表明输入数据偏移检测能有效识别分布变化并提供早期诊断信号,但并不总是对应实际性能退化;而基于输出的置信度指标与性能退化密切相关,可作为输入偏移检测的有效补充;在大型病理数据集上的肿瘤分类实验证明,结合两种方法能更可靠地检测和解释数据偏移下的VLM性能退化。

Conclusion: 该研究为数字病理学中基础模型的可靠性监控提供了一个实用且互补的框架,输入数据偏移检测和输出置信度指标的结合使用能够更全面地监测模型在分布变化下的性能变化,为临床部署的VLM提供了重要的可靠性保障机制。


📄 Abstract

Vision-Language Models have demonstrated strong potential in medical image analysis and disease diagnosis. However, after deployment, their performance may deteriorate when the input data distribution shifts from that observed during development. Detecting such performance degradation is essential for clinical reliability, yet remains challenging for large pre-trained VLMs operating without labeled data. In this study, we investigate performance degradation detection under data shift in a state-of-the-art pathology VLM. We examine both input-level data shift and output-level prediction behavior to understand their respective roles in monitoring model reliability. To facilitate systematic analysis of input data shift, we develop DomainSAT, a lightweight toolbox with a graphical interface that integrates representative shift detection algorithms and enables intuitive exploration of data shift. Our analysis shows that while input data shift detection is effective at identifying distributional changes and providing early diagnostic signals, it does not always correspond to actual performance degradation. Motivated by this observation, we further study output-based monitoring and introduce a label-free, confidence-based degradation indicator that directly captures changes in model prediction confidence. We find that this indicator exhibits a close relationship with performance degradation and serves as an effective complement to input shift detection. Experiments on a large-scale pathology dataset for tumor classification demonstrate that combining input data shift detection and output confidence-based indicators enables more reliable detection and interpretation of performance degradation in VLMs under data shift. These findings provide a practical and complementary framework for monitoring the reliability of foundation models in digital pathology.

[15] TotalFM: An Organ-Separated Framework for 3D-CT Vision Foundation Models

Kohei Yamamoto, Tomohiro Kikuchi

🧩 TL;DR

本研究提出了TotalFM,一种基于器官分离概念的放射学基础模型,通过结合自监督预训练和对比学习,在3D-CT影像与语言表达之间建立高效对应,显著提升了零样本病变分类性能。


📘 Detailed Summary

Motivation: 放射学基础模型在应用于3D-CT容积数据时面临计算成本约束的重大挑战,现有方法在平衡计算效率与表示能力方面存在局限,需要一种更实用的设计框架来实现3D-CT基础模型的实际部署。

Method: 提出基于器官分离概念的TotalFM模型,利用14万序列的大规模数据集,通过分割技术和基于LLM的放射学报告处理自动化创建器官体积与发现语句对,结合VideoMAE的自监督预训练和体积-文本对的对比学习来平衡计算效率与表示能力。

Result: 在零样本器官病变分类任务中,模型在83%的器官上比CT-CLIP获得更高F1分数,在64%的器官上优于Merlin;在零样本发现病变分类任务中,在83%的发现类别上AUROC高于Merlin;在放射学报告生成任务中性能与现有VLM相当。

Conclusion: 器官分离学习框架为3D-CT基础模型的实际实施提供了现实有效的设计指南,证明了该模型在临床评估环境中具有高泛化性能,能够平衡计算效率与表示能力,为放射学基础模型的临床应用提供了可行方案。


📄 Abstract

While foundation models in radiology are expected to be applied to various clinical tasks, computational cost constraints remain a major challenge when training on 3D-CT volumetric data. In this study, we propose TotalFM, a radiological foundation model that efficiently learns the correspondence between 3D-CT images and linguistic expressions based on the concept of organ separation, utilizing a large-scale dataset of 140,000 series. By automating the creation of organ volume and finding-sentence pairs through segmentation techniques and Large Language Model (LLM)-based radiology report processing, and by combining self-supervised pre-training via VideoMAE with contrastive learning using volume-text pairs, we aimed to balance computational efficiency and representation capability. In zero-shot organ-wise lesion classification tasks, the proposed model achieved higher F1 scores in 83% (5/6) of organs compared to CT-CLIP and 64% (9/14) of organs compared to Merlin. These results suggest that the proposed model exhibits high generalization performance in a clinical evaluation setting using actual radiology report sentences. Furthermore, in zero-shot finding-wise lesion classification tasks, our model achieved a higher AUROC in 83% (25/30) of finding categories compared to Merlin. We also confirmed performance comparable to existing Vision-Language Models (VLMs) in radiology report generation tasks. Our results demonstrate that the organ-separated learning framework can serve as a realistic and effective design guideline for the practical implementation of 3D-CT foundation models.

[16] S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

He Wang, Longteng Guo, Pengkang Huo, Xuanxu Lin, Yichen Yuan, Jie Jiang, Jing Liu

🧩 TL;DR

本文提出了S1-MMAlign,这是一个包含超过1550万高质量图像-文本对的大规模多学科多模态数据集,旨在解决科学发现中复杂科学图像与稀疏文本描述之间的语义鸿沟问题。该研究还引入了一个基于Qwen-VL的语义增强流程,显著提升了科学图像-文本对齐的质量。


📘 Detailed Summary

Motivation: 多模态学习虽然在通用领域任务中取得了革命性进展,但在科学发现中的应用受到复杂科学图像与稀疏文本描述之间深刻语义鸿沟的阻碍。现有科学数据集普遍存在图像与文本弱对齐的问题,这限制了跨模态理解和科学推理能力的发展。该研究旨在填补这一空白,为AI for Science时代提供高质量的多模态基础资源。

Method: 研究团队构建了S1-MMAlign数据集,包含来自250万篇开放获取科学论文的超过1550万高质量图像-文本对,涵盖物理、生物、工程等多个学科领域。为解决原始科学标题中普遍存在的弱对齐问题,引入了一个AI就绪的语义增强流程,利用Qwen-VL多模态大模型系列,通过综合论文摘要和引用上下文信息来重新描述图像内容。

Result: 技术验证表明语义增强显著提升了数据质量:基于SciBERT的伪困惑度指标显示语义模糊性降低,而CLIP分数表明图像-文本对齐改善了18.21%。数据集覆盖了实验装置、热力图、显微图像等多种视觉模态,为科学推理提供了丰富的多模态表示基础。该数据集已在Hugging Face平台公开提供,便于社区使用和进一步研究。

Conclusion: S1-MMAlign为推进科学推理和跨模态理解提供了基础性资源,特别是在AI for Science时代具有重要意义。该研究不仅解决了科学多模态数据中的对齐质量问题,还为未来科学发现任务中的多模态模型训练和评估建立了新的基准。语义增强流程的方法论为处理其他领域中的弱对齐多模态数据提供了可借鉴的技术路径。


📄 Abstract

Multimodal learning has revolutionized general domain tasks, yet its application in scientific discovery is hindered by the profound semantic gap between complex scientific imagery and sparse textual descriptions. We present S1-MMAlign, a large-scale, multi-disciplinary multimodal dataset comprising over 15.5 million high-quality image-text pairs derived from 2.5 million open-access scientific papers. Spanning disciplines from physics and biology to engineering, the dataset captures diverse visual modalities including experimental setups, heatmaps, and microscopic imagery. To address the pervasive issue of weak alignment in raw scientific captions, we introduce an AI-ready semantic enhancement pipeline that utilizes the Qwen-VL multimodal large model series to recaption images by synthesizing context from paper abstracts and citation contexts. Technical validation demonstrates that this enhancement significantly improves data quality: SciBERT-based pseudo-perplexity metrics show reduced semantic ambiguity, while CLIP scores indicate an 18.21% improvement in image-text alignment. S1-MMAlign provides a foundational resource for advancing scientific reasoning and cross-modal understanding in the era of AI for Science. The dataset is publicly available at https://huggingface.co/datasets/ScienceOne-AI/S1-MMAlign.

[17] ActErase: A Training-Free Paradigm for Precise Concept Erasure via Activation Patching

Yi Sun, Xinhao Zhong, Hongyan Li, Yimin Zhou, Junhao Li, Bin Chen, Xuan Wang

🧩 TL;DR

本文提出了一种无需训练的激活擦除方法(ActErase),通过识别激活差异区域并动态替换输入激活,在文本到图像扩散模型中实现了高效的概念擦除,在保持生成能力的同时达到了最先进的擦除性能。


📘 Detailed Summary

Motivation: 现有的概念擦除方法大多依赖于数据密集且计算成本高昂的微调过程,这在实际应用中存在显著限制。文本到图像扩散模型在安全、版权和伦理方面存在风险,需要更高效的方法来移除敏感概念,同时避免对模型整体生成能力的损害。

Method: 该方法基于模型激活主要由通用概念组成、仅极小部分表示目标概念的观察,通过提示对分析识别激活差异区域,提取目标激活并在前向传播过程中动态替换输入激活。这种无需训练的方法实现了即插即用的概念操作范式。

Result: 在三个关键擦除任务(裸露内容、艺术风格和对象移除)上的综合评估表明,该方法达到了最先进的擦除性能,同时有效保持了模型的整体生成能力。该方法还表现出对对抗攻击的强大鲁棒性,验证了其实际应用的可靠性。

Conclusion: 该研究为扩散模型中的概念操作建立了一个新的轻量级即插即用范式,无需训练即可实现高效概念擦除。该方法在安全、版权和伦理风险缓解方面具有重要应用价值,同时为模型可解释性和可控生成提供了新的技术路径。


📄 Abstract

Recent advances in text-to-image diffusion models have demonstrated remarkable generation capabilities, yet they raise significant concerns regarding safety, copyright, and ethical implications. Existing concept erasure methods address these risks by removing sensitive concepts from pre-trained models, but most of them rely on data-intensive and computationally expensive fine-tuning, which poses a critical limitation. To overcome these challenges, inspired by the observation that the model's activations are predominantly composed of generic concepts, with only a minimal component can represent the target concept, we propose a novel training-free method (ActErase) for efficient concept erasure. Specifically, the proposed method operates by identifying activation difference regions via prompt-pair analysis, extracting target activations and dynamically replacing input activations during forward passes. Comprehensive evaluations across three critical erasure tasks (nudity, artistic style, and object removal) demonstrates that our training-free method achieves state-of-the-art (SOTA) erasure performance, while effectively preserving the model's overall generative capability. Our approach also exhibits strong robustness against adversarial attacks, establishing a new plug-and-play paradigm for lightweight yet effective concept manipulation in diffusion models.

[18] OmniVaT: Single Domain Generalization for Multimodal Visual-Tactile Learning

Liuxiang Qiu, Hui Da, Yuzhen Niu, Tiesong Zhao, Yang Cao, Zheng-Jun Zha

🧩 TL;DR

本文提出了OmniVaT框架,首次成功解决了视觉-触觉学习中的单域泛化问题,通过多模态分数傅里叶适配器和离散树生成模块有效缓解模态差异和域偏移挑战。


📘 Detailed Summary

Motivation: 视觉-触觉学习面临视觉与触觉图像之间的模态差异,以及由非标准化触觉传感器和不一致数据收集程序引起的域差距问题,这些挑战被形式化为单域泛化多模态视觉-触觉学习任务。

Method: OmniVaT框架包含两个核心组件:多模态分数傅里叶适配器将视觉和触觉嵌入映射到统一的嵌入-频率空间以缓解模态差距;离散树生成模块通过分层树结构获得多样且可靠的多模态分数表示,增强对未见域中波动域偏移的适应性。

Result: 大量实验表明OmniVaT在SDG-VTL任务上表现出优越的跨域泛化性能,无需多域训练数据或精细的跨模态融合策略即可有效处理模态差异和域偏移问题。

Conclusion: 该研究首次成功解决了视觉-触觉学习中的单域泛化挑战,提出的统一嵌入-频率空间映射和分层树结构表示方法为多模态感知系统在物理世界中的鲁棒泛化提供了新的技术路径。


📄 Abstract

Visual-tactile learning (VTL) enables embodied agents to perceive the physical world by integrating visual (VIS) and tactile (TAC) sensors. However, VTL still suffers from modality discrepancies between VIS and TAC images, as well as domain gaps caused by non-standardized tactile sensors and inconsistent data collection procedures. We formulate these challenges as a new task, termed single domain generalization for multimodal VTL (SDG-VTL). In this paper, we propose an OmniVaT framework that, for the first time, successfully addresses this task. On the one hand, OmniVaT integrates a multimodal fractional Fourier adapter (MFFA) to map VIS and TAC embeddings into a unified embedding-frequency space, thereby effectively mitigating the modality gap without multi-domain training data or careful cross-modal fusion strategies. On the other hand, it also incorporates a discrete tree generation (DTG) module that obtains diverse and reliable multimodal fractional representations through a hierarchical tree structure, thereby enhancing its adaptivity to fluctuating domain shifts in unseen domains. Extensive experiments demonstrate the superior cross-domain generalization performance of OmniVaT on the SDG-VTL task.

[19] Efficient Prediction of Dense Visual Embeddings via Distillation and RGB-D Transformers

Söhnke Benedikt Fischedick, Daniel Seichter, Benedict Stephan, Robin Schmidt, Horst-Michael Gross

🧩 TL;DR

本文提出DVEFormer,一种基于RGB-D Transformer的高效方法,通过知识蒸馏预测密集文本对齐的视觉嵌入,为移动机器人提供灵活的语义理解和自然语言查询能力,同时满足实时性要求。


📘 Detailed Summary

Motivation: 在家庭环境中,机器人需要全面理解周围环境才能与未经训练的人类进行有效直观的交互,传统语义分割方法使用固定预定义类别,缺乏灵活性,无法支持自然语言查询和高级应用。

Method: 该方法采用基于Transformer的RGB-D架构DVEFormer,通过知识蒸馏从Alpha-CLIP教师模型中学习细粒度像素级嵌入,预测密集文本对齐的视觉嵌入,而非直接进行传统的固定类别语义分割。

Result: 在常见室内数据集上的评估显示,该方法在保持竞争力的同时满足实时要求,完整模型在NVIDIA Jetson AGX Orin上达到26.3 FPS,较小变体达到77.0 FPS,定性结果展示了在实际应用中的有效性。

Conclusion: 该方法可作为传统分割方法的直接替代方案,同时支持灵活的自然语言查询和无缝集成到移动机器人3D建图流程中,为机器人环境理解提供了更通用和可扩展的解决方案。


📄 Abstract

In domestic environments, robots require a comprehensive understanding of their surroundings to interact effectively and intuitively with untrained humans. In this paper, we propose DVEFormer - an efficient RGB-D Transformer-based approach that predicts dense text-aligned visual embeddings (DVE) via knowledge distillation. Instead of directly performing classical semantic segmentation with fixed predefined classes, our method uses teacher embeddings from Alpha-CLIP to guide our efficient student model DVEFormer in learning fine-grained pixel-wise embeddings. While this approach still enables classical semantic segmentation, e.g., via linear probing, it further enables flexible text-based querying and other applications, such as creating comprehensive 3D maps. Evaluations on common indoor datasets demonstrate that our approach achieves competitive performance while meeting real-time requirements, operating at 26.3 FPS for the full model and 77.0 FPS for a smaller variant on an NVIDIA Jetson AGX Orin. Additionally, we show qualitative results that highlight the effectiveness and possible use cases in real-world applications. Overall, our method serves as a drop-in replacement for traditional segmentation approaches while enabling flexible natural-language querying and seamless integration into 3D mapping pipelines for mobile robotics.

[20] BHaRNet: Reliability-Aware Body-Hand Modality Expertized Networks for Fine-grained Skeleton Action Recognition

Seungyeon Cho, Tae-kyun Kim

🧩 TL;DR

本文提出了一种概率双流框架,用于骨架动作识别,通过统一可靠性建模和多模态集成,在不确定条件下实现专家化学习,特别关注细粒度手部动作识别。


📘 Detailed Summary

Motivation: 现有基于骨架的动作识别方法大多以身体为中心,关注大规模动作而忽略了对手部细微关节的关键信息,这限制了细粒度动作识别的性能,特别是在噪声和异构条件下。

Method: 该方法包含三个关键组件:无需校准的预处理管道直接从原生坐标学习;概率Noisy-OR融合实现无需显式置信度监督的可靠性感知双流学习;以及从骨架模态(关节、骨骼、关节运动、骨骼运动)到RGB表示的内到跨模态集成,在统一框架中桥接结构和视觉运动线索。

Result: 在多个基准测试(NTU RGB+D~60/120、PKU-MMD、N-UCLA)和新定义的手部中心基准上进行了全面评估,结果显示在噪声和异构条件下均获得了一致的性能提升和鲁棒性改进。

Conclusion: 该研究展示了概率双流框架在统一可靠性建模和多模态集成方面的有效性,特别强调了手部细微关节对细粒度动作识别的重要性,为在不确定条件下实现专家化学习提供了新思路。


📄 Abstract

Skeleton-based human action recognition (HAR) has achieved remarkable progress with graph-based architectures. However, most existing methods remain body-centric, focusing on large-scale motions while neglecting subtle hand articulations that are crucial for fine-grained recognition. This work presents a probabilistic dual-stream framework that unifies reliability modeling and multi-modal integration, generalizing expertized learning under uncertainty across both intra-skeleton and cross-modal domains. The framework comprises three key components: (1) a calibration-free preprocessing pipeline that removes canonical-space transformations and learns directly from native coordinates; (2) a probabilistic Noisy-OR fusion that stabilizes reliability-aware dual-stream learning without requiring explicit confidence supervision; and (3) an intra- to cross-modal ensemble that couples four skeleton modalities (Joint, Bone, Joint Motion, and Bone Motion) to RGB representations, bridging structural and visual motion cues in a unified cross-modal formulation. Comprehensive evaluations across multiple benchmarks (NTU RGB+D~60/120, PKU-MMD, N-UCLA) and a newly defined hand-centric benchmark exhibit consistent improvements and robustness under noisy and heterogeneous conditions.

[21] CPPO: Contrastive Perception for Vision Language Policy Optimization

Ahmad Rezaei, Mohsen Gholami, Saeed Ranjbar Alvar, Kevin Cannons, Mohammad Asiful Hossain, Zhou Weimin, Shunbo Zhou, Yong Zhang, Mohammad Akbari

🧩 TL;DR

本文提出了CPPO(对比感知策略优化)方法,用于微调视觉语言模型,通过检测感知标记的熵变化并引入对比感知损失,解决了多模态推理中感知与推理难以分离的问题。


📘 Detailed Summary

Motivation: 尽管强化学习在语言模型推理方面取得了进展,但将其扩展到多模态推理需要同时改进感知和推理能力。先前工作主要依赖显式感知奖励,但分离感知标记与推理标记存在困难,需要额外的LLM、真实数据、强制分离策略模型或对所有输出标记不加区分地应用奖励。

Method: CPPO通过扰动输入图像下模型输出的熵变化来检测感知标记,并在RL目标函数中扩展了对比感知损失,该损失在信息保留扰动下强制一致性,在信息移除扰动下强制敏感性,从而避免使用额外模型。

Result: 实验表明CPPO超越了先前的感知奖励方法,同时避免了使用额外模型,使训练更加高效和可扩展,在多模态推理任务上取得了优越性能。

Conclusion: CPPO提供了一种无需额外模型或强制分离的有效方法来改进视觉语言模型的感知能力,通过对比感知损失实现了感知标记的自动检测和优化,为多模态强化学习提供了更高效和可扩展的解决方案。


📄 Abstract

We introduce CPPO, a Contrastive Perception Policy Optimization method for finetuning vision-language models (VLMs). While reinforcement learning (RL) has advanced reasoning in language models, extending it to multimodal reasoning requires improving both the perception and reasoning aspects. Prior works tackle this challenge mainly with explicit perception rewards, but disentangling perception tokens from reasoning tokens is difficult, requiring extra LLMs, ground-truth data, forced separation of perception from reasoning by policy model, or applying rewards indiscriminately to all output tokens. CPPO addresses this problem by detecting perception tokens via entropy shifts in the model outputs under perturbed input images. CPPO then extends the RL objective function with a Contrastive Perception Loss (CPL) that enforces consistency under information-preserving perturbations and sensitivity under information-removing ones. Experiments show that CPPO surpasses previous perception-rewarding methods, while avoiding extra models, making training more efficient and scalable.

[22] FreeText: Training-Free Text Rendering in Diffusion Transformers via Attention Localization and Spectral Glyph Injection

Ruiqiang Zhang, Hengyi Wang, Chang Liu, Guanjie Wang, Zehua Ma, Weiming Zhang

🧩 TL;DR

本文提出了FreeText,一种无需训练、即插即用的框架,通过利用扩散Transformer模型的内在机制来改进文本渲染。该框架将问题分解为"在哪里写"和"写什么"两个子问题,分别通过空间定位和字形注入来解决。


📘 Detailed Summary

Motivation: 大规模文本到图像扩散模型在开放域合成方面表现出色,但在精确文本渲染方面仍存在困难,特别是对于多行布局、密集排版和长尾脚本如中文。现有解决方案通常需要昂贵的重新训练或严格的外部布局约束,这会降低美学质量并限制灵活性。

Method: FreeText框架将问题分解为"在哪里写"和"写什么"两个子问题。对于空间定位,通过读取来自内生图像到文本注意力的token-wise空间归因来定位书写区域,使用sink-like token作为稳定的空间锚点,并通过拓扑感知细化产生高置信度掩码。对于内容生成,引入了频谱调制字形注入,通过频域带通调制注入噪声对齐的字形先验,以增强字形结构并抑制语义泄漏。

Result: 在Qwen-Image、FLUX.1-dev和SD3变体上的广泛实验表明,在longText-Benchmark、CVTG和CLT-Bench基准测试中,文本可读性获得了一致的提升,同时很大程度上保持了语义对齐和美学质量,仅带来适度的推理开销。

Conclusion: 该研究证明了通过利用扩散Transformer模型的内在机制,可以在不重新训练的情况下显著改进文本渲染质量。FreeText提供了一种灵活且高效的解决方案,为文本到图像生成中的精确文本渲染问题开辟了新的研究方向,特别是在处理复杂布局和非拉丁脚本方面具有重要价值。


📄 Abstract

Large-scale text-to-image (T2I) diffusion models excel at open-domain synthesis but still struggle with precise text rendering, especially for multi-line layouts, dense typography, and long-tailed scripts such as Chinese. Prior solutions typically require costly retraining or rigid external layout constraints, which can degrade aesthetics and limit flexibility. We propose \textbf{FreeText}, a training-free, plug-and-play framework that improves text rendering by exploiting intrinsic mechanisms of \emph{Diffusion Transformer (DiT)} models. \textbf{FreeText} decomposes the problem into \emph{where to write} and \emph{what to write}. For \emph{where to write}, we localize writing regions by reading token-wise spatial attribution from endogenous image-to-text attention, using sink-like tokens as stable spatial anchors and topology-aware refinement to produce high-confidence masks. For \emph{what to write}, we introduce Spectral-Modulated Glyph Injection (SGMI), which injects a noise-aligned glyph prior with frequency-domain band-pass modulation to strengthen glyph structure and suppress semantic leakage (rendering the concept instead of the word). Extensive experiments on Qwen-Image, FLUX.1-dev, and SD3 variants across longText-Benchmark, CVTG, and our CLT-Bench show consistent gains in text readability while largely preserving semantic alignment and aesthetic quality, with modest inference overhead.

[23] Boosting Segment Anything Model to Generalize Visually Non-Salient Scenarios

Guangqian Guo, Pengfei Chen, Yong Guo, Huafeng Chen, Boqiang Zhang, Shan Gao

🧩 TL;DR

本文提出了VNS-SAM,一种增强SAM在视觉非显著场景下分割性能的方法,通过引入Mask-Edge Token Interactive解码器和Non-Salient Feature Mining模块,在保持原始零样本泛化能力的同时显著提升对低对比度场景的分割精度。


📘 Detailed Summary

Motivation: Segment Anything Model (SAM)在视觉非显著场景下性能受限,这些场景中前景与背景对比度低,现有方法难以捕捉准确轮廓并产生理想分割结果,因此需要增强SAM对此类场景的感知能力同时保持其零样本泛化性。

Method: 提出VNS-SAM方法,通过Mask-Edge Token Interactive解码器和Non-Salient Feature Mining模块有效利用SAM的低层特征,使解码器能够深入理解非显著特征,仅需少量参数增加和计算开销;同时构建了包含超过35K图像的VNS-SEG统一数据集,涵盖多种VNS场景以增强模型鲁棒性。

Result: VNS-SAM在多种VNS分割任务上表现出优越性能,特别是在零样本设置下效果显著;额外参数可在4小时内优化完成,证明了方法的可行性和实用性;VNS-SEG数据集为模型性能评估和泛化能力提供了全面基准。

Conclusion: VNS-SAM成功解决了SAM在视觉非显著场景下的性能瓶颈,通过高效的特征挖掘机制在保持原始泛化能力的同时显著提升分割精度,为实际应用提供了实用解决方案;公开的代码和数据集将促进相关领域研究发展。


📄 Abstract

Segment Anything Model (SAM), known for its remarkable zero-shot segmentation capabilities, has garnered significant attention in the community. Nevertheless, its performance is challenged when dealing with what we refer to as visually non-salient scenarios, where there is low contrast between the foreground and background. In these cases, existing methods often cannot capture accurate contours and fail to produce promising segmentation results. In this paper, we propose Visually Non-Salient SAM (VNS-SAM), aiming to enhance SAM's perception of visually non-salient scenarios while preserving its original zero-shot generalizability. We achieve this by effectively exploiting SAM's low-level features through two designs: Mask-Edge Token Interactive decoder and Non-Salient Feature Mining module. These designs help the SAM decoder gain a deeper understanding of non-salient characteristics with only marginal parameter increments and computational requirements. The additional parameters of VNS-SAM can be optimized within 4 hours, demonstrating its feasibility and practicality. In terms of data, we established VNS-SEG, a unified dataset for various VNS scenarios, with more than 35K images, in contrast to previous single-task adaptations. It is designed to make the model learn more robust VNS features and comprehensively benchmark the model's segmentation performance and generalizability on VNS scenarios. Extensive experiments across various VNS segmentation tasks demonstrate the superior performance of VNS-SAM, particularly under zero-shot settings, highlighting its potential for broad real-world applications. Codes and datasets are publicly available at https://guangqian-guo.github.io/VNS-SAM.

[24] AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Mulitmodal Models

Jintao Lin, Bowen Dong, Weikang Shi, Chenyang Lei, Suiyun Zhang, Rui Liu, Xihui Liu

🧩 TL;DR

本文提出了AEGIS基准测试,用于评估统一多模态模型的世界知识应用能力,并引入确定性清单评估协议以增强评估可靠性。实验揭示了当前模型在复杂推理任务中的显著知识缺陷。


📘 Detailed Summary

Motivation: 现有基准测试存在局限性,仅提供孤立的单任务评估且诊断能力不足,无法全面评估统一多模态模型在不同任务间应用世界知识的能力。这阻碍了对模型跨模态知识迁移和复杂推理能力的准确评估。

Method: 研究提出了AEGIS基准测试,包含1,050个手动标注的挑战性问题,涵盖21个主题和6种推理类型,覆盖视觉理解、生成、编辑和交错生成等多任务。同时提出了确定性清单评估协议,用原子化的"是/否"判断替代模糊的提示评分,以提高评估可靠性。

Result: 实验结果表明大多数统一多模态模型存在严重的世界知识缺陷,且随着推理复杂度增加性能显著下降。研究发现简单的插件式推理模块可以部分缓解这些脆弱性,为未来研究提供了有前景的方向。

Conclusion: 研究强调了基于世界知识的推理是统一多模态模型发展的关键前沿领域。AEGIS基准和确定性评估协议为模型能力评估提供了更可靠的框架,揭示了当前模型在复杂多模态任务中的局限性。


📄 Abstract

The capability of Unified Multimodal Models (UMMs) to apply world knowledge across diverse tasks remains a critical, unresolved challenge. Existing benchmarks fall short, offering only siloed, single-task evaluations with limited diagnostic power. To bridge this gap, we propose AEGIS (\emph{i.e.}, \textbf{A}ssessing \textbf{E}diting, \textbf{G}eneration, \textbf{I}nterpretation-Understanding for \textbf{S}uper-intelligence), a comprehensive multi-task benchmark covering visual understanding, generation, editing, and interleaved generation. AEGIS comprises 1,050 challenging, manually-annotated questions spanning 21 topics (including STEM, humanities, daily life, etc.) and 6 reasoning types. To concretely evaluate the performance of UMMs in world knowledge scope without ambiguous metrics, we further propose Deterministic Checklist-based Evaluation (DCE), a protocol that replaces ambiguous prompt-based scoring with atomic ``Y/N'' judgments, to enhance evaluation reliability. Our extensive experiments reveal that most UMMs exhibit severe world knowledge deficits and that performance degrades significantly with complex reasoning. Additionally, simple plug-in reasoning modules can partially mitigate these vulnerabilities, highlighting a promising direction for future research. These results highlight the importance of world-knowledge-based reasoning as a critical frontier for UMMs.

[25] GranAlign: Granularity-Aware Alignment Framework for Zero-Shot Video Moment Retrieval

Mingyu Jeon, Sunjae Yoon, Jonghee Kim, Junyeoung Kim

🧩 TL;DR

本文提出了一种无需训练的粒度感知对齐框架(GranAlign),通过粒度感知的查询重写和查询感知的标题生成技术,解决了零样本视频时刻检索中的语义粒度不匹配问题,在多个基准测试中实现了新的最先进性能。


📘 Detailed Summary

Motivation: 零样本视频时刻检索(ZVMR)的主要挑战在于文本查询与视频内容之间的语义粒度不匹配问题。先前研究虽然利用高质量预训练知识在联合空间中表示视频和语言,但未能平衡不同模态提供的预训练知识在给定场景中的语义粒度,导致尽管各模态表示质量很高,粒度不匹配仍造成检索不准确。

Method: 本文提出了无需训练的粒度感知对齐框架(GranAlign),包含两种互补技术:基于粒度的查询重写用于生成不同语义粒度的查询变体,以及查询感知的标题生成用于将查询意图嵌入视频内容。通过将多级查询与查询无关和查询感知的标题配对,有效解决了语义不匹配问题。

Result: 该方法在三个主要基准测试(QVHighlights、Charades-STA、ActivityNet-Captions)上均实现了新的最先进性能,其中在具有挑战性的QVHighlights数据集上取得了显著的3.23% mAP@avg提升,证明了其有效性和优越性。

Conclusion: 该研究表明,通过粒度感知的对齐策略可以有效解决零样本视频时刻检索中的语义粒度不匹配问题,无需额外训练即可显著提升性能。这一框架为跨模态对齐提供了新的思路,强调了平衡不同模态语义粒度的重要性,并为未来视频语言理解研究提供了有价值的参考。


📄 Abstract

Zero-shot video moment retrieval (ZVMR) is the task of localizing a temporal moment within an untrimmed video using a natural language query without relying on task-specific training data. The primary challenge in this setting lies in the mismatch in semantic granularity between textual queries and visual content. Previous studies in ZVMR have attempted to achieve alignment by leveraging high-quality pre-trained knowledge that represents video and language in a joint space. However, these approaches failed to balance the semantic granularity between the pre-trained knowledge provided by each modality for a given scene. As a result, despite the high quality of each modality's representations, the mismatch in granularity led to inaccurate retrieval. In this paper, we propose a training-free framework, called Granularity-Aware Alignment (GranAlign), that bridges this gap between coarse and fine semantic representations. Our approach introduces two complementary techniques: granularity-based query rewriting to generate varied semantic granularities, and query-aware caption generation to embed query intent into video content. By pairing multi-level queries with both query-agnostic and query-aware captions, we effectively resolve semantic mismatches. As a result, our method sets a new state-of-the-art across all three major benchmarks (QVHighlights, Charades-STA, ActivityNet-Captions), with a notable 3.23% mAP@avg improvement on the challenging QVHighlights dataset.

[26] Modality Dominance-Aware Optimization for Embodied RGB-Infrared Perception

Xianhui Liu, Siqi Jiang, Yi Xie, Yuqing Lin, Siao Liu

🧩 TL;DR

本文提出了一种模态主导感知的跨模态学习框架(MDACL),通过量化RGB-红外模态间的优化偏差并引入层次化引导与对抗均衡正则化,显著提升了RGB-IR多模态检测性能。


📘 Detailed Summary

Motivation: RGB-红外多模态感知在复杂物理环境中的嵌入式多媒体系统中至关重要,但现有跨模态融合方法忽视了由模态特性不对称引起的优化动态问题。实践中,信息密度和特征质量的差异导致持续的优化偏差,使训练过度强调主导模态并阻碍有效融合。

Method: 本文首先提出模态主导指数(MDI),通过联合建模特征熵和梯度贡献来量化模态主导程度。基于MDI,开发了模态主导感知的跨模态学习框架(MDACL),该框架包含层次化跨模态引导(HCG)以增强特征对齐,以及对抗均衡正则化(AER)以平衡融合过程中的优化动态。

Result: 在三个RGB-IR基准数据集上的广泛实验表明,MDACL能有效缓解优化偏差并实现最先进的性能。具体而言,该方法在多个检测指标上显著优于现有跨模态融合方法,验证了所提量化指标和平衡机制的有效性。

Conclusion: 本研究揭示了多模态学习中由不对称模态特性引起的优化偏差问题,并提出了一种量化分析和平衡优化的系统解决方案。MDACL框架不仅提升了RGB-IR检测性能,也为其他多模态任务中的优化不平衡问题提供了通用方法论,推动了跨模态融合技术的理论发展。


📄 Abstract

RGB-Infrared (RGB-IR) multimodal perception is fundamental to embodied multimedia systems operating in complex physical environments. Although recent cross-modal fusion methods have advanced RGB-IR detection, the optimization dynamics caused by asymmetric modality characteristics remain underexplored. In practice, disparities in information density and feature quality introduce persistent optimization bias, leading training to overemphasize a dominant modality and hindering effective fusion. To quantify this phenomenon, we propose the Modality Dominance Index (MDI), which measures modality dominance by jointly modeling feature entropy and gradient contribution. Based on MDI, we develop a Modality Dominance-Aware Cross-modal Learning (MDACL) framework that regulates cross-modal optimization. MDACL incorporates Hierarchical Cross-modal Guidance (HCG) to enhance feature alignment and Adversarial Equilibrium Regularization (AER) to balance optimization dynamics during fusion. Extensive experiments on three RGB-IR benchmarks demonstrate that MDACL effectively mitigates optimization bias and achieves SOTA performance.

[27] HyperPriv-EPN: Hypergraph Learning with Privileged Knowledge for Ependymoma Prognosis

Shuren Gabriel Yu, Sikang Ren, Yongji Tian

🧩 TL;DR

本文提出HyperPriv-EPN,一种基于超图的特权信息学习框架,通过双流蒸馏使术前模型能够从视觉特征中幻觉出语义社区结构,实现了无需推理时文本输入的术前室管膜瘤预后预测。


📘 Detailed Summary

Motivation: 室管膜瘤的术前预后对治疗规划至关重要,但由于MRI缺乏术后手术报告中的语义洞察而具有挑战性。现有多模态方法在推理时无法利用这些特权文本数据,需要弥合这一差距。

Method: 提出HyperPriv-EPN框架,采用Severed Graph策略,使用共享编码器处理教师图(包含术后特权信息)和学生图(仅限术前数据)。通过双流蒸馏,学生模型学习仅从视觉特征中幻觉出语义社区结构。

Result: 在包含311名患者的多中心队列验证中,HyperPriv-EPN实现了最先进的诊断准确性和生存分层性能,有效将专家知识转移到术前设置。

Conclusion: 该研究解锁了历史术后数据的价值,无需推理时文本输入即可指导新患者的诊断,为医学影像分析中特权信息的利用提供了创新框架。


📄 Abstract

Preoperative prognosis of Ependymoma is critical for treatment planning but challenging due to the lack of semantic insights in MRI compared to post-operative surgical reports. Existing multimodal methods fail to leverage this privileged text data when it is unavailable during inference. To bridge this gap, we propose HyperPriv-EPN, a hypergraph-based Learning Using Privileged Information (LUPI) framework. We introduce a Severed Graph Strategy, utilizing a shared encoder to process both a Teacher graph (enriched with privileged post-surgery information) and a Student graph (restricted to pre-operation data). Through dual-stream distillation, the Student learns to hallucinate semantic community structures from visual features alone. Validated on a multi-center cohort of 311 patients, HyperPriv-EPN achieves state-of-the-art diagnostic accuracy and survival stratification. This effectively transfers expert knowledge to the preoperative setting, unlocking the value of historical post-operative data to guide the diagnosis of new patients without requiring text at inference.

[28] CRoPS: A Training-Free Hallucination Mitigation Framework for Vision-Language Models

Neeraj Anand, Samyak Jha, Udbhav Bamba, Rahul Rahaman

🧩 TL;DR

本文提出了CRoPS,一种无需训练的幻觉缓解框架,通过选择性移除关键文本标记构建幻觉模型,并结合广义对比解码来捕捉多样化的幻觉来源,显著提升了大型视觉语言模型的可靠性。


📘 Detailed Summary

Motivation: 尽管大型视觉语言模型取得了快速成功,但其倾向于生成幻觉内容的问题严重影响了实际应用的可靠性。现有无需训练的方法存在两个主要局限:一是依赖于对幻觉来源的狭隘假设,二是其有效性在生成过程后期(幻觉最可能发生时)会下降。当前基于移除视觉标记构建幻觉模型的策略也不够充分,因为视觉信息仍会传播到生成文本中。

Method: 本文提出了一种新颖的幻觉模型构建方法,通过选择性移除关键文本标记来捕捉幻觉效应。进一步引入了广义对比解码,该技术整合多个幻觉模型以表示多样化的幻觉来源。这些思想共同构成了CRoPS框架,这是一个无需训练的幻觉缓解框架,通过对比原始模型与多个专门设计的幻觉模型来减少幻觉生成。

Result: CRoPS在CHAIR评分上实现了20%的改进,并在六个基准测试和三个大型视觉语言模型家族中取得了一致的性能提升。该框架在多个评估指标上均优于当前最先进的无需训练方法,展示了其广泛的适用性和有效性。

Conclusion: 研究表明,通过选择性移除文本标记而非仅关注视觉标记,可以更有效地捕捉和缓解幻觉效应。广义对比解码的多模型整合策略为解决幻觉来源的多样性问题提供了新思路。CRoPS框架为无需训练的大规模视觉语言模型可靠性提升提供了实用且高效的解决方案,具有重要的实际应用价值。


📄 Abstract

Despite the rapid success of Large Vision-Language Models (LVLMs), a persistent challenge is their tendency to generate hallucinated content, undermining reliability in real-world use. Existing training-free methods address hallucinations but face two limitations: (i) they rely on narrow assumptions about hallucination sources, and (ii) their effectiveness declines toward the end of generation, where hallucinations are most likely to occur. A common strategy is to build hallucinated models by completely or partially removing visual tokens and contrasting them with the original model. Yet, this alone proves insufficient, since visual information still propagates into generated text. Building on this insight, we propose a novel hallucinated model that captures hallucination effects by selectively removing key text tokens. We further introduce Generalized Contrastive Decoding, which integrates multiple hallucinated models to represent diverse hallucination sources. Together, these ideas form CRoPS, a training-free hallucination mitigation framework that improves CHAIR scores by 20% and achieves consistent gains across six benchmarks and three LVLM families, outperforming state-of-the-art training-free methods.

[29] Grading Handwritten Engineering Exams with Multimodal Large Language Models

Janez Perš, Jon Muhovič, Andrej Košir, Boštjan Murovec

🧩 TL;DR

本文提出了一种基于多模态大语言模型的端到端工作流,用于自动评分手写STEM考试,该方案通过多阶段设计和参考解决方案条件化实现了高可靠性,在真实课程测验中达到了与教师评分约8分的平均绝对差异。


📘 Detailed Summary

Motivation: 手写STEM考试能够捕捉开放式的推理过程和图表,但人工评分速度慢且难以扩展,需要一种能够保持标准考试流程(A4纸张、无约束手写)的自动化评分方案来解决这一可扩展性问题。

Method: 该方法采用端到端工作流,基于多模态大语言模型处理扫描手写试卷,通过教师提供的手写参考解决方案和简短评分规则进行条件化,采用多阶段设计包括格式/存在性检查防止空白答案评分、独立评分器集成、监督器聚合以及确定性验证的刚性模板生成可审计报告。

Result: 在真实课程测验评估中,使用最先进后端模型(GPT-5.2和Gemini-3 Pro)的完整流程达到了与教师评分约8分的平均绝对差异,偏差较低,估计手动审查触发率约为17%,消融实验显示简化提示和移除参考解决方案会显著降低准确性并引入系统性过评分。

Conclusion: 研究表明结构化提示和参考解决方案条件化对于自动评分系统至关重要,多阶段设计确保了可靠性,该工作流为手写STEM考试评分提供了可扩展的解决方案,同时保持了标准考试流程的完整性。


📄 Abstract

Handwritten STEM exams capture open-ended reasoning and diagrams, but manual grading is slow and difficult to scale. We present an end-to-end workflow for grading scanned handwritten engineering quizzes with multimodal large language models (LLMs) that preserves the standard exam process (A4 paper, unconstrained student handwriting). The lecturer provides only a handwritten reference solution (100%) and a short set of grading rules; the reference is converted into a text-only summary that conditions grading without exposing the reference scan. Reliability is achieved through a multi-stage design with a format/presence check to prevent grading blank answers, an ensemble of independent graders, supervisor aggregation, and rigid templates with deterministic validation to produce auditable, machine-parseable reports. We evaluate the frozen pipeline in a clean-room protocol on a held-out real course quiz in Slovenian, including hand-drawn circuit schematics. With state-of-the-art backends (GPT-5.2 and Gemini-3 Pro), the full pipeline achieves $\approx$8-point mean absolute difference to lecturer grades with low bias and an estimated manual-review trigger rate of $\approx$17% at $D_{\max}=40$. Ablations show that trivial prompting and removing the reference solution substantially degrade accuracy and introduce systematic over-grading, confirming that structured prompting and reference grounding are essential.

cs.CL [Back]

[30] Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach

Biao Wu, Meng Fang, Ling Chen, Ke Xu, Tao Cheng, Jun Wang

🧩 TL;DR

本文提出了Geo-R,一种基于强化学习的检索无关图像地理定位框架,通过从真实坐标中提取结构化推理路径并利用基于Haversine距离的坐标对齐奖励进行优化,实现了可解释且可扩展的地理定位。


📘 Detailed Summary

Motivation: 现有视觉语言模型在图像地理定位中通常依赖合成推理标注或外部图像检索,这限制了方法的可解释性和泛化能力,需要一种能够直接从真实坐标中学习结构化推理且无需检索的解决方案。

Method: 该方法提出Geo-R框架,包含基于规则的层次推理范式"区域链",将GPS坐标映射到地理实体层次结构以生成精确监督;采用轻量级强化学习策略,基于Haversine距离设计坐标对齐奖励,通过空间有意义的反馈优化模型预测。

Result: 实验结果表明,Geo-R在多个基准测试中表现出优越的定位精度、更强的泛化能力和更透明的推理过程,验证了该检索无关范式在可扩展和可解释图像地理定位中的有效性。

Conclusion: 该研究建立了结构化地理推理与直接空间监督之间的桥梁,为图像地理定位提供了新的检索无关范式,通过公开模型和代码促进了该领域的研究可复现性和进一步发展。


📄 Abstract

Recent advances in vision-language models have opened up new possibilities for reasoning-driven image geolocalization. However, existing approaches often rely on synthetic reasoning annotations or external image retrieval, which can limit interpretability and generalizability. In this paper, we present Geo-R, a retrieval-free framework that uncovers structured reasoning paths from existing ground-truth coordinates and optimizes geolocation accuracy via reinforcement learning. We propose the Chain of Region, a rule-based hierarchical reasoning paradigm that generates precise, interpretable supervision by mapping GPS coordinates to geographic entities (e.g., country, province, city) without relying on model-generated or synthetic labels. Building on this, we introduce a lightweight reinforcement learning strategy with coordinate-aligned rewards based on Haversine distance, enabling the model to refine predictions through spatially meaningful feedback. Our approach bridges structured geographic reasoning with direct spatial supervision, yielding improved localization accuracy, stronger generalization, and more transparent inference. Experimental results across multiple benchmarks confirm the effectiveness of Geo-R, establishing a new retrieval-free paradigm for scalable and interpretable image geolocalization. To facilitate further research and ensure reproducibility, both the model and code will be made publicly available.

cs.AI [Back]

[31] Explicit Abstention Knobs for Predictable Reliability in Video Question Answering

Jorge Ortiz

🧩 TL;DR

该研究评估了视觉语言模型在视频问答任务中基于置信度的选择性预测机制,发现在分布内设置下置信度阈值能提供机制性错误率控制,但在分布偏移时这种控制会失效。


📘 Detailed Summary

Motivation: 视觉语言模型在高风险部署中需要选择性预测机制,即系统在不确定时能够弃权以避免代价高昂的错误。本研究旨在探究基于置信度的弃权机制是否能在视频问答任务中提供可靠的错误率控制,以及这种控制在分布偏移下是否保持稳健。

Method: 研究采用NExT-QA数据集和Gemini 2.0 Flash模型进行实验,通过系统性地扫描置信度阈值epsilon来评估风险-覆盖权衡曲线。该方法考察了置信度阈值机制在分布内和分布偏移条件下的表现差异。

Result: 实验发现两个关键结果:首先,在分布内设置下,置信度阈值能提供机制性控制,通过调整阈值可以产生平滑的风险-覆盖权衡曲线,有效降低错误率。其次,这种控制在分布偏移条件下会失效,表明现有置信度校准方法对分布变化的鲁棒性不足。

Conclusion: 研究表明虽然置信度阈值在分布内能有效控制错误率,但在实际部署中面临分布偏移挑战。这强调了开发对分布变化更鲁棒的置信度校准方法的重要性,为视觉语言模型在高风险应用中的可靠部署提供了关键见解。


📄 Abstract

High-stakes deployment of vision-language models (VLMs) requires selective prediction, where systems abstain when uncertain rather than risk costly errors. We investigate whether confidence-based abstention provides reliable control over error rates in video question answering, and whether that control remains robust under distribution shift. Using NExT-QA and Gemini 2.0 Flash, we establish two findings. First, confidence thresholding provides mechanistic control in-distribution. Sweeping threshold epsilon produces smooth risk-coverage tradeoffs, reducing error rates f

[32] DA-DPO: Cost-efficient Difficulty-aware Preference Optimization for Reducing MLLM Hallucinations

Longtian Qiu, Shan Ning, Chuyu Zhang, Jiaxuan Sun, Xuming He

🧩 TL;DR

本文提出难度感知直接偏好优化(DA-DPO),一种针对多模态大语言模型中偏好数据难度不平衡问题的成本效益框架,通过重新加权偏好对来缓解过拟合,从而更有效地抑制幻觉。


📘 Detailed Summary

Motivation: 现有多模态直接偏好优化方法常因偏好数据难度不平衡而出现过拟合问题,多模态大语言模型倾向于过度关注易于区分的偏好对,这阻碍了细粒度幻觉抑制并导致整体性能下降。

Method: DA-DPO框架包含两个核心组件:难度估计利用预训练视觉-语言模型结合生成式和对比式目标,通过分布感知投票策略产生鲁棒的难度分数而无需额外训练;难度感知训练根据估计难度重新加权偏好对,降低简单样本权重同时强调困难样本以缓解过拟合。

Result: 大量实验表明DA-DPO持续改进多模态偏好优化,在标准基准测试中展现出更强的幻觉鲁棒性和更好的泛化能力,同时保持计算效率,无需新数据或额外微调阶段。

Conclusion: 该研究通过难度感知机制平衡学习过程,有效解决了多模态偏好优化中的过拟合问题,为更精细的幻觉抑制提供了实用框架,同时保持了成本效益和计算效率。


📄 Abstract

Direct Preference Optimization (DPO) has shown strong potential for mitigating hallucinations in Multimodal Large Language Models (MLLMs). However, existing multimodal DPO approaches often suffer from overfitting due to the difficulty imbalance in preference data. Our analysis shows that MLLMs tend to overemphasize easily distinguishable preference pairs, which hinders fine-grained hallucination suppression and degrades overall performance. To address this issue, we propose Difficulty-Aware Direct Preference Optimization (DA-DPO), a cost-effective framework designed to balance the learning process. DA-DPO consists of two main components: (1) Difficulty Estimation leverages pre-trained vision--language models with complementary generative and contrastive objectives, whose outputs are integrated via a distribution-aware voting strategy to produce robust difficulty scores without additional training; and (2) Difficulty-Aware Training reweights preference pairs based on their estimated difficulty, down-weighting easy samples while emphasizing harder ones to alleviate overfitting. This framework enables more effective preference optimization by prioritizing challenging examples, without requiring new data or extra fine-tuning stages. Extensive experiments demonstrate that DA-DPO consistently improves multimodal preference optimization, yielding stronger robustness to hallucinations and better generalization across standard benchmarks, while remaining computationally efficient. The project page is available at https://artanic30.github.io/project_pages/DA-DPO/.

[33] A Vision-and-Knowledge Enhanced Large Language Model for Generalizable Pedestrian Crossing Behavior Inference

Qingwen Pu, Kun Xie, Hong Yang, Guocong Zhai

🧩 TL;DR

本研究提出了PedX-LLM,一个视觉与知识增强的框架,通过整合视觉特征、文本数据和交通领域知识,将行人过街行为推断从站点特定模式识别转变为可泛化的行为推理,显著提升了跨场景的泛化能力。


📘 Detailed Summary

Motivation: 现有行人过街行为推断范式(从统计模型到监督学习方法)泛化能力有限,在新场景中表现不佳;虽然大型语言模型提供了从数值模式拟合到语义上下文行为推理的转变,但现有LLM应用缺乏领域特定适应性和视觉上下文,需要开发能够整合视觉信息和领域知识的通用推理框架。

Method: 研究提出了Pedestrian Crossing LLM (PedX-LLM)框架,通过整合LLaVA提取的视觉特征与文本数据及交通领域知识,采用Low-Rank Adaptation (LoRA)对LLaMA-2-7B基础模型进行微调,实现从视觉增强特征到行人过街决策的推理过程。

Result: PedX-LLM达到82.0%的平衡准确率,优于最佳统计和监督学习方法;视觉增强模块贡献2.9%性能提升,领域知识整合带来额外4.1%改进;在五个未见测试站点上,零样本配置获得66.9%平衡准确率,优于基线数据驱动方法至少18个百分点;通过少样本学习(仅五个验证示例)可将平衡准确率进一步提升至72.2%。

Conclusion: PedX-LLM展示了强大的跨场景泛化能力,证实视觉与知识增强的推理使模型能够模拟人类决策逻辑,克服纯数据驱动方法的局限性;该方法为行人行为建模提供了从模式识别到语义推理的范式转变,为智能交通系统中的人类行为理解提供了新途径。


📄 Abstract

Existing paradigms for inferring pedestrian crossing behavior, ranging from statistical models to supervised learning methods, demonstrate limited generalizability and perform inadequately on new sites. Recent advances in Large Language Models (LLMs) offer a shift from numerical pattern fitting to semantic, context-aware behavioral reasoning, yet existing LLM applications lack domain-specific adaptation and visual context. This study introduces Pedestrian Crossing LLM (PedX-LLM), a vision-and-knowledge enhanced framework designed to transform pedestrian crossing inference from site-specific pattern recognition to generalizable behavioral reasoning. By integrating LLaVA-extracted visual features with textual data and transportation domain knowledge, PedX-LLM fine-tunes a LLaMA-2-7B foundation model via Low-Rank Adaptation (LoRA) to infer crossing decisions. PedX-LLM achieves 82.0% balanced accuracy, outperforming the best statistical and supervised learning methods. Results demonstrate that the vision-augmented module contributes a 2.9% performance gain by capturing the built environment and integrating domain knowledge yields an additional 4.1% improvement. To evaluate generalizability across unseen environments, cross-site validation was conducted using site-based partitioning. The zero-shot PedX-LLM configuration achieves 66.9% balanced accuracy on five unseen test sites, outperforming the baseline data-driven methods by at least 18 percentage points. Incorporating just five validation examples via few-shot learning to PedX-LLM further elevates the balanced accuracy to 72.2%. PedX-LLM demonstrates strong generalizability to unseen scenarios, confirming that vision-and-knowledge-enhanced reasoning enables the model to mimic human-like decision logic and overcome the limitations of purely data-driven methods.