Table of Contents

cs.CV [Back]

[1] MultiFoodhat: A potential new paradigm for intelligent food quality inspection

Yue Hu, Guohang Zhuang

🧩 TL;DR

本文提出了MultiFoodChat,一个基于多智能体推理的对话式零样本食物识别框架,通过整合视觉语言模型和大语言模型实现无需额外训练的食物分类,在多个公开数据集上展现了优越的识别准确性和可解释性。


📘 Detailed Summary

Motivation: 现有监督模型严重依赖大规模标注数据且对未见食物类别泛化能力有限,这限制了智能食品质量检测和饮食评估系统的实际应用,因此需要开发无需额外训练即可识别新类别食物的零样本方法。

Method: 该框架采用多智能体对话推理机制,整合视觉语言模型和大语言模型进行多轮视觉-文本对话协作,通过对象感知令牌捕获细粒度视觉属性,结合交互式推理智能体动态解析上下文线索以优化预测结果。

Result: 在多个公开食物数据集上的实验表明,MultiFoodChat相比现有无监督和少样本方法实现了更优越的识别准确率,同时保持了良好的可解释性,验证了该框架的有效性和泛化能力。

Conclusion: MultiFoodChat为智能食品质量检测和分析提供了新范式,其多智能体对话推理设计实现了无需人工标注的复杂食物场景理解,展示了视觉语言模型与大语言模型协同在零样本识别任务中的巨大潜力。


📄 Abstract

Food image classification plays a vital role in intelligent food quality inspection, dietary assessment, and automated monitoring. However, most existing supervised models rely heavily on large labeled datasets and exhibit limited generalization to unseen food categories. To overcome these challenges, this study introduces MultiFoodChat, a dialogue-driven multi-agent reasoning framework for zero-shot food recognition. The framework integrates vision-language models (VLMs) and large language models (LLMs) to enable collaborative reasoning through multi-round visual-textual dialogues. An Object Perception Token (OPT) captures fine-grained visual attributes, while an Interactive Reasoning Agent (IRA) dynamically interprets contextual cues to refine predictions. This multi-agent design allows flexible and human-like understanding of complex food scenes without additional training or manual annotations. Experiments on multiple public food datasets demonstrate that MultiFoodChat achieves superior recognition accuracy and interpretability compared with existing unsupervised and few-shot methods, highlighting its potential as a new paradigm for intelligent food quality inspection and analysis.

[2] Efficient Few-Shot Learning in Remote Sensing: Fusing Vision and Vision-Language Models

Jia Yun Chua, Argyrios Zolotas, Miguel Arana-Catania

🧩 TL;DR

本研究提出了一种结合传统视觉模型与视觉语言模型的方法来增强遥感图像分析,在飞机检测和场景理解任务中实现了显著性能提升,特别是在少样本学习场景下表现出色。


📘 Detailed Summary

Motivation: 传统视觉模型在遥感应用中面临需要大量领域特定标注数据以及理解复杂环境上下文能力的限制,而视觉语言模型在遥感领域的应用仍处于探索不足状态,特别是其通用性特点尚未充分发掘。

Method: 该方法整合了YOLO目标检测模型与LLaVA、ChatGPT、Gemini等视觉语言模型,旨在实现更准确且具有上下文感知的图像解释,并在标注和未标注遥感数据以及退化图像场景中进行评估。

Result: 实验结果显示在飞机检测和计数任务中,所有模型的平均MAE提升了48.46%,特别是在挑战性条件下表现优异;同时在遥感图像全面理解方面,CLIPScore指标提升了6.17%。

Conclusion: 所提出的传统视觉模型与视觉语言模型结合方法为更先进和高效的遥感图像分析开辟了新途径,尤其在少样本学习场景中具有重要应用价值,能够有效应对复杂环境下的图像理解挑战。


📄 Abstract

Remote sensing has become a vital tool across sectors such as urban planning, environmental monitoring, and disaster response. While the volume of data generated has increased significantly, traditional vision models are often constrained by the requirement for extensive domain-specific labelled data and their limited ability to understand the context within complex environments. Vision Language Models offer a complementary approach by integrating visual and textual data; however, their application to remote sensing remains underexplored, particularly given their generalist nature. This work investigates the combination of vision models and VLMs to enhance image analysis in remote sensing, with a focus on aircraft detection and scene understanding. The integration of YOLO with VLMs such as LLaVA, ChatGPT, and Gemini aims to achieve more accurate and contextually aware image interpretation. Performance is evaluated on both labelled and unlabelled remote sensing data, as well as degraded image scenarios which are crucial for remote sensing. The findings show an average MAE improvement of 48.46% across models in the accuracy of aircraft detection and counting, especially in challenging conditions, in both raw and degraded scenarios. A 6.17% improvement in CLIPScore for comprehensive understanding of remote sensing images is obtained. The proposed approach combining traditional vision models and VLMs paves the way for more advanced and efficient remote sensing image analysis, especially in few-shot learning scenarios.

[3] Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding

Xiaoqian Shen, Wenxuan Zhang, Jun Chen, Mohamed Elhoseiny

🧩 TL;DR

本文提出了Vgent,一种基于图的检索-推理增强生成框架,通过结构化图表示和中间推理步骤来解决长视频理解中时序依赖断裂和无关信息干扰的问题,显著提升了大型视频语言模型的性能。


📘 Detailed Summary

Motivation: 长视频理解对大型视频语言模型构成重大挑战,主要由于处理超出上下文窗口的密集视频标记和保留长期序列信息的困难。现有检索增强生成方法在长视频应用中面临时序依赖断裂和无关信息包含等问题,这些因素会阻碍准确的推理过程。

Method: Vgent框架引入两个关键创新:一是使用结构化图表示视频,保持视频片段间的语义关系以提高检索效果;二是引入中间推理步骤,通过结构化验证减少检索噪声,促进跨片段相关信息的显式聚合,从而生成更准确和上下文感知的响应。

Result: 在三个长视频理解基准测试中,该方法相比基础模型在MLVU上实现了3.0%∼5.4%的整体性能提升,并在视频检索增强生成方法中优于最先进方法8.6%。

Conclusion: 该研究表明结构化图表示和中间推理步骤能有效解决长视频理解中的关键挑战,为视频语言模型处理复杂时序信息提供了新的技术路径,具有重要的实际应用价值。


📄 Abstract

Understanding and reasoning over long videos pose significant challenges for large video language models (LVLMs) due to the difficulty in processing intensive video tokens beyond context window and retaining long-term sequential information. Retrieval-Augmented Generation (RAG) has demonstrated effectiveness in processing long context for Large Language Models (LLMs); however, applying RAG to long video faces challenges such as disrupted temporal dependencies and inclusion of irrelevant information that can hinder accurate reasoning. To address these limitations, we propose Vgent, a novel graph-based retrieval-reasoning-augmented generation framework to enhance LVLMs for long video understanding. Our approach introduces two key innovations: (i) It represents videos by structured graphs with semantic relationships across video clips preserved to improve retrieval effectiveness. (ii) It introduces an intermediate reasoning step to mitigate the reasoning limitation of LVLMs, which leverages structured verification to reduce retrieval noise and facilitate the explicit aggregation of relevant information across clips, resulting in more accurate and context-aware responses. We comprehensively evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks. Our approach yielded an overall performance improvement of $3.0\%\sim 5.4\%$ over base models on MLVU, and outperformed state-of-the-art video RAG methods by $8.6\%$. Our code is publicly available at https://xiaoqian-shen.github.io/Vgent.

[4] Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images

Emanuel Garbin, Guy Adam, Oded Krams, Zohar Barzelay, Eran Guendelman, Michael Schwarz, Moran Vatelmacher, Yigal Shenkman, Eli Peker, Itai Druker, Uri Patish, Yoav Blum, Max Bluvstein, Junxuan Li, Rawal Khirodkar, Shunsuke Saito

🧩 TL;DR

本文提出了一种零样本流水线,可从少量非结构化手机图像创建超逼真且保持身份一致性的3D虚拟化身。该方法通过生成式规范化模块和基于Transformer的模型,解决了现有方法在几何一致性和高频细节捕捉方面的局限性。


📘 Detailed Summary

Motivation: 现有方法面临多重挑战:单视图方法存在几何不一致性和幻觉问题,导致身份保持能力下降;而基于合成数据训练的模型无法捕捉皮肤皱纹和细发等高频细节,限制了真实感。这些局限性促使研究者开发能够从非结构化照片中生成高质量虚拟化身的解决方案。

Method: 该方法引入两个关键贡献:生成式规范化模块处理多个非结构化视图并转化为标准化、一致的表示;基于Transformer的模型在从真实人物穹顶捕捉数据构建的大规模高斯溅射虚拟化身数据集上进行训练。整个"捕捉-规范化-溅射"流水线可从非结构化照片生成静态半身虚拟化身。

Result: 该流水线生成的静态半身虚拟化身展现出令人信服的真实感和鲁棒的身份保持能力。实验结果表明,该方法在保持身份一致性的同时,能够有效捕捉高频细节特征,显著提升了虚拟化身的视觉质量。

Conclusion: 这项研究展示了从非结构化照片生成高质量3D虚拟化身的可行性,为数字身份创建提供了新的技术路径。所提出的方法在真实感和身份保持方面均取得显著进展,为虚拟现实、数字人等应用领域提供了有力支撑。


📄 Abstract

We present a novel, zero-shot pipeline for creating hyperrealistic, identity-preserving 3D avatars from a few unstructured phone images. Existing methods face several challenges: single-view approaches suffer from geometric inconsistencies and hallucinations, degrading identity preservation, while models trained on synthetic data fail to capture high-frequency details like skin wrinkles and fine hair, limiting realism. Our method introduces two key contributions: (1) a generative canonicalization module that processes multiple unstructured views into a standardized, consistent representation, and (2) a transformer-based model trained on a new, large-scale dataset of high-fidelity Gaussian splatting avatars derived from dome captures of real people. This "Capture, Canonicalize, Splat" pipeline produces static quarter-body avatars with compelling realism and robust identity preservation from unstructured photos.

[5] Joint Modeling of Big Five and HEXACO for Multimodal Apparent Personality-trait Recognition

Ryo Masumura, Shota Orihashi, Mana Ihori, Tomohiro Tanaka, Naoki Makishima, Taiga Yamane, Naotaka Kawata, Satoshi Suzuki, Taichi Katayama

🧩 TL;DR

本研究提出了一种联合建模方法,用于从多模态人类行为中自动识别大五人格和HEXACO人格特质。该方法通过联合优化大五人格和HEXACO的识别,提高了多模态人格特质识别的效果。


📘 Detailed Summary

Motivation: 现有研究主要使用大五人格进行多模态人格特质识别,但缺乏对HEXACO人格的关注,特别是能够评估诚实-谦逊特质(与替代性攻击和报复心相关)的维度。同时,机器学习建模中大五人格和HEXACO之间的关系尚未明确,考虑这些关系有望提升多模态人类行为的认知效果。

Method: 提出了一种联合优化识别大五人格和HEXACO的方法,通过多模态人类行为数据进行建模,探索两种人格模型之间的内在联系,实现更全面的人格特质识别。

Result: 在自我介绍视频数据集上的实验表明,所提出的方法能够有效识别大五人格和HEXACO人格特质,验证了联合建模方法的有效性。

Conclusion: 该研究证明了联合建模大五人格和HEXACO在多模态人格识别中的可行性,为更全面的人格特质评估提供了新思路,未来可进一步探索两种人格模型在行为分析中的互补关系。


📄 Abstract

This paper proposes a joint modeling method of the Big Five, which has long been studied, and HEXACO, which has recently attracted attention in psychology, for automatically recognizing apparent personality traits from multimodal human behavior. Most previous studies have used the Big Five for multimodal apparent personality-trait recognition. However, no study has focused on apparent HEXACO which can evaluate an Honesty-Humility trait related to displaced aggression and vengefulness, social-dominance orientation, etc. In addition, the relationships between the Big Five and HEXACO when modeled by machine learning have not been clarified. We expect awareness of multimodal human behavior to improve by considering these relationships. The key advance of our proposed method is to optimize jointly recognizing the Big Five and HEXACO. Experiments using a self-introduction video dataset demonstrate that the proposed method can effectively recognize the Big Five and HEXACO.

[6] PIA: Deepfake Detection Using Phoneme-Temporal and Identity-Dynamic Analysis

Soumyya Kanti Datta, Tanvi Ranga, Chengzhe Sun, Siwei Lyu

🧩 TL;DR

本文提出了一种新颖的多模态音频-视觉框架PIA,通过整合语言、动态面部运动和面部识别线索,显著提升了检测先进生成模型产生的深度伪造内容的能力。该方法解决了传统检测器在识别现代深度伪造技术产生的时间不一致性方面的局限性。


📘 Detailed Summary

Motivation: 传统深度伪造检测方法主要依赖手动设计的音素-视位对齐阈值、基础帧级一致性检查或单模态检测策略,无法有效识别由GAN、扩散模型和神经渲染技术等先进生成模型产生的现代深度伪造内容。这些先进技术能生成近乎完美的单帧图像,但会无意中产生传统检测器经常忽略的微小时间差异。

Method: 本文提出了Phoneme-Temporal and Identity-Dynamic Analysis (PIA)多模态音频-视觉框架,整合了语言、动态面部运动和面部识别线索。该方法利用音素序列、嘴唇几何数据和先进的面部身份嵌入,通过识别多个互补模态之间的不一致性来检测细微的深度伪造篡改。

Result: 该集成方法显著提高了对细微深度伪造篡改的检测能力,通过多模态一致性分析有效识别了传统检测器难以发现的时间差异和不一致性。实验结果表明该方法在检测现代生成模型产生的深度伪造内容方面具有优越性能。

Conclusion: PIA框架通过多模态分析解决了传统深度伪造检测方法的局限性,为检测先进生成技术产生的伪造内容提供了有效解决方案。该研究强调了整合语言、动态运动和身份识别线索在深度伪造检测中的重要性,为未来多模态检测技术的发展指明了方向。


📄 Abstract

The rise of manipulated media has made deepfakes a particularly insidious threat, involving various generative manipulations such as lip-sync modifications, face-swaps, and avatar-driven facial synthesis. Conventional detection methods, which predominantly depend on manually designed phoneme-viseme alignment thresholds, fundamental frame-level consistency checks, or a unimodal detection strategy, inadequately identify modern-day deepfakes generated by advanced generative models such as GANs, diffusion models, and neural rendering techniques. These advanced techniques generate nearly perfect individual frames yet inadvertently create minor temporal discrepancies frequently overlooked by traditional detectors. We present a novel multimodal audio-visual framework, Phoneme-Temporal and Identity-Dynamic Analysis(PIA), incorporating language, dynamic face motion, and facial identification cues to address these limitations. We utilize phoneme sequences, lip geometry data, and advanced facial identity embeddings. This integrated method significantly improves the detection of subtle deepfake alterations by identifying inconsistencies across multiple complementary modalities. Code is available at https://github.com/skrantidatta/PIA

[7] Watermarking for Factuality: Guiding Vision-Language Models Toward Truth via Tri-layer Contrastive Decoding

Kyungryul Back, Seongbeom Park, Milim Kim, Mincheol Kwon, SangHyeok Lee, Hyunyoung Lee, Junhee Cho, Seunghyun Park, Jinkyu Kim

🧩 TL;DR

本文提出了一种无需训练的三层对比解码与水印方法,通过选择成熟层与业余层、识别视觉接地良好的枢轴层,并应用三层对比解码来减少大型视觉语言模型中的幻觉问题,在多个基准测试中实现了最先进的性能。


📘 Detailed Summary

Motivation: 大型视觉语言模型尽管在多模态任务中表现出色,但仍容易产生幻觉,往往过度依赖单一模态或记忆训练数据而未能正确接地其输出,这限制了模型的可靠性和实用性。

Method: 提出无需训练的三层对比解码与水印方法:首先在解码层中选择成熟层与业余层,然后使用水印相关问题识别视觉接地良好的枢轴层,最后应用三层对比解码生成最终输出。

Result: 在POPE、MME和AMBER等公共基准测试上的实验表明,该方法在减少LVLMs幻觉方面达到了最先进的性能,并生成了更具视觉接地性的响应。

Conclusion: 该方法通过创新的三层对比解码机制有效提升了大型视觉语言模型的输出可靠性,为减少模型幻觉提供了无需训练的高效解决方案,具有重要的实际应用价值。


📄 Abstract

Large Vision-Language Models (LVLMs) have recently shown promising results on various multimodal tasks, even achieving human-comparable performance in certain cases. Nevertheless, LVLMs remain prone to hallucinations -- they often rely heavily on a single modality or memorize training data without properly grounding their outputs. To address this, we propose a training-free, tri-layer contrastive decoding with watermarking, which proceeds in three steps: (1) select a mature layer and an amateur layer among the decoding layers, (2) identify a pivot layer using a watermark-related question to assess whether the layer is visually well-grounded, and (3) apply tri-layer contrastive decoding to generate the final output. Experiments on public benchmarks such as POPE, MME and AMBER demonstrate that our method achieves state-of-the-art performance in reducing hallucinations in LVLMs and generates more visually grounded responses.

[8] Vision-Centric Activation and Coordination for Multimodal Large Language Models

Yunnan Wang, Fan Lu, Kecheng Zheng, Ziyuan Huang, Ziqiang Li, Wenjun Zeng, Xin Jin

🧩 TL;DR

本文提出VaCo方法,通过引入视觉中心激活和多视觉基础模型的协调来优化多模态大语言模型的表示学习,显著提升了MLLM在视觉理解任务上的性能。


📘 Detailed Summary

Motivation: 主流多模态大语言模型仅通过文本标记的下一个词预测进行监督,忽视了对于分析能力至关重要的视觉中心信息,导致视觉理解能力受限。

Method: VaCo引入可学习的模块化任务查询和视觉对齐层来激活特定视觉信号,并通过令牌网关掩码协调不同视觉基础模型之间的表示冲突,实现文本和视觉输出的统一优化。

Result: 大量实验表明VaCo显著提升了不同MLLM在多种基准测试上的性能,展示了其在视觉理解方面的卓越能力。

Conclusion: 该研究证明了通过视觉中心激活和协调机制可以有效地增强MLLM的视觉理解能力,为多模态表示学习提供了新的优化方向。


📄 Abstract

Multimodal large language models (MLLMs) integrate image features from visual encoders with LLMs, demonstrating advanced comprehension capabilities. However, mainstream MLLMs are solely supervised by the next-token prediction of textual tokens, neglecting critical vision-centric information essential for analytical abilities. To track this dilemma, we introduce VaCo, which optimizes MLLM representations through Vision-Centric activation and Coordination from multiple vision foundation models (VFMs). VaCo introduces visual discriminative alignment to integrate task-aware perceptual features extracted from VFMs, thereby unifying the optimization of both textual and visual outputs in MLLMs. Specifically, we incorporate the learnable Modular Task Queries (MTQs) and Visual Alignment Layers (VALs) into MLLMs, activating specific visual signals under the supervision of diverse VFMs. To coordinate representation conflicts across VFMs, the crafted Token Gateway Mask (TGM) restricts the information flow among multiple groups of MTQs. Extensive experiments demonstrate that VaCo significantly improves the performance of different MLLMs on various benchmarks, showcasing its superior capabilities in visual comprehension.

[9] Spatial Preference Rewarding for MLLMs Spatial Understanding

Han Qiu, Peng Gao, Lewei Lu, Xiaoqin Zhang, Ling Shao, Shijian Lu

🧩 TL;DR

本文提出SPR(空间偏好奖励)方法,通过奖励具有精确定位能力的详细响应来增强多模态大语言模型的空间理解能力,在标准基准测试中显著提升了模型性能且训练开销极小。


📘 Detailed Summary

Motivation: 多模态大语言模型在空间理解方面表现出潜力,但在细粒度空间感知能力上存在不足,如生成详细区域描述或精确定位物体。现有方法主要关注对预标注指令数据的建模,缺乏对模型实际响应的直接监督,导致无法满足用户对细粒度空间理解的需求。

Method: SPR方法通过随机选择图像区域和模型生成的区域描述,引入语义和定位评分来全面评估文本质量和定位质量。通过精确定位精度优化模型描述,并将最高分的优化描述与最低分的初始描述配对进行直接偏好优化,从而增强与视觉输入的细粒度对齐。

Result: 在标准引用和定位基准测试上的广泛实验表明,SPR方法有效提升了多模态大语言模型的空间理解能力,且训练开销极小,在多个评估指标上均取得了显著改进。

Conclusion: SPR方法通过直接偏好优化机制有效解决了多模态大语言模型在细粒度空间理解方面的局限性,为增强模型空间感知能力提供了一种高效且可扩展的解决方案,具有重要的实际应用价值。


📄 Abstract

Multimodal large language models~(MLLMs) have demonstrated promising spatial understanding capabilities, such as referencing and grounding object descriptions. Despite their successes, MLLMs still fall short in fine-grained spatial perception abilities, such as generating detailed region descriptions or accurately localizing objects. Additionally, they often fail to respond to the user's requirements for desired fine-grained spatial understanding. This issue might arise because existing approaches primarily focus on tuning MLLMs to model pre-annotated instruction data to inject spatial knowledge, without direct supervision of MLLMs' actual responses. We address this issue by SPR, a Spatial Preference Rewarding~(SPR) approach that enhances MLLMs' spatial capabilities by rewarding MLLMs' detailed responses with precise object localization over vague or inaccurate responses. With randomly selected image regions and region descriptions from MLLMs, SPR introduces semantic and localization scores to comprehensively evaluate the text quality and localization quality in MLLM-generated descriptions. We also refine the MLLM descriptions with better localization accuracy and pair the best-scored refinement with the initial descriptions of the lowest score for direct preference optimization, thereby enhancing fine-grained alignment with visual input. Extensive experiments over standard referring and grounding benchmarks show that SPR improves MLLM spatial understanding capabilities effectively with minimal overhead in training. Data and code will be released at https://github.com/hanqiu-hq/SPR

[10] DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation

Dongnam Byun, Jungwon Park, Jumgmin Ko, Changin Choi, Wonjong Rhee

🧩 TL;DR

本文提出DOS方法,通过修改CLIP文本嵌入来改善文本到图像生成模型在多对象场景下的性能,显著减少了对象忽略和混合问题,在人类评估中优于四种竞争方法26.24%-43.04%。


📘 Detailed Summary

Motivation: 当前文本到图像生成模型在处理包含多个对象的提示时存在对象忽略和对象混合问题,特别是在相似形状、相似纹理、不同背景偏差和大量对象四种场景下表现不佳,这限制了模型在实际应用中的可靠性。

Method: 基于对CLIP嵌入的两个关键观察,提出了DOS方法,该方法在将文本嵌入输入文本到图像模型之前修改三种类型的CLIP文本嵌入,从而改善多对象图像生成中的对象分离效果。

Result: 实验结果显示DOS方法持续提高了多对象图像生成的成功率并减少了对象混合现象,在人类评估中显著优于四种竞争方法,在四个基准测试中获得了26.24%-43.04%更多的投票支持。

Conclusion: DOS方法为解决多对象图像生成中的关键挑战提供了实用有效的解决方案,通过改进CLIP嵌入处理机制显著提升了生成质量,为文本到图像模型的进一步发展提供了重要方向。


📄 Abstract

Recent progress in text-to-image (T2I) generative models has led to significant improvements in generating high-quality images aligned with text prompts. However, these models still struggle with prompts involving multiple objects, often resulting in object neglect or object mixing. Through extensive studies, we identify four problematic scenarios, Similar Shapes, Similar Textures, Dissimilar Background Biases, and Many Objects, where inter-object relationships frequently lead to such failures. Motivated by two key observations about CLIP embeddings, we propose DOS (Directional Object Separation), a method that modifies three types of CLIP text embeddings before passing them into text-to-image models. Experimental results show that DOS consistently improves the success rate of multi-object image generation and reduces object mixing. In human evaluations, DOS significantly outperforms four competing methods, receiving 26.24%-43.04% more votes across four benchmarks. These results highlight DOS as a practical and effective solution for improving multi-object image generation.

[11] Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models

Yunze Tong, Didi Zhu, Zijing Hu, Jinluan Yang, Ziyu Zhao

🧩 TL;DR

本文提出了一种噪声投影器,通过在去噪前对初始噪声进行文本条件化精炼,将噪声映射到与训练分布更匹配的提示感知对应物,从而解决文本到图像生成中的训练-推理不匹配问题,显著提升了文本-图像对齐效果。


📘 Detailed Summary

Motivation: 本文旨在解决文本到图像生成中的训练-推理不匹配问题:在训练过程中,提示条件化噪声位于潜在空间的提示特定子集中,而在推理时噪声是从提示无关的高斯先验中采样的,这种分布差异导致生成的图像与提示对齐不佳。

Method: 本文提出了一个噪声投影器框架,首先采样噪声并通过视觉语言模型获取对应图像的token级反馈,然后将这些信号蒸馏到奖励模型中,最后通过准直接偏好优化来优化噪声投影器,该设计无需参考图像或手工先验,且推理成本低。

Result: 广泛的实验表明,本文提出的提示感知噪声投影方法能够显著提升多样提示下的文本-图像对齐效果,相比多样本选择方法仅需单次前向传播,在保持生成多样性的同时提高了对齐质量。

Conclusion: 该研究揭示了训练-推理分布不匹配是文本到图像生成对齐问题的关键原因,提出的噪声投影方法为改善生成质量提供了有效途径,且不修改原始扩散模型,具有较好的实用性和扩展性。


📄 Abstract

In text-to-image generation, different initial noises induce distinct denoising paths with a pretrained Stable Diffusion (SD) model. While this pattern could output diverse images, some of them may fail to align well with the prompt. Existing methods alleviate this issue either by altering the denoising dynamics or by drawing multiple noises and conducting post-selection. In this paper, we attribute the misalignment to a training-inference mismatch: during training, prompt-conditioned noises lie in a prompt-specific subset of the latent space, whereas at inference the noise is drawn from a prompt-agnostic Gaussian prior. To close this gap, we propose a noise projector that applies text-conditioned refinement to the initial noise before denoising. Conditioned on the prompt embedding, it maps the noise to a prompt-aware counterpart that better matches the distribution observed during SD training, without modifying the SD model. Our framework consists of these steps: we first sample some noises and obtain token-level feedback for their corresponding images from a vision-language model (VLM), then distill these signals into a reward model, and finally optimize the noise projector via a quasi-direct preference optimization. Our design has two benefits: (i) it requires no reference images or handcrafted priors, and (ii) it incurs small inference cost, replacing multi-sample selection with a single forward pass. Extensive experiments further show that our prompt-aware noise projection improves text-image alignment across diverse prompts.

[12] PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, Yanjun Ma

🧩 TL;DR

本文提出了PaddleOCR-VL,一个专为文档解析设计的SOTA且资源高效的模型,其核心是PaddleOCR-VL-0.9B——一个紧凑而强大的视觉语言模型,通过集成动态分辨率视觉编码器和语言模型实现准确的元素识别。


📘 Detailed Summary

Motivation: 当前文档解析系统在处理多语言支持和复杂元素识别方面存在局限性,特别是在资源受限的实际部署场景中,需要开发既能保持高性能又具有最小资源消耗的解决方案。

Method: 该模型采用创新的架构设计,将NaViT风格的动态分辨率视觉编码器与ERNIE-4.5-0.3B语言模型相结合,支持109种语言,能够高效识别文本、表格、公式和图表等复杂文档元素。

Result: 在广泛使用的公共基准测试和内部基准测试中,PaddleOCR-VL在页面级文档解析和元素级识别方面均达到SOTA性能,显著优于现有解决方案,与顶级VLM模型相比具有强大竞争力,同时提供快速的推理速度。

Conclusion: PaddleOCR-VL展示了在保持高性能的同时实现资源效率的可行性,其紧凑的模型设计和快速的推理能力使其非常适合在实际场景中部署,为文档解析领域提供了实用的解决方案。


📄 Abstract

In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.

[13] Towards Generalist Intelligence in Dentistry: Vision Foundation Models for Oral and Maxillofacial Radiology

Xinrui Huang, Fan Xiao, Dongming He, Anqi Gao, Dandan Li, Xiaofan Zhang, Shaoting Zhang, Xudong Wang

🧩 TL;DR

本文提出了DentVFM,这是首个专为牙科设计的视觉基础模型家族,通过自监督学习在包含约160万张多模态放射影像的大规模数据集上训练,显著提升了牙科AI的泛化能力、标签效率和可扩展性。


📘 Detailed Summary

Motivation: 牙颌面放射学在口腔医疗中至关重要,但放射影像解读受限于专业医师短缺。现有牙科AI系统存在单模态限制、任务特定设计以及对昂贵标注数据的依赖等问题,阻碍了其在多样化临床场景中的泛化应用。

Method: 提出了DentVFM视觉基础模型家族,基于Vision Transformer架构开发了2D和3D变体,使用自监督学习在DentVista数据集(约160万张多模态放射影像)上进行训练,并构建了涵盖八个牙科亚专业的综合基准DentBench。

Result: DentVFM展现出强大的通用智能,在疾病诊断、治疗分析、生物标志物识别和解剖标志点检测分割等多样化牙科任务中表现出优异的泛化能力,显著优于监督学习、自监督学习和弱监督学习的基线方法,在传统影像不可用时甚至能提供比经验丰富的牙医更可靠的跨模态诊断结果。

Conclusion: DentVFM为牙科AI设立了新范式,提供了可扩展、适应性强且标签高效的模型,能够改善智能牙科医疗并解决全球口腔医疗中的关键缺口,推动了牙科AI从任务特定向通用智能的转变。


📄 Abstract

Oral and maxillofacial radiology plays a vital role in dental healthcare, but radiographic image interpretation is limited by a shortage of trained professionals. While AI approaches have shown promise, existing dental AI systems are restricted by their single-modality focus, task-specific design, and reliance on costly labeled data, hindering their generalization across diverse clinical scenarios. To address these challenges, we introduce DentVFM, the first family of vision foundation models (VFMs) designed for dentistry. DentVFM generates task-agnostic visual representations for a wide range of dental applications and uses self-supervised learning on DentVista, a large curated dental imaging dataset with approximately 1.6 million multi-modal radiographic images from various medical centers. DentVFM includes 2D and 3D variants based on the Vision Transformer (ViT) architecture. To address gaps in dental intelligence assessment and benchmarks, we introduce DentBench, a comprehensive benchmark covering eight dental subspecialties, more diseases, imaging modalities, and a wide geographical distribution. DentVFM shows impressive generalist intelligence, demonstrating robust generalization to diverse dental tasks, such as disease diagnosis, treatment analysis, biomarker identification, and anatomical landmark detection and segmentation. Experimental results indicate DentVFM significantly outperforms supervised, self-supervised, and weakly supervised baselines, offering superior generalization, label efficiency, and scalability. Additionally, DentVFM enables cross-modality diagnostics, providing more reliable results than experienced dentists in situations where conventional imaging is unavailable. DentVFM sets a new paradigm for dental AI, offering a scalable, adaptable, and label-efficient model to improve intelligent dental healthcare and address critical gaps in global oral healthcare.

[14] Acquisition of interpretable domain information during brain MR image harmonization for content-based image retrieval

Keima Abe, Hayato Muraki, Shuhei Tomoshige, Kenichi Oishi, Hitoshi Iyatomi

🧩 TL;DR

本文提出PL-SE-ADA框架,通过伪线性风格编码器和对抗性域适应实现脑部MR图像的领域协调和可解释表示学习,在保持疾病相关信息的同时提供高可解释性。


📘 Detailed Summary

Motivation: 医学图像如MR扫描常因扫描仪和协议差异出现领域偏移,这会降低机器学习在疾病分类等任务中的性能。现有方法虽然通过解耦潜在空间取得良好效果,但缺乏医疗应用必需的可解释性,导致实际问题未能解决。

Method: PL-SE-ADA框架包含两个编码器f_E和f_SE分别提取领域不变特征z_u和领域特定特征z_d,以及解码器f_D和领域预测器g_D。除了编码器与领域预测器之间的对抗训练,模型通过学习将输入图像x重构为z_u和z_d重构之和,确保协调性和信息保留。

Result: 与先前方法相比,PL-SE-ADA在图像重构、疾病分类和领域识别方面达到同等或更优性能。该框架能够可视化领域无关的脑部特征和领域特定组件,为整个系统提供高可解释性。

Conclusion: 该研究证明了在医学图像领域协调中同时实现高性能和高可解释性的可行性。PL-SE-ADA不仅提升了模型性能,还提供了对领域不变和领域特定特征的直观理解,为医疗AI应用提供了更可靠的解决方案。


📄 Abstract

Medical images like MR scans often show domain shifts across imaging sites due to scanner and protocol differences, which degrade machine learning performance in tasks such as disease classification. Domain harmonization is thus a critical research focus. Recent approaches encode brain images $\boldsymbol{x}$ into a low-dimensional latent space $\boldsymbol{z}$, then disentangle it into $\boldsymbol{z_u}$ (domain-invariant) and $\boldsymbol{z_d}$ (domain-specific), achieving strong results. However, these methods often lack interpretability$-$an essential requirement in medical applications$-$leaving practical issues unresolved. We propose Pseudo-Linear-Style Encoder Adversarial Domain Adaptation (PL-SE-ADA), a general framework for domain harmonization and interpretable representation learning that preserves disease-relevant information in brain MR images. PL-SE-ADA includes two encoders $f_E$ and $f_{SE}$ to extract $\boldsymbol{z_u}$ and $\boldsymbol{z_d}$, a decoder to reconstruct the image $f_D$, and a domain predictor $g_D$. Beyond adversarial training between the encoder and domain predictor, the model learns to reconstruct the input image $\boldsymbol{x}$ by summing reconstructions from $\boldsymbol{z_u}$ and $\boldsymbol{z_d}$, ensuring both harmonization and informativeness. Compared to prior methods, PL-SE-ADA achieves equal or better performance in image reconstruction, disease classification, and domain recognition. It also enables visualization of both domain-independent brain features and domain-specific components, offering high interpretability across the entire framework.

[15] Exploring Cross-Modal Flows for Few-Shot Learning

Ziqi Jiang, Yanghao Wang, Long Chen

🧩 TL;DR

本文提出了首个模型无关的多步调整方法Flow Matching Alignment (FMA),通过跨模态速度场学习来解决复杂数据集中模态特征高度纠缠的问题,相比单步PEFT方法实现了更精确和鲁棒的对齐。


📘 Detailed Summary

Motivation: 现有参数高效微调方法仅执行单步调整,对于特征高度纠缠的复杂数据集来说调整不足,无法实现充分的跨模态对齐,因此需要开发能够进行多步校正的调整方法。

Method: 提出Flow Matching Alignment方法,首先采用固定耦合策略确保训练过程中类别对应关系,然后使用噪声增强策略缓解数据稀缺问题,最后设计早停求解器提前终止变换过程以提高效率和准确性。

Result: 在多个基准测试和骨干网络上,FMA能够持续带来显著的性能提升,特别是在具有挑战性的数据集上表现尤为突出,证明了其多步校正能力的有效性。

Conclusion: FMA通过多步调整机制实现了更精确的跨模态对齐,为复杂场景下的参数高效微调提供了新思路,在保持效率的同时显著提升了模型在困难数据集上的性能表现。


📄 Abstract

Aligning features from different modalities, is one of the most fundamental challenges for cross-modal tasks. Although pre-trained vision-language models can achieve a general alignment between image and text, they often require parameter-efficient fine-tuning (PEFT) for further adjustment. Today's PEFT methods (e.g., prompt tuning, LoRA-based, or adapter-based) always selectively fine-tune a subset of parameters, which can slightly adjust either visual or textual features, and avoid overfitting. In this paper, we are the first to highlight that all existing PEFT methods perform one-step adjustment. It is insufficient for complex (or difficult) datasets, where features of different modalities are highly entangled. To this end, we propose the first model-agnostic multi-step adjustment approach by learning a cross-modal velocity field: Flow Matching Alignment (FMA). Specifically, to ensure the correspondence between categories during training, we first utilize a fixed coupling strategy. Then, we propose a noise augmentation strategy to alleviate the data scarcity issue. Finally, we design an early-stopping solver, which terminates the transformation process earlier, improving both efficiency and accuracy. Compared with one-step PEFT methods, FMA has the multi-step rectification ability to achieve more precise and robust alignment. Extensive results have demonstrated that FMA can consistently yield significant performance gains across various benchmarks and backbones, particularly on challenging datasets.

[16] Consistent text-to-image generation via scene de-contextualization

Song Tang, Peihao Gong, Kunyu Li, Kai Guo, Boyu Wang, Mao Ye, Jianwei Zhang, Xiatian Zhu

🧩 TL;DR

本文提出了一种称为场景去上下文化(SDeC)的训练自由提示嵌入编辑方法,通过抑制文本到图像生成模型中固有的场景-身份相关性,显著提升了跨场景的身份一致性生成效果。该方法无需预先知道所有目标场景,为实际应用提供了高度灵活的解决方案。


📘 Detailed Summary

Motivation: 现有的一致性文本到图像生成方法在处理跨场景身份保持时经常失败,主要归因于身份偏移现象,且传统方法通常依赖于预先知道所有目标场景的不切实际假设。本文揭示了场景上下文化这一关键问题根源,即主体与场景上下文之间的原生相关性,这种相关性在T2I模型拟合自然图像训练分布时自然产生。

Method: 本文提出场景去上下文化(SDeC)方法,通过实施T2I模型内置场景上下文化的逆过程来抑制身份提示嵌入中的潜在场景-身份相关性。该方法通过量化SVD方向稳定性来自适应重新加权相应特征值,无需训练即可实现高效的提示嵌入编辑。关键创新在于支持每个场景单独使用,无需预先访问所有目标场景。

Result: 实验结果表明,SDeC方法在保持场景多样性的同时显著增强了身份保持能力。该方法在跨场景身份一致性生成任务中表现出优越性能,验证了其理论框架的有效性和实际应用的可行性。

Conclusion: 本研究从理论上证明了场景-身份相关性的普遍存在性,并推导了其强度的理论界限,为理解T2I生成中的身份偏移问题提供了新的理论视角。SDeC方法为解决现实应用中缺乏先验场景知识的挑战提供了高度灵活和通用的解决方案,具有重要的实际应用价值。


📄 Abstract

Consistent text-to-image (T2I) generation seeks to produce identity-preserving images of the same subject across diverse scenes, yet it often fails due to a phenomenon called identity (ID) shift. Previous methods have tackled this issue, but typically rely on the unrealistic assumption of knowing all target scenes in advance. This paper reveals that a key source of ID shift is the native correlation between subject and scene context, called scene contextualization, which arises naturally as T2I models fit the training distribution of vast natural images. We formally prove the near-universality of this scene-ID correlation and derive theoretical bounds on its strength. On this basis, we propose a novel, efficient, training-free prompt embedding editing approach, called Scene De-Contextualization (SDeC), that imposes an inversion process of T2I's built-in scene contextualization. Specifically, it identifies and suppresses the latent scene-ID correlation within the ID prompt's embedding by quantifying the SVD directional stability to adaptively re-weight the corresponding eigenvalues. Critically, SDeC allows for per-scene use (one scene per prompt) without requiring prior access to all target scenes. This makes it a highly flexible and general solution well-suited to real-world applications where such prior knowledge is often unavailable or varies over time. Experiments demonstrate that SDeC significantly enhances identity preservation while maintaining scene diversity.

[17] Talking Points: Describing and Localizing Pixels

Matan Rusanovsky, Shimon Malnick, Shai Avidan

🧩 TL;DR

本文提出了一种新颖的像素级视觉语言理解框架,通过互补的点描述器和点定位器组件,实现了从自然语言到像素级关键点的精确理解与定位,填补了现有模型在像素级关键点理解方面的空白。


📘 Detailed Summary

Motivation: 当前视觉语言模型在跨模态理解方面取得了显著成功,但主要局限于物体级或区域级的定位能力,缺乏通过自然语言实现像素级精确关键点理解的能力。本研究旨在解决这一关键能力缺口,使模型能够从自由形式的语言描述中精确定位像素级关键点。

Method: 提出的框架包含两个互补组件:点描述器生成丰富的情境化关键点描述,点定位器从这些描述中回归精确的像素坐标。不同于依赖模板化提示或关键点名称的现有方法,本方法生成从场景级上下文到关键点周围视觉特征的多尺度自由形式描述。为解决训练数据缺乏问题,构建了包含20K+图像-关键点-描述三元组的LlamaPointInPart数据集,并通过GRPO在AP-10K上优化点描述器,使用冻结的点定位器作为奖励模型来最大化定位精度。

Result: 实验结果表明,在LlamaPointInPart数据集上,所提框架相比基线模型展现出优越性能。建立了新的评估协议,通过定位器测量预测点与真实点之间的距离来评估描述质量,而非直接比较文本描述。该框架在跨类别泛化方面表现出色,验证了其像素级定位能力的有效性。

Conclusion: 该研究的双向框架为关键点引导的图像理解和语言引导的精确定位开辟了新的应用方向。通过将自然语言描述与像素级定位相结合,推动了视觉语言模型向更精细粒度的理解能力发展。公开的代码和数据集为后续研究提供了重要基础,有望促进像素级视觉语言理解领域的进一步发展。


📄 Abstract

Vision-language models have achieved remarkable success in cross-modal understanding. Yet, these models remain limited to object-level or region-level grounding, lacking the capability for pixel-precise keypoint comprehension through natural language. We introduce a novel framework for pixel level grounding. The framework consists of two complementary components: a Point Descriptor that generates rich, contextual descriptions of individual keypoints, and a Point Localizer that regresses precise pixel coordinates from these descriptions. Unlike prior work that relies on templated prompts or keypoint names, our approach produces free-form, coarse-to-fine descriptions that situate keypoints within their visual context. Since there is no available dataset to train such a system, we introduce LlamaPointInPart, a carefully curated dataset of 20K+ image-keypoint-description triplets synthesized from multiple vision-language models, capturing multi-scale information from scene-level context to visual features around the keypoint. For cross-category generalization, we optimize the Point Descriptor on AP-10K via GRPO, using the frozen Point Localizer as a reward model to produce descriptions that maximize localization accuracy. To evaluate our results we establish a new evaluation protocol. Instead of comparing the text description produced by our method to the ground truth, we use the localizer to determine how close is the predicted point generated to the ground truth point. Experiments demonstrate superior performance compared to baseline models on LlamaPointInPart.The bidirectional nature of our framework should enable future applications in both keypoint-guided image understanding and language-guided precise localization. Our code and dataset are publicly available at https://github.com/matanr/Talking_Points.

[18] Hierarchical Re-Classification: Combining Animal Classification Models with Vision Transformers

Hugo Markoff, Jevgenijs Galaktionovs

🧩 TL;DR

本研究提出了一种用于动物检测平台的分层重分类系统,通过结合SpeciesNet EfficientNetV2-M预测与CLIP嵌入和度量学习,将高级分类学标签细化为物种级识别,显著提升了动物物种识别精度。


📘 Detailed Summary

Motivation: 现有最先进的动物分类模型如SpeciesNet虽然能对数千种物种进行预测,但采用保守的汇总策略,导致许多动物仅被标记在高级分类学层级而非物种级别,这限制了精确物种识别的能力。

Method: 开发了一个五阶段分层重分类流水线,包括高置信度接受、鸟类覆盖、质心构建、三元组损失度量学习和自适应余弦距离评分,结合SpeciesNet EfficientNetV2-M预测与CLIP嵌入和度量学习技术。

Result: 在LILA BC Desert Lion Conservation数据集(4,018张图像,15,031个检测)上评估,从'空白'和'动物'标签中恢复了761个鸟类检测,并以96.5%的准确率重新分类了456个标记为动物、哺乳动物或空白的检测,实现了64.9%的物种级识别率。

Conclusion: 该分层重分类系统有效解决了现有动物分类模型在物种级识别上的局限性,证明了结合多模态嵌入和度量学习能够显著提升细粒度物种识别性能,为野生动物监测和保护提供了更精确的技术支持。


📄 Abstract

State-of-the-art animal classification models like SpeciesNet provide predictions across thousands of species but use conservative rollup strategies, resulting in many animals labeled at high taxonomic levels rather than species. We present a hierarchical re-classification system for the Animal Detect platform that combines SpeciesNet EfficientNetV2-M predictions with CLIP embeddings and metric learning to refine high-level taxonomic labels toward species-level identification. Our five-stage pipeline (high-confidence acceptance, bird override, centroid building, triplet-loss metric learning, and adaptive cosine-distance scoring) is evaluated on a segment of the LILA BC Desert Lion Conservation dataset (4,018 images, 15,031 detections). After recovering 761 bird detections from "blank" and "animal" labels, we re-classify 456 detections labeled animal, mammal, or blank with 96.5% accuracy, achieving species-level identification for 64.9 percent

[19] Zero-Shot Wildlife Sorting Using Vision Transformers: Evaluating Clustering and Continuous Similarity Ordering

Hugo Markoff, Jevgenijs Galaktionovs

🧩 TL;DR

本研究评估了自监督视觉变换器在野生动物图像零样本分类中的表现,在5物种测试集上DINOv2结合UMAP和GMM达到88.6%准确率,并将连续相似性排序部署到生产环境中,显著加速了生物多样性监测的手动标注流程。


📘 Detailed Summary

Motivation: 相机陷阱产生数百万张野生动物图像,但许多数据集包含现有分类器未涵盖的物种,需要开发能够处理未标记野生动物图像的零样本方法来解决这一数据标注瓶颈。

Method: 研究比较了三种架构(CLIP、DINOv2、MegaDescriptor)与无监督聚类方法(DBSCAN、GMM)的结合,并采用降维技术(PCA、UMAP)和t-SNE投影实现连续一维相似性排序,在Animal Detect平台上进行开发和测试。

Result: 在仅用于评估的5物种测试集上,DINOv2与UMAP和GMM组合达到88.6%准确率(宏F1=0.874),一维排序在1500张图像中哺乳动物和鸟类达到88.2%一致性,鱼类达到95.2%一致性。

Conclusion: 基于实验结果,连续相似性排序已部署到生产环境,能够实现快速探索性分析并显著加速生物多样性监测的手动标注工作流程,为零样本野生动物图像组织提供了实用解决方案。


📄 Abstract

Camera traps generate millions of wildlife images, yet many datasets contain species that are absent from existing classifiers. This work evaluates zero-shot approaches for organizing unlabeled wildlife imagery using self-supervised vision transformers, developed and tested within the Animal Detect platform for camera trap analysis. We compare unsupervised clustering methods (DBSCAN, GMM) across three architectures (CLIP, DINOv2, MegaDescriptor) combined with dimensionality reduction techniques (PCA, UMAP), and we demonstrate continuous 1D similarity ordering via t-SNE projection. On a 5-species test set with ground truth labels used only for evaluation, DINOv2 with UMAP and GMM achieves 88.6 percent accuracy (macro-F1 = 0.874), while 1D sorting reaches 88.2 percent coherence for mammals and birds and 95.2 percent for fish across 1,500 images. Based on these findings, we deployed continuous similarity ordering in production, enabling rapid exploratory analysis and accelerating manual annotation workflows for biodiversity monitoring.

[20] Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

Yuyang Hong, Jiaqi Gu, Qi Yang, Lubin Fan, Yue Wu, Ying Wang, Kun Ding, Shiming Xiang, Jieping Ye

🧩 TL;DR

本文提出了一种名为Wiki-PRF的三阶段方法,通过处理、检索和过滤阶段改进基于知识的视觉问答任务,结合视觉工具调用和强化学习训练,在E-VQA和InfoSeek基准数据集上实现了最先进的性能。


📘 Detailed Summary

Motivation: 基于知识的视觉问答任务中,检索增强生成方法虽然取得了进展,但在多模态查询质量和检索结果相关性方面仍存在挑战,需要解决视觉语言模型在整合视觉理解与外部知识检索时的局限性。

Method: 提出了三阶段的Wiki-PRF方法:处理阶段动态调用视觉工具提取精确的多模态信息;检索阶段整合视觉和文本特征实现多模态知识检索;过滤阶段对检索结果进行相关性过滤和集中,并采用强化学习训练视觉语言模型,以答案准确性和格式一致性作为奖励信号。

Result: 在E-VQA和InfoSeek基准数据集上的实验显示,该方法在答案质量方面取得了显著提升(36.0和42.8),达到了最先进的性能水平。

Conclusion: 该研究证明了通过精心设计的多阶段检索增强框架,结合强化学习训练,可以有效提升视觉语言模型在知识密集型任务中的表现,为多模态知识检索和推理提供了新的解决方案。


📄 Abstract

Knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval. Although retrieval-augmented generation (RAG) achieves significant advances in this task by combining knowledge-base querying, it still struggles with the quality of multimodal queries and the relevance of retrieved results. To overcome these challenges, we propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages. The processing stage dynamically invokes visual tools to extract precise multimodal information for retrieval. The retrieval stage integrates visual and text features to achieve multimodal knowledge retrieval. The filtering stage performs relevance filtering and concentration on retrieval results. To this end, we introduce a visual language model trained with answer accuracy and format consistency as reward signals via a reinforcement learning manner. This enhances the model's reasoning, tool invocation for accurate queries, and filtering of irrelevant content. Experiments on benchmark datasets (E-VQA and InfoSeek) show significant improvements~(36.0 and 42.8) in answer quality, achieving state-of-the-art performance. Code is available at https://github.com/cqu-student/Wiki-PRF

[21] Shot2Tactic-Caption: Multi-Scale Captioning of Badminton Videos for Tactical Understanding

Ning Ding, Keisuke Fujii, Toru Tamaki

🧩 TL;DR

本文提出了Shot2Tactic-Caption框架,这是首个能够同时生成羽毛球比赛中击球级和战术级多尺度视频描述的系统,通过双分支设计和基于提示的机制实现了对战术执行的语义和时间理解。


📘 Detailed Summary

Motivation: 羽毛球战术理解不仅需要解释单个动作,还需要理解战术如何随时间动态执行,现有方法缺乏对战术层面动态执行的描述能力,特别是在处理战术中断和恢复等复杂场景时存在局限。

Method: 采用双分支架构设计,包含视觉编码器、时空Transformer编码器和基于Transformer的解码器;引入战术单元检测器识别有效战术单元、类型和状态;提出基于击球的提示引导机制,将预测的战术类型和状态作为提示通过交叉注意力注入解码器。

Result: 实验结果表明该框架在生成击球和战术描述方面具有显著效果,消融研究显示基于ResNet50的时空编码器优于其他变体,基于击球的提示结构能够产生更连贯和准确的战术描述。

Conclusion: 该研究证明了多尺度视频描述在体育分析中的价值,提出的提示引导机制能够有效处理复杂战术场景,为理解动态战术执行提供了新思路,未来可扩展至其他需要时序理解的视频分析任务。


📄 Abstract

Tactical understanding in badminton involves interpreting not only individual actions but also how tactics are dynamically executed over time. In this paper, we propose \textbf{Shot2Tactic-Caption}, a novel framework for semantic and temporal multi-scale video captioning in badminton, capable of generating shot-level captions that describe individual actions and tactic-level captions that capture how these actions unfold over time within a tactical execution. We also introduce the Shot2Tactic-Caption Dataset, the first badminton captioning dataset containing 5,494 shot captions and 544 tactic captions. Shot2Tactic-Caption adopts a dual-branch design, with both branches including a visual encoder, a spatio-temporal Transformer encoder, and a Transformer-based decoder to generate shot and tactic captions. To support tactic captioning, we additionally introduce a Tactic Unit Detector that identifies valid tactic units, tactic types, and tactic states (e.g., Interrupt, Resume). For tactic captioning, we further incorporate a shot-wise prompt-guided mechanism, where the predicted tactic type and state are embedded as prompts and injected into the decoder via cross-attention. The shot-wise prompt-guided mechanism enables our system not only to describe successfully executed tactics but also to capture tactical executions that are temporarily interrupted and later resumed. Experimental results demonstrate the effectiveness of our framework in generating both shot and tactic captions. Ablation studies show that the ResNet50-based spatio-temporal encoder outperforms other variants, and that shot-wise prompt structuring leads to more coherent and accurate tactic captioning.

[22] Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference

Natan Bagrov, Eugene Khvedchenia, Borys Tymchenko, Shay Aharon, Lior Kadoch, Tomer Keren, Ofri Masad, Yonatan Geifman, Ran Zilberstein, Tuomas Rintamaki, Matthieu Le, Andrew Tao

🧩 TL;DR

本文提出了一种高效的视频采样方法EVS,通过识别并剪枝时间上静态的视觉补丁来减少视频处理中的令牌冗余,实现了在不牺牲语义保真度的前提下显著降低计算成本,为可扩展的视频语言理解提供了解决方案。


📘 Detailed Summary

Motivation: 当前视觉语言模型在处理视频时面临严重的可扩展性限制,密集帧序列的二次计算成本导致令牌预算不足,引发上下文限制和延迟问题,迫切需要减少视频中的令牌冗余以支持长视频理解。

Method: 提出了高效视频采样方法EVS,该方法通过识别连续帧间保持不变的时空静态补丁并进行剪枝,保留了位置身份信息,无需架构修改或重新训练,支持推理时直接应用。

Result: EVS显著减少了令牌数量同时保持语义保真度,将大型语言模型的首令牌时间最多降低4倍且精度损失最小,结合随机剪枝率的上训练可产生对不同程度压缩具有鲁棒性的模型。

Conclusion: EVS方法有效改善了效率与精度的权衡关系,为可扩展的视频语言理解开辟了新途径,证明通过智能令牌减少可以在不牺牲质量的前提下实现大规模视频处理。


📄 Abstract

Vision-language models (VLMs) have recently expanded from static image understanding to video reasoning, but their scalability is fundamentally limited by the quadratic cost of processing dense frame sequences. Long videos often exceed the token budget of modern language models, leading to severe context limitations and latency issues. We introduce Efficient Video Sampling (EVS), a simple, plug-and-play method for reducing token redundancy in videos by identifying and pruning temporally static patches -- spatial regions that remain unchanged across consecutive frames. EVS preserves positional identity, requires no architectural changes or retraining. We show that EVS substantially reduces token count while maintaining semantic fidelity, enabling faster inference and longer input sequences. Applied at inference time, EVS reduces large language model (LLM) time-to-first-token (TTFT) by up to 4x with minimal accuracy loss. When combined with an uptraining phase using stochastic pruning rates, EVS yields models that are robust to varying compression levels and retain full performance under aggressive pruning. Extensive experiments demonstrate that EVS consistently improves efficiency-accuracy trade-offs, unlocking scalable video-language understanding without sacrificing quality.

[23] Benchmarking Multimodal Large Language Models for Face Recognition

Hatef Otroshi Shahreza, Sébastien Marcel

🧩 TL;DR

本研究系统评估了多模态大语言模型在面部识别任务上的性能,发现在零样本应用中虽然能够捕捉丰富的语义线索,但在高精度识别场景中仍落后于专用模型。该基准为推进基于MLLM的面部识别提供了基础,并为下一代更高精度和泛化能力的模型设计提供了见解。


📘 Detailed Summary

Motivation: 多模态大语言模型在各种视觉语言任务中取得了显著性能,但其在面部识别领域的潜力尚未得到充分探索。特别是需要评估开源MLLMs在标准基准测试中的表现,并与现有面部识别模型在相似协议下进行比较。

Method: 本研究在多个面部识别数据集上对最先进的MLLMs进行了系统性基准测试,包括LFW、CALFW、CPLFW、CFP、AgeDB和RFW。通过标准化的评估协议,对比分析了MLLMs与专用面部识别模型的性能差异。

Result: 实验结果表明,虽然MLLMs能够捕捉对面部相关任务有用的丰富语义线索,但在零样本应用的高精度识别场景中,它们仍落后于专用模型。在不同数据集上的性能评估揭示了MLLMs在当前技术水平下的局限性。

Conclusion: 该基准为推进基于MLLM的面部识别研究提供了重要基础,揭示了当前MLLMs在面部识别任务中的能力边界。研究结果为设计具有更高精度和泛化能力的下一代模型提供了关键见解,指出了改进方向和发展潜力。


📄 Abstract

Multimodal large language models (MLLMs) have achieved remarkable performance across diverse vision-and-language tasks. However, their potential in face recognition remains underexplored. In particular, the performance of open-source MLLMs needs to be evaluated and compared with existing face recognition models on standard benchmarks with similar protocol. In this work, we present a systematic benchmark of state-of-the-art MLLMs for face recognition on several face recognition datasets, including LFW, CALFW, CPLFW, CFP, AgeDB and RFW. Experimental results reveal that while MLLMs capture rich semantic cues useful for face-related tasks, they lag behind specialized models in high-precision recognition scenarios in zero-shot applications. This benchmark provides a foundation for advancing MLLM-based face recognition, offering insights for the design of next-generation models with higher accuracy and generalization. The source code of our benchmark is publicly available in the project page.

[24] Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

Ming Gui, Johannes Schusterbauer, Timy Phan, Felix Krause, Josh Susskind, Miguel Angel Bautista, Björn Ommer

🧩 TL;DR

本文提出了表示分词器(RepTok),一种基于自监督视觉变换器的生成建模框架,通过单个连续潜在token表示图像,在保持高效训练的同时实现竞争性的生成性能。


📘 Detailed Summary

Motivation: 该研究旨在解决传统2D潜在空间在生成建模中的空间冗余问题,同时探索如何利用预训练自监督表示构建紧凑且有效的潜在空间,以显著降低训练成本并保持生成质量。

Method: RepTok框架基于预训练的SSL编码器,仅微调语义token嵌入,并与使用流匹配目标联合训练的生成解码器配对。通过添加余弦相似度损失来正则化适应后的token,保持原始SSL空间的有利几何特性,同时丰富token包含的低级重建相关信息。

Result: 在类别条件ImageNet生成上取得竞争性结果,在MS-COCO文本到图像合成任务中,在极有限训练预算下达到竞争性的零样本性能,同时显著降低了训练成本。

Conclusion: 研究表明微调后的SSL表示可以作为紧凑有效的潜在空间用于高效生成建模,单token公式解决了2D潜在空间的空间冗余问题,为资源受限环境下的高质量生成提供了可行方案。


📄 Abstract

We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.

[25] You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction

Logan Lawrence, Oindrila Saha, Megan Wei, Chen Sun, Subhransu Maji, Grant Van Horn

🧩 TL;DR

本研究提出了nlg2choice方法,一种用于多模态大语言模型在细粒度视觉分类任务中的两阶段评估框架,通过开放问答与约束解码相结合的方式,有效解决了高维多选场景下的分类与检索问题。


📘 Detailed Summary

Motivation: 当前零样本视觉分类评估面临两大挑战:现有方法主要关注纯语言任务或局限于5选项以内的多选题,而细粒度视觉分类任务通常涉及数百至数千个高度相关的选项;同时在高维多选设置下,如何将LLM选择提取扩展到基于检索的问题中,避免对选择集进行概率计算的巨大计算成本。

Method: 提出了nlg2choice两阶段方法:首先向MLLM提出无约束的开放性问题,然后使用纯文本约束解码来预测最可能的选择;在检索设置中,采用提前停止方法计算约束响应选择该选项的概率,显著提高了处理吞吐量。

Result: 在七个细粒度视觉数据集上的实验结果表明,该方法在分类和检索评估指标上均表现出改进,并且这种性能优势在不同自然语言任务实现方式下保持稳定。

Conclusion: 该研究证明了在高度多路多选题设置下,结合开放问答与约束解码的两阶段方法能够有效解决细粒度视觉分类中的评估挑战,为MLLM在复杂视觉任务中的性能评估提供了实用框架,同时通过计算优化确保了方法的实际可行性。


📄 Abstract

Despite the renewed interest in zero-shot visual classification due to the rise of Multimodal Large Language Models (MLLMs), the problem of evaluating free-form responses of auto-regressive models remains a persistent challenge. Most existing works focus on language-only tasks or don't consider Multiple Choice Questions (MCQs) beyond 5-way options, both of which are critical capabilities to solve tasks in Fine-Grained Visual Classification (FGVC) where choice counts are in the hundreds to thousands and the choices are highly related. Furthermore, in this highly multi-way MCQ setting it is not clear how to extend LLM choice extraction to retrieval-based problems, where computing probabilities over the choice set is computationally costly. In this work we investigate nlg2choice, a simple two-stage method which first asks the MLLM an open-ended question for the task with minimal constraints, then uses text-only constrained decoding to predict the most likely choice. In retrieval settings, we compute the probability of the constrained response taking that choice with an early stopping method to significantly improve throughput. Our results show improvement over a suite of seven fine-grained visual datasets when evaluating in terms of classification and retrieval, and show that this performance holds over the various ways that users of LLMs can implement tasks in natural language.

[26] In-Context Learning with Unpaired Clips for Instruction-based Video Editing

Xinyao Liao, Xianfang Zeng, Ziye Song, Zhoujie Fu, Gang Yu, Guosheng Lin

🧩 TL;DR

本文提出了一种基于指令的视频编辑预训练策略,通过上下文学习从非配对视频片段中学习编辑概念,显著降低了大规模配对视频编辑数据集的构建成本,并在编辑指令遵循和视觉质量方面超越了现有方法。


📘 Detailed Summary

Motivation: 当前基于指令的图像编辑技术发展迅速,但其在视频领域的扩展仍未被充分探索,主要障碍在于构建大规模配对视频编辑数据集的高昂成本和复杂性,这限制了指令视频编辑技术的发展和应用。

Method: 采用低成本的预训练策略,利用非配对视频片段进行上下文学习,使基础视频生成模型获得通用编辑能力;首先在约100万个真实视频片段上进行预训练学习基本编辑概念,然后在少于15万个精选编辑对上进行微调以扩展编辑任务并提升编辑质量,该框架基于HunyuanVideoT2V构建。

Result: 比较实验表明,该方法在编辑指令遵循和视觉保真度方面均超越了现有的基于指令视频编辑方法,实现了编辑指令遵循能力提升12%和编辑质量提升15%的显著改进。

Conclusion: 研究表明通过上下文学习策略可以有效赋予视频生成模型通用编辑能力,证明了在有限高质量配对数据下进行高效微调的可行性,为指令视频编辑提供了一种经济高效的解决方案,并展示了在减少数据依赖的同时实现高质量编辑的潜力。


📄 Abstract

Despite the rapid progress of instruction-based image editing, its extension to video remains underexplored, primarily due to the prohibitive cost and complexity of constructing large-scale paired video editing datasets. To address this challenge, we introduce a low-cost pretraining strategy for instruction-based video editing that leverages in-context learning from unpaired video clips. We show that pretraining a foundation video generation model with this strategy endows it with general editing capabilities, such as adding, replacing, or deleting operations, according to input editing instructions. The pretrained model can then be efficiently refined with a small amount of high-quality paired editing data. Built upon HunyuanVideoT2V, our framework first pretrains on approximately 1M real video clips to learn basic editing concepts, and subsequently fine-tunes on fewer than 150k curated editing pairs to extend more editing tasks and improve the editing quality. Comparative experiments show that our method surpasses existing instruction-based video editing approaches in both instruction alignment and visual fidelity, achieving a 12\% improvement in editing instruction following and a 15\% improvement in editing quality.

[27] WeCKD: Weakly-supervised Chained Distillation Network for Efficient Multimodal Medical Imaging

Md. Abdur Rahman, Mohaimenul Azam Khan Raiaan, Sami Azam, Asif Karim, Jemima Beissbarth, Amanda Leach

🧩 TL;DR

本文提出了首个弱监督链式知识蒸馏网络WeCKD,通过构建渐进式蒸馏链重新定义知识传递,其中每个模型不仅从前驱学习知识,还对其进行精炼后传递给后续模型,显著降低了数据依赖并提升了特征学习能力。


📘 Detailed Summary

Motivation: 传统知识蒸馏方法存在知识退化、监督效率低下以及依赖强大教师模型或大规模标注数据的问题,这限制了其在现实世界有限数据场景中的有效性,特别是在医学影像等数据稀缺领域。

Method: WeCKD采用结构化序列的互连模型构建渐进式蒸馏链,每个模型仅使用数据集的一部分进行训练,不仅从前驱模型学习知识,还对知识进行精炼后传递给后续模型,实现了弱监督下的高效知识传递。

Result: 在四个耳镜成像数据集上的广泛评估表明,该方法不仅匹配而且在许多情况下超越了现有监督方法的性能,在另外两个数据集上的实验进一步验证了其在不同医学影像模态(包括显微和磁共振成像)上的泛化能力,相比在相同有限数据上训练的单一骨干网络,累计准确率提升高达+23%。

Conclusion: 该研究证明了通过结构化知识传递链可以在弱监督条件下实现高效学习,显著降低了医学影像分析对大规模标注数据的依赖,为现实世界数据稀缺场景下的模型部署提供了可行解决方案,并展示了在多样化医学影像模态上的良好泛化性能。


📄 Abstract

Knowledge distillation (KD) has traditionally relied on a static teacher-student framework, where a large, well-trained teacher transfers knowledge to a single student model. However, these approaches often suffer from knowledge degradation, inefficient supervision, and reliance on either a very strong teacher model or large labeled datasets, which limits their effectiveness in real-world, limited-data scenarios. To address these, we present the first-ever Weakly-supervised Chain-based KD network (WeCKD) that redefines knowledge transfer through a structured sequence of interconnected models. Unlike conventional KD, it forms a progressive distillation chain, where each model not only learns from its predecessor but also refines the knowledge before passing it forward. This structured knowledge transfer further enhances feature learning, reduces data dependency, and mitigates the limitations of one-step KD. Each model in the distillation chain is trained on only a fraction of the dataset and demonstrates that effective learning can be achieved with minimal supervision. Extensive evaluations across four otoscopic imaging datasets demonstrate that it not only matches but in many cases surpasses the performance of existing supervised methods. Experimental results on two other datasets further underscore its generalization across diverse medical imaging modalities, including microscopic and magnetic resonance imaging. Furthermore, our evaluations resulted in cumulative accuracy gains of up to +23% over a single backbone trained on the same limited data, which highlights its potential for real-world adoption.

[28] MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning

Weikang Shi, Aldrich Yu, Rongyao Fang, Houxing Ren, Ke Wang, Aojun Zhou, Changyao Tian, Xinyu Fu, Yuxuan Hu, Zimu Lu, Linjiang Huang, Si Liu, Rui Liu, Hongsheng Li

🧩 TL;DR

本文提出了MathCanvas框架,赋予统一大型多模态模型内在的视觉思维链能力,通过两阶段训练方法在数学几何问题上实现了86%的相对性能提升,为解决LLM在视觉依赖数学领域的推理难题提供了完整工具包。


📘 Detailed Summary

Motivation: 大型语言模型在文本推理方面表现出色,但在几何等依赖视觉辅助的数学领域存在困难,现有视觉思维链方法受限于僵化的外部工具或无法生成高保真、策略性定时的图表来支持复杂问题解决。

Method: 提出MathCanvas框架,包含视觉操作和策略性视觉辅助推理两个阶段:第一阶段通过1500万对语料预训练模型掌握图表生成和编辑能力,第二阶段在21.9万例交错视觉文本推理路径数据集上微调,教会模型何时以及如何利用视觉辅助。

Result: 基于该框架训练的BAGEL-Canvas模型在MathCanvas-Bench基准上相比强大多模态模型基线实现了86%的相对性能提升,并在其他公共数学基准上展现出优秀的泛化能力。

Conclusion: 该研究为解锁多模态模型中复杂、类人的视觉辅助推理提供了完整的工具包,包括框架、数据集和基准,推动了视觉依赖数学问题解决能力的发展。


📄 Abstract

While Large Language Models (LLMs) have excelled in textual reasoning, they struggle with mathematical domains like geometry that intrinsically rely on visual aids. Existing approaches to Visual Chain-of-Thought (VCoT) are often limited by rigid external tools or fail to generate the high-fidelity, strategically-timed diagrams necessary for complex problem-solving. To bridge this gap, we introduce MathCanvas, a comprehensive framework designed to endow unified Large Multimodal Models (LMMs) with intrinsic VCoT capabilities for mathematics. Our approach consists of two phases. First, a Visual Manipulation stage pre-trains the model on a novel 15.2M-pair corpus, comprising 10M caption-to-diagram pairs (MathCanvas-Imagen) and 5.2M step-by-step editing trajectories (MathCanvas-Edit), to master diagram generation and editing. Second, a Strategic Visual-Aided Reasoning stage fine-tunes the model on MathCanvas-Instruct, a new 219K-example dataset of interleaved visual-textual reasoning paths, teaching it when and how to leverage visual aids. To facilitate rigorous evaluation, we introduce MathCanvas-Bench, a challenging benchmark with 3K problems that require models to produce interleaved visual-textual solutions. Our model, BAGEL-Canvas, trained under this framework, achieves an 86% relative improvement over strong LMM baselines on MathCanvas-Bench, demonstrating excellent generalization to other public math benchmarks. Our work provides a complete toolkit-framework, datasets, and benchmark-to unlock complex, human-like visual-aided reasoning in LMMs. Project Page: https://mathcanvas.github.io/

[29] VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning

Jinglei Zhang, Yuanfan Guo, Rolandos Alexandros Potamias, Jiankang Deng, Hang Xu, Chao Ma

🧩 TL;DR

本文提出了VTimeCoT框架,通过引入进度条视觉工具和视觉时序思维链,显著提升了多模态大语言模型在视频时序定位和推理任务中的性能,实现了组合式可解释推理过程。


📘 Detailed Summary

Motivation: 当前基于多模态大语言模型的视频问答系统在视频时序定位和推理方面存在显著不足,这限制了实际视频理解系统的开发效果,因此需要解决模型在时序理解和跨模态推理方面的能力缺陷。

Method: 提出了VTimeCoT训练免费框架,包含两个新颖的进度条视觉工具:即插即用的进度条集成工具和高效高亮工具,同时引入了视觉时序思维链过程,将视频和文本的跨模态推理相结合。

Result: 在Qwen2VL-7B和GPT4o基线上,该方法在视频时序定位和基于推理的问答任务中均取得了显著的性能提升,并展示了组合式和可解释的推理过程。

Conclusion: 该研究证明了进度条视觉工具与视觉时序思维链的有效性,为视频理解系统提供了新的技术路径,增强了模型在时序推理方面的能力,具有重要的实际应用价值。


📄 Abstract

In recent years, video question answering based on multimodal large language models (MLLM) has garnered considerable attention, due to the benefits from the substantial advancements in LLMs. However, these models have a notable deficiency in the domains of video temporal grounding and reasoning, posing challenges to the development of effective real-world video understanding systems. Inspired by how humans use video players to interact with the progress bar for video comprehension, we introduce VTimeCoT, a simple yet effective training-free framework, designed for high-performance video grounding and reasoning. The proposed framework incorporates two novel visual tools of the progress bar: a plug-and-play progress bar integration tool and a high-efficiency highlighting tool. In addition, to address the limitations of conventional text-based chain-of-thought (CoT) approaches, we introduce a visuotemporal CoT process that integrates cross-modality reasoning across both video and text. Our approach demonstrates significant performance improvements on both Qwen2VL-7B and GPT4o baselines in tasks of video temporal grounding and reasoning-based question answering. Finally, we showcase that the proposed framework achieves a compositional and interpretable reasoning process. Project page: https://vtimecot.github.io

[30] WithAnyone: Towards Controllable and ID Consistent Image Generation

Hengyuan Xu, Wei Cheng, Peng Xing, Yixiao Fang, Shuhan Wu, Rui Wang, Xianfang Zeng, Daxin Jiang, Gang Yu, Xingjun Ma, Yu-Gang Jiang

🧩 TL;DR

本文提出了WithAnyone模型,通过构建大规模多身份数据集MultiID-2M和对比身份损失函数,有效解决了文本到图像生成中的复制粘贴问题,在保持身份一致性的同时实现了多样化的可控生成。


📘 Detailed Summary

Motivation: 当前身份一致性生成方法因缺乏大规模配对数据集而依赖重建训练,导致模型出现复制粘贴问题,即直接复制参考面部而非在姿态、表情或光照变化中保持身份一致性,这削弱了生成的可控性和表达能力。

Method: 构建了针对多人物场景的大规模配对数据集MultiID-2M,为每个身份提供多样化参考;提出量化复制粘贴伪影和身份保真度-变化权衡的基准;引入基于对比身份损失的新训练范式,利用配对数据平衡保真度与多样性,最终开发了基于扩散的WithAnyone模型。

Result: 广泛的定性和定量实验表明,WithAnyone显著减少了复制粘贴伪影,提高了对姿态和表情的可控性,并保持了强大的感知质量;用户研究进一步验证了该方法在实现高身份保真度的同时支持表达性可控生成。

Conclusion: 该研究展示了通过大规模数据集构建和对比学习策略可以有效缓解身份一致性生成中的复制粘贴问题,为平衡身份保真度和生成多样性提供了新的解决方案,推动了可控文本到图像生成的发展。


📄 Abstract

Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive qualitative and quantitative experiments demonstrate that WithAnyone significantly reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive controllable generation.

[31] Free-Grained Hierarchical Recognition

Seulki Park, Zilin Wang, Stella X. Yu

🧩 TL;DR

本文提出了ImageNet-F基准和自由粒度学习方法,用于解决现实世界中图像分类标注粒度不一致的问题。通过结合视觉语言模型和半监督学习,显著提升了混合粒度监督下的分层分类性能。


📘 Detailed Summary

Motivation: 现有分层图像分类方法通常假设完整的细粒度标注,而现实世界中的监督标注粒度因图像质量、标注者专业知识和任务需求而存在差异,导致标注粒度不一致的问题。这种混合粒度标注在实际应用中普遍存在,但现有方法未能有效处理。

Method: 作者构建了ImageNet-F大规模基准数据集,基于认知心理学将其划分为基础、从属和细粒度三个层次。利用CLIP模型模拟语义模糊性,生成反映人类标注行为的混合粒度标签。提出了自由粒度学习方法,通过视觉语言模型生成伪属性增强语义指导,并结合半监督学习提升视觉指导。

Result: 所提出的方法在混合监督设置下显著提升了分层分类性能。与强基线方法相比,结合伪属性和半监督学习的策略在ImageNet-F基准上取得了实质性改进,验证了该方法处理现实世界标注约束的有效性。

Conclusion: 该研究通过引入认知启发的基准和自由粒度学习框架,推进了现实约束下的分层分类研究。混合粒度监督和视觉语言模型的结合为解决标注不一致问题提供了新思路,为实际应用中的图像分类系统设计提供了重要参考。


📄 Abstract

Hierarchical image classification predicts labels across a semantic taxonomy, but existing methods typically assume complete, fine-grained annotations, an assumption rarely met in practice. Real-world supervision varies in granularity, influenced by image quality, annotator expertise, and task demands; a distant bird may be labeled Bird, while a close-up reveals Bald eagle. We introduce ImageNet-F, a large-scale benchmark curated from ImageNet and structured into cognitively inspired basic, subordinate, and fine-grained levels. Using CLIP as a proxy for semantic ambiguity, we simulate realistic, mixed-granularity labels reflecting human annotation behavior. We propose free-grain learning, with heterogeneous supervision across instances. We develop methods that enhance semantic guidance via pseudo-attributes from vision-language models and visual guidance via semi-supervised learning. These, along with strong baselines, substantially improve performance under mixed supervision. Together, our benchmark and methods advance hierarchical classification under real-world constraints.

[32] From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, Ziwei Liu

🧩 TL;DR

本文提出了NEO系列原生视觉语言模型,通过构建基于第一性原理的密集单体架构,解决了原生VLM与模块化VLM之间的根本差异问题,仅需3.9亿图像文本样本即可实现与顶级模块化模型相媲美的性能。


📘 Detailed Summary

Motivation: 当前原生视觉语言模型面临两个关键挑战:一是与模块化VLM相比存在根本性约束且这些障碍的克服程度尚不明确,二是如何使原生VLM研究更加普及和民主化以加速领域进展。

Method: 提出了构建原生VLM的三项基本原则:在共享语义空间中有效对齐像素和词表示、无缝整合先前分离的视觉和语言模块优势、固有体现支持统一视觉语言编码对齐和推理的跨模态特性,并基于此开发了NEO系列密集单体模型。

Result: NEO模型仅使用3.9亿图像文本样本即可从零开始高效发展视觉感知能力,在多样化现实场景中能够与顶级模块化模型竞争,同时缓解了密集单体模型内部的视觉语言冲突。

Conclusion: NEO为可扩展且强大的原生VLM奠定了基石,配合丰富的可复用组件构建了成本效益高且可扩展的生态系统,推动了原生视觉语言模型研究的民主化和加速发展。


📄 Abstract

The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.

[33] CoT-PL: Visual Chain-of-Thought Reasoning Meets Pseudo-Labeling for Open-Vocabulary Object Detection

Hojun Choi, Youngsun Lim, Jaeyo Shin, Hyunjung Shim

🧩 TL;DR

本文提出了CoT-PL框架,通过将结构化视觉思维链推理引入伪标签生成过程,显著提升了开放词汇目标检测在拥挤和遮挡场景中的鲁棒性。该方法在开放词汇COCO和LVIS基准上均实现了新的最优性能。


📘 Detailed Summary

Motivation: 现有开放词汇目标检测方法主要依赖直接的图像-文本匹配,忽略了理解语义复杂场景所需的中层推理步骤,导致在拥挤或遮挡的视觉环境中鲁棒性有限。

Method: CoT-PL框架将目标理解分解为三个可解释步骤:区域感知、零样本类别识别和背景定位,并提出了对比背景学习机制,利用预计算的背景线索作为负样本来促进目标与背景特征解耦。

Result: 在拥挤和遮挡场景中,新类别伪标签质量分别比先前最佳方法提升了103.4%和168.4%,在开放词汇COCO上实现了+7.7 AP50的提升,在LVIS上新类别实现了+2.9 mask AP的提升,均达到新的最优水平。

Conclusion: 结构化思维链推理与对比背景学习的集成能够有效提升开放词汇检测在复杂场景中的鲁棒性,为处理语义复杂视觉环境提供了一种新的范式。


📄 Abstract

Open-vocabulary object detection (OVD) seeks to recognize and localize object categories beyond those seen during training. Recent approaches typically leverage vision-language models (VLMs) to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on direct image-text matching, neglecting the intermediate reasoning steps essential for interpreting semantically complex scenes. This results in limited robustness when confronted with crowded or occluded visual contexts. In this paper, we introduce CoT-PL, a new framework that employs structured visual chain-of-thought (CoT) reasoning into the pseudo-labeling process. CoT-PL decomposes object understanding into three interpretable steps: (1) region perception even for unseen objects, (2) category recognition via zero-shot reasoning, and (3) background grounding to separate semantically complex objects. Crucially, the third step naturally motivates our contrastive background learning (CBL) that uses the pre-computed background cues as negatives to promote feature disentanglement between objects and background. In this way, CoT reasoning and CBL form an integrated pipeline tailored to robust pseudo-labeling in crowded or occluded scenes. Notably, in these two settings, our novel-class pseudo-label quality achieves relative improvements of 103.4% and 168.4% over the best prior, respectively. Our extensive experiments demonstrate that CoT-PL achieves +7.7 AP50 on open-vocabulary COCO and +2.9 mask AP on LVIS for novel classes, setting a new state of the art.

[34] QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

Yixuan Li, Yuhui Chen, Mingcai Zhou, Haoran Li

🧩 TL;DR

本文提出QDepth-VLA框架,通过引入深度预测任务增强视觉-语言-动作模型的3D空间感知能力,在操作任务中实现了强大的空间推理和竞争性性能。


📘 Detailed Summary

Motivation: 现有的视觉-语言-动作模型在细粒度操作任务中缺乏对关键3D结构的理解和推理能力,这限制了它们在精确控制方面的表现。

Method: 设计了专门的深度专家模块来预测从VQ-VAE编码器获得的深度图的量化潜在标记,使模型能够学习捕捉关键几何线索的深度感知表示。

Result: 在仿真基准测试和真实世界任务上的实验结果表明,QDepth-VLA在操作任务中展现出强大的空间推理能力和竞争性性能。

Conclusion: 通过深度预测任务增强视觉-语言-动作模型能够有效提升其空间感知和推理能力,为精细操作任务提供了新的解决方案。


📄 Abstract

Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.

[35] Multi-modal video data-pipelines for machine learning with minimal human supervision

Mihai-Cristian Pîrvu, Marius Leordeanu

🧩 TL;DR

本研究提出了一种无需人工监督的多模态视觉学习方法,通过预训练专家模型和程序化组合构建全自动数据流水线,并将PHG-MAE模型高效蒸馏至小于1M参数,在实时语义分割任务中达到与300M参数模型相竞争的性能。


📘 Detailed Summary

Motivation: 现实世界本质上是多模态的,但传统机器学习模型多为单模态或双模态,无法全面理解世界。本研究旨在整合尽可能多的视觉模态,使用极少或无需人工监督的方式实现多模态学习,以弥补现有方法在模态整合方面的不足。

Method: 采用预训练专家模型和程序化组合技术,构建全自动数据流水线处理原始视频数据。使用专门设计的PHG-MAE模型来利用多模态数据,并通过高效蒸馏技术将模型参数压缩至小于1M。部署框架支持实时语义分割和深度估计等任务。

Result: 经过蒸馏的PHG-MAE模型(参数<1M)在性能上能够与约300M参数的大型模型相竞争。该模型成功部署于手持设备和网络摄像头的实时语义分割任务,并在商品硬件上实现高效运行。相同框架下还部署了DPT模型用于近实时深度估计。

Conclusion: 研究表明通过有效的多模态整合和模型蒸馏技术,可以在保持竞争力的同时显著减少模型参数。全自动数据流水线和开源框架为多模态学习提供了可扩展的解决方案,证明了在资源受限设备上实现高性能多模态视觉任务的可行性。


📄 Abstract

The real-world is inherently multi-modal at its core. Our tools observe and take snapshots of it, in digital form, such as videos or sounds, however much of it is lost. Similarly for actions and information passing between humans, languages are used as a written form of communication. Traditionally, Machine Learning models have been unimodal (i.e. rgb -> semantic or text -> sentiment_class). Recent trends go towards bi-modality, where images and text are learned together, however, in order to truly understand the world, we need to integrate all these independent modalities. In this work we try to combine as many visual modalities as we can using little to no human supervision. In order to do this, we use pre-trained experts and procedural combinations between them on top of raw videos using a fully autonomous data-pipeline, which we also open-source. We then make use of PHG-MAE, a model specifically designed to leverage multi-modal data. We show that this model which was efficiently distilled into a low-parameter (<1M) can have competitive results compared to models of ~300M parameters. We deploy this model and analyze the use-case of real-time semantic segmentation from handheld devices or webcams on commodity hardware. Finally, we deploy other off-the-shelf models using the same framework, such as DPT for near real-time depth estimation.

[36] ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention

Keli Liu, Zhendong Wang, Wengang Zhou, Shaodong Xu, Ruixiao Dong, Houqiang Li

🧩 TL;DR

本文提出了ScaleWeaver框架,通过参数高效微调在视觉自回归模型上实现高质量可控生成,核心创新是改进的MMDiT块和Reference Attention模块,在保持生成质量的同时实现精确控制。


📘 Detailed Summary

Motivation: 当前视觉自回归模型在文本到图像生成方面取得了显著进展,但相比扩散模型,VAR范式下的精确灵活控制机制仍未被充分探索,需要填补这一关键空白以实现高效可控生成。

Method: ScaleWeaver框架采用参数高效微调策略,核心模块是改进的MMDiT块和提出的Reference Attention模块,该模块摒弃了图像到条件的非必要注意力,降低计算成本并稳定控制注入,同时通过零初始化线性投影确保控制信号有效融入而不破坏基础模型的生成能力。

Result: 大量实验表明,ScaleWeaver能够实现高质量生成和精确控制,在效率上优于基于扩散的方法,为视觉自回归范式下的可控文本到图像生成提供了实用有效的解决方案。

Conclusion: ScaleWeaver通过创新的Reference Attention模块和参数重用策略,成功将精确控制能力引入视觉自回归模型,在保持高效性的同时实现了与扩散模型相媲美的控制精度,为VAR范式下的可控生成开辟了新途径。


📄 Abstract

Text-to-image generation with visual autoregressive~(VAR) models has recently achieved impressive advances in generation fidelity and inference efficiency. While control mechanisms have been explored for diffusion models, enabling precise and flexible control within VAR paradigm remains underexplored. To bridge this critical gap, in this paper, we introduce ScaleWeaver, a novel framework designed to achieve high-fidelity, controllable generation upon advanced VAR models through parameter-efficient fine-tuning. The core module in ScaleWeaver is the improved MMDiT block with the proposed Reference Attention module, which efficiently and effectively incorporates conditional information. Different from MM Attention, the proposed Reference Attention module discards the unnecessary attention from image$\rightarrow$condition, reducing computational cost while stabilizing control injection. Besides, it strategically emphasizes parameter reuse, leveraging the capability of the VAR backbone itself with a few introduced parameters to process control information, and equipping a zero-initialized linear projection to ensure that control signals are incorporated effectively without disrupting the generative capability of the base model. Extensive experiments show that ScaleWeaver delivers high-quality generation and precise control while attaining superior efficiency over diffusion-based methods, making ScaleWeaver a practical and effective solution for controllable text-to-image generation within the visual autoregressive paradigm. Code and models will be released.

[37] Leveraging Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection

Furkan Mumcu, Michael J. Jones, Anoop Cherian, Yasin Yilmaz

🧩 TL;DR

本文提出了一种基于多模态大语言模型的视频异常检测框架,通过提取和解释对象活动与交互的文本描述来检测复杂异常,不仅有效检测交互型异常,还在非交互型异常数据集上达到最先进性能。


📘 Detailed Summary

Motivation: 现有的半监督视频异常检测方法在处理涉及对象交互的复杂异常时存在困难,并且普遍缺乏可解释性,这限制了它们在现实场景中的应用效果和可信度。

Method: 该方法通过向多模态大语言模型输入不同时刻的对象对视觉信息,生成正常视频中对象活动和交互的文本描述,这些文本描述作为视频中对象活动的高层表示,在测试时通过比较与训练视频中文本描述的差异来检测异常。

Result: 在基准数据集上的广泛实验表明,该方法不仅能够有效检测基于交互的复杂异常,还在不含交互异常的数据集上实现了最先进的性能表现。

Conclusion: 该框架不仅提供了固有的可解释性,还能与许多传统视频异常检测方法结合以进一步增强其可解释性,为复杂异常检测和模型透明度提供了新的解决方案。


📄 Abstract

Existing semi-supervised video anomaly detection (VAD) methods often struggle with detecting complex anomalies involving object interactions and generally lack explainability. To overcome these limitations, we propose a novel VAD framework leveraging Multimodal Large Language Models (MLLMs). Unlike previous MLLM-based approaches that make direct anomaly judgments at the frame level, our method focuses on extracting and interpreting object activity and interactions over time. By querying an MLLM with visual inputs of object pairs at different moments, we generate textual descriptions of the activity and interactions from nominal videos. These textual descriptions serve as a high-level representation of the activity and interactions of objects in a video. They are used to detect anomalies during test time by comparing them to textual descriptions found in nominal training videos. Our approach inherently provides explainability and can be combined with many traditional VAD methods to further enhance their interpretability. Extensive experiments on benchmark datasets demonstrate that our method not only detects complex interaction-based anomalies effectively but also achieves state-of-the-art performance on datasets without interaction anomalies.

[38] 3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation

JoungBin Lee, Jaewoo Jung, Jisang Han, Takuya Narihira, Kazumi Fukuda, Junyoung Seo, Sunghwan Hong, Yuki Mitsufuji, Seungryong Kim

🧩 TL;DR

3DScenePrompt是一个视频生成框架,通过双时空条件机制和3D场景记忆实现长视频生成,能够精确控制相机视角并保持场景一致性,显著优于现有方法。


📘 Detailed Summary

Motivation: 现有视频生成方法通常基于单张图像或短片段进行条件生成,难以在生成长视频时保持场景一致性和精确相机控制,特别是在跨越时间边界时动态元素会错误保留的问题。

Method: 提出双时空条件机制,结合时间相邻帧的运动连续性和空间相邻内容的场景一致性;引入3D场景记忆专门表示从输入视频中提取的静态几何结构,通过动态SLAM和新提出的动态掩码策略分离静态场景几何与动态元素。

Result: 大量实验表明,该框架在场景一致性、相机可控性和生成质量方面显著优于现有方法,能够保持长程空间连贯性和精确相机控制,同时不牺牲计算效率或运动真实性。

Conclusion: 该研究证明了通过3D场景表示和时空分离策略可以有效解决长视频生成中的场景一致性问题,为可控视频生成提供了新思路,未来可扩展到更复杂的动态场景建模。


📄 Abstract

We present 3DScenePrompt, a framework that generates the next video chunk from arbitrary-length input while enabling precise camera control and preserving scene consistency. Unlike methods conditioned on a single image or a short clip, we employ dual spatio-temporal conditioning that reformulates context-view referencing across the input video. Our approach conditions on both temporally adjacent frames for motion continuity and spatially adjacent content for scene consistency. However, when generating beyond temporal boundaries, directly using spatially adjacent frames would incorrectly preserve dynamic elements from the past. We address this by introducing a 3D scene memory that represents exclusively the static geometry extracted from the entire input video. To construct this memory, we leverage dynamic SLAM with our newly introduced dynamic masking strategy that explicitly separates static scene geometry from moving elements. The static scene representation can then be projected to any target viewpoint, providing geometrically consistent warped views that serve as strong 3D spatial prompts while allowing dynamic regions to evolve naturally from temporal context. This enables our model to maintain long-range spatial coherence and precise camera control without sacrificing computational efficiency or motion realism. Extensive experiments demonstrate that our framework significantly outperforms existing methods in scene consistency, camera controllability, and generation quality. Project page : https://cvlab-kaist.github.io/3DScenePrompt/

[39] OmniMotion: Multimodal Motion Generation with Continuous Masked Autoregression

Zhe Li, Weihao Yuan, Weichao Shen, Siyu Zhu, Zilong Dong, Chang Xu

🧩 TL;DR

本文提出了一种连续掩码自回归运动变换器,通过结合门控线性注意力和RMSNorm模块,解决了全身多模态人体运动生成中的关键挑战,并在文本、语音和音乐等多种模态上实现了最先进的性能。


📘 Detailed Summary

Motivation: 全身多模态人体运动生成面临两个主要挑战:构建有效的运动生成机制以及将文本、语音和音乐等多种模态整合到统一框架中。传统方法通常采用离散掩码建模或自回归建模,无法充分处理人体运动的序列特性和多模态分布异质性问题。

Method: 开发了连续掩码自回归运动变换器,采用因果注意力机制处理人体运动的序列特性。引入门控线性注意力和RMSNorm模块,使变换器能够关注关键动作并抑制异常运动或多模态分布异质性引起的不稳定性。利用DiT结构扩散变换器的条件信息,并通过AdaLN和交叉注意力机制融合文本、语音和音乐信号。

Result: 实验结果表明,该框架在所有模态上均优于先前方法,包括文本到运动、语音到手势和音乐到舞蹈任务。在多个基准测试中展现出卓越的性能表现,证明了方法的有效性和泛化能力。

Conclusion: 该研究为多模态人体运动生成提供了统一的解决方案,通过连续掩码自回归建模和有效的多模态融合机制,显著提升了生成质量和模态适应性。框架的可扩展性为未来更复杂的多模态交互任务奠定了基础。


📄 Abstract

Whole-body multi-modal human motion generation poses two primary challenges: creating an effective motion generation mechanism and integrating various modalities, such as text, speech, and music, into a cohesive framework. Unlike previous methods that usually employ discrete masked modeling or autoregressive modeling, we develop a continuous masked autoregressive motion transformer, where a causal attention is performed considering the sequential nature within the human motion. Within this transformer, we introduce a gated linear attention and an RMSNorm module, which drive the transformer to pay attention to the key actions and suppress the instability caused by either the abnormal movements or the heterogeneous distributions within multi-modalities. To further enhance both the motion generation and the multimodal generalization, we employ the DiT structure to diffuse the conditions from the transformer towards the targets. To fuse different modalities, AdaLN and cross-attention are leveraged to inject the text, speech, and music signals. Experimental results demonstrate that our framework outperforms previous methods across all modalities, including text-to-motion, speech-to-gesture, and music-to-dance. The code of our method will be made public.

[40] ChangingGrounding: 3D Visual Grounding in Changing Scenes

Miao Hu, Zhiwei Huang, Tai Wang, Jiangmiao Pang, Dahua Lin, Nanning Zheng, Runsen Xu

🧩 TL;DR

本文提出了ChangingGrounding基准测试和Mem-ChangingGrounder方法,首次将3D视觉定位重新定义为主动、记忆驱动的问题,在动态场景中通过利用历史观测和高效探索实现精确的3D边界框定位。


📘 Detailed Summary

Motivation: 现有3D视觉定位方法通常假设存在重建且最新的点云数据,这需要昂贵的重复扫描且阻碍实际部署,而真实世界中机器人需要在场景不断变化的情况下根据自然语言指令定位物体,因此需要将3D视觉定位重新定义为主动、记忆驱动的问题。

Method: 提出了Mem-ChangingGrounder零样本方法,结合跨模态检索与轻量级多视图融合:首先识别查询隐含的物体类型,检索相关记忆来指导行动,然后在场景中高效探索目标,当先前操作无效时回退,对目标进行多视图扫描,并将多视图扫描的融合证据投影以获得准确的物体边界框。

Result: 在ChangingGrounding基准测试上评估了不同基线方法,Mem-ChangingGrounder实现了最高的定位精度,同时显著降低了探索成本。

Conclusion: 这项研究推动了3D视觉定位向实用化、以记忆为中心的研究方向转变,为真实世界应用提供了新的基准和方法框架,有望催化该领域研究范式的转变。


📄 Abstract

Real-world robots localize objects from natural-language instructions while scenes around them keep changing. Yet most of the existing 3D visual grounding (3DVG) method still assumes a reconstructed and up-to-date point cloud, an assumption that forces costly re-scans and hinders deployment. We argue that 3DVG should be formulated as an active, memory-driven problem, and we introduce ChangingGrounding, the first benchmark that explicitly measures how well an agent can exploit past observations, explore only where needed, and still deliver precise 3D boxes in changing scenes. To set a strong reference point, we also propose Mem-ChangingGrounder, a zero-shot method for this task that marries cross-modal retrieval with lightweight multi-view fusion: it identifies the object type implied by the query, retrieves relevant memories to guide actions, then explores the target efficiently in the scene, falls back when previous operations are invalid, performs multi-view scanning of the target, and projects the fused evidence from multi-view scans to get accurate object bounding boxes. We evaluate different baselines on ChangingGrounding, and our Mem-ChangingGrounder achieves the highest localization accuracy while greatly reducing exploration cost. We hope this benchmark and method catalyze a shift toward practical, memory-centric 3DVG research for real-world applications. Project page: https://hm123450.github.io/CGB/ .

[41] Learning an Image Editing Model without Image Editing Pairs

Nupur Kumari, Sheng-Yu Wang, Nanxuan Zhao, Yotam Nitzan, Yuheng Li, Krishna Kumar Singh, Richard Zhang, Eli Shechtman, Jun-Yan Zhu, Xun Huang

🧩 TL;DR

本文提出了一种无需配对数据的图像编辑训练范式,通过将扩散模型展开训练并利用视觉语言模型的反馈进行端到端优化,在无监督设置下实现了与监督方法相当的性能。


📘 Detailed Summary

Motivation: 当前图像编辑模型依赖大规模输入-目标配对数据进行监督微调,但这类自然配对的训练数据难以大规模获取,而使用合成训练对会传播预训练模型的伪影问题,因此需要消除对配对数据的依赖。

Method: 该方法通过展开多步扩散模型进行直接优化,利用视觉语言模型评估编辑是否遵循指令并保留未改变内容,提供端到端优化的直接梯度,同时结合分布匹配损失来确保生成图像保持在预训练模型学习到的图像流形内。

Result: 在标准基准测试中,该方法在无需任何配对数据的情况下,在少步设置下与各种基于大量监督配对数据训练的图像编辑扩散模型性能相当,并且在相同视觉语言模型作为奖励模型时,优于基于强化学习的技术如Flow-GRPO。

Conclusion: 该研究证明了无需配对数据的图像编辑训练可行性,通过直接优化和视觉语言模型反馈可以克服监督数据稀缺的瓶颈,为图像编辑模型训练提供了新的无监督范式,具有重要的实际应用价值。


📄 Abstract

Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting. Given the same VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.

cs.CL [Back]

[42] Bridging the Semantic Gap: Contrastive Rewards for Multilingual Text-to-SQL

Ashish Kattamuri, Ishita Prasad, Meetu Malhotra, Arpita Vats, Rahul Raja, Albert Lie

🧩 TL;DR

本研究提出了一个结合群组相对策略优化(GRPO)和多语言对比奖励信号的框架,用于提升跨语言Text-to-SQL系统的任务效率和语义准确性。该方法通过语义相似度奖励信号增强SQL生成与用户意图的对齐,在少量强化学习训练样本下显著提升了执行准确率和语义准确率。


📘 Detailed Summary

Motivation: 当前Text-to-SQL方法仅关注可执行查询的评估,忽视了语义对齐挑战——包括查询语义含义和执行结果正确性。从英语迁移到其他语言时,执行准确率平均下降6个百分点,表明现有方法在跨语言场景下存在显著性能瓶颈。

Method: 提出新框架结合群组相对策略优化(GRPO)与多语言对比奖励信号,通过基于语义相似度的奖励信号教导模型获得更好的SQL生成与用户意图对应关系。该方法在强化学习训练中集成语义对齐目标,使用仅3000个训练样本进行参数高效微调。

Result: 在七语言MultiSpider数据集上,使用GRPO微调LLaMA-3-3B模型将执行准确率提升至87.4%(比零样本提升26个百分点),语义准确率达52.29%(提升32.86个百分点)。加入对比奖励信号后,平均语义准确率进一步提升至59.14%(提升6.85个百分点,越南语最高提升10个百分点)。3B模型在仅3000样本训练下超越8B零样本模型的执行准确率(88.86% vs 81.43%),语义准确率接近(59.14% vs 68.57%)。

Conclusion: 研究表明通过对比奖励信号实现定向语义对齐,可以在不依赖大规模训练数据集的情况下显著提升Text-to-SQL系统性能。较小的3B参数模型经过高效微调后能够超越更大的零样本模型,证明了该方法在跨语言场景下的有效性和参数效率。该方法为解决语义对齐挑战提供了新思路,对多语言自然语言处理应用具有重要价值。


📄 Abstract

Current Text-to-SQL methods are evaluated and only focused on executable queries, overlooking the semantic alignment challenge -- both in terms of the semantic meaning of the query and the correctness of the execution results. Even execution accuracy itself shows significant drops when moving from English to other languages, with an average decline of 6 percentage points across non-English languages. We address these challenges by presenting a new framework that combines Group Relative Policy Optimization (GRPO) within a multilingual contrastive reward signal to enhance both task efficiency and semantic accuracy in Text-to-SQL systems in cross-lingual scenarios. Our method teaches models to obtain better correspondence between SQL generation and user intent by combining a reward signal based on semantic similarity. On the seven-language MultiSpider dataset, fine-tuning the LLaMA-3-3B model with GRPO improved the execution accuracy up to 87.4 percent (+26 pp over zero-shot) and semantic accuracy up to 52.29 percent (+32.86 pp). Adding our contrastive reward signal in the GRPO framework further improved the average semantic accuracy to 59.14 percent (+6.85 pp, up to +10 pp for Vietnamese). Our experiments showcase that a smaller, parameter-efficient 3B LLaMA model fine-tuned with our contrastive reward signal outperforms a much larger zero-shot 8B LLaMA model, with an uplift of 7.43 pp in execution accuracy (from 81.43 percent on the 8B model to 88.86 percent on the 3B model), and nearly matches its semantic accuracy (59.14 percent vs. 68.57 percent) -- all using just 3,000 reinforcement learning training examples. These results demonstrate how we can improve the performance of Text-to-SQL systems with contrastive rewards for directed semantic alignment, without requiring large-scale training datasets.

[43] Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA

A H M Rezaul Karim, Ozlem Uzuner

🧩 TL;DR

本文提出了MasonNLP系统,采用检索增强生成框架结合通用领域指令调优大语言模型,为伤口护理视觉问答任务提供简单有效的解决方案,在MEDIQA-WV 2025共享任务中排名第三。


📘 Detailed Summary

Motivation: 医疗视觉问答在临床决策和患者护理中具有重要作用,但现有系统在伤口护理领域的多模态理解和结构化属性生成方面存在挑战,需要提高推理能力、模式遵循和响应质量。

Method: 采用检索增强生成框架,结合通用领域指令调优大语言模型,通过简单的索引和融合机制融入领域内文本和视觉示例,无需额外训练或复杂重排序,实现轻量级多模态推理。

Result: 在MEDIQA-WV 2025共享任务的19个团队51个提交中排名第三,平均得分41.37%,在dBLEU、ROUGE、BERTScore和基于LLM的评估指标上均表现出色。

Conclusion: 研究表明轻量级RAG与通用LLM的组合为多模态临床NLP任务提供了简单有效的基线方法,通过少量相关示例的检索增强即可显著提升模型性能,无需复杂架构或额外训练成本。


📄 Abstract

Medical Visual Question Answering (MedVQA) enables natural language queries over medical images to support clinical decision-making and patient care. The MEDIQA-WV 2025 shared task addressed wound-care VQA, requiring systems to generate free-text responses and structured wound attributes from images and patient queries. We present the MasonNLP system, which employs a general-domain, instruction-tuned large language model with a retrieval-augmented generation (RAG) framework that incorporates textual and visual examples from in-domain data. This approach grounds outputs in clinically relevant exemplars, improving reasoning, schema adherence, and response quality across dBLEU, ROUGE, BERTScore, and LLM-based metrics. Our best-performing system ranked 3rd among 19 teams and 51 submissions with an average score of 41.37%, demonstrating that lightweight RAG with general-purpose LLMs -- a minimal inference-time layer that adds a few relevant exemplars via simple indexing and fusion, with no extra training or complex re-ranking -- provides a simple and effective baseline for multimodal clinical NLP tasks.

[44] Ensembling Large Language Models to Characterize Affective Dynamics in Student-AI Tutor Dialogues

Chenyu Zhang, Sharifa Alghowinem, Cynthia Breazeal

🧩 TL;DR

本研究提出了首个用于大规模辅导对话情感感知的集成LLM框架,通过分析AI辅导过程中学习者的情感动态,为负责任地将生成式AI整合到教育中提供了新视角。研究发现学生在AI辅导中通常呈现轻度积极情感,但困惑和好奇是解决问题的常见伴随状态,而负面情绪往往能快速缓解。


📘 Detailed Summary

Motivation: 尽管已有研究探讨了大型语言模型在教育环境中的学习影响,但LLM介导辅导中的情感动态仍未被充分理解。本研究旨在填补这一研究空白,通过关注学习者不断变化的情感状态,推进生成式AI在教育中负责任整合的讨论。

Method: 研究开发了首个集成LLM框架用于大规模辅导对话情感感知,分析了两个学期共16,986个对话轮次,涉及261名本科生与PyTutor AI辅导系统的交互。通过三个前沿LLM(Gemini、GPT-4o、Claude)生成零样本情感标注,包括效价、唤醒度和学习帮助性的标度评分以及自由文本情感标签,采用排名加权模型内池化和跨模型多数共识融合方法产生稳健的情感档案。

Result: 分析显示学生在与AI辅导系统交互时通常报告轻度积极情感和中等唤醒度,但学习过程并不总是顺利:困惑和好奇是解决问题的常见伴随状态,而沮丧虽然较少出现但仍可能阻碍进步。情感状态持续时间较短,积极时刻比中性或消极时刻稍长但易受干扰,负面情绪通常能快速缓解,有时直接反弹至积极状态。中性时刻常作为转折点,更倾向于引导学生向上而非向下。

Conclusion: 研究揭示了AI辅导中学习者情感动态的复杂性,中性时刻作为潜在干预点的发现为智能辅导系统设计提供了重要启示。这些发现强调了在AI教育应用中关注情感维度的重要性,为开发更具响应性和支持性的教育AI系统指明了方向,特别是在情感转折点进行适时干预的机会。


📄 Abstract

While recent studies have examined the leaning impact of large language model (LLM) in educational contexts, the affective dynamics of LLM-mediated tutoring remain insufficiently understood. This work introduces the first ensemble-LLM framework for large-scale affect sensing in tutoring dialogues, advancing the conversation on responsible pathways for integrating generative AI into education by attending to learners' evolving affective states. To achieve this, we analyzed two semesters' worth of 16,986 conversational turns exchanged between PyTutor, an LLM-powered AI tutor, and 261 undergraduate learners across three U.S. institutions. To investigate learners' emotional experiences, we generate zero-shot affect annotations from three frontier LLMs (Gemini, GPT-4o, Claude), including scalar ratings of valence, arousal, and learning-helpfulness, along with free-text emotion labels. These estimates are fused through rank-weighted intra-model pooling and plurality consensus across models to produce robust emotion profiles. Our analysis shows that during interaction with the AI tutor, students typically report mildly positive affect and moderate arousal. Yet learning is not uniformly smooth: confusion and curiosity are frequent companions to problem solving, and frustration, while less common, still surfaces in ways that can derail progress. Emotional states are short-lived--positive moments last slightly longer than neutral or negative ones, but they are fragile and easily disrupted. Encouragingly, negative emotions often resolve quickly, sometimes rebounding directly into positive states. Neutral moments frequently act as turning points, more often steering students upward than downward, suggesting opportunities for tutors to intervene at precisely these junctures.

[45] Order from Chaos: Comparative Study of Ten Leading LLMs on Unstructured Data Categorization

Ariel Kamen

🧩 TL;DR

本研究对十种最先进的大语言模型在非结构化文本分类任务上进行了系统性评估,发现尽管模型规模不断扩大,但分类性能仍然有限,而集成方法显著提升了准确率并完全消除了幻觉问题。


📘 Detailed Summary

Motivation: 当前大语言模型在非结构化文本分类任务中的实际性能尚不明确,特别是在将丰富文本内容压缩到有限分类体系时面临挑战,需要系统评估模型在真实分类场景下的表现和局限性。

Method: 研究采用统一的IAB 2.2层次分类体系和8,660个人工标注样本,使用零样本提示对十种大语言模型进行对比评估,并开发了基于多模型独立专家的集成方法来提升分类性能。

Result: 评估结果显示当代大语言模型在经典指标上表现中等,平均准确率34%、精确率42%、召回率45%、F1分数41%,同时存在较高的幻觉率和膨胀率,而集成方法显著提升了准确率并完全消除了幻觉问题。

Conclusion: 研究表明单纯依赖模型规模和架构改进无法保证更好的分类性能,而模型协调编排比单纯扩大规模更能有效提升大规模文本分类任务的表现,甚至可能达到或超越人类专家水平。


📄 Abstract

This study presents a comparative evaluation of ten state-of-the-art large language models (LLMs) applied to unstructured text categorization using the Interactive Advertising Bureau (IAB) 2.2 hierarchical taxonomy. The analysis employed a uniform dataset of 8,660 human-annotated samples and identical zero-shot prompts to ensure methodological consistency across all models. Evaluation metrics included four classic measures - accuracy, precision, recall, and F1-score - and three LLM-specific indicators: hallucination ratio, inflation ratio, and categorization cost. Results show that, despite their rapid advancement, contemporary LLMs achieve only moderate classic performance, with average scores of 34% accuracy, 42% precision, 45% recall, and 41% F1-score. Hallucination and inflation ratios reveal that models frequently overproduce categories relative to human annotators. Among the evaluated systems, Gemini 1.5/2.0 Flash and GPT 20B/120B offered the most favorable cost-to-performance balance, while GPT 120B demonstrated the lowest hallucination ratio. The findings suggest that scaling and architectural improvements alone do not ensure better categorization accuracy, as the task requires compressing rich unstructured text into a limited taxonomy - a process that challenges current model architectures. To address these limitations, a separate ensemble-based approach was developed and tested. The ensemble method, in which multiple LLMs act as independent experts, substantially improved accuracy, reduced inflation, and completely eliminated hallucinations. These results indicate that coordinated orchestration of models - rather than sheer scale - may represent the most effective path toward achieving or surpassing human-expert performance in large-scale text categorization.

[46] Knowledge Reasoning Language Model: Unifying Knowledge and Language for Inductive Knowledge Graph Reasoning

Xingrui Zhuo, Jiapu Wang, Gongqing Wu, Zhongyuan Wang, Jichen Zhang, Shirui Pan, Xindong Wu

🧩 TL;DR

本文提出了知识推理语言模型(KRLM),通过设计知识推理语言指令格式和动态知识记忆机制,实现了大语言模型知识与知识图谱上下文在归纳知识图谱推理中的统一协调,有效解决了LLM知识扭曲和生成幻觉问题。


📘 Detailed Summary

Motivation: 现有基于大语言模型的知识图谱基础模型在归纳知识图谱推理中存在两个关键问题:LLM的内在知识可能被稀疏的知识图谱上下文所掩盖导致知识扭曲,以及现有方法难以充分约束LLM的生成幻觉,严重限制了推理结果的可信度。

Method: 提出了知识推理语言模型(KRLM),包括设计知识推理语言指令格式和KRL分词器来对齐LLM知识与KG表示,提出KRL注意力层通过动态知识记忆机制协调内在LLM知识与额外KG上下文,以及设计结构感知的下一个实体预测器来严格约束推理结果在可信知识域内。

Result: 在25个真实世界归纳知识图谱推理数据集上的广泛实验结果表明,所提出的KRLM在零样本推理和微调场景下均表现出显著优越性,验证了模型在解决知识扭曲和约束生成幻觉方面的有效性。

Conclusion: 该研究通过统一协调LLM知识与KG上下文,成功解决了归纳知识图谱推理中的知识扭曲和生成幻觉问题,为构建可信的开放域知识推理系统提供了有效解决方案,并展示了在真实场景中的广泛应用潜力。


📄 Abstract

Inductive Knowledge Graph Reasoning (KGR) aims to discover facts in open-domain KGs containing unknown entities and relations, which poses a challenge for KGR models in comprehending uncertain KG components. Existing studies have proposed Knowledge Graph Foundation Models (KGFMs) that learn structural invariances across KGs to handle this uncertainty. Recently, Large Language Models (LLMs) have demonstrated strong capabilities for open-domain knowledge reasoning. As a result, the latest research has focused on LLM-based KGFMs that integrate LLM knowledge with KG context for inductive KGR. However, the intrinsic knowledge of LLMs may be overshadowed by sparse KG context, leading to LLM knowledge distortion, which can cause irreversible damage to model reasoning. Moreover, existing LLM-based KGR methods still struggle to fully constrain generative hallucinations in LLMs, severely limiting the credibility of reasoning results. To address these limitations, we propose a Knowledge Reasoning Language Model (KRLM) that achieves unified coordination between LLM knowledge and KG context throughout the KGR process. Specifically, we design a Knowledge Reasoning Language (KRL) instruction format and a KRL tokenizer to align LLM knowledge with KG representations. Then, we propose a KRL attention layer that coordinates intrinsic LLM knowledge with additional KG context through a dynamic knowledge memory mechanism. Finally, a structure-aware next-entity predictor is proposed, which strictly constrains the reasoning results within a trustworthy knowledge domain. Extensive experimental results on 25 real-world inductive KGR datasets demonstrate the significant superiority of the proposed KRLM\footnote{Our source codes are available at https://anonymous.4open.science/r/KRLM-EA36 in both zero-shot reasoning and fine-tuning scenarios.

[47] MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning

Mahbub E Sobhani, Md. Faiyaz Abdullah Sayeedi, Tasnim Mohiuddin, Md Mofijul Islam, Swakkhar Shatabda

🧩 TL;DR

本研究提出了MathMist,一个包含超过21K对齐问题-答案对的并行多语言数学推理基准,涵盖七种语言,系统评估了LLMs在跨语言数学推理中的一致性和可解释性缺陷。


📘 Detailed Summary

Motivation: 现有数学推理基准主要关注英语或少数高资源语言,缺乏对多语言和跨语言数学推理能力的全面评估,特别是在中低资源语言环境中存在显著的研究空白。

Method: 构建了覆盖高、中、低资源语言的平行多语言数据集,系统评估了开源和专有LLMs在零样本、思维链和代码切换推理范式下的表现,包括多语言推理专用模型。

Result: 实验结果显示LLMs在跨语言数学推理中存在持续缺陷,特别是在低资源语言环境中性能显著下降,模型难以保持推理的一致性和可解释性。

Conclusion: 该研究揭示了当前LLMs在多语言数学推理方面的局限性,强调了开发更鲁棒的多语言推理能力的重要性,为未来研究提供了基准和方向。


📄 Abstract

Mathematical reasoning remains one of the most challenging domains for large language models (LLMs), requiring not only linguistic understanding but also structured logical deduction and numerical precision. While recent LLMs demonstrate strong general-purpose reasoning abilities, their mathematical competence across diverse languages remains underexplored. Existing benchmarks primarily focus on English or a narrow subset of high-resource languages, leaving significant gaps in assessing multilingual and cross-lingual mathematical reasoning. To address this, we introduce MathMist, a parallel multilingual benchmark for mathematical problem solving and reasoning. MathMist encompasses over 21K aligned question-answer pairs across seven languages, representing a balanced coverage of high-, medium-, and low-resource linguistic settings. The dataset captures linguistic variety, multiple types of problem settings, and solution synthesizing capabilities. We systematically evaluate a diverse suite of models, including open-source small and medium LLMs, proprietary systems, and multilingual-reasoning-focused models, under zero-shot, chain-of-thought (CoT), and code-switched reasoning paradigms. Our results reveal persistent deficiencies in LLMs' ability to perform consistent and interpretable mathematical reasoning across languages, with pronounced degradation in low-resource settings. All the codes and data are available at GitHub: https://github.com/mahbubhimel/MathMist

[48] MERLIN: A Testbed for Multilingual Multimodal Entity Recognition and Linking

Sathyanarayanan Ramamoorthy, Vishwa Shah, Simran Khanuja, Zaid Sheikh, Shan Jie, Ann Chia, Shearman Chua, Graham Neubig

🧩 TL;DR

本文提出了MERLIN,一个用于多语言多模态实体链接任务的新型测试平台系统,包含BBC新闻文章标题和对应图像的多语言数据集,并证明了视觉数据能够提升实体链接准确性,特别是在文本上下文模糊或不足的情况下。


📘 Detailed Summary

Motivation: 当前多语言实体链接任务主要依赖文本信息,缺乏对多模态数据的充分利用,特别是在文本上下文模糊或不足的情况下实体链接准确性受限,需要探索视觉信息在多语言实体链接中的价值。

Method: 构建了包含BBC新闻文章标题和对应图像的多语言数据集,涵盖印地语、日语、印尼语、越南语和泰米尔语五种语言,包含7000多个命名实体提及链接到2500个独特的Wikidata实体,并采用多语言和多模态实体链接方法进行基准测试,探索了LLaMa-2和Aya-23等不同语言模型。

Result: 实验结果表明,融入视觉数据能够显著提升实体链接的准确性,特别是在文本上下文模糊或不足的实体识别中效果更为明显,对于缺乏强大多语言能力的模型而言,视觉信息的补充作用尤为突出。

Conclusion: 多模态方法在多语言实体链接任务中具有重要价值,视觉信息能够有效补充文本信息的不足,特别是在处理多语言场景时,未来研究应更充分地整合多模态信息以提升实体链接性能,特别是在资源匮乏语言环境中。


📄 Abstract

This paper introduces MERLIN, a novel testbed system for the task of Multilingual Multimodal Entity Linking. The created dataset includes BBC news article titles, paired with corresponding images, in five languages: Hindi, Japanese, Indonesian, Vietnamese, and Tamil, featuring over 7,000 named entity mentions linked to 2,500 unique Wikidata entities. We also include several benchmarks using multilingual and multimodal entity linking methods exploring different language models like LLaMa-2 and Aya-23. Our findings indicate that incorporating visual data improves the accuracy of entity linking, especially for entities where the textual context is ambiguous or insufficient, and particularly for models that do not have strong multilingual abilities. For the work, the dataset, methods are available here at https://github.com/rsathya4802/merlin

[49] PluriHop: Exhaustive, Recall-Sensitive QA over Distractor-Rich Corpora

Mykolas Sveistrys, Richard Kunert

🧩 TL;DR

本文提出了PluriHopRAG框架来解决多跳问答中的文档重复和干扰项挑战,通过文档级子问题分解和交叉编码器过滤机制,在真实风能行业报告数据集上实现了18-52%的相对F1分数提升。


📘 Detailed Summary

Motivation: 现有问答系统在处理需要跨所有文档聚合信息的pluri-hop问题时面临挑战,这类问题对召回率敏感、需要穷尽性检索且对遗漏文档高度敏感,而传统检索增强生成方法在重复性强、干扰文档密集的实际报告语料上表现不佳。

Method: 提出了PluriHopRAG架构,采用"逐个检查所有文档、廉价过滤"策略:首先将查询分解为文档级子问题,然后使用交叉编码器过滤器在昂贵的LLM推理前丢弃无关文档,从而优化检索效率。

Result: 在PluriHopWIND多语言数据集上的实验表明,传统RAG、图基和多媒体变体方法的语句级F1分数均未超过40%,而PluriHopRAG实现了18-52%的相对F1分数提升,具体提升幅度取决于基础LLM模型。

Conclusion: 研究揭示了当前问答系统在重复性强、干扰项丰富的语料上的局限性,证明了穷尽性检索和早期过滤作为top-k方法替代方案的价值,为处理现实世界报告数据提供了有效解决方案。


📄 Abstract

Recent advances in large language models (LLMs) and retrieval-augmented generation (RAG) have enabled progress on question answering (QA) when relevant evidence is in one (single-hop) or multiple (multi-hop) passages. Yet many realistic questions about recurring report data - medical records, compliance filings, maintenance logs - require aggregation across all documents, with no clear stopping point for retrieval and high sensitivity to even one missed passage. We term these pluri-hop questions and formalize them by three criteria: recall sensitivity, exhaustiveness, and exactness. To study this setting, we introduce PluriHopWIND, a diagnostic multilingual dataset of 48 pluri-hop questions built from 191 real-world wind industry reports in German and English. We show that PluriHopWIND is 8-40% more repetitive than other common datasets and thus has higher density of distractor documents, better reflecting practical challenges of recurring report corpora. We test a traditional RAG pipeline as well as graph-based and multimodal variants, and find that none of the tested approaches exceed 40% in statement-wise F1 score. Motivated by this, we propose PluriHopRAG, a RAG architecture that follows a "check all documents individually, filter cheaply" approach: it (i) decomposes queries into document-level subquestions and (ii) uses a cross-encoder filter to discard irrelevant documents before costly LLM reasoning. We find that PluriHopRAG achieves relative F1 score improvements of 18-52% depending on base LLM. Despite its modest size, PluriHopWIND exposes the limitations of current QA systems on repetitive, distractor-rich corpora. PluriHopRAG's performance highlights the value of exhaustive retrieval and early filtering as a powerful alternative to top-k methods.

[50] Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents

Rui Wang, Ce Zhang, Jun-Yu Ma, Jianshu Zhang, Hongru Wang, Yi Chen, Boyang Xue, Tianqing Fang, Zhisong Zhang, Hongming Zhang, Haitao Mi, Dong Yu, Kam-Fai Wong

🧩 TL;DR

本文提出了一种探索进化范式,用于构建可验证的Web智能体训练数据,并开发了WebAggregator系列基础模型,在信息聚合能力上超越了GPT-4.1和Claude-3.7-sonnet。


📘 Detailed Summary

Motivation: 现有开源深度研究智能体主要关注增强信息检索能力,但忽视了信息聚合这一关键需求,这限制了它们支持深度研究的能力,因此需要解决Web智能体在信息聚合方面的能力不足问题。

Method: 采用探索进化范式,通过主动在线探索收集真实网络证据,然后智能体自进化聚合程序,从12种高级逻辑类型中选择、组合和优化操作来合成可验证的问答对,基于SmolAgents框架收集监督微调轨迹开发WebAggregator系列基础模型。

Result: 构建了包含10K样本、覆盖50K网站和11个领域的WebAggregatorQA数据集,WebAggregator-8B模型性能与GPT-4.1相当,32B变体在GAIA-text上超越GPT-4.1超过10%,接近Claude-3.7-sonnet,在WebAggregatorQA基准测试中Claude-3.7-sonnet仅得28%,GPT-4.1得25.8%。

Conclusion: 即使智能体能够检索到所有参考资料,在WebAggregatorQA上仍然表现不佳,凸显了加强Web智能体基础模型信息聚合能力的必要性,为深度研究智能体的发展提供了重要方向。


📄 Abstract

Deep research web agents not only retrieve information from diverse sources such as web environments, files, and multimodal inputs, but more importantly, they need to rigorously analyze and aggregate knowledge for insightful research. However, existing open-source deep research agents predominantly focus on enhancing information-seeking capabilities of web agents to locate specific information, while overlooking the essential need for information aggregation, which would limit their ability to support in-depth research. We propose an Explore to Evolve paradigm to scalably construct verifiable training data for web agents. Begins with proactive online exploration, an agent sources grounded information by exploring the real web. Using the collected evidence, the agent then self-evolves an aggregation program by selecting, composing, and refining operations from 12 high-level logical types to synthesize a verifiable QA pair. This evolution from high-level guidance to concrete operations allowed us to scalably produce WebAggregatorQA, a dataset of 10K samples across 50K websites and 11 domains. Based on an open-source agent framework, SmolAgents, we collect supervised fine-tuning trajectories to develop a series of foundation models, WebAggregator. WebAggregator-8B matches the performance of GPT-4.1, while the 32B variant surpasses GPT-4.1 by more than 10% on GAIA-text and closely approaches Claude-3.7-sonnet. Moreover, given the limited availability of benchmarks that evaluate web agents' information aggregation abilities, we construct a human-annotated evaluation split of WebAggregatorQA as a challenging test set. On this benchmark, Claude-3.7-sonnet only achieves 28%, and GPT-4.1 scores 25.8%. Even when agents manage to retrieve all references, they still struggle on WebAggregatorQA, highlighting the need to strengthen the information aggregation capabilities of web agent foundations.

[51] Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures

Shuangshuang Ying, Yunwen Li, Xingwei Qu, Xin Li, Sheng Jin, Minghao Liu, Zhoufutu Wen, Xeron Du, Tianyu Zheng, Yichi Zhang, Letian Ni, Yuyang Cheng, Qiguang Chen, Jingzhe Ding, Shengda Long, Wangchunshu Zhou, Jiazhan Feng, Wanjun Zhong, Libo Qin, Ge Zhang, Wenhao Huang, Wanxiang Che, Chenghua Lin

🧩 TL;DR

本文引入WritingPreferenceBench基准,发现当前RLHF方法主要学习检测客观错误而非捕捉主观质量偏好,生成式奖励模型通过显式推理链在主观偏好评估上显著优于序列式奖励模型。


📘 Detailed Summary

Motivation: 当前偏好学习方法在标准基准上表现良好,但当移除客观质量信号时性能显著下降,这暴露了现有方法在捕捉主观质量偏好(如创意性、风格特色和情感共鸣)方面的局限性。

Method: 研究构建了包含1,800个人工标注偏好对的多语言写作数据集,比较了序列式奖励模型、零样本语言模型评估器和生成式奖励模型在主观质量偏好评估上的表现,其中生成式奖励模型通过产生显式推理链来进行偏好判断。

Result: 序列式奖励模型平均准确率仅为52.7%,零样本语言模型评估器为53.9%,而生成式奖励模型达到81.8%准确率;不同写作体裁间存在高度模型方差,准确率范围从18.2%到81.8%,标准差平均为10.1%,且模型规模扩大并未带来一致改进。

Conclusion: 研究结果表明成功的偏好建模可能需要中间推理表示而非直接分类,当前RLHF方法主要学习检测客观错误而非捕捉主观质量偏好,这为开发更有效的偏好学习框架提供了重要启示。


📄 Abstract

Current preference learning methods achieve high accuracy on standard benchmarks but exhibit significant performance degradation when objective quality signals are removed. We introduce WritingPreferenceBench, a dataset of 1,800 human-annotated preference pairs (1,200 English, 600 Chinese) across 8 creative writing genres, where responses are matched for objective correctness, factual accuracy, and length. On this benchmark, sequence-based reward models--the standard architecture for RLHF--achieve only 52.7% mean accuracy, while zero-shot language model judges perform at 53.9%. In contrast, generative reward models that produce explicit reasoning chains achieve 81.8% accuracy. We observe high within-model variance across genres: individual models range from 18.2% to 81.8% accuracy across different writing categories, with standard deviations averaging 10.1%. This variance persists regardless of model scale, with 27B parameter models showing no consistent improvement over 8B variants. Our results suggest that current RLHF methods primarily learn to detect objective errors rather than capture subjective quality preferences (e.g., creativity, stylistic flair, and emotional resonance), and that successful preference modeling may require intermediate reasoning representations rather than direct classification.

[52] AutoRubric-R1V: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning

Mengzhao Jia, Zhihan Zhang, Ignacio Cases, Zheyuan Liu, Meng Jiang, Peng Qi

🧩 TL;DR

本文提出AutoRubric-R1V框架,通过自动收集基于量规的生成奖励将强化学习与过程级监督相结合,解决了多模态大语言模型中仅奖励最终答案正确性导致的虚假推理问题。


📘 Detailed Summary

Motivation: 多模态大语言模型已从感知任务发展到复杂多步推理,但基于可验证奖励的强化学习通常导致虚假推理,因为仅奖励最终答案的正确性而忽略了推理过程的质量。

Method: 提出AutoRubric-R1V框架,核心创新是可扩展的自聚合方法,从成功轨迹中提炼一致的推理检查点,无需人工标注或更强教师模型即可构建问题特定的量规,通过联合利用基于量规的奖励和结果奖励实现过程级监督。

Result: AutoRubric-R1V在六个多模态推理基准测试中实现了最先进的性能,并在专门的评估中显著提高了推理的忠实度。

Conclusion: 该研究表明将过程级监督与强化学习相结合可有效提升多模态推理的质量和可靠性,为复杂推理任务的训练提供了可扩展的解决方案,无需依赖外部监督资源。


📄 Abstract

Multimodal large language models (MLLMs) have rapidly advanced from perception tasks to complex multi-step reasoning, yet reinforcement learning with verifiable rewards (RLVR) often leads to spurious reasoning since only the final-answer correctness is rewarded. To address this limitation, we propose AutoRubric-R1V, a framework that integrates RLVR with process-level supervision through automatically collected rubric-based generative rewards. Our key innovation lies in a scalable self-aggregation method that distills consistent reasoning checkpoints from successful trajectories, enabling problem-specific rubric construction without human annotation or stronger teacher models. By jointly leveraging rubric-based and outcome rewards, AutoRubric-R1V achieves state-of-the-art performance on six multimodal reasoning benchmarks and substantially improves reasoning faithfulness in dedicated evaluations.

[53] Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking

Ziqi Dai, Xin Zhang, Mingxin Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, Min Zhang

🧩 TL;DR

本研究通过系统比较和理论分析,发现在基于大语言模型的检索重排序任务中,监督微调(SFT)相比对比学习(CL)具有显著优势,主要归因于SFT提供更强的权重更新机制,并在MRB基准上实现了新的最先进性能。


📘 Detailed Summary

Motivation: 当前检索重排序模型训练存在两种主要目标:基于度量学习的对比损失和基于分类的监督微调,对于BERT编码器对比学习更有效,而对于大语言模型监督微调似乎更有前景,这种分歧引发了对哪种目标更适合LLM重排序及其内在机制的核心研究问题。

Method: 本研究在通用多模态检索(UMR)实验平台上对对比学习(CL)和监督微调(SFT)进行系统比较,将目标函数分解为控制更新幅度的权重组件和指导模型更新方向的方向组件,提出统一框架分析其交互作用,并通过探测实验验证不同组件的贡献。

Result: 实验发现SFT相比CL提供显著更强的权重更新方案,而评分方向偏好无明显优劣,综合结果表明SFT在LLM重排序中具有一致优势,大规模SFT训练在MRB基准上实现了新的最先进重排序器性能。

Conclusion: 研究揭示了SFT在LLM重排序中的内在优势机制,主要源于其更强的权重更新能力,为未来该领域研究和应用提供了重要指导,同时通过消融实验验证了SFT设置的有效性,推动了基于大语言模型的检索重排序技术发展。


📄 Abstract

In information retrieval, training reranking models mainly focuses on two types of objectives: metric learning (e.g. contrastive loss to increase the predicted scores on relevant query-document pairs) and classification (binary label prediction of relevance vs. irrelevance). For BERT-style encoders, various studies have shown that contrastive learning (CL) can be more effective than discriminative (classification) learning. However, for large language models (LLMs), classification via supervised fine-tuning (SFT), which predicts ''yes'' (resp. ''no'') token for relevant (resp. irrelevant) pairs, appears more promising as it aligns well with the generative nature of LLMs. This divergence raises a central question: which objective is intrinsically better suited to LLM-based reranking, and what mechanism underlies the difference? In this work, we conduct a comprehensive comparison and analysis between CL and SFT for reranking, taking the universal multimodal retrieval (UMR) as the experimental playground. We first decompose the objectives into two components: weight, which controls the magnitude of those updates, and direction, which guides the model updates, then present a unified framework for understanding their interactions. Through probing experiments, we find that SFT provides a substantially stronger weighting scheme than CL, whereas the preferred scoring direction shows no clear winner. Taken together, these results point to a consistent advantage of SFT over CL for LLM reranking. To further validate our findings, we conduct large-scale training with SFT and present new state-of-the-art rerankers on the MRB benchmark. We also provide ablations on SFT settings and expect our findings to benefit future research and applications in this area.

[54] AI-Powered Early Diagnosis of Mental Health Disorders from Real-World Clinical Conversations

Jianfeng Zhu, Julina Maharjan, Xinyu Li, Karin G. Coifman, Ruoming Jin

🧩 TL;DR

本研究评估了基于大型语言模型的机器学习方法在心理健康筛查中的有效性,使用553个真实世界半结构化访谈数据集,在抑郁、焦虑和PTSD诊断中实现了超过80%的准确率,为临床环境提供了可扩展的AI辅助诊断工具。


📘 Detailed Summary

Motivation: 心理健康障碍是全球致残的主要原因之一,但抑郁症、焦虑症和创伤后应激障碍等疾病常因主观评估、临床资源有限以及污名化和认知不足而被漏诊或误诊,初级保健环境中超过60%的病例被误判,亟需开发可扩展、易获取且具有情境感知能力的诊断工具来支持早期检测和干预。

Method: 研究使用553个真实世界半结构化访谈数据集,评估了多种模型类别,包括GPT-4.1 Mini和MetaLLaMA的零样本提示方法,以及使用低秩适应技术微调的RoBERTa模型,特别探索了较短上下文和聚焦上下文片段对检测性能的影响。

Result: 模型在各类诊断中实现了超过80%的准确率,其中PTSD检测表现尤为突出,准确率达到89%,召回率达到98%,使用较短上下文片段能显著提升召回率,低秩配置在保持竞争力的同时实现了高效的微调效果。

Conclusion: 基于大型语言模型的机器学习方法相比传统自报告筛查工具具有显著优势,为低门槛AI辅助早期诊断提供了可行路径,这项研究为将机器学习整合到真实世界临床工作流程奠定了基础,特别是在资源匮乏或高污名化环境中具有重要应用价值。


📄 Abstract

Mental health disorders remain among the leading cause of disability worldwide, yet conditions such as depression, anxiety, and Post-Traumatic Stress Disorder (PTSD) are frequently underdiagnosed or misdiagnosed due to subjective assessments, limited clinical resources, and stigma and low awareness. In primary care settings, studies show that providers misidentify depression or anxiety in over 60% of cases, highlighting the urgent need for scalable, accessible, and context-aware diagnostic tools that can support early detection and intervention. In this study, we evaluate the effectiveness of machine learning models for mental health screening using a unique dataset of 553 real-world, semistructured interviews, each paried with ground-truth diagnoses for major depressive episodes (MDE), anxiety disorders, and PTSD. We benchmark multiple model classes, including zero-shot prompting with GPT-4.1 Mini and MetaLLaMA, as well as fine-tuned RoBERTa models using LowRank Adaptation (LoRA). Our models achieve over 80% accuracy across diagnostic categories, with especially strongperformance on PTSD (up to 89% accuracy and 98% recall). We also find that using shorter context, focused context segments improves recall, suggesting that focused narrative cues enhance detection sensitivity. LoRA fine-tuning proves both efficient and effective, with lower-rank configurations (e.g., rank 8 and 16) maintaining competitive performance across evaluation metrics. Our results demonstrate that LLM-based models can offer substantial improvements over traditional self-report screening tools, providing a path toward low-barrier, AI-powerd early diagnosis. This work lays the groundwork for integrating machine learning into real-world clinical workflows, particularly in low-resource or high-stigma environments where access to timely mental health care is most limited.

[55] DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation

Yu Zhou, Sohyun An, Haikang Deng, Da Yin, Clark Peng, Cho-Jui Hsieh, Kai-Wei Chang, Nanyun Peng

🧩 TL;DR

本文构建了首个大规模英语方言多模态生成基准,发现现有模型在方言输入上存在32-48%性能下降,并提出一种编码器优化方法,在保持标准英语性能的同时将方言生成质量提升34.4%。


📘 Detailed Summary

Motivation: 当前多模态生成模型在处理英语方言输入时存在严重性能退化问题,但缺乏系统性评估方言理解能力的基准,本研究旨在填补这一空白并探索有效的缓解策略。

Method: 构建了涵盖六种常见英语方言的大规模基准,包含4200多个经过方言使用者验证的提示,并设计了一种基于编码器的通用优化方法,使模型能够识别新方言特征而不损害标准英语性能。

Result: 实验评估17个图像和视频生成模型显示,当提示中使用单个方言词汇时,性能下降达32.26%至48.17%;提出的编码器方法在Stable Diffusion 1.5上成功将五种方言性能提升至与标准英语相当水平(+34.4%),同时标准英语性能几乎无损失。

Conclusion: 多模态生成模型存在显著的方言理解缺陷,传统微调和提示重写方法效果有限,而提出的编码器优化策略能够有效平衡方言适应与标准英语保持,为构建更具包容性的多模态AI系统提供了可行路径。


📄 Abstract

Contact languages like English exhibit rich regional variations in the form of dialects, which are often used by dialect speakers interacting with generative models. However, can multimodal generative models effectively produce content given dialectal textual input? In this work, we study this question by constructing a new large-scale benchmark spanning six common English dialects. We work with dialect speakers to collect and verify over 4200 unique prompts and evaluate on 17 image and video generative models. Our automatic and human evaluation results show that current state-of-the-art multimodal generative models exhibit 32.26% to 48.17% performance degradation when a single dialect word is used in the prompt. Common mitigation methods such as fine-tuning and prompt rewriting can only improve dialect performance by small margins (< 7%), while potentially incurring significant performance degradation in Standard American English (SAE). To this end, we design a general encoder-based mitigation strategy for multimodal generative models. Our method teaches the model to recognize new dialect features while preserving SAE performance. Experiments on models such as Stable Diffusion 1.5 show that our method is able to simultaneously raise performance on five dialects to be on par with SAE (+34.4%), while incurring near zero cost to SAE performance.

cs.AI [Back]

[56] Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks

Supriti Sinhamahapatra, Jan Niehues

🧩 TL;DR

本研究提出了一种融合视觉信息的多模态语音识别方法,通过整合演示幻灯片来增强科学演讲场景下的语音识别性能,在领域特定术语识别上取得了显著改进。


📘 Detailed Summary

Motivation: 当前最先进的自动语音识别系统主要依赖声学信息而忽略了多模态上下文,然而视觉信息在消歧和适应中至关重要。大多数工作专注于使用说话者图像处理噪声条件,本研究则专注于在科学演讲场景中整合演示幻灯片以提升识别准确性。

Method: 研究首先创建了包含领域特定术语自动分析的多模态演讲基准,然后探索了用多模态信息增强语音模型的方法。通过数据增强方法缓解了伴随幻灯片的数据集缺乏问题,并利用增强数据集训练了多模态融合模型。

Result: 实验结果表明,与基线模型相比,所提出的多模态方法在所有词汇上实现了约34%的词错误率相对降低,在领域特定术语上实现了35%的相对改进,显著提升了科学演讲场景下的语音识别性能。

Conclusion: 该研究证明了在科学演讲场景中整合演示幻灯片等视觉信息对提升语音识别性能的有效性,特别是对领域特定术语的识别改进尤为显著,为多模态语音识别在专业领域的应用提供了重要参考。


📄 Abstract

State-of-the-art (SOTA) Automatic Speech Recognition (ASR) systems primarily rely on acoustic information while disregarding additional multi-modal context. However, visual information are essential in disambiguation and adaptation. While most work focus on speaker images to handle noise conditions, this work also focuses on integrating presentation slides for the use cases of scientific presentation. In a first step, we create a benchmark for multi-modal presentation including an automatic analysis of transcribing domain-specific terminology. Next, we explore methods for augmenting speech models with multi-modal information. We mitigate the lack of datasets with accompanying slides by a suitable approach of data augmentation. Finally, we train a model using the augmented dataset, resulting in a relative reduction in word error rate of approximately 34%, across all words and 35%, for domain-specific terms compared to the baseline model.

[57] GammaZero: Learning To Guide POMDP Belief Space Search With Graph Representations

Rajesh Mangannavar, Prasad Tadepalli

🧩 TL;DR

本文提出了GammaZero,一种基于动作中心图表示的部分可观测马尔可夫决策过程规划引导框架,通过统一的图表示实现跨问题规模的零样本泛化能力,在保持解质量的同时显著减少搜索需求。


📘 Detailed Summary

Motivation: 现有POMDP规划方法需要领域特定的神经网络架构且难以扩展,GammaZero旨在解决这一可扩展性问题,通过开发统一的图表示框架实现在领域内不同规模问题间的泛化能力。

Method: GammaZero采用动作中心图表示信念状态,利用图神经网络和编码器-解码器架构从专家演示中学习价值函数和策略,然后将习得的启发式方法应用于更大规模问题的蒙特卡洛树搜索引导。

Result: 在标准POMDP基准测试中,GammaZero在相同规模问题上与BetaZero性能相当,同时实现了零样本泛化到训练时未见过的2-4倍大规模问题,在保持解质量的同时减少了搜索需求。

Conclusion: 研究表明动作中心图表示能够有效捕获POMDP中的结构模式,使得在小规模问题上学习的启发式方法能够成功迁移到更大规模问题,为可扩展的POMDP规划提供了新方向。


📄 Abstract

We introduce an action-centric graph representation framework for learning to guide planning in Partially Observable Markov Decision Processes (POMDPs). Unlike existing approaches that require domain-specific neural architectures and struggle with scalability, GammaZero leverages a unified graph-based belief representation that enables generalization across problem sizes within a domain. Our key insight is that belief states can be systematically transformed into action-centric graphs where structural patterns learned on small problems transfer to larger instances. We employ a graph neural network with a decoder architecture to learn value functions and policies from expert demonstrations on computationally tractable problems, then apply these learned heuristics to guide Monte Carlo tree search on larger problems. Experimental results on standard POMDP benchmarks demonstrate that GammaZero achieves comparable performance to BetaZero when trained and tested on the same-sized problems, while uniquely enabling zero-shot generalization to problems 2-4 times larger than those seen during training, maintaining solution quality with reduced search requirements.

[58] A Multimodal Approach to Heritage Preservation in the Context of Climate Change

David Roqui, Adèle Cormier, nistor Grozavu, Ann Bourges

🧩 TL;DR

本文提出了一种轻量级多模态架构,融合传感器数据和视觉图像来预测文化遗产地的退化严重程度,通过简化的编码器和自适应Barlow Twins损失在数据稀缺场景下实现76.9%的准确率,相比标准多模态方法提升43%。


📘 Detailed Summary

Motivation: 文化遗产地因气候变化面临加速退化,但传统监测方法依赖单模态分析(仅视觉检查或环境传感器),无法捕捉环境压力与材料退化之间的复杂相互作用。

Method: 采用改进的PerceiverIO架构,包含两个关键创新:简化的编码器(64维潜在空间)防止小数据集过拟合,以及自适应Barlow Twins损失鼓励模态互补性而非冗余。

Result: 在斯特拉斯堡大教堂数据上达到76.9%准确率,比标准多模态架构提升43%,比原始PerceiverIO提升25%。消融研究显示传感器单独为61.5%,图像单独为46.2%,证实了多模态协同效应。

Conclusion: 架构简洁性结合对比正则化能够在数据稀缺的文化遗产监测场景中实现有效的多模态学习,为AI驱动的保护决策支持系统奠定基础,同时系统超参数研究确定了最佳中等相关性目标(τ=0.3)以平衡对齐和互补性。


📄 Abstract

Cultural heritage sites face accelerating degradation due to climate change, yet tradi- tional monitoring relies on unimodal analysis (visual inspection or environmental sen- sors alone) that fails to capture the complex interplay between environmental stres- sors and material deterioration. We propose a lightweight multimodal architecture that fuses sensor data (temperature, humidity) with visual imagery to predict degradation severity at heritage sites. Our approach adapts PerceiverIO with two key innovations: (1) simplified encoders (64D latent space) that prevent overfitting on small datasets (n=37 training samples), and (2) Adaptive Barlow Twins loss that encourages modality complementarity rather than redundancy. On data from Strasbourg Cathedral, our model achieves 76.9% accu- racy, a 43% improvement over standard multimodal architectures (VisualBERT, Trans- former) and 25% over vanilla PerceiverIO. Ablation studies reveal that sensor-only achieves 61.5% while image-only reaches 46.2%, confirming successful multimodal synergy. A systematic hyperparameter study identifies an optimal moderate correlation target ({\tau} =0.3) that balances align- ment and complementarity, achieving 69.2% accuracy compared to other {\tau} values ({\tau} =0.1/0.5/0.7: 53.8%, {\tau} =0.9: 61.5%). This work demonstrates that architectural sim- plicity combined with contrastive regularization enables effective multimodal learning in data-scarce heritage monitoring contexts, providing a foundation for AI-driven con- servation decision support systems.

[59] ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning

Roger Creus Castanyer, Faisal Mohamed, Pablo Samuel Castro, Cyrus Neary, Glen Berseth

🧩 TL;DR

本文提出ARM-FM框架,通过基础模型自动构建奖励机来解决强化学习中奖励函数设计的核心挑战,实现了从自然语言到结构化奖励规范的自动化转换,并在多样化环境中展示了有效性。


📘 Detailed Summary

Motivation: 强化学习算法对奖励函数规范高度敏感,这限制了其广泛应用,现有方法难以实现自动化、组合式的奖励设计,需要解决从高层次任务描述到具体奖励规范的转换问题。

Method: 采用奖励机作为强化学习目标规范的形式化机制,利用基础模型的高层推理能力自动从自然语言规范生成奖励机,并为每个自动机状态关联语言嵌入以实现跨任务泛化。

Result: 在多样化挑战性环境套件中提供了ARM-FM有效性的实证证据,包括零样本泛化能力的展示,表明该方法能够成功实现从自然语言到结构化奖励规范的转换。

Conclusion: ARM-FM框架通过结合基础模型的推理能力和奖励机的结构化形式化,为强化学习的奖励设计提供了自动化解决方案,展示了语言引导的强化学习在复杂环境中的潜力,为更广泛的应用奠定了基础。


📄 Abstract

Reinforcement learning (RL) algorithms are highly sensitive to reward function specification, which remains a central challenge limiting their broad applicability. We present ARM-FM: Automated Reward Machines via Foundation Models, a framework for automated, compositional reward design in RL that leverages the high-level reasoning capabilities of foundation models (FMs). Reward machines (RMs) -- an automata-based formalism for reward specification -- are used as the mechanism for RL objective specification, and are automatically constructed via the use of FMs. The structured formalism of RMs yields effective task decompositions, while the use of FMs enables objective specifications in natural language. Concretely, we (i) use FMs to automatically generate RMs from natural language specifications; (ii) associate language embeddings with each RM automata-state to enable generalization across tasks; and (iii) provide empirical evidence of ARM-FM's effectiveness in a diverse suite of challenging environments, including evidence of zero-shot generalization.

[60] Implementation of AI in Precision Medicine

Göktuğ Bender, Samer Faraj, Anand Bhardwaj

🧩 TL;DR

本文通过范围综述分析了2019-2024年间人工智能在精准医学中的实施情况,提出了一个基于生态系统的框架来识别关键障碍和促进因素,并为可信赖和可持续的实施提供未来方向。


📘 Detailed Summary

Motivation: 尽管人工智能在精准医学中通过整合和解释多模态数据发挥着越来越重要的作用,但在临床环境中的实施仍然有限,本研究旨在解决这一实施差距。

Method: 采用范围综述方法,分析2019-2024年文献,识别数据质量、临床可靠性、工作流程整合和治理等关键维度的障碍和促进因素,并提出了一个基于生态系统的分析框架。

Result: 研究识别了影响人工智能在精准医学中实际转化的关键障碍和促进因素,强调了塑造现实世界转化的相互依赖关系,为实施策略提供了实证基础。

Conclusion: 研究强调了采用生态系统视角理解人工智能在精准医学中实施复杂性的重要性,提出了支持可信赖和可持续实施的具体方向,包括改进数据治理、增强临床可靠性和优化工作流程整合。


📄 Abstract

Artificial intelligence (AI) has become increasingly central to precision medicine by enabling the integration and interpretation of multimodal data, yet implementation in clinical settings remains limited. This paper provides a scoping review of literature from 2019-2024 on the implementation of AI in precision medicine, identifying key barriers and enablers across data quality, clinical reliability, workflow integration, and governance. Through an ecosystem-based framework, we highlight the interdependent relationships shaping real-world translation and propose future directions to support trustworthy and sustainable implementation.

[61] Can MLLMs Absorb Math Reasoning Abilities from LLMs as Free Lunch?

Yijie Hu, Zihao Zhou, Kaizhu Huang, Xiaowei Huang, Qiufeng Wang

🧩 TL;DR

本文提出IP-Merging方法,一种无需调优的模型融合技术,能够直接将数学推理能力从现成的数学LLM转移到多模态LLM中,同时保持多模态对齐不退化。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在数学推理能力上显著落后于纯文本大语言模型,而现有的模型融合方法忽视了MLLM与LLM之间的参数空间对齐问题,导致性能下降。

Method: 提出IP-Merging方法,首先识别MLLM和数学LLM中与推理相关的关键参数层,然后将这些参数投影到MLLM的子空间中保持对齐,最后在该子空间内进行参数融合。

Result: 大量实验表明,IP-Merging方法能够在不损害MLLM其他能力的前提下,直接从数学LLM中增强MLLM的数学推理能力。

Conclusion: 该研究证明了通过精心设计的参数空间对齐和融合策略,可以实现跨模态能力的有效迁移,为提升MLLM的专门化能力提供了新的技术路径。


📄 Abstract

Math reasoning has been one crucial ability of large language models (LLMs), where significant advancements have been achieved in recent years. However, most efforts focus on LLMs by curating high-quality annotation data and intricate training (or inference) paradigms, while the math reasoning performance of multi-modal LLMs (MLLMs) remains lagging behind. Since the MLLM typically consists of an LLM and a vision block, we wonder: Can MLLMs directly absorb math reasoning abilities from off-the-shelf math LLMs without tuning? Recent model-merging approaches may offer insights into this question. However, they overlook the alignment between the MLLM and LLM, where we find that there is a large gap between their parameter spaces, resulting in lower performance. Our empirical evidence reveals two key factors behind this issue: the identification of crucial reasoning-associated layers in the model and the mitigation of the gaps in parameter space. Based on the empirical insights, we propose IP-Merging that first identifies the reasoning-associated parameters in both MLLM and Math LLM, then projects them into the subspace of MLLM, aiming to maintain the alignment, and finally merges parameters in this subspace. IP-Merging is a tuning-free approach since parameters are directly adjusted. Extensive experiments demonstrate that our IP-Merging method can enhance the math reasoning ability of MLLMs directly from Math LLMs without compromising their other capabilities.

[62] Hi-Agent: Hierarchical Vision-Language Agents for Mobile Device Control

Zhe Wu, Hongjin Lu, Junliang Xing, Changhao Zhang, Yin Zhu, Yuhao Yang, Yuheng Jing, Kai Li, Kun Shao, Jianye Hao, Jun Wang, Yuanchun Shi

🧩 TL;DR

本文提出了Hi-Agent,一种可训练的分层视觉语言代理,用于移动设备控制,通过联合优化高层推理模型和低层动作模型,在Android-in-the-Wild基准测试中实现了87.9%的最新任务成功率。


📘 Detailed Summary

Motivation: 现有基于视觉语言模型的移动设备控制方法主要依赖直接的状态到动作映射,缺乏结构化推理和规划能力,导致在新任务或未见过的UI布局上泛化性能较差。

Method: Hi-Agent采用分层架构,包含高层推理模型和低层动作模型,通过将多步决策重构为单步子目标序列,并提出了前瞻优势函数,利用低层模型的执行反馈来指导高层优化,缓解了长时域任务中的路径爆炸问题。

Result: 在Android-in-the-Wild基准测试中达到87.9%的任务成功率,显著优于基于提示的AppAgent(17.7%)、监督学习的Filtered BC(54.5%)和强化学习的DigiRL(71.9%),在ScreenSpot-v2基准测试上展现出竞争力的零样本泛化能力。

Conclusion: 分层设计和联合优化策略有效解决了移动设备控制中的泛化挑战,前瞻优势函数实现了无评论家的稳定训练,该方法在复杂移动控制场景中展现出强大的适应性和可扩展性。


📄 Abstract

Building agents that autonomously operate mobile devices has attracted increasing attention. While Vision-Language Models (VLMs) show promise, most existing approaches rely on direct state-to-action mappings, which lack structured reasoning and planning, and thus generalize poorly to novel tasks or unseen UI layouts. We introduce Hi-Agent, a trainable hierarchical vision-language agent for mobile control, featuring a high-level reasoning model and a low-level action model that are jointly optimized. For efficient training, we reformulate multi-step decision-making as a sequence of single-step subgoals and propose a foresight advantage function, which leverages execution feedback from the low-level model to guide high-level optimization. This design alleviates the path explosion issue encountered by Group Relative Policy Optimization (GRPO) in long-horizon tasks and enables stable, critic-free joint training. Hi-Agent achieves a new State-Of-The-Art (SOTA) 87.9% task success rate on the Android-in-the-Wild (AitW) benchmark, significantly outperforming prior methods across three paradigms: prompt-based (AppAgent: 17.7%), supervised (Filtered BC: 54.5%), and reinforcement learning-based (DigiRL: 71.9%). It also demonstrates competitive zero-shot generalization on the ScreenSpot-v2 benchmark. On the more challenging AndroidWorld benchmark, Hi-Agent also scales effectively with larger backbones, showing strong adaptability in high-complexity mobile control scenarios.

[63] ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks

Yuanyi Song, Heyuan Huang, Qiqiang Lin, Yin Zhao, Xiangmou Qu, Jun Wang, Xingyu Lou, Weiwen Liu, Zhuosheng Zhang, Jun Wang, Yong Yu, Weinan Zhang, Zhaoxiang Wang

🧩 TL;DR

本文提出了一种新颖的图结构基准框架ColorBench,用于评估移动智能体在复杂长视野任务中的表现,通过静态模拟动态行为来弥合离线与在线评估之间的差距。


📘 Detailed Summary

Motivation: 当前移动智能体评估标准存在局限性,离线静态基准只能验证单一预定义路径,而在线动态测试受限于真实设备的复杂性和不可重现性,两者均无法全面评估智能体在允许多种有效解决方案的复杂现实任务中的能力。

Method: 通过建模真实设备交互中观察到的有限状态,实现了动态行为的静态模拟,并在此基础上开发了ColorBench基准,支持多种有效解决方案评估、子任务完成率统计和原子级能力分析。

Result: ColorBench包含175个任务(74个单应用任务和101个跨应用任务),平均长度超过13步,每个任务至少包含两条正确路径和若干典型错误路径,通过评估各种基线模型发现了现有模型的局限性。

Conclusion: 基于实验结果提出了改进方向和可行的技术路径,以增强智能体在复杂长视野问题上的性能,为移动智能体的综合能力评估提供了新的基准框架和方法论。


📄 Abstract

The rapid advancement of multimodal large language models has enabled agents to operate mobile devices by directly interacting with graphical user interfaces, opening new possibilities for mobile automation. However, real-world mobile tasks are often complex and allow for multiple valid solutions. This contradicts current mobile agent evaluation standards: offline static benchmarks can only validate a single predefined "golden path", while online dynamic testing is constrained by the complexity and non-reproducibility of real devices, making both approaches inadequate for comprehensively assessing agent capabilities. To bridge the gap between offline and online evaluation and enhance testing stability, this paper introduces a novel graph-structured benchmarking framework. By modeling the finite states observed during real-device interactions, it achieves static simulation of dynamic behaviors. Building on this, we develop ColorBench, a benchmark focused on complex long-horizon tasks. It supports evaluation of multiple valid solutions, subtask completion rate statistics, and atomic-level capability analysis. ColorBench contains 175 tasks (74 single-app, 101 cross-app) with an average length of over 13 steps. Each task includes at least two correct paths and several typical error paths, enabling quasi-dynamic interaction. By evaluating ColorBench across various baselines, we discover limitations of existing models and propose improvement directions and feasible technical pathways to enhance agents' performance on complex, long-horizon problems based on experimental results. Code and data are available at: https://github.com/MadeAgents/ColorBench.

[64] LabOS: The AI-XR Co-Scientist That Sees and Works With Humans

Le Cong, Zaixi Zhang, Xiaotong Wang, Yin Di, Ruofan Jin, Michal Gerasimiuk, Yinkai Wang, Ravi K. Dinesh, David Smerkous, Alex Smerkous, Xuekun Wu, Shilong Liu, Peishan Li, Yi Zhu, Simran Serrao, Ning Zhao, Imran A. Mohammad, John B. Sunwoo, Joseph C. Wu, Mengdi Wang

🧩 TL;DR

LabOS是首个将计算推理与物理实验相结合的人工智能共同科学家,通过多模态感知、自进化代理和扩展现实(XR)赋能的人机协作,将实验室转变为智能协作环境。


📘 Detailed Summary

Motivation: 该研究旨在解决人工智能在科学研究中仅局限于计算设计而无法参与物理实验的问题,通过连接人类科学家的实验环境与AI系统,实现从计算设计到实际参与的转变。

Method: LabOS采用多模型AI代理系统、智能眼镜和扩展现实(XR)技术,结合多模态感知能力,使AI能够理解实验环境并实时协助科学家执行实验操作。

Result: 在癌症免疫治疗靶点发现和干细胞工程等多个应用领域中,LabOS展示了AI能够超越传统计算设计,直接参与物理实验过程,实现人机协作的智能实验环境。

Conclusion: 这项研究表明人工智能可以成为科学研究的积极参与者,通过人机协作将实验室转变为智能发现环境,为未来科学研究范式带来革命性变革。


📄 Abstract

Modern science advances fastest when thought meets action. LabOS represents the first AI co-scientist that unites computational reasoning with physical experimentation through multimodal perception, self-evolving agents, and Entended-Reality(XR)-enabled human-AI collaboration. By connecting multi-model AI agents, smart glasses, and human-AI collaboration, LabOS allows AI to see what scientists see, understand experimental context, and assist in real-time execution. Across applications--from cancer immunotherapy target discovery to stem-cell engineering -- LabOS shows that AI can move beyond computational design to participation, turning the laboratory into an intelligent, collaborative environment where human and machine discovery evolve together.

[65] TRI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech, Text, and EEG

Annisaa Fitri Nurfidausi, Eleonora Mancini, Paolo Torroni

🧩 TL;DR

本研究通过系统探索EEG、语音和文本的多模态特征表示与建模策略,建立了抑郁症检测的稳健基准框架,证明三模态组合与预训练嵌入能显著提升检测性能并达到最先进水平。


📘 Detailed Summary

Motivation: 现有抑郁症自动检测研究存在范围有限、缺乏特征系统性比较以及评估协议不一致的问题,特别是多模态方法虽然显示出潜力但尚未得到充分探索。

Method: 系统评估了手工特征与预训练嵌入的有效性,比较了不同神经编码器的性能,分析了单模态、双模态和三模态配置,并特别关注了EEG在多模态融合中的角色作用,采用一致的受试者独立分割确保可复现性。

Result: 实验结果表明EEG、语音和文本三模态组合能显著增强多模态检测性能,预训练嵌入全面优于手工设计特征,精心设计的三模态模型实现了最先进的检测性能。

Conclusion: 本研究为多模态抑郁症检测的未来研究奠定了坚实基础,证明了系统特征探索和稳健评估框架的重要性,为临床应用中更可靠的抑郁症筛查工具开发提供了方法论支持。


📄 Abstract

Depression is a widespread mental health disorder, yet its automatic detection remains challenging. Prior work has explored unimodal and multimodal approaches, with multimodal systems showing promise by leveraging complementary signals. However, existing studies are limited in scope, lack systematic comparisons of features, and suffer from inconsistent evaluation protocols. We address these gaps by systematically exploring feature representations and modelling strategies across EEG, together with speech and text. We evaluate handcrafted features versus pre-trained embeddings, assess the effectiveness of different neural encoders, compare unimodal, bimodal, and trimodal configurations, and analyse fusion strategies with attention to the role of EEG. Consistent subject-independent splits are applied to ensure robust, reproducible benchmarking. Our results show that (i) the combination of EEG, speech and text modalities enhances multimodal detection, (ii) pretrained embeddings outperform handcrafted features, and (iii) carefully designed trimodal models achieve state-of-the-art performance. Our work lays the groundwork for future research in multimodal depression detection.

[66] Towards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline

Haiyang Li, Yaxiong Wang, Shengeng Tang, Lianwei Wu, Lechao Cheng, Zhun Zhong

🧩 TL;DR

本文提出了统一多模态虚假内容检测框架UMFDet,通过构建包含人类制作和AI生成虚假内容的综合数据集OmniFake,解决了现有方法仅针对单一类型虚假内容的局限性,实现了对未知类型多模态虚假内容的鲁棒检测。


📘 Detailed Summary

Motivation: 当前多模态虚假内容检测研究存在领域隔离问题,NLP领域专注于人类制作的虚假信息,而计算机视觉领域主要针对AI生成内容,导致现有模型通常仅能处理单一类型的虚假内容。在真实场景中,多模态帖子的类型通常是未知的,这种专业化系统的有效性受到限制。

Method: 提出了统一多模态虚假内容检测框架UMFDet,该框架采用视觉语言模型作为骨干网络,并增强以类别感知的混合专家适配器来捕捉类别特定的线索,同时引入归因思维链机制为定位显著欺骗信号提供隐式推理指导。

Result: 大量实验表明,UMFDet在两种虚假信息类型上都实现了鲁棒且一致的性能表现,超越了专业化的基线方法,为真实世界的多模态欺骗检测提供了实用解决方案。

Conclusion: 该研究证明了统一框架在处理不同类型多模态虚假内容方面的有效性,通过构建综合数据集和引入类别感知机制,为多模态虚假内容检测提供了新的研究方向,强调了在真实场景中处理未知类型虚假内容的重要性。


📄 Abstract

In recent years, detecting fake multimodal content on social media has drawn increasing attention. Two major forms of deception dominate: human-crafted misinformation (e.g., rumors and misleading posts) and AI-generated content produced by image synthesis models or vision-language models (VLMs). Although both share deceptive intent, they are typically studied in isolation. NLP research focuses on human-written misinformation, while the CV community targets AI-generated artifacts. As a result, existing models are often specialized for only one type of fake content. In real-world scenarios, however, the type of a multimodal post is usually unknown, limiting the effectiveness of such specialized systems. To bridge this gap, we construct the Omnibus Dataset for Multimodal News Deception (OmniFake), a comprehensive benchmark of 127K samples that integrates human-curated misinformation from existing resources with newly synthesized AI-generated examples. Based on this dataset, we propose Unified Multimodal Fake Content Detection (UMFDet), a framework designed to handle both forms of deception. UMFDet leverages a VLM backbone augmented with a Category-aware Mixture-of-Experts (MoE) Adapter to capture category-specific cues, and an attribution chain-of-thought mechanism that provides implicit reasoning guidance for locating salient deceptive signals. Extensive experiments demonstrate that UMFDet achieves robust and consistent performance across both misinformation types, outperforming specialized baselines and offering a practical solution for real-world multimodal deception detection.