Table of Contents

cs.CV [Back]

[1] Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models

Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Pier Luigi Dovesi, Shaghayegh Roohi, Mark Granroth-Wilding, Rita Cucchiara

🧩 TL;DR

本文提出JARVIS,一种基于JEPA的自监督视觉增强框架,旨在解决多模态大语言模型在视觉推理任务中的局限性。该框架通过整合I-JEPA学习范式,使MLLMs能够在不依赖语言监督的情况下学习图像的结构和语义规律。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在基础视觉推理任务上表现有限,主要原因是其视觉理解主要从文本描述中学习,而文本描述作为监督信号具有主观性和不完整性。此外,多模态指令调优的规模远小于纯文本预训练,导致模型过度依赖语言先验而忽视视觉细节。

Method: 本文提出JARVIS框架,将I-JEPA学习范式整合到MLLMs的标准视觉-语言对齐训练流程中。该方法利用冻结的视觉基础模型作为上下文和目标编码器,同时训练由LLM早期层实现的预测器,使其能够在不依赖语言监督的情况下学习图像的结构和语义规律。

Result: 在标准MLLM基准测试上的广泛实验表明,JARVIS在不同LLM家族上一致地提升了视觉中心基准的性能,且不会降低多模态推理能力。该方法在多个视觉推理任务上表现出显著改进,验证了自监督视觉增强的有效性。

Conclusion: JARVIS框架通过自监督视觉增强有效解决了MLLMs过度依赖语言先验的问题,为多模态模型训练提供了新的方向。该方法表明,整合JEPA范式能够在不损害多模态能力的前提下提升视觉理解性能,为构建更平衡的视觉-语言对齐提供了可行路径。


📄 Abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental visual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understanding primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMs to overfit language priors while overlooking visual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language alignment pipeline of MLLMs training. Our approach leverages frozen vision foundation models as context and target encoders, while training the predictor, implemented as the early layers of an LLM, to learn structural and semantic regularities from images without relying exclusively on language supervision. Extensive experiments on standard MLLM benchmarks show that JARVIS consistently improves performance on vision-centric benchmarks across different LLM families, without degrading multimodal reasoning abilities. Our source code is publicly available at: https://github.com/aimagelab/JARVIS.

[2] City Navigation in the Wild: Exploring Emergent Navigation from Web-Scale Knowledge in MLLMs

Dwip Dalal, Utkarsh Mishra, Narendra Ahuja, Nebojsa Jojic

🧩 TL;DR

本文提出了稀疏接地视觉导航任务及CityNav基准,用于评估多模态大语言模型在知识密集型真实环境中的顺序决策能力,并提出了路径言语化方法以显著提升导航成功率。


📘 Detailed Summary

Motivation: 当前基于多模态大语言模型的具身智能体评估基准主要集中于语言中心或模拟环境,缺乏对实际复杂场景中知识密集型推理能力的系统评估,这限制了智能体在真实世界任务中的应用潜力。

Method: 研究提出了稀疏接地视觉导航任务,并构建了涵盖四个全球城市的CityNav基准,要求智能体仅依赖视觉输入和多模态推理进行50多个决策点的顺序导航;同时提出了路径言语化方法,通过从MLLMs中提取显式认知地图来显式接地智能体的内部推理过程。

Result: 实验表明,当前最先进的多模态大语言模型及标准推理技术在CityNav基准上表现显著不足,而提出的路径言语化方法通过显式构建关键地标和方向认知地图,大幅提升了导航成功率,验证了显式推理接地的重要性。

Conclusion: 该研究揭示了当前多模态大语言模型在知识密集型真实环境导航中的局限性,提出的路径言语化方法为提升智能体的空间推理和决策能力提供了有效途径,强调了显式认知地图构建对于复杂任务执行的重要性。


📄 Abstract

Leveraging multimodal large language models (MLLMs) to develop embodied agents offers significant promise for addressing complex real-world tasks. However, current evaluation benchmarks remain predominantly language-centric or heavily reliant on simulated environments, rarely probing the nuanced, knowledge-intensive reasoning essential for practical, real-world scenarios. To bridge this critical gap, we introduce the task of Sparsely Grounded Visual Navigation, explicitly designed to evaluate the sequential decision-making abilities of MLLMs in challenging, knowledge-intensive real-world environments. We operationalize this task with CityNav, a comprehensive benchmark encompassing four diverse global cities, specifically constructed to assess raw MLLM-driven agents in city navigation. Agents are required to rely solely on visual inputs and internal multimodal reasoning to sequentially navigate 50+ decision points without additional environmental annotations or specialized architectural modifications. Crucially, agents must autonomously achieve localization through interpreting city-specific cues and recognizing landmarks, perform spatial reasoning, and strategically plan and execute routes to their destinations. Through extensive evaluations, we demonstrate that current state-of-the-art MLLMs and standard reasoning techniques (e.g., Chain-of-Thought, Reflection) significantly underperform in this challenging setting. To address this, we propose Verbalization of Path (VoP), which explicitly grounds the agent's internal reasoning by probing an explicit cognitive map (key landmarks and directions toward the destination) from the MLLMs, substantially enhancing navigation success. Project Webpage: https://dwipddalal.github.io/AgentNav/

[3] R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space

Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax

🧩 TL;DR

本文提出R4框架,一种无需训练的四维时空检索增强推理方法,通过构建持久化的结构化世界模型,使视觉语言模型能够进行时空感知和推理。该框架将对象级语义描述锚定在度量空间和时间中,实现跨智能体的协作推理。


📘 Detailed Summary

Motivation: 人类通过构建持久化、结构化的内部表示来感知和理解四维时空环境,但现有视觉语言模型缺乏这种持续构建和利用时空记忆的能力。研究旨在解决动态环境中智能体进行时空感知和推理的局限性,特别是如何实现无需训练的四维检索增强推理。

Method: R4框架通过将对象级语义描述锚定在度量空间和时间中,持续构建四维知识数据库,形成持久化的世界模型。推理时,自然语言查询被分解为语义、空间和时间键,用于检索相关观察结果,然后整合到视觉语言模型的推理过程中。该方法直接在四维空间中进行检索,支持情景式和协作式推理。

Result: 在具身问答和导航基准测试中,R4在时空信息检索和推理方面显著优于基线方法。实验表明该框架能够有效提升对动态环境中时空信息的理解和利用,验证了四维检索增强推理的有效性。

Conclusion: R4框架为动态环境中的具身四维推理提供了新范式,展示了无需训练即可实现结构化时空记忆构建和利用的可行性。该研究推动了视觉语言模型在时空感知和推理方面的发展,为跨智能体协作和长期环境理解奠定了基础。


📄 Abstract

Humans perceive and reason about their surroundings in four dimensions by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to recall past events, infer unobserved states, and integrate new information into context-dependent reasoning. Inspired by this capability, we introduce R4, a training-free framework for retrieval-augmented reasoning in 4D spatio-temporal space that equips vision-language models (VLMs) with structured, lifelong memory. R4 continuously constructs a 4D knowledge database by anchoring object-level semantic descriptions in metric space and time, yielding a persistent world model that can be shared across agents. At inference, natural language queries are decomposed into semantic, spatial, and temporal keys to retrieve relevant observations, which are integrated into the VLM's reasoning. Unlike classical retrieval-augmented generation methods, retrieval in R4 operates directly in 4D space, enabling episodic and collaborative reasoning without training. Experiments on embodied question answering and navigation benchmarks demonstrate that R4 substantially improves retrieval and reasoning over spatio-temporal information compared to baselines, advancing a new paradigm for embodied 4D reasoning in dynamic environments.

[4] The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs

Tejas Anvekar, Fenil Bardoliya, Pavan K. Turaga, Chitta Baral, Vivek Gupta

🧩 TL;DR

本文提出了感知观测台框架,用于系统评估多模态大语言模型的视觉感知能力,超越了传统的端到端任务准确性评估,通过受控扰动测试模型的视觉基础真实性和鲁棒性。


📘 Detailed Summary

Motivation: 当前多模态大语言模型虽然取得了显著进展,但其感知能力缺乏系统表征,大多数模型家族在扩展语言组件的同时重用几乎相同的视觉编码器,这引发了一个关键问题:模型进步是源于真正的视觉基础还是依赖于互联网规模的文本世界知识。现有评估方法过于强调端任务准确性,忽视了鲁棒性、归因保真度和受控扰动下的推理能力。

Method: 研究提出了感知观测台框架,通过多个垂直维度系统地表征MLLMs:包括简单视觉任务(如人脸匹配和视觉文本理解能力)以及局部到全局理解(如图像匹配、网格指向游戏和属性定位)。每个垂直维度都实例化为具有真实标注的人脸和单词数据集,并通过基于像素的增强和基于扩散的风格化幻觉进行系统性扰动。

Result: 感知观测台框架超越了排行榜准确性评估,提供了关于MLLMs在扰动下如何保持感知基础和关系结构的深入见解。该框架为分析当前和未来模型的优势和弱点提供了原则性基础,能够系统评估模型在受控扰动下的视觉基础真实性和鲁棒性表现。

Conclusion: 该研究强调了系统评估多模态大语言模型视觉感知能力的重要性,超越了传统的准确性指标。感知观测台框架为模型评估提供了更全面的视角,能够揭示模型是否真正具备视觉基础能力而非仅依赖文本知识,为未来模型设计和评估方法提供了重要参考。


📄 Abstract

Recent advances in multimodal large language models (MLLMs) have yielded increasingly powerful models, yet their perceptual capacities remain poorly characterized. In practice, most model families scale language component while reusing nearly identical vision encoders (e.g., Qwen2.5-VL 3B/7B/72B), which raises pivotal concerns about whether progress reflects genuine visual grounding or reliance on internet-scale textual world knowledge. Existing evaluation methods emphasize end-task accuracy, overlooking robustness, attribution fidelity, and reasoning under controlled perturbations. We present The Perceptual Observatory, a framework that characterizes MLLMs across verticals like: (i) simple vision tasks, such as face matching and text-in-vision comprehension capabilities; (ii) local-to-global understanding, encompassing image matching, grid pointing game, and attribute localization, which tests general visual grounding. Each vertical is instantiated with ground-truth datasets of faces and words, systematically perturbed through pixel-based augmentations and diffusion-based stylized illusions. The Perceptual Observatory moves beyond leaderboard accuracy to yield insights into how MLLMs preserve perceptual grounding and relational structure under perturbations, providing a principled foundation for analyzing strengths and weaknesses of current and future models.

[5] From Words to Wavelengths: VLMs for Few-Shot Multispectral Object Detection

Manuel Nkegoum, Minh-Tan Pham, Élisa Fromont, Bruno Avignon, Sébastien Lefèvre

🧩 TL;DR

本研究探索了视觉语言模型在少样本多光谱目标检测中的应用,通过将Grounding DINO和YOLO-World适配到多光谱输入并有效整合文本、视觉和热模态,显著提升了数据稀缺场景下的检测性能。


📘 Detailed Summary

Motivation: 多光谱目标检测在自动驾驶和监控等安全敏感应用中至关重要,但标注数据的稀缺严重限制了深度检测器的训练。在数据稀缺场景下,文本类别信息可以作为有价值的语义监督来源,而视觉语言模型在计算机视觉中的成功应用激发了其在少样本多光谱检测中的潜力探索。

Method: 研究将两种代表性的视觉语言模型检测器(Grounding DINO和YOLO-World)适配到多光谱输入,并提出了有效整合文本、视觉和热模态的机制。该方法通过利用大规模预训练的视觉语言模型的语义先验知识,实现了跨光谱模态的知识迁移。

Result: 在FLIR和M3FD两个流行多光谱图像基准上的广泛实验表明,视觉语言模型检测器在少样本场景中表现卓越,显著优于使用可比数据训练的专业多光谱模型。在完全监督设置下,这些方法也取得了竞争性或更优的结果,证明了其有效性。

Conclusion: 研究发现大规模视觉语言模型学习的语义先验能够有效迁移到未见的光谱模态,为数据高效的多光谱感知提供了强大途径。这一方法不仅解决了标注数据稀缺的问题,还为跨模态视觉理解开辟了新的研究方向。


📄 Abstract

Multispectral object detection is critical for safety-sensitive applications such as autonomous driving and surveillance, where robust perception under diverse illumination conditions is essential. However, the limited availability of annotated multispectral data severely restricts the training of deep detectors. In such data-scarce scenarios, textual class information can serve as a valuable source of semantic supervision. Motivated by the recent success of Vision-Language Models (VLMs) in computer vision, we explore their potential for few-shot multispectral object detection. Specifically, we adapt two representative VLM-based detectors, Grounding DINO and YOLO-World, to handle multispectral inputs and propose an effective mechanism to integrate text, visual and thermal modalities. Through extensive experiments on two popular multispectral image benchmarks, FLIR and M3FD, we demonstrate that VLM-based detectors not only excel in few-shot regimes, significantly outperforming specialized multispectral models trained with comparable data, but also achieve competitive or superior results under fully supervised settings. Our findings reveal that the semantic priors learned by large-scale VLMs effectively transfer to unseen spectral modalities, ofFering a powerful pathway toward data-efficient multispectral perception.

[6] Are vision-language models ready to zero-shot replace supervised classification models in agriculture?

Earl Ranario, Mason J. Earles

🧩 TL;DR

该研究对多种开源和闭源视觉语言模型在27个农业分类数据集上进行基准测试,发现当前现成的VLM在农业诊断任务中表现显著低于专用监督基线,表明它们尚不适合作为独立的农业诊断系统,但可作为辅助组件与约束界面结合使用。


📘 Detailed Summary

Motivation: 视觉语言模型越来越多地被提议作为视觉识别任务的通用解决方案,但它们在农业决策支持中的可靠性仍不清楚,该研究旨在评估当前VLM在农业诊断任务中的实际性能表现,填补这一研究空白。

Method: 研究在AgML集合的27个农业分类数据集上对多样化的开源和闭源VLM进行基准测试,涵盖162个类别,包括植物病害、害虫和损害识别以及植物和杂草物种分类,采用零样本评估、多项选择提示、开放式提示以及基于LLM的语义判断等多种评估方法,并与专用监督基线模型YOLO11进行对比。

Result: 在所有任务中,零样本VLM显著低于专用监督基线YOLO11,最佳VLM(Gemini-3 Pro)在多项选择提示下达到约62%的平均准确率,而开放式提示性能更低(通常低于25%),应用LLM语义判断可提高开放式准确率(如从21%提升至30%),开源模型中Qwen-VL-72B表现最佳,接近闭源系统性能,任务级分析显示植物和杂草物种分类比害虫和损害识别更容易。

Conclusion: 当前现成的VLM尚不适合作为独立的农业诊断系统,但可作为辅助组件与约束界面、明确标签本体和领域感知评估策略结合使用,评估方法学显著影响报告结论,农业诊断任务需要专门的评估框架和领域适应方法。


📄 Abstract

Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural classification datasets from the AgML collection, spanning 162 classes across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (for example, from 21% to 30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.

[7] C-DGPA: Class-Centric Dual-Alignment Generative Prompt Adaptation

Chao Li, Dasha Hu, Chengyang Li, Yuming Jiang, Yuncheng Shen

🧩 TL;DR

本文提出C-DGPA:一种面向无监督域适应的类中心双对齐生成提示适应方法,通过协同优化边缘分布对齐和条件分布对齐,有效缓解视觉语言模型在域适应任务中的域差异问题,在多个基准测试中取得了最先进性能。


📘 Detailed Summary

Motivation: 现有基于视觉语言模型的提示调优策略在无监督域适应任务中主要关注边缘分布对齐,但忽略了条件分布差异,导致类原型错位和语义判别性退化等关键问题,需要一种能够同时处理两种分布差异的解决方案。

Method: C-DGPA采用双分支架构协同优化边缘分布对齐和条件分布对齐:边缘分布对齐分支使用动态对抗训练框架来弥合边缘分布差异;条件分布对齐分支引入类映射机制,通过标准化语义提示理解和防止源域过依赖来对齐条件分布差异,实现领域不变且语义可判别的表示学习。

Result: 在OfficeHome、Office31和VisDA-2017等基准数据集上的广泛实验验证了C-DGPA的优越性,该方法在所有基准测试中都取得了新的最先进结果,显著超越了现有方法。

Conclusion: 该研究表明通过双对齐策略协同优化边缘分布和条件分布对齐,能够有效将领域知识整合到提示学习中,为视觉语言模型在无监督域适应任务中的应用提供了新的解决方案,并展示了在复杂域适应场景中的强大泛化能力。


📄 Abstract

Unsupervised Domain Adaptation transfers knowledge from a labeled source domain to an unlabeled target domain. Directly deploying Vision-Language Models (VLMs) with prompt tuning in downstream UDA tasks faces the signifi cant challenge of mitigating domain discrepancies. Existing prompt-tuning strategies primarily align marginal distribu tion, but neglect conditional distribution discrepancies, lead ing to critical issues such as class prototype misalignment and degraded semantic discriminability. To address these lim itations, the work proposes C-DGPA: Class-Centric Dual Alignment Generative Prompt Adaptation. C-DGPA syner gistically optimizes marginal distribution alignment and con ditional distribution alignment through a novel dual-branch architecture. The marginal distribution alignment branch em ploys a dynamic adversarial training framework to bridge marginal distribution discrepancies. Simultaneously, the con ditional distribution alignment branch introduces a Class Mapping Mechanism (CMM) to align conditional distribu tion discrepancies by standardizing semantic prompt under standing and preventing source domain over-reliance. This dual alignment strategy effectively integrates domain knowl edge into prompt learning via synergistic optimization, ensur ing domain-invariant and semantically discriminative repre sentations. Extensive experiments on OfficeHome, Office31, and VisDA-2017 validate the superiority of C-DGPA. It achieves new state-of-the-art results on all benchmarks.

[8] CoVAR: Co-generation of Video and Action for Robotic Manipulation via Multi-Modal Diffusion

Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Ziyuan Liu, Abhinav Valada

🧩 TL;DR

本文提出了一种从文本指令、初始图像和机器人关节状态生成视频-动作对的方法,通过扩展预训练视频扩散模型并引入跨模态交互机制,解决了机器人策略学习中动作标注缺失的问题。


📘 Detailed Summary

Motivation: 现有方法面临两个主要限制:两阶段流水线限制了紧密耦合的跨模态信息共享,而适应单模态扩散模型处理联合分布无法充分利用预训练视频知识。同时,视频数据通常缺乏动作标注,这阻碍了视频扩散模型在机器人策略学习中的充分应用。

Method: 方法包括三个核心组件:首先扩展预训练视频扩散模型,添加并行的专用动作扩散模型以保留预训练知识;其次引入桥接注意力机制实现有效的跨模态交互;最后设计动作细化模块,将粗略动作转换为适用于低分辨率数据集的精确控制信号。

Result: 在多个公共基准测试和真实世界数据集上的广泛评估表明,该方法能够生成更高质量的视频和更准确的动作,显著优于现有基线方法,为利用大规模视频数据进行机器人学习提供了可扩展的框架。

Conclusion: 该研究通过创新的跨模态架构设计,成功解决了视频数据中动作标注缺失的问题,为机器人策略学习提供了新的数据生成范式。桥接注意力机制和动作细化模块的结合,实现了预训练视频知识与动作生成的深度融合,展示了利用大规模视频数据提升机器人学习性能的可行路径。


📄 Abstract

We present a method to generate video-action pairs that follow text instructions, starting from an initial image observation and the robot's joint states. Our approach automatically provides action labels for video diffusion models, overcoming the common lack of action annotations and enabling their full use for robotic policy learning. Existing methods either adopt two-stage pipelines, which limit tightly coupled cross-modal information sharing, or rely on adapting a single-modal diffusion model for a joint distribution that cannot fully leverage pretrained video knowledge. To overcome these limitations, we (1) extend a pretrained video diffusion model with a parallel, dedicated action diffusion model that preserves pretrained knowledge, (2) introduce a Bridge Attention mechanism to enable effective cross-modal interaction, and (3) design an action refinement module to convert coarse actions into precise controls for low-resolution datasets. Extensive evaluations on multiple public benchmarks and real-world datasets demonstrate that our method generates higher-quality videos, more accurate actions, and significantly outperforms existing baselines, offering a scalable framework for leveraging large-scale video data for robotic learning.

[9] Open Ad-hoc Categorization with Contextualized Feature Learning

Zilin Wang, Sangwoo Mo, Stella X. Yu, Sima Behpour, Liu Ren

🧩 TL;DR

本文提出OAK模型,通过引入可学习的上下文令牌和结合CLIP的图像-文本对齐目标与GCD的视觉聚类目标,实现了开放自组织分类任务的最先进性能,显著提升了新颖类别的识别准确率。


📘 Detailed Summary

Motivation: 该研究旨在解决AI代理处理动态变化任务时的自适应视觉场景分类问题,特别是针对自组织分类这一挑战。与植物或动物等固定通用类别不同,自组织类别是为特定目标动态创建的,研究关注在给定少量标注样本和大量未标注数据的情况下,如何发现底层上下文并通过语义扩展和视觉聚类来扩展自组织类别。

Method: 基于自组织类别和通用类别依赖相似感知机制的洞察,研究者提出了OAK模型。该模型在冻结的CLIP模型输入处引入一小组可学习的上下文令牌,并通过结合CLIP的图像-文本对齐目标和GCD的视觉聚类目标进行优化,实现了语义扩展和视觉聚类的统一框架。

Result: 在Stanford和Clevr-4数据集上,OAK在多个分类任务中实现了最先进的准确率和概念发现性能,其中在Stanford Mood数据集上的新颖类别准确率达到87.4%,比CLIP和GCD方法提高了超过50%。此外,OAK生成了可解释的显著性图,能够针对不同任务关注相关特征,如动作任务关注手部、情绪任务关注面部、位置任务关注背景。

Conclusion: 该研究表明通过结合预训练视觉语言模型的语义理解能力和视觉聚类方法,可以有效解决开放自组织分类问题。OAK模型不仅实现了优异的性能,还通过可解释的显著性图增强了透明度和可信度,为自适应和可泛化的分类系统提供了有前景的方向,推动了AI代理在动态环境中的实际应用。


📄 Abstract

Adaptive categorization of visual scenes is essential for AI agents to handle changing tasks. Unlike fixed common categories for plants or animals, ad-hoc categories are created dynamically to serve specific goals. We study open ad-hoc categorization: Given a few labeled exemplars and abundant unlabeled data, the goal is to discover the underlying context and to expand ad-hoc categories through semantic extension and visual clustering around it. Building on the insight that ad-hoc and common categories rely on similar perceptual mechanisms, we propose OAK, a simple model that introduces a small set of learnable context tokens at the input of a frozen CLIP and optimizes with both CLIP's image-text alignment objective and GCD's visual clustering objective. On Stanford and Clevr-4 datasets, OAK achieves state-of-the-art in accuracy and concept discovery across multiple categorizations, including 87.4% novel accuracy on Stanford Mood, surpassing CLIP and GCD by over 50%. Moreover, OAK produces interpretable saliency maps, focusing on hands for Action, faces for Mood, and backgrounds for Location, promoting transparency and trust while enabling adaptive and generalizable categorization.

[10] Auto-Vocabulary 3D Object Detection

Haomeng Zhang, Kuan-Chuan Peng, Suhas Lohit, Raymond A. Yeh

🧩 TL;DR

本文提出了自动词汇三维目标检测(AV3DOD)框架,通过利用二维视觉语言模型自动生成检测对象的语义类别,无需用户指定类别,在定位精度和语义质量上均达到最先进水平。


📘 Detailed Summary

Motivation: 现有开放词汇三维目标检测方法虽然在训练期间能够定位未见类别的三维边界框,但在训练和推理阶段仍依赖于用户指定的类别。本文旨在研究自动词汇三维目标检测,即无需任何用户输入即可自动为检测对象生成语义类别,并引入语义评分来评估生成类名的质量。

Method: 本文提出了AV3DOD框架,该框架通过图像描述生成、伪三维边界框生成和特征空间语义扩展三个关键步骤,利用二维视觉语言模型生成丰富的语义候选。该方法首先通过图像描述获取语义信息,然后生成伪三维边界框进行空间对齐,最后在特征空间中进行语义扩展以增强类别表示的丰富性。

Result: AV3DOD在ScanNetV2和SUNRGB-D数据集上实现了最先进的性能,在定位精度和语义质量方面均表现优异。具体而言,在ScanNetV2上,该方法比当前最先进的CoDA方法在整体mAP上提升了3.48,并在语义评分上实现了24.5%的相对改进,显著优于现有方法。

Conclusion: 该研究证明了通过二维视觉语言模型自动生成三维检测对象语义类别的可行性,为完全自主的三维场景理解开辟了新方向。引入的语义评分为评估生成类名的质量提供了量化标准,框架的设计展示了跨模态知识迁移在三维视觉任务中的有效性。


📄 Abstract

Open-vocabulary 3D object detection methods are able to localize 3D boxes of classes unseen during training. Despite the name, existing methods rely on user-specified classes both at training and inference. We propose to study Auto-Vocabulary 3D Object Detection (AV3DOD), where the classes are automatically generated for the detected objects without any user input. To this end, we introduce Semantic Score (SS) to evaluate the quality of the generated class names. We then develop a novel framework, AV3DOD, which leverages 2D vision-language models (VLMs) to generate rich semantic candidates through image captioning, pseudo 3D box generation, and feature-space semantics expansion. AV3DOD achieves the state-of-the-art (SOTA) performance on both localization (mAP) and semantic quality (SS) on the ScanNetV2 and SUNRGB-D datasets. Notably, it surpasses the SOTA, CoDA, by 3.48 overall mAP and attains a 24.5% relative improvement in SS on ScanNetV2.

[11] AI-Powered Dermatological Diagnosis: From Interpretable Models to Clinical Implementation A Comprehensive Framework for Accessible and Trustworthy Skin Disease Detection

Satya Narayana Panda, Vaishnavi Kukkala, Spandana Iyer

🧩 TL;DR

本研究开发了一个多模态AI框架,通过整合家族史数据与临床影像分析来增强皮肤病诊断准确性,特别针对遗传性皮肤疾病,并设计了可解释的临床决策支持系统。


📘 Detailed Summary

Motivation: 全球皮肤病影响19亿人,但准确诊断面临专科医生资源有限和临床表现复杂的挑战。家族史对皮肤疾病易感性和治疗反应有显著影响,但在诊断过程中常未被充分利用,本研究旨在解决如何将家族史数据与临床影像结合以增强皮肤病诊断的问题。

Method: 开发了综合多模态AI框架,结合基于深度学习的图像分析与结构化临床数据,包括详细的家族史模式。采用可解释的卷积神经网络与整合遗传风险因素的临床决策树相结合的方法,包含在不同医疗环境中进行的前瞻性临床试验验证。

Result: 当整合家族史数据时,集成AI系统显示出增强的诊断准确性,特别是在黑色素瘤、银屑病和特应性皮炎等遗传性皮肤疾病方面。专家反馈表明该系统具有改善早期检测和提供更个性化建议的潜力,并计划进行正式临床试验。

Conclusion: 该研究证明了整合家族史数据对提高皮肤病AI诊断准确性的重要性,特别是针对遗传性疾病。框架设计考虑了临床工作流程的集成性,并通过可解释AI机制保持透明度,为个性化皮肤病诊断和临床决策支持提供了新途径。


📄 Abstract

Dermatological conditions affect 1.9 billion people globally, yet accurate diagnosis remains challenging due to limited specialist availability and complex clinical presentations. Family history significantly influences skin disease susceptibility and treatment responses, but is often underutilized in diagnostic processes. This research addresses the critical question: How can AI-powered systems integrate family history data with clinical imaging to enhance dermatological diagnosis while supporting clinical trial validation and real-world implementation? We developed a comprehensive multi-modal AI framework that combines deep learning-based image analysis with structured clinical data, including detailed family history patterns. Our approach employs interpretable convolutional neural networks integrated with clinical decision trees that incorporate hereditary risk factors. The methodology includes prospective clinical trials across diverse healthcare settings to validate AI-assisted diagnosis against traditional clinical assessment. In this work, validation was conducted with healthcare professionals to assess AI-assisted outputs against clinical expectations; prospective clinical trials across diverse healthcare settings are proposed as future work. The integrated AI system demonstrates enhanced diagnostic accuracy when family history data is incorporated, particularly for hereditary skin conditions such as melanoma, psoriasis, and atopic dermatitis. Expert feedback indicates potential for improved early detection and more personalized recommendations; formal clinical trials are planned. The framework is designed for integration into clinical workflows while maintaining interpretability through explainable AI mechanisms.

[12] Interaction-via-Actions: Cattle Interaction Detection with Joint Learning of Action-Interaction Latent Space

Ren Nakagawa, Yang Yang, Risa Shinoda, Hiroaki Santo, Kenji Oyama, Fumio Okura, Takenao Ohkawa

🧩 TL;DR

本文提出CattleAct,一种从单张图像自动检测放牧牛群行为交互的数据高效方法,通过将交互分解为个体牛的动作组合,并开发了集成视频和GPS输入的实用系统。


📘 Detailed Summary

Motivation: 尽管人类交互检测研究活跃,但牛群交互检测面临非平凡挑战,特别是缺乏包含交互的全面行为数据集,因为放牧牛群的交互是罕见事件,这对于智能畜牧管理如发情检测至关重要。

Method: 该方法首先从大规模牛群动作数据集学习动作潜在空间,然后通过对比学习微调预训练潜在空间来嵌入罕见交互,从而构建动作和交互的统一潜在空间,并开发了集成视频和GPS输入的实用工作系统。

Result: 在商业规模牧场上的实验表明,与基线方法相比,该方法实现了准确的交互检测,验证了所提方法在现实场景中的有效性。

Conclusion: 该研究通过数据高效的方法解决了牛群交互检测的稀缺数据问题,为智能畜牧管理提供了实用解决方案,未来可扩展至其他动物行为分析领域。


📄 Abstract

This paper introduces a method and application for automatically detecting behavioral interactions between grazing cattle from a single image, which is essential for smart livestock management in the cattle industry, such as for detecting estrus. Although interaction detection for humans has been actively studied, a non-trivial challenge lies in cattle interaction detection, specifically the lack of a comprehensive behavioral dataset that includes interactions, as the interactions of grazing cattle are rare events. We, therefore, propose CattleAct, a data-efficient method for interaction detection by decomposing interactions into the combinations of actions by individual cattle. Specifically, we first learn an action latent space from a large-scale cattle action dataset. Then, we embed rare interactions via the fine-tuning of the pre-trained latent space using contrastive learning, thereby constructing a unified latent space of actions and interactions. On top of the proposed method, we develop a practical working system integrating video and GPS inputs. Experiments on a commercial-scale pasture demonstrate the accurate interaction detection achieved by our method compared to the baselines. Our implementation is available at https://github.com/rakawanegan/CattleAct.

[13] TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering

Rui Gui, Yang Wan, Haochen Han, Dongxing Mao, Fangming Liu, Min Li, Alex Jinpeng Wang

🧩 TL;DR

本文提出了TextEditBench,一个专注于图像中文本区域编辑的综合性评估基准,并引入了语义期望(SE)这一新评估维度来衡量模型在文本编辑中的推理能力。


📘 Detailed Summary

Motivation: 尽管文本渲染已成为视觉生成的重要前沿,但图像中的文本编辑仍是一个未被充分探索的领域,因为它需要在生成清晰字符的同时保持语义、几何和上下文的一致性,现有方法缺乏对文本中心区域的专门评估基准。

Method: 研究团队提出了TextEditBench这一综合性评估基准,专门针对图像中的文本中心区域进行评估,超越了基本的像素操作,强调需要理解物理合理性、语言意义和跨模态依赖的推理密集型编辑场景,并引入了语义期望(SE)这一新颖评估维度来衡量模型在文本编辑过程中保持语义一致性、上下文连贯性和跨模态对齐的推理能力。

Result: 对最先进编辑系统的大量实验表明,虽然当前模型能够遵循简单的文本指令,但在上下文依赖推理、物理一致性和布局感知整合方面仍然存在困难,TextEditBench通过专注于这一长期被忽视但基本的能力,为推进文本引导的图像编辑和多模态生成中的推理建立了新的测试平台。

Conclusion: 该研究揭示了当前文本编辑模型在复杂推理任务上的局限性,强调了评估文本编辑能力中语义一致性和上下文理解的重要性,TextEditBench为未来研究提供了关键的评估框架,将推动多模态生成模型在文本编辑方面的进步。


📄 Abstract

Text rendering has recently emerged as one of the most challenging frontiers in visual generation, drawing significant attention from large-scale diffusion and multimodal models. However, text editing within images remains largely unexplored, as it requires generating legible characters while preserving semantic, geometric, and contextual coherence. To fill this gap, we introduce TextEditBench, a comprehensive evaluation benchmark that explicitly focuses on text-centric regions in images. Beyond basic pixel manipulations, our benchmark emphasizes reasoning-intensive editing scenarios that require models to understand physical plausibility, linguistic meaning, and cross-modal dependencies. We further propose a novel evaluation dimension, Semantic Expectation (SE), which measures reasoning ability of model to maintain semantic consistency, contextual coherence, and cross-modal alignment during text editing. Extensive experiments on state-of-the-art editing systems reveal that while current models can follow simple textual instructions, they still struggle with context-dependent reasoning, physical consistency, and layout-aware integration. By focusing evaluation on this long-overlooked yet fundamental capability, TextEditBench establishes a new testing ground for advancing text-guided image editing and reasoning in multimodal generation.

[14] PixelArena: A benchmark for Pixel-Precision Visual Intelligence

Feng Liang, Sizhe Cheng, Chenqi Yi

🧩 TL;DR

本文提出PixelArena基准测试,利用语义分割任务客观评估多模态大语言模型的细粒度图像生成能力,发现Gemini 3 Pro Image在零样本设置下展现出前所未有的语义掩码生成能力。


📘 Detailed Summary

Motivation: 当前多模态大语言模型的图像生成基准测试主要关注美学质量而非细粒度生成能力,现有评估方法无法客观衡量模型在像素级精度上的生成智能,需要新的评估框架来填补这一研究空白。

Method: 研究提出PixelArena基准测试框架,采用语义分割任务作为评估工具,通过像素级精度客观检验多模态大语言模型的细粒度生成智能,在零样本设置下测试模型的语义掩码生成能力。

Result: Gemini 3 Pro Image展现出新兴的图像生成能力,在零样本设置下能够生成高保真度的语义掩码,表现出前所未有的视觉智能和在新图像生成任务中的真正泛化能力,研究还提供了与其他模型的定性和定量比较以及失败案例分析。

Conclusion: 该研究不仅标志着多模态生成领域的重大进展,还为未来关于多模态、推理、可解释性和基准测试的研究提供了重要见解,揭示了大型语言模型在细粒度视觉任务中的潜力。


📄 Abstract

Multi-modal large language models that have image output are emerging. Many image generation benchmarks focus on aesthetics instead of fine-grained generation capabilities. In PixelArena, we propose using semantic segmentation tasks to objectively examine their fine-grained generative intelligence with pixel precision. We find the latest Gemini 3 Pro Image has emergent image generation capabilities that generate semantic masks with high fidelity under zero-shot settings, showcasing visual intelligence unseen before and true generalization in new image generation tasks. We further investigate its results, compare them qualitatively and quantitatively with those of other models, and present failure cases. The findings not only signal exciting progress in the field but also provide insights into future research related to multimodality, reasoning, interpretability and benchmarking.

[15] Avatar4D: Synthesizing Domain-Specific 4D Humans for Real-World Pose Estimation

Jerrin Bright, Zhibo Wang, Dmytro Klepachevskyi, Yuhao Chen, Sirisha Rambhatla, David Clausi, John Zelek

🧩 TL;DR

本文提出了Avatar4D,一个可迁移的真实世界管道,用于生成针对特定领域应用的可定制合成人体运动数据集,并引入了Syn2Sport大规模合成体育数据集,验证了合成数据在监督学习、零样本迁移和跨体育泛化方面的有效性。


📘 Detailed Summary

Motivation: 现有的人体运动生成方法主要关注通用日常动作,灵活性有限,难以满足特定领域应用的需求,特别是在体育等需要精细控制身体姿态、外观、相机视角和环境背景的领域,缺乏无需手动标注的可定制合成数据生成方案。

Method: 本文提出了Avatar4D管道,能够生成高保真4D(随时间变化的3D几何)人体运动序列,提供对身体姿态、外观、相机视角和环境背景的细粒度控制,无需任何手动标注,并基于此构建了Syn2Sport大规模合成体育数据集,涵盖棒球和冰球等多种运动。

Result: 研究在Syn2Sport数据集上对多个最先进的姿态估计模型进行了基准测试,证明了合成数据在监督学习、零样本迁移到真实世界数据以及跨体育泛化方面的有效性,同时评估了生成合成数据与真实世界数据集在特征空间中的对齐程度。

Conclusion: 该研究展示了Avatar4D系统在无需依赖特定领域真实数据的情况下,生成可扩展、可控制和可迁移的人体数据集的能力,为多样化的特定领域任务提供了有效的合成数据解决方案,特别是在体育动作理解等具有挑战性的应用场景中。


📄 Abstract

We present Avatar4D, a real-world transferable pipeline for generating customizable synthetic human motion datasets tailored to domain-specific applications. Unlike prior works, which focus on general, everyday motions and offer limited flexibility, our approach provides fine-grained control over body pose, appearance, camera viewpoint, and environmental context, without requiring any manual annotations. To validate the impact of Avatar4D, we focus on sports, where domain-specific human actions and movement patterns pose unique challenges for motion understanding. In this setting, we introduce Syn2Sport, a large-scale synthetic dataset spanning sports, including baseball and ice hockey. Avatar4D features high-fidelity 4D (3D geometry over time) human motion sequences with varying player appearances rendered in diverse environments. We benchmark several state-of-the-art pose estimation models on Syn2Sport and demonstrate their effectiveness for supervised learning, zero-shot transfer to real-world data, and generalization across sports. Furthermore, we evaluate how closely the generated synthetic data aligns with real-world datasets in feature space. Our results highlight the potential of such systems to generate scalable, controllable, and transferable human datasets for diverse domain-specific tasks without relying on domain-specific real data.

[16] Collaborative Edge-to-Server Inference for Vision-Language Models

Soochang Song, Yongjune Kim

🧩 TL;DR

本文提出了一种用于视觉语言模型的协作式边缘到服务器推理框架,通过选择性重传策略在保持推理精度的同时显著降低通信成本。该框架利用VLM内部注意力机制识别感兴趣区域,并根据置信度度量决定是否需要传输细节保留的局部图像进行推理细化。


📘 Detailed Summary

Motivation: 在典型的视觉语言模型部署中,边缘设备捕获的视觉数据需要传输到服务器进行推理,但将原始图像调整为视觉编码器输入分辨率通常会丢失细粒度细节,导致精度下降。现有方法在通信成本与推理精度之间存在权衡,需要一种能够选择性传输关键视觉内容以保持精度的解决方案。

Method: 本文设计了一个两阶段协作推理框架:第一阶段服务器在全局图像上执行推理,利用VLM内部注意力机制识别感兴趣区域,并计算输出令牌的最小熵作为置信度度量;当最小熵超过预设阈值时,服务器请求边缘设备传输RoI的细节保留局部图像;第二阶段服务器通过联合利用全局和局部图像细化推理,实现选择性重传策略。

Result: 在多种VLM架构上的实验表明,所提出的框架显著降低了通信成本同时保持了推理精度。该选择性重传策略确保仅传输必要的视觉内容,在多个基准测试中实现了通信效率与推理准确性的良好平衡,验证了框架的有效性和通用性。

Conclusion: 该研究为边缘计算环境中的视觉语言模型部署提供了一种高效的通信优化方案,通过选择性重传机制在保持精度的同时减少数据传输。框架的通用性表明可应用于多种VLM架构,为资源受限环境中的大规模模型部署开辟了新方向,平衡了计算效率与推理质量的需求。


📄 Abstract

We propose a collaborative edge-to-server inference framework for vision-language models (VLMs) that reduces the communication cost while maintaining inference accuracy. In typical deployments, visual data captured at edge devices (clients) is transmitted to the server for VLM inference. However, resizing the original image (global image) to match the vision encoder's input resolution often discards fine-grained details, leading to accuracy degradation. To overcome this limitation, we design a two-stage framework. In the first stage, the server performs inference on the global image and identifies a region of interest (RoI) using the VLM's internal attention. The min-entropy of the output tokens is then computed as a confidence measure to determine whether retransmission is required. If the min-entropy exceeds a predefined threshold, the server requests the edge device to send a detail-preserved local image of the RoI. The server then refines its inference by jointly leveraging the global and local images. This selective retransmission strategy ensures that only essential visual content is transmitted. Experiments across multiple VLM architectures show that the proposed framework significantly reduces communication cost while maintaining inference accuracy.

[17] Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation

Sarosij Bose, Ravi K. Rajendran, Biplob Debnath, Konstantinos Karydis, Amit K. Roy-Chowdhury, Srimat Chakradhar

🧩 TL;DR

本文提出VALOR方法,一种基于强化学习的视觉对齐框架,用于提升医学视觉语言模型在放射学报告生成中的事实准确性和视觉基础性,通过两阶段训练策略解决跨模态对齐不足导致的幻觉问题。


📘 Detailed Summary

Motivation: 现有医学视觉语言模型在放射学报告生成中存在视觉基础性不足和临床准确性有限的问题,主要源于跨模态对齐不充分导致的幻觉现象,且现有方法依赖大规模标注数据、昂贵的任务特定偏好数据或检索方法,未能有效解决视觉与语言表征之间的对齐缺陷。

Method: VALOR采用基于强化学习的后对齐框架,使用组相对近端优化进行两阶段训练:第一阶段通过文本奖励改进医学视觉语言模型以鼓励临床精确术语使用;第二阶段将文本基础模型的视觉投影模块与疾病发现对齐,引导注意力集中在诊断任务最相关的图像区域。

Result: 在多个基准测试上的广泛实验表明,VALOR显著提高了事实准确性和视觉基础性,在放射学报告生成任务上取得了优于现有最先进方法的性能提升,验证了所提对齐框架的有效性。

Conclusion: 该研究证明了强化学习驱动的视觉对齐策略能够有效缓解医学视觉语言模型中的幻觉问题,为自动化医疗工作流程提供了更可靠的基础,同时为多模态医学AI系统的精确性和可解释性提供了新方向。


📄 Abstract

Radiology Report Generation (RRG) is a critical step toward automating healthcare workflows, facilitating accurate patient assessments, and reducing the workload of medical professionals. Despite recent progress in Large Medical Vision-Language Models (Med-VLMs), generating radiology reports that are both visually grounded and clinically accurate remains a significant challenge. Existing approaches often rely on large labeled corpora for pre-training, costly task-specific preference data, or retrieval-based methods. However, these strategies do not adequately mitigate hallucinations arising from poor cross-modal alignment between visual and linguistic representations. To address these limitations, we propose VALOR:Visual Alignment of Medical Vision-Language Models for GrOunded Radiology Report Generation. Our method introduces a reinforcement learning-based post-alignment framework utilizing Group-Relative Proximal Optimization (GRPO). The training proceeds in two stages: (1) improving the Med-VLM with textual rewards to encourage clinically precise terminology, and (2) aligning the vision projection module of the textually grounded model with disease findings, thereby guiding attention toward image re gions most relevant to the diagnostic task. Extensive experiments on multiple benchmarks demonstrate that VALOR substantially improves factual accuracy and visual grounding, achieving significant performance gains over state-of-the-art report generation methods.

[18] Smile on the Face, Sadness in the Eyes: Bridging the Emotion Gap with a Multimodal Dataset of Eye and Facial Behaviors

Kejun Liu, Yuanyuan Liu, Lin Wei, Chang Tang, Yibing Zhan, Zijing Chen, Zhe Chen

🧩 TL;DR

本文提出了一种基于眼动行为辅助的多模态情绪识别方法,通过构建包含眼动序列和注视图的EMER数据集,并设计EMERT模型来弥补面部表情识别与真实情绪识别之间的差距。


📘 Detailed Summary

Motivation: 当前情绪识别领域过度依赖面部表情识别,但面部表情常被用作社交工具而非真实内心情绪的体现,导致面部表情识别与真实情绪识别之间存在显著差距,需要引入更可靠的生理线索来弥合这一差距。

Method: 研究采用自发情绪诱导范式构建了包含眼动序列、注视图和面部视频的EMER数据集,并设计了基于模态对抗特征解耦和多任务Transformer的EMERT模型,将眼动行为作为面部表情的补充特征进行建模。

Result: 实验结果表明,EMERT模型在七种多模态基准协议上显著优于现有最先进方法,证明了眼动行为建模对于提升情绪识别鲁棒性的重要性,为多模态情绪识别提供了新的评估框架。

Conclusion: 眼动行为是理解真实情绪的重要线索,能够有效弥补面部表情识别与情绪识别之间的差距,该研究为构建更鲁棒的情绪识别系统提供了新的数据集和方法论基础,推动了情绪识别领域向更真实、更可靠的方向发展。


📄 Abstract

Emotion Recognition (ER) is the process of analyzing and identifying human emotions from sensing data. Currently, the field heavily relies on facial expression recognition (FER) because visual channel conveys rich emotional cues. However, facial expressions are often used as social tools rather than manifestations of genuine inner emotions. To understand and bridge this gap between FER and ER, we introduce eye behaviors as an important emotional cue and construct an Eye-behavior-aided Multimodal Emotion Recognition (EMER) dataset. To collect data with genuine emotions, spontaneous emotion induction paradigm is exploited with stimulus material, during which non-invasive eye behavior data, like eye movement sequences and eye fixation maps, is captured together with facial expression videos. To better illustrate the gap between ER and FER, multi-view emotion labels for mutimodal ER and FER are separately annotated. Furthermore, based on the new dataset, we design a simple yet effective Eye-behavior-aided MER Transformer (EMERT) that enhances ER by bridging the emotion gap. EMERT leverages modality-adversarial feature decoupling and a multitask Transformer to model eye behaviors as a strong complement to facial expressions. In the experiment, we introduce seven multimodal benchmark protocols for a variety of comprehensive evaluations of the EMER dataset. The results show that the EMERT outperforms other state-of-the-art multimodal methods by a great margin, revealing the importance of modeling eye behaviors for robust ER. To sum up, we provide a comprehensive analysis of the importance of eye behaviors in ER, advancing the study on addressing the gap between FER and ER for more robust ER performance. Our EMER dataset and the trained EMERT models will be publicly available at https://github.com/kejun1/EMER.

[19] TTP: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models

Zhiwei Li, Yitian Pang, Weining Wang, Zhenan Sun, Qi Li

🧩 TL;DR

本文提出了一种名为Test-Time Padding (TTP)的轻量级防御框架,用于提升视觉语言模型(如CLIP)的对抗鲁棒性。该方法在推理时通过空间填充检测对抗样本并进行针对性适应,在不损害干净样本准确率的情况下显著提升对抗防御能力。


📘 Detailed Summary

Motivation: 视觉语言模型(如CLIP)在零样本识别方面表现出色,但对对抗扰动高度敏感,在安全关键场景中存在重大风险。现有的训练时防御方法需要标注数据和昂贵的重新训练,而测试时策略无法可靠区分干净和对抗输入,导致对抗鲁棒性和干净准确率无法同时达到最优。

Method: 本文提出Test-Time Padding (TTP)框架,在推理时执行对抗检测和针对性适应。该方法通过计算空间填充前后CLIP特征嵌入的余弦相似度偏移来识别对抗输入,为不同架构和数据集提供通用检测阈值。对于检测到的对抗样本,TTP使用可训练填充来恢复被破坏的注意力模式,并结合相似度感知集成策略进行更鲁棒的最终预测。

Result: 在多种CLIP骨干网络和细粒度基准测试上的综合实验表明,TTP始终优于最先进的测试时防御方法,在对抗鲁棒性方面实现了显著提升,同时不损害干净样本的准确率。该方法为不同架构和数据集提供了可靠的通用检测阈值,并在保持推理效率的同时实现了有效的防御。

Conclusion: TTP提供了一种轻量级且有效的测试时防御解决方案,无需重新训练模型或依赖标注数据,即可显著提升视觉语言模型的对抗鲁棒性。该方法通过空间填充检测和针对性适应的组合策略,为实际部署中的安全关键应用提供了实用框架,同时保持了模型的零样本识别能力。


📄 Abstract

Vision-Language Models (VLMs), such as CLIP, have achieved impressive zero-shot recognition performance but remain highly susceptible to adversarial perturbations, posing significant risks in safety-critical scenarios. Previous training-time defenses rely on adversarial fine-tuning, which requires labeled data and costly retraining, while existing test-time strategies fail to reliably distinguish between clean and adversarial inputs, thereby preventing both adversarial robustness and clean accuracy from reaching their optimum. To address these limitations, we propose Test-Time Padding (TTP), a lightweight defense framework that performs adversarial detection followed by targeted adaptation at inference. TTP identifies adversarial inputs via the cosine similarity shift between CLIP feature embeddings computed before and after spatial padding, yielding a universal threshold for reliable detection across architectures and datasets. For detected adversarial cases, TTP employs trainable padding to restore disrupted attention patterns, coupled with a similarity-aware ensemble strategy for a more robust final prediction. For clean inputs, TTP leaves them unchanged by default or optionally integrates existing test-time adaptation techniques for further accuracy gains. Comprehensive experiments on diverse CLIP backbones and fine-grained benchmarks show that TTP consistently surpasses state-of-the-art test-time defenses, delivering substantial improvements in adversarial robustness without compromising clean accuracy. The code for this paper will be released soon.

[20] ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation

Zichen Geng, Zeeshan Hayder, Wei Liu, Hesheng Wang, Ajmal Mian

🧩 TL;DR

本文提出ARMFlow,一种基于MeanFlow的自回归框架,用于3D人体反应生成,解决了现有方法无法同时满足高运动保真度、实时推理和自回归适应性三大挑战的问题。该框架通过因果上下文编码器和MLP速度预测器实现,并引入Bootstrap Contextual Encoding来减少误差累积。


📘 Detailed Summary

Motivation: 3D人体反应生成面临三个主要挑战:高运动保真度、实时推理能力以及在线场景下的自回归适应性。现有方法无法同时满足这三个要求,特别是在在线设置中,现有方法在语义对齐、推理延迟和误差累积方面存在显著局限性。

Method: 提出ARMFlow框架,这是一个基于MeanFlow的自回归框架,用于建模演员和反应者运动之间的时间依赖关系。该框架包含因果上下文编码器和MLP速度预测器,并引入Bootstrap Contextual Encoding训练策略,通过编码生成的历史而非真实历史来减轻自回归生成中的误差累积。此外还提出了离线变体ReMFlow,通过全局上下文编码器增强语义对齐。

Result: ARMFlow在InterHuman和InterX数据集上的单步在线生成性能超过现有在线方法超过40%的FID指标,同时仅使用部分序列条件就能匹配离线最先进方法的性能。ReMFlow离线变体在离线方法中实现了最先进的性能并具有最快的推理速度,单步在线推理实现了高精度和低延迟。

Conclusion: 该研究证明了通过自回归框架结合Bootstrap Contextual Encoding策略,可以同时解决3D人体反应生成中的保真度、实时性和适应性问题。ARMFlow框架为在线交互场景提供了实用的解决方案,其单步推理架构和误差控制机制为实时人体运动生成领域提供了新的技术路径。


📄 Abstract

3D human reaction generation faces three main challenges:(1) high motion fidelity, (2) real-time inference, and (3) autoregressive adaptability for online scenarios. Existing methods fail to meet all three simultaneously. We propose ARMFlow, a MeanFlow-based autoregressive framework that models temporal dependencies between actor and reactor motions. It consists of a causal context encoder and an MLP-based velocity predictor. We introduce Bootstrap Contextual Encoding (BSCE) in training, encoding generated history instead of the ground-truth ones, to alleviate error accumulation in autoregressive generation. We further introduce the offline variant ReMFlow, achieving state-of-the-art performance with the fastest inference among offline methods. Our ARMFlow addresses key limitations of online settings by: (1) enhancing semantic alignment via a global contextual encoder; (2) achieving high accuracy and low latency in a single-step inference; and (3) reducing accumulated errors through BSCE. Our single-step online generation surpasses existing online methods on InterHuman and InterX by over 40% in FID, while matching offline state-of-the-art performance despite using only partial sequence conditions.

[21] Yuan-TecSwin: A text conditioned Diffusion model with Swin-transformer blocks

Shaohua Wu, Tong Yu, Shenling Wang, Xudong Zhao

🧩 TL;DR

本文提出Yuan-TecSwin,一种基于Swin-transformer的文本条件扩散模型,通过替换传统CNN块来增强长距离语义理解能力,在ImageNet生成基准上取得了1.37的FID分数,达到最先进水平。


📘 Detailed Summary

Motivation: 扩散模型在图像合成中表现出色,但其基于CNN的U形架构中卷积操作的局部性限制了模型理解长距离语义信息的能力,本文旨在解决这一局限性。

Method: 提出Yuan-TecSwin模型,用Swin-transformer块替换编码器和解码器中的CNN块,以增强特征提取和图像恢复中的非局部建模能力;通过精心选择的文本编码器、有效利用文本嵌入以及精心设计的文本条件融合策略来改进文本-图像对齐;采用自适应时间步搜索策略在不同扩散阶段优化推理性能。

Result: Yuan-TecSwin在ImageNet生成基准上取得了1.37的FID分数,达到最先进水平;自适应时间步搜索策略将推理性能提升了10%;在人类评估中,受访者难以区分模型生成的图像与人工绘制的图像。

Conclusion: 研究表明,将Swin-transformer集成到扩散模型中能有效增强长距离语义理解能力,改进的文本条件融合策略显著提升了文本-图像对齐质量,自适应时间步优化进一步提升了推理效率,为高质量文本到图像生成提供了新的架构设计思路。


📄 Abstract

Diffusion models have shown remarkable capacity in image synthesis based on their U-shaped architecture and convolutional neural networks (CNN) as basic blocks. The locality of the convolution operation in CNN may limit the model's ability to understand long-range semantic information. To address this issue, we propose Yuan-TecSwin, a text-conditioned diffusion model with Swin-transformer in this work. The Swin-transformer blocks take the place of CNN blocks in the encoder and decoder, to improve the non-local modeling ability in feature extraction and image restoration. The text-image alignment is improved with a well-chosen text encoder, effective utilization of text embedding, and careful design in the incorporation of text condition. Using an adapted time step to search in different diffusion stages, inference performance is further improved by 10%. Yuan-TecSwin achieves the state-of-the-art FID score of 1.37 on ImageNet generation benchmark, without any additional models at different denoising stages. In a side-by-side comparison, we find it difficult for human interviewees to tell the model-generated images from the human-painted ones.

[22] MACL: Multi-Label Adaptive Contrastive Learning Loss for Remote Sensing Image Retrieval

Amna Amir, Erchan Aptoula

🧩 TL;DR

本文提出了多标签自适应对比学习(MACL)方法,通过集成标签感知采样、频率敏感加权和动态温度缩放技术,有效解决了遥感图像检索中语义重叠、标签分布高度不平衡和复杂类间共现模式等挑战。


📘 Detailed Summary

Motivation: 遥感图像多标签检索面临三大挑战:土地覆盖类别间的语义重叠、高度不平衡的标签分布以及复杂的类间共现模式,这些因素严重影响了大规模遥感档案中检索系统的性能和可靠性。

Method: MACL方法将对比学习扩展到多标签场景,集成了三种关键技术:标签感知采样策略以处理复杂类间关系,频率敏感加权机制以平衡常见和稀有类别的表示学习,以及动态温度缩放技术来适应不同类别的语义特征。

Result: 在DLRSD、ML-AID和WHDLD三个基准数据集上的广泛实验表明,MACL方法持续优于基于对比损失的基线方法,有效缓解了语义不平衡问题,并在大规模遥感档案中提供了更可靠的检索性能。

Conclusion: 该研究证明了自适应对比学习框架在处理遥感图像多标签检索中语义不平衡问题的有效性,为大规模遥感档案的智能检索提供了新的技术途径,相关代码和预训练模型将开源以促进领域发展。


📄 Abstract

Semantic overlap among land-cover categories, highly imbalanced label distributions, and complex inter-class co-occurrence patterns constitute significant challenges for multi-label remote-sensing image retrieval. In this article, Multi-Label Adaptive Contrastive Learning (MACL) is introduced as an extension of contrastive learning to address them. It integrates label-aware sampling, frequency-sensitive weighting, and dynamic-temperature scaling to achieve balanced representation learning across both common and rare categories. Extensive experiments on three benchmark datasets (DLRSD, ML-AID, and WHDLD), show that MACL consistently outperforms contrastive-loss based baselines, effectively mitigating semantic imbalance and delivering more reliable retrieval performance in large-scale remote-sensing archives. Code, pretrained models, and evaluation scripts will be released at https://github.com/amna/MACL upon acceptance.

[23] Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models

Mariam Hassan, Bastien Van Delft, Wuyang Li, Alexandre Alahi

🧩 TL;DR

本文提出Factorized Video Generation (FVG),一种将文本到视频生成分解为三个专门化阶段的流水线方法,通过解耦初始场景构建与时间动画任务,显著提升了视频合成的逻辑一致性、构图质量和生成效率。


📘 Detailed Summary

Motivation: 当前最先进的文本到视频扩散模型虽然在视觉上表现出色,但经常无法构建复杂场景或遵循逻辑时间指令,许多错误源于模型无法生成语义正确或逻辑一致的初始帧,这限制了视频合成的质量和可控性。

Method: FVG将文本到视频生成分解为三个专门化阶段:推理阶段使用大型语言模型重写视频提示以描述仅包含初始场景的内容,解决时间模糊性;构图阶段使用文本到图像模型从新提示合成高质量、构图正确的锚定帧;时间合成阶段使用经过微调的视频模型专注于动画场景并遵循原始提示,充分利用锚定帧信息。

Result: 该方法在T2V CompBench基准测试中达到新的最先进水平,并在VBench2上显著提升了所有测试模型的性能,视觉锚定技术使得采样步骤减少70%而性能不损失,实现了采样速度的大幅提升。

Conclusion: Factorized Video Generation为更高效、鲁棒和可控的视频合成提供了一条简单实用的路径,通过任务解耦专门化处理,有效解决了当前文本到视频模型在逻辑一致性和构图质量方面的核心挑战,同时显著提升了生成效率。


📄 Abstract

State-of-the-art Text-to-Video (T2V) diffusion models can generate visually impressive results, yet they still frequently fail to compose complex scenes or follow logical temporal instructions. In this paper, we argue that many errors, including apparent motion failures, originate from the model's inability to construct a semantically correct or logically consistent initial frame. We introduce Factorized Video Generation (FVG), a pipeline that decouples these tasks by decomposing the Text-to-Video generation into three specialized stages: (1) Reasoning, where a Large Language Model (LLM) rewrites the video prompt to describe only the initial scene, resolving temporal ambiguities; (2) Composition, where a Text-to-Image (T2I) model synthesizes a high-quality, compositionally-correct anchor frame from this new prompt; and (3) Temporal Synthesis, where a video model, finetuned to understand this anchor, focuses its entire capacity on animating the scene and following the prompt. Our decomposed approach sets a new state-of-the-art on the T2V CompBench benchmark and significantly improves all tested models on VBench2. Furthermore, we show that visual anchoring allows us to cut the number of sampling steps by 70% without any loss in performance, leading to a substantial speed-up in sampling. Factorized Video Generation offers a simple yet practical path toward more efficient, robust, and controllable video synthesis

[24] BrepLLM: Native Boundary Representation Understanding with Large Language Models

Liyuan Deng, Hao Guo, Yunpeng Bai, Yongkang Dai, Huaxi Huang, Yilei Shi

🧩 TL;DR

本文提出BrepLLM框架,首次实现大语言模型直接解析和处理原始3D边界表示数据,通过跨模态对齐预训练和多阶段微调策略,在3D物体分类和描述生成任务上达到最先进性能。


📘 Detailed Summary

Motivation: 当前基于token序列的大语言模型难以直接处理包含复杂几何和拓扑信息的3D边界表示模型,存在3D几何结构与自然语言之间的模态鸿沟问题,限制了LLMs在三维几何理解领域的应用。

Method: BrepLLM采用两阶段训练流程:跨模态对齐预训练阶段通过自适应UV采样将Brep转换为图表示,设计分层BrepEncoder提取几何和拓扑特征,并使用对比学习对齐全局token与CLIP文本嵌入;多阶段LLM微调阶段通过三阶段渐进策略集成预训练BrepEncoder,包括MLP语义映射、LLM微调以及混合查询专家机制增强几何多样性建模。

Result: 实验结果表明BrepLLM在3D物体分类和描述生成任务上取得了最先进的性能,研究构建了包含269,444个Brep-文本问答对的数据集Brep2Text,验证了框架在跨模态3D几何理解方面的有效性。

Conclusion: 该研究成功建立了3D几何表示与自然语言之间的桥梁,为LLMs在CAD建模、三维设计等领域的应用提供了新范式,提出的分层编码器和渐进训练策略为其他跨模态几何理解任务提供了可借鉴的方法论。


📄 Abstract

Current token-sequence-based Large Language Models (LLMs) are not well-suited for directly processing 3D Boundary Representation (Brep) models that contain complex geometric and topological information. We propose BrepLLM, the first framework that enables LLMs to parse and reason over raw Brep data, bridging the modality gap between structured 3D geometry and natural language. BrepLLM employs a two-stage training pipeline: Cross-modal Alignment Pre-training and Multi-stage LLM Fine-tuning. In the first stage, an adaptive UV sampling strategy converts Breps into graphs representation with geometric and topological information. We then design a hierarchical BrepEncoder to extract features from geometry (i.e., faces and edges) and topology, producing both a single global token and a sequence of node tokens. Then we align the global token with text embeddings from a frozen CLIP text encoder (ViT-L/14) via contrastive learning. In the second stage, we integrate the pretrained BrepEncoder into an LLM. We then align its sequence of node tokens using a three-stage progressive training strategy: (1) training an MLP-based semantic mapping from Brep representation to 2D with 2D-LLM priors. (2) performing fine-tuning of the LLM. (3) designing a Mixture-of-Query Experts (MQE) to enhance geometric diversity modeling. We also construct Brep2Text, a dataset comprising 269,444 Brep-text question-answer pairs. Experiments show that BrepLLM achieves state-of-the-art (SOTA) results on 3D object classification and captioning tasks.

[25] CountZES: Counting via Zero-Shot Exemplar Selection

Muhammad Ibraheem Siddiqui, Muhammad Haris Khan

🧩 TL;DR

本文提出CountZES,一种免训练的零样本目标计数框架,通过渐进式范例选择解决现有方法在复杂场景中无法准确识别单个目标实例的问题,在多个数据集上实现了优越性能。


📘 Detailed Summary

Motivation: 零样本目标计数在复杂场景中面临挑战,现有方法要么依赖开放词汇检测器产生多实例候选,要么采用随机补丁采样无法准确界定目标实例,需要一种能够精确识别单个目标范例的解决方案。

Method: CountZES采用三阶段协同范例选择框架:检测锚定范例阶段精炼开放词汇检测以隔离精确单实例范例;密度引导范例阶段引入密度驱动的自监督范式识别统计一致且语义紧凑的范例;特征共识范例阶段通过特征空间聚类增强视觉一致性,共同产生平衡文本基础、计数一致性和特征代表性的多样化范例集。

Result: 在多样化数据集上的实验表明,CountZES在零样本目标计数方法中表现出优越性能,同时在自然图像、航拍图像和医学图像等多个领域展现出有效的泛化能力。

Conclusion: 该研究提出的渐进式范例选择框架通过检测锚定、密度引导和特征共识的协同作用,有效解决了零样本计数中的范例选择问题,为跨领域目标计数提供了免训练且泛化性强的解决方案。


📄 Abstract

Object counting in complex scenes remains challenging, particularly in the zero-shot setting, where the goal is to count instances of unseen categories specified only by a class name. Existing zero-shot object counting (ZOC) methods that infer exemplars from text either rely on open-vocabulary detectors, which often yield multi-instance candidates, or on random patch sampling, which fails to accurately delineate object instances. To address this, we propose CountZES, a training-free framework for object counting via zero-shot exemplar selection. CountZES progressively discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines open-vocabulary detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a diverse, complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across natural, aerial and medical domains.

[26] Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt

Shangxun Li, Youngjung Uh

🧩 TL;DR

本文提出了一种无需训练的方法,通过从几何角度优化文本嵌入来抑制不想要的语义,显著提升了文本到图像扩散模型中跨多个输出的主题一致性和文本对齐能力。


📘 Detailed Summary

Motivation: 文本到图像扩散模型在从自然语言描述生成高质量图像方面表现出色,但在多个输出之间难以保持主题一致性,限制了其在视觉叙事中的应用。现有方法依赖于模型微调或图像条件化,这些方法计算成本高且需要针对每个主题进行优化,而1Prompt1Story等无需训练的方法则存在语义泄漏问题,导致跨帧嵌入纠缠和文本不对齐。

Method: 本文提出了一种简单而有效的无需训练方法,从几何角度解决语义纠缠问题。该方法通过优化文本嵌入来抑制不想要的语义,具体通过精炼文本嵌入的几何表示来实现,避免了现有方法中跨帧嵌入纠缠导致的语义泄漏问题。

Result: 大量实验证明,该方法在主题一致性和文本对齐方面显著优于现有基线方法。实验结果表明,该方法能够有效减少语义泄漏,提高跨多个图像输出的主题一致性,同时保持更好的文本描述对齐。

Conclusion: 该研究提供了一种计算效率高的解决方案,无需模型微调即可改善扩散模型中的主题一致性。从几何角度处理文本嵌入的方法为视觉叙事应用开辟了新途径,并为扩散模型中的语义控制问题提供了新的视角。


📄 Abstract

Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.

[27] SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning

Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax

🧩 TL;DR

本文提出SNOW框架,一种无需训练且与主干网络无关的统一4D场景理解方法,通过将视觉语言模型的开放世界语义先验与点云几何及时序一致性相结合,构建可查询的4D场景图作为下游推理的时空先验。


📘 Detailed Summary

Motivation: 自主机器人系统需要动态环境的时空理解以确保可靠导航与交互,但现有视觉语言模型缺乏3D几何与时序动态的接地,而几何感知方法虽能捕捉结构与运动却语义稀疏,因此需要统一框架来整合开放世界语义与时空几何信息。

Method: SNOW框架处理同步的RGB图像与3D点云,使用HDBSCAN聚类生成物体级提案指导SAM2分割,通过提出的时空标记化补丁编码为每个分割区域生成多模态标记,捕获局部语义、几何与时序属性,并增量整合到4D场景图中,同时通过轻量级SLAM后端将所有标记空间锚定在环境中确保全局参考对齐。

Result: 在多样化基准测试上的实验表明,SNOW实现了精确的4D场景理解与空间接地推理,在多个设置中创造了新的最先进性能,突显了结构化4D先验对于具身推理与自主机器人的重要性。

Conclusion: 该研究证明了将开放世界语义与几何时空表示统一整合的有效性,4D场景图作为可查询的统一世界模型使视觉语言模型能够直接解释空间场景结构与时序动态,为自主机器人系统的可靠环境理解提供了结构化4D先验的新范式。


📄 Abstract

Autonomous robotic systems require spatio-temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision-Language Models (VLMs) provide open-world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open-World Knowledge), a training-free and backbone-agnostic framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency. SNOW processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering to generate object-level proposals that guide SAM2-based segmentation. Each segmented region is encoded through our proposed Spatio-Temporal Tokenized Patch Encoding (STEP), producing multimodal tokens that capture localized semantic, geometric, and temporal attributes. These tokens are incrementally integrated into a 4D Scene Graph (4DSG), which serves as 4D prior for downstream reasoning. A lightweight SLAM backend anchors all STEP tokens spatially in the environment, providing the global reference alignment, and ensuring unambiguous spatial grounding across time. The resulting 4DSG forms a queryable, unified world model through which VLMs can directly interpret spatial scene structure and temporal dynamics. Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference, thereby setting new state-of-the-art performance in several settings, highlighting the importance of structured 4D priors for embodied reasoning and autonomous robotics.

[28] VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

Beitong Zhou, Zhexiao Huang, Yuan Guo, Zhangxuan Gu, Tianyu Xia, Zichen Luo, Fei Tang, Dehan Kong, Yanyi Shang, Suling Ou, Zhenlin Guo, Changhua Meng, Shuheng Shen

🧩 TL;DR

本文提出了VenusBench-GD,一个跨平台的双语GUI grounding基准测试,通过分层任务分类法扩展了元素grounding的评估范围,并揭示了通用多模态模型在基础任务上已匹敌甚至超越专用GUI模型的重要发现。


📘 Detailed Summary

Motivation: 现有GUI grounding基准测试存在显著局限性:要么数据量不足且领域覆盖狭窄,要么过度专注于单一平台并需要高度专业化的领域知识,这阻碍了GUI智能体能力的全面评估。

Method: 本文提出了VenusBench-GD基准测试,建立了高质量的数据构建流水线,并提出了分层任务分类法,将grounding任务划分为基础和高级两类,涵盖六个互补的评估子任务。

Result: 实验结果表明,通用多模态模型在基础grounding任务上已匹配甚至超越专用GUI模型,而高级任务仍偏向GUI专用模型,但这些模型表现出显著的过拟合和较差的鲁棒性。

Conclusion: 研究强调了构建全面、多层次评估框架的必要性,揭示了当前GUI grounding模型的局限性,并为未来GUI智能体的发展提供了重要的评估基准和方向指导。


📄 Abstract

GUI grounding is a critical component in building capable GUI agents. However, existing grounding benchmarks suffer from significant limitations: they either provide insufficient data volume and narrow domain coverage, or focus excessively on a single platform and require highly specialized domain knowledge. In this work, we present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding that spans multiple platforms, enabling hierarchical evaluation for real-word applications. VenusBench-GD contributes as follows: (i) we introduce a large-scale, cross-platform benchmark with extensive coverage of applications, diverse UI elements, and rich annotated data, (ii) we establish a high-quality data construction pipeline for grounding tasks, achieving higher annotation accuracy than existing benchmarks, and (iii) we extend the scope of element grounding by proposing a hierarchical task taxonomy that divides grounding into basic and advanced categories, encompassing six distinct subtasks designed to evaluate models from complementary perspectives. Our experimental findings reveal critical insights: general-purpose multimodal models now match or even surpass specialized GUI models on basic grounding tasks. In contrast, advanced tasks, still favor GUI-specialized models, though they exhibit significant overfitting and poor robustness. These results underscore the necessity of comprehensive, multi-tiered evaluation frameworks.

[29] Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization

Qiushuo Cheng, Jingjing Liu, Catherine Morgan, Alan Whone, Majid Mirmehdi

🧩 TL;DR

该研究提出了一种用于骨架时序动作定位的自监督预训练方法,通过片段判别任务和U形特征融合模块,显著提升了现有骨架对比学习方法在动作边界检测上的性能。


📘 Detailed Summary

Motivation: 骨架时序动作定位的自监督预训练仍面临挑战且研究不足,与视频级动作识别不同,检测动作边界需要能够捕捉相邻帧间细微差异的时间敏感特征,以准确识别标签变化的边界区域。

Method: 提出片段判别自监督预训练任务,将骨架序列密集投影到非重叠片段中,通过对比学习促进跨视频片段区分性特征学习;同时基于骨架动作识别模型的强主干网络,通过U形模块融合中间特征以增强帧级定位的特征分辨率。

Result: 该方法在BABEL数据集上持续改进了现有骨架对比学习方法在动作定位任务上的性能,涵盖多种子集和评估协议;在PKUMMD数据集上通过NTU RGB+D和BABEL预训练实现了最先进的迁移学习性能。

Conclusion: 研究表明片段判别自监督预训练任务能有效学习时间敏感特征,结合U形特征融合模块可显著提升骨架时序动作定位性能,为骨架动作分析的自监督学习提供了新方向。


📄 Abstract

The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level {action} recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.

[30] N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models

Yuxin Wang, Lei Ke, Boqiang Zhang, Tianyuan Qu, Hanxun Yu, Zhenpeng Huang, Meng Yu, Dan Xu, Dong Yu

🧩 TL;DR

本文提出N3D-VLM,一种统一框架,将原生3D物体感知与3D感知视觉推理相结合,解决了现有多模态模型缺乏内在3D感知能力的问题,在3D定位和空间推理任务上均取得最先进性能。


📘 Detailed Summary

Motivation: 当前多模态模型虽然能基于2D图像回答问题,但缺乏内在的3D物体感知能力,限制了它们理解3D场景中空间关系和深度线索的能力,这构成了现有方法在3D空间理解方面的核心局限。

Method: 本文提出N3D-VLM统一框架,通过将原生3D物体感知能力集成到模型中,使其能够基于文本描述直接在3D空间中定位物体,并在此基础上进行显式的3D推理。为实现鲁棒训练,开发了可扩展的数据构建流程,利用深度估计将大规模2D标注提升到3D空间,生成用于3D物体定位和空间推理的训练数据。

Result: 实验结果表明,该框架在3D定位任务上取得了最先进的性能,同时在3D空间推理方面持续超越现有方法。数据构建流程产生的3D物体定位数据比现有最大的单图像3D检测数据集大六倍以上,显著提高了数据的多样性和覆盖范围。

Conclusion: 该研究证明了将原生3D感知能力集成到视觉语言模型中的有效性,为实现更可解释和结构化的3D空间理解提供了新途径。统一框架的设计支持3D定位和空间推理的联合训练,为未来3D感知多模态系统的发展奠定了基础。


📄 Abstract

While current multimodal models can answer questions based on 2D images, they lack intrinsic 3D object perception, limiting their ability to comprehend spatial relationships and depth cues in 3D scenes. In this work, we propose N3D-VLM, a novel unified framework that seamlessly integrates native 3D object perception with 3D-aware visual reasoning, enabling both precise 3D grounding and interpretable spatial understanding. Unlike conventional end-to-end models that directly predict answers from RGB/RGB-D inputs, our approach equips the model with native 3D object perception capabilities, enabling it to directly localize objects in 3D space based on textual descriptions. Building upon accurate 3D object localization, the model further performs explicit reasoning in 3D, achieving more interpretable and structured spatial understanding. To support robust training for these capabilities, we develop a scalable data construction pipeline that leverages depth estimation to lift large-scale 2D annotations into 3D space, significantly increasing the diversity and coverage for 3D object grounding data, yielding over six times larger than the largest existing single-image 3D detection dataset. Moreover, the pipeline generates spatial question-answering datasets that target chain-of-thought (CoT) reasoning in 3D, facilitating joint training for both 3D object localization and 3D spatial reasoning. Experimental results demonstrate that our unified framework not only achieves state-of-the-art performance on 3D grounding tasks, but also consistently surpasses existing methods in 3D spatial reasoning in vision-language model.

[31] Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs

Jintao Tong, Jiaqi Gu, Yujing Lou, Lubin Fan, Yixiong Zou, Yue Wu, Jieping Ye, Ruixuan Li

🧩 TL;DR

本文提出SkiLa(Sketch-in-Latents),一种新颖的统一多模态推理范式,通过扩展MLLMs的自回归能力使其能够原生生成连续视觉嵌入作为视觉思维,从而解决MLLMs在需要视觉想象场景中的不足。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在通过文本推理进行视觉理解任务方面表现出色,但在需要视觉想象的场景中往往表现不足。与人类能够在思维过程中形成灵活的视觉-文本想象和交互不同,现有方法要么依赖预定义的外部工具包,要么在思考过程中生成图像。人类能够在统一空间内构建视觉-文本思维过程的能力启发了本研究,考虑到当前MLLMs已经在同一特征空间中编码视觉和文本信息,我们认为视觉标记可以无缝插入到文本标记承载的推理过程中。

Method: 本文提出Sketch-in-Latents(SkiLa)范式,扩展MLLMs的自回归能力以原生生成连续视觉嵌入,称为潜在草图标记,作为视觉思维。在多步推理过程中,模型动态交替使用文本思考模式生成文本思考标记和视觉草图模式生成潜在草图标记。提出了一种潜在视觉语义重建机制,确保这些潜在草图标记在语义上是可解释的,从而实现了统一的视觉-文本推理过程。

Result: 大量实验表明,SkiLa在视觉中心任务上实现了卓越性能,同时在多样化的通用多模态基准测试中表现出强大的泛化能力。该方法不仅提升了MLLMs的视觉想象能力,还在保持文本推理优势的同时,显著增强了模型在需要视觉思维的任务上的表现,验证了统一视觉-文本推理空间的有效性。

Conclusion: 该研究展示了在统一特征空间中实现视觉-文本交互思维的可行性,为多模态推理提供了新范式。SkiLa的成功表明,通过扩展MLLMs的自回归能力来原生生成视觉嵌入,可以有效弥补当前模型在视觉想象方面的不足,为未来更自然、更灵活的多模态智能系统发展提供了重要方向。


📄 Abstract

While Multimodal Large Language Models (MLLMs) excel at visual understanding tasks through text reasoning, they often fall short in scenarios requiring visual imagination. Unlike current works that take predefined external toolkits or generate images during thinking, however, humans can form flexible visual-text imagination and interactions during thinking without predefined toolkits, where one important reason is that humans construct the visual-text thinking process in a unified space inside the brain. Inspired by this capability, given that current MLLMs already encode visual and text information in the same feature space, we hold that visual tokens can be seamlessly inserted into the reasoning process carried by text tokens, where ideally, all visual imagination processes can be encoded by the latent features. To achieve this goal, we propose Sketch-in-Latents (SkiLa), a novel paradigm for unified multi-modal reasoning that expands the auto-regressive capabilities of MLLMs to natively generate continuous visual embeddings, termed latent sketch tokens, as visual thoughts. During multi-step reasoning, the model dynamically alternates between textual thinking mode for generating textual think tokens and visual sketching mode for generating latent sketch tokens. A latent visual semantics reconstruction mechanism is proposed to ensure these latent sketch tokens are semantically grounded. Extensive experiments demonstrate that SkiLa achieves superior performance on vision-centric tasks while exhibiting strong generalization to diverse general multi-modal benchmarks. Codes will be released at https://github.com/TungChintao/SkiLa.

[32] DeContext as Defense: Safe Image Editing in Diffusion Transformers

Linghui Shen, Mingyue Cui, Xingyi Yang

🧩 TL;DR

本文提出DeContext方法,通过向多模态注意力层注入针对性扰动来阻断上下文信息传播,从而保护输入图像免遭未经授权的上下文编辑,该方法在保持视觉质量的同时有效阻止恶意图像操纵。


📘 Detailed Summary

Motivation: 上下文扩散模型虽然能实现便捷的图像编辑,但也带来了严重的隐私和安全问题,个人图像可能被未经授权地用于身份冒充、虚假信息传播等恶意用途,而现有针对个性化文本到图像生成的输入扰动方法对现代大规模基于DiT的上下文模型的鲁棒性尚未得到充分研究。

Method: DeContext方法的核心洞察是上下文信息主要通过多模态注意力层从源图像传播到输出,通过注入微小但有针对性的扰动来削弱这些跨注意力路径,从而阻断信息流并解耦输入与输出之间的关联,该方法特别关注早期去噪步骤和特定Transformer块,将扰动集中在最关键的上下文传播位置。

Result: 在Flux Kontext和Step1X-Edit等模型上的实验表明,DeContext能持续有效地阻止不需要的图像编辑,同时保持图像的视觉质量,这些结果验证了基于注意力的扰动作为防御图像操纵的有效策略。

Conclusion: 该研究揭示了注意力机制在上下文信息传播中的关键作用,并提出了一种高效且鲁棒的防御框架,为保护个人图像隐私提供了新的技术途径,同时为理解扩散模型中上下文编辑的机制提供了理论洞察。


📄 Abstract

In-context diffusion models allow users to modify images with remarkable ease and realism. However, the same power raises serious privacy concerns: personal images can be easily manipulated for identity impersonation, misinformation, or other malicious uses, all without the owner's consent. While prior work has explored input perturbations to protect against misuse in personalized text-to-image generation, the robustness of modern, large-scale in-context DiT-based models remains largely unexamined. In this paper, we propose DeContext, a new method to safeguard input images from unauthorized in-context editing. Our key insight is that contextual information from the source image propagates to the output primarily through multimodal attention layers. By injecting small, targeted perturbations that weaken these cross-attention pathways, DeContext breaks this flow, effectively decouples the link between input and output. This simple defense is both efficient and robust. We further show that early denoising steps and specific transformer blocks dominate context propagation, which allows us to concentrate perturbations where they matter most. Experiments on Flux Kontext and Step1X-Edit show that DeContext consistently blocks unwanted image edits while preserving visual quality. These results highlight the effectiveness of attention-based perturbations as a powerful defense against image manipulation.

[33] Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation

Yunkai Yang, Yudong Zhang, Kunquan Zhang, Jinxiao Zhang, Xinying Chen, Haohuan Fu, Runmin Dong

🧩 TL;DR

本文提出了一种面向任务的遥感语义分割数据合成框架TODSynth,包括具有统一三重注意力的多模态扩散变换器(MM-DiT)和由任务反馈引导的即插即用采样策略,旨在生成更稳定且面向下游任务的合成数据。


📘 Detailed Summary

Motivation: 当前可控生成技术虽然为遥感领域提供了扩展标注数据集的新途径,但语义掩码控制的复杂性和采样质量的不确定性限制了合成数据在下游语义分割任务中的实际效用,特别是在少样本和复杂场景下。

Method: 该方法包含两个核心组件:基于DiT的多模态扩散变换器(MM-DiT)采用文本-图像-掩码联合注意力机制,并对图像和掩码分支进行全微调;控制-校正流匹配(CRFM)方法在早期高可塑性阶段通过语义损失动态调整采样方向,减少生成图像的不稳定性并弥合合成数据与下游任务之间的差距。

Result: 实验表明,该方法在遥感语义分割数据合成中显著优于现有最先进的可控生成方法,特别是在少样本和复杂场景下表现出色,生成的合成数据更稳定且更符合下游任务需求。

Conclusion: 该研究证明了任务导向的数据合成框架在遥感语义分割中的有效性,通过结合多模态联合注意力和任务反馈引导的采样策略,能够生成高质量、面向任务的合成数据,为缓解手动标注负担提供了有前景的解决方案。


📄 Abstract

With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of synthetic data in downstream semantic segmentation tasks. To address these challenges, we propose a task-oriented data synthesis framework (TODSynth), including a Multimodal Diffusion Transformer (MM-DiT) with unified triple attention and a plug-and-play sampling strategy guided by task feedback. Built upon the powerful DiT-based generative foundation model, we systematically evaluate different control schemes, showing that a text-image-mask joint attention scheme combined with full fine-tuning of the image and mask branches significantly enhances the effectiveness of RS semantic segmentation data synthesis, particularly in few-shot and complex-scene scenarios. Furthermore, we propose a control-rectify flow matching (CRFM) method, which dynamically adjusts sampling directions guided by semantic loss during the early high-plasticity stage, mitigating the instability of generated images and bridging the gap between synthetic data and downstream segmentation tasks. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art controllable generation methods, producing more stable and task-oriented synthetic data for RS semantic segmentation.

[34] Kling-Omni Technical Report

Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, Xiao Hu, Xiaohua Hu, Boyuan Jiang, Fangyuan Kong, Hang Li, Jie Li, Qingyu Li, Shen Li, Xiaohan Li, Yan Li, Jiajun Liang, Borui Liao, Yiqiao Liao, Weihong Lin, Quande Liu, Xiaokun Liu, Yilun Liu, Yuliang Liu, Shun Lu, Hangyu Mao, Yunyao Mao, Haodong Ouyang, Wenyu Qin, Wanqi Shi, Xiaoyu Shi, Lianghao Su, Haozhi Sun, Peiqin Sun, Pengfei Wan, Chao Wang, Chenyu Wang, Meng Wang, Qiulin Wang, Runqi Wang, Xintao Wang, Xuebo Wang, Zekun Wang, Min Wei, Tiancheng Wen, Guohao Wu, Xiaoshi Wu, Zhenhua Wu, Da Xie, Yingtong Xiong, Yulong Xu, Sile Yang, Zikang Yang, Weicai Ye, Ziyang Yuan, Shenglong Zhang, Shuaiyu Zhang, Yuanxing Zhang, Yufan Zhang, Wenzheng Zhao, Ruiliang Zhou, Yan Zhou, Guosheng Zhu, Yongjie Zhu

🧩 TL;DR

本文提出了Kling-Omni,这是一个通用的生成式框架,能够直接从多模态视觉语言输入合成高保真视频,将视频生成、编辑和智能推理任务整合到一个统一的端到端系统中。


📘 Detailed Summary

Motivation: 当前视频生成、编辑和智能推理任务通常采用分离的流水线方法,缺乏统一的端到端框架来处理多样化的用户输入并生成高质量视频内容,这限制了多模态视频创作的整体能力和效率。

Method: Kling-Omni采用端到端架构,支持文本指令、参考图像和视频上下文等多种用户输入,将其处理为统一的多模态表示;该框架基于全面的数据系统构建,并通过高效的大规模预训练策略和推理基础设施优化来增强性能。

Result: 综合评估表明,Kling-Omni在上下文生成、基于推理的编辑和多模态指令跟随方面表现出卓越能力,能够生成电影质量的智能视频内容,超越了传统分离流水线方法的局限性。

Conclusion: Kling-Omni不仅是内容创作工具,更是迈向多模态世界模拟器的关键进展,这种模拟器能够感知、推理、生成并与动态复杂世界交互,为统一的多模态视频理解和生成系统奠定了基础。


📄 Abstract

We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.

[35] R3ST: A Synthetic 3D Dataset With Realistic Trajectories

Simone Teglia, Claudia Melis Tonti, Francesco Pro, Leonardo Russo, Andrea Alfarano, Leonardo Pentassuglia, Irene Amerini

🧩 TL;DR

本文提出了R3ST数据集,通过将真实世界轨迹集成到合成3D环境中,解决了合成数据缺乏真实车辆运动的问题,为轨迹预测研究提供了兼具精确多模态标注和真实人类驾驶行为的数据集。


📘 Detailed Summary

Motivation: 现有真实数据集虽然捕捉真实道路对象行为,但通常缺乏精确的地面真实标注;而合成数据集虽然能低成本生成大量标注帧,但普遍存在车辆运动不真实的问题,因为轨迹通常由AI模型或基于规则的系统生成,这限制了轨迹预测研究的发展。

Method: 本文提出R3ST数据集,通过生成合成3D环境并集成从SinD数据集提取的真实世界轨迹来克服这一限制。SinD是一个从无人机航拍视频记录的鸟瞰视角数据集,R3ST将其实轨迹与合成环境相结合,实现了真实人类驾驶车辆轨迹与精确多模态地面真实标注的融合。

Result: R3ST数据集成功弥合了合成数据与真实轨迹之间的差距,提供了既包含准确多模态地面真实标注又具有真实人类驾驶车辆轨迹的合成数据集。该数据集为道路车辆轨迹预测研究提供了高质量的训练和评估资源,解决了现有数据集在真实性和标注精度之间的权衡问题。

Conclusion: R3ST数据集通过将真实轨迹集成到合成3D环境中,有效解决了合成数据缺乏真实车辆运动的局限性,推动了轨迹预测研究的发展。该方法为计算机视觉模型的训练和评估提供了更高质量的数据资源,特别是在增强道路安全应用方面具有重要价值,展示了合成数据与真实世界数据融合的潜力。


📄 Abstract

Datasets are essential to train and evaluate computer vision models used for traffic analysis and to enhance road safety. Existing real datasets fit real-world scenarios, capturing authentic road object behaviors, however, they typically lack precise ground-truth annotations. In contrast, synthetic datasets play a crucial role, allowing for the annotation of a large number of frames without additional costs or extra time. However, a general drawback of synthetic datasets is the lack of realistic vehicle motion, since trajectories are generated using AI models or rule-based systems. In this work, we introduce R3ST (Realistic 3D Synthetic Trajectories), a synthetic dataset that overcomes this limitation by generating a synthetic 3D environment and integrating real-world trajectories derived from SinD, a bird's-eye-view dataset recorded from drone footage. The proposed dataset closes the gap between synthetic data and realistic trajectories, advancing the research in trajectory forecasting of road vehicles, offering both accurate multimodal ground-truth annotations and authentic human-driven vehicle trajectories.

cs.CL [Back]

[36] MRG-R1: Reinforcement Learning for Clinically Aligned Medical Report Generation

Pengyu Wang, Shuchang Ye, Usman Naseem, Jinman Kim

🧩 TL;DR

本文提出了一种语义驱动的强化学习方法(MRG-R1),用于医学报告生成,通过优化基于临床正确性的报告级奖励而非传统的词级监督,显著提升了生成报告的临床准确性。该方法在IU X-Ray和MIMIC-CXR数据集上实现了最先进的性能。


📘 Detailed Summary

Motivation: 现有医学报告生成方法通常基于词级目标训练,侧重于模仿放射科医生的语言风格而非确保临床正确性,导致生成的文本虽然语言流畅但医学准确性不足。本研究旨在解决这一关键问题,探索如何超越单纯的语言模仿,直接优化临床正确性指导的学习过程。

Method: 本文提出语义驱动强化学习方法,采用组相对策略优化来鼓励临床正确性导向的学习。该方法优化报告级奖励,即基于边缘的余弦相似度,计算生成报告与参考报告中提取的关键放射学发现之间的语义相似性。此外,引入轻量级推理格式约束,引导模型生成结构化的"思维报告"输出,并在大型视觉语言模型上实施该方法。

Result: MRG-R1在IU X-Ray数据集上达到临床效能F1分数51.88,在MIMIC-CXR数据集上达到40.39,均实现了最先进的性能。实验结果表明,基于标签语义的强化学习监督优于传统的词级监督方法,验证了优化临床基础报告级奖励对提升临床正确性的有效性。

Conclusion: 本研究证明优化基于临床正确性的报告级奖励比优化词重叠更能有效提升医学报告生成的临床准确性,为监督医学大型视觉语言模型的训练提供了语义强化学习的新方向。这项工作是在医学LVLM训练中探索语义强化监督临床正确性的先驱性研究,具有重要的临床应用价值。


📄 Abstract

Medical report generation (MRG) aims to automatically derive radiology-style reports from medical images to aid in clinical decision-making. However, existing methods often generate text that mimics the linguistic style of radiologists but fails to guarantee clinical correctness, because they are trained on token-level objectives which focus on word-choice and sentence structure rather than actual medical accuracy. We propose a semantic-driven reinforcement learning (SRL) method for medical report generation, adopted on a large vision-language model (LVLM). SRL adopts Group Relative Policy Optimization (GRPO) to encourage clinical-correctness-guided learning beyond imitation of language style. Specifically, we optimise a report-level reward: a margin-based cosine similarity (MCCS) computed between key radiological findings extracted from generated and reference reports, thereby directly aligning clinical-label agreement and improving semantic correctness. A lightweight reasoning format constraint further guides the model to generate structured "thinking report" outputs. We evaluate Medical Report Generation with Sematic-driven Reinforment Learning (MRG-R1), on two datasets: IU X-Ray and MIMIC-CXR using clinical efficacy (CE) metrics. MRG-R1 achieves state-of-the-art performance with CE-F1 51.88 on IU X-Ray and 40.39 on MIMIC-CXR. We found that the label-semantic reinforcement is better than conventional token-level supervision. These results indicate that optimizing a clinically grounded, report-level reward rather than token overlap,meaningfully improves clinical correctness. This work is a prior to explore semantic-reinforcement in supervising medical correctness in medical Large vision-language model(Med-LVLM) training.

[37] Exploration of Augmentation Strategies in Multi-modal Retrieval-Augmented Generation for the Biomedical Domain: A Case Study Evaluating Question Answering in Glycobiology

Primož Kocbek, Azra Frkatović-Hodžić, Dora Lalić, Vivian Hui, Gordan Lauc, Gregor Štiglic

🧩 TL;DR

本研究系统评估了多模态检索增强生成(MM-RAG)在生物医学问答中的两种视觉信息处理方法:将图表转换为文本与OCR-free视觉检索,发现在糖生物学这一视觉密集领域,方法选择取决于模型容量,中等规模模型更适合文本转换,而前沿模型能有效利用OCR-free检索。


📘 Detailed Summary

Motivation: 多模态检索增强生成(MM-RAG)在生物医学问答中面临关键决策:何时将图表转换为文本,何时使用OCR-free视觉检索直接返回页面图像并由生成器解释。本研究旨在探索这一权衡,特别是在糖生物学这一视觉密集领域,缺乏系统评估不同方法在复杂生物医学内容处理中的性能比较。

Method: 研究构建了包含120道多项选择题的基准测试,源自25篇论文,按检索难度分层(简单文本、中等图表、困难交叉证据)。实现了四种增强策略:无增强、文本RAG、多模态转换和late-interaction视觉检索(ColPali),使用Docling解析和Qdrant索引。评估了中等规模开源模型(如Gemma-3-27B-IT)和前沿专有模型(GPT-4o系列),并进一步测试了GPT-5系列和多种视觉检索器(ColPali/ColQwen/ColFlor)。采用Agresti-Coull 95%置信区间计算准确率,每个配置运行5次。

Result: 使用Gemma-3-27B-IT时,文本和多模态增强显著优于OCR-free检索(平均准确率0.722-0.740 vs. 0.510)。使用GPT-4o时,多模态方法达到0.808,文本方法为0.782,ColPali为0.745,模型内差异较小。在GPT-5系列后续实验中,ColPali和ColFlor的最佳结果均提升约2%至0.828。总体而言,在GPT-5系列中,ColPali、ColQwen和ColFlor在统计上无显著差异,而GPT-5-nano落后于较大变体约8-10%。

Conclusion: 管道选择具有容量依赖性:将视觉内容转换为文本降低了读取负担,对中等规模模型更可靠;而OCR-free视觉检索在前沿模型下变得具有竞争力。在检索器中,ColFlor以较小计算开销提供与较重选项相当的性能,当有强大生成器可用时是高效默认选择。该研究为生物医学多模态RAG系统设计提供了实证指导,强调根据模型能力选择适当的信息表示策略。


📄 Abstract

Multi-modal retrieval-augmented generation (MM-RAG) promises grounded biomedical QA, but it is unclear when to (i) convert figures/tables into text versus (ii) use optical character recognition (OCR)-free visual retrieval that returns page images and leaves interpretation to the generator. We study this trade-off in glycobiology, a visually dense domain. We built a benchmark of 120 multiple-choice questions (MCQs) from 25 papers, stratified by retrieval difficulty (easy text, medium figures/tables, hard cross-evidence). We implemented four augmentations-None, Text RAG, Multi-modal conversion, and late-interaction visual retrieval (ColPali)-using Docling parsing and Qdrant indexing. We evaluated mid-size open-source and frontier proprietary models (e.g., Gemma-3-27B-IT, GPT-4o family). Additional testing used the GPT-5 family and multiple visual retrievers (ColPali/ColQwen/ColFlor). Accuracy with Agresti-Coull 95% confidence intervals (CIs) was computed over 5 runs per configuration. With Gemma-3-27B-IT, Text and Multi-modal augmentation outperformed OCR-free retrieval (0.722-0.740 vs. 0.510 average accuracy). With GPT-4o, Multi-modal achieved 0.808, with Text 0.782 and ColPali 0.745 close behind; within-model differences were small. In follow-on experiments with the GPT-5 family, the best results with ColPali and ColFlor improved by ~2% to 0.828 in both cases. In general, across the GPT-5 family, ColPali, ColQwen, and ColFlor were statistically indistinguishable. GPT-5-nano trailed larger GPT-5 variants by roughly 8-10%. Pipeline choice is capacity-dependent: converting visuals to text lowers the reader burden and is more reliable for mid-size models, whereas OCR-free visual retrieval becomes competitive under frontier models. Among retrievers, ColFlor offers parity with heavier options at a smaller footprint, making it an efficient default when strong generators are available.

cs.AI [Back]

[38] Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis

Zhi Helu, Huang Jingjing, Xu Wang, Xu Yangbin, Zhang Wanyue, Jiang Baoyang, Deng Shirui, Zhu Liang, Li Fangfang, Zhao Tiejun, Lin Yankai, Yao Yuan

🧩 TL;DR

本文提出了SPRITE框架,通过将空间推理的真值生成重构为代码生成任务,利用模拟器和大模型程序化合成可扩展、多样且高质量的空间推理数据,显著提升了视觉语言模型的空间理解能力。


📘 Detailed Summary

Motivation: 当前具身智能面临空间理解与推理能力不足的核心限制,现有解决方案陷入两难困境:基于模板的数据集可扩展但结构僵化,而人工标注虽语言多样但不可扩展且计算不精确,迫切需要一种能够同时实现可扩展性、多样性和高质量的方法。

Method: SPRITE框架的核心创新是将真值生成重构为代码生成任务,利用大语言模型将复杂空间问题编译为可执行程序,然后通过从模拟器提取的高精度场景元信息进行验证,确保生成的真值既计算精确又可验证,同时利用大模型的生成能力提供丰富的语言多样性。

Result: 通过该流程构建了包含3个模拟器、11,000多个场景和300,000多个图像/视频指令调优对的数据集,实验表明使用该数据训练的视觉语言模型在多个空间基准测试中取得显著性能提升,且优于同等规模的其他开源数据集,可扩展性分析证实了克服传统模板方法低多样性限制的重要性。

Conclusion: 研究表明程序化合成高质量空间推理数据是构建鲁棒、可泛化空间智能的关键,SPRITE框架通过结合模拟器精度与大模型生成能力解决了现有方法的根本限制,为空间智能研究提供了可扩展的数据生成范式,公开的框架代码和完整数据集将促进该领域的进一步发展。


📄 Abstract

Embodied intelligence, a grand challenge in artificial intelligence, is fundamentally constrained by the limited spatial understanding and reasoning capabilities of current models. Prevailing efforts to address this through enhancing Vision-Language Models (VLMs) are trapped in a dilemma: template-based datasets are scalable but structurally rigid, while manual annotation is linguistically diverse but unscalable and, critically, computationally imprecise. We introduce SPRITE, a novel framework that overcomes this dilemma by leveraging simulators and large models to programmatically synthesize scalable, diverse, and high-quality spatial reasoning data. The core innovation of SPRITE is to reframe ground-truth generation as a code-generation task. We utilize LLMs to compile complex spatial questions into executable programs, which are then verified against high-precision scene meta-information extracted from simulators. This ensures our ground truth is both computationally precise and verifiable, while the generative power of LLMs provides vast linguistic diversity. Leveraging this pipeline, we have curated a dataset encompassing 3 simulators, 11k+ scenes, and 300k+ image/video instruction-tuning pairs. We demonstrate that a VLM trained on our data achieves significant performance gains on multiple spatial benchmarks and outperforms other open-source datasets of equivalent size. Furthermore, a scalability analysis confirms our hypothesis that overcoming the low-diversity nature of traditional template methods is essential for building robust, generalizable spatial intelligence. We will make the SPRITE framework code and the full 300k+ dataset publicly available to facilitate future research in spatial intelligence.

[39] AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding

Sanjoy Chowdhury, Karren D. Yang, Xudong Liu, Fartash Faghri, Pavan Kumar Anasosalu Vasu, Oncel Tuzel, Dinesh Manocha, Chun-Liang Li, Raviteja Vemulapalli

🧩 TL;DR

本文提出了AMUSE基准测试和RAFT框架,用于评估和改进多模态大语言模型在多说话人、对话为中心的音频-视频理解任务中的代理推理能力。AMUSE基准测试揭示了当前模型在多说话人推理方面的不足,而RAFT框架通过奖励优化和选择性参数适应实现了显著的性能提升。


📘 Detailed Summary

Motivation: 当前多模态大语言模型(如GPT-4o和Qwen3-Omni)在强感知能力方面表现出色,但在多说话人、对话为中心的场景中表现不佳,这些场景需要代理推理能力来跟踪说话者、维持角色并在时间维度上对事件进行基础定位。这些场景是多模态音频-视频理解的核心,模型需要联合推理音频和视觉流,应用于对话式视频助手和会议分析等应用。

Method: 本文提出了两个主要贡献:AMUSE基准测试和RAFT框架。AMUSE基准测试围绕本质上需要代理推理的任务设计,要求模型将复杂的音频-视觉交互分解为规划、基础定位和反思步骤,评估MLLM在三种模式(零样本、引导和代理)和六个任务族(包括时空说话者基础定位和多模态对话摘要)上的表现。RAFT是一个数据高效的代理对齐框架,集成了奖励优化与内在多模态自我评估作为奖励,以及选择性参数适应以实现数据和参数高效的更新。

Result: 在所有评估模式下,当前模型表现出弱的多说话人推理能力和在非代理和代理评估下不一致的行为。使用RAFT框架,研究在基准测试上实现了高达39.52%的相对准确率提升。AMUSE基准测试为多模态模型的代理推理能力提供了全面的评估平台。

Conclusion: AMUSE和RAF共同为研究多模态模型中的代理推理和改进其能力提供了实用平台。研究揭示了当前MLLM在多说话人音频-视频理解任务中的局限性,并展示了通过代理对齐框架可以显著提升性能。这项工作强调了在多模态环境中开发更强大的代理推理能力的重要性,并为未来研究提供了基准和方法论基础。


📄 Abstract

Recent multimodal large language models (MLLMs) such as GPT-4o and Qwen3-Omni show strong perception but struggle in multi-speaker, dialogue-centric settings that demand agentic reasoning tracking who speaks, maintaining roles, and grounding events across time. These scenarios are central to multimodal audio-video understanding, where models must jointly reason over audio and visual streams in applications such as conversational video assistants and meeting analytics. We introduce AMUSE, a benchmark designed around tasks that are inherently agentic, requiring models to decompose complex audio-visual interactions into planning, grounding, and reflection steps. It evaluates MLLMs across three modes zero-shot, guided, and agentic and six task families, including spatio-temporal speaker grounding and multimodal dialogue summarization. Across all modes, current models exhibit weak multi-speaker reasoning and inconsistent behavior under both non-agentic and agentic evaluation. Motivated by the inherently agentic nature of these tasks and recent advances in LLM agents, we propose RAFT, a data-efficient agentic alignment framework that integrates reward optimization with intrinsic multimodal self-evaluation as reward and selective parameter adaptation for data and parameter efficient updates. Using RAFT, we achieve up to 39.52\% relative improvement in accuracy on our benchmark. Together, AMUSE and RAFT provide a practical platform for examining agentic reasoning in multimodal models and improving their capabilities.

[40] Code-in-the-Loop Forensics: Agentic Tool Use for Image Forgery Detection

Fanrui Zhang, Qiang Zhang, Sizhuo Zhou, Jianwen Sun, Chuanhao Li, Jiaxin Ai, Yukang Feng, Yujie Zhang, Wenjie Li, Zizhen Li, Yifan Chang, Jiawei Liu, Kaipeng Zhang

🧩 TL;DR

本文提出ForenAgent,一个多轮交互式图像伪造检测框架,通过让多模态大语言模型自主生成、执行和迭代优化基于Python的低级工具,实现了更灵活和可解释的伪造分析,为通用图像伪造检测开辟了新路径。


📘 Detailed Summary

Motivation: 现有图像伪造检测方法要么利用低级语义无关的伪影,要么依赖具有高级语义知识的多模态大语言模型,这两种信息流在范式和推理上高度异构,使得现有方法难以统一它们或有效建模其跨层次交互,本研究旨在解决这一差距。

Method: 提出ForenAgent框架,采用两阶段训练流程结合冷启动和强化微调,设计动态推理循环包括全局感知、局部聚焦、迭代探测和整体裁决,并构建FABench数据集包含10万张图像和约20万智能体交互问答对。

Result: 实验表明ForenAgent在低层工具辅助下展现出新兴的工具使用能力和反思推理能力,在具有挑战性的图像伪造检测任务上表现优异,验证了该框架的有效性和潜力。

Conclusion: 该研究为通用图像伪造检测开辟了有前景的路径,通过智能体驱动的工具交互实现了低级伪影与高级语义知识的有效融合,提供了更灵活和可解释的分析框架,具有重要的理论和应用价值。


📄 Abstract

Existing image forgery detection (IFD) methods either exploit low-level, semantics-agnostic artifacts or rely on multimodal large language models (MLLMs) with high-level semantic knowledge. Although naturally complementary, these two information streams are highly heterogeneous in both paradigm and reasoning, making it difficult for existing methods to unify them or effectively model their cross-level interactions. To address this gap, we propose ForenAgent, a multi-round interactive IFD framework that enables MLLMs to autonomously generate, execute, and iteratively refine Python-based low-level tools around the detection objective, thereby achieving more flexible and interpretable forgery analysis. ForenAgent follows a two-stage training pipeline combining Cold Start and Reinforcement Fine-Tuning to enhance its tool interaction capability and reasoning adaptability progressively. Inspired by human reasoning, we design a dynamic reasoning loop comprising global perception, local focusing, iterative probing, and holistic adjudication, and instantiate it as both a data-sampling strategy and a task-aligned process reward. For systematic training and evaluation, we construct FABench, a heterogeneous, high-quality agent-forensics dataset comprising 100k images and approximately 200k agent-interaction question-answer pairs. Experiments show that ForenAgent exhibits emergent tool-use competence and reflective reasoning on challenging IFD tasks when assisted by low-level tools, charting a promising route toward general-purpose IFD. The code will be released after the review process is completed.

[41] Scaling Laws for Energy Efficiency of Local LLMs

Ander Alvarez, Alessandro Genuardi, Nilotpal Sinha, Antonio Tiene, Samuel Mugel, Román Orús

🧩 TL;DR

该研究系统性地量化了在CPU上进行本地大语言模型和视觉语言模型推理的计算规律,发现了文本长度与计算成本之间的线性缩放关系以及视觉语言模型的分辨率阈值效应,并证明量子启发压缩能显著降低资源消耗。


📘 Detailed Summary

Motivation: 尽管大多数消费硬件依赖中央处理器进行AI部署,但CPU专用推理在本地语言和视觉语言工作负载上的计算规律尚未得到充分探索。研究旨在填补这一空白,为边缘设备上准确性与计算/能耗约束之间的平衡提供系统量化依据。

Method: 研究采用统一方法论,在MacBook Pro M2和Raspberry Pi 5两个代表性CPU层级上系统性地基准测试大语言模型和视觉语言模型。通过连续采样处理器和内存使用情况并结合曲线下面积积分,量化计算负载随输入文本长度和图像分辨率的变化规律,并评估量子启发压缩技术的效果。

Result: 研究揭示了两个经验缩放规律:语言模型推理的计算成本与标记长度近似线性缩放;视觉语言模型存在预处理驱动的"分辨率阈值",在内部分辨率钳位以上计算保持恒定,以下则急剧下降。量子启发压缩使处理器和内存使用减少高达71.9%,能耗降低高达62%,同时保持或提高语义准确性。

Conclusion: 该研究为本地多模态CPU专用推理提供了系统量化框架,识别出模型压缩和输入分辨率预处理作为可持续边缘推理的有效低成本杠杆。这些发现对边缘设备上AI部署的资源优化具有重要指导意义,特别是在计算和能源受限的环境中。


📄 Abstract

Deploying local large language models and vision-language models on edge devices requires balancing accuracy with constrained computational and energy budgets. Although graphics processors dominate modern artificial-intelligence deployment, most consumer hardware--including laptops, desktops, industrial controllers, and embedded systems--relies on central processing units. Despite this, the computational laws governing central-processing-unit-only inference for local language and vision-language workloads remain largely unexplored. We systematically benchmark large language and vision-language models on two representative central-processing-unit tiers widely used for local inference: a MacBook Pro M2, reflecting mainstream laptop-class deployment, and a Raspberry Pi 5, representing constrained, low-power embedded settings. Using a unified methodology based on continuous sampling of processor and memory usage together with area-under-curve integration, we characterize how computational load scales with input text length for language models and with image resolution for vision-language models. We uncover two empirical scaling laws: (1) computational cost for language-model inference scales approximately linearly with token length; and (2) vision-language models exhibit a preprocessing-driven "resolution knee", where compute remains constant above an internal resolution clamp and decreases sharply below it. Beyond these laws, we show that quantum-inspired compression reduces processor and memory usage by up to 71.9% and energy consumption by up to 62%, while preserving or improving semantic accuracy. These results provide a systematic quantification of multimodal central-processing-unit-only scaling for local language and vision-language workloads, and they identify model compression and input-resolution preprocessing as effective, low-cost levers for sustainable edge inference.

[42] Comprehensive AI Literacy: The Case for Centering Human Agency

Sri Yash Tadimalla, Justin Cary, Gordon Hull, Jordan Register, Daniel Maxwell, David Pugalee, Tina Heafner

🧩 TL;DR

本文主张系统性转向以人类能动性为核心的全面人工智能素养教育,提出AI素养、流畅性和能力框架,旨在弥合当前AI教育中功能性技能与批判性伦理推理之间的危险差距。


📘 Detailed Summary

Motivation: 当前人工智能技术快速融入社会各领域,而现有教育框架未能有效应对这一挑战,导致出现危险的素养差距,即对AI工具功能性操作技能的关注正在掩盖批判性和伦理性推理能力的发展,这构成了亟待解决的教育紧迫性问题。

Method: 本文作为立场论文,提出了以人类能动性为核心的全面AI素养教育框架,具体包括AI素养、流畅性和能力三个层次的概念框架,强调将技术视为可选择的工具而非必然采纳的宿命,并要求深入培养批判性思维和认识论理解。

Result: 研究提出了系统性的教育转变方案,使教育者和学生能够在以人为中心的AI方法中成为能动主体,为清晰表达AI决策意图及其对学术工作、职业和社会的影响提供了必要路径,建立了从功能性技能到批判性伦理推理的完整教育框架。

Conclusion: 真正的AI素养教育应教授能动性本身,将技术框架为可选择的工具而非必然采纳的宿命,这要求教育生态系统中的所有利益相关者——从学生的质疑和创造能力到教师基于教学价值观设计学习体验的自主权——都得到充分重视和培养。


📄 Abstract

The rapid assimilation of Artificial Intelligence technologies into various facets of society has created a significant educational imperative that current frameworks are failing to effectively address. We are witnessing the rise of a dangerous literacy gap, where a focus on the functional, operational skills of using AI tools is eclipsing the development of critical and ethical reasoning about them. This position paper argues for a systemic shift toward comprehensive AI literacy that centers human agency - the empowered capacity for intentional, critical, and responsible choice. This principle applies to all stakeholders in the educational ecosystem: it is the student's agency to question, create with, or consciously decide not to use AI based on the task; it is the teacher's agency to design learning experiences that align with instructional values, rather than ceding pedagogical control to a tool. True literacy involves teaching about agency itself, framing technology not as an inevitability to be adopted, but as a choice to be made. This requires a deep commitment to critical thinking and a robust understanding of epistemology. Through the AI Literacy, Fluency, and Competency frameworks described in this paper, educators and students will become agents in their own human-centric approaches to AI, providing necessary pathways to clearly articulate the intentions informing decisions and attitudes toward AI and the impact of these decisions on academic work, career, and society.

[43] Do Multi-Agents Solve Better Than Single? Evaluating Agentic Frameworks for Diagram-Grounded Geometry Problem Solving and Reasoning

Mahbub E Sobhani, Md. Faiyaz Abdullah Sayeedi, Mohammad Nehad Alam, Proma Hossain Progga, Swakkhar Shatabda

🧩 TL;DR

本研究系统比较了单智能体与多智能体管道在几何问题求解任务上的表现,发现多智能体设计对开源模型具有显著优势,而对闭源模型仅在较新数据集上提供有限改进,表明智能体分解并非普遍最优。


📘 Detailed Summary

Motivation: 图表几何问题求解是多模态大语言模型的关键基准,但多智能体设计相较于单智能体的优势尚不明确,本研究旨在系统比较这两种方法在视觉数学基准上的性能差异,以确定多智能体架构的实际价值。

Method: 研究采用系统比较方法,在四个视觉数学基准上评估单智能体与多智能体管道:Geometry3K、MathVerse、OlympiadBench和We-Math,使用开源模型Qwen-2.5-VL(7B和32B)以及闭源模型Gemini-2.0-Flash进行对比实验,分析不同架构在不同类型数据集上的表现差异。

Result: 实验结果显示,对于开源模型,多智能体管道持续提升性能:Qwen-2.5-VL(7B)在Geometry3K上提升6.8分,32B版本提升3.3分,在OlympiadBench和We-Math上进一步获得增益;相反,闭源Gemini-2.0-Flash在经典基准上单智能体表现更优,仅在较新的We-Math数据集上多智能体带来适度改进。

Conclusion: 研究表明多智能体管道对开源模型具有明确优势,并能辅助强大的专有系统处理较新、较不熟悉的基准,但智能体分解并非普遍最优策略,其有效性取决于模型类型和数据集特性,为多模态大语言模型的架构选择提供了实证指导。


📄 Abstract

Diagram-grounded geometry problem solving is a critical benchmark for multimodal large language models (MLLMs), yet the benefits of multi-agent design over single-agent remain unclear. We systematically compare single-agent and multi-agent pipelines on four visual math benchmarks: Geometry3K, MathVerse, OlympiadBench, and We-Math. For open-source models, multi-agent consistently improves performance. For example, Qwen-2.5-VL (7B) gains +6.8 points and Qwen-2.5-VL (32B) gains +3.3 on Geometry3K, and both Qwen-2.5-VL variants see further gains on OlympiadBench and We-Math. In contrast, the closed-source Gemini-2.0-Flash generally performs better in single-agent mode on classic benchmarks, while multi-agent yields only modest improvements on the newer We-Math dataset. These findings show that multi-agent pipelines provide clear benefits for open-source models and can assist strong proprietary systems on newer, less familiar benchmarks, but agentic decomposition is not universally optimal. All code, data, and reasoning files are available at https://github.com/faiyazabdullah/Interpreter-Solver

[44] CitySeeker: How Do VLMS Explore Embodied Urban Navigation With Implicit Human Needs?

Siqi Wang, Chao Liang, Yunfan Gao, Erxin Yu, Sen Li, Yushi Li, Jing Li, Haofen Wang

🧩 TL;DR

本文提出了CitySeeker基准,用于评估视觉语言模型在动态城市环境中解决隐含需求的空间推理和决策能力,揭示了现有模型在长视野推理、空间认知和经验回忆方面的关键瓶颈。


📘 Detailed Summary

Motivation: 当前视觉语言模型在显式指令导航方面取得显著进展,但在动态城市环境中解释隐含人类需求(如"我口渴了")的能力尚未得到充分探索,这限制了模型在实际应用中的实用性。

Method: 研究引入了CitySeeker基准,包含8个城市的6,440条轨迹,涵盖7种目标驱动场景;为分析模型瓶颈,提出了基于回溯机制、丰富空间认知和基于记忆检索的探索策略,这些策略受人类认知映射中迭代观察-推理循环和自适应路径优化的启发。

Result: 实验表明,即使是性能最佳的模型(如Qwen2.5-VL-32B-Instruct)任务完成率仅为21.1%,研究识别出错误累积、空间认知不足和经验回忆缺陷是主要瓶颈,这些发现为模型改进提供了具体方向。

Conclusion: 该研究为开发具有强大空间智能的视觉语言模型提供了可操作的见解,特别针对解决"最后一英里"导航挑战,强调了迭代观察-推理循环和自适应路径优化在提升模型空间推理能力中的重要性。


📄 Abstract

Vision-Language Models (VLMs) have made significant progress in explicit instruction-based navigation; however, their ability to interpret implicit human needs (e.g., "I am thirsty") in dynamic urban environments remains underexplored. This paper introduces CitySeeker, a novel benchmark designed to assess VLMs' spatial reasoning and decision-making capabilities for exploring embodied urban navigation to address implicit needs. CitySeeker includes 6,440 trajectories across 8 cities, capturing diverse visual characteristics and implicit needs in 7 goal-driven scenarios. Extensive experiments reveal that even top-performing models (e.g., Qwen2.5-VL-32B-Instruct) achieve only 21.1% task completion. We find key bottlenecks in error accumulation in long-horizon reasoning, inadequate spatial cognition, and deficient experiential recall. To further analyze them, we investigate a series of exploratory strategies-Backtracking Mechanisms, Enriching Spatial Cognition, and Memory-Based Retrieval (BCR), inspired by human cognitive mapping's emphasis on iterative observation-reasoning cycles and adaptive path optimization. Our analysis provides actionable insights for developing VLMs with robust spatial intelligence required for tackling "last-mile" navigation challenges.