Table of Contents

cs.CV [Back]

[1] Object Counting with GPT-4o and GPT-5: A Comparative Study

Richard Füzesséry, Kaziwa Saleh, Sándor Szénási, Zoltán Vámossy

🧩 TL;DR

本研究探索利用多模态大语言模型(GPT-4o和GPT-5)进行零样本物体计数,仅通过文本提示实现无需监督的计数任务,在FSC-147数据集上取得了与现有零样本方法相当甚至更优的性能。


📘 Detailed Summary

Motivation: 零样本物体计数旨在估计视觉模型在训练过程中从未见过的新类别物体实例数量,现有方法通常需要大量标注数据或视觉示例引导,而大语言模型具有强大的推理和数据理解能力,本研究探索利用多模态LLMs实现完全无需监督的零样本计数。

Method: 本研究提出利用两种多模态大语言模型(GPT-4o和GPT-5)的视觉能力,仅通过文本提示进行零样本物体计数,无需任何监督训练或视觉示例,实现了完全基于文本引导的计数方法。

Result: 在FSC-147和CARPK数据集上的评估表明,两种模型在FSC-147数据集上取得了与最先进的零样本方法相当的性能,在某些情况下甚至超越了现有方法,验证了多模态LLMs在零样本计数任务中的有效性。

Conclusion: 研究表明多模态大语言模型能够有效执行零样本物体计数任务,仅通过文本提示即可实现与专门设计方法相当的性能,为无需监督的视觉计数任务开辟了新途径,展示了LLMs在视觉理解任务中的强大潜力。


📄 Abstract

Zero-shot object counting attempts to estimate the number of object instances belonging to novel categories that the vision model performing the counting has never encountered during training. Existing methods typically require large amount of annotated data and often require visual exemplars to guide the counting process. However, large language models (LLMs) are powerful tools with remarkable reasoning and data understanding abilities, which suggest the possibility of utilizing them for counting tasks without any supervision. In this work we aim to leverage the visual capabilities of two multi-modal LLMs, GPT-4o and GPT-5, to perform object counting in a zero-shot manner using only textual prompts. We evaluate both models on the FSC-147 and CARPK datasets and provide a comparative analysis. Our findings show that the models achieve performance comparable to the state-of-the-art zero-shot approaches on FSC-147, in some cases, even surpass them.

[2] LLM-Guided Material Inference for 3D Point Clouds

Nafiseh Izadyar, Teseo Schneider

🧩 TL;DR

本文提出了一种基于大型语言模型的两阶段方法,用于从带有粗分割的三维点云直接推断材料组成,通过将物体识别与材料推理解耦,实现了零样本的材料理解。


📘 Detailed Summary

Motivation: 现有三维形状数据集和模型主要关注几何特征,忽视了决定物体外观的关键材料属性,导致缺乏从三维数据中推断材料组成的可靠方法,这限制了三维形状的完整物理理解。

Method: 该方法采用两阶段大型语言模型架构:第一阶段LLM预测物体的语义类别,第二阶段LLM根据推断的语义为每个几何分割分配合理的材料,两个阶段均以零样本方式运行,无需特定任务训练。

Result: 在Fusion/ABS和ShapeNet数据集的1000个形状上,该方法实现了较高的语义和材料合理性,通过DeepEval实现的LLM-as-a-Judge评估框架验证了其有效性,证明了零样本材料推理的可行性。

Conclusion: 研究表明语言模型可以作为通用先验知识,有效连接三维数据中的几何推理与材料理解,为零样本材料推断提供了新范式,并为三维形状的完整物理属性建模开辟了新方向。


📄 Abstract

Most existing 3D shape datasets and models focus solely on geometry, overlooking the material properties that determine how objects appear. We introduce a two-stage large language model (LLM) based method for inferring material composition directly from 3D point clouds with coarse segmentations. Our key insight is to decouple reasoning about what an object is from what it is made of. In the first stage, an LLM predicts the object's semantic; in the second stage, it assigns plausible materials to each geometric segment, conditioned on the inferred semantics. Both stages operate in a zero-shot manner, without task-specific training. Because existing datasets lack reliable material annotations, we evaluate our method using an LLM-as-a-Judge implemented in DeepEval. Across 1,000 shapes from Fusion/ABS and ShapeNet, our method achieves high semantic and material plausibility. These results demonstrate that language models can serve as general-purpose priors for bridging geometric reasoning and material understanding in 3D data.

[3] Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models

Shojiro Yamabe, Futa Waseda, Daiki Shiono, Tsubasa Takahashi

🧩 TL;DR

本文提出文本打印图像(TPI)方法,通过将文本描述直接渲染到空白画布上生成合成图像,以解决大规模视觉语言模型训练中图像数据稀缺和收集成本高的问题,实现了更有效的文本中心训练。


📘 Detailed Summary

Motivation: 当前大规模视觉语言模型在VQA任务中需要大量图像-文本对进行任务特定微调,但图像数据收集受隐私限制和领域稀缺性制约且成本高昂,而文本数据广泛可用且易于编辑扩展,然而仅使用原始文本训练会因模态鸿沟导致性能提升有限。

Method: 本文提出文本打印图像方法,通过将给定文本描述直接渲染到纯白色画布上生成合成图像,这种简单渲染将文本投影到图像模态,可低成本集成到现有LVLM训练流程中,同时保留了文本语义,避免了文本到图像模型常出现的语义失真问题。

Result: 在四个模型和七个基准测试的系统实验中,TPI方法在文本中心训练方面比扩散模型生成的合成图像更有效,进一步探索了TPI作为低成本数据增强策略的实际效用,展示了其在多种VQA任务中的性能优势。

Conclusion: 研究结果表明文本中心训练具有显著潜力,TPI方法为大规模视觉语言模型的全自动数据生成开辟了新路径,提供了一种可扩展、低成本的训练范式,能够有效利用广泛可用的文本资源来增强模型性能。


📄 Abstract

Recent large vision-language models (LVLMs) have been applied to diverse VQA tasks. However, achieving practical performance typically requires task-specific fine-tuning with large numbers of image-text pairs, which are costly to collect. In this work, we study text-centric training, a setting where only textual descriptions are available and no real images are provided, as a paradigm for low-cost data scaling. Unlike images, whose collection is often restricted by privacy constraints and scarcity in niche domains, text is widely available. Moreover, text is easily editable, enabling automatic diversification and expansion with LLMs at minimal human effort. While this offers clear advantages over image collection in terms of scalability and cost, training on raw text without images still yields limited gains on VQA tasks because of the image-text modality gap. To address this issue, we propose a Text-Printed Image (TPI), which generates synthetic images by directly rendering the given textual description on a plain white canvas. This simple rendering projects text into the image modality and can be integrated into arbitrary existing LVLM training pipelines at low cost. Moreover, TPI preserves the semantics of the text, whereas text-to-image models often fail to do. Across four models and seven benchmarks, our systematic experiments show that TPI enables more effective text-centric training than synthetic images generated by a diffusion model. We further explore TPI as a low-cost data-augmentation strategy and demonstrate its practical utility. Overall, our findings highlight the significant potential of text-centric training and, more broadly, chart a path toward fully automated data generation for LVLMs.

[4] SpatialReasoner: Active Perception for Large-Scale 3D Scene Understanding

Hongpei Zheng, Shijie Li, Yanran Li, Hujun Yin

🧩 TL;DR

本研究提出了H²U3D房屋尺度3D视觉问答数据集和SpatialReasoner主动感知框架,通过粗到细层次化表示和自适应探索奖励机制,在大规模3D环境空间推理任务中实现了最先进性能,同时显著减少了所需的图像数量。


📘 Detailed Summary

Motivation: 当前视觉语言模型在大规模3D环境中的空间推理能力存在局限,主要局限于房间尺度场景,缺乏针对房屋尺度多楼层复杂环境的系统评估基准和有效探索方法。

Method: 研究提出了H²U3D数据集,通过自动化标注流程构建从粗到细的层次化视觉表示并生成多样化问答对;同时开发了SpatialReasoner主动感知框架,该框架基于文本查询自主调用空间工具探索3D场景,采用两阶段训练策略:监督式冷启动后接强化学习,并引入自适应探索奖励机制以促进高效探索同时减少冗余操作。

Result: SpatialReasoner在H²U3D数据集上实现了最先进的性能,显著超越了GPT-4o和Gemini-2.5-Pro等强基线模型;该方法平均仅需3-4张图像即可达到优异效果,而基线方法需要16张以上图像,证明了粗到细主动探索范式的有效性。

Conclusion: 该研究通过构建房屋尺度3D视觉问答数据集和主动感知框架,为大规模3D环境空间推理提供了系统解决方案;自适应探索奖励机制和两阶段训练策略显著提升了探索效率,为未来3D场景理解研究提供了新的基准和方法论指导。


📄 Abstract

Spatial reasoning in large-scale 3D environments remains challenging for current vision-language models, which are typically constrained to room-scale scenarios. We introduce H$^2$U3D (Holistic House Understanding in 3D), a 3D visual question answering dataset designed for house-scale scene understanding. H$^2$U3D features multi-floor environments spanning up to three floors and 10-20 rooms, covering more than 300 m$^2$. Through an automated annotation pipeline, it constructs hierarchical coarse-to-fine visual representations and generates diverse question-answer pairs with chain-of-thought annotations. We further propose SpatialReasoner, an active perception framework that autonomously invokes spatial tools to explore 3D scenes based on textual queries. SpatialReasoner is trained through a two-stage strategy: a supervised cold start followed by reinforcement learning with an adaptive exploration reward that promotes efficient exploration while discouraging redundant operations. Extensive experiments demonstrate that SpatialReasoner achieves state-of-the-art performance on H$^2$U3D, outperforming strong baselines including GPT-4o and Gemini-2.5-Pro. Notably, our method attains superior results while using only 3-4 images in total on average, compared to baselines requiring 16+ images, highlighting the effectiveness of our coarse-to-fine active exploration paradigm.

[5] CartoMapQA: A Fundamental Benchmark Dataset Evaluating Vision-Language Models on Cartographic Map Understanding

Huy Quang Ung, Guillaume Habault, Yasutaka Nishimura, Hao Niu, Roberto Legaspi, Tomoki Oya, Ryoichi Kojima, Masato Taya, Chihiro Ono, Atsunori Minamikawa, Yan Liu

🧩 TL;DR

本文提出了CartoMapQA基准测试,专门用于评估视觉语言模型在解读制图地图方面的能力,揭示了现有模型在地图语义理解和地理空间推理方面的显著局限性。


📘 Detailed Summary

Motivation: 尽管视觉语言模型在视觉-文本融合方面展现出潜力,但其在制图地图理解方面的能力尚未得到充分探索。现有模型缺乏专门针对地图解读任务的评估基准,这限制了模型在导航、地理搜索和城市规划等实际应用中的可靠性。

Method: 研究团队构建了CartoMapQA基准数据集,包含超过2000个样本,每个样本由制图地图、问题(开放式或多选题)和真实答案组成。该基准涵盖了从低层到高层的多种地图解读技能,包括符号识别、嵌入式信息提取、比例尺解读和基于路径的推理。

Result: 对开源和专有视觉语言模型的评估显示,模型在地图特定语义理解方面存在持续挑战,地理空间推理能力有限,并且容易受到光学字符识别相关错误的影响。这些弱点在符号识别、比例尺解读和复杂路径推理任务中尤为明显。

Conclusion: CartoMapQA基准通过系统性地识别视觉语言模型在地图理解方面的弱点,为未来模型架构改进提供了有价值的指导工具。该研究支持开发更适用于依赖可靠地图理解的实际应用的模型,并公开了源代码和数据集以促进研究社区的发展。


📄 Abstract

The rise of Visual-Language Models (LVLMs) has unlocked new possibilities for seamlessly integrating visual and textual information. However, their ability to interpret cartographic maps remains largely unexplored. In this paper, we introduce CartoMapQA, a benchmark specifically designed to evaluate LVLMs' understanding of cartographic maps through question-answering tasks. The dataset includes over 2000 samples, each composed of a cartographic map, a question (with open-ended or multiple-choice answers), and a ground-truth answer. These tasks span key low-, mid- and high-level map interpretation skills, including symbol recognition, embedded information extraction, scale interpretation, and route-based reasoning. Our evaluation of both open-source and proprietary LVLMs reveals persistent challenges: models frequently struggle with map-specific semantics, exhibit limited geospatial reasoning, and are prone to Optical Character Recognition (OCR)-related errors. By isolating these weaknesses, CartoMapQA offers a valuable tool for guiding future improvements in LVLM architectures. Ultimately, it supports the development of models better equipped for real-world applications that depend on robust and reliable map understanding, such as navigation, geographic search, and urban planning. Our source code and data are openly available to the research community at: https://github.com/ungquanghuy-kddi/CartoMapQA.git

[6] Step-by-step Layered Design Generation

Faizan Farooq Khan, K J Joseph, Koustava Goswami, Mohamed Elhoseiny, Balaji Vasan Srinivasan

🧩 TL;DR

本文提出了一种新颖的逐步分层设计生成问题设定,并开发了SLEDGE模型来模拟设计师逐步修改设计的过程,通过将每次更新建模为原子化的分层变化来实现指令驱动的设计生成。


📘 Detailed Summary

Motivation: 现有方法主要将设计合成视为单步生成问题,严重低估了创造性过程的固有复杂性,无法捕捉设计师逐步细化和增强工作的本质特征,因此需要一种能够模拟逐步设计过程的机器学习方法。

Method: 本文提出了逐步分层设计生成的新问题设定,并开发了SLEDGE模型,该模型利用多模态大语言模型将每次设计更新建模为基于先前状态的原子化分层变化,同时确保与设计指令的语义一致性。

Result: 研究构建了包含数据集和基准测试的完整评估套件,通过详尽的实验分析表明,与针对新设定定制的先进方法相比,所提出的方法在逐步设计生成任务上表现出显著的有效性和优越性能。

Conclusion: 这项工作强调了逐步设计生成的重要性,为实际应用场景提供了实用解决方案,有望吸引更多研究关注这一被低估但具有重要实践价值的研究领域,推动设计生成方法向更符合人类创作过程的方向发展。


📄 Abstract

Design generation, in its essence, is a step-by-step process where designers progressively refine and enhance their work through careful modifications. Despite this fundamental characteristic, existing approaches mainly treat design synthesis as a single-step generation problem, significantly underestimating the inherent complexity of the creative process. To bridge this gap, we propose a novel problem setting called Step-by-Step Layered Design Generation, which tasks a machine learning model with generating a design that adheres to a sequence of instructions from a designer. Leveraging recent advancements in multi-modal LLMs, we propose SLEDGE: Step-by-step LayEred Design GEnerator to model each update to a design as an atomic, layered change over its previous state, while being grounded in the instruction. To complement our new problem setting, we introduce a new evaluation suite, including a dataset and a benchmark. Our exhaustive experimental analysis and comparison with state-of-the-art approaches tailored to our new setup demonstrate the efficacy of our approach. We hope our work will attract attention to this pragmatic and under-explored research area.

[7] Thinking with Programming Vision: Towards a Unified View for Thinking with Images

Zirun Guo, Minjie Hong, Feng Zhang, Kai Jia, Tao Jin

🧩 TL;DR

本文提出CodeVision,一种基于代码即工具的可扩展框架,通过生成代码作为通用接口来调用任意图像操作,以解决多模态大语言模型在工具推理中的脆弱性和可扩展性限制问题。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在工具推理方面存在显著局限性,即使最先进的模型在面对简单方向变化或自然损坏的图像时也表现出惊人的脆弱性,性能显著下降,同时现有方法依赖有限的工具集,缺乏现实必要性和可扩展性。

Method: 提出CodeVision框架,将代码作为通用接口来调用任意图像操作,超越固定的工具注册表;采用两阶段训练方法,首先在高质量数据集上进行监督微调,该数据集专门针对复杂多轮工具组合和错误恢复而构建,随后使用具有新颖密集过程奖励函数的强化学习来鼓励战略性和高效的工具使用。

Result: 在Qwen2.5-VL和Qwen3-VL系列上的实验表明,该方法显著提升了模型性能,并促进了新兴能力的发展,包括灵活的工具组合、高效的链式执行以及从运行时反馈中进行稳健的错误恢复;同时构建了新的SFT和RL数据集以及挑战性基准套件,用于严格评估方向变化和多工具推理的鲁棒性。

Conclusion: 该研究揭示了多模态大语言模型在工具推理中的关键脆弱性,并提出了一种可扩展的解决方案;CodeVision框架通过代码即工具的方法实现了更灵活和鲁棒的多模态推理,为未来工具增强型多模态系统的发展提供了重要方向,特别是在处理现实世界复杂视觉任务方面具有重要价值。


📄 Abstract

Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this work, we first reveal a critical and previously overlooked weakness: even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions, underscoring the need for more robust tool-based reasoning. To address this, we propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation, moving beyond fixed tool registries. We train our model using a two-stage methodology, beginning with Supervised Fine-Tuning (SFT) on a high-quality dataset curated for complex, multi-turn tool composition and error recovery, followed by Reinforcement Learning (RL) with a novel and dense process reward function to encourage strategic and efficient tool use. To facilitate this research, we construct new SFT and RL datasets and introduce a challenging new benchmark suite designed to rigorously evaluate robustness to orientation changes and multi-tool reasoning. Experiments on Qwen2.5-VL and Qwen3-VL series show that our approach significantly improves model performance and fosters emergent capabilities such as flexible tool composition, efficient chained execution, and robust error recovery from runtime feedback. Code is available at https://github.com/ByteDance-BandAI/CodeVision.

[8] FireSentry: A Multi-Modal Spatio-temporal Benchmark Dataset for Fine-Grained Wildfire Spread Forecasting

Nan Zhou, Huandong Wang, Jiahao Li, Han Li, Yali Song, Qiuhua Wang, Yong Li, Xinlei Chen

🧩 TL;DR

该研究提出了FireSentry数据集和FiReDiff范式,前者是首个省级规模、亚米级空间和亚秒级时间分辨率的野火多模态数据集,后者是一种新颖的双模态预测方法,通过首先生成红外视频序列再精确分割火场掩码,显著提升了细粒度野火蔓延预测性能。


📘 Detailed Summary

Motivation: 现有野火蔓延预测研究主要关注粗时空尺度并依赖低分辨率卫星数据,仅能捕捉宏观火情状态,严重限制了高精度局部火势动态建模能力,亟需细粒度预测方法以提升应急响应效能和决策精度。

Method: 研究首先构建了FireSentry数据集,通过同步无人机平台采集亚米级空间分辨率和亚秒级时间分辨率的可见光与红外视频流、现场环境测量数据及人工验证的火场掩码;在此基础上建立了涵盖物理模型、数据驱动模型和生成模型的综合基准测试;提出了FiReDiff双模态范式,该范式先在红外模态中预测未来视频序列,再基于生成的动态信息在掩码模态中精确分割火场掩码。

Result: FiReDiff在生成模型应用中取得了最先进的性能表现,视频质量方面PSNR提升39.2%、SSIM提升36.1%、LPIPS提升50.0%、FVD提升29.4%,掩码精度方面AUPRC提升3.3%、F1分数提升59.1%、IoU提升42.9%、MSE提升62.5%,全面超越了现有的仅掩码方法。

Conclusion: FireSentry基准数据集和FiReDiff范式共同推进了细粒度野火预测和动态灾害模拟领域的发展,揭示了多模态数据融合和时序动态建模在精准火势预测中的关键作用,为未来高精度应急响应系统提供了重要技术基础和数据支持。


📄 Abstract

Fine-grained wildfire spread prediction is crucial for enhancing emergency response efficacy and decision-making precision. However, existing research predominantly focuses on coarse spatiotemporal scales and relies on low-resolution satellite data, capturing only macroscopic fire states while fundamentally constraining high-precision localized fire dynamics modeling capabilities. To bridge this gap, we present FireSentry, a provincial-scale multi-modal wildfire dataset characterized by sub-meter spatial and sub-second temporal resolution. Collected using synchronized UAV platforms, FireSentry provides visible and infrared video streams, in-situ environmental measurements, and manually validated fire masks. Building on FireSentry, we establish a comprehensive benchmark encompassing physics-based, data-driven, and generative models, revealing the limitations of existing mask-only approaches. Our analysis proposes FiReDiff, a novel dual-modality paradigm that first predicts future video sequences in the infrared modality, and then precisely segments fire masks in the mask modality based on the generated dynamics. FiReDiff achieves state-of-the-art performance, with video quality gains of 39.2% in PSNR, 36.1% in SSIM, 50.0% in LPIPS, 29.4% in FVD, and mask accuracy gains of 3.3% in AUPRC, 59.1% in F1 score, 42.9% in IoU, and 62.5% in MSE when applied to generative models. The FireSentry benchmark dataset and FiReDiff paradigm collectively advance fine-grained wildfire forecasting and dynamic disaster simulation. The processed benchmark dataset is publicly available at: https://github.com/Munan222/FireSentry-Benchmark-Dataset.

[9] AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

Zichuan Lin, Yicheng Liu, Yang Yang, Lvfang Tao, Deheng Ye

🧩 TL;DR

本文提出AdaptVision,一种基于人类主动视觉机制的高效视觉语言模型范式,通过从粗到细的自适应视觉标记获取,在减少计算开销的同时保持视觉问答性能。该方法采用强化学习框架和解耦回合策略优化,实现了根据任务需求动态调整视觉信息获取的能力。


📘 Detailed Summary

Motivation: 当前视觉语言模型依赖大量视觉标记导致显著计算开销,而现有高效方法采用固定比例压缩缺乏适应性。本文旨在解决VLMs能否自主确定每个样本所需最小视觉标记数的问题,受人类主动视觉机制启发,探索自适应视觉标记获取的新范式。

Method: 提出AdaptVision范式,采用从粗到细的自适应视觉标记获取方法:首先处理低分辨率图像的压缩视觉标记,必要时调用边界框工具裁剪关键区域获取额外视觉信息。采用强化学习框架训练,核心是解耦回合策略优化,将学习目标分解为工具学习和准确性改进两个组件,并进一步解耦优势估计为每个目标计算独立优势。

Result: 在多个VQA基准测试上的综合实验表明,AdaptVision在消耗显著少于现有高效VLM方法的视觉标记的同时,实现了优越的性能表现。与标准GRPO相比,该方法的优化效果更为有效,在准确性和效率之间取得了良好平衡。

Conclusion: 该研究证明了VLMs能够自主确定所需视觉标记数量的可行性,提出了一种新颖的自适应视觉信息获取范式。解耦回合策略优化为多目标强化学习提供了有效框架,为开发更高效、更智能的视觉语言系统开辟了新方向。


📄 Abstract

Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.

[10] ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding

Lingjun Zhao, Yandong Luo, James Hay, Lu Gan

🧩 TL;DR

本文提出了ShelfGaussian,这是一个基于开放词汇多模态高斯分布的3D场景理解框架,利用现成的视觉基础模型进行监督,实现了在零样本语义占据预测任务上的最先进性能。


📘 Detailed Summary

Motivation: 现有基于高斯分布的方法存在两个主要局限:一是将物体建模为封闭集语义高斯分布,依赖于标注的3D标签且忽略了渲染能力;二是通过纯2D自监督学习开放集高斯表示,导致几何质量下降且仅限于相机设置。本研究旨在充分挖掘高斯分布的潜力,解决这些限制。

Method: 提出了多模态高斯变换器,使高斯分布能够从多种传感器模态中查询特征;设计了货架监督学习范式,在2D图像和3D场景层面联合优化高斯分布与视觉基础模型特征;构建了开放词汇多模态高斯框架,利用现成视觉基础模型进行监督。

Result: 在Occ3D-nuScenes基准测试中实现了最先进的零样本语义占据预测性能;在无人地面车辆上进行了真实世界评估,验证了其在多样化城市场景中的野外性能;展示了框架在各种感知和规划任务上的有效性。

Conclusion: 该研究证明了结合多模态传感器信息和视觉基础模型监督能够显著提升高斯分布方法的场景理解能力,为开放词汇3D场景理解提供了有效框架,并展示了在实际机器人应用中的潜力。


📄 Abstract

We introduce ShelfGaussian, an open-vocabulary multi-modal Gaussian-based 3D scene understanding framework supervised by off-the-shelf vision foundation models (VFMs). Gaussian-based methods have demonstrated superior performance and computational efficiency across a wide range of scene understanding tasks. However, existing methods either model objects as closed-set semantic Gaussians supervised by annotated 3D labels, neglecting their rendering ability, or learn open-set Gaussian representations via purely 2D self-supervision, leading to degraded geometry and limited to camera-only settings. To fully exploit the potential of Gaussians, we propose a Multi-Modal Gaussian Transformer that enables Gaussians to query features from diverse sensor modalities, and a Shelf-Supervised Learning Paradigm that efficiently optimizes Gaussians with VFM features jointly at 2D image and 3D scene levels. We evaluate ShelfGaussian on various perception and planning tasks. Experiments on Occ3D-nuScenes demonstrate its state-of-the-art zero-shot semantic occupancy prediction performance. ShelfGaussian is further evaluated on an unmanned ground vehicle (UGV) to assess its in the-wild performance across diverse urban scenarios. Project website: https://lunarlab-gatech.github.io/ShelfGaussian/.

[11] Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation

Xieji Li, Siyuan Yan, Yingsheng Liu, H. Peter Soyer, Monika Janda, Victoria Mar, Zongyuan Ge

🧩 TL;DR

本文提出了一种新颖的医学视觉语言预训练框架,通过多智能体数据生成系统和基于本体的多维度知识增强预训练,解决了医学图像分析中数据噪声和长文本复杂性的挑战,在皮肤病学领域实现了最先进的零样本性能。


📘 Detailed Summary

Motivation: 现有视觉语言预训练方法在医学图像分析中面临两个主要挑战:网络收集数据固有的噪声问题,以及非结构化长医学文本的复杂性。这些限制阻碍了从大规模图像-文本对中学习高质量表示的能力,特别是在需要精确医学知识的领域如皮肤病学中。

Method: 本文提出了一个集成多智能体数据生成系统和基于本体的多维度知识增强预训练框架。MAGEN系统通过基础模型辅助的标注和基于检索的验证流程合成知识丰富的描述以增强数据质量;O-MAKE方法将长非结构化文本分解为不同的知识维度,实现全局和局部层面的细粒度对齐,并通过本体引导机制显式建模医学概念关系。

Result: 在皮肤病学领域的综合实验中,该方法在八个数据集上的疾病分类和跨模态检索任务中实现了最先进的零样本性能。研究团队还发布了包含超过40万皮肤图像-文本对的增强数据集Derm1M-AgentAug,验证了每个组件的有效性。

Conclusion: 该研究展示了通过智能数据增强和本体知识整合可以有效解决医学视觉语言预训练中的关键挑战。框架的成功应用为医学图像分析领域提供了新的范式,特别是在数据质量有限和文本复杂性高的场景下,具有重要的实际应用价值。


📄 Abstract

Vision-language pretraining (VLP) has emerged as a powerful paradigm in medical image analysis, enabling representation learning from large-scale image-text pairs without relying on expensive manual annotations. However, existing methods often struggle with the noise inherent in web-collected data and the complexity of unstructured long medical texts. To address these challenges, we propose a novel VLP framework integrating a Multi-Agent data GENeration (MAGEN) system and Ontology-based Multi-Aspect Knowledge-Enhanced (O-MAKE) pretraining. First, MAGEN enhances data quality by synthesizing knowledge-enriched descriptions via a foundation model-assisted captioning and retrieval-based verification pipeline. Second, O-MAKE addresses the difficulty of learning from long, unstructured texts by decomposing them into distinct knowledge aspects. This facilitates fine-grained alignment at both global and patch levels, while explicitly modeling medical concept relationships through ontology-guided mechanisms. We validate our framework in the field of dermatology, where comprehensive experiments demonstrate the effectiveness of each component. Our approach achieves state-of-the-art zero-shot performance on disease classification and cross-modal retrieval tasks across eight datasets. Our code and the augmented dataset Derm1M-AgentAug, comprising over 400k skin-image-text pairs, will be released at https://github.com/SiyuanYan1/Derm1M.

[12] MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification

Yujian Zhao, Hankun Liu, Guanglin Niu

🧩 TL;DR

本文提出MOS框架,通过模态一致表示学习和跨模态数据生成与特征融合,有效缓解光学与SAR图像之间的模态差异,显著提升跨模态船舶重识别性能。


📘 Detailed Summary

Motivation: 光学与合成孔径雷达(SAR)图像之间的跨模态船舶重识别是海事情报与监视中的关键任务,但两种模态间的显著差异构成了鲁棒识别的重大挑战,现有研究对此探索不足。

Method: MOS框架包含两个核心组件:模态一致表示学习通过SAR图像去噪处理和类级模态对齐损失来对齐跨模态的类内特征分布;跨模态数据生成与特征融合利用布朗桥扩散模型合成跨模态样本,并在推理阶段将合成特征与原始特征融合以增强对齐性和判别性。

Result: 在HOSS ReID数据集上的广泛实验表明,MOS在所有评估协议下均显著超越现有最优方法,在ALL to ALL、Optical to SAR和SAR to Optical设置下分别实现了R1准确率+3.0%、+6.2%和+16.4%的显著提升。

Conclusion: 该研究证明了通过模态对齐和跨模态数据生成可以有效缓解光学-SAR模态差异,为跨模态船舶重识别提供了有效的解决方案,并展示了扩散模型在跨模态特征增强中的潜力。


📄 Abstract

Cross-modal ship re-identification (ReID) between optical and synthetic aperture radar (SAR) imagery has recently emerged as a critical yet underexplored task in maritime intelligence and surveillance. However, the substantial modality gap between optical and SAR images poses a major challenge for robust identification. To address this issue, we propose MOS, a novel framework designed to mitigate the optical-SAR modality gap and achieve modality-consistent feature learning for optical-SAR cross-modal ship ReID. MOS consists of two core components: (1) Modality-Consistent Representation Learning (MCRL) applies denoise SAR image procession and a class-wise modality alignment loss to align intra-identity feature distributions across modalities. (2) Cross-modal Data Generation and Feature fusion (CDGF) leverages a brownian bridge diffusion model to synthesize cross-modal samples, which are subsequently fused with original features during inference to enhance alignment and discriminability. Extensive experiments on the HOSS ReID dataset demonstrate that MOS significantly surpasses state-of-the-art methods across all evaluation protocols, achieving notable improvements of +3.0%, +6.2%, and +16.4% in R1 accuracy under the ALL to ALL, Optical to SAR, and SAR to Optical settings, respectively. The code and trained models will be released upon publication.

[13] Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles

Haicheng Liao, Huanming Shen, Bonan Wang, Yongkang Li, Yihong Tang, Chengyue Wang, Dingyi Zhuang, Kehua Chen, Hai Yang, Chengzhong Xu, Zhenning Li

🧩 TL;DR

本文提出了ThinkDeeper框架,通过空间感知世界模型对自动驾驶中的视觉定位任务进行前瞻性推理,有效解决了模糊指令和上下文依赖的挑战,并在多个基准测试中取得了最先进的性能。


📘 Detailed Summary

Motivation: 现有自动驾驶视觉定位方法在处理模糊、上下文依赖的指令时存在困难,主要因为它们缺乏对三维空间关系和场景演变的推理能力,这限制了自动驾驶系统对自然语言命令的准确理解和目标定位。

Method: 本文提出了ThinkDeeper框架,其核心是空间感知世界模型,该模型将当前场景蒸馏为指令感知的潜在状态,并推演出一系列未来潜在状态以提供前瞻性线索。此外,采用超图引导的解码器层次化融合这些状态与多模态输入,捕获高阶空间依赖关系以实现鲁棒定位。同时,还构建了DrivePilot数据集,采用检索增强生成和思维链提示的大语言模型管道生成语义标注。

Result: ThinkDeeper在Talk2Car排行榜上排名第一,并在DrivePilot、MoCAD和RefCOCO/+/g基准测试中超越了现有最先进方法。该框架在具有挑战性的场景中表现出强大的鲁棒性和效率,即使在仅使用50%数据训练时仍能保持优越性能。

Conclusion: 研究表明,通过世界模型进行前瞻性空间推理能显著提升自动驾驶中视觉定位任务的性能,特别是在处理模糊指令和复杂场景时。该方法为自动驾驶系统的自然语言交互提供了新的解决方案,展示了空间感知推理在视觉语言任务中的重要性。


📄 Abstract

Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.

[14] ViDiC: Video Difference Captioning

Jiangtao Wu, Shihao Li, Zhaozhou Bian, Yuanxing Zhang, Jialu Chen, Runzhe Wen, An Ping, Yiwen He, Jiakai Wang, Jiaheng Liu

🧩 TL;DR

本文提出了视频差异描述(ViDiC)任务及相应的ViDiC-1K数据集,旨在评估多模态大语言模型在视频对之间进行细粒度相似性和差异描述的能力,填补了现有视觉语言系统在动态场景比较感知方面的研究空白。


📘 Detailed Summary

Motivation: 现有视觉语言系统在理解动态场景之间的视觉差异方面存在不足,特别是对组合性、空间性和时间性变化的比较感知能力尚未得到充分探索。虽然图像差异描述(IDC)研究已使模型能够描述静态图像之间的语义变化,但这些方法无法捕捉运动连续性、事件演变或时间上的编辑一致性。

Method: 本文提出了视频差异描述(ViDiC)任务并构建了ViDiC-1K数据集,包含1,000个精心策划的视频对,标注了超过4,000个比较检查项,涵盖主体、风格、背景、摄影、运动、位置和播放技术等七个类别。为确保可靠评估,提出了基于LLM-as-a-Judge协议的双检查表框架,分别测量相似性和差异的准确性。

Result: 在19个代表性多模态模型上的实验揭示了它们在比较描述和差异感知能力方面存在显著的性能差距。ViDiC-1K数据集作为一个具有挑战性的基准测试,为评估视频理解能力提供了系统框架,并展示了当前模型在动态场景比较分析方面的局限性。

Conclusion: ViDiC-1K为推进多模态智能中的视频理解、编辑感知和比较推理奠定了坚实基础。该研究强调了动态场景比较分析的重要性,并为未来模型开发提供了明确的评估标准,有望推动视频差异理解领域的研究进展。


📄 Abstract

Understanding visual differences between dynamic scenes requires the comparative perception of compositional, spatial, and temporal changes--a capability that remains underexplored in existing vision-language systems. While prior work on Image Difference Captioning (IDC) has enabled models to describe semantic changes between static images, these approaches fail to capture motion continuity, event evolution, or editing consistency over time. We introduce the ViDiC (Video Difference Captioning) task and its corresponding ViDiC-1K dataset, designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to provide fine-grained descriptions of similarities and differences between video pairs. ViDiC-1K comprises 1,000 curated video pairs annotated with over 4,000 comparative checklist items, covering seven categories: subject, style, background, cinematography, motion, location, and playback techniques. To ensure reliable evaluation, we propose a dual-checklist framework that measures the accuracy of similarity and difference separately, based on the LLM-as-a-Judge protocol. Experiments on nineteen representative multimodal models reveal a significant performance gap in their comparative description and difference perception abilities. We hope ViDiC-1K can be a challenging benchmark that lays a solid foundation for advancing video understanding, edit awareness, and comparative reasoning in multimodal intelligence.

[15] Generalization Evaluation of Deep Stereo Matching Methods for UAV-Based Forestry Applications

Yida Lin, Bing Xue, Mengjie Zhang, Sam Schofield, Richard Green

🧩 TL;DR

本研究首次对八种最先进的立体深度估计方法在林业场景中进行系统性零样本评估,填补了现有评估主要关注城市和室内环境的空白,并识别出DEFOM作为植被密集环境深度估计的最佳基准方法。


📘 Detailed Summary

Motivation: 自主无人机林业作业需要具有强大跨域泛化能力的鲁棒深度估计方法,但现有评估主要集中于城市和室内场景,缺乏对植被密集专业环境的系统性研究,这构成了一个关键的研究空白。

Method: 研究对八种最先进的立体方法进行了系统性零样本评估,涵盖迭代细化、基础模型和零样本适应范式,包括RAFT-Stereo、IGEV、IGEV++、BridgeDepth、StereoAnywhere、DEFOM以及基线方法ACVNet、PSMNet和TCstereo。所有方法仅在Scene Flow数据集上训练,并在四个标准基准(ETH3D、KITTI 2012/2015、Middlebury)和一个包含5,313对图像的新型Canterbury林业数据集上进行无微调评估。

Result: 实验结果显示场景依赖的性能模式:基础模型在结构化场景中表现优异(BridgeDepth在ETH3D上为0.23 px,在KITTI上为0.83-1.07 px;DEFOM在各基准上为0.35-4.65 px),而迭代方法保持跨域鲁棒性(IGEV++为0.36-6.77 px;IGEV为0.33-21.91 px)。关键发现是RAFT-Stereo在ETH3D上出现灾难性失败(26.23 px EPE,98%错误率),但在KITTI上表现正常(0.90-1.11 px)。在Canterbury林业数据集上,DEFOM被识别为植被深度估计的最佳黄金标准基线,展现出优于IGEV++的深度平滑性、遮挡处理和跨域一致性。

Conclusion: 该研究揭示了立体深度估计方法的场景依赖性,强调了在专业领域(如林业)进行针对性评估的重要性。DEFOM被确立为植被密集环境深度估计的推荐基准方法,为自主无人机林业作业提供了实用的技术指导。研究还指出了RAFT-Stereo等方法的特定失败模式,为未来方法的鲁棒性改进提供了方向。


📄 Abstract

Autonomous UAV forestry operations require robust depth estimation methods with strong cross-domain generalization. However, existing evaluations focus on urban and indoor scenarios, leaving a critical gap for specialized vegetation-dense environments. We present the first systematic zero-shot evaluation of eight state-of-the-art stereo methods--RAFT-Stereo, IGEV, IGEV++, BridgeDepth, StereoAnywhere, DEFOM (plus baseline methods ACVNet, PSMNet, TCstereo)--spanning iterative refinement, foundation model, and zero-shot adaptation paradigms. All methods are trained exclusively on Scene Flow and evaluated without fine-tuning on four standard benchmarks (ETH3D, KITTI 2012/2015, Middlebury) plus a novel 5,313-pair Canterbury forestry dataset captured with ZED Mini camera (1920x1080). Performance reveals scene-dependent patterns: foundation models excel on structured scenes (BridgeDepth: 0.23 px on ETH3D, 0.83-1.07 px on KITTI; DEFOM: 0.35-4.65 px across benchmarks), while iterative methods maintain cross-domain robustness (IGEV++: 0.36-6.77 px; IGEV: 0.33-21.91 px). Critical finding: RAFT-Stereo exhibits catastrophic ETH3D failure (26.23 px EPE, 98 percent error rate) due to negative disparity predictions, while performing normally on KITTI (0.90-1.11 px). Qualitative evaluation on Canterbury forestry dataset identifies DEFOM as the optimal gold-standard baseline for vegetation depth estimation, exhibiting superior depth smoothness, occlusion handling, and cross-domain consistency compared to IGEV++, despite IGEV++'s finer detail preservation.

[16] Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation

Subin Kim, Sangwoo Mo, Mamshad Nayeem Rizve, Yiran Xu, Difan Liu, Jinwoo Shin, Tobias Hinz

🧩 TL;DR

本文提出PRIS框架,通过在推理过程中自适应地修订提示词来改进文本到视觉生成的对齐问题。该方法引入元素级事实校正验证器,在多个视觉生成基准上实现了显著性能提升。


📘 Detailed Summary

Motivation: 文本到视觉生成中的核心挑战是用户意图与生成视觉内容之间的精确对齐,传统方法主要通过扩展视觉生成过程(如增加采样步数或种子数)来解决,但这种方法很快达到质量瓶颈,因为提示词在生成过程中保持固定,无法根据生成结果进行适应性调整。

Method: 本文提出PRIS框架,在推理过程中自适应地修订提示词以响应扩展的视觉生成。核心思想是审查生成的视觉内容,识别跨视觉的重复失败模式,并相应地重新设计提示词,然后使用修订后的提示词重新生成视觉内容。为提供精确的对齐反馈,引入了元素级事实校正验证器,在细粒度级别评估提示属性与生成视觉之间的对齐,相比整体评估方法实现更准确和可解释的评估。

Result: 在文本到图像和文本到视频基准上的广泛实验证明了该方法的有效性,包括在VBench 2.0上实现了15%的性能增益。这些结果表明,联合扩展提示词和视觉内容是充分利用推理时扩展定律的关键。

Conclusion: 研究表明,在推理过程中联合扩展提示词和视觉内容对于实现文本到视觉生成的精确对齐至关重要。PRIS框架通过自适应提示修订和细粒度验证机制,为解决生成模型中的对齐问题提供了新的方向,强调了提示词动态调整在提升生成质量中的重要性。


📄 Abstract

Achieving precise alignment between user intent and generated visuals remains a central challenge in text-to-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because the prompt, crucial for guiding generation, is kept fixed. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt. To provide precise alignment feedback for prompt revision, we introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level, achieving more accurate and interpretable assessments than holistic measures. Extensive experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach, including a 15% gain on VBench 2.0. These results highlight that jointly scaling prompts and visuals is key to fully leveraging scaling laws at inference-time. Visualizations are available at the website: https://subin-kim-cv.github.io/PRIS.

[17] Label-Efficient Hyperspectral Image Classification via Spectral FiLM Modulation of Low-Level Pretrained Diffusion Features

Yuzhen Hu, Biplab Banerjee, Saurabh Prasad

🧩 TL;DR

本文提出了一种标签高效的框架,利用预训练扩散模型的空间特征进行高光谱图像分类,通过轻量级FiLM融合模块整合光谱与空间信息,在稀疏标注下实现了优于现有方法的性能。


📘 Detailed Summary

Motivation: 高光谱成像虽然能实现详细的地物分类,但面临空间分辨率低和标注稀疏的挑战,现有方法在有限标注下难以有效整合光谱与空间信息,需要开发标签高效的多模态学习框架。

Method: 该方法利用在自然图像上预训练的冻结扩散模型提取空间特征,从早期去噪时间步的高分辨率解码器层获取低层表示,并引入轻量级FiLM融合模块,通过光谱线索自适应调制冻结的空间特征,实现稀疏监督下的鲁棒多模态学习。

Result: 在两个近期高光谱数据集上的实验表明,该方法仅使用提供的稀疏训练标签就超越了最先进方法,消融研究进一步验证了扩散模型特征和光谱感知融合的有效性,证明了框架在标签稀缺情况下的优越性能。

Conclusion: 研究结果表明预训练扩散模型能够支持领域无关的标签高效表示学习,为遥感及更广泛的科学成像任务提供了新思路,证明了冻结扩散特征在多模态融合中的有效性和可迁移性。


📄 Abstract

Hyperspectral imaging (HSI) enables detailed land cover classification, yet low spatial resolution and sparse annotations pose significant challenges. We present a label-efficient framework that leverages spatial features from a frozen diffusion model pretrained on natural images. Our approach extracts low-level representations from high-resolution decoder layers at early denoising timesteps, which transfer effectively to the low-texture structure of HSI. To integrate spectral and spatial information, we introduce a lightweight FiLM-based fusion module that adaptively modulates frozen spatial features using spectral cues, enabling robust multimodal learning under sparse supervision. Experiments on two recent hyperspectral datasets demonstrate that our method outperforms state-of-the-art approaches using only the provided sparse training labels. Ablation studies further highlight the benefits of diffusion-derived features and spectral-aware fusion. Overall, our results indicate that pretrained diffusion models can support domain-agnostic, label-efficient representation learning for remote sensing and broader scientific imaging tasks.

[18] CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation

Ruoxuan Zhang, Bin Wen, Hongxia Xie, Yi Yao, Songhan Zuo, Jian-Yu Jiang-Lin, Hong-Han Shuai, Wen-Huang Cheng

🧩 TL;DR

本文提出了CookAnything,一个灵活的扩散模型框架,用于从任意长度的文本烹饪指令生成连贯、语义分明的图像序列,解决了现有方法在处理结构化多步骤场景和可变指令长度方面的局限性。


📘 Detailed Summary

Motivation: 现有扩散模型在处理结构化多步骤场景(如菜谱图解)时存在困难,且当前菜谱图解方法无法适应菜谱长度的自然变化,无论实际指令结构如何都生成固定数量的图像,这限制了在程序性内容创作中的应用。

Method: 该框架引入了三个关键组件:步骤级区域控制(SRC)在单个去噪过程中将文本步骤与对应图像区域对齐;灵活的RoPE作为步骤感知的位置编码机制,增强时间连贯性和空间多样性;跨步骤一致性控制(CSCC)保持步骤间细粒度成分一致性。

Result: 在菜谱图解基准测试上的实验结果表明,CookAnything在基于训练和无训练设置下均优于现有方法,能够支持复杂多步骤指令的可扩展高质量视觉合成,验证了框架的有效性和灵活性。

Conclusion: 该研究提出的框架在程序性内容创作领域具有广泛应用潜力,特别是在教学媒体和结构化视觉内容生成方面,为处理可变长度多步骤指令的视觉合成提供了系统解决方案。


📄 Abstract

Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image regions within a single denoising process; (2) Flexible RoPE, a step-aware positional encoding mechanism that enhances both temporal coherence and spatial diversity; and (3) Cross-Step Consistency Control (CSCC), which maintains fine-grained ingredient consistency across steps. Experimental results on recipe illustration benchmarks show that CookAnything performs better than existing methods in training-based and training-free settings. The proposed framework supports scalable, high-quality visual synthesis of complex multi-step instructions and holds significant potential for broad applications in instructional media, and procedural content creation.

[19] V-ITI: Mitigating Hallucinations in Multimodal Large Language Models via Visual Inference-Time Intervention

Nan Sun, Zhenyu Zhang, Xixun Lin, Kun Wang, Yanmin Shang, Naibin Gu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang, Yanan Cao

🧩 TL;DR

本文提出V-ITI,一种轻量级视觉推理时干预框架,通过检测头级激活模式识别视觉忽视,仅在必要时进行干预,有效缓解多模态大语言模型的幻觉问题,同时保持通用任务性能。


📘 Detailed Summary

Motivation: 多模态大语言模型在众多视觉语言任务中表现出色,但存在幻觉问题,生成与输入视觉内容不一致的输出,这在精度敏感领域严重损害可靠性。现有方法通常通过干预注意力分数或输出logits来缓解幻觉,但主要关注"如何干预"而忽略了"何时干预"这一前提,导致"过度干预"问题,进而引入新的幻觉和不必要的计算开销。

Method: 本文首先研究了视觉忽视机制,发现可以通过MLLMs中的头级激活模式准确检测视觉忽视。基于此提出了V-ITI框架,包含两个核心组件:视觉忽视检测器通过头级判别性探针识别视觉忽视;视觉回忆干预器仅在检测到视觉忽视时,使用预存储的视觉激活信息来调制激活,实现精准干预。

Result: 在八个基准测试和不同MLLM家族上的广泛实验表明,V-ITI能够持续缓解视觉相关幻觉,同时保持通用任务性能。该方法在多个评估指标上均表现出色,验证了其有效性和鲁棒性。

Conclusion: 该研究揭示了视觉忽视可以通过头级激活模式准确检测,并提出了轻量级的推理时干预框架V-ITI,通过"何时干预"的精确控制解决了过度干预问题,为缓解MLLMs幻觉提供了新思路,在保持模型通用能力的同时显著提升了视觉可靠性。


📄 Abstract

Multimodal Large Language Models (MLLMs) excel in numerous vision-language tasks yet suffer from hallucinations, producing content inconsistent with input visuals, that undermine reliability in precision-sensitive domains. This issue stems from a fundamental problem of visual neglect, where models fail to adequately prioritize input images. Existing methods typically alleviate hallucinations by intervening in the attention score or output logits, focusing on "how to intervene" but overlooking the prerequisite "when to intervene", which leads to the "over-intervention" problem and subsequently introduces new hallucinations and unnecessary computational overhead. To address this gap, we first investigate the mechanism of visual neglect and reveal it can be accurately detected via head-level activation patterns in MLLMs. We thus propose V-ITI, a lightweight visual inference-time intervention framework integrating a Visual Neglect Detector that identifies visual neglect via head-level discriminative probes and a Visual Recall Intervenor that modulates activations with prestored visual activation information only when the visual neglect is detected. Extensive experiments across eight benchmarks and different MLLM families demonstrate that V-ITI consistently mitigates vision-related hallucinations while preserving general task performance.

[20] LM-CartSeg: Automated Segmentation of Lateral and Medial Cartilage and Subchondral Bone for Radiomics Analysis

Tongxu Zhang

🧩 TL;DR

本研究提出LM-CartSeg,一种全自动的膝关节MRI分割与影像组学分析流程,通过几何后处理规则和零样本预测融合,实现了稳健的软骨/骨分割与内外侧分室,为多中心骨关节炎影像组学研究提供了实用基础。


📘 Detailed Summary

Motivation: 膝关节MRI影像组学研究需要稳健且具有解剖学意义的感兴趣区域来同时捕获软骨和软骨下骨,现有方法大多依赖手动标注且缺乏质量控制报告,这限制了多中心研究的可靠性和可重复性。

Method: 该方法基于两个3D nnU-Net模型分别在SKM-TEA和OAIZIB-CM数据集上训练,测试时融合零样本预测并通过几何规则进行后处理,包括连通分量清洗、在物理空间构建10毫米软骨下骨带,以及基于PCA和k-means的数据驱动胫骨内外侧分割。

Result: 后处理显著改善了分割性能,在OAIZIB-CM测试集上宏观ASSD从2.63毫米降至0.36毫米,HD95从25.2毫米降至3.35毫米,DSC达到0.91;在SKI-10数据集上零样本DSC为0.80。几何内外侧分割规则在不同数据集间产生稳定分室,而直接使用内外侧nnU-Net模型则出现域依赖的侧向交换问题。

Conclusion: LM-CartSeg能够生成自动化的质量控制感兴趣区域和影像组学特征,这些特征携带了超越简单形态测量的判别信息,为多中心膝关节骨关节炎影像组学研究提供了实用基础,同时发现仅6-12%的特征与体积或厚度强相关。


📄 Abstract

Background and Objective: Radiomics of knee MRI requires robust, anatomically meaningful regions of interest (ROIs) that jointly capture cartilage and subchondral bone. Most existing work relies on manual ROIs and rarely reports quality control (QC). We present LM-CartSeg, a fully automatic pipeline for cartilage/bone segmentation, geometric lateral/medial (L/M) compartmentalisation and radiomics analysis. Methods: Two 3D nnU-Net models were trained on SKM-TEA (138 knees) and OAIZIB-CM (404 knees). At test time, zero-shot predictions were fused and refined by simple geometric rules: connected-component cleaning, construction of 10 mm subchondral bone bands in physical space, and a data-driven tibial L/M split based on PCA and k-means. Segmentation was evaluated on an OAIZIB-CM test set (103 knees) and on SKI-10 (100 knees). QC used volume and thickness signatures. From 10 ROIs we extracted 4 650 non-shape radiomic features to study inter-compartment similarity, dependence on ROI size, and OA vs. non-OA classification on OAIZIB-CM Results: Post-processing improved macro ASSD on OAIZIB-CM from 2.63 to 0.36 mm and HD95 from 25.2 to 3.35 mm, with DSC 0.91; zero-shot DSC on SKI-10 was 0.80. The geometric L/M rule produced stable compartments across datasets, whereas a direct L/M nnU-Net showed domain-dependent side swaps. Only 6 to 12 percent of features per ROI were strongly correlated with volume or thickness. Radiomics-based models models restricted to size-linked features. Conclusions: LM-CartSeg yields automatic, QCd ROIs and radiomic features that carry discriminative information beyond simple morphometry, providing a practical foundation for multi-centre knee OA radiomics studies.

[21] Dynamic Content Moderation in Livestreams: Combining Supervised Classification with MLLM-Boosted Similarity Matching

Wei Chee Yew, Hailun Xu, Sanjay Saha, Xiaotian Fan, Hiok Hian Ong, David Yuchen Wang, Kanchan Sarkar, Zhenheng Yang, Danhui Guan

🧩 TL;DR

本文提出了一种用于大规模用户生成视频平台的混合内容审核框架,该框架结合了监督分类和基于参考的相似性匹配,以应对直播环境中多模态、及时且需适应不断演化的违规内容检测挑战。


📘 Detailed Summary

Motivation: 大规模用户生成视频平台的内容审核面临严峻挑战,尤其在直播环境中需要实现及时、多模态且对不断演化的违规内容具有鲁棒性的检测,传统方法难以有效处理新型或微妙的违规案例。

Method: 该研究提出了一种混合审核框架,结合监督分类管道处理已知违规内容,以及基于参考的相似性匹配管道检测新型或微妙案例,多模态输入通过两个管道处理,并利用多模态大语言模型将知识蒸馏到每个管道中以提升准确性同时保持推理轻量级。

Result: 在生产环境中,分类管道在80%精确率下达到67%召回率,相似性管道在80%精确率下达到76%召回率,大规模A/B测试显示用户观看不良直播的比例减少了6-8%。

Conclusion: 该研究展示了一种可扩展且适应性强的多模态内容治理方法,能够同时处理显性违规和新兴对抗行为,为大规模视频平台提供了兼顾检测准确性和系统效率的实用解决方案。


📄 Abstract

Content moderation remains a critical yet challenging task for large-scale user-generated video platforms, especially in livestreaming environments where moderation must be timely, multimodal, and robust to evolving forms of unwanted content. We present a hybrid moderation framework deployed at production scale that combines supervised classification for known violations with reference-based similarity matching for novel or subtle cases. This hybrid design enables robust detection of both explicit violations and novel edge cases that evade traditional classifiers. Multimodal inputs (text, audio, visual) are processed through both pipelines, with a multimodal large language model (MLLM) distilling knowledge into each to boost accuracy while keeping inference lightweight. In production, the classification pipeline achieves 67% recall at 80% precision, and the similarity pipeline achieves 76% recall at 80% precision. Large-scale A/B tests show a 6-8% reduction in user views of unwanted livestreams}. These results demonstrate a scalable and adaptable approach to multimodal content governance, capable of addressing both explicit violations and emerging adversarial behaviors.

[22] GeoVideo: Introducing Geometric Regularization into Video Generation Model

Yunpeng Bai, Shaoheng Fang, Chaohui Yu, Fan Wang, Qixing Huang

🧩 TL;DR

本文提出了一种通过深度预测增强视频生成几何一致性的方法,将多视角几何损失引入潜在扩散模型,显著提升了生成视频的时空连贯性和结构合理性。


📘 Detailed Summary

Motivation: 现有视频生成方法主要在2D像素空间操作,缺乏明确的3D结构建模机制,导致时间几何不一致、运动不自然和结构伪影等问题,需要将几何正则化引入视频生成过程以提升时空一致性。

Method: 该方法通过为潜在扩散模型添加逐帧深度预测来实现几何正则化,采用深度作为几何表示因其与图像编码器的兼容性,并提出多视角几何损失在共享3D坐标系中对齐跨帧深度图以增强结构一致性。

Result: 在多个数据集上的实验表明,该方法相比现有基线能产生显著更稳定和几何一致的结果,在时空连贯性、形状一致性和物理合理性方面均有明显改善。

Conclusion: 该方法成功弥合了外观生成与3D结构建模之间的差距,为视频生成提供了有效的几何正则化框架,展示了深度表示在提升生成视频质量方面的潜力,为未来结合更丰富3D先验的研究奠定了基础。


📄 Abstract

Recent advances in video generation have enabled the synthesis of high-quality and visually realistic clips using diffusion transformer models. However, most existing approaches operate purely in the 2D pixel space and lack explicit mechanisms for modeling 3D structures, often resulting in temporally inconsistent geometries, implausible motions, and structural artifacts. In this work, we introduce geometric regularization losses into video generation by augmenting latent diffusion models with per-frame depth prediction. We adopted depth as the geometric representation because of the great progress in depth prediction and its compatibility with image-based latent encoders. Specifically, to enforce structural consistency over time, we propose a multi-view geometric loss that aligns the predicted depth maps across frames within a shared 3D coordinate system. Our method bridges the gap between appearance generation and 3D structure modeling, leading to improved spatio-temporal coherence, shape consistency, and physical plausibility. Experiments across multiple datasets show that our approach produces significantly more stable and geometrically consistent results than existing baselines.

[23] ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

Qi'ao Xu, Tianwen Qian, Yuqian Fu, Kailing Li, Yang Jiao, Jiacheng Zhang, Xiaoling Wang, Liang He

🧩 TL;DR

本文提出了ToG-Bench,首个面向任务的时空视频定位基准,专注于具身智能中的任务导向对象定位,解决了现有方法局限于描述性指令而缺乏任务推理能力的问题。


📘 Detailed Summary

Motivation: 现有时空视频定位研究主要局限于对象中心和描述性指令,忽视了任务导向推理对于具身智能体实现目标导向交互的关键作用,这限制了模型在真实具身场景中的应用能力。

Method: 研究构建了基于ScanNet视频的ToG-Bench基准,包含100个标注片段和2,704个任务导向定位指令,采用结合基础模型标注和人工细化的半自动化流程,并设计了针对多对象和显隐式对象定位的任务级评估指标。

Result: 实验评估了七个最先进的多模态大语言模型,结果显示任务导向时空视频定位存在固有挑战,在显隐式和多对象定位方面存在显著性能差距,揭示了感知与交互在具身场景中的融合难度。

Conclusion: 该研究强调了任务导向推理在具身智能中的重要性,提出的基准和评估方法为连接视觉感知与物理交互提供了新方向,揭示了当前模型在理解上下文和任务意图方面的局限性。


📄 Abstract

A core capability towards general embodied intelligence lies in localizing task-relevant objects from an egocentric perspective, formulated as Spatio-Temporal Video Grounding (STVG). Despite recent progress, existing STVG studies remain largely confined to object-centric and descriptive instructions, neglecting the task-oriented reasoning that is crucial for embodied agents to accomplish goal-directed interactions. To bridge this gap, we introduce \textbf{ToG-Bench}, the first task-oriented spatio-temporal video grounding benchmark for egocentric videos. ToG-Bench is characterized by three key features: (1) \textbf{Task-oriented Grounding}, which requires identifying and localizing objects based on intended tasks rather than straightforward descriptions; (2) \textbf{Explicit-Implicit Dual Grounding}, where target objects can be either explicitly mentioned or implicitly inferred by contextual reasoning; (3) \textbf{One-to-Many Grounding}, where a single instruction may correspond to multiple objects involved in task execution. Built upon videos sourced from ScanNet, ToG-Bench comprises 100 annotated clips with 2,704 task-oriented grounding instructions, constructed via a semi-automated pipeline that combines foundation model annotation and human refinement. In addition, we introduce a set of task-level evaluation metrics tailored for multi-object and explicit-implicit object grounding, and systematically benchmark seven state-of-the-art MLLMs. Extensive experiments reveal the intrinsic challenges of task-oriented STVG and substantial performance gaps across explicit-implicit and multi-object grounding, highlighting the difficulty of bridging perception and interaction in embodied scenarios. Data and code will be released at: \href{https://github.com/qaxuDev/ToG-Bench}{https://github.com/qaxuDev/ToG-Bench}..

[24] PULSE: A Unified Multi-Task Architecture for Cardiac Segmentation, Diagnosis, and Few-Shot Cross-Modality Clinical Adaptation

Hania Ghouse, Maryam Alsharqi, Farhad R. Nezami, Muzammil Behzad

🧩 TL;DR

本文提出了PULSE,一种多任务视觉语言框架,旨在统一心脏图像分析中的解剖分割、疾病分类和临床报告生成任务,通过自监督表示和复合监督策略实现跨模态和跨数据集的泛化能力。


📘 Detailed Summary

Motivation: 心脏图像分析目前存在任务碎片化问题,解剖分割、疾病分类和临床报告生成通常由在不同数据机制下训练的独立网络处理,缺乏能够将这些目标统一在单一架构中并保持跨成像模态和数据集泛化能力的现有框架。

Method: PULSE框架基于自监督表示构建,采用复合监督策略平衡区域重叠学习、像素级分类保真度和边界感知IoU细化,通过多尺度令牌重建解码器实现解剖分割,共享的全局表示支持疾病分类和临床文本输出,使模型能够在单一架构内从像素到结构再到临床推理进行过渡。

Result: 与先前任务特定管道不同,PULSE能够学习任务不变的心脏先验知识,在跨数据集上表现出稳健的泛化能力,并且能够以最小监督适应新的成像模态,为可扩展的基础风格心脏分析框架奠定了基础。

Conclusion: 该研究推动了心脏图像分析领域向统一、可扩展的基础框架发展,通过单一架构整合多个分析任务并实现跨模态泛化,为临床实践提供了更高效、一致的分析工具,减少了任务特定管道带来的复杂性和数据需求。


📄 Abstract

Cardiac image analysis remains fragmented across tasks: anatomical segmentation, disease classification, and grounded clinical report generation are typically handled by separate networks trained under different data regimes. No existing framework unifies these objectives within a single architecture while retaining generalization across imaging modalities and datasets. We introduce PULSE, a multi-task vision-language framework built on self-supervised representations and optimized through a composite supervision strategy that balances region overlap learning, pixel wise classification fidelity, and boundary aware IoU refinement. A multi-scale token reconstruction decoder enables anatomical segmentation, while shared global representations support disease classification and clinically grounded text output allowing the model to transition from pixels to structures and finally clinical reasoning within one architecture. Unlike prior task-specific pipelines, PULSE learns task-invariant cardiac priors, generalizes robustly across datasets, and can be adapted to new imaging modalities with minimal supervision. This moves the field closer to a scalable, foundation style cardiac analysis framework.

[25] Procedural Mistake Detection via Action Effect Modeling

Wenliang Guo, Yujiang Pu, Yu Kong

🧩 TL;DR

本文提出动作效果建模(AEM)框架,通过联合建模动作执行及其结果来改进程序性任务中的错误检测。该框架在单类分类设置下在EgoPER和CaptainCook4D基准上实现了最先进的性能。


📘 Detailed Summary

Motivation: 现有程序性任务错误检测方法主要分析动作执行方式,而忽视了动作产生的结果(即动作效果)。许多错误并非体现在执行过程中,而是体现在结果状态中,如意外的物体状态或不正确的空间排列,这一研究空白需要被填补。

Method: 本文提出动作效果建模(AEM)统一框架,通过概率公式联合捕捉动作执行及其结果。该方法首先基于语义相关性和视觉质量选择最具信息量的效果帧来识别动作结果,然后从视觉定位和符号场景图中提取互补线索,在共享潜在空间中对齐以形成鲁棒的效果感知表示。为检测错误,进一步设计了基于提示的检测器,结合任务特定提示并将每个动作片段与其预期执行语义对齐。

Result: 该方法在具有挑战性的单类分类(OCC)设置下,在EgoPER和CaptainCook4D基准上实现了最先进的性能。实验结果表明,联合建模执行和结果能产生更可靠的错误检测,突显了效果感知表示在提升检测准确性方面的有效性。

Conclusion: 研究表明建模动作执行和结果能实现更可靠的错误检测,效果感知表示具有潜力应用于更广泛的下游任务。该框架为解决程序性任务中基于结果的错误检测问题提供了新思路,强调了在智能系统中同时考虑执行过程和结果状态的重要性。


📄 Abstract

Mistake detection in procedural tasks is essential for building intelligent systems that support learning and task execution. Existing approaches primarily analyze how an action is performed, while overlooking what it produces, i.e., the \textbf{action effect}. Yet many errors manifest not in the execution itself but in the resulting outcome, such as an unintended object state or incorrect spatial arrangement. To address this gap, we propose Action Effect Modeling (AEM), a unified framework that jointly captures action execution and its outcomes through a probabilistic formulation. AEM first identifies the outcome of an action by selecting the most informative effect frame based on semantic relevance and visual quality. It then extracts complementary cues from visual grounding and symbolic scene graphs, aligning them in a shared latent space to form robust effect-aware representations. To detect mistakes, we further design a prompt-based detector that incorporates task-specific prompts and aligns each action segment with its intended execution semantics. Our approach achieves state-of-the-art performance on the EgoPER and CaptainCook4D benchmarks under the challenging one-class classification (OCC) setting. These results demonstrate that modeling both execution and outcome yields more reliable mistake detection, and highlight the potential of effect-aware representations to benefit a broader range of downstream applications.

[26] DIQ-H: Evaluating Hallucination Persistence in VLMs Under Temporal Visual Degradation

Zexin Lin, Hawen Wan, Yebin Zhong, Xiaoqiang

🧩 TL;DR

本文提出了DIQ-H基准测试,这是首个评估视觉语言模型在动态视觉退化时序序列中鲁棒性的基准,通过物理模拟的视觉退化揭示现有模型在现实部署中的严重可靠性缺陷。


📘 Detailed Summary

Motivation: 现有视觉语言模型基准主要关注静态高质量图像,忽略了时间序列中的视觉退化与错误传播问题,这些是安全关键应用如自动驾驶中的关键失效模式,其中瞬态视觉损坏会引发跨帧持续存在的幻觉。

Method: 研究提出了DIQ-H基准测试,应用基于物理的视觉退化包括运动模糊、传感器噪声和压缩伪影,并通过多轮问答任务评估幻觉持续性、错误恢复和时间一致性;同时提出了不确定性引导迭代精炼方法,利用轻量级视觉语言模型进行不确定性过滤生成可靠的伪真值标注。

Result: 在16个最先进的视觉语言模型上的实验揭示了显著的鲁棒性差距:即使是GPT-4o等先进模型也仅达到78.5%的恢复率,而开源模型的时间一致性表现低于60%;提出的不确定性引导迭代精炼方法实现了15.3%的准确率提升。

Conclusion: DIQ-H基准为评估视觉语言模型在现实世界部署中的可靠性提供了全面平台,揭示了现有模型在动态视觉退化条件下的严重鲁棒性缺陷,强调了时间序列评估在安全关键应用中的重要性。


📄 Abstract

Vision-Language Models (VLMs) deployed in safety-critical applications such as autonomous driving must handle continuous visual streams under imperfect conditions. However, existing benchmarks focus on static, high-quality images and ignore temporal degradation and error propagation, which are critical failure modes where transient visual corruption induces hallucinations that persist across subsequent frames. We introduce DIQ-H, the first benchmark for evaluating VLM robustness under dynamic visual degradation in temporal sequences. DIQ-H applies physics-based corruptions including motion blur, sensor noise, and compression artifacts, and measures hallucination persistence, error recovery, and temporal consistency through multi-turn question-answering tasks. To enable scalable annotation, we propose Uncertainty-Guided Iterative Refinement (UIR), which generates reliable pseudo-ground-truth using lightweight VLMs with uncertainty filtering, achieving a 15.3 percent accuracy improvement. Experiments on 16 state-of-the-art VLMs reveal substantial robustness gaps: even advanced models such as GPT-4o achieve only a 78.5 percent recovery rate, while open-source models struggle with temporal consistency at less than 60 percent. DIQ-H provides a comprehensive platform for evaluating VLM reliability in real-world deployments.

[27] Fairness-Aware Fine-Tuning of Vision-Language Models for Medical Glaucoma Diagnosis

Zijian Gu, Yuxi Liu, Zhenhao Zhang, Song Wang

🧩 TL;DR

本文提出了一种公平感知的低秩适应方法,通过可微分的MaxAccGap损失函数优化医学视觉语言模型在不同人口群体间的诊断准确性差异,在保持参数效率的同时显著减少了69%的公平性差距。


📘 Detailed Summary

Motivation: 医学视觉语言模型在医疗影像任务中表现出专家级性能,但在不同人口群体间存在显著的诊断准确性差异,这限制了其在临床实践中的公平应用,需要开发既能保持参数效率又能优化公平性的适应方法。

Method: 本文提出了公平感知的低秩适应框架,核心贡献是可微分的MaxAccGap损失函数,能够端到端优化不同人口群体间的准确性均衡;具体包括三种方法:FR-LoRA将MaxAccGap正则化整合到训练目标中,GR-LoRA应用逆频率加权平衡梯度贡献,Hybrid-LoRA结合两种机制,整个方法仅需0.24%的可训练参数。

Result: 在10,000张青光眼眼底图像上的评估显示,GR-LoRA将诊断准确性差异减少了69%,同时保持53.15%的整体准确率;消融研究表明,强正则化强度在最小化准确性折衷的情况下实现了最优公平性,针对特定种族的优化实现了60%的差异减少。

Conclusion: 该方法通过参数高效的公平性优化,为资源受限的医疗环境中部署公平的医学AI提供了实用解决方案;研究揭示了强正则化在平衡准确性与公平性方面的有效性,以及人口群体特定优化策略的潜力,推动了医疗AI向更公平、可部署的方向发展。


📄 Abstract

Vision-language models achieve expert-level performance on medical imaging tasks but exhibit significant diagnostic accuracy disparities across demographic groups. We introduce fairness-aware Low-Rank Adaptation for medical VLMs, combining parameter efficiency with explicit fairness optimization. Our key algorithmic contribution is a differentiable MaxAccGap loss that enables end-to-end optimization of accuracy parity across demographic groups. We propose three methods: FR-LoRA integrates MaxAccGap regularization into the training objective, GR-LoRA applies inverse frequency weighting to balance gradient contributions, and Hybrid-LoRA combines both mechanisms.Evaluated on 10,000 glaucoma fundus images, GR-LoRA reduces diagnostic accuracy disparities by 69% while maintaining 53.15% overall accuracy. Ablation studies reveal that strong regularization strength achieves optimal fairness with minimal accuracy trade-off, and race-specific optimization yields 60% disparity reduction. Our approach requires only 0.24% trainable parameters, enabling practical deployment of fair medical AI in resource-constrained healthcare settings.

[28] Highly Efficient Test-Time Scaling for T2I Diffusion Models with Text Embedding Perturbation

Hang Xu, Linjiang Huang, Feng Zhao

🧩 TL;DR

本文提出了一种文本嵌入扰动方法,用于增强文本到图像扩散模型中的测试时缩放性能,通过结合空间噪声和文本嵌入扰动在频域上的互补特性,显著提升了生成多样性和质量。


📘 Detailed Summary

Motivation: 现有测试时缩放方法主要关注搜索策略和奖励模型,但忽略了文本到图像扩散模型中噪声随机性对方法性能的影响,特别是空间噪声在生成过程中可能在高频细节处理上存在局限性。

Method: 该方法包含两个关键设计:一是引入基于步骤的文本嵌入扰动,结合频率引导的噪声调度与空间噪声扰动;二是根据频域贡献和扰动容忍度自适应调整扰动强度,利用文本嵌入扰动与SDE注入噪声在频域上的互补特性。

Result: 该方法能够无缝集成到现有测试时缩放方法中,在多个基准测试上展现出显著改进,且几乎不增加额外计算开销,通过频域分析验证了空间噪声偏好低频成分而文本嵌入扰动增强高频细节的互补行为。

Conclusion: 研究表明文本嵌入扰动是一种有效的随机性形式,能够与空间噪声耦合以增强生成多样性和质量,为扩散模型中的测试时优化提供了新的随机性利用视角,并揭示了不同维度对扰动具有不同容忍度的特性。


📄 Abstract

Test-time scaling (TTS) aims to achieve better results by increasing random sampling and evaluating samples based on rules and metrics. However, in text-to-image(T2I) diffusion models, most related works focus on search strategies and reward models, yet the impact of the stochastic characteristic of noise in T2I diffusion models on the method's performance remains unexplored. In this work, we analyze the effects of randomness in T2I diffusion models and explore a new format of randomness for TTS: text embedding perturbation, which couples with existing randomness like SDE-injected noise to enhance generative diversity and quality. We start with a frequency-domain analysis of these formats of randomness and their impact on generation, and find that these two randomness exhibit complementary behavior in the frequency domain: spatial noise favors low-frequency components (early steps), while text embedding perturbation enhances high-frequency details (later steps), thereby compensating for the potential limitations of spatial noise randomness in high-frequency manipulation. Concurrently, text embedding demonstrates varying levels of tolerance to perturbation across different dimensions of the generation process. Specifically, our method consists of two key designs: (1) Introducing step-based text embedding perturbation, combining frequency-guided noise schedules with spatial noise perturbation. (2) Adapting the perturbation intensity selectively based on their frequency-specific contributions to generation and tolerance to perturbation. Our approach can be seamlessly integrated into existing TTS methods and demonstrates significant improvements on multiple benchmarks with almost no additional computation. Code is available at \href{https://github.com/xuhang07/TEP-Diffusion}{https://github.com/xuhang07/TEP-Diffusion}.

[29] Towards Object-centric Understanding for Instructional Videos

Wenliang Guo, Yu Kong

🧩 TL;DR

本文提出了一种面向对象的视频理解范式,将动作视为驱动状态转换的机制,并引入了Object-IVQA基准测试和代理框架,显著提升了程序性活动理解中的对象级推理能力。


📘 Detailed Summary

Motivation: 现有以动作为中心的方法难以处理现实程序中的灵活性,其中步骤顺序会根据对象状态而变化。本文旨在将研究焦点转向以对象为中心的范式,将动作视为驱动状态转换的机制,以解决程序性活动理解中的这一关键限制。

Method: 本文引入了Object-IVQA基准测试,包含107个长格式教学视频和514个带有时间锚定证据的开放式问答对,评估对象中心推理的四个维度。同时提出了一个代理框架,该框架协调对象中心规划、感知、分析和生成工具,支持显式证据检索和跨不连续片段的多次推理。

Result: 实验表明,现有的大型视觉语言模型在对象级识别和推理方面表现不佳,而本文提出的框架实现了显著改进。Object-IVQA基准测试评估了状态演化、前提条件验证、反事实推理和错误识别四个推理维度,为对象中心理解提供了全面的评估标准。

Conclusion: 该研究强调了对象中心范式在程序性活动理解中的重要性,提出的框架通过显式证据检索和多跳推理机制有效解决了现有方法的局限性。Object-IVQA基准为未来辅助AI系统的发展提供了关键的评估工具,推动了复杂现实世界任务推理能力的研究方向。


📄 Abstract

Understanding procedural activities is crucial for developing future assistive AI that can reason about complex real-world tasks. Existing action-centric methods struggle with the flexibility of real procedures, where step order varies depending on object states. In this work, we propose to shift the focus to an object-centric paradigm by regarding actions as mechanisms that drive state transitions. To advance this direction, we introduce Object-IVQA, a long-form instructional video benchmark with 107 videos and 514 open-ended question-answer pairs annotated with temporally grounded evidence. The benchmark evaluates four dimensions of object-centric reasoning, including state evolution, precondition verification, counterfactual reasoning and mistake recognition. We further propose an agent framework that orchestrates object-centric planning, perception, analysis and generation tools, enabling explicit evidence retrieval and multi-hop reasoning across disjoint segments. Experiments show that existing large vision-language models struggle in object-level recognition and reasoning, whereas our framework achieves substantially improvement.

[30] Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

Jialuo Li, Bin Li, Jiahao Li, Yan Lu

🧩 TL;DR

本文提出DIG框架,一种无需训练的视频帧选择方法,根据查询类型自适应地采用均匀采样或查询感知选择策略,显著提升了长视频理解中大型多模态模型的性能。


📘 Detailed Summary

Motivation: 长视频理解中大型多模态模型面临上下文长度限制和密集视频标记处理的计算成本问题,现有查询感知帧选择方法虽然有效但计算开销大,本文旨在验证复杂搜索机制是否在所有查询场景下都必要,并探索更高效的帧选择策略。

Method: 本文首先建立并验证了查询类型学,区分全局查询和局部化查询,基于此提出DIG框架,该框架针对全局查询采用高效的均匀采样策略,针对局部化查询则激活专门的流水线提取查询相关帧,整个框架无需训练即可实现自适应帧选择。

Result: 在三个长视频理解基准测试上的实验表明,DIG框架在性能上持续超越现有基线方法,即使在输入帧数扩展到256帧的情况下,仍能稳健地提升大型多模态模型的性能表现。

Conclusion: 研究揭示了查询类型对帧选择策略的重要影响,表明复杂搜索机制并非在所有场景下都必要,DIG框架提供了一种高效且有效的自适应解决方案,为长视频理解中的计算效率优化提供了新思路。


📄 Abstract

The application of Large Multimodal Models (LMMs) to long-form video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video tokens. Consequently, recent research has focused on query-aware frame selection, methods that often incur significant computational overhead. This paper challenges the assumption that such complex search mechanisms are universally necessary. We first identify and validate a query typology distinguishing between global query and localized query. We demonstrate that while uniform sampling is both effective and efficient for global queries, localized queries indeed necessitate query-aware selection for optimal performance. Building on this insight, we propose DIG, a training-free frame selection framework that adapts its strategy based on the query type. Specifically,DIG employs efficient uniform sampling for global queries while activating a specialized pipeline to extract query-relevant frames for localized queries. Experiments on three long-form video understanding benchmarks demonstrate that DIG consistently outperforms existing baselines and robustly improves LMM performance, even when scaling the input frame count to 256.

[31] EEA: Exploration-Exploitation Agent for Long Video Understanding

Te Yang, Xiangyu Zhu, Bo Wang, Quan Chen, Peng Jiang, Zhen Lei

🧩 TL;DR

本文提出EEA框架,一种通过语义引导的层次树搜索过程实现探索-利用平衡的新型视频智能体框架,用于长视频理解任务。该方法能够自主发现并动态更新任务相关的语义查询,同时结合视觉语言模型的内在奖励与语义先验,实现高效且准确的长视频分析。


📘 Detailed Summary

Motivation: 当前长视频理解方法面临两个主要问题:密集预处理导致严重计算开销,以及探索与利用平衡不当导致信息覆盖不完整和效率低下。现有方法要么计算成本过高,要么无法有效平衡探索与利用,限制了长视频分析的实用性和效率。

Method: EEA框架通过语义引导的层次树搜索过程实现探索-利用平衡,自主发现并动态更新任务相关语义查询,收集与这些查询紧密匹配的视频帧作为语义锚点。在树搜索过程中,EEA优先探索语义相关帧,同时确保未知段的充分覆盖,并通过显式建模不确定性将视觉语言模型的内在奖励与语义先验自适应结合,实现对视频段的稳定精确评估。

Result: 在多个长视频基准测试上的实验验证了所提出方法的优越性能和计算效率。EEA框架在保持高效计算的同时,显著提升了长视频理解任务的性能表现,证明了其在平衡探索与利用方面的有效性。

Conclusion: 该研究表明,通过语义引导的层次树搜索过程能够有效解决长视频理解中的探索-利用平衡问题,为高效的长视频分析提供了新思路。EEA框架的自适应语义查询机制和不确定性建模方法为未来视频智能体设计提供了重要参考,特别是在处理大规模视觉数据时实现计算效率与信息覆盖的平衡。


📄 Abstract

Long-form video understanding requires efficient navigation of extensive visual data to pinpoint sparse yet critical information. Current approaches to longform video understanding either suffer from severe computational overhead due to dense preprocessing, or fail to effectively balance exploration and exploitation, resulting in incomplete information coverage and inefficiency. In this work, we introduce EEA, a novel video agent framework that archives exploration-exploitation balance through semantic guidance with hierarchical tree search process. EEA autonomously discovers and dynamically updates task-relevant semantic queries, and collects video frames closely matched to these queries as semantic anchors. During the tree search process, instead of uniform expansion, EEA preferentially explores semantically relevant frames while ensuring sufficient coverage within unknown segments. Moreover, EEA adaptively combines intrinsic rewards from visionlanguage models (VLMs) with semantic priors by explicitly modeling uncertainty to achieve stable and precise evaluation of video segments. Experiments across various long-video benchmarks validate the superior performance and computational efficiency of our proposed method.

[32] PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation

Xiaolong Li, Youping Gu, Xi Lin, Weijie Wang, Bohan Zhuang

🧩 TL;DR

本文提出金字塔稀疏注意力机制,通过多级池化键值表示替代传统二值掩码,在保持计算效率的同时显著减少高稀疏度下的信息损失,适用于视频理解和生成任务。


📘 Detailed Summary

Motivation: 注意力机制作为基础模型的核心,其二次复杂度是扩展的关键瓶颈。现有高效注意力方法通常采用稀疏化范式,但当前方法使用二值掩码保留或丢弃整个键值块,在高稀疏度下会导致显著的信息损失,这一缺陷亟待解决。

Method: 本文提出金字塔稀疏注意力机制,采用多级池化键值表示替代传统二值掩码。每个查询块动态分配较低的池化级别给关键键值块,较高的池化级别给次要键值块,在完全保留和完全剪枝之间创建信息性插值。该方法采用解耦的块-瓦片设计,确保硬件友好的高效执行。

Result: 在视频理解和生成基准测试中,PSA在保持上下文信息和视觉保真度方面表现优异,始终优于或达到与现有稀疏注意力基线相当的性能,同时展现出更优的效率-质量权衡。该方法在低计算预算下有效缓解信息损失,同时保持计算效率。

Conclusion: 金字塔稀疏注意力机制通过多级池化表示提供了稀疏注意力设计的新范式,类比于定点量化和计算机视觉中的经典特征金字塔网络。该方法在保持计算效率的同时显著减少高稀疏度下的信息损失,为高效注意力机制的发展提供了重要方向。


📄 Abstract

Attention mechanisms are the core of foundation models, but their quadratic complexity remains a critical bottleneck for scaling. This challenge has driven the development of efficient attention mechanisms, with sparsity emerging as the dominant paradigm. Current methods typically retain or discard entire key-value blocks with binary masks, resulting in substantial information loss under high sparsity. To mitigate this gap, we present Pyramid Sparse Attention (PSA), a versatile module applicable to both video understanding and generation tasks. Instead of binary masking, PSA introduces multi-level pooled KV representations, enabling finer mask granularity. Specifically, each query block dynamically allocates lower pooling levels to critical KV blocks and higher levels to less important ones, creating an informative interpolation between full retention and complete pruning. This design, analogous to fixed-point quantization and classical feature pyramid networks in computer vision, effectively mitigates information loss while preserving computational efficiency under a low compute budget. It works with a native, hardware-friendly kernel that leverages decoupled block-tile design to ensure efficient execution. Across video understanding and generation benchmarks, PSA preserves contextual information and visual fidelity, consistently outperforming or achieving comparable performance over existing sparse attention baselines with superior efficiency-quality trade-offs. Our code and model weights are publicly available at: http://ziplab.co/PSA

[33] Exploiting Domain Properties in Language-Driven Domain Generalization for Semantic Segmentation

Seogkyu Jeon, Kibeom Hong, Hyeran Byun

🧩 TL;DR

本文提出DPMFormer,一种用于领域泛化语义分割的新型框架,通过领域感知提示学习和一致性学习解决视觉-文本语义错配问题,在多个基准测试中达到最先进性能。


📘 Detailed Summary

Motivation: 现有基于视觉语言模型的领域泛化语义分割方法忽视了视觉与文本上下文之间的语义错配问题,这种错配源于在单一源域上学习的固定上下文提示的刚性,导致模型在跨域泛化时性能受限。

Method: 提出领域感知提示驱动的掩码变换器框架,包含三个核心组件:领域感知提示学习以促进视觉与文本线索的语义对齐;结合纹理扰动的领域感知对比学习以多样化可观测领域;以及领域鲁棒一致性学习以最小化原始图像与增强图像预测间的差异。

Result: 实验表明该框架在多个领域泛化语义分割基准测试中建立了新的最先进性能,通过系统分析验证了各组件对提升跨域泛化能力的有效性。

Conclusion: 该研究证明了解决视觉-文本语义错配对领域泛化的重要性,提出的多组件框架为构建对环境变化具有鲁棒性的语义分割系统提供了有效途径,并为未来跨域视觉理解研究提供了新方向。


📄 Abstract

Recent domain generalized semantic segmentation (DGSS) studies have achieved notable improvements by distilling semantic knowledge from Vision-Language Models (VLMs). However, they overlook the semantic misalignment between visual and textual contexts, which arises due to the rigidity of a fixed context prompt learned on a single source domain. To this end, we present a novel domain generalization framework for semantic segmentation, namely Domain-aware Prompt-driven Masked Transformer (DPMFormer). Firstly, we introduce domain-aware prompt learning to facilitate semantic alignment between visual and textual cues. To capture various domain-specific properties with a single source dataset, we propose domain-aware contrastive learning along with the texture perturbation that diversifies the observable domains. Lastly, to establish a framework resilient against diverse environmental changes, we have proposed the domain-robust consistency learning which guides the model to minimize discrepancies of prediction from original and the augmented images. Through experiments and analyses, we demonstrate the superiority of the proposed framework, which establishes a new state-of-the-art on various DGSS benchmarks. The code is available at https://github.com/jone1222/DPMFormer.

[34] OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation

Zhishan Zhou, Siyuan Wei, Zengran Wang, Chunjie Wang, Xiaosheng Yan, Xiao Liu

🧩 TL;DR

本文提出了OpenTrack3D,一个用于开放词汇3D实例分割的通用且准确的框架,通过在线视觉-空间跟踪器构建跨视角一致的对象提议,并利用多模态大语言模型增强组合推理能力,在多个基准测试中实现了最先进的性能。


📘 Detailed Summary

Motivation: 现有开放词汇3D实例分割方法存在两个关键限制:一是提议生成依赖于数据集特定的提议网络或基于网格的超点,使其在无网格场景中不适用且限制了向新场景的泛化能力;二是基于CLIP的分类器文本推理能力较弱,难以处理组合性和功能性的用户查询。

Method: OpenTrack3D采用在线视觉-空间跟踪器构建跨视角一致的对象提议,首先利用2D开放词汇分割器生成掩码并通过深度信息提升到3D点云,然后使用DINO特征图提取掩码引导的实例特征,跟踪器融合视觉和空间线索以保持实例一致性。该核心流程完全无网格,但提供了可选的超点细化模块以在场景网格可用时进一步提升性能,同时用多模态大语言模型替代CLIP以增强复杂查询的组合推理能力。

Result: 在ScanNet200、Replica、ScanNet++和SceneFun3D等多个多样化基准测试上的广泛实验表明,OpenTrack3D实现了最先进的性能,并展现出强大的泛化能力,验证了其在无网格场景中的有效性和对复杂用户查询的准确理解。

Conclusion: 该研究证明了在线视觉-空间跟踪器在构建跨视角一致对象提议方面的有效性,以及多模态大语言模型在增强组合推理能力方面的优势,为机器人学和AR/VR应用中的开放词汇3D实例分割提供了通用且准确的解决方案,同时保持了在无网格环境中的适用性。


📄 Abstract

Generalizing open-vocabulary 3D instance segmentation (OV-3DIS) to diverse, unstructured, and mesh-free environments is crucial for robotics and AR/VR, yet remains a significant challenge. We attribute this to two key limitations of existing methods: (1) proposal generation relies on dataset-specific proposal networks or mesh-based superpoints, rendering them inapplicable in mesh-free scenarios and limiting generalization to novel scenes; and (2) the weak textual reasoning of CLIP-based classifiers, which struggle to recognize compositional and functional user queries. To address these issues, we introduce OpenTrack3D, a generalizable and accurate framework. Unlike methods that rely on pre-generated proposals, OpenTrack3D employs a novel visual-spatial tracker to construct cross-view consistent object proposals online. Given an RGB-D stream, our pipeline first leverages a 2D open-vocabulary segmenter to generate masks, which are lifted to 3D point clouds using depth. Mask-guided instance features are then extracted using DINO feature maps, and our tracker fuses visual and spatial cues to maintain instance consistency. The core pipeline is entirely mesh-free, yet we also provide an optional superpoints refinement module to further enhance performance when scene mesh is available. Finally, we replace CLIP with a multi-modal large language model (MLLM), significantly enhancing compositional reasoning for complex user queries. Extensive experiments on diverse benchmarks, including ScanNet200, Replica, ScanNet++, and SceneFun3D, demonstrate state-of-the-art performance and strong generalization capabilities.

[35] UniComp: Rethinking Video Compression Through Informational Uniqueness

Chao Yuan, Shimin Chen, Minliang Lin, Limeng Qiao, Guanglu Wan, Lin Ma

🧩 TL;DR

本文提出了一种基于信息独特性的视频压缩框架UniComp,通过最小化保留令牌与完整令牌之间的条件熵来优化压缩过程,在有限计算预算下显著提升了视觉令牌的保留效果。


📘 Detailed Summary

Motivation: 现有注意力机制压缩方法存在局限性,本文旨在从信息论视角出发,在受限计算预算下最大化视频表示的信息保真度,解决视觉压缩中如何有效保留关键信息的问题。

Method: 提出信息独特性概念来衡量令牌间的内在冗余度,并设计三个渐进式模块:帧组融合实现语义帧分组,令牌分配进行自适应资源分配,空间动态压缩执行细粒度空间压缩,共同构成UniComp框架。

Result: 大量实验表明,UniComp在有限计算预算下持续优于现有压缩方法,在保留关键视觉令牌方面表现出色,验证了信息独特性在令牌压缩效果中的关键作用。

Conclusion: 研究证实了信息独特性在视觉压缩中的核心价值,为基于信息论的压缩方法提供了新思路,展示了渐进式语义压缩框架在资源受限场景下的优越性。


📄 Abstract

Distinct from attention-based compression methods, this paper presents an information uniqueness driven video compression framework, termed UniComp, which aims to maximize the information fidelity of video representations under constrained computational budgets. Starting from the information-theoretic perspective, we formulate the vision compression as an optimization problem that minimizes conditional entropy (reconstruction error) between retained and full tokens. To achieve this, we introduce the notion of information uniqueness to measure intrinsic redundancy among tokens to link with reconstruction error. Based on uniqueness, we design three modules-Frame Group Fusion, Token Allocation, and Spatial Dynamic Compression-that progressively perform semantic frame grouping, adaptive resource allocation, and fine-grained spatial compression. Extensive experiments demonstrate that UniComp consistently outperforms existing compression methods in preserving essential visual tokens under limited computational budgets, highlighting the pivotal role of information uniqueness in token compression efficacy.

[36] Cross-Stain Contrastive Learning for Paired Immunohistochemistry and Histopathology Slide Representation Learning

Yizhi Zhang, Lei Fan, Zhulin Tao, Donglin Di, Yang Song, Sidong Liu, Cong Cong

🧩 TL;DR

该研究提出了跨染色对比学习(CSCL)框架,通过利用新构建的五染色对齐数据集,增强H&E全切片图像的通用表示能力,使其能有效整合多染色生物标志物信息,从而提升计算病理学任务的性能。


📘 Detailed Summary

Motivation: 计算病理学中,通用且可迁移的全切片图像表示至关重要,将免疫组化等多染色信息与H&E结合可丰富特征表达,但现有方法受限于对齐良好的多染色数据集稀缺,染色间错位导致组织对应关系不一致,阻碍了稳定的补丁级特征提取和切片级嵌入质量。

Method: 研究首先构建了切片级对齐的五染色数据集(H&E、HER2、KI67、ER、PGR),并提出了两阶段预训练框架CSCL:第一阶段使用轻量适配器通过补丁级对比对齐增强H&E特征与对应IHC上下文线索的兼容性;第二阶段采用多示例学习进行切片级表示学习,包含跨染色注意力融合模块整合染色特异性补丁特征,以及跨染色全局对齐模块强制不同染色间切片级嵌入的一致性。

Result: 在癌症亚型分类、IHC生物标志物状态分类和生存预测任务上的实验表明,该方法实现了性能的持续提升,生成了高质量、可迁移的H&E切片级表示,验证了跨染色对比学习框架的有效性。

Conclusion: 该研究通过构建对齐的多染色数据集和创新的跨染色对比学习框架,成功解决了计算病理学中多染色信息整合的挑战,为生成通用且生物学意义丰富的H&E表示提供了有效途径,推动了多模态病理图像分析的发展,相关代码和数据已开源。


📄 Abstract

Universal, transferable whole-slide image (WSI) representations are central to computational pathology. Incorporating multiple markers (e.g., immunohistochemistry, IHC) alongside H&E enriches H&E-based features with diverse, biologically meaningful information. However, progress is limited by the scarcity of well-aligned multi-stain datasets. Inter-stain misalignment shifts corresponding tissue across slides, hindering consistent patch-level features and degrading slide-level embeddings. To address this, we curated a slide-level aligned, five-stain dataset (H&E, HER2, KI67, ER, PGR) to enable paired H&E-IHC learning and robust cross-stain representation. Leveraging this dataset, we propose Cross-Stain Contrastive Learning (CSCL), a two-stage pretraining framework with a lightweight adapter trained using patch-wise contrastive alignment to improve the compatibility of H&E features with corresponding IHC-derived contextual cues, and slide-level representation learning with Multiple Instance Learning (MIL), which uses a cross-stain attention fusion module to integrate stain-specific patch features and a cross-stain global alignment module to enforce consistency among slide-level embeddings across different stains. Experiments on cancer subtype classification, IHC biomarker status classification, and survival prediction show consistent gains, yielding high-quality, transferable H&E slide-level representations. The code and data are available at https://github.com/lily-zyz/CSCL.

[37] Dynamic Optical Test for Bot Identification (DOT-BI): A simple check to identify bots in surveys and online processes

Malte Bleeker, Mauro Gotsch

🧩 TL;DR

本文提出了DOT-BI(动态光学测试机器人识别),一种利用人类运动感知能力区分人类与自动化系统的快速方法,通过在动态背景纹理中隐藏数字,使其仅对人类可见而对算法不可见。


📘 Detailed Summary

Motivation: 当前在线调查和流程中缺乏有效区分人类受访者与自动化系统的方法,现有验证机制容易被先进AI模型绕过,需要开发基于人类独特感知能力的验证技术。

Method: DOT-BI采用动态光学测试方法,将数字以与背景相同的随机黑白像素纹理显示,仅通过数字与背景之间的运动和尺度差异使数字对人类可见,而逐帧算法处理无法提取有意义信号。

Result: 评估显示最先进的多模态模型(GPT-5-Thinking和Gemini 2.5 Pro)无法正确提取数值;在线调查中99.5%参与者成功完成任务,平均完成时间10.7秒;实验室研究未发现相对于对照组的易用性或完成时间负面影响。

Conclusion: DOT-BI通过利用人类运动感知的独特能力提供了一种有效的机器人识别解决方案,该方法对用户友好且能抵抗先进AI攻击,为在线验证系统提供了新的研究方向和实践工具。


📄 Abstract

We propose the Dynamic Optical Test for Bot Identification (DOT-BI): a quick and easy method that uses human perception of motion to differentiate between human respondents and automated systems in surveys and online processes. In DOT-BI, a 'hidden' number is displayed with the same random black-and-white pixel texture as its background. Only the difference in motion and scale between the number and the background makes the number perceptible to humans across frames, while frame-by-frame algorithmic processing yields no meaningful signal. We conducted two preliminary assessments. Firstly, state-of-the-art, video-capable, multimodal models (GPT-5-Thinking and Gemini 2.5 Pro) fail to extract the correct value, even when given explicit instructions about the mechanism. Secondly, in an online survey (n=182), 99.5% (181/182) of participants solved the task, with an average end-to-end completion time of 10.7 seconds; a supervised lab study (n=39) found no negative effects on perceived ease-of-use or completion time relative to a control. We release code to generate tests and 100+ pre-rendered variants to facilitate adoption in surveys and online processes.

[38] Beyond Boundary Frames: Audio-Visual Semantic Guidance for Context-Aware Video Interpolation

Yuchen Deng, Xiuyang Wu, Hai-Tao Zheng, Jie Wang, Feidiao Yang, Yuxing Han

🧩 TL;DR

本文提出了BBF(Beyond Boundary Frames),一个上下文感知的视频帧插值框架,能够通过音频/视觉语义进行引导,在通用插值和音频视觉同步插值任务上均超越了专门的先进方法。


📘 Detailed Summary

Motivation: 现有视频帧插值方法在处理快速、复杂且高度非线性的运动模式时面临挑战,特别是扩散基方法虽然改进了传统光流方法,但仍难以覆盖多样化应用场景,且在音频视觉同步插值等细粒度运动任务中经常无法生成清晰、时间一致的帧。

Method: BBF框架采用增强的输入设计,能够灵活处理文本、音频、图像和视频等多种条件模态;提出解耦的多模态融合机制,将不同条件信号顺序注入DiT骨干网络;采用渐进多阶段训练范式,利用起始-结束帧差异嵌入动态调整数据采样和损失权重。

Result: 大量实验结果表明,BBF在通用插值和音频视觉同步插值任务上均超越了专门的先进方法,在协调多通道条件下建立了统一的视频帧插值框架,能够生成更清晰、时间一致的帧。

Conclusion: 该研究展示了上下文感知和多模态引导在视频帧插值中的重要性,BBF框架通过灵活的输入设计、解耦融合机制和渐进训练策略,成功解决了多样化应用场景下的插值挑战,为多条件视频生成提供了统一解决方案。


📄 Abstract

Handling fast, complex, and highly non-linear motion patterns has long posed challenges for video frame interpolation. Although recent diffusion-based approaches improve upon traditional optical-flow-based methods, they still struggle to cover diverse application scenarios and often fail to produce sharp, temporally consistent frames in fine-grained motion tasks such as audio-visual synchronized interpolation. To address these limitations, we introduce BBF (Beyond Boundary Frames), a context-aware video frame interpolation framework, which could be guided by audio/visual semantics. First, we enhance the input design of the interpolation model so that it can flexibly handle multiple conditional modalities, including text, audio, images, and video. Second, we propose a decoupled multimodal fusion mechanism that sequentially injects different conditional signals into a DiT backbone. Finally, to maintain the generation abilities of the foundation model, we adopt a progressive multi-stage training paradigm, where the start-end frame difference embedding is used to dynamically adjust both the data sampling and the loss weighting. Extensive experimental results demonstrate that BBF outperforms specialized state-of-the-art methods on both generic interpolation and audio-visual synchronized interpolation tasks, establishing a unified framework for video frame interpolation under coordinated multi-channel conditioning.

[39] Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning

Ge-Peng Ji, Jingyi Liu, Deng-Ping Fan, Nick Barnes

🧩 TL;DR

本研究提出了Colon-X开放计划,构建了最全面的结肠镜多模态数据集ColonVQA,并开发了首个R1风格模型ColonR1,通过任务自适应奖励和梯度稳定优化技术,在数据稀缺条件下实现了从多模态理解到临床推理的转变。


📘 Detailed Summary

Motivation: 该研究旨在解决结肠镜多模态智能从理解到临床推理的关键过渡问题,现有多模态大语言模型在临床输出方面缺乏鲁棒性和可信度,需要开发专门针对结肠镜的推理中心智能系统。

Method: 研究首先构建了包含110万+视觉问答条目的ColonVQA多模态数据集,涵盖76个临床发现和18个多模态任务;随后通过多专家辩论流程标注了ColonReason临床推理数据集,并开发了ColonR1模型,该模型采用任务自适应奖励和梯度稳定优化技术,是首个R1风格的结肠镜推理模型。

Result: 系统评估了22个多模态大语言模型的泛化能力和抗干扰性,发现现有模型的临床输出远未达到鲁棒可信;在数据稀缺条件下,ColonR1模型实现了56.61%的整体准确率,比监督微调方法提升了25.22%,为多模态结肠镜分析设立了新的推理基准。

Conclusion: 该研究成功实现了从多模态理解到临床推理的转变,为结肠镜智能分析提供了数据基础和模型基准,ColonR1在数据稀缺条件下的优异表现展示了任务自适应奖励和梯度稳定优化技术的有效性,所有数据和模型资源已公开以促进社区发展。


📄 Abstract

In this study, we present Colon-X, an open initiative aimed at advancing multimodal intelligence in colonoscopy. We begin by constructing ColonVQA, the most comprehensive multimodal dataset ever built for colonoscopy, featuring over 1.1M+ visual question answering entries across 76 clinical findings and 18 multimodal tasks. Beyond serving as a community-wide data foundation, we further investigate a critical yet underexplored transition in colonoscopy - evolving from multimodal understanding to clinical reasoning: (a) To capture the current landscape of multimodal understanding behaviors, we systematically assess the generalizability of 22 multimodal large language models and examine their reliability under human-induced perturbations. The results reveal that clinical outputs from leading MLLMs remain far from robust and trustworthy. (b) To narrow this gap, we further explore reasoning-centric intelligence tailored for colonoscopy. Specifically, we curate ColonReason, a clinically grounded reasoning dataset annotated through a multi-expert debating pipeline, and develop ColonR1, the first R1-styled model incorporating task-adaptive rewarding and gradient-stable optimization techniques. Under data-scarce conditions, our ColonR1 achieves 56.61% overall accuracy, outperforming supervised fine-tuning by 25.22%, and sets a new reasoning-enabled baseline for multimodal colonoscopy analysis. All data and model resources are publicly available at https://github.com/ai4colonoscopy/Colon-X.

[40] GaussianBlender: Instant Stylization of 3D Gaussians with Disentangled Latent Spaces

Melis Ocal, Xiaoyan Xing, Yue Li, Ngo Anh Vien, Sezer Karaoglu, Theo Gevers

🧩 TL;DR

本文提出了GaussianBlender,一种用于文本驱动3D风格化的前馈框架,能够实现即时推理编辑,解决了现有方法需要逐资产优化和存在多视角不一致性的问题。


📘 Detailed Summary

Motivation: 现有文本到3D风格化方法通常从2D图像编辑器蒸馏而来,需要耗时的逐资产优化,并且由于当前文本到图像模型的局限性而表现出多视角不一致性,这使得它们在大规模生产中不切实际。

Method: 该方法从空间分组的3D高斯中学习结构化、解耦的潜在空间,实现几何和外观的受控信息共享,然后使用潜在扩散模型在这些学习到的表示上应用文本条件编辑。

Result: 综合评估表明,GaussianBlender不仅能够实现即时、高保真、几何保持、多视角一致的风格化,而且超越了需要逐实例测试时优化的方法,解锁了实用的大规模民主化3D风格化。

Conclusion: 该研究提供了一种创新的前馈框架,能够实现即时3D风格化编辑,解决了现有方法在大规模生产中的局限性,为游戏开发、虚拟现实和数字艺术领域的3D资产创作提供了实用且可扩展的解决方案。


📄 Abstract

3D stylization is central to game development, virtual reality, and digital arts, where the demand for diverse assets calls for scalable methods that support fast, high-fidelity manipulation. Existing text-to-3D stylization methods typically distill from 2D image editors, requiring time-intensive per-asset optimization and exhibiting multi-view inconsistency due to the limitations of current text-to-image models, which makes them impractical for large-scale production. In this paper, we introduce GaussianBlender, a pioneering feed-forward framework for text-driven 3D stylization that performs edits instantly at inference. Our method learns structured, disentangled latent spaces with controlled information sharing for geometry and appearance from spatially-grouped 3D Gaussians. A latent diffusion model then applies text-conditioned edits on these learned representations. Comprehensive evaluations show that GaussianBlender not only delivers instant, high-fidelity, geometry-preserving, multi-view consistent stylization, but also surpasses methods that require per-instance test-time optimization - unlocking practical, democratized 3D stylization at scale.

[41] Active Visual Perception: Opportunities and Challenges

Yian Li, Xiaoyu Guo, Hao Zhang, Shuiwang Li, Xiaowei Dai

🧩 TL;DR

本文对主动视觉感知进行了全面综述,探讨了其在动态环境中通过感知与行动交互获取信息的能力,系统分析了该领域的机遇、挑战及未来发展方向。


📘 Detailed Summary

Motivation: 主动视觉感知系统能够通过动态感知和行动与环境交互,但面临实时处理复杂视觉数据、动态环境决策制定和多模态感知融合等挑战,本文旨在系统梳理该领域的机遇与障碍,为更广泛的应用提供理论基础。

Method: 本文采用综述研究方法,系统分析主动视觉感知的核心概念、技术框架和应用场景,重点探讨了注意力引导、传感器移动和物体交互等关键技术,以及它们在复杂环境中的实现机制。

Result: 研究提供了主动视觉感知在机器人、自动驾驶、人机交互和监控系统等领域的应用全景,识别了实时数据处理、动态决策和多模态融合等关键技术挑战,并指出了当前研究的局限性和未来发展方向。

Conclusion: 主动视觉感知代表了从被动观察到主动交互的范式转变,虽然面临技术挑战,但在复杂动态环境中具有显著优势,需要进一步研究实时算法、决策框架和跨模态集成方法以实现更广泛的实际应用。


📄 Abstract

Active visual perception refers to the ability of a system to dynamically engage with its environment through sensing and action, allowing it to modify its behavior in response to specific goals or uncertainties. Unlike passive systems that rely solely on visual data, active visual perception systems can direct attention, move sensors, or interact with objects to acquire more informative data. This approach is particularly powerful in complex environments where static sensing methods may not provide sufficient information. Active visual perception plays a critical role in numerous applications, including robotics, autonomous vehicles, human-computer interaction, and surveillance systems. However, despite its significant promise, there are several challenges that need to be addressed, including real-time processing of complex visual data, decision-making in dynamic environments, and integrating multimodal sensory inputs. This paper explores both the opportunities and challenges inherent in active visual perception, providing a comprehensive overview of its potential, current research, and the obstacles that must be overcome for broader adoption.

[42] PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention

Ziwen Li, Xin Wang, Hanlue Zhang, Runnan Chen, Runqi Lin, Xiao He, Han Huang, Yandong Guo, Fakhri Karray, Tongliang Liu, Mingming Gong

🧩 TL;DR

本文提出了一种高效的PosA-VLA框架,通过姿态条件监督锚定视觉注意力,解决了现有视觉-语言-动作模型在复杂环境中产生冗余动作的问题,显著提升了动作生成的精确性和效率。


📘 Detailed Summary

Motivation: 当前视觉-语言-动作模型在具身任务中仍难以产生一致且精确的目标导向动作,经常在轨迹中生成冗余或不稳定的运动,限制了其在时间敏感场景中的应用。作者将这些冗余动作归因于现有VLA模型的空间均匀感知场,导致模型在复杂环境中容易被目标无关物体分散注意力。

Method: 本文提出了高效的PosA-VLA框架,通过姿态条件监督锚定视觉注意力,持续引导模型感知朝向任务相关区域。该姿态条件锚定注意力机制使模型能更好地对齐指令语义与可操作的视觉线索,从而提升动作生成精度和效率。该框架采用轻量级架构,无需辅助感知模块,确保了高效推理。

Result: 大量实验验证表明,该方法在多样化的机器人操作基准测试中能够以精确且时间高效的方式执行具身任务,并在各种挑战性环境中展现出鲁棒的泛化能力。相比现有方法,PosA-VLA在动作生成精度和效率方面均有显著提升。

Conclusion: 该研究通过姿态条件监督锚定视觉注意力,有效解决了VLA模型在复杂环境中的冗余动作问题,为具身智能系统提供了更精确高效的动作生成框架。该方法无需额外感知模块的轻量级设计使其在实际应用中具有显著优势,为时间敏感场景下的机器人操作任务提供了可行的解决方案。


📄 Abstract

The Vision-Language-Action (VLA) models have demonstrated remarkable performance on embodied tasks and shown promising potential for real-world applications. However, current VLAs still struggle to produce consistent and precise target-oriented actions, as they often generate redundant or unstable motions along trajectories, limiting their applicability in time-sensitive scenarios.In this work, we attribute these redundant actions to the spatially uniform perception field of existing VLAs, which causes them to be distracted by target-irrelevant objects, especially in complex environments.To address this issue, we propose an efficient PosA-VLA framework that anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions. The pose-conditioned anchor attention mechanism enables the model to better align instruction semantics with actionable visual cues, thereby improving action generation precision and efficiency. Moreover, our framework adopts a lightweight architecture and requires no auxiliary perception modules (e.g., segmentation or grounding networks), ensuring efficient inference. Extensive experiments verify that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks and shows robust generalization in a variety of challenging environments.

[43] Dual-level Modality Debiasing Learning for Unsupervised Visible-Infrared Person Re-Identification

Jiaze Li, Yan Lu, Bin Liu, Guojun Yin, Mang Ye

🧩 TL;DR

本文提出了一种双级模态去偏学习(DMDL)框架,通过模型级和优化级的双重干预来解决无监督可见光-红外行人重识别中的模态偏差问题,实现了更广义的模态不变特征学习。


📘 Detailed Summary

Motivation: 现有的两阶段无监督可见光-红外行人重识别方法在单模态学习阶段会引入模态特定线索,这些偏差会传播到跨模态学习阶段,损害身份判别能力和模型泛化性能,因此需要解决模态偏差问题。

Method: 提出了双级模态去偏学习框架,包含模型级的因果启发调整干预模块,用因果建模替代基于似然的建模以防止模态诱导的虚假模式;以及优化级的协作无偏训练策略,通过模态特定增强、标签细化和特征对齐来中断模态偏差在数据、标签和特征间的传播。

Result: 在基准数据集上的广泛实验表明,DMDL框架能够实现模态不变的特征学习,获得更具泛化能力的模型,在无监督可见光-红外行人重识别任务上取得了有竞争力的性能表现。

Conclusion: 该研究通过模型级和优化级的双重去偏机制有效解决了模态偏差问题,为跨模态学习提供了新的去偏框架,表明同时处理模型结构和训练过程的偏差传播是实现广义模态不变表示的关键。


📄 Abstract

Two-stage learning pipeline has achieved promising results in unsupervised visible-infrared person re-identification (USL-VI-ReID). It first performs single-modality learning and then operates cross-modality learning to tackle the modality discrepancy. Although promising, this pipeline inevitably introduces modality bias: modality-specific cues learned in the single-modality training naturally propagate into the following cross-modality learning, impairing identity discrimination and generalization. To address this issue, we propose a Dual-level Modality Debiasing Learning (DMDL) framework that implements debiasing at both the model and optimization levels. At the model level, we propose a Causality-inspired Adjustment Intervention (CAI) module that replaces likelihood-based modeling with causal modeling, preventing modality-induced spurious patterns from being introduced, leading to a low-biased model. At the optimization level, a Collaborative Bias-free Training (CBT) strategy is introduced to interrupt the propagation of modality bias across data, labels, and features by integrating modality-specific augmentation, label refinement, and feature alignment. Extensive experiments on benchmark datasets demonstrate that DMDL could enable modality-invariant feature learning and a more generalized model.

[44] Fully Unsupervised Self-debiasing of Text-to-Image Diffusion Models

Korada Sri Vardhana, Shrikrishna Lolla, Soma Biswas

🧩 TL;DR

本文提出SelfDebias,一种完全无监督的测试时去偏方法,适用于任何使用UNet作为噪声预测器的扩散模型,能够自动识别语义模式并引导生成过程以减少偏见,同时保持图像视觉质量。


📘 Detailed Summary

Motivation: 文本到图像扩散模型在大型互联网数据集(如LAION-5B)上训练时,会学习并再现数据中存在的众多偏见,导致生成刻板印象化的输出,现有方法通常需要人工标注数据集或针对每个生成概念训练外部分类器,限制了其适用性和可扩展性。

Method: SelfDebias是一种完全无监督的测试时去偏方法,通过识别图像编码器嵌入空间中的语义聚类,在推理过程中使用这些聚类引导扩散过程,最小化输出分布与均匀分布之间的KL散度,该方法不依赖人工标注数据集或外部分类器,能够自动识别语义模式。

Result: 广泛实验表明,SelfDebias在多种提示和扩散模型架构(包括条件模型和无条件模型)上具有良好泛化能力,不仅能有效减少关键人口统计维度上的偏见同时保持生成图像的视觉保真度,还能处理识别偏见更具挑战性的抽象概念。

Conclusion: 该研究提供了一种无需监督的去偏框架,能够自动适应不同概念和模型架构,为扩散模型的公平性研究开辟了新方向,表明无监督方法在识别和减轻复杂偏见方面具有潜力,同时保持生成质量不受影响。


📄 Abstract

Text-to-image (T2I) diffusion models have achieved widespread success due to their ability to generate high-resolution, photorealistic images. These models are trained on large-scale datasets, like LAION-5B, often scraped from the internet. However, since this data contains numerous biases, the models inherently learn and reproduce them, resulting in stereotypical outputs. We introduce SelfDebias, a fully unsupervised test-time debiasing method applicable to any diffusion model that uses a UNet as its noise predictor. SelfDebias identifies semantic clusters in an image encoder's embedding space and uses these clusters to guide the diffusion process during inference, minimizing the KL divergence between the output distribution and the uniform distribution. Unlike supervised approaches, SelfDebias does not require human-annotated datasets or external classifiers trained for each generated concept. Instead, it is designed to automatically identify semantic modes. Extensive experiments show that SelfDebias generalizes across prompts and diffusion model architectures, including both conditional and unconditional models. It not only effectively debiases images along key demographic dimensions while maintaining the visual fidelity of the generated images, but also more abstract concepts for which identifying biases is also challenging.

[45] Heatmap Pooling Network for Action Recognition from RGB Videos

Mengyuan Liu, Jinfu Liu, Yongkang Jiang, Bin He

🧩 TL;DR

本文提出了一种新颖的热图池化网络(HP-Net)用于视频动作识别,通过反馈池化模块提取信息丰富、鲁棒且简洁的人体池化特征,并设计多模态融合模块实现更稳健的动作识别。


📘 Detailed Summary

Motivation: 现有RGB视频动作识别方法在提取深度特征时面临信息冗余、易受噪声干扰和高存储成本等挑战,需要充分利用视频中的有用信息并提高特征提取的效率和鲁棒性。

Method: 本文提出热图池化网络(HP-Net),包含反馈池化模块提取信息丰富且鲁棒的人体池化特征,并设计空间-运动协同学习模块和文本细化调制模块来整合提取的池化特征与其他多模态数据。

Result: 在NTU RGB+D 60、NTU RGB+D 120、Toyota-Smarthome和UAV-Human等多个基准数据集上的广泛实验验证了HP-Net的有效性,其性能优于现有的人类动作识别方法。

Conclusion: HP-Net通过创新的池化特征提取和多模态融合机制,为视频动作识别提供了更高效、鲁棒的解决方案,所提取的池化特征相比传统姿态数据和热图特征具有明显性能优势。


📄 Abstract

Human action recognition (HAR) in videos has garnered widespread attention due to the rich information in RGB videos. Nevertheless, existing methods for extracting deep features from RGB videos face challenges such as information redundancy, susceptibility to noise and high storage costs. To address these issues and fully harness the useful information in videos, we propose a novel heatmap pooling network (HP-Net) for action recognition from videos, which extracts information-rich, robust and concise pooled features of the human body in videos through a feedback pooling module. The extracted pooled features demonstrate obvious performance advantages over the previously obtained pose data and heatmap features from videos. In addition, we design a spatial-motion co-learning module and a text refinement modulation module to integrate the extracted pooled features with other multimodal data, enabling more robust action recognition. Extensive experiments on several benchmarks namely NTU RGB+D 60, NTU RGB+D 120, Toyota-Smarthome and UAV-Human consistently verify the effectiveness of our HP-Net, which outperforms the existing human action recognition methods. Our code is publicly available at: https://github.com/liujf69/HPNet-Action.

[46] CoDA: From Text-to-Image Diffusion Models to Training-Free Dataset Distillation

Letian Zhou, Songhua Liu, Xinchao Wang

🧩 TL;DR

本文提出了Core Distribution Alignment (CoDA)框架,通过利用现成的文本到图像模型实现数据集蒸馏,无需在目标数据集上预训练生成模型,同时解决了通用生成先验与目标语义之间的分布不匹配问题。


📘 Detailed Summary

Motivation: 现有数据集蒸馏方法面临两个基本限制:大多数基于扩散模型的方法需要在完整目标数据集上预训练模型,这违背了数据集蒸馏的初衷且训练成本高昂;而依赖通用文本到图像模型的方法则存在显著的分布不匹配问题,因为网络规模的先验无法准确捕捉目标特定的语义,导致性能次优。

Method: 提出的Core Distribution Alignment (CoDA)框架首先通过鲁棒的基于密度的发现机制识别目标数据集的"内在核心分布",然后引导生成过程使生成的样本与该核心分布对齐,从而有效弥合通用生成先验与目标语义之间的差距,生成具有高度代表性的蒸馏数据集。

Result: 实验表明,CoDA在不依赖目标数据集特定训练生成模型的情况下,在包括ImageNet-1K及其子集在内的所有基准测试中,性能达到甚至超越了先前依赖此类模型的方法,在ImageNet-1K的每类50图像设置下达到了60.4%的最新准确率。

Conclusion: 该研究证明了利用现成文本到图像模型实现高效数据集蒸馏的可行性,通过核心分布对齐机制有效解决了通用生成先验与目标语义的匹配问题,为数据集蒸馏领域提供了更实用且成本效益更高的解决方案,避免了昂贵的特定数据集模型训练需求。


📄 Abstract

Prevailing Dataset Distillation (DD) methods leveraging generative models confront two fundamental limitations. First, despite pioneering the use of diffusion models in DD and delivering impressive performance, the vast majority of approaches paradoxically require a diffusion model pre-trained on the full target dataset, undermining the very purpose of DD and incurring prohibitive training costs. Second, although some methods turn to general text-to-image models without relying on such target-specific training, they suffer from a significant distributional mismatch, as the web-scale priors encapsulated in these foundation models fail to faithfully capture the target-specific semantics, leading to suboptimal performance. To tackle these challenges, we propose Core Distribution Alignment (CoDA), a framework that enables effective DD using only an off-the-shelf text-to-image model. Our key idea is to first identify the "intrinsic core distribution" of the target dataset using a robust density-based discovery mechanism. We then steer the generative process to align the generated samples with this core distribution. By doing so, CoDA effectively bridges the gap between general-purpose generative priors and target semantics, yielding highly representative distilled datasets. Extensive experiments suggest that, without relying on a generative model specifically trained on the target dataset, CoDA achieves performance on par with or even superior to previous methods with such reliance across all benchmarks, including ImageNet-1K and its subsets. Notably, it establishes a new state-of-the-art accuracy of 60.4% at the 50-images-per-class (IPC) setup on ImageNet-1K. Our code is available on the project webpage: https://github.com/zzzlt422/CoDA

[47] Zero-Shot Video Translation and Editing with Frame Spatial-Temporal Correspondence

Shuai Yang, Junxin Lin, Yifan Zhou, Ziwei Liu, Chen Change Loy

🧩 TL;DR

本文提出了FRESCO框架,通过整合帧内与帧间对应关系来增强时空约束,显著提升了零样本视频编辑的时空一致性,在视频到视频转换和文本引导视频编辑任务中取得了优异表现。


📘 Detailed Summary

Motivation: 当前零样本视频编辑方法主要关注在注意力机制中整合帧间对应关系,但其软约束在识别有效特征方面不足,容易导致时间不一致性问题,需要更鲁棒的时空约束来确保视频编辑的视觉连贯性。

Method: FRESCO框架整合了帧内对应关系与帧间对应关系,形成了更鲁棒的时空约束机制,该方法不仅提供注意力引导,还通过显式优化特征来确保语义相似内容在帧间的一致性转换,从而提升视频编辑的时空一致性。

Result: 在视频到视频转换和文本引导视频编辑两个零样本任务上的综合实验表明,FRESCO能够生成高质量、连贯的视频内容,相比现有零样本方法取得了显著进步,验证了该框架在提升视觉连贯性方面的有效性。

Conclusion: 该研究通过整合帧内与帧间对应关系构建了更鲁棒的时空约束机制,为提升零样本视频编辑的视觉一致性提供了有效解决方案,代表了当前零样本方法的重要进展,并为未来视频编辑技术发展提供了新思路。


📄 Abstract

The remarkable success in text-to-image diffusion models has motivated extensive investigation of their potential for video applications. Zero-shot techniques aim to adapt image diffusion models for videos without requiring further model training. Recent methods largely emphasize integrating inter-frame correspondence into attention mechanisms. However, the soft constraint applied to identify the valid features to attend is insufficient, which could lead to temporal inconsistency. In this paper, we present FRESCO, which integrates intra-frame correspondence with inter-frame correspondence to formulate a more robust spatial-temporal constraint. This enhancement ensures a consistent transformation of semantically similar content between frames. Our method goes beyond attention guidance to explicitly optimize features, achieving high spatial-temporal consistency with the input video, significantly enhancing the visual coherence of manipulated videos. We verify FRESCO adaptations on two zero-shot tasks of video-to-video translation and text-guided video editing. Comprehensive experiments demonstrate the effectiveness of our framework in generating high-quality, coherent videos, highlighting a significant advance over current zero-shot methods.

[48] UniMo: Unifying 2D Video and 3D Human Motion with an Autoregressive Framework

Youxin Pang, Yong Zhang, Ruizhi Shao, Xiang Deng, Feng Gao, Xu Xiaoming, Xiaoming Wei, Yebin Liu

🧩 TL;DR

本文提出了UniMo,一种创新的自回归模型,首次在统一框架内实现了2D人体视频和3D人体运动的联合建模,能够同时生成和理解这两种模态,为人类中心信息的多模态融合开辟了新途径。


📘 Detailed Summary

Motivation: 现有方法主要关注以另一种模态为条件生成单一模态,或将其中一种模态与文本、音频等其他模态集成,而将2D视频和3D运动统一进行同时优化和生成的研究仍然很少探索,这面临着由于它们在结构和分布上存在显著差异所带来的重大挑战。

Method: 该方法将视频和3D运动建模为统一的令牌序列,利用单独的嵌入层来缓解分布差异,并设计了集成两种不同任务的序列建模策略。此外,为了有效对齐视觉令牌并保留3D空间信息,设计了一种具有时间扩展策略的新型3D运动分词器,使用单个VQ-VAE生成量化运动令牌,该分词器包含多个专家解码器,分别处理身体形状、平移、全局方向和身体姿态,以实现可靠的3D运动重建。

Result: 大量实验表明,该方法能够同时生成相应的视频和运动,同时执行精确的运动捕捉,证明了统一建模的有效性,并在多个任务上展示了优越的性能。

Conclusion: 这项工作挖掘了大型语言模型融合不同数据类型的能力,为将人类中心信息集成到现有模型中铺平了道路,并可能实现人类、物体和场景的多模态可控联合建模,代表了跨模态人类表示学习的重要进展。


📄 Abstract

We propose UniMo, an innovative autoregressive model for joint modeling of 2D human videos and 3D human motions within a unified framework, enabling simultaneous generation and understanding of these two modalities for the first time. Current methods predominantly focus on generating one modality given another as the condition or integrating either of them with other modalities such as text and audio. Unifying 2D videos and 3D motions for simultaneous optimization and generation remains largely unexplored, presenting significant challenges due to their substantial structural and distributional differences. Inspired by the LLM's ability to unify different modalities, our method models videos and 3D motions as a unified tokens sequence, utilizing separate embedding layers to mitigate distribution gaps. Additionally, we devise a sequence modeling strategy that integrates two distinct tasks within a single framework, proving the effectiveness of unified modeling. Moreover, to efficiently align with visual tokens and preserve 3D spatial information, we design a novel 3D motion tokenizer with a temporal expansion strategy, using a single VQ-VAE to produce quantized motion tokens. It features multiple expert decoders that handle body shapes, translation, global orientation, and body poses for reliable 3D motion reconstruction. Extensive experiments demonstrate that our method simultaneously generates corresponding videos and motions while performing accurate motion capture. This work taps into the capacity of LLMs to fuse diverse data types, paving the way for integrating human-centric information into existing models and potentially enabling multimodal, controllable joint modeling of humans, objects, and scenes.

[49] TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

Tao Wu, Li Yang, Gen Zhan, Yiting Liao, Junlin Li, Deliang Fu, Li Zhang, Limin Wang

🧩 TL;DR

本文提出了TempR1,一种面向多模态大语言模型的时间感知多任务强化学习框架,通过系统化的多任务优化显著增强模型对视频时序结构的理解能力,在多个基准测试中取得了最先进的性能。


📘 Detailed Summary

Motivation: 当前用于增强多模态大语言模型时序理解的强化学习方法通常局限于特定任务类型和数据,限制了其在多样化时序理解场景中的泛化能力,因此需要开发一种能够系统化提升模型时序理解能力的通用框架。

Method: 本文提出了TempR1框架,基于Group Relative Policy Optimization算法构建,通过构建包含多样化时序结构和语义的多任务语料库,并将时序任务划分为三种预测区间与真实实例对应类型,为每种类型设计定制化的定位奖励函数,实现稳定有效的跨任务优化。

Result: 实验结果表明,TempR1在多个基准测试中达到了最先进的性能水平,其联合优化互补任务产生了显著的协同效应,既增强了模型的泛化能力,也提升了单任务性能表现。

Conclusion: 该研究为多模态大语言模型的时序推理建立了一个可扩展且原则性的范式,证明了多任务强化学习在系统化增强时序理解方面的有效性,并为长视频分析任务提供了新的技术路径。


📄 Abstract

Enhancing the temporal understanding of Multimodal Large Language Models (MLLMs) is essential for advancing long-form video analysis, enabling tasks such as temporal localization, action detection, and time-sensitive question answering. While reinforcement learning (RL) has recently been explored for improving temporal reasoning, existing approaches are often confined to limited task types and data, restricting their generalization across diverse temporal understanding scenarios. To address this challenge, we present TempR1, a temporal-aware multi-task reinforcement learning framework that systematically strengthens MLLMs' temporal comprehension. We curate a multi-task corpus that exposes the model to diverse temporal structures and semantics, and build upon the Group Relative Policy Optimization (GRPO) algorithm to achieve stable and effective cross-task optimization. Specifically, we categorize temporal tasks into three correspondence types between predicted intervals and ground-truth instances, and design tailored localization rewards for each, enabling TempR1 to capture fine-grained temporal dependencies and adapt to different temporal patterns. Extensive experiments demonstrate that TempR1 attains state-of-the-art performance across multiple benchmarks. Moreover, its joint optimization over complementary tasks yields a strong synergistic effect, enhancing both generalization and single-task performance, establishing a scalable and principled paradigm for temporal reasoning in MLLMs.

[50] Ultra-lightweight Neural Video Representation Compression

Ho Man Kwan, Tianhao Peng, Ge Gao, Fan Zhang, Mike Nilsson, Andrew Gower, David Bull

🧩 TL;DR

本研究提出NVRC-Lite,一种轻量化的神经视频表示压缩框架,通过集成多尺度特征网格和八叉树上下文模型,在保持低计算复杂度的同时显著提升了压缩性能并加速了熵编码过程。


📘 Detailed Summary

Motivation: 现有基于隐式神经表示(INR)的视频压缩方法虽然性能优异,但面临两个关键挑战:一是轻量化INR在低复杂度下的性能仍有提升空间,二是现有方法通常使用自回归模型进行熵编码,虽然有效但编码速度缓慢,限制了实际应用。

Method: NVRC-Lite采用两种关键技术改进:首先,将多尺度特征网格集成到轻量化神经表示中,通过使用更高分辨率的网格显著提升低复杂度下INR的性能;其次,提出基于八叉树的上下文模型用于高维特征网格的熵编码,替代传统的自回归模型,从而加速整个熵编码模块。

Result: 实验结果表明,NVRC-Lite在PSNR和MS-SSIM指标上分别实现了最高21.03%和23.06%的BD-rate节省,优于当前最佳轻量化INR视频编解码器C3,同时实现了8.4倍的编码加速和2.5倍的解码加速,计算复杂度保持在10kMACs/像素以下。

Conclusion: 该研究证明了通过多尺度特征网格和高效熵编码模型的结合,可以在保持低计算复杂度的同时显著提升轻量化INR视频压缩的性能,为实际部署提供了可行的解决方案,并展示了八叉树上下文模型在加速熵编码方面的有效性。


📄 Abstract

Recent works have demonstrated the viability of utilizing over-fitted implicit neural representations (INRs) as alternatives to autoencoder-based models for neural video compression. Among these INR-based video codecs, Neural Video Representation Compression (NVRC) was the first to adopt a fully end-to-end compression framework that compresses INRs, achieving state-of-the-art performance. Moreover, some recently proposed lightweight INRs have shown comparable performance to their baseline codecs with computational complexity lower than 10kMACs/pixel. In this work, we extend NVRC toward lightweight representations, and propose NVRC-Lite, which incorporates two key changes. Firstly, we integrated multi-scale feature grids into our lightweight neural representation, and the use of higher resolution grids significantly improves the performance of INRs at low complexity. Secondly, we address the issue that existing INRs typically leverage autoregressive models for entropy coding: these are effective but impractical due to their slow coding speed. In this work, we propose an octree-based context model for entropy coding high-dimensional feature grids, which accelerates the entropy coding module of the model. Our experimental results demonstrate that NVRC-Lite outperforms C3, one of the best lightweight INR-based video codecs, with up to 21.03% and 23.06% BD-rate savings when measured in PSNR and MS-SSIM, respectively, while achieving 8.4x encoding and 2.5x decoding speedup. The implementation of NVRC-Lite will be made available.

[51] PosterCopilot: Toward Layout Reasoning and Controllable Editing for Professional Graphic Design

Jiazhe Wei, Ken Li, Tianyu Lao, Haofan Wang, Liang Wang, Caifeng Shan, Chenyang Si

🧩 TL;DR

本文提出了PosterCopilot框架,通过渐进式三阶段训练策略增强大型多模态模型的几何理解和美学推理能力,并结合生成模型实现专业级图形设计的可控迭代编辑。


📘 Detailed Summary

Motivation: 现有基于大型多模态模型的图形设计自动化方法存在几何布局不准确和缺乏专业工作流程所需的迭代、分层编辑能力的问题,这限制了其在专业设计场景中的应用。

Method: PosterCopilot框架采用渐进式三阶段训练策略:扰动监督微调、视觉-现实对齐的强化学习以及美学反馈强化学习,以增强LMM的几何理解和美学推理能力,并结合生成模型实现分层可控的迭代编辑工作流程。

Result: 大量实验表明PosterCopilot能够生成几何准确且美学优越的布局,在专业迭代设计中提供前所未有的可控性,显著优于现有方法。

Conclusion: 该研究通过增强LMM的几何理解和美学推理能力,结合可控编辑工作流程,为专业图形设计自动化提供了新的解决方案,实现了几何准确性和设计可控性的显著提升。


📄 Abstract

Graphic design forms the cornerstone of modern visual communication, serving as a vital medium for promoting cultural and commercial events. Recent advances have explored automating this process using Large Multimodal Models (LMMs), yet existing methods often produce geometrically inaccurate layouts and lack the iterative, layer-specific editing required in professional workflows. To address these limitations, we present PosterCopilot, a framework that advances layout reasoning and controllable editing for professional graphic design. Specifically, we introduce a progressive three-stage training strategy that equips LMMs with geometric understanding and aesthetic reasoning for layout design, consisting of Perturbed Supervised Fine-Tuning, Reinforcement Learning for Visual-Reality Alignment, and Reinforcement Learning from Aesthetic Feedback. Furthermore, we develop a complete workflow that couples the trained LMM-based design model with generative models, enabling layer-controllable, iterative editing for precise element refinement while maintaining global visual consistency. Extensive experiments demonstrate that PosterCopilot achieves geometrically accurate and aesthetically superior layouts, offering unprecedented controllability for professional iterative design.

[52] Unique Lives, Shared World: Learning from Single-Life Videos

Tengda Han, Sayna Ebrahimi, Dilara Gokay, Li Yang Ku, Maks Ovsjanikov, Iva Babukova, Daniel Zoran, Viorica Patraucean, Joao Carreira, Andrew Zisserman, Dima Damen

🧩 TL;DR

本文提出'单一生涯'学习范式,通过仅使用单一个体采集的自我中心视频训练独立的视觉模型,利用自然捕获的多视角信息进行自监督学习,证明了该范式能够学习到高度对齐且可泛化的几何表示。


📘 Detailed Summary

Motivation: 本研究旨在探索'单一生涯'学习范式,即仅使用单一个体采集的自我中心视频训练视觉模型,研究这种受限数据源是否能够学习到有效的视觉表示,以及不同个体模型之间是否能够发展出对齐的几何理解。

Method: 该方法采用自监督学习框架,利用单一个体自我中心视频中自然捕获的多视角信息训练视觉编码器,并引入基于交叉注意力的新度量标准来量化不同模型内部表示的功能对齐程度,同时在室内外不同场景下对多个独立个体数据集进行训练。

Result: 实验结果表明三个关键发现:不同个体独立训练的模型发展出高度对齐的几何理解;单一生涯模型学习到的几何表示能够有效迁移到深度估计等下游任务中;仅使用同一人一周内30小时数据训练的性能与使用30小时多样化网络数据训练相当。

Conclusion: 该研究确立了单一生涯表示学习的有效性,表明共享的世界结构不仅导致个体模型之间的一致性,还为视觉表示学习提供了强大信号,为个性化视觉模型和受限数据场景下的表示学习开辟了新方向。


📄 Abstract

We introduce the "single-life" learning paradigm, where we train a distinct vision model exclusively on egocentric videos captured by one individual. We leverage the multiple viewpoints naturally captured within a single life to learn a visual encoder in a self-supervised manner. Our experiments demonstrate three key findings. First, models trained independently on different lives develop a highly aligned geometric understanding. We demonstrate this by training visual encoders on distinct datasets each capturing a different life, both indoors and outdoors, as well as introducing a novel cross-attention-based metric to quantify the functional alignment of the internal representations developed by different models. Second, we show that single-life models learn generalizable geometric representations that effectively transfer to downstream tasks, such as depth estimation, in unseen environments. Third, we demonstrate that training on up to 30 hours from one week of the same person's life leads to comparable performance to training on 30 hours of diverse web data, highlighting the strength of single-life representation learning. Overall, our results establish that the shared structure of the world, both leads to consistency in models trained on individual lives, and provides a powerful signal for visual representation learning.

cs.CL [Back]

[53] A Preliminary Study on the Promises and Challenges of Native Top-$k$ Sparse Attention

Di Xiu, Hongyin Tang, Bolin Rong, Lizhi Yan, Jingang Wang, Yifan Lu, Xunliang Cai

🧩 TL;DR

该研究报告对Top-k注意力机制在解码和训练阶段的有效性与理论机制进行了初步探索,验证了精确Top-k解码在保持性能的同时显著降低计算成本,并提出了训练-推理一致的Top-k注意力策略以进一步释放模型潜力。


📘 Detailed Summary

Motivation: 大型语言模型在长上下文建模中日益普及,但其推理计算成本已成为阻碍智能体和多模态应用发展的关键瓶颈,因此需要研究更高效的注意力机制来降低计算开销。

Method: 研究采用精确Top-k解码机制,在解码阶段仅保留与查询相似度最高的关键键作为上下文窗口,并进一步探索了原生Top-k注意力训练策略,确保训练与推理阶段操作的一致性,同时研究了近似Top-k算法精度对下游任务的影响。

Result: 实验表明精确Top-k解码在HELMET和LongBench v2等下游任务上达到或超越全注意力性能,训练-推理一致的Top-k注意力策略显著提升模型表现,下游任务性能与近似算法保真度呈正相关,且Top-k注意力SFT模型在下游任务中表现出明显的熵减现象。

Conclusion: 研究从熵的角度提供了理论解释,验证了低熵状态更适应Top-k解码的假设,为降低LLM推理计算成本提供了有效方法,同时揭示了训练与推理机制一致性对模型性能优化的重要性。


📄 Abstract

Large Language Models (LLMs) are increasingly prevalent in the field of long-context modeling, however, their inference computational costs have become a critical bottleneck hindering the advancement of tasks such as agents and multimodal applications. This report conducts a preliminary investigation into the effectiveness and theoretical mechanisms of the Top-$k$ Attention mechanism during both the decoding and training phases. First, we validate the effectiveness of exact Top-$k$ Decoding through extensive experimentation. Experiments demonstrate that retaining only the pivotal Keys with the highest similarity to the Query as the context window during the decoding stage achieves performance comparable to, or even surpassing, full attention on downstream tasks such as HELMET and LongBench v2. Second, we further explore the native Top-$k$ Attention training strategy. Experiments confirm that ensuring the consistency between training and inference regarding Top-$k$ Attention operations facilitates the further unlocking of Top-$k$ Decoding's potential, thereby significantly enhancing model performance. Furthermore, considering the high computational complexity of exact Top-$k$ Attention, we investigate the impact of approximate Top-$k$ algorithm precision on downstream tasks. Our research confirms a positive correlation between downstream task performance and approximation fidelity, and we provide statistical evaluations of the Lightning Indexer's precision within the DeepSeek-V3.2-Exp model. Finally, this report provides a theoretical interpretation from the perspective of Entropy. Experimental observations indicate that models subjected to Top-$k$ Attention SFT exhibit a distinct phenomenon of entropy reduction in downstream tasks, which validates the hypothesis that low-entropy states are better adapted to Top-$k$ Decoding.

[54] DZ-TDPO: Non-Destructive Temporal Alignment for Mutable State Tracking in Long-Context Dialogue

Yijun Liao

🧩 TL;DR

本文提出DZ-TDPO框架,通过非破坏性对齐方法解决长上下文对话系统中的状态惯性问题,结合冲突感知动态KL约束和可学习时序注意力偏置,在保持模型通用能力的同时实现卓越的对话性能。


📘 Detailed Summary

Motivation: 长上下文对话系统存在状态惯性问题,即静态约束阻碍模型解决用户意图演化与历史上下文之间的冲突,这限制了对话系统在动态交互中的适应性和响应准确性。

Method: 提出DZ-TDPO非破坏性对齐框架,结合冲突感知动态KL约束和可学习时序注意力偏置,通过精确的注意力调节而非破坏性权重更新来缓解状态惯性问题。

Result: 在Multi-Session Chat数据集上,DZ-TDPO实现了86.2%的胜率(Phi-3.5模型),而Qwen2.5-7B模型达到99.4%的近乎完美对齐且困惑度开销可忽略,同时保持MMLU等通用能力。

Conclusion: 研究揭示了"容量-稳定性权衡"现象:小模型需付出"对齐税"来克服历史惯性,而大模型可通过精确注意力调节实现完美对齐,这为长上下文对话系统设计提供了重要指导。


📄 Abstract

Long-context dialogue systems suffer from State Inertia, where static constraints prevent models from resolving conflicts between evolving user intents and established historical context. To address this, we propose DZ-TDPO, a non-destructive alignment framework that synergizes conflict-aware dynamic KL constraints with a learnable temporal attention bias. Experiments on the Multi-Session Chat (MSC) dataset demonstrate that DZ-TDPO achieves state-of-the-art win rates (86.2% on Phi-3.5) while maintaining robust zero-shot generalization. Crucially, our scaling analysis reveals a "Capacity-Stability Trade-off": while smaller models incur an "alignment tax" (perplexity surge) to overcome historical inertia, the larger Qwen2.5-7B model achieves near-perfect alignment (99.4% win rate) with negligible perplexity overhead. This confirms that TAI can be alleviated via precise attention regulation rather than destructive weight updates, preserving general capabilities (MMLU) across model scales. Code and data are available: https://github.com/lyj20071013/DZ-TDPO

[55] Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology

Kylie L. Anglin, Stephanie Milan, Brittney Hernandez, Claudia Ventura

🧩 TL;DR

本研究提出了一个系统化框架,通过提示工程优化大语言模型在心理学文本分类任务中的表现,发现结合人工指导与自动生成的提示工程方法能显著提升模型与专家判断的一致性。


📘 Detailed Summary

Motivation: 大语言模型在文本分类任务中表现优异,但其输出高度依赖提示的措辞,现有研究很少关注心理学等专业领域,这些领域的构念具有精确的理论定义且可能未在预训练数据中充分体现,导致模型与专家判断存在偏差。

Method: 研究提出了一个实证框架,系统评估了五种提示策略:基于代码本的经验提示选择、自动提示工程、角色提示、思维链推理和解释性提示,并在零样本和少样本分类设置下进行实验验证,重点关注构念定义、任务框架和示例选择等关键特征。

Result: 实验发现角色、思维链和解释性提示无法完全弥补不良措辞带来的性能损失,最具影响力的提示特征是构念定义、任务框架和示例选择,在三个构念和两个模型上,结合代码本指导的经验提示选择与自动提示工程的少样本提示产生了与专家判断最一致的结果。

Conclusion: 研究建议研究人员应尽可能生成和评估多种提示变体,包括人工设计和自动生成的,并基于训练数据集中的实证性能选择提示和示例,在保留集上验证最终方法,这为需要与专家判断保持一致的场景提供了实用、系统且理论驱动的LLM提示优化方法。


📄 Abstract

Due to their architecture and vast pre-training data, large language models (LLMs) demonstrate strong text classification performance. However, LLM output - here, the category assigned to a text - depends heavily on the wording of the prompt. While literature on prompt engineering is expanding, few studies focus on classification tasks, and even fewer address domains like psychology, where constructs have precise, theory-driven definitions that may not be well represented in pre-training data. We present an empirical framework for optimizing LLM performance for identifying constructs in texts via prompt engineering. We experimentally evaluate five prompting strategies --codebook-guided empirical prompt selection, automatic prompt engineering, persona prompting, chain-of-thought reasoning, and explanatory prompting - with zero-shot and few-shot classification. We find that persona, chain-of-thought, and explanations do not fully address performance loss accompanying a badly worded prompt. Instead, the most influential features of a prompt are the construct definition, task framing, and, to a lesser extent, the examples provided. Across three constructs and two models, the classifications most aligned with expert judgments resulted from a few-shot prompt combining codebook-guided empirical prompt selection with automatic prompt engineering. Based on our findings, we recommend that researchers generate and evaluate as many prompt variants as feasible, whether human-crafted, automatically generated, or ideally both, and select prompts and examples based on empirical performance in a training dataset, validating the final approach in a holdout set. This procedure offers a practical, systematic, and theory-driven method for optimizing LLM prompts in settings where alignment with expert judgment is critical.

[56] Training and Evaluation of Guideline-Based Medical Reasoning in LLMs

Michael Staniek, Artem Sokolov, Stefan Riezler

🧩 TL;DR

该研究提出一种通过微调LLMs遵循医学共识指南进行逐步推理的方法,以解决医疗AI中解释可信度不足的问题。实验表明,在特定医疗领域(如脓毒症-3定义)上微调的小模型在推理正确性和预测准确性方面均优于使用显式定义提示的大型LLMs。


📘 Detailed Summary

Motivation: 医疗早期预测的机器学习研究虽然取得了突破性性能,但过度关注预测准确性导致忽视了获得医疗从业者信任所需的可信解释。现有方法未能有效整合医学中普遍存在的共识指南,这些指南提供了标准化的推理步骤和例外情况处理,对于确保模型推理的忠实性和可解释性至关重要。

Method: 研究提出将医学共识指南实例化为电子健康记录中的语言化推理规则,并以此作为微调数据来训练LLMs学习共识规则及其例外情况。该方法采用两步评估框架:推导正确性评估模型从前提正确推导结论的能力,价值正确性评估预测值与实际测量值的一致性。针对时间序列预测挑战,研究进一步提出多模态方法,将时间序列预测模型的输出表示与LLM集成。

Result: 在脓毒症-3共识定义上的实验表明,经过特定医疗领域规则实例微调的小型模型在未见患者数据上实现了近乎完美的推导正确性,显著优于使用显式定义进行单样本学习的大型LLMs以及训练时包含共识定义的医学文本模型。多模态集成方法进一步改善了稀疏、不规则采样临床变量的时间序列预测性能,解决了向未来泛化的正交问题。

Conclusion: 研究表明,早期预测的主要瓶颈并非分布外泛化,而是向未来时间泛化的正交问题。通过语言化规则实例进行微调能够有效教授LLMs遵循医学共识指南,实现忠实且可解释的推理过程。多模态集成方法为解决临床变量预测的时间序列挑战提供了有效途径,为构建可信医疗AI系统奠定了方法论基础。


📄 Abstract

Machine learning for early prediction in medicine has recently shown breakthrough performance, however, the focus on improving prediction accuracy has led to a neglect of faithful explanations that are required to gain the trust of medical practitioners. The goal of this paper is to teach LLMs to follow medical consensus guidelines step-by-step in their reasoning and prediction process. Since consensus guidelines are ubiquitous in medicine, instantiations of verbalized medical inference rules to electronic health records provide data for fine-tuning LLMs to learn consensus rules and possible exceptions thereof for many medical areas. Consensus rules also enable an automatic evaluation of the model's inference process regarding its derivation correctness (evaluating correct and faithful deduction of a conclusion from given premises) and value correctness (comparing predicted values against real-world measurements). We exemplify our work using the complex Sepsis-3 consensus definition. Our experiments show that small fine-tuned models outperform one-shot learning of considerably larger LLMs that are prompted with the explicit definition and models that are trained on medical texts including consensus definitions. Since fine-tuning on verbalized rule instantiations of a specific medical area yields nearly perfect derivation correctness for rules (and exceptions) on unseen patient data in that area, the bottleneck for early prediction is not out-of-distribution generalization, but the orthogonal problem of generalization into the future by forecasting sparsely and irregularly sampled clinical variables. We show that the latter results can be improved by integrating the output representations of a time series forecasting model with the LLM in a multimodal setup.

[57] Jina-VLM: Small Multilingual Vision Language Model

Andreas Koukounas, Georgios Mastrapas, Florian Hönicke, Sedigheh Eslami, Guillaume Roncari, Scott Martens, Han Xiao

🧩 TL;DR

本文提出了Jina-VLM,一个24亿参数的视觉语言模型,在开放2B规模VLM中实现了最先进的多语言视觉问答性能。该模型通过注意力池化连接器将SigLIP2视觉编码器与Qwen3语言主干耦合,支持任意分辨率图像的令牌高效处理。


📘 Detailed Summary

Motivation: 该研究旨在解决现有2B规模视觉语言模型在多语言视觉问答任务上的性能不足问题,特别是在保持竞争力的纯文本性能的同时,需要提升多模态理解能力。当前开放2B规模VLM在多语言评估中表现有限,需要更高效的架构来处理任意分辨率图像。

Method: Jina-VLM采用SigLIP2视觉编码器与Qwen3语言主干相结合的架构,通过注意力池化连接器实现视觉与语言模态的融合。该连接器设计支持任意分辨率图像的令牌高效处理,减少了计算开销,同时保持了模型参数规模为24亿。

Result: 在标准VQA基准测试和多语言评估中,Jina-VLM超越了同类可比模型,实现了最先进的性能表现。该模型在保持竞争力的纯文本性能的同时,显著提升了多语言视觉问答能力,在2B规模开放VLM中确立了新的性能标杆。

Conclusion: 研究表明,通过精心设计的注意力池化连接器将先进的视觉编码器与语言主干相结合,可以在中等参数规模下实现卓越的多语言视觉理解能力。这一架构为开发高效的多模态模型提供了新思路,平衡了计算效率与性能表现,为实际应用部署创造了条件。


📄 Abstract

We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. Across standard VQA benchmarks and multilingual evaluations, Jina-VLM outperforms comparable models while preserving competitive text-only performance.

cs.AI [Back]

[58] Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia

Chandler Smith, Marwa Abdulhai, Manfred Diaz, Marko Tesic, Rakshit S. Trivedi, Alexander Sasha Vezhnevets, Lewis Hammond, Jesse Clifton, Minsuk Chang, Edgar A. Duéñez-Guzmán, John P. Agapiou, Jayd Matyas, Danny Karmon, Akash Kundu, Aliaksei Korshuk, Ananya Ananya, Arrasy Rahman, Avinaash Anand Kulandaivel, Bain McHale, Beining Zhang, Buyantuev Alexander, Carlos Saith Rodriguez Rojas, Caroline Wang, Chetan Talele, Chenao Liu, Chichen Lin, Diana Riazi, Di Yang Shi, Emanuel Tewolde, Elizaveta Tennant, Fangwei Zhong, Fuyang Cui, Gang Zhao, Gema Parreño Piqueras, Hyeonggeun Yun, Ilya Makarov, Jiaxun Cui, Jebish Purbey, Jim Dilkes, Jord Nguyen, Lingyun Xiao, Luis Felipe Giraldo, Manuela Chacon-Chamorro, Manuel Sebastian Rios Beltran, Marta Emili García Segura, Mengmeng Wang, Mogtaba Alim, Nicanor Quijano, Nico Schiavone, Olivia Macmillan-Scott, Oswaldo Peña, Peter Stone, Ram Mohan Rao Kadiyala, Rolando Fernandez, Ruben Manrique, Sunjia Lu, Sheila A. McIlraith, Shamika Dhuri, Shuqing Shi, Siddhant Gupta, Sneheel Sarangi, Sriram Ganapathi Subramanian, Taehun Cha, Toryn Q. Klassen, Wenming Tu, Weijian Fan, Wu Ruiyang, Xue Feng, Yali Du, Yang Liu, Yiding Wang, Yipeng Kang, Yoonchang Sung, Yuxuan Chen, Zhaowei Zhang, Zhihan Wang, Zhiqiang Wu, Ziang Chen, Zilong Zheng, Zixia Jia, Ziyan Wang, Dylan Hadfield-Menell, Natasha Jaques, Tim Baarslag, Jose Hernandez-Orallo, Joel Z. Leibo

🧩 TL;DR

本文提出了一种评估LLM智能体在零样本混合动机环境中合作能力的方法,使用Concordia自然语言多智能体仿真环境,揭示了当前智能体在需要说服和规范执行的场景中存在显著泛化差距。


📘 Detailed Summary

Motivation: 大型语言模型智能体在社交互动中展现出强大能力,并越来越多地部署在与人类和人工智能体交互的场景中,但现有评估方法无法衡量这些能力在新颖社交情境中的泛化表现,这构成了LLM智能体发展的关键前沿挑战。

Method: 研究引入了基于Concordia自然语言多智能体仿真环境的评估方法,通过测试智能体在不同合作伙伴和情境中识别并利用互利机会的能力,来衡量其一般合作智能,该方法特别关注零样本混合动机环境中的合作表现评估。

Result: 基于NeurIPS 2024 Concordia竞赛的实证结果显示,当前智能体能力与稳健泛化所需水平存在显著差距,特别是在需要说服和规范执行的场景中,智能体在从谈判到集体行动问题的多样化情境中实现互利的能力有限。

Conclusion: 研究揭示了LLM智能体在复杂社交互动中的泛化能力不足,强调了开发更鲁棒合作智能的必要性,为未来智能体评估框架设计提供了重要基准,并指出了在说服和规范执行等关键社交能力方面的改进方向。


📄 Abstract

Large Language Model (LLM) agents have demonstrated impressive capabilities for social interaction and are increasingly being deployed in situations where they might engage with both human and artificial agents. These interactions represent a critical frontier for LLM-based agents, yet existing evaluation methods fail to measure how well these capabilities generalize to novel social situations. In this paper, we introduce a method for evaluating the ability of LLM-based agents to cooperate in zero-shot, mixed-motive environments using Concordia, a natural language multi-agent simulation environment. Our method measures general cooperative intelligence by testing an agent's ability to identify and exploit opportunities for mutual gain across diverse partners and contexts. We present empirical results from the NeurIPS 2024 Concordia Contest, where agents were evaluated on their ability to achieve mutual gains across a suite of diverse scenarios ranging from negotiation to collective action problems. Our findings reveal significant gaps between current agent capabilities and the robust generalization required for reliable cooperation, particularly in scenarios demanding persuasion and norm enforcement.

[59] Multimodal Reinforcement Learning with Agentic Verifier for AI Agents

Reuben Tan, Baolin Peng, Zhengyuan Yang, Hao Cheng, Oier Mees, Theodore Zhao, Andrea Tupini, Isar Meijier, Qianhui Wu, Yuncong Yang, Lars Liden, Yu Gu, Sheng Zhang, Xiaodong Liu, Lijuan Wang, Marc Pollefeys, Yong Jae Lee, Jianfeng Gao

🧩 TL;DR

本文提出了Argos(Agentic Reward for Grounded & Objective Scoring),一种用于训练多模态推理模型的奖励智能体,通过动态选择教师模型和规则评分函数来提供细粒度奖励信号,从而解决多模态强化学习中稀疏奖励和奖励黑客问题。


📘 Detailed Summary

Motivation: 当前多模态强化学习训练中的智能体推理模型通常仅依赖基于最终答案的稀疏奖励进行优化,这种奖励机制无法提供细粒度的学习指导。不同样本需要不同的评分函数,而教师模型提供的奖励信号可能存在噪声,这限制了模型的学习效率和性能提升。

Method: Argos方法为每个样本从教师模型派生的评分函数和基于规则的评分函数池中动态选择,同时评估三个关键维度:最终响应准确性、引用实体和动作的时空定位质量,以及推理过程的质量。该方法在监督微调数据筛选和强化学习训练阶段均应用智能体验证机制。

Result: 使用Argos训练的模型在空间推理、视觉幻觉以及机器人和具身AI基准测试中取得了最先进的结果。实验表明,仅依赖高度筛选推理数据的监督微调后训练是不够的,没有在线验证的智能体在强化学习中会崩溃为未接地气的解决方案。Argos还能有效减少多模态强化学习中的奖励黑客现象。

Conclusion: 该研究证明了细粒度、多维度奖励机制对多模态推理智能体训练的重要性,Argos通过帕累托最优性概念提供了理论依据。研究强调了在线验证在防止智能体崩溃中的关键作用,为多模态强化学习中的奖励设计提供了新范式。


📄 Abstract

Agentic reasoning models trained with multimodal reinforcement learning (MMRL) have become increasingly capable, yet they are almost universally optimized using sparse, outcome-based rewards computed based on the final answers. Richer rewards computed from the reasoning tokens can improve learning significantly by providing more fine-grained guidance. However, it is challenging to compute more informative rewards in MMRL beyond those based on outcomes since different samples may require different scoring functions and teacher models may provide noisy reward signals too. In this paper, we introduce the Argos (Agentic Reward for Grounded & Objective Scoring), a principled reward agent to train multimodal reasoning models for agentic tasks. For each sample, Argos selects from a pool of teacher-model derived and rule-based scoring functions to simultaneously evaluate: (i) final response accuracy, (ii) spatiotemporal localization of referred entities and actions, and (iii) the quality of the reasoning process. We find that by leveraging our agentic verifier across both SFT data curation and RL training, our model achieves state-of-the-art results across multiple agentic tasks such as spatial reasoning, visual hallucination as well as robotics and embodied AI benchmarks. Critically, we demonstrate that just relying on SFT post-training on highly curated reasoning data is insufficient, as agents invariably collapse to ungrounded solutions during RL without our online verification. We also show that our agentic verifier can help to reduce reward-hacking in MMRL. Finally, we also provide a theoretical justification for the effectiveness of Argos through the concept of pareto-optimality.

[60] MemVerse: Multimodal Memory for Lifelong Learning Agents

Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yirong Chen, Licheng Wen, Xuemeng Yang, Daocheng Fu, Pinlong Cai, Nianchen Deng, Yi Yu, Shuyue Hu, Botian Shi, Ding Wang

🧩 TL;DR

本文提出了MemVerse,一种模型无关的即插即用记忆框架,通过桥接快速参数化回忆与分层检索式记忆,实现了可扩展的自适应多模态智能,显著提升了智能体的记忆与推理能力。


📘 Detailed Summary

Motivation: 尽管大规模语言和视觉模型发展迅速,但AI智能体仍面临根本性限制:无法有效记忆。缺乏可靠记忆导致智能体灾难性遗忘过去经验,难以进行长时程推理,在多模态或交互环境中无法保持连贯操作。

Method: MemVerse采用模型无关的即插即用架构,将快速参数化回忆与分层检索式记忆相结合。该框架维护短期记忆处理近期上下文,同时将原始多模态经验转化为结构化长期记忆,组织为分层知识图谱。通过周期性蒸馏机制将长期记忆中的关键知识压缩到参数化模型中,实现快速可微回忆并保持可解释性。

Result: 大量实验表明,MemVerse显著提升了多模态推理和持续学习效率。该框架使智能体能够在扩展交互中记忆、适应和进行连贯推理,在多种基准测试中表现出优越性能。

Conclusion: MemVerse为解决智能体记忆问题提供了有效框架,通过分层记忆结构和周期性蒸馏机制实现了可扩展的自适应记忆系统。该研究为构建具有长期记忆能力的AI智能体开辟了新方向,对多模态交互和持续学习具有重要意义。


📄 Abstract

Despite rapid progress in large-scale language and vision models, AI agents still suffer from a fundamental limitation: they cannot remember. Without reliable memory, agents catastrophically forget past experiences, struggle with long-horizon reasoning, and fail to operate coherently in multimodal or interactive environments. We introduce MemVerse, a model-agnostic, plug-and-play memory framework that bridges fast parametric recall with hierarchical retrieval-based memory, enabling scalable and adaptive multimodal intelligence. MemVerse maintains short-term memory for recent context while transforming raw multimodal experiences into structured long-term memories organized as hierarchical knowledge graphs. This design supports continual consolidation, adaptive forgetting, and bounded memory growth. To handle real-time demands, MemVerse introduces a periodic distillation mechanism that compresses essential knowledge from long-term memory into the parametric model, allowing fast, differentiable recall while preserving interpretability. Extensive experiments demonstrate that MemVerse significantly improves multimodal reasoning and continual learning efficiency, empowering agents to remember, adapt, and reason coherently across extended interactions.

[61] Omni-AutoThink: Adaptive Multimodal Reasoning via Reinforcement Learning

Dongchao Yang, Songxiang Liu, Disong Wang, Yuanyuan Wang, Guanglu Wan, Helen Meng

🧩 TL;DR

本文提出Omni-AutoThink,一种自适应推理框架,通过动态调整推理深度来解决现有Omni模型在简单问题上过度推理或复杂问题上推理不足的问题,显著提升了多模态自适应推理性能。


📘 Detailed Summary

Motivation: 现有Omni模型在多模态感知和生成方面取得了进展,但推理行为仍然僵化,要么在简单问题上过度推理,要么在需要推理时推理不足,这限制了模型的实际应用效果和效率。

Method: 提出两阶段自适应推理框架:第一阶段采用自适应监督微调,使用大规模推理增强数据赋予模型基础推理能力;第二阶段采用自适应强化学习,基于任务复杂度和奖励反馈优化推理行为,并构建了涵盖文本、文本-音频、文本-视觉、文本-音频-视觉模态的全面自适应推理基准。

Result: 实验结果表明,所提出的框架在自适应推理性能上显著优于先前基线方法,构建的基准提供了训练和评估分割,支持多模态推理能力的全面评估。

Conclusion: 该研究证明了自适应推理框架在提升Omni模型推理效率和效果方面的有效性,为多模态智能系统提供了更灵活、高效的推理机制,所有基准数据和代码将公开释放以促进后续研究。


📄 Abstract

Recent advances in Omni models have enabled unified multimodal perception and generation. However, most existing systems still exhibit rigid reasoning behaviors, either overthinking simple problems or failing to reason when necessary. To address this limitation, we propose Omni-AutoThink, a novel adaptive reasoning framework that dynamically adjusts the model's reasoning depth according to task difficulty. Our framework comprises two stages: (1) an Adaptive Supervised Fine-Tuning (Adaptive SFT) stage, which endows the Omni model with fundamental reasoning capability using large-scale reasoning-augmented data, and (2) an Adaptive Reinforcement Learning (Adaptive GRPO) stage, which optimizes reasoning behaviors based on task complexity and reward feedback. We further construct a comprehensive adaptive reasoning benchmark that spans text-only, text-audio, text-visual, and text-audio-visual modalities, providing both training and evaluation splits for multimodal reasoning assessment. Experimental results demonstrate that our proposed framework significantly improves adaptive reasoning performance compared to previous baselines. All benchmark data and code will be publicly released.