Table of Contents

cs.CV [Back]

[1] Personalized Reward Modeling for Text-to-Image Generation

Jeongeun Lee, Ryang Heo, Dongha Lee

🧩 TL;DR

本文提出了PIGReward,一种个性化的奖励模型,通过动态生成用户条件评估维度和思维链推理来评估文本到图像生成结果,解决了传统评估方法无法捕捉个人视觉偏好的问题,并建立了可扩展的个性化评估和优化框架。


📘 Detailed Summary

Motivation: 现有文本到图像模型的评估方法,如通用奖励函数或基于相似性的指标,无法有效捕捉个人视觉偏好的多样性和复杂性,导致评估结果与用户实际偏好之间存在显著差距。

Method: PIGReward采用自引导策略,在有限参考数据上进行推理以构建丰富的用户上下文,无需用户特定训练即可实现个性化;该模型通过思维链推理动态生成用户条件评估维度,并提供个性化反馈以驱动用户特定的提示优化。

Result: 在PIGBench基准测试上的广泛实验表明,PIGReward在准确性和可解释性方面均优于现有方法,为个性化文本到图像评估和优化建立了可扩展的基于推理的基础。

Conclusion: PIGReward代表了向个体对齐文本到图像生成的重要进展,其推理驱动的个性化评估框架不仅提升了评估准确性,还为个性化提示优化提供了有效工具,推动了文本到图像生成与个人意图的更好对齐。


📄 Abstract

Recent text-to-image (T2I) models generate semantically coherent images from textual prompts, yet evaluating how well they align with individual user preferences remains an open challenge. Conventional evaluation methods, general reward functions or similarity-based metrics, fail to capture the diversity and complexity of personal visual tastes. In this work, we present PIGReward, a personalized reward model that dynamically generates user-conditioned evaluation dimensions and assesses images through CoT reasoning. To address the scarcity of user data, PIGReward adopt a self-bootstrapping strategy that reasons over limited reference data to construct rich user contexts, enabling personalization without user-specific training. Beyond evaluation, PIGReward provides personalized feedback that drives user-specific prompt optimization, improving alignment between generated images and individual intent. We further introduce PIGBench, a per-user preference benchmark capturing diverse visual interpretations of shared prompts. Extensive experiments demonstrate that PIGReward surpasses existing methods in both accuracy and interpretability, establishing a scalable and reasoning-based foundation for personalized T2I evaluation and optimization. Taken together, our findings highlight PIGReward as a robust steptoward individually aligned T2I generation.

[2] Tracking and Segmenting Anything in Any Modality

Tianlu Zhang, Qiang Zhang, Guiguang Ding, Jungong Han

🧩 TL;DR

本文提出SATA框架,通过解耦混合专家机制和任务感知多目标跟踪管道,统一处理多种跟踪与分割子任务,在18个基准测试中展现了优越性能,为通用视频理解提供了新视角。


📘 Detailed Summary

Motivation: 现有跟踪与分割方法通常采用专用架构或模态特定参数,限制了模型的泛化能力和可扩展性。先前尝试统一多任务的方法忽视了跨模态分布差异和跨任务特征表示差距这两个关键挑战,阻碍了有效的跨任务和跨模态知识共享,制约了真正通用模型的发展。

Method: 提出SATA通用跟踪与分割框架,采用解耦混合专家机制将统一表示学习任务分解为跨模态共享知识和特定信息建模过程,同时引入任务感知多目标跟踪管道将所有任务输出统一为具有校准ID信息的实例集合,缓解多任务训练中任务特定知识的退化问题。

Result: SATA在18个具有挑战性的跟踪与分割基准测试中表现出优越性能,证明了该框架在统一处理多种视频理解任务方面的有效性和泛化能力。

Conclusion: 该研究为解决跨模态和跨任务的知识共享问题提供了有效方案,通过解耦表示学习和统一输出表示,为开发更通用的视频理解模型开辟了新途径,展示了在多样化跟踪与分割任务中实现统一建模的可行性。


📄 Abstract

Tracking and segmentation play essential roles in video understanding, providing basic positional information and temporal association of objects within video sequences. Despite their shared objective, existing approaches often tackle these tasks using specialized architectures or modality-specific parameters, limiting their generalization and scalability. Recent efforts have attempted to unify multiple tracking and segmentation subtasks from the perspectives of any modality input or multi-task inference. However, these approaches tend to overlook two critical challenges: the distributional gap across different modalities and the feature representation gap across tasks. These issues hinder effective cross-task and cross-modal knowledge sharing, ultimately constraining the development of a true generalist model. To address these limitations, we propose a universal tracking and segmentation framework named SATA, which unifies a broad spectrum of tracking and segmentation subtasks with any modality input. Specifically, a Decoupled Mixture-of-Expert (DeMoE) mechanism is presented to decouple the unified representation learning task into the modeling process of cross-modal shared knowledge and specific information, thus enabling the model to maintain flexibility while enhancing generalization. Additionally, we introduce a Task-aware Multi-object Tracking (TaMOT) pipeline to unify all the task outputs as a unified set of instances with calibrated ID information, thereby alleviating the degradation of task-specific knowledge during multi-task training. SATA demonstrates superior performance on 18 challenging tracking and segmentation benchmarks, offering a novel perspective for more generalizable video understanding.

[3] Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning

Liqin Luo, Guangyao Chen, Xiawu Zheng, Yongxing Dai, Yixiong Zou, Yonghong Tian

🧩 TL;DR

本文提出了GroundingAgent,一种无需任务特定微调的智能视觉定位框架,通过迭代推理机制整合预训练模型,在多个基准测试中实现了65.1%的零样本定位准确率。


📘 Detailed Summary

Motivation: 现有视觉定位方法通常依赖大量任务特定标注和微调,限制了其在新颖或分布外场景中的泛化能力,需要开发无需微调即可有效泛化的解决方案。

Method: GroundingAgent采用结构化迭代推理机制,整合预训练开放词汇目标检测器、多模态大语言模型和大语言模型,通过联合语义和空间分析逐步精炼候选区域。

Result: 在RefCOCO、RefCOCO+、RefCOCOg等基准测试中实现了65.1%的平均零样本定位准确率,仅使用选择阶段时准确率可达约90%,接近监督方法的性能。

Conclusion: 该研究表明LLM推理能力在视觉定位中的关键作用,同时提供了强大的可解释性,透明展示每个推理步骤,为无需微调的视觉语言理解提供了新方向。


📄 Abstract

Visual grounding, the task of linking textual queries to specific regions within images, plays a pivotal role in vision-language integration. Existing methods typically rely on extensive task-specific annotations and fine-tuning, limiting their ability to generalize effectively to novel or out-of-distribution scenarios. To address these limitations, we introduce GroundingAgent, a novel agentic visual grounding framework that operates without any task-specific fine-tuning. GroundingAgent employs a structured, iterative reasoning mechanism that integrates pretrained open-vocabulary object detectors, multimodal large language models (MLLMs), and large language models (LLMs) to progressively refine candidate regions through joint semantic and spatial analyses. Remarkably, GroundingAgent achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks (RefCOCO, RefCOCO+, RefCOCOg), entirely without fine-tuning. Furthermore, by substituting MLLM-generated captions with the original query texts, the accuracy at the selection stage alone reaches approximately 90 %, closely matching supervised performance and underscoring the critical role of LLM reasoning capabilities. GroundingAgent also offers strong interpretability, transparently illustrating each reasoning step and providing clear insights into its decision-making process.

[4] Towards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning

Zhaoqi Xu, Yingying Zhang, Jian Li, Jianwei Guo, Qiannan Zhu, Hua Huang

🧩 TL;DR

本文提出InfoPrune,一种基于信息论的视觉语言模型自适应结构压缩框架,通过信息瓶颈原理实现注意力头剪枝和FFN低秩近似,在保持性能的同时实现显著计算效率提升。


📘 Detailed Summary

Motivation: 当前视觉语言模型规模不断增长带来严重的部署和效率挑战,现有压缩方法依赖启发式重要性度量或经验性剪枝规则,缺乏信息保留的理论保证。

Method: 基于信息瓶颈原理将剪枝建模为任务相关语义保留与冗余依赖丢弃之间的权衡,引入基于熵的有效秩(eRank)量化注意力头贡献,使用Kolmogorov-Smirnov距离测量压缩前后结构差异,提出训练引导的注意力头剪枝和训练自由的FFN自适应低秩近似两种互补方案。

Result: 在VQAv2、TextVQA和GQA数据集上的实验表明,InfoPrune实现了最高3.2倍FLOP减少和1.8倍加速,性能下降可忽略不计。

Conclusion: 该研究为高效多模态大模型提供了理论依据和实践有效的压缩方法,建立了结构稀疏性与信息效率的统一标准,推动了视觉语言模型在实际部署中的应用。


📄 Abstract

Recent advances in vision-language models (VLMs) have shown remarkable performance across multimodal tasks, yet their ever-growing scale poses severe challenges for deployment and efficiency. Existing compression methods often rely on heuristic importance metrics or empirical pruning rules, lacking theoretical guarantees about information preservation. In this work, we propose InfoPrune, an information-theoretic framework for adaptive structural compression of VLMs. Grounded in the Information Bottleneck principle, we formulate pruning as a trade-off between retaining task-relevant semantics and discarding redundant dependencies. To quantify the contribution of each attention head, we introduce an entropy-based effective rank (eRank) and employ the Kolmogorov--Smirnov (KS) distance to measure the divergence between original and compressed structures. This yields a unified criterion that jointly considers structural sparsity and informational efficiency. Building on this foundation, we further design two complementary schemes: (1) a training-based head pruning guided by the proposed information loss objective, and (2) a training-free FFN compression via adaptive low-rank approximation. Extensive experiments on VQAv2, TextVQA, and GQA demonstrate that InfoPrune achieves up to 3.2x FLOP reduction and 1.8x acceleration with negligible performance degradation, establishing a theoretically grounded and practically effective step toward efficient multimodal large models.

[5] VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

Boyu Chen, Zikang Wang, Zhengrong Yue, Kainan Yan, Chenyun Yu, Yi Huang, Zijun Liu, Yafei Wen, Xiaoxin Chen, Yang Liu, Peng Li, Yali Wang

🧩 TL;DR

本文提出VideoChat-M1,一种采用协作策略规划范式的多智能体视频理解系统,通过动态策略生成、执行和通信机制,结合多智能体强化学习,在多个视频理解基准上实现了最先进的性能。


📘 Detailed Summary

Motivation: 现有基于工具增强多模态大语言模型的多智能体视频理解系统大多采用静态且不可学习的工具调用机制,这限制了发现复杂视频中时空多样性线索的能力,从而影响了感知和推理的鲁棒性。

Method: 提出协作策略规划范式,包含三个关键过程:策略生成阶段各智能体针对用户查询生成独特的工具调用策略;策略执行阶段各智能体顺序调用相关工具执行策略并探索视频内容;策略通信阶段智能体在策略执行过程中相互交互以更新各自策略,并采用多智能体强化学习方法进行联合优化。

Result: 在涵盖四个任务的八个基准测试中实现了最先进的性能,特别是在LongVideoBench上,比最先进模型Gemini 2.5 pro高出3.6%,比GPT-4o高出15.6%。

Conclusion: 协作策略规划范式通过动态策略调整和智能体间交互,显著提升了视频理解的性能,证明了多智能体协作在复杂视频分析任务中的有效性,并为未来视频理解系统设计提供了新的方向。


📄 Abstract

By leveraging tool-augmented Multimodal Large Language Models (MLLMs), multi-agent frameworks are driving progress in video understanding. However, most of them adopt static and non-learnable tool invocation mechanisms, which limit the discovery of diverse clues essential for robust perception and reasoning regarding temporally or spatially complex videos. To address this challenge, we propose a novel Multi-agent system for video understanding, namely VideoChat-M1. Instead of using a single or fixed policy, VideoChat-M1 adopts a distinct Collaborative Policy Planning (CPP) paradigm with multiple policy agents, which comprises three key processes. (1) Policy Generation: Each agent generates its unique tool invocation policy tailored to the user's query; (2) Policy Execution: Each agent sequentially invokes relevant tools to execute its policy and explore the video content; (3) Policy Communication: During the intermediate stages of policy execution, agents interact with one another to update their respective policies. Through this collaborative framework, all agents work in tandem, dynamically refining their preferred policies based on contextual insights from peers to effectively respond to the user's query. Moreover, we equip our CPP paradigm with a concise Multi-Agent Reinforcement Learning (MARL) method. Consequently, the team of policy agents can be jointly optimized to enhance VideoChat-M1's performance, guided by both the final answer reward and intermediate collaborative process feedback. Extensive experiments demonstrate that VideoChat-M1 achieves SOTA performance across eight benchmarks spanning four tasks. Notably, on LongVideoBench, our method outperforms the SOTA model Gemini 2.5 pro by 3.6% and GPT-4o by 15.6%.

[6] Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models

Jonathan Lee, Xingrui Wang, Jiawei Peng, Luoxin Ye, Zehan Zheng, Tiezheng Zhang, Tao Wang, Wufei Ma, Siyi Chen, Yu-Cheng Chou, Prakhar Kaushik, Alan Yuille

🧩 TL;DR

该研究提出了感知分类学基准,用于评估物理基础视觉推理能力,揭示了当前视觉语言模型在结构化属性推理方面的显著性能下降,并通过模拟场景的上下文推理示例展示了改进效果。


📘 Detailed Summary

Motivation: 当前视觉语言基准主要关注表面层面的识别或图像-文本对齐,缺乏对感知分类学这种结构化场景理解过程的全面评估,该过程涉及对象识别、空间配置以及任务相关属性推理,是人类认知的基础但现有基准未能充分衡量。

Method: 研究构建了感知分类学基准,包含3173个对象的四个属性家族的84个细粒度属性标注,使用5802张合成和真实图像构建了28033个模板化问题和50个专家设计问题,涵盖对象描述、空间推理、属性匹配和分类学推理四种类型。

Result: 实验结果显示领先的视觉语言模型在识别任务上表现良好,但在属性驱动问题上性能下降10-20%,特别是在需要结构化属性多步推理的问题上表现更差,而提供模拟场景的上下文推理示例能有效提升真实世界和专家设计问题的性能。

Conclusion: 研究揭示了当前模型在结构化视觉理解方面的持续差距,这些模型过度依赖模式匹配而缺乏深度推理能力,感知分类学引导的提示方法展示了改进潜力,为开发更强大的视觉推理系统指明了方向。


📄 Abstract

We propose Perceptual Taxonomy, a structured process of scene understanding that first recognizes objects and their spatial configurations, then infers task-relevant properties such as material, affordance, function, and physical attributes to support goal-directed reasoning. While this form of reasoning is fundamental to human cognition, current vision-language benchmarks lack comprehensive evaluation of this ability and instead focus on surface-level recognition or image-text alignment. To address this gap, we introduce Perceptual Taxonomy, a benchmark for physically grounded visual reasoning. We annotate 3173 objects with four property families covering 84 fine-grained attributes. Using these annotations, we construct a multiple-choice question benchmark with 5802 images across both synthetic and real domains. The benchmark contains 28033 template-based questions spanning four types (object description, spatial reasoning, property matching, and taxonomy reasoning), along with 50 expert-crafted questions designed to evaluate models across the full spectrum of perceptual taxonomy reasoning. Experimental results show that leading vision-language models perform well on recognition tasks but degrade by 10 to 20 percent on property-driven questions, especially those requiring multi-step reasoning over structured attributes. These findings highlight a persistent gap in structured visual understanding and the limitations of current models that rely heavily on pattern matching. We also show that providing in-context reasoning examples from simulated scenes improves performance on real-world and expert-curated questions, demonstrating the effectiveness of perceptual-taxonomy-guided prompting.

[7] Vidi2: Large Multimodal Models for Video Understanding and Creation

Vidi Team, Celong Liu, Chia-Wen Kuo, Chuang Huang, Dawei Du, Fan Chen, Guang Chen, Haoji Zhang, Haojun Zhao, Lingxi Zhang, Lu Guo, Lusha Li, Longyin Wen, Qihang Fan, Qingyu Chen, Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin, Weiyan Tao, Wen Zhong, Xiaohui Shen, Xin Gu, Zhenfang Chen, Zuhua Lin

🧩 TL;DR

Vidi2模型在视频理解领域实现了突破性进展,通过引入细粒度时空定位能力和视频问答功能,显著提升了多模态推理能力,并在多个基准测试中超越了领先的专有系统。


📘 Detailed Summary

Motivation: 当前视频已成为互联网通信和创作的主要媒介,对可扩展、高质量视频生产的需求日益增长,现有视频理解模型在细粒度时空定位和综合推理能力方面存在局限,无法满足复杂编辑场景的应用需求。

Method: Vidi2模型扩展了时空定位能力,能够根据文本查询同时识别时间戳和目标对象的边界框,实现了端到端的时空定位,并引入了新的VUE-STG基准测试,该基准在视频时长、查询格式、标注质量和评估指标四个方面进行了关键改进。

Result: Vidi2模型在VUE-TR-V2和VUE-STG基准测试中显著超越了Gemini 3 Pro和GPT-5等领先专有系统,同时在视频问答基准上与同规模开源模型取得了竞争性结果,验证了其卓越的多模态推理性能。

Conclusion: 该研究展示了端到端时空定位在复杂视频编辑场景中的巨大应用潜力,包括情节理解、自动多视角切换和智能构图感知裁剪等,同时提出的新基准为视频理解研究提供了更全面的评估框架。


📄 Abstract

Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve toward next-generation video creation and have achieved state-of-the-art performance in multimodal temporal retrieval (TR). In its second release, Vidi2 advances video understanding with fine-grained spatio-temporal grounding (STG) and extends its capability to video question answering (Video QA), enabling comprehensive multimodal reasoning. Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges. This end-to-end spatio-temporal grounding capability enables potential applications in complex editing scenarios, such as plot or character understanding, automatic multi-view switching, and intelligent, composition-aware reframing and cropping. To enable comprehensive evaluation of STG in practical settings, we introduce a new benchmark, VUE-STG, which offers four key improvements over existing STG datasets: 1) Video duration: spans from roughly 10s to 30 mins, enabling long-context reasoning; 2) Query format: queries are mostly converted into noun phrases while preserving sentence-level expressiveness; 3) Annotation quality: all ground-truth time ranges and bounding boxes are manually annotated with high accuracy; 4) Evaluation metric: a refined vIoU/tIoU/vIoU-Intersection scheme. In addition, we upgrade the previous VUE-TR benchmark to VUE-TR-V2, achieving a more balanced video-length distribution and more user-style queries. Remarkably, the Vidi2 model substantially outperforms leading proprietary systems, such as Gemini 3 Pro (Preview) and GPT-5, on both VUE-TR-V2 and VUE-STG, while achieving competitive results with popular open-source models with similar scale on video QA benchmarks.

[8] Cross-Domain Generalization of Multimodal LLMs for Global Photovoltaic Assessment

Muhao Guo, Yang Weng

🧩 TL;DR

本研究探索了多模态大语言模型在分布式光伏系统跨域泛化中的应用,通过结构化提示和微调实现了检测、定位和量化的统一框架,在跨区域评估中表现出优于传统计算机视觉方法的性能。


📘 Detailed Summary

Motivation: 分布式光伏系统的快速扩张给电网管理带来挑战,许多安装未被记录。虽然卫星图像提供全球覆盖,但传统的计算机视觉模型如CNN和U-Net需要大量标注数据且难以跨区域泛化。

Method: 采用多模态大语言模型,通过结构化提示和微调技术,将光伏系统的检测、定位和量化功能集成到统一框架中。

Result: 跨区域评估使用ΔF1指标显示,所提模型在未见区域上的性能下降最小,优于传统计算机视觉和Transformer基线方法。

Conclusion: 研究结果表明多模态大语言模型在领域偏移下具有鲁棒性,为可扩展、可迁移和可解释的全球光伏测绘提供了潜力。


📄 Abstract

The rapid expansion of distributed photovoltaic (PV) systems poses challenges for power grid management, as many installations remain undocumented. While satellite imagery provides global coverage, traditional computer vision (CV) models such as CNNs and U-Nets require extensive labeled data and fail to generalize across regions. This study investigates the cross-domain generalization of a multimodal large language model (LLM) for global PV assessment. By leveraging structured prompts and fine-tuning, the model integrates detection, localization, and quantification within a unified schema. Cross-regional evaluation using the $Δ$F1 metric demonstrates that the proposed model achieves the smallest performance degradation across unseen regions, outperforming conventional CV and transformer baselines. These results highlight the robustness of multimodal LLMs under domain shift and their potential for scalable, transferable, and interpretable global PV mapping.

[9] Think First, Assign Next (ThiFAN-VQA): A Two-stage Chain-of-Thought Framework for Post-Disaster Damage Assessment

Ehsan Karimi, Nhut Le, Maryam Rahnemoonfar

🧩 TL;DR

本文提出ThiFAN-VQA,一种基于两阶段推理的视觉问答框架,用于自然灾害后无人机图像中的损害评估。该框架通过链式思维提示和上下文学习生成结构化推理轨迹,结合答案选择模块提升模型性能,在零样本和监督方法之间架起桥梁。


📘 Detailed Summary

Motivation: 现有基于AI的自然灾害损害评估方法面临数据标注成本高、数据集规模有限的问题,且传统分类框架具有固定答案空间,无法灵活提供新信息。预训练生成模型虽支持开放式答案,但常产生幻觉输出或缺乏领域相关性的通用响应,限制了实际应用效果。

Method: ThiFAN-VQA采用两阶段推理框架:首先通过链式思维提示和上下文学习生成结构化推理轨迹,实现有限监督下的可解释推理;随后通过答案选择模块评估生成响应并分配最连贯且上下文准确的答案。该框架集成了定制信息检索系统、领域特定提示和推理引导的答案选择机制。

Result: 在基于无人机的FloodNet和RescueNet-VQA数据集上的实验表明,ThiFAN-VQA在洪水飓风灾害场景中实现了卓越的准确性、可解释性和适应性。该框架显著提升了真实世界灾后损害评估任务的性能表现,优于现有方法。

Conclusion: ThiFAN-VQA成功结合了生成模型的灵活性和监督方法的一致性,为有限标注数据下的灾害评估提供了有效解决方案。该研究展示了推理引导框架在开放域视觉问答中的潜力,为实际部署中的可靠性和适应性设定了新标准。


📄 Abstract

Timely and accurate assessment of damages following natural disasters is essential for effective emergency response and recovery. Recent AI-based frameworks have been developed to analyze large volumes of aerial imagery collected by Unmanned Aerial Vehicles, providing actionable insights rapidly. However, creating and annotating data for training these models is costly and time-consuming, resulting in datasets that are limited in size and diversity. Furthermore, most existing approaches rely on traditional classification-based frameworks with fixed answer spaces, restricting their ability to provide new information without additional data collection or model retraining. Using pre-trained generative models built on in-context learning (ICL) allows for flexible and open-ended answer spaces. However, these models often generate hallucinated outputs or produce generic responses that lack domain-specific relevance. To address these limitations, we propose ThiFAN-VQA, a two-stage reasoning-based framework for visual question answering (VQA) in disaster scenarios. ThiFAN-VQA first generates structured reasoning traces using chain-of-thought (CoT) prompting and ICL to enable interpretable reasoning under limited supervision. A subsequent answer selection module evaluates the generated responses and assigns the most coherent and contextually accurate answer, effectively improve the model performance. By integrating a custom information retrieval system, domain-specific prompting, and reasoning-guided answer selection, ThiFAN-VQA bridges the gap between zero-shot and supervised methods, combining flexibility with consistency. Experiments on FloodNet and RescueNet-VQA, UAV-based datasets from flood- and hurricane-affected regions, demonstrate that ThiFAN-VQA achieves superior accuracy, interpretability, and adaptability for real-world post-disaster damage assessment tasks.

[10] Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization

Debin Meng, Chen Jin, Zheng Gao, Yanran Li, Ioannis Patras, Georgios Tzimiropoulos

🧩 TL;DR

本文提出了TPSO(Token-Prompt嵌入空间优化),一种无需训练且模型无关的模块,通过探索标记嵌入空间中代表性不足的区域来显著提升文本到图像扩散模型的生成多样性,同时保持图像质量。


📘 Detailed Summary

Motivation: 文本到图像扩散模型面临生成多样性不足的根本挑战,低多样性模型倾向于生成重复输出,增加了采样冗余并阻碍了创造性探索和下游应用。现有方法如噪声重采样、提示重写或基于引导的方法往往仍然会坍缩到主导模式或引入导致图像质量下降的失真。

Method: TPSO引入可学习参数来探索标记嵌入空间中代表性不足的区域,减少模型从学习分布强模式中重复生成样本的倾向。同时,提示级空间提供全局语义约束来调节分布偏移,在保持高保真度的同时防止质量下降。

Result: 在MS-COCO和三个扩散骨干网络上的广泛实验表明,TPSO显著增强了生成多样性,将基线性能从1.10提升到4.18分,且不牺牲图像质量。

Conclusion: TPSO通过探索标记嵌入空间的欠表示区域有效解决了扩散模型的多样性问题,提供了一种无需训练且模型无关的解决方案,在保持图像质量的同时显著提升生成多样性,为文本到图像生成领域提供了新的优化方向。


📄 Abstract

Image diversity remains a fundamental challenge for text-to-image diffusion models. Low-diversity models tend to generate repetitive outputs, increasing sampling redundancy and hindering both creative exploration and downstream applications. A primary cause is that generation often collapses toward a strong mode in the learned distribution. Existing attempts to improve diversity, such as noise resampling, prompt rewriting, or steering-based guidance, often still collapse to dominant modes or introduce distortions that degrade image quality. In light of this, we propose Token-Prompt embedding Space Optimization (TPSO), a training-free and model-agnostic module. TPSO introduces learnable parameters to explore underrepresented regions of the token embedding space, reducing the tendency of the model to repeatedly generate samples from strong modes of the learned distribution. At the same time, the prompt-level space provides a global semantic constraint that regulates distribution shifts, preventing quality degradation while maintaining high fidelity. Extensive experiments on MS-COCO and three diffusion backbones show that TPSO significantly enhances generative diversity, improving baseline performance from 1.10 to 4.18 points, without sacrificing image quality. Code will be released upon acceptance.

[11] HunyuanOCR Technical Report

Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, Qi Yang, Qiming Peng, Bin Luo, Hower Yang, Houwen Peng, Hongming Yang, Senhao Xie, Binghong Wu, Mana Yang, Sergey Wang, Raccoon Liu, Dick Zhu, Jie Jiang, Linus, Han Hu, Chengquan Zhang

🧩 TL;DR

本文提出了HunyuanOCR,一个商业级、开源、轻量级的视觉语言模型,专用于OCR任务。该模型在统一多功能性与效率、采用端到端架构以及应用数据驱动和强化学习策略方面取得突破,在多项基准测试中超越商业API和更大规模模型。


📘 Detailed Summary

Motivation: 当前OCR领域存在专业模型功能单一与通用视觉语言模型效率低下的问题,传统流水线方法依赖预处理模块导致错误传播,且缺乏高效统一的解决方案来同时支持感知任务和语义任务。

Method: 采用原生视觉变换器和轻量级大语言模型通过MLP适配器连接的架构,实现纯端到端范式消除对预处理模块的依赖,并首次在工业界应用强化学习策略结合高质量数据训练来提升OCR任务性能。

Result: 在ICDAR 2025 DIMT挑战赛小型模型赛道获得第一名,在OCRBench上达到参数量小于30亿的视觉语言模型中的最先进水平,在文本定位、解析等感知任务和信息抽取、文本图像翻译等语义任务上均超越现有公开解决方案和商业API。

Conclusion: HunyuanOCR证明了轻量级模型通过优化架构设计和训练策略可以实现多功能与高效率的统一,为工业应用提供了可靠基础,其端到端范式有效解决了传统流水线的错误传播问题,强化学习在OCR任务中的成功应用为未来研究开辟了新方向。


📄 Abstract

This paper presents HunyuanOCR, a commercial-grade, open-source, and lightweight (1B parameters) Vision-Language Model (VLM) dedicated to OCR tasks. The architecture comprises a Native Vision Transformer (ViT) and a lightweight LLM connected via an MLP adapter. HunyuanOCR demonstrates superior performance, outperforming commercial APIs, traditional pipelines, and larger models (e.g., Qwen3-VL-4B). Specifically, it surpasses current public solutions in perception tasks (Text Spotting, Parsing) and excels in semantic tasks (IE, Text Image Translation), securing first place in the ICDAR 2025 DIMT Challenge (Small Model Track). Furthermore, it achieves state-of-the-art (SOTA) results on OCRBench among VLMs with fewer than 3B parameters. HunyuanOCR achieves breakthroughs in three key aspects: 1) Unifying Versatility and Efficiency: We implement comprehensive support for core capabilities including spotting, parsing, IE, VQA, and translation within a lightweight framework. This addresses the limitations of narrow "OCR expert models" and inefficient "General VLMs". 2) Streamlined End-to-End Architecture: Adopting a pure end-to-end paradigm eliminates dependencies on pre-processing modules (e.g., layout analysis). This fundamentally resolves error propagation common in traditional pipelines and simplifies system deployment. 3) Data-Driven and RL Strategies: We confirm the critical role of high-quality data and, for the first time in the industry, demonstrate that Reinforcement Learning (RL) strategies yield significant performance gains in OCR tasks. HunyuanOCR is officially open-sourced on HuggingFace. We also provide a high-performance deployment solution based on vLLM, placing its production efficiency in the top tier. We hope this model will advance frontier research and provide a solid foundation for industrial applications.

[12] CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

Miguel Carvalho, Helder Dias, Bruno Martins

🧩 TL;DR

CropVLM是一种低成本外部方法,通过强化学习训练使视觉语言模型能够动态放大相关图像区域,显著提升细粒度视觉理解任务的性能,无需修改或微调目标VLM。


📘 Detailed Summary

Motivation: 视觉语言模型在需要细粒度图像理解的任务中表现不佳,如场景文本识别或文档分析,主要由于感知限制和视觉碎片化问题,现有方法通常需要人工标注边界框或昂贵的合成评估。

Method: 提出CropVLM作为外部增强方法,使用强化学习训练模型动态放大相关图像区域,无需人工标注边界框作为监督信号,也不依赖昂贵的合成评估,训练一次即可与开源和专有VLM配对使用。

Result: 该方法在需要高分辨率图像理解的任务上带来显著性能提升,特别是在目标VLM域外基准测试中表现优异,同时避免了灾难性遗忘问题。

Conclusion: CropVLM提供了一种高效的外部增强方案,能够在不修改基础VLM的情况下提升细粒度视觉理解能力,为VLM的实用部署开辟了新途径,具有重要的实际应用价值。


📄 Abstract

Vision-Language Models (VLMs) often struggle with tasks that require fine-grained image understanding, such as scene-text recognition or document analysis, due to perception limitations and visual fragmentation. To address these challenges, we introduce CropVLM as an external low-cost method for boosting performance, enabling VLMs to dynamically ''zoom in'' on relevant image regions, enhancing their ability to capture fine details. CropVLM is trained using reinforcement learning, without using human-labeled bounding boxes as a supervision signal, and without expensive synthetic evaluations. The model is trained once and can be paired with both open-source and proprietary VLMs to improve their performance. Our approach delivers significant improvements on tasks that require high-resolution image understanding, notably for benchmarks that are out-of-domain for the target VLM, without modifying or fine-tuning the VLM, thus avoiding catastrophic forgetting.

[13] On the Utility of Foundation Models for Fast MRI: Vision-Language-Guided Image Reconstruction

Ruimin Feng, Xingxin He, Ronald Mercer, Zachary Stewart, Fang Liu

🧩 TL;DR

本研究提出了一种基于视觉语言基础模型的语义分布引导重建框架,通过将重建图像和辅助信息编码为高层语义特征,利用对比目标确保重建表示与目标语义分布一致,显著提升了欠采样MRI重建的感知质量。


📘 Detailed Summary

Motivation: 传统MRI重建方法主要依赖低层图像先验,缺乏对高层语义信息的利用,本研究旨在探索视觉语言基础模型能否通过提供超越传统先验的高层上下文信息来增强欠采样MRI重建效果。

Method: 提出了语义分布引导重建框架,使用预训练的视觉语言基础模型将重建图像和辅助信息编码为高层语义特征,通过对比目标对齐重建表示与目标语义分布,该框架可与多种深度学习重建方法兼容,并能灵活整合来自多模态源的语义先验。

Result: 在膝关节和脑部数据集上的实验表明,基于图像的语义先验能保留精细解剖结构并实现更优的感知质量,表现为更低的LPIPS值、更高的Tenengrad分数和读者研究中更高的评分,图像-语言信息进一步扩展了语义分布并实现了对重建属性的高层控制。

Conclusion: 研究表明视觉语言基础模型通过语义空间优化能够显著改善欠采样MRI重建,为医学图像重建领域引入了新的语义级优化范式,展示了多模态语义先验在提升重建质量方面的潜力。


📄 Abstract

Purpose: To investigate whether a vision-language foundation model can enhance undersampled MRI reconstruction by providing high-level contextual information beyond conventional priors. Methods: We proposed a semantic distribution-guided reconstruction framework that uses a pre-trained vision-language foundation model to encode both the reconstructed image and auxiliary information into high-level semantic features. A contrastive objective aligns the reconstructed representation with the target semantic distribution, ensuring consistency with high-level perceptual cues. The proposed objective works with various deep learning-based reconstruction methods and can flexibly incorporate semantic priors from multimodal sources. To test the effectiveness of these semantic priors, we evaluated reconstruction results guided by priors derived from either image-only or image-language auxiliary information. Results: Experiments on knee and brain datasets demonstrate that semantic priors from images preserve fine anatomical structures and achieve superior perceptual quality, as reflected in lower LPIPS values, higher Tenengrad scores, and improved scores in the reader study, compared with conventional regularization. The image-language information further expands the semantic distribution and enables high-level control over reconstruction attributes. Across all evaluations, the contrastive objective consistently guided the reconstructed features toward the desired semantic distributions while maintaining data fidelity, demonstrating the effectiveness of the proposed optimization framework. Conclusion: The study highlights that vision-language foundation models can improve undersampled MRI reconstruction through semantic-space optimization.

[14] MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization

Chengyue Huang, Mellon M. Zhang, Robert Azarcon, Glen Chou, Zsolt Kira

🧩 TL;DR

本文提出了MAPS(模块化邻近调度),这是首个用于视觉-语言-动作模型的鲁棒微调框架,通过线性调度邻近约束松弛来平衡稳定性和灵活性,在多个基准测试中显著提升了分布内和分布外性能。


📘 Detailed Summary

Motivation: 视觉-语言-动作模型继承了预训练视觉-语言模型的强大先验,但简单的微调往往会破坏这些表示并损害泛化能力,现有方法要么过度约束适应过程,要么忽略了VLA组件在功能上的差异性。

Method: MAPS框架通过系统分析揭示了邻近约束松弛的经验顺序,采用线性调度策略使视觉编码器保持接近预训练先验,同时允许面向动作的语言层更自由地适应,无需额外参数或数据即可集成到现有VLA中。

Result: 在MiniVLA-VQ、MiniVLA-OFT、OpenVLA-OFT以及SimplerEnv、CALVIN、LIBERO等挑战性基准测试中,MAPS一致提升了分布内和分布外性能(最高提升30%),并在Franka Emika Panda平台上进行了真实世界评估验证。

Conclusion: 研究结果表明,以经验为指导的预训练VLM邻近性是一个简单而强大的原则,能够有效保护VLM到VLA迁移中的广泛泛化能力,为视觉-语言-动作模型的稳健微调提供了新范式。


📄 Abstract

Vision-Language-Action (VLA) models inherit strong priors from pretrained Vision-Language Models (VLMs), but naive fine-tuning often disrupts these representations and harms generalization. Existing fixes -- freezing modules or applying uniform regularization -- either overconstrain adaptation or ignore the differing roles of VLA components. We present MAPS (Module-Wise Proximity Scheduling), the first robust fine-tuning framework for VLAs. Through systematic analysis, we uncover an empirical order in which proximity constraints should be relaxed to balance stability and flexibility. MAPS linearly schedules this relaxation, enabling visual encoders to stay close to their pretrained priors while action-oriented language layers adapt more freely. MAPS introduces no additional parameters or data, and can be seamlessly integrated into existing VLAs. Across MiniVLA-VQ, MiniVLA-OFT, OpenVLA-OFT, and challenging benchmarks such as SimplerEnv, CALVIN, LIBERO, as well as real-world evaluations on the Franka Emika Panda platform, MAPS consistently boosts both in-distribution and out-of-distribution performance (up to +30%). Our findings highlight empirically guided proximity to pretrained VLMs as a simple yet powerful principle for preserving broad generalization in VLM-to-VLA transfer.

[15] Navigating Gigapixel Pathology Images with Large Multimodal Models

Thomas A. Buckley, Kian R. Weihrauch, Katherine Latham, Andrew Z. Zhou, Padmini A. Manrai, Arjun K. Manrai

🧩 TL;DR

本研究提出了GIANT框架,这是首个允许大型多模态模型像病理学家一样迭代导航全玻片图像的系统,并发布了MultiPathQA基准测试,在病理学图像解释任务中显著超越了传统方法。


📘 Detailed Summary

Motivation: 尽管通用大型多模态模型在临床护理中广泛使用,但在医学图像解释特别是病理学领域表现不佳,先前研究使用低分辨率缩略图或随机图像块可能低估了模型性能,因此需要开发能够像病理学家一样对全玻片图像进行连贯准确推理的系统。

Method: 提出了GIANT框架,允许大型多模态模型迭代导航全玻片图像,同时发布了包含934个WSI级别问题的MultiPathQA基准测试,涵盖从癌症诊断到开放式推理的五个临床相关任务,包括128个由专业病理学家编写的直接切片解释问题。

Result: GIANT系统在MultiPathQA基准上显著优于传统的基于图像块和缩略图的基线方法,接近或超过了在数百万图像上训练的专业模型性能,在病理学家编写的问题上,GPT-5与GIANT结合达到62.5%的准确率,优于TITAN(43.8%)和SlideChat(37.5%)等专业病理学模型。

Conclusion: 该研究揭示了当前基础模型在病理学专家推理中的优势和局限性,为未来大型多模态模型在病理学领域的发展奠定了基础,表明代理式系统能够显著提升医学图像解释能力。


📄 Abstract

Despite being widely used to support clinical care, general-purpose large multimodal models (LMMs) have generally shown poor or inconclusive performance in medical image interpretation, particularly in pathology, where gigapixel images are used. However, prior studies have used either low-resolution thumbnails or random patches, which likely underestimated model performance. Here, we ask whether LMMs can be adapted to reason coherently and accurately in the evaluation of such images. In this study, we introduce Gigapixel Image Agent for Navigating Tissue (GIANT), the first framework that allows LMMs to iteratively navigate whole-slide images (WSIs) like a pathologist. Accompanying GIANT, we release MultiPathQA, a new benchmark, which comprises 934 WSI-level questions, encompassing five clinically-relevant tasks ranging from cancer diagnosis to open-ended reasoning. MultiPathQA also includes 128 questions, authored by two professional pathologists, requiring direct slide interpretation. Using MultiPathQA, we show that our simple agentic system substantially outperforms conventional patch- and thumbnail-based baselines, approaching or surpassing the performance of specialized models trained on millions of images. For example, on pathologist-authored questions, GPT-5 with GIANT achieves 62.5% accuracy, outperforming specialist pathology models such as TITAN (43.8%) and SlideChat (37.5%). Our findings reveal the strengths and limitations of current foundation models and ground future development of LMMs for expert reasoning in pathology.

[16] CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding

Yuefei Chen, Jiang Liu, Xiaodong Lin, Ruixiang Tang

🧩 TL;DR

本文提出CounterVQA基准测试,系统评估视觉语言模型在视频反事实推理方面的能力,并开发CFGPT后训练方法以增强模型的反事实推理能力,在三个难度级别上均取得一致改进。


📘 Detailed Summary

Motivation: 当前视觉语言模型在视频理解方面取得显著进展,但在反事实推理能力方面尚未充分探索,这种能力对于稳健的视频理解至关重要,因为它需要识别潜在的因果结构并推理未观察到的可能性,而不仅仅是识别观察到的模式。

Method: 引入CounterVQA视频基准测试,包含三个渐进难度级别以评估反事实推理的不同方面;开发CFGPT后训练方法,通过从语言模态中提取模型的反事实推理能力来增强其视觉反事实推理能力。

Result: 对最先进的开源和闭源模型的综合评估揭示了显著的性能差距:虽然这些模型在简单的反事实问题上达到合理准确率,但在复杂的多跳因果链上性能显著下降;CFGPT方法在所有CounterVQA难度级别上均产生一致的改进。

Conclusion: 研究表明当前视觉语言模型在复杂反事实推理方面存在显著局限性,CFGPT方法通过跨模态知识蒸馏有效提升了模型的推理能力,为未来视频理解系统的稳健性发展提供了重要方向。


📄 Abstract

Vision Language Models (VLMs) have recently shown significant advancements in video understanding, especially in feature alignment, event reasoning, and instruction-following tasks. However, their capability for counterfactual reasoning, inferring alternative outcomes under hypothetical conditions, remains underexplored. This capability is essential for robust video understanding, as it requires identifying underlying causal structures and reasoning about unobserved possibilities, rather than merely recognizing observed patterns. To systematically evaluate this capability, we introduce CounterVQA, a video-based benchmark featuring three progressive difficulty levels that assess different aspects of counterfactual reasoning. Through comprehensive evaluation of both state-of-the-art open-source and closed-source models, we uncover a substantial performance gap: while these models achieve reasonable accuracy on simple counterfactual questions, performance degrades significantly on complex multi-hop causal chains. To address these limitations, we develop a post-training method, CFGPT, that enhances a model's visual counterfactual reasoning ability by distilling its counterfactual reasoning capability from the language modality, yielding consistent improvements across all CounterVQA difficulty levels. Dataset and code will be further released.

[17] CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization

Xinhai Hou, Shaoyuan Xu, Manan Biyani, Mayan Li, Jia Liu, Todd C. Hollon, Bryan Wang

🧩 TL;DR

本文提出了CodeV视觉代理和工具感知策略优化框架,通过直接监督视觉工具的输入输出来解决视觉代理中工具使用不忠实的问题,在保持高准确率的同时显著提升了工具使用的忠实度。


📘 Detailed Summary

Motivation: 现有视觉语言模型在调用图像操作时存在工具使用不忠实的问题,即使最终答案正确,模型也可能在无关区域调用工具或完全忽略工具输出,这暴露了当前视觉代理在忠实推理方面的严重缺陷。

Method: 提出了基于代码的视觉代理CodeV和工具感知策略优化框架,该框架通过直接对视觉工具输入输出定义密集奖励而非思维链标记,使用Python代码表示视觉工具,并在两阶段SFT+RL流程中进行训练。

Result: CodeV在视觉搜索基准测试中实现了竞争性或更优的准确率,同时显著提高了忠实工具使用率,并在多模态推理和数学基准测试中表现出强大的性能。

Conclusion: 明确监督中间工具行为对于构建可信赖的代理视觉推理系统至关重要,工具感知策略优化框架为解决视觉代理忠实性问题提供了有效途径,推动了更可靠的视觉推理系统发展。


📄 Abstract

Agentic vision-language models are increasingly trained to "think with images" by calling image operations. However, we show that high final-answer accuracy often hides unfaithful visual reasoning: models may invoke tools on irrelevant regions or ignore tool outputs entirely, yet still guess the correct answer. In this work, we first propose a faithfulness evaluation protocol that measures whether intermediate visual tool outputs (e.g., crops) actually contain the queried evidence. This reveals that recent visual agents achieve high final-answer accuracy but exhibit low rates of faithful tool-use on visual search benchmarks. We then introduce CodeV, a code-based visual agent trained with Tool-Aware Policy Optimization (TAPO). TAPO is a process-level RL framework that augments GRPO with dense rewards defined directly on visual tool inputs and outputs, rather than on chain-of-thought tokens, making supervision easier to verify and less susceptible to reward hacking. CodeV represents visual tools as executable Python code, and TAPO assigns step-wise rewards based solely on the question and tool output, encouraging both necessary and evidence-consistent tool use. In a two-stage SFT+RL pipeline, CodeV achieves competitive or superior accuracy while substantially increasing faithful tool-use rates on related visual search benchmarks. Beyond visual search, CodeV attains strong performance on a range of multimodal reasoning and math benchmarks, suggesting that explicitly supervising intermediate tool behavior is crucial for building trustworthy, agentic visual reasoning systems.

[18] OncoVision: Integrating Mammography and Clinical Data through Attention-Driven Multimodal AI for Enhanced Breast Cancer Diagnosis

Istiak Ahmed, Galib Ahmed, K. Shahriar Sanjid, Md. Tanzim Hossain, Md. Nishan Khan, Md. Misbah Khan, Md. Arifur Rahman, Sheikh Anisul Haque, Sharmin Akhtar Rupa, Mohammed Mejbahuddin Mia, Mahmud Hasan Mostofa Kamal, Md. Mostafa Kamal Sarker, M. Monir Uddin

🧩 TL;DR

OncoVision是一个多模态AI管道,结合乳腺X光图像和临床数据,通过注意力编码器-解码器架构实现四个关键区域的联合分割和十个临床特征的预测,采用晚期融合策略提升诊断精度并减少观察者间差异。


📘 Detailed Summary

Motivation: 该研究旨在解决传统乳腺X光诊断中存在的观察者间变异性高、诊断精度不足的问题,特别是在资源匮乏地区缺乏专业放射科医生的情况下,需要开发能够结合影像和临床数据的AI系统来改善乳腺癌早期检测。

Method: 采用基于注意力的编码器-解码器主干网络,联合分割肿块、钙化、腋窝发现和乳腺组织四个关键区域,并预测肿块形态、钙化类型、ACR乳腺密度和BI-RADS分类等十个结构化临床特征,开发了两种晚期融合策略来整合影像和临床见解。

Result: 系统实现了最先进的分割精度,能够稳健预测多种临床特征,通过多模态数据融合显著提高了诊断精度并降低了观察者间变异性,部署为安全的Web应用程序,提供结构化报告和双重置信度评分。

Conclusion: OncoVision通过结合准确分割与临床直觉,为基于AI的乳腺X光诊断设立了新标准,提供了可扩展且公平的解决方案,能够在全球资源匮乏地区实现早期乳腺癌检测,并通过及时干预改善治疗效果。


📄 Abstract

OncoVision is a multimodal AI pipeline that combines mammography images and clinical data for better breast cancer diagnosis. Employing an attention-based encoder-decoder backbone, it jointly segments four ROIs - masses, calcifications, axillary findings, and breast tissues - with state-of-the-art accuracy and robustly predicts ten structured clinical features: mass morphology, calcification type, ACR breast density, and BI-RADS categories. To fuse imaging and clinical insights, we developed two late-fusion strategies. By utilizing complementary multimodal data, late fusion strategies improve diagnostic precision and reduce inter-observer variability. Operationalized as a secure, user-friendly web application, OncoVision produces structured reports with dual-confidence scoring and attention-weighted visualizations for real-time diagnostic support to improve clinician trust and facilitate medical teaching. It can be easily incorporated into the clinic, making screening available in underprivileged areas around the world, such as rural South Asia. Combining accurate segmentation with clinical intuition, OncoVision raises the bar for AI-based mammography, offering a scalable and equitable solution to detect breast cancer at an earlier stage and enhancing treatment through timely interventions.

[19] INTERLACE: Interleaved Layer Pruning and Efficient Adaptation in Large Vision-Language Models

Parsa Madinei, Ryan Solgi, Ziqi Wen, Jonathan Skaza, Miguel Eckstein, Ramtin Pedarsani

🧩 TL;DR

INTERLACE是一种新颖的视觉语言模型层剪枝框架,通过样本高效微调在剪除冗余层的同时保持模型性能。该方法通过分析连续三层结构识别局部冗余,仅使用1%数据微调即可在剪除25%网络时达到88.9%的平均性能保持率。


📘 Detailed Summary

Motivation: 现有的层剪枝方法在应用于视觉语言模型时会导致显著的性能下降,这限制了模型压缩在实际应用中的可行性。研究旨在解决如何在保持性能的同时有效剪除VLM中的冗余层,克服传统剪枝方法带来的性能损失问题。

Method: INTERLACE框架分析连续三层结构以识别局部冗余,移除前两层中最冗余的层,微调剩余层以补偿丢失的容量,并冻结第三层作为微调过程中的稳定锚点。这种交错式微调-冻结设计使得剪枝后仅需少量数据和快速收敛即可恢复性能。

Result: 实验结果表明,仅使用FineVision数据集的1%数据进行单轮微调,在剪除25%网络层后,INTERLACE实现了88.9%的平均性能保持率,达到了最先进的性能水平。该方法在样本效率方面显著优于现有方法。

Conclusion: INTERLACE证明了通过精心设计的交错微调策略,可以在极少量数据下有效恢复剪枝后模型的性能,为视觉语言模型的高效压缩提供了新的技术路径。该框架的样本高效特性使其在实际部署中具有重要应用价值,为大规模模型的轻量化部署开辟了新方向。


📄 Abstract

We introduce INTERLACE, a novel framework that prunes redundant layers in VLMs while maintaining performance through sample-efficient finetuning. Existing layer pruning methods lead to significant performance drop when applied to VLMs. Instead, we analyze triplets of consecutive layers to identify local redundancy, removing the most redundant of the first two layers, finetune the remaining layer to compensate for the lost capacity, and freeze the third layer to serve as a stable anchor during finetuning. We found that this interleaved finetune-freeze design enables rapid convergence with minimal data after pruning. By finetuning only a subset of layers on just 1% of the FineVision dataset for one epoch, Interlace achieves 88.9% average performance retention after dropping 25% of the network, achieving SOTA performance. Our code is available at: https://github.com/pmadinei/Interlace.git

[20] IndEgo: A Dataset of Industrial Scenarios and Collaborative Work for Egocentric Assistants

Vivek Chavan, Yasmina Imgrund, Tung Dao, Sanwantri Bai, Bosong Wang, Ze Lu, Oliver Heimann, Jörg Krüger

🧩 TL;DR

本文提出了IndEgo,一个多模态的自我中心与他中心工业任务数据集,包含3,460个自我中心记录和1,092个他中心记录,重点关注协作工作场景,并提供丰富的多模态数据和详细标注,为工业任务理解建立了新的基准。


📘 Detailed Summary

Motivation: 当前工业任务理解领域缺乏包含协作工作场景的全面多模态数据集,特别是同时涵盖自我中心和他中心视角的数据,这限制了模型在复杂工业环境中的表现评估和能力发展。

Method: 构建了包含3,460个自我中心记录和1,092个他中心记录的工业任务数据集,涵盖装配/拆卸、物流组织、检查维修、木工等多种任务,特别关注协作工作场景,并提供眼动追踪、语音叙述、声音、运动等多模态数据以及详细的动作标注、摘要、错误标注和元数据。

Result: 基线评估显示,在错误检测、问答和协作任务理解等任务上,当前最先进的多模态模型在该数据集上表现面临挑战,证明了数据集为工业任务理解提供了具有挑战性的基准测试平台。

Conclusion: IndEgo数据集为工业环境中的多模态任务理解建立了新的基准,特别强调了协作工作场景的重要性,为开发更鲁棒的工业AI系统提供了重要资源,并揭示了当前模型在复杂工业任务理解方面的局限性。


📄 Abstract

We introduce IndEgo, a multimodal egocentric and exocentric dataset addressing common industrial tasks, including assembly/disassembly, logistics and organisation, inspection and repair, woodworking, and others. The dataset contains 3,460 egocentric recordings (approximately 197 hours), along with 1,092 exocentric recordings (approximately 97 hours). A key focus of the dataset is collaborative work, where two workers jointly perform cognitively and physically intensive tasks. The egocentric recordings include rich multimodal data and added context via eye gaze, narration, sound, motion, and others. We provide detailed annotations (actions, summaries, mistake annotations, narrations), metadata, processed outputs (eye gaze, hand pose, semi-dense point cloud), and benchmarks on procedural and non-procedural task understanding, Mistake Detection, and reasoning-based Question Answering. Baseline evaluations for Mistake Detection, Question Answering and collaborative task understanding show that the dataset presents a challenge for the state-of-the-art multimodal models. Our dataset is available at: https://huggingface.co/datasets/FraunhoferIPK/IndEgo

[21] RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models

Omar Alama, Darshil Jariwala, Avigyan Bhattacharya, Seungchan Kim, Wenshan Wang, Sebastian Scherer

🧩 TL;DR

本文提出了RADSeg方法,通过利用被忽视的RADIO视觉基础模型,在零样本开放词汇语义分割中同时显著提升了mIoU性能、降低了延迟并提高了参数效率。该方法在保持较小模型规模的同时超越了先前需要组合多个大型视觉模型的方法。


📘 Detailed Summary

Motivation: 现有开放词汇语义分割方法面临训练数据有限导致泛化能力不足的问题,或依赖对视觉语言模型的零样本启发式方法,而最具竞争力的方法需要组合多个模型但带来高昂的计算和内存开销。本文旨在解决这些限制,同时提升分割性能、降低延迟并提高参数效率。

Method: 本文首次对RADIO模型在零样本开放词汇语义分割中进行全面研究,并通过自相关递归注意力、自相关全局聚合和计算高效的掩码细化等技术创新来增强其性能。这些方法共同构成了RADSeg框架,有效提升了分割精度和计算效率。

Result: RADSeg在基础ViT类别中实现了6-30%的mIoU提升,同时速度提升3.95倍且参数减少2.5倍。令人惊讶的是,仅105M参数的RADSeg-base模型在mIoU上超越了先前需要850-1350M参数的巨型视觉模型组合,以显著更低的计算和内存成本达到了最先进的准确率。

Conclusion: 该研究表明被忽视的RADIO模型在开放词汇语义分割中具有巨大潜力,通过适当的技术增强可以在保持高效性的同时实现卓越性能。这一发现为开发更轻量级但功能强大的分割系统提供了新方向,挑战了当前依赖大型模型组合的主流范式。


📄 Abstract

Open-vocabulary semantic segmentation (OVSS) underpins many vision and robotics tasks that require generalizable semantic understanding. Existing approaches either rely on limited segmentation training data, which hinders generalization, or apply zero-shot heuristics to vision-language models (e.g CLIP), while the most competitive approaches combine multiple models to improve performance at the cost of high computational and memory demands. In this work, we leverage an overlooked agglomerative vision foundation model, RADIO, to improve zero-shot OVSS along three key axes simultaneously: mIoU, latency, and parameter efficiency. We present the first comprehensive study of RADIO for zero-shot OVSS and enhance its performance through self-correlating recursive attention, self-correlating global aggregation, and computationally efficient mask refinement. Our approach, RADSeg, achieves 6-30% mIoU improvement in the base ViT class while being 3.95x faster and using 2.5x fewer parameters. Surprisingly, RADSeg-base (105M) outperforms previous combinations of huge vision models (850-1350M) in mIoU, achieving state-of-the-art accuracy with substantially lower computational and memory cost.

[22] Leveraging Foundation Models for Histological Grading in Cutaneous Squamous Cell Carcinoma using PathFMTools

Abdul Rahman Diab, Emily E. Karn, Renchin Wu, Emily S. Ruiz, William Lotter

🧩 TL;DR

本研究提出了PathFMTools,一个轻量级可扩展的Python工具包,用于高效执行、分析和可视化病理学基础模型,并通过在皮肤鳞状细胞癌组织学分级任务上评估CONCH和MUSK等最先进的视觉语言基础模型,验证了基础模型嵌入在临床应用中的潜力。


📘 Detailed Summary

Motivation: 尽管计算病理学基础模型具有巨大潜力,但由于全切片图像处理的复杂性、学习特征的不透明性以及广泛的适应策略范围,将其适应特定临床任务仍然具有挑战性。

Method: 开发了PathFMTools轻量级Python工具包,用于与CONCH和MUSK等最先进的视觉语言基础模型接口,并在包含440个皮肤鳞状细胞癌H&E全切片图像的队列上对多种适应策略进行基准测试。

Result: 研究展示了不同预测方法之间的权衡,并验证了使用基础模型嵌入训练小型专业模型的潜力,证明了病理学基础模型在真实世界临床应用中的可行性。

Conclusion: 这些发现强调了病理学基础模型在现实临床应用中的前景,PathFMTools工具能够实现高效的分析和验证,为临床决策支持系统的发展提供了重要工具支持。


📄 Abstract

Despite the promise of computational pathology foundation models, adapting them to specific clinical tasks remains challenging due to the complexity of whole-slide image (WSI) processing, the opacity of learned features, and the wide range of potential adaptation strategies. To address these challenges, we introduce PathFMTools, a lightweight, extensible Python package that enables efficient execution, analysis, and visualization of pathology foundation models. We use this tool to interface with and evaluate two state-of-the-art vision-language foundation models, CONCH and MUSK, on the task of histological grading in cutaneous squamous cell carcinoma (cSCC), a critical criterion that informs cSCC staging and patient management. Using a cohort of 440 cSCC H&E WSIs, we benchmark multiple adaptation strategies, demonstrating trade-offs across prediction approaches and validating the potential of using foundation model embeddings to train small specialist models. These findings underscore the promise of pathology foundation models for real-world clinical applications, with PathFMTools enabling efficient analysis and validation.

[23] Efficient Transferable Optimal Transport via Min-Sliced Transport Plans

Xinran Liu, Elaheh Akbari, Rocio Diaz Martin, Navid NaderiAlizadeh, Soheil Kolouri

🧩 TL;DR

本文研究了切片最优传输计划(min-STP)框架中优化切片器的可迁移性,证明了在数据分布轻微扰动下切片器保持稳定,并引入了小批量min-STP以提高可扩展性,在点云对齐和流式生成建模中实现了强大的单次匹配性能。


📘 Detailed Summary

Motivation: 切片传输方法虽然能通过利用一维OT问题的闭式解降低计算成本,但存在一个关键问题:在数据分布发生变化时,经过训练的切片器能否有效迁移到新的分布对上?理解这种可迁移性对于数据不断演化或需要在相关分布间重复计算OT的场景至关重要。

Method: 本文研究了min-STP框架中优化切片器的可迁移性,理论上证明了在数据分布轻微扰动下优化切片器保持接近,从而能够在相关任务间高效迁移。为提高可扩展性,引入了min-STP的小批量公式化方法,并提供了其准确性的统计保证。

Result: 实证研究表明,可迁移的min-STP在点云对齐和流式生成建模任务中实现了强大的单次匹配性能,并促进了摊销训练。该方法在分布偏移下仍能保持有效的传输计划质量。

Conclusion: 研究证实了优化切片器在相关分布间的可迁移性,为在演化数据环境中高效应用OT方法提供了理论基础和实践方案。min-STP框架不仅降低了计算复杂度,还通过可迁移性实现了跨任务的性能泛化,为大规模OT应用开辟了新途径。


📄 Abstract

Optimal Transport (OT) offers a powerful framework for finding correspondences between distributions and addressing matching and alignment problems in various areas of computer vision, including shape analysis, image generation, and multimodal tasks. The computation cost of OT, however, hinders its scalability. Slice-based transport plans have recently shown promise for reducing the computational cost by leveraging the closed-form solutions of 1D OT problems. These methods optimize a one-dimensional projection (slice) to obtain a conditional transport plan that minimizes the transport cost in the ambient space. While efficient, these methods leave open the question of whether learned optimal slicers can transfer to new distribution pairs under distributional shift. Understanding this transferability is crucial in settings with evolving data or repeated OT computations across closely related distributions. In this paper, we study the min-Sliced Transport Plan (min-STP) framework and investigate the transferability of optimized slicers: can a slicer trained on one distribution pair yield effective transport plans for new, unseen pairs? Theoretically, we show that optimized slicers remain close under slight perturbations of the data distributions, enabling efficient transfer across related tasks. To further improve scalability, we introduce a minibatch formulation of min-STP and provide statistical guarantees on its accuracy. Empirically, we demonstrate that the transferable min-STP achieves strong one-shot matching performance and facilitates amortized training for point cloud alignment and flow-based generative modeling.

[24] Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering

Noah Frahm, Prakrut Patel, Yue Zhang, Shoubin Yu, Mohit Bansal, Roni Sengupta

🧩 TL;DR

本文提出Prune-Then-Plan框架,通过步骤级校准解决大型视觉语言模型在具身问答任务中的边界振荡问题,将过度自信的预测转化为保守可解释的动作。该方法在OpenEQA和EXPRESS-Bench数据集上实现了最高49%的相对性能提升。


📘 Detailed Summary

Motivation: 大型视觉语言模型在直接用于步骤级探索时经常表现出边界振荡现象,即由于过度自信和校准不当导致的不稳定来回移动,这会降低导航效率并恶化答案质量。现有方法缺乏对VLM预测的可靠校准机制,导致探索行为不稳定。

Method: 提出Prune-Then-Plan框架,采用Holm-Bonferroni启发的剪枝程序剔除不可信的边界选择,然后将最终决策委托给基于覆盖率的规划器。这种分离方法通过依赖人类级别的判断来校准VLM的步骤级行为,将过度自信的预测转化为保守可解释的动作。

Result: 在3D-Mem EQA框架中集成该方法后,在视觉基础SPL和LLM-Match指标上分别实现了相对于基线最高49%和33%的相对改进。在OpenEQA和EXPRESS-Bench数据集上,在同等探索预算下实现了更好的场景覆盖。

Conclusion: 研究表明将VLM预测与保守规划分离可以有效解决边界振荡问题,为具身智能体的可靠探索提供了新思路。该方法展示了通过校准机制提升VLM在具身任务中行为稳定性的有效性,为未来具身AI系统设计提供了重要参考。


📄 Abstract

Large vision-language models (VLMs) have improved embodied question answering (EQA) agents by providing strong semantic priors for open-vocabulary reasoning. However, when used directly for step-level exploration, VLMs often exhibit frontier oscillations, unstable back-and-forth movements caused by overconfidence and miscalibration, leading to inefficient navigation and degraded answer quality. We propose Prune-Then-Plan, a simple and effective framework that stabilizes exploration through step-level calibration. Instead of trusting raw VLM scores, our method prunes implausible frontier choices using a Holm-Bonferroni inspired pruning procedure and then delegates final decisions to a coverage-based planner. This separation converts overconfident predictions into conservative, interpretable actions by relying on human-level judgments to calibrate the step-level behavior of VLMs. Integrated into the 3D-Mem EQA framework, our approach achieves relative improvements of up to 49% and 33% in visually grounded SPL and LLM-Match metrics respectively over baselines. Overall, our method achieves better scene coverage under equal exploration budgets on both OpenEQA and EXPRESS-Bench datasets.

[25] What You See is (Usually) What You Get: Multimodal Prototype Networks that Abstain from Expensive Modalities

Muchang Bahng, Charlie Berens, Jon Donnelly, Eric Chen, Chaofan Chen, Cynthia Rudin

🧩 TL;DR

本研究提出了一种多模态原型网络方法,通过集成来自不同模态的原型并引入成本感知机制,解决了物种检测中神经网络可解释性差和基因数据采集成本高的问题。该方法能够在保持高精度的同时,智能分配昂贵的基因数据用于细粒度分类,而利用丰富的图像数据进行视觉分类。


📘 Detailed Summary

Motivation: 物种检测对于生态系统健康监测和入侵物种识别至关重要,但当前多模态神经网络存在两个主要缺陷:黑盒性质导致决策过程难以解释,以及基因数据采集成本高昂且通常需要侵入性程序。这些限制阻碍了物种检测任务的自动化和广泛应用,特别是在需要可解释性和成本效益的场景中。

Method: 本研究扩展了原型网络(ProtoPNets)到多模态成本感知设置,通过集成来自每个模态的原型并使用相关权重确定预测对每个模态的依赖程度。进一步引入了识别不需要昂贵基因信息即可做出自信预测情况的方法,实现了对昂贵基因数据的智能分配策略。

Result: 实验结果表明,该方法能够在细粒度分类中智能分配昂贵的基因数据,同时利用丰富的图像数据进行清晰的视觉分类,达到了与始终使用两种模态的模型相当的准确率。该方法在保持性能的同时显著降低了数据采集成本。

Conclusion: 该研究证明了多模态原型网络在物种检测任务中的有效性,不仅提供了可解释的决策过程,还通过成本感知机制优化了资源分配。这一方法为生态监测和物种保护提供了更实用、经济的解决方案,并为其他需要多模态数据和成本优化的应用场景提供了参考框架。


📄 Abstract

Species detection is important for monitoring the health of ecosystems and identifying invasive species, serving a crucial role in guiding conservation efforts. Multimodal neural networks have seen increasing use for identifying species to help automate this task, but they have two major drawbacks. First, their black-box nature prevents the interpretability of their decision making process. Second, collecting genetic data is often expensive and requires invasive procedures, often necessitating researchers to capture or kill the target specimen. We address both of these problems by extending prototype networks (ProtoPNets), which are a popular and interpretable alternative to traditional neural networks, to the multimodal, cost-aware setting. We ensemble prototypes from each modality, using an associated weight to determine how much a given prediction relies on each modality. We further introduce methods to identify cases for which we do not need the expensive genetic information to make confident predictions. We demonstrate that our approach can intelligently allocate expensive genetic data for fine-grained distinctions while using abundant image data for clearer visual classifications and achieving comparable accuracy to models that consistently use both modalities.

[26] Rectified SpaAttn: Revisiting Attention Sparsity for Efficient Video Generation

Xuewen Liu, Zhikai Li, Jing Zhang, Mengjuan Chen, Qingyi Gu

🧩 TL;DR

本文提出Rectified SpaAttn方法,通过隐式全注意力参考修正注意力分配偏差,解决了现有稀疏注意力方法在视频生成中导致的性能下降问题,在保持高质量生成的同时实现了显著的加速效果。


📘 Detailed Summary

Motivation: 扩散变换器在视频生成中占据主导地位,但注意力计算的二次复杂度引入了显著延迟。现有注意力稀疏方法通过关注关键令牌而忽略非关键令牌来降低计算成本,但这些方法存在严重的性能下降问题。本文重新审视注意力稀疏性,发现现有方法在注意力分配中引入了系统性偏差:过度关注关键令牌会放大其注意力权重,而完全忽略非关键令牌会导致相关注意力权重的丢失。

Method: 提出Rectified SpaAttn方法,通过隐式全注意力参考修正注意力分配,增强稀疏与全注意力图的对齐。具体包括:针对关键令牌,提出孤立池化注意力重分配,通过重新分配多模态池化权重计算准确的修正因子;针对非关键令牌,提出增益感知池化修正,确保修正后的增益始终超过引入的误差。此外,使用Triton定制并集成了Rectified SpaAttn内核。

Result: 在HunyuanVideo和Wan 2.1上分别实现了最高3.33倍和2.08倍的加速,同时保持了高质量的生成效果。该方法作为开源项目发布,为视频生成中的注意力稀疏优化提供了有效的解决方案。

Conclusion: Rectified SpaAttn通过系统性地修正注意力分配偏差,在保持生成质量的同时显著提升了计算效率。这项工作为扩散变换器在视频生成中的应用提供了更高效的注意力机制,平衡了计算成本与模型性能之间的关系,具有重要的实际应用价值。


📄 Abstract

Diffusion Transformers dominate video generation, but the quadratic complexity of attention computation introduces substantial latency. Attention sparsity reduces computational costs by focusing on critical tokens while ignoring non-critical tokens. However, existing methods suffer from severe performance degradation. In this paper, we revisit attention sparsity and reveal that existing methods induce systematic biases in attention allocation: (1) excessive focus on critical tokens amplifies their attention weights; (2) complete neglect of non-critical tokens causes the loss of relevant attention weights. To address these issues, we propose Rectified SpaAttn, which rectifies attention allocation with implicit full attention reference, thereby enhancing the alignment between sparse and full attention maps. Specifically: (1) for critical tokens, we show that their bias is proportional to the sparse attention weights, with the ratio governed by the amplified weights. Accordingly, we propose Isolated-Pooling Attention Reallocation, which calculates accurate rectification factors by reallocating multimodal pooled weights. (2) for non-critical tokens, recovering attention weights from the pooled query-key yields attention gains but also introduces pooling errors. Therefore, we propose Gain-Aware Pooling Rectification, which ensures that the rectified gain consistently surpasses the induced error. Moreover, we customize and integrate the Rectified SpaAttn kernel using Triton, achieving up to 3.33 and 2.08 times speedups on HunyuanVideo and Wan 2.1, respectively, while maintaining high generation quality. We release Rectified SpaAttn as open-source at https://github.com/BienLuky/Rectified-SpaAttn .

[27] Vision--Language Enhanced Foundation Model for Semi-supervised Medical Image Segmentation

Jiaqi Guo, Mingzhen Li, Hanyu Su, Santiago López, Lexiaozi Fan, Daniel Kim, Aggelos Katsaggelos

🧩 TL;DR

本文提出了一种视觉语言增强的半监督医学图像分割方法VESSA,通过将基础级视觉语义理解整合到半监督学习框架中,在极有限标注条件下显著提升了分割精度。该方法采用两阶段训练策略,结合视觉特征匹配和动态交互机制,在多个数据集上超越了现有最先进方法。


📘 Detailed Summary

Motivation: 半监督学习虽然能够减少对专家标注的依赖,但在医学图像分割领域仍面临标注数据极度稀缺的挑战。同时,视觉语言模型在多领域展现出强大的泛化能力和少样本学习能力,但尚未被有效整合到半监督医学图像分割框架中,存在模型潜力未被充分利用的研究空白。

Method: 提出的VESSA方法采用两阶段训练策略:第一阶段使用包含金标准示例的模板库训练VESSA作为参考引导的分割助手,通过视觉特征匹配提取代表性语义和空间线索,生成结构化提示供SAM2启发的掩码解码器生成分割掩码;第二阶段将VESSA集成到最先进的半监督学习框架中,实现与学生模型的动态交互,随着学生预测的精细化,将其反馈给VESSA作为提示以生成更高质量的伪标签和更强指导。

Result: 在多个分割数据集和领域的广泛实验表明,VESSA增强的半监督学习显著提升了分割精度,在极有限标注条件下超越了现有最先进的基线方法。该方法在医学图像分割任务中展现出卓越的性能优势,特别是在标注数据稀缺的场景下表现尤为突出。

Conclusion: 该研究证实了将视觉语言模型的基础级视觉语义理解能力整合到半监督学习框架中的有效性,为医学图像分割提供了新的技术路径。VESSA的动态交互机制和参考引导策略为极有限标注条件下的分割任务提供了强有力的解决方案,具有重要的临床应用价值和推广潜力。


📄 Abstract

Semi-supervised learning (SSL) has emerged as an effective paradigm for medical image segmentation, reducing the reliance on extensive expert annotations. Meanwhile, vision-language models (VLMs) have demonstrated strong generalization and few-shot capabilities across diverse visual domains. In this work, we integrate VLM-based segmentation into semi-supervised medical image segmentation by introducing a Vision-Language Enhanced Semi-supervised Segmentation Assistant (VESSA) that incorporates foundation-level visual-semantic understanding into SSL frameworks. Our approach consists of two stages. In Stage 1, the VLM-enhanced segmentation foundation model VESSA is trained as a reference-guided segmentation assistant using a template bank containing gold-standard exemplars, simulating learning from limited labeled data. Given an input-template pair, VESSA performs visual feature matching to extract representative semantic and spatial cues from exemplar segmentations, generating structured prompts for a SAM2-inspired mask decoder to produce segmentation masks. In Stage 2, VESSA is integrated into a state-of-the-art SSL framework, enabling dynamic interaction with the student model: as student predictions become more refined, they are fed back to VESSA as prompts, allowing it to generate higher-quality pseudo-labels and stronger guidance. Extensive experiments across multiple segmentation datasets and domains show that VESSA-augmented SSL significantly enhances segmentation accuracy, outperforming state-of-the-art baselines under extremely limited annotation conditions.

[28] Reading Between the Lines: Abstaining from VLM-Generated OCR Errors via Latent Representation Probes

Jihan Yao, Achin Kulshrestha, Nathalie Rauschmayr, Reed Roberts, Banghua Zhu, Yulia Tsvetkov, Federico Tombari

🧩 TL;DR

本文提出潜在表示探测(LRP)方法,通过训练轻量级探测器分析视觉语言模型的内部隐藏状态和注意力模式,显著提升了模型在场景文本视觉问答任务中的弃权能力,相比基线方法提高了7.6%的弃权准确率。


📘 Detailed Summary

Motivation: 随着视觉语言模型在安全关键应用中的部署,其在不确定时拒绝回答的能力变得至关重要,特别是在场景文本视觉问答任务中,OCR错误可能导致严重后果。现有弃权方法存在局限性,要么依赖未校准的输出概率,要么采用不适合OCR任务的语义一致性方法,这表明不确定性信号可能隐藏在模型的内部表示中。

Method: 提出潜在表示探测(LRP)方法,训练轻量级探测器分析隐藏状态或注意力模式,探索了三种探测器设计:跨所有层连接表示、聚合视觉标记的注意力、通过多数投票集成单层探测器。该方法专注于从内部状态而非不可靠的输出中提取置信度信号。

Result: 在四个跨图像和视频模态的基准测试中,LRP相比最佳基线方法提高了7.6%的弃权准确率。分析表明探测器能够泛化到各种不确定性来源和数据集,且最优信号出现在中间层而非最终层。

Conclusion: 研究确立了从内部状态检测置信度信号的原则性框架,为构建部署就绪的AI系统提供了新途径。发现中间层包含更丰富的置信度信息,这为理解视觉语言模型的不确定性机制提供了重要见解,并推动了可靠AI系统的发展。


📄 Abstract

As VLMs are deployed in safety-critical applications, their ability to abstain from answering when uncertain becomes crucial for reliability, especially in Scene Text Visual Question Answering (STVQA) tasks. For example, OCR errors like misreading "50 mph" as "60 mph" could cause severe traffic accidents. This leads us to ask: Can VLMs know when they can't see? Existing abstention methods suggest pessimistic answers: they either rely on miscalibrated output probabilities or require semantic agreement unsuitable for OCR tasks. However, this failure may indicate we are looking in the wrong place: uncertainty signals could be hidden in VLMs' internal representations. Building on this insight, we propose Latent Representation Probing (LRP): training lightweight probes on hidden states or attention patterns. We explore three probe designs: concatenating representations across all layers, aggregating attention over visual tokens, and ensembling single layer probes by majority vote. Experiments on four benchmarks across image and video modalities show LRP improves abstention accuracy by 7.6\% over best baselines. Our analysis reveals: probes generalize across various uncertainty sources and datasets, and optimal signals emerge from intermediate rather than final layers. This establishes a principled framework for building deployment-ready AI systems by detecting confidence signals from internal states rather than unreliable outputs.

[29] Distilling Cross-Modal Knowledge via Feature Disentanglement

Junhong Liu, Yuan Zhang, Tao Huang, Wenchao Xu, Renyu Yang

🧩 TL;DR

本文提出频率解耦跨模态知识蒸馏方法,通过分析不同模态在频域的特征一致性,对低频和高频特征分别采用严格对齐和宽松对齐策略,有效解决了跨模态知识蒸馏中的表示不一致问题。


📘 Detailed Summary

Motivation: 传统知识蒸馏在跨模态场景(如视觉到语言)中效果显著下降,主要原因是不同模态间的表示不一致导致知识迁移困难,现有方法难以有效处理这种模态间特征分布差异。

Method: 提出频率解耦跨模态知识蒸馏框架,基于频域特征分析发现低频特征具有高跨模态一致性而高频特征相似度极低,因此对低频特征实施严格对齐损失、对高频特征采用宽松对齐策略,同时引入尺度一致性损失处理模态间分布偏移,并使用共享分类器统一特征空间。

Result: 在多个基准数据集上的广泛实验表明,该方法显著优于传统知识蒸馏和最先进的跨模态知识蒸馏方法,验证了频域解耦策略在跨模态知识迁移中的有效性。

Conclusion: 频域特征分析为跨模态知识蒸馏提供了新的视角,解耦处理不同频率成分能够有效平衡知识迁移,该方法为处理模态间表示不一致问题提供了通用且有效的解决方案。


📄 Abstract

Knowledge distillation (KD) has proven highly effective for compressing large models and enhancing the performance of smaller ones. However, its effectiveness diminishes in cross-modal scenarios, such as vision-to-language distillation, where inconsistencies in representation across modalities lead to difficult knowledge transfer. To address this challenge, we propose frequency-decoupled cross-modal knowledge distillation, a method designed to decouple and balance knowledge transfer across modalities by leveraging frequency-domain features. We observed that low-frequency features exhibit high consistency across different modalities, whereas high-frequency features demonstrate extremely low cross-modal similarity. Accordingly, we apply distinct losses to these features: enforcing strong alignment in the low-frequency domain and introducing relaxed alignment for high-frequency features. We also propose a scale consistency loss to address distributional shifts between modalities, and employ a shared classifier to unify feature spaces. Extensive experiments across multiple benchmark datasets show our method substantially outperforms traditional KD and state-of-the-art cross-modal KD approaches. Code is available at https://github.com/Johumliu/FD-CMKD.

[30] Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Jiaqi Liu, Kaiwen Xiong, Peng Xia, Yiyang Zhou, Haonian Ji, Lu Feng, Siwei Han, Mingyu Ding, Huaxiu Yao

🧩 TL;DR

本文提出Agent0-VL,一种自演化的视觉语言智能体,通过工具集成推理实现持续自我改进。该方法在无需外部奖励的情况下,通过工具辅助的自我评估和自我修复机制,在几何问题求解和视觉科学分析任务上实现了12.5%的性能提升。


📘 Detailed Summary

Motivation: 当前视觉语言智能体的学习受到人类标注监督的限制,而纯文本的自我评估方法在处理复杂视觉推理步骤时存在验证困难,且容易产生评估幻觉问题。

Method: Agent0-VL将工具使用整合到推理、自我评估和自我修复过程中,在单一LVLM中统一了两个协同角色:执行多轮工具集成推理的求解器,以及通过工具基础批判生成结构化反馈和细粒度自我奖励的验证器。这些角色通过自演化推理循环进行交互,其中基于工具的验证和强化学习共同对齐推理和评估分布以实现稳定的自我改进。

Result: 在几何问题求解和视觉科学分析任务上的实验表明,Agent0-VL相比基础模型实现了12.5%的性能提升,且无需任何人类标注或外部奖励模型即可实现持续自我改进。

Conclusion: 该方法证明了工具集成推理在视觉语言智能体自演化中的有效性,通过将工具使用扩展到自我评估和自我修复过程,实现了推理和验证行为的对齐,为零外部奖励的持续学习提供了可行路径。


📄 Abstract

Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on geometric problem solving and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the base model. Our code is available at \href{https://github.com/aiming-lab/Agent0/Agent0-VL}{this https URL}.

[31] MambaEye: A Size-Agnostic Visual Encoder with Causal Sequential Processing

Changho Choi, Minho Kim, Jinkyu Kim

🧩 TL;DR

本文提出了MambaEye,一种基于因果状态空间模型的输入尺寸无关视觉编码器,通过严格单向处理和相对移动嵌入实现任意图像分辨率的自适应处理,在ImageNet-1K分类任务中展现出优异的性能表现。


📘 Detailed Summary

Motivation: 尽管计算机视觉领域取得了数十年进展,但真正实现输入尺寸无关的视觉编码器——人类视觉的基本特性——仍然是一个未解决的挑战。现有方法在处理不同分辨率图像时存在局限性,需要一种能够适应任意图像尺寸的通用视觉编码架构。

Method: MambaEye采用纯Mamba2主干网络构建因果序列编码器,使用严格单向处理保持状态空间模型的固有因果性,使模型能够在输入序列的任何位置生成预测。核心创新包括相对移动嵌入技术,编码连续图像块之间的空间位移,为平移不变性提供强归纳偏置;以及受扩散模型启发的损失函数,提供密集的逐步监督,训练模型随着收集更多视觉证据而建立置信度。

Result: 在ImageNet-1K分类任务中,MambaEye在广泛图像分辨率范围内展现出稳健性能,特别是在1536×1536等高分辨率下表现优异。该模型在保持相对于图像块数量的线性时间和内存复杂度的同时,实现了这一技术突破。

Conclusion: MambaEye证明了基于因果状态空间模型的视觉编码器在实现输入尺寸无关性方面的可行性,为开发更接近人类视觉系统的通用视觉模型开辟了新方向。该方法的核心创新——相对移动嵌入和逐步监督机制——为解决视觉任务中的尺度变化问题提供了有效解决方案。


📄 Abstract

Despite decades of progress, a truly input-size agnostic visual encoder-a fundamental characteristic of human vision-has remained elusive. We address this limitation by proposing \textbf{MambaEye}, a novel, causal sequential encoder that leverages the low complexity and causal-process based pure Mamba2 backbone. Unlike previous Mamba-based vision encoders that often employ bidirectional processing, our strictly unidirectional approach preserves the inherent causality of State Space Models, enabling the model to generate a prediction at any point in its input sequence. A core innovation is our use of relative move embedding, which encodes the spatial shift between consecutive patches, providing a strong inductive bias for translation invariance and making the model inherently adaptable to arbitrary image resolutions and scanning patterns. To achieve this, we introduce a novel diffusion-inspired loss function that provides dense, step-wise supervision, training the model to build confidence as it gathers more visual evidence. We demonstrate that MambaEye exhibits robust performance across a wide range of image resolutions, especially at higher resolutions such as $1536^2$ on the ImageNet-1K classification task. This feat is achieved while maintaining linear time and memory complexity relative to the number of patches.

[32] Large Language Model Aided Birt-Hogg-Dube Syndrome Diagnosis with Multimodal Retrieval-Augmented Generation

Haoqing Li, Jun Shi, Xianmeng Chen, Qiwei Jia, Rui Wang, Wei Wei, Hong An, Xiaowen Hu

🧩 TL;DR

本文提出了BHD-RAG,一种用于Birt-Hogg-Dube综合征诊断的多模态检索增强生成框架,通过整合领域特定知识和临床先例来解决MLLMs在罕见肺部疾病诊断中的幻觉问题,在四种弥漫性囊性肺病数据集上实现了优越的诊断准确性。


📘 Detailed Summary

Motivation: 深度学习方法在通过CT影像推进Birt-Hogg-Dube综合征诊断时面临临床样本有限和弥漫性囊性肺病间类别差异小的双重挑战,而多模态大语言模型虽然具有诊断潜力,但缺乏领域特定知识和可参考的放射学特征加剧了幻觉风险。

Method: BHD-RAG框架包含三个核心组件:专门代理生成CT图像的影像表现描述以构建多模态病例语料库,基于余弦相似度的检索器为查询图像定位相关图像-描述对,以及MLLM将检索证据与影像数据结合进行诊断推理。

Result: 在包含四种弥漫性囊性肺病的数据集上验证表明,BHD-RAG实现了优越的诊断准确性,并生成与专家见解高度一致的基于证据的描述内容。

Conclusion: 该研究证明了检索增强生成框架在整合领域知识和临床先例方面的有效性,为罕见疾病的多模态诊断提供了可靠解决方案,显著降低了MLLMs在医学影像分析中的幻觉风险。


📄 Abstract

Deep learning methods face dual challenges of limited clinical samples and low inter-class differentiation among Diffuse Cystic Lung Diseases (DCLDs) in advancing Birt-Hogg-Dube syndrome (BHD) diagnosis via Computed Tomography (CT) imaging. While Multimodal Large Language Models (MLLMs) demonstrate diagnostic potential fo such rare diseases, the absence of domain-specific knowledge and referable radiological features intensify hallucination risks. To address this problem, we propose BHD-RAG, a multimodal retrieval-augmented generation framework that integrates DCLD-specific expertise and clinical precedents with MLLMs to improve BHD diagnostic accuracy. BHDRAG employs: (1) a specialized agent generating imaging manifestation descriptions of CT images to construct a multimodal corpus of DCLDs cases. (2) a cosine similarity-based retriever pinpointing relevant imagedescription pairs for query images, and (3) an MLLM synthesizing retrieved evidence with imaging data for diagnosis. BHD-RAG is validated on the dataset involving four types of DCLDs, achieving superior accuracy and generating evidence-based descriptions closely aligned with expert insights.

[33] EmoFeedback2: Reinforcement of Continuous Emotional Image Generation via LVLM-based Reward and Textual Feedback

Jingyang Jia, Kai Shu, Gang Yang, Long Xing, Xun Chen, Aiping Liu

🧩 TL;DR

本文提出EmoFeedback2框架,通过生成-理解-反馈的强化学习范式,利用微调的大型视觉语言模型为连续情感图像生成提供奖励和文本反馈,显著提升了图像的情感连续性和保真度。


📘 Detailed Summary

Motivation: 现有连续情感图像生成方法缺乏对生成图像的情感反馈机制,限制了情感连续性的控制能力,同时简单的情绪与文本对齐策略无法根据图像内容自适应调整情感提示,导致情感保真度不足。

Method: 提出生成-理解-反馈强化范式,采用情感感知奖励反馈策略,通过LVLM评估生成图像的情感值并计算与目标情感的奖励,指导生成模型的强化微调;设计自提升文本反馈框架,LVLM迭代分析生成图像的情感内容并自适应生成下一轮提示的优化建议。

Result: 大量实验结果表明,该方法能有效生成具有期望情感的高质量图像,在自定义数据集上优于现有最先进方法,显著提升了情感连续性和保真度性能。

Conclusion: 该研究证明了LVLM推理能力在连续情感图像生成中的有效性,通过反馈机制实现了情感控制的闭环优化,为情感导向的内容生成提供了新的技术范式,具有重要的实际应用价值。


📄 Abstract

Continuous emotional image generation (C-EICG) is emerging rapidly due to its ability to produce images aligned with both user descriptions and continuous emotional values. However, existing approaches lack emotional feedback from generated images, limiting the control of emotional continuity. Additionally, their simple alignment between emotions and naively generated texts fails to adaptively adjust emotional prompts according to image content, leading to insufficient emotional fidelity. To address these concerns, we propose a novel generation-understanding-feedback reinforcement paradigm (EmoFeedback2) for C-EICG, which exploits the reasoning capability of the fine-tuned large vision-language model (LVLM) to provide reward and textual feedback for generating high-quality images with continuous emotions. Specifically, we introduce an emotion-aware reward feedback strategy, where the LVLM evaluates the emotional values of generated images and computes the reward against target emotions, guiding the reinforcement fine-tuning of the generative model and enhancing the emotional continuity of images. Furthermore, we design a self-promotion textual feedback framework, in which the LVLM iteratively analyzes the emotional content of generated images and adaptively produces refinement suggestions for the next-round prompt, improving the emotional fidelity with fine-grained content. Extensive experimental results demonstrate that our approach effectively generates high-quality images with the desired emotions, outperforming existing state-of-the-art methods in our custom dataset. The code and dataset will be released soon.

[34] On the Feasibility of Hijacking MLLMs' Decision Chain via One Perturbation

Changyue Li, Jiaying Li, Youliang Yuan, Jiaming He, Zhicong Huang, Pinjia He

🧩 TL;DR

本文揭示了一种新型对抗攻击威胁:单一扰动可以劫持整个决策链,通过语义感知通用扰动(SAUPs)实现对多模态大语言模型的多目标操控,在控制五个不同目标时达到70%的攻击成功率。


📘 Detailed Summary

Motivation: 传统对抗攻击主要关注操纵神经网络的单一决策,然而现实世界模型通常在一系列决策中运行,其中孤立错误容易被纠正,但级联错误可能导致严重风险。本文旨在解决单一扰动如何能够劫持整个决策链的研究空白。

Method: 提出了语义感知通用扰动(SAUPs),通过开发有效的优化算法,在归一化空间中搜索扰动,并采用语义分离策略来克服优化挑战。同时构建了RIST数据集,这是一个具有细粒度语义标注的真实世界图像数据集。

Result: 在三个多模态大语言模型上的广泛实验证明了其脆弱性,当使用仅一个对抗帧控制五个不同目标时,攻击成功率达到了70%。

Conclusion: 该研究揭示了多模态大语言模型在面对级联对抗攻击时的严重安全漏洞,强调了在序列决策场景中模型鲁棒性的重要性,为未来防御机制的设计提供了关键见解。


📄 Abstract

Conventional adversarial attacks focus on manipulating a single decision of neural networks. However, real-world models often operate in a sequence of decisions, where an isolated mistake can be easily corrected, but cascading errors can lead to severe risks. This paper reveals a novel threat: a single perturbation can hijack the whole decision chain. We demonstrate the feasibility of manipulating a model's outputs toward multiple, predefined outcomes, such as simultaneously misclassifying "non-motorized lane" signs as "motorized lane" and "pedestrian" as "plastic bag". To expose this threat, we introduce Semantic-Aware Universal Perturbations (SAUPs), which induce varied outcomes based on the semantics of the inputs. We overcome optimization challenges by developing an effective algorithm, which searches for perturbations in normalized space with a semantic separation strategy. To evaluate the practical threat of SAUPs, we present RIST, a new real-world image dataset with fine-grained semantic annotations. Extensive experiments on three multimodal large language models demonstrate their vulnerability, achieving a 70% attack success rate when controlling five distinct targets using just an adversarial frame.

[35] 4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models

Yiting Lu, Wei Luo, Peiyan Tu, Haoran Li, Hanxin Zhu, Zihao Yu, Xingrui Wang, Xinyi Chen, Xinge Peng, Xin Li, Zhibo Chen

🧩 TL;DR

本文提出了4DWorldBench基准测试,用于系统评估世界生成模型在感知质量、条件对齐、物理真实性和4D一致性四个维度的性能,通过统一的多模态条件映射和自适应评估工具选择,推动从视觉生成向世界生成的转变。


📘 Detailed Summary

Motivation: 现有基准测试强调不同的评估维度,缺乏对世界生成模型真实感能力的统一评估,无法系统衡量这些模型在构建物理一致、动态3D/4D世界方面的综合性能。

Method: 提出了4DWorldBench基准测试框架,包含四个核心评估维度:感知质量、条件-4D对齐、物理真实性和4D一致性,支持图像到3D/4D、视频到4D、文本到3D/4D等任务,并创新性地引入跨模态自适应条件机制,将所有模态条件映射到统一文本空间进行评估。

Result: 基准测试整合了LLM-as-judge、MLLM-as-judge和传统网络方法,初步人类研究表明自适应工具选择与主观人类判断具有更高一致性,为世界生成模型提供了更全面和一致的评估框架。

Conclusion: 该基准测试为世界生成模型的客观比较和改进奠定了基础,通过统一的多模态评估方法加速了从视觉生成到世界生成的范式转变,有望推动虚拟现实、自动驾驶、具身智能等领域的应用发展。


📄 Abstract

World Generation Models are emerging as a cornerstone of next-generation multimodal intelligence systems. Unlike traditional 2D visual generation, World Models aim to construct realistic, dynamic, and physically consistent 3D/4D worlds from images, videos, or text. These models not only need to produce high-fidelity visual content but also maintain coherence across space, time, physics, and instruction control, enabling applications in virtual reality, autonomous driving, embodied intelligence, and content creation. However, prior benchmarks emphasize different evaluation dimensions and lack a unified assessment of world-realism capability. To systematically evaluate World Models, we introduce the 4DWorldBench, which measures models across four key dimensions: Perceptual Quality, Condition-4D Alignment, Physical Realism, and 4D Consistency. The benchmark covers tasks such as Image-to-3D/4D, Video-to-4D, Text-to-3D/4D. Beyond these, we innovatively introduce adaptive conditioning across multiple modalities, which not only integrates but also extends traditional evaluation paradigms. To accommodate different modality-conditioned inputs, we map all modality conditions into a unified textual space during evaluation, and further integrate LLM-as-judge, MLLM-as-judge, and traditional network-based methods. This unified and adaptive design enables more comprehensive and consistent evaluation of alignment, physical realism, and cross-modal coherence. Preliminary human studies further demonstrate that our adaptive tool selection achieves closer agreement with subjective human judgments. We hope this benchmark will serve as a foundation for objective comparisons and improvements, accelerating the transition from "visual generation" to "world generation." Our project can be found at https://yeppp27.github.io/4DWorldBench.github.io/.

[36] Pedestrian Crossing Intention Prediction Using Multimodal Fusion Network

Yuanzhe Li, Steffen Müller

🧩 TL;DR

本文提出了一种多模态融合网络,通过整合视觉和运动分支的七种模态特征来预测行人过街意图。该网络采用深度引导注意力机制和双重注意力策略,在JAAD数据集上实现了优于基线方法的性能。


📘 Detailed Summary

Motivation: 行人过街意图预测对于自动驾驶车辆在城市场景中的部署至关重要,但该任务面临挑战,因为行人行为具有多样性且依赖于多种上下文因素。现有方法难以有效整合不同模态的互补信息,需要开发更有效的多模态融合机制。

Method: 提出多模态融合网络,从视觉和运动分支提取七种模态特征。使用基于Transformer的特征提取模块处理原始输入,设计深度引导注意力模块利用深度信息引导跨模态空间特征交互。同时引入模态注意力和时间注意力机制,分别选择性地强调重要模态和捕捉时间依赖性。

Result: 在JAAD数据集上的大量实验验证了所提网络的有效性,相比基线方法取得了更优的性能表现。该网络能够有效整合多模态信息,显著提升了行人过街意图预测的准确性。

Conclusion: 研究表明多模态融合和注意力机制能够有效提升行人行为预测性能。深度引导注意力可以增强跨模态特征交互,而双重注意力策略能够自适应地关注重要信息。这项工作为自动驾驶中的行人行为理解提供了有效的技术方案。


📄 Abstract

Pedestrian crossing intention prediction is essential for the deployment of autonomous vehicles (AVs) in urban environments. Ideal prediction provides AVs with critical environmental cues, thereby reducing the risk of pedestrian-related collisions. However, the prediction task is challenging due to the diverse nature of pedestrian behavior and its dependence on multiple contextual factors. This paper proposes a multimodal fusion network that leverages seven modality features from both visual and motion branches, aiming to effectively extract and integrate complementary cues across different modalities. Specifically, motion and visual features are extracted from the raw inputs using multiple Transformer-based extraction modules. Depth-guided attention module leverages depth information to guide attention towards salient regions in another modality through comprehensive spatial feature interactions. To account for the varying importance of different modalities and frames, modality attention and temporal attention are designed to selectively emphasize informative modalities and effectively capture temporal dependencies. Extensive experiments on the JAAD dataset validate the effectiveness of the proposed network, achieving superior performance compared to the baseline methods.

[37] Face, Whole-Person, and Object Classification in a Unified Space Via The Interleaved Multi-Domain Identity Curriculum

Thomas M Metz, Matthew Q Hill, Alice J O'Toole

🧩 TL;DR

本研究提出了交错多域身份课程(IMIC)方法,通过在单一嵌入空间中同时微调多个视觉任务,有效解决了基础模型在多任务学习中的灾难性遗忘问题,并在对象识别、人脸识别和人体识别等多个任务上实现了与领域专家相当的性能。


📘 Detailed Summary

Motivation: 视觉基础模型在零样本模式下能够执行广义对象分类,但在微调后往往面临灾难性遗忘问题。本研究旨在解决在单一嵌入空间中同时执行对象识别、高低质量图像的人脸识别以及全身图像的人物识别等多任务时,如何避免显著的性能退化。

Method: 提出了交错多域身份课程(IMIC)的两种变体,这是一种梯度耦合的交错训练调度方法,能够同时在四个任务上微调基础模型骨干网络。该方法在DINOv3、CLIP和EVA-02三种基础模型上进行了验证,通过统一的嵌入空间实现多任务学习。

Result: EVA-02和CLIP模型在所有四个任务上同时达到了与领域专家相当的性能,并且在跨人脸、人体和对象数据集的多任务处理中比人类更准确。分析显示,最准确的模型变体在统一嵌入空间中实现了四个任务的线性可分表示,但存在跨任务的显著特征共享,仅需少于100个主成分即可完成其他任务而几乎无性能损失。

Conclusion: IMIC方法成功实现了多任务学习而不损害基础模型的分布外泛化能力,证明了在统一嵌入空间中同时处理多个视觉任务的可行性。该方法保持了基础模型的关键特性,同时通过特征共享机制实现了高效的多任务表示学习,为构建更通用的视觉系统提供了重要启示。


📄 Abstract

Vision foundation models can perform generalized object classification in zero-shot mode, and face/person recognition when they are fine-tuned. However, fine-tuned models suffer from catastrophic forgetting. We create models that perform four tasks (object recognition, face recognition from high- and low-quality images, and person recognition from whole-body images) in a single embedding space -- without incurring substantial catastrophic forgetting. To accomplish this, we introduce two variants of the Interleaved Multi-Domain Identity Curriculum (IMIC): a gradient-coupled, interleaving training schedule that fine-tunes a foundation backbone simultaneously on all four tasks. The IMIC method proved effective with three foundation model bases: DINOv3, CLIP, and EVA-02. Two of these (EVA-02 and CLIP) performed comparably with domain experts on all four tasks concurrently and were more accurate than humans at multi-tasking across face, body, and object datasets. Further, we demonstrate that our approach does not substantially harm out-of-distribution generalization, thus maintaining a key property of foundation models. Analysis of the most accurate model variants (EVA-02 + IMIC A and B) showed linearly separable representations of the four tasks in the unified embedding space, but with substantial sharing of features across tasks. Fewer than 100 PCs calculated from any one task could perform all other tasks with nearly zero performance degradation.

[38] WaymoQA: A Multi-View Visual Question Answering Dataset for Safety-Critical Reasoning in Autonomous Driving

Seungjun Yu, Seonho Lee, Namho Kim, Jaeyo Shin, Junsung Park, Wonjeong Ryu, Raehyuk Jung, Hyunjung Shim

🧩 TL;DR

本文提出了安全关键推理任务,通过多视角输入解决自动驾驶中避免一个交通风险可能引发另一个风险的复杂场景推理问题,并引入了包含35,000个人工标注问答对的WaymoQA数据集来提升多模态大语言模型的安全推理能力。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在驾驶场景理解方面展现出强大能力,但在安全关键场景中的高级推理仍然面临重大挑战,特别是当避免一个交通风险可能引发另一个风险时,仅依赖单一前视角输入往往无法实现有效推理,需要更全面的环境视角。

Method: 研究将安全关键推理分解为两个阶段:首先解决即时风险,然后缓解决策引发的下游风险;为此引入了WaymoQA数据集,包含35,000个人工标注的问答对,涵盖复杂高风险驾驶场景,支持多项选择和开放式回答格式,涵盖图像和视频两种模态。

Result: 实验表明现有多模态大语言模型在安全关键场景中的表现明显低于正常场景,但使用WaymoQA进行微调后,模型的安全推理能力得到显著提升,证明了该数据集在开发更安全、更具推理能力的驾驶智能体方面的有效性。

Conclusion: 该研究强调了多视角输入对于安全关键推理的重要性,提出的两阶段推理框架和WaymoQA数据集为解决自动驾驶中的复杂风险决策提供了有效工具,为开发更可靠的自动驾驶系统奠定了基础,并指出了多模态大语言模型在安全关键应用中需要进一步优化的方向。


📄 Abstract

Recent advancements in multimodal large language models (MLLMs) have shown strong understanding of driving scenes, drawing interest in their application to autonomous driving. However, high-level reasoning in safety-critical scenarios, where avoiding one traffic risk can create another, remains a major challenge. Such reasoning is often infeasible with only a single front view and requires a comprehensive view of the environment, which we achieve through multi-view inputs. We define Safety-Critical Reasoning as a new task that leverages multi-view inputs to address this challenge. Then, we distill Safety-Critical Reasoning into two stages: first resolve the immediate risk, then mitigate the decision-induced downstream risks. To support this, we introduce WaymoQA, a dataset of 35,000 human-annotated question-answer pairs covering complex, high-risk driving scenarios. The dataset includes multiple-choice and open-ended formats across both image and video modalities. Experiments reveal that existing MLLMs underperform in safety-critical scenarios compared to normal scenes, but fine-tuning with WaymoQA significantly improves their reasoning ability, highlighting the effectiveness of our dataset in developing safer and more reasoning-capable driving agents.

[39] Temporal-Visual Semantic Alignment: A Unified Architecture for Transferring Spatial Priors from Vision Models to Zero-Shot Temporal Tasks

Xiangkai Ma, Han Zhang, Wenzhong Li, Sanglu Lu

🧩 TL;DR

TimeArtist提出了一种时间-视觉转换框架,通过语义级对齐实现了从时间序列到高质量图像的直接生成,同时建立了时间动态与视觉语义之间的跨模态桥梁。


📘 Detailed Summary

Motivation: 现有方法在将时间序列转换为伪图像进行时间预测时未能建立语义级对齐,且非视觉连续序列作为条件信号用于高保真图像生成的潜力尚未充分探索。

Method: 采用预热对齐范式:首先通过双自编码器和共享量化器在大规模数据集上进行自监督训练学习模态共享表示,然后冻结编码器和量化器,引入投影层在表示层面对齐时间和视觉样本。

Result: 实验表明TimeArtist在图像生成指标上取得满意性能,同时在零样本时间任务中获得优异结果,能够捕捉时间波动模式实现风格迁移的图像渲染。

Conclusion: 该研究建立了跨模态生成的新范式,弥合了时间动态与视觉语义之间的鸿沟,为时间序列到图像的语义级转换提供了可行方案。


📄 Abstract

Large Multimodal Models (LMMs) have achieved remarkable progress in aligning and generating content across text and image modalities. However, the potential of using non-visual, continuous sequential, as a conditioning signal for high-fidelity image generation remains largely unexplored. Furthermore, existing methods that convert series into "pseudo-images" for temporal forecasting fail to establish semantic-level alignment. In this paper, we propose TimeArtist, a temporal-visual conversion framework that pioneers semantic-level alignment between time series fluctuations and visual concepts. It pioneers a "warmup-align" paradigm: first, a dual-autoencoder and shared quantizer are self-supervised trained on large-scale datasets to learn modality-shared representations. Then, the encoders and quantizer are frozen, and a projection is introduced to align temporal and visual samples at the representation level. TimeArtist establishes a versatile cross-modal framework, enabling high-quality, diverse image generation directly from time series, while capturing temporal fluctuation patterns to render images as styles transfer. Extensive experiments show that TimeArtist achieves satisfactory performance in image generation metrics, while also attaining superior results in zero-shot temporal tasks. Our work establishes a new paradigm for cross-modal generation, bridging the gap between temporal dynamics and visual semantics.

[40] LungEvaty: A Scalable, Open-Source Transformer-based Deep Learning Model for Lung Cancer Risk Prediction in LDCT Screening

Johannes Brandt, Maulik Chevli, Rickmer Braren, Georgios Kaissis, Philip Müller, Daniel Rueckert

🧩 TL;DR

LungEvaty是一个基于Transformer的端到端框架,通过单次低剂量CT扫描预测1-6年肺癌风险,无需像素级标注即可实现全肺分析,在超过90,000个CT扫描数据集上达到最先进性能。


📘 Detailed Summary

Motivation: 随着基于低剂量CT的肺癌筛查项目在全球推广,处理大规模筛查数据需要可扩展的高效方法。现有方法要么过度依赖像素级标注限制了可扩展性,要么将肺部图像分割分析削弱了性能表现。

Method: 提出完全基于Transformer的LungEvaty框架,直接在全肺输入上学习,从大规模筛查数据中捕获与恶性风险相关的全面解剖和病理线索。采用可选解剖感知注意力引导损失函数鼓励解剖聚焦的注意力机制,无需区域监督。

Result: 仅使用影像数据且无区域监督的情况下,LungEvaty匹配了最先进的性能表现。模型在超过90,000个CT扫描数据集上训练,包括28,000多个用于微调和6,000个用于评估的扫描。

Conclusion: 该框架提供了一个简单、数据高效且完全开源的解决方案,为未来纵向和多模态肺癌风险预测研究奠定了可扩展的基础,展示了无监督全肺分析在癌症风险预测中的潜力。


📄 Abstract

Lung cancer risk estimation is gaining increasing importance as more countries introduce population-wide screening programs using low-dose CT (LDCT). As imaging volumes grow, scalable methods that can process entire lung volumes efficiently are essential to tap into the full potential of these large screening datasets. Existing approaches either over-rely on pixel-level annotations, limiting scalability, or analyze the lung in fragments, weakening performance. We present LungEvaty, a fully transformer-based framework for predicting 1-6 year lung cancer risk from a single LDCT scan. The model operates on whole-lung inputs, learning directly from large-scale screening data to capture comprehensive anatomical and pathological cues relevant for malignancy risk. Using only imaging data and no region supervision, LungEvaty matches state-of-the-art performance, refinable by an optional Anatomically Informed Attention Guidance (AIAG) loss that encourages anatomically focused attention. In total, LungEvaty was trained on more than 90,000 CT scans, including over 28,000 for fine-tuning and 6,000 for evaluation. The framework offers a simple, data-efficient, and fully open-source solution that provides an extensible foundation for future research in longitudinal and multimodal lung cancer risk prediction.

[41] GigaWorld-0: World Models as Data Engine to Empower Embodied AI

GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, Qiuping Deng, Siting Wang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yankai Wang, Yu Cao, Yifan Chang, Yuan Xu, Yun Ye, Yang Wang, Yukun Zhou, Zhengyuan Zhang, Zhehao Dong, Zheng Zhu

🧩 TL;DR

本文提出了GigaWorld-0,一个统一的世界模型框架,通过结合视频生成和3D生成建模,为视觉-语言-动作学习提供可扩展的数据引擎,显著提升了物理机器人在真实世界中的泛化能力和任务成功率。


📘 Detailed Summary

Motivation: 当前在具身AI领域,如何构建可扩展且数据高效的世界模型是一个关键挑战。本研究旨在解决视觉-语言-动作学习中的数据稀缺问题,特别是需要生成视觉逼真、空间一致且物理合理的大规模交互数据。

Method: GigaWorld-0框架包含两个协同组件:GigaWorld-0-Video利用大规模视频生成技术产生多样化的具身序列,实现对外观、相机视角和动作语义的细粒度控制;GigaWorld-0-3D结合3D生成建模、3D高斯泼溅重建、物理可微系统识别和可执行运动规划,确保几何一致性和物理真实性。通过高效的GigaTrain框架利用FP8精度和稀疏注意力大幅降低内存和计算需求。

Result: 综合评估表明GigaWorld-0能够在多个维度上生成高质量、多样化且可控的数据。关键的是,在GigaWorld-0生成数据上训练的VLA模型(如GigaBrain-0)在物理机器人上实现了强大的真实世界性能,在无需真实世界交互训练的情况下显著提高了泛化能力和任务成功率。

Conclusion: 该研究展示了统一世界模型框架作为具身AI数据引擎的有效性,通过联合优化视频和3D生成组件,能够规模化合成视觉吸引、空间一致、物理合理且指令对齐的交互数据,为数据驱动的具身智能系统开发提供了重要基础。


📄 Abstract

World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a data engine for Vision-Language-Action (VLA) learning. GigaWorld-0 integrates two synergistic components: GigaWorld-0-Video, which leverages large-scale video generation to produce diverse, texture-rich, and temporally coherent embodied sequences under fine-grained control of appearance, camera viewpoint, and action semantics; and GigaWorld-0-3D, which combines 3D generative modeling, 3D Gaussian Splatting reconstruction, physically differentiable system identification, and executable motion planning to ensure geometric consistency and physical realism. Their joint optimization enables the scalable synthesis of embodied interaction data that is visually compelling, spatially coherent, physically plausible, and instruction-aligned. Training at scale is made feasible through our efficient GigaTrain framework, which exploits FP8-precision and sparse attention to drastically reduce memory and compute requirements. We conduct comprehensive evaluations showing that GigaWorld-0 generates high-quality, diverse, and controllable data across multiple dimensions. Critically, VLA model (e.g., GigaBrain-0) trained on GigaWorld-0-generated data achieve strong real-world performance, significantly improving generalization and task success on physical robots without any real-world interaction during training.

[42] While recognizing actions, LMMs struggle to detect core interaction events

Daniel Harari, Michael Sidorov, Liel David, Chen Shterental, Abrham Kahsay Gebreselasie, Muhammad Haris Khan

🧩 TL;DR

本研究评估了大型多模态模型在物理交互事件中的感知基础能力,发现尽管模型能够准确描述物体和动作,但在精确定位交互开始和结束的关键时刻及位置方面存在显著缺陷。


📘 Detailed Summary

Motivation: 本研究旨在探索大型多模态模型是否真正将其语义理解建立在视觉输入基础上,特别是在物理交互场景中能否准确定位交互开始和结束的关键时刻与位置,以评估模型的感知基础能力。

Method: 研究构建了首个大规模交互事件标注数据集,包含超过20K个来自Something-Something-V2数据集的视频标注,由250名人工标注员标记核心交互事件(接触和释放),并使用Qwen-2.5VL和GPT-4o两个大型多模态模型在单事件短视频中定位这些事件。

Result: 实验结果显示,虽然模型能够可靠地命名目标物体、识别动作并提供连贯推理,但在识别交互开始或结束的关键帧以及事件在场景中的定位方面持续失败,表明模型缺乏精确的时空定位能力。

Conclusion: 研究结果表明,大型多模态模型在精确定义交互的物理接触时刻和位置方面存在困难,缺乏对动态场景更深层次理解所需的感知基础,这揭示了当前模型在细粒度时空理解方面的局限性。


📄 Abstract

Large multi-modal models (LMMs) show increasing performance in realistic visual tasks for images and, more recently, for videos. For example, given a video sequence, such models are able to describe in detail objects, the surroundings and dynamic actions. In this study, we explored the extent to which these models ground their semantic understanding in the actual visual input. Specifically, given sequences of hands interacting with objects, we asked models when and where the interaction begins or ends. For this purpose, we introduce a first of its kind, large-scale dataset with more than 20K annotated interactions on videos from the Something-Something-V2 dataset. 250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached ('contact') or detached ('release'). We asked two LMMs (Qwen-2.5VL and GPT-4o) to locate these events in short videos, each with a single event. The results show that although the models can reliably name the target objects, identify the action and provide coherent reasoning, they consistently fail to identify the frame where the interaction begins or ends and cannot localize the event within the scene. Our findings suggest that in struggling to pinpoint the moment and location of physical contact that defines the interaction, the models lack the perceptual grounding required for deeper understanding of dynamic scenes.

[43] ChessMamba: Structure-Aware Interleaving of State Spaces for Change Detection in Remote Sensing Images

Lei Ding, Tong Liu, Xuanguang Liu, Xiangyun Liu, Haitao Guo, Jun Lu

🧩 TL;DR

本文提出了ChessMamba框架,通过棋盘交错状态空间建模解决多时相遥感影像变化检测中的异质性和时空不对齐问题,在多个变化检测任务上实现了显著的精度提升。


📘 Detailed Summary

Motivation: 多时相遥感影像变化检测面临异质性和时空不对齐的挑战,现有基于视觉Transformer或状态空间模型的方法在时间序列化过程中破坏了局部结构一致性,掩盖了不对齐情况下的判别性线索,阻碍了可靠的变化定位。

Method: ChessMamba框架集成了SpatialMamba编码器和轻量级跨源交互模块,采用棋盘交错与蛇形扫描顺序将多时相特征统一序列化,通过多扩张卷积实现结构感知融合,选择性捕获单时相内的中心和角点邻域上下文。

Result: 在二进制变化检测、语义变化检测和多模态建筑损伤评估三个任务上的综合评估表明,ChessMamba有效融合了异质特征,相比最先进方法实现了显著的精度提升。

Conclusion: 该研究证明了结构感知状态空间建模在多时相变化检测中的有效性,为处理遥感影像中的时空不对齐问题提供了新思路,相关代码将在GitHub上开源供进一步研究使用。


📄 Abstract

Change detection (CD) in multitemporal remote sensing imagery presents significant challenges for fine-grained recognition, owing to heterogeneity and spatiotemporal misalignment. However, existing methodologies based on vision transformers or state-space models typically disrupt local structural consistency during temporal serialization, obscuring discriminative cues under misalignment and hindering reliable change localization. To address this, we introduce ChessMamba, a structure-aware framework leveraging interleaved state-space modeling for robust CD with multi-temporal inputs. ChessMamba integrates a SpatialMamba encoder with a lightweight cross-source interaction module, featuring two key innovations: (i) Chessboard interleaving with snake scanning order, which serializes multi-temporal features into a unified sequence within a single forward pass, thereby shortening interaction paths and enabling direct comparison for accurate change localization; and (ii) Structure-aware fusion via multi-dilated convolutions, selectively capturing center-and-corner neighborhood contexts within each mono-temporal. Comprehensive evaluations on three CD tasks, including binary CD, semantic CD and multimodal building damage assessment, demonstrate that ChessMamba effectively fuses heterogeneous features and achieves substantial accuracy improvements over state-of-the-art methods.The relevant code will be available at: github.com/DingLei14/ChessMamba.

[44] VeriSciQA: An Auto-Verified Dataset for Scientific Visual Question Answering

Yuyi Li, Daoyuan Chen, Zhen Wang, Yutong Lu, Yaliang Li

🧩 TL;DR

本文提出了一个生成-验证框架来构建VeriSciQA数据集,解决了科学视觉问答中高质量数据稀缺的问题。该框架通过跨模态一致性检查和辅助过滤器消除错误QA对,显著提升了开源模型在科学视觉问答任务上的性能。


📘 Detailed Summary

Motivation: 当前开源大视觉语言模型在科学视觉问答任务上表现不佳,主要瓶颈在于缺乏大规模、高质量的科学视觉问答数据集。现有方法使用大模型合成数据时存在系统性错误,这些错误源于模型固有局限以及图像与文本之间的信息不对称。

Method: 提出了以验证为中心的生成-验证框架,首先生成带有图像关联文本上下文的问答对,然后应用跨模态一致性检查对抗图像以及辅助过滤器来消除错误对。该框架被实例化为VeriSciQA数据集,包含20,351个问答对,涵盖20个科学领域和12种图像类型。

Result: VeriSciQA对开源模型构成了挑战性基准,领先开源模型准确率为64%,而专有模型达到82%。在VeriSciQA上微调的模型在科学视觉问答基准上获得一致改进,性能提升随数据规模扩大而增强,并优于在现有数据集上训练的模型。人工评估进一步验证了VeriSciQA的优越正确性。

Conclusion: 研究证明通过可扩展框架持续扩展数据可以进一步推进开源社区的科学视觉问答能力。该框架能够系统性地解决数据合成中的错误问题,为构建高质量科学多模态数据集提供了有效方法,并展示了数据质量对模型性能的关键影响。


📄 Abstract

Large Vision-Language Models (LVLMs) show promise for scientific applications, yet open-source models still struggle with Scientific Visual Question Answering (SVQA), namely answering questions about figures from scientific papers. A key bottleneck lies in the lack of public, large-scale, high-quality SVQA datasets. Although recent work uses LVLMs to synthesize data at scale, we identify systematic errors in their resulting QA pairs, stemming from LVLMs' inherent limitations and information asymmetry between figures and text. To address these challenges, we propose a verification-centric Generate-then-Verify framework that first generates QA pairs with figure-associated textual context, then applies cross-modal consistency checks against figures along with auxiliary filters to eliminate erroneous pairs. We instantiate this framework to curate VeriSciQA, a dataset of 20,351 QA pairs spanning 20 scientific domains and 12 figure types. VeriSciQA poses a challenging benchmark for open-source models, with a substantial accuracy gap between the leading open-source models (64%) and a proprietary model (82%). Moreover, models fine-tuned on VeriSciQA achieve consistent improvements on SVQA benchmarks, with performance gains that scale with data size and surpass models trained on existing datasets. Human evaluation further validates the superior correctness of VeriSciQA. Together, these evidences demonstrate that continued data expansion by our scalable framework can further advance SVQA capability in the open-source community.

[45] MHB: Multimodal Handshape-aware Boundary Detection for Continuous Sign Language Recognition

Mingyu Zhao, Zhanfu Yang, Yang Zhou, Zhaoyang Xia, Can Jin, Xiaoxiao He, Carol Neidle, Dimitris N. Metaxas

🧩 TL;DR

本文提出了一种用于连续手语识别的多模态方法,通过结合3D骨骼特征和手形分类来检测手语边界,并在ASLLRP语料库上实现了显著性能提升。该方法首先检测美国手语句子视频中手语的起始和结束帧,然后对分割出的手语进行识别。


📘 Detailed Summary

Motivation: 连续手语识别面临的主要挑战是准确检测手语边界,因为连续手语中的手势与孤立形式的手势在多个方面存在差异。现有方法在处理手语边界检测时缺乏对3D手形信息的有效利用,而手形在手语边界处往往表现出特定的聚类特征。

Method: 该方法使用从手语视频中提取的3D骨骼特征来捕捉手语属性及其动态特征的收敛性。通过整合和标准化多个现有数据集,预训练了一个包含87个语言学定义的标准手形类别的手形分类器。采用多模态融合模块将预训练的手语视频分割框架与手形分类模型进行统一,最后利用估计的边界进行手语识别。

Result: 在ASLLRP语料库上的评估表明,该方法相比先前工作实现了显著改进。通过结合3D骨骼特征和手形信息的多模态融合,有效提升了手语边界检测的准确性和鲁棒性。

Conclusion: 该研究证明了多模态方法在连续手语识别中的有效性,特别是3D骨骼特征和手形信息的结合能够显著改善边界检测性能。未来的工作可以进一步探索更多模态信息的融合,以及在不同手语变体和环境下的泛化能力。


📄 Abstract

This paper presents a multimodal approach for continuous sign recognition that first uses machine learning to detect the start and end frames of signs in videos of American Sign Language (ASL) sentences, and then recognizes the segmented signs. For improved robustness, we use 3D skeletal features extracted from sign language videos to capture the convergence of sign properties and their dynamics, which tend to cluster at sign boundaries. Another focus of this work is the incorporation of information from 3D handshape for boundary detection. To detect handshapes normally expected at the beginning and end of signs, we pretrain a handshape classifier for 87 linguistically defined canonical handshape categories using a dataset that we created by integrating and normalizing several existing datasets. A multimodal fusion module is then used to unify the pretrained sign video segmentation framework and the handshape classification models. Finally, the estimated boundaries are used for sign recognition, where the recognition model is trained on a large database containing both citation-form isolated signs and signs pre-segmented (based on manual annotations) from continuous signing, as such signs often differ in certain respects. We evaluate our method on the ASLLRP corpus and demonstrate significant improvements over previous work.

[46] Motion Marionette: Rethinking Rigid Motion Transfer via Prior Guidance

Haoxuan Wang, Jiachen Tao, Junyi Wu, Gaowen Liu, Ramana Rao Kompella, Yan Yan

🧩 TL;DR

Motion Marionette 提出了一种零样本刚性运动迁移框架,通过内部时空先验将单目源视频的运动转移到单视图目标图像,无需外部先验即可实现跨物体的通用运动迁移和时序一致性视频生成。


📘 Detailed Summary

Motivation: 现有运动迁移方法通常依赖几何、生成或仿真等外部先验来指导迁移过程,但这些外部先验引入了额外的约束条件,导致在通用性和时序一致性之间需要进行权衡,限制了方法的适用范围和效果。

Method: 该框架首先将源视频和目标图像提升到统一的3D表示空间,从源视频中提取运动轨迹构建独立于物体几何和语义的时空先验,该先验编码了随时间变化的相对空间变换,然后与目标物体集成生成可控速度场,并通过基于位置的动力学方法进行细化以消除伪影并增强视觉连贯性。

Result: 实验结果表明,Motion Marionette 能够泛化到多种不同物体,生成与源运动对齐且时序一致的视频,同时支持可控的视频生成,在运动迁移质量和通用性方面表现出色。

Conclusion: 该研究证明了内部时空先验在运动迁移中的有效性,提供了一种无需外部约束的通用解决方案,为高效视频制作和跨物体运动控制开辟了新途径,具有重要的实际应用价值。


📄 Abstract

We present Motion Marionette, a zero-shot framework for rigid motion transfer from monocular source videos to single-view target images. Previous works typically employ geometric, generative, or simulation priors to guide the transfer process, but these external priors introduce auxiliary constraints that lead to trade-offs between generalizability and temporal consistency. To address these limitations, we propose guiding the motion transfer process through an internal prior that exclusively captures the spatial-temporal transformations and is shared between the source video and any transferred target video. Specifically, we first lift both the source video and the target image into a unified 3D representation space. Motion trajectories are then extracted from the source video to construct a spatial-temporal (SpaT) prior that is independent of object geometry and semantics, encoding relative spatial variations over time. This prior is further integrated with the target object to synthesize a controllable velocity field, which is subsequently refined using Position-Based Dynamics to mitigate artifacts and enhance visual coherence. The resulting velocity field can be flexibly employed for efficient video production. Empirical results demonstrate that Motion Marionette generalizes across diverse objects, produces temporally consistent videos that align well with the source motion, and supports controllable video generation.

[47] Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving

Dapeng Zhang, Zhenlong Yuan, Zhangquan Chen, Chih-Ting Liao, Yinda Chen, Fei Shen, Qingguo Zhou, Tat-Seng Chua

🧩 TL;DR

本文提出Reasoning-VLA框架,通过可学习动作查询和推理增强的视觉语言特征,实现了高效且泛化性强的自动驾驶决策生成,在多个基准测试中达到最先进性能。


📘 Detailed Summary

Motivation: 现有的视觉-语言-动作模型在自动驾驶决策中存在推理效率低和泛化能力不足的问题,特别是在面对新型车辆配置和驾驶场景时表现不佳,需要开发更高效且通用的动作生成框架。

Method: 提出Reasoning-VLA框架,采用高斯采样从真实轨迹初始化可学习动作查询,这些查询与推理增强的视觉语言特征交互以并行生成连续动作轨迹;整合八个公开自动驾驶数据集为标准化、基于思维链推理的数据格式,结合监督学习和强化学习微调进行模型训练。

Result: 在多个基准测试上的广泛实证评估表明,Reasoning-VLA实现了最先进的性能,展现出卓越的泛化能力,并达到了迄今为止报道的最佳推理速度。

Conclusion: 该研究证明了可学习动作查询与推理增强特征相结合的有效性,为自动驾驶决策系统提供了高效且通用的解决方案,推动了VLA模型在实际应用中的发展。


📄 Abstract

Vision-Language-Action (VLA) models have recently shown strong decision-making capabilities in autonomous driving. However, existing VLAs often struggle with achieving efficient inference and generalizing to novel autonomous vehicle configurations and driving scenarios. In this paper, we propose Reasoning-VLA, a general and fast action-generation VLA framework. The proposed model employs a set of learnable action queries, initialized via Gaussian sampling from ground-truth trajectories within the training corpus. These learnable queries interact with reasoning-enhanced vision-language features to generate continuous action trajectories in parallel. To promote robust generalization, we consolidate eight publicly available autonomous driving datasets into a standardized, Chain-of-Thought reasoning-based, and easy-to-use data format for model training. Leveraging both supervised learning and reinforcement learning fine-tuning, extensive empirical evaluations across multiple benchmarks demonstrate that Reasoning-VLA achieves state-of-the-art performance, superior generalization capability, and the excellent inference speed reported to date.

[48] Coupled Physics-Gated Adaptation: Spatially Decoding Volumetric Photochemical Conversion in Complex 3D-Printed Objects

Maryam Eftekharifar, Churun Zhang, Jialiang Wei, Xudong Cao, Hossein Heidari

🧩 TL;DR

本研究提出了一个开创性框架,首次实现了复杂3D打印物体中光化学转化的预测,通过Coupled Physics-Gated Adaptation (C-PGA)架构从3D视觉数据预测密集的非视觉体积物理特性,为虚拟化学表征提供了突破性解决方案。


📘 Detailed Summary

Motivation: 传统视觉模型无法处理复杂3D打印物体中光学物理(衍射、吸收)与材料物理(扩散、对流)之间的耦合非线性相互作用,这限制了从3D视觉数据预测密集体积物理特性的能力,而现有方法缺乏对这种物理耦合的归纳偏置。

Method: 提出了Coupled Physics-Gated Adaptation (C-PGA)多模态融合架构,利用稀疏几何和工艺参数作为Query,通过特征级线性调制(FiLM)动态门控和适应密集视觉特征,该机制空间调制由并行3D-CNN处理原始投影堆栈及其扩散-衍射校正对应物的双3D视觉流。

Result: 该方法在迄今为止最大的光学打印3D样本数据集上实现了突破,该数据集包含大量参数化设计的复杂最小表面结构,并已完成终端化学表征,成功预测了光化学转化状态。

Conclusion: C-PGA架构为虚拟化学表征提供了突破性进展,消除了传统后打印测量的需求,实现了对化学转化状态的精确控制,为物理感知的计算机视觉任务开辟了新方向。


📄 Abstract

We present a framework that pioneers the prediction of photochemical conversion in complex three-dimensionally printed objects, introducing a challenging new computer vision task: predicting dense, non-visual volumetric physical properties from 3D visual data. This approach leverages the largest-ever optically printed 3D specimen dataset, comprising a large family of parametrically designed complex minimal surface structures that have undergone terminal chemical characterisation. Conventional vision models are ill-equipped for this task, as they lack an inductive bias for the coupled, non-linear interactions of optical physics (diffraction, absorption) and material physics (diffusion, convection) that govern the final chemical state. To address this, we propose Coupled Physics-Gated Adaptation (C-PGA), a novel multimodal fusion architecture. Unlike standard concatenation, C-PGA explicitly models physical coupling by using sparse geometrical and process parameters (e.g., surface transport, print layer height) as a Query to dynamically gate and adapt the dense visual features via feature-wise linear modulation (FiLM). This mechanism spatially modulates dual 3D visual streams-extracted by parallel 3D-CNNs processing raw projection stacks and their diffusion-diffraction corrected counterparts allowing the model to recalibrate its visual perception based on the physical context. This approach offers a breakthrough in virtual chemical characterisation, eliminating the need for traditional post-print measurements and enabling precise control over the chemical conversion state.

[49] Scale Where It Matters: Training-Free Localized Scaling for Diffusion Models

Qin Ren, Yufei Wang, Lanqing Guo, Wen Zhang, Zhiwen Fan, Chenyu You

🧩 TL;DR

本文提出了LoTTS,首个无需训练的局部化测试时缩放框架,通过自适应重采样缺陷区域同时保留高质量区域,显著减少搜索空间并降低计算成本。该方法在多个扩散模型上实现了最先进的性能,同时将GPU成本降低了2-4倍。


📘 Detailed Summary

Motivation: 现有测试时缩放方法在全图像级别操作,忽略了图像质量的空间异质性,导致对已满意区域进行不必要的计算,同时对局部缺陷校正不足。这促使研究者探索局部化测试时缩放的新方向,以自适应地重采样缺陷区域同时保留高质量区域。

Method: LoTTS框架包含两个核心组件:缺陷定位通过对比质量感知提示下的交叉注意力和自注意力信号来识别缺陷区域,并将其细化为连贯掩码;一致性保持通过仅扰动缺陷区域并在局部进行去噪,确保校正保持局部化而图像其余部分不受干扰。

Result: 在SD2.1、SDXL和FLUX上的广泛实验表明,LoTTS实现了最先进的性能:持续改善局部质量和全局保真度,同时相比Best-of-N采样将GPU成本降低了2-4倍。该方法在多个基准测试中均表现出优越的效率和效果平衡。

Conclusion: 这项研究确立了局部化测试时缩放作为扩散模型推理时缩放的一个有前景的新方向。LoTTS的成功表明通过空间自适应计算分配可以显著提升效率,同时保持甚至改善生成质量,为未来扩散模型优化提供了重要启示。


📄 Abstract

Diffusion models have become the dominant paradigm in text-to-image generation, and test-time scaling (TTS) further improves quality by allocating more computation during inference. However, existing TTS methods operate at the full-image level, overlooking the fact that image quality is often spatially heterogeneous. This leads to unnecessary computation on already satisfactory regions and insufficient correction of localized defects. In this paper, we explore a new direction - Localized TTS - that adaptively resamples defective regions while preserving high-quality regions, thereby substantially reducing the search space. This paradigm poses two central challenges: accurately localizing defects and maintaining global consistency. We propose LoTTS, the first fully training-free framework for localized TTS. For defect localization, LoTTS contrasts cross- and self-attention signals under quality-aware prompts (e.g., high-quality vs. low-quality) to identify defective regions, and then refines them into coherent masks. For consistency, LoTTS perturbs only defective regions and denoises them locally, ensuring that corrections remain confined while the rest of the image remains undisturbed. Extensive experiments on SD2.1, SDXL, and FLUX demonstrate that LoTTS achieves state-of-the-art performance: it consistently improves both local quality and global fidelity, while reducing GPU cost by 2-4x compared to Best-of-N sampling. These findings establish localized TTS as a promising new direction for scaling diffusion models at inference time.

[50] Intelligent Image Search Algorithms Fusing Visual Large Models

Kehan Wang, Tingqiong Cui, Yang Zhang, Yu Chen, Shifeng Wu, Zhenzhang Li

🧩 TL;DR

本文提出DetVLM框架,通过融合目标检测器和视觉大模型,实现了细粒度图像检索中的状态搜索和零样本搜索能力,在车辆组件数据集上达到94.82%的最优检索准确率。


📘 Detailed Summary

Motivation: 细粒度图像检索在安全和工业检测领域至关重要,但传统方法存在显著局限:手工特征缺乏鲁棒性,基于深度学习的检测器无法进行状态特定检索或零样本搜索,而视觉大模型虽然具备语义和零样本能力,但空间定位能力差且计算成本高,无法直接用于高效检索。

Method: 提出DetVLM框架,采用两阶段流水线架构:首先使用YOLO检测器进行高效、高召回率的组件级筛选以确定组件存在性,然后利用VLM作为召回增强单元,对检测器遗漏的组件进行二次验证,从而支持状态搜索和零样本搜索两种高级能力。

Result: 在车辆组件数据集上的实验表明,DetVLM实现了94.82%的最优整体检索准确率,显著优于仅使用检测器的基线方法,同时在驾驶员佩戴口罩的零样本搜索任务中达到94.95%的准确率,在状态搜索任务中平均准确率超过90%。

Conclusion: 该研究证明了检测器与VLM协同融合的有效性,开创了搜索增强新范式,为细粒度图像检索提供了高效解决方案,同时展示了在工业检测和安全监控等实际应用中的巨大潜力。


📄 Abstract

Fine-grained image retrieval, which aims to find images containing specific object components and assess their detailed states, is critical in fields like security and industrial inspection. However, conventional methods face significant limitations: manual features (e.g., SIFT) lack robustness; deep learning-based detectors (e.g., YOLO) can identify component presence but cannot perform state-specific retrieval or zero-shot search; Visual Large Models (VLMs) offer semantic and zero-shot capabilities but suffer from poor spatial grounding and high computational cost, making them inefficient for direct retrieval. To bridge these gaps, this paper proposes DetVLM, a novel intelligent image search framework that synergistically fuses object detection with VLMs. The framework pioneers a search-enhancement paradigm via a two-stage pipeline: a YOLO detector first conducts efficient, high-recall component-level screening to determine component presence; then, a VLM acts as a recall-enhancement unit, performing secondary verification for components missed by the detector. This architecture directly enables two advanced capabilities: 1) State Search: Guided by task-specific prompts, the VLM refines results by verifying component existence and executing sophisticated state judgments (e.g., "sun visor lowered"), allowing retrieval based on component state. 2) Zero-shot Search: The framework leverages the VLM's inherent zero-shot capability to recognize and retrieve images containing unseen components or attributes (e.g., "driver wearing a mask") without any task-specific training. Experiments on a vehicle component dataset show DetVLM achieves a state-of-the-art overall retrieval accuracy of 94.82\%, significantly outperforming detection-only baselines. It also attains 94.95\% accuracy in zero-shot search for driver mask-wearing and over 90\% average accuracy in state search tasks.

[51] Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos

Youngseo Kim, Dohyun Kim, Geohee Han, Paul Hongsuck Seo

🧩 TL;DR

本文提出DRIFT框架,通过将扩散模型的自注意力图重新解释为语义标签传播核,实现了零样本视频对象跟踪与分割,在标准基准测试中达到最先进性能。


📘 Detailed Summary

Motivation: 尽管图像扩散模型最初用于图像生成,但其隐含的丰富语义结构尚未充分用于识别和定位任务,本研究旨在探索如何利用扩散模型的自注意力机制实现零样本视频对象跟踪与分割。

Method: 提出将扩散模型自注意力图重新解释为语义标签传播核,构建时间传播核实现帧间标签传播,结合DDIM反转、文本反转和自适应头权重等测试时优化策略,并引入SAM引导的掩码精炼机制。

Result: DRIFT框架在标准视频对象分割基准测试中实现了最先进的零样本性能,证明了扩散特征在标签传播中的鲁棒性和一致性。

Conclusion: 扩散模型的自注意力机制可作为强大的语义标签传播工具,为视频分析任务提供了新的零样本解决方案,展示了生成模型在判别任务中的潜在应用价值。


📄 Abstract

Image diffusion models, though originally developed for image generation, implicitly capture rich semantic structures that enable various recognition and localization tasks beyond synthesis. In this work, we investigate their self-attention maps can be reinterpreted as semantic label propagation kernels, providing robust pixel-level correspondences between relevant image regions. Extending this mechanism across frames yields a temporal propagation kernel that enables zero-shot object tracking via segmentation in videos. We further demonstrate the effectiveness of test-time optimization strategies-DDIM inversion, textual inversion, and adaptive head weighting-in adapting diffusion features for robust and consistent label propagation. Building on these findings, we introduce DRIFT, a framework for object tracking in videos leveraging a pretrained image diffusion model with SAM-guided mask refinement, achieving state-of-the-art zero-shot performance on standard video object segmentation benchmarks.

[52] Supervise Less, See More: Training-free Nuclear Instance Segmentation with Prototype-Guided Prompting

Wen Zhang, Qin Ren, Wenjing Liu, Haibin Ling, Chenyu You

🧩 TL;DR

本文提出了SPROUT,一种完全无需训练和标注的核实例分割提示框架,通过组织学先验构建切片特异性参考原型来缓解领域差距,无需参数更新即可实现精确的核分割。


📘 Detailed Summary

Motivation: 当前大多数核实例分割方法仍然依赖密集监督和计算成本高昂的微调,而无需训练的方法虽然具有吸引力但尚未得到充分探索,特别是在计算病理学中准确核分割对于临床洞察和转化应用至关重要。

Method: SPROUT利用组织学先验构建切片特异性参考原型,通过部分最优传输方案逐步引导特征对齐,将生成的前景和背景特征转换为正负点提示,使Segment Anything Model无需参数更新即可产生精确的核分割结果。

Result: 在多个组织病理学基准测试上的广泛实验表明,SPROUT在无需监督或重新训练的情况下实现了具有竞争力的性能,为可扩展的无需训练核实例分割建立了新范式。

Conclusion: 该研究为病理学中的可扩展、无需训练核实例分割建立了新范式,证明了利用组织学先验和提示工程可以在不依赖密集监督的情况下实现高质量的核分割性能。


📄 Abstract

Accurate nuclear instance segmentation is a pivotal task in computational pathology, supporting data-driven clinical insights and facilitating downstream translational applications. While large vision foundation models have shown promise for zero-shot biomedical segmentation, most existing approaches still depend on dense supervision and computationally expensive fine-tuning. Consequently, training-free methods present a compelling research direction, yet remain largely unexplored. In this work, we introduce SPROUT, a fully training- and annotation-free prompting framework for nuclear instance segmentation. SPROUT leverages histology-informed priors to construct slide-specific reference prototypes that mitigate domain gaps. These prototypes progressively guide feature alignment through a partial optimal transport scheme. The resulting foreground and background features are transformed into positive and negative point prompts, enabling the Segment Anything Model (SAM) to produce precise nuclear delineations without any parameter updates. Extensive experiments across multiple histopathology benchmarks demonstrate that SPROUT achieves competitive performance without supervision or retraining, establishing a novel paradigm for scalable, training-free nuclear instance segmentation in pathology.

[53] HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning

Hongji Yang, Yucheng Zhou, Wencheng Han, Runzhou Tao, Zhongying Qiu, Jianfei Yang, Jianbing Shen

🧩 TL;DR

本文提出了一种基于合成链范式的分层组合生成框架HiCoGen,通过将复杂提示分解为最小语义单元并迭代合成来解决扩散模型在处理复杂多对象提示时的概念遗漏和组合性问题。


📘 Detailed Summary

Motivation: 现有扩散模型在处理涉及多个对象和分层结构的复杂提示时存在显著局限性,经常出现概念遗漏、混淆和组合性差的问题,无法准确遵循复杂指令生成符合要求的图像。

Method: 提出分层组合生成框架HiCoGen,基于合成链范式利用大语言模型分解复杂提示为最小语义单元,然后迭代合成这些单元;引入强化学习框架并基于理论分析提出衰减随机性调度以增强探索,采用分层奖励机制在全局、主体和关系层面评估图像质量。

Result: 实验结果表明该方法在概念覆盖率和组合准确性方面显著优于现有方法,并在新构建的HiCoPrompt分层提示基准上进行了严格评估,验证了框架的有效性。

Conclusion: 该研究证明了通过分层分解和迭代合成策略可以有效解决复杂提示的图像生成问题,衰减随机性调度和分层奖励机制为扩散模型的强化学习训练提供了新思路,为组合式图像生成开辟了有前景的方向。


📄 Abstract

Recent advances in diffusion models have demonstrated impressive capability in generating high-quality images for simple prompts. However, when confronted with complex prompts involving multiple objects and hierarchical structures, existing models struggle to accurately follow instructions, leading to issues such as concept omission, confusion, and poor compositionality. To address these limitations, we propose a Hierarchical Compositional Generative framework (HiCoGen) built upon a novel Chain of Synthesis (CoS) paradigm. Instead of monolithic generation, HiCoGen first leverages a Large Language Model (LLM) to decompose complex prompts into minimal semantic units. It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next, ensuring all textual concepts are faithfully constructed into the final scene. To further optimize this process, we introduce a reinforcement learning (RL) framework. Crucially, we identify that the limited exploration of standard diffusion samplers hinders effective RL. We theoretically prove that sample diversity is maximized by concentrating stochasticity in the early generation stages and, based on this insight, propose a novel Decaying Stochasticity Schedule to enhance exploration. Our RL algorithm is then guided by a hierarchical reward mechanism that jointly evaluates the image at the global, subject, and relationship levels. We also construct HiCoPrompt, a new text-to-image benchmark with hierarchical prompts for rigorous evaluation. Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.

[54] Boosting Reasoning in Large Multimodal Models via Activation Replay

Yun Xing, Xiaobin Hu, Qingdong He, Jiangning Zhang, Shuicheng Yan, Shijian Lu, Yu-Gang Jiang

🧩 TL;DR

本文提出了激活重放方法,一种无需训练即可提升后训练大型多模态模型推理能力的技术,通过重放基础模型的低熵激活来调节RLVR模型的推理过程,在数学推理、视觉代理和视频推理等多个场景中显著提升性能。


📘 Detailed Summary

Motivation: 当前基于可验证奖励的强化学习(RLVR)在提升大型多模态模型推理能力方面效果显著,但其背后的工作机制尚未得到充分理解,特别是RLVR如何影响模型内部激活以及这种影响与推理能力之间的关系需要深入探索。

Method: 研究首先通过logit lens视角分析RLVR对输入激活的影响,发现RLVR会意外地改变低熵激活而高熵激活受影响较小;基于此提出了激活重放方法,该方法在测试时通过操纵视觉令牌,将基础模型的低熵激活重放到RLVR模型中,无需昂贵的策略优化即可调节推理过程。

Result: 实验表明激活重放在数学推理、o3类视觉代理和视频推理等多样化场景中均能触发更好的推理能力,显著提升Pass@K指标并缓解RLVR推理覆盖范围较窄的问题;与重放高熵激活或直接跨模型干预等替代方案相比,所提方法展现出明显优势。

Conclusion: 研究表明调节低熵激活在提升多模态推理中具有潜在有益作用,激活重放作为一种简单有效的训练无关方法为理解RLVR工作机制提供了新视角,并为提升后训练模型推理能力开辟了新的技术路径,代码将公开提供以促进进一步研究。


📄 Abstract

Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach to incentivizing reasoning capability in Large Multimodal Models (LMMs), while the underlying mechanisms behind this post-training paradigm are poorly understood. We begin by exploring how input activations are affected by RLVR through the perspective of logit lens. Our systematic investigations across multiple post-trained LMMs suggest that RLVR shifts low-entropy activations unexpectedly, while high-entropy ones are less affected. We further demonstrate that such phenomena are associated with LMM reasoning by controlled experiments, suggesting a potentially beneficial role of modulating low-entropy activations. To this end, we propose Activation Replay, a novel simple yet effective training-free approach that boosts multimodal reasoning of post-trained LMMs without requiring expensive policy optimization. Our design involves manipulation of visual tokens at test time, replaying low-entropy activations from the input context of base LMMs to regulating the RLVR counterparts. Activation Replay triggers better reasoning across diverse scenarios, including mathematics, o3-like visual agents, and video reasoning. We further show that Activation Replay boosts Pass@K and mitigates narrower reasoning coverage of RLVR. Our design is compared against alternative choices, such as replaying high-entropy activations instead of low-entropy ones, or direct cross-model intervention instead of manipulating input tokens, demonstrating the superiority of our implementation. Codes will be made publicly available.

[55] SONIC: Spectral Optimization of Noise for Inpainting with Consistency

Seungyeon Baek, Erqun Dong, Shadan Namazifard, Mark J. Matthews, Kwang Moo Yi

🧩 TL;DR

本文提出了一种无需训练的修复方法,通过优化初始种子噪声来提升现成文本到图像模型在修复任务中的性能,该方法在线性近似和谱域优化的支持下,在多种修复任务中超越了现有技术。


📘 Detailed Summary

Motivation: 现有基于引导的方法理论上允许通用模型用于修复等逆问题,但在实践中效果有限,导致需要专门的修复模型。本文认为训练无关修复方法缺失的关键要素是对初始种子噪声的优化,以使其更好地匹配未掩码数据部分。

Method: 提出优化初始种子噪声以近似匹配未掩码数据部分,仅需几十步优化即可完成。关键创新包括:使用线性近似避免初始噪声与生成结果之间昂贵的展开计算;在谱域进行优化以稳定优化过程。

Result: 该方法在多种修复任务中展现出卓越效果,超越了现有最先进方法。实验证明优化的初始种子噪声能够显著提升训练无关修复方法的性能。

Conclusion: 研究表明通过优化初始种子噪声可以显著提升现成文本到图像模型在修复任务中的表现,无需专门训练。该方法为通用模型在逆问题中的应用提供了新思路,具有重要的实际应用价值。


📄 Abstract

We propose a novel training-free method for inpainting with off-the-shelf text-to-image models. While guidance-based methods in theory allow generic models to be used for inverse problems such as inpainting, in practice, their effectiveness is limited, leading to the necessity of specialized inpainting-specific models. In this work, we argue that the missing ingredient for training-free inpainting is the optimization (guidance) of the initial seed noise. We propose to optimize the initial seed noise to approximately match the unmasked parts of the data - with as few as a few tens of optimization steps. We then apply conventional training-free inpainting methods on top of our optimized initial seed noise. Critically, we propose two core ideas to effectively implement this idea: (i) to avoid the costly unrolling required to relate the initial noise and the generated outcome, we perform linear approximation; and (ii) to stabilize the optimization, we optimize the initial seed noise in the spectral domain. We demonstrate the effectiveness of our method on various inpainting tasks, outperforming the state of the art. Project page: https://ubc-vision.github.io/sonic/

[56] GazeProphetV2: Head-Movement-Based Gaze Prediction Enabling Efficient Foveated Rendering on Mobile VR

Farhaan Ebadulla, Chiraag Mudlpaur, Shreya Chaurasia, Gaurav BV

🧩 TL;DR

本文提出了一种多模态VR注视预测方法,通过门控融合机制结合时间注视模式、头部运动数据和视觉场景信息,在22个VR场景的530万注视样本数据集上验证了多模态融合相比单一数据流的预测精度提升。


📘 Detailed Summary

Motivation: 虚拟现实环境中注视行为预测对于渲染优化和界面设计具有重要意义,但现有方法在预测精度和泛化能力方面存在局限,需要开发能够有效整合多种信息源的方法来提升预测性能。

Method: 采用多模态方法结合时间注视模式、头部运动数据和视觉场景信息,通过门控融合机制与跨模态注意力学习基于上下文相关性的自适应权重分配,整合过去注视轨迹、头部朝向和场景内容进行联合预测。

Result: 在包含22个VR场景的530万注视样本数据集上,多模态融合相比单一数据流显著提升了预测精度,在1-3个未来帧的预测中表现优异,跨场景泛化测试达到93.1%的验证准确率且预测注视轨迹具有时间一致性。

Conclusion: 研究揭示了虚拟环境中注意力机制的工作方式,证明了多模态信息融合对注视预测的有效性,为渲染优化、交互设计和用户体验评估提供了技术基础,推动了无需昂贵眼动追踪硬件的VR系统发展。


📄 Abstract

Predicting gaze behavior in virtual reality environments remains a significant challenge with implications for rendering optimization and interface design. This paper introduces a multimodal approach to VR gaze prediction that combines temporal gaze patterns, head movement data, and visual scene information. By leveraging a gated fusion mechanism with cross-modal attention, the approach learns to adaptively weight gaze history, head movement, and scene content based on contextual relevance. Evaluations using a dataset spanning 22 VR scenes with 5.3M gaze samples demonstrate improvements in predictive accuracy when combining modalities compared to using individual data streams alone. The results indicate that integrating past gaze trajectories with head orientation and scene content enhances prediction accuracy across 1-3 future frames. Cross-scene generalization testing shows consistent performance with 93.1% validation accuracy and temporal consistency in predicted gaze trajectories. These findings contribute to understanding attention mechanisms in virtual environments while suggesting potential applications in rendering optimization, interaction design, and user experience evaluation. The approach represents a step toward more efficient virtual reality systems that can anticipate user attention patterns without requiring expensive eye tracking hardware.

[57] CREward: A Type-Specific Creativity Reward Model

Jiyeon Han, Ali Mahdavi-Amiri, Hao Zhang, Haedong Jeong

🧩 TL;DR

本文提出了首个类型特定的创造力奖励模型CREward,通过几何、材质和纹理三个创造力轴来评估和生成创意图像,利用大视觉语言模型与人类感知的对齐性来训练模型,并探索了创造力评估、可解释创造力和创意样本获取三个应用场景。


📘 Detailed Summary

Motivation: 传统方法将创造力视为单一未分化量存在局限性,无法充分表征和评估创造力的复杂性。本文旨在解决这一研究空白,通过图像形成流程的视角来细粒度地理解创造力,建立能够区分不同类型创造力的评估框架。

Method: 首先进行人类基准评估以捕捉各类型创意图像的人类感知,分析大视觉语言模型预测与人类判断的相关性,确认其强对齐性后收集LVLM生成的标签来训练CREward模型。该模型适用于创意图像的评估和生成,并探索了低秩自适应等生成引导技术。

Result: 实验证实大视觉语言模型与人类创造力感知具有强对齐性,基于此训练的CREward模型能够有效进行类型特定的创造力评估。模型在三个创造力轴上均表现出色,为创意图像的质量评估和可控生成提供了可靠工具。

Conclusion: 研究证明了类型特定创造力评估的可行性,CREward模型为创意计算提供了新的方法论框架。该工作不仅提升了创造力评估的细粒度,还为创意生成系统的开发奠定了基础,推动了人工智能在创意领域的应用发展。


📄 Abstract

Creativity is a complex phenomenon. When it comes to representing and assessing creativity, treating it as a single undifferentiated quantity would appear naive and underwhelming. In this work, we learn the \emph{first type-specific creativity reward model}, coined CREward, which spans three creativity ``axes," geometry, material, and texture, to allow us to view creativity through the lens of the image formation pipeline. To build our reward model, we first conduct a human benchmark evaluation to capture human perception of creativity for each type across various creative images. We then analyze the correlation between human judgments and predictions by large vision-language models (LVLMs), confirming that LVLMs exhibit strong alignment with human perception. Building on this observation, we collect LVLM-generated labels to train our CREward model that is applicable to both evaluation and generation of creative images. We explore three applications of CREward: creativity assessment, explainable creativity, and creative sample acquisition for both human design inspiration and guiding creative generation through low-rank adaptation.

[58] ACIT: Attention-Guided Cross-Modal Interaction Transformer for Pedestrian Crossing Intention Prediction

Yuanzhe Li, Steffen Müller

🧩 TL;DR

本文提出了一种注意力引导的跨模态交互Transformer(ACIT),用于行人过街意图预测,通过六种视觉和运动模态的三组交互对实现互补特征提取,在JAAD数据集上取得了最先进的性能。


📘 Detailed Summary

Motivation: 当前行人过街意图预测面临的主要挑战是如何有效提取和整合不同类型数据中的互补线索,特别是视觉和运动模态之间的深度交互问题,这限制了预测模型的准确性和鲁棒性。

Method: ACIT采用六种视觉和运动模态,将其分为三组交互对:全局语义地图与全局光流、局部RGB图像与局部光流、自车速度与行人边界框。每对视觉交互采用双路径注意力机制,通过模态内自注意力增强主要模态的显著区域,并利用光流引导注意力与辅助模态进行深度交互。运动交互对采用跨模态注意力建模跨模态动态特征。此外,还包含多模态特征融合模块和基于Transformer的时序特征聚合模块。

Result: 实验结果表明,ACIT在JAADbeh和JAADall数据集上分别达到了70%和89%的准确率,显著优于现有最先进方法。广泛的消融研究进一步验证了ACIT各模块的有效性和贡献度。

Conclusion: 该研究表明通过精心设计的跨模态交互机制可以有效整合视觉和运动信息,显著提升行人意图预测性能。注意力引导的模态交互策略为多模态融合提供了新的思路,未来可扩展到更复杂的交通场景理解任务中。


📄 Abstract

Predicting pedestrian crossing intention is crucial for autonomous vehicles to prevent pedestrian-related collisions. However, effectively extracting and integrating complementary cues from different types of data remains one of the major challenges. This paper proposes an attention-guided cross-modal interaction Transformer (ACIT) for pedestrian crossing intention prediction. ACIT leverages six visual and motion modalities, which are grouped into three interaction pairs: (1) Global semantic map and global optical flow, (2) Local RGB image and local optical flow, and (3) Ego-vehicle speed and pedestrian's bounding box. Within each visual interaction pair, a dual-path attention mechanism enhances salient regions within the primary modality through intra-modal self-attention and facilitates deep interactions with the auxiliary modality (i.e., optical flow) via optical flow-guided attention. Within the motion interaction pair, cross-modal attention is employed to model the cross-modal dynamics, enabling the effective extraction of complementary motion features. Beyond pairwise interactions, a multi-modal feature fusion module further facilitates cross-modal interactions at each time step. Furthermore, a Transformer-based temporal feature aggregation module is introduced to capture sequential dependencies. Experimental results demonstrate that ACIT outperforms state-of-the-art methods, achieving accuracy rates of 70% and 89% on the JAADbeh and JAADall datasets, respectively. Extensive ablation studies are further conducted to investigate the contribution of different modules of ACIT.

[59] Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention

Jianfei Zhao, Feng Zhang, Xin Sun, Chong Feng, Zhixing Tan

🧩 TL;DR

本文提出了Vision-Guided Attention (VGA),一种无需训练的方法,通过利用视觉令牌的语义内容构建精确的视觉基础,并以此引导模型关注相关视觉区域,显著减少多模态大语言模型的幻觉问题。


📘 Detailed Summary

Motivation: 多模态大语言模型中的视觉注意力机制定位能力有限,经常导致幻觉问题,尽管模型能够准确提取视觉令牌的语义信息,但在后续推理过程中未能充分利用这一优势。

Method: VGA方法首先通过视觉令牌的语义内容构建精确的视觉基础,然后利用这一基础引导模型关注相关视觉区域;在图像描述任务中,VGA通过动态抑制已描述区域来进一步优化引导过程,该方法仅需单次前向传播,延迟开销仅为4.36%,且完全兼容FlashAttention等高效注意力实现。

Result: 在多个多模态大语言模型和各种幻觉基准测试上的广泛实验表明,VGA实现了最先进的去幻觉性能,进一步分析确认显式视觉引导在增强多模态大语言模型的视觉理解能力中起着关键作用。

Conclusion: 显式视觉引导对于提升多模态大语言模型的视觉理解能力至关重要,VGA作为一种无需训练的高效方法,为减少模型幻觉提供了有效解决方案,具有重要的实际应用价值。


📄 Abstract

Visual attention serves as the primary mechanism through which MLLMs interpret visual information; however, its limited localization capability often leads to hallucinations. We observe that although MLLMs can accurately extract visual semantics from visual tokens, they fail to fully leverage this advantage during subsequent inference. To address this limitation, we propose Vision-Guided Attention (VGA), a training-free method that first constructs precise visual grounding by exploiting the semantic content of visual tokens, and then uses this grounding to guide the model's focus toward relevant visual regions. In image captioning, VGA further refines this guidance dynamically during generation by suppressing regions that have already been described. In VGA, each token undergoes only a single forward pass, introducing a negligible latency overhead of just 4.36\%. In addition, VGA is fully compatible with efficient attention implementations such as FlashAttention. Extensive experiments across diverse MLLMs and multiple hallucination benchmarks demonstrate that VGA achieves state-of-the-art dehallucination performance. Further analysis confirms that explicit visual guidance plays a crucial role in enhancing the visual understanding capabilities of MLLMs.

[60] History-Augmented Contrastive Meta-Learning for Unsupervised Blind Super-Resolution of Planetary Remote Sensing Images

Huijia Zhao, Jie Lu, Yunqing Jiang, Xiao-Ping Lu, Kaichang Di

🧩 TL;DR

本文提出了一种无需地面真值和外部核先验的无监督盲超分辨率框架HACBSR,通过对比核采样和历史增强对比学习解决了行星遥感图像中多样未知退化的问题。


📘 Detailed Summary

Motivation: 行星遥感图像受到成像环境和硬件约束导致的多样未知退化影响,这些因素限制了图像质量,并且由于缺乏地面真值图像而阻碍了监督盲超分辨率方法的发展。

Method: HACBSR包含两个核心组件:对比核采样机制通过核相似性控制减轻高斯采样的分布偏差,以及历史增强对比学习利用历史模型生成负样本来实现非贪婪优化并诱导强凸性。

Result: 实验表明HACBSR在多个放大因子下与最先进的无监督方法相比具有竞争性性能,并引入了包含多样地质特征和模拟退化模式的Ceres-50数据集来支持行星应用评估。

Conclusion: 该研究证明了无需地面真值和外部核先验的无监督盲超分辨率可行性,为行星遥感图像处理提供了有效解决方案,历史增强对比学习的收敛分析为方法提供了理论支持。


📄 Abstract

Planetary remote sensing images are affected by diverse and unknown degradations caused by imaging environments and hardware constraints. These factors limit image quality and hinder supervised blind super-resolution due to the lack of ground-truth images. This work presents History-Augmented Contrastive Blind Super-Resolution (HACBSR), an unsupervised framework for blind super-resolution that operates without ground-truth images and external kernel priors. HACBSR comprises two components: (1) a contrastive kernel sampling mechanism with kernel similarity control to mitigate distribution bias from Gaussian sampling, and (2) a history-augmented contrastive learning that uses historical models to generate negative samples to enable less greedy optimization and to induce strong convexity without ground-truth. A convergence analysis of the history-augmented contrastive learning is given in the Appendix. To support evaluation in planetary applications, we introduce Ceres-50, a dataset with diverse geological features simulated degradation patterns. Experiments show that HACBSR achieves competitive performance compared with state-of-the-art unsupervised methods across multiple upscaling factors. The code is available at https://github.com/2333repeat/HACBSR, and the dataset is available at https://github.com/2333repeat/Ceres-50.

[61] PRADA: Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images

Simon Damm, Jonas Ricker, Henning Petzka, Asja Fischer

🧩 TL;DR

本文提出了PRADA方法,一种基于概率比率的可解释性框架,专门用于检测和归因自回归生成的图像。该方法通过分析模型条件概率与无条件概率的比率特征,能够可靠地识别AR生成图像并溯源至具体生成模型。


📘 Detailed Summary

Motivation: 自回归图像生成模型借鉴大语言模型的生成原理,能够高效生成逼真图像,这增加了对可靠检测方法的需求。然而目前缺乏专门针对自回归图像生成器生成的图像检测方法,存在明显的研究空白。

Method: PRADA方法的核心思想是检查模型对自回归token序列的条件概率与无条件概率比率。当图像由特定模型生成时,其概率比率会显示出独特特征,这些特征在其他模型生成的图像或真实图像中不存在。通过校准简单的模型特定评分函数,利用这些特征进行基于阈值的归因和检测。

Result: 实验评估表明,PRADA方法对八个类别到图像模型和四个文本到图像模型具有高度有效性。该方法能够可靠地检测自回归生成的图像,并准确地将它们归因到相应的源模型。

Conclusion: PRADA提供了一种简单且可解释的解决方案,填补了自回归生成图像检测领域的空白。该方法展示了概率比率特征在图像溯源和检测中的有效性,为生成模型的安全应用提供了重要技术支撑。


📄 Abstract

Autoregressive (AR) image generation has recently emerged as a powerful paradigm for image synthesis. Leveraging the generation principle of large language models, they allow for efficiently generating deceptively real-looking images, further increasing the need for reliable detection methods. However, to date there is a lack of work specifically targeting the detection of images generated by AR image generators. In this work, we present PRADA (Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images), a simple and interpretable approach that can reliably detect AR-generated images and attribute them to their respective source model. The key idea is to inspect the ratio of a model's conditional and unconditional probability for the autoregressive token sequence representing a given image. Whenever an image is generated by a particular model, its probability ratio shows unique characteristics which are not present for images generated by other models or real images. We exploit these characteristics for threshold-based attribution and detection by calibrating a simple, model-specific score function. Our experimental evaluation shows that PRADA is highly effective against eight class-to-image and four text-to-image models.

[62] Learning Procedural-aware Video Representations through State-Grounded Hierarchy Unfolding

Jinghan Zhao, Yifei Huang, Feng Lu

🧩 TL;DR

本文提出了一种任务-步骤-状态(TSS)框架,通过引入状态作为视觉基础语义层来增强程序感知视频表示学习,并在COIN和CrossTask数据集上实现了优于基线模型的性能。


📘 Detailed Summary

Motivation: 现有方法通过将视觉内容与任务和步骤级别的文本描述对齐来注入程序语义,但由于任务和步骤描述的高度抽象性,难以与视觉数据中的具体可观察细节形成稳健对齐,这限制了程序感知视频表示学习的有效性。

Method: 提出了任务-步骤-状态(TSS)框架,将状态定义为对象配置的文本快照作为视觉基础语义层,并设计了渐进式预训练策略来展开TSS层次结构,强制模型在状态基础上构建表示同时将其与步骤和高级任务关联。

Result: 在COIN和CrossTask数据集上的广泛实验表明,该方法在任务识别、步骤识别和下一步预测等多个下游任务上均优于基线模型,消融研究证实状态监督是性能提升的关键驱动因素,且渐进式预训练策略比标准联合训练更有效。

Conclusion: 引入状态作为视觉基础语义层能够有效桥接抽象程序与具体视觉内容之间的鸿沟,渐进式预训练策略能够更好地强制执行预期的层次结构,为构建能够推理和执行复杂任务的智能体提供了重要基础。


📄 Abstract

Learning procedural-aware video representations is a key step towards building agents that can reason about and execute complex tasks. Existing methods typically address this problem by aligning visual content with textual descriptions at the task and step levels to inject procedural semantics into video representations. However, due to their high level of abstraction, 'task' and 'step' descriptions fail to form a robust alignment with the concrete, observable details in visual data. To address this, we introduce 'states', i.e., textual snapshots of object configurations, as a visually-grounded semantic layer that anchors abstract procedures to what a model can actually see. We formalize this insight in a novel Task-Step-State (TSS) framework, where tasks are achieved via steps that drive transitions between observable states. To enforce this structure, we propose a progressive pre-training strategy that unfolds the TSS hierarchy, forcing the model to ground representations in states while associating them with steps and high-level tasks. Extensive experiments on the COIN and CrossTask datasets show that our method outperforms baseline models on multiple downstream tasks, including task recognition, step recognition, and next step prediction. Ablation studies show that introducing state supervision is a key driver of performance gains across all tasks. Additionally, our progressive pretraining strategy proves more effective than standard joint training, as it better enforces the intended hierarchical structure.

[63] Vision-Language Models for Automated 3D PET/CT Report Generation

Wenpei Jiao, Kun Shang, Hui Li, Ke Yan, Jiajin Zhang, Guangjie Yang, Lijuan Guo, Yan Wan, Xing Yang, Dakai Jin, Zhaoheng Xie

🧩 TL;DR

本文提出了PETRG-3D,一种用于自动化PET/CT报告生成的端到端3D双分支框架,通过构建多中心淋巴瘤数据集和临床评估协议,显著提升了报告生成的自然语言质量和临床效用。


📘 Detailed Summary

Motivation: PET/CT在肿瘤学中至关重要,但扫描仪的快速扩张超过了专业医师的可用性,自动化PET/CT报告生成对于减轻临床工作负担日益重要。相比结构成像,功能性PET面临独特挑战:代谢模式随示踪剂生理学变化,需要全身3D上下文信息而非局部区域解释。

Method: 提出PETRG-3D端到端3D双分支框架,分别编码PET和CT体积,并融入风格自适应提示以减轻医院间报告实践的变异性。构建了PETRG-Lym多中心淋巴瘤数据集(来自4家医院的824份报告,包含245,509对PET/CT切片)和AutoPET-RG-Lym公开基准数据集。

Result: 实验表明PETRG-3D在自然语言指标(如ROUGE-L提升31.49%)和临床效能指标(如PET-All提升8.18%)上显著优于现有方法,突显了体积双模态建模和风格感知提示的益处。

Conclusion: 这项工作为未来强调疾病感知推理和临床可靠评估的PET/CT特定模型奠定了基础,通过构建标准化数据集和评估协议,推动了自动化医学报告生成领域的发展。


📄 Abstract

Positron emission tomography/computed tomography (PET/CT) is essential in oncology, yet the rapid expansion of scanners has outpaced the availability of trained specialists, making automated PET/CT report generation (PETRG) increasingly important for reducing clinical workload. Compared with structural imaging (e.g., X-ray, CT, and MRI), functional PET poses distinct challenges: metabolic patterns vary with tracer physiology, and whole-body 3D contextual information is required rather than local-region interpretation. To advance PETRG, we propose PETRG-3D, an end-to-end 3D dual-branch framework that separately encodes PET and CT volumes and incorporates style-adaptive prompts to mitigate inter-hospital variability in reporting practices. We construct PETRG-Lym, a multi-center lymphoma dataset collected from four hospitals (824 reports w/ 245,509 paired PET/CT slices), and construct AutoPET-RG-Lym, a publicly accessible PETRG benchmark derived from open imaging data but equipped with new expert-written, clinically validated reports (135 cases). To assess clinical utility, we introduce PETRG-Score, a lymphoma-specific evaluation protocol that jointly measures metabolic and structural findings across curated anatomical regions. Experiments show that PETRG-3D substantially outperforms existing methods on both natural language metrics (e.g., +31.49\% ROUGE-L) and clinical efficacy metrics (e.g., +8.18\% PET-All), highlighting the benefits of volumetric dual-modality modeling and style-aware prompting. Overall, this work establishes a foundation for future PET/CT-specific models emphasizing disease-aware reasoning and clinically reliable evaluation. Codes, models, and AutoPET-RG-Lym will be released.

[64] Map-World: Masked Action planning and Path-Integral World Model for Autonomous Driving

Bin Hu, Zijian Lu, Haicheng Liao, Chengran Yuan, Bin Rao, Yongkang Li, Guofa Li, Zhiyong Cui, Cheng-zhong Xu, Zhenning Li

🧩 TL;DR

本文提出了MAP-World,一种无先验的多模态规划框架,通过结合掩码动作规划和路径加权世界模型,解决了自动驾驶中多模态轨迹规划的信息丢失和优化复杂性问题。该方法在NAVSIM基准上达到了世界模型方法的最高性能,同时避免了强化学习并保持实时推理。


📘 Detailed Summary

Motivation: 当前自动驾驶运动规划系统虽然能预测丰富的多模态轨迹,但通常依赖人工设计的锚点或强化学习来选择单一最佳模式进行训练和控制,这种选择丢弃了替代未来轨迹的信息并使优化过程复杂化。

Method: 提出了掩码动作规划模块,将未来自车运动视为掩码序列补全任务,其中过去路径点编码为可见标记,未来路径点表示为掩码标记,驾驶意图路径提供粗略支架。通过向紧凑潜在规划状态注入噪声生成多样化轨迹查询,结合轻量级世界模型在BEV语义空间中对每个候选轨迹进行未来推演。

Result: 在NAVSIM基准测试中,该方法与基于锚点的方法性能相当,并在世界模型方法中达到了最先进的性能水平,同时避免了强化学习的使用并保持了实时推理延迟。

Conclusion: MAP-World框架通过路径加权训练策略使规划器能够从完整的多模态未来分布中学习,而非单一选定路径,提供了更高效且信息保留更完整的自动驾驶规划解决方案,为多模态规划问题提供了新的技术路径。


📄 Abstract

Motion planning for autonomous driving must handle multiple plausible futures while remaining computationally efficient. Recent end-to-end systems and world-model-based planners predict rich multi-modal trajectories, but typically rely on handcrafted anchors or reinforcement learning to select a single best mode for training and control. This selection discards information about alternative futures and complicates optimization. We propose MAP-World, a prior-free multi-modal planning framework that couples masked action planning with a path-weighted world model. The Masked Action Planning (MAP) module treats future ego motion as masked sequence completion: past waypoints are encoded as visible tokens, future waypoints are represented as mask tokens, and a driving-intent path provides a coarse scaffold. A compact latent planning state is expanded into multiple trajectory queries with injected noise, yielding diverse, temporally consistent modes without anchor libraries or teacher policies. A lightweight world model then rolls out future BEV semantics conditioned on each candidate trajectory. During training, semantic losses are computed as an expectation over modes, using trajectory probabilities as discrete path weights, so the planner learns from the full distribution of plausible futures instead of a single selected path. On NAVSIM, our method matches anchor-based approaches and achieves state-of-the-art performance among world-model-based methods, while avoiding reinforcement learning and maintaining real-time inference latency.

[65] Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs

Ziqi Wang, Chang Che, Qi Wang, Hui Ma, Zenglin Shi, Cees G. M. Snoek, Meng Wang

🧩 TL;DR

本文提出和谐参数适应(HPA)框架,用于解决安全对齐多模态大语言模型在持续视觉指令调优中面临的安全性与任务性能平衡问题。该框架通过参数分区、平衡选择和正交约束,有效缓解灾难性遗忘并保持模型安全性。


📘 Detailed Summary

Motivation: 现有持续视觉指令调优研究主要关注未进行安全对齐的模型,忽视了现实世界多模态大语言模型必须配备安全机制以减轻潜在风险。在持续适应过程中,模型不仅遭受任务遗忘,其安全性也会退化,实现安全性与任务性能的和谐平衡成为关键挑战。

Method: 提出和谐参数适应(HPA)后训练框架,包含基于关注度的参数分区、和谐平衡的参数选择和正交参数调整。HPA根据参数对安全性或任务性能的关注度将其分为两类,从平衡视角选择需保留的关注参数,并对参数更新施加正交性约束以缓解灾难性遗忘。

Result: 在持续视觉指令调优基准和安全评估数据集上的广泛实验表明,HPA相比现有基线能更好地保持高安全性并缓解遗忘问题,在安全性和任务性能之间实现了更优的平衡。

Conclusion: 该研究强调了安全对齐多模态大语言模型在持续学习中的特殊挑战,提出的HPA框架为平衡安全性与任务性能提供了有效解决方案,对现实世界多模态大语言模型的持续适应具有重要实践意义。


📄 Abstract

While continual visual instruction tuning (CVIT) has shown promise in adapting multimodal large language models (MLLMs), existing studies predominantly focus on models without safety alignment. This critical oversight ignores the fact that real-world MLLMs inherently require such mechanisms to mitigate potential risks. In this work, we shift our focus to CVIT for safety-aligned MLLMs and observe that during continual adaptation, the model not only suffers from task forgetting but also exhibits degradation in its safety. Achieving a harmonious balance between safety and task performance remains a crucial challenge. To address this, we propose Harmonious Parameter Adaptation (HPA), a post-training framework composed of focusing-based parameter partition, harmoniously balanced parameter selection, and orthogonal parameter adjustment. Specifically, HPA partitions parameters into two types based on their focus on safety or task performance, and selects the focused ones to preserve from a balanced perspective. In addition, HPA imposes orthogonality constraints on parameter updates to further alleviate catastrophic forgetting. Extensive experiments on the CVIT benchmark and safety evaluation datasets demonstrate that HPA better maintains high safety and mitigates forgetting than existing baselines.

[66] ADNet: A Large-Scale and Extensible Multi-Domain Benchmark for Anomaly Detection Across 380 Real-World Categories

Hai Ling, Jia Guo, Zhulin Tao, Yunkang Cao, Donglin Di, Hongyan Xu, Xiu Su, Yang Song, Lei Fan

🧩 TL;DR

本文提出了ADNet,一个包含380个类别的大规模多领域异常检测基准,并开发了Dinomaly-m方法来解决现有方法在大规模多类别设置下的可扩展性挑战。


📘 Detailed Summary

Motivation: 现有异常检测基准(如MVTec-AD仅包含15个类别)覆盖范围有限,无法充分评估跨上下文泛化能力和可扩展性,这限制了异常检测方法在多样化实际应用中的发展。

Method: 提出了ADNet基准,整合49个公开数据集形成380个类别的多领域集合,包含196,294张标准化RGB图像;同时开发了Dinomaly-m方法,这是一种上下文引导的专家混合扩展,在不增加推理成本的情况下扩展解码器容量。

Result: 实验显示现有最优方法在单类别设置下达到90.6% I-AUROC,但在380类别多类别设置下下降至78.5%;而Dinomaly-m方法实现了83.2% I-AUROC和93.1% P-AUROC,表现出优于现有方法的性能。

Conclusion: ADNet基准揭示了异常检测在大规模多类别设置下的可扩展性挑战,为未来异常检测基础模型提供了标准化和可扩展的基础,支持社区在不同领域扩展异常检测数据集。


📄 Abstract

Anomaly detection (AD) aims to identify defects using normal-only training data. Existing anomaly detection benchmarks (e.g., MVTec-AD with 15 categories) cover only a narrow range of categories, limiting the evaluation of cross-context generalization and scalability. We introduce ADNet, a large-scale, multi-domain benchmark comprising 380 categories aggregated from 49 publicly available datasets across Electronics, Industry, Agrifood, Infrastructure, and Medical domains. The benchmark includes a total of 196,294 RGB images, consisting of 116,192 normal samples for training and 80,102 test images, of which 60,311 are anomalous. All images are standardized with MVTec-style pixel-level annotations and structured text descriptions spanning both spatial and visual attributes, enabling multimodal anomaly detection tasks. Extensive experiments reveal a clear scalability challenge: existing state-of-the-art methods achieve 90.6% I-AUROC in one-for-one settings but drop to 78.5% when scaling to all 380 categories in a multi-class setting. To address this, we propose Dinomaly-m, a context-guided Mixture-of-Experts extension of Dinomaly that expands decoder capacity without increasing inference cost. It achieves 83.2% I-AUROC and 93.1% P-AUROC, demonstrating superior performance over existing approaches. ADNet is designed as a standardized and extensible benchmark, supporting the community in expanding anomaly detection datasets across diverse domains and providing a scalable foundation for future anomaly detection foundation models. Dataset: https://grainnet.github.io/ADNet

[67] SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA

Haibin He, Qihuang Zhong, Juhua Liu, Bo Du, Peng Wang, Jing Zhang

🧩 TL;DR

本文提出SFA,一种无需训练的视频文本视觉问答框架,通过模拟人类答题过程自适应扫描视频帧、选择性聚焦关键区域并直接放大,引导视频大语言模型关注关键线索,在多个公开数据集上取得最先进性能。


📘 Detailed Summary

Motivation: 视频文本视觉问答任务面临多重挑战,包括准确感知和理解不同尺度、方向及清晰度的场景文本,有效整合时空和语义上下文以生成精确答案,以及识别问题相关文本线索并过滤冗余信息,确保回答基于最相关和最有信息量的线索。

Method: 提出SFA训练免费框架,这是首个基于视频大语言模型的视频文本视觉问答方法,通过自适应扫描视频帧、选择性聚焦关键区域并直接放大这些区域,有效引导视频大语言模型的注意力转向关键线索。

Result: SFA在多个公开视频文本视觉问答数据集上取得了新的最先进结果,相比先前方法实现了显著性能提升,证明了其有效性和泛化能力。

Conclusion: 该研究展示了模拟人类认知过程的方法在视频文本理解任务中的有效性,为视频大语言模型在复杂视觉文本理解任务中的应用提供了新思路,证明了无需额外训练即可显著提升模型性能的可行性。


📄 Abstract

Video text-based visual question answering (Video TextVQA) task aims to answer questions about videos by leveraging the visual text appearing within the videos. This task poses significant challenges, requiring models to accurately perceive and comprehend scene text that varies in scale, orientation, and clarity across frames, while effectively integrating temporal and semantic context to generate precise answers. Moreover, the model must identify question-relevant textual cues and filter out redundant or irrelevant information to ensure answering is guided by the most relevant and informative cues. To address these challenges, we propose SFA, a training-free framework and the first Video-LLM-based method tailored for Video TextVQA, motivated by the human process of answering questions. By adaptively scanning video frames, selectively focusing on key regions, and directly amplifying them, SFA effectively guides the Video-LLM's attention toward essential cues, enabling it to generate more accurate answers. SFA achieves new state-of-the-art results across several public Video TextVQA datasets and surpasses previous methods by a substantial margin, demonstrating its effectiveness and generalizability.

[68] GHR-VQA: Graph-guided Hierarchical Relational Reasoning for Video Question Answering

Dionysia Danai Brilli, Dimitrios Mallis, Vassilis Pitsikalis, Petros Maragos

🧩 TL;DR

本文提出GHR-VQA框架,通过图引导的层次关系推理进行视频问答,利用场景图捕捉视频序列中复杂的人-物交互,在AGQA数据集上实现了7.3%的对象关系推理性能提升。


📘 Detailed Summary

Motivation: 传统基于像素的视频问答方法难以有效捕捉视频中复杂的人-物交互关系,缺乏对时空动态的深入理解,需要一种能够显式建模人类中心交互的框架来提升推理能力。

Method: 提出基于场景图的层次关系推理框架,将每帧表示为场景图,跨帧的人类节点连接到全局根节点形成视频级图,使用图神经网络处理图结构并生成上下文感知嵌入,最后与问题特征在层次网络中集成进行多抽象层次推理。

Result: 在Action Genome Question Answering (AGQA)数据集上验证了方法的有效性,相比现有最优方法在对象关系推理任务上取得了7.3%的显著性能提升。

Conclusion: 显式的人类根植图结构通过将动作分解为人-物交互增强了可解释性,能够更深入地理解时空动态,为人类中心的视频理解提供了有效的结构化表示方法。


📄 Abstract

We propose GHR-VQA, Graph-guided Hierarchical Relational Reasoning for Video Question Answering (Video QA), a novel human-centric framework that incorporates scene graphs to capture intricate human-object interactions within video sequences. Unlike traditional pixel-based methods, each frame is represented as a scene graph and human nodes across frames are linked to a global root, forming the video-level graph and enabling cross-frame reasoning centered on human actors. The video-level graphs are then processed by Graph Neural Networks (GNNs), transforming them into rich, context-aware embeddings for efficient processing. Finally, these embeddings are integrated with question features in a hierarchical network operating across different abstraction levels, enhancing both local and global understanding of video content. This explicit human-rooted structure enhances interpretability by decomposing actions into human-object interactions and enables a more profound understanding of spatiotemporal dynamics. We validate our approach on the Action Genome Question Answering (AGQA) dataset, achieving significant performance improvements, including a 7.3% improvement in object-relation reasoning over the state of the art.

[69] Text-guided Controllable Diffusion for Realistic Camouflage Images Generation

Yuhang Qian, Haiyan Chen, Wentong Li, Ningzhong Liu, Jie Qin

🧩 TL;DR

本文提出CT-CIG方法,通过可控文本引导的伪装图像生成技术,利用大型视觉语言模型标注高质量文本提示,结合轻量级控制器和频率交互细化模块,生成具有逻辑一致性和真实感的伪装图像。


📘 Detailed Summary

Motivation: 现有伪装图像生成方法通常通过将对象融合到特定背景或通过前景对象引导扩散来外推周围环境,但这些方法往往忽略伪装对象与背景环境之间的逻辑关系,导致生成结果不够自然。

Method: 提出CT-CIG方法,利用大型视觉语言模型设计伪装揭示对话机制来标注现有伪装数据集的高质量文本提示,构建图像-提示对微调Stable Diffusion,并集成轻量级控制器指导伪装对象的位置和形状,同时设计频率交互细化模块捕捉高频纹理特征以学习复杂伪装模式。

Result: 通过包括CLIPScore评估和伪装效果评估在内的广泛实验证明,生成的文本提示具有语义对齐性,CT-CIG能够生成逼真的伪装图像,在语义一致性和视觉质量方面表现出色。

Conclusion: 该研究展示了结合视觉语言模型和可控生成技术在伪装图像生成中的有效性,为生成逻辑合理且视觉一致的伪装场景提供了新思路,在军事伪装、艺术创作等领域具有应用潜力。


📄 Abstract

Camouflage Images Generation (CIG) is an emerging research area that focuses on synthesizing images in which objects are harmoniously blended and exhibit high visual consistency with their surroundings. Existing methods perform CIG by either fusing objects into specific backgrounds or outpainting the surroundings via foreground object-guided diffusion. However, they often fail to obtain natural results because they overlook the logical relationship between camouflaged objects and background environments. To address this issue, we propose CT-CIG, a Controllable Text-guided Camouflage Images Generation method that produces realistic and logically plausible camouflage images. Leveraging Large Visual Language Models (VLM), we design a Camouflage-Revealing Dialogue Mechanism (CRDM) to annotate existing camouflage datasets with high-quality text prompts. Subsequently, the constructed image-prompt pairs are utilized to finetune Stable Diffusion, incorporating a lightweight controller to guide the location and shape of camouflaged objects for enhanced camouflage scene fitness. Moreover, we design a Frequency Interaction Refinement Module (FIRM) to capture high-frequency texture features, facilitating the learning of complex camouflage patterns. Extensive experiments, including CLIPScore evaluation and camouflage effectiveness assessment, demonstrate the semantic alignment of our generated text prompts and CT-CIG's ability to produce photorealistic camouflage images.

[70] Patch-Level Glioblastoma Subregion Classification with a Contrastive Learning-Based Encoder

Juexin Zhang, Qifeng Zhong, Ying Weng, Ke Chen

🧩 TL;DR

本研究针对胶质母细胞瘤的组织病理学异质性诊断挑战,开发了一种基于Vision Transformer的深度学习方法,在BraTS-Path 2025挑战赛中获得了第二名。该方法通过微调预训练ViT编码器实现了全切片图像的自动化分析,为基于ViT的组织病理学分析建立了坚实基础。


📘 Detailed Summary

Motivation: 胶质母细胞瘤作为侵袭性脑肿瘤,其显著的分子和病理异质性给诊断和患者分层带来了复杂挑战。传统组织病理学评估虽然仍是标准方法,但深度学习为实现全切片图像的客观自动化分析提供了有前景的路径,特别是在BraTS-Path 2025挑战赛的背景下。

Method: 本研究开发了一种基于预训练Vision Transformer编码器的深度学习方法,通过在官方训练数据集上微调ViT编码器并配备专用分类头来实现组织病理学图像分析。该方法专门针对BraTS-Path挑战赛设计,利用Synapse平台进行在线验证集评估。

Result: 在在线验证集上,模型取得了0.7064的Matthews相关系数和0.7676的F1分数。在最终测试集上,模型获得0.6509的MCC和0.5330的F1分数,这一表现在BraTS-Pathology 2025挑战赛中获得了第二名。结果显示出在未见验证数据上存在性能差距需要进一步解决。

Conclusion: 本研究为基于Vision Transformer的组织病理学分析建立了坚实的基线,证明了ViT在脑肿瘤病理图像分析中的有效性。未来工作将重点解决在未见验证数据上观察到的性能差距问题,进一步提升模型的泛化能力和临床应用价值。


📄 Abstract

The significant molecular and pathological heterogeneity of glioblastoma, an aggressive brain tumor, complicates diagnosis and patient stratification. While traditional histopathological assessment remains the standard, deep learning offers a promising path toward objective and automated analysis of whole slide images. For the BraTS-Path 2025 Challenge, we developed a method that fine-tunes a pre-trained Vision Transformer (ViT) encoder with a dedicated classification head on the official training dataset. Our model's performance on the online validation set, evaluated via the Synapse platform, yielded a Matthews Correlation Coefficient (MCC) of 0.7064 and an F1-score of 0.7676. On the final test set, the model achieved an MCC of 0.6509 and an F1-score of 0.5330, which secured our team second place in the BraTS-Pathology 2025 Challenge. Our results establish a solid baseline for ViT-based histopathological analysis, and future efforts will focus on bridging the performance gap observed on the unseen validation data.

[71] V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs

Sen Nie, Jie Zhang, Jianxin Yan, Shiguang Shan, Xilin Chen

🧩 TL;DR

本文提出V-Attack方法,通过针对Transformer注意力块中的值特征进行精确语义攻击,解决了现有对抗攻击方法在大型视觉语言模型中难以精确控制局部语义的问题。该方法显著提升了攻击成功率,揭示了现代视觉语言理解系统的关键漏洞。


📘 Detailed Summary

Motivation: 现有对抗攻击方法在大型视觉语言模型中难以精确控制特定概念的语义,主要原因是补丁-令牌表示中的语义纠缠问题——视觉编码器中自注意力聚合的全局上下文主导了单个补丁特征,使其无法成为精确局部语义操作的可靠处理点。

Method: V-Attack方法基于对值特征作为精确操作点的关键发现,包含两个核心组件:自值增强模块用于提炼值特征的内在语义丰富性,以及文本引导的值操作模块利用文本提示定位源概念并将其优化为目标概念,通过绕过纠缠的补丁特征实现高度有效的语义控制。

Result: 在LLaVA、InternVL、DeepseekVL和GPT-4o等多种大型视觉语言模型上的广泛实验表明,V-Attack相比最先进方法平均提升了36%的攻击成功率,显著暴露了现代视觉语言理解系统的关键脆弱性。

Conclusion: 该研究揭示了值特征在Transformer注意力机制中作为精确语义操作点的独特优势,通过抑制全局上下文通道保留高熵解耦的局部语义信息,为对抗攻击提供了新的技术路径,同时也强调了现代视觉语言模型在语义安全性方面的重要漏洞。


📄 Abstract

Adversarial attacks have evolved from simply disrupting predictions on conventional task-specific models to the more complex goal of manipulating image semantics on Large Vision-Language Models (LVLMs). However, existing methods struggle with controllability and fail to precisely manipulate the semantics of specific concepts in the image. We attribute this limitation to semantic entanglement in the patch-token representations on which adversarial attacks typically operate: global context aggregated by self-attention in the vision encoder dominates individual patch features, making them unreliable handles for precise local semantic manipulation. Our systematic investigation reveals a key insight: value features (V) computed within the transformer attention block serve as much more precise handles for manipulation. We show that V suppresses global-context channels, allowing it to retain high-entropy, disentangled local semantic information. Building on this discovery, we propose V-Attack, a novel method designed for precise local semantic attacks. V-Attack targets the value features and introduces two core components: (1) a Self-Value Enhancement module to refine V's intrinsic semantic richness, and (2) a Text-Guided Value Manipulation module that leverages text prompts to locate source concept and optimize it toward a target concept. By bypassing the entangled patch features, V-Attack achieves highly effective semantic control. Extensive experiments across diverse LVLMs, including LLaVA, InternVL, DeepseekVL and GPT-4o, show that V-Attack improves the attack success rate by an average of 36% over state-of-the-art methods, exposing critical vulnerabilities in modern visual-language understanding. Our code and data are available https://github.com/Summu77/V-Attack.

[72] PromptMoG: Enhancing Diversity in Long-Prompt Image Generation via Prompt Embedding Mixture-of-Gaussian Sampling

Bo-Kai Ruan, Teng-Fang Hsiao, Ling Lo, Yi-Lun Wu, Hong-Han Shuai

🧩 TL;DR

本文系统研究了长提示下文本到图像生成的保真度-多样性困境,提出了PromptMoG方法通过混合高斯采样增强多样性,并在四个先进模型上验证了其有效性。


📘 Detailed Summary

Motivation: 当前文本到图像生成模型在长提示下表现出保真度提升但多样性显著下降的问题,导致输出重复且缺乏创造性,这一保真度-多样性困境尚未得到系统研究。

Method: 提出了理论框架通过提示重构增加采样熵,并开发了无需训练的方法PromptMoG,在嵌入空间中使用混合高斯分布采样提示嵌入以增强多样性同时保持语义一致性。

Result: 在SD3.5-Large、Flux.1-Krea-Dev、CogView4和Qwen-Image四个先进模型上的广泛实验表明,PromptMoG能持续提升长提示生成的多样性且无语义漂移。

Conclusion: 研究揭示了长提示生成中的系统性多样性下降问题,提出的PromptMoG方法为平衡保真度与多样性提供了有效解决方案,推动了文本到图像生成在复杂提示下的性能提升。


📄 Abstract

Recent advances in text-to-image (T2I) generation have achieved remarkable visual outcomes through large-scale rectified flow models. However, how these models behave under long prompts remains underexplored. Long prompts encode rich content, spatial, and stylistic information that enhances fidelity but often suppresses diversity, leading to repetitive and less creative outputs. In this work, we systematically study this fidelity-diversity dilemma and reveal that state-of-the-art models exhibit a clear drop in diversity as prompt length increases. To enable consistent evaluation, we introduce LPD-Bench, a benchmark designed for assessing both fidelity and diversity in long-prompt generation. Building on our analysis, we develop a theoretical framework that increases sampling entropy through prompt reformulation and propose a training-free method, PromptMoG, which samples prompt embeddings from a Mixture-of-Gaussians in the embedding space to enhance diversity while preserving semantics. Extensive experiments on four state-of-the-art models, SD3.5-Large, Flux.1-Krea-Dev, CogView4, and Qwen-Image, demonstrate that PromptMoG consistently improves long-prompt generation diversity without semantic drifting.

[73] Zoo3D: Zero-Shot 3D Object Detection at Scene Level

Andrey Lemeshko, Bulat Gabdullin, Nikita Drozdov, Anton Konushin, Danila Rukhovich, Maksim Kolodiazhnyi

🧩 TL;DR

Zoo3D是首个无需训练的开放词汇3D物体检测框架,通过2D实例掩码的图聚类构建3D边界框,在ScanNet200和ARKitScenes基准测试中实现了最先进的性能,其零样本版本甚至超越了现有的自监督方法。


📘 Detailed Summary

Motivation: 现实世界环境需要能够识别多样化、未见过的物体的模型,这是封闭集方法的主要限制。现有开放词汇3D检测器虽然放宽了标注要求,但仍依赖于训练场景的点云或图像数据,因此需要开发完全无需训练的3D检测框架。

Method: 该方法通过2D实例掩码的图聚类构建3D边界框,并使用具有最佳视图选择和视图一致性掩码生成的新型开放词汇模块分配语义标签。Zoo3D提供两种模式:零样本Zoo3D₀完全无需训练,自监督Zoo3D₁在Zoo3D₀生成的伪标签上训练类无关检测器以优化3D框预测。

Result: 在ScanNet200和ARKitScenes基准测试中,Zoo3D₀和Zoo3D₁均实现了开放词汇3D物体检测的最先进结果。值得注意的是,零样本Zoo3D₀超越了所有现有的自监督方法,证明了无需训练方法的强大能力和适应性。

Conclusion: 该研究展示了无需训练的现成方法在真实世界3D理解中的强大潜力,为开放词汇3D检测提供了新的范式。Zoo3D框架不仅适用于点云,还能直接处理带姿态甚至无姿态的图像,具有广泛的适用性。


📄 Abstract

3D object detection is fundamental for spatial understanding. Real-world environments demand models capable of recognizing diverse, previously unseen objects, which remains a major limitation of closed-set methods. Existing open-vocabulary 3D detectors relax annotation requirements but still depend on training scenes, either as point clouds or images. We take this a step further by introducing Zoo3D, the first training-free 3D object detection framework. Our method constructs 3D bounding boxes via graph clustering of 2D instance masks, then assigns semantic labels using a novel open-vocabulary module with best-view selection and view-consensus mask generation. Zoo3D operates in two modes: the zero-shot Zoo3D$_0$, which requires no training at all, and the self-supervised Zoo3D$_1$, which refines 3D box prediction by training a class-agnostic detector on Zoo3D$_0$-generated pseudo labels. Furthermore, we extend Zoo3D beyond point clouds to work directly with posed and even unposed images. Across ScanNet200 and ARKitScenes benchmarks, both Zoo3D$_0$ and Zoo3D$_1$ achieve state-of-the-art results in open-vocabulary 3D object detection. Remarkably, our zero-shot Zoo3D$_0$ outperforms all existing self-supervised methods, hence demonstrating the power and adaptability of training-free, off-the-shelf approaches for real-world 3D understanding. Code is available at https://github.com/col14m/zoo3d .

[74] Modality-Balanced Collaborative Distillation for Multi-Modal Domain Generalization

Xiaohan Wang, Zhangtao Cheng, Ting Zhong, Leiting Chen, Fan Zhou

🧩 TL;DR

本文提出MBCD框架,通过自适应模态丢弃、梯度一致性约束和权重平均教师蒸馏,解决了多模态域泛化中权重平均方法因模态优化速度差异导致的早期过拟合问题,实现了更平坦的损失曲面和更好的泛化性能。


📘 Detailed Summary

Motivation: 权重平均方法在多模态域泛化中面临挑战,不同模态的优化速度差异导致早期阶段过拟合于收敛更快的模态,抑制了较慢但互补模态的贡献,从而阻碍有效的模态融合并使损失曲面偏向更尖锐、泛化性较差的最小值。

Method: MBCD框架采用自适应模态丢弃机制来抑制早期对主导模态的偏置,通过梯度一致性约束对齐单模态分支与融合表示的学习信号以促进协调优化,并利用基于权重平均的教师模型进行跨模态蒸馏,将融合知识传递到每个单模态分支以增强跨模态交互。

Result: 在多模态域泛化基准测试上的广泛实验表明,MBCD持续优于现有方法,在不同未见域上实现了卓越的准确性和鲁棒性。

Conclusion: 该研究证明了MBCD框架能够保留权重平均方法的平坦化优势,同时克服其在多模态上下文中的局限性,为多模态学习中的域泛化问题提供了有效的解决方案,并强调了协调模态优化过程对实现平坦损失曲面的重要性。


📄 Abstract

Weight Averaging (WA) has emerged as a powerful technique for enhancing generalization by promoting convergence to a flat loss landscape, which correlates with stronger out-of-distribution performance. However, applying WA directly to multi-modal domain generalization (MMDG) is challenging: differences in optimization speed across modalities lead WA to overfit to faster-converging ones in early stages, suppressing the contribution of slower yet complementary modalities, thereby hindering effective modality fusion and skewing the loss surface toward sharper, less generalizable minima. To address this issue, we propose MBCD, a unified collaborative distillation framework that retains WA's flatness-inducing advantages while overcoming its shortcomings in multi-modal contexts. MBCD begins with adaptive modality dropout in the student model to curb early-stage bias toward dominant modalities. A gradient consistency constraint then aligns learning signals between uni-modal branches and the fused representation, encouraging coordinated and smoother optimization. Finally, a WA-based teacher conducts cross-modal distillation by transferring fused knowledge to each uni-modal branch, which strengthens cross-modal interactions and steer convergence toward flatter solutions. Extensive experiments on MMDG benchmarks show that MBCD consistently outperforms existing methods, achieving superior accuracy and robustness across diverse unseen domains.

[75] VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

Tianxiang Jiang, Sheng Xia, Yicheng Xu, Linquan Wu, Xiangyu Zeng, Limin Wang, Yu Qiao, Yi Wang

🧩 TL;DR

本研究提出了VKnowU基准来评估多模态大语言模型的视觉知识能力,并开发了VideoKnow+模型通过显式整合视觉知识和强化学习奖励机制,在多个基准测试上实现了显著性能提升。


📘 Detailed Summary

Motivation: 当前多模态大语言模型虽然在物体识别方面表现出色,但缺乏对人类世界物理和社会原则的直观理解能力,这种连接感知与推理的视觉知识能力在现有研究中尚未得到充分探索。

Method: 研究构建了包含1,680个问题的VKnowU基准,提出了VideoKnow+基线模型,采用See-Think-Answer结构化范式,并引入基于视觉知识奖励的强化学习机制。

Result: 评估23个先进MLLM发现其性能仍落后于人类水平,特别是在世界中心型知识方面存在显著差距;VideoKnow+在VKnowU上实现了+3.7%的提升,并在MVBench、Video-MME和MMVU等基准上获得一致改进。

Conclusion: 视觉知识是开发更通用MLLM的关键基石,能够使模型不仅看到还能真正理解物理和社会世界,为构建具备人类水平理解能力的AI系统指明了方向。


📄 Abstract

While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This high-level vision-grounded semantics, which we term visual knowledge, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both world-centric (e.g., intuitive physics) and human-centric (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, VKnowQA, and VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs. VideoKnow+ follows a structured See-Think-Answer paradigm and adopts reinforcement learning with visual knowledge reward, achieving a +3.7% improvement on VKnowU and consistent gains on MVBench, Video-MME, and MMVU. Our work highlights visual knowledge as a missing cornerstone for developing more generalizable MLLMs that can not only see but also truly understand our physical and social worlds.

[76] ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis

Advik Sinha, Saurabh Atreya, Aashutosh A, Sk Aziz Ali, Abhijit Das

🧩 TL;DR

本文提出ScenarioCLIP模型,通过显式建模场景中的对象关系和组合结构,解决了传统CLIP模型在复杂场景理解中的局限性。该方法在多个领域特定任务上展现出强大的零样本和微调性能。


📘 Detailed Summary

Motivation: 现有CLIP模型主要关注短文本检索或单对象分类任务,无法充分处理现实场景中丰富的组合结构。虽然最新方法通过挖掘困难负样本和改进文本提示来提升类别区分度,但仍局限于预定义类别列表,缺乏对关系和组合结构的显式建模。PyramidCLIP部分解决了这一问题,但依然缺少对象间关系的显式建模。

Method: ScenarioCLIP模型接受输入文本、基础关系、输入图像以及突出关系的聚焦区域作为输入。该模型在精心策划的场景数据上进行预训练,并在专门的下游任务上进行微调。为解决领域特定数据集的缺乏,作者通过扩展现有室内外场景数据集的图像-文本对,使用语言模型管道来基础动作、对象和关系,并通过人工和自动筛选构建了新的数据集。

Result: ScenarioCLIP在多个场景相关任务上建立了全面基准,并与多种基线方法进行了比较。实验结果表明,该方法在各种领域特定任务上展现出强大的零样本和微调性能,显著优于现有方法。

Conclusion: 该研究强调了显式建模场景组合结构对于复杂视觉理解任务的重要性。ScenarioCLIP的成功表明,通过整合关系信息和聚焦区域,可以显著提升CLIP模型在现实场景分析中的能力,为细粒度视觉理解开辟了新方向。


📄 Abstract

Until recently, the general corpus of CLIP-type fundamental models has widely explored either the retrieval of short descriptions or the classification of objects in the scene as SINGLE-object image classification task. The same holds for retrieving the image embedding (image retrieval task) given a text prompt. However, real-world scene images exhibit rich compositional structure involving multiple objects and actions. The latest methods in the CLIP-based literature improve class-level discrimination by mining harder negative image-text pairs and by refining permanent text prompts, often using LLMs. However, these improvements remain confined to predefined class lists and do not explicitly model relational or compositional structure. PyramidCLIP partially addresses this gap by aligning global and local visual features, yet it still lacks explicit modeling of inter-object relations. Hence, to further leverage this aspect for scene analysis, the proposed ScenarioCLIP model accepts input texts, grounded relations, and input images, along with focused regions highlighting relations. The proposed model is pretrained on curated scenario data, and finetuned for specialized downstream tasks, such as cross-modal retrieval and fine-grained visual understanding tasks. To address the lack of domain-specific datasets, we generate a novel dataset by extending image-text pairs from existing diverse indoor and outdoor scenario datasets that are publicly available. We used a pipeline of existing language models to ground action, object, and relations, filled by manual and automatic curation. We established a comprehensive benchmark for several scenario-based tasks and compared it with many baseline methods. ScenarioCLIP demonstrates robust zero-shot and finetune performance on various domain-specific tasks. Our code and dataset are available at https://github.com/scenario-clip/ScenarioCLIP

[77] Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement

Yang Liu, Xilin Zhao, Peisong Wen, Siran Dai, Qingming Huang

🧩 TL;DR

本文提出了一种基于多模态思维链的迭代自优化框架,通过利用大语言模型和视觉语言模型提供物理感知指导,显著提升了视频生成的物理一致性。该方法无需训练且即插即用,在PhyIQ基准上将物理智商分数从56.31提升至62.38。


📘 Detailed Summary

Motivation: 当前视频生成模型虽然在视觉质量方面取得了显著进展,但在生成结果与真实世界物理原理对齐方面仍存在明显不足,这限制了生成视频的真实性和可信度。

Method: 提出了一种迭代自优化框架,采用多模态思维链过程,基于物理不一致性反馈逐步优化提示词,利用大语言模型和视觉语言模型提供物理感知指导。该方法无需额外训练且具有即插即用的特性。

Result: 在PhyIQ基准测试中,该方法将物理智商分数从基线模型的56.31显著提升至62.38,证明了其在提升视频生成物理一致性方面的有效性。

Conclusion: 这项工作为物理一致性视频生成提供了初步探索,展示了利用多模态模型进行迭代优化的潜力,为未来相关研究提供了有价值的见解和方法论基础。


📄 Abstract

Recent progress in video generation has led to impressive visual quality, yet current models still struggle to produce results that align with real-world physical principles. To this end, we propose an iterative self-refinement framework that leverages large language models and vision-language models to provide physics-aware guidance for video generation. Specifically, we introduce a multimodal chain-of-thought (MM-CoT) process that refines prompts based on feedback from physical inconsistencies, progressively enhancing generation quality. This method is training-free and plug-and-play, making it readily applicable to a wide range of video generation models. Experiments on the PhyIQ benchmark show that our method improves the Physics-IQ score from 56.31 to 62.38. We hope this work serves as a preliminary exploration of physics-consistent video generation and may offer insights for future research.

[78] Image-Free Timestep Distillation via Continuous-Time Consistency with Trajectory-Sampled Pairs

Bao Tang, Shuai Zhang, Yueting Zhu, Jijun Xiang, Xin Yang, Li Yu, Wenyu Liu, Xinggang Wang

🧩 TL;DR

本文提出了轨迹反向一致性模型(TBCM),通过直接从教师模型的生成轨迹中提取潜在表示,消除了对外部训练数据的依赖,实现了更高效的无数据一致性蒸馏。


📘 Detailed Summary

Motivation: 当前连续时间一致性蒸馏方法严重依赖训练数据和计算资源,限制了其在资源受限场景下的部署和跨领域扩展能力,需要一种更高效的自包含蒸馏范式来解决这一瓶颈。

Method: TBCM采用自包含蒸馏范式,通过从教师模型生成轨迹中直接提取潜在表示来替代传统的VAE编码和大规模数据集需求,轨迹提取的样本自然弥合了训练与推理之间的分布差距。

Result: 在MJHQ-30k数据集上,TBCM在单步生成中实现了6.52 FID和28.08 CLIP分数,相比Sana-Sprint减少了约40%的训练时间并显著节省GPU内存,在保持质量的同时展现了卓越的效率优势。

Conclusion: 该研究揭示了连续时间一致性蒸馏中的扩散-生成空间差异,分析了采样策略对蒸馏性能的影响,为未来蒸馏研究提供了重要见解,同时证明了自包含蒸馏在效率和可扩展性方面的潜力。


📄 Abstract

Timestep distillation is an effective approach for improving the generation efficiency of diffusion models. The Consistency Model (CM), as a trajectory-based framework, demonstrates significant potential due to its strong theoretical foundation and high-quality few-step generation. Nevertheless, current continuous-time consistency distillation methods still rely heavily on training data and computational resources, hindering their deployment in resource-constrained scenarios and limiting their scalability to diverse domains. To address this issue, we propose Trajectory-Backward Consistency Model (TBCM), which eliminates the dependence on external training data by extracting latent representations directly from the teacher model's generation trajectory. Unlike conventional methods that require VAE encoding and large-scale datasets, our self-contained distillation paradigm significantly improves both efficiency and simplicity. Moreover, the trajectory-extracted samples naturally bridge the distribution gap between training and inference, thereby enabling more effective knowledge transfer. Empirically, TBCM achieves 6.52 FID and 28.08 CLIP scores on MJHQ-30k under one-step generation, while reducing training time by approximately 40% compared to Sana-Sprint and saving a substantial amount of GPU memory, demonstrating superior efficiency without sacrificing quality. We further reveal the diffusion-generation space discrepancy in continuous-time consistency distillation and analyze how sampling strategies affect distillation performance, offering insights for future distillation research. GitHub Link: https://github.com/hustvl/TBCM.

[79] MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts

Zilong Huang, Jun He, Xiaobin Huang, Ziyi Xiong, Yang Luo, Junyan Ye, Weijia Li, Yiping Chen, Ting Han

🧩 TL;DR

MajutsuCity是一个基于自然语言驱动和美学自适应的3D城市生成框架,通过四阶段流水线将城市表示为可控布局、资产和材料的组合,在几何保真度、风格适应性和语义可控性方面达到了新的最先进水平。


📘 Detailed Summary

Motivation: 现有方法难以平衡基于文本生成的创意灵活性与显式结构表示提供的对象级可编辑性,这限制了3D城市生成在满足风格多样性、细粒度和可控性方面的潜力。

Method: 该框架采用四阶段流水线,将城市表示为可控布局、资产和材料的组合,并集成了MajutsuAgent交互式语言编辑代理支持五种对象级操作,同时构建了包含2D语义布局、高度图、3D建筑资产、PBR材料和天空盒的高质量多模态数据集MajutsuDataset。

Result: 实验表明MajutsuCity相比CityDreamer将布局FID降低了83.7%,相比CityCraft降低了20.1%,在所有AQS和RDR评分中均排名第一,在几何保真度、风格适应性和语义可控性方面显著优于现有方法。

Conclusion: 该研究确立了3D城市生成在几何保真度、风格适应性和语义可控性方面的新标准,其框架有望激发3D城市生成研究的新方向,并为虚拟现实、游戏开发和世界模型提供更强大的工具。


📄 Abstract

Generating realistic 3D cities is fundamental to world models, virtual reality, and game development, where an ideal urban scene must satisfy both stylistic diversity, fine-grained, and controllability. However, existing methods struggle to balance the creative flexibility offered by text-based generation with the object-level editability enabled by explicit structural representations. We introduce MajutsuCity, a natural language-driven and aesthetically adaptive framework for synthesizing structurally consistent and stylistically diverse 3D urban scenes. MajutsuCity represents a city as a composition of controllable layouts, assets, and materials, and operates through a four-stage pipeline. To extend controllability beyond initial generation, we further integrate MajutsuAgent, an interactive language-grounded editing agent} that supports five object-level operations. To support photorealistic and customizable scene synthesis, we also construct MajutsuDataset, a high-quality multimodal dataset} containing 2D semantic layouts and height maps, diverse 3D building assets, and curated PBR materials and skyboxes, each accompanied by detailed annotations. Meanwhile, we develop a practical set of evaluation metrics, covering key dimensions such as structural consistency, scene complexity, material fidelity, and lighting atmosphere. Extensive experiments demonstrate MajutsuCity reduces layout FID by 83.7% compared with CityDreamer and by 20.1% over CityCraft. Our method ranks first across all AQS and RDR scores, outperforming existing methods by a clear margin. These results confirm MajutsuCity as a new state-of-the-art in geometric fidelity, stylistic adaptability, and semantic controllability for 3D city generation. We expect our framework can inspire new avenues of research in 3D city generation. Our dataset and code will be released at https://github.com/LongHZ140516/MajutsuCity.

cs.CL [Back]

[80] A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction

Farzad Ahmed, Joniel Augustine Jerome, Meliha Yetisgen, Özlem Uzuner

🧩 TL;DR

本研究评估了检索增强动态提示(RDP)在医疗错误处理任务中的表现,发现RDP相比零样本提示和静态提示能显著降低假阳性率并提高检测召回率,为临床文档错误检测与修正提供了更可靠的解决方案。


📘 Detailed Summary

Motivation: 临床文档中存在事实性、诊断性和管理性错误,这些错误可能危及患者安全,而大型语言模型在不同提示策略下的医疗错误处理能力尚不明确,需要系统评估零样本提示、静态随机示例提示和检索增强动态提示在错误标记检测、错误句子检测和错误修正三个子任务中的表现差异。

Method: 研究使用MEDEC数据集评估了九个指令调优的大型语言模型(包括GPT、Claude、Gemini和OpenAI o-series模型),采用零样本提示、静态随机示例提示和检索增强动态提示三种策略,通过准确率、召回率、假阳性率以及ROUGE-1、BLEURT和BERTScore的聚合得分来衡量模型在医疗错误处理任务中的性能。

Result: 零样本提示在两项检测任务中召回率较低,经常遗漏缩写密集或非典型错误;静态随机示例提示提高了召回率但增加了假阳性率;检索增强动态提示在所有九个大型语言模型中将假阳性率降低了约15%,在错误句子检测中召回率提高了5-10%,并生成了更符合上下文的修正结果。

Conclusion: 检索增强动态提示在不同大型语言模型中均优于零样本和静态随机示例提示,使用检索示例能够提高检测准确性、降低假阳性率并增强医疗错误修正的可靠性,为临床文档质量改进提供了有效的技术路径。


📄 Abstract

Objective: Clinical documentation contains factual, diagnostic, and management errors that can compromise patient safety. Large language models (LLMs) may help detect and correct such errors, but their behavior under different prompting strategies remains unclear. We evaluate zero-shot prompting, static prompting with random exemplars (SPR), and retrieval-augmented dynamic prompting (RDP) for three subtasks of medical error processing: error flag detection, error sentence detection, and error correction. Methods: Using the MEDEC dataset, we evaluated nine instruction-tuned LLMs (GPT, Claude, Gemini, and OpenAI o-series models). We measured performance using accuracy, recall, false-positive rate (FPR), and an aggregate score of ROUGE-1, BLEURT, and BERTScore for error correction. We also analyzed example outputs to identify failure modes and differences between LLM and clinician reasoning. Results: Zero-shot prompting showed low recall in both detection tasks, often missing abbreviation-heavy or atypical errors. SPR improved recall but increased FPR. Across all nine LLMs, RDP reduced FPR by about 15 percent, improved recall by 5 to 10 percent in error sentence detection, and generated more contextually accurate corrections. Conclusion: Across diverse LLMs, RDP outperforms zero-shot and SPR prompting. Using retrieved exemplars improves detection accuracy, reduces false positives, and enhances the reliability of medical error correction.

[81] AppSelectBench: Application-Level Tool Selection Benchmark

Tianyi Chen, Michael Solodko, Sen Wang, Jongwoo Ko, Junheng Hao, Colby Banbury, Sara Abdali, Saeed Amizadeh, Qing Xiao, Yinheng Li, Tianyu Ding, Kamran Ghasedi Dizaji, Suzhen Zheng, Hao Fan, Justin Wagle, Pashmina Cameron, Kazuhito Koishida

🧩 TL;DR

本文提出了AppSelectBench,一个用于评估计算机使用代理(CUAs)应用选择能力的综合基准,填补了现有基准主要关注细粒度API选择而缺乏跨应用推理评估的空白。该基准包含超过10万个真实多样的用户任务,揭示了当前最先进模型在应用选择方面仍存在系统性挑战。


📘 Detailed Summary

Motivation: 现有基准主要评估细粒度API选择能力,无法有效衡量模型在不同应用之间的推理和选择能力,而应用选择是计算机使用代理执行复杂现实任务时的基本能力,直接影响代理能否正确初始化环境、避免编排混乱并高效关注相关上下文。

Method: 提出了AppSelectBench基准,包含新颖的用户任务生成流水线,能够大规模生成真实、多样且语义基础的用户意图,并设计了统一的评估协议,涵盖随机、启发式、零样本、少样本和检索增强设置,覆盖100个广泛使用的桌面应用。

Result: 在闭源和开源大语言模型上的广泛实验揭示了跨应用推理的系统性优势和弱点,表明即使是最先进的模型在做出一致的应用选择方面仍然存在困难,基准包含超过10万个真实多样的用户任务。

Conclusion: AppSelectBench为研究和推进应用级推理能力奠定了基础,这是智能计算机使用代理的关键但尚未充分探索的能力,实验结果强调了提升模型跨应用推理能力的必要性,并为未来研究提供了标准化评估框架。


📄 Abstract

Computer Using Agents (CUAs) are increasingly equipped with external tools, enabling them to perform complex and realistic tasks. For CUAs to operate effectively, application selection, which refers to deciding which application to use before invoking fine-grained tools such as APIs, is a fundamental capability. It determines whether the agent initializes the correct environment, avoids orchestration confusion, and efficiently focuses on relevant context. However, existing benchmarks primarily assess fine-grained API selection, offering limited insight into whether models can reason across and choose between different applications. To fill this gap, we introduce AppSelectBench, a comprehensive benchmark for evaluating application selection in CUAs. AppSelectBench contains a novel user task generation pipeline that produces realistic, diverse, and semantically grounded user intents at scale, together with unified evaluation protocols covering random, heuristic, zero-shot, few-shot, and retrieval-augmented-settings. AppSelectBench covers one hundred widely used desktop applications and includes more than one hundred thousand realistic, diverse, and semantically grounded user tasks. Extensive experiments across both closed-source and open-source large language models reveal systematic strengths and weaknesses in inter-application reasoning, showing that even the most capable models still struggle to make consistent application choices. Together, these results establish AppSelectBench as a foundation for studying and advancing application level reasoning, an essential yet underexplored capability of intelligent CUAs. The source is available at https://github.com/microsoft/appselectbench.

[82] A Machine Learning Approach for Detection of Mental Health Conditions and Cyberbullying from Social Media

Edward Ajayi, Martha Kachweka, Mawuli Deku, Emily Aiken

🧩 TL;DR

本文提出了一种统一的多类别分类框架,用于从社交媒体数据中检测10种不同的心理健康和网络欺凌类别,通过端到端微调的MentalBERT模型实现了最佳性能,并开发了结合SHAP和LLM的可解释性框架来支持人工辅助筛查。


📘 Detailed Summary

Motivation: 数字空间中日益严重的心理健康挑战和网络欺凌问题需要可扩展且可解释的检测系统,现有方法在统一框架下对多种心理健康和网络欺凌类别进行准确检测的能力有限。

Method: 采用统一的多类别分类框架,构建来自Twitter和Reddit的数据集,实施严格的"分割-再平衡"流程,比较传统词汇模型、混合方法和多种端到端微调transformer模型,并引入结合SHAP和LLM的混合可解释性框架。

Result: 端到端微调对性能至关重要,领域适应的MentalBERT成为最佳模型,准确率达到0.92,宏观F1分数为0.76,超越了通用对应模型和零样本LLM基线。

Conclusion: 该系统定位为人工辅助筛查工具而非诊断工具,强调了未来需要多标签、临床验证数据集的发展方向,为在线安全和计算心理健康的交叉领域提供了稳健基准。


📄 Abstract

Mental health challenges and cyberbullying are increasingly prevalent in digital spaces, necessitating scalable and interpretable detection systems. This paper introduces a unified multiclass classification framework for detecting ten distinct mental health and cyberbullying categories from social media data. We curate datasets from Twitter and Reddit, implementing a rigorous "split-then-balance" pipeline to train on balanced data while evaluating on a realistic, held-out imbalanced test set. We conducted a comprehensive evaluation comparing traditional lexical models, hybrid approaches, and several end-to-end fine-tuned transformers. Our results demonstrate that end-to-end fine-tuning is critical for performance, with the domain-adapted MentalBERT emerging as the top model, achieving an accuracy of 0.92 and a Macro F1 score of 0.76, surpassing both its generic counterpart and a zero-shot LLM baseline. Grounded in a comprehensive ethical analysis, we frame the system as a human-in-the-loop screening aid, not a diagnostic tool. To support this, we introduce a hybrid SHAPLLM explainability framework and present a prototype dashboard ("Social Media Screener") designed to integrate model predictions and their explanations into a practical workflow for moderators. Our work provides a robust baseline, highlighting future needs for multi-label, clinically-validated datasets at the critical intersection of online safety and computational mental health.

[83] "When Data is Scarce, Prompt Smarter"... Approaches to Grammatical Error Correction in Low-Resource Settings

Somsubhra De, Harsh Kumar, Arun Prakash A

🧩 TL;DR

本研究探索了基于提示的大语言模型在低资源印度语言语法错误纠正任务中的应用,证明了当代LLMs在多语言GEC中的卓越泛化能力,通过精心设计的提示策略在多个印度语言上取得了领先性能。


📘 Detailed Summary

Motivation: 尽管基于Transformer的模型和大型标注数据集显著提升了英语等高资源语言的语法错误纠正性能,但大多数印度语言由于资源有限、语言多样性复杂和形态学复杂,GEC仍然是一个具有挑战性的任务,本研究旨在解决这一资源差距问题。

Method: 采用基于提示的方法,结合GPT-4.1、Gemini-2.5和LLaMA-4等先进大语言模型,使用零样本和少样本策略来适应低资源设置,通过精心设计的提示和轻量级适配来提升多语言GEC的校正质量。

Result: 实验结果表明,即使是基本的提示策略也能使这些LLMs显著优于经过微调的印度语言模型Sarvam-22B,在共享任务中取得了领先结果:泰米尔语排名第一(GLEU: 91.57)、印地语排名第一(GLEU: 85.69)、泰卢固语排名第二(GLEU: 85.22)、孟加拉语排名第四(GLEU: 92.86)、马拉雅拉姆语排名第五(GLEU: 92.97)。

Conclusion: 这些发现突显了提示驱动NLP技术的有效性,并强调了大规LLMs在弥合多语言GEC资源差距方面的潜力,为低资源语言的语法错误纠正提供了新的解决方案路径。


📄 Abstract

Grammatical error correction (GEC) is an important task in Natural Language Processing that aims to automatically detect and correct grammatical mistakes in text. While recent advances in transformer-based models and large annotated datasets have greatly improved GEC performance for high-resource languages such as English, the progress has not extended equally. For most Indic languages, GEC remains a challenging task due to limited resources, linguistic diversity and complex morphology. In this work, we explore prompting-based approaches using state-of-the-art large language models (LLMs), such as GPT-4.1, Gemini-2.5 and LLaMA-4, combined with few-shot strategy to adapt them to low-resource settings. We observe that even basic prompting strategies, such as zero-shot and few-shot approaches, enable these LLMs to substantially outperform fine-tuned Indic-language models like Sarvam-22B, thereby illustrating the exceptional multilingual generalization capabilities of contemporary LLMs for GEC. Our experiments show that carefully designed prompts and lightweight adaptation significantly enhance correction quality across multiple Indic languages. We achieved leading results in the shared task--ranking 1st in Tamil (GLEU: 91.57) and Hindi (GLEU: 85.69), 2nd in Telugu (GLEU: 85.22), 4th in Bangla (GLEU: 92.86), and 5th in Malayalam (GLEU: 92.97). These findings highlight the effectiveness of prompt-driven NLP techniques and underscore the potential of large-scale LLMs to bridge resource gaps in multilingual GEC.

[84] BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali

Abdullah Al Sefat

🧩 TL;DR

本文提出了BengaliFig,一个针对孟加拉语的紧凑型挑战数据集,用于评估大型语言模型在低资源文化背景下的比喻性和文化基础推理能力。该研究揭示了前沿LLM在隐喻推理和文化特定理解方面的一致弱点,为包容性和文化感知的NLP评估提供了诊断工具。


📘 Detailed Summary

Motivation: 当前大型语言模型在广泛的多语言基准测试中表现优异,但在比喻性和文化基础推理方面,特别是在低资源语境下的评估仍然不足。本研究旨在填补孟加拉语这一广泛使用但资源匮乏语言在这一领域的评估空白,关注模型对文化特定内容和隐喻推理的理解能力。

Method: 研究构建了包含435个独特谜题的BengaliFig数据集,这些谜题源自孟加拉语口头和文学传统。每个项目沿五个正交维度进行标注,并通过约束感知、AI辅助的流水线自动转换为多项选择格式。评估了八个主要提供商的前沿LLM在零样本和少样本思维链提示下的表现。

Result: 实验结果显示,前沿LLM在隐喻推理和文化特定理解方面表现出持续性的弱点。模型在处理需要深度文化背景知识和比喻性思维的谜题时表现不佳,揭示了当前多语言模型在低资源文化语境下的鲁棒性不足问题。

Conclusion: BengaliFig不仅为评估LLM在低资源文化背景下的鲁棒性提供了诊断探针,而且推动了包容性和文化遗产感知的NLP评估发展。该研究强调了在模型评估中考虑文化深度和比喻推理的重要性,为未来开发更具文化敏感性的语言模型指明了方向。


📄 Abstract

Large language models excel on broad multilingual benchmarks but remain to be evaluated extensively in figurative and culturally grounded reasoning, especially in low-resource contexts. We present BengaliFig, a compact yet richly annotated challenge set that targets this gap in Bengali, a widely spoken low-resourced language. The dataset contains 435 unique riddles drawn from Bengali oral and literary traditions. Each item is annotated along five orthogonal dimensions capturing reasoning type, trap type, cultural depth, answer category, and difficulty, and is automatically converted to multiple-choice format through a constraint-aware, AI-assisted pipeline. We evaluate eight frontier LLMs from major providers under zero-shot and few-shot chain-of-thought prompting, revealing consistent weaknesses in metaphorical and culturally specific reasoning. BengaliFig thus contributes both a diagnostic probe for evaluating LLM robustness in low-resource cultural contexts and a step toward inclusive and heritage-aware NLP evaluation.

cs.AI [Back]

[85] Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs

Meng Lu, Ran Xu, Yi Fang, Wenxuan Zhang, Yue Yu, Gaurav Srivastava, Yuchen Zhuang, Mohamed Elhoseiny, Charles Fleming, Carl Yang, Zhengzhong Tu, Yang Xie, Guanghua Xiao, Hanrui Wang, Di Jin, Wenqi Shi, Xuan Wang

🧩 TL;DR

本文提出了VISTA-Gym,一个可扩展的视觉语言模型训练环境,用于增强多步视觉推理能力。通过该环境训练的VISTA-R1模型在11个VQA基准测试中显著优于同类模型,证明了工具集成推理的有效性。


📘 Detailed Summary

Motivation: 当前视觉语言模型虽然具备强大的图像理解能力,但在多步视觉交互推理方面仍存在局限,特别是在工具选择、调用和协调方面表现不足。研究旨在解决视觉语言模型在工具集成推理能力上的瓶颈问题。

Method: 提出了VISTA-Gym训练环境,统一了7个任务13个数据集的多模态推理任务,提供标准化视觉工具接口、可执行交互循环、可验证反馈信号和高效轨迹记录。通过多轮轨迹采样和端到端强化学习训练VISTA-R1模型实现工具使用与推理的交替进行。

Result: 在11个公共推理密集型VQA基准测试中,VISTA-R1-8B模型相比同类规模的最先进基线模型提升了9.51%-18.72%,显著优于专有和开源模型在工具集成推理方面的表现。

Conclusion: VISTA-Gym作为一个有效的训练平台,能够显著提升视觉语言模型的工具集成推理能力,为开发更强大的视觉推理代理提供了可行路径,证明了强化学习在视觉工具协调中的有效性。


📄 Abstract

While recent vision-language models (VLMs) demonstrate strong image understanding, their ability to "think with images", i.e., to reason through multi-step visual interactions, remains limited. We introduce VISTA-Gym, a scalable training environment for incentivizing tool-integrated visual reasoning capabilities in VLMs. VISTA-Gym unifies diverse real-world multimodal reasoning tasks (7 tasks from 13 datasets in total) with a standardized interface for visual tools (e.g., grounding, parsing), executable interaction loops, verifiable feedback signals, and efficient trajectory logging, enabling visual agentic reinforcement learning at scale. While recent VLMs exhibit strong text-only reasoning, both proprietary and open-source models still struggle with tool selection, invocation, and coordination. With VISTA-Gym, we train VISTA-R1 to interleave tool-use with agentic reasoning via multi-turn trajectory sampling and end-to-end reinforcement learning. Extensive experiments across 11 public reasoning-intensive VQA benchmarks show that VISTA-R1-8B outperforms state-of-the-art baselines with similar sizes by 9.51%-18.72%, demonstrating VISTA-Gym as an effective training ground to unlock the tool-integrated reasoning capabilities for VLMs.

[86] Agentic AI-Empowered Conversational Embodied Intelligence Networks in 6G

Mingkai Chen, Zijie Feng, Lei Wang, Yaser Khamayseh

🧩 TL;DR

本文提出了一种协作式对话具身智能网络(CC-EIN),通过多模态特征融合、自适应语义通信和可解释决策机制,解决了6G时代多具身智能设备在复杂任务执行中的协作挑战,在震后救援场景中实现了95.4%的任务完成率和95%的传输效率。


📘 Detailed Summary

Motivation: 在6G时代,多具身智能设备(MEIDs)的语义协作对于复杂任务执行至关重要,但现有系统在多模态信息融合、自适应通信和决策可解释性方面面临挑战,需要解决这些限制以实现高效可靠的设备协作。

Method: 提出的CC-EIN框架集成了多模态特征融合、自适应语义通信、任务协调和可解释性机制,其中PerceptiNet执行图像和雷达数据的跨模态融合生成统一语义表示,自适应语义通信策略根据任务紧急性和信道质量动态调整编码方案和传输功率,语义驱动协作机制支持异构设备间的任务分解和无冲突协调,InDec模块通过Grad-CAM可视化增强决策透明度。

Result: 在震后救援场景的仿真实验中,CC-EIN实现了95.4%的任务完成率和95%的传输效率,同时保持了强大的语义一致性和能量效率,证明了该框架在复杂协作任务中的有效性和可靠性。

Conclusion: 该研究为6G时代的多具身智能设备协作提供了完整的解决方案,通过语义驱动的协作框架实现了高效的任务执行和可靠的通信传输,同时增强了决策过程的透明度和可解释性,为未来智能设备协作系统的发展提供了重要参考。


📄 Abstract

In the 6G era, semantic collaboration among multiple embodied intelligent devices (MEIDs) becomes crucial for complex task execution. However, existing systems face challenges in multimodal information fusion, adaptive communication, and decision interpretability. To address these limitations, we propose a collaborative Conversational Embodied Intelligence Network (CC-EIN) integrating multimodal feature fusion, adaptive semantic communication, task coordination, and interpretability. PerceptiNet performs cross-modal fusion of image and radar data to generate unified semantic representations. An adaptive semantic communication strategy dynamically adjusts coding schemes and transmission power according to task urgency and channel quality. A semantic-driven collaboration mechanism further supports task decomposition and conflict-free coordination among heterogeneous devices. Finally, the InDec module enhances decision transparency through Grad-CAM visualization. Simulation results in post-earthquake rescue scenarios demonstrate that CC-EIN achieves 95.4% task completion rate and 95% transmission efficiency while maintaining strong semantic consistency and energy efficiency.

[87] M$^3$Prune: Hierarchical Communication Graph Pruning for Efficient Multi-Modal Multi-Agent Retrieval-Augmented Generation

Weizi Shao, Taolin Zhang, Zijie Zhou, Chen Chen, Chengyu Wang, Xiaofeng He

🧩 TL;DR

本文提出M³Prune框架,通过多模态多智能体层次化通信图剪枝技术,在保持任务性能的同时显著降低多模态检索增强生成系统的令牌开销和计算成本。该框架通过跨模态冗余边消除和动态通信拓扑构建,实现了多智能体系统效率与性能的优化平衡。


📘 Detailed Summary

Motivation: 现有多智能体多模态检索增强生成系统虽然通过智能体间有效通信显著提升了性能,但存在固有的高令牌开销和计算成本问题,这给大规模部署带来了挑战。多智能体系统在实现集体智能优势的同时,其通信机制引入了大量冗余交互,亟需优化通信效率与性能之间的平衡。

Method: M³Prune框架采用多模态多智能体层次化通信图剪枝方法,首先在文本和视觉模态内进行图稀疏化,识别对任务解决最关键的关键边。随后基于这些关键边构建动态通信拓扑进行跨模态图稀疏化,最后通过渐进式剪枝消除冗余边,获得更高效的层次化拓扑结构。

Result: 在通用和领域特定的多模态检索增强生成基准测试上的广泛实验表明,该方法在显著降低令牌消耗的同时,持续优于单智能体和鲁棒多智能体系统。该框架在保持高性能的前提下实现了通信效率的大幅提升,验证了其在实际部署中的有效性。

Conclusion: 该研究证明了通过精心设计的通信图剪枝策略,可以在不牺牲性能的前提下显著优化多智能体系统的效率。M³Prune框架为大规模多模态检索增强生成系统的实际部署提供了可行的技术路径,展示了层次化通信拓扑在平衡性能与效率方面的潜力。


📄 Abstract

Recent advancements in multi-modal retrieval-augmented generation (mRAG), which enhance multi-modal large language models (MLLMs) with external knowledge, have demonstrated that the collective intelligence of multiple agents can significantly outperform a single model through effective communication. Despite impressive performance, existing multi-agent systems inherently incur substantial token overhead and increased computational costs, posing challenges for large-scale deployment. To address these issues, we propose a novel Multi-Modal Multi-agent hierarchical communication graph PRUNING framework, termed M$^3$Prune. Our framework eliminates redundant edges across different modalities, achieving an optimal balance between task performance and token overhead. Specifically, M$^3$Prune first applies intra-modal graph sparsification to textual and visual modalities, identifying the edges most critical for solving the task. Subsequently, we construct a dynamic communication topology using these key edges for inter-modal graph sparsification. Finally, we progressively prune redundant edges to obtain a more efficient and hierarchical topology. Extensive experiments on both general and domain-specific mRAG benchmarks demonstrate that our method consistently outperforms both single-agent and robust multi-agent mRAG systems while significantly reducing token consumption.

[88] "Are We Done Yet?": A Vision-Based Judge for Autonomous Task Completion of Computer Use Agents

Marta Sumyk, Oleksandr Kosovan

🧩 TL;DR

本文提出了一个基于视觉语言模型的自主评估框架,用于评估计算机使用代理的任务完成情况。该框架通过屏幕截图和任务描述直接检测任务成功与否,显著提升了自主代理的可靠性和自我纠正能力。


📘 Detailed Summary

Motivation: 计算机使用代理在自主操作数字界面时,往往难以可靠地判断任务是否完成。现有系统缺乏有效的任务完成评估机制,导致代理无法准确检测执行结果并进行自我纠正。

Method: 提出了一个自主评估和反馈框架,利用视觉语言模型直接从屏幕截图和任务描述中评估任务完成情况。构建了包含42个macOS内置应用程序和1,260个人工标注任务的数据集,覆盖广泛的场景类型。

Result: 该框架在任务成功检测方面达到73%的准确率,当应用评估器反馈时,整体任务成功率平均相对提升27%。实验结果表明视觉基础评估能够有效改善代理的可靠性。

Conclusion: 基于视觉的评估可以作为有效的反馈机制,显著提升自主计算机使用代理的可靠性和自我纠正能力。这一方法为构建更可靠的自主代理系统提供了新的技术路径,具有重要的实际应用价值。


📄 Abstract

Computer Use Agents (CUAs) are designed to autonomously operate digital interfaces, yet they often fail to reliably determine whether a given task has been completed. We present an autonomous evaluation and feedback framework that uses vision-language models to assess task completion directly from screenshots and task descriptions. Our dataset covers 42 built-in macOS applications and 1,260 human-labeled tasks across a wide range of scenarios. Our framework achieves up to 73 percent accuracy in task success detection and yields an average relative improvement of 27 percent in overall task success when evaluator feedback is applied. These results show that vision-based evaluation can serve as an effective feedback mechanism that improves the reliability and self-correction of autonomous computer-use agents.

[89] VICoT-Agent: A Vision-Interleaved Chain-of-Thought Framework for Interpretable Multimodal Reasoning and Scalable Remote Sensing Analysis

Chujie Wang, Zhiyuan Luo, Ruiqi Liu, Can Ran, Shenghua Fan, Xi Chen, Chu He

🧩 TL;DR

本文提出了Vision-Interleaved Chain-of-Thought (VICoT)多模态智能体框架,通过将视觉工具动态嵌入思维链实现显式多轮推理,并在多个遥感基准测试中显著优于现有SOTA方法。


📘 Detailed Summary

Motivation: 当前遥感图像分析任务正从传统目标识别向复杂智能推理演进,对模型的推理能力和工具调用灵活性提出了更高要求,现有方法在推理透明度和执行效率方面存在不足。

Method: 提出VICoT多模态智能体框架,采用基于堆栈的推理结构和模块化MCP兼容工具套件,使LLM能够高效执行多轮交错的视觉语言推理任务,并通过推理堆栈蒸馏方法将复杂智能体行为迁移到轻量级模型中。

Result: 在多个遥感基准测试上的实验表明,VICoT在推理透明度、执行效率和生成质量方面显著优于现有SOTA框架,同时通过蒸馏方法在保持推理能力的同时大幅降低了模型复杂度。

Conclusion: 该研究证明了动态集成视觉工具到思维链中的有效性,为复杂遥感图像分析任务提供了可扩展的解决方案,同时提出的蒸馏方法为智能体能力的轻量化部署提供了可行路径。


📄 Abstract

The current remote sensing image analysis task is increasingly evolving from traditional object recognition to complex intelligence reasoning, which places higher requirements on the model's reasoning ability and the flexibility of tool invocation. To this end, we propose a new multimodal agent framework, Vision-Interleaved Chain-of-Thought Framework (VICoT), which implements explicit multi-round reasoning by dynamically incorporating visual tools into the chain of thought. Through a stack-based reasoning structure and a modular MCP-compatible tool suite, VICoT enables LLMs to efficiently perform multi-round, interleaved vision-language reasoning tasks with strong generalization and flexibility.We also propose the Reasoning Stack distillation method to migrate complex Agent behaviors to small, lightweight models, which ensures the reasoning capability while significantly reducing complexity. Experiments on multiple remote sensing benchmarks demonstrate that VICoT significantly outperforms existing SOTA frameworks in reasoning transparency, execution efficiency, and generation quality.

[90] Towards Benign Memory Forgetting for Selective Multimodal Large Language Model Unlearning

Zhen Zeng, Leijiang Gu, Zhangling Duan, Feng Li, Zenglin Shi, Cees G. M. Snoek, Meng Wang

🧩 TL;DR

本文提出Sculpted Memory Forgetting Adapter (SMFA),一种针对多模态大语言模型隐私敏感信息遗忘的新方法,能够在精确移除敏感知识的同时保持模型的通用图像理解能力,并引入了首个选择性MLLM遗忘基准S-MLLMUn Bench。


📘 Detailed Summary

Motivation: 现有多模态大语言模型在实现卓越能力的同时会无意中记忆隐私敏感信息,而现有的遗忘方法虽然能够移除这些知识,但往往会导致模型通用图像理解性能的下降,无法实现良性遗忘。

Method: 提出的SMFA方法首先对模型进行微调,将敏感响应替换为拒绝回答,生成记忆遗忘适配器,然后应用保留锚点引导的掩码机制来防止对无关知识和理解能力的干扰。

Result: 大量实验表明,与先前方法不同,SMFA能够实现精确可控的遗忘,同时保持模型的基础图像理解能力,在S-MLLMUn Bench基准测试中表现出色。

Conclusion: 该研究证明了选择性遗忘在保护隐私的同时维持模型核心能力的可行性,为多模态大语言模型的安全部署提供了重要技术支撑,并建立了系统化的评估框架来指导未来研究。


📄 Abstract

Multimodal Large Language Models (MLLMs) achieve remarkable capabilities but can inadvertently memorize privacy-sensitive information. Although existing unlearning methods can remove such knowledge, they fail to achieve benign forgetting because they often degrade the model's general image understanding performance. To address this, we propose the Sculpted Memory Forgetting Adapter (SMFA), which confines forgetting to targeted memory regions while preserving overall capabilities. SMFA first fine-tunes the model to replace sensitive responses with refusals, yielding a memory forgetting adapter, and then applies a retaining anchor-guided masking mechanism to prevent interference with unrelated knowledge and understanding ability. To systematically evaluate selective MLLM unlearning, we introduce S-MLLMUn Bench, the first benchmark designed to jointly assess the removal of sensitive knowledge and retention of general visual understanding. Extensive experiments show that, unlike prior methods, SMFA achieves precise and controllable unlearning while maintaining the model's foundational image understanding.

[91] Interactive AI NPCs Powered by LLMs: Technical Report for the CPDC Challenge 2025

Yitian Huang, Yuxuan Lei, Jianxun Lian, Hao Liao

🧩 TL;DR

本文提出了一个简单而有效的统一框架,在CPDC 2025竞赛中通过上下文工程和GRPO强化学习优化,显著提升了常识人物对话系统的性能,在多个任务中取得了领先排名。


📘 Detailed Summary

Motivation: 该研究旨在解决常识人物对话系统中工具调用稳定性、执行可靠性和角色扮演指导性的问题,同时缓解小样本过拟合现象,提升任务导向对话的整体性能。

Method: 方法包含两个核心组件:上下文工程采用动态工具剪枝和人物属性裁剪进行输入压缩,结合参数归一化和函数合并等后处理技术;在GPU Track中进一步采用GRPO训练,用强化学习替代监督微调,直接通过奖励信号进行优化。

Result: 在最终评估中,团队在Task 2 API中排名第一,Task 1 API中排名第二,Task 3 API和GPU Track中均排名第三,证明了所提方法的有效性。

Conclusion: 研究表明,结合上下文工程和强化学习的统一框架能够显著提升对话系统的稳定性和性能,为构建更可靠的常识人物对话系统提供了有效解决方案,相关代码已公开供社区使用。


📄 Abstract

This report presents the solution and results of our team MSRA_SC in the Commonsense Persona-Grounded Dialogue Challenge (CPDC 2025). We propose a simple yet effective framework that unifies improvements across both GPU Track and API Track. Our method centers on two key components. First, Context Engineering applies dynamic tool pruning and persona clipping for input compression, combined with post-processing techniques such as parameter normalization and function merging. Together with manually refined prompts, this design improves tool call stability, execution reliability, and role-playing guidance. Second, in the GPU Track, we further adopt GRPO training, replacing supervised fine-tuning with reinforcement learning directly optimized by reward signals. This mitigates small-sample overfitting and significantly enhances task-oriented dialogue performance. In the final evaluation, our team ranks 1st in Task 2 API, 2nd in Task 1 API, and 3rd in both Task 3 API and GPU track, demonstrating the effectiveness of our approach. Our code is publicly available at https://gitlab.aicrowd.com/nikoo_yu/cpdc-2025-winning-solution

[92] NNGPT: Rethinking AutoML with Large Language Models

Roman Kochnev, Waleed Khalid, Tolgay Atinc Uzun, Xi Zhang, Yashkumar Sanjaybhai Dhameliya, Furui Qin, Chandini Vysyaraju, Raghuvir Duvvuri, Avi Goyal, Dmitry Ignatov, Radu Timofte

🧩 TL;DR

NNGPT是一个开源框架,将大型语言模型转化为用于神经网络开发的自改进AutoML引擎,通过生成新模型扩展神经网络数据集,实现基于生成、评估和自我改进闭环系统的持续微调。


📘 Detailed Summary

Motivation: 当前AI领域面临构建自改进AI系统的根本挑战,现有框架无法实现神经网络开发的持续自我优化,需要一种能够扩展神经网络数据集并实现闭环学习的AutoML解决方案。

Method: NNGPT集成了五个协同的LLM管道:零样本架构合成、超参数优化、代码感知精度/早停预测、检索增强的PyTorch块合成以及强化学习,基于LEMUR数据集构建可复现的审计语料库,通过PyTorch适配器实现框架无关性。

Result: NN-RAG在1,289个目标上达到73%可执行性,3样本提示在常见数据集上提升精度,基于哈希的去重节省数百次运行,超参数优化在LEMUR上达到RMSE 0.60优于Optuna,代码感知预测器达到RMSE 0.14和Pearson r=0.78,系统已生成超过5K验证模型。

Conclusion: NNGPT证明可作为自主AutoML引擎,通过统一工作流实现端到端的神经网络开发自动化,其开源发布将促进可复现性和社区使用,为自改进AI系统提供了实用框架。


📄 Abstract

Building self-improving AI systems remains a fundamental challenge in the AI domain. We present NNGPT, an open-source framework that turns a large language model (LLM) into a self-improving AutoML engine for neural network development, primarily for computer vision. Unlike previous frameworks, NNGPT extends the dataset of neural networks by generating new models, enabling continuous fine-tuning of LLMs based on closed-loop system of generation, assessment, and self-improvement. It integrates within one unified workflow five synergistic LLM-based pipelines: zero-shot architecture synthesis, hyperparameter optimization (HPO), code-aware accuracy/early-stop prediction, retrieval-augmented synthesis of scope-closed PyTorch blocks (NN-RAG), and reinforcement learning. Built on the LEMUR dataset as an audited corpus with reproducible metrics, NNGPT emits from a single prompt and validates network architecture, preprocessing code, and hyperparameters, executes them end-to-end, and learns from result. The PyTorch adapter makes NNGPT framework-agnostic, enabling strong performance: NN-RAG achieves 73% executability on 1,289 targets, 3-shot prompting boosts accuracy on common datasets, and hash-based deduplication saves hundreds of runs. One-shot prediction matches search-based AutoML, reducing the need for numerous trials. HPO on LEMUR achieves RMSE 0.60, outperforming Optuna (0.64), while the code-aware predictor reaches RMSE 0.14 with Pearson r=0.78. The system has already generated over 5K validated models, proving NNGPT as an autonomous AutoML engine. Upon acceptance, the code, prompts, and checkpoints will be released for public access to enable reproducibility and facilitate community usage.

[93] VibraVerse: A Large-Scale Geometry-Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning

Bo Pang, Chenxi Xu, Jierui Ren, Guoping Wang, Sheng Li

🧩 TL;DR

本文提出了VibraVerse数据集和CLASP对比学习框架,通过建立3D几何→物理属性→模态参数→声学信号的因果链,实现了物理一致的多模态学习,为基于声音的具身感知提供了基础。


📘 Detailed Summary

Motivation: 现有基于视觉和语言的多模态学习框架缺乏物理一致性,忽略了物体几何、材料、振动模式与产生声音之间的内在因果关系,无法真正理解物理世界。

Method: 构建了大规模几何-声学对齐数据集VibraVerse,包含3D模型的物理属性和体积几何,计算模态特征进行冲击声合成;提出CLASP对比学习框架,通过跨模态对齐保持物理结构与声学响应的因果对应关系。

Result: 在几何到声音预测、声音引导形状重建和跨模态表示学习等基准任务上的广泛验证表明,基于VibraVerse训练的模型在准确性、可解释性和跨模态泛化能力方面表现优越。

Conclusion: VibraVerse为物理一致和因果可解释的多模态学习建立了基准,为声音引导的具身感知和深入理解物理世界提供了基础,数据集将开源以促进相关研究。


📄 Abstract

Understanding the physical world requires perceptual models grounded in physical laws rather than mere statistical correlations. However, existing multimodal learning frameworks, focused on vision and language, lack physical consistency and overlook the intrinsic causal relationships among an object's geometry, material, vibration modes, and the sounds it produces. We introduce VibraVerse, a large-scale geometry-acoustics alignment dataset that explicitly bridges the causal chain from 3D geometry -> physical attributes -> modal parameters -> acoustic signals. Each 3D model has explicit physical properties (density, Young's modulus, Poisson's ratio) and volumetric geometry, from which modal eigenfrequencies and eigenvectors are computed for impact sound synthesis under controlled excitations. To establish this coherence, we introduce CLASP, a contrastive learning framework for cross-modal alignment that preserves the causal correspondence between an object's physical structure and its acoustic response. This framework enforces physically consistent alignment across modalities, ensuring that every sample is coherent, traceable to the governing equations, and embedded within a unified representation space spanning shape, image, and sound. Built upon VibraVerse, we define a suite of benchmark tasks for geometry-to-sound prediction, sound-guided shape reconstruction, and cross-modal representation learning. Extensive validations on these tasks demonstrate that models trained on VibraVerse exhibit superior accuracy, interpretability, and generalization across modalities. These results establish VibraVerse as a benchmark for physically consistent and causally interpretable multimodal learning, providing a foundation for sound-guided embodied perception and a deeper understanding of the physical world. The dataset will be open-sourced.