cs.CV [Total: 52]
cs.CL [Total: 5]
cs.AI [Total: 10]

cs.CV [Back]

[1] Leveraging AI multimodal geospatial foundation models for improved near-real-time flood mapping at a global scale

Mirela G. Tulbure, Julio Caineta, Mark Broich, Mollie D. Gaines, Philippe Rufin, Leon-Friedrich Thomas, Hamed Alemohammad, Jan Hemmerling, Patrick Hostert

🧩 TL;DR

本研究通过微调地理空间基础模型TerraMind进行洪水范围制图，首次在全球尺度上评估了GFM在洪水分割任务中的性能，展示了结合多模态光学和SAR数据以及微调策略在近实时洪水制图中的潜力。

📘 Detailed Summary

Motivation: 洪水是最具破坏性的天气相关灾害之一，2024年极端洪水事件影响了五大洲的社区。尽管地球观测卫星为洪水制图提供了关键的频繁覆盖，但操作精度严重依赖标注数据集和模型泛化能力。最近的地理空间基础模型通过大规模自监督预训练提供了改进的泛化性，但它们在多样化全球洪水事件上的性能仍未被充分理解。

Method: 研究使用FloodsNet数据集微调ESA-IBM的TerraMind模型进行洪水范围制图，该数据集包含85个全球洪水事件的共位Sentinel-1 SAR和Sentinel-2光学影像。测试了四种配置（基础vs大型模型；冻结vs非冻结骨干网络），并与TerraMind Sen1Floods11示例以及在FloodsNet和Sen1Floods11上训练的U-Net进行比较。

Result: 基础-非冻结配置在准确性、精确度和召回率方面提供了最佳平衡，计算成本显著低于大型模型。大型非冻结模型实现了最高召回率。在FloodsNet上训练的模型在召回率上优于Sen1Floods11训练的示例，总体准确性相似。U-Net比所有GFM配置实现了更高的召回率，但准确性和精确度略低。

Conclusion: 研究结果表明，整合多模态光学和SAR数据并微调GFM可以增强近实时洪水制图能力。这项研究提供了GFM在洪水分割任务中首次全球尺度评估之一，突出了其在气候适应和灾害恢复方面的潜力和当前局限性，为未来改进提供了重要基准。

📄 Abstract

Floods are among the most damaging weather-related hazards, and in 2024, the warmest year on record, extreme flood events affected communities across five continents. Earth observation (EO) satellites provide critical, frequent coverage for mapping inundation, yet operational accuracy depends heavily on labeled datasets and model generalization. Recent Geospatial Foundation Models (GFMs), such as ESA-IBM's TerraMind, offer improved generalizability through large-scale self-supervised pretraining, but their performance on diverse global flood events remains poorly understood. We fine-tune TerraMind for flood extent mapping using FloodsNet, a harmonized multimodal dataset containing co-located Sentinel-1 (Synthetic Aperture Radar, SAR data) and Sentinel-2 (optical) imagery for 85 flood events worldwide. We tested four configurations (base vs. large models; frozen vs. unfrozen backbones) and compared against the TerraMind Sen1Floods11 example and a U-Net trained on both FloodsNet and Sen1Floods11. The base-unfrozen configuration provided the best balance of accuracy, precision, and recall at substantially lower computational cost than the large model. The large unfrozen model achieved the highest recall. Models trained on FloodsNet outperformed the Sen1Floods11-trained example in recall with similar overall accuracy. U-Net achieved higher recall than all GFM configurations, though with slightly lower accuracy and precision. Our results demonstrate that integrating multimodal optical and SAR data and fine-tuning a GFM can enhance near-real-time flood mapping. This study provides one of the first global-scale evaluations of a GFM for flood segmentation, highlighting both its potential and current limitations for climate adaptation and disaster resilience.

[2] Context-Enriched Contrastive Loss: Enhancing Presentation of Inherent Sample Connections in Contrastive Learning Framework

Haojin Deng, Yimin Yang

🧩 TL;DR

本文提出了一种上下文增强的对比损失函数，通过同时优化两个收敛目标来提升对比学习效果并解决信息失真问题，在多个大规模基准数据集上超越了16种最先进的对比学习方法。

📘 Detailed Summary

Motivation: 对比学习中的对比损失函数虽然能有效区分样本相似性，但可能引入来自增强样本的信息失真问题，导致模型过度依赖相同标签样本的信息，同时忽视来自同一原始图像的正样本对，特别是在大规模数据集中这一问题更为显著。

Method: 本文提出了一种上下文增强的对比损失函数，包含两个收敛目标组件：第一个组件对标签对比敏感，区分相同类别和不同类别的特征以提升对比训练效率；第二个组件拉近来自同一源图像的增强样本距离，同时推远所有其他样本，从而解决信息失真问题。

Result: 在8个大规模基准数据集（CIFAR10、CIFAR100、Caltech-101、Caltech-256、ImageNet、BiasedMNIST、UTKFace和CelebA）上的实验表明，该方法在泛化性能和学习收敛速度方面均优于16种最先进的对比学习方法，特别是在BiasedMNIST数据集上相比原始对比损失函数实现了22.9%的性能提升，在系统性失真任务中表现突出。

Conclusion: 该研究提出的上下文增强对比损失函数有效解决了对比学习中的信息失真问题，不仅提升了模型性能和学习效率，还在处理系统性偏差任务中展现出显著优势，为更高效和公平的下游训练提供了有前景的技术路径。

📄 Abstract

Contrastive learning has gained popularity and pushes state-of-the-art performance across numerous large-scale benchmarks. In contrastive learning, the contrastive loss function plays a pivotal role in discerning similarities between samples through techniques such as rotation or cropping. However, this learning mechanism can also introduce information distortion from the augmented samples. This is because the trained model may develop a significant overreliance on information from samples with identical labels, while concurrently neglecting positive pairs that originate from the same initial image, especially in expansive datasets. This paper proposes a context-enriched contrastive loss function that concurrently improves learning effectiveness and addresses the information distortion by encompassing two convergence targets. The first component, which is notably sensitive to label contrast, differentiates between features of identical and distinct classes which boosts the contrastive training efficiency. Meanwhile, the second component draws closer the augmented samples from the same source image and distances all other samples. We evaluate the proposed approach on image classification tasks, which are among the most widely accepted 8 recognition large-scale benchmark datasets: CIFAR10, CIFAR100, Caltech-101, Caltech-256, ImageNet, BiasedMNIST, UTKFace, and CelebA datasets. The experimental results demonstrate that the proposed method achieves improvements over 16 state-of-the-art contrastive learning methods in terms of both generalization performance and learning convergence speed. Interestingly, our technique stands out in addressing systematic distortion tasks. It demonstrates a 22.9% improvement compared to original contrastive loss functions in the downstream BiasedMNIST dataset, highlighting its promise for more efficient and equitable downstream training.

[3] FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

Kevin David Hayes, Micah Goldblum, Vikash Sehwag, Gowthami Somepalli, Ashwinee Panda, Tom Goldstein

🧩 TL;DR

本研究提出了一种联合评估文本到图像生成模型和视觉语言模型的层次化框架，通过测试视觉语言模型能否识别文本到图像模型生成的图像中的27种特定失败模式，揭示了现有评估指标的不足并创建了包含5种生成模型和3种视觉语言模型标注的数据集。

📘 Detailed Summary

Motivation: 当前文本到图像模型在生成图像时经常无法准确捕捉用户提示中的特定属性，如正确数量的对象和指定颜色，而现有评估框架缺乏对这种多样错误的层次化分析能力；同时，视觉语言模型的基准测试未能跟上其用于标注的复杂场景需求，导致无法有效评估生成模型的提示遵循能力。

Method: 研究提出了一种结构化方法，通过测试视觉语言模型能否识别文本到图像模型在挑战性提示下生成的图像中的27种特定失败模式来联合评估这两种模型；创建了一个包含5种文本到图像模型生成的图像数据集，并使用3种视觉语言模型进行标注，最后通过大型语言模型验证标注的正确性，系统分析属性保真度和对象表示方面的系统性错误。

Result: 通过对精心策划的提示集进行失败模式分析，研究揭示了文本到图像模型在属性保真度和对象表示方面的系统性错误；实验使用了Flux、SD3-Medium、SD3-Large、SD3.5-Medium、SD3.5-Large等5种生成模型，以及Molmo、InternVL3、Pixtral等3种视觉语言模型，并由Llama3验证标注质量，结果表明当前评估指标无法捕捉这些细微错误。

Conclusion: 该研究强调了针对性基准测试对于提升生成模型可靠性和可解释性的重要性，提出的联合评估框架能够更全面地评估文本到图像模型和视觉语言模型的性能，为未来生成模型的改进提供了系统化的错误分析方法和评估标准，推动了该领域评估方法的发展。

📄 Abstract

Text-to-image (T2I) models are capable of generating visually impressive images, yet they often fail to accurately capture specific attributes in user prompts, such as the correct number of objects with the specified colors. The diversity of such errors underscores the need for a hierarchical evaluation framework that can compare prompt adherence abilities of different image generation models. Simultaneously, benchmarks of vision language models (VLMs) have not kept pace with the complexity of scenes that VLMs are used to annotate. In this work, we propose a structured methodology for jointly evaluating T2I models and VLMs by testing whether VLMs can identify 27 specific failure modes in the images generated by T2I models conditioned on challenging prompts. Our second contribution is a dataset of prompts and images generated by 5 T2I models (Flux, SD3-Medium, SD3-Large, SD3.5-Medium, SD3.5-Large) and the corresponding annotations from VLMs (Molmo, InternVL3, Pixtral) annotated by an LLM (Llama3) to test whether VLMs correctly identify the failure mode in a generated image. By analyzing failure modes on a curated set of prompts, we reveal systematic errors in attribute fidelity and object representation. Our findings suggest that current metrics are insufficient to capture these nuanced errors, highlighting the importance of targeted benchmarks for advancing generative model reliability and interpretability.

[4] Towards Unified Video Quality Assessment

Chen Feng, Tianhao Peng, Fan Zhang, David Bull

🧩 TL;DR

本文提出Unified-VQA框架，通过将通用视频质量评估重构为诊断性专家混合问题，实现了适用于多种视频格式和失真类型的统一质量模型，同时提供可解释的多维伪影向量。

📘 Detailed Summary

Motivation: 当前视频质量评估方法存在两个主要局限：一是采用单一质量分数预测的单一模型无法提供诊断性、可解释的反馈；二是大多数方法是特定格式的专用指标而非真正的通用解决方案，因为它们从不同感知域学习折衷表示。

Method: Unified-VQA框架将通用VQA重构为诊断性专家混合问题，采用多个专注于不同感知域的"感知专家"，设计了新颖的多代理专家训练策略，使用基于排名的损失函数并以最适合其领域的代理指标为指导优化每个专家，同时集成了诊断性多任务头以生成全局质量分数和可解释的多维伪影向量，采用弱监督学习策略进行优化。

Result: 在静态模型参数条件下（无需重新训练或微调），Unified-VQA在17个包含HD、UHD、HDR和HFR格式中多样化流媒体伪影的数据库上，相比超过18种基准方法，在通用VQA和诊断性伪影检测任务中均表现出持续且优越的性能。

Conclusion: 该研究代表了向实用、可操作和可解释的视频质量评估迈出的重要一步，通过统一的框架解决了格式特定性和缺乏诊断能力的问题，为视频质量分析提供了既全面又具解释性的解决方案。

📄 Abstract

Recent works in video quality assessment (VQA) typically employ monolithic models that typically predict a single quality score for each test video. These approaches cannot provide diagnostic, interpretable feedback, offering little insight into why the video quality is degraded. Most of them are also specialized, format-specific metrics rather than truly generic" solutions, as they are designed to learn a compromised representation from disparate perceptual domains. To address these limitations, this paper proposes Unified-VQA, a framework that provides a single, unified quality model applicable to various distortion types within multiple video formats by recasting generic VQA as a Diagnostic Mixture-of-Experts (MoE) problem. Unified-VQA employs multipleperceptual experts'' dedicated to distinct perceptual domains. A novel multi-proxy expert training strategy is designed to optimize each expert using a ranking-inspired loss, guided by the most suitable proxy metric for its domain. We also integrated a diagnostic multi-task head into this framework to generate a global quality score and an interpretable multi-dimensional artifact vector, which is optimized using a weakly-supervised learning strategy, leveraging the known properties of the large-scale training database generated for this work. With static model parameters (without retraining or fine-tuning), Unified-VQA demonstrates consistent and superior performance compared to over 18 benchmark methods for both generic VQA and diagnostic artifact detection tasks across 17 databases containing diverse streaming artifacts in HD, UHD, HDR and HFR formats. This work represents an important step towards practical, actionable, and interpretable video quality assessment.

[5] See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, Yong Jae Lee

🧩 TL;DR

本文提出了AV-SpeakerBench，这是一个专注于说话人中心视听推理的基准测试，包含3,212个多项选择题，旨在评估多模态大语言模型在细粒度语音理解方面的能力，填补了现有视频基准在说话人识别、内容对齐和时间定位方面的评估空白。

📘 Detailed Summary

Motivation: 现有视频基准测试很少评估多模态大语言模型对人类语音的细粒度推理能力，许多任务要么可以通过视觉单独解决，要么仅对语音进行粗略评估，无法深入考察模型是否能准确对齐说话人身份、说话内容以及说话时间这三个关键维度。

Method: 该方法构建了AV-SpeakerBench基准测试，包含3,212个精心设计的多项选择题，采用说话人中心的表述方式，将说话人而非场景作为核心推理单元；通过融合基础的问题设计将视听依赖关系嵌入问题语义中；并采用专家标注确保时间精度和跨模态有效性。

Result: 综合评估显示，Gemini系列模型始终优于开源系统，其中Gemini 2.5 Pro取得最佳结果；在开源模型中，Qwen3-Omni-30B接近Gemini 2.0 Flash但仍远落后于Gemini 2.5 Pro，主要差距在于视听融合能力而非视觉感知能力。

Conclusion: 该研究为推进未来多模态系统的细粒度视听推理能力建立了严谨的评估基础，揭示了当前模型在说话人中心推理方面的局限性，特别是开源模型在视听融合能力上的不足，为后续研究指明了改进方向。

📄 Abstract

Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers-not scenes-as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving the best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audiovisual fusion rather than visual perception. We believe AV-SpeakerBench establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in future multimodal systems.

[6] Progressive Image Restoration via Text-Conditioned Video Generation

Peng Kang, Xijun Wang, Yu Yuan

🧩 TL;DR

本文重新利用CogVideo文本到视频模型进行渐进式视觉修复任务，通过微调使其生成修复轨迹而非自然视频运动，在超分辨率、去模糊和低光增强任务上实现了时空一致的恢复效果。

📘 Detailed Summary

Motivation: 当前文本到视频模型在时间生成方面表现出色，但其在图像修复领域的潜力尚未充分探索，特别是如何将时间生成能力重新用于渐进式视觉修复任务，以产生从退化到清晰帧的逐步恢复过程。

Method: 通过微调CogVideo模型，使其生成修复轨迹而非自然视频运动；构建了超分辨率、去模糊和低光增强的合成数据集，每个样本展示从退化到清晰帧的渐进过渡；比较了两种提示策略：跨所有样本的统一文本提示，以及通过LLaVA多模态LLM生成并经ChatGPT优化的场景特定提示方案。

Result: 微调后的模型学会将时间进展与修复质量关联，产生的序列在PSNR、SSIM和LPIPS等感知指标上逐帧改善；实验表明CogVideo能有效恢复空间细节和光照一致性，同时保持时间连贯性；模型在ReLoBlur数据集上无需额外训练即可泛化到真实场景，展示了强大的零样本鲁棒性和通过时间修复的可解释性。

Conclusion: 该研究证明了文本到视频模型在视觉修复任务中的重新利用潜力，通过将时间维度与修复质量关联，实现了渐进式恢复过程；模型展示了良好的零样本泛化能力和时间一致性保持，为视频修复任务提供了新的范式，同时通过时间轨迹提供了修复过程的可解释性洞察。

📄 Abstract

Recent text-to-video models have demonstrated strong temporal generation capabilities, yet their potential for image restoration remains underexplored. In this work, we repurpose CogVideo for progressive visual restoration tasks by fine-tuning it to generate restoration trajectories rather than natural video motion. Specifically, we construct synthetic datasets for super-resolution, deblurring, and low-light enhancement, where each sample depicts a gradual transition from degraded to clean frames. Two prompting strategies are compared: a uniform text prompt shared across all samples, and a scene-specific prompting scheme generated via LLaVA multi-modal LLM and refined with ChatGPT. Our fine-tuned model learns to associate temporal progression with restoration quality, producing sequences that improve perceptual metrics such as PSNR, SSIM, and LPIPS across frames. Extensive experiments show that CogVideo effectively restores spatial detail and illumination consistency while maintaining temporal coherence. Moreover, the model generalizes to real-world scenarios on the ReLoBlur dataset without additional training, demonstrating strong zero-shot robustness and interpretability through temporal restoration.

[7] Understanding and Harnessing Sparsity in Unified Multimodal Models

Shwai He, Chaorui Deng, Ang Li, Shen Yan

🧩 TL;DR

本文系统分析了统一多模态模型中的推理效率问题，发现生成组件对压缩高度敏感，并提出基于混合专家（MoE）的稀疏激活适配方法，使BAGEL模型仅激活约半数参数即可达到全模型性能。

📘 Detailed Summary

Motivation: 统一多模态模型虽然整合了理解和生成能力，但引入了推理效率问题，因为特定任务或样本可能不需要模型的全部知识或容量。目前对于这些效率问题在不同组件中如何表现的系统性理解仍然有限，需要深入分析以优化模型效率。

Method: 首先采用训练无关的剪枝作为探测方法，系统分析统一多模态模型组件，包括深度剪枝和宽度缩减。随后提出混合专家（MoE）适配方法，将生成模块划分为多个专家并启用稀疏激活以恢复生成质量，通过专家冻结调优和完全可训练的适配进行验证。

Result: 研究发现理解组件在理解和生成任务中都表现出显著的可压缩性，在生成任务中更为明显；而生成组件对压缩高度敏感，即使在中等压缩比下性能也会急剧下降。提出的MoE适配方法使BAGEL模型仅激活约半数参数即可达到与全模型相当的性能，验证了稀疏激活的有效性。

Conclusion: 研究揭示了统一多模态模型中不同组件对压缩的敏感性差异，为模型效率优化提供了重要见解。提出的MoE适配方法通过稀疏激活有效解决了生成组件的效率瓶颈，为构建更高效的多模态模型提供了可行路径，同时展示了动态激活模式在模型优化中的潜力。

📄 Abstract

Large multimodal models have achieved remarkable progress in both understanding and generation. Recent efforts pursue unified multimodal models that integrate heterogeneous components to support both capabilities within a single framework. However, such unification introduces inference inefficiencies, e.g., specific tasks or samples may not require the full knowledge or capacity of the unified model. Yet, a systematic understanding of how these inefficiencies manifest across different components remains limited. In this work, we first conduct a systematic analysis of unified multimodal model components using training-free pruning as a probing methodology, considering both depth pruning and width reduction. Our study reveals that the understanding component exhibits notable compressibility in both understanding and generation tasks, which is more pronounced in the latter. In contrast, the generation components are highly sensitive to compression, with performance deteriorating sharply even under moderate compression ratios. To address this limitation, we propose the Mixture-of-Experts (MoE) Adaptation, inspired by the dynamic activation patterns observed across different samples. This approach partitions the generation module into multiple experts and enables sparse activation to restore generation quality. We validate the effectiveness of sparse activation through expert-frozen tuning and further demonstrate that a fully trainable adaptation delivers additional gains. As a result, the adapted BAGEL model achieves performance comparable to the full model while activating only about half of its parameters. The code is released at \href{https://github.com/Shwai-He/SparseUnifiedModel}{this link}.

[8] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, Sung Ju Hwang

🧩 TL;DR

本文提出了WorldMM，一种新颖的多模态记忆代理，通过构建和检索互补的文本与视觉记忆来解决长视频理解中的上下文容量限制和视觉细节丢失问题，在多个长视频问答基准上显著超越了现有方法。

📘 Detailed Summary

Motivation: 现有视频大语言模型在处理小时或天级长视频时面临上下文容量有限和视觉细节丢失的挑战，当前基于文本摘要的记忆增强方法过度依赖文本表示且无法有效利用视觉证据进行复杂场景推理，同时固定时间尺度的检索机制难以捕捉可变持续时间的事件。

Method: WorldMM构建了三种互补的记忆类型：跨多时间尺度索引事实事件的片段记忆、持续更新高层概念知识的语义记忆，以及保留场景详细信息的视觉记忆。在推理过程中，自适应检索代理基于查询迭代选择最相关的记忆源并利用多种时间粒度，直到确定已收集足够信息。

Result: WorldMM在五个长视频问答基准测试中显著优于现有基线方法，平均性能比先前最先进方法提升了8.4%，展示了其在长视频推理任务中的有效性。

Conclusion: 该研究表明多模态记忆架构能够有效解决长视频理解中的上下文限制问题，通过结合文本和视觉表示以及自适应多时间尺度检索，为复杂长视频推理任务提供了新的解决方案，并展示了在需要精细视觉细节和长期依赖理解的应用场景中的潜力。

📄 Abstract

Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.

[9] Multi-Domain Enhanced Map-Free Trajectory Prediction with Selective Attention

Wenyi Xiong, Jian Chen

🧩 TL;DR

本文提出了一种新颖的无地图轨迹预测算法，通过时域、空域和频域的多域预测，利用专家混合机制和选择性注意力模块有效处理复杂交互场景中的冗余信息，显著提升了自动驾驶系统的轨迹预测性能。

📘 Detailed Summary

Motivation: 现有轨迹预测方法在复杂交互场景中难以从冗余数据中高效提取有价值的场景信息，导致计算效率降低和预测精度下降，特别是在处理复杂的智能体交互时表现不佳，这限制了自动驾驶系统的可靠性和安全性。

Method: 提出了一种无地图轨迹预测算法，使用时域、空域和频域的多域预测框架；在时域信息处理中采用专家混合机制自适应选择关键频率成分并集成多尺度时域特征；设计了选择性注意力模块来过滤时域序列和空间交互中的冗余信息；最后构建了多模态解码器，在补丁级和点级损失监督下生成合理的轨迹结果。

Result: 在Nuscences数据集上的实验验证了算法的优越性，表明该方法能够有效处理复杂交互场景，在轨迹预测任务中取得了显著的性能提升，证明了所提方法在计算效率和预测精度方面的优势。

Conclusion: 该研究通过多域预测框架和自适应信息筛选机制，为复杂交互场景下的轨迹预测提供了有效解决方案，所提出的专家混合机制和选择性注意力模块能够显著提升信息处理效率，为自动驾驶系统的可靠性和安全性提供了重要技术支撑。

📄 Abstract

Trajectory prediction is crucial for the reliability and safety of autonomous driving systems, yet it remains a challenging task in complex interactive scenarios. Existing methods often struggle to efficiently extract valuable scene information from redundant data, thereby reducing computational efficiency and prediction accuracy, especially when dealing with intricate agent interactions. To address these challenges, we propose a novel map-free trajectory prediction algorithm that achieves trajectory prediction across the temporal, spatial, and frequency domains. Specifically, in temporal information processing, We utilize a Mixture of Experts (MoE) mechanism to adaptively select critical frequency components. Concurrently, we extract these components and integrate multi-scale temporal features. Subsequently, a selective attention module is proposed to filter out redundant information in both temporal sequences and spatial interactions. Finally, we design a multimodal decoder. Under the supervision of patch-level and point-level losses, we obtain reasonable trajectory results. Experiments on Nuscences datasets demonstrate the superiority of our algorithm, validating its effectiveness in handling complex interactive scenarios.

[10] See, Think, Learn: A Self-Taught Multimodal Reasoner

Sourabh Sharma, Sonam Gupta, Sadbhawna

🧩 TL;DR

本文提出了See-Think-Learn（STL）自训练框架，通过结构化推理模板和负向原理增强，无需依赖高质量人工标注或昂贵专有模型，即可同时提升视觉语言模型的感知与推理能力。

📘 Detailed Summary

Motivation: 视觉语言模型在整合视觉感知与语言理解方面取得显著进展，但有效的多模态推理需要准确的感知和稳健的推理能力，任一方面的弱点都会限制模型性能。先前提升推理能力的方法通常依赖于高质量思维链数据，这些数据需要通过劳动密集型人工标注、昂贵的专有模型或忽视感知的自训练方法获得，存在成本高且忽视感知的局限性。

Method: 本文提出了See-Think-Learn（STL）自训练框架，其核心是引入结构化推理模板，强制模型遵循"先看后想"的原则：首先提取视觉属性并转换为文本形式，然后利用这些属性指导推理。该框架通过让模型生成并学习自身的结构化原理，在自训练循环中联合改进感知和推理能力。此外，通过添加负向原理（即解释为何某些答案选项是错误的）来增强训练数据，提升模型区分正确与误导性响应的能力。

Result: 跨多个领域的实验表明，STL框架在性能上持续优于仅基于答案直接训练或基于自生成推理的基线方法。定性分析证实了STL生成原理的高质量，验证了该框架在提升多模态推理能力方面的有效性。实验结果表明该方法能够以成本效益高的方式显著增强视觉语言模型的推理能力。

Conclusion: STL框架为增强视觉语言模型的多模态推理能力提供了一种成本效益高的解决方案，通过结构化推理模板和负向原理增强，实现了感知与推理能力的联合提升。该方法避免了依赖高质量人工标注或昂贵专有模型的限制，为多模态推理研究提供了新的自训练范式，具有重要的实际应用价值和研究意义。

📄 Abstract

Vision-Language Models (VLMs) have achieved remarkable progress in integrating visual perception with language understanding. However, effective multimodal reasoning requires both accurate perception and robust reasoning, and weakness in either limits the performance of VLMs. Prior efforts to enhance reasoning often depend on high-quality chain-of-thought (CoT) data, obtained via labor-intensive human annotations, costly proprietary models, or self-training methods that overlook perception. To address these limitations, we propose a simple yet effective self-training framework called See-Think-Learn (STL). At its core, STL introduces a structured reasoning template that encourages the model to see before thinking, first extracting visual attributes in textual form, then using them to guide reasoning. The framework jointly improves perception and reasoning by having the model generate and learn from its own structured rationales in a self-training loop. Furthermore, we augment the training data with negative rationales, i.e. explanations that justify why certain answer choices are incorrect, to enhance the model's ability to distinguish between correct and misleading responses. This fosters more discriminative and robust learning. Experiments across diverse domains show that STL consistently outperforms baselines trained directly only on answers or self-generated reasoning, while qualitative analysis confirms the high quality of its rationales. STL thus provides a cost-effective solution to enhance multimodal reasoning ability of VLMs.

[11] From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking

Yuqing Shao, Yuchen Yang, Rui Yu, Weilong Li, Xu Guo, Huaicheng Yan, Wei Wang, Xiao Sun

🧩 TL;DR

本文提出FDTA框架，通过显式特征细化增强端到端多目标跟踪中的关联准确性，解决了现有方法因共享DETR架构产生的对象嵌入相似度过高问题，在多个挑战性基准上实现了最先进性能。

📘 Detailed Summary

Motivation: 当前端到端多目标跟踪方法虽然检测性能优异，但关联准确性相对较低，主要问题在于共享DETR架构产生的对象嵌入具有过高的对象间相似度，该架构仅关注单帧内的类别级区分，而跟踪需要跨帧的实例级区分以及时空连续性，现有方法对此优化不足。

Method: 提出FDTA显式特征细化框架，从三个互补视角增强对象区分性：空间适配器集成深度感知线索以实现空间连续性，时间适配器聚合历史信息以建立时间依赖性，身份适配器利用质量感知对比学习实现实例级可分离性。

Result: 在多个挑战性MOT基准测试上的广泛实验表明，FDTA在DanceTrack、SportsMOT和BFT等数据集上实现了最先进的性能，验证了所提出的区分性嵌入增强策略的有效性。

Conclusion: 该研究揭示了端到端MOT方法中对象嵌入区分性不足的核心问题，提出的多视角特征细化框架为提升跟踪关联准确性提供了有效解决方案，表明显式优化时空连续性和实例级区分性对多目标跟踪性能至关重要。

📄 Abstract

End-to-end multi-object tracking (MOT) methods have recently achieved remarkable progress by unifying detection and association within a single framework. Despite their strong detection performance, these methods suffer from relatively low association accuracy. Through detailed analysis, we observe that object embeddings produced by the shared DETR architecture display excessively high inter-object similarity, as it emphasizes only category-level discrimination within single frames. In contrast, tracking requires instance-level distinction across frames with spatial and temporal continuity, for which current end-to-end approaches insufficiently optimize object embeddings. To address this, we introduce FDTA (From Detection to Association), an explicit feature refinement framework that enhances object discriminativeness across three complementary perspectives. Specifically, we introduce a Spatial Adapter (SA) to integrate depth-aware cues for spatial continuity, a Temporal Adapter (TA) to aggregate historical information for temporal dependencies, and an Identity Adapter (IA) to leverage quality-aware contrastive learning for instance-level separability. Extensive experiments demonstrate that FDTA achieves state-of-the-art performance on multiple challenging MOT benchmarks, including DanceTrack, SportsMOT, and BFT, highlighting the effectiveness of our proposed discriminative embedding enhancement strategy. The code is available at https://github.com/Spongebobbbbbbbb/FDTA.

[12] Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities

Yuan Xiong, Ziqi Miao, Lijun Li, Chen Qian, Jie Li, Jing Shao

🧩 TL;DR

本文提出了一种针对多模态大语言模型的新型图像中心化越狱攻击方法CIA，通过多智能体系统将有害查询嵌入看似良性的视觉上下文中，显著提升了攻击成功率，证明了视觉模态本身是攻击先进MLLM的有效向量。

📘 Detailed Summary

Motivation: 现有针对多模态大语言模型的越狱攻击方法主要关注文本-图像交互，将视觉模态视为次要提示，未能充分利用图像承载复杂上下文信息的独特潜力，因此需要开发更有效的图像中心化攻击方法。

Method: 本文提出了上下文图像攻击方法，采用多智能体系统通过四种不同的可视化策略将有害查询微妙地嵌入看似良性的视觉上下文中，并整合了上下文元素增强和自动毒性混淆技术来进一步提升攻击效果。

Result: 在MMSafetyBench-tiny数据集上的实验结果显示，CIA方法对GPT-4o和Qwen2.5-VL-72B模型的毒性评分分别达到4.73和4.83，攻击成功率分别达到86.31%和91.07%，显著优于先前的工作。

Conclusion: 该研究表明视觉模态本身是攻击先进多模态大语言模型的有效向量，提出的图像中心化攻击方法显著提升了越狱攻击的成功率，揭示了当前MLLM安全对齐机制在视觉攻击方面的脆弱性。

📄 Abstract

While Multimodal Large Language Models (MLLMs) show remarkable capabilities, their safety alignments are susceptible to jailbreak attacks. Existing attack methods typically focus on text-image interplay, treating the visual modality as a secondary prompt. This approach underutilizes the unique potential of images to carry complex, contextual information. To address this gap, we propose a new image-centric attack method, Contextual Image Attack (CIA), which employs a multi-agent system to subtly embeds harmful queries into seemingly benign visual contexts using four distinct visualization strategies. To further enhance the attack's efficacy, the system incorporate contextual element enhancement and automatic toxicity obfuscation techniques. Experimental results on the MMSafetyBench-tiny dataset show that CIA achieves high toxicity scores of 4.73 and 4.83 against the GPT-4o and Qwen2.5-VL-72B models, respectively, with Attack Success Rates (ASR) reaching 86.31\% and 91.07\%. Our method significantly outperforms prior work, demonstrating that the visual modality itself is a potent vector for jailbreaking advanced MLLMs.

[13] Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

Yifan Zhang, Liang Hu, Haofeng Sun, Peiyu Wang, Yichen Wei, Shukang Yin, Jiangbo Pei, Wei Shen, Peng Xia, Yi Peng, Tianyidan Xie, Eric Li, Yang Liu, Xuchen Song, Yahui Zhou

🧩 TL;DR

本文提出了Skywork-R1V4，一个300亿参数的多模态智能体模型，通过统一的规划框架将图像操作与网络搜索能力相结合，仅使用监督微调就实现了先进的智能体性能，在多个基准测试中超越了Gemini 2.5 Flash。

📘 Detailed Summary

Motivation: 现有多模态智能体系统存在三个主要局限：将图像操作和网络搜索视为分离能力、过度依赖昂贵的强化学习方法、以及缺乏基于真实工具执行轨迹的规划。这些限制阻碍了智能体在复杂多步骤任务中的表现。

Method: Skywork-R1V4是一个300亿参数的多模态智能体模型，采用统一的多模态规划框架，整合了主动图像操作、深度多模态搜索以及关键的交错推理机制。模型仅通过监督微调在少于30,000个高质量规划-执行一致性轨迹上进行训练，并通过逐步一致性过滤进行验证，完全避免了强化学习的依赖。

Result: Skywork-R1V4在感知和多模态搜索基准测试中取得了最先进的结果：在MMSearch上获得66.1分，在FVQA上获得67.2分，在所有11个指标上都超越了Gemini 2.5 Flash。模型展现出新兴的长时程推理能力，能够成功协调超过10个工具调用来解决复杂的多步骤任务。

Conclusion: 研究表明，通过精心策划的监督学习单独训练，无需依赖强化学习，就能实现复杂的多模态智能体智能。这一发现挑战了当前依赖强化学习的主流范式，为开发更高效、可扩展的智能体系统提供了新的方向。

📄 Abstract

Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation ("thinking with images"), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Skywork-R1V4 exhibits emergent long-horizon reasoning at inference time, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks. Our results demonstrate that sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.

[14] WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate

Anoop Cherian, River Doyle, Eyal Ben-Dov, Suhas Lohit, Kuan-Chuan Peng

🧩 TL;DR

本文提出了WISE（加权迭代专家社会）框架，将多智能体辩论扩展到多模态推理任务，通过异构专家分工和两阶段辩论模型，在多种视觉-语言任务上实现了2-7%的准确率提升。

📘 Detailed Summary

Motivation: 当前多智能体辩论主要应用于纯语言任务，其在多模态问题上的有效性尚未充分探索，而现代大语言模型在多样化语料和任务训练中形成了互补优势，需要一种能够整合异构专家能力并处理多模态推理的辩论框架。

Method: 提出了WISE框架，将智能体划分为生成解决方案的求解器和验证正确性、分配权重并提供自然语言反馈的反思器；采用改进的Dawid-Skene算法进行后处理，整合了两阶段辩论模型，能够处理智能体响应差异和反馈权重变化。

Result: 在SMART-840、VisualPuzzles、EvoChart-QA和新构建的SMART-840++数据集上评估，WISE框架在多样化多模态任务和LLM配置中，相比最先进的多智能体辩论设置和聚合方法，准确率持续提升2-7%。

Conclusion: 研究表明多智能体辩论可有效扩展到多模态推理领域，异构专家分工和两阶段辩论模型能够显著提升性能，为复杂多模态问题求解提供了模块化且可泛化的框架，展示了整合互补模型优势的潜力。

📄 Abstract

Recent large language models (LLMs) are trained on diverse corpora and tasks, leading them to develop complementary strengths. Multi-agent debate (MAD) has emerged as a popular way to leverage these strengths for robust reasoning, though it has mostly been applied to language-only tasks, leaving its efficacy on multimodal problems underexplored. In this paper, we study MAD for solving vision-and-language reasoning problems. Our setup enables generalizing the debate protocol with heterogeneous experts that possess single- and multi-modal capabilities. To this end, we present Weighted Iterative Society-of-Experts (WISE), a generalized and modular MAD framework that partitions the agents into Solvers, that generate solutions, and Reflectors, that verify correctness, assign weights, and provide natural language feedback. To aggregate the agents' solutions across debate rounds, while accounting for variance in their responses and the feedback weights, we present a modified Dawid-Skene algorithm for post-processing that integrates our two-stage debate model. We evaluate WISE on SMART-840, VisualPuzzles, EvoChart-QA, and a new SMART-840++ dataset with programmatically generated problem instances of controlled difficulty. Our results show that WISE consistently improves accuracy by 2-7% over the state-of-the-art MAD setups and aggregation methods across diverse multimodal tasks and LLM configurations.

[15] Generalizing Vision-Language Models with Dedicated Prompt Guidance

Xinyao Li, Yinjie Min, Hongbo Chen, Zhekai Du, Fengling Li, Jingjing Li

🧩 TL;DR

本文提出了一种名为GuiDG的领域专家引导框架，通过理论分析发现多专家模型优于单一通用模型，从而在视觉语言模型微调中实现了更好的领域泛化能力，同时保持了参数效率。

📘 Detailed Summary

Motivation: 当前视觉语言模型微调方法面临领域特定性与领域泛化能力之间的关键权衡，通常在整个数据集上微调单一通用模型会损害对未见领域的泛化能力，需要解决这一研究空白。

Method: 本文提出了两阶段领域专家引导框架GuiDG，首先通过提示调优获得源领域专家模型，然后引入跨模态注意力模块通过自适应专家集成来指导视觉编码器的微调，同时构建了ImageNet-DG基准用于少样本领域泛化评估。

Result: 在标准领域泛化基准和新建的ImageNet-DG数据集上的广泛实验表明，GuiDG在保持效率的同时超越了最先进的微调方法，验证了多专家策略在领域泛化方面的优越性。

Conclusion: 该研究从理论上证明了多专家模型在领域泛化方面的优势，提出的GuiDG框架为视觉语言模型微调提供了有效的解决方案，同时构建的ImageNet-DG基准为少样本领域泛化研究提供了重要评估工具。

📄 Abstract

Fine-tuning large pretrained vision-language models (VLMs) has emerged as a prevalent paradigm for downstream adaptation, yet it faces a critical trade-off between domain specificity and domain generalization (DG) ability. Current methods typically fine-tune a universal model on the entire dataset, which potentially compromises the ability to generalize to unseen domains. To fill this gap, we provide a theoretical understanding of the generalization ability for VLM fine-tuning, which reveals that training multiple parameter-efficient expert models on partitioned source domains leads to better generalization than fine-tuning a universal model. Inspired by this finding, we propose a two-step domain-expert-Guided DG (GuiDG) framework. GuiDG first employs prompt tuning to obtain source domain experts, then introduces a Cross-Modal Attention module to guide the fine-tuning of the vision encoder via adaptive expert integration. To better evaluate few-shot DG, we construct ImageNet-DG from ImageNet and its variants. Extensive experiments on standard DG benchmarks and ImageNet-DG demonstrate that GuiDG improves upon state-of-the-art fine-tuning methods while maintaining efficiency.

[16] Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources

Phuc Pham, Nhu Pham, Ngoc Quoc Ly

🧩 TL;DR

该研究提出了一种结合动量自蒸馏和梯度累积的高效视觉语言模型训练方法，旨在解决医疗领域标注数据稀缺且对比学习计算成本高的问题，在保持单GPU训练效率的同时实现了与最先进方法相当的性能。

📘 Detailed Summary

Motivation: 在医疗健康领域，获取详细标注数据具有挑战性，因此需要鲁棒的视觉语言模型。然而，对比学习作为训练视觉语言模型的关键范式，通常需要大批量进行有效学习，这导致计算成本高昂且往往仅限于资源充足的机构。此外，在医疗数据有限的情况下，需要在训练过程中同时从数据和模型中提取知识以提高性能。

Method: 该方法的核心是结合动量方法和蒸馏技术，同时解决计算效率和知识利用问题。具体包括两个主要贡献：一是利用动量自蒸馏来增强多模态学习，二是将动量机制与梯度累积相结合，在不增加资源消耗的情况下扩大有效批量大小。这种方法能够在单GPU上实现高效的训练过程。

Result: 该方法在零样本分类任务中取得了与最先进方法相当的性能，同时在少样本适应方面提供了显著提升，实现了超过90%的AUC-ROC，并将检索任务性能提高了2-3%。重要的是，该方法在单GPU上实现了高训练效率，同时保持了合理的训练时间，显著降低了资源需求。

Conclusion: 该研究通过动量自蒸馏和梯度累积的集成，为高效多模态学习提供了创新解决方案，能够在资源受限的环境中实现高性能视觉语言模型训练。这种方法特别适用于医疗健康等标注数据稀缺的领域，为减少资源需求同时提升性能提供了有效途径，推动了高效多模态学习的发展。

📄 Abstract

In medical healthcare, obtaining detailed annotations is challenging, highlighting the need for robust Vision-Language Models (VLMs). Pretrained VLMs enable fine-tuning on small datasets or zero-shot inference, achieving performance comparable to task-specific models. Contrastive learning (CL) is a key paradigm for training VLMs but inherently requires large batch sizes for effective learning, making it computationally demanding and often limited to well-resourced institutions. Moreover, with limited data in healthcare, it is important to prioritize knowledge extraction from both data and models during training to improve performance. Therefore, we focus on leveraging the momentum method combined with distillation to simultaneously address computational efficiency and knowledge exploitation. Our contributions can be summarized as follows: (1) leveraging momentum self-distillation to enhance multimodal learning, and (2) integrating momentum mechanisms with gradient accumulation to enlarge the effective batch size without increasing resource consumption. Our method attains competitive performance with state-of-the-art (SOTA) approaches in zero-shot classification, while providing a substantial boost in the few-shot adaption, achieving over 90% AUC-ROC and improving retrieval tasks by 2-3%. Importantly, our method achieves high training efficiency with a single GPU while maintaining reasonable training time. Our approach aims to advance efficient multimodal learning by reducing resource requirements while improving performance over SOTA methods. The implementation of our method is available at https://github.com/phphuc612/MSD .

[17] nuScenes Revisited: Progress and Challenges in Autonomous Driving

Whye Kit Fong, Venice Erin Liong, Kok Seang Tan, Holger Caesar

🧩 TL;DR

本文对广泛使用的自动驾驶数据集nuScenes进行了全面回顾，揭示了其创建细节和技术标准，并分析了该数据集对后续自动驾驶研究和数据集发展的深远影响。

📘 Detailed Summary

Motivation: 自动驾驶和高级驾驶辅助系统依赖大量标注数据，而nuScenes作为最广泛使用的自动驾驶数据集之一，其创建细节和技术标准在学术文献中尚未充分披露，需要系统回顾其设计理念、技术实现以及对整个研究领域的影响。

Method: 研究采用回顾性分析方法，深入探讨nuScenes数据集的创建过程、技术细节及其扩展版本（nuImages和Panoptic nuScenes），通过分析数据集的多模态传感器融合设计、标准化基准测试框架以及涵盖感知、定位与建图、预测和规划等任务的综合特性，揭示其技术实现原理。

Result: 研究首次公开了nuScenes数据集创建的详细技术细节，展示了其作为首个包含雷达数据、覆盖两大洲多样化城市场景、采用完全自动驾驶车辆在公共道路上采集的数据集的创新性，并系统分析了该数据集对后续众多自动驾驶数据集的深远影响和定义的技术标准。

Conclusion: nuScenes数据集不仅为自动驾驶研究提供了重要的基准平台，还通过其多模态传感器融合、标准化评估和多样化任务设计，深刻影响了整个领域的发展方向，其技术标准和设计理念至今仍被广泛采用，成为自动驾驶数据集发展的里程碑。

📄 Abstract

Autonomous Vehicles (AV) and Advanced Driver Assistance Systems (ADAS) have been revolutionized by Deep Learning. As a data-driven approach, Deep Learning relies on vast amounts of driving data, typically labeled in great detail. As a result, datasets, alongside hardware and algorithms, are foundational building blocks for the development of AVs. In this work we revisit one of the most widely used autonomous driving datasets: the nuScenes dataset. nuScenes exemplifies key trends in AV development, being the first dataset to include radar data, to feature diverse urban driving scenes from two continents, and to be collected using a fully autonomous vehicle operating on public roads, while also promoting multi-modal sensor fusion, standardized benchmarks, and a broad range of tasks including perception, localization \& mapping, prediction and planning. We provide an unprecedented look into the creation of nuScenes, as well as its extensions nuImages and Panoptic nuScenes, summarizing many technical details that have hitherto not been revealed in academic publications. Furthermore, we trace how the influence of nuScenes impacted a large number of other datasets that were released later and how it defined numerous standards that are used by the community to this day. Finally, we present an overview of both official and unofficial tasks using the nuScenes dataset and review major methodological developments, thereby offering a comprehensive survey of the autonomous driving literature, with a particular focus on nuScenes.

[18] UCAgents: Unidirectional Convergence for Visual Evidence Anchored Multi-Agent Medical Decision-Making

Qianhan Feng, Zhongzhen Huang, Yakun Zhu, Xiaofan Zhang, Qi Dou

🧩 TL;DR

本文提出UCAgents，一种用于医学视觉问答的分层多智能体框架，通过结构化证据审计实现单向收敛，在提高诊断准确性的同时大幅降低计算成本，解决了现有多智能体方法中推理脱离视觉证据和计算效率低下的问题。

📘 Detailed Summary

Motivation: 视觉语言模型在医学诊断中存在推理脱离问题，即语言流畅的解释与可验证的图像证据脱节，损害临床信任。现有的多智能体框架通过模拟多学科团队辩论来缓解单一模型偏差，但开放式讨论会放大文本噪声和计算成本，且未能将推理锚定在视觉证据这一医学决策的基石上。

Method: 提出UCAgents分层多智能体框架，通过结构化证据审计强制执行单向收敛。该框架受临床工作流程启发，禁止立场变更并将智能体交互限制在针对性证据验证，抑制修辞漂移同时增强视觉信号提取。引入单轮询问讨论以揭示视觉-文本错位的潜在风险，通过信息论形式化双重噪声瓶颈，共同约束视觉模糊性和文本噪声。

Result: 在四个医学VQA基准测试上的广泛实验表明，UCAgents在PathVQA上达到71.3%的准确率，比最先进方法提高6.0%，同时降低87.7%的token成本。评估结果进一步证实UCAgents在揭示更多视觉证据与避免混淆性文本干扰之间取得了平衡，展现了诊断可靠性和计算效率的双重优势。

Conclusion: UCAgents通过结构化证据审计和单向收敛机制，有效解决了医学视觉语言模型中的推理脱离问题，在保持诊断可靠性的同时显著提升计算效率，为现实世界临床部署提供了关键的技术基础。该框架展示了如何通过受临床工作流程启发的多智能体设计来平衡视觉证据提取与文本噪声控制。

📄 Abstract

Vision-Language Models (VLMs) show promise in medical diagnosis, yet suffer from reasoning detachment, where linguistically fluent explanations drift from verifiable image evidence, undermining clinical trust. Recent multi-agent frameworks simulate Multidisciplinary Team (MDT) debates to mitigate single-model bias, but open-ended discussions amplify textual noise and computational cost while failing to anchor reasoning to visual evidence, the cornerstone of medical decision-making. We propose UCAgents, a hierarchical multi-agent framework enforcing unidirectional convergence through structured evidence auditing. Inspired by clinical workflows, UCAgents forbids position changes and limits agent interactions to targeted evidence verification, suppressing rhetorical drift while amplifying visual signal extraction. In UCAgents, a one-round inquiry discussion is introduced to uncover potential risks of visual-textual misalignment. This design jointly constrains visual ambiguity and textual noise, a dual-noise bottleneck that we formalize via information theory. Extensive experiments on four medical VQA benchmarks show UCAgents achieves superior accuracy (71.3% on PathVQA, +6.0% over state-of-the-art) with 87.7% lower token cost, the evaluation results further confirm that UCAgents strikes a balance between uncovering more visual evidence and avoiding confusing textual interference. These results demonstrate that UCAgents exhibits both diagnostic reliability and computational efficiency critical for real-world clinical deployment. Code is available at https://github.com/fqhank/UCAgents.

[19] Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding

Yerim Jeon, Miso Lee, WonJun Moon, Jae-Pil Heo

🧩 TL;DR

本文提出了3D-SLIM，一种针对3D场景语言理解的自适应注意力掩码策略，通过替换因果注意力掩码来解决顺序偏差和受限指令注意力问题，显著提升了3D多模态推理性能。

📘 Detailed Summary

Motivation: 现有3D场景语言理解方法通常采用语言建模中的标准解码器，这些解码器依赖因果注意力掩码，导致两个基本冲突：顺序无关3D对象之间的顺序偏差，以及受限的对象-指令注意力，阻碍了任务特定的推理能力。

Method: 本文提出3D-SLIM，一种自适应注意力掩码策略，包含两个关键组件：几何自适应掩码根据空间密度而非标记顺序约束注意力，指令感知掩码使对象标记能够直接访问指令上下文，该方法无需架构修改或额外参数。

Result: 在多个基准测试和LLM基线上进行的广泛实验验证了3D-SLIM的有效性，该方法在多样化的3D场景语言任务上带来了显著的性能提升，强调了解码器设计在3D多模态推理中的关键作用。

Conclusion: 研究表明，通过设计适应3D空间结构的注意力掩码可以克服传统因果掩码在3D场景理解中的局限性，为3D多模态推理提供了简单而有效的解决方案，并揭示了解码器设计在该领域的重要性。

📄 Abstract

Recent advances in 3D scene-language understanding have leveraged Large Language Models (LLMs) for 3D reasoning by transferring their general reasoning ability to 3D multi-modal contexts. However, existing methods typically adopt standard decoders from language modeling, which rely on a causal attention mask. This design introduces two fundamental conflicts in 3D scene understanding: sequential bias among order-agnostic 3D objects and restricted object-instruction attention, hindering task-specific reasoning. To overcome these limitations, we propose 3D Spatial Language Instruction Mask (3D-SLIM), an effective masking strategy that replaces the causal mask with an adaptive attention mask tailored to the spatial structure of 3D scenes. Our 3D-SLIM introduces two key components: a Geometry-adaptive Mask that constrains attention based on spatial density rather than token order, and an Instruction-aware Mask that enables object tokens to directly access instruction context. This design allows the model to process objects based on their spatial relationships while being guided by the user's task. 3D-SLIM is simple, requires no architectural modifications, and adds no extra parameters, yet it yields substantial performance improvements across diverse 3D scene-language tasks. Extensive experiments across multiple benchmarks and LLM baselines validate its effectiveness and underscore the critical role of decoder design in 3D multi-modal reasoning.

[20] Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation

Jianzong Wu, Hao Lian, Dachao Hao, Ye Tian, Qingyu Shi, Biaolong Chen, Hao Jiang

🧩 TL;DR

本文首次系统性地证明音频-视频联合去噪训练不仅能提升跨模态同步性，还能显著改善视频生成质量，即使仅关注视频模态本身。研究通过参数高效的AVFullDiT架构，揭示了音频作为特权信号对视频动态物理规律建模的正则化作用。

📘 Detailed Summary

Motivation: 当前研究主要关注音频-视频生成系统的跨模态同步性优势，但尚未系统探究联合去噪训练是否对单一视频模态的生成质量本身具有提升作用。本文旨在回答一个基础性问题：即使仅关注视频质量，音频-视频联合去噪训练能否改善视频生成效果，特别是针对包含复杂运动模式的挑战性场景。

Method: 研究提出了参数高效的Audio-Video Full DiT（AVFullDiT）架构，该架构利用预训练的文本到视频（T2V）和文本到音频（T2A）模块进行联合去噪训练。在相同实验设置下，同时训练了T2AV联合模型和仅T2V的对照模型，以进行公平比较并隔离音频信号的影响。

Result: 实验首次提供了系统性证据，表明音频-视频联合去噪训练确实能带来超越同步性的质量提升。在包含大幅度和物体接触运动的挑战性子集上，联合训练模型表现出持续的性能改进，视频生成质量显著优于仅视频训练模型。

Conclusion: 研究假设音频预测作为特权信号，促使模型内化视觉事件与其声学后果之间的因果关系，从而正则化视频动态并提升物理合理性。这一发现表明跨模态协同训练是开发更强、更具物理基础的世界模型的有效途径，为多模态生成模型设计提供了新思路。

📄 Abstract

Recent audio-video generative systems suggest that coupling modalities benefits not only audio-video synchrony but also the video modality itself. We pose a fundamental question: Does audio-video joint denoising training improve video generation, even when we only care about video quality? To study this, we introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising. We train (i) a T2AV model with AVFullDiT and (ii) a T2V-only counterpart under identical settings. Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony. We observe consistent improvements on challenging subsets featuring large and object contact motions. We hypothesize that predicting audio acts as a privileged signal, encouraging the model to internalize causal relationships between visual events and their acoustic consequences (e.g., collision $\times$ impact sound), which in turn regularizes video dynamics. Our findings suggest that cross-modal co-training is a promising approach to developing stronger, more physically grounded world models. Code and dataset will be made publicly available.

[21] From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature

Kun Yuan, Min Woo Sun, Zhen Chen, Alejandro Lozano, Xiangteng He, Shi Li, Nassir Navab, Xiaoxiao Sun, Nicolas Padoy, Serena Yeung-Levy

🧩 TL;DR

本文提出了Panel2Patch，一种从生物医学科学文献中挖掘层次结构的新数据管道，将多面板、标记丰富的图形及其周围文本转换为多粒度监督，用于训练生物医学视觉语言模型，显著提升了模型性能并减少了预训练数据需求。

📘 Detailed Summary

Motivation: 当前生物医学视觉语言预训练通常将丰富的科学图形和文本压缩为粗略的图形级配对，丢弃了临床医生在实际关注局部结构时所依赖的细粒度对应关系，这限制了模型对生物医学图像中精细语义的理解能力。

Method: Panel2Patch数据管道通过解析科学图形的布局、面板和视觉标记，构建了图形、面板和补丁三个层次的视觉语言对齐配对，并基于此层次化语料开发了粒度感知的预训练策略，统一了从粗粒度教学描述到细粒度区域聚焦短语的异构目标。

Result: 实验表明，Panel2Pipeline仅需少量文献图形即可提取比先前管道更有效的监督信号，在减少预训练数据量的同时实现了显著更好的性能表现，验证了层次化多粒度监督的有效性。

Conclusion: 该研究证明了从现有科学文献中挖掘层次结构信息的重要性，为生物医学视觉语言模型提供了更精细的监督信号，减少了大规模数据需求，为开发更强大的生物医学多模态模型开辟了新途径。

📄 Abstract

There is a growing interest in developing strong biomedical vision-language models. A popular approach to achieve robust representations is to use web-scale scientific data. However, current biomedical vision-language pretraining typically compresses rich scientific figures and text into coarse figure-level pairs, discarding the fine-grained correspondences that clinicians actually rely on when zooming into local structures. To tackle this issue, we introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature, i.e., multi-panel, marker-heavy figures and their surrounding text, and converts them into multi-granular supervision. Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels, preserving local semantics instead of treating each figure as a single data sample. Built on this hierarchical corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases. By applying Panel2Patch to only a small set of the literature figures, we extract far more effective supervision than prior pipelines, enabling substantially better performance with less pretraining data.

[22] Vision to Geometry: 3D Spatial Memory for Sequential Embodied MLLM Reasoning and Exploration

Zhongyi Cai, Yi Du, Chen Wang, Yu Kong

🧩 TL;DR

该研究提出了SEER-Bench基准和3DSPMR方法，首次将几何信息显式融入多模态大语言模型的空间理解中，以解决顺序具身任务中空间知识复用和探索推理的核心挑战。

📘 Detailed Summary

Motivation: 现有室内具身任务研究通常要求智能体主动探索未知环境并进行场景推理，但在实际部署中，智能体常面临顺序任务场景，其中每个新子任务都依赖于前一个任务的完成，且某些子任务可能不可行。与单任务设置相比，核心挑战在于如何复用先前探索积累的空间知识来支持后续推理和探索，这是一个尚未充分探索但具有实际意义的具身AI挑战。

Method: 研究提出了SEER-Bench顺序具身探索与推理基准，涵盖具身问答和具身多模态导航两个经典任务。基于此基准，提出了3DSPMR方法，该方法利用已探索区域的关系、视觉和几何线索来增强多模态大语言模型在顺序具身任务中的推理和探索能力。这是首次将几何信息显式融入基于MLLM的空间理解和推理中。

Result: 大量实验验证表明，3DSPMR在顺序EQA和EMN任务上均实现了显著的性能提升。该方法在SEER-Bench基准上的表现证明了其有效性，特别是在处理顺序任务中空间知识复用和探索推理方面的优势。

Conclusion: 该研究首次将几何信息显式整合到多模态大语言模型的空间理解框架中，为解决顺序具身任务中的核心挑战提供了新思路。3DSPMR方法展示了关系、视觉和几何线索融合在增强空间推理能力方面的潜力，为未来具身AI系统在复杂顺序任务中的应用奠定了基础。

📄 Abstract

Existing research on indoor embodied tasks typically requires agents to actively explore unknown environments and reason about the scene to achieve a specific goal. However, when deployed in real life, agents often face sequential tasks, where each new sub-task follows the completion of the previous one, and certain sub-tasks may be infeasible, such as searching for a non-existent object. Compared with the single-task setting, the core challenge lies in reusing spatial knowledge accumulated from previous explorations to support subsequent reasoning and exploration. In this work, we investigate this underexplored yet practically significant embodied AI challenge. To evaluate this challenge, we introduce SEER-Bench, a new Sequential Embodied Exploration and Reasoning Benchmark encompassing encompassing two classic embodied tasks: Embodied Question Answering (EQA) and Embodied Multi-modal Navigation (EMN). Building on SEER-Bench, we propose 3DSPMR, a 3D SPatial Memory Reasoning approach that exploits relational, visual, and geometric cues from explored regions to augment Multi-Modal Large Language Models (MLLMs) for reasoning and exploration in sequential embodied tasks. To the best of our knowledge, this is the first work to explicitly incorporate geometric information into MLLM-based spatial understanding and reasoning. Extensive experiments verify that 3DSPMR achieves substantial performance gains on both sequential EQA and EMN tasks.

[23] Reasoning-Aware Multimodal Fusion for Hateful Video Detection

Shuonan Yang, Tailin Chen, Jiangbei Yue, Guangliang Cheng, Jianbo Jiao, Zeyu Fu

🧩 TL;DR

本文提出了一种推理感知的多模态融合（RAMF）框架，通过局部-全局上下文融合和语义交叉注意力机制解决多模态仇恨视频检测中的语义融合难题，并引入对抗性推理过程增强对微妙仇恨意图的理解。

📘 Detailed Summary

Motivation: 在线视频中的仇恨言论日益成为数字平台的严重威胁，现有方法难以有效融合多模态间的复杂语义关系，且缺乏对微妙仇恨内容的理解能力，这构成了当前研究的主要局限。

Method: 本文提出了推理感知的多模态融合（RAMF）框架，包含两个核心组件：局部-全局上下文融合（LGCF）用于捕捉局部显著线索和全局时间结构，语义交叉注意力（SCA）实现细粒度多模态语义交互；同时引入对抗性推理过程，通过视觉语言模型生成客观描述、仇恨假设推理和非仇恨假设推理三个阶段的互补语义视角。

Result: 在两个真实世界仇恨视频数据集上的评估表明，该方法实现了鲁棒的泛化性能，在Macro-F1和仇恨类召回率上分别比最先进方法提升了3%和7%，显示出显著的性能改进。

Conclusion: 该研究通过结合结构化的对抗性推理与细粒度多模态融合，为仇恨视频检测提供了更全面的语义理解框架，表明同时考虑局部-全局上下文和多视角推理能有效提升对微妙仇恨内容的识别能力。

📄 Abstract

Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning-a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences-providing complementary semantic perspectives that enrich the model's contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. We will release the code after the anonymity period ends.

[24] MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding

Fan Yang, Kaihao Zhang

🧩 TL;DR

本文提出了多分辨率检索-检测（MRD）框架，一种无需训练的高分辨率图像理解方法，通过多分辨率语义融合和开放词汇目标检测来解决现有裁剪处理方法中目标对象被分割导致的语义相似性偏差问题。

📘 Detailed Summary

Motivation: 现有基于裁剪的多模态大语言模型高分辨率图像理解方法存在显著缺陷，当目标对象被分割到多个图像块时，会破坏语义相似性计算，导致语义偏差。这种裁剪处理方式会破坏完整对象的完整性，影响目标定位的准确性。

Method: 本文提出多分辨率检索-检测（MRD）框架，包含两个核心组件：多分辨率语义融合方法，通过整合不同分辨率下获得的语义相似性图来生成更准确的语义信息并保持目标对象完整性；开放词汇目标检测模型，采用滑动窗口方法在全局尺度上直接定位目标对象区域。

Result: 在高分辨率图像理解基准测试中，使用不同多模态大语言模型进行的实验证明了该方法的有效性。MRD框架能够显著提升目标定位的准确性，特别是在处理被分割到多个图像块的对象时表现出优越性能。

Conclusion: 该研究表明多分辨率处理对于高分辨率图像理解至关重要，提出的训练免费框架为处理不同尺寸对象提供了有效解决方案。多分辨率语义融合方法能够有效缓解语义相似性偏差，而开放词汇检测组件则实现了更精确的全局目标定位。

📄 Abstract

Understanding high-resolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops and computing the semantic similarity between each crop and a query using a pretrained retrieval-augmented generation (RAG) model. The most relevant crops are then selected to localize the target object and suppress irrelevant information. However, such crop-based processing can fragment complete objects across multiple crops, thereby disrupting the computation of semantic similarity. In our experiments, we find that image crops of objects with different sizes are better handled at different resolutions. Based on this observation, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework for high-resolution image understanding. To address the issue of semantic similarity bias caused by objects being split across different image crops, we propose a multi-resolution semantic fusion method, which integrates semantic similarity maps obtained at different resolutions to produce more accurate semantic information and preserve the integrity of target objects. Furthermore, to achieve direct localization of target objects at a global scale, we introduce an open-vocalbulary object detection (OVD) model that identifies object regions using a sliding-window approach.Experiments on high-resolution image understanding benchmarks using different MLLMs demonstrate the effectiveness of our approach.

[25] YingVideo-MV: Music-Driven Multi-Stage Video Generation

Jiahui Chen, Weida Wang, Runhua Shi, Huan Yang, Chaofan Ding, Zihao Chen

🧩 TL;DR

本文提出了YingVideo-MV，这是首个用于音乐驱动长视频生成的级联框架，通过集成音频语义分析、可解释镜头规划模块、时序感知扩散Transformer架构和长序列一致性建模，实现了从音频信号自动合成高质量音乐表演视频。

📘 Detailed Summary

Motivation: 当前扩散模型在音频驱动化身视频生成方面取得了显著进展，但音乐表演视频的生成（特别是包含摄像机运动）仍未被充分探索，现有长视频生成方法缺乏明确的摄像机运动控制，且剪辑间连续性不足。

Method: 方法包括音频语义分析、可解释镜头规划模块（MV-Director）、时序感知扩散Transformer架构、长序列一致性建模，并引入了摄像机适配器模块将摄像机姿态嵌入潜在噪声，以及基于音频嵌入自适应调整去噪范围的时间感知动态窗口范围策略。

Result: 综合基准测试表明，YingVideo-MV在生成连贯且富有表现力的音乐视频方面表现出色，实现了精确的音乐-动作-摄像机同步，并通过构建大规模Music-in-the-Wild数据集支持了多样化的高质量结果生成。

Conclusion: 该研究为音乐驱动视频生成提供了首个完整的级联框架，实现了对摄像机运动的精确控制和长序列的连贯性生成，为音乐表演视频的自动合成开辟了新方向，展示了音频语义分析与视觉生成深度融合的潜力。

📄 Abstract

While diffusion model for audio-driven avatar video generation have achieved notable process in synthesizing long sequences with natural audio-visual synchronization and identity consistency, the generation of music-performance videos with camera motions remains largely unexplored. We present YingVideo-MV, the first cascaded framework for music-driven long-video generation. Our approach integrates audio semantic analysis, an interpretable shot planning module (MV-Director), temporal-aware diffusion Transformer architectures, and long-sequence consistency modeling to enable automatic synthesis of high-quality music performance videos from audio signals. We construct a large-scale Music-in-the-Wild Dataset by collecting web data to support the achievement of diverse, high-quality results. Observing that existing long-video generation methods lack explicit camera motion control, we introduce a camera adapter module that embeds camera poses into latent noise. To enhance continulity between clips during long-sequence inference, we further propose a time-aware dynamic window range strategy that adaptively adjust denoising ranges based on audio embedding. Comprehensive benchmark tests demonstrate that YingVideo-MV achieves outstanding performance in generating coherent and expressive music videos, and enables precise music-motion-camera synchronization. More videos are available in our project page: https://giantailab.github.io/YingVideo-MV/ .

[26] Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench

Lanxiang Hu, Abhilash Shankarampeta, Yixin Huang, Zilin Dai, Haoyang Yu, Yujie Zhao, Haoqiang Kang, Daniel Zhao, Tajana Rosing, Hao Zhang

🧩 TL;DR

本文提出了VideoScience-Bench，这是首个评估视频模型科学推理能力的基准测试，包含200个涵盖物理和化学多概念场景的提示，通过专家标注和VLM-as-a-Judge方法评估了七种最先进的视频模型。

📘 Detailed Summary

Motivation: 当前视频生成的下一个前沿是开发具备零样本推理能力的模型，这需要理解现实世界的科学定律以准确模拟不同条件下的物理结果。然而，现有的视频基准测试主要基于物理常识，对视频模型的科学推理能力评估有限，因此需要专门评估本科水平科学理解能力的基准。

Method: 研究团队构建了VideoScience-Bench基准，包含200个精心设计的提示，涵盖14个主题和103个物理与化学概念，每个提示编码了需要跨多个科学概念理解和推理的复合科学场景。评估采用专家标注方法，在文本到视频和图像到视频两种设置下对七种最先进的视频模型进行评估，评估维度包括提示一致性、现象一致性、正确动态性、不变性和时空连续性，并采用VLM-as-a-Judge方法评估视频生成质量。

Result: 实验结果显示，使用VLM-as-a-Judge评估视频生成与人类评估结果具有强相关性。该基准首次系统评估了视频模型作为生成器和推理器的双重能力，要求生成的视频在科学理解上与预期的物理和化学现象保持一致，为视频模型的科学推理能力提供了量化评估框架。

Conclusion: VideoScience-Bench是首个专门评估视频模型科学推理能力的基准测试，填补了现有基准在科学理解评估方面的空白。该研究为视频生成模型的科学推理能力提供了系统评估方法，推动了视频模型从单纯生成向具备科学理解能力的方向发展，相关数据和评估代码已开源供研究社区使用。

📄 Abstract

The next frontier for video generation lies in developing models capable of zero-shot reasoning, where understanding real-world scientific laws is crucial for accurate physical outcome modeling under diverse conditions. However, existing video benchmarks are physical commonsense-based, offering limited insight into video models' scientific reasoning capability. We introduce VideoScience-Bench, a benchmark designed to evaluate undergraduate-level scientific understanding in video models. Each prompt encodes a composite scientific scenario that requires understanding and reasoning across multiple scientific concepts to generate the correct phenomenon. The benchmark comprises 200 carefully curated prompts spanning 14 topics and 103 concepts in physics and chemistry. We conduct expert-annotated evaluations across seven state-of-the-art video models in T2V and I2V settings along five dimensions: Prompt Consistency, Phenomenon Congruency, Correct Dynamism, Immutability, and Spatio-Temporal Continuity. Using a VLM-as-a-Judge to assess video generations, we observe strong correlation with human assessments. To the best of our knowledge, VideoScience-Bench is the first benchmark to evaluate video models not only as generators but also as reasoners, requiring their generations to demonstrate scientific understanding consistent with expected physical and chemical phenomena. Our data and evaluation code are available at: \href{https://github.com/hao-ai-lab/VideoScience}{github.com/hao-ai-lab/VideoScience}.

[27] A Large Scale Benchmark for Test Time Adaptation Methods in Medical Image Segmentation

Wenjing Yu, Shuo Jiang, Yifei Chen, Shuo Chang, Yuanhan Wang, Beining Wu, Jie Dong, Mingxuan Liu, Shenghao Zhu, Feiwei Qin, Changmiao Wang, Qiyuan Tian

🧩 TL;DR

该研究提出了MedSeg-TTA，一个全面的医学图像分割测试时适应基准，系统评估了20种代表性适应方法在7种成像模态下的性能，揭示了不同适应范式的适用条件与局限性。

📘 Detailed Summary

Motivation: 当前医学图像分割的测试时适应研究存在模态覆盖不足、任务多样性有限和方法评估不一致的问题，缺乏系统性的跨模态比较，阻碍了临床部署中方法选择的科学依据。

Method: 研究构建了MedSeg-TTA基准，统一了数据预处理、骨干网络配置和测试时协议，系统评估了20种代表性适应方法，涵盖输入级变换、特征级对齐、输出级正则化和先验估计四大适应范式，覆盖MRI、CT、超声、病理、皮肤镜、OCT和胸部X光七种成像模态。

Result: 实验结果表明没有单一范式在所有条件下表现最佳：输入级方法在轻度外观偏移下更稳定；特征级和输出级方法在边界相关指标上优势明显；先验方法表现出强烈的模态依赖性；多种方法在大规模跨中心跨设备偏移下性能显著下降。

Conclusion: 该研究强调了根据临床场景选择适当适应范式的重要性，揭示了不同范式的适用边界，提供了标准化数据集、验证实现和公开排行榜，为开发鲁棒且临床可靠的测试时适应方法奠定了严格基础。

📄 Abstract

Test time Adaptation is a promising approach for mitigating domain shift in medical image segmentation; however, current evaluations remain limited in terms of modality coverage, task diversity, and methodological consistency. We present MedSeg-TTA, a comprehensive benchmark that examines twenty representative adaptation methods across seven imaging modalities, including MRI, CT, ultrasound, pathology, dermoscopy, OCT, and chest X-ray, under fully unified data preprocessing, backbone configuration, and test time protocols. The benchmark encompasses four significant adaptation paradigms: Input-level Transformation, Feature-level Alignment, Output-level Regularization, and Prior Estimation, enabling the first systematic cross-modality comparison of their reliability and applicability. The results show that no single paradigm performs best in all conditions. Input-level methods are more stable under mild appearance shifts. Feature-level and Output-level methods offer greater advantages in boundary-related metrics, whereas prior-based methods exhibit strong modality dependence. Several methods degrade significantly under large inter-center and inter-device shifts, which highlights the importance of principled method selection for clinical deployment. MedSeg-TTA provides standardized datasets, validated implementations, and a public leaderboard, establishing a rigorous foundation for future research on robust, clinically reliable test-time adaptation. All source codes and open-source datasets are available at https://github.com/wenjing-gg/MedSeg-TTA.

[28] dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model

Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, Colin Zhang

🧩 TL;DR

本文提出了dots.ocr，首个在统一端到端框架中联合学习文档布局解析三大核心任务的视觉语言模型，通过可扩展数据引擎合成多语言语料，在多个基准测试中取得最先进性能。

📘 Detailed Summary

Motivation: 当前文档布局解析方法依赖碎片化的多阶段流水线，存在错误传播问题且无法利用联合训练的优势，这限制了AI访问和解释结构化知识的能力，特别是对于赋能下一代视觉语言模型至关重要。

Method: 本文提出dots.ocr，这是一个单一的视觉语言模型，首次在统一端到端框架中联合学习布局检测、文本识别和关系理解三大核心任务，通过高度可扩展的数据引擎合成大规模多语言语料库，支持模型在多样化语言、布局和领域任务中表现稳健。

Result: dots.ocr在综合性基准测试OmniDocBench上取得了最先进的性能，并在新引入的XDocParse基准测试（涵盖126种语言）上建立了强大的新基线，以+7.4分的显著优势超越次优竞争者，证明了其卓越的多语言能力。

Conclusion: 该研究证明了统一端到端框架在文档布局解析中的优势，通过联合学习核心任务避免了多阶段流水线的错误传播问题，同时引入的XDocParse基准测试为全球文档智能研究提供了新的挑战平台，dots.ocr的卓越多语言性能为跨语言文档理解设立了新的标准。

📄 Abstract

Document Layout Parsing serves as a critical gateway for Artificial Intelligence (AI) to access and interpret the world's vast stores of structured knowledge. This process,which encompasses layout detection, text recognition, and relational understanding, is particularly crucial for empowering next-generation Vision-Language Models. Current methods, however, rely on fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. In this paper, we introduce dots.ocr, a single Vision-Language Model that, for the first time, demonstrates the advantages of jointly learning three core tasks within a unified, end-to-end framework. This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus, empowering the model to deliver robust performance across a wide array of tasks, encompassing diverse languages, layouts, and domains. The efficacy of our unified paradigm is validated by state-of-the-art performance on the comprehensive OmniDocBench. Furthermore, to catalyze research in global document intelligence, we introduce XDocParse, a challenging new benchmark spanning 126 languages. On this testbed, dots.ocr establishes a powerful new baseline, outperforming the next-best competitor by a remarkable +7.4 point margin and proving its unparalleled multilingual capabilities.

[29] GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding

Jiaqi Liu, Ronghao Fu, Haoran Liu, Lang Sun, Bo Yang

🧩 TL;DR

本文提出GeoDiT，首个面向地理空间领域的扩散式视觉语言模型，通过并行细化的生成过程解决了自回归模型在地理空间理解中的结构错配问题，在结构化输出任务上实现了新的最先进性能。

📘 Detailed Summary

Motivation: 自回归模型的结构与地理空间理解固有的并行性质存在根本性错配，强制将刚性顺序叙事施加于场景之上，严重阻碍了结构化、连贯输出的生成能力，特别是在需要同时解析多个语义元素的地理空间分析任务中表现不佳。

Method: 研究将地理空间生成重新定义为并行细化过程，实现了从粗到细的整体合成，同时解析所有语义元素，并为此引入了GeoDiT——首个专门为地理空间领域设计的基于扩散的视觉语言模型。

Result: 大量实验表明，GeoDiT在需要结构化、以对象为中心输出的基准测试中建立了新的最先进水平，在图像描述、视觉定位和多对象检测等自回归模型表现不佳的任务上取得了显著性能提升。

Conclusion: 该研究验证了将生成过程与数据内在结构对齐是解锁复杂地理空间分析中卓越性能的关键，为地理空间人工智能领域提供了新的生成范式，强调了模型架构与领域特性匹配的重要性。

📄 Abstract

Autoregressive models are structurally misaligned with the inherently parallel nature of geospatial understanding, forcing a rigid sequential narrative onto scenes and fundamentally hindering the generation of structured and coherent outputs. We challenge this paradigm by reframing geospatial generation as a parallel refinement process, enabling a holistic, coarse-to-fine synthesis that resolves all semantic elements simultaneously. To operationalize this, we introduce GeoDiT, the first diffusion-based vision-language model tailored for the geospatial domain. Extensive experiments demonstrate that GeoDiT establishes a new state-of-the-art on benchmarks requiring structured, object-centric outputs. It achieves significant gains in image captioning, visual grounding, and multi-object detection, precisely the tasks where autoregressive models falter. Our work validates that aligning the generative process with the data's intrinsic structure is key to unlocking superior performance in complex geospatial analysis.

[30] SkyMoE: A Vision-Language Foundation Model for Enhancing Geospatial Interpretation with Mixture of Experts

Jiaqi Liu, Ronghao Fu, Lang Sun, Haoran Liu, Xiao Yang, Weipeng Zhang, Xu Na, Zhuoran Duan, Bo Yang

🧩 TL;DR

本文提出了SkyMoE，一种专为遥感任务设计的混合专家视觉语言模型，通过任务和粒度感知的路由机制，实现了多模态、多任务遥感解释的优越性能，在21个公开数据集上达到了最先进的水平。

📘 Detailed Summary

Motivation: 通用视觉语言模型在遥感任务上表现欠佳，现有地理空间VLM采用统一建模策略，难以区分任务类型和解释粒度，限制了其在局部细节感知和全局上下文理解之间的平衡能力。

Method: SkyMoE采用混合专家架构，设计了自适应路由器生成任务和粒度感知的路由指令，使专业大语言模型专家处理不同子任务；引入上下文解耦增强策略，创建局部与全局特征的对比对，引导专家进行层级特定的表示学习；构建了MGRS-Bench基准，覆盖多个遥感解释任务和粒度级别。

Result: 在21个公开数据集上的广泛实验表明，SkyMoE在各项任务中均达到了最先进的性能，验证了其在复杂场景下的适应性、可扩展性和优越的多粒度理解能力。

Conclusion: 该研究证明了混合专家架构在遥感视觉语言建模中的有效性，通过任务和粒度感知的路由机制实现了专家解耦和专业化，为多模态遥感解释提供了可扩展且高效的解决方案，推动了地理空间AI向更精细、更适应性的方向发展。

📄 Abstract

The emergence of large vision-language models (VLMs) has significantly enhanced the efficiency and flexibility of geospatial interpretation. However, general-purpose VLMs remain suboptimal for remote sensing (RS) tasks. Existing geospatial VLMs typically adopt a unified modeling strategy and struggle to differentiate between task types and interpretation granularities, limiting their ability to balance local detail perception and global contextual understanding. In this paper, we present SkyMoE, a Mixture-of-Experts (MoE) vision-language model tailored for multimodal, multi-task RS interpretation. SkyMoE employs an adaptive router that generates task- and granularity-aware routing instructions, enabling specialized large language model experts to handle diverse sub-tasks. To further promote expert decoupling and granularity sensitivity, we introduce a context-disentangled augmentation strategy that creates contrastive pairs between local and global features, guiding experts toward level-specific representation learning. We also construct MGRS-Bench, a comprehensive benchmark covering multiple RS interpretation tasks and granularity levels, to evaluate generalization in complex scenarios. Extensive experiments on 21 public datasets demonstrate that SkyMoE achieves state-of-the-art performance across tasks, validating its adaptability, scalability, and superior multi-granularity understanding in remote sensing.

[31] On the Problem of Consistent Anomalies in Zero-Shot Anomaly Detection

Tai Le-Gia

🧩 TL;DR

本论文针对零样本异常分类与分割中的核心挑战，提出了基于理论分析和算法设计的系统解决方案，包括识别一致异常问题、开发CoDeGraph图框架、扩展至3D医学影像以及桥接批量与文本方法。

📘 Detailed Summary

Motivation: 零样本异常分类与分割在工业检测和医学成像中日益重要，但现有方法面临一致异常这一失效模式，即重复出现的相似异常会系统性偏差基于距离的方法。本研究旨在深入探究零样本AC/AS的核心挑战，并提供基于理论和算法设计的原理性解决方案。

Method: 首先通过分析预训练Vision Transformers中补丁表示的统计和几何行为，识别了相似性缩放和邻居烧毁两个关键现象。随后提出了CoDeGraph图框架，通过多阶段图构建、社区检测和结构化精炼来过滤一致异常。进一步扩展到3D医学成像，提出了无需训练的计算高效体积标记化策略。最后展示了如何利用CoDeGraph生成的伪掩码监督提示驱动的视觉语言模型。

Result: 研究揭示了在高度相似物体场景中，正常补丁关系在有/无一致异常时的变化规律。CoDeGraph框架能有效抑制一致异常的影响，而提出的3D体积标记化策略实现了真正的零样本3D异常检测流程，证明无需任何3D训练样本即可实现体积异常分割。实验表明该方法能桥接批量与文本方法的优势。

Conclusion: 本研究为零样本异常分类与分割问题提供了理论理解和实用解决方案，系统解决了从问题形式化到算法设计再到实际应用的完整链条。提出的框架不仅适用于2D工业检测，还能扩展到3D医学成像，并通过视觉语言模型集成展示了方法的通用性和可扩展性，为零样本异常检测领域奠定了重要基础。

📄 Abstract

Zero-shot anomaly classification and segmentation (AC/AS) aim to detect anomalous samples and regions without any training data, a capability increasingly crucial in industrial inspection and medical imaging. This dissertation aims to investigate the core challenges of zero-shot AC/AS and presents principled solutions rooted in theory and algorithmic design. We first formalize the problem of consistent anomalies, a failure mode in which recurring similar anomalies systematically bias distance-based methods. By analyzing the statistical and geometric behavior of patch representations from pre-trained Vision Transformers, we identify two key phenomena - similarity scaling and neighbor-burnout - that describe how relationships among normal patches change with and without consistent anomalies in settings characterized by highly similar objects. We then introduce CoDeGraph, a graph-based framework for filtering consistent anomalies built on the similarity scaling and neighbor-burnout phenomena. Through multi-stage graph construction, community detection, and structured refinement, CoDeGraph effectively suppresses the influence of consistent anomalies. Next, we extend this framework to 3D medical imaging by proposing a training-free, computationally efficient volumetric tokenization strategy for MRI data. This enables a genuinely zero-shot 3D anomaly detection pipeline and shows that volumetric anomaly segmentation is achievable without any 3D training samples. Finally, we bridge batch-based and text-based zero-shot methods by demonstrating that CoDeGraph-derived pseudo-masks can supervise prompt-driven vision-language models. Together, this dissertation provides theoretical understanding and practical solutions for the zero-shot AC/AS problem.

[32] WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

Jian Yang, Dacheng Yin, Xiaoxuan He, Yong Li, Fengyun Rao, Jing Lyu, Wei Zhai, Yang Cao, Zheng-Jun Zha

🧩 TL;DR

本文提出Noisy Query Tokens方法，通过端到端优化学习视觉语言模型与扩散模型之间的分布式表示空间，以解决多模态大语言模型中固定查询令牌导致的泛化崩溃问题，并引入VAE分支恢复细粒度图像细节。

📘 Detailed Summary

Motivation: 当前多模态大语言模型中，使用固定数量可学习查询令牌的方法虽然计算高效，但存在任务泛化崩溃问题，无法适应与预训练任务差异较大的新任务，这限制了模型在持续学习场景下的表现。

Method: 本文提出Noisy Query Tokens方法，通过端到端优化学习视觉语言模型与扩散模型之间的分布式表示空间，增强持续学习能力；同时引入带有线性投影的VAE分支，以恢复细粒度的图像细节信息。

Result: 实验结果表明，该方法有效缓解了泛化崩溃问题，能够在多样化的任务上实现稳定的持续学习，验证了分布式表示空间和VAE分支对提升模型适应性和细节恢复的有效性。

Conclusion: 该研究为多模态大语言模型与扩散模型的高效桥接提供了新思路，通过分布式表示学习和细节恢复机制，显著提升了模型在持续学习场景下的泛化能力和适应性，为跨模态生成任务的进一步发展奠定了基础。

📄 Abstract

Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.

[33] OmniPerson: Unified Identity-Preserving Pedestrian Generation

Changxiao Ma, Chao Yuan, Xincheng Shi, Yuzhuo Ma, Yongfei Zhang, Longkun Zhou, Yujia Zhang, Shangze Li, Yifan Xu

🧩 TL;DR

本文提出了OmniPerson，这是首个用于可见光/红外图像/视频行人重识别任务的统一身份保持行人生成框架，通过多参考融合器和PersonSyn数据集实现了高保真、可控的行人生成，有效提升了ReID模型的性能。

📘 Detailed Summary

Motivation: 行人重识别领域面临大规模高质量训练数据不足的问题，主要受限于数据隐私和标注成本。现有行人生成方法在身份一致性和可控性方面存在不足，限制了其在数据集增强中的有效性，因此需要开发能够保持身份一致并提供细粒度控制的行人生成框架。

Method: 本文提出了OmniPerson统一生成模型，支持RGB/IR模态图像/视频生成，具备多参考图像输入、两种行人姿态和文本控制能力，同时包含RGB到IR转换和图像超分辨率功能。设计了多参考融合器用于从多视角参考图像中提取统一身份表示，确保身份一致性。此外，构建了PersonSyn数据集及其自动化标注流程，将公开的仅含ID标签的ReID基准转换为具有密集多模态监督的丰富标注资源。

Result: 实验结果表明，OmniPerson在行人生成任务中达到了最先进的性能，在视觉保真度和身份一致性方面表现优异。使用生成数据增强现有数据集能够持续提升ReID模型的性能，验证了该框架在数据增强方面的有效性。

Conclusion: 该研究为行人重识别领域提供了一种有效的身份保持数据增强解决方案，通过统一的行人生成框架解决了现有方法在身份一致性和可控性方面的局限性。OmniPerson的开源将促进相关研究的发展，PersonSyn数据集为可控行人生成任务提供了宝贵的资源，为未来研究奠定了基础。

📄 Abstract

Person re-identification (ReID) suffers from a lack of large-scale high-quality training data due to challenges in data privacy and annotation costs. While previous approaches have explored pedestrian generation for data augmentation, they often fail to ensure identity consistency and suffer from insufficient controllability, thereby limiting their effectiveness in dataset augmentation. To address this, We introduce OmniPerson, the first unified identity-preserving pedestrian generation pipeline for visible/infrared image/video ReID tasks. Our contributions are threefold: 1) We proposed OmniPerson, a unified generation model, offering holistic and fine-grained control over all key pedestrian attributes. Supporting RGB/IR modality image/video generation with any number of reference images, two kinds of person poses, and text. Also including RGB-to-IR transfer and image super-resolution abilities.2) We designed Multi-Refer Fuser for robust identity preservation with any number of reference images as input, making OmniPerson could distill a unified identity from a set of multi-view reference images, ensuring our generated pedestrians achieve high-fidelity pedestrian generation.3) We introduce PersonSyn, the first large-scale dataset for multi-reference, controllable pedestrian generation, and present its automated curation pipeline which transforms public, ID-only ReID benchmarks into a richly annotated resource with the dense, multi-modal supervision required for this task. Experimental results demonstrate that OmniPerson achieves SoTA in pedestrian generation, excelling in both visual fidelity and identity consistency. Furthermore, augmenting existing datasets with our generated data consistently improves the performance of ReID models. We will open-source the full codebase, pretrained model, and the PersonSyn dataset.

[34] RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence

Xuming He, Zehao Fan, Hengjia Li, Fan Zhuo, Hankun Xu, Senlin Cheng, Di Weng, Haifeng Liu, Can Ye, Boxi Wu

🧩 TL;DR

本文提出了RULER-Bench基准测试，用于评估视频生成模型的规则推理能力，发现当前最先进模型在规则一致性指标上仅达到48.87%，揭示了视频生成模型在推理能力方面的显著不足。

📘 Detailed Summary

Motivation: 现有视频生成模型评估基准主要关注视觉感知和理解因素，如视觉美学、指令遵循和时间一致性，但模型的规则推理能力尚未得到充分探索。尽管近期研究对视频模型作为零样本学习器进行了初步探索，但仍缺乏对推理能力的细粒度分解和全面评估协议。

Method: 研究构建了RULER-Bench基准测试，基于文本到视频和图像到视频两种基本范式，涵盖六个规则类别的40个代表性任务，包含622个高质量标注实例。对于每个生成视频，构建了涵盖四个指标的检查清单，并利用GPT-4o进行评分，实现了与人类判断85%的一致性。

Result: 广泛实验表明，当前最先进的视频生成模型在规则一致性指标上仅达到48.87%的得分，突显了下一代视频模型在推理能力方面存在显著改进空间。评估协议通过GPT-4o实现了与人类判断85%的对齐度，验证了自动化评估的可靠性。

Conclusion: RULER-Bench基准测试揭示了当前视频生成模型在规则推理能力方面的局限性，为推理感知的视频生成研究提供了重要评估工具。该研究有望推动视频生成模型向视觉基础智能发展，促进下一代视频模型在认知规则理解方面的进步。

📄 Abstract

Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation models. To evaluate these video generation models, existing benchmarks primarily focus on factors related to visual perception and understanding, like visual aesthetics, instruction adherence, and temporal coherence. However, the rule-based reasoning capabilities of video generation models remain largely unexplored. Although recent studies have carried out preliminary explorations into whether video models can serve as zero-shot learners, they still lack a fine-grained decomposition of reasoning capabilities and a comprehensive evaluation protocol. To address this gap, we introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules. Built upon two fundamental paradigms: text-to-video and image-to-video, RULER-Bench covers 40 representative tasks spanning six rule categories with 622 high-quality annotated instances. For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question, achieving 85% alignment with human judgements. Extensive experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric, highlighting significant room for improvement in the reasoning capability of next-level video models. We expect that the insight obtained from RULER-Bench will facilitate further development of reasoning-aware video generation, advancing video generation models toward vision foundation intelligence.

[35] PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding

Zheng Huang, Xukai Liu, Tianyu Hu, Kai Zhang, Ye Liu

🧩 TL;DR

本文提出了PPTBench，一个用于评估多模态大语言模型在PowerPoint相关任务上性能的综合基准，揭示了当前模型在视觉布局推理方面的显著不足，为视觉-结构推理研究提供了新的评估视角。

📘 Detailed Summary

Motivation: 现有基准主要关注狭窄的子任务，忽视了布局理解这一核心挑战，而布局理解对于现实世界的幻灯片创建和编辑至关重要。PowerPoint演示文稿结合了丰富的文本内容和结构化视觉布局，是评估现代多模态大语言模型多模态推理和布局理解能力的天然测试平台。

Method: 研究团队引入了PPTBench，这是一个全面的多模态基准，利用958个PPTX文件的多样化来源，在四个类别中评估模型性能，包括检测、理解、修改和生成，共包含4,439个样本。该基准通过系统评估模型在视觉布局推理和语义理解方面的能力，特别关注模型结合视觉线索与JSON布局结构的能力。

Result: 实验揭示了当前多模态大语言模型在语义理解和视觉布局推理之间存在显著差距：模型能够解释幻灯片内容，但无法生成连贯的空间排列。消融分析和进一步研究表明，当前模型难以将视觉线索与JSON布局结构相结合，也无法将视觉信息整合到其API规划能力中。案例研究直观地暴露了系统性布局错误，如错位和元素重叠。

Conclusion: 这些发现为评估视觉语言模型在PPT场景中的性能提供了新视角，突显了视觉-结构推理和连贯幻灯片生成方面的挑战和未来研究方向。研究团队完全发布了所有数据集和代码，以支持可重复性和未来研究，强调了开发能够有效结合语义理解和空间布局推理的多模态模型的必要性。

📄 Abstract

PowerPoint presentations combine rich textual content with structured visual layouts, making them a natural testbed for evaluating the multimodal reasoning and layout understanding abilities of modern MLLMs. However, existing benchmarks focus solely on narrow subtasks while overlooking layout-centric challenges, which are central to real-world slide creation and editing. To bridge this gap, we introduce PPTBench, a comprehensive multimodal benchmark for evaluating LLMs on PowerPoint-related tasks. Leveraging a diverse source of 958 PPTX files, PPTBench evaluates models across four categories with 4,439 samples, including Detection, Understanding, Modification, and Generation. Our experiments reveal a substantial gap between semantic understanding and visual-layout reasoning in current MLLMs: models can interpret slide content but fail to produce coherent spatial arrangements. Ablation and further analysis show that current MLLMs struggle to combine visual cues with JSON-based layout structures and fail to integrate visual information into their API planning ability. And case studies visually expose systematic layout errors such as misalignment and element overlap. These findings provides a new perspective on evaluating VLLMs in PPT scenarios, highlighting challenges and directions for future research on visual-structural reasoning and coherent slide generation. All datasets and code are fully released to support reproducibility and future research.

[36] Leveraging Large-Scale Pretrained Spatial-Spectral Priors for General Zero-Shot Pansharpening

Yongchuan Cui, Peng Liu, Yi Zeng

🧩 TL;DR

该研究提出了一种利用大规模模拟数据进行预训练的新策略，通过基础模型学习鲁棒的空间-光谱先验知识，显著提升了遥感图像融合模型的跨传感器泛化能力，在零样本和少样本场景下均取得优异性能。

📘 Detailed Summary

Motivation: 现有遥感图像融合深度学习方法由于真实训练数据有限以及不同卫星传感器之间存在域差距，在应用于未见数据集时泛化能力较差，这限制了模型在实际跨域场景中的实用性。

Method: 研究提出了一种新颖的预训练策略，通过构建多样化模拟数据集来学习鲁棒的空间-光谱先验知识。具体方法包括对ImageNet自然图像和SkyScript遥感图像应用多种退化操作（模糊、噪声、下采样）和数据增强技术（波段生成、通道混洗、高通滤波、颜色抖动等），然后在模拟数据上预训练融合模型，最后采用零样本和单样本范式在六个数据集上进行评估，并探索了全微调和冻结微调两种微调方法。

Result: 在卷积神经网络、Transformer和Mamba等多种网络架构上的广泛实验表明，该预训练策略显著提升了不同卫星传感器和成像条件下各种融合模型的泛化性能。预训练模型在零样本场景中取得优异结果，在单样本设置中展现出仅需少量真实数据的显著适应能力，在WorldView-2/3/4、IKONOS、QuickBird、GaoFen-2等六个数据集上均验证了其有效性。

Conclusion: 该研究为跨域全色锐化提供了实用解决方案，建立了遥感图像融合任务中泛化能力的新基准，并通过先进的训练策略为利用基础模型开辟了新途径，展示了模拟数据预训练在提升模型跨传感器适应性和减少真实数据依赖方面的巨大潜力。

📄 Abstract

Existing deep learning methods for remote sensing image fusion often suffer from poor generalization when applied to unseen datasets due to the limited availability of real training data and the domain gap between different satellite sensors. To address this challenge, we explore the potential of foundation models by proposing a novel pretraining strategy that leverages large-scale simulated datasets to learn robust spatial-spectral priors. Specifically, our approach first constructs diverse simulated datasets by applying various degradation operations (blur, noise, downsampling) and augmentations (bands generation, channel shuffling, high-pass filtering, color jittering, etc.) to natural images from ImageNet and remote sensing images from SkyScript. We then pretrain fusion models on these simulated data to learn generalizable spatial-spectral representations. The pretrained models are subsequently evaluated on six datasets (WorldView-2/3/4, IKONOS, QuickBird, GaoFen-2) using zero-shot and one-shot paradigms, with both full- and freeze-tuning approaches for fine-tuning. Extensive experiments on different network architectures including convolutional neural networks, Transformer, and Mamba demonstrate that our pretraining strategy significantly improves generalization performance across different satellite sensors and imaging conditions for various fusion models. The pretrained models achieve superior results in zero-shot scenarios and show remarkable adaptation capability with minimal real data in one-shot settings. Our work provides a practical solution for cross-domain pansharpening, establishes a new benchmark for generalization in remote sensing image fusion tasks, and paves the way for leveraging foundation models through advanced training strategies.

[37] Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

Junwon Lee, Juhan Nam, Jiyoung Lee

🧩 TL;DR

本文提出了SelVA模型，用于解决文本条件选择性视频到音频生成任务，能够从多对象视频中仅生成用户指定的声音源，通过提示调制视频编码器和抑制无关激活来实现精确的音频分离与生成。

📘 Detailed Summary

Motivation: 当前视频到音频生成方法通常一次性生成混合的单一源声音，主要因为视觉特征纠缠且区域提示往往无法准确指定特定声源，这限制了多媒体制作中需要对每个声源进行独立编辑、混合和创意控制的需求。

Method: SelVA模型将文本提示作为目标声源的显式选择器，调制视频编码器以提取与提示相关的视频特征；提出补充标记通过抑制文本无关激活来促进跨注意力机制，实现高效的参数调整；采用自增强方案解决单声道音频监督数据缺乏的问题。

Result: 在专门构建的VGG-MONOAUDIO基准测试上，SelVA在音频质量、语义对齐和时间同步方面均表现出色，广泛的实验和消融研究一致验证了其有效性，模型能够精确生成用户指定的声音源。

Conclusion: 该研究为多媒体制作提供了精确的音频编辑工具，通过文本条件调制实现了视觉特征的解纠缠，为选择性音频生成任务建立了新的基准，并为缺乏监督数据的场景提供了有效的自增强解决方案。

📄 Abstract

This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. However, current approaches generate single source-mixed sounds at once, largely because visual features are entangled, and region cues or prompts often fail to specify the source. We propose SelVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector of target source and modulates video encoder to distinctly extract prompt-relevant video features. The proposed supplementary tokens promote cross-attention by suppressing text-irrelevant activations with efficient parameter tuning, yielding robust semantic and temporal grounding. SelVA further employs a self-augmentation scheme to overcome the lack of mono audio track supervision. We evaluate SelVA on VGG-MONOAUDIO, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization. Code and demo are available at https://jnwnlee.github.io/selva-demo/.

[38] Spatially-Grounded Document Retrieval via Patch-to-Region Relevance Propagation

Agathoklis Georgiou

🧩 TL;DR

本文提出了一种混合架构，将视觉语言模型的细粒度相似性评分与OCR提取的区域相结合，实现了文档检索中精确区域定位，而无需额外训练。该研究通过坐标映射和交集度量解决了视觉补丁与OCR边界框之间的对齐问题。

📘 Detailed Summary

Motivation: 当前视觉语言模型如ColPali在文档检索中返回整个页面而非特定区域，限制了检索增强生成中精确上下文的需求。同时，基于OCR的系统虽然能提取带坐标的结构化文本，但缺乏语义相关性评估能力。这两种范式各自存在局限性，需要一种统一的方法来结合它们的优势。

Method: 研究提出了一种混合架构，利用ColPali的补丁级相似性评分作为空间相关性过滤器，应用于OCR提取的区域。该方法形式化了视觉变换器补丁网格与OCR边界框之间的坐标映射，引入了交集度量进行相关性传播，并建立了检索精度的理论界限。整个方法在推理时运行，无需额外训练。

Result: 研究发布了开源实现Snappy，展示了实际应用可行性。虽然经验评估仍在进行中，但该方法通过理论分析建立了检索精度的界限，并通过坐标映射和交集度量解决了视觉补丁与OCR区域的对齐问题。

Conclusion: 该研究展示了如何统一视觉语言模型的语义理解能力与OCR系统的空间定位精度，为检索增强生成提供了更精确的上下文提取方法。混合架构的提出为文档检索领域开辟了新方向，特别是在需要细粒度区域定位的应用场景中具有重要价值。

📄 Abstract

Vision-language models (VLMs) like ColPali achieve state-of-the-art document retrieval by embedding pages as images and computing fine-grained similarity between query tokens and visual patches. However, they return entire pages rather than specific regions, limiting utility for retrieval-augmented generation (RAG) where precise context is paramount. Conversely, OCR-based systems extract structured text with bounding box coordinates but lack semantic grounding for relevance assessment. We propose a hybrid architecture that unifies these paradigms: using ColPali's patch-level similarity scores as spatial relevance filters over OCR-extracted regions. We formalize the coordinate mapping between vision transformer patch grids and OCR bounding boxes, introduce intersection metrics for relevance propagation, and establish theoretical bounds on retrieval precision. Our approach operates at inference time without additional training. We release Snappy, an open-source implementation demonstrating practical applicability, with empirical evaluation ongoing.

[39] UAUTrack: Towards Unified Multimodal Anti-UAV Visual Tracking

Qionglin Ren, Dawei Zhang, Chunxu Tian, Dan Zhang

🧩 TL;DR

本文提出UAUTrack，一个用于反无人机跟踪的统一单目标跟踪框架，通过单流单阶段端到端架构有效整合多种模态，并引入文本先验提示策略引导模型关注不同场景下的无人机目标。

📘 Detailed Summary

Motivation: 反无人机跟踪研究虽然探索了RGB、TIR和RGB-T融合等多种模态，但缺乏跨模态协作的统一框架。现有方法主要关注独立任务的独立模型，忽视了跨模态信息共享的潜力，且当前反无人机跟踪技术仍处于起步阶段，现有解决方案难以实现有效的多模态数据融合。

Method: 本文提出UAUTrack框架，采用单流单阶段端到端架构，有效整合多种模态。该框架引入关键组件——文本先验提示策略，该策略引导模型关注不同场景下的无人机目标，实现跨模态信息的高效融合与协作。

Result: 实验结果表明，UAUTrack在Anti-UAV和DUT Anti-UAV数据集上实现了最先进的性能。在Anti-UAV410数据集上，该框架在准确性和速度之间保持了良好的平衡，展示了在不同反无人机场景下的高准确性和实际效率。

Conclusion: 该研究为解决反无人机跟踪中多模态融合的挑战提供了统一框架，通过文本先验提示策略实现了跨模态的有效协作。UAUTrack展示了在准确性和效率之间的良好权衡，为实际应用中的反无人机跟踪系统提供了可行的解决方案。

📄 Abstract

Research in Anti-UAV (Unmanned Aerial Vehicle) tracking has explored various modalities, including RGB, TIR, and RGB-T fusion. However, a unified framework for cross-modal collaboration is still lacking. Existing approaches have primarily focused on independent models for individual tasks, often overlooking the potential for cross-modal information sharing. Furthermore, Anti-UAV tracking techniques are still in their infancy, with current solutions struggling to achieve effective multimodal data fusion. To address these challenges, we propose UAUTrack, a unified single-target tracking framework built upon a single-stream, single-stage, end-to-end architecture that effectively integrates multiple modalities. UAUTrack introduces a key component: a text prior prompt strategy that directs the model to focus on UAVs across various scenarios. Experimental results show that UAUTrack achieves state-of-the-art performance on the Anti-UAV and DUT Anti-UAV datasets, and maintains a favourable trade-off between accuracy and speed on the Anti-UAV410 dataset, demonstrating both high accuracy and practical efficiency across diverse Anti-UAV scenarios.

[40] ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data

Yuxing Liu, Yong Liu

🧩 TL;DR

本文提出ClimaDrive框架，一种语义引导的图像到图像合成方法，用于生成语义连贯、天气多样且物理合理的异常驾驶数据，并构建了大规模基准数据集ClimaOoD，显著提升了异常分割模型的鲁棒性和泛化能力。

📘 Detailed Summary

Motivation: 异常分割在自动驾驶中至关重要，但异常数据的稀缺性和多样性不足严重限制了模型在开放世界环境中的泛化能力。现有方法通过合成数据生成来缓解这一问题，但往往缺乏上下文连贯性和物理真实性，导致合成数据与真实数据之间存在领域差距。

Method: 本文提出ClimaDrive框架，这是一种语义引导的图像到图像合成框架，统一了结构引导的多天气生成与提示驱动的异常修复技术。该框架能够合成语义连贯、天气多样且物理合理的异常驾驶数据，并基于此构建了ClimaOoD基准数据集，涵盖六种代表性驾驶场景在晴朗和恶劣天气条件下的情况。

Result: 在四种最先进方法上的广泛实验表明，使用ClimaOoD进行训练显著提升了异常分割性能。所有方法的AUROC、AP和FPR95指标均有显著改善，其中RbA方法在Fishyscapes LAF上的FPR95从3.97降至3.52，证明了数据增强的有效性。

Conclusion: ClimaOoD数据集增强了模型的鲁棒性，为开放世界异常检测提供了有价值的训练数据，促进了更好的泛化能力。该研究展示了语义引导的合成数据生成在解决异常数据稀缺问题上的有效性，为自动驾驶安全系统的开发提供了重要支持。

📄 Abstract

Anomaly segmentation seeks to detect and localize unknown or out-of-distribution (OoD) objects that fall outside predefined semantic classes a capability essential for safe autonomous driving. However, the scarcity and limited diversity of anomaly data severely constrain model generalization in open-world environments. Existing approaches mitigate this issue through synthetic data generation, either by copy-pasting external objects into driving scenes or by leveraging text-to-image diffusion models to inpaint anomalous regions. While these methods improve anomaly diversity, they often lack contextual coherence and physical realism, resulting in domain gaps between synthetic and real data. In this paper, we present ClimaDrive, a semantics-guided image-to-image framework for synthesizing semantically coherent, weather-diverse, and physically plausible OoD driving data. ClimaDrive unifies structure-guided multi-weather generation with prompt-driven anomaly inpainting, enabling the creation of visually realistic training data. Based on this framework, we construct ClimaOoD, a large-scale benchmark spanning six representative driving scenarios under both clear and adverse weather conditions. Extensive experiments on four state-of-the-art methods show that training with ClimaOoD leads to robust improvements in anomaly segmentation. Across all methods, AUROC, AP, and FPR95 show notable gains, with FPR95 dropping from 3.97 to 3.52 for RbA on Fishyscapes LAF. These results demonstrate that ClimaOoD enhances model robustness, offering valuable training data for better generalization in open-world anomaly detection.

[41] GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization

Zixuan Song, Jing Zhang, Di Wang, Zidie Zhou, Wenbin Liu, Haonan Guo, En Wang, Bo Du

🧩 TL;DR

本文提出了GeoBridge，一个用于跨视角地理定位的基础模型，通过新颖的语义锚机制实现多视角双向匹配并支持语言到图像检索，同时构建了首个大规模跨模态多视角对齐数据集GeoLoc。

📘 Detailed Summary

Motivation: 传统卫星中心的地理定位范式在缺乏高分辨率或最新卫星图像时鲁棒性受限，且未能充分利用跨视角（如无人机、卫星、街景）和跨模态（如语言与图像）的互补线索，需要更灵活、鲁棒的定位方法。

Method: 提出了GeoBridge基础模型，采用新颖的语义锚机制通过文本描述桥接多视角特征，实现双向跨视角匹配和语言到图像检索；同时构建了GeoLoc数据集，包含来自36个国家的超过50,000对无人机、街景全景和卫星图像及其文本描述。

Result: 实验表明，GeoLoc预训练显著提升了GeoBridge的地理定位精度，同时促进了跨域泛化能力和跨模态知识迁移；模型在多个任务上进行了广泛评估，验证了其有效性。

Conclusion: 该研究突破了传统卫星中心的地理定位范式，通过语义锚机制和多模态对齐实现了更鲁棒、灵活的定位；GeoLoc数据集的构建为跨视角跨模态地理定位研究提供了重要资源，推动了基础模型在地理空间理解中的应用。

📄 Abstract

Cross-view geo-localization infers a location by retrieving geo-tagged reference images that visually correspond to a query image. However, the traditional satellite-centric paradigm limits robustness when high-resolution or up-to-date satellite imagery is unavailable. It further underexploits complementary cues across views (e.g., drone, satellite, and street) and modalities (e.g., language and image). To address these challenges, we propose GeoBridge, a foundation model that performs bidirectional matching across views and supports language-to-image retrieval. Going beyond traditional satellite-centric formulations, GeoBridge builds on a novel semantic-anchor mechanism that bridges multi-view features through textual descriptions for robust, flexible localization. In support of this task, we construct GeoLoc, the first large-scale, cross-modal, and multi-view aligned dataset comprising over 50,000 pairs of drone, street-view panorama, and satellite images as well as their textual descriptions, collected from 36 countries, ensuring both geographic and semantic alignment. We performed broad evaluations across multiple tasks. Experiments confirm that GeoLoc pre-training markedly improves geo-location accuracy for GeoBridge while promoting cross-domain generalization and cross-modal knowledge transfer. The dataset, source code, and pretrained models were released at https://github.com/MiliLab/GeoBridge.

[42] VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm

Zhenkai Wu, Xiaowen Ma, Zhenliang Ni, Dengming Zhang, Han Shu, Xin Jiang, Xinghao Chen

🧩 TL;DR

本文提出VLM-Pruner，一种无需训练的视觉语言模型令牌剪枝算法，通过显式平衡冗余性和空间稀疏性，在保持88.9%剪枝率的同时实现端到端推理加速，并在五种VLM上一致优于现有基线方法。

📘 Detailed Summary

Motivation: 视觉语言模型中的大量视觉令牌带来了显著的计算成本，阻碍了在移动设备上的部署。现有剪枝方法仅依赖令牌重要性而忽略了令牌间冗余性，导致保留大量重复令牌浪费容量；而一些冗余感知方法又忽视了视觉令牌的空间关系，可能导致保留令牌过于稀疏而无法充分覆盖目标对象区域。

Method: 提出VLM-Pruner算法，采用离心式令牌剪枝范式实现由近及远的选择，优先保留细粒度物体细节。设计缓冲空间稀疏性准则推迟选择空间距离较远的令牌，采用并行贪婪策略高效进行令牌选择，并选择性融合被丢弃令牌中的显著信息到保留令牌中以减轻信息损失。

Result: 在五种视觉语言模型上，VLM-Pruner在88.9%的剪枝率下一致优于强基线方法，同时实现了端到端推理加速。综合比较表明该方法在保持高剪枝率的同时显著提升了计算效率。

Conclusion: 该研究证明了显式平衡冗余性和空间稀疏性在视觉令牌剪枝中的重要性，提出的离心式剪枝范式和缓冲空间稀疏性准则为解决现有方法局限性提供了有效方案，为视觉语言模型在资源受限设备上的高效部署开辟了新途径。

📄 Abstract

Vision-language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many pruning methods rely solely on token importance and thus overlook inter-token redundancy, retaining numerous duplicated tokens and wasting capacity. Although some redundancy-aware approaches have been proposed, they often ignore the spatial relationships among visual tokens. This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects. To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. We introduce a centrifugal token pruning paradigm that enables near-to-far selection while prioritizing the preservation of fine-grained object details. Moreover, we design a Buffering for Spatial Sparsity (BSS) criterion that defers the selection of spatially distant tokens. We further adopt a parallel greedy strategy to conduct token selection efficiently. To mitigate information loss from pruning, we selectively fuse salient information from the discarded tokens into the retained ones. Comprehensive comparisons demonstrate that VLM-Pruner consistently outperforms strong baselines across five VLMs with an 88.9\% pruning rate, while delivering an end-to-end inference speedup.

[43] GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding

Peirong Zhang, Yidan Zhang, Luxiao Xu, Jinliang Lin, Zonghao Guo, Fengxiang Wang, Xue Yang, Kaiwen Wei, Lei Wang

🧩 TL;DR

本文提出GeoViS框架，将遥感视觉定位重新构建为渐进式搜索推理过程，通过树状视觉线索序列主动探索全局图像，在多个遥感基准测试中实现了精确的地理空间理解并超越了现有方法。

📘 Detailed Summary

Motivation: 尽管多模态大语言模型在视觉定位方面取得了显著进展，但将其能力迁移到遥感图像仍面临挑战，因为目标通常在千米级场景中极其微小，且查询通常涉及复杂的空间关系，如相对位置、空间层次结构或跨远距离对象的上下文依赖关系。

Method: 本文提出GeoViS框架，将遥感视觉定位重新构建为渐进式搜索推理过程，通过树状视觉线索序列主动探索全局图像，整合多模态感知、空间推理和奖励引导探索，迭代优化地理空间假设，使模型能够检测微小目标同时保持整体场景感知。

Result: 在五个遥感定位基准测试上的广泛实验表明，GeoViS实现了精确的地理空间理解，在关键视觉定位指标上持续超越现有方法，展现了强大的跨领域泛化能力和可解释性。

Conclusion: 该研究展示了渐进式搜索推理框架在解决遥感视觉定位挑战中的有效性，为处理微小目标和复杂空间关系提供了新方法，强调了整合多模态感知与空间推理的重要性，为地理空间人工智能应用开辟了新方向。

📄 Abstract

Recent advances in multimodal large language models(MLLMs) have led to remarkable progress in visual grounding, enabling fine-grained cross-modal alignment between textual queries and image regions. However, transferring such capabilities to remote sensing imagery remains challenging, as targets are often extremely small within kilometer-scale scenes, and queries typically involve intricate geospatial relations such as relative positions, spatial hierarchies, or contextual dependencies across distant objects. To address these challenges, we propose GeoViS, a Geospatially Rewarded Visual Search framework that reformulates remote sensing visual grounding as a progressive search-and-reasoning process. Rather than directly predicting the target location in a single step, GeoViS actively explores the global image through a tree-structured sequence of visual cues, integrating multimodal perception, spatial reasoning, and reward-guided exploration to refine geospatial hypotheses iteratively. This design enables the model to detect subtle small-scale targets while maintaining holistic scene awareness. Extensive experiments on five remote sensing grounding benchmarks demonstrate that GeoViS achieves precise geospatial understanding and consistently surpasses existing methods across key visual grounding metrics, highlighting its strong cross-domain generalization and interpretability.

[44] UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits

Keming Ye, Zhipeng Huang, Canmiao Fu, Qingyang Liu, Jiani Cai, Zheqi Lv, Chen Li, Jing Lyu, Zhou Zhao, Shengyu Zhang

🧩 TL;DR

本文提出了UnicEdit-10M数据集和UnicBench基准测试，通过轻量级数据流水线和7B专家模型Qwen-Verify解决多模态图像编辑中数据质量与规模之间的权衡问题，为开源模型提供大规模高质量训练数据和细粒度诊断基准。

📘 Detailed Summary

Motivation: 当前多模态图像编辑领域存在闭源与开源模型性能差距扩大的问题，主要源于大规模高质量训练数据的稀缺以及缺乏能够全面诊断模型在多样化编辑行为中弱点的基准测试。现有数据构建方法面临规模与质量的权衡：人工标注质量高但不可扩展，而自动化流水线则存在错误传播和噪声问题。

Method: 研究提出了一种轻量级数据流水线，用端到端模型替代多工具链，并引入统一的后期验证阶段。为实现可扩展的质量控制，训练了一个7B双任务专家模型Qwen-Verify，用于高效失败检测和指令重述。该流水线产生了UnicEdit-10M数据集，涵盖多样化的基础与复杂编辑任务。同时提出了UnicBench基准测试，扩展了基础编辑评估，明确评估空间和知识驱动的推理能力，并引入了非编辑一致性和推理准确性等新颖指标。

Result: 通过该数据流水线构建了UnicEdit-10M数据集，这是一个包含1000万样本的大规模数据集，覆盖了多样化的基础与复杂编辑任务。在UnicBench基准测试上对主流模型的分析揭示了它们的局限性，包括在空间推理和知识驱动编辑方面的不足，为未来研究提供了明确方向。

Conclusion: 该研究通过创新的数据构建方法和全面的基准测试，为多模态图像编辑领域提供了重要的基础设施。UnicEdit-10M数据集解决了训练数据规模与质量的矛盾，而UnicBench基准测试则提供了细粒度的模型诊断能力，有助于缩小开源与闭源模型之间的性能差距，并为未来研究方向提供了清晰指引。

📄 Abstract

With the rapid advances of powerful multimodal models such as GPT-4o, Nano Banana, and Seedream 4.0 in Image Editing, the performance gap between closed-source and open-source models is widening, primarily due to the scarcity of large-scale, high-quality training data and comprehensive benchmarks capable of diagnosing model weaknesses across diverse editing behaviors. Existing data construction methods face a scale-quality trade-off: human annotations are high-quality but not scalable, while automated pipelines suffer from error propagation and noise. To address this, we introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage. For scalable quality control, we train a 7B dual-task expert model, \textbf{Qwen-Verify}, for efficient failure detection and instruction recaptioning. This pipeline yields \textbf{UnicEdit-10M}, a 10M-scale dataset spanning diverse basic and complex editing tasks. We also propose \textbf{UnicBench}, a general benchmark that extends beyond basic edits to explicitly assess spatial and knowledge-driven reasoning. To enable fine-grained diagnosis, we introduce novel metrics, including \textit{Non-edit Consistency} and \textit{Reasoning Accuracy}. Our analysis of mainstream models on UnicBench reveals their limitations and provides clear directions for future research.

[45] HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval

Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen, Weili Guan

🧩 TL;DR

本文提出了一种新颖的分层不确定性感知消歧网络（HUD），用于解决组合视频检索任务中视频与文本模态信息密度差异带来的问题，该框架通过利用模态间信息密度差异来增强多模态查询理解，在多个基准数据集上实现了最先进的性能。

📘 Detailed Summary

Motivation: 组合视频检索任务中，多模态查询包含参考视频和修改文本，但先前研究忽视了视频与文本模态间的信息密度差异，这导致两个关键问题：修改主体指代模糊性和有限的细节语义关注，从而降低了CVR模型的性能。

Method: 提出的分层不确定性感知消歧网络包含三个关键组件：整体代词消歧、原子不确定性建模以及整体到原子对齐，通过整体跨模态交互利用重叠语义，并通过原子级跨模态交互实现细粒度语义对齐，从而有效进行对象消歧并增强对细节语义的关注。

Result: HUD框架在三个基准数据集上针对组合视频检索和组合图像检索任务均实现了最先进的性能，该框架不仅适用于CVR任务，也适用于CIR任务，并在多个数据集上验证了其有效性。

Conclusion: 该研究首次利用视频与文本间的信息密度差异来增强多模态查询理解，通过分层不确定性建模和语义对齐机制解决了模态间信息不平衡问题，为组合检索任务提供了新的技术框架，并展示了跨任务应用的潜力。

📄 Abstract

Composed Video Retrieval (CVR) is a challenging video retrieval task that utilizes multi-modal queries, consisting of a reference video and modification text, to retrieve the desired target video. The core of this task lies in understanding the multi-modal composed query and achieving accurate composed feature learning. Within multi-modal queries, the video modality typically carries richer semantic content compared to the textual modality. However, previous works have largely overlooked the disparity in information density between these two modalities. This limitation can lead to two critical issues: 1) modification subject referring ambiguity and 2) limited detailed semantic focus, both of which degrade the performance of CVR models. To address the aforementioned issues, we propose a novel CVR framework, namely the Hierarchical Uncertainty-aware Disambiguation network (HUD). HUD is the first framework that leverages the disparity in information density between video and text to enhance multi-modal query understanding. It comprises three key components: (a) Holistic Pronoun Disambiguation, (b) Atomistic Uncertainty Modeling, and (c) Holistic-to-Atomistic Alignment. By exploiting overlapping semantics through holistic cross-modal interaction and fine-grained semantic alignment via atomistic-level cross-modal interaction, HUD enables effective object disambiguation and enhances the focus on detailed semantics, thereby achieving precise composed feature learning. Moreover, our proposed HUD is also applicable to the Composed Image Retrieval (CIR) task and achieves state-of-the-art performance across three benchmark datasets for both CVR and CIR tasks. The codes are available on https://zivchen-ty.github.io/HUD.github.io/.

[46] PhyCustom: Towards Realistic Physical Customization in Text-to-Image Generation

Fan Wu, Cheng Chen, Zhoujie Fu, Jiacheng Wei, Yi Xu, Deheng Ye, Guosheng Lin

🧩 TL;DR

本文提出PhyCustom框架，通过两种新颖的正则化损失函数激活扩散模型进行物理概念定制，解决了现有方法在物理属性定制方面的不足，在定量和定性评估中均优于现有方法。

📘 Detailed Summary

Motivation: 现有基于扩散的文本到图像定制方法在理解具体概念（如风格和形状）方面取得了显著成功，但在物理概念的定制方面存在明显不足。核心限制在于训练过程中缺乏对物理知识的显式引入，即使输入提示中包含物理相关词汇，现有方法也无法准确反映相应的物理属性。

Method: 本文提出PhyCustom微调框架，包含两种新颖的正则化损失函数：等距损失旨在激活扩散模型学习物理概念，而解耦损失则帮助消除独立概念的混合学习。该框架通过显式引入物理知识来增强扩散模型对物理属性的理解和控制能力。

Result: 在多样化数据集上的实验表明，PhyCustom在物理概念定制方面定量和定性均优于先前的最先进方法和流行方法。基准测试结果验证了该框架在准确反映物理属性方面的有效性，特别是在处理物理相关提示时表现出显著改进。

Conclusion: 该研究强调了在扩散模型中显式引入物理知识的重要性，为物理概念定制提供了有效的解决方案。PhyCustom框架的成功表明，通过专门设计的正则化损失可以显著提升模型对物理属性的理解和生成能力，为未来在物理感知图像生成领域的研究提供了新方向。

📄 Abstract

Recent diffusion-based text-to-image customization methods have achieved significant success in understanding concrete concepts to control generation processes, such as styles and shapes. However, few efforts dive into the realistic yet challenging customization of physical concepts. The core limitation of current methods arises from the absence of explicitly introducing physical knowledge during training. Even when physics-related words appear in the input text prompts, our experiments consistently demonstrate that these methods fail to accurately reflect the corresponding physical properties in the generated results. In this paper, we propose PhyCustom, a fine-tuning framework comprising two novel regularization losses to activate diffusion model to perform physical customization. Specifically, the proposed isometric loss aims at activating diffusion models to learn physical concepts while decouple loss helps to eliminate the mixture learning of independent concepts. Experiments are conducted on a diverse dataset and our benchmark results demonstrate that PhyCustom outperforms previous state-of-the-art and popular methods in terms of physical customization quantitatively and qualitatively.

[47] Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video?

Manuel Benavent-Lledo, Konstantinos Bacharidis, Victoria Manousaki, Konstantinos Papoutsakis, Antonis Argyros, Jose Garcia-Rodriguez

🧩 TL;DR

本文提出了AAG方法，通过结合单帧RGB特征、深度线索和先验动作信息来实现动作预测，该方法在多个教学活动中数据集上能够与基于视频时序聚合的方法竞争，证明了单帧多模态动作预测的可行性。

📘 Detailed Summary

Motivation: 传统动作预测方法依赖从视频中提取和聚合时序信息，但人类往往通过观察场景中的单个瞬间并结合足够上下文就能预测即将发生的动作，本研究旨在探索视频聚合是否可以被替代模态所取代，实现基于单帧的动作预测能力。

Method: 本文提出了AAG方法，该方法结合单帧RGB特征和深度线索以增强空间推理能力，并融入先验动作信息提供长期上下文，这些上下文信息通过视觉语言模型生成的文本摘要或单帧动作识别器的预测结果获得，形成多模态单帧动作预测框架。

Result: 实验结果表明，基于AAG的多模态单帧动作预测方法在IKEA-ASM、Meccano和Assembly101三个教学活动数据集上表现优异，能够与基于时序聚合的视频基线方法和当前最先进方法竞争，证明了单帧多模态方法的有效性。

Conclusion: 该研究表明，通过结合适当的空间线索和上下文信息，单帧多模态方法能够有效替代传统的视频时序聚合方法进行动作预测，这为动作理解研究提供了新的方向，展示了多模态融合在减少计算需求和提高效率方面的潜力。

📄 Abstract

Anticipating actions before they occur is a core challenge in action understanding research. While conventional methods rely on extracting and aggregating temporal information from videos, as humans we can often predict upcoming actions by observing a single moment from a scene, when given sufficient context. Can a model achieve this competence? The short answer is yes, although its effectiveness depends on the complexity of the task. In this work, we investigate to what extent video aggregation can be replaced with alternative modalities. To this end, based on recent advances in visual feature extraction and language-based reasoning, we introduce AAG, a method for Action Anticipation at a Glimpse. AAG combines RGB features with depth cues from a single frame for enhanced spatial reasoning, and incorporates prior action information to provide long-term context. This context is obtained either through textual summaries from Vision-Language Models, or from predictions generated by a single-frame action recognizer. Our results demonstrate that multimodal single-frame action anticipation using AAG can perform competitively compared to both temporally aggregated video baselines and state-of-the-art methods across three instructional activity datasets: IKEA-ASM, Meccano, and Assembly101.

[48] MICCAI STSR 2025 Challenge: Semi-Supervised Teeth and Pulp Segmentation and CBCT-IOS Registration

Yaqi Wang, Zhi Li, Chengyu Wu, Jun Liu, Yifan Zhang, Jialuo Chen, Jiaxue Ni, Qian Luo, Jin Liu, Can Han, Changkai Ji, Zhi Qin Tan, Ajo Babu George, Liangyu Chen, Qianni Zhang, Dahong Qian, Shuai Wang, Huiyu Zhou

🧩 TL;DR

本研究通过组织STSR 2025挑战赛，为数字牙科中的CBCT和IOS数据建立了半监督学习基准，解决了标注数据稀缺问题，并推动了牙齿分割和跨模态配准的自动化解决方案发展。

📘 Detailed Summary

Motivation: 数字牙科中锥束计算机断层扫描（CBCT）和口内扫描（IOS）对于自动化诊断和治疗规划至关重要，但标注数据的稀缺严重限制了牙齿和牙髓管分割以及跨模态配准等任务的深度学习解决方案发展，因此需要建立半监督学习基准来推动该领域进展。

Method: 研究组织了包含两个任务的STSR 2025挑战赛：CBCT中牙齿和牙髓管的半监督分割，以及CBCT与IOS的半监督刚性配准。提供了60个标注和640个未标注的IOS样本，以及30个标注和250个未标注的CBCT扫描数据。领先的解决方案采用了nnU-Net和Mamba-like状态空间模型结合伪标签和一致性正则化进行分割，而配准任务则结合了PointNetLK与可微分SVD以及几何增强来处理模态差异。

Result: 挑战赛吸引了广泛社区参与，最佳分割方法在隐藏测试集上达到了0.967的Dice分数和0.738的实例亲和力。配准任务中，混合神经-经典优化方法在有限标注下实现了精确对齐。所有数据和代码已公开，确保了研究的可重复性。

Conclusion: 该研究证明了半监督学习在数字牙科中的有效性，特别是对于标注稀缺的医学图像分析任务。挑战赛的成功组织为社区提供了标准化基准和开源解决方案，推动了CBCT和IOS数据处理技术的发展，并为未来研究提供了重要的数据集和方法参考。

📄 Abstract

Cone-Beam Computed Tomography (CBCT) and Intraoral Scanning (IOS) are essential for digital dentistry, but annotated data scarcity limits automated solutions for pulp canal segmentation and cross-modal registration. To benchmark semi-supervised learning (SSL) in this domain, we organized the STSR 2025 Challenge at MICCAI 2025, featuring two tasks: (1) semi-supervised segmentation of teeth and pulp canals in CBCT, and (2) semi-supervised rigid registration of CBCT and IOS. We provided 60 labeled and 640 unlabeled IOS samples, plus 30 labeled and 250 unlabeled CBCT scans with varying resolutions and fields of view. The challenge attracted strong community participation, with top teams submitting open-source deep learning-based SSL solutions. For segmentation, leading methods used nnU-Net and Mamba-like State Space Models with pseudo-labeling and consistency regularization, achieving a Dice score of 0.967 and Instance Affinity of 0.738 on the hidden test set. For registration, effective approaches combined PointNetLK with differentiable SVD and geometric augmentation to handle modality gaps; hybrid neural-classical refinement enabled accurate alignment despite limited labels. All data and code are publicly available at https://github.com/ricoleehduu/STS-Challenge-2025 to ensure reproducibility.

[49] MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm

Wei Chen, Chaoqun Du, Feng Gu, Wei He, Qizhen Li, Zide Liu, Xuhao Pan, Chang Ren, Xudong Rao, Chenfeng Wang, Tao Wei, Chengjun Yu, Pengfei Yu, Yufei Zheng, Chunpeng Zhou, Pan Zhou, Xuhan Zhu

🧩 TL;DR

本文提出MindGPT-4ov，一种多模态大语言模型，引入了一个涵盖数据生产、模型训练和高效部署的通用后训练范式，在低成本下实现多个基准测试的最先进性能，有效增强了MLLMs的基础能力和泛化能力。

📘 Detailed Summary

Motivation: 当前多模态大语言模型面临高质量跨领域数据生成困难、领域特定知识与通用能力平衡不足、推理能力与多目标优化协同优化挑战以及训练部署效率与成本限制等问题，需要一种系统性的后训练范式来提升模型性能并降低应用门槛。

Method: 提出三个关键技术创新：基于信息密度的数据生成方案结合双维度树状标签系统实现高质量跨领域数据自动生成；协作课程监督微调方法平衡领域知识注入与通用能力保持；混合强化学习范式增强推理能力同时优化多样性探索、多模态感知保持和响应简洁性等多目标。此外实施5D并行训练、算子优化和推理量化等基础设施优化。

Result: MindGPT-4ov在MMBench、MMStar、MathVision和MathVista等多个基准测试中超越现有最先进模型，同时在垂直领域任务中展现出卓越的用户体验，实现了从学术研究到工业部署的无缝过渡，并在低成本下达到高性能。

Conclusion: 该研究提供了一个适用于广泛MLLMs的通用后训练范式，通过系统化的数据构造、训练策略和部署优化框架，显著提升了多模态模型的性能和实用性。基于Qwen3-VL的变体模型权重、数据集和代码将开源，支持社区MLLMs发展，为多模态人工智能的实际应用提供了可行路径。

📄 Abstract

We present MindGPT-4ov, a multimodal large language model (MLLM) that introduces a general post-training paradigm spanning data production, model training, and efficient deployment. It achieves state-of-the-art performance across multiple benchmarks at low cost, effectively enhancing the foundational capabilities of MLLMs and the generalization ability. Focusing on data construction, supervised fine-tuning strategies, and multimodal reinforcement learning methods, this work proposes three key innovations: (1) An information density-based data generation scheme, integrated with a dual-dimensional tree-structured label system, enabling automated generation of high-quality cross-domain data. (2) A collaborative curriculum supervised fine-tuning approach that balances the injection of domain-specific knowledge with the preservation of general capabilities. (3) A hybrid reinforcement learning paradigm that enhances reasoning ability while simultaneously addressing multi-objective optimization such as diversity exploration, maintenance of multimodal perception, and response conciseness. Moreover, we implement a series of infrastructure optimizations, such as 5D parallel training, operator optimization, and inference quantization to enhance training and inference efficiency while reducing the cost of domain adaptation. Experimental results demonstrate that the MindGPT-4ov model outperforms state-of-the-art models on benchmarks such as MMBench, MMStar, MathVision, and MathVista. In addition, MindGPT-4ov also demonstrates superior user experience in vertical domain tasks, enabling a seamless transition from academic research to industrial deployment. MindGPT-4ov provides a general post-training paradigm applicable to a wide range of MLLMs. The model weights, datasets, and code for the Qwen3-VL-based variants will be recently open-sourced to support the community's development of MLLMs.

[50] Layout Anything: One Transformer for Universal Room Layout Estimation

Md Sohag Mia, Muhammad Abdullah Adnan

🧩 TL;DR

本文提出Layout Anything，一种基于Transformer的室内布局估计框架，通过将OneFormer的通用分割架构适配到几何结构预测任务中，实现了端到端的布局估计，消除了复杂的后处理流程并实现了114ms的高速推理。

📘 Detailed Summary

Motivation: 该研究旨在解决室内布局估计中复杂后处理流程和几何约束整合不足的问题，通过开发一个能够直接预测几何结构并保持曼哈顿世界约束的端到端框架，以提升布局估计的准确性和计算效率。

Method: 该方法基于OneFormer的通用分割架构进行适配，集成了任务条件查询和对比学习，并引入了两个关键模块：布局退化策略通过拓扑感知变换增强训练数据同时保持曼哈顿世界约束；可微分几何损失直接在训练中强制平面一致性和锐利边界预测，形成端到端的统一框架。

Result: 实验结果表明，该方法在标准基准测试中取得了最先进的性能：在LSUN数据集上像素误差为5.43%，角点误差为4.02%；在Hedau数据集上像素误差为7.04%，角点误差为5.17%；在Matterport3D-Layout数据集上像素误差为4.03%，角点误差为3.15%，同时实现了114ms的高速推理。

Conclusion: 该研究证明了将通用分割架构与几何约束相结合的有效性，为室内布局估计提供了兼具几何感知和计算效率的解决方案，特别适用于增强现实应用和大规模3D场景重建任务，展示了端到端框架在几何结构预测中的潜力。

📄 Abstract

We present Layout Anything, a transformer-based framework for indoor layout estimation that adapts the OneFormer's universal segmentation architecture to geometric structure prediction. Our approach integrates OneFormer's task-conditioned queries and contrastive learning with two key modules: (1) a layout degeneration strategy that augments training data while preserving Manhattan-world constraints through topology-aware transformations, and (2) differentiable geometric losses that directly enforce planar consistency and sharp boundary predictions during training. By unifying these components in an end-to-end framework, the model eliminates complex post-processing pipelines while achieving high-speed inference at 114ms. Extensive experiments demonstrate state-of-the-art performance across standard benchmarks, with pixel error (PE) of 5.43% and corner error (CE) of 4.02% on the LSUN, PE of 7.04% (CE 5.17%) on the Hedau and PE of 4.03% (CE 3.15%) on the Matterport3D-Layout datasets. The framework's combination of geometric awareness and computational efficiency makes it particularly suitable for augmented reality applications and large-scale 3D scene reconstruction tasks.

Guowen Zhang, Chenhang He, Liyi Chen, Lei Zhang

🧩 TL;DR

本文提出了BEVDilation，一种以LiDAR为中心的BEV融合框架，通过将图像BEV特征作为隐式引导而非简单拼接，有效缓解了图像深度估计误差导致的空间错位问题，并在nuScenes基准上实现了优于现有方法的性能。

📘 Detailed Summary

Motivation: 现有LiDAR与相机在BEV表示中的融合方法由于传感器间几何精度的根本差异，往往导致性能下降，特别是图像深度估计误差引起的空间错位问题，以及点云固有的稀疏性和语义局限性需要得到有效解决。

Method: BEVDilation采用以LiDAR为中心的融合策略，提出了稀疏体素扩张块通过图像先验对前景体素进行稠密化以缓解点云稀疏性，并引入语义引导的BEV扩张块利用图像语义引导和长距离上下文捕获来增强LiDAR特征扩散处理。

Result: 在nuScenes基准测试中，BEVDilation实现了优于现有最先进方法的性能，同时保持了竞争力的计算效率，更重要的是，该LiDAR中心策略相比朴素融合方法对深度噪声表现出更强的鲁棒性。

Conclusion: 该研究证明了以LiDAR为中心的融合策略在处理多传感器数据时的优越性，通过将图像特征作为隐式引导而非直接拼接，能够有效缓解空间错位问题，同时利用图像先验弥补点云的稀疏性和语义局限性，为3D目标检测中的传感器融合提供了新思路。

📄 Abstract

Integrating LiDAR and camera information in the bird's eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, because of the fundamental disparity in geometric accuracy between these sensors, indiscriminate fusion in previous methods often leads to degraded performance. In this paper, we propose BEVDilation, a novel LiDAR-centric framework that prioritizes LiDAR information in the fusion. By formulating image BEV features as implicit guidance rather than naive concatenation, our strategy effectively alleviates the spatial misalignment caused by image depth estimation errors. Furthermore, the image guidance can effectively help the LiDAR-centric paradigm to address the sparsity and semantic limitations of point clouds. Specifically, we propose a Sparse Voxel Dilation Block that mitigates the inherent point sparsity by densifying foreground voxels through image priors. Moreover, we introduce a Semantic-Guided BEV Dilation Block to enhance the LiDAR feature diffusion processing with image semantic guidance and long-range context capture. On the challenging nuScenes benchmark, BEVDilation achieves better performance than state-of-the-art methods while maintaining competitive computational efficiency. Importantly, our LiDAR-centric strategy demonstrates greater robustness to depth noise compared to naive fusion. The source code is available at https://github.com/gwenzhang/BEVDilation.

Zhongyu Yang, Yingfang Yuan, Xuanming Jiang, Baoyi An, Wei Pang

🧩 TL;DR

本文提出InEx，一种无需训练的多智能体框架，通过内省推理和外部多智能体协作来自主缓解多模态大语言模型中的幻觉问题，在多个基准测试中显著优于现有方法。

📘 Detailed Summary

Motivation: 多模态大语言模型中的幻觉问题严重阻碍了其可靠性发展，现有解决方案通常依赖人工干预或未能充分利用智能体自主缓解幻觉的能力，需要一种更自主的幻觉缓解方法。

Method: 受人类认知过程启发，提出InEx框架，包含基于熵的不确定性估计引导的内省推理，以及通过编辑智能体和自反思智能体进行的外部跨模态多智能体协作，以迭代验证和精炼响应。

Result: 在广泛的实验中，InEx在通用和幻觉基准测试上持续优于现有方法，实现了4%-27%的性能提升，并展现出强大的鲁棒性。

Conclusion: 该研究表明，模拟人类认知过程的内省推理与外部验证相结合的多智能体框架，能够有效自主缓解多模态大语言模型中的幻觉问题，为构建更可靠的AI系统提供了新方向。

📄 Abstract

Hallucination remains a critical challenge in large language models (LLMs), hindering the development of reliable multimodal LLMs (MLLMs). Existing solutions often rely on human intervention or underutilize the agent's ability to autonomously mitigate hallucination. To address these limitations, we draw inspiration from how humans make reliable decisions in the real world. They begin with introspective reasoning to reduce uncertainty and form an initial judgment, then rely on external verification from diverse perspectives to reach a final decision. Motivated by this cognitive paradigm, we propose InEx, a training-free, multi-agent framework designed to autonomously mitigate hallucination. InEx introduces internal introspective reasoning, guided by entropy-based uncertainty estimation, to improve the reliability of the decision agent's reasoning process. The agent first generates a response, which is then iteratively verified and refined through external cross-modal multi-agent collaboration with the editing agent and self-reflection agents, further enhancing reliability and mitigating hallucination. Extensive experiments show that InEx consistently outperforms existing methods, achieving 4%-27% gains on general and hallucination benchmarks, and demonstrating strong robustness.

cs.CL [Back]

[53] From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

Changpeng Yang, Jinyang Wu, Yuchen Liu, Shuai Zhang, Yang Li, Qiliang Liang, Hongzhen Wang, Shuai Nie, Jiaming Xu, Runyu Shi, Ying Huang, Guoquan Zhang

🧩 TL;DR

本文提出了CAPO（课程优势策略优化），一种基于优势信号的自适应课程机制，通过分离正负优势样本来增强大语言模型的后训练强化学习效果，在数学推理和多模态GUI推理任务中取得了稳定显著的性能提升。

📘 Detailed Summary

Motivation: 现有强化学习方法在后训练大语言模型时，通常不加区分地混合正负优势信号，尤其是在训练早期阶段，这种混合可能导致模糊的指导信号和有限的性能增益，限制了模型推理能力的进一步提升。

Method: 本文提出了CAPO框架，采用基于优势信号的自适应课程机制，首先仅使用正优势样本进行模仿学习以建立稳健基础，随后逐步引入负信号以培养判别能力，该方法兼容多种优化方法包括GRPO、PPO、RLOO和Reinforce++。

Result: 在数学推理任务中，CAPO方法实现了稳定且显著的性能改进，并进一步有效泛化到多模态图形用户界面推理场景，证明了其作为通用优化框架的鲁棒性和有效性。

Conclusion: 该研究表明通过课程化分离正负优势信号可以显著提升强化学习后训练的效果，为复杂场景下的模型泛化提供了新思路，CAPO框架的兼容性使其成为多种优化方法的通用增强方案。

📄 Abstract

Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, the indiscriminate mixing of the two signals in existing methods, especially from the early stages, may lead to ambiguous guidance and limited gains. To address this issue, we propose CAPO (Curriculum Advantage Policy Optimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization framework.

[54] Spoken Conversational Agents with Large Language Models

Chao-Han Huck Yang, Andreas Stolcke, Larry Heck

🧩 TL;DR

本教程系统梳理了从级联ASR/NLU系统到端到端语音原生大语言模型的发展路径，提供了从工业级助手到开放域和任务导向代理的实用路线图，重点关注跨模态对齐、联合训练和鲁棒性评估等关键技术挑战。

📘 Detailed Summary

Motivation: 口语对话代理正在向语音原生大语言模型演进，但现有研究缺乏对从传统级联架构到端到端系统的系统性技术路线梳理，特别是在跨模态对齐、联合语音文本训练、以及实际部署中的鲁棒性评估等方面存在知识空白。

Method: 教程系统性地探讨了文本大语言模型向音频领域的适配方法，包括跨模态对齐技术和联合语音文本训练策略；详细比较了级联架构与端到端设计的优劣，分析了后ASR修正和流式处理等关键技术选择；同时回顾了相关数据集、评估指标以及针对口音多样性的鲁棒性处理方法。

Result: 教程提供了可复现的基线系统和实用的技术方案，建立了从工业级语音助手到当前开放域和任务导向代理的技术连接框架；系统评估了不同设计选择在性能、延迟和鲁棒性方面的权衡，为实际系统开发提供了明确的参考标准。

Conclusion: 研究揭示了语音原生大语言模型发展的系统性技术路线，强调了隐私保护、安全性和评估方法等开放性问题的重要性；为从业者提供了从理论到实践的完整知识体系，指明了未来在跨模态学习、实时处理和鲁棒性优化等方面的研究方向。

📄 Abstract

Spoken conversational agents are converging toward voice-native LLMs. This tutorial distills the path from cascaded ASR/NLU to end-to-end, retrieval-and vision-grounded systems. We frame adaptation of text LLMs to audio, cross-modal alignment, and joint speech-text training; review datasets, metrics, and robustness across accents and compare design choices (cascaded vs. E2E, post-ASR correction, streaming). We link industrial assistants to current open-domain and task-oriented agents, highlight reproducible baselines, and outline open problems in privacy, safety, and evaluation. Attendees leave with practical recipes and a clear systems-level roadmap.

[55] Emergent Bayesian Behaviour and Optimal Cue Combination in LLMs

Julian Ma, Jun Wang, Zafeirios Fountas

🧩 TL;DR

该研究引入了一个名为BayesBench的心理物理学基准测试，用于评估大型语言模型在无显式训练或指令情况下的隐式计算策略，特别是贝叶斯一致的多模态线索整合能力。研究发现模型能力与策略之间存在关键分离，准确性并不能保证鲁棒性。

📘 Detailed Summary

Motivation: 大型语言模型在显式推理方面表现出色，但其隐式计算策略仍未得到充分探索。人类在感知任务中能够使用接近最优的贝叶斯策略直观处理噪声信号，而本研究旨在探究LLMs是否在没有显式训练或指令的情况下表现出类似行为，实现最优的多模态整合。

Method: 研究采用心理物理学范式，通过系统行为研究推断LLMs的计算原理。引入了名为BayesBench的行为基准测试，包含四个基于文本和图像的幅度估计任务（长度、位置、距离和持续时间），并评估了九个不同的LLMs与人类判断进行校准。通过控制噪声、上下文和指令提示的消融实验，测量多模态线索整合中的性能、行为和效率。除了准确性和效率指标外，还引入了贝叶斯一致性分数，即使在准确性饱和时也能检测贝叶斯一致的行为变化。

Result: 研究结果显示，虽然能力强的模型通常以贝叶斯一致的方式适应，但准确性并不能保证鲁棒性。值得注意的是，GPT-5 Mini在文本准确性方面表现完美，但未能有效整合视觉线索。这揭示了能力与策略之间的关键分离，表明以准确性为中心的基准测试可能过度关注性能而忽略了脆弱的确定性处理。这些发现揭示了不确定性处理的原则性涌现，并突出了准确性与贝叶斯倾向之间的相关性。

Conclusion: 该研究揭示了模型能力与计算策略之间的重要分离，表明准确性导向的评估可能掩盖了模型在不确定性处理方面的脆弱性。研究发布的心理物理学基准测试和一致性指标为未来多模态架构设计提供了评估工具和指导，强调了在模型评估中考虑隐式计算策略和贝叶斯一致性的重要性。

📄 Abstract

Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored. Decades of psychophysics research show that humans intuitively process and integrate noisy signals using near-optimal Bayesian strategies in perceptual tasks. We ask whether LLMs exhibit similar behaviour and perform optimal multimodal integration without explicit training or instruction. Adopting the psychophysics paradigm, we infer computational principles of LLMs from systematic behavioural studies. We introduce a behavioural benchmark - BayesBench: four magnitude estimation tasks (length, location, distance, and duration) over text and image, inspired by classic psychophysics, and evaluate a diverse set of nine LLMs alongside human judgments for calibration. Through controlled ablations of noise, context, and instruction prompts, we measure performance, behaviour and efficiency in multimodal cue-combination. Beyond accuracy and efficiency metrics, we introduce a Bayesian Consistency Score that detects Bayes-consistent behavioural shifts even when accuracy saturates. Our results show that while capable models often adapt in Bayes-consistent ways, accuracy does not guarantee robustness. Notably, GPT-5 Mini achieves perfect text accuracy but fails to integrate visual cues efficiently. This reveals a critical dissociation between capability and strategy, suggesting accuracy-centric benchmarks may over-index on performance while missing brittle uncertainty handling. These findings reveal emergent principled handling of uncertainty and highlight the correlation between accuracy and Bayesian tendencies. We release our psychophysics benchmark and consistency metric (https://bayes-bench.github.io) as evaluation tools and to inform future multimodal architecture designs.

[56] BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion

Sai Koneru, Fabian Retkowski, Christian Huber, Lukas Hilgert, Seymanur Akti, Enes Yavuz Ugan, Alexander Waibel, Jan Niehues

🧩 TL;DR

本文提出了BOOM，一种多模态多语言讲座伴侣系统，能够联合翻译讲座音频和幻灯片，生成跨文本、视觉和语音三种模态的同步输出，为在线学习提供完整的本地化学习体验。

📘 Detailed Summary

Motivation: 教育全球化和在线学习的快速增长使得教育内容本地化成为关键挑战，讲座材料本质上是多模态的，结合了口语音频和视觉幻灯片，需要能够处理多种输入模态的系统，为了提供可访问且完整的学习体验，翻译必须保留所有模态：用于阅读的文本、用于视觉理解的幻灯片以及用于听觉学习的语音。

Method: 本文提出了BOOM，一种端到端的多模态多语言讲座伴侣系统，能够联合翻译讲座音频和幻灯片，生成同步的跨模态输出，包括翻译文本、保留视觉元素的本地化幻灯片以及合成语音，系统通过幻灯片感知的转录方法处理多模态输入。

Result: 实验表明，幻灯片感知的转录对下游任务如摘要和问答具有级联效益，系统能够生成同步的跨模态输出，包括翻译文本、本地化幻灯片和合成语音，所有代码和模型已开源发布，采用MIT许可证。

Conclusion: 该研究为多模态教育内容本地化提供了端到端的解决方案，使学生能够以母语访问讲座内容，同时保留原始内容的完整性，幻灯片感知的转录方法对下游任务具有积极影响，为多模态机器翻译和教育技术领域提供了新的研究方向。

📄 Abstract

The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoken audio with visual slides, which requires systems capable of processing multiple input modalities. To provide an accessible and complete learning experience, translations must preserve all modalities: text for reading, slides for visual understanding, and speech for auditory learning. We present \textbf{BOOM}, a multimodal multilingual lecture companion that jointly translates lecture audio and slides to produce synchronized outputs across three modalities: translated text, localized slides with preserved visual elements, and synthesized speech. This end-to-end approach enables students to access lectures in their native language while aiming to preserve the original content in its entirety. Our experiments demonstrate that slide-aware transcripts also yield cascading benefits for downstream tasks such as summarization and question answering. We release our Slide Translation code at https://github.com/saikoneru/image-translator and integrate it in Lecture Translator at https://gitlab.kit.edu/kit/isl-ai4lt/lt-middleware/ltpipeline}\footnote{All released code and models are licensed under the MIT License.

[57] AutoNeural: Co-Designing Vision-Language Models for NPU Inference

Wei Chen, Liangmin Wu, Yunhai Hu, Zhiyuan Li, Zhiyuan Cheng, Yicheng Qian, Lingyue Zhu, Zhipeng Hu, Luoyi Liang, Qiang Tang, Zhen Liu, Han Yang

🧩 TL;DR

本文提出AutoNeural，一种专为NPU整数推理协同设计的原生视觉-语言模型架构，通过替换标准ViT编码器为MobileNetV5风格骨干网络，并结合状态空间模型与Transformer的混合设计，显著提升了边缘设备上的推理效率和量化稳定性。

📘 Detailed Summary

Motivation: 当前面向GPU优化的视觉-语言模型在神经处理单元上表现不佳，主要归因于视觉Transformer的量化脆弱性和自回归注意力机制的I/O受限特性，这些因素无法充分利用NPU的高算术吞吐量，导致硬件与模型之间的不匹配问题。

Method: AutoNeural采用NPU原生架构设计，将标准ViT编码器替换为基于深度可分离卷积的MobileNetV5风格骨干网络，确保激活分布有界以实现稳定的INT4/8/16量化；语言骨干网络则整合状态空间模型原理与Transformer层，采用高效门控卷积实现线性时间复杂度，消除生成过程中键值缓存的重内存I/O开销。

Result: 该方法显著提升了量化稳定性，视觉编码器的量化误差降低高达7倍，端到端延迟减少14倍；解码速度提升3倍，上下文窗口长度增加4倍；在Qualcomm SA8295P SoC上的实际汽车案例研究中验证了驾驶舱应用的实时性能。

Conclusion: 研究表明，针对NPU约束重新设计模型拓扑结构是实现鲁棒多模态边缘智能的前提条件，硬件感知的协同设计方法能够有效解决量化脆弱性和内存瓶颈问题，为边缘AI部署提供了新的架构范式。

📄 Abstract

While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision--Language Models (VLMs) tailored for GPUs often falter on these substrates. We attribute this hardware-model mismatch to two primary factors: the quantization brittleness of Vision Transformers (ViTs) and the I/O-bound nature of autoregressive attention mechanisms, which fail to utilize the high arithmetic throughput of NPUs. To bridge this gap, we propose AutoNeural, an NPU-native VLM architecture co-designed for integer-only inference. We replace the standard ViT encoder with a MobileNetV5-style backbone utilizing depthwise separable convolutions, which ensures bounded activation distributions for stable INT4/8/16 quantization. Complementing this, our language backbone integrates State-Space Model (SSM) principles with Transformer layers, employing efficient gated convolutions to achieve linear-time complexity. This hybrid design eliminates the heavy memory I/O overhead of Key-Value caching during generation. Our approach delivers substantial efficiency gains, reducing quantization error of vision encoder by up to 7x and end-to-end latency by 14x compared to conventional baselines. The AutoNeural also delivers 3x decoding speed and 4x longer context window than the baseline. We validate these improvements via a real-world automotive case study on the Qualcomm SA8295P SoC, demonstrating real-time performance for cockpit applications. Our results highlight that rethinking model topology specifically for NPU constraints is a prerequisite for robust multi-modal edge intelligence.

cs.AI [Back]

[58] Flowchart2Mermaid: A Vision-Language Model Powered System for Converting Flowcharts into Editable Diagram Code

Pritam Deka, Barry Devereux

🧩 TL;DR

本文提出了Flowchart2Mermaid系统，这是一个轻量级网络工具，能够将流程图图像转换为可编辑的Mermaid.js代码，通过视觉语言模型和混合主动式交互界面实现静态流程图的动态编辑与重用。

📘 Detailed Summary

Motivation: 流程图作为常见的流程沟通工具，通常以静态图像形式共享，难以编辑和重用，现有图像转图表工具缺乏结构化、版本可控的文本表示方法，导致工作流程中断和协作效率低下。

Method: 系统采用详细的系统提示和视觉语言模型将流程图图像转换为Mermaid.js标记语言，界面支持混合主动式细化，包括内联文本编辑、拖放节点插入以及集成AI助手解释的自然语言命令，实现渲染图与结构化文本表示的同步。

Result: 研究引入了评估结构准确性、流程正确性、语法有效性和完整性的多维度评价指标，系统能够生成版本可控的文本表示，并在多个模型上验证了转换效果，相比现有工具提供了更结构化的输出格式。

Conclusion: 该研究展示了将静态流程图转换为可编辑文本表示的可行性，为视觉工作流程的数字化协作提供了新范式，提出的评估框架为未来图像到图表转换系统的性能比较建立了基准，推动了文档工作流程的自动化发展。

📄 Abstract

Flowcharts are common tools for communicating processes but are often shared as static images that cannot be easily edited or reused. We present \textsc{Flowchart2Mermaid}, a lightweight web system that converts flowchart images into editable Mermaid.js code which is a markup language for visual workflows, using a detailed system prompt and vision-language models. The interface supports mixed-initiative refinement through inline text editing, drag-and-drop node insertion, and natural-language commands interpreted by an integrated AI assistant. Unlike prior image-to-diagram tools, our approach produces a structured, version-controllable textual representation that remains synchronized with the rendered diagram. We further introduce evaluation metrics to assess structural accuracy, flow correctness, syntax validity, and completeness across multiple models.

[59] Bridging the Gap: Toward Cognitive Autonomy in Artificial Intelligence

Noorbakhsh Amiri Golilarz, Sindhuja Penchala, Shahram Rahimi

🧩 TL;DR

本文系统分析了当代人工智能系统的七大核心缺陷，包括缺乏内在自我监控、元认知意识不足、学习机制固定非自适应等，并提出了基于神经认知原理的认知自主AI架构作为未来发展方向。

📘 Detailed Summary

Motivation: 尽管人工智能在感知、语言、推理和多模态领域取得了快速进展，但现代AI系统在动态环境中仍存在根本性限制，无法实现自我监控、自我校正和行为自主调节。本文旨在解决当代AI模型的七大核心缺陷，包括内在自我监控缺失、元认知意识不足、固定非自适应学习机制、目标重构能力缺乏、表征维护不足、具身反馈不充分以及内在能动性缺失，这些限制阻碍了系统实现鲁棒泛化、终身适应性和真实世界自主性。

Method: 本文采用比较分析方法，将人工系统与生物认知进行对比研究，并整合AI研究、认知科学和神经科学的见解。通过系统识别和分析七大核心缺陷，提出了基于神经认知原理的前瞻性架构设计思路，强调需要超越当前深度学习与基于Transformer的架构，构建能够实现自我导向适应、动态表征管理和意向性目标导向行为的认知自主AI系统。

Result: 研究明确识别了制约当代AI模型的七大结构性缺陷，论证了单纯扩展模型规模无法解决这些根本问题。通过跨学科分析揭示了当前架构在实现鲁棒泛化、终身适应性和真实世界自主性方面的内在局限性，为认知自主AI的发展提供了理论基础和分析框架，强调了需要范式转变而非渐进改进。

Conclusion: 研究主张向认知基础AI（认知自主）进行范式转变，这种系统能够实现自我导向适应、动态表征管理和意向性目标导向行为。同时需要配套改革性监督机制，确保自主系统保持可解释性、可治理性并与人类价值观对齐。这一转变对于实现真正具有适应性、自主性和可靠性的AI系统至关重要，为未来AI架构设计提供了明确的方向性指导。

📄 Abstract

Artificial intelligence has advanced rapidly across perception, language, reasoning, and multimodal domains. Yet despite these achievements, modern AI systems remain fun- damentally limited in their ability to self-monitor, self-correct, and regulate their behavior autonomously in dynamic contexts. This paper identifies and analyzes seven core deficiencies that constrain contemporary AI models: the absence of intrinsic self- monitoring, lack of meta-cognitive awareness, fixed and non- adaptive learning mechanisms, inability to restructure goals, lack of representational maintenance, insufficient embodied feedback, and the absence of intrinsic agency. Alongside identifying these limitations, we also outline a forward-looking perspective on how AI may evolve beyond them through architectures that mirror neurocognitive principles. We argue that these structural limitations prevent current architectures, including deep learning and transformer-based systems, from achieving robust general- ization, lifelong adaptability, and real-world autonomy. Drawing on a comparative analysis of artificial systems and biological cognition [7], and integrating insights from AI research, cognitive science, and neuroscience, we outline how these capabilities are absent in current models and why scaling alone cannot resolve them. We conclude by advocating for a paradigmatic shift toward cognitively grounded AI (cognitive autonomy) capable of self-directed adaptation, dynamic representation management, and intentional, goal-oriented behavior, paired with reformative oversight mechanisms [8] that ensure autonomous systems remain interpretable, governable, and aligned with human values.

Boyu Zhu, Xiaofei Wen, Wenjie Jacky Mo, Tinghui Zhu, Yanan Xie, Peng Qi, Muhao Chen

🧩 TL;DR

本文提出了OmniGuard，这是首个面向全模态大语言模型的安全护栏系统，通过结构化安全标注和专家模型蒸馏，实现了跨文本、图像、视频和音频的统一安全保障框架。

📘 Detailed Summary

Motivation: 全模态大语言模型处理文本、图像、视频和音频时引入了新的安全挑战，现有护栏研究主要针对单模态设置且通常将安全保障视为二元分类问题，这限制了其在多样化模态和任务中的鲁棒性。

Method: 该方法提出了OmniGuard家族，这是首个具备深思熟虑推理能力的全模态护栏系统；为支持训练，研究者构建了一个包含超过21万个多样化样本的大型全模态安全数据集，涵盖所有模态的单模态和跨模态样本，每个样本都通过目标蒸馏从专家模型获得结构化安全标签和精心策划的安全评析。

Result: 在15个基准测试上的广泛实验表明，OmniGuard在广泛的多模态安全场景中实现了强大的有效性和泛化能力，系统能够跨所有模态执行安全保障任务。

Conclusion: OmniGuard提供了一个统一框架，能够在全模态中执行策略并降低风险，为构建更鲁棒和更强大的全模态安全系统铺平了道路，解决了现有方法在多模态环境中的局限性。

📄 Abstract

Omni-modal Large Language Models (OLLMs) that process text, images, videos, and audio introduce new challenges for safety and value guardrails in human-AI interaction. Prior guardrail research largely targets unimodal settings and typically frames safeguarding as binary classification, which limits robustness across diverse modalities and tasks. To address this gap, we propose OmniGuard, the first family of omni-modal guardrails that performs safeguarding across all modalities with deliberate reasoning ability. To support the training of OMNIGUARD, we curate a large, comprehensive omni-modal safety dataset comprising over 210K diverse samples, with inputs that cover all modalities through both unimodal and cross-modal samples. Each sample is annotated with structured safety labels and carefully curated safety critiques from expert models through targeted distillation. Extensive experiments on 15 benchmarks show that OmniGuard achieves strong effectiveness and generalization across a wide range of multimodal safety scenarios. Importantly, OmniGuard provides a unified framework that enforces policies and mitigates risks in omni-modalities, paving the way toward building more robust and capable omnimodal safeguarding systems.

[61] Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective

Qiyao Xue, Weichen Liu, Shiqi Wang, Haoming Wang, Yuyang Wu, Wei Gao

🧩 TL;DR

本文提出了ReMindView-Bench基准测试，用于评估视觉语言模型在多视角空间推理中构建、对齐和维护空间心理模型的能力，揭示了当前模型在跨视角对齐和视角转换方面的系统性缺陷。

📘 Detailed Summary

Motivation: 当前视觉语言模型在多视角设置中进行空间推理时，难以维持几何一致性和跨视角一致性，这一差距源于缺乏能够将多视角推理与单视角感知和时间因素分离的细粒度基准测试，因此需要构建认知基础扎实的评估框架来诊断VLM的空间推理能力。

Method: 研究提出了ReMindView-Bench基准测试，通过系统变化视角空间模式和查询类型来探测空间认知的关键因素，采用显式分阶段分析（包括LLM-as-a-judge和自一致性提示）评估推理过程，以及隐式分析（包括线性探测和熵动态）来追踪任务相关信息丢失和不确定性分离。

Result: 对15个当前VLM的评估显示，模型在多视角空间推理中普遍存在跨视角对齐和视角转换的失败，显式分析表明模型在帧内感知表现良好但在跨视角信息整合时性能急剧下降，隐式分析进一步揭示了任务相关信息的渐进性丢失以及正确与错误轨迹间的不确定性分离。

Conclusion: 该研究为VLM空间推理提供了认知基础的诊断，揭示了多视角空间心理模型在推理过程中如何形成、退化和失稳，基准测试的公开可用性将促进未来在空间推理方面的模型改进和评估方法发展。

📄 Abstract

Spatial reasoning is a core aspect of human intelligence that allows perception, inference and planning in 3D environments. However, current vision-language models (VLMs) struggle to maintain geometric coherence and cross-view consistency for spatial reasoning in multi-view settings. We attribute this gap to the lack of fine-grained benchmarks that isolate multi-view reasoning from single-view perception and temporal factors. To address this, we present ReMindView-Bench, a cognitively grounded benchmark for evaluating how VLMs construct, align and maintain spatial mental models across complementary viewpoints. ReMindView-Bench systematically varies viewpoint spatial pattern and query type to probe key factors of spatial cognition. Evaluations of 15 current VLMs reveals consistent failures in cross-view alignment and perspective-taking in multi-view spatial reasoning, motivating deeper analysis on the reasoning process. Explicit phase-wise analysis using LLM-as-a-judge and self-consistency prompting shows that VLMs perform well on in-frame perception but degrade sharply when integrating information across views. Implicit analysis, including linear probing and entropy dynamics, further show progressive loss of task-relevant information and uncertainty separation between correct and incorrect trajectories. These results provide a cognitively grounded diagnosis of VLM spatial reasoning and reveal how multi-view spatial mental models are formed, degraded and destabilized across reasoning phases. The ReMindView-Bench benchmark is available at https://huggingface.co/datasets/Xue0823/ReMindView-Bench, and the source codes of benchmark construction and VLM reasoning analysis are available at https://github.com/pittisl/ReMindView-Bench.

[62] Aetheria: A multimodal interpretable content safety framework based on multi-agent debate and collaboration

Yuxiang He, Jian Zhao, Yuchen Yuan, Tianle Zhang, Wei Cai, Haojie Cheng, Ziyan Shi, Ming Zhu, Haichuan Tang, Chi Zhang, Xuelong Li

🧩 TL;DR

本文提出Aetheria，一种基于多智能体辩论与协作的多模态可解释内容安全框架，通过动态辩论机制和RAG知识检索，显著提升内容安全审核的准确性和可解释性。

📘 Detailed Summary

Motivation: 数字内容的指数级增长给内容安全带来重大挑战，当前基于单一模型或固定流水线的审核系统在识别隐含风险和提供可解释判断过程方面存在局限性，需要更透明和可解释的内容审核范式。

Method: 提出Aetheria多模态可解释内容安全框架，采用五个核心智能体的协作架构，通过基于RAG的知识检索和动态相互说服辩论机制，对多模态内容进行深度分析和裁决，生成详细可追溯的审核报告。

Result: 在提出的AIR-Bench基准测试上进行综合实验验证，Aetheria不仅生成详细可追溯的审核报告，而且在整体内容安全准确性方面显著优于基线方法，特别是在隐含风险识别方面表现出明显优势。

Conclusion: 该框架建立了透明可解释的内容审核范式，显著推进了可信AI内容审核领域的发展，为多模态内容安全提供了新的解决方案，强调了可解释性和透明度在内容安全系统中的重要性。

📄 Abstract

The exponential growth of digital content presents significant challenges for content safety. Current moderation systems, often based on single models or fixed pipelines, exhibit limitations in identifying implicit risks and providing interpretable judgment processes. To address these issues, we propose Aetheria, a multimodal interpretable content safety framework based on multi-agent debate and collaboration.Employing a collaborative architecture of five core agents, Aetheria conducts in-depth analysis and adjudication of multimodal content through a dynamic, mutually persuasive debate mechanism, which is grounded by RAG-based knowledge retrieval.Comprehensive experiments on our proposed benchmark (AIR-Bench) validate that Aetheria not only generates detailed and traceable audit reports but also demonstrates significant advantages over baselines in overall content safety accuracy, especially in the identification of implicit risks. This framework establishes a transparent and interpretable paradigm, significantly advancing the field of trustworthy AI content moderation.

Yufei Xiao, Shangfei Wang

🧩 TL;DR

本文提出了一种先进的多模态共情预测方法，整合视频、音频和文本信息，并引入监督文档作为特权信息来增强文本特征提取，在训练阶段提升模型性能。

📘 Detailed Summary

Motivation: 现有共情预测技术主要集中于单一模态（通常是文本），忽视了多模态处理能力，同时忽略了某些特权信息的利用，这些信息可能包含额外的共情内容，导致预测能力受限。

Method: 该方法包含多模态共情预测和监督文档辅助训练两部分，使用预训练网络提取视频、音频和文本特征，通过跨模态融合生成多模态特征表示来预测共情标签，并在辅助训练阶段引入监督文档作为特权信息，应用潜在狄利克雷分配模型识别潜在主题分布以约束文本特征。

Result: 在多模态和对话共情数据集上的实验结果表明，该方法优于现有方法，验证了多模态融合和监督文档辅助训练的有效性，特权信息在训练阶段的引入显著提升了模型性能。

Conclusion: 该研究证明了整合多模态信息和利用特权监督文档的有效性，为共情预测提供了更全面的方法，同时展示了仅在训练阶段可用的辅助信息如何增强模型学习能力，为情感计算和心理咨询应用提供了新思路。

📄 Abstract

Prevalent empathy prediction techniques primarily concentrate on a singular modality, typically textual, thus neglecting multi-modal processing capabilities. They also overlook the utilization of certain privileged information, which may encompass additional empathetic content. In response, we introduce an advanced multi-modal empathy prediction method integrating video, audio, and text information. The method comprises the Multi-Modal Empathy Prediction and Supervisory Documentation Assisted Training. We use pre-trained networks in the empathy prediction network to extract features from various modalities, followed by a cross-modal fusion. This process yields a multi-modal feature representation, which is employed to predict empathy labels. To enhance the extraction of text features, we incorporate supervisory documents as privileged information during the assisted training phase. Specifically, we apply the Latent Dirichlet Allocation model to identify potential topic distributions to constrain text features. These supervisory documents, created by supervisors, focus on the counseling topics and the counselor's display of empathy. Notably, this privileged information is only available during training and is not accessible during the prediction phase. Experimental results on the multi-modal and dialogue empathy datasets demonstrate that our approach is superior to the existing methods.

[64] Zero-Shot Instruction Following in RL via Structured LTL Representations

Mattia Giuri, Mathias Jackermeier, Alessandro Abate

🧩 TL;DR

本文提出了一种新颖的方法，通过将线性时序逻辑指令编码为布尔公式序列，并利用图神经网络生成结构化任务表示，从而学习能够执行任意LTL指令的多任务强化学习策略，解决了现有方法在多事件并发且复杂交互环境中的局限性。

📘 Detailed Summary

Motivation: 现有基于线性时序逻辑的强化学习方法将LTL指令解释为有限自动机，能够学习执行任意指令的通用策略，但在多个高层事件同时为真且可能以复杂方式交互的环境中表现不足。本文旨在解决这一局限性，特别是在并发事件交互复杂的场景中。

Method: 本文提出了一种新颖的多任务策略学习方法，将策略条件化于与自动机转换直接对齐的简单布尔公式序列。通过图神经网络对这些布尔公式进行编码，生成结构化的任务表示，从而有效处理多个原子命题同时为真且复杂交互的情况。

Result: 在复杂的基于国际象棋的环境中进行的实验表明，该方法相比现有方法具有显著优势。实验验证了所提方法在处理多事件并发和复杂交互场景中的有效性，展示了其在复杂结构化任务执行方面的优越性能。

Conclusion: 该研究提供了一种有效处理复杂并发事件的LTL指令执行框架，通过结构化任务表示增强了策略的泛化能力。该方法为在复杂交互环境中实现通用强化学习策略开辟了新途径，对机器人任务规划和自主系统控制具有重要应用价值。

📄 Abstract

Linear temporal logic (LTL) is a compelling framework for specifying complex, structured tasks for reinforcement learning (RL) agents. Recent work has shown that interpreting LTL instructions as finite automata, which can be seen as high-level programs monitoring task progress, enables learning a single generalist policy capable of executing arbitrary instructions at test time. However, existing approaches fall short in environments where multiple high-level events (i.e., atomic propositions) can be true at the same time and potentially interact in complicated ways. In this work, we propose a novel approach to learning a multi-task policy for following arbitrary LTL instructions that addresses this shortcoming. Our method conditions the policy on sequences of simple Boolean formulae, which directly align with transitions in the automaton, and are encoded via a graph neural network (GNN) to yield structured task representations. Experiments in a complex chess-based environment demonstrate the advantages of our approach.

[65] Learning What to Attend First: Modality-Importance-Guided Reasoning for Reliable Multimodal Emotion Understanding

Hyeongseop Rha, Jeong Hun Yeo, Junil Won, Se Jin Park, Yong Man Ro

🧩 TL;DR

本文提出了模态重要性引导推理（MIGR）框架，通过识别情感主导模态并重新组织推理序列，显著提升了多模态大语言模型在情感理解任务中的推理可靠性，将情感不一致解释的比例从18.10%降低到7.37%。

📘 Detailed Summary

Motivation: 现有基于推理的多模态情感理解方法存在推理漂移问题：模型逐渐依赖自身生成的文本而非多模态证据，且解释过程过度受视觉主导的推理路径影响，导致情感理解可靠性不足。

Method: 提出模态重要性（MI）机制识别情感主导模态，并构建MIGR框架重新组织推理序列，使解释从对目标情感最关键的模态开始；采用两阶段训练框架，包括模态对齐的监督微调和模态感知的奖励优化，确保生成情感基础扎实、因果相关且保持连贯性的解释。

Result: 在DFEW基准测试中，MIGR显著提升了推理可靠性，将正确预测但伴随情感不一致解释的比例从18.10%降低到7.37%，验证了从情感主导模态开始推理的有效性。

Conclusion: 研究表明，通过识别情感主导模态并重新组织推理序列，能够有效防止早期推理被信息量较少的线索误导，从而提升多模态情感理解的可靠性和解释质量，为多模态推理系统设计提供了新思路。

📄 Abstract

In this paper, we present Modality-Importance-Guided Reasoning (MIGR), a framework designed to improve the reliability of reasoning-based multimodal emotion understanding in multimodal large language models. Although existing methods have advanced emotion understanding, they often suffer from reasoning drift: models gradually rely on their own generated text instead of multimodal evidence, and their explanations are overly shaped by visually initiated reasoning paths. To address these issues, we introduce Modality Importance (MI), a simple yet effective mechanism for identifying the emotion-dominant modality. Using MI, MIGR reorganizes reasoning sequences so that explanations begin from the modality most critical to the target emotion, preventing early reasoning from being misled by less informative cues. Our two-stage framework-comprising modality-aligned supervised fine-tuning and modality-aware reward optimization-encourages models to generate emotionally grounded, causally relevant, and coherence-preserving explanations. Experimental results on the DFEW benchmark show that MIGR substantially improves reasoning reliability, decreasing instances of correct predictions accompanied by emotionally inconsistent explanations from 18.10% to 7.37%. These results confirm the benefit of initiating reasoning from the emotion-dominant modality.

[66] Training Data Attribution for Image Generation using Ontology-Aligned Knowledge Graphs

Theodoros Aivalis, Iraklis A. Klampanos, Antonis Troumpoukis, Joemon M. Jose

🧩 TL;DR

本文提出一个通过构建本体对齐知识图谱来解释生成模型输出的框架，利用多模态大语言模型从图像中提取结构化三元组，通过比较生成图像与训练图像的知识图谱来追踪潜在影响，支持版权分析和可解释AI。

📘 Detailed Summary

Motivation: 随着生成模型能力增强，透明度、问责制和版权侵权问题日益突出，理解特定训练数据如何影响模型输出变得至关重要，而当前从视觉内容中提取结构化且符合本体的表示仍面临挑战，因为图像具有丰富性和多对象特性。

Method: 该方法利用多模态大语言模型从图像中提取结构化三元组，并与领域特定本体对齐，通过自动构建本体对齐知识图谱来比较生成图像与训练图像的图谱，从而追踪潜在影响，框架支持通过去学习和风格特定实验进行验证。

Result: 该方法通过本地训练模型的去学习实验和大规模模型的风格特定实验进行了验证，能够有效追踪生成输出与训练数据之间的潜在关联，支持版权分析、数据集透明度和可解释AI的实现。

Conclusion: 该框架为生成模型提供了可追溯性和解释性，支持版权分析、数据集透明度提升和可解释AI发展，有助于构建促进人类协作、创造力和激发好奇心的AI系统，为生成模型的负责任使用提供了技术基础。

📄 Abstract

As generative models become powerful, concerns around transparency, accountability, and copyright violations have intensified. Understanding how specific training data contributes to a model's output is critical. We introduce a framework for interpreting generative outputs through the automatic construction of ontologyaligned knowledge graphs (KGs). While automatic KG construction from natural text has advanced, extracting structured and ontology-consistent representations from visual content remains challenging -- due to the richness and multi-object nature of images. Leveraging multimodal large language models (LLMs), our method extracts structured triples from images, aligned with a domain-specific ontology. By comparing the KGs of generated and training images, we can trace potential influences, enabling copyright analysis, dataset transparency, and interpretable AI. We validate our method through experiments on locally trained models via unlearning, and on large-scale models through a style-specific experiment. Our framework supports the development of AI systems that foster human collaboration, creativity and stimulate curiosity.

[67] Radiologist Copilot: An Agentic Assistant with Orchestrated Tools for Radiology Reporting with Quality Control

Yongrui Yu, Zhongzhen Huang, Linjie Mu, Shaoting Zhang, Xiaofan Zhang

🧩 TL;DR

本文提出了Radiologist Copilot，一种基于大语言模型的智能体系统，通过编排多种工具实现自动化放射学报告生成与质量控制，显著超越现有方法，为放射科医生提供全面支持。

📘 Detailed Summary

Motivation: 放射学报告撰写是临床检查中耗时且易出错的任务，现有自动化方法主要关注报告生成阶段而忽略了关键的质量控制流程，无法为放射科医生提供全面支持，这限制了它们在临床实践中的应用价值。

Method: 该方法采用大语言模型作为推理核心，构建了一个智能体系统，能够自主选择工具、规划并执行动作，模拟放射科医生在整个报告流程中的行为。编排的工具包括区域定位、基于"think with image"范式的区域分析规划、战略模板选择、质量评估以及反馈驱动的自适应细化等质量控制机制。

Result: 实验结果表明，Radiologist Copilot在放射学报告生成任务中显著超越了其他最先进的方法，能够实现准确、完整且高效的放射学报告生成，有效协助放射科医生并提升临床效率。

Conclusion: 该研究展示了智能体系统在医学影像分析领域的应用潜力，通过整合生成与质量控制流程，为放射科医生提供了全面的自动化支持。该方法不仅提高了报告质量，还通过模拟人类专家的工作流程增强了系统的临床实用性，为未来医疗AI助手的发展提供了重要参考。

📄 Abstract

Radiology reporting is an essential yet time-consuming and error-prone task for radiologists in clinical examinations, especially for volumetric medical images. Rigorous quality control is also critical but tedious, ensuring that the final report meets clinical standards. Existing automated approaches, including radiology report generation methods and medical vision-language models, focus mainly on the report generation phase and neglect the crucial quality control procedure, limiting their capability to provide comprehensive support to radiologists. We propose Radiologist Copilot, an agentic AI assistant equipped with orchestrated tools designed for automated radiology reporting with quality control. Leveraging large language models as the reasoning backbone, the agentic system autonomously selects tools, plans, and executes actions, emulating the behavior of radiologists throughout the holistic radiology reporting process. The orchestrated tools include region localization, think with image paradigm directed region analysis planning, strategic template selection for report generation, quality assessment and feedback-driven adaptive refinement for quality control. Therefore, Radiologist Copilot facilitates accurate, complete, and efficient radiology reporting, assisting radiologists and improving clinical efficiency. Experimental results demonstrate that Radiologist Copilot significantly surpasses other state-of-the-art methods in radiology reporting. The source code will be released upon acceptance.

Table of Contents

cs.CV [Back]

[1] Leveraging AI multimodal geospatial foundation models for improved near-real-time flood mapping at a global scale

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[2] Context-Enriched Contrastive Loss: Enhancing Presentation of Inherent Sample Connections in Contrastive Learning Framework

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[3] FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[4] Towards Unified Video Quality Assessment

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[5] See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[6] Progressive Image Restoration via Text-Conditioned Video Generation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[7] Understanding and Harnessing Sparsity in Unified Multimodal Models

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[8] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[9] Multi-Domain Enhanced Map-Free Trajectory Prediction with Selective Attention

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[10] See, Think, Learn: A Self-Taught Multimodal Reasoner

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[11] From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[12] Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[13] Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[14] WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[15] Generalizing Vision-Language Models with Dedicated Prompt Guidance

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[16] Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[17] nuScenes Revisited: Progress and Challenges in Autonomous Driving

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[18] UCAgents: Unidirectional Convergence for Visual Evidence Anchored Multi-Agent Medical Decision-Making

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[19] Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[20] Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation

🧩 TL;DR