Table of Contents

cs.CV [Back]

[1] Scaling Remote Sensing Foundation Models: Data Domain Tradeoffs at the Peta-Scale

Charith Wickrema, Eliza Mace, Hunter Brown, Heidys Cabrera, Nick Krall, Matthew O'Neill, Shivangi Sarkar, Lowell Weissman, Eric Hughes, Guido Zarrella

🧩 TL;DR

该研究探索了高分辨率光电遥感数据上基础模型的缩放规律,通过训练规模远超现有水平的视觉Transformer骨干网络,揭示了在遥感领域即使达到千万亿像素数据规模,性能仍受数据而非模型参数限制的规律。


📘 Detailed Summary

Motivation: 当前多模态机器学习应用依赖于针对非文本模态的鲁棒、领域专用编码器,在自然图像领域已有成熟的缩放规律指导模型容量、训练计算和数据集规模的联合优化,但在高价值遥感领域这些关系尚未被充分理解,限制了前沿规模遥感基础模型的开发。

Method: 研究使用超过千万亿像素的商业卫星光电数据,在MITRE联邦AI沙箱中训练了渐进增大的视觉Transformer骨干网络,分析了在千万亿规模训练中观察到的成功和失败模式,并研究了跨额外遥感模态的领域差距桥接问题。

Result: 实验发现即使达到千万亿像素数据规模,性能仍与数据受限而非模型参数受限的机制一致,这表明遥感领域的缩放规律与自然图像领域存在显著差异,需要不同的优化策略。

Conclusion: 研究提供了关于数据收集策略、计算预算和优化调度的实用见解,旨在指导前沿规模遥感基础模型的未来发展,强调了遥感领域需要不同于自然图像领域的缩放规律和训练方法。


📄 Abstract

We explore the scaling behaviors of artificial intelligence to establish practical techniques for training foundation models on high-resolution electro-optical (EO) datasets that exceed the current state-of-the-art scale by orders of magnitude. Modern multimodal machine learning (ML) applications, such as generative artificial intelligence (GenAI) systems for image captioning, search, and reasoning, depend on robust, domain-specialized encoders for non-text modalities. In natural-image domains where internet-scale data is plentiful, well-established scaling laws help optimize the joint scaling of model capacity, training compute, and dataset size. Unfortunately, these relationships are much less well-understood in high-value domains like remote sensing (RS). Using over a quadrillion pixels of commercial satellite EO data and the MITRE Federal AI Sandbox, we train progressively larger vision transformer (ViT) backbones, report success and failure modes observed at petascale, and analyze implications for bridging domain gaps across additional RS modalities. We observe that even at this scale, performance is consistent with a data limited regime rather than a model parameter-limited one. These practical insights are intended to inform data-collection strategies, compute budgets, and optimization schedules that advance the future development of frontier-scale RS foundation models.

[2] MGML: A Plug-and-Play Meta-Guided Multi-Modal Learning Framework for Incomplete Multimodal Brain Tumor Segmentation

Yulong Zou, Bo Liu, Cun-Jing Zheng, Yuan-ming Geng, Siyue Li, Qiankun Zuo, Shuihua Wang, Yudong Zhang, Jin Hong

🧩 TL;DR

本文提出了一种元引导的多模态学习框架,通过元参数化自适应模态融合和一致性正则化模块,有效解决了临床实践中多模态MRI数据不完整情况下的脑肿瘤分割问题。


📘 Detailed Summary

Motivation: 在临床实践中,多模态磁共振成像数据常常不完整,这给充分利用可用信息进行病灶分割带来了挑战,特别是在脑肿瘤分割任务中,如何最大化利用不完整的多模态信息成为一个关键的研究问题。

Method: 本文提出了一个新颖的元引导多模态学习框架,包含两个核心组件:元参数化自适应模态融合模块和一致性正则化模块。元参数化自适应模态融合能够根据可用模态生成自适应软标签监督信号,在不同输入条件下有效整合多模态信息;一致性正则化模块则增强分割性能并隐式提升框架的鲁棒性和泛化能力,且该方法不改变原始模型架构,可方便集成到端到端训练流程中。

Result: 在公开数据集BraTS2020和BraTS2023上的广泛实验表明,该方法在十五种缺失模态组合的平均Dice分数上优于多种最先进方法。在BraTS2020上,基于基线模型,该方法在全肿瘤、肿瘤核心和增强肿瘤区域分别获得了87.55、79.36和62.67的Dice分数,展现了优越的性能。

Conclusion: 该研究为解决不完整多模态医学图像分割问题提供了有效的解决方案,通过元引导的融合机制和一致性正则化,不仅提升了分割精度,还增强了模型对缺失模态的鲁棒性,为临床实际应用中的多模态数据处理提供了实用框架,且其模块化设计便于集成到现有训练流程中。


📄 Abstract

Leveraging multimodal information from Magnetic Resonance Imaging (MRI) plays a vital role in lesion segmentation, especially for brain tumors. However, in clinical practice, multimodal MRI data are often incomplete, making it challenging to fully utilize the available information. Therefore, maximizing the utilization of this incomplete multimodal information presents a crucial research challenge. We present a novel meta-guided multi-modal learning (MGML) framework that comprises two components: meta-parameterized adaptive modality fusion and consistency regularization module. The meta-parameterized adaptive modality fusion (Meta-AMF) enables the model to effectively integrate information from multiple modalities under varying input conditions. By generating adaptive soft-label supervision signals based on the available modalities, Meta-AMF explicitly promotes more coherent multimodal fusion. In addition, the consistency regularization module enhances segmentation performance and implicitly reinforces the robustness and generalization of the overall framework. Notably, our approach does not alter the original model architecture and can be conveniently integrated into the training pipeline for end-to-end model optimization. We conducted extensive experiments on the public BraTS2020 and BraTS2023 datasets. Compared to multiple state-of-the-art methods from previous years, our method achieved superior performance. On BraTS2020, for the average Dice scores across fifteen missing modality combinations, building upon the baseline, our method obtained scores of 87.55, 79.36, and 62.67 for the whole tumor (WT), the tumor core (TC), and the enhancing tumor (ET), respectively. We have made our source code publicly available at https://github.com/worldlikerr/MGML.

[3] T2VAttack: Adversarial Attack on Text-to-Video Diffusion Models

Changzhen Li, Yuecong Min, Jie Zhang, Zheng Yuan, Shiguang Shan, Xilin Chen

🧩 TL;DR

本文提出了T2VAttack,这是首个针对文本到视频扩散模型对抗攻击的全面研究,通过语义和时间两个维度评估模型的脆弱性,揭示了即使微小的提示修改也能导致视频生成质量的显著下降。


📘 Detailed Summary

Motivation: 尽管文本到视频扩散模型在生成高质量、时间连贯的视频方面取得了显著进展,但其对抗攻击的脆弱性尚未得到充分探索,视频数据的动态特性要求从语义和时间两个维度评估模型的鲁棒性。

Method: 研究提出了两种攻击目标:语义目标评估视频-文本对齐,时间目标评估时间动态;并开发了两种攻击方法:T2VAttack-S通过贪婪搜索识别关键词并用同义词替换,T2VAttack-I通过迭代插入优化词实现最小提示扰动。

Result: 实验在多个最先进的T2V模型(包括ModelScope、CogVideoX、Open-Sora和HunyuanVideo)上进行,结果表明即使单个词的替换或插入也会导致语义保真度和时间动态的显著退化,揭示了当前模型的严重脆弱性。

Conclusion: 该研究揭示了文本到视频扩散模型在对抗攻击下的关键漏洞,强调了在模型开发中考虑鲁棒性的重要性,为未来更安全的视频生成系统提供了重要的安全评估框架和基准。


📄 Abstract

The rapid evolution of Text-to-Video (T2V) diffusion models has driven remarkable advancements in generating high-quality, temporally coherent videos from natural language descriptions. Despite these achievements, their vulnerability to adversarial attacks remains largely unexplored. In this paper, we introduce T2VAttack, a comprehensive study of adversarial attacks on T2V diffusion models from both semantic and temporal perspectives. Considering the inherently dynamic nature of video data, we propose two distinct attack objectives: a semantic objective to evaluate video-text alignment and a temporal objective to assess the temporal dynamics. To achieve an effective and efficient attack process, we propose two adversarial attack methods: (i) T2VAttack-S, which identifies semantically or temporally critical words in prompts and replaces them with synonyms via greedy search, and (ii) T2VAttack-I, which iteratively inserts optimized words with minimal perturbation to the prompt. By combining these objectives and strategies, we conduct a comprehensive evaluation on the adversarial robustness of several state-of-the-art T2V models, including ModelScope, CogVideoX, Open-Sora, and HunyuanVideo. Our experiments reveal that even minor prompt modifications, such as the substitution or insertion of a single word, can cause substantial degradation in semantic fidelity and temporal dynamics, highlighting critical vulnerabilities in current T2V diffusion models.

[4] Bridging Structure and Appearance: Topological Features for Robust Self-Supervised Segmentation

Haotang Li, Zhenyu Qi, Hao Qin, Huanrui Yang, Sen He, Kebin Peng

🧩 TL;DR

本文提出GASeg框架,通过引入可微盒计数模块和拓扑增强策略,将几何拓扑信息与外观特征相结合,以解决自监督语义分割中因外观模糊性导致的性能下降问题。


📘 Detailed Summary

Motivation: 自监督语义分割方法在面对外观模糊性时经常失败,这主要是由于过度依赖不稳定的外观特征,如阴影、眩光和局部纹理,而忽略了更稳定的几何拓扑信息。

Method: 该方法的核心是可微盒计数模块,用于从几何特征流和外观特征流中量化多尺度拓扑统计信息;同时引入拓扑增强策略,通过形态学操作模拟真实世界的模糊性;最后使用GALoss多目标损失函数显式地强制几何特征与外观特征之间的跨模态对齐。

Result: 在COCO-Stuff、Cityscapes和PASCAL等四个基准测试上的广泛实验表明,GASeg实现了最先进的性能,验证了通过拓扑信息桥接几何与外观方法的有效性。

Conclusion: 该研究表明,通过整合稳定的几何拓扑信息可以有效缓解自监督语义分割中的外观模糊性问题,为跨模态特征对齐提供了新的解决方案,并展示了拓扑增强在模拟真实世界模糊性方面的潜力。


📄 Abstract

Self-supervised semantic segmentation methods often fail when faced with appearance ambiguities. We argue that this is due to an over-reliance on unstable, appearance-based features such as shadows, glare, and local textures. We propose \textbf{GASeg}, a novel framework that bridges appearance and geometry by leveraging stable topological information. The core of our method is Differentiable Box-Counting (\textbf{DBC}) module, which quantifies multi-scale topological statistics from two parallel streams: geometric-based features and appearance-based features. To force the model to learn these stable structural representations, we introduce Topological Augmentation (\textbf{TopoAug}), an adversarial strategy that simulates real-world ambiguities by applying morphological operators to the input images. A multi-objective loss, \textbf{GALoss}, then explicitly enforces cross-modal alignment between geometric-based and appearance-based features. Extensive experiments demonstrate that GASeg achieves state-of-the-art performance on four benchmarks, including COCO-Stuff, Cityscapes, and PASCAL, validating our approach of bridging geometry and appearance via topological information.

[5] Bridging the Perception-Cognition Gap:Re-engineering SAM2 with Hilbert-Mamba for Robust VLM-based Medical Diagnosis

Hao Wu, Hui Li, Yiyun Su

🧩 TL;DR

本文提出Hilbert-VLM,一种用于三维多模态医学图像分析的两阶段融合框架,通过将希尔伯特空间填充曲线集成到Mamba状态空间模型中,有效保留空间局部性,并结合增强提示引导视觉语言模型进行准确疾病分类。


📘 Detailed Summary

Motivation: 当前视觉语言模型在处理复杂三维多模态医学图像时面临两大挑战:互补信息的有效整合以及细微但关键病理特征的偶尔遗漏,这限制了其在自动化医疗诊断任务中的准确性和可靠性。

Method: 提出Hilbert-VLM两阶段融合框架,包含HilbertMed-SAM模块用于精确病灶分割和提示增强模块。核心创新在于系统性地重新设计SAM2架构:将希尔伯特空间填充曲线集成到Mamba状态空间模型的扫描机制中以最大化保留三维数据的空间局部性,同时引入希尔伯特-Mamba交叉注意力机制和尺度感知解码器以捕获细粒度细节,分割掩码与对应文本属性被统一为信息密集提示以支持视觉语言模型推理。

Result: 在BraTS2021分割基准测试中,模型达到82.35%的Dice分数,疾病分类准确率达到78.85%,实验结果表明所提模型在医学视觉语言模型分析方面具有显著改进准确性和可靠性的潜力。

Conclusion: 该研究证明了通过系统性地重新设计架构以保留三维医学图像空间局部性的重要性,Hilbert-VLM框架为处理复杂多模态医学图像提供了有效解决方案,展示了在自动化医疗诊断中提高视觉语言模型准确性和可靠性的实质性潜力。


📄 Abstract

Recent studies suggest that Visual Language Models (VLMs) hold great potential for tasks such as automated medical diagnosis. However, processing complex three-dimensional (3D) multimodal medical images poses significant challenges - specifically, the effective integration of complementary information and the occasional oversight of subtle yet critical pathological features. To address these issues, we present a novel two-stage fusion framework termed Hilbert-VLM. This framework leverages the HilbertMed-SAM module for precise lesion segmentation, with the generated multimodal enhanced prompts then guiding the VLM toward accurate disease classification. Our key innovation lies in the systematic redesign of the Segment Anything Model 2 (SAM2) architecture: we incorporate Hilbert space-filling curves into the scanning mechanism of the Mamba State Space Model (SSM) to maximize the preservation of spatial locality in 3D data, a property critical for medical image analysis. We also introduce a novel Hilbert-Mamba Cross-Attention (HMCA) mechanism and a scale-aware decoder to capture fine-grained details. Meanwhile, the prompt enhancement module unifies segmentation masks and their corresponding textual attributes into an information-dense prompt to support VLM inference. Extensive experiments were conducted to validate the effectiveness of the Hilbert-VLM model. On the BraTS2021 segmentation benchmark, it achieves a Dice score of 82.35 percent, with a diagnostic classification accuracy (ACC) of 78.85 percent. These results demonstrate that the proposed model offers substantial potential to improve the accuracy and reliability of medical VLM-based analysis.

[6] FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing

Yunkai Dang, Donghao Wang, Jiacheng Yang, Yifan Jiang, Meiyi Zhu, Yuekun Yang, Cong Wang, Qi Fan, Wenbin Li, Yang Gao

🧩 TL;DR

本文提出MF-RSVLM,一种多特征融合的遥感视觉语言模型,通过多尺度视觉表征学习和循环视觉特征注入机制,有效解决了遥感图像理解中的细粒度特征提取和视觉遗忘问题。


📘 Detailed Summary

Motivation: 大型视觉语言模型在通用领域表现优异,但在遥感领域面临显著挑战,主要源于遥感图像与自然图像的固有差异。现有遥感VLM难以提取细粒度视觉特征,并在深度语言处理过程中出现视觉遗忘现象,限制了其在遥感场景理解中的应用效果。

Method: MF-RSVLM采用多特征融合架构,学习多尺度视觉表征,将全局上下文与局部细节相结合,以更好地捕捉遥感场景中的小型复杂结构。模型引入循环视觉特征注入方案,确保语言模型在生成过程中持续基于视觉证据,有效减少视觉遗忘问题。

Result: 在多个遥感基准测试上的广泛实验表明,MF-RSVLM在遥感分类、图像描述和视觉问答任务中均取得了最先进或极具竞争力的性能。该模型在细粒度特征提取和视觉信息保持方面表现出显著优势,验证了所提方法的有效性。

Conclusion: 该研究为遥感领域的视觉语言建模提供了有效解决方案,通过多尺度特征融合和循环视觉注入机制,显著提升了模型对遥感图像的理解能力。该方法为解决遥感图像与自然图像差异带来的挑战提供了新思路,并为遥感多模态任务的发展奠定了基础。


📄 Abstract

Large vision-language models (VLMs) exhibit strong performance across various tasks. However, these VLMs encounter significant challenges when applied to the remote sensing domain due to the inherent differences between remote sensing images and natural images. Existing remote sensing VLMs often fail to extract fine-grained visual features and suffer from visual forgetting during deep language processing. To address this, we introduce MF-RSVLM, a Multi-Feature Fusion Remote Sensing Vision--Language Model that effectively extracts and fuses visual features for RS understanding. MF-RSVLM learns multi-scale visual representations and combines global context with local details, improving the capture of small and complex structures in RS scenes. A recurrent visual feature injection scheme ensures the language model remains grounded in visual evidence and reduces visual forgetting during generation. Extensive experiments on diverse RS benchmarks show that MF-RSVLM achieves state-of-the-art or highly competitive performance across remote sensing classification, image captioning, and VQA tasks. Our code is publicly available at https://github.com/Yunkaidang/RSVLM.

[7] RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations

Xingqi He, Yujie Zhang, Shuyong Gao, Wenjie Li, Lingyi Hong, Mingxi Chen, Kaixun Jiang, Jiyuan Fu, Wenqiang Zhang

🧩 TL;DR

本文提出RSAgent,一种基于多模态大语言模型的智能体系统,通过多轮工具调用实现文本引导的对象分割,将推理与行动交织以解决传统单次定位方法的局限性。


📘 Detailed Summary

Motivation: 当前文本引导分割方法通常将任务视为单次定位问题,模型通过单次前向传播预测像素提示来驱动外部分割器,这种方法在初始定位错误时缺乏验证、重新聚焦和细化的能力,限制了分割性能的进一步提升。

Method: RSAgent采用多模态大语言模型作为智能体框架,通过多轮工具调用实现推理与行动的交织,具体包括查询分割工具箱、观察视觉反馈、利用历史观察修正空间假设以重新定位目标并迭代优化掩码。研究还构建了合成多轮推理分割轨迹的数据管道,并采用两阶段训练框架:冷启动监督微调后接基于细粒度任务特定奖励的智能体强化学习。

Result: 实验结果表明,RSAgent在ReasonSeg测试集上达到66.5%的零样本广义交并比,相比Seg-Zero-7B提升9%,在RefCOCOg数据集上达到81.5%的类别交并比,在领域内和领域外基准测试中均表现出最先进的性能。

Conclusion: 该研究展示了将智能体范式引入文本引导分割任务的有效性,通过多轮交互式推理显著提升了分割精度和鲁棒性,为复杂视觉语言任务提供了新的解决方案框架,并验证了强化学习在细粒度视觉任务中的适用性。


📄 Abstract

Text-guided object segmentation requires both cross-modal reasoning and pixel grounding abilities. Most recent methods treat text-guided segmentation as one-shot grounding, where the model predicts pixel prompts in a single forward pass to drive an external segmentor, which limits verification, refocusing and refinement when initial localization is wrong. To address this limitation, we propose RSAgent, an agentic Multimodal Large Language Model (MLLM) which interleaves reasoning and action for segmentation via multi-turn tool invocations. RSAgent queries a segmentation toolbox, observes visual feedback, and revises its spatial hypothesis using historical observations to re-localize targets and iteratively refine masks. We further build a data pipeline to synthesize multi-turn reasoning segmentation trajectories, and train RSAgent with a two-stage framework: cold-start supervised fine-tuning followed by agentic reinforcement learning with fine-grained, task-specific rewards. Extensive experiments show that RSAgent achieves a zero-shot performance of 66.5% gIoU on ReasonSeg test, improving over Seg-Zero-7B by 9%, and reaches 81.5% cIoU on RefCOCOg, demonstrating state-of-the-art performance on both in-domain and out-of-domain benchmarks.

[8] Neighbor-aware Instance Refining with Noisy Labels for Cross-Modal Retrieval

Yizhi Liu, Ruitao Pu, Shilin Xu, Yingke Chen, Quan-Hui Liu, Yuan Sun

🧩 TL;DR

本文提出了一种名为NIRNL的鲁棒跨模态学习框架,通过跨模态边界保持和邻居感知实例精炼技术,有效处理多模态数据中的噪声标签问题,在多个基准数据集上实现了最先进的检索性能。


📘 Detailed Summary

Motivation: 跨模态检索领域面临大规模高质量标注数据收集困难的问题,多模态数据标注中不可避免地存在噪声标签,这会显著降低模型的检索性能。现有鲁棒学习方法在模型性能上限、校准可靠性和数据利用率方面难以同时满足要求,需要一种更有效的噪声标签处理框架。

Method: 提出的NIRNL框架包含两个核心技术:跨模态边界保持用于调整正负样本对的相对距离以增强样本对区分度;邻居感知实例精炼通过跨模态邻居共识识别纯子集、困难子集和噪声子集,并为这种细粒度划分构建定制化优化策略,最大化数据利用率同时减少误差传播。

Result: 在三个基准数据集上的大量实验表明,NIRNL实现了最先进的性能,表现出显著的鲁棒性,特别是在高噪声率条件下。该方法在模型性能、校准可靠性和数据利用率方面均优于现有方法。

Conclusion: 该研究通过细粒度的实例划分和定制化优化策略,有效解决了跨模态检索中的噪声标签问题,为多模态学习中的噪声鲁棒性提供了新思路。框架的设计平衡了数据利用效率和误差控制,具有实际应用价值。


📄 Abstract

In recent years, Cross-Modal Retrieval (CMR) has made significant progress in the field of multi-modal analysis. However, since it is time-consuming and labor-intensive to collect large-scale and well-annotated data, the annotation of multi-modal data inevitably contains some noise. This will degrade the retrieval performance of the model. To tackle the problem, numerous robust CMR methods have been developed, including robust learning paradigms, label calibration strategies, and instance selection mechanisms. Unfortunately, they often fail to simultaneously satisfy model performance ceilings, calibration reliability, and data utilization rate. To overcome the limitations, we propose a novel robust cross-modal learning framework, namely Neighbor-aware Instance Refining with Noisy Labels (NIRNL). Specifically, we first propose Cross-modal Margin Preserving (CMP) to adjust the relative distance between positive and negative pairs, thereby enhancing the discrimination between sample pairs. Then, we propose Neighbor-aware Instance Refining (NIR) to identify pure subset, hard subset, and noisy subset through cross-modal neighborhood consensus. Afterward, we construct different tailored optimization strategies for this fine-grained partitioning, thereby maximizing the utilization of all available data while mitigating error propagation. Extensive experiments on three benchmark datasets demonstrate that NIRNL achieves state-of-the-art performance, exhibiting remarkable robustness, especially under high noise rates.

[9] Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images

Jingzhou Chen, Dexin Chen, Fengchao Xiong, Yuntao Qian, Liang Xiao

🧩 TL;DR

本文提出了一种平衡层次对比损失与解耦学习策略,以解决细粒度遥感目标检测中层次标签结构嵌入的挑战。该方法在DETR框架内通过可学习的类别原型和梯度均衡机制,有效缓解了数据分布不平衡对层次语义学习的影响。


📘 Detailed Summary

Motivation: 细粒度遥感数据集通常采用层次标签结构来区分对象,但将这种语义层次嵌入表示学习空间以提升检测性能仍具挑战性。先前研究在应用监督对比学习时忽视了两个关键问题:标签层次中数据分布不平衡导致高频类别主导学习过程,以及类别间语义关系学习干扰了类别无关的定位任务。

Method: 本文提出了一种平衡层次对比损失与解耦学习策略,结合到检测变换器(DETR)框架中。该损失引入了可学习的类别原型,并在每个层次级别均衡不同类别贡献的梯度,确保每个层次类别在每个小批量中对损失计算贡献相等。解耦策略将DETR的对象查询分为分类和定位两个集合,实现任务特定的特征提取和优化。

Result: 在三个具有层次标注的细粒度数据集上的实验表明,该方法在性能上超越了现有最先进方法。具体而言,平衡层次对比损失有效缓解了数据不平衡问题,而解耦学习策略显著提升了分类和定位任务的性能表现。

Conclusion: 本研究证明了在细粒度目标检测中有效利用层次语义信息的重要性,提出的平衡层次对比损失和解耦学习策略为解决数据不平衡和任务干扰问题提供了有效方案。该方法为层次标注数据的表示学习开辟了新方向,具有扩展到其他细粒度视觉任务的潜力。


📄 Abstract

Fine-grained remote sensing datasets often use hierarchical label structures to differentiate objects in a coarse-to-fine manner, with each object annotated across multiple levels. However, embedding this semantic hierarchy into the representation learning space to improve fine-grained detection performance remains challenging. Previous studies have applied supervised contrastive learning at different hierarchical levels to group objects under the same parent class while distinguishing sibling subcategories. Nevertheless, they overlook two critical issues: (1) imbalanced data distribution across the label hierarchy causes high-frequency classes to dominate the learning process, and (2) learning semantic relationships among categories interferes with class-agnostic localization. To address these issues, we propose a balanced hierarchical contrastive loss combined with a decoupled learning strategy within the detection transformer (DETR) framework. The proposed loss introduces learnable class prototypes and equilibrates gradients contributed by different classes at each hierarchical level, ensuring that each hierarchical class contributes equally to the loss computation in every mini-batch. The decoupled strategy separates DETR's object queries into classification and localization sets, enabling task-specific feature extraction and optimization. Experiments on three fine-grained datasets with hierarchical annotations demonstrate that our method outperforms state-of-the-art approaches.

[10] Factorized Learning for Temporally Grounded Video-Language Models

Wenzheng Zeng, Difei Gao, Mike Zheng Shou, Hwee Tou Ng

🧩 TL;DR

本文提出D²VLM框架和因子化偏好优化算法,通过解耦时间定位与文本响应的学习来解决视频语言模型中事件级感知的准确性问题,强调两者之间的逻辑层次依赖关系。


📘 Detailed Summary

Motivation: 现有视频语言模型在事件级感知方面存在时间定位不准确的问题,而时间定位与文本响应这两个核心任务通常以耦合方式处理,缺乏清晰的逻辑层次结构,导致优化目标次优。研究观察到准确的时间证据定位是可靠文本响应的基础,需要从因子化学习的角度解决这一问题。

Method: 提出D²VLM框架,采用"先定位后回答并引用证据"的范式,解耦时间定位与文本响应的学习同时强调其内在依赖关系。引入证据令牌进行证据定位,专注于事件级视觉语义捕获而非单纯的时间戳表示。进一步提出因子化偏好优化算法,将概率时间定位建模显式纳入优化目标,实现时间定位和文本响应的偏好学习。为支持因子化偏好学习,构建了包含显式时间定位标注的合成数据集。

Result: 在多种任务上的实验表明,该方法具有明显优势。D²VLM框架结合FPO算法在时间定位准确性和文本响应质量方面均取得显著提升,验证了因子化学习方法的有效性。证据令牌机制成功捕获了超越时间戳表示的事件级视觉语义信息,提高了事件感知的准确性。

Conclusion: 研究证实了时间定位与文本响应之间的逻辑层次关系,因子化学习方法能够有效提升视频语言模型的性能。证据令牌和FPO算法为视频理解任务提供了新的技术路径,强调事件级语义捕获的重要性。该方法为视频语言模型的设计提供了新的范式,未来可扩展到更复杂的多模态推理任务中。


📄 Abstract

Recent video-language models have shown great potential for video understanding, but still struggle with accurate temporal grounding for event-level perception. We observe that two main factors in video understanding (i.e., temporal grounding and textual response) form a logical hierarchy: accurate temporal evidence grounding lays the foundation for reliable textual response. However, existing works typically handle these two tasks in a coupled manner without a clear logical structure, leading to sub-optimal objectives. We address this from a factorized learning perspective. We first propose D$^2$VLM, a framework that decouples the learning of these two tasks while also emphasizing their inherent dependency. We adopt a "grounding then answering with evidence referencing" paradigm and introduce evidence tokens for evidence grounding, which emphasize event-level visual semantic capture beyond the focus on timestamp representation in existing works. To further facilitate the learning of these two tasks, we introduce a novel factorized preference optimization (FPO) algorithm. Unlike standard preference optimization, FPO explicitly incorporates probabilistic temporal grounding modeling into the optimization objective, enabling preference learning for both temporal grounding and textual response. We also construct a synthetic dataset to address the lack of suitable datasets for factorized preference learning with explicit temporal grounding. Experiments on various tasks demonstrate the clear advantage of our approach. Our source code is available at https://github.com/nusnlp/d2vlm.

[11] GeoBench: Rethinking Multimodal Geometric Problem-Solving via Hierarchical Evaluation

Yuan Feng, Yue Yang, Xiaohan He, Jiatong Zhao, Jianlong Chen, Zijun Chen, Daocheng Fu, Qi Liu, Renqiu Xia, Bo Zhang, Junchi Yan

🧩 TL;DR

本文提出了GeoBench,一个用于评估视觉语言模型几何推理能力的层次化基准测试,通过四个推理级别和六个正式验证的任务来系统分析模型在几何问题解决中的能力表现。


📘 Detailed Summary

Motivation: 当前视觉语言模型在几何推理评估方面存在三个主要局限:教科书基准可能导致测试数据污染、过度强调最终答案而忽视推理过程、以及诊断粒度不足,这些问题阻碍了对模型几何推理能力的准确评估。

Method: 研究提出了GeoBench层次化基准,包含视觉感知、目标导向规划、严格定理应用和自我反思回溯四个推理级别,通过TrustGeoGen生成的六个正式验证任务来系统评估从属性提取到逻辑错误校正的能力。

Result: 实验结果显示,虽然推理模型如OpenAI-o3优于通用多模态大语言模型,但随着任务复杂度增加,性能显著下降;子目标分解和无关前提过滤对最终问题解决准确性有重要影响,而思维链提示在某些任务中意外地降低了性能。

Conclusion: GeoBench为几何问题解决提供了全面的评估基准,同时揭示了子目标分解和前提过滤等关键因素对模型性能的影响,为开发几何问题解决系统提供了可操作的指导原则。


📄 Abstract

Geometric problem solving constitutes a critical branch of mathematical reasoning, requiring precise analysis of shapes and spatial relationships. Current evaluations of geometric reasoning in vision-language models (VLMs) face limitations, including the risk of test data contamination from textbook-based benchmarks, overemphasis on final answers over reasoning processes, and insufficient diagnostic granularity. To address these issues, we present GeoBench, a hierarchical benchmark featuring four reasoning levels in geometric problem-solving: Visual Perception, Goal-Oriented Planning, Rigorous Theorem Application, and Self-Reflective Backtracking. Through six formally verified tasks generated via TrustGeoGen, we systematically assess capabilities ranging from attribute extraction to logical error correction. Experiments reveal that while reasoning models like OpenAI-o3 outperform general MLLMs, performance declines significantly with increasing task complexity. Key findings demonstrate that sub-goal decomposition and irrelevant premise filtering critically influence final problem-solving accuracy, whereas Chain-of-Thought prompting unexpectedly degrades performance in some tasks. These findings establish GeoBench as a comprehensive benchmark while offering actionable guidelines for developing geometric problem-solving systems.

[12] DermaVQA-DAS: Dermatology Assessment Schema (DAS) & Datasets for Closed-Ended Question Answering & Segmentation in Patient-Generated Dermatology Images

Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Meliha Yetisgen, Noel Codella, Roberto Andres Novoa, Josep Malvehy

🧩 TL;DR

本研究提出了DermaVQA-DAS数据集和皮肤病评估模式(DAS),扩展了现有皮肤病图像分析基准,通过支持封闭式问答和皮肤病变分割任务,为以患者为中心的皮肤病视觉语言建模研究提供了标准化评估框架。


📘 Detailed Summary

Motivation: 现有皮肤病图像分析基准主要关注皮肤镜图像,缺乏患者自述的临床查询和临床上下文,限制了其在以患者为中心的医疗护理中的应用。本研究旨在填补这一空白,提供支持患者中心化护理的标准化评估框架。

Method: 本研究引入了皮肤病评估模式(DAS),这是一个由专家开发的系统性框架,包含36个高级和27个细粒度评估问题,提供中英文多选项。基于DAS构建了DermaVQA-DAS数据集,支持封闭式问答和皮肤病变分割两个互补任务,并评估了多种最先进的多模态模型和提示策略。

Result: 在分割任务中,提示设计显著影响性能:默认提示在Mean-of-Max和Mean-of-Mean评估方案下表现最佳,而结合患者查询标题和内容的增强提示在多数投票微评分评估下达到最高性能,使用BiomedParse获得Jaccard指数0.395和Dice分数0.566。在封闭式问答中,o3模型获得最佳总体准确率0.798,GPT-4.1以0.796紧随其后,Gemini-1.5-Pro在Gemini系列中表现竞争性(0.783)。

Conclusion: DermaVQA-DAS和DAS框架为以患者为中心的皮肤病视觉语言建模研究提供了标准化评估基础,揭示了提示设计对分割性能的重要影响,并展示了当前多模态模型在皮肤病问答任务中的强健表现。公开的数据集和评估协议将加速该领域未来研究的发展。


📄 Abstract

Recent advances in dermatological image analysis have been driven by large-scale annotated datasets; however, most existing benchmarks focus on dermatoscopic images and lack patient-authored queries and clinical context, limiting their applicability to patient-centered care. To address this gap, we introduce DermaVQA-DAS, an extension of the DermaVQA dataset that supports two complementary tasks: closed-ended question answering (QA) and dermatological lesion segmentation. Central to this work is the Dermatology Assessment Schema (DAS), a novel expert-developed framework that systematically captures clinically meaningful dermatological features in a structured and standardized form. DAS comprises 36 high-level and 27 fine-grained assessment questions, with multiple-choice options in English and Chinese. Leveraging DAS, we provide expert-annotated datasets for both closed QA and segmentation and benchmark state-of-the-art multimodal models. For segmentation, we evaluate multiple prompting strategies and show that prompt design impacts performance: the default prompt achieves the best results under Mean-of-Max and Mean-of-Mean evaluation aggregation schemes, while an augmented prompt incorporating both patient query title and content yields the highest performance under majority-vote-based microscore evaluation, achieving a Jaccard index of 0.395 and a Dice score of 0.566 with BiomedParse. For closed-ended QA, overall performance is strong across models, with average accuracies ranging from 0.729 to 0.798; o3 achieves the best overall accuracy (0.798), closely followed by GPT-4.1 (0.796), while Gemini-1.5-Pro shows competitive performance within the Gemini family (0.783). We publicly release DermaVQA-DAS, the DAS schema, and evaluation protocols to support and accelerate future research in patient-centered dermatological vision-language modeling (https://osf.io/72rp3).

[13] Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning

Chubin Chen, Sujie Hu, Jiashu Zhu, Meiqi Wu, Jintao Chen, Yanxun Li, Nisha Huang, Chengyu Fang, Jiahong Wu, Xiangxiang Chu, Xiu Li

🧩 TL;DR

本文提出了一种名为方向解耦对齐(D²-Align)的新框架,用于缓解文本到图像扩散模型在人类偏好对齐中出现的偏好模式崩溃问题,该方法通过方向性修正奖励信号来维持生成多样性。


📘 Detailed Summary

Motivation: 现有基于人类反馈强化学习的文本到图像扩散模型对齐方法虽然能在自动化奖励指标上获得高分,但经常导致偏好模式崩溃——一种特定的奖励攻击形式,模型收敛于狭窄的高分输出模式,严重损害了生成多样性,本文旨在解决这一关键问题。

Method: 本文首先提出并量化了偏好模式崩溃现象,并设计了DivGenBench基准来衡量其程度;基于分析提出了方向解耦对齐框架,该方法在冻结模型的情况下学习奖励模型嵌入空间中的方向性修正,然后在优化过程中将此修正应用于奖励信号,防止模型塌缩到特定模式。

Result: 综合评估结合了质量与多样性的定性和定量指标,结果显示D²-Align在维持生成多样性的同时实现了与人类偏好的更好对齐,有效缓解了偏好模式崩溃问题,在DivGenBench基准上表现出优越性能。

Conclusion: 研究表明偏好模式崩溃是由奖励模型固有偏见的过度优化驱动的,方向解耦对齐通过方向性修正奖励信号有效解决了这一问题,为文本到图像扩散模型的对齐提供了保持多样性的新方法,对生成模型的偏好对齐研究具有重要启示。


📄 Abstract

Recent studies have demonstrated significant progress in aligning text-to-image diffusion models with human preference via Reinforcement Learning from Human Feedback. However, while existing methods achieve high scores on automated reward metrics, they often lead to Preference Mode Collapse (PMC)-a specific form of reward hacking where models converge on narrow, high-scoring outputs (e.g., images with monolithic styles or pervasive overexposure), severely degrading generative diversity. In this work, we introduce and quantify this phenomenon, proposing DivGenBench, a novel benchmark designed to measure the extent of PMC. We posit that this collapse is driven by over-optimization along the reward model's inherent biases. Building on this analysis, we propose Directional Decoupling Alignment (D$^2$-Align), a novel framework that mitigates PMC by directionally correcting the reward signal. Specifically, our method first learns a directional correction within the reward model's embedding space while keeping the model frozen. This correction is then applied to the reward signal during the optimization process, preventing the model from collapsing into specific modes and thereby maintaining diversity. Our comprehensive evaluation, combining qualitative analysis with quantitative metrics for both quality and diversity, reveals that D$^2$-Align achieves superior alignment with human preference.

[14] CPJ: Explainable Agricultural Pest Diagnosis via Caption-Prompt-Judge with LLM-Judged Refinement

Wentao Zhang, Tao Fang, Lina Lu, Lifei Wang, Weihe Zhong

🧩 TL;DR

本文提出Caption-Prompt-Judge (CPJ)框架,一种无需训练的小样本方法,通过结构化可解释的图像描述增强农业病虫害视觉问答性能,显著提升跨域诊断的准确性和可解释性。


📘 Detailed Summary

Motivation: 现有农业病虫害诊断方法通常依赖昂贵的监督微调,且在领域偏移下表现不佳,同时缺乏可解释性,这限制了实际农业决策中的应用。

Method: CPJ框架采用无需训练的小样本方法,首先利用大型视觉语言模型生成多角度图像描述,然后通过LLM-as-Judge模块迭代优化描述,最后基于优化后的描述进行双答案视觉问答,同时输出识别和管理建议。

Result: 在CDDMBench基准测试中,CPJ框架显著提升性能:使用GPT-5-mini生成的描述,GPT-5-Nano在病害分类任务上相对无描述基线提升22.7个百分点,在问答得分上提升19.5分,同时提供透明、基于证据的推理过程。

Conclusion: CPJ框架展示了无需微调即可实现鲁棒、可解释农业诊断的可行性,通过结构化描述和迭代优化机制提高了跨域适应性,为农业决策支持系统提供了透明、证据驱动的解决方案。


📄 Abstract

Accurate and interpretable crop disease diagnosis is essential for agricultural decision-making, yet existing methods often rely on costly supervised fine-tuning and perform poorly under domain shifts. We propose Caption--Prompt--Judge (CPJ), a training-free few-shot framework that enhances Agri-Pest VQA through structured, interpretable image captions. CPJ employs large vision-language models to generate multi-angle captions, refined iteratively via an LLM-as-Judge module, which then inform a dual-answer VQA process for both recognition and management responses. Evaluated on CDDMBench, CPJ significantly improves performance: using GPT-5-mini captions, GPT-5-Nano achieves \textbf{+22.7} pp in disease classification and \textbf{+19.5} points in QA score over no-caption baselines. The framework provides transparent, evidence-based reasoning, advancing robust and explainable agricultural diagnosis without fine-tuning. Our code and data are publicly available at: https://github.com/CPJ-Agricultural/CPJ-Agricultural-Diagnosis.

[15] Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

Zhe Huang, Hao Wen, Aiming Hao, Bingze Song, Meiqi Wu, Jiahong Wu, Xiangxiang Chu, Sheng Lu, Haoqian Wang

🧩 TL;DR

本文提出了DualityForge框架,通过可控扩散视频编辑合成反事实视频数据,并开发DNA-Train训练方法,显著减少了多模态大语言模型在视频理解中的幻觉问题。


📘 Detailed Summary

Motivation: 多模态大语言模型在视频理解中存在严重依赖语言先验的问题,导致在处理违背常识的反事实视频时产生视觉未接地幻觉,而收集和标注反事实数据成本高昂,现有方法难以有效解决这一数据不平衡问题。

Method: 提出了DualityForge反事实数据合成框架,利用可控扩散视频编辑将真实视频转换为反事实场景,并嵌入结构化上下文信息自动生成高质量QA对;开发了DualityVidQA大规模视频数据集,并提出Duality-Normalized Advantage Training两阶段训练机制,在强化学习阶段应用成对ℓ1优势归一化以实现更稳定的策略优化。

Result: 在DualityVidQA-Test测试集上,该方法将模型在反事实视频上的幻觉相对减少了24.0%,显著优于Qwen2.5-VL-7B基线;在幻觉基准和通用基准上均取得显著提升,展现出强大的泛化能力。

Conclusion: 该研究通过合成反事实数据和对比训练机制有效缓解了MLLM的幻觉问题,为视频理解中的视觉接地性提供了系统解决方案,其数据合成框架和训练方法具有广泛的应用潜力。


📄 Abstract

Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. In addition, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase applies pair-wise $\ell_1$ advantage normalization, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.

[16] Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset

TsaiChing Ni, ZhenQi Chen, YuanFu Yang

🧩 TL;DR

本文提出了IMDD-1M,这是首个包含100万对齐图像-文本对的大规模工业多模态缺陷数据集,并基于该数据集从头训练了一个专门针对工业场景的扩散式视觉-语言基础模型,该模型通过轻量级微调即可高效适应专业领域。


📘 Detailed Summary

Motivation: 当前工业缺陷检测领域缺乏大规模、高质量的多模态数据集,这限制了多模态学习在制造业和质量检测中的应用,特别是对于需要细粒度理解和生成能力的工业场景。

Method: 研究首先构建了IMDD-1M数据集,包含100万对齐的图像-文本对,涵盖60多种材料类别和400多种缺陷类型,每个样本都配有专家验证的注释和细粒度文本描述;基于该数据集,从头训练了一个扩散式视觉-语言基础模型,专门针对工业场景进行优化。

Result: 该基础模型在仅需不到5%任务特定数据的情况下,通过轻量级微调即可达到与专用专家模型相当的性能,支持分类、分割、检索、描述和生成等多种应用,展示了数据高效的基础模型适应能力。

Conclusion: 这项研究为工业检测和生成任务提供了可扩展、领域自适应和知识驱动的解决方案,展示了基础模型在工业场景中的巨大潜力,为制造业智能化开辟了新途径,特别是通过数据高效适应策略降低了专业领域应用的门槛。


📄 Abstract

We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence.

[17] DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

Zefeng He, Xiaoye Qu, Yafu Li, Tong Zhu, Siyuan Huang, Yu Cheng

🧩 TL;DR

本文提出DiffThinker,一种基于扩散的生成式多模态推理框架,将多模态推理重新定义为原生图像到图像的生成任务,在视觉中心任务中实现了卓越的逻辑一致性和空间精度。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在推理过程中仍以文本为中心,导致在复杂长视野的视觉中心任务中表现欠佳,需要一种更有效的多模态推理范式来解决这一局限性。

Method: 本文建立了生成式多模态推理新范式,引入DiffThinker这一基于扩散的推理框架,将多模态推理重新定义为原生图像到图像的生成任务,并系统比较了该范式与MLLMs的内在特性。

Result: 在四个领域(顺序规划、组合优化、约束满足和空间配置)的广泛实验表明,DiffThinker显著优于领先的闭源模型,包括GPT-5(+314.2%)、Gemini-3-Flash(+111.6%)以及微调的Qwen3-VL-32B基线(+39.0%)。

Conclusion: 该研究揭示了生成式多模态推理范式的四个核心特性:效率、可控性、原生并行性和协作性,表明生成式多模态推理是视觉中心推理的有前景方法,为多模态推理研究开辟了新方向。


📄 Abstract

While recent Multimodal Large Language Models (MLLMs) have attained significant strides in multimodal reasoning, their reasoning processes remain predominantly text-centric, leading to suboptimal performance in complex long-horizon, vision-centric tasks. In this paper, we establish a novel Generative Multimodal Reasoning paradigm and introduce DiffThinker, a diffusion-based reasoning framework. Conceptually, DiffThinker reformulates multimodal reasoning as a native generative image-to-image task, achieving superior logical consistency and spatial precision in vision-centric tasks. We perform a systematic comparison between DiffThinker and MLLMs, providing the first in-depth investigation into the intrinsic characteristics of this paradigm, revealing four core properties: efficiency, controllability, native parallelism, and collaboration. Extensive experiments across four domains (sequential planning, combinatorial optimization, constraint satisfaction, and spatial configuration) demonstrate that DiffThinker significantly outperforms leading closed source models including GPT-5 (+314.2\%) and Gemini-3-Flash (+111.6\%), as well as the fine-tuned Qwen3-VL-32B baseline (+39.0\%), highlighting generative multimodal reasoning as a promising approach for vision-centric reasoning.

[18] CorGi: Contribution-Guided Block-Wise Interval Caching for Training-Free Acceleration of Diffusion Transformers

Yonglak Son, Suhyeok Kim, Seungryong Kim, Young Geun Kim

🧩 TL;DR

本文提出了CorGi和CorGi+,一种无需训练即可加速扩散Transformer推理的框架,通过选择性重用Transformer块输出来减少去噪步骤间的冗余计算,在保持生成质量的同时实现高达2.0倍的加速。


📘 Detailed Summary

Motivation: 扩散Transformer在视觉生成中表现出色,但其迭代去噪过程结合大模型容量导致高推理成本,现有研究表明DiT模型的迭代去噪过程在步骤间存在大量冗余计算,需要有效减少这些冗余计算以提高推理效率。

Method: 提出了CorGi框架,这是一种无需训练的DiT推理加速方法,通过贡献度引导的块级间隔缓存机制,选择性重用DiT中Transformer块的输出;对于文本到图像任务,进一步提出CorGi+,利用每块交叉注意力图识别显著标记并应用部分注意力更新以保护重要对象细节。

Result: 在最先进的DiT模型上的评估表明,CorGi和CorGi+平均实现了高达2.0倍的加速,同时保持了高生成质量,证明了该框架在减少冗余计算方面的有效性。

Conclusion: 该研究展示了通过选择性缓存和重用低贡献度Transformer块可以显著加速DiT推理而不损害生成质量,为扩散模型的效率优化提供了新方向,特别是在文本到图像生成任务中,结合注意力机制的保护策略进一步提升了细节保留能力。


📄 Abstract

Diffusion transformer (DiT) achieves remarkable performance in visual generation, but its iterative denoising process combined with larger capacity leads to a high inference cost. Recent works have demonstrated that the iterative denoising process of DiT models involves substantial redundant computation across steps. To effectively reduce the redundant computation in DiT, we propose CorGi (Contribution-Guided Block-Wise Interval Caching), training-free DiT inference acceleration framework that selectively reuses the outputs of transformer blocks in DiT across denoising steps. CorGi caches low-contribution blocks and reuses them in later steps within each interval to reduce redundant computation while preserving generation quality. For text-to-image tasks, we further propose CorGi+, which leverages per-block cross-attention maps to identify salient tokens and applies partial attention updates to protect important object details. Evaluation on the state-of-the-art DiT models demonstrates that CorGi and CorGi+ achieve up to 2.0x speedup on average, while preserving high generation quality.

[19] ARM: A Learnable, Plug-and-Play Module for CLIP-based Open-vocabulary Semantic Segmentation

Ziquan Liu, Zhewei Zhu, Xuyang Shi

🧩 TL;DR

本文提出注意力精炼模块(ARM),一种轻量级可学习模块,通过自适应融合CLIP的层次特征来提升开放词汇语义分割性能,实现"一次训练,随处使用"的范式,显著提升多种免训练基线方法的效果。


📘 Detailed Summary

Motivation: 开放词汇语义分割(OVSS)面临CLIP图像级表示缺乏像素级细节的根本限制,现有免训练方法要么依赖昂贵的外部基础模型(如SAM、DINO),要么对CLIP内部特征应用静态启发式方法,导致计算成本高或效果次优。

Method: 提出注意力精炼模块(ARM),包含语义引导的交叉注意力块和自注意力块,使用鲁棒的深层特征(K、V)选择和精炼细节丰富的浅层特征(Q),实现自适应层次特征融合,采用"一次训练,随处使用"范式,在通用数据集(如COCO-Stuff)训练后可作为通用即插即用后处理器。

Result: 大量实验表明ARM在多个基准测试中持续提升基线性能,推理开销可忽略不计,为免训练OVSS建立了高效有效的范式,显著优于静态融合方法。

Conclusion: ARM通过解锁和精炼CLIP内部潜力,解决了现有免训练方法的计算成本与效果权衡问题,其通用即插即用特性为开放词汇语义分割提供了高效解决方案,展示了自适应特征融合相对于静态方法的优势。


📄 Abstract

Open-vocabulary semantic segmentation (OVSS) is fundamentally hampered by the coarse, image-level representations of CLIP, which lack precise pixel-level details. Existing training-free methods attempt to resolve this by either importing priors from costly external foundation models (e.g., SAM, DINO) or by applying static, hand-crafted heuristics to CLIP's internal features. These approaches are either computationally expensive or sub-optimal. We propose the Attention Refinement Module (ARM), a lightweight, learnable module that effectively unlocks and refines CLIP's internal potential. Unlike static-fusion methods, ARM learns to adaptively fuse hierarchical features. It employs a semantically-guided cross-attention block, using robust deep features (K, V) to select and refine detail-rich shallow features (Q), followed by a self-attention block. The key innovation lies in a ``train once, use anywhere" paradigm. Trained once on a general-purpose dataset (e.g., COCO-Stuff), ARM acts as a universal plug-and-play post-processor for diverse training-free frameworks. Extensive experiments show that ARM consistently boosts baseline performance on multiple benchmarks with negligible inference overhead, establishing an efficient and effective paradigm for training-free OVSS.

[20] Evolving, Not Training: Zero-Shot Reasoning Segmentation via Evolutionary Prompting

Kai Ye, Xiaotong You, Jianghang Lin, Jiayi Ji, Pingyang Dai, Liujuan Cao

🧩 TL;DR

本文提出EVOL-SAM3,一种新颖的零样本推理分割框架,将任务重新定义为推理时进化搜索过程,通过"生成-评估-进化"循环迭代优化提示假设,在无需训练的情况下显著超越现有方法。


📘 Detailed Summary

Motivation: 当前推理分割方法存在显著局限性:监督微调方法遭受灾难性遗忘和领域依赖问题,强化学习方法面临训练不稳定性和对预定义奖励函数的刚性依赖,而免训练方法虽然避免了训练负担,但其静态推理范式存在推理深度不足、缺乏自我纠正语言幻觉或空间误解能力的问题,通常依赖于单次"生成-然后-分割"链式流程。

Method: EVOL-SAM3将推理分割重新定义为推理时进化搜索过程,采用"生成-评估-进化"循环迭代优化提示假设。框架维护一个提示假设种群,通过视觉竞技场进行无参考成对竞赛评估提示适应性,引入语义变异算子注入多样性并纠正语义错误,同时采用异构竞技场模块整合几何先验与语义推理以确保最终选择的鲁棒性。

Result: 在具有挑战性的ReasonSeg基准测试中,EVOL-SAM3不仅显著超越静态基线方法,还在零样本设置下大幅超过完全监督的最先进方法,展示了其强大的推理分割能力。广泛的实验验证了该框架的有效性和优越性能。

Conclusion: 该研究展示了将进化搜索范式引入推理分割任务的潜力,提供了一种无需训练即可实现高性能推理分割的创新方法。框架的进化机制能够自我纠正语言幻觉和空间误解,为复杂视觉语言任务提供了新的解决思路,并可能启发其他需要深度推理和多轮交互的视觉任务研究。


📄 Abstract

Reasoning Segmentation requires models to interpret complex, context-dependent linguistic queries to achieve pixel-level localization. Current dominant approaches rely heavily on Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). However, SFT suffers from catastrophic forgetting and domain dependency, while RL is often hindered by training instability and rigid reliance on predefined reward functions. Although recent training-free methods circumvent these training burdens, they are fundamentally limited by a static inference paradigm. These methods typically rely on a single-pass "generate-then-segment" chain, which suffers from insufficient reasoning depth and lacks the capability to self-correct linguistic hallucinations or spatial misinterpretations. In this paper, we challenge these limitations and propose EVOL-SAM3, a novel zero-shot framework that reformulates reasoning segmentation as an inference-time evolutionary search process. Instead of relying on a fixed prompt, EVOL-SAM3 maintains a population of prompt hypotheses and iteratively refines them through a "Generate-Evaluate-Evolve" loop. We introduce a Visual Arena to assess prompt fitness via reference-free pairwise tournaments, and a Semantic Mutation operator to inject diversity and correct semantic errors. Furthermore, a Heterogeneous Arena module integrates geometric priors with semantic reasoning to ensure robust final selection. Extensive experiments demonstrate that EVOL-SAM3 not only substantially outperforms static baselines but also significantly surpasses fully supervised state-of-the-art methods on the challenging ReasonSeg benchmark in a zero-shot setting. The code is available at https://github.com/AHideoKuzeA/Evol-SAM3.

[21] MambaSeg: Harnessing Mamba for Accurate and Efficient Image-Event Semantic Segmentation

Fuqiang Gu, Yuanke Li, Xianlei Long, Kangping Ji, Chao Chen, Qingyi Gu, Zhenliang Ni

🧩 TL;DR

本文提出MambaSeg,一种新颖的双分支语义分割框架,采用并行Mamba编码器高效建模RGB图像和事件流,并通过双维交互模块实现时空维度的细粒度融合,在显著降低计算成本的同时达到最先进的分割性能。


📘 Detailed Summary

Motivation: RGB相机在快速运动、低光照或高动态范围条件下性能下降,而事件相机虽然具有高时间分辨率和低延迟优势,但缺乏颜色和纹理信息。现有RGB与事件数据融合方法计算成本高且主要关注空间融合,忽略了事件流固有的时间动态特性,需要更高效、全面的跨模态融合方案。

Method: 提出MambaSeg双分支语义分割框架,采用并行Mamba编码器分别处理RGB图像和事件流。引入双维交互模块(DDIM),包含跨空间交互模块(CSIM)和跨时间交互模块(CTIM),在空间和时间维度上联合执行细粒度融合,改善跨模态对齐并减少模糊性。

Result: 在DDD17和DSEC数据集上的广泛实验表明,MambaSeg实现了最先进的语义分割性能,同时显著降低了计算成本,展示了其在高效、可扩展和鲁棒的多模态感知方面的潜力。

Conclusion: 该研究证明了Mamba架构在高效处理多模态视觉数据方面的有效性,双维交互机制能够充分利用RGB和事件数据的互补特性,为实时、鲁棒的语义分割系统提供了有前景的解决方案,特别是在动态和挑战性环境中。


📄 Abstract

Semantic segmentation is a fundamental task in computer vision with wide-ranging applications, including autonomous driving and robotics. While RGB-based methods have achieved strong performance with CNNs and Transformers, their effectiveness degrades under fast motion, low-light, or high dynamic range conditions due to limitations of frame cameras. Event cameras offer complementary advantages such as high temporal resolution and low latency, yet lack color and texture, making them insufficient on their own. To address this, recent research has explored multimodal fusion of RGB and event data; however, many existing approaches are computationally expensive and focus primarily on spatial fusion, neglecting the temporal dynamics inherent in event streams. In this work, we propose MambaSeg, a novel dual-branch semantic segmentation framework that employs parallel Mamba encoders to efficiently model RGB images and event streams. To reduce cross-modal ambiguity, we introduce the Dual-Dimensional Interaction Module (DDIM), comprising a Cross-Spatial Interaction Module (CSIM) and a Cross-Temporal Interaction Module (CTIM), which jointly perform fine-grained fusion along both spatial and temporal dimensions. This design improves cross-modal alignment, reduces ambiguity, and leverages the complementary properties of each modality. Extensive experiments on the DDD17 and DSEC datasets demonstrate that MambaSeg achieves state-of-the-art segmentation performance while significantly reducing computational cost, showcasing its promise for efficient, scalable, and robust multimodal perception.

[22] Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control

Jason Armitage, Rico Sennnrich

🧩 TL;DR

本文提出了一种通过基于遗憾最小化的无导数优化来改进多元互信息估计的新方法,使现成的跨模态系统能够在无需预训练或微调的情况下适应3D场景中的物体遮挡和特征区分。


📘 Detailed Summary

Motivation: 跨模态系统在2D视觉输入上训练后,处理3D场景时会面临维度转换的挑战,而场景内相机虽然能弥合这一维度差距,但需要学习控制模块,现有方法在处理物体遮挡和特征区分方面存在局限。

Method: 该方法通过基于遗憾最小化的无导数优化来改进多元互信息估计,结合表达性度量和基于价值的优化,辅助控制场景内相机直接从视觉语言模型的噪声输出中学习,使现成的2D跨模态系统能够在线适应3D场景。

Result: 所提出的管道在多物体3D场景的跨模态任务中提高了性能,无需依赖预训练或微调,有效处理了物体遮挡并改善了特征区分能力,展示了在复杂3D环境中跨模态系统的适应性改进。

Conclusion: 该研究展示了通过改进互信息估计和优化策略,可以使现成的2D跨模态系统有效适应3D场景,为跨维度视觉处理提供了新的解决方案,避免了昂贵的预训练或微调需求,具有实际部署价值。


📄 Abstract

Cross-modal systems trained on 2D visual inputs are presented with a dimensional shift when processing 3D scenes. An in-scene camera bridges the dimensionality gap but requires learning a control module. We introduce a new method that improves multivariate mutual information estimates by regret minimisation with derivative-free optimisation. Our algorithm enables off-the-shelf cross-modal systems trained on 2D visual inputs to adapt online to object occlusions and differentiate features. The pairing of expressive measures and value-based optimisation assists control of an in-scene camera to learn directly from the noisy outputs of vision-language models. The resulting pipeline improves performance in cross-modal tasks on multi-object 3D scenes without resorting to pretraining or finetuning.

[23] DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments

Yohan Park, Hyunwoo Ha, Wonjun Jo, Tae-Hyun Oh

🧩 TL;DR

本文提出了DarkEQA基准测试,用于评估视觉语言模型在低光照条件下的感知能力,揭示了现有模型在黑暗环境中的局限性,并建立了物理真实的视觉退化模拟框架。


📘 Detailed Summary

Motivation: 现有基准测试主要在理想光照条件下评估视觉语言模型的性能,而忽略了实际应用中必需的24/7全天候操作需求,特别是在夜间或黑暗环境中的低光照条件,这一核心需求尚未得到充分探索。

Method: 研究团队开发了DarkEQA开源基准测试,通过控制视觉退化来评估以自我为中心视角下的问答能力,其关键设计特点是物理真实性:在RAW线性空间模拟基于物理的照明衰减和传感器噪声,随后采用ISP启发的渲染流程。

Result: 通过评估多种最先进的视觉语言模型和低光照图像增强模型,系统揭示了这些模型在挑战性视觉条件下的局限性,DarkEQA能够隔离感知瓶颈并实现可归因的鲁棒性分析。

Conclusion: 该研究强调了视觉语言模型在低光照条件下的性能评估重要性,DarkEQA基准为未来研究提供了标准化测试平台,揭示了现有方法在黑暗环境中的不足,并促进了更具鲁棒性的具身智能系统发展。


📄 Abstract

Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments--a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis. A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs' limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance.

[24] UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots

Nan Jiang, Zimo He, Wanhe Yu, Lexi Pang, Yunhao Li, Hongjie Li, Jieming Cui, Yuhan Li, Yizhou Wang, Yixin Zhu, Siyuan Huang

🧩 TL;DR

本文提出了UniAct,一种两阶段框架,通过整合微调的多模态大语言模型与因果流式处理管道,实现了人形机器人以低于500毫秒的延迟执行多模态指令,在零样本跟踪不完美参考运动方面取得了19%的成功率提升。


📘 Detailed Summary

Motivation: 人形机器人领域长期面临的一个挑战是实现能够遵循多样化多模态指令的通用智能体,但现有方法在将语言、音乐和轨迹等异构指令转化为稳定、实时的全身动作方面存在显著瓶颈,高层次的感知与全身执行之间的鸿沟限制了系统的灵活性。

Method: UniAct采用两阶段框架,整合了微调的多模态大语言模型与因果流式处理管道,通过FSQ共享离散码本统一多模态输入,确保跨模态对齐的同时将运动约束在物理可行的流形上,实现了多模态指令到动作的实时转换。

Result: 该方法在UniMoCap基准测试中实现了低于500毫秒的延迟,在零样本跟踪不完美参考运动方面取得了19%的成功率提升,并在多样化的真实场景中展示了鲁棒的泛化能力,验证了框架的有效性。

Conclusion: 该研究标志着向响应式通用人形助手迈出了关键一步,通过统一的感知与控制实现了无缝交互,为解决多模态指令到全身动作的转换瓶颈提供了有效方案,为人形机器人的实际应用奠定了基础。


📄 Abstract

A long-standing objective in humanoid robotics is the realization of versatile agents capable of following diverse multimodal instructions with human-level flexibility. Despite advances in humanoid control, bridging high-level multimodal perception with whole-body execution remains a significant bottleneck. Existing methods often struggle to translate heterogeneous instructions -- such as language, music, and trajectories -- into stable, real-time actions. Here we show that UniAct, a two-stage framework integrating a fine-tuned MLLM with a causal streaming pipeline, enables humanoid robots to execute multimodal instructions with sub-500 ms latency. By unifying inputs through a shared discrete codebook via FSQ, UniAct ensures cross-modal alignment while constraining motions to a physically grounded manifold. This approach yields a 19% improvement in the success rate of zero-shot tracking of imperfect reference motions. We validate UniAct on UniMoCap, our 20-hour humanoid motion benchmark, demonstrating robust generalization across diverse real-world scenarios. Our results mark a critical step toward responsive, general-purpose humanoid assistants capable of seamless interaction through unified perception and control.

[25] Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention

Haijing Liu, Zhiyuan Song, Hefeng Wu, Tao Pu, Keze Wang, Liang Lin

🧩 TL;DR

本文提出了CERES,一种用于第一人称视频中参照视频对象分割的因果推理框架,通过双模态因果干预解决训练数据偏差和第一人称视角的视觉混淆问题,在Ego-RVOS基准上实现了最先进的性能。


📘 Detailed Summary

Motivation: 第一人称参照视频对象分割任务面临两个主要挑战:训练数据中存在的对象-动作配对偏差导致模型学习虚假相关性,以及第一人称视角固有的视觉混淆因素如快速运动和频繁遮挡,这些限制了现有方法的鲁棒性。

Method: CERES是一个插件式因果框架,采用双模态因果干预策略:应用后门调整原则来抵消从数据集统计中学习的语言表示偏差,并利用前门调整概念,通过因果原则指导将语义视觉特征与几何深度信息智能集成,以解决视觉混淆问题。

Result: 在Ego-RVOS基准上的广泛实验表明,CERES实现了最先进的性能,验证了因果推理框架在提升模型鲁棒性和分割准确性方面的有效性。

Conclusion: 该研究展示了因果推理在构建更可靠的第一人称视频理解模型方面的潜力,通过解决数据偏差和视角混淆问题,为更广泛的自我中心视觉任务提供了新的方法论方向。


📄 Abstract

Egocentric Referring Video Object Segmentation (Ego-RVOS) aims to segment the specific object actively involved in a human action, as described by a language query, within first-person videos. This task is critical for understanding egocentric human behavior. However, achieving such segmentation robustly is challenging due to ambiguities inherent in egocentric videos and biases present in training data. Consequently, existing methods often struggle, learning spurious correlations from skewed object-action pairings in datasets and fundamental visual confounding factors of the egocentric perspective, such as rapid motion and frequent occlusions. To address these limitations, we introduce Causal Ego-REferring Segmentation (CERES), a plug-in causal framework that adapts strong, pre-trained RVOS backbones to the egocentric domain. CERES implements dual-modal causal intervention: applying backdoor adjustment principles to counteract language representation biases learned from dataset statistics, and leveraging front-door adjustment concepts to address visual confounding by intelligently integrating semantic visual features with geometric depth information guided by causal principles, creating representations more robust to egocentric distortions. Extensive experiments demonstrate that CERES achieves state-of-the-art performance on Ego-RVOS benchmarks, highlighting the potential of applying causal reasoning to build more reliable models for broader egocentric video understanding.

[26] SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, Gao Huang, Dahua Lin, Lewei Lu

🧩 TL;DR

本文提出SenseNova-MARS,一种通过强化学习赋能视觉语言模型的多模态代理推理与搜索框架,实现了视觉推理与工具使用的交织能力,并在搜索导向基准测试中达到最先进性能。


📘 Detailed Summary

Motivation: 当前视觉语言模型在代理推理方面主要局限于文本导向的思维链或孤立工具调用,缺乏人类般无缝交织动态工具操作与连续推理的能力,特别是在需要协调外部工具(如搜索和图像裁剪)的知识密集型和视觉复杂场景中。

Method: 提出SenseNova-MARS多模态代理推理与搜索框架,通过强化学习赋能视觉语言模型交织视觉推理与工具使用能力;具体整合图像搜索、文本搜索和图像裁剪工具处理细粒度和知识密集型视觉理解任务;在强化学习阶段提出批量归一化组序列策略优化算法以提高训练稳定性并增强模型工具调用和推理能力。

Result: SenseNova-MARS在开源搜索和细粒度图像理解基准测试中达到最先进性能;在搜索导向基准测试中,SenseNova-MARS-8B在MMSearch上得分67.84,在HR-MMSearch上得分41.64,超越Gemini-3-Flash和GPT-5等专有模型;同时引入HR-MMSearch基准,这是首个由高分辨率图像组成的搜索导向基准,包含知识密集型和搜索驱动问题。

Conclusion: SenseNova-MARS代表了向代理视觉语言模型发展的有希望一步,通过提供有效且鲁棒的工具使用能力;该研究将促进该领域进一步探索,作者将发布所有代码、模型和数据集以支持后续研究。


📄 Abstract

While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to improve the training stability and advance the model's ability to invoke tools and reason effectively. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNova-MARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Flash and GPT-5. SenseNova-MARS represents a promising step toward agentic VLMs by providing effective and robust tool-use capabilities. To facilitate further research in this field, we will release all code, models, and datasets.

[27] Spatial-aware Vision Language Model for Autonomous Driving

Weijie Wei, Zhipeng Luo, Ling Feng, Venice Erin Liong

🧩 TL;DR

本文提出LVLDrive框架,通过将LiDAR点云作为额外输入模态来增强现有视觉语言模型,以解决基于2D图像的自动驾驶方法在三维度量空间推理方面的局限性,从而提升自动驾驶系统的安全性和可靠性。


📘 Detailed Summary

Motivation: 当前基于视觉语言模型的自动驾驶方法主要依赖2D图像线索进行复杂场景理解和决策,这导致了在精确度量空间推理和几何推断方面的严重瓶颈,影响了驾驶策略的安全性和可靠性。现有图像方法难以进行准确的三维空间推理,因此需要引入显式的三维度量数据来构建可信赖的VLM自动驾驶系统。

Method: 本文提出LVLDrive框架,通过将LiDAR点云作为额外输入模态来增强现有视觉语言模型的三维度量空间理解能力。为了解决三维数据对预训练VLM的灾难性干扰问题,设计了渐进融合Q-Former模块,逐步注入LiDAR特征以确保VLM现有知识库的稳定性和保留。此外,开发了空间感知问答数据集来显式教导模型高级三维感知和推理能力。

Result: 在多个自动驾驶基准测试上的广泛实验表明,LVLDrive在场景理解、度量空间感知和可靠驾驶决策方面均优于纯视觉方法。该框架显著提升了三维空间推理能力,验证了渐进融合策略在保持VLM知识稳定性的同时有效整合三维信息的有效性。

Conclusion: 该研究强调了显式三维度量数据对于构建可信赖的VLM自动驾驶系统的必要性,证明了LiDAR-Vision-Language多模态融合的有效性。渐进融合策略为整合异构传感器数据到预训练模型中提供了可行方案,为未来更安全的自动驾驶系统开发指明了方向。


📄 Abstract

While Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models, their reliance on 2D image cues for complex scene understanding and decision-making presents a critical bottleneck for safety and reliability. Current image-based methods struggle with accurate metric spatial reasoning and geometric inference, leading to unreliable driving policies. To bridge this gap, we propose LVLDrive (LiDAR-Vision-Language), a novel framework specifically designed to upgrade existing VLMs with robust 3D metric spatial understanding for autonomous driving by incoperating LiDAR point cloud as an extra input modality. A key challenge lies in mitigating the catastrophic disturbance introduced by disparate 3D data to the pre-trained VLMs. To this end, we introduce a Gradual Fusion Q-Former that incrementally injects LiDAR features, ensuring the stability and preservation of the VLM's existing knowledge base. Furthermore, we develop a spatial-aware question-answering (SA-QA) dataset to explicitly teach the model advanced 3D perception and reasoning capabilities. Extensive experiments on driving benchmarks demonstrate that LVLDrive achieves superior performance compared to vision-only counterparts across scene understanding, metric spatial perception, and reliable driving decision-making. Our work highlights the necessity of explicit 3D metric data for building trustworthy VLM-based autonomous systems.

[28] Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems

Song Wang, Lingdong Kong, Xiaolu Liu, Hao Shi, Wentong Li, Jianke Zhu, Steven C. H. Hoi

🧩 TL;DR

本文提出了一个全面的多模态预训练框架和统一分类体系,旨在解决自主系统中多模态传感器数据融合的挑战,为实现空间智能提供了系统性方法。研究识别了关键技术瓶颈并提出了实现通用多模态基础模型的路线图。


📘 Detailed Summary

Motivation: 自动驾驶车辆和无人机等自主系统的快速发展,迫切需要从多模态车载传感器数据中构建真正的空间智能。尽管基础模型在单模态场景中表现出色,但如何整合摄像头、激光雷达等不同传感器的能力以形成统一理解仍然是一个重大挑战,现有研究缺乏系统性的多模态预训练框架和统一分类体系。

Method: 本文提出了一个全面的多模态预训练框架,系统分析了基础传感器特性与学习策略之间的相互作用,评估了平台特定数据集在推动技术进步中的作用。核心贡献是构建了一个统一的预训练范式分类体系,涵盖从单模态基线到学习整体表示的复杂统一框架,用于3D目标检测和语义占据预测等高级任务。此外,研究还探讨了文本输入和占据表示的整合,以促进开放世界感知和规划。

Result: 研究通过系统分析识别了推动多模态空间智能发展的核心技术集合,建立了从单模态到统一框架的完整分类体系。该框架能够支持3D目标检测、语义占据预测等高级感知任务,并通过整合文本输入实现开放世界感知能力。研究还评估了不同平台特定数据集对技术发展的促进作用。

Conclusion: 该研究为多模态基础模型的发展提供了系统性框架和分类体系,识别了计算效率和模型可扩展性等关键瓶颈。研究提出了实现通用多模态基础模型的路线图,这些模型能够实现稳健的空间智能,支持自动驾驶等实际应用部署。该工作为未来研究提供了明确的技术方向和评估基准。


📄 Abstract

The rapid advancement of autonomous systems, including self-driving vehicles and drones, has intensified the need to forge true Spatial Intelligence from multi-modal onboard sensor data. While foundation models excel in single-modal contexts, integrating their capabilities across diverse sensors like cameras and LiDAR to create a unified understanding remains a formidable challenge. This paper presents a comprehensive framework for multi-modal pre-training, identifying the core set of techniques driving progress toward this goal. We dissect the interplay between foundational sensor characteristics and learning strategies, evaluating the role of platform-specific datasets in enabling these advancements. Our central contribution is the formulation of a unified taxonomy for pre-training paradigms: ranging from single-modality baselines to sophisticated unified frameworks that learn holistic representations for advanced tasks like 3D object detection and semantic occupancy prediction. Furthermore, we investigate the integration of textual inputs and occupancy representations to facilitate open-world perception and planning. Finally, we identify critical bottlenecks, such as computational efficiency and model scalability, and propose a roadmap toward general-purpose multi-modal foundation models capable of achieving robust Spatial Intelligence for real-world deployment.

[29] Hierarchical Vector-Quantized Latents for Perceptual Low-Resolution Video Compression

Manikanta Kotthapalli, Banafsheh Rekabdar

🧩 TL;DR

本文提出了一种多尺度向量量化变分自编码器(MS-VQ-VAE),用于生成低分辨率视频的紧凑高保真潜在表示,适用于带宽敏感场景下的高效存储和传输。该模型扩展了VQ-VAE-2框架到时空域,采用轻量级设计(约1850万参数),在UCF101数据集上实现了25.96 dB PSNR和0.8375 SSIM的优异性能。


📘 Detailed Summary

Motivation: 视频流量的指数增长对带宽和存储基础设施提出了更高要求,特别是内容分发网络和边缘设备。传统视频编解码器如H.264和HEVC虽然压缩率高,但主要针对像素域重建设计,缺乏对机器学习中心化潜在表示的原生支持,限制了其与深度学习流程的集成。

Method: 本文提出了一种多尺度向量量化变分自编码器(MS-VQ-VAE),将VQ-VAE-2框架扩展到时空设置,采用两级分层潜在结构,使用3D残差卷积构建。模型轻量级设计(约1850万参数),针对64×64分辨率视频片段优化,适合边缘设备部署。为提高感知重建质量,引入了基于预训练VGG16网络的感知损失函数。

Result: 在UCF101数据集上使用2秒视频片段(32帧,16 FPS)进行训练,测试集上达到25.96 dB PSNR和0.8375 SSIM。验证集上相比单尺度基线模型提升了1.41 dB PSNR和0.0248 SSIM,证明了多尺度架构的有效性。

Conclusion: 该框架特别适用于带宽敏感场景下的可扩展视频压缩,包括实时流媒体、移动视频分析和CDN级存储优化。模型轻量级设计使其适合在计算和内存资源受限的边缘设备上部署,为深度学习流程中的视频表示提供了高效解决方案。


📄 Abstract

The exponential growth of video traffic has placed increasing demands on bandwidth and storage infrastructure, particularly for content delivery networks (CDNs) and edge devices. While traditional video codecs like H.264 and HEVC achieve high compression ratios, they are designed primarily for pixel-domain reconstruction and lack native support for machine learning-centric latent representations, limiting their integration into deep learning pipelines. In this work, we present a Multi-Scale Vector Quantized Variational Autoencoder (MS-VQ-VAE) designed to generate compact, high-fidelity latent representations of low-resolution video, suitable for efficient storage, transmission, and client-side decoding. Our architecture extends the VQ-VAE-2 framework to a spatiotemporal setting, introducing a two-level hierarchical latent structure built with 3D residual convolutions. The model is lightweight (approximately 18.5M parameters) and optimized for 64x64 resolution video clips, making it appropriate for deployment on edge devices with constrained compute and memory resources. To improve perceptual reconstruction quality, we incorporate a perceptual loss derived from a pre-trained VGG16 network. Trained on the UCF101 dataset using 2-second video clips (32 frames at 16 FPS), on the test set we achieve 25.96 dB PSNR and 0.8375 SSIM. On validation, our model improves over the single-scale baseline by 1.41 dB PSNR and 0.0248 SSIM. The proposed framework is well-suited for scalable video compression in bandwidth-sensitive scenarios, including real-time streaming, mobile video analytics, and CDN-level storage optimization.

[30] PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Yuanhao Cai, Kunpeng Li, Menglin Jia, Jialiang Wang, Junzhe Sun, Feng Liang, Weifeng Chen, Felix Juefei-Xu, Chu Wang, Ali Thabet, Xiaoliang Dai, Xuan Ju, Alan Yuille, Ji Hou

🧩 TL;DR

本文提出了一种用于文本到视频生成的物理感知偏好优化框架PhyGDPO,通过构建大规模物理增强数据集PhyVidGen-135K和设计基于视觉语言模型的物理引导奖励机制,显著提升了生成视频的物理一致性。


📘 Detailed Summary

Motivation: 当前文本到视频生成方法虽然在视觉质量上取得进展,但生成内容难以遵循物理定律,现有基于图形学或提示扩展的方法难以超越简单模拟环境或学习隐式物理推理,同时缺乏包含丰富物理交互现象的训练数据也是一个关键问题。

Method: 本文首先提出物理增强视频数据构建流程PhyAugPipe,利用具有思维链推理能力的视觉语言模型收集大规模训练数据集PhyVidGen-135K;然后构建基于组别Plackett-Luce概率模型的物理感知组别直接偏好优化框架PhyGDPO,其中设计了物理引导奖励方案将VLM物理奖励嵌入优化过程,并提出LoRA-Switch Reference方案消除内存密集型参考复制以实现高效训练。

Result: 实验表明,该方法在PhyGenBench和VideoPhy2基准测试上显著优于现有最先进的开源方法,证明了其在提升生成视频物理一致性方面的有效性,相关代码、模型和数据将通过GitHub开源发布。

Conclusion: 该研究为文本到视频生成中的物理一致性挑战提供了系统解决方案,通过数据构建、优化框架和奖励设计的协同创新,为物理感知内容生成开辟了新方向,其开源策略将促进该领域的进一步研究和发展。


📄 Abstract

Recent advances in text-to-video (T2V) generation have achieved good visual quality, yet synthesizing videos that faithfully follow physical laws remains an open challenge. Existing methods mainly based on graphics or prompt extension struggle to generalize beyond simple simulated environments or learn implicit physical reasoning. The scarcity of training data with rich physics interactions and phenomena is also a problem. In this paper, we first introduce a Physics-Augmented video data construction Pipeline, PhyAugPipe, that leverages a vision-language model (VLM) with chain-of-thought reasoning to collect a large-scale training dataset, PhyVidGen-135K. Then we formulate a principled Physics-aware Groupwise Direct Preference Optimization, PhyGDPO, framework that builds upon the groupwise Plackett-Luce probabilistic model to capture holistic preferences beyond pairwise comparisons. In PhyGDPO, we design a Physics-Guided Rewarding (PGR) scheme that embeds VLM-based physics rewards to steer optimization toward physical consistency. We also propose a LoRA-Switch Reference (LoRA-SR) scheme that eliminates memory-heavy reference duplication for efficient training. Experiments show that our method significantly outperforms state-of-the-art open-source methods on PhyGenBench and VideoPhy2. Please check our project page at https://caiyuanhao1998.github.io/project/PhyGDPO for more video results. Our code, models, and data will be released at https://github.com/caiyuanhao1998/Open-PhyGDPO

[31] RGBT-Ground Benchmark: Visual Grounding Beyond RGB in Complex Real-World Scenarios

Tianyi Zhao, Jiawen Xi, Linhui Xiao, Junnan Li, Xue Yang, Maoxun Yuan, Xingxing Wei

🧩 TL;DR

本文提出了RGBT-Ground,首个面向复杂真实场景的大规模视觉定位基准,包含空间对齐的RGB和热红外图像对及高质量标注;同时提出了统一的视觉定位框架RGBT-VGNet,通过融合互补视觉模态实现鲁棒定位。


📘 Detailed Summary

Motivation: 现有视觉定位基准大多基于COCO等清洁环境数据集构建,场景多样性有限,无法反映真实世界中光照、天气等复杂条件的变化,这限制了模型在安全关键应用中的鲁棒性和泛化能力评估。

Method: 本文构建了RGBT-Ground基准,包含空间对齐的RGB和热红外图像对、高质量指代表达式、对象边界框及场景、环境和对象级别的细粒度标注;提出了统一的视觉定位框架,支持单模态和多模态视觉输入,并设计了RGBT-VGNet基线模型,有效融合互补视觉模态。

Result: 实验结果表明,提出的RGBT-VGNet在RGBT-Ground基准上显著优于现有方法的适配版本,特别是在夜间和远距离场景中表现突出,验证了多模态融合在复杂真实场景下的有效性。

Conclusion: 该研究为复杂真实场景下的鲁棒视觉定位提供了首个多模态基准和有效基线,促进了视觉语言理解在安全关键应用中的发展;多模态融合策略能够显著提升模型在挑战性条件下的性能,为未来研究提供了重要资源和方法指导。


📄 Abstract

Visual Grounding (VG) aims to localize specific objects in an image according to natural language expressions, serving as a fundamental task in vision-language understanding. However, existing VG benchmarks are mostly derived from datasets collected under clean environments, such as COCO, where scene diversity is limited. Consequently, they fail to reflect the complexity of real-world conditions, such as changes in illumination, weather, etc., that are critical to evaluating model robustness and generalization in safety-critical applications. To address these limitations, we present RGBT-Ground, the first large-scale visual grounding benchmark built for complex real-world scenarios. It consists of spatially aligned RGB and Thermal infrared (TIR) image pairs with high-quality referring expressions, corresponding object bounding boxes, and fine-grained annotations at the scene, environment, and object levels. This benchmark enables comprehensive evaluation and facilitates the study of robust grounding under diverse and challenging conditions. Furthermore, we establish a unified visual grounding framework that supports both uni-modal (RGB or TIR) and multi-modal (RGB-TIR) visual inputs. Based on it, we propose RGBT-VGNet, a simple yet effective baseline for fusing complementary visual modalities to achieve robust grounding. We conduct extensive adaptations to the existing methods on RGBT-Ground. Experimental results show that our proposed RGBT-VGNet significantly outperforms these adapted methods, particularly in nighttime and long-distance scenarios. All resources will be publicly released to promote future research on robust visual grounding in complex real-world environments.

[32] Improving Few-Shot Change Detection Visual Question Answering via Decision-Ambiguity-guided Reinforcement Fine-Tuning

Fuyu Dong, Ke Li, Di Wang, Nan Luo, Yiming Zhang, Kaiyu Li, Jianfei Yang, Quan Wang

🧩 TL;DR

本文提出DARFT框架,一种决策模糊性引导的强化微调方法,用于提升变化检测视觉问答模型的判别能力和鲁棒性,通过显式优化决策模糊样本来抑制强干扰项并锐化决策边界。


📘 Detailed Summary

Motivation: 变化检测视觉问答中,现有基于监督微调的方法存在大量决策模糊性失败案例,模型对正确答案与强干扰项分配相似置信度,而非明显错误预测,这种决策模糊样本限制了模型的判别能力和鲁棒性。

Method: 提出DARFT框架,首先使用监督微调训练的参考策略挖掘决策模糊样本,然后在挖掘的子集上应用组相对策略优化,通过多样本解码和组内相对优势来抑制强干扰项并锐化决策边界,无需额外监督信号。

Result: 大量实验表明,DARFT在监督微调基线上取得一致性能提升,特别是在少样本设置下表现显著,验证了显式优化决策模糊样本对提升模型判别能力的有效性。

Conclusion: 决策模糊性是变化检测视觉问答模型失败的重要根源,显式优化这类样本能有效提升模型鲁棒性,组相对策略优化为处理决策模糊性提供了有效框架,对少样本场景尤其有益。


📄 Abstract

Change detection visual question answering (CDVQA) requires answering text queries by reasoning about semantic changes in bi-temporal remote sensing images. A straightforward approach is to boost CDVQA performance with generic vision-language models via supervised fine-tuning (SFT). Despite recent progress, we observe that a significant portion of failures do not stem from clearly incorrect predictions, but from decision ambiguity, where the model assigns similar confidence to the correct answer and strong distractors. To formalize this challenge, we define Decision-Ambiguous Samples (DAS) as instances with a small probability margin between the ground-truth answer and the most competitive alternative. We argue that explicitly optimizing DAS is crucial for improving the discriminability and robustness of CDVQA models. To this end, we propose DARFT, a Decision-Ambiguity-guided Reinforcement Fine-Tuning framework that first mines DAS using an SFT-trained reference policy and then applies group-relative policy optimization on the mined subset. By leveraging multi-sample decoding and intra-group relative advantages, DARFT suppresses strong distractors and sharpens decision boundaries without additional supervision. Extensive experiments demonstrate consistent gains over SFT baselines, particularly under few-shot settings.

[33] SliceLens: Fine-Grained and Grounded Error Slice Discovery for Multi-Instance Vision Tasks

Wei Zhang, Chaoqun Wang, Zixuan Guan, Sam Kao, Pengfei Zhao, Peng Wu, Sifeng He

🧩 TL;DR

本文提出SliceLens,一个基于大语言模型和视觉语言模型的假设驱动框架,用于在多实例视觉任务中识别细粒度错误切片,并引入首个针对细粒度错误切片发现的基准FeSD,显著提升了错误切片的发现精度和可解释性。


📘 Detailed Summary

Motivation: 现有错误切片发现方法主要针对图像分类任务,难以适用于检测、分割和姿态估计等多实例视觉任务,且现有基准通常针对特定算法或偏向图像分类,缺乏反映真实模型失败的细粒度评估。现实场景中的错误切片常涉及复杂视觉关系的角落案例,现有缺乏细粒度推理的实例级方法难以提供有意义的洞察。

Method: 本文提出SliceLens框架,利用大语言模型和视觉语言模型通过基于视觉的推理生成和验证多样化失败假设,实现细粒度且可解释的错误切片可靠识别。同时引入FeSD基准,这是首个专门为评估实例级视觉任务中细粒度错误切片发现而设计的基准,包含专家标注并精心细化的真实切片,精确地定位到局部错误区域。

Result: 在现有基准和FeSD上的大量实验表明,SliceLens实现了最先进的性能,在FeSD上将Precision@10提高了0.42(0.73 vs. 0.31),并识别出可解释的切片,促进了可操作的模型改进,这在模型修复实验中得到了验证。

Conclusion: 该研究为解决多实例视觉任务中的错误切片发现问题提供了有效的框架和基准,通过结合大语言模型和视觉语言模型的视觉推理能力,显著提升了错误切片的发现精度和可解释性,为模型评估和修复提供了实用工具。FeSD基准的建立为未来细粒度错误切片研究提供了标准化评估平台。


📄 Abstract

Systematic failures of computer vision models on subsets with coherent visual patterns, known as error slices, pose a critical challenge for robust model evaluation. Existing slice discovery methods are primarily developed for image classification, limiting their applicability to multi-instance tasks such as detection, segmentation, and pose estimation. In real-world scenarios, error slices often arise from corner cases involving complex visual relationships, where existing instance-level approaches lacking fine-grained reasoning struggle to yield meaningful insights. Moreover, current benchmarks are typically tailored to specific algorithms or biased toward image classification, with artificial ground truth that fails to reflect real model failures. To address these limitations, we propose SliceLens, a hypothesis-driven framework that leverages LLMs and VLMs to generate and verify diverse failure hypotheses through grounded visual reasoning, enabling reliable identification of fine-grained and interpretable error slices. We further introduce FeSD (Fine-grained Slice Discovery), the first benchmark specifically designed for evaluating fine-grained error slice discovery across instance-level vision tasks, featuring expert-annotated and carefully refined ground-truth slices with precise grounding to local error regions. Extensive experiments on both existing benchmarks and FeSD demonstrate that SliceLens achieves state-of-the-art performance, improving Precision@10 by 0.42 (0.73 vs. 0.31) on FeSD, and identifies interpretable slices that facilitate actionable model improvements, as validated through model repair experiments.

[34] MoniRefer: A Real-world Large-scale Multi-modal Dataset based on Roadside Infrastructure for 3D Visual Grounding

Panquan Yang, Junfei Huang, Zongzhangbao Yin, Yingsong Hu, Anni Xu, Xinyi Luo, Xueqi Sun, Hai Wu, Sheng Ao, Zhaoxing Zhu, Chenglu Wen, Cheng Wang

🧩 TL;DR

本文提出了面向室外监控场景的3D视觉定位新任务,并构建了首个大规模真实世界多模态数据集MoniRefer,同时开发了端到端方法Moni3DVG,实现了基础设施级别的交通场景理解。


📘 Detailed Summary

Motivation: 现有3D视觉定位研究主要集中于室内和室外驾驶场景,而由于缺乏由路边基础设施传感器采集的配对点云-文本数据,室外监控场景的3D视觉定位任务尚未得到探索,这限制了基础设施级别对复杂交通环境的理解能力。

Method: 本文提出了端到端方法Moni3DVG,该方法综合利用图像提供的丰富外观信息以及点云提供的几何和光学信息进行多模态特征学习,实现了3D物体定位;同时构建了首个真实世界大规模多模态数据集MoniRefer,包含约136,018个物体和411,128个自然语言表达。

Result: 在提出的基准测试上进行的广泛实验和消融研究表明,所提出的方法具有优越性和有效性;数据集包含来自真实世界复杂交通交叉口的多个场景数据,所有语言描述和3D标签均经过人工验证以确保质量和准确性。

Conclusion: 本研究填补了室外监控场景3D视觉定位的研究空白,为基础设施级别的交通场景理解提供了新的视角;提出的数据集和方法为未来相关研究奠定了基础,推动了超越自我车辆视角的交通环境理解能力发展。


📄 Abstract

3D visual grounding aims to localize the object in 3D point cloud scenes that semantically corresponds to given natural language sentences. It is very critical for roadside infrastructure system to interpret natural languages and localize relevant target objects in complex traffic environments. However, most existing datasets and approaches for 3D visual grounding focus on the indoor and outdoor driving scenes, outdoor monitoring scenarios remain unexplored due to scarcity of paired point cloud-text data captured by roadside infrastructure sensors. In this paper, we introduce a novel task of 3D Visual Grounding for Outdoor Monitoring Scenarios, which enables infrastructure-level understanding of traffic scenes beyond the ego-vehicle perspective. To support this task, we construct MoniRefer, the first real-world large-scale multi-modal dataset for roadside-level 3D visual grounding. The dataset consists of about 136,018 objects with 411,128 natural language expressions collected from multiple complex traffic intersections in the real-world environments. To ensure the quality and accuracy of the dataset, we manually verified all linguistic descriptions and 3D labels for objects. Additionally, we also propose a new end-to-end method, named Moni3DVG, which utilizes the rich appearance information provided by images and geometry and optical information from point cloud for multi-modal feature learning and 3D object localization. Extensive experiments and ablation studies on the proposed benchmarks demonstrate the superiority and effectiveness of our method. Our dataset and code will be released.

[35] EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation

Bingxuan Li, Yiming Cui, Yicheng He, Yiwei Wang, Shu Zhang, Longyin Wen, Yulei Niu

🧩 TL;DR

本文提出了EchoFoley任务,针对视频文本到音频生成中的视觉主导、细粒度控制缺失和指令理解不足三大问题,通过符号化声音事件表示和慢-快思维策略的代理生成框架,实现了视频场景中事件级局部控制和层次语义控制的音效生成。


📘 Detailed Summary

Motivation: 当前视频文本到音频生成面临三个关键限制:视觉与文本条件不平衡导致的视觉主导问题;缺乏细粒度可控生成的具体定义;现有数据集依赖简短分类标签导致的指令理解和跟随能力弱。这些限制阻碍了视频场景中精确可控的音效生成。

Method: 研究引入了EchoFoley任务,采用符号化声音事件表示来指定视频或指令中每个声音产生的时间、内容和方式,支持声音生成、插入和编辑等细粒度控制。基于此构建了包含6000多个视频-指令-标注三元组的大规模专家策划基准EchoFoley-6k,并提出了采用慢-快思维策略的以声音事件为中心的代理生成框架EchoVidia。

Result: 实验表明,EchoVidia在可控性方面超越了最近的VT2A模型40.7%,在感知质量方面提升了12.5%。构建的EchoFoley-6k基准为视频音效生成提供了大规模、高质量的训练和评估数据,支持事件级局部控制和层次语义控制。

Conclusion: 该研究通过符号化声音事件表示和代理生成框架,解决了视频音效生成中的细粒度控制问题,为多模态叙事中的音效设计提供了新的技术路径。EchoFoley任务和基准的建立为未来可控音效生成研究奠定了基础,慢-快思维策略的引入提升了模型对复杂指令的理解和执行能力。


📄 Abstract

Sound effects build an essential layer of multimodal storytelling, shaping the emotional atmosphere and the narrative semantics of videos. Despite recent advancement in video-text-to-audio (VT2A), the current formulation faces three key limitations: First, an imbalance between visual and textual conditioning that leads to visual dominance; Second, the absence of a concrete definition for fine-grained controllable generation; Third, weak instruction understanding and following, as existing datasets rely on brief categorical tags. To address these limitations, we introduce EchoFoley, a new task designed for video-grounded sound generation with both event level local control and hierarchical semantic control. Our symbolic representation for sounding events specifies when, what, and how each sound is produced within a video or instruction, enabling fine-grained controls like sound generation, insertion, and editing. To support this task, we construct EchoFoley-6k, a large-scale, expert-curated benchmark containing over 6,000 video-instruction-annotation triplets. Building upon this foundation, we propose EchoVidia a sounding-event-centric agentic generation framework with slow-fast thinking strategy. Experiments show that EchoVidia surpasses recent VT2A models by 40.7% in controllability and 12.5% in perceptual quality.

[36] UniC-Lift: Unified 3D Instance Segmentation via Contrastive Learning

Ankit Dhiman, Srinath R, Jaswanth Reddy, Lokesh R Boregowda, Venkatesh Babu Radhakrishnan

🧩 TL;DR

本文提出了一种用于3D高斯溅射场景实例分割的统一框架,通过引入可学习的特征嵌入和创新的"嵌入到标签"解码过程,解决了多视图2D标签不一致的问题,并采用边界硬挖掘策略提升分割质量。


📘 Detailed Summary

Motivation: 现有方法在多视图2D实例分割扩展到3D场景时面临视图间标签不一致的关键挑战,导致3D预测质量不佳。现有解决方案通常采用两阶段方法,要么依赖超参数敏感的对比学习聚类,要么需要预处理标签以确保一致性,这些方法存在训练效率低和性能受限的问题。

Method: 本文提出统一框架将分割步骤合并,在3D高斯基元中引入可学习的特征嵌入,通过创新的"嵌入到标签"过程高效解码为实例标签。为解决对象边界伪影问题,采用边界硬挖掘策略,并在特征嵌入后应用线性层再计算三元组损失,以稳定训练过程。

Result: 该方法在ScanNet、Replica3D和Messy-Rooms数据集上均超越了基线方法,在定性和定量评估中表现出优越性能。边界硬挖掘策略与线性层稳定化处理显著提升了分割质量,特别是在对象边界区域。

Conclusion: 该研究展示了统一框架在3D场景实例分割中的有效性,通过端到端优化减少训练时间并提升性能。边界处理策略为3D分割中的几何一致性挑战提供了实用解决方案,为未来3D场景理解研究提供了有价值的参考方向。


📄 Abstract

3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) have advanced novel-view synthesis. Recent methods extend multi-view 2D segmentation to 3D, enabling instance/semantic segmentation for better scene understanding. A key challenge is the inconsistency of 2D instance labels across views, leading to poor 3D predictions. Existing methods use a two-stage approach in which some rely on contrastive learning with hyperparameter-sensitive clustering, while others preprocess labels for consistency. We propose a unified framework that merges these steps, reducing training time and improving performance by introducing a learnable feature embedding for segmentation in Gaussian primitives. This embedding is then efficiently decoded into instance labels through a novel "Embedding-to-Label" process, effectively integrating the optimization. While this unified framework offers substantial benefits, we observed artifacts at the object boundaries. To address the object boundary issues, we propose hard-mining samples along these boundaries. However, directly applying hard mining to the feature embeddings proved unstable. Therefore, we apply a linear layer to the rasterized feature embeddings before calculating the triplet loss, which stabilizes training and significantly improves performance. Our method outperforms baselines qualitatively and quantitatively on the ScanNet, Replica3D, and Messy-Rooms datasets.

[37] VLN-MME: Diagnosing MLLMs as Language-guided Visual Navigation agents

Xunyi Zhao, Gengze Zhou, Qi Wu

🧩 TL;DR

本文提出了VLN-MME评估框架,用于系统评估多模态大语言模型在具身导航任务中的零样本能力,并发现增强推理链和自反思机制反而导致性能下降,揭示了MLLMs在三维空间推理方面的局限性。


📘 Detailed Summary

Motivation: 尽管多模态大语言模型在多种视觉语言任务中表现出色,但其作为具身智能体在需要多轮对话空间推理和序列动作预测的导航任务中的性能仍需深入探索,当前缺乏统一的评估框架来系统评估MLLMs在具身导航中的零样本能力。

Method: 研究提出了VLN-MME评估框架,通过将传统导航数据集转化为标准化基准来评估MLLMs作为零样本智能体的能力,采用高度模块化和可访问的设计简化评估流程,支持对不同MLLM架构、智能体设计和导航任务进行结构化比较和组件级消融实验。

Result: 实验发现增强基线智能体使用思维链推理和自反思机制反而导致性能下降,这表明MLLMs在具身导航任务中表现出较差的情境感知能力,虽然能够遵循指令并结构化输出,但其三维空间推理的保真度较低。

Conclusion: VLN-MME为系统评估通用MLLMs在具身导航环境中的能力奠定了基础,揭示了其在序列决策能力方面的局限性,这些发现为MLLMs作为具身智能体的后训练提供了关键指导,表明需要改进其空间推理和情境感知能力。


📄 Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a wide range of vision-language tasks. However, their performance as embodied agents, which requires multi-round dialogue spatial reasoning and sequential action prediction, needs further exploration. Our work investigates this potential in the context of Vision-and-Language Navigation (VLN) by introducing a unified and extensible evaluation framework to probe MLLMs as zero-shot agents by bridging traditional navigation datasets into a standardized benchmark, named VLN-MME. We simplify the evaluation with a highly modular and accessible design. This flexibility streamlines experiments, enabling structured comparisons and component-level ablations across diverse MLLM architectures, agent designs, and navigation tasks. Crucially, enabled by our framework, we observe that enhancing our baseline agent with Chain-of-Thought (CoT) reasoning and self-reflection leads to an unexpected performance decrease. This suggests MLLMs exhibit poor context awareness in embodied navigation tasks; although they can follow instructions and structure their output, their 3D spatial reasoning fidelity is low. VLN-MME lays the groundwork for systematic evaluation of general-purpose MLLMs in embodied navigation settings and reveals limitations in their sequential decision-making capabilities. We believe these findings offer crucial guidance for MLLM post-training as embodied agents.

[38] FinMMDocR: Benchmarking Financial Multimodal Reasoning with Scenario Awareness, Document Understanding, and Multi-Step Computation

Zichen Tang, Haihong E, Rongjin Li, Jiacheng Liu, Linwei Jia, Zhuodi Hao, Zhongjun Yang, Yuanze Li, Haolin Tian, Xinyi Hu, Peizhi Zhao, Yuan Liu, Zhengyu Wang, Xianghe Wang, Yiling Huang, Xueyuan Lin, Ruofei Bai, Zijian Xie, Qian Huang, Ruining Cao, Haocheng Gao

🧩 TL;DR

本文提出了FinMMDocR,这是一个新颖的双语多模态基准测试,用于评估多模态大语言模型在真实世界金融数值推理任务上的性能,该基准在场景感知、文档理解和多步计算三个方面显著超越了现有基准。


📘 Detailed Summary

Motivation: 现有基准测试在评估多模态大语言模型处理真实世界金融数值推理任务方面存在不足,特别是在处理复杂金融场景、长文档理解和多步骤推理方面缺乏足够的挑战性,无法充分测试模型在实际金融应用中的能力。

Method: 研究团队构建了FinMMDocR基准,包含1200个专家标注的问题,其中57.9%融入了12种隐含金融场景;收集了837份中英文金融文档,涵盖9种类型,平均长度50.8页;设计了平均需要11步推理的问题结构,包括5.3步信息提取和5.7步计算步骤,其中65.0%的问题需要跨页证据支持。

Result: 在FinMMDocR基准上,表现最佳的多模态大语言模型仅达到58.0%的准确率,不同检索增强生成方法在该任务上表现出显著的性能差异,这突显了当前模型在处理复杂金融多模态推理任务时的局限性。

Conclusion: FinMMDocR基准揭示了当前多模态大语言模型在真实世界金融数值推理任务上的显著不足,特别是在处理隐含场景、长文档理解和多步骤计算方面,该基准有望推动模型改进和推理增强方法在复杂多模态推理任务上的发展。


📄 Abstract

We introduce FinMMDocR, a novel bilingual multimodal benchmark for evaluating multimodal large language models (MLLMs) on real-world financial numerical reasoning. Compared to existing benchmarks, our work delivers three major advancements. (1) Scenario Awareness: 57.9% of 1,200 expert-annotated problems incorporate 12 types of implicit financial scenarios (e.g., Portfolio Management), challenging models to perform expert-level reasoning based on assumptions; (2) Document Understanding: 837 Chinese/English documents spanning 9 types (e.g., Company Research) average 50.8 pages with rich visual elements, significantly surpassing existing benchmarks in both breadth and depth of financial documents; (3) Multi-Step Computation: Problems demand 11-step reasoning on average (5.3 extraction + 5.7 calculation steps), with 65.0% requiring cross-page evidence (2.4 pages average). The best-performing MLLM achieves only 58.0% accuracy, and different retrieval-augmented generation (RAG) methods show significant performance variations on this task. We expect FinMMDocR to drive improvements in MLLMs and reasoning-enhanced methods on complex multimodal reasoning tasks in real-world scenarios.

[39] VIPER: Process-aware Evaluation for Generative Video Reasoning

Yifan Li, Yukai Gu, Yingqian Min, Zikang Liu, Yifan Du, Kun Zhou, Min Yang, Wayne Xin Zhao, Minghui Qiu

🧩 TL;DR

该研究提出了VIPER基准和过程一致性评估范式,用于解决生成式视频推理中现有评估方法导致的"结果欺骗"问题,通过分层评估框架量化模型推理过程的可靠性。


📘 Detailed Summary

Motivation: 现有视频生成模型展现出链式帧推理能力,但当前评估框架主要依赖单帧评估,导致模型可能通过错误过程得出正确结论的"结果欺骗"问题,缺乏对推理过程有效性的系统评估。

Method: 研究提出过程感知评估范式,包括VIPER基准(涵盖时间、结构、符号、空间、物理和规划推理的16个任务)和过程-结果一致性指标POC@r,该指标利用VLM-as-Judge分层评估框架同时评估中间步骤有效性和最终结果。

Result: 实验表明最先进的视频模型仅达到约20%的POC@1.0分数,显示出显著的结果欺骗现象;测试时缩放和采样鲁棒性分析进一步揭示了当前视频生成与真正广义视觉推理之间的巨大差距。

Conclusion: 该研究强调了过程评估在生成式视频推理中的重要性,揭示了当前模型在可靠推理能力方面的局限性,提出的基准和评估框架为未来研究提供了系统性评估工具,推动向更可靠的视觉推理系统发展。


📄 Abstract

Recent breakthroughs in video generation have demonstrated an emerging capability termed Chain-of-Frames (CoF) reasoning, where models resolve complex tasks through the generation of continuous frames. While these models show promise for Generative Video Reasoning (GVR), existing evaluation frameworks often rely on single-frame assessments, which can lead to outcome-hacking, where a model reaches a correct conclusion through an erroneous process. To address this, we propose a process-aware evaluation paradigm. We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning. Furthermore, we propose Process-outcome Consistency (POC@r), a new metric that utilizes VLM-as-Judge with a hierarchical rubric to evaluate both the validity of the intermediate steps and the final result. Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking. We further explore the impact of test-time scaling and sampling robustness, highlighting a substantial gap between current video generation and true generalized visual reasoning. Our benchmark will be publicly released.

[40] GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction

Yi-Chuan Huang, Hao-Jen Chien, Chin-Yang Lin, Ying-Huan Chen, Yu-Lun Liu

🧩 TL;DR

本文提出GaMO(几何感知多视角外推)框架,通过多视角外推而非生成新视角来改进稀疏视图3D重建,在保持几何一致性的同时提供更广的场景覆盖,相比现有方法实现了25倍加速和更优的重建质量。


📘 Detailed Summary

Motivation: 当前基于扩散模型的稀疏视图3D重建方法存在三个关键局限:已知视角外围覆盖不足、生成视图间的几何不一致性以及计算成本高昂的流程。这些限制阻碍了从有限输入视图实现高质量场景重建的实际应用。

Method: GaMO框架通过多视角外推重新定义稀疏视图重建问题,从现有相机位姿扩展视野而非生成新视角,从而固有地保持几何一致性。该方法采用多视角条件和几何感知去噪策略,以零样本方式运行无需额外训练。

Result: 在Replica和ScanNet++数据集上的广泛实验表明,GaMO在3、6、9个输入视图下均达到最先进的重建质量,在PSNR和LPIPS指标上超越先前方法,同时相比SOTA扩散方法实现25倍加速,处理时间低于10分钟。

Conclusion: 该研究证明了通过多视角外推而非新视角生成来改进稀疏视图重建的有效性,提供了一种计算高效且几何一致的方法。GaMO的成功表明重新定义问题表述可以克服现有方法的局限性,为实际应用中的快速高质量3D重建开辟了新途径。


📄 Abstract

Recent advances in 3D reconstruction have achieved remarkable progress in high-quality scene capture from dense multi-view imagery, yet struggle when input views are limited. Various approaches, including regularization techniques, semantic priors, and geometric constraints, have been implemented to address this challenge. Latest diffusion-based methods have demonstrated substantial improvements by generating novel views from new camera poses to augment training data, surpassing earlier regularization and prior-based techniques. Despite this progress, we identify three critical limitations in these state-of-the-art approaches: inadequate coverage beyond known view peripheries, geometric inconsistencies across generated views, and computationally expensive pipelines. We introduce GaMO (Geometry-aware Multi-view Outpainter), a framework that reformulates sparse-view reconstruction through multi-view outpainting. Instead of generating new viewpoints, GaMO expands the field of view from existing camera poses, which inherently preserves geometric consistency while providing broader scene coverage. Our approach employs multi-view conditioning and geometry-aware denoising strategies in a zero-shot manner without training. Extensive experiments on Replica and ScanNet++ demonstrate state-of-the-art reconstruction quality across 3, 6, and 9 input views, outperforming prior methods in PSNR and LPIPS, while achieving a $25\times$ speedup over SOTA diffusion-based methods with processing time under 10 minutes. Project page: https://yichuanh.github.io/GaMO/

cs.CL [Back]

[41] Break Out the Silverware -- Semantic Understanding of Stored Household Items

Michaela Levi-Richter, Reuth Mirsky, Oren Glickman

🧩 TL;DR

本文提出了存储家居物品挑战基准,用于评估服务机器人在不可见物品定位方面的认知能力,并开发了NOAM混合代理管道,该管道结合了结构化场景理解与大语言模型推理,显著提升了预测准确性并接近人类水平。


📘 Detailed Summary

Motivation: 尽管计算机视觉和机械操作技术已取得进展,但服务机器人仍缺乏推断日常物品存储位置的常识推理能力,这些物品通常隐藏在抽屉、橱柜或壁橱中,无法直接观察。现有系统难以完成"给我拿个盘子"这类简单指令,因此需要开发能够理解家庭空间组织逻辑的认知能力基准。

Method: 研究提出了存储家居物品挑战基准,包含两个数据集:100个物品-图像对的真实世界评估集和6500个带存储多边形标注的公开厨房图像开发集。为解决该挑战,开发了NOAM混合代理管道,该管道将视觉输入转换为空间上下文和可见容器的自然语言描述,然后提示大语言模型推断最可能的隐藏存储位置,实现了结构化场景理解与语言模型推理的结合。

Result: NOAM在预测准确性方面显著优于随机选择、视觉语言管道、领先的多模态模型等基线方法,并接近人类表现水平。评估表明该集成视觉语言代理展现出新兴的常识推理能力,为家庭环境中的认知能力代理部署提供了最佳实践。

Conclusion: 该研究强调了结合结构化场景理解与大语言模型推理在机器人常识推理中的有效性,为服务机器人在家庭环境中的认知能力评估和部署提供了标准化基准。NOAM的模块化设计使其能够集成到更广泛的机器人系统中,推动了家庭服务机器人向更智能、更自主的方向发展。


📄 Abstract

``Bring me a plate.'' For domestic service robots, this simple command reveals a complex challenge: inferring where everyday items are stored, often out of sight in drawers, cabinets, or closets. Despite advances in vision and manipulation, robots still lack the commonsense reasoning needed to complete this task. We introduce the Stored Household Item Challenge, a benchmark task for evaluating service robots' cognitive capabilities: given a household scene and a queried item, predict its most likely storage location. Our benchmark includes two datasets: (1) a real-world evaluation set of 100 item-image pairs with human-annotated ground truth from participants' kitchens, and (2) a development set of 6,500 item-image pairs annotated with storage polygons over public kitchen images. These datasets support realistic modeling of household organization and enable comparative evaluation across agent architectures. To begin tackling this challenge, we introduce NOAM (Non-visible Object Allocation Model), a hybrid agent pipeline that combines structured scene understanding with large language model inference. NOAM converts visual input into natural language descriptions of spatial context and visible containers, then prompts a language model (e.g., GPT-4) to infer the most likely hidden storage location. This integrated vision-language agent exhibits emergent commonsense reasoning and is designed for modular deployment within broader robotic systems. We evaluate NOAM against baselines including random selection, vision-language pipelines (Grounding-DINO + SAM), leading multimodal models (e.g., Gemini, GPT-4o, Kosmos-2, LLaMA, Qwen), and human performance. NOAM significantly improves prediction accuracy and approaches human-level results, highlighting best practices for deploying cognitively capable agents in domestic environments.

[42] Automated Analysis of Sustainability Reports: Using Large Language Models for the Extraction and Prediction of EU Taxonomy-Compliant KPIs

Jonathan Schmoll, Adam Jatowt

🧩 TL;DR

本研究针对欧盟分类法合规流程自动化缺乏公开基准数据集的问题,构建了首个结构化数据集并系统评估了大语言模型在合规工作流中的表现,揭示了模型在定性与定量任务上的显著性能差距。


📘 Detailed Summary

Motivation: 欧盟分类法合规流程的手动操作资源密集且效率低下,而大语言模型自动化研究因缺乏公开基准数据集而受阻,本研究旨在填补这一空白并首次系统评估LLM在核心合规工作流中的能力。

Method: 研究构建了包含190份公司报告的新型结构化数据集,包含真实经济活动和经济绩效指标标注,采用多步骤智能体框架进行定性任务评估,并在零样本设置下测试模型定量预测能力,系统比较了不同输入格式(简洁元数据与完整非结构化报告)的性能差异。

Result: 实验结果显示定性与定量任务存在明显性能差距:LLM在识别经济活动的定性任务中表现中等,多步骤智能体框架仅略微提升精度;而在预测财务KPI的定量任务中,模型在零样本设置下完全失败。研究还发现简洁元数据往往优于完整非结构化报告的矛盾现象,且模型置信度分数校准效果不佳。

Conclusion: 研究表明大语言模型尚未准备好实现完全自动化合规流程,但可作为人类专家的有力辅助工具。公开数据集为未来研究提供了基准,同时揭示了模型在复杂定量推理任务上的根本局限性以及输入格式优化的重要性。


📄 Abstract

The manual, resource-intensive process of complying with the EU Taxonomy presents a significant challenge for companies. While Large Language Models (LLMs) offer a path to automation, research is hindered by a lack of public benchmark datasets. To address this gap, we introduce a novel, structured dataset from 190 corporate reports, containing ground-truth economic activities and quantitative Key Performance Indicators (KPIs). We use this dataset to conduct the first systematic evaluation of LLMs on the core compliance workflow. Our results reveal a clear performance gap between qualitative and quantitative tasks. LLMs show moderate success in the qualitative task of identifying economic activities, with a multi-step agentic framework modestly enhancing precision. Conversely, the models comprehensively fail at the quantitative task of predicting financial KPIs in a zero-shot setting. We also discover a paradox, where concise metadata often yields superior performance to full, unstructured reports, and find that model confidence scores are poorly calibrated. We conclude that while LLMs are not ready for full automation, they can serve as powerful assistive tools for human experts. Our dataset provides a public benchmark for future research.

[43] Figure It Out: Improving the Frontier of Reasoning with Active Visual Thinking

Meiqi Chen, Fandong Meng, Jie Zhou

🧩 TL;DR

本文提出了FIGR框架,通过端到端强化学习将主动视觉思维集成到多轮推理中,利用视觉表示外部化中间结构假设,在复杂数学推理基准上显著优于纯文本链式思维基线。


📘 Detailed Summary

Motivation: 复杂推理问题通常涉及隐式的空间、几何和结构关系,这些关系无法在文本中显式编码,而纯文本推理难以表示复杂场景中的全局结构约束,导致在需要结构化思维的任务上表现受限。

Method: FIGR框架通过端到端强化学习将主动视觉思维集成到多轮推理过程中,在问题求解时动态构建视觉表示来外部化中间结构假设,并自适应地调节视觉推理的调用时机和方式,从而实现对全局结构属性的稳定连贯推理。

Result: 在具有挑战性的数学推理基准测试中,FIGR显著优于强大的纯文本链式思维基线,在AIME 2025上比基础模型提升13.12%,在BeyondAIME上提升11.00%,证明了视觉引导多模态推理在增强复杂推理稳定性和可靠性方面的有效性。

Conclusion: 研究表明将视觉表示整合到推理过程中能够有效捕捉文本难以表达的全局结构约束,主动视觉思维机制为复杂推理任务提供了更稳定可靠的解决方案,为多模态推理系统的发展开辟了新方向。


📄 Abstract

Complex reasoning problems often involve implicit spatial, geometric, and structural relationships that are not explicitly encoded in text. While recent reasoning models have achieved strong performance across many domains, purely text-based reasoning struggles to represent global structural constraints in complex settings. In this paper, we introduce FIGR, which integrates active visual thinking into multi-turn reasoning via end-to-end reinforcement learning. FIGR externalizes intermediate structural hypotheses by constructing visual representations during problem solving. By adaptively regulating when and how visual reasoning should be invoked, FIGR enables more stable and coherent reasoning over global structural properties that are difficult to capture from text alone. Experiments on challenging mathematical reasoning benchmarks demonstrate that FIGR outperforms strong text-only chain-of-thought baselines. In particular, FIGR improves the base model by 13.12% on AIME 2025 and 11.00% on BeyondAIME, highlighting the effectiveness of figure-guided multimodal reasoning in enhancing the stability and reliability of complex reasoning.

[44] Skim-Aware Contrastive Learning for Efficient Document Representation

Waheed Ahmed Abro, Zied Bouraoui

🧩 TL;DR

本文提出了一种基于自监督对比学习的新框架,用于增强长文档表示,该方法受人类略读策略启发,通过随机掩码文档片段并利用自然语言推理对比目标来对齐相关部分,在保持计算效率的同时显著提升了法律和生物医学文本的表示质量。


📘 Detailed Summary

Motivation: 尽管基于Transformer的模型在词级和句级任务中表现出色,但在法律和医学等领域有效表示长文档仍然困难。稀疏注意力机制虽然能处理更长输入,但计算资源密集且难以捕获完整文档上下文;分层Transformer模型效率更高,但无法清晰解释文档不同部分之间的关系。人类通常通过略读文本、聚焦重要部分来理解整体内容,这一策略启发了本研究。

Method: 本文提出了一种新的自监督对比学习框架,受人类略读策略启发,通过随机掩码文档片段并使用基于自然语言推理的对比目标来对齐相关部分,同时使其与不相关部分保持距离。该方法模拟人类信息合成过程,旨在生成既丰富又计算高效的文档表示,特别针对法律和生物医学等领域的专业长文档。

Result: 在法律和生物医学文本上的实验验证了该方法的有效性,结果显示在准确性和效率方面均取得显著提升。该方法不仅提高了文档表示的质量,还保持了计算效率,相比传统稀疏注意力机制和分层Transformer模型表现出更好的性能。

Conclusion: 该研究展示了受人类认知策略启发的自监督学习方法在长文档表示中的潜力,为处理专业领域长文本提供了新的有效途径。通过模拟人类略读和信息合成过程,该方法生成的文档表示既丰富又高效,为未来长文档理解研究提供了重要启示。


📄 Abstract

Although transformer-based models have shown strong performance in word- and sentence-level tasks, effectively representing long documents, especially in fields like law and medicine, remains difficult. Sparse attention mechanisms can handle longer inputs, but are resource-intensive and often fail to capture full-document context. Hierarchical transformer models offer better efficiency but do not clearly explain how they relate different sections of a document. In contrast, humans often skim texts, focusing on important sections to understand the overall message. Drawing from this human strategy, we introduce a new self-supervised contrastive learning framework that enhances long document representation. Our method randomly masks a section of the document and uses a natural language inference (NLI)-based contrastive objective to align it with relevant parts while distancing it from unrelated ones. This mimics how humans synthesize information, resulting in representations that are both richer and more computationally efficient. Experiments on legal and biomedical texts confirm significant gains in both accuracy and efficiency.

[45] Comparing Approaches to Automatic Summarization in Less-Resourced Languages

Chester Palen-Michel, Constantine Lignos

🧩 TL;DR

本研究系统比较了低资源语言自动文本摘要的多种方法,发现多语言微调的mT5模型在大多数指标上优于零样本LLM方法,并揭示了LLM作为评估器在低资源语言中可能不可靠的问题。


📘 Detailed Summary

Motivation: 自动文本摘要在英语等高资源语言中已取得高性能,但对低资源语言的摘要研究关注较少,本研究旨在系统比较低资源语言摘要的各种方法,填补这一研究空白。

Method: 研究比较了多种方法:包括不同规模LLM的零样本提示、带三种数据增强和不带增强的mT5微调、多语言迁移,以及LLM翻译管道方法(源语言→英语→摘要→翻译回源语言)。

Result: 使用五种不同指标评估发现:相似参数规模的LLM性能存在差异;多语言微调的mT5基线在大多数指标上优于包括零样本LLM在内的大多数方法;LLM作为评估器在低资源语言中可能不太可靠。

Conclusion: 研究表明对于低资源语言摘要任务,多语言微调方法比零样本LLM方法更有效,同时揭示了当前评估方法在低资源语言中的局限性,为未来研究提供了重要方向。


📄 Abstract

Automatic text summarization has achieved high performance in high-resourced languages like English, but comparatively less attention has been given to summarization in less-resourced languages. This work compares a variety of different approaches to summarization from zero-shot prompting of LLMs large and small to fine-tuning smaller models like mT5 with and without three data augmentation approaches and multilingual transfer. We also explore an LLM translation pipeline approach, translating from the source language to English, summarizing and translating back. Evaluating with five different metrics, we find that there is variation across LLMs in their performance across similar parameter sizes, that our multilingual fine-tuned mT5 baseline outperforms most other approaches including zero-shot LLM performance for most metrics, and that LLM as judge may be less reliable on less-resourced languages.

[46] Safe in the Future, Dangerous in the Past: Dissecting Temporal and Linguistic Vulnerabilities in LLMs

Muhammad Abdullahi Said, Muhammad Sammani Sani

🧩 TL;DR

本研究系统审计了主流大语言模型在英语与豪萨语之间的安全对齐差异,揭示了安全性能并非简单退化,而是由语言与时态框架的复杂交互决定的动态状态,挑战了现有的多语言安全差距叙事。


📘 Detailed Summary

Motivation: 随着大语言模型融入全球关键基础设施,安全对齐在英语到其他语言的零样本迁移假设存在危险盲点,特别是针对豪萨语等低资源语言以及西非特定威胁场景(如雅虎雅虎欺诈、丹恩枪制造),当前研究缺乏对语言与时态因素交互影响的系统性理解。

Method: 研究采用2×4因子设计进行1,440次评估,使用基于西非威胁场景构建的新型对抗数据集HausaSafety,系统测试了三种最先进模型(GPT-5.1、Gemini 3 Pro和Claude 4.5 Opus)在英语与豪萨语之间以及不同时态框架下的安全性能非线性交互作用。

Result: 研究发现了复杂干扰机制而非简单的安全退化,Claude 4.5 Opus在豪萨语中安全性显著高于英语(45.0% vs 36.7%),但存在灾难性的时态推理失败,过去时框架绕过防御(15.6%安全)而将来时场景触发过度保守拒绝(57.2%安全),最安全与最脆弱配置间存在9.2倍差异。

Conclusion: 当前模型依赖表面启发式而非鲁棒的语义理解,形成安全漏洞使全球南方用户暴露于本地化危害,需要向不变对齐范式转变以确保跨语言和时态变化的安全稳定性,安全性能是情境依赖的动态状态而非固定属性。


📄 Abstract

As Large Language Models (LLMs) integrate into critical global infrastructure, the assumption that safety alignment transfers zero-shot from English to other languages remains a dangerous blind spot. This study presents a systematic audit of three state of the art models (GPT-5.1, Gemini 3 Pro, and Claude 4.5 Opus) using HausaSafety, a novel adversarial dataset grounded in West African threat scenarios (e.g., Yahoo-Yahoo fraud, Dane gun manufacturing). Employing a 2 x 4 factorial design across 1,440 evaluations, we tested the non-linear interaction between language (English vs. Hausa) and temporal framing. Our results challenge the prevailing multilingual safety gap narrative. Instead of a simple degradation in low-resource settings, we identified a mechanism of Complex Interference where safety is determined by the intersection of variables. While models exhibited a Reverse Linguistic with Claude 4.5 Opus proving significantly safer in Hausa (45.0%) than in English (36.7%) due to uncertainty-driven refusal they suffered catastrophic failures in temporal reasoning. We report a profound Temporal Asymmetry, where past-tense framing bypassed defenses (15.6% safe) while future-tense scenarios triggered hyper-conservative refusals (57.2% safe). The magnitude of this volatility is illustrated by a 9.2x disparity between the safest and most vulnerable configurations, proving that safety is not a fixed property but a context-dependent state. We conclude that current models rely on superficial heuristics rather than robust semantic understanding, creating Safety Pockets that leave Global South users exposed to localized harms. We propose Invariant Alignment as a necessary paradigm shift to ensure safety stability across linguistic and temporal shifts.

cs.AI [Back]

[47] ROAD: Reflective Optimization via Automated Debugging for Zero-Shot Agent Alignment

Natchaya Temyingyong, Daman Jain, Neeraj Kumarsahu, Prabhat Kumar, Rachata Phondi, Wachiravit Modecrua, Krittanon Kaewtawee, Krittin Pachtrachai, Touchapon Kraisingkorn

🧩 TL;DR

本文提出了ROAD框架,一种无需标注数据集即可优化LLM提示的新方法,通过模拟人类工程师的调试循环,将失败日志转化为结构化决策协议,显著提升了代理性能。


📘 Detailed Summary

Motivation: 当前自动提示优化方法严重依赖大量标注的开发集来计算适应度分数,但在实际软件工程中,特别是在代理开发的冷启动阶段,工程师通常只能获得混乱的生产日志和不断演变的故障模式,缺乏精心策划的数据集。

Method: ROAD框架采用多智能体架构,将优化视为动态调试调查而非随机搜索,包括用于根因分析的Analyzer、用于模式聚合的Optimizer和用于策略整合的Coach,能够将非结构化失败日志转化为鲁棒的结构化决策树协议。

Result: 在标准化学术基准和实际生产知识管理引擎上的实验表明,ROAD具有高样本效率,仅通过三次自动迭代就实现了成功率从73.6%提升至79.2%(提升5.6%)和搜索准确率提升3.8%,在零售领域复杂推理任务中,ROAD使代理性能相比基线提升约19%。

Conclusion: 研究表明,模拟人类工程师的故障分析和修复循环为部署可靠的LLM代理提供了一种可行且数据高效的替代方案,避免了资源密集的强化学习训练,特别适用于缺乏标注数据的实际应用场景。


📄 Abstract

Automatic Prompt Optimization (APO) has emerged as a critical technique for enhancing Large Language Model (LLM) performance, yet current state-of-the-art methods typically rely on large, labeled gold-standard development sets to compute fitness scores for evolutionary or Reinforcement Learning (RL) approaches. In real-world software engineering, however, such curated datasets are rarely available during the initial cold start of agent development, where engineers instead face messy production logs and evolving failure modes. We present ROAD (Reflective Optimization via Automated Debugging), a novel framework that bypasses the need for refined datasets by treating optimization as a dynamic debugging investigation rather than a stochastic search. Unlike traditional mutation strategies, ROAD utilizes a specialized multi-agent architecture, comprising an Analyzer for root-cause analysis, an Optimizer for pattern aggregation, and a Coach for strategy integration, to convert unstructured failure logs into robust, structured Decision Tree Protocols. We evaluated ROAD across both a standardized academic benchmark and a live production Knowledge Management engine. Experimental results demonstrate that ROAD is highly sample-efficient, achieving a 5.6 percent increase in success rate (73.6 percent to 79.2 percent) and a 3.8 percent increase in search accuracy within just three automated iterations. Furthermore, on complex reasoning tasks in the retail domain, ROAD improved agent performance by approximately 19 percent relative to the baseline. These findings suggest that mimicking the human engineering loop of failure analysis and patching offers a viable, data-efficient alternative to resource-intensive RL training for deploying reliable LLM agents.

[48] Multi-modal cross-domain mixed fusion model with dual disentanglement for fault diagnosis under unseen working conditions

Pengcheng Xia, Yixiang Huang, Chengjin Qin, Chengliang Liu

🧩 TL;DR

本文提出了一种多模态跨域混合融合模型,通过双重解耦框架分离模态不变与模态特定特征以及域不变与域特定表示,结合跨域混合融合策略和三模态融合机制,显著提升了未见工作条件下机械故障诊断的泛化性能。


📘 Detailed Summary

Motivation: 现有智能故障诊断方法在真实场景中面临两大挑战:在未见工作条件下性能显著下降,而域自适应方法依赖目标域样本;同时多数研究依赖单模态传感信号,忽视了多模态信息的互补性对提升模型泛化能力的作用。

Method: 提出多模态跨域混合融合模型,采用双重解耦框架分离模态不变与模态特定特征以及域不变与域特定表示;设计跨域混合融合策略,在域间随机混合模态信息以增强模态和域多样性;引入三模态融合机制自适应整合多模态异构信息。

Result: 在感应电机故障诊断的广泛实验中,该方法在未见恒定和时变工作条件下均优于先进方法;全面的消融研究进一步验证了每个提出组件和多模态融合的有效性,证明了模型在跨域故障诊断中的优越性能。

Conclusion: 该研究通过双重解耦和跨域混合融合策略,有效解决了多模态故障诊断中的域泛化问题,为机械可靠性保障提供了更鲁棒的解决方案;方法框架具有通用性,可推广到其他需要多模态信息融合的工业诊断场景。


📄 Abstract

Intelligent fault diagnosis has become an indispensable technique for ensuring machinery reliability. However, existing methods suffer significant performance decline in real-world scenarios where models are tested under unseen working conditions, while domain adaptation approaches are limited to their reliance on target domain samples. Moreover, most existing studies rely on single-modal sensing signals, overlooking the complementary nature of multi-modal information for improving model generalization. To address these limitations, this paper proposes a multi-modal cross-domain mixed fusion model with dual disentanglement for fault diagnosis. A dual disentanglement framework is developed to decouple modality-invariant and modality-specific features, as well as domain-invariant and domain-specific representations, enabling both comprehensive multi-modal representation learning and robust domain generalization. A cross-domain mixed fusion strategy is designed to randomly mix modality information across domains for modality and domain diversity augmentation. Furthermore, a triple-modal fusion mechanism is introduced to adaptively integrate multi-modal heterogeneous information. Extensive experiments are conducted on induction motor fault diagnosis under both unseen constant and time-varying working conditions. The results demonstrate that the proposed method consistently outperforms advanced methods and comprehensive ablation studies further verify the effectiveness of each proposed component and multi-modal fusion. The code is available at: https://github.com/xiapc1996/MMDG.

[49] GenZ: Foundational models as latent variable generators within traditional statistical models

Marko Jojic, Nebojsa Jojic

🧩 TL;DR

本文提出GenZ,一种通过可解释语义特征桥接基础模型与统计建模的混合方法,能够发现数据集特定模式以提升预测性能,显著优于仅依赖基础模型领域知识的基线方法。


📘 Detailed Summary

Motivation: 大型语言模型虽具备广泛领域知识,但常无法捕捉对预测任务至关重要的数据集特定模式,这限制了其在需要精确统计建模场景中的应用效果。

Method: 该方法通过迭代过程发现语义特征描述,该过程基于统计建模误差对比项目组而非仅依赖基础模型的领域理解,将其形式化为联合优化语义特征描述符和统计模型参数的广义EM算法,利用冻结的基础模型基于发现的特征对项目进行分类,并将这些判断视为预测实值目标的潜在二元特征的噪声观测。

Result: 在房价预测任务中,模型使用从多模态列表数据中发现的语义特征实现了12%的中位数相对误差,显著优于依赖LLM一般领域知识的GPT-5基线(38%误差);在Netflix电影嵌入的冷启动协同过滤中,模型仅从语义描述预测协同过滤表示达到0.59余弦相似度,匹配传统协同过滤需要约4000用户评分才能达到的性能。

Conclusion: 该方法成功桥接了基础模型的语义理解能力与统计建模的精确性,发现的语义特征揭示了数据集特定模式(如建筑细节预测本地住房市场、特许经营成员预测用户偏好),这些模式与模型单独依赖的领域知识存在差异,为构建更可解释且性能优越的混合AI系统提供了新方向。


📄 Abstract

We present GenZ, a hybrid model that bridges foundational models and statistical modeling through interpretable semantic features. While large language models possess broad domain knowledge, they often fail to capture dataset-specific patterns critical for prediction tasks. Our approach addresses this by discovering semantic feature descriptions through an iterative process that contrasts groups of items identified via statistical modeling errors, rather than relying solely on the foundational model's domain understanding. We formulate this as a generalized EM algorithm that jointly optimizes semantic feature descriptors and statistical model parameters. The method prompts a frozen foundational model to classify items based on discovered features, treating these judgments as noisy observations of latent binary features that predict real-valued targets through learned statistical relationships. We demonstrate the approach on two domains: house price prediction (hedonic regression) and cold-start collaborative filtering for movie recommendations. On house prices, our model achieves 12\% median relative error using discovered semantic features from multimodal listing data, substantially outperforming a GPT-5 baseline (38\% error) that relies on the LLM's general domain knowledge. For Netflix movie embeddings, our model predicts collaborative filtering representations with 0.59 cosine similarity purely from semantic descriptions -- matching the performance that would require approximately 4000 user ratings through traditional collaborative filtering. The discovered features reveal dataset-specific patterns (e.g., architectural details predicting local housing markets, franchise membership predicting user preferences) that diverge from the model's domain knowledge alone.

[50] Semi-Automated Data Annotation in Multisensor Datasets for Autonomous Vehicle Testing

Andrii Gamalii, Daniel Górniak, Robert Nowak, Bartłomiej Olber, Krystian Radlak, Jakub Winter

🧩 TL;DR

该报告介绍了DARTS项目中开发的半自动化数据标注流水线,该流水线采用人机协同方法,结合人工智能与人类专业知识,显著降低了大规模多模态驾驶场景数据集的标注成本和时间。


📘 Detailed Summary

Motivation: 手动标注波兰驾驶条件下的大规模、多模态数据集成本高昂且耗时,DARTS项目需要解决这一效率瓶颈,以加速自动驾驶研究的数据准备过程。

Method: 该解决方案采用人机协同方法,结合3D目标检测算法生成初步标注,支持迭代模型重训练,并集成了数据匿名化和领域适应技术,构建了完整的半自动化标注流水线。

Result: 开发的工具和方法实现了显著的时间节省,同时确保了跨不同传感器模态的一致高质量标注,直接支持了DARTS项目中大规模标注数据集的快速准备。

Conclusion: 该研究为自动驾驶研究提供了高效的数据标注解决方案,通过半自动化流水线显著提升了数据准备效率,同时确保了标注质量,为波兰自动驾驶技术研究奠定了坚实的数据基础。


📄 Abstract

This report presents the design and implementation of a semi-automated data annotation pipeline developed within the DARTS project, whose goal is to create a large-scale, multimodal dataset of driving scenarios recorded in Polish conditions. Manual annotation of such heterogeneous data is both costly and time-consuming. To address this challenge, the proposed solution adopts a human-in-the-loop approach that combines artificial intelligence with human expertise to reduce annotation cost and duration. The system automatically generates initial annotations, enables iterative model retraining, and incorporates data anonymization and domain adaptation techniques. At its core, the tool relies on 3D object detection algorithms to produce preliminary annotations. Overall, the developed tools and methodology result in substantial time savings while ensuring consistent, high-quality annotations across different sensor modalities. The solution directly supports the DARTS project by accelerating the preparation of large annotated dataset in the project's standardized format, strengthening the technological base for autonomous vehicle research in Poland.