Table of Contents

cs.CV [Back]

[1] Multi-task Cross-modal Learning for Chest X-ray Image Retrieval

Zhaohui Liang, Sivaramakrishnan Rajaraman, Niccolo Marini, Zhiyun Xue, Sameer Antani

🧩 TL;DR

本研究提出了一种多任务学习框架来微调BiomedCLIP模型,旨在提升胸部X光图像与文本报告的跨模态检索性能,通过结合分类、对比学习和CLIP损失实现更平衡且临床相关的检索结果。


📘 Detailed Summary

Motivation: 尽管CLIP和BiomedCLIP等视觉语言基础模型提供了强大的跨模态嵌入能力,但它们并未针对细粒度医学检索任务(如使用胸部X光图像查询检索临床相关放射学报告)进行优化,存在领域适应性不足的问题。

Method: 研究以BiomedCLIP为骨干网络,引入轻量级MLP投影头,采用多任务复合损失函数进行训练,包括:用于区分正常与异常胸部X光研究的二元交叉熵损失、增强类内一致性的监督对比损失,以及维持跨模态对齐的CLIP损失。

Result: 实验结果表明,微调后的模型在图像到文本和文本到图像检索任务中均实现了比预训练BiomedCLIP和通用CLIP模型更平衡且临床意义更强的性能;t-SNE可视化显示正常与异常病例的语义聚类更加清晰,表明模型具有增强的诊断敏感性。

Conclusion: 该研究强调了领域自适应多任务学习在推进生物医学跨模态检索中的价值,表明通过结合特定领域任务和对比学习目标,可以显著提升基础模型在细粒度医学检索任务中的性能,为医学影像分析提供了有效的优化框架。


📄 Abstract

CLIP and BiomedCLIP are examples of vision-language foundation models and offer strong cross-modal embeddings; however, they are not optimized for fine-grained medical retrieval tasks, such as retrieving clinically relevant radiology reports using chest X-ray (CXR) image queries. To address this shortcoming, we propose a multi-task learning framework to fine-tune BiomedCLIP and evaluate improvements to CXR image-text retrieval. Using BiomedCLIP as the backbone, we incorporate a lightweight MLP projector head trained with a multi-task composite loss function that includes: (1) a binary cross-entropy loss to distinguish normal from abnormal CXR studies, (2) a supervised contrastive loss to reinforce intra-class consistency, and (3) a CLIP loss to maintain cross-modal alignment. Experimental results demonstrate that the fine-tuned model achieves more balanced and clinically meaningful performance across both image-to-text and text-to-image retrieval tasks compared to the pretrained BiomedCLIP and general-purpose CLIP models. Furthermore, t-SNE visualizations reveal clearer semantic clustering of normal and abnormal cases, demonstrating the model's enhanced diagnostic sensitivity. These findings highlight the value of domain-adaptive, multi-task learning for advancing cross-modal retrieval in biomedical applications.

[2] Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization

Yuxiang Ji, Yong Wang, Ziyu Ma, Yiming Hu, Hailang Huang, Xuecai Hu, Guanhua Chen, Liaoni Wu, Xiangxiang Chu

🧩 TL;DR

本文提出了一种结合地图思维的图像地理定位方法,通过智能体在地图环境中的循环推理和两阶段优化方案,显著提升了地理定位精度,并建立了包含真实世界图像的MAPBench基准。


📘 Detailed Summary

Motivation: 现有的大型视觉语言模型在地理定位任务中虽然利用了世界知识、思维链推理和智能体能力,但忽视了人类常用的地图使用策略,导致定位精度有限,需要开发能够结合地图推理的新方法。

Method: 该方法首先赋予模型"地图思维"能力,将其构建为地图中的智能体循环框架,并开发了两阶段优化方案:第一阶段使用智能体强化学习增强模型的采样效率,第二阶段采用并行测试时扩展使模型在最终预测前探索多个候选路径。

Result: 实验结果表明,该方法在大多数指标上优于现有的开源和闭源模型,特别是在500米精度指标上从Gemini-3-Pro的8.0%提升至22.1%,同时建立了MAPBench这一完全由真实世界图像组成的综合地理定位训练和评估基准。

Conclusion: 该研究证明了将地图推理能力整合到视觉语言模型中的重要性,提出的两阶段优化框架有效提升了地理定位性能,MAPBench基准为未来研究提供了更贴近实际应用场景的评估标准,为智能地理定位系统的发展提供了新方向。


📄 Abstract

The image geolocalization task aims to predict the location where an image was taken anywhere on Earth using visual clues. Existing large vision-language model (LVLM) approaches leverage world knowledge, chain-of-thought reasoning, and agentic capabilities, but overlook a common strategy used by humans -- using maps. In this work, we first equip the model \textit{Thinking with Map} ability and formulate it as an agent-in-the-map loop. We develop a two-stage optimization scheme for it, including agentic reinforcement learning (RL) followed by parallel test-time scaling (TTS). The RL strengthens the agentic capability of model to improve sampling efficiency, and the parallel TTS enables the model to explore multiple candidate paths before making the final prediction, which is crucial for geolocalization. To evaluate our method on up-to-date and in-the-wild images, we further present MAPBench, a comprehensive geolocalization training and evaluation benchmark composed entirely of real-world images. Experimental results show that our method outperforms existing open- and closed-source models on most metrics, specifically improving Acc@500m from 8.0\% to 22.1\% compared to \textit{Gemini-3-Pro} with Google Search/Map grounded mode.

[3] ROAP: A Reading-Order and Attention-Prior Pipeline for Optimizing Layout Transformers in Key Information Extraction

Tingwei Xie, Jinxin He, Yonghong Song

🧩 TL;DR

本文提出了ROAP,一种轻量级且架构无关的流程,通过显式建模阅读顺序和抑制视觉噪声来优化Layout Transformers的注意力分布,从而提升视觉丰富文档理解任务的性能。


📘 Detailed Summary

Motivation: 多模态Transformer在视觉丰富文档理解中的效能受到两个固有局限性的严重制约:缺乏对逻辑阅读顺序的显式建模,以及视觉标记的干扰会稀释对文本语义的注意力。

Method: ROAP流程首先采用自适应XY间隙树从复杂布局中稳健提取分层阅读序列,然后通过阅读顺序感知相对位置偏置将这些序列集成到注意力机制中,并引入文本标记子块注意力先验来自适应抑制视觉噪声并增强细粒度文本-文本交互。

Result: 在FUNSD和CORD基准测试上的广泛实验表明,ROAP持续提升了包括LayoutLMv3和GeoLayoutLM在内的代表性骨干模型的性能,证实了该方法对复杂文档理解任务的有效性。

Conclusion: 研究证实显式建模阅读逻辑和调节模态干扰对于稳健的文档理解至关重要,为复杂布局分析提供了可扩展的解决方案,同时保持了预训练骨干架构不变,实现了轻量级优化。


📄 Abstract

The efficacy of Multimodal Transformers in visually-rich document understanding (VrDU) is critically constrained by two inherent limitations: the lack of explicit modeling for logical reading order and the interference of visual tokens that dilutes attention on textual semantics. To address these challenges, this paper presents ROAP, a lightweight and architecture-agnostic pipeline designed to optimize attention distributions in Layout Transformers without altering their pre-trained backbones. The proposed pipeline first employs an Adaptive-XY-Gap (AXG-Tree) to robustly extract hierarchical reading sequences from complex layouts. These sequences are then integrated into the attention mechanism via a Reading-Order-Aware Relative Position Bias (RO-RPB). Furthermore, a Textual-Token Sub-block Attention Prior (TT-Prior) is introduced to adaptively suppress visual noise and enhance fine-grained text-text interactions. Extensive experiments on the FUNSD and CORD benchmarks demonstrate that ROAP consistently improves the performance of representative backbones, including LayoutLMv3 and GeoLayoutLM. These findings confirm that explicitly modeling reading logic and regulating modality interference are critical for robust document understanding, offering a scalable solution for complex layout analysis. The implementation code will be released at https://github.com/KevinYuLei/ROAP.

[4] Multi-Image Super Resolution Framework for Detection and Analysis of Plant Roots

Shubham Agarwal, Ofek Nourian, Michael Sidorov, Sharon Chemweno, Ofer Hadar, Naftali Lazarovitch, Jhonathan E. Ephrath

🧩 TL;DR

本文提出了一种用于地下植物根系成像的新型多图像超分辨率框架,通过整合多个重叠视图来增强根系可见性和细节,从而促进根系表型分析和关键性状的准确估计。


📘 Detailed Summary

Motivation: 地下植物根系成像面临持续挑战,包括遮挡、土壤湿度变化和固有低对比度等不利条件,这些因素限制了传统视觉方法的有效性,阻碍了根系系统的准确可视化和分析。

Method: 研究提出了一种新型地下成像系统,该系统捕获植物根系的多个重叠视图,并集成了基于深度学习的多图像超分辨率框架;该框架利用视图间的空间冗余性重建高分辨率图像,同时构建了模拟真实地下成像场景的合成数据集进行训练和评估。

Result: 定量评估表明,该方法在超分辨率基准测试中表现优于现有技术,实现了BRISQUE指标降低2.3%,同时保持相同的CLIP-IQA评分,表明图像质量得到改善,从而能够增强根系表型分析能力。

Conclusion: 该框架为农业和生态研究中的稳健自动地下植物根系成像和性状量化提供了有前景的方向,能够促进根系毛数量和根系毛密度等关键根系性状的准确估计,推动土壤-植物相互作用和养分吸收研究的发展。


📄 Abstract

Understanding plant root systems is critical for advancing research in soil-plant interactions, nutrient uptake, and overall plant health. However, accurate imaging of roots in subterranean environments remains a persistent challenge due to adverse conditions such as occlusion, varying soil moisture, and inherently low contrast, which limit the effectiveness of conventional vision-based approaches. In this work, we propose a novel underground imaging system that captures multiple overlapping views of plant roots and integrates a deep learning-based Multi-Image Super Resolution (MISR) framework designed to enhance root visibility and detail. To train and evaluate our approach, we construct a synthetic dataset that simulates realistic underground imaging scenarios, incorporating key environmental factors that affect image quality. Our proposed MISR algorithm leverages spatial redundancy across views to reconstruct high-resolution images with improved structural fidelity and visual clarity. Quantitative evaluations show that our approach outperforms state-of-the-art super resolution baselines, achieving a 2.3 percent reduction in BRISQUE, indicating improved image quality with the same CLIP-IQA score, thereby enabling enhanced phenotypic analysis of root systems. This, in turn, facilitates accurate estimation of critical root traits, including root hair count and root hair density. The proposed framework presents a promising direction for robust automatic underground plant root imaging and trait quantification for agricultural and ecological research.

[5] MMViR: A Multi-Modal and Multi-Granularity Representation for Long-range Video Understanding

Zizhong Li, Haopeng Zhang, Jiawei Zhang

🧩 TL;DR

本文提出了MMViR,一种用于长视频理解的多模态多粒度结构化表示方法,通过识别关键转折点构建三级描述,在保持全局叙事的同时捕捉细粒度视觉细节,显著提升了长视频理解性能并降低了计算开销。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在处理分钟到小时级别的长视频时面临重大挑战,包括复杂事件、多样场景和长程依赖关系,直接编码计算成本过高,而简单的视频到文本转换往往产生冗余或碎片化内容,需要一种高效的结构化表示方法来解决这些限制。

Method: MMViR通过识别关键转折点对视频进行分割,构建包含全局叙事与细粒度视觉细节的三级描述结构,这种多粒度表示支持基于查询的高效检索,并具有良好的跨场景泛化能力,避免了直接编码的计算负担。

Result: 在问答、摘要和检索三个任务上的广泛评估表明,MMViR超越了先前最强方法,在小时级视频理解上实现了19.67%的性能提升,同时将处理延迟降低至原始方法的45.4%,显著提高了长视频理解的效率与准确性。

Conclusion: MMViR为长视频理解提供了一种高效的结构化表示范式,通过多粒度描述耦合解决了长视频内容冗余与碎片化问题,其基于查询的检索机制为大规模视频分析提供了实用解决方案,并为未来多模态视频理解系统的设计提供了重要参考。


📄 Abstract

Long videos, ranging from minutes to hours, present significant challenges for current Multi-modal Large Language Models (MLLMs) due to their complex events, diverse scenes, and long-range dependencies. Direct encoding of such videos is computationally too expensive, while simple video-to-text conversion often results in redundant or fragmented content. To address these limitations, we introduce MMViR, a novel multi-modal, multi-grained structured representation for long video understanding. MMViR identifies key turning points to segment the video and constructs a three-level description that couples global narratives with fine-grained visual details. This design supports efficient query-based retrieval and generalizes well across various scenarios. Extensive evaluations across three tasks, including QA, summarization, and retrieval, show that MMViR outperforms the prior strongest method, achieving a 19.67% improvement in hour-long video understanding while reducing processing latency to 45.4% of the original.

[6] Enabling Stroke-Level Structural Analysis of Hieroglyphic Scripts without Language-Specific Priors

Fuwen Luo, Zihao Wan, Ziyue Wang, Yaluo Liu, Pau Tong Lin Xu, Xuanjia Qiao, Xiaolong Wang, Peng Li, Yang Liu

🧩 TL;DR

本文提出了Hieroglyphic Stroke Analyzer (HieroSA),这是一个新颖且可泛化的框架,使多模态大语言模型能够从字符位图中自动提取笔画级结构,无需人工标注数据,从而实现对象形文字内部结构的深度理解。


📘 Detailed Summary

Motivation: 当前先进的大语言模型和多模态大语言模型在处理象形文字时存在结构盲区:LLMs将字符视为文本标记,MLLMs将其视为原始像素网格,两者都无法建模字符笔画的内在逻辑。此外,现有的结构分析方法通常是特定于某种文字且劳动密集型的,缺乏通用性解决方案。

Method: HieroSA框架将现代表意文字和古代象形文字字符图像转换为归一化坐标空间中的显式、可解释的线段表示。该方法无需手工制作数据,通过自动推导笔画级结构,实现了跨语言泛化能力,为多模态大语言模型提供了结构感知能力。

Result: 大量实验表明,HieroSA能够有效捕捉字符内部结构和语义信息,无需语言特定的先验知识。实验结果表明该框架作为字形分析工具具有潜力,能够促进对象形文字脚本的更深层次理解。

Conclusion: 该研究为象形文字分析提供了通用且自动化的解决方案,突破了现有方法在结构建模和跨语言泛化方面的限制。HieroSA框架展示了多模态大语言模型在字形结构分析领域的应用潜力,为文字学和计算语言学提供了新的工具和方法论。


📄 Abstract

Hieroglyphs, as logographic writing systems, encode rich semantic and cultural information within their internal structural composition. Yet, current advanced Large Language Models (LLMs) and Multimodal LLMs (MLLMs) usually remain structurally blind to this information. LLMs process characters as textual tokens, while MLLMs additionally view them as raw pixel grids. Both fall short to model the underlying logic of character strokes. Furthermore, existing structural analysis methods are often script-specific and labor-intensive. In this paper, we propose Hieroglyphic Stroke Analyzer (HieroSA), a novel and generalizable framework that enables MLLMs to automatically derive stroke-level structures from character bitmaps without handcrafted data. It transforms modern logographic and ancient hieroglyphs character images into explicit, interpretable line-segment representations in a normalized coordinate space, allowing for cross-lingual generalization. Extensive experiments demonstrate that HieroSA effectively captures character-internal structures and semantics, bypassing the need for language-specific priors. Experimental results highlight the potential of our work as a graphematics analysis tool for a deeper understanding of hieroglyphic scripts. View our code at https://github.com/THUNLP-MT/HieroSA.

[7] SAS-VPReID: A Scale-Adaptive Framework with Shape Priors for Video-based Person Re-Identification at Extreme Far Distances

Qiwei Yang, Pingping Zhang, Yuhao Wang, Zijing Gong

🧩 TL;DR

本文提出SAS-VPReID框架,通过尺度自适应机制和形状先验来解决远距离视频行人重识别中的分辨率退化、视角变化和外观噪声等挑战,在VReID-XFD基准测试中取得了领先性能。


📘 Detailed Summary

Motivation: 远距离视频行人重识别面临分辨率严重退化、视角剧烈变化和不可避免的外观噪声等挑战,这些因素导致特征表示难以区分,限制了现有方法在极端远距离场景下的性能。

Method: 提出SAS-VPReID框架,包含三个互补模块:基于CLIP视觉编码器和多代理记忆的增强视觉骨干网络用于提取判别性特征表示;多粒度时序建模模块在不同时间粒度上构建序列并自适应强调跨尺度的运动线索;先验正则化形状动态模块用于捕捉身体结构动态。

Result: 在VReID-XFD基准测试上的实验验证了每个模块的有效性,最终框架在VReID-XFD挑战排行榜上排名第一,证明了所提方法在远距离视频行人重识别任务中的优越性能。

Conclusion: 该研究表明结合尺度自适应机制和形状先验能够有效应对远距离视频行人重识别的挑战,通过多模块协同工作可以获得更具判别力的特征表示,为极端条件下的视觉识别任务提供了新的技术思路。


📄 Abstract

Video-based Person Re-IDentification (VPReID) aims to retrieve the same person from videos captured by non-overlapping cameras. At extreme far distances, VPReID is highly challenging due to severe resolution degradation, drastic viewpoint variation and inevitable appearance noise. To address these issues, we propose a Scale-Adaptive framework with Shape Priors for VPReID, named SAS-VPReID. The framework is built upon three complementary modules. First, we deploy a Memory-Enhanced Visual Backbone (MEVB) to extract discriminative feature representations, which leverages the CLIP vision encoder and multi-proxy memory. Second, we propose a Multi-Granularity Temporal Modeling (MGTM) to construct sequences at multiple temporal granularities and adaptively emphasize motion cues across scales. Third, we incorporate Prior-Regularized Shape Dynamics (PRSD) to capture body structure dynamics. With these modules, our framework can obtain more discriminative feature representations. Experiments on the VReID-XFD benchmark demonstrate the effectiveness of each module and our final framework ranks the first on the VReID-XFD challenge leaderboard. The source code is available at https://github.com/YangQiWei3/SAS-VPReID.

[8] DIFF-MF: A Difference-Driven Channel-Spatial State Space Model for Multi-Modal Image Fusion

Yiming Sun, Zifan Ye, Qinghua Hu, Pengfei Zhu

🧩 TL;DR

本文提出DIFF-MF,一种基于差异驱动的通道-空间状态空间模型,用于解决多模态图像融合中红外强度与可见光细节平衡问题,通过跨模态特征差异引导实现更优的互补信息集成。


📘 Detailed Summary

Motivation: 现有基于状态空间模型的多模态图像融合方法存在明显局限性,它们往往过度优先考虑红外强度而牺牲可见光细节,或者保留可见光结构却削弱热目标显著性,这种不平衡限制了融合图像的质量和实用性。

Method: DIFF-MF采用差异驱动的通道-空间状态空间模型,利用模态间特征差异图指导特征提取,在通道维度通过通道交换模块实现基于交叉注意力双状态空间建模的自适应特征重加权,在空间维度通过空间交换模块采用跨模态状态空间扫描实现全面空间融合,在保持线性计算复杂度的同时高效捕获全局依赖关系。

Result: 在驾驶场景和低空无人机数据集上的实验结果表明,DIFF-MF在视觉质量和定量评估方面均优于现有方法,证明了其在多模态特征互补集成方面的优越性能。

Conclusion: 该研究通过差异驱动的通道-空间状态空间建模有效解决了多模态图像融合中的平衡问题,为高效捕获全局依赖同时保持线性计算复杂度提供了新思路,在自动驾驶和无人机监控等实际应用中具有重要价值。


📄 Abstract

Multi-modal image fusion aims to integrate complementary information from multiple source images to produce high-quality fused images with enriched content. Although existing approaches based on state space model have achieved satisfied performance with high computational efficiency, they tend to either over-prioritize infrared intensity at the cost of visible details, or conversely, preserve visible structure while diminishing thermal target salience. To overcome these challenges, we propose DIFF-MF, a novel difference-driven channel-spatial state space model for multi-modal image fusion. Our approach leverages feature discrepancy maps between modalities to guide feature extraction, followed by a fusion process across both channel and spatial dimensions. In the channel dimension, a channel-exchange module enhances channel-wise interaction through cross-attention dual state space modeling, enabling adaptive feature reweighting. In the spatial dimension, a spatial-exchange module employs cross-modal state space scanning to achieve comprehensive spatial fusion. By efficiently capturing global dependencies while maintaining linear computational complexity, DIFF-MF effectively integrates complementary multi-modal features. Experimental results on the driving scenarios and low-altitude UAV datasets demonstrate that our method outperforms existing approaches in both visual quality and quantitative evaluation.

[9] MoGen: A Unified Collaborative Framework for Controllable Multi-Object Image Generation

Yanfeng Li, Yue Sun, Keren Fu, Sio-Kei Im, Xiaoming Liu, Guangtao Zhai, Xiaohong Liu, Tao Tan

🧩 TL;DR

本文提出MoGen,一种用户友好的多对象图像生成方法,通过区域语义锚定模块和自适应多模态引导模块,实现了对多对象数量和属性的精确控制,显著提升了生成质量与一致性。


📘 Detailed Summary

Motivation: 现有多对象图像生成方法难以实现语言描述与局部图像生成区域之间的精确对齐,常导致对象数量不一致和属性混淆问题。主流方法依赖外部控制信号来约束空间布局和视觉属性,但这种强依赖性使得输入格式僵化,无法适应用户的异构资源条件和多样约束需求。

Method: 本文提出MoGen方法,首先设计区域语义锚定模块,在生成过程中将语言描述中的短语单元精确锚定到对应的图像区域,实现遵循多对象数量规范的文本到图像生成。在此基础上,进一步引入自适应多模态引导模块,自适应地解析和整合多源控制信号的各种组合,形成相应的结构化意图,从而指导对场景布局和对象属性的选择性约束,实现动态细粒度控制。

Result: 实验结果表明,MoGen在生成质量、数量一致性和细粒度控制方面显著优于现有方法,同时展现出卓越的可访问性和控制灵活性。该方法能够有效解决多对象生成中的数量对齐和属性控制问题。

Conclusion: MoGen通过创新的区域语义锚定和自适应多模态引导机制,为多对象图像生成提供了更灵活、精确的控制框架,降低了对外部控制信号的依赖,提升了方法的普适性和用户友好性,为细粒度可控图像生成开辟了新方向。


📄 Abstract

Existing multi-object image generation methods face difficulties in achieving precise alignment between localized image generation regions and their corresponding semantics based on language descriptions, frequently resulting in inconsistent object quantities and attribute aliasing. To mitigate this limitation, mainstream approaches typically rely on external control signals to explicitly constrain the spatial layout, local semantic and visual attributes of images. However, this strong dependency makes the input format rigid, rendering it incompatible with the heterogeneous resource conditions of users and diverse constraint requirements. To address these challenges, we propose MoGen, a user-friendly multi-object image generation method. First, we design a Regional Semantic Anchor (RSA) module that precisely anchors phrase units in language descriptions to their corresponding image regions during the generation process, enabling text-to-image generation that follows quantity specifications for multiple objects. Building upon this foundation, we further introduce an Adaptive Multi-modal Guidance (AMG) module, which adaptively parses and integrates various combinations of multi-source control signals to formulate corresponding structured intent. This intent subsequently guides selective constraints on scene layouts and object attributes, achieving dynamic fine-grained control. Experimental results demonstrate that MoGen significantly outperforms existing methods in generation quality, quantity consistency, and fine-grained control, while exhibiting superior accessibility and control flexibility. Code is available at: https://github.com/Tear-kitty/MoGen/tree/master.

[10] VIB-Probe: Detecting and Mitigating Hallucinations in Vision-Language Models via Variational Information Bottleneck

Feiran Zhang, Yixin Wu, Zhenghua Wang, Xiaohua Wang, Changze Lv, Xuanjing Huang, Xiaoqing Zheng

🧩 TL;DR

本文提出VIB-Probe框架,利用变分信息瓶颈理论从视觉语言模型的内部注意力头中提取判别性模式,实现幻觉检测与缓解,显著优于现有基线方法。


📘 Detailed Summary

Motivation: 视觉语言模型在多模态任务中表现出色,但容易产生幻觉现象,即生成文本偏离视觉内容。现有幻觉检测方法主要依赖输出logits或外部验证工具,往往忽略了模型的内部机制,特别是注意力头中可能包含的真实生成信号。

Method: 本文提出VIB-Probe框架,基于变分信息瓶颈理论从内部注意力头中提取判别性模式,同时通过信息瓶颈原则过滤语义噪声。该方法分析跨层和跨头的注意力输出,并利用VIB探针的梯度识别对幻觉具有强因果影响的注意力头,进而提出推理时干预策略进行幻觉缓解。

Result: 在多个基准测试上的广泛实验表明,VIB-Probe在幻觉检测和缓解两方面均显著优于现有基线方法。该方法能够有效识别内部注意力机制中的关键信号,并通过干预策略减少幻觉生成。

Conclusion: 研究表明视觉语言模型的内部注意力头包含丰富的真实生成信号,通过变分信息瓶颈理论可以有效提取这些信号并缓解幻觉问题。该方法为理解多模态模型的内部工作机制提供了新视角,并为幻觉检测与缓解开辟了基于内部机制的新途径。


📄 Abstract

Vision-Language Models (VLMs) have demonstrated remarkable progress in multimodal tasks, but remain susceptible to hallucinations, where generated text deviates from the underlying visual content. Existing hallucination detection methods primarily rely on output logits or external verification tools, often overlooking their internal mechanisms. In this work, we investigate the outputs of internal attention heads, postulating that specific heads carry the primary signals for truthful generation.However, directly probing these high-dimensional states is challenging due to the entanglement of visual-linguistic syntax and noise. To address this, we propose VIB-Probe, a novel hallucination detection and mitigation framework leveraging the Variational Information Bottleneck (VIB) theory. Our method extracts discriminative patterns across layers and heads while filtering out semantic nuisances through the information bottleneck principle. Furthermore, by leveraging the gradients of our VIB probe, we identify attention heads with strong causal influence on hallucinations and introduce an inference-time intervention strategy for hallucination mitigation. Extensive experiments across diverse benchmarks demonstrate that VIB-Probe significantly outperforms existing baselines in both settings. Our code will be made publicly available.

[11] One Language-Free Foundation Model Is Enough for Universal Vision Anomaly Detection

Bin-Bin Gao, Chengjie Wang

🧩 TL;DR

本文提出UniADet,一种极其简单、通用且有效的通用视觉异常检测框架,通过解耦分类与分割任务以及跨层级特征,仅需学习少量参数即可在零样本/少样本设置下超越现有方法,甚至首次在多个基准上超越全样本方法。


📘 Detailed Summary

Motivation: 当前基于视觉语言基础模型的通用异常检测方法通常面临复杂的提示工程、精细的适配模块和挑战性的训练策略等问题,这些限制了方法的灵活性和通用性。本文旨在重新思考视觉语言模型在异常检测中的基本机制,解决现有方法在复杂性和泛化能力方面的局限性。

Method: 本文提出UniADet框架,首先发现语言编码器在异常分类和分割中仅用于生成决策权重,并证明其在通用异常检测中并非必需。其次,提出一种极其简单的方法,完全解耦分类与分割任务,并解耦跨层级特征,即为不同任务和层次特征学习独立的权重。该方法仅需学习解耦权重,参数效率极高。

Result: UniADet在14个真实世界异常检测基准测试中表现出色,涵盖工业和医疗领域。该方法以显著优势超越了当前最先进的零样本/少样本方法,并且首次在多个基准上超越了全样本异常检测方法。该框架仅包含0.002M可学习参数,具有极高的参数效率。

Conclusion: 研究表明,通过解耦分类与分割任务以及跨层级特征,可以构建极其简单但高效的通用异常检测框架。该方法展示了在无需复杂提示工程或精细适配模块的情况下,视觉语言基础模型仍能实现卓越性能,为通用异常检测提供了新的设计思路和实现路径。


📄 Abstract

Universal visual anomaly detection (AD) aims to identify anomaly images and segment anomaly regions towards open and dynamic scenarios, following zero- and few-shot paradigms without any dataset-specific fine-tuning. We have witnessed significant progress in widely use of visual-language foundational models in recent approaches. However, current methods often struggle with complex prompt engineering, elaborate adaptation modules, and challenging training strategies, ultimately limiting their flexibility and generality. To address these issues, this paper rethinks the fundamental mechanism behind visual-language models for AD and presents an embarrassingly simple, general, and effective framework for Universal vision Anomaly Detection (UniADet). Specifically, we first find language encoder is used to derive decision weights for anomaly classification and segmentation, and then demonstrate that it is unnecessary for universal AD. Second, we propose an embarrassingly simple method to completely decouple classification and segmentation, and decouple cross-level features, i.e., learning independent weights for different tasks and hierarchical features. UniADet is highly simple (learning only decoupled weights), parameter-efficient (only 0.002M learnable parameters), general (adapting a variety of foundation models), and effective (surpassing state-of-the-art zero-/few-shot by a large margin and even full-shot AD methods for the first time) on 14 real-world AD benchmarks covering both industrial and medical domains. We will make the code and model of UniADet available at https://github.com/gaobb/UniADet.

[12] What's Left Unsaid? Detecting and Correcting Misleading Omissions in Multimodal News Previews

Fanxiao Li, Jiaying Wu, Tingchao Fu, Dayang Li, Herun Wan, Wei Zhou, Min-Yen Kan

🧩 TL;DR

该研究针对社交媒体新闻预览中因选择性省略关键上下文而导致的隐性误导问题,开发了多阶段分析管道构建MM-Misleading基准,并提出OMGuard框架提升多模态误导性检测与修正能力。


📘 Detailed Summary

Motivation: 社交媒体新闻预览(图像-标题对)即使事实正确,也可能通过选择性省略关键上下文导致读者理解偏离原文的隐性误导,这种隐蔽危害比显性虚假信息更难检测且研究不足。

Method: 研究开发了多阶段管道分离并模拟预览理解与上下文理解,构建MM-Misleading基准;提出OMGuard框架,整合解释感知微调提升多模态误导性检测,以及基于推理的误导内容修正机制指导标题重写。

Result: 实验表明OMGuard将8B模型的检测准确率提升至与235B LVLM相当,并实现显著更强的端到端修正效果;分析揭示误导性通常源于局部叙事转变而非全局框架变化,并识别出纯文本修正失败的图像驱动场景。

Conclusion: 研究揭示了选择性省略导致的隐性误导机制,证明了视觉干预的必要性,提出的OMGuard框架为多模态内容安全提供了有效解决方案,并识别了局部叙事转变作为主要误导来源。


📄 Abstract

Even when factually correct, social-media news previews (image-headline pairs) can induce interpretation drift: by selectively omitting crucial context, they lead readers to form judgments that diverge from what the full article conveys. This covert harm is harder to detect than explicit misinformation yet remains underexplored. To address this gap, we develop a multi-stage pipeline that disentangles and simulates preview-based versus context-based understanding, enabling construction of the MM-Misleading benchmark. Using this benchmark, we systematically evaluate open-source LVLMs and uncover pronounced blind spots to omission-based misleadingness detection. We further propose OMGuard, which integrates (1) Interpretation-Aware Fine-Tuning, which used to improve multimodal misleadingness detection and (2) Rationale-Guided Misleading Content Correction, which uses explicit rationales to guide headline rewriting and reduce misleading impressions. Experiments show that OMGuard lifts an 8B model's detection accuracy to match a 235B LVLM and delivers markedly stronger end-to-end correction. Further analysis reveals that misleadingness typically stems from local narrative shifts (e.g., missing background) rather than global frame changes, and identifies image-driven scenarios where text-only correction fails, highlighting the necessity of visual interventions.

[13] Towards Generalized Multi-Image Editing for Unified Multimodal Models

Pengcheng Xu, Peng Tang, Donghao Luo, Xiaobin Hu, Weichu Cui, Qingdong He, Zhennan Chen, Jiangning Zhang, Charles Ling, Boyu Wang

🧩 TL;DR

本文提出了一种可扩展的多图像编辑框架,通过引入可学习的潜在分离器和正弦索引编码,解决了统一多模态模型在多图像编辑中视觉一致性和图像身份区分方面的局限性。


📘 Detailed Summary

Motivation: 统一多模态模型在多模态理解和生成方面具有优势,但在处理多图像编辑任务时存在局限性,特别是在保持视觉一致性和区分不同输入图像的视觉线索方面存在不足,无法准确处理可变数量的输入图像。

Method: 本文提出了一种可扩展的多图像编辑框架,包含两个核心创新:可学习的潜在分离器在潜在空间中显式区分每个参考图像,实现准确解耦的条件控制;正弦索引编码为同一图像的视觉标记分配连续的正弦索引嵌入,提供显式图像身份标识并支持可变数量输入的泛化和外推。此外,采用逆数据集构建方法建立了高质量基准数据集。

Result: 实验结果表明,在多样化的多图像编辑任务中,该方法在语义一致性、视觉保真度和跨图像整合方面明显优于现有基线方法,验证了其在一致性和泛化能力方面的优势。

Conclusion: 该研究通过显式区分图像身份和可变数量输入泛化的创新方法,显著提升了统一多模态模型在多图像编辑任务中的性能,为多图像理解和生成提供了有效的技术框架,具有重要的实际应用价值。


📄 Abstract

Unified Multimodal Models (UMMs) integrate multimodal understanding and generation, yet they are limited to maintaining visual consistency and disambiguating visual cues when referencing details across multiple input images. In this work, we propose a scalable multi-image editing framework for UMMs that explicitly distinguishes image identities and generalizes to variable input counts. Algorithmically, we introduce two innovations: 1) The learnable latent separators explicitly differentiate each reference image in the latent space, enabling accurate and disentangled conditioning. 2) The sinusoidal index encoding assigns visual tokens from the same image a continuous sinusoidal index embedding, which provides explicit image identity while allowing generalization and extrapolation on a variable number of inputs. To facilitate training and evaluation, we establish a high-fidelity benchmark using an inverse dataset construction methodology to guarantee artifact-free, achievable outputs. Experiments show clear improvements in semantic consistency, visual fidelity, and cross-image integration over prior baselines on diverse multi-image editing tasks, validating our advantages on consistency and generalization ability.

[14] Orient Anything V2: Unifying Orientation and Rotation Understanding

Zehan Wang, Ziang Zhang, Jiayang Xu, Jialei Wang, Tianyu Pang, Chao Du, HengShuang Zhao, Zhou Zhao

🧩 TL;DR

本文提出了Orient Anything V2,这是一个增强的基础模型,用于从单张或配对图像中统一理解物体的3D朝向和旋转。该模型通过四项关键创新扩展了V1的能力,能够处理具有不同旋转对称性的物体并直接估计相对旋转。


📘 Detailed Summary

Motivation: Orient Anything V1仅通过单一独特前表面定义物体朝向,无法处理具有不同旋转对称性的物体,且不能直接估计相对旋转。本研究旨在解决这些限制,扩展朝向估计在多样化下游任务中的适用性。

Method: 研究提出了四项关键创新:1)通过生成模型合成可扩展的3D资产,确保广泛的类别覆盖和平衡的数据分布;2)高效的模型在环标注系统,能够鲁棒地识别每个物体的0到N个有效前表面;3)对称感知的周期性分布拟合目标,捕获所有合理的前向朝向,有效建模物体旋转对称性;4)直接预测相对物体旋转的多帧架构。

Result: 广泛的实验表明,Orient Anything V2在11个广泛使用的基准测试中,在朝向估计、6DoF姿态估计和物体对称性识别方面实现了最先进的零样本性能。该模型表现出强大的泛化能力,显著拓宽了朝向估计在多样化下游任务中的适用性。

Conclusion: 该研究通过处理物体旋转对称性和直接相对旋转估计,显著扩展了朝向估计基础模型的能力。模型在多个基准测试中的卓越性能表明,统一的朝向理解框架能够有效支持广泛的计算机视觉应用,为3D物体理解提供了更全面的解决方案。


📄 Abstract

This work presents Orient Anything V2, an enhanced foundation model for unified understanding of object 3D orientation and rotation from single or paired images. Building upon Orient Anything V1, which defines orientation via a single unique front face, V2 extends this capability to handle objects with diverse rotational symmetries and directly estimate relative rotations. These improvements are enabled by four key innovations: 1) Scalable 3D assets synthesized by generative models, ensuring broad category coverage and balanced data distribution; 2) An efficient, model-in-the-loop annotation system that robustly identifies 0 to N valid front faces for each object; 3) A symmetry-aware, periodic distribution fitting objective that captures all plausible front-facing orientations, effectively modeling object rotational symmetry; 4) A multi-frame architecture that directly predicts relative object rotations. Extensive experiments show that Orient Anything V2 achieves state-of-the-art zero-shot performance on orientation estimation, 6DoF pose estimation, and object symmetry recognition across 11 widely used benchmarks. The model demonstrates strong generalization, significantly broadening the applicability of orientation estimation in diverse downstream tasks.

[15] SceneAlign: Aligning Multimodal Reasoning to Scene Graphs in Complex Visual Scenes

Chuhan Wang, Xintong Li, Jennifer Yuntong Zhang, Junda Wu, Chengkai Huang, Lina Yao, Julian McAuley, Jingbo Shang

🧩 TL;DR

本文提出SceneAlign框架,通过利用场景图进行可控结构干预来生成对比样本,并采用直接偏好优化方法,显著提升了多模态大语言模型在复杂视觉场景中的推理忠实性和准确性。


📘 Detailed Summary

Motivation: 多模态大语言模型在复杂视觉场景中经常出现推理不忠实的问题,表现为幻觉实体、错误接地关系、跳过推理步骤和过度具体化推理。现有基于偏好的方法通常依赖文本扰动或答案条件化推理,无法解决这一挑战,因为模型可以利用语言先验绕过视觉接地。

Method: 本文提出SceneAlign框架,利用场景图作为结构化视觉信息进行可控结构干预。通过识别推理关键节点并采用四种针对典型接地失败的扰动策略,构建语言上合理但视觉事实不准确的硬负样本推理链。这些对比样本对用于直接偏好优化,引导模型进行细粒度、结构忠实的推理。

Result: 在七个视觉推理基准测试中,SceneAlign一致提高了答案准确性和推理忠实性。实验结果表明,该方法在多个评估指标上显著优于现有方法,验证了基于接地感知对齐对多模态推理的有效性。

Conclusion: 该研究强调了基于接地感知对齐对提升多模态推理质量的重要性。SceneAlign框架通过结构化视觉干预生成有意义的对比样本,为多模态模型的忠实推理对齐提供了有效途径,并为未来视觉语言模型的可靠推理研究指明了方向。


📄 Abstract

Multimodal large language models often struggle with faithful reasoning in complex visual scenes, where intricate entities and relations require precise visual grounding at each step. This reasoning unfaithfulness frequently manifests as hallucinated entities, mis-grounded relations, skipped steps, and over-specified reasoning. Existing preference-based approaches, typically relying on textual perturbations or answer-conditioned rationales, fail to address this challenge as they allow models to exploit language priors to bypass visual grounding. To address this, we propose SceneAlign, a framework that leverages scene graphs as structured visual information to perform controllable structural interventions. By identifying reasoning-critical nodes and perturbing them through four targeted strategies that mimic typical grounding failures, SceneAlign constructs hard negative rationales that remain linguistically plausible but are grounded in inaccurate visual facts. These contrastive pairs are used in Direct Preference Optimization to steer models toward fine-grained, structure-faithful reasoning. Across seven visual reasoning benchmarks, SceneAlign consistently improves answer accuracy and reasoning faithfulness, highlighting the effectiveness of grounding-aware alignment for multimodal reasoning.

[16] LatentVLA: Efficient Vision-Language Models for Autonomous Driving via Latent Action Prediction

Chengen Xie, Bin Sun, Tianyu Li, Junjie Wu, Zhihui Hao, XianPeng Lang, Hongyang Li

🧩 TL;DR

本文提出LatentVLA框架,通过自监督潜在动作预测训练视觉-语言-动作模型,无需语言标注即可学习丰富的驾驶表示,并通过知识蒸馏将泛化能力迁移到高效视觉网络中,在保持实时性的同时提升长尾场景性能。


📘 Detailed Summary

Motivation: 端到端自动驾驶模型在大规模数据集上训练后,在常见场景表现良好,但在罕见长尾场景中因场景多样性有限而表现不佳。现有视觉-语言-动作模型虽能利用预训练视觉语言模型的广泛知识,但仍面临三个关键挑战:轨迹预测中的数值不精确性、对语言标注的过度依赖引入语言偏见和标注负担,以及多步思维链推理导致的计算效率低下问题阻碍实时部署。

Method: 本文提出LatentVLA框架,采用自监督潜在动作预测方法训练视觉-语言-动作模型,完全无需语言标注,从而消除语言偏见,同时从未标注轨迹数据中学习丰富的驾驶表示。通过知识蒸馏技术,将视觉-语言-动作模型的泛化能力迁移到高效的基于视觉的网络中,实现鲁棒性能和实时效率的平衡。

Result: LatentVLA在NAVSIM基准测试中取得了92.4的PDMS分数,建立了新的最先进性能。在nuScenes基准测试中展示了强大的零样本泛化能力,验证了该方法在保持实时效率的同时,显著提升了自动驾驶模型在长尾场景中的表现。

Conclusion: 该研究证明了通过自监督潜在动作预测可以有效地训练视觉-语言-动作模型而无需语言标注,消除了语言偏见并减少了标注负担。知识蒸馏方法成功地将大模型的泛化能力迁移到高效网络中,为实时自动驾驶系统提供了一种既鲁棒又高效的解决方案,推动了视觉-语言-动作模型在实际部署中的应用。


📄 Abstract

End-to-end autonomous driving models trained on largescale datasets perform well in common scenarios but struggle with rare, long-tail situations due to limited scenario diversity. Recent Vision-Language-Action (VLA) models leverage broad knowledge from pre-trained visionlanguage models to address this limitation, yet face critical challenges: (1) numerical imprecision in trajectory prediction due to discrete tokenization, (2) heavy reliance on language annotations that introduce linguistic bias and annotation burden, and (3) computational inefficiency from multi-step chain-of-thought reasoning hinders real-time deployment. We propose LatentVLA, a novel framework that employs self-supervised latent action prediction to train VLA models without language annotations, eliminating linguistic bias while learning rich driving representations from unlabeled trajectory data. Through knowledge distillation, LatentVLA transfers the generalization capabilities of VLA models to efficient vision-based networks, achieving both robust performance and real-time efficiency. LatentVLA establishes a new state-of-the-art on the NAVSIM benchmark with a PDMS score of 92.4 and demonstrates strong zeroshot generalization on the nuScenes benchmark.

[17] Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals

Nate Gillman, Yinghua Zhou, Zitian Tang, Evan Luo, Arjan Chakravarthy, Daksh Aggarwal, Michael Freeman, Charles Herrmann, Chen Sun

🧩 TL;DR

本文提出了Goal Force框架,通过力向量和中间动力学来定义视频生成目标,使模型能够作为隐式神经物理模拟器,实现精确的物理感知规划。


📘 Detailed Summary

Motivation: 当前视频生成世界模型在指定精确目标方面存在挑战:文本指令过于抽象难以捕捉物理细节,而目标图像对于动态任务通常不可行。需要一种能够反映人类物理任务概念化的目标定义方法。

Method: 提出了Goal Force框架,允许用户通过显式力向量和中间动力学定义目标。在合成的因果原语数据集(如弹性碰撞和多米诺骨牌倒塌)上训练视频生成模型,教导模型在时间和空间中传播力。

Result: 尽管在简单物理数据上训练,模型展现出卓越的零样本泛化能力,能够处理复杂现实场景,包括工具操作和多对象因果链。模型能够作为隐式神经物理模拟器,实现精确的物理感知规划。

Conclusion: 通过将视频生成基于基本物理交互,模型能够作为隐式神经物理模拟器出现,实现精确的物理感知规划而无需依赖外部物理引擎。这为机器人学和规划任务提供了新的目标定义范式。


📄 Abstract

Recent advancements in video generation have enabled the development of ``world models'' capable of simulating potential futures for robotics and planning. However, specifying precise goals for these models remains a challenge; text instructions are often too abstract to capture physical nuances, while target images are frequently infeasible to specify for dynamic tasks. To address this, we introduce Goal Force, a novel framework that allows users to define goals via explicit force vectors and intermediate dynamics, mirroring how humans conceptualize physical tasks. We train a video generation model on a curated dataset of synthetic causal primitives-such as elastic collisions and falling dominos-teaching it to propagate forces through time and space. Despite being trained on simple physics data, our model exhibits remarkable zero-shot generalization to complex, real-world scenarios, including tool manipulation and multi-object causal chains. Our results suggest that by grounding video generation in fundamental physical interactions, models can emerge as implicit neural physics simulators, enabling precise, physics-aware planning without reliance on external engines. We release all datasets, code, model weights, and interactive video demos at our project page.

[18] SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving

Jingyu Li, Junjie Wu, Dongnan Hu, Xiangkai Huang, Bin Sun, Zhihui Hao, Xianpeng Lang, Xiatian Zhu, Li Zhang

🧩 TL;DR

本文提出SGDrive框架,通过场景-智能体-目标层次结构为视觉语言模型提供结构化时空表示,解决通用VLM在自动驾驶中缺乏驾驶专用推理能力的问题,在NAVSIM基准测试中实现了相机方法的SOTA性能。


📘 Detailed Summary

Motivation: 当前基于视觉语言模型的端到端自动驾驶方法存在局限性,因为通用VLM缺乏对驾驶专用三维时空推理的专业理解,难以建立捕捉几何关系、场景上下文和运动模式的结构化时空表示,这限制了安全轨迹规划的能力。

Method: SGDrive框架在预训练VLM骨干基础上,将驾驶理解分解为场景-智能体-目标层次结构,模仿人类驾驶认知过程:首先感知整体环境(场景上下文),然后关注安全关键智能体及其行为,最后制定短期目标再执行动作,通过这种层次分解为VLM提供结构化时空表示。

Result: 在NAVSIM基准测试上的广泛实验表明,SGDrive在PDMS和EPDMS两个指标上均实现了相机方法的最高性能,验证了层次知识结构化在将通用VLM适配到自动驾驶任务中的有效性。

Conclusion: 研究证明通过驾驶专用知识层次结构显式组织VLM表示学习能够有效弥补通用模型的不足,层次分解方法为VLM在自动驾驶中的应用提供了结构化时空表示框架,这一范式可扩展到其他需要专业时空推理的领域。


📄 Abstract

Recent end-to-end autonomous driving approaches have leveraged Vision-Language Models (VLMs) to enhance planning capabilities in complex driving scenarios. However, VLMs are inherently trained as generalist models, lacking specialized understanding of driving-specific reasoning in 3D space and time. When applied to autonomous driving, these models struggle to establish structured spatial-temporal representations that capture geometric relationships, scene context, and motion patterns critical for safe trajectory planning. To address these limitations, we propose SGDrive, a novel framework that explicitly structures the VLM's representation learning around driving-specific knowledge hierarchies. Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition: drivers first perceive the overall environment (scene context), then attend to safety-critical agents and their behaviors, and finally formulate short-term goals before executing actions. This hierarchical decomposition provides the structured spatial-temporal representation that generalist VLMs lack, integrating multi-level information into a compact yet comprehensive format for trajectory planning. Extensive experiments on the NAVSIM benchmark demonstrate that SGDrive achieves state-of-the-art performance among camera-only methods on both PDMS and EPDMS, validating the effectiveness of hierarchical knowledge structuring for adapting generalist VLMs to autonomous driving.

[19] SketchVL: Policy Optimization via Fine-Grained Credit Assignment for Chart Understanding and More

Muye Huang, Lingling Zhang, Yifei Li, Yaqiang Wu, Jun Liu

🧩 TL;DR

本文提出SketchVL,一种通过FinePO强化学习算法优化的多模态大语言模型,该算法利用细粒度过程奖励模型对轨迹中的每个绘图动作进行评分,实现了精确的信用分配,从而显著提升了图表理解和复杂推理能力。


📘 Detailed Summary

Motivation: 现有基于强化学习的多模态大语言模型在图表理解等复杂视觉推理任务中面临信用分配挑战,其轨迹级别的优势估计无法区分单个生成响应中正确与错误的推理步骤,这限制了模型在需要精确多步推理任务中的性能提升。

Method: SketchVL采用FinePO强化学习算法进行优化,该算法利用细粒度过程奖励模型对轨迹中的每个绘图动作进行评分,实现精确的信用分配;模型通过在图像上绘制中间推理步骤作为标记,并将注释后的图像反馈给自身,构建鲁棒的多步推理过程。

Result: 实验表明SketchVL在图表数据集、自然图像数据集和数学任务上平均性能提升7.23%,模型成功将其步骤级行为与FinePRM对齐,在复杂推理任务中表现出显著改进,验证了细粒度信用分配机制的有效性。

Conclusion: 该研究为训练强大的推理模型提供了新方向,通过细粒度过程奖励和精确信用分配机制,多模态大语言模型能够更好地处理需要复杂多步推理的任务,特别是在图表理解等高密度视觉数据解析领域具有重要应用价值。


📄 Abstract

Charts are high-density visual carriers of complex data and medium for information extraction and analysis. Due to the need for precise and complex visual reasoning, automated chart understanding poses a significant challenge to existing Multimodal Large Language Models (MLLMs). Many MLLMs trained with reinforcement learning (RL) face the challenge of credit assignment. Their advantage estimation, typically performed at the trajectory level, cannot distinguish between correct and incorrect reasoning steps within a single generated response. To address this limitation, we introduce SketchVL, a novel MLLM that optimized with FinePO, a new RL algorithm designed for fine-grained credit assignment within each trajectory. SketchVL's methodology involves drawing its intermediate reasoning steps as markers on the image and feeding the annotated image back to itself, creating a robust, multi-step reasoning process. During training, the FinePO algorithm leverages a Fine-grained Process Reward Model (FinePRM) to score each drawing action within a trajectory, thereby precisely assigning credit for each step. This mechanism allows FinePO to more strongly reward correct tokens when a trajectory is globally successful, and more heavily penalize incorrect tokens when the trajectory is globally suboptimal, thus achieving fine-grained reinforcement signals. Experiments show that SketchVL learns to align its step-level behavior with the FinePRM, achieving an average performance gain of 7.23\% over its base model across chart datasets, natural image datasets, and mathematics, providing a promising new direction for training powerful reasoning models.

[20] TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment

Jin Wang, Jianxiang Lu, Guangzheng Xu, Comi Chen, Haoyu Yang, Linqing Wang, Peng Chen, Mingtao Chen, Zhichao Hu, Longhuang Wu, Shuai Shao, Qinglin Lu, Ping Luo

🧩 TL;DR

本文提出了TAGRPO,一种用于图像到视频生成模型的鲁棒后训练框架,通过对比学习启发的GRPO损失和记忆库机制,解决了现有方法在I2V任务中奖励提升不一致的问题。


📘 Detailed Summary

Motivation: 现有研究表明将组相对策略优化集成到流匹配模型中对文本到图像和文本到视频生成有效,但直接应用于图像到视频模型时往往无法获得一致的奖励提升,这构成了本研究要解决的核心技术局限。

Method: TAGRPO框架基于从相同初始噪声生成的展开视频能提供更优优化指导的观察,提出了应用于中间潜在空间的GRPO损失,该损失鼓励与高奖励轨迹直接对齐同时最大化与低奖励轨迹的距离,并引入了展开视频记忆库以增强多样性和降低计算开销。

Result: 尽管方法简单,TAGRPO在图像到视频生成任务中相比DanceGRPO取得了显著改进,证明了其在提升模型奖励一致性和生成质量方面的有效性。

Conclusion: 该研究表明针对图像到视频生成任务需要专门设计的后训练优化框架,基于对比学习的GRPO损失和记忆库机制为解决I2V模型奖励优化不一致问题提供了有效途径,为视频生成模型的精细化调优开辟了新方向。


📄 Abstract

Recent studies have demonstrated the efficacy of integrating Group Relative Policy Optimization (GRPO) into flow matching models, particularly for text-to-image and text-to-video generation. However, we find that directly applying these techniques to image-to-video (I2V) models often fails to yield consistent reward improvements. To address this limitation, we present TAGRPO, a robust post-training framework for I2V models inspired by contrastive learning. Our approach is grounded in the observation that rollout videos generated from identical initial noise provide superior guidance for optimization. Leveraging this insight, we propose a novel GRPO loss applied to intermediate latents, encouraging direct alignment with high-reward trajectories while maximizing distance from low-reward counterparts. Furthermore, we introduce a memory bank for rollout videos to enhance diversity and reduce computational overhead. Despite its simplicity, TAGRPO achieves significant improvements over DanceGRPO in I2V generation.

[21] Context-Aware Decoding for Faithful Vision-Language Generation

Mehrdad Fazli, Bowen Wei, Ziwei Zhu

🧩 TL;DR

本文通过分析大型视觉语言模型(LVLM)的层间生成动态,揭示了幻觉生成的机制,并提出了一种无需训练的训练方法——上下文嵌入注入(CEI),以有效减少开放任务中的幻觉现象。


📘 Detailed Summary

Motivation: 大型视觉语言模型在图像描述和视觉推理等开放任务中经常产生与视觉输入不一致的幻觉响应,这构成了一个关键限制。现有方法在理解幻觉生成机制和提供可扩展干预方面存在不足,需要探索新的训练方法来解决这一问题。

Method: 研究采用Logit Lens技术分析LVLM在解码器各层构建下一个令牌分布的动态过程,揭示了真实令牌比幻觉令牌更早积累概率质量的"承诺深度差距"。基于这一发现,提出了上下文嵌入注入(CEI)方法,该方法利用最后一个输入令牌的隐藏状态(上下文嵌入)作为基础信号,在解码过程中保持视觉保真度并抑制幻觉。

Result: 在CHAIR、AMBER和MMHal-Bench基准测试(最大令牌长度为512)上评估,CEI在三种LVLM模型中均优于最先进的基线方法,其动态变体实现了最低的整体幻觉率。该方法展示了在多个基准测试中一致减少幻觉的有效性。

Conclusion: 这项工作通过结合新颖的机制洞察与可扩展干预,推进了LVLM中幻觉缓解的研究。研究揭示了幻觉生成的层间动态特性,并提供了无需训练的有效缓解策略,为未来开发更可靠的视觉语言系统奠定了基础。


📄 Abstract

Hallucinations, generating responses inconsistent with the visual input, remain a critical limitation of large vision-language models (LVLMs), especially in open-ended tasks such as image captioning and visual reasoning. In this work, we probe the layer-wise generation dynamics that drive hallucinations and propose a training-free mitigation strategy. Employing the Logit Lens, we examine how LVLMs construct next-token distributions across decoder layers, uncovering a pronounced commitment-depth gap: truthful tokens accumulate probability mass on their final candidates earlier than hallucinatory ones. Drawing on this discovery, we introduce Context Embedding Injection (CEI), a lightweight method that harnesses the hidden state of the last input token-the context embedding-as a grounding signal to maintain visual fidelity throughout decoding and curb hallucinations. Evaluated on the CHAIR, AMBER, and MMHal-Bench benchmarks (with a maximum token length of 512), CEI outperforms state-of-the-art baselines across three LVLMs, with its dynamic variant yielding the lowest overall hallucination rates. By integrating novel mechanistic insights with a scalable intervention, this work advances the mitigation of hallucinations in LVLMs.

[22] WaveRNet: Wavelet-Guided Frequency Learning for Multi-Source Domain-Generalized Retinal Vessel Segmentation

Chanchan Wang, Yuanfang Wang, Qing Xu, Guanxin Chen

🧩 TL;DR

本文提出WaveRNet,一种基于小波引导频率学习的框架,用于鲁棒的多源域泛化视网膜血管分割,通过结合小波分解与可学习域令牌来分离光照鲁棒的低频结构和高频血管边界,并在四个公共视网膜数据集上实现了最先进的泛化性能。


📘 Detailed Summary

Motivation: 域泛化视网膜血管分割面临光照不均匀和对比度变化引起的域偏移挑战,现有基于SAM的方法依赖简单适配器微调而忽略编码域不变特征的频域信息,导致在光照和对比度变化下泛化性能下降,同时SAM的直接上采样会丢失精细血管细节。

Method: 提出WaveRNet框架,包含三个核心模块:谱引导域调制器(SDM)集成小波分解与可学习域令牌,分离光照鲁棒的低频结构与高频血管边界;频率自适应域融合(FADF)模块通过小波频率相似性进行智能测试时域选择与软加权融合;分层掩码提示细化器(HMPR)通过粗到细的细化与长程依赖建模克服SAM上采样限制。

Result: 在Leave-One-Domain-Out协议下对四个公共视网膜数据集进行广泛实验,WaveRNet实现了最先进的泛化性能,显著优于现有方法,特别是在处理光照和对比度变化方面表现出色,证明了频率域学习在域泛化视网膜血管分割中的有效性。

Conclusion: 该研究证明了频率域学习在解决视网膜血管分割域偏移问题中的关键作用,小波分解能够有效分离域不变特征与域特定特征,提出的智能测试时域选择和分层细化策略为医学图像分割的域泛化提供了新思路,具有重要的临床应用价值。


📄 Abstract

Domain-generalized retinal vessel segmentation is critical for automated ophthalmic diagnosis, yet faces significant challenges from domain shift induced by non-uniform illumination and varying contrast, compounded by the difficulty of preserving fine vessel structures. While the Segment Anything Model (SAM) exhibits remarkable zero-shot capabilities, existing SAM-based methods rely on simple adapter fine-tuning while overlooking frequency-domain information that encodes domain-invariant features, resulting in degraded generalization under illumination and contrast variations. Furthermore, SAM's direct upsampling inevitably loses fine vessel details. To address these limitations, we propose WaveRNet, a wavelet-guided frequency learning framework for robust multi-source domain-generalized retinal vessel segmentation. Specifically, we devise a Spectral-guided Domain Modulator (SDM) that integrates wavelet decomposition with learnable domain tokens, enabling the separation of illumination-robust low-frequency structures from high-frequency vessel boundaries while facilitating domain-specific feature generation. Furthermore, we introduce a Frequency-Adaptive Domain Fusion (FADF) module that performs intelligent test-time domain selection through wavelet-based frequency similarity and soft-weighted fusion. Finally, we present a Hierarchical Mask-Prompt Refiner (HMPR) that overcomes SAM's upsampling limitation through coarse-to-fine refinement with long-range dependency modeling. Extensive experiments under the Leave-One-Domain-Out protocol on four public retinal datasets demonstrate that WaveRNet achieves state-of-the-art generalization performance. The source code is available at https://github.com/Chanchan-Wang/WaveRNet.

cs.CL [Back]

[23] Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions

Minda Zhao, Yilun Du, Mengyu Wang

🧩 TL;DR

该研究对前沿大语言模型的原生概率采样能力进行了首次大规模统计审计,发现当前LLM缺乏功能性内部采样器,在批量生成和独立请求协议下均表现出严重的统计有效性缺陷,导致下游任务失败。


📘 Detailed Summary

Motivation: 随着大语言模型从聊天接口转变为跨领域随机管道的关键组件,如教育评估和合成数据构建,从指定概率分布中忠实采样的能力已成为功能性需求而非理论好奇。然而,当前LLM是否具备可靠的内部概率采样器仍缺乏系统性评估,这限制了其在需要统计保证的应用中的可靠性。

Method: 研究采用双协议设计来分离故障模式:批量生成协议要求模型在单个响应中生成N=1000个样本,而独立请求协议则包含N=1000个无状态调用。该方法对11个前沿LLM在15种不同分布上进行了大规模统计审计,通过系统化基准测试评估采样保真度。

Result: 研究揭示了显著的协议不对称性:批量生成仅实现中等统计有效性,中位通过率为13%,而独立请求几乎完全崩溃,11个模型中有10个在所有分布上均未通过。采样保真度随分布复杂性单调下降,并随采样范围N增加而恶化。这些失败进一步传播到下游任务,如多项选择题生成中的均匀答案位置约束和属性约束文本到图像提示合成中的人口统计目标。

Conclusion: 当前大语言模型缺乏功能性内部采样器,无法满足需要统计保证的应用需求。这一发现表明,在涉及概率采样的实际应用中必须依赖外部工具而非模型原生能力。研究结果对教育评估、合成数据生成等依赖统计可靠性的领域具有重要警示意义。


📄 Abstract

As large language models (LLMs) transition from chat interfaces to integral components of stochastic pipelines across domains like educational assessment and synthetic data construction, the ability to faithfully sample from specified probability distributions has become a functional requirement rather than a theoretical curiosity. We present the first large-scale, statistically powered audit of native probabilistic sampling in frontier LLMs, benchmarking 11 models across 15 distributions. To disentangle failure modes, we employ a dual-protocol design: Batch Generation, where a model produces N=1000 samples within one response, and Independent Requests, comprising $N=1000$ stateless calls. We observe a sharp protocol asymmetry: batch generation achieves only modest statistical validity, with a 13% median pass rate, while independent requests collapse almost entirely, with 10 of 11 models passing none of the distributions. Beyond this asymmetry, we reveal that sampling fidelity degrades monotonically with distributional complexity and aggravates as the requested sampling horizon N increases. Finally, we demonstrate the propagation of these failures into downstream tasks: models fail to enforce uniform answer-position constraints in MCQ generation and systematically violate demographic targets in attribute-constrained text-to-image prompt synthesis. These findings indicate that current LLMs lack a functional internal sampler, necessitating the use of external tools for applications requiring statistical guarantees.

[24] Afri-MCQA: Multimodal Cultural Question Answering for African Languages

Atnafu Lambebo Tonja, Srija Anand, Emilio Villa-Cueva, Israel Abebe Azime, Jesujoba Oluwadara Alabi, Muhidin A. Mohamed, Debela Desalegn Yadeta, Negasi Haile Abadi, Abigail Oppong, Nnaemeka Casmir Obiefuna, Idris Abdulmumin, Naome A Etori, Eric Peter Wairagala, Kanda Patrick Tshinu, Imanigirimbabazi Emmanuel, Gabofetswe Malema, Alham Fikri Aji, David Ifeoluwa Adelani, Thamar Solorio

🧩 TL;DR

本研究提出了首个覆盖15种非洲语言的多模态文化问答基准Afri-MCQA,包含7.5k个平行Q&A对,并揭示了当前大语言模型在非洲语言和文化理解方面的严重不足,强调了语音优先方法和文化基础预训练的必要性。


📘 Detailed Summary

Motivation: 非洲拥有全球超过三分之一的语言,但在AI研究中代表性严重不足,缺乏覆盖多种非洲语言和文化背景的多模态评估基准,限制了包容性AI系统的发展。

Method: 研究团队创建了Afri-MCQA基准,包含7.5k个平行英语-非洲语言问答对,覆盖15种非洲语言和12个国家,采用文本和语音双模态设计,所有数据均由母语者创建,并包含控制实验以分离语言能力和文化知识评估。

Result: 基准测试显示,开源大语言模型在评估的文化任务上表现不佳,在母语或语音查询的开放式视觉问答中准确率接近零,且文本和语音模态下母语与英语之间存在显著性能差距,突显了语言和文化理解的双重挑战。

Conclusion: 研究强调了开发语音优先方法、文化基础预训练和跨语言文化迁移的必要性,为促进非洲语言的多模态AI发展,团队在HuggingFace上以学术许可或CC BY-NC 4.0发布了Afri-MCQA数据集。


📄 Abstract

Africa is home to over one-third of the world's languages, yet remains underrepresented in AI research. We introduce Afri-MCQA, the first Multilingual Cultural Question-Answering benchmark covering 7.5k Q&A pairs across 15 African languages from 12 countries. The benchmark offers parallel English-African language Q&A pairs across text and speech modalities and was entirely created by native speakers. Benchmarking large language models (LLMs) on Afri-MCQA shows that open-weight models perform poorly across evaluated cultures, with near-zero accuracy on open-ended VQA when queried in native language or speech. To evaluate linguistic competence, we include control experiments meant to assess this specific aspect separate from cultural knowledge, and we observe significant performance gaps between native languages and English for both text and speech. These findings underscore the need for speech-first approaches, culturally grounded pretraining, and cross-lingual cultural transfer. To support more inclusive multimodal AI development in African languages, we release our Afri-MCQA under academic license or CC BY-NC 4.0 on HuggingFace (https://huggingface.co/datasets/Atnafu/Afri-MCQA)

[25] Multimodal In-context Learning for ASR of Low-resource Languages

Zhaolin Li, Jan Niehues

🧩 TL;DR

本文研究了语音大语言模型通过多模态上下文学习(MICL)处理未见语言的能力,并提出了一种结合更强声学模型与语音LLM的ASR系统,在三种濒危语言上验证了MICL的有效性和跨语言迁移学习的优势。


📘 Detailed Summary

Motivation: 自动语音识别(ASR)目前仅覆盖世界上少数语言,主要受限于监督数据的稀缺性。现有基于大语言模型(LLM)的上下文学习(ICL)研究主要关注训练中覆盖的高资源语言和纯文本设置,本文旨在探索语音LLM是否能够通过多模态上下文学习(MICL)处理未见语言,并利用这种能力改进ASR系统。

Method: 研究采用两种语音LLM(Phi-4和Qwen3-Omni)在三种不同的濒危语言上进行实验。方法包括分析MICL对未见语言的有效性、探索跨语言迁移学习对目标语言MICL效率的提升、通过注意力模式分析解释MICL机制,并提出了一种结合更强声学模型与语音LLM的ASR系统,该系统通过MICL选择声学假设来改进性能。

Result: 实验结果表明MICL对未见语言有效,能够同时利用语音和文本模态。跨语言迁移学习在不使用目标语言数据的情况下提升了MICL效率。注意力分析显示模型在不同层对音频和文本上下文有不同偏好,整体偏向文本。提出的ASR系统通过MICL持续改进性能,跨语言迁移学习在性能上匹配或超越了基于语料库训练的语言模型。

Conclusion: 本研究证实了语音LLM通过MICL处理未见语言的潜力,跨语言迁移学习为低资源语言ASR提供了有效途径。注意力机制分析揭示了多模态学习的内部工作机制,而提出的混合ASR系统框架为结合传统声学模型与新兴LLM能力提供了实用方案,对濒危语言保护具有重要意义。


📄 Abstract

Automatic speech recognition (ASR) still covers only a small fraction of the world's languages, mainly due to supervised data scarcity. In-context learning (ICL) with large language models (LLMs) addresses this problem, but prior work largely focuses on high-resource languages covered during training and text-only settings. This paper investigates whether speech LLMs can learn unseen languages with multimodal ICL (MICL), and how this learning can be used to improve ASR. We conduct experiments with two speech LLMs, Phi-4 and Qwen3-Omni, on three diverse endangered languages. Firstly, we find that MICL is effective for unseen languages, leveraging both speech and text modalities. We further show that cross-lingual transfer learning improves MICL efficiency on target languages without training on them. Moreover, we analyze attention patterns to interpret MICL mechanisms, and we observe layer-dependent preferences between audio and text context, with an overall bias towards text. Finally, we show that prompt-based ASR with speech LLMs performs poorly on unseen languages, motivating a simple ASR system that combines a stronger acoustic model with a speech LLM via MICL-based selection of acoustic hypotheses. Results show that MICL consistently improves ASR performance, and that cross-lingual transfer learning matches or outperforms corpus-trained language models without using target-language data. Our code is publicly available.

[26] Router-Suggest: Dynamic Routing for Multimodal Auto-Completion in Visually-Grounded Dialogs

Sandeep Mishra, Devichand Budagam, Anubhab Mandal, Bishal Santra, Pawan Goyal, Manish Gupta

🧩 TL;DR

本文提出了多模态自动补全(MAC)任务,利用视觉上下文预测实时聊天中的即将输入字符,并开发了Router-Suggest路由框架,在保持用户满意度的同时实现2.3倍至10倍的速度提升。


📘 Detailed Summary

Motivation: 传统文本自动补全(TAC)在数字助手、聊天机器人等依赖共享视觉上下文的场景中存在局限,无法有效利用多模态信息来捕捉用户意图,需要开发能够结合视觉线索的实时多模态自动补全方法。

Method: 研究提出了多模态自动补全(MAC)任务,基于部分输入文本和视觉线索预测即将输入的字符;通过适配MMDialog和ImageChat创建了基准数据集;开发了Router-Suggest路由框架,根据对话上下文动态选择文本模型或视觉语言模型,并设计了适用于资源受限环境的轻量级变体。

Result: Router-Suggest框架相比性能最佳的视觉语言模型实现了2.3倍至10倍的速度提升;用户研究表明视觉语言模型在用户满意度方面显著优于文本模型,特别是在节省用户输入努力和提高多轮对话补全质量方面表现突出。

Conclusion: 多模态上下文对于自动补全至关重要,能够实现更智能、用户感知的助手系统;Router-Suggest框架在效率和准确性之间取得了良好平衡,为资源受限环境提供了实用解决方案,推动了多模态交互系统的发展。


📄 Abstract

Real-time multimodal auto-completion is essential for digital assistants, chatbots, design tools, and healthcare consultations, where user inputs rely on shared visual context. We introduce Multimodal Auto-Completion (MAC), a task that predicts upcoming characters in live chats using partially typed text and visual cues. Unlike traditional text-only auto-completion (TAC), MAC grounds predictions in multimodal context to better capture user intent. To enable this task, we adapt MMDialog and ImageChat to create benchmark datasets. We evaluate leading vision-language models (VLMs) against strong textual baselines, highlighting trade-offs in accuracy and efficiency. We present Router-Suggest, a router framework that dynamically selects between textual models and VLMs based on dialog context, along with a lightweight variant for resource-constrained environments. Router-Suggest achieves a 2.3x to 10x speedup over the best-performing VLM. A user study shows that VLMs significantly excel over textual models on user satisfaction, notably saving user typing effort and improving the quality of completions in multi-turn conversations. These findings underscore the need for multimodal context in auto-completions, leading to smarter, user-aware assistants.

[27] CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning

Alexandra Dragomir, Florin Brad, Radu Tudor Ionescu

🧩 TL;DR

本研究提出了一种新颖的课程学习策略CLewR,通过将课程学习与重启机制集成到偏好优化算法中,显著提升了大型语言模型在零样本多语言机器翻译任务中的性能。


📘 Detailed Summary

Motivation: 尽管大型语言模型在零样本多语言机器翻译中表现出色,且后续研究通过偏好优化进一步提升了性能,但现有方法普遍忽视了训练过程中数据样本顺序的重要性,特别是如何有效缓解模型在学习困难样本时对简单示例的灾难性遗忘问题。

Method: 本研究提出了一种新颖的课程学习策略CLewR,该策略将课程学习与重启机制相结合,在训练过程中多次重复从简单到困难的课程安排,有效缓解了模型对简单示例的灾难性遗忘,并将该策略集成到多种最先进的偏好优化算法中。

Result: 实验结果表明,CLewR策略在多个主流模型家族(包括Gemma2、Qwen2.5和Llama3.1)和多种偏好优化技术上均取得了性能提升,验证了该方法的有效性和泛化能力,相关代码已在GitHub上开源发布。

Conclusion: 该研究不仅证明了训练数据顺序对偏好优化性能的重要影响,还为缓解灾难性遗忘问题提供了有效的解决方案,为未来改进大型语言模型的机器翻译性能提供了新的研究方向和技术路径。


📄 Abstract

Large language models (LLMs) have demonstrated competitive performance in zero-shot multilingual machine translation (MT). Some follow-up works further improved MT performance via preference optimization, but they leave a key aspect largely underexplored: the order in which data samples are given during training. We address this topic by integrating curriculum learning into various state-of-the-art preference optimization algorithms to boost MT performance. We introduce a novel curriculum learning strategy with restarts (CLewR), which reiterates easy-to-hard curriculum multiple times during training to effectively mitigate the catastrophic forgetting of easy examples. We demonstrate consistent gains across several model families (Gemma2, Qwen2.5, Llama3.1) and preference optimization techniques. We publicly release our code at https://github.com/alexandra-dragomir/CLewR.

[28] Continual-learning for Modelling Low-Resource Languages from Large Language Models

Santosh Srinath K, Mudit Somani, Varun Reddy Padala, Prajna Devi Upadhyay, Abhijit Das

🧩 TL;DR

本文提出了一种基于词性代码切换和重放适配器的持续学习策略,用于缓解多语言场景下小语言模型从大语言模型迁移时面临的灾难性遗忘问题,并在视觉语言任务和语言建模任务上验证了其有效性。


📘 Detailed Summary

Motivation: 在多语言场景下构建语言模型面临诸多挑战,其中灾难性遗忘是主要问题。具体而言,通过适配大语言模型来构建面向低资源语言的小语言模型时,灾难性遗忘问题尤为突出,这限制了模型在多语言任务中的性能保持能力。

Method: 本研究提出了一种持续学习策略,结合基于词性的代码切换技术和重放适配器方法。该方法通过词性引导的代码切换来增强语言表示的多语言适应性,同时利用重放适配器机制来保留先前学习到的知识,从而有效缓解灾难性遗忘问题。

Result: 实验在视觉语言任务(如视觉问答)和语言建模任务上进行,结果表明所提出的架构在缓解灾难性遗忘方面取得了成功。该方法能够在小语言模型从大语言模型迁移的过程中有效保持多语言能力,同时在目标任务上表现出良好的性能。

Conclusion: 该研究证明了结合词性代码切换和重放适配器的持续学习策略是缓解多语言模型迁移中灾难性遗忘问题的有效方法。这一方法为低资源语言的小语言模型开发提供了新思路,并展示了在视觉语言任务和纯语言任务中的通用适用性。


📄 Abstract

Modelling a language model for a multi-lingual scenario includes several potential challenges, among which catastrophic forgetting is the major challenge. For example, small language models (SLM) built for low-resource languages by adapting large language models (LLMs) pose the challenge of catastrophic forgetting. This work proposes to employ a continual learning strategy using parts-of-speech (POS)-based code-switching along with a replay adapter strategy to mitigate the identified gap of catastrophic forgetting while training SLM from LLM. Experiments conducted on vision language tasks such as visual question answering and language modelling task exhibits the success of the proposed architecture.

[29] iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models

Meghana Sunil, Manikandarajan Venmathimaran, Muthu Subash Kavitha

🧩 TL;DR

本文提出了iReasoner,一种通过显式激发思维链并奖励其内部一致性的自进化框架,以改进大型多模态模型的隐含推理能力,在完全无监督的后训练设置下实现了多模态推理基准的显著性能提升。


📘 Detailed Summary

Motivation: 现有自进化框架主要奖励最终结果,而忽略了中间推理过程的重要性,导致视觉基础决策中的中间推理约束较弱,尽管这对于多模态推理至关重要。

Method: iReasoner采用Proposer-Solver循环在未标注图像上进行自进化,通过显式激发思维链并奖励其内部一致性,在结果级内在奖励基础上增加了基于中间推理步骤的轨迹感知信号,从而在没有真实标签或外部评判的情况下区分导致相同答案的不同推理路径。

Result: 从Qwen2.5-VL-7B模型出发,iReasoner在完全无监督的后训练设置下,在多样化的多模态推理基准测试中实现了高达+2.1分的性能提升。

Conclusion: 这项工作为大型多模态模型在纯无监督设置下的推理感知自改进提供了起点,强调了中间推理过程的重要性,并为在没有外部监督的情况下改进模型推理能力提供了有效框架。


📄 Abstract

Recent work shows that large multimodal models (LMMs) can self-improve from unlabeled data via self-play and intrinsic feedback. Yet existing self-evolving frameworks mainly reward final outcomes, leaving intermediate reasoning weakly constrained despite its importance for visually grounded decision making. We propose iReasoner, a self-evolving framework that improves an LMM's implicit reasoning by explicitly eliciting chain-of-thought (CoT) and rewarding its internal agreement. In a Proposer--Solver loop over unlabeled images, iReasoner augments outcome-level intrinsic rewards with a trajectory-aware signal defined over intermediate reasoning steps, providing learning signals that distinguish reasoning paths leading to the same answer without ground-truth labels or external judges. Starting from Qwen2.5-VL-7B, iReasoner yields up to $+2.1$ points across diverse multimodal reasoning benchmarks under fully unsupervised post-training. We hope this work serves as a starting point for reasoning-aware self-improvement in LMMs in purely unsupervised settings.

[30] Gender Bias in LLMs: Preliminary Evidence from Shared Parenting Scenario in Czech Family Law

Jakub Harasta, Matej Vasina, Martin Kornel, Tomas Foltynek

🧩 TL;DR

本研究评估了领先的大型语言模型在家庭法律场景中是否存在性别偏见,通过专家设计的离婚情境测试发现不同模型存在性别依赖的输出模式,揭示了非专业人士依赖LLMs获取法律指导的风险。


📘 Detailed Summary

Motivation: 随着非专业人士越来越多地依赖大型语言模型获取法律自助服务,他们可能基于不完整、不正确或有偏见的输出形成期望,本研究旨在评估领先LLMs在现实家庭法律场景中是否表现出性别偏见,以解决对敏感法律背景下模型行为评估不足的研究空白。

Method: 研究采用专家设计的基于捷克家庭法的离婚情境,评估了四种最先进的LLMs:GPT-5 nano、Claude Haiku 4.5、Gemini 2.5 Flash和Llama 3.3,使用完全零样本交互方式,部署了两个版本的情境(一个带有性别化姓名,一个带有中性标签)以建立比较基线,并引入了九个法律相关因素来改变案件的事实情况,测试这些变化是否影响模型提出的共同抚养比例。

Result: 初步结果显示不同模型之间存在差异,并表明某些系统在生成结果时存在性别依赖模式,研究提供了探索性和描述性证据,旨在识别系统性不对称而非建立因果关系,强调了模型在敏感法律背景下的行为变化。

Conclusion: 研究结果强调了非专业人士依赖LLMs获取法律指导的风险,以及需要在敏感法律背景下对模型行为进行更稳健评估的必要性,这些发现对法律AI系统的开发和使用具有重要启示,特别是在确保公平性和减少偏见方面。


📄 Abstract

Access to justice remains limited for many people, leading laypersons to increasingly rely on Large Language Models (LLMs) for legal self-help. Laypeople use these tools intuitively, which may lead them to form expectations based on incomplete, incorrect, or biased outputs. This study examines whether leading LLMs exhibit gender bias in their responses to a realistic family law scenario. We present an expert-designed divorce scenario grounded in Czech family law and evaluate four state-of-the-art LLMs GPT-5 nano, Claude Haiku 4.5, Gemini 2.5 Flash, and Llama 3.3 in a fully zero-shot interaction. We deploy two versions of the scenario, one with gendered names and one with neutral labels, to establish a baseline for comparison. We further introduce nine legally relevant factors that vary the factual circumstances of the case and test whether these variations influence the models' proposed shared-parenting ratios. Our preliminary results highlight differences across models and suggest gender-dependent patterns in the outcomes generated by some systems. The findings underscore both the risks associated with laypeople's reliance on LLMs for legal guidance and the need for more robust evaluation of model behavior in sensitive legal contexts. We present exploratory and descriptive evidence intended to identify systematic asymmetries rather than to establish causal effects.

[31] Pantagruel: Unified Self-Supervised Encoders for French Text and Speech

Phuong-Hang Le, Valentin Pelloin, Arnault Chatelain, Maryem Bouziane, Mohammed Ghennai, Qianwen Guan, Kirill Milintsevich, Salima Mdhaffar, Aidan Mannion, Nils Defauw, Shuyue Gu, Alexandre Audibert, Marco Dinarelli, Yannick Estève, Lorraine Goeuriot, Steffen Lalande, Nicolas Hervé, Maximin Coavoux, François Portet, Étienne Ollion, Marie Candito, Maxime Peyrard, Solange Rossato, Benjamin Lecouteux, Aurélie Nardy, Gilles Sérasset, Vincent Segonne, Solène Evain, Diandra Fabre, Didier Schwab

🧩 TL;DR

本文提出了Pantagruel模型系列,这是一种针对法语文本和语音的自监督编码器模型,通过特征空间学习上下文目标表示,在多种下游任务中展现出优于现有法语基线的性能。


📘 Detailed Summary

Motivation: 现有自监督学习方法通常针对特定模态(如文本标记或语音单元)预测目标,缺乏能够有效捕捉语言和声学规律的统一表示学习方法,特别是在法语多模态理解领域存在研究空白。

Method: Pantagruel采用特征空间自监督目标学习方法,让模态特定编码器学习上下文目标表示而非模态特定目标,在大型法语语料库上进行预训练,包括文本方面的Wikipedia、OSCAR和CroissantLLM,以及语音方面的MultilingualLibriSpeech、LeBenchmark和新引入的INA-100k(来自法国国家视听研究所的10万小时法语音频语料库)。

Result: 在涵盖FLUE和LeBenchmark等标准法语基准的广泛下游任务评估中,Pantagruel模型在文本和语音任务上均展现出竞争性或优于CamemBERT、FlauBERT和LeBenchmark2.0等强基线的性能,同时保持能够无缝处理语音或文本输入的共享架构。

Conclusion: 研究证实了特征空间自监督目标在法语表示学习中的有效性,Pantagruel作为多模态语音-文本理解的稳健基础模型,为法语多模态表示学习提供了新的技术路径,其统一架构设计支持跨模态的灵活应用。


📄 Abstract

We release Pantagruel models, a new family of self-supervised encoder models for French text and speech. Instead of predicting modality-tailored targets such as textual tokens or speech units, Pantagruel learns contextualized target representations in the feature space, allowing modality-specific encoders to capture linguistic and acoustic regularities more effectively. Separate models are pre-trained on large-scale French corpora, including Wikipedia, OSCAR and CroissantLLM for text, together with MultilingualLibriSpeech, LeBenchmark, and INA-100k for speech. INA-100k is a newly introduced 100,000-hour corpus of French audio derived from the archives of the Institut National de l'Audiovisuel (INA), the national repository of French radio and television broadcasts, providing highly diverse audio data. We evaluate Pantagruel across a broad range of downstream tasks spanning both modalities, including those from the standard French benchmarks such as FLUE or LeBenchmark. Across these tasks, Pantagruel models show competitive or superior performance compared to strong French baselines such as CamemBERT, FlauBERT, and LeBenchmark2.0, while maintaining a shared architecture that can seamlessly handle either speech or text inputs. These results confirm the effectiveness of feature-space self-supervised objectives for French representation learning and highlight Pantagruel as a robust foundation for multimodal speech-text understanding.

cs.AI [Back]

[32] Conformity and Social Impact on AI Agents

Alessandro Bellina, Giordano De Marzo, David Garcia

🧩 TL;DR

本研究通过社会心理学实验范式发现,多模态大语言模型作为AI智能体在群体环境中表现出系统性从众偏差,揭示了AI智能体决策中的基本安全漏洞,可能被恶意操纵用于虚假信息传播和偏见扩散。


📘 Detailed Summary

Motivation: 随着AI智能体在多智能体环境中日益普及,理解其集体行为对于预测人工社会动态变得至关重要。本研究旨在探索AI智能体作为社会行动者时如何响应群体影响,特别关注从众行为这一社会心理学现象,以揭示多智能体系统中的潜在安全漏洞。

Method: 研究通过改编社会心理学中的经典视觉实验范式,将多模态大语言模型作为AI智能体置于群体影响情境中。实验设计基于社会影响理论框架,系统考察了群体规模、一致性、任务难度和来源特征等关键变量对AI智能体从众行为的影响。

Result: 实验结果显示AI智能体表现出系统性从众偏差,其行为模式与社会影响理论预测一致:对群体规模、一致性、任务难度和来源特征表现出敏感性。孤立状态下表现近乎完美的AI智能体在社交影响下变得高度易受操纵,且这种脆弱性在不同模型规模间持续存在:较大模型在简单任务上因能力提升而减少从众,但在其能力边界附近操作时仍保持脆弱性。

Conclusion: 这些发现揭示了AI智能体决策中的基本安全漏洞,可能被恶意操纵用于虚假信息运动、偏见传播和多智能体系统操纵。研究强调了在集体AI部署中实施安全防护措施的紧迫性,并指出需要开发能够抵抗社会影响操纵的鲁棒AI系统。


📄 Abstract

As AI agents increasingly operate in multi-agent environments, understanding their collective behavior becomes critical for predicting the dynamics of artificial societies. This study examines conformity, the tendency to align with group opinions under social pressure, in large multimodal language models functioning as AI agents. By adapting classic visual experiments from social psychology, we investigate how AI agents respond to group influence as social actors. Our experiments reveal that AI agents exhibit a systematic conformity bias, aligned with Social Impact Theory, showing sensitivity to group size, unanimity, task difficulty, and source characteristics. Critically, AI agents achieving near-perfect performance in isolation become highly susceptible to manipulation through social influence. This vulnerability persists across model scales: while larger models show reduced conformity on simple tasks due to improved capabilities, they remain vulnerable when operating at their competence boundary. These findings reveal fundamental security vulnerabilities in AI agent decision-making that could enable malicious manipulation, misinformation campaigns, and bias propagation in multi-agent systems, highlighting the urgent need for safeguards in collective AI deployments.

[33] ART: Adaptive Reasoning Trees for Explainable Claim Verification

Sahil Wadhwa, Himanshu Kumar, Guanqun Yang, Abbaas Alif Mohamed Nishar, Pranab Mohanty, Swapnil Shinde, Yue Wu

🧩 TL;DR

本文提出ART(自适应推理树),一种用于声明验证的分层方法,通过构建支持与攻击子论点的树状结构,利用LLM裁判进行成对比较,实现了透明且可争议的推理过程,显著提升了声明验证的可解释性和可靠性。


📘 Detailed Summary

Motivation: 尽管大型语言模型在复杂决策中表现出强大潜力,但其在高风险环境中的应用受到不透明性的限制,输出缺乏可信解释且无法有效纠正错误,这损害了其可信度,现有方法如思维链缺乏系统化的透明和可争议推理机制。

Method: ART采用分层树状结构进行声明验证,从根声明开始分支为支持和攻击子论点,通过LLM裁判对子论点进行成对锦标赛式比较,自底向上确定论点强度,最终系统化推导出透明且可争议的裁决结果。

Result: 在多个数据集上的实证验证表明,ART的结构化推理优于强基线方法,为可解释声明验证设立了新基准,其决策过程更加可靠且确保了整体决策步骤的清晰度,分析了不同论点生成器和比较策略的效果。

Conclusion: ART方法通过结构化推理框架解决了LLM决策的不透明性问题,为高风险环境中的可信AI决策提供了可行路径,其透明且可争议的验证机制为可解释AI研究开辟了新方向,强调了系统化推理在提升模型可信度中的关键作用。


📄 Abstract

Large Language Models (LLMs) are powerful candidates for complex decision-making, leveraging vast encoded knowledge and remarkable zero-shot abilities. However, their adoption in high-stakes environments is hindered by their opacity; their outputs lack faithful explanations and cannot be effectively contested to correct errors, undermining trustworthiness. In this paper, we propose ART (Adaptive Reasoning Trees), a hierarchical method for claim verification. The process begins with a root claim, which branches into supporting and attacking child arguments. An argument's strength is determined bottom-up via a pairwise tournament of its children, adjudicated by a judge LLM, allowing a final, transparent and contestable verdict to be systematically derived which is missing in methods like Chain-of-Thought (CoT). We empirically validate ART on multiple datasets, analyzing different argument generators and comparison strategies. Our findings show that ART's structured reasoning outperforms strong baselines, establishing a new benchmark for explainable claim verification which is more reliable and ensures clarity in the overall decision making step.

[34] MMUEChange: A Generalized LLM Agent Framework for Intelligent Multi-Modal Urban Environment Change Analysis

Zixuan Xiao, Jun Ma, Siwei Zhang

🧩 TL;DR

本文提出MMUEChange,一种多模态智能体框架,通过模块化工具包和模态控制器实现异构城市数据的灵活集成与对齐,显著提升了复杂城市变化分析任务的性能,并有效缓解了幻觉问题。


📘 Detailed Summary

Motivation: 当前城市环境变化分析方法,特别是遥感变化检测,通常依赖于僵化的单模态分析,难以处理复杂的城市变化场景,这限制了可持续城市发展的有效监测与理解。

Method: 该方法提出MMUEChange多模态智能体框架,包含模块化工具包用于灵活集成异构城市数据,以及核心模块模态控制器,专门负责跨模态和模态内对齐,以实现对复杂城市变化场景的鲁棒分析。

Result: 与最佳基线模型相比,MMUEChange智能体在任务成功率上实现了46.7%的显著提升,并有效缓解了幻觉问题。案例研究展示了其在纽约社区公园变化、香港水污染扩散以及深圳垃圾填埋场减少等实际场景中的分析能力。

Conclusion: 该研究证明了多模态智能体框架在城市变化分析中的有效性,能够揭示不同城市压力背后的复杂关联,为城市政策制定提供具有实际意义的见解,并为可持续城市发展监测提供了新的技术途径。


📄 Abstract

Understanding urban environment change is essential for sustainable development. However, current approaches, particularly remote sensing change detection, often rely on rigid, single-modal analysis. To overcome these limitations, we propose MMUEChange, a multi-modal agent framework that flexibly integrates heterogeneous urban data via a modular toolkit and a core module, Modality Controller for cross- and intra-modal alignment, enabling robust analysis of complex urban change scenarios. Case studies include: a shift toward small, community-focused parks in New York, reflecting local green space efforts; the spread of concentrated water pollution across districts in Hong Kong, pointing to coordinated water management; and a notable decline in open dumpsites in Shenzhen, with contrasting links between nighttime economic activity and waste types, indicating differing urban pressures behind domestic and construction waste. Compared to the best-performing baseline, the MMUEChange agent achieves a 46.7% improvement in task success rate and effectively mitigates hallucination, demonstrating its capacity to support complex urban change analysis tasks with real-world policy implications.

[35] Safety Not Found (404): Hidden Risks of LLM-Based Robotics Decision Making

Jua Han, Jaeyoon Seo, Jungbin Min, Jean Oh, Jihie Kim

🧩 TL;DR

该研究系统评估了大型语言模型在安全关键机器人场景中的决策能力,揭示了即使99%准确率在物理风险环境中仍可能导致灾难性后果,并证明当前最先进模型无法保证安全部署。


📘 Detailed Summary

Motivation: 随着大型语言模型在机器人决策中的集成日益重要,物理风险维度显著增加,单个错误指令可能直接危及人类安全。本研究旨在解决系统评估LLM在安全关键场景中性能的迫切需求,特别是在即使微小错误也可能导致灾难性后果的环境中。

Method: 研究通过火灾疏散场景的定性评估识别关键失败案例,并设计七项定量评估任务,分为完整信息、不完整信息和安全导向空间推理三类。完整信息任务使用ASCII地图最小化解释歧义,隔离空间推理与视觉处理;不完整信息任务要求模型推断缺失上下文,测试空间连续性与幻觉;SOSR任务使用自然语言评估生命威胁环境中的安全决策。研究对各种LLM和视觉语言模型进行基准测试。

Result: 结果显示严重漏洞:多个模型在ASCII导航中达到0%成功率,在模拟消防演习中,模型指示机器人向危险区域而非紧急出口移动。研究分析了1%失败率的影响,揭示"罕见"错误如何升级为灾难性后果。基准测试表明即使最先进模型也无法保证安全,99%准确率在机器人应用中具有误导性,意味着每百次执行可能造成灾难性伤害。

Conclusion: 研究得出严峻结论:当前大型语言模型尚未准备好直接部署于安全关键系统。在机器人应用中,99%准确率具有危险误导性,因为这意味着每百次执行可能造成灾难性伤害。绝对依赖这些模型会产生不可接受的风险,需要更严格的安全保障机制和评估框架来确保物理环境中的可靠部署。


📄 Abstract

One mistake by an AI system in a safety-critical setting can cost lives. As Large Language Models (LLMs) become integral to robotics decision-making, the physical dimension of risk grows; a single wrong instruction can directly endanger human safety. This paper addresses the urgent need to systematically evaluate LLM performance in scenarios where even minor errors are catastrophic. Through a qualitative evaluation of a fire evacuation scenario, we identified critical failure cases in LLM-based decision-making. Based on these, we designed seven tasks for quantitative assessment, categorized into: Complete Information, Incomplete Information, and Safety-Oriented Spatial Reasoning (SOSR). Complete information tasks utilize ASCII maps to minimize interpretation ambiguity and isolate spatial reasoning from visual processing. Incomplete information tasks require models to infer missing context, testing for spatial continuity versus hallucinations. SOSR tasks use natural language to evaluate safe decision-making in life-threatening contexts. We benchmark various LLMs and Vision-Language Models (VLMs) across these tasks. Beyond aggregate performance, we analyze the implications of a 1% failure rate, highlighting how "rare" errors escalate into catastrophic outcomes. Results reveal serious vulnerabilities: several models achieved a 0% success rate in ASCII navigation, while in a simulated fire drill, models instructed robots to move toward hazardous areas instead of emergency exits. Our findings lead to a sobering conclusion: current LLMs are not ready for direct deployment in safety-critical systems. A 99% accuracy rate is dangerously misleading in robotics, as it implies one out of every hundred executions could result in catastrophic harm. We demonstrate that even state-of-the-art models cannot guarantee safety, and absolute reliance on them creates unacceptable risks.

[36] GenCtrl -- A Formal Controllability Toolkit for Generative Models

Emily Cheng, Carmen Amo Alonso, Federico Danieli, Arno Blaas, Luca Zappella, Pau Rodriguez, Xavier Suau

🧩 TL;DR

该研究提出了一个理论框架来形式化评估生成模型的真实可控性,通过将人机交互建模为控制过程,开发了一种估计模型可控集的新算法,并提供了分布无关的PAC保证。


📘 Detailed Summary

Motivation: 随着生成模型变得无处不在,对生成过程的细粒度控制需求日益增长,但现有控制方法从提示到微调不断涌现,一个根本问题仍未解决:这些模型是否真正可控?该研究旨在通过理论框架正式回答这个问题,填补对生成模型基本可控性极限理解的研究空白。

Method: 该研究将人机交互建模为控制过程,提出了一种新颖算法来估计对话设置中模型的可控集。该方法提供了关于估计误差作为样本复杂度函数的正式保证:推导了可控集估计的概率近似正确边界,这些边界是分布无关的,除了输出有界性外不采用任何假设,适用于任何黑盒非线性控制系统(即任何生成模型)。

Result: 在控制对话过程的不同任务上对语言模型和文本到图像生成进行了实证验证,结果表明模型可控性出人意料地脆弱且高度依赖于实验设置。理论框架在实际应用中得到了验证,揭示了可控性分析的实用价值。

Conclusion: 该研究强调了进行严格可控性分析的必要性,将研究重点从简单地尝试控制转向首先理解其基本极限。研究结果表明模型可控性具有显著脆弱性,这对实际应用中的可靠控制提出了重要警示,为未来可控生成研究提供了理论基础和方法论指导。


📄 Abstract

As generative models become ubiquitous, there is a critical need for fine-grained control over the generation process. Yet, while controlled generation methods from prompting to fine-tuning proliferate, a fundamental question remains unanswered: are these models truly controllable in the first place? In this work, we provide a theoretical framework to formally answer this question. Framing human-model interaction as a control process, we propose a novel algorithm to estimate the controllable sets of models in a dialogue setting. Notably, we provide formal guarantees on the estimation error as a function of sample complexity: we derive probably-approximately correct bounds for controllable set estimates that are distribution-free, employ no assumptions except for output boundedness, and work for any black-box nonlinear control system (i.e., any generative model). We empirically demonstrate the theoretical framework on different tasks in controlling dialogue processes, for both language models and text-to-image generation. Our results show that model controllability is surprisingly fragile and highly dependent on the experimental setting. This highlights the need for rigorous controllability analysis, shifting the focus from simply attempting control to first understanding its fundamental limits.

[37] From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation

Zezhou Wang, Ziyun Zhang, Xiaoyi Zhang, Zhuzhong Qian, Yan Lu

🧩 TL;DR

本文提出BEPA(双层专家到策略同化)方法,通过将静态专家轨迹转化为策略对齐的指导,解决了在GUI任务中强化学习与专家轨迹混合训练时的结构不匹配问题,显著提升了端到端截图到动作策略在OSWorld等基准上的性能。


📘 Detailed Summary

Motivation: 当前GUI数据集如OSWorld面临两个瓶颈:仅暴露数百个可交互、可验证的任务和环境,且专家轨迹必须通过与环境交互收集,难以扩展。研究旨在探索如何通过可验证奖励的强化学习(RLVR)最佳利用少量现有专家轨迹来训练端到端策略,解决离线专家轨迹与在线RLVR混合训练时的结构不匹配和分布偏移问题。

Method: 提出BEPA(Bi-Level Expert-to-Policy Assimilation)方法,包含两个层级:LEVEL-1通过基础策略下的自滚动可达轨迹将静态专家轨迹转化为策略对齐的指导;LEVEL-2使用按任务动态更新的缓存进行RLVR训练。该方法将专家轨迹转化为与当前策略对齐的指导信号,而非直接混合使用。

Result: 在OSWorld-Verified基准上,BEPA将UITARS1.5-7B的成功率从22.87%提升至32.13%,在保留测试集上从5.74%提升至10.30%。在MMBench-GUI和Online-Mind2Web基准上也取得了一致的性能提升,验证了方法的有效性。

Conclusion: 研究表明,通过双层同化机制将静态专家轨迹转化为策略对齐的指导,能够有效解决专家轨迹与学习策略之间的结构不匹配问题,为利用有限专家数据训练高性能端到端GUI代理提供了可行方案,推动了计算机使用代理的实际部署。


📄 Abstract

Vision-language models are increasingly deployed as computer-use agents (CUAs) that operate desktops and browsers. Top-performing CUAs are framework-based systems that decompose planning and execution, while end-to-end screenshot-to-action policies are easier to deploy but lag behind on benchmarks such as OSWorld-Verified. GUI datasets like OSWorld pose two bottlenecks: they expose only a few hundred interactive, verifiable tasks and environments, and expert trajectories must be gathered by interacting with these environments, making such data hard to scale. We therefore ask how reinforcement learning from verifiable rewards (RLVR) can best exploit a small pool of exist expert trajectories to train end-to-end policies. Naively mixing these off-policy traces into on-policy RLVR is brittle: even after format conversion, expert trajectories exhibit structural mismatch and distribution shift from the learner. We propose BEPA (Bi-Level Expert-to-Policy Assimilation), which turns static expert traces into policy-aligned guidance via self-rolled reachable trajectories under the base policy (LEVEL-1) and a per-task, dynamically updated cache used in RLVR (LEVEL-2). On OSWorld-Verified, BEPA improves UITARS1.5-7B success from 22.87% to 32.13% and raises a held-out split from 5.74% to 10.30%, with consistent gains on MMBench-GUI and Online-Mind2Web. Our code and data are available at: https://github.com/LEON-gittech/Verl_GUI.git

[38] TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents

Dawei Wang, Chengming Zhou, Di Zhao, Xinyuan Liu, Marci Chi Ma, Gary Ushaw, Richard Davison

🧩 TL;DR

本文提出了TowerMind,一个基于塔防游戏的新型轻量级多模态环境,用于评估大型语言模型在长期规划和决策制定方面的能力,该环境弥补了现有RTS游戏环境在计算需求和文本观察支持方面的不足。


📘 Detailed Summary

Motivation: 现有基于实时策略游戏的环境要么计算需求较高,要么缺乏对文本观察的支持,这限制了RTS游戏在评估大型语言模型长期规划和决策制定能力方面的应用,而这两种能力是智能体适应多样化场景和任务的核心通用能力。

Method: 本文提出了TowerMind环境,该环境基于RTS游戏的塔防子类型,保留了RTS游戏评估LLMs的关键优势,同时具有低计算需求和包含像素、文本和结构化游戏状态表示的多模态观察空间,并支持模型幻觉评估和高可定制性,设计了五个基准关卡来评估不同多模态输入设置下的LLMs。

Result: 实验结果显示LLMs与人类专家在能力和幻觉维度上存在明显性能差距,揭示了LLMs行为的关键局限性,包括规划验证不足、决策缺乏多终局性和行动使用效率低下,同时评估了Ape-X DQN和PPO两种经典强化学习算法。

Conclusion: TowerMind通过其轻量级和多模态设计,补充了现有基于RTS游戏的环境格局,为AI智能体领域引入了新的基准,揭示了LLMs在规划和决策方面的系统性缺陷,为未来智能体研究提供了重要的评估工具和方向指引。


📄 Abstract

Recent breakthroughs in Large Language Models (LLMs) have positioned them as a promising paradigm for agents, with long-term planning and decision-making emerging as core general-purpose capabilities for adapting to diverse scenarios and tasks. Real-time strategy (RTS) games serve as an ideal testbed for evaluating these two capabilities, as their inherent gameplay requires both macro-level strategic planning and micro-level tactical adaptation and action execution. Existing RTS game-based environments either suffer from relatively high computational demands or lack support for textual observations, which has constrained the use of RTS games for LLM evaluation. Motivated by this, we present TowerMind, a novel environment grounded in the tower defense (TD) subgenre of RTS games. TowerMind preserves the key evaluation strengths of RTS games for assessing LLMs, while featuring low computational demands and a multimodal observation space, including pixel-based, textual, and structured game-state representations. In addition, TowerMind supports the evaluation of model hallucination and provides a high degree of customizability. We design five benchmark levels to evaluate several widely used LLMs under different multimodal input settings. The results reveal a clear performance gap between LLMs and human experts across both capability and hallucination dimensions. The experiments further highlight key limitations in LLM behavior, such as inadequate planning validation, a lack of multifinality in decision-making, and inefficient action use. We also evaluate two classic reinforcement learning algorithms: Ape-X DQN and PPO. By offering a lightweight and multimodal design, TowerMind complements the existing RTS game-based environment landscape and introduces a new benchmark for the AI agent field. The source code is publicly available on GitHub(https://github.com/tb6147877/TowerMind).

[39] Open-Vocabulary 3D Instruction Ambiguity Detection

Jiayu Ding, Haoran Tang, Ge Li

🧩 TL;DR

本文首次定义了开放词汇3D指令歧义检测任务,并构建了Ambi3D大规模基准数据集,提出了AmbiVer两阶段框架来解决现有3D大语言模型在指令歧义判断上的局限性,为安全关键领域的具身AI系统提供了重要安全保障。


📘 Detailed Summary

Motivation: 在安全关键领域,语言歧义可能导致严重后果,但现有具身AI研究大多忽视这一问题,假设指令清晰且专注于执行而非确认。为填补这一关键安全空白,本文首次定义了开放词汇3D指令歧义检测这一基础新任务,要求模型判断给定3D场景中指令是否具有单一明确含义。

Method: 本文提出了AmbiVer两阶段框架,首先从多个视角收集明确的视觉证据,然后利用这些证据引导视觉语言模型判断指令歧义性。同时构建了Ambi3D大规模基准数据集,包含700多个多样化3D场景和约22k条指令,为该任务研究提供支持。

Result: 实验分析揭示了令人惊讶的局限性:最先进的3D大语言模型在可靠判断指令歧义性方面表现不佳。大量实验证明了该任务的挑战性以及AmbiVer框架的有效性,为更安全可靠的具身AI系统奠定了基础。

Conclusion: 该研究为安全关键领域的具身AI系统提供了重要的安全保障机制,通过检测指令歧义性防止潜在错误。Ambi3D基准数据集和AmbiVer框架为该领域未来研究开辟了新方向,强调了在AI系统中集成安全确认机制的必要性。


📄 Abstract

In safety-critical domains, linguistic ambiguity can have severe consequences; a vague command like "Pass me the vial" in a surgical setting could lead to catastrophic errors. Yet, most embodied AI research overlooks this, assuming instructions are clear and focusing on execution rather than confirmation. To address this critical safety gap, we are the first to define Open-Vocabulary 3D Instruction Ambiguity Detection, a fundamental new task where a model must determine if a command has a single, unambiguous meaning within a given 3D scene. To support this research, we build Ambi3D, the large-scale benchmark for this task, featuring over 700 diverse 3D scenes and around 22k instructions. Our analysis reveals a surprising limitation: state-of-the-art 3D Large Language Models (LLMs) struggle to reliably determine if an instruction is ambiguous. To address this challenge, we propose AmbiVer, a two-stage framework that collects explicit visual evidence from multiple views and uses it to guide an vision-language model (VLM) in judging instruction ambiguity. Extensive experiments demonstrate the challenge of our task and the effectiveness of AmbiVer, paving the way for safer and more trustworthy embodied AI. Code and dataset available at https://jiayuding031020.github.io/ambi3d/.