Table of Contents

cs.CV [Back]

[1] ESCA: Contextualizing Embodied Agents via Scene-Graph Generation

Jiani Huang, Amish Sethi, Matthew Kuo, Mayank Keoliya, Neelay Velingker, JungHo Jung, Ser-Nam Lim, Ziyang Li, Mayur Naik

🧩 TL;DR

本文提出了ESCA框架,通过结构化时空理解来情境化具身智能体,其核心是SGClip模型——一种基于CLIP的开放域可提示场景图生成模型,显著提升了多模态大语言模型在具身环境中的性能。


📘 Detailed Summary

Motivation: 当前多模态大语言模型的训练主要依赖高层视觉-声音-文本对,缺乏像素级视觉内容与文本语义之间的细粒度结构化对齐,这限制了具身智能体的感知和推理能力。

Method: 提出了ESCA框架,其核心是SGClip模型——一种基于CLIP的开放域可提示场景图生成模型,通过神经符号学习管道在87K+开放域视频上进行训练,利用视频-字幕对中的模型驱动自监督和结构化推理,无需人工标注的场景图注释。

Result: SGClip在场景图生成和动作定位基准测试中表现优异,ESCA框架持续提升了开源和商业多模态大语言模型的性能,在两个具身环境中实现了最先进性能,显著减少了智能体感知错误并使开源模型超越专有基线。

Conclusion: 该研究表明结构化空间-时间理解对于具身智能体的重要性,神经符号学习方法能够有效解决缺乏标注数据的问题,为开发更强大的通用具身智能体提供了新的技术路径。


📄 Abstract

Multi-modal large language models (MLLMs) are making rapid progress toward general-purpose embodied agents. However, current training pipelines primarily rely on high-level vision-sound-text pairs and lack fine-grained, structured alignment between pixel-level visual content and textual semantics. To overcome this challenge, we propose ESCA, a new framework for contextualizing embodied agents through structured spatial-temporal understanding. At its core is SGClip, a novel CLIP-based, open-domain, and promptable model for generating scene graphs. SGClip is trained on 87K+ open-domain videos via a neurosymbolic learning pipeline, which harnesses model-driven self-supervision from video-caption pairs and structured reasoning, thereby eliminating the need for human-labeled scene graph annotations. We demonstrate that SGClip supports both prompt-based inference and task-specific fine-tuning, excelling in scene graph generation and action localization benchmarks. ESCA with SGClip consistently improves both open-source and commercial MLLMs, achieving state-of-the-art performance across two embodied environments. Notably, it significantly reduces agent perception errors and enables open-source models to surpass proprietary baselines.

[2] CrossRay3D: Geometry and Distribution Guidance for Efficient Multimodal 3D Detection

Huiming Yang

🧩 TL;DR

本文提出了CrossRay3D稀疏多模态检测器,通过射线感知监督和类别平衡监督机制提升token表示质量,在nuScenes基准上实现72.4 mAP和74.7 NDS的SOTA性能,同时运行速度提升1.84倍。


📘 Detailed Summary

Motivation: 现有稀疏检测器忽视了token表示质量的问题,导致前景质量次优且性能受限,特别是缺乏对几何结构保持和类别分布的关键关注。

Method: 提出稀疏选择器(SS)核心模块,包括射线感知监督(RAS)在训练阶段保持丰富几何信息,类别平衡监督自适应重加权类别语义显著性,以及射线位置编码(Ray PE)解决LiDAR与图像模态间的分布差异。

Result: 在nuScenes基准测试中,CrossRay3D达到72.4 mAP和74.7 NDS的SOTA性能,运行速度比其他领先方法快1.84倍,且在LiDAR或相机数据部分或完全缺失的场景下表现出强鲁棒性。

Conclusion: 该研究表明几何结构保持和类别分布平衡是提升稀疏检测器性能的关键,提出的方法在保持计算效率的同时显著提升了检测精度和鲁棒性,为下游任务提供了更优的适应性。


📄 Abstract

The sparse cross-modality detector offers more advantages than its counterpart, the Bird's-Eye-View (BEV) detector, particularly in terms of adaptability for downstream tasks and computational cost savings. However, existing sparse detectors overlook the quality of token representation, leaving it with a sub-optimal foreground quality and limited performance. In this paper, we identify that the geometric structure preserved and the class distribution are the key to improving the performance of the sparse detector, and propose a Sparse Selector (SS). The core module of SS is Ray-Aware Supervision (RAS), which preserves rich geometric information during the training stage, and Class-Balanced Supervision, which adaptively reweights the salience of class semantics, ensuring that tokens associated with small objects are retained during token sampling. Thereby, outperforming other sparse multi-modal detectors in the representation of tokens. Additionally, we design Ray Positional Encoding (Ray PE) to address the distribution differences between the LiDAR modality and the image. Finally, we integrate the aforementioned module into an end-to-end sparse multi-modality detector, dubbed CrossRay3D. Experiments show that, on the challenging nuScenes benchmark, CrossRay3D achieves state-of-the-art performance with 72.4 mAP and 74.7 NDS, while running 1.84 faster than other leading methods. Moreover, CrossRay3D demonstrates strong robustness even in scenarios where LiDAR or camera data are partially or entirely missing.

[3] InfraGPT Smart Infrastructure: An End-to-End VLM-Based Framework for Detecting and Managing Urban Defects

Ibrahim Sheikh Mohamed, Abdullah Yahya Abdullah Omaisan

🧩 TL;DR

本研究提出一个基于CCTV监控视频的智能城市基础设施缺陷检测与修复规划系统,结合YOLO目标检测器和视觉语言模型,能够自动识别多种缺陷并生成结构化的维护行动计划。


📘 Detailed Summary

Motivation: 智能城市基础设施监控面临手动检查成本高、危险性大,现有自动系统通常只能处理单一缺陷类型或输出非结构化结果,无法直接指导维护团队进行修复工作,需要开发能够全面检测多种缺陷并生成结构化维护计划的一体化解决方案。

Method: 采用YOLO系列目标检测器进行多缺陷检测和分割,然后将检测结果传递给视觉语言模型进行场景感知总结,生成包含事件描述、推荐工具、尺寸、修复计划和紧急警报的JSON格式结构化行动计划,整合了QwenVL和LLaVA等大型视觉语言模型的先进技术。

Result: 在公共数据集和捕获的CCTV视频片段上的实验评估表明,该系统能够准确识别多种基础设施缺陷,并生成连贯的结构化总结,验证了方法的有效性和实用性。

Conclusion: 该系统为智能城市基础设施维护提供了端到端的自动化解决方案,能够显著提高检测效率和维护响应速度,同时讨论了将系统扩展到城市范围部署时面临的挑战和未来发展方向。


📄 Abstract

Infrastructure in smart cities is increasingly monitored by networks of closed circuit television (CCTV) cameras. Roads, bridges and tunnels develop cracks, potholes, and fluid leaks that threaten public safety and require timely repair. Manual inspection is costly and hazardous, and existing automatic systems typically address individual defect types or provide unstructured outputs that cannot directly guide maintenance crews. This paper proposes a comprehensive pipeline that leverages street CCTV streams for multi defect detection and segmentation using the YOLO family of object detectors and passes the detections to a vision language model (VLM) for scene aware summarization. The VLM generates a structured action plan in JSON format that includes incident descriptions, recommended tools, dimensions, repair plans, and urgent alerts. We review literature on pothole, crack and leak detection, highlight recent advances in large vision language models such as QwenVL and LLaVA, and describe the design of our early prototype. Experimental evaluation on public datasets and captured CCTV clips demonstrates that the system accurately identifies diverse defects and produces coherent summaries. We conclude by discussing challenges and directions for scaling the system to city wide deployments.

[4] IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection

Zewen Li, Zitong Yu, Qilang Ye, Weicheng Xie, Wei Zhuo, Linlin Shen

🧩 TL;DR

本文提出IAD-GPT,一种基于多模态大语言模型的工业异常检测新范式,通过异常提示生成器和文本引导增强器激活预训练视觉语言模型的检测与分割能力,在MVTec-AD和VisA数据集上实现了最先进的性能。


📘 Detailed Summary

Motivation: 传统工业异常检测方法缺乏多轮人机对话和详细描述能力,而基于大预训练模型的方法尚未充分激发大模型在异常检测任务中的潜力,特别是在结合文本语义与图像级、像素级信息方面存在不足。

Method: 采用异常提示生成器为特定对象生成详细异常提示,通过文本引导增强器使图像特征与正常/异常文本提示交互以动态选择增强路径,并设计多掩码融合模块将掩码作为专家知识融入,增强大语言模型对像素级异常的感知能力。

Result: 在MVTec-AD和VisA数据集上的广泛实验表明,该方法在自监督和少样本异常检测与分割任务中实现了最先进的性能表现。

Conclusion: 该研究展示了多模态大语言模型在工业异常检测中的强大潜力,通过结合文本语义与多级视觉信息的新范式,为智能工业检测系统提供了有效的解决方案,并开辟了大模型在专业领域应用的新方向。


📄 Abstract

The robust causal capability of Multimodal Large Language Models (MLLMs) hold the potential of detecting defective objects in Industrial Anomaly Detection (IAD). However, most traditional IAD methods lack the ability to provide multi-turn human-machine dialogues and detailed descriptions, such as the color of objects, the shape of an anomaly, or specific types of anomalies. At the same time, methods based on large pre-trained models have not fully stimulated the ability of large models in anomaly detection tasks. In this paper, we explore the combination of rich text semantics with both image-level and pixel-level information from images and propose IAD-GPT, a novel paradigm based on MLLMs for IAD. We employ Abnormal Prompt Generator (APG) to generate detailed anomaly prompts for specific objects. These specific prompts from the large language model (LLM) are used to activate the detection and segmentation functions of the pre-trained visual-language model (i.e., CLIP). To enhance the visual grounding ability of MLLMs, we propose Text-Guided Enhancer, wherein image features interact with normal and abnormal text prompts to dynamically select enhancement pathways, which enables language models to focus on specific aspects of visual data, enhancing their ability to accurately interpret and respond to anomalies within images. Moreover, we design a Multi-Mask Fusion module to incorporate mask as expert knowledge, which enhances the LLM's perception of pixel-level anomalies. Extensive experiments on MVTec-AD and VisA datasets demonstrate our state-of-the-art performance on self-supervised and few-shot anomaly detection and segmentation tasks, such as MVTec-AD and VisA datasets. The codes are available at \href{https://github.com/LiZeWen1225/IAD-GPT}{https://github.com/LiZeWen1225/IAD-GPT}.

[5] Aria Gen 2 Pilot Dataset

Chen Kong, James Fort, Aria Kang, Jonathan Wittmer, Simon Green, Tianwei Shen, Yipu Zhao, Cheng Peng, Gustavo Solaira, Andrew Berkovich, Nikhil Raina, Vijay Baiyya, Evgeniy Oleinik, Eric Huang, Fan Zhang, Julian Straub, Mark Schwesinger, Luis Pesqueira, Xiaqing Pan, Jakob Julian Engel, Carl Ren, Mingfei Yan, Richard Newcombe

🧩 TL;DR

Aria Gen 2 Pilot Dataset (A2PD) 是一个使用先进Aria Gen 2眼镜采集的自我中心多模态开放数据集,通过增量发布方式提供包含多种日常场景的原始传感器数据和机器感知算法输出,支持对佩戴者、环境和交互的全面感知研究。


📘 Detailed Summary

Motivation: 该研究旨在解决自我中心视觉和多模态感知领域缺乏大规模、多样化真实场景数据集的问题,通过提供包含清洁、烹饪、进食、游戏和户外步行等五种主要场景的全面数据,填补了现有数据集在用户多样性和环境复杂性方面的不足。

Method: 研究采用Aria Gen 2眼镜设备采集多模态传感器数据,通过增量发布策略持续扩展数据集规模,包含原始传感器数据和多种机器感知算法的输出结果,涵盖了佩戴者状态、环境感知以及人机交互等多个维度的信息。

Result: 数据集展示了设备在不同用户和条件下保持鲁棒性能的能力,能够有效感知佩戴者状态、周围环境以及佩戴者与环境之间的交互关系,为多模态感知研究提供了丰富的真实世界基准数据。

Conclusion: A2PD数据集为自我中心视觉和多模态感知研究提供了重要的基准资源,其开放获取策略和配套工具将促进相关领域的发展,特别是在真实环境下的感知算法评估和跨用户泛化能力研究方面具有重要价值。


📄 Abstract

The Aria Gen 2 Pilot Dataset (A2PD) is an egocentric multimodal open dataset captured using the state-of-the-art Aria Gen 2 glasses. To facilitate timely access, A2PD is released incrementally with ongoing dataset enhancements. The initial release features Dia'ane, our primary subject, who records her daily activities alongside friends, each equipped with Aria Gen 2 glasses. It encompasses five primary scenarios: cleaning, cooking, eating, playing, and outdoor walking. In each of the scenarios, we provide comprehensive raw sensor data and output data from various machine perception algorithms. These data illustrate the device's ability to perceive the wearer, the surrounding environment, and interactions between the wearer and the environment, while maintaining robust performance across diverse users and conditions. The A2PD is publicly available at projectaria.com, with open-source tools and usage examples provided in Project Aria Tools.

[6] StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales

Nyle Siddiqui, Rohit Gupta, Sirnam Swetha, Mubarak Shah

🧩 TL;DR

本文提出了一种灵活的时空自适应训练方法StretchySnake,通过动态调整空间和时间分辨率训练视频状态空间模型,解决了传统视频模型在未见时空分辨率下的性能退化问题,在多个动作识别基准上显著优于Transformer和SSM基线。


📘 Detailed Summary

Motivation: 当前视频理解训练方法主要针对Transformer设计,未能充分利用状态空间模型(SSMs)的独特属性,导致模型在面对训练时未见过的空间和时间分辨率时出现性能退化,这种时空不灵活性严重限制了模型在长短视频上的泛化能力。

Method: 提出灵活训练方法,在训练过程中采样不同时空分辨率的视频,并动态插值模型权重以适应任意时空尺度,开发了五种灵活训练变体并确定了最适合视频SSM的策略,构建了名为StretchySnake的时空自适应SSM模型。

Result: 在短动作基准(UCF-101、HMDB-51)和长动作基准(COIN、Breakfast)上,StretchySnake比Transformer和SSM基线性能提升高达28%,在细粒度动作数据集(SSV2、Diving-48)上表现出强大的适应性。

Conclusion: 该方法提供了一个简单的即插即用训练方案,使视频SSM在各种动作识别场景中更加鲁棒、分辨率无关且高效,为视频理解模型提供了新的训练范式,充分利用了SSM的线性复杂度和隐藏状态递归优势。


📄 Abstract

State space models (SSMs) have emerged as a competitive alternative to transformers in various tasks. Their linear complexity and hidden-state recurrence make them particularly attractive for modeling long sequences, whereas attention becomes quadratically expensive. However, current training methods for video understanding are tailored towards transformers and fail to fully leverage the unique attributes of SSMs. For example, video models are often trained at a fixed resolution and video length to balance the quadratic scaling of attention cost against performance. Consequently, these models suffer from degraded performance when evaluated on videos with spatial and temporal resolutions unseen during training; a property we call spatio-temporal inflexibility. In the context of action recognition, this severely limits a model's ability to retain performance across both short- and long-form videos. Therefore, we propose a flexible training method that leverages and improves the inherent adaptability of SSMs. Our method samples videos at varying temporal and spatial resolutions during training and dynamically interpolates model weights to accommodate any spatio-temporal scale. This instills our SSM, which we call StretchySnake, with spatio-temporal flexibility and enables it to seamlessly handle videos ranging from short, fine-grained clips to long, complex activities. We introduce and compare five different variants of flexible training, and identify the most effective strategy for video SSMs. On short-action (UCF-101, HMDB-51) and long-action (COIN, Breakfast) benchmarks, StretchySnake outperforms transformer and SSM baselines alike by up to 28%, with strong adaptability to fine-grained actions (SSV2, Diving-48). Therefore, our method provides a simple drop-in training recipe that makes video SSMs more robust, resolution-agnostic, and efficient across diverse action recognition scenarios.

[7] Embody 3D: A Large-scale Multimodal Motion and Behavior Dataset

Claire McLean, Makenzie Meendering, Tristan Swartz, Orri Gabbay, Alexandra Olsen, Rachel Jacobs, Nicholas Rosen, Philippe de Bree, Tony Garcia, Gadsden Merrill, Jake Sandakly, Julia Buffalini, Neham Jain, Steven Krenn, Moneish Kumar, Dejan Markovic, Evonne Ng, Fabian Prada, Andrew Saba, Siwei Zhang, Vasu Agrawal, Tim Godisart, Alexander Richard, Michael Zollhoefer

🧩 TL;DR

Embody 3D是一个大规模多模态3D运动数据集,包含500小时来自439名参与者的3D运动数据,涵盖单人多人和交互行为,为人体运动分析和行为理解研究提供了重要资源。


📘 Detailed Summary

Motivation: 当前缺乏大规模、高质量的多模态3D人体运动数据集,特别是在多人交互和行为分析方面存在数据空白,限制了相关算法的训练和评估。

Method: 通过多相机采集系统收集439名参与者的3D运动数据,包括单人运动、手势、移动以及多人对话、协作活动和共同生活场景,并提供运动追踪、身体形态、文本标注和独立音频轨道。

Result: 构建了包含500小时数据、超过5400万帧3D运动追踪的大规模数据集,涵盖广泛的运动类型和交互场景,为多模态行为分析提供了全面基准。

Conclusion: Embody 3D数据集填补了3D人体运动数据在规模和多样性方面的空白,为计算机视觉、人机交互和行为分析等领域的研究提供了重要基础设施,将推动相关算法的发展和应用。


📄 Abstract

The Codec Avatars Lab at Meta introduces Embody 3D, a multimodal dataset of 500 individual hours of 3D motion data from 439 participants collected in a multi-camera collection stage, amounting to over 54 million frames of tracked 3D motion. The dataset features a wide range of single-person motion data, including prompted motions, hand gestures, and locomotion; as well as multi-person behavioral and conversational data like discussions, conversations in different emotional states, collaborative activities, and co-living scenarios in an apartment-like space. We provide tracked human motion including hand tracking and body shape, text annotations, and a separate audio track for each participant.

[8] Cerberus: Real-Time Video Anomaly Detection via Cascaded Vision-Language Models

Yue Zheng, Xiufang Shi, Jiming Chen, Yuanchao Shu

🧩 TL;DR

本文提出了Cerberus,一种用于实时视频异常检测的两级级联系统,通过结合轻量级过滤和细粒度视觉语言模型推理,在保持高精度的同时实现了显著的加速效果。


📘 Detailed Summary

Motivation: 当前基于视觉语言模型的视频异常检测方法虽然具有优越的零样本检测能力,但其巨大的计算成本和不稳定的视觉定位性能阻碍了实时部署应用。

Method: Cerberus采用两级级联架构,包括离线学习正常行为规则和在线推理时的轻量级过滤与细粒度VLM推理相结合,关键创新包括运动掩码提示和基于规则的偏差检测。

Result: 在四个数据集上的广泛评估显示,Cerberus在NVIDIA L40S GPU上平均达到57.68 fps,实现了151.79倍的加速,同时保持97.2%的准确率,与最先进的VLM-based VAD方法相当。

Conclusion: Cerberus通过创新的运动注意力引导和规则偏差检测机制,证明了在保持高精度的同时实现实时视频异常检测的可行性,为实际视频分析应用提供了实用解决方案。


📄 Abstract

Video anomaly detection (VAD) has rapidly advanced by recent development of Vision-Language Models (VLMs). While these models offer superior zero-shot detection capabilities, their immense computational cost and unstable visual grounding performance hinder real-time deployment. To overcome these challenges, we introduce Cerberus, a two-stage cascaded system designed for efficient yet accurate real-time VAD. Cerberus learns normal behavioral rules offline, and combines lightweight filtering with fine-grained VLM reasoning during online inference. The performance gains of Cerberus come from two key innovations: motion mask prompting and rule-based deviation detection. The former directs the VLM's attention to regions relevant to motion, while the latter identifies anomalies as deviations from learned norms rather than enumerating possible anomalies. Extensive evaluations on four datasets show that Cerberus on average achieves 57.68 fps on an NVIDIA L40S GPU, a 151.79$\times$ speedup, and 97.2\% accuracy comparable to the state-of-the-art VLM-based VAD methods, establishing it as a practical solution for real-time video analytics.

[9] OpenLVLM-MIA: A Controlled Benchmark Revealing the Limits of Membership Inference Attacks on Large Vision-Language Models

Ryoto Miyamoto, Xin Fan, Fuyuko Kido, Tsuneo Matsumoto, Hayato Yamana

🧩 TL;DR

OpenLVLM-MIA是一个新的基准测试,揭示了评估大型视觉语言模型成员推理攻击时存在的根本性挑战。研究发现先前报道的高攻击成功率主要源于数据集构建引入的分布偏差检测,而非真实的成员状态识别。


📘 Detailed Summary

Motivation: 该研究旨在解决评估大型视觉语言模型成员推理攻击时存在的根本问题。先前工作报告的高攻击成功率可能源于数据集构建过程中引入的分布偏差,而非真实的成员状态识别能力,这导致了对攻击方法真实性能的误解。

Method: 研究引入了包含6,000张图像的受控基准测试,其中成员和非成员样本的分布经过精心平衡,并在三个不同的训练阶段提供了真实成员标签。该基准通过消除分布偏差来确保评估的公正性。

Result: 实验结果表明,在无偏条件下,最先进的成员推理攻击方法的性能收敛于随机猜测水平。这证实了先前报道的高攻击成功率主要源于数据集偏差而非真实的攻击能力。

Conclusion: OpenLVLM-MIA基准揭示了当前LVLM成员推理攻击研究的局限性,为开发更强的隐私保护技术提供了坚实基础。该研究强调了在评估隐私攻击方法时消除数据集偏差的重要性。


📄 Abstract

OpenLVLM-MIA is a new benchmark that highlights fundamental challenges in evaluating membership inference attacks (MIA) against large vision-language models (LVLMs). While prior work has reported high attack success rates, our analysis suggests that these results often arise from detecting distributional bias introduced during dataset construction rather than from identifying true membership status. To address this issue, we introduce a controlled benchmark of 6{,}000 images where the distributions of member and non-member samples are carefully balanced, and ground-truth membership labels are provided across three distinct training stages. Experiments using OpenLVLM-MIA demonstrated that the performance of state-of-the-art MIA methods converged to random chance under unbiased conditions. By offering a transparent and unbiased benchmark, OpenLVLM-MIA clarifies the current limitations of MIA research on LVLMs and provides a solid foundation for developing stronger privacy-preserving techniques.

[10] Scale-DiT: Ultra-High-Resolution Image Generation with Hierarchical Local Attention

Yuyao Zhang, Yu-Wing Tai

🧩 TL;DR

本文提出了Scale-DiT,一种新的扩散框架,通过分层局部注意力与低分辨率全局引导相结合,实现了超高清图像的高效生成,无需额外的高分辨率训练数据即可扩展到4K分辨率。


📘 Detailed Summary

Motivation: 当前扩散模型受限于注意力机制的二次复杂性和原生4K训练数据的稀缺,无法实现超高清文本到图像生成,这限制了模型在细粒度纹理合成和全局结构一致性方面的表现。

Method: Scale-DiT采用分层局部注意力机制,将高分辨率潜变量划分为固定大小的局部窗口以降低注意力复杂度,同时使用配备缩放位置锚点的低分辨率潜变量注入全局语义信息,并通过轻量级LoRA适配器在去噪过程中桥接全局和局部路径。

Result: 实验表明Scale-DiT相比密集注意力基线实现了2倍以上的推理加速和更低的内存使用,能够在4K×4K分辨率下生成具有优越全局一致性和更清晰局部细节的图像,在FID、IS和CLIP Score等定量指标上达到或超越依赖原生4K训练的最先进方法。

Conclusion: 研究表明分层局部注意力与引导性低分辨率锚点相结合是推进超高清图像生成的有效方法,为无需额外高分辨率训练数据的大规模图像合成提供了可行的技术路径。


📄 Abstract

Ultra-high-resolution text-to-image generation demands both fine-grained texture synthesis and globally coherent structure, yet current diffusion models remain constrained to sub-$1K \times 1K$ resolutions due to the prohibitive quadratic complexity of attention and the scarcity of native $4K$ training data. We present \textbf{Scale-DiT}, a new diffusion framework that introduces hierarchical local attention with low-resolution global guidance, enabling efficient, scalable, and semantically coherent image synthesis at ultra-high resolutions. Specifically, high-resolution latents are divided into fixed-size local windows to reduce attention complexity from quadratic to near-linear, while a low-resolution latent equipped with scaled positional anchors injects global semantics. A lightweight LoRA adaptation bridges global and local pathways during denoising, ensuring consistency across structure and detail. To maximize inference efficiency, we repermute token sequence in Hilbert curve order and implement a fused-kernel for skipping masked operations, resulting in a GPU-friendly design. Extensive experiments demonstrate that Scale-DiT achieves more than $2\times$ faster inference and lower memory usage compared to dense attention baselines, while reliably scaling to $4K \times 4K$ resolution without requiring additional high-resolution training data. On both quantitative benchmarks (FID, IS, CLIP Score) and qualitative comparisons, Scale-DiT delivers superior global coherence and sharper local detail, matching or outperforming state-of-the-art methods that rely on native 4K training. Taken together, these results highlight hierarchical local attention with guided low-resolution anchors as a promising and effective approach for advancing ultra-high-resolution image generation.

[11] RL makes MLLMs see better than SFT

Junha Song, Sangdoo Yun, Dongyoon Han, Jaegul Choo, Byeongho Heo

🧩 TL;DR

本研究揭示了强化学习训练策略能够显著增强多模态语言模型中视觉编码器的表征能力,提出了PIVOT方法,仅需标准视觉预训练1%的计算成本即可构建更强大的视觉编码器。


📘 Detailed Summary

Motivation: 当前多模态语言模型研究普遍假设性能主要继承自LLM骨干网络,导致对视觉编码器的理解存在空白,特别是在训练范式从监督微调转向强化学习的背景下,缺乏对训练策略如何重塑视觉编码器的系统分析。

Method: 通过多样化的深度实验分析训练策略对视觉编码器的影响,包括ImageNet分类、分割任务和梯度可视化,并基于研究发现提出了Preference-Instructed Vision OpTimization (PIVOT)方法。

Result: 强化学习相比监督微调在强视觉相关的VQA基准上表现更优,能够产生更强且精确定位的视觉表征,PIVOT训练的视觉编码器即使计算成本不足标准预训练的1%,也能超越更大规模训练的对等模型。

Conclusion: 训练策略不仅影响多模态语言模型的下游任务性能,还从根本上重塑其底层视觉表征,PIVOT方法为推进多模态语言模型的视觉骨干网络提供了一条高效路径,挑战了当前对视觉编码器作用的低估认知。


📄 Abstract

A dominant assumption in Multimodal Language Model (MLLM) research is that its performance is largely inherited from the LLM backbone, given its immense parameter scale and remarkable capabilities. This has created a void in the understanding of the vision encoder, which determines how MLLMs perceive images. The recent shift in MLLM training paradigms, from Supervised Finetuning (SFT) to Reinforcement Learning (RL), magnifies this oversight-namely, the significant lack of analysis on how such training reshapes the vision encoder as well as the MLLM. To address this, we first investigate the impact of training strategies on MLLMs, where RL shows a clear advantage over SFT in strongly vision-related VQA benchmarks. Motivated by this, we conduct a critical yet under-explored analysis of the vision encoder of MLLMs through diverse and in-depth experiments, ranging from ImageNet classification and segmentation to gradient visualization. Our results demonstrate that MLLM's post-training strategy (i.e., SFT or RL) not only leads to distinct outcomes on MLLM downstream tasks, but also fundamentally reshapes MLLM's underlying visual representations. Specifically, the key finding of our study is that RL produces stronger and precisely localized visual representations compared to SFT, boosting the ability of the vision encoder for MLLM. We then reframe our findings into a simple recipe for building strong vision encoders for MLLMs, Preference-Instructed Vision OpTimization (PIVOT). When integrated into MLLMs, a PIVOT-trained vision encoder outperforms even larger and more heavily-trained counterparts, despite requiring less than 1% of the computational cost of standard vision pretraining. This result opens an effective and efficient path for advancing the vision backbones of MLLMs. Project page available at https://june-page.github.io/pivot/

[12] On the Provable Importance of Gradients for Language-Assisted Image Clustering

Bo Peng, Jie Lu, Guangquan Zhang, Zhen Fang

🧩 TL;DR

本文提出了GradNorm这一基于梯度的框架来解决语言辅助图像聚类中的正名词筛选问题,该方法具有理论保证并在多个基准测试中实现了最先进的聚类性能。


📘 Detailed Summary

Motivation: 语言辅助图像聚类(LaIC)面临的核心挑战是如何从未标注的野生语料数据中筛选出与目标图像语义相近的正名词,现有基于CLIP特征空间的筛选策略虽然直观但缺乏严格的理论基础。

Method: 提出了名为GradNorm的梯度框架,通过计算预测目标分布与softmax输出之间的交叉熵反向传播梯度大小来衡量每个名词的正性程度,该方法在理论上能够包含现有筛选策略作为其特例。

Result: 大量实验表明GradNorm在多个基准测试中达到了最先进的聚类性能,理论分析提供了正名词可分性的严格误差界证明。

Conclusion: GradNorm不仅为语言辅助图像聚类提供了理论严谨的解决方案,还证明了现有方法只是该框架的极端特例,为未来相关研究奠定了理论基础。


📄 Abstract

This paper investigates the recently emerged problem of Language-assisted Image Clustering (LaIC), where textual semantics are leveraged to improve the discriminability of visual representations to facilitate image clustering. Due to the unavailability of true class names, one of core challenges of LaIC lies in how to filter positive nouns, i.e., those semantically close to the images of interest, from unlabeled wild corpus data. Existing filtering strategies are predominantly based on the off-the-shelf feature space learned by CLIP; however, despite being intuitive, these strategies lack a rigorous theoretical foundation. To fill this gap, we propose a novel gradient-based framework, termed as GradNorm, which is theoretically guaranteed and shows strong empirical performance. In particular, we measure the positiveness of each noun based on the magnitude of gradients back-propagated from the cross-entropy between the predicted target distribution and the softmax output. Theoretically, we provide a rigorous error bound to quantify the separability of positive nouns by GradNorm and prove that GradNorm naturally subsumes existing filtering strategies as extremely special cases of itself. Empirically, extensive experiments show that GradNorm achieves the state-of-the-art clustering performance on various benchmarks.

[13] MIRAD - A comprehensive real-world robust anomaly detection dataset for Mass Individualization

Pulin Li, Guocheng Wu, Li Yin, Yuxin Zheng, Wei Zhang, Yanjie Zhou

🧩 TL;DR

本文提出了首个专门针对社会制造场景的异常检测基准数据集MIRAD,该数据集捕捉了个性化生产中的三个关键维度挑战,并通过广泛实验揭示了现有先进方法在真实世界个性化生产缺陷检测中的显著性能下降。


📘 Detailed Summary

Motivation: 社会制造范式通过社区协作和分散资源实现大规模个性化生产,但这一转变在质量控制特别是缺陷检测方面带来了重大挑战,主要困难源于产品高度定制化配置、生产涉及碎片化小批量订单以及分布式站点成像环境差异显著,现有方法缺乏针对性的真实世界数据集和算法。

Method: 研究团队构建了Mass Individualization Robust Anomaly Detection (MIRAD)数据集,这是首个专门为社会制造异常检测设计的基准数据集,该数据集捕捉了个性化生产中的三个关键维度:具有大类内变异的多样化个性化产品、来自六个地理分散制造节点的数据收集以及包含光照、背景和运动条件变化的显著成像异质性。

Result: 在MIRAD数据集上对包括单类、多类和零样本方法在内的最先进异常检测方法进行了广泛评估,结果显示所有模型相比传统基准都出现了显著的性能下降,这突显了真实世界个性化生产中缺陷检测尚未解决的复杂性挑战。

Conclusion: MIRAD数据集通过连接工业需求和学术研究,为开发工业5.0所需的鲁棒质量控制解决方案提供了现实基础,揭示了现有异常检测方法在应对个性化生产场景时的局限性,并为未来研究指明了改进方向。


📄 Abstract

Social manufacturing leverages community collaboration and scattered resources to realize mass individualization in modern industry. However, this paradigm shift also introduces substantial challenges in quality control, particularly in defect detection. The main difficulties stem from three aspects. First, products often have highly customized configurations. Second, production typically involves fragmented, small-batch orders. Third, imaging environments vary considerably across distributed sites. To overcome the scarcity of real-world datasets and tailored algorithms, we introduce the Mass Individualization Robust Anomaly Detection (MIRAD) dataset. As the first benchmark explicitly designed for anomaly detection in social manufacturing, MIRAD captures three critical dimensions of this domain: (1) diverse individualized products with large intra-class variation, (2) data collected from six geographically dispersed manufacturing nodes, and (3) substantial imaging heterogeneity, including variations in lighting, background, and motion conditions. We then conduct extensive evaluations of state-of-the-art (SOTA) anomaly detection methods on MIRAD, covering one-class, multi-class, and zero-shot approaches. Results show a significant performance drop across all models compared with conventional benchmarks, highlighting the unresolved complexities of defect detection in real-world individualized production. By bridging industrial requirements and academic research, MIRAD provides a realistic foundation for developing robust quality control solutions essential for Industry 5.0. The dataset is publicly available at https://github.com/wu33learn/MIRAD.

[14] REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting

Changyue Shi, Minghao Chen, Yiping Mao, Chuxiao Yang, Xinyuan Hu, Jiajun Ding, Zhou Yu

🧩 TL;DR

本文提出REALM框架,通过将3D高斯泼溅表示与多模态大语言模型相结合,实现了无需大量3D特定后训练即可进行开放世界推理分割的创新方法。


📘 Detailed Summary

Motivation: 现有3D分割方法难以处理模糊的推理式指令,而擅长此类推理的2D视觉语言模型又缺乏内在的3D空间理解能力,这构成了复杂人类指令与精确3D物体定位之间的显著差距。

Method: REALM直接在3D高斯泼溅表示上执行分割,利用其渲染逼真新视角的能力,并提出全局到局部空间定位策略:首先并行输入多个全局视图进行粗粒度定位,然后合成多个物体特写视图进行细粒度局部分割。

Result: 在LERF、3D-OVS和新引入的REALM3D基准测试中,REALM在解释显式和隐式指令方面表现出色,同时该框架无缝支持物体移除、替换和风格转换等多种3D交互任务。

Conclusion: REALM展示了将先进2D视觉语言模型能力有效迁移到3D空间的可行性,为开放世界3D场景理解和交互提供了实用且通用的解决方案,具有重要的实际应用价值。


📄 Abstract

Bridging the gap between complex human instructions and precise 3D object grounding remains a significant challenge in vision and robotics. Existing 3D segmentation methods often struggle to interpret ambiguous, reasoning-based instructions, while 2D vision-language models that excel at such reasoning lack intrinsic 3D spatial understanding. In this paper, we introduce REALM, an innovative MLLM-agent framework that enables open-world reasoning-based segmentation without requiring extensive 3D-specific post-training. We perform segmentation directly on 3D Gaussian Splatting representations, capitalizing on their ability to render photorealistic novel views that are highly suitable for MLLM comprehension. As directly feeding one or more rendered views to the MLLM can lead to high sensitivity to viewpoint selection, we propose a novel Global-to-Local Spatial Grounding strategy. Specifically, multiple global views are first fed into the MLLM agent in parallel for coarse-level localization, aggregating responses to robustly identify the target object. Then, several close-up novel views of the object are synthesized to perform fine-grained local segmentation, yielding accurate and consistent 3D masks. Extensive experiments show that REALM achieves remarkable performance in interpreting both explicit and implicit instructions across LERF, 3D-OVS, and our newly introduced REALM3D benchmarks. Furthermore, our agent framework seamlessly supports a range of 3D interaction tasks, including object removal, replacement, and style transfer, demonstrating its practical utility and versatility. Project page: https://ChangyueShi.github.io/REALM.

[15] SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning

Xiaojun Guo, Runyu Zhou, Yifei Wang, Qi Zhang, Chenheng Zhang, Stefanie Jegelka, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Yisen Wang

🧩 TL;DR

本文提出SSL4RL框架,通过将自监督学习任务转化为可验证的奖励信号,解决了视觉语言模型在强化学习微调中缺乏可扩展可靠奖励机制的问题,显著提升了模型在视觉中心和视觉语言推理任务上的性能。


📘 Detailed Summary

Motivation: 当前视觉语言模型在利用视觉证据方面存在不足,要么过度依赖语言先验,要么使用文本捷径进行推理,而强化学习在视觉语言模型中的应用因缺乏可扩展可靠的奖励机制而受到限制。

Method: 提出SSL4RL框架,将自监督学习目标(如图像旋转预测、掩码补丁重建)重新表述为密集的自动奖励信号,无需人工偏好数据或不可靠的AI评估器,为基于强化学习的微调提供可验证的奖励来源。

Result: 实验表明SSL4RL在视觉中心和视觉语言推理基准测试中显著提升了性能,通过系统消融研究识别出任务难度、模型规模和与目标领域语义对齐等关键影响因素,并在图学习领域验证了框架的通用性。

Conclusion: SSL4RL建立了一个通用有效的多模态模型对齐范式,使用可验证的自监督目标,为未来工作提供了新的设计原则,展示了该框架在不同领域的适用潜力。


📄 Abstract

Vision-language models (VLMs) have shown remarkable abilities by integrating large language models with visual inputs. However, they often fail to utilize visual evidence adequately, either depending on linguistic priors in vision-centric tasks or resorting to textual shortcuts during reasoning. Although reinforcement learning (RL) can align models with desired behaviors, its application to VLMs has been hindered by the lack of scalable and reliable reward mechanisms. To overcome this challenge, we propose SSL4RL, a novel framework that leverages self-supervised learning (SSL) tasks as a source of verifiable rewards for RL-based fine-tuning. Our approach reformulates SSL objectives-such as predicting image rotation or reconstructing masked patches-into dense, automatic reward signals, eliminating the need for human preference data or unreliable AI evaluators. Experiments show that SSL4RL substantially improves performance on both vision-centric and vision-language reasoning benchmarks. Furthermore, through systematic ablations, we identify key factors-such as task difficulty, model scale, and semantic alignment with the target domain-that influence the effectiveness of SSL4RL tasks, offering new design principles for future work. We also demonstrate the framework's generality by applying it to graph learning, where it yields significant gains. SSL4RL establishes a versatile and effective paradigm for aligning multimodal models using verifiable, self-supervised objectives.

[16] EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning

Haoran Sun, Chen Cai, Huiping Zhuang, Kong Aik Lee, Lap-Pui Chau, Yi Wang

🧩 TL;DR

本文提出可解释深度伪造视频检测任务EDVD,并设计EDVD-LLaMA多模态大语言模型推理框架,该框架在提供准确检测结果的同时,提供可追溯的推理过程和可信的解释。


📘 Detailed Summary

Motivation: 传统深度伪造视频检测方法存在原理透明度不足和对不断发展的伪造技术泛化能力不足的问题,迫切需要能够识别伪造内容并提供可验证推理解释的检测器。

Method: 提出EDVD-LLaMA多模态大语言模型推理框架,包含时空细微信息标记化ST-SIT来提取和融合全局局部跨帧深度伪造特征,以及细粒度多模态思维链Fg-MCoT机制,在推理过程中引入面部特征数据作为硬约束以实现像素级时空视频定位。

Result: 大量实验表明EDVD-LLaMA在检测准确性、可解释性以及处理跨伪造方法和跨数据集场景方面实现了出色的性能和鲁棒性,相比先前DVD方法提供了更可解释且优越的解决方案。

Conclusion: 该研究为深度伪造检测领域提供了更透明和可信的解决方案,通过多模态推理框架实现了检测结果的可追溯解释,增强了检测系统的可靠性和实用性。


📄 Abstract

The rapid development of deepfake video technology has not only facilitated artistic creation but also made it easier to spread misinformation. Traditional deepfake video detection (DVD) methods face issues such as a lack of transparency in their principles and insufficient generalization capabilities to cope with evolving forgery techniques. This highlights an urgent need for detectors that can identify forged content and provide verifiable reasoning explanations. This paper proposes the explainable deepfake video detection (EDVD) task and designs the EDVD-LLaMA multimodal, a large language model (MLLM) reasoning framework, which provides traceable reasoning processes alongside accurate detection results and trustworthy explanations. Our approach first incorporates a Spatio-Temporal Subtle Information Tokenization (ST-SIT) to extract and fuse global and local cross-frame deepfake features, providing rich spatio-temporal semantic information input for MLLM reasoning. Second, we construct a Fine-grained Multimodal Chain-of-Thought (Fg-MCoT) mechanism, which introduces facial feature data as hard constraints during the reasoning process to achieve pixel-level spatio-temporal video localization, suppress hallucinated outputs, and enhance the reliability of the chain of thought. In addition, we build an Explainable Reasoning FF++ benchmark dataset (ER-FF++set), leveraging structured data to annotate videos and ensure quality control, thereby supporting dual supervision for reasoning and detection. Extensive experiments demonstrate that EDVD-LLaMA achieves outstanding performance and robustness in terms of detection accuracy, explainability, and its ability to handle cross-forgery methods and cross-dataset scenarios. Compared to previous DVD methods, it provides a more explainable and superior solution. The source code and dataset will be publicly available.

[17] RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba

Kunyu Peng, Di Wen, Jia Fu, Jiamin Wu, Kailun Yang, Junwei Zheng, Ruiping Liu, Yufan Chen, Yuqian Fu, Danda Pani Paudel, Luc Van Gool, Rainer Stiefelhagen

🧩 TL;DR

本文提出了RefAtomNet++框架和RefAVA++数据集,用于解决基于语言描述的原子级视频动作识别问题。该框架通过多层级语义对齐交叉注意力和多轨迹Mamba建模,在细粒度动作识别和人物定位方面实现了新的最先进性能。


📘 Detailed Summary

Motivation: 现有的原子视频动作识别方法在跨模态信息对齐和检索方面存在局限,导致目标人物定位和细粒度动作预测性能不佳。特别是在复杂多人场景中,精确的语言引导动作理解能力不足,需要更有效的跨模态表示学习机制。

Method: 提出了RefAtomNet++框架,采用多层级语义对齐交叉注意力机制,结合部分关键词、场景属性和整体句子三个层次的多轨迹Mamba建模。通过动态选择最近视觉空间token构建扫描轨迹,实现跨不同语义层级的时空token有效聚合。

Result: 实验结果表明RefAtomNet++在RefAVA++数据集上建立了新的最先进性能。该数据集包含超过290万帧视频和7.51万个标注人物,为细粒度动作识别提供了大规模基准。

Conclusion: 该研究证明了多层级语义对齐和动态轨迹建模在跨模态动作识别中的有效性,为复杂场景下的交互式人类动作分析提供了新的技术路径。多模态表示学习中的细粒度语义对齐是提升性能的关键因素。


📄 Abstract

Referring Atomic Video Action Recognition (RAVAR) aims to recognize fine-grained, atomic-level actions of a specific person of interest conditioned on natural language descriptions. Distinct from conventional action recognition and detection tasks, RAVAR emphasizes precise language-guided action understanding, which is particularly critical for interactive human action analysis in complex multi-person scenarios. In this work, we extend our previously introduced RefAVA dataset to RefAVA++, which comprises >2.9 million frames and >75.1k annotated persons in total. We benchmark this dataset using baselines from multiple related domains, including atomic action localization, video question answering, and text-video retrieval, as well as our earlier model, RefAtomNet. Although RefAtomNet surpasses other baselines by incorporating agent attention to highlight salient features, its ability to align and retrieve cross-modal information remains limited, leading to suboptimal performance in localizing the target person and predicting fine-grained actions. To overcome the aforementioned limitations, we introduce RefAtomNet++, a novel framework that advances cross-modal token aggregation through a multi-hierarchical semantic-aligned cross-attention mechanism combined with multi-trajectory Mamba modeling at the partial-keyword, scene-attribute, and holistic-sentence levels. In particular, scanning trajectories are constructed by dynamically selecting the nearest visual spatial tokens at each timestep for both partial-keyword and scene-attribute levels. Moreover, we design a multi-hierarchical semantic-aligned cross-attention strategy, enabling more effective aggregation of spatial and temporal tokens across different semantic hierarchies. Experiments show that RefAtomNet++ establishes new state-of-the-art results. The dataset and code are released at https://github.com/KPeng9510/refAVA2.

[18] Instance-Aware Pseudo-Labeling and Class-Focused Contrastive Learning for Weakly Supervised Domain Adaptive Segmentation of Electron Microscopy

Shan Xiong, Jiabao Chen, Ye Wang, Jialin Peng

🧩 TL;DR

本文提出了一种用于电子显微镜图像中线粒体分割的弱监督域自适应方法,通过结合稀疏点标注和多任务学习框架,显著提升了分割性能并减少了标注成本。


📘 Detailed Summary

Motivation: 电子显微镜图像中线粒体实例分割的标注成本高昂,现有的无监督域自适应方法在实际应用中性能有限,因此研究利用目标域上稀疏点标注的弱监督域自适应方法,以最小化标注工作量并降低专家知识需求。

Method: 提出了多任务学习框架,联合进行分割和中心点检测,采用交叉教学机制和类聚焦跨域对比学习,并引入具有实例感知伪标签选择策略的分割自训练方法,语义性地选择可靠且多样化的伪标签。

Result: 在多个挑战性数据集上的综合验证表明,该方法优于现有的无监督域自适应和弱监督域自适应方法,显著缩小了与监督上界的性能差距,在无监督域自适应设置下也相比其他技术实现了显著改进。

Conclusion: 该方法通过有效利用稀疏点标注和多任务学习策略,为生物医学图像分割提供了一种高效且实用的解决方案,在保持性能的同时大幅降低了标注成本,具有重要的实际应用价值。


📄 Abstract

Annotation-efficient segmentation of the numerous mitochondria instances from various electron microscopy (EM) images is highly valuable for biological and neuroscience research. Although unsupervised domain adaptation (UDA) methods can help mitigate domain shifts and reduce the high costs of annotating each domain, they typically have relatively low performance in practical applications. Thus, we investigate weakly supervised domain adaptation (WDA) that utilizes additional sparse point labels on the target domain, which require minimal annotation effort and minimal expert knowledge. To take full use of the incomplete and imprecise point annotations, we introduce a multitask learning framework that jointly conducts segmentation and center detection with a novel cross-teaching mechanism and class-focused cross-domain contrastive learning. While leveraging unlabeled image regions is essential, we introduce segmentation self-training with a novel instance-aware pseudo-label (IPL) selection strategy. Unlike existing methods that typically rely on pixel-wise pseudo-label filtering, the IPL semantically selects reliable and diverse pseudo-labels with the help of the detection task. Comprehensive validations and comparisons on challenging datasets demonstrate that our method outperforms existing UDA and WDA methods, significantly narrowing the performance gap with the supervised upper bound. Furthermore, under the UDA setting, our method also achieves substantial improvements over other UDA techniques.

[19] NavQ: Learning a Q-Model for Foresighted Vision-and-Language Navigation

Peiran Xu, Xicheng Gong, Yadong MU

🧩 TL;DR

本文提出了一种面向目标视觉语言导航的前瞻性智能体,通过Q学习从大规模无标签轨迹数据中学习场景布局和物体关系知识,结合A*搜索策略有效探索可能通往目的地的区域。


📘 Detailed Summary

Motivation: 现有视觉语言导航方法通常基于历史信息进行决策,忽略了动作的未来影响和长期结果,导致导航效率受限。本文旨在解决这一局限性,开发能够预见未来状态的前瞻性导航智能体。

Method: 采用Q学习框架在大规模无标签轨迹数据上训练Q模型,学习室内场景的布局和物体关系知识,生成描述候选动作潜在未来信息的Q特征。通过跨模态未来编码器将任务无关的Q特征与导航指令结合,产生反映未来前景的动作评分,并与基于历史的原始评分结合,实现A*风格的搜索策略。

Result: 在广泛使用的目标导向视觉语言导航数据集上进行的广泛实验验证了所提方法的有效性,表明该方法能够显著提升导航性能,特别是在探索可能通往目的地的区域方面表现出色。

Conclusion: 该研究证明了将未来信息纳入视觉语言导航决策过程的重要性,通过结合任务无关的场景知识和任务特定的导航指令,实现了更有效的路径规划和目标达成。这一方法为开发更智能的导航系统提供了新的思路,强调了长期规划在复杂环境导航中的关键作用。


📄 Abstract

In this work we concentrate on the task of goal-oriented Vision-and-Language Navigation (VLN). Existing methods often make decisions based on historical information, overlooking the future implications and long-term outcomes of the actions. In contrast, we aim to develop a foresighted agent. Specifically, we draw upon Q-learning to train a Q-model using large-scale unlabeled trajectory data, in order to learn the general knowledge regarding the layout and object relations within indoor scenes. This model can generate a Q-feature, analogous to the Q-value in traditional Q-network, for each candidate action, which describes the potential future information that may be observed after taking the specific action. Subsequently, a cross-modal future encoder integrates the task-agnostic Q-feature with navigation instructions to produce a set of action scores reflecting future prospects. These scores, when combined with the original scores based on history, facilitate an A*-style searching strategy to effectively explore the regions that are more likely to lead to the destination. Extensive experiments conducted on widely used goal-oriented VLN datasets validate the effectiveness of the proposed method.

[20] PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

Lukas Selch, Yufang Hou, M. Jehanzeb Mirza, Sivan Doveh, James Glass, Rogerio Feris, Wei Lin

🧩 TL;DR

本文提出了PRISMM-Bench,首个基于真实科学论文中审稿人标记不一致性的多模态基准,评估大型多模态模型在科学推理中的可靠性。研究发现现有模型在检测和解决跨模态不一致性方面表现较差,揭示了科学多模态推理的挑战性。


📘 Detailed Summary

Motivation: 大型多模态模型在科学研究中的应用日益增多,但其能否可靠理解和推理论文中的多模态复杂性仍不明确。核心挑战在于检测和解决文本、图表、表格和方程之间的不一致性,这些问题通常很微妙且领域特定,会削弱清晰度、可重复性和可信度。现有基准要么孤立单一模态,要么依赖无法捕捉真实世界复杂性的合成错误。

Method: 通过多阶段流程构建PRISMM-Bench基准,包括审稿意见挖掘、LLM辅助过滤和人工验证,从242篇论文中收集了262个不一致性。设计了三个任务:不一致性识别、修正和配对匹配,评估模型跨模态检测、纠正和推理能力。引入结构化JSON答案表示以最小化语言偏见,减少对表面风格线索的依赖。

Result: 对21个领先LMM模型进行基准测试,包括大型开放权重模型和专有模型。结果显示性能极低(26.1-54.2%),突显了多模态科学推理的挑战性。即使是最先进的模型在检测和解决真实科学论文中的不一致性方面也表现不佳。

Conclusion: 研究揭示了当前多模态模型在科学推理任务中的显著局限性,强调了开发可信科学助手的必要性。PRISMM-Bench为评估和改进模型的多模态科学理解能力提供了重要基准,推动了更可靠科学AI系统的发展方向。


📄 Abstract

Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 262 inconsistencies from 242 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of choice-only shortcuts in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (26.1-54.2%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants.

[21] Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions

Jihoon Kwon, Kyle Min, Jy-yong Sohn

🧩 TL;DR

本文提出了READ方法,通过添加重构和对齐两个辅助目标来增强视觉语言模型的组合推理能力。READ-CLIP在五个主要组合推理基准测试中实现了最先进的性能,比传统微调基线提升高达4.1%。


📘 Detailed Summary

Motivation: 尽管近期取得进展,但使用标准对比目标训练的视觉语言模型在组合推理方面仍然存在困难,即理解视觉和语言元素之间的结构化关系。这一不足主要源于文本编码器倾向于关注单个单词而非它们之间的关系,这种局限性被主要将单词与视觉对象对齐的对比训练所强化。

Method: 我们引入了READ方法,这是一种通过向对比学习添加两个辅助目标来增强组合推理的微调方法:(1) 令牌级重构目标,其中冻结的预训练解码器基于原始标题的嵌入重构替代标题;(2) 句子级对齐目标,显式地在嵌入空间中对齐释义句子。

Result: READ-CLIP在五个主要组合推理基准测试中实现了最先进的性能,比最强的传统微调基线提升高达4.1%。将READ应用于现有CLIP变体(包括NegCLIP和FSC-CLIP)也能提高这些基准测试的性能。定量和定性分析表明,重构和对齐目标提供了互补的益处。

Conclusion: 本研究提出的重构和对齐目标提供了互补的益处:前者鼓励编码器捕获标题内单词之间的关系,而后者确保使用不同措辞表达的释义具有一致的表示。该方法为增强视觉语言模型的组合推理能力提供了一种有效的微调策略。


📄 Abstract

Despite recent advances, vision-language models trained with standard contrastive objectives still struggle with compositional reasoning -- the ability to understand structured relationships between visual and linguistic elements. This shortcoming is largely due to the tendency of the text encoder to focus on individual words rather than their relations, a limitation reinforced by contrastive training that primarily aligns words with visual objects. In this paper, we introduce REconstruction and Alignment of text Descriptions (READ), a fine-tuning method designed to enhance compositional reasoning by adding two auxiliary objectives to the contrastive learning: (1) a token-level reconstruction objective, where a frozen pre-trained decoder reconstructs alternative captions based on the embedding of the original caption; and (2) a sentence-level alignment objective, which explicitly aligns paraphrased sentences in the embedding space. We show that READ-CLIP, a model derived by applying the READ method to the pre-trained CLIP model, achieves the state-of-the-art performance across five major compositional reasoning benchmarks, outperforming the strongest conventional fine-tuning baseline by up to 4.1%. Furthermore, applying the READ to existing CLIP variants (including NegCLIP and FSC-CLIP) also improves performance on these benchmarks. Quantitative and qualitative analyses reveal that our proposed objectives -- reconstruction and alignment -- offer complementary benefits: the former encourages the encoder to capture relationships between words within a caption, while the latter ensures consistent representations for paraphrases expressed with different wording.

[22] Fit for Purpose? Deepfake Detection in the Real World

Guangyu Lin, Li Lin, Christina P. Walker, Daniel S. Schiff, Shu Hu

🧩 TL;DR

本文提出了首个基于真实世界政治深度伪造事件的系统性基准测试,评估了现有检测器在真实政治深度伪造内容上的泛化能力,发现现有检测器在真实政治场景中表现不佳且易受简单攻击。


📘 Detailed Summary

Motivation: 当前大多数深度伪造检测模型在实验室控制的合成数据集上训练和验证,限制了它们对社交媒体上传播的真实世界政治深度伪造的泛化能力,而这类内容对公众信任和民主制度构成严重威胁。

Method: 研究基于政治深度伪造事件数据库构建了首个系统性基准,该数据库收集了2018年以来社交媒体上传播的真实政治深度伪造内容,并对学术界、政府和工业界的最先进深度伪造检测器进行了系统性评估。

Result: 评估发现学术界和政府开发的检测器表现相对较差,付费检测工具虽然性能优于免费模型,但所有检测器都难以有效泛化到真实政治深度伪造内容,且在视频领域特别容易受到简单操作的攻击。

Conclusion: 研究结果强调了开发政治情境化的深度伪造检测框架的迫切需求,以在真实世界环境中更好地保护公众免受政治深度伪造的威胁,需要针对政治内容的特殊性设计更具鲁棒性的检测方案。


📄 Abstract

The rapid proliferation of AI-generated content, driven by advances in generative adversarial networks, diffusion models, and multimodal large language models, has made the creation and dissemination of synthetic media effortless, heightening the risks of misinformation, particularly political deepfakes that distort truth and undermine trust in political institutions. In turn, governments, research institutions, and industry have strongly promoted deepfake detection initiatives as solutions. Yet, most existing models are trained and validated on synthetic, laboratory-controlled datasets, limiting their generalizability to the kinds of real-world political deepfakes circulating on social platforms that affect the public. In this work, we introduce the first systematic benchmark based on the Political Deepfakes Incident Database, a curated collection of real-world political deepfakes shared on social media since 2018. Our study includes a systematic evaluation of state-of-the-art deepfake detectors across academia, government, and industry. We find that the detectors from academia and government perform relatively poorly. While paid detection tools achieve relatively higher performance than free-access models, all evaluated detectors struggle to generalize effectively to authentic political deepfakes, and are vulnerable to simple manipulations, especially in the video domain. Results urge the need for politically contextualized deepfake detection frameworks to better safeguard the public in real-world settings.

[23] SHIELD: Suppressing Hallucinations In LVLM Encoders via Bias and Vulnerability Defense

Yiyang Huang, Liang Shi, Yitian Zhang, Yi Xu, Yun Fu

🧩 TL;DR

本文提出SHIELD框架,首次将LVLM中的物体幻觉问题溯源至视觉编码器,并通过三种训练无关策略有效缓解了统计偏差、固有偏差和脆弱性导致的幻觉现象。


📘 Detailed Summary

Motivation: 大型视觉语言模型在跨模态任务中表现出色,但物体幻觉问题(模型生成看似合理但不准确的物体描述)仍然是一个重大挑战,与以往关注LLM组件的研究不同,本文首次将LVLM幻觉问题溯源至视觉编码器。

Method: 提出SHIELD训练无关框架,采用三种策略:重加权视觉token以减少统计偏差,引入噪声衍生token以对抗固有偏差,应用对抗攻击与对比解码来解决脆弱性问题。

Result: 实验表明SHIELD在多种基准测试和LVLM家族中有效缓解了物体幻觉,同时在通用LVLM基准上表现出色,证明了其广泛适用性。

Conclusion: 该研究揭示了视觉编码器在LVLM幻觉问题中的关键作用,提出的SHIELD框架不仅解决了特定幻觉问题,还展示了在保持通用性能的同时提升模型鲁棒性的潜力。


📄 Abstract

Large Vision-Language Models (LVLMs) excel in diverse cross-modal tasks. However, object hallucination, where models produce plausible but inaccurate object descriptions, remains a significant challenge. In contrast to previous work focusing on LLM components, this paper is the first to trace LVLM hallucinations to visual encoders and identifies three key issues: statistical bias, inherent bias, and vulnerability. To address these challenges, we propose SHIELD, a training-free framework that mitigates hallucinations through three strategies: re-weighting visual tokens to reduce statistical bias, introducing noise-derived tokens to counter inherent bias, and applying adversarial attacks with contrastive decoding to address vulnerability. Experiments demonstrate that SHIELD effectively mitigates object hallucinations across diverse benchmarks and LVLM families. Moreover, SHIELD achieves strong performance on the general LVLM benchmark, highlighting its broad applicability. Code will be released.

[24] VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs

Jiaying Zhu, Yurui Zhu, Xin Lu, Wenrui Yan, Dong Li, Kunlin Liu, Xueyang Fu, Zheng-Jun Zha

🧩 TL;DR

本文提出VisionSelector,一种轻量级即插即用的多模态大语言模型视觉令牌压缩框架,通过可学习的决策过程实现高效且自适应的令牌选择,显著提升计算效率同时保持模型性能。


📘 Detailed Summary

Motivation: 多模态大语言模型在处理高分辨率图像或多图像输入时面临大量视觉令牌带来的计算和内存瓶颈,现有令牌压缩技术受限于启发式规则,存在丢弃关键信息和注意力偏差问题,导致在激进压缩率下性能急剧下降。

Method: 提出VisionSelector框架,将令牌压缩重新表述为端到端可学习的决策过程,包含解耦于MLLM骨干网络的评分器模块,采用可微分Top-K机制和课程退火策略来弥合训练-推理差距,支持任意压缩率下的高效自适应令牌选择。

Result: VisionSelector仅需12.85M可训练参数,在MME基准上以30%保留预算保持100%准确率,在10%保留预算下比先前方法提升12.14%,预填充速度提升两倍,展现出跨不同压缩预算的优越性能和自适应识别关键令牌的能力。

Conclusion: 该研究证明了通过可学习决策过程实现令牌压缩的有效性,VisionSelector的轻量级设计和泛化能力为多模态大语言模型的高效部署提供了实用解决方案,同时为自适应令牌选择机制的未来发展指明了方向。


📄 Abstract

Multimodal Large Language Models (MLLMs) encounter significant computational and memory bottlenecks from the massive number of visual tokens generated by high-resolution images or multi-image inputs. Previous token compression techniques are often constrained by heuristic rules that risk discarding critical information. They may suffer from biases, such as attention sinks, that lead to sharp performance drops under aggressive compression ratios. To address these limitations, we reformulate token compression as a lightweight plug-and-play framework that reformulates token compression into an end-to-end learnable decision process. To be specific, we propose VisionSelector, a scorer module decoupled from the MLLM backbone that incorporates a differentiable Top-K mechanism and a curriculum annealing strategy to bridge the training-inference gap, enabling efficient and adaptive token selection various arbitrary compression rates. Remarkably lightweight with only 12.85M trainable parameters, VisionSelector demonstrates generalization across various compression rates and adaptively identifying critical tokens. This leads to superior performance across all compression budgets, evidenced by preserving 100% accuracy on MME with 30% retention budget, outperforming prior methods by 12.14% at 10% retention budget, and doubling prefill speed. Our code is available at https://github.com/JulietChoo/VisionSelector .

[25] Xiaoice: Training-Free Video Understanding via Self-Supervised Spatio-Temporal Clustering of Semantic Features

Shihao Ji, Zihui Song

🧩 TL;DR

本文提出了一种无需训练的视频理解框架,通过将预训练视觉语言模型的语义先验与经典机器学习算法相结合,将视频理解重新定义为高维语义特征空间中的自监督时空聚类问题。


📘 Detailed Summary

Motivation: 当前大规模视觉语言模型在静态图像上表现出卓越的零样本推理能力,但这种能力尚未充分迁移到视频领域,传统视频理解模型依赖大量标注数据和特定任务训练,成本高昂且可扩展性有限。

Method: 该框架首先使用预训练VLM的冻结视觉编码器将视频流转换为语义特征轨迹,然后采用核时间分割算法将连续特征流划分为离散的语义连贯事件片段,最后通过无监督密度聚类识别视频中重复出现的宏观场景和主题。

Result: 通过从每个发现的聚类中选择代表性关键帧并利用VLM的生成能力进行文本描述,该框架能够自动生成视频内容的结构化多模态摘要,实现了零样本的自动化视频结构分析。

Conclusion: 该方法提供了一种有效、可解释且模型无关的路径,用于视频内容的零样本自动结构分析,为视频理解开辟了无需端到端训练的新范式,具有重要的实际应用价值。


📄 Abstract

The remarkable zero-shot reasoning capabilities of large-scale Visual Language Models (VLMs) on static images have yet to be fully translated to the video domain. Conventional video understanding models often rely on extensive, task-specific training on annotated datasets, a process that is both costly and limited in scalability. This paper introduces a novel, training-free framework for video understanding that circumvents end-to-end training by synergistically combining the rich semantic priors of pre-trained VLMs with classic machine learning algorithms for pattern discovery. Our core idea is to reframe video understanding as a self-supervised spatio-temporal clustering problem within a high-dimensional semantic feature space. The proposed pipeline first transforms a video stream into a semantic feature trajectory using the frozen visual encoder of a pre-trained VLM. Subsequently, we employ Kernel Temporal Segmentation (KTS), a robust machine learning technique, to partition the continuous feature stream into discrete, semantically coherent event segments. These segments are then subjected to unsupervised density-based clustering to identify recurring macroscopic scenes and themes throughout the video. By selecting representative keyframes from each discovered cluster and leveraging the VLM's generative capabilities for textual description, our framework automatically produces a structured, multi-modal summary of the video content. This approach provides an effective, interpretable, and model-agnostic pathway for zero-shot, automated structural analysis of video content.

[26] Pursuing Minimal Sufficiency in Spatial Reasoning

Yejie Guo, Yunzhong Hou, Wufei Ma, Meng Tang, Ming-Hsuan Yang

🧩 TL;DR

本文提出了MSSR(最小充分空间推理器),这是一个双智能体框架,通过构建最小充分信息集来解决视觉语言模型在3D空间推理中的挑战,显著提升了空间推理的准确性和可解释性。


📘 Detailed Summary

Motivation: 当前视觉语言模型在空间推理方面存在两个基本瓶颈:由于2D中心预训练导致的3D理解能力不足,以及冗余3D信息引发的推理失败问题,这限制了模型在3D场景中的语言理解能力。

Method: MSSR采用双智能体框架,其中感知智能体通过程序化查询3D场景并使用多功能感知工具箱提取充分信息,包括新颖的SOG(情境化方向定位)模块;推理智能体则迭代优化这些信息以追求最小化,在闭环中修剪冗余细节并请求缺失信息,直到构建出最小充分信息集。

Result: 大量实验表明,该方法通过显式追求充分性和最小性,显著提高了准确性,在两个具有挑战性的基准测试中实现了最先进的性能,同时框架产生了可解释的推理路径。

Conclusion: 该研究为3D空间推理提供了新的解决方案,通过最小充分信息集原则有效解决了信息冗余和不足的问题,同时产生的可解释推理路径为未来模型训练提供了高质量数据源,推动了视觉语言模型在3D理解方面的发展。


📄 Abstract

Spatial reasoning, the ability to ground language in 3D understanding, remains a persistent challenge for Vision-Language Models (VLMs). We identify two fundamental bottlenecks: inadequate 3D understanding capabilities stemming from 2D-centric pre-training, and reasoning failures induced by redundant 3D information. To address these, we first construct a Minimal Sufficient Set (MSS) of information before answering a given question: a compact selection of 3D perception results from \textit{expert models}. We introduce MSSR (Minimal Sufficient Spatial Reasoner), a dual-agent framework that implements this principle. A Perception Agent programmatically queries 3D scenes using a versatile perception toolbox to extract sufficient information, including a novel SOG (Situated Orientation Grounding) module that robustly extracts language-grounded directions. A Reasoning Agent then iteratively refines this information to pursue minimality, pruning redundant details and requesting missing ones in a closed loop until the MSS is curated. Extensive experiments demonstrate that our method, by explicitly pursuing both sufficiency and minimality, significantly improves accuracy and achieves state-of-the-art performance across two challenging benchmarks. Furthermore, our framework produces interpretable reasoning paths, offering a promising source of high-quality training data for future models. Source code is available at https://github.com/gyj155/mssr.

[27] Connecting Domains and Contrasting Samples: A Ladder for Domain Generalization

Tianxin Wei, Yifan Chen, Xinrui He, Wenxuan Bao, Jingrui He

🧩 TL;DR

本文提出了一种新的领域连接对比学习范式DCCL,通过增强跨领域的概念连通性来解决领域泛化中直接应用对比学习导致性能下降的问题,在五个标准基准测试中超越了最先进方法。


📘 Detailed Summary

Motivation: 领域泛化中训练和测试样本之间的分布偏移严重阻碍模型泛化性能,虽然对比学习理论上应能通过类别分离表示改善领域泛化,但实际应用中直接使用对比学习反而会降低性能,研究发现这是由于领域泛化设置中缺乏类内连通性导致的。

Method: 提出了领域连接对比学习DCCL范式,在数据层面采用更激进的数据增强和跨领域正样本来增强类内连通性,在模型层面提出模型锚定技术利用预训练表示中的类内连通性,并辅以生成变换损失来更好地嵌入未见测试领域。

Result: 在五个标准领域泛化基准测试上的广泛实验验证了DCCL优于最先进的基线方法,即使在没有领域监督的情况下也能取得优越性能,代码实现已在GitHub上开源提供。

Conclusion: 该研究表明增强跨领域的概念连通性对于领域泛化至关重要,DCCL通过数据增强和模型锚定的双重策略有效解决了对比学习在领域泛化中的性能下降问题,为无监督领域泛化提供了新的解决方案。


📄 Abstract

Distribution shifts between training and testing samples frequently occur in practice and impede model generalization performance. This crucial challenge thereby motivates studies on domain generalization (DG), which aim to predict the label on unseen target domain data by solely using data from source domains. It is intuitive to conceive the class-separated representations learned in contrastive learning (CL) are able to improve DG, while the reality is quite the opposite: users observe directly applying CL deteriorates the performance. We analyze the phenomenon with the insights from CL theory and discover lack of intra-class connectivity in the DG setting causes the deficiency. We thus propose a new paradigm, domain-connecting contrastive learning (DCCL), to enhance the conceptual connectivity across domains and obtain generalizable representations for DG. On the data side, more aggressive data augmentation and cross-domain positive samples are introduced to improve intra-class connectivity. On the model side, to better embed the unseen test domains, we propose model anchoring to exploit the intra-class connectivity in pre-trained representations and complement the anchoring with generative transformation loss. Extensive experiments on five standard DG benchmarks are performed. The results verify that DCCL outperforms state-of-the-art baselines even without domain supervision. The detailed model implementation and the code are provided through https://github.com/weitianxin/DCCL

[28] Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input

Chenxu Li, Zhicai Wang, Yuan Sheng, Xingyu Zhu, Yanbin Hao, Xiang Wang

🧩 TL;DR

本文提出了Res-Bench基准测试,用于评估多模态大语言模型在不同输入分辨率下的鲁棒性,并引入新的评估框架来量化模型性能随分辨率变化的稳定性。


📘 Detailed Summary

Motivation: 当前多模态大语言模型的评估主要关注语义性能,而忽视了分辨率鲁棒性这一关键问题,即模型性能在不同输入分辨率下是否保持稳定,这一研究空白需要系统性的评估框架来解决。

Method: 研究设计了Res-Bench基准测试,包含14,400个样本覆盖12个分辨率级别和6个核心能力维度,并提出了包含Spearman相关性分析和绝对/相对连续误差在内的多种鲁棒性评估指标来衡量性能稳定性。

Result: 通过大规模评估主流MLLMs,研究从模型中心化、任务中心化角度分析了鲁棒性,考察了填充和超分辨率等预处理策略,并探索了微调对稳定性提升的效果。

Conclusion: 该研究强调了分辨率鲁棒性评估的重要性,为MLLMs的稳健性发展提供了系统性评估框架,并揭示了当前模型在不同分辨率下的性能波动问题,为未来模型优化指明了方向。


📄 Abstract

Multimodal Large Language Models (MLLMs) increasingly support dynamic image resolutions. However, current evaluation paradigms primarily assess semantic performance, overlooking the critical question of resolution robustness - whether performance remains stable across varying input resolutions. To address this gap, we introduce \textbf{Res-Bench}, a comprehensive benchmark comprising 14,400 samples across 12 resolution levels and six core capability dimensions. We designed a novel evaluation framework that goes beyond traditional accuracy metrics to capture performance stability. This framework introduces multiple robustness metrics: Spearman's correlation for assessing resolution-performance trends, and Absolute/Relative Continuous Error (ACE/RCE) for measuring performance volatility. Using these metrics, we conducted a large-scale evaluation of leading MLLMs. Our analysis encompasses: (1) model-centric and task-centric robustness examination, (2) investigation of preprocessing strategies including padding and super-resolution, and (3) exploration of fine-tuning for stability enhancement.

[29] Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

Xiongkun Linghu, Jiangyong Huang, Ziyu Zhu, Baoxiong Jia, Siyuan Huang

🧩 TL;DR

本文提出了首个应用于3D场景理解的思维链推理框架SCENECOT,通过构建大规模标注数据集SCENECOT-185K,实现了基于多模态专家模块的渐进式场景-对象接地推理,在多个复杂3D场景推理基准上取得了优异性能。


📘 Detailed Summary

Motivation: 现有3D大语言模型在接地问答方面仍存在困难,主要原因是缺乏对人类场景-对象接地推理机制的深入探索,本研究旨在填补这一空白,解决复杂3D场景推理中的接地问题。

Method: 提出了基于3D场景的接地思维链推理方法SCENECOT,将复杂推理任务分解为更简单的子问题,并利用多模态专家模块构建相应的视觉线索,同时开发了首个大规模接地CoT推理数据集SCENECOT-185K,包含18.5万个高质量实例。

Result: 在多个复杂3D场景推理基准上的广泛实验表明,新框架实现了强大的性能表现,并具有高度的接地问答一致性,验证了CoT推理在3D场景理解中的有效性。

Conclusion: 这是思维链推理在3D场景理解中的首次成功应用,实现了类似人类的逐步推理过程,显示出扩展到更广泛3D场景理解场景的潜力,为3D场景推理研究开辟了新方向。


📄 Abstract

Existing research on 3D Large Language Models (LLMs) still struggles to achieve grounded question-answering, primarily due to the under-exploration of the mech- anism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework. We first introduce a grounded Chain-of- Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a method, we develop SCENECOT-185K, the first large-scale grounded CoT reasoning dataset, consisting of 185K high-quality instances. Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves strong performance with high grounding-QA coherence. To the best of our knowledge, this is the first successful application of CoT reasoning to 3D scene understanding, enabling step-by-step human-like reasoning and showing potential for extension to broader 3D scene understanding scenarios.

[30] $\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs

Yingqi Fan, Anhao Zhao, Jinlan Fu, Junlong Tong, Hui Su, Yijie Pan, Wei Zhang, Xiaoyu Shen

🧩 TL;DR

本文通过系统分析揭示了多模态大语言模型的三阶段跨模态交互过程,并基于此提出了无需训练的剪枝框架VisiPruner,能够显著减少视觉相关注意力计算和FLOPs,同时超越现有方法并适用于多种MLLM架构。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在处理视觉语言任务时面临计算开销过大的问题,特别是注意力计算随多模态token数量呈二次增长,而现有的token剪枝方法缺乏对MLLM如何处理和融合多模态信息的基本理解。

Method: 通过系统分析发现MLLM存在三阶段跨模态交互过程:浅层识别任务意图、中层关键视觉token驱动跨模态融合、深层仅关注语言精炼,基于此提出了无需训练的剪枝框架VisiPruner,针对性地在不同阶段进行视觉token剪枝。

Result: VisiPruner在LLaVA-v1.5 7B模型上实现了高达99%的视觉相关注意力计算减少和53.9%的FLOPs降低,显著优于现有token剪枝方法,并在多种MLLM架构上展现出良好的泛化能力。

Conclusion: 该研究不仅提供了高效的剪枝解决方案,更重要的是揭示了MLLM内在的层次处理动态,为训练高效MLLM提供了可操作的指导原则,即通过使模型架构与其内在处理动态对齐来优化性能。


📄 Abstract

Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal tokens. Though efforts have been made to prune tokens in MLLMs, \textit{they lack a fundamental understanding of how MLLMs process and fuse multimodal information.} Through systematic analysis, we uncover a \textbf{three-stage} cross-modal interaction process: (1) Shallow layers recognize task intent, with visual tokens acting as passive attention sinks; (2) Cross-modal fusion occurs abruptly in middle layers, driven by a few critical visual tokens; (3) Deep layers discard vision tokens, focusing solely on linguistic refinement. Based on these findings, we propose \emph{VisiPruner}, a training-free pruning framework that reduces up to 99\% of vision-related attention computations and 53.9\% of FLOPs on LLaVA-v1.5 7B. It significantly outperforms existing token pruning methods and generalizes across diverse MLLMs. Beyond pruning, our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics. Our code is available at: https://github.com/EIT-NLP/VisiPruner.

[31] Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling

Erik Riise, Mehmet Onurcan Kaya, Dim P. Papadopoulos

🧩 TL;DR

本研究证明离散自回归视觉模型能够有效利用束搜索进行推理时优化,使2B参数的自回归模型在文本到图像生成任务中超越12B参数的扩散模型,揭示了模型架构对推理时优化的关键作用。


📘 Detailed Summary

Motivation: 尽管推理时搜索策略在大型语言模型中取得了革命性成功,但在图像生成领域的应用效果有限,连续扩散模型中的搜索策略收益甚微,简单随机采样往往表现最佳,这促使研究者探索不同模型架构对推理时优化的适应性。

Method: 研究采用离散自回归视觉模型架构,利用束搜索策略进行文本到图像生成,通过离散标记空间的特性实现早期剪枝和计算重用,并系统分析了验证器在速度与推理能力之间的权衡关系。

Result: 实验表明束搜索显著提升了文本到图像生成质量,2B参数的自回归模型在多个基准测试中超越了12B参数的扩散模型,系统消融研究证实离散标记空间是实现这一优势的关键因素。

Conclusion: 研究结果表明模型架构而不仅仅是规模对于视觉生成中的推理时优化至关重要,离散自回归模型因其支持有效搜索策略而展现出独特优势,这为未来视觉生成系统的设计提供了重要启示。


📄 Abstract

While inference-time scaling through search has revolutionized Large Language Models, translating these gains to image generation has proven difficult. Recent attempts to apply search strategies to continuous diffusion models show limited benefits, with simple random sampling often performing best. We demonstrate that the discrete, sequential nature of visual autoregressive models enables effective search for image generation. We show that beam search substantially improves text-to-image generation, enabling a 2B parameter autoregressive model to outperform a 12B parameter diffusion model across benchmarks. Systematic ablations show that this advantage comes from the discrete token space, which allows early pruning and computational reuse, and our verifier analysis highlights trade-offs between speed and reasoning capability. These findings suggest that model architecture, not just scale, is critical for inference-time optimization in visual generation.

[32] Region in Context: Text-condition Image editing with Human-like semantic reasoning

Thuy Phuong Vu, Dinh-Cuong Hoang, Minhhuy Le, Phan Xuan Tan

🧩 TL;DR

本文提出了Region in Context框架,通过多层级语义对齐实现文本条件图像编辑,使局部区域在全局图像上下文中进行协调一致的编辑,解决了现有方法在区域编辑中缺乏整体语义一致性的问题。


📘 Detailed Summary

Motivation: 现有文本条件图像编辑方法通常将图像区域视为孤立单元,仅依赖局部线索而忽略了各部分对整体视觉和语义构成的贡献,导致编辑结果出现不一致、不自然过渡或图像整体连贯性丧失的问题。

Method: 该框架引入了双层级引导机制:区域在完整图像上下文中进行表示并与详细区域级描述对齐,同时整个图像与大型视觉语言模型生成的全面场景级描述进行匹配,这些描述作为明确的语言参考指导局部修改和全局结构。

Result: 实验结果表明,该方法能够生成更加连贯且与指令对齐的编辑结果,在保持局部精确性的同时确保了整体图像的语义一致性。

Conclusion: 该研究强调了在图像编辑中考虑全局上下文的重要性,通过多层级语义对齐实现了局部编辑与整体场景的协调统一,为文本驱动的图像编辑提供了新的思路和解决方案。


📄 Abstract

Recent research has made significant progress in localizing and editing image regions based on text. However, most approaches treat these regions in isolation, relying solely on local cues without accounting for how each part contributes to the overall visual and semantic composition. This often results in inconsistent edits, unnatural transitions, or loss of coherence across the image. In this work, we propose Region in Context, a novel framework for text-conditioned image editing that performs multilevel semantic alignment between vision and language, inspired by the human ability to reason about edits in relation to the whole scene. Our method encourages each region to understand its role within the global image context, enabling precise and harmonized changes. At its core, the framework introduces a dual-level guidance mechanism: regions are represented with full-image context and aligned with detailed region-level descriptions, while the entire image is simultaneously matched to a comprehensive scene-level description generated by a large vision-language model. These descriptions serve as explicit verbal references of the intended content, guiding both local modifications and global structure. Experiments show that it produces more coherent and instruction-aligned results. Code is available at: https://github.com/thuyvuphuong/Region-in-Context.git

[33] UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

Yuhao Yang, Zhen Yang, Zi-Yi Dou, Anh Nguyen, Keen You, Omar Attia, Andrew Szot, Michael Feng, Ram Ramrakhya, Alexander Toshev, Chao Huang, Yinfei Yang, Zhe Gan

🧩 TL;DR

UltraCUA提出了一种混合行动的基础模型,将GUI原语操作与高级程序化工具调用无缝集成,解决了计算机使用代理仅依赖原始操作导致的级联失败和执行效率低下的问题。


📘 Detailed Summary

Motivation: 当前计算机使用代理仅依赖点击、输入、滚动等原始操作,需要精确的视觉定位和冗长的执行链,导致级联失败和性能瓶颈,同时这些代理无法利用丰富的程序化接口(如API、MCP服务器、工具)等能力。

Method: 该方法包含四个关键组件:从软件文档、开源仓库和代码生成中扩展程序化工具的自动化流水线;生成超过17,000个可验证任务的合成数据引擎;包含低级GUI操作和高级程序化工具调用的大规模高质量混合行动轨迹收集;结合监督微调和在线强化学习的两阶段训练流程,实现低级和高级行动之间的策略性交替。

Result: 在OSWorld基准测试中,UltraCUA模型相比基线模型实现了22%的相对改进,执行步骤减少11%;在WindowsAgentArena的域外评估中达到21.7%的成功率,优于在Windows数据上训练的基线模型。

Conclusion: 混合行动机制被证明至关重要,能够减少错误传播同时保持执行效率,为计算机使用代理提供了更强大和灵活的行动能力,突破了传统仅依赖原始操作的性能限制。


📄 Abstract

Multimodal agents for computer use rely exclusively on primitive actions (click, type, scroll) that require accurate visual grounding and lengthy execution chains, leading to cascading failures and performance bottlenecks. While other agents leverage rich programmatic interfaces (APIs, MCP servers, tools), computer-use agents (CUAs) remain isolated from these capabilities. We present UltraCUA, a foundation model that bridges this gap through hybrid action -- seamlessly integrating GUI primitives with high-level programmatic tool calls. To achieve this, our approach comprises four key components: (1) an automated pipeline that scales programmatic tools from software documentation, open-source repositories, and code generation; (2) a synthetic data engine producing over 17,000 verifiable tasks spanning real-world computer-use scenarios; (3) a large-scale high-quality hybrid action trajectory collection with both low-level GUI actions and high-level programmatic tool calls; and (4) a two-stage training pipeline combining supervised fine-tuning with online reinforcement learning, enabling strategic alternation between low-level and high-level actions. Experiments with our 7B and 32B models demonstrate substantial improvements over state-of-the-art agents. On OSWorld, UltraCUA models achieve an average 22% relative improvement over base models, while being 11% faster in terms of steps. Out-of-domain evaluation on WindowsAgentArena shows our model reaches 21.7% success rate, outperforming baselines trained on Windows data. The hybrid action mechanism proves critical, reducing error propagation while maintaining execution efficiency.

[34] EMRRG: Efficient Fine-Tuning Pre-trained X-ray Mamba Networks for Radiology Report Generation

Mingzheng Zhang, Jinfeng Gao, Dan Xu, Jiangrui Yu, Yuhan Qiao, Lan Chen, Jin Tang, Xiao Wang

🧩 TL;DR

本文提出EMRRG框架,通过参数高效方法微调预训练的Mamba网络,实现了X射线医学报告生成的端到端训练,在多个基准数据集上取得了优异性能。


📘 Detailed Summary

Motivation: 现有医学报告生成模型主要依赖大型语言模型,对预训练视觉基础模型和先进微调技术探索有限,主流框架要么避免微调要么使用简单方法如LoRA,同时忽略了非Transformer架构如Mamba网络在医学报告生成中的潜力。

Method: 提出的EMRRG框架将X射线图像分割为补丁并标记化,使用基于SSM的视觉骨干网络进行特征提取,其中Partial LoRA方法表现最佳,结合具有混合解码器的LLM生成医学报告,实现端到端训练。

Result: 在三个广泛使用的基准数据集上进行的广泛实验充分验证了所提出策略对X射线医学报告生成的有效性,取得了强劲的结果。

Conclusion: 该研究展示了Mamba网络在医学报告生成任务中的潜力,证明了参数高效微调方法的有效性,为未来医学视觉语言任务研究提供了新方向。


📄 Abstract

X-ray image-based medical report generation (MRG) is a pivotal area in artificial intelligence that can significantly reduce diagnostic burdens for clinicians and patient wait times. Existing MRG models predominantly rely on Large Language Models (LLMs) to improve report generation, with limited exploration of pre-trained vision foundation models or advanced fine-tuning techniques. Mainstream frameworks either avoid fine-tuning or utilize simplistic methods like LoRA, often neglecting the potential of enhancing cross-attention mechanisms. Additionally, while Transformer-based models dominate vision-language tasks, non-Transformer architectures, such as the Mamba network, remain underexplored for medical report generation, presenting a promising avenue for future research. In this paper, we propose EMRRG, a novel X-ray report generation framework that fine-tunes pre-trained Mamba networks using parameter-efficient methods. Specifically, X-ray images are divided into patches, tokenized, and processed by an SSM-based vision backbone for feature extraction, with Partial LoRA yielding optimal performance. An LLM with a hybrid decoder generates the medical report, enabling end-to-end training and achieving strong results on benchmark datasets. Extensive experiments on three widely used benchmark datasets fully validated the effectiveness of our proposed strategies for the X-ray MRG. The source code of this paper will be released on https://github.com/Event-AHU/Medical_Image_Analysis.

[35] Glyph: Scaling Context Windows via Visual-Text Compression

Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, Minlie Huang

🧩 TL;DR

本文提出Glyph框架,通过将长文本渲染为图像并使用视觉语言模型处理,实现3-4倍文本压缩,在保持精度的同时显著提升长上下文处理效率。该方法突破了传统LLM在百万级token上下文扩展中的计算瓶颈。


📘 Detailed Summary

Motivation: 随着大语言模型在文档理解、代码分析和多步推理等任务中对长上下文建模需求的增长,将上下文窗口扩展到百万token级别带来了高昂的计算和内存成本,限制了长上下文LLM的实际应用价值。

Method: 提出Glyph框架,采用视觉上下文扩展的替代方案,将长文本渲染为图像并通过视觉语言模型处理,同时设计基于LLM驱动的遗传搜索算法来优化视觉渲染配置,以平衡精度和压缩率。

Result: 实验表明该方法在多种长上下文基准测试中实现3-4倍token压缩,精度与Qwen3-8B等领先LLM相当,预填充和解码速度提升约4倍,SFT训练速度提升约2倍,128K上下文VLM可扩展处理百万token级文本任务。

Conclusion: 视觉上下文扩展为长文本处理提供了高效替代方案,突破了传统token扩展的计算瓶颈,渲染的文本数据还能受益于现实多模态任务如文档理解,为大规模长上下文应用开辟了新路径。


📄 Abstract

Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.

[36] Segmentation as A Plug-and-Play Capability for Frozen Multimodal LLMs

Jiazhen Liu, Long Chen

🧩 TL;DR

本文提出LENS方法,一种无需微调即可为多模态大语言模型添加像素级分割能力的即插即用解决方案。该方法通过提取注意力图中的关键点特征,在保持模型泛化能力的同时实现了竞争性的分割性能。


📘 Detailed Summary

Motivation: 当前为多模态大语言模型添加分割能力的方法通常需要微调模型以产生与掩码解码器兼容的输出,这会改变模型的输出空间并损害其内在的泛化能力,从而违背了构建统一模型的目标。

Method: LENS方法在完全冻结的多模态大语言模型上附加一个轻量级的可训练头部,通过精炼注意力图中嵌入的空间线索来提取关键点,并将其描述为与掩码解码器直接兼容的点级特征。

Result: 大量实验验证了LENS方法的有效性:在分割性能上达到或超过了基于重训练的方法,同时完全保留了多模态大语言模型的泛化能力,而微调方法会显著降低这种能力。

Conclusion: LENS的可附加设计为扩展多模态大语言模型建立了一个高效且强大的范式,为实现真正多才多艺的统一模型铺平了道路,同时解决了微调方法损害模型泛化能力的关键问题。


📄 Abstract

Integrating diverse visual capabilities into a unified model is a significant trend in Multimodal Large Language Models (MLLMs). Among these, the inclusion of segmentation poses a distinct set of challenges. To equip MLLMs with pixel-level segmentation abilities, prevailing methods require finetuning the model to produce specific outputs compatible with a mask decoder. This process typically alters the model's output space and compromises its intrinsic generalization, which undermines the goal of building a unified model. We introduce LENS (Leveraging kEypoiNts for MLLMs' Segmentation), a novel plug-and-play solution. LENS attaches a lightweight, trainable head to a completely frozen MLLM. By refining the spatial cues embedded in attention maps, LENS extracts keypoints and describes them into point-wise features directly compatible with the mask decoder. Extensive experiments validate our approach: LENS achieves segmentation performance competitive with or superior to that of retraining-based methods. Crucially, it does so while fully preserving the MLLM's generalization capabilities, which are significantly degraded by finetuning approaches. As such, the attachable design of LENS establishes an efficient and powerful paradigm for extending MLLMs, paving the way for truly multi-talented, unified models.

[37] Personalized Image Filter: Mastering Your Photographic Style

Chengxuan Zhu, Shuchen Weng, Jiacong Fang, Peixuan Zhang, Si Li, Chao Xu, Boxin Shi

🧩 TL;DR

本文提出了一种个性化图像滤镜(PIF)方法,基于预训练的文本到图像扩散模型,通过文本反演技术学习参考图像的摄影风格,能够有效提取和转移多种摄影风格。


📘 Detailed Summary

Motivation: 现有方法在从参考图像中学习有意义的摄影概念方面存在不足,要么无法学习到有意义的摄影概念,要么无法保持内容图像的内容完整性。摄影风格作为特定摄影概念的组合,是著名摄影师作品魅力的关键所在。

Method: 基于预训练的文本到图像扩散模型,利用生成先验学习摄影概念的平均外观以及如何根据文本提示调整这些概念。通过文本反演技术优化摄影概念的提示词来学习参考图像的摄影风格。

Result: PIF在提取和转移各种摄影风格方面表现出色,能够有效学习摄影概念的平均外观并根据文本提示进行相应调整。

Conclusion: 该方法通过结合预训练扩散模型的生成能力和文本反演技术,为摄影风格学习和转移提供了一种有效的解决方案,展示了在风格提取和转移任务上的卓越性能。


📄 Abstract

Photographic style, as a composition of certain photographic concepts, is the charm behind renowned photographers. But learning and transferring photographic style need a profound understanding of how the photo is edited from the unknown original appearance. Previous works either fail to learn meaningful photographic concepts from reference images, or cannot preserve the content of the content image. To tackle these issues, we proposed a Personalized Image Filter (PIF). Based on a pretrained text-to-image diffusion model, the generative prior enables PIF to learn the average appearance of photographic concepts, as well as how to adjust them according to text prompts. PIF then learns the photographic style of reference images with the textual inversion technique, by optimizing the prompts for the photographic concepts. PIF shows outstanding performance in extracting and transferring various kinds of photographic style. Project page: https://pif.pages.dev/

[38] ReefNet: A Large scale, Taxonomically Enriched Dataset and Benchmark for Hard Coral Classification

Yahia Battach, Abdulwahab Felemban, Faizan Farooq Khan, Yousef A. Radwan, Xiang Li, Fabio Marchese, Sara Beery, Burton H. Jones, Francesca Benzoni, Mohamed Elhoseiny

🧩 TL;DR

本研究提出了ReefNet,一个大规模公开珊瑚礁图像数据集,包含约925,000个经专家验证的属级硬珊瑚标注,旨在解决现有数据集在规模、地理覆盖和精细标注方面的局限性,为珊瑚礁自动监测提供具有挑战性的领域泛化基准。


📘 Detailed Summary

Motivation: 珊瑚礁因气候变化等人类压力而迅速衰退,迫切需要可扩展的自动监测方法,但现有数据集往往受限于规模、地理覆盖范围或粗粒度标注,且不符合机器学习就绪要求,无法支持全球尺度的精细珊瑚分类研究。

Method: ReefNet整合了来自76个CoralNet来源和红海Al Wajh站点的图像数据,提供映射至世界海洋物种名录的精细分类标注,并设计了两种评估设置:源内基准将各来源图像分区进行局部评估,跨源基准则完全保留某些来源以测试领域泛化能力。

Result: 实验分析显示,监督学习在源内基准上表现良好,但在跨域情况下性能急剧下降,零样本模型在所有情况下表现均较差,特别是对于稀有和视觉相似属种的分类效果不佳,这为领域泛化研究提供了具有挑战性的基准。

Conclusion: 该研究强调了珊瑚礁监测中领域泛化的重要性,揭示了当前模型在跨域识别和稀有物种分类方面的局限性,通过发布数据集、基准代码和预训练模型,旨在推动鲁棒、领域自适应的全球珊瑚礁监测与保护技术的发展。


📄 Abstract

Coral reefs are rapidly declining due to anthropogenic pressures such as climate change, underscoring the urgent need for scalable, automated monitoring. We introduce ReefNet, a large public coral reef image dataset with point-label annotations mapped to the World Register of Marine Species (WoRMS). ReefNet aggregates imagery from 76 curated CoralNet sources and an additional site from Al Wajh in the Red Sea, totaling approximately 925000 genus-level hard coral annotations with expert-verified labels. Unlike prior datasets, which are often limited by size, geography, or coarse labels and are not ML-ready, ReefNet offers fine-grained, taxonomically mapped labels at a global scale to WoRMS. We propose two evaluation settings: (i) a within-source benchmark that partitions each source's images for localized evaluation, and (ii) a cross-source benchmark that withholds entire sources to test domain generalization. We analyze both supervised and zero-shot classification performance on ReefNet and find that while supervised within-source performance is promising, supervised performance drops sharply across domains, and performance is low across the board for zero-shot models, especially for rare and visually similar genera. This provides a challenging benchmark intended to catalyze advances in domain generalization and fine-grained coral classification. We will release our dataset, benchmarking code, and pretrained models to advance robust, domain-adaptive, global coral reef monitoring and conservation.

[39] Uncovering Brain-Like Hierarchical Patterns in Vision-Language Models through fMRI-Based Neural Encoding

Yudan Ren, Xinlong Wang, Kexin Wang, Tian Xia, Zihan Ma, Zhaowei Li, Xiangrong Bi, Xiao Li, Xiaowei He

🧩 TL;DR

本研究提出了一种神经元级别的分析框架,通过结合人工神经元分析和基于fMRI的体素编码,揭示了视觉语言模型与人类大脑在多模态信息处理机制上的相似性,为理解人工神经网络与生物神经系统的对应关系提供了新的证据。


📘 Detailed Summary

Motivation: 当前对人工神经网络与人类大脑处理机制的理解存在两个主要局限:单模态ANN研究无法捕捉大脑固有的多模态处理能力,而多模态ANN研究主要关注高层模型输出,忽视了单个神经元的关键作用。

Method: 本研究提出了一个新颖的神经元级别分析框架,通过结合精细的人工神经元分析和基于fMRI的体素编码,对两种架构不同的视觉语言模型(CLIP和METER)进行了系统研究。

Result: 研究发现人工神经元能够成功预测多个功能网络中的生物神经元活动,两者都表现出功能冗余性,人工神经元展现出与生物神经元相似的极性模式,并且不同VLM架构驱动不同的生物神经元激活模式。

Conclusion: 这些结果为视觉语言模型中存在类脑层次处理提供了有力证据,揭示了人工神经网络与生物神经系统在神经元级别的共享表征机制,为理解AI系统的类脑特性及其与人类认知的对应关系开辟了新途径。


📄 Abstract

While brain-inspired artificial intelligence(AI) has demonstrated promising results, current understanding of the parallels between artificial neural networks (ANNs) and human brain processing remains limited: (1) unimodal ANN studies fail to capture the brain's inherent multimodal processing capabilities, and (2) multimodal ANN research primarily focuses on high-level model outputs, neglecting the crucial role of individual neurons. To address these limitations, we propose a novel neuron-level analysis framework that investigates the multimodal information processing mechanisms in vision-language models (VLMs) through the lens of human brain activity. Our approach uniquely combines fine-grained artificial neuron (AN) analysis with fMRI-based voxel encoding to examine two architecturally distinct VLMs: CLIP and METER. Our analysis reveals four key findings: (1) ANs successfully predict biological neurons (BNs) activities across multiple functional networks (including language, vision, attention, and default mode), demonstrating shared representational mechanisms; (2) Both ANs and BNs demonstrate functional redundancy through overlapping neural representations, mirroring the brain's fault-tolerant and collaborative information processing mechanisms; (3) ANs exhibit polarity patterns that parallel the BNs, with oppositely activated BNs showing mirrored activation trends across VLM layers, reflecting the complexity and bidirectional nature of neural information processing; (4) The architectures of CLIP and METER drive distinct BNs: CLIP's independent branches show modality-specific specialization, whereas METER's cross-modal design yields unified cross-modal activation, highlighting the architecture's influence on ANN brain-like properties. These results provide compelling evidence for brain-like hierarchical processing in VLMs at the neuronal level.

[40] Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, Li Yuan

🧩 TL;DR

本文提出了Edit-R1,一种基于策略优化的后训练框架,通过DiffusionNFT方法和多模态大语言模型奖励机制,显著提升了指令式图像编辑模型的泛化能力和性能,在多个基准测试中达到最先进水平。


📘 Detailed Summary

Motivation: 当前基于监督微调的指令式图像编辑模型容易过拟合到标注模式,限制了其在训练分布之外的探索和泛化能力,需要一种能够突破这一限制的后训练框架。

Method: 采用Diffusion Negative-aware Finetuning策略优化方法,该方法与流匹配前向过程一致,支持高阶采样器和高效训练;利用多模态大语言模型作为无需训练的统一奖励模型,通过输出logits提供细粒度反馈;设计了低方差组过滤机制来减少评分噪声并稳定优化。

Result: UniWorld-V2在ImgEdit和GEdit-Bench基准测试中分别获得4.49和7.83的分数,达到最先进水平;该框架具有模型无关性,在Qwen-Image-Edit和FLUX-Kontext等不同基础模型上均能带来显著的性能提升。

Conclusion: Edit-R1框架通过策略优化和MLLM奖励机制有效解决了指令式图像编辑中的泛化问题,其模型无关特性展示了广泛的适用性,为后训练优化提供了新的解决方案。


📄 Abstract

Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. UniWorld-V2, trained with this framework, achieves \textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available at https://github.com/PKU-YuanGroup/UniWorld-V2.

[41] Foundation Models in Medical Image Analysis: A Systematic Review and Meta-Analysis

Praveenbalaji Rajendran, Mojtaba Safari, Wenfeng He, Mingzhe Hu, Shansong Wang, Jun Zhou, Xiaofeng Yang

🧩 TL;DR

本文对医学图像分析中的基础模型进行了全面综述,系统性地分类了视觉专用和视觉语言基础模型,分析了其架构演进、训练策略和临床应用,并提出了未来研究方向以加速临床转化。


📘 Detailed Summary

Motivation: 尽管基础模型在医学图像分析领域快速发展,但该领域仍缺乏对跨模态架构演进、训练范式和临床应用的系统性综合,需要统一的框架来整合现有研究成果并指导未来发展。

Method: 采用系统性分类方法将研究分为视觉专用和视觉语言基础模型,分析其架构基础、训练策略和下游临床任务,并进行定量元分析以表征数据集利用和应用领域的时间趋势。

Result: 通过系统性综述和定量分析揭示了基础模型在医学图像分析中的架构演进模式、训练策略发展趋势以及临床应用分布特征,识别了领域适应、高效微调和计算约束等关键挑战。

Conclusion: 基础模型在医学图像分析中展现出巨大潜力,但需要解决领域适应、可解释性和临床集成等挑战,未来研究方向应聚焦于增强模型鲁棒性、可解释性和临床实用性以加速实际医疗应用。


📄 Abstract

Recent advancements in artificial intelligence (AI), particularly foundation models (FMs), have revolutionized medical image analysis, demonstrating strong zero- and few-shot performance across diverse medical imaging tasks, from segmentation to report generation. Unlike traditional task-specific AI models, FMs leverage large corpora of labeled and unlabeled multimodal datasets to learn generalized representations that can be adapted to various downstream clinical applications with minimal fine-tuning. However, despite the rapid proliferation of FM research in medical imaging, the field remains fragmented, lacking a unified synthesis that systematically maps the evolution of architectures, training paradigms, and clinical applications across modalities. To address this gap, this review article provides a comprehensive and structured analysis of FMs in medical image analysis. We systematically categorize studies into vision-only and vision-language FMs based on their architectural foundations, training strategies, and downstream clinical tasks. Additionally, a quantitative meta-analysis of the studies was conducted to characterize temporal trends in dataset utilization and application domains. We also critically discuss persistent challenges, including domain adaptation, efficient fine-tuning, computational constraints, and interpretability along with emerging solutions such as federated learning, knowledge distillation, and advanced prompting. Finally, we identify key future research directions aimed at enhancing the robustness, explainability, and clinical integration of FMs, thereby accelerating their translation into real-world medical practice.

[42] One-step Diffusion Models with Bregman Density Ratio Matching

Yuanzhi Zhu, Eleftherios Tsonis, Lucas Degeorge, Vicky Kalogeiton

🧩 TL;DR

本文提出了Di-Bregman框架,通过Bregman散度密度比匹配统一了扩散模型蒸馏方法,实现了高效的一步生成,在CIFAR-10和文本到图像生成任务上取得了优于反向KL蒸馏的FID指标。


📘 Detailed Summary

Motivation: 扩散和流模型虽然生成质量高,但由于多步采样过程导致计算成本昂贵,现有蒸馏方法缺乏统一的理论基础,需要建立更系统化的蒸馏框架来加速生成过程。

Method: 提出了Di-Bregman框架,将扩散蒸馏建模为基于Bregman散度的密度比匹配问题,从凸分析视角统一了多种现有目标函数,提供了理论上的共同基础。

Result: 在CIFAR-10和文本到图像生成实验中,Di-Bregman在一步生成FID指标上优于反向KL蒸馏方法,同时保持了与教师模型相当的高视觉保真度。

Conclusion: Bregman密度比匹配为高效一步扩散生成提供了实用且理论基础坚实的路径,统一了多种蒸馏目标,为未来扩散模型加速研究提供了新的理论视角。


📄 Abstract

Diffusion and flow models achieve high generative quality but remain computationally expensive due to slow multi-step sampling. Distillation methods accelerate them by training fast student generators, yet most existing objectives lack a unified theoretical foundation. In this work, we propose Di-Bregman, a compact framework that formulates diffusion distillation as Bregman divergence-based density-ratio matching. This convex-analytic view connects several existing objectives through a common lens. Experiments on CIFAR-10 and text-to-image generation demonstrate that Di-Bregman achieves improved one-step FID over reverse-KL distillation and maintains high visual fidelity compared to the teacher model. Our results highlight Bregman density-ratio matching as a practical and theoretically-grounded route toward efficient one-step diffusion generation.

[43] Video Reasoning without Training

Deepak Sridhar, Kartikeya Bhardwaj, Jeya Pradha Jeyaraj, Nuno Vasconcelos, Ankita Nayak, Harris Teague

🧩 TL;DR

本文提出V-Reason方法,通过基于熵的控制器优化大型多模态模型在视频推理中的微探索和微利用行为,无需强化学习或监督微调即可显著提升推理性能。该方法在多个视频推理数据集上接近RL训练模型的准确率,同时大幅减少计算开销。


📘 Detailed Summary

Motivation: 当前基于大型多模态模型的视频推理方法依赖昂贵的强化学习和冗长的思维链,导致训练和推理阶段的计算开销巨大,且推理过程的控制机制非常有限。

Method: 利用模型输出熵作为信号,发现高质量模型经历微探索和微利用序列以保持推理过程的稳定性,并基于此提出V-Reason方法:在推理时通过基于熵的目标函数对小型可训练控制器进行少量优化步骤,自适应调整LMM的值缓存。

Result: 在多个视频推理数据集上,该方法相比基础指令调优模型取得显著改进,与RL训练模型的准确率差距缩小至0.6%以内,同时输出token减少58.6%,提供巨大的效率优势。

Conclusion: 研究表明通过熵信号直接优化推理行为可有效提升模型性能,无需额外训练数据或强化学习,为高效视频推理提供了新的理论指导和实用方法。


📄 Abstract

Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting in substantial computational overhead during both training and inference. Moreover, the mechanisms that control the thinking process in these reasoning models are very limited. In this paper, using entropy of the model's output as a signal, we discover that the high-quality models go through a series of micro-explorations and micro-exploitations which keep the reasoning process grounded (i.e., avoid excessive randomness while the model is exploring or thinking through an answer). We further observe that once this "thinking" process is over, more accurate models demonstrate a better convergence by reducing the entropy significantly via a final exploitation phase (i.e., a more certain convergence towards a solution trajectory). We then use these novel, theoretically-grounded insights to tune the model's behavior directly at inference, without using any RL or supervised fine-tuning. Specifically, during inference, our proposed approach called V-Reason (Video-Reason) adapts the value cache of the LMM via a few optimization steps on a small, trainable controller using an entropy-based objective, i.e., no supervision from any dataset or RL is necessary. This tuning improves the model's micro-exploration and exploitation behavior during inference. Our experiments show that our proposed method achieves significant improvements over the base instruction-tuned models across several video reasoning datasets, narrowing the gap with RL-trained models to within 0.6% average accuracy without any training, while offering massive efficiency benefits: output tokens are reduced by 58.6% compared to the RL model.

[44] GOOD: Training-Free Guided Diffusion Sampling for Out-of-Distribution Detection

Xin Gao, Jiyao Liu, Guanghao Li, Yueming Lyu, Jianxiong Gao, Weichen Yu, Ningsheng Xu, Liang Wang, Caifeng Shan, Ziwei Liu, Chenyang Si

🧩 TL;DR

本文提出GOOD框架,通过利用现成的分布内分类器直接引导扩散采样轨迹生成分布外样本,解决了现有方法语义不稳定和多样性不足的问题,显著提升了分布外检测性能。


📘 Detailed Summary

Motivation: 现有基于文本到图像扩散模型的分布外样本生成方法通常依赖于扰动文本条件嵌入,导致语义不稳定和偏移多样性不足,限制了在真实分布外场景中的泛化能力。

Method: GOOD框架采用双层级引导机制:图像级引导基于对数分割梯度降低输入似然,驱动样本向像素空间低密度区域移动;特征级引导基于分类器潜在空间中k-NN距离,促进在特征稀疏区域采样,从而实现更可控和多样化的分布外样本生成。

Result: 全面的定量和定性分析表明,使用GOOD生成的样本进行训练可以显著提升分布外检测性能,同时引入了统一的自适应结合图像和特征差异的分布外评分机制,增强了检测鲁棒性。

Conclusion: GOOD框架通过直接引导扩散采样轨迹的创新方法,为分布外检测提供了更有效和可控的样本生成解决方案,其双层级引导设计确保了语义稳定性和多样性,为实际应用中的分布外检测性能提升提供了重要技术路径。


📄 Abstract

Recent advancements have explored text-to-image diffusion models for synthesizing out-of-distribution (OOD) samples, substantially enhancing the performance of OOD detection. However, existing approaches typically rely on perturbing text-conditioned embeddings, resulting in semantic instability and insufficient shift diversity, which limit generalization to realistic OOD. To address these challenges, we propose GOOD, a novel and flexible framework that directly guides diffusion sampling trajectories towards OOD regions using off-the-shelf in-distribution (ID) classifiers. GOOD incorporates dual-level guidance: (1) Image-level guidance based on the gradient of log partition to reduce input likelihood, drives samples toward low-density regions in pixel space. (2) Feature-level guidance, derived from k-NN distance in the classifier's latent space, promotes sampling in feature-sparse regions. Hence, this dual-guidance design enables more controllable and diverse OOD sample generation. Additionally, we introduce a unified OOD score that adaptively combines image and feature discrepancies, enhancing detection robustness. We perform thorough quantitative and qualitative analyses to evaluate the effectiveness of GOOD, demonstrating that training with samples generated by GOOD can notably enhance OOD detection performance.

[45] Training-free Online Video Step Grounding

Luca Zanella, Massimiliano Mancini, Yiming Wang, Alessio Tonioni, Elisa Ricci

🧩 TL;DR

本文提出了一种无需训练、在线执行的视频步骤定位方法BaGLM,利用大型多模态模型的零样本能力,通过贝叶斯滤波整合历史帧信息,在三个数据集上超越了需要训练的传统离线方法。


📘 Detailed Summary

Motivation: 传统的视频步骤定位方法需要带标注的训练数据和离线处理完整视频,这导致标注成本高昂且无法应用于需要在线决策的场景。本文旨在探索无需训练和在线执行的视频步骤定位方法。

Method: 本文利用大型多模态模型的零样本能力,基于有限帧集预测步骤,并开发了BaGLM方法,通过贝叶斯滤波原则整合历史帧信息,使用大型语言模型提取的依赖矩阵和步骤进度估计来建模步骤转换。

Result: 实验结果表明,这种无需任务特定调优的在线策略优于传统的离线和基于训练的方法。在三个数据集上的测试显示,BaGLM在性能上超越了当前最先进的基于训练的离线方法。

Conclusion: 研究表明大型多模态模型的零样本能力能够有效解决视频步骤定位问题,贝叶斯滤波框架的引入进一步提升了性能,为无需训练的视频理解任务提供了新的解决方案和未来研究方向。


📄 Abstract

Given a task and a set of steps composing it, Video Step Grounding (VSG) aims to detect which steps are performed in a video. Standard approaches for this task require a labeled training set (e.g., with step-level annotations or narrations), which may be costly to collect. Moreover, they process the full video offline, limiting their applications for scenarios requiring online decisions. Thus, in this work, we explore how to perform VSG online and without training. We achieve this by exploiting the zero-shot capabilities of recent Large Multimodal Models (LMMs). In particular, we use LMMs to predict the step associated with a restricted set of frames, without access to the whole video. We show that this online strategy without task-specific tuning outperforms offline and training-based models. Motivated by this finding, we develop Bayesian Grounding with Large Multimodal Models (BaGLM), further injecting knowledge of past frames into the LMM-based predictions. BaGLM exploits Bayesian filtering principles, modeling step transitions via (i) a dependency matrix extracted through large language models and (ii) an estimation of step progress. Experiments on three datasets show superior performance of BaGLM over state-of-the-art training-based offline methods.

[46] GACO-CAD: Geometry-Augmented and Conciseness-Optimized CAD Model Generation from Single Image

Yinghui Wang, Xinyu Zhang, Peng Du

🧩 TL;DR

本文提出GACO-CAD,一种新颖的两阶段后训练框架,通过结合深度和表面法线图作为几何先验,并引入组长度奖励机制,显著提升了从单张图像生成可编辑参数化CAD模型的几何精度和建模简洁性。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在从2D图像准确推断3D几何方面存在困难,主要受限于空间推理能力不足,这阻碍了工业概念设计中从单张图像生成可编辑参数化CAD模型的实际应用。

Method: 采用两阶段后训练框架:监督微调阶段利用深度和表面法线图作为密集几何先验,与RGB图像形成多通道输入;强化学习阶段引入组长度奖励机制,在保持高几何保真度的同时促进生成更紧凑的建模序列,并采用动态加权策略稳定训练。

Result: 在DeepCAD和Fusion360数据集上的实验表明,GACO-CAD在相同MLLM骨干网络下达到最先进性能,在代码有效性、几何精度和建模简洁性方面持续优于现有方法。

Conclusion: 该研究证明了结合几何先验和简洁性奖励的双重优化策略能有效提升CAD模型生成质量,为工业设计自动化提供了更可靠的解决方案,并展示了多模态学习在几何推理任务中的潜力。


📄 Abstract

Generating editable, parametric CAD models from a single image holds great potential to lower the barriers of industrial concept design. However, current multi-modal large language models (MLLMs) still struggle with accurately inferring 3D geometry from 2D images due to limited spatial reasoning capabilities. We address this limitation by introducing GACO-CAD, a novel two-stage post-training framework. It is designed to achieve a joint objective: simultaneously improving the geometric accuracy of the generated CAD models and encouraging the use of more concise modeling procedures. First, during supervised fine-tuning, we leverage depth and surface normal maps as dense geometric priors, combining them with the RGB image to form a multi-channel input. In the context of single-view reconstruction, these priors provide complementary spatial cues that help the MLLM more reliably recover 3D geometry from 2D observations. Second, during reinforcement learning, we introduce a group length reward that, while preserving high geometric fidelity, promotes the generation of more compact and less redundant parametric modeling sequences. A simple dynamic weighting strategy is adopted to stabilize training. Experiments on the DeepCAD and Fusion360 datasets show that GACO-CAD achieves state-of-the-art performance under the same MLLM backbone, consistently outperforming existing methods in terms of code validity, geometric accuracy, and modeling conciseness.

[47] An empirical study of the effect of video encoders on Temporal Video Grounding

Ignacio M. De la Jara, Cristian Rodriguez-Opazo, Edison Marrese-Taylor, Felipe Bravo-Marquez

🧩 TL;DR

本文通过实证研究探讨了不同视频特征对时序视频定位任务的影响,发现在经典架构中仅改变视频编码器即可带来显著的性能差异,同时揭示了不同特征之间的潜在互补性。


📘 Detailed Summary

Motivation: 当前时序视频定位研究主要集中于少数几种视频表示方法,这可能导致长期的架构过拟合问题,因此需要系统研究不同视频特征对模型性能的影响。

Method: 研究使用基于CNN、时序推理和Transformer的视频编码器提取特征,并在Charades-STA、ActivityNet-Captions和YouCookII三个基准数据集上进行实验,采用经典架构评估不同视频特征的影响。

Result: 实验结果表明,仅改变视频编码器就能在模型性能上产生显著差异,同时揭示了使用特定特征时产生的明显模式和错误,表明不同特征之间存在潜在的互补性。

Conclusion: 不同视频特征对时序视频定位任务具有显著影响,特征选择不应局限于少数几种表示方法,多种特征的组合可能带来更好的性能表现,这为未来研究提供了重要的设计指导。


📄 Abstract

Temporal video grounding is a fundamental task in computer vision, aiming to localize a natural language query in a long, untrimmed video. It has a key role in the scientific community, in part due to the large amount of video generated every day. Although we find extensive work in this task, we note that research remains focused on a small selection of video representations, which may lead to architectural overfitting in the long run. To address this issue, we propose an empirical study to investigate the impact of different video features on a classical architecture. We extract features for three well-known benchmarks, Charades-STA, ActivityNet-Captions and YouCookII, using video encoders based on CNNs, temporal reasoning and transformers. Our results show significant differences in the performance of our model by simply changing the video encoder, while also revealing clear patterns and errors derived from the use of certain features, ultimately indicating potential feature complementarity.

[48] ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models

Pu Zhang, Yuwei Li, Xingyuan Xian, Guoming Tang

🧩 TL;DR

本文提出了一种零样本的视觉令牌剪枝方法,通过引入提示感知视角在任务相关性和信息多样性之间取得平衡,能够在剪除高达90%视觉令牌的同时保持性能,显著降低推理成本。


📘 Detailed Summary

Motivation: 随着视觉语言模型能力的提升,处理大规模输入时会产生显著的视觉令牌冗余,导致推理成本过高。现有基于注意力或多样性的剪枝方法通常忽略文本提示的指导,无法有效优先考虑任务相关性。

Method: 提出了一种层次化的零样本方法,首先选择任务相关的核心视觉令牌集合,然后补充多样性令牌以保留更广泛的上下文信息,明确建模视觉令牌剪枝在任务相关性和信息多样性之间的平衡。

Result: 在多个模型和基准测试上的实验表明,该方法在剪除高达90%令牌的情况下,性能达到或超越了现有最优方法,仅带来最小精度损失,同时显著降低了GPU内存占用和推理延迟。

Conclusion: 该研究证明了提示感知的视觉令牌剪枝策略的有效性,为降低视觉语言模型推理成本提供了新思路,同时强调了在剪枝过程中平衡任务相关性和信息多样性的重要性。


📄 Abstract

As the capabilities of Vision-Language Models (VLMs) advance, they can process increasingly large inputs, which, unlike in LLMs, generates significant visual token redundancy and leads to prohibitive inference costs. While many methods aim to reduce these costs by pruning visual tokens, existing approaches, whether based on attention or diversity, typically neglect the guidance of the text prompt and thus fail to prioritize task relevance. In this work, we propose a novel, zero-shot method that reframes the problem by introducing a prompt-aware perspective, explicitly modeling visual token pruning as a balance between task relevance and information diversity. Our hierarchical approach first selects a core set of task-relevant visual tokens and then supplements them with diversity tokens to preserve broader context. Experiments across multiple models and benchmarks show that our method achieves performance that matches or surpasses the state-of-the-art with only minimal accuracy loss, even when pruning up to 90\% of the tokens. Furthermore, these gains are accompanied by significant reductions in GPU memory footprint and inference latency.

[49] Enrich and Detect: Video Temporal Grounding with Multimodal LLMs

Shraman Pramanick, Effrosyni Mavroudi, Yale Song, Rama Chellappa, Lorenzo Torresani, Triantafyllos Afouras

🧩 TL;DR

本文提出了ED-VTG方法,利用多模态大语言模型进行细粒度视频时序定位,通过两阶段处理将自然语言查询转化为富含细节的增强句子,并使用轻量级解码器实现精确边界预测,在多个基准测试中达到最先进性能。


📘 Detailed Summary

Motivation: 现有视频时序定位方法在处理自然语言查询时面临细节缺失和幻觉噪声的挑战,需要更有效地利用多模态信息来提升定位精度和鲁棒性。

Method: 采用两阶段处理流程:首先将语言查询转换为包含缺失细节的增强句子,然后使用轻量级解码器基于增强查询的上下文表示预测准确边界,并通过多实例学习目标动态选择最优查询版本以减轻噪声影响。

Result: 在多个视频时序定位和段落定位基准测试中实现了最先进的性能,显著优于所有先前提出的基于LLM的时序定位方法,在零样本评估场景中相对于专用模型保持明显优势。

Conclusion: 该方法证明了多模态LLM在视频时序定位任务中的有效性,通过查询增强和动态选择机制有效缓解了幻觉问题,为细粒度视频理解提供了新的技术路径。


📄 Abstract

We introduce ED-VTG, a method for fine-grained video temporal grounding utilizing multi-modal large language models. Our approach harnesses the capabilities of multimodal LLMs to jointly process text and video, in order to effectively localize natural language queries in videos through a two-stage process. Rather than being directly grounded, language queries are initially transformed into enriched sentences that incorporate missing details and cues to aid in grounding. In the second stage, these enriched queries are grounded, using a lightweight decoder, which specializes at predicting accurate boundaries conditioned on contextualized representations of the enriched queries. To mitigate noise and reduce the impact of hallucinations, our model is trained with a multiple-instance-learning objective that dynamically selects the optimal version of the query for each training sample. We demonstrate state-of-the-art results across various benchmarks in temporal video grounding and paragraph grounding settings. Experiments reveal that our method significantly outperforms all previously proposed LLM-based temporal grounding approaches and is either superior or comparable to specialized models, while maintaining a clear advantage against them in zero-shot evaluation scenarios.

[50] When One Moment Isn't Enough: Multi-Moment Retrieval with Cross-Moment Interactions

Zhuo Cao, Heming Du, Bingqing Zhang, Xin Yu, Xue Li, Sen Wang

🧩 TL;DR

本研究提出了QV-M²多时刻检索数据集和FlashMMR框架,解决了现有视频时刻检索方法局限于单时刻检索的问题,在QV-M²数据集上相比先前最优方法实现了3.00%的G-mAP提升。


📘 Detailed Summary

Motivation: 现有时刻检索方法主要关注单时刻检索,但现实应用中一个查询可能对应多个相关时刻,这使得现有数据集和方法在视频时序定位任务中表现不足。

Method: 提出了FlashMMR框架,包含多时刻后验证模块来精炼时刻边界,通过约束性时序调整和验证模块重新评估候选片段,通过精细的过滤管道修剪低置信度提议并实现鲁棒的多时刻对齐。

Result: 在QV-M²数据集上,FlashMMR相比先前最优方法在G-mAP上提升3.00%,在mAP@3+tgt上提升2.70%,在mR@3上提升2.56%。QV-M²包含2,212个标注和6,384个视频片段。

Conclusion: QV-M²数据集和FlashMMR方法为推进更现实和具有挑战性的视频时序定位场景研究奠定了基础,提出的基准和方法在MMR设置下提供了有效的训练和评估平台。


📄 Abstract

Existing Moment retrieval (MR) methods focus on Single-Moment Retrieval (SMR). However, one query can correspond to multiple relevant moments in real-world applications. This makes the existing datasets and methods insufficient for video temporal grounding. By revisiting the gap between current MR tasks and real-world applications, we introduce a high-quality datasets called QVHighlights Multi-Moment Dataset (QV-M$^2$), along with new evaluation metrics tailored for multi-moment retrieval (MMR). QV-M$^2$ consists of 2,212 annotations covering 6,384 video segments. Building on existing efforts in MMR, we propose a framework called FlashMMR. Specifically, we propose a Multi-moment Post-verification module to refine the moment boundaries. We introduce constrained temporal adjustment and subsequently leverage a verification module to re-evaluate the candidate segments. Through this sophisticated filtering pipeline, low-confidence proposals are pruned, and robust multi-moment alignment is achieved. We retrain and evaluate 6 existing MR methods on QV-M$^2$ and QVHighlights under both SMR and MMR settings. Results show that QV-M$^2$ serves as an effective benchmark for training and evaluating MMR models, while FlashMMR provides a strong baseline. Specifically, on QV-M$^2$, it achieves improvements over prior SOTA method by 3.00% on G-mAP, 2.70% on mAP@3+tgt, and 2.56% on mR@3. The proposed benchmark and method establish a foundation for advancing research in more realistic and challenging video temporal grounding scenarios. Code is released at https://github.com/Zhuo-Cao/QV-M2.

[51] Where, Not What: Compelling Video LLMs to Learn Geometric Causality for 3D-Grounding

Yutong Zhong

🧩 TL;DR

本文提出W2R2训练框架,通过解耦表征学习和针对性捷径抑制来解决多模态3D定位中的2D语义偏差问题,在不修改推理架构的情况下显著提升3D定位精度。


📘 Detailed Summary

Motivation: 多模态3D定位模型存在严重的2D语义偏差问题,即过度依赖2D图像特征进行粗略定位而忽视3D几何输入,导致融合性能欠佳。

Method: 提出What-Where表征重构框架,将2D特征作为语义信标用于物体识别,3D特征作为空间锚点用于定位,采用双目标损失函数包括监督融合预测的对齐损失和惩罚2D主导伪输出的边界机制伪标签损失。

Result: 在ScanRefer和ScanQA数据集上的实验表明,W2R2在定位精度和鲁棒性方面取得显著提升,特别是在复杂室外场景中表现优异。

Conclusion: 该研究证明了通过解耦表征学习和针对性捷径抑制可有效缓解2D语义偏差,为多模态3D定位提供了新的训练范式,具有重要的实际应用价值。


📄 Abstract

Multimodal 3D grounding has garnered considerable interest in Vision-Language Models (VLMs) \cite{yin2025spatial} for advancing spatial reasoning in complex environments. However, these models suffer from a severe "2D semantic bias" that arises from over-reliance on 2D image features for coarse localization, largely disregarding 3D geometric inputs and resulting in suboptimal fusion performance. In this paper, we propose a novel training framework called What-Where Representation Re-Forming (W2R2) to tackle this issue via disentangled representation learning and targeted shortcut suppression. Our approach fundamentally reshapes the model's internal space by designating 2D features as semantic beacons for "What" identification and 3D features as spatial anchors for "Where" localization, enabling precise 3D grounding without modifying inference architecture. Key components include a dual-objective loss function with an Alignment Loss that supervises fused predictions using adapted cross-entropy for multimodal synergy, and a Pseudo-Label Loss that penalizes overly effective 2D-dominant pseudo-outputs via a margin-based mechanism. Experiments conducted on ScanRefer and ScanQA demonstrate the effectiveness of W2R2, with significant gains in localization accuracy and robustness, particularly in cluttered outdoor scenes.

[52] FineVision: Open Data Is All You Need

Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, Andrés Marafioti

🧩 TL;DR

本文提出了FineVision,一个经过精心收集、整理和统一的大规模视觉语言数据集,包含2400万个样本,是目前同类中最大的开放资源。该数据集通过半自动化人工参与流程整合了200多个来源,并经过严格去重和去污染处理,在广泛评估中显著优于现有开放混合数据集训练的模型。


📘 Detailed Summary

Motivation: 当前视觉语言模型的发展受到碎片化、不一致和受污染公共数据集的阻碍,这些数据集缺乏统一标准和严格的质量控制,限制了模型性能的进一步提升和可靠评估。

Method: 采用半自动化人工参与流程,自动化处理批量数据摄入和模式映射,同时由审核人员审查映射结果并抽样检查输出质量,确保注释的忠实转换、格式适当性、多样性和安全性;该流程还应用了严格的内部和跨源去重,以及针对66个公共基准的去污染处理。

Result: 在FineVision上训练的模型在广泛的评估套件中持续优于基于现有开放混合数据集训练的模型,证明了数据集规模、数据卫生以及平衡自动化与人工监督的综合效益。

Conclusion: 研究强调了大规模、高质量数据集对视觉语言模型性能提升的重要性,展示了半自动化人工参与流程在数据集构建中的有效性,同时发布的语料库和整理工具将加速以数据为中心的视觉语言模型研究发展。


📄 Abstract

The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.

[53] How Universal Are SAM2 Features?

Masoud Khairi Atani, Alon Harell, Hyomin Choi, Runyu Yang, Fabien Racape, Ivan V. Bajic

🧩 TL;DR

本研究通过比较通用视觉基础模型Hiera与专用分割模型SAM2的特征通用性,量化了模型专业化带来的信息论代价,揭示了专用模型在空间相关任务上的优势与语义信息损失之间的权衡。


📘 Detailed Summary

Motivation: 当前研究尚未充分理解通用视觉基础模型与其专用对应模型之间的权衡关系,这对于高效特征编码设计至关重要,需要量化专业化带来的信息论代价和特征通用性损失。

Method: 采用轻量级可训练颈部网络来探测冻结特征的可适应性,通过信息论方法量化专业化成本,并对SAM2进行跨颈部分析以揭示不同层级适应的表征瓶颈效应。

Result: 实验结果显示SAM2在深度估计等空间相关任务上表现优异,但在姿态估计和图像描述等概念距离较远的任务上显著落后于通用模型Hiera,表明专用编码器存在更广泛的语义信息损失,且每个适应层级都会产生额外的表征瓶颈。

Conclusion: 研究阐明了特征通用性与专业化之间的量化权衡关系,为针对不同下游应用设计高效特征编码和适应策略提供了理论基础,强调了在模型设计中平衡专业化与通用性的重要性。


📄 Abstract

The trade-off between general-purpose foundation vision models and their specialized counterparts is critical for efficient feature coding design and is not yet fully understood. We investigate this trade-off by comparing the feature versatility of the general-purpose Hiera encoder against the segmentation-specialized Segment Anything Model 2 (SAM2). Using a lightweight, trainable neck to probe the adaptability of their frozen features, we quantify the information-theoretic cost of specialization. Our results reveal that while SAM2's specialization is highly effective for spatially-related tasks like depth estimation, it comes at a cost. The specialized SAM2 encoder underperforms its generalist predecessor, Hiera, on conceptually distant tasks such as pose estimation and image captioning, demonstrating a measurable loss of broader semantic information. A novel cross-neck analysis on SAM2 reveals that each level of adaptation creates a further representational bottleneck. Our analysis illuminates these trade-offs in feature universality, providing a quantitative foundation for designing efficient feature coding and adaptation strategies for diverse downstream applications.

[54] Towards a Generalizable Fusion Architecture for Multimodal Object Detection

Jad Berjawi, Yoann Dupas, Christophe C'erin

🧩 TL;DR

本文提出了FMCAF(滤波多模态交叉注意力融合)架构,通过频域滤波和交叉注意力机制增强RGB与红外图像的特征融合,在多种多模态目标检测任务中实现了性能提升,无需特定数据集调优。


📘 Detailed Summary

Motivation: 多模态目标检测在挑战性条件下通过融合多种传感器模态的互补信息来提高鲁棒性,但现有方法往往针对特定数据集设计,缺乏通用性。本文旨在开发一种通用的预处理架构,能够有效融合RGB和红外输入,提升多模态检测性能而不需要数据集特定的调优。

Method: FMCAF架构包含两个核心组件:频域滤波模块(Freq-Filter)用于抑制冗余光谱特征,以及基于交叉注意力的融合模块(MCAF)用于改善模态间特征共享。该架构作为预处理模块设计,可与现有检测器结合使用,专注于增强输入特征的质量和互补性。

Result: 在LLVIP(低光行人检测)和VEDAI(航空车辆检测)数据集上的实验表明,FMCAF显著优于传统的拼接融合方法,在VEDAI上实现了+13.9%的mAP@50提升,在LLVIP上实现了+1.1%的提升。这些结果验证了FMCAF在不同多模态挑战中的有效性。

Conclusion: FMCAF展示了作为通用多模态融合基础架构的潜力,能够适应不同的检测任务而无需特定调优。该研究为构建更鲁棒的多模态检测系统提供了灵活的基础,未来可扩展到其他传感器模态组合,推动多模态感知技术的发展。


📄 Abstract

Multimodal object detection improves robustness in chal- lenging conditions by leveraging complementary cues from multiple sensor modalities. We introduce Filtered Multi- Modal Cross Attention Fusion (FMCAF), a preprocess- ing architecture designed to enhance the fusion of RGB and infrared (IR) inputs. FMCAF combines a frequency- domain filtering block (Freq-Filter) to suppress redun- dant spectral features with a cross-attention-based fusion module (MCAF) to improve intermodal feature sharing. Unlike approaches tailored to specific datasets, FMCAF aims for generalizability, improving performance across different multimodal challenges without requiring dataset- specific tuning. On LLVIP (low-light pedestrian detec- tion) and VEDAI (aerial vehicle detection), FMCAF outper- forms traditional fusion (concatenation), achieving +13.9% mAP@50 on VEDAI and +1.1% on LLVIP. These results support the potential of FMCAF as a flexible foundation for robust multimodal fusion in future detection pipelines.

[55] Towards Imperceptible Watermarking Via Environment Illumination for Consumer Cameras

Hodaka Kawachi, Tomoya Nakamura, Hiroaki Santo, SaiKiran Kumar Tedla, Trevor Dalton Canham, Yasushi Yagi, Michael S. Brown

🧩 TL;DR

本文提出了一种利用LED环境照明为消费级相机生成视觉不可见水印的方法,该方法通过优化LED光源的光谱特性,使其对人眼几乎不可见但对相机高度可检测,实现了在标准帧率下提取水印的能力。


📘 Detailed Summary

Motivation: 当前需要一种能够在消费级相机拍摄的视频中嵌入不可见水印的技术,以支持隐私保护和内容验证等应用,但传统可见光通信方法通常需要高帧率且可能被人眼察觉。

Method: 该方法联合考虑了人眼视觉系统对可见光谱的敏感性、现代消费相机传感器的光谱灵敏度以及窄带LED生成宽带光谱的能力,采用光谱调制而非强度调制来确保不可感知性,并优化LED光源的光谱轮廓使其在D65照明下呈现为"白光"。

Result: 该方法能够在标准低帧率(30-60 fps)下提取水印,信息传输速率适中,在10秒视频片段中可嵌入128位数据,这一容量足以支持基本的元数据需求。

Conclusion: 该研究展示了利用环境照明实现视觉不可见水印的可行性,为隐私保护和内容验证应用提供了一种实用的解决方案,同时保持了与现有消费级相机系统的兼容性。


📄 Abstract

This paper introduces a method for using LED-based environmental lighting to produce visually imperceptible watermarks for consumer cameras. Our approach optimizes an LED light source's spectral profile to be minimally visible to the human eye while remaining highly detectable by typical consumer cameras. The method jointly considers the human visual system's sensitivity to visible spectra, modern consumer camera sensors' spectral sensitivity, and narrowband LEDs' ability to generate broadband spectra perceived as "white light" (specifically, D65 illumination). To ensure imperceptibility, we employ spectral modulation rather than intensity modulation. Unlike conventional visible light communication, our approach enables watermark extraction at standard low frame rates (30-60 fps). While the information transfer rate is modest-embedding 128 bits within a 10-second video clip-this capacity is sufficient for essential metadata supporting privacy protection and content verification.

[56] Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization

Yuanli Wu, Long Zhang, Yue Du, Bin Li

🧩 TL;DR

本文提出了一种基于评分标准引导的伪标签提示框架,通过将少量真实标注转化为高置信度伪标签并聚合为结构化评分标准,实现了无需参数调优的零样本视频摘要方法。该方法在SumMe和TVSum数据集上超越了无监督和先前零样本基线,接近监督方法性能。


📘 Detailed Summary

Motivation: 现有监督方法依赖密集标注导致高昂标注成本和有限跨数据集泛化能力,无监督方法难以捕捉高层次语义和细粒度叙事线索,而零样本提示方法对人工设计提示模板和数据集特定分数归一化高度敏感。本文旨在克服这些限制,开发一种稳定且可解释的零样本视频摘要方法。

Method: 提出评分标准引导的伪标签提示框架,将少量真实标注转化为高置信度伪标签并聚合为结构化、数据集自适应的评分标准。推理时首尾片段仅基于描述评分,中间片段则结合相邻场景的上下文摘要来评估叙事进展和冗余度,通过上下文提示使LLM在无需参数调优下平衡局部显著性和全局连贯性。

Result: 在SumMe和TVSum数据集上分别达到57.58和63.05的F1分数,超越了无监督方法和先前零样本基线,接近监督方法的性能表现。实验结果表明该方法能有效稳定基于LLM的评分过程。

Conclusion: 评分标准引导的伪标签方法能够有效稳定LLM基评分过程,为视频摘要建立了一个通用且可解释的零样本范式。该方法展示了如何通过结构化评分标准和上下文提示实现无需训练的高质量视频摘要,为后续零样本视频理解研究提供了新思路。


📄 Abstract

With the rapid proliferation of video content across social media, surveillance, and education platforms, efficiently summarizing long videos into concise yet semantically faithful surrogates has become increasingly vital. Existing supervised methods achieve strong in-domain accuracy by learning from dense annotations but suffer from high labeling costs and limited cross-dataset generalization, while unsupervised approaches, though label-free, often fail to capture high-level human semantics and fine-grained narrative cues. More recently, zero-shot prompting pipelines have leveraged large language models (LLMs) for training-free video summarization, yet remain highly sensitive to handcrafted prompt templates and dataset-specific score normalization. To overcome these limitations, we introduce a rubric-guided, pseudo-labeled prompting framework that transforms a small subset of ground-truth annotations into high-confidence pseudo labels, which are aggregated into structured, dataset-adaptive scoring rubrics guiding interpretable scene evaluation. During inference, first and last segments are scored based solely on their descriptions, whereas intermediate ones incorporate brief contextual summaries of adjacent scenes to assess narrative progression and redundancy. This contextual prompting enables the LLM to balance local salience and global coherence without parameter tuning. On SumMe and TVSum, our method achieves F1 scores of \textbf{57.58} and \textbf{63.05}, surpassing unsupervised and prior zero-shot baselines while approaching supervised performance. The results demonstrate that rubric-guided pseudo labeling effectively stabilizes LLM-based scoring and establishes a general, interpretable zero-shot paradigm for video summarization.

[57] MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models

Yongshun Zhang, Zhongyi Fan, Yonghang Zhang, Zhangzikang Li, Weifeng Chen, Zhongwei Feng, Chaoyue Wang, Peng Hou, Anxiang Zeng

🧩 TL;DR

本文提出了一个优化大规模视频生成模型训练的四支柱框架,涵盖数据处理、模型架构、训练策略和基础设施,开发出的MUG-V 10B模型在整体性能上达到当前最优水平,并在电商视频生成任务中超越领先开源基线。


📘 Detailed Summary

Motivation: 大规模视频生成模型训练面临跨模态文本-视频对齐、长序列处理和复杂时空依赖等挑战,导致训练过程特别困难和资源密集,需要系统性的优化方案来提升训练效率和模型性能。

Method: 提出的训练框架系统优化了四个关键支柱:数据处理(包括数据预处理和视频压缩)、模型架构(参数缩放设计)、训练策略(课程式预训练和对齐导向的后训练)以及基础设施(基于Megatron-Core的大规模训练实现),实现了端到端的训练效率提升。

Result: MUG-V 10B模型在整体视频生成质量上匹配当前最优模型,在电商导向的视频生成任务中,通过人工评估超越了领先的开源基线模型,同时训练框架实现了显著效率增益和性能改进。

Conclusion: 该研究证明了通过系统性优化训练框架的四个支柱可以显著提升大规模视频生成的效率和性能,开源发布的完整技术栈为社区提供了首个基于Megatron-Core的大规模视频生成训练代码,实现了高效的近线性多节点扩展能力。


📄 Abstract

In recent years, large-scale generative models for visual content (\textit{e.g.,} images, videos, and 3D objects/scenes) have made remarkable progress. However, training large-scale video generation models remains particularly challenging and resource-intensive due to cross-modal text-video alignment, the long sequences involved, and the complex spatiotemporal dependencies. To address these challenges, we present a training framework that optimizes four pillars: (i) data processing, (ii) model architecture, (iii) training strategy, and (iv) infrastructure for large-scale video generation models. These optimizations delivered significant efficiency gains and performance improvements across all stages of data preprocessing, video compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training. Our resulting model, MUG-V 10B, matches recent state-of-the-art video generators overall and, on e-commerce-oriented video generation tasks, surpasses leading open-source baselines in human evaluations. More importantly, we open-source the complete stack, including model weights, Megatron-Core-based large-scale training code, and inference pipelines for video generation and enhancement. To our knowledge, this is the first public release of large-scale video generation training code that exploits Megatron-Core to achieve high training efficiency and near-linear multi-node scaling, details are available in \href{https://github.com/Shopee-MUG/MUG-V}{our webpage}.

[58] Generation then Reconstruction: Accelerating Masked Autoregressive Models via Two-Stage Sampling

Feihong Yan, Peiru Wang, Yao Zhu, Kaiyu Pang, Qingyan Wei, Huiqi Li, Linfeng Zhang

🧩 TL;DR

本文提出了GtR(Generation then Reconstruction),一种无需训练的分层采样策略,通过将图像生成分解为结构生成和细节重建两个阶段,在保持生成质量的同时实现了3.72倍的加速,显著优于现有的加速方法。


📘 Detailed Summary

Motivation: 掩码自回归模型虽然具备并行生成能力,但其加速潜力受到单步中空间相关视觉标记建模复杂度的限制,需要解决生成效率与质量之间的平衡问题。

Method: 提出了GtR分层采样策略,将生成过程分解为结构生成和细节重建两个阶段:结构生成建立全局语义框架,细节重建高效完成剩余标记;同时引入频率加权标记选择机制,基于高频信息能量为图像细节区域的标记分配更多计算资源。

Result: 在ImageNet类别条件生成和文本到图像生成任务上的广泛实验表明,GtR在MAR-H模型上实现了3.72倍加速,同时保持可比较的生成质量(FID:1.59,IS:304.4 vs 原始1.59,299.1),在各种模型规模和生成任务上显著优于现有加速方法。

Conclusion: 该研究证明了通过分层生成策略可以有效平衡生成效率与质量,频率感知的标记选择机制能够优化计算资源分配,为高效视觉生成提供了新的技术路径,具有广泛的应用前景。


📄 Abstract

Masked Autoregressive (MAR) models promise better efficiency in visual generation than autoregressive (AR) models for the ability of parallel generation, yet their acceleration potential remains constrained by the modeling complexity of spatially correlated visual tokens in a single step. To address this limitation, we introduce Generation then Reconstruction (GtR), a training-free hierarchical sampling strategy that decomposes generation into two stages: structure generation establishing global semantic scaffolding, followed by detail reconstruction efficiently completing remaining tokens. Assuming that it is more difficult to create an image from scratch than to complement images based on a basic image framework, GtR is designed to achieve acceleration by computing the reconstruction stage quickly while maintaining the generation quality by computing the generation stage slowly. Moreover, observing that tokens on the details of an image often carry more semantic information than tokens in the salient regions, we further propose Frequency-Weighted Token Selection (FTS) to offer more computation budget to tokens on image details, which are localized based on the energy of high frequency information. Extensive experiments on ImageNet class-conditional and text-to-image generation demonstrate 3.72x speedup on MAR-H while maintaining comparable quality (e.g., FID: 1.59, IS: 304.4 vs. original 1.59, 299.1), substantially outperforming existing acceleration methods across various model scales and generation tasks. Our codes will be released in https://github.com/feihongyan1/GtR.

[59] Frugal Federated Learning for Violence Detection: A Comparison of LoRA-Tuned VLMs and Personalized CNNs

Sébastien Thuau, Siba Haidar, Ayush Bajracharya, Rachid Chelouah

🧩 TL;DR

本研究比较了两种节俭联邦学习方法用于暴力检测:基于视觉语言模型的零样本与联邦微调策略,以及紧凑3D卷积神经网络的个性化训练方法,重点关注能源效率和环境指标。


📘 Detailed Summary

Motivation: 当前暴力检测系统在资源受限环境下部署时面临能源消耗和计算效率的挑战,特别是在非独立同分布数据场景下,需要探索既能保持高性能又具备能源效率的解决方案。

Method: 采用两种互补策略:使用LLaVA-7B进行零样本和联邦微调的视觉语言模型方法,以及基于65.8M参数紧凑3D卷积神经网络的个性化训练方法,并在非独立同分布设置下评估准确率、校准度和能源使用情况。

Result: 两种方法均超过90%的准确率,CNN3D在ROC AUC和log loss指标上略优于低秩自适应微调的视觉语言模型,同时能耗更低;视觉语言模型在上下文推理和多模态推断方面仍具优势。

Conclusion: 研究支持混合模型架构:轻量级CNN用于常规分类任务,选择性激活视觉语言模型处理复杂或描述性场景,为视频监控领域提供可复现的资源感知AI基准框架,可扩展至实时多模态和生命周期感知系统。


📄 Abstract

We examine frugal federated learning approaches to violence detection by comparing two complementary strategies: (i) zero-shot and federated fine-tuning of vision-language models (VLMs), and (ii) personalized training of a compact 3D convolutional neural network (CNN3D). Using LLaVA-7B and a 65.8M parameter CNN3D as representative cases, we evaluate accuracy, calibration, and energy usage under realistic non-IID settings. Both approaches exceed 90% accuracy. CNN3D slightly outperforms Low-Rank Adaptation(LoRA)-tuned VLMs in ROC AUC and log loss, while using less energy. VLMs remain favorable for contextual reasoning and multimodal inference. We quantify energy and CO$_2$ emissions across training and inference, and analyze sustainability trade-offs for deployment. To our knowledge, this is the first comparative study of LoRA-tuned vision-language models and personalized CNNs for federated violence detection, with an emphasis on energy efficiency and environmental metrics. These findings support a hybrid model: lightweight CNNs for routine classification, with selective VLM activation for complex or descriptive scenarios. The resulting framework offers a reproducible baseline for responsible, resource-aware AI in video surveillance, with extensions toward real-time, multimodal, and lifecycle-aware systems.

[60] Intelligent Communication Mixture-of-Experts Boosted-Medical Image Segmentation Foundation Model

Xinwei Zhang, Hu Chen, Zhe Yuan, Sukun Tian, Peng Feng

🧩 TL;DR

本文提出IC-MoE模型,通过混合专家架构和语义引导对比学习,解决了医学图像分割基础模型自适应微调中的高层特征表示不足和预训练权重结构完整性破坏问题,在多个公开数据集上实现了最先进的性能。


📘 Detailed Summary

Motivation: 现有医学图像分割基础模型自适应微调方法存在两个关键问题:高层特征表示能力不足,以及微调过程会破坏预训练权重的结构完整性,这限制了模型在医学图像分割任务中的性能表现和泛化能力。

Method: 提出IC-MoE模型,构建基础专家、语义专家和自适应专家,采用像素概率自适应投票策略实现专家选择和融合;同时提出语义引导对比学习方法,通过标签一致性和负载平衡增强高层特征表示能力,同时保持预训练权重结构完整性。

Result: 在三个公开医学图像分割数据集上的广泛实验表明,IC-MoE模型超越了其他最先进模型,验证了其在多样化医学图像分割场景中的优越泛化性能。

Conclusion: IC-MoE模型有效补充了基础医学图像分割模型的高层特征表示能力和预训练结构完整性,为医学图像分割任务提供了一种高效的自适应微调解决方案,具有重要的实际应用价值。


📄 Abstract

Foundation models for medical image segmentation have achieved remarkable performance. Adaptive fine-tuning of natural image segmentation foundation models is crucial for medical image segmentation tasks. However, some limitations exist in existing fine-tuning methods: 1) insufficient representation of high-level features and 2) the fine-tuning process disrupts the structural integrity of pretrained weights. Inspired by these critical problems, we propose an intelligent communication mixture-of-experts boosted-medical image segmentation foundation model, named IC-MoE, with twofold ideas: 1) We construct basic experts, semantic experts, and adaptive experts. Moreover, we implement a pixel probability adaptive voting strategy, which enables expert selection and fusion through label consistency and load balancing. This approach preliminarily enhances the representation capability of high-level features while preserving the structural integrity of pretrained weights. 2) We propose a semantic-guided contrastive learning method to address the issue of weak supervision in contrastive learning. This method further enhances the representation capability of high-level features while preserving the structural integrity of pretrained weights. Extensive experiments across three public medical image segmentation datasets demonstrate that the IC-MoE outperforms other SOTA models. Consequently, the proposed IC-MoE effectively supplements foundational medical image segmentation models with high-level features and pretrained structural integrity. We also validate the superior generalizability of the IC-MoE across diverse medical image segmentation scenarios.

[61] Multilingual Text-to-Image Person Retrieval via Bidirectional Relation Reasoning and Aligning

Min Cao, Xinyu Zhou, Ding Jiang, Bo Du, Mang Ye, Min Zhang

🧩 TL;DR

本研究提出了首个多语言文本到图像人物检索任务,并开发了Bi-IRRA框架,通过双向隐式关系推理和多维全局对齐来学习跨语言和跨模态的对齐,在所有多语言TIPR数据集上实现了新的最先进性能。


📘 Detailed Summary

Motivation: 文本到图像人物检索面临模态异质性的挑战,现有方法存在局限性:全局方法忽略细粒度跨模态差异,局部方法需要先验信息进行显式部分对齐,且当前方法主要针对英语设计,限制了在多语言环境中的应用。

Method: 提出了Bi-IRRA框架,包含双向隐式关系推理模块,通过双向预测掩码图像和文本来隐式增强跨语言和跨模态的局部关系建模,同时集成了多维全局对齐模块来弥合模态异质性,并构建了多语言TIPR基准数据集。

Result: 所提出的方法在所有多语言TIPR数据集上均取得了新的最先进结果,验证了框架在跨语言和跨模态检索任务中的有效性。

Conclusion: 该研究不仅扩展了TIPR任务到多语言场景,还通过隐式关系推理和全局对齐的结合提供了更有效的跨模态学习方法,为多语言视觉语言理解开辟了新的研究方向。


📄 Abstract

Text-to-image person retrieval (TIPR) aims to identify the target person using textual descriptions, facing challenge in modality heterogeneity. Prior works have attempted to address it by developing cross-modal global or local alignment strategies. However, global methods typically overlook fine-grained cross-modal differences, whereas local methods require prior information to explore explicit part alignments. Additionally, current methods are English-centric, restricting their application in multilingual contexts. To alleviate these issues, we pioneer a multilingual TIPR task by developing a multilingual TIPR benchmark, for which we leverage large language models for initial translations and refine them by integrating domain-specific knowledge. Correspondingly, we propose Bi-IRRA: a Bidirectional Implicit Relation Reasoning and Aligning framework to learn alignment across languages and modalities. Within Bi-IRRA, a bidirectional implicit relation reasoning module enables bidirectional prediction of masked image and text, implicitly enhancing the modeling of local relations across languages and modalities, a multi-dimensional global alignment module is integrated to bridge the modality heterogeneity. The proposed method achieves new state-of-the-art results on all multilingual TIPR datasets. Data and code are presented in https://github.com/Flame-Chasers/Bi-IRRA.

[62] MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

Yaning Pan, Zekun Wang, Qianqian Xie, Yongqian Wen, Yuanxing Zhang, Guohui Zhang, Haoxuan Hu, Zhiyu Pan, Yibing Huang, Zhidong Gan, Yonghong Lin, An Ping, Tianhao Peng, Jiaheng Liu

🧩 TL;DR

本文提出了MT-Video-Bench,一个用于评估多模态大语言模型在多轮对话中视频理解能力的综合基准,揭示了现有模型在处理多轮视频对话时的显著性能差距。


📘 Detailed Summary

Motivation: 现有评估基准主要局限于单轮问答任务,未能充分反映现实场景中多轮对话的复杂性,这限制了多模态大语言模型在真实应用中的有效评估和发展。

Method: 构建了包含987个精心策划的多轮对话的MT-Video-Bench基准,主要评估感知性和交互性两大维度的六项核心能力,涵盖体育分析和视频教学等多个实际应用领域。

Result: 对多种先进开源和闭源多模态大语言模型的广泛评估显示,这些模型在处理多轮视频对话时存在显著的性能差异和局限性,特别是在复杂交互场景中表现不足。

Conclusion: 该基准揭示了当前多模态大语言模型在多轮视频对话理解方面的关键挑战,为未来研究提供了重要的评估工具和发展方向,将促进更鲁棒的视频对话系统开发。


📄 Abstract

The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI's ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues. The benchmark will be publicly available to foster future research.

[63] Enhanced Motion Forecasting with Plug-and-Play Multimodal Large Language Models

Katie Luo, Jingwei Ji, Tong He, Runsheng Xu, Yichen Xie, Dragomir Anguelov, Mingxing Tan

🧩 TL;DR

本文提出Plug-and-Forecast(PnF),一种即插即用的方法,通过多模态大语言模型增强现有运动预测模型,利用自然语言描述复杂场景实现快速适应,无需微调即可显著提升运动预测性能。


📘 Detailed Summary

Motivation: 当前自动驾驶系统依赖专用模型进行感知和运动预测,在标准条件下表现可靠,但难以经济高效地泛化到多样化现实场景,这成为自动驾驶系统面临的重要挑战。

Method: PnF方法设计提示词从多模态大语言模型中提取结构化场景理解,并将这些信息提炼为可学习的嵌入向量来增强现有行为预测模型,充分利用MLLMs的零样本推理能力。

Result: 在Waymo Open Motion Dataset和nuScenes Dataset上对两种最先进的运动预测模型进行验证,结果显示在两个基准测试中均获得了一致的性能提升。

Conclusion: 该方法证明了自然语言在描述和处理复杂驾驶场景中的有效性,为零样本适应特定行为提供了实用解决方案,且无需微调的特性使其具有实际部署的可行性。


📄 Abstract

Current autonomous driving systems rely on specialized models for perceiving and predicting motion, which demonstrate reliable performance in standard conditions. However, generalizing cost-effectively to diverse real-world scenarios remains a significant challenge. To address this, we propose Plug-and-Forecast (PnF), a plug-and-play approach that augments existing motion forecasting models with multimodal large language models (MLLMs). PnF builds on the insight that natural language provides a more effective way to describe and handle complex scenarios, enabling quick adaptation to targeted behaviors. We design prompts to extract structured scene understanding from MLLMs and distill this information into learnable embeddings to augment existing behavior prediction models. Our method leverages the zero-shot reasoning capabilities of MLLMs to achieve significant improvements in motion prediction performance, while requiring no fine-tuning -- making it practical to adopt. We validate our approach on two state-of-the-art motion forecasting models using the Waymo Open Motion Dataset and the nuScenes Dataset, demonstrating consistent performance improvements across both benchmarks.

[64] LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding

ZhaoYang Han, Qihan Lin, Hao Liang, Bowen Chen, Zhou Liu, Wentao Zhang

🧩 TL;DR

本文提出了首个专注于长视频理解的基准测试LongInsightBench,通过整合视觉、音频和文本模态,评估模型在人类语言、观点、动作等上下文元素上的理解能力,揭示了当前全模态模型在时间定位和长距离因果推理任务中的挑战。


📘 Detailed Summary

Motivation: 当前缺乏专门针对长视频理解的基准测试,特别是在整合多模态信息(视觉、音频、文本)方面存在研究空白,需要评估模型对人类语言、观点、动作等上下文元素的深度理解能力。

Method: 构建了包含约1000个长时长、信息密集视频的数据集,涵盖讲座、访谈和vlog等富含语言元素的内容;设计了六个具有挑战性的任务场景,包括事件内和事件间任务;开发了三步半自动数据质量保证流程确保问题难度和有效性。

Result: 实验结果表明,全模态模型在需要精确时间定位和长距离因果推理的任务中仍然面临挑战;扩展实验揭示了全模态模型在多模态融合中存在信息丢失和处理偏差的问题。

Conclusion: LongInsightBench为长视频理解研究提供了首个综合性基准,揭示了当前全模态模型的局限性,特别是在复杂推理任务中的不足,为未来多模态模型的发展指明了改进方向。


📄 Abstract

We introduce \textbf{LongInsightBench}, the first benchmark designed to assess models' ability to understand long videos, with a focus on human language, viewpoints, actions, and other contextual elements, while integrating \textbf{visual, audio, and text} modalities. Our benchmark excels in three key areas: \textbf{a) Long-Duration, Information-Dense Videos:} We carefully select approximately 1,000 videos from open-source datasets FineVideo based on duration limit and the information density of both visual and audio modalities, focusing on content like lectures, interviews, and vlogs, which contain rich language elements. \textbf{b) Diverse and Challenging Task Scenarios:} We have designed six challenging task scenarios, including both Intra-Event and Inter-Event Tasks. \textbf{c) Rigorous and Comprehensive Quality Assurance Pipelines:} We have developed a three-step, semi-automated data quality assurance pipeline to ensure the difficulty and validity of the synthesized questions and answer options. Based on LongInsightBench, we designed a series of experiments. Experimental results shows that Omni-modal models(OLMs) still face challenge in tasks requiring precise temporal localization (T-Loc) and long-range causal inference (CE-Caus). Extended experiments reveal the information loss and processing bias in multi-modal fusion of OLMs. Our dataset and code is available at https://anonymous.4open.science/r/LongInsightBench-910F/.

[65] iDETEX: Empowering MLLMs for Intelligent DETailed EXplainable IQA

Zhaoran Zhao, Xinli Yue, Jianhui Sun, Yuhao Xie, Tao Shao, Liangchao Yao, Fan Xia, Yuetang Deng

🧩 TL;DR

本研究提出了iDETEX,一个统一的多模态大语言模型,能够同时执行质量定位、感知和描述三个关键任务,解决了详细可解释图像质量评估的挑战。该模型在ViDA-UGC基准测试中取得了最先进的性能,并在ICCV MIPI 2025详细图像质量评估挑战赛中排名第一。


📘 Detailed Summary

Motivation: 图像质量评估已从标量质量预测发展到更可解释、与人类对齐的评估范式,本研究旨在解决新兴的详细可解释图像质量评估挑战,通过统一的多模态大语言模型同时处理质量定位、感知和描述三个异构子任务。

Method: 提出了iDETEX统一多模态大语言模型,设计了一套任务特定的离线增强模块和数据混合策略,辅以在线增强策略以充分利用多源监督,实现三个异构子任务的高效和泛化训练。

Result: 在大型ViDA-UGC基准测试中,iDETEX在所有子任务上均实现了最先进的性能,并在ICCV MIPI 2025详细图像质量评估挑战赛中排名第一,证明了其在提供准确和可解释质量评估方面的有效性和鲁棒性。

Conclusion: 该研究展示了统一多模态大语言模型在详细可解释图像质量评估中的潜力,通过精心设计的训练策略和架构,能够同时处理多个异构任务,为图像质量评估提供了更加全面和人类对齐的解决方案。


📄 Abstract

Image Quality Assessment (IQA) has progressed from scalar quality prediction to more interpretable, human-aligned evaluation paradigms. In this work, we address the emerging challenge of detailed and explainable IQA by proposing iDETEX-a unified multimodal large language model (MLLM) capable of simultaneously performing three key tasks: quality grounding, perception, and description. To facilitate efficient and generalizable training across these heterogeneous subtasks, we design a suite of task-specific offline augmentation modules and a data mixing strategy. These are further complemented by online enhancement strategies to fully exploit multi-sourced supervision. We validate our approach on the large-scale ViDA-UGC benchmark, where iDETEX achieves state-of-the-art performance across all subtasks. Our model ranks first in the ICCV MIPI 2025 Detailed Image Quality Assessment Challenge, demonstrating its effectiveness and robustness in delivering accurate and interpretable quality assessments.

[66] Exploring The Missing Semantics In Event Modality

Jingqian Wu, Shengpeng Xu, Yunbo Jia, Edmund Y. Lam

🧩 TL;DR

本文提出Semantic-E2VID框架,通过将视觉语义知识从帧模态迁移到事件模态,显著提升了事件到视频重建的质量和语义信息恢复能力。该方法在多个基准测试中超越了现有最先进的事件到视频重建方法。


📘 Detailed Summary

Motivation: 事件相机仅捕获强度变化而忽略静态物体和背景,导致事件模态缺乏语义信息,而现有事件到视频重建方法往往忽视了语义信息在视频重建中的关键作用,这限制了重建视频的语义质量和信息完整性。

Method: 提出Semantic-E2VID框架,包含跨模态特征对齐模块将Segment Anything Model的视觉语义知识迁移到事件编码器,语义感知特征融合块集成学习到的语义特征形成富含语义的事件表示,以及语义感知E2V监督机制利用SAM生成的类别标签促进语义细节重建。

Result: 在多个基准测试上的广泛实验表明,Semantic-E2VID显著提升了帧质量,在事件到视频重建任务中超越了现有最先进的方法,证明了语义信息迁移对重建质量的重要改进作用。

Conclusion: 该研究证明了将视觉语义知识从帧模态迁移到事件模态的有效性,为事件视觉任务提供了新的语义增强范式,未来可扩展到其他事件视觉任务中提升语义理解能力。


📄 Abstract

Event cameras offer distinct advantages such as low latency, high dynamic range, and efficient motion capture. However, event-to-video reconstruction (E2V), a fundamental event-based vision task, remains challenging, particularly for reconstructing and recovering semantic information. This is primarily due to the nature of the event camera, as it only captures intensity changes, ignoring static objects and backgrounds, resulting in a lack of semantic information in captured event modality. Further, semantic information plays a crucial role in video and frame reconstruction, yet is often overlooked by existing E2V approaches. To bridge this gap, we propose Semantic-E2VID, an E2V framework that explores the missing visual semantic knowledge in event modality and leverages it to enhance event-to-video reconstruction. Specifically, Semantic-E2VID introduces a cross-modal feature alignment (CFA) module to transfer the robust visual semantics from a frame-based vision foundation model, the Segment Anything Model (SAM), to the event encoder, while aligning the high-level features from distinct modalities. To better utilize the learned semantic feature, we further propose a semantic-aware feature fusion (SFF) block to integrate learned semantics in frame modality to form event representations with rich semantics that can be decoded by the event decoder. Further, to facilitate the reconstruction of semantic information, we propose a novel Semantic Perceptual E2V Supervision that helps the model to reconstruct semantic details by leveraging SAM-generated categorical labels. Extensive experiments demonstrate that Semantic-E2VID significantly enhances frame quality, outperforming state-of-the-art E2V methods across multiple benchmarks. The sample code is included in the supplementary material.

[67] Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs

Vaggelis Dorovatas, Soroush Seifi, Gunshi Gupta, Rahaf Aljundi

🧩 TL;DR

本文提出了一种无需训练的方法,通过LLM引导的视觉令牌选择、循环处理和基于描述的问答,解决了视频大语言模型在流式视频处理中的效率问题,在保持性能的同时显著提升了处理效率。


📘 Detailed Summary

Motivation: 视频大语言模型虽然在上下文视频理解方面表现出色,但在流式处理场景下面临重大挑战,特别是当需要实时处理小时级长度视频并给出及时响应时,现有方法无法满足效率要求。

Method: 该方法采用三个关键技术:基于LLM注意力的视觉令牌选择机制识别对理解每个短片段重要的视觉令牌;循环处理过去选定的令牌以生成时间连贯的视频理解;基于描述的轻量级问答系统实现准确响应。

Result: 该方法在流式视频基准测试中达到最先进性能,能够丢弃约95%的不重要视觉令牌而仅带来最小性能损失,在效率和效果之间取得了良好平衡。

Conclusion: 研究表明通过智能令牌选择和循环处理机制,可以在不牺牲理解质量的前提下显著提升视频大语言模型的流式处理效率,为实时长视频分析提供了可行的解决方案。


📄 Abstract

Video Large Language Models (Video-LLMs) excel at understanding videos in-context, provided they have full access to the video when answering queries. However, these models face challenges in streaming scenarios where hour-long videos must be processed online, and questions need timely responses. In this work, we propose a training-free approach compatible with standard Video-LLMs, leveraging three key concepts: 1) LLM-informed selection of visual tokens to identify those that the LLM has attended to and contributed to its understanding of each short clip. Our attention-based selection allows us to discard up to ~95% of unimportant visual tokens with minimal performance loss; 2) Recurrent processing of past selected tokens to generate temporally coherent understanding of each processed clip; 3) Caption-based question answering for lightweight and accurate responses. Our method achieves state-of-the-art performance on streaming video benchmarks, striking a balance between efficiency and effectiveness.

[68] Closed-Loop Transfer for Weakly-supervised Affordance Grounding

Jiajin Tang, Zhengxuan Wei, Ge Zheng, Sibei Yang

🧩 TL;DR

本文提出了LoopTrans,一种新颖的闭环框架,用于弱监督可承受性定位,不仅从外中心视角向自我中心视角传递知识,还通过反向传递增强外中心知识提取,在图像和视频基准测试中实现了全面性能提升。


📘 Detailed Summary

Motivation: 现有弱监督可承受性定位方法仅从外中心交互图像单向传递知识到自我中心图像,限制了在复杂交互场景中的适用性,需要解决领域差距和知识传递效率问题。

Method: 提出了LoopTrans闭环框架,包含统一跨模态定位和去噪知识蒸馏机制,通过双向知识传递桥接以对象为中心的自我中心图像和以交互为中心的外中心图像之间的领域差距。

Result: 实验表明LoopTrans在图像和视频基准测试的所有指标上都实现了持续改进,即使在人类身体完全遮挡对象交互区域的挑战性场景中也能有效处理。

Conclusion: 该研究证明了双向知识传递在可承受性定位中的重要性,闭环框架能够显著增强知识提取和传递效果,为复杂交互场景下的视觉理解提供了新思路。


📄 Abstract

Humans can perform previously unexperienced interactions with novel objects simply by observing others engage with them. Weakly-supervised affordance grounding mimics this process by learning to locate object regions that enable actions on egocentric images, using exocentric interaction images with image-level annotations. However, extracting affordance knowledge solely from exocentric images and transferring it one-way to egocentric images limits the applicability of previous works in complex interaction scenarios. Instead, this study introduces LoopTrans, a novel closed-loop framework that not only transfers knowledge from exocentric to egocentric but also transfers back to enhance exocentric knowledge extraction. Within LoopTrans, several innovative mechanisms are introduced, including unified cross-modal localization and denoising knowledge distillation, to bridge domain gaps between object-centered egocentric and interaction-centered exocentric images while enhancing knowledge transfer. Experiments show that LoopTrans achieves consistent improvements across all metrics on image and video benchmarks, even handling challenging scenarios where object interaction regions are fully occluded by the human body.

[69] Monitoring Horses in Stalls: From Object to Event Detection

Dmitrii Galimzianov, Viacheslav Vyshegorodtsev, Ivan Nezhivykh

🧩 TL;DR

本研究提出了一种基于视觉的马匹行为自动监测系统原型,利用目标检测和多目标跟踪技术实现马厩内马匹和人员的自动化监控。该系统通过YOLOv11和BoT-SORT算法结合空间关系推理,为马匹福利监测提供了实时解决方案。


📘 Detailed Summary

Motivation: 当前马匹行为监测主要依赖人工观察,存在劳动密集和时间消耗大的问题,难以实现早期健康问题的及时检测。传统方法无法满足马厩环境中持续、自动化的行为监控需求,亟需开发智能化的监测解决方案。

Method: 系统采用YOLOv11进行目标检测和BoT-SORT进行多目标跟踪,基于目标轨迹和空间位置关系推断事件状态。为支持开发,构建了自定义数据集,并利用CLIP和GroundingDINO等基础模型辅助数据标注。系统能够识别五种事件类型并考虑相机盲区。

Result: 定性评估显示系统在马匹相关事件检测方面表现可靠,但由于数据稀缺,在人员检测方面存在局限性。系统成功区分了不同事件类型,并在实际马厩环境中验证了其可行性。

Conclusion: 该研究为马匹设施中的实时行为监测奠定了基础,对动物福利和厩舍管理具有重要意义。未来工作应着重解决人员检测的数据不足问题,并进一步优化系统的实时性能和鲁棒性。


📄 Abstract

Monitoring the behavior of stalled horses is essential for early detection of health and welfare issues but remains labor-intensive and time-consuming. In this study, we present a prototype vision-based monitoring system that automates the detection and tracking of horses and people inside stables using object detection and multi-object tracking techniques. The system leverages YOLOv11 and BoT-SORT for detection and tracking, while event states are inferred based on object trajectories and spatial relations within the stall. To support development, we constructed a custom dataset annotated with assistance from foundation models CLIP and GroundingDINO. The system distinguishes between five event types and accounts for the camera's blind spots. Qualitative evaluation demonstrated reliable performance for horse-related events, while highlighting limitations in detecting people due to data scarcity. This work provides a foundation for real-time behavioral monitoring in equine facilities, with implications for animal welfare and stable management.

[70] Leveraging AV1 motion vectors for Fast and Dense Feature Matching

Julien Zouein, Hossein Javidnia, François Pitié, Anil Kokaram

🧩 TL;DR

本研究提出利用AV1运动向量生成密集亚像素对应关系和余弦一致性过滤的短轨迹,作为压缩域视觉前端。该方法在短视频上实现了与顺序SIFT相当的性能,同时显著降低CPU使用量,为完整视觉管道提供了资源高效的前端解决方案。


📘 Detailed Summary

Motivation: 传统视觉前端方法如SIFT在计算资源消耗方面效率较低,特别是在处理视频序列时。本研究旨在探索压缩域对应关系作为资源高效替代方案的可行性,解决现有方法在计算效率和可扩展性方面的局限性。

Method: 该方法重新利用AV1视频编码标准中的运动向量来生成密集亚像素对应关系,并通过余弦一致性过滤短轨迹。这种压缩域前端避免了传统特征提取和匹配的昂贵计算过程,直接从压缩视频数据中提取几何信息。

Result: 在117帧视频片段上的实验表明,运动向量匹配成功配准所有图像并重建了46-62万个三维点,重投影误差为0.51-0.53像素。该方法在短视频上运行性能与顺序SIFT相当,但CPU使用量显著降低,且能产生更密集的匹配点。

Conclusion: 压缩域对应关系被证明是实用且资源高效的视觉前端解决方案,具有清晰的扩展路径。该方法为完整视觉管道提供了可行的替代方案,特别是在计算资源受限或需要处理大规模视频数据的场景中具有重要应用价值。


📄 Abstract

We repurpose AV1 motion vectors to produce dense sub-pixel correspondences and short tracks filtered by cosine consistency. On short videos, this compressed-domain front end runs comparably to sequential SIFT while using far less CPU, and yields denser matches with competitive pairwise geometry. As a small SfM demo on a 117-frame clip, MV matches register all images and reconstruct 0.46-0.62M points at 0.51-0.53,px reprojection error; BA time grows with match density. These results show compressed-domain correspondences are a practical, resource-efficient front end with clear paths to scaling in full pipelines.

[71] One Dinomaly2 Detect Them All: A Unified Framework for Full-Spectrum Unsupervised Anomaly Detection

Jia Guo, Shuai Lu, Lei Fan, Zelin Li, Donglin Di, Yang Song, Weihang Zhang, Wenbing Zhu, Hong Yan, Fang Chen, Huiqi Li, Hongen Liao

🧩 TL;DR

Dinomaly2提出了首个全谱图像无监督异常检测统一框架,通过五个简单元素的协同设计在多类检测中实现突破性性能,同时无缝扩展到多种数据模态和任务设置,证明了简单性是真正通用性的基础。


📘 Detailed Summary

Motivation: 现有无监督异常检测领域存在多类模型性能显著落后于一对一专用模型的问题,且该领域已分裂为针对特定场景的专门化方法,这造成了部署障碍并凸显了对统一解决方案的需求。

Method: Dinomaly2采用基于重构的标准框架,通过五个简单元素的精心编排实现卓越性能,这种方法论上的极简主义使其能够无需修改即可自然扩展到多样化任务中。

Result: 在12个无监督异常检测基准测试中,Dinomaly2在多模态、任务设置和应用领域均展现出全谱优势,多类模型在MVTec-AD和VisA上分别达到前所未有的99.9%和99.3%图像级AUROC,仅使用每类8个正常样本即可超越先前全样本模型。

Conclusion: 该研究证明了极简设计、计算可扩展性和通用适用性的结合使Dinomaly2成为现实世界异常检测应用全谱的统一解决方案,确立了简单性作为真正通用性基础的重要原则。


📄 Abstract

Unsupervised anomaly detection (UAD) has evolved from building specialized single-class models to unified multi-class models, yet existing multi-class models significantly underperform the most advanced one-for-one counterparts. Moreover, the field has fragmented into specialized methods tailored to specific scenarios (multi-class, 3D, few-shot, etc.), creating deployment barriers and highlighting the need for a unified solution. In this paper, we present Dinomaly2, the first unified framework for full-spectrum image UAD, which bridges the performance gap in multi-class models while seamlessly extending across diverse data modalities and task settings. Guided by the "less is more" philosophy, we demonstrate that the orchestration of five simple element achieves superior performance in a standard reconstruction-based framework. This methodological minimalism enables natural extension across diverse tasks without modification, establishing that simplicity is the foundation of true universality. Extensive experiments on 12 UAD benchmarks demonstrate Dinomaly2's full-spectrum superiority across multiple modalities (2D, multi-view, RGB-3D, RGB-IR), task settings (single-class, multi-class, inference-unified multi-class, few-shot) and application domains (industrial, biological, outdoor). For example, our multi-class model achieves unprecedented 99.9% and 99.3% image-level (I-) AUROC on MVTec-AD and VisA respectively. For multi-view and multi-modal inspection, Dinomaly2 demonstrates state-of-the-art performance with minimum adaptations. Moreover, using only 8 normal examples per class, our method surpasses previous full-shot models, achieving 98.7% and 97.4% I-AUROC on MVTec-AD and VisA. The combination of minimalistic design, computational scalability, and universal applicability positions Dinomaly2 as a unified solution for the full spectrum of real-world anomaly detection applications.

[72] Towards 3D Objectness Learning in an Open World

Taichi Liu, Zhenyu Wang, Ruofeng Liu, Guang Wang, Desheng Zhang

🧩 TL;DR

本文提出OP3Det,一种无需文本提示的开放世界3D检测器,通过融合2D语义先验和3D几何先验实现通用3D物体发现,在开放世界3D检测中显著超越现有方法。


📘 Detailed Summary

Motivation: 当前3D物体检测和新型类别检测虽取得进展,但通用3D物体性学习研究不足。传统闭集3D检测器难以泛化到开放世界场景,而直接引入3D开放词汇模型又面临词汇扩展和语义重叠问题,需要解决开放世界3D物体性学习的研究空白。

Method: 提出OP3Det框架,利用2D基础模型的强泛化能力和零样本能力,结合2D语义先验和3D几何先验生成类别无关的物体提议。通过跨模态专家混合机制,动态路由单模态和多模态特征,从点云和RGB图像中学习通用3D物体性。

Result: 大量实验表明OP3Det表现卓越,在AR指标上显著超越现有开放世界3D检测器达16.0%,相比闭世界3D检测器提升13.5%,证明了其在开放世界3D物体发现中的有效性。

Conclusion: 该研究展示了融合2D基础模型与3D几何信息在开放世界3D检测中的潜力,为通用3D物体性学习提供了新范式,推动了3D场景理解向更通用的方向发展。


📄 Abstract

Recent advancements in 3D object detection and novel category detection have made significant progress, yet research on learning generalized 3D objectness remains insufficient. In this paper, we delve into learning open-world 3D objectness, which focuses on detecting all objects in a 3D scene, including novel objects unseen during training. Traditional closed-set 3D detectors struggle to generalize to open-world scenarios, while directly incorporating 3D open-vocabulary models for open-world ability struggles with vocabulary expansion and semantic overlap. To achieve generalized 3D object discovery, We propose OP3Det, a class-agnostic Open-World Prompt-free 3D Detector to detect any objects within 3D scenes without relying on hand-crafted text prompts. We introduce the strong generalization and zero-shot capabilities of 2D foundation models, utilizing both 2D semantic priors and 3D geometric priors for class-agnostic proposals to broaden 3D object discovery. Then, by integrating complementary information from point cloud and RGB image in the cross-modal mixture of experts, OP3Det dynamically routes uni-modal and multi-modal features to learn generalized 3D objectness. Extensive experiments demonstrate the extraordinary performance of OP3Det, which significantly surpasses existing open-world 3D detectors by up to 16.0% in AR and achieves a 13.5% improvement compared to closed-world 3D detectors.

[73] SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

Samir Khaki, Junxian Guo, Jiaming Tang, Shang Yang, Yukang Chen, Konstantinos N. Plataniotis, Yao Lu, Song Han, Zhijian Liu

🧩 TL;DR

SparseVILA提出了一种高效的视觉语言模型推理新范式,通过在预填充阶段剪枝冗余视觉令牌并在解码阶段检索查询相关令牌,实现了训练无关、架构无关的加速框架,在保持模型能力的同时显著提升推理速度。


📘 Detailed Summary

Motivation: 当前视觉语言模型的可扩展性受到视觉令牌数量快速增长的限制,这些视觉令牌主导了推理延迟,阻碍了模型在高分辨率图像理解、长视频分析和多轮对话等应用中的实际部署效率。

Method: SparseVILA采用解耦视觉稀疏性的设计,在预填充阶段进行查询无关的令牌剪枝,在解码阶段进行查询感知的令牌检索,基于AWQ优化的推理管道实现,保留大部分视觉缓存以确保多轮对话的保真度。

Result: 在长上下文视频任务中实现了4.0倍预填充加速、2.5倍解码加速和2.6倍端到端加速,同时在文档理解和推理任务上提升了准确性,证明了效率与性能的协同改进。

Conclusion: 通过解耦查询无关剪枝和查询感知检索,SparseVILA为高效多模态推理开辟了新方向,提供了无需训练、架构无关的加速框架,能够在保持大模型能力的同时显著提升推理效率。


📄 Abstract

Vision Language Models (VLMs) have rapidly advanced in integrating visual and textual reasoning, powering applications across high-resolution image understanding, long-video analysis, and multi-turn conversation. However, their scalability remains limited by the growing number of visual tokens that dominate inference latency. We present SparseVILA, a new paradigm for efficient VLM inference that decouples visual sparsity across the prefilling and decoding stages. SparseVILA distributes sparsity across stages by pruning redundant visual tokens during prefill and retrieving only query-relevant tokens during decoding. This decoupled design matches leading prefill pruning methods while preserving multi-turn fidelity by retaining most of the visual cache so that query-aware tokens can be retrieved at each conversation round. Built on an AWQ-optimized inference pipeline, SparseVILA achieves up to 4.0 times faster prefilling, 2.5 times faster decoding, and an overall 2.6 times end-to-end speedup on long-context video tasks -- while improving accuracy on document-understanding and reasoning tasks. By decoupling query-agnostic pruning and query-aware retrieval, SparseVILA establishes a new direction for efficient multimodal inference, offering a training-free, architecture-agnostic framework for accelerating large VLMs without sacrificing capability.

cs.CL [Back]

[74] Fusion-Augmented Large Language Models: Boosting Diagnostic Trustworthiness via Model Consensus

Md Kamrul Siam, Md Jobair Hossain Faruk, Jerry Q. Cheng, Huanying Gu

🧩 TL;DR

本研究提出了一种基于ChatGPT和Claude的多模型融合框架,通过输出相似度共识机制显著提升了胸部X光片诊断的准确性,在CheXpert数据集上实现了91.3%的最高准确率。


📘 Detailed Summary

Motivation: 当前AI辅助放射学诊断存在可靠性不足的问题,特别是在单一模型诊断准确性有限的情况下,需要探索多模型融合策略来提升诊断信任度和临床实用性。

Method: 研究采用多模态融合框架,结合ChatGPT和Claude两个先进大语言模型,使用基于95%输出相似度阈值的共识方法,并评估了纯图像提示与图像加合成临床笔记的多模态输入效果。

Result: 在234例纯图像测试中,ChatGPT和Claude分别达到62.8%和76.9%的准确率,共识方法提升至77.6%;在50例多模态测试中,准确率分别提升至84%和76%,共识准确率达到91.3%,融合策略始终优于单一模型。

Conclusion: 研究证明了多模态输入和输出级共识机制在提升AI辅助放射诊断可靠性方面的有效性,为减少诊断错误提供了计算开销最小的实用路径,具有重要的临床转化价值。


📄 Abstract

This study presents a novel multi-model fusion framework leveraging two state-of-the-art large language models (LLMs), ChatGPT and Claude, to enhance the reliability of chest X-ray interpretation on the CheXpert dataset. From the full CheXpert corpus of 224,316 chest radiographs, we randomly selected 234 radiologist-annotated studies to evaluate unimodal performance using image-only prompts. In this setting, ChatGPT and Claude achieved diagnostic accuracies of 62.8% and 76.9%, respectively. A similarity-based consensus approach, using a 95% output similarity threshold, improved accuracy to 77.6%. To assess the impact of multimodal inputs, we then generated synthetic clinical notes following the MIMIC-CXR template and evaluated a separate subset of 50 randomly selected cases paired with both images and synthetic text. On this multimodal cohort, performance improved to 84% for ChatGPT and 76% for Claude, while consensus accuracy reached 91.3%. Across both experimental conditions, agreement-based fusion consistently outperformed individual models. These findings highlight the utility of integrating complementary modalities and using output-level consensus to improve the trustworthiness and clinical utility of AI-assisted radiological diagnosis, offering a practical path to reduce diagnostic errors with minimal computational overhead.

[75] Evaluating Prompting Strategies and Large Language Models in Systematic Literature Review Screening: Relevance and Task-Stage Classification

Binglan Han, Anuradha Mathrani, Teo Susnjak

🧩 TL;DR

本研究系统量化了提示策略与大型语言模型在自动化系统文献综述筛选阶段的交互效应,发现CoT少样本提示提供最可靠的精确率-召回率平衡,并提出了基于成本效益的分阶段工作流程。


📘 Detailed Summary

Motivation: 本研究旨在解决系统文献综述自动化筛选阶段中,不同提示策略与大型语言模型之间的交互效应尚未被系统量化的问题,探索如何通过优化提示-模型组合来提高筛选效率和准确性。

Method: 研究评估了六种LLM在五种提示类型下的表现,包括零样本、少样本、思维链、思维链少样本和自反思提示,使用准确率、精确率、召回率和F1分数等指标,对相关性分类和六个二级任务进行全面分析。

Result: 结果显示显著的模型-提示交互效应:CoT少样本提示提供最可靠的精确率-召回率平衡;零样本提示在高敏感度筛选时最大化召回率;自反思提示因过度包容性和不稳定性表现不佳。GPT-4o和DeepSeek提供稳健的整体性能,而GPT-4o-mini在显著降低成本的同时保持竞争力。

Conclusion: 研究推荐采用分阶段工作流程:首先使用低成本模型配合结构化提示进行初筛,仅将边界案例升级到高容量模型处理。这些发现突显了LLM在自动化文献筛选方面不均衡但具有前景的潜力,为任务自适应LLM部署提供了比较基准和实践指导。


📄 Abstract

This study quantifies how prompting strategies interact with large language models (LLMs) to automate the screening stage of systematic literature reviews (SLRs). We evaluate six LLMs (GPT-4o, GPT-4o-mini, DeepSeek-Chat-V3, Gemini-2.5-Flash, Claude-3.5-Haiku, Llama-4-Maverick) under five prompt types (zero-shot, few-shot, chain-of-thought (CoT), CoT-few-shot, self-reflection) across relevance classification and six Level-2 tasks, using accuracy, precision, recall, and F1. Results show pronounced model-prompt interaction effects: CoT-few-shot yields the most reliable precision-recall balance; zero-shot maximizes recall for high-sensitivity passes; and self-reflection underperforms due to over-inclusivity and instability across models. GPT-4o and DeepSeek provide robust overall performance, while GPT-4o-mini performs competitively at a substantially lower dollar cost. A cost-performance analysis for relevance classification (per 1,000 abstracts) reveals large absolute differences among model-prompt pairings; GPT-4o-mini remains low-cost across prompts, and structured prompts (CoT/CoT-few-shot) on GPT-4o-mini offer attractive F1 at a small incremental cost. We recommend a staged workflow that (1) deploys low-cost models with structured prompts for first-pass screening and (2) escalates only borderline cases to higher-capacity models. These findings highlight LLMs' uneven but promising potential to automate literature screening. By systematically analyzing prompt-model interactions, we provide a comparative benchmark and practical guidance for task-adaptive LLM deployment.

[76] EgMM-Corpus: A Multimodal Vision-Language Dataset for Egyptian Culture

Mohamed Gamil, Abdelrahman Elsayed, Abdelrahman Lila, Ahmed Gad, Hesham Abdelgawad, Mohamed Aref, Ahmed Fares

🧩 TL;DR

本文介绍了EgMM-Corpus,一个专门针对埃及文化的多模态数据集,包含3,000多张图像覆盖313个文化概念,并通过评估CLIP模型在该数据集上的表现揭示了现有视觉语言模型中的文化偏见。


📘 Detailed Summary

Motivation: 当前AI领域缺乏针对中东和非洲地区的多模态文化多样性数据集,特别是埃及文化背景下的资源严重不足,这限制了视觉语言模型在跨文化场景中的评估和发展。

Method: 研究设计并运行了新的数据收集流程,收集了涵盖地标、食物和民间传说等313个概念的3,000多张图像,每个条目都经过人工验证以确保文化真实性和多模态一致性。

Result: 在EgMM-Corpus上评估CLIP模型的零样本性能,结果显示Top-1准确率为21.2%,Top-5准确率为36.4%,显著低于在主流数据集上的表现。

Conclusion: 研究结果证实了大规模视觉语言模型中存在的文化偏见,强调了EgMM-Corpus作为开发文化感知模型基准的重要性,为促进AI系统的文化多样性提供了关键资源。


📄 Abstract

Despite recent advances in AI, multimodal culturally diverse datasets are still limited, particularly for regions in the Middle East and Africa. In this paper, we introduce EgMM-Corpus, a multimodal dataset dedicated to Egyptian culture. By designing and running a new data collection pipeline, we collected over 3,000 images, covering 313 concepts across landmarks, food, and folklore. Each entry in the dataset is manually validated for cultural authenticity and multimodal coherence. EgMM-Corpus aims to provide a reliable resource for evaluating and training vision-language models in an Egyptian cultural context. We further evaluate the zero-shot performance of Contrastive Language-Image Pre-training CLIP on EgMM-Corpus, on which it achieves 21.2% Top-1 accuracy and 36.4% Top-5 accuracy in classification. These results underscore the existing cultural bias in large-scale vision-language models and demonstrate the importance of EgMM-Corpus as a benchmark for developing culturally aware models.

[77] Towards Low-Resource Alignment to Diverse Perspectives with Sparse Feedback

Chu Fei Luo, Samuel Dahan, Xiaodan Zhu

🧩 TL;DR

本研究提出了两种低资源设置下的多元化对齐方法——多元化解码和模型引导,通过仅50个标注样本即可在多个高风险任务中显著提升语言模型的多元化对齐能力,减少误判并改善人类价值观分布对齐。


📘 Detailed Summary

Motivation: 随着语言模型对社会影响日益增大,需要确保它们能够与多样化视角对齐并反映人类价值观的细微差别,然而当前主流训练范式假设每个查询存在唯一最优答案,导致生成响应过于泛化且对齐效果不佳。

Method: 提出了两种低资源设置下的多元化对齐方法:多元化解码通过引入多样性机制生成多个响应变体,模型引导则利用少量标注样本直接调整模型行为以更好地捕捉不同观点。

Result: 实验表明模型引导方法在仅使用50个标注样本的情况下,相比零样本和少样本基线取得一致改进,在仇恨言论检测和错误信息检测等高风险任务中显著降低误报率,并在GlobalOpinionQA基准上改善了与人类价值观的分布对齐。

Conclusion: 该研究强调了多元化对齐的重要性,证明了语言模型可以通过轻量级方法有效适应不同视角,为构建更具包容性和细致理解能力的AI系统提供了可行路径,推动了价值观敏感场景下语言模型的实际应用。


📄 Abstract

As language models have a greater impact on society, it is important to ensure they are aligned to a diverse range of perspectives and are able to reflect nuance in human values. However, the most popular training paradigms for modern language models often assume there is one optimal answer for every query, leading to generic responses and poor alignment. In this work, we aim to enhance pluralistic alignment of language models in a low-resource setting with two methods: pluralistic decoding and model steering. We empirically demonstrate that model steering offers consistent improvement over zero-shot and few-shot baselines with only 50 annotated samples. Our proposed methods decrease false positives in several high-stakes tasks such as hate speech detection and misinformation detection, and improves the distributional alignment to human values in GlobalOpinionQA. We hope our work highlights the importance of diversity and how language models can be adapted to consider nuanced perspectives.

[78] Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment

Fu-An Chao, Bi-Cheng Yan, Berlin Chen

🧩 TL;DR

本研究探索了Whisper语音识别基础模型在第二语言口语评估中的潜力,通过提取其隐藏表示中的声学和语言特征,仅需训练轻量级分类器即可在GEPT数据集上超越现有先进基线,并揭示了该模型内在编码了口语能力等级和语义信息。


📘 Detailed Summary

Motivation: 现有研究主要从外部分析Whisper产生的转录文本,未能充分挖掘其潜在能力,本研究旨在探索Whisper在第二语言口语评估中的隐藏表征能力,填补了直接利用基础模型内部表示进行口语能力评估的研究空白。

Method: 通过提取Whisper隐藏表示中的声学和语言特征,仅训练轻量级分类器于其中间和最终输出之上,并引入图像和文本提示信息作为辅助相关性线索来增强模型性能。

Result: 在GEPT图片描述数据集上取得了强劲性能,超越了包括多模态方法在内的现有先进基线,通过整合图像和文本提示信息进一步提升了性能表现,深入分析显示Whisper嵌入内在编码了口语能力等级模式和语音语义方面。

Conclusion: 即使没有任务特定的微调,Whisper模型本质上编码了口语能力的序数模式和语义信息,突显了其作为口语评估及其他口语理解任务的强大基础模型的潜力,为基于基础模型的口语能力评估提供了新范式。


📄 Abstract

In this paper, we explore the untapped potential of Whisper, a well-established automatic speech recognition (ASR) foundation model, in the context of L2 spoken language assessment (SLA). Unlike prior studies that extrinsically analyze transcriptions produced by Whisper, our approach goes a step further to probe its latent capabilities by extracting acoustic and linguistic features from hidden representations. With only a lightweight classifier being trained on top of Whisper's intermediate and final outputs, our method achieves strong performance on the GEPT picture-description dataset, outperforming existing cutting-edge baselines, including a multimodal approach. Furthermore, by incorporating image and text-prompt information as auxiliary relevance cues, we demonstrate additional performance gains. Finally, we conduct an in-depth analysis of Whisper's embeddings, which reveals that, even without task-specific fine-tuning, the model intrinsically encodes both ordinal proficiency patterns and semantic aspects of speech, highlighting its potential as a powerful foundation for SLA and other spoken language understanding tasks.

[79] RAVEN: Robust Advertisement Video Violation Temporal Grounding via Reinforcement Reasoning

Deyi Ji, Yuekui Yang, Haiyang Wu, Shaoping Ma, Tianrun Chen, Lanyun Zhu

🧩 TL;DR

本文提出RAVEN框架,通过结合课程强化学习与多模态大语言模型,解决了广告视频违规检测中的时序定位不精确、标注噪声和泛化能力有限等问题,在工业数据集和公开基准测试中均表现出优越性能。


📘 Detailed Summary

Motivation: 现有广告视频违规检测方法在精确时序定位、噪声标注处理和泛化能力方面存在显著不足,难以满足平台合规性要求的实际应用需求。

Method: RAVEN框架整合课程强化学习与多模态大语言模型,采用渐进式训练策略结合精确和粗略标注数据,利用组相对策略优化开发涌现推理能力,并通过多层复杂奖励机制确保精确时序定位和一致类别预测。

Result: 在工业数据集和公开基准测试中,RAVEN在违规类别准确率和时序区间定位方面均取得优越性能,在线A/B测试进一步验证了其实际适用性,在精确率和召回率上均有显著提升,同时展现出强大的泛化能力。

Conclusion: RAVEN框架通过强化学习与多模态大语言模型的协同设计,有效解决了广告违规检测的关键挑战,其在线部署验证了实际应用价值,并为类似时序多模态任务提供了新的技术路径。


📄 Abstract

Advertisement (Ad) video violation detection is critical for ensuring platform compliance, but existing methods struggle with precise temporal grounding, noisy annotations, and limited generalization. We propose RAVEN, a novel framework that integrates curriculum reinforcement learning with multimodal large language models (MLLMs) to enhance reasoning and cognitive capabilities for violation detection. RAVEN employs a progressive training strategy, combining precisely and coarsely annotated data, and leverages Group Relative Policy Optimization (GRPO) to develop emergent reasoning abilities without explicit reasoning annotations. Multiple hierarchical sophisticated reward mechanism ensures precise temporal grounding and consistent category prediction. Experiments on industrial datasets and public benchmarks show that RAVEN achieves superior performances in violation category accuracy and temporal interval localization. We also design a pipeline to deploy the RAVEN on the online Ad services, and online A/B testing further validates its practical applicability, with significant improvements in precision and recall. RAVEN also demonstrates strong generalization, mitigating the catastrophic forgetting issue associated with supervised fine-tuning.

[80] MOSAIC: Masked Objective with Selective Adaptation for In-domain Contrastive Learning

Vera Pavlova, Mohammed Makhlouf

🧩 TL;DR

本文提出了MOSAIC框架,这是一个用于句子嵌入模型领域适应的多阶段方法,通过联合优化掩码语言建模和对比学习目标,在保持原始模型鲁棒语义区分能力的同时有效学习领域相关表示。


📘 Detailed Summary

Motivation: 该研究旨在解决将大规模通用领域句子嵌入模型适应到专业领域时面临的挑战,特别是如何在保持模型原有语义区分能力的同时有效学习领域特定知识。

Method: MOSAIC框架采用多阶段训练策略,通过联合优化掩码语言建模和对比学习目标,结合选择性适应机制,在统一训练流程中实现领域相关表示学习。

Result: 在高低资源领域的实证验证中,该方法在NDCG@10指标上相比强通用领域基线提升了高达13.4%,消融研究进一步证明了各组件的重要性。

Conclusion: 研究表明平衡的联合监督和分阶段适应策略对于领域适应至关重要,为句子嵌入模型的领域专业化提供了有效解决方案,并强调了多目标联合优化的价值。


📄 Abstract

We introduce MOSAIC (Masked Objective with Selective Adaptation for In-domain Contrastive learning), a multi-stage framework for domain adaptation of sentence embedding models that incorporates joint domain-specific masked supervision. Our approach addresses the challenges of adapting large-scale general-domain sentence embedding models to specialized domains. By jointly optimizing masked language modeling (MLM) and contrastive objectives within a unified training pipeline, our method enables effective learning of domain-relevant representations while preserving the robust semantic discrimination properties of the original model. We empirically validate our approach on both high-resource and low-resource domains, achieving improvements up to 13.4% in NDCG@10 (Normalized Discounted Cumulative Gain) over strong general-domain baselines. Comprehensive ablation studies further demonstrate the effectiveness of each component, highlighting the importance of balanced joint supervision and staged adaptation.

[81] FinSight: Towards Real-World Financial Deep Research

Jiajie Jin, Yuyao Zhang, Yimeng Xu, Hongjin Qian, Yutao Zhu, Zhicheng Dou

🧩 TL;DR

本文提出了FinSight,一个用于生成高质量多模态财务报告的新型多智能体框架,通过可编程变量空间和迭代视觉增强机制,在事实准确性、分析深度和呈现质量方面显著优于现有基准系统。


📘 Detailed Summary

Motivation: 当前AI系统难以完全自动化生成专业财务报告,这一过程既耗时又需要专业知识,因此需要开发能够灵活收集数据、进行分析并生成高质量多模态报告的智能系统。

Method: FinSight采用带可变内存的代码智能体架构,将外部数据、设计工具和智能体统一到可编程变量空间中;提出迭代视觉增强机制逐步优化原始视觉输出为专业财务图表;并采用两阶段写作框架将简洁的分析链扩展为连贯、引用感知的多模态报告。

Result: 在多个公司和行业级任务上的实验表明,FinSight在事实准确性、分析深度和呈现质量方面显著优于所有基线系统,包括领先的深度研究系统,显示出接近人类专家水平报告生成的潜力。

Conclusion: 该研究展示了生成接近人类专家质量财务报告的可行路径,通过多智能体框架和迭代优化机制实现了专业级财务分析和可视化,为自动化金融分析系统的发展提供了重要参考。


📄 Abstract

Generating professional financial reports is a labor-intensive and intellectually demanding process that current AI systems struggle to fully automate. To address this challenge, we introduce FinSight (Financial InSight), a novel multi agent framework for producing high-quality, multimodal financial reports. The foundation of FinSight is the Code Agent with Variable Memory (CAVM) architecture, which unifies external data, designed tools, and agents into a programmable variable space, enabling flexible data collection, analysis and report generation through executable code. To ensure professional-grade visualization, we propose an Iterative Vision-Enhanced Mechanism that progressively refines raw visual outputs into polished financial charts. Furthermore, a two stage Writing Framework expands concise Chain-of-Analysis segments into coherent, citation-aware, and multimodal reports, ensuring both analytical depth and structural consistency. Experiments on various company and industry-level tasks demonstrate that FinSight significantly outperforms all baselines, including leading deep research systems in terms of factual accuracy, analytical depth, and presentation quality, demonstrating a clear path toward generating reports that approach human-expert quality.

[82] Does Visual Grounding Enhance the Understanding of Embodied Knowledge in Large Language Models?

Zhihui Yang, Yupei Wang, Kaijie Mo, Zhe Zhao, Renfen Hu

🧩 TL;DR

本研究提出了基于心理学感知理论的多模态知识理解基准,通过比较30种先进语言模型发现,视觉语言模型在具身知识理解方面并未超越纯文本模型,且模型在视觉维度的表现显著弱于其他感官维度。


📘 Detailed Summary

Motivation: 尽管多模态语言模型取得了显著进展,但尚不清楚视觉基础是否比纯文本模型更能增强其对具身知识的理解,因此需要系统评估模型在不同感官模态下的感知能力。

Method: 基于心理学感知理论构建了包含视觉、听觉、触觉、味觉、嗅觉外部感官及内感受的具身知识理解基准,通过向量比较和问答任务评估模型感知能力,涵盖超过1700个问题。

Result: 比较30种先进语言模型后发现,视觉语言模型在两项任务中均未超越纯文本模型,且模型在视觉维度的表现显著差于其他感官维度,向量表示易受词形和频率影响,模型在涉及空间感知和推理的问题上表现困难。

Conclusion: 研究结果表明当前语言模型在具身知识整合方面存在不足,需要更有效地将具身知识融入语言模型以增强其对物理世界的理解能力。


📄 Abstract

Despite significant progress in multimodal language models (LMs), it remains unclear whether visual grounding enhances their understanding of embodied knowledge compared to text-only models. To address this question, we propose a novel embodied knowledge understanding benchmark based on the perceptual theory from psychology, encompassing visual, auditory, tactile, gustatory, olfactory external senses, and interoception. The benchmark assesses the models' perceptual abilities across different sensory modalities through vector comparison and question-answering tasks with over 1,700 questions. By comparing 30 state-of-the-art LMs, we surprisingly find that vision-language models (VLMs) do not outperform text-only models in either task. Moreover, the models perform significantly worse in the visual dimension compared to other sensory dimensions. Further analysis reveals that the vector representations are easily influenced by word form and frequency, and the models struggle to answer questions involving spatial perception and reasoning. Our findings underscore the need for more effective integration of embodied knowledge in LMs to enhance their understanding of the physical world.

[83] How News Feels: Understanding Affective Bias in Multilingual Headlines for Human-Centered Media Design

Mohd Ruhul Ameen, Akif Islam, Abu Saleh Musa Miah, Ayesha Siddiqua, Jungpil Shin

🧩 TL;DR

本研究通过大规模情感分析揭示了孟加拉语新闻中负面情绪的主导地位,并基于此提出了可视化情感线索的新闻聚合器设计理念,帮助读者识别新闻报道中的情感框架偏见。


📘 Detailed Summary

Motivation: 新闻媒体通过情感框架影响公众情绪,负面或情绪化标题往往获得更多关注并传播更快,导致媒体倾向于使用引发强烈反应的报道方式,本研究旨在探索孟加拉语新闻中的这种情感偏见现象。

Method: 使用Gemma-3 4B模型进行零样本推理,分析了30万条孟加拉语新闻标题及其内容,识别每条新闻的主导情绪和整体情感基调。

Result: 研究发现负面情绪明显占主导地位,特别是愤怒、恐惧和失望情绪,同时不同媒体对相似新闻事件的情感呈现存在显著差异。

Conclusion: 基于分析结果提出了以人为本的新闻聚合器设计理念,通过可视化情感线索帮助读者识别日常新闻中隐藏的情感框架,增强对媒体报道偏见的认知能力。


📄 Abstract

News media often shape the public mood not only by what they report but by how they frame it. The same event can appear calm in one outlet and alarming in another, reflecting subtle emotional bias in reporting. Negative or emotionally charged headlines tend to attract more attention and spread faster, which in turn encourages outlets to frame stories in ways that provoke stronger reactions. This research explores that tendency through large-scale emotion analysis of Bengali news. Using zero-shot inference with Gemma-3 4B, we analyzed 300000 Bengali news headlines and their content to identify the dominant emotion and overall tone of each. The findings reveal a clear dominance of negative emotions, particularly anger, fear, and disappointment, and significant variation in how similar stories are emotionally portrayed across outlets. Based on these insights, we propose design ideas for a human-centered news aggregator that visualizes emotional cues and helps readers recognize hidden affective framing in daily news.

[84] Addressing Antisocial Behavior in Multi-Party Dialogs Through Multimodal Representation Learning

Hajar Bakarou, Mohamed Sinane El Messoussi, Anaïs Ollagnier

🧩 TL;DR

本研究针对社交媒体多轮对话中的反社会行为检测,提出了一种多模态融合方法,在法语数据集CyberAgressionAdo-Large上实现了最佳性能,特别擅长处理隐式攻击和角色转换等复杂现象。


📘 Detailed Summary

Motivation: 当前社交媒体反社会行为研究主要集中于X和Reddit等平台,而多轮对话场景由于数据稀缺尚未得到充分探索,本研究旨在填补这一研究空白。

Method: 研究评估了六种基于文本和八种基于图的表示学习方法,分析了词汇线索和交互动态,并探索了多模态融合策略,重点关注晚期融合模型mBERT + WD-SGCN。

Result: 多模态模型显著优于单模态基线,mBERT + WD-SGCN在滥用检测任务上达到0.718的最佳性能,在同伴群体识别和欺凌行为分析任务上分别获得0.286和0.606的竞争性分数。

Conclusion: 研究表明多模态融合能有效处理复杂反社会行为现象,包括隐式攻击、角色转换和上下文依赖的敌意行为,为社交媒体安全监控提供了重要技术路径。


📄 Abstract

Antisocial behavior (ASB) on social media -- including hate speech, harassment, and cyberbullying -- poses growing risks to platform safety and societal well-being. Prior research has focused largely on networks such as X and Reddit, while \textit{multi-party conversational settings} remain underexplored due to limited data. To address this gap, we use \textit{CyberAgressionAdo-Large}, a French open-access dataset simulating ASB in multi-party conversations, and evaluate three tasks: \textit{abuse detection}, \textit{bullying behavior analysis}, and \textit{bullying peer-group identification}. We benchmark six text-based and eight graph-based \textit{representation-learning methods}, analyzing lexical cues, interactional dynamics, and their multimodal fusion. Results show that multimodal models outperform unimodal baselines. The late fusion model \texttt{mBERT + WD-SGCN} achieves the best overall results, with top performance on abuse detection (0.718) and competitive scores on peer-group identification (0.286) and bullying analysis (0.606). Error analysis highlights its effectiveness in handling nuanced ASB phenomena such as implicit aggression, role transitions, and context-dependent hostility.

[85] Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation

Chenghao Zhang, Guanting Dong, Xinyu Yang, Zhicheng Dou

🧩 TL;DR

本文提出Nyx,一种针对通用检索增强生成(URAG)的统一混合模态检索器,通过构建NyxQA数据集和两阶段训练框架,有效解决了多模态检索和推理的挑战。


📘 Detailed Summary

Motivation: 现有RAG系统主要关注单模态文本检索,在现实场景中当查询和文档都包含混合模态(如文本和图像)时表现不足,需要解决通用检索增强生成(URAG)中混合模态信息的检索和推理问题。

Method: 提出Nyx统一混合模态检索器,构建四阶段自动流水线生成和过滤NyxQA数据集,采用两阶段训练框架:先在NyxQA和开源检索数据集上进行预训练,然后利用下游视觉语言模型的反馈进行监督微调以对齐检索输出与生成偏好。

Result: 实验结果表明Nyx不仅在标准文本RAG基准上表现竞争力,在更通用和现实的URAG设置中表现优异,显著提升了视觉语言任务中的生成质量。

Conclusion: 该研究证明了混合模态检索在增强视觉语言生成中的有效性,为处理现实世界多模态信息需求提供了可行解决方案,并为通用检索增强生成开辟了新方向。


📄 Abstract

Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus. However, existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images). In this paper, we address the challenge of Universal Retrieval-Augmented Generation (URAG), which involves retrieving and reasoning over mixed-modal information to improve vision-language generation. To this end, we propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate the scarcity of realistic mixed-modal data, we introduce a four-stage automated pipeline for generation and filtering, leveraging web documents to construct NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that better reflect real-world information needs. Building on this high-quality dataset, we adopt a two-stage training framework for Nyx: we first perform pre-training on NyxQA along with a variety of open-source retrieval datasets, followed by supervised fine-tuning using feedback from downstream vision-language models (VLMs) to align retrieval outputs with generative preferences. Experimental results demonstrate that Nyx not only performs competitively on standard text-only RAG benchmarks, but also excels in the more general and realistic URAG setting, significantly improving generation quality in vision-language tasks.

[86] The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives

Henry Lim, Kwan Hui Lim

🧩 TL;DR

本研究评估了20个指令调优大语言模型在原子指令遵循能力上的表现,发现模型对选项标签格式存在显著偏见且指令遵循能力不足,揭示了当前指令调优范式的局限性。


📘 Detailed Summary

Motivation: 尽管指令调优大语言模型展现出强大的零样本推理能力,但其执行简单、自包含指令的基本能力尚未得到充分探索,而这正是复杂指令遵循能力的基础。

Method: 研究在修改后的MMLU和MMLU-Pro基准上系统性地改变选项标签格式(字母、数字、罗马数字),在四种实验范式下评估20个IT-LLM:显式指令、无指令、移除选项内容和三样本示例。

Result: 标签格式变化导致显著性能波动(罗马vs数字下降30.45%),无指令时性能进一步下降10.84%,移除选项内容后除数字标签外均无法超越随机基线,三样本示例未能显著提升鲁棒性,大模型准确率更高但指令遵循一致性仍不足。

Conclusion: 研究结果暴露了当前指令调优范式的不足,强调需要开发专门针对原子指令遵循能力的评估方法和训练策略,以提升模型的基础指令理解与执行能力。


📄 Abstract

Instruction-tuned large language models (IT-LLMs) exhibit strong zero-shot reasoning, yet their ability to execute simple, self-contained instructions remains underexplored, despite this being foundational to complex instruction-following. We evaluate 20 IT-LLMs on modified MMLU and MMLU-Pro benchmarks, by systematically varying the format of option labels (alphabetic, numeric, Roman) while keeping their meaning identical under four paradigms, namely: (1) With explicit instructions, label changes cause large performance shifts (e.g., -30.45\% for Roman vs. numeric), revealing instruction-format bias. (2) Without instructions, performance drops further (up to -10.84\%) and label sensitivity intensifies, underscoring the role of explicit guidance. (3) When option contents are removed, models fail random-choice baselines except with numeric labels, suggesting weak adherence to atomic directives. (4) Three-shot exemplars yield no significant gains in robustness or fidelity, and generation analyses show persistent label errors, especially for non-numeric formats. Across model sizes, larger LLMs achieve higher accuracy but remain inconsistent in instruction adherence. These results expose the insufficiencies of current instruction-tuning paradigms and highlight the need for evaluation methods and training strategies that explicitly target atomic instruction-following.

[87] AFRICAPTION: Establishing a New Paradigm for Image Captioning in African Languages

Mardiyyah Oduwole, Prince Mireku, Fatimo Adebanjo, Oluwatosin Olajide, Mahi Aminu Aliyu, Jekaterina Novikova

🧩 TL;DR

本文提出了AfriCaption框架,这是首个针对20种非洲语言的多语言图像描述系统,通过精心策划的数据集、动态质量保证流程和0.5B参数模型,为资源匮乏语言建立了可扩展的多模态AI基础设施。


📘 Detailed Summary

Motivation: 当前多模态AI研究过度集中于高资源语言,阻碍了该领域进步的民主化,特别是非洲语言在图像描述任务中严重缺乏代表性资源和技术支持。

Method: 构建了基于Flickr8k的语义对齐多语言数据集,采用上下文感知选择和翻译流程;开发了动态上下文保持管道,通过模型集成和自适应替换确保持续数据质量;设计了0.5B参数的AfriCaption模型,整合SigLIP和NLLB200实现跨语言图像描述生成。

Result: 该框架建立了首个针对非洲低资源语言的可扩展图像描述资源,通过统一架构确保了持续的数据质量,为20种代表性不足的非洲语言提供了多模态AI支持。

Conclusion: AfriCaption为真正包容性多模态AI奠定了基础,通过系统化方法解决了语言多样性鸿沟,展示了在资源匮乏语言环境中构建高质量多模态系统的可行性。


📄 Abstract

Multimodal AI research has overwhelmingly focused on high-resource languages, hindering the democratization of advancements in the field. To address this, we present AfriCaption, a comprehensive framework for multilingual image captioning in 20 African languages and our contributions are threefold: (i) a curated dataset built on Flickr8k, featuring semantically aligned captions generated via a context-aware selection and translation process; (ii) a dynamic, context-preserving pipeline that ensures ongoing quality through model ensembling and adaptive substitution; and (iii) the AfriCaption model, a 0.5B parameter vision-to-text architecture that integrates SigLIP and NLLB200 for caption generation across under-represented languages. This unified framework ensures ongoing data quality and establishes the first scalable image-captioning resource for under-represented African languages, laying the groundwork for truly inclusive multimodal AI.

[88] BenCao: An Instruction-Tuned Large Language Model for Traditional Chinese Medicine

Jiacheng Xie, Yang Yu, Yibo Chen, Hanyao Zhang, Lening Zhao, Jiaxuan He, Lei Jiang, Xiaoting Tang, Guanghui An, Dong Xu

🧩 TL;DR

本研究开发了BenCao,一个基于ChatGPT的中医多模态助手,通过自然语言指令调优而非参数重训练的方式,整合了结构化知识库、诊断数据和专家反馈,在中医问答和诊断任务中超越了通用领域和中医领域模型。


📘 Detailed Summary

Motivation: 传统中医依赖整体推理、隐含逻辑和多模态诊断线索,现有中医领域大语言模型在文本理解方面取得进展,但缺乏多模态整合、可解释性和临床适用性,需要开发能够整合多模态信息并符合中医专业推理和伦理规范的系统。

Method: 系统整合了超过1000部经典和现代文献的全面知识库,采用基于场景的指令框架支持多样化交互,包含用于可解释推理的思维链模拟机制,以及由执业中医师参与的反馈精炼过程,并连接外部API实现舌象分类和多模态数据库检索。

Result: 在单项选择问答基准和多模态分类任务评估中,BenCao在诊断、草药识别和体质分类方面实现了优于通用领域和中医领域模型的准确率,该系统已在OpenAI GPTs商店部署为交互式应用,截至2025年10月已有近1000名全球用户访问。

Conclusion: 本研究证明了通过自然语言指令调优和多模态集成开发中医领域大语言模型的可行性,为生成式AI与传统医学推理的对齐提供了实用框架,并为现实世界部署提供了可扩展路径,展示了将AI技术与传统医学实践相结合的实际价值。


📄 Abstract

Traditional Chinese Medicine (TCM), with a history spanning over two millennia, plays a role in global healthcare. However, applying large language models (LLMs) to TCM remains challenging due to its reliance on holistic reasoning, implicit logic, and multimodal diagnostic cues. Existing TCM-domain LLMs have made progress in text-based understanding but lack multimodal integration, interpretability, and clinical applicability. To address these limitations, we developed BenCao, a ChatGPT-based multimodal assistant for TCM, integrating structured knowledge bases, diagnostic data, and expert feedback refinement. BenCao was trained through natural language instruction tuning rather than parameter retraining, aligning with expert-level reasoning and ethical norms specific to TCM. The system incorporates a comprehensive knowledge base of over 1,000 classical and modern texts, a scenario-based instruction framework for diverse interactions, a chain-of-thought simulation mechanism for interpretable reasoning, and a feedback refinement process involving licensed TCM practitioners. BenCao connects to external APIs for tongue-image classification and multimodal database retrieval, enabling dynamic access to diagnostic resources. In evaluations across single-choice question benchmarks and multimodal classification tasks, BenCao achieved superior accuracy to general-domain and TCM-domain models, particularly in diagnostics, herb recognition, and constitution classification. The model was deployed as an interactive application on the OpenAI GPTs Store, accessed by nearly 1,000 users globally as of October 2025. This study demonstrates the feasibility of developing a TCM-domain LLM through natural language-based instruction tuning and multimodal integration, offering a practical framework for aligning generative AI with traditional medical reasoning and a scalable pathway for real-world deployment.

[89] PANER: A Paraphrase-Augmented Framework for Low-Resource Named Entity Recognition

Nanda Kumar Rengarajan, Jun Yan, Chun Wang

🧩 TL;DR

本文提出了一种轻量级少样本命名实体识别框架,通过创新的指令调优模板和保留实体信息的策略性数据增强技术,在低资源场景下实现了与最先进模型相媲美的性能。


📘 Detailed Summary

Motivation: 命名实体识别任务需要大量标注数据,但在低资源场景中标签获取成本高昂。现有的零样本和指令调优方法难以泛化到领域特定实体,且无法有效利用有限的可用数据,这成为当前研究的主要挑战。

Method: 该框架包含两个关键创新:一是设计新的指令调优模板,采用简化的输出格式结合先前IT方法的原理,充分利用最新大型语言模型的大上下文窗口;二是引入策略性数据增强技术,在保持实体信息的同时对周围上下文进行改写,从而扩展训练数据而不损害语义关系。

Result: 在基准数据集上的实验表明,该方法在少样本和零样本任务上达到了与最先进模型相当的性能,其中少样本方法在CrossNER数据集上平均F1得分为80.1。使用改写方法训练的模型相比基线版本在F1分数上实现了高达17个点的持续改进。

Conclusion: 该研究为拥有有限NER训练数据和计算资源的群体提供了有前景的解决方案,证明了轻量级框架在低资源场景下的有效性,并为少样本NER任务的发展指明了新的方向。


📄 Abstract

Named Entity Recognition (NER) is a critical task that requires substantial annotated data, making it challenging in low-resource scenarios where label acquisition is expensive. While zero-shot and instruction-tuned approaches have made progress, they often fail to generalize to domain-specific entities and do not effectively utilize limited available data. We present a lightweight few-shot NER framework that addresses these challenges through two key innovations: (1) a new instruction tuning template with a simplified output format that combines principles from prior IT approaches to leverage the large context window of recent state-of-the-art LLMs; (2) introducing a strategic data augmentation technique that preserves entity information while paraphrasing the surrounding context, thereby expanding our training data without compromising semantic relationships. Experiments on benchmark datasets show that our method achieves performance comparable to state-of-the-art models on few-shot and zero-shot tasks, with our few-shot approach attaining an average F1 score of 80.1 on the CrossNER datasets. Models trained with our paraphrasing approach show consistent improvements in F1 scores of up to 17 points over baseline versions, offering a promising solution for groups with limited NER training data and compute power.

cs.AI [Back]

MingSheng Li, Guangze Zhao, Sichen Liu

🧩 TL;DR

VisuoAlign是一个通过提示引导树搜索实现多模态安全对齐的框架,它通过将安全约束嵌入推理过程、使用蒙特卡洛树搜索构建安全关键提示轨迹,以及引入基于提示的缩放来显著提升大型视觉语言模型对复杂跨模态威胁的鲁棒性。


📘 Detailed Summary

Motivation: 大型视觉语言模型在多模态感知和生成方面取得了显著进展,但其安全对齐仍面临关键挑战,现有防御方法对多模态越狱攻击存在脆弱性,视觉输入引入了新的攻击面,推理链缺乏安全监督,且模态融合常常导致对齐性能下降。

Method: VisuoAlign通过视觉-文本交互提示将安全约束嵌入推理过程,采用蒙特卡洛树搜索系统性地构建多样化的安全关键提示轨迹,并引入基于提示的缩放机制以确保实时风险检测和合规响应。

Result: 广泛实验表明,VisuoAlign能够主动暴露风险,实现全面的数据集生成,并显著提升大型视觉语言模型对复杂跨模态威胁的鲁棒性。

Conclusion: 该研究为多模态安全对齐提供了有效解决方案,通过系统化的提示引导树搜索框架增强了模型的安全推理能力,为未来多模态系统的安全部署奠定了重要基础。


📄 Abstract

Large Vision-Language Models (LVLMs) have achieved remarkable progress in multimodal perception and generation, yet their safety alignment remains a critical challenge.Existing defenses and vulnerable to multimodal jailbreaks, as visual inputs introduce new attack surfaces, reasoning chains lack safety supervision, and alignment often degrades under modality fusion.To overcome these limitation, we propose VisuoAlign, a framework for multi-modal safety alignment via prompt-guided tree search.VisuoAlign embeds safety constrains into the reasoning process through visual-textual interactive prompts, employs Monte Carlo Tree Search(MCTS) to systematically construct diverse safety-critical prompt trajectories, and introduces prompt-based scaling to ensure real-time risk detection and compliant responses.Extensive experiments demonstrate that VisuoAlign proactively exposes risks, enables comprehensive dataset generation, and significantly improves the robustness of LVLMs against complex cross-modal threats.

[91] Reliability of Large Language Model Generated Clinical Reasoning in Assisted Reproductive Technology: Blinded Comparative Evaluation Study

Dou Liu, Ying Long, Sophia Zuoqiu, Di Liu, Kang Li, Yiting Lin, Hanyi Liu, Rong Yin, Tian Tang

🧩 TL;DR

本研究验证了通过精心设计的提示策略(选择性少样本学习)可以生成高质量的临床思维链,显著优于零样本和随机少样本方法,并提出了基于"黄金标准深度"和"代表性多样性"的双原则框架来解决医疗AI中的数据稀缺问题。


📘 Detailed Summary

Motivation: 医疗AI中高质量临床思维链的创建面临数据稀缺的挑战,虽然大型语言模型能够合成医疗数据,但其临床可靠性尚未得到验证,因此需要评估LLM生成的思维链可靠性并探索提升其质量的提示策略。

Method: 研究采用盲法比较研究设计,由辅助生殖技术领域的高级临床医生评估三种不同提示策略生成的思维链:零样本学习、随机少样本学习(使用浅层示例)和选择性少样本学习(使用多样化高质量示例),并将专家评分与最先进的AI模型(GPT-4o)评估结果进行比较。

Result: 选择性少样本策略在所有人类评估指标上均显著优于其他策略(p < .001),而随机少样本策略相比零样本基线没有显著改进,表明低质量示例与无示例同样无效;AI评估器未能识别这些关键性能差异,选择性策略的成功归因于"黄金标准深度"和"代表性多样性"两个原则。

Conclusion: 合成思维链的临床可靠性取决于策略性提示设计而非单纯示例存在,提出的"双原则"框架为大规模生成可信数据提供了基础方法学,这项工作为解决数据瓶颈提供了验证方案,并确认了人类专业知识在评估高风险临床AI中不可或缺的作用。


📄 Abstract

Creating high-quality clinical Chains-of-Thought (CoTs) is crucial for explainable medical Artificial Intelligence (AI) while constrained by data scarcity. Although Large Language Models (LLMs) can synthesize medical data, their clinical reliability remains unverified. This study evaluates the reliability of LLM-generated CoTs and investigates prompting strategies to enhance their quality. In a blinded comparative study, senior clinicians in Assisted Reproductive Technology (ART) evaluated CoTs generated via three distinct strategies: Zero-shot, Random Few-shot (using shallow examples), and Selective Few-shot (using diverse, high-quality examples). These expert ratings were compared against evaluations from a state-of-the-art AI model (GPT-4o). The Selective Few-shot strategy significantly outperformed other strategies across all human evaluation metrics (p < .001). Critically, the Random Few-shot strategy offered no significant improvement over the Zero-shot baseline, demonstrating that low-quality examples are as ineffective as no examples. The success of the Selective strategy is attributed to two principles: "Gold-Standard Depth" (reasoning quality) and "Representative Diversity" (generalization). Notably, the AI evaluator failed to discern these critical performance differences. The clinical reliability of synthetic CoTs is dictated by strategic prompt curation, not the mere presence of examples. We propose a "Dual Principles" framework as a foundational methodology to generate trustworthy data at scale. This work offers a validated solution to the data bottleneck and confirms the indispensable role of human expertise in evaluating high-stakes clinical AI.

[92] Beyond Fixed Anchors: Precisely Erasing Concepts with Sibling Exclusive Counterparts

Tong Zhang, Ru Zhang, Jianyi Liu, Zhen Yang, Gongshen Liu

🧩 TL;DR

本文提出了SELECT框架,一种动态锚点选择方法,用于解决文本到图像扩散模型中概念擦除的锚点敏感性问题。该方法通过因果追踪分析揭示了固定锚点策略的局限性,并引入兄弟排他性概念作为更优的锚点类别。


📘 Detailed Summary

Motivation: 现有文本到图像扩散模型的概念擦除方法通常依赖固定锚点策略,这会导致概念重新出现和侵蚀等关键问题。通过因果追踪分析,研究发现擦除效果对锚点选择具有内在敏感性,需要开发更智能的锚点选择机制来克服这些限制。

Method: 提出了SELECT框架,这是一种动态锚点选择方法,采用新颖的两阶段评估机制。该框架自动发现用于精确擦除的最优锚点,同时识别关键边界锚点以保留相关概念,将兄弟排他性概念定义为一类更优的锚点类别。

Result: 广泛评估表明,SELECT作为一种通用锚点解决方案,不仅能够高效适配多种擦除框架,而且在关键性能指标上持续优于现有基线方法。对于单个概念的锚点挖掘,平均仅需4秒时间。

Conclusion: 该研究揭示了概念擦除中锚点选择的关键作用,证明了动态锚点策略相对于固定方法的优越性。SELECT框架为解决概念擦除中的敏感性问题提供了有效解决方案,并为未来相关研究提供了重要启示。


📄 Abstract

Existing concept erasure methods for text-to-image diffusion models commonly rely on fixed anchor strategies, which often lead to critical issues such as concept re-emergence and erosion. To address this, we conduct causal tracing to reveal the inherent sensitivity of erasure to anchor selection and define Sibling Exclusive Concepts as a superior class of anchors. Based on this insight, we propose \textbf{SELECT} (Sibling-Exclusive Evaluation for Contextual Targeting), a dynamic anchor selection framework designed to overcome the limitations of fixed anchors. Our framework introduces a novel two-stage evaluation mechanism that automatically discovers optimal anchors for precise erasure while identifying critical boundary anchors to preserve related concepts. Extensive evaluations demonstrate that SELECT, as a universal anchor solution, not only efficiently adapts to multiple erasure frameworks but also consistently outperforms existing baselines across key performance metrics, averaging only 4 seconds for anchor mining of a single concept.

[93] Urban-R1: Reinforced MLLMs Mitigate Geospatial Biases for Urban General Intelligence

Qiongyan Wang, Xingchen Zou, Yutian Jiang, Haomin Wen, Jiaheng Wei, Qingsong Wen, Yuxuan Liang

🧩 TL;DR

本文提出了Urban-R1,一种基于强化学习的后训练框架,通过地理分组相对策略优化和城市区域画像任务来缓解多模态大语言模型在城市通用智能中的地理偏见问题,显著提升了跨区域泛化能力。


📘 Detailed Summary

Motivation: 快速城市化加剧了对城市通用智能的需求,但现有基于监督微调的城市基础模型存在持续的地理偏见问题,导致区域预测偏差和有限的泛化能力,需要开发能够理解复杂城市环境的公平AI系统。

Method: Urban-R1采用基于强化学习的后训练框架,使用分组相对策略优化来优化跨地理群体的推理能力,并以城市区域画像作为代理任务,从多模态城市数据中提供可测量的奖励信号。

Result: 跨多个区域和任务的广泛实验表明,Urban-R1有效缓解了地理偏见并显著改善了跨区域泛化性能,在各项指标上均优于监督微调训练模型和闭源模型。

Conclusion: 研究结果表明强化学习对齐是实现公平可信城市智能的有前景路径,为构建无地理偏见的城市AI系统提供了重要技术框架和验证。


📄 Abstract

Rapid urbanization intensifies the demand for Urban General Intelligence (UGI), referring to AI systems that can understand and reason about complex urban environments. Recent studies have built urban foundation models using supervised fine-tuning (SFT) of LLMs and MLLMs, yet these models exhibit persistent geospatial bias, producing regionally skewed predictions and limited generalization. To this end, we propose Urban-R1, a reinforcement learning-based post-training framework that aligns MLLMs with the objectives of UGI. Urban-R1 adopts Group Relative Policy Optimization (GRPO) to optimize reasoning across geographic groups and employs urban region profiling as a proxy task to provide measurable rewards from multimodal urban data. Extensive experiments across diverse regions and tasks show that Urban-R1 effectively mitigates geo-bias and improves cross-region generalization, outperforming both SFT-trained and closed-source models. Our results highlight reinforcement learning alignment as a promising pathway toward equitable and trustworthy urban intelligence.

[94] Ripple Effect Protocol: Coordinating Agent Populations

Ayush Chopra, Aman Sharma, Feroz Ahmad, Luca Muscariello, Vijoy Pandey, Ramesh Raskar

🧩 TL;DR

本文提出了Ripple Effect Protocol (REP),一种协调协议,通过让智能体共享决策和轻量级敏感性信号来改善群体协调效率,相比传统A2A协议在多个领域实现了41%至100%的协调准确性和效率提升。


📘 Detailed Summary

Motivation: 现代AI智能体使用A2A和ACP等协议进行通信,但这些机制强调通信而非协调,随着智能体群体规模扩大,这种局限性导致脆弱的集体行为,即使个体智能体表现优秀也会产生不良群体结果。

Method: REP协议让智能体不仅共享决策,还共享轻量级敏感性信号——表达关键环境变量变化时其选择将如何改变的信号,这些敏感性在局部网络中传播,使群体能够比单独使用智能体中心通信更快更稳定地对齐。

Result: 在三个领域的基准测试中:供应链级联(啤酒游戏)、稀疏网络中的偏好聚合(电影调度)和可持续资源分配(渔业银行),REP相比A2A协议将协调准确性和效率提高了41%至100%,并能灵活处理来自LLM的多模态敏感性信号。

Conclusion: 通过将协调作为协议级能力,REP为新兴的智能体互联网提供了可扩展的基础设施,使智能体群体能够实现更高效稳定的集体决策和资源分配。


📄 Abstract

Modern AI agents can exchange messages using protocols such as A2A and ACP, yet these mechanisms emphasize communication over coordination. As agent populations grow, this limitation produces brittle collective behavior, where individually smart agents converge on poor group outcomes. We introduce the Ripple Effect Protocol (REP), a coordination protocol in which agents share not only their decisions but also lightweight sensitivities - signals expressing how their choices would change if key environmental variables shifted. These sensitivities ripple through local networks, enabling groups to align faster and more stably than with agent-centric communication alone. We formalize REP's protocol specification, separating required message schemas from optional aggregation rules, and evaluate it across scenarios with varying incentives and network topologies. Benchmarks across three domains: (i) supply chain cascades (Beer Game), (ii) preference aggregation in sparse networks (Movie Scheduling), and (iii) sustainable resource allocation (Fishbanks) show that REP improves coordination accuracy and efficiency over A2A by 41 to 100%, while flexibly handling multimodal sensitivity signals from LLMs. By making coordination a protocol-level capability, REP provides scalable infrastructure for the emerging Internet of Agents

[95] Foundation and Large-Scale AI Models in Neuroscience: A Comprehensive Review

Shihao Yang, Xiying Huang, Danilo Bernardo, Jun-En Ding, Andrew Michael, Jingmei Yang, Patrick Kwan, Ashish Raj, Feng Liu

🧩 TL;DR

本文综述了大规模AI模型在神经科学领域的变革性影响,展示了这些模型如何通过端到端学习从原始脑信号中解决多模态神经数据整合、时空模式解释等关键计算神经科学挑战,并探讨了神经科学与AI之间的双向互动关系。


📘 Detailed Summary

Motivation: 传统计算方法在处理神经科学数据时存在局限性,需要解决多模态神经数据整合、时空模式解释以及临床部署的转化框架等主要计算神经科学挑战,而大规模AI模型的出现为这些挑战提供了新的解决途径。

Method: 研究采用大规模AI模型进行端到端学习,直接从原始脑信号和神经数据中提取特征,并整合生物学启发的架构约束来开发更可解释和计算效率更高的模型,覆盖神经影像数据处理、脑机接口、分子神经科学等多个领域。

Result: 大规模AI模型在五个主要神经科学领域展现出显著效果:神经影像与数据处理、脑机接口与神经解码、分子神经科学与基因组建模、临床辅助与转化框架,以及跨神经和精神疾病的疾病特异性应用,成功解决了多模态数据整合和时空模式解释等关键问题。

Conclusion: 大规模AI模型为神经科学研究带来了范式转变,但成功实施需要强调严格的评估框架、有效的领域知识整合以及全面的临床使用伦理指南,同时神经科学与AI的互动日益双向化,生物学约束的整合促进了更可解释模型的发展。


📄 Abstract

The advent of large-scale artificial intelligence (AI) models has a transformative effect on neuroscience research, which represents a paradigm shift from the traditional computational methods through the facilitation of end-to-end learning from raw brain signals and neural data. In this paper, we explore the transformative effects of large-scale AI models on five major neuroscience domains: neuroimaging and data processing, brain-computer interfaces and neural decoding, molecular neuroscience and genomic modeling, clinical assistance and translational frameworks, and disease-specific applications across neurological and psychiatric disorders. These models are demonstrated to address major computational neuroscience challenges, including multimodal neural data integration, spatiotemporal pattern interpretation, and the derivation of translational frameworks for clinical deployment. Moreover, the interaction between neuroscience and AI has become increasingly reciprocal, as biologically informed architectural constraints are now incorporated to develop more interpretable and computationally efficient models. This review highlights both the notable promise of such technologies and key implementation considerations, with particular emphasis on rigorous evaluation frameworks, effective domain knowledge integration, and comprehensive ethical guidelines for clinical use. Finally, a systematic listing of critical neuroscience datasets used to derive and validate large-scale AI models across diverse research applications is provided.

[96] ELMM: Efficient Lightweight Multimodal Large Language Models for Multimodal Knowledge Graph Completion

Wei Huang, Peining Li, Meiyu Liang, Xu Hou, Junping Du, Yingxia Shao, Guanhua Ye, Wu Liu, Kangkang Lu, Yang Yu

🧩 TL;DR

本文提出ELMM方法,通过多视角视觉令牌压缩器和注意力剪枝策略,解决多模态知识图谱补全中图像令牌冗余和计算成本高的问题,在保持性能的同时显著提升计算效率。


📘 Detailed Summary

Motivation: 多模态知识图谱存在不完整性问题,而现有方法在应用多模态大语言模型进行补全时面临两大挑战:图像令牌数量过多导致语义噪声和模态冲突,以及处理大量令牌输入带来的高计算成本。

Method: 提出ELMM框架,包含基于多头注意力的多视角视觉令牌压缩器,从文本和视觉视角自适应压缩图像令牌以减少冗余并避免模态冲突;同时设计注意力剪枝策略移除冗余注意力层,并通过线性投影补偿剪枝带来的性能损失。

Result: 在FB15k-237-IMG和WN18-IMG基准测试上的广泛实验表明,ELMM在实现最先进性能的同时显著提高了计算效率,建立了多模态知识图谱补全的新范式。

Conclusion: 该研究证明了通过有效的令牌压缩和模型剪枝策略,可以在保持多模态知识图谱补全性能的同时大幅降低计算开销,为实际应用中的效率优化提供了可行方案。


📄 Abstract

Multimodal Knowledge Graphs (MKGs) extend traditional knowledge graphs by incorporating visual and textual modalities, enabling richer and more expressive entity representations. However, existing MKGs often suffer from incompleteness, which hinder their effectiveness in downstream tasks. Therefore, multimodal knowledge graph completion (MKGC) task is receiving increasing attention. While large language models (LLMs) have shown promise for knowledge graph completion (KGC), their application to the multimodal setting remains underexplored. Moreover, applying Multimodal Large Language Models (MLLMs) to the task of MKGC introduces significant challenges: (1) the large number of image tokens per entity leads to semantic noise and modality conflicts, and (2) the high computational cost of processing large token inputs. To address these issues, we propose Efficient Lightweight Multimodal Large Language Models (ELMM) for MKGC. ELMM proposes a Multi-view Visual Token Compressor (MVTC) based on multi-head attention mechanism, which adaptively compresses image tokens from both textual and visual views, thereby effectively reducing redundancy while retaining necessary information and avoiding modality conflicts. Additionally, we design an attention pruning strategy to remove redundant attention layers from MLLMs, thereby significantly reducing the inference cost. We further introduce a linear projection to compensate for the performance degradation caused by pruning. Extensive experiments on benchmark FB15k-237-IMG and WN18-IMG demonstrate that ELMM achieves state-of-the-art performance while substantially improving computational efficiency, establishing a new paradigm for multimodal knowledge graph completion.

[97] End-to-end Listen, Look, Speak and Act

Siyin Wang, Wenyi Yu, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Lu Lu, Chao Zhang

🧩 TL;DR

本文提出了ELLSA,这是首个全双工端到端模型,能够在单一架构中同时感知和生成视觉、文本、语音和动作模态,实现了更自然的人机交互行为。


📘 Detailed Summary

Motivation: 人类交互本质上是多模态和全双工的,现有模型难以同时处理感知和生成任务,无法实现流畅的对话轮换和中断等自然交互模式。

Method: 提出SA-MoE架构,通过自注意力混合专家模型将各模态路由到专用专家,并通过统一注意力骨干网络进行融合,实现联合多模态感知和并发生成。

Result: 在语音交互和机器人操作基准测试中,ELLSA达到模态专用基线的性能,同时支持高级多模态和全双工行为,包括对话轮换、缺陷指令拒绝、边说边做等能力。

Conclusion: ELLSA代表了向更自然和通用交互智能迈出的一步,为人工通用智能的追求做出贡献,其统一架构为多模态系统提供了可扩展的解决方案。


📄 Abstract

Human interaction is inherently multimodal and full-duplex: we listen while watching, speak while acting, and fluidly adapt to turn-taking and interruptions. Realizing these capabilities is essential for building models simulating humans. We present ELLSA (End-to-end Listen, Look, Speak and Act), which, to our knowledge, is the first full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture, enabling interaction patterns previously out of reach, yielding more natural, human-like behaviors. At its core is a novel SA-MoE architecture (Self-Attention Mixture-of-Experts) that routes each modality to specialized experts and fuses them through a unified attention backbone. This provides a generalizable solution for joint multimodal perception and concurrent generation, leveraging strong pre-trained components while enabling efficient modality integration and mitigating modality interference. On speech-interaction and robot-manipulation benchmarks, ELLSA matches modality-specific baselines, while uniquely supporting advanced multimodal and full-duplex behaviors such as dialogue and action turn-taking, defective instruction rejection, speaking-while-acting, context-grounded visual question answering, and action barge-ins. We contend that ELLSA represents a step toward more natural and general interactive intelligence, contributing to the broader pursuit of artificial general intelligence. All data, code and model checkpoints will be released upon acceptance.

[98] See or Say Graphs: Agent-Driven Scalable Graph Understanding with Vision-Language Models

Shuo Han, Yukun Cao, Zezhong Ding, Zengyi Gao, S Kevin Zhou, Xike Xie

🧩 TL;DR

GraphVista是一个统一的视觉语言模型框架,通过分层图信息组织和模态协调规划代理,解决了图理解中的可扩展性和多模态协调问题,能够处理比现有基准大200倍的图并实现4.4倍的性能提升。


📘 Detailed Summary

Motivation: 当前视觉语言模型在图理解任务中存在输入令牌限制导致的可扩展性瓶颈,以及缺乏有效的文本和视觉模态协调机制,限制了其在大规模图数据上的应用效果。

Method: GraphVista采用分层图信息组织方法构建轻量级GraphRAG基础,仅检索任务相关的文本描述和高分辨率视觉子图,同时引入规划代理根据任务复杂度将推理路由到最适合的模态:简单属性推理使用文本模态,局部和结构复杂推理使用基于显式拓扑的视觉模态。

Result: 实验表明GraphVista能够扩展到比现有基准大200倍的大型图,在多个基准测试中持续优于现有的文本、视觉和融合方法,相比最先进基线实现了高达4.4倍的质量提升。

Conclusion: 该研究证明了通过分层信息压缩和智能模态路由可以有效解决图理解中的可扩展性和模态协调挑战,充分利用文本和视觉模态的互补优势为大规模图分析提供了新的解决方案。


📄 Abstract

Vision-language models (VLMs) have shown promise in graph understanding, but remain limited by input-token constraints, facing scalability bottlenecks and lacking effective mechanisms to coordinate textual and visual modalities. To address these challenges, we propose GraphVista, a unified framework that enhances both scalability and modality coordination in graph understanding. For scalability, GraphVista organizes graph information hierarchically into a lightweight GraphRAG base, which retrieves only task-relevant textual descriptions and high-resolution visual subgraphs, compressing redundant context while preserving key reasoning elements. For modality coordination, GraphVista introduces a planning agent that routes tasks to the most suitable modality-using the text modality for simple property reasoning and the visual modality for local and structurally complex reasoning grounded in explicit topology. Extensive experiments demonstrate that GraphVista scales to large graphs, up to $200\times$ larger than those used in existing benchmarks, and consistently outperforms existing textual, visual, and fusion-based methods, achieving up to $4.4\times$ quality improvement over the state-of-the-art baselines by fully exploiting the complementary strengths of both modalities.

[99] VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents

Kangrui Wang, Pingyue Zhang, Zihan Wang, Yaning Gao, Linjie Li, Qineng Wang, Hanyang Chen, Chi Wan, Yiping Lu, Zhengyuan Yang, Lijuan Wang, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Yejin Choi, Manling Li

🧩 TL;DR

本研究提出通过强化学习架构性地强制和奖励视觉语言模型代理进行显式视觉状态推理,构建内部世界模型,在五个多样化代理基准测试中实现了3倍性能提升,超越了GPT-5、Gemini 2.5 Pro和Claude 4.5等专有推理模型。


📘 Detailed Summary

Motivation: 训练视觉语言模型代理面临的关键挑战是从文本状态向复杂视觉观察的转变,这引入了部分可观测性并需要强大的世界建模能力,研究旨在探索VLM代理是否能够通过显式视觉状态推理构建内部世界模型。

Method: 将代理推理过程建模为部分可观测马尔可夫决策过程,将推理分解为状态估计和转移建模两个关键组件,设计了世界建模奖励提供密集的回合级监督,并引入了双层通用优势估计进行回合感知的信用分配。

Result: 通过视觉状态推理,一个30亿参数的模型在五个多样化代理基准测试中获得了0.82的分数,相比未训练对应模型(0.21)实现了3倍改进,超越了GPT-5(0.75)、Gemini 2.5 Pro(0.67)和Claude 4.5(0.62)等专有推理模型。

Conclusion: 研究发现代理表示内部信念的最佳方式是任务依赖的:自然语言在通用任务中擅长捕捉语义关系,而结构化格式对于精确操作和控制不可或缺,这些发现为VLM代理的世界建模提供了重要设计原则。


📄 Abstract

A key challenge in training Vision-Language Model (VLM) agents, compared to Language Model (LLM) agents, lies in the shift from textual states to complex visual observations. This transition introduces partial observability and demands robust world modeling. We ask: Can VLM agents construct internal world models through explicit visual state reasoning? To address this question, we architecturally enforce and reward the agent's reasoning process via reinforcement learning (RL), formulating it as a Partially Observable Markov Decision Process (POMDP). We find that decomposing the agent's reasoning into State Estimation ("what is the current state?") and Transition Modeling ("what comes next?") is critical for success, as demonstrated through five reasoning strategies. Our investigation into how agents represent internal beliefs reveals that the optimal representation is task-dependent: Natural Language excels at capturing semantic relationships in general tasks, while Structured formats are indispensable for precise manipulation and control. Building on these insights, we design a World Modeling Reward that provides dense, turn-level supervision for accurate state prediction, and introduce Bi-Level General Advantage Estimation (Bi-Level GAE) for turn-aware credit assignment. Through this form of visual state reasoning, a 3B-parameter model achieves a score of 0.82 across five diverse agent benchmarks, representing a 3$\times$ improvement over its untrained counterpart (0.21) and outperforming proprietary reasoning models such as GPT-5 (0.75), Gemini 2.5 Pro (0.67) and Claude 4.5 (0.62). All experiments are conducted within our VAGEN framework, a scalable system for training and analyzing multi-turn VLM agents in diverse visual environments. Code and data are publicly available at https://vagen-ai.github.io.

[100] ToolCritic: Detecting and Correcting Tool-Use Errors in Dialogue Systems

Hassan Hamad, Yingru Xu, Liang Zhao, Wenbo Yan, Narendra Gyanchandani

🧩 TL;DR

本文提出了ToolCritic诊断框架,用于评估和改进大语言模型在多轮工具增强对话中的行为。该框架通过检测八种特定工具调用错误并提供针对性反馈,在SGD数据集上将工具调用准确率提升了13%,显著增强了LLM与外部工具集成的鲁棒性。


📘 Detailed Summary

Motivation: 工具增强的大语言模型在现实应用中日益普及,但工具使用错误仍然阻碍其可靠性。现有方法在检测和纠正工具调用过程中的特定错误类型方面存在不足,特别是在多轮对话场景中,需要系统化的诊断框架来提升LLM与外部工具交互的鲁棒性。

Method: ToolCritic框架系统性地定义了八种工具调用错误类型,包括过早调用、参数不对齐和工具输出误解等,并构建合成数据集进行训练。该框架通过检测这些特定错误并提供针对性反馈,使具有强大推理能力的主LLM能够基于反馈修正其响应。

Result: 在Schema-Guided Dialogue数据集上的实验结果表明,ToolCritic相比基线方法(包括零样本提示和自校正技术)将工具调用准确率提升了高达13%。该框架有效减少了工具调用过程中的各类错误,显著改善了多轮工具增强对话的性能。

Conclusion: ToolCritic代表了向更鲁棒的LLM与外部工具集成迈出的重要一步,为现实对话应用提供了可靠的诊断和校正机制。该框架的系统化错误分类和反馈机制为未来工具增强LLM的研究提供了有价值的参考方向,特别是在多轮交互场景的可靠性提升方面。


📄 Abstract

Tool-augmented large language models (LLMs) are increasingly employed in real-world applications, but tool usage errors still hinder their reliability. We introduce ToolCritic, a diagnostic framework that evaluates and improves LLM behavior in multi-turn, tool-augmented dialogues. ToolCritic detects eight distinct error types specific to tool-calling (e.g., premature invocation, argument misalignment, and misinterpretation of tool outputs) and provides targeted feedback to the main LLM. The main LLM, assumed to have strong reasoning, task understanding and orchestration capabilities, then revises its response based on ToolCritic's feedback. We systematically define these error categories and construct a synthetic dataset to train ToolCritic. Experimental results on the Schema-Guided Dialogue (SGD) dataset demonstrate that ToolCritic improves tool-calling accuracy by up to 13% over baselines, including zero-shot prompting and self-correction techniques. This represents a promising step toward more robust LLM integration with external tools in real-world dialogue applications.

[101] MIRAGE: Agentic Framework for Multimodal Misinformation Detection with Web-Grounded Reasoning

Mir Nafis Sharear Shopnil, Sharad Duwal, Abhishek Tyagi, Adiba Mahbub Proma

🧩 TL;DR

本文提出了MIRAGE,一种推理时、模型可插拔的代理框架,通过将多模态验证分解为四个顺序模块来检测网络虚假信息,在MMFakeBench验证集上达到81.65% F1分数,显著优于最强零样本基线7.65个点。


📘 Detailed Summary

Motivation: 网络平台上每天通过数十亿结合文本和图像的多模态帖子传播虚假信息,超出了人工事实核查能力,而监督检测模型需要领域特定的训练数据且无法泛化到不同的操纵策略。

Method: MIRAGE框架将多模态验证分解为四个顺序模块:视觉真实性评估检测AI生成图像,跨模态一致性分析识别上下文不当重用途,检索增强的事实检查通过迭代问题生成将声明基于网络证据,校准判断模块整合所有信号,协调视觉语言模型推理与定向网络检索,输出结构化且带引用的推理过程。

Result: 在MMFakeBench验证集(1,000样本)上,MIRAGE与GPT-4o-mini组合达到81.65% F1分数和75.1%准确率,优于最强零样本基线(GPT-4V与MMD-Agent的74.0% F1)7.65个点,同时保持34.3%的误报率,而仅判断基线的误报率为97.3%;测试集结果(5,000样本)确认泛化能力,达到81.44% F1和75.08%准确率;消融研究显示视觉验证贡献5.18 F1点,检索增强推理贡献2.97 F1点。

Conclusion: 研究结果表明,结合网络检索的分解代理推理可以在没有领域特定训练的情况下匹配监督检测器性能,在标记数据稀缺的多模态场景中实现虚假信息检测,为应对大规模多模态虚假信息传播提供了有效的自动化解决方案。


📄 Abstract

Misinformation spreads across web platforms through billions of daily multimodal posts that combine text and images, overwhelming manual fact-checking capacity. Supervised detection models require domain-specific training data and fail to generalize across diverse manipulation tactics. We present MIRAGE, an inference-time, model-pluggable agentic framework that decomposes multimodal verification into four sequential modules: visual veracity assessment detects AI-generated images, cross-modal consistency analysis identifies out-of-context repurposing, retrieval-augmented factual checking grounds claims in web evidence through iterative question generation, and a calibrated judgment module integrates all signals. MIRAGE orchestrates vision-language model reasoning with targeted web retrieval, outputs structured and citation-linked rationales. On MMFakeBench validation set (1,000 samples), MIRAGE with GPT-4o-mini achieves 81.65% F1 and 75.1% accuracy, outperforming the strongest zero-shot baseline (GPT-4V with MMD-Agent at 74.0% F1) by 7.65 points while maintaining 34.3% false positive rate versus 97.3% for a judge-only baseline. Test set results (5,000 samples) confirm generalization with 81.44% F1 and 75.08% accuracy. Ablation studies show visual verification contributes 5.18 F1 points and retrieval-augmented reasoning contributes 2.97 points. Our results demonstrate that decomposed agentic reasoning with web retrieval can match supervised detector performance without domain-specific training, enabling misinformation detection across modalities where labeled data remains scarce.

[102] Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs

Zhining Liu, Ziyi Chen, Hui Liu, Chen Luo, Xianfeng Tang, Suhang Wang, Joy Zeng, Zhenwei Dai, Zhan Shi, Tianxin Wei, Benoit Dumoulin, Hanghang Tong

🧩 TL;DR

本研究揭示了视觉语言模型存在'看见但不相信'现象,即模型在输出错误答案时仍能感知视觉证据。提出了一种无需训练的推理时干预方法,通过选择性注意力掩码显著提升了多个VLM家族的准确性。


📘 Detailed Summary

Motivation: 尽管视觉语言模型在多模态任务中表现优异,但即使存在正确的视觉证据时仍会失败。本研究旨在系统性地探究这些失败是由于未能感知证据还是未能有效利用证据,以理解VLM内部感知与推理之间的差距。

Method: 通过分析层间注意力动态,发现浅层主要关注文本而深层稀疏但可靠地关注局部证据区域。基于此提出推理时干预方法,通过选择性注意力掩码突出深层证据区域,该方法无需训练即可应用。

Result: 研究发现VLMs在输出错误答案时通常仍能感知视觉证据,这种现象广泛存在于主要VLM家族中。提出的干预方法在LLaVA、Qwen、Gemma和InternVL等多个模型上一致提升了准确性,验证了方法的有效性。

Conclusion: 研究表明VLMs内部编码了可靠的证据但未能充分利用,通过使这些信号显式化可以弥合感知与推理之间的差距。这一发现推进了对VLM的诊断理解,为提高模型可靠性提供了新方向。


📄 Abstract

Vision-Language Models (VLMs) achieve strong results on multimodal tasks such as visual question answering, yet they can still fail even when the correct visual evidence is present. In this work, we systematically investigate whether these failures arise from not perceiving the evidence or from not leveraging it effectively. By examining layer-wise attention dynamics, we find that shallow layers focus primarily on text, while deeper layers sparsely but reliably attend to localized evidence regions. Surprisingly, VLMs often perceive the visual evidence when outputting incorrect answers, a phenomenon we term ``seeing but not believing'' that widely exists in major VLM families. Building on this, we introduce an inference-time intervention that highlights deep-layer evidence regions through selective attention-based masking. It requires no training and consistently improves accuracy across multiple families, including LLaVA, Qwen, Gemma, and InternVL. These results show that VLMs encode reliable evidence internally but under-utilize it, making such signals explicit can bridge the gap between perception and reasoning, advancing the diagnostic understanding and reliability of VLMs.