cs.CV [Total: 28]
cs.CL [Total: 3]
cs.AI [Total: 4]

cs.CV [Back]

[1] KBE-DME: Dynamic Multimodal Evaluation via Knowledge Enhanced Benchmark Evolution

Junzhe Zhang, Huixuan Zhang, Xiaojun Wan

🧩 TL;DR

本文提出了知识增强基准演化（KBE）框架，通过图结构表示VQA样本并整合多模态知识，将静态基准转化为可控的动态演化版本，以解决多模态大语言模型评估中的数据污染和饱和问题。

📘 Detailed Summary

Motivation: 现有静态基准存在数据污染和饱和风险，导致性能评估失真或误导，多模态大语言模型的快速发展需要更可靠的评估协议来准确衡量模型能力。

Method: 采用图结构表示静态或动态VQA样本，提出知识增强基准演化框架，通过重新选择原始图像中的视觉信息和整合外部文本知识来重构和扩展问题，实现难度可控的评估。

Result: 大量实验表明KBE有效缓解了数据污染和饱和风险，提供了对MLLM能力更全面的评估，通过调整问题探索程度实现难度可控的基准演化。

Conclusion: KBE框架为多模态大语言模型评估提供了动态、可控的解决方案，能够更准确地反映模型真实能力，为未来评估方法的发展提供了重要方向。

📄 Abstract

The rapid progress of multimodal large language models (MLLMs) calls for more reliable evaluation protocols. Existing static benchmarks suffer from the potential risk of data contamination and saturation, leading to inflated or misleading performance evaluations. To address these issues, we first apply Graph formulation to represent a static or dynamic VQA sample. With the formulation, we propose Knowledge-enhanced Benchmark Evolution(KBE), a dynamic multimodal evaluation framework. KBE first analyzes the original static benchmark, then expands it by integrating multimodal knowledge, transforming the static benchmark into a controllable, dynamic evolving version. Crucially, KBE can both reconstruct questions by Re-selecting visual information in the original image and expand existing questions with external textual knowledge. It enables difficulty-controllable evaluation by adjusting the degree of question exploration. Extensive experiments demonstrate that KBE alleviates the risk of data contamination, data saturation, and provides a more comprehensive assessment of MLLM capabilities.

[2] Preventing Shortcuts in Adapter Training via Providing the Shortcuts

Anujraaj Argo Goyal, Guocheng Gordon Qian, Huseyin Coskun, Aarush Gupta, Himmy Tam, Daniil Ostashev, Ju Hu, Dhritiman Sagar, Sergey Tulyakov, Kfir Aberman, Kuan-Chieh Jackson Wang

🧩 TL;DR

本文提出了一种名为捷径重定向适配器训练的新方法，通过在适配器训练期间为混淆因素建立专用路径，有效解决了适配器纠缠目标属性与偶然因素的问题，显著提升了生成质量、多样性和文本提示遵循能力。

📘 Detailed Summary

Motivation: 基于适配器的训练在扩展基础图像生成器能力方面发挥关键作用，但现有方法存在目标属性与偶然因素（如姿态、表情、光照）的纠缠问题，这种伪相关性限制了模型的泛化能力并阻碍了对输入文本提示的遵循。

Method: 提出捷径重定向适配器训练方法，通过ControlNet或LoRA等辅助模块为混淆因素建立专用路径，消除适配器内部化这些因素的动机，训练完成后在推理阶段移除辅助模块。

Result: 在面部和全身身份注入等任务中，该方法显著提升了生成质量、多样性和提示遵循能力，实验结果表明该方法能够有效解耦目标属性与偶然因素。

Conclusion: 研究揭示了一个通用设计原则：在追求解耦表示时，最有效的途径可能是为不应学习的内容建立捷径，这为大型模型时代的适配器训练提供了新的设计范式。

📄 Abstract

Adapter-based training has emerged as a key mechanism for extending the capabilities of powerful foundation image generators, enabling personalized and stylized text-to-image synthesis. These adapters are typically trained to capture a specific target attribute, such as subject identity, using single-image reconstruction objectives. However, because the input image inevitably contains a mixture of visual factors, adapters are prone to entangle the target attribute with incidental ones, such as pose, expression, and lighting. This spurious correlation problem limits generalization and obstructs the model's ability to adhere to the input text prompt. In this work, we uncover a simple yet effective solution: provide the very shortcuts we wish to eliminate during adapter training. In Shortcut-Rerouted Adapter Training, confounding factors are routed through auxiliary modules, such as ControlNet or LoRA, eliminating the incentive for the adapter to internalize them. The auxiliary modules are then removed during inference. When applied to tasks like facial and full-body identity injection, our approach improves generation quality, diversity, and prompt adherence. These results point to a general design principle in the era of large models: when seeking disentangled representations, the most effective path may be to establish shortcuts for what should NOT be learned.

[3] Head Pursuit: Probing Attention Specialization in Multimodal Transformers

Lorenzo Basile, Valentino Maiorca, Diego Doimo, Francesco Locatello, Alberto Cazzaniga

🧩 TL;DR

本研究通过信号处理视角重新解释注意力头分析，揭示了文本生成模型中个体注意力头在语义和视觉属性上的专业化模式，并证明仅编辑1%的注意力头即可可靠地抑制或增强目标概念。

📘 Detailed Summary

Motivation: 尽管语言和视觉语言模型在各种任务中表现出色，但其内部工作机制仍未被完全理解，本研究旨在探索文本生成模型中个体注意力头如何专门化处理特定语义或视觉属性，以填补模型可解释性方面的研究空白。

Method: 基于已有的可解释性方法，本研究从信号处理角度重新解释了使用最终解码层探测中间激活的实践，开发了一种原则性方法来分析多个样本并根据注意力头与目标概念的相关性进行排序。

Result: 研究发现在单模态和多模态Transformer中都存在一致的注意力头级专业化模式，通过该方法选择的仅1%注意力头编辑即可可靠地抑制或增强模型输出中的目标概念，并在语言问答、毒性缓解以及视觉语言图像分类和描述任务中得到验证。

Conclusion: 研究结果揭示了注意力层内存在可解释和可控的结构，为理解和编辑大规模生成模型提供了简单有效的工具，同时证明了模型内部存在系统性的概念专业化机制。

📄 Abstract

Language and vision-language models have shown impressive performance across a wide range of tasks, but their internal mechanisms remain only partly understood. In this work, we study how individual attention heads in text-generative models specialize in specific semantic or visual attributes. Building on an established interpretability method, we reinterpret the practice of probing intermediate activations with the final decoding layer through the lens of signal processing. This lets us analyze multiple samples in a principled way and rank attention heads based on their relevance to target concepts. Our results show consistent patterns of specialization at the head level across both unimodal and multimodal transformers. Remarkably, we find that editing as few as 1% of the heads, selected using our method, can reliably suppress or enhance targeted concepts in the model output. We validate our approach on language tasks such as question answering and toxicity mitigation, as well as vision-language tasks including image classification and captioning. Our findings highlight an interpretable and controllable structure within attention layers, offering simple tools for understanding and editing large-scale generative models.

[4] Video-As-Prompt: Unified Semantic Control for Video Generation

Yuxuan Bian, Xin Chen, Zenan Li, Tiancheng Zhi, Shen Sang, Linjie Luo, Qiang Xu

🧩 TL;DR

本文提出了Video-As-Prompt (VAP)新范式，通过将参考视频作为语义提示来引导冻结的视频扩散变换器，实现了统一、可泛化的语义控制视频生成，在开源方法中达到最先进水平。

📘 Detailed Summary

Motivation: 当前视频生成中的统一语义控制面临关键挑战，现有方法要么通过结构控制施加不适当的像素级先验引入伪影，要么依赖不可泛化的条件特定微调或任务特定架构，缺乏通用解决方案。

Method: VAP采用上下文生成范式，利用参考视频作为直接语义提示，通过即插即用的混合变换器专家架构引导冻结的Video Diffusion Transformer，并采用时间偏置位置嵌入消除虚假映射先验以实现鲁棒的上下文检索。

Result: VAP作为单一统一模型，在开源方法中达到新的最先进水平，获得38.7%的用户偏好率，可与领先的条件特定商业模型相媲美，同时构建了包含10万对视频的VAP-Data数据集支持该方法。

Conclusion: VAP的强零样本泛化能力和对各种下游应用的支持标志着向通用可控视频生成迈出了重要一步，其即插即用架构防止灾难性遗忘，为未来研究提供了新的方向和基准数据集。

📄 Abstract

Unified, generalizable semantic control in video generation remains a critical open challenge. Existing methods either introduce artifacts by enforcing inappropriate pixel-wise priors from structure-based controls, or rely on non-generalizable, condition-specific finetuning or task-specific architectures. We introduce Video-As-Prompt (VAP), a new paradigm that reframes this problem as in-context generation. VAP leverages a reference video as a direct semantic prompt, guiding a frozen Video Diffusion Transformer (DiT) via a plug-and-play Mixture-of-Transformers (MoT) expert. This architecture prevents catastrophic forgetting and is guided by a temporally biased position embedding that eliminates spurious mapping priors for robust context retrieval. To power this approach and catalyze future research, we built VAP-Data, the largest dataset for semantic-controlled video generation with over 100K paired videos across 100 semantic conditions. As a single unified model, VAP sets a new state-of-the-art for open-source methods, achieving a 38.7% user preference rate that rivals leading condition-specific commercial models. VAP's strong zero-shot generalization and support for various downstream applications mark a significant advance toward general-purpose, controllable video generation.

[5] Generative Point Tracking with Flow Matching

Mattie Tesfaldet, Adam W. Harley, Konstantinos G. Derpanis, Derek Nowrouzezahrai, Christopher Pal

🧩 TL;DR

本文提出了生成式点轨迹跟踪器（GenPT），通过生成式建模方法解决视频点跟踪中的多模态不确定性，在遮挡点跟踪方面达到最先进性能，同时在可见点跟踪上保持竞争力。

📘 Detailed Summary

Motivation: 当前最先进的判别式点跟踪模型在存在不确定性时只能回归到均值或众数，无法捕捉轨迹的多模态特性，特别是在视觉遮挡和外观变化导致的不确定性情况下存在局限。

Method: GenPT采用基于流匹配的生成式框架，结合判别式跟踪器的迭代优化、跨窗口一致性的窗口相关先验，以及专门为点坐标调整的方差调度，并在推理时使用基于模型置信度的最佳优先搜索策略。

Result: 在PointOdyssey、Dynamic Replica和TAP-Vid基准测试中，GenPT在遮挡点跟踪精度上达到最先进水平，同时在可见点跟踪上保持与现有判别式跟踪器相当的竞争力，并在专门设计的遮挡增强版TAP-Vid上展示了多模态捕捉能力。

Conclusion: 生成式方法能够有效建模点轨迹的多模态不确定性，特别是在遮挡场景下显著提升跟踪性能，为视频点跟踪领域提供了新的研究方向，表明生成式框架在解决视觉不确定性问题上具有重要价值。

📄 Abstract

Tracking a point through a video can be a challenging task due to uncertainty arising from visual obfuscations, such as appearance changes and occlusions. Although current state-of-the-art discriminative models excel in regressing long-term point trajectory estimates -- even through occlusions -- they are limited to regressing to a mean (or mode) in the presence of uncertainty, and fail to capture multi-modality. To overcome this limitation, we introduce Generative Point Tracker (GenPT), a generative framework for modelling multi-modal trajectories. GenPT is trained with a novel flow matching formulation that combines the iterative refinement of discriminative trackers, a window-dependent prior for cross-window consistency, and a variance schedule tuned specifically for point coordinates. We show how our model's generative capabilities can be leveraged to improve point trajectory estimates by utilizing a best-first search strategy on generated samples during inference, guided by the model's own confidence of its predictions. Empirically, we evaluate GenPT against the current state of the art on the standard PointOdyssey, Dynamic Replica, and TAP-Vid benchmarks. Further, we introduce a TAP-Vid variant with additional occlusions to assess occluded point tracking performance and highlight our model's ability to capture multi-modality. GenPT is capable of capturing the multi-modality in point trajectories, which translates to state-of-the-art tracking accuracy on occluded points, while maintaining competitive tracking accuracy on visible points compared to extant discriminative point trackers.

[6] 3DReasonKnee: Advancing Grounded Reasoning in Medical Vision Language Models

Sraavya Sambara, Sung Eun Kim, Xiaoman Zhang, Luyang Luo, Shreya Johri, Mohammed Baharoon, Du Hyun Ro, Pranav Rajpurkar

🧩 TL;DR

本研究提出了3DReasonKnee，这是首个用于医学图像的3D基础推理数据集，包含来自7,970个3D膝关节MRI体积的494k高质量五元组，旨在解决当前视觉语言模型在3D医学图像中定位解剖区域并进行逐步推理的局限性。

📘 Detailed Summary

Motivation: 当前视觉语言模型在3D医学图像中难以定位解剖区域并进行逐步推理，这是真实世界诊断评估的关键要求。现有3D数据集提供定位标签，但都不支持这种"基础推理"能力，无法与临床医生的诊断工作流程对齐，限制了可信赖的临床医生-AI协作。

Method: 研究团队创建了3DReasonKnee数据集，每个五元组包含3D MRI体积、针对特定解剖区域的诊断问题、定位相关解剖结构的3D边界框、临床医生生成的详细3D推理过程的诊断推理步骤，以及相关解剖区域的结构化严重程度评估。数据集创建涉及450多个小时的专家临床医生时间进行手动分割和生成推理链。

Result: 建立了ReasonKnee-Bench评估框架来评估定位和诊断准确性，为五个最先进的视觉语言模型提供了基准性能，作为ReasonKnee-Bench的基线表现，提供了关于VLM在不同解剖区域和诊断查询中执行基础和严重程度评估能力的深入见解。

Conclusion: 3DReasonKnee作为骨科外科医生诊断专业知识的独特资源库，为推进多模态医学AI系统向3D、临床对齐的局部决策能力提供了重要测试平台，通过提供专家注释的3D推理路径，促进了临床医生-AI协作的可信度。

📄 Abstract

Current Vision-Language Models (VLMs) struggle to ground anatomical regions in 3D medical images and reason about them in a step-by-step manner, a key requirement of real-world diagnostic assessment. This ability is essential for aligning model outputs with the diagnostic workflows clinicians use in practice, enabling trustworthy clinician-AI collaboration. Existing 3D datasets provide localization labels, but none support this "grounded reasoning" ability. To address this gap, we introduce 3DReasonKnee, the first 3D grounded reasoning dataset for medical images, which provides 494k high-quality quintuples derived from 7,970 3D knee MRI volumes. Each quintuple includes: (1) the 3D MRI volume, (2) a diagnostic question targeting a specific anatomical region (3) a 3D bounding box localizing the relevant anatomical structures, (4) clinician-generated diagnostic reasoning steps that explicitly detail the 3D reasoning process, and (5) structured severity assessments for the relevant anatomical region. The creation and validation of 3DReasonKnee, involving over 450 hours of expert clinician time for manually segmenting MRIs and generating reasoning chains, ensures its superior quality and clinical relevance. We establish ReasonKnee-Bench to evaluate localization and diagnostic accuracy, providing insight into VLM ability to perform grounding and severity assessment across anatomical regions and diagnostic inquiries. We benchmark five state-of-the-art VLMs, providing baseline performance for ReasonKnee-Bench. By providing this unique resource of expert-annotated 3D reasoning pathways, 3DReasonKnee serves as a repository of orthopedic surgeons' diagnostic expertise and offers a vital testbed for advancing multimodal medical AI systems towards 3D, clinically aligned, localized decision-making capabilities. The dataset can be found in: https://huggingface.co/datasets/rajpurkarlab/3DReasonKnee

[7] ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models

Pranav Saxena, Jimmy Chiun

🧩 TL;DR

ZING-3D是一个零样本3D场景图生成框架，利用预训练基础模型实现开放词汇识别，支持增量更新和3D几何接地，适用于机器人下游应用。

📘 Detailed Summary

Motivation: 现有3D场景图生成方法主要局限于单视角设置，无法支持新观测的增量更新，且缺乏3D空间的显式几何接地，这些限制对于具身场景应用至关重要。

Method: 该方法利用VLM推理生成丰富的2D场景图，并使用深度信息将其接地到3D空间，节点表示具有特征、3D位置和语义上下文的开放词汇对象，边捕获空间和语义关系及物体间距离。

Result: 在Replica和HM3D数据集上的实验表明，ZING-3D能够有效捕获空间和关系知识，无需任务特定训练即可实现高质量的场景理解。

Conclusion: 该研究展示了预训练基础模型在3D场景理解中的强大潜力，为零样本3D场景图生成提供了可行方案，并为机器人应用中的实时环境理解开辟了新途径。

📄 Abstract

Understanding and reasoning about complex 3D environments requires structured scene representations that capture not only objects but also their semantic and spatial relationships. While recent works on 3D scene graph generation have leveraged pretrained VLMs without task-specific fine-tuning, they are largely confined to single-view settings, fail to support incremental updates as new observations arrive and lack explicit geometric grounding in 3D space, all of which are essential for embodied scenarios. In this paper, we propose, ZING-3D, a framework that leverages the vast knowledge of pretrained foundation models to enable open-vocabulary recognition and generate a rich semantic representation of the scene in a zero-shot manner while also enabling incremental updates and geometric grounding in 3D space, making it suitable for downstream robotics applications. Our approach leverages VLM reasoning to generate a rich 2D scene graph, which is grounded in 3D using depth information. Nodes represent open-vocabulary objects with features, 3D locations, and semantic context, while edges capture spatial and semantic relations with inter-object distances. Our experiments on scenes from the Replica and HM3D dataset show that ZING-3D is effective at capturing spatial and relational knowledge without the need of task-specific training.

[8] Knowledge-Driven Vision-Language Model for Plexus Detection in Hirschsprung's Disease

Youssef Megahed, Atallah Madi, Dina El Demellawy, Adrian D. C. Chan

🧩 TL;DR

本研究提出了一种新颖的多模态学习框架，将专家知识驱动的文本概念集成到基于对比语言-图像预训练的视觉语言模型中，用于指导Hirschsprung病肠肌层神经丛分类。该方法在多个分类指标上显著优于传统CNN模型，实现了83.9%的准确率、86.6%的精确率和87.6%的特异性。

📘 Detailed Summary

Motivation: Hirschsprung病的诊断和治疗需要清晰识别组织切片中肌层神经丛的不同区域，传统深度学习方法如卷积神经网络虽然在此任务上表现良好，但通常被视为黑箱模型，缺乏可解释性，且不符合医生的决策方式。

Method: 本研究提出了一种新颖框架，通过大型语言模型生成专家来源（如医学教科书和论文）的提示词，经团队审核后使用QuiltNet编码，将临床相关语义线索与视觉特征对齐，集成到基于对比语言-图像预训练的视觉语言模型中指导神经丛分类。

Result: 实验结果表明，所提出的模型在不同分类指标上表现出优越的判别能力，显著优于基于CNN的模型（包括VGG-19、ResNet-18和ResNet-50），实现了83.9%的准确率、86.6%的精确率和87.6%的特异性。

Conclusion: 这些发现凸显了多模态学习在组织病理学中的潜力，并强调了整合专家知识以获得更具临床相关性的模型输出的价值，为开发更符合临床实践的可解释AI系统提供了重要方向。

📄 Abstract

Hirschsprung's disease is defined as the congenital absence of ganglion cells in some segment(s) of the colon. The muscle cannot make coordinated movements to propel stool in that section, most commonly leading to obstruction. The diagnosis and treatment for this disease require a clear identification of different region(s) of the myenteric plexus, where ganglion cells should be present, on the microscopic view of the tissue slide. While deep learning approaches, such as Convolutional Neural Networks, have performed very well in this task, they are often treated as black boxes, with minimal understanding gained from them, and may not conform to how a physician makes decisions. In this study, we propose a novel framework that integrates expert-derived textual concepts into a Contrastive Language-Image Pre-training-based vision-language model to guide plexus classification. Using prompts derived from expert sources (e.g., medical textbooks and papers) generated by large language models and reviewed by our team before being encoded with QuiltNet, our approach aligns clinically relevant semantic cues with visual features. Experimental results show that the proposed model demonstrated superior discriminative capability across different classification metrics as it outperformed CNN-based models, including VGG-19, ResNet-18, and ResNet-50; achieving an accuracy of 83.9%, a precision of 86.6%, and a specificity of 87.6%. These findings highlight the potential of multi-modal learning in histopathology and underscore the value of incorporating expert knowledge for more clinically relevant model outputs.

[9] PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments

Weijie Zhou, Xuantang Xiong, Yi Peng, Manli Tao, Chaoyang Zhao, Honghui Dong, Ming Tang, Jinqiao Wang

🧩 TL;DR

本文提出了主动视觉推理任务，将视觉推理扩展到部分可观察的交互环境中，并开发了PhysVLM-AVR模型，在多个基准测试中实现了最先进的性能，同时揭示了当前多模态大语言模型在主动获取和整合信息方面的能力不足。

📘 Detailed Summary

Motivation: 当前多模态大语言模型的视觉推理研究主要局限于静态、完全可观察的环境，这与现实世界中信息常因遮挡或视野限制而不完整的实际情况存在差距。人类通过主动探索和交互来收集信息的闭环过程启发了本研究，旨在解决现有模型在部分可观察环境中进行主动推理的能力不足问题。

Method: 研究引入了主动视觉推理任务，并开发了CLEVR-AVR仿真基准来评估推理正确性和信息收集效率。构建了包含15.2万样本的AVR-152k数据集，提供丰富的思维链标注，支持不确定性识别、动作条件信息增益预测和信息最大化动作选择。基于此开发了PhysVLM-AVR模型，在高级马尔可夫决策过程中训练智能体。

Result: PhysVLM-AVR在CLEVR-AVR基准上实现了最先进的性能，同时在具身推理和被动视觉推理任务上也表现出色。分析发现当前具身多模态大语言模型虽然能够检测信息不完整性，但在通过交互主动获取和整合新信息方面存在显著困难。

Conclusion: 该研究揭示了当前多模态大语言模型在主动推理能力上的根本性差距，强调了在部分可观察环境中进行闭环感知-推理-行动集成的重要性。提出的主动视觉推理框架为开发更接近人类认知能力的AI系统提供了新的方向和评估标准。

📄 Abstract

Visual reasoning in multimodal large language models (MLLMs) has primarily been studied in static, fully observable settings, limiting their effectiveness in real-world environments where information is often incomplete due to occlusion or limited field of view. Humans, in contrast, actively explore and interact with their environment-moving, examining, and manipulating objects-to gather information through a closed-loop process integrating perception, reasoning, and action. Inspired by this human capability, we introduce the Active Visual Reasoning (AVR) task, extending visual reasoning to partially observable, interactive environments. AVR necessitates agents to: (1) actively acquire information via sequential physical actions, (2) integrate observations across multiple steps for coherent reasoning, and (3) dynamically adjust decisions based on evolving visual feedback. To rigorously evaluate AVR, we introduce CLEVR-AVR, a simulation benchmark featuring multi-round interactive environments designed to assess both reasoning correctness and information-gathering efficiency. We present AVR-152k, a large-scale dataset that offers rich Chain-of-Thought (CoT) annotations detailing iterative reasoning for uncertainty identification, action-conditioned information gain prediction, and information-maximizing action selection, crucial for training agents in a higher-order Markov Decision Process. Building on this, we develop PhysVLM-AVR, an MLLM achieving state-of-the-art performance on CLEVR-AVR, embodied reasoning (OpenEQA, RoboVQA), and passive visual reasoning (GeoMath, Geometry30K). Our analysis also reveals that current embodied MLLMs, despite detecting information incompleteness, struggle to actively acquire and integrate new information through interaction, highlighting a fundamental gap in active reasoning capabilities.

[10] SafetyPairs: Isolating Safety Critical Image Features with Counterfactual Image Generation

Alec Helbling, Shruti Palaskar, Kundan Krishna, Polo Chau, Leon Gatys, Joseph Yitan Cheng

🧩 TL;DR

本文提出了SafetyPairs框架，通过生成仅安全相关特征不同的反事实图像对来系统研究图像安全性，构建了包含3020对图像的安全基准，揭示了视觉语言模型在细微安全区分上的弱点，并展示了该框架作为数据增强策略的有效性。

📘 Detailed Summary

Motivation: 现有图像安全数据集存在粗糙和模糊的问题，仅提供广泛的安全标签而未隔离驱动安全差异的具体特征，难以系统区分良性图像和问题图像，特别是当图像的细微变化（如侮辱性手势或符号）会彻底改变其安全含义时。

Method: 提出了SafetyPairs可扩展框架，利用图像编辑模型对图像进行针对性修改，仅改变与给定安全策略相关的特征从而翻转安全标签，同时保持安全无关细节不变，构建了涵盖9个安全类别的多样化分类体系。

Result: 构建了包含3020对SafetyPair图像的新安全基准，该基准作为强大的评估数据源，突显了视觉语言模型在区分细微不同图像能力上的弱点；同时发现该流水线作为数据增强策略能有效提高轻量级防护模型的样本效率。

Conclusion: 该研究提供了首个系统研究细粒度图像安全区分的资源，揭示了现有模型在安全敏感特征识别上的局限性，提出的框架不仅可用于模型评估，还能作为有效的数据增强方法提升安全防护模型的训练效果。

📄 Abstract

What exactly makes a particular image unsafe? Systematically differentiating between benign and problematic images is a challenging problem, as subtle changes to an image, such as an insulting gesture or symbol, can drastically alter its safety implications. However, existing image safety datasets are coarse and ambiguous, offering only broad safety labels without isolating the specific features that drive these differences. We introduce SafetyPairs, a scalable framework for generating counterfactual pairs of images, that differ only in the features relevant to the given safety policy, thus flipping their safety label. By leveraging image editing models, we make targeted changes to images that alter their safety labels while leaving safety-irrelevant details unchanged. Using SafetyPairs, we construct a new safety benchmark, which serves as a powerful source of evaluation data that highlights weaknesses in vision-language models' abilities to distinguish between subtly different images. Beyond evaluation, we find our pipeline serves as an effective data augmentation strategy that improves the sample efficiency of training lightweight guard models. We release a benchmark containing over 3,020 SafetyPair images spanning a diverse taxonomy of 9 safety categories, providing the first systematic resource for studying fine-grained image safety distinctions.

[11] NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

Longtian Qiu, Shan Ning, Jiaxuan Sun, Xuming He

🧩 TL;DR

本文提出了NoisyGRPO，一种系统性的多模态强化学习框架，通过向视觉输入注入可控噪声增强探索，并采用贝叶斯框架显式建模优势估计过程，显著提升了多模态大语言模型在链式思维推理中的泛化能力和鲁棒性。

📘 Detailed Summary

Motivation: 现有强化学习框架在提升多模态大语言模型的通用链式思维推理能力时，往往难以超越训练分布的泛化，这限制了模型在实际复杂视觉场景中的应用效果和稳定性。

Method: NoisyGRPO框架包含两个核心技术：噪声注入探索策略通过在视觉输入中添加高斯噪声来鼓励模型在更广泛的视觉场景中进行探索；贝叶斯优势估计将优势估计建模为贝叶斯推理问题，其中注入的噪声水平作为先验，观察到的轨迹奖励作为似然，通过融合两种信息源计算轨迹优势的稳健后验估计。

Result: 在标准链式思维质量、通用能力和幻觉基准测试中，NoisyGRPO显著提升了泛化性和鲁棒性，特别是在使用小规模多模态大语言模型（如Qwen2.5-VL 3B）的强化学习设置中表现尤为突出。

Conclusion: 该研究表明通过系统性地结合噪声注入和贝叶斯建模，可以有效解决多模态强化学习中的泛化挑战，为提升多模态大语言模型的推理能力和实际应用提供了新的技术路径和理论支撑。

📄 Abstract

Reinforcement learning (RL) has shown promise in enhancing the general Chain-of-Thought (CoT) reasoning capabilities of multimodal large language models (MLLMs). However, when applied to improve general CoT reasoning, existing RL frameworks often struggle to generalize beyond the training distribution. To address this, we propose NoisyGRPO, a systematic multimodal RL framework that introduces controllable noise into visual inputs for enhanced exploration and explicitly models the advantage estimation process via a Bayesian framework. Specifically, NoisyGRPO improves RL training by: (1) \textbf{Noise-Injected Exploration Policy}: Perturbing visual inputs with Gaussian noise to encourage exploration across a wider range of visual scenarios; and (2) \textbf{Bayesian Advantage Estimation}: Formulating advantage estimation as a principled Bayesian inference problem, where the injected noise level serves as a prior and the observed trajectory reward as the likelihood. This Bayesian modeling fuses both sources of information to compute a robust posterior estimate of trajectory advantage, effectively guiding MLLMs to prefer visually grounded trajectories over noisy ones. Experiments on standard CoT quality, general capability, and hallucination benchmarks demonstrate that NoisyGRPO substantially improves generalization and robustness, especially in RL settings with small-scale MLLMs such as Qwen2.5-VL 3B. The project page is available at \href{https://artanic30.github.io/project_pages/NoisyGRPO/}{\texttt{https://artanic30.github.io/project_pages/NoisyGRPO}}.

Lemin Liu, Fangchao Hu, Honghua Jiang, Yaru Chen, Limin Liu, Yongliang Qiao

🧩 TL;DR

本研究提出了一种名为CT-CLIP的多分支识别框架，通过协同使用CNN和Vision Transformer并结合CLIP的多模态学习，有效解决了复杂果园环境中苹果叶部病害表型异质性带来的识别挑战。

📘 Detailed Summary

Motivation: 在复杂果园环境中，不同苹果叶部病害的表型异质性表现为病变区域的显著变异，这对传统的多尺度特征融合方法构成了挑战。这些方法仅整合卷积神经网络提取的多层特征，未能充分考虑局部与全局特征之间的关系。

Method: 本研究提出了CNN-Transformer-CLIP（CT-CLIP）多分支识别框架，协同使用CNN提取局部病变细节特征和Vision Transformer捕捉全局结构关系。自适应特征融合模块（AFFM）动态融合这些特征，实现局部与全局信息的最优耦合。同时采用多模态图文学习方法，利用预训练的CLIP权重实现视觉特征与病害语义描述的深度对齐。

Result: 实验结果显示，CT-CLIP在公开苹果病害数据集和自建数据集上分别达到97.38%和96.12%的准确率，优于多个基线方法。该框架在复杂环境条件下显著提升了识别精度。

Conclusion: 提出的CT-CLIP在农业病害识别方面展现出强大能力，显著提高了复杂环境条件下的识别精度，为农业应用中的自动化病害识别提供了创新且实用的解决方案。该研究为处理表型异质性病害提供了有效的多模态融合方法。

📄 Abstract

In complex orchard environments, the phenotypic heterogeneity of different apple leaf diseases, characterized by significant variation among lesions, poses a challenge to traditional multi-scale feature fusion methods. These methods only integrate multi-layer features extracted by convolutional neural networks (CNNs) and fail to adequately account for the relationships between local and global features. Therefore, this study proposes a multi-branch recognition framework named CNN-Transformer-CLIP (CT-CLIP). The framework synergistically employs a CNN to extract local lesion detail features and a Vision Transformer to capture global structural relationships. An Adaptive Feature Fusion Module (AFFM) then dynamically fuses these features, achieving optimal coupling of local and global information and effectively addressing the diversity in lesion morphology and distribution. Additionally, to mitigate interference from complex backgrounds and significantly enhance recognition accuracy under few-shot conditions, this study proposes a multimodal image-text learning approach. By leveraging pre-trained CLIP weights, it achieves deep alignment between visual features and disease semantic descriptions. Experimental results show that CT-CLIP achieves accuracies of 97.38% and 96.12% on a publicly available apple disease and a self-built dataset, outperforming several baseline methods. The proposed CT-CLIP demonstrates strong capabilities in recognizing agricultural diseases, significantly enhances identification accuracy under complex environmental conditions, provides an innovative and practical solution for automated disease recognition in agricultural applications.

[13] Towards Physics-informed Spatial Intelligence with Human Priors: An Autonomous Driving Pilot Study

Guanlin Wu, Boyan Su, Yang Zhao, Pu Wang, Yichen Lin, Hao Frank Yang

🧩 TL;DR

本文提出了空间智能网格（SIG），一种结构化、基于网格的表示方法，用于在基础模型中显式编码空间关系，并开发了SIG-informed评估指标来量化模型的内在视觉空间智能，有效分离空间能力与语言先验。

📘 Detailed Summary

Motivation: 当前基础模型中空间智能的集成与验证存在挑战，现有方法通常使用纯文本提示和VQA式评分来代理视觉空间智能，这掩盖了几何信息、引入了语言捷径，并削弱了对真正空间技能的归因。

Method: 提出了空间智能网格（SIG）作为文本的补充通道，这是一种结构化、基于网格的表示方法，显式编码物体布局、物体间关系以及物理基础先验，为基于SIG的评估指标提供了忠实、组合式的场景结构表示。

Result: 在少样本上下文学习中，SIG相比纯VQA表示在所有VSI指标上均产生更一致、更稳定且更全面的性能提升，表明其作为数据标注和训练模式用于学习视觉空间智能的潜力，同时发布了包含1.4K驾驶帧的SIGBench基准数据集。

Conclusion: SIG为学习视觉空间智能提供了有前景的数据标注和训练模式，其结构化表示能够有效分离模型的空间能力与语言先验，为自动驾驶等场景中的机器VSI任务和人类注意力驱动的VSI任务提供了统一评估框架。

📄 Abstract

How to integrate and verify spatial intelligence in foundation models remains an open challenge. Current practice often proxies Visual-Spatial Intelligence (VSI) with purely textual prompts and VQA-style scoring, which obscures geometry, invites linguistic shortcuts, and weakens attribution to genuinely spatial skills. We introduce Spatial Intelligence Grid (SIG): a structured, grid-based schema that explicitly encodes object layouts, inter-object relations, and physically grounded priors. As a complementary channel to text, SIG provides a faithful, compositional representation of scene structure for foundation-model reasoning. Building on SIG, we derive SIG-informed evaluation metrics that quantify a model's intrinsic VSI, which separates spatial capability from language priors. In few-shot in-context learning with state-of-the-art multimodal LLMs (e.g. GPT- and Gemini-family models), SIG yields consistently larger, more stable, and more comprehensive gains across all VSI metrics compared to VQA-only representations, indicating its promise as a data-labeling and training schema for learning VSI. We also release SIGBench, a benchmark of 1.4K driving frames annotated with ground-truth SIG labels and human gaze traces, supporting both grid-based machine VSI tasks and attention-driven, human-like VSI tasks in autonomous-driving scenarios.

[14] TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection

Qihang Zhou, Binbin Gao, Guansong Pang, Xin Wang, Jiming Chen, Shibo He

🧩 TL;DR

本文提出TokenCLIP，一种基于最优传输的令牌级自适应框架，通过动态对齐视觉令牌与可学习文本子空间来实现细粒度异常检测，解决了现有方法在单一文本空间中无法准确捕捉多样化异常语义的问题。

📘 Detailed Summary

Motivation: 现有方法通常依赖单一文本空间来对齐不同物体和领域的视觉语义，这种无差别的对齐方式阻碍了模型准确捕捉多样化的异常语义，需要更细粒度的对齐机制来提升异常检测性能。

Method: TokenCLIP将令牌无关的文本空间扩展为一组正交子空间，通过最优传输问题动态分配每个视觉令牌到语义相关的子空间组合，并应用top-k掩码来稀疏化传输计划，使不同子空间专注于不同的视觉区域语义。

Result: 大量实验证明了TokenCLIP的优越性，该方法在零样本异常检测任务中显著提升了性能，能够更准确地识别和定位各种异常模式。

Conclusion: 该研究表明通过令牌级的动态文本-视觉对齐可以显著提升异常检测的细粒度性能，最优传输框架为多模态表示学习提供了一种有效的自适应机制，为未来细粒度视觉理解任务开辟了新方向。

📄 Abstract

Adapting CLIP for anomaly detection on unseen objects has shown strong potential in a zero-shot manner. However, existing methods typically rely on a single textual space to align with visual semantics across diverse objects and domains. The indiscriminate alignment hinders the model from accurately capturing varied anomaly semantics. We propose TokenCLIP, a token-wise adaptation framework that enables dynamic alignment between visual and learnable textual spaces for fine-grained anomaly learning. Rather than mapping all visual tokens to a single, token-agnostic textual space, TokenCLIP aligns each token with a customized textual subspace that represents its visual characteristics. Explicitly assigning a unique learnable textual space to each token is computationally intractable and prone to insufficient optimization. We instead expand the token-agnostic textual space into a set of orthogonal subspaces, and then dynamically assign each token to a subspace combination guided by semantic affinity, which jointly supports customized and efficient token-wise adaptation. To this end, we formulate dynamic alignment as an optimal transport problem, where all visual tokens in an image are transported to textual subspaces based on semantic similarity. The transport constraints of OT ensure sufficient optimization across subspaces and encourage them to focus on different semantics. Solving the problem yields a transport plan that adaptively assigns each token to semantically relevant subspaces. A top-k masking is then applied to sparsify the plan and specialize subspaces for distinct visual regions. Extensive experiments demonstrate the superiority of TokenCLIP.

[15] GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs

Guanghao Zheng, Bowen Shi, Mingxing Xu, Ruoyu Sun, Peisen Zhao, Zhibo Zhang, Wenrui Dai, Junni Zou, Hongkai Xiong, Xiaopeng Zhang, Qi Tian

🧩 TL;DR

本文提出GranViT，一种新颖的视觉Transformer，通过区域级自回归训练将细粒度特征提取与大型语言模型语义对齐相结合，解决了现有视觉编码器在细粒度感知方面的局限性。

📘 Detailed Summary

Motivation: 现有视觉编码器主要关注全局图像表示而忽视细粒度区域分析，由于缺乏细粒度标注数据和专门的预训练范式，在多模态大语言模型中限制了细粒度感知能力的发展。

Method: 构建了包含200万张自然和OCR图像及1.8亿高质量区域级标注的Gran-29M数据集，开发了预训练-适应框架结合自蒸馏机制，通过边界框到标题回归增强视觉编码器的局部表示能力，并通过标题到边界框回归提升LLM的视觉特征利用和定位能力。

Result: 广泛实验表明GranViT超越了现有视觉编码器，在不同LLM上展现出强大的可迁移性，在细粒度识别、多模态VQA和OCR理解任务上取得了最先进的性能。

Conclusion: 该研究证明了通过大规模细粒度预训练和区域级自回归训练可以显著提升视觉编码器的细粒度感知能力，为多模态大语言模型提供了更强大的视觉理解基础，推动了细粒度视觉语言任务的发展。

📄 Abstract

Vision encoders are indispensable for allowing impressive performance of Multi-modal Large Language Models (MLLMs) in vision language tasks such as visual question answering and reasoning. However, existing vision encoders focus on global image representations but overlook fine-grained regional analysis. They are limited in fine grained perception due to the scarcity of fine grained annotated data and the lack of a fine grained pre-training paradigm. In this paper, we propose GranViT, a novel Vision Transformer that integrates fine-grained feature extraction with semantic alignment to Large Language Models (LLMs) via region level autoregressive training. We first construct Gran-29M, a dataset comprising 2million natural and OCR images paired with over 180 million high-quality region-level annotations, to enable large scale fine grained pretraining. Consequently, we develop a pretraining-adaptation framework along with a self distillation mechanism to train fine-grained GranViT on Gran-29M. We sufficiently exploit the fine-grained annotations from Gran-29M to resort to bounding-box-to-caption regression to enhance localized visual representation of the vision encoder in the pretraining and caption-to-bounding-box regression to improve vision feature utilization and localization for LLM in the adaptation. We further incorporate a self distillation mechanism that imposes explicit localization constraints on the vision encoder to strengthen its regional reasoning capability. Extensive experiments show that GranViT surpasses existing vision encoders and attains strong transferability to varying LLMs. Remarkably, it achieves state-of-the-art results on fine-grained recognition, multimodal VQA, and OCR understanding.

[16] Sample By Step, Optimize By Chunk: Chunk-Level GRPO For Text-to-Image Generation

Yifu Luo, Penghui Du, Bo Li, Sinan Du, Tiantian Zhang, Yongzhe Chang, Kai Wu, Kun Gai, Xueqian Wang

🧩 TL;DR

本文提出Chunk-GRPO，首个基于块级优化的GRPO方法，通过将优化范式从步级转移到块级来解决优势归因不准确和忽略生成时序动态的问题，在文本到图像生成中实现了更好的偏好对齐和图像质量。

📘 Detailed Summary

Motivation: 现有Group Relative Policy Optimization (GRPO)方法在基于流匹配的文本到图像生成中存在两个关键局限：优势归因不准确，以及忽略生成的时序动态特性，这限制了方法的性能提升和实际应用效果。

Method: 提出Chunk-GRPO方法，将连续步骤分组为连贯的'块'以捕捉流匹配的内在时序动态，在块级别进行策略优化，并引入可选的加权采样策略来进一步提升性能表现。

Result: 大量实验表明，Chunk-GRPO在偏好对齐和图像质量方面均取得了优越结果，验证了块级优化对于GRPO基方法的有效性和潜力。

Conclusion: 研究证明了将优化范式从步级转移到块级能够有效解决GRPO方法的局限性，块级优化为GRPO基方法提供了新的发展方向，在文本到图像生成任务中展现出显著优势。

📄 Abstract

Group Relative Policy Optimization (GRPO) has shown strong potential for flow-matching-based text-to-image (T2I) generation, but it faces two key limitations: inaccurate advantage attribution, and the neglect of temporal dynamics of generation. In this work, we argue that shifting the optimization paradigm from the step level to the chunk level can effectively alleviate these issues. Building on this idea, we propose Chunk-GRPO, the first chunk-level GRPO-based approach for T2I generation. The insight is to group consecutive steps into coherent 'chunk's that capture the intrinsic temporal dynamics of flow matching, and to optimize policies at the chunk level. In addition, we introduce an optional weighted sampling strategy to further enhance performance. Extensive experiments show that ChunkGRPO achieves superior results in both preference alignment and image quality, highlighting the promise of chunk-level optimization for GRPO-based methods.

Bingchen Miao, Rong Wei, Zhiqi Ge, Xiaoquan sun, Shiqi Gao, Jingzhe Zhu, Renhan Wang, Siliang Tang, Jun Xiao, Rui Tang, Juncheng Li

🧩 TL;DR

本文提出SAGE-3D，一种将3D高斯泼溅升级为可执行、语义和物理对齐环境的新范式，通过对象中心语义标注和物理感知执行连接，解决了3DGS在视觉语言导航中缺乏细粒度语义和物理可执行性的问题。

📘 Detailed Summary

Motivation: 3D高斯泼溅虽然具有照片级实时渲染能力，但在视觉语言导航任务中缺乏细粒度语义理解和物理可执行性，这限制了其在缩小仿真与现实差距方面的应用潜力。

Method: 提出了SAGE-3D框架，包含对象中心语义标注为3DGS添加对象级细粒度注释，以及物理感知执行连接在3DGS中嵌入碰撞对象并构建丰富的物理接口，同时发布了包含1K对象标注3DGS室内场景数据的InteriorGS数据集和首个基于3DGS的VLN基准SAGE-Bench。

Result: 实验表明3DGS场景数据收敛难度更大但表现出强泛化能力，在VLN-CE Unseen任务上将基线性能提升了31%，同时构建了包含2M VLN数据的SAGE-Bench基准。

Conclusion: 该研究证明了3DGS在语义和物理对齐方面的可扩展性，为视觉语言导航提供了更真实的仿真环境，显著提升了导航模型的泛化性能，为构建更逼真的仿真到现实转换系统奠定了基础。

📄 Abstract

3D Gaussian Splatting (3DGS), a 3D representation method with photorealistic real-time rendering capabilities, is regarded as an effective tool for narrowing the sim-to-real gap. However, it lacks fine-grained semantics and physical executability for Visual-Language Navigation (VLN). To address this, we propose SAGE-3D (Semantically and Physically Aligned Gaussian Environments for 3D Navigation), a new paradigm that upgrades 3DGS into an executable, semantically and physically aligned environment. It comprises two components: (1) Object-Centric Semantic Grounding, which adds object-level fine-grained annotations to 3DGS; and (2) Physics-Aware Execution Jointing, which embeds collision objects into 3DGS and constructs rich physical interfaces. We release InteriorGS, containing 1K object-annotated 3DGS indoor scene data, and introduce SAGE-Bench, the first 3DGS-based VLN benchmark with 2M VLN data. Experiments show that 3DGS scene data is more difficult to converge, while exhibiting strong generalizability, improving baseline performance by 31% on the VLN-CE Unseen task. The data and code will be available soon.

[18] FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning

Lu Zhang, Jiazuo Yu, Haomiao Xiong, Ping Hu, Yunzhi Zhuge, Huchuan Lu, You He

🧩 TL;DR

本文提出FineRS，一种基于多模态大语言模型的两阶段强化学习框架，通过全局语义探索和局部感知细化的粗到精流程，解决了高分辨率图像中极小型物体的精确理解和定位问题。该方法在指令引导分割和视觉推理任务上均优于现有最先进方法。

📘 Detailed Summary

Motivation: 多模态大语言模型由于输入分辨率受限，在处理高分辨率图像时难以精确理解和定位视觉细节，特别是在处理嵌入复杂背景中的极小型物体时面临显著挑战。

Method: FineRS采用两阶段粗到精流程，包括全局语义探索（GSE）和局部感知细化（LPR）。GSE通过指令引导推理生成文本响应和粗略目标区域，LPR则细化该区域以生成精确边界框和分割掩码。通过引入定位知情的回顾性奖励机制，将LPR输出用于优化GSE以获得更鲁棒的粗略区域探索。

Result: 在FineRS-4k数据集和公共数据集上的实验结果表明，该方法在指令引导分割和视觉推理任务上一致优于最先进的基于MLLM的方法。新提出的FineRS-4k数据集专门用于评估MLLM在复杂高分辨率场景中对细微、小尺度目标的属性级推理和像素级分割能力。

Conclusion: 该研究证明了强化学习框架在耦合多模态大语言模型的推理和分割能力方面的有效性，为解决高分辨率图像中极小型物体的精确定位问题提供了新的解决方案，并为未来在复杂视觉场景理解方面的研究指明了方向。

📄 Abstract

Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities across a wide range of vision-language tasks. However, due to the restricted input resolutions, MLLMs face significant challenges in precisely understanding and localizing visual details in high-resolution images -- particularly when dealing with extra-small objects embedded in cluttered contexts. To address this issue, we propose \textsc{FineRS}, a two-stage MLLM-based reinforcement learning framework for jointly reasoning and segmenting extremely small objects within high-resolution scenes. \textsc{FineRS} adopts a coarse-to-fine pipeline comprising Global Semantic Exploration (GSE) and Localized Perceptual Refinement (LPR). Specifically, GSE performs instruction-guided reasoning to generate a textural response and a coarse target region, while LPR refines this region to produce an accurate bounding box and segmentation mask. To couple the two stages, we introduce a locate-informed retrospective reward, where LPR's outputs are used to optimize GSE for more robust coarse region exploration. % Additionally, we present \textsc{FineRS}-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets in complex high-resolution scenes. Experimental results on \textsc{FineRS}-4k and public datasets demonstrate that our method consistently outperforms state-of-the-art MLLM-based approaches on both instruction-guided segmentation and visual reasoning tasks.

[19] VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

Shufan Shen, Junshu Sun, Qingming Huang, Shuhui Wang

🧩 TL;DR

本文提出VL-SAE，一种稀疏自编码器，通过将视觉-语言表示编码为隐藏层激活来解构多模态对齐的可解释性，每个神经元对应由语义相似的图像和文本表示的概念，从而建立统一的解释框架。

📘 Detailed Summary

Motivation: 当前视觉-语言模型在多模态推理方面表现出色，但其对齐组件的可解释性仍未得到充分研究，主要困难在于难以将多模态表示的语义映射到统一的概念集合中。

Method: 提出VL-SAE稀疏自编码器，采用基于距离的编码器和两个模态特定解码器，通过余弦相似度显式对齐多模态表示，并鼓励语义相似的表示在自监督训练中表现出一致的神经元激活模式。

Result: 在多个视觉-语言模型上的实验表明，VL-SAE在解释和增强视觉-语言对齐方面具有卓越能力，能够通过概念比较理解视觉与语言表示的对齐关系，并在零样本图像分类和幻觉消除等下游任务中带来性能提升。

Conclusion: 该研究为多模态表示对齐提供了可解释的分析框架，通过概念级别的对齐增强视觉-语言表示，不仅提升了模型的可解释性，还改善了实际应用性能，为多模态AI系统的透明化发展提供了重要思路。

📄 Abstract

The alignment of vision-language representations endows current Vision-Language Models (VLMs) with strong multi-modal reasoning capabilities. However, the interpretability of the alignment component remains uninvestigated due to the difficulty in mapping the semantics of multi-modal representations into a unified concept set. To address this problem, we propose VL-SAE, a sparse autoencoder that encodes vision-language representations into its hidden activations. Each neuron in its hidden layer correlates to a concept represented by semantically similar images and texts, thereby interpreting these representations with a unified concept set. To establish the neuron-concept correlation, we encourage semantically similar representations to exhibit consistent neuron activations during self-supervised training. First, to measure the semantic similarity of multi-modal representations, we perform their alignment in an explicit form based on cosine similarity. Second, we construct the VL-SAE with a distance-based encoder and two modality-specific decoders to ensure the activation consistency of semantically similar representations. Experiments across multiple VLMs (e.g., CLIP, LLaVA) demonstrate the superior capability of VL-SAE in interpreting and enhancing the vision-language alignment. For interpretation, the alignment between vision and language representations can be understood by comparing their semantics with concepts. For enhancement, the alignment can be strengthened by aligning vision-language representations at the concept level, contributing to performance improvements in downstream tasks, including zero-shot image classification and hallucination elimination. Codes are available at https://github.com/ssfgunner/VL-SAE.

Yue Feng, Jinwei Hu, Qijia Lu, Jiawei Niu, Li Tan, Shuo Yuan, Ziyi Yan, Yizhen Jia, Qingzhi He, Shiping Ge, Ethan Q. Chen, Wentong Li, Limin Wang, Jie Qin

🧩 TL;DR

本文提出了多模态未修剪视频检索任务和MUVR基准，旨在解决长视频平台中多模态查询的未修剪视频检索问题，通过构建包含53K视频和1050个多模态查询的大规模数据集，系统评估了现有检索模型和多模态大模型的性能局限。

📘 Detailed Summary

Motivation: 当前视频检索方法主要针对修剪后的短视频，难以处理长视频平台中的未修剪视频和多模态查询需求，现有基准缺乏对多模态查询支持和未修剪视频检索的专门评估，无法满足实际应用场景的需求。

Method: 提出MUVR基准，支持视频中心的多模态查询（长文本描述、视频标签提示、掩码提示），采用一对多检索范式，构建基于核心视频内容的多层次视觉对应关系（复制、事件、场景、实例、动作等六个级别），开发三个版本（Base、Filter、QA）分别评估检索模型和多模态大模型。

Result: 在包含53K未修剪视频和1050个多模态查询的大规模数据集上评估了3个最先进的视频检索模型、6个基于图像的视觉语言模型和10个多模态大模型，揭示了现有方法在处理未修剪视频和多模态查询方面的局限性，以及多模态大模型在多视频理解和重排序能力上的不足。

Conclusion: MUVR基准系统揭示了当前视频检索技术在处理未修剪视频和多模态查询方面的关键挑战，为未来研究提供了重要的评估标准和发展方向，强调了多模态理解和长视频处理能力在视频检索中的重要性。

📄 Abstract

We propose the Multi-modal Untrimmed Video Retrieval task, along with a new benchmark (MUVR) to advance video retrieval for long-video platforms. MUVR aims to retrieve untrimmed videos containing relevant segments using multi-modal queries. It has the following features: 1) Practical retrieval paradigm: MUVR supports video-centric multi-modal queries, expressing fine-grained retrieval needs through long text descriptions, video tag prompts, and mask prompts. It adopts a one-to-many retrieval paradigm and focuses on untrimmed videos, tailored for long-video platform applications. 2) Multi-level visual correspondence: To cover common video categories (e.g., news, travel, dance) and precisely define retrieval matching criteria, we construct multi-level visual correspondence based on core video content (e.g., news events, travel locations, dance moves) which users are interested in and want to retrieve. It covers six levels: copy, event, scene, instance, action, and others. 3) Comprehensive evaluation criteria: We develop 3 versions of MUVR (i.e., Base, Filter, QA). MUVR-Base/Filter evaluates retrieval models, while MUVR-QA assesses MLLMs in a question-answering format. We also propose a Reranking Score to evaluate the reranking ability of MLLMs. MUVR consists of 53K untrimmed videos from the video platform Bilibili, with 1,050 multi-modal queries and 84K matches. Extensive evaluations of 3 state-of-the-art video retrieval models, 6 image-based VLMs, and 10 MLLMs are conducted. MUVR reveals the limitations of retrieval methods in processing untrimmed videos and multi-modal queries, as well as MLLMs in multi-video understanding and reranking. Our code and benchmark is available at https://github.com/debby-0527/MUVR.

[21] Bridging the gap to real-world language-grounded visual concept learning

Whie Jung, Semin Kim, Junee Kim, Seunghoon Hong

🧩 TL;DR

本研究提出了一个可扩展的语言引导视觉概念学习框架，能够自适应地识别图像相关概念轴并在真实场景中沿这些轴进行视觉概念定位，无需预定义概念类别即可实现多样化的视觉编辑。

📘 Detailed Summary

Motivation: 现有语言引导的视觉概念学习方法局限于少数预定义的基本维度（如颜色和形状），且主要在合成数据集上进行探索，无法适应真实场景中丰富多样的语义概念学习需求。

Method: 利用预训练视觉语言模型和通用提示策略自适应识别多样化的图像相关概念轴，通过通用概念编码器将视觉特征绑定到发现的概念轴上，无需为每个概念引入额外模型参数，并优化组合锚定目标确保各概念轴可独立操作。

Result: 在ImageNet、CelebA-HQ和AFHQ数据集上展示了优越的编辑能力，能够处理过于多样化而无法手动预定义的真实世界概念，并在组合泛化方面优于现有的视觉概念学习和基于文本的编辑方法。

Conclusion: 该框架突破了预定义概念轴的限制，实现了真实场景中多样化视觉概念的自适应学习与编辑，为语言引导的视觉概念学习提供了可扩展的解决方案，具有较强的实际应用价值。

📄 Abstract

Human intelligence effortlessly interprets visual scenes along a rich spectrum of semantic dimensions. However, existing approaches to language-grounded visual concept learning are limited to a few predefined primitive axes, such as color and shape, and are typically explored in synthetic datasets. In this work, we propose a scalable framework that adaptively identifies image-related concept axes and grounds visual concepts along these axes in real-world scenes. Leveraging a pretrained vision-language model and our universal prompting strategy, our framework identifies a diverse image-related axes without any prior knowledge. Our universal concept encoder adaptively binds visual features to the discovered axes without introducing additional model parameters for each concept. To ground visual concepts along the discovered axes, we optimize a compositional anchoring objective, which ensures that each axis can be independently manipulated without affecting others. We demonstrate the effectiveness of our framework on subsets of ImageNet, CelebA-HQ, and AFHQ, showcasing superior editing capabilities across diverse real-world concepts that are too varied to be manually predefined. Our method also exhibits strong compositional generalization, outperforming existing visual concept learning and text-based editing methods. The code is available at https://github.com/whieya/Language-grounded-VCL.

[22] MoniTor: Exploiting Large Language Models with Instruction for Online Video Anomaly Detection

Shengtian Yang, Yue Feng, Yingshi Liu, Jingrou Zhang, Jie Qin

🧩 TL;DR

本文提出MoniTor，一种基于内存的在线评分队列方案，用于无需训练的视频异常检测，通过结合预训练视觉语言模型和LSTM启发的预测机制，在实时视频异常检测中实现了最先进的性能。

📘 Detailed Summary

Motivation: 当前视频异常检测研究主要关注离线场景，而在线视频异常检测由于实时性约束和计算强度很少受到关注，本文旨在解决在线视频异常检测中的固有复杂性挑战。

Method: MoniTor采用流式输入到视觉语言模型，结合LSTM网络启发的预测机制来有效建模时间依赖性，并设计了评分队列和异常先验来动态存储最近分数，为LLMs区分正常和异常行为提供时间维度的指导。

Result: 在UCF-Crime和XD-Violence两个大型数据集上的评估表明，MoniTor在无需训练的情况下超越了现有最先进方法，并与弱监督方法具有竞争力。

Conclusion: 该研究证明了预训练大模型在在线视频异常检测中的有效性，通过创新的内存机制和时间建模方法，为实时异常检测提供了新的解决方案，同时保持了无需训练的优势。

📄 Abstract

Video Anomaly Detection (VAD) aims to locate unusual activities or behaviors within videos. Recently, offline VAD has garnered substantial research attention, which has been invigorated by the progress in large language models (LLMs) and vision-language models (VLMs), offering the potential for a more nuanced understanding of anomalies. However, online VAD has seldom received attention due to real-time constraints and computational intensity. In this paper, we introduce a novel Memory-based online scoring queue scheme for Training-free VAD (MoniTor), to address the inherent complexities in online VAD. Specifically, MoniTor applies a streaming input to VLMs, leveraging the capabilities of pre-trained large-scale models. To capture temporal dependencies more effectively, we incorporate a novel prediction mechanism inspired by Long Short-Term Memory (LSTM) networks. This ensures the model can effectively model past states and leverage previous predictions to identify anomalous behaviors. Thereby, it better understands the current frame. Moreover, we design a scoring queue and an anomaly prior to dynamically store recent scores and cover all anomalies in the monitoring scenario, providing guidance for LLMs to distinguish between normal and abnormal behaviors over time. We evaluate MoniTor on two large datasets (i.e., UCF-Crime and XD-Violence) containing various surveillance and real-world scenarios. The results demonstrate that MoniTor outperforms state-of-the-art methods and is competitive with weakly supervised methods without training. Code is available at https://github.com/YsTvT/MoniTor.

[23] CXR-LanIC: Language-Grounded Interpretable Classifier for Chest X-Ray Diagnosis

Yiming Tang, Wenjia Zhong, Rushi Shah, Dianbo Liu

🧩 TL;DR

本研究提出了CXR-LanIC框架，通过在BiomedCLIP诊断分类器上训练基于转码器的稀疏自编码器，将医学图像表征分解为可解释的视觉模式，实现了在保持竞争性诊断准确性的同时提供透明可验证的解释。

📘 Detailed Summary

Motivation: 尽管深度学习模型在胸部X光诊断中取得了显著准确性，但其黑盒预测特性限制了临床广泛应用，临床医生需要透明可验证的解释来信任自动化诊断并识别潜在失败模式。

Method: 该方法训练基于转码器的稀疏自编码器，在MIMIC-CXR数据集上训练100个转码器集成，从多模态嵌入中分解出约5,000个单义性模式，涵盖心脏、肺部、胸膜、结构、设备和伪影等类别。

Result: CXR-LanIC在五个关键发现上实现了竞争性诊断准确性，预测可分解为20-50个可解释模式，每个模式在共享特定放射学特征的图像间表现出一致的激活行为，并可通过验证激活图库进行透明归因。

Conclusion: 该研究的关键创新在于从针对特定诊断目标训练的分类器中提取可解释特征，而非通用嵌入，确保发现的模式与临床决策直接相关，证明医学AI系统可以同时实现准确性和可解释性，通过透明、临床基础的解释支持更安全的临床部署。

📄 Abstract

Deep learning models have achieved remarkable accuracy in chest X-ray diagnosis, yet their widespread clinical adoption remains limited by the black-box nature of their predictions. Clinicians require transparent, verifiable explanations to trust automated diagnoses and identify potential failure modes. We introduce CXR-LanIC (Language-Grounded Interpretable Classifier for Chest X-rays), a novel framework that addresses this interpretability challenge through task-aligned pattern discovery. Our approach trains transcoder-based sparse autoencoders on a BiomedCLIP diagnostic classifier to decompose medical image representations into interpretable visual patterns. By training an ensemble of 100 transcoders on multimodal embeddings from the MIMIC-CXR dataset, we discover approximately 5,000 monosemantic patterns spanning cardiac, pulmonary, pleural, structural, device, and artifact categories. Each pattern exhibits consistent activation behavior across images sharing specific radiological features, enabling transparent attribution where predictions decompose into 20-50 interpretable patterns with verifiable activation galleries. CXR-LanIC achieves competitive diagnostic accuracy on five key findings while providing the foundation for natural language explanations through planned large multimodal model annotation. Our key innovation lies in extracting interpretable features from a classifier trained on specific diagnostic objectives rather than general-purpose embeddings, ensuring discovered patterns are directly relevant to clinical decision-making, demonstrating that medical AI systems can be both accurate and interpretable, supporting safer clinical deployment through transparent, clinically grounded explanations.

[24] Towards a Golden Classifier-Free Guidance Path via Foresight Fixed Point Iterations

Kaibo Wang, Jianda Mao, Tong Wu, Yang Xiang

🧩 TL;DR

本文提出了前瞻性引导（FSG）方法，通过将条件引导重新构建为不动点迭代问题，解决了现有分类器无关引导（CFG）方法的效率限制。FSG在早期扩散阶段优先解决长间隔子问题并增加迭代次数，在图像质量和计算效率方面均优于现有最先进方法。

📘 Detailed Summary

Motivation: 现有分类器无关引导（CFG）方法基于不同的理论解释，限制了设计空间并模糊了关键设计选择。这些方法被证明构成单步短间隔迭代的特例，理论上存在效率低下的问题，需要一种统一的理论框架来改进条件引导机制。

Method: 提出统一视角将条件引导重新构建为不动点迭代问题，旨在找到潜在变量在条件生成和无条件生成下产生一致输出的黄金路径。引入前瞻性引导（FSG）方法，在早期扩散阶段优先解决长间隔子问题并增加迭代次数，突破了传统CFG的单步短间隔迭代限制。

Result: 在多样化数据集和模型架构上的广泛实验验证了FSG的优越性，相比现有最先进方法在图像质量和计算效率方面均表现出显著提升。FSG不仅改进了生成质量，还提供了更高效的计算框架。

Conclusion: 本研究为条件引导提供了新颖的理论视角，揭示了自适应设计的潜力。统一框架不仅解释了现有方法的局限性，还为未来条件生成模型的优化开辟了新的研究方向，推动了文本到图像生成技术的发展。

📄 Abstract

Classifier-Free Guidance (CFG) is an essential component of text-to-image diffusion models, and understanding and advancing its operational mechanisms remains a central focus of research. Existing approaches stem from divergent theoretical interpretations, thereby limiting the design space and obscuring key design choices. To address this, we propose a unified perspective that reframes conditional guidance as fixed point iterations, seeking to identify a golden path where latents produce consistent outputs under both conditional and unconditional generation. We demonstrate that CFG and its variants constitute a special case of single-step short-interval iteration, which is theoretically proven to exhibit inefficiency. To this end, we introduce Foresight Guidance (FSG), which prioritizes solving longer-interval subproblems in early diffusion stages with increased iterations. Extensive experiments across diverse datasets and model architectures validate the superiority of FSG over state-of-the-art methods in both image quality and computational efficiency. Our work offers novel perspectives for conditional guidance and unlocks the potential of adaptive design.

[25] Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video

Ciara Rowles, Varun Jampani, Simon Donné, Shimon Vainer, Julian Parker, Zach Evans

🧩 TL;DR

Foley Control提出了一种轻量级的视频引导Foley方法，通过冻结预训练的单模态模型并仅学习它们之间的小型交叉注意力桥接，将V-JEPA2视频嵌入连接到冻结的Stable Audio Open DiT文本到音频模型，实现了高效的多模态音频生成。

📘 Detailed Summary

Motivation: 当前多模态系统通常需要端到端重新训练，计算成本高昂且缺乏模块化，本研究旨在开发一种轻量级方法，在保持预训练模型性能的同时实现视频与音频的有效同步，同时保留提示驱动的可控性。

Method: 该方法在现有文本交叉注意力之后插入紧凑的视频交叉注意力，使用提示设置全局语义而视频细化时序和局部动态，通过池化视频token来减少内存消耗并稳定训练，保持音频先验不变仅学习音频-视频依赖关系。

Result: 在精选的视频-音频基准测试中，Foley Control在时间和语义对齐方面表现出竞争力，训练参数远少于最近的多模态系统，同时保持了提示驱动的可控性和生产友好的模块化特性。

Conclusion: 该研究证明了通过轻量级桥接设计可以有效连接冻结的单模态模型，实现高效的多模态生成，这种桥接设计具有扩展到其他音频模态的潜力，为模块化多模态系统开发提供了新思路。

📄 Abstract

Foley Control is a lightweight approach to video-guided Foley that keeps pretrained single-modality models frozen and learns only a small cross-attention bridge between them. We connect V-JEPA2 video embeddings to a frozen Stable Audio Open DiT text-to-audio (T2A) model by inserting compact video cross-attention after the model's existing text cross-attention, so prompts set global semantics while video refines timing and local dynamics. The frozen backbones retain strong marginals (video; audio given text) and the bridge learns the audio-video dependency needed for synchronization -- without retraining the audio prior. To cut memory and stabilize training, we pool video tokens before conditioning. On curated video-audio benchmarks, Foley Control delivers competitive temporal and semantic alignment with far fewer trainable parameters than recent multi-modal systems, while preserving prompt-driven controllability and production-friendly modularity (swap/upgrade encoders or the T2A backbone without end-to-end retraining). Although we focus on Video-to-Foley, the same bridge design can potentially extend to other audio modalities (e.g., speech).

[26] S3OD: Towards Generalizable Salient Object Detection with Synthetic Data

Orest Kupyn, Hirokatsu Kataoka, Christian Rupprecht

🧩 TL;DR

本文提出了一种通过大规模合成数据生成和模糊感知架构显著提升显著目标检测泛化能力的方法，创建了包含139,000张高分辨率图像的S3OD数据集，在跨数据集泛化中实现了20-50%的错误率降低。

📘 Detailed Summary

Motivation: 显著目标检测面临数据受限的问题，昂贵的像素级标注迫使相关子任务如DIS和HR-SOD需要分别训练模型，这限制了模型的泛化能力和效率。

Method: 采用多模态扩散管道从扩散和DINO-v3特征中提取标签，构建了包含139,000张高分辨率图像的S3OD数据集，并提出了基于模型性能优先处理挑战类别的迭代生成框架，以及能够自然处理显著目标检测固有模糊性的简化多掩码解码器。

Result: 仅使用合成数据训练的模型在跨数据集泛化中实现了20-50%的错误率降低，经过微调的版本在DIS和HR-SOD基准测试中达到了最先进的性能水平。

Conclusion: 大规模合成数据生成与模糊感知架构的结合能够显著提升显著目标检测的泛化能力，为数据受限的计算机视觉任务提供了有效的解决方案，并展示了合成数据在推动模型性能边界方面的巨大潜力。

📄 Abstract

Salient object detection exemplifies data-bounded tasks where expensive pixel-precise annotations force separate model training for related subtasks like DIS and HR-SOD. We present a method that dramatically improves generalization through large-scale synthetic data generation and ambiguity-aware architecture. We introduce S3OD, a dataset of over 139,000 high-resolution images created through our multi-modal diffusion pipeline that extracts labels from diffusion and DINO-v3 features. The iterative generation framework prioritizes challenging categories based on model performance. We propose a streamlined multi-mask decoder that naturally handles the inherent ambiguity in salient object detection by predicting multiple valid interpretations. Models trained solely on synthetic data achieve 20-50% error reduction in cross-dataset generalization, while fine-tuned versions reach state-of-the-art performance across DIS and HR-SOD benchmarks.

[27] Modest-Align: Data-Efficient Alignment for Vision-Language Models

Jiaxiang Liu, Yuan Wang, Jiawei Du, Joey Tianyi Zhou, Mingkun Xu, Zuozhu Liu

🧩 TL;DR

本文提出了Modest-Align，一种轻量级的跨模态对齐框架，通过随机扰动和嵌入平滑两种互补策略来缓解资源受限场景下的过自信问题，在仅使用1%训练数据和0.17%GPU时间的情况下实现了与CLIP相竞争的性能。

📘 Detailed Summary

Motivation: 在资源受限环境下，大规模跨模态预训练模型如CLIP面临过自信和性能退化问题，这主要源于数据质量低、图像-文本对关联性弱以及现有对比学习方法对不确定样本的过度强化。

Method: Modest-Align框架采用两种互补策略：随机扰动通过引入受控噪声模拟不确定性，嵌入平滑通过校准嵌入空间中的相似度分布来减少过自信，共同提升对噪声和弱对齐样本的鲁棒性。

Result: 在多个基准数据集上的实验表明，Modest-Align在检索任务中优于现有最先进方法，仅使用CLIP 1%的训练数据和0.17%的GPU时间就实现了竞争性性能。

Conclusion: 该方法为现实世界低资源场景下的跨模态对齐提供了实用且可扩展的解决方案，证明了通过轻量级校准机制可以在显著减少计算成本的同时保持模型性能。

📄 Abstract

Cross-modal alignment aims to map heterogeneous modalities into a shared latent space, as exemplified by models like CLIP, which benefit from large-scale image-text pretraining for strong recognition capabilities. However, when operating in resource-constrained settings with limited or low-quality data, these models often suffer from overconfidence and degraded performance due to the prevalence of ambiguous or weakly correlated image-text pairs. Current contrastive learning approaches, which rely on single positive pairs, further exacerbate this issue by reinforcing overconfidence on uncertain samples. To address these challenges, we propose Modest-Align, a lightweight alignment framework designed for robustness and efficiency. Our approach leverages two complementary strategies -- Random Perturbation, which introduces controlled noise to simulate uncertainty, and Embedding Smoothing, which calibrates similarity distributions in the embedding space. These mechanisms collectively reduce overconfidence and improve performance on noisy or weakly aligned samples. Extensive experiments across multiple benchmark datasets demonstrate that Modest-Align outperforms state-of-the-art methods in retrieval tasks, achieving competitive results with over 100x less training data and 600x less GPU time than CLIP. Our method offers a practical and scalable solution for cross-modal alignment in real-world, low-resource scenarios.

[28] Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent

Christy Li, Josep Lopez Camuñas, Jake Thomas Touchet, Jacob Andreas, Agata Lapedriza, Antonio Torralba, Tamar Rott Shaham

🧩 TL;DR

本文提出了一种自动化框架，通过自反思代理系统性地生成和测试视觉属性依赖假设，用于检测训练好的视觉模型中的意外依赖关系。该方法在包含130个模型的基准测试中显著优于非反思基线，并能识别CLIP视觉编码器和YOLOv8等先进模型中的真实世界视觉属性依赖。

📘 Detailed Summary

Motivation: 视觉模型在进行图像识别时可能意外依赖特定的视觉属性，这种依赖性会威胁模型鲁棒性、导致过拟合和虚假相关性。现有方法缺乏系统化的自动化检测机制来识别这些潜在依赖关系，因此需要开发能够自动发现和验证模型依赖模式的框架。

Method: 核心方法是一个自反思代理，它系统性地生成关于模型可能依赖的视觉属性的假设并进行测试。该过程是迭代式的：代理根据实验结果精炼假设，并使用自评估协议来验证其发现是否准确解释模型行为。当出现不一致时，代理会自我反思并触发新的实验循环。

Result: 在包含130个设计具有18个类别不同视觉属性依赖的模型基准测试中，自反思代理的性能随着反思过程持续提升，显著优于非反思基线。该方法成功识别了CLIP视觉编码器和YOLOv8目标检测器等最先进模型中的真实世界视觉属性依赖。

Conclusion: 自反思机制对于系统化检测视觉模型中的属性依赖至关重要，该方法为理解模型决策过程提供了可扩展的自动化工具。研究结果表明自反思能够显著提升依赖检测的准确性和可靠性，为模型审计和鲁棒性评估开辟了新途径。

📄 Abstract

When a vision model performs image recognition, which visual attributes drive its predictions? Detecting unintended reliance on specific visual features is critical for ensuring model robustness, preventing overfitting, and avoiding spurious correlations. We introduce an automated framework for detecting such dependencies in trained vision models. At the core of our method is a self-reflective agent that systematically generates and tests hypotheses about visual attributes that a model may rely on. This process is iterative: the agent refines its hypotheses based on experimental outcomes and uses a self-evaluation protocol to assess whether its findings accurately explain model behavior. When inconsistencies arise, the agent self-reflects over its findings and triggers a new cycle of experimentation. We evaluate our approach on a novel benchmark of 130 models designed to exhibit diverse visual attribute dependencies across 18 categories. Our results show that the agent's performance consistently improves with self-reflection, with a significant performance increase over non-reflective baselines. We further demonstrate that the agent identifies real-world visual attribute dependencies in state-of-the-art models, including CLIP's vision encoder and the YOLOv8 object detector.

cs.CL [Back]

[29] Reasoning's Razor: Reasoning Improves Accuracy but Can Hurt Recall at Critical Operating Points in Safety and Hallucination Detection

Atoosa Chegini, Hamid Kazemi, Garrett Souza, Maria Safi, Yang Song, Samy Bengio, Sinead Williamson, Mehrdad Farajtabar

🧩 TL;DR

本研究首次系统性地评估了在低误报率要求下的推理增强分类任务，发现推理虽然提升整体准确率，但在严格精度敏感场景中表现不佳，揭示了推理在精度敏感应用中的双刃剑特性。

📘 Detailed Summary

Motivation: 尽管推理已成为大语言模型的核心范式并显著提升多种基准测试的准确率，但其在精度敏感任务中的适用性仍不明确，本研究旨在填补这一研究空白，特别是在严格低误报率要求下的分类任务表现。

Method: 研究覆盖安全检测和幻觉检测两个任务，在微调和零样本设置下评估标准LLM和大型推理模型，比较推理开启与推理关闭两种模式，并分析基于token的评分与自我口头化置信度的性能差异。

Result: 实验结果显示推理开启模式虽提升整体准确率，但在低误报率阈值下表现不佳；推理关闭模式在精度敏感场景中占优；基于token的评分显著优于自我口头化置信度；两种模式的简单集成可恢复各自优势。

Conclusion: 推理在精度敏感应用中具有双刃剑特性：有益于平均准确率，但往往不适用于要求严格精度的实际部署场景，为实际应用中的模型选择提供了重要指导。

📄 Abstract

Reasoning has become a central paradigm for large language models (LLMs), consistently boosting accuracy across diverse benchmarks. Yet its suitability for precision-sensitive tasks remains unclear. We present the first systematic study of reasoning for classification tasks under strict low false positive rate (FPR) regimes. Our analysis covers two tasks--safety detection and hallucination detection--evaluated in both fine-tuned and zero-shot settings, using standard LLMs and Large Reasoning Models (LRMs). Our results reveal a clear trade-off: Think On (reasoning-augmented) generation improves overall accuracy, but underperforms at the low-FPR thresholds essential for practical use. In contrast, Think Off (no reasoning during inference) dominates in these precision-sensitive regimes, with Think On surpassing only when higher FPRs are acceptable. In addition, we find token-based scoring substantially outperforms self-verbalized confidence for precision-sensitive deployments. Finally, a simple ensemble of the two modes recovers the strengths of each. Taken together, our findings position reasoning as a double-edged tool: beneficial for average accuracy, but often ill-suited for applications requiring strict precision.

[30] REMONI: An Autonomous System Integrating Wearables and Multimodal Large Language Models for Enhanced Remote Health Monitoring

Thanh Cong Ho, Farah Kharrat, Abderrazek Abid, Fakhri Karray

🧩 TL;DR

本研究提出REMONI系统，通过整合多模态大语言模型、物联网和可穿戴设备，构建了一个能够自主监测患者生命体征、检测异常状态并进行自然语言交互的远程健康监护系统。该系统能够实时分析患者活动、情绪状态，并通过智能代理为医护人员提供直观的健康状态查询界面。

📘 Detailed Summary

Motivation: 当前远程患者监护研究主要集中于传感器数据收集、可视化和特定疾病异常检测，但在人机交互方面存在显著空白。现有系统缺乏对患者活动状态、情绪变化的自然语言理解能力，以及医护人员与监护系统之间的智能交互机制。

Method: 系统整合多模态大语言模型、物联网和可穿戴设备，自动持续收集生命体征、加速度计数据和患者视频片段。采用异常检测模块包括跌倒检测模型和紧急状况识别算法，通过提示工程无缝整合所有患者信息，并开发了能够识别患者活动和情绪的自然语言处理组件。

Result: 实验证明该系统在现实场景中具有可实施性和可扩展性，开发了完整功能原型并正在进行测试，验证了系统各项能力的稳健性。系统能够有效减少医护人员工作负担和医疗成本，通过用户友好的Web应用为医生护士提供实时生命体征和患者状态信息。

Conclusion: REMONI系统展示了多模态大语言模型在远程健康监护中的实际应用价值，为智能医疗监护提供了新的技术范式。该系统有望通过自动化监测和智能交互显著提升医疗监护效率，为未来智能医疗系统的发展提供了重要参考。

📄 Abstract

With the widespread adoption of wearable devices in our daily lives, the demand and appeal for remote patient monitoring have significantly increased. Most research in this field has concentrated on collecting sensor data, visualizing it, and analyzing it to detect anomalies in specific diseases such as diabetes, heart disease and depression. However, this domain has a notable gap in the aspect of human-machine interaction. This paper proposes REMONI, an autonomous REmote health MONItoring system that integrates multimodal large language models (MLLMs), the Internet of Things (IoT), and wearable devices. The system automatically and continuously collects vital signs, accelerometer data from a special wearable (such as a smartwatch), and visual data in patient video clips collected from cameras. This data is processed by an anomaly detection module, which includes a fall detection model and algorithms to identify and alert caregivers of the patient's emergency conditions. A distinctive feature of our proposed system is the natural language processing component, developed with MLLMs capable of detecting and recognizing a patient's activity and emotion while responding to healthcare worker's inquiries. Additionally, prompt engineering is employed to integrate all patient information seamlessly. As a result, doctors and nurses can access real-time vital signs and the patient's current state and mood by interacting with an intelligent agent through a user-friendly web application. Our experiments demonstrate that our system is implementable and scalable for real-life scenarios, potentially reducing the workload of medical professionals and healthcare costs. A full-fledged prototype illustrating the functionalities of the system has been developed and being tested to demonstrate the robustness of its various capabilities.

[31] Document Understanding, Measurement, and Manipulation Using Category Theory

Jared Claypoole, Yunye Gong, Noson S. Yanofsky, Ajay Divakaran

🧩 TL;DR

本研究提出了一种基于范畴论的文档结构分析方法，开发了信息理论度量、内容摘要与扩展技术，以及利用RLVR的自监督方法来改进大型预训练模型。该方法通过将文档表示为问答对的范畴，实现了文档信息的正交化分解和新型摘要生成能力。

📘 Detailed Summary

Motivation: 当前文档理解方法缺乏对文档内在结构的数学形式化表示，难以系统性地提取、度量和操作文档信息。本研究旨在通过范畴论建立文档的数学表示框架，解决文档信息分解、摘要生成和内容扩展等核心问题。

Method: 提出将文档建模为问答对范畴的数学表示，开发了信息正交化程序将文档信息分解为不重叠的部分。基于此框架构建了信息度量方法、新型摘要技术，并利用RLVR开发了自监督方法，通过组合性和闭包等一致性约束来改进预训练模型。

Result: 实现了文档信息的系统分解和量化度量，开发了基于率失真分析的摘要评估框架，提出了文档注释扩展的新解决方案。通过大型预训练模型实现了方法的实际应用，并构建了多模态扩展的数学框架。

Conclusion: 范畴论为文档理解提供了强大的数学基础，支持系统化的信息提取和操作。问答对表示和正交化分解为文档分析开辟了新途径，自监督方法展示了利用结构约束改进模型的有效性，为多模态文档处理奠定了基础。

📄 Abstract

We apply category theory to extract multimodal document structure which leads us to develop information theoretic measures, content summarization and extension, and self-supervised improvement of large pretrained models. We first develop a mathematical representation of a document as a category of question-answer pairs. Second, we develop an orthogonalization procedure to divide the information contained in one or more documents into non-overlapping pieces. The structures extracted in the first and second steps lead us to develop methods to measure and enumerate the information contained in a document. We also build on those steps to develop new summarization techniques, as well as to develop a solution to a new problem viz. exegesis resulting in an extension of the original document. Our question-answer pair methodology enables a novel rate distortion analysis of summarization techniques. We implement our techniques using large pretrained models, and we propose a multimodal extension of our overall mathematical framework. Finally, we develop a novel self-supervised method using RLVR to improve large pretrained models using consistency constraints such as composability and closure under certain operations that stem naturally from our category theoretic framework.

cs.AI [Back]

[32] Sketch2BIM: A Multi-Agent Human-AI Collaborative Pipeline to Convert Hand-Drawn Floor Plans to 3D BIM

Abir Khan Ratul, Sanjay Acharjee, Somin Park, Md Nazmus Sakib

🧩 TL;DR

本研究提出了一种人机协同管道，将未缩放手绘平面图转换为语义一致的3D BIM模型。该工作流程在多智能体框架中利用多模态大语言模型，通过感知提取、人工反馈、模式验证和自动化BIM脚本生成实现高效转换。

📘 Detailed Summary

Motivation: 该研究旨在解决传统BIM创建过程对专业知识和复杂软件依赖的问题，使得非专业人士也能通过简单的手绘草图生成高质量的3D建筑信息模型，降低BIM技术的使用门槛。

Method: 该方法采用多智能体框架结合多模态大语言模型，包含感知提取、人工反馈迭代、模式验证和自动化BIM脚本生成四个关键阶段。首先将草图迭代优化为结构化的JSON布局，然后转换为可执行脚本生成3D BIM模型。

Result: 在十个不同平面图上的实验显示，门窗等开口元素在初始阶段即达到高可靠性检测，墙体检测从83%开始并通过几次反馈迭代实现近乎完美的对齐。所有类别的精确率、召回率和F1分数均保持在0.83以上，几何误差通过反馈修正逐步降至零。

Conclusion: 该研究表明MLLM驱动的多智能体推理能够使BIM创建对专家和非专家都变得可访问，仅需使用手绘草图即可完成。这项工作为建筑行业的数字化提供了新的低门槛解决方案，展示了多模态AI在专业领域应用的潜力。

📄 Abstract

This study introduces a human-in-the-loop pipeline that converts unscaled, hand-drawn floor plan sketches into semantically consistent 3D BIM models. The workflow leverages multimodal large language models (MLLMs) within a multi-agent framework, combining perceptual extraction, human feedback, schema validation, and automated BIM scripting. Initially, sketches are iteratively refined into a structured JSON layout of walls, doors, and windows. Later, these layouts are transformed into executable scripts that generate 3D BIM models. Experiments on ten diverse floor plans demonstrate strong convergence: openings (doors, windows) are captured with high reliability in the initial pass, while wall detection begins around 83% and achieves near-perfect alignment after a few feedback iterations. Across all categories, precision, recall, and F1 scores remain above 0.83, and geometric errors (RMSE, MAE) progressively decrease to zero through feedback corrections. This study demonstrates how MLLM-driven multi-agent reasoning can make BIM creation accessible to both experts and non-experts using only freehand sketches.

[33] MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning

Siyong Chen, Jinbo Wen, Jiawen Kang, Tenghui Huang, Xumin Huang, Yuanjia Su, Hudan Pan, Zishao Zhong, Dusit Niyato, Shengli Xie, Dong In Kim

🧩 TL;DR

本文提出了MedAlign框架，通过多模态直接偏好优化和检索感知的专家混合架构来解决医疗视觉问答中LVLM的幻觉问题，同时实现自适应推理和多机构协作，在三个Med-VQA数据集上达到最先进性能。

📘 Detailed Summary

Motivation: 大型视觉语言模型在临床服务部署中面临三个关键挑战：基于视觉证据的幻觉倾向、固定深度推理的低效性以及多机构协作的困难，这些限制了LVLM在智能医疗领域的实际应用。

Method: 提出多模态直接偏好优化目标将偏好学习与视觉上下文显式对齐，设计检索感知专家混合架构利用图像和文本相似性将查询路由到专门的上下文增强LVLM专家，并采用联邦治理机制实现基于本地元认知不确定性估计器的自适应链式思维推理。

Result: 在三个代表性Med-VQA数据集上的广泛实验表明，MedAlign实现了最先进的性能，比强检索增强基线在F1分数上提升高达11.85%，同时与固定深度CoT方法相比平均推理长度减少51.60%。

Conclusion: 该研究证明了通过视觉上下文对齐、专家路由和联邦治理机制可以有效解决LVLM在医疗领域的幻觉问题，为多机构临床协作提供了可行方案，推动了智能医疗中大型模型的实际部署。

📄 Abstract

Recently, large models have shown significant potential for smart healthcare. However, the deployment of Large Vision-Language Models (LVLMs) for clinical services is currently hindered by three critical challenges: a tendency to hallucinate answers not grounded in visual evidence, the inefficiency of fixed-depth reasoning, and the difficulty of multi-institutional collaboration. To address these challenges, in this paper, we develop MedAlign, a novel framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA). Specifically, we first propose a multimodal Direct Preference Optimization (mDPO) objective to explicitly align preference learning with visual context. We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM (i.e., an expert), thereby mitigating hallucinations in LVLMs. To achieve adaptive reasoning and facilitate multi-institutional collaboration, we propose a federated governance mechanism, where the selected expert, fine-tuned on clinical datasets based on mDPO, locally performs iterative Chain-of-Thought (CoT) reasoning via the local meta-cognitive uncertainty estimator. Extensive experiments on three representative Med-VQA datasets demonstrate that MedAlign achieves state-of-the-art performance, outperforming strong retrieval-augmented baselines by up to $11.85\%$ in F1-score, and simultaneously reducing the average reasoning length by $51.60\%$ compared with fixed-depth CoT approaches.

[34] Memory-Free Continual Learning with Null Space Adaptation for Zero-Shot Vision-Language Models

Yujin Jo, Taesup Kim

🧩 TL;DR

本文提出了NuSA-CL框架，一种轻量级无内存的持续学习方法，通过零空间适应策略在保持预训练视觉语言模型零样本能力的同时实现持续学习，解决了实际部署中分布偏移和新任务带来的挑战。

📘 Detailed Summary

Motivation: 预训练视觉语言模型在现实部署中面临环境演变和新兴类别带来的分布偏移问题，静态零样本能力不足，需要持续学习方法在不引发灾难性遗忘的情况下实现模型适应。

Method: NuSA-CL采用低秩适应技术，将任务特定的权重更新约束在模型当前参数的近似零空间内，这种策略最小化对已获取知识的干扰，有效保留原始模型的零样本能力。

Result: 实验表明该框架不仅有效保持了零样本迁移能力，还在持续学习基准测试中取得了极具竞争力的性能表现。

Conclusion: NuSA-CL作为一种实用且可扩展的解决方案，为现实应用中持续演化的零样本视觉语言模型提供了有效的持续学习框架，具有较低的计算和内存开销优势。

📄 Abstract

Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated remarkable zero-shot generalization, enabling deployment in a wide range of real-world tasks without additional task-specific training. However, in real deployment scenarios with evolving environments or emerging classes, these models inevitably face distributional shifts and novel tasks. In such contexts, static zero-shot capabilities are insufficient, and there is a growing need for continual learning methods that allow models to adapt over time while avoiding catastrophic forgetting. We introduce NuSA-CL (Null Space Adaptation for Continual Learning), a lightweight memory-free continual learning framework designed to address this challenge. NuSA-CL employs low-rank adaptation and constrains task-specific weight updates to lie within an approximate null space of the model's current parameters. This strategy minimizes interference with previously acquired knowledge, effectively preserving the zero-shot capabilities of the original model. Unlike methods relying on replay buffers or costly distillation, NuSA-CL imposes minimal computational and memory overhead, making it practical for deployment in resource-constrained, real-world continual learning environments. Experiments show that our framework not only effectively preserves zero-shot transfer capabilities but also achieves highly competitive performance on continual learning benchmarks. These results position NuSA-CL as a practical and scalable solution for continually evolving zero-shot VLMs in real-world applications.

[35] A Multimodal Benchmark for Framing of Oil & Gas Advertising and Potential Greenwashing Detection

Gaku Morio, Harri Rowlands, Dominik Stammbach, Christopher D. Manning, Peter Henderson

🧩 TL;DR

本研究引入了一个专家标注的多模态视频广告基准数据集，专门用于评估视觉语言模型在能源行业战略传播分析中的表现，填补了现有纯文本框架分析数据集的空白。

📘 Detailed Summary

Motivation: 当前企业公关活动中存在言行不一的"漂绿"现象，特别是在石油天然气行业，需要大规模分析框架变化来理解公关活动的目标和性质，而现有数据集多为纯文本形式，无法满足多模态分析需求。

Method: 构建了一个专家标注的视频广告数据集，包含来自Facebook和YouTube的广告视频，涵盖13种框架类型、50多家公司或倡导团体、20个国家，专门设计用于视觉语言模型的评估。

Result: 基线实验显示GPT-4.1在检测环境信息方面达到79%的F1分数，而最佳模型在识别绿色创新框架方面仅达到46%的F1分数，表明现有模型在多模态框架分析方面仍有较大改进空间。

Conclusion: 该数据集为能源行业战略传播的多模态分析研究提供了重要资源，同时揭示了视觉语言模型在处理隐式框架、不同长度视频和隐含文化背景等方面面临的挑战，为未来研究指明了方向。

📄 Abstract

Companies spend large amounts of money on public relations campaigns to project a positive brand image. However, sometimes there is a mismatch between what they say and what they do. Oil & gas companies, for example, are accused of "greenwashing" with imagery of climate-friendly initiatives. Understanding the framing, and changes in framing, at scale can help better understand the goals and nature of public relations campaigns. To address this, we introduce a benchmark dataset of expert-annotated video ads obtained from Facebook and YouTube. The dataset provides annotations for 13 framing types for more than 50 companies or advocacy groups across 20 countries. Our dataset is especially designed for the evaluation of vision-language models (VLMs), distinguishing it from past text-only framing datasets. Baseline experiments show some promising results, while leaving room for improvement for future work: GPT-4.1 can detect environmental messages with 79% F1 score, while our best model only achieves 46% F1 score on identifying framing around green innovation. We also identify challenges that VLMs must address, such as implicit framing, handling videos of various lengths, or implicit cultural backgrounds. Our dataset contributes to research in multimodal analysis of strategic communication in the energy sector.

Table of Contents

cs.CV [Back]

[1] KBE-DME: Dynamic Multimodal Evaluation via Knowledge Enhanced Benchmark Evolution

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[2] Preventing Shortcuts in Adapter Training via Providing the Shortcuts

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[3] Head Pursuit: Probing Attention Specialization in Multimodal Transformers

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[4] Video-As-Prompt: Unified Semantic Control for Video Generation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[5] Generative Point Tracking with Flow Matching

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[6] 3DReasonKnee: Advancing Grounded Reasoning in Medical Vision Language Models

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[7] ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[8] Knowledge-Driven Vision-Language Model for Plexus Detection in Hirschsprung's Disease

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[9] PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[10] SafetyPairs: Isolating Safety Critical Image Features with Counterfactual Image Generation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[11] NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[12] CT-CLIP: A Multi-modal Fusion Framework for Robust Apple Leaf Disease Recognition in Complex Environments

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[13] Towards Physics-informed Spatial Intelligence with Human Priors: An Autonomous Driving Pilot Study

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[14] TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[15] GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[16] Sample By Step, Optimize By Chunk: Chunk-Level GRPO For Text-to-Image Generation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[17] Towards Physically Executable 3D Gaussian for Embodied Navigation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[18] FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[19] VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[20] MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence

🧩 TL;DR