Table of Contents
cs.CV [Back]
[1] Are Neuro-Inspired Multi-Modal Vision-Language Models Resilient to Membership Inference Privacy Leakage?
David Amebley, Sayanton Dibbo
🧩 TL;DR
本文提出了一种神经科学启发的拓扑正则化框架,用于增强多模态视觉语言模型对成员推理攻击的隐私保护能力,实验表明该方法在保持模型性能的同时显著降低隐私攻击成功率。
📘 Detailed Summary
Motivation: 随着多模态模型的广泛部署,隐私泄露风险日益凸显,现有研究主要关注单模态系统的隐私攻击,而多模态模型对隐私攻击的脆弱性研究不足,特别是神经科学启发的多模态模型在隐私保护方面的潜力尚未探索。
Method: 提出了系统化的神经科学启发的拓扑正则化框架,在三种视觉语言模型上进行了评估,包括BLIP、PaliGemma 2和ViT-GPT2,使用COCO、CC3M和NoCaps三个基准数据集,通过tau>0的配置定义神经变体模型。
Result: 在BLIP模型和COCO数据集上的实验显示,神经变体模型的成员推理攻击成功率平均ROC-AUC下降24%,同时保持相似的模型效用,MPNet和ROUGE-2指标表明生成质量未显著受损,在PaliGemma 2和ViT-GPT2模型上的扩展评估进一步验证了结果的稳定性。
Conclusion: 神经启发的多模态视觉语言模型在保持模型性能的同时显著增强了隐私威胁抵御能力,为理解多模态模型隐私风险提供了新视角,并为开发更安全的AI系统提供了实证依据。
📄 Abstract
In the age of agentic AI, the growing deployment of multi-modal models (MMs) has introduced new attack vectors that can leak sensitive training data in MMs, causing privacy leakage. This paper investigates a black-box privacy attack, i.e., membership inference attack (MIA) on multi-modal vision-language models (VLMs). State-of-the-art research analyzes privacy attacks primarily to unimodal AI-ML systems, while recent studies indicate MMs can also be vulnerable to privacy attacks. While researchers have demonstrated that biologically inspired neural network representations can improve unimodal model resilience against adversarial attacks, it remains unexplored whether neuro-inspired MMs are resilient against privacy attacks. In this work, we introduce a systematic neuroscience-inspired topological regularization (tau) framework to analyze MM VLMs resilience against image-text-based inference privacy attacks. We examine this phenomenon using three VLMs: BLIP, PaliGemma 2, and ViT-GPT2, across three benchmark datasets: COCO, CC3M, and NoCaps. Our experiments compare the resilience of baseline and neuro VLMs (with topological regularization), where the tau > 0 configuration defines the NEURO variant of VLM. Our results on the BLIP model using the COCO dataset illustrate that MIA attack success in NEURO VLMs drops by 24% mean ROC-AUC, while achieving similar model utility (similarities between generated and reference captions) in terms of MPNet and ROUGE-2 metrics. This shows neuro VLMs are comparatively more resilient against privacy attacks, while not significantly compromising model utility. Our extensive evaluation with PaliGemma 2 and ViT-GPT2 models, on two additional datasets: CC3M and NoCaps, further validates the consistency of the findings. This work contributes to the growing understanding of privacy risks in MMs and provides evidence on neuro VLMs privacy threat resilience.
[2] DeeAD: Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving
Haibo HU, Lianming Huang, Nan Guan, Chun Jason Xue
🧩 TL;DR
本文提出了DeeAD,一种无需训练的动作引导早期退出框架,通过评估中间轨迹的物理可行性来加速视觉语言动作模型的规划推理,在保持规划质量的同时显著降低推理延迟。
📘 Detailed Summary
Motivation: 视觉语言动作模型在自动驾驶中统一了感知、推理和轨迹生成,但由于深度transformer堆栈导致推理延迟显著,需要一种在不牺牲规划质量的前提下加速推理的方法。
Method: DeeAD采用基于动作的早期退出策略,当预测轨迹与轻量级规划先验(如导航或低精度规划)在可容忍偏差(<2米)内对齐时终止推理,并引入多跳控制器根据分数变化率自适应跳过冗余层。
Result: 在Bench2Drive基准测试中,DeeAD实现了高达28%的transformer层稀疏性和29%的延迟降低,同时保持了规划质量和安全性。
Conclusion: 该研究表明基于物理可行性的早期退出策略可有效加速VLA模型推理,为实时自动驾驶系统提供了实用的优化方案,且无需重新训练即可集成到现有模型中。
📄 Abstract
Vision-Language Action (VLA) models unify perception, reasoning, and trajectory generation for autonomous driving, but suffer from significant inference latency due to deep transformer stacks. We present DeeAD, a training-free, action-guided early-exit framework that accelerates VLA planning by evaluating the physical feasibility of intermediate trajectories. Instead of relying on confidence scores, DeeAD terminates inference when predicted trajectories align with lightweight planning priors (e.g., Navigation or Low-precision Planning) within a tolerable deviation (<2m). To improve efficiency, we introduce a multi-hop controller that adaptively skips redundant layers based on the change rate of scores. DeeAD integrates into existing VLA models, such as ORION, without requiring retraining. Experiments on the Bench2Drive benchmark demonstrate up to 28% transformer-layer sparsity and 29% latency reduction, while preserving planning quality and safety.
[3] CANVAS: A Benchmark for Vision-Language Models on Tool-Based User Interface Design
Daeheon Jeong, Seoyeon Byun, Kihoon Son, Dae Hyun Kim, Juho Kim
🧩 TL;DR
本文提出了CANVAS基准测试,用于评估视觉语言模型在基于工具的用户界面设计任务中的性能,填补了现有评估框架的空白,并揭示了领先模型在工具调用策略方面的改进潜力。
📘 Detailed Summary
Motivation: 当前缺乏评估视觉语言模型在基于工具的用户界面设计能力方面的基准测试,尽管这些模型显示出通过工具调用在设计软件中迭代编辑UI设计的潜力,但这一能力的具体表现仍未被系统评估。
Method: CANVAS基准包含598个基于工具的设计任务,采样自3.3K个移动UI设计,涵盖30个功能类别,包含设计复制和设计修改两种任务类型,模型通过上下文工具调用逐步更新设计。
Result: 实验结果表明领先模型展现出更具策略性的工具调用能力,从而提高了设计质量,同时研究识别了模型常见的错误模式,为未来改进提供了指导。
Conclusion: 该研究为工具驱动的UI设计能力评估建立了标准化框架,揭示了视觉语言模型在设计协作中的潜力,并为未来增强基于工具的设计能力指明了方向。
📄 Abstract
User interface (UI) design is an iterative process in which designers progressively refine their work with design software such as Figma or Sketch. Recent advances in vision language models (VLMs) with tool invocation suggest these models can operate design software to edit a UI design through iteration. Understanding and enhancing this capacity is important, as it highlights VLMs' potential to collaborate with designers within conventional software. However, as no existing benchmark evaluates tool-based design performance, the capacity remains unknown. To address this, we introduce CANVAS, a benchmark for VLMs on tool-based user interface design. Our benchmark contains 598 tool-based design tasks paired with ground-truth references sampled from 3.3K mobile UI designs across 30 function-based categories (e.g., onboarding, messaging). In each task, a VLM updates the design step-by-step through context-based tool invocations (e.g., create a rectangle as a button background), linked to design software. Specifically, CANVAS incorporates two task types: (i) design replication evaluates the ability to reproduce a whole UI screen; (ii) design modification evaluates the ability to modify a specific part of an existing screen. Results suggest that leading models exhibit more strategic tool invocations, improving design quality. Furthermore, we identify common error patterns models exhibit, guiding future work in enhancing tool-based design capabilities.
[4] Text-Guided Semantic Image Encoder
Raghuveer Thirukovalluru, Xiaochuang Han, Bhuwan Dhingra, Emily Dinan, Maha Elbayad
🧩 TL;DR
本文提出了文本引导语义图像编码器(TIE),通过文本条件化训练使视觉语言模型中的图像编码器能够根据输入文本查询生成条件化的图像表示,显著提升了多模态任务的性能并提高了推理效率。
📘 Detailed Summary
Motivation: 传统视觉语言模型中的图像编码器通常在独立预训练后与语言模型对齐,这种范式导致编码器以任务无关的方式处理图像,无法根据具体下游任务或文本查询进行自适应调整,限制了模型对查询相关视觉特征的捕捉能力。
Method: 提出了文本引导语义图像编码器(TIE),通过文本条件化训练使图像编码器能够根据输入文本查询生成条件化的图像表示,该方法在保持模型架构简洁的同时实现了图像表示与文本查询的深度交互。
Result: 在1B和3B规模下,配备TIE的视觉语言模型在九个图像到文本基准测试中平均分别提升了1.5和1.3个百分点,在DocVQA和InfoVQA等任务上增益高达6个百分点,同时仅使用一半的图像分块(token)即可达到更优性能,显著提升了推理效率。
Conclusion: 文本条件化训练有效优化了编码器对关键视觉特征的捕捉能力,TIE能够持续关注查询相关区域,增强了模型的可解释性和查询特定的基础能力,该方法为构建更高效、更具适应性的多模态模型提供了新思路。
📄 Abstract
Image encoders, a fundamental component of vision-language models (VLMs), are typically pretrained independently before being aligned with a language model. This standard paradigm results in encoders that process images agnostically, without regard to the specific downstream task or text query. To address this limitation, we propose the Text-Guided Semantic Image Encoder (TIE), which generates image representations conditioned on the input text query. VLMs equipped with TIE outperform their conventional counterparts by +1.5 and +1.3 points on average across nine image-to-text benchmarks at the 1B and 3B scales, respectively, with gains reaching up to 6 points on tasks such as DocVQA and InfoVQA. Moreover, TIE-based VLMs attain superior performance while utilizing only half as many image tiles (tokens), resulting in notably improved inference efficiency. TIE also generalizes well with generic queries, indicating that text-conditioned training effectively optimizes the encoder to capture key visual features. Qualitative analysis confirms that TIE consistently attends to query-relevant regions, enhancing both interpretability and query-specific grounding.
[5] LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing
🧩 TL;DR
LongVT是一个端到端的智能体框架,通过交错的多模态工具链思维实现长视频推理,利用大型多模态模型固有的时间定位能力作为原生视频裁剪工具,在四个具有挑战性的长视频理解基准上持续优于现有强基线。
📘 Detailed Summary
Motivation: 大型多模态模型在长视频推理中容易产生幻觉,特别是在证据稀疏且时间分散的长视频处理场景下,现有方法难以有效定位和利用分散的视觉证据。
Method: 提出LongVT框架,利用LMMs的时间定位能力作为原生视频裁剪工具,通过全局到局部的推理循环逐步放大特定视频片段并重新采样更细粒度的视频帧,直到答案基于检索到的视觉证据。
Result: 在四个具有挑战性的长视频理解和推理基准上持续优于现有强基线,并构建了包含247.9K训练样本和1,280个QA对评估基准的VideoSIAH数据集。
Conclusion: 该研究展示了通过工具集成和多阶段训练策略,LMMs能够有效处理长视频推理任务,为长视频理解提供了新的解决方案和基准数据集。
📄 Abstract
Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .
[6] Revisiting KRISP: A Lightweight Reproduction and Analysis of Knowledge-Enhanced Vision-Language Models
Souradeep Dutta, Keshav Bulia, Neena S Nair
🧩 TL;DR
本研究对KRISP知识增强视觉问答模型进行了轻量化复现,通过减少参数数量开发了可在边缘设备运行的紧凑版本,同时通过系统消融研究揭示了原始模型的设计缺陷和实际应用问题。
📘 Detailed Summary
Motivation: 原始KRISP模型虽然有效但存在工业级训练规模大、计算需求高、与大型骨干网络紧密耦合的问题,本研究旨在重新审视该模型并开发参数显著减少的轻量级版本,同时探索在资源受限条件下知识增强VQA架构的可扩展性和有效性。
Method: 采用轻量化复现策略显著减少模型参数,通过系统消融研究分析设计缺陷和实际问题,包括在合成VQA数据上的概念验证和DAQUAR数据集评估,模型配置采用低参数设置并受外部知识图谱领域约束以防止AI幻觉。
Result: 复现模型性能达到原始KRISP的约75%,同时通过消融研究揭示了原始论文未充分覆盖的设计缺陷、实际应用陷阱和隐性问题,轻量化设计使得模型能够在智能手机和AR-VR等边缘设备上运行,实现离线视觉推理。
Conclusion: 研究表明知识增强VQA架构在资源受限条件下仍能保持有效性能,轻量化复现不仅提升了模型部署灵活性,还通过系统分析为未来类似架构的设计提供了重要见解,特别是在防止AI幻觉和边缘计算应用方面具有实际价值。
📄 Abstract
Facebook AI Research introduced KRISP [4], which integrates structured external knowledge into pipelines for vision-language reasoning. Despite its effectiveness, the original model has been developed for industrial-scale training, is computationally demanding, and is tightly connected to a large backbone. In this work, we reexamine KRISP from a different angle and offer a lightweight reproduction with significantly fewer parameters. Even though our replicated model performs about 75 % of the original, the replication process uncovers a number of design flaws, real-world pitfalls, and implicit problems that were not fully covered in the original paper. We offer insights into the scalability and efficacy of knowledge-enhanced VQA architectures under resource constraints through systematic ablation studies, which include a proof-of-concept on synthetic VQA data and evaluation on the DAQUAR dataset. Our model, configured with a low parameter setup and constrained by the external Knowledge graph domain, prevents AI hallucinations and generates outputs solely within that domain. Minimal parameters allow us to function on edge devices like smartphones and AR-VR, further improving offline visual reasoning.
[7] SPHINX: A Synthetic Environment for Visual Perception and Reasoning
Md Tanvirul Alam, Saksham Aggarwal, Justin Yang Chae, Nidhi Rastogi
🧩 TL;DR
本研究提出了Sphinx合成环境,用于视觉感知与推理任务评估,通过程序化生成包含多种认知基元的谜题,并展示了强化学习与可验证奖励(RLVR)方法能显著提升模型在视觉推理任务上的性能。
📘 Detailed Summary
Motivation: 当前视觉推理任务缺乏能够精确评估核心认知能力的基准测试环境,现有数据集在任务多样性和可验证性方面存在不足,难以系统化地衡量模型在对称检测、几何变换、空间推理等基础认知能力上的表现。
Method: 开发了Sphinx合成环境,通过程序化生成包含图案、图块、图表、图标和几何基元的谜题,每个谜题都配有可验证的真实解,构建了涵盖25种任务类型的大规模数据集,并采用强化学习与可验证奖励(RLVR)方法进行模型训练。
Result: 评估显示当前最先进的大型视觉语言模型GPT-5在Sphinx基准上仅达到51.1%的准确率,远低于人类表现,而RLVR方法显著提升了模型在这些任务上的准确率,并在外部视觉推理基准上取得了性能增益。
Conclusion: Sphinx基准揭示了当前视觉语言模型在基础认知能力上的显著不足,RLVR方法展示了通过可验证奖励机制提升多模态推理能力的潜力,为开发更强大的视觉推理系统提供了新的训练范式。
📄 Abstract
We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.
[8] Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion
Samuele Dell'Erba, Andrew D. Bagdanov
🧩 TL;DR
本研究提出了一种基于优化的视觉反转方法,无需训练即可替代扩散模型中的先验网络,通过优化潜在视觉表示与文本嵌入的相似性,并结合两种新的约束损失来提升生成质量。
📘 Detailed Summary
Motivation: 扩散模型在文本到图像生成中依赖计算昂贵的先验网络将文本嵌入映射到视觉流形,这些先验需要大量数据和训练成本,本研究旨在探索无需训练先验的替代方案。
Method: 采用基于优化的视觉反转方法,从随机伪标记初始化潜在视觉表示,通过迭代优化最大化与文本提示嵌入的余弦相似度,并引入马氏距离和最近邻损失两种约束来正则化优化过程。
Result: 在Kandinsky 2.2上的实验表明,OVI可以替代传统先验,同时揭示了当前评估基准的缺陷,约束OVI方法在视觉保真度上优于基线,最近邻方法在定量得分上达到或超过最先进的数据高效先验。
Conclusion: 研究证明了无需训练先验的可行性,为扩散模型提供了更高效的替代方案,同时指出了当前评估基准需要改进,基于优化的方法值得进一步研究以提升生成质量。
📄 Abstract
Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.
[9] Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries
Sree Bhattacharyya, Yaman Kumar Singla, Sudhir Yarram, Somesh Kumar Singh, Harini S, James Z. Wang
🧩 TL;DR
本研究提出了首个大规模无监督视觉记忆性数据集,包含82,000多个视频及其回忆描述数据,通过微调大型视觉语言模型在回忆生成和话到嘴边检索任务上超越了GPT-4o等最先进模型。
📘 Detailed Summary
Motivation: 视觉内容记忆性研究面临的主要挑战在于从人类收集记忆性标注的成本高昂,这限制了数据集的多样性和可扩展性,现有数据集仅收集聚合记忆性分数而无法捕捉自然开放式回忆描述中的细微记忆性信号。
Method: 利用Reddit等在线平台的话到嘴边检索查询构建无监督数据集,采用对比训练策略创建首个能够执行多模态话到嘴边检索的模型,并通过微调大型视觉语言模型来处理记忆性相关任务。
Result: 在回忆生成任务中,基于本数据集微调的大型视觉语言模型在生成视觉内容的开放式记忆性描述方面超越了GPT-4o等最先进模型,同时成功实现了多模态话到嘴边检索功能。
Conclusion: 该数据集和模型为视觉内容记忆性研究开辟了新方向,通过无监督方法有效解决了数据收集的瓶颈问题,为理解人类记忆机制和增强内容设计提供了有力工具。
📄 Abstract
Visual content memorability has intrigued the scientific community for decades, with applications ranging widely, from understanding nuanced aspects of human memory to enhancing content design. A significant challenge in progressing the field lies in the expensive process of collecting memorability annotations from humans. This limits the diversity and scalability of datasets for modeling visual content memorability. Most existing datasets are limited to collecting aggregate memorability scores for visual content, not capturing the nuanced memorability signals present in natural, open-ended recall descriptions. In this work, we introduce the first large-scale unsupervised dataset designed explicitly for modeling visual memorability signals, containing over 82,000 videos, accompanied by descriptive recall data. We leverage tip-of-the-tongue (ToT) retrieval queries from online platforms such as Reddit. We demonstrate that our unsupervised dataset provides rich signals for two memorability-related tasks: recall generation and ToT retrieval. Large vision-language models fine-tuned on our dataset outperform state-of-the-art models such as GPT-4o in generating open-ended memorability descriptions for visual content. We also employ a contrastive training strategy to create the first model capable of performing multimodal ToT retrieval. Our dataset and models present a novel direction, facilitating progress in visual content memorability research.
[10] Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation
Taehoon Kim, Henry Gouk, Timothy Hospedales
🧩 TL;DR
本文提出Null-Text Test-Time Alignment (Null-TTA),通过在推理时优化分类器无关引导中的无条件嵌入而非潜在变量,解决了测试时对齐中存在的欠优化和过优化问题,实现了在语义相干流形上的对齐并防止奖励破解。
📘 Detailed Summary
Motivation: 现有测试时对齐方法在推理过程中适应特定奖励时,往往存在欠优化或过优化问题,后者表现为奖励破解现象,即模型通过利用非语义噪声模式来人为提高奖励分数,而非真正实现语义层面的对齐。
Method: Null-TTA方法通过优化分类器无关引导中的无条件嵌入来实现扩散模型的对齐,由于文本嵌入空间具有结构化的语义特性,这种优化确保了对齐发生在语义相干的流形上,即使不更新模型参数也能直接引导生成分布朝向目标奖励。
Result: 实验结果表明Null-TTA在目标测试时对齐任务上达到了最先进的性能,同时保持了强大的跨奖励泛化能力,证明了语义空间优化作为一种有效且原理性的TTA新范式的可行性。
Conclusion: 该研究确立了语义空间优化作为测试时对齐的有效新范式,通过利用文本嵌入空间的语义结构特性,既实现了精确的目标奖励对齐,又避免了奖励破解问题,为扩散模型的推理时适应提供了新的技术路径。
📄 Abstract
Test-time alignment (TTA) aims to adapt models to specific rewards during inference. However, existing methods tend to either under-optimise or over-optimise (reward hack) the target reward function. We propose Null-Text Test-Time Alignment (Null-TTA), which aligns diffusion models by optimising the unconditional embedding in classifier-free guidance, rather than manipulating latent or noise variables. Due to the structured semantic nature of the text embedding space, this ensures alignment occurs on a semantically coherent manifold and prevents reward hacking (exploiting non-semantic noise patterns to improve the reward). Since the unconditional embedding in classifier-free guidance serves as the anchor for the model's generative distribution, Null-TTA directly steers model's generative distribution towards the target reward rather than just adjusting the samples, even without updating model parameters. Thanks to these desirable properties, we show that Null-TTA achieves state-of-the-art target test-time alignment while maintaining strong cross-reward generalisation. This establishes semantic-space optimisation as an effective and principled novel paradigm for TTA.
[11] BUSTR: Breast Ultrasound Text Reporting with a Descriptor-Aware Vision-Language Model
Rawa Mohammed, Mina Attin, Bryar Shareef
🧩 TL;DR
本文提出BUSTR,一种无需配对图像-报告监督的多任务视觉语言框架,用于生成乳腺超声报告。该方法通过结构化描述符和放射组学特征构建报告,利用多任务损失训练多头Swin编码器学习描述符感知的视觉表示,并通过双级目标对齐视觉和文本标记。
📘 Detailed Summary
Motivation: 乳腺超声自动报告生成面临缺乏配对图像-报告数据集和大型语言模型产生幻觉风险的限制。现有方法依赖监督学习需要大量配对数据,而实际临床环境中这种数据往往稀缺且获取成本高昂。
Method: BUSTR框架从结构化描述符(如BI-RADS、病理学、组织学)和放射组学特征构建报告,使用多任务损失在数据集特定描述符集上训练多头Swin编码器学习描述符感知的视觉表示,并通过结合标记级交叉熵与输入输出表示间余弦相似度对齐损失的双级目标对齐视觉和文本标记。
Result: 在两个公共乳腺超声数据集BrEaST和BUS-BRA上的评估显示,BUSTR在标准自然语言生成指标和临床效能指标上均取得一致改进,特别是在BI-RADS类别和病理学等关键目标上表现优异。该方法在不同规模和可用描述符的数据集上都展现出稳健性能。
Conclusion: 研究表明,这种描述符感知的视觉模型结合标记级和对齐损失训练,无需配对图像-报告数据即可同时改善自动报告指标和临床效能。该方法为医学影像报告生成提供了新的无监督学习范式,具有重要的临床应用价值。
📄 Abstract
Automated radiology report generation (RRG) for breast ultrasound (BUS) is limited by the lack of paired image-report datasets and the risk of hallucinations from large language models. We propose BUSTR, a multitask vision-language framework that generates BUS reports without requiring paired image-report supervision. BUSTR constructs reports from structured descriptors (e.g., BI-RADS, pathology, histology) and radiomics features, learns descriptor-aware visual representations with a multi-head Swin encoder trained using a multitask loss over dataset-specific descriptor sets, and aligns visual and textual tokens via a dual-level objective that combines token-level cross-entropy with a cosine-similarity alignment loss between input and output representations. We evaluate BUSTR on two public BUS datasets, BrEaST and BUS-BRA, which differ in size and available descriptors. Across both datasets, BUSTR consistently improves standard natural language generation metrics and clinical efficacy metrics, particularly for key targets such as BI-RADS category and pathology. Our results show that this descriptor-aware vision model, trained with a combined token-level and alignment loss, improves both automatic report metrics and clinical efficacy without requiring paired image-report data. The source code can be found at https://github.com/AAR-UNLV/BUSTR
[12] TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs
Md Adnan Arefeen, Biplob Debnath, Srimat Chakradhar
🧩 TL;DR
TrafficLens提出了一种针对多摄像头交通路口的定制算法,通过顺序处理摄像头重叠覆盖区域并智能绕过冗余视觉语言模型调用,将视频到文本转换时间减少高达4倍,同时保持信息准确性。
📘 Detailed Summary
Motivation: 多摄像头交通路口产生大量视频数据,传统方法需要将视频数据通过视觉语言模型转换为文本再使用大型语言模型分析,这一过程耗时且延迟了交通视频的实时洞察生成和事件调查能力。
Method: TrafficLens采用顺序处理策略,利用摄像头重叠覆盖区域,迭代应用不同令牌限制的视觉语言模型,将前序输出作为后续摄像头的提示,并通过对象级相似性检测器智能绕过冗余的视觉语言模型调用。
Result: 在真实世界数据集上的实验结果表明,TrafficLens将视频到文本转换时间减少了高达4倍,同时保持了信息准确性,实现了高效的交通视频分析。
Conclusion: 该研究证明了通过智能调度视觉语言模型调用和利用摄像头重叠信息,可以显著提升多摄像头交通视频分析效率,为实时交通管理和事件响应提供了可行解决方案。
📄 Abstract
Traffic cameras are essential in urban areas, playing a crucial role in intelligent transportation systems. Multiple cameras at intersections enhance law enforcement capabilities, traffic management, and pedestrian safety. However, efficiently managing and analyzing multi-camera feeds poses challenges due to the vast amount of data. Analyzing such huge video data requires advanced analytical tools. While Large Language Models (LLMs) like ChatGPT, equipped with retrieval-augmented generation (RAG) systems, excel in text-based tasks, integrating them into traffic video analysis demands converting video data into text using a Vision-Language Model (VLM), which is time-consuming and delays the timely utilization of traffic videos for generating insights and investigating incidents. To address these challenges, we propose TrafficLens, a tailored algorithm for multi-camera traffic intersections. TrafficLens employs a sequential approach, utilizing overlapping coverage areas of cameras. It iteratively applies VLMs with varying token limits, using previous outputs as prompts for subsequent cameras, enabling rapid generation of detailed textual descriptions while reducing processing time. Additionally, TrafficLens intelligently bypasses redundant VLM invocations through an object-level similarity detector. Experimental results with real-world datasets demonstrate that TrafficLens reduces video-to-text conversion time by up to $4\times$ while maintaining information accuracy.
[13] GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision
Yuxiao Xiang, Junchi Chen, Zhenchao Jin, Changtao Miao, Haojie Yuan, Qi Chu, Tao Gong, Nenghai Yu
🧩 TL;DR
本文提出了GuardTrace-VL,一种视觉感知的安全审计器,通过联合图像-文本分析监控完整的问答-思考-答案流程,能够在推理阶段检测不安全内容。该方法在安全推理检测任务上实现了93.1%的F1分数,比现有多模态安全防御方法提升了13.5%。
📘 Detailed Summary
Motivation: 多模态大型推理模型在视觉语言任务中产生显式中间推理过程时,即使最终答案无害,推理轨迹也可能包含不安全内容,造成部署风险。现有的多模态安全防护主要仅评估输入问题和最终答案,忽视了中间推理过程,导致偏见推断或违反政策的视觉上下文使用等危害未被检测到。
Method: 提出了GuardTrace-VL视觉感知安全审计器,通过联合图像-文本分析监控完整的问答-思考-答案流程。构建了GuardTrace数据集,采用多样化提示策略生成并通过MLRM和人工投票验证流程进行精炼。提出了三阶段渐进训练方案与数据精炼过程结合,使模型能够根据不同风险级别学习细致和上下文相关的安全偏好。
Result: 在提出的涵盖域内和域外场景的测试集上,GuardTrace-VL模型在不安全推理检测任务上实现了93.1%的F1分数,相比之前最强的多模态安全防御方法提升了13.5%的F1分数。
Conclusion: 该研究表明监控多模态推理模型的完整推理流程对于全面安全防护至关重要,提出的联合图像-文本分析和渐进训练方法能够有效检测推理过程中出现的安全风险。这项工作为多模态AI系统的安全部署提供了重要保障,强调了中间推理过程安全监控的必要性。
📄 Abstract
Multimodal large reasoning models (MLRMs) are increasingly deployed for vision-language tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non-harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image-text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and evaluation, we construct the GuardTrace dataset, which is generated through diverse prompting strategies and refined via a MLRM- and human-based voting and verification pipeline. Furthermore, we propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences according to different risk levels. On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks, representing a 13.5% improvement in F1 score compared to the previous strongest multimodal safety defense methods. The codes will be made publicly available.
[14] AnchorOPT: Towards Optimizing Dynamic Anchors for Adaptive Prompt Learning
Zheng Li, Yibing Song, Xin Zhang, Lei Luo, Xiang Li, Jian Yang
🧩 TL;DR
本文提出AnchorOPT,一种动态锚点提示学习框架,通过动态学习任务特定的锚点值和可学习的位置矩阵,解决了现有CLIP提示学习方法中锚点静态固定的局限性,在多个数据集上实现了性能提升。
📘 Detailed Summary
Motivation: 现有基于CLIP的提示学习方法使用静态文本标记作为锚点来指导可学习软标记,但这些锚点在值和位置上都是静态的,缺乏跨任务和阶段自适应的灵活性,限制了模型的泛化能力。
Method: AnchorOPT框架在锚点值和位置关系两个维度引入动态性:锚点值从任务特定数据中动态学习而非使用手工设计的显式文本标记;锚点与软标记之间的位置关系通过可学习的位置矩阵根据训练阶段和任务上下文自适应优化。训练分为两个阶段:先学习锚点标记,然后冻结并转移到第二阶段优化软标记和位置矩阵。
Result: 广泛的实验表明,仅使用简单的可学习锚点和位置矩阵就能达到与某些包含额外可学习模块或正则化技术的方法相当或更优的性能。作为即插即用模块,AnchorOPT能无缝集成到现有框架中,在多样化数据集上带来一致的性能提升。
Conclusion: 该研究证明了动态锚点机制在提示学习中的有效性,通过任务自适应的锚点学习和位置优化,显著提升了CLIP模型的泛化性能,为提示学习方法提供了新的设计思路,具有广泛的适用性和可扩展性。
📄 Abstract
Existing prompt learning methods, which are built upon CLIP models, leverage textual tokens as anchors to guide the learnable soft tokens. This guidance improves CLIP generalizations. However, these anchors-static in both value and position-lack cross-task and stage-adaptive flexibility. To address this limitation, we propose AnchorOPT, a dynamic anchor-based prompt learning framework. Specifically, AnchorOPT introduces dynamism in two key dimensions: (i) anchor values eschew handcrafted explicit textual tokens (e.g., "shape", "color"), instead learning dynamically from task-specific data; and (ii) the positional relationship between anchor and soft tokens is no longer fixed but adaptively optimized via a learnable position matrix conditioned on the training stage and task context. Training occurs in two stages: we first learn the anchor tokens, then freeze and transfer them to the second stage for optimization of soft tokens and the position matrix. Extensive experiments demonstrate that using only a simple learnable anchor and position matrix achieves performance comparable to or exceeding some methods incorporating additional learnable modules or regularization techniques. As a plug-and-play module, AnchorOPT integrates seamlessly into existing frameworks, yielding consistent performance gains across diverse datasets. Code is publicly available at https://github.com/zhengli97/ATPrompt.
[15] From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition
Jingxi Chen, Yixiao Zhang, Xiaoye Qian, Zongxia Li, Cornelia Fermuller, Caren Chen, Yiannis Aloimonos
🧩 TL;DR
本文提出了一种基于扩散模型的图像分层分解方法,通过轻量级微调将修复模型适配于分层分解任务,并引入多模态上下文融合模块以保持细节。该方法在合成数据集上训练,在目标移除和遮挡恢复方面表现出色。
📘 Detailed Summary
Motivation: 尽管生成模型取得了显著进展,但单张图像的分层分解仍然具有挑战性,主要受限于方法和数据的不足。现有方法难以有效处理图像中的前景对象与背景的分离以及遮挡恢复问题,这限制了图像编辑的灵活性和创造性应用。
Method: 本文观察到分层分解与图像修复任务之间的强关联性,提出通过轻量级微调将基于扩散的修复模型适配于分层分解任务。为了在潜在空间中保持细节,引入了具有线性注意力复杂度的多模态上下文融合模块,该模型完全在开源资产构建的合成数据集上进行训练。
Result: 该方法在目标移除和遮挡恢复任务中取得了优越性能,能够有效分解图像为前景和背景层,同时保持高质量的细节。实验结果表明,该方法在合成数据集上的训练效果显著,为下游编辑应用提供了可靠的分层表示。
Conclusion: 该研究展示了将修复模型适配于分层分解任务的可行性,提出的多模态上下文融合模块有效解决了潜在空间中的细节保持问题。这项工作为图像编辑和创造性应用开辟了新可能性,证明了在有限真实数据情况下利用合成数据的有效性。
📄 Abstract
Images can be viewed as layered compositions, foreground objects over background, with potential occlusions. This layered representation enables independent editing of elements, offering greater flexibility for content creation. Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data. We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusion-based inpainting model for layer decomposition using lightweight finetuning. To further preserve detail in the latent space, we introduce a novel multi-modal context fusion module with linear attention complexity. Our model is trained purely on a synthetic dataset constructed from open-source assets and achieves superior performance in object removal and occlusion recovery, unlocking new possibilities in downstream editing and creative applications.
[16] Do Reasoning Vision-Language Models Inversely Scale in Test-Time Compute? A Distractor-centric Empirical Analysis
Jiyun Bae, Hyunjong Ok, Sangwoo Mo, Jaeho Lee
🧩 TL;DR
本研究通过引入Idis数据集系统研究视觉干扰物对视觉语言模型测试时扩展的影响,发现视觉干扰物与文本干扰物存在根本差异:虽然都存在逆扩展效应,但视觉干扰物会降低准确率而不增加推理长度,并提出基于属性计数的分析方法揭示了干扰物、推理长度与准确率之间的复杂关系。
📘 Detailed Summary
Motivation: 现有研究表明文本干扰物会导致语言模型出现逆扩展效应,即更长的推理过程反而降低效果,但视觉干扰物在视觉语言模型中的影响机制尚不明确,本研究旨在探索多模态场景下干扰物对模型测试时扩展的影响规律。
Method: 提出Idis视觉问答数据集,系统地在语义、数值和空间三个维度上引入视觉干扰物,通过分析推理轨迹中的属性计数来研究干扰物与推理过程的相互作用,并开发了简单的提示策略来缓解推理模型中的偏见驱动预测。
Result: 实验发现视觉干扰物与文本干扰物存在本质差异:虽然都出现逆扩展现象,但视觉干扰物显著降低模型准确率而不增加推理长度,在Waterbirds等视觉偏见基准测试中也观察到类似趋势,提出的提示策略有效减轻了偏见驱动的预测。
Conclusion: 视觉干扰物对视觉语言模型的影响机制与文本干扰物不同,通过跟踪推理过程中的属性计数可以深入理解干扰物作用机制,这一发现为理解和改进多模态模型的鲁棒性提供了重要见解,并展示了简单提示策略在缓解模型偏见方面的有效性。
📄 Abstract
How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior studies on language models have reported an inverse scaling effect, where textual distractors lead to longer but less effective reasoning. To investigate whether similar phenomena occur in multimodal settings, we introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic, numerical, and spatial dimensions. Our analyses reveal that visual distractors differ fundamentally from textual ones: although inverse scaling persists, adding visual distractors reduces accuracy without increasing reasoning length. We further show that tracking attribute counts within reasoning traces provides key insights into how distractors, reasoning length, and accuracy interact. Finally, we demonstrate that these trends extend to established visual bias benchmarks such as Waterbirds, and we propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.
[17] Knowledge Completes the Vision: A Multimodal Entity-aware Retrieval-Augmented Generation Framework for News Image Captioning
Xiaoxing You, Qiang Huang, Lingyu Li, Chi Zhang, Xiaopeng Liu, Min Zhang, Jun Yu
🧩 TL;DR
本文提出了MERGE,首个用于新闻图像描述的多模态实体感知检索增强生成框架,通过构建实体中心的多模态知识库和动态检索机制,显著提升了新闻图像描述的完整性和准确性。
📘 Detailed Summary
Motivation: 现有新闻图像描述方法面临三个关键挑战:信息覆盖不完整、跨模态对齐能力弱以及视觉实体接地效果欠佳,这些问题限制了生成描述的新闻价值和信息丰富度。
Method: MERGE框架构建了实体中心的多模态知识库(EMKB),整合文本、视觉和结构化知识;采用多阶段假设-描述策略改进跨模态对齐,并通过基于图像内容的动态检索机制增强视觉实体匹配。
Result: 在GoodNews和NYTimes800k数据集上,MERGE显著优于现有最优方法,CIDEr得分分别提升+6.84和+1.16,命名实体识别F1分数分别提升+4.14和+2.64;在未见过的Visual News数据集上获得+20.17的CIDEr提升和+6.22的F1分数提升。
Conclusion: MERGE展示了多模态实体感知检索增强生成在新闻图像描述任务中的有效性,其强大的泛化能力和领域适应性为跨模态内容生成提供了新的技术路径,特别是在需要结合丰富背景知识的应用场景中。
📄 Abstract
News image captioning aims to produce journalistically informative descriptions by combining visual content with contextual cues from associated articles. Despite recent advances, existing methods struggle with three key challenges: (1) incomplete information coverage, (2) weak cross-modal alignment, and (3) suboptimal visual-entity grounding. To address these issues, we introduce MERGE, the first Multimodal Entity-aware Retrieval-augmented GEneration framework for news image captioning. MERGE constructs an entity-centric multimodal knowledge base (EMKB) that integrates textual, visual, and structured knowledge, enabling enriched background retrieval. It improves cross-modal alignment through a multistage hypothesis-caption strategy and enhances visual-entity matching via dynamic retrieval guided by image content. Extensive experiments on GoodNews and NYTimes800k show that MERGE significantly outperforms state-of-the-art baselines, with CIDEr gains of +6.84 and +1.16 in caption quality, and F1-score improvements of +4.14 and +2.64 in named entity recognition. Notably, MERGE also generalizes well to the unseen Visual News dataset, achieving +20.17 in CIDEr and +6.22 in F1-score, demonstrating strong robustness and domain adaptability.
[18] CaptionQA: Is Your Caption as Useful as the Image Itself?
Shijia Yang, Yunong Liu, Bohan Zhai, Ximeng Sun, Zicheng Liu, Emad Barsoum, Manling Li, Chenfeng Xu
🧩 TL;DR
本文提出了CaptionQA基准测试,通过衡量图像描述在下游任务中的实用性来评估生成描述的质量,填补了当前评估方法无法验证描述能否在实际任务中替代图像的空白。
📘 Detailed Summary
Motivation: 当前图像描述评估方法存在根本性缺陷,无法验证生成的描述是否能在实际下游任务中有效替代原始图像内容,这限制了图像描述在多模态系统中的实际应用价值。
Method: 构建了CaptionQA基准测试,涵盖自然图像、文档、电子商务和具身AI四个领域,包含25个顶级类别和69个子类别的细粒度分类体系,创建了33,027个密集标注的多项选择题,通过LLM仅使用描述回答问题的方式直接测量描述在下游任务中的实用性。
Result: 评估显示最先进的多模态大模型在描述实用性与图像实用性之间存在显著差距,在传统图像问答基准上表现相近的模型在描述实用性上最多下降32%,揭示了当前图像描述生成技术的局限性。
Conclusion: 图像描述的质量评估应基于其在下游任务中的实际效用,而非传统指标,CaptionQA为图像描述生成技术的改进提供了实用导向的评估框架,并可通过开源管道扩展到新领域。
📄 Abstract
Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, and multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains--Natural, Document, E-commerce, and Embodied AI--each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption utility. In our evaluation protocol, an LLM answers these questions using captions alone, directly measuring whether captions preserve image-level utility and are utilizable by a downstream LLM. Evaluating state-of-the-art MLLMs reveals substantial gaps between the image and its caption utility. Notably, models nearly identical on traditional image-QA benchmarks lower by up to 32% in caption utility. We release CaptionQA along with an open-source pipeline for extension to new domains. The code is available at https://github.com/bronyayang/CaptionQA.
[19] FlowerDance: MeanFlow for Efficient and Refined 3D Dance Generation
Kaixing Yang, Xulong Tang, Ziqiao Peng, Xiangyue Zhang, Puwei Wang, Jun He, Hongyan Liu
🧩 TL;DR
本文提出FlowerDance,一种高效的音乐到舞蹈生成方法,通过结合MeanFlow与物理一致性约束,在仅需少量采样步骤的情况下生成高质量舞蹈动作,同时在推理速度和内存利用方面实现显著效率提升。
📘 Detailed Summary
Motivation: 现有音乐到舞蹈生成方法存在生成效率有限的问题,导致在高保真3D渲染中计算余量不足,从而限制了3D角色在真实应用场景中的表现力。
Method: FlowerDance结合MeanFlow与物理一致性约束实现高质量运动生成,采用基于BiMamba的高效模型架构和通道级跨模态融合,以非自回归方式生成舞蹈动作,同时支持运动编辑功能。
Result: 在AIST++和FineDance数据集上的广泛实验表明,FlowerDance在运动质量和生成效率方面均达到最先进水平,显著提升了推理速度和内存利用率。
Conclusion: 该研究展示了高效舞蹈生成系统的可行性,为虚拟现实和数字娱乐应用提供了实用的解决方案,同时支持交互式运动编辑功能增强了系统的实用性。
📄 Abstract
Music-to-dance generation aims to translate auditory signals into expressive human motion, with broad applications in virtual reality, choreography, and digital entertainment. Despite promising progress, the limited generation efficiency of existing methods leaves insufficient computational headroom for high-fidelity 3D rendering, thereby constraining the expressiveness of 3D characters during real-world applications. Thus, we propose FlowerDance, which not only generates refined motion with physical plausibility and artistic expressiveness, but also achieves significant generation efficiency on inference speed and memory utilization . Specifically, FlowerDance combines MeanFlow with Physical Consistency Constraints, which enables high-quality motion generation with only a few sampling steps. Moreover, FlowerDance leverages a simple but efficient model architecture with BiMamba-based backbone and Channel-Level Cross-Modal Fusion, which generates dance with efficient non-autoregressive manner. Meanwhile, FlowerDance supports motion editing, enabling users to interactively refine dance sequences. Extensive experiments on AIST++ and FineDance show that FlowerDance achieves state-of-the-art results in both motion quality and generation efficiency. Code will be released upon acceptance.
[20] LungNoduleAgent: A Collaborative Multi-Agent System for Precision Diagnosis of Lung Nodules
Cheng Yang, Hui Jin, Xinlei Yu, Zhipeng Wang, Yaoqun Liu, Fenglei Fan, Dajiang Lei, Gangyong Jia, Changmiao Wang, Ruiquan Ge
🧩 TL;DR
本文提出了LungNoduleAgent,一个用于分析肺部CT扫描的协作多智能体系统,通过模块化诊断流程显著提升了肺结节形态描述和恶性程度分级的准确性,在多个数据集上超越了现有主流模型。
📘 Detailed Summary
Motivation: 当前基于多模态大语言模型的肺部CT分析在准确描述结节形态和融入医学专业知识方面存在局限性,影响了临床应用的可靠性,而协作多智能体系统在病理学领域的潜力尚未得到充分探索。
Method: LungNoduleAgent采用三模块协作架构:结节定位器协调临床检测模型识别结节,放射科医生整合局部图像描述技术生成全面CT报告,医生智能体系统基于图像和CT报告进行恶性推理,并辅以病理知识库和多智能体框架支持。
Result: 在两个私有数据集和公开LIDC-IDRI数据集上的广泛测试表明,LungNoduleAgent超越了主流视觉语言模型、智能体系统和先进专家模型,证明了区域级语义对齐和多智能体协作在结节诊断中的重要性。
Conclusion: 该研究强调了模块化多智能体协作在医学影像分析中的有效性,LungNoduleAgent作为支持肺结节临床分析的有前景的基础工具,展示了在平衡通用性和精确性方面的优势。
📄 Abstract
Diagnosing lung cancer typically involves physicians identifying lung nodules in Computed tomography (CT) scans and generating diagnostic reports based on their morphological features and medical expertise. Although advancements have been made in using multimodal large language models for analyzing lung CT scans, challenges remain in accurately describing nodule morphology and incorporating medical expertise. These limitations affect the reliability and effectiveness of these models in clinical settings. Collaborative multi-agent systems offer a promising strategy for achieving a balance between generality and precision in medical applications, yet their potential in pathology has not been thoroughly explored. To bridge these gaps, we introduce LungNoduleAgent, an innovative collaborative multi-agent system specifically designed for analyzing lung CT scans. LungNoduleAgent streamlines the diagnostic process into sequential components, improving precision in describing nodules and grading malignancy through three primary modules. The first module, the Nodule Spotter, coordinates clinical detection models to accurately identify nodules. The second module, the Radiologist, integrates localized image description techniques to produce comprehensive CT reports. Finally, the Doctor Agent System performs malignancy reasoning by using images and CT reports, supported by a pathology knowledge base and a multi-agent system framework. Extensive testing on two private datasets and the public LIDC-IDRI dataset indicates that LungNoduleAgent surpasses mainstream vision-language models, agent systems, and advanced expert models. These results highlight the importance of region-level semantic alignment and multi-agent collaboration in diagnosing nodules. LungNoduleAgent stands out as a promising foundational tool for supporting clinical analyses of lung nodules.
[21] MIRA: Multimodal Iterative Reasoning Agent for Image Editing
Ziyun Zeng, Hang Hua, Jiebo Luo
🧩 TL;DR
本文提出了MIRA(多模态迭代推理代理),一种轻量级即插即用的多模态推理代理,通过迭代感知-推理-行动循环来执行图像编辑,显著提升了扩散模型对复杂用户指令的理解能力。
📘 Detailed Summary
Motivation: 扩散基编辑模型在处理复杂用户指令时存在困难,特别是涉及组合关系、上下文线索或指代表达的指令,导致编辑结果语义漂移或无法反映预期更改。现有方法难以准确解释包含多层次语义的复杂编辑需求。
Method: 提出了MIRA多模态推理代理,采用迭代感知-推理-行动循环模拟人机多轮交互过程,逐步预测原子编辑指令并利用视觉反馈进行决策。通过构建150K多模态工具使用数据集MIRA-Editing,结合两阶段SFT + GRPO训练流程,使模型能够对复杂编辑指令进行推理和编辑。
Result: 当与开源图像编辑模型(如Flux.1-Kontext、Step1X-Edit和Qwen-Image-Edit)配合使用时,MIRA显著提升了语义一致性和感知质量,在性能上达到甚至超过了GPT-Image和Nano-Banana等专有系统。
Conclusion: MIRA展示了迭代推理机制在复杂图像编辑任务中的有效性,为扩散模型提供了更准确的多模态指令理解能力。该方法为开放域图像编辑提供了一种可扩展的解决方案,推动了人机交互式编辑的发展方向。
📄 Abstract
Instruction-guided image editing offers an intuitive way for users to edit images with natural language. However, diffusion-based editing models often struggle to accurately interpret complex user instructions, especially those involving compositional relationships, contextual cues, or referring expressions, leading to edits that drift semantically or fail to reflect the intended changes. We tackle this problem by proposing MIRA (Multimodal Iterative Reasoning Agent), a lightweight, plug-and-play multimodal reasoning agent that performs editing through an iterative perception-reasoning-action loop, effectively simulating multi-turn human-model interaction processes. Instead of issuing a single prompt or static plan, MIRA predicts atomic edit instructions step by step, using visual feedback to make its decisions. Our 150K multimodal tool-use dataset, MIRA-Editing, combined with a two-stage SFT + GRPO training pipeline, enables MIRA to perform reasoning and editing over complex editing instructions. When paired with open-source image editing models such as Flux.1-Kontext, Step1X-Edit, and Qwen-Image-Edit, MIRA significantly improves both semantic consistency and perceptual quality, achieving performance comparable to or exceeding proprietary systems such as GPT-Image and Nano-Banana.
[22] Scaling Foundation Models for Radar Scene Understanding
Pushkal Mishra, Kshitiz Bansal, Dinesh Bharadia
🧩 TL;DR
本文提出了RadarFM,一个通过结构化空间语言监督学习统一场景级表示的雷达基础模型,解决了现有雷达方法任务特定且无法跨任务迁移的问题。该模型利用结构化标题框架和哈希感知对比学习目标,实现了细粒度的空间推理能力。
📘 Detailed Summary
Motivation: 现有雷达感知方法存在任务特定性和架构碎片化问题,每个下游任务使用不同的架构和训练目标,导致无法实现跨任务的知识迁移。尽管基础模型在视觉和语言理解领域取得了显著进展,但其与雷达感知的整合仍处于探索不足的状态。
Method: 提出了两个关键贡献:结构化标题框架在原生雷达坐标中编码车辆分布,以及哈希感知对比学习目标量化连续场景相似性而非二元匹配。利用CARLA模拟器生成大规模、良好标注的雷达数据集,并提出了超越传统检测指标的定位感知评估指标。
Result: 通过大规模模拟数据集验证了方法的有效性,提出的定位感知指标能够更准确地评估空间精度。结构化语言监督和对比学习目标使得模型能够学习统一的场景级表示,支持跨任务的迁移学习。
Conclusion: RadarFM展示了基础模型范式在雷达感知领域的潜力,为多任务雷达理解提供了统一框架。结构化空间语言监督和连续相似性度量方法为雷达基础模型的发展开辟了新方向,有望推动自动驾驶在恶劣天气条件下的可靠感知。
📄 Abstract
Radar sensors provide reliable perception across adverse weather, lighting, and long-range conditions. Recent advances in foundation models have transformed visual and language understanding, yet their integration with radar sensing remains largely underexplored. Existing radar approaches are fragmented and task-specific; each downstream task employs distinct architectures and training objectives, preventing transfer across tasks. In this work, we introduce RadarFM: a radar foundation model that learns unified scene-level representations through structured spatial language supervision. We make two key contributions: (1) a structured caption framework that encodes vehicle distributions in native radar coordinates, and (2) a hash-aware contrastive learning objective that quantifies continuous scene similarity rather than binary matching, enabling fine-grained spatial reasoning. Leveraging the CARLA simulator, we generate large-scale, well-annotated radar datasets across diverse driving scenarios. We also propose localization-aware metrics that assess spatial accuracy beyond traditional detection measures.
[23] EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens
Ze Feng, Sen Yang, Boqiang Duan, Wankou Yang, Jingdong Wang
🧩 TL;DR
本文提出EM-KD,一种增强高效多模态大语言模型的知识蒸馏新范式,通过解决师生模型间视觉令牌不平衡问题,显著提升了模型在准确性和效率方面的性能。
📘 Detailed Summary
Motivation: 现有高效多模态大语言模型通过压缩视觉令牌来减少资源消耗,但视觉信息丢失会降低理解能力。虽然先前研究引入知识蒸馏来增强学生模型,但忽略了高效学生模型与原始教师模型之间由于视觉令牌不平衡导致的细粒度视觉理解差异。
Method: EM-KD首先使用曼哈顿距离计算师生视觉logits之间的距离,并通过匈牙利匹配算法在空间维度上进行对齐。随后引入两种蒸馏策略:视觉语言亲和力蒸馏通过计算文本令牌与对齐视觉令牌之间的亲和矩阵,并最小化师生亲和矩阵的平滑L1距离;视觉语义蒸馏利用反向KL散度度量对齐视觉logits在词汇空间上的离散概率分布差异。
Result: 在多样化基准测试上的综合评估表明,EM-KD训练模型在准确性和效率方面均显著优于先前的高效多模态大语言模型。与配备相同视觉令牌匹配策略的先前蒸馏方法相比,EM-KD也实现了更好的性能表现。
Conclusion: 该研究证明了通过解决视觉令牌不平衡问题,知识蒸馏可以有效增强高效多模态大语言模型的性能。EM-KD为多模态模型的高效压缩提供了新的技术路径,在保持模型轻量化的同时显著提升了视觉理解能力。
📄 Abstract
Efficient Multimodal Large Language Models (MLLMs) compress vision tokens to reduce resource consumption, but the loss of visual information can degrade comprehension capabilities. Although some priors introduce Knowledge Distillation to enhance student models, they overlook the fundamental differences in fine-grained vision comprehension caused by unbalanced vision tokens between the efficient student and vanilla teacher. In this paper, we propose EM-KD, a novel paradigm that enhances the Efficient MLLMs with Knowledge Distillation. To overcome the challenge of unbalanced vision tokens, we first calculate the Manhattan distance between the vision logits of teacher and student, and then align them in the spatial dimension with the Hungarian matching algorithm. After alignment, EM-KD introduces two distillation strategies: 1) Vision-Language Affinity Distillation (VLAD) and 2) Vision Semantic Distillation (VSD). Specifically, VLAD calculates the affinity matrix between text tokens and aligned vision tokens, and minimizes the smooth L1 distance of the student and the teacher affinity matrices. Considering the semantic richness of vision logits in the final layer, VSD employs the reverse KL divergence to measure the discrete probability distributions of the aligned vision logits over the vocabulary space. Comprehensive evaluation on diverse benchmarks demonstrates that EM-KD trained model outperforms prior Efficient MLLMs on both accuracy and efficiency with a large margin, validating its effectiveness. Compared with previous distillation methods, which are equipped with our proposed vision token matching strategy for fair comparison, EM-KD also achieves better performance.
[24] Which Layer Causes Distribution Deviation? Entropy-Guided Adaptive Pruning for Diffusion and Flow Models
Changlin Li, Jiawei Zhang, Zeyi Shi, Zongxin Yang, Zhihui Li, Xiaojun Chang
🧩 TL;DR
本文提出EntPruner,一种基于熵引导的自动渐进式剪枝框架,专门针对扩散模型和流模型,能够在保持生成质量的同时实现高达2.22倍的推理加速。
📘 Detailed Summary
Motivation: 大规模视觉生成模型(包括扩散模型和流模型)在迁移到下游任务时存在显著的参数冗余问题,传统剪枝方法难以在保持生成多样性和条件保真度的同时有效压缩模型。
Method: 提出熵引导剪枝策略,使用数据依赖的条件熵偏差(CED)作为块级重要性评估指标;开发零样本自适应剪枝框架,动态确定剪枝时机和程度,避免一次性剪枝导致的模式崩溃问题。
Result: 在DiT和SiT模型上的广泛实验表明,EntPruner在ImageNet和三个下游数据集上保持竞争力的生成质量的同时,实现了高达2.22倍的推理加速。
Conclusion: 该研究证明了生成模型剪枝需要专门的重要性评估策略,熵引导方法能够有效平衡模型压缩与生成质量,为零样本自适应模型压缩提供了新思路。
📄 Abstract
Large-scale vision generative models, including diffusion and flow models, have demonstrated remarkable performance in visual generation tasks. However, transferring these pre-trained models to downstream tasks often results in significant parameter redundancy. In this paper, we propose EntPruner, an entropy-guided automatic progressive pruning framework for diffusion and flow models. First, we introduce entropy-guided pruning, a block-level importance assessment strategy specifically designed for generative models. Unlike discriminative models, generative models require preserving the diversity and condition-fidelity of the output distribution. As the importance of each module can vary significantly across downstream tasks, EntPruner prioritizes pruning of less important blocks using data-dependent Conditional Entropy Deviation (CED) as a guiding metric. CED quantifies how much the distribution diverges from the learned conditional data distribution after removing a block. Second, we propose a zero-shot adaptive pruning framework to automatically determine when and how much to prune during training. This dynamic strategy avoids the pitfalls of one-shot pruning, mitigating mode collapse, and preserving model performance. Extensive experiments on DiT and SiT models demonstrate the effectiveness of EntPruner, achieving up to 2.22$\times$ inference speedup while maintaining competitive generation quality on ImageNet and three downstream datasets.
[25] CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion
Dianbing Xi, Jiepeng Wang, Yuanzhi Liang, Xi Qiu, Jialun Liu, Hao Pan, Yuchi Huo, Rui Wang, Haibin Huang, Chi Zhang, Xuelong Li
🧩 TL;DR
本文提出了CtrlVDiff,一个统一的扩散模型,通过混合模态控制策略融合深度、法线、分割、边缘和图形学本征属性等多种模态,解决了视频理解和可控生成的双重挑战,在保持时间一致性的同时实现了精确的层式编辑。
📘 Detailed Summary
Motivation: 当前基于几何线索的视频理解与生成方法存在局限性,仅能指定布局但无法充分约束外观、材质和光照,导致物理上有意义的编辑如重光照或材质替换难以实现,并经常引起时间漂移问题。
Method: 提出了CtrlVDiff模型,采用混合模态控制策略来路由和融合深度、法线、分割、边缘以及图形学本征属性等多种模态特征,并构建了MMVideo混合真实与合成数据集来提供跨模态对齐的监督信号。
Result: 在理解和生成基准测试中,CtrlVDiff展现出卓越的可控性和保真度,能够实现层式编辑如重光照、材质调整和物体插入,并在部分模态缺失时保持鲁棒性,超越了现有最先进基线方法。
Conclusion: 研究表明,丰富图形学模态的引入为视频理解提供了互补约束,既消除歧义又实现了精确可控的生成,为统一视频理解与生成框架的发展提供了重要见解和方向。
📄 Abstract
We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework. Our key insights are two-fold: geometry-only cues (e.g., depth, edges) are insufficient: they specify layout but under-constrain appearance, materials, and illumination, limiting physically meaningful edits such as relighting or material swaps and often causing temporal drift. Enriching the model with additional graphics-based modalities (intrinsics and semantics) provides complementary constraints that both disambiguate understanding and enable precise, predictable control during generation. However, building a single model that uses many heterogeneous cues introduces two core difficulties. Architecturally, the model must accept any subset of modalities, remain robust to missing inputs, and inject control signals without sacrificing temporal consistency. Data-wise, training demands large-scale, temporally aligned supervision that ties real videos to per-pixel multimodal annotations. We then propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic), and re-renders videos from any chosen subset with strong temporal coherence. To enable this, we build MMVideo, a hybrid real-and-synthetic dataset aligned across modalities and captions. Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.
[26] Referring Video Object Segmentation with Cross-Modality Proxy Queries
Baoli Sun, Xinzhu Ma, Ning Wang, Zhihui Wang, Zhiyong Wang
🧩 TL;DR
本文提出ProxyFormer,一种新颖的指代视频目标分割架构,通过引入代理查询机制来整合视觉和文本语义,解决现有方法中跨模态对齐不足和文本约束延迟集成的问题,在多个基准测试中达到最先进性能。
📘 Detailed Summary
Motivation: 现有基于条件查询的指代视频目标分割方法存在两个主要局限:条件查询缺乏帧间依赖性和变化建模,导致在帧间显著变化时难以准确跟踪目标;文本约束延迟集成可能使视频特征关注非指代对象,影响分割精度。
Method: 提出ProxyFormer架构,引入一组代理查询来整合视觉和文本语义并促进语义流动,通过多阶段视频特征编码器逐步更新和传播代理查询,确保视频特征聚焦于感兴趣对象;采用时空维度解耦的跨模态交互机制降低计算成本,并设计联合语义一致性训练策略对齐代理查询与视频-文本对的语义共识。
Result: 在四个广泛使用的指代视频目标分割基准测试上进行的全面实验表明,ProxyFormer在性能上优于当前最先进的方法,验证了所提方法的有效性和优越性。
Conclusion: ProxyFormer通过动态演化的代理查询机制建立了帧间依赖关系,增强了目标跟踪的准确性和连贯性;该研究为跨模态视频理解任务提供了新的语义对齐范式,具有重要的理论和应用价值。
📄 Abstract
Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions. The main concept involves learning an accurate alignment of visual elements and language expressions within a semantic space. Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism built upon transformer structure. However, they exhibit two limitations: (1) these conditional queries lack inter-frame dependency and variation modeling, making accurate target tracking challenging amid significant frame-to-frame variations; and (2) they integrate textual constraints belatedly, which may cause the video features potentially focus on the non-referred objects. Therefore, we propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics and facilitate the flow of semantics between them. By progressively updating and propagating proxy queries across multiple stages of video feature encoder, ProxyFormer ensures that the video features are focused on the object of interest. This dynamic evolution also enables the establishment of inter-frame dependencies, enhancing the accuracy and coherence of object tracking. To mitigate high computational costs, we decouple cross-modality interactions into temporal and spatial dimensions. Additionally, we design a Joint Semantic Consistency (JSC) training strategy to align semantic consensus between the proxy queries and the combined video-text pairs. Comprehensive experiments on four widely used RVOS benchmarks demonstrate the superiority of our ProxyFormer to the state-of-the-art methods.
[27] LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs
Shichu Sun, Yichen Zhang, Haolin Song, Zonghao Guo, Chi Chen, Yidan Zhang, Yuan Yao, Zhiyuan Liu, Maosong Sun
🧩 TL;DR
本文提出了LLaVA-UHD v3多模态大语言模型,其核心是渐进式视觉压缩方法,能够在保持性能的同时显著降低计算开销。该方法将预训练的ViT重构为ViT-UHD,在多个基准测试中达到与MoonViT相当的性能,同时将首令牌生成时间减少2.4倍。
📘 Detailed Summary
Motivation: 当前多模态大语言模型中,全局原生分辨率视觉编码相比切片方法越来越受青睐,但全局编码虽然增强了整体能力,却带来了更大的计算开销。本研究旨在解决这一效率问题,在保持视觉编码性能的同时显著降低计算成本。
Method: 提出了渐进式视觉压缩方法,包含两个关键模块:精细化补丁嵌入支持灵活补丁大小缩放以实现细粒度视觉建模,以及窗口化令牌压缩在ViT层间分层部署以逐步聚合局部令牌表示。该方法可无缝集成到标准ViT中,实现高效的原生分辨率编码。
Result: 转换后的ViT-UHD在广泛基准测试中表现出与MoonViT相当的性能,同时将首令牌生成时间减少2.4倍。基于ViT-UHD构建的LLaVA-UHD v3达到与Qwen2-VL竞争的性能,同时进一步将首令牌生成时间减少1.9倍。
Conclusion: 渐进式视觉压缩方法能够有效平衡多模态大语言模型的性能与效率,通过重构预训练ViT可在保持通用性的同时显著提升推理速度。该研究为高效多模态大语言模型的未来发展提供了重要技术路径和基准。
📄 Abstract
Visual encoding followed by token condensing has become the standard architectural paradigm in multi-modal large language models (MLLMs). Many recent MLLMs increasingly favor global native- resolution visual encoding over slice-based methods. To investigate this trend, we systematically compare their behavior on vision-language understanding and attention patterns, revealing that global encoding enhances overall capability but at the expense of greater computational overhead. To address this issue, we present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method, which can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient native-resolution encoding. The PVC approach consists of two key modules: (i) refined patch embedding, which supports flexible patch-size scaling for fine-grained visual model- ing, (ii) windowed token compression, hierarchically deployed across ViT layers to progressively aggregate local token representations. Jointly modulated by these two modules, a widely pretrained ViT can be reconfigured into an efficient architecture while largely preserving generality. Evaluated across extensive benchmarks, the transformed ViT, termed ViT-UHD, demonstrates competitive performance with MoonViT while reducing TTFT (time-to-first-token) by 2.4x, when developed within an identical MLLM architecture. Building upon ViT-UHD, LLaVA-UHD v3 also achieves competitive performance to Qwen2-VL, while further reducing TTFT by 1.9x. We will release all code and checkpoints to support future research on efficient MLLMs.
[28] Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation
Joonhyung Park, Hyeongwon Jang, Joowon Kim, Eunho Yang
🧩 TL;DR
本文提出了GridAR,一种专为视觉自回归模型设计的测试时扩展框架,通过网格分区渐进生成和布局指定提示重构策略,在有限计算预算下显著提升生成质量。该方法在T2I-CompBench++上以N=4超越Best-of-N(N=8)14.4%,同时降低成本25.6%。
📘 Detailed Summary
Motivation: 现有视觉自回归模型在测试时扩展方面存在两个关键问题:传统Best-of-N策略在错误生成轨迹上消耗完整计算资源,而光栅扫描解码方案缺乏整体画布蓝图,导致扩展效益受限。这些限制阻碍了视觉AR模型充分发挥测试时扩展的潜力。
Method: GridAR采用网格分区渐进生成方案,在同一画布位置生成多个部分候选,早期剪枝不可行候选,并将可行候选固定为锚点指导后续解码。同时提出布局指定提示重构策略,通过检查部分视图推断可行布局,重构后的提示指导图像生成以弥补蓝图缺失。
Result: 在T2I-CompBench++基准上,GridAR在N=4时性能超越Best-of-N(N=8)14.4%,同时计算成本降低25.6%。在PIE-Bench图像编辑任务中,该方法在保持编辑质量的同时,语义保持度比更大N的基线提升13.9%。
Conclusion: GridAR证明了视觉自回归模型测试时扩展的有效性,通过智能候选管理和布局引导解决了传统方法的局限性。该框架不仅提升了生成质量,还显著降低了计算成本,为视觉AR模型的实用化部署提供了可行路径。
📄 Abstract
Recent visual autoregressive (AR) models have shown promising capabilities in text-to-image generation, operating in a manner similar to large language models. While test-time computation scaling has brought remarkable success in enabling reasoning-enhanced outputs for challenging natural language tasks, its adaptation to visual AR models remains unexplored and poses unique challenges. Naively applying test-time scaling strategies such as Best-of-N can be suboptimal: they consume full-length computation on erroneous generation trajectories, while the raster-scan decoding scheme lacks a blueprint of the entire canvas, limiting scaling benefits as only a few prompt-aligned candidates are generated. To address these, we introduce GridAR, a test-time scaling framework designed to elicit the best possible results from visual AR models. GridAR employs a grid-partitioned progressive generation scheme in which multiple partial candidates for the same position are generated within a canvas, infeasible ones are pruned early, and viable ones are fixed as anchors to guide subsequent decoding. Coupled with this, we present a layout-specified prompt reformulation strategy that inspects partial views to infer a feasible layout for satisfying the prompt. The reformulated prompt then guides subsequent image generation to mitigate the blueprint deficiency. Together, GridAR achieves higher-quality results under limited test-time scaling: with N=4, it even outperforms Best-of-N (N=8) by 14.4% on T2I-CompBench++ while reducing cost by 25.6%. It also generalizes to autoregressive image editing, showing comparable edit quality and a 13.9% gain in semantic preservation on PIE-Bench over larger-N baselines.
[29] Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding
Yutao Tang, Cheng Zhao, Gaurav Mittal, Rohith Kukkala, Rama Chellappa, Cheng Peng, Mei Chen
🧩 TL;DR
本文提出了NDTokenizer3D,一种通用的3D视觉语言模型,通过新颖的三阶段场景标记化流程和基于多尺度正态分布变换的表示方法,在多种3D场景理解任务中实现了显著性能提升,同时支持人机交互。
📘 Detailed Summary
Motivation: 当前3D视觉语言模型在将3D场景有效标记化为整体场景标记,并跨多种3D理解任务利用这些标记方面面临重大挑战,需要一种能够桥接语言级推理与3D空间理解的通用解决方案。
Method: 该方法采用基于多尺度正态分布变换表示的三阶段场景标记化流程,首先从原始高分辨率点云构建多尺度NDT表示以保留全局上下文和细粒度几何细节,然后通过多尺度NDT解码器逐步融合跨尺度特征生成整体场景标记,该解码器还被重新用作人机交互提示和分割掩码解码的通用接口。
Result: NDTokenizer3D在3D参考分割、3D视觉问答和3D密集描述等任务中实现了显著改进,其紧凑统一的设计使其成为细粒度的通用3D视觉语言模型。
Conclusion: 该研究展示了通过统一的多尺度表示和标记化方法,可以在单一架构内实现多种3D场景理解任务的统一处理,为3D视觉语言模型的发展提供了新的设计范式,强调了场景表示与语言模型集成的重要性。
📄 Abstract
Recent advances in 3D vision-language models (VLMs) highlight a strong potential for 3D scene understanding and reasoning. However, effectively tokenizing 3D scenes into holistic scene tokens, and leveraging these tokens across diverse 3D understanding tasks, remain highly challenging. We present NDTokenizer3D, a generalist 3D VLM that performs a wide range of 3D scene understanding tasks while naturally supporting human interactions, thereby bridging language-level reasoning with 3D spatial understanding. The core of our approach is a novel three-stage scene tokenization pipeline built upon a Multi-Scale Normal Distributions Transform (NDT) representation, paired with a Multi-Scale NDT Decoder (MSDec). Specifically, NDTokenizer3D first constructs a multi-scale NDT representation from raw high-resolution point clouds, preserving both global context and fine-grained geometric details. Next, the MSDec progressively fuses cross-scale NDT features, producing holistic scene tokens consumable by LLM endpoints. Beyond tokenization, MSDec is repurposed as a general interface for human-interactive prompting (points, boxes, masks) and segmentation-mask decoding, unifying diverse 3D scene understanding tasks within a single architecture. With this compact and unified design, NDTokenizer3D offers a fine-grained, general-purpose 3D VLM, achieving remarkable improvements in 3D Referring Segmentation, 3D Visual Question Answering, and 3D Dense Captioning.
[30] When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models
Hui Lu, Yi Yu, Yiming Yang, Chenyu Yi, Qixin Zhang, Bingquan Shen, Alex C. Kot, Xudong Jiang
🧩 TL;DR
本文提出UPA-RFAS,一种针对视觉-语言-动作模型的通用可迁移对抗补丁攻击框架,通过共享特征空间学习单一物理补丁,在未知架构、微调变体和仿真到真实场景下实现跨模型攻击。
📘 Detailed Summary
Motivation: 当前视觉-语言-动作模型容易受到对抗攻击,但通用且可迁移的攻击方法研究不足,大多数现有补丁方法过度拟合单一模型且在黑盒设置下失效,这限制了对抗攻击在实际机器人系统中的适用性评估。
Method: UPA-RFAS框架结合了特征空间目标函数(包含ℓ1偏差先验和排斥性InfoNCE损失)、鲁棒性增强的两阶段min-max优化(内环学习不可见的样本级扰动,外环针对强化邻域优化通用补丁),以及两个VLA特定损失函数(补丁注意力主导和补丁语义错配)来劫持文本到视觉的注意力并诱导图像-文本不匹配。
Result: 实验表明UPA-RFAS在不同VLA模型、操作套件和物理执行中均能实现跨模型、跨任务和跨视角的一致迁移攻击,暴露了基于补丁的实际攻击面,为未来防御研究建立了强基准。
Conclusion: 该研究揭示了VLA模型在对抗补丁攻击下的实际脆弱性,证明了单一补丁在未知模型和真实环境中的可迁移性,为评估机器人系统的安全性和开发更鲁棒的防御机制提供了重要见解和基准方法。
📄 Abstract
Vision-Language-Action (VLA) models are vulnerable to adversarial attacks, yet universal and transferable attacks remain underexplored, as most existing patches overfit to a single model and fail in black-box settings. To address this gap, we present a systematic study of universal, transferable adversarial patches against VLA-driven robots under unknown architectures, finetuned variants, and sim-to-real shifts. We introduce UPA-RFAS (Universal Patch Attack via Robust Feature, Attention, and Semantics), a unified framework that learns a single physical patch in a shared feature space while promoting cross-model transfer. UPA-RFAS combines (i) a feature-space objective with an $\ell_1$ deviation prior and repulsive InfoNCE loss to induce transferable representation shifts, (ii) a robustness-augmented two-phase min-max procedure where an inner loop learns invisible sample-wise perturbations and an outer loop optimizes the universal patch against this hardened neighborhood, and (iii) two VLA-specific losses: Patch Attention Dominance to hijack text$\to$vision attention and Patch Semantic Misalignment to induce image-text mismatch without labels. Experiments across diverse VLA models, manipulation suites, and physical executions show that UPA-RFAS consistently transfers across models, tasks, and viewpoints, exposing a practical patch-based attack surface and establishing a strong baseline for future defenses.
[31] BotaCLIP: Contrastive Learning for Botany-Aware Representation of Earth Observation Data
Selene Cerna, Sara Si-Moussi, Wilfried Thuiller, Hadrien Hendrikx, Vincent Miele
🧩 TL;DR
本文提出了BotaCLIP,一种轻量级多模态对比学习框架,用于将领域专业知识注入预训练的地球观测基础模型,在生态建模任务中实现了优于原始模型和监督基线的性能表现。
📘 Detailed Summary
Motivation: 现有基础模型虽然能够学习丰富的跨模态表示,但在适应特定领域知识时面临挑战,需要在不重新训练或显著增加计算成本的情况下,将领域专业知识注入预训练模型,特别是在数据稀缺的生态建模场景中。
Method: BotaCLIP采用轻量级多模态对比学习框架,通过将高分辨率航空影像与植物样方数据进行对齐,结合正则化策略缓解灾难性遗忘问题,从而在预训练的DOFA地球观测基础模型中内化生态结构知识。
Result: 在植物存在预测、蝴蝶出现建模和土壤营养组丰度估计三个生态任务中,BotaCLIP表示均取得了优于DOFA基础模型和监督基线的性能表现,显示出在生态建模任务中的一致改进。
Conclusion: 这项工作展示了领域感知的基础模型适应方法能够在数据稀缺环境中注入专家知识,实现经济的表示学习,为特定领域的基础模型定制化提供了可行路径。
📄 Abstract
Foundation models have demonstrated a remarkable ability to learn rich, transferable representations across diverse modalities such as images, text, and audio. In modern machine learning pipelines, these representations often replace raw data as the primary input for downstream tasks. In this paper, we address the challenge of adapting a pre-trained foundation model to inject domain-specific knowledge, without retraining from scratch or incurring significant computational costs. To this end, we introduce BotaCLIP, a lightweight multimodal contrastive framework that adapts a pre-trained Earth Observation foundation model (DOFA) by aligning high-resolution aerial imagery with botanical relevés. Unlike generic embeddings, BotaCLIP internalizes ecological structure through contrastive learning with a regularization strategy that mitigates catastrophic forgetting. Once trained, the resulting embeddings serve as transferable representations for downstream predictors. Motivated by real-world applications in biodiversity modeling, we evaluated BotaCLIP representations in three ecological tasks: plant presence prediction, butterfly occurrence modeling, and soil trophic group abundance estimation. The results showed consistent improvements over those derived from DOFA and supervised baselines. More broadly, this work illustrates how domain-aware adaptation of foundation models can inject expert knowledge into data-scarce settings, enabling frugal representation learning.
[32] Unlocking Zero-shot Potential of Semi-dense Image Matching via Gaussian Splatting
Juncheng Chen, Chao Xu, Yanjun Cao
🧩 TL;DR
本文提出MatchGS框架,首次系统性地修正并利用3D高斯溅射(3DGS)实现鲁棒的零样本图像匹配。通过几何精炼的数据生成流程和2D-3D表示对齐策略,显著提升了图像匹配器的零样本性能。
📘 Detailed Summary
Motivation: 基于学习的图像匹配严重依赖大规模、多样化且几何准确的训练数据。3D高斯溅射虽然能够实现逼真的新视角合成,但其几何不准确性和深度渲染偏差阻碍了鲁棒对应关系的标注。
Method: MatchGS框架包含两个核心组件:几何忠实的数据生成流程,通过精炼3DGS几何来产生高精度对应标签;2D-3D表示对齐策略,将3DGS的显式3D知识注入2D匹配器,引导学习视角不变的3D表示。
Result: 生成的对应关系将极线误差降低了40倍,支持极端视角变化的监督,并通过高斯属性提供自监督信号。仅使用MatchGS数据训练的最先进匹配器在公开基准上实现了高达17.7%的零样本性能提升。
Conclusion: 研究表明,通过适当的几何精炼,3DGS可以作为可扩展、高保真且结构丰富的数据源,为新一代鲁棒零样本图像匹配器的发展铺平道路。
📄 Abstract
Learning-based image matching critically depends on large-scale, diverse, and geometrically accurate training data. 3D Gaussian Splatting (3DGS) enables photorealistic novel-view synthesis and thus is attractive for data generation. However, its geometric inaccuracies and biased depth rendering currently prevent robust correspondence labeling. To address this, we introduce MatchGS, the first framework designed to systematically correct and leverage 3DGS for robust, zero-shot image matching. Our approach is twofold: (1) a geometrically-faithful data generation pipeline that refines 3DGS geometry to produce highly precise correspondence labels, enabling the synthesis of a vast and diverse range of viewpoints without compromising rendering fidelity; and (2) a 2D-3D representation alignment strategy that infuses 3DGS' explicit 3D knowledge into the 2D matcher, guiding 2D semi-dense matchers to learn viewpoint-invariant 3D representations. Our generated ground-truth correspondences reduce the epipolar error by up to 40 times compared to existing datasets, enable supervision under extreme viewpoint changes, and provide self-supervisory signals through Gaussian attributes. Consequently, state-of-the-art matchers trained solely on our data achieve significant zero-shot performance gains on public benchmarks, with improvements of up to 17.7%. Our work demonstrates that with proper geometric refinement, 3DGS can serve as a scalable, high-fidelity, and structurally-rich data source, paving the way for a new generation of robust zero-shot image matchers.
[33] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment
Stefanos Koutoupis, Michaela Areti Zervou, Konstantinos Kontras, Maarten De Vos, Panagiotis Tsakalides, Grigorios Tsagatakis
🧩 TL;DR
本文提出了对比融合(ConFu)框架,通过将单个模态及其融合组合共同嵌入到统一表示空间中,同时捕获多模态高阶依赖关系并保持强配对对应性。该方法在合成和真实多模态基准测试中展现出竞争性性能,支持统一的一对一和一对多检索。
📘 Detailed Summary
Motivation: 现有多模态学习方法主要局限于配对设置,仅对齐两个模态,而近期尝试捕获高阶交互的方法往往忽视或未能充分保持配对关系,限制了其在单模态任务上的有效性。多模态联合表示学习中的高阶依赖关系捕获与配对关系保持之间的平衡是一个关键挑战。
Method: ConFu框架通过扩展传统配对对比目标,引入额外的融合模态对比项,将模态对与第三个模态的联合嵌入对齐。该框架将单个模态及其融合组合共同嵌入到统一表示空间,使模态与其融合对应物对齐,从而能够捕获仅通过配对对齐无法恢复的高阶依赖关系(如XOR类关系)。
Result: 在合成和真实多模态基准测试中,ConFu在检索和分类任务上展现出竞争性性能,能够有效利用跨模态互补性、捕获高阶依赖关系,并随着多模态复杂度的增加而扩展。该框架支持在单一对比框架内实现统一的一对一和一对多检索。
Conclusion: ConFu证明了在多模态表示学习中同时保持高阶交互和配对对应性的可行性,为处理复杂多模态关系提供了统一框架。该研究为多模态机器学习中平衡高阶依赖捕获与基础配对关系保持提供了重要见解,推动了更全面的多模态理解能力的发展。
📄 Abstract
Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework.
[34] Co-Training Vision Language Models for Remote Sensing Multi-task Learning
Qingyun Li, Shuran Ma, Junwei Luo, Yi Yu, Yue Zhou, Fengxiang Wang, Xudong Lu, Xiaoxing Wang, Xin He, Yushi Chen, Xue Yang, Junchi Yan
🧩 TL;DR
本文提出了RSCoVLM,一个用于遥感多任务学习的简单而灵活的视觉语言模型基线,通过统一的数据引擎、动态分辨率策略和Zoom-in Chain机制,在多个遥感任务上实现了最先进的性能。
📘 Detailed Summary
Motivation: 当前遥感领域虽然Transformer在单任务上表现出色,但缺乏能够统一处理多任务的通用模型;同时,视觉语言模型在遥感图像理解、定位和超高清图像推理方面展现出潜力,但面临复杂遥感数据环境和多样化图像尺度的挑战。
Method: 提出了完整的数据管理引擎,包括数据采集、离线处理和集成、在线加载和加权;设计了统一动态分辨率策略处理不同尺度的遥感图像;针对超高清图像引入了Zoom-in Chain机制及其对应数据集LRS-VQA-Zoom;增强了模型的目标检测能力并提出了新的评估协议。
Result: 大量实验表明RSCoVLM在多样化任务上实现了最先进的性能,超越了现有的遥感视觉语言模型,甚至可与专门的专家模型相媲美;所有训练评估工具、模型权重和数据集均已开源以确保可复现性。
Conclusion: 该基线模型为通用遥感模型的发展提供了重要推动,展示了视觉语言模型在遥感多任务学习中的巨大潜力;提出的数据引擎和分辨率策略为解决遥感数据复杂性和计算负担提供了有效方案。
📄 Abstract
With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater practical applicability. Recently, vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning, respectively. Moreover, the unified text-based interface demonstrates significant potential for MTL. Hence, in this work, we present RSCoVLM, a simple yet flexible VLM baseline for RS MTL. Firstly, we create the data curation engine, including data acquisition, offline processing and integrating, as well as online loading and weighting. This data engine effectively addresses complex RS data enviroment and generates flexible vision-language conversations. Furthermore, we propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery. For UHR images, we introduce the Zoom-in Chain mechanism together with its corresponding dataset, LRS-VQA-Zoom. The strategies are flexible and effectively mitigate the computational burdens. Additionally, we significantly enhance the model's object detection capability and propose a novel evaluation protocol that ensures fair comparison between VLMs and conventional detection models. Extensive experiments demonstrate that RSCoVLM achieves state-of-the-art performance across diverse tasks, outperforming existing RS VLMs and even rivaling specialized expert models. All the training and evaluating tools, model weights, and datasets have been fully open-sourced to support reproducibility. We expect that this baseline will promote further progress toward general-purpose RS models.
[35] SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding
Tae-Min Choi, Tae Kyeong Jeong, Garam Kim, Jaemin Lee, Yeongyoon Koh, In Cheul Choi, Jae-Ho Chung, Jong Woong Park, Juyoun Park
🧩 TL;DR
本研究提出了SurgMLLMBench,一个统一的多模态基准测试,专门用于开发和评估交互式多模态大语言模型在外科场景理解中的应用,包括新收集的MAVIS数据集,解决了现有外科数据集的局限性。
📘 Detailed Summary
Motivation: 现有外科数据集主要采用视觉问答格式,存在分类学异构且缺乏像素级分割支持的问题,这限制了一致性评估和实际应用,因此需要构建统一的多模态基准来促进外科AI研究的发展。
Method: 该研究整合了像素级器械分割掩码和结构化视觉问答标注,涵盖腹腔镜、机器人辅助和显微外科领域,并在统一分类学下构建了SurgMLLMBench基准测试,支持超越传统视觉问答任务的全面评估和更丰富的视觉对话交互。
Result: 广泛的基线实验表明,在SurgMLLMBench上训练的单一模型在不同领域均能保持一致的性能表现,并且能够有效地泛化到未见过的数据集,证明了该基准的有效性和泛化能力。
Conclusion: SurgMLLMBench将作为公开资源推动多模态外科AI研究的发展,支持可重复的评估和交互式外科推理模型的开发,为外科场景理解提供了标准化的评估框架和丰富的数据支持。
📄 Abstract
Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual-conversational interactions. Extensive baseline experiments show that a single model trained on SurgMLLMBench achieves consistent performance across domains and generalizes effectively to unseen datasets. SurgMLLMBench will be publicly released as a robust resource to advance multimodal surgical AI research, supporting reproducible evaluation and development of interactive surgical reasoning models.
[36] Monet: Reasoning in Latent Visual Space Beyond Images and Language
Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, Yisen Wang
🧩 TL;DR
本文提出了Monet训练框架,使多模态大语言模型能够在潜在视觉空间中进行推理,通过生成连续嵌入作为中间视觉思维,解决了现有方法在抽象视觉推理方面的局限性。
📘 Detailed Summary
Motivation: 现有视觉推理方法在抽象视觉思维方面存在不足,其灵活性受到外部工具的根本限制,无法实现类人的抽象视觉推理能力。
Method: 提出了三阶段蒸馏式监督微调管道,包括构建高质量的文本-图像交错思维链数据集Monet-SFT-125K,并设计了VLPO(视觉潜在策略优化)强化学习方法,将潜在嵌入显式纳入策略梯度更新。
Result: Monet-7B模型在真实世界感知和推理基准测试中表现出持续增益,在具有挑战性的抽象视觉推理任务上展现出强大的分布外泛化能力。
Conclusion: 该研究为视觉潜在推理的未来发展提供了重要见解,通过实证分析各训练组件的作用并讨论早期失败尝试,推动了多模态推理向更抽象、更灵活的视觉思维方向发展。
📄 Abstract
"Thinking with images" has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning: high computational cost in latent-vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of applying GRPO to latent reasoning: it primarily enhances text-based reasoning rather than latent reasoning. To overcome this, we propose VLPO (Visual-latent Policy Optimization), a reinforcement learning method that explicitly incorporates latent embeddings into policy gradient updates. To support SFT, we construct Monet-SFT-125K, a high-quality text-image interleaved CoT dataset containing 125K real-world, chart, OCR, and geometry CoTs. Our model, Monet-7B, shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on challenging abstract visual reasoning tasks. We also empirically analyze the role of each training component and discuss our early unsuccessful attempts, providing insights for future developments in visual latent reasoning. Our model, data, and code are available at https://github.com/NOVAglow646/Monet.
[37] Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning
Xin Gu, Haoji Zhang, Qihang Fan, Jingxuan Niu, Zhipeng Zhang, Libo Zhang, Guang Chen, Fan Chen, Longyin Wen, Sijie Zhu
🧩 TL;DR
STVG-o1是首个无需架构修改即可让现有多模态大语言模型在时空视频定位任务中达到最先进性能的框架,通过引入边界框思维链机制和多维强化奖励函数,显著提升了MLLMs在细粒度时空定位任务上的表现。
📘 Detailed Summary
Motivation: 尽管多模态大语言模型在语言理解方面表现出色,但在时空视频定位任务中表现不佳,主要原因是训练目标不匹配以及标准视觉编码器中细粒度区域-词语对齐能力较弱,这限制了MLLMs在精确时空定位任务中的应用潜力。
Method: 该方法引入了边界框思维链机制,在生成最终预测前显式推理时空位置作为中间步骤,并设计了包含格式、一致性、时间、空间和思维奖励的多维强化奖励函数,通过强化微调提供几何感知的监督信号。
Result: 在HCSTVG-v1/v2和VidSTG数据集上的评估显示,STVG-o1在HCSTVG上创造了新的最先进结果,在HCSTVG-v1上以7.3%的m_tIoU优势超越最佳任务特定方法,在VidSTG上与专用模型表现相当,并大幅超越所有现有的基于MLLM的方法。
Conclusion: 该研究确立了多模态大语言模型作为精确时空定位任务可行且强大的骨干网络,展示了强大的跨数据集开放词汇泛化能力,为MLLMs在细粒度视觉语言任务中的应用开辟了新途径。
📄 Abstract
Spatio-temporal video grounding (STVG) requires localizing a target object in untrimmed videos both temporally and spatially from natural language descriptions. Despite their strong language understanding, multimodal large language models (MLLMs) underperform on STVG due to misaligned training objectives and weak fine-grained region-word alignment in standard visual encoders. To address this, we propose STVG-o1, the first framework that enables off-the-shelf MLLMs to achieve state-of-the-art STVG performance without any architectural modifications. Our method introduces a bounding-box chain-of-thought mechanism that explicitly reasons about spatio-temporal locations in an intermediate step before producing the final prediction. We further design a multi-dimensional reinforcement reward function consisting of format, consistency, temporal, spatial, and think rewards, which provides geometry-aware supervision through reinforcement fine-tuning. Evaluated on HCSTVG-v1/v2 and VidSTG, STVG-o1 sets new state-of-the-art results on HCSTVG, outperforming the best task-specific method by 7.3\% m_tIoU on HCSTVG-v1, matching specialized models on VidSTG, and surpassing all existing MLLM-based approaches by large margins. It also demonstrates strong open-vocabulary generalization across datasets, establishing MLLMs as viable and powerful backbones for precise spatio-temporal grounding. Our code and models will be released.
[38] From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings
Jiajie Zhang, Sören Schwertfeger, Alexander Kleiner
🧩 TL;DR
本文提出了一种新颖的无监督框架,从连续工业视频流中解锁大量未标记的人类演示数据,用于视觉-语言-动作模型预训练。该方法通过轻量级运动分词器和基于潜在动作能量的无监督动作分割器,自动发现和组织语义一致的动作基元。
📘 Detailed Summary
Motivation: 当前工业环境中存在大量未标记的人类演示视频数据,但缺乏自动化方法来提取结构化数据用于视觉-语言-动作模型预训练。现有方法需要人工标注或监督学习,难以扩展到工业规模的连续视频流,这限制了具身AI在制造业中的集成应用。
Method: 该方法首先训练轻量级运动分词器来编码运动动态,然后采用无监督动作分割器,利用新颖的潜在动作能量度量来发现和分割语义一致的动作基元。整个流水线输出分割的视频片段及其对应的潜在动作序列,为VLA预训练提供结构化数据。
Result: 在公共基准测试和专有电机装配数据集上的评估表明,该方法能有效分割人类在工作站执行的关键任务。通过视觉语言模型进行的聚类和定量评估证实了所发现动作基元的语义一致性,验证了方法的有效性。
Conclusion: 这是首个从非结构化工业视频中提取和组织VLA预训练数据的全自动端到端系统,为制造业中具身AI集成提供了可扩展的解决方案。该方法解锁了工业视频数据的潜力,为大规模VLA模型训练开辟了新途径。
📄 Abstract
We present a novel unsupervised framework to unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action (VLA) model pre-training. Our method first trains a lightweight motion tokenizer to encode motion dynamics, then employs an unsupervised action segmenter leveraging a novel "Latent Action Energy" metric to discover and segment semantically coherent action primitives. The pipeline outputs both segmented video clips and their corresponding latent action sequences, providing structured data directly suitable for VLA pre-training. Evaluations on public benchmarks and a proprietary electric motor assembly dataset demonstrate effective segmentation of key tasks performed by humans at workstations. Further clustering and quantitative assessment via a Vision-Language Model confirm the semantic coherence of the discovered action primitives. To our knowledge, this is the first fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering a scalable solution for embodied AI integration in manufacturing.
[39] EvRainDrop: HyperGraph-guided Completion for Effective Frame and Event Stream Aggregation
Futian Wang, Fan Zhang, Xiao Wang, Mengqi Wang, Dexing Huang, Jin Tang
🧩 TL;DR
本文提出了一种基于超图引导的时空事件流补全机制,通过超图连接不同时间和空间位置的事件令牌,并利用上下文信息传递来补全稀疏事件,有效解决了事件相机数据空间稀疏性导致的欠采样问题。该方法可灵活整合RGB令牌实现多模态信息补全,在单标签和多标签事件分类任务中均验证了其有效性。
📘 Detailed Summary
Motivation: 事件相机产生的异步事件流具有空间稀疏但时间密集的特性,主流的事件表示学习方法通常使用事件帧、体素或张量作为输入,这些方法虽然取得了显著进展,但难以解决由空间稀疏性引起的欠采样问题。
Method: 提出了一种新颖的超图引导时空事件流补全机制,通过超图连接不同时间和空间位置的事件令牌,利用上下文信息消息传递来补全这些稀疏事件。该方法可灵活地将RGB令牌作为超图中的节点整合到补全框架中,实现基于超图的多模态信息补全,随后通过自注意力聚合不同时间步的超图节点信息,实现多模态特征的有效学习和融合。
Result: 在单标签和多标签事件分类任务上进行的广泛实验充分验证了所提出框架的有效性,实验结果表明该方法能够显著提升事件数据的表示学习性能。
Conclusion: 该研究为事件相机数据处理提供了一种有效的超图引导补全机制,通过时空上下文信息传递和多模态特征融合,成功解决了事件流空间稀疏性问题,为事件表示学习开辟了新的研究方向,并展示了在多模态场景下的应用潜力。
📄 Abstract
Event cameras produce asynchronous event streams that are spatially sparse yet temporally dense. Mainstream event representation learning algorithms typically use event frames, voxels, or tensors as input. Although these approaches have achieved notable progress, they struggle to address the undersampling problem caused by spatial sparsity. In this paper, we propose a novel hypergraph-guided spatio-temporal event stream completion mechanism, which connects event tokens across different times and spatial locations via hypergraphs and leverages contextual information message passing to complete these sparse events. The proposed method can flexibly incorporate RGB tokens as nodes in the hypergraph within this completion framework, enabling multi-modal hypergraph-based information completion. Subsequently, we aggregate hypergraph node information across different time steps through self-attention, enabling effective learning and fusion of multi-modal features. Extensive experiments on both single- and multi-label event classification tasks fully validated the effectiveness of our proposed framework. The source code of this paper will be released on https://github.com/Event-AHU/EvRainDrop.
[40] Multimodal Robust Prompt Distillation for 3D Point Cloud Models
Xiang Gu, Liming Lu, Xu Zheng, Anan Du, Yongbin Zhou, Shuchao Pang
🧩 TL;DR
本文提出了一种新颖的多模态鲁棒提示蒸馏框架(MRPD),通过将学生点云模型与来自视觉、3D和文本编码器的鲁棒嵌入对齐,实现了对抗攻击下的高效防御,在训练阶段完成蒸馏而无需推理时额外计算开销。
📘 Detailed Summary
Motivation: 现有3D点云模型防御方法面临两个主要问题:高计算开销和跨不同攻击类型的泛化能力差,这严重限制了其在安全敏感应用中的可靠性。
Method: 提出多模态鲁棒提示蒸馏框架,通过三个不同教师模型(处理深度投影的视觉模型、高性能3D模型和文本编码器)的鲁棒嵌入来对齐学生点云模型特征,并采用置信度门控机制动态平衡所有输入模态的贡献。
Result: 广泛实验表明MRPD在白盒和黑盒攻击下显著优于现有最先进防御方法,甚至在干净数据上也能获得更好的性能表现。
Conclusion: 本研究通过高效利用多模态知识,为构建鲁棒的3D视觉系统提供了一种新颖实用的范式,展示了在训练阶段完成知识蒸馏的优越性。
📄 Abstract
Adversarial attacks pose a significant threat to learning-based 3D point cloud models, critically undermining their reliability in security-sensitive applications. Existing defense methods often suffer from (1) high computational overhead and (2) poor generalization ability across diverse attack types. To bridge these gaps, we propose a novel yet efficient teacher-student framework, namely Multimodal Robust Prompt Distillation (MRPD) for distilling robust 3D point cloud model. It learns lightweight prompts by aligning student point cloud model's features with robust embeddings from three distinct teachers: a vision model processing depth projections, a high-performance 3D model, and a text encoder. To ensure a reliable knowledge transfer, this distillation is guided by a confidence-gated mechanism which dynamically balances the contribution of all input modalities. Notably, since the distillation is all during the training stage, there is no additional computational cost at inference. Extensive experiments demonstrate that MRPD substantially outperforms state-of-the-art defense methods against a wide range of white-box and black-box attacks, while even achieving better performance on clean data. Our work presents a new, practical paradigm for building robust 3D vision systems by efficiently harnessing multimodal knowledge.
[41] E-M3RF: An Equivariant Multimodal 3D Re-assembly Framework
Adeela Islam, Stefano Fiorini, Manuel Lecha, Theodore Tsesmelis, Stuart James, Pietro Morerio, Alessio Del Bue
🧩 TL;DR
本文提出E-M3RF,一种等变多模态3D重组框架,通过结合几何和颜色特征以及SE(3)流匹配技术,显著提升了碎片重组精度,特别是在几何信息不足或模糊的情况下。
📘 Detailed Summary
Motivation: 现有基于深度学习的3D重组方法主要依赖几何特征,在几何信息不足(如小碎片、侵蚀碎片或对称碎片)时表现不佳,且缺乏防止重叠组装的物理约束,这限制了实际应用效果。
Method: E-M3RF采用等变多模态框架,输入包含位置和颜色的点云数据,使用旋转等变编码器提取几何特征,Transformer提取颜色特征,然后融合形成多模态表示,最后通过SE(3)流匹配预测重组所需的变换。
Result: 在四个数据集上的实验表明,E-M3RF在RePAIR数据集上相比竞争方法将旋转误差降低23.1%,平移误差降低13.2%,Chamfer距离减少18.4%,在合成和真实文化遗产数据集上均表现出优越性能。
Conclusion: 该研究表明结合多模态特征和等变学习能有效解决几何模糊性问题,SE(3)流匹配为3D重组提供了强大的变换预测框架,为文化遗产保护等实际应用提供了更可靠的解决方案。
📄 Abstract
3D reassembly is a fundamental geometric problem, and in recent years it has increasingly been challenged by deep learning methods rather than classical optimization. While learning approaches have shown promising results, most still rely primarily on geometric features to assemble a whole from its parts. As a result, methods struggle when geometry alone is insufficient or ambiguous, for example, for small, eroded, or symmetric fragments. Additionally, solutions do not impose physical constraints that explicitly prevent overlapping assemblies. To address these limitations, we introduce E-M3RF, an equivariant multimodal 3D reassembly framework that takes as input the point clouds, containing both point positions and colors of fractured fragments, and predicts the transformations required to reassemble them using SE(3) flow matching. Each fragment is represented by both geometric and color features: i) 3D point positions are encoded as rotationconsistent geometric features using a rotation-equivariant encoder, ii) the colors at each 3D point are encoded with a transformer. The two feature sets are then combined to form a multimodal representation. We experimented on four datasets: two synthetic datasets, Breaking Bad and Fantastic Breaks, and two real-world cultural heritage datasets, RePAIR and Presious, demonstrating that E-M3RF on the RePAIR dataset reduces rotation error by 23.1% and translation error by 13.2%, while Chamfer Distance decreases by 18.4% compared to competing methods.
[42] Video Generation Models Are Good Latent Reward Models
Xiaoyue Mi, Wenqing Yu, Jiesong Lian, Shibo Jie, Ruizhe Zhong, Zijun Liu, Guozhen Zhang, Zixiang Zhou, Zhiyong Xu, Yuan Zhou, Qinglin Lu, Fan Tang
🧩 TL;DR
本文提出Process Reward Feedback Learning (PRFL),一种在潜在空间中进行偏好优化的框架,解决了视频生成中奖励反馈学习的内存消耗和训练效率问题,通过直接在噪声潜在空间中进行优化,显著提升了与人类偏好的对齐效果。
📘 Detailed Summary
Motivation: 现有视频奖励模型依赖为像素空间输入设计的视觉语言模型,将ReFL优化限制在计算昂贵的VAE解码后的接近完全去噪步骤,这种像素空间方法导致大量内存开销和训练时间增加,且后期优化缺乏早期监督,仅能优化视觉质量而非基础运动动态和结构连贯性。
Method: 提出Process Reward Feedback Learning (PRFL)框架,利用预训练视频生成模型在噪声潜在空间中进行奖励建模,这些模型专为处理任意时间步的噪声潜在表示而设计,通过序列建模能力固有地保留时间信息,实现无需VAE解码的完整去噪链梯度反向传播。
Result: 广泛实验表明PRFL显著提升了与人类偏好的对齐效果,同时在内存消耗和训练时间方面相比RGB ReFL实现了大幅减少,证明了在潜在空间进行偏好优化的有效性和效率优势。
Conclusion: 研究表明预训练视频生成模型天然适合在噪声潜在空间进行奖励建模,PRFL框架为视频生成的对齐学习提供了更高效的解决方案,开辟了在潜在空间进行端到端优化的新方向,具有重要的实际应用价值。
📄 Abstract
Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding. This pixel-space approach incurs substantial memory overhead and increased training time, and its late-stage optimization lacks early-stage supervision, refining only visual quality rather than fundamental motion dynamics and structural coherence. In this work, we show that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, as they are explicitly designed to process noisy latent representations at arbitrary timesteps and inherently preserve temporal information through their sequential modeling capabilities. Accordingly, we propose Process Reward Feedback Learning~(PRFL), a framework that conducts preference optimization entirely in latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding. Extensive experiments demonstrate that PRFL significantly improves alignment with human preferences, while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.
[43] Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy
Teng Hu, Zhentao Yu, Guozhen Zhang, Zihan Su, Zhengguang Zhou, Youliang Zhang, Yuan Zhou, Qinglin Lu, Ran Yi
🧩 TL;DR
本文提出了Harmony框架,通过解决联合扩散过程中的三个根本挑战来显著提升音视频同步生成质量。该框架引入跨任务协同训练、全局-局部解耦交互模块和同步增强CFG,在生成保真度和细粒度音视频同步方面达到新的最先进水平。
📘 Detailed Summary
Motivation: 开源模型在音视频内容合成中面临音视频对齐不稳定的核心问题,这源于联合扩散过程的三个根本挑战:对应漂移、低效的全局注意力机制以及传统无分类器引导的模态内偏差。这些挑战阻碍了稳定的对齐学习和细粒度时间线索的捕捉。
Method: Harmony框架包含三个关键技术:跨任务协同训练范式利用音频驱动视频和视频驱动音频生成任务的强监督信号来缓解对应漂移;全局-局部解耦交互模块实现高效精确的时间-风格对齐;同步增强CFG在推理过程中显式分离并放大对齐信号。
Result: 大量实验表明Harmony在生成保真度和细粒度音视频同步方面均显著优于现有方法,建立了新的最先进水平。该框架在多个基准测试中表现出卓越的同步性能和生成质量。
Conclusion: 研究证明通过机制性地强制执行音视频同步,可以有效解决联合扩散过程中的根本挑战。Harmony为生成式AI中的多模态内容合成提供了新的技术路径,强调了对齐信号显式建模的重要性。
📄 Abstract
The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.
cs.CL [Back]
[44] Structured Definitions and Segmentations for Legal Reasoning in LLMs: A Study on Indian Legal Data
Mann Khatri, Mirza Yusuf, Rajiv Ratn Shah, Ponnurangam Kumaraguru
🧩 TL;DR
本研究通过重组法律文档基于修辞角色、定义法律术语以及模拟法院逐步推理过程,显著提升了大型语言模型在零样本法律判决预测任务中的性能,在三个印度法律数据集上实现了1.5%至4.36%的F1分数提升。
📘 Detailed Summary
Motivation: 大型语言模型虽然在通用推理方面表现出色,但在法律等专业领域表现不佳,主要原因是缺乏领域特定的预训练。法律文档通常冗长复杂,使得模型难以有效处理完整文本,现有研究主要关注上下文学习方法来弥补知识差距,但未充分探索结构化信息对模型决策的影响。
Method: 研究通过三个实验方向分析模型在法律任务中的行为:基于修辞角色重组文档以评估结构化信息对长文本处理和模型决策的影响;定义修辞角色使模型熟悉法律术语;模拟法院关于修辞角色的逐步推理过程以增强模型推理能力。所有实验均在零样本设置下在三个印度法律判决预测数据集上进行。
Result: 实验结果表明,组织数据或解释关键法律术语显著提升了模型性能,与基线相比F1分数最低提升约1.5%,最高提升4.36%。结构化信息处理和术语定义对模型在法律领域的表现产生了实质性改进。
Conclusion: 研究表明通过适当的结构化信息组织和领域术语定义,可以有效提升大型语言模型在专业法律任务中的性能,而无需进行完整的领域对齐训练。这种方法为将通用语言模型适配到专业领域提供了高效可行的路径,具有重要的实际应用价值。
📄 Abstract
Large Language Models (LLMs), trained on extensive datasets from the web, exhibit remarkable general reasoning skills. Despite this, they often struggle in specialized areas like law, mainly because they lack domain-specific pretraining. The legal field presents unique challenges, as legal documents are generally long and intricate, making it hard for models to process the full text efficiently. Previous studies have examined in-context approaches to address the knowledge gap, boosting model performance in new domains without full domain alignment. In our paper, we analyze model behavior on legal tasks by conducting experiments in three areas: (i) reorganizing documents based on rhetorical roles to assess how structured information affects long context processing and model decisions, (ii) defining rhetorical roles to familiarize the model with legal terminology, and (iii) emulating the step-by-step reasoning of courts regarding rhetorical roles to enhance model reasoning. These experiments are conducted in a zero-shot setting across three Indian legal judgment prediction datasets. Our results reveal that organizing data or explaining key legal terms significantly boosts model performance, with a minimum increase of ~1.5% and a maximum improvement of 4.36% in F1 score compared to the baseline.
[45] Prompt Engineering Techniques for Context-dependent Text-to-SQL in Arabic
Saleh Almohaimeed, May Alsofyani, Saad Almohaimeed, Mansour Al Ghanim, Liqiang Wang
🧩 TL;DR
本文提出了Ar-SParC,首个阿拉伯语跨领域上下文相关文本到SQL数据集,包含3,450个相关问题和10,225个查询,并通过GPT模型和创新的GAT校正器方法显著提升了阿拉伯语SQL生成性能。
📘 Detailed Summary
Motivation: 当前文本到SQL任务的研究主要集中在英语和中文领域,阿拉伯语完全缺乏相关数据集和研究,这限制了阿拉伯语用户在无需SQL知识的情况下与数据库进行自然语言对话的能力,因此需要填补这一重要空白。
Method: 研究构建了Ar-SParC数据集,使用GPT-3.5-turbo和GPT-4.5-turbo进行40个实验,应用10种提示工程技术包括四种问题表示方法和六种上下文学习技术,并开发了新颖的GAT校正器方法来提升性能。
Result: GAT校正器在所有40个实验中均提升了性能,在零样本设置下执行准确率(EX)和交互准确率(IX)平均提高1.9%,在上下文学习设置下EX平均提高1.72%、IX提高0.92%,并通过消融研究验证了其相对于先前GAT验证器的优势。
Conclusion: 该研究填补了阿拉伯语文本到SQL领域的空白,证明了大型语言模型在阿拉伯语SQL生成任务中的有效性,GAT校正器方法特别适用于阿拉伯语特性,为多语言自然语言数据库交互系统的发展提供了重要基础。
📄 Abstract
In recent years, the task of cross-domain, context-dependent text-to-SQL has received significant attention. Enables users with no prior knowledge of SQL to have a conversation with databases using natural language. However, most of the available datasets and research have been conducted in English, along with some work in Chinese. To this date, no effort has been made to address this task in the Arabic language. In this paper, we introduce Ar-SParC, the first Arabic cross-domain, context-dependent text-to-SQL dataset. The dataset consists of 3,450 sequences of interrelated questions, each sequence containing an average of approximately three questions, which results in a total of 10225 questions along with their corresponding SQL queries. We conducted 40 experiments on the Ar-SParC dataset using two large language models, GPT-3.5-turbo and GPT-4.5-turbo, applying 10 different prompt engineering techniques, including four question representation methods and six in-context learning techniques. Furthermore, we developed a novel approach named GAT corrector, which enhanced the performance across all 40 experiments, yielding an average improvement of 1.9% in execution accuracy (EX) and 1.9% in interaction accuracy (IX) under zero-shot settings, and an average increase of 1.72% EX and 0.92% IX under in-context learning settings. Finally, we conducted an ablation study with two more experiments to explain why the GAT corrector outperformed the previous GAT verifier technique, particularly for the Arabic language.
[46] Winning with Less for Low Resource Languages: Advantage of Cross-Lingual English_Persian Argument Mining Model over LLM Augmentation
Ali Jahan, Masood Ghayoomi, Annette Hautli-Janisz
🧩 TL;DR
本研究提出一种跨语言论辩挖掘方法,通过构建三种训练场景来解决低资源语言的论辩挖掘问题,实验表明轻量级跨语言混合模型显著优于资源密集型的增强方法。
📘 Detailed Summary
Motivation: 论辩挖掘作为自然语言处理的子领域,旨在识别文本中的论证组件及其关系,但现有方法主要针对高资源语言,低资源语言面临数据稀缺的挑战。本研究旨在探索跨语言方法在论辩挖掘中的应用,特别关注低资源语言如波斯语的数据短缺问题。
Method: 研究构建了三种训练场景:零样本迁移(仅使用英语数据训练)、基于大语言模型生成合成样本的增强训练、以及结合原始英语数据和人工翻译波斯语句子的跨语言模型。评估基于英语Microtext语料库及其平行波斯语翻译进行。
Result: 零样本迁移模型在英语和波斯语测试集上的F1分数分别为50.2%和50.7%。LLM增强模型将性能提升至英语59.2%和波斯语69.3%。跨语言模型在波斯语测试集上表现最佳,达到74.8%的F1分数,显著优于其他方法。
Conclusion: 研究表明轻量级跨语言混合方法能够显著超越资源密集型的增强流程,为论辩挖掘任务在低资源语言上克服数据短缺问题提供了实用路径。这种方法展示了跨语言迁移在论辩挖掘领域的有效性和效率优势。
📄 Abstract
Argument mining is a subfield of natural language processing to identify and extract the argument components, like premises and conclusions, within a text and to recognize the relations between them. It reveals the logical structure of texts to be used in tasks like knowledge extraction. This paper aims at utilizing a cross-lingual approach to argument mining for low-resource languages, by constructing three training scenarios. We examine the models on English, as a high-resource language, and Persian, as a low-resource language. To this end, we evaluate the models based on the English Microtext corpus \citep{PeldszusStede2015}, and its parallel Persian translation. The learning scenarios are as follow: (i) zero-shot transfer, where the model is trained solely with the English data, (ii) English-only training enhanced by synthetic examples generated by Large Language Models (LLMs), and (iii) a cross-lingual model that combines the original English data with manually translated Persian sentences. The zero-shot transfer model attains F1 scores of 50.2\% on the English test set and 50.7\% on the Persian test set. LLM-based augmentation model improves the performance up to 59.2\% on English and 69.3\% on Persian. The cross-lingual model, trained on both languages but evaluated solely on the Persian test set, surpasses the LLM-based variant, by achieving a F1 of 74.8\%. Results indicate that a lightweight cross-lingual blend can outperform considerably the more resource-intensive augmentation pipelines, and it offers a practical pathway for the argument mining task to overcome data resource shortage on low-resource languages.
[47] Semantic Anchors in In-Context Learning: Why Small LLMs Cannot Flip Their Labels
Anantha Padmanaban Krishna Kumar
🧩 TL;DR
本研究通过对比自然演示和反转演示实验,发现上下文学习主要调整输入在预训练语义方向上的投影,而非重新映射标签含义,支持了语义锚定观点。
📘 Detailed Summary
Motivation: 本研究旨在解决上下文学习是否能够覆盖预训练的标签语义,还是仅仅在现有语义基础上进行微调的核心问题,探索LLMs在提示学习中的语义处理机制。
Method: 将LLMs视为提示诱导分类器,通过对比自然演示和反转演示实验,引入三个对齐度量指标和语义覆盖率,在八个分类任务和八个开源LLMs上进行系统性评估。
Result: 实验结果显示,在自然演示下ICL提高准确率同时保持强先验对齐;在反转演示下模型无法学习连贯的反语义分类器,语义覆盖率在1-12B参数规模下保持为零。
Conclusion: 研究揭示了上下文学习主要调整输入在稳定语义方向上的投影,而非灵活重映射标签含义,阐明了少样本提示的基本限制,表明覆盖标签语义需要超越ICL的干预措施。
📄 Abstract
Can in-context learning (ICL) override pre-trained label semantics, or does it merely refine an existing semantic backbone? We address this question by treating LLMs as prompt-induced classifiers and contrasting their behavior under \emph{natural} demonstrations (with correct labels) and \emph{inverted} demonstrations (systematically flipping label meanings). We decompose ICL behavior into three alignment metrics (truth, prior, and prompt alignment) and introduce a semantic override rate, defined as correctness under flipped semantics. Across eight classification tasks and eight open-source LLMs (1--12B parameters), we find consistent evidence for a semantic anchor view. With natural demonstrations, ICL improves accuracy while maintaining strong prior alignment; most correct predictions coincide with zero-shot behavior, even when the prior is weak. With inverted demonstrations, models cannot learn coherent anti-semantic classifiers: prompt alignment increases only by sacrificing accuracy, and semantic override rates remain exactly zero in our few-shot 1--12B setting. Rather than flexibly remapping label meanings, ICL primarily adjusts how inputs project onto stable semantic directions learned during pre-training, clarifying fundamental limits of few-shot prompting and suggesting that overriding label semantics at these scales requires interventions beyond ICL. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/semantic-anchors-icl.
cs.AI [Back]
[48] AssurAI: Experience with Constructing Korean Socio-cultural Datasets to Discover Potential Risks of Generative AI
Chae-Gyun Lim, Seung-Ho Han, EunYoung Byun, Jeongyun Han, Soohyun Cho, Eojin Joo, Heehyeon Kim, Sieun Kim, Juhoon Lee, Hyunsoo Lee, Dongkun Lee, Jonghwan Hyeon, Yechan Hwang, Young-Jun Lee, Kyeongryul Lee, Minhyeong An, Hyunjun Ahn, Jeongwoo Son, Junho Park, Donggyu Yoon, Taehyung Kim, Jeemin Kim, Dasom Choi, Kwangyoung Lee, Hyunseung Lim, Yeohyun Jung, Jongok Hong, Sooyohn Nam, Joonyoung Park, Sungmin Na, Yubin Choi, Jeanne Choi, Yoojin Hong, Sueun Jang, Youngseok Seo, Somin Park, Seoungung Jo, Wonhye Chae, Yeeun Jo, Eunyoung Kim, Joyce Jiyoung Whang, HwaJung Hong, Joseph Seering, Uichin Lee, Juho Kim, Sunna Choi, Seokyeon Ko, Taeho Kim, Kyunghoon Kim, Myungsik Ha, So Jung Lee, Jemin Hwang, JoonHo Kwak, Ho-Jin Choi
🧩 TL;DR
本文提出了AssurAI,一个经过质量控制的韩语多模态数据集,用于评估生成式AI的安全性。该数据集包含11,480个实例,涵盖文本、图像、视频和音频四种模态,专门针对韩语社会文化背景下的AI风险因素进行评估。
📘 Detailed Summary
Motivation: 当前的安全评估数据集主要集中于英语环境,无法有效捕捉非英语社会文化背景(如韩语)中的特定风险,且通常仅限于文本模态。这种局限性阻碍了生成式AI在多元文化环境中的安全性和可靠性发展。
Method: 研究首先通过多学科专家组定义了35个不同的AI风险因素分类体系,涵盖普遍危害和韩语社会文化相关性。然后采用两阶段构建方法(专家引导的种子阶段和众包扩展阶段),结合三重独立标注和迭代式专家红队测试循环,确保数据完整性。
Result: 构建的AssurAI数据集包含11,480个多模态实例,覆盖文本、图像、视频和音频四种形式。初步研究验证了该数据集在评估最新大型语言模型安全性方面的有效性,为韩语社区的AI安全评估提供了可靠基准。
Conclusion: AssurAI数据集填补了非英语多模态AI安全评估的空白,促进了更安全可靠的生成式AI系统在韩语社区的发展。该工作的多学科方法、严格质量控制和针对性风险分类为其他语言和文化背景的AI安全评估提供了可借鉴的框架。
📄 Abstract
The rapid evolution of generative AI necessitates robust safety evaluations. However, current safety datasets are predominantly English-centric, failing to capture specific risks in non-English, socio-cultural contexts such as Korean, and are often limited to the text modality. To address this gap, we introduce AssurAI, a new quality-controlled Korean multimodal dataset for evaluating the safety of generative AI. First, we define a taxonomy of 35 distinct AI risk factors, adapted from established frameworks by a multidisciplinary expert group to cover both universal harms and relevance to the Korean socio-cultural context. Second, leveraging this taxonomy, we construct and release AssurAI, a large-scale Korean multimodal dataset comprising 11,480 instances across text, image, video, and audio. Third, we apply the rigorous quality control process used to ensure data integrity, featuring a two-phase construction (i.e., expert-led seeding and crowdsourced scaling), triple independent annotation, and an iterative expert red-teaming loop. Our pilot study validates AssurAI's effectiveness in assessing the safety of recent LLMs. We release AssurAI to the public to facilitate the development of safer and more reliable generative AI systems for the Korean community.
[49] Cross Domain Evaluation of Multimodal Chain-of-Thought Reasoning of different datasets into the Amazon CoT Framework
Nitya Tiwari, Parv Maheshwari, Vidisha Agarwal
🧩 TL;DR
本研究对多模态思维链推理进行了跨领域泛化能力的综合分析,发现虽然视觉特征整合能显著减少推理过程中的幻觉生成,但思维链推理的有效性在不同问题类型间存在显著差异,特别是在常识推理任务中面临挑战。
📘 Detailed Summary
Motivation: 尽管已有工作将思维链扩展到多模态场景并在科学问答基准上取得了先进结果,但这些方法在多样化领域中的泛化能力仍未得到充分探索。本研究旨在评估多模态思维链推理在需要广泛常识和世界知识的A-OKVQA、OKVQA和ChartQA数据集上的有效性。
Method: 我们实现了Zhang等人提出的两阶段框架,该框架将推理过程生成与答案推断分离,并通过门控融合机制将视觉特征与基于T5的语言模型集成。通过系统消融研究分析了视觉特征、推理质量和架构选择的贡献。
Result: 研究发现视觉特征整合显著减少了推理生成中的幻觉现象,但思维链推理的有效性在不同问题类型间差异很大,其中常识推理任务面临特别挑战。实验在多个基准数据集上验证了方法的性能表现。
Conclusion: 这项工作为研究人员实现多模态推理系统提供了实用见解,并确定了跨领域泛化能力改进的关键方向,强调了针对不同问题类型定制推理策略的重要性。
📄 Abstract
While recent work has extended CoT to multimodal settings, achieving state-of-the-art results on science question answering benchmarks like ScienceQA, the generalizability of these approaches across diverse domains remains underexplored. This work presents a comprehensive analysis of Multimodal Chain-of-Thought (Multimodal-CoT) reasoning, evaluating its effectiveness on the A-OKVQA, OKVQA and ChartQA datasets, which requires broad commonsense and world knowledge beyond scientific reasoning. We implement the two-stage framework proposed by Zhang et al. [3], which separates rationale generation from answer inference and integrates vision features through a gated fusion mechanism with T5-based language models. Through systematic ablation studies, we analyze the contributions of vision features, rationale quality, and architectural choices. Our findings reveal that while vision integration significantly reduces hallucination in rationale generation, the effectiveness of CoT reasoning varies substantially across question types, with commonsense reasoning presenting particular challenges. This work provides practical insights for researchers implementing multimodal reasoning systems and identifies key areas for future improvement in cross-domain generalization.
[50] OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability
Karen Ullrich, Jingtong Su, Claudia Shi, Arjun Subramonian, Amir Bar, Ivan Evtimov, Nikolaos Tsilivis, Randall Balestriero, Julia Kempe, Mark Ibrahim
🧩 TL;DR
本文提出了OpenApps,一个轻量级开源生态系统,用于评估多模态UI代理在不同应用变体中的可靠性。研究发现,尽管在固定应用环境中代理性能相对稳定,但在应用变体中可靠性波动显著,任务成功率可变化超过50%。
📘 Detailed Summary
Motivation: 当前自主UI代理的评估依赖于固定环境,通常是现有应用的克隆版本,这种方法只能揭示代理在特定环境中完成任务的能力。然而实际部署时,代理可能遇到应用设计和内容的变化,这些变化会影响代理完成任务的能力,现有评估方法无法有效衡量代理在不同应用变体中的可靠性。
Method: 开发了OpenApps轻量级开源生态系统,包含六个可配置应用(消息、日历、地图等),在内容和外观上均可配置。该系统仅需单个CPU即可运行,能够轻松生成和部署数千个每个应用的不同版本,并进行了超过10,000次独立评估来研究七个领先多模态代理的可靠性。
Result: 研究发现,在固定应用环境中标准可靠性相对稳定,但在应用变体中测量的可靠性波动显著。许多代理的任务成功率在不同应用变体中波动超过50%,例如Kimi-VL-3B在所有任务中的平均成功率从63%波动至仅4%。代理行为如循环或幻觉动作也因环境配置不同而差异巨大。
Conclusion: 这些初步发现强调了在应用变体这一新维度上测量可靠性的重要性。OpenApps生态系统为评估UI代理在多样化环境中的鲁棒性提供了有效工具,揭示了当前多模态代理在面对应用设计和内容变化时的脆弱性,为未来代理可靠性研究指明了方向。
📄 Abstract
Reliability is key to realizing the promise of autonomous UI-Agents, multimodal agents that directly interact with apps in the same manner as humans, as users must be able to trust an agent to complete a given task. Current evaluations rely on fixed environments, often clones of existing apps, which are limited in that they can only shed light on whether or how often an agent can complete a task within a specific environment. When deployed however, agents are likely to encounter variations in app design and content that can affect an agent's ability to complete a task. To address this blind spot of measuring agent reliability across app variations, we develop OpenApps, a light-weight open-source ecosystem with six apps (messenger, calendar, maps, etc.) that are configurable in appearance and content. OpenApps requires just a single CPU to run, enabling easy generation and deployment of thousands of versions of each app. Specifically, we run more than 10,000 independent evaluations to study reliability across seven leading multimodal agents. We find that while standard reliability within a fixed app is relatively stable, reliability can vary drastically when measured across app variations. Task success rates for many agents can fluctuate by more than $50\%$ across app variations. For example, Kimi-VL-3B's average success across all tasks fluctuates from $63\%$ to just $4\%$ across app versions. We also find agent behaviors such as looping or hallucinating actions can differ drastically depending on the environment configuration. These initial findings highlight the importance of measuring reliability along this new dimension of app variations. OpenApps is available at https://facebookresearch.github.io/OpenApps/
[51] ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction
Qineng Wang, Wenlong Huang, Yu Zhou, Hang Yin, Tianwei Bao, Jianwen Lyu, Weiyu Liu, Ruohan Zhang, Jiajun Wu, Li Fei-Fei, Manling Li
🧩 TL;DR
本研究提出ENACT基准测试,将具身认知评估转化为以视觉问答形式进行的自我中心交互世界建模,揭示了前沿视觉语言模型在具身认知能力方面与人类存在显著差距。
📘 Detailed Summary
Motivation: 具身认知理论认为智能源于感知运动交互而非被动观察,但现代视觉语言模型主要采用非具身方式训练,本研究旨在探索这些模型是否展现出具身认知的迹象,填补了现有评估方法在具身认知能力测试方面的空白。
Method: 提出ENACT基准测试框架,将其构建为部分可观测马尔可夫决策过程,动作定义为场景图变化,包含两个互补的序列重排序任务:前向世界建模(给定动作重排打乱的观测)和逆向世界建模(给定观测重排打乱的动作),通过机器人仿真平台BEHAVIOR生成8,972个问答对,评估模型在家庭规模长时程活动中的表现。
Result: 实验显示前沿视觉语言模型与人类之间存在性能差距,且随着交互时长的增加差距扩大,模型在逆向任务上表现优于前向任务,并表现出以人类为中心的偏见,包括偏好右手动作以及在相机内参或视角偏离人类视觉时性能下降。
Conclusion: 研究结果表明当前视觉语言模型在具身认知能力方面存在系统性不足,揭示了模型对人类中心偏见的敏感性,为开发更鲁棒的具身智能系统提供了重要基准和方向指导,强调了在非具身训练范式下模型内在认知局限的评估必要性。
📄 Abstract
Embodied cognition argues that intelligence arises from sensorimotor interaction rather than passive observation. It raises an intriguing question: do modern vision-language models (VLMs), trained largely in a disembodied manner, exhibit signs of embodied cognition? We introduce ENACT, a benchmark that casts evaluation of embodied cognition as world modeling from egocentric interaction in a visual question answering (VQA) format. Framed as a partially observable Markov decision process (POMDP) whose actions are scene graph changes, ENACT comprises two complementary sequence reordering tasks: forward world modeling (reorder shuffled observations given actions) and inverse world modeling (reorder shuffled actions given observations). While conceptually simple, solving these tasks implicitly demands capabilities central to embodied cognition-affordance recognition, action-effect reasoning, embodied awareness, and interactive, long-horizon memory from partially observable egocentric input, while avoiding low-level image synthesis that could confound the evaluation. We provide a scalable pipeline that synthesizes QA pairs from robotics simulation (BEHAVIOR) and evaluates models on 8,972 QA pairs spanning long-horizon home-scale activities. Experiments reveal a performance gap between frontier VLMs and humans that widens with interaction horizon. Models consistently perform better on the inverse task than the forward one and exhibit anthropocentric biases, including a preference for right-handed actions and degradation when camera intrinsics or viewpoints deviate from human vision. Website at https://enact-embodied-cognition.github.io/.
[52] OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection
Chujie Wang, Jianyu Lu, Zhiyuan Luo, Xi Chen, Chu He
🧩 TL;DR
本文提出OVOD-Agent框架,将开放词汇目标检测从被动类别匹配转变为主动视觉推理和自我演化检测,通过视觉思维链和弱马尔可夫决策过程显著提升了检测性能。
📘 Detailed Summary
Motivation: 现有开放词汇目标检测方法虽然在多模态数据集上预训练,但推理仍受限于固定类别名称,导致多模态训练与单模态推理之间存在显著差距,且文本表示空间尚未充分探索。
Method: 提出OVOD-Agent框架,采用视觉思维链范式将文本优化过程扩展为可解释的视觉推理,基于弱马尔可夫决策过程建模视觉上下文转换,通过Bandit模块生成探索信号并集成马尔可夫转移矩阵与Bandit轨迹进行自监督奖励模型优化。
Result: 在COCO和LVIS数据集上的实验表明,OVOD-Agent在各种OVOD骨干网络上均能提供一致性能提升,特别是在稀有类别上表现尤为显著,验证了所提框架的有效性。
Conclusion: 该研究证明了将被动检测转变为主动推理范式的有效性,通过视觉思维链和弱马尔可夫决策过程为开放词汇目标检测提供了新的解决方案,特别是在处理稀有类别时展现出强大潜力。
📄 Abstract
Open-Vocabulary Object Detection (OVOD) aims to enable detectors to generalize across categories by leveraging semantic information. Although existing methods are pretrained on large vision-language datasets, their inference is still limited to fixed category names, creating a gap between multimodal training and unimodal inference. Previous work has shown that improving textual representation can significantly enhance OVOD performance, indicating that the textual space is still underexplored. To this end, we propose OVOD-Agent, which transforms passive category matching into proactive visual reasoning and self-evolving detection. Inspired by the Chain-of-Thought (CoT) paradigm, OVOD-Agent extends the textual optimization process into an interpretable Visual-CoT with explicit actions. OVOD's lightweight nature makes LLM-based management unsuitable; instead, we model visual context transitions as a Weakly Markovian Decision Process (w-MDP) over eight state spaces, which naturally represents the agent's state, memory, and interaction dynamics. A Bandit module generates exploration signals under limited supervision, helping the agent focus on uncertain regions and adapt its detection policy. We further integrate Markov transition matrices with Bandit trajectories for self-supervised Reward Model (RM) optimization, forming a closed loop from Bandit exploration to RM learning. Experiments on COCO and LVIS show that OVOD-Agent provides consistent improvements across OVOD backbones, particularly on rare categories, confirming the effectiveness of the proposed framework.
[53] EWE: An Agentic Framework for Extreme Weather Analysis
Zhe Jiang, Jiong Wang, Xiaoyu Yue, Zijie Guo, Wenlong Zhang, Fenghua Ling, Wanli Ouyang, Lei Bai
🧩 TL;DR
本研究提出了首个极端天气智能诊断代理框架EWE,通过知识引导的规划、闭环推理和气象专用工具包,实现了从原始气象数据到多模态可视化的自动诊断分析,为解决极端天气物理机制分析的自动化瓶颈提供了解决方案。
📘 Detailed Summary
Motivation: 极端天气事件对全球社会构成日益严重的风险,但当前专家驱动、劳动密集的诊断范式造成了关键的分析瓶颈,阻碍了科学进展。虽然AI在地球科学预测方面取得了显著进步,但同样重要的自动诊断推理挑战仍基本未被探索。
Method: EWE框架通过知识引导的规划模拟专家工作流程,采用闭环推理机制和专门定制的气象工具包,能够从原始气象数据自主生成和解释多模态可视化,实现全面的诊断分析。
Result: 研究引入了该新兴领域的首个基准测试,包含103个高影响事件的精选数据集和新型逐步评估指标,为极端天气自动诊断提供了标准化评估框架。
Conclusion: EWE标志着向自动化科学发现迈出了一步,具有民主化专业知识和智力资源的潜力,特别为易受极端天气影响的发展中国家提供了重要工具,推动了极端天气物理机制分析范式的转变。
📄 Abstract
Extreme weather events pose escalating risks to global society, underscoring the urgent need to unravel their underlying physical mechanisms. Yet the prevailing expert-driven, labor-intensive diagnostic paradigm has created a critical analytical bottleneck, stalling scientific progress. While AI for Earth Science has achieved notable advances in prediction, the equally essential challenge of automated diagnostic reasoning remains largely unexplored. We present the Extreme Weather Expert (EWE), the first intelligent agent framework dedicated to this task. EWE emulates expert workflows through knowledge-guided planning, closed-loop reasoning, and a domain-tailored meteorological toolkit. It autonomously produces and interprets multimodal visualizations from raw meteorological data, enabling comprehensive diagnostic analyses. To catalyze progress, we introduce the first benchmark for this emerging field, comprising a curated dataset of 103 high-impact events and a novel step-wise evaluation metric. EWE marks a step toward automated scientific discovery and offers the potential to democratize expertise and intellectual resources, particularly for developing nations vulnerable to extreme weather.
[54] SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, Yunjian Zhang
🧩 TL;DR
本文提出了一个分层空间认知框架和SpatialBench基准,首次系统性地评估多模态大语言模型的空间推理能力,揭示了模型在感知层面表现良好但在符号推理和规划方面存在局限。
📘 Detailed Summary
Motivation: 现有基准通常将空间认知过度简化为单一维度指标,无法捕捉空间能力的层次结构和相互依赖关系,这限制了多模态大语言模型在真实世界环境交互中的发展潜力。
Method: 提出了分层空间认知框架,将空间智能分解为从基础观察到高级规划的五个渐进复杂度级别,并构建了SpatialBench大规模细粒度基准,包含与这些认知级别对齐的15个任务,同时引入了高层次能力导向的统一评估指标。
Result: 大规模多模态大语言模型实验显示不同认知级别存在明显的性能分层:模型在感知基础方面表现强劲,但在符号推理、因果推断和规划方面仍然受限;人类测试表明人类执行选择性、目标导向的抽象,而多模态大语言模型倾向于过度关注表面细节而缺乏连贯的空间意图。
Conclusion: 该研究建立了首个系统化测量多模态大语言模型中分层空间认知的框架,为未来空间智能系统的发展奠定了基础,揭示了当前模型在高级空间推理能力方面的关键差距。
📄 Abstract
Spatial cognition is fundamental to real-world multimodal intelligence, allowing models to effectively interact with the physical environment. While multimodal large language models (MLLMs) have made significant strides, existing benchmarks often oversimplify spatial cognition, reducing it to a single-dimensional metric, which fails to capture the hierarchical structure and interdependence of spatial abilities. To address this gap, we propose a hierarchical spatial cognition framework that decomposes spatial intelligence into five progressively complex levels from basic observation to high-level planning. Building upon this taxonomy, we construct SpatialBench, a large-scale, fine-grained benchmark covering 15 tasks aligned with these cognitive levels. To provide a unified evaluation across heterogeneous tasks, we further introduce a high-level capability-oriented metric that reliably assesses a model's overall spatial reasoning ability. Extensive experiments over massive MLLMs reveal distinct performance stratification across cognitive levels: models exhibit strong perceptual grounding yet remain limited in symbolic reasoning, causal inference, and planning. Additional human tests demonstrate that humans perform selective, goal-directed abstraction, while MLLMs tend to over-attend to surface details without coherent spatial intent. Our work establishes the first systematic framework for measuring hierarchical spatial cognition in MLLMs, laying the foundation for future spatially intelligent systems.