Table of Contents
cs.CV [Back]
[1] HyperCLOVA X 32B Think
NAVER Cloud HyperCLOVA X Team
🧩 TL;DR
本文提出了HyperCLOVA X 32B Think,这是一个专门针对韩语语言文化环境设计的视觉语言模型,特别强调推理能力和智能体能力,并在开源后旨在促进学术和工业界的进一步研究。
📘 Detailed Summary
Motivation: 该研究旨在解决现有模型在韩语语言文化环境中的推理能力不足问题,以及缺乏专门针对韩语环境的视觉语言模型和智能体能力支持的研究空白。
Method: 模型采用两阶段训练方法:首先进行以推理能力为重点的预训练,随后进行后训练以支持多模态理解、增强推理、智能体行为和人类偏好对齐,专门针对韩语语言文化环境进行优化。
Result: 实验评估表明,与同类规模模型相比,HyperCLOVA X 32B Think在韩语文本到文本和视觉到文本基准测试中表现出色,同时在面向智能体的评估任务中也取得了强劲性能。
Conclusion: 该研究展示了专门针对特定语言文化环境定制视觉语言模型的可行性,通过开源模型支持更广泛的采用,并为学术和工业界在多模态推理和智能体能力方面的研究创新提供了重要资源。
📄 Abstract
In this report, we present HyperCLOVA X 32B Think, a vision-language model designed with particular emphasis on reasoning within the Korean linguistic and cultural context, as well as agentic ability. HyperCLOVA X 32B Think is pre-trained with a strong focus on reasoning capabilities and subsequently post-trained to support multimodal understanding, enhanced reasoning, agentic behaviors, and alignment with human preferences. Experimental evaluations against comparably sized models demonstrate that our model achieves strong performance on Korean text-to-text and vision-to-text benchmarks, as well as on agent-oriented evaluation tasks. By open-sourcing HyperCLOVA X 32B Think, we aim to support broader adoption and facilitate further research and innovation across both academic and industrial communities.
[2] VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models
Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, Jianyu Chen
🧩 TL;DR
本文通过系统研究视觉语言模型(VLM)选择与能力如何影响下游视觉语言动作(VLA)策略性能,挑战了关于VLM通用能力可预测下游控制性能的常见假设,并发现视觉模块是主要性能瓶颈,提出通过注入控制相关监督可提升性能。
📘 Detailed Summary
Motivation: 本研究旨在解决一个基础但鲜有系统研究的问题:视觉语言模型(VLM)的选择和能力如何转化为下游视觉语言动作(VLA)策略的性能。当前研究缺乏对VLM通用能力与具体具身控制任务性能之间关系的深入理解,需要探究标准VLM能力是否足以支持有效的具身控制。
Method: 研究者提出了VLM4VLA,这是一个最小化适应管道,仅使用少量新的可学习参数将通用VLM转换为VLA策略,以实现公平高效的比较。该方法通过三个基准测试中的多个下游任务进行广泛实证研究,并进一步通过在七个辅助具身任务上微调VLM来探究具体具身能力的影响,同时进行模态级消融实验以识别性能瓶颈。
Result: 实验发现,尽管VLM初始化相比从头训练始终具有优势,但VLM的通用能力是下游任务性能的较差预测指标。令人惊讶的是,提升VLM在特定具身技能上的表现并不能保证更好的下游控制性能。模态级消融实验确定VLM中的视觉模块而非语言组件是主要性能瓶颈,向视觉编码器注入控制相关监督即使在下游微调期间编码器保持冻结也能带来持续增益。
Conclusion: 该研究挑战了关于VLM通用能力可预测下游控制性能的常见假设,表明标准VLM能力对于有效具身控制是必要但不充分的。研究揭示了当前VLM预训练目标与具身动作规划需求之间存在持续的领域差距,并证明通过向视觉编码器注入控制相关监督可以缓解这一差距,为未来VLA模型设计提供了重要指导。
📄 Abstract
Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance? We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Despite its simplicity, VLM4VLA proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that while VLM initialization offers a consistent benefit over training from scratch, a VLM's general capabilities are poor predictors of its downstream task performance. This challenges common assumptions, indicating that standard VLM competence is necessary but insufficient for effective embodied control. We further investigate the impact of specific embodied capabilities by fine-tuning VLMs on seven auxiliary embodied tasks (e.g., embodied QA, visual pointing, depth estimation). Contrary to intuition, improving a VLM's performance on specific embodied skills does not guarantee better downstream control performance. Finally, modality-level ablations identify the visual module in VLM, rather than the language component, as the primary performance bottleneck. We demonstrate that injecting control-relevant supervision into the vision encoder of the VLM yields consistent gains, even when the encoder remains frozen during downstream fine-tuning. This isolates a persistent domain gap between current VLM pretraining objectives and the requirements of embodied action-planning.
[3] MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models
Yang Shi, Yifeng Xie, Minzhe Guo, Liangsi Lu, Mingxuan Huang, Jingchao Wang, Zhihong Zhu, Boyan Xu, Zhiqi Huang
🧩 TL;DR
本文提出了MMErroR基准测试,这是一个包含2,013个样本的多模态基准,专门用于评估视觉语言模型检测和分类推理错误的能力,揭示了当前先进模型在识别错误推理方面仍面临显著挑战。
📘 Detailed Summary
Motivation: 随着视觉语言模型在多模态学习中的性能提升,研究关注这些模型是否真正理解其处理的内容,特别是能否检测推理过程中的错误并识别错误类型,现有基准主要关注答案正确性而非过程层面的错误中心评估。
Method: 研究构建了MMErroR多模态基准,包含2,013个样本,每个样本嵌入单一连贯的推理错误,涵盖六个顶级领域下的24个子领域,采用过程层面的错误中心评估方法,要求模型在视觉和语言上下文中检测错误推理并进行错误类型分类。
Result: 评估了20个先进视觉语言模型,最佳模型Gemini-3.0-Pro的错误分类准确率仅为66.47%,表明识别错误推理具有显著挑战性,基准的广泛覆盖和分类丰富性确保了评估的全面性和代表性。
Conclusion: 该研究揭示了当前视觉语言模型在检测和分类推理错误方面的局限性,错误识别能力为理解多模态推理模型的真实能力提供了宝贵见解,基准测试为未来模型评估和开发提供了重要工具和方向。
📄 Abstract
Recent advances in Vision-Language Models (VLMs) have improved performance in multi-modal learning, raising the question of whether these models truly understand the content they process. Crucially, can VLMs detect when a reasoning process is wrong and identify its error type? To answer this, we present MMErroR, a multi-modal benchmark of 2,013 samples, each embedding a single coherent reasoning error. These samples span 24 subdomains across six top-level domains, ensuring broad coverage and taxonomic richness. Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation that requires models to detect incorrect reasoning and classify the error type within both visual and linguistic contexts. We evaluate 20 advanced VLMs, even the best model (Gemini-3.0-Pro) classifies the error in only 66.47\% of cases, underscoring the challenge of identifying erroneous reasoning. Furthermore, the ability to accurately identify errors offers valuable insights into the capabilities of multi-modal reasoning models. Project Page: https://mmerror-benchmark.github.io
[4] RiskCueBench: Benchmarking Anticipatory Reasoning from Early Risk Cues in Video-Language Models
Sha Luo, Yogesh Prabhu, Tim Ossowski, Kaiping Chen, Junjie Hu
🧩 TL;DR
该研究提出了RiskCueBench视频理解基准,通过识别风险信号片段来评估模型从早期视觉线索预测未来危险事件的能力,揭示了当前系统在解释动态场景和风险预测方面存在显著差距。
📘 Detailed Summary
Motivation: 现有视频风险评估研究大多允许模型访问包含事故本身的完整视频序列,这大大降低了任务难度,无法反映真实世界条件。研究旨在解决这一局限性,通过识别风险信号片段来更好地评估模型从早期视觉线索预测未来危险事件的能力。
Method: 研究引入了新的视频理解基准RiskCueBench,其中视频经过精心标注以识别风险信号片段,定义为指示潜在安全关注的最早时刻。该方法要求模型仅基于早期视觉信号进行预测,而不是访问包含事故结果的完整视频序列。
Result: 实验结果显示当前系统在解释动态场景和从早期视觉信号预测未来危险事件方面存在显著能力差距。RiskCueBench基准揭示了现有视频风险预测模型在实际部署中面临的重要挑战。
Conclusion: 该研究强调了视频风险预测模型在实际部署中的关键挑战,即需要从早期视觉线索准确预测未来危险事件。RiskCueBench基准为评估模型在真实世界条件下的风险预测能力提供了更现实的测试平台,推动了视频理解领域向更具前瞻性的风险评估方向发展。
📄 Abstract
With the rapid growth of video centered social media, the ability to anticipate risky events from visual data is a promising direction for ensuring public safety and preventing real world accidents. Prior work has extensively studied supervised video risk assessment across domains such as driving, protests, and natural disasters. However, many existing datasets provide models with access to the full video sequence, including the accident itself, which substantially reduces the difficulty of the task. To better reflect real world conditions, we introduce a new video understanding benchmark RiskCueBench in which videos are carefully annotated to identify a risk signal clip, defined as the earliest moment that indicates a potential safety concern. Experimental results reveal a significant gap in current systems ability to interpret evolving situations and anticipate future risky events from early visual signals, highlighting important challenges for deploying video risk prediction models in practice.
[5] Eye-Q: A Multilingual Benchmark for Visual Word Puzzle Solving and Image-to-Phrase Reasoning
Ali Najar, Alireza Mirrokni, Arshia Izadyari, Sadegh Mohammadian, Amir Homayoon Sharifizade, Asal Meskin, Mobin Bagherian, Ehsaneddin Asgari
🧩 TL;DR
本文提出了Eye-Q基准测试,旨在评估视觉语言模型在复杂视觉推理任务上的能力,该基准包含多语言视觉字谜,要求模型进行假设生成和修订等深层推理,而非表面识别。
📘 Detailed Summary
Motivation: 当前视觉语言模型在标准基准测试上表现良好,但主要依赖表面识别而非深层推理,现有评估方法容易通过字面匹配、OCR捷径或简单检索式匹配解决,缺乏对复杂视觉理解能力的评估,特别是需要发现隐含视觉线索、生成和修订假设以及将感知证据映射到非字面概念的推理任务。
Method: 研究提出了Eye-Q多语言基准测试,包含1,343个视觉字谜,每个谜题呈现概念密集的场景和简短描述,要求模型推断特定目标词或短语,这些谜题故意设计为非结构化和线索隐含,包含干扰项和上下文关系,需要选择性注意、抽象和关联推理,评估采用开放式、与人类对齐的协议,在轻度辅助下探测假设形成和修订过程,涵盖英语、波斯语、阿拉伯语和跨语言谜题。
Result: 评估最先进的视觉语言模型显示存在显著性能差距,特别是在抽象和跨语言谜题上,最大准确率仅为60.27%,这表明当前模型在构建和搜索适当概念表示以进行灵活图像到短语推理的能力存在局限性,模型难以处理需要深层推理的复杂视觉理解任务。
Conclusion: 研究揭示了当前视觉语言模型在复杂推理任务上的实质性限制,特别是抽象和跨语言场景下的表现不足,Eye-Q基准为评估模型深层视觉理解能力提供了重要工具,强调了未来研究需要关注模型构建和搜索概念表示的能力,以支持更灵活的视觉语言推理。
📄 Abstract
Vision-Language Models (VLMs) have achieved strong performance on standard vision-language benchmarks, yet often rely on surface-level recognition rather than deeper reasoning. We propose visual word puzzles as a challenging alternative, as they require discovering implicit visual cues, generating and revising hypotheses, and mapping perceptual evidence to non-literal concepts in ways that are difficult to solve via literal grounding, OCR-heavy shortcuts, or simple retrieval-style matching. We introduce Eye-Q, a multilingual benchmark designed to assess this form of complex visual understanding. Eye-Q contains 1,343 puzzles in which a model observes a conceptually dense scene with a brief description and must infer a specific target word or phrase. The puzzles are intentionally unstructured and cue-implicit, with distractors and contextual relationships that demand selective attention, abstraction, and associative inference. The benchmark spans English, Persian, Arabic, and cross-lingual puzzles. We evaluate state-of-the-art VLMs using an open-ended, human-aligned protocol that probes hypothesis formation and revision under lightweight assistance. Results reveal substantial performance gaps, especially on abstract and cross-lingual puzzles, highlighting limitations in current models' ability to construct and search over appropriate conceptual representations for flexible image-to-phrase inference; maximum accuracy reaches only 60.27%.
[6] GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models
Xiangdong Hu, Yangyang Jiang, Qin Hu, Xiaojun Jia
🧩 TL;DR
本文提出GAMBIT框架,一种新颖的多模态越狱攻击方法,通过构建游戏化场景驱动模型主动探索并重构恶意意图,从而有效绕过推理型MLLMs的安全对齐机制,显著提升攻击成功率。
📘 Detailed Summary
Motivation: 当前多模态大语言模型的安全对齐机制在对抗性输入下仍然脆弱,现有攻击方法主要关注增加视觉任务复杂度,未能有效利用模型自身的推理激励机制,导致在推理型模型上的攻击效果远低于非推理型模型,因此需要探索如何通过影响模型的认知阶段决策来主动完成越狱。
Method: 本文提出GAMBIT框架,首先分解并重组有害视觉语义,然后构建游戏化场景驱动模型进行探索和意图重构,通过结构化推理链同时增加视觉和文本任务复杂度,使模型成为游戏参与者,其目标追求过程会降低安全注意力并诱导其回答重构后的恶意查询。
Result: 在流行的推理型和非推理型MLLMs上的广泛实验表明,GAMBIT实现了高攻击成功率,在Gemini 2.5 Flash上达到92.13%,在QvQ-MAX上达到91.20%,在GPT-4o上达到85.87%,显著优于基线方法。
Conclusion: 该研究表明通过游戏化场景设计可以影响推理型MLLMs的认知决策过程,有效绕过安全对齐机制,揭示了当前安全防护在结构化推理攻击下的脆弱性,为未来更鲁棒的安全对齐方法提供了重要启示。
📄 Abstract
Multimodal Large Language Models (MLLMs) have become widely deployed, yet their safety alignment remains fragile under adversarial inputs. Previous work has shown that increasing inference steps can disrupt safety mechanisms and lead MLLMs to generate attacker-desired harmful content. However, most existing attacks focus on increasing the complexity of the modified visual task itself and do not explicitly leverage the model's own reasoning incentives. This leads to them underperforming on reasoning models (Models with Chain-of-Thoughts) compared to non-reasoning ones (Models without Chain-of-Thoughts). If a model can think like a human, can we influence its cognitive-stage decisions so that it proactively completes a jailbreak? To validate this idea, we propose GAMBI} (Gamified Adversarial Multimodal Breakout via Instructional Traps), a novel multimodal jailbreak framework that decomposes and reassembles harmful visual semantics, then constructs a gamified scene that drives the model to explore, reconstruct intent, and answer as part of winning the game. The resulting structured reasoning chain increases task complexity in both vision and text, positioning the model as a participant whose goal pursuit reduces safety attention and induces it to answer the reconstructed malicious query. Extensive experiments on popular reasoning and non-reasoning MLLMs demonstrate that GAMBIT achieves high Attack Success Rates (ASR), reaching 92.13% on Gemini 2.5 Flash, 91.20% on QvQ-MAX, and 85.87% on GPT-4o, significantly outperforming baselines.
[7] FROST-Drive: Scalable and Efficient End-to-End Driving with a Frozen Vision Encoder
Zeyu Dong, Yimin Zhu, Yu Wu, Yu Sun
🧩 TL;DR
本文提出了FROST-Drive,一种新颖的端到端自动驾驶架构,通过冻结预训练视觉语言模型的视觉编码器权重,直接将其强大的泛化能力迁移到驾驶任务中,显著提升了在复杂场景下的性能表现。
📘 Detailed Summary
Motivation: 端到端自动驾驶模型在泛化到新颖复杂场景方面存在挑战,传统方法对视觉编码器进行完全微调可能导致模型过度专注于训练数据而限制泛化能力,本文质疑这种训练范式的必要性,旨在探索如何更好地利用预训练视觉语言模型的广泛世界知识。
Method: 本文提出FROST-Drive架构,保持预训练视觉语言模型中视觉编码器的权重冻结,直接将其丰富的泛化知识迁移到驾驶任务;该架构结合冻结编码器、基于Transformer的多模态融合适配器以及基于GRU的解码器用于平滑路径点生成;同时引入专门设计的损失函数来直接优化Rater Feedback Score,该指标优先考虑鲁棒的轨迹规划。
Result: 在Waymo Open E2E数据集上进行的大量实验表明,该冻结编码器方法显著优于采用完全微调的模型;该数据集专门设计用于捕捉长尾场景,实验结果提供了有力证据,表明保留强大视觉语言模型的广泛知识比密集的领域特定适应更有效。
Conclusion: 本研究证明保留预训练视觉语言模型的广泛知识是实现鲁棒、可泛化驾驶性能的更有效策略,为开发能够更好处理现实世界应用复杂性的视觉模型提供了新途径,挑战了传统完全微调范式的必要性。
📄 Abstract
End-to-end (E2E) models in autonomous driving aim to directly map sensor inputs to control commands, but their ability to generalize to novel and complex scenarios remains a key challenge. The common practice of fully fine-tuning the vision encoder on driving datasets potentially limits its generalization by causing the model to specialize too heavily in the training data. This work challenges the necessity of this training paradigm. We propose FROST-Drive, a novel E2E architecture designed to preserve and leverage the powerful generalization capabilities of a pretrained vision encoder from a Vision-Language Model (VLM). By keeping the encoder's weights frozen, our approach directly transfers the rich, generalized world knowledge from the VLM to the driving task. Our model architecture combines this frozen encoder with a transformer-based adapter for multimodal fusion and a GRU-based decoder for smooth waypoint generation. Furthermore, we introduce a custom loss function designed to directly optimize for Rater Feedback Score (RFS), a metric that prioritizes robust trajectory planning. We conduct extensive experiments on Waymo Open E2E Dataset, a large-scale datasets deliberately curated to capture the long-tail scenarios, demonstrating that our frozen-encoder approach significantly outperforms models that employ full fine-tuning. Our results provide substantial evidence that preserving the broad knowledge of a capable VLM is a more effective strategy for achieving robust, generalizable driving performance than intensive domain-specific adaptation. This offers a new pathway for developing vision-based models that can better handle the complexities of real-world application domains.
[8] ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing
Hengjia Li, Liming Jiang, Qing Yan, Yizhi Song, Hao Kang, Zichuan Liu, Xin Lu, Boxi Wu, Deng Cai
🧩 TL;DR
本文提出ThinkRL-Edit,一种面向推理的图像编辑强化学习框架,通过解耦视觉推理与图像合成,并引入思维链采样机制,显著提升了指令驱动图像编辑在复杂推理任务上的性能。
📘 Detailed Summary
Motivation: 当前基于多模态生成模型的指令驱动图像编辑方法在视觉推理能力上存在局限,导致在推理密集型编辑任务中表现不佳;现有的强化学习方法面临三个关键挑战:推理探索受限、奖励融合偏差以及不稳定的视觉语言模型指令奖励。
Method: 提出ThinkRL-Edit框架,将视觉推理与图像合成解耦,并超越去噪随机性扩展推理探索;引入基于思维链的推理采样机制,包含生成前的规划和反思阶段,促使模型探索多个语义假设并验证其合理性;提出无偏链偏好分组策略以避免加权聚合失败,并用二元检查表替代基于区间的视觉语言模型评分,提供更精确、低方差且可解释的复杂推理奖励。
Result: 实验表明,该方法在推理密集型图像编辑任务上显著优于先前工作,能够生成更忠实于指令、视觉连贯且语义基础扎实的编辑结果,在多个评估指标上展现出优越性能。
Conclusion: 该研究证明了通过解耦推理与合成、扩展推理探索空间以及改进奖励机制,能够有效提升复杂图像编辑任务的性能;提出的框架为强化学习在视觉生成任务中的应用提供了新思路,特别是在需要高层次语义推理的场景中具有重要价值。
📄 Abstract
Instruction-driven image editing with unified multimodal generative models has advanced rapidly, yet their underlying visual reasoning remains limited, leading to suboptimal performance on reasoning-centric edits. Reinforcement learning (RL) has been investigated for improving the quality of image editing, but it faces three key challenges: (1) limited reasoning exploration confined to denoising stochasticity, (2) biased reward fusion, and (3) unstable VLM-based instruction rewards. In this work, we propose ThinkRL-Edit, a reasoning-centric RL framework that decouples visual reasoning from image synthesis and expands reasoning exploration beyond denoising. To the end, we introduce Chain-of-Thought (CoT)-based reasoning sampling with planning and reflection stages prior to generation in online sampling, compelling the model to explore multiple semantic hypotheses and validate their plausibility before committing to a visual outcome. To avoid the failures of weighted aggregation, we propose an unbiased chain preference grouping strategy across multiple reward dimensions. Moreover, we replace interval-based VLM scores with a binary checklist, yielding more precise, lower-variance, and interpretable rewards for complex reasoning. Experiments show our method significantly outperforms prior work on reasoning-centric image editing, producing instruction-faithful, visually coherent, and semantically grounded edits.
[9] CroBIM-U: Uncertainty-Driven Referring Remote Sensing Image Segmentation
Yuzhe Sun, Zhe Dong, Haochen Jiang, Tianzhu Liu, Yanfeng Gu
🧩 TL;DR
本文提出了一种不确定性引导的遥感图像指代分割框架,通过显式建模指代不确定性空间分布来指导自适应推理,解决了现有方法在跨模态对齐中空间非均匀性的问题,显著提升了复杂遥感场景下的分割鲁棒性和几何保真度。
📘 Detailed Summary
Motivation: 遥感图像指代分割面临极端尺度变化、密集相似干扰物和复杂边界结构等挑战,导致跨模态对齐存在显著的空间非均匀性。现有方法采用均匀融合和细化策略,在视觉清晰区域引入不必要的语言扰动,而在混淆区域又无法提供足够的消歧能力,因此需要一种能够自适应处理不同区域不确定性的方法。
Method: 本文提出不确定性引导框架,核心是引入可插拔的指代不确定性评分器,通过在线误差一致性监督策略训练来预测指代歧义的空间分布。基于此先验设计两个可插拔模块:不确定性门控融合动态调节语言注入强度,在高不确定性区域增强约束而在低不确定性区域抑制噪声;不确定性驱动局部细化利用不确定性衍生的软掩码聚焦于易错边界和精细细节的优化。
Result: 大量实验表明,该方法作为统一的即插即用解决方案,在不改变骨干网络架构的情况下,显著提升了复杂遥感场景中的鲁棒性和几何保真度。具体表现为在多个基准数据集上取得了优于现有方法的性能,特别是在处理尺度变化、相似干扰和复杂边界等挑战性场景时表现出更强的适应性。
Conclusion: 该研究揭示了显式建模指代不确定性空间分布对于遥感图像分割的重要性,提出的不确定性引导框架为跨模态对齐中的空间非均匀性问题提供了有效解决方案。该方法作为即插即用模块具有广泛适用性,为后续研究提供了新的方向,即通过不确定性感知机制来增强视觉语言任务中的自适应推理能力。
📄 Abstract
Referring remote sensing image segmentation aims to localize specific targets described by natural language within complex overhead imagery. However, due to extreme scale variations, dense similar distractors, and intricate boundary structures, the reliability of cross-modal alignment exhibits significant \textbf{spatial non-uniformity}. Existing methods typically employ uniform fusion and refinement strategies across the entire image, which often introduces unnecessary linguistic perturbations in visually clear regions while failing to provide sufficient disambiguation in confused areas. To address this, we propose an \textbf{uncertainty-guided framework} that explicitly leverages a pixel-wise \textbf{referring uncertainty map} as a spatial prior to orchestrate adaptive inference. Specifically, we introduce a plug-and-play \textbf{Referring Uncertainty Scorer (RUS)}, which is trained via an online error-consistency supervision strategy to interpretably predict the spatial distribution of referential ambiguity. Building on this prior, we design two plug-and-play modules: 1) \textbf{Uncertainty-Gated Fusion (UGF)}, which dynamically modulates language injection strength to enhance constraints in high-uncertainty regions while suppressing noise in low-uncertainty ones; and 2) \textbf{Uncertainty-Driven Local Refinement (UDLR)}, which utilizes uncertainty-derived soft masks to focus refinement on error-prone boundaries and fine details. Extensive experiments demonstrate that our method functions as a unified, plug-and-play solution that significantly improves robustness and geometric fidelity in complex remote sensing scenes without altering the backbone architecture.
[10] Understanding Reward Hacking in Text-to-Image Reinforcement Learning
Yunqi Hong, Kuei-Chun Kao, Hengguang Zhou, Cho-Jui Hsieh
🧩 TL;DR
本文系统分析了文本到图像强化学习后训练中的奖励黑客问题,并提出了一种轻量级自适应伪影奖励模型作为有效的正则化器,显著提升了生成图像的视觉真实性和对齐性。
📘 Detailed Summary
Motivation: 现有文本到图像强化学习后训练中使用的奖励函数往往是不完美的人类判断代理,导致模型容易产生奖励黑客行为——生成不现实或低质量的图像却能获得高奖励分数,这严重影响了生成质量和人类偏好对齐。
Method: 本文系统分析了审美/人类偏好奖励和提示-图像一致性奖励各自对奖励黑客的贡献,并研究了多奖励集成方法的局限性。为应对伪影生成这一常见失效模式,提出了一种轻量级自适应伪影奖励模型,该模型在小规模精心策划的无伪影和含伪影样本数据集上训练,可作为现有强化学习流程的有效正则化器。
Result: 实验表明,集成提出的伪影奖励模型能显著提升视觉真实性并减少奖励黑客行为,在多个文本到图像强化学习设置中均表现出有效性。该方法证明了轻量级奖励增强作为对抗奖励黑客的安全保障机制的实际效用。
Conclusion: 研究表明奖励黑客是文本到图像强化学习后训练中的系统性挑战,而轻量级自适应奖励模型可作为有效的正则化解决方案。这项工作为改进生成模型的奖励设计提供了新思路,强调了针对特定失效模式开发专门奖励组件的重要性。
📄 Abstract
Reinforcement learning (RL) has become a standard approach for post-training large language models and, more recently, for improving image generation models, which uses reward functions to enhance generation quality and human preference alignment. However, existing reward designs are often imperfect proxies for true human judgment, making models prone to reward hacking--producing unrealistic or low-quality images that nevertheless achieve high reward scores. In this work, we systematically analyze reward hacking behaviors in text-to-image (T2I) RL post-training. We investigate how both aesthetic/human preference rewards and prompt-image consistency rewards individually contribute to reward hacking and further show that ensembling multiple rewards can only partially mitigate this issue. Across diverse reward models, we identify a common failure mode: the generation of artifact-prone images. To address this, we propose a lightweight and adaptive artifact reward model, trained on a small curated dataset of artifact-free and artifact-containing samples. This model can be integrated into existing RL pipelines as an effective regularizer for commonly used reward models. Experiments demonstrate that incorporating our artifact reward significantly improves visual realism and reduces reward hacking across multiple T2I RL setups, demonstrating the effectiveness of lightweight reward augment serving as a safeguard against reward hacking.
[11] SDCD: Structure-Disrupted Contrastive Decoding for Mitigating Hallucinations in Large Vision-Language Models
Yuxuan Xia, Siheng Wang, Peng Li
🧩 TL;DR
本文提出了一种无需训练的结构扰乱对比解码算法,通过引入结构扰乱的视觉视图来校准输出分布,有效缓解大型视觉语言模型中的物体幻觉问题,显著提升了多模态理解能力。
📘 Detailed Summary
Motivation: 大型视觉语言模型在多模态理解和推理方面取得显著进展,但物体幻觉问题仍然是一个关键挑战。现有研究主要关注缓解语言先验或高层统计偏差,往往忽视了视觉编码过程中的内部复杂性。本文发现视觉统计偏差源于视觉编码器在弱结构监督下固有的Bag-of-Patches行为,这种偏差导致模型优先考虑单个补丁内的局部纹理特征而非整体几何结构,从而可能引发虚假的视觉置信度并导致幻觉。
Method: 本文提出了一种无需训练的结构扰乱对比解码算法,通过引入经过洗牌的结构扰乱视图来对比校准输出分布。该方法惩罚在结构缺失视图下仍保持高置信度的标记,从而有效抑制纹理驱动的偏差。SDCD算法不需要额外的训练过程,直接作用于推理阶段,通过对比原始视图和结构扰乱视图的置信度差异来识别和减少幻觉倾向。
Result: 实验结果表明,SDCD算法在多个基准测试中显著缓解了幻觉现象,有效提升了大型视觉语言模型的整体多模态能力。该方法在不同模型架构和数据集上都表现出良好的泛化性能,证明了通过结构扰乱对比机制抑制视觉统计偏差的有效性。
Conclusion: 本研究揭示了视觉编码过程中的统计偏差是导致物体幻觉的重要因素,特别是Bag-of-Patches行为在弱结构监督下引发的纹理偏好问题。提出的SDCD方法为缓解幻觉问题提供了一种无需训练的高效解决方案,强调了在视觉理解中考虑几何结构信息的重要性,为未来多模态模型的设计和优化提供了新的方向。
📄 Abstract
Large Vision-Language Models (LVLMs) demonstrate significant progress in multimodal understanding and reasoning, yet object hallucination remains a critical challenge. While existing research focuses on mitigating language priors or high-level statistical biases, they often overlook the internal complexities of the visual encoding process. We identify that visual statistical bias, arising from the inherent Bag-of-Patches behavior of Vision Encoders under weak structural supervision, acts as a contributing factor of object hallucinations. Under this bias, models prioritize local texture features within individual patches over holistic geometric structures. This tendency may induce spurious visual confidence and result in hallucinations. To address this, we introduce a training-free algorithm called Structure-Disrupted Contrastive Decoding (SDCD), which performs contrastive calibration of the output distribution by introducing a shuffled structure-disrupted view. By penalizing tokens that maintain high confidence under this structure-less view, SDCD effectively suppresses the texture-driven bias. Experimental results demonstrate that SDCD significantly mitigates hallucinations across multiple benchmarks and enhances the overall multimodal capabilities of LVLMs.
[12] EASLT: Emotion-Aware Sign Language Translation
Guobin Tu, Di Weng
🧩 TL;DR
本文提出EASLT框架,将面部情感作为语义锚点而非辅助信息,通过情感感知融合模块解决手语翻译中因忽略面部表情导致的语义歧义问题,在无词汇注释方法中实现了先进性能。
📘 Detailed Summary
Motivation: 当前无词汇注释的手语翻译方法主要关注手动信号而忽视非手动信号,特别是面部表情的语义重要性,导致当不同概念具有相同手动表达时产生语义歧义,需要将面部情感作为核心语义锚点而非辅助信息来解决这一局限性。
Method: EASLT框架包含专门的情感编码器来捕捉连续的情感动态,并通过新颖的情感感知融合模块自适应地基于情感上下文重新校准时空手语特征,将面部情感表示作为语义锚点整合到翻译过程中以解决歧义问题。
Result: 在PHOENIX14T和CSL-Daily基准测试中,EASLT在无词汇注释方法中实现了先进性能,分别获得26.15和22.80的BLEU-4分数,以及61.0和57.8的BLEURT分数,消融研究证实显式建模情感能有效将情感语义与手动动态解耦。
Conclusion: 该研究表明将面部情感作为语义锚点而非辅助信息能显著提升手语翻译的保真度,情感感知融合机制能有效解决跨模态翻译中的语义歧义问题,为未来整合多模态信号的手语翻译研究提供了新方向。
📄 Abstract
Sign Language Translation (SLT) is a complex cross-modal task requiring the integration of Manual Signals (MS) and Non-Manual Signals (NMS). While recent gloss-free SLT methods have made strides in translating manual gestures, they frequently overlook the semantic criticality of facial expressions, resulting in ambiguity when distinct concepts share identical manual articulations. To address this, we present EASLT (Emotion-Aware Sign Language Translation), a framework that treats facial affect not as auxiliary information, but as a robust semantic anchor. Unlike methods that relegate facial expressions to a secondary role, EASLT incorporates a dedicated emotional encoder to capture continuous affective dynamics. These representations are integrated via a novel Emotion-Aware Fusion (EAF) module, which adaptively recalibrates spatio-temporal sign features based on affective context to resolve semantic ambiguities. Extensive evaluations on the PHOENIX14T and CSL-Daily benchmarks demonstrate that EASLT establishes advanced performance among gloss-free methods, achieving BLEU-4 scores of 26.15 and 22.80, and BLEURT scores of 61.0 and 57.8, respectively. Ablation studies confirm that explicitly modeling emotion effectively decouples affective semantics from manual dynamics, significantly enhancing translation fidelity. Code is available at https://github.com/TuGuobin/EASLT.
[13] Physics-Constrained Cross-Resolution Enhancement Network for Optics-Guided Thermal UAV Image Super-Resolution
Zhicheng Zhao, Fengjiao Peng, Jinquan Yan, Wei Lu, Chenglong Li, Jin Tang
🧩 TL;DR
本文提出PCNet,一种用于热成像无人机图像超分辨率的新方法,通过跨分辨率相互增强模块和物理驱动的热传导模块,在保留高频光学先验的同时防止物理不一致的伪影,显著提升了热成像超分辨率的性能。
📘 Detailed Summary
Motivation: 现有光学引导的热成像无人机图像超分辨率方法通常压缩光学特征以匹配热特征维度,这会导致对热超分辨率有益的高频信息丢失,并因忽略模态间成像物理差异而引入纹理失真和边缘模糊等物理不一致的伪影。
Method: PCNet包含跨分辨率相互增强模块(CRME)和物理驱动的热传导模块(PDTM)。CRME联合优化热图像超分辨率和光学到热模态转换,实现跨分辨率的双向特征交互并保留高频光学先验;PDTM将二维热传导融入光学引导过程,建模空间变化的热传导特性以防止不一致伪影;此外还引入了温度一致性损失来强制区域分布一致性和边界梯度平滑性。
Result: 在VGTSR2.0和DroneVehicle数据集上的大量实验表明,PCNet在重建质量和下游任务(包括语义分割和目标检测)方面显著优于现有最先进方法,验证了所提方法的有效性。
Conclusion: 该研究通过物理约束的光学引导和跨分辨率相互增强机制,解决了热成像超分辨率中的信息损失和物理不一致问题,为全天候监测应用提供了更鲁棒的热成像超分辨率解决方案,并展示了在语义分割和目标检测等下游任务中的实际应用价值。
📄 Abstract
Optics-guided thermal UAV image super-resolution has attracted significant research interest due to its potential in all-weather monitoring applications. However, existing methods typically compress optical features to match thermal feature dimensions for cross-modal alignment and fusion, which not only causes the loss of high-frequency information that is beneficial for thermal super-resolution, but also introduces physically inconsistent artifacts such as texture distortions and edge blurring by overlooking differences in the imaging physics between modalities. To address these challenges, we propose PCNet to achieve cross-resolution mutual enhancement between optical and thermal modalities, while physically constraining the optical guidance process via thermal conduction to enable robust thermal UAV image super-resolution. In particular, we design a Cross-Resolution Mutual Enhancement Module (CRME) to jointly optimize thermal image super-resolution and optical-to-thermal modality conversion, facilitating effective bidirectional feature interaction across resolutions while preserving high-frequency optical priors. Moreover, we propose a Physics-Driven Thermal Conduction Module (PDTM) that incorporates two-dimensional heat conduction into optical guidance, modeling spatially-varying heat conduction properties to prevent inconsistent artifacts. In addition, we introduce a temperature consistency loss that enforces regional distribution consistency and boundary gradient smoothness to ensure generated thermal images align with real-world thermal radiation principles. Extensive experiments on VGTSR2.0 and DroneVehicle datasets demonstrate that PCNet significantly outperforms state-of-the-art methods on both reconstruction quality and downstream tasks including semantic segmentation and object detection.
[14] CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval
Zhipeng Qian, Zihan Liang, Yufei Ma, Ben Chen, Huangyu Dai, Yiwei Ma, Jiayi Ji, Chenyi Lei, Han Li, Xiaoshuai Sun
🧩 TL;DR
本文提出CSMCIR,一种用于组合图像检索的统一表示框架,通过多级思维链提示、对称双塔架构和动态记忆库策略,解决了现有方法中表示空间碎片化的问题,实现了最先进的检索性能。
📘 Detailed Summary
Motivation: 现有组合图像检索方法存在表示空间碎片化问题,查询和目标由异构模态组成并由不同编码器处理,迫使模型仅通过事后对齐来桥接未对齐的表示空间,这从根本上限制了检索性能。这种架构不对称性在特征空间中表现为三个明显分离的聚类,直接展示了异构模态如何从初始化开始就创建根本未对齐的表示空间。
Method: CSMCIR框架包含三个协同组件:首先引入多级思维链提示策略,指导多模态大语言模型为目标图像生成具有区分性且语义兼容的标题,建立模态对称性;其次设计对称双塔架构,查询和目标两侧使用相同的共享参数Q-Former进行跨模态编码,确保一致的特征表示;最后利用架构对称性实现基于熵的时序动态记忆库策略,提供高质量负样本同时保持与演化模型状态的一致性。
Result: 在四个基准数据集上的广泛实验表明,CSMCIR实现了最先进的性能并具有优越的训练效率。综合消融研究进一步验证了每个提出组件的有效性,证明了该框架在组合图像检索任务中的优越性。
Conclusion: 该研究通过统一表示框架解决了组合图像检索中的表示空间碎片化问题,展示了模态对称性架构和动态负采样策略的重要性。CSMCIR的成功表明,通过消除异构编码器带来的表示空间不匹配,可以显著提升跨模态检索性能,为未来多模态检索系统设计提供了重要见解。
📄 Abstract
Composed Image Retrieval (CIR) enables users to search for target images using both a reference image and manipulation text, offering substantial advantages over single-modality retrieval systems. However, existing CIR methods suffer from representation space fragmentation: queries and targets comprise heterogeneous modalities and are processed by distinct encoders, forcing models to bridge misaligned representation spaces only through post-hoc alignment, which fundamentally limits retrieval performance. This architectural asymmetry manifests as three distinct, well-separated clusters in the feature space, directly demonstrating how heterogeneous modalities create fundamentally misaligned representation spaces from initialization. In this work, we propose CSMCIR, a unified representation framework that achieves efficient query-target alignment through three synergistic components. First, we introduce a Multi-level Chain-of-Thought (MCoT) prompting strategy that guides Multimodal Large Language Models to generate discriminative, semantically compatible captions for target images, establishing modal symmetry. Building upon this, we design a symmetric dual-tower architecture where both query and target sides utilize the identical shared-parameter Q-Former for cross-modal encoding, ensuring consistent feature representations and further reducing the alignment gap. Finally, this architectural symmetry enables an entropy-based, temporally dynamic Memory Bank strategy that provides high-quality negative samples while maintaining consistency with the evolving model state. Extensive experiments on four benchmark datasets demonstrate that our CSMCIR achieves state-of-the-art performance with superior training efficiency. Comprehensive ablation studies further validate the effectiveness of each proposed component.
[15] SpatiaLoc: Leveraging Multi-Level Spatial Enhanced Descriptors for Cross-Modal Localization
Tianyi Shang, Pengjie Xu, Zhaojun Deng, Zhenyu Li, Zhicong Chen, Lijun Wu
🧩 TL;DR
本文提出了SpatiaLoc框架,通过粗到细的策略强调实例级和全局级的空间关系,用于基于文本和点云的跨模态定位任务,在KITTI360Pose数据集上显著优于现有方法。
📘 Detailed Summary
Motivation: 跨模态定位任务中,文本和点云之间的物体经常重复出现,使得空间关系成为最具有区分性的定位线索,现有方法未能充分挖掘这一特性,需要更有效的空间关系建模方法。
Method: SpatiaLoc采用粗到细的双阶段策略,粗阶段包含Bezier增强物体空间编码器(BEOSE)使用二次贝塞尔曲线建模实例级空间关系,以及频率感知编码器(FAE)在频域生成全局级空间表示;细阶段采用不确定性感知高斯精细定位器(UGFL),将预测建模为高斯分布并使用不确定性感知损失函数回归2D位置。
Result: 在KITTI360Pose数据集上的大量实验表明,SpatiaLoc框架显著优于现有的最先进方法,验证了其空间关系建模策略的有效性和优越性。
Conclusion: 该研究证明了在跨模态定位任务中,同时关注实例级和全局级的空间关系建模至关重要,提出的粗到细策略和不确定性感知方法为机器人导航和人机交互中的自然语言定位提供了有效解决方案。
📄 Abstract
Cross-modal localization using text and point clouds enables robots to localize themselves via natural language descriptions, with applications in autonomous navigation and interaction between humans and robots. In this task, objects often recur across text and point clouds, making spatial relationships the most discriminative cues for localization. Given this characteristic, we present SpatiaLoc, a framework utilizing a coarse-to-fine strategy that emphasizes spatial relationships at both the instance and global levels. In the coarse stage, we introduce a Bezier Enhanced Object Spatial Encoder (BEOSE) that models spatial relationships at the instance level using quadratic Bezier curves. Additionally, a Frequency Aware Encoder (FAE) generates spatial representations in the frequency domain at the global level. In the fine stage, an Uncertainty Aware Gaussian Fine Localizer (UGFL) regresses 2D positions by modeling predictions as Gaussian distributions with a loss function aware of uncertainty. Extensive experiments on KITTI360Pose demonstrate that SpatiaLoc significantly outperforms existing state-of-the-art (SOTA) methods.
[16] FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
Mingyu Ouyang, Kevin Qinghong Lin, Mike Zheng Shou, Hwee Tou Ng
🧩 TL;DR
本文提出FocusUI,一种高效的UI grounding框架,通过选择性保留与指令相关的视觉token并保持位置连续性,显著降低了视觉语言模型在UI grounding任务中的计算开销,同时保持了高精度。
📘 Detailed Summary
Motivation: 当前视觉语言模型在处理高分辨率UI截图时会产生数千个视觉token,导致显著的计算开销和注意力稀释,而人类通常只关注感兴趣区域,因此需要开发高效的UI grounding方法来解决这一效率问题。
Method: FocusUI框架包含两个关键技术:首先通过融合指令条件分数和基于规则的UI图分数构建补丁级监督,选择具有区分性和指令相关性的视觉token;其次提出PosPad策略,将连续丢弃的视觉token序列压缩为放置在序列最后索引的特殊标记,以保持位置连续性。
Result: 在四个grounding基准测试中,FocusUI超越了GUI特定基线方法,在ScreenSpot-Pro基准上,FocusUI-7B相比GUI-Actor-7B实现了3.7%的性能提升;即使仅保留30%视觉token,性能仅下降3.2%,同时推理速度提升1.44倍,峰值GPU内存降低17%。
Conclusion: 该研究表明通过选择性视觉token保留和位置连续性保持,可以在UI grounding任务中实现显著的效率提升而不牺牲精度,为高效视觉语言模型应用提供了新思路,特别适用于需要处理高分辨率UI界面的实际应用场景。
📄 Abstract
Vision-Language Models (VLMs) have shown remarkable performance in User Interface (UI) grounding tasks, driven by their ability to process increasingly high-resolution screenshots. However, screenshots are tokenized into thousands of visual tokens (e.g., about 4700 for 2K resolution), incurring significant computational overhead and diluting attention. In contrast, humans typically focus on regions of interest when interacting with UI. In this work, we pioneer the task of efficient UI grounding. Guided by practical analysis of the task's characteristics and challenges, we propose FocusUI, an efficient UI grounding framework that selects patches most relevant to the instruction while preserving positional continuity for precise grounding. FocusUI addresses two key challenges: (1) Eliminating redundant tokens in visual encoding. We construct patch-level supervision by fusing an instruction-conditioned score with a rule-based UI-graph score that down-weights large homogeneous regions to select distinct and instruction-relevant visual tokens. (2) Preserving positional continuity during visual token selection. We find that general visual token pruning methods suffer from severe accuracy degradation on UI grounding tasks due to broken positional information. We introduce a novel PosPad strategy, which compresses each contiguous sequence of dropped visual tokens into a single special marker placed at the sequence's last index to preserve positional continuity. Comprehensive experiments on four grounding benchmarks demonstrate that FocusUI surpasses GUI-specific baselines. On the ScreenSpot-Pro benchmark, FocusUI-7B achieves a performance improvement of 3.7% over GUI-Actor-7B. Even with only 30% visual token retention, FocusUI-7B drops by only 3.2% while achieving up to 1.44x faster inference and 17% lower peak GPU memory.
[17] RadDiff: Describing Differences in Radiology Image Sets with Natural Language
Xiaoxian Shen, Yuhui Zhang, Sahithi Ankireddy, Xiaohan Wang, Maya Varma, Henry Guo, Curtis Langlotz, Serena Yeung-Levy
🧩 TL;DR
本文提出了RadDiff,一种多模态智能系统,能够执行放射科医生风格的比较推理来描述成对放射学研究之间的临床意义差异,并构建了RadDiffBench基准来系统评估放射学差异发现方法。
📘 Detailed Summary
Motivation: 理解两组放射学图像之间的差异对于生成临床见解和解释医学AI系统至关重要,当前缺乏能够系统发现放射学数据中有意义差异的方法和基准。
Method: RadDiff基于VisDiff的提议者-排序者框架构建,融合了四项创新:通过领域适应的视觉语言模型注入医学知识;整合图像与临床报告的多模态推理;跨多轮推理的迭代假设细化;以及定位并放大显著区域以捕捉细微发现的针对性视觉搜索。
Result: 在包含57个专家验证放射学研究对的RadDiffBench基准上,RadDiff实现了47%的准确率,在真实报告指导下达到50%准确率,显著优于通用领域VisDiff基线,并展示了在COVID-19表型比较、种族亚组分析和生存相关影像特征发现等多样化临床任务中的通用性。
Conclusion: RadDiff和RadDiffBench共同为系统揭示放射学数据中有意义的差异提供了首个方法-基准基础,展示了多模态智能系统在放射学比较分析中的潜力,为临床见解生成和医学AI系统解释开辟了新途径。
📄 Abstract
Understanding how two radiology image sets differ is critical for generating clinical insights and for interpreting medical AI systems. We introduce RadDiff, a multimodal agentic system that performs radiologist-style comparative reasoning to describe clinically meaningful differences between paired radiology studies. RadDiff builds on a proposer-ranker framework from VisDiff, and incorporates four innovations inspired by real diagnostic workflows: (1) medical knowledge injection through domain-adapted vision-language models; (2) multimodal reasoning that integrates images with their clinical reports; (3) iterative hypothesis refinement across multiple reasoning rounds; and (4) targeted visual search that localizes and zooms in on salient regions to capture subtle findings. To evaluate RadDiff, we construct RadDiffBench, a challenging benchmark comprising 57 expert-validated radiology study pairs with ground-truth difference descriptions. On RadDiffBench, RadDiff achieves 47% accuracy, and 50% accuracy when guided by ground-truth reports, significantly outperforming the general-domain VisDiff baseline. We further demonstrate RadDiff's versatility across diverse clinical tasks, including COVID-19 phenotype comparison, racial subgroup analysis, and discovery of survival-related imaging features. Together, RadDiff and RadDiffBench provide the first method-and-benchmark foundation for systematically uncovering meaningful differences in radiological data.
[18] Detecting AI-Generated Images via Distributional Deviations from Real Images
Yakun Niu, Yingjian Chen, Lei Zhang
🧩 TL;DR
本文提出了一种基于掩码预训练模型微调(MPFT)的策略,通过纹理感知掩码(TAM)机制在微调过程中掩码包含生成模型特定模式的纹理区域,从而增强CLIP-ViT对AI生成图像的检测泛化能力,显著超越了现有方法。
📘 Detailed Summary
Motivation: 随着生成模型快速发展,AI生成图像质量显著提升,引发了关于错误信息和公众信任侵蚀的担忧。检测AI生成图像已成为关键挑战,特别是在泛化到未见生成模型方面。现有方法使用冻结预训练CLIP模型虽在泛化方面有潜力,但仅将图像编码器视为基本特征提取器,未能充分利用其潜力,且缺乏对真实与AI生成图像的有效区分能力。
Method: 本文首先对冻结CLIP图像编码器(CLIP-ViT)进行深入分析,发现其能在高级抽象特征空间中有效聚类真实图像,但缺乏区分真实与AI生成图像的能力。基于此分析,提出了掩码预训练模型微调(MPFT)策略,引入纹理感知掩码(TAM)机制,在微调过程中掩码包含生成模型特定模式的纹理区域。这种方法迫使CLIP-ViT关注AI生成图像相对于真实图像的"分布偏差",从而实现增强的泛化性能。
Result: 在GenImage和UniversalFakeDetect数据集上的广泛实验表明,该方法仅需少量图像进行微调即可显著超越现有方法。在两个数据集上分别达到98.2%和94.6%的平均准确率,证明了其卓越的泛化能力和检测性能。
Conclusion: 该研究揭示了CLIP-ViT在AI生成图像检测中的潜力与局限,并提出了一种有效的微调策略来增强其泛化能力。纹理感知掩码机制通过关注生成模型特定模式,使模型能够学习更鲁棒的区分特征,为AI生成图像检测提供了新的技术方向,具有重要的实际应用价值。
📄 Abstract
The rapid advancement of generative models has significantly enhanced the quality of AI-generated images, raising concerns about misinformation and the erosion of public trust. Detecting AI-generated images has thus become a critical challenge, particularly in terms of generalizing to unseen generative models. Existing methods using frozen pre-trained CLIP models show promise in generalization but treat the image encoder as a basic feature extractor, failing to fully exploit its potential. In this paper, we perform an in-depth analysis of the frozen CLIP image encoder (CLIP-ViT), revealing that it effectively clusters real images in a high-level, abstract feature space. However, it does not truly possess the ability to distinguish between real and AI-generated images. Based on this analysis, we propose a Masking-based Pre-trained model Fine-Tuning (MPFT) strategy, which introduces a Texture-Aware Masking (TAM) mechanism to mask textured areas containing generative model-specific patterns during fine-tuning. This approach compels CLIP-ViT to attend to the "distributional deviations"from authentic images for AI-generated image detection, thereby achieving enhanced generalization performance. Extensive experiments on the GenImage and UniversalFakeDetect datasets demonstrate that our method, fine-tuned with only a minimal number of images, significantly outperforms existing approaches, achieving up to 98.2% and 94.6% average accuracy on the two datasets, respectively.
[19] Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts
Zhihao Zhu, Jiafeng Liang, Shixin Jiang, Jinlan Fu, Ming Liu, Guanglu Sun, See-Kiong Ng, Bing Qin
🧩 TL;DR
本文提出Active Visual-Context Refinement方法,通过主动视觉重定位机制和自适应上下文精炼策略,有效缓解大型多模态模型在视频推理中出现的文本惯性问题,显著抑制幻觉传播并增强推理鲁棒性。
📘 Detailed Summary
Motivation: 大型多模态模型在视频推理中展现出强大的链式思维能力,但其推理链的鲁棒性存在严重问题。研究识别出文本惯性这一关键失效模式,即一旦思维过程中出现文本幻觉,模型倾向于盲目遵循错误文本而忽略冲突的视觉证据,这导致错误在推理链中持续传播。
Method: 研究首先提出LogicGraph Perturbation Protocol,通过在多样化LMM的推理链中结构性注入扰动来系统评估其自我反思能力。为缓解文本惯性问题,提出Active Visual-Context Refinement这一无需训练的推理范式,该方法协调主动视觉重定位机制以执行细粒度验证,并结合自适应上下文精炼策略来总结和去噪推理历史。
Result: 实验结果显示,现有模型在文本惯性情况下成功自我纠正的比例低于10%,主要表现出盲目的文本错误传播。提出的Active Visual-Context Refinement方法显著抑制了幻觉传播,有效增强了推理鲁棒性,在多种LMM架构和提示驱动范式上均表现出优越性能。
Conclusion: 该研究揭示了大型多模态模型推理链中存在的文本惯性问题及其严重性,提出的训练免费推理范式为解决该问题提供了有效途径。研究强调了在复杂推理任务中视觉证据验证的重要性,并为未来开发更鲁棒的多模态推理系统提供了重要见解和方法论基础。
📄 Abstract
Large Multimodal Models (LMMs) have demonstrated impressive capabilities in video reasoning via Chain-of-Thought (CoT). However, the robustness of their reasoning chains remains questionable. In this paper, we identify a critical failure mode termed textual inertia, where once a textual hallucination occurs in the thinking process, models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence. To systematically investigate this, we propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs spanning both native reasoning architectures and prompt-driven paradigms to evaluate their self-reflection capabilities. The results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation. To mitigate this, we introduce Active Visual-Context Refinement, a training-free inference paradigm which orchestrates an active visual re-grounding mechanism to enforce fine-grained verification coupled with an adaptive context refinement strategy to summarize and denoise the reasoning history. Experiments demonstrate that our approach significantly stifles hallucination propagation and enhances reasoning robustness.
[20] Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions
Zhongbin Guo, Zhen Yang, Yushan Li, Xinyue Zhang, Wenyu Gao, Jiacheng Wang, Chengzhi Li, Xiangrui Liu, Ping Jian
🧩 TL;DR
本研究提出了SiT-Bench基准测试,旨在评估纯文本大语言模型的空间智能能力,通过将视觉场景转换为坐标感知的文本描述,揭示了当前LLMs在空间推理方面存在的显著差距,并证明了显式空间推理能显著提升性能。
📘 Detailed Summary
Motivation: 当前空间智能研究主要依赖视觉语言模型,但一个关键问题尚未解决:空间理解能力究竟源于视觉编码器还是基础推理架构?本研究旨在探索大语言模型在缺乏像素级输入的情况下是否具备空间推理能力,并量化评估其空间智能表现。
Method: 研究提出了SiT-Bench基准测试,包含超过3,800个专家标注项目,涵盖5个主要类别和17个子任务,包括自我中心导航、视角转换和精细机器人操作等。通过将单/多视角场景转换为高保真、坐标感知的文本描述,迫使LLMs进行符号文本推理而非视觉模式匹配。
Result: 评估最先进的大语言模型发现,虽然模型在局部语义任务上表现熟练,但在全局一致性方面存在显著的"空间差距"。值得注意的是,显式空间推理能显著提升性能,表明LLMs具备潜在的世界建模能力。
Conclusion: SiT-Bench基准测试为未来视觉语言模型和具身智能体的空间基础LLM架构开发提供了基础资源。研究表明大语言模型具有潜在的世界建模能力,显式空间推理能有效提升性能,为开发不依赖视觉输入的空间智能系统指明了方向。
📄 Abstract
Recent advancements in Spatial Intelligence (SI) have predominantly relied on Vision-Language Models (VLMs), yet a critical question remains: does spatial understanding originate from visual encoders or the fundamental reasoning backbone? Inspired by this question, we introduce SiT-Bench, a novel benchmark designed to evaluate the SI performance of Large Language Models (LLMs) without pixel-level input, comprises over 3,800 expert-annotated items across five primary categories and 17 subtasks, ranging from egocentric navigation and perspective transformation to fine-grained robotic manipulation. By converting single/multi-view scenes into high-fidelity, coordinate-aware textual descriptions, we challenge LLMs to perform symbolic textual reasoning rather than visual pattern matching. Evaluation results of state-of-the-art (SOTA) LLMs reveals that while models achieve proficiency in localized semantic tasks, a significant "spatial gap" remains in global consistency. Notably, we find that explicit spatial reasoning significantly boosts performance, suggesting that LLMs possess latent world-modeling potential. Our proposed dataset SiT-Bench serves as a foundational resource to foster the development of spatially-grounded LLM backbones for future VLMs and embodied agents. Our code and benchmark will be released at https://github.com/binisalegend/SiT-Bench .
[21] Unveiling Text in Challenging Stone Inscriptions: A Character-Context-Aware Patching Strategy for Binarization
Pratyush Jena, Amal Joseph, Arnav Sharma, Ravi Kiran Sarvadevabhatla
🧩 TL;DR
本文提出了一种用于历史石碑铭文二值化的鲁棒自适应分块策略,结合注意力U-Net模型,显著提升了在低对比度、表面退化严重情况下的二值化性能,并展示了跨文字的零样本泛化能力。
📘 Detailed Summary
Motivation: 石碑铭文图像由于蚀刻字符与石材背景对比度低、表面退化不均匀、存在干扰伪影以及文字密度和布局高度可变,给二值化带来了严峻挑战,导致现有二值化技术经常失败且难以分离连贯的字符区域。
Method: 本文提出了一种鲁棒的自适应分块策略用于处理具有挑战性的印度文字铭文,利用动态采样和分块选择方法确保模型能够克服表面噪声和布局不规则性,并使用这些分块训练注意力U-Net模型,其中注意力机制使模型能够聚焦于细微的结构线索。
Result: 实验表明,新颖的分块机制显著提升了二值化性能,在经典和深度学习基线上均有明显改进;尽管仅在单一文字的印度数据集上训练,模型展现出对其他印度文字和非印度文字的强零样本泛化能力,凸显了其鲁棒性和文字无关的泛化能力。
Conclusion: 该方法通过生成干净、结构化的铭文内容表示,为下游任务如文字识别、OCR和历史文本分析奠定了基础;研究还引入了一个在字符片段级别精心标注的像素级精确印度石碑铭文数据集,为相关研究提供了重要资源。
📄 Abstract
Binarization is a popular first step towards text extraction in historical artifacts. Stone inscription images pose severe challenges for binarization due to poor contrast between etched characters and the stone background, non-uniform surface degradation, distracting artifacts, and highly variable text density and layouts. These conditions frequently cause existing binarization techniques to fail and struggle to isolate coherent character regions. Many approaches sub-divide the image into patches to improve text fragment resolution and improve binarization performance. With this in mind, we present a robust and adaptive patching strategy to binarize challenging Indic inscriptions. The patches from our approach are used to train an Attention U-Net for binarization. The attention mechanism allows the model to focus on subtle structural cues, while our dynamic sampling and patch selection method ensures that the model learns to overcome surface noise and layout irregularities. We also introduce a carefully annotated, pixel-precise dataset of Indic stone inscriptions at the character-fragment level. We demonstrate that our novel patching mechanism significantly boosts binarization performance across classical and deep learning baselines. Despite training only on single script Indic dataset, our model exhibits strong zero-shot generalization to other Indic and non-indic scripts, highlighting its robustness and script-agnostic generalization capabilities. By producing clean, structured representations of inscription content, our method lays the foundation for downstream tasks such as script identification, OCR, and historical text analysis. Project page: https://ihdia.iiit.ac.in/shilalekhya-binarization/
[22] Pixel-Wise Multimodal Contrastive Learning for Remote Sensing Images
Leandro Stival, Ricardo da Silva Torres, Helio Pedrini
🧩 TL;DR
本研究提出了一种新颖的多模态自监督方法PIMC,通过将卫星图像时间序列转换为二维递归图表示,并结合遥感影像进行对比学习,显著提升了地球观测任务中特征提取的效果。
📘 Detailed Summary
Motivation: 卫星持续产生海量地球观测数据,特别是卫星图像时间序列,但现有深度学习模型大多设计用于处理完整图像或时间序列,缺乏对像素级时间动态的有效编码方法,这限制了从SITS中提取有意义的特征用于下游任务的能力。
Method: 本研究提出了一种基于像素级二维表示的多模态方法,首先将基于像素的植被指数时间序列转换为递归图作为原始像素值的替代表示,然后引入PIxel-wise Multimodal Contrastive自监督方法,该方法结合二维像素时间序列表示和遥感影像,通过对比学习训练有效的编码器。
Result: 实验在三个下游任务上验证了方法的有效性:使用PASTIS数据集进行像素级预测和分类,以及在EuroSAT数据集上进行土地覆盖分类。与最先进方法相比,该方法在所有任务上都表现出优越性能,二维表示显著增强了从SITS中提取特征的能力,对比学习提高了像素时间序列和RSI表示的质量。
Conclusion: 该研究证明了多模态方法在处理卫星图像时间序列和遥感影像方面的优势,建立了一个强大的自监督框架,能够有效处理两种数据类型,为地球观测任务提供了更有效的特征提取解决方案,并展示了在多种下游任务中的广泛适用性。
📄 Abstract
Satellites continuously generate massive volumes of data, particularly for Earth observation, including satellite image time series (SITS). However, most deep learning models are designed to process either entire images or complete time series sequences to extract meaningful features for downstream tasks. In this study, we propose a novel multimodal approach that leverages pixel-wise two-dimensional (2D) representations to encode visual property variations from SITS more effectively. Specifically, we generate recurrence plots from pixel-based vegetation index time series (NDVI, EVI, and SAVI) as an alternative to using raw pixel values, creating more informative representations. Additionally, we introduce PIxel-wise Multimodal Contrastive (PIMC), a new multimodal self-supervision approach that produces effective encoders based on two-dimensional pixel time series representations and remote sensing imagery (RSI). To validate our approach, we assess its performance on three downstream tasks: pixel-level forecasting and classification using the PASTIS dataset, and land cover classification on the EuroSAT dataset. Moreover, we compare our results to state-of-the-art (SOTA) methods on all downstream tasks. Our experimental results show that the use of 2D representations significantly enhances feature extraction from SITS, while contrastive learning improves the quality of representations for both pixel time series and RSI. These findings suggest that our multimodal method outperforms existing models in various Earth observation tasks, establishing it as a robust self-supervision framework for processing both SITS and RSI. Code avaliable on
[23] VideoMemory: Toward Consistent Video Generation via Memory Integration
Jinsong Zhou, Yihua Du, Xinli Xu, Luozhou Wang, Zijie Zhuang, Yehang Zhang, Shuaibo Li, Xiaojun Hu, Bolan Su, Ying-cong Chen
🧩 TL;DR
本文提出VideoMemory,一种面向实体的视频生成框架,通过动态记忆库实现跨多镜头叙事视频中角色、道具和背景的一致性保持,解决了现有模型在长时程生成中实体身份与外观保持的难题。
📘 Detailed Summary
Motivation: 叙事视频生成中的核心挑战是在多个镜头间保持角色、道具和环境的身份与外观一致性,现有模型虽能生成高质量短视频片段,但在场景变化或实体经过长时间间隔后重新出现时,往往无法维持实体的一致表征。
Method: VideoMemory采用实体中心化框架,通过动态记忆库整合叙事规划与视觉生成,多智能体系统将叙事分解为镜头序列,从记忆库检索实体表征,并基于这些检索状态合成关键帧和视频,记忆库存储角色、道具和背景的显式视觉与语义描述符,并在每个镜头后更新以反映故事驱动变化同时保持身份一致性。
Result: 研究构建了包含54个案例的多镜头一致性基准测试,涵盖角色、道具和背景持续性场景,大量实验表明VideoMemory在多样化叙事序列中实现了强大的实体级连贯性和高感知质量。
Conclusion: 该研究通过检索-更新机制实现了跨远距离镜头的实体一致性描绘,支持连贯的长时程视频生成,为叙事视频生成中的实体一致性保持提供了有效解决方案,并为该领域的研究设立了新的评估基准。
📄 Abstract
Maintaining consistent characters, props, and environments across multiple shots is a central challenge in narrative video generation. Existing models can produce high-quality short clips but often fail to preserve entity identity and appearance when scenes change or when entities reappear after long temporal gaps. We present VideoMemory, an entity-centric framework that integrates narrative planning with visual generation through a Dynamic Memory Bank. Given a structured script, a multi-agent system decomposes the narrative into shots, retrieves entity representations from memory, and synthesizes keyframes and videos conditioned on these retrieved states. The Dynamic Memory Bank stores explicit visual and semantic descriptors for characters, props, and backgrounds, and is updated after each shot to reflect story-driven changes while preserving identity. This retrieval-update mechanism enables consistent portrayal of entities across distant shots and supports coherent long-form generation. To evaluate this setting, we construct a 54-case multi-shot consistency benchmark covering character-, prop-, and background-persistent scenarios. Extensive experiments show that VideoMemory achieves strong entity-level coherence and high perceptual quality across diverse narrative sequences.
[24] MGPC: Multimodal Network for Generalizable Point Cloud Completion With Modality Dropout and Progressive Decoding
Jiangyuan Liu, Hongxuan Ma, Yuhao Zhao, Zhe Liu, Jian Wang, Wei Zou
🧩 TL;DR
本文提出MGPC,一种可泛化的多模态点云补全框架,通过整合点云、RGB图像和文本,在统一架构中解决现有方法在泛化到新物体和真实场景时的局限性。该方法引入了模态丢弃策略、Transformer融合模块和渐进生成器,并在构建的大规模数据集MGPC-1M上验证了其优越性能。
📘 Detailed Summary
Motivation: 现有基于学习的点云补全方法(包括3D CNN、基于点的方法和Transformer方法)在合成基准上表现良好,但由于模态限制、可扩展性和生成能力不足,它们在新物体和真实场景中的泛化能力仍然有限。本研究旨在解决这一泛化挑战,开发一个能够适应多样化真实世界条件的通用点云补全框架。
Method: MGPC框架整合了点云、RGB图像和文本三种模态,采用统一的架构设计。关键技术包括创新的模态丢弃策略以提高鲁棒性,基于Transformer的融合模块实现多模态信息有效整合,以及新颖的渐进生成器增强几何建模能力。此外,研究还开发了自动数据生成流程,构建了包含超过1000个类别、100万训练对的大规模基准数据集MGPC-1M。
Result: 在MGPC-1M数据集和真实世界数据上的广泛实验表明,所提方法持续优于现有基线方法,并在真实世界条件下展现出强大的泛化能力。该方法在多样化类别和复杂场景中均表现出色,验证了多模态整合和所提技术组件的有效性。
Conclusion: 该研究证明了多模态整合对于点云补全任务泛化能力的重要性,提出的MGPC框架为解决真实世界点云补全挑战提供了有效方案。模态丢弃策略和渐进生成器等创新技术为未来多模态3D理解研究提供了有价值的参考方向,大规模数据集的构建也为该领域的发展奠定了基础。
📄 Abstract
Point cloud completion aims to recover complete 3D geometry from partial observations caused by limited viewpoints and occlusions. Existing learning-based works, including 3D Convolutional Neural Network (CNN)-based, point-based, and Transformer-based methods, have achieved strong performance on synthetic benchmarks. However, due to the limitations of modality, scalability, and generative capacity, their generalization to novel objects and real-world scenarios remains challenging. In this paper, we propose MGPC, a generalizable multimodal point cloud completion framework that integrates point clouds, RGB images, and text within a unified architecture. MGPC introduces an innovative modality dropout strategy, a Transformer-based fusion module, and a novel progressive generator to improve robustness, scalability, and geometric modeling capability. We further develop an automatic data generation pipeline and construct MGPC-1M, a large-scale benchmark with over 1,000 categories and one million training pairs. Extensive experiments on MGPC-1M and in-the-wild data demonstrate that the proposed method consistently outperforms prior baselines and exhibits strong generalization under real-world conditions.
[25] BREATH-VL: Vision-Language-Guided 6-DoF Bronchoscopy Localization via Semantic-Geometric Fusion
Qingyao Tian, Bingyu Yang, Huai Liao, Xinyan Huang, Junyong Li, Dong Yi, Hongbin Liu
🧩 TL;DR
本文提出了BREATH-VL,一种将视觉语言模型的语义理解与视觉配准方法的几何信息相结合的混合框架,用于内窥镜6-DoF相机定位,并在新构建的BREATH数据集上实现了最先进的性能。
📘 Detailed Summary
Motivation: 将视觉语言模型应用于6-DoF内窥镜相机定位面临三个主要挑战:缺乏大规模、高质量、密集标注且面向定位的真实医疗场景视觉语言数据集;细粒度姿态回归能力有限;以及从历史帧提取时序特征时计算延迟高。
Method: 首先构建了BREATH数据集,这是迄今为止最大规模的体内内窥镜定位数据集,采集自复杂的人体气道环境。在此基础上提出了BREATH-VL混合框架,该框架将视觉语言模型的语义线索与视觉配准方法的几何信息相结合进行6-DoF姿态估计。此外,还引入了一种轻量级上下文学习机制,将运动历史编码为语言提示,实现高效的时序推理而无需昂贵的视频级计算。
Result: 实验表明视觉语言模块在具有挑战性的手术场景中实现了鲁棒的语义定位。BREATH-VL在准确性和泛化性方面均优于最先进的纯视觉定位方法,与最佳基线相比平移误差降低了25.5%,同时保持了具有竞争力的计算延迟。
Conclusion: 该研究证明了视觉语言模型与几何配准方法在内窥镜定位中的互补优势:视觉语言模型提供可泛化的语义理解,而配准方法提供精确的几何对齐。所提出的轻量级上下文学习机制为高效时序推理提供了有效解决方案,为医疗导航和定位任务开辟了新途径。
📄 Abstract
Vision-language models (VLMs) have recently shown remarkable performance in navigation and localization tasks by leveraging large-scale pretraining for semantic understanding. However, applying VLMs to 6-DoF endoscopic camera localization presents several challenges: 1) the lack of large-scale, high-quality, densely annotated, and localization-oriented vision-language datasets in real-world medical settings; 2) limited capability for fine-grained pose regression; and 3) high computational latency when extracting temporal features from past frames. To address these issues, we first construct BREATH dataset, the largest in-vivo endoscopic localization dataset to date, collected in the complex human airway. Building on this dataset, we propose BREATH-VL, a hybrid framework that integrates semantic cues from VLMs with geometric information from vision-based registration methods for accurate 6-DoF pose estimation. Our motivation lies in the complementary strengths of both approaches: VLMs offer generalizable semantic understanding, while registration methods provide precise geometric alignment. To further enhance the VLM's ability to capture temporal context, we introduce a lightweight context-learning mechanism that encodes motion history as linguistic prompts, enabling efficient temporal reasoning without expensive video-level computation. Extensive experiments demonstrate that the vision-language module delivers robust semantic localization in challenging surgical scenes. Building on this, our BREATH-VL outperforms state-of-the-art vision-only localization methods in both accuracy and generalization, reducing translational error by 25.5% compared with the best-performing baseline, while achieving competitive computational latency.
[26] I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing
Jinghan Yu, Junhao Xiao, Chenyu Zhu, Jiaming Li, Jia Li, HanMing Deng, Xirui Wang, Guoli Jia, Jianjun Li, Zhiyuan Ma, Xiang Bai, Bowen Zhou
🧩 TL;DR
本文提出I2E,一种新颖的"分解-行动"范式,将图像编辑重新定义为结构化环境中的可操作交互过程,显著提升了复杂组合编辑任务的性能。
📘 Detailed Summary
Motivation: 现有基于文本引导的图像编辑方法主要依赖端到端的像素级修复范式,在需要精确局部控制和复杂多对象空间推理的组合编辑任务中存在显著局限性,具体表现为规划与执行的隐式耦合、缺乏对象级控制粒度以及对非结构化像素中心建模的依赖。
Method: I2E采用"分解-行动"范式,首先使用分解器将非结构化图像转换为离散可操作的对象层,然后引入物理感知的视觉-语言-行动代理,通过思维链推理将复杂指令解析为一系列原子操作,实现了规划与执行的解耦。
Result: 在I2E-Bench基准测试和多个公共基准上的实验结果表明,I2E在处理复杂组合指令、保持物理合理性和确保多轮编辑稳定性方面显著优于现有最先进方法,证明了该范式在复杂编辑任务中的优越性。
Conclusion: 该研究通过结构化分解和可操作交互的范式转变,为复杂图像编辑提供了新的解决方案,强调了对象级控制、物理感知推理和规划-执行解耦的重要性,为未来可解释和可控的视觉编辑系统指明了方向。
📄 Abstract
Existing text-guided image editing methods primarily rely on end-to-end pixel-level inpainting paradigm. Despite its success in simple scenarios, this paradigm still significantly struggles with compositional editing tasks that require precise local control and complex multi-object spatial reasoning. This paradigm is severely limited by 1) the implicit coupling of planning and execution, 2) the lack of object-level control granularity, and 3) the reliance on unstructured, pixel-centric modeling. To address these limitations, we propose I2E, a novel "Decompose-then-Action" paradigm that revisits image editing as an actionable interaction process within a structured environment. I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions into a series of atomic actions via Chain-of-Thought reasoning. Further, we also construct I2E-Bench, a benchmark designed for multi-instance spatial reasoning and high-precision editing. Experimental results on I2E-Bench and multiple public benchmarks demonstrate that I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.
[27] HemBLIP: A Vision-Language Model for Interpretable Leukemia Cell Morphology Analysis
Julie van Logtestijn, Petru Manescu
🧩 TL;DR
本研究提出了HemBLIP,一种用于生成外周血细胞可解释形态描述的多模态视觉语言模型,旨在解决白血病诊断中深度学习模型黑箱问题并提升临床信任度。
📘 Detailed Summary
Motivation: 当前用于白血病诊断的白细胞形态学评估深度学习模型通常作为黑箱运行,这限制了临床信任和实际应用,因此需要开发能够生成可解释、形态感知描述的系统来提升诊断透明度。
Method: 研究构建了包含14,000个健康与白血病细胞及其专家标注属性描述的新数据集,并采用两种策略微调通用视觉语言模型:完整微调和基于LoRA的参数高效训练,同时以生物医学基础模型MedGEMMA作为基准进行对比评估。
Result: HemBLIP在描述质量和形态准确性方面均优于基准模型MedGEMMA,而LoRA适配方法在显著降低计算成本的同时进一步提升了性能表现,实现了计算效率与模型性能的平衡优化。
Conclusion: 该研究证明了视觉语言模型在血液学诊断中实现透明化和可扩展应用的潜力,为开发临床可信的AI辅助诊断工具提供了新的技术路径,参数高效微调方法尤其适用于医疗领域的实际部署需求。
📄 Abstract
Microscopic evaluation of white blood cell morphology is central to leukemia diagnosis, yet current deep learning models often act as black boxes, limiting clinical trust and adoption. We introduce HemBLIP, a vision language model designed to generate interpretable, morphology aware descriptions of peripheral blood cells. Using a newly constructed dataset of 14k healthy and leukemic cells paired with expert-derived attribute captions, we adapt a general-purpose VLM via both full fine-tuning and LoRA based parameter efficient training, and benchmark against the biomedical foundation model MedGEMMA. HemBLIP achieves higher caption quality and morphological accuracy, while LoRA adaptation provides further gains with significantly reduced computational cost. These results highlight the promise of vision language models for transparent and scalable hematological diagnostics.
[28] GeoReason: Aligning Thinking And Answering In Remote Sensing Vision-Language Models Via Logical Consistency Reinforcement Learning
Wenshuai Li, Xiantai Xiang, Zixiao Wen, Guangyao Zhou, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, Yuxin Hu
🧩 TL;DR
本文提出了GeoReason框架,旨在解决遥感视觉语言模型中的逻辑幻觉问题,通过同步内部推理与最终决策来增强空间任务中的认知可靠性。该框架采用两阶段训练策略,结合监督知识初始化和一致性感知强化学习,显著提升了模型的推理可靠性和可解释性。
📘 Detailed Summary
Motivation: 当前遥感视觉语言模型在复杂空间任务中存在逻辑幻觉问题,即正确答案源于错误推理链或依赖位置捷径而非空间逻辑,这种推理与决策的脱钩严重影响了战略空间决策的可靠性,因此需要开发能够同步内部思考与最终决策的框架来增强认知可靠性。
Method: 本文提出了GeoReason框架,首先构建了包含4000个推理轨迹的逻辑驱动数据集GeoReason-Bench,该数据集基于几何基元和专家知识合成;然后设计了两阶段训练策略:第一阶段为监督知识初始化,使模型掌握推理语法和领域专业知识;第二阶段为一致性感知强化学习,通过新颖的逻辑一致性奖励机制,采用选项排列策略惩罚逻辑漂移,将决策锚定在可验证的推理轨迹上。
Result: 实验结果表明,GeoReason框架显著提升了遥感视觉语言模型的认知可靠性和可解释性,在各项基准测试中取得了最先进的性能表现,相比其他先进方法展现出明显优势,验证了同步推理与决策策略的有效性。
Conclusion: 该研究强调了从感知为中心识别向高级演绎推理转变的重要性,提出的GeoReason框架通过逻辑一致性约束有效解决了遥感视觉语言模型中的推理-决策脱钩问题,为复杂空间任务中的可靠决策提供了新范式,并为未来更可靠的空间认知系统开发指明了方向。
📄 Abstract
The evolution of Remote Sensing Vision-Language Models(RS-VLMs) emphasizes the importance of transitioning from perception-centric recognition toward high-level deductive reasoning to enhance cognitive reliability in complex spatial tasks. However, current models often suffer from logical hallucinations, where correct answers are derived from flawed reasoning chains or rely on positional shortcuts rather than spatial logic. This decoupling undermines reliability in strategic spatial decision-making. To address this, we present GeoReason, a framework designed to synchronize internal thinking with final decisions. We first construct GeoReason-Bench, a logic-driven dataset containing 4,000 reasoning trajectories synthesized from geometric primitives and expert knowledge. We then formulate a two-stage training strategy: (1) Supervised Knowledge Initialization to equip the model with reasoning syntax and domain expertise, and (2) Consistency-Aware Reinforcement Learning to refine deductive reliability. This second stage integrates a novel Logical Consistency Reward, which penalizes logical drift via an option permutation strategy to anchor decisions in verifiable reasoning traces. Experimental results demonstrate that our framework significantly enhances the cognitive reliability and interpretability of RS-VLMs, achieving state-of-the-art performance compared to other advanced methods.
[29] Diffusion-DRF: Differentiable Reward Flow for Video Diffusion Fine-Tuning
Yifan Wang, Yanyu Li, Sergey Tulyakov, Yun Fu, Anil Kag
🧩 TL;DR
本文提出Diffusion-DRF,一种用于视频扩散模型微分的奖励流方法,通过冻结的视觉语言模型作为免训练批评器,直接反向传播反馈以优化生成质量,无需额外奖励模型或偏好数据集。
📘 Detailed Summary
Motivation: 当前基于直接偏好优化的文本到视频生成方法依赖于不可微分的人类标注或学习型奖励模型信号,导致训练标签密集、易产生偏差且容易触发奖励攻击和不稳定训练,需要一种更稳定且免训练奖励的优化方法。
Method: Diffusion-DRF使用冻结的现成视觉语言模型作为免训练批评器,通过扩散去噪链直接反向传播VLM反馈,将logit级响应转换为token感知梯度进行优化;提出自动化、面向方面的提示管道获取可靠的多维VLM反馈,同时使用梯度检查点技术实现最终去噪步骤的高效更新。
Result: 该方法提高了视频质量和语义对齐,同时缓解了奖励攻击和崩溃问题,无需额外的奖励模型或偏好数据集;具有模型无关性,可轻松推广到其他基于扩散的生成任务。
Conclusion: 研究展示了使用冻结VLM作为免训练批评器的可行性,为扩散模型优化提供了更稳定和高效的替代方案,避免了传统偏好优化方法中的标签依赖和偏差问题,为生成模型对齐开辟了新途径。
📄 Abstract
Direct Preference Optimization (DPO) has recently improved Text-to-Video (T2V) generation by enhancing visual fidelity and text alignment. However, current methods rely on non-differentiable preference signals from human annotations or learned reward models. This reliance makes training label-intensive, bias-prone, and easy-to-game, which often triggers reward hacking and unstable training. We propose Diffusion-DRF, a differentiable reward flow for fine-tuning video diffusion models using a frozen, off-the-shelf Vision-Language Model (VLM) as a training-free critic. Diffusion-DRF directly backpropagates VLM feedback through the diffusion denoising chain, converting logit-level responses into token-aware gradients for optimization. We propose an automated, aspect-structured prompting pipeline to obtain reliable multi-dimensional VLM feedback, while gradient checkpointing enables efficient updates through the final denoising steps. Diffusion-DRF improves video quality and semantic alignment while mitigating reward hacking and collapse -- without additional reward models or preference datasets. It is model-agnostic and readily generalizes to other diffusion-based generative tasks.
[30] ToTMNet: FFT-Accelerated Toeplitz Temporal Mixing Network for Lightweight Remote Photoplethysmography
Vladimir Frants, Sos Agaian, Karen Panetta
🧩 TL;DR
本文提出ToTMNet,一种轻量级远程光电容积描记(rPPG)架构,使用FFT加速的Toeplitz时序混合层替代注意力机制,在保持高性能的同时显著降低计算复杂度和参数量。
📘 Detailed Summary
Motivation: 尽管深度学习方法提高了远程光电容积描记(rPPG)的鲁棒性,但现有方法通常计算成本高、参数量大,且基于注意力的时序建模存在与时间长度平方相关的计算复杂度问题,需要更高效的时序建模方案。
Method: ToTMNet采用FFT加速的Toeplitz时序混合层替代传统注意力机制,通过循环嵌入和FFT卷积实现近线性时间计算;该架构整合全局Toeplitz时序算子到紧凑的门控时序混合器中,结合局部深度时序卷积分支与门控全局Toeplitz混合,仅需63k参数即可实现高效长程时序滤波。
Result: 在UBFC-rPPG数据集上,ToTMNet达到1.055 bpm的平均绝对误差和0.996的皮尔逊相关系数;在合成到真实场景迁移(SCAMPS到UBFC-rPPG)中,达到1.582 bpm MAE和0.994相关系数;消融实验证实门控机制对有效利用全局Toeplitz混合至关重要,尤其在领域偏移情况下。
Conclusion: Toeplitz结构的时序混合是rPPG中注意力机制的高效实用替代方案,能够在保持高性能的同时显著降低计算复杂度;门控机制对跨领域泛化至关重要,为轻量级时序建模提供了新思路,但研究受限于仅使用两个数据集。
📄 Abstract
Remote photoplethysmography (rPPG) estimates a blood volume pulse (BVP) waveform from facial videos captured by commodity cameras. Although recent deep models improve robustness compared to classical signal-processing approaches, many methods increase computational cost and parameter count, and attention-based temporal modeling introduces quadratic scaling with respect to the temporal length. This paper proposes ToTMNet, a lightweight rPPG architecture that replaces temporal attention with an FFT-accelerated Toeplitz temporal mixing layer. The Toeplitz operator provides full-sequence temporal receptive field using a linear number of parameters in the clip length and can be applied in near-linear time using circulant embedding and FFT-based convolution. ToTMNet integrates the global Toeplitz temporal operator into a compact gated temporal mixer that combines a local depthwise temporal convolution branch with gated global Toeplitz mixing, enabling efficient long-range temporal filtering while only having 63k parameters. Experiments on two datasets, UBFC-rPPG (real videos) and SCAMPS (synthetic videos), show that ToTMNet achieves strong heart-rate estimation accuracy with a compact design. On UBFC-rPPG intra-dataset evaluation, ToTMNet reaches 1.055 bpm MAE with Pearson correlation 0.996. In a synthetic-to-real setting (SCAMPS to UBFC-rPPG), ToTMNet reaches 1.582 bpm MAE with Pearson correlation 0.994. Ablation results confirm that the gating mechanism is important for effectively using global Toeplitz mixing, especially under domain shift. The main limitation of this preprint study is the use of only two datasets; nevertheless, the results indicate that Toeplitz-structured temporal mixing is a practical and efficient alternative to attention for rPPG.
cs.CL [Back]
[31] Advances and Challenges in Semantic Textual Similarity: A Comprehensive Survey
Lokendra Kumar, Neelesh S. Upadhye, Kannan Piedy
🧩 TL;DR
本文是一篇关于语义文本相似性(STS)研究的综述性论文,系统回顾了自2021年以来该领域的快速发展,涵盖了基于Transformer的模型、对比学习、领域特定方法、多模态方法、图基方法和知识增强技术等六个关键领域。
📘 Detailed Summary
Motivation: 该综述旨在解决语义文本相似性研究领域自2021年以来快速扩张但缺乏系统性整理的问题,通过全面回顾和分类最新进展,为研究者和实践者提供导航,帮助他们理解当前方法、实际应用和剩余挑战。
Method: 该综述采用系统性文献回顾方法,将STS研究进展组织为六个关键领域:基于Transformer的模型(如FarSSiBERT和DeBERTa-v3)、对比学习方法(如AspectCSE)、领域特定解决方案(如CXR-BERT用于医学文本和Financial-STS用于金融)、多模态方法、图基方法以及知识增强技术。
Result: 综述发现最近的Transformer模型如FarSSiBERT和DeBERTa-v3取得了显著准确率,而对比学习方法如AspectCSE建立了新的基准,领域适应模型如CXR-BERT和Financial-STS展示了STS在专业领域中的有效定制能力,多模态、图基和知识集成模型进一步增强了语义理解和表示能力。
Conclusion: 该综述为研究者和实践者提供了当前STS方法的宝贵见解,突出了新兴趋势和未来机会,通过系统组织和分析这些发展,帮助导航快速进展的STS领域,并识别了实际应用和剩余挑战。
📄 Abstract
Semantic Textual Similarity (STS) research has expanded rapidly since 2021, driven by advances in transformer architectures, contrastive learning, and domain-specific techniques. This survey reviews progress across six key areas: transformer-based models, contrastive learning, domain-focused solutions, multi-modal methods, graph-based approaches, and knowledge-enhanced techniques. Recent transformer models such as FarSSiBERT and DeBERTa-v3 have achieved remarkable accuracy, while contrastive methods like AspectCSE have established new benchmarks. Domain-adapted models, including CXR-BERT for medical texts and Financial-STS for finance, demonstrate how STS can be effectively customized for specialized fields. Moreover, multi-modal, graph-based, and knowledge-integrated models further enhance semantic understanding and representation. By organizing and analyzing these developments, the survey provides valuable insights into current methods, practical applications, and remaining challenges. It aims to guide researchers and practitioners alike in navigating rapid advancements, highlighting emerging trends and future opportunities in the evolving field of STS.
[32] Prompting Underestimates LLM Capability for Time Series Classification
Dan Schumacher, Erfan Nourbakhsh, Rocky Slavin, Anthony Rios
🧩 TL;DR
该研究揭示了大型语言模型在时间序列分类任务中表现不佳的原因在于提示评估方法的局限性,而非模型本身缺乏时间结构表征能力。通过线性探针分析发现,LLMs内部确实编码了有意义的时间序列信息,其性能可与专用时间序列模型相媲美。
📘 Detailed Summary
Motivation: 当前基于提示的评估方法显示大型语言模型在时间序列分类任务上表现不佳,这引发了关于LLMs是否真正编码有意义时间结构的疑问。该研究旨在探究这种表现不佳是由于模型表征能力不足,还是提示评估方法本身的局限性所致。
Method: 研究采用线性探针方法直接分析LLMs内部表征,并与基于提示的生成方法进行对比。通过层间分析技术追踪时间序列信息在不同Transformer层中的演化过程,并考察视觉和多模态输入对时间序列表征的增强效应。
Result: 实验结果显示,零样本提示方法的平均F1分数仅为0.15-0.26,接近随机水平,而线性探针方法将平均F1显著提升至0.61-0.67。层间分析表明,类别区分性的时间序列信息在早期Transformer层中已开始出现,且视觉和多模态输入能够进一步放大这种信息表征。
Conclusion: 该研究揭示了当前基于提示的评估方法与LLMs内部表征能力之间存在系统性不匹配,导致现有评估严重低估了模型对时间序列的理解能力。这一发现对如何准确评估LLMs的时序理解能力具有重要启示,并表明需要开发更有效的评估方法来揭示模型的实际表征能力。
📄 Abstract
Prompt-based evaluations suggest that large language models (LLMs) perform poorly on time series classification, raising doubts about whether they encode meaningful temporal structure. We show that this conclusion reflects limitations of prompt-based generation rather than the model's representational capacity by directly comparing prompt outputs with linear probes over the same internal representations. While zero-shot prompting performs near chance, linear probes improve average F1 from 0.15-0.26 to 0.61-0.67, often matching or exceeding specialized time series models. Layer-wise analyses further show that class-discriminative time series information emerges in early transformer layers and is amplified by visual and multimodal inputs. Together, these results demonstrate a systematic mismatch between what LLMs internally represent and what prompt-based evaluation reveals, leading current evaluations to underestimate their time series understanding.
[33] Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents
Yuanchen Bei, Tianxin Wei, Xuying Ning, Yanjun Zhao, Zhining Liu, Xiao Lin, Yada Zhu, Hendrik Hamann, Jingrui He, Hanghang Tong
🧩 TL;DR
本文提出了Mem-Gallery,这是一个用于评估多模态大语言模型(MLLM)代理在多轮对话中长时记忆能力的新基准,并建立了一个系统性的评估框架来量化记忆提取、推理和知识管理三个维度的性能。
📘 Detailed Summary
Motivation: 现有基准要么评估纯文本对话中的多会话记忆,要么评估局部上下文中的多模态理解,缺乏对多模态记忆在长期对话轨迹中如何被保存、组织和演化的系统性评估,这限制了MLLM代理在真实世界交互中的长期记忆能力发展。
Method: 研究者构建了Mem-Gallery基准数据集,包含基于视觉和文本信息的高质量多会话对话,具有长交互视野和丰富的多模态依赖关系,并在此基础上提出了一个系统性评估框架,从记忆提取与测试时适应、记忆推理以及记忆知识管理三个功能维度评估关键记忆能力。
Result: 在十三个记忆系统上的广泛基准测试揭示了几个关键发现:明确的多模态信息保留和记忆组织的必要性、记忆推理和知识管理方面的持续局限性,以及当前模型的效率瓶颈,这些发现量化了现有MLLM在长时多模态记忆方面的能力差距。
Conclusion: 该研究强调了开发能够有效保留、组织和演化多模态记忆的MLLM代理的重要性,Mem-Gallery基准和评估框架为未来研究提供了重要的评估工具,揭示了当前系统在记忆推理和知识管理方面的不足,为改进长时多模态记忆能力指明了方向。
📄 Abstract
Long-term memory is a critical capability for multimodal large language model (MLLM) agents, particularly in conversational settings where information accumulates and evolves over time. However, existing benchmarks either evaluate multi-session memory in text-only conversations or assess multimodal understanding within localized contexts, failing to evaluate how multimodal memory is preserved, organized, and evolved across long-term conversational trajectories. Thus, we introduce Mem-Gallery, a new benchmark for evaluating multimodal long-term conversational memory in MLLM agents. Mem-Gallery features high-quality multi-session conversations grounded in both visual and textual information, with long interaction horizons and rich multimodal dependencies. Building on this dataset, we propose a systematic evaluation framework that assesses key memory capabilities along three functional dimensions: memory extraction and test-time adaptation, memory reasoning, and memory knowledge management. Extensive benchmarking across thirteen memory systems reveals several key findings, highlighting the necessity of explicit multimodal information retention and memory organization, the persistent limitations in memory reasoning and knowledge management, as well as the efficiency bottleneck of current models.
[34] e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings
Haonan Chen, Sicheng Gao, Radu Timofte, Tetsuya Sakai, Zhicheng Dou
🧩 TL;DR
本文提出e5-omni,一种轻量级显式对齐方法,将现成的视觉语言模型转化为稳健的全模态嵌入模型,通过温度校准、可控负样本课程和批白化技术解决跨模态对齐中的三个常见问题。
📘 Detailed Summary
Motivation: 当前全模态嵌入模型严重依赖预训练视觉语言模型的隐式对齐,导致三个实际问题:相似度分数存在模态依赖的锐度差异,跨模态批次中负样本硬度分布不均衡使得许多负样本迅速变得无关紧要,以及跨模态嵌入的一阶和二阶统计量不匹配导致排序不稳定。
Method: e5-omni包含三个核心组件:模态感知温度校准以对齐相似度尺度,可控负样本课程与去偏技术以聚焦混淆负样本同时减少假负样本影响,以及协方差正则化的批白化技术以更好地匹配共享嵌入空间中的跨模态几何结构。
Result: 在MMEB-V2和AudioCaps基准测试中,e5-omni相比强双模态和全模态基线模型取得了一致的性能提升,且该配方能够良好地迁移到其他视觉语言模型骨干网络上,验证了方法的有效性和泛化能力。
Conclusion: 研究表明通过轻量级显式对齐技术可以有效解决全模态嵌入中的跨模态不一致问题,为将现成视觉语言模型转化为稳健的全模态检索系统提供了实用方案,同时释放的模型检查点为后续研究提供了有价值的基准。
📄 Abstract
Modern information systems often involve different types of items, e.g., a text query, an image, a video clip, or an audio segment. This motivates omni-modal embedding models that map heterogeneous modalities into a shared space for direct comparison. However, most recent omni-modal embeddings still rely heavily on implicit alignment inherited from pretrained vision-language model (VLM) backbones. In practice, this causes three common issues: (i) similarity logits have modality-dependent sharpness, so scores are not on a consistent scale; (ii) in-batch negatives become less effective over time because mixed-modality batches create an imbalanced hardness distribution; as a result, many negatives quickly become trivial and contribute little gradient; and (iii) embeddings across modalities show mismatched first- and second-order statistics, which makes rankings less stable. To tackle these problems, we propose e5-omni, a lightweight explicit alignment recipe that adapts off-the-shelf VLMs into robust omni-modal embedding models. e5-omni combines three simple components: (1) modality-aware temperature calibration to align similarity scales, (2) a controllable negative curriculum with debiasing to focus on confusing negatives while reducing the impact of false negatives, and (3) batch whitening with covariance regularization to better match cross-modal geometry in the shared embedding space. Experiments on MMEB-V2 and AudioCaps show consistent gains over strong bi-modal and omni-modal baselines, and the same recipe also transfers well to other VLM backbones. We release our model checkpoint at https://huggingface.co/Haon-Chen/e5-omni-7B.
[35] AirNav: A Large-Scale Real-World UAV Vision-and-Language Navigation Dataset with Natural and Diverse Instructions
Hengxing Cai, Yijie Rao, Ligang Huang, Zanyang Zhong, Jinhan Dong, Jingjun Tan, Wenhao Lu, Renxin Zhong
🧩 TL;DR
本文提出了AirNav——一个基于真实城市航拍数据构建的大规模无人机视觉语言导航基准数据集,并开发了AirVLN-R1模型,通过监督微调和强化微调相结合的方法提升了无人机视觉语言导航的性能和泛化能力。
📘 Detailed Summary
Motivation: 现有无人机视觉语言导航数据集存在依赖虚拟环境、指令缺乏自然性以及规模有限等问题,这限制了无人机在真实场景中的导航能力研究和应用。
Method: 研究团队构建了AirNav基准数据集,该数据集基于真实城市航拍数据而非合成环境,包含自然多样的指令;同时提出了AirVLN-R1模型,该模型结合了监督微调和强化微调两种训练策略来提升性能。
Result: 通过真实世界测试对模型可行性进行了初步评估,研究团队公开了数据集和代码,为后续研究提供了可复现的基础。
Conclusion: 该研究通过真实数据构建的大规模基准和混合训练策略,为解决无人机视觉语言导航中的环境真实性和指令自然性问题提供了有效方案,推动了该领域向真实场景应用的发展。
📄 Abstract
Existing Unmanned Aerial Vehicle (UAV) Vision-Language Navigation (VLN) datasets face issues such as dependence on virtual environments, lack of naturalness in instructions, and limited scale. To address these challenges, we propose AirNav, a large-scale UAV VLN benchmark constructed from real urban aerial data, rather than synthetic environments, with natural and diverse instructions. Additionally, we introduce the AirVLN-R1, which combines Supervised Fine-Tuning and Reinforcement Fine-Tuning to enhance performance and generalization. The feasibility of the model is preliminarily evaluated through real-world tests. Our dataset and code are publicly available.
[36] When Helpers Become Hazards: A Benchmark for Analyzing Multimodal LLM-Powered Safety in Daily Life
Xinyue Lou, Jinan Xu, Jingyi Yin, Xiaolong Wang, Zhaolu Kang, Youwei Liao, Yixuan Wang, Xiangyu Shi, Fengran Mo, Su Yao, Kaiyu Huang
🧩 TL;DR
本文提出了SaLAD,一个用于评估多模态大语言模型安全性的基准数据集,包含2,013个真实世界图像-文本样本,覆盖10个常见类别,并提出了基于安全警告的评估框架,以替代通用的拒绝响应。
📘 Detailed Summary
Motivation: 随着多模态大语言模型成为人类生活中不可或缺的助手,其生成的不安全内容对人类行为构成威胁,但现有研究缺乏对MLLMs在日常生活中的安全影响进行系统评估的基准,特别是缺乏强调真实风险暴露、真实视觉输入和细粒度跨模态推理的评估框架。
Method: 研究提出了SaLAD多模态安全基准,包含2,013个真实世界图像-文本样本,覆盖10个常见类别,采用平衡设计同时包含不安全场景和过度敏感案例;特别强调安全风险不能仅从文本推断,需要跨模态推理;进一步提出了基于安全警告的评估框架,鼓励模型提供清晰且信息丰富的安全警告而非通用拒绝。
Result: 在18个MLLMs上的评估结果显示,表现最佳的模型在不安全查询上的安全响应率仅为57.2%;即使流行的安全对齐方法在该场景下效果也有限,揭示了当前MLLMs在识别日常生活中危险行为方面的脆弱性;基准数据集已公开可用。
Conclusion: 研究表明当前多模态大语言模型在识别日常生活中的安全风险方面存在显著不足,即使经过安全对齐的模型也难以有效处理真实世界的多模态安全场景;提出的SaLAD基准和评估框架为MLLMs安全评估提供了重要工具,揭示了需要更精细的安全对齐策略来应对复杂的跨模态安全挑战。
📄 Abstract
As Multimodal Large Language Models (MLLMs) become an indispensable assistant in human life, the unsafe content generated by MLLMs poses a danger to human behavior, perpetually overhanging human society like a sword of Damocles. To investigate and evaluate the safety impact of MLLMs responses on human behavior in daily life, we introduce SaLAD, a multimodal safety benchmark which contains 2,013 real-world image-text samples across 10 common categories, with a balanced design covering both unsafe scenarios and cases of oversensitivity. It emphasizes realistic risk exposure, authentic visual inputs, and fine-grained cross-modal reasoning, ensuring that safety risks cannot be inferred from text alone. We further propose a safety-warning-based evaluation framework that encourages models to provide clear and informative safety warnings, rather than generic refusals. Results on 18 MLLMs demonstrate that the top-performing models achieve a safe response rate of only 57.2% on unsafe queries. Moreover, even popular safety alignment methods limit effectiveness of the models in our scenario, revealing the vulnerabilities of current MLLMs in identifying dangerous behaviors in daily life. Our dataset is available at https://github.com/xinyuelou/SaLAD.
[37] Bridging the Discrete-Continuous Gap: Unified Multimodal Generation via Coupled Manifold Discrete Absorbing Diffusion
Yuanfeng Xu, Yuhao Chen, Liang Lin, Guangrun Wang
🧩 TL;DR
本文提出了CoM-DAD(耦合流形离散吸收扩散),一种新颖的概率框架,通过分层双过程将多模态生成重新表述为语义规划和令牌合成的解耦,为统一文本-图像生成建立了新范式。
📘 Detailed Summary
Motivation: 当前生成建模在离散数据(文本)的自回归方法和连续数据(图像)的扩散方法之间存在分歧,阻碍了真正统一多模态系统的发展。掩码语言模型虽然提供高效双向上下文,但缺乏自回归模型的生成保真度和扩散模型的语义连续性,且在扩展到多模态设置时面临严重的对齐挑战和训练不稳定性。
Method: CoM-DAD框架将多模态生成重新表述为分层双过程,解耦高层语义规划和低层令牌合成。首先通过连续潜在扩散过程建模语义流形,然后将令牌生成处理为离散吸收扩散过程,由可变速率噪声调度调节并基于这些演化的语义先验进行条件化。关键创新是引入随机混合模态传输策略,无需繁重的对比双编码器即可对齐不同模态。
Result: 该方法在训练稳定性方面表现出优于标准掩码建模的优越性能,为可扩展的统一文本-图像生成建立了新范式。实验验证了该框架在多模态生成任务中的有效性和鲁棒性,特别是在处理文本和图像对齐挑战方面取得了显著进展。
Conclusion: CoM-DAD通过解耦语义规划和令牌合成,为统一多模态生成提供了创新框架,解决了传统方法在模态对齐和训练稳定性方面的关键限制。该研究为开发更高效、更稳定的跨模态生成系统开辟了新方向,具有推动多模态人工智能发展的潜力。
📄 Abstract
The bifurcation of generative modeling into autoregressive approaches for discrete data (text) and diffusion approaches for continuous data (images) hinders the development of truly unified multimodal systems. While Masked Language Models (MLMs) offer efficient bidirectional context, they traditionally lack the generative fidelity of autoregressive models and the semantic continuity of diffusion models. Furthermore, extending masked generation to multimodal settings introduces severe alignment challenges and training instability. In this work, we propose \textbf{CoM-DAD} (\textbf{Co}upled \textbf{M}anifold \textbf{D}iscrete \textbf{A}bsorbing \textbf{D}iffusion), a novel probabilistic framework that reformulates multimodal generation as a hierarchical dual-process. CoM-DAD decouples high-level semantic planning from low-level token synthesis. First, we model the semantic manifold via a continuous latent diffusion process; second, we treat token generation as a discrete absorbing diffusion process, regulated by a \textbf{Variable-Rate Noise Schedule}, conditioned on these evolving semantic priors. Crucially, we introduce a \textbf{Stochastic Mixed-Modal Transport} strategy that aligns disparate modalities without requiring heavy contrastive dual-encoders. Our method demonstrates superior stability over standard masked modeling, establishing a new paradigm for scalable, unified text-image generation.
[38] Persona-aware and Explainable Bikeability Assessment: A Vision-Language Model Approach
Yilong Dai, Ziyi Wang, Chenguang Wang, Kexin Zhou, Yiheng Qian, Susu Xu, Xiang Yan
🧩 TL;DR
本文提出了一种基于角色感知的视觉语言模型框架,用于自行车道可骑行性评估,该框架通过理论驱动的角色条件化、多粒度监督微调和AI增强数据生成,实现了可解释的评估与预测。
📘 Detailed Summary
Motivation: 现有基于感知的自行车道可骑行性评估方法在捕捉复杂道路环境方面存在局限,且未能充分考虑用户主观感知的异质性,这限制了评估的准确性和可解释性。
Method: 该方法提出了三个核心创新:基于既定自行车手类型学的理论驱动角色条件化,通过思维链推理生成角色特定解释;结合稀缺专家标注推理与丰富用户评分的多粒度监督微调,实现联合预测与可解释评估;以及AI驱动的数据增强,创建受控配对数据以隔离基础设施变量影响。
Result: 通过开发全景图像众包系统收集了来自427名自行车手的12,400条角色条件化评估数据,实验结果表明该框架在自行车道可骑行性评分预测方面具有竞争力,同时独特地实现了可解释的因子归因。
Conclusion: 该研究为自行车道可骑行性评估提供了一种融合理论指导与数据驱动的新范式,不仅提升了预测性能,更重要的是实现了评估过程的可解释性,为城市规划者提供了更深入的决策支持。
📄 Abstract
Bikeability assessment is essential for advancing sustainable urban transportation and creating cyclist-friendly cities, and it requires incorporating users' perceptions of safety and comfort. Yet existing perception-based bikeability assessment approaches face key limitations in capturing the complexity of road environments and adequately accounting for heterogeneity in subjective user perceptions. This paper proposes a persona-aware Vision-Language Model framework for bikeability assessment with three novel contributions: (i) theory-grounded persona conditioning based on established cyclist typology that generates persona-specific explanations via chain-of-thought reasoning; (ii) multi-granularity supervised fine-tuning that combines scarce expert-annotated reasoning with abundant user ratings for joint prediction and explainable assessment; and (iii) AI-enabled data augmentation that creates controlled paired data to isolate infrastructure variable impacts. To test and validate this framework, we developed a panoramic image-based crowdsourcing system and collected 12,400 persona-conditioned assessments from 427 cyclists. Experiment results show that the proposed framework offers competitive bikeability rating prediction while uniquely enabling explainable factor attribution.
cs.AI [Back]
[39] Personalization of Large Foundation Models for Health Interventions
Stefan Konigorski, Johannes E. Vedder, Babajide Alamu Owoyele, İbrahim Özkan
🧩 TL;DR
本文探讨大型基础模型在个性化医疗中的局限性,提出LFMs无法替代N-of-1试验,但两者互补:LFMs擅长从群体数据中生成假设,而N-of-1试验提供个体因果验证,并设计了一个结合两者优势的混合框架。
📘 Detailed Summary
Motivation: 大型基础模型在医疗AI中展现出潜力,但其能否提供真正个性化的治疗建议仍存疑问。研究发现个性化面临多重挑战,包括泛化性悖论(模型在一个临床研究中表现优异却在其他研究中表现随机)、隐私-性能悖论、规模-特异性悖论和自动化-共情悖论。此外,个性化推荐所需的因果理解程度与LFMs的预测能力之间的界限也不明确。
Method: 本文提出一个混合框架,结合大型基础模型和N-of-1试验的优势。LFMs利用多模态数据从群体模式中快速生成干预候选假设并给出不确定性估计,而N-of-1试验作为个性化医学中个体因果推断的金标准,通过交叉自我实验提供个体内因果证据并保护隐私。该框架让LFMs生成排序的干预候选触发后续的N-of-1试验验证。
Result: 研究论证LFMs无法替代N-of-1试验,但两者具有互补性:LFMs擅长从群体模式中快速生成假设,而N-of-1试验擅长为特定个体提供因果验证。通过明确预测与因果之间的界限,并直接解决个性化医疗中的各种悖论,该混合框架为负责任地整合AI到个性化医学提供了路径。
Conclusion: 个性化医疗需要明确区分预测能力和因果理解,LFMs和N-of-1试验的互补结合为解决医疗AI中的悖论提供了可行方案。明确预测与因果的边界、直接应对各种悖论对于负责任地将AI整合到个性化医学至关重要。该框架为未来医疗AI系统设计提供了理论指导,强调因果验证在个性化治疗中的核心地位。
📄 Abstract
Large foundation models (LFMs) transform healthcare AI in prevention, diagnostics, and treatment. However, whether LFMs can provide truly personalized treatment recommendations remains an open question. Recent research has revealed multiple challenges for personalization, including the fundamental generalizability paradox: models achieving high accuracy in one clinical study perform at chance level in others, demonstrating that personalization and external validity exist in tension. This exemplifies broader contradictions in AI-driven healthcare: the privacy-performance paradox, scale-specificity paradox, and the automation-empathy paradox. As another challenge, the degree of causal understanding required for personalized recommendations, as opposed to mere predictive capacities of LFMs, remains an open question. N-of-1 trials -- crossover self-experiments and the gold standard for individual causal inference in personalized medicine -- resolve these tensions by providing within-person causal evidence while preserving privacy through local experimentation. Despite their impressive capabilities, this paper argues that LFMs cannot replace N-of-1 trials. We argue that LFMs and N-of-1 trials are complementary: LFMs excel at rapid hypothesis generation from population patterns using multimodal data, while N-of-1 trials excel at causal validation for a given individual. We propose a hybrid framework that combines the strengths of both to enable personalization and navigate the identified paradoxes: LFMs generate ranked intervention candidates with uncertainty estimates, which trigger subsequent N-of-1 trials. Clarifying the boundary between prediction and causation and explicitly addressing the paradoxical tensions are essential for responsible AI integration in personalized medicine.
[40] Current Agents Fail to Leverage World Model as Tool for Foresight
Cheng Qian, Emre Can Acikgoz, Bingxuan Li, Xiusi Chen, Yuji Zhang, Bingxiang He, Qinyu Luo, Dilek Hakkani-Tür, Gokhan Tur, Yunzhu Li, Heng Ji, Heng Ji
🧩 TL;DR
本文通过实证研究发现,当前基于视觉语言模型的智能体难以有效利用生成世界模型作为外部模拟器来增强其认知能力,主要瓶颈在于智能体无法决定何时进行模拟、如何解释预测结果以及如何将预见整合到下游推理中。
📘 Detailed Summary
Motivation: 随着智能体面临越来越多需要预测未来状态而非依赖短视推理的任务,生成世界模型被视为一种有前景的解决方案,可作为外部模拟器帮助智能体在行动前预见结果。然而,当前智能体是否能够有效利用这些世界模型作为工具来增强其认知能力,这一问题尚未得到实证检验。
Method: 本研究采用实证分析方法,在多样化的智能体任务和视觉问答任务上,系统评估了当前智能体利用生成世界模型作为外部模拟器的能力。通过分析智能体调用模拟的频率、使用预测推演的方式以及性能变化,并结合归因分析识别具体瓶颈。
Result: 实验结果显示,某些智能体极少调用模拟(少于1%),经常误用预测推演(约15%),并且在模拟可用或强制使用时经常表现出不一致甚至性能下降(高达5%)。归因分析进一步表明,主要瓶颈在于智能体决定何时模拟、如何解释预测结果以及如何将预见整合到下游推理的能力不足。
Conclusion: 研究发现当前智能体难以有效利用世界模型作为认知增强工具,这凸显了需要开发能够促进与世界模型进行校准、战略性交互的机制。这些发现为未来构建更可靠的预见性认知智能体系统指明了方向,强调了智能体与世界模型交互策略的重要性。
📄 Abstract
Agents built on vision-language models increasingly face tasks that demand anticipating future states rather than relying on short-horizon reasoning. Generative world models offer a promising remedy: agents could use them as external simulators to foresee outcomes before acting. This paper empirically examines whether current agents can leverage such world models as tools to enhance their cognition. Across diverse agentic and visual question answering tasks, we observe that some agents rarely invoke simulation (fewer than 1%), frequently misuse predicted rollouts (approximately 15%), and often exhibit inconsistent or even degraded performance (up to 5%) when simulation is available or enforced. Attribution analysis further indicates that the primary bottleneck lies in the agents' capacity to decide when to simulate, how to interpret predicted outcomes, and how to integrate foresight into downstream reasoning. These findings underscore the need for mechanisms that foster calibrated, strategic interaction with world models, paving the way toward more reliable anticipatory cognition in future agent systems.