Table of Contents

cs.CV [Back]

[1] Dynamic VLM-Guided Negative Prompting for Diffusion Models

Hoyeon Chang, Seungjin Kim, Yoonseok Choi

🧩 TL;DR

本文提出了一种基于视觉语言模型的动态负提示方法,通过在去噪过程中生成中间图像预测并查询VLM来产生上下文相关的负提示,相比传统固定负提示方法实现了更灵活的图像生成控制。


📘 Detailed Summary

Motivation: 传统扩散模型中的负提示方法通常使用固定的负提示文本,缺乏对生成过程中上下文变化的适应性,无法根据中间生成状态动态调整负引导策略,限制了图像生成质量和文本对齐的优化潜力。

Method: 该方法在去噪过程的特定步骤生成中间图像预测,然后利用视觉语言模型分析这些中间结果并生成上下文相关的负提示,实现了负提示的动态自适应生成,而非依赖预设的固定提示文本。

Result: 在多个基准数据集上的实验表明,该方法在负引导强度与文本-图像对齐度之间实现了更好的平衡,相比传统固定负提示方法在图像质量和语义一致性方面均有显著提升。

Conclusion: 动态负提示机制为扩散模型提供了更精细的生成控制能力,证明了结合视觉语言模型进行实时负提示生成的可行性,为未来自适应图像生成方法开辟了新的研究方向。


📄 Abstract

We propose a novel approach for dynamic negative prompting in diffusion models that leverages Vision-Language Models (VLMs) to adaptively generate negative prompts during the denoising process. Unlike traditional Negative Prompting methods that use fixed negative prompts, our method generates intermediate image predictions at specific denoising steps and queries a VLM to produce contextually appropriate negative prompts. We evaluate our approach on various benchmark datasets and demonstrate the trade-offs between negative guidance strength and text-image alignment.

[2] Security Risk of Misalignment between Text and Image in Multi-modal Model

Xiaosen Wang, Zhijin Ge, Shaokang Wang

🧩 TL;DR

本文提出了PReMA攻击方法,首次通过仅创建对抗性图像来操纵多模态扩散模型的输出,揭示了现有扩散模型中文本与图像模态对齐不足的安全风险。该方法在图像修复和风格迁移任务中展现出强大效力,对固定提示词的应用场景构成新型威胁。


📘 Detailed Summary

Motivation: 尽管多模态扩散模型取得了显著进展,但其对对抗性输入的脆弱性仍未充分探索。研究发现现有扩散模型中文本与图像模态的对齐存在不足,这种错位在生成不当内容时带来显著风险,特别是在NSFW内容生成方面存在安全隐患。

Method: 提出了Prompt-Restricted Multi-modal Attack (PReMA)攻击方法,通过修改输入图像来操纵生成内容,同时保持提示词不变。这是首个仅通过创建对抗性图像来操纵模型输出的攻击方法,区别于先前主要生成对抗性提示词的方法。

Result: 在图像修复和风格迁移任务上对多种模型进行的全面评估证实了PReMA的强大效力。该方法能够有效操纵模型输出,特别是在固定提示词的图像编辑应用中展现出显著威胁。

Conclusion: PReMA揭示了多模态扩散模型在模态对齐方面的安全漏洞,对图像编辑应用的完整性构成新型威胁。研究强调了需要加强多模态模型安全性的重要性,特别是在固定提示词操作场景下的防护措施。


📄 Abstract

Despite the notable advancements and versatility of multi-modal diffusion models, such as text-to-image models, their susceptibility to adversarial inputs remains underexplored. Contrary to expectations, our investigations reveal that the alignment between textual and Image modalities in existing diffusion models is inadequate. This misalignment presents significant risks, especially in the generation of inappropriate or Not-Safe-For-Work (NSFW) content. To this end, we propose a novel attack called Prompt-Restricted Multi-modal Attack (PReMA) to manipulate the generated content by modifying the input image in conjunction with any specified prompt, without altering the prompt itself. PReMA is the first attack that manipulates model outputs by solely creating adversarial images, distinguishing itself from prior methods that primarily generate adversarial prompts to produce NSFW content. Consequently, PReMA poses a novel threat to the integrity of multi-modal diffusion models, particularly in image-editing applications that operate with fixed prompts. Comprehensive evaluations conducted on image inpainting and style transfer tasks across various models confirm the potent efficacy of PReMA.

[3] MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency

Nicolas Dufour, Lucas Degeorge, Arijit Ghosh, Vicky Kalogeiton, David Picard

🧩 TL;DR

本文提出了MIRO方法,通过在训练过程中直接对多个奖励模型进行条件化,使文本到图像生成模型能够直接学习用户偏好,从而显著提高生成图像质量并加速训练过程。该方法在GenEval组合基准和用户偏好评分上实现了最先进的性能。


📘 Detailed Summary

Motivation: 当前文本到图像生成模型在大型未筛选数据集上训练以实现多样化生成能力,但这与用户偏好并不一致。现有的奖励模型方法通过后处理选择生成图像来对齐奖励,但会丢弃信息数据并优化单一奖励,从而损害多样性、语义保真度和效率。

Method: 提出的MIRO方法在训练过程中直接对多个奖励模型进行条件化,使模型能够直接学习用户偏好,而不是采用后处理方式。这种方法避免了信息数据的丢弃,并支持多奖励优化。

Result: MIRO方法不仅显著提高了生成图像的视觉质量,还大幅加快了训练速度。在GenEval组合基准和用户偏好评分(PickAScore、ImageReward、HPSv2)上实现了最先进的性能表现。

Conclusion: 研究表明,在训练过程中直接集成多奖励条件化比后处理选择更有效,能够同时保持生成质量、多样性和效率。这种方法为文本到图像生成的对齐问题提供了新的解决方案,并展示了直接学习用户偏好的优势。


📄 Abstract

Current text-to-image generative models are trained on large uncurated datasets to enable diverse generation capabilities. However, this does not align well with user preferences. Recently, reward models have been specifically designed to perform post-hoc selection of generated images and align them to a reward, typically user preference. This discarding of informative data together with the optimizing for a single reward tend to harm diversity, semantic fidelity and efficiency. Instead of this post-processing, we propose to condition the model on multiple reward models during training to let the model learn user preferences directly. We show that this not only dramatically improves the visual quality of the generated images but it also significantly speeds up the training. Our proposed method, called MIRO, achieves state-of-the-art performances on the GenEval compositional benchmark and user-preference scores (PickAScore, ImageReward, HPSv2).

[4] EgoExo-Con: Exploring View-Invariant Video Temporal Understanding

Minjoon Jung, Junbin Xiao, Junghyun Kim, Byoung-Tak Zhang, Angela Yao

🧩 TL;DR

本文提出了EgoExo-Con基准来评估视频-大语言模型在多视角视频中的时序理解一致性,并开发了View-GRPO强化学习框架来提升跨视角一致性推理能力。研究发现现有模型在多视角一致性方面存在显著缺陷,而提出的方法在改善跨视角一致性方面优于传统微调方法。


📘 Detailed Summary

Motivation: 现有视频-大语言模型在从不同视角捕捉同一事件的视频中是否能够实现一致的时序理解能力尚不明确。研究旨在解决模型在多视角视频理解中的一致性缺陷问题,特别是当视频从自我中心视角和外部视角同步记录同一事件时,模型需要保持跨视角的时序推理一致性。

Method: 研究引入了EgoExo-Con基准,包含全面同步的自我中心与外部中心视频对及人工精炼的自然语言查询,重点评估时序验证和时序定位两个任务。提出了View-GRPO强化学习框架,该框架有效加强了视角特定的时序推理能力,同时促进了跨视角的一致性理解。

Result: 分析揭示了现有视频-大语言模型的两个关键局限性:模型往往无法保持一致性,其表现远低于单视角性能;当使用双视角同步视频进行简单微调时,模型虽然一致性有所改善,但通常表现不如单视角训练的模型。View-GRPO方法在改善跨视角一致性方面优于朴素SFT和GRPO方法。

Conclusion: 研究表明跨视角时序理解一致性是视频-大语言模型的重要挑战,需要专门设计的训练方法。View-GRPO框架为解决多视角一致性推理问题提供了有效途径,强调了在视频理解中考虑视角差异的重要性,为未来多模态模型的一致性评估和改进提供了新方向。


📄 Abstract

Can Video-LLMs achieve consistent temporal understanding when videos capture the same event from different viewpoints? To study this, we introduce EgoExo-Con (Consistency), a benchmark of comprehensively synchronized egocentric and exocentric video pairs with human-refined queries in natural language. EgoExo-Con emphasizes two temporal understanding tasks: Temporal Verification and Temporal Grounding. It evaluates not only correctness but consistency across viewpoints. Our analysis reveals two critical limitations of existing Video-LLMs: (1) models often fail to maintain consistency, with results far worse than their single-view performances. (2) When naively finetuned with synchronized videos of both viewpoints, the models show improved consistency but often underperform those trained on a single view. For improvements, we propose View-GRPO, a novel reinforcement learning framework that effectively strengthens view-specific temporal reasoning while encouraging consistent comprehension across viewpoints. Our method demonstrates its superiority over naive SFT and GRPO, especially for improving cross-view consistency. All resources will be made publicly available.

[5] Generative Image Restoration and Super-Resolution using Physics-Informed Synthetic Data for Scanning Tunneling Microscopy

Nikola L. Kolev, Tommaso Rodani, Neil J. Curson, Taylor J. Z. Stock, Alberto Cazzaniga

🧩 TL;DR

本研究提出了一种基于机器学习的扫描隧道显微镜图像修复与超分辨率方法,通过物理信息合成数据生成管道训练先进的流匹配和扩散模型,能够有效修复图像质量并实现2-4倍的图像采集时间减少。


📘 Detailed Summary

Motivation: 扫描隧道显微镜在原子分辨率成像和原子操纵方面具有重要应用,但其实际效用常受限于针尖退化和缓慢的串行数据采集过程,同时针尖制备过程中施加的高电压会改变针尖尖端形状,需要频繁进行针尖调节处理。

Method: 采用物理信息合成数据生成管道,仅使用36张原始实验图像作为基础数据集,训练了多种先进的流匹配和扩散模型,通过CLIP最大均值差异得分和结构相似性等定量指标进行模型评估。

Result: 实验结果表明,所提出的模型能够有效修复图像质量,通过从稀疏采样数据中准确重建图像,实现了2-4倍的图像采集时间减少,显著提升了扫描隧道显微镜的实验通量。

Conclusion: 该框架通过减少针尖调节过程的频率和增强现有高速STM系统的帧率,有望显著提高扫描隧道显微镜的实验通量,为原子尺度成像和操纵提供了更高效的解决方案。


📄 Abstract

Scanning tunnelling microscopy (STM) enables atomic-resolution imaging and atom manipulation, but its utility is often limited by tip degradation and slow serial data acquisition. Fabrication adds another layer of complexity since the tip is often subjected to large voltages, which may alter the shape of its apex, requiring it to be conditioned. Here, we propose a machine learning (ML) approach for image repair and super-resolution to alleviate both challenges. Using a dataset of only 36 pristine experimental images of Si(001):H, we demonstrate that a physics-informed synthetic data generation pipeline can be used to train several state-of-the-art flow-matching and diffusion models. Quantitative evaluation with metrics such as the CLIP Maximum Mean Discrepancy (CMMD) score and structural similarity demonstrates that our models are able to effectively restore images and offer a two- to fourfold reduction in image acquisition time by accurately reconstructing images from sparsely sampled data. Our framework has the potential to significantly increase STM experimental throughput by offering a route to reducing the frequency of tip-conditioning procedures and to enhancing frame rates in existing high-speed STM systems.

[6] WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios

Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yuliang Zou, Liting Sun, John Gorman, Kate Tolstaya, Sarah Tang, Brandyn White, Ben Sapp, Mingxing Tan, Jyh-Jing Hwang, Drago Anguelov

🧩 TL;DR

本文提出了WOD-E2E数据集,专门针对自动驾驶中罕见的长尾场景,并引入了基于人类评分者偏好的新型开环评估指标RFS,旨在推动端到端驾驶系统在复杂现实场景中的鲁棒性研究。


📘 Detailed Summary

Motivation: 当前端到端驾驶基准主要关注常规场景,无法充分测试系统在罕见长尾场景中的真实潜力,且现有开环评估指标难以有效评估驾驶的多模态特性或在长尾场景中的性能表现。

Method: 构建了包含4,021个驾驶片段(约12小时)的WOD-E2E数据集,专门针对发生频率低于0.03%的挑战性长尾场景,每个片段包含高级路由信息、自车状态和8个环视摄像头数据,并提出了基于评分者轨迹偏好标注的新型评估指标RFS。

Result: WOD-E2E数据集已公开发布验证集的评分者偏好标签,测试集标签用于2025年WOD-E2E挑战赛,该数据集和评估方法为端到端驾驶系统在复杂长尾场景中的性能评估提供了标准化基准。

Conclusion: 该研究通过专门的长尾场景数据集和基于人类偏好的评估指标,为开发通用性强、鲁棒性高且安全的端到端自动驾驶系统提供了重要基础,将推动自动驾驶在复杂现实场景中的研究进展。


📄 Abstract

Vision-based end-to-end (E2E) driving has garnered significant interest in the research community due to its scalability and synergy with multimodal large language models (MLLMs). However, current E2E driving benchmarks primarily feature nominal scenarios, failing to adequately test the true potential of these systems. Furthermore, existing open-loop evaluation metrics often fall short in capturing the multi-modal nature of driving or effectively evaluating performance in long-tail scenarios. To address these gaps, we introduce the Waymo Open Dataset for End-to-End Driving (WOD-E2E). WOD-E2E contains 4,021 driving segments (approximately 12 hours), specifically curated for challenging long-tail scenarios that that are rare in daily life with an occurring frequency of less than 0.03%. Concretely, each segment in WOD-E2E includes the high-level routing information, ego states, and 360-degree camera views from 8 surrounding cameras. To evaluate the E2E driving performance on these long-tail situations, we propose a novel open-loop evaluation metric: Rater Feedback Score (RFS). Unlike conventional metrics that measure the distance between predicted way points and the logs, RFS measures how closely the predicted trajectory matches rater-annotated trajectory preference labels. We have released rater preference labels for all WOD-E2E validation set segments, while the held out test set labels have been used for the 2025 WOD-E2E Challenge. Through our work, we aim to foster state of the art research into generalizable, robust, and safe end-to-end autonomous driving agents capable of handling complex real-world situations.

[7] SplitFlow: Flow Decomposition for Inversion-Free Text-to-Image Editing

Sung-Hoon Yoon, Minghan Li, Gaspard Beaudouin, Congcong Wen, Muhammad Rafay Azhar, Mengyu Wang

🧩 TL;DR

本文提出了一种基于流分解与聚合的免反演图像编辑框架,通过语义分解目标提示并自适应聚合子流来解决整流流模型在图像编辑中的反演不准确和梯度纠缠问题。该方法在语义保真度和属性解耦方面优于现有零样本编辑方法。


📘 Detailed Summary

Motivation: 整流流模型在图像生成中表现出色,但在图像编辑任务中存在关键限制:将真实图像映射回潜在空间的反演过程不准确,以及编辑过程中的梯度纠缠问题导致输出无法忠实反映目标提示。现有基于ODE的方法虽然尝试绕过反演直接映射源和目标分布,但仍产生次优的编辑质量。

Method: 提出基于免反演公式的流分解与聚合框架,将目标提示语义分解为多个子提示,为每个子提示计算独立流,并通过投影和软聚合机制自适应加权子目标速度场。该机制受多任务学习中梯度冲突解决的启发,抑制语义冗余同时强调不同方向,保持编辑输出的多样性和一致性。

Result: 实验结果表明,该方法在语义保真度和属性解耦方面优于现有的零样本编辑方法。流分解增强了目标空间的多样性,而软聚合机制确保生成语义对齐的输出,同时保持对完整目标提示的一致引导。

Conclusion: 该研究证明了通过语义分解和自适应流聚合可以有效解决整流流模型在图像编辑中的局限性,为复杂编辑任务提供了新的解决方案。框架设计灵感来自多任务学习,展示了跨领域方法在生成模型优化中的潜力,为未来编辑方法的发展提供了重要见解。


📄 Abstract

Rectified flow models have become a de facto standard in image generation due to their stable sampling trajectories and high-fidelity outputs. Despite their strong generative capabilities, they face critical limitations in image editing tasks: inaccurate inversion processes for mapping real images back into the latent space, and gradient entanglement issues during editing often result in outputs that do not faithfully reflect the target prompt. Recent efforts have attempted to directly map source and target distributions via ODE-based approaches without inversion; however,these methods still yield suboptimal editing quality. In this work, we propose a flow decomposition-and-aggregation framework built upon an inversion-free formulation to address these limitations. Specifically, we semantically decompose the target prompt into multiple sub-prompts, compute an independent flow for each, and aggregate them to form a unified editing trajectory. While we empirically observe that decomposing the original flow enhances diversity in the target space, generating semantically aligned outputs still requires consistent guidance toward the full target prompt. To this end, we design a projection and soft-aggregation mechanism for flow, inspired by gradient conflict resolution in multi-task learning. This approach adaptively weights the sub-target velocity fields, suppressing semantic redundancy while emphasizing distinct directions, thereby preserving both diversity and consistency in the final edited output. Experimental results demonstrate that our method outperforms existing zero-shot editing approaches in terms of semantic fidelity and attribute disentanglement. The code is available at https://github.com/Harvard-AI-and-Robotics-Lab/SplitFlow.

[8] MV-MLM: Bridging Multi-View Mammography and Language for Breast Cancer Diagnosis and Risk Prediction

Shunjie-Fabian Zheng, Hyeonjun Lee, Thijs Kooi, Ali Diba

🧩 TL;DR

本文提出了一种新颖的多视角乳腺X线摄影与语言模型(MV-MLM),通过利用配对乳腺X线图像和合成放射学报告进行跨模态自监督学习,在乳腺癌分类和风险预测任务中实现了最先进的性能。


📘 Detailed Summary

Motivation: 当前计算机辅助诊断系统依赖大量精细标注数据,但获取此类数据成本高昂且耗时,而基于大规模图像-文本对预训练的视觉语言模型为解决医学影像任务中的数据效率问题提供了有前景的解决方案。

Method: 该方法采用多视角监督学习策略,通过跨模态自监督在图像-文本对上进行联合视觉-文本学习,利用多个视角和相应的伪放射学报告来学习丰富的表示,从而区分乳腺组织或癌症特征(钙化、肿块)并利用这些模式预测癌症风险。

Result: 在私有和公开数据集上的评估表明,该模型在三个分类任务中均达到最先进性能:恶性分类、亚型分类和基于图像的癌症风险预测,同时展现出强大的数据效率,在仅使用合成文本报告训练的情况下超越了现有全监督或视觉语言模型基线。

Conclusion: 该研究证明了多视角跨模态学习在医学影像分析中的有效性,为减少对昂贵人工标注数据的依赖提供了可行方案,同时展示了合成文本数据在提升模型泛化能力和准确性方面的潜力。


📄 Abstract

Large annotated datasets are essential for training robust Computer-Aided Diagnosis (CAD) models for breast cancer detection or risk prediction. However, acquiring such datasets with fine-detailed annotation is both costly and time-consuming. Vision-Language Models (VLMs), such as CLIP, which are pre-trained on large image-text pairs, offer a promising solution by enhancing robustness and data efficiency in medical imaging tasks. This paper introduces a novel Multi-View Mammography and Language Model for breast cancer classification and risk prediction, trained on a dataset of paired mammogram images and synthetic radiology reports. Our MV-MLM leverages multi-view supervision to learn rich representations from extensive radiology data by employing cross-modal self-supervision across image-text pairs. This includes multiple views and the corresponding pseudo-radiology reports. We propose a novel joint visual-textual learning strategy to enhance generalization and accuracy performance over different data types and tasks to distinguish breast tissues or cancer characteristics(calcification, mass) and utilize these patterns to understand mammography images and predict cancer risk. We evaluated our method on both private and publicly available datasets, demonstrating that the proposed model achieves state-of-the-art performance in three classification tasks: (1) malignancy classification, (2) subtype classification, and (3) image-based cancer risk prediction. Furthermore, the model exhibits strong data efficiency, outperforming existing fully supervised or VLM baselines while trained on synthetic text reports and without the need for actual radiology reports.

[9] CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments

Rishika Bhagwatkar, Syrielle Montariol, Angelika Romanou, Beatriz Borges, Irina Rish, Antoine Bosselut

🧩 TL;DR

本文提出了CAVE,首个真实世界视觉异常基准,支持异常描述、解释和论证三个开放任务,为评估视觉语言模型在异常检测和常识推理能力方面提供了认知科学启发的综合框架。


📘 Detailed Summary

Motivation: 当前计算机视觉中的异常检测主要局限于工业缺陷或合成生成的异常,无法捕捉真实世界异常的丰富性和不可预测性,而人类却能自然地识别、推理和解释环境中的异常现象。

Method: CAVE基准引入了基于认知科学研究启发的细粒度标注框架,包括视觉定位和基于视觉表现、复杂性、严重性和常见性的异常分类,支持异常描述、解释和论证三个开放任务。

Result: 实验表明,即使采用先进的提示策略,最先进的视觉语言模型在视觉异常感知和常识推理方面仍存在显著困难,突显了当前模型在理解真实世界异常方面的局限性。

Conclusion: CAVE作为现实且认知基础扎实的基准,为推进异常检测和视觉语言模型中的常识推理研究提供了宝贵资源,揭示了当前模型在真实世界异常理解方面的不足和改进方向。


📄 Abstract

Humans can naturally identify, reason about, and explain anomalies in their environment. In computer vision, this long-standing challenge remains limited to industrial defects or unrealistic, synthetically generated anomalies, failing to capture the richness and unpredictability of real-world anomalies. In this work, we introduce CAVE, the first benchmark of real-world visual anomalies. CAVE supports three open-ended tasks: anomaly description, explanation, and justification; with fine-grained annotations for visual grounding and categorizing anomalies based on their visual manifestations, their complexity, severity, and commonness. These annotations draw inspiration from cognitive science research on how humans identify and resolve anomalies, providing a comprehensive framework for evaluating Vision-Language Models (VLMs) in detecting and understanding anomalies. We show that state-of-the-art VLMs struggle with visual anomaly perception and commonsense reasoning, even with advanced prompting strategies. By offering a realistic and cognitively grounded benchmark, CAVE serves as a valuable resource for advancing research in anomaly detection and commonsense reasoning in VLMs.

[10] GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-guided Latent Diffusion Model?

Mingyu Sung, Seungjae Ham, Kangwoo Kim, Yeokyoung Yoon, Sangseok Yun, Il-Min Kim, Jae-Mo Kang

🧩 TL;DR

GLYPH-SR提出了一种视觉语言引导的扩散框架,专门针对场景文本超分辨率问题,通过结合文本可读性和感知质量优化,在保持高视觉真实性的同时显著提升OCR性能。


📘 Detailed Summary

Motivation: 现有超分辨率研究主要针对失真指标(PSNR/SSIM)或感知质量指标进行优化,但这些指标对字符级错误不敏感,导致场景文本(如标志、产品标签中的文字)在超分辨率后仍难以被OCR系统准确识别,限制了实际应用效果。

Method: GLYPH-SR采用基于OCR数据的文本超分辨率融合控制网络(TS-ControlNet)和乒乓调度器,在文本导向和场景导向之间交替引导,通过在合成语料上训练这些组件同时保持主超分辨率分支冻结,实现针对性文本恢复。

Result: 在SVT、SCUT-CTW1500和CUTE80数据集上的x4和x8超分辨率实验中,GLYPH-SR相比扩散/GAN基线将OCR F1分数提升了最高15.18个百分点(SVT x8, OpenOCR),同时保持了竞争力的MANIQA、CLIP-IQA和MUSIQ感知质量分数。

Conclusion: 该研究表明超分辨率系统需要同时优化文本可读性和视觉真实性,GLYPH-SR框架证明了通过针对性设计可以实现既看起来正确又读起来正确的超分辨率效果,为实际部署提供了有效解决方案。


📄 Abstract

Image super-resolution(SR) is fundamental to many vision system-from surveillance and autonomy to document analysis and retail analytics-because recovering high-frequency details, especially scene-text, enables reliable downstream perception. Scene-text, i.e., text embedded in natural images such as signs, product labels, and storefronts, often carries the most actionable information; when characters are blurred or hallucinated, optical character recognition(OCR) and subsequent decisions fail even if the rest of the image appears sharp. Yet previous SR research has often been tuned to distortion (PSNR/SSIM) or learned perceptual metrics (LIPIS, MANIQA, CLIP-IQA, MUSIQ) that are largely insensitive to character-level errors. Furthermore, studies that do address text SR often focus on simplified benchmarks with isolated characters, overlooking the challenges of text within complex natural scenes. As a result, scene-text is effectively treated as generic texture. For SR to be effective in practical deployments, it is therefore essential to explicitly optimize for both text legibility and perceptual quality. We present GLYPH-SR, a vision-language-guided diffusion framework that aims to achieve both objectives jointly. GLYPH-SR utilizes a Text-SR Fusion ControlNet(TS-ControlNet) guided by OCR data, and a ping-pong scheduler that alternates between text- and scene-centric guidance. To enable targeted text restoration, we train these components on a synthetic corpus while keeping the main SR branch frozen. Across SVT, SCUT-CTW1500, and CUTE80 at x4, and x8, GLYPH-SR improves OCR F1 by up to +15.18 percentage points over diffusion/GAN baseline (SVT x8, OpenOCR) while maintaining competitive MANIQA, CLIP-IQA, and MUSIQ. GLYPH-SR is designed to satisfy both objectives simultaneously-high readability and high visual realism-delivering SR that looks right and reds right.

[11] Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders

Ali Rasekh, Erfan Bagheri Soula, Omid Daliran, Simon Gottschalk, Mohsen Fayyaz

🧩 TL;DR

本文提出了一种在视觉编码器中引入堆叠时序注意力模块的Video-LLM架构,显著提升了视频时序理解能力。该方法在多个视频问答基准测试中实现了最高+5.5%的性能提升,解决了当前视频大语言模型在时序动态理解方面的关键限制。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在视频理解方面存在显著挑战,特别是在复杂时序动态理解上表现不足。实验表明现有Video-LLM架构在需要详细理解动作序列和时间进展的任务中存在关键限制,无法有效捕捉帧间关系和动作演进过程。

Method: 提出了一种新颖的Video-LLM架构,在视觉编码器中直接引入堆叠时序注意力模块。该设计通过在视觉编码器中集成时序注意力机制,使模型能够在将视觉令牌传递给LLM之前更好地捕捉动作进展和帧间关系,从而增强时序推理能力。

Result: 该方法在多个视频问答基准测试中显著提升了性能,在VITATECS、MVBench和Video-MME等基准上实现了最高+5.5%的改进。特别是在动作识别任务中表现优异,超越了现有模型的时序理解能力。

Conclusion: 通过增强视觉编码器的时序结构,本研究解决了Video-LLM在视频理解中的关键空白。该工作表明直接在视觉编码器中集成时序注意力是提升视频时序推理的有效途径,为未来视频理解模型的设计提供了重要启示。


📄 Abstract

Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video Large Language Model (Video-LLM) architectures have critical limitations in temporal understanding, struggling with tasks that require detailed comprehension of action sequences and temporal progression. In this work, we propose a Video-LLM architecture that introduces stacked temporal attention modules directly within the vision encoder. This design incorporates a temporal attention in vision encoder, enabling the model to better capture the progression of actions and the relationships between frames before passing visual tokens to the LLM. Our results show that this approach significantly improves temporal reasoning and outperforms existing models in video question answering tasks, specifically in action recognition. We improve on benchmarks including VITATECS, MVBench, and Video-MME by up to +5.5%. By enhancing the vision encoder with temporal structure, we address a critical gap in video understanding for Video-LLMs. Project page and code are available at: https://alirasekh.github.io/STAVEQ2/.

[12] Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models

Shiho Matta, Lis Kanashiro Pereira, Peitao Han, Fei Cheng, Shigeru Kitazawa

🧩 TL;DR

本研究揭示了当前视觉语言模型在时间推理方面的根本缺陷,通过引入AoT-PsyPhyBENCH基准测试发现大多数模型在判断视频时间方向任务上表现接近随机水平,远落后于人类表现。


📘 Detailed Summary

Motivation: 现代视觉语言模型在多模态任务上表现出色,但对视频中时间信息的理解能力仍然薄弱且缺乏充分评估,本研究旨在填补这一研究空白,探索模型对时间方向判断的基本能力。

Method: 研究引入了AoT-PsyPhyBENCH基准测试,这是一个经过心理物理学验证的评估框架,使用与人类行为基准相同的刺激材料来测试VLMs对自然视频中时间方向的推断能力,全面评估了开源和专有、推理和非推理类型的视觉语言模型。

Result: 实验结果显示大多数模型在时间方向判断任务上表现接近随机水平,即使在物理不可逆过程(如自由落体、扩散/爆炸)和因果手动动作(除法/加法)等人类几乎能瞬间识别的任务上,最佳模型的表现也远远落后于人类准确率。

Conclusion: 研究揭示了当前多模态系统存在根本性差距:虽然它们能够捕捉丰富的视觉语义关联,但缺乏时间连续性和因果理解所需的归纳偏置,这为开发具有物理和时间推理能力的下一代VLMs指明了方向。


📄 Abstract

Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and, crucially, under-evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best lag far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.

[13] LoCoT2V-Bench: A Benchmark for Long-Form and Complex Text-to-Video Generation

Xiangqing Zheng, Chengyue Wu, Kehai Chen, Min Zhang

🧩 TL;DR

本文提出了LoCoT2V-Bench,一个专门针对复杂输入条件下长视频生成的基准测试,通过引入多维度评估框架和新颖的评估指标,系统评估了当前长视频生成模型在叙事连贯性和主题表达等抽象维度上的表现。


📘 Detailed Summary

Motivation: 当前文本到视频生成虽然在生成短高质量视频方面取得显著进展,但长视频生成评估仍面临重大挑战,现有基准测试大多依赖简化提示并关注低层次指标,忽视了与提示的细粒度对齐以及叙事连贯性、主题表达等抽象维度。

Method: 基于真实世界视频构建了包含场景转换和事件动态等元素的现实复杂提示集,并建立了多维度评估框架,包括新提出的评估指标如事件级对齐、细粒度时间一致性、内容清晰度以及关注叙事流程、情感响应和角色发展等抽象属性的人类期望实现度指标。

Result: 对九个代表性长视频生成模型的综合评估表明,当前方法在基本视觉和时间方面表现良好,但在事件间一致性、细粒度对齐和高层次主题遵循等方面存在显著困难。

Conclusion: LoCoT2V-Bench为长形式复杂文本到视频生成提供了全面可靠的评估平台,揭示了当前方法在高级语义理解方面的局限性,并为未来方法改进指明了关键方向,特别是在提升叙事连贯性和主题表达能力方面。


📄 Abstract

Recently text-to-video generation has made impressive progress in producing short, high-quality clips, but evaluating long-form outputs remains a major challenge especially when processing complex prompts. Existing benchmarks mostly rely on simplified prompts and focus on low-level metrics, overlooking fine-grained alignment with prompts and abstract dimensions such as narrative coherence and thematic expression. To address these gaps, we propose LoCoT2V-Bench, a benchmark specifically designed for long video generation (LVG) under complex input conditions. Based on various real-world videos, LoCoT2V-Bench introduces a suite of realistic and complex prompts incorporating elements like scene transitions and event dynamics. Moreover, it constructs a multi-dimensional evaluation framework that includes our newly proposed metrics such as event-level alignment, fine-grained temporal consistency, content clarity, and the Human Expectation Realization Degree (HERD) that focuses on more abstract attributes like narrative flow, emotional response, and character development. Using this framework, we conduct a comprehensive evaluation of nine representative LVG models, finding that while current methods perform well on basic visual and temporal aspects, they struggle with inter-event consistency, fine-grained alignment, and high-level thematic adherence, etc. Overall, LoCoT2V-Bench provides a comprehensive and reliable platform for evaluating long-form complex text-to-video generation and highlights critical directions for future method improvement.

[14] Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing

Xin Guo, Zhiheng Xi, Yiwen Ding, Yitao Zhai, Xiaowei Shi, Xunliang Cai, Tao Gui, Qi Zhang, Xuanjing Huang

🧩 TL;DR

本研究针对大型视觉语言模型自改进过程中出现的马太效应问题,提出了分布重塑和轨迹重采样两种视角的四种策略,有效平衡了简单与复杂推理任务的优化,显著提升了视觉推理能力。


📘 Detailed Summary

Motivation: 当前大型视觉语言模型在自改进过程中存在马太效应问题,模型倾向于为简单查询生成高质量推理轨迹,而难以处理复杂查询,导致优化失衡并阻碍模型在复杂推理任务上的能力提升,最终形成性能瓶颈。

Method: 提出了从分布重塑和轨迹重采样两个角度的四种高效策略,在探索学习的自改进过程中实现头尾数据的重新平衡,包括调整数据分布权重和优化推理轨迹采样机制。

Result: 在Qwen2-VL-7B-Instruct和InternVL2.5-4B模型上的广泛实验表明,该方法在视觉推理任务上持续提升模型能力,平均比原始自改进方法高出3.86个点。

Conclusion: 该研究揭示了自改进过程中的优化失衡问题,提出的平衡策略有效缓解了马太效应,为大型视觉语言模型的持续改进提供了重要方法论,推动了复杂推理能力的发展。


📄 Abstract

Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision-language models (LVLMs), where models explore and learn from successful trajectories iteratively. However, we identify a critical issue during this process: the model excels at generating high-quality trajectories for simple queries (i.e., head data) but struggles with more complex ones (i.e., tail data). This leads to an imbalanced optimization that drives the model to prioritize simple reasoning skills, while hindering its ability to tackle more complex reasoning tasks. Over iterations, this imbalance becomes increasingly pronounced--a dynamic we term the "Matthew effect"--which ultimately hinders further model improvement and leads to performance bottlenecks. To counteract this challenge, we introduce four efficient strategies from two perspectives: distribution-reshaping and trajectory-resampling, to achieve head-tail re-balancing during the exploration-and-learning self-improvement process. Extensive experiments on Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks demonstrate that our methods consistently improve visual reasoning capabilities, outperforming vanilla self-improvement by 3.86 points on average.

[15] Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, Pheng-Ann Heng

🧩 TL;DR

本研究对领先的视频生成模型Veo-3进行了系统性评估,发现当前视频模型在短时空间一致性和局部动态推理方面表现良好,但在长时因果推理和严格几何约束方面仍存在局限,尚不能作为独立的零样本推理器。


📘 Detailed Summary

Motivation: 尽管当前视频生成模型能够产生高保真、时间连贯的视频,表明其可能编码了丰富的世界知识,但一个重要问题仍未解决:这些模型是否能够在具有挑战性的视觉推理场景中作为零样本推理器使用?本研究旨在通过实证研究全面探讨这一问题。

Method: 研究构建了MME-CoF基准测试,对领先的Veo-3模型在12个维度上进行系统性评估,包括空间、几何、物理、时间和具身逻辑等方面,系统性地刻画了其优势和失败模式。

Result: 评估结果显示,当前视频模型在短时空间一致性、细粒度定位和局部一致动态推理方面展现出有前景的模式,但在长时因果推理、严格几何约束和抽象逻辑推理方面仍存在显著局限。

Conclusion: 研究表明当前视频模型尚不能作为可靠的独立零样本推理器,但作为专用推理模型的补充视觉引擎展现出令人鼓舞的潜力,为未来视频推理模型的发展指明了方向。


📄 Abstract

Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io

[16] OracleAgent: A Multimodal Reasoning Agent for Oracle Bone Script Research

Caoshuo Li, Zengmao Ding, Xiaobin Hu, Bang Li, Donghao Luo, Xu Peng, Taisong Jin, Yongge Liu, Shengwei Han, Jing Yang, Xiaoping He, Feng Gao, AndyPian Wu, SevenShu, Chaoyang Wang, Chengjie Wang

🧩 TL;DR

本文提出了OracleAgent,这是首个专门用于甲骨文结构化管理和检索的智能体系统,通过集成多模态知识库和大型语言模型,显著提升了甲骨文研究的效率和自动化水平。


📘 Detailed Summary

Motivation: 甲骨文作为最早的文字系统之一,其研究面临两大挑战:甲骨文解读涉及多个串行和并行子任务的复杂工作流程,以及甲骨文信息组织和检索效率低下,学者需要花费大量精力搜索、整理和管理相关资源。

Method: OracleAgent通过集成多个由大型语言模型驱动的甲骨文分析工具,并灵活编排这些组件来构建智能体系统;同时构建了一个全面的领域特定多模态知识库,包含超过140万张单字拓片图像和8万条解读文本,通过多年严格的数据收集、清洗和专家标注过程完成。

Result: 大量实验表明,OracleAgent在一系列多模态推理和生成任务中实现了卓越性能,超越了主流多模态大语言模型(如GPT-4o);案例研究显示该系统能有效协助领域专家,显著降低甲骨文研究的时间成本。

Conclusion: OracleAgent代表了甲骨文辅助研究和自动化解读系统实际部署的重要进展,为文化遗产保护和研究提供了有效的技术解决方案,展示了智能体系统在专业领域应用的巨大潜力。


📄 Abstract

As one of the earliest writing systems, Oracle Bone Script (OBS) preserves the cultural and intellectual heritage of ancient civilizations. However, current OBS research faces two major challenges: (1) the interpretation of OBS involves a complex workflow comprising multiple serial and parallel sub-tasks, and (2) the efficiency of OBS information organization and retrieval remains a critical bottleneck, as scholars often spend substantial effort searching for, compiling, and managing relevant resources. To address these challenges, we present OracleAgent, the first agent system designed for the structured management and retrieval of OBS-related information. OracleAgent seamlessly integrates multiple OBS analysis tools, empowered by large language models (LLMs), and can flexibly orchestrate these components. Additionally, we construct a comprehensive domain-specific multimodal knowledge base for OBS, which is built through a rigorous multi-year process of data collection, cleaning, and expert annotation. The knowledge base comprises over 1.4M single-character rubbing images and 80K interpretation texts. OracleAgent leverages this resource through its multimodal tools to assist experts in retrieval tasks of character, document, interpretation text, and rubbing image. Extensive experiments demonstrate that OracleAgent achieves superior performance across a range of multimodal reasoning and generation tasks, surpassing leading mainstream multimodal large language models (MLLMs) (e.g., GPT-4o). Furthermore, our case study illustrates that OracleAgent can effectively assist domain experts, significantly reducing the time cost of OBS research. These results highlight OracleAgent as a significant step toward the practical deployment of OBS-assisted research and automated interpretation systems.

[17] CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Jiaqi Wang, Xiao Yang, Kai Sun, Parth Suresh, Sanat Sharma, Adam Czyzewski, Derek Andersen, Surya Appini, Arkav Banerjee, Sajal Choudhary, Shervin Ghasemlou, Ziqiang Guan, Akil Iyer, Haidar Khan, Lingkun Kong, Roy Luo, Tiffany Ma, Zhen Qiao, David Tran, Wenfang Xu, Skyler Yeatman, Chen Zhou, Gunveer Gujral, Yinglong Xia, Shane Moon, Nicolas Scheffer, Nirav Shah, Eun Chang, Yue Liu, Florian Metze, Tammy Stark, Zhaleh Feizollahi, Andrea Jessee, Mangesh Pujari, Ahmed Aly, Babak Damavandi, Rakesh Wanga, Anuj Kumar, Rohit Patel, Wen-tau Yih, Xin Luna Dong

🧩 TL;DR

本文提出了CRAG-MM——一个面向多模态多轮对话的全面检索增强生成基准,包含6.5K个图像-问题-答案三元组和2K个多轮对话,专门针对可穿戴设备场景设计,填补了该领域缺乏综合基准的空白。


📘 Detailed Summary

Motivation: 当前缺乏针对可穿戴设备场景的多模态检索增强生成(MM-RAG)任务的综合基准,特别是能够反映真实世界挑战的评估框架,这限制了该领域的研究进展和实际应用。

Method: 构建了包含13个领域、6.5K个图像-问题-答案三元组和2K个多轮对话的基准数据集,其中包含6.2K个模拟可穿戴设备拍摄的自我中心图像;设计了五种图像质量问题、六种问题类型、不同实体流行度、信息动态性和对话轮次;提供了三种任务设置和相应的检索语料库及API。

Result: 实验评估显示,直接RAG方法在单轮和多轮问答上的真实性分别仅为32%和43%,而业界最先进解决方案的质量相似(32%/45%);该基准已成功举办KDD Cup 2025竞赛,获胜方案将基线性能提升了28%。

Conclusion: 该研究揭示了当前MM-RAG方法在可穿戴设备场景下的显著性能差距,为未来研究提供了重要的评估框架和方向指引;基准的成功应用表明其在推动该领域发展方面具有重要价值。


📄 Abstract

Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations -- each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.

[18] A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models

Shihab Aaqil Ahamed, Udaya S. K. P. Miriya Thanthrige, Ranga Rodrigo, Muhammad Haris Khan

🧩 TL;DR

本文提出A-TPT框架,通过引入角度多样性来增强视觉语言模型在测试时提示调优中的校准性能,该方法通过最大化归一化文本特征在单位超球面上的最小成对角度距离来实现特征均匀分布。


📘 Detailed Summary

Motivation: 现有测试时提示调优方法主要关注最大化平均文本特征离散度或施加正交约束来促进角度分离,但这些方法可能无法实现类别间文本特征的最优角度分离,忽视了角度多样性的关键作用,导致文本特征缺乏分散性从而损害校准性能。

Method: 提出A-TPT框架,通过鼓励由可学习提示诱导的归一化文本特征在单位超球面上均匀分布来实现角度多样性,具体通过最大化特征间最小成对角度距离来实现这一目标。

Result: 在多个骨干网络和数据集上的广泛实验表明,A-TPT在降低聚合平均校准误差方面持续超越最先进的TPT方法,同时保持相当的准确率,在自然分布偏移下表现出优越的零样本校准性能,并能很好地泛化到医学数据集。

Conclusion: 研究结果表明促进角度多样性是实现良好分散文本特征的有效方法,显著改善了视觉语言模型在测试时适应过程中的校准性能,为提升模型可靠性、可信度和安全性提供了重要技术路径。


📄 Abstract

Test-time prompt tuning (TPT) has emerged as a promising technique for adapting large vision-language models (VLMs) to unseen tasks without relying on labeled data. However, the lack of dispersion between textual features can hurt calibration performance, which raises concerns about VLMs' reliability, trustworthiness, and safety. Current TPT approaches primarily focus on improving prompt calibration by either maximizing average textual feature dispersion or enforcing orthogonality constraints to encourage angular separation. However, these methods may not always have optimal angular separation between class-wise textual features, which implies overlooking the critical role of angular diversity. To address this, we propose A-TPT, a novel TPT framework that introduces angular diversity to encourage uniformity in the distribution of normalized textual features induced by corresponding learnable prompts. This uniformity is achieved by maximizing the minimum pairwise angular distance between features on the unit hypersphere. We show that our approach consistently surpasses state-of-the-art TPT methods in reducing the aggregate average calibration error while maintaining comparable accuracy through extensive experiments with various backbones on different datasets. Notably, our approach exhibits superior zero-shot calibration performance on natural distribution shifts and generalizes well to medical datasets. We provide extensive analyses, including theoretical aspects, to establish the grounding of A-TPT. These results highlight the potency of promoting angular diversity to achieve well-dispersed textual features, significantly improving VLM calibration during test-time adaptation. Our code will be made publicly available.

[19] Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection

Yuanting Fan, Jun Liu, Xiaochen Chen, Bin-Bin Gao, Jian Li, Yong Liu, Jinlong Peng, Chengjie Wang

🧩 TL;DR

本文提出了FineGrainedAD框架,通过多级细粒度语义描述和语义对齐机制解决少样本异常检测中的语义不对齐问题,在MVTec-AD和VisA数据集上实现了优越的异常定位性能。


📘 Detailed Summary

Motivation: 现有少样本异常检测方法依赖预训练视觉语言模型的泛化能力,但由于缺乏详细文本描述,只能预定义图像级描述来匹配视觉补丁标记,导致图像描述与补丁级视觉异常之间的语义不对齐,从而获得次优的定位性能。

Method: 提出了多级细粒度语义描述(MFSC)为异常检测数据集提供多级细粒度文本描述,并设计了包含多级可学习提示(MLLP)和多级语义对齐(MLSA)的FineGrainedAD框架,其中MLLP通过自动替换和连接机制将细粒度语义引入多级可学习提示,MLSA设计区域聚合策略和多级对齐训练来促进可学习提示与相应视觉区域的对齐。

Result: 实验表明,所提出的FineGrainedAD在MVTec-AD和VisA数据集的少样本设置中实现了优越的整体性能,显著提升了异常定位的准确性和效果。

Conclusion: 该研究通过引入多级细粒度语义描述和语义对齐机制,有效解决了少样本异常检测中的语义不对齐问题,为基于视觉语言模型的异常检测方法提供了新的技术路径和性能提升方案。


📄 Abstract

Few-shot anomaly detection (FSAD) methods identify anomalous regions with few known normal samples. Most existing methods rely on the generalization ability of pre-trained vision-language models (VLMs) to recognize potentially anomalous regions through feature similarity between text descriptions and images. However, due to the lack of detailed textual descriptions, these methods can only pre-define image-level descriptions to match each visual patch token to identify potential anomalous regions, which leads to the semantic misalignment between image descriptions and patch-level visual anomalies, achieving sub-optimal localization performance. To address the above issues, we propose the Multi-Level Fine-Grained Semantic Caption (MFSC) to provide multi-level and fine-grained textual descriptions for existing anomaly detection datasets with automatic construction pipeline. Based on the MFSC, we propose a novel framework named FineGrainedAD to improve anomaly localization performance, which consists of two components: Multi-Level Learnable Prompt (MLLP) and Multi-Level Semantic Alignment (MLSA). MLLP introduces fine-grained semantics into multi-level learnable prompts through automatic replacement and concatenation mechanism, while MLSA designs region aggregation strategy and multi-level alignment training to facilitate learnable prompts better align with corresponding visual regions. Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings on MVTec-AD and VisA datasets.

[20] Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition

Pei Peng, MingKun Xie, Hang Hao, Tong Jin, ShengJun Huang

🧩 TL;DR

本文提出了一种基于因果推理的轻量级表示级反事实方法,通过估计对象和背景期望并合成反事实嵌入,有效缓解视觉语言模型中的对象-上下文捷径问题,在无需重新训练或提示设计的情况下显著提升了零样本可靠性。


📘 Detailed Summary

Motivation: 视觉语言模型中存在的对象-上下文捷径问题严重影响了零样本可靠性,当测试场景与训练共现模式不同时,模型容易产生错误预测,本研究旨在通过因果推理框架解决这一挑战。

Method: 在CLIP表示空间中估计对象和背景期望,通过从外部数据集、批次邻居或文本描述中采样多样替代上下文,合成反事实嵌入,并利用总直接效应估计和干预模拟来减去仅背景激活,保留有益的对象-上下文交互。

Result: 该方法在上下文敏感基准测试中显著提升了最差组和平均准确率,无需重新训练或提示设计即实现了新的零样本最先进性能,证明了其在缓解幻觉分数方面的有效性。

Conclusion: 该研究提供了一个轻量级的表示级反事实框架,为去偏和可靠的多模态推理开辟了实用的因果途径,展示了因果推理在提升视觉语言模型鲁棒性方面的潜力。


📄 Abstract

Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.

[21] AdSum: Two-stream Audio-visual Summarization for Automated Video Advertisement Clipping

Wen Xie, Yanjun Zhu, Gijs Overgoor, Yakov Bart, Agata Lapedriza Garcia, Sarah Ostadabbas

🧩 TL;DR

本研究提出了一种基于视频摘要技术的自动化广告剪辑框架,首次将视频剪辑定义为针对广告场景的镜头选择问题,并通过双流音视频融合模型显著提升了广告剪辑性能。


📘 Detailed Summary

Motivation: 广告商通常需要为同一广告活动制作不同时长的多个版本,传统方法依赖人工从长视频广告中选择和重新编辑镜头来创建短版本,这一过程既耗时又费力,现有通用视频摘要方法主要关注视觉内容而忽略了音频在广告中的关键作用。

Method: 提出了一种双流音视频融合模型来预测视频帧的重要性,其中重要性定义为帧被选入企业制作的短广告中的可能性,该方法将视频剪辑构建为镜头选择问题,并专门针对广告场景进行了优化。

Result: 在AdSum204数据集上的广泛实验表明,该模型在平均精度、曲线下面积、斯皮尔曼相关系数和肯德尔系数等多个指标上均优于现有最先进方法,验证了音频信息对广告剪辑任务的重要性。

Conclusion: 该研究强调了音频在广告剪辑中的关键作用,提出的双流音视频融合方法为自动化广告制作提供了有效解决方案,同时发布的AdSum204数据集填补了广告特定数据集的空白,为未来研究提供了重要基准。


📄 Abstract

Advertisers commonly need multiple versions of the same advertisement (ad) at varying durations for a single campaign. The traditional approach involves manually selecting and re-editing shots from longer video ads to create shorter versions, which is labor-intensive and time-consuming. In this paper, we introduce a framework for automated video ad clipping using video summarization techniques. We are the first to frame video clipping as a shot selection problem, tailored specifically for advertising. Unlike existing general video summarization methods that primarily focus on visual content, our approach emphasizes the critical role of audio in advertising. To achieve this, we develop a two-stream audio-visual fusion model that predicts the importance of video frames, where importance is defined as the likelihood of a frame being selected in the firm-produced short ad. To address the lack of ad-specific datasets, we present AdSum204, a novel dataset comprising 102 pairs of 30-second and 15-second ads from real advertising campaigns. Extensive experiments demonstrate that our model outperforms state-of-the-art methods across various metrics, including Average Precision, Area Under Curve, Spearman, and Kendall.

[22] Dynamic Context-Aware Scene Reasoning Using Vision-Language Alignment in Zero-Shot Real-World Scenarios

Manjunath Prasad Holenarasipura Rajiv, B. M. Vidyavathi

🧩 TL;DR

本文提出了一种动态上下文感知场景推理框架,通过视觉-语言对齐技术解决零样本真实世界场景理解问题,在未见过的复杂环境中实现了显著的性能提升。


📘 Detailed Summary

Motivation: 真实环境中AI系统经常面临缺乏标注数据的陌生场景,传统场景理解模型难以泛化到未见过的上下文环境,这限制了视觉应用在动态非结构化环境中的部署。

Method: 该方法整合预训练视觉变换器和大型语言模型,将视觉语义与自然语言描述对齐以增强上下文理解;动态推理模块通过结合全局场景线索和对象级交互,在语言先验指导下优化预测结果。

Result: 在COCO、Visual Genome和Open Images等零样本基准测试中,该方法在复杂未见环境中比基线模型提高了18%的场景理解准确率;在模糊或杂乱场景中也表现出鲁棒性能,得益于视觉与语言的协同融合。

Conclusion: 该框架为上下文感知推理提供了可扩展且可解释的方法,推动了动态真实世界环境中零样本泛化能力的发展,为智能系统在无任务特定训练条件下适应新环境提供了有效解决方案。


📄 Abstract

In real-world environments, AI systems often face unfamiliar scenarios without labeled data, creating a major challenge for conventional scene understanding models. The inability to generalize across unseen contexts limits the deployment of vision-based applications in dynamic, unstructured settings. This work introduces a Dynamic Context-Aware Scene Reasoning framework that leverages Vision-Language Alignment to address zero-shot real-world scenarios. The goal is to enable intelligent systems to infer and adapt to new environments without prior task-specific training. The proposed approach integrates pre-trained vision transformers and large language models to align visual semantics with natural language descriptions, enhancing contextual comprehension. A dynamic reasoning module refines predictions by combining global scene cues and object-level interactions guided by linguistic priors. Extensive experiments on zero-shot benchmarks such as COCO, Visual Genome, and Open Images demonstrate up to 18% improvement in scene understanding accuracy over baseline models in complex and unseen environments. Results also show robust performance in ambiguous or cluttered scenes due to the synergistic fusion of vision and language. This framework offers a scalable and interpretable approach for context-aware reasoning, advancing zero-shot generalization in dynamic real-world settings.

[23] CATCH: A Modular Cross-domain Adaptive Template with Hook

Xinjin Li, Yulie Lu, Jinghan Cao, Yu Ma, Zhenglin Li, Yeyang Zhou

🧩 TL;DR

本文提出CATCH框架,一种即插即用的跨领域视觉问答适应方法,通过解耦视觉和语言适应模块,在不重新训练主干模型的情况下显著提升多领域VQA性能。


📘 Detailed Summary

Motivation: 现有视觉问答模型在自然图像领域表现优异,但在遥感、医学影像、数学图表等跨领域场景中泛化能力显著下降,主要由于分布偏移和缺乏有效的领域适应机制。传统方法依赖领域特定微调或定制流程,成本高、灵活性差且难以扩展到多样化任务。

Method: CATCH框架引入两个轻量级模块:领域分类器用于识别输入图像类型,以及包含提示适配器和视觉适配器的双适配机制。两个模块通过统一钩子接口动态注入,无需重新训练主干模型,实现视觉特征调整和语言调制的解耦适应。

Result: 在四个领域特定VQA基准测试中,CATCH框架在不重新训练主干模型的情况下实现一致性能提升:MathVQA上BLEU提升2.3分,MedVQA-RAD上VQA得分提升2.6分,ChartQA上ROUGE提升3.1分。

Conclusion: CATCH为多领域VQA提供了可扩展和可扩展的方法,通过轻量级适配机制实现跨领域泛化,支持在实际部署中应用于多样化应用领域,显著降低了领域适应的成本和复杂性。


📄 Abstract

Recent advances in Visual Question Answering (VQA) have demonstrated impressive performance in natural image domains, with models like LLaVA leveraging large language models (LLMs) for open-ended reasoning. However, their generalization degrades significantly when transferred to out-of-domain scenarios such as remote sensing, medical imaging, or math diagrams, due to large distributional shifts and the lack of effective domain adaptation mechanisms. Existing approaches typically rely on per-domain fine-tuning or bespoke pipelines, which are costly, inflexible, and not scalable across diverse tasks. In this paper, we propose CATCH, a plug-and-play framework for cross-domain adaptation that improves the generalization of VQA models while requiring minimal changes to their core architecture. Our key idea is to decouple visual and linguistic adaptation by introducing two lightweight modules: a domain classifier to identify the input image type, and a dual adapter mechanism comprising a Prompt Adapter for language modulation and a Visual Adapter for vision feature adjustment. Both modules are dynamically injected via a unified hook interface, requiring no retraining of the backbone model. Experimental results across four domain-specific VQA benchmarks demonstrate that our framework achieves consistent performance gains without retraining the backbone model, including +2.3 BLEU on MathVQA, +2.6 VQA on MedVQA-RAD, and +3.1 ROUGE on ChartQA. These results highlight that CATCH provides a scalable and extensible approach to multi-domain VQA, enabling practical deployment across diverse application domains.

[24] Emu3.5: Native Multimodal Models are World Learners

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, Yueze Wang, Chengyuan Wang, Fan Zhang, Yingli Zhao, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, Xinlong Wang

🧩 TL;DR

本文提出了Emu3.5,一个通过端到端预训练的大规模多模态世界模型,能够原生预测视觉和语言的下一个状态,并引入离散扩散适配技术实现20倍推理加速,在图像生成和编辑任务上达到与Gemini 2.5 Flash Image相当的性能。


📘 Detailed Summary

Motivation: 当前多模态模型在视觉语言交织生成、长时序一致性保持以及推理效率方面存在局限,Emu3.5旨在构建一个能够原生处理视觉语言交织输入输出、具备世界建模能力且推理高效的多模态基础模型。

Method: Emu3.5采用统一的下一个token预测目标在超过10万亿token的视觉语言交织数据上进行端到端预训练,随后通过大规模强化学习增强多模态推理和生成能力,并提出了离散扩散适配技术将逐token解码转换为双向并行预测以提升推理效率。

Result: Emu3.5在图像生成和编辑任务上达到与Gemini 2.5 Flash Image相当的性能,在交织生成任务上表现更优,推理加速约20倍,同时展现出长时序视觉语言生成、任意到图像生成、复杂文本图像生成以及时空一致性世界探索等强大能力。

Conclusion: Emu3.5证明了通过大规模预训练和强化学习可以构建具备通用世界建模能力的多模态模型,其开源发布将推动社区在多模态人工智能领域的研究发展,为具身智能和开放世界交互提供了新的技术路径。


📄 Abstract

We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at https://github.com/baaivision/Emu3.5 to support community research.

[25] All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles

Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Hazim Alzorgan, Ahmad Sarlak, Mahlagha Fazeli, Abolfazl Razi

🧩 TL;DR

本综述为自动驾驶车辆物体检测领域提供了前瞻性分析,重点关注视觉语言模型、大语言模型和生成式AI等新兴范式,通过系统梳理传感器融合、数据集分类和先进检测方法,为当前能力、开放挑战和未来机遇绘制了清晰路线图。


📘 Detailed Summary

Motivation: 自动驾驶车辆的成功依赖于在复杂多模态环境中实现可靠的物体检测,但当前知识在跨模态感知、上下文推理和协同智能方面仍然碎片化,该研究旨在弥合这一差距,避免重新审视过时技术。

Method: 研究系统回顾了自动驾驶传感器(相机、超声波、激光雷达和雷达)及其融合策略,引入了超越简单收集的结构化数据集分类,包括自车、基础设施和协同数据集,并分析了从2D/3D检测流程到混合传感器融合的前沿方法,特别关注由视觉变换器、大小语言模型和视觉语言模型驱动的变换器方法。

Result: 通过综合多视角分析,该研究提供了当前能力、开放挑战和未来机遇的清晰路线图,强调了新兴范式与传统传感器融合的整合潜力,并对不同数据结构和特征进行了交叉分析。

Conclusion: 该综述通过整合新兴AI范式与传统感知框架,为自动驾驶物体检测领域提供了系统性的发展蓝图,指出了向多模态协同智能和上下文感知系统演进的关键方向,为未来研究奠定了理论基础。


📄 Abstract

Autonomous Vehicles (AVs) are transforming the future of transportation through advances in intelligent perception, decision-making, and control systems. However, their success is tied to one core capability, reliable object detection in complex and multimodal environments. While recent breakthroughs in Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable progress, the field still faces a critical challenge as knowledge remains fragmented across multimodal perception, contextual reasoning, and cooperative intelligence. This survey bridges that gap by delivering a forward-looking analysis of object detection in AVs, emphasizing emerging paradigms such as Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI rather than re-examining outdated techniques. We begin by systematically reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR, and Radar) and their fusion strategies, highlighting not only their capabilities and limitations in dynamic driving environments but also their potential to integrate with recent advances in LLM/VLM-driven perception frameworks. Next, we introduce a structured categorization of AV datasets that moves beyond simple collections, positioning ego-vehicle, infrastructure-based, and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a cross-analysis of data structures and characteristics. Ultimately, we analyze cutting-edge detection methodologies, ranging from 2D and 3D pipelines to hybrid sensor fusion, with particular attention to emerging transformer-driven approaches powered by Vision Transformers (ViTs), Large and Small Language Models (SLMs), and VLMs. By synthesizing these perspectives, our survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities.

[26] SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models

Anushka Sivakumar, Andrew Zhang, Zaber Hakim, Chris Thomas

🧩 TL;DR

本文提出SteerVLM,一种轻量级引导模块,通过动态调整视觉语言模型中语言模态与图像上下文之间的激活连接,实现无需修改模型权重的细粒度推理时控制。该方法仅需学习原始VLM参数0.14%的参数量,在引导和幻觉缓解基准测试中优于现有干预技术。


📘 Detailed Summary

Motivation: 当前视觉语言模型在输出语义控制方面存在局限,需要开发能够在推理时对复杂输出语义进行细粒度控制的方法,同时保持非目标任务性能且不修改模型权重。现有方法通常需要预提取静态向量或手动调整干预点,缺乏动态适应性。

Method: SteerVLM通过学习编码目标行为和相反行为的配对提示的潜在嵌入,动态调整连接语言模态与图像上下文的激活连接。该方法采用维度级激活调制和跨层自适应引导,无需预提取静态向量或手动调整干预点。同时引入了VNIA多模态数据集,专门用于开发和评估VLM引导技术。

Result: SteerVLM在VLM引导和幻觉缓解基准测试中优于现有干预技术,仅需学习原始VLM参数0.14%的参数量。该方法能够在不修改模型权重的情况下实现细粒度推理时控制,同时保持非目标任务的性能表现。

Conclusion: 该研究通过激活工程为多模态模型控制提供了稳健解决方案,证明了轻量级引导模块在实现复杂输出语义控制方面的有效性。SteerVLM的方法为视觉语言模型的精确行为调控开辟了新途径,具有重要的实际应用价值。


📄 Abstract

This work introduces SteerVLM, a lightweight steering module designed to guide Vision-Language Models (VLMs) towards outputs that better adhere to desired instructions. Our approach learns from the latent embeddings of paired prompts encoding target and converse behaviors to dynamically adjust activations connecting the language modality with image context. This allows for fine-grained, inference-time control over complex output semantics without modifying model weights while preserving performance on off-target tasks. Our steering module requires learning parameters equal to 0.14% of the original VLM's size. Our steering module gains model control through dimension-wise activation modulation and adaptive steering across layers without requiring pre-extracted static vectors or manual tuning of intervention points. Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a multimodal dataset specifically created to facilitate the development and evaluation of VLM steering techniques. Our method outperforms existing intervention techniques on steering and hallucination mitigation benchmarks for VLMs and proposes a robust solution for multimodal model control through activation engineering.

[27] ChartAB: A Benchmark for Chart Grounding & Dense Alignment

Aniruddh Bansal, Davit Soselia, Dang Nguyen, Tianyi Zhou

🧩 TL;DR

本文提出了ChartAlign基准测试(ChartAB),用于全面评估视觉语言模型在图表理解任务中的细粒度感知能力,揭示了现有模型在图表元素定位、属性识别和多图表比较方面的局限性。


📘 Detailed Summary

Motivation: 现有视觉语言模型在图表理解任务中缺乏准确的细节感知能力,难以提取图表的细粒度结构,这种图表定位能力的限制进一步阻碍了模型进行多图表比较和推理的能力。

Method: 研究设计了ChartAlign基准测试,采用JSON模板来促进针对每个定位任务的评估指标计算,并引入新颖的两阶段推理工作流程来评估模型在多个图表间对齐和比较元素/属性的能力。

Result: 对多个最新视觉语言模型的评估分析揭示了它们在图表理解任务中的感知偏差、弱点、鲁棒性和幻觉问题,这些发现突出了不同模型在图表理解任务中的细粒度差异。

Conclusion: 研究结果指出了当前模型需要加强的具体技能,为改进视觉语言模型在图表理解领域的性能提供了重要指导,强调了细粒度感知能力在复杂视觉推理任务中的关键作用。


📄 Abstract

Charts play an important role in visualization, reasoning, data analysis, and the exchange of ideas among humans. However, existing vision-language models (VLMs) still lack accurate perception of details and struggle to extract fine-grained structures from charts. Such limitations in chart grounding also hinder their ability to compare multiple charts and reason over them. In this paper, we introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting tabular data, localizing visualization elements, and recognizing various attributes from charts of diverse types and complexities. We design a JSON template to facilitate the calculation of evaluation metrics specifically tailored for each grounding task. By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs' capability to align and compare elements/attributes across two charts. Our analysis of evaluations on several recent VLMs reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding. These findings highlight the fine-grained discrepancies among VLMs in chart understanding tasks and point to specific skills that need to be strengthened in current models.

[28] The Quest for Generalizable Motion Generation: Data, Model, and Evaluation

Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qingping Sun, Zhongang Cai, Lei Yang, Ziwei Liu

🧩 TL;DR

本文提出了一个从视频生成向3D人体运动生成系统化迁移知识的框架,通过构建大规模数据集ViMoGen-228K、开发基于流匹配的扩散变换器模型以及建立分层评估基准MBench,显著提升了运动生成的泛化能力。


📘 Detailed Summary

Motivation: 现有3D人体运动生成模型在泛化能力方面存在根本性瓶颈,而相邻的视频生成领域在建模人类行为方面已展现出卓越的泛化性能,这为运动生成提供了可迁移的洞察。

Method: 提出了基于流匹配的扩散变换器ViMoGen,通过门控多模态条件机制统一了MoCap数据和视频生成模型的先验知识;同时开发了ViMoGen-light蒸馏变体,在消除视频生成依赖的同时保持强泛化能力。

Result: 广泛实验表明,该框架在自动评估和人工评估中均显著优于现有方法,构建的ViMoGen-228K数据集包含228,000个高质量运动样本,大幅扩展了语义多样性。

Conclusion: 该研究证明了从视频生成向运动生成进行知识迁移的有效性,为提升运动生成模型的泛化能力提供了系统性解决方案,未来可进一步探索跨模态生成任务的统一框架。


📄 Abstract

Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.

[29] Scaling Image Geo-Localization to Continent Level

Philipp Lindenberger, Paul-Edouard Sarlin, Jan Hosang, Matteo Balice, Marc Pollefeys, Simon Lynen, Eduard Trulls

🧩 TL;DR

本文提出了一种混合方法,通过代理分类任务学习丰富特征表示,结合地面和航空图像嵌入,实现了在大陆尺度上的细粒度地理定位,能够在欧洲地区68%的查询中定位到200米范围内。


📘 Detailed Summary

Motivation: 全球尺度图像地理定位面临标准图像检索方法效率低下和覆盖不足的问题,现有可扩展解决方案存在粗粒度分类与跨视图检索领域差距的权衡,需要在大地理范围内实现细粒度定位。

Method: 采用混合方法,在训练期间利用代理分类任务学习隐含编码精确位置信息的丰富特征表示,并将学习到的原型与航空图像嵌入相结合,增强对地面数据稀疏性的鲁棒性。

Result: 在覆盖欧洲大部分地区的数据集上,该方法能够在68%以上的查询中实现200米范围内的精确定位,显著优于现有方法。

Conclusion: 该方法证明了通过结合学习特征和跨视图嵌入,可以在大陆尺度上实现细粒度地理定位,为大规模地理定位系统提供了可行的解决方案。


📄 Abstract

Determining the precise geographic location of an image at a global scale remains an unsolved challenge. Standard image retrieval techniques are inefficient due to the sheer volume of images (>100M) and fail when coverage is insufficient. Scalable solutions, however, involve a trade-off: global classification typically yields coarse results (10+ kilometers), while cross-view retrieval between ground and aerial imagery suffers from a domain gap and has been primarily studied on smaller regions. This paper introduces a hybrid approach that achieves fine-grained geo-localization across a large geographic expanse the size of a continent. We leverage a proxy classification task during training to learn rich feature representations that implicitly encode precise location information. We combine these learned prototypes with embeddings of aerial imagery to increase robustness to the sparsity of ground-level data. This enables direct, fine-grained retrieval over areas spanning multiple countries. Our extensive evaluation demonstrates that our approach can localize within 200m more than 68\% of queries of a dataset covering a large part of Europe. The code is publicly available at https://scaling-geoloc.github.io.

[30] OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes

Yukun Huang, Jiwen Yu, Yanning Zhou, Jianan Wang, Xintao Wang, Pengfei Wan, Xihui Liu

🧩 TL;DR

本文提出了OmniX框架,通过重新利用2D生成先验实现全景感知,能够生成适用于物理渲染、重光照和仿真的图形就绪3D场景,超越了传统2D提升方法仅关注外观生成的局限。


📘 Detailed Summary

Motivation: 现有基于全景图的2D提升方法主要关注外观生成,而忽略了内在属性的感知,无法生成适用于物理渲染、重光照和仿真的图形就绪3D场景。本文旨在解决这一局限性,将全景图技术推进到能够生成具备完整物理属性的3D场景。

Method: 提出了OmniX统一框架,基于轻量高效的跨模态适配器结构,重新利用2D生成先验进行几何、纹理和PBR材质的全景感知。同时构建了大规模合成全景数据集,包含来自多样化室内外场景的高质量多模态全景图。

Result: 大量实验证明了模型在全景视觉感知和图形就绪3D场景生成方面的有效性,为沉浸式和物理真实的虚拟世界生成开辟了新的可能性。

Conclusion: 该研究展示了重新利用2D生成先验进行全景感知的可行性,为生成具备完整物理属性的3D场景提供了统一框架,推动了沉浸式虚拟环境生成技术的发展。


📄 Abstract

There are two prevalent ways to constructing 3D scenes: procedural generation and 2D lifting. Among them, panorama-based 2D lifting has emerged as a promising technique, leveraging powerful 2D generative priors to produce immersive, realistic, and diverse 3D environments. In this work, we advance this technique to generate graphics-ready 3D scenes suitable for physically based rendering (PBR), relighting, and simulation. Our key insight is to repurpose 2D generative models for panoramic perception of geometry, textures, and PBR materials. Unlike existing 2D lifting approaches that emphasize appearance generation and ignore the perception of intrinsic properties, we present OmniX, a versatile and unified framework. Based on a lightweight and efficient cross-modal adapter structure, OmniX reuses 2D generative priors for a broad range of panoramic vision tasks, including panoramic perception, generation, and completion. Furthermore, we construct a large-scale synthetic panorama dataset containing high-quality multimodal panoramas from diverse indoor and outdoor scenes. Extensive experiments demonstrate the effectiveness of our model in panoramic visual perception and graphics-ready 3D scene generation, opening new possibilities for immersive and physically realistic virtual world generation.

cs.CL [Back]

[31] LISTEN to Your Preferences: An LLM Framework for Multi-Objective Selection

Adam S. Jovine, Tinghan Ye, Francis Bahk, Jingjing Wang, David B. Shmoys, Peter I. Frazier

🧩 TL;DR

本文提出了LISTEN框架,利用大型语言模型作为零样本偏好预测器,通过自然语言指导来优化多目标决策问题,显著降低了传统偏好获取的认知负担。


📘 Detailed Summary

Motivation: 人类专家在面对多目标竞争的大规模选项选择时常常难以形式化复杂的隐含偏好,这一过程受到偏好获取困难的瓶颈限制,需要一种能够直接利用自然语言指导来简化复杂决策的方法。

Method: 提出了LISTEN框架,包含两种迭代算法:LISTEN-U使用LLM优化参数化效用函数,LISTEN-T采用非参数方法在小批量解空间中进行锦标赛式选择,两种方法都旨在克服LLM的上下文窗口和推理成本限制。

Result: 在航班预订、购物和考试安排等多样化任务上的评估表明,当偏好与参数化对齐时LISTEN-U表现优异,而LISTEN-T提供更稳健的性能,并通过新颖的一致性度量来量化偏好对齐程度。

Conclusion: 这项工作探索了使用自然语言直接指导复杂多目标决策的可行方向,为减少传统偏好获取的认知负担提供了有前景的解决方案,并为LLM在决策支持系统中的实际应用开辟了新途径。


📄 Abstract

Human experts often struggle to select the best option from a large set of items with multiple competing objectives, a process bottlenecked by the difficulty of formalizing complex, implicit preferences. To address this, we introduce LISTEN, a framework that leverages a Large Language Model (LLM) as a zero-shot preference oracle, guided only by an expert's high-level priorities in natural language. To operate within LLM constraints like context windows and inference costs, we propose two iterative algorithms: LISTEN-U, which uses the LLM to refine a parametric utility function, and LISTEN-T, a non-parametric method that performs tournament-style selections over small batches of solutions. Evaluated on diverse tasks including flight booking, shopping, and exam scheduling, our results show LISTEN-U excels when preferences are parametrically aligned (a property we measure with a novel concordance metric), while LISTEN-T offers more robust performance. This work explores a promising direction for steering complex multi-objective decisions directly with natural language, reducing the cognitive burden of traditional preference elicitation.

[32] Distilling Multilingual Vision-Language Models: When Smaller Models Stay Multilingual

Sukrit Sriratanawilai, Jhayahgrit Thongwat, Romrawin Chumpu, Patomporn Payoungkhamdee, Sarana Nutanong, Peerat Limkonchotiwat

🧩 TL;DR

本研究通过控制实验分析了知识蒸馏在多语言视觉语言模型压缩中的行为,发现不同蒸馏配置在跨语言表示一致性和下游性能稳定性方面存在显著差异,揭示了仅靠聚合准确率无法反映的设计敏感权衡。


📘 Detailed Summary

Motivation: 视觉语言模型在多语言环境中表现出性能不均的问题,尤其是在模型尺寸减小时更为严重,而知识蒸馏在多语言环境中的应用仍是一个探索不足的领域,需要系统研究蒸馏方法对跨语言表示一致性和模型压缩下性能稳定性的影响。

Method: 研究采用五种不同的知识蒸馏方法,在CLIP和SigLIP2模型上进行控制性实验,评估这些方法在域内检索和域外视觉问答任务中的表现,重点关注跨语言表示一致性和下游性能稳定性。

Result: 实验发现某些蒸馏配置能够在模型尺寸减半的情况下保持甚至提升多语言检索的鲁棒性,但其他配置无法维持跨任务稳定性,揭示了仅靠聚合准确率无法捕捉的设计敏感权衡。

Conclusion: 研究强调了知识蒸馏方法选择对多语言视觉语言模型压缩效果的关键影响,指出需要综合考虑跨语言表示一致性和任务稳定性,而非仅关注总体准确率,为多语言模型压缩提供了重要设计指导。


📄 Abstract

Vision-language models (VLMs) exhibit uneven performance across languages, a problem that is often exacerbated when the model size is reduced. While Knowledge distillation (KD) demonstrates promising results in transferring knowledge from larger to smaller VLMs, applying KD in multilingualism is an underexplored area. This paper presents a controlled empirical study of KD behavior across five distillation approaches, isolating their effects on cross-lingual representation consistency and downstream performance stability under model compression. We study five distillation formulations across CLIP and SigLIP2, and evaluate them on in-domain retrieval and out-of-domain visual QA. We find that some configurations preserve or even improve multilingual retrieval robustness despite halving model size, but others fail to maintain cross-task stability, exposing design-sensitive trade-offs that aggregate accuracy alone does not reveal.

[33] MisSynth: Improving MISSCI Logical Fallacies Classification with Synthetic Data

Mykhailo Poliakov, Nadiya Shvai

🧩 TL;DR

本文提出MisSynth管道,通过检索增强生成合成谬误样本并微调LLM,显著提升了大型语言模型识别科学健康相关错误信息的能力,在有限计算资源下实现了超过35%的F1分数绝对提升。


📘 Detailed Summary

Motivation: 健康相关错误信息普遍存在且具有潜在危害性,特别是当这些声明扭曲或曲解科学发现时难以识别。现有标注资源有限,需要探索在计算资源受限情况下提升LLM识别谬误论证能力的方法。

Method: 提出MisSynth管道,应用检索增强生成技术生成合成谬误样本,然后使用这些样本微调大型语言模型。该方法基于MISSCI数据集和框架,通过合成数据增强有限的标注资源。

Result: 微调模型相比原始基线模型实现了显著的准确率提升,其中LLaMA 3.1 8B微调模型在MISSCI测试集上相比原始基线实现了超过35%的F1分数绝对提升。实验证明合成谬误数据能够显著增强零样本LLM在真实世界科学错误信息任务上的分类性能。

Conclusion: 研究表明引入合成谬误数据可以有效增强有限标注资源下的模型性能,即使在计算资源受限的情况下也能显著提升LLM识别科学错误信息的能力。该方法为错误信息检测提供了有效的解决方案,相关代码和合成数据集已开源。


📄 Abstract

Health-related misinformation is very prevalent and potentially harmful. It is difficult to identify, especially when claims distort or misinterpret scientific findings. We investigate the impact of synthetic data generation and lightweight fine-tuning techniques on the ability of large language models (LLMs) to recognize fallacious arguments using the MISSCI dataset and framework. In this work, we propose MisSynth, a pipeline that applies retrieval-augmented generation (RAG) to produce synthetic fallacy samples, which are then used to fine-tune an LLM model. Our results show substantial accuracy gains with fine-tuned models compared to vanilla baselines. For instance, the LLaMA 3.1 8B fine-tuned model achieved an over 35% F1-score absolute improvement on the MISSCI test split over its vanilla baseline. We demonstrate that introducing synthetic fallacy data to augment limited annotated resources can significantly enhance zero-shot LLM classification performance on real-world scientific misinformation tasks, even with limited computational resources. The code and synthetic dataset are available on https://github.com/mxpoliakov/MisSynth.

[34] Hebrew Diacritics Restoration using Visual Representation

Yair Elboher, Yuval Pinter

🧩 TL;DR

本文提出DIVRIT系统,将希伯来语变音符号恢复任务构建为零样本分类问题,通过视觉语言模型将未变音文本作为图像处理,在候选集包含正确形式时实现高精度变音恢复。


📘 Detailed Summary

Motivation: 希伯来语变音符号恢复是确保准确发音和消除文本歧义的关键任务,尽管未变音文本存在高度歧义性,现有方法仍需要更有效的解决方案来处理这一复杂的语言处理问题。

Method: DIVRIT系统将变音恢复构建为基于上下文条件的零样本分类问题,使用希伯来视觉语言模型将未变音文本作为图像处理,从动态生成的候选集中为每个单词选择最合适的变音模式。

Result: 在全面评估中,系统在不依赖复杂显式语言分析的情况下有效执行变音恢复,在正确变音形式保证存在于候选集的oracle设置下达到高准确率,架构优化和训练方法改进显著提升了系统的泛化能力。

Conclusion: 研究表明视觉表示在希伯来语自动变音恢复中具有显著潜力,该方法避免了复杂的语言分析依赖,为基于视觉的语言处理任务提供了新的技术路径和优化方向。


📄 Abstract

Diacritics restoration in Hebrew is a fundamental task for ensuring accurate word pronunciation and disambiguating textual meaning. Despite the language's high degree of ambiguity when unvocalized, recent machine learning approaches have significantly advanced performance on this task. In this work, we present DIVRIT, a novel system for Hebrew diacritization that frames the task as a zero-shot classification problem. Our approach operates at the word level, selecting the most appropriate diacritization pattern for each undiacritized word from a dynamically generated candidate set, conditioned on the surrounding textual context. A key innovation of DIVRIT is its use of a Hebrew Visual Language Model, which processes undiacritized text as an image, allowing diacritic information to be embedded directly within the input's vector representation. Through a comprehensive evaluation across various configurations, we demonstrate that the system effectively performs diacritization without relying on complex, explicit linguistic analysis. Notably, in an ``oracle'' setting where the correct diacritized form is guaranteed to be among the provided candidates, DIVRIT achieves a high level of accuracy. Furthermore, strategic architectural enhancements and optimized training methodologies yield significant improvements in the system's overall generalization capabilities. These findings highlight the promising potential of visual representations for accurate and automated Hebrew diacritization.

[35] SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

Yiqiao Jin, Rachneet Kaur, Zhen Zeng, Sumitra Ganesh, Srijan Kumar

🧩 TL;DR

本文提出了SlideAgent,一个用于理解多模态、多页面、多布局文档的智能代理框架,通过分层专业化代理实现从全局到元素的细粒度推理,在复杂视觉文档理解任务上显著超越了现有专有和开源模型。


📘 Detailed Summary

Motivation: 当前系统在处理复杂多页面视觉文档时存在困难,特别是在跨页面元素细粒度推理方面存在局限,而大型语言模型虽然为文档理解提供了机会,但尚未有效解决多页面视觉文档的复杂理解挑战。

Method: SlideAgent采用分层专业化代理框架,将推理分解为全局、页面和元素三个专门层级,构建结构化的查询无关表示,在推理过程中选择性激活不同层级代理并整合其输出以生成上下文感知的答案。

Result: 大量实验表明,SlideAgent相比专有模型实现了+7.9的整体性能提升,相比开源模型实现了+9.8的整体性能提升,在复杂多页面视觉文档理解任务上取得了显著改进。

Conclusion: 该研究证明了分层专业化代理框架在复杂多页面视觉文档理解中的有效性,为多模态文档智能分析提供了新的解决方案,并展示了代理架构在处理细粒度跨页面推理任务上的优势。


📄 Abstract

Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While large language models (LLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels-global, page, and element-to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent achieves significant improvement over both proprietary (+7.9 overall) and open-source models (+9.8 overall).

cs.AI [Back]

[36] MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders

Riccardo Renzulli, Colas Lepoutre, Enrico Cassano, Marco Grangetto

🧩 TL;DR

本研究提出医学稀疏自编码器(MedSAEs)应用于MedCLIP的潜在空间,通过结合相关性指标、熵分析和自动神经元命名的新评估框架,显著提升了医学视觉模型的机制可解释性,在CheXpert数据集上实现了比原始MedCLIP特征更高的单义性和可解释性。


📘 Detailed Summary

Motivation: 医疗人工智能需要既准确又可解释的模型,当前研究旨在推进医学视觉领域的机制可解释性,解决高性能医学AI与透明度之间的差距,为临床可靠表示提供可扩展的解决方案。

Method: 研究将医学稀疏自编码器(MedSAEs)应用于MedCLIP视觉语言模型的潜在空间,并提出结合相关性指标、熵分析和通过MedGEMMA基础模型进行自动神经元命名的综合评估框架,以量化可解释性。

Result: 在CheXpert数据集上的实验表明,MedSAE神经元相比原始MedCLIP特征实现了更高的单义性和可解释性,验证了所提方法在提升医学视觉模型透明度方面的有效性。

Conclusion: 该研究架起了高性能医学AI与透明度之间的桥梁,为临床可靠表示提供了可扩展的路径,推动了医学视觉模型机制可解释性的发展,具有重要的临床应用价值。


📄 Abstract

Artificial intelligence in healthcare requires models that are accurate and interpretable. We advance mechanistic interpretability in medical vision by applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP, a vision-language model trained on chest radiographs and reports. To quantify interpretability, we propose an evaluation framework that combines correlation metrics, entropy analyzes, and automated neuron naming via the MedGEMMA foundation model. Experiments on the CheXpert dataset show that MedSAE neurons achieve higher monosemanticity and interpretability than raw MedCLIP features. Our findings bridge high-performing medical AI and transparency, offering a scalable step toward clinically reliable representations.

[37] Unveiling Intrinsic Text Bias in Multimodal Large Language Models through Attention Key-Space Analysis

Xinhan Zheng, Huyu Wu, Xueting Wang, Haiyun Jiang

🧩 TL;DR

本文揭示了多模态大语言模型中的文本偏好源于注意力机制内部的关键向量分布不匹配,而非外部数据因素。研究发现视觉关键向量相对于文本关键空间呈现分布外特性,导致注意力计算中视觉信息被系统性低估。


📘 Detailed Summary

Motivation: 多模态大语言模型在处理视觉语言数据时表现出明显的文本偏好,限制了其基于视觉证据进行有效推理的能力。与先前将这种文本偏见归因于数据不平衡或指令调优等外部因素的研究不同,本研究提出该偏见源于模型内部架构,特别是视觉关键向量在语言预训练期间学习的文本关键空间中呈现分布外特性。

Method: 研究从LLaVA和Qwen2.5-VL模型中提取关键向量,并使用定性(t-SNE)和定量(Jensen-Shannon散度)方法分析其分布结构。通过比较视觉和文本关键向量在注意力空间中的分布差异,验证视觉关键向量相对于文本关键空间的分布外假设。

Result: 研究结果提供了直接证据表明视觉和文本关键向量在注意力空间中占据明显不同的子空间。模态间差异在统计上显著,超过模态内变异的数个数量级。视觉关键向量在注意力计算中获得的相似性得分系统性较低,导致其在上下文表示中的利用不足。

Conclusion: 文本偏见源于注意力关键空间内部的内在不对齐,而非仅来自外部数据因素。这一发现对多模态模型设计具有重要意义,表明需要重新审视注意力机制在多模态融合中的作用,并开发更有效的跨模态对齐方法以改善视觉推理能力。


📄 Abstract

Multimodal large language models (MLLMs) exhibit a pronounced preference for textual inputs when processing vision-language data, limiting their ability to reason effectively from visual evidence. Unlike prior studies that attribute this text bias to external factors such as data imbalance or instruction tuning, we propose that the bias originates from the model's internal architecture. Specifically, we hypothesize that visual key vectors (Visual Keys) are out-of-distribution (OOD) relative to the text key space learned during language-only pretraining. Consequently, these visual keys receive systematically lower similarity scores during attention computation, leading to their under-utilization in the context representation. To validate this hypothesis, we extract key vectors from LLaVA and Qwen2.5-VL and analyze their distributional structures using qualitative (t-SNE) and quantitative (Jensen-Shannon divergence) methods. The results provide direct evidence that visual and textual keys occupy markedly distinct subspaces within the attention space. The inter-modal divergence is statistically significant, exceeding intra-modal variation by several orders of magnitude. These findings reveal that text bias arises from an intrinsic misalignment within the attention key space rather than solely from external data factors.