Table of Contents

cs.CV [Back]

[1] Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

Tao Chen, Kun Zhang, Qiong Wu, Xiao Chen, Chao Chang, Xiaoshuai Sun, Yiyi Zhou, Rongrong Ji

🧩 TL;DR

本文提出了一种名为FlexMem的无训练方法,通过模拟人类观看视频时的记忆机制,使多模态大语言模型能够处理无限长度的视频内容,突破了现有方法在输入长度上的限制。


📘 Detailed Summary

Motivation: 长视频理解是多模态大语言模型面临的关键挑战,现有方法需要一次性处理所有视频信息且存在输入长度上限,无法有效处理无限长度的视频内容,这限制了模型在实际应用中的能力扩展。

Method: FlexMem采用基于视觉记忆机制的创新方法,通过双路径压缩设计实现有效的记忆转移和写入,将视觉KV缓存作为记忆源,并针对不同的视频理解任务探索了多种记忆读取策略,包括流行的流式处理方式。

Result: 实验在五个长视频任务和一个流式视频任务上进行,结果显示在单张3090 GPU上,FlexMem能够处理超过1000帧的视频,性能明显优于现有高效视频理解方法,并使基础MLLM在某些基准测试中达到甚至超过GPT-4o和Gemini-1.5 Pro等SOTA模型的水平。

Conclusion: 该研究证明了通过模拟人类记忆机制的无训练方法能够有效解决长视频理解问题,为多模态大语言模型处理无限长度视频内容提供了可行的技术路径,具有重要的实际应用价值和研究意义。


📄 Abstract

Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emph{Flexible Memory} (\textbf{FlexMem}). In principle, FlexMem aims to mimic human behavior of video watching, \emph{i.e.}, continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design. Afterwards, FlexMem also explores different memory reading strategies for the diverse video understanding tasks, including the popular streaming one. To validate FlexMem, we apply it to two popular video-MLLMs, and conduct extensive experiments on five long video and one streaming video task. The experimental results show that on \textbf{a single 3090 GPU}, our FlexMem can achieve obvious improvements than existing efficient video understanding methods and process more than \textbf{1k frames}, which also helps the base MLLMs achieve comparable or even better performance than SOTA MLLMs on some benchmarks, \emph{e.g.} , GPT-4o and Gemini-1.5 Pro.

[2] Adversarial Prompt Injection Attack on Multimodal Large Language Models

Meiwen Ding, Song Xia, Chenqi Kong, Xudong Jiang

🧩 TL;DR

本文提出了一种针对闭源多模态大语言模型的不可感知视觉提示注入攻击方法,通过自适应地将恶意指令嵌入输入图像,并在粗粒度和细粒度上对齐特征表示,有效实现了对强大MLLMs的攻击。


📘 Detailed Summary

Motivation: 尽管多模态大语言模型在实际应用中日益普及,但其指令跟随行为使其容易受到提示注入攻击。现有方法主要依赖文本提示或人类可感知的视觉提示,而本文研究针对强大闭源MLLMs的不可感知视觉提示注入攻击,其中对抗性指令嵌入视觉模态中。

Method: 该方法通过有界文本覆盖自适应地将恶意提示嵌入输入图像以提供语义指导,同时迭代优化不可感知的视觉扰动,在粗粒度和细粒度上对齐受攻击图像的特征表示与恶意视觉和文本目标。视觉目标被实例化为文本渲染图像,并在优化过程中逐步细化以更忠实地表示所需语义并提高可迁移性。

Result: 在多个闭源MLLMs上的两个多模态理解任务的广泛实验表明,该方法相比现有方法具有优越性能。该方法成功实现了对强大闭源模型的不可感知视觉提示注入攻击,证明了其有效性和可迁移性。

Conclusion: 该研究揭示了MLLMs在不可感知视觉提示注入方面的安全漏洞,强调了多模态安全的重要性。提出的方法为评估和改进MLLMs的鲁棒性提供了新视角,同时指出了未来需要开发更强大的防御机制来应对此类攻击。


📄 Abstract

Although multimodal large language models (MLLMs) are increasingly deployed in real-world applications, their instruction-following behavior leaves them vulnerable to prompt injection attacks. Existing prompt injection methods predominantly rely on textual prompts or perceptible visual prompts that are observable by human users. In this work, we study imperceptible visual prompt injection against powerful closed-source MLLMs, where adversarial instructions are embedded in the visual modality. Our method adaptively embeds the malicious prompt into the input image via a bounded text overlay to provide semantic guidance. Meanwhile, the imperceptible visual perturbation is iteratively optimized to align the feature representation of the attacked image with those of the malicious visual and textual targets at both coarse- and fine-grained levels. Specifically, the visual target is instantiated as a text-rendered image and progressively refined during optimization to more faithfully represent the desired semantics and improve transferability. Extensive experiments on two multimodal understanding tasks across multiple closed-source MLLMs demonstrate the superior performance of our approach compared to existing methods.

[3] Multimodal Models Meet Presentation Attack Detection on ID Documents

Marina Villanueva, Juan M. Espin, Juan E. Tapia

🧩 TL;DR

本研究探索了将多模态模型(如Paligemma、Llava和Qwen)应用于身份证件呈现攻击检测,通过融合视觉特征和文本元数据来增强检测能力,但实验结果表明这些模型在身份证件PAD任务上表现不佳。


📘 Detailed Summary

Motivation: 传统呈现攻击检测系统仅依赖视觉特征,难以检测复杂的欺骗攻击,本研究旨在通过整合多模态信息来解决身份证件安全中的这一局限性。

Method: 研究采用预训练的多模态模型(包括Paligemma、Llava和Qwen),将深度视觉嵌入与上下文元数据(如文档类型、签发机构和日期)相结合,构建更全面的呈现攻击检测系统。

Result: 实验结果表明,尽管采用了先进的多模态方法,这些模型在身份证件呈现攻击检测任务上仍然表现不佳,未能达到预期的检测精度。

Conclusion: 研究揭示了当前多模态模型在身份证件安全领域的局限性,表明需要开发更专门化的架构或训练策略来有效处理此类特定的呈现攻击检测任务。


📄 Abstract

The integration of multimodal models into Presentation Attack Detection (PAD) for ID Documents represents a significant advancement in biometric security. Traditional PAD systems rely solely on visual features, which often fail to detect sophisticated spoofing attacks. This study explores the combination of visual and textual modalities by utilizing pre-trained multimodal models, such as Paligemma, Llava, and Qwen, to enhance the detection of presentation attacks on ID Documents. This approach merges deep visual embeddings with contextual metadata (e.g., document type, issuer, and date). However, experimental results indicate that these models struggle to accurately detect PAD on ID Documents.

[4] CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

Shifang Zhao, Yihan Hu, Ying Shan, Yunchao Wei, Xiaodong Cun

🧩 TL;DR

本文提出了CutClaw,一个基于多模态语言模型的多智能体自主框架,用于将长时间原始素材自动编辑成与音乐节奏同步、叙事连贯的短视频,显著提升了视频编辑的效率与质量。


📘 Detailed Summary

Motivation: 当前社交媒体中视频内容与音频对齐的编辑形成了一种数字艺术形式,但手动视频编辑耗时且重复,对电影制作人和专业内容创作者构成了长期挑战,需要自动化解决方案来提升编辑效率和质量。

Method: CutClaw采用多智能体框架,首先通过分层多模态分解捕获视觉和音频素材的细粒度细节与全局结构;然后由Playwriter智能体编排整体叙事流程并构建长期叙事结构,将视觉场景锚定到音乐变化;最后Editor和Reviewer智能体基于严格的美学和语义标准协同优化最终剪辑,选择细粒度视觉内容。

Result: 实验表明CutClaw在生成高质量、节奏对齐的视频方面显著优于现有最先进的基线方法,能够将数小时的原始素材有效编辑成具有同步音乐、遵循指令且视觉吸引力强的短视频。

Conclusion: 该研究展示了多智能体框架在自动化视频编辑任务中的有效性,为内容创作提供了高效工具,同时证明了多模态语言模型在复杂创意任务中的协同应用潜力,为未来智能媒体制作系统的发展指明了方向。


📄 Abstract

Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine-grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures the long-term narrative, anchoring visual scenes to musical shifts. Finally, to construct a short edited video, Editor and Reviewer Agents collaboratively optimize the final cut via selecting fine-grained visual content based on rigorous aesthetic and semantic criteria. We conduct detailed experiments to demonstrate that CutClaw significantly outperforms state-of-the-art baselines in generating high-quality, rhythm-aligned videos. The code is available at: https://github.com/GVCLab/CutClaw.

[5] EC-Bench: Enumeration and Counting Benchmark for Ultra-Long Videos

Fumihiko Tsuchiya, Taiki Miyanishi, Mahiro Ukai, Nakamasa Inoue, Shuhei Kurita, Yusuke Iwasawa, Yutaka Matsuo

🧩 TL;DR

该研究提出了EC-Bench,一个用于评估长视频中枚举、计数和时间证据定位的基准测试,揭示了当前多模态大语言模型在长程时序推理方面的根本性局限。


📘 Detailed Summary

Motivation: 当前视频计数研究主要关注短片段且仅评估最终数值答案,缺乏对长视频中稀疏、多样事件的时序推理能力评估,无法揭示模型是否持续识别相关实例或理解应计数的内容。

Method: 研究引入了EC-Bench基准测试,包含152个时长超过30分钟的长视频和1,699个带有明确证据时间跨度的查询,联合评估枚举、计数和时间证据定位三个任务。

Result: 在22个多模态大语言模型评估中,最佳模型在枚举任务上仅达到29.98%准确率,计数任务为23.74%,而人类表现分别为78.57%和82.97%,分析显示枚举准确率、时间定位和计数性能之间存在强相关性。

Conclusion: 研究结果揭示了当前MLLMs在长程定量视频推理方面的根本性局限,建立了EC-Bench作为具有挑战性的基准,强调了枚举准确性、时间定位和计数性能之间的紧密关系。


📄 Abstract

Counting in long videos remains a fundamental yet underexplored challenge in computer vision. Real-world recordings often span tens of minutes or longer and contain sparse, diverse events, making long-range temporal reasoning particularly difficult. However, most existing video counting benchmarks focus on short clips and evaluate only the final numerical answer, providing little insight into what should be counted or whether models consistently identify relevant instances across time. We introduce EC-Bench, a benchmark that jointly evaluates enumeration, counting, and temporal evidence grounding in long-form videos. EC-Bench contains 152 videos longer than 30 minutes and 1,699 queries paired with explicit evidence spans. Across 22 multimodal large language models (MLLMs), the best model achieves only 29.98% accuracy on Enumeration and 23.74% on Counting, while human performance reaches 78.57% and 82.97%, respectively. Our analysis reveals strong relationships between enumeration accuracy, temporal grounding, and counting performance. These results highlight fundamental limitations of current MLLMs and establish EC-Bench as a challenging benchmark for long-form quantitative video reasoning.

cs.CL [Back]

[6] Calibrated Confidence Expression for Radiology Report Generation

David Bani-Harouni, Chantal Pellegrini, Julian Lüers, Su Hwan Kim, Markus Baalmann, Benedikt Wiestler, Rickmer Braren, Nassir Navab, Matthias Keicher

🧩 TL;DR

本文提出了ConRad框架,一种基于强化学习的医学大视觉语言模型微调方法,用于在生成放射学报告的同时提供经过校准的口头化置信度估计,以支持临床AI辅助报告生成的安全部署。


📘 Detailed Summary

Motivation: 在放射学报告生成中安全部署大型视觉语言模型需要临床可解释的置信度指标,以指示何时需要对输出进行彻底审查,从而降低幻觉发现影响临床决策的风险。当前最先进的语言模型往往过于自信,且在放射学报告生成等多模态场景中的校准研究有限,这构成了研究空白。

Method: 本文引入了ConRad框架,这是一个基于强化学习的医学大视觉语言模型微调框架,用于生成经过校准的口头化置信度估计。该框架研究了两种设置:单一报告级置信度评分和句子级变体(为每个声明分配置信度)。两种设置均使用GRPO算法进行训练,奖励函数基于对数评分规则,通过惩罚校准错误来激励真实的自我评估,并在奖励最大化下保证最优校准。

Result: 实验表明,ConRad显著改善了校准性能并优于竞争方法。在临床评估中,ConRad的报告级评分与临床医生的判断高度一致。通过突出显示完整报告或低置信度陈述进行针对性审查,ConRad能够支持AI辅助报告生成更安全的临床整合。

Conclusion: 该研究为放射学报告生成中的多模态置信度校准提供了有效解决方案,通过强化学习框架实现了口头化置信度估计的优化。ConRad框架能够识别需要临床审查的高风险输出,从而促进AI辅助系统在医疗环境中的安全部署,为选择性放射科医生验证提供了实用工具。


📄 Abstract

Safe deployment of Large Vision-Language Models (LVLMs) in radiology report generation requires not only accurate predictions but also clinically interpretable indicators of when outputs should be thoroughly reviewed, enabling selective radiologist verification and reducing the risk of hallucinated findings influencing clinical decisions. One intuitive approach to this is verbalized confidence, where the model explicitly states its certainty. However, current state-of-the-art language models are often overconfident, and research on calibration in multimodal settings such as radiology report generation is limited. To address this gap, we introduce ConRad (Confidence Calibration for Radiology Reports), a reinforcement learning framework for fine-tuning medical LVLMs to produce calibrated verbalized confidence estimates alongside radiology reports. We study two settings: a single report-level confidence score and a sentence-level variant assigning a confidence to each claim. Both are trained using the GRPO algorithm with reward functions based on the logarithmic scoring rule, which incentivizes truthful self-assessment by penalizing miscalibration and guarantees optimal calibration under reward maximization. Experimentally, ConRad substantially improves calibration and outperforms competing methods. In a clinical evaluation we show that ConRad's report level scores are well aligned with clinicians' judgment. By highlighting full reports or low-confidence statements for targeted review, ConRad can support safer clinical integration of AI-assistance for report generation.

cs.AI [Back]

[7] Webscraper: Leverage Multimodal Large Language Models for Index-Content Web Scraping

Guan-Lun Huang, Yuh-Jzer Joung

🧩 TL;DR

本文提出Webscraper框架,利用多模态大语言模型自主导航动态交互式网站并执行结构化数据提取,解决了传统网络爬虫在处理现代动态网页应用时的局限性。


📘 Detailed Summary

Motivation: 现代网络爬虫在处理动态交互式网站时面临挑战,传统静态HTML解析方法往往脆弱且需要针对每个网站进行手动定制,无法有效处理需要用户交互的现代网页应用。

Method: Webscraper框架采用多模态大语言模型自主导航交互界面,结合结构化五阶段提示过程和定制工具集,专门针对常见的"索引-内容"架构网站进行数据提取,通过专用工具处理动态网页交互。

Result: 在六个新闻网站上的实验表明,配备引导提示和专用工具的完整Webscraper框架相比基线代理Anthropic's Computer Use在提取准确率上取得显著提升,同时在电子商务平台上的应用验证了其泛化能力。

Conclusion: 该研究展示了多模态大语言模型在自动化网络数据提取任务中的潜力,为处理动态交互式网站提供了系统化解决方案,框架的可扩展性和工具化方法为未来网络爬虫技术发展提供了新方向。


📄 Abstract

Modern web scraping struggles with dynamic, interactive websites that require more than static HTML parsing. Current methods are often brittle and require manual customization for each site. To address this, we introduce Webscraper, a framework designed to handle the challenges of modern, dynamic web applications. It leverages a Multimodal Large Language Model (MLLM) to autonomously navigate interactive interfaces, invoke specialized tools, and perform structured data extraction in environments where traditional scrapers are ineffective. Webscraper utilizes a structured five-stage prompting procedure and a set of custom-built tools to navigate and extract data from websites following the common ``index-and-content'' architecture. Our experiments, conducted on six news websites, demonstrate that the full Webscraper framework, equipped with both our guiding prompt and specialized tools, achieves a significant improvement in extraction accuracy over the baseline agent Anthropic's Computer Use. We also applied the framework to e-commerce platforms to validate its generalizability.

[8] Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

Zhiqian Zhang, Xu Zhao, Xiaoqing Xu, Guangdong Liang, Weijia Wang, Xiaolei Lv, Bo Li, Jun Gao

🧩 TL;DR

本文提出了Xuanwu VL-2B,这是一个将通用多模态模型发展为工业级内容生态基础模型的案例研究,通过紧凑的2B参数架构在业务对齐、视觉感知、通用能力保留和部署成本之间实现了实用平衡。


📘 Detailed Summary

Motivation: 当前主流多模态模型在现实世界内容审核和对抗性场景中面临泛化能力下降和灾难性遗忘问题,主要原因是细粒度视觉感知有限以及对长尾噪声建模不足,需要开发能够在有限参数预算下平衡业务专业化和通用能力保留的工业级基础模型。

Method: 模型采用紧凑的InternViT-300M + MLP + Qwen3 1.7B架构,在约2B参数预算内平衡细粒度视觉感知和语言语义对齐;通过数据迭代与筛选机制,采用渐进式三阶段训练流程:预训练、中期训练和后训练,以平衡业务专业化与通用能力保留。

Result: Xuanwu VL-2B在七个OpenCompass多模态指标上平均得分67.90(对比InternVL 3.5 2B的64.27),在七个独立业务审核任务中平均召回率达94.38%,在具有挑战性的对抗性OCR场景中政策违规文本加权总体召回率为82.82%,优于Gemini-2.5-Pro的76.72%。

Conclusion: 研究表明,在有限参数预算下,通过紧凑架构设计和渐进式训练策略,多模态模型能够在业务对齐、视觉感知、通用能力保留和部署成本之间实现实用平衡,为工业级内容生态基础模型的开发提供了可行路径。


📄 Abstract

In recent years, multimodal large models have continued to improve on general benchmarks. However, in real-world content moderation and adversarial settings, mainstream models still suffer from degraded generalization and catastrophic forgetting because of limited fine-grained visual perception and insufficient modeling of long-tail noise. In this paper, we present Xuanwu VL-2B as a case study of how general multimodal models can be developed into an industrial-grade foundation model for content ecosystems. The model adopts a compact InternViT-300M + MLP + Qwen3 1.7B architecture, balancing fine-grained visual perception, language-semantic alignment, and deployment cost within an approximately 2B-parameter budget. To balance business specialization with the retention of general capabilities, we developed a data iteration and curation mechanism and trained the model through a progressive three-stage pipeline: pre-training, mid-training, and post-training. Ablation studies and offline business evaluations show that Xuanwu VL-2B achieves an average score of 67.90 across seven OpenCompass multimodal metrics (vs. 64.27 for InternVL 3.5 2B), an average recall of 94.38% over seven independent business moderation tasks, and a weighted overall recall of 82.82% on policy-violating text in challenging adversarial OCR scenarios, outperforming Gemini-2.5-Pro (76.72%). These results show that, under a limited parameter budget, Xuanwu VL-2B achieves a practical balance among business alignment, visual perception, general capability retention, and deployment cost.

[9] ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation

Yinuo Liu, Zi Qian, Heng Zhou, Jiahao Zhang, Yajie Zhang, Zhihang Li, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

🧩 TL;DR

本文提出了ATP-Bench基准测试和MAM评估系统,用于系统评估多模态大语言模型在交错文本-图像生成中的智能体工具规划能力,揭示了现有模型在连贯规划和工具使用方面存在显著不足。


📘 Detailed Summary

Motivation: 当前交错文本-图像生成方法通常将图像生成和检索增强视为互斥路径,未能统一事实准确性与创造性,缺乏对智能体工具规划范式的系统评估框架,该范式要求模型作为中央控制器自主决定何时、何地以及调用何种工具来生成视觉关键查询的交错响应。

Method: 本文提出了ATP-Bench基准测试,包含7,702个QA对(含1,592个VQA对),涵盖八个类别和25种视觉关键意图,并设计了多智能体MLLM-as-a-Judge评估系统,该系统独立于端到端执行和变化的后端工具,评估工具调用精度、识别工具使用遗漏机会,并评估整体响应质量而无需真实参考。

Result: 对10个最先进MLLM的广泛实验表明,模型在连贯交错规划方面表现不佳,工具使用行为存在显著差异,揭示了模型在智能体工具规划能力方面的实质性改进空间,为推进交错生成提供了可操作的指导。

Conclusion: 该研究强调了智能体工具规划作为交错文本-图像生成下一个里程碑的重要性,提出的ATP-Bench和MAM系统为系统评估该范式提供了标准化框架,实验结果揭示了当前模型的局限性,并为未来研究方向提供了明确指导,包括改进模型规划能力和工具使用策略。


📄 Abstract

Interleaved text-and-image generation represents a significant frontier for Multimodal Large Language Models (MLLMs), offering a more intuitive way to convey complex information. Current paradigms rely on either image generation or retrieval augmentation, yet they typically treat the two as mutually exclusive paths, failing to unify factuality with creativity. We argue that the next milestone in this field is Agentic Tool Planning, where the model serves as a central controller that autonomously determines when, where, and which tools to invoke to produce interleaved responses for visual-critical queries. To systematically evaluate this paradigm, we introduce ATP-Bench, a novel benchmark comprising 7,702 QA pairs (including 1,592 VQA pairs) across eight categories and 25 visual-critical intents, featuring human-verified queries and ground truths. Furthermore, to evaluate agentic planning independent of end-to-end execution and changing tool backends, we propose a Multi-Agent MLLM-as-a-Judge (MAM) system. MAM evaluates tool-call precision, identifies missed opportunities for tool use, and assesses overall response quality without requiring ground-truth references. Our extensive experiments on 10 state-of-the-art MLLMs reveal that models struggle with coherent interleaved planning and exhibit significant variations in tool-use behavior, highlighting substantial room for improvement and providing actionable guidance for advancing interleaved generation. Dataset and code are available at https://github.com/Qwen-Applications/ATP-Bench.