Table of Contents

cs.CV [Back]

[1] DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding

Dawei Zhu, Rui Meng, Jiefeng Chen, Sujian Li, Tomas Pfister, Jinsung Yoon

🧩 TL;DR

本文提出DocLens,一种工具增强的多智能体框架,通过类似镜头的方式放大证据,有效解决长视觉文档理解中的证据定位挑战。该框架结合Gemini-2.5-Pro在MMLongBench-Doc和FinRAGBench-V上实现了最先进的性能,甚至超越了人类专家。


📘 Detailed Summary

Motivation: 现有视觉语言模型在处理长视觉文档时面临证据定位的根本挑战,它们难以检索相关页面并忽略视觉元素中的细粒度细节,导致性能受限和模型幻觉问题。

Method: DocLens采用工具增强的多智能体框架,首先从完整文档导航到相关页面上的特定视觉元素,然后采用采样-裁决机制生成单一可靠的答案。

Result: DocLens在MMLongBench-Doc和FinRAGBench-V上实现了最先进的性能,超越了人类专家,特别是在视觉中心和不可回答的查询上表现出明显优势。

Conclusion: 该研究证明了增强定位能力在长文档理解中的重要性,DocLens框架在处理视觉密集和复杂查询方面展现出强大能力,为文档理解系统提供了新的设计范式。


📄 Abstract

Comprehending long visual documents, where information is distributed across extensive pages of text and visual elements, is a critical but challenging task for modern Vision-Language Models (VLMs). Existing approaches falter on a fundamental challenge: evidence localization. They struggle to retrieve relevant pages and overlook fine-grained details within visual elements, leading to limited performance and model hallucination. To address this, we propose DocLens, a tool-augmented multi-agent framework that effectively ``zooms in'' on evidence like a lens. It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate a single, reliable answer. Paired with Gemini-2.5-Pro, DocLens achieves state-of-the-art performance on MMLongBench-Doc and FinRAGBench-V, surpassing even human experts. The framework's superiority is particularly evident on vision-centric and unanswerable queries, demonstrating the power of its enhanced localization capabilities.