Table of Contents
cs.CV [Back]
[1] ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models
Thomas De Min, Subhankar Roy, Stéphane Lathuilière, Elisa Ricci, Massimiliano Mancini
🧩 TL;DR
该研究提出了ProactiveBench基准测试,用于评估多模态大语言模型在需要用户干预时的主动求助能力,发现现有模型普遍缺乏主动性且与模型能力无关,但通过强化学习微调可以学习这种能力。
📘 Detailed Summary
Motivation: 现有多模态大语言模型缺乏类似人类在遇到困难时主动请求帮助的"主动性"行为,例如识别遮挡物体时需要请求移除遮挡物,这种能力对于有效的人机协作至关重要但尚未被系统研究。
Method: 研究构建了ProactiveBench基准测试,整合了七个重新设计的数据集,涵盖识别遮挡物体、提升图像质量和解释粗略草图等任务,评估了22个MLLM的主动性表现,并探索了基于强化学习的简单微调策略。
Result: 实验发现现有MLLM普遍缺乏主动性,且主动性能力与模型容量无关,提示性引导仅带来边际改善,对话历史和上下文学习反而引入负面偏见,但强化学习微调表明主动性可以被学习并能泛化到未见场景。
Conclusion: 该研究表明主动性是多模态模型可学习的能力而非固有属性,公开的ProactiveBench为构建主动型多模态模型提供了重要基准,未来研究需要关注如何有效训练模型在适当时候请求帮助。
📄 Abstract
Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a similar "proactive" behavior by requesting simple user interventions? To investigate this, we introduce ProactiveBench, a benchmark built from seven repurposed datasets that tests proactiveness across different tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches. We evaluate 22 MLLMs on ProactiveBench, showing that (i) they generally lack proactiveness; (ii) proactiveness does not correlate with model capacity; (iii) "hinting" at proactiveness yields only marginal gains. Surprisingly, we found that conversation histories and in-context learning introduce negative biases, hindering performance. Finally, we explore a simple fine-tuning strategy based on reinforcement learning: its results suggest that proactiveness can be learned, even generalizing to unseen scenarios. We publicly release ProactiveBench as a first step toward building proactive multimodal models.
[2] Narrative Aligned Long Form Video Question Answering
Rahul Jain, Keval Doshi, Burak Uzkent, Garin Kessler
🧩 TL;DR
本文提出了NA-VQA基准测试来评估长视频中的深度时序与叙事推理能力,并开发了Video-NaRA框架来显式建模叙事结构,以解决现有多模态大语言模型在长距离因果推理方面的不足。
📘 Detailed Summary
Motivation: 现有长视频推理基准大多依赖局部线索,无法捕捉叙事推理能力,即追踪意图、连接远距离事件以及重建整个电影中因果链的能力,这限制了多模态大语言模型在复杂叙事理解方面的发展。
Method: 本文提出了NA-VQA基准测试,包含88部完整电影和4.4K开放式问答对,每个问题都基于标注为短、中、远距离的证据跨度;同时开发了Video-NaRA叙事中心框架,该框架构建事件级链并将其存储在结构化记忆中,以便在推理过程中检索。
Result: 实验表明,最先进的多模态大语言模型在需要远距离证据的问题上表现不佳,而Video-NaRA将长距离推理性能提升了高达3%,证明了其在处理复杂叙事结构方面的有效性。
Conclusion: 该研究强调了显式叙事建模的必要性,NA-VQA基准揭示了现有模型在长距离因果推理方面的局限性,而Video-NaRA框架为改进多模态大语言模型的叙事理解能力提供了有效途径。
📄 Abstract
Recent progress in multimodal large language models (MLLMs) has led to a surge of benchmarks for long-video reasoning. However, most existing benchmarks rely on localized cues and fail to capture narrative reasoning, the ability to track intentions, connect distant events, and reconstruct causal chains across an entire movie. We introduce NA-VQA, a benchmark designed to evaluate deep temporal and narrative reasoning in long-form videos. NA-VQA contains 88 full-length movies and 4.4K open-ended question-answer pairs, each grounded in multiple evidence spans labeled as Short, Medium, or Far to assess long-range dependencies. By requiring generative, multi-scene answers, NA-VQA tests whether models can integrate dispersed narrative information rather than rely on shallow pattern matching. To address the limitations of existing approaches, we propose Video-NaRA, a narrative-centric framework that builds event-level chains and stores them in a structured memory for retrieval during reasoning. Extensive experiments show that state-of-the-art MLLMs perform poorly on questions requiring far-range evidence, highlighting the need for explicit narrative modeling. Video-NaRA improves long-range reasoning performance by up to 3 percent, demonstrating its effectiveness in handling complex narrative structures. We will release NA-VQA upon publication.
[3] Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following
Myeongkyun Kang, Soopil Kim, Xiaoxiao Li, Sang Hyun Park
🧩 TL;DR
本文提出了一种无需指令的调优方法,通过动量代理指令和响应重排策略,在医学领域减少对人工标注指令的依赖,实现了大规模视觉语言模型的高效领域适应。
📘 Detailed Summary
Motivation: 在医学领域构建大规模高质量指令数据集面临严峻挑战,因为需要专业医学知识进行标注,这限制了视觉语言模型在医疗领域的应用和调优效率。
Method: 提出了一种无需指令的调优方法,仅使用图像-描述对进行微调,引入动量代理指令替代人工标注指令,同时采用响应重排策略减少模型对先前词语的过度依赖。
Result: 该方法在SKINCON、WBCAtt、CBIS和MIMIC-CXR数据集的多选视觉问答任务中达到了最先进的准确率,显著提升了医学领域视觉语言模型的微调效率。
Conclusion: 该研究表明无需显式指令的调优方法在医学领域具有可行性,动量代理指令机制能够保持预训练模型的指令跟随能力,为数据稀缺领域的视觉语言模型适应提供了有效解决方案。
📄 Abstract
Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning approach that reduces reliance on handcrafted instructions, leveraging only image-description pairs for fine-tuning. Specifically, we introduce a momentum proxy instruction as a replacement for curated text instructions, which preserves the instruction-following capability of the pre-trained LVLM while promoting updates to parameters that remain valid during inference. Consequently, the fine-tuned LVLM can flexibly respond to domain-specific instructions, even though explicit instructions are absent during fine-tuning. Additionally, we incorporate a response shuffling strategy to mitigate the model's over-reliance on previous words, facilitating more effective fine-tuning. Our approach achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets, significantly enhancing the fine-tuning efficiency of LVLMs in medical domains.
[4] CurveStream: Boosting Streaming Video Understanding in MLLMs via Curvature-Aware Hierarchical Visual Memory Management
Chao Wang, Xudong Tan, Jianjian Cao, Kangcong Li, Tao Chen
🧩 TL;DR
本文提出CurveStream,一种无需训练、基于曲率感知的分层视觉记忆管理框架,用于解决多模态大语言模型在流式视频理解中因视觉令牌线性爆炸导致的内存溢出和灾难性遗忘问题,通过几何洞察实现语义感知的记忆管理。
📘 Detailed Summary
Motivation: 多模态大语言模型在离线视频理解中取得了显著成功,但在流式视频应用中受到视觉令牌线性爆炸的严重限制,常导致内存溢出错误或灾难性遗忘。现有视觉保留和记忆管理方法通常依赖均匀采样、低级物理指标或被动缓存驱逐,这些策略缺乏内在语义感知能力,可能破坏上下文连贯性并模糊短暂但关键的语义转换。
Method: 本文提出CurveStream,一种无需训练、基于曲率感知的分层视觉记忆管理框架。该方法基于关键观察:连续特征轨迹上的高曲率区域与关键的全局语义转换紧密对齐。基于这一几何洞察,CurveStream通过曲率分数评估实时语义强度,并集成在线K-Sigma动态阈值,在严格的令牌预算下自适应地将帧路由到清晰和模糊的记忆状态中。
Result: 在不同时间尺度上的评估证实,这种轻量级框架CurveStream在相应基线上持续产生超过10%的绝对性能提升,如在StreamingBench上达到10.69%,在OVOBench上达到13.58%,为流式视频感知建立了新的最先进结果。
Conclusion: 该研究展示了基于几何洞察的语义感知记忆管理在流式视频理解中的有效性,通过曲率分数识别关键语义转换点,实现了在有限计算资源下的高效视频理解。这一框架为多模态大语言模型在实时视频处理中的应用提供了新的解决方案,强调了语义感知在记忆管理中的重要性。
📄 Abstract
Multimodal Large Language Models have achieved significant success in offline video understanding, yet their application to streaming videos is severely limited by the linear explosion of visual tokens, which often leads to Out-of-Memory (OOM) errors or catastrophic forgetting. Existing visual retention and memory management methods typically rely on uniform sampling, low-level physical metrics, or passive cache eviction. However, these strategies often lack intrinsic semantic awareness, potentially disrupting contextual coherence and blurring transient yet critical semantic transitions. To address these limitations, we propose CurveStream, a training-free, curvature-aware hierarchical visual memory management framework. Our approach is motivated by the key observation that high-curvature regions along continuous feature trajectories closely align with critical global semantic transitions. Based on this geometric insight, CurveStream evaluates real-time semantic intensity via a Curvature Score and integrates an online K-Sigma dynamic threshold to adaptively route frames into clear and fuzzy memory states under a strict token budget. Evaluations across diverse temporal scales confirm that this lightweight framework, CurveStream, consistently yields absolute performance gains of over 10% (e.g., 10.69% on StreamingBench and 13.58% on OVOBench) over respective baselines, establishing new state-of-the-art results for streaming video perception.The code will be released at https://github.com/streamingvideos/CurveStream.
[5] ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding
Quan Kong, Yuhao Shen, Yicheng Ji, Huan Li, Cong Wang
🧩 TL;DR
本文提出ParallelVLM,一种免训练的草稿-验证推测解码框架,通过并行化阶段和无偏验证引导剪枝策略,显著提升视频大语言模型的解码效率,解决了长视频场景中草稿与目标模型相互等待和加速比受限的问题。
📘 Detailed Summary
Motivation: 当前视频大语言模型在视频理解任务中表现出色,但其自回归解码效率受到海量视频令牌的严重制约。现有视觉令牌剪枝方法虽能部分缓解这一瓶颈,但仍存在信息损失问题,且解码加速效果有限,特别是在长视频场景中,草稿模型与目标模型之间的相互等待问题限制了加速比的提升。
Method: ParallelVLM采用免训练的草稿-验证推测解码框架,包含两个并行化阶段以最大化硬件利用率。该方法引入无偏验证引导剪枝策略,通过消除注意力引导剪枝中的位置偏差,更好地对齐草稿模型与目标模型,从而克服长视频设置中的相互等待和加速比受限问题。
Result: 实验表明,ParallelVLM能够将草稿窗口扩展1.6至1.8倍,同时保持较高的接受长度。在LLaVA-Onevision-72B模型上,相比原始自回归解码实现了3.36倍的加速;在Qwen2.5-VL-32B模型上实现了2.42倍的加速,显著提升了各种视频理解基准测试的解码效率。
Conclusion: ParallelVLM通过创新的并行化架构和无偏验证机制,有效解决了视频大语言模型解码效率的瓶颈问题,为长视频理解任务提供了高效的推测解码解决方案。该方法在不需额外训练的情况下显著提升解码速度,为视频大语言模型的实时应用和部署开辟了新途径。
📄 Abstract
Although current Video-LLMs achieve impressive performance in video understanding tasks, their autoregressive decoding efficiency remains constrained by the massive number of video tokens. Visual token pruning can partially ease this bottleneck, yet existing approaches still suffer from information loss and yield only modest acceleration in decoding. In this paper, we propose ParallelVLM, a training-free draft-then-verify speculative decoding framework that overcomes both mutual waiting and limited speedup-ratio problems between draft and target models in long-video settings. ParallelVLM features two parallelized stages that maximize hardware utilization and incorporate an Unbiased Verifier-Guided Pruning strategy to better align the draft and target models by eliminating the positional bias in attention-guided pruning. Extensive experiments demonstrate that ParallelVLM effectively expands the draft window by $1.6\sim1.8\times$ with high accepted lengths, and accelerates various video understanding benchmarks by 3.36$\times$ on LLaVA-Onevision-72B and 2.42$\times$ on Qwen2.5-VL-32B compared with vanilla autoregressive decoding.
[6] Detached Skip-Links and $R$-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR
Ziye Yuan, Ruchang Yao, Chengxin Zheng, Yusheng Zhao, Daxiang Dong, Ming Zhang
🧩 TL;DR
本文提出Detached Skip-Links方法解决多模态大语言模型中多层特征融合的梯度干扰问题,通过前向重用浅层特征同时阻止跳跃分支梯度回传,显著提升OCR任务性能并保持通用多模态能力。
📘 Detailed Summary
Motivation: 多模态大语言模型在高级推理任务上表现出色,但在需要细粒度视觉细节的OCR任务中表现不佳,主要原因是多层特征融合中的跳跃连接引入了从高级语义目标到早期视觉层的直接反向传播路径,这种梯度干扰机制会覆盖低级视觉信号并破坏训练稳定性。
Method: 提出Detached Skip-Links方法,这是一种最小化修改方案,在前向传播中重用浅层特征,同时在联合训练期间阻止通过跳跃分支的梯度回传,这种非对称设计减少了梯度干扰而不增加可学习参数。此外,引入R-Probe诊断工具,通过使用从LLM前四分之一层初始化的浅层解码器,测量投影视觉令牌的像素级可重构性,以评估细粒度信息是否被保留和可用。
Result: 在多种ViT骨干网络和多模态基准测试中,以及高达700万训练样本的规模下,该方法在OCR中心化基准测试上持续改进,并在通用多模态任务上带来明显性能提升,显著提高了训练稳定性和收敛性。
Conclusion: 研究揭示了多模态大语言模型中特征融合的梯度干扰问题,提出的Detached Skip-Links方法通过简单的架构修改有效解决了这一问题,为保持细粒度视觉信息同时维持高级语义能力提供了有效方案,对需要精确视觉细节理解的多模态应用具有重要价值。
📄 Abstract
Multimodal large language models (MLLMs) excel at high-level reasoning yet fail on OCR tasks where fine-grained visual details are compromised or misaligned. We identify an overlooked optimization issue in multi-layer feature fusion. Skip pathways introduce direct back-propagation paths from high-level semantic objectives to early visual layers. This mechanism overwrites low-level signals and destabilizes training. To mitigate this gradient interference, we propose Detached Skip-Links, a minimal modification that reuses shallow features in the forward pass while stopping gradients through the skip branch during joint training. This asymmetric design reduces gradient interference, improving stability and convergence without adding learnable parameters. To diagnose whether fine-grained information is preserved and usable by an LLM, we introduce $R$-Probe, which measures pixel-level reconstructability of projected visual tokens using a shallow decoder initialized from the first quarter of the LLM layers. Across multiple ViT backbones and multimodal benchmarks, and at scales up to 7M training samples, our approach consistently improves OCR-centric benchmarks and delivers clear gains on general multimodal tasks.
[7] FREAK: A Fine-grained Hallucination Evaluation Benchmark for Advanced MLLMs
Zhihan Yin, Jianxin Liang, Yueqian Wang, Yifeng Yao, Huishuai Zhang, Dongyan Zhao
🧩 TL;DR
本文提出FREAK基准,通过高质量逼真图像与细粒度反常识编辑,对多模态大语言模型进行细粒度幻觉评估,揭示了当前最先进模型在详细视觉感知方面的严重幻觉问题。
📘 Detailed Summary
Motivation: 现有多模态大语言模型幻觉评估基准存在任务过于简化导致指标饱和,或多样性不足无法充分评估最先进多模态模型幻觉程度的问题,需要更全面的评估框架来填补这一研究空白。
Method: 研究提出FREAK基准,采用高质量逼真图像配合细粒度反常识编辑,创新性地评估MLLMs在详细视觉感知中的幻觉现象,并通过构建受控子集间接评估模型感知目标详细信息的能力,系统评估主流思维链提示技术。
Result: 在FREAK基准上的广泛实验表明,最先进模型在详细视觉感知方面存在严重幻觉问题,通过受控子集的系统评估揭示了幻觉模式与模型推理过程的关键洞察。
Conclusion: 该研究为多模态大语言模型的幻觉评估提供了更精细的基准,揭示了当前模型在详细视觉理解方面的局限性,为未来模型改进和幻觉缓解技术提供了重要参考依据。
📄 Abstract
Multimodal Large Language Models (MLLMs) suffer from hallucinations. Existing hallucination evaluation benchmarks are often limited by over-simplified tasks leading to saturated metrics, or insufficient diversity that fails to adequately assess the hallucination extent in state-of-the-art multimodal models. To address this gap, we propose FREAK, a comprehensive multimodal benchmark designed for fine-grained hallucination assessment in MLLMs. Through high-quality photorealistic images featuring fine-grained counter-commonsense edits, FREAK innovatively evaluates hallucination phenomena in detailed visual perception of MLLMs. Extensive experiments on FREAK show severe hallucination issues in SOTA models regarding detailed visual perception. To enable deeper investigation, we curate a controlled subset to indirectly evaluate the model's ability to perceive target detailed information. Through systematic evaluation of prevailing Chain-of-Thought (CoT) prompting techniques within this task, we reveal critical insights regarding hallucination patterns and model reasoning processes.
[8] LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation
Jiazheng Xing, Fei Du, Hangjie Yuan, Pengwei Liu, Hongbin Xu, Hai Ci, Ruigang Niu, Weihua Chen, Fan Wang, Yong Liu
🧩 TL;DR
本文提出LumosX框架,通过数据与模型设计的双重创新,解决了多主体个性化视频生成中面部属性对齐的挑战,实现了细粒度、身份一致且语义对齐的视频生成。
📘 Detailed Summary
Motivation: 尽管扩散模型在文本到视频生成方面取得显著进展,能够实现前景与背景元素的细粒度控制,但现有方法缺乏确保主体间面部属性对齐的显式机制,导致多主体个性化视频生成中的组内一致性不足,这需要显式建模策略和面部属性感知数据资源的双重改进。
Method: LumosX框架包含数据与模型两方面的创新:数据方面,通过定制化收集流程从独立视频中编排字幕和视觉线索,利用多模态大语言模型推断并分配主体特定依赖关系,构建包含关系先验的综合基准;模型方面,引入关系自注意力与关系交叉注意力机制,将位置感知嵌入与精炼的注意力动态交织,以编码显式的主体-属性依赖关系,强制组内凝聚并增强不同主体簇间的分离。
Result: 在构建的综合基准上进行全面评估表明,LumosX在细粒度、身份一致性和语义对齐的多主体个性化视频生成方面实现了最先进的性能,显著提升了面部属性对齐的精度和组内一致性。
Conclusion: 该研究证明了通过数据收集与模型架构的协同设计能够有效解决多主体视频生成中的面部属性对齐问题,关系注意力机制为显式建模主体间依赖关系提供了有效途径,为个性化内容创作提供了更精细的控制能力,并为未来视频生成研究提供了新的基准和建模范式。
📄 Abstract
Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject-attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation. Code and models are available at https://jiazheng-xing.github.io/lumosx-home/.
[9] MedQ-Engine: A Closed-Loop Data Engine for Evolving MLLMs in Medical Image Quality Assessment
Jiyao Liu, Junzhi Ning, Wanying Qu, Lihao Liu, Chenglong Ma, Junjun He, Ningsheng Xu
🧩 TL;DR
本文提出了MedQ-Engine,一种用于医学图像质量评估的闭环数据引擎,通过迭代评估、数据驱动聚类发现失败原型、渐进式人机协同标注和高质量微调,显著提升了多模态大语言模型在医学图像质量评估任务上的性能。
📘 Detailed Summary
Motivation: 医学图像质量评估是临床AI部署的前提条件,但当前多模态大语言模型在提供具有临床推理的描述性评估方面仍远落后于人类专家,且改进面临两大挑战:获取描述性标注的高昂成本,以及一次性数据收集无法适应模型不断演化的弱点。
Method: MedQ-Engine采用闭环数据引擎架构,通过迭代评估模型并利用数据驱动聚类发现失败原型,将这些原型作为检索锚点从百万级图像池中探索样本,结合渐进式人机协同标注进行高质量微调,形成自我改进循环。系统包含熵引导路由机制以最小化标注成本,并在感知和描述两个互补任务上进行评估。
Result: 在五个医学成像模态上的实验表明,MedQ-Engine将8B参数模型的性能提升至超越GPT-4o超过13%,并将与人类专家的差距缩小至仅4.34%,仅使用10K标注样本,相比随机采样实现了超过4倍的样本效率。
Conclusion: 该研究展示了闭环数据引擎在医学图像质量评估任务中的有效性,通过系统化的失败模式发现和针对性数据收集,能够以较低标注成本显著提升模型性能,为医学AI系统的持续改进提供了可扩展的框架。
📄 Abstract
Medical image quality assessment (Med-IQA) is a prerequisite for clinical AI deployment, yet multimodal large language models (MLLMs) still fall substantially short of human experts, particularly when required to provide descriptive assessments with clinical reasoning beyond simple quality scores. However, improving them is hindered by the high cost of acquiring descriptive annotations and by the inability of one-time data collection to adapt to the model's evolving weaknesses. To address these challenges, we propose MedQ-Engine, a closed-loop data engine that iteratively evaluates the model to discover failure prototypes via data-driven clustering, explores a million-scale image pool using these prototypes as retrieval anchors with progressive human-in-the-loop annotation, and evolves through quality-assured fine-tuning, forming a self-improving cycle. Models are evaluated on complementary perception and description tasks. An entropy-guided routing mechanism triages annotations to minimize labeling cost. Experiments across five medical imaging modalities show that MedQ-Engine elevates an 8B-parameter model to surpass GPT-4o by over 13% and narrow the gap with human experts to only 4.34%, using only 10K annotations with more than 4x sample efficiency over random sampling.
[10] MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI
Rozain Shakeel, Abdul Rahman Mohammad Ali, Muneeb Mushtaq, Tausifa Jan Saleem, Tajamul Ashraf
🧩 TL;DR
本文提出了MedSPOT,一个面向临床GUI环境的工作流感知顺序视觉定位基准,旨在评估多模态大语言模型在医疗软件环境中执行可靠视觉定位的能力,特别关注多步骤工作流中的顺序推理和错误传播。
📘 Detailed Summary
Motivation: 现有GUI基准主要关注孤立的单步定位查询,忽视了真实世界医疗界面中所需的顺序、工作流驱动的推理能力,其中任务在独立步骤和动态界面状态间演化,多模态大语言模型在高风险临床软件环境中执行可靠视觉定位的能力尚未得到充分探索。
Method: 研究引入了MedSPOT基准,将程序性交互建模为结构化空间决策序列,包含216个任务驱动视频和597个标注关键帧,每个任务由2到3个相互依赖的定位步骤组成,并提出了严格的顺序评估协议,在首次错误定位预测时终止任务评估,以显式测量多步骤工作流中的错误传播。
Result: MedSPOT基准通过捕获界面层次结构、上下文依赖性和动态条件下的细粒度空间精度,建立了临床GUI环境中评估多模态模型的现实且安全关键的基准,并引入了全面的失败分类法,包括边缘偏差、小目标错误、无预测、近失、远失和工具栏混淆,以系统诊断模型行为。
Conclusion: 通过将评估从孤立定位转向工作流感知的顺序推理,MedSPOT为医疗软件环境中的多模态模型评估建立了现实且安全关键的基准,强调了在临床界面中考虑程序性交互和错误传播的重要性,为未来模型在医疗应用中的可靠部署提供了重要评估框架。
📄 Abstract
Despite the rapid progress of Multimodal Large Language Models (MLLMs), their ability to perform reliable visual grounding in high-stakes clinical software environments remains underexplored. Existing GUI benchmarks largely focus on isolated, single-step grounding queries, overlooking the sequential, workflow-driven reasoning required in real-world medical interfaces, where tasks evolve across independent steps and dynamic interface states. We introduce MedSPOT, a workflow-aware sequential grounding benchmark for clinical GUI environments. Unlike prior benchmarks that treat grounding as a standalone prediction task, MedSPOT models procedural interaction as a sequence of structured spatial decisions. The benchmark comprises 216 task-driven videos with 597 annotated keyframes, in which each task consists of 2 to 3 interdependent grounding steps within realistic medical workflows. This design captures interface hierarchies, contextual dependencies, and fine-grained spatial precision under evolving conditions. To evaluate procedural robustness, we propose a strict sequential evaluation protocol that terminates task assessment upon the first incorrect grounding prediction, explicitly measuring error propagation in multi-step workflows. We further introduce a comprehensive failure taxonomy, including edge bias, small-target errors, no prediction, near miss, far miss, and toolbar confusion, to enable systematic diagnosis of model behavior in clinical GUI settings. By shifting evaluation from isolated grounding to workflow-aware sequential reasoning, MedSPOT establishes a realistic and safety-critical benchmark for assessing multimodal models in medical software environments. Code and data are available at: https://github.com/Tajamul21/MedSPOT.
cs.CL [Back]
[11] CURE: A Multimodal Benchmark for Clinical Understanding and Retrieval Evaluation
Yannian Gu, Zhongzhen Huang, Linjie Mu, Xizhuo Zhang, Shaoting Zhang, Xiaofan Zhang
🧩 TL;DR
本文提出了CURE基准测试,用于评估多模态大语言模型在临床诊断中的推理与检索能力,揭示了模型在提供参考证据时表现优异但独立检索时性能显著下降的二分现象。
📘 Detailed Summary
Motivation: 现有基准测试主要评估多模态大语言模型在端到端回答场景中的表现,无法区分模型的基础多模态推理能力与证据检索和应用能力之间的差异,这限制了对其临床诊断潜力的准确评估。
Method: 研究引入了临床理解与检索评估基准,包含500个多模态临床案例,每个案例映射到医生引用的参考文献,通过控制证据设置来分别评估推理和检索能力,并在封闭式和开放式诊断任务中评估不同证据收集范式下的最先进模型。
Result: 评估结果显示明显的二分现象:当提供医生参考证据时,先进模型在鉴别诊断中达到73.4%的准确率,但当依赖独立检索机制时,性能显著下降至25.4%,突显了多模态临床证据整合与精确文献检索的双重挑战。
Conclusion: 该研究揭示了多模态大语言模型在临床诊断中面临的核心瓶颈,即有效整合多模态临床证据与精确检索支持文献的能力不足,CURE基准为未来研究提供了系统评估框架,强调了需要同时提升模型的推理和检索能力。
📄 Abstract
Multimodal large language models (MLLMs) demonstrate considerable potential in clinical diagnostics, a domain that inherently requires synthesizing complex visual and textual data alongside consulting authoritative medical literature. However, existing benchmarks primarily evaluate MLLMs in end-to-end answering scenarios. This limits the ability to disentangle a model's foundational multimodal reasoning from its proficiency in evidence retrieval and application. We introduce the Clinical Understanding and Retrieval Evaluation (CURE) benchmark. Comprising $500$ multimodal clinical cases mapped to physician-cited reference literature, CURE evaluates reasoning and retrieval under controlled evidence settings to disentangle their respective contributions. We evaluate state-of-the-art MLLMs across distinct evidence-gathering paradigms in both closed-ended and open-ended diagnosis tasks. Evaluations reveal a stark dichotomy: while advanced models demonstrate clinical reasoning proficiency when supplied with physician reference evidence (achieving up to $73.4\%$ accuracy on differential diagnosis), their performance substantially declines (as low as $25.4\%$) when reliant on independent retrieval mechanisms. This disparity highlights the dual challenges of effectively integrating multimodal clinical evidence and retrieving precise supporting literature. CURE is publicly available at https://github.com/yanniangu/CURE.
[12] DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs
Xuan Qi, Luxi He, Dan Roth, Xingyu Fu
🧩 TL;DR
本文提出了DATAPROPHET,一种无需训练即可预测多模态大语言模型监督数据影响力的方法,通过结合多模态困惑度、相似性和数据多样性来评估数据集对目标基准的潜在贡献,显著优于基于直觉任务相似性的选择策略。
📘 Detailed Summary
Motivation: 传统选择多模态大语言模型监督数据的直觉方法是优先选择与目标基准相似的数据集,但这种基于任务相似性的直觉是否能可靠预测下游性能提升尚不明确。本研究旨在解决一个实际问题:能否在训练开始前就估计训练数据集对目标基准的影响力,从而避免依赖不可靠的直觉相似性判断。
Method: 研究首先对14个视觉语言数据集跨越7个不同任务进行了深入的迁移分析,揭示了直觉任务相似性的不可靠性。基于此发现,提出了DATAPPHET方法,这是一种简单有效的免训练指标,结合了多模态困惑度、相似性和数据多样性三个关键因素来评估监督数据的潜在价值。
Result: 实验结果表明,DATAPROPHET生成的监督数据排名与实际训练后性能增益排名高度相关,Kendall's tau达到86.0%。该方法在监督数据选择上优于均匀选择(提升6.9%)、基于训练的最先进基线(提升1.4%),甚至略优于基于实验性能的oracle选择(提升0.2%)。
Conclusion: 研究发现直觉任务相似性是多模态大语言模型数据选择中的不可靠预测因子,而泛化能力更依赖于具体数据集特性而非其广泛任务类别。DATAPROPHET提供了一种实用的免训练数据选择框架,能够显著提高监督数据选择的效率和效果,为多模态模型训练的数据策略提供了新见解。
📄 Abstract
Conventional wisdom for selecting supervision data for multimodal large language models (MLLMs) is to prioritize datasets that appear similar to the target benchmark, such as text-intensive or vision-centric tasks. However, it remains unclear whether such intuitive similarity reliably predicts downstream performance gains. In this work, we take a first step toward answering a practical question: can we estimate the influence of a training dataset on a target benchmark before any training is performed? To investigate this question, we conduct an in-depth analysis of transfer across 14 vision-language datasets spanning 7 diverse tasks. Our results show that intuitive task similarity is an unreliable predictor of transferability, and that generalization depends more on the specific dataset than on its broad task category. Motivated by this finding, we propose DATAPROPHET, a simple and effective training-free metric that combines multimodal perplexity, similarity, and data diversity. Experiments show that DATAPROPHET produces supervision-data rankings that strongly correlate with rankings based on actual post-training performance gains, achieving a Kendall's tau of 86.0%. Moreover, DATAPROPHET enables better supervision-data selection, yielding up to 6.9% improvement over uniform selection, 1.4% over a state-of-the-art training-based baseline, and 0.2% above oracle selection based on experimental performance. Our code and data will be released.
[13] Rethinking Ground Truth: A Case Study on Human Label Variation in MLLM Benchmarking
Tomas Ruiz, Tanalp Agustoslu, Carsten Schwemmer
🧩 TL;DR
本研究提出了一种考虑人类标注变异性的多模态大语言模型评估协议,发现仅基于共识标签的基准测试会高估模型在模糊主观任务中的能力,而纳入标注分歧能提供更现实的模型评估。
📘 Detailed Summary
Motivation: 尽管大语言模型快速发展,但人类标注变异性(即标注者判断之间的系统性差异)在基准测试中仍未得到充分探索。本研究旨在填补这一空白,通过设计一个明确考虑人类标注一致性和分歧条件的评估协议,以更准确地评估多模态大语言模型在模糊主观任务中的表现。
Method: 研究引入了一种多模态大语言模型评估协议,该协议明确区分两种条件:人类标注一致性和标注分歧。该方法应用于两个最先进的多模态大语言模型家族(Gemma 3和Qwen 2.5 VL),使用社交媒体内容分类数据集中未聚合的人类标注数据进行评估,从而分析模型在不同标注一致性水平下的表现差异。
Result: 实验发现,在标注一致性高的子集上,更大规模的模型通常表现最佳;然而在人类标注分歧高的任务中,大型模型往往表现不如中等规模模型。这表明参数数量本身并不能决定模型对模糊性和主观性的敏感度,仅基于共识标签的基准测试会高估模型在模糊主观领域的能力。
Conclusion: 研究结果表明,纳入人类标注变异性能够提供更现实和鲁棒的多模态大语言模型评估,特别是在内容审核等主观性强的应用场景中。这一发现强调了在模型评估中考虑标注分歧的重要性,为未来基准测试设计提供了新的方向,有助于更准确地评估模型在真实世界模糊任务中的实际能力。
📄 Abstract
Human Label Variation (HLV), i.e. systematic differences among annotators' judgments, remains underexplored in benchmarks despite rapid progress in large language model (LLM) development. We address this gap by introducing an evaluation protocol for multimodal large language model (MLLM) benchmarking that explicitly accounts for two conditions: (1) human label agreement and (2) disagreement. We apply this protocol to two state-of-the-art MLLM families (Gemma 3, Qwen 2.5 VL) using non-aggregated human annotations from a social media content classification dataset. Across tasks, we find that larger models tend to perform best on high-agreement subsets, yet often underperform medium-sized models when human disagreement is high, indicating that parameter count alone does not determine sensitivity to ambiguity and subjectivity. These results show that benchmarks based solely on consensus labels can overstate model capabilities in such domains and that incorporating human label variation yields more realistic and robust assessments of MLLMs in content moderation pipelines.