Table of Contents
- cs.CV [Total: 12]
cs.CV [Back]
[1] PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning
Shaoxuan Li, Zhixuan Zhao, Hanze Deng, Zirun Ma, Shulin Tian, Zuyan Liu, Yushi Hu, Haoning Wu, Yuhao Dong, Benlin Liu, Ziwei Liu, Ranjay Krishna
🧩 TL;DR
本文提出了PerceptionComp,这是一个手动标注的复杂、长时程、以感知为中心的视频推理基准,旨在评估需要多个时间分离视觉证据和组合约束的深度感知推理能力。
📘 Detailed Summary
Motivation: 现有视频推理基准通常侧重于单一时刻或简单推理任务,缺乏对复杂、长时程、以感知为中心推理能力的评估。本文旨在解决这一研究空白,通过创建需要多个时间分离视觉证据、组合约束以及跨多种感知子任务(如物体、属性、关系、位置、动作和事件)的基准,推动深度感知推理的发展。
Method: 研究团队构建了PerceptionComp基准,包含1,114个高度复杂的问题,基于279个来自城市步行游览、室内别墅游览、视频游戏和极限户外运动等多样化领域的视频。所有问题均采用100%手动标注,设计原则是任何单一时刻都不足以回答问题,需要结合连词和顺序逻辑的组合约束,涵盖语义识别、视觉对应、时间推理和空间推理等多种技能。
Result: 人类研究表明,PerceptionComp需要大量测试时间思考和重复感知步骤:参与者在不允许重看的情况下准确率降至接近随机水平(18.97%)。最先进的多模态大语言模型在该基准上的表现也显著低于现有基准:评估中最佳模型Gemini-3-Flash在五选一设置中仅达到45.96%准确率,而开源模型则低于40%。
Conclusion: 研究结果表明,以感知为中心的长时程视频推理仍然是主要瓶颈,现有模型在这一复杂任务上表现有限。PerceptionComp基准的发布有望推动感知推理领域的发展,为评估和提升模型在需要多步视觉证据整合和时间推理的复杂场景中的能力提供重要工具。
📄 Abstract
We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning. PerceptionComp is designed so that no single moment is sufficient: answering each question requires multiple temporally separated pieces of visual evidence and compositional constraints under conjunctive and sequential logic, spanning perceptual subtasks such as objects, attributes, relations, locations, actions, and events, and requiring skills including semantic recognition, visual correspondence, temporal reasoning, and spatial reasoning. The benchmark contains 1,114 highly complex questions on 279 videos from diverse domains including city walk tours, indoor villa tours, video games, and extreme outdoor sports, with 100% manual annotation. Human studies show that PerceptionComp requires substantial test-time thinking and repeated perception steps: participants take much longer than on prior benchmarks, and accuracy drops to near chance (18.97%) when rewatching is disallowed. State-of-the-art MLLMs also perform substantially worse on PerceptionComp than on existing benchmarks: the best model in our evaluation, Gemini-3-Flash, reaches only 45.96% accuracy in the five-choice setting, while open-source models remain below 40%. These results suggest that perception-centric long-horizon video reasoning remains a major bottleneck, and we hope PerceptionComp will help drive progress in perceptual reasoning.
[2] ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions
Zikai Wang, Zhilu Zhang, Yiqing Wang, Hui Li, Wangmeng Zuo
🧩 TL;DR
本文提出了ArtHOI,一种基于优化的框架,用于从单目RGB视频重建4D人体-铰接物体交互,通过整合和精炼多个基础模型的先验知识来解决这一高度不适定问题。
📘 Detailed Summary
Motivation: 现有的人体-物体交互方法主要局限于刚性物体,而铰接物体的4D重建通常需要预扫描物体甚至多视角视频,从单目RGB视频重建4D人体-铰接物体交互仍然是一个未探索但具有重要意义的挑战,基础模型的进展为解决这一高度不适定问题提供了新机遇。
Method: ArtHOI是一个基于优化的框架,整合并精炼了来自多个基础模型的先验知识,核心贡献包括一套新颖的方法论来解决这些先验固有的不准确性和物理不真实性,特别是引入了自适应采样精炼方法以优化物体的度量尺度和姿态,用于在世界空间中定位其归一化网格,并提出了多模态大语言模型引导的手-物体对齐方法,利用接触推理信息作为手-物体网格组合优化的约束。
Result: 为了进行全面评估,研究贡献了两个新数据集ArtHOI-RGBD和ArtHOI-Wild,广泛的实验验证了ArtHOI在不同物体和交互中的鲁棒性和有效性,展示了该方法在单目RGB视频条件下重建4D人体-铰接物体交互的能力。
Conclusion: 该研究展示了如何有效整合多个基础模型的先验知识来解决高度不适定的4D重建问题,提出的方法论为解决单目视频中铰接物体交互的复杂重建挑战提供了新思路,并为该领域的研究建立了新的基准数据集。
📄 Abstract
Existing hand-object interactions (HOI) methods are largely limited to rigid objects, while 4D reconstruction methods of articulated objects generally require pre-scanning the object or even multi-view videos. It remains an unexplored but significant challenge to reconstruct 4D human-articulated-object interactions from a single monocular RGB video. Fortunately, recent advancements in foundation models present a new opportunity to address this highly ill-posed problem. To this end, we introduce ArtHOI, an optimization-based framework that integrates and refines priors from multiple foundation models. Our key contribution is a suite of novel methodologies designed to resolve the inherent inaccuracies and physical unreality of these priors. In particular, we introduce an Adaptive Sampling Refinement (ASR) method to optimize object's metric scale and pose for grounding its normalized mesh in world space. Furthermore, we propose a Multimodal Large Language Model (MLLM) guided hand-object alignment method, utilizing contact reasoning information as constraints of hand-object mesh composition optimization. To facilitate a comprehensive evaluation, we also contribute two new datasets, ArtHOI-RGBD and ArtHOI-Wild. Extensive experiments validate the robustness and effectiveness of our ArtHOI across diverse objects and interactions. Project: https://arthoi-reconstruction.github.io.
[3] Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven Verification
Shuai Lv, Chang Liu, Feng Tang, Yujie Yuan, Aojun Zhou, Kui Zhang, Xi Yang, Yangqiu Song
🧩 TL;DR
本文提出视觉再检查(VRE)框架,通过自演化训练使多模态大语言模型在长文本生成中自主进行视觉内省,有效缓解了模型随输出增长而逐渐偏离图像证据、依赖文本先验导致的幻觉问题。
📘 Detailed Summary
Motivation: 多模态大语言模型在长文本生成中存在一个系统性缺陷:随着输出内容增长,模型逐渐偏离图像证据而依赖文本先验,导致推理不接地和幻觉现象。研究发现模型具备潜在的后期视觉验证能力,但该能力未被一致激活。
Method: 提出视觉再检查(VRE)框架,这是一种自演化训练方法,使模型能够在推理过程中自主进行视觉内省而无需额外视觉输入。VRE不依赖从更强教师模型蒸馏视觉能力,而是通过模型自身生成反思轨迹,利用信息增益使视觉信息可操作化,实现迭代自我改进。
Result: 在多个多模态基准测试上的广泛实验表明,VRE能持续提升推理准确性和感知可靠性,同时显著减少幻觉现象,尤其在长链推理场景中效果更为明显。该方法有效激活了模型潜在的视觉验证能力。
Conclusion: 该研究揭示了多模态大语言模型内在的视觉验证潜力,并提出了一种无需外部监督的自演化训练范式。VRE框架为解决长文本生成中的视觉漂移问题提供了有效方案,为提升多模态模型的可靠性和可解释性开辟了新方向。
📄 Abstract
Multimodal Large Language Models (MLLMs) achieve strong multimodal reasoning performance, yet we identify a recurring failure mode in long-form generation: as outputs grow longer, models progressively drift away from image evidence and fall back on textual priors, resulting in ungrounded reasoning and hallucinations. Interestingly, Based on attention analysis, we find that MLLMs have a latent capability for late-stage visual verification that is present but not consistently activated. Motivated by this observation, we propose Visual Re-Examination (VRE), a self-evolving training framework that enables MLLMs to autonomously perform visual introspection during reasoning without additional visual inputs. Rather than distilling visual capabilities from a stronger teacher, VRE promotes iterative self-improvement by leveraging the model itself to generate reflection traces, making visual information actionable through information gain. Extensive experiments across diverse multimodal benchmarks demonstrate that VRE consistently improves reasoning accuracy and perceptual reliability, while substantially reducing hallucinations, especially in long-chain settings. Code is available at https://github.com/Xiaobu-USTC/VRE.
[4] Reinforcing Structured Chain-of-Thought for Video Understanding
Peiyao Wang, Haotian Xu, Noranart Vesdapunt, Rui Hou, Jingyi Zhang, Haibin Ling, Oleksandr Obiednikov, Ning Zhou, Kah Kuen Fu
🧩 TL;DR
本文提出了一种名为摘要驱动强化学习(SDRL)的新型单阶段强化学习框架,用于解决多模态大语言模型在视频理解中的思维漂移和时序理解弱的问题,无需监督微调即可实现最先进的视频问答性能。
📘 Detailed Summary
Motivation: 多模态大语言模型在视频理解中存在思维漂移和时序理解能力弱的问题,现有强化学习方法通常依赖需要昂贵思维链标注和多阶段训练的监督微调,这种固定推理路径限制了模型的泛化能力并可能引入偏见。
Method: 本文提出摘要驱动强化学习框架,采用"总结->思考->回答"的结构化思维链格式,无需监督微调。该方法在组相对策略优化目标中集成了两种自监督机制:视觉知识一致性通过减少生成摘要间的KL散度来增强事实基础,动态推理多样性基于组准确率动态调节思维多样性以促进探索。
Result: 该方法在七个公开视频问答数据集上实现了最先进的性能表现,有效平衡了对齐和探索,同时监督最终答案和推理过程,显著提升了多模态大语言模型的视频理解能力。
Conclusion: SDRL框架通过创新的自监督机制解决了现有强化学习方法对监督微调的依赖问题,提供了一种更高效、更灵活的模型训练范式,为多模态大语言模型的视频理解任务开辟了新的研究方向。
📄 Abstract
Multi-modal Large Language Models (MLLMs) show promise in video understanding. However, their reasoning often suffers from thinking drift and weak temporal comprehension, even when enhanced by Reinforcement Learning (RL) techniques like Group Relative Policy Optimization (GRPO). Moreover, existing RL methods usually depend on Supervised Fine-Tuning (SFT), which requires costly Chain-of-Thought (CoT) annotation and multi-stage training, and enforces fixed reasoning paths, limiting MLLMs' ability to generalize and potentially inducing bias. To overcome these limitations, we introduce Summary-Driven Reinforcement Learning (SDRL), a novel single-stage RL framework that obviates the need for SFT by utilizing a Structured CoT format: Summarize -> Think -> Answer. SDRL introduces two self-supervised mechanisms integrated into the GRPO objective: 1) Consistency of Vision Knowledge (CVK) enforces factual grounding by reducing KL divergence among generated summaries; and 2) Dynamic Variety of Reasoning (DVR) promotes exploration by dynamically modulating thinking diversity based on group accuracy. This novel integration effectively balances alignment and exploration, supervising both the final answer and the reasoning process. Our method achieves state-of-the-art performance on seven public VideoQA datasets.
[5] FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants
Mahesh Bhosale, Abdul Wasi, Shantam Srivastava, Shifa Latif, Tianyu Luan, Mingchen Gao, David Doermann, Xuan Gong
🧩 TL;DR
本文提出了FairLLaVA,一种参数高效的微调方法,通过最小化目标属性间的互信息来减少多模态大语言模型在视觉指令调优中的群体差异,同时保持整体性能,在医疗影像报告生成和视觉问答任务中显著提升了公平性和生成质量。
📘 Detailed Summary
Motivation: 多模态大语言模型在图像条件生成中表现出色,但在不同人口统计群体间存在性能不均问题,这种公平性风险在临床等安全关键场景中尤为严重,可能导致不平等的诊断叙述并削弱AI辅助决策的信任度,而现有研究主要关注纯视觉或纯语言模型的公平性,对MLLMs公平性的影响仍缺乏深入探索。
Method: FairLLaVA采用参数高效的微调方法,通过最小化目标属性间的互信息来正则化模型表示,使其对人口统计特征保持不变性,该方法可作为轻量级插件集成,利用低秩适配器微调保持效率,并提供与架构无关的公平视觉指令跟随解决方案。
Result: 在大规模胸部放射学报告生成和皮肤镜视觉问答基准测试中,FairLLaVA持续减少了群体间差异,同时提升了公平性调整后的临床性能和自然语言生成质量,这些改进在不同医疗影像模态中均得到验证,代码已在GitHub上开源。
Conclusion: 该研究展示了参数高效微调在缓解MLLMs公平性问题方面的有效性,为临床AI系统提供了减少群体偏见的实用解决方案,同时保持整体性能,FairLLaVA的架构无关特性使其可广泛应用于各种视觉指令跟随任务,为公平多模态AI的发展提供了重要技术路径。
📄 Abstract
While powerful in image-conditioned generation, multimodal large language models (MLLMs) can display uneven performance across demographic groups, highlighting fairness risks. In safety-critical clinical settings, such disparities risk producing unequal diagnostic narratives and eroding trust in AI-assisted decision-making. While fairness has been studied extensively in vision-only and language-only models, its impact on MLLMs remains largely underexplored. To address these biases, we introduce FairLLaVA, a parameter-efficient fine-tuning method that mitigates group disparities in visual instruction tuning without compromising overall performance. By minimizing the mutual information between target attributes, FairLLaVA regularizes the model's representations to be demographic-invariant. The method can be incorporated as a lightweight plug-in, maintaining efficiency with low-rank adapter fine-tuning, and provides an architecture-agnostic approach to fair visual instruction following. Extensive experiments on large-scale chest radiology report generation and dermoscopy visual question answering benchmarks show that FairLLaVA consistently reduces inter-group disparities while improving both equity-scaled clinical performance and natural language generation quality across diverse medical imaging modalities. Code can be accessed at https://github.com/bhosalems/FairLLaVA.
[6] Knowledge is Power: Advancing Few-shot Action Recognition with Multimodal Semantics from MLLMs
Jiazheng Xing, Chao Xu, Hangjie Yuan, Mengmeng Wang, Jun Dan, Hangwei Qian, Yong Liu
🧩 TL;DR
本文提出FSAR-LLaVA,首个利用多模态大语言模型作为多模态知识库直接增强少样本动作识别的端到端方法,通过多模态特征增强模块和复合任务导向原型构建,显著提升少样本动作识别性能。
📘 Detailed Summary
Motivation: 当前少样本动作识别领域主要采用生成字幕形成次优的特征->字幕->特征流程,且仅在视觉空间进行度量学习,缺乏直接利用多模态大语言模型丰富语义知识的端到端方法,存在模态融合不足和分布差距问题。
Method: 该方法包含三个核心组件:利用多模态解码器提取时空语义增强表示的多模态特征增强模块,通过自适应提示生成和输出对齐实现复合任务导向原型构建,以及无需训练的多模态原型匹配度量机制,自适应选择关键线索并有效利用解耦特征表示。
Result: 在多个少样本动作识别任务上的广泛实验表明,FSAR-LLaVA在仅需极少可训练参数的情况下实现了优越性能,显著超越了现有方法,验证了直接利用多模态大语言模型作为知识库的有效性。
Conclusion: 该研究证明了多模态大语言模型可直接作为少样本动作识别的有效知识库,通过端到端的多模态特征增强和原型构建机制,为少样本学习领域提供了新的范式,展示了多模态语义知识在动作识别中的关键作用。
📄 Abstract
Multimodal Large Language Models (MLLMs) have propelled the field of few-shot action recognition (FSAR). However, preliminary explorations in this area primarily focus on generating captions to form a suboptimal feature->caption->feature pipeline and adopt metric learning solely within the visual space. In this paper, we propose FSAR-LLaVA, the first end-to-end method to leverage MLLMs (such as Video-LLaVA) as a multimodal knowledge base for directly enhancing FSAR. First, at the feature level, we leverage the MLLM's multimodal decoder to extract spatiotemporally and semantically enriched representations, which are then decoupled and enhanced by our Multimodal Feature-Enhanced Module into distinct visual and textual features that fully exploit their semantic knowledge for FSAR. Next, we leverage the versatility of MLLMs to craft input prompts that flexibly adapt to diverse scenarios, and use their aligned outputs to drive our designed Composite Task-Oriented Prototype Construction, effectively bridging the distribution gap between meta-train and meta-test sets. Finally, to enable multimodal features to guide metric learning jointly, we introduce a training-free Multimodal Prototype Matching Metric that adaptively selects the most decisive cues and efficiently leverages the decoupled feature representations produced by MLLMs. Extensive experiments demonstrate superior performance across various tasks with minimal trainable parameters.
[7] Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives
Daiqiang Li, Zihao Pan, Zeyu Zhang, Ronghao Chen, Huacan Wang, Honggang Chen, Haiyun Jiang
🧩 TL;DR
本文对GUI视觉代理中的历史截图进行令牌剪枝的实证研究,发现GUI截图具有独特的前景-背景语义构成,随机剪枝在保持空间结构方面具有固有优势,而GUI代理表现出类似人类认知的近期效应,这些发现为设计高效GUI视觉代理提供了新见解和实践指导。
📘 Detailed Summary
Motivation: 基于多模态大语言模型构建的GUI视觉代理在导航任务中展现出强大潜力,但高分辨率GUI截图会产生大量视觉令牌,直接保留完整历史信息计算成本高昂,因此需要研究如何在GUI场景中对历史截图进行有效的令牌剪枝以降低计算开销。
Method: 研究采用实证方法探索GUI场景中的令牌剪枝策略,通过简单的边缘检测分离将截图划分为前景和背景区域,比较精心设计的剪枝策略与随机剪枝的效果,并分析不同时间戳截图的令牌分配策略,特别关注近期效应在GUI代理中的表现。
Result: 研究发现GUI截图具有独特的前景-背景语义构成,背景区域能有效捕捉界面状态转换并提供辅助推理线索;随机剪枝在相同计算预算下因能更好地保持空间结构而表现更优;GUI代理表现出类似人类认知的近期效应,通过为近期截图分配更多令牌并大幅压缩早期截图,可在保持性能基本不变的同时显著降低计算成本。
Conclusion: 该研究为设计高效GUI视觉代理提供了三个关键见解:GUI截图的前景-背景语义构成具有独特价值,随机剪枝在保持空间结构方面具有优势,以及利用近期效应进行动态令牌分配可有效平衡性能与计算效率,这些发现为实际系统设计提供了实用指导。
📄 Abstract
In recent years, GUI visual agents built upon Multimodal Large Language Models (MLLMs) have demonstrated strong potential in navigation tasks. However, high-resolution GUI screenshots produce a large number of visual tokens, making the direct preservation of complete historical information computationally expensive. In this paper, we conduct an empirical study on token pruning for historical screenshots in GUI scenarios and distill three practical insights that are crucial for designing effective pruning strategies. First, we observe that GUI screenshots exhibit a distinctive foreground-background semantic composition. To probe this property, we apply a simple edge-based separation to partition screenshots into foreground and background regions. Surprisingly, we find that, contrary to the common assumption that background areas have little semantic value, they effectively capture interface-state transitions, thereby providing auxiliary cues for GUI reasoning. Second, compared with carefully designed pruning strategies, random pruning possesses an inherent advantage in preserving spatial structure, enabling better performance under the same computational budget. Finally, we observe that GUI Agents exhibit a recency effect similar to human cognition: by allocating larger token budgets to more recent screenshots and heavily compressing distant ones, we can significantly reduce computational cost while maintaining nearly unchanged performance. These findings offer new insights and practical guidance for the design of efficient GUI visual agents.
[8] Beyond Where to Look: Trajectory-Guided Reinforcement Learning for Multimodal RLVR
Jinda Lu, Junkang Wu, Jinghan Li, Kexin Huang, Shuo Yang, Mingzhu Chen, Jiancan Wu, Kuien Liu, Xiang Wang
🧩 TL;DR
本文提出轨迹引导强化学习(TGRL),通过利用更强模型的专家推理轨迹来指导策略模型将视觉证据整合到细粒度推理过程中,从而解决多模态大语言模型中视觉感知与逻辑推理之间的脱节问题。
📘 Detailed Summary
Motivation: 当前基于可验证奖励的强化学习在多模态大语言模型中的研究主要关注最终答案正确性和视觉定位,但存在关键瓶颈:模型虽然能关注相关视觉区域,却往往无法有效将视觉证据整合到后续推理中,导致推理链条与视觉事实的关联较弱。
Method: 提出轨迹引导强化学习框架,利用更强模型的专家推理轨迹指导策略模型将视觉证据整合到细粒度推理过程;引入词元级重加权和轨迹过滤机制,确保策略优化的稳定性和有效性。
Result: 在多个多模态推理基准测试上的广泛实验表明,TGRL能持续提升推理性能,有效弥合视觉感知与逻辑推理之间的差距,显著改善模型对视觉证据的利用能力。
Conclusion: 该研究表明通过轨迹引导的强化学习能有效解决多模态推理中视觉证据整合不足的问题,为提升多模态大语言模型的推理能力提供了新方向,强调了细粒度推理过程引导的重要性。
📄 Abstract
Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) for multimodal large language models (MLLMs) have mainly focused on improving final answer correctness and strengthening visual grounding. However, a critical bottleneck remains: although models can attend to relevant visual regions, they often fail to effectively incorporate visual evidence into subsequent reasoning, leading to reasoning chains that are weakly grounded in visual facts. To address this issue, we propose Trajectory-Guided Reinforcement Learning (TGRL), which guides the policy model to integrate visual evidence into fine-grained reasoning processes using expert reasoning trajectories from stronger models. We further introduce token-level reweighting and trajectory filtering to ensure stable and effective policy optimization. Extensive experiments on multiple multimodal reasoning benchmarks demonstrate that TGRL consistently improves reasoning performance and effectively bridges the gap between visual perception and logical reasoning.
[9] CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions
Chonghuinan Wang, Zihan Chen, Yuxiang Wei, Tianyi Jiang, Xiaohe Wu, Fan Li, Wangmeng Zuo, Hongxun Yao
🧩 TL;DR
本文提出了CREval,一个基于问答的自动化评估流水线,以及CREval-Bench基准测试,用于系统评估复杂创意图像编辑模型。该研究揭示了当前模型在复杂创意编辑任务上的局限性,并提供了与人类判断高度一致的自动化评估框架。
📘 Detailed Summary
Motivation: 现有基于指令的多模态图像编辑评估方法缺乏系统化且与人类对齐的框架,特别是在复杂创意编辑任务上存在评估不完整和可解释性差的问题。当前依赖不透明的多模态大语言模型评分的方法无法充分评估模型在创造性图像处理任务上的性能。
Method: 研究提出了CREval,一个完全自动化的基于问答的评估流水线,克服了不透明多模态大语言模型评分的不足。同时引入了CREval-Bench基准测试,专门针对复杂指令下的创意图像编辑设计,涵盖3个类别和9个创意维度,包含800多个编辑样本和13,000个评估查询。
Result: 系统评估了多种开源和闭源的最先进模型,结果显示闭源模型在复杂创意任务上普遍优于开源模型,但所有模型在有效完成此类编辑方面仍存在困难。用户研究表明CREval的自动化指标与人类判断具有高度一致性,验证了评估框架的可靠性。
Conclusion: CREval为评估复杂创意图像编辑模型提供了可靠基础,揭示了当前模型在创造性图像处理任务上的关键挑战。该研究强调了未来研究的方向,即需要开发更强大的模型来处理复杂创意编辑,同时为领域提供了标准化评估框架。
📄 Abstract
Instruction-based multimodal image manipulation has recently made rapid progress. However, existing evaluation methods lack a systematic and human-aligned framework for assessing model performance on complex and creative editing tasks. To address this gap, we propose CREval, a fully automated question-answer (QA)-based evaluation pipeline that overcomes the incompleteness and poor interpretability of opaque Multimodal Large Language Models (MLLMs) scoring. Simultaneously, we introduce CREval-Bench, a comprehensive benchmark specifically designed for creative image manipulation under complex instructions. CREval-Bench covers three categories and nine creative dimensions, comprising over 800 editing samples and 13K evaluation queries. Leveraging this pipeline and benchmark, we systematically evaluate a diverse set of state-of-the-art open and closed-source models. The results reveal that while closed-source models generally outperform open-source ones on complex and creative tasks, all models still struggle to complete such edits effectively. In addition, user studies demonstrate strong consistency between CREval's automated metrics and human judgments. Therefore, CREval provides a reliable foundation for evaluating image editing models on complex and creative image manipulation tasks, and highlights key challenges and opportunities for future research.
[10] HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models
MD Khalequzzaman Chowdhury Sayem, Mubarrat Tajoar Chowdhury, Yihalem Yimolal Tiruneh, Muneeb A. Khan, Muhammad Salman Ali, Binod Bhattarai, Seungryul Baek
🧩 TL;DR
本文提出了HandVQA,一个用于评估视觉语言模型在精细手部空间推理能力的大规模诊断基准,揭示了当前模型在理解复杂手部解剖结构方面的系统性局限,并通过微调显著提升了模型在下游任务中的表现。
📘 Detailed Summary
Motivation: 当前视觉语言模型在通用视觉语言基准上已达到接近人类的性能,但在精细空间推理方面存在显著不足,特别是在理解复杂且关节化的手部姿态方面,这在机器人辅助手术、芯片制造和AR/VR人机交互等高要求场景中尤为关键。
Method: 研究构建了HandVQA大规模诊断基准,基于高质量3D手部数据集(FreiHAND、InterHand2.6M、FPHA),包含超过160万个受控多项选择题,用于探究手部关节之间的空间关系,如角度、距离和相对位置,并采用LoRA轻量级微调方法评估了多个最先进的视觉语言模型。
Result: 评估揭示了当前模型的系统性局限,包括幻觉手指部位、错误的几何解释和泛化能力差,但通过基准学习到的3D空间知识在零样本设置下能够有效迁移,显著提升了模型在新下游任务中的准确率,如手部姿态识别提高了10.33%,手-物体交互提高了2.63%。
Conclusion: HandVQA不仅暴露了当前视觉语言模型在精细手部空间推理方面的关键缺陷,还提供了经过验证的改进路径,表明通过专门的3D空间知识学习可以显著提升模型在相关应用场景中的性能,为未来模型在精细空间理解能力的发展指明了方向。
📄 Abstract
Understanding the fine-grained articulation of human hands is critical in high-stakes settings such as robot-assisted surgery, chip manufacturing, and AR/VR-based human-AI interaction. Despite achieving near-human performance on general vision-language benchmarks, current vision-language models (VLMs) struggle with fine-grained spatial reasoning, especially in interpreting complex and articulated hand poses. We introduce HandVQA, a large-scale diagnostic benchmark designed to evaluate VLMs' understanding of detailed hand anatomy through visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA), our benchmark includes over 1.6M controlled multiple-choice questions that probe spatial relationships between hand joints, such as angles, distances, and relative positions. We evaluate several state-of-the-art VLMs (LLaVA, DeepSeek and Qwen-VL) in both base and fine-tuned settings, using lightweight fine-tuning via LoRA. Our findings reveal systematic limitations in current models, including hallucinated finger parts, incorrect geometric interpretations, and poor generalization. HandVQA not only exposes these critical reasoning gaps but provides a validated path to improvement. We demonstrate that the 3D-grounded spatial knowledge learned from our benchmark transfers in a zero-shot setting, significantly improving accuracy of model on novel downstream tasks like hand gesture recognition (+10.33%) and hand-object interaction (+2.63%).
[11] ClipTTT: CLIP-Guided Test-Time Training Helps LVLMs See Better
Mriganka Nath, Anurag Das, Jiahao Xie, Bernt Schiele
🧩 TL;DR
本文提出ClipTTT方法,利用预训练CLIP模型的图像-文本对齐能力作为稳定指导信号,在测试时对大型视觉语言模型进行单样本自适应,有效缓解视觉输入退化条件下的幻觉问题。
📘 Detailed Summary
Motivation: 大型视觉语言模型在测试时视觉输入退化条件下容易产生幻觉,这种退化作为额外的分布偏移会显著放大现实应用中的幻觉率,需要一种能够在单测试样本上实时适应退化条件而不改变基础模型的方法。
Method: 提出CLIP引导的测试时训练方法,利用预训练CLIP模型的图像-文本对齐能力作为稳定指导信号,识别可靠的自监督目标,实现仅需单个测试样本的快速自适应,无需修改基础大型视觉语言模型。
Result: 在标准幻觉基准测试中,使用15种常见退化类型进行广泛实验,结果表明ClipTTT能有效缓解幻觉并提高视觉退化条件下的描述忠实度,显著降低了退化条件下的幻觉率。
Conclusion: 研究证明了利用CLIP等预训练模型的跨模态对齐能力作为指导信号的有效性,为大型视觉语言模型在现实世界退化条件下的鲁棒性提升提供了实用解决方案,展示了测试时自适应在缓解幻觉问题上的潜力。
📄 Abstract
Large vision-language models (LVLMs) tend to hallucinate, especially when visual inputs are corrupted at test time. We show that such corruptions act as additional distribution shifts, significantly amplifying hallucination rates in real-world applications. To address this, we propose CLIP-guided Test-Time Training (ClipTTT), a method to adapt LVLMs under degraded conditions on the fly with a single test sample. Specifically, we leverage the image-text alignment strength of a pre-trained CLIP model as a stable guidance signal to identify reliable self-supervision targets, enabling rapid adaptation without altering the base LVLMs. Extensive experiments on standard hallucination benchmarks, with 15 common corruptions, demonstrate that ClipTTT effectively mitigates hallucinations and improves descriptive faithfulness under visual corruptions.
[12] MA-Bench: Towards Fine-grained Micro-Action Understanding
Kun Li, Jihao Gu, Fei Wang, Zhiliang Wu, Hehe Fan, Dan Guo
🧩 TL;DR
本文提出了MA-Bench,这是首个专门用于评估多模态大语言模型在微动作理解方面能力的基准测试,包含1000个视频和12000个结构化问答对,并构建了包含20.5K视频的大规模训练数据集MA-Bench-Train以提升模型性能。
📘 Detailed Summary
Motivation: 随着多模态大语言模型的快速发展,其在人类情感分析中至关重要的微动作理解潜力尚未被充分探索,主要原因是缺乏专门的评估基准,这阻碍了模型在捕捉细微动作和身体部位动态方面的能力提升。
Method: 研究提出了MA-Bench基准测试,包含1000个视频和三层评估架构,分别考察微动作感知、关系理解和解释推理能力,共包含12000个结构化问答对;同时构建了MA-Bench-Train大规模训练语料库,包含20.5K个带有结构化微动作标注的视频,用于对MLLMs进行微调。
Result: 对23个代表性MLLMs的评估结果显示,现有模型在捕捉运动粒度和细粒度身体部位动态方面面临显著挑战;通过在MA-Bench-Train上微调的Qwen3-VL-8B模型在微动作推理和解释任务上表现出明显的性能提升,验证了专门训练数据的有效性。
Conclusion: 该研究为推进MLLMs在理解细微微动作和人类相关行为方面建立了基础性基准,揭示了当前模型在细粒度动作理解方面的局限性,并证明了专门构建的训练数据能够有效提升模型性能,为未来微动作分析研究提供了重要工具和方向。
📄 Abstract
With the rapid development of Multimodal Large Language Models (MLLMs), their potential in Micro-Action understanding, a vital role in human emotion analysis, remains unexplored due to the absence of specialized benchmarks. To tackle this issue, we present MA-Bench, a benchmark comprising 1,000 videos and a three-tier evaluation architecture that progressively examines micro-action perception, relational comprehension, and interpretive reasoning. MA-Bench contains 12,000 structured question-answer pairs, enabling systematic assessment of both recognition accuracy and action interpretation. The results of 23 representative MLLMs reveal that there are significant challenges in capturing motion granularity and fine-grained body-part dynamics. To address these challenges, we further construct MA-Bench-Train, a large-scale training corpus with 20.5K videos annotated with structured micro-action captions for fine-tuning MLLMs. The results of Qwen3-VL-8B fine-tuned on MA-Bench-Train show clear performance improvements across micro-action reasoning and explanation tasks. Our work aims to establish a foundation benchmark for advancing MLLMs in understanding subtle micro-action and human-related behaviors. Project Page: https://MA-Bench.github.io