Table of Contents
cs.CV [Back]
[1] OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction
Yuxin Ray Song, Jinzhou Li, Rao Fu, Devin Murphy, Kaichen Zhou, Rishi Shiv, Yaqi Li, Haoyu Xiong, Crystal Elaine Owens, Yilun Du, Yiyue Luo, Xianyi Cheng, Antonio Torralba, Wojciech Matusik, Paul Pu Liang
🧩 TL;DR
本文提出了OpenTouch,首个野外环境下的第一人称全手触觉数据集,包含5.1小时的同步视频-触觉-姿态数据及2900个带详细文本标注的剪辑片段,并建立了检索和分类基准,探索触觉如何增强感知与动作的关联。
📘 Detailed Summary
Motivation: 人类手部是与物理世界交互的主要界面,但现有第一人称感知系统很少能准确识别接触的时间、位置和力度,缺乏鲁棒的穿戴式触觉传感器,且没有野外环境下的视频-全手触觉对齐数据集,这限制了视觉感知与物理交互的融合研究。
Method: 研究团队构建了OpenTouch数据集,包含5.1小时同步采集的视频、触觉和手部姿态数据,并精心标注了2900个剪辑片段及其详细文本描述;基于该数据集建立了检索和分类基准任务,用于探究触觉信号如何增强感知与动作的关联性。
Result: 实验表明触觉信号为抓握理解提供了紧凑而强大的线索,能够显著增强跨模态对齐效果,并且可以从野外视频查询中可靠地检索到相关触觉信息,验证了触觉在感知任务中的实用价值。
Conclusion: OpenTouch数据集和基准的发布将推动多模态第一人称感知、具身学习以及接触密集型机器人操作等领域的发展,为视觉与触觉融合研究提供了重要的数据资源和评估框架。
📄 Abstract
The human hand is our primary interface to the physical world, yet egocentric perception rarely knows when, where, or how forcefully it makes contact. Robust wearable tactile sensors are scarce, and no existing in-the-wild datasets align first-person video with full-hand touch. To bridge the gap between visual perception and physical interaction, we present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, containing 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. Using OpenTouch, we introduce retrieval and classification benchmarks that probe how touch grounds perception and action. We show that tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this annotated vision-touch-pose dataset and benchmark, we aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.
[2] GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation
Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, Marjan Ghazvininejad
🧩 TL;DR
本文揭示了文本到图像(T2I)评估基准GenEval存在的基准漂移问题,即静态评估器无法跟上新模型能力的发展,导致与人类判断严重偏离。为解决此问题,作者提出了新的基准GenEval 2和改进的评估方法Soft-TIFA,以提供更可靠且不易漂移的T2I模型评估框架。
📘 Detailed Summary
Motivation: 当前文本到图像(T2I)模型评估面临基准漂移的严重问题,即静态的评估基准(如GenEval)无法适应新模型能力的快速发展,导致评估结果与人类判断逐渐偏离。这种漂移使得现有基准的有效性随时间下降,无法准确反映最新模型的真实性能,从而在T2I领域造成了评估空白。
Method: 作者首先通过大规模人类研究验证了GenEval基准的漂移问题,然后提出了新的基准GenEval 2,该基准改进了对基本视觉概念的覆盖并提高了组合性程度。同时,作者引入了Soft-TIFA评估方法,该方法结合了对视觉基本元素的判断,相比VQAScore等整体性评估器,能够更好地与人类判断对齐且更不易发生漂移。
Result: 实验结果显示,GenEval基准已经严重漂移,与人类判断的绝对误差高达17.7%,表明该基准在相当长时间内已经饱和。新提出的GenEval 2基准对当前模型更具挑战性,而Soft-TIFA评估方法在人类对齐性方面表现更优,为T2I模型提供了更可靠的评估框架。
Conclusion: 本研究强调了T2I及相关自动化模型评估基准需要持续审计和改进的重要性,因为静态基准容易随时间漂移而失效。GenEval 2和Soft-TIFA的提出为解决基准漂移问题提供了可行方案,但避免基准漂移并非易事,需要研究社区持续关注和努力。
📄 Abstract
Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time -- resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is more challenging for current models. We also introduce Soft-TIFA, an evaluation method for GenEval 2 that combines judgments for visual primitives, which we show is more well-aligned with human judgment and argue is less likely to drift from human-alignment over time (as compared to more holistic judges such as VQAScore). Although we hope GenEval 2 will provide a strong benchmark for many years, avoiding benchmark drift is far from guaranteed and our work, more generally, highlights the importance of continual audits and improvement for T2I and related automated model evaluation benchmarks.
[3] LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation
Haichao Zhang, Yao Lu, Lichen Wang, Yunzhe Li, Daiwei Chen, Yunpeng Xu, Yun Fu
🧩 TL;DR
本文提出LinkedOut,一种从视频大语言模型中提取世界知识感知表示的方法,旨在解决视频推荐系统中多视频输入支持、低延迟推理和细粒度视觉细节保留的挑战,无需手工标注即可在标准基准上实现最先进的性能。
📘 Detailed Summary
Motivation: 当前视频大语言模型在视频推荐等下游任务部署中存在三个主要限制:仅解码生成导致顺序推理延迟高,典型接口不支持多视频输入,以及语言输出约束丢弃了对下游视觉任务至关重要的细粒度视觉细节。这些限制源于缺乏一种既能保留像素级细节又能利用世界知识的表示方法。
Method: LinkedOut通过从原始帧中提取语义基础、知识感知的令牌,利用可提示查询和可选辅助模态引导VLLM生成表示。该方法引入跨层知识融合混合专家模型,从丰富的VLLM特征中选择适当的抽象层次,实现个性化、可解释和低延迟的推荐。
Result: LinkedOut在标准基准测试中实现了最先进的性能,是首个无需手工标注、基于原始帧操作的VLLM视频推荐方法。可解释性研究和消融实验证实了层多样性和分层融合的优势,验证了该方法有效利用VLLM世界知识先验和视觉推理的能力。
Conclusion: 该研究为充分利用VLLM世界知识先验和视觉推理进行下游视觉任务提供了一条实用路径,展示了跨层知识融合在平衡抽象层次与计算效率方面的价值,为视频推荐系统的高效部署开辟了新方向。
📄 Abstract
Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise on tasks such as movie analysis and video question answering. However, deploying VLLMs for downstream tasks such as video recommendation remains challenging, since real systems require multi-video inputs, lightweight backbones, low-latency sequential inference, and rapid response. In practice, (1) decode-only generation yields high latency for sequential inference, (2) typical interfaces do not support multi-video inputs, and (3) constraining outputs to language discards fine-grained visual details that matter for downstream vision tasks. We argue that these limitations stem from the absence of a representation that preserves pixel-level detail while leveraging world knowledge. We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference, supports multi-video histories, and removes the language bottleneck. LinkedOut extracts semantically grounded, knowledge-aware tokens from raw frames using VLLMs, guided by promptable queries and optional auxiliary modalities. We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation. To our knowledge, LinkedOut is the first VLLM-based video recommendation method that operates on raw frames without handcrafted labels, achieving state-of-the-art results on standard benchmarks. Interpretability studies and ablations confirm the benefits of layer diversity and layer-wise fusion, pointing to a practical path that fully leverages VLLM world-knowledge priors and visual reasoning for downstream vision tasks such as recommendation.
[4] Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos
Mingfei Chen, Yifan Wang, Zhengqin Li, Homanga Bharadhwaj, Yujin Chen, Chuan Qin, Ziyi Kou, Yuan Tian, Eric Whitmire, Rajinder Sodhi, Hrvoje Benko, Eli Shlizerman, Yue Liu
🧩 TL;DR
本文提出了EgoMAN数据集和模型,用于解决3D手部轨迹预测中语义监督与运动解耦、推理与动作弱关联的问题,通过构建大规模第一人称数据集和推理到运动的框架,实现了准确且阶段感知的轨迹预测。
📘 Detailed Summary
Motivation: 现有3D手部轨迹预测研究存在两个主要限制:数据集将运动与语义监督解耦,模型仅弱关联推理与动作。这导致预测轨迹缺乏语义理解和交互阶段感知能力,限制了在实际场景中的应用效果。
Method: 研究首先构建了EgoMAN数据集,包含219K个6DoF轨迹和3M结构化QA对,支持语义、空间和运动推理。随后提出了EgoMAN模型,这是一个推理到运动的框架,通过轨迹-令牌接口连接视觉语言推理与运动生成,采用渐进式训练方法对齐推理与运动动态。
Result: 该方法能够生成准确且具有交互阶段感知的3D手部轨迹,在真实世界场景中展现出良好的泛化能力。EgoMAN数据集的大规模特性(219K轨迹和3M QA对)为模型训练提供了充分的语义和运动监督。
Conclusion: 该研究表明,通过将语义推理与运动生成紧密耦合,可以显著提升3D手部轨迹预测的准确性和实用性。EgoMAN框架为第一人称视角下的交互理解提供了新的范式,其渐进式训练方法确保了推理与运动动态的有效对齐。
📄 Abstract
Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.
[5] EasyV2V: A High-quality Instruction-based Video Editing Framework
Jinjie Mai, Chaoyang Wang, Guocheng Gordon Qian, Willi Menapace, Sergey Tulyakov, Bernard Ghanem, Peter Wonka, Ashkan Mirzaei
🧩 TL;DR
本文提出了EasyV2V框架,这是一个简单而有效的基于指令的视频编辑系统,通过数据组合、简化架构和统一控制机制,在视频编辑任务中实现了最先进的性能。
📘 Detailed Summary
Motivation: 尽管图像编辑技术发展迅速,但视频编辑领域仍面临一致性、控制和泛化能力方面的挑战,现有方法在数据多样性、模型架构和控制机制方面存在局限,需要更有效的视频编辑解决方案。
Method: EasyV2V框架从数据、架构和控制三个维度进行设计:数据方面通过组合现有专家模型、单帧监督提升视频对、挖掘密集标注视频片段以及添加过渡监督来构建多样化训练数据;架构方面利用预训练文本到视频模型的编辑能力,采用简单序列连接和轻量级LoRA微调;控制方面通过单一掩码机制统一时空控制,并支持可选参考图像输入。
Result: EasyV2V在视频编辑任务中实现了最先进的性能,超越了同期竞争系统和商业系统,能够处理多种灵活输入格式,包括视频+文本、视频+掩码+文本、视频+掩码+参考图像+文本等组合。
Conclusion: 研究表明预训练文本到视频模型已具备编辑能力,简化设计结合数据多样性构建和统一控制机制能够有效解决视频编辑的挑战,为未来视频编辑系统提供了可扩展的框架设计思路。
📄 Abstract
While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce \emph{EasyV2V}, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold. On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model. For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images. Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems. Project page: https://snap-research.github.io/easyv2v/
[6] Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
Qihao Liu, Chengzhi Mao, Yaojie Liu, Alan Yuille, Wen-Sheng Chu
🧩 TL;DR
本文提出了AuditDM,一个用于多模态大语言模型(MLLM)的自动化审计框架,通过主动发现并纠正模型失败模式来提升模型性能。该框架训练一个审计模型生成具有挑战性的问题和反事实图像,以揭示目标模型的弱点并生成无需标注的修正数据。
📘 Detailed Summary
Motivation: 传统的多模态大语言模型评估方法缺乏可解释性,且往往无法充分揭示不同模型之间的显著能力差距。现有评估方法通常是被动和静态的,难以系统性地发现模型的失败模式,限制了模型诊断和改进的有效性。
Method: AuditDM框架通过强化学习微调一个MLLM作为审计器,使其能够生成最大化目标模型之间分歧的挑战性问题和反事实图像。训练完成后,审计器能够发现多样化的、可解释的示例,这些示例既揭示了模型弱点,又作为无需标注的修正数据用于模型改进。
Result: 在Gemma-3和PaliGemma-2等最先进模型上应用AuditDM,发现了超过20种不同的失败类型。基于这些发现进行微调后,所有模型在16个基准测试中均获得一致提升,并使一个3B参数的模型超越了其28B参数的对应版本。
Conclusion: 研究结果表明,当数据扩展达到收益递减时,有针对性的模型审计为模型诊断和改进提供了有效途径。AuditDM框架不仅能够系统性地发现模型弱点,还能生成高质量的修正数据,为模型能力的持续提升提供了新的方法论。
📄 Abstract
Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.
[7] GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation
Jingjing Qian, Boyao Han, Chen Shi, Lei Xiao, Long Yang, Shaoshuai Shi, Li Jiang
🧩 TL;DR
本文提出GeoPredict,一种几何感知的视觉-语言-动作框架,通过预测性运动学和几何先验增强连续动作策略,显著提升了VLA模型在需要精确3D推理的机器人操作任务中的性能。
📘 Detailed Summary
Motivation: 当前视觉-语言-动作模型在机器人操作中展现出强大的泛化能力,但主要局限于反应式和2D中心的方法,在需要精确3D推理的任务中可靠性不足,特别是在几何密集和空间要求高的场景下存在明显局限。
Method: GeoPredict框架包含两个核心模块:轨迹级模块编码运动历史并预测机器人手臂的多步3D关键点轨迹,以及预测性3D高斯几何模块通过轨迹引导的细化预测工作空间几何。这些预测模块仅作为训练时的监督信号,通过基于深度的渲染实现,推理时仅需轻量级额外查询令牌而无需任何3D解码。
Result: 在RoboCasa Human-50、LIBERO和真实世界操作任务上的实验表明,GeoPredict持续超越强大的VLA基线方法,特别是在几何密集和空间要求高的场景中表现尤为突出,验证了其几何感知预测机制的有效性。
Conclusion: 该研究证明了将预测性运动学和几何先验整合到VLA框架中的重要性,为提升机器人操作中的3D推理能力提供了有效途径,同时保持了推理时的计算效率,为未来更复杂的几何感知机器人系统奠定了基础。
📄 Abstract
Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.
[8] Radiology Report Generation with Layer-Wise Anatomical Attention
Emmanuel D. Muñiz-De-León, Jorge A. Rosales-de-Golferichs, Ana S. Muñoz-Rodríguez, Alejandro I. Trejo-Castro, Eduardo de Avila-Armenta, Antonio Martínez-Torteya
🧩 TL;DR
本研究提出了一种紧凑的胸部X光报告生成架构,通过单张正面图像生成报告发现部分,采用冻结的DINOv3视觉编码器和增强的GPT-2解码器,结合分层解剖注意力机制,在减少资源需求的同时显著提升了临床相关区域的生成质量。
📘 Detailed Summary
Motivation: 当前最先进的放射学报告生成系统如MAIRA-2和MedPaLM-M依赖于大规模多模态训练、临床元数据和多个成像视图,导致资源密集且难以在大多数医疗环境中部署,本研究旨在开发一种仅需单张正面图像的紧凑架构来解决这一可访问性问题。
Method: 该方法采用冻结的DINOv3视觉Transformer编码器与GPT-2解码器相结合,通过层级的解剖注意力机制整合肺部和心脏分割掩码,利用分层高斯平滑技术将注意力偏向临床相关区域,该机制不增加可训练参数,实现了纯图像条件的报告生成。
Result: 在MIMIC-CXR数据集上的评估显示,该方法在五个关键病理的CheXpert Macro-F1指标上提升了168%(0.083→0.238),Micro-F1提升了146%(0.137→0.337),14个观察指标的总体性能提升了86%(0.170→0.316),结构连贯性方面RadGraph F1提升了9.7%。
Conclusion: 研究表明,解码器层级的解剖引导能够改善空间定位并增强临床相关区域的连贯性,尽管模型规模较小且仅依赖图像条件,但仍能实现显著的性能提升,为资源受限环境下的自动放射学报告生成提供了可行的解决方案。
📄 Abstract
Automatic radiology report generation is a promising application of multimodal deep learning, aiming to reduce reporting workload and improve consistency. However, current state-of-the-art (SOTA) systems - such as Multimodal AI for Radiology Applications (MAIRA-2) and Medical Pathways Language Model-Multimodal (MedPaLM-M) - depend on large-scale multimodal training, clinical metadata, and multiple imaging views, making them resource-intensive and inaccessible for most settings. We introduce a compact image-to-text architecture that generates the Findings section of chest X-ray reports from a single frontal image. The model combines a frozen Self-Distillation with No Labels v3 (DINOv3) Vision Transformer (ViT) encoder with a Generative Pre-trained Transformer 2 (GPT-2) decoder enhanced by layer-wise anatomical attention. This mechanism integrates lung and heart segmentation masks through hierarchical Gaussian smoothing, biasing attention toward clinically relevant regions without adding trainable parameters. Evaluated on the official Medical Information Mart for Intensive Care-Chest X-ray (MIMIC-CXR) dataset using Chest Radiograph Expert (CheXpert) and Radiology Graph (RadGraph) metrics, our approach achieved substantial gains: CheXpert Macro-F1 for five key pathologies increased by 168% (0.083 -> 0.238) and Micro-F1 by 146% (0.137 -> 0.337), while broader performance across 14 observations improved by 86% (0.170 -> 0.316). Structural coherence also improved, with RadGraph F1 rising by 9.7%. Despite its small size and purely image-conditioned design, the model demonstrates that decoder-level anatomical guidance improves spatial grounding and enhances coherence in clinically relevant regions. The source code is publicly available at: https://github.com/devMuniz02/UDEM-CXR-Reporting-Thesis-2025.
[9] RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing
Tianyuan Qu, Lei Ke, Xiaohang Zhan, Longxiang Tang, Yuqi Liu, Bohao Peng, Bei Yu, Dong Yu, Jiaya Jia
🧩 TL;DR
本文提出了RePlan框架,通过区域对齐规划方法解决指令-视觉复杂性场景下的图像编辑问题,该框架结合视觉语言规划器和扩散编辑器,在复杂指令和杂乱场景中实现精确的多区域并行编辑。
📘 Detailed Summary
Motivation: 现有基于指令的图像编辑模型在处理指令-视觉复杂性场景时表现不佳,即当复杂指令遇到杂乱或模糊场景时,模型难以准确理解和执行编辑任务,这限制了实际应用中的编辑精度和可靠性。
Method: RePlan采用规划-执行框架,包含视觉语言规划器和扩散编辑器两部分。规划器通过逐步推理分解指令并显式地将指令定位到目标区域,编辑器则使用无需训练的注意力区域注入机制进行修改,支持精确的并行多区域编辑而无需迭代修复。为增强规划能力,采用基于GRPO的强化学习,仅使用1K纯指令示例即可显著提升推理保真度和格式可靠性。
Result: RePlan在IV-Edit基准测试中表现出色,该基准专注于细粒度定位和知识密集型编辑任务。在指令-视觉复杂性设置下,RePlan持续优于使用更大数据集训练的强基线模型,显著提升了区域精度和整体保真度,证明了其方法的有效性。
Conclusion: 该研究表明,通过将复杂指令分解为逐步推理并显式定位到目标区域,结合无需训练的注意力注入机制,可以显著提升复杂场景下的图像编辑精度。该方法为处理指令-视觉复杂性提供了有效解决方案,并为未来基于指令的视觉任务研究提供了新方向。
📄 Abstract
Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: https://replan-iv-edit.github.io
[10] Memory-Enhanced SAM3 for Occlusion-Robust Surgical Instrument Segmentation
Valay Bundele, Mehran Hosseinzadeh, Hendrik P. A. Lensch
🧩 TL;DR
本文提出ReMeDI-SAM3,一种无需训练的内存增强型SAM3扩展,通过相关性感知内存过滤、分段插值方案和基于特征的重新识别模块,显著提升了内窥镜视频中手术器械分割的准确性和遮挡恢复能力。
📘 Detailed Summary
Motivation: 内窥镜视频中手术器械的精确分割对于计算机辅助干预至关重要,但由于频繁遮挡、快速运动、镜面伪影和长期器械重新进入等挑战而难以实现。虽然SAM3为视频对象分割提供了强大的时空框架,但其在手术场景中的性能受到无差别内存更新、固定内存容量以及遮挡后弱身份恢复能力的限制。
Method: ReMeDI-SAM3通过三个关键组件扩展SAM3:相关性感知内存过滤机制,配备专门的遮挡感知内存用于存储遮挡前帧;分段插值方案,扩展有效内存容量;以及基于特征的重新识别模块,结合时间投票机制实现可靠的遮挡后身份消歧。这些组件共同工作以减轻错误累积并实现可靠的遮挡恢复。
Result: 在EndoVis17和EndoVis18数据集上的零样本评估显示,相对于原始SAM3分别实现了约7%和16%的绝对mcIoU提升,甚至超越了先前基于训练的方法。该方法在无需额外训练的情况下显著提高了手术器械分割的准确性和鲁棒性。
Conclusion: ReMeDI-SAM3通过内存增强机制有效解决了手术场景中视频分割的关键挑战,特别是遮挡恢复问题。该方法展示了无需训练的改进策略在复杂医学视频分析中的有效性,为计算机辅助手术系统提供了更可靠的分割解决方案,并可能启发其他视频理解任务的内存优化设计。
📄 Abstract
Accurate surgical instrument segmentation in endoscopic videos is crucial for computer-assisted interventions, yet remains challenging due to frequent occlusions, rapid motion, specular artefacts, and long-term instrument re-entry. While SAM3 provides a powerful spatio-temporal framework for video object segmentation, its performance in surgical scenes is limited by indiscriminate memory updates, fixed memory capacity, and weak identity recovery after occlusions. We propose ReMeDI-SAM3, a training-free memory-enhanced extension of SAM3, that addresses these limitations through three components: (i) relevance-aware memory filtering with a dedicated occlusion-aware memory for storing pre-occlusion frames, (ii) a piecewise interpolation scheme that expands the effective memory capacity, and (iii) a feature-based re-identification module with temporal voting for reliable post-occlusion identity disambiguation. Together, these components mitigate error accumulation and enable reliable recovery after occlusions. Evaluations on EndoVis17 and EndoVis18 under a zero-shot setting show absolute mcIoU improvements of around 7% and 16%, respectively, over vanilla SAM3, outperforming even prior training-based approaches. Project page: https://valaybundele.github.io/remedi-sam3/.
[11] Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection
Kaixin Ding, Yang Zhou, Xi Chen, Miao Yang, Jiarong Ou, Rui Chen, Xin Tao, Hengshuang Zhao
🧩 TL;DR
本文提出了Alchemist,一种基于元梯度的数据选择框架,用于从大规模文本-图像数据对中自动选择高质量子集,以提升文本到图像生成模型的训练效率和视觉质量。
📘 Detailed Summary
Motivation: 现有文本到图像生成模型的性能受限于训练数据质量,网络爬取和合成图像数据集常包含低质量或冗余样本,导致视觉保真度下降、训练不稳定和计算效率低下。现有数据选择方法依赖昂贵的人工筛选或基于单维特征的启发式评分,且元学习方法在图像模态中尚未得到适配,因此需要一种自动、可扩展的数据选择框架。
Method: Alchemist采用基于元梯度的框架,通过数据中心的视角迭代优化模型以评估每个样本的影响。该框架包含两个关键阶段:数据评级和数据剪枝。首先训练一个轻量级评级器,基于梯度信息并增强多粒度感知来估计每个样本的影响;然后使用Shift-Gsampling策略选择信息丰富的子集进行高效模型训练。
Result: 在合成和网络爬取数据集上的实验表明,Alchemist能持续提升视觉质量和下游性能。使用Alchemist选择的50%数据子集进行训练,其表现可超越使用完整数据集训练的结果,证明了该框架在数据效率和模型性能方面的显著优势。
Conclusion: Alchemist是首个用于文本到图像模型训练的自动、可扩展、基于元梯度的数据选择框架,为解决训练数据质量问题提供了系统化解决方案。该研究展示了数据选择对生成模型性能的关键影响,并为高效训练大规模生成模型开辟了新途径,具有重要的实际应用价值。
📄 Abstract
Recent advances in Text-to-Image (T2I) generative models, such as Imagen, Stable Diffusion, and FLUX, have led to remarkable improvements in visual quality. However, their performance is fundamentally limited by the quality of training data. Web-crawled and synthetic image datasets often contain low-quality or redundant samples, which lead to degraded visual fidelity, unstable training, and inefficient computation. Hence, effective data selection is crucial for improving data efficiency. Existing approaches rely on costly manual curation or heuristic scoring based on single-dimensional features in Text-to-Image data filtering. Although meta-learning based method has been explored in LLM, there is no adaptation for image modalities. To this end, we propose Alchemist, a meta-gradient-based framework to select a suitable subset from large-scale text-image data pairs. Our approach automatically learns to assess the influence of each sample by iteratively optimizing the model from a data-centric perspective. Alchemist consists of two key stages: data rating and data pruning. We train a lightweight rater to estimate each sample's influence based on gradient information, enhanced with multi-granularity perception. We then use the Shift-Gsampling strategy to select informative subsets for efficient model training. Alchemist is the first automatic, scalable, meta-gradient-based data selection framework for Text-to-Image model training. Experiments on both synthetic and web-crawled datasets demonstrate that Alchemist consistently improves visual quality and downstream performance. Training on an Alchemist-selected 50% of the data can outperform training on the full dataset.
[12] MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning
Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, Nandiraju Gireesh, Yuanliang Ju, Seungjae Lee, Qiao Gu, Elvis Hsieh, Furong Huang, Koushil Sreenath
🧩 TL;DR
本文提出了MomaGraph,一种用于具身智能体的统一场景表示方法,集成了空间-功能关系和部件级交互元素,并引入了首个大规模任务驱动场景图数据集MomaGraph-Scenes和系统评估套件MomaGraph-Bench,在此基础上开发了MomaGraph-R1模型,在基准测试中达到71.6%的准确率。
📘 Detailed Summary
Motivation: 现有场景图表示方法通常将空间关系和功能关系分离,将场景视为静态快照而忽略对象状态和时序更新,并且忽视了与当前任务最相关的信息,这限制了移动操作机器人在家庭环境中的导航和操作能力,需要一种紧凑且语义丰富的统一场景表示方法。
Method: 本文提出了MomaGraph统一场景表示方法,集成了空间-功能关系和部件级交互元素;创建了MomaGraph-Scenes数据集,这是首个大规模任务驱动的家庭环境场景图数据集;开发了MomaGraph-Bench评估套件,涵盖从高层规划到细粒度场景理解的六种推理能力;并基于此训练了MomaGraph-R1,一个70亿参数的视觉-语言模型,采用强化学习训练,能够预测任务导向场景图并作为零样本任务规划器。
Result: MomaGraph-R1在基准测试中达到71.6%的准确率,相比最佳基线提升了11.4%,在开源模型中取得了最先进的结果;该模型在公共基准测试中展现出良好的泛化能力,并在真实机器人实验中实现了有效的迁移。
Conclusion: MomaGraph为具身智能体提供了一种统一的场景表示方法,解决了现有方法在空间-功能整合、动态状态建模和任务相关性方面的局限性;通过大规模数据集和系统评估套件的建立,为场景图研究提供了重要的基础设施;提出的Graph-then-Plan框架展示了任务导向场景图在零样本规划中的有效性。
📄 Abstract
Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.
[13] SFTok: Bridging the Performance Gap in Discrete Tokenizers
Qihang Rao, Borui Zhang, Wenzhao Zheng, Jie Zhou, Jiwen Lu
🧩 TL;DR
本文提出SFTok,一种用于高分辨率图像生成的离散分词器,通过自强制引导视觉重建和去偏拟合训练策略,解决了多步迭代过程中的训练-推理不一致问题,显著提升了图像重建质量。
📘 Detailed Summary
Motivation: 尽管离散分词器与自回归范式天然契合,但其性能仍落后于连续分词器,限制了在多模态系统中的广泛应用。现有方法在多步迭代过程中存在训练与推理不一致的问题,导致图像重建质量受限。
Method: SFTok采用多步迭代机制进行精确重建,核心创新包括自强制引导视觉重建和去偏拟合训练策略。这些技术解决了多步过程中的训练-推理不一致问题,通过仅使用64个token的高压缩率实现高效图像表示。
Result: 在ImageNet数据集上,SFTok实现了最先进的图像重建质量(rFID = 1.21),在类别到图像生成任务中表现出色(gFID = 2.29)。该模型仅使用每图像64个token的高压缩率,显著优于现有离散分词器。
Conclusion: SFTok通过解决训练-推理不一致问题,显著提升了离散分词器的性能,使其能够与连续分词器竞争。这项工作为高分辨率图像生成提供了更高效的离散表示方法,推动了多模态系统中离散分词器的应用。
📄 Abstract
Recent advances in multimodal models highlight the pivotal role of image tokenization in high-resolution image generation. By compressing images into compact latent representations, tokenizers enable generative models to operate in lower-dimensional spaces, thereby improving computational efficiency and reducing complexity. Discrete tokenizers naturally align with the autoregressive paradigm but still lag behind continuous ones, limiting their adoption in multimodal systems. To address this, we propose \textbf{SFTok}, a discrete tokenizer that incorporates a multi-step iterative mechanism for precise reconstruction. By integrating \textbf{self-forcing guided visual reconstruction} and \textbf{debias-and-fitting training strategy}, SFTok resolves the training-inference inconsistency in multi-step process, significantly enhancing image reconstruction quality. At a high compression rate of only 64 tokens per image, SFTok achieves state-of-the-art reconstruction quality on ImageNet (rFID = 1.21) and demonstrates exceptional performance in class-to-image generation tasks (gFID = 2.29).
[14] Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation
Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, Lu Qi
🧩 TL;DR
本文提出了一种全景度量深度基础模型,通过数据驱动的范式构建大规模数据集并设计创新框架,实现了跨多样化场景距离的泛化能力,在多个基准测试中展现出强大的零样本泛化性能。
📘 Detailed Summary
Motivation: 当前深度估计模型在处理全景图像时面临跨室内/室外场景和合成/真实数据域差异的挑战,特别是在多样化距离尺度下的泛化能力不足。本研究旨在构建一个能够统一处理不同场景距离的全景度量深度基础模型,解决现有方法在真实世界复杂环境中的鲁棒性问题。
Method: 采用数据驱动范式,结合公开数据集、UE5模拟器生成的高质量合成数据、文本到图像模型生成数据以及网络收集的真实全景图像构建大规模数据集。提出三阶段伪标签筛选流程以减少域差异,采用DINOv3-Large作为主干网络,并引入即插即用的范围掩码头、锐度中心优化和几何中心优化策略,增强对变化距离的鲁棒性和跨视图几何一致性。
Result: 在Stanford2D3D、Matterport3D和Deep360等多个基准测试中展现出强大的性能表现和零样本泛化能力。模型在多样化真实世界场景中实现了鲁棒且稳定的度量深度预测,特别是在处理不同距离尺度时表现出色,验证了所提框架的有效性。
Conclusion: 该研究证明了数据驱动范式在全景深度估计中的有效性,通过精心设计的数据构建和模型优化策略,成功实现了跨域泛化。所提出的框架为全景视觉理解提供了新的基础模型,其即插即用的优化组件和几何一致性约束为未来相关研究提供了有价值的参考方向。
📄 Abstract
In this work, we present a panoramic metric depth foundation model that generalizes across diverse scene distances. We explore a data-in-the-loop paradigm from the view of both data construction and framework design. We collect a large-scale dataset by combining public datasets, high-quality synthetic data from our UE5 simulator and text-to-image models, and real panoramic images from the web. To reduce domain gaps between indoor/outdoor and synthetic/real data, we introduce a three-stage pseudo-label curation pipeline to generate reliable ground truth for unlabeled images. For the model, we adopt DINOv3-Large as the backbone for its strong pre-trained generalization, and introduce a plug-and-play range mask head, sharpness-centric optimization, and geometry-centric optimization to improve robustness to varying distances and enforce geometric consistency across views. Experiments on multiple benchmarks (e.g., Stanford2D3D, Matterport3D, and Deep360) demonstrate strong performance and zero-shot generalization, with particularly robust and stable metric predictions in diverse real-world scenes. The project page can be found at: \href{https://insta360-research-team.github.io/DAP_website/} {https://insta360-research-team.github.io/DAP_website/}
[15] AdaTooler-V: Adaptive Tool-Use for Images and Videos
Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li, Sicheng Gao, Meng Meng, Xu Zhou, Manyuan Zhang, Yuzhang Shang, Xiangyu Yue
🧩 TL;DR
本文提出了AdaTooler-V,一种能够自适应决定何时使用视觉工具的多模态大语言模型,通过引入AT-GRPO强化学习算法和构建大规模训练数据集,显著减少了不必要的工具调用开销并提升了视觉推理性能。
📘 Detailed Summary
Motivation: 现有开源多模态大语言模型在视觉推理任务中存在盲目工具使用问题,即使在不必要时也会调用视觉工具,这显著增加了推理开销并降低了模型性能,因此需要开发能够自适应决定何时真正需要工具的方法。
Method: 提出了AdaTooler-V模型,采用AT-GRPO强化学习算法根据每个样本的工具效益评分自适应调整奖励尺度,鼓励模型仅在工具能提供真正改进时才调用;同时构建了两个训练数据集:AdaTooler-V-CoT-100k用于监督微调冷启动,AdaTooler-V-300k用于强化学习训练,覆盖单图像、多图像和视频数据。
Result: 在十二个基准测试上的实验表明AdaTooler-V具有强大的推理能力,在多样视觉推理任务中优于现有方法;特别是AdaTooler-V-7B在高分辨率基准V*上达到89.8%的准确率,超越了商业专有模型GPT-4o和Gemini 1.5 Pro。
Conclusion: 该研究证明了自适应工具使用策略在多模态大语言模型中的重要性,通过减少不必要的工具调用可以同时提升性能和效率;AT-GRPO算法为强化学习在多模态任务中的应用提供了新思路,开源代码、模型和数据将促进该领域的进一步发展。
📄 Abstract
Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8\% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.
[16] The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text
Hanlin Wang, Hao Ouyang, Qiuyu Wang, Yue Yu, Yihao Meng, Wen Wang, Ka Leong Cheng, Shuailei Ma, Qingyan Bai, Yixuan Li, Cheng Chen, Yanhong Zeng, Xing Zhu, Yujun Shen, Qifeng Chen
🧩 TL;DR
WorldCanvas是一个用于可提示世界事件的多模态框架,通过结合文本、轨迹和参考图像实现用户导向的丰富模拟,将世界模型从被动预测器转变为交互式、用户可塑造的模拟器。
📘 Detailed Summary
Motivation: 现有方法存在局限性:纯文本方法表达能力有限,现有轨迹控制的图像到视频方法缺乏语义意图和视觉基础。研究旨在解决多模态世界事件生成的空白,实现更丰富、用户可控的模拟,支持多智能体交互、对象进出、参考引导外观和反直觉事件等复杂场景。
Method: WorldCanvas采用多模态方法,将轨迹(编码运动、时序和可见性)与自然语言(语义意图)和参考图像(对象身份的视觉基础)相结合。该框架通过轨迹控制运动模式,语言指导语义内容,参考图像确保对象视觉一致性,从而生成连贯可控的事件序列。
Result: 生成的视频不仅展示时间连贯性,还表现出涌现一致性,能够在对象暂时消失时保持对象身份和场景连续性。框架支持多智能体交互、对象进入/退出、参考引导外观和反直觉事件等复杂世界事件的生成,超越了现有方法的表达能力。
Conclusion: WorldCanvas通过支持表达性世界事件生成,将世界模型从被动预测器推进为交互式、用户可塑造的模拟器。该研究为可控视频生成和多模态模拟开辟了新方向,实现了更丰富、更灵活的世界建模能力,为未来交互式AI系统奠定了基础。
📄 Abstract
We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: https://worldcanvas.github.io/.
cs.CL [Back]
[17] Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image
Yushi Hu, Reyhane Askari-Hemmat, Melissa Hall, Emily Dinan, Luke Zettlemoyer, Marjan Ghazvininejad
🧩 TL;DR
该研究提出了首个针对多模态奖励模型的综合基准MMRB2,涵盖图像生成、编辑、交错生成和推理四大任务,并利用该基准评估了现有多模态评判系统的性能,揭示了当前奖励模型与人类专家之间的显著差距。
📘 Detailed Summary
Motivation: 奖励模型对于训练大语言模型至关重要,但在处理交错图像和文本序列的全能模型领域仍缺乏深入探索,现有研究缺乏针对多模态理解与生成任务的综合评估基准,无法有效衡量多模态奖励模型的性能。
Method: 研究团队构建了Multimodal RewardBench 2基准,涵盖文本到图像生成、图像编辑、交错生成和多模态推理四大任务,每个任务包含1000个专家标注的偏好对,数据来自23个模型和代理在21个源任务上的响应,采用集成过滤策略确保偏好对具有强人类专家共识。
Result: 最新Gemini 3 Pro模型在基准测试中达到75-80%准确率,GPT-5和Gemini 2.5 Pro达到66-75%,显著优于广泛使用的GPT-4o的59%,最佳开源模型Qwen3-VL-32B与Gemini 2.5 Flash性能相当达到64%,但与人类专家超过90%的准确率仍有显著差距。
Conclusion: MMRB2基准揭示了当前多模态奖励模型与人类专家判断之间的性能差距,其表现与下游任务成功率强相关,为未来改进奖励模型提供了关键方向,特别是在多模态理解和生成任务的偏好建模方面需要进一步研究。
📄 Abstract
Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning ("thinking-with-images"), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. The latest Gemini 3 Pro attains 75-80% accuracy. GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy, compared to >90% for humans, yet surpass the widely used GPT-4o (59%). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini 2.5 Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success using Best-of-N sampling and conduct an in-depth analysis that shows key areas to improve the reward models going forward.