Table of Contents

cs.CV [Back]

[1] Leveraging Text Guidance for Enhancing Demographic Fairness in Gender Classification

Anoop Krishnan

🧩 TL;DR

本文提出两种基于文本引导的方法来提升面部图像性别分类算法的公平性:图像文本匹配引导和图像文本融合,这些方法利用图像标题的语义信息来增强模型泛化能力并减少人口统计偏差。


📘 Detailed Summary

Motivation: 本研究旨在解决面部图像性别分类算法中存在的公平性问题,特别是人口统计偏差问题,当前方法在跨性别和种族群体上存在性能差异,需要开发能够减少这些差异并提高算法公平性的新方法。

Method: 本文提出两种核心方法:图像文本匹配引导训练模型学习图像与文本之间的细粒度对齐以获得增强的多模态表示;图像文本融合将两种模态结合为综合表示以改善公平性,这些方法无需人口统计标签且具有应用无关性。

Result: 在基准数据集上进行的大量实验表明,所提出的方法有效减轻了偏见,并相比现有方法在跨性别和种族群体上提高了分类准确性,同时验证了语义信息在减少性能差异方面的有效性。

Conclusion: 该研究为开发更公平的面部分析算法提供了有价值的见解,展示了文本引导方法在提高计算机视觉系统公平性方面的潜力,同时为可解释和直观的训练范式奠定了基础,为解决面部图像性别分类中的人口统计偏差挑战做出了贡献。


📄 Abstract

In the quest for fairness in artificial intelligence, novel approaches to enhance it in facial image based gender classification algorithms using text guided methodologies are presented. The core methodology involves leveraging semantic information from image captions during model training to improve generalization capabilities. Two key strategies are presented: Image Text Matching (ITM) guidance and Image Text fusion. ITM guidance trains the model to discern fine grained alignments between images and texts to obtain enhanced multimodal representations. Image text fusion combines both modalities into comprehensive representations for improved fairness. Exensive experiments conducted on benchmark datasets demonstrate these approaches effectively mitigate bias and improve accuracy across gender racial groups compared to existing methods. Additionally, the unique integration of textual guidance underscores an interpretable and intuitive training paradigm for computer vision systems. By scrutinizing the extent to which semantic information reduces disparities, this research offers valuable insights into cultivating more equitable facial analysis algorithms. The proposed methodologies contribute to addressing the pivotal challenge of demographic bias in gender classification from facial images. Furthermore, this technique operates in the absence of demographic labels and is application agnostic.

[2] Image Tiling for High-Resolution Reasoning: Balancing Local Detail with Global Context

Anatole Jacquin de Margerie, Alexis Roger, Irina Rish

🧩 TL;DR

本研究对CVPR24发表的Monkey视觉语言模型进行了复现与扩展分析,验证了图像分块策略在高分辨率图像理解中的有效性,并探讨了全局上下文整合的影响,为高分辨率多模态建模提供了实用见解。


📘 Detailed Summary

Motivation: 尽管可复现性是科学进步的基础,但复杂的多模态模型往往缺乏透明的实现细节和可访问的训练基础设施,本研究旨在对Monkey视觉语言模型进行详细复现和批判性分析,以验证其高分辨率图像理解方法并探索改进方向。

Method: 研究采用开放检查点复现了Monkey视觉语言模型的训练流程,该模型通过图像分块策略将大图像分割为多个图块以恢复细粒度视觉细节,同时保持计算效率,并在此基础上扩展研究了全局上下文信息的整合效果。

Result: 研究确认了原始Monkey VLM工作的关键发现,即分块策略能有效恢复局部细节,但同时也报告了结果偏差,这些偏差的幅度严重依赖于任务类型和分块粒度,全局上下文的整合效果为高分辨率多模态建模提供了实用见解。

Conclusion: 图像分块策略是高分辨率视觉理解的有效方法,但性能表现对任务特性和分块参数敏感,全局上下文信息的整合具有重要影响,这为未来高分辨率多模态模型设计提供了重要的实践指导和研究方向。


📄 Abstract

Reproducibility remains a cornerstone of scientific progress, yet complex multimodal models often lack transparent implementation details and accessible training infrastructure. In this work, we present a detailed reproduction and critical analysis of the Monkey Vision-Language Model (VLM) (Li et al. 2023b) published in CVPR24, a recent approach to high-resolution image understanding via image tiling. The original paper proposed splitting large images into tiles to recover fine-grained visual details while maintaining computational efficiency. Our study replicates this strategy using open checkpoints and reimplements the training pipeline. We confirm the key finding of the original Monkey VLM work, namely that tiling effectively recovers local details. We then extend this work further, by investigating the effect of the inclusion of the global context, which provide practical insights for future high-resolution multimodal modeling. However, we also report deviations in the results, with the magnitude of these effects depending heavily on task type and tile granularity.

[3] Few-Shot VLM-Based G-Code and HMI Verification in CNC Machining

Yasaman Hashem Pour, Nazanin Mahjourian, Vinh Nguyen

🧩 TL;DR

本文提出了一种基于视觉语言模型的少样本验证方法,用于同时评估数控机床的G代码和HMI显示界面中的错误与安全状态,解决了传统LLM方法无法处理视觉模态信息的局限性。


📘 Detailed Summary

Motivation: 传统基于大型语言模型的G代码验证方法主要关注编程错误检测,但数控加工需要广泛使用和了解人机界面,该界面显示机器状态和错误信息。由于LLM无法访问视觉模态,当前方法缺乏利用HMI知识的能力,这限制了G代码验证的全面性。

Method: 本文提出了一种少样本VLM验证方法,输入数据集包含来自15-slant-PRO车床的配对G代码文本和相关HMI截图,包括正确和易出错案例。为实现少样本学习,VLM配备了基于先验启发式知识的结构化JSON模式,并使用包含错误和无错误的G代码及HMI实例作为少样本示例来指导模型。

Result: 与零样本VLM相比,该模型在多个错误G代码和HMI错误场景下通过每槽准确率进行评估。结果表明,少样本提示显著增强了HMI错误检测能力,并改善了G代码与HMI显示之间不一致性的识别,实现了更全面的调试功能。

Conclusion: 所提出的框架被证明适用于验证通常在CNC培训中开发的手动生成G代码,为数控机床操作学习提供了更全面的验证方法。该方法通过结合视觉模态信息,解决了传统纯文本方法的局限性,为工业培训和安全验证开辟了新途径。


📄 Abstract

Manual generation of G-code is important for learning the operation of CNC machines. Prior work in G-code verification uses Large-Language Models (LLMs), which primarily examine errors in the written programming. However, CNC machining requires extensive use and knowledge of the Human-Machine Interface (HMI), which displays machine status and errors. LLMs currently lack the capability to leverage knowledge of HMIs due to their inability to access the vision modality. This paper proposes a few-shot VLM-based verification approach that simultaneously evaluates the G-code and the HMI display for errors and safety status. The input dataset includes paired G-code text and associated HMI screenshots from a 15-slant-PRO lathe, including both correct and error-prone cases. To enable few-shot learning, the VLM is provided with a structured JSON schema based on prior heuristic knowledge. After determining the prompts, instances of G-code and HMI that either contain errors or are error free are used as few-shot examples to guide the VLM. The model was then evaluated in comparison to a zero-shot VLM through multiple scenarios of incorrect G-code and HMI errors with respect to per-slot accuracy. The VLM showed that few-shot prompting led to overall enhancement of detecting HMI errors and discrepancies with the G-code for more comprehensive debugging. Therefore, the proposed framework was demonstrated to be suitable for verification of manually generated G-code that is typically developed in CNC training.

[4] Synthetic Vasculature and Pathology Enhance Vision-Language Model Reasoning

Chenjun Li, Cheng Wan, Laurin Lux, Alexander Berger, Richard B. Rosen, Martin J. Menten, Johannes C. Paetzold

🧩 TL;DR

该研究提出了Synthetic Vasculature Reasoning (SVR)框架,通过可控合成视网膜血管图像和对应文本,解决了医学视觉语言模型训练中高质量标注数据稀缺的问题,并构建了OCTA-100K-SVR数据集,显著提升了模型在OCTA图像上的诊断性能和解释能力。


📘 Detailed Summary

Motivation: 视觉语言模型在医学诊断中具有潜力,但训练需要大规模高质量的图像-文本对数据,而在许多专业领域如光学相干断层扫描血管成像中,包含病理细节的精确文本标注非常稀缺甚至不存在,这限制了模型在临床解释和跨模态推理方面的发展。

Method: 研究提出了Synthetic Vasculature Reasoning框架,该框架能够可控地合成具有糖尿病视网膜病变特征的视网膜血管图像,包括毛细血管脱落、微动脉瘤、新生血管和血管迂曲等病理特征,同时自动生成细粒度的推理文本,并基于此构建了包含10万对图像的OCTA-100K-SVR数据集。

Result: 在OCTA-100K-SVR数据集上训练的通用视觉语言模型在真实OCTA图像上实现了89.67%的零样本平衡分类准确率,超越了监督基线方法,通过人类专家评估证实,该模型显著提升了临床数据的解释质量和病理定位能力。

Conclusion: 该研究表明,通过可控合成方法生成高质量的训练数据是解决医学视觉语言模型数据稀缺问题的有效途径,SVR框架不仅提升了模型性能,还增强了临床解释的可信度,为专业医学领域的AI诊断系统开发提供了新的数据生成范式。


📄 Abstract

Vision-Language Models (VLMs) offer a promising path toward interpretable medical diagnosis by allowing users to ask about clinical explanations alongside predictions and across different modalities. However, training VLMs for detailed reasoning requires large-scale image-text datasets. In many specialized domains, for example in reading Optical Coherence Tomography Angiography (OCTA) images, such precise text with grounded description of pathologies is scarce or even non-existent. To overcome this bottleneck, we introduce Synthetic Vasculature Reasoning (SVR), a framework that controllably synthesizes images and corresponding text, specifically: realistic retinal vasculature with Diabetic Retinopathy (DR) features: capillary dropout, microaneurysms, neovascularization, and tortuosity, while automatically generating granular reasoning texts. Based on this we curate OCTA-100K-SVR, an OCTA image-reasoning dataset with 100,000 pairs. Our experiments show that a general-purpose VLM (Qwen3-VL-8b) trained on the dataset achieves a zero-shot balanced classification accuracy of 89.67% on real OCTA images, outperforming supervised baselines. Through human expert evaluation we also demonstrate that it significantly enhances explanation quality and pathology localization on clinical data.

[5] Boosting Skeleton-based Zero-Shot Action Recognition with Training-Free Test-Time Adaptation

Jingmin Zhu, Anqi Zhu, Hossein Rahmani, Jun Liu, Mohammed Bennamoun, Qiuhong Ke

🧩 TL;DR

本文提出了Skeleton-Cache,这是首个用于基于骨架的零样本动作识别的免训练测试时自适应框架,通过将推理重构为轻量级检索过程并利用大语言模型的语义推理能力,显著提升了模型对未见动作的泛化性能。


📘 Detailed Summary

Motivation: 现有基于骨架的零样本动作识别方法在推理时难以适应未见动作,缺乏有效的测试时自适应机制,导致模型泛化能力受限。本研究旨在解决这一局限性,提出首个免训练的测试时自适应框架,以提升模型对未见动作的识别能力。

Method: Skeleton-Cache将推理重构为轻量级检索过程,通过非参数缓存存储结构化的骨架表示,结合全局和细粒度局部描述符。该方法利用大语言模型的语义推理能力为类别分配重要性权重,指导描述符预测的融合,实现动态适应未见动作而无需额外训练或访问训练数据。

Result: 在NTU RGB+D 60/120和PKU-MMD II数据集上的广泛实验表明,Skeleton-Cache在零样本和广义零样本设置下,能够持续提升多种SZAR骨干网络的性能。该框架显著增强了模型对未见动作的泛化能力,验证了其有效性和鲁棒性。

Conclusion: Skeleton-Cache为基于骨架的零样本动作识别提供了首个免训练的测试时自适应解决方案,通过结合结构化骨架表示和LLM引导的语义先验,实现了对未见动作的有效适应。该框架为动作识别领域的测试时自适应研究开辟了新方向,具有重要的理论和实践意义。


📄 Abstract

We introduce Skeleton-Cache, the first training-free test-time adaptation framework for skeleton-based zero-shot action recognition (SZAR), aimed at improving model generalization to unseen actions during inference. Skeleton-Cache reformulates inference as a lightweight retrieval process over a non-parametric cache that stores structured skeleton representations, combining both global and fine-grained local descriptors. To guide the fusion of descriptor-wise predictions, we leverage the semantic reasoning capabilities of large language models (LLMs) to assign class-specific importance weights. By integrating these structured descriptors with LLM-guided semantic priors, Skeleton-Cache dynamically adapts to unseen actions without any additional training or access to training data. Extensive experiments on NTU RGB+D 60/120 and PKU-MMD II demonstrate that Skeleton-Cache consistently boosts the performance of various SZAR backbones under both zero-shot and generalized zero-shot settings. The code is publicly available at https://github.com/Alchemist0754/Skeleton-Cache.

[6] VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation

Felix O'Mahony, Roberto Cipolla, Ayush Tewari

🧩 TL;DR

本文提出了VDAWorld框架,通过将图像-文本对蒸馏为可处理的抽象表示并利用视觉语言模型作为智能代理协调视觉工具和物理模拟器,构建了一个能够产生高质量动态场景模拟的通用世界模型。


📘 Detailed Summary

Motivation: 生成式视频模型作为世界建模的主要方法存在根本性局限,包括违反物理和逻辑规则、缺乏交互性以及作为不透明黑箱难以构建结构化可查询世界。本研究旨在克服这些挑战,探索新的世界建模范式。

Method: 提出VDAWorld框架,将图像-文本对蒸馏为针对模拟优化的可处理抽象表示。采用视觉语言模型作为智能代理协调整个过程,自主构建基于视觉工具套件的接地(2D或3D)场景表示,并相应选择兼容的物理模拟器(如刚体、流体)作用于场景,能够从静态场景推断潜在动态以预测合理未来状态。

Result: 实验表明,智能抽象与自适应模拟的结合产生了能够跨广泛动态场景生成高质量模拟的通用世界模型。该方法在多种动态场景中展现出卓越的模拟能力。

Conclusion: 该研究展示了智能抽象与自适应物理模拟相结合的新范式潜力,为构建结构化、可查询且遵循物理规则的世界模型提供了有效途径,推动了从黑箱生成模型向可解释、可交互世界表示的转变。


📄 Abstract

Generative video models, a leading approach to world modeling, face fundamental limitations. They often violate physical and logical rules, lack interactivity, and operate as opaque black boxes ill-suited for building structured, queryable worlds. To overcome these challenges, we propose a new paradigm focused on distilling an image caption pair into a tractable, abstract representation optimized for simulation. We introduce VDAWorld, a framework where a Vision-Language Model (VLM) acts as an intelligent agent to orchestrate this process. The VLM autonomously constructs a grounded (2D or 3D) scene representation by selecting from a suite of vision tools, and accordingly chooses a compatible physics simulator (e.g., rigid body, fluid) to act upon it. VDAWorld can then infer latent dynamics from the static scene to predict plausible future states. Our experiments show that this combination of intelligent abstraction and adaptive simulation results in a versatile world model capable of producing high quality simulations across a wide range of dynamic scenarios.

[7] Exploring MLLM-Diffusion Information Transfer with MetaCanvas

Han Lin, Xichen Pan, Ziqi Huang, Ji Hou, Jialiang Wang, Weifeng Chen, Zecheng He, Felix Juefei-Xu, Junzhe Sun, Zhipeng Fan, Ali Thabet, Mohit Bansal, Chu Wang

🧩 TL;DR

本文提出MetaCanvas框架,通过将多模态大语言模型作为潜在空间规划器,使其能够在空间和时空潜在空间中直接进行推理和规划,从而弥合多模态理解与生成之间的能力差距。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在视觉生成任务中通常被简化为扩散模型的全局文本编码器,其强大的推理和规划能力未被充分利用,导致模型能够解析复杂布局和知识密集型场景,却难以生成具有同等精确和结构化控制的图像或视频,形成了多模态理解与生成之间的能力差距。

Method: 提出MetaCanvas轻量级框架,使多模态大语言模型能够在空间和时空潜在空间中直接进行推理和规划,并与扩散生成器紧密接口;该框架在三种不同的扩散模型骨干上进行了实证实现,支持文本到图像生成、文本/图像到视频生成、图像/视频编辑以及上下文视频生成等多种任务。

Result: MetaCanvas在六个需要精确布局、鲁棒属性绑定和推理密集型控制的任务中持续优于全局条件基准方法,包括文本到图像生成、文本/图像到视频生成、图像/视频编辑和上下文视频生成等任务,证明了该方法的有效性。

Conclusion: 将多模态大语言模型视为潜在空间规划器是弥合多模态理解与生成之间差距的有前景方向,MetaCanvas框架通过充分利用模型的推理和规划能力,为精确和结构化视觉生成提供了新的解决方案。


📄 Abstract

Multimodal learning has rapidly advanced visual understanding, largely via multimodal large language models (MLLMs) that use powerful LLMs as cognitive cores. In visual generation, however, these powerful core models are typically reduced to global text encoders for diffusion models, leaving most of their reasoning and planning ability unused. This creates a gap: current multimodal LLMs can parse complex layouts, attributes, and knowledge-intensive scenes, yet struggle to generate images or videos with equally precise and structured control. We propose MetaCanvas, a lightweight framework that lets MLLMs reason and plan directly in spatial and spatiotemporal latent spaces and interface tightly with diffusion generators. We empirically implement MetaCanvas on three different diffusion backbones and evaluate it across six tasks, including text-to-image generation, text/image-to-video generation, image/video editing, and in-context video generation, each requiring precise layouts, robust attribute binding, and reasoning-intensive control. MetaCanvas consistently outperforms global-conditioning baselines, suggesting that treating MLLMs as latent-space planners is a promising direction for narrowing the gap between multimodal understanding and generation.

[8] HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning

Yiqing Yang, Kin-Man Lam

🧩 TL;DR

本文提出了一种端到端可训练的任务自适应关键帧选择框架,通过动态查询生成、集合级优化和师生互学习机制,显著提升了视频理解中的帧选择性能。该方法在多个基准测试上优于现有方法,消除了对静态伪标签的依赖。


📘 Detailed Summary

Motivation: 传统视频关键帧选择方法存在两大局限:一是基于独立评分的top-K选择方法无法从整体上优化帧选择,导致所选帧在时间上聚集且视觉冗余;二是使用多模态大语言模型离线生成的伪标签训练轻量级选择器,使得监督信号无法动态适应任务目标,限制了选择器的性能提升。

Method: 本文提出了一种端到端可训练的任务自适应帧选择框架。首先采用思维链方法引导小型语言模型生成任务特定的隐式查询向量,与多模态特征结合实现动态帧评分。其次定义了包含相关性、覆盖度和冗余度的连续集合级目标函数,通过Gumbel-Softmax实现可微分优化以选择最优帧组合。最后采用师生互学习方法,通过KL散度对齐学生选择器(SLM)和教师推理器(MLLM)的帧重要性分布,结合交叉熵损失实现端到端优化,消除对静态伪标签的依赖。

Result: 在多个基准测试上的实验结果表明,该方法显著优于现有方法。具体在Video-MME、LongVideoBench、MLVU和NExT-QA等数据集上的评估验证了其有效性,证明了任务自适应框架在关键帧选择任务上的优越性能。

Conclusion: 该研究提出了一种创新的端到端任务自适应关键帧选择框架,通过动态查询生成、集合级优化和师生互学习机制,有效解决了传统方法的局限性。该方法不仅消除了对静态伪标签的依赖,还能根据具体任务目标自适应地选择最优帧组合,为视频理解中的帧选择问题提供了新的解决方案。


📄 Abstract

Key frame selection in video understanding presents significant challenges. Traditional top-K selection methods, which score frames independently, often fail to optimize the selection as a whole. This independent scoring frequently results in selecting frames that are temporally clustered and visually redundant. Additionally, training lightweight selectors using pseudo labels generated offline by Multimodal Large Language Models (MLLMs) prevents the supervisory signal from dynamically adapting to task objectives. To address these limitations, we propose an end-to-end trainable, task-adaptive framework for frame selection. A Chain-of-Thought approach guides a Small Language Model (SLM) to generate task-specific implicit query vectors, which are combined with multimodal features to enable dynamic frame scoring. We further define a continuous set-level objective function that incorporates relevance, coverage, and redundancy, enabling differentiable optimization via Gumbel-Softmax to select optimal frame combinations at the set level. Finally, student-teacher mutual learning is employed, where the student selector (SLM) and teacher reasoner (MLLM) are trained to align their frame importance distributions via KL divergence. Combined with cross-entropy loss, this enables end-to-end optimization, eliminating reliance on static pseudo labels. Experiments across various benchmarks, including Video-MME, LongVideoBench, MLVU, and NExT-QA, demonstrate that our method significantly outperforms existing approaches.

[9] Vision-Language Models for Infrared Industrial Sensing in Additive Manufacturing Scene Description

Nazanin Mahjourian, Vinh Nguyen

🧩 TL;DR

本文提出了VLM-IRIS框架,通过将红外图像预处理为RGB兼容表示,使基于CLIP的视觉语言模型能够实现零样本红外工业感知,无需模型重新训练即可在热成像应用中实现高精度工件检测。


📘 Detailed Summary

Motivation: 在低光照或封闭机器环境等制造场景中,传统视觉系统难以有效工作,而红外相机具有互补优势。然而,当前视觉语言基础模型仅针对RGB数据训练,无法理解红外相机数据,且监督AI系统需要大量标注数据集,这使得零样本学习框架在红外相机应用中更具实用性。

Method: 本研究提出了VLM-IRIS框架,通过预处理FLIR Boson传感器捕获的红外图像,将其转换为适合CLIP编码器的RGB兼容输入。具体方法包括将红外图像转换为岩浆表示,并应用质心提示集成技术与CLIP ViT-B/32编码器相结合,实现在无需模型重新训练的情况下对红外图像进行零样本预测。

Result: VLM-IRIS在3D打印机床的工件存在检测任务中展示了零样本学习能力,该任务利用构建板与工件之间的温度差异,非常适合热成像应用。实验表明,通过所提出的预处理和提示集成方法,能够在红外图像上实现高精度检测,无需任何模型重新训练或标注数据。

Conclusion: 该研究证明了通过适当的预处理技术,视觉语言模型的有效性可以扩展到热成像应用领域,为无标签监控提供了实用解决方案。这一发现为工业环境中红外感知的零样本学习开辟了新途径,展示了基础模型在非RGB模态数据上的适应潜力。


📄 Abstract

Many manufacturing environments operate in low-light conditions or within enclosed machines where conventional vision systems struggle. Infrared cameras provide complementary advantages in such environments. Simultaneously, supervised AI systems require large labeled datasets, which makes zero-shot learning frameworks more practical for applications including infrared cameras. Recent advances in vision-language foundation models (VLMs) offer a new path in zero-shot predictions from paired image-text representations. However, current VLMs cannot understand infrared camera data since they are trained on RGB data. This work introduces VLM-IRIS (Vision-Language Models for InfraRed Industrial Sensing), a zero-shot framework that adapts VLMs to infrared data by preprocessing infrared images captured by a FLIR Boson sensor into RGB-compatible inputs suitable for CLIP-based encoders. We demonstrate zero-shot workpiece presence detection on a 3D printer bed where temperature differences between the build plate and workpieces make the task well-suited for thermal imaging. VLM-IRIS converts the infrared images to magma representation and applies centroid prompt ensembling with a CLIP ViT-B/32 encoder to achieve high accuracy on infrared images without any model retraining. These findings demonstrate that the proposed improvements to VLMs can be effectively extended to thermal applications for label-free monitoring.

[10] DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

Zhenyang Cai, Jiaming Zhang, Junjie Zhao, Ziyi Zeng, Yanchao Li, Jingyi Liang, Junying Chen, Yunjin Yang, Jiajun You, Shuzhi Deng, Tongfei Wang, Wanting Chen, Chunxiu Hao, Ruiqi Xie, Zhenwei Wen, Xiangyi Feng, Zou Ting, Jin Zou Lin, Jianquan Li, Guangjun Yu, Liangyi Chen, Junwen Wang, Shan Jiang, Benyou Wang

🧩 TL;DR

本文提出了DentalGPT,一种专门用于牙科领域的多模态大语言模型,通过高质量领域知识注入和强化学习解决了现有MLLMs在牙科视觉细节捕捉和诊断推理能力方面的不足。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在牙科领域存在两个主要局限:难以捕捉细粒度的牙科视觉细节,以及缺乏进行精确诊断所需的充分推理能力,这阻碍了自动化口腔医疗保健中多模态数据的可靠解释。

Method: 研究通过构建迄今为止最大的牙科多模态标注数据集(包含超过12万张牙科图像及其详细描述),并采用高质量领域知识注入和强化学习两阶段训练策略来开发DentalGPT,其中数据集特别强调了诊断相关的视觉特征。

Result: DentalGPT在口腔内和全景X光片基准测试以及医学VQA基准的牙科子集上均表现出卓越性能,在疾病分类和牙科视觉问答任务中超越了许多最先进的MLLMs,尽管模型仅有70亿参数。

Conclusion: 研究表明,高质量牙科数据与分阶段适应策略相结合,为构建能力强且领域专业化的牙科多模态大语言模型提供了有效途径,证明了领域专业化模型在医疗应用中的价值。


📄 Abstract

Reliable interpretation of multimodal data in dentistry is essential for automated oral healthcare, yet current multimodal large language models (MLLMs) struggle to capture fine-grained dental visual details and lack sufficient reasoning ability for precise diagnosis. To address these limitations, we present DentalGPT, a specialized dental MLLM developed through high-quality domain knowledge injection and reinforcement learning. Specifically, the largest annotated multimodal dataset for dentistry to date was constructed by aggregating over 120k dental images paired with detailed descriptions that highlight diagnostically relevant visual features, making it the multimodal dataset with the most extensive collection of dental images to date. Training on this dataset significantly enhances the MLLM's visual understanding of dental conditions, while the subsequent reinforcement learning stage further strengthens its capability for multimodal complex reasoning. Comprehensive evaluations on intraoral and panoramic benchmarks, along with dental subsets of medical VQA benchmarks, show that DentalGPT achieves superior performance in disease classification and dental VQA tasks, outperforming many state-of-the-art MLLMs despite having only 7B parameters. These results demonstrate that high-quality dental data combined with staged adaptation provides an effective pathway for building capable and domain-specialized dental MLLMs.

[11] VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction

Weitai Kang, Jason Kuen, Mengwei Ren, Zijun Wei, Yan Yan, Kangning Liu

🧩 TL;DR

本文提出VGent,一种模块化的编码器-解码器架构,通过解耦高层推理与低层边界框预测来解决视觉定位中的自回归解码速度慢、幻觉风险以及重新对齐LLM损害预训练推理能力的问题,在多项基准测试中实现了新的最优性能。


📘 Detailed Summary

Motivation: 当前视觉定位模型存在两个主要问题:基于多模态大语言模型的自回归解码速度慢且存在幻觉风险,而通过重新对齐LLM与视觉特征学习新特殊标记的方法可能损害LLM的预训练推理能力。本研究旨在解决这些限制,提出一种既能保持强大推理能力又能实现快速准确边界框预测的解决方案。

Method: VGent采用模块化编码器-解码器架构,使用冻结的MLLM作为编码器以保持其强大的推理能力,解码器则接收检测器提出的高质量边界框作为查询,通过交叉注意力机制在编码器隐藏状态上选择目标框。此外,引入了QuadThinker(基于强化学习的训练范式增强编码器的多目标推理能力)、掩码感知标签解决检测-分割歧义,以及全局目标识别改进所有目标的识别能力。

Result: 在多目标视觉定位基准测试中,VGent相比先前方法实现了+20.6%的F1分数提升,在视觉参考挑战下进一步将gIoU提升+8.2%、cIoU提升+5.8%,同时保持了恒定且快速的推理延迟,达到了新的最优性能水平。

Conclusion: VGent通过解耦高层推理与低层边界框预测,充分利用了目标检测和MLLM的最新进展,避免了自回归解码的缺陷,支持编码器和解码器的模块化升级,为视觉定位任务提供了一种高效且可扩展的解决方案,同时保持了强大的推理能力和快速推理速度。


📄 Abstract

Current visual grounding models are either based on a Multimodal Large Language Model (MLLM) that performs auto-regressive decoding, which is slow and risks hallucinations, or on re-aligning an LLM with vision features to learn new special or object tokens for grounding, which may undermine the LLM's pretrained reasoning ability. In contrast, we propose VGent, a modular encoder-decoder architecture that explicitly disentangles high-level reasoning and low-level bounding box prediction. Specifically, a frozen MLLM serves as the encoder to provide untouched powerful reasoning capabilities, while a decoder takes high-quality boxes proposed by detectors as queries and selects target box(es) via cross-attending on encoder's hidden states. This design fully leverages advances in both object detection and MLLM, avoids the pitfalls of auto-regressive decoding, and enables fast inference. Moreover, it supports modular upgrades of both the encoder and decoder to benefit the whole system: we introduce (i) QuadThinker, an RL-based training paradigm for enhancing multi-target reasoning ability of the encoder; (ii) mask-aware label for resolving detection-segmentation ambiguity; and (iii) global target recognition to improve the recognition of all the targets which benefits the selection among augmented proposals. Experiments on multi-target visual grounding benchmarks show that VGent achieves a new state-of-the-art with +20.6% F1 improvement over prior methods, and further boosts gIoU by +8.2% and cIoU by +5.8% under visual reference challenges, while maintaining constant, fast inference latency.

[12] Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

Bowen Wen, Shaurya Dewan, Stan Birchfield

🧩 TL;DR

本文提出了Fast-FoundationStereo,这是一个能够在保持强大零样本泛化能力的同时实现实时帧率的立体视觉基础模型家族,通过知识蒸馏、神经架构搜索和结构化剪枝等加速策略,在速度上比FoundationStereo快10倍以上。


📘 Detailed Summary

Motivation: 立体视觉基础模型虽然具有强大的零样本泛化能力,但计算成本过高无法满足实时应用需求;而高效的立体视觉架构则为了速度牺牲了鲁棒性,并且需要昂贵的逐领域微调。本文旨在弥合这一差距,开发能够在保持零样本泛化能力的同时实现实时性能的立体视觉模型。

Method: 本文采用分而治之的加速策略,包含三个核心组件:使用知识蒸馏将混合骨干网络压缩为单个高效学生模型;采用分块神经架构搜索自动发现延迟预算下的最优成本滤波设计,将搜索复杂度指数级降低;以及通过结构化剪枝消除迭代细化模块中的冗余。此外,还引入了自动伪标注流程,收集了140万张真实世界立体图像对来补充合成训练数据并促进知识蒸馏。

Result: Fast-FoundationStereo模型能够以超过10倍的速度运行于FoundationStereo,同时紧密匹配其零样本精度,从而在实时方法中建立了新的最先进水平。该模型首次实现了在实时帧率下的强大零样本泛化能力,解决了现有方法在速度与泛化能力之间的权衡问题。

Conclusion: 本研究证明了通过系统化的加速策略,可以在不牺牲零样本泛化能力的前提下实现立体视觉基础模型的实时性能,为实际应用中的高效立体视觉系统提供了可行的解决方案。该方法为其他计算密集型基础模型的加速提供了有价值的参考框架,展示了知识蒸馏、神经架构搜索和结构化剪枝在模型优化中的协同作用。


📄 Abstract

Stereo foundation models achieve strong zero-shot generalization but remain computationally prohibitive for real-time applications. Efficient stereo architectures, on the other hand, sacrifice robustness for speed and require costly per-domain fine-tuning. To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. We employ a divide-and-conquer acceleration strategy with three components: (1) knowledge distillation to compress the hybrid backbone into a single efficient student; (2) blockwise neural architecture search for automatically discovering optimal cost filtering designs under latency budgets, reducing search complexity exponentially; and (3) structured pruning for eliminating redundancy in the iterative refinement module. Furthermore, we introduce an automatic pseudo-labeling pipeline used to curate 1.4M in-the-wild stereo pairs to supplement synthetic training data and facilitate knowledge distillation. The resulting model can run over 10x faster than FoundationStereo while closely matching its zero-shot accuracy, thus establishing a new state-of-the-art among real-time methods. Project page: https://nvlabs.github.io/Fast-FoundationStereo/

[13] Learning complete and explainable visual representations from itemized text supervision

Yiwei Lyu, Chenhui Zhao, Soumyanil Banerjee, Shixuan Liu, Akshay Rao, Akhil Kondepudi, Honglak Lee, Todd C. Hollon

🧩 TL;DR

本文提出了ItemizedCLIP框架,用于从项目化文本监督中学习完整且可解释的视觉表示,通过跨注意力模块和定制化目标函数,在医学影像和遥感等多个领域实现了显著的零样本性能提升和细粒度可解释性。


📘 Detailed Summary

Motivation: 许多视觉领域,特别是医学影像和遥感等非以对象为中心的领域,包含项目化文本标注:单个图像中有多个文本项目描述独立且语义不同的发现。这种监督与标准的多标题监督不同,后者通常是冗余或高度重叠的,现有方法难以处理这种项目化监督结构。

Method: ItemizedCLIP采用跨注意力模块生成文本项目条件化的视觉嵌入,并设计了一套定制化目标函数,联合强制执行项目独立性(不同项目对应不同区域)和表示完整性(覆盖所有项目)。该框架专门针对项目化文本监督的结构特点进行优化。

Result: 在四个具有自然项目化文本监督的领域(脑部MRI、头部CT、胸部CT、遥感)和一个合成项目化数据集上,ItemizedCLIP在零样本性能和细粒度可解释性方面相比基线方法实现了显著提升。生成的表示具有语义基础、项目可区分性、完整性和视觉可解释性。

Conclusion: ItemizedCLIP为处理项目化文本监督提供了一种有效的框架,能够学习到语义基础且可解释的视觉表示。该方法特别适用于医学影像和遥感等需要细粒度解释的领域,为跨模态学习中的结构化监督处理开辟了新方向。


📄 Abstract

Training vision models with language supervision enables general and transferable representations. However, many visual domains, especially non-object-centric domains such as medical imaging and remote sensing, contain itemized text annotations: multiple text items describing distinct and semantically independent findings within a single image. Such supervision differs from standard multi-caption supervision, where captions are redundant or highly overlapping. Here, we introduce ItemizedCLIP, a framework for learning complete and explainable visual representations from itemized text supervision. ItemizedCLIP employs a cross-attention module to produce text item-conditioned visual embeddings and a set of tailored objectives that jointly enforce item independence (distinct regions for distinct items) and representation completeness (coverage of all items). Across four domains with naturally itemized text supervision (brain MRI, head CT, chest CT, remote sensing) and one additional synthetically itemized dataset, ItemizedCLIP achieves substantial improvements in zero-shot performance and fine-grained interpretability over baselines. The resulting ItemizedCLIP representations are semantically grounded, item-differentiable, complete, and visually interpretable. Our code is available at https://github.com/MLNeurosurg/ItemizedCLIP.

[14] Multi-task Learning with Extended Temporal Shift Module for Temporal Action Localization

Anh-Kiet Duong, Petra Gomez-Krämer

🧩 TL;DR

本文提出了针对BinEgo-360挑战赛的解决方案,通过扩展时序移位模块(TSM)处理时序动作定位任务,并采用多任务学习和集成策略,在比赛中取得了第一名成绩。


📘 Detailed Summary

Motivation: 本研究旨在解决多视角多模态视频中的时序动作定位问题,具体针对BinEgo-360挑战赛提供的包含全景、第三人称和第一人称视角的复杂视频数据集,该数据集标注了细粒度动作类别,需要开发能够有效处理这种多模态时序数据的定位方法。

Method: 方法基于时序移位模块(TSM)进行扩展,通过引入背景类别和对固定长度非重叠区间进行分类来处理时序动作定位任务;采用多任务学习框架联合优化场景分类和时序动作定位,利用动作与环境之间的上下文线索;最后通过加权集成策略整合多个模型,提升预测的鲁棒性和一致性。

Result: 该方法在ICCV 2025 BinEgo-360挑战赛的初始轮和扩展轮中均排名第一,证明了多任务学习、高效骨干网络和集成学习相结合在时序动作定位任务中的有效性,特别是在处理多视角多模态视频数据方面表现出色。

Conclusion: 研究表明,将多任务学习框架、高效的时序建模骨干网络和集成策略相结合,能够有效解决复杂多视角视频中的时序动作定位问题;该方法为处理多模态时序数据提供了可行的技术路径,展示了上下文信息利用和模型集成在提升定位性能方面的重要价值。


📄 Abstract

We present our solution to the BinEgo-360 Challenge at ICCV 2025, which focuses on temporal action localization (TAL) in multi-perspective and multi-modal video settings. The challenge provides a dataset containing panoramic, third-person, and egocentric recordings, annotated with fine-grained action classes. Our approach is built on the Temporal Shift Module (TSM), which we extend to handle TAL by introducing a background class and classifying fixed-length non-overlapping intervals. We employ a multi-task learning framework that jointly optimizes for scene classification and TAL, leveraging contextual cues between actions and environments. Finally, we integrate multiple models through a weighted ensemble strategy, which improves robustness and consistency of predictions. Our method is ranked first in both the initial and extended rounds of the competition, demonstrating the effectiveness of combining multi-task learning, an efficient backbone, and ensemble learning for TAL.

[15] AutoRefiner: Improving Autoregressive Video Diffusion Models via Reflective Refinement Over the Stochastic Sampling Path

Zhengyang Yu, Akio Hayakawa, Masato Ishii, Qingtao Yu, Takashi Shibuya, Jing Zhang, Yuki Mitsufuji

🧩 TL;DR

本文提出AutoRefiner,一种专为自回归视频扩散模型设计的噪声优化器,通过路径式噪声优化和反射KV缓存机制,在不更新模型参数的情况下显著提升生成样本的保真度。


📘 Detailed Summary

Motivation: 自回归视频扩散模型作为双向视频扩散模型的可扩展替代方案展现出潜力,但其样本保真度仍有改进空间。现有基于优化的推理时对齐方法计算成本过高,而文本到图像领域的噪声优化器无法直接应用于视频模型,需要专门针对自回归视频扩散模型设计高效的噪声优化方案。

Method: 本文提出AutoRefiner噪声优化器,包含两个关键设计:路径式噪声优化沿着随机去噪路径优化噪声,以及反射KV缓存机制有效管理自回归生成过程中的键值缓存。该方法作为即插即用模块,通过单次前向传播调制采样噪声,避免参数更新。

Result: 实验表明AutoRefiner能有效提升自回归视频扩散模型的样本保真度,相比朴素扩展的文本到图像噪声优化器表现更优。该方法计算高效,可作为现有模型的增强模块,在保持实时和交互应用特性的同时改善生成质量。

Conclusion: 研究证实了为自回归视频扩散模型设计专用噪声优化器的必要性,AutoRefiner通过路径式优化和缓存管理解决了视频生成中的独特挑战。该方法为提升视频生成质量提供了高效解决方案,同时保持了自回归模型的可扩展性和实时性优势。


📄 Abstract

Autoregressive video diffusion models (AR-VDMs) show strong promise as scalable alternatives to bidirectional VDMs, enabling real-time and interactive applications. Yet there remains room for improvement in their sample fidelity. A promising solution is inference-time alignment, which optimizes the noise space to improve sample fidelity without updating model parameters. Yet, optimization- or search-based methods are computationally impractical for AR-VDMs. Recent text-to-image (T2I) works address this via feedforward noise refiners that modulate sampled noises in a single forward pass. Can such noise refiners be extended to AR-VDMs? We identify the failure of naively extending T2I noise refiners to AR-VDMs and propose AutoRefiner-a noise refiner tailored for AR-VDMs, with two key designs: pathwise noise refinement and a reflective KV-cache. Experiments demonstrate that AutoRefiner serves as an efficient plug-in for AR-VDMs, effectively enhancing sample fidelity by refining noise along stochastic denoising paths.

[16] SmokeBench: Evaluating Multimodal Large Language Models for Wildfire Smoke Detection

Tianye Qi, Weihao Li, Nick Barnes

🧩 TL;DR

本文提出了SmokeBench基准测试,用于评估多模态大语言模型在野火烟雾识别与定位方面的能力,发现现有模型在早期烟雾定位方面存在显著局限性。


📘 Detailed Summary

Motivation: 野火烟雾具有透明、无定形且常与云层混淆的视觉特性,使得早期检测尤为困难,当前多模态大语言模型在安全关键的野火监测应用中的能力尚未得到系统评估。

Method: 研究提出了SmokeBench基准测试,包含烟雾分类、基于瓦片的烟雾定位、基于网格的烟雾定位和烟雾检测四项任务,并评估了Idefics2、Qwen2.5-VL、InternVL3、Unified-IO 2、Grounding DINO、GPT-4o和Gemini-2.5 Pro等多个多模态大语言模型。

Result: 实验结果表明,虽然部分模型能够在烟雾覆盖大面积时进行分类,但所有模型在精确定位方面均表现不佳,尤其是在早期阶段;烟雾体积与模型性能强相关,而对比度的影响相对较小。

Conclusion: 研究揭示了当前多模态大语言模型在安全关键野火监测中的关键局限性,强调了需要开发改进早期烟雾定位能力的新方法,为未来模型优化提供了重要方向。


📄 Abstract

Wildfire smoke is transparent, amorphous, and often visually confounded with clouds, making early-stage detection particularly challenging. In this work, we introduce a benchmark, called SmokeBench, to evaluate the ability of multimodal large language models (MLLMs) to recognize and localize wildfire smoke in images. The benchmark consists of four tasks: (1) smoke classification, (2) tile-based smoke localization, (3) grid-based smoke localization, and (4) smoke detection. We evaluate several MLLMs, including Idefics2, Qwen2.5-VL, InternVL3, Unified-IO 2, Grounding DINO, GPT-4o, and Gemini-2.5 Pro. Our results show that while some models can classify the presence of smoke when it covers a large area, all models struggle with accurate localization, especially in the early stages. Further analysis reveals that smoke volume is strongly correlated with model performance, whereas contrast plays a comparatively minor role. These findings highlight critical limitations of current MLLMs for safety-critical wildfire monitoring and underscore the need for methods that improve early-stage smoke localization.

[17] RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing

Wentang Chen, Shougao Zhang, Yiman Zhang, Tianhao Zhou, Ruihui Li

🧩 TL;DR

本文提出了RoomPilot,一个用于可控室内场景生成的统一框架,能够将文本描述或CAD平面图等多模态输入解析为室内领域特定语言,从而生成具有交互语义的结构化场景。


📘 Detailed Summary

Motivation: 现有室内场景生成方法存在两个主要局限:一是仅支持有限的输入模态范围,二是依赖随机过程导致可控性不足。这限制了在游戏开发、建筑可视化和具身AI训练等应用中的实用性,特别是无法生成具有真实交互行为的场景。

Method: RoomPilot的核心方法是引入室内领域特定语言作为共享语义表示,将文本描述或CAD平面图等多模态输入统一解析为IDSL。该框架利用包含交互标注的资产数据集,生成具有真实对象行为的场景,避免了传统程序化方法产生的视觉合理但功能僵化的布局问题。

Result: 大量实验验证了RoomPilot在多模态理解方面的强大能力,展示了在场景生成中的细粒度可控性,以及在物理一致性和视觉保真度方面的优越表现。该方法在生成具有交互语义的室内场景方面取得了显著进展,超越了传统方法的局限性。

Conclusion: 该研究表明,精心设计的领域特定语言可以作为共享语义表示,实现从单一模态到高质量场景的连贯合成,同时保持交互语义。这项工作标志着向通用可控3D室内场景生成迈出了重要一步,为多模态输入的统一处理提供了有效框架。


📄 Abstract

Generating controllable and interactive indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI training. Yet existing approaches either handle a narrow range of input modalities or rely on stochastic processes that hinder controllability. To overcome these limitations, we introduce RoomPilot, a unified framework that parses diverse multi-modal inputs--textual descriptions or CAD floor plans--into an Indoor Domain-Specific Language (IDSL) for indoor structured scene generation. The key insight is that a well-designed IDSL can act as a shared semantic representation, enabling coherent, high-quality scene synthesis from any single modality while maintaining interaction semantics. In contrast to conventional procedural methods that produce visually plausible but functionally inert layouts, RoomPilot leverages a curated dataset of interaction-annotated assets to synthesize environments exhibiting realistic object behaviors. Extensive experiments further validate its strong multi-modal understanding, fine-grained controllability in scene generation, and superior physical consistency and visual fidelity, marking a significant step toward general-purpose controllable 3D indoor scene generation.

[18] Cross-modal Prompting for Balanced Incomplete Multi-modal Emotion Recognition

Wen-Jue He, Xiaofeng Zhu, Zheng Zhang

🧩 TL;DR

本文提出了一种新颖的跨模态提示(ComP)方法,通过增强模态特定特征和提升各模态性能来解决不完全多模态情感识别中的性能差距和模态欠优化问题。


📘 Detailed Summary

Motivation: 不完全多模态情感识别面临多模态数据部分缺失的挑战,其中性能差距和模态欠优化问题在实践中阻碍了有效的多模态学习,并在数据缺失情况下进一步加剧,需要新的方法来增强模态特定特征并提升整体识别精度。

Method: 该方法设计了跨模态提示方法,包括具有动态梯度调制器的渐进提示生成模块来产生简洁一致的模态语义线索,跨模态知识传播通过传递的提示选择性地放大模态特征中的一致信息以增强模态特定输出的区分度,以及设计协调器动态重新加权模态输出作为平衡策略的补充。

Result: 在4个数据集上对7种最先进方法在不同缺失率下进行的广泛实验验证了所提方法的有效性,该方法在提升各模态性能的同时显著改善了不完全多模态情感识别的整体准确率。

Conclusion: 该研究通过跨模态提示机制有效解决了不完全多模态情感识别中的关键挑战,提出的渐进提示生成和动态协调策略为处理多模态数据缺失和性能不平衡问题提供了新的技术途径,具有实际应用价值。


📄 Abstract

Incomplete multi-modal emotion recognition (IMER) aims at understanding human intentions and sentiments by comprehensively exploring the partially observed multi-source data. Although the multi-modal data is expected to provide more abundant information, the performance gap and modality under-optimization problem hinder effective multi-modal learning in practice, and are exacerbated in the confrontation of the missing data. To address this issue, we devise a novel Cross-modal Prompting (ComP) method, which emphasizes coherent information by enhancing modality-specific features and improves the overall recognition accuracy by boosting each modality's performance. Specifically, a progressive prompt generation module with a dynamic gradient modulator is proposed to produce concise and consistent modality semantic cues. Meanwhile, cross-modal knowledge propagation selectively amplifies the consistent information in modality features with the delivered prompts to enhance the discrimination of the modality-specific output. Additionally, a coordinator is designed to dynamically re-weight the modality outputs as a complement to the balance strategy to improve the model's efficacy. Extensive experiments on 4 datasets with 7 SOTA methods under different missing rates validate the effectiveness of our proposed method.

[19] KeyframeFace: From Text to Expressive Facial Keyframes

Jingchao Wu, Zejian Kang, Haibo Liu, Yuanchen Fei, Xiangru Huang

🧩 TL;DR

本文提出了KeyframeFace数据集和基于LLM的文本到动画框架,通过关键帧级监督和LLM先验知识,解决了从自然语言生成动态3D面部动画时缺乏语义基础和时序结构的问题。


📘 Detailed Summary

Motivation: 现有数据集和方法主要关注语音驱动动画或无结构表情序列,缺乏语义基础和时序结构,难以生成具有表现力的人类表演动画,因此需要解决从自然语言生成动态3D面部动画时理解时序语义和细粒度表情变化的问题。

Method: 本文提出了KeyframeFace大规模多模态数据集,包含2100个表现力脚本、单目视频、逐帧ARKit系数、上下文背景、复杂情感和手动定义关键帧,并通过LLM和MLLM进行多视角标注;同时提出了首个利用LLM先验进行可解释面部运动合成的文本到动画框架,将LLM的语义理解能力与ARKit系数的可解释结构对齐。

Result: KeyframeFace数据集提供了全面的多模态标注资源,基于LLM的框架实现了高保真表现力动画生成,共同建立了可解释、关键帧引导和上下文感知的文本到动画新基础,代码和数据已在GitHub上公开。

Conclusion: 该研究通过数据集和框架的结合,为可解释的面部动画生成建立了新范式,将LLM的语义理解与参数化面部模型的结构化表示相结合,为未来文本驱动动画研究提供了关键基础设施和方法论基础。


📄 Abstract

Generating dynamic 3D facial animation from natural language requires understanding both temporally structured semantics and fine-grained expression changes. Existing datasets and methods mainly focus on speech-driven animation or unstructured expression sequences and therefore lack the semantic grounding and temporal structures needed for expressive human performance generation. In this work, we introduce KeyframeFace, a large-scale multimodal dataset designed for text-to-animation research through keyframe-level supervision. KeyframeFace provides 2,100 expressive scripts paired with monocular videos, per-frame ARKit coefficients, contextual backgrounds, complex emotions, manually defined keyframes, and multi-perspective annotations based on ARKit coefficients and images via Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Beyond the dataset, we propose the first text-to-animation framework that explicitly leverages LLM priors for interpretable facial motion synthesis. This design aligns the semantic understanding capabilities of LLMs with the interpretable structure of ARKit's coefficients, enabling high-fidelity expressive animation. KeyframeFace and our LLM-based framework together establish a new foundation for interpretable, keyframe-guided, and context-aware text-to-animation. Code and data are available at https://github.com/wjc12345123/KeyframeFace.

[20] UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models

Hewen Pan, Cong Wei, Dashuang Liang, Zepeng Huang, Pengfei Gao, Ziqi Zhou, Lulu Xue, Pengfei Yan, Xiaoming Wei, Minghui Li, Shengshan Hu

🧩 TL;DR

本文提出了UFVideo,首个具备统一多粒度协同理解能力的视频大语言模型,通过统一的视觉-语言引导对齐机制,在单一模型中实现了全局、像素和时间尺度的视频理解。


📘 Detailed Summary

Motivation: 现有视频大语言模型主要局限于特定视频理解任务,缺乏全面且多粒度的视频感知能力,无法实现跨不同尺度的统一视频理解,本研究旨在填补这一研究空白。

Method: 设计了统一的视觉-语言引导对齐机制,能够灵活处理全局、像素和时间尺度的视频理解任务,模型动态编码不同任务的视觉和文本输入,并生成文本响应、时间定位或基础掩码。

Result: 构建了UFVideo-Bench基准测试集,包含三个不同尺度的协作任务,实验表明UFVideo在灵活性方面优于GPT-4o,同时在9个公共基准测试中验证了模型在各种常见视频理解任务上的有效性。

Conclusion: UFVideo展示了统一多粒度视频理解的可行性,为未来视频大语言模型的发展提供了有价值的见解,证明了单一模型能够同时处理不同尺度的视频理解任务。


📄 Abstract

With the advancement of multi-modal Large Language Models (LLMs), Video LLMs have been further developed to perform on holistic and specialized video understanding. However, existing works are limited to specialized video understanding tasks, failing to achieve a comprehensive and multi-grained video perception. To bridge this gap, we introduce UFVideo, the first Video LLM with unified multi-grained cooperative understanding capabilities. Specifically, we design unified visual-language guided alignment to flexibly handle video understanding across global, pixel and temporal scales within a single model. UFVideo dynamically encodes the visual and text inputs of different tasks and generates the textual response, temporal localization, or grounded mask. Additionally, to evaluate challenging multi-grained video understanding tasks, we construct the UFVideo-Bench consisting of three distinct collaborative tasks within the scales, which demonstrates UFVideo's flexibility and advantages over GPT-4o. Furthermore, we validate the effectiveness of our model across 9 public benchmarks covering various common video understanding tasks, providing valuable insights for future Video LLMs.

[21] Task-Specific Distance Correlation Matching for Few-Shot Action Recognition

Fei Long, Yao Zhang, Jiaming Lv, Jiangtao Xie, Peihua Li

🧩 TL;DR

本文提出TS-FSAR框架,通过视觉阶梯侧网络实现CLIP的高效微调,并引入任务特定的距离相关匹配度量,以解决少样本动作识别中现有方法在非线性关系建模和有限数据下优化困难的问题。


📘 Detailed Summary

Motivation: 少样本动作识别现有方法存在两个关键限制:现有集合匹配度量通常依赖余弦相似度测量帧间线性依赖,仅利用实例级信息进行匹配,无法捕捉非线性关系等复杂模式且忽视任务特定线索;同时,通过跳跃融合层对CLIP进行高效微调的方法在有限数据条件下新引入的侧层难以优化。

Method: TS-FSAR框架包含三个核心组件:视觉阶梯侧网络用于CLIP的高效微调;任务特定距离相关匹配度量使用α-距离相关建模线性和非线性帧间依赖,并利用任务原型实现任务特定匹配;引导LSN与适应CLIP模块通过适应后的冻结CLIP正则化LSN,以在有限监督下改进α-距离相关估计的训练。

Result: 在五个广泛使用的基准测试上进行的大量实验表明,TS-FSAR相比先前的最先进方法取得了优越的性能表现,验证了所提框架在少样本动作识别任务中的有效性。

Conclusion: 该研究通过结合高效微调策略和任务特定的非线性匹配度量,显著提升了少样本动作识别的性能,为解决有限数据条件下的复杂模式建模和模型优化困难提供了有效方案,为视觉语言模型在少样本学习中的应用开辟了新方向。


📄 Abstract

Few-shot action recognition (FSAR) has recently made notable progress through set matching and efficient adaptation of large-scale pre-trained models. However, two key limitations persist. First, existing set matching metrics typically rely on cosine similarity to measure inter-frame linear dependencies and then perform matching with only instance-level information, thus failing to capture more complex patterns such as nonlinear relationships and overlooking task-specific cues. Second, for efficient adaptation of CLIP to FSAR, recent work performing fine-tuning via skip-fusion layers (which we refer to as side layers) has significantly reduced memory cost. However, the newly introduced side layers are often difficult to optimize under limited data conditions. To address these limitations, we propose TS-FSAR, a framework comprising three components: (1) a visual Ladder Side Network (LSN) for efficient CLIP fine-tuning; (2) a metric called Task-Specific Distance Correlation Matching (TS-DCM), which uses $α$-distance correlation to model both linear and nonlinear inter-frame dependencies and leverages a task prototype to enable task-specific matching; and (3) a Guiding LSN with Adapted CLIP (GLAC) module, which regularizes LSN using the adapted frozen CLIP to improve training for better $α$-distance correlation estimation under limited supervision. Extensive experiments on five widely-used benchmarks demonstrate that our TS-FSAR yields superior performance compared to prior state-of-the-arts.

[22] The N-Body Problem: Parallel Execution from Single-Person Egocentric Video

Zhifan Zhu, Yifei Huang, Yoichi Sato, Dima Damen

🧩 TL;DR

本文提出了N-Body问题,旨在从单视角视频中学习人类活动的并行化执行,通过结构化提示策略引导视觉语言模型推理3D环境、物体使用和时序依赖,生成可行的并行执行方案。


📘 Detailed Summary

Motivation: 人类能够直观地并行化复杂活动,但模型能否从观察单个人的视频中学习这种能力?给定一个第一人称视角视频,研究N-Body问题:N个个体如何假设性地执行视频中观察到的相同任务集合,目标是最大化加速,但简单的视频片段分配常违反现实约束,导致物理上不可能的场景。

Method: 研究团队形式化了N-Body问题并提出了一套评估指标来衡量性能(加速、任务覆盖率)和可行性(空间碰撞、物体冲突和因果约束)。引入结构化提示策略,引导视觉语言模型推理3D环境、物体使用和时序依赖,以生成可行的并行执行方案。

Result: 在EPIC-Kitchens和HD-EPIC的100个视频上,当N=2时,该方法相比Gemini 2.5 Pro的基线提示将动作覆盖率提高了45%,同时将碰撞率、物体冲突和因果冲突分别降低了55%、45%和55%。

Conclusion: 该研究展示了从单视角视频学习活动并行化的可行性,提出的结构化提示策略有效解决了现实约束问题,为多智能体协作、机器人任务规划和效率优化等应用提供了新方法,同时形式化的评估框架为未来研究提供了基准。


📄 Abstract

Humans can intuitively parallelise complex activities, but can a model learn this from observing a single person? Given one egocentric video, we introduce the N-Body Problem: how N individuals, can hypothetically perform the same set of tasks observed in this video. The goal is to maximise speed-up, but naive assignment of video segments to individuals often violates real-world constraints, leading to physically impossible scenarios like two people using the same object or occupying the same space. To address this, we formalise the N-Body Problem and propose a suite of metrics to evaluate both performance (speed-up, task coverage) and feasibility (spatial collisions, object conflicts and causal constraints). We then introduce a structured prompting strategy that guides a Vision-Language Model (VLM) to reason about the 3D environment, object usage, and temporal dependencies to produce a viable parallel execution. On 100 videos from EPIC-Kitchens and HD-EPIC, our method for N = 2 boosts action coverage by 45% over a baseline prompt for Gemini 2.5 Pro, while simultaneously slashing collision rates, object and causal conflicts by 55%, 45% and 55% respectively.

[23] FlowDC: Flow-Based Decoupling-Decay for Complex Image Editing

Yilei Jiang, Zhen Wang, Yanghao Wang, Jun Yu, Yueting Zhuang, Jun Xiao, Long Chen

🧩 TL;DR

本文提出FlowDC方法,通过将复杂图像编辑任务解耦为多个子编辑效果并进行并行叠加,同时分解速度场并衰减正交分量,以解决多目标复杂编辑中的语义对齐与源一致性平衡问题。


📘 Detailed Summary

Motivation: 当前基于预训练文本到图像流匹配模型的图像编辑方法在处理简单编辑任务时表现优异,但在处理包含多个编辑目标的复杂编辑任务时面临挑战。现有的单轮编辑和多轮编辑方法分别受到长文本跟随和累积不一致性的限制,难以在语义对齐和源一致性之间取得平衡。

Method: FlowDC方法的核心是将复杂编辑任务解耦为多个子编辑效果,并在编辑过程中并行叠加这些效果。同时,该方法观察到与编辑位移正交的速度分量会损害源结构保持,因此对速度场进行分解并衰减正交部分以更好地保持源一致性。此外,研究还构建了复杂编辑基准Complex-PIE-Bench用于评估方法性能。

Result: 在两个基准测试中,FlowDC方法相比现有方法展现出优越的性能表现。该方法在复杂编辑任务中实现了语义对齐和源一致性之间的更好平衡,并通过消融实验详细验证了各模块设计的有效性。

Conclusion: FlowDC通过解耦复杂编辑任务和优化速度场分解,为多目标图像编辑提供了有效的解决方案。该方法不仅提升了复杂编辑的性能,还为未来研究提供了新的技术思路和评估基准,推动了文本引导图像编辑领域的发展。


📄 Abstract

With the surge of pre-trained text-to-image flow matching models, text-based image editing performance has gained remarkable improvement, especially for \underline{simple editing} that only contains a single editing target. To satisfy the exploding editing requirements, the \underline{complex editing} which contains multiple editing targets has posed as a more challenging task. However, current complex editing solutions: single-round and multi-round editing are limited by long text following and cumulative inconsistency, respectively. Thus, they struggle to strike a balance between semantic alignment and source consistency. In this paper, we propose \textbf{FlowDC}, which decouples the complex editing into multiple sub-editing effects and superposes them in parallel during the editing process. Meanwhile, we observed that the velocity quantity that is orthogonal to the editing displacement harms the source structure preserving. Thus, we decompose the velocity and decay the orthogonal part for better source consistency. To evaluate the effectiveness of complex editing settings, we construct a complex editing benchmark: Complex-PIE-Bench. On two benchmarks, FlowDC shows superior results compared with existing methods. We also detail the ablations of our module designs.

[24] VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing

Emanuel Sánchez Aimar, Gulnaz Zhambulova, Fahad Shahbaz Khan, Yonghao Xu, Michael Felsberg

🧩 TL;DR

本文提出VLM2GeoVec,一种指令跟随的单编码器视觉语言模型,通过对比学习将交错输入(图像、文本、边界框和地理坐标)嵌入统一向量空间,解决了遥感领域中检索模型与生成模型之间的割裂问题。


📘 Detailed Summary

Motivation: 卫星图像与自然图像存在根本差异,其航拍视角、超高分辨率、尺度变化多样和小物体丰富等特点需要区域级空间推理和整体场景理解。当前遥感方法在双编码器检索模型(擅长大规模跨模态搜索但无法交错模态)和生成助手(支持区域级解释但缺乏可扩展检索能力)之间存在割裂,需要统一解决方案。

Method: 提出VLM2GeoVec模型,采用指令跟随的单编码器架构,通过对比学习将交错输入(图像、文本、边界框和地理坐标)嵌入统一向量空间。该方法消除了多阶段流水线和任务特定模块,并引入RSMEB基准测试,涵盖场景分类、跨模态搜索、组合检索、视觉问答、视觉定位和区域级推理以及语义地理空间检索等关键遥感嵌入应用。

Result: 在RSMEB基准测试中,VLM2GeoVec在区域-字幕检索上达到26.6%的P@1(比双编码器基线提升25个百分点),在指代表达检索上达到32.5%的P@1(提升19个百分点),在语义地理定位检索上达到17.8%的P@1(超过先前最佳结果3倍以上),同时在场景分类和跨模态检索等传统任务上匹配或超越专用基线。

Conclusion: VLM2GeoVec统一了可扩展检索与区域级空间推理能力,实现了遥感中连贯的多模态分析。该研究为遥感领域提供了首个能够同时处理交错模态输入和统一嵌入的解决方案,显著提升了区域级任务的性能,为未来遥感智能分析系统的发展奠定了基础。


📄 Abstract

Satellite imagery differs fundamentally from natural images: its aerial viewpoint, very high resolution, diverse scale variations, and abundance of small objects demand both region-level spatial reasoning and holistic scene understanding. Current remote-sensing approaches remain fragmented between dual-encoder retrieval models, which excel at large-scale cross-modal search but cannot interleave modalities, and generative assistants, which support region-level interpretation but lack scalable retrieval capabilities. We propose $\textbf{VLM2GeoVec}$, an instruction-following, single-encoder vision-language model trained contrastively to embed interleaved inputs (images, text, bounding boxes, and geographic coordinates) in a unified vector space. Our single encoder interleaves all inputs into one joint embedding trained with a contrastive loss, eliminating multi-stage pipelines and task-specific modules. To evaluate its versatility, we introduce $\textbf{RSMEB}$, a novel benchmark covering key remote-sensing embedding applications: scene classification; cross-modal search; compositional retrieval; visual-question answering; visual grounding and region-level reasoning; and semantic geospatial retrieval. On RSMEB, it achieves $\textbf{26.6%}$ P@1 on region-caption retrieval (+25 pp vs. dual-encoder baselines), $\textbf{32.5%}$ P@1 on referring-expression retrieval (+19 pp), and $\textbf{17.8%}$ P@1 on semantic geo-localization retrieval (over $3\times$ prior best), while matching or exceeding specialized baselines on conventional tasks such as scene classification and cross-modal retrieval. VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing. We will publicly release the code, checkpoints, and data upon acceptance.

[25] Reconstruction as a Bridge for Event-Based Visual Question Answering

Hanyue Lou, Jiayi Zhou, Yang Zhang, Boyu Li, Yi Wang, Guangnan Ye, Boxin Shi

🧩 TL;DR

该研究提出了两种将事件相机数据与多模态大语言模型集成的方法,并引入了首个客观的、真实世界的事件MLLM基准EvQA,在事件视觉理解方面实现了最先进的性能。


📘 Detailed Summary

Motivation: 将事件相机与多模态大语言模型集成有望在挑战性视觉条件下实现通用场景理解,但需要在保留事件数据独特优势与确保与基于帧的模型兼容性之间取得平衡,当前缺乏客观的、真实世界的事件MLLM评估基准。

Method: 提出了两种方法:基于帧的重建与标记化方法,以及利用事件稀疏性的自适应重建与标记化方法;同时设计了EvQA基准,包含来自22个公共数据集的1000个事件-Q&A对,用于客观评估事件MLLM性能。

Result: 实验表明,所提出的方法在EvQA基准上实现了最先进的性能,验证了事件稀疏性利用的有效性,并展示了MLLM在事件视觉理解方面的显著潜力。

Conclusion: 该研究通过重建作为桥梁,成功解决了事件数据与帧基模型兼容性的挑战,提出的EvQA基准为事件MLLM评估提供了客观标准,证明了MLLM在事件视觉领域的应用前景。


📄 Abstract

Integrating event cameras with Multimodal Large Language Models (MLLMs) promises general scene understanding in challenging visual conditions, yet requires navigating a trade-off between preserving the unique advantages of event data and ensuring compatibility with frame-based models. We address this challenge by using reconstruction as a bridge, proposing a straightforward Frame-based Reconstruction and Tokenization (FRT) method and designing an efficient Adaptive Reconstruction and Tokenization (ART) method that leverages event sparsity. For robust evaluation, we introduce EvQA, the first objective, real-world benchmark for event-based MLLMs, comprising 1,000 event-Q&A pairs from 22 public datasets. Our experiments demonstrate that our methods achieve state-of-the-art performance on EvQA, highlighting the significant potential of MLLMs in event-based vision.

[26] Infinity and Beyond: Compositional Alignment in VAR and Diffusion T2I Models

Hossein Shahabadi, Niki Sepasian, Arash Marioriyad, Ali Sharifi-Zarchi, Mahdieh Soleymani Baghshah

🧩 TL;DR

本研究首次系统性地比较了视觉自回归模型与扩散模型在组合对齐能力上的表现,通过全面基准测试发现Infinity-8B在组合对齐方面表现最佳,为T2I模型的未来发展建立了统一基准。


📘 Detailed Summary

Motivation: 现代文本到图像模型在实现文本描述与生成图像之间的组合对齐(涵盖对象、属性和空间关系)方面仍面临核心挑战,尽管扩散模型已被广泛研究,但新兴视觉自回归模型的组合行为尚未得到充分检验,本研究旨在填补这一研究空白。

Method: 研究对六种不同的T2I系统进行了基准测试,包括SDXL、PixArt-α、Flux-Dev、Flux-Schnell、Infinity-2B和Infinity-8B,评估覆盖了完整的T2I-CompBench++和GenEval测试套件,重点考察颜色与属性绑定、空间关系、数值理解和复杂多对象提示等组合对齐能力。

Result: 在所有基准测试中,Infinity-8B实现了最强的整体组合对齐性能,而Infinity-2B在多个类别中也匹配或超越了更大的扩散模型,显示出优越的效率-性能权衡;相比之下,SDXL和PixArt-α在属性敏感和空间任务中表现出持续的弱点。

Conclusion: 这项研究首次系统比较了VAR和扩散方法在组合对齐方面的表现,为T2I模型的未来发展建立了统一基准,结果表明视觉自回归模型在组合对齐任务上具有显著优势,特别是Infinity系列模型在性能与效率之间取得了良好平衡。


📄 Abstract

Achieving compositional alignment between textual descriptions and generated images - covering objects, attributes, and spatial relationships - remains a core challenge for modern text-to-image (T2I) models. Although diffusion-based architectures have been widely studied, the compositional behavior of emerging Visual Autoregressive (VAR) models is still largely unexamined. We benchmark six diverse T2I systems - SDXL, PixArt-$α$, Flux-Dev, Flux-Schnell, Infinity-2B, and Infinity-8B - across the full T2I-CompBench++ and GenEval suites, evaluating alignment in color and attribute binding, spatial relations, numeracy, and complex multi-object prompts. Across both benchmarks, Infinity-8B achieves the strongest overall compositional alignment, while Infinity-2B also matches or exceeds larger diffusion models in several categories, highlighting favorable efficiency-performance trade-offs. In contrast, SDXL and PixArt-$α$ show persistent weaknesses in attribute-sensitive and spatial tasks. These results provide the first systematic comparison of VAR and diffusion approaches to compositional alignment and establish unified baselines for the future development of the T2I model.

[27] Embodied Image Compression

Chunyi Li, Rui Qing, Jianbo Zhang, Yuan Tian, Xiangyang Zhu, Zicheng Zhang, Xiaohong Liu, Weisi Lin, Guangtao Zhai

🧩 TL;DR

本文首次提出具身图像压缩这一科学问题,针对具身智能体在现实世界环境中的通信约束,建立了标准化基准EmbodiedComp,并证明现有视觉-语言-动作模型在低于具身比特率阈值时无法可靠执行简单操作任务。


📘 Detailed Summary

Motivation: 随着机器智能的快速发展,压缩目标已从任务特定的虚拟模型转向在现实世界环境中操作的具身智能体,但现有图像压缩方法无法满足多智能体系统中具身AI的通信约束和实时任务执行需求,这促使研究者首次提出具身图像压缩这一科学问题。

Method: 本文建立了标准化基准EmbodiedComp,用于在闭环设置下对超低比特率条件进行系统评估,通过模拟和现实世界环境中的广泛实证研究,评估现有视觉-语言-动作模型在压缩条件下的性能表现。

Result: 实验研究表明,当压缩比特率低于具身比特率阈值时,现有视觉-语言-动作模型无法可靠执行简单的操作任务,这揭示了当前方法在满足具身智能体通信需求方面的严重不足。

Conclusion: EmbodiedComp基准将推动面向具身智能体的领域特定压缩技术的发展,加速具身AI在现实世界中的部署应用,为解决具身智能体在受限通信环境下的视觉数据处理问题提供了重要的评估框架和研究方向。


📄 Abstract

Image Compression for Machines (ICM) has emerged as a pivotal research direction in the field of visual data compression. However, with the rapid evolution of machine intelligence, the target of compression has shifted from task-specific virtual models to Embodied agents operating in real-world environments. To address the communication constraints of Embodied AI in multi-agent systems and ensure real-time task execution, this paper introduces, for the first time, the scientific problem of Embodied Image Compression. We establish a standardized benchmark, EmbodiedComp, to facilitate systematic evaluation under ultra-low bitrate conditions in a closed-loop setting. Through extensive empirical studies in both simulated and real-world settings, we demonstrate that existing Vision-Language-Action models (VLAs) fail to reliably perform even simple manipulation tasks when compressed below the Embodied bitrate threshold. We anticipate that EmbodiedComp will catalyze the development of domain-specific compression tailored for Embodied agents , thereby accelerating the Embodied AI deployment in the Real-world.

[28] Kinetic Mining in Context: Few-Shot Action Synthesis via Text-to-Motion Distillation

Luca Cazzola, Ahed Alboody

🧩 TL;DR

本文提出KineMIC框架,通过迁移学习方法将通用文本到动作生成模型适配到人体动作识别领域,利用CLIP文本嵌入建立语义对应关系,实现少样本动作合成,显著提升动作识别性能。


📘 Detailed Summary

Motivation: 大规模标注运动数据集的获取成本是骨骼基人体动作识别的关键瓶颈,尽管文本到动作生成模型提供了可扩展的合成数据源,但其训练目标和数据集结构与动作识别所需的运动精度和类别区分性存在显著差异,导致通用T2M模型无法生成适合动作识别分类器的运动数据。

Method: 本文提出KineMIC迁移学习框架,通过假设文本编码空间中的语义对应关系可以为运动学蒸馏提供软监督,采用动力学挖掘策略利用CLIP文本嵌入建立稀疏动作识别标签与T2M源数据之间的对应关系,指导微调过程将通用T2M主干转换为专门的少样本动作到运动生成器。

Result: 在HumanML3D作为源T2M数据集和NTU RGB+D 120子集作为目标动作识别领域的验证中,仅使用每动作类别10个样本,KineMIC生成的运动数据显著更连贯,作为数据增强源提供了+23.1%准确率提升。

Conclusion: 该研究证明了通过语义对应关系进行运动学蒸馏的有效性,为少样本动作合成提供了实用框架,显著缩小了通用文本到动作生成模型与动作识别需求之间的领域差距,为数据稀缺场景下的动作识别提供了可行的数据增强解决方案。


📄 Abstract

The acquisition cost for large, annotated motion datasets remains a critical bottleneck for skeletal-based Human Activity Recognition (HAR). Although Text-to-Motion (T2M) generative models offer a compelling, scalable source of synthetic data, their training objectives, which emphasize general artistic motion, and dataset structures fundamentally differ from HAR's requirements for kinematically precise, class-discriminative actions. This disparity creates a significant domain gap, making generalist T2M models ill-equipped for generating motions suitable for HAR classifiers. To address this challenge, we propose KineMIC (Kinetic Mining In Context), a transfer learning framework for few-shot action synthesis. KineMIC adapts a T2M diffusion model to an HAR domain by hypothesizing that semantic correspondences in the text encoding space can provide soft supervision for kinematic distillation. We operationalize this via a kinetic mining strategy that leverages CLIP text embeddings to establish correspondences between sparse HAR labels and T2M source data. This process guides fine-tuning, transforming the generalist T2M backbone into a specialized few-shot Action-to-Motion generator. We validate KineMIC using HumanML3D as the source T2M dataset and a subset of NTU RGB+D 120 as the target HAR domain, randomly selecting just 10 samples per action class. Our approach generates significantly more coherent motions, providing a robust data augmentation source that delivers a +23.1% accuracy points improvement. Animated illustrations and supplementary materials are available at (https://lucazzola.github.io/publications/kinemic).

[29] Cross-modal Context-aware Learning for Visual Prompt Guided Multimodal Image Understanding in Remote Sensing

Xu Zhang, Jiabin Fang, Zhuoming Ding, Jin Yuan, Xuan Liu, Qianjun Zhang, Zhiyong Li

🧩 TL;DR

本文提出CLV-Net,一种用于视觉提示引导的多模态遥感图像理解框架,通过用户提供的简单视觉提示(边界框)来生成相关的分割掩码和描述,有效捕捉用户意图并建模对象间关系。


📘 Detailed Summary

Motivation: 现有基于大语言模型的多模态遥感图像理解方法在仅有简单文本提示时难以引导模型关注用户相关区域,且大规模航空影像中许多对象具有高度相似的视觉外观和丰富的对象间关系,这进一步阻碍了准确识别。

Method: CLV-Net的核心设计包括上下文感知掩码解码器,用于建模和整合对象间关系以增强目标表示;以及语义与关系对齐模块,包含跨模态语义一致性损失以增强视觉相似目标的细粒度区分,以及关系一致性损失以确保文本关系与视觉交互的对齐。

Result: 在两个基准数据集上的综合实验表明,CLV-Net优于现有方法并建立了新的最先进结果,模型能有效捕捉用户意图并产生精确、意图对齐的多模态输出。

Conclusion: 该研究展示了视觉提示引导的多模态理解在遥感领域的有效性,通过建模对象间关系和跨模态对齐显著提升了意图感知能力,为复杂场景下的交互式图像分析提供了新范式。


📄 Abstract

Recent advances in image understanding have enabled methods that leverage large language models for multimodal reasoning in remote sensing. However, existing approaches still struggle to steer models to the user-relevant regions when only simple, generic text prompts are available. Moreover, in large-scale aerial imagery many objects exhibit highly similar visual appearances and carry rich inter-object relationships, which further complicates accurate recognition. To address these challenges, we propose Cross-modal Context-aware Learning for Visual Prompt-Guided Multimodal Image Understanding (CLV-Net). CLV-Net lets users supply a simple visual cue, a bounding box, to indicate a region of interest, and uses that cue to guide the model to generate correlated segmentation masks and captions that faithfully reflect user intent. Central to our design is a Context-Aware Mask Decoder that models and integrates inter-object relationships to strengthen target representations and improve mask quality. In addition, we introduce a Semantic and Relationship Alignment module: a Cross-modal Semantic Consistency Loss enhances fine-grained discrimination among visually similar targets, while a Relationship Consistency Loss enforces alignment between textual relations and visual interactions. Comprehensive experiments on two benchmark datasets show that CLV-Net outperforms existing methods and establishes new state-of-the-art results. The model effectively captures user intent and produces precise, intention-aligned multimodal outputs.

[30] Depth-Copy-Paste: Multimodal and Depth-Aware Compositing for Robust Face Detection

Qiushi Guo

🧩 TL;DR

本文提出Depth Copy Paste,一种多模态深度感知数据增强框架,通过结合语义匹配、精确分割和深度引导放置,生成物理一致且视觉逼真的人脸检测训练样本,显著提升检测性能。


📘 Detailed Summary

Motivation: 传统复制粘贴增强方法在生成人脸检测训练数据时存在明显缺陷,包括前景提取不准确、场景几何不一致和背景语义不匹配,导致合成样本不真实且物理不一致,限制了检测系统的鲁棒性提升。

Method: 该方法采用多模态深度感知增强框架,首先利用BLIP和CLIP联合评估语义与视觉一致性,自动检索最匹配的背景图像;集成SAM3进行精确分割和Depth-Anything提取非遮挡可见区域;引入深度引导滑动窗口放置机制,在背景深度图上搜索具有最佳深度连续性和尺度对齐的粘贴位置。

Result: 实验表明Depth Copy Paste生成的合成样本在深度关系和视觉逼真度方面显著改善,为下游人脸检测任务提供了更多样化和真实的训练数据,相比传统复制粘贴和无深度增强方法取得了显著的性能提升。

Conclusion: 该研究证明了结合语义匹配、精确分割和深度感知的增强框架能够有效生成物理一致的训练样本,为复杂环境下的人脸检测系统提供了更有效的训练数据增强方案,推动了多模态数据增强技术的发展。


📄 Abstract

Data augmentation is crucial for improving the robustness of face detection systems, especially under challenging conditions such as occlusion, illumination variation, and complex environments. Traditional copy paste augmentation often produces unrealistic composites due to inaccurate foreground extraction, inconsistent scene geometry, and mismatched background semantics. To address these limitations, we propose Depth Copy Paste, a multimodal and depth aware augmentation framework that generates diverse and physically consistent face detection training samples by copying full body person instances and pasting them into semantically compatible scenes. Our approach first employs BLIP and CLIP to jointly assess semantic and visual coherence, enabling automatic retrieval of the most suitable background images for the given foreground person. To ensure high quality foreground masks that preserve facial details, we integrate SAM3 for precise segmentation and Depth-Anything to extract only the non occluded visible person regions, preventing corrupted facial textures from being used in augmentation. For geometric realism, we introduce a depth guided sliding window placement mechanism that searches over the background depth map to identify paste locations with optimal depth continuity and scale alignment. The resulting composites exhibit natural depth relationships and improved visual plausibility. Extensive experiments show that Depth Copy Paste provides more diverse and realistic training data, leading to significant performance improvements in downstream face detection tasks compared with traditional copy paste and depth free augmentation methods.

[31] EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing

Wei Chow, Linfeng Li, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, Xiangtai Li, Junting Pan, Shaoteng Liu, Ran Zhou, Tianshu Yang, Songhua Liu

🧩 TL;DR

本文提出了EditMGT,这是首个基于掩码生成变换器(MGT)的图像编辑框架,通过利用MGT的局部解码特性来解决扩散模型在局部编辑时对非目标区域的意外修改问题,实现了更精确的区域控制。


📘 Detailed Summary

Motivation: 扩散模型在图像编辑中虽然取得了优异的视觉质量,但其全局去噪动态本质上将局部编辑目标与全图上下文混为一谈,导致非目标区域出现意外修改。本文旨在探索掩码生成变换器作为替代方法来解决这一挑战,利用其局部解码范式来显式保护编辑过程中的非相关区域。

Method: 本文提出了EditMGT框架,首先利用MGT的交叉注意力图提供信息性定位信号,并设计了多层注意力整合方案来细化这些图以实现细粒度精确定位。在此基础上引入了区域保持采样技术,限制在低注意力区域内进行令牌翻转以抑制虚假编辑,从而将修改限制在目标区域内并保护周围非目标区域的完整性。通过构建CrispEdit-2M高分辨率数据集,采用注意力注入方法将预训练的文本到图像MGT适配为图像编辑模型。

Result: 在四个标准基准测试上的广泛实验表明,EditMGT在参数少于10亿的情况下实现了相似的相似性性能,同时编辑速度提升了6倍。在风格变化和风格迁移任务上分别取得了3.6%和17.6%的改进,提供了可比或更优的编辑质量。

Conclusion: 该研究证明了掩码生成变换器在图像编辑任务中的潜力,其局部解码范式为解决扩散模型的全局编辑限制提供了有前景的替代方案。EditMGT框架展示了如何利用MGT的注意力机制实现精确的区域控制,同时保持高效的编辑性能,为未来的局部感知图像编辑方法开辟了新方向。


📄 Abstract

Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a multi-layer attention consolidation scheme that refines these maps to achieve fine-grained and precise localization. On top of these adaptive localization results, we introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct CrispEdit-2M, a high-resolution dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than 1B parameters, our model achieves similarity performance while enabling 6 times faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of 3.6% and 17.6% on style change and style transfer tasks, respectively.

[32] Referring Change Detection in Remote Sensing Imagery

Yilmaz Korkmaz, Jay N. Paranjape, Celso M. de Melo, Vishal M. Patel

🧩 TL;DR

本文提出了Referring Change Detection (RCD)框架,通过自然语言提示实现遥感图像中特定类别变化的检测,并引入RCDNet跨模态融合网络和RCDGen扩散合成数据生成管道来解决标注数据稀缺和类别不平衡问题。


📘 Detailed Summary

Motivation: 传统变化检测方法无法区分变化类型,而语义变化检测方法依赖刚性类别定义和固定模型架构,难以混合不同标签集的数据集或跨任务重用模型,因为输出通道与语义类别数量和类型紧密耦合。现有方法无法满足用户对特定变化类型的检测需求。

Method: 本文提出两阶段框架:第一阶段是RCDNet,一个用于指代变化检测的跨模态融合网络,将自然语言理解与视觉分析相结合;第二阶段是RCDGen,一个基于扩散的合成数据生成管道,仅使用变化前图像即可生成指定类别的真实变化后图像和变化图,无需依赖语义分割掩码。

Result: 在多个数据集上的实验表明,该框架实现了可扩展和有针对性的变化检测。RCDGen显著降低了大规模数据创建的障碍,能够生成高质量的训练数据,有效缓解了标注数据稀缺和类别不平衡问题。

Conclusion: 该研究通过引入自然语言提示和合成数据生成,为遥感变化检测提供了更灵活和可扩展的解决方案。RCD框架允许用户指定感兴趣的具体变化类型,突破了传统方法在类别定义和模型重用方面的限制,为实际应用场景提供了更强的适应性。


📄 Abstract

Change detection in remote sensing imagery is essential for applications such as urban planning, environmental monitoring, and disaster management. Traditional change detection methods typically identify all changes between two temporal images without distinguishing the types of transitions, which can lead to results that may not align with specific user needs. Although semantic change detection methods have attempted to address this by categorizing changes into predefined classes, these methods rely on rigid class definitions and fixed model architectures, making it difficult to mix datasets with different label sets or reuse models across tasks, as the output channels are tightly coupled with the number and type of semantic classes. To overcome these limitations, we introduce Referring Change Detection (RCD), which leverages natural language prompts to detect specific classes of changes in remote sensing images. By integrating language understanding with visual analysis, our approach allows users to specify the exact type of change they are interested in. However, training models for RCD is challenging due to the limited availability of annotated data and severe class imbalance in existing datasets. To address this, we propose a two-stage framework consisting of (I) \textbf{RCDNet}, a cross-modal fusion network designed for referring change detection, and (II) \textbf{RCDGen}, a diffusion-based synthetic data generation pipeline that produces realistic post-change images and change maps for a specified category using only pre-change image, without relying on semantic segmentation masks and thereby significantly lowering the barrier to scalable data creation. Experiments across multiple datasets show that our framework enables scalable and targeted change detection. Project website is here: https://yilmazkorkmaz1.github.io/RCD.

[33] Reframing Music-Driven 2D Dance Pose Generation as Multi-Channel Image Generation

Yan Zhang, Han Zou, Lincong Feng, Cong Xie, Ruiqi Yu, Zhenpeng Zhan

🧩 TL;DR

本文提出了一种新颖的音乐到舞蹈生成方法,将2D姿态序列编码为单热图像,利用预训练图像VAE进行压缩,并采用DiT风格主干网络进行建模,从而更好地捕捉高方差姿态分布。该方法通过时间共享索引方案和参考姿态条件策略,显著提升了生成舞蹈姿态的时间一致性和节奏对齐性。


📘 Detailed Summary

Motivation: 现有姿态到视频模型能够将2D姿态序列转换为逼真的身份保持舞蹈视频,但关键挑战在于从音乐生成时间一致、节奏对齐的2D姿态,特别是在复杂、高方差的野外分布场景下。本研究旨在解决音乐到舞蹈生成中姿态序列的时间一致性和音乐节奏对齐问题。

Method: 该方法将音乐到舞蹈生成重新定义为音乐标记条件的多通道图像合成问题:将2D姿态序列编码为单热图像,使用预训练图像VAE进行压缩,并采用DiT风格主干网络进行建模。在此基础上,引入了时间共享的时间索引方案,显式同步音乐标记和姿态潜在表示;以及参考姿态条件策略,保持主体特定身体比例和屏幕尺度,同时支持长时域分段拼接生成。

Result: 在大型野外2D舞蹈语料库和校准的AIST++2D基准测试中,该方法在姿态空间和视频空间指标以及人类偏好方面均优于代表性音乐到舞蹈方法。消融实验验证了表示方法、时间索引和参考条件策略的有效贡献,展示了在复杂分布下生成高质量舞蹈姿态的能力。

Conclusion: 该研究通过将音乐到舞蹈生成重新定义为图像合成问题,成功继承了现代文本到图像模型的架构和训练优势,显著提升了姿态生成的质量和一致性。提出的时间索引和参考条件策略为解决长时域生成和主体保持问题提供了有效解决方案,为音乐驱动的舞蹈生成开辟了新的技术路径。


📄 Abstract

Recent pose-to-video models can translate 2D pose sequences into photorealistic, identity-preserving dance videos, so the key challenge is to generate temporally coherent, rhythm-aligned 2D poses from music, especially under complex, high-variance in-the-wild distributions. We address this by reframing music-to-dance generation as a music-token-conditioned multi-channel image synthesis problem: 2D pose sequences are encoded as one-hot images, compressed by a pretrained image VAE, and modeled with a DiT-style backbone, allowing us to inherit architectural and training advances from modern text-to-image models and better capture high-variance 2D pose distributions. On top of this formulation, we introduce (i) a time-shared temporal indexing scheme that explicitly synchronizes music tokens and pose latents over time and (ii) a reference-pose conditioning strategy that preserves subject-specific body proportions and on-screen scale while enabling long-horizon segment-and-stitch generation. Experiments on a large in-the-wild 2D dance corpus and the calibrated AIST++2D benchmark show consistent improvements over representative music-to-dance methods in pose- and video-space metrics and human preference, and ablations validate the contributions of the representation, temporal indexing, and reference conditioning. See supplementary videos at https://hot-dance.github.io

[34] SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

Minglei Shi, Haolin Wang, Borui Zhang, Wenzhao Zheng, Bohan Zeng, Ziyang Yuan, Xiaoshi Wu, Yuanxing Zhang, Huan Yang, Xintao Wang, Pengfei Wan, Kun Gai, Jie Zhou, Jiwen Lu

🧩 TL;DR

该研究提出了SVG-T2I框架,将自监督视觉基础模型表示扩展到文本到图像生成任务,实现了在VFM特征空间中的高质量图像合成,并开源了完整的训练和推理流程。


📘 Detailed Summary

Motivation: 尽管基于视觉基础模型表示的可视化生成为整合视觉理解、感知和生成提供了统一路径,但在VFM表示空间中完全训练大规模文本到图像扩散模型的研究仍处于探索不足的状态,本研究旨在填补这一空白。

Method: 研究扩展了SVG框架,提出SVG-T2I方法,采用标准的文本到图像扩散管道直接在VFM特征域中进行图像合成,通过自监督表示学习实现视觉生成任务。

Result: SVG-T2I在GenEval基准上达到0.75分,在DPG-Bench上获得85.78分,表现出与现有方法竞争的性能,验证了VFM表示在生成任务中的内在表征能力。

Conclusion: 该研究证实了视觉基础模型表示在生成任务中的有效性,通过开源完整的项目包括自编码器、生成模型、训练推理流程和预训练权重,为表示驱动的视觉生成研究提供了重要基础。


📄 Abstract

Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.

[35] MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator

Peiqing Yang, Shangchen Zhou, Kai Hao, Qingyi Tao

🧩 TL;DR

本文提出了一种无需真实标注的抠图质量评估器(MQE),通过该评估器构建了大规模真实世界视频抠图数据集VMReal,并开发了MatAnyone 2模型,在合成和真实世界基准测试中实现了最先进的性能。


📘 Detailed Summary

Motivation: 视频抠图领域受到现有数据集规模和真实性的限制,虽然利用分割数据可以增强语义稳定性,但缺乏有效的边界监督通常导致抠图结果呈现分割状而缺乏精细细节,这阻碍了高质量视频抠图模型的发展。

Method: 本文提出了无需真实标注的抠图质量评估器(MQE),该评估器能够评估alpha抠图的语义和边界质量,并生成像素级评估图以识别可靠和错误区域。MQE通过两种方式扩展视频抠图:作为训练期间的在线质量反馈机制来抑制错误区域,以及作为离线数据筛选模块用于数据整理。此外,作者还引入了参考帧训练策略来处理长视频中的外观变化,并构建了包含28K个视频片段和240万帧的大规模真实世界视频抠图数据集VMReal。

Result: 基于MQE和VMReal数据集开发的MatAnyone 2模型在合成和真实世界基准测试中均实现了最先进的性能,在所有评估指标上超越了先前的方法。该模型能够有效处理长视频中的外观变化,并生成具有精细细节的高质量抠图结果。

Conclusion: 本研究通过引入无需真实标注的抠图质量评估器和构建大规模真实世界数据集,显著提升了视频抠图的质量和实用性。该方法不仅为视频抠图提供了有效的质量监督机制,还为构建高质量训练数据开辟了新途径,具有重要的实际应用价值。


📄 Abstract

Video matting remains limited by the scale and realism of existing datasets. While leveraging segmentation data can enhance semantic stability, the lack of effective boundary supervision often leads to segmentation-like mattes lacking fine details. To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality of alpha mattes without ground truth. It produces a pixel-wise evaluation map that identifies reliable and erroneous regions, enabling fine-grained quality assessment. The MQE scales up video matting in two ways: (1) as an online matting-quality feedback during training to suppress erroneous regions, providing comprehensive supervision, and (2) as an offline selection module for data curation, improving annotation quality by combining the strengths of leading video and image matting models. This process allows us to build a large-scale real-world video matting dataset, VMReal, containing 28K clips and 2.4M frames. To handle large appearance variations in long videos, we introduce a reference-frame training strategy that incorporates long-range frames beyond the local window for effective training. Our MatAnyone 2 achieves state-of-the-art performance on both synthetic and real-world benchmarks, surpassing prior methods across all metrics.

cs.CL [Back]

[36] MultiScript30k: Leveraging Multilingual Embeddings to Extend Cross Script Parallel Data

Christopher Driggers-Ellis, Detravious Brinkley, Ray Chen, Aashish Dhawan, Daisy Zhe Wang, Christan Grant

🧩 TL;DR

本文提出了MultiScript30k,这是Multi30k数据集的一个新扩展,通过使用NLLB200-3.3B模型将英文版Multi30k翻译成多种脚本语言,旨在解决现有多模态机器翻译数据集仅限于少数欧洲语言和拉丁脚本的问题。


📘 Detailed Summary

Motivation: 现有Multi30k数据集仅支持捷克语、英语、法语和德语这四种欧洲语言,且均为拉丁脚本,这限制了多模态机器翻译研究在多样化语言上的发展。由于官方数据集仅代表欧洲语言,导致对非拉丁脚本和全球多样化语言的研究进展受阻,先前扩展尝试支持的语言种类、语系和脚本仍然非常有限。

Method: 研究提出MultiScript30k数据集,通过使用NLLB200-3.3B翻译模型将英文版Multi30k数据集翻译成多种全球语言。该数据集包含超过30000个句子,提供了Multi30k-En中所有句子到阿拉伯语、西班牙语、乌克兰语、简体中文和繁体中文的翻译,覆盖了多种脚本系统。

Result: 相似性分析显示,除繁体中文外,MultiScript30k扩展在所有支持语言上均实现了大于0.8的余弦相似度和小于0.000251的对称KL散度,与先前扩展ArEnMulti30k和Multi30k-Uk相当。COMETKiwi评估显示混合结果:ArEnMulti30k与MultiScript30k-Ar得分相近,但Multi30k-Uk比MultiScript30k-Uk高出6.4%。

Conclusion: MultiScript30k为多模态机器翻译研究提供了更广泛的语言覆盖,特别是非拉丁脚本语言,有助于推动该领域在多样化语言上的发展。尽管在某些语言上与现有扩展相比存在性能差异,但该数据集为研究全球语言的多模态翻译提供了重要资源,并展示了使用大规模翻译模型扩展现有数据集的可行性。


📄 Abstract

Multi30k is frequently cited in the multimodal machine translation (MMT) literature, offering parallel text data for training and fine-tuning deep learning models. However, it is limited to four languages: Czech, English, French, and German. This restriction has led many researchers to focus their investigations only on these languages. As a result, MMT research on diverse languages has been stalled because the official Multi30k dataset only represents European languages in Latin scripts. Previous efforts to extend Multi30k exist, but the list of supported languages, represented language families, and scripts is still very short. To address these issues, we propose MultiScript30k, a new Multi30k dataset extension for global languages in various scripts, created by translating the English version of Multi30k (Multi30k-En) using NLLB200-3.3B. The dataset consists of over (30000) sentences and provides translations of all sentences in Multi30k-En into Ar, Es, Uk, Zh_Hans and Zh_Hant. Similarity analysis shows that Multi30k extension consistently achieves greater than (0.8) cosine similarity and symmetric KL divergence less than (0.000251) for all languages supported except Zh_Hant which is comparable to the previous Multi30k extensions ArEnMulti30k and Multi30k-Uk. COMETKiwi scores reveal mixed assessments of MultiScript30k as a translation of Multi30k-En in comparison to the related work. ArEnMulti30k scores nearly equal MultiScript30k-Ar, but Multi30k-Uk scores $6.4\%$ greater than MultiScript30k-Uk per split.

[37] Minimal Clips, Maximum Salience: Long Video Summarization via Key Moment Extraction

Galann Pennec, Zhengyuan Liu, Nicholas Asher, Philippe Muller, Nancy F. Chen

🧩 TL;DR

本文提出了一种针对长视频的多模态摘要生成方法,通过轻量级视频描述模型和大型语言模型协作选择关键视频片段,在保持低计算成本的同时显著提升摘要质量。


📘 Detailed Summary

Motivation: 随着视觉语言模型处理视频长度的增加,重要视觉信息容易在整个上下文中丢失,同时需要设计能够经济高效分析长视频内容的工具,以解决现有方法在长视频摘要中信息遗漏和计算成本高的问题。

Method: 该方法将视频分割为短片段,使用轻量级视频描述模型为每个片段生成紧凑的视觉描述,然后将这些描述输入大型语言模型,由LLM选择包含最相关视觉信息的K个片段用于构建多模态摘要,实现了计算效率与摘要质量的平衡。

Result: 在MovieSum数据集上的评估显示,该方法生成的摘要性能接近人工标注的参考片段(仅占电影总时长不到6%),同时比随机片段选择捕获了更多相关视频信息,验证了轻量级描述模型与LLM协作的有效性。

Conclusion: 研究表明通过精心设计的片段选择机制,仅需少量关键视频片段即可构建完整的电影多模态摘要,为长视频内容分析提供了高效实用的解决方案,同时证明了轻量级模型与大型语言模型协同工作的可行性。


📄 Abstract

Vision-Language Models (VLMs) are able to process increasingly longer videos. Yet, important visual information is easily lost throughout the entire context and missed by VLMs. Also, it is important to design tools that enable cost-effective analysis of lengthy video content. In this paper, we propose a clip selection method that targets key video moments to be included in a multimodal summary. We divide the video into short clips and generate compact visual descriptions of each using a lightweight video captioning model. These are then passed to a large language model (LLM), which selects the K clips containing the most relevant visual information for a multimodal summary. We evaluate our approach on reference clips for the task, automatically derived from full human-annotated screenplays and summaries in the MovieSum dataset. We further show that these reference clips (less than 6% of the movie) are sufficient to build a complete multimodal summary of the movies in MovieSum. Using our clip selection method, we achieve a summarization performance close to that of these reference clips while capturing substantially more relevant video information than random clip selection. Importantly, we maintain low computational cost by relying on a lightweight captioning model.

[38] Extending a Parliamentary Corpus with MPs' Tweets: Automatic Annotation and Evaluation Using MultiParTweet

Mevlüt Bagci, Ali Abusaleh, Daniel Baumartz, Giueseppe Abrami, Maxim Konca, Alexander Mehler

🧩 TL;DR

本研究提出了MultiParTweet,一个连接德国议会语料库的多语言推特语料库,以及TTLABTweetCrawler数据收集工具,通过九种文本模型和一种视觉语言模型进行自动标注,验证了模型间的相互可预测性。


📘 Detailed Summary

Motivation: 社交媒体在现代政治中扮演关键角色,但缺乏连接政治家在线话语与议会辩论的系统性多语言语料库,限制了比较分析和跨平台政治沟通研究。

Method: 研究构建了包含39,546条推文(含19,056个媒体项目)的MultiParTweet语料库,连接德国政治语料库GerParCor,使用九种文本模型和一种视觉语言模型进行情感、情绪和主题自动标注,并开发了TTLABTweetCrawler工具用于X平台数据收集。

Result: 实验表明不同模型输出之间存在相互可预测性,视觉语言模型标注结果与人工标注者偏好更一致,自动标注与人工标注子集进行了验证评估,证明了多模态表示与人类解释更契合。

Conclusion: 研究提供了整合自动文本与媒体标注的标准化资源,展示了模型间可预测性的分析方法,视觉语言模型在多模态政治话语分析中具有优势,为跨平台政治沟通研究提供了方法论框架和数据基础设施。


📄 Abstract

Social media serves as a critical medium in modern politics because it both reflects politicians' ideologies and facilitates communication with younger generations. We present MultiParTweet, a multilingual tweet corpus from X that connects politicians' social media discourse with German political corpus GerParCor, thereby enabling comparative analyses between online communication and parliamentary debates. MultiParTweet contains 39 546 tweets, including 19 056 media items. Furthermore, we enriched the annotation with nine text-based models and one vision-language model (VLM) to annotate MultiParTweet with emotion, sentiment, and topic annotations. Moreover, the automated annotations are evaluated against a manually annotated subset. MultiParTweet can be reconstructed using our tool, TTLABTweetCrawler, which provides a framework for collecting data from X. To demonstrate a methodological demonstration, we examine whether the models can predict each other using the outputs of the remaining models. In summary, we provide MultiParTweet, a resource integrating automatic text and media-based annotations validated with human annotations, and TTLABTweetCrawler, a general-purpose X data collection tool. Our analysis shows that the models are mutually predictable. In addition, VLM-based annotation were preferred by human annotators, suggesting that multimodal representations align more with human interpretation.

cs.AI [Back]

[39] CAPTURE: A Benchmark and Evaluation for LVLMs in CAPTCHA Resolving

Jianyi Zhang, Ziyin Zhou, Xu Ji, Shizhao Liu, Zhangchi Zhao

🧩 TL;DR

本文首次为大型视觉语言模型引入了一个名为CAPTURE的CAPTCHA基准测试,该基准涵盖4种主要类型和25种子类型,来自31个供应商,用于全面评估LVLM在解决验证码方面的性能。


📘 Detailed Summary

Motivation: 现有基于视觉验证码的基准测试存在局限性,先前研究根据特定目标定制的基准无法全面覆盖所有验证码类型,且缺乏专门针对大型视觉语言模型的专用基准,这阻碍了对LVLM解决验证码能力的全面评估。

Method: 研究提出了名为CAPTURE(CAPTCHA for Testing Under Real-world Experiments)的新型基准测试,该基准系统性地涵盖了4种主要验证码类型和25种子类型,收集自31个不同供应商,具有广泛的类别多样性、大规模数据和专门为LVLM定制的标签体系。

Result: 使用CAPTURE基准对当前大型视觉语言模型进行评估时,结果显示这些模型在解决验证码方面表现不佳,验证了现有LVLM在应对多样化真实世界验证码挑战时的局限性。

Conclusion: CAPTURE基准填补了先前研究在数据全面性和标签针对性方面的空白,为LVLM的验证码解决能力提供了多维度的全面评估框架,揭示了当前模型在实际应用场景中的性能缺陷,为未来模型改进提供了重要参考依据。


📄 Abstract

Benefiting from strong and efficient multi-modal alignment strategies, Large Visual Language Models (LVLMs) are able to simulate human visual and reasoning capabilities, such as solving CAPTCHAs. However, existing benchmarks based on visual CAPTCHAs still face limitations. Previous studies, when designing benchmarks and datasets, customized them according to their research objectives. Consequently, these benchmarks cannot comprehensively cover all CAPTCHA types. Notably, there is a dearth of dedicated benchmarks for LVLMs. To address this problem, we introduce a novel CAPTCHA benchmark for the first time, named CAPTURE CAPTCHA for Testing Under Real-world Experiments, specifically for LVLMs. Our benchmark encompasses 4 main CAPTCHA types and 25 sub-types from 31 vendors. The diversity enables a multi-dimensional and thorough evaluation of LVLM performance. CAPTURE features extensive class variety, large-scale data, and unique LVLM-tailored labels, filling the gaps in previous research in terms of data comprehensiveness and labeling pertinence. When evaluated by this benchmark, current LVLMs demonstrate poor performance in solving CAPTCHAs.