Table of Contents

cs.CV [Back]

[1] Rethinking Chain-of-Thought Reasoning for Videos

Yiwu Zhong, Zi-Yuan Hu, Yin Li, Liwei Wang

🧩 TL;DR

本文提出了一种高效的视频多模态大语言模型后训练与推理框架,通过压缩视觉令牌和生成简洁推理轨迹,在保持竞争力的同时显著提升推理效率,挑战了传统链式思维推理在视频理解中的必要性。


📘 Detailed Summary

Motivation: 当前基于链式思维推理的视频多模态大语言模型通常依赖冗长的推理链和大量输入视觉令牌,导致计算效率低下。本研究通过基准分析发现,简洁推理结合少量视觉令牌可能足以实现有效的视频推理,旨在验证这一假设并提升模型效率。

Method: 研究设计并验证了一个高效的后训练与推理框架,该框架使视频多模态大语言模型能够在压缩的视觉令牌上操作,并在回答问题前生成简洁的推理轨迹。该方法无需人工链式思维标注或监督微调,实现了端到端的高效推理。

Result: 所提出的框架在多个基准测试中实现了显著提升的推理效率,同时保持了竞争力的性能表现。模型在减少视觉令牌数量和缩短推理链的情况下,仍能有效完成视频推理任务,验证了简洁推理的可行性。

Conclusion: 研究结果表明,类似人类的长链式思维推理对于通用视频理解可能并非必要,简洁推理既能保持有效性又能提高效率。这一发现为视频多模态大语言模型的轻量化设计提供了新思路,挑战了传统复杂推理模式在视频领域的必要性。


📄 Abstract

Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing, and recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning. However, these models typically build on lengthy reasoning chains and large numbers of input visual tokens. Motivated by empirical observations from our benchmark study, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning. To evaluate this hypothesis, we design and validate an efficient post-training and inference framework that enhances a video MLLM's reasoning capability. Our framework enables models to operate on compressed visual tokens and generate brief reasoning traces prior to answering. The resulting models achieve substantially improved inference efficiency, deliver competitive performance across diverse benchmarks, and avoid reliance on manual CoT annotations or supervised fine-tuning. Collectively, our results suggest that long, human-like CoT reasoning may not be necessary for general video reasoning, and that concise reasoning can be both effective and efficient. Our code will be released at https://github.com/LaVi-Lab/Rethink_CoT_Video.

[2] MedForget: Hierarchy-Aware Multimodal Unlearning Testbed for Medical AI

Fengli Wu, Vaidehi Patil, Jaehong Yoon, Yue Zhang, Mohit Bansal

🧩 TL;DR

本文提出了MedForget,一个层次感知的多模态遗忘测试平台,用于系统评估医疗多模态大语言模型中的选择性遗忘效果,并揭示了现有遗忘方法在复杂医疗层次结构中的局限性。


📘 Detailed Summary

Motivation: 预训练的多模态大语言模型在医疗AI系统中广泛应用,但其训练涉及敏感患者数据,面临HIPAA和GDPR等法规下的隐私和合规挑战,特别是"被遗忘权"要求。现有遗忘方法在复杂医疗环境中的有效性尚未得到充分探索,需要系统评估框架来研究层次化医疗数据中的选择性遗忘问题。

Method: 研究提出了MedForget测试平台,将医院数据建模为嵌套层次结构(机构→患者→研究→部分),包含3840个多模态实例和明确的保留/遗忘划分。平台引入重建攻击方法,通过逐步添加层次上下文提示来测试遗忘是否真正删除了层次化路径,并在三个任务上评估了四种最先进的遗忘方法。

Result: 实验表明现有遗忘方法难以在保持诊断性能的同时实现完全、层次感知的遗忘。粗粒度遗忘的模型对重建攻击表现出强抵抗力,而细粒度遗忘的模型则容易受到攻击。测试平台在八个组织层次上提供了细粒度评估,揭示了不同遗忘粒度下的安全权衡。

Conclusion: 研究揭示了医疗多模态大语言模型中层次感知遗忘的复杂性,为构建合规医疗AI系统提供了实用测试框架。结果表明需要在遗忘完整性和模型实用性之间取得平衡,并为未来开发更有效的医疗数据遗忘方法奠定了基础。


📄 Abstract

Pretrained Multimodal Large Language Models (MLLMs) are increasingly deployed in medical AI systems for clinical reasoning, diagnosis support, and report generation. However, their training on sensitive patient data raises critical privacy and compliance challenges under regulations such as HIPAA and GDPR, which enforce the "right to be forgotten". Unlearning, the process of tuning models to selectively remove the influence of specific training data points, offers a potential solution, yet its effectiveness in complex medical settings remains underexplored. To systematically study this, we introduce MedForget, a Hierarchy-Aware Multimodal Unlearning Testbed with explicit retain and forget splits and evaluation sets containing rephrased variants. MedForget models hospital data as a nested hierarchy (Institution -> Patient -> Study -> Section), enabling fine-grained assessment across eight organizational levels. The benchmark contains 3840 multimodal (image, question, answer) instances, each hierarchy level having a dedicated unlearning target, reflecting distinct unlearning challenges. Experiments with four SOTA unlearning methods on three tasks (generation, classification, cloze) show that existing methods struggle to achieve complete, hierarchy-aware forgetting without reducing diagnostic performance. To test whether unlearning truly deletes hierarchical pathways, we introduce a reconstruction attack that progressively adds hierarchical level context to prompts. Models unlearned at a coarse granularity show strong resistance, while fine-grained unlearning leaves models vulnerable to such reconstruction. MedForget provides a practical, HIPAA-aligned testbed for building compliant medical AI systems.

[3] What Happens When: Learning Temporal Orders of Events in Videos

Daechul Ahn, Yura Choi, Hyeonbeom Choi, Seongwon Cho, San Kim, Jonghyun Choi

🧩 TL;DR

该研究针对视频大语言模型在时序理解能力上的不足,提出了VECTOR基准来评估模型对事件时序顺序的识别能力,并开发了MECOT方法通过事件级指令微调和思维链提示来增强模型的时序感知能力。


📘 Detailed Summary

Motivation: 视频大语言模型在视频理解方面表现出色,但其准确捕捉多个事件时序顺序的能力尚未得到充分探索。研究发现即使视频帧被打乱,模型在现有基准上仍表现良好,这表明模型可能依赖典型场景的先验知识而非准确的时序处理来回答问题,因此需要专门评估和提升模型的时序理解能力。

Method: 研究提出了VECTOR基准来明确评估模型识别事件时序顺序的能力,并开发了MECOT方法,该方法包含两个关键组件:在详细的事件级视频描述上进行指令微调,以及在推理时使用思维链提示来增强时序感知。MECOT通过多事件指令微调和思维链机制相结合来提升模型的时序理解能力。

Result: 在VECTOR基准上,多种视频大语言模型经常无法正确理解事件顺序。MECOT方法在VECTOR基准上超越了先前技术,同时在现有视频基准上的性能也有所提升,证明了时序理解增强的有效性。研究团队发布了代码、模型和数据集供后续研究使用。

Conclusion: 该研究表明视频大语言模型在时序理解方面存在显著不足,需要专门的评估基准和方法来提升这一能力。MECOT方法通过结合事件级指令微调和思维链提示,有效增强了模型的时序感知能力,为视频理解模型的时序推理提供了新的解决方案。研究强调了时序理解在视频多模态模型中的重要性,并为未来研究提供了基准和方法论基础。


📄 Abstract

Video Large Multimodal Models (VLMMs) have shown impressive performance in video understanding, yet their ability to accurately capture the temporal order of multiple events remains underexplored. We interestingly observe that, even when video frames are scrambled, models perform very well on the existing benchmarks by comprehensive experiments. This implies that VLMMs may not necessarily rely on accurate sequential processing of visual events, but instead depend on prior knowledge of typical scenarios to answer the question. To benchmark temporal understanding capabilities in VLMMs, we propose VECTOR, designed to explicitly assess a model's ability to identify the temporal order of events. On this benchmark, we observe that various VLMMs often fail to understand the orders of events. To address this, we propose MECOT (Multi-Event instruction fine-tuning with Chain-of-Thought), which (1) trains models on detailed, event-by-event video descriptions and (2) using chain-of-thought prompts at inference to enhance temporal awareness. MECOT outperforms prior arts on VECTOR as well as improving performance on existing video benchmarks, implying effectiveness of temporal understanding. We release our code, model and datasets.

[4] Mitigating Bias with Words: Inducing Demographic Ambiguity in Face Recognition Templates by Text Encoding

Tahar Chettaoui, Naser Damer, Fadi Boutros

🧩 TL;DR

本文提出了一种名为统一文本-图像嵌入(UTIE)的新策略,通过利用视觉语言模型的零样本能力和跨模态语义对齐,在面部嵌入中引入人口统计模糊性,从而减少人脸识别系统中的偏见,同时保持或提高验证准确性。


📘 Detailed Summary

Motivation: 人脸识别系统经常存在人口统计偏见,这主要源于面部嵌入中人口统计特定信息与身份相关特征的纠缠,这种纠缠会导致人口统计属性在嵌入空间中掩盖身份线索,从而在不同人口统计群体之间产生验证性能的差异,特别是在生物识别技术发挥重要作用的多文化城市智能基础设施中,这一问题尤为关键。

Method: 本文提出了统一文本-图像嵌入(UTIE)策略,该方法利用视觉语言模型的零样本能力和跨模态语义对齐特性,通过将每个群体的人脸嵌入与从其他人口统计群体提取的文本衍生人口统计特征进行丰富,从而在面部嵌入中引入人口统计模糊性,鼓励嵌入空间更强调身份相关特征,促进跨群体的公平验证性能。

Result: 在RFW和BFW这两个广泛使用的评估人脸识别偏见的基准测试上,使用CLIP、OpenCLIP和SigLIP三种视觉语言模型进行实验,结果表明UTIE能够一致性地降低偏见指标,同时在许多情况下保持甚至提高了人脸验证的准确性。

Conclusion: 该研究表明,通过利用视觉语言模型的跨模态对齐能力,可以在面部嵌入中有效引入人口统计模糊性,从而减少人脸识别系统的偏见,这种方法为开发更公平的生物识别系统提供了有前景的方向,特别是在多文化城市环境中,同时保持了系统的实用性。


📄 Abstract

Face recognition (FR) systems are often prone to demographic biases, partially due to the entanglement of demographic-specific information with identity-relevant features in facial embeddings. This bias is extremely critical in large multicultural cities, especially where biometrics play a major role in smart city infrastructure. The entanglement can cause demographic attributes to overshadow identity cues in the embedding space, resulting in disparities in verification performance across different demographic groups. To address this issue, we propose a novel strategy, Unified Text-Image Embedding (UTIE), which aims to induce demographic ambiguity in face embeddings by enriching them with information related to other demographic groups. This encourages face embeddings to emphasize identity-relevant features and thus promotes fairer verification performance across groups. UTIE leverages the zero-shot capabilities and cross-modal semantic alignment of Vision-Language Models (VLMs). Given that VLMs are naturally trained to align visual and textual representations, we enrich the facial embeddings of each demographic group with text-derived demographic features extracted from other demographic groups. This encourages a more neutral representation in terms of demographic attributes. We evaluate UTIE using three VLMs, CLIP, OpenCLIP, and SigLIP, on two widely used benchmarks, RFW and BFW, designed to assess bias in FR. Experimental results show that UTIE consistently reduces bias metrics while maintaining, or even improving in several cases, the face verification accuracy.

[5] Explainable Fundus Image Curation and Lesion Detection in Diabetic Retinopathy

Anca Mihai, Adrian Groza

🧩 TL;DR

本文提出了一种用于糖尿病视网膜病变诊断的AI训练数据质量控制框架,通过可解释特征分类器筛选图像、深度学习辅助标注以及标注者一致性计算,确保仅使用高质量数据用于模型评估和训练。


📘 Detailed Summary

Motivation: 糖尿病视网膜病变的早期诊断对预防视力丧失至关重要,但AI模型训练需要高质量标注数据。由于视网膜结构复杂,图像采集错误和人工标注者解释差异导致数据质量不一致,这影响了AI模型的可靠性和性能。

Method: 研究提出一个三阶段质量控制框架:首先使用可解释特征分类器筛选不充分图像,特征通过图像处理和对比学习提取;然后对图像进行增强并采用深度学习辅助标注;最后通过推导公式计算标注者间一致性来确定标注的可用性。

Result: 该框架能够有效识别和过滤低质量图像,确保只有符合高标准的数据用于AI训练和评估。通过深度学习辅助标注提高了标注效率,而标注者一致性计算为数据质量提供了量化评估指标。

Conclusion: 该研究强调了数据质量控制在医疗AI应用中的重要性,提出的框架为构建可靠糖尿病视网膜病变诊断系统提供了系统化解决方案。该方法可推广到其他医学影像分析任务,有助于提高AI辅助诊断的准确性和临床可信度。


📄 Abstract

Diabetic Retinopathy (DR) affects individuals with long-term diabetes. Without early diagnosis, DR can lead to vision loss. Fundus photography captures the structure of the retina along with abnormalities indicative of the stage of the disease. Artificial Intelligence (AI) can support clinicians in identifying these lesions, reducing manual workload, but models require high-quality annotated datasets. Due to the complexity of retinal structures, errors in image acquisition and lesion interpretation of manual annotators can occur. We proposed a quality-control framework, ensuring only high-standard data is used for evaluation and AI training. First, an explainable feature-based classifier is used to filter inadequate images. The features are extracted both using image processing and contrastive learning. Then, the images are enhanced and put subject to annotation, using deep-learning-based assistance. Lastly, the agreement between annotators calculated using derived formulas determines the usability of the annotations.

[6] Towards Lossless Ultimate Vision Token Compression for VLMs

Dehua Zheng, Mouxiao Huang, Borui Jiang, Hailin Hu, Xinghao Chen

🧩 TL;DR

本文提出了LUVC框架,通过视觉编码器的迭代合并和LLM中的频谱剪枝单元,实现视觉令牌的无损压缩,在保持精度的同时显著加速视觉语言模型推理。


📘 Detailed Summary

Motivation: 视觉语言模型在处理高分辨率图像和视频时面临计算效率和延迟挑战,主要源于视觉令牌表示中的大量冗余。现有基于注意力/相似性的压缩算法存在位置偏差或类别不平衡问题,导致精度显著下降,且无法泛化到跨模态交互较弱的浅层LLM。

Method: 提出LUVC框架,通过空间轴正交的有效迭代合并方案将令牌压缩扩展到视觉编码器,加速整个VLM计算。在LLM中集成基于无注意力/相似性的低通滤波器的频谱剪枝单元,逐步剪枝冗余视觉令牌,完全兼容现代FlashAttention。系统压缩视觉令牌直至LLM最终层完全消除,使高维视觉特征逐步融入多模态查询。

Result: 实验表明LUVC在语言模型中实现2倍推理加速,同时精度下降可忽略不计。无需训练的特性使其能够立即部署到多个VLM中,验证了框架的有效性和实用性。

Conclusion: LUVC框架通过正交空间压缩和频谱剪枝的创新组合,解决了视觉令牌冗余问题,在保持模型精度的同时显著提升计算效率。该方法无需训练即可部署的特性为实际应用提供了便利,为高效视觉语言模型设计提供了新思路。


📄 Abstract

Visual language models encounter challenges in computational efficiency and latency, primarily due to the substantial redundancy in the token representations of high-resolution images and videos. Current attention/similarity-based compression algorithms suffer from either position bias or class imbalance, leading to significant accuracy degradation. They also fail to generalize to shallow LLM layers, which exhibit weaker cross-modal interactions. To address this, we extend token compression to the visual encoder through an effective iterative merging scheme that is orthogonal in spatial axes to accelerate the computation across the entire VLM. Furthermoer, we integrate a spectrum pruning unit into LLM through an attention/similarity-free low-pass filter, which gradually prunes redundant visual tokens and is fully compatible to modern FlashAttention. On this basis, we propose Lossless Ultimate Vision tokens Compression (LUVC) framework. LUVC systematically compresses visual tokens until complete elimination at the final layer of LLM, so that the high-dimensional visual features are gradually fused into the multimodal queries. The experiments show that LUVC achieves a 2 speedup inference in language model with negligible accuracy degradation, and the training-free characteristic enables immediate deployment across multiple VLMs.

[7] A Survey of Body and Face Motion: Datasets, Performance Evaluation Metrics and Generative Techniques

Lownish Rai Sookha, Nikhil Pakhale, Mudasir Ganaie, Abhinav Dhall

🧩 TL;DR

本文对面向对话交互的全身与面部运动生成领域进行了首次全面综述,系统梳理了该领域的核心概念、生成方法、数据集与评估指标,并指出了未来提升虚拟化身真实感与表现力的研究方向。


📘 Detailed Summary

Motivation: 尽管生成建模与多模态学习的最新进展使得从语音、对话上下文和视觉线索生成人体运动成为可能,但生成具有表现力且协调一致的面部和身体动态仍然面临挑战,主要源于言语/非言语线索与个体人格特质之间复杂的相互作用,且目前缺乏同时涵盖身体与面部运动生成的系统性综述。

Method: 本综述系统性地回顾了身体与面部运动生成的整个技术栈,涵盖核心概念定义、运动表示技术(如参数化模型、网格、点云等)、各类生成方法(包括基于深度学习的生成模型、多模态融合技术),并详细梳理了相关数据集与评估指标体系。

Result: 作为该领域的首次全面综述,本文整合了身体与面部运动生成的关键技术路线与研究进展,提供了详尽的资源列表(包括公开数据集、代码库与评估基准),并建立了统一的分析框架以比较不同方法的优劣。

Conclusion: 该综述明确了未来提升虚拟化身在二元交互场景中真实性、协调性与表现力的关键方向,强调了跨模态一致性与个性化建模的重要性,为研究人员提供了系统性的领域概览与技术路线图,相关资源已通过项目网站公开。


📄 Abstract

Body and face motion play an integral role in communication. They convey crucial information on the participants. Advances in generative modeling and multi-modal learning have enabled motion generation from signals such as speech, conversational context and visual cues. However, generating expressive and coherent face and body dynamics remains challenging due to the complex interplay of verbal / non-verbal cues and individual personality traits. This survey reviews body and face motion generation, covering core concepts, representations techniques, generative approaches, datasets and evaluation metrics. We highlight future directions to enhance the realism, coherence and expressiveness of avatars in dyadic settings. To the best of our knowledge, this work is the first comprehensive review to cover both body and face motion. Detailed resources are listed on https://lownish23csz0010.github.io/mogen/.

[8] Prompt-Based Continual Compositional Zero-Shot Learning

Sauda Maryam, Sara Nadeem, Faisal Qureshi, Mohsen Ali

🧩 TL;DR

本文提出了首个基于提示的持续组合零样本学习框架PromptCCZSL,通过多教师蒸馏和会话感知组合提示来解决持续适应新属性和对象组合同时防止遗忘的问题,在UT-Zappos和C-GQA基准上显著优于现有方法。


📘 Detailed Summary

Motivation: 本文旨在解决组合零样本学习中的持续适应问题,即视觉语言模型需要不断适应新属性、对象及其组合,同时防止先前知识的遗忘。与传统的持续学习不同,CCZSL更为复杂,因为属性和对象可能在多个会话中重复出现,而组合保持唯一性,这需要新的方法来平衡知识保留和组合泛化能力。

Method: 该方法基于冻结的视觉语言模型主干,提出了首个基于提示的持续组合零样本学习框架PromptCCZSL。该框架通过基于最近性的多教师蒸馏来保留先前知识,使用会话感知的组合提示来融合多模态特征以处理新组合,同时通过会话无关的融合学习属性和对象提示以保持全局语义一致性。此外,通过余弦锚定损失来稳定先前知识的保留,通过正交投影损失确保新属性和对象嵌入与先前嵌入保持区分,以及通过会话内多样性损失促进当前会话嵌入的多样性以获得更丰富、更具区分性的表示。

Result: 在UT-Zappos和C-GQA基准上的广泛实验表明,PromptCCZSL在持续组合零样本学习任务中取得了显著改进,大幅优于先前基于视觉语言模型和非视觉语言模型的基线方法。该方法还引入了一个综合评估协议,能够联合衡量灾难性遗忘和组合泛化能力,为封闭世界设置下的CCZSL建立了新的性能基准。

Conclusion: 该研究为持续组合零样本学习领域提供了首个基于提示的解决方案,通过创新的多教师蒸馏策略和损失函数设计,有效平衡了知识保留和新组合适应之间的权衡。所提出的框架不仅提升了性能,还建立了更全面的评估标准,为未来在更复杂开放世界场景中的持续组合学习研究奠定了基础。


📄 Abstract

We tackle continual adaptation of vision-language models to new attributes, objects, and their compositions in Compositional Zero-Shot Learning (CZSL), while preventing forgetting of prior knowledge. Unlike classical continual learning where classes are disjoint, CCZSL is more complex as attributes and objects may reoccur across sessions while compositions remain unique. Built on a frozen VLM backbone, we propose the first Prompt-based Continual Compositional Zero-Shot Learning (PromptCCZSL) framework that retains prior knowledge through recency-weighted multi-teacher distillation. It employs session-aware compositional prompts to fuse multimodal features for new compositions, while attribute and object prompts are learned through session-agnostic fusion to maintain global semantic consistency, which is further stabilized by a Cosine Anchor Loss (CAL) to preserve prior knowledge. To enhance adaptation in the current session, an Orthogonal Projection Loss (OPL) ensures that new attribute and object embeddings remain distinct from previous ones, preventing overlap, while an Intra-Session Diversity Loss (IDL) promotes variation among current-session embeddings for richer, more discriminative representations. We also introduce a comprehensive protocol that jointly measures catastrophic forgetting and compositional generalization. Extensive experiments on UT-Zappos and C-GQA benchmarks demonstrate that PromptCCZSL achieves substantial improvements over prior VLM-based and non-VLM baselines, setting a new benchmark for CCZSL in closed-world settings.

[9] GLACIA: Instance-Aware Positional Reasoning for Glacial Lake Segmentation via Multimodal Large Language Model

Lalit Maurya, Saurabh Kaushik, Beth Tellman

🧩 TL;DR

本文提出了GLACIA框架,首次将大语言模型与分割能力相结合,用于冰川湖监测,不仅生成准确的分割掩码,还提供空间推理输出,以支持更直观的灾害预防和决策制定。


📘 Detailed Summary

Motivation: 现有基于卷积神经网络和视觉Transformer的冰川湖分割方法局限于像素级预测,缺乏高层全局场景语义和人类可解释的推理能力,这限制了其在灾害预防和政策制定中的应用效果。

Method: 本文提出了GLACIA框架,首次将大语言模型与分割能力相结合,同时构建了Glacial Lake Position Reasoning数据集管道,提供多样化的空间基础问答对,以解决遥感数据中实例感知位置推理数据的缺乏问题。

Result: GLACIA在mIoU指标上达到87.30,显著超越了基于CNN的方法(78.55-79.01)、ViT方法(69.27-81.75)、地理基础模型(76.37-87.10)以及基于推理的分割方法(60.12-75.66),在所有比较方法中表现最优。

Conclusion: 该研究通过自然语言交互支持更高效和可解释的决策制定,为快速变化的冰川环境中的直观灾害准备和知情政策制定提供了新途径,其代码已在GitHub上开源。


📄 Abstract

Glacial lake monitoring bears great significance in mitigating the anticipated risk of Glacial Lake Outburst Floods. However, existing segmentation methods based on convolutional neural networks (CNNs) and Vision Transformers (ViTs), remain constrained to pixel-level predictions, lacking high-level global scene semantics and human-interpretable reasoning. To address this, we introduce GLACIA (\textbf{G}lacial \textbf{LA}ke segmentation with \textbf{C}ontextual \textbf{I}nstance \textbf{A}wareness), the first framework that integrates large language models with segmentation capabilities to produce both accurate segmentation masks and corresponding spatial reasoning outputs. We construct the Glacial Lake Position Reasoning (GLake-Pos) dataset pipeline, which provides diverse, spatially grounded question-answer pairs designed to overcome the lack of instance-aware positional reasoning data in remote sensing. Comparative evaluation demonstrate that GLACIA (mIoU: 87.30) surpasses state-of-the-art method based on CNNs (mIoU: 78.55 - 79.01), ViTs (mIoU: 69.27 - 81.75), Geo-foundation models (mIoU: 76.37 - 87.10), and reasoning based segmentation methods (mIoU: 60.12 - 75.66). Our approach enables intuitive disaster preparedness and informed policy-making in the context of rapidly changing glacial environments by facilitating natural language interaction, thereby supporting more efficient and interpretable decision-making. The code is released on https://github.com/lalitmaurya47/GLACIA

[10] ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

Liming Kuang, Yordanka Velikova, Mahdi Saleh, Jan-Nico Zaech, Danda Pani Paudel, Benjamin Busam

🧩 TL;DR

本文提出了ConceptPose,一种无需训练且模型无关的物体姿态估计框架,通过利用视觉语言模型创建开放词汇的3D概念图,在零样本相对姿态估计基准上实现了最先进的性能。


📘 Detailed Summary

Motivation: 传统的物体姿态估计方法通常需要大量数据集特定的训练,而大规模视觉语言模型展现出卓越的零样本能力,本研究旨在弥合这两个领域,开发一种无需训练且模型无关的姿态估计方法。

Method: ConceptPose框架利用视觉语言模型创建开放词汇的3D概念图,其中每个点都通过显著性图提取的概念向量进行标记,通过建立跨概念图的鲁棒3D-3D对应关系,实现精确的6自由度相对姿态估计。

Result: 在没有任何物体或数据集特定训练的情况下,该方法在常见的零样本相对姿态估计基准上取得了最先进的结果,在ADD(-S)分数上显著优于现有方法超过62%,包括那些使用大量数据集特定训练的方法。

Conclusion: 该研究表明视觉语言模型的零样本能力可以有效地应用于物体姿态估计任务,为无需训练的姿态估计提供了新范式,展示了跨模态表示在几何视觉任务中的潜力,并为机器人感知和增强现实等应用开辟了新途径。


📄 Abstract

Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities. In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero shot relative pose estimation benchmarks, significantly outperforming existing methods by over 62% in ADD(-S) score, including those that utilize extensive dataset-specific training.

[11] Representation Calibration and Uncertainty Guidance for Class-Incremental Learning based on Vision Language Model

Jiantao Tan, Peixian Ma, Tong Yu, Wentao Zhang, Ruixuan Wang

🧩 TL;DR

本文提出了一种基于视觉语言模型的类增量学习框架,通过任务特定适配器、跨任务表示校准策略和不确定性引导推理机制,有效缓解了跨任务类别混淆问题,显著提升了图像分类的持续学习性能。


📘 Detailed Summary

Motivation: 当前基于视觉语言模型的类增量学习方法在区分不同学习任务中的类别时仍存在困难,导致跨任务类别混淆问题,这限制了持续学习系统在同时学习新类别知识和保持旧类别知识方面的性能表现。

Method: 该框架采用预训练且冻结的图像编码器,通过添加任务特定适配器来学习新知识,并提出基于轻量级投影器混合的跨任务表示校准策略,以在统一特征空间中更好地区分所有已学习类别,同时开发了基于预测不确定性的推理策略来更准确地选择最合适的图像特征进行类别预测。

Result: 在多种数据集和各种设置下的大量实验表明,该方法相比现有方法表现出优越性能,有效缓解了跨任务类别混淆问题,显著提升了类增量学习在图像分类任务中的准确性和稳定性。

Conclusion: 该研究展示了通过跨任务表示校准和不确定性引导推理机制可以有效解决视觉语言模型在持续学习中的类别混淆问题,为类增量学习提供了新的技术路径,并证明了任务特定适配器与特征空间校准策略相结合的有效性,为未来持续学习研究提供了重要参考。


📄 Abstract

Class-incremental learning requires a learning system to continually learn knowledge of new classes and meanwhile try to preserve previously learned knowledge of old classes. As current state-of-the-art methods based on Vision-Language Models (VLMs) still suffer from the issue of differentiating classes across learning tasks. Here a novel VLM-based continual learning framework for image classification is proposed. In this framework, task-specific adapters are added to the pre-trained and frozen image encoder to learn new knowledge, and a novel cross-task representation calibration strategy based on a mixture of light-weight projectors is used to help better separate all learned classes in a unified feature space, alleviating class confusion across tasks. In addition, a novel inference strategy guided by prediction uncertainty is developed to more accurately select the most appropriate image feature for class prediction. Extensive experiments on multiple datasets under various settings demonstrate the superior performance of our method compared to existing ones.

[12] AgentComp: From Agentic Reasoning to Compositional Mastery in Text-to-Image Models

Arman Zarei, Jiacheng Pan, Matthew Gwilliam, Soheil Feizi, Zhenheng Yang

🧩 TL;DR

本文提出了AgentComp框架,通过利用大型语言模型的推理和工具使用能力自主构建组合性数据集,并采用智能体偏好优化方法微调文本到图像生成模型,显著提升了模型在组合性生成任务上的性能,在T2I-CompBench等基准测试中取得了最先进的结果。


📘 Detailed Summary

Motivation: 当前文本到图像生成模型虽然在视觉质量上取得了显著进展,但在组合性方面仍存在不足,难以准确捕捉对象关系、属性绑定和提示中的细粒度细节。核心限制在于模型未经过明确训练来区分组合性相似的提示和图像,导致输出结果在细粒度细节上偏离预期描述。

Method: 本文提出了AgentComp框架,该框架利用配备图像生成、编辑和视觉问答工具的大型语言模型的推理和工具使用能力,自主构建组合性数据集。基于这些数据集,采用智能体偏好优化方法对文本到图像模型进行微调,使模型能够更好地区分组合性相似的样本,从而增强整体组合性生成能力。

Result: AgentComp在T2I-CompBench等组合性基准测试中取得了最先进的结果,同时没有损害图像质量——这是先前方法常见的缺点。该方法甚至能够泛化到未明确训练的其他能力,如文本渲染,显示出良好的泛化性能。

Conclusion: 该研究表明,通过利用大型语言模型的自主数据构建能力和智能体偏好优化方法,可以显著提升文本到图像生成模型的组合性能力。该方法不仅解决了组合性生成的核心挑战,还避免了图像质量下降的问题,为未来生成模型的组合性改进提供了有前景的研究方向。


📄 Abstract

Text-to-image generative models have achieved remarkable visual quality but still struggle with compositionality$-$accurately capturing object relationships, attribute bindings, and fine-grained details in prompts. A key limitation is that models are not explicitly trained to differentiate between compositionally similar prompts and images, resulting in outputs that are close to the intended description yet deviate in fine-grained details. To address this, we propose AgentComp, a framework that explicitly trains models to better differentiate such compositional variations and enhance their reasoning ability. AgentComp leverages the reasoning and tool-use capabilities of large language models equipped with image generation, editing, and VQA tools to autonomously construct compositional datasets. Using these datasets, we apply an agentic preference optimization method to fine-tune text-to-image models, enabling them to better distinguish between compositionally similar samples and resulting in overall stronger compositional generation ability. AgentComp achieves state-of-the-art results on compositionality benchmarks such as T2I-CompBench, without compromising image quality$-$a common drawback in prior approaches$-$and even generalizes to other capabilities not explicitly trained for, such as text rendering.

[13] Explaining the Unseen: Multimodal Vision-Language Reasoning for Situational Awareness in Underground Mining Disasters

Mizanur Rahman Jewel, Mohamed Elmahallawy, Sanjay Madria, Samuel Frimpong

🧩 TL;DR

本文提出MDSE(多模态灾害情境解释器),一种新颖的视觉-语言框架,用于自动生成地下灾害场景的详细文本描述,通过上下文感知交叉注意力、分割感知双路径视觉编码和资源高效Transformer语言模型,显著提升了在视觉退化环境下的情境感知能力。


📘 Detailed Summary

Motivation: 地下采矿灾害产生的黑暗、灰尘和坍塌会严重遮挡视线,使人类和传统系统难以获得准确的情境感知,现有方法在视觉严重退化环境下无法生成准确详细的场景描述,这阻碍了紧急响应决策。

Method: MDSE框架包含三个核心创新:上下文感知交叉注意力机制,用于在严重视觉退化下实现鲁棒的视觉-文本特征对齐;分割感知双路径视觉编码,融合全局和区域特定的嵌入表示;资源高效Transformer语言模型,以最小计算成本生成表达性强的描述。同时构建了首个真实地下灾害场景图像-描述数据集UMD。

Result: 在UMD数据集和相关基准上的广泛实验表明,MDSE显著优于最先进的图像描述模型,能够生成更准确、上下文更相关的描述,在视觉遮挡环境中捕捉关键细节,为地下紧急响应提供了更好的情境感知支持。

Conclusion: 该研究展示了多模态融合在恶劣视觉条件下的有效性,提出的框架为灾害响应系统提供了实用的自动情境解释工具,未来可扩展至其他视觉受限的应急场景,并推动相关领域数据集和评估标准的发展。


📄 Abstract

Underground mining disasters produce pervasive darkness, dust, and collapses that obscure vision and make situational awareness difficult for humans and conventional systems. To address this, we propose MDSE, Multimodal Disaster Situation Explainer, a novel vision-language framework that automatically generates detailed textual explanations of post-disaster underground scenes. MDSE has three-fold innovations: (i) Context-Aware Cross-Attention for robust alignment of visual and textual features even under severe degradation; (ii) Segmentation-aware dual pathway visual encoding that fuses global and region-specific embeddings; and (iii) Resource-Efficient Transformer-Based Language Model for expressive caption generation with minimal compute cost. To support this task, we present the Underground Mine Disaster (UMD) dataset--the first image-caption corpus of real underground disaster scenes--enabling rigorous training and evaluation. Extensive experiments on UMD and related benchmarks show that MDSE substantially outperforms state-of-the-art captioning models, producing more accurate and contextually relevant descriptions that capture crucial details in obscured environments, improving situational awareness for underground emergency response. The code is at https://github.com/mizanJewel/Multimodal-Disaster-Situation-Explainer.

[14] Food Image Generation on Multi-Noun Categories

Xinyue Pan, Yuhao Chen, Jiangpeng He, Fengqing Zhu

🧩 TL;DR

本文提出FoCULR方法,通过融入食品领域知识和在生成过程早期引入核心概念,解决了多名词食品类别图像生成中语义误解和空间布局错误的问题。


📘 Detailed Summary

Motivation: 多名词食品类别(如"鸡蛋面")在生成图像时面临语义误解挑战,导致模型将复合名称错误解析为多个独立实体而非单一概念。这种问题在UEC-256等真实世界数据集中普遍存在,源于文本编码器缺乏多名词类别相关知识以及对多名词关系的错误理解,从而产生不正确的空间布局。

Method: 提出的FoCULR方法包含两个核心技术:融入食品领域知识以增强模型对多名词类别的理解,以及在生成过程早期阶段引入核心概念来引导正确的语义表示。该方法旨在解决文本编码器在多名词关系理解上的不足,通过领域特定知识注入改善语义解析。

Result: 实验结果表明,FoCULR方法在食品领域的图像生成性能得到显著提升。通过整合食品领域知识和早期概念引入技术,有效减少了多名词类别生成中的语义错误,改善了生成图像的空间布局准确性。

Conclusion: 该研究揭示了多名词食品类别生成中的核心挑战在于语义关系和空间布局理解,提出的FoCULR框架为领域特定生成任务提供了有效解决方案。未来工作可扩展至其他具有复合概念的领域,并进一步探索知识注入与生成过程的深度融合机制。


📄 Abstract

Generating realistic food images for categories with multiple nouns is surprisingly challenging. For instance, the prompt "egg noodle" may result in images that incorrectly contain both eggs and noodles as separate entities. Multi-noun food categories are common in real-world datasets and account for a large portion of entries in benchmarks such as UEC-256. These compound names often cause generative models to misinterpret the semantics, producing unintended ingredients or objects. This is due to insufficient multi-noun category related knowledge in the text encoder and misinterpretation of multi-noun relationships, leading to incorrect spatial layouts. To overcome these challenges, we propose FoCULR (Food Category Understanding and Layout Refinement) which incorporates food domain knowledge and introduces core concepts early in the generation process. Experimental results demonstrate that the integration of these techniques improves image generation performance in the food domain.

[15] View-on-Graph: Zero-shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs

Yuanyuan Liu, Haiyang Mei, Dongyang Zhan, Jiayue Zhao, Dongsheng Zhou, Bo Dong, Xin Yang

🧩 TL;DR

本文提出了一种新的VLM × SI范式,通过将3D空间信息外部化为场景图,使视觉语言模型能够作为主动智能体进行增量检索和推理,实现了零样本3D视觉定位的最先进性能。


📘 Detailed Summary

Motivation: 现有零样本3D视觉定位方法采用VLM + SI范式,将3D空间信息转换为复合输入(如指定视角渲染或带标记的视频序列),导致视觉表示纠缠,迫使VLM处理整个杂乱线索,难以有效利用空间语义关系。

Method: 本文提出VLM × SI新范式,通过View-on-Graph方法将场景组织为多模态、多层场景图,使VLM能够作为主动智能体在遍历场景时选择性访问必要线索,实现增量检索和推理。

Result: 大量实验表明,VoG方法在零样本3D视觉定位任务上取得了最先进的性能,验证了结构化场景探索作为推进零样本3DVG的有前景策略的有效性。

Conclusion: 该研究展示了将3D上下文结构化为空间和语义连贯的场景图而非纠缠视觉输入的优势,不仅降低了VLM的推理难度,还通过主动探索和推理自然产生可解释的逐步追踪,为3D视觉定位提供了透明、可解释的解决方案。


📄 Abstract

3D visual grounding (3DVG) identifies objects in 3D scenes from language descriptions. Existing zero-shot approaches leverage 2D vision-language models (VLMs) by converting 3D spatial information (SI) into forms amenable to VLM processing, typically as composite inputs such as specified view renderings or video sequences with overlaid object markers. However, this VLM + SI paradigm yields entangled visual representations that compel the VLM to process entire cluttered cues, making it hard to exploit spatial semantic relationships effectively. In this work, we propose a new VLM x SI paradigm that externalizes the 3D SI into a form enabling the VLM to incrementally retrieve only what it needs during reasoning. We instantiate this paradigm with a novel View-on-Graph (VoG) method, which organizes the scene into a multi-modal, multi-layer scene graph and allows the VLM to operate as an active agent that selectively accesses necessary cues as it traverses the scene. This design offers two intrinsic advantages: (i) by structuring 3D context into a spatially and semantically coherent scene graph rather than confounding the VLM with densely entangled visual inputs, it lowers the VLM's reasoning difficulty; and (ii) by actively exploring and reasoning over the scene graph, it naturally produces transparent, step-by-step traces for interpretable 3DVG. Extensive experiments show that VoG achieves state-of-the-art zero-shot performance, establishing structured scene exploration as a promising strategy for advancing zero-shot 3DVG.

[16] LongT2IBench: A Benchmark for Evaluating Long Text-to-Image Generation with Graph-structured Annotations

Zhichao Yang, Tianjiao Gu, Jianjie Wang, Feiyu Lin, Xiangfei Sheng, Pengfei Chen, Leida Li

🧩 TL;DR

本研究提出了LongT2IBench基准数据集和LongT2IExpert评估器,用于解决长文本到图像生成场景下的细粒度对齐评估问题,通过图结构标注和层次化对齐思维链方法实现了可解释的量化评估。


📘 Detailed Summary

Motivation: 当前文本到图像对齐评估主要集中于短提示场景,缺乏针对长提示的自动化和可解释评估模型,现有基准仅提供MOS或Likert量表标注,无法支持长文本场景下的细粒度对齐分析,这限制了长文本T2I评估器的发展。

Method: 研究首先设计了包含14K长文本-图像对的LongT2IBench基准,采用Generate-Refine-Qualify标注协议将长提示转换为包含实体、属性和关系的图结构表示,然后提出LongT2IExpert评估器,通过指令微调和层次化对齐思维链方法使多模态大语言模型能够提供量化分数和结构化解释。

Result: 实验表明LongT2IExpert在长文本到图像对齐评估和解释方面具有优越性能,基准数据集提供了细粒度的图结构标注,评估器能够同时输出量化对齐分数和结构化解释,显著提升了长提示场景下的评估效果。

Conclusion: 该研究填补了长文本到图像对齐评估的空白,提出的图结构标注方法和层次化对齐思维链框架为可解释评估提供了新范式,基准数据集和评估器为未来长文本T2I生成的质量控制和研究发展奠定了基础。


📄 Abstract

The increasing popularity of long Text-to-Image (T2I) generation has created an urgent need for automatic and interpretable models that can evaluate the image-text alignment in long prompt scenarios. However, the existing T2I alignment benchmarks predominantly focus on short prompt scenarios and only provide MOS or Likert scale annotations. This inherent limitation hinders the development of long T2I evaluators, particularly in terms of the interpretability of alignment. In this study, we contribute LongT2IBench, which comprises 14K long text-image pairs accompanied by graph-structured human annotations. Given the detail-intensive nature of long prompts, we first design a Generate-Refine-Qualify annotation protocol to convert them into textual graph structures that encompass entities, attributes, and relations. Through this transformation, fine-grained alignment annotations are achieved based on these granular elements. Finally, the graph-structed annotations are converted into alignment scores and interpretations to facilitate the design of T2I evaluation models. Based on LongT2IBench, we further propose LongT2IExpert, a LongT2I evaluator that enables multi-modal large language models (MLLMs) to provide both quantitative scores and structured interpretations through an instruction-tuning process with Hierarchical Alignment Chain-of-Thought (CoT). Extensive experiments and comparisons demonstrate the superiority of the proposed LongT2IExpert in alignment evaluation and interpretation. Data and code have been released in https://welldky.github.io/LongT2IBench-Homepage/.

[17] Dynamic Facial Expressions Analysis Based Parkinson's Disease Auxiliary Diagnosis

Xiaochen Huang, Xiaochen Bi, Cuihua Lv, Xin Wang, Haoyan Zhang, Wenjing Jiang, Xin Ma, Yibin Li

🧩 TL;DR

本文提出了一种基于动态面部表情分析的帕金森病辅助诊断方法,通过分析面部表情减少和面部僵硬这两个特征性症状,实现了93.1%的诊断准确率,为帕金森病提供了一种更便捷的非侵入式诊断方案。


📘 Detailed Summary

Motivation: 帕金森病作为一种常见的神经退行性疾病,严重影响患者的日常生活和社会交往。本研究旨在解决传统诊断方法的局限性,通过针对PD的特征性临床症状——面部表情减少(hypomimia),开发一种更高效、更易获取的辅助诊断方法,改善潜在患者的诊断体验。

Method: 本研究开发了一种多模态面部表情分析网络,专门提取患者执行各种面部表情时的表情强度特征。该网络基于CLIP架构,整合了视觉和文本特征,同时保留了面部表情的时间动态特性。随后,提取的表情强度特征经过处理后输入到基于LSTM的分类网络中,用于帕金森病的最终诊断。

Result: 该方法在帕金森病诊断任务中达到了93.1%的准确率,显著优于其他体外PD诊断方法。实验结果表明,通过分析面部表情减少和面部僵硬这两个临床表现,能够有效区分帕金森病患者与健康个体。

Conclusion: 本研究提出的动态面部表情分析技术为帕金森病提供了一种便捷、非侵入式的辅助诊断方案,改善了患者的诊断体验。该方法不仅具有较高的诊断准确性,还为神经退行性疾病的计算机辅助诊断开辟了新的研究方向,未来可扩展到其他具有面部表情异常的神经系统疾病诊断中。


📄 Abstract

Parkinson's disease (PD), a prevalent neurodegenerative disorder, significantly affects patients' daily functioning and social interactions. To facilitate a more efficient and accessible diagnostic approach for PD, we propose a dynamic facial expression analysis-based PD auxiliary diagnosis method. This method targets hypomimia, a characteristic clinical symptom of PD, by analyzing two manifestations: reduced facial expressivity and facial rigidity, thereby facilitating the diagnosis process. We develop a multimodal facial expression analysis network to extract expression intensity features during patients' performance of various facial expressions. This network leverages the CLIP architecture to integrate visual and textual features while preserving the temporal dynamics of facial expressions. Subsequently, the expression intensity features are processed and input into an LSTM-based classification network for PD diagnosis. Our method achieves an accuracy of 93.1%, outperforming other in-vitro PD diagnostic approaches. This technique offers a more convenient detection method for potential PD patients, improving their diagnostic experience.

[18] Transformer-Driven Multimodal Fusion for Explainable Suspiciousness Estimation in Visual Surveillance

Kuldeep Singh Yadav, Lalan Kumar

🧩 TL;DR

本文提出了大规模标注数据集USE50k和轻量级视觉框架DeepUSEvision,用于实时可疑性分析,通过多模态融合实现可解释的威胁检测,为智能监控和安全关键应用建立了可扩展的基础。


📘 Detailed Summary

Motivation: 可疑性估计对于复杂环境中的主动威胁检测和公共安全至关重要,但现有方法在数据集规模、计算效率和可解释性方面存在不足,需要能够在多样化非受控环境中实时分析多种可疑线索的综合解决方案。

Method: 提出的DeepUSEvision框架包含三个核心组件:基于增强型YOLOv12架构的可疑物体检测器,用于面部表情和身体语言识别的双深度卷积神经网络(DCNN-I和DCNN-II),以及基于Transformer的判别器网络,该网络自适应融合多模态输出以生成可解释的可疑性分数。

Result: 实验结果表明,所提框架在准确性、鲁棒性和可解释性方面均优于现有最先进方法,USE50k数据集包含65,500张来自机场、火车站、餐厅、公园等多样化非受控环境的图像,覆盖了武器、火灾、人群密度、异常面部表情和异常身体姿势等多种线索。

Conclusion: USE50k数据集和DeepUSEvision框架共同为智能监控和实时风险评估建立了强大且可扩展的基础,通过多模态融合和可解释性设计,该研究为安全关键应用中的主动威胁检测提供了有效的技术解决方案,推动了实时可疑性分析领域的发展。


📄 Abstract

Suspiciousness estimation is critical for proactive threat detection and ensuring public safety in complex environments. This work introduces a large-scale annotated dataset, USE50k, along with a computationally efficient vision-based framework for real-time suspiciousness analysis. The USE50k dataset contains 65,500 images captured from diverse and uncontrolled environments, such as airports, railway stations, restaurants, parks, and other public areas, covering a broad spectrum of cues including weapons, fire, crowd density, abnormal facial expressions, and unusual body postures. Building on this dataset, we present DeepUSEvision, a lightweight and modular system integrating three key components, i.e., a Suspicious Object Detector based on an enhanced YOLOv12 architecture, dual Deep Convolutional Neural Networks (DCNN-I and DCNN-II) for facial expression and body-language recognition using image and landmark features, and a transformer-based Discriminator Network that adaptively fuses multimodal outputs to yield an interpretable suspiciousness score. Extensive experiments confirm the superior accuracy, robustness, and interpretability of the proposed framework compared to state-of-the-art approaches. Collectively, the USE50k dataset and the DeepUSEvision framework establish a strong and scalable foundation for intelligent surveillance and real-time risk assessment in safety-critical applications.

[19] TextGuider: Training-Free Guidance for Text Rendering via Attention Alignment

Kanghyun Baek, Sangyub Lee, Jin Young Choi, Jaewoo Song, Daemin Park, Jooyoung Choi, Chaehun Shin, Bohyung Han, Sungroh Yoon

🧩 TL;DR

本文提出TextGuider,一种无需训练的方法,通过对齐文本内容标记与图像中的文本区域,解决扩散模型中文本遗漏问题,在文本渲染任务上实现了最先进的性能。


📘 Detailed Summary

Motivation: 尽管扩散式文本到图像模型近期取得进展,但准确文本渲染仍然困难,现有方法主要关注文本准确性而忽视了文本遗漏问题,即期望文本部分或完全缺失,这一关键问题尚未得到充分研究。

Method: 该方法分析MM-DiT模型中文本相关标记的注意力模式,在去噪过程的早期阶段应用潜在引导,基于作者引入的两种损失函数来对齐文本内容标记与图像中的文本区域,从而实现无需训练的文本渲染优化。

Result: TextGuider在测试时文本渲染任务上实现了最先进的性能,在召回率方面取得显著提升,同时在OCR准确率和CLIP分数方面表现出强劲结果,有效解决了文本遗漏问题。

Conclusion: 该研究表明通过分析注意力模式并应用早期潜在引导可以有效解决扩散模型中的文本遗漏问题,为无需训练的文本渲染优化提供了新思路,对提升文本到图像生成模型的实用性具有重要意义。


📄 Abstract

Despite recent advances, diffusion-based text-to-image models still struggle with accurate text rendering. Several studies have proposed fine-tuning or training-free refinement methods for accurate text rendering. However, the critical issue of text omission, where the desired text is partially or entirely missing, remains largely overlooked. In this work, we propose TextGuider, a novel training-free method that encourages accurate and complete text appearance by aligning textual content tokens and text regions in the image. Specifically, we analyze attention patterns in MM-DiT models, particularly for text-related tokens intended to be rendered in the image. Leveraging this observation, we apply latent guidance during the early stage of denoising steps based on two loss functions that we introduce. Our method achieves state-of-the-art performance in test-time text rendering, with significant gains in recall and strong results in OCR accuracy and CLIP score.

[20] Video-QTR: Query-Driven Temporal Reasoning Framework for Lightweight Video Understanding

Xinkui Zhao, Zuxin Wang, Yifan Zhang, Guanjie Cheng, Yueshen Xu, Shuiguang Deng, Chang Liu, Naibo Wang, Jianwei Yin

🧩 TL;DR

本文提出了Video-QTR(查询驱动时序推理),一种轻量级框架,通过将视频理解重新定义为查询引导的推理过程,动态分配感知资源,显著减少了长视频理解的计算负担,在多个基准测试中达到最先进性能的同时将输入帧消耗降低高达73%。


📘 Detailed Summary

Motivation: 多模态大语言模型在视觉语言推理方面取得显著进展,但应用于长视频理解时面临计算密集的挑战,密集帧编码会产生过多视觉标记,导致高内存消耗、冗余计算和有限的可扩展性,传统"先处理再推理"范式在分析视觉流时效率低下。

Method: Video-QTR框架将视频理解重新定义为查询引导的推理过程,采用动态感知资源分配机制,基于查询的语义意图创建推理与感知之间的自适应反馈循环,避免编码每一帧,实现轻量级视频理解。

Result: 在MSVD-QA、Activity Net-QA、Movie Chat和Video MME等五个基准测试上的广泛实验表明,Video-QTR实现了最先进的性能,同时将输入帧消耗减少了高达73%,证明了其高效性和可扩展性。

Conclusion: 查询驱动时序推理为视频理解提供了高效且可扩展的解决方案,通过动态资源分配和自适应反馈机制,显著降低了计算负担,为实际应用中的长视频分析开辟了新途径。


📄 Abstract

The rapid development of multimodal large-language models (MLLMs) has significantly expanded the scope of visual language reasoning, enabling unified systems to interpret and describe complex visual content. However, applying these models to long-video understanding remains computationally intensive. Dense frame encoding generates excessive visual tokens, leading to high memory consumption, redundant computation, and limited scalability in real-world applications. This inefficiency highlights a key limitation of the traditional process-then-reason paradigm, which analyzes visual streams exhaustively before semantic reasoning. To address this challenge, we introduce Video-QTR (Query-Driven Temporal Reasoning), a lightweight framework that redefines video comprehension as a query-guided reasoning process. Instead of encoding every frame, Video-QTR dynamically allocates perceptual resources based on the semantic intent of the query, creating an adaptive feedback loop between reasoning and perception. Extensive experiments across five benchmarks: MSVD-QA, Activity Net-QA, Movie Chat, and Video MME demonstrate that Video-QTR achieves state-of-the-art performance while reducing input frame consumption by up to 73%. These results confirm that query-driven temporal reasoning provides an efficient and scalable solution for video understanding.

[21] Detection and Localization of Subdural Hematoma Using Deep Learning on Computed Tomography

Vasiliki Stoumpou, Rohan Kumar, Bernard Burman, Diego Ojeda, Tapan Mehta, Dimitris Bertsimas

🧩 TL;DR

本研究提出了一种多模态深度学习框架,用于硬膜下血肿的快速检测与定位,通过整合临床变量、3D卷积神经网络和增强型2D分割模型,实现了高精度诊断并生成解剖学意义的定位图。


📘 Detailed Summary

Motivation: 硬膜下血肿是常见的神经外科急症,现有自动化工具主要关注检测而缺乏可解释性和空间定位能力,需要开发透明、高性能的系统来整合多模态临床和影像信息以支持实时决策。

Method: 开发了多模态深度学习框架,整合结构化临床变量、基于CT体积训练的3D卷积神经网络以及用于SDH检测和定位的transformer增强2D分割模型,采用贪婪集成策略结合互补预测器,使用25,315例头CT研究数据进行训练。

Result: 临床变量单独使用时判别能力有限(AUC 0.75),基于CT体积和分割衍生图的卷积模型显著提高准确性(AUC分别为0.922和0.926),多模态集成框架实现最佳整体性能(AUC 0.9407),并生成与已知SDH模式一致的解剖学意义定位图。

Conclusion: 该多模态可解释框架提供了快速准确的SDH检测和定位,实现了高检测性能并提供透明、解剖学基础的输出,整合到放射学工作流程中可简化分诊、减少干预时间并提高SDH管理的一致性。


📄 Abstract

Background. Subdural hematoma (SDH) is a common neurosurgical emergency, with increasing incidence in aging populations. Rapid and accurate identification is essential to guide timely intervention, yet existing automated tools focus primarily on detection and provide limited interpretability or spatial localization. There remains a need for transparent, high-performing systems that integrate multimodal clinical and imaging information to support real-time decision-making. Methods. We developed a multimodal deep-learning framework that integrates structured clinical variables, a 3D convolutional neural network trained on CT volumes, and a transformer-enhanced 2D segmentation model for SDH detection and localization. Using 25,315 head CT studies from Hartford HealthCare (2015--2024), of which 3,774 (14.9\%) contained clinician-confirmed SDH, tabular models were trained on demographics, comorbidities, medications, and laboratory results. Imaging models were trained to detect SDH and generate voxel-level probability maps. A greedy ensemble strategy combined complementary predictors. Findings. Clinical variables alone provided modest discriminatory power (AUC 0.75). Convolutional models trained on CT volumes and segmentation-derived maps achieved substantially higher accuracy (AUCs 0.922 and 0.926). The multimodal ensemble integrating all components achieved the best overall performance (AUC 0.9407; 95\% CI, 0.930--0.951) and produced anatomically meaningful localization maps consistent with known SDH patterns. Interpretation. This multimodal, interpretable framework provides rapid and accurate SDH detection and localization, achieving high detection performance and offering transparent, anatomically grounded outputs. Integration into radiology workflows could streamline triage, reduce time to intervention, and improve consistency in SDH management.

[22] Defect-aware Hybrid Prompt Optimization via Progressive Tuning for Zero-Shot Multi-type Anomaly Detection and Segmentation

Nadeem Nazer, Hongkuan Zhou, Lavdim Halilaj, Ylli Sadikaj, Steffen Staab

🧩 TL;DR

本文提出了DAPO,一种基于渐进式调优的缺陷感知提示优化方法,用于零样本多类型和二元异常检测与分割,通过将异常相关图像特征与对应文本语义对齐,在分布偏移下显著提升了异常检测性能。


📘 Detailed Summary

Motivation: 现有视觉语言模型如CLIP在异常检测中主要利用高层语义信息,但往往忽略细粒度异常类型细节,如"孔洞"、"切割"、"划痕"等,这些细节能提供更具体的异常性质洞察。手动为每种缺陷类型设计提示既耗时又易受人为偏见影响,因此需要一种自动化的缺陷感知提示优化方法。

Method: DAPO方法基于渐进式调优,通过同时学习固定文本锚点和可学习标记嵌入的混合缺陷感知提示,将异常相关图像特征与对应文本语义对齐。该方法专门设计用于零样本多类型和二元异常检测与分割任务,在分布偏移条件下优化提示表示。

Result: 在MPDD、VisA、MVTec-AD、MAD和Real-IAD等公共基准测试及内部数据集上的实验表明,与基线模型相比,DAPO在分布偏移下的图像级AUROC和平均精度指标平均提升3.7%,在零样本设置下定位新型异常类型的性能平均提升6.5%。

Conclusion: DAPO通过细粒度异常类型识别丰富了"异常"的表征语义,缩小了粗粒度异常信号与细粒度缺陷类别之间的差距。该方法使制造商能够理解异常的根本原因并快速实施更有针对性的纠正措施,为工业异常检测提供了更精确和可解释的解决方案。


📄 Abstract

Recent vision language models (VLMs) like CLIP have demonstrated impressive anomaly detection performance under significant distribution shift by utilizing high-level semantic information through text prompts. However, these models often neglect fine-grained details, such as which kind of anomalies, like "hole", "cut", "scratch" that could provide more specific insight into the nature of anomalies. We argue that recognizing fine-grained anomaly types 1) enriches the representation of "abnormal" with structured semantics, narrowing the gap between coarse anomaly signals and fine-grained defect categories; 2) enables manufacturers to understand the root causes of the anomaly and implement more targeted and appropriate corrective measures quickly. While incorporating such detailed semantic information is crucial, designing handcrafted prompts for each defect type is both time-consuming and susceptible to human bias. For this reason, we introduce DAPO, a novel approach for Defect-aware Prompt Optimization based on progressive tuning for the zero-shot multi-type and binary anomaly detection and segmentation under distribution shifts. Our approach aligns anomaly-relevant image features with their corresponding text semantics by learning hybrid defect-aware prompts with both fixed textual anchors and learnable token embeddings. We conducted experiments on public benchmarks (MPDD, VisA, MVTec-AD, MAD, and Real-IAD) and an internal dataset. The results suggest that compared to the baseline models, DAPO achieves a 3.7% average improvement in AUROC and average precision metrics at the image level under distribution shift, and a 6.5% average improvement in localizing novel anomaly types under zero-shot settings.

[23] Building Reasonable Inference for Vision-Language Models in Blind Image Quality Assessment

Yuan Li, Zitang Sun, Yen-ju Chen, Shin'ya Nishida

🧩 TL;DR

本研究针对基于视觉语言模型的盲图像质量评估中存在的预测矛盾和不稳定性问题,提出了一种两阶段调优方法,明确分离视觉感知与质量推理。实验表明该方法显著降低了预测不稳定性并提升了多个基准数据集上的性能表现。


📘 Detailed Summary

Motivation: 该研究旨在解决基于视觉语言模型(VLM)的盲图像质量评估(BIQA)中存在的关键问题:模型生成的文本描述与最终质量预测之间存在矛盾,且推理过程中预测分数不稳定,这些行为与人类推理方式不一致。研究试图分析导致矛盾评估和不稳定性的因素,以促进更符合人类推理的质量评估方法。

Method: 研究首先分析了最终质量预测与生成视觉特征之间的关系,发现预测并未完全基于特征且逻辑连接薄弱。通过解码VLM中间层发现模型过度依赖有限候选词元导致预测不稳定。为解决这些问题,研究提出了一种两阶段调优方法:第一阶段模型学习视觉特征,第二阶段仅基于这些特征进行质量推断,从而明确分离视觉感知与质量推理过程。

Result: 在SPAQ和KONIQ数据集上的实验表明,该方法将预测不稳定性从22.00%降低至12.39%。在LIVE、CSIQ、SPAQ和KONIQ数据集上相比基线平均获得0.3124/0.3507的SRCC/PLCC提升。进一步分析显示该方法同时改善了推理过程的稳定性和可靠性。

Conclusion: 该研究表明通过明确分离视觉感知与质量推理阶段,可以有效解决VLM在BIQA任务中的矛盾评估和不稳定性问题。这种方法促进了更符合人类推理的质量评估过程,为改进基于VLM的质量评估模型提供了重要见解和实用框架。


📄 Abstract

Recent progress in BIQA has been driven by VLMs, whose semantic reasoning abilities suggest that they might extract visual features, generate descriptive text, and infer quality in a human-like manner. However, these models often produce textual descriptions that contradict their final quality predictions, and the predicted scores can change unstably during inference - behaviors not aligned with human reasoning. To understand these issues, we analyze the factors that cause contradictory assessments and instability. We first estimate the relationship between the final quality predictions and the generated visual features, finding that the predictions are not fully grounded in the features and that the logical connection between them is weak. Moreover, decoding intermediate VLM layers shows that the model frequently relies on a limited set of candidate tokens, which contributes to prediction instability. To encourage more human-like reasoning, we introduce a two-stage tuning method that explicitly separates visual perception from quality inference. In the first stage, the model learns visual features; in the second, it infers quality solely from these features. Experiments on SPAQ and KONIQ demonstrate that our approach reduces prediction instability from 22.00% to 12.39% and achieves average gains of 0.3124/0.3507 in SRCC/PLCC across LIVE, CSIQ, SPAQ, and KONIQ compared to the baseline. Further analyses show that our method improves both stability and the reliability of the inference process.

[24] Investigate the Low-level Visual Perception in Vision-Language based Image Quality Assessment

Yuan Li, Zitang Sun, Yen-Ju Chen, Shin'ya Nishida

🧩 TL;DR

该研究揭示了多模态大语言模型在图像质量评估中存在的低层失真感知缺陷,并提出通过视觉编码器组件级微调来增强视觉-语言对齐,从而显著提升失真识别能力。


📘 Detailed Summary

Motivation: 尽管多模态大语言模型在图像质量评估中能够生成描述性解释,但它们往往无法可靠地检测基本的低层失真(如模糊、噪声和压缩),并在重复推理中产生不一致的评估结果,这引发了关于这些模型是否真正感知到关键视觉特征的质疑。

Method: 研究引入了低层失真感知任务来评估模型对特定失真类型的分类能力,通过组件级分析探究模型结构表示能力,并计算视觉特征与对应语义标记之间的语义距离,特别关注视觉编码器的组件级微调以增强视觉-语言对齐。

Result: 实验表明,虽然多模态大语言模型在结构上能够表示低层失真,但它们容易过拟合训练模板,导致质量评分偏差,而通过改进视觉编码器的对齐,失真识别准确率从14.92%显著提升至84.43%,证明了视觉编码器约束的有效性。

Conclusion: 研究结果表明,在视觉编码器中加入专门约束可以增强文本可解释的视觉表示,使基于多模态大语言模型的流程在视觉中心任务中产生更一致和可解释的推理,为改进视觉-语言模型在低层视觉任务中的应用提供了重要方向。


📄 Abstract

Recent advances in Image Quality Assessment (IQA) have leveraged Multi-modal Large Language Models (MLLMs) to generate descriptive explanations. However, despite their strong visual perception modules, these models often fail to reliably detect basic low-level distortions such as blur, noise, and compression, and may produce inconsistent evaluations across repeated inferences. This raises an essential question: do MLLM-based IQA systems truly perceive the visual features that matter? To examine this issue, we introduce a low-level distortion perception task that requires models to classify specific distortion types. Our component-wise analysis shows that although MLLMs are structurally capable of representing such distortions, they tend to overfit training templates, leading to biases in quality scoring. As a result, critical low-level features are weakened or lost during the vision-language alignment transfer stage. Furthermore, by computing the semantic distance between visual features and corresponding semantic tokens before and after component-wise fine-tuning, we show that improving the alignment of the vision encoder dramatically enhances distortion recognition accuracy, increasing it from 14.92% to 84.43%. Overall, these findings indicate that incorporating dedicated constraints on the vision encoder can strengthen text-explainable visual representations and enable MLLM-based pipelines to produce more coherent and interpretable reasoning in vision-centric tasks.

[25] Content-Adaptive Image Retouching Guided by Attribute-Based Text Representation

Hancheng Zhu, Xinyu Liu, Rui Yao, Kunyang Sun, Leida Li, Abdulmotaleb El Saddik

🧩 TL;DR

本文提出了一种基于属性文本表示的内容自适应图像润色方法(CA-ATP),通过内容自适应曲线映射模块捕捉图像内部颜色多样性,并结合多属性文本表示实现用户友好的风格指导,在多个公开数据集上达到了最先进的性能。


📘 Detailed Summary

Motivation: 现有图像润色方法主要依赖全图统一的像素级颜色映射,忽略了图像内容引起的固有颜色变化,这限制了方法在适应多样化颜色分布和用户定义风格偏好方面的能力,无法实现自适应的图像润色。

Method: 本文提出了CA-ATP方法,包含两个核心模块:内容自适应曲线映射模块利用一系列基础曲线建立多种颜色映射关系并学习相应的权重图,实现基于空间上下文的内容感知颜色调整;属性文本预测模块从多个图像属性生成文本表示,通过多模态模型与视觉特征融合,提供用户友好的润色指导。

Result: 在多个公开数据集上的大量实验表明,该方法在图像润色任务中达到了最先进的性能,能够有效捕捉图像内容的颜色多样性,使相似颜色值根据其空间上下文获得不同的变换。

Conclusion: 该研究证明了内容自适应颜色映射与属性文本表示相结合的有效性,为图像润色提供了既能适应多样化颜色分布又能满足用户风格偏好的解决方案,推动了自适应图像增强技术的发展。


📄 Abstract

Image retouching has received significant attention due to its ability to achieve high-quality visual content. Existing approaches mainly rely on uniform pixel-wise color mapping across entire images, neglecting the inherent color variations induced by image content. This limitation hinders existing approaches from achieving adaptive retouching that accommodates both diverse color distributions and user-defined style preferences. To address these challenges, we propose a novel Content-Adaptive image retouching method guided by Attribute-based Text Representation (CA-ATP). Specifically, we propose a content-adaptive curve mapping module, which leverages a series of basis curves to establish multiple color mapping relationships and learns the corresponding weight maps, enabling content-aware color adjustments. The proposed module can capture color diversity within the image content, allowing similar color values to receive distinct transformations based on their spatial context. In addition, we propose an attribute text prediction module that generates text representations from multiple image attributes, which explicitly represent user-defined style preferences. These attribute-based text representations are subsequently integrated with visual features via a multimodal model, providing user-friendly guidance for image retouching. Extensive experiments on several public datasets demonstrate that our method achieves state-of-the-art performance.

[26] IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting

Tao Zhang, Yuyang Hong, Yang Xia, Kun Ding, Zeyu Zhang, Ying Wang, Shiming Xiang, Chunhong Pan

🧩 TL;DR

本文提出了IF-Bench,这是首个用于评估多模态大语言模型在红外图像理解能力的高质量基准,并引入了一种无需训练的生成式视觉提示方法(GenViP),通过将红外图像转换为语义和空间对齐的RGB图像来缓解领域分布偏移问题。


📘 Detailed Summary

Motivation: 尽管多模态大语言模型在各种基准测试中取得了显著进展,但其在红外图像理解方面的能力尚未得到充分探索,当前缺乏专门用于评估红外图像理解的高质量基准,这限制了该领域的研究进展和应用发展。

Method: 研究团队构建了IF-Bench基准,包含来自23个红外数据集的499张图像和680个精心设计的视觉问答对,覆盖10个图像理解维度;同时提出了一种无需训练的生成式视觉提示方法(GenViP),利用先进的图像编辑模型将红外图像转换为语义和空间对齐的RGB图像,以缓解领域分布偏移问题。

Result: 研究系统评估了超过40个开源和闭源的多模态大语言模型,采用循环评估、双语评估和混合判断策略确保结果可靠性;实验表明GenViP方法能在多种MLLMs上带来显著的性能提升,同时揭示了模型规模、架构和推理范式对红外图像理解的影响规律。

Conclusion: 该研究填补了红外图像理解评估基准的空白,为多模态大语言模型在红外领域的应用提供了重要参考;提出的GenViP方法为解决跨模态领域适应问题提供了有效途径,基准测试和代码的开源将促进该领域的进一步研究和发展。


📄 Abstract

Recent advances in multimodal large language models (MLLMs) have led to impressive progress across various benchmarks. However, their capability in understanding infrared images remains unexplored. To address this gap, we introduce IF-Bench, the first high-quality benchmark designed for evaluating multimodal understanding of infrared images. IF-Bench consists of 499 images sourced from 23 infrared datasets and 680 carefully curated visual question-answer pairs, covering 10 essential dimensions of image understanding. Based on this benchmark, we systematically evaluate over 40 open-source and closed-source MLLMs, employing cyclic evaluation, bilingual assessment, and hybrid judgment strategies to enhance the reliability of the results. Our analysis reveals how model scale, architecture, and inference paradigms affect infrared image comprehension, providing valuable insights for this area. Furthermore, we propose a training-free generative visual prompting (GenViP) method, which leverages advanced image editing models to translate infrared images into semantically and spatially aligned RGB counterparts, thereby mitigating domain distribution shifts. Extensive experiments demonstrate that our method consistently yields significant performance improvements across a wide range of MLLMs. The benchmark and code are available at https://github.com/casiatao/IF-Bench.

[27] An Automated Tip-and-Cue Framework for Optimized Satellite Tasking and Visual Intelligence

Gil Weissman, Amir Ivry, Israel Cohen

🧩 TL;DR

本文提出了一种完全自动化的Tip-and-Cue框架,用于卫星成像任务分配与调度,通过外部数据源生成提示并优化多卫星观测计划,结合人工智能模型处理图像并生成结构化报告,在海上船舶跟踪场景中验证了其有效性。


📘 Detailed Summary

Motivation: 随着卫星星座的扩展、任务延迟的降低以及传感器能力的多样化,自动化地球观测的机会不断增加,但现有系统缺乏能够自动整合外部数据源、优化多卫星调度并生成可操作洞察的端到端框架。

Method: 该方法采用完全自动化的Tip-and-Cue框架,其中提示来自外部数据源或先前卫星图像分析,用于识别时空目标并确定优先级;对应的线索是响应生成的成像任务,包含传感器约束、时间要求和效用函数。系统自动生成候选任务,使用连续效用函数优化多卫星调度,并通过基于人工智能的模型(包括目标检测器和视觉语言模型)处理所得图像,生成结构化视觉报告以支持可解释性和新洞察的识别。

Result: 该框架在海上船舶跟踪场景中展示了有效性,利用自动识别系统数据进行轨迹预测、目标观测和可操作输出生成。海上船舶跟踪作为广泛研究的应用领域,常被用于评估卫星任务分配、预测和分析的新方法,证明了该系统的实际应用价值。

Conclusion: 该研究提出的自动化Tip-and-Cue框架能够有效整合多源数据、优化卫星观测调度并生成结构化分析报告,系统可扩展至智慧城市监测和灾害响应等更广泛的应用领域,其中及时的任务分配和自动化分析至关重要,为地球观测系统的智能化发展提供了重要技术路径。


📄 Abstract

The proliferation of satellite constellations, coupled with reduced tasking latency and diverse sensor capabilities, has expanded the opportunities for automated Earth observation. This paper introduces a fully automated Tip-and-Cue framework designed for satellite imaging tasking and scheduling. In this context, tips are generated from external data sources or analyses of prior satellite imagery, identifying spatiotemporal targets and prioritizing them for downstream planning. Corresponding cues are the imaging tasks formulated in response, which incorporate sensor constraints, timing requirements, and utility functions. The system autonomously generates candidate tasks, optimizes their scheduling across multiple satellites using continuous utility functions that reflect the expected value of each observation, and processes the resulting imagery using artificial-intelligence-based models, including object detectors and vision-language models. Structured visual reports are generated to support both interpretability and the identification of new insights for downstream tasking. The efficacy of the framework is demonstrated through a maritime vessel tracking scenario, utilizing Automatic Identification System (AIS) data for trajectory prediction, targeted observations, and the generation of actionable outputs. Maritime vessel tracking is a widely researched application, often used to benchmark novel approaches to satellite tasking, forecasting, and analysis. The system is extensible to broader applications such as smart-city monitoring and disaster response, where timely tasking and automated analysis are critical.

[28] Modality-Specific Enhancement and Complementary Fusion for Semi-Supervised Multi-Modal Brain Tumor Segmentation

Tien-Dat Chung, Ba-Thinh Lam, Thanh-Huy Nguyen, Thien Nguyen, Nguyen Lan Vi Vu, Hoang-Loc Cao, Phat Kim Huynh, Min Xu

🧩 TL;DR

本文提出了一种新颖的半监督多模态医学图像分割框架,通过模态特异性增强模块和可学习的互补信息融合模块,有效解决了多模态MRI序列间语义差异和对齐问题,显著提升了在有限标注数据下的分割性能。


📘 Detailed Summary

Motivation: 现有半监督学习方法在多模态医学图像分割中难以有效利用模态间的互补信息,主要由于MRI序列间存在语义差异和错位问题,这限制了模型在有限标注数据下的性能提升。

Method: 提出了一种半监督多模态框架,包含模态特异性增强模块(MEM)通过通道注意力机制强化每个模态的独特语义线索,以及可学习的互补信息融合模块(CIF)自适应地在模态间交换互补知识,整体采用监督分割损失和跨模态一致性正则化的混合目标函数进行优化。

Result: 在BraTS 2019(HGG子集)数据集上的实验表明,该方法在1%、5%和10%标注数据设置下均显著优于现有半监督和多模态基线方法,在Dice和Sensitivity指标上均取得显著提升,消融研究进一步验证了MEM和CIF模块在弥合跨模态差异和提升分割鲁棒性方面的互补效应。

Conclusion: 该研究证明了显式增强模态特异性表示并自适应融合跨模态信息对于半监督多模态医学图像分割的有效性,为解决模态间语义差异和错位问题提供了新思路,为有限标注数据下的多模态医学图像分析开辟了有前景的方向。


📄 Abstract

Semi-supervised learning (SSL) has become a promising direction for medical image segmentation, enabling models to learn from limited labeled data alongside abundant unlabeled samples. However, existing SSL approaches for multi-modal medical imaging often struggle to exploit the complementary information between modalities due to semantic discrepancies and misalignment across MRI sequences. To address this, we propose a novel semi-supervised multi-modal framework that explicitly enhances modality-specific representations and facilitates adaptive cross-modal information fusion. Specifically, we introduce a Modality-specific Enhancing Module (MEM) to strengthen semantic cues unique to each modality via channel-wise attention, and a learnable Complementary Information Fusion (CIF) module to adaptively exchange complementary knowledge between modalities. The overall framework is optimized using a hybrid objective combining supervised segmentation loss and cross-modal consistency regularization on unlabeled data. Extensive experiments on the BraTS 2019 (HGG subset) demonstrate that our method consistently outperforms strong semi-supervised and multi-modal baselines under 1\%, 5\%, and 10\% labeled data settings, achieving significant improvements in both Dice and Sensitivity scores. Ablation studies further confirm the complementary effects of our proposed MEM and CIF in bridging cross-modality discrepancies and improving segmentation robustness under scarce supervision.

[29] DynaIP: Dynamic Image Prompt Adapter for Scalable Zero-shot Personalized Text-to-Image Generation

Zhizhong Wang, Tianyi Chu, Zeyi Huang, Nanyang Wang, Kehan Li

🧩 TL;DR

本文提出DynaIP(动态图像提示适配器),一种用于个性化文本到图像生成的先进插件,通过动态解耦策略和分层专家混合特征融合模块,显著提升了概念保真度、提示跟随平衡以及多主体可扩展性。


📘 Detailed Summary

Motivation: 当前个性化文本到图像生成方法面临三个核心挑战:概念保持与提示跟随之间的平衡难以把握,参考图像细粒度细节保留困难,以及多主体个性化扩展能力受限。现有方法在零-shot设置下难以同时解决这些问题,需要新的适配器设计来提升性能。

Method: 基于发现的多模态扩散变换器存在解耦学习行为的观察,提出动态解耦策略,在推理时移除概念无关信息的干扰。同时设计分层专家混合特征融合模块,充分利用CLIP编码器的分层特征,实现对视觉粒度的灵活控制并提升细粒度概念保真度。

Result: 在单主体和多主体个性化文本到图像生成任务上的广泛实验表明,DynaIP在概念保真度、提示跟随平衡以及多主体组合可扩展性方面均优于现有方法,实现了该领域的显著进步。

Conclusion: 该研究揭示了多模态扩散变换器的内在解耦学习特性,并提出了有效的动态解耦和分层特征融合机制,为个性化文本到图像生成提供了新的技术路径,在保持概念细节和遵循文本提示之间实现了更好的平衡,同时增强了多主体组合的灵活性。


📄 Abstract

Personalized Text-to-Image (PT2I) generation aims to produce customized images based on reference images. A prominent interest pertains to the integration of an image prompt adapter to facilitate zero-shot PT2I without test-time fine-tuning. However, current methods grapple with three fundamental challenges: 1. the elusive equilibrium between Concept Preservation (CP) and Prompt Following (PF), 2. the difficulty in retaining fine-grained concept details in reference images, and 3. the restricted scalability to extend to multi-subject personalization. To tackle these challenges, we present Dynamic Image Prompt Adapter (DynaIP), a cutting-edge plugin to enhance the fine-grained concept fidelity, CP-PF balance, and subject scalability of SOTA T2I multimodal diffusion transformers (MM-DiT) for PT2I generation. Our key finding is that MM-DiT inherently exhibit decoupling learning behavior when injecting reference image features into its dual branches via cross attentions. Based on this, we design an innovative Dynamic Decoupling Strategy that removes the interference of concept-agnostic information during inference, significantly enhancing the CP-PF balance and further bolstering the scalability of multi-subject compositions. Moreover, we identify the visual encoder as a key factor affecting fine-grained CP and reveal that the hierarchical features of commonly used CLIP can capture visual information at diverse granularity levels. Therefore, we introduce a novel Hierarchical Mixture-of-Experts Feature Fusion Module to fully leverage the hierarchical features of CLIP, remarkably elevating the fine-grained concept fidelity while also providing flexible control of visual granularity. Extensive experiments across single- and multi-subject PT2I tasks verify that our DynaIP outperforms existing approaches, marking a notable advancement in the field of PT2l generation.

[30] UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

Hao Lu, Ziyang Liu, Guangfeng Jiang, Yuanfei Luo, Sheng Chen, Yangang Zhang, Ying-Cong Chen

🧩 TL;DR

本文提出UniUGP框架,通过统一理解-生成-规划架构将场景推理、未来视频生成和轨迹规划相结合,以解决自动驾驶系统在长尾场景中的知识局限和视觉动态建模不足问题。


📘 Detailed Summary

Motivation: 自动驾驶系统在长尾场景中表现不佳,主要受限于有限的世界知识和薄弱的视觉动态建模能力。现有的视觉-语言-动作方法无法利用未标记视频进行视觉因果学习,而基于世界模型的方法又缺乏大语言模型的推理能力。

Method: 本文构建了多个专门数据集为复杂场景提供推理和规划标注,并提出名为UniUGP的统一理解-生成-规划框架,通过混合专家架构协同场景推理、未来视频生成和轨迹规划。该框架集成预训练的视觉语言模型和视频生成模型,利用视觉动态和语义推理提升规划性能,采用四阶段训练策略逐步构建这些能力。

Result: 实验表明UniUGP在感知、推理和决策方面实现了最先进的性能,并在具有挑战性的长尾场景中展现出卓越的泛化能力。该系统能够生成可解释的思维链推理、物理一致的轨迹和连贯的未来视频。

Conclusion: 该研究展示了通过统一框架整合视觉动态建模和语义推理对提升自动驾驶系统在复杂场景中性能的重要性。UniUGP的成功表明结合专门数据集和多阶段训练策略能够有效解决长尾场景挑战,为未来自动驾驶系统的发展提供了新的架构方向。


📄 Abstract

Autonomous driving (AD) systems struggle in long-tail scenarios due to limited world knowledge and weak visual dynamic modeling. Existing vision-language-action (VLA)-based methods cannot leverage unlabeled videos for visual causal learning, while world model-based methods lack reasoning capabilities from large language models. In this paper, we construct multiple specialized datasets providing reasoning and planning annotations for complex scenarios. Then, a unified Understanding-Generation-Planning framework, named UniUGP, is proposed to synergize scene reasoning, future video generation, and trajectory planning through a hybrid expert architecture. By integrating pre-trained VLMs and video generation models, UniUGP leverages visual dynamics and semantic reasoning to enhance planning performance. Taking multi-frame observations and language instructions as input, it produces interpretable chain-of-thought reasoning, physically consistent trajectories, and coherent future videos. We introduce a four-stage training strategy that progressively builds these capabilities across multiple existing AD datasets, along with the proposed specialized datasets. Experiments demonstrate state-of-the-art performance in perception, reasoning, and decision-making, with superior generalization to challenging long-tail situations.

[31] Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs

Pius Horn, Janis Keuper

🧩 TL;DR

本文提出了一种新颖的PDF数学公式解析基准测试框架,通过合成PDF文档和创新的LLM-as-a-judge语义评估方法,系统评估了20多种PDF解析器的性能,为下游应用提供了关键选择依据。


📘 Detailed Summary

Motivation: 现有PDF解析基准测试要么完全排除数学公式,要么缺乏语义感知的评估指标,这限制了从学术文献中训练大语言模型和构建科学知识库的能力,因此需要一种能够系统控制布局、公式和内容特征的精确评估框架。

Method: 该方法基于合成生成的PDF文档构建基准测试框架,使用精确的LaTeX作为真实标签,并创新性地采用LLM-as-a-judge进行语义公式评估,结合鲁棒的两阶段匹配管道处理解析器输出不一致性问题,通过人工验证确保评估方法的可靠性。

Result: 在250个公式对(来自30位评估者的750个评分)上的人工验证表明,基于LLM的评估与人类判断具有显著更高的相关性(Pearson r=0.78),远优于CDM(r=0.34)和文本相似度方法(r≈0)。对20多种当代PDF解析器(包括专用OCR模型、视觉语言模型和基于规则的方法)在100个合成文档和2000多个公式上的评估揭示了显著的性能差异。

Conclusion: 该研究为从业者选择下游应用的PDF解析器提供了关键见解,并建立了一个鲁棒、可扩展的方法论,能够可重复地评估PDF公式提取质量,同时开源代码和基准数据促进了该领域的研究进展。


📄 Abstract

Correctly parsing mathematical formulas from PDFs is critical for training large language models and building scientific knowledge bases from academic literature, yet existing benchmarks either exclude formulas entirely or lack semantically-aware evaluation metrics. We introduce a novel benchmarking framework centered on synthetically generated PDFs with precise LaTeX ground truth, enabling systematic control over layout, formulas, and content characteristics. A key methodological contribution is pioneering LLM-as-a-judge for semantic formula assessment, combined with a robust two-stage matching pipeline that handles parser output inconsistencies. Through human validation on 250 formula pairs (750 ratings from 30 evaluators), we demonstrate that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.78) compared to CDM (r=0.34) and text similarity (r~0). Evaluating 20+ contemporary PDF parsers (including specialized OCR models, vision-language models, and rule-based approaches) across 100 synthetic documents with 2,000+ formulas reveals significant performance disparities. Our findings provide crucial insights for practitioners selecting parsers for downstream applications and establish a robust, scalable methodology that enables reproducible evaluation of PDF formula extraction quality. Code and benchmark data: https://github.com/phorn1/pdf-parse-bench

[32] VisualActBench: Can VLMs See and Act like a Human?

Daoan Zhang, Pai Liu, Xiaofei Zhou, Yuan Ge, Guangchen Lan, Jing Bi, Christopher Brinton, Ehsan Hoque, Jiebo Luo

🧩 TL;DR

该研究提出了视觉动作推理新任务和VisualActBench基准,用于评估视觉语言模型在无文本提示下基于视觉输入进行主动推理和行动的能力,揭示了当前模型与人类级推理之间的显著差距。


📘 Detailed Summary

Motivation: 当前视觉语言模型在感知和描述视觉环境方面取得了显著进展,但其在无显式文本提示下仅基于视觉输入进行主动推理和行动的能力仍未得到充分探索,这限制了模型在现实世界中的实际应用价值。

Method: 研究引入了视觉动作推理新任务,并构建了VisualActBench大规模基准数据集,包含1,074个视频和3,733个人工标注的动作,涵盖四个真实世界场景,每个动作标注了动作优先级等级和主动-反应类型,用于评估29个视觉语言模型的人类对齐推理和价值敏感性。

Result: 评估结果显示,虽然前沿模型如GPT4o表现出相对较强的性能,但与人类级推理相比仍存在显著差距,特别是在生成主动、高优先级动作方面,当前模型在理解复杂上下文、预测结果和与人类决策框架对齐方面存在明显局限性。

Conclusion: VisualActBench为评估和改进主动、视觉中心AI智能体的现实世界准备度建立了全面基础,研究结果强调了当前视觉语言模型在主动推理和人类价值对齐方面的不足,为未来模型开发提供了重要方向和评估标准。


📄 Abstract

Vision-Language Models (VLMs) have achieved impressive progress in perceiving and describing visual environments. However, their ability to proactively reason and act based solely on visual inputs, without explicit textual prompts, remains underexplored. We introduce a new task, Visual Action Reasoning, and propose VisualActBench, a large-scale benchmark comprising 1,074 videos and 3,733 human-annotated actions across four real-world scenarios. Each action is labeled with an Action Prioritization Level (APL) and a proactive-reactive type to assess models' human-aligned reasoning and value sensitivity. We evaluate 29 VLMs on VisualActBench and find that while frontier models like GPT4o demonstrate relatively strong performance, a significant gap remains compared to human-level reasoning, particularly in generating proactive, high-priority actions. Our results highlight limitations in current VLMs' ability to interpret complex context, anticipate outcomes, and align with human decision-making frameworks. VisualActBench establishes a comprehensive foundation for assessing and improving the real-world readiness of proactive, vision-centric AI agents.

[33] ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning

Xinyu Liu, Hangjie Yuan, Yujie Wei, Jiazheng Xing, Yujin Han, Jiahao Pan, Yanbiao Ma, Chi-Min Chan, Kang Zhao, Shiwei Zhang, Wenhan Luo, Yike Guo

🧩 TL;DR

本文提出了Reason-Informed Video Editing (RVE)任务,并开发了ReViSE框架,通过自我反思推理机制将视频生成与评估统一起来,显著提升了基于推理的视频编辑性能。


📘 Detailed Summary

Motivation: 现有视频统一模型虽然在理解和生成方面表现出色,但在基于推理的视觉编辑方面存在明显不足,主要原因是缺乏专门用于训练和评估推理感知视频编辑的数据集,以及模型推理能力与编辑能力之间存在脱节,导致丰富的理解无法有效指导编辑过程。

Method: 研究引入了Reason-Informed Video Editing (RVE)任务,构建了RVE-Bench基准数据集,包含推理感知视频编辑和上下文视频生成两个互补子集,并提出了ReViSE框架,采用自我反思推理机制,将生成与评估统一在单一架构中,利用内部视觉语言模型提供内在反馈来优化生成器的推理行为。

Result: 在RVE-Bench上的大量实验表明,ReViSE显著提升了编辑准确性和视觉保真度,在推理感知视频编辑子集上相比最先进方法实现了32%的整体分数提升。

Conclusion: 该研究通过整合推理与视觉转换,建立了连接理解与编辑的有效框架,为解决视频编辑中的推理挑战提供了系统化解决方案,并为未来视频编辑模型的发展指明了方向,强调了推理能力在复杂视觉任务中的重要性。


📄 Abstract

Video unified models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing even when equipped with powerful internal vision-language models (VLMs). We attribute this gap to two factors: 1) existing datasets are inadequate for training and evaluating reasoning-aware video editing, and 2) an inherent disconnect between the models' reasoning and editing capabilities, which prevents the rich understanding from effectively instructing the editing process. Bridging this gap requires an integrated framework that connects reasoning with visual transformation. To address this gap, we introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing. To support systematic evaluation, we construct RVE-Bench, a comprehensive benchmark with two complementary subsets: Reasoning-Informed Video Editing and In-Context Video Generation. These subsets cover diverse reasoning dimensions and real-world editing scenarios. Building upon this foundation, we propose the ReViSE, a Self-Reflective Reasoning (SRF) framework that unifies generation and evaluation within a single architecture. The model's internal VLM provides intrinsic feedback by assessing whether the edited video logically satisfies the given instruction. The differential feedback that refines the generator's reasoning behavior during training. Extensive experiments on RVE-Bench demonstrate that ReViSE significantly enhances editing accuracy and visual fidelity, achieving a 32% improvement of the Overall score in the reasoning-informed video editing subset over state-of-the-art methods.

cs.CL [Back]

[34] Neurosymbolic Information Extraction from Transactional Documents

Arthur Hemmer, Mickaël Coustaty, Nicola Bartolo, Jean-Marc Ogier

🧩 TL;DR

本文提出了一种用于文档信息提取的神经符号框架,通过整合符号验证方法实现了更有效的零样本输出和知识蒸馏,在事务性文档处理中显著提升了性能。


📘 Detailed Summary

Motivation: 该研究旨在解决事务性文档信息提取中的挑战,特别是如何在没有大量标注数据的情况下实现准确提取,以及如何确保提取结果符合领域特定的算术约束和结构要求。

Method: 该方法采用神经符号框架,结合语言模型生成候选提取结果,然后通过句法级、任务级和领域级的多层次验证进行过滤,确保符合领域特定的算术约束,同时提出了用于知识蒸馏的高质量标签生成方法。

Result: 实验结果表明,该方法在事务性文档处理中取得了显著的性能提升,F1分数和准确率均有明显改善,验证了神经符号验证框架的有效性和实用性。

Conclusion: 该研究证明了神经符号方法在文档信息提取中的优势,特别是通过符号验证增强语言模型输出的可靠性和准确性,为低资源场景下的文档处理提供了有效的解决方案,并展示了知识蒸馏在提升模型性能方面的潜力。


📄 Abstract

This paper presents a neurosymbolic framework for information extraction from documents, evaluated on transactional documents. We introduce a schema-based approach that integrates symbolic validation methods to enable more effective zero-shot output and knowledge distillation. The methodology uses language models to generate candidate extractions, which are then filtered through syntactic-, task-, and domain-level validation to ensure adherence to domain-specific arithmetic constraints. Our contributions include a comprehensive schema for transactional documents, relabeled datasets, and an approach for generating high-quality labels for knowledge distillation. Experimental results demonstrate significant improvements in $F_1$-scores and accuracy, highlighting the effectiveness of neurosymbolic validation in transactional document processing.

[35] ChronusOmni: Improving Time Awareness of Omni Large Language Models

Yijing Chen, Yihan Wu, Kaisi Guan, Yuchen Ren, Yuyue Wang, Ruihua Song, Liyun Ru

🧩 TL;DR

本文提出ChronusOmni,一种增强时间感知的全能大语言模型,专门用于解决视听模态中显式和隐式时间定位问题,通过统一的时间建模和强化学习奖励机制,在多个基准测试中实现了最先进的性能。


📘 Detailed Summary

Motivation: 现有方法主要针对视觉语言场景,专注于显式时间定位问题,但对音频模态利用不足,且忽视了跨模态的隐式时间定位问题,例如在角色说话时识别视觉内容或在视觉事件发生时确定语音内容,而这些跨模态时间关系在现实场景中普遍存在。

Method: 首先,在每个时间单元中将基于文本的时间戳标记与视觉和音频表示交错排列,实现跨模态的统一时间建模;其次,通过强化学习结合专门设计的奖励函数来强制正确的时间排序并增强细粒度时间推理;此外,构建了ChronusAV数据集,这是一个时间精确、模态完整且跨模态对齐的数据集,用于支持视听时间定位任务的训练和评估。

Result: 实验结果表明,ChronusOmni在ChronusAV数据集上实现了超过30%的性能提升,达到了最先进的性能水平,并在大多数其他时间定位基准测试指标上取得了顶级结果,同时保持了通用的视频和音频理解能力。

Conclusion: 该研究强调了跨模态时间感知的重要性,提出的统一时间建模框架和强化学习奖励机制有效解决了显式和隐式时间定位问题,为全能大语言模型的时间理解能力提供了新的解决方案,同时构建的高质量数据集为未来相关研究提供了重要资源。


📄 Abstract

Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions. Previous approaches mainly target vision-language scenarios and focus on the explicit temporal grounding questions, such as identifying when a visual event occurs or determining what event happens at aspecific time. However, they often make insufficient use of the audio modality, and overlook implicit temporal grounding across modalities--for example, identifying what is visually present when a character speaks, or determining what is said when a visual event occurs--despite such cross-modal temporal relations being prevalent in real-world scenarios. In this paper, we propose ChronusOmni, an omni large language model designed to enhance temporal awareness for both explicit and implicit audiovisual temporal grounding. First, we interleave text-based timestamp tokens with visual and audio representations at each time unit, enabling unified temporal modeling across modalities. Second, to enforce correct temporal ordering and strengthen fine-grained temporal reasoning, we incorporate reinforcement learning with specially designed reward functions. Moreover, we construct ChronusAV, a temporally-accurate, modality-complete, and cross-modal-aligned dataset to support the training and evaluation on audiovisual temporal grounding task. Experimental results demonstrate that ChronusOmni achieves state-of-the-art performance on ChronusAV with more than 30% improvement and top results on most metrics upon other temporal grounding benchmarks. This highlights the strong temporal awareness of our model across modalities, while preserving general video and audio understanding capabilities.