Table of Contents

cs.CV [Back]

[1] Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

Fenfen Lin, Yesheng Liu, Haiyu Xu, Chen Yue, Zheqi He, Mingxuan Zhao, Miguel Hu Chen, Jiakang Liu, JG Yao, Xi Yang

🧩 TL;DR

本文提出了MeasureBench基准测试和可扩展的数据合成流程,用于评估视觉语言模型在测量仪器读数任务上的性能,揭示了当前VLM在细粒度空间定位方面的基本局限性。


📘 Detailed Summary

Motivation: 当前视觉语言模型在读取测量仪器方面存在显著挑战,尽管这对人类来说相对容易且需要较少领域专业知识,但初步评估显示VLM在此任务上表现不佳,因此需要专门的基准测试来系统评估和改进VLM的视觉测量读数能力。

Method: 研究开发了MeasureBench基准测试,涵盖真实世界和合成图像中的各类测量仪器,并构建了可扩展的数据合成流程,通过程序化生成具有可控视觉外观的仪表,能够规模化地变化指针、刻度、字体、光照和背景杂乱等关键细节。

Result: 评估显示即使是当前最强的前沿VLM在测量读数任务上普遍表现不佳,主要失败模式是指示器定位问题——模型能够读取数字或标签但错误识别指针或对齐的关键位置,导致尽管文本推理看似合理却产生巨大数值误差,在合成数据上的强化学习实验显示出有希望的结果但未能有效泛化到真实图像。

Conclusion: 该研究揭示了当前VLM在细粒度空间定位方面的基本局限性,强调了在识别数字和测量世界之间存在的差距,希望该资源能够推动VLM在视觉基础数理能力和精确空间感知方面的未来发展。


📄 Abstract

Reading measurement instruments is effortless for humans and requires relatively little domain expertise, yet it remains surprisingly challenging for current vision-language models (VLMs) as we find in preliminary evaluation. In this work, we introduce MeasureBench, a benchmark on visual measurement reading covering both real-world and synthesized images of various types of measurements, along with an extensible pipeline for data synthesis. Our pipeline procedurally generates a specified type of gauge with controllable visual appearance, enabling scalable variation in key details such as pointers, scales, fonts, lighting, and clutter. Evaluation on popular proprietary and open-weight VLMs shows that even the strongest frontier VLMs struggle measurement reading in general. A consistent failure mode is indicator localization: models can read digits or labels but misidentify the key positions of pointers or alignments, leading to big numeric errors despite plausible textual reasoning. We have also conducted preliminary experiments with reinforcement learning over synthetic data, and find encouraging results on in-domain synthetic subset but less promising for real-world images. Our analysis highlights a fundamental limitation of current VLMs in fine-grained spatial grounding. We hope this resource can help future advances on visually grounded numeracy and precise spatial perception of VLMs, bridging the gap between recognizing numbers and measuring the world.

[2] SYNAPSE-Net: A Unified Framework with Lesion-Aware Hierarchical Gating for Robust Segmentation of Heterogeneous Brain Lesions

Md. Mehedi Hassan, Shafqat Alam, Shahriar Ahmed Seam, Maruf Ahmed

🧩 TL;DR

本文提出了统一多流SYNAPSE-Net框架,通过集成多流CNN编码器、Swin Transformer瓶颈和动态跨模态注意力融合机制,在多个脑部病变分割挑战中实现了最先进的性能,为临床自动分割提供了稳健解决方案。


📘 Detailed Summary

Motivation: 当前深度学习模型在脑部多模态MRI异质性病变分割中存在泛化能力不足和性能方差大的问题,这些专业化的'点解决方案'限制了临床可靠性,需要开发既能泛化又稳健的自适应框架。

Method: 提出统一多流SYNAPSE-Net框架,采用混合架构集成多流CNN编码器、Swin Transformer瓶颈用于全局上下文建模、动态跨模态注意力融合机制以及分层门控解码器进行高保真掩模重建,并采用结合病理特定数据增强和难度感知采样的方差减少策略进行训练。

Result: 在三个公开挑战数据集上评估:MICCAI 2017 WMH挑战中获得DSC 0.831和HD95 3.03的先进性能;ISLES 2022挑战中取得最佳边界精度,HD95值为9.69具有统计显著性差异;BraTS 2020挑战中肿瘤核心区域达到最高DSC值0.8651。

Conclusion: 该统一自适应框架在多种脑部病理分割中均实现了最先进的性能,为自动分割提供了稳健且临床可行的解决方案,证明了其在处理异质性脑部病变分割任务中的有效性和泛化能力。


📄 Abstract

Automated segmentation of heterogeneous brain lesions from multi-modal MRI remains a critical challenge in clinical neuroimaging. Current deep learning models are typically specialized `point solutions' that lack generalization and high performance variance, limiting their clinical reliability. To address these gaps, we propose the Unified Multi-Stream SYNAPSE-Net, an adaptive framework designed for both generalization and robustness. The framework is built on a novel hybrid architecture integrating multi-stream CNN encoders, a Swin Transformer bottleneck for global context, a dynamic cross-modal attention fusion (CMAF) mechanism, and a hierarchical gated decoder for high-fidelity mask reconstruction. The architecture is trained with a variance reduction strategy that combines pathology specific data augmentation and difficulty-aware sampling method. The model was evaluated on three different challenging public datasets: the MICCAI 2017 WMH Challenge, the ISLES 2022 Challenge, and the BraTS 2020 Challenge. Our framework attained a state-of-the-art DSC value of 0.831 with the HD95 value of 3.03 in the WMH dataset. For ISLES 2022, it achieved the best boundary accuracy with a statistically significant difference (HD95 value of 9.69). For BraTS 2020, it reached the highest DSC value for the tumor core region (0.8651). These experimental findings suggest that our unified adaptive framework achieves state-of-the-art performance across multiple brain pathologies, providing a robust and clinically feasible solution for automated segmentation. The source code and the pre-trained models are available at https://github.com/mubid-01/SYNAPSE-Net-pre.

[3] Semantic Frame Aggregation-based Transformer for Live Video Comment Generation

Anam Fatima, Yi Yu, Janak Kapuriya, Julien Lalanne, Jainendra Shukla

🧩 TL;DR

本文提出了一种新颖的语义帧聚合Transformer模型(SFAT),用于直播视频评论生成,该方法通过基于语义相关性对视频帧进行加权聚合,并构建了一个大规模多模态英语视频评论数据集来验证模型有效性。


📘 Detailed Summary

Motivation: 现有直播视频评论生成方法往往忽略了对与观众互动最相关的视频帧进行优先级排序,而这一优先排序对于生成上下文适当的评论至关重要,同时现有数据集主要关注中文内容且视频类别有限,无法满足多样化研究需求。

Method: 提出语义帧聚合Transformer模型(SFAT),利用CLIP的视觉-文本多模态知识生成评论,基于语义相关性为视频帧分配权重,采用高效的加权帧求和技术强调信息丰富的帧,并通过交叉注意力机制的评论解码器确保生成的评论反映来自聊天和视频的上下文线索。

Result: 构建了大规模多样化的多模态英语视频评论数据集,涵盖11个视频类别,总计438小时和320万条评论,通过对比实验证明了SFAT模型在从直播视频和持续对话上下文生成评论方面的有效性优于现有方法。

Conclusion: 该研究强调了视频帧语义优先级排序在直播评论生成中的重要性,提出的SFAT模型通过多模态融合和注意力机制有效提升了评论生成的上下文相关性,为直播互动系统的发展提供了新的技术路径和基准数据集。


📄 Abstract

Live commenting on video streams has surged in popularity on platforms like Twitch, enhancing viewer engagement through dynamic interactions. However, automatically generating contextually appropriate comments remains a challenging and exciting task. Video streams can contain a vast amount of data and extraneous content. Existing approaches tend to overlook an important aspect of prioritizing video frames that are most relevant to ongoing viewer interactions. This prioritization is crucial for producing contextually appropriate comments. To address this gap, we introduce a novel Semantic Frame Aggregation-based Transformer (SFAT) model for live video comment generation. This method not only leverages CLIP's visual-text multimodal knowledge to generate comments but also assigns weights to video frames based on their semantic relevance to ongoing viewer conversation. It employs an efficient weighted sum of frames technique to emphasize informative frames while focusing less on irrelevant ones. Finally, our comment decoder with a cross-attention mechanism that attends to each modality ensures that the generated comment reflects contextual cues from both chats and video. Furthermore, to address the limitations of existing datasets, which predominantly focus on Chinese-language content with limited video categories, we have constructed a large scale, diverse, multimodal English video comments dataset. Extracted from Twitch, this dataset covers 11 video categories, totaling 438 hours and 3.2 million comments. We demonstrate the effectiveness of our SFAT model by comparing it to existing methods for generating comments from live video and ongoing dialogue contexts.

[4] MoME: Mixture of Visual Language Medical Experts for Medical Imaging Segmentation

Arghavan Rezvani, Xiangyi Yan, Anthony T. Wu, Kun Han, Pooya Khosravi, Xiaohui Xie

🧩 TL;DR

本研究提出MoME(视觉语言医学专家混合模型),将大型语言模型中广泛使用的混合专家范式成功应用于医学图像分割任务,通过动态专家选择和视觉-语言融合实现了在医学影像分析中的竞争性性能。


📘 Detailed Summary

Motivation: 该研究旨在解决医学图像分割领域中视觉-语言模型集成不足的问题,探索将基础模型有效应用于医学影像分析,特别是利用混合专家范式提升模型性能并整合文本信息来增强医学图像理解能力。

Method: MoME采用混合专家架构,通过动态专家选择机制有效利用多尺度视觉特征,并结合文本嵌入来适应医学图像的复杂性,构建了一个专门针对医学视觉-语言任务的创新模型框架。

Result: 在包含10个数据集、3,410个CT扫描的综合医学图像分割基准测试中,MoME表现出强大的性能,在多个数据集上实现了竞争性的精度,验证了该架构在医学图像分析中的有效性。

Conclusion: 该研究证明了混合专家范式在医学视觉-语言任务中的适用性,为医学图像分析提供了新的架构思路,展示了基础模型与医学影像领域结合的巨大潜力,推动了医学AI向更鲁棒和智能的方向发展。


📄 Abstract

In this study, we propose MoME, a Mixture of Visual Language Medical Experts, for Medical Image Segmentation. MoME adapts the successful Mixture of Experts (MoE) paradigm, widely used in Large Language Models (LLMs), for medical vision-language tasks. The architecture enables dynamic expert selection by effectively utilizing multi-scale visual features tailored to the intricacies of medical imagery, enriched with textual embeddings. This work explores a novel integration of vision-language models for this domain. Utilizing an assembly of 10 datasets, encompassing 3,410 CT scans, MoME demonstrates strong performance on a comprehensive medical imaging segmentation benchmark. Our approach explores the integration of foundation models for medical imaging, benefiting from the established efficacy of MoE in boosting model performance by incorporating textual information. Demonstrating competitive precision across multiple datasets, MoME explores a novel architecture for achieving robust results in medical image analysis.

[5] Incremental Human-Object Interaction Detection with Invariant Relation Representation Learning

Yana Wei, Zeen Chi, Chongyu Wang, Yu Wu, Shipeng Yan, Yongfei Liu, Xuming He

🧩 TL;DR

本文提出了一种无需范例的增量关系蒸馏框架,用于解决开放世界中动态变化的人-物交互检测问题,通过解耦对象和关系学习来缓解灾难性遗忘并增强对交互漂移和零样本HOI组合的鲁棒性。


📘 Detailed Summary

Motivation: 在开放世界环境中,人-物交互持续演化,传统封闭世界HOI检测模型难以应对这种动态变化,同时面临灾难性遗忘、交互漂移以及零样本HOI组合检测等独特挑战。

Method: 提出增量关系蒸馏框架,解耦对象和关系的学习过程,引入两种独特的蒸馏损失来学习跨不同HOI组合但共享相同关系的不变关系特征。

Result: 在HICO-DET和V-COCO数据集上的广泛实验表明,该方法在缓解遗忘、增强对交互漂移的鲁棒性以及零样本HOI泛化方面优于最先进的基线方法。

Conclusion: 该研究展示了通过解耦学习和关系蒸馏策略,能够有效应对增量HOI检测中的核心挑战,为开放世界交互理解提供了可行的解决方案,并具有重要的实际应用价值。


📄 Abstract

In open-world environments, human-object interactions (HOIs) evolve continuously, challenging conventional closed-world HOI detection models. Inspired by humans' ability to progressively acquire knowledge, we explore incremental HOI detection (IHOID) to develop agents capable of discerning human-object relations in such dynamic environments. This setup confronts not only the common issue of catastrophic forgetting in incremental learning but also distinct challenges posed by interaction drift and detecting zero-shot HOI combinations with sequentially arriving data. Therefore, we propose a novel exemplar-free incremental relation distillation (IRD) framework. IRD decouples the learning of objects and relations, and introduces two unique distillation losses for learning invariant relation features across different HOI combinations that share the same relation. Extensive experiments on HICO-DET and V-COCO datasets demonstrate the superiority of our method over state-of-the-art baselines in mitigating forgetting, strengthening robustness against interaction drift, and generalization on zero-shot HOIs. Code is available at \href{https://github.com/weiyana/ContinualHOI}{this HTTP URL}

[6] ZEBRA: Towards Zero-Shot Cross-Subject Generalization for Universal Brain Visual Decoding

Haonan Wang, Jingyu Lu, Hongrui Li, Xiaomeng Li

🧩 TL;DR

ZEBRA提出首个零样本脑视觉解码框架,通过解耦fMRI表征中的主体相关和语义相关成分,无需主体特定适应即可实现跨主体视觉重建,在多项指标上达到与完全微调模型相当的性能。


📘 Detailed Summary

Motivation: 当前基于fMRI的视觉重建方法主要依赖主体特定模型或需要主体特定微调,这限制了方法的可扩展性和实际应用价值,因此需要开发能够泛化到未见主体的零样本解码框架。

Method: ZEBRA基于fMRI表征可分解为主体相关和语义相关成分的关键洞察,采用对抗训练方法显式解耦这些成分以分离出主体不变、语义特定的表征,从而实现无需额外fMRI数据或重新训练的跨主体泛化。

Result: 广泛实验表明ZEBRA显著优于零样本基线方法,在多个评估指标上达到与完全微调模型相当的性能表现,验证了方法的有效性和泛化能力。

Conclusion: 该研究代表了向通用神经解码迈出的可扩展且实用的一步,通过解耦主体特定和语义特定信息实现了零样本脑视觉解码,为神经科学与计算机视觉的桥梁提供了新的技术路径。


📄 Abstract

Recent advances in neural decoding have enabled the reconstruction of visual experiences from brain activity, positioning fMRI-to-image reconstruction as a promising bridge between neuroscience and computer vision. However, current methods predominantly rely on subject-specific models or require subject-specific fine-tuning, limiting their scalability and real-world applicability. In this work, we introduce ZEBRA, the first zero-shot brain visual decoding framework that eliminates the need for subject-specific adaptation. ZEBRA is built on the key insight that fMRI representations can be decomposed into subject-related and semantic-related components. By leveraging adversarial training, our method explicitly disentangles these components to isolate subject-invariant, semantic-specific representations. This disentanglement allows ZEBRA to generalize to unseen subjects without any additional fMRI data or retraining. Extensive experiments show that ZEBRA significantly outperforms zero-shot baselines and achieves performance comparable to fully finetuned models on several metrics. Our work represents a scalable and practical step toward universal neural decoding. Code and model weights are available at: https://github.com/xmed-lab/ZEBRA.

[7] E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources

Tong Shen, Jingai Yu, Dong Zhou, Dong Li, Emad Barsoum

🧩 TL;DR

本文提出了E-MMDiT,一种仅含304M参数的高效轻量级多模态扩散模型,通过令牌压缩、位置增强和交替子区域注意力等创新技术,在低训练资源需求下实现快速图像生成。


📘 Detailed Summary

Motivation: 现有扩散模型通常需要大规模训练数据和大量计算资源进行训练,或者存在结构复杂、延迟高的问题,这限制了生成式AI模型的普及和实际应用。

Method: 采用高度压缩的视觉分词器生成更紧凑表示,提出多路径压缩模块进一步压缩令牌,引入位置增强技术保持空间一致性,设计交替子区域注意力降低计算成本,并提出AdaLN-affine轻量模块计算变换器块中的调制参数。

Result: 在单节点8个AMD MI300X GPU上仅用1.5天和2500万公开数据训练的512px生成模型,在GenEval上达到0.66分,结合GRPO等后训练技术可轻松提升至0.72分。

Conclusion: E-MMDiT为未来研究提供了强大实用的基准模型,通过高效的架构设计显著降低了计算需求,有助于推动生成式AI模型的民主化进程,使更多研究者能够参与相关技术开发。


📄 Abstract

Diffusion models have shown strong capabilities in generating high-quality images from text prompts. However, these models often require large-scale training data and significant computational resources to train, or suffer from heavy structure with high latency. To this end, we propose Efficient Multimodal Diffusion Transformer (E-MMDiT), an efficient and lightweight multimodal diffusion model with only 304M parameters for fast image synthesis requiring low training resources. We provide an easily reproducible baseline with competitive results. Our model for 512px generation, trained with only 25M public data in 1.5 days on a single node of 8 AMD MI300X GPUs, achieves 0.66 on GenEval and easily reaches to 0.72 with some post-training techniques such as GRPO. Our design philosophy centers on token reduction as the computational cost scales significantly with the token count. We adopt a highly compressive visual tokenizer to produce a more compact representation and propose a novel multi-path compression module for further compression of tokens. To enhance our design, we introduce Position Reinforcement, which strengthens positional information to maintain spatial coherence, and Alternating Subregion Attention (ASA), which performs attention within subregions to further reduce computational cost. In addition, we propose AdaLN-affine, an efficient lightweight module for computing modulation parameters in transformer blocks. Our code is available at https://github.com/AMD-AGI/Nitro-E and we hope E-MMDiT serves as a strong and practical baseline for future research and contributes to democratization of generative AI models.

[8] Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions

Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Yoichi Sato

🧩 TL;DR

本研究提出了多模态交互真实性评估任务,并基于狼人杀游戏构建了一个包含同步视频、文本和真实标签的数据集,评估了最先进的多模态大语言模型在欺骗检测中的表现,揭示了这些模型在区分真假方面的显著性能差距。


📘 Detailed Summary

Motivation: 随着AI系统日益融入人类生活,赋予其强大的社会智能已成为关键前沿,而自动检测动态多方对话中的欺骗行为仍面临重大挑战,当前多模态大语言模型在这一关键领域的能力尚未得到充分量化。

Method: 研究引入了多模态交互真实性评估新任务,基于社交推理游戏狼人杀构建了包含同步视频、文本和可验证真实标签的新型多模态数据集,并建立了全面基准来评估最先进的多模态大语言模型。

Result: 基准评估显示存在显著的性能差距,即使是GPT-4o等强大模型也难以可靠地区分真假,失败模式分析表明这些模型未能有效将语言与视觉社交线索相结合,且可能因过度对齐而过于保守。

Conclusion: 研究强调了开发新方法以构建更具洞察力和可信赖AI系统的迫切需求,多模态大语言模型在欺骗检测任务中的局限性暴露了其在社会智能理解方面的不足,需要进一步改进模型的多模态融合能力。


📄 Abstract

As AI systems become increasingly integrated into human lives, endowing them with robust social intelligence has emerged as a critical frontier. A key aspect of this intelligence is discerning truth from deception, a ubiquitous element of human interaction that is conveyed through a complex interplay of verbal language and non-verbal visual cues. However, automatic deception detection in dynamic, multi-party conversations remains a significant challenge. The recent rise of powerful Multimodal Large Language Models (MLLMs), with their impressive abilities in visual and textual understanding, makes them natural candidates for this task. Consequently, their capabilities in this crucial domain are mostly unquantified. To address this gap, we introduce a new task, Multimodal Interactive Veracity Assessment (MIVA), and present a novel multimodal dataset derived from the social deduction game Werewolf. This dataset provides synchronized video, text, with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating state-of-the-art MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to ground language in visual social cues effectively and may be overly conservative in their alignment, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems.

[9] Generating Accurate and Detailed Captions for High-Resolution Images

Hankyeol Lee, Gawon Seo, Kyounggyu Lee, Dogun Kim, Kyungwoo Song, Jiyoung Jung

🧩 TL;DR

本文提出了一种新颖的多阶段流水线,通过整合视觉语言模型、大语言模型和物体检测系统来增强高分辨率图像的描述质量。该方法能够生成更详细可靠的图像描述,同时有效减少幻觉现象。


📘 Detailed Summary

Motivation: 视觉语言模型通常在对低分辨率输入进行预训练后,难以生成高分辨率图像的准确详细描述,因为将高分辨率图像下采样到标准尺寸会导致视觉细节丢失和重要物体遗漏。

Method: 提出多阶段流水线方法:首先使用视觉语言模型生成初始描述,然后通过大语言模型识别关键物体并预测可能共现的物体,利用物体检测系统验证预测结果,对未在初始描述中提及的新检测物体进行区域特定描述生成,从而丰富描述细节并减少幻觉。

Result: 在精心策划的高分辨率图像数据集上的实验表明,该流水线生成的图像描述更加详细可靠,通过成对比较和大型多模态模型的定量评分验证了改进效果,同时在幻觉检测基准上表现出色。

Conclusion: 该研究证明了通过结合多种AI系统的优势可以有效解决高分辨率图像描述生成的挑战,为提升视觉语言模型在复杂场景下的表现提供了新思路,同时展示了多模态系统集成在减少幻觉方面的潜力。


📄 Abstract

Vision-language models (VLMs) often struggle to generate accurate and detailed captions for high-resolution images since they are typically pre-trained on low-resolution inputs (e.g., 224x224 or 336x336 pixels). Downscaling high-resolution images to these dimensions may result in the loss of visual details and the omission of important objects. To address this limitation, we propose a novel pipeline that integrates vision-language models, large language models (LLMs), and object detection systems to enhance caption quality. Our proposed pipeline refines captions through a novel, multi-stage process. Given a high-resolution image, an initial caption is first generated using a VLM, and key objects in the image are then identified by an LLM. The LLM predicts additional objects likely to co-occur with the identified key objects, and these predictions are verified by object detection systems. Newly detected objects not mentioned in the initial caption undergo focused, region-specific captioning to ensure they are incorporated. This process enriches caption detail while reducing hallucinations by removing references to undetected objects. We evaluate the enhanced captions using pairwise comparison and quantitative scoring from large multimodal models, along with a benchmark for hallucination detection. Experiments on a curated dataset of high-resolution images demonstrate that our pipeline produces more detailed and reliable image captions while effectively minimizing hallucinations.

[10] Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

Zhuoning Guo, Mingxin Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Xiaowen Chu

🧩 TL;DR

本文提出了一个通过评估、数据和建模协同设计的框架来解决视频检索领域的结构错配问题,开发了通用视频检索基准和通用视频嵌入器,在零样本泛化方面实现了最先进性能。


📘 Detailed Summary

Motivation: 当前视频检索范式存在结构错配问题,狭窄的基准测试激励了相应的有限数据和单任务训练,导致通用能力被抑制,缺乏能够定义和需求多维度泛化的诊断性评估。

Method: 提出了一个协同设计框架,包括建立通用视频检索基准作为诊断工具,开发可扩展的合成工作流程生成155万高质量数据对,以及设计模态金字塔课程来训练通用视频嵌入器。

Result: 通用视频嵌入器在UVRB上实现了最先进的零样本泛化性能,分析显示流行基准测试对通用能力的预测能力较差,部分相关检索是一个主导但被忽视的场景。

Conclusion: 协同设计框架为摆脱有限范围提供了实用路径,推动视频检索向真正通用化发展,揭示了现有基准测试的局限性并强调了部分相关检索的重要性。


📄 Abstract

The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.

[11] M^3Detection: Multi-Frame Multi-Level Feature Fusion for Multi-Modal 3D Object Detection with Camera and 4D Imaging Radar

Xiaozhi Li, Huijun Di, Jian Li, Feng Liu, Wei Liang

🧩 TL;DR

本文提出了M^3Detection,一个统一的多帧3D目标检测框架,通过相机和4D成像雷达的多模态数据融合,在VoD和TJ4DRadSet数据集上实现了最先进的3D检测性能。


📘 Detailed Summary

Motivation: 现有相机-雷达融合方法大多局限于单帧输入,仅能捕获场景的部分信息,不完整的场景信息加上图像质量退化和4D雷达稀疏性限制了整体检测性能,而多帧融合虽然提供更丰富的时空信息,但面临跨帧跨模态对象特征融合的鲁棒性和计算成本挑战。

Method: 提出M^3Detection框架,利用基线检测器的中间特征和跟踪器生成参考轨迹,在第二阶段设计了雷达信息引导的全局级对象间特征聚合模块来对齐候选提议的全局特征,以及局部级网格间特征聚合模块沿参考轨迹扩展局部特征,最后通过轨迹级多帧时空推理模块编码跨帧交互。

Result: 在VoD和TJ4DRadSet数据集上的广泛实验表明,M^3Detection在3D检测性能上达到了最先进水平,验证了其在相机-4D成像雷达多帧检测中的有效性。

Conclusion: 该研究证明了多帧相机-4D成像雷达融合在3D目标检测中的巨大潜力,通过多级特征融合和时空推理有效解决了多模态数据融合的计算效率和特征表示增强问题,为成本效益高的3D感知系统提供了可行方案。


📄 Abstract

Recent advances in 4D imaging radar have enabled robust perception in adverse weather, while camera sensors provide dense semantic information. Fusing the these complementary modalities has great potential for cost-effective 3D perception. However, most existing camera-radar fusion methods are limited to single-frame inputs, capturing only a partial view of the scene. The incomplete scene information, compounded by image degradation and 4D radar sparsity, hinders overall detection performance. In contrast, multi-frame fusion offers richer spatiotemporal information but faces two challenges: achieving robust and effective object feature fusion across frames and modalities, and mitigating the computational cost of redundant feature extraction. Consequently, we propose M^3Detection, a unified multi-frame 3D object detection framework that performs multi-level feature fusion on multi-modal data from camera and 4D imaging radar. Our framework leverages intermediate features from the baseline detector and employs the tracker to produce reference trajectories, improving computational efficiency and providing richer information for second-stage. In the second stage, we design a global-level inter-object feature aggregation module guided by radar information to align global features across candidate proposals and a local-level inter-grid feature aggregation module that expands local features along the reference trajectories to enhance fine-grained object representation. The aggregated features are then processed by a trajectory-level multi-frame spatiotemporal reasoning module to encode cross-frame interactions and enhance temporal representation. Extensive experiments on the VoD and TJ4DRadSet datasets demonstrate that M^3Detection achieves state-of-the-art 3D detection performance, validating its effectiveness in multi-frame detection with camera-4D imaging radar fusion.

[12] SilhouetteTell: Practical Video Identification Leveraging Blurred Recordings of Video Subtitles

Guanchong Huang, Song Fang

🧩 TL;DR

本文提出SilhouetteTell,一种新颖的视频识别攻击方法,通过分析字幕轮廓的时空特征来推断用户观看的视频内容,能够在最远40米距离内有效识别在线和离线视频。


📘 Detailed Summary

Motivation: 视频识别攻击构成严重的隐私威胁,可能泄露用户的观看历史、兴趣爱好、宗教信仰、政治倾向等敏感信息,现有技术主要依赖分析流媒体网络流量,存在局限性。

Method: 提出SilhouetteTell攻击方法,利用字幕内容决定屏幕上显示的字幕轮廓这一观察,将字幕轮廓的空间信息和连续字幕间的时间差异结合为时空特征,通过记录的字幕轮廓与视频字幕文件的时空相关性进行视频识别。

Result: 在现成智能手机上的综合实验证实,SilhouetteTell在各种设置下对视频标题和片段的推断具有高效性,包括从最远40米的距离进行有效识别。

Conclusion: 该研究揭示了基于视觉字幕轮廓的新型隐私威胁,表明即使不依赖网络流量分析,仅通过视觉观察也能实现精确的视频识别,对视频流媒体平台的隐私保护提出了新的安全挑战。


📄 Abstract

Video identification attacks pose a significant privacy threat that can reveal videos that victims watch, which may disclose their hobbies, religious beliefs, political leanings, sexual orientation, and health status. Also, video watching history can be used for user profiling or advertising and may result in cyberbullying, discrimination, or blackmail. Existing extensive video inference techniques usually depend on analyzing network traffic generated by streaming online videos. In this work, we observe that the content of a subtitle determines its silhouette displayed on the screen, and identifying each subtitle silhouette also derives the temporal difference between two consecutive subtitles. We then propose SilhouetteTell, a novel video identification attack that combines the spatial and time domain information into a spatiotemporal feature of subtitle silhouettes. SilhouetteTell explores the spatiotemporal correlation between recorded subtitle silhouettes of a video and its subtitle file. It can infer both online and offline videos. Comprehensive experiments on off-the-shelf smartphones confirm the high efficacy of SilhouetteTell for inferring video titles and clips under various settings, including from a distance of up to 40 meters.

[13] Multi-Modal Feature Fusion for Spatial Morphology Analysis of Traditional Villages via Hierarchical Graph Neural Networks

Jiaxin Zhang, Zehong Zhu, Junye Deng, Yunqin Li, and Bowen Wang

🧩 TL;DR

本文提出了一种集成多源数据的层次图神经网络模型,用于深入分析村庄空间形态,通过多模态特征融合和联合训练策略在分类任务上实现了显著性能提升。


📘 Detailed Summary

Motivation: 现有研究主要采用单一学科视角分析村庄空间形态及其影响因素,依赖定性分析方法,受限于数字基础设施缺乏和数据不足,难以有效应对城市化进程中村庄空间特征消失和景观同质化问题。

Method: 提出层次图神经网络模型,包含输入节点和通信节点两种节点类型,以及静态输入边和动态通信边两种边类型,结合图卷积网络和图注意力网络,在两级特征更新机制下高效集成多模态特征,并引入关系池化机制实现17个子类型的联合训练策略。

Result: 实验结果表明该方法在多模态融合和分类任务上显著优于现有方法,所有子类型的联合优化将平均准确率/F1分数从0.71/0.83提升至0.82/0.90,其中地块任务性能提升6%。

Conclusion: 该方法为探索村庄空间模式和生成逻辑提供了科学依据,通过深度学习和多源数据融合有效解决了传统研究中数据不足和方法局限的问题,为村庄空间形态研究提供了新的技术路径。


📄 Abstract

Villages areas hold significant importance in the study of human-land relationships. However, with the advancement of urbanization, the gradual disappearance of spatial characteristics and the homogenization of landscapes have emerged as prominent issues. Existing studies primarily adopt a single-disciplinary perspective to analyze villages spatial morphology and its influencing factors, relying heavily on qualitative analysis methods. These efforts are often constrained by the lack of digital infrastructure and insufficient data. To address the current research limitations, this paper proposes a Hierarchical Graph Neural Network (HGNN) model that integrates multi-source data to conduct an in-depth analysis of villages spatial morphology. The framework includes two types of nodes-input nodes and communication nodes-and two types of edges-static input edges and dynamic communication edges. By combining Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT), the proposed model efficiently integrates multimodal features under a two-stage feature update mechanism. Additionally, based on existing principles for classifying villages spatial morphology, the paper introduces a relational pooling mechanism and implements a joint training strategy across 17 subtypes. Experimental results demonstrate that this method achieves significant performance improvements over existing approaches in multimodal fusion and classification tasks. Additionally, the proposed joint optimization of all sub-types lifts mean accuracy/F1 from 0.71/0.83 (independent models) to 0.82/0.90, driven by a 6% gain for parcel tasks. Our method provides scientific evidence for exploring villages spatial patterns and generative logic.

[14] C-LEAD: Contrastive Learning for Enhanced Adversarial Defense

Suklav Ghosh, Sonal Kumar, Arijit Sur

🧩 TL;DR

本文提出了一种利用对比学习进行对抗防御的新方法,通过结合干净图像和对抗扰动图像训练分类模型,显著提升了深度神经网络对对抗攻击的鲁棒性。


📘 Detailed Summary

Motivation: 深度神经网络在计算机视觉任务中表现出色,但对对抗攻击非常脆弱,微小的输入扰动即可导致错误预测,这阻碍了鲁棒深度学习系统的实际部署。

Method: 该方法采用对比损失函数,通过同时优化模型参数和扰动,使网络能够学习到对对抗攻击更具抵抗力的鲁棒表示,将干净图像和对抗扰动图像共同用于训练过程。

Result: 实验结果表明,该方法在各种类型的对抗扰动攻击下显著提升了模型的鲁棒性,证明了对比学习在提取信息丰富且具有弹性的特征方面的有效性。

Conclusion: 对比损失有助于提取更具信息性和鲁棒性的特征,为深度学习中的对抗鲁棒性研究开辟了新的方向,表明对比学习是提升模型安全性的有效策略。


📄 Abstract

Deep neural networks (DNNs) have achieved remarkable success in computer vision tasks such as image classification, segmentation, and object detection. However, they are vulnerable to adversarial attacks, which can cause incorrect predictions with small perturbations in input images. Addressing this issue is crucial for deploying robust deep-learning systems. This paper presents a novel approach that utilizes contrastive learning for adversarial defense, a previously unexplored area. Our method leverages the contrastive loss function to enhance the robustness of classification models by training them with both clean and adversarially perturbed images. By optimizing the model's parameters alongside the perturbations, our approach enables the network to learn robust representations that are less susceptible to adversarial attacks. Experimental results show significant improvements in the model's robustness against various types of adversarial perturbations. This suggests that contrastive loss helps extract more informative and resilient features, contributing to the field of adversarial robustness in deep learning.

[15] Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes

Yehna Kim andYoung-Eun Kim, Seong-Whan Lee

🧩 TL;DR

本文提出了一种利用网络爬取描述和大型语言模型提取关键词的创新方法,解决视觉语言模型在零样本动作识别中因多语义词导致的语义模糊问题。该方法通过时空交互模块实现描述属性与视频内容的对齐,在多个基准数据集上取得了优异的零样本识别性能。


📘 Detailed Summary

Motivation: 视觉语言模型在零样本动作识别中主要依赖动作类别提供语义上下文,但由于多语义词的存在,这种方法容易导致对动作概念理解的模糊性。现有方法需要大量人工标注属性数据,过程繁琐且成本高昂,因此需要开发更高效的语义增强方法。

Method: 提出利用网络爬取描述结合大型语言模型提取相关关键词的方法,减少对人工标注的依赖。设计了一个时空交互模块,专注于对象和动作单元,促进描述属性与视频内容之间的对齐。该方法通过自动化流程生成语义丰富的属性数据。

Result: 在零样本实验中,模型在UCF-101、HMDB-51和Kinetics-600数据集上分别达到了81.0%、53.1%和68.9%的准确率。这些结果证明了该方法在不同下游任务中的适应性和有效性,显著提升了零样本动作识别的性能。

Conclusion: 研究表明利用网络资源和大型语言模型可以有效解决动作识别中的语义模糊问题,减少对人工标注的依赖。时空交互模块的设计为视频内容与语义描述的对齐提供了有效机制,为视觉语言模型在复杂动作理解任务中的应用开辟了新途径。


📄 Abstract

Vision-Language Models (VLMs) have demonstrated impressive capabilities in zero-shot action recognition by learning to associate video embeddings with class embeddings. However, a significant challenge arises when relying solely on action classes to provide semantic context, particularly due to the presence of multi-semantic words, which can introduce ambiguity in understanding the intended concepts of actions. To address this issue, we propose an innovative approach that harnesses web-crawled descriptions, leveraging a large-language model to extract relevant keywords. This method reduces the need for human annotators and eliminates the laborious manual process of attribute data creation. Additionally, we introduce a spatio-temporal interaction module designed to focus on objects and action units, facilitating alignment between description attributes and video content. In our zero-shot experiments, our model achieves impressive results, attaining accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, underscoring the model's adaptability and effectiveness across various downstream tasks.

[16] RegionRAG: Region-level Retrieval-Augumented Generation for Visually-Rich Documents

Yinglu Li, Zhiying Lu, Zhihang Liu, Chuanbin Liu, Hongtao Xie

🧩 TL;DR

RegionRAG提出了一种新颖的多模态检索增强生成框架,将检索粒度从文档级别转移到区域级别,通过识别和聚焦相关视觉区域来减少无关内容干扰,显著提升了检索和问答性能。


📘 Detailed Summary

Motivation: 当前多模态检索增强生成方法将整个文档作为基本检索单元,存在两个关键问题:相关文档中包含大量与查询无关的视觉区域会稀释关键信息,以及检索多个文档会引入冗余和无关文档,这些冗余上下文会分散模型注意力并降低性能。

Method: 提出从文档级别到区域级别的检索范式转变,在训练阶段设计了来自标注数据和未标注数据的混合监督策略来精确定位相关图像块,在推理阶段提出了动态管道将显著图像块智能分组为完整语义区域,使生成器能够专注于与查询相关的简洁视觉内容。

Result: 在六个基准测试上的实验表明,RegionRAG实现了最先进的性能,平均R@1检索准确率提升10.02%,问答准确率提升3.56%,同时仅使用先前方法71.42%的视觉token数量。

Conclusion: 通过将识别相关区域的任务委托给检索器,RegionRAG使生成器能够专注于与查询相关的简洁视觉内容,显著提高了多模态检索增强生成的效率和准确性,为细粒度视觉内容检索提供了新的研究方向。


📄 Abstract

Multi-modal Retrieval-Augmented Generation (RAG) has become a critical method for empowering LLMs by leveraging candidate visual documents. However, current methods consider the entire document as the basic retrieval unit, introducing substantial irrelevant visual content in two ways: 1) Relevant documents often contain large regions unrelated to the query, diluting the focus on salient information; 2) Retrieving multiple documents to increase recall further introduces redundant and irrelevant documents. These redundant contexts distract the model's attention and further degrade the performance. To address this challenge, we propose \modelname, a novel framework that shifts the retrieval paradigm from the document level to the region level. During training, we design a hybrid supervision strategy from both labeled data and unlabeled data to pinpoint relevant patches. During inference, we propose a dynamic pipeline that intelligently groups salient patches into complete semantic regions. By delegating the task of identifying relevant regions to the retriever, \modelname enables the generator to focus solely on concise visual content relevant to queries, improving both efficiency and accuracy. Experiments on six benchmarks demonstrate that RegionRAG achieves state-of-the-art performance. Improves retrieval accuracy by 10.02\% in R@1 on average and increases question answering accuracy by 3.56\% while using only 71.42\% visual tokens compared to prior methods. The code will be available at https://github.com/Aeryn666/RegionRAG.

[17] FOCUS: Efficient Keyframe Selection for Long Video Understanding

Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Zhenheng Yang, Yang You

🧩 TL;DR

本文提出FOCUS,一种无需训练、模型无关的关键帧选择方法,将关键帧选择建模为多臂老虎机中的组合纯探索问题,在严格token预算下选择查询相关帧,显著提升长视频理解性能。


📘 Detailed Summary

Motivation: 多模态大语言模型处理长视频时,视觉token数量会急剧膨胀超出实际限制,现有方法要么均匀采样要么使用检索式评分进行关键帧选择,但前者依赖预过滤降低推理成本,后者可能错过最具信息量的时刻。

Method: FOCUS将关键帧选择建模为多臂老虎机中的组合纯探索问题,将短时间片段视为臂,使用经验均值和Bernstein置信半径识别信息丰富区域同时保留不确定区域的探索,采用两阶段探索-利用过程,先识别高价值时间区域,然后在每个区域内选择最高分帧。

Result: 在两个长视频问答基准测试中,FOCUS仅处理不到2%的视频帧就实现了显著的准确率提升,对于超过20分钟的视频,在LongVideoBench上实现了11.9%的准确率增益。

Conclusion: FOCUS作为关键帧选择方法表现出色,为使用MLLM进行可扩展的长视频理解提供了简单通用的解决方案,其训练无关和模型无关的特性使其具有广泛的适用性。


📄 Abstract

Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs.

[18] T3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis

Raza Imam, Hu Wang, Dwarikanath Mahapatra, Mohammad Yaqub

🧩 TL;DR

本文提出了T^3(Test-Time Task adaptive merging)框架,通过基于Jensen-Shannon散度的动态模型融合方法,解决了医学影像中预训练模型与专家模型在模态偏移下的性能权衡问题,在保持效率的同时实现了最先进的性能。


📘 Detailed Summary

Motivation: 医学影像中的视觉语言模型面临双重挑战:预训练网络具有广泛鲁棒性但缺乏模态特定特征,而微调专家模型在分布内精度高但在模态偏移下表现不佳。现有模型融合技术设计用于自然图像基准,其静态插值方法在多样化医学模态中无法提供一致增益,限制了在临床任务中的可靠性。

Method: 提出了无需反向传播的T^3框架,通过计算两个模型输出分布之间的Jensen-Shannon散度来获得每个样本的插值系数。该方法在模型一致时保持局部精度,在漂移情况下依赖通用模型的鲁棒性。为进一步降低推理成本,提出了批量扩展版本T^3_B,在样本批次上计算融合系数。

Result: 在涵盖域内、基础到新颖以及四种模态的跨评估协议中,T^3在Top-1准确率和误差减少方面达到了新的最先进水平,显著优于强基线方法,同时保持了计算效率。

Conclusion: 该研究为医学视觉语言模型在临床环境中的自适应部署开辟了新途径,证明了动态模型融合在医学影像分析中的有效性,并通过标准化评估协议为后续研究提供了基准框架。


📄 Abstract

In medical imaging, vision-language models face a critical duality: pretrained networks offer broad robustness but lack subtle, modality-specific characteristics, while fine-tuned expert models achieve high in-distribution accuracy yet falter under modality shift. Existing model-merging techniques, designed for natural-image benchmarks, are simple and efficient but fail to deliver consistent gains across diverse medical modalities; their static interpolation limits reliability in varied clinical tasks. To address this, we introduce Test-Time Task adaptive merging (T^3), a backpropagation-free framework that computes per-sample interpolation coefficients via the Jensen-Shannon divergence between the two models' output distributions. T^3 dynamically preserves local precision when models agree and defers to generalist robustness under drift. To overcome the inference costs of sample-wise merging, we further propose a batch-wise extension, T^3_B, that computes a merging coefficient across a batch of samples, dramatically reducing computational bottleneck. Recognizing the lack of a standardized medical-merging benchmark, we present a rigorous cross-evaluation protocol spanning in-domain, base-to-novel, and corruptions across four modalities. Empirically, T^3 sets new state-of-the-art in Top-1 accuracy and error reduction, outperforming strong baselines while maintaining efficiency, paving the way for adaptive MVLM deployment in clinical settings. Our code is available at https://github.com/Razaimam45/TCube.

[19] Generative Semantic Coding for Ultra-Low Bitrate Visual Communication and Analysis

Weiming Chen, Yijia Wang, Zhihan Zhu, Zhihai He

🧩 TL;DR

本文提出了一种超低比特率视觉通信方法,通过将图像生成与深度图像压缩无缝集成,利用联合文本和编码潜在特征指导校正流模型进行精确视觉场景重建,在保持相同图像重建质量和视觉分析精度的同时大幅降低带宽需求。


📘 Detailed Summary

Motivation: 该研究旨在解决在深空探测、战场情报和复杂环境机器人导航等通信带宽极低场景下的超低比特率视觉通信问题,现有文本到图像生成模型只能实现语义级别的视觉场景近似,无法满足视觉通信和远程视觉分析及人机交互的精度要求。

Method: 提出将图像生成与深度图像压缩无缝集成的方法,利用联合文本和编码潜在特征指导校正流模型进行精确视觉场景生成,语义文本描述和编码潜在特征均以极低比特率进行编码传输到解码器。

Result: 实验结果表明,该方法能够在使用远少于现有方法带宽的情况下,实现相同的图像重建质量和视觉分析精度,验证了在超低比特率条件下保持视觉通信性能的有效性。

Conclusion: 该研究证明了通过集成生成模型与压缩技术可以在超低比特率条件下实现精确的视觉场景重建,为极端通信环境下的视觉通信系统提供了新的技术路径,具有重要的实际应用价值。


📄 Abstract

We consider the problem of ultra-low bit rate visual communication for remote vision analysis, human interactions and control in challenging scenarios with very low communication bandwidth, such as deep space exploration, battlefield intelligence, and robot navigation in complex environments. In this paper, we ask the following important question: can we accurately reconstruct the visual scene using only a very small portion of the bit rate in existing coding methods while not sacrificing the accuracy of vision analysis and performance of human interactions? Existing text-to-image generation models offer a new approach for ultra-low bitrate image description. However, they can only achieve a semantic-level approximation of the visual scene, which is far insufficient for the purpose of visual communication and remote vision analysis and human interactions. To address this important issue, we propose to seamlessly integrate image generation with deep image compression, using joint text and coding latent to guide the rectified flow models for precise generation of the visual scene. The semantic text description and coding latent are both encoded and transmitted to the decoder at a very small bit rate. Experimental results demonstrate that our method can achieve the same image reconstruction quality and vision analysis accuracy as existing methods while using much less bandwidth. The code will be released upon paper acceptance.

[20] Fine-Tuning Open Video Generators for Cinematic Scene Synthesis: A Small-Data Pipeline with LoRA and Wan2.1 I2V

Meftun Akarsu, Kerem Catay, Sedat Bin Vedat, Enes Kutay Yarkan, Ilke Senturk, Arda Sar, Dafne Eksioglu

🧩 TL;DR

本文提出了一种两阶段微调流程,用于将开源视频扩散变换器适配到影视制作领域,通过LoRA模块实现视觉风格学习与运动生成的解耦,能够在单GPU上数小时内完成领域迁移并生成风格一致的720p电影场景。


📘 Detailed Summary

Motivation: 现有视频生成模型在影视制作应用中面临领域适应效率低和风格一致性不足的问题,需要一种能够从小型数据集快速学习特定视觉风格并保持时间一致性的实用解决方案。

Method: 采用两阶段微调流程:第一阶段在Wan2.1 I2V-14B模型的交叉注意力层集成LoRA模块,使用El Turco电视剧的短片数据集进行视觉表示适配;第二阶段利用微调后的模型生成风格一致的关键帧,通过视频解码器扩展为连贯的720p序列,并应用轻量级并行化和序列分区策略加速推理。

Result: 使用FVD、CLIP-SIM和LPIPS指标以及专家用户研究进行定量和定性评估,结果显示在电影保真度和时间稳定性方面相比基础模型有显著提升,同时推理过程无质量损失。

Conclusion: 该研究证明了通过解耦视觉风格学习与运动生成的高效领域适应方法在影视制作中的可行性,完整的训练和推理流程为跨电影领域的适应提供了可复现的解决方案,推动了开源视频生成模型在实际生产环境中的应用。


📄 Abstract

We present a practical pipeline for fine-tuning open-source video diffusion transformers to synthesize cinematic scenes for television and film production from small datasets. The proposed two-stage process decouples visual style learning from motion generation. In the first stage, Low-Rank Adaptation (LoRA) modules are integrated into the cross-attention layers of the Wan2.1 I2V-14B model to adapt its visual representations using a compact dataset of short clips from Ay Yapim's historical television film El Turco. This enables efficient domain transfer within hours on a single GPU. In the second stage, the fine-tuned model produces stylistically consistent keyframes that preserve costume, lighting, and color grading, which are then temporally expanded into coherent 720p sequences through the model's video decoder. We further apply lightweight parallelization and sequence partitioning strategies to accelerate inference without quality degradation. Quantitative and qualitative evaluations using FVD, CLIP-SIM, and LPIPS metrics, supported by a small expert user study, demonstrate measurable improvements in cinematic fidelity and temporal stability over the base model. The complete training and inference pipeline is released to support reproducibility and adaptation across cinematic domains.

[21] Mitigating Semantic Collapse in Partially Relevant Video Retrieval

WonJun Moon, MinSeok Jung, Gilhan Park, Tae-Young Kim, Cheol-Ho Cho, Woojin Jun, Jae-Pil Heo

🧩 TL;DR

本文提出了一种解决部分相关视频检索中语义坍缩问题的新框架,通过文本相关性保持学习和跨分支视频对齐方法,有效防止查询和视频嵌入空间的语义坍缩,显著提升了检索性能。


📘 Detailed Summary

Motivation: 现有方法将所有标注的文本-视频对视为正样本,其他视为负样本,忽略了单个视频内部和不同视频之间的丰富语义变化,导致同一视频中不同事件的查询和视频片段嵌入坍缩在一起,而不同视频中语义相似的查询和片段却被推远,限制了多事件视频的检索性能。

Method: 提出了文本相关性保持学习来保持基础模型编码的文本查询间语义关系,以及跨分支视频对齐这种对比对齐方法,通过跨时间尺度解耦层次化视频表示来解决视频嵌入坍缩问题,并引入保序令牌合并和自适应CBVA来生成内部一致且相互区分的视频片段以增强对齐效果。

Result: 在PRVR基准测试上的广泛实验表明,该框架有效防止了语义坍缩,并显著提高了检索准确率。

Conclusion: 该研究为解决部分相关视频检索中的语义坍缩问题提供了有效方案,通过保持语义关系和跨尺度对齐实现了更好的嵌入表示,为多事件视频检索性能提升指明了方向。


📄 Abstract

Partially Relevant Video Retrieval (PRVR) seeks videos where only part of the content matches a text query. Existing methods treat every annotated text-video pair as a positive and all others as negatives, ignoring the rich semantic variation both within a single video and across different videos. Consequently, embeddings of both queries and their corresponding video-clip segments for distinct events within the same video collapse together, while embeddings of semantically similar queries and segments from different videos are driven apart. This limits retrieval performance when videos contain multiple, diverse events. This paper addresses the aforementioned problems, termed as semantic collapse, in both the text and video embedding spaces. We first introduce Text Correlation Preservation Learning, which preserves the semantic relationships encoded by the foundation model across text queries. To address collapse in video embeddings, we propose Cross-Branch Video Alignment (CBVA), a contrastive alignment method that disentangles hierarchical video representations across temporal scales. Subsequently, we introduce order-preserving token merging and adaptive CBVA to enhance alignment by producing video segments that are internally coherent yet mutually distinctive. Extensive experiments on PRVR benchmarks demonstrate that our framework effectively prevents semantic collapse and substantially improves retrieval accuracy.

[22] RzenEmbed: Towards Comprehensive Multimodal Retrieval

Weijian Jian, Yajun Zhang, Dawei Liang, Chunyu Xie, Yixiao He, Dawei Leng, Yuhui Yin

🧩 TL;DR

本文提出了RzenEmbed,一个统一的多模态嵌入学习框架,通过两阶段训练策略和增强的InfoNCE损失函数,在MMEB基准测试中实现了最先进的性能,特别是在视频和视觉文档检索任务上表现优异。


📘 Detailed Summary

Motivation: 现有的多模态大语言模型主要关注自然图像,对其他重要视觉模态如视频和视觉文档的支持有限,这限制了多模态检索任务的通用性和性能。

Method: 采用两阶段训练策略:第一阶段专注于基础文本和多模态检索,第二阶段引入改进的InfoNCE损失函数,包含难度加权机制和假阴性缓解策略,同时结合可学习温度参数和模型融合技术。

Result: RzenEmbed在MMEB基准测试中创造了新的最先进水平,不仅获得了最佳总体得分,而且在具有挑战性的视频和视觉文档检索任务上超越了所有先前工作。

Conclusion: 该研究表明通过精心设计的训练策略和损失函数优化,可以显著提升多模态嵌入模型的判别能力和指令跟随能力,为统一的多模态表示学习提供了有效解决方案。


📄 Abstract

The rapid advancement of Multimodal Large Language Models (MLLMs) has extended CLIP-based frameworks to produce powerful, universal embeddings for retrieval tasks. However, existing methods primarily focus on natural images, offering limited support for other crucial visual modalities such as videos and visual documents. To bridge this gap, we introduce RzenEmbed, a unified framework to learn embeddings across a diverse set of modalities, including text, images, videos, and visual documents. We employ a novel two-stage training strategy to learn discriminative representations. The first stage focuses on foundational text and multimodal retrieval. In the second stage, we introduce an improved InfoNCE loss, incorporating two key enhancements. Firstly, a hardness-weighted mechanism guides the model to prioritize challenging samples by assigning them higher weights within each batch. Secondly, we implement an approach to mitigate the impact of false negatives and alleviate data noise. This strategy not only enhances the model's discriminative power but also improves its instruction-following capabilities. We further boost performance with learnable temperature parameter and model souping. RzenEmbed sets a new state-of-the-art on the MMEB benchmark. It not only achieves the best overall score but also outperforms all prior work on the challenging video and visual document retrieval tasks. Our models are available in https://huggingface.co/qihoo360/RzenEmbed.

[23] Context-Gated Cross-Modal Perception with Visual Mamba for PET-CT Lung Tumor Segmentation

Elena Mulero Ayllón, Linlin Shen, Pierangelo Veltri, Fabrizia Gelardi, Arturo Chiti, Paolo Soda, Matteo Tortora

🧩 TL;DR

本研究提出vMambaX,一种基于Visual Mamba架构的轻量级多模态框架,通过上下文门控跨模态感知模块集成PET和CT扫描图像,实现了高效且可扩展的肺肿瘤分割。


📘 Detailed Summary

Motivation: 准确肺肿瘤分割对于改善诊断和治疗规划至关重要,而有效整合PET和CT的解剖与功能信息仍然是一个主要挑战。

Method: 基于Visual Mamba架构构建vMambaX框架,集成上下文门控跨模态感知模块,自适应增强模态间特征交互,强调信息丰富区域同时抑制噪声。

Result: 在PCLT20K数据集上的评估表明,该模型在保持较低计算复杂度的同时优于基线模型。

Conclusion: 结果突显了自适应跨模态门控在多模态肿瘤分割中的有效性,并展示了vMambaX作为高效可扩展框架在先进肺癌分析中的潜力。


📄 Abstract

Accurate lung tumor segmentation is vital for improving diagnosis and treatment planning, and effectively combining anatomical and functional information from PET and CT remains a major challenge. In this study, we propose vMambaX, a lightweight multimodal framework integrating PET and CT scan images through a Context-Gated Cross-Modal Perception Module (CGM). Built on the Visual Mamba architecture, vMambaX adaptively enhances inter-modality feature interaction, emphasizing informative regions while suppressing noise. Evaluated on the PCLT20K dataset, the model outperforms baseline models while maintaining lower computational complexity. These results highlight the effectiveness of adaptive cross-modal gating for multimodal tumor segmentation and demonstrate the potential of vMambaX as an efficient and scalable framework for advanced lung cancer analysis. The code is available at https://github.com/arco-group/vMambaX.

[24] Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds

Wu Wei, Xiaomeng Fan, Yuwei Wu, Zhi Gao, Pengxiang Li, Yunde Jia, Mehrtash Harandi

🧩 TL;DR

本文提出了一种跨树对齐方法,通过构建和匹配图像与文本的层次化树状特征来解决视觉语言模型中的模态不对称问题。该方法在双曲流形中嵌入异构特征树,并学习中间流形进行对齐,在多个开放集分类任务中显著优于现有基线。


📘 Detailed Summary

Motivation: 现有视觉语言模型在模态对齐方面存在不对称问题,文本采用层次化特征提取而图像仅使用单一特征表示,导致跨模态信息整合不充分。这种特征表示的不匹配限制了模型对复杂语义层次结构的建模能力,特别是在需要细粒度理解的视觉语言任务中表现欠佳。

Method: 提出语义感知的视觉特征提取框架,利用跨注意力机制从Transformer中间层提取具有粗到细语义的视觉特征。将两种模态的特征树嵌入到具有不同曲率的双曲流形中,通过最小化异构流形间的KL距离来学习中间流形进行对齐,并证明了最优中间流形的存在性和唯一性。

Result: 在多个图像数据集上的分类学开放集分类任务实验表明,该方法在少样本和跨域设置下均显著优于强基线模型。特别是在需要层次化语义理解的场景中,跨树对齐策略带来了持续的精度提升,验证了异构双曲流形对齐的有效性。

Conclusion: 该研究证明了层次化特征对齐对于视觉语言模型的重要性,双曲几何为建模异构模态的层次结构提供了有效框架。通过理论保证的流形对齐方法,为处理复杂跨模态任务开辟了新方向,未来可扩展至更多需要细粒度语义对齐的应用场景。


📄 Abstract

Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.

[25] Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang

🧩 TL;DR

本文提出了Spatial-SSRL,一种自监督强化学习范式,通过从普通RGB或RGB-D图像中自动生成可验证信号来增强大型视觉语言模型的空间理解能力,无需人工标注即可实现显著性能提升。


📘 Detailed Summary

Motivation: 当前大型视觉语言模型在空间理解方面存在明显不足,现有监督微调和强化学习方法依赖昂贵的标注数据、专用工具或受限环境,限制了方法的可扩展性和实用性。

Method: Spatial-SSRL自动构建五种捕捉2D和3D空间结构的自监督任务:打乱补丁重排序、翻转补丁识别、裁剪补丁修复、区域深度排序和相对3D位置预测,这些任务提供易于验证的真实答案且无需人工或LVLM标注。

Result: 在七个空间理解基准测试中,Spatial-SSRL在图像和视频设置下分别实现了3B模型4.63%和7B模型3.89%的平均准确率提升,显著改善了空间推理能力同时保持了通用视觉能力。

Conclusion: 研究表明简单内在监督能够实现大规模可验证强化学习,为增强LVLMs空间智能提供了实用路径,证明了自监督方法在空间理解任务中的有效性。


📄 Abstract

Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.

[26] Sketch-to-Layout: Sketch-Guided Multimodal Layout Generation

Riccardo Brioschi, Aleksandr Alekseev, Emanuele Nevali, Berkay Döner, Omar El Malki, Blagoj Mitrevski, Leandro Kieliger, Mark Collier, Andrii Maksai, Jesse Berent, Claudiu Musat, Efi Kokiopoulou

🧩 TL;DR

本文提出了一种基于用户草图约束的图形布局生成方法,通过多模态transformer架构将手绘草图转换为高质量布局,解决了传统约束方法复杂难用的问题,并在多个公开数据集上超越了现有约束方法的性能。


📘 Detailed Summary

Motivation: 当前图形布局生成研究中,用户约束通常需要复杂的规范说明,这大大降低了系统的可用性和用户体验。现有的约束引导方法虽然能够生成符合要求的布局,但约束条件的复杂性限制了实际应用场景。

Method: 本文采用多模态transformer架构,将用户提供的草图和内容资源作为输入,直接生成高质量布局。为了解决人工标注草图训练数据的高成本问题,提出了一种新颖高效的合成草图生成方法,能够大规模生成训练数据。

Result: 在PubLayNet、DocLayNet和SlidesVQA三个公开数据集上的实验表明,该方法在布局生成质量上超越了当前最先进的约束方法。同时发布了约20万张合成生成的草图数据集,为后续研究提供了宝贵资源。

Conclusion: 草图到布局的转换方法为图形布局生成提供了更直观的用户交互方式,显著提升了设计体验。该方法不仅展示了草图作为约束的有效性,还通过合成数据生成解决了数据稀缺问题,为相关研究开辟了新的方向。


📄 Abstract

Graphic layout generation is a growing research area focusing on generating aesthetically pleasing layouts ranging from poster designs to documents. While recent research has explored ways to incorporate user constraints to guide the layout generation, these constraints often require complex specifications which reduce usability. We introduce an innovative approach exploiting user-provided sketches as intuitive constraints and we demonstrate empirically the effectiveness of this new guidance method, establishing the sketch-to-layout problem as a promising research direction, which is currently under-explored. To tackle the sketch-to-layout problem, we propose a multimodal transformer-based solution using the sketch and the content assets as inputs to produce high quality layouts. Since collecting sketch training data from human annotators to train our model is very costly, we introduce a novel and efficient method to synthetically generate training sketches at scale. We train and evaluate our model on three publicly available datasets: PubLayNet, DocLayNet and SlidesVQA, demonstrating that it outperforms state-of-the-art constraint-based methods, while offering a more intuitive design experience. In order to facilitate future sketch-to-layout research, we release O(200k) synthetically-generated sketches for the public datasets above. The datasets are available at https://github.com/google-deepmind/sketch_to_layout.

[27] DeblurSDI: Blind Image Deblurring Using Self-diffusion

Yanlong Yang, Guanxiong Luo

🧩 TL;DR

本文提出了DeblurSDI,一种基于自扩散的零样本自监督盲图像去模糊框架,无需预训练即可通过迭代反向扩散过程同时恢复清晰图像和模糊核。


📘 Detailed Summary

Motivation: 传统盲图像去模糊方法依赖手工先验,而现代深度学习方法需要大量外部数据集预训练,限制了其在真实场景中的适应性,因此需要开发无需预训练的自适应解决方案。

Method: DeblurSDI将盲去模糊建模为从纯噪声开始的迭代反向自扩散过程,通过两个随机初始化神经网络持续优化清晰图像和模糊核,结合数据一致性和核稀疏性L1范数的目标函数,并引入噪声调度机制以稳定优化过程。

Result: 大量实验表明DeblurSDI在高度退化场景下仍能实现优越性能,稳定恢复清晰图像和准确模糊核,且对模糊核尺寸变化具有显著鲁棒性。

Conclusion: 该研究证明了通过自扩散框架可以动态学习实例特定先验,为零样本盲图像恢复提供了新范式,展示了无需外部训练数据的自监督方法在逆问题中的潜力。


📄 Abstract

Blind image deconvolution is a challenging ill-posed inverse problem, where both the latent sharp image and the blur kernel are unknown. Traditional methods often rely on handcrafted priors, while modern deep learning approaches typically require extensive pre-training on large external datasets, limiting their adaptability to real-world scenarios. In this work, we propose DeblurSDI, a zero-shot, self-supervised framework based on self-diffusion (SDI) that requires no prior training. DeblurSDI formulates blind deconvolution as an iterative reverse self-diffusion process that starts from pure noise and progressively refines the solution. At each step, two randomly-initialized neural networks are optimized continuously to refine the sharp image and the blur kernel. The optimization is guided by an objective function combining data consistency with a sparsity-promoting L1-norm for the kernel. A key innovation is our noise scheduling mechanism, which stabilizes the optimization and provides remarkable robustness to variations in blur kernel size. These allow DeblurSDI to dynamically learn an instance-specific prior tailored to the input image. Extensive experiments demonstrate that DeblurSDI consistently achieves superior performance, recovering sharp images and accurate kernels even in highly degraded scenarios.

[28] VessShape: Few-shot 2D blood vessel segmentation by leveraging shape priors from synthetic images

Cesar H. Comin, Wesley N. Galvão

🧩 TL;DR

本文提出VessShape方法,通过生成大规模合成血管数据集来增强分割模型的形状偏置,从而解决医学图像中血管分割的数据稀缺和跨模态泛化问题。该方法在少样本和零样本场景下均展现出优异的血管分割性能。


📘 Detailed Summary

Motivation: 当前血管语义分割面临两大挑战:大规模标注数据稀缺以及模型在不同成像模态间的泛化能力不足。卷积神经网络倾向于学习纹理特征而非形状特征,这限制了模型在新域中的表现。

Method: 提出了VessShape方法,通过程序化生成包含管状几何结构的大规模2D合成数据集,结合多样化的前景和背景纹理,强制模型学习形状线索而非纹理特征。该方法旨在在分割模型中建立强烈的形状偏置。

Result: 在VessShape预训练的模型在两个真实世界数据集上表现出强大的少样本分割能力,仅需4-10个样本进行微调。模型还展现出显著的零样本能力,能够在未见过的域中有效分割血管而无需目标特定训练。

Conclusion: 研究表明,通过强形状偏置进行预训练是克服数据稀缺和提升血管分割模型泛化能力的有效策略。基于几何先验的方法为医学图像分析中的域适应问题提供了有前景的解决方案。


📄 Abstract

Semantic segmentation of blood vessels is an important task in medical image analysis, but its progress is often hindered by the scarcity of large annotated datasets and the poor generalization of models across different imaging modalities. A key aspect is the tendency of Convolutional Neural Networks (CNNs) to learn texture-based features, which limits their performance when applied to new domains with different visual characteristics. We hypothesize that leveraging geometric priors of vessel shapes, such as their tubular and branching nature, can lead to more robust and data-efficient models. To investigate this, we introduce VessShape, a methodology for generating large-scale 2D synthetic datasets designed to instill a shape bias in segmentation models. VessShape images contain procedurally generated tubular geometries combined with a wide variety of foreground and background textures, encouraging models to learn shape cues rather than textures. We demonstrate that a model pre-trained on VessShape images achieves strong few-shot segmentation performance on two real-world datasets from different domains, requiring only four to ten samples for fine-tuning. Furthermore, the model exhibits notable zero-shot capabilities, effectively segmenting vessels in unseen domains without any target-specific training. Our results indicate that pre-training with a strong shape bias can be an effective strategy to overcome data scarcity and improve model generalization in blood vessel segmentation.

[29] From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration

Jianwen Sun, Fanrui Zhang, Yukang Feng, Chuanhao Li, Zizhen Li, Jiaxin Ai, Yifan Chang, Yu Dai, Kaipeng Zhang

🧩 TL;DR

本文提出VisPainter,一个基于模型上下文协议的多智能体框架,用于生成具有元素级控制能力的科学插图,同时引入VisBench基准从七个维度系统评估科学插图质量。


📘 Detailed Summary

Motivation: 当前科学插图生成面临两大局限:图像生成模型输出缺乏语义结构的栅格化图像,无法编辑独立视觉组件;基于代码的生成方法虽然提供元素级控制,但操作繁琐且缺乏直观性。这两种方法均无法满足科学创作对效率、直观性和迭代修改的需求。

Method: VisPainter采用多智能体框架,包含Manager、Designer和Toolbox三个专门模块,基于模型上下文协议协同工作,生成与标准矢量图形软件兼容的图表。这种模块化、基于角色的设计使得每个元素都能被显式表示和操作,实现真正的元素级控制。

Result: 研究引入VisBench基准,从内容、布局、视觉感知和交互成本四个方面的七个维度评估科学插图质量。通过大量消融实验验证了架构合理性,并对多种视觉语言模型进行了公平可信的排名和能力详细比较。

Conclusion: 该研究通过多智能体框架解决了科学插图生成中元素级控制与直观操作的平衡问题,提出的评估基准为科学插图质量提供了系统化度量标准,为后续研究提供了可靠的基础设施和评估方法。


📄 Abstract

Scientific illustrations demand both high information density and post-editability. However, current generative models have two major limitations: Frist, image generation models output rasterized images lacking semantic structure, making it impossible to access, edit, or rearrange independent visual components in the images. Second, code-based generation methods (TikZ or SVG), although providing element-level control, force users into the cumbersome cycle of "writing-compiling-reviewing" and lack the intuitiveness of manipulation. Neither of these two approaches can well meet the needs for efficiency, intuitiveness, and iterative modification in scientific creation. To bridge this gap, we introduce VisPainter, a multi-agent framework for scientific illustration built upon the model context protocol. VisPainter orchestrates three specialized modules-a Manager, a Designer, and a Toolbox-to collaboratively produce diagrams compatible with standard vector graphics software. This modular, role-based design allows each element to be explicitly represented and manipulated, enabling true element-level control and any element can be added and modified later. To systematically evaluate the quality of scientific illustrations, we introduce VisBench, a benchmark with seven-dimensional evaluation metrics. It assesses high-information-density scientific illustrations from four aspects: content, layout, visual perception, and interaction cost. To this end, we conducted extensive ablation experiments to verify the rationality of our architecture and the reliability of our evaluation methods. Finally, we evaluated various vision-language models, presenting fair and credible model rankings along with detailed comparisons of their respective capabilities. Additionally, we isolated and quantified the impacts of role division, step control,and description on the quality of illustrations.

[30] PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting

Danyal Maqbool, Changhee Lee, Zachary Huemann, Samuel D. Church, Matthew E. Larson, Scott B. Perlman, Tomas A. Romero, Joshua D. Warner, Meghan Lubner, Xin Tie, Jameson Merkow, Junjie Hu, Steve Y. Cho, Tyler J. Bradshaw

🧩 TL;DR

本研究提出PETAR-4B,一种3D掩码感知视觉语言模型,通过整合PET、CT和病灶轮廓实现空间定位的报告生成,显著提升了PET/CT报告生成质量,推动了3D医学视觉语言理解的发展。


📘 Detailed Summary

Motivation: 当前视觉语言模型在医学应用主要局限于2D成像,而3D PET/CT领域面临大体积数据、小而分散病灶以及冗长放射学报告的挑战,需要开发能够处理3D医学影像的先进视觉语言模型。

Method: 构建了包含11,000多个病灶级描述与3D分割配对的超大规模数据集,采用混合规则和大型语言模型管道提取;提出PETAR-4B模型,整合PET、CT和病灶轮廓,实现全局上下文推理与细粒度病灶感知的结合。

Result: 综合自动化和人工评估表明,PETAR模型显著提升了PET/CT报告生成质量,能够产生临床连贯且空间定位的发现,在3D医学视觉语言理解方面取得重要进展。

Conclusion: 该研究证明了3D掩码感知视觉语言模型在医学影像分析中的有效性,为处理复杂3D医学数据提供了新范式,推动了多模态医学人工智能的发展,具有重要的临床应用价值。


📄 Abstract

Recent advances in vision-language models (VLMs) have enabled impressive multimodal reasoning, yet most medical applications remain limited to 2D imaging. In this work, we extend VLMs to 3D positron emission tomography and computed tomography (PET/CT), a domain characterized by large volumetric data, small and dispersed lesions, and lengthy radiology reports. We introduce a large-scale dataset comprising over 11,000 lesion-level descriptions paired with 3D segmentations from more than 5,000 PET/CT exams, extracted via a hybrid rule-based and large language model (LLM) pipeline. Building upon this dataset, we propose PETAR-4B, a 3D mask-aware vision-language model that integrates PET, CT, and lesion contours for spatially grounded report generation. PETAR bridges global contextual reasoning with fine-grained lesion awareness, producing clinically coherent and localized findings. Comprehensive automated and human evaluations demonstrate that PETAR substantially improves PET/CT report generation quality, advancing 3D medical vision-language understanding.

[31] Referee: Reference-aware Audiovisual Deepfake Detection

Hyemin Boo, Eunsang Lee, Jiyoung Lee

🧩 TL;DR

本文提出了一种新颖的参考感知音视频深度伪造检测方法Referee,该方法利用单样本中的说话人特定线索,通过跨模态身份验证实现了对未见伪造内容的强大泛化能力。


📘 Detailed Summary

Motivation: 现有音视频深度伪造检测方法难以泛化到未见过的伪造内容,特别是面对由先进生成模型产生的深度伪造威胁时,传统方法主要依赖时空伪影检测而缺乏对身份一致性的有效验证。

Method: Referee方法通过从单样本参考内容中提取说话人特定线索,将参考内容和目标内容中的身份相关查询进行匹配和对齐,形成跨模态特征表示,从而联合推理音视频同步性和身份一致性。

Result: 在FakeAVCeleb、FaceForensics++和KoDF数据集上的广泛实验表明,Referee在跨数据集和跨语言评估协议上达到了最先进的性能,显著优于现有方法。

Conclusion: 实验结果强调了跨模态身份验证对于未来深度伪造检测的重要性,该方法为应对不断进化的生成模型威胁提供了有效的解决方案,并展示了参考感知检测范式的潜力。


📄 Abstract

Since deepfakes generated by advanced generative models have rapidly posed serious threats, existing audiovisual deepfake detection approaches struggle to generalize to unseen forgeries. We propose a novel reference-aware audiovisual deepfake detection method, called Referee. Speaker-specific cues from only one-shot examples are leveraged to detect manipulations beyond spatiotemporal artifacts. By matching and aligning identity-related queries from reference and target content into cross-modal features, Referee jointly reasons about audiovisual synchrony and identity consistency. Extensive experiments on FakeAVCeleb, FaceForensics++, and KoDF demonstrate that Referee achieves state-of-the-art performance on cross-dataset and cross-language evaluation protocols. Experimental results highlight the importance of cross-modal identity verification for future deepfake detection. The code is available at https://github.com/ewha-mmai/referee.

[32] NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding

Wei Xu, Cheng Wang, Dingkang Liang, Zongchuang Zhao, Xingyu Jiang, Peng Zhang, Xiang Bai

🧩 TL;DR

本研究提出了NAUTILUS水下大语言视觉模型,通过构建包含145万图像-文本对的NautData数据集和引入基于物理先验的视觉特征增强模块,显著提升了水下场景理解任务的性能。


📘 Detailed Summary

Motivation: 当前缺乏大规模水下多任务指令调优数据集,且水下图像退化问题严重干扰水下场景理解任务的性能,阻碍了自动化水下探索技术的发展。

Method: 构建了包含145万图像-文本对的NautData数据集支持8个水下场景理解任务,并提出了基于水下成像模型物理先验的即插即用视觉特征增强模块,该模块被集成到LLaVA-1.5和Qwen2.5-VL基线模型中构建NAUTILUS模型。

Result: 在NautData和公共水下数据集上的实验表明,视觉特征增强模块有效提升了两个基线模型在大多数支持任务上的性能,确保了NAUTILUS在水下场景理解领域的优越性。

Conclusion: 该研究为水下场景理解提供了首个大规模多任务数据集和有效的特征增强方法,显著推进了自动化水下探索技术的发展,具有重要的实际应用价值。


📄 Abstract

Underwater exploration offers critical insights into our planet and attracts increasing attention for its broader applications in resource exploration, national security, etc. We study the underwater scene understanding methods, which aim to achieve automated underwater exploration. The underwater scene understanding task demands multi-task perceptions from multiple granularities. However, the absence of large-scale underwater multi-task instruction-tuning datasets hinders the progress of this research. To bridge this gap, we construct NautData, a dataset containing 1.45 M image-text pairs supporting eight underwater scene understanding tasks. It enables the development and thorough evaluation of the underwater scene understanding models. Underwater image degradation is a widely recognized challenge that interferes with underwater tasks. To improve the robustness of underwater scene understanding, we introduce physical priors derived from underwater imaging models and propose a plug-and-play vision feature enhancement (VFE) module, which explicitly restores clear underwater information. We integrate this module into renowned baselines LLaVA-1.5 and Qwen2.5-VL and build our underwater LMM, NAUTILUS. Experiments conducted on the NautData and public underwater datasets demonstrate the effectiveness of the VFE module, consistently improving the performance of both baselines on the majority of supported tasks, thus ensuring the superiority of NAUTILUS in the underwater scene understanding area. Data and models are available at https://github.com/H-EmbodVis/NAUTILUS.

[33] ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, Yu Cheng

🧩 TL;DR

本文提出了ThinkMorph,一种统一的文本-图像推理模型,通过生成互补的多模态思维链在视觉中心任务上实现了显著性能提升。该模型在24K高质量交错推理轨迹上微调,展示了新兴的多模态智能能力。


📘 Detailed Summary

Motivation: 当前多模态推理需要语言和视觉之间的迭代协调,但尚不清楚什么构成了有意义的交错思维链。本文认为文本和图像思维应作为互补而非同构的模态,共同推进推理过程。

Method: 构建了ThinkMorph统一模型,在24K高质量交错推理轨迹上进行微调,涵盖不同视觉参与度的任务。模型学习生成渐进式的文本-图像推理步骤,具体操作视觉内容同时保持连贯的言语逻辑。

Result: 在视觉中心基准测试上实现了平均34.7%的性能提升,优于基础模型,并在领域外任务上匹配或超越了更大规模的专有视觉语言模型。模型展现出新兴的多模态智能能力,包括未见过的视觉操作技能和自适应推理模式切换。

Conclusion: 这些发现为表征统一多模态推理模型的新兴能力提供了有前景的方向,表明多样化多模态思维能够实现更好的测试时扩展性,揭示了互补模态协调在多模态推理中的关键作用。


📄 Abstract

Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary, rather than isomorphic, modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7% over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts.These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.

[34] Image Hashing via Cross-View Code Alignment in the Age of Foundation Models

Ilyass Moummad, Kawtar Zaher, Hervé Goëau, Alexis Joly

🧩 TL;DR

本文提出了CroVCA(跨视角代码对齐),一种简单统一的学习二进制哈希码的原则,通过单一二元交叉熵损失实现跨语义对齐视图的代码一致性,并在多个基准测试中仅用5个训练周期就达到了最先进的结果。


📘 Detailed Summary

Motivation: 大规模高效检索需要既紧凑又具有区分度的表示,基础模型提供了强大的视觉和多模态嵌入,但在这些高维空间中进行最近邻搜索计算成本高昂,而现有的哈希方法通常依赖复杂流程、多目标优化和专门设计,训练时间长且缺乏统一性。

Method: 提出了CroVCA框架,使用单一二元交叉熵损失强制代码对齐,同时采用编码率最大化作为防崩溃正则化器以促进平衡多样的代码;设计了HashCoder轻量级MLP哈希网络,通过最终批归一化层强制执行平衡代码,可作为冻结嵌入的探测头或通过LoRA微调高效适配编码器。

Result: 在多个基准测试中,CroVCA仅用5个训练周期就达到了最先进的结果,在16位设置下表现尤为出色:COCO上的无监督哈希在2分钟内完成,ImageNet100上的监督哈希在单GPU上约3分钟完成,突显了其高效的训练速度。

Conclusion: CroVCA展示了在哈希学习中简单统一原则的有效性,通过最小化设计实现了高效、适应性强且广泛适用的二进制代码学习,为大规模检索系统提供了实用的解决方案,并证明了编码率正则化在防止代码崩溃中的重要作用。


📄 Abstract

Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.

[35] ANCHOR: Integrating Adversarial Training with Hard-mined Supervised Contrastive Learning for Robust Representation Learning

Samarup Bhattacharya, Anubhab Bhattacharya, Abir Chakraborty

🧩 TL;DR

本文提出了ANCHOR框架,通过结合对抗训练和显式困难样本挖掘的监督对比学习,使模型能够学习更结构化、更鲁棒的表示,从而有效防御对抗攻击。


📘 Detailed Summary

Motivation: 神经网络虽然通过学习梯度获得了强大的模式识别能力,但这也使其容易受到对抗攻击的威胁,攻击者可以通过添加人眼难以察觉的微小扰动来改变模型的预测结果,这暴露了模型对脆弱梯度线索的依赖问题。

Method: ANCHOR框架采用监督对比学习结合显式困难正样本挖掘技术,使原始图像、其增强版本以及对抗扰动版本在嵌入空间中与同类图像聚集在一起,同时与异类图像分离,从而引导模型关注稳定且有意义的模式而非脆弱的梯度线索。

Result: 在CIFAR-10数据集上,该方法在PGD-20攻击下取得了令人印象深刻的干净准确率和鲁棒准确率,优于标准的对抗训练方法,表明对抗指导与困难样本挖掘的对比监督相结合有助于缩小准确率与鲁棒性之间的差距。

Conclusion: 该研究表明结合对抗训练与对比学习能够帮助模型学习更具结构化和鲁棒性的表示,为构建更安全的深度学习系统提供了新的方向,通过显式困难样本挖掘进一步增强了模型的鲁棒性表现。


📄 Abstract

Neural networks have changed the way machines interpret the world. At their core, they learn by following gradients, adjusting their parameters step by step until they identify the most discriminant patterns in the data. This process gives them their strength, yet it also opens the door to a hidden flaw. The very gradients that help a model learn can also be used to produce small, imperceptible tweaks that cause the model to completely alter its decision. Such tweaks are called adversarial attacks. These attacks exploit this vulnerability by adding tiny, imperceptible changes to images that, while leaving them identical to the human eye, cause the model to make wrong predictions. In this work, we propose Adversarially-trained Contrastive Hard-mining for Optimized Robustness (ANCHOR), a framework that leverages the power of supervised contrastive learning with explicit hard positive mining to enable the model to learn representations for images such that the embeddings for the images, their augmentations, and their perturbed versions cluster together in the embedding space along with those for other images of the same class while being separated from images of other classes. This alignment helps the model focus on stable, meaningful patterns rather than fragile gradient cues. On CIFAR-10, our approach achieves impressive results for both clean and robust accuracy under PGD-20 (epsilon = 0.031), outperforming standard adversarial training methods. Our results indicate that combining adversarial guidance with hard-mined contrastive supervision helps models learn more structured and robust representations, narrowing the gap between accuracy and robustness.

[36] Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, Jinwoo Shin

🧩 TL;DR

本文提出DUST框架,一种通过双流扩散架构增强视觉-语言-动作模型的世界建模方法,通过模态解耦和异步采样有效解决了多模态联合预测的挑战,在仿真和真实机器人任务中均取得显著性能提升。


📘 Detailed Summary

Motivation: 当前增强视觉-语言-动作模型的世界建模方法在联合预测下一状态观察和动作序列时面临模态差异的挑战,因为视觉和动作模态在表示和动态特性上存在本质区别,这限制了多模态联合建模的效果。

Method: 提出双流扩散变换器架构,明确维护分离的模态流同时支持跨模态知识共享;引入独立模态噪声扰动和解耦流匹配损失,避免统一潜在空间需求;开发联合采样方法支持测试时缩放,实现动作和视觉令牌的异步演化。

Result: 在RoboCasa和GR-1仿真基准上相比基线方法提升达6%,测试时缩放额外提供2-5%增益;在Franka Research 3真实世界任务中成功率提升13%;在BridgeV2无动作视频上的预训练在RoboCasa上产生显著迁移收益。

Conclusion: DUST框架通过模态解耦有效解决了多模态世界建模的冲突,证明了双向联合分布学习的可行性;测试时缩放策略提供了灵活的推理控制;大规模预训练潜力表明该框架适用于扩展视觉-语言-动作模型的实际应用。


📄 Abstract

Recently, augmenting Vision-Language-Action models (VLAs) with world modeling has shown promise in improving robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while still enabling cross-modal knowledge sharing. In addition, we introduce independent noise perturbations for each modality and a decoupled flow-matching loss. This design enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Based on the decoupling of modalities during training, we also introduce a joint sampling method that supports test-time scaling, where action and vision tokens evolve asynchronously at different rates. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over baseline methods, while our test-time scaling approach provides an additional 2-5% boost. On real-world tasks with the Franka Research 3, DUST improves success rates by 13%, confirming its effectiveness beyond simulation. Furthermore, pre-training on action-free videos from BridgeV2 yields significant transfer gains on RoboCasa, underscoring DUST's potential for large-scale VLA pretraining.

[37] NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception

Congzhang Shao, Quan Yuan, Guiyang Luo, Yue Hu, Danni Wang, Yilin Liu, Rui Pan, Bo Chen, Jinglin Li

🧩 TL;DR

本文提出NegoCollab,一种基于协商共同表示的异构协作感知方法,通过引入协商器在训练过程中从各模态代理的局部表示中推导共同表示,有效减少与各种局部表示的固有领域差距。


📘 Detailed Summary

Motivation: 异构协作感知面临不可变异构性的重大挑战,参与代理可能采用不同且固定的感知模型,导致代理间共享的中间特征存在领域差距,从而降低协作性能。现有方法将共同表示指定为特定代理的表示,使得与该特定代理存在显著领域差异的代理难以实现适当对齐。

Method: NegoCollab引入训练期间的协商器从各模态代理的局部表示中推导共同表示,通过一对发送器和接收器实现局部表示空间与共同表示空间之间的特征相互转换。除了分布对齐损失外,还引入结构对齐损失和语用对齐损失来监督训练,确保共同表示中的知识能够充分蒸馏到发送器中。

Result: 该方法有效减少了与各种局部表示的固有领域差距,通过多模态信息的共同表示实现了更好的特征对齐,提升了异构协作感知的性能表现。

Conclusion: NegoCollab通过协商共同表示的方法解决了异构协作感知中的领域对齐问题,为多模态异构代理的协作感知提供了有效的解决方案,具有较低的训练成本优势。


📄 Abstract

Collaborative perception improves task performance by expanding the perception range through information sharing among agents. . Immutable heterogeneity poses a significant challenge in collaborative perception, as participating agents may employ different and fixed perception models. This leads to domain gaps in the intermediate features shared among agents, consequently degrading collaborative performance. Aligning the features of all agents to a common representation can eliminate domain gaps with low training cost. However, in existing methods, the common representation is designated as the representation of a specific agent, making it difficult for agents with significant domain discrepancies from this specific agent to achieve proper alignment. This paper proposes NegoCollab, a heterogeneous collaboration method based on the negotiated common representation. It introduces a negotiator during training to derive the common representation from the local representations of each modality's agent, effectively reducing the inherent domain gap with the various local representations. In NegoCollab, the mutual transformation of features between the local representation space and the common representation space is achieved by a pair of sender and receiver. To better align local representations to the common representation containing multimodal information, we introduce structural alignment loss and pragmatic alignment loss in addition to the distribution alignment loss to supervise the training. This enables the knowledge in the common representation to be fully distilled into the sender.

cs.CL [Back]

[38] MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models

Zixin Chen, Hongzhan Lin, Kaixin Li, Ziyang Luo, Yayue Deng, Jing Ma

🧩 TL;DR

本文提出了MemeArena,一种基于智能体的竞技场式评估框架,用于对大语言模型在多模态有害内容理解能力进行上下文感知和无偏见的评估。该框架通过模拟多样化解释语境并整合不同视角达成共识,有效减少了评估偏见。


📘 Detailed Summary

Motivation: 现有评估方法主要关注大语言模型在二元分类任务中的检测准确率,往往无法反映模型在不同语境下对有害内容理解的深度解释细微差别。社交媒体上表情包的泛滥迫切需要多模态大语言模型具备有效理解多模态有害内容的能力。

Method: MemeArena框架模拟多样化的解释语境来构建评估任务,从大语言模型中引出特定视角的分析。通过整合不同观点并在评估者之间达成共识,实现了对多模态有害内容解释能力的公平无偏比较。

Result: 大量实验表明,该框架有效减少了评估智能体的评估偏见,判断结果与人类偏好高度一致,为多模态有害内容理解中的可靠全面大语言模型评估提供了宝贵见解。

Conclusion: 该研究为多模态有害内容理解领域的可靠全面大语言模型评估提供了新的方法论视角,展示了基于共识的竞技场式评估在减少偏见和提升评估准确性方面的有效性。


📄 Abstract

The proliferation of memes on social media necessitates the capabilities of multimodal Large Language Models (mLLMs) to effectively understand multimodal harmfulness. Existing evaluation approaches predominantly focus on mLLMs' detection accuracy for binary classification tasks, which often fail to reflect the in-depth interpretive nuance of harmfulness across diverse contexts. In this paper, we propose MemeArena, an agent-based arena-style evaluation framework that provides a context-aware and unbiased assessment for mLLMs' understanding of multimodal harmfulness. Specifically, MemeArena simulates diverse interpretive contexts to formulate evaluation tasks that elicit perspective-specific analyses from mLLMs. By integrating varied viewpoints and reaching consensus among evaluators, it enables fair and unbiased comparisons of mLLMs' abilities to interpret multimodal harmfulness. Extensive experiments demonstrate that our framework effectively reduces the evaluation biases of judge agents, with judgment results closely aligning with human preferences, offering valuable insights into reliable and comprehensive mLLM evaluations in multimodal harmfulness understanding. Our code and data are publicly available at https://github.com/Lbotirx/MemeArena.

[39] Patient-Centered Summarization Framework for AI Clinical Summarization: A Mixed-Methods Design

Maria Lizarazo Jimenez, Ana Gabriela Claros, Kieran Green, David Toro-Tobon, Felipe Larios, Sheena Asthana, Camila Wenczenovicz, Kerly Guevara Maldonado, Luis Vilatuna-Andrango, Cristina Proano-Velez, Satya Sai Sri Bandi, Shubhangi Bagewadi, Megan E. Branda, Misk Al Zahidy, Saturnino Luz, Mirella Lapata, Juan P. Brito, Oscar J. Ponce-Ponte

🧩 TL;DR

本研究提出了以患者为中心的临床摘要(PCS)新标准,通过混合方法开发框架评估开源大语言模型在捕捉患者价值观方面的表现,发现当前模型在完整性和流畅性上接近人类专家,但在准确性和患者中心性方面仍有差距。


📘 Detailed Summary

Motivation: 当前大语言模型生成的临床摘要主要关注患者生物学信息,而忽略了患者的偏好、价值观、愿望和关切等关键信息,无法实现真正的以患者为中心的医疗护理,因此需要建立新的AI临床摘要标准。

Method: 采用混合研究方法,首先通过英国患者和临床医生参与的半结构化访谈确定临床摘要应包含的个人和情境信息,然后制定标注指南并由八名临床医生创建88份房颤咨询的金标准PCS,最后使用五种开源LLM进行零样本和少样本提示生成摘要。

Result: 患者强调生活方式、社会支持、近期压力源和护理价值观,临床医生需要简洁的功能性、心理社会和情感背景信息;Mistral-8B在零样本设置下获得最佳ROUGE-L得分(0.189),Llama-3.1-8B在少样本设置下表现最优(ROUGE-L 0.206,BERTScore 0.683),模型在完整性和流畅性上与专家相当,但在正确性和患者中心性方面人类PCS更优。

Conclusion: 研究表明当前开源大语言模型在生成以患者为中心的临床摘要方面已接近人类水平,但在准确捕捉患者价值观方面仍需改进,这为开发更全面的AI辅助临床文档系统提供了重要方向,强调了将患者视角整合到医疗AI系统中的必要性。


📄 Abstract

Large Language Models (LLMs) are increasingly demonstrating the potential to reach human-level performance in generating clinical summaries from patient-clinician conversations. However, these summaries often focus on patients' biology rather than their preferences, values, wishes, and concerns. To achieve patient-centered care, we propose a new standard for Artificial Intelligence (AI) clinical summarization tasks: Patient-Centered Summaries (PCS). Our objective was to develop a framework to generate PCS that capture patient values and ensure clinical utility and to assess whether current open-source LLMs can achieve human-level performance in this task. We used a mixed-methods process. Two Patient and Public Involvement groups (10 patients and 8 clinicians) in the United Kingdom participated in semi-structured interviews exploring what personal and contextual information should be included in clinical summaries and how it should be structured for clinical use. Findings informed annotation guidelines used by eight clinicians to create gold-standard PCS from 88 atrial fibrillation consultations. Sixteen consultations were used to refine a prompt aligned with the guidelines. Five open-source LLMs (Llama-3.2-3B, Llama-3.1-8B, Mistral-8B, Gemma-3-4B, and Qwen3-8B) generated summaries for 72 consultations using zero-shot and few-shot prompting, evaluated with ROUGE-L, BERTScore, and qualitative metrics. Patients emphasized lifestyle routines, social support, recent stressors, and care values. Clinicians sought concise functional, psychosocial, and emotional context. The best zero-shot performance was achieved by Mistral-8B (ROUGE-L 0.189) and Llama-3.1-8B (BERTScore 0.673); the best few-shot by Llama-3.1-8B (ROUGE-L 0.206, BERTScore 0.683). Completeness and fluency were similar between experts and models, while correctness and patient-centeredness favored human PCS.

cs.AI [Back]

[40] Cognition Envelopes for Bounded AI Reasoning in Autonomous UAS Operations

Pedro Antonio Alarcón Granadeno, Arturo Miguel Bernal Russell, Sofia Nelson, Demetrius Hernandez, Maureen Petterson, Michael Murphy, Walter J. Scheirer, Jane Cleland-Huang

🧩 TL;DR

本文提出了认知包络的概念,旨在为依赖基础模型的网络物理系统建立推理边界,通过约束AI生成决策来应对幻觉、过度泛化和上下文错位等新型错误,同时补充元认知和传统安全包络的使用。


📘 Detailed Summary

Motivation: 随着网络物理系统日益依赖基础模型(如LLMs和VLMs)来增强自主性,这些模型引入了幻觉、过度泛化和上下文错位等新型错误,导致不正确和有缺陷的决策,需要建立有效的约束机制来确保系统安全。

Method: 提出了认知包络的概念,通过建立推理边界来约束AI生成的决策,该方法与元认知和传统安全包络形成互补,并为认知包络的定义、验证和保证提供了系统化的流程和实用指南。

Result: 研究建立了认知包络的理论框架,明确了其作为推理边界的功能定位,并开发了相应的系统化流程来支持认知包络在实际系统中的部署和应用,为处理基础模型引入的新型错误提供了结构化解决方案。

Conclusion: 认知包络为解决基础模型在网络物理系统中引入的安全挑战提供了重要框架,强调了系统化流程和实用指南的必要性,为未来AI安全保证体系的发展指明了方向,特别是在处理模型幻觉和推理错误方面具有重要价值。


📄 Abstract

Cyber-physical systems increasingly rely on Foundational Models such as Large Language Models (LLMs) and Vision-Language Models (VLMs) to increase autonomy through enhanced perception, inference, and planning. However, these models also introduce new types of errors, such as hallucinations, overgeneralizations, and context misalignments, resulting in incorrect and flawed decisions. To address this, we introduce the concept of Cognition Envelopes, designed to establish reasoning boundaries that constrain AI-generated decisions while complementing the use of meta-cognition and traditional safety envelopes. As with safety envelopes, Cognition Envelopes require practical guidelines and systematic processes for their definition, validation, and assurance.

[41] CombiGraph-Vis: A Curated Multimodal Olympiad Benchmark for Discrete Mathematical Reasoning

Hamed Mahdavi, Pouria Mahdavinia, Alireza Farhadi, Pegah Mohammadipour, Samira Malek, Majid Daliri, Pedram Mohammadipour, Alireza Hashemi, Amir Khasahmadi, Vasant Honavar

🧩 TL;DR

本研究评估了先进大语言模型在数学证明评分方面的能力,并提出了基于智能体工作流程的多步骤评分方法,显著提高了与人类评分的一致性和部分得分处理的准确性。


📘 Detailed Summary

Motivation: 随着先进大语言模型在解决奥林匹克数学问题方面取得显著进展,本研究旨在评估这些模型在证明评分方面的能力,包括错误检测、严重性判断以及超越二元正确性的公平评分,以解决现有模型在部分得分分配方面的校准差距问题。

Method: 研究使用了包含90个Gemini 2.5 Pro生成解决方案的语料库,采用1-4分制进行详细错误标注,并基于MathArena的IMO/USAMO 2025解决方案集进行0-7分制评分。提出了智能体工作流程,通过提取和分析参考解决方案来自动生成问题特定的评分标准,实现多步骤评分过程。

Result: 实验结果表明,模型能够可靠地标记错误解决方案(包括细微错误),但在部分得分分配方面存在校准差距。提出的智能体工作流程在标注语料库和MathArena数据集上均实现了与人类评分更高的一致性,并在多个指标上表现出更一致的部分得分处理能力。

Conclusion: 该研究揭示了先进大语言模型在证明评分方面的潜力和局限性,提出的智能体工作流程为解决部分得分校准问题提供了有效方法,为未来数学自动评分系统的开发奠定了基础,并发布了所有代码、数据和提示/日志以促进后续研究。


📄 Abstract

State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a multi-step grading process. We instantiate and compare different design choices for the grading workflows, and evaluate their trade-offs. Across our annotated corpus and MathArena, our proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research.

[42] GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation

Tao Liu, Chongyu Wang, Rongjie Li, Yingchen Yu, Xuming He, Bai Song

🧩 TL;DR

本文提出了GUI-Rise,一种推理增强的GUI导航框架,通过结构化推理、动作预测和历史摘要的系统集成,解决了多模态大语言模型在跨领域泛化和历史利用方面的局限性,并在标准基准测试中取得了最先进的性能。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在GUI导航代理中面临跨领域泛化能力不足和历史信息利用效率低下的问题,需要开发能够系统整合推理过程和历史管理的框架来提升导航性能。

Method: 提出了推理增强框架,包含结构化推理组件生成结合进度估计和决策推理的思维链分析,通过监督微调伪标注轨迹和基于GRPO的强化学习训练GUI-Rise代理,并采用包含历史感知目标的专门奖励机制。

Result: 在标准基准测试中,在相同训练数据条件下取得了最先进的性能,特别是在跨领域场景中表现出色,验证了框架在多样化GUI导航任务中保持稳健推理和泛化能力。

Conclusion: 该研究证明了系统化整合结构化推理和历史管理对于提升GUI导航代理性能的重要性,为多模态智能体在复杂交互环境中的发展提供了有效框架和训练方法。


📄 Abstract

While Multimodal Large Language Models (MLLMs) have advanced GUI navigation agents, current approaches face limitations in cross-domain generalization and effective history utilization. We present a reasoning-enhanced framework that systematically integrates structured reasoning, action prediction, and history summarization. The structured reasoning component generates coherent Chain-of-Thought analyses combining progress estimation and decision reasoning, which inform both immediate action predictions and compact history summaries for future steps. Based on this framework, we train a GUI agent, \textbf{GUI-Rise}, through supervised fine-tuning on pseudo-labeled trajectories and reinforcement learning with Group Relative Policy Optimization (GRPO). This framework employs specialized rewards, including a history-aware objective, directly linking summary quality to subsequent action performance. Comprehensive evaluations on standard benchmarks demonstrate state-of-the-art results under identical training data conditions, with particularly strong performance in out-of-domain scenarios. These findings validate our framework's ability to maintain robust reasoning and generalization across diverse GUI navigation tasks. Code is available at https://leon022.github.io/GUI-Rise.

[43] ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use

Mengjie Deng, Guanting Dong, Zhicheng Dou

🧩 TL;DR

本文提出了ToolScope框架,通过统一全局规划与局部多模态感知,解决多模态大语言模型在推理过程中灵活高效使用外部工具的挑战,在多个VQA基准测试中实现了显著的性能提升。


📘 Detailed Summary

Motivation: 尽管大语言模型已展现出通过自主集成外部工具进行协作推理的卓越能力,但由于多模态信息固有的复杂性和多样性,如何使多模态大语言模型在推理过程中灵活高效地利用外部工具仍是一个未被充分探索的挑战。

Method: ToolScope框架包含三个核心组件:全局导航器作为"望远镜"提供高层战略指导,代理执行器通过集成搜索、代码和感知等外部工具迭代增强MLLM的局部感知能力,响应合成器将推理过程整合为连贯的用户友好输出,其中专门设计了感知工具来缓解长视野VQA任务中的视觉上下文退化问题。

Result: 在涵盖VQA 2.0、ScienceQA、MAT-Search和MathVista四个不同领域的VQA基准测试中,ToolScope展现出强大的泛化能力,在所有数据集上平均性能提升高达+6.69%。

Conclusion: 该研究表明统一全局规划与局部多模态感知的框架能有效提升MLLM的工具使用能力,为多模态推理系统的发展提供了新的设计思路,特别是在处理复杂长视野视觉问答任务方面具有重要应用价值。


📄 Abstract

Recently, large language models (LLMs) have demonstrated remarkable problem-solving capabilities by autonomously integrating with external tools for collaborative reasoning. However, due to the inherently complex and diverse nature of multimodal information, enabling multimodal large language models (MLLMs) to flexibly and efficiently utilize external tools during reasoning remains an underexplored challenge. In this work, we introduce ToolScope, an agentic framework designed to unify global planning with local multimodal perception, adopting a specialized Perceive tool to mitigates visual context degradation in long-horizon VQA task. ToolScope comprises three primary components: the Global Navigator, the Agentic Executor, and the Response Synthesizer. The Global Navigator functions as a "telescope", offering high-level strategic guidance. The Agentic Executor operates iteratively to augment MLLM with local perception through the integration of external tools-Search, Code, and Perceive. Finally, the Response Synthesizer consolidates and organizes the reasoning process into a coherent, user-friendly output. We evaluate ToolScope on four VQA benchmarks across diverse domains, including VQA 2.0, ScienceQA, MAT-Search and MathVista. It demonstrates strong generalization capabilities, achieving an average performance improvement of up to +6.69% across all datasets.

[44] GeoFM: Enhancing Geometric Reasoning of MLLMs via Synthetic Data Generation through Formal Language

Yuhao Zhang, Dingxin Hu, Tinghao Yu, Hao Liu, Yiting Liu

🧩 TL;DR

本文提出GeoFM方法,通过形式化语言和符号引擎生成高质量几何数据,显著提升多模态大语言模型在几何推理任务上的性能,在MathVista和GeoQA基准上超越GPT-4o和领先开源模型。


📘 Detailed Summary

Motivation: 多模态大语言模型在数学几何推理任务中面临高质量几何数据稀缺的挑战,现有合成几何数据方法存在多样性不足、噪声干扰以及生成图像与真实几何图形差异显著等问题。

Method: GeoFM采用形式化语言探索度量空间中的条件组合,通过符号引擎确保生成几何问题的正确性,能够产生与原始问题不同但保持高保真度的几何数据。

Result: 实验结果表明,使用GeoFM合成数据训练的模型在MathVista几何问题解决任务上比GPT-4o高出18.7%,在GeoQA上高出16.5%,同时在MathVista上比领先开源模型高出5.7%,在GeoQA上高出2.7%。

Conclusion: GeoFM方法通过形式化语言和符号验证机制有效解决了几何数据合成的多样性和保真度问题,为多模态大语言模型的几何推理能力提升提供了可靠的数据支持,展示了合成数据在专业领域任务中的巨大潜力。


📄 Abstract

Multi-modal Large Language Models (MLLMs) have gained significant attention in both academia and industry for their capabilities in handling multi-modal tasks. However, these models face challenges in mathematical geometric reasoning due to the scarcity of high-quality geometric data. To address this issue, synthetic geometric data has become an essential strategy. Current methods for generating synthetic geometric data involve rephrasing or expanding existing problems and utilizing predefined rules and templates to create geometric images and problems. However, these approaches often produce data that lacks diversity or is prone to noise. Additionally, the geometric images synthesized by existing methods tend to exhibit limited variation and deviate significantly from authentic geometric diagrams. To overcome these limitations, we propose GeoFM, a novel method for synthesizing geometric data. GeoFM uses formal languages to explore combinations of conditions within metric space, generating high-fidelity geometric problems that differ from the originals while ensuring correctness through a symbolic engine. Experimental results show that our synthetic data significantly outperforms existing methods. The model trained with our data surpass the proprietary GPT-4o model by 18.7\% on geometry problem-solving tasks in MathVista and by 16.5\% on GeoQA. Additionally, it exceeds the performance of a leading open-source model by 5.7\% on MathVista and by 2.7\% on GeoQA.

[45] Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning

Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen, Liang-Yan Gui, Yu-Xiong Wang, Huan Zhang, Heng Ji, Daniel Kang

🧩 TL;DR

BEAT是首个针对多模态大语言模型驱动的具身智能体的视觉后门攻击框架,利用环境中的物体作为触发器,在触发出现时持续执行攻击者指定的多步策略,攻击成功率高达80%且保持良性任务性能。


📘 Detailed Summary

Motivation: 多模态大语言模型驱动的具身智能体虽然能够直接从视觉输入进行感知、推理和规划任务导向动作,但同时也开启了新的攻击面:视觉后门攻击,即智能体在正常情况下表现正常,但当视觉触发器出现在场景中时,会持续执行攻击者指定的多步策略。

Method: BEAT框架通过构建涵盖多样化场景、任务和触发器放置的训练集来暴露智能体于触发器变异性,并采用两阶段训练方案:首先应用监督微调,然后引入新颖的对比触发器学习,将触发器识别建模为触发存在和触发缺失输入之间的偏好学习,明确锐化决策边界以确保精确的后门激活。

Result: 在多种具身智能体基准测试和多模态大语言模型上,BEAT实现了高达80%的攻击成功率,同时保持强大的良性任务性能,并能可靠地泛化到分布外的触发器放置场景;与简单的监督微调相比,对比触发器学习在有限后门数据下将后门激活准确率提升高达39%。

Conclusion: 这些发现揭示了多模态大语言模型驱动的具身智能体中一个关键但未被探索的安全风险,强调了在实际部署前需要开发鲁棒的防御机制,以确保此类系统的安全性。


📄 Abstract

Multimodal large language models (MLLMs) have advanced embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs. However, such vision driven embodied agents open a new attack surface: visual backdoor attacks, where the agent behaves normally until a visual trigger appears in the scene, then persistently executes an attacker-specified multi-step policy. We introduce BEAT, the first framework to inject such visual backdoors into MLLM-based embodied agents using objects in the environments as triggers. Unlike textual triggers, object triggers exhibit wide variation across viewpoints and lighting, making them difficult to implant reliably. BEAT addresses this challenge by (1) constructing a training set that spans diverse scenes, tasks, and trigger placements to expose agents to trigger variability, and (2) introducing a two-stage training scheme that first applies supervised fine-tuning (SFT) and then our novel Contrastive Trigger Learning (CTL). CTL formulates trigger discrimination as preference learning between trigger-present and trigger-free inputs, explicitly sharpening the decision boundaries to ensure precise backdoor activation. Across various embodied agent benchmarks and MLLMs, BEAT achieves attack success rates up to 80%, while maintaining strong benign task performance, and generalizes reliably to out-of-distribution trigger placements. Notably, compared to naive SFT, CTL boosts backdoor activation accuracy up to 39% under limited backdoor data. These findings expose a critical yet unexplored security risk in MLLM-based embodied agents, underscoring the need for robust defenses before real-world deployment.