Table of Contents
cs.CV [Back]
[1] SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge
Adeel Yousaf, Joseph Fioresi, James Beetham, Amrit Singh Bedi, Mubarak Shah
🧩 TL;DR
本文提出SaFeR-CLIP,一种基于邻近感知的微调框架,通过将不安全概念重定向到语义上最接近的安全替代品来最小化表示变化,成功解决了视觉语言模型安全性与泛化性能之间的权衡问题。
📘 Detailed Summary
Motivation: 现有视觉语言模型如CLIP在通过微调提升安全性时往往导致泛化性能显著下降,这种权衡源于僵化的对齐策略强制将不安全概念映射到单一预定义安全目标,破坏了模型学习到的语义结构。
Method: 提出邻近感知方法,将不安全概念重定向到语义上最接近的安全替代品以最小化表示变化,开发了SaFeR-CLIP微调框架应用这种最小干预原则,并贡献了NSFW-Caps基准数据集用于分布偏移下的安全性评估。
Result: SaFeR-CLIP成功协调了安全性与性能,相比先前方法在零样本准确率上恢复了高达8.0%的性能提升,同时保持了鲁棒的安全性表现。
Conclusion: 研究表明尊重预训练表示的几何结构是实现安全性而不牺牲性能的关键,邻近感知方法为视觉语言模型的安全微调提供了更有效的路径,最小干预原则可广泛应用于其他需要平衡安全性与性能的场景。
📄 Abstract
Improving the safety of vision-language models like CLIP via fine-tuning often comes at a steep price, causing significant drops in their generalization performance. We find this trade-off stems from rigid alignment strategies that force unsafe concepts toward single, predefined safe targets, disrupting the model's learned semantic structure. To address this, we propose a proximity-aware approach: redirecting unsafe concepts to their semantically closest safe alternatives to minimize representational change. We introduce SaFeR-CLIP, a fine-tuning framework that applies this principle of minimal intervention. SaFeR-CLIP successfully reconciles safety and performance, recovering up to 8.0% in zero-shot accuracy over prior methods while maintaining robust safety. To support more rigorous evaluation, we also contribute NSFW-Caps, a new benchmark of 1,000 highly-aligned pairs for testing safety under distributional shift. Our work shows that respecting the geometry of pretrained representations is key to achieving safety without sacrificing performance.
[2] Towards Unified Vision Language Models for Forest Ecological Analysis in Earth Observation
Xizhe Xue, Xiao Xiang Zhu
🧩 TL;DR
本文提出了REO-Instruct,这是首个面向地球观测的统一基准数据集,专门设计用于同时处理描述性和回归任务,填补了多模态感知与可测量生物物理变量之间缺乏关联的空白。
📘 Detailed Summary
Motivation: 当前视觉语言模型在科学回归任务中的潜力尚未被充分探索,现有地球观测数据集主要关注语义理解任务如描述和分类,缺乏将多模态感知与可测量生物物理变量对齐的基准。
Method: 研究构建了REO-Instruct数据集,建立了森林生态场景中的认知可解释逻辑链,整合了共配准的Sentinel-2和ALOS-2影像,并通过混合人机流程生成和验证结构化文本标注。
Result: 对通用视觉语言模型的综合评估显示,当前模型在数值推理方面存在显著困难,突显了科学视觉语言模型面临的关键挑战。
Conclusion: REO-Instruct为开发和评估能够同时进行描述和科学推理的下一代地理空间模型提供了标准化基础,揭示了当前模型在科学回归任务中的局限性。
📄 Abstract
Recent progress in vision language models (VLMs) has enabled remarkable perception and reasoning capabilities, yet their potential for scientific regression in Earth Observation (EO) remains largely unexplored. Existing EO datasets mainly emphasize semantic understanding tasks such as captioning or classification, lacking benchmarks that align multimodal perception with measurable biophysical variables. To fill this gap, we present REO-Instruct, the first unified benchmark designed for both descriptive and regression tasks in EO. REO-Instruct establishes a cognitively interpretable logic chain in forest ecological scenario (human activity,land-cover classification, ecological patch counting, above-ground biomass (AGB) regression), bridging qualitative understanding and quantitative prediction. The dataset integrates co-registered Sentinel-2 and ALOS-2 imagery with structured textual annotations generated and validated through a hybrid human AI pipeline. Comprehensive evaluation protocols and baseline results across generic VLMs reveal that current models struggle with numeric reasoning, highlighting an essential challenge for scientific VLMs. REO-Instruct offers a standardized foundation for developing and assessing next-generation geospatial models capable of both description and scientific inference. The project page are publicly available at \href{https://github.com/zhu-xlab/REO-Instruct}{REO-Instruct}.
[3] BOP-ASK: Object-Interaction Reasoning for Vision-Language Models
Vineet Bhat, Sungsu Kim, Valts Blukis, Greg Heinrich, Prashanth Krishnamurthy, Ramesh Karri, Stan Birchfield, Farshad Khorrami, Jonathan Tremblay
🧩 TL;DR
本文提出了BOP-ASK,一个用于物体交互推理的大规模数据集和基准测试,通过利用6D物体姿态数据生成细粒度空间关系标注,显著提升了视觉语言模型在精确3D定位、物理兼容性和多步空间规划等任务上的性能。
📘 Detailed Summary
Motivation: 现有视觉语言模型在空间推理基准测试中表现优异,但这些评估掩盖了其在理解物体交互方面的关键弱点,当前基准主要测试高层次空间关系而忽略了真实世界应用所需的细粒度空间理解,包括精确3D定位、物体间物理兼容性、物体可供性和多步空间规划等能力。
Method: 研究提出基于BOP数据集的6D物体姿态生成数据流水线,从中提取抓取姿态、参考物体姿态、路径规划轨迹、相对空间和深度关系以及物体间关系等细粒度标注,构建了包含超过15万张图像和3300万个问答对的数据集,涵盖六个任务(其中四个为新颖任务)。
Result: 实验表明在BOP-ASK上训练的模型优于基线方法,并展现出精确物体和抓取姿态估计、轨迹规划以及在杂乱环境中进行细粒度物体中心空间推理等新兴能力,同时发布了BOP-ASK-core测试基准和BOP-ASK-lab分布外泛化基准用于全面评估。
Conclusion: 该研究填补了视觉语言模型在细粒度物体交互推理方面的评估空白,证明了基于6D姿态的数据生成方法能有效提升模型的空间理解能力,为真实世界应用提供了重要的训练资源和评估框架,同时开源的数据集和生成流水线将推动该领域的进一步发展。
📄 Abstract
Vision Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high level relationships ('left of,' 'behind', etc.) but ignore fine-grained spatial understanding needed for real world applications: precise 3D localization, physical compatibility between objects, object affordances and multi step spatial planning. In this work, we present BOP-ASK, a novel large scale dataset for object interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships. BOP-ASK comprises over 150k images and 33M question answer pairs spanning six tasks (four novel), providing a rich resource for training and evaluating VLMs. We evaluate proprietary and open sourced VLMs, and conduct human evaluations on BOP-ASK-core, a contributed test benchmark. We also release BOP-ASK-lab, an out-of-distribution benchmark with images not sourced from BOP, enabling testing of generalization. Our experiments demonstrate that models trained on BOP-ASK outperform baselines and exhibit emergent capabilities such as precise object and grasp pose estimation, trajectory planning, and fine-grained object-centric spatial reasoning in cluttered environments. We will publicly release our datasets and dataset generation pipeline.
[4] Align & Invert: Solving Inverse Problems with Diffusion and Flow-based Models via Representational Alignment
Loukas Sfountouris, Giannis Daras, Paris Giampouras
🧩 TL;DR
本文提出了一种表示对齐正则化方法,通过在推理时对齐扩散模型与预训练自监督编码器的内部表示,显著提升了逆问题求解的重建质量和效率。该方法在各种逆问题任务中均能一致改善重建效果,同时减少所需的离散化步骤。
📘 Detailed Summary
Motivation: 现有逆问题求解方法在使用预训练生成模型作为先验时,缺乏对模型内部表示与目标特征对齐的有效引导机制。虽然表示对齐在生成模型训练中已被证明能改善收敛和样本质量,但在逆问题推理过程中如何利用这种对齐来指导重建过程仍是一个未充分探索的研究空白。
Method: 提出了表示对齐正则化方法,在推理时强制对齐扩散或流式生成模型与预训练自监督视觉编码器的内部表示。该方法利用DINOv2等编码器提取近似目标特征,通过理论分析揭示了REPA正则化与DINOv2嵌入空间中散度度量的关系,以及REPA更新如何引导模型内部表示向干净图像表示收敛。
Result: 在超分辨率、框内修复、高斯去模糊和运动去模糊等任务上的广泛实验表明,该方法能一致提升重建质量,同时显著提高效率——在不影响底层求解器性能的前提下,减少了所需的离散化步骤数量。重建保真度和感知真实性均得到实质性增强。
Conclusion: 表示对齐为逆问题求解提供了强大的归纳偏置,通过引导模型内部表示向目标特征收敛来改善感知保真度。该方法具有通用性,可集成到多种最先进的逆问题求解器中,为利用预训练表示指导生成模型推理开辟了新途径。
📄 Abstract
Enforcing alignment between the internal representations of diffusion or flow-based generative models and those of pretrained self-supervised encoders has recently been shown to provide a powerful inductive bias, improving both convergence and sample quality. In this work, we extend this idea to inverse problems, where pretrained generative models are employed as priors. We propose applying representation alignment (REPA) between diffusion or flow-based models and a pretrained self-supervised visual encoder, such as DINOv2, to guide the reconstruction process at inference time. Although ground-truth signals are unavailable in inverse problems, we show that aligning model representations with approximate target features can substantially enhance reconstruction fidelity and perceptual realism. We provide theoretical results showing (a) the relation between the REPA regularization and a divergence measure in the DINOv2 embedding space, and (b) how REPA updates steer the model's internal representations toward those of the clean image. These results offer insights into the role of REPA in improving perceptual fidelity. Finally, we demonstrate the generality of our approach by integrating it into multiple state-of-the-art inverse problem solvers. Extensive experiments on super-resolution, box inpainting, Gaussian deblurring, and motion deblurring confirm that our method consistently improves reconstruction quality across tasks, while also providing substantial efficiency gains by reducing the number of required discretization steps without compromising the performance of the underlying solver.
[5] OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios
Hong Gao, Jingyu Wu, Xiangkai Xu, Kangni Xie, Yunchen Zhang, Bin Zhong, Xurui Gao, Min-Ling Zhang
🧩 TL;DR
本文提出OmniGround基准和PG-TAF框架,解决了时空视频定位中类别偏见和复杂查询处理不足的问题。PG-TAF通过两阶段分解方法在多个基准上实现了显著性能提升。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在时空视频定位任务中与现实需求存在显著差距,主要由于基准数据集范围有限导致模型出现类别偏见、推理过程过于简化以及语言鲁棒性差等问题。
Method: 提出了OmniGround综合基准数据集包含3,475个视频和81个类别,采用前向-后向-精化标注流程结合多方向跟踪和智能纠错;开发了PG-TAF训练免费的两阶段框架,将STVG分解为高层时间定位和细粒度时空传播。
Result: 在复杂现实场景中性能平均下降10.4%,特别是在小/遮挡物体和复杂空间关系上表现较差;PG-TAF在OmniGround上m_tIoU和m_vIoU分别提升25.6%和35.6%,在四个基准上均获得一致增益。
Conclusion: 研究揭示了现有STVG模型在复杂现实场景中的局限性,提出的系统评估框架DeepSTG和PG-TAF方法为改进时空视频定位提供了有效途径,强调了数据集质量和任务分解策略的重要性。
📄 Abstract
Spatio-Temporal Video Grounding (STVG) aims to localize target objects in videos based on natural language descriptions. Despite recent advances in Multimodal Large Language Models, a significant gap remains between current models and real-world demands involving diverse objects and complex queries. We attribute this to limited benchmark scope, causing models to exhibit category bias, oversimplified reasoning, and poor linguistic robustness. To address these limitations, we introduce OmniGround, a comprehensive benchmark with 3,475 videos spanning 81 categories and complex real-world queries. We propose the Forward-Backward-Refinement annotation pipeline that combines multi-directional tracking with intelligent error correction for high-quality labels. We further introduce DeepSTG, a systematic evaluation framework quantifying dataset quality across four complementary dimensions beyond superficial statistics. Evaluations reveal performance average drop of 10.4% on complex real-world scenes, particularly with small/occluded objects and intricate spatial relations. Motivated by these, we propose PG-TAF, a training-free two-stage framework decomposing STVG into high-level temporal grounding and fine-grained spatio-temporal propagation. Experiments demonstrate PG-TAF achieves 25.6% and 35.6% improvements in m_tIoU and m_vIoU on OmniGround with consistent gains across four benchmarks.
[6] R-AVST: Empowering Video-LLMs with Fine-Grained Spatio-Temporal Reasoning in Complex Audio-Visual Scenarios
Lu Zhu, Tiantian Geng, Yangye Chen, Teng Wang, Ping Lu, Feng Zheng
🧩 TL;DR
本研究提出了R-AVST数据集和AVST-Zero模型,前者是首个专为真实世界音频-视觉时空推理设计的数据集,后者是基于强化学习的模型,通过多维度奖励直接优化行为,无需中间监督。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在视频理解任务中取得了快速进展,但现有研究主要关注简单视频场景,无法反映真实世界视频中音频-视觉事件的复杂性和多样性,这一研究空白限制了模型在复杂音频-视觉时空推理任务上的表现。
Method: 研究首先构建了R-AVST数据集,采用基于LLM的关键对象提取、自动空间标注和人工质量检查的流水线,包含超过5K个未修剪视频和27K个对象;在此基础上提出了AVST-Zero模型,这是一种基于强化学习的框架,通过精心设计的多维度奖励直接优化行为,避免了中间监督的需求。
Result: 实验验证了R-AVST数据集在推进音频-视觉时空推理方面的有效性,AVST-Zero模型在基准测试中表现出与现有模型相竞争的性能,生成了超过8K个高质量、均匀分布的问题-答案对来有效评估模型表现。
Conclusion: R-AVST是首个专为真实世界音频-视觉时空推理设计的数据集,AVST-Zero为解决该领域未来挑战提供了新颖视角,这项工作为复杂多模态推理任务建立了重要的基准和解决方案框架。
📄 Abstract
Recently, rapid advancements have been made in multimodal large language models (MLLMs), especially in video understanding tasks. However, current research focuses on simple video scenarios, failing to reflect the complex and diverse nature of real-world audio-visual events in videos. To bridge this gap, we firstly introduce R-AVST, a dataset for audio-visual reasoning featuring fine-grained spatio-temporal annotations. In constructing this, we design a pipeline consisting of LLM-based key object extraction, automatic spatial annotation and manual quality inspection, resulting in over 5K untrimmed videos with 27K objects across 100 types of audio-visual events. Building on this dataset, we define three core tasks for spatio-temporal reasoning in audio-visual scenes and generate more than 8K high-quality, evenly distributed question-answer pairs to effectively benchmark model performance. To further enhance reasoning, we propose AVST-Zero, a reinforcement learning-based model that avoids intermediate supervision, directly optimizing behavior via carefully designed multi-dimensional rewards. Extensive experiments validate the effectiveness of our R-AVST in advancing audio-visual spatio-temporal reasoning, upon which AVST-Zero demonstrates competitive performance compared to existing models. To the best of our knowledge, R-AVST is the first dataset designed for real-world audio-visual spatio-temporal reasoning, and AVST-Zero offers a novel perspective for tackling future challenges in this domain.
[7] The Finer the Better: Towards Granular-aware Open-set Domain Generalization
Yunyun Wang, Zheng Duan, Xinyue Liao, Ke-Jia Chen, Songcan Chen
🧩 TL;DR
本研究提出SeeCLIP框架,通过细粒度语义增强解决开放集域泛化中的结构风险与开放空间风险困境,显著提升模型在遇到视觉相似未知类别时的识别能力。
📘 Detailed Summary
Motivation: 开放集域泛化面临已知类别结构风险与未知类别开放空间风险的权衡困境,现有方法在处理与已知类别具有细粒度视觉相似性的'困难未知'样本时容易产生过度自信问题。
Method: 提出语义感知提示增强模块将图像分解为判别性语义标记,实现超越粗粒度类别标签的细粒度视觉-语言对齐;采用双工对比学习保持与已知类别的分离性和语义邻近性;通过语义引导扩散模块扰动提取的语义标记合成伪未知样本。
Result: 在五个基准测试上的广泛实验表明,相比最先进方法,准确率提升3%,H-score提升5%,实现了持续的性能改进。
Conclusion: 该研究证明了细粒度语义增强在缓解开放集域泛化风险困境中的有效性,为处理视觉相似未知类别提供了新思路,推动了开放环境下的稳健模型发展。
📄 Abstract
Open-Set Domain Generalization (OSDG) tackles the realistic scenario where deployed models encounter both domain shifts and novel object categories. Despite impressive progress with vision-language models like CLIP, existing methods still fall into the dilemma between structural risk of known-classes and open-space risk from unknown-classes, and easily suffers from over-confidence, especially when distinguishing ``hard unknowns" that share fine-grained visual similarities with known classes. To this end, we propose a Semantic-enhanced CLIP (SeeCLIP) framework that explicitly addresses this dilemma through fine-grained semantic enhancement. In SeeCLIP, we propose a semantic-aware prompt enhancement module to decompose images into discriminative semantic tokens, enabling nuanced vision-language alignment beyond coarse category labels. To position unknown prompts effectively, we introduce duplex contrastive learning with complementary objectives, that is, repulsion to maintain separability from known classes, and cohesion to preserve semantic proximity. Further, our semantic-guided diffusion module synthesizes pseudo-unknowns by perturbing extracted semantic tokens, generating challenging samples that are visually similar to known classes yet exhibit key local differences. These hard negatives force the model to learn finer decision boundaries. Extensive experiments across five benchmarks demonstrate consistent improvements of 3% accuracy and 5% H-score over state-of-the-art methods.
[8] Q-REAL: Towards Realism and Plausibility Evaluation for AI-Generated Content
Shushi Wang, Zicheng Zhang, Chunyi Li, Wei Wang, Liya Ma, Fengjiao Chen, Xiaoyu Li, Xuezhi Cao, Guangtao Zhai, Xiaohong Liu
🧩 TL;DR
本文提出了Q-Real数据集和基准,用于AI生成图像的细粒度真实性和合理性评估,通过多模态大语言模型实现判断和定位推理任务,并设计了微调框架提升模型能力。
📘 Detailed Summary
Motivation: 现有AI生成内容质量评估方法通常仅提供单一质量分数,过于粗糙无法为生成模型优化提供针对性指导,特别是在真实性和合理性这两个关键维度上缺乏细粒度评估能力。
Method: 构建了包含3,088张文本生成图像的Q-Real数据集,标注主要实体的位置信息以及真实性和合理性维度的判断问题和归因描述;设计了Q-Real基准评估多模态大语言模型在判断和定位推理任务上的表现;开发了专门的微调框架来增强模型能力。
Result: 实验结果表明数据集具有高质量和重要意义,基准评估全面有效,通过微调框架显著提升了多个多模态大语言模型在细粒度图像质量评估任务上的性能。
Conclusion: 该研究为AI生成图像的细粒度质量评估提供了标准化数据集和基准,推动了统一生成-理解模型的发展,通过多模态大语言模型的细粒度评估能力为生成模型优化提供了有效指导方向。
📄 Abstract
Quality assessment of AI-generated content is crucial for evaluating model capability and guiding model optimization. However, most existing quality assessment datasets and models provide only a single quality score, which is too coarse to offer targeted guidance for improving generative models. In current applications of AI-generated images, realism and plausibility are two critical dimensions, and with the emergence of unified generation-understanding models, fine-grained evaluation along these dimensions becomes especially effective for improving generative performance. Therefore, we introduce Q-Real, a novel dataset for fine-grained evaluation of realism and plausibility in AI-generated images. Q-Real consists of 3,088 images generated by popular text-to-image models. For each image, we annotate the locations of major entities and provide a set of judgment questions and attribution descriptions for these along the dimensions of realism and plausibility. Considering that recent advances in multi-modal large language models (MLLMs) enable fine-grained evaluation of AI-generated images, we construct Q-Real Bench to evaluate them on two tasks: judgment and grounding with reasoning. Finally, to enhance MLLM capabilities, we design a fine-tuning framework and conduct experiments on multiple MLLMs using our dataset. Experimental results demonstrate the high quality and significance of our dataset and the comprehensiveness of the benchmark. Dataset and code will be released upon publication.
[9] RacketVision: A Multiple Racket Sports Benchmark for Unified Ball and Racket Analysis
Linfeng Dong, Yuchen Yang, Hao Wu, Wei Wang, Yuenan HouZhihang Zhong, Xiao Sun
🧩 TL;DR
RacketVision提出了首个大规模细粒度球拍姿态标注数据集,涵盖乒乓球、网球和羽毛球,通过跨注意力机制有效融合球拍姿态特征,显著提升了轨迹预测性能,为体育分析中的动态目标跟踪和多模态研究提供了重要资源。
📘 Detailed Summary
Motivation: 当前体育分析领域缺乏大规模细粒度球拍姿态标注数据,限制了复杂人机交互研究的进展,特别是针对球拍类运动的动态目标跟踪、关节式球拍姿态估计和预测性轨迹预测等相互关联任务的研究。
Method: 构建了首个大规模细粒度球拍姿态标注数据集,涵盖三种球拍运动,提出使用CrossAttention机制进行多模态特征融合,而非简单的特征拼接,以有效整合球拍姿态信息与球体位置数据。
Result: 评估显示简单拼接球拍姿态特征会降低性能,而CrossAttention机制能够有效利用球拍姿态信息,在轨迹预测任务上超越了强大的单模态基线模型,验证了多模态融合策略的重要性。
Conclusion: RacketVision为动态目标跟踪、条件运动预测和体育多模态分析提供了多功能资源和研究起点,证明了跨注意力机制在多模态融合中的关键作用,为未来相关研究奠定了坚实基础。
📄 Abstract
We introduce RacketVision, a novel dataset and benchmark for advancing computer vision in sports analytics, covering table tennis, tennis, and badminton. The dataset is the first to provide large-scale, fine-grained annotations for racket pose alongside traditional ball positions, enabling research into complex human-object interactions. It is designed to tackle three interconnected tasks: fine-grained ball tracking, articulated racket pose estimation, and predictive ball trajectory forecasting. Our evaluation of established baselines reveals a critical insight for multi-modal fusion: while naively concatenating racket pose features degrades performance, a CrossAttention mechanism is essential to unlock their value, leading to trajectory prediction results that surpass strong unimodal baselines. RacketVision provides a versatile resource and a strong starting point for future research in dynamic object tracking, conditional motion forecasting, and multimodal analysis in sports. Project page at https://github.com/OrcustD/RacketVision
[10] UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation
Chi Zhang, Jiepeng Wang, Youming Wang, Yuanzhi Liang, Xiaoyan Yang, Zuoxin Li, Haibin Huang, Xuelong Li
🧩 TL;DR
本文提出UniModel,一种在单一像素到像素扩散框架内同时支持视觉理解和视觉生成的统一生成模型,通过将文本和图像映射到共享视觉空间实现多模态学习的完全视觉原生表述。
📘 Detailed Summary
Motivation: 当前多模态学习存在模型、任务和表示三个维度的分离问题,不同模态间的表示差异导致系统复杂且难以统一处理视觉理解和生成任务,需要一种能够消除模态差异并统一处理多种视觉语言任务的框架。
Method: 采用统一的像素到像素扩散框架,将文本提示渲染为绘制文本图像,所有输入输出均视为RGB像素;使用基于整流流的统一扩散变换器作为共享骨干网络,通过轻量级任务嵌入指定映射方向,实现自然图像与绘制文本图像之间的双向映射学习。
Result: 在文本到图像合成和图像到文本理解任务上的实验表明,该方法实现了强大的跨模态对齐能力,并展现出新兴的可控性特性,如图像-描述-图像的循环一致性,验证了统一框架在多模态任务中的有效性。
Conclusion: 将模型、任务和表示统一在单一视觉空间中是实现通用多模态智能的有前景范式,这种完全视觉原生的方法为构建更简洁高效的多模态系统提供了新的技术路径,并展示了双向视觉翻译过程的潜力。
📄 Abstract
We present UniModel, a unified generative model that jointly supports visual understanding and visual generation within a single pixel-to-pixel diffusion framework. Our goal is to achieve unification along three axes: the model, the tasks, and the representations. At the representation level, we eliminate modality discrepancies by mapping both text and images into a shared visual space: textual prompts are rendered as painted text images on a clean canvas, and all inputs and outputs are treated purely as RGB pixels. This yields a fully vision-native formulation of multimodal learning. At the task level, a broad range of vision-language problems are cast as pixel-to-pixel transformations in this visual space. For understanding tasks, the model takes an RGB image and produces a painted text image that visually encodes the semantic prediction. For generation tasks, painted text images serve as visual conditions that guide realistic and semantically aligned image synthesis. Captioning and text-to-image generation thus become different directions of the same underlying visual translation process. At the model level, we instantiate a single Unified Diffusion Transformer trained with rectified flow in pixel space. A shared backbone jointly learns bidirectional mappings between natural images and painted text images, with lightweight task embeddings to specify the desired direction. Experiments on text-to-image synthesis and image-to-text understanding demonstrate strong cross-modal alignment and emergent controllability such as cycle-consistent image-caption-image loops. Our initial exploration suggests that unifying model, tasks, and representations in a single visual space is a promising paradigm for general-purpose multimodal intelligence.
[11] Vision Language Models are Confused Tourists
Patrick Amadeus Irawan, Ikhlasul Akmal Hanif, Muhammad Dehan Al Kautsar, Genta Indra Winata, Fajri Koto, Alham Fikri Aji
🧩 TL;DR
本文提出了ConfusedTourist评估套件,揭示了视觉语言模型在面临混合文化线索时的严重脆弱性,即使简单的图像堆叠扰动也会导致模型性能显著下降,突显了构建文化鲁棒多模态理解的紧迫需求。
📘 Detailed Summary
Motivation: 当前视觉语言模型的文化维度评估主要关注单一文化概念场景,忽略了现实世界中多个无关文化线索共存的复杂情况,这种局限性无法全面评估模型在多元文化社会中的实际稳定性表现。
Method: 研究团队开发了ConfusedTourist文化对抗鲁棒性评估套件,通过图像堆叠和基于图像生成的扰动方法,系统性地测试视觉语言模型在地理文化线索干扰下的稳定性表现。
Result: 实验结果显示模型在简单图像堆叠扰动下准确率急剧下降,基于图像生成的扰动版本表现更差,可解释性分析表明失败源于系统性的注意力偏移,模型被干扰线索分散了原本的焦点。
Conclusion: 视觉文化概念的混合会严重损害最先进的视觉语言模型性能,这一发现强调了开发更具文化鲁棒性的多模态理解技术的迫切性,为未来研究指明了重要方向。
📄 Abstract
Although the cultural dimension has been one of the key aspects in evaluating Vision-Language Models (VLMs), their ability to remain stable across diverse cultural inputs remains largely untested, despite being crucial to support diversity and multicultural societies. Existing evaluations often rely on benchmarks featuring only a singular cultural concept per image, overlooking scenarios where multiple, potentially unrelated cultural cues coexist. To address this gap, we introduce ConfusedTourist, a novel cultural adversarial robustness suite designed to assess VLMs' stability against perturbed geographical cues. Our experiments reveal a critical vulnerability, where accuracy drops heavily under simple image-stacking perturbations and even worsens with its image-generation-based variant. Interpretability analyses further show that these failures stem from systematic attention shifts toward distracting cues, diverting the model from its intended focus. These findings highlight a critical challenge: visual cultural concept mixing can substantially impair even state-of-the-art VLMs, underscoring the urgent need for more culturally robust multimodal understanding.
[12] OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding
Teng Fu, Mengyang Zhao, Ke Niu, Kaixin Peng, Bin Li
🧩 TL;DR
本文提出OmniPT,一个统一的行人跟踪框架,能够基于参考进行跟踪并为跟踪对象生成语义理解,通过多阶段训练方法在跟踪基准上取得了优于先前方法的性能。
📘 Detailed Summary
Motivation: 尽管LVLMs在图像级任务中表现优异,但在实例级任务如视觉定位和目标检测中仍存在性能差距,同时行人跟踪领域出现了结合目标跟踪与自然语言理解的新任务,这些任务要求模型在高级语义层面理解跟踪对象,这正是LVLMs的优势所在。
Method: 提出OmniPT框架,采用RL-Mid Training-SFT-RL多阶段训练策略:首先通过RL阶段使模型输出固定格式的边界框,然后使用大量行人相关数据集进行中间训练,接着在多个行人跟踪数据集上进行监督微调,最后再次进行RL阶段以提升跟踪性能和指令跟随能力。
Result: 在跟踪基准上的实验结果表明,所提出的方法能够比先前方法表现更好,证明了该框架在行人跟踪任务中的有效性。
Conclusion: 该研究展示了如何将跟踪任务建模为基础模型可执行的任务,并解决了模型输出格式化答案的问题,为结合LVLMs与实例级跟踪任务提供了可行的解决方案,推动了语义理解与目标跟踪的融合。
📄 Abstract
LVLMs have been shown to perform excellently in image-level tasks such as VQA and caption. However, in many instance-level tasks, such as visual grounding and object detection, LVLMs still show performance gaps compared to previous expert models. Meanwhile, although pedestrian tracking is a classical task, there have been a number of new topics in combining object tracking and natural language, such as Referring MOT, Cross-view Referring MOT, and Semantic MOT. These tasks emphasize that models should understand the tracked object at an advanced semantic level, which is exactly where LVLMs excel. In this paper, we propose a new unified Pedestrian Tracking framework, namely OmniPT, which can track, track based on reference and generate semantic understanding of tracked objects interactively. We address two issues: how to model the tracking task into a task that foundation models can perform, and how to make the model output formatted answers. To this end, we implement a training phase consisting of RL-Mid Training-SFT-RL. Based on the pre-trained weights of the LVLM, we first perform a simple RL phase to enable the model to output fixed and supervisable bounding box format. Subsequently, we conduct a mid-training phase using a large number of pedestrian-related datasets. Finally, we perform supervised fine-tuning on several pedestrian tracking datasets, and then carry out another RL phase to improve the model's tracking performance and enhance its ability to follow instructions. We conduct experiments on tracking benchmarks and the experimental results demonstrate that the proposed method can perform better than the previous methods.
[13] DeltaDeno: Zero-Shot Anomaly Generation via Delta-Denoising Attribution
Chaoran Xu, Chengkan Lv, Qiyu Chen, Yunkang Cao, Feng Zhang, Zhengtao Zhang
🧩 TL;DR
本文提出Delta-Denoising (DeltaDeno),一种无需训练、零样本的异常生成方法,通过对比两个扩散分支在共享调度下的去噪差异来定位和编辑缺陷,解决了异常样本稀缺导致的过拟合问题。
📘 Detailed Summary
Motivation: 现有异常生成方法通常需要少量异常样本进行微调,这与异常稀缺性的现实相矛盾,且容易导致类别先验过拟合。本文旨在解决无需真实异常样本或训练的零样本异常生成问题。
Method: DeltaDeno通过对比由最小提示对驱动的两个扩散分支在共享调度下的去噪差异,将每步去噪差异累积为图像特定定位图,生成掩码指导潜在空间修复,同时保留上下文并生成真实局部缺陷。方法还包括令牌级提示精炼以对齐共享内容并强化异常令牌,以及在预测区域应用仅限异常令牌的空间注意力偏置。
Result: 在公共数据集上的实验表明,DeltaDeno实现了出色的生成质量、真实感,并在下游检测任务中获得了持续的性能提升。
Conclusion: 该研究证明了无需真实异常样本的零样本异常生成的可行性,通过扩散模型对比和定位机制实现了高质量的缺陷生成,为异常检测等下游任务提供了有效的增强数据来源。
📄 Abstract
Anomaly generation is often framed as few-shot fine-tuning with anomalous samples, which contradicts the scarcity that motivates generation and tends to overfit category priors. We tackle the setting where no real anomaly samples or training are available. We propose Delta-Denoising (DeltaDeno), a training-free zero-shot anomaly generation method that localizes and edits defects by contrasting two diffusion branches driven by a minimal prompt pair under a shared schedule. By accumulating per-step denoising deltas into an image-specific localization map, we obtain a mask to guide the latent inpainting during later diffusion steps and preserve the surrounding context while generating realistic local defects. To improve stability and control, DeltaDeno performs token-level prompt refinement that aligns shared content and strengthens anomaly tokens, and applies a spatial attention bias restricted to anomaly tokens in the predicted region. Experiments on public datasets show that DeltaDeno achieves great generation, realism and consistent gains in downstream detection performance. Code will be made publicly available.
[14] Planning with Sketch-Guided Verification for Physics-Aware Video Generation
Yidong Huang, Zun Wang, Han Lin, Dong-Ki Kim, Shayegan Omidshafiei, Jaehong Yoon, Yue Zhang, Mohit Bansal
🧩 TL;DR
本文提出SketchVerify,一种无需训练的基于草图验证的运动规划框架,通过测试时采样和验证循环生成动态一致的运动轨迹,在保持高效的同时显著提升视频生成的运动质量和物理真实性。
📘 Detailed Summary
Motivation: 现有视频生成方法主要依赖单次运动规划,通常只能处理简单运动,或者需要多次调用视频生成器进行迭代优化,计算成本高昂,这限制了复杂动态场景的生成质量和效率。
Method: 该方法通过预测多个候选运动轨迹,使用视觉语言验证器对轨迹进行联合评估,考虑语义对齐和物理合理性;通过将轨迹渲染为轻量级视频草图在静态背景上合成对象,避免昂贵的重复扩散合成;采用迭代优化策略直到识别出满意的运动规划。
Result: 在WorldModelBench和PhyWorldBench上的实验表明,该方法在运动质量、物理真实性和长期一致性方面显著优于竞争基线,同时计算效率大幅提升;消融研究显示增加轨迹候选数量能持续提升整体性能。
Conclusion: 该研究证明了测试时验证循环在提升运动规划质量方面的有效性,轻量级草图渲染方法在保持性能的同时显著降低计算成本,为高质量视频生成提供了一种高效的运动规划解决方案。
📄 Abstract
Recent video generation approaches increasingly rely on planning intermediate control signals such as object trajectories to improve temporal coherence and motion fidelity. However, these methods mostly employ single-shot plans that are typically limited to simple motions, or iterative refinement which requires multiple calls to the video generator, incuring high computational cost. To overcome these limitations, we propose SketchVerify, a training-free, sketch-verification-based planning framework that improves motion planning quality with more dynamically coherent trajectories (i.e., physically plausible and instruction-consistent motions) prior to full video generation by introducing a test-time sampling and verification loop. Given a prompt and a reference image, our method predicts multiple candidate motion plans and ranks them using a vision-language verifier that jointly evaluates semantic alignment with the instruction and physical plausibility. To efficiently score candidate motion plans, we render each trajectory as a lightweight video sketch by compositing objects over a static background, which bypasses the need for expensive, repeated diffusion-based synthesis while achieving comparable performance. We iteratively refine the motion plan until a satisfactory one is identified, which is then passed to the trajectory-conditioned generator for final synthesis. Experiments on WorldModelBench and PhyWorldBench demonstrate that our method significantly improves motion quality, physical realism, and long-term consistency compared to competitive baselines while being substantially more efficient. Our ablation study further shows that scaling up the number of trajectory candidates consistently enhances overall performance.
[15] ReBrain: Brain MRI Reconstruction from Sparse CT Slice via Retrieval-Augmented Diffusion
Junming Liu, Yifei Sun, Weihua Cheng, Yujin Kang, Yirong Chen, Ding Wang, Guosun Zeng
🧩 TL;DR
本文提出ReBrain,一种基于检索增强扩散的框架,用于从稀疏CT扫描重建完整脑部MRI。该方法通过检索相似CT切片作为参考,结合Brownian Bridge扩散模型,解决了低剂量CT协议导致的稀疏体积重建挑战。
📘 Detailed Summary
Motivation: 脑部MRI在疾病诊断中至关重要,但某些患者因身体或临床限制无法进行MRI检查。现有从CT合成MRI的方法面临低剂量协议导致的稀疏CT体积和较差平面分辨率问题,使得准确重建完整脑部MRI体积极具挑战性。
Method: 提出ReBrain框架,首先使用Brownian Bridge扩散模型在2D维度合成MRI切片,同时通过微调检索模型从先验数据库中检索结构和病理相似的CT切片作为参考。通过ControlNet分支整合检索切片以指导中间MRI切片生成,确保结构连续性,并对罕见检索失败情况应用球面线性插值提供补充指导。
Result: 在SynthRAD2023和BraTS数据集上的广泛实验表明,ReBrain在稀疏条件下的跨模态重建中实现了最先进的性能,显著提升了从稀疏CT到MRI的转换质量。
Conclusion: 该研究证明了检索增强扩散框架在解决稀疏医学图像重建问题上的有效性,为临床中无法进行MRI检查的患者提供了可行的替代方案,并为跨模态医学图像合成开辟了新的技术路径。
📄 Abstract
Magnetic Resonance Imaging (MRI) plays a crucial role in brain disease diagnosis, but it is not always feasible for certain patients due to physical or clinical constraints. Recent studies attempt to synthesize MRI from Computed Tomography (CT) scans; however, low-dose protocols often result in highly sparse CT volumes with poor through-plane resolution, making accurate reconstruction of the full brain MRI volume particularly challenging. To address this, we propose ReBrain, a retrieval-augmented diffusion framework for brain MRI reconstruction. Given any 3D CT scan with limited slices, we first employ a Brownian Bridge Diffusion Model (BBDM) to synthesize MRI slices along the 2D dimension. Simultaneously, we retrieve structurally and pathologically similar CT slices from a comprehensive prior database via a fine-tuned retrieval model. These retrieved slices are used as references, incorporated through a ControlNet branch to guide the generation of intermediate MRI slices and ensure structural continuity. We further account for rare retrieval failures when the database lacks suitable references and apply spherical linear interpolation to provide supplementary guidance. Extensive experiments on SynthRAD2023 and BraTS demonstrate that ReBrain achieves state-of-the-art performance in cross-modal reconstruction under sparse conditions.
[16] MultiPriv: Benchmarking Individual-Level Privacy Reasoning in Vision-Language Models
Xiongtao Sun, Hui Li, Jiaming Zhang, Yujie Yang, Kaili Liu, Ruxin Feng, Wen Jun Tan, Wei Yang Bryan Lim
🧩 TL;DR
本文提出了MultiPriv,首个系统评估视觉语言模型中个体级隐私推理风险的基准,通过构建双语多模态数据集和隐私感知与推理框架,揭示了现有VLMs在隐私推理方面的系统性漏洞。
📘 Detailed Summary
Motivation: 现代视觉语言模型展现出复杂推理能力,导致隐私风险从简单的属性感知升级到个体级关联,而现有隐私基准在结构上无法应对这种新威胁,主要评估隐私感知而未能解决更关键的隐私推理风险,即VLMs推断和链接分布式信息以构建个体档案的能力。
Method: 提出了隐私感知与推理框架,并构建了一个新颖的双语多模态数据集,其核心特点是包含合成个体档案,其中标识符与敏感属性被精心链接,支持九个挑战性任务评估完整的PPR谱系,从属性检测到跨图像重识别和链式推理。
Result: 对超过50个基础和商业VLMs进行大规模评估,发现许多VLMs存在显著且未被测量的基于推理的隐私风险,感知级指标对这些推理风险的预测能力较差,揭示了关键评估差距,现有安全对齐方法对此类基于推理的攻击不一致且无效。
Conclusion: MultiPriv暴露了VLMs在隐私推理方面的系统性漏洞,为开发鲁棒的隐私保护VLMs提供了必要框架,强调了需要超越传统隐私感知评估来应对新兴推理威胁的重要性。
📄 Abstract
Modern Vision-Language Models (VLMs) demonstrate sophisticated reasoning, escalating privacy risks beyond simple attribute perception to individual-level linkage. Current privacy benchmarks are structurally insufficient for this new threat, as they primarily evaluate privacy perception while failing to address the more critical risk of privacy reasoning: a VLM's ability to infer and link distributed information to construct individual profiles. To address this critical gap, we propose \textbf{MultiPriv}, the first benchmark designed to systematically evaluate individual-level privacy reasoning in VLMs. We introduce the \textbf{Privacy Perception and Reasoning (PPR)} framework and construct a novel, bilingual multimodal dataset to support it. The dataset uniquely features a core component of synthetic individual profiles where identifiers (e.g., faces, names) are meticulously linked to sensitive attributes. This design enables nine challenging tasks evaluating the full PPR spectrum, from attribute detection to cross-image re-identification and chained inference. We conduct a large-scale evaluation of over 50 foundational and commercial VLMs. Our analysis reveals: (1) Many VLMs possess significant, unmeasured reasoning-based privacy risks. (2) Perception-level metrics are poor predictors of these reasoning risks, revealing a critical evaluation gap. (3) Existing safety alignments are inconsistent and ineffective against such reasoning-based attacks. MultiPriv exposes systemic vulnerabilities and provides the necessary framework for developing robust, privacy-preserving VLMs.
[17] FingerCap: Fine-grained Finger-level Hand Motion Captioning
Xin Shen, Rui Zhu, Lei Shen, Xinyu Wang, Kaihao Zhang, Tianqing Zhu, Shuchen Wu, Chenxi Miao, Weikang Li, Yang Li, Deguo Xia, Jizhou Huang, Xin Yu
🧩 TL;DR
本文提出了FingerCap任务用于生成细粒度手指级手部运动描述,并构建了包含40K配对视频-文本的FingerCap-40K数据集。为解决视频MLLMs在时间稀疏性上的瓶颈,作者提出了FiGOP方法,通过结合RGB关键帧和手部关键点来恢复精细时间线索。
📘 Detailed Summary
Motivation: 当前视觉感知和具身智能领域缺乏对细粒度手指级手部运动的理解能力,现有视频多模态大语言模型由于时间稀疏采样无法捕捉手指运动的高频细微动态,这成为手指级推理的基本瓶颈。
Method: 提出了FiGOP方法,将每个RGB关键帧与后续手部关键点配对直到下一关键帧,通过轻量级时间编码器将关键点转换为运动嵌入并与RGB特征集成。该方法适应了经典GOP概念到手指运动,在不增加RGB密度的情况下恢复精细时间线索。
Result: 在FingerCap-40K数据集上的实验表明,当前强大的开源和闭源视频MLLMs在手指级推理方面仍存在困难,而FiGOP增强模型在HandJudge评估和人类研究中均取得一致性能提升。
Conclusion: FiGOP提供了一种计算友好的解决方案来处理视频MLLMs的时间稀疏性问题,特别适用于需要捕捉高频细微动态的细粒度手部运动理解任务,为手部动作分析开辟了新的研究方向。
📄 Abstract
Understanding fine-grained human hand motion is fundamental to visual perception, embodied intelligence, and multimodal communication. In this work, we propose Fine-grained Finger-level Hand Motion Captioning (FingerCap), which aims to generate textual descriptions that capture detailed finger-level semantics of hand actions. To support this task, we curate FingerCap-40K, a large-scale corpus of 40K paired hand-motion videos and captions spanning two complementary sources: concise instruction-style finger motions and diverse, naturalistic hand-object interactions. To enable effective evaluation, we employ HandJudge, a LLM-based rubric that measures finger-level correctness and motion completeness. Temporal sparsity remains a fundamental bottleneck for current Video-MLLMs, since sparse RGB sampling is insufficient to capture the subtle, high-frequency dynamics underlying fine finger motions. As a simple and compute-friendly remedy, we introduce FiGOP (Finger Group-of-Pictures), which pairs each RGB keyframe with subsequent hand keypoints until the next keyframe. A lightweight temporal encoder converts the keypoints into motion embeddings and integrates them with RGB features. FiGOP adapts the classic GOP concept to finger motion, recovering fine temporal cues without increasing RGB density. Experiments on FingerCap-40K show that strong open- and closed-source Video-MLLMs still struggle with finger-level reasoning, while our FiGOP-augmented model yield consistent gains under HandJudge and human studies.
[18] Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats
Jiaye Qian, Ge Zheng, Yuchen Zhu, Sibei Yang
🧩 TL;DR
本文提出了一个与大型视觉语言模型(LVLM)Transformer因果架构对齐的综合干预框架,通过分析图像-输入文本、图像-输出文本和文本-文本三种路径的交互作用来减少幻觉现象。研究发现LVLM的幻觉源于多路径相互作用,并首次发现模型依赖不同路径取决于问答对齐格式,据此提出了针对判别式和生成式格式的关键幻觉头识别与干预方法。
📘 Detailed Summary
Motivation: 尽管大型视觉语言模型(LVLM)在各种任务中表现出色,但它们仍然容易产生幻觉现象。本研究旨在解决LVLM中幻觉问题的根本原因,探索不同因果路径对幻觉产生的影响,并开发有效的干预方法来减少这种不良现象。
Method: 提出了一个与Transformer因果架构对齐的综合干预框架,分析了图像-输入文本、图像-输出文本和文本-文本三种路径的交互作用。首次发现LVLM根据问答对齐格式依赖不同路径,并提出了针对判别式和生成式格式的关键幻觉头识别与干预方法,这些方法简单而有效。
Result: 在多个基准测试上的实验表明,该方法能够持续减少各种对齐类型下的幻觉现象。通过干预关键幻觉头,在不同问答格式下都实现了幻觉的显著降低,证明了所提出方法的有效性和通用性。
Conclusion: 研究揭示了LVLM幻觉的多路径本质,表明幻觉不是单一因果路径的结果,而是多种路径相互作用的产物。这一发现为理解LVLM工作机制提供了新视角,所提出的干预框架为减少幻觉提供了系统性的解决方案,对提升LVLM的可靠性和实用性具有重要意义。
📄 Abstract
Despite their impressive performance across a wide range of tasks, Large Vision-Language Models (LVLMs) remain prone to hallucination. In this study, we propose a comprehensive intervention framework aligned with the transformer's causal architecture in LVLMs, integrating the effects of different intervention paths on hallucination. We find that hallucinations in LVLMs do not arise from a single causal path, but rather from the interplay among image-to-input-text, image-to-output-text, and text-to-text pathways. For the first time, we also find that LVLMs rely on different pathways depending on the question-answer alignment format. Building on these insights, we propose simple yet effective methods to identify and intervene on critical hallucination heads within each pathway, tailored to discriminative and generative formats. Experiments across multiple benchmarks demonstrate that our approach consistently reduces hallucinations across diverse alignment types.
[19] Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models
Dailan He, Guanlin Feng, Xingtong Ge, Yazhe Niu, Yi Zhang, Bingqi Ma, Guanglu Song, Yu Liu, Hongsheng Li
🧩 TL;DR
本文提出了Neighbor GRPO算法,通过扰动ODE初始噪声条件生成多样化轨迹,完全绕开SDE转换需求,解决了SDE-based GRPO在流匹配模型中的信用分配低效和高阶求解器不兼容问题。
📘 Detailed Summary
Motivation: 现有SDE-based GRPO方法在应用于现代流匹配模型时面临挑战,因为其确定性采样范式与GRPO的随机性需求存在冲突。当前解决方案通过将ODE转换为SDE引入随机性,但这种方法存在信用分配低效和与高阶求解器不兼容的问题,限制了训练效率和采样质量。
Method: 本文首先从距离优化角度重新解释现有SDE-based GRPO方法,揭示其对比学习机制本质。基于此提出Neighbor GRPO算法,通过扰动ODE初始噪声条件生成多样化候选轨迹,使用基于softmax距离的代理跳跃策略进行模型优化。方法还引入对称锚点采样提高计算效率,以及组间拟范数重加权解决奖励平坦化问题。
Result: 大量实验表明,Neighbor GRPO在训练成本、收敛速度和生成质量方面显著优于SDE-based对应方法。该方法完全保留了确定性ODE采样的优势,包括高效性和与高阶求解器的兼容性,同时实现了更好的对齐效果。
Conclusion: Neighbor GRPO为流匹配模型的对齐提供了更有效的解决方案,不仅解决了SDE-based方法的局限性,还建立了距离优化目标与策略梯度优化之间的理论联系。该方法为确定性生成模型的人类偏好对齐开辟了新途径,具有重要的理论价值和实际应用前景。
📄 Abstract
Group Relative Policy Optimization (GRPO) has shown promise in aligning image and video generative models with human preferences. However, applying it to modern flow matching models is challenging because of its deterministic sampling paradigm. Current methods address this issue by converting Ordinary Differential Equations (ODEs) to Stochastic Differential Equations (SDEs), which introduce stochasticity. However, this SDE-based GRPO suffers from issues of inefficient credit assignment and incompatibility with high-order solvers for fewer-step sampling. In this paper, we first reinterpret existing SDE-based GRPO methods from a distance optimization perspective, revealing their underlying mechanism as a form of contrastive learning. Based on this insight, we propose Neighbor GRPO, a novel alignment algorithm that completely bypasses the need for SDEs. Neighbor GRPO generates a diverse set of candidate trajectories by perturbing the initial noise conditions of the ODE and optimizes the model using a softmax distance-based surrogate leaping policy. We establish a theoretical connection between this distance-based objective and policy gradient optimization, rigorously integrating our approach into the GRPO framework. Our method fully preserves the advantages of deterministic ODE sampling, including efficiency and compatibility with high-order solvers. We further introduce symmetric anchor sampling for computational efficiency and group-wise quasi-norm reweighting to address reward flattening. Extensive experiments demonstrate that Neighbor GRPO significantly outperforms SDE-based counterparts in terms of training cost, convergence speed, and generation quality.
[20] Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation
Chuancheng Shi, Shangze Li, Shiming Guo, Simiao Xie, Wenhua Wu, Jingtong Dou, Chao Wu, Canran Xiao, Cong Wang, Zifeng Cheng, Fei Shen, Tat-Seng Chua
🧩 TL;DR
本文提出了一种解决多语言文本到图像生成中文化一致性问题的方法,通过定位文化敏感神经元并设计两种对齐策略,显著提升了跨语言文化一致性而保持生成质量。
📘 Detailed Summary
Motivation: 当前多语言文本到图像模型在跨语言提示下往往产生文化中性或英语偏向的结果,这源于语言携带的文化内涵未被充分激活,而非模型缺乏文化知识。
Method: 提出了一种探测方法定位文化敏感信号到少量固定层中的特定神经元,并设计了两种互补对齐策略:无需主干微调的推理时文化激活和仅更新文化相关层的层定向文化增强。
Result: 在CultureBench基准上的实验表明,相比强基线方法,所提方法在文化一致性方面取得了一致性改进,同时保持了生成保真度和多样性。
Conclusion: 研究表明文化知识已存在于模型中但激活不足,通过针对性激活文化相关神经元可有效提升跨语言文化一致性,为多模态生成模型的文化对齐提供了新思路。
📄 Abstract
Multilingual text-to-image (T2I) models have advanced rapidly in terms of visual realism and semantic alignment, and are now widely utilized. Yet outputs vary across cultural contexts: because language carries cultural connotations, images synthesized from multilingual prompts should preserve cross-lingual cultural consistency. We conduct a comprehensive analysis showing that current T2I models often produce culturally neutral or English-biased results under multilingual prompts. Analyses of two representative models indicate that the issue stems not from missing cultural knowledge but from insufficient activation of culture-related representations. We propose a probing method that localizes culture-sensitive signals to a small set of neurons in a few fixed layers. Guided by this finding, we introduce two complementary alignment strategies: (1) inference-time cultural activation that amplifies the identified neurons without backbone fine-tuned; and (2) layer-targeted cultural enhancement that updates only culturally relevant layers. Experiments on our CultureBench demonstrate consistent improvements over strong baselines in cultural consistency while preserving fidelity and diversity.
[21] DReX: Pure Vision Fusion of Self-Supervised and Convolutional Representations for Image Complexity Prediction
Jonathan Skaza, Parsa Madinei, Ziqi Wen, Miguel Eckstein
🧩 TL;DR
本文提出DReX,一种仅使用视觉信息的图像复杂度预测模型,通过融合自监督和卷积表示实现了最先进的性能,表明视觉特征本身足以进行人类对齐的复杂度预测。
📘 Detailed Summary
Motivation: 当前图像复杂度预测方法多依赖多模态模型结合视觉和语言信息,但语言信息对此任务是否必要尚不明确,本研究旨在探索仅使用视觉特征进行复杂度预测的可能性。
Method: 提出DReX模型,通过可学习的注意力机制融合ResNet-50的多尺度层次特征和DINOv3 ViT-S/16的语义丰富表示,能够同时捕获低层纹理模式和高层语义结构。
Result: 在IC9600基准测试中达到Pearson相关系数0.9581的最先进性能,超越包括多模态方法在内的先前方法,同时参数量减少约21.5倍,并在多个数据集和指标上展现出稳健的泛化能力。
Conclusion: 研究表明当适当融合时,自监督transformer和监督深度卷积神经网络对此任务具有互补和协同效益,视觉特征本身足以进行人类对齐的复杂度预测,为计算机视觉和认知科学提供了重要启示。
📄 Abstract
Visual complexity prediction is a fundamental problem in computer vision with applications in image compression, retrieval, and classification. Understanding what makes humans perceive an image as complex is also a long-standing question in cognitive science. Recent approaches have leveraged multimodal models that combine visual and linguistic representations, but it remains unclear whether language information is necessary for this task. We propose DReX (DINO-ResNet Fusion), a vision-only model that fuses self-supervised and convolutional representations through a learnable attention mechanism to predict image complexity. Our architecture integrates multi-scale hierarchical features from ResNet-50 with semantically rich representations from DINOv3 ViT-S/16, enabling the model to capture both low-level texture patterns and high-level semantic structure. DReX achieves state-of-the-art performance on the IC9600 benchmark (Pearson r = 0.9581), surpassing previous methods--including those trained on multimodal image-text data--while using approximately 21.5x fewer learnable parameters. Furthermore, DReX generalizes robustly across multiple datasets and metrics, achieving superior results on Pearson and Spearman correlation, Root Mean Square Error (RMSE), and Mean Absolute Error (MAE). Ablation and attention analyses confirm that DReX leverages complementary cues from both backbones, with the DINOv3 [CLS] token enhancing sensitivity to visual complexity. Our findings suggest that visual features alone can be sufficient for human-aligned complexity prediction and that, when properly fused, self-supervised transformers and supervised deep convolutional neural networks offer complementary and synergistic benefits for this task.
[22] MuM: Multi-View Masked Image Modeling for 3D Vision
David Nordström, Johan Edstedt, Fredrik Kahl, Georg Bökman
🧩 TL;DR
本文提出MuM模型,通过将掩码自编码扩展到任意多视角图像,专门针对3D视觉任务学习特征表示。该方法在多个下游任务中超越了DINOv3和CroCo v2等最先进的视觉编码器。
📘 Detailed Summary
Motivation: 当前自监督学习方法主要针对语义理解而非几何推理优化,现有方法如CroCo虽然针对3D理解进行了改进,但在扩展性和复杂性方面存在局限。本研究旨在开发专门针对3D视觉任务的特征学习方法,解决现有方法在几何推理能力上的不足。
Method: 提出MuM模型,将掩码自编码扩展到任意多视角图像,通过在所有视角上统一应用掩码策略,并采用轻量级解码器结合跨帧注意力机制,实现了比CroCo更简单且更具扩展性的架构设计。
Result: 在多个下游任务评估中,包括前馈重建、密集图像匹配和相对姿态估计,MuM模型均表现出色,超越了当前最先进的视觉编码器DINOv3和CroCo v2的性能表现。
Conclusion: 研究表明,通过专门针对3D视觉任务设计的多视角掩码自编码方法,能够有效学习几何感知的特征表示,为计算机视觉中的3D理解任务提供了新的有效解决方案,并展示了在扩展性和性能方面的优势。
📄 Abstract
Self-supervised learning on images seeks to extract meaningful visual representations from unlabeled data. When scaled to large datasets, this paradigm has achieved state-of-the-art performance and the resulting trained models such as DINOv3 have seen widespread adoption. However, most prior efforts are optimized for semantic understanding rather than geometric reasoning. One important exception is Cross-View Completion, CroCo, which is a form of masked autoencoding (MAE) tailored for 3D understanding. In this work, we continue on the path proposed by CroCo and focus on learning features tailored for 3D vision. In a nutshell, we extend MAE to arbitrarily many views of the same scene. By uniformly masking all views and employing a lightweight decoder with inter-frame attention, our approach is inherently simpler and more scalable than CroCo. We evaluate the resulting model, MuM, extensively on downstream tasks including feedforward reconstruction, dense image matching and relative pose estimation, finding that it outperforms the state-of-the-art visual encoders DINOv3 and CroCo v2.
[23] VLM-Augmented Degradation Modeling for Image Restoration Under Adverse Weather Conditions
Qianyi Shao, Yuanfan Zhang, Renxiang Xiao, Liang Hu
🧩 TL;DR
本文提出了一种统一的内存增强视觉语言恢复模型,通过结合视觉语言模型的链式推理和隐式内存库,实现了对各种恶劣天气条件下图像的有效恢复,在保持计算效率的同时显著提升了恢复精度。
📘 Detailed Summary
Motivation: 在自动驾驶和户外机器人应用中,恶劣天气条件下的可靠视觉感知是至关重要但极具挑战性的问题,现有方法难以有效处理不同退化程度和多种天气条件的复杂退化模式。
Method: MVLR模型采用轻量级编码器-解码器主干网络,结合视觉语言模型进行链式推理以编码天气退化先验,并通过隐式内存库存储连续潜在退化模式表示,利用动态交叉注意力机制自适应融合多尺度视觉特征与退化原型。
Result: 在四个恶劣天气基准测试上的广泛实验表明,MVLR在峰值信噪比和结构相似性指标上均优于单分支和专家混合基线方法,实现了模型紧凑性与表达能力的良好平衡。
Conclusion: 该研究表明结合视觉语言推理与内存机制能够有效提升恶劣天气图像恢复性能,为实时部署提供了实用的解决方案,展示了在多样化户外条件下实现高效视觉感知的可行性。
📄 Abstract
Reliable visual perception under adverse weather conditions, such as rain, haze, snow, or a mixture of them, is desirable yet challenging for autonomous driving and outdoor robots. In this paper, we propose a unified Memory-Enhanced Visual-Language Recovery (MVLR) model that restores images from different degradation levels under various weather conditions. MVLR couples a lightweight encoder-decoder backbone with a Visual-Language Model (VLM) and an Implicit Memory Bank (IMB). The VLM performs chain-of-thought inference to encode weather degradation priors and the IMB stores continuous latent representations of degradation patterns. The VLM-generated priors query the IMB to retrieve fine-grained degradation prototypes. These prototypes are then adaptively fused with multi-scale visual features via dynamic cross-attention mechanisms, enhancing restoration accuracy while maintaining computational efficiency. Extensive experiments on four severe-weather benchmarks show that MVLR surpasses single-branch and Mixture-of-Experts baselines in terms of Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM). These results indicate that MVLR offers a practical balance between model compactness and expressiveness for real-time deployment in diverse outdoor conditions.
[24] REMSA: An LLM Agent for Foundation Model Selection in Remote Sensing
Binger Chen, Tacettin Emre Bök, Behnood Rasti, Volker Markl, Begüm Demir
🧩 TL;DR
本文提出了RSFM数据库(RS-FMD)和REMSA智能体,前者是一个包含150多个遥感基础模型的结构化资源库,后者是首个基于LLM的自动化遥感基础模型选择系统,通过自然语言查询解决模型选择难题。
📘 Detailed Summary
Motivation: 遥感领域基础模型(RSFM)的广泛应用面临模型选择困难,主要由于文档分散、格式异构和部署约束多样等问题,缺乏系统化的模型选择工具来帮助研究人员和从业者快速定位适合特定任务的模型。
Method: 构建了包含150多个RSFM的RS-FMD结构化数据库,涵盖多模态数据、分辨率和学习范式;开发了REMSA智能体,利用上下文学习解析用户需求、补全缺失约束、排序候选模型并提供透明解释;建立了包含75个专家验证查询场景和900种配置的基准评估协议。
Result: REMSA在专家中心评估协议下显著优于多个基线方法,包括朴素智能体、稠密检索和非结构化RAG-based LLM,且仅使用公开元数据运行,不访问私有或敏感数据。
Conclusion: 该研究为遥感基础模型选择提供了首个系统化解决方案,通过结构化数据库和LLM智能体的结合,显著提升了模型选择的效率和准确性,为遥感领域的模型应用标准化和自动化奠定了基础。
📄 Abstract
Foundation Models (FMs) are increasingly used in remote sensing (RS) for tasks such as environmental monitoring, disaster assessment, and land-use mapping. These models include unimodal vision encoders trained on a single data modality and multimodal architectures trained on combinations of SAR, multispectral, hyperspectral, and image-text data. They support diverse RS tasks including semantic segmentation, image classification, change detection, and visual question answering. However, selecting an appropriate remote sensing foundation model (RSFM) remains difficult due to scattered documentation, heterogeneous formats, and varied deployment constraints. We introduce the RSFM Database (RS-FMD), a structured resource covering over 150 RSFMs spanning multiple data modalities, resolutions, and learning paradigms. Built on RS-FMD, we present REMSA, the first LLM-based agent for automated RSFM selection from natural language queries. REMSA interprets user requirements, resolves missing constraints, ranks candidate models using in-context learning, and provides transparent justifications. We also propose a benchmark of 75 expert-verified RS query scenarios, producing 900 configurations under an expert-centered evaluation protocol. REMSA outperforms several baselines, including naive agents, dense retrieval, and unstructured RAG-based LLMs. It operates entirely on publicly available metadata and does not access private or sensitive data.
[25] PathAgent: Toward Interpretable Analysis of Whole-slide Pathology Images via Large Language Model-based Agentic Reasoning
Jingyun Chen, Linghan Cai, Zhikang Wang, Yi Huang, Songhan Jiang, Shenjin Huang, Hongpeng Wang, Yongbing Zhang
🧩 TL;DR
本文提出PathAgent,一种基于大型语言模型的训练免费代理框架,通过模拟病理学家的逐步推理过程实现全切片图像的透明分析。该框架通过导航器、感知器和执行器模块的协同工作,生成可解释的决策轨迹,在多个数据集上展现出强大的零样本泛化能力。
📘 Detailed Summary
Motivation: 现有全切片图像计算分析流程缺乏明确的推理轨迹,导致预测结果不透明且难以合理解释,无法模拟病理学家动态缩放、重新聚焦和自我修正的迭代证据驱动推理过程。
Method: PathAgent采用基于大型语言模型的训练免费代理框架,包含三个核心模块:导航器负责迭代精确定位显著微区域,感知器提取形态学视觉线索,执行器将这些发现整合到持续演化的自然语言轨迹中,形成显式的思维链。
Result: 在五个具有挑战性的数据集上的评估表明,PathAgent展现出强大的零样本泛化能力,在开放性和受限视觉问答任务中均超越任务特定基线方法,与人类病理学家的协作评估证实了其作为透明且临床基础诊断助手的潜力。
Conclusion: PathAgent通过模拟人类专家的反思性逐步分析方法,为计算病理学提供了完全可解释的预测框架,其训练免费特性结合强大的泛化能力使其成为临床诊断中透明决策支持的有前景工具。
📄 Abstract
Analyzing whole-slide images (WSIs) requires an iterative, evidence-driven reasoning process that parallels how pathologists dynamically zoom, refocus, and self-correct while collecting the evidence. However, existing computational pipelines often lack this explicit reasoning trajectory, resulting in inherently opaque and unjustifiable predictions. To bridge this gap, we present PathAgent, a training-free, large language model (LLM)-based agent framework that emulates the reflective, stepwise analytical approach of human experts. PathAgent can autonomously explore WSI, iteratively and precisely locating significant micro-regions using the Navigator module, extracting morphology visual cues using the Perceptor, and integrating these findings into the continuously evolving natural language trajectories in the Executor. The entire sequence of observations and decisions forms an explicit chain-of-thought, yielding fully interpretable predictions. Evaluated across five challenging datasets, PathAgent exhibits strong zero-shot generalization, surpassing task-specific baselines in both open-ended and constrained visual question-answering tasks. Moreover, a collaborative evaluation with human pathologists confirms PathAgent's promise as a transparent and clinically grounded diagnostic assistant.
[26] Sparse Reasoning is Enough: Biological-Inspired Framework for Video Anomaly Detection with Large Pre-trained Models
He Huang, Zixuan Hu, Dongxiao Li, Yao Xiao, Ling-Yu Duan
🧩 TL;DR
本文提出ReCoVAD框架,通过模拟人类神经系统的双通路机制实现选择性帧处理,在视频异常检测中仅需处理少量帧即可达到最先进的训练无关性能,显著降低计算成本。
📘 Detailed Summary
Motivation: 现有基于大型预训练模型的视频异常检测方法通常依赖密集帧级推理,导致高昂计算成本和延迟,本研究旨在探索在强大预训练模型下稀疏推理是否足以实现有效的异常检测。
Method: ReCoVAD采用双通路架构:反射通路使用轻量级CLIP模块融合视觉特征与原型提示,查询动态记忆库实现快速响应;意识通路采用中等规模视觉语言模型生成文本事件描述和精炼异常分数,通过集成大语言模型定期审查描述以识别未见异常并优化原型。
Result: 实验表明ReCoVAD在UCF-Crime和XD-Violence数据集上分别仅需处理28.55%和16.04%的帧数即可达到最先进的训练无关性能,显著优于先前方法的计算效率。
Conclusion: 研究表明在大型预训练模型支持下,稀疏推理足以实现有效的视频异常检测,为实时应用提供了高效解决方案,同时双通路机制为其他视频理解任务提供了可借鉴的架构设计思路。
📄 Abstract
Video anomaly detection (VAD) plays a vital role in real-world applications such as security surveillance, autonomous driving, and industrial monitoring. Recent advances in large pre-trained models have opened new opportunities for training-free VAD by leveraging rich prior knowledge and general reasoning capabilities. However, existing studies typically rely on dense frame-level inference, incurring high computational costs and latency. This raises a fundamental question: Is dense reasoning truly necessary when using powerful pre-trained models in VAD systems? To answer this, we propose ReCoVAD, a novel framework inspired by the dual reflex and conscious pathways of the human nervous system, enabling selective frame processing to reduce redundant computation. ReCoVAD consists of two core pathways: (i) a Reflex pathway that uses a lightweight CLIP-based module to fuse visual features with prototype prompts and produce decision vectors, which query a dynamic memory of past frames and anomaly scores for fast response; and (ii) a Conscious pathway that employs a medium-scale vision-language model to generate textual event descriptions and refined anomaly scores for novel frames. It continuously updates the memory and prototype prompts, while an integrated large language model periodically reviews accumulated descriptions to identify unseen anomalies, correct errors, and refine prototypes. Extensive experiments show that ReCoVAD achieves state-of-the-art training-free performance while processing only 28.55\% and 16.04\% of the frames used by previous methods on the UCF-Crime and XD-Violence datasets, demonstrating that sparse reasoning is sufficient for effective large-model-based VAD.
[27] Bridging Visual Affective Gap: Borrowing Textual Knowledge by Learning from Noisy Image-Text Pairs
Daiqing Wu, Dongbao Yang, Yu Zhou, Can Ma
🧩 TL;DR
本文提出分区自适应对比学习(PACL)方法,通过借用预训练文本模型的情感知识来增强视觉模型的情感感知能力,从而弥合视觉情感识别中的'情感鸿沟'问题。
📘 Detailed Summary
Motivation: 视觉情感识别领域存在'情感鸿沟'问题,即预训练视觉模型的事实级特征与情感类别之间缺乏直接关联,限制了预训练知识在情感识别任务中的适用性。相比之下,文本模态具有明确的情感表达和高信息密度,能够消除这种鸿沟。
Method: 提出分区自适应对比学习方法(PACL),关注社交媒体数据中图像与文本之间的事实和情感联系,通过分离不同类型的样本并为每种类型设计不同的对比学习策略,动态构建正负样本对以充分利用噪声样本的潜力。
Result: 通过全面实验证明,弥合'情感鸿沟'显著提升了多种预训练视觉模型在下游情感相关任务中的性能表现,验证了所提方法的有效性。
Conclusion: 该研究表明借用文本模态的情感知识能够有效增强视觉模型的情感感知能力,为跨模态情感识别提供了新的思路,同时提出的PACL方法为处理噪声社交媒体数据提供了有效的技术框架。
📄 Abstract
Visual emotion recognition (VER) is a longstanding field that has garnered increasing attention with the advancement of deep neural networks. Although recent studies have achieved notable improvements by leveraging the knowledge embedded within pre-trained visual models, the lack of direct association between factual-level features and emotional categories, called the "affective gap", limits the applicability of pre-training knowledge for VER tasks. On the contrary, the explicit emotional expression and high information density in textual modality eliminate the "affective gap". Therefore, we propose borrowing the knowledge from the pre-trained textual model to enhance the emotional perception of pre-trained visual models. We focus on the factual and emotional connections between images and texts in noisy social media data, and propose Partitioned Adaptive Contrastive Learning (PACL) to leverage these connections. Specifically, we manage to separate different types of samples and devise distinct contrastive learning strategies for each type. By dynamically constructing negative and positive pairs, we fully exploit the potential of noisy samples. Through comprehensive experiments, we demonstrate that bridging the "affective gap" significantly improves the performance of various pre-trained visual models in downstream emotion-related tasks. Our code is released on https://github.com/wdqqdw/PACL.
[28] ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better
Yuan Zhang, Ming Lu, Junwen Pan, Tao Huang, Kuan Cheng, Qi She, Shanghang Zhang
🧩 TL;DR
ChainV是一个动态整合视觉提示的多模态推理框架,通过视觉补丁选择和注意力强度分析使推理过程更短更准确,在数学密集型基准测试中显著提升推理精度和效率。
📘 Detailed Summary
Motivation: 当前多模态推理模型在生成长推理链时存在冗余自反思问题,而基于静态视觉参考的无训练CoT压缩方法在多模态推理中增益有限,因此需要开发能够动态整合视觉提示的推理框架。
Method: ChainV首先基于前一步推理进行粗略视觉补丁选择,然后通过平均注意力强度识别最具代表性的原子视觉提示,并引入基于一致性的评估机制来评估所选提示的可靠性,最终通过伯努利随机过程将选定视觉提示的像素坐标及其可靠性整合到思考过程中。
Result: 在MathVista基准测试中,ChainV在MIMO-VL-RL上实现了2.3%的精度提升,同时推理延迟降低51.4%,输出token长度缩短24.5%,特别在需要多步符号推理的数学密集型任务中表现优异。
Conclusion: 该研究表明动态视觉提示整合能有效提升多模态推理的效率和准确性,为减少模型冗余自反思提供了新思路,未来可扩展至更复杂的多模态推理场景。
📄 Abstract
Recent advances in multimodal reasoning models have demonstrated impressive capabilities across text and vision. However, even leading models exhibit redundant self-reflection when generating lengthy reasoning chains. While training-free CoT compression methods have emerged in the LLMs domain, they rely on static visual references and thus provide limited gains for multimodal reasoning. Therefore, we propose ChainV, a framework that dynamically integrates visual hints into the reasoning process, thereby making multimodal reasoning shorter and better. Specifically, ChainV first performs a coarse visual patch selection based on the previous reasoning step, then refines it by identifying the most representative atomic visual hint according to the averaged attention intensity. Additionally, ChainV introduces a consistency-based evaluation mechanism to assess the reliability of the chosen hint, guiding the model to adaptively adjust its level of self-reflection. Eventually, the pixel coordinates of the selected visual hint and its reliability are incorporated into thinking with a Bernoulli stochastic process. Experiments indicate that our method significantly improves reasoning accuracy and efficiency, especially on math-intensive benchmarks where visual hints are crucial for multi-step symbolic reasoning. For example, ChainV achieves $2.3\%$ improvement on the MathVista within MIMO-VL-RL, while reducing inference latency by $51.4\%$ and shortening output token length by $24.5\%$.
[29] A Multi-Stage Optimization Framework for Deploying Learned Image Compression on FPGAs
Jiaxun Fang, Li Chen
🧩 TL;DR
本研究提出了一个完整的多阶段优化框架,将高性能浮点图像压缩模型转化为高效的硬件友好型整数实现,通过动态范围感知量化、渐进混合精度搜索和通道剪枝技术,在FPGA上实现了既高效又保持优异率失真性能的图像压缩系统。
📘 Detailed Summary
Motivation: 基于深度学习的图像压缩模型虽然取得了最先进的率失真性能,但在资源受限的FPGA上的部署仍然面临重大挑战,主要问题在于量化导致的性能下降以及硬件资源限制与模型复杂度之间的平衡难题。
Method: 提出了动态范围感知量化方法,通过统计校准的激活裁剪和新型权重正则化方案来应对极端数据异常值和大的动态范围;开发了渐进混合精度搜索算法,为每层分配最优的非均匀位宽;设计了适用于GDN层的通道剪枝方法,消除模型冗余。
Result: 基础DRAQ方法将基于GDN模型的BD-rate开销从30%降低到6.3%,后续硬件感知优化进一步将计算复杂度降低超过20%,同时对率失真性能影响可忽略,最终模型在效率和性能上均优于现有FPGA图像压缩实现。
Conclusion: 该研究证明了通过系统化的量化、精度优化和剪枝策略,可以在保持高性能的同时显著降低深度学习图像压缩模型的硬件实现复杂度,为资源受限设备上的高效AI部署提供了可行的技术路径。
📄 Abstract
Deep learning-based image compression (LIC) has achieved state-of-the-art rate-distortion (RD) performance, yet deploying these models on resource-constrained FPGAs remains a major challenge. This work presents a complete, multi-stage optimization framework to bridge the gap between high-performance floating-point models and efficient, hardware-friendly integer-based implementations. First, we address the fundamental problem of quantization-induced performance degradation. We propose a Dynamic Range-Aware Quantization (DRAQ) method that uses statistically-calibrated activation clipping and a novel weight regularization scheme to counteract the effects of extreme data outliers and large dynamic ranges, successfully creating a high-fidelity 8-bit integer model. Second, building on this robust foundation, we introduce two hardware-aware optimization techniques tailored for FPGAs. A progressive mixed-precision search algorithm exploits FPGA flexibility to assign optimal, non-uniform bit-widths to each layer, minimizing complexity while preserving performance. Concurrently, a channel pruning method, adapted to work with the Generalized Divisive Normalization (GDN) layers common in LIC, removes model redundancy by eliminating inactive channels. Our comprehensive experiments show that the foundational DRAQ method reduces the BD-rate overhead of a GDN-based model from $30\%$ to $6.3\%$. The subsequent hardware-aware optimizations further reduce computational complexity by over $20\%$ with negligible impact on RD performance, yielding a final model that is both state-of-the-art in efficiency and superior in quality to existing FPGA-based LIC implementations.
[30] FireScope: Wildfire Risk Prediction with a Chain-of-Thought Oracle
Mario Markov, Stefan Maria Ailuro, Luc Van Gool, Konrad Schindler, Danda Pani Paudel
🧩 TL;DR
本文提出了FireScope-Bench数据集和FireScope框架,这是首个基于语言推理的视觉生成模型,通过多模态理解和因果推理显著提升了跨大陆野火风险预测的泛化能力和可解释性。
📘 Detailed Summary
Motivation: 现有野火风险预测方法缺乏因果推理和多模态理解能力,导致泛化性能不足,无法可靠地应用于跨大陆场景,这限制了实际部署的可靠性。
Method: 提出了基于视觉语言模型的推理到生成框架FireScope,结合强化学习和视觉监督学习预测风险栅格图并生成互补的推理轨迹,使用Sentinel-2影像和气候数据构建大规模数据集FireScope-Bench。
Result: 在美国训练并在欧洲测试时,FireScope实现了显著的性能提升,专家反馈和自动化分析证实其推理轨迹具有忠实性和语义意义,验证了跨大陆泛化的有效性。
Conclusion: 研究表明基于语言的推理能够有效支撑栅格预测模型,同时提升泛化能力和可解释性,为推理驱动的可解释空间建模奠定了基础,开创了语言推理改进视觉生成泛化能力的研究方向。
📄 Abstract
Predicting wildfire risk is a reasoning-intensive spatial problem that requires the integration of visual, climatic, and geographic factors to infer continuous risk maps. Existing methods lack the causal reasoning and multimodal understanding required for reliable generalization. We introduce $\textbf{FireScope-Bench}$, a large-scale dataset and benchmark that couples Sentinel-2 imagery and climate data with expert-defined risk rasters across the USA, and real wildfire events in Europe for cross-continental evaluation. Building on this dataset, we propose $\textbf{FireScope}$, a VLM-based reasoning-to-generation framework that learns from both reinforcement learning and visual supervision to predict risk rasters with complementary reasoning traces. When trained in the USA and tested in Europe, $\textbf{FireScope}$ achieves substantial performance gains, while expert feedback and automated analysis confirm that its reasoning traces are faithful and semantically meaningful. Our findings demonstrate that reasoning can ground raster prediction models, improving both generalization and interpretability. To our knowledge, this is the first framework to (1) demonstrate that language-based reasoning can improve generalization in visual generation, (2) propose a high-resolution wildfire risk model that can be applied across continents, and (3) enable systematic studies of robust cross-continental generalization for multimodal fire risk models. We believe that $\textbf{FireScope-Bench}$ has the potential to serve as a foundation for advancing reasoning-driven, interpretable and generalizable spatial modeling. Data and source code will be made publicly available.
[31] Investigating self-supervised representations for audio-visual deepfake detection
Dragos-Alexandru Boldisor, Stefan Smeu, Dan Oneata, Elisabeta Oneata
🧩 TL;DR
本研究系统评估了自监督表示在音频-视频深度伪造检测中的应用潜力,发现这些特征能捕获有意义的伪造相关信息且具有互补性,但跨数据集泛化能力仍然有限。
📘 Detailed Summary
Motivation: 自监督表示在视觉和语音任务中表现出色,但其在音频-视频深度伪造检测中的潜力尚未得到充分探索,现有研究要么孤立使用这些特征,要么将其嵌入复杂架构中,缺乏系统性评估。
Method: 研究系统评估了自监督特征在音频、视频和多模态三个维度上的表现,重点关注唇部运动和通用视觉内容两个领域,评估了检测效果、信息可解释性和跨模态互补性三个关键维度。
Result: 实验发现大多数自监督特征都能捕获与深度伪造相关的信息,且这些信息具有互补性,模型主要关注语义上有意义的区域而非虚假伪影,但所有模型在跨数据集泛化方面表现不可靠,这种泛化失败源于数据集特性而非特征本身对表面模式的依赖。
Conclusion: 研究揭示了自监督表示在深度伪造检测中的双重性:虽然它们能够学习有意义的模式,但实现稳健的跨域性能仍然面临根本性挑战,这为未来研究指明了方向。
📄 Abstract
Self-supervised representations excel at many vision and speech tasks, but their potential for audio-visual deepfake detection remains underexplored. Unlike prior work that uses these features in isolation or buried within complex architectures, we systematically evaluate them across modalities (audio, video, multimodal) and domains (lip movements, generic visual content). We assess three key dimensions: detection effectiveness, interpretability of encoded information, and cross-modal complementarity. We find that most self-supervised features capture deepfake-relevant information, and that this information is complementary. Moreover, models primarily attend to semantically meaningful regions rather than spurious artifacts. Yet none generalize reliably across datasets. This generalization failure likely stems from dataset characteristics, not from the features themselves latching onto superficial patterns. These results expose both the promise and fundamental challenges of self-supervised representations for deepfake detection: while they learn meaningful patterns, achieving robust cross-domain performance remains elusive.
[32] Navigating in the Dark: A Multimodal Framework and Dataset for Nighttime Traffic Sign Recognition
Aditya Mishra, Akshay Agarwal, Haroon Lone
🧩 TL;DR
本文提出了INTSD夜间交通标志数据集和LENS-Net框架,通过自适应图像增强检测器和多模态CLIP-GCNN分类器,有效解决了夜间交通标志识别中光照不足和视觉噪声的挑战。该方法在检测和分类任务上均超越了现有框架,为自动驾驶和智能交通系统提供了可靠的夜间视觉解决方案。
📘 Detailed Summary
Motivation: 夜间交通标志识别面临视觉噪声和公开数据集稀缺的挑战,现有方法在低光照条件下鲁棒性不足,且未能有效利用互补的多模态线索。为克服这些限制,需要开发专门针对夜间条件的大规模数据集和能够同时处理光照校正与语义理解的端到端框架。
Method: 首先构建了INTSD数据集,包含41类印度各地夜间交通标志图像;然后提出LENS-Net框架,集成自适应图像增强检测器进行联合光照校正和标志定位,后续采用结构化多模态CLIP-GCNN分类器,通过跨模态注意力和图推理实现鲁棒识别。
Result: LENS-Net在检测和分类任务上均超越了现有最先进框架,消融研究证实了其关键组件的有效性。该框架在INTSD数据集上进行了广泛评估,展示了在多样化光照和天气条件下的优越性能。
Conclusion: 该研究为夜间交通标志识别提供了首个大规模印度数据集和有效的端到端解决方案,强调了多模态融合和自适应图像增强在低光照场景中的重要性,为自动驾驶系统的全天候可靠运行奠定了基础。
📄 Abstract
Traffic signboards are vital for road safety and intelligent transportation systems, enabling navigation and autonomous driving. Yet, recognizing traffic signs at night remains challenging due to visual noise and scarcity of public nighttime datasets. Despite advances in vision architectures, existing methods struggle with robustness under low illumination and fail to leverage complementary mutlimodal cues effectively. To overcome these limitations, firstly, we introduce INTSD, a large-scale dataset comprising street-level night-time images of traffic signboards collected across diverse regions of India. The dataset spans 41 traffic signboard classes captured under varying lighting and weather conditions, providing a comprehensive benchmark for both detection and classification tasks. To benchmark INTSD for night-time sign recognition, we conduct extensive evaluations using state-of-the-art detection and classification models. Secondly, we propose LENS-Net, which integrates an adaptive image enhancement detector for joint illumination correction and sign localization, followed by a structured multimodal CLIP-GCNN classifier that leverages cross-modal attention and graph-based reasoning for robust and semantically consistent recognition. Our method surpasses existing frameworks, with ablation studies confirming the effectiveness of its key components. The dataset and code for LENS-Net is publicly available for research.
[33] VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation
Hanyu Zhou, Chuanhao Ma, Gim Hee Lee
🧩 TL;DR
本文提出了VLA-4D模型,这是一个具有4D感知能力的通用视觉-语言-动作模型,通过引入4D感知的视觉表示和时空动作表示,解决了机器人操作中时空一致性的挑战。
📘 Detailed Summary
Motivation: 现有的视觉-语言-动作模型在时空一致性操作方面面临挑战,虽然现有方法通过将3D位置嵌入视觉表示来增强空间精度,但这些方法难以实现动作执行的时间一致性控制。
Method: 提出了两个关键设计:4D感知的视觉表示,通过提取视觉特征、将1D时间嵌入3D位置形成4D嵌入,并通过交叉注意力机制融合为统一视觉表示;时空动作表示,在传统空间动作表示基础上扩展时间信息以实现时空规划,并将多模态表示对齐到LLM中进行时空动作预测。
Result: 广泛的实验验证了该方法在机器人操作不同任务中的优越性,所设计的视觉和动作表示共同使机器人操作在空间上平滑且在时间上一致。
Conclusion: 该研究通过统一的框架实现了机器人操作的时空一致性,扩展的VLA数据集为模型微调提供了支持,为精细化的机器人操作控制提供了新的解决方案。
📄 Abstract
Vision-language-action (VLA) models show potential for general robotic tasks, but remain challenging in spatiotemporally coherent manipulation, which requires fine-grained representations. Typically, existing methods embed 3D positions into visual representations to enhance the spatial precision of actions. However, these methods struggle to achieve temporally coherent control over action execution. In this work, we propose VLA-4D, a general VLA model with 4D awareness for spatiotemporally coherent robotic manipulation. Our model is guided by two key designs: 1) 4D-aware visual representation. We extract visual features, embed 1D time into 3D positions for 4D embeddings, and fuse them into a unified visual representation via a cross-attention mechanism. 2) Spatiotemporal action representation. We extend conventional spatial action representations with temporal information to enable the spatiotemporal planning, and align the multimodal representations into the LLM for spatiotemporal action prediction. Within this unified framework, the designed visual and action representations jointly make robotic manipulation spatially-smooth and temporally-coherent. In addition, we extend the VLA dataset with temporal action annotations for fine-tuning our model. Extensive experiments have been conducted to verify the superiority of our method across different tasks of robotic manipulation.
[34] Continual Alignment for SAM: Rethinking Foundation Models for Medical Image Segmentation in Continual Learning
Jiayi Wang, Wei Dai, Haoyu Wang, Sihan Yang, Haixia Bi, Jian Sun
🧩 TL;DR
本文提出CA-SAM方法,通过轻量级对齐层和持续学习策略,在保持SAM零样本先验的同时解决医学图像分割中的计算效率和灾难性遗忘问题,在九个医学分割数据集上实现最先进性能。
📘 Detailed Summary
Motivation: 医学图像分割中,机构间的隐私政策异质性使得联合训练不可行,而Segment Anything Model虽然提供强大零样本先验,但其大参数量和计算开销限制了实际部署,需要在计算效率和性能之间取得平衡。
Method: 提出对齐层作为轻量级即插即用模块,通过对齐编码器-解码器特征分布来高效适配SAM到特定医学图像;基于此构建CA-SAM持续学习策略,自动适配合适的对齐层以缓解灾难性遗忘,同时利用SAM的零样本先验保持对未见数据集的强性能。
Result: 在持续学习场景下对九个医学分割数据集的实验表明,CA-SAM实现了最先进的性能表现,在保持高精度的同时显著降低了计算开销。
Conclusion: SAM范式在平衡计算效率和性能后具有巨大潜力,对齐层和CA-SAM策略为医学图像分割中的持续学习提供了有效解决方案,能够在不牺牲性能的情况下适应数据流变化。
📄 Abstract
In medical image segmentation, heterogeneous privacy policies across institutions often make joint training on pooled datasets infeasible, motivating continual image segmentation-learning from data streams without catastrophic forgetting. While the Segment Anything Model (SAM) offers strong zero-shot priors and has been widely fine-tuned across downstream tasks, its large parameter count and computational overhead challenge practical deployment. This paper demonstrates that the SAM paradigm is highly promising once its computational efficiency and performance can be balanced. To this end, we introduce the Alignment Layer, a lightweight, plug-and-play module which aligns encoder-decoder feature distributions to efficiently adapt SAM to specific medical images, improving accuracy while reducing computation. Building on SAM and the Alignment Layer, we then propose Continual Alignment for SAM (CA-SAM), a continual learning strategy that automatically adapts the appropriate Alignment Layer to mitigate catastrophic forgetting, while leveraging SAM's zero-shot priors to preserve strong performance on unseen medical datasets. Experimented across nine medical segmentation datasets under continual-learning scenario, CA-SAM achieves state-of-the-art performance. Our code, models and datasets will be released on \mbox{https://github.com/azzzzyo/Continual-Alignment-for-SAM.}
[35] Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers
Cris Claessens, Christiaan Viviers, Giacomo D'Amicantonio, Egor Bondarev, Fons van der Sommen
🧩 TL;DR
SPECTRE是一个基于Transformer的3D CT基础模型,通过结合自监督学习和视觉语言预训练策略,在多个CT基准测试中实现了最先进的性能,证明了无需私有数据即可获得高性能通用CT表示。
📘 Detailed Summary
Motivation: 该研究旨在解决体积CT面临的独特挑战,包括极端令牌缩放、几何各向异性以及弱或嘈杂的临床监督,这些挑战使得标准Transformer和对比学习方法难以直接有效应用。
Method: SPECTRE采用可扩展的3D视觉Transformer架构,联合优化局部Transformer进行高分辨率体积特征提取和全局Transformer进行全扫描上下文建模,结合DINO风格的自蒸馏和基于SigLIP的视觉语言对齐使用配对放射学报告进行预训练。
Result: 在多个CT基准测试中,SPECTRE在零样本和微调设置下始终优于先前的CT基础模型,证明了其在3D医学成像中的卓越性能。
Conclusion: 该研究表明高性能、可泛化的CT表示可以通过完全使用公开可用数据集实现,无需依赖私有数据,为3D医学成像提供了一个可扩展、开放且完全基于Transformer的基础模型解决方案。
📄 Abstract
We introduce SPECTRE, a fully transformer-based foundation model for volumetric computed tomography (CT). Our Self-Supervised & Cross-Modal Pretraining for CT Representation Extraction (SPECTRE) approach utilizes scalable 3D Vision Transformer architectures and modern self-supervised and vision-language pretraining strategies to learn general-purpose CT representations. Volumetric CT poses unique challenges, such as extreme token scaling, geometric anisotropy, and weak or noisy clinical supervision, that make standard transformer and contrastive learning recipes ineffective out of the box. The framework jointly optimizes a local transformer for high-resolution volumetric feature extraction and a global transformer for whole-scan context modeling, making large-scale 3D attention computationally tractable. Notably, SPECTRE is trained exclusively on openly available CT datasets, demonstrating that high-performing, generalizable representations can be achieved without relying on private data. Pretraining combines DINO-style self-distillation with SigLIP-based vision-language alignment using paired radiology reports, yielding features that are both geometrically consistent and clinically meaningful. Across multiple CT benchmarks, SPECTRE consistently outperforms prior CT foundation models in both zero-shot and fine-tuned settings, establishing SPECTRE as a scalable, open, and fully transformer-based foundation model for 3D medical imaging.
[36] A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback
Bulat Khaertdinov, Mirela Popa, Nava Tintarev
🧩 TL;DR
本文提出了一种基于相关性反馈的机制来改进视觉语言模型的检索性能,无需微调即可在推理时提升检索效果。该方法包括四种反馈策略,在多个基准测试中显著提高了检索精度,特别是在小规模模型中效果更为明显。
📘 Detailed Summary
Motivation: 当前大型视觉语言模型虽然支持自然语言查询的视觉搜索,但提升性能通常需要微调或扩展模型规模。本研究旨在解决这一限制,探索在推理时通过相关性反馈机制来改进检索性能,避免对模型进行重新训练或扩展。
Method: 提出了四种相关性反馈策略:经典伪相关性反馈通过基于排名靠前结果优化查询嵌入;生成式相关性反馈利用合成字幕进行查询优化;注意力反馈汇总器采用定制化transformer模型整合多模态细粒度特征;以及使用真实字幕作为显式反馈的上界基准。
Result: 在Flickr30k和COCO数据集上的实验表明,生成式相关性反馈、注意力反馈汇总器和显式反馈相比无反馈检索,在小规模视觉语言模型中MRR@5指标提升3-5%,在大规模模型中提升1-3%。注意力反馈汇总器在迭代多轮检索设置中表现出更好的鲁棒性,能够有效缓解查询漂移问题。
Conclusion: 相关性反馈机制能够持续提升不同规模视觉语言模型的检索性能,为交互式和自适应视觉搜索开辟了新机遇。该方法具有模型无关性,既可以作为微调的替代方案,也可以与已微调模型协同使用,展示了在推理阶段优化检索效果的有效途径。
📄 Abstract
Large vision-language models (VLMs) enable intuitive visual search using natural language queries. However, improving their performance often requires fine-tuning and scaling to larger model variants. In this work, we propose a mechanism inspired by traditional text-based search to improve retrieval performance at inference time: relevance feedback. While relevance feedback can serve as an alternative to fine-tuning, its model-agnostic design also enables use with fine-tuned VLMs. Specifically, we introduce and evaluate four feedback strategies for VLM-based retrieval. First, we revise classical pseudo-relevance feedback (PRF), which refines query embeddings based on top-ranked results. To address its limitations, we propose generative relevance feedback (GRF), which uses synthetic captions for query refinement. Furthermore, we introduce an attentive feedback summarizer (AFS), a custom transformer-based model that integrates multimodal fine-grained features from relevant items. Finally, we simulate explicit feedback using ground-truth captions as an upper-bound baseline. Experiments on Flickr30k and COCO with the VLM backbones show that GRF, AFS, and explicit feedback improve retrieval performance by 3-5% in MRR@5 for smaller VLMs, and 1-3% for larger ones, compared to retrieval with no feedback. Moreover, AFS, similarly to explicit feedback, mitigates query drift and is more robust than GRF in iterative, multi-turn retrieval settings. Our findings demonstrate that relevance feedback can consistently enhance retrieval across VLMs and open up opportunities for interactive and adaptive visual search.
[37] SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion
Jiajie Guo, Qingpeng Zhu, Jin Zeng, Xiaolong Wu, Changyong He, Weida Wang
🧩 TL;DR
本文提出SpatialGeo,一种基于几何与语义特征分层融合的新型视觉编码器,通过生成空间感知的视觉嵌入来增强多模态大语言模型的空间推理能力,在空间推理任务中显著提升性能并降低内存成本。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在三维空间解释和推理空间布局方面存在能力限制,主要问题源于现有视觉编码器(如CLIP)的损失性嵌入仅关注实例级语义特征,导致空间模糊性缺陷。
Method: 提出基于几何与语义特征分层融合的视觉编码器SpatialGeo,通过分层适配器将自监督学习获得的几何特征与CLIP的语义特征互补,采用随机特征丢弃策略避免仅依赖CLIP编码器的平凡解,使用预训练LLaVA模型进行高效训练。
Result: 实验结果表明,SpatialGeo在空间推理任务中显著提升准确率,在SpatialRGPT-Bench上将最先进模型的性能提升至少8.0%,同时推理期间内存成本降低约50%。
Conclusion: 该研究揭示了增强多模态大语言模型空间感知能力的关键在于融合几何与语义特征,提出的分层融合方法为提升空间推理性能提供了有效途径,同时实现了性能提升与计算效率的平衡。
📄 Abstract
Multimodal large language models (MLLMs) have achieved significant progress in image and language tasks due to the strong reasoning capability of large language models (LLMs). Nevertheless, most MLLMs suffer from limited spatial reasoning ability to interpret and infer spatial arrangements in three-dimensional space. In this work, we propose a novel vision encoder based on hierarchical fusion of geometry and semantics features, generating spatial-aware visual embedding and boosting the spatial grounding capability of MLLMs. Specifically, we first unveil that the spatial ambiguity shortcoming stems from the lossy embedding of the vision encoder utilized in most existing MLLMs (e.g., CLIP), restricted to instance-level semantic features. This motivates us to complement CLIP with the geometry features from vision-only self-supervised learning via a hierarchical adapter, enhancing the spatial awareness in the proposed SpatialGeo. The network is efficiently trained using pretrained LLaVA model and optimized with random feature dropping to avoid trivial solutions relying solely on the CLIP encoder. Experimental results show that SpatialGeo improves the accuracy in spatial reasoning tasks, enhancing state-of-the-art models by at least 8.0% in SpatialRGPT-Bench with approximately 50% less memory cost during inference. The source code is available via https://ricky-plus.github.io/SpatialGeoPages/.
[38] Loomis Painter: Reconstructing the Painting Process
Markus Pobitzer, Chang Liu, Chenyi Zhuang, Teng Long, Bin Ren, Nicu Sebe
🧩 TL;DR
本文提出了一种统一的多媒体绘画过程生成框架,通过语义驱动的风格控制机制和跨媒体风格增强,实现了跨风格的一致纹理演化和过程迁移。该框架采用反向绘画训练策略确保生成过程平滑且符合人类创作流程,并在大规模真实绘画过程数据集上验证了其有效性。
📘 Detailed Summary
Motivation: 现有的绘画教程视频资源缺乏交互性和个性化,而当前的生成模型在跨媒体泛化方面存在困难,经常出现时间或结构上的不一致性,难以忠实再现人类创作流程。这限制了艺术学习过程中对多样化绘画媒介和风格的技术掌握。
Method: 提出统一的多媒体绘画过程生成框架,采用语义驱动的风格控制机制将多种媒体嵌入扩散模型的条件空间,并利用跨媒体风格增强技术。通过反向绘画训练策略确保生成过程的平滑性和人类对齐性,同时构建了大规模真实绘画过程数据集用于模型训练和评估。
Result: 在跨媒体一致性、时间连贯性和最终图像保真度方面取得了显著成果,在LPIPS、DINO和CLIP等指标上表现优异。提出的感知距离轮廓曲线能够定量建模创作序列,包括构图、色彩分块和细节精炼等阶段,准确反映了人类艺术创作进程。
Conclusion: 该研究为艺术教育提供了交互式和个性化的学习工具,通过统一的生成框架解决了跨媒体绘画过程生成的挑战。感知距离轮廓曲线为量化分析创作过程提供了新方法,反向绘画训练策略确保了生成过程与人类创作习惯的一致性,为未来艺术生成技术的发展指明了方向。
📄 Abstract
Step-by-step painting tutorials are vital for learning artistic techniques, but existing video resources (e.g., YouTube) lack interactivity and personalization. While recent generative models have advanced artistic image synthesis, they struggle to generalize across media and often show temporal or structural inconsistencies, hindering faithful reproduction of human creative workflows. To address this, we propose a unified framework for multi-media painting process generation with a semantics-driven style control mechanism that embeds multiple media into a diffusion models conditional space and uses cross-medium style augmentation. This enables consistent texture evolution and process transfer across styles. A reverse-painting training strategy further ensures smooth, human-aligned generation. We also build a large-scale dataset of real painting processes and evaluate cross-media consistency, temporal coherence, and final-image fidelity, achieving strong results on LPIPS, DINO, and CLIP metrics. Finally, our Perceptual Distance Profile (PDP) curve quantitatively models the creative sequence, i.e., composition, color blocking, and detail refinement, mirroring human artistic progression.
[39] UAM: A Unified Attention-Mamba Backbone of Multimodal Framework for Tumor Cell Classification
Taixi Chen, Jingyun Chen, Nancy Guo
🧩 TL;DR
本文提出了一种统一注意力-曼巴(UAM)主干网络,专门用于细胞级放射组学特征分类,并进一步扩展为多模态框架,在细胞分类和肿瘤分割任务上均实现了最先进的性能。该研究填补了细胞级放射组学分析领域的空白,为基于放射组学的癌症诊断提供了统一且可扩展的多模态基础。
📘 Detailed Summary
Motivation: 现有研究主要集中在切片级或斑块级肿瘤分类,而细胞级放射组学分析领域尚未充分探索,且目前缺乏专门为放射组学数据设计的专用主干网络。细胞级放射组学特征能够提供细粒度的肿瘤表型洞察,但这一潜力尚未得到有效挖掘。
Method: 受Mamba架构在视觉和语言领域成功的启发,提出了统一注意力-曼巴(UAM)主干网络用于细胞级放射组学特征分类。与之前固定比例集成注意力和Mamba模块的混合方法不同,UAM在单一统一架构中灵活结合两者的能力,无需手动调整比例并提高了编码能力。开发了两种UAM变体,并在此基础上进一步提出多模态UAM框架,联合执行细胞级分类和图像分割。
Result: 实验结果表明,UAM在公共基准测试中在两个任务上均实现了最先进的性能,超越了领先的基于图像的基础模型。细胞分类准确率从74%提升至78%(n=349,882个细胞),肿瘤分割精度从75%提升至80%(n=406个斑块)。这些改进证明了UAM在放射组学驱动诊断中的有效性。
Conclusion: UAM作为一种统一且可扩展的多模态基础,展现了在放射组学驱动癌症诊断中的巨大潜力和有效性。该研究为细胞级放射组学分析提供了新的技术路径,统一架构设计消除了手动参数调整的需求,同时提高了模型的编码能力和诊断精度,为精准医疗AI系统的发展提供了重要支撑。
📄 Abstract
Cell-level radiomics features provide fine-grained insights into tumor phenotypes and have the potential to significantly enhance diagnostic accuracy on hematoxylin and eosin (H&E) images. By capturing micro-level morphological and intensity patterns, these features support more precise tumor identification and improve AI interpretability by highlighting diagnostically relevant cells for pathologist review. However, most existing studies focus on slide-level or patch-level tumor classification, leaving cell-level radiomics analysis largely unexplored. Moreover, there is currently no dedicated backbone specifically designed for radiomics data. Inspired by the recent success of the Mamba architecture in vision and language domains, we introduce a Unified Attention-Mamba (UAM) backbone for cell-level classification using radiomics features. Unlike previous hybrid approaches that integrate Attention and Mamba modules in fixed proportions, our unified design flexibly combines their capabilities within a single cohesive architecture, eliminating the need for manual ratio tuning and improving encode capability. We develop two UAM variants to comprehensively evaluate the benefits of this unified structure. Building on this backbone, we further propose a multimodal UAM framework that jointly performs cell-level classification and image segmentation. Experimental results demonstrate that UAM achieves state-of-the-art performance across both tasks on public benchmarks, surpassing leading image-based foundation models. It improves cell classification accuracy from 74% to 78% ($n$=349,882 cells), and tumor segmentation precision from 75% to 80% ($n$=406 patches). These findings highlight the effectiveness and promise of UAM as a unified and extensible multimodal foundation for radiomics-driven cancer diagnosis.
[40] ATAC: Augmentation-Based Test-Time Adversarial Correction for CLIP
Linxiang Su, András Balogh
🧩 TL;DR
本文提出了一种基于增强的测试时对抗校正方法ATAC,通过在CLIP的嵌入空间中计算增强诱导的漂移向量来推断语义恢复方向,显著提升了CLIP在零样本图像-文本匹配任务中的对抗鲁棒性。
📘 Detailed Summary
Motivation: 尽管CLIP在零样本图像-文本匹配方面取得了显著成功,但其对图像上的对抗扰动高度脆弱,而对抗微调成本过高,现有的测试时防御策略鲁棒性仍然有限。
Method: ATAC方法直接在CLIP的嵌入空间中操作,计算增强诱导的漂移向量来推断语义恢复方向,并基于这些潜在漂移的角度一致性来校正嵌入表示。
Result: 在广泛的基准测试中,ATAC始终实现了极高的鲁棒性,平均比先前最先进方法的鲁棒性高出近50%,同时仅需最小的计算开销,并在非常规和极端设置下保持最先进的鲁棒性。
Conclusion: ATAC展示了一种在CLIP嵌入空间中进行测试时对抗防御的新范式,证明了通过语义恢复方向的推断可以有效提升模型鲁棒性,为轻量级对抗防御提供了新思路。
📄 Abstract
Despite its remarkable success in zero-shot image-text matching, CLIP remains highly vulnerable to adversarial perturbations on images. As adversarial fine-tuning is prohibitively costly, recent works explore various test-time defense strategies; however, these approaches still exhibit limited robustness. In this work, we revisit this problem and propose a simple yet effective strategy: Augmentation-based Test-time Adversarial Correction (ATAC). Our method operates directly in the embedding space of CLIP, calculating augmentation-induced drift vectors to infer a semantic recovery direction and correcting the embedding based on the angular consistency of these latent drifts. Across a wide range of benchmarks, ATAC consistently achieves remarkably high robustness, surpassing that of previous state-of-the-art methods by nearly 50\% on average, all while requiring minimal computational overhead. Furthermore, ATAC retains state-of-the-art robustness in unconventional and extreme settings and even achieves nontrivial robustness against adaptive attacks. Our results demonstrate that ATAC is an efficient method in a novel paradigm for test-time adversarial defenses in the embedding space of CLIP.
[41] MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment
Huangbiao Xu, Huanqi Wu, Xiao Ke, Junyi Wu, Rui Xu, Jinglin Xu
🧩 TL;DR
本文提出了一种新颖的缺失模态补全框架MCMoE,通过混合专家机制统一单模态和联合表示学习,解决了多模态动作质量评估中模态缺失导致的性能退化问题。
📘 Detailed Summary
Motivation: 多模态动作质量评估在推理阶段经常面临部分模态不可用的问题,现有方法在缺失任何模态时无法正常工作,且由于跨模态交互中断会导致灾难性的性能下降,这限制了实际应用中的鲁棒性。
Method: 提出自适应门控模态生成器动态融合可用信息重构缺失模态,设计模态专家学习单模态知识并动态混合所有专家知识提取跨模态联合表示,通过专家混合机制进一步细化和补充缺失模态。
Result: 在三个公共AQA基准测试上的广泛实验表明,MCMoE在完整和不完整多模态学习场景下均取得了最先进的结果,验证了方法在模态缺失情况下的有效性。
Conclusion: 该研究为多模态学习中的模态缺失问题提供了有效的解决方案,通过单阶段训练统一了单模态和联合表示学习,为实际应用中的鲁棒多模态评估系统开发提供了重要参考。
📄 Abstract
Multimodal Action Quality Assessment (AQA) has recently emerged as a promising paradigm. By leveraging complementary information across shared contextual cues, it enhances the discriminative evaluation of subtle intra-class variations in highly similar action sequences. However, partial modalities are frequently unavailable at the inference stage in reality. The absence of any modality often renders existing multimodal models inoperable. Furthermore, it triggers catastrophic performance degradation due to interruptions in cross-modal interactions. To address this issue, we propose a novel Missing Completion Framework with Mixture of Experts (MCMoE) that unifies unimodal and joint representation learning in single-stage training. Specifically, we propose an adaptive gated modality generator that dynamically fuses available information to reconstruct missing modalities. We then design modality experts to learn unimodal knowledge and dynamically mix the knowledge of all experts to extract cross-modal joint representations. With a mixture of experts, missing modalities are further refined and complemented. Finally, in the training phase, we mine the complete multimodal features and unimodal expert knowledge to guide modality generation and generation-based joint representation extraction. Extensive experiments demonstrate that our MCMoE achieves state-of-the-art results in both complete and incomplete multimodal learning on three public AQA benchmarks. Code is available at https://github.com/XuHuangbiao/MCMoE.
[42] MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models
Yuqi Li, Junhao Dong, Chuanguang Yang, Shiping Wen, Piotr Koniusz, Tingwen Huang, Yingli Tian, Yew-Soon Ong
🧩 TL;DR
本文提出了MMT-ARD多模态多教师对抗鲁棒蒸馏框架,通过双教师知识融合架构和动态权重分配策略,显著提升了视觉语言模型的对抗鲁棒性和训练效率。
📘 Detailed Summary
Motivation: 传统单教师对抗知识蒸馏方法存在知识多样性有限、收敛速度慢以及鲁棒性与准确性难以平衡的问题,限制了视觉语言模型在安全关键应用中的部署可靠性。
Method: 提出多模态多教师对抗鲁棒蒸馏框架,采用双教师知识融合架构协同优化清洁特征保持和鲁棒特征增强,引入基于教师置信度的动态权重分配策略处理困难对抗样本,并设计了自适应sigmoid加权函数平衡跨模态知识传递强度。
Result: 在ImageNet和零样本基准测试中,ViT-B-32模型的鲁棒准确率提升+4.32%,零样本准确率提升+3.5%,训练效率相比传统单教师方法提高2.3倍。
Conclusion: MMT-ARD框架有效解决了多教师蒸馏中的知识融合和平衡问题,证明了多教师协作在提升多模态大模型对抗鲁棒性方面的有效性和可扩展性,为安全关键应用提供了可靠解决方案。
📄 Abstract
Vision-Language Models (VLMs) are increasingly deployed in safety-critical applications, making their adversarial robustness a crucial concern. While adversarial knowledge distillation has shown promise in transferring robustness from teacher to student models, traditional single-teacher approaches suffer from limited knowledge diversity, slow convergence, and difficulty in balancing robustness and accuracy. To address these challenges, we propose MMT-ARD: a Multimodal Multi-Teacher Adversarial Robust Distillation framework. Our key innovation is a dual-teacher knowledge fusion architecture that collaboratively optimizes clean feature preservation and robust feature enhancement. To better handle challenging adversarial examples, we introduce a dynamic weight allocation strategy based on teacher confidence, enabling adaptive focus on harder samples. Moreover, to mitigate bias among teachers, we design an adaptive sigmoid-based weighting function that balances the strength of knowledge transfer across modalities. Extensive experiments on ImageNet and zero-shot benchmarks demonstrate that MMT-ARD improves robust accuracy by +4.32% and zero-shot accuracy by +3.5% on the ViT-B-32 model, while achieving a 2.3x increase in training efficiency over traditional single-teacher methods. These results highlight the effectiveness and scalability of MMT-ARD in enhancing the adversarial robustness of multimodal large models. Our codes are available at https://github.com/itsnotacie/MMT-ARD.
[43] Improving Multimodal Distillation for 3D Semantic Segmentation under Domain Shift
Björn Michele, Alexandre Boulch, Gilles Puy, Tuan-Hung Vu, Renaud Marlet, Nicolas Courty
🧩 TL;DR
本研究系统探索了如何利用视觉基础模型进行激光雷达点云语义分割的无监督域自适应,发现骨干网络架构对泛化性能至关重要,并提出了一种可一次性预训练并适用于多种域偏移的解决方案。该方法在四个具有挑战性的场景中实现了最先进的性能。
📘 Detailed Summary
Motivation: 在完全监督下训练的激光雷达语义分割网络无法在未见过的激光雷达类型上实现良好泛化,存在显著的域偏移性能差距。为了解决这一问题,研究旨在探索如何有效利用跨域鲁棒性强的视觉基础模型来提升激光雷达点云语义分割的域自适应能力。
Method: 基于无监督图像到激光雷达知识蒸馏框架,系统研究了视觉基础模型在激光雷达语义分割域自适应中的应用方法。关键发现包括:骨干网络架构对目标域泛化性能至关重要;可以一次性预训练单个骨干网络并应用于多种域偏移场景;最佳结果通过冻结预训练骨干网络并训练MLP分割头获得。
Result: 提出的方法在四个广泛认可且具有挑战性的域自适应设置中取得了最先进的性能。实验验证了所提方案的有效性,显著减少了不同激光雷达类型之间的域偏移性能差距。
Conclusion: 研究表明激光雷达骨干网络架构是最大化域自适应性能的关键因素,且可以通过一次性预训练实现多域泛化。冻结预训练骨干并训练轻量级分割头的策略被证明是最有效的,为激光雷达语义分割的域自适应提供了实用的解决方案和设计指导。
📄 Abstract
Semantic segmentation networks trained under full supervision for one type of lidar fail to generalize to unseen lidars without intervention. To reduce the performance gap under domain shifts, a recent trend is to leverage vision foundation models (VFMs) providing robust features across domains. In this work, we conduct an exhaustive study to identify recipes for exploiting VFMs in unsupervised domain adaptation for semantic segmentation of lidar point clouds. Building upon unsupervised image-to-lidar knowledge distillation, our study reveals that: (1) the architecture of the lidar backbone is key to maximize the generalization performance on a target domain; (2) it is possible to pretrain a single backbone once and for all, and use it to address many domain shifts; (3) best results are obtained by keeping the pretrained backbone frozen and training an MLP head for semantic segmentation. The resulting pipeline achieves state-of-the-art results in four widely-recognized and challenging settings. The code will be available at: https://github.com/valeoai/muddos.
[44] Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
Mark Endo, Serena Yeung-Levy
🧩 TL;DR
本文提出了一种名为Extract+Think的高效多模态模型方法,通过视觉提取调优和逐步推理来解决LLM降维对视觉能力的不成比例影响,在保持性能的同时显著提升模型效率。
📘 Detailed Summary
Motivation: 随着多模态模型的规模化发展,实际应用需求要求更小、更高效的系统。本研究旨在系统分析多模态模型中智能降维的影响,特别是探究减少大型语言模型容量如何影响多模态能力,并解决LLM降维对视觉能力造成的过度损害问题。
Method: 提出Extract+Think方法,包含两个核心组件:视觉提取调优,通过显式训练模型提取与指令相关的视觉细节;以及逐步推理机制,利用提取的视觉细节生成答案。该方法专门针对LLM降维导致的视觉能力瓶颈进行优化。
Result: 研究发现LLM降维对视觉能力的影响不成比例地大于对LLM继承能力的影响。视觉提取调优显著缓解了这一瓶颈,在多项任务中实现了效率与性能的新平衡,视觉感知能力的下降幅度甚至超过推理能力的下降。
Conclusion: 研究揭示了多模态模型中视觉能力对LLM容量的敏感性,提出的Extract+Think框架为构建高效多模态系统提供了新范式,强调了视觉信息提取在多模态推理中的关键作用,并为未来轻量化多模态模型设计指明了方向。
📄 Abstract
Scaling up multimodal models has enabled remarkable advances in visual understanding and reasoning, but practical demands call for smaller, efficient systems. In this work, we conduct a principled analysis of downscaling intelligence in multimodal models, examining how reduced large language model (LLM) capacity affects multimodal capabilities. Our initial findings reveal an interesting trend: LLM downscaling disproportionately affects visual capabilities, rather than abilities inherited from the LLM. We then examine whether this drop mainly reflects the expected decline in visual reasoning or a more fundamental loss of perceptual abilities. Isolating the effect of LLM downscaling on perception, we find performance still drops sharply, often matching or exceeding the impact on reasoning. To address this bottleneck, we introduce visual extraction tuning, which explicitly trains the model to extract instruction-relevant visual details consistently across tasks. With these extracted visual details, we then apply step-by-step reasoning to generate answers. Together, these components form our Extract+Think approach, setting a new standard for efficiency and performance in this space.
[45] Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination
Yolo Yunlong Tang, Daiki Shimada, Hang Hua, Chao Huang, Jing Bi, Rogerio Feris, Chenliang Xu
🧩 TL;DR
本文提出Video-R4,一种基于视觉反刍机制的视频推理大语言模型,通过迭代选择帧、放大关键区域和重新编码像素来增强文本丰富视频的理解能力。该方法在多个视频问答基准上取得最先进性能,并展示了迭代反刍作为像素基础多模态推理的有效范式。
📘 Detailed Summary
Motivation: 现有视频问答模型主要依赖单次感知固定帧,导致在细粒度证据理解上出现幻觉和失败问题,无法有效处理需要重复检查的瞬态文本线索。人类观看文本丰富视频时会暂停、放大和重读关键区域,这种能力是当前模型所缺乏的。
Method: 提出视觉反刍机制,通过迭代选择帧、放大信息区域、重新编码检索像素和更新推理状态来增强视频理解。构建了两个包含可执行反刍轨迹的数据集:Video-R4-CoT-17k用于监督训练,Video-R4-RL-30k用于强化学习。采用多阶段反刍学习框架,通过监督微调和基于GRPO的强化学习逐步微调7B参数的大语言模型,学习原子和混合视觉操作。
Result: Video-R4-7B在M4-ViteVQA基准上取得了最先进的性能结果。该方法还展现出良好的泛化能力,能够有效应用于多页文档问答、幻灯片问答和通用视频问答任务。实验验证了迭代反刍机制在像素基础多模态推理中的有效性。
Conclusion: 迭代视觉反刍是一种有效的像素基础多模态推理范式,能够显著提升模型对文本丰富视频的理解能力。该方法不仅解决了现有模型在细粒度证据理解上的局限性,还展示了在多模态任务中的广泛适用性,为视频理解和多模态推理研究提供了新的技术路径。
📄 Abstract
Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning.
[46] Native 3D Editing with Full Attention
Weiwei Cai, Shuangkang Fang, Weicai Ye, Xin Dong, Yunhan Yang, Xuanyang Zhang, Wei Cheng, Yanpei Cao, Gang Yu, Tao Chen
🧩 TL;DR
本文提出了一种新颖的原生3D编辑框架,通过单次前向传播直接操作3D表示,解决了现有优化方法速度慢和基于多视图2D编辑方法几何不一致的问题。该方法利用大规模多模态数据集和创新的3D标记拼接策略,在生成质量、3D一致性和指令遵循方面实现了最先进的性能。
📘 Detailed Summary
Motivation: 现有指令引导的3D编辑方法存在关键局限性:基于优化的方法计算速度极慢,而基于多视图2D编辑的前馈方法常常导致几何不一致和视觉质量下降。这些限制阻碍了3D内容创作的广泛应用,迫切需要开发更高效且保持3D一致性的编辑方法。
Method: 本文构建了一个大规模多模态指令引导3D编辑数据集,涵盖添加、删除和修改等多种任务类型,确保编辑对象忠实遵循指令变化同时保持未编辑区域与源对象的一致性。在此基础上,探索了两种条件策略:传统的交叉注意力机制和创新的3D标记拼接方法,后者被证明参数效率更高且性能更优。
Result: 广泛评估表明,该方法在生成质量、3D一致性和指令遵循方面均优于现有的2D提升方法,确立了新的性能基准。实验证明3D标记拼接策略相比传统交叉注意力机制具有更好的参数效率和性能表现,为3D编辑任务提供了更有效的解决方案。
Conclusion: 该研究证明了原生3D编辑框架在效率和一致性方面的显著优势,为3D内容创作提供了更实用的工具。3D标记拼接策略的优越性为未来3D生成模型的设计提供了重要启示,同时大规模多模态数据集的构建也为该领域的发展奠定了坚实基础。
📄 Abstract
Instruction-guided 3D editing is a rapidly emerging field with the potential to broaden access to 3D content creation. However, existing methods face critical limitations: optimization-based approaches are prohibitively slow, while feed-forward approaches relying on multi-view 2D editing often suffer from inconsistent geometry and degraded visual quality. To address these issues, we propose a novel native 3D editing framework that directly manipulates 3D representations in a single, efficient feed-forward pass. Specifically, we create a large-scale, multi-modal dataset for instruction-guided 3D editing, covering diverse addition, deletion, and modification tasks. This dataset is meticulously curated to ensure that edited objects faithfully adhere to the instructional changes while preserving the consistency of unedited regions with the source object. Building upon this dataset, we explore two distinct conditioning strategies for our model: a conventional cross-attention mechanism and a novel 3D token concatenation approach. Our results demonstrate that token concatenation is more parameter-efficient and achieves superior performance. Extensive evaluations show that our method outperforms existing 2D-lifting approaches, setting a new benchmark in generation quality, 3D consistency, and instruction fidelity.
cs.CL [Back]
[47] Towards Hyper-Efficient RAG Systems in VecDBs: Distributed Parallel Multi-Resolution Vector Search
Dong Liu, Yanxuan Yu
🧩 TL;DR
本文提出了语义金字塔索引(SPI),一种用于向量数据库的多分辨率索引框架,通过查询自适应分辨率控制实现检索增强生成系统的加速和优化。SPI在保持语义覆盖的同时显著提升检索速度,并与现有向量数据库基础设施兼容。
📘 Detailed Summary
Motivation: 现有向量数据库检索管道依赖扁平或单分辨率索引结构,无法适应多样化用户查询所需的语义粒度变化,导致检索速度和上下文相关性之间的次优权衡。传统分层方法需要离线调优或单独模型训练,限制了实际部署的灵活性。
Method: SPI构建文档嵌入的语义金字塔,通过轻量级分类器为每个查询动态选择最优分辨率级别,实现从粗到细表示的渐进式检索。该方法作为插件实现,兼容FAISS和Qdrant后端,无需离线调优或额外模型训练。
Result: 在MS MARCO、Natural Questions和多模态检索基准测试中,SPI实现了高达5.7倍的检索加速和1.8倍的内存效率提升,同时将端到端QA F1分数提高了最多2.5分。理论分析提供了检索质量和延迟边界保证,消融研究验证了各组件贡献。
Conclusion: SPI框架通过自适应多分辨率索引解决了RAG系统中检索效率与质量的关键权衡问题,其与现有向量数据库基础设施的兼容性使其可直接部署于生产系统。该工作为大规模知识检索系统提供了可扩展且高效的解决方案。
📄 Abstract
Retrieval-Augmented Generation (RAG) systems have become a dominant approach to augment large language models (LLMs) with external knowledge. However, existing vector database (VecDB) retrieval pipelines rely on flat or single-resolution indexing structures, which cannot adapt to the varying semantic granularity required by diverse user queries. This limitation leads to suboptimal trade-offs between retrieval speed and contextual relevance. To address this, we propose \textbf{Semantic Pyramid Indexing (SPI)}, a novel multi-resolution vector indexing framework that introduces query-adaptive resolution control for RAG in VecDBs. Unlike existing hierarchical methods that require offline tuning or separate model training, SPI constructs a semantic pyramid over document embeddings and dynamically selects the optimal resolution level per query through a lightweight classifier. This adaptive approach enables progressive retrieval from coarse-to-fine representations, significantly accelerating search while maintaining semantic coverage. We implement SPI as a plugin for both FAISS and Qdrant backends and evaluate it across multiple RAG tasks including MS MARCO, Natural Questions, and multimodal retrieval benchmarks. SPI achieves up to \textbf{5.7$\times$} retrieval speedup and \textbf{1.8$\times$} memory efficiency gain while improving end-to-end QA F1 scores by up to \textbf{2.5 points} compared to strong baselines. Our theoretical analysis provides guarantees on retrieval quality and latency bounds, while extensive ablation studies validate the contribution of each component. The framework's compatibility with existing VecDB infrastructures makes it readily deployable in production RAG systems. Code is availabe at \href{https://github.com/FastLM/SPI_VecDB}{https://github.com/FastLM/SPI_VecDB}.
[48] Ellipsoid-Based Decision Boundaries for Open Intent Classification
Yuetian Zou, Hanlei Zhang, Hua Xu, Songze Li, Long Xiao
🧩 TL;DR
本文提出EliDecide方法,通过引入可学习的椭圆决策边界来解决文本开放意图分类问题。该方法利用监督对比学习获得判别性特征空间,并通过双损失函数优化椭圆边界,在多个基准测试中达到最先进性能。
📘 Detailed Summary
Motivation: 现有自适应决策边界方法假设已知类服从各向同性分布,将边界限制为球形,忽略了不同特征方向上的分布方差。这种简化限制了模型对复杂真实世界场景中未知意图的检测能力,需要更灵活的边界表示方法。
Method: 首先采用监督对比学习获得已知样本的判别性特征空间;其次使用可学习矩阵参数化每个已知类的椭圆边界,提供比仅由中心和半径定义的球形边界更大的灵活性;最后通过新颖设计的双损失函数优化边界,平衡经验风险和开放空间风险,在覆盖已知样本的同时收缩边界以对抗合成的伪开放样本。
Result: 该方法在多个文本意图基准测试中实现了最先进的性能,并在问题分类数据集上进一步验证了有效性。椭圆边界的灵活性展现出优越的开放意图检测能力,在多样复杂开放世界场景中具有强大的泛化潜力。
Conclusion: 椭圆决策边界相比传统球形边界能更准确地建模已知类的真实分布,显著提升开放意图检测性能。该方法为文本分类任务在复杂开放世界场景中的应用提供了新的技术路径,展示了在多样化真实环境中的强大泛化能力。
📄 Abstract
Textual open intent classification is crucial for real-world dialogue systems, enabling robust detection of unknown user intents without prior knowledge and contributing to the robustness of the system. While adaptive decision boundary methods have shown great potential by eliminating manual threshold tuning, existing approaches assume isotropic distributions of known classes, restricting boundaries to balls and overlooking distributional variance along different directions. To address this limitation, we propose EliDecide, a novel method that learns ellipsoid decision boundaries with varying scales along different feature directions. First, we employ supervised contrastive learning to obtain a discriminative feature space for known samples. Second, we apply learnable matrices to parameterize ellipsoids as the boundaries of each known class, offering greater flexibility than spherical boundaries defined solely by centers and radii. Third, we optimize the boundaries via a novelly designed dual loss function that balances empirical and open-space risks: expanding boundaries to cover known samples while contracting them against synthesized pseudo-open samples. Our method achieves state-of-the-art performance on multiple text intent benchmarks and further on a question classification dataset. The flexibility of the ellipsoids demonstrates superior open intent detection capability and strong potential for generalization to more text classification tasks in diverse complex open-world scenarios.
[49] PEPPER: Perception-Guided Perturbation for Robust Backdoor Defense in Text-to-Image Diffusion Models
Oscar Chew, Po-Yi Lu, Jayden Lin, Kuan-Hao Huang, Hsuan-Tien Lin
🧩 TL;DR
本文提出了PEPPER防御方法,通过语义重写策略破坏文本到图像扩散模型中的后门触发器,显著降低攻击成功率同时保持生成质量。该方法可与现有防御技术结合,提供比任何单一方法更强的通用鲁棒性。
📘 Detailed Summary
Motivation: 文本到图像扩散模型容易受到后门攻击,输入提示中的特定触发器可能引导模型生成有害或非预期内容,现有防御方法在保护生成质量的同时难以有效应对此类攻击。
Method: PEPPER采用感知引导的扰动策略,将输入标题重写为语义距离较远但视觉相似的标题,同时添加不显眼的元素,通过这种重写策略破坏输入提示中嵌入的触发器并稀释触发令牌的影响。
Result: 实验表明PEPPER对基于文本编码器的攻击特别有效,显著降低了攻击成功率同时保持了生成质量,与现有防御方法结合使用时,能够提供比任何单一方法更强且更通用的鲁棒性。
Conclusion: PEPPER证明了通过语义重写策略可以有效防御文本到图像扩散模型的后门攻击,为多模态生成模型的安全防护提供了新思路,其模块化设计使其能够与现有防御技术协同工作,具有实际部署价值。
📄 Abstract
Recent studies show that text to image (T2I) diffusion models are vulnerable to backdoor attacks, where a trigger in the input prompt can steer generation toward harmful or unintended content. To address this, we introduce PEPPER (PErcePtion Guided PERturbation), a backdoor defense that rewrites the caption into a semantically distant yet visually similar caption while adding unobstructive elements. With this rewriting strategy, PEPPER disrupt the trigger embedded in the input prompt, dilute the influence of trigger tokens and thereby achieve enhanced robustness. Experiments show that PEPPER is particularly effective against text encoder based attacks, substantially reducing attack success while preserving generation quality. Beyond this, PEPPER can be paired with any existing defenses yielding consistently stronger and generalizable robustness than any standalone method. Our code will be released on Github.
[50] Do Vision-Language Models Understand Visual Persuasiveness?
Gyuwon Park
🧩 TL;DR
该研究构建了视觉说服力判断数据集并引入视觉说服因素分类法,发现视觉语言模型在视觉说服理解上存在召回导向偏见,且主要通过高层语义对齐而非低中层特征进行判断,对象导向的理性推理能显著提升模型性能。
📘 Detailed Summary
Motivation: 尽管视觉语言模型在多模态推理方面取得显著进展,但这些模型是否真正理解视觉说服——即视觉线索如何影响人类态度和决策——仍不清楚,研究旨在探究VLMs对视觉说服机制的理解能力。
Method: 研究构建了二元说服力判断的高共识数据集,提出了包含低层感知、中层构图和高层语义线索的视觉说服因素分类法,并探索了认知引导和知识注入等说服相关推理策略。
Result: 实证分析显示VLMs存在召回导向偏见——模型过度预测高说服力,对低中层特征的判别能力较弱,而信息与对象存在的高层语义对齐是预测人类判断的最强指标,对象导向的简洁理性推理能显著提高精确率和F1分数。
Conclusion: VLMs的核心局限不在于识别有说服力的对象,而在于将这些对象与沟通意图联系起来,高层语义理解在视觉说服判断中起主导作用,针对性推理策略能有效弥补模型缺陷。
📄 Abstract
Recent advances in vision-language models (VLMs) have enabled impressive multi-modal reasoning and understanding. Yet, whether these models truly grasp visual persuasion-how visual cues shape human attitudes and decisions-remains unclear. To probe this question, we construct a high-consensus dataset for binary persuasiveness judgment and introduce the taxonomy of Visual Persuasive Factors (VPFs), encompassing low-level perceptual, mid-level compositional, and high-level semantic cues. We also explore cognitive steering and knowledge injection strategies for persuasion-relevant reasoning. Empirical analysis across VLMs reveals a recall-oriented bias-models over-predict high persuasiveness-and weak discriminative power for low/mid-level features. In contrast, high-level semantic alignment between message and object presence emerges as the strongest predictor of human judgment. Among intervention strategies, simple instruction or unguided reasoning scaffolds yield marginal or negative effects, whereas concise, object-grounded rationales significantly improve precision and F1 scores. These results indicate that VLMs core limitation lies not in recognizing persuasive objects but in linking them to communicative intent.
[51] Learning to Compress: Unlocking the Potential of Large Language Models for Text Representation
Yeqin Zhang, Yizheng Zhao, Chen Hu, Binxing Jiao, Daxin Jiang, Ruihang Miao, Cam-Tu Nguyen
🧩 TL;DR
本文提出LLM2Comp,通过上下文压缩作为预训练任务来改进LLM的文本表示能力,相比基于token级预测的方法,该方法在多种任务上表现更优且数据效率更高。
📘 Detailed Summary
Motivation: 现有大型语言模型主要针对因果建模和下一词预测进行优化,难以产生高质量的全局文本表示。虽然已有研究引入预训练任务来适应LLM用于文本表示,但大多依赖token级预测目标,如掩码下一词预测,这些方法在捕捉整体语义表示方面存在局限。
Method: 本研究探索上下文压缩作为无监督适应LLM的预训练任务,在压缩预训练过程中,模型学习生成紧凑的记忆token来替代完整上下文进行下游序列预测。通过精心设计的压缩目标,结合对比学习进一步优化,构建了强大的文本表示模型LLM2Comp。
Result: 实验表明,精心设计的压缩目标能显著提升基于LLM的文本表示质量,在多种任务上优于基于token级预训练任务的模型。LLM2Comp在广泛的文本编码任务中超越了当代基于LLM的文本编码器,同时具有更高的样本效率,所需训练数据显著减少。
Conclusion: 上下文压缩作为预训练任务为LLM文本表示提供了新的有效途径,相比传统token级方法能更好地捕捉文档级语义信息。该方法不仅性能优越,还提高了数据利用效率,为构建高效文本表示模型开辟了新方向。
📄 Abstract
Text representation plays a critical role in tasks like clustering, retrieval, and other downstream applications. With the emergence of large language models (LLMs), there is increasing interest in harnessing their capabilities for this purpose. However, most of the LLMs are inherently causal and optimized for next-token prediction, making them suboptimal for producing holistic representations. To address this, recent studies introduced pretext tasks to adapt LLMs for text representation. Most of these tasks, however, rely on token-level prediction objectives, such as the masked next-token prediction (MNTP) used in LLM2Vec. In this work, we explore the untapped potential of context compression as a pretext task for unsupervised adaptation of LLMs. During compression pre-training, the model learns to generate compact memory tokens, which substitute the whole context for downstream sequence prediction. Experiments demonstrate that a well-designed compression objective can significantly enhance LLM-based text representations, outperforming models trained with token-level pretext tasks. Further improvements through contrastive learning produce a strong representation model (LLM2Comp) that outperforms contemporary LLM-based text encoders on a wide range of tasks while being more sample-efficient, requiring significantly less training data.
[52] Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables
Anshul Singh, Rohan Chaudhary, Gagneet Singh, Abhay Kumary
🧩 TL;DR
本文提出了MirageTVQA基准测试,旨在评估视觉语言模型在多语言和视觉噪声环境下的表格推理能力,揭示了现有模型在现实场景中的严重性能下降和语言偏见问题。
📘 Detailed Summary
Motivation: 现有表格问答数据集如WikiTableQuestions和FinQA主要面向英语且呈现完美数字格式,无法反映现实世界中多语言和视觉噪声的复杂性,导致研究与实践之间存在显著差距。
Method: 研究团队构建了包含近60,000个问答对的MirageTVQA基准测试,涵盖24种语言,并引入模拟扫描文档的视觉噪声,专门设计用于评估视觉语言模型在多语言和视觉不完美条件下的表现。
Result: 对领先视觉语言模型的评估显示,面对视觉噪声时最佳模型性能下降超过35%,同时存在一致的英语优先偏见,模型在其他语言上的推理能力无法有效迁移。
Conclusion: MirageTVQA为衡量和推动更鲁棒的表格推理视觉语言模型提供了基准,揭示了当前模型在现实应用中的关键局限性,强调了处理多语言和视觉噪声能力的重要性。
📄 Abstract
The impressive performance of VLMs is largely measured on benchmarks that fail to capture the complexities of real-world scenarios. Existing datasets for tabular QA, such as WikiTableQuestions and FinQA, are overwhelmingly monolingual (English) and present tables in a digitally perfect, clean format. This creates a significant gap between research and practice. To address this, we present \textbf{MirageTVQA}, a new benchmark designed to evaluate VLMs on these exact dimensions. Featuring nearly 60,000 QA pairs across 24 languages, MirageTVQA challenges models with tables that are not only multilingual but also visually imperfect, incorporating realistic noise to mimic scanned documents. Our evaluation of the leading VLMs reveals two primary failure points: a severe degradation in performance (over 35\% drop for the best models) when faced with visual noise and a consistent English-first bias where reasoning abilities fail to transfer to other languages. MirageTVQA provides a benchmark for measuring and driving progress towards more robust VLM models for table reasoning. The dataset and the code are available at: https://github.com/anshulsc/MirageTVQA.
[53] Large Language Models for Sentiment Analysis to Detect Social Challenges: A Use Case with South African Languages
Koena Ronny Mabokela, Tim Schlippe, Matthias Wölfel
🧩 TL;DR
本研究分析了GPT-3.5、GPT-4、LlaMa 2、PaLM 2和Dolly 2等先进大语言模型在南非英语、Sepedi和Setswana社交媒体帖子中的零样本情感分析性能,并发现通过多模型融合可将情感分类错误率降至1%以下,为检测社会挑战提供了可靠解决方案。
📘 Detailed Summary
Motivation: 当前缺乏针对南非多语言社交媒体帖子的大语言模型情感分析研究,特别是在检测社会挑战方面存在空白,而多语言社区的情感分析系统能够帮助政府部门更精确地识别和解决社会问题。
Method: 研究评估了GPT-3.5、GPT-4、LlaMa 2、PaLM 2和Dolly 2等先进大语言模型在南非10个政府部门管辖范围内的10个新兴话题上的零样本情感分析能力,涵盖英语、Sepedi和Setswana三种语言,并采用了多模型结果融合策略。
Result: 实验结果显示不同大语言模型、话题和语言之间存在显著性能差异,通过多模型融合策略能够大幅提升情感分类性能,将分类错误率降至1%以下,实现了高度可靠的情感分析效果。
Conclusion: 研究表明大语言模型融合方法能够为多语言社交媒体情感分析提供可靠解决方案,使政府部门能够基于特定话题和语言群体的情感分析结果来检测社会挑战并制定相应行动策略。
📄 Abstract
Sentiment analysis can aid in understanding people's opinions and emotions on social issues. In multilingual communities sentiment analysis systems can be used to quickly identify social challenges in social media posts, enabling government departments to detect and address these issues more precisely and effectively. Recently, large-language models (LLMs) have become available to the wide public and initial analyses have shown that they exhibit magnificent zero-shot sentiment analysis abilities in English. However, there is no work that has investigated to leverage LLMs for sentiment analysis on social media posts in South African languages and detect social challenges. Consequently, in this work, we analyse the zero-shot performance of the state-of-the-art LLMs GPT-3.5, GPT-4, LlaMa 2, PaLM 2, and Dolly 2 to investigate the sentiment polarities of the 10 most emerging topics in English, Sepedi and Setswana social media posts that fall within the jurisdictional areas of 10 South African government departments. Our results demonstrate that there are big differences between the various LLMs, topics, and languages. In addition, we show that a fusion of the outcomes of different LLMs provides large gains in sentiment classification performance with sentiment classification errors below 1%. Consequently, it is now feasible to provide systems that generate reliable information about sentiment analysis to detect social challenges and draw conclusions about possible needs for actions on specific topics and within different language groups.
[54] Don't Learn, Ground: A Case for Natural Language Inference with Visual Grounding
Daniil Ignatev, Ayman Santeer, Albert Gatt, Denis Paperno
🧩 TL;DR
本文提出了一种零样本自然语言推理方法,通过文本到图像模型生成前提的视觉表示,并与文本假设进行比较来完成推理。该方法无需任务特定微调即可实现高精度,并展现出对文本偏见和表面启发式的鲁棒性。
📘 Detailed Summary
Motivation: 当前自然语言推理方法主要依赖文本模态,容易受到文本偏见和表面启发式的影响。本研究旨在探索通过多模态表示将语言在视觉上下文中进行基础化,以构建更鲁棒的自然语言理解系统。
Method: 该方法利用文本到图像模型生成前提的视觉表示,并采用两种推理技术进行比较:余弦相似度和视觉问答。同时设计了一个受控的对抗性数据集来验证方法的鲁棒性。
Result: 该方法在无需任务特定微调的情况下实现了高准确率,有效抵抗了文本偏见和表面启发式的影响。通过对抗性数据集的验证进一步证实了该方法的鲁棒性能。
Conclusion: 研究表明利用视觉模态作为意义表示为鲁棒的自然语言理解提供了有前景的方向,多模态基础化能够有效缓解纯文本方法的局限性,为未来自然语言处理系统设计提供了新思路。
📄 Abstract
We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these representations with textual hypotheses. We evaluate two inference techniques: cosine similarity and visual question answering. Our method achieves high accuracy without task-specific fine-tuning, demonstrating robustness against textual biases and surface heuristics. Additionally, we design a controlled adversarial dataset to validate the robustness of our approach. Our findings suggest that leveraging visual modality as a meaning representation provides a promising direction for robust natural language understanding.
[55] Beyond Multiple Choice: A Hybrid Framework for Unifying Robust Evaluation and Verifiable Reasoning Training
Yesheng Liu, Hao Li, Haiyu Xu, Baoqi Pei, Jiahao Wang, Mingxuan Zhao, Jingshu Zheng, Zheqi He, JG Yao, Bowen Qin, Xi Yang, Jiajun Zhang
🧩 TL;DR
本研究提出了ReVeL框架,将多项选择题转换为开放式问题以解决选项泄露导致的评估偏差问题,通过LLM重写和验证机制提升模型训练和评估的可靠性。
📘 Detailed Summary
Motivation: 多项选择题(MCQA)在评估和强化微调多模态语言模型时存在选项泄露问题,导致准确率指标不可靠并鼓励模型猜测行为,无法真实反映模型能力。
Method: 提出ReVeL框架,通过LLM将MCQA重写为开放式问题,根据答案类型分类应用不同的重写和验证方案,使用GRPO方法对Qwen2.5-VL模型进行微调。
Result: 在多项选择基准测试中,ReVeL-OpenQA训练的模型保持了MCQA准确率,同时将OpenQA准确率提升了约6个百分点,并揭示了MCQA基准中高达20个百分点的分数膨胀。
Conclusion: ReVeL框架提供了更高效的数据利用和更鲁棒的奖励信号,改善了评估准确性并降低了成本和延迟,为多模态语言模型的可靠训练和评估提供了新范式。
📄 Abstract
Multiple-choice question answering (MCQA) has been a popular format for evaluating and reinforcement fine-tuning (RFT) of modern multimodal language models. Its constrained output format allows for simplified, deterministic automatic verification. However, we find that the options may leak exploitable signals, which makes the accuracy metrics unreliable for indicating real capabilities and encourages explicit or implicit answer guessing behaviors during RFT. We propose ReVeL (Rewrite and Verify by LLM), a framework that rewrites multiple-choice questions into open-form questions while keeping answers verifiable whenever possible. The framework categorizes questions according to different answer types, apply different rewriting and verification schemes, respectively. When applied for RFT, we converted 20k MCQA examples and use GRPO to finetune Qwen2.5-VL models. Models trained on ReVeL-OpenQA match MCQA accuracy on multiple-choice benchmarks and improve OpenQA accuracy by about six percentage points, indicating better data efficiency and more robust reward signals than MCQA-based training. When used for evaluation, ReVeL also reveals up to 20 percentage points of score inflation in MCQA benchmarks (relative to OpenQA), improves judging accuracy, and reduces both cost and latency. We will release code and data publicly.
[56] SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation
Shrikant Kendre, Austin Xu, Honglu Zhou, Michael Ryoo, Shafiq Joty, Juan Carlos Niebles
🧩 TL;DR
本文提出了SMILE评估指标,该指标通过整合句子级语义理解、关键词级语义理解和精确关键词匹配,在文本和视觉问答任务中实现了词汇精确性和语义相关性的平衡,显著优于传统评估方法。
📘 Detailed Summary
Motivation: 传统文本和视觉问答评估指标如ROUGE、METEOR和精确匹配主要关注基于n-gram的词汇相似性,往往忽略了深层语义理解需求。虽然BERTScore和MoverScore等度量方法利用上下文嵌入解决了这一局限,但它们缺乏在句子级和关键词级语义之间平衡的灵活性,并且忽略了仍然重要的词汇相似性。基于大型语言模型的评估器虽然强大,但存在成本高、偏见、不一致和幻觉等问题。
Method: 本文提出了SMILE评估方法,这是一种新颖的复合方法,结合了句子级语义理解、关键词级语义理解和简单的关键词匹配。该方法通过整合不同层次的语义信息,平衡了词汇精确性和语义相关性,提供全面的评估能力。SMILE设计为计算轻量级,避免了基于大型语言模型评估器的高成本和可靠性问题。
Result: 在文本、图像和视频问答任务上的广泛基准测试表明,SMILE与人类判断具有高度相关性。该指标在多个评估任务中表现出色,证明了其在平衡词汇和语义评估方面的有效性。实验结果显示SMILE在计算效率方面表现优异,能够提供可靠且一致的评估结果。
Conclusion: SMILE成功弥合了词汇评估和语义评估之间的差距,为文本和视觉问答任务提供了更加全面和准确的评估框架。该方法展示了结合不同层次语义信息的有效性,为未来评估指标的发展提供了重要启示。SMILE的轻量级特性使其在实际应用中具有显著优势,为评估领域提供了新的研究方向。
📄 Abstract
Traditional evaluation metrics for textual and visual question answering, like ROUGE, METEOR, and Exact Match (EM), focus heavily on n-gram based lexical similarity, often missing the deeper semantic understanding needed for accurate assessment. While measures like BERTScore and MoverScore leverage contextual embeddings to address this limitation, they lack flexibility in balancing sentence-level and keyword-level semantics and ignore lexical similarity, which remains important. Large Language Model (LLM) based evaluators, though powerful, come with drawbacks like high costs, bias, inconsistency, and hallucinations. To address these issues, we introduce SMILE: Semantic Metric Integrating Lexical Exactness, a novel approach that combines sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching. This composite method balances lexical precision and semantic relevance, offering a comprehensive evaluation. Extensive benchmarks across text, image, and video QA tasks show SMILE is highly correlated with human judgments and computationally lightweight, bridging the gap between lexical and semantic evaluation.
cs.AI [Back]
[57] Patient-level Information Extraction by Consistent Integration of Textual and Tabular Evidence with Bayesian Networks
Paloma Rabaey, Adrick Tench, Stefan Heytens, Thomas Demeester
🧩 TL;DR
本文提出了一种多模态患者级信息提取方法,通过结合结构化电子健康记录和临床文本数据,使用虚拟证据增强和一致性节点实现可解释的概率融合,有效处理缺失信息并解决数据矛盾。
📘 Detailed Summary
Motivation: 电子健康记录中大量关键信息存在于非结构化文本中,而现有方法难以充分利用结构化表格特征与临床笔记之间的互补信息,需要开发能够透明融合多模态数据并处理信息缺失和矛盾的方法。
Method: 提出多模态患者级信息提取框架,结合专家知识构建的贝叶斯网络处理结构化EHR特征,使用神经文本分类器分析临床笔记,并通过虚拟证据增强和一致性节点实现概率融合,提高预测校准性。
Result: 在SimSUM模拟基准数据集上的实验表明,该方法相比单独使用虚拟证据能够显著改善预测校准,使贝叶斯网络更好地调整神经分类器输出,有效处理信息缺失和解决表格与文本数据间的矛盾。
Conclusion: 该方法为临床决策支持系统提供了可解释的多模态数据融合方案,通过一致性节点机制增强了模型在处理复杂医疗数据时的鲁棒性和可靠性,为高风险医疗应用中的透明特征建模开辟了新途径。
📄 Abstract
Electronic health records (EHRs) form an invaluable resource for training clinical decision support systems. To leverage the potential of such systems in high-risk applications, we need large, structured tabular datasets on which we can build transparent feature-based models. While part of the EHR already contains structured information (e.g. diagnosis codes, medications, and lab results), much of the information is contained within unstructured text (e.g. discharge summaries and nursing notes). In this work, we propose a method for multi-modal patient-level information extraction that leverages both the tabular features available in the patient's EHR (using an expert-informed Bayesian network) as well as clinical notes describing the patient's symptoms (using neural text classifiers). We propose the use of virtual evidence augmented with a consistency node to provide an interpretable, probabilistic fusion of the models' predictions. The consistency node improves the calibration of the final predictions compared to virtual evidence alone, allowing the Bayesian network to better adjust the neural classifier's output to handle missing information and resolve contradictions between the tabular and text data. We show the potential of our method on the SimSUM dataset, a simulated benchmark linking tabular EHRs with clinical notes through expert knowledge.