cs.CV [Total: 34]
cs.CL [Total: 7]
cs.AI [Total: 6]

cs.CV [Back]

[1] GAZE:Governance-Aware pre-annotation for Zero-shot World Model Environments

Leela Krishna, Mengyang Zhao, Saicharithreddy Pasula, Harshit Rajgarhia, Abhishek Mukherji

🧩 TL;DR

本文提出了一个名为GAZE的生产级流水线，能够自动将原始长视频转换为用于世界模型训练的丰富监督数据，通过多模态预标注和结构化输出显著提升了数据标注效率和质量。

📘 Detailed Summary

Motivation: 训练鲁棒的世界模型需要大规模、精确标注的多模态数据集，但传统手动标注过程存在速度慢、成本高的瓶颈问题，限制了高质量训练数据的获取效率。

Method: 该系统采用三阶段处理流程：首先将专有的360度视频格式标准化为标准视图并进行分片并行处理；然后应用一系列AI模型进行密集多模态预标注，包括场景理解、目标跟踪、音频转录以及PII/NSFW/未成年人检测；最后将信号整合为结构化输出规范以便快速人工验证。

Result: GAZE工作流程实现了显著的效率提升，每审核小时可节省约19分钟时间，通过保守自动跳过低显著性片段将人工审核量减少超过80%，同时提高了标签密度和一致性，并集成了隐私保护和监管链元数据。

Conclusion: 该方法为生成高质量世界模型训练数据提供了可扩展的蓝图，能够在保持吞吐量和治理要求的同时，直接生成适用于学习跨模态动态和动作条件预测的高保真、隐私感知数据集。

📄 Abstract

Training robust world models requires large-scale, precisely labeled multimodal datasets, a process historically bottlenecked by slow and expensive manual annotation. We present a production-tested GAZE pipeline that automates the conversion of raw, long-form video into rich, task-ready supervision for world-model training. Our system (i) normalizes proprietary 360-degree formats into standard views and shards them for parallel processing; (ii) applies a suite of AI models (scene understanding, object tracking, audio transcription, PII/NSFW/minor detection) for dense, multimodal pre-annotation; and (iii) consolidates signals into a structured output specification for rapid human validation. The GAZE workflow demonstrably yields efficiency gains (~19 minutes saved per review hour) and reduces human review volume by >80% through conservative auto-skipping of low-salience segments. By increasing label density and consistency while integrating privacy safeguards and chain-of-custody metadata, our method generates high-fidelity, privacy-aware datasets directly consumable for learning cross-modal dynamics and action-conditioned prediction. We detail our orchestration, model choices, and data dictionary to provide a scalable blueprint for generating high-quality world model training data without sacrificing throughput or governance.

[2] DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models

Mor Ventura, Michael Toker, Or Patashnik, Yonatan Belinkov, Roi Reichart

🧩 TL;DR

本文提出DeLeaker，一种轻量级、无需优化的推理时方法，通过直接干预模型注意力图来缓解文本到图像模型中的语义泄漏问题，在保持图像质量的同时有效抑制实体间的意外特征传递。

📘 Detailed Summary

Motivation: 文本到图像模型虽然发展迅速，但仍然容易受到语义泄漏的影响，即不同实体之间意外传递语义相关特征，现有缓解策略通常基于优化或依赖外部输入，存在局限性。

Method: DeLeaker在扩散过程中动态重新加权注意力图，抑制过度的跨实体交互同时增强每个实体的身份特征，该方法无需优化且不依赖外部信息。

Result: 实验表明DeLeaker在所有基线方法中表现最优，即使基线方法使用外部信息，DeLeaker也能在不损害保真度或质量的情况下有效缓解语义泄漏，同时作者还构建了首个专门用于语义泄漏评估的数据集SLIM。

Conclusion: 研究结果强调了注意力控制的价值，为开发更语义精确的文本到图像模型铺平了道路，证明了直接干预注意力机制在解决语义泄漏问题上的有效性。

📄 Abstract

Text-to-Image (T2I) models have advanced rapidly, yet they remain vulnerable to semantic leakage, the unintended transfer of semantically related features between distinct entities. Existing mitigation strategies are often optimization-based or dependent on external inputs. We introduce DeLeaker, a lightweight, optimization-free inference-time approach that mitigates leakage by directly intervening on the model's attention maps. Throughout the diffusion process, DeLeaker dynamically reweights attention maps to suppress excessive cross-entity interactions while strengthening the identity of each entity. To support systematic evaluation, we introduce SLIM (Semantic Leakage in IMages), the first dataset dedicated to semantic leakage, comprising 1,130 human-verified samples spanning diverse scenarios, together with a novel automatic evaluation framework. Experiments demonstrate that DeLeaker consistently outperforms all baselines, even when they are provided with external information, achieving effective leakage mitigation without compromising fidelity or quality. These results underscore the value of attention control and pave the way for more semantically precise T2I models.

[3] UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos

Mingxuan Liu, Honglin He, Elisa Ricci, Wayne Wu, Bolei Zhou

🧩 TL;DR

本文提出了UrbanVerse，一个数据驱动的真实到仿真系统，可将众包城市游览视频转换为物理感知的交互式仿真场景，用于训练城市具身AI代理。该系统包含10万+标注的3D城市资产库和自动场景生成流水线，显著提升了导航策略在仿真和零样本真实世界迁移中的性能。

📘 Detailed Summary

Motivation: 现有的人工制作或程序生成的仿真场景在可扩展性和真实世界复杂性捕捉方面存在不足，无法满足日益增长的城市具身AI代理训练需求，这些代理需要在混乱的城市街道中导航以提供最后一公里连接服务。

Method: UrbanVerse系统包含两个核心组件：UrbanVerse-100K（包含10万+带有语义和物理属性标注的3D城市资产库）和UrbanVerse-Gen（从视频中提取场景布局并使用检索资产实例化度量尺度3D仿真的自动流水线），系统运行在IsaacSim平台上。

Result: 实验表明UrbanVerse场景保持了真实世界的语义和布局，在人类评估的真实感方面与人工制作场景相当。在导航任务中，在UrbanVerse中训练的策略展现出缩放幂律和强泛化能力，仿真中成功率提升+6.3%，零样本仿真到真实迁移中提升+30.1%，仅需两次干预即可完成300米真实世界任务。

Conclusion: UrbanVerse证明了数据驱动的真实到仿真方法能够有效解决城市仿真场景的可扩展性和真实性问题，为具身AI代理的训练提供了高质量、多样化的城市环境，显著提升了导航策略在真实世界中的部署性能。

📄 Abstract

Urban embodied AI agents, ranging from delivery robots to quadrupeds, are increasingly populating our cities, navigating chaotic streets to provide last-mile connectivity. Training such agents requires diverse, high-fidelity urban environments to scale, yet existing human-crafted or procedurally generated simulation scenes either lack scalability or fail to capture real-world complexity. We introduce UrbanVerse, a data-driven real-to-sim system that converts crowd-sourced city-tour videos into physics-aware, interactive simulation scenes. UrbanVerse consists of: (i) UrbanVerse-100K, a repository of 100k+ annotated urban 3D assets with semantic and physical attributes, and (ii) UrbanVerse-Gen, an automatic pipeline that extracts scene layouts from video and instantiates metric-scale 3D simulations using retrieved assets. Running in IsaacSim, UrbanVerse offers 160 high-quality constructed scenes from 24 countries, along with a curated benchmark of 10 artist-designed test scenes. Experiments show that UrbanVerse scenes preserve real-world semantics and layouts, achieving human-evaluated realism comparable to manually crafted scenes. In urban navigation, policies trained in UrbanVerse exhibit scaling power laws and strong generalization, improving success by +6.3% in simulation and +30.1% in zero-shot sim-to-real transfer comparing to prior methods, accomplishing a 300 m real-world mission with only two interventions.

Mattia Segu, Marta Tintore Gazulla, Yongqin Xian, Luc Van Gool, Federico Tombari

🧩 TL;DR

MOBIUS是一个用于通用实例分割的基础模型家族，通过瓶颈像素解码器、语言引导不确定性校准损失和统一训练策略，在显著降低计算成本的同时保持最先进的性能，实现了从高端加速器到移动设备的帕累托最优部署。

📘 Detailed Summary

Motivation: 现有基础模型虽然在全景分割和目标检测领域实现了最先进的域内和零样本性能，但其高计算成本限制了在资源受限平台上的部署应用，需要解决模型效率与性能之间的权衡问题。

Method: 提出了瓶颈像素解码器用于高效的多尺度和多模态融合，设计了语言引导不确定性校准损失以实现自适应解码器剪枝，并采用简化的统一训练策略来降低训练和推理需求。

Result: MOBIUS将像素和Transformer解码器的FLOPs分别降低了55%和75%，仅用三分之一的训练迭代次数就保持了最先进的性能表现，在高效分割任务上为高性能计算平台和移动设备建立了新的基准。

Conclusion: 该研究证明了基础模型可以在不牺牲性能的前提下实现显著的计算效率提升，为资源受限环境下的实例分割应用提供了可行的解决方案，推动了高效AI模型在边缘设备上的实际部署。

📄 Abstract

Scaling up model size and training data has advanced foundation models for instance-level perception, achieving state-of-the-art in-domain and zero-shot performance across object detection and segmentation. However, their high computational cost limits adoption on resource-constrained platforms. We first examine the limitations of existing architectures in enabling efficient edge deployment without compromising performance. We then introduce MOBIUS, a family of foundation models for universal instance segmentation, designed for Pareto-optimal downscaling to support deployment across devices ranging from high-end accelerators to mobile hardware. To reduce training and inference demands, we propose: (i) a bottleneck pixel decoder for efficient multi-scale and multi-modal fusion, (ii) a language-guided uncertainty calibration loss for adaptive decoder pruning, and (iii) a streamlined, unified training strategy. Unlike efficient baselines that trade accuracy for reduced complexity, MOBIUS reduces pixel and transformer decoder FLOPs by up to 55% and 75%, respectively, while maintaining state-of-the-art performance in just a third of the training iterations. MOBIUS establishes a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.

[5] Composition-Grounded Instruction Synthesis for Visual Reasoning

Xinyi Gu, Jiayuan Mao, Zhang-Wei Hong, Zhuoran Yu, Pengyuan Li, Dhiraj Joshi, Rogerio Feris, Zexue He

🧩 TL;DR

本文提出了COGS框架，通过将种子问题分解为感知和推理因子并进行系统重组，从而从少量种子数据中高效生成大规模合成问答对，显著提升多模态大语言模型在图表等人工图像领域的推理能力。

📘 Detailed Summary

Motivation: 预训练多模态大语言模型在多样化多模态任务上表现出色，但在标注数据难以获取的人工图像领域（如图表、渲染文档和网页）的推理能力仍然有限，这些领域在实践中丰富但缺乏大规模人工标注的推理数据集。

Method: COGS框架将每个种子问题分解为原始感知和推理因子，然后系统性地与新图像重新组合生成大量合成问答对，每个生成问题都配有子问题和中间答案，支持基于因子级过程奖励的强化学习。

Result: 在图表推理任务上的实验表明，COGS显著提升了未见问题的性能，在推理密集和组合性问题上的提升最大，使用不同种子数据的因子级混合训练在多个数据集上实现了更好的迁移性能，表明COGS诱导了泛化能力而非数据集特定的过拟合。

Conclusion: 该研究表明COGS框架能够从少量种子数据中高效扩展多模态大语言模型的推理能力，并且该方法可扩展到图表之外的网页等其他领域，为数据稀缺领域的多模态推理提供了有效的解决方案。

📄 Abstract

Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets. We introduce COGS (COmposition-Grounded instruction Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning factors, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions. Moreover, training with a factor-level mixture of different seed data yields better transfer across multiple datasets, suggesting that COGS induces generalizable capabilities rather than dataset-specific overfitting. We further demonstrate that the framework extends beyond charts to other domains such as webpages.

[6] Comprehensive language-image pre-training for 3D medical image understanding

Tassilo Wald, Ibrahim Ethem Hamamci, Yuan Gao, Sam Bond-Taylor, Harshita Sharma, Maximilian Ilse, Cynthia Lo, Olesya Melnichenko, Noel C. F. Codella, Maria Teodora Wetscherek, Klaus H. Maier-Hein, Panagiotis Korfiatis, Valentina Salvatelli, Javier Alvarez-Valle, Fernando Pérez-García

🧩 TL;DR

本文提出了COLIPRI编码器家族，通过引入报告生成目标和结合视觉-语言预训练与纯视觉预训练来解决3D医学视觉语言预训练中数据不足的问题，在多个任务上实现了最先进的性能。

📘 Detailed Summary

Motivation: 当前3D医学视觉语言编码器面临数据可用性限制的问题，这限制了其在支持放射科医生进行相似异常病例检索和异常可能性预测等方面的能力，需要解决数据稀缺对模型性能的制约。

Method: 通过引入额外的归纳偏置来解决数据不足问题，包括引入报告生成目标以及将视觉语言预训练与纯视觉预训练相结合，从而能够同时利用纯图像和配对图像文本3D数据集，并遵循3D医学影像领域的最佳实践开发了COLIPRI编码器家族。

Result: COLIPRI编码器在报告生成、分类探测和零样本分类任务上实现了最先进的性能，同时在语义分割任务上保持了竞争力，证明了所提方法在多个下游任务上的有效性。

Conclusion: 通过引入额外的归纳偏置和结合不同类型的预训练数据，能够有效缓解3D医学视觉语言预训练中的数据稀缺问题，为开发更强大的医学影像分析工具提供了可行路径，并展示了在有限数据条件下提升模型性能的有效策略。

📄 Abstract

Vision-language pre-training, i.e., aligning images with paired text, is a powerful paradigm to create encoders that can be directly used for tasks such as classification and retrieval, and for downstream tasks such as segmentation and report generation. In the 3D medical image domain, these capabilities allow vision-language encoders (VLEs) to support radiologists by retrieving patients with similar abnormalities or predicting likelihoods of abnormality. While the methodology holds promise, data availability limits the capabilities of current 3D VLEs. In this paper, we alleviate the lack of data by injecting additional inductive biases: introducing a report generation objective and pairing vision-language pre-training with vision-only pre-training. This allows us to leverage both image-only and paired image-text 3D datasets, increasing the total amount of data to which our model is exposed. Through these additional inductive biases, paired with best practices of the 3D medical imaging domain, we develop the Comprehensive Language-image Pre-training (COLIPRI) encoder family. Our COLIPRI encoders achieve state-of-the-art performance in report generation, classification probing, and zero-shot classification, and remain competitive for semantic segmentation.

[7] Directional Reasoning Injection for Fine-Tuning MLLMs

Chao Huang, Zeliang Zhang, Jiang Liu, Ximeng Sun, Jialian Wu, Xiaodong Yu, Ze Wang, Chenliang Xu, Emad Barsoum, Zicheng Liu

🧩 TL;DR

本文提出DRIFT方法，通过梯度空间中的推理知识转移来增强多模态大语言模型的推理能力，无需大规模监督微调或强化学习，在多个多模态推理基准上实现显著性能提升。

📘 Detailed Summary

Motivation: 多模态大语言模型的推理能力通常落后于纯文本模型，现有方法依赖资源密集型的大规模监督微调或强化学习，而简单的模型融合方法在不同模型家族中效果不稳定，部分模型甚至出现性能下降。

Method: 提出DRIFT方法，通过预计算推理先验作为推理变体与多模态变体在参数空间的差异，然后在多模态微调过程中利用该先验来偏置梯度，实现在梯度空间中的推理知识转移，同时保持多模态对齐的稳定性。

Result: 在MathVista和MathVerse等多模态推理基准上的广泛实验表明，DRIFT方法相比朴素融合和监督微调持续提升推理性能，同时以极低成本达到或超越训练密集型方法的性能水平。

Conclusion: DRIFT提供了一种轻量级且高效的推理能力增强方案，证明了在梯度空间进行知识转移的可行性，为多模态模型的能力提升开辟了新的技术路径，同时保持了标准监督微调流程的简洁性。

📄 Abstract

Multimodal large language models (MLLMs) are rapidly advancing, yet their reasoning ability often lags behind that of strong text-only counterparts. Existing methods to bridge this gap rely on supervised fine-tuning over large-scale multimodal reasoning data or reinforcement learning, both of which are resource-intensive. A promising alternative is model merging, which interpolates parameters between reasoning-enhanced LLMs and multimodal variants. However, our analysis shows that naive merging is not always a "free lunch": its effectiveness varies drastically across model families, with some (e.g., LLaVA, Idefics) benefiting while others (e.g., Qwen) suffer performance degradation. To address this, we propose Directional Reasoning Injection for Fine-Tuning (DRIFT) MLLMs, a lightweight method that transfers reasoning knowledge in the gradient space, without destabilizing multimodal alignment. DRIFT precomputes a reasoning prior as the parameter-space difference between reasoning and multimodal variants, then uses it to bias gradients during multimodal fine-tuning. This approach preserves the simplicity of standard supervised fine-tuning pipelines while enabling efficient reasoning transfer. Extensive experiments on multimodal reasoning benchmarks, including MathVista and MathVerse, demonstrate that DRIFT consistently improves reasoning performance over naive merging and supervised fine-tuning, while matching or surpassing training-heavy methods at a fraction of the cost.

[8] TGT: Text-Grounded Trajectories for Locally Controlled Video Generation

Guofeng Zhang, Angtian Wang, Jacob Zhiyuan Fang, Liming Jiang, Haotian Yang, Bo Liu, Yiding Yang, Guang Chen, Longyin Wen, Alan Yuille, Chongyang Ma

🧩 TL;DR

本文提出了文本引导轨迹（TGT）框架，通过将点轨迹与局部文本描述配对来控制视频生成，解决了多对象场景下现有方法控制精度不足和轨迹-实体对应关系模糊的问题。该方法在视觉质量、文本对齐和运动可控性方面均优于现有方法。

📘 Detailed Summary

Motivation: 现有文本到视频生成方法在控制生成场景的主体构图方面能力有限，特别是在复杂多对象设置下，现有基于边界框或分割掩码的局部文本控制方法精度不足，且随着可控对象数量增加，个体轨迹与视觉实体之间的对应关系变得模糊不清。

Method: 提出了文本引导轨迹（TGT）框架，采用位置感知交叉注意力（LACA）来整合轨迹和局部文本描述信号，并采用双重分类器自由引导（dual-CFG）方案分别调节局部和全局文本指导。开发了数据处理流水线来生成带有跟踪实体局部描述符的轨迹，并标注了200万个高质量视频片段用于训练TGT。

Result: 广泛实验表明，TGT在视觉质量、文本对齐准确性和运动可控性方面均优于现有方法，能够使用点轨迹作为直观的运动控制手柄，将每个轨迹与文本配对以同时控制外观和运动。

Conclusion: TGT框架通过将轨迹与局部文本描述相结合，显著提升了视频生成中多对象场景的控制精度和灵活性，为复杂场景的精确可控视频生成提供了有效解决方案，并展示了轨迹作为运动控制手柄的直观性和实用性。

📄 Abstract

Text-to-video generation has advanced rapidly in visual fidelity, whereas standard methods still have limited ability to control the subject composition of generated scenes. Prior work shows that adding localized text control signals, such as bounding boxes or segmentation masks, can help. However, these methods struggle in complex scenarios and degrade in multi-object settings, offering limited precision and lacking a clear correspondence between individual trajectories and visual entities as the number of controllable objects increases. We introduce Text-Grounded Trajectories (TGT), a framework that conditions video generation on trajectories paired with localized text descriptions. We propose Location-Aware Cross-Attention (LACA) to integrate these signals and adopt a dual-CFG scheme to separately modulate local and global text guidance. In addition, we develop a data processing pipeline that produces trajectories with localized descriptions of tracked entities, and we annotate two million high quality video clips to train TGT. Together, these components enable TGT to use point trajectories as intuitive motion handles, pairing each trajectory with text to control both appearance and motion. Extensive experiments show that TGT achieves higher visual quality, more accurate text alignment, and improved motion controllability compared with prior approaches. Website: https://textgroundedtraj.github.io.

Xingrui Wang, Jiang Liu, Chao Huang, Xiaodong Yu, Ze Wang, Ximeng Sun, Jialian Wu, Alan Yuille, Emad Barsoum, Zicheng Liu

🧩 TL;DR

本文提出了XModBench，一个大规模三模态基准测试，专门用于评估全模态大语言模型的跨模态一致性，揭示了当前模型在模态不变推理方面的显著局限性。

📘 Detailed Summary

Motivation: 现有基准主要评估全模态大语言模型的通用跨模态问答能力，但无法确定这些模型是否真正实现了模态不变推理或存在模态特定偏差，因此需要专门的诊断工具来系统评估跨模态一致性。

Method: 研究团队构建了包含60,828个多选题的XModBench基准，涵盖五个任务家族并系统覆盖所有六种模态组合的问答对，能够对模型的模态不变推理能力、模态差异和方向不平衡进行细粒度诊断。

Result: 实验表明，即使最强的Gemini 2.5 Pro模型在空间和时间推理任务上准确率低于60%，存在持续的模态差异（音频模态性能显著低于文本），并表现出系统性方向不平衡（视觉作为上下文时一致性低于文本）。

Conclusion: 当前全模态大语言模型距离真正的模态不变推理仍有很大差距，XModBench可作为评估和改进跨模态能力的基础诊断工具，为未来模型开发提供重要基准。

📄 Abstract

Omni-modal large language models (OLLMs) aim to unify audio, vision, and text understanding within a single framework. While existing benchmarks primarily evaluate general cross-modal question-answering ability, it remains unclear whether OLLMs achieve modality-invariant reasoning or exhibit modality-specific biases. We introduce XModBench, a large-scale tri-modal benchmark explicitly designed to measure cross-modal consistency. XModBench comprises 60,828 multiple-choice questions spanning five task families and systematically covers all six modality compositions in question-answer pairs, enabling fine-grained diagnosis of an OLLM's modality-invariant reasoning, modality disparity, and directional imbalance. Experiments show that even the strongest model, Gemini 2.5 Pro, (i) struggles with spatial and temporal reasoning, achieving less than 60% accuracy, (ii) reveals persistent modality disparities, with performance dropping substantially when the same semantic content is conveyed through audio rather than text, and (iii) shows systematic directional imbalance, exhibiting lower consistency when vision serves as context compared to text. These findings indicate that current OLLMs remain far from truly modality-invariant reasoning and position XModBench as a fundamental diagnostic tool for evaluating and improving cross-modal competence. All data and evaluation tools will be available at https://xingruiwang.github.io/projects/XModBench/.

[10] Train a Unified Multimodal Data Quality Classifier with Synthetic Data

Weizhi Wang, Rongmei Lin, Shiyang Li, Colin Lockard, Ritesh Sarkhel, Sanket Lokegaonkar, Jingbo Shang, Xifeng Yan, Nasser Zalmout, Xian Li

🧩 TL;DR

本研究提出UniFilter，一种统一的多模态数据质量分类器，用于筛选高质量的图像-文本描述数据和交错文档数据。通过在UniFilter筛选的数据上预训练的多模态大语言模型展现出显著增强的零样本推理和上下文学习能力。

📘 Detailed Summary

Motivation: 当前多模态大语言模型在图像-文本描述数据和交错文档数据的混合预训练中，高质量数据筛选方法尚未得到充分探索，特别是在图像-文本交错文档数据的质量评估方面存在研究空白。

Method: 提出训练高效的多模态大语言模型作为统一多模态数据质量分类器，采用半合成方法利用原始图像生成四种质量级别的对应文本，为描述和交错文档数据创建样本-评分对来训练UniFilter模型。

Result: 应用UniFilter从DataComp描述数据集和OBELICS图像-文本交错数据集中筛选高质量数据，预训练的MLLM相比基线方法展现出显著增强的零样本推理和上下文学习能力，经过视觉监督微调后在多个基准测试中表现更优。

Conclusion: 研究表明高质量多模态预训练数据对下游任务性能具有显著提升作用，通过UniFilter筛选的数据能够有效增强MLLM的能力，为社区提供了可复现的数据筛选方法和高质量数据集资源。

📄 Abstract

The Multimodal Large Language Models (MLLMs) are continually pre-trained on a mixture of image-text caption data and interleaved document data, while the high-quality data filtering towards image-text interleaved document data is under-explored. We propose to train an efficient MLLM as a Unified Mulitmodal Data Quality Classifier to Filter both high-quality image-text caption and interleaved data (UniFilter). To address the challenge of collecting diverse labeled multimodal data, we introduce a semi-synthetic approach that leverages readily available raw images and generates corresponding text across four quality levels. This method enables efficient creation of sample-score pairs for both caption and interleaved document data to train UniFilter. We apply UniFilter to curate high-quality caption data from DataComp caption dataset and interleaved data from the OBELICS image-text interleaved dataset. MLLMs pre-trained on the filtered data demonstrate significantly enhanced capabilities compared to those trained on baseline-filtered data, achieving stronger zero-shot reasoning and in-context learning capabilities. After visual supervised fine-tuning, these UniFilter-induced MLLMs achieve stronger performance on various benchmarks, highlighting the downstream benefits of high-quality multimodal pre-training. We release the synthetic training data used for training UniFilter, the UniFilter model checkpoints, and the high-quality interleaved document subset OBELICS-HQ, curated by UniFilter, to the community for reproduction and further development.

[11] Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models

Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M de Melo, Jieneng Chen, Alan Yuille

🧩 TL;DR

本文提出了Spatial457数据集，这是一个可扩展且无偏的合成数据集，用于全面评估大型多模态模型在6D空间推理任务上的能力，揭示了现有模型在3D空间推理方面的显著局限性。

📘 Detailed Summary

Motivation: 尽管大型多模态模型在视觉场景理解和推理方面表现出色，但其在复杂精确的3维空间推理能力仍不确定，现有基准主要关注2D空间理解，缺乏评估6D空间推理的全面框架。

Method: 开发了Spatial457合成数据集，包含4个关键空间推理能力：多目标识别、2D定位、3D定位和3D方向，采用级联评估结构构建了7种问题类型和5个难度级别，从基础单目标识别到新提出的复杂6D空间推理任务。

Result: 评估发现大型多模态模型在PulseCheck457上性能随任务复杂度增加而普遍下降，特别是在3D推理和6D空间任务中表现不佳，引入相对性能下降率量化了3D推理能力的关键弱点，并揭示了不同属性间的预测偏差。

Conclusion: 研究揭示了当前大型多模态模型在3D空间推理方面的显著局限性，提出的数据集和评估框架为未来模型开发提供了重要基准，发现的预测偏差模式在真实图像设置中同样存在，强调了改进3D空间理解能力的必要性。

📄 Abstract

Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities. To address this limitation, we present Spatial457, a scalable and unbiased synthetic dataset designed with 4 key capability for spatial reasoning: multi-object recognition, 2D location, 3D location, and 3D orientation. We develop a cascading evaluation structure, constructing 7 question types across 5 difficulty levels that range from basic single object recognition to our new proposed complex 6D spatial reasoning tasks. We evaluated various large multimodal models (LMMs) on PulseCheck457, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks. To quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities. Leveraging the unbiased attribute design of our dataset, we also uncover prediction biases across different attributes, with similar patterns observed in real-world image settings. The code and data are released in https://github.com/XingruiWang/Spatial457.

[12] Hyperparameter Optimization and Reproducibility in Deep Learning Model Training

Usman Afzaal, Ziyu Su, Usama Sajjad, Hao Lu, Mostafa Rezapour, Metin Nafi Gurcan, Muhammad Khalid Khan Niazi

🧩 TL;DR

本研究系统评估了组织病理学基础模型训练中的可复现性挑战，通过CLIP模型在QUILT-1M数据集上的训练实验，确定了关键超参数和增强策略对下游任务性能的影响，为数字病理学领域提供了实用的可复现性指导原则。

📘 Detailed Summary

Motivation: 组织病理学基础模型训练面临严重的可复现性问题，主要障碍包括软件随机性、硬件非确定性和超参数报告不一致，这阻碍了研究结果的可靠比较和验证。

Method: 研究在QUILT-1M数据集上训练CLIP模型，系统评估了不同超参数设置和数据增强策略在三个下游组织病理学数据集（PatchCamelyon、LC25000-Lung和LC25000-Colon）上的影响。

Result: 实验发现RandomResizedCrop值在0.7-0.8范围内表现最佳，分布式训练不使用局部损失提高了稳定性，学习率低于5.0e-5在所有数据集上都导致性能下降，其中LC25000（Colon）数据集提供了最可靠的可复现性基准。

Conclusion: 计算病理学的可复现性不仅依赖于透明文档记录，更需要精心选择的实验配置，研究结果为开发可复现的数字病理学基础模型提供了实用的指导原则和配置建议。

📄 Abstract

Reproducibility remains a critical challenge in foundation model training for histopathology, often hindered by software randomness, hardware non-determinism, and inconsistent hyperparameter reporting. To investigate these issues, we trained a CLIP model on the QUILT-1M dataset and systematically evaluated the impact of different hyperparameter settings and augmentation strategies across three downstream histopathology datasets (PatchCamelyon, LC25000-Lung, and LC25000-Colon). Despite variability across runs, we identified clear trends: RandomResizedCrop values of 0.7-0.8 outperformed more aggressive (0.6) or conservative (0.9) settings, distributed training without local loss improved stability, and learning rates below 5.0e-5 consistently degraded performance across all datasets. The LC25000 (Colon) dataset consistently provided the most reproducible benchmark. These findings highlight that reproducibility in computational pathology depends not only on transparent documentation but also on carefully chosen experimental configurations, and we provide practical rules to guide future efforts in developing reproducible foundation models for digital pathology.

[13] Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI

Syed Abdul Gaffar Shakhadri, Kruthika KR, Kartik Basavaraj Angadi

🧩 TL;DR

Shakti VLM是一个参数规模为1B和4B的视觉语言模型家族，通过架构创新和高效训练策略，在减少训练数据量的情况下实现竞争力的多模态性能，为大规模企业应用提供了高效解决方案。

📘 Detailed Summary

Motivation: 该研究旨在解决多模态学习中的数据效率挑战，当前视觉语言模型通常依赖海量训练数据来获得强性能，而Shakti模型则探索如何通过模型设计和训练策略优化来实现数据高效的多模态学习。

Method: Shakti VLM引入了QK-Normalization来增强注意力稳定性，采用混合归一化技术，改进了位置编码方法，并实施了三阶段训练策略来优化学习效率。

Result: 评估结果显示Shakti-VLM-1B和Shakti-VLM-4B在文档理解、视觉推理、OCR提取和通用多模态推理任务中表现出色，证明了在减少训练token数量的情况下仍能实现竞争力性能。

Conclusion: 研究表明高性能可以通过模型设计和训练策略而非单纯依赖数据量来实现，Shakti模型为企业级多模态任务提供了高效解决方案，强调了架构创新在多模态学习中的重要性。

📄 Abstract

We introduce Shakti VLM, a family of vision-language models in the capacity of 1B and 4B parameters designed to address data efficiency challenges in multimodal learning. While recent VLMs achieve strong performance through extensive training data, Shakti models leverage architectural innovations to attain competitive results with fewer tokens. Key advancements include QK-Normalization for attention stability, hybrid normalization techniques, and enhanced positional encoding. A three-stage training strategy further optimizes learning efficiency. Evaluations show that Shakti-Shakti-VLM-1B and Shakti-VLM-4B excel in document understanding, Visual Reasoning, OCR extraction, and general multimodal reasoning. Our results highlight that high performance can be achieved through model design and training strategy rather than sheer data volume, making Shakti an efficient solution for enterprise-scale multimodal tasks.

[14] CARDIUM: Congenital Anomaly Recognition with Diagnostic Images and Unified Medical records

Daniela Vega, Hannah V. Ceballos, Javier S. Vera, Santiago Rodriguez, Alejandra Perez, Angela Castillo, Maria Escobar, Dario Londoño, Luis A. Sarmiento, Camila I. Castro, Nadiezhda Rodriguez, Juan C. Briceño, Pablo Arbeláez

🧩 TL;DR

本文提出了首个公开的多模态数据集CARDIUM，用于胎儿先天性心脏病产前诊断，并开发了一种结合交叉注意力机制的多模态transformer架构，显著提升了诊断性能。

📘 Detailed Summary

Motivation: 先天性心脏病产前诊断面临数据稀缺、质量低下以及多源信息整合不足的挑战，现有AI模型因数据不平衡和单模态限制而性能受限，缺乏公开的多模态数据集阻碍了相关研究进展。

Method: 研究提出了CARDIUM多模态数据集，整合胎儿超声和超声心动图图像以及母体临床记录，并设计了一种基于交叉注意力机制的多模态transformer架构，用于融合图像和表格数据的特征表示。

Result: 所提出的多模态方法在CARDIUM数据集上实现了79.8±4.8%的F1分数，相比单模态图像和表格方法分别提升了11%和50%的检测性能，显著优于传统单模态方法。

Conclusion: 该研究通过提供首个公开多模态数据集和有效的多模态融合方法，为产前先天性心脏病诊断开辟了新途径，数据集和代码的公开将促进该领域进一步研究发展。

📄 Abstract

Prenatal diagnosis of Congenital Heart Diseases (CHDs) holds great potential for Artificial Intelligence (AI)-driven solutions. However, collecting high-quality diagnostic data remains difficult due to the rarity of these conditions, resulting in imbalanced and low-quality datasets that hinder model performance. Moreover, no public efforts have been made to integrate multiple sources of information, such as imaging and clinical data, further limiting the ability of AI models to support and enhance clinical decision-making. To overcome these challenges, we introduce the Congenital Anomaly Recognition with Diagnostic Images and Unified Medical records (CARDIUM) dataset, the first publicly available multimodal dataset consolidating fetal ultrasound and echocardiographic images along with maternal clinical records for prenatal CHD detection. Furthermore, we propose a robust multimodal transformer architecture that incorporates a cross-attention mechanism to fuse feature representations from image and tabular data, improving CHD detection by 11% and 50% over image and tabular single-modality approaches, respectively, and achieving an F1 score of 79.8 $\pm$ 4.8% in the CARDIUM dataset. We will publicly release our dataset and code to encourage further research on this unexplored field. Our dataset and code are available at https://github.com/BCVUniandes/Cardium, and at the project website https://bcv-uniandes.github.io/CardiumPage/

[15] The Face of Persuasion: Analyzing Bias and Generating Culture-Aware Ads

Aysan Aghazadeh, Adriana Kovashka

🧩 TL;DR

本研究调查了文本到图像模型在广告定制中的潜在人口统计偏见，通过分析不同广告主题下的人口偏见以及相同广告内容中人物性别/种族对说服力的差异性影响，并探索了针对特定国家的广告定向技术。

📘 Detailed Summary

Motivation: 文本到图像模型在定制视觉广告和针对特定人群方面具有吸引力，但存在人口统计偏见的潜在风险，本研究旨在探索这种偏见在广告中的表现及其对说服力的差异性影响。

Method: 研究采用实验方法分析不同广告主题下的人口统计偏见，通过控制变量比较仅人物性别/种族不同的相同广告的说服力差异，并开发了针对特定国家的广告定向技术。

Result: 研究发现文本到图像模型生成的广告存在显著的人口统计偏见，不同广告主题表现出不同的偏见模式，且相同广告内容中人物性别/种族的改变会导致模型判断的说服力存在显著差异。

Conclusion: 文本到图像模型在广告应用中存在系统性的人口统计偏见，这可能导致不公平的广告投放效果，研究强调了在AI驱动的广告系统中考虑公平性和偏见缓解的重要性。

📄 Abstract

Text-to-image models are appealing for customizing visual advertisements and targeting specific populations. We investigate this potential by examining the demographic bias within ads for different ad topics, and the disparate level of persuasiveness (judged by models) of ads that are identical except for gender/race of the people portrayed. We also experiment with a technique to target ads for specific countries. The code is available at https://github.com/aysanaghazadeh/FaceOfPersuasion

[16] DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion

Weijie Wang, Jiagang Zhu, Zeyu Zhang, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Chaojun Ni, Haoxiao Wang, Guan Huang, Xinze Chen, Yukun Zhou, Wenkang Qin, Duochao Shi, Haoyun Li, Guanghong Jia, Jiwen Lu

🧩 TL;DR

DriveGen3D提出了一种新颖的框架，通过结合加速长时视频生成和大规模动态场景重建，实现了高质量、高可控性的动态3D驾驶场景生成，解决了现有方法在计算效率、时间一致性和3D表示方面的局限性。

📘 Detailed Summary

Motivation: 当前驾驶场景合成方法存在显著局限性：要么因计算需求过高而无法进行长时间生成，要么仅关注长时间视频合成而缺乏3D表示能力，或者局限于静态单场景重建。本研究旨在弥合这一方法学差距，通过多模态条件控制实现加速长时视频生成与大规模动态场景重建的集成。

Method: DriveGen3D采用统一流水线，包含两个专门组件：FastDrive-DiT——基于高效视频扩散变换器的高分辨率、时间一致视频合成模块，支持文本和鸟瞰图布局引导；FastRecon3D——前馈重建模块，快速构建跨时间的3D高斯表示，确保时空一致性。

Result: 该框架实现了实时生成长时驾驶视频（最高424×800分辨率，12 FPS）及对应的动态3D场景，在新视角合成任务上达到SSIM 0.811和PSNR 22.84的性能指标，同时保持了参数效率。

Conclusion: 本研究证明了通过专门化组件集成加速视频生成与动态3D重建的可行性，为自动驾驶仿真和虚拟环境创建提供了高效解决方案，同时展示了多模态条件控制在复杂场景生成中的有效性，为未来大规模动态场景合成研究指明了方向。

📄 Abstract

We present DriveGen3D, a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes that addresses critical limitations in existing methodologies. Current approaches to driving scene synthesis either suffer from prohibitive computational demands for extended temporal generation, focus exclusively on prolonged video synthesis without 3D representation, or restrict themselves to static single-scene reconstruction. Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction through multimodal conditional control. DriveGen3D introduces a unified pipeline consisting of two specialized components: FastDrive-DiT, an efficient video diffusion transformer for high-resolution, temporally coherent video synthesis under text and Bird's-Eye-View (BEV) layout guidance; and FastRecon3D, a feed-forward reconstruction module that rapidly builds 3D Gaussian representations across time, ensuring spatial-temporal consistency. Together, these components enable real-time generation of extended driving videos (up to $424\times800$ at 12 FPS) and corresponding dynamic 3D scenes, achieving SSIM of 0.811 and PSNR of 22.84 on novel view synthesis, all while maintaining parameter efficiency.

[17] Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models

Shuang Liang, Zhihao Xu, Jialing Tao, Hui Xue, Xiting Wang

🧩 TL;DR

本文提出学习检测（LoD）框架，通过将重点从攻击特定学习转向任务特定学习，准确检测未知的越狱攻击。该方法包含多模态安全概念激活向量模块用于安全导向表示学习，以及安全模式自编码器模块用于无监督攻击分类。

📘 Detailed Summary

Motivation: 尽管进行了广泛的对齐工作，大型视觉语言模型（LVLM）仍然容易受到越狱攻击，构成严重的安全风险。现有检测方法要么学习攻击特定参数，限制了泛化到未见攻击的能力，要么依赖启发式原则，限制了准确性和效率。

Method: 提出学习检测（LoD）通用框架，包括多模态安全概念激活向量模块用于安全导向表示学习，以及安全模式自编码器模块用于无监督攻击分类，实现从攻击特定学习到任务特定学习的转变。

Result: 大量实验表明，该方法在多样未知攻击上实现了持续更高的检测AUROC，同时提高了检测效率，在多个基准测试中表现出优越性能。

Conclusion: 该研究证明了任务特定学习在越狱攻击检测中的有效性，为构建更安全的大型视觉语言模型提供了新思路，未来可扩展到其他安全威胁检测场景。

📄 Abstract

Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.

Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, Yuming Lou, Dong Yang, Zhijian Liu, Yukang Chen, Ambrish Dantrey, Ehsan Jahangiri, Sreyan Ghosh, Daguang Xu, Ehsan Hosseini-Asl, Danial Mohseni Taheri, Vidya Murali, Sifei Liu, Jason Lu, Oluwatobi Olabiyi, Frank Wang, Rafael Valle, Bryan Catanzaro, Andrew Tao, Song Han, Jan Kautz, Hongxu Yin, Pavlo Molchanov

🧩 TL;DR

本研究提出了OmniVinci，一个强大的开源全模态大语言模型，通过创新的架构设计和数据策展方法，在跨模态理解任务上显著超越现有模型，同时将训练数据量减少了6倍。

📘 Detailed Summary

Motivation: 推进机器智能需要发展跨多模态的感知能力，类似于人类对世界的多感官感知，当前研究旨在构建强大的开源全模态大语言模型来解决多模态对齐和理解的关键挑战。

Method: 提出了三个关键架构创新：OmniAlignNet用于增强视觉和音频嵌入在全模态潜在空间中的对齐；时间嵌入分组用于捕捉视觉和音频信号间的相对时间对齐；约束旋转时间嵌入用于在全模态嵌入中编码绝对时间信息，同时开发了生成2400万单模态和全模态对话的数据策展与合成流程。

Result: OmniVinci在跨模态理解任务DailyOmni上超越Qwen2.5-Omni达19.05分，在音频任务MMAR上提升1.7分，在视觉任务Video-MME上提升3.9分，仅使用0.2T训练token，相比Qwen2.5-Omni的1.2T减少了6倍训练数据量。

Conclusion: 研究发现不同模态在感知和推理过程中相互增强，证明了全模态模型在机器人、医疗AI和智能工厂等下游应用中的显著优势，为构建更高效的多模态智能系统提供了重要洞见。

📄 Abstract

Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.

[19] Semantic4Safety: Causal Insights from Zero-shot Street View Imagery Segmentation for Urban Road Safety

Huan Chen, Ting Han, Siyu Chen, Zhihao Guo, Yiping Chen, Meiliu Wu

🧩 TL;DR

本研究提出了Semantic4Safety框架，通过零样本语义分割从街景图像中提取可解释的街道指标，并结合因果推断方法量化不同事故类型的因果效应，为城市道路安全规划提供数据驱动的工具。

📘 Detailed Summary

Motivation: 该研究旨在解决街景图像分析中的两个核心挑战：如何构建能够捕捉事故相关特征的街道级指标，以及如何量化这些指标对不同事故类型的因果影响。现有方法在从街景数据中提取可解释的安全指标和建立因果联系方面存在不足。

Method: 研究提出Semantic4Safety框架，应用零样本语义分割从街景图像中提取11个可解释的街道景观指标，并整合道路类型作为上下文信息。采用XGBoost多分类器结合SHAP进行特征贡献分析，使用广义倾向得分加权和平均处理效应估计来控制混杂因素并量化因果效应。

Result: 研究发现异质性的、事故类型特定的因果模式：捕捉场景复杂性、暴露度和道路几何的特征主导预测能力；更大的可驾驶区域和应急空间降低风险，而过度视觉开放性可能增加风险。基于约30,000条奥斯汀事故记录的分析验证了方法的有效性。

Conclusion: 通过将预测建模与因果推断相结合，Semantic4Safety支持针对性干预和高风险走廊诊断，为城市道路安全规划提供了可扩展的数据驱动工具。该框架能够识别不同类型事故的关键影响因素，为精准安全干预提供科学依据。

📄 Abstract

Street-view imagery (SVI) offers a fine-grained lens on traffic risk, yet two fundamental challenges persist: (1) how to construct street-level indicators that capture accident-related features, and (2) how to quantify their causal impacts across different accident types. To address these challenges, we propose Semantic4Safety, a framework that applies zero-shot semantic segmentation to SVIs to derive 11 interpretable streetscape indicators, and integrates road type as contextual information to analyze approximately 30,000 accident records in Austin. Specifically, we train an eXtreme Gradient Boosting (XGBoost) multi-class classifier and use Shapley Additive Explanations (SHAP) to interpret both global and local feature contributions, and then apply Generalized Propensity Score (GPS) weighting and Average Treatment Effect (ATE) estimation to control confounding and quantify causal effects. Results uncover heterogeneous, accident-type-specific causal patterns: features capturing scene complexity, exposure, and roadway geometry dominate predictive power; larger drivable area and emergency space reduce risk, whereas excessive visual openness can increase it. By bridging predictive modeling with causal inference, Semantic4Safety supports targeted interventions and high-risk corridor diagnosis, offering a scalable, data-informed tool for urban road safety planning.

Jinghao Huang, Yaxiong Chen, Ganchao Liu

🧩 TL;DR

本文首次系统性地提出并研究了无人机视频-文本检索任务，提出了一种名为多语义自适应挖掘（MSAM）的新方法，该方法通过细粒度交互和跨模态特征融合机制，显著提升了无人机视频语义检索的性能。

📘 Detailed Summary

Motivation: 随着无人机技术的快速发展，视频数据量急剧增加，迫切需要高效的语义检索方法。现有跨模态方法主要针对地面视角设计，无法有效建模无人机视频的俯视视角、强结构同质性和目标组合的多样化语义表达等特性，因此需要专门针对无人机场景的检索机制。

Method: 提出了多语义自适应挖掘（MSAM）方法，包含多语义自适应学习机制，通过动态帧间变化建模和特定场景区域语义提取来增强对无人机视频内容的深度理解。该方法整合了自适应语义构建模块、分布驱动的语义学习项和多样性语义项，并引入了跨模态交互特征融合池化机制以减少复杂背景干扰。

Result: 在两个自建的无人机视频-文本数据集上的大量实验表明，MSAM方法在无人机视频-文本检索任务中优于其他现有方法，验证了所提方法的有效性和优越性。

Conclusion: 该研究为无人机视频语义检索提供了专门的解决方案，证明了针对无人机视频特性的定制化检索机制的必要性。提出的MSAM方法通过深度模态交互和背景噪声抑制，为无人机视频分析领域开辟了新的研究方向，相关代码和数据集将公开以促进后续研究。

📄 Abstract

With the advancement of drone technology, the volume of video data increases rapidly, creating an urgent need for efficient semantic retrieval. We are the first to systematically propose and study the drone video-text retrieval (DVTR) task. Drone videos feature overhead perspectives, strong structural homogeneity, and diverse semantic expressions of target combinations, which challenge existing cross-modal methods designed for ground-level views in effectively modeling their characteristics. Therefore, dedicated retrieval mechanisms tailored for drone scenarios are necessary. To address this issue, we propose a novel approach called Multi-Semantic Adaptive Mining (MSAM). MSAM introduces a multi-semantic adaptive learning mechanism, which incorporates dynamic changes between frames and extracts rich semantic information from specific scene regions, thereby enhancing the deep understanding and reasoning of drone video content. This method relies on fine-grained interactions between words and drone video frames, integrating an adaptive semantic construction module, a distribution-driven semantic learning term and a diversity semantic term to deepen the interaction between text and drone video modalities and improve the robustness of feature representation. To reduce the interference of complex backgrounds in drone videos, we introduce a cross-modal interactive feature fusion pooling mechanism that focuses on feature extraction and matching in target regions, minimizing noise effects. Extensive experiments on two self-constructed drone video-text datasets show that MSAM outperforms other existing methods in the drone video-text retrieval task. The source code and dataset will be made publicly available.

[21] Exploring Conditions for Diffusion models in Robotic Control

Heeseong Shin, Byeongho Heo, Dongyoon Han, Seungryong Kim, Taekyung Kim

🧩 TL;DR

本研究提出了ORCA方法，通过可学习的任务提示和视觉提示来适配预训练文本到图像扩散模型，为机器人控制任务提供任务自适应的视觉表示，而无需微调模型本身。该方法在各种机器人控制基准测试中实现了最先进的性能。

📘 Detailed Summary

Motivation: 现有预训练视觉表示在模仿学习中通常保持冻结状态，导致任务无关性。虽然文本条件在其他视觉领域表现成功，但在机器人控制任务中直接应用效果有限甚至产生负面影响，这归因于扩散模型训练数据与机器人控制环境之间的领域差距。

Method: 提出了ORCA方法，引入可学习的任务提示来适应控制环境，以及视觉提示来捕捉细粒度的帧级细节。通过新设计的条件机制促进任务自适应表示，而无需微调预训练扩散模型本身。

Result: 该方法在各种机器人控制基准测试中实现了最先进的性能，显著超越了先前的方法。实验结果表明任务自适应表示能够有效提升控制任务的性能表现。

Conclusion: 研究表明考虑控制任务所需的特定动态视觉信息的重要性，成功弥合了预训练扩散模型与机器人控制环境之间的领域差距。可学习提示机制为利用大规模预训练模型进行机器人控制提供了有效途径。

📄 Abstract

While pre-trained visual representations have significantly advanced imitation learning, they are often task-agnostic as they remain frozen during policy learning. In this work, we explore leveraging pre-trained text-to-image diffusion models to obtain task-adaptive visual representations for robotic control, without fine-tuning the model itself. However, we find that naively applying textual conditions - a successful strategy in other vision domains - yields minimal or even negative gains in control tasks. We attribute this to the domain gap between the diffusion model's training data and robotic control environments, leading us to argue for conditions that consider the specific, dynamic visual information required for control. To this end, we propose ORCA, which introduces learnable task prompts that adapt to the control environment and visual prompts that capture fine-grained, frame-specific details. Through facilitating task-adaptive representations with our newly devised conditions, our approach achieves state-of-the-art performance on various robotic control benchmarks, significantly surpassing prior methods.

[22] ClapperText: A Benchmark for Text Recognition in Low-Resource Archival Documents

Tingyu Lin, Marco Peer, Florian Kleber, Robert Sablatnig

🧩 TL;DR

本文提出了ClapperText，一个用于视觉退化和低资源场景下手写与印刷文本识别的基准数据集，该数据集源自二战时期档案视频中的场记板，包含9,813个标注帧和94,573个单词级文本实例，为历史文档分析提供了现实且文化背景丰富的资源。

📘 Detailed Summary

Motivation: 本研究旨在解决历史档案分析中视觉退化文本识别的挑战，特别是场记板文本识别面临运动模糊、手写变异、曝光波动和杂乱背景等问题，这些问题反映了历史文档分析中结构化内容出现在退化、非标准形式中的普遍挑战。

Method: 该研究构建了包含127个二战时期档案视频片段的ClapperText数据集，提供了9,813个标注帧和94,573个单词级文本实例，其中67%为手写文本，1,566个实例存在部分遮挡，每个实例包含转录、语义类别、文本类型和遮挡状态等标注信息，并以4点多边形表示的旋转边界框支持空间精确的OCR应用。

Result: 研究使用一致的每视频评估协议，在零样本和微调条件下对六个代表性识别模型和七个检测模型进行了基准测试，结果显示尽管训练集较小（仅18个视频），微调仍能带来显著的性能提升，突出了ClapperText在少样本学习场景中的适用性。

Conclusion: ClapperText数据集为推进低资源档案场景中鲁棒OCR和文档理解提供了现实且文化背景丰富的资源，该研究强调了微调在历史文档分析中的重要性，并为处理视觉退化文本的模型开发提供了标准化评估框架。

📄 Abstract

This paper presents ClapperText, a benchmark dataset for handwritten and printed text recognition in visually degraded and low-resource settings. The dataset is derived from 127 World War II-era archival video segments containing clapperboards that record structured production metadata such as date, location, and camera-operator identity. ClapperText includes 9,813 annotated frames and 94,573 word-level text instances, 67% of which are handwritten and 1,566 are partially occluded. Each instance includes transcription, semantic category, text type, and occlusion status, with annotations available as rotated bounding boxes represented as 4-point polygons to support spatially precise OCR applications. Recognizing clapperboard text poses significant challenges, including motion blur, handwriting variation, exposure fluctuations, and cluttered backgrounds, mirroring broader challenges in historical document analysis where structured content appears in degraded, non-standard forms. We provide both full-frame annotations and cropped word images to support downstream tasks. Using a consistent per-video evaluation protocol, we benchmark six representative recognition and seven detection models under zero-shot and fine-tuned conditions. Despite the small training set (18 videos), fine-tuning leads to substantial performance gains, highlighting ClapperText's suitability for few-shot learning scenarios. The dataset offers a realistic and culturally grounded resource for advancing robust OCR and document understanding in low-resource archival contexts. The dataset and evaluation code are available at https://github.com/linty5/ClapperText.

[23] Lightweight CycleGAN Models for Cross-Modality Image Transformation and Experimental Quality Assessment in Fluorescence Microscopy

Mohammad Soltaninezhad, Yashar Rouzbahani, Jhonatan Contreras, Rohan Chippalkatti, Daniel Kwaku Abankwa, Christian Eggeling, Thomas Bocklitz

🧩 TL;DR

本文提出了一种轻量级CycleGAN用于荧光显微镜中的模态转换，通过固定通道策略将参数从4180万减少到约9000个，同时实现了更快的训练速度和更低的内存使用。该模型还可作为实验和标记质量的诊断工具，通过比较生成图像与实验图像来识别成像问题。

📘 Detailed Summary

Motivation: 本研究旨在解决荧光显微镜中模态转换面临的未配对数据集挑战，同时降低深度学习模型的计算成本和环境影响。传统方法参数过多且训练效率低下，需要开发更轻量高效的解决方案。

Method: 采用轻量级CycleGAN架构，在基于U-Net的生成器中用固定通道方法替代传统的通道倍增策略。这种设计显著减少了模型参数数量，同时保持了模态转换的性能。模型还作为诊断工具用于评估实验图像质量。

Result: 模型参数从4180万大幅减少到约9000个，训练速度更快且内存使用更低。在高质量图像上训练后，模型能够学习最优成像特征，其生成输出与实验图像的偏差可揭示光漂白、伪影或标记不准确等问题。

Conclusion: 该轻量级CycleGAN不仅实现了高效的模态转换，还建立了作为实验验证工具的新应用范式。它为显微镜工作流程提供了实用的图像保真度验证方法，展示了深度学习模型在科学仪器质量控制中的潜力。

📄 Abstract

Lightweight deep learning models offer substantial reductions in computational cost and environmental impact, making them crucial for scientific applications. We present a lightweight CycleGAN for modality transfer in fluorescence microscopy (confocal to super-resolution STED/deconvolved STED), addressing the common challenge of unpaired datasets. By replacing the traditional channel-doubling strategy in the U-Net-based generator with a fixed channel approach, we drastically reduce trainable parameters from 41.8 million to approximately nine thousand, achieving superior performance with faster training and lower memory usage. We also introduce the GAN as a diagnostic tool for experimental and labeling quality. When trained on high-quality images, the GAN learns the characteristics of optimal imaging; deviations between its generated outputs and new experimental images can reveal issues such as photobleaching, artifacts, or inaccurate labeling. This establishes the model as a practical tool for validating experimental accuracy and image fidelity in microscopy workflows.

Zhen Sun, Lei Tan, Yunhang Shen, Chengmao Cai, Xing Sun, Pingyang Dai, Liujuan Cao, Rongrong Ji

🧩 TL;DR

本文提出了FlexiReID，一个支持RGB、红外、素描和文本四种模态间七种检索模式的灵活多模态行人重识别框架，通过自适应专家混合机制和跨模态查询融合模块实现了卓越性能。

📘 Detailed Summary

Motivation: 现有方法主要关注有限的跨模态设置，无法支持任意查询-检索组合，这严重限制了实际部署的灵活性。

Method: FlexiReID引入了自适应专家混合机制来动态整合不同模态特征，并设计了跨模态查询融合模块以增强多模态特征提取能力。

Result: 在构建的统一数据集CIRS-PEDES上的大量实验表明，FlexiReID实现了最先进的性能，并在复杂场景中展现出强大的泛化能力。

Conclusion: 该研究证明了支持多种模态组合的灵活框架在行人重识别中的重要性，为实际应用中的复杂多模态检索需求提供了有效解决方案。

📄 Abstract

Multimodal person re-identification (Re-ID) aims to match pedestrian images across different modalities. However, most existing methods focus on limited cross-modal settings and fail to support arbitrary query-retrieval combinations, hindering practical deployment. We propose FlexiReID, a flexible framework that supports seven retrieval modes across four modalities: rgb, infrared, sketches, and text. FlexiReID introduces an adaptive mixture-of-experts (MoE) mechanism to dynamically integrate diverse modality features and a cross-modal query fusion module to enhance multimodal feature extraction. To facilitate comprehensive evaluation, we construct CIRS-PEDES, a unified dataset extending four popular Re-ID datasets to include all four modalities. Extensive experiments demonstrate that FlexiReID achieves state-of-the-art performance and offers strong generalization in complex scenarios.

[25] Valeo Near-Field: a novel dataset for pedestrian intent detection

Antonyo Musabini, Rachid Benmokhtar, Jagdish Bhanushali, Victor Galizzi, Bertrand Luvison, Xavier Perrotton

🧩 TL;DR

本文提出了一个用于检测行人意图的新型多模态数据集，包含鱼眼相机、激光雷达、超声波传感器和运动捕捉数据，旨在为智能车辆在近场场景中的感知算法提供基准测试资源。

📘 Detailed Summary

Motivation: 该研究旨在解决智能车辆在近场场景中准确检测行人意图的挑战，特别是在处理传感器遮挡、动态环境和硬件约束等现实世界问题时，现有数据集缺乏同步多模态数据和详细3D姿态标注的问题。

Method: 该方法构建了一个包含同步多模态数据的综合数据集，包括鱼眼相机视频、激光雷达扫描、超声波传感器读数和基于运动捕捉的3D人体姿态，提供了详细的3D关节位置标注和精确的3D行人位置信息。

Result: 研究发布了部分数据集和全面的基准测试套件，包含准确性、效率和嵌入式系统可扩展性的评估指标，并提供了基于定制神经网络架构的基线性能指标。

Conclusion: 该数据集为行人检测、3D姿态估计以及4D轨迹和意图预测等任务提供了独特的资源，为研究人员开发先进算法奠定了基础，并提出了未来研究方向以促进数据集的采用和改进。

📄 Abstract

This paper presents a novel dataset aimed at detecting pedestrians' intentions as they approach an ego-vehicle. The dataset comprises synchronized multi-modal data, including fisheye camera feeds, lidar laser scans, ultrasonic sensor readings, and motion capture-based 3D body poses, collected across diverse real-world scenarios. Key contributions include detailed annotations of 3D body joint positions synchronized with fisheye camera images, as well as accurate 3D pedestrian positions extracted from lidar data, facilitating robust benchmarking for perception algorithms. We release a portion of the dataset along with a comprehensive benchmark suite, featuring evaluation metrics for accuracy, efficiency, and scalability on embedded systems. By addressing real-world challenges such as sensor occlusions, dynamic environments, and hardware constraints, this dataset offers a unique resource for developing and evaluating state-of-the-art algorithms in pedestrian detection, 3D pose estimation and 4D trajectory and intention prediction. Additionally, we provide baseline performance metrics using custom neural network architectures and suggest future research directions to encourage the adoption and enhancement of the dataset. This work aims to serve as a foundation for researchers seeking to advance the capabilities of intelligent vehicles in near-field scenarios.

[26] Quantized FCA: Efficient Zero-Shot Texture Anomaly Detection

Andrei-Timotei Ardelean, Patrick Rückbeil, Tim Weyrich

🧩 TL;DR

本研究提出了一种名为QFCA的实时零样本异常定位方法，通过量化特征对应分析算法实现了10倍加速，在纹理异常检测任务中保持了高精度。

📘 Detailed Summary

Motivation: 现有零样本异常定位方法在纹理异常检测中存在运行时间过长的问题，使其难以在实际场景如生产线监控中部署应用，本研究旨在解决这一实际部署瓶颈。

Method: 提出的QFCA方法实现了量化版本的特征对应分析算法，通过将补丁统计比较适配到量化值直方图上工作，并引入基于主成分分析的特征预处理步骤来增强正常与异常特征之间的对比度。

Result: 该方法在保持准确率几乎没有损失的情况下获得了10倍的加速效果，并在复杂纹理上提高了检测精度，在与现有方法的全面评估中表现优异。

Conclusion: QFCA方法证明了通过量化策略可以在不牺牲精度的情况下显著提升异常检测算法的运行效率，为零样本异常定位在实时应用中的部署提供了可行的技术路径。

📄 Abstract

Zero-shot anomaly localization is a rising field in computer vision research, with important progress in recent years. This work focuses on the problem of detecting and localizing anomalies in textures, where anomalies can be defined as the regions that deviate from the overall statistics, violating the stationarity assumption. The main limitation of existing methods is their high running time, making them impractical for deployment in real-world scenarios, such as assembly line monitoring. We propose a real-time method, named QFCA, which implements a quantized version of the feature correspondence analysis (FCA) algorithm. By carefully adapting the patch statistics comparison to work on histograms of quantized values, we obtain a 10x speedup with little to no loss in accuracy. Moreover, we introduce a feature preprocessing step based on principal component analysis, which enhances the contrast between normal and anomalous features, improving the detection precision on complex textures. Our method is thoroughly evaluated against prior art, comparing favorably with existing methods. Project page: https://reality.tf.fau.de/pub/ardelean2025quantized.html

[27] Towards Label-Free Brain Tumor Segmentation: Unsupervised Learning with Multimodal MRI

Gerard Comas-Quiles, Carles Garcia-Cabrera, Julia Dietlmeier, Noel E. O'Connor, Ferran Marques

🧩 TL;DR

本研究提出了一种基于多模态视觉Transformer自编码器（MViT-AE）的无监督异常检测方法，专门用于脑肿瘤分割，该方法仅使用健康脑部MRI数据进行训练，通过重建误差图实现肿瘤检测与定位，在BraTS-GoAT 2025数据集上取得了具有临床意义的性能。

📘 Detailed Summary

Motivation: 该研究旨在解决脑肿瘤分割中标注数据有限、成本高昂且不一致的问题，通过无监督异常检测方法提供监督学习的补充方案，特别针对神经影像工作流程中的可扩展性瓶颈。

Method: 提出多模态视觉Transformer自编码器（MViT-AE），采用多模态早期-晚期融合策略整合不同MRI序列的互补信息，并引入包含Segment Anything Model（SAM）的后处理流程来优化预测的肿瘤轮廓。

Result: 在测试集上获得病灶级别的Dice相似系数：全肿瘤0.437、肿瘤核心0.316、增强肿瘤0.350，在验证集上异常检测率达到89.4%，证明了该方法在临床环境中的有效性。

Conclusion: 研究结果表明基于Transformer的无监督模型具有作为神经肿瘤影像可扩展、标签高效工具的潜力，尽管在检测小病灶或非增强病变方面仍存在挑战，但为无监督医学图像分析提供了有前景的方向。

📄 Abstract

Unsupervised anomaly detection (UAD) presents a complementary alternative to supervised learning for brain tumor segmentation in magnetic resonance imaging (MRI), particularly when annotated datasets are limited, costly, or inconsistent. In this work, we propose a novel Multimodal Vision Transformer Autoencoder (MViT-AE) trained exclusively on healthy brain MRIs to detect and localize tumors via reconstruction-based error maps. This unsupervised paradigm enables segmentation without reliance on manual labels, addressing a key scalability bottleneck in neuroimaging workflows. Our method is evaluated in the BraTS-GoAT 2025 Lighthouse dataset, which includes various types of tumors such as gliomas, meningiomas, and pediatric brain tumors. To enhance performance, we introduce a multimodal early-late fusion strategy that leverages complementary information across multiple MRI sequences, and a post-processing pipeline that integrates the Segment Anything Model (SAM) to refine predicted tumor contours. Despite the known challenges of UAD, particularly in detecting small or non-enhancing lesions, our method achieves clinically meaningful tumor localization, with lesion-wise Dice Similarity Coefficient of 0.437 (Whole Tumor), 0.316 (Tumor Core), and 0.350 (Enhancing Tumor) on the test set, and an anomaly Detection Rate of 89.4% on the validation set. These findings highlight the potential of transformer-based unsupervised models to serve as scalable, label-efficient tools for neuro-oncological imaging.

[28] DGME-T: Directional Grid Motion Encoding for Transformer-Based Historical Camera Movement Classification

Tingyu Lin, Armin Dadras, Florian Kleber, Robert Sablatnig

🧩 TL;DR

本文提出DGME-T，一种轻量级视频Swin Transformer扩展，通过注入方向性网格运动编码来提升档案影片中的相机运动分类性能。该方法在现代和历史影片数据上均显著提升了分类准确率和F1分数。

📘 Detailed Summary

Motivation: 针对在当代高质量视频上训练的相机运动分类模型在处理档案影片时性能下降的问题，由于档案影片存在噪声、缺失帧和低对比度等退化因素会模糊运动线索，需要开发更鲁棒的分类方法。

Method: 构建了统一基准数据集，将两个现代语料库整合为四个规范类别，并重构HISTORIAN集合为五个平衡类别。提出了DGME-T方法，在Video Swin Transformer基础上通过可学习的归一化后期融合层注入基于光流的方向性网格运动编码。

Result: DGME-T将骨干网络的top-1准确率从81.78%提升至86.14%，宏观F1从82.08%提升至87.81%。在二战档案影片上，准确率从83.43%提升至84.62%，宏观F1从81.72%提升至82.63%。跨域研究表明在现代数据上进行中间微调可将历史性能提升超过五个百分点。

Conclusion: 研究表明结构化运动先验与Transformer表示具有互补性，即使是一个小型精心校准的运动头也能显著增强退化影片分析的鲁棒性。方向性运动编码与Transformer架构的结合为档案影片分析提供了有效的解决方案。

📄 Abstract

Camera movement classification (CMC) models trained on contemporary, high-quality footage often degrade when applied to archival film, where noise, missing frames, and low contrast obscure motion cues. We bridge this gap by assembling a unified benchmark that consolidates two modern corpora into four canonical classes and restructures the HISTORIAN collection into five balanced categories. Building on this benchmark, we introduce DGME-T, a lightweight extension to the Video Swin Transformer that injects directional grid motion encoding, derived from optical flow, via a learnable and normalised late-fusion layer. DGME-T raises the backbone's top-1 accuracy from 81.78% to 86.14% and its macro F1 from 82.08% to 87.81% on modern clips, while still improving the demanding World-War-II footage from 83.43% to 84.62% accuracy and from 81.72% to 82.63% macro F1. A cross-domain study further shows that an intermediate fine-tuning stage on modern data increases historical performance by more than five percentage points. These results demonstrate that structured motion priors and transformer representations are complementary and that even a small, carefully calibrated motion head can substantially enhance robustness in degraded film analysis. Related resources are available at https://github.com/linty5/DGME-T.

[29] NDM: A Noise-driven Detection and Mitigation Framework against Implicit Sexual Intentions in Text-to-Image Generation

Yitong Sun, Yao Huang, Ruochen Zhang, Huanran Chen, Shouwei Ruan, Ranjie Duan, Xingxing Wei

🧩 TL;DR

本研究提出NDM，首个噪声驱动的检测与缓解框架，能够检测并缓解文本到图像生成中的隐含恶意意图，同时保持模型的原始生成能力。该框架通过噪声分离性和自适应负引导机制，在自然和对抗数据集上均优于现有最先进方法。

📘 Detailed Summary

Motivation: 文本到图像扩散模型在生成能力方面表现优异，但对隐含性暗示提示仍然脆弱，这些微妙线索常伪装为良性术语却可能意外触发不当内容。现有检测方法主要针对显式内容设计，难以识别隐含线索，而微调方法虽有效但会损害生成质量，形成不良权衡。

Method: 提出两个关键创新：首先利用早期预测噪声的可分离性开发基于噪声的检测方法，能够高精度高效识别恶意内容；其次提出噪声增强的自适应负引导机制，通过抑制显著区域注意力来优化初始噪声，从而增强对性内容缓解的自适应负引导效果。

Result: 在自然和对抗数据集上的实验验证表明，NDM在性能上优于包括SLD、UCE和RECE在内的现有最先进方法，展示了其卓越的检测和缓解能力。

Conclusion: 该研究证明了噪声驱动方法在检测和缓解隐含恶意意图方面的有效性，为文本到图像生成的安全防护提供了新方向，同时保持了模型的生成质量，解决了现有方法在安全性和质量之间的权衡问题。

📄 Abstract

Despite the impressive generative capabilities of text-to-image (T2I) diffusion models, they remain vulnerable to generating inappropriate content, especially when confronted with implicit sexual prompts. Unlike explicit harmful prompts, these subtle cues, often disguised as seemingly benign terms, can unexpectedly trigger sexual content due to underlying model biases, raising significant ethical concerns. However, existing detection methods are primarily designed to identify explicit sexual content and therefore struggle to detect these implicit cues. Fine-tuning approaches, while effective to some extent, risk degrading the model's generative quality, creating an undesirable trade-off. To address this, we propose NDM, the first noise-driven detection and mitigation framework, which could detect and mitigate implicit malicious intention in T2I generation while preserving the model's original generative capabilities. Specifically, we introduce two key innovations: first, we leverage the separability of early-stage predicted noise to develop a noise-based detection method that could identify malicious content with high accuracy and efficiency; second, we propose a noise-enhanced adaptive negative guidance mechanism that could optimize the initial noise by suppressing the prominent region's attention, thereby enhancing the effectiveness of adaptive negative guidance for sexual mitigation. Experimentally, we validate NDM on both natural and adversarial datasets, demonstrating its superior performance over existing SOTA methods, including SLD, UCE, and RECE, etc. Code and resources are available at https://github.com/lorraine021/NDM.

[30] Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis

Junzhi Ning, Wei Li, Cheng Tang, Jiashi Lin, Chenglong Ma, Chaoyang Zhang, Jiyao Liu, Ying Chen, Shujian Gao, Lihao Liu, Yuandong Pu, Huihui Xu, Chenhui Gou, Ziyan Huang, Yi Xin, Qi Qin, Zhongying Deng, Diping Song, Bin Fu, Guang Yang, Yuanfeng Ji, Tianbin Li, Yanzhou Su, Jin Ye, Shixiang Tang, Ming Hu, Junjun He

🧩 TL;DR

本文提出了UniMedVL，这是首个医学统一多模态模型，通过Observation-Knowledge-Analysis框架在单一架构中同时处理医学图像理解和生成任务，在五个理解基准上取得优越性能，并在八个医学成像模态上匹配专用模型的生成质量。

📘 Detailed Summary

Motivation: 现有医学AI系统存在统一处理流程的断裂：医学图像理解模型只能解释图像而无法生成视觉输出，医学图像生成模型只能合成图像而无法提供文本解释，这导致了数据表示、特征集成和任务级多模态能力方面的差距。

Method: 提出了基于Observation-Knowledge-Analysis范式的多级框架，包括构建包含560万样本的UniMed-5M数据集用于基础观察，采用渐进式课程学习系统引入医学多模态知识，以及设计UniMedVL统一多模态模型在单一架构中同时分析图像理解和生成任务。

Result: UniMedVL在五个医学图像理解基准上取得优越性能，同时在八个医学成像模态的生成质量上匹配专用模型，统一架构实现了双向知识共享，生成任务增强了视觉理解特征。

Conclusion: 在单一医学框架内整合传统分离的能力能够解锁跨多样化医学视觉语言任务的改进，双向知识共享机制证明生成和理解任务的协同作用可以相互增强，为医学诊断应用提供了更统一的多模态处理方案。

📄 Abstract

Medical diagnostic applications require models that can process multimodal medical inputs (images, patient histories, lab results) and generate diverse outputs including both textual reports and visual content (annotations, segmentation masks, and images). Despite this need, existing medical AI systems disrupt this unified process: medical image understanding models interpret images but cannot generate visual outputs, while medical image generation models synthesize images but cannot provide textual explanations. This leads to gaps in data representation, feature integration, and task-level multimodal capabilities. To this end, we propose a multi-level framework that draws inspiration from diagnostic workflows through the Observation-Knowledge-Analysis (OKA) paradigm. Specifically, at the observation level, we construct UniMed-5M, a dataset comprising over 5.6M samples that reformat diverse unimodal data into multimodal pairs for foundational observation. At the knowledge level, we propose Progressive Curriculum Learning that systematically introduces medical multimodal knowledge. At the analysis level, we introduce UniMedVL, the first medical unified multimodal model for the simultaneous analysis of image understanding and generation tasks within a single architecture. UniMedVL achieves superior performance on five medical image understanding benchmarks, while matching specialized models in generation quality across eight medical imaging modalities. Crucially, our unified architecture enables bidirectional knowledge sharing: generation tasks enhance visual understanding features, demonstrating that integrating traditionally separate capabilities within a single medical framework unlocks improvements across diverse medical vision-language tasks. Code is available at https://github.com/uni-medical/UniMedVL.

[31] QSilk: Micrograin Stabilization and Adaptive Quantile Clipping for Detail-Friendly Latent Diffusion

Denis Rychkovskiy

🧩 TL;DR

QSilk是一种轻量级的潜在扩散模型稳定层，通过微钳位和自适应分位数裁剪技术，在无需训练的情况下显著提升高频细节保真度并抑制罕见激活峰值，实现更清晰、更锐利的生成结果。

📘 Detailed Summary

Motivation: 该研究旨在解决潜在扩散模型中存在的高频细节保真度不足和罕见激活峰值问题，这些问题在低步数采样和超高分辨率生成时尤为明显，影响生成图像的质量和稳定性。

Method: QSilk结合了两种关键技术：每样本微钳位技术温和限制极端值而不损失纹理细节，以及自适应分位数裁剪（AQClip）根据区域特性动态调整允许值范围，AQClip可基于局部结构统计或注意力熵引导两种模式运行。

Result: 集成到CADE 2.5渲染管线后，QSilk在低步数采样和超高分辨率条件下产生更清晰、更锐利的结果，计算开销可忽略不计，在SD/SDXL骨干网络上均获得一致的定性改进，并能与CFG/Rescale技术协同工作，支持更高的引导强度而不产生伪影。

Conclusion: QSilk提供了一种无需训练或微调的实用解决方案，显著提升了潜在扩散模型的稳定性和生成质量，其轻量级设计和最小化用户控制使其易于集成到现有工作流中，为高质量图像生成提供了有效的后处理增强手段。

📄 Abstract

We present QSilk, a lightweight, always-on stabilization layer for latent diffusion that improves high-frequency fidelity while suppressing rare activation spikes. QSilk combines (i) a per-sample micro clamp that gently limits extreme values without washing out texture, and (ii) Adaptive Quantile Clip (AQClip), which adapts the allowed value corridor per region. AQClip can operate in a proxy mode using local structure statistics or in an attention entropy guided mode (model confidence). Integrated into the CADE 2.5 rendering pipeline, QSilk yields cleaner, sharper results at low step counts and ultra-high resolutions with negligible overhead. It requires no training or fine-tuning and exposes minimal user controls. We report consistent qualitative improvements across SD/SDXL backbones and show synergy with CFG/Rescale, enabling slightly higher guidance without artifacts.

[32] Neuro-Symbolic Spatial Reasoning in Segmentation

Jiayi Lin, Jiabo Huang, Shaogang Gong

🧩 TL;DR

本文提出了RelateSeg，这是首个在开放词汇语义分割中探索神经符号空间推理的方法，通过一阶逻辑公式在神经网络架构中施加显式空间关系约束，在四个基准数据集上实现了最先进的性能。

📘 Detailed Summary

Motivation: 当前基于视觉语言模型关联的开放词汇语义分割方法缺乏对场景中物体空间关系的理解，导致在未见和未标记对象上的泛化能力受限，需要解决空间关系建模不足的问题。

Method: 提出了Relational Segmentor (RelateSeg)方法，通过一阶逻辑公式在神经网络架构中施加显式空间关系约束，自动提取空间关系并编码为逻辑公式，每个像素同时预测语义类别和空间伪类别，最后通过模糊逻辑松弛在深度网络架构中实现端到端学习。

Result: RelateSeg在四个基准数据集上实现了平均mIoU的最先进性能，特别是在包含多个类别的图像上表现出明显优势，仅引入单个辅助损失函数且不增加额外参数，验证了神经符号空间推理在开放词汇语义分割中的有效性。

Conclusion: 该研究表明神经符号空间推理能够显著提升开放词汇语义分割的性能，通过显式空间关系约束实现了空间关系一致的语义分割，为结合符号推理与深度学习的方法提供了新的研究方向。

📄 Abstract

Open-Vocabulary Semantic Segmentation (OVSS) assigns pixel-level labels from an open set of categories, requiring generalization to unseen and unlabelled objects. Using vision-language models (VLMs) to correlate local image patches with potential unseen object categories suffers from a lack of understanding of spatial relations of objects in a scene. To solve this problem, we introduce neuro-symbolic (NeSy) spatial reasoning in OVSS. In contrast to contemporary VLM correlation-based approaches, we propose Relational Segmentor (RelateSeg) to impose explicit spatial relational constraints by first order logic (FOL) formulated in a neural network architecture. This is the first attempt to explore NeSy spatial reasoning in OVSS. Specifically, RelateSeg automatically extracts spatial relations, e.g., , and encodes them as first-order logic formulas using our proposed pseudo categories. Each pixel learns to predict both a semantic category (e.g., "cat") and a spatial pseudo category (e.g., "right of person") simultaneously, enforcing relational constraints (e.g., a "cat" pixel must lie to the right of a "person"). Finally, these logic constraints are formulated in a deep network architecture by fuzzy logic relaxation, enabling end-to-end learning of spatial-relationally consistent segmentation. RelateSeg achieves state-of-the-art performance in terms of average mIoU across four benchmark datasets and particularly shows clear advantages on images containing multiple categories, with the cost of only introducing a single auxiliary loss function and no additional parameters, validating the effectiveness of NeSy spatial reasoning in OVSS.

[33] BLIP3o-NEXT: Next Frontier of Native Image Generation

Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, Tianyi Zhou, Junnan Li, Silvio Savarese, Caiming Xiong, Ran Xu

🧩 TL;DR

BLIP3o-NEXT是一个完全开源的图像生成基础模型，通过自回归+扩散混合架构统一了文本到图像生成和图像编辑功能，在多项基准测试中实现了优于现有模型的性能表现。

📘 Detailed Summary

Motivation: 该研究旨在解决当前图像生成模型在统一架构下同时实现高质量文本到图像生成和图像编辑能力的挑战，探索原生图像生成的前沿技术边界。

Method: BLIP3o-NEXT采用自回归+扩散混合架构，其中自回归模型首先生成基于多模态输入的离散图像令牌，其隐藏状态随后作为扩散模型的调节信号来生成高保真图像，结合了自回归模型的推理能力和扩散模型的细节渲染优势。

Result: 在多种文本到图像和图像编辑基准测试的广泛评估中，BLIP3o-NEXT展现出优于现有模型的卓越性能，证明了其在图像生成和编辑任务上的强大能力。

Conclusion: 研究揭示了四个关键洞察：架构选择对性能影响较小但需关注扩展效率；强化学习能推动原生图像生成前沿；通过后训练和数据引擎可显著提升图像编辑能力；数据质量和规模仍是决定模型性能上限的决定性因素。

📄 Abstract

We present BLIP3o-NEXT, a fully open-source foundation model in the BLIP3 series that advances the next frontier of native image generation. BLIP3o-NEXT unifies text-to-image generation and image editing within a single architecture, demonstrating strong image generation and image editing capabilities. In developing the state-of-the-art native image generation model, we identify four key insights: (1) Most architectural choices yield comparable performance; an architecture can be deemed effective provided it scales efficiently and supports fast inference; (2) The successful application of reinforcement learning can further push the frontier of native image generation; (3) Image editing still remains a challenging task, yet instruction following and the consistency between generated and reference images can be significantly enhanced through post-training and data engine; (4) Data quality and scale continue to be decisive factors that determine the upper bound of model performance. Building upon these insights, BLIP3o-NEXT leverages an Autoregressive + Diffusion architecture in which an autoregressive model first generates discrete image tokens conditioned on multimodal inputs, whose hidden states are then used as conditioning signals for a diffusion model to generate high-fidelity images. This architecture integrates the reasoning strength and instruction following of autoregressive models with the fine-detail rendering ability of diffusion models, achieving a new level of coherence and realism. Extensive evaluations of various text-to-image and image-editing benchmarks show that BLIP3o-NEXT achieves superior performance over existing models.

[34] BiomedXPro: Prompt Optimization for Explainable Diagnosis with Biomedical Vision Language Models

Kaushitha Silva, Mansitha Eashwara, Sanduni Ubayasiri, Ruwan Tennakoon, Damayanthi Herath

🧩 TL;DR

本文提出了BiomedXPro进化框架，利用大型语言模型作为生物医学知识提取器和自适应优化器，自动生成多样化的可解释自然语言提示对集合，用于疾病诊断。该方法在多个生物医学基准测试中优于现有提示调优方法，特别是在数据稀缺的少样本场景下表现优异。

📘 Detailed Summary

Motivation: 当前生物医学视觉语言模型的临床应用受到限制，因为现有的提示优化技术要么产生不可解释的潜在向量，要么仅生成单一文本提示。这种缺乏透明度以及无法捕捉临床诊断多面性的问题，限制了这些方法在高风险医疗环境中的可信度。

Method: BiomedXPro采用进化框架，利用大型语言模型作为生物医学知识提取器和自适应优化器，自动生成多样化的可解释自然语言提示对集合。该方法通过集成学习策略，确保生成的提示能够捕捉临床诊断的多方面特征。

Result: 在多个生物医学基准测试中，BiomedXPro持续优于最先进的提示调优方法，特别是在数据稀缺的少样本场景下表现突出。分析显示发现的提示与统计显著的临床特征之间存在强语义对齐，为模型性能提供了可验证的概念基础。

Conclusion: 通过生成多样化的可解释提示集合，BiomedXPro为模型预测提供了可验证的基础，代表了向开发更可信且临床对齐的AI系统迈出的关键一步。该方法增强了模型在医疗决策中的透明度和可信度。

📄 Abstract

The clinical adoption of biomedical vision-language models is hindered by prompt optimization techniques that produce either uninterpretable latent vectors or single textual prompts. This lack of transparency and failure to capture the multi-faceted nature of clinical diagnosis, which relies on integrating diverse observations, limits their trustworthiness in high-stakes settings. To address this, we introduce BiomedXPro, an evolutionary framework that leverages a large language model as both a biomedical knowledge extractor and an adaptive optimizer to automatically generate a diverse ensemble of interpretable, natural-language prompt pairs for disease diagnosis. Experiments on multiple biomedical benchmarks show that BiomedXPro consistently outperforms state-of-the-art prompt-tuning methods, particularly in data-scarce few-shot settings. Furthermore, our analysis demonstrates a strong semantic alignment between the discovered prompts and statistically significant clinical features, grounding the model's performance in verifiable concepts. By producing a diverse ensemble of interpretable prompts, BiomedXPro provides a verifiable basis for model predictions, representing a critical step toward the development of more trustworthy and clinically-aligned AI systems.

cs.CL [Back]

[35] Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding

Sensen Gao, Shanshan Zhao, Xu Jiang, Lunhao Duan, Yong Xien Chng, Qing-Guo Chen, Weihua Luo, Kaifu Zhang, Jia-Wang Bian, Mingming Gong

🧩 TL;DR

本文系统综述了面向文档理解的多模态检索增强生成方法，提出了基于领域、检索模态和粒度的分类体系，并总结了图结构和智能体框架等关键技术进展，为文档AI的未来发展提供了路线图。

📘 Detailed Summary

Motivation: 当前文档理解方法面临关键限制：基于OCR的流水线方法会丢失结构细节，而原生多模态大语言模型在上下文建模方面存在困难。检索增强生成虽然有助于模型基于外部数据，但文档的多模态特性需要更先进的范式来解决文本、表格、图表和布局的全面检索与推理问题。

Method: 本文提出了基于领域、检索模态和粒度的分类体系，系统回顾了涉及图结构和智能体框架的技术进展，总结了关键数据集、基准测试和应用场景，重点关注多模态RAG在文档理解中的系统化实现方法。

Result: 研究总结了多模态RAG在文档理解领域的关键数据集、基准测试和实际应用，识别了当前方法在效率、细粒度表示和鲁棒性方面面临的挑战，为评估和比较不同方法提供了系统框架。

Conclusion: 多模态RAG为文档AI提供了全面检索和推理的新范式，本文通过系统综述揭示了该领域在效率优化、细粒度表示增强和系统鲁棒性方面的开放挑战，为未来研究方向提供了清晰的路线图和发展指南。

📄 Abstract

Document understanding is critical for applications from financial analysis to scientific discovery. Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs), face key limitations: the former loses structural detail, while the latter struggles with context modeling. Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, i.e., combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG. This approach enables holistic retrieval and reasoning across all modalities, unlocking comprehensive document intelligence. Recognizing its importance, this paper presents a systematic survey of Multimodal RAG for document understanding. We propose a taxonomy based on domain, retrieval modality, and granularity, and review advances involving graph structures and agentic frameworks. We also summarize key datasets, benchmarks, and applications, and highlight open challenges in efficiency, fine-grained representation, and robustness, providing a roadmap for future progress in document AI.

[36] Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing

Baode Wang, Biao Wu, Weizhen Li, Meng Fang, Zuming Huang, Jun Huang, Haozhe Wang, Yanjie Liang, Ling Chen, Wei Chu, Yuan Qi

🧩 TL;DR

本文提出了LayoutRL强化学习框架和Infinity-Parser模型，通过复合奖励机制优化文档布局理解，在多个基准测试中实现了最先进的文档解析性能。

📘 Detailed Summary

Motivation: 现有监督微调方法在处理扫描文档解析时难以泛化到多样化文档类型，特别是在分布外数据上表现不佳，同时高质量布局感知解析训练数据的稀缺进一步加剧了这一挑战。

Method: 提出了LayoutRL强化学习框架，通过整合归一化编辑距离、段落计数准确性和阅读顺序保持的复合奖励来优化布局理解，并构建了Infinity-Doc-400K数据集来训练Infinity-Parser视觉语言模型。

Result: 在OmniDocBench、olmOCR-Bench、PubTabNet和FinTabNet等多个基准测试上的广泛评估表明，Infinity-Parser在各种文档类型、语言和结构复杂度上均实现了最先进的性能，显著优于专门的文档解析系统和通用视觉语言模型。

Conclusion: 该研究证明了强化学习框架在文档解析任务中的有效性，通过复合奖励机制和高质量数据集训练，能够实现强大的跨领域泛化能力，为文档解析研究提供了可复现的基础设施。

📄 Abstract

Document parsing from scanned images into structured formats remains a significant challenge due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Existing supervised fine-tuning methods often struggle to generalize across diverse document types, leading to poor performance, particularly on out-of-distribution data. This issue is further exacerbated by the limited availability of high-quality training data for layout-aware parsing tasks. To address these challenges, we introduce LayoutRL, a reinforcement learning framework that optimizes layout understanding through composite rewards integrating normalized edit distance, paragraph count accuracy, and reading order preservation. To support this training, we construct the Infinity-Doc-400K dataset, which we use to train Infinity-Parser, a vision-language model demonstrating robust generalization across various domains. Extensive evaluations on benchmarks including OmniDocBench, olmOCR-Bench, PubTabNet, and FinTabNet show that Infinity-Parser consistently achieves state-of-the-art performance across a broad range of document types, languages, and structural complexities, substantially outperforming both specialized document parsing systems and general-purpose vision-language models. We will release our code, dataset, and model to facilitate reproducible research in document parsing.

[37] Fine-Tuning MedGemma for Clinical Captioning to Enhance Multimodal RAG over Malaysia CPGs

Lee Qi Zun, Mohamad Zulhilmi Bin Abdul Halim, Goh Man Fye

🧩 TL;DR

本研究提出并验证了一个专门化MedGemma模型的框架，用于生成高保真度的医学图像描述作为检索增强生成系统的优质查询，显著提升了多模态临床决策支持系统的性能。

📘 Detailed Summary

Motivation: 检索增强生成系统在处理基于图像的医学查询时面临挑战，因为通用视觉语言模型生成的描述缺乏临床特异性和事实基础，限制了其在马来西亚临床实践指南中提供基于事实指导的有效性。

Method: 采用知识蒸馏管道创建皮肤科、眼底和胸部放射学领域的合成数据集，并使用参数高效的QLoRA方法对MedGemma模型进行微调，通过双重评估框架衡量分类准确性和使用RAGAS框架评估描述忠实性、相关性和正确性。

Result: 微调后的模型在分类性能上取得显著提升，RAGAS评估确认了描述忠实性和正确性的显著改善，验证了模型生成可靠、事实基础描述的能力。

Conclusion: 本研究建立了一个专门化医学视觉语言模型的稳健流程，验证了所得模型作为高质量查询生成器的有效性，为增强基于证据的临床决策支持中的多模态检索增强生成系统奠定了基础。

📄 Abstract

Retrieval-Augmented Generation systems are essential for providing fact-based guidance from Malaysian Clinical Practice Guidelines. However, their effectiveness with image-based queries is limited, as general Vision-Language Model captions often lack clinical specificity and factual grounding. This study proposes and validates a framework to specialize the MedGemma model for generating high-fidelity captions that serve as superior queries. To overcome data scarcity, we employ a knowledge distillation pipeline to create a synthetic dataset across dermatology, fundus, and chest radiography domains, and fine-tune MedGemma using the parameter-efficient QLoRA method. Performance was rigorously assessed through a dual framework measuring both classification accuracy and, via a novel application of the RAGAS framework, caption faithfulness, relevancy, and correctness. The fine-tuned model demonstrated substantial improvements in classification performance, while RAGAS evaluation confirmed significant gains in caption faithfulness and correctness, validating the models ability to produce reliable, factually grounded descriptions. This work establishes a robust pipeline for specializing medical VLMs and validates the resulting model as a high-quality query generator, laying the groundwork for enhancing multimodal RAG systems in evidence-based clinical decision support.

[38] When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs

Hongcheng Liu, Pingjie Wang, Yuhao Wang, Siqu Ou, Yanfeng Wang, Yu Wang

🧩 TL;DR

本研究提出了GuessBench基准来评估多模态大语言模型在主动推理任务中的表现，发现现有模型在主动获取缺失证据和迭代决策方面的能力远低于被动推理，揭示了多模态主动推理领域的重要研究空白。

📘 Detailed Summary

Motivation: 现有多模态大语言模型评估主要关注被动推理场景，即在完整信息下进行逐步推理，这与现实世界中信息不完整的实际应用场景存在偏差，因此需要研究模型在信息不完整情况下主动获取缺失证据并进行迭代决策的能力。

Method: 研究提出了GuessBench基准，包含感知导向和知识导向的图像，要求模型在没有任务特定先验的情况下从候选池中选择目标图像，通过主动获取缺失证据和迭代优化决策来评估多模态模型的主动推理能力。

Result: 评估了20个先进的多模态大语言模型，发现它们在主动推理任务上的表现远低于被动推理场景，消融研究表明感知增强对小型模型有益，而思维导向方法在不同规模模型上都能带来一致提升。

Conclusion: 研究发现细粒度感知和及时决策是主动推理的主要挑战，感知增强和思维导向方法为未来多模态主动推理研究提供了有前景的方向，表明该领域仍有巨大的改进空间。

📄 Abstract

Multimodal large language models (MLLMs) have shown strong capabilities across a broad range of benchmarks. However, most existing evaluations focus on passive inference, where models perform step-by-step reasoning under complete information. This setup is misaligned with real-world use, where seeing is not enough. This raises a fundamental question: Can MLLMs actively acquire missing evidence under incomplete information? To bridge this gap, we require the MLLMs to actively acquire missing evidence and iteratively refine decisions under incomplete information, by selecting a target image from a candidate pool without task-specific priors. To support systematic study, we propose GuessBench, a benchmark with both perception-oriented and knowledge-oriented images for evaluating active reasoning in MLLMs. We evaluate 20 superior MLLMs and find that performance on active reasoning lags far behind it on passive settings, indicating substantial room for improvement. Further analysis identifies fine-grained perception and timely decision-making as key challenges. Ablation studies show that perceptual enhancements benefit smaller models, whereas thinking-oriented methods provide consistent gains across model sizes. These results suggest promising directions for future research on multimodal active reasoning.

[39] MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval

Qiyu Wu, Shuyang Cui, Satoshi Hayakawa, Wei-Yao Wang, Hiromi Wakaki, Yuki Mitsufuji

🧩 TL;DR

本文提出了一种模态组合感知框架，通过偏好损失和组合正则化目标来缓解统一编码器在多模态检索中的模态捷径问题，显著提升了分布偏移下的检索鲁棒性。

📘 Detailed Summary

Motivation: 尽管基于统一编码器的多模态大语言模型在组合检索中展现出灵活性，但研究发现传统对比学习训练的统一编码器容易学习模态捷径，导致在分布偏移下鲁棒性较差，需要解决这一关键问题。

Method: 提出模态组合感知框架，包含偏好损失强制多模态嵌入优于其单模态对应物，以及组合正则化目标将多模态嵌入与其单模态部分组合的原型对齐，明确建模组合表示与单模态对应物之间的结构关系。

Result: 在多个基准测试上的实验表明，该方法在分布外检索任务中取得显著提升，验证了模态组合感知作为利用MLLMs作为统一编码器时实现鲁棒组合多模态检索的有效原则。

Conclusion: 研究证明了模态组合感知是提升统一编码器在多模态检索中鲁棒性的关键机制，为未来多模态表示学习提供了重要方向，特别是在处理分布偏移和模态交互方面具有广泛适用性。

📄 Abstract

Multimodal retrieval, which seeks to retrieve relevant content across modalities such as text or image, supports applications from AI search to contents production. Despite the success of separate-encoder approaches like CLIP align modality-specific embeddings with contrastive learning, recent multimodal large language models (MLLMs) enable a unified encoder that directly processes composed inputs. While flexible and advanced, we identify that unified encoders trained with conventional contrastive learning are prone to learn modality shortcut, leading to poor robustness under distribution shifts. We propose a modality composition awareness framework to mitigate this issue. Concretely, a preference loss enforces multimodal embeddings to outperform their unimodal counterparts, while a composition regularization objective aligns multimodal embeddings with prototypes composed from its unimodal parts. These objectives explicitly model structural relationships between the composed representation and its unimodal counterparts. Experiments on various benchmarks show gains in out-of-distribution retrieval, highlighting modality composition awareness as a effective principle for robust composed multimodal retrieval when utilizing MLLMs as the unified encoder.

[40] Leveraging LLMs for Context-Aware Implicit Textual and Multimodal Hate Speech Detection

Joshua Wolfe Brook, Ilia Markov

🧩 TL;DR

本研究提出了一种新颖的仇恨言论检测方法，利用大型语言模型作为动态知识库生成背景上下文，并将其整合到分类器输入中。实验表明上下文信息和整合方法对提升检测性能至关重要，在文本和多模态设置下分别实现了最高3和6个F1分数的提升。

📘 Detailed Summary

Motivation: 当前仇恨言论检测方法在处理隐含仇恨言论和多模态内容时面临挑战，缺乏对背景上下文的有效利用。本研究旨在探索如何利用LLMs生成相关上下文信息来增强仇恨言论检测系统的性能，特别是在处理需要深层理解的隐含表达和多模态内容时。

Method: 研究提出了两种上下文生成策略：基于命名实体的生成和基于全文提示的生成。比较了四种上下文整合方法：文本拼接、嵌入拼接、基于层次化Transformer的融合以及LLM驱动的文本增强。实验在文本隐含仇恨数据集Latent Hatred和多模态厌女表情包数据集MAMI上进行。

Result: 实验结果显示上下文信息和整合方法对性能提升至关重要。从零上下文基线到最佳性能系统，在文本和多模态设置下分别实现了最高3和6个F1分数的提升。基于嵌入拼接的方法在所有整合策略中表现最佳，证明了上下文信息的有效整合能够显著提升仇恨言论检测的准确性。

Conclusion: 研究表明利用LLMs生成背景上下文能够有效提升仇恨言论检测性能，特别是在处理隐含表达和多模态内容时。上下文整合方法的选择对最终性能有显著影响，嵌入拼接策略显示出最佳效果。这为未来基于上下文的仇恨言论检测研究提供了重要参考方向。

📄 Abstract

This research introduces a novel approach to textual and multimodal Hate Speech Detection (HSD), using Large Language Models (LLMs) as dynamic knowledge bases to generate background context and incorporate it into the input of HSD classifiers. Two context generation strategies are examined: one focused on named entities and the other on full-text prompting. Four methods of incorporating context into the classifier input are compared: text concatenation, embedding concatenation, a hierarchical transformer-based fusion, and LLM-driven text enhancement. Experiments are conducted on the textual Latent Hatred dataset of implicit hate speech and applied in a multimodal setting on the MAMI dataset of misogynous memes. Results suggest that both the contextual information and the method by which it is incorporated are key, with gains of up to 3 and 6 F1 points on textual and multimodal setups respectively, from a zero-context baseline to the highest-performing system, based on embedding concatenation.

[41] SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling

Kadri Hacioglu, Manjunath K E, Andreas Stolcke

🧩 TL;DR

本研究通过建立槽填充任务的实证上限，识别了语音大语言模型在性能、鲁棒性和泛化性方面的差距，并提出了训练数据、架构和训练策略的改进方案，显著提升了模型性能。

📘 Detailed Summary

Motivation: 传统槽填充任务采用语音识别与自然语言理解级联架构，而新兴的语音大语言模型虽然提供了统一生成式解决方案，但在实际应用中仍存在性能、鲁棒性和泛化性方面的不足，需要系统性地识别并缩小与理论上限的差距。

Method: 研究通过建立槽填充任务的实证性能上限，系统分析了语音大语言模型在数据、架构和训练策略方面的局限性，并针对性地提出了改进方案，包括优化训练数据构造、模型架构调整以及训练策略改进。

Result: 实验表明，所提出的各项改进措施均显著提升了模型性能，有效缩小了与实证上限的差距，同时揭示了实际应用中的挑战，为利用这些新兴模型提供了实证指导。

Conclusion: 该研究不仅提升了语音大语言模型在槽填充任务上的性能，更重要的是提供了系统性的改进框架和实证指导，为未来语音理解任务的统一生成式解决方案发展指明了方向。

📄 Abstract

Slot filling is a crucial subtask in spoken language understanding (SLU), traditionally implemented as a cascade of speech recognition followed by one or more natural language understanding (NLU) components. The recent advent of speech-based large language models (speechLLMs), which integrate speech and textual foundation models, has opened new avenues for achieving speech understanding tasks in a more unified, generative, and instruction-following manner while promising data and compute efficiency with zero-shot abilities, generalizing to unseen slot labels. We address the slot-filling task by creating an empirical upper bound for the task, identifying performance, robustness, and generalization gaps, and proposing improvements to the training data, architecture, and training strategies to narrow the gap with the upper bound result. We show that each of these measures improve performance substantially, while highlighting practical challenges and providing empirical guidance and insights for harnessing these emerging models.

cs.AI [Back]

[42] AUGUSTUS: An LLM-Driven Multimodal Agent System with Contextualized User Memory

Jitesh Jain, Shubham Maheshwari, Ning Yu, Wen-mei Hwu, Humphrey Shi

🧩 TL;DR

本文提出了AUGUSTUS，一种基于认知科学人类记忆理论的多模态智能体系统，通过图结构的多模态上下文记忆实现概念驱动的信息检索，在ImageNet分类任务中比传统多模态RAG方法快3.5倍，并在MSC基准测试中优于MemGPT。

📘 Detailed Summary

Motivation: 现有基于检索增强生成的智能体系统主要关注文本信息的存储，忽视了多模态信号的重要性，而人类记忆本质上是多模态的，这促使我们开发与认知科学中人类记忆理论对齐的多模态智能体系统来解决这一研究空白。

Method: 系统采用四阶段循环架构：编码理解输入、存储重要信息到记忆、从记忆中检索相关上下文、执行任务；不同于传统向量数据库，我们提出将信息概念化为语义标签，并将其与上下文关联存储在基于图结构的多模态上下文记忆中，实现高效的概念驱动检索。

Result: 在ImageNet分类任务中，系统性能优于传统多模态RAG方法且速度快3.5倍，在MSC基准测试中超越了MemGPT的表现，验证了所提出方法的有效性和效率优势。

Conclusion: 研究表明将认知科学中的人类记忆原理融入多模态智能体系统设计能够显著提升性能，图结构的语义标签记忆机制为多模态信息检索提供了更高效的解决方案，为未来智能体系统的记忆架构设计提供了新的方向。

📄 Abstract

Riding on the success of LLMs with retrieval-augmented generation (RAG), there has been a growing interest in augmenting agent systems with external memory databases. However, the existing systems focus on storing text information in their memory, ignoring the importance of multimodal signals. Motivated by the multimodal nature of human memory, we present AUGUSTUS, a multimodal agent system aligned with the ideas of human memory in cognitive science. Technically, our system consists of 4 stages connected in a loop: (i) encode: understanding the inputs; (ii) store in memory: saving important information; (iii) retrieve: searching for relevant context from memory; and (iv) act: perform the task. Unlike existing systems that use vector databases, we propose conceptualizing information into semantic tags and associating the tags with their context to store them in a graph-structured multimodal contextual memory for efficient concept-driven retrieval. Our system outperforms the traditional multimodal RAG approach while being 3.5 times faster for ImageNet classification and outperforming MemGPT on the MSC benchmark.

[43] WebGen-V Bench: Structured Representation for Enhancing Visual Design in LLM-based Web Generation and Evaluation

Kuang-Da Wang, Zhao Wang, Yotaro Shimose, Wei-Yao Wang, Shingo Takamatsu

🧩 TL;DR

WebGen-V提出了一个用于指令到HTML生成的新基准和框架，通过三个关键创新实现了数据质量和评估粒度的双重提升：无界可扩展的代理爬取框架、结构化分节数据表示以及分节多模态评估协议。

📘 Detailed Summary

Motivation: 当前指令到HTML生成任务面临数据质量不足和评估粒度粗糙的问题，需要更真实的数据收集方法和更精细的多模态评估机制来提升网页生成的准确性和实用性。

Method: WebGen-V引入了三个核心技术：无界可扩展的代理爬取框架持续收集真实网页数据；结构化分节数据表示整合元数据、局部UI截图和JSON格式的文本图像资源；分节多模态评估协议对齐文本、布局和视觉组件进行细粒度评估。

Result: 通过最先进的大语言模型实验和消融研究验证了结构化数据和分节评估的有效性，证明了每个组件的贡献，实现了从真实数据采集到结构化多模态评估的统一流程。

Conclusion: WebGen-V是首个实现高粒度代理爬取和评估的指令到HTML生成工作，为网页生成任务提供了从数据获取到多模态评估的完整解决方案，显著提升了生成网页的质量和评估精度。

📄 Abstract

Witnessed by the recent advancements on leveraging LLM for coding and multimodal understanding, we present WebGen-V, a new benchmark and framework for instruction-to-HTML generation that enhances both data quality and evaluation granularity. WebGen-V contributes three key innovations: (1) an unbounded and extensible agentic crawling framework that continuously collects real-world webpages and can leveraged to augment existing benchmarks; (2) a structured, section-wise data representation that integrates metadata, localized UI screenshots, and JSON-formatted text and image assets, explicit alignment between content, layout, and visual components for detailed multimodal supervision; and (3) a section-level multimodal evaluation protocol aligning text, layout, and visuals for high-granularity assessment. Experiments with state-of-the-art LLMs and ablation studies validate the effectiveness of our structured data and section-wise evaluation, as well as the contribution of each component. To the best of our knowledge, WebGen-V is the first work to enable high-granularity agentic crawling and evaluation for instruction-to-HTML generation, providing a unified pipeline from real-world data acquisition and webpage generation to structured multimodal assessment.

[44] VERITAS: Leveraging Vision Priors and Expert Fusion to Improve Multimodal Data

Tingqiao Xu, Ziru Zeng, Jiayu Chen

🧩 TL;DR

本文提出VERITAS管道，通过整合视觉先验和多模态大模型的统计融合方法，系统性提升监督微调数据的质量，显著减少事实错误和幻觉问题。

📘 Detailed Summary

Motivation: 当前多模态大模型的监督微调数据增强方法存在严重的视觉感知不足问题，导致事实错误和幻觉频发，亟需一种能够有效整合视觉信息并确保数据准确性的系统化解决方案。

Method: VERITAS采用视觉识别模型RAM++和OCR系统PP-OCRv4提取结构化视觉先验，结合三个先进LMM模型进行答案评估，通过统计融合生成高置信度共识分数作为真实标签，并利用Group Relative Policy Optimization训练轻量级批评模型，最终选择最高分候选答案作为精炼结果。

Result: 在六个多模态基准测试中，使用VERITAS处理数据微调的模型性能全面超越原始数据，尤其在文本丰富和细粒度推理任务上表现突出，批评模型在保持与先进LMM相当能力的同时显著提升了效率。

Conclusion: VERITAS证明了系统性整合视觉先验和多模型共识机制对提升SFT数据质量的有效性，为多模态数据优化提供了可复现的解决方案，同时展示了轻量级批评模型在保持性能前提下实现效率优化的潜力。

📄 Abstract

The quality of supervised fine-tuning (SFT) data is crucial for the performance of large multimodal models (LMMs), yet current data enhancement methods often suffer from factual errors and hallucinations due to inadequate visual perception. To address this challenge, we propose VERITAS, a pipeline that systematically integrates vision priors and multiple state-of-the-art LMMs with statistical methods to enhance SFT data quality. VERITAS leverages visual recognition models (RAM++) and OCR systems (PP-OCRv4) to extract structured vision priors, which are combined with images, questions, and answers. Three LMMs (GPT-4o, Gemini-2.5-Pro, Doubao-1.5-pro) evaluate the original answers, providing critique rationales and scores that are statistically fused into a high-confidence consensus score serving as ground truth. Using this consensus, we train a lightweight critic model via Group Relative Policy Optimization (GRPO), enhancing reasoning capabilities efficiently. Each LMM then refines the original answers based on the critiques, generating new candidate answers; we select the highest-scoring one as the final refined answer. Experiments across six multimodal benchmarks demonstrate that models fine-tuned with data processed by VERITAS consistently outperform those using raw data, particularly in text-rich and fine-grained reasoning tasks. Our critic model exhibits enhanced capability comparable to state-of-the-art LMMs while being significantly more efficient. We release our pipeline, datasets, and model checkpoints to advance research in multimodal data optimization.

[45] Towards Flash Thinking via Decoupled Advantage Policy Optimization

Zezhong Tan, Hang Gao, Xinhong Ma, Feng Zhang, Ziqiang Dong

🧩 TL;DR

本文提出了一种名为DEPO的新型强化学习框架，旨在减少大型推理模型中的低效推理问题。该方法通过优势解耦算法、难度感知长度惩罚和优势裁剪三个核心组件，显著降低了模型响应长度和计算消耗，同时保持或提升了准确率。

📘 Detailed Summary

Motivation: 现有强化学习算法虽然显著提升了大型推理模型的准确性，但仍存在响应过长和过度思考的问题，导致推理延迟增加和计算资源浪费，特别是在处理简单任务时尤为明显。这些问题限制了模型在实际应用中的效率和可扩展性。

Method: DEPO框架包含三个核心组件：创新的优势解耦算法用于指导模型减少低效token；难度感知长度惩罚机制以降低整体响应长度；优势裁剪方法防止策略优化中的偏差。该方法应用于DeepSeek-Distill-Qwen系列模型进行验证。

Result: 实验结果表明，在DeepSeek-Distill-Qwen-7B和1.5B模型上，DEPO实现了序列长度减少39%的显著效果，同时减少了低效token中的过度推理路径。该方法在降低计算消耗的同时，整体准确率仍优于基准模型。

Conclusion: DEPO框架为解决大型推理模型中的低效推理问题提供了有效方案，证明了在保持模型性能的同时显著优化推理效率的可行性。该研究为未来高效推理模型的发展提供了重要参考，特别是在资源受限环境下的应用具有重要价值。

📄 Abstract

Recent Large Reasoning Models (LRMs) have achieved remarkable performance in solving complex problems via supervised fine-tuning (SFT) and reinforcement learning (RL). Although existing RL algorithms significantly enhance model accuracy, they still suffer from excessively lengthy responses and overthinking issues, resulting in increased inference latency and computational consumption, especially for simple tasks that require minimal reasoning. To address this, we propose a novel RL framework, DEPO, to reduce inefficient reasoning for models. Our method mainly consists of three core components: (1) an innovative advantage decoupled algorithm to guide model reduction of inefficient tokens; (2) a difficulty-aware length penalty to lower the overall length of model responses; (3) an advantage clipping method to prevent bias in policy optimization. In our experiments, applied to DeepSeek-Distill-Qwen-7B and DeepSeek-Distill-Qwen-1.5B as base models, DEPO achieves a significant reduction in sequence length by 39% and reduces excessive reasoning paths in inefficient tokens, while outperforming the base model in overall accuracy.

[46] Hypergraph Contrastive Sensor Fusion for Multimodal Fault Diagnosis in Induction Motors

Usman Ali, Ali Zia, Waqas Ali, Umer Ramzan, Abdul Rehman, Muhammad Tayyab Chaudhry, Wei Xiang

🧩 TL;DR

本文提出了一种多模态超图对比注意力网络（MM-HCAN），这是首个将对比学习集成到多模态传感器融合超图拓扑中的统一框架，用于实现鲁棒的感应电机多故障诊断，在三个真实世界基准测试中达到99.82%的准确率。

📘 Detailed Summary

Motivation: 传统感应电机故障诊断方法难以捕捉复杂的多模态信号关系，通常局限于单模态数据或单一故障类型，且在噪声或跨域条件下性能下降，无法满足工业安全性和运行连续性的需求。

Method: MM-HCAN框架将对比学习集成到专门为多模态传感器融合设计的超图拓扑中，能够联合建模模态内和模态间依赖关系，并超越欧几里得嵌入空间增强泛化能力，支持同时诊断轴承、定子和转子故障。

Result: 在三个真实世界基准测试中，MM-HCAN实现了高达99.82%的准确率，表现出强大的跨域泛化能力和噪声鲁棒性，消融研究验证了各组件的贡献。

Conclusion: MM-HCAN为全面多故障诊断提供了可扩展且鲁棒的解决方案，支持工业环境中的预测性维护和资产寿命延长，展示了在多模态传感器融合和跨域泛化方面的显著优势。

📄 Abstract

Reliable induction motor (IM) fault diagnosis is vital for industrial safety and operational continuity, mitigating costly unplanned downtime. Conventional approaches often struggle to capture complex multimodal signal relationships, are constrained to unimodal data or single fault types, and exhibit performance degradation under noisy or cross-domain conditions. This paper proposes the Multimodal Hypergraph Contrastive Attention Network (MM-HCAN), a unified framework for robust fault diagnosis. To the best of our knowledge, MM-HCAN is the first to integrate contrastive learning within a hypergraph topology specifically designed for multimodal sensor fusion, enabling the joint modelling of intra- and inter-modal dependencies and enhancing generalisation beyond Euclidean embedding spaces. The model facilitates simultaneous diagnosis of bearing, stator, and rotor faults, addressing the engineering need for consolidated di- agnostic capabilities. Evaluated on three real-world benchmarks, MM-HCAN achieves up to 99.82% accuracy with strong cross-domain generalisation and resilience to noise, demonstrating its suitability for real-world deployment. An ablation study validates the contribution of each component. MM-HCAN provides a scalable and robust solution for comprehensive multi-fault diagnosis, supporting predictive maintenance and extended asset longevity in industrial environments.

[47] Towards Relaxed Multimodal Inputs for Gait-based Parkinson's Disease Assessment

Minlin Zeng, Zhipeng Zhou, Yang Qiu, Zhiqi Shen

🧩 TL;DR

本文提出了首个将帕金森病评估中的多模态学习建模为多目标优化问题的系统TRIP，解决了传统多模态方法在训练和推理阶段对模态同步性和完整性的依赖问题。

📘 Detailed Summary

Motivation: 当前帕金森病评估中的多模态方法存在两个主要限制：训练时需要所有模态同步可用，推理时依赖所有模态完整存在，这些限制严重阻碍了其实际临床应用。

Method: 提出TRIP框架将多模态学习建模为多目标优化问题，采用基于边界的类别重平衡策略来缓解模态内部的不平衡问题，同时处理多模态信息融合中的模态崩溃问题。

Result: 在三个公共数据集上的实验表明，TRIP在异步设置下比最佳基线方法提升了16.48、6.89和11.55个百分点，在同步设置下提升了4.86和2.30个百分点，达到了最先进的性能水平。

Conclusion: 该研究证明了多目标优化框架在多模态医疗数据分析中的有效性，为处理不完整或异步模态数据提供了灵活且鲁棒的解决方案，具有重要的临床应用价值。

📄 Abstract

Parkinson's disease assessment has garnered growing interest in recent years, particularly with the advent of sensor data and machine learning techniques. Among these, multimodal approaches have demonstrated strong performance by effectively integrating complementary information from various data sources. However, two major limitations hinder their practical application: (1) the need to synchronize all modalities during training, and (2) the dependence on all modalities during inference. To address these issues, we propose the first Parkinson's assessment system that formulates multimodal learning as a multi-objective optimization (MOO) problem. This not only allows for more flexible modality requirements during both training and inference, but also handles modality collapse issue during multimodal information fusion. In addition, to mitigate the imbalance within individual modalities, we introduce a margin-based class rebalancing strategy to enhance category learning. We conduct extensive experiments on three public datasets under both synchronous and asynchronous settings. The results show that our framework-Towards Relaxed InPuts (TRIP)-achieves state-of-the-art performance, outperforming the best baselines by 16.48, 6.89, and 11.55 percentage points in the asynchronous setting, and by 4.86 and 2.30 percentage points in the synchronous setting, highlighting its effectiveness and adaptability.

Table of Contents

cs.CV [Back]

[1] GAZE:Governance-Aware pre-annotation for Zero-shot World Model Environments

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[2] DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[3] UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[4] MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[5] Composition-Grounded Instruction Synthesis for Visual Reasoning

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[6] Comprehensive language-image pre-training for 3D medical image understanding

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[7] Directional Reasoning Injection for Fine-Tuning MLLMs

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[8] TGT: Text-Grounded Trajectories for Locally Controlled Video Generation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[9] XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[10] Train a Unified Multimodal Data Quality Classifier with Synthetic Data

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[11] Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[12] Hyperparameter Optimization and Reproducibility in Deep Learning Model Training

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[13] Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[14] CARDIUM: Congenital Anomaly Recognition with Diagnostic Images and Unified Medical records

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[15] The Face of Persuasion: Analyzing Bias and Generating Culture-Aware Ads

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[16] DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[17] Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[18] OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[19] Semantic4Safety: Causal Insights from Zero-shot Street View Imagery Segmentation for Urban Road Safety

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[20] MSAM: Multi-Semantic Adaptive Mining for Cross-Modal Drone Video-Text Retrieval

🧩 TL;DR