Table of Contents

cs.CV [Back]

[1] MetaVoxel: Joint Diffusion Modeling of Imaging and Clinical Metadata

Yihao Liu, Chenyu Gao, Lianrui Zuo, Michael E. Kim, Brian D. Boyd, Lisa L. Barnes, Walter A. Kukull, Lori L. Beason-Held, Susan M. Resnick, Timothy J. Hohman, Warren D. Taylor, Bennett A. Landman

🧩 TL;DR

本文提出了MetaVoxel,一种生成式联合扩散建模框架,通过单一扩散过程建模医学影像数据与临床元数据的联合分布,实现了统一的多任务处理与灵活的零样本推理。


📘 Detailed Summary

Motivation: 当前深度学习方法通常针对特定预测方向训练条件分布模型,需要为不同任务构建独立模型,缺乏灵活性和统一性。本研究旨在解决医学AI中任务特定模型分散的问题,通过联合分布建模实现多任务统一处理。

Method: MetaVoxel采用生成式联合扩散建模框架,学习一个跨越所有变量的单一扩散过程来建模影像数据与临床元数据的联合分布。该方法支持使用任意输入子集进行灵活的零样本推理,无需针对特定任务重新训练。

Result: 在超过10,000个T1加权MRI扫描与九个数据集的临床元数据上,单个MetaVoxel模型能够同时执行图像生成、年龄估计和性别预测任务,性能与已建立的任务特定基线模型相当。额外实验展示了其在灵活推理方面的能力。

Conclusion: 联合多模态扩散建模为统一医学AI模型提供了有前景的方向,能够支持更广泛的临床应用。该方法通过单一模型实现多任务处理,减少了模型部署复杂性,增强了临床应用的灵活性。


📄 Abstract

Modern deep learning methods have achieved impressive results across tasks from disease classification, estimating continuous biomarkers, to generating realistic medical images. Most of these approaches are trained to model conditional distributions defined by a specific predictive direction with a specific set of input variables. We introduce MetaVoxel, a generative joint diffusion modeling framework that models the joint distribution over imaging data and clinical metadata by learning a single diffusion process spanning all variables. By capturing the joint distribution, MetaVoxel unifies tasks that traditionally require separate conditional models and supports flexible zero-shot inference using arbitrary subsets of inputs without task-specific retraining. Using more than 10,000 T1-weighted MRI scans paired with clinical metadata from nine datasets, we show that a single MetaVoxel model can perform image generation, age estimation, and sex prediction, achieving performance comparable to established task-specific baselines. Additional experiments highlight its capabilities for flexible inference.Together, these findings demonstrate that joint multimodal diffusion offers a promising direction for unifying medical AI models and enabling broader clinical applicability.

[2] Independent Density Estimation

Jiahao Liu

🧩 TL;DR

本文提出了一种名为独立密度估计(IDE)的新方法,旨在解决大规模视觉语言模型在组合泛化方面的不足,通过建立句子中单个单词与图像特征之间的连接来实现组合泛化。


📘 Detailed Summary

Motivation: 尽管大规模视觉语言模型在图像描述和条件图像生成等领域取得了显著成果,但这些模型在实现类人组合泛化方面仍面临困难,本研究旨在解决这一挑战。

Method: 本文提出了独立密度估计(IDE)方法,该方法学习句子中单个单词与图像对应特征之间的连接,并构建了两种模型:第一种使用完全解缠的视觉表示作为输入,第二种利用变分自编码器从原始图像中获取部分解缠特征,此外还提出了一种基于熵的组合推理方法来组合句子中每个单词的预测。

Result: 在多个数据集上的评估表明,与现有模型相比,本文提出的模型在未见组合上表现出更优越的泛化能力,验证了IDE方法在组合泛化方面的有效性。

Conclusion: IDE方法为解决视觉语言模型的组合泛化问题提供了有效途径,通过建立单词与图像特征的独立连接以及基于熵的推理机制,显著提升了模型对新组合的泛化能力,为未来更复杂的组合推理任务奠定了基础。


📄 Abstract

Large-scale Vision-Language models have achieved remarkable results in various domains, such as image captioning and conditioned image generation. Neverthe- less, these models still encounter difficulties in achieving human-like composi- tional generalization. In this study, we propose a new method called Independent Density Estimation (IDE) to tackle this challenge. IDE aims to learn the connec- tion between individual words in a sentence and the corresponding features in an image, enabling compositional generalization. We build two models based on the philosophy of IDE. The first one utilizes fully disentangled visual representations as input, and the second leverages a Variational Auto-Encoder to obtain partially disentangled features from raw images. Additionally, we propose an entropy- based compositional inference method to combine predictions of each word in the sentence. Our models exhibit superior generalization to unseen compositions compared to current models when evaluated on various datasets.

[3] Latent Chain-of-Thought World Modeling for End-to-End Driving

Shuhan Tan, Kashyap Chitta, Yuxiao Chen, Ran Tian, Yurong You, Yan Wang, Wenjie Luo, Yulong Cao, Philipp Krahenbuhl, Marco Pavone, Boris Ivanovic

🧩 TL;DR

本文提出Latent-CoT-Drive (LCDrive),一种在自动驾驶中采用潜在语言进行思维链推理的视觉-语言-动作模型,通过将推理和决策统一在动作对齐的潜在空间中,相比基于自然语言的推理方法实现了更高效的推理和更好的驾驶性能。


📘 Detailed Summary

Motivation: 现有自动驾驶视觉-语言-动作模型主要使用自然语言进行思维链推理,但文本可能不是最高效的推理表示形式,因此需要探索更有效的推理表示方法来提升驾驶性能和安全性。

Method: LCDrive采用潜在语言进行思维链推理,将推理和决策统一在动作对齐的潜在空间中,模型通过交替生成动作提议标记(使用与输出动作相同的词汇)和世界模型标记(基于学习的潜在世界模型表达动作的未来结果),并通过冷启动监督和闭环强化学习后训练来增强推理能力。

Result: 在大规模端到端驾驶基准测试中,LCDrive相比非推理和文本推理基线实现了更快的推理速度、更好的轨迹质量,并在交互式强化学习中获得了更大的性能提升。

Conclusion: 研究表明潜在语言比自然语言更适合自动驾驶中的思维链推理,将推理和决策统一在动作对齐的潜在空间中能够实现更高效的推理和更好的驾驶决策,为VLA模型的推理表示设计提供了新方向。


📄 Abstract

Recent Vision-Language-Action (VLA) models for autonomous driving explore inference-time reasoning as a way to improve driving performance and safety in challenging scenarios. Most prior work uses natural language to express chain-of-thought (CoT) reasoning before producing driving actions. However, text may not be the most efficient representation for reasoning. In this work, we present Latent-CoT-Drive (LCDrive): a model that expresses CoT in a latent language that captures possible outcomes of the driving actions being considered. Our approach unifies CoT reasoning and decision making by representing both in an action-aligned latent space. Instead of natural language, the model reasons by interleaving (1) action-proposal tokens, which use the same vocabulary as the model's output actions; and (2) world model tokens, which are grounded in a learned latent world model and express future outcomes of these actions. We cold start latent CoT by supervising the model's action proposals and world model tokens based on ground-truth future rollouts of the scene. We then post-train with closed-loop reinforcement learning to strengthen reasoning capabilities. On a large-scale end-to-end driving benchmark, LCDrive achieves faster inference, better trajectory quality, and larger improvements from interactive reinforcement learning compared to both non-reasoning and text-reasoning baselines.

[4] Solving Semi-Supervised Few-Shot Learning from an Auto-Annotation Perspective

Tian Liu, Anwesha Basu, James Caverlee, Shu Kong

🧩 TL;DR

本文提出SWIFT方法,通过阶段式微调和温度调优,解决了半监督少样本学习中视觉语言模型置信度分布平坦的问题,显著提升了利用未标注数据的能力,在多个基准测试中超越了现有FSL和SSL方法约5个准确率点。


📘 Detailed Summary

Motivation: 半监督少样本学习(SSFSL)旨在利用少量标注数据和大量未标注数据进行模型学习,以实现"自动标注"等实际应用。尽管存在强大的开源视觉语言模型(VLM)及其预训练数据,但现有SSFSL研究大多忽视了这些资源,而相关的少样本学习领域已成功利用它们提升性能。研究发现,直接应用传统SSL方法微调VLM会显著劣于FSL基线,主要原因是VLM产生的softmax概率分布过于"平坦",导致未标注数据利用率低和监督信号弱。

Method: 本文提出两种简单但有效的技术:分类器初始化和温度调优,共同提高伪标签的置信度,增强未标注数据的利用率和监督信号。基于此,提出了阶段式微调与温度调优(SWIFT)方法,该方法使现有SSL方法能够有效微调VLM,利用有限的标注数据、丰富的未标注数据以及从VLM预训练集中检索的任务相关但带噪声的数据。SWIFT采用分阶段微调策略,逐步优化模型在目标任务上的表现。

Result: 在五个SSFSL基准测试上的广泛实验表明,SWIFT显著优于最近的FSL和SSL方法,平均提升约5个准确率点。更值得注意的是,SWIFT的性能甚至可与监督学习相媲美,后者使用真实标签对VLM进行微调。实验验证了分类器初始化和温度调优对解决VLM置信度分布平坦问题的有效性,以及SWIFT方法在利用未标注数据和噪声数据方面的优势。

Conclusion: 该研究表明,通过适当的技术调整,视觉语言模型在半监督少样本学习场景中能够有效利用未标注数据,克服其固有的置信度分布平坦问题。SWIFT方法为实际应用中的自动标注任务提供了实用解决方案,展示了结合开源VLM资源与半监督学习框架的潜力。未来的工作可以进一步探索更复杂的微调策略和噪声数据处理技术,以进一步提升SSFSL的性能上限。


📄 Abstract

Semi-supervised few-shot learning (SSFSL) formulates real-world applications like ''auto-annotation'', as it aims to learn a model over a few labeled and abundant unlabeled examples to annotate the unlabeled ones. Despite the availability of powerful open-source Vision-Language Models (VLMs) and their pretraining data, the SSFSL literature largely neglects these open-source resources. In contrast, the related area few-shot learning (FSL) has already exploited them to boost performance. Arguably, to achieve auto-annotation in the real world, SSFSL should leverage such open-source resources. To this end, we start by applying established SSL methods to finetune a VLM. Counterintuitively, they significantly underperform FSL baselines. Our in-depth analysis reveals the root cause: VLMs produce rather ''flat'' distributions of softmax probabilities. This results in zero utilization of unlabeled data and weak supervision signals. We address this issue with embarrassingly simple techniques: classifier initialization and temperature tuning. They jointly increase the confidence scores of pseudo-labels, improving the utilization rate of unlabeled data, and strengthening supervision signals. Building on this, we propose: Stage-Wise Finetuning with Temperature Tuning (SWIFT), which enables existing SSL methods to effectively finetune a VLM on limited labeled data, abundant unlabeled data, and task-relevant but noisy data retrieved from the VLM's pretraining set. Extensive experiments on five SSFSL benchmarks show that SWIFT outperforms recent FSL and SSL methods by $\sim$5 accuracy points. SWIFT even rivals supervised learning, which finetunes VLMs with the unlabeled data being labeled with ground truth!

[5] VLM-NCD:Novel Class Discovery with Vision-Based Large Language Models

Yuetong Su, Baoguo Wei, Xinyu Wang, Xu Li, Lixin Li

🧩 TL;DR

本文提出LLM-NCD,一种多模态新类发现框架,通过融合视觉-文本语义和原型引导聚类来克服现有方法在特征判别性和长尾分布方面的局限性,在CIFAR-100数据集上实现了未知类准确率高达25.3%的提升。


📘 Detailed Summary

Motivation: 新类发现任务旨在利用已知类别的先验知识对未标注数据中的未知类别进行分类和发现,但现有基于图像的方法主要依赖视觉特征,存在特征判别性不足和数据长尾分布等局限性,需要突破这一瓶颈。

Method: 提出LLM-NCD多模态框架,通过融合视觉-文本语义和原型引导聚类来提升特征表示;核心创新包括联合优化已知类图像和文本特征来建模聚类中心和语义原型,以及采用双阶段发现机制,通过语义亲和度阈值和自适应聚类动态分离已知或新类样本。

Result: 在CIFAR-100数据集上的实验表明,相比现有方法,该方法在未知类别上的准确率提升了高达25.3%;特别值得注意的是,该方法展现出对长尾分布的独特鲁棒性,这在新类发现文献中尚属首次。

Conclusion: 该研究证明了多模态语义融合在新类发现任务中的有效性,特别是通过视觉-文本联合优化和自适应聚类机制能够显著提升特征判别性和对数据分布不平衡的鲁棒性,为新类发现方法设计提供了新的方向。


📄 Abstract

Novel Class Discovery aims to utilise prior knowledge of known classes to classify and discover unknown classes from unlabelled data. Existing NCD methods for images primarily rely on visual features, which suffer from limitations such as insufficient feature discriminability and the long-tail distribution of data. We propose LLM-NCD, a multimodal framework that breaks this bottleneck by fusing visual-textual semantics and prototype guided clustering. Our key innovation lies in modelling cluster centres and semantic prototypes of known classes by jointly optimising known class image and text features, and a dualphase discovery mechanism that dynamically separates known or novel samples via semantic affinity thresholds and adaptive clustering. Experiments on the CIFAR-100 dataset show that compared to the current methods, this method achieves up to 25.3% improvement in accuracy for unknown classes. Notably, our method shows unique resilience to long tail distributions, a first in NCD literature.

[6] Efficient-VLN: A Training-Efficient Vision-Language Navigation Model

Duo Zheng, Shijia Huang, Yanyang Li, Liwei Wang

🧩 TL;DR

本文提出Efficient-VLN,一种训练高效的视觉语言导航模型,通过设计渐进式记忆和可学习递归记忆机制减少长历史观测的二次计算负担,并引入动态混合策略平衡探索效率权衡,在显著降低训练开销的同时达到最先进性能。


📘 Detailed Summary

Motivation: 多模态大语言模型在视觉语言导航中展现出潜力,但其实际发展受到巨大训练开销的严重阻碍。作者识别出导致开销的两个关键问题:处理长历史观测作为大规模token序列带来的二次计算负担,以及DAgger数据聚合过程中探索效率的权衡问题,其中更多探索虽然能产生有效的错误恢复轨迹,但会导致训练和推理轨迹长度增加。

Method: 为应对这些挑战,作者提出Efficient-VLN模型,包含两个高效记忆机制:渐进式记忆动态分配更多token给近期观测,可学习递归记忆利用可学习token的键值缓存作为记忆状态。此外,引入动态混合策略来平衡探索效率权衡,优化训练过程中的轨迹收集策略。

Result: 实验结果表明,Efficient-VLN在R2R-CE上达到64.2%的成功率,在RxR-CE上达到67.0%的成功率,均达到最先进性能。关键的是,该模型仅消耗282 H800 GPU小时,相比最先进方法实现了训练开销的显著降低,证明了其高效性。

Conclusion: 该研究展示了通过设计高效记忆机制和平衡探索策略,可以在大幅降低训练开销的同时保持甚至提升视觉语言导航性能。这为实际部署多模态大语言模型在导航任务中提供了可行的解决方案,并为未来高效VLN模型设计提供了重要参考。


📄 Abstract

Multimodal large language models (MLLMs) have shown promising potential in Vision-Language Navigation (VLN). However, their practical development is severely hindered by the substantial training overhead. We recognize two key issues that contribute to the overhead: (1) the quadratic computational burden from processing long-horizon historical observations as massive sequences of tokens, and (2) the exploration-efficiency trade-off in DAgger, i.e., a data aggregation process of collecting agent-explored trajectories. While more exploration yields effective error-recovery trajectories for handling test-time distribution shifts, it comes at the cost of longer trajectory lengths for both training and inference. To address these challenges, we propose Efficient-VLN, a training-efficient VLN model. Specifically, to mitigate the token processing burden, we design two efficient memory mechanisms: a progressive memory that dynamically allocates more tokens to recent observations, and a learnable recursive memory that utilizes the key-value cache of learnable tokens as the memory state. Moreover, we introduce a dynamic mixed policy to balance the exploration-efficiency trade-off. Extensive experiments show that Efficient-VLN achieves state-of-the-art performance on R2R-CE (64.2% SR) and RxR-CE (67.0% SR). Critically, our model consumes merely 282 H800 GPU hours, demonstrating a dramatic reduction in training overhead compared to state-of-the-art methods.

[7] DualProtoSeg: Simple and Efficient Design with Text- and Image-Guided Prototype Learning for Weakly Supervised Histopathology Image Segmentation

Anh M. Vu, Khang P. Le, Trang T. K. Vo, Ha Thach, Huy Hung Nguyen, David Yang, Han H. Huynh, Quynh Nguyen, Tuan M. Pham, Tuan-Anh Le, Minh H. N. Le, Thanh-Huy Nguyen, Akash Awasthi, Chandra Mohan, Zhu Han, Hien Van Nguyen

🧩 TL;DR

本文提出了一种用于组织病理学弱监督语义分割的原型驱动框架,该框架通过结合视觉-语言对齐和可学习提示调优来改善区域发现,在BCSS-WSSS基准测试中超越了现有最先进方法。


📘 Detailed Summary

Motivation: 组织病理学中的弱监督语义分割旨在通过图像级标签降低标注成本,但受到类间同质性、类内异质性以及基于CAM监督的区域收缩效应的限制,需要更有效的区域发现方法。

Method: 该方法提出了一个原型驱动框架,整合了CoOp风格的可学习提示调优以生成基于文本的原型,并将其与可学习的图像原型相结合,形成双模态原型库以捕获语义和外观线索。为缓解ViT表示中的过度平滑问题,还引入了多尺度金字塔模块来增强空间精度和定位质量。

Result: 在BCSS-WSSS基准测试上的实验表明,该方法超越了现有的最先进方法。详细分析展示了文本描述多样性、上下文长度以及文本和图像原型互补行为的优势,验证了双模态原型学习的有效性。

Conclusion: 研究结果表明,联合利用文本语义和视觉原型学习对于数字病理学中的弱监督语义分割具有显著效果。该方法通过双模态原型库和多尺度增强有效解决了现有方法的局限性,为降低医学图像标注成本提供了有前景的解决方案。


📄 Abstract

Weakly supervised semantic segmentation (WSSS) in histopathology seeks to reduce annotation cost by learning from image-level labels, yet it remains limited by inter-class homogeneity, intra-class heterogeneity, and the region-shrinkage effect of CAM-based supervision. We propose a simple and effective prototype-driven framework that leverages vision-language alignment to improve region discovery under weak supervision. Our method integrates CoOp-style learnable prompt tuning to generate text-based prototypes and combines them with learnable image prototypes, forming a dual-modal prototype bank that captures both semantic and appearance cues. To address oversmoothing in ViT representations, we incorporate a multi-scale pyramid module that enhances spatial precision and improves localization quality. Experiments on the BCSS-WSSS benchmark show that our approach surpasses existing state-of-the-art methods, and detailed analyses demonstrate the benefits of text description diversity, context length, and the complementary behavior of text and image prototypes. These results highlight the effectiveness of jointly leveraging textual semantics and visual prototype learning for WSSS in digital pathology.

[8] ConStruct: Structural Distillation of Foundation Models for Prototype-Based Weakly Supervised Histopathology Segmentation

Khang Le, Ha Thach, Anh M. Vu, Trang T. K. Vo, Han H. Huynh, David Yang, Minh H. N. Le, Thanh-Huy Nguyen, Akash Awasthi, Chandra Mohan, Zhu Han, Hien Van Nguyen

🧩 TL;DR

本文提出了一种用于组织病理学图像弱监督语义分割的原型学习框架,该框架整合了CONCH的形态感知表示、SegFormer的多尺度结构线索以及文本引导的语义对齐,能够在没有像素级标注的情况下生成高质量伪掩码并提升定位完整性。


📘 Detailed Summary

Motivation: 组织病理学中的弱监督语义分割严重依赖分类骨干网络,但这些模型通常仅定位最具判别性的区域,难以捕捉组织结构的完整空间范围。视觉语言模型如CONCH提供丰富的语义对齐和形态感知表示,而现代分割骨干网络如SegFormer保留细粒度空间线索,但在弱监督和无密集标注条件下结合这些互补优势仍具有挑战性。

Method: 该方法提出了一个原型学习框架,整合了CONCH的形态感知表示、SegFormer的多尺度结构线索和文本引导的语义对齐,以生成同时具有语义判别性和空间一致性的原型。具体包括文本引导的原型初始化机制,利用病理学描述生成更完整和语义准确的伪掩码;以及结构蒸馏机制,将空间知识从SegFormer转移到原型学习中,以保留细粒度形态模式和局部组织边界。

Result: 在BCSS-WSSS数据集上的实验表明,该原型学习框架优于现有的弱监督语义分割方法,能够在没有像素级标注的情况下生成高质量伪掩码,提高定位完整性并增强跨组织类型的语义一致性。该方法通过冻结基础模型骨干和轻量级可训练适配器保持计算效率。

Conclusion: 该研究展示了如何有效整合视觉语言模型的语义对齐能力与分割骨干网络的空间结构信息,为弱监督组织病理学图像分析提供了新思路。通过文本引导的原型初始化和结构蒸馏机制,该方法能够生成更完整和语义准确的伪掩码,为医学图像分析中减少标注依赖提供了有前景的解决方案。


📄 Abstract

Weakly supervised semantic segmentation (WSSS) in histopathology relies heavily on classification backbones, yet these models often localize only the most discriminative regions and struggle to capture the full spatial extent of tissue structures. Vision-language models such as CONCH offer rich semantic alignment and morphology-aware representations, while modern segmentation backbones like SegFormer preserve fine-grained spatial cues. However, combining these complementary strengths remains challenging, especially under weak supervision and without dense annotations. We propose a prototype learning framework for WSSS in histopathological images that integrates morphology-aware representations from CONCH, multi-scale structural cues from SegFormer, and text-guided semantic alignment to produce prototypes that are simultaneously semantically discriminative and spatially coherent. To effectively leverage these heterogeneous sources, we introduce text-guided prototype initialization that incorporates pathology descriptions to generate more complete and semantically accurate pseudo-masks. A structural distillation mechanism transfers spatial knowledge from SegFormer to preserve fine-grained morphological patterns and local tissue boundaries during prototype learning. Our approach produces high-quality pseudo masks without pixel-level annotations, improves localization completeness, and enhances semantic consistency across tissue types. Experiments on BCSS-WSSS datasets demonstrate that our prototype learning framework outperforms existing WSSS methods while remaining computationally efficient through frozen foundation model backbones and lightweight trainable adapters.

[9] EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs

Chao Gong, Depeng Wang, Zhipeng Wei, Ya Guo, Huijia Zhu, Jingjing Chen

🧩 TL;DR

本文提出了EchoingPixels框架,通过跨模态语义筛(CS2)模块对音频-视觉联合流进行自适应令牌缩减,并设计了同步增强RoPE(Sync-RoPE)来保持稀疏选择令牌的关键时间关系,显著降低了计算开销。


📘 Detailed Summary

Motivation: 音频-视觉大语言模型面临大量音频和视频令牌带来的计算开销问题,现有的单模态令牌缩减方法无法利用音频-视觉跨模态协同作用,且音频和视频具有不同且动态的信息密度,使得每模态静态预算分配方案效果不佳,如何对联合音频-视觉流进行令牌缩减仍是一个未解决的关键瓶颈。

Method: 本文提出了EchoingPixels框架,其核心是跨模态语义筛(CS2)模块,该模块通过早期音频-视觉交互实现对联合多模态流的共同关注,从整个音频-视觉令牌组合池中缩减令牌而非使用每模态固定预算,这种单池方法允许自适应分配令牌预算并动态识别显著令牌,同时设计了同步增强RoPE(Sync-RoPE)来保持稀疏选择令牌的关键时间关系。

Result: 大量实验表明,EchoingPixels仅使用原始令牌的5-20%就能实现与强基线相当的性能,同时获得了2-3倍的速度提升和内存减少,证明了该框架在保持性能的同时显著降低计算开销的有效性。

Conclusion: 该研究填补了音频-视觉联合流令牌缩减的技术空白,提出的跨模态协同缩减方法比单模态独立压缩更有效,自适应预算分配机制能够更好地处理不同模态的动态信息密度,为高效多模态大语言模型的发展提供了重要技术路径。


📄 Abstract

Audio-Visual Large Language Models (AV-LLMs) face prohibitive computational overhead from massive audio and video tokens. Token reduction, while extensively explored for video-only LLMs, is insufficient for the audio-visual domain, as these unimodal methods cannot leverage audio-visual cross-modal synergies. Furthermore, the distinct and dynamic information densities of audio and video render static budgets per modality suboptimal. How to perform token reduction on a joint audio-visual stream thus remains an unaddressed bottleneck. To fill this gap, we introduce EchoingPixels, a framework inspired by the coexistence and interaction of visuals and sound in real-world scenes. The core of our framework is the Cross-Modal Semantic Sieve (CS2), a module enabling early audio-visual interaction. Instead of compressing modalities independently, CS2 co-attends to the joint multimodal stream and reduces tokens from an entire combined pool of audio-visual tokens rather than using fixed budgets per modality. This single-pool approach allows it to adaptively allocate the token budget across both modalities and dynamically identify salient tokens in concert. To ensure this aggressive reduction preserves the vital temporal modeling capability, we co-design a Synchronization-Augmented RoPE (Sync-RoPE) to maintain critical temporal relationships for the sparsely selected tokens. Extensive experiments demonstrate that EchoingPixels achieves performance comparable to strong baselines using only 5-20% of the original tokens, with a 2-3x speedup and memory reduction.

[10] Zero-shot Adaptation of Stable Diffusion via Plug-in Hierarchical Degradation Representation for Real-World Super-Resolution

Yi-Cheng Liao, Shyang-En Weng, Yu-Syuan Xu, Chi-Wei Hsiao, Wei-Chen Chiu, Ching-Chun Huang

🧩 TL;DR

本文提出HD-CLIP(分层退化CLIP),通过将低质量图像分解为语义嵌入和有序退化嵌入,为扩散模型提供更丰富的指导,解决了真实世界图像超分辨率中未知复杂退化的问题。该方法作为即插即用模块,显著提升了细节保真度和感知真实性。


📘 Detailed Summary

Motivation: 真实世界图像超分辨率面临未知复杂退化因素的挑战,现有方法通常假设已知退化严重程度并依赖无法捕捉数值严重性的CLIP文本编码器,这限制了其泛化能力。需要为扩散模型提供更丰富、更具信息量的指导来处理多样且耦合的退化情况。

Method: 提出HD-CLIP(分层退化CLIP),将低质量图像分解为语义嵌入和有序退化嵌入,后者捕捉有序关系并允许在未见退化级别间插值。通过无分类器引导(CFG)和无分类器投影引导(CFPG)将其集成到扩散模型中,利用语义线索指导生成式恢复,同时使用退化线索抑制不希望的幻觉和伪影。

Result: HD-CLIP作为即插即用模块,无需训练即可无缝集成到各种超分辨率框架中,在多样化的真实世界数据集上显著提高了细节保真度和感知真实性。该方法有效提升了生成式恢复的质量,同时减少了不希望的伪影。

Conclusion: HD-CLIP通过分层表示学习为真实世界图像超分辨率提供了更有效的指导机制,其即插即用特性使其具有广泛的适用性。该方法展示了结合语义和有序退化信息在复杂退化场景中的重要性,为生成式图像恢复开辟了新方向。


📄 Abstract

Real-World Image Super-Resolution (Real-ISR) aims to recover high-quality images from low-quality inputs degraded by unknown and complex real-world factors. Real-world scenarios involve diverse and coupled degradations, making it necessary to provide diffusion models with richer and more informative guidance. However, existing methods often assume known degradation severity and rely on CLIP text encoders that cannot capture numerical severity, limiting their generalization ability. To address this, we propose \textbf{HD-CLIP} (\textbf{H}ierarchical \textbf{D}egradation CLIP), which decomposes a low-quality image into a semantic embedding and an ordinal degradation embedding that captures ordered relationships and allows interpolation across unseen levels. Furthermore, we integrated it into diffusion models via classifier-free guidance (CFG) and proposed classifier-free projection guidance (CFPG). HD-CLIP leverages semantic cues to guide generative restoration while using degradation cues to suppress undesired hallucinations and artifacts. As a \textbf{plug-and-play module}, HD-CLIP can be seamlessly integrated into various super-resolution frameworks without training, significantly improving detail fidelity and perceptual realism across diverse real-world datasets.

[11] CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates

Shresth Grover, Priyank Pathak, Akash Kumar, Vibhav Vineet, Yogesh S Rawat

🧩 TL;DR

本文提出了纠正性序列规划基准(CoSPlan),用于评估大规模视觉语言模型在易出错视觉序列规划任务中的能力,并提出了无需训练的场景图增量更新方法(SGI),显著提升了模型性能。


📘 Detailed Summary

Motivation: 大规模视觉语言模型在复杂推理方面表现出色,但在视觉序列规划(即执行多步动作以实现目标)方面尚未充分探索。实际序列规划常涉及非最优(错误)步骤,这对模型检测和纠正此类步骤的能力提出了挑战。

Method: 本文提出了纠正性序列规划基准(CoSPlan),用于评估视觉语言模型在四个领域的易出错视觉序列规划任务:迷宫导航、积木重排、图像重建和物体重组。针对模型性能不足的问题,提出了无需训练的场景图增量更新方法(SGI),该方法在初始状态和目标状态之间引入中间推理步骤。

Result: 实验表明,即使使用思维链和场景图等先进推理技术,当前最先进的视觉语言模型(如Intern-VLM和Qwen2)在CoSPlan基准上表现不佳,无法有效利用上下文线索达成目标。提出的SGI方法使模型能够更好地推理序列,实现了平均5.2%的性能提升,并在Plan-Bench和VQA等传统规划任务上展现出良好的泛化能力。

Conclusion: 该研究揭示了当前视觉语言模型在纠正性序列规划任务中的局限性,提出的SGI方法通过引入中间推理步骤有效提升了模型性能。这项工作不仅增强了纠正性序列规划的可靠性,还为视觉语言模型在复杂规划任务中的应用提供了新的方法论,具有重要的实践意义和推广价值。


📄 Abstract

Large-scale Vision-Language Models (VLMs) exhibit impressive complex reasoning capabilities but remain largely unexplored in visual sequential planning, i.e., executing multi-step actions towards a goal. Additionally, practical sequential planning often involves non-optimal (erroneous) steps, challenging VLMs to detect and correct such steps. We propose Corrective Sequential Planning Benchmark (CoSPlan) to evaluate VLMs in error-prone, vision-based sequential planning tasks across 4 domains: maze navigation, block rearrangement, image reconstruction,and object reorganization. CoSPlan assesses two key abilities: Error Detection (identifying non-optimal action) and Step Completion (correcting and completing action sequences to reach the goal). Despite using state-of-the-art reasoning techniques such as Chain-of-Thought and Scene Graphs, VLMs (e.g. Intern-VLM and Qwen2) struggle on CoSPlan, failing to leverage contextual cues to reach goals. Addressing this, we propose a novel training-free method, Scene Graph Incremental updates (SGI), which introduces intermediate reasoning steps between the initial and goal states. SGI helps VLMs reason about sequences, yielding an average performance gain of 5.2%. In addition to enhancing reliability in corrective sequential planning, SGI generalizes to traditional planning tasks such as Plan-Bench and VQA.

[12] Topology-Agnostic Animal Motion Generation from Text Prompt

Keyi Chen, Mingze Sun, Zhenyu Liu, Zhangquan Chen, Ruqi Huang

🧩 TL;DR

本文提出了OmniZoo大规模动物运动数据集和统一的生成框架,解决了现有运动生成方法无法处理任意骨骼拓扑结构的问题。该框架通过拓扑感知的骨骼嵌入模块,实现了文本驱动的跨物种运动生成和风格迁移。


📘 Detailed Summary

Motivation: 当前运动生成方法大多依赖固定的骨骼模板,无法泛化到具有不同或扰动拓扑结构的骨骼系统。现有方法面临两个核心限制:缺乏大规模异构动物运动数据,以及缺乏能够联合建模任意骨骼拓扑结构和文本条件的统一生成框架。

Method: 本文引入了OmniZoo大规模动物运动数据集,涵盖140个物种和32,979个序列,并包含多模态标注。基于此数据集,提出了一个广义自回归运动生成框架,能够为任意骨骼拓扑结构生成文本驱动的运动。核心是拓扑感知骨骼嵌入模块,该模块将任何骨骼的几何和结构属性编码到共享标记空间中,实现与文本语义的无缝融合。

Result: 给定文本提示和目标骨骼,该方法能够生成时间一致、物理合理且语义对齐的运动。该方法进一步实现了跨物种运动风格迁移,展示了框架在异构骨骼拓扑结构上的泛化能力。

Conclusion: 该研究为跨物种运动生成提供了统一框架,突破了传统方法对固定骨骼模板的依赖。OmniZoo数据集和拓扑感知嵌入机制为处理任意骨骼拓扑结构开辟了新途径,在动画、机器人和虚拟环境等领域具有广泛应用前景。


📄 Abstract

Motion generation is fundamental to computer animation and widely used across entertainment, robotics, and virtual environments. While recent methods achieve impressive results, most rely on fixed skeletal templates, which prevent them from generalizing to skeletons with different or perturbed topologies. We address the core limitation of current motion generation methods - the combined lack of large-scale heterogeneous animal motion data and unified generative frameworks capable of jointly modeling arbitrary skeletal topologies and textual conditions. To this end, we introduce OmniZoo, a large-scale animal motion dataset spanning 140 species and 32,979 sequences, enriched with multimodal annotations. Building on OmniZoo, we propose a generalized autoregressive motion generation framework capable of producing text-driven motions for arbitrary skeletal topologies. Central to our model is a Topology-aware Skeleton Embedding Module that encodes geometric and structural properties of any skeleton into a shared token space, enabling seamless fusion with textual semantics. Given a text prompt and a target skeleton, our method generates temporally coherent, physically plausible, and semantically aligned motions, and further enables cross-species motion style transfer.

[13] Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task

Sunqi Fan, Jiashuo Cui, Meng-Hao Guo, Shuojin Yang

🧩 TL;DR

本文提出了一种配备全面可扩展视频工具包的多模态大语言模型,并设计了时空推理框架(STAR)来增强模型在复杂视频问答任务中的时空推理能力,显著提升了在多个基准测试上的性能。


📘 Detailed Summary

Motivation: 现有的多模态大语言模型在复杂且需要深度推理的视频问答任务中,难以同时建模视频帧内的空间关系和理解时间演化的因果动态,这限制了它们在动态真实场景中的感知与推理能力。

Method: 研究为MLLM配备了一个全面且可扩展的视频工具包,以增强其时空推理能力并确保工具数量与多样性的协调;同时提出了时空推理框架(STAR),该框架通过策略性地调度时间和空间工具来渐进式定位视频中的关键区域,从而更好地控制工具调用顺序并避免工具链捷径问题。

Result: STAR框架使用轻量级工具增强了GPT-4o模型,在VideoMME基准上实现了8.2%的性能提升,在LongVideoBench基准上实现了4.6%的性能提升,显著改善了模型在复杂视频问答任务中的表现。

Conclusion: 所提出的视频工具包和STAR框架为构建自主智能的视频分析助手迈出了重要一步,通过增强多模态大语言模型的时空推理能力,为解决复杂视频理解任务提供了有效的方法论和工具支持。


📄 Abstract

Video Question Answering (VideoQA) task serves as a critical playground for evaluating whether foundation models can effectively perceive, understand, and reason about dynamic real-world scenarios. However, existing Multimodal Large Language Models (MLLMs) struggle with simultaneously modeling spatial relationships within video frames and understanding the causal dynamics of temporal evolution on complex and reasoning-intensive VideoQA task. In this work, we equip MLLM with a comprehensive and extensible Video Toolkit, to enhance MLLM's spatiotemporal reasoning capabilities and ensure the harmony between the quantity and diversity of tools. To better control the tool invocation sequence and avoid toolchain shortcut issues, we propose a Spatiotemporal Reasoning Framework (STAR) that strategically schedules temporal and spatial tools, thereby progressively localizing the key area in the video. Our STAR framework enhances GPT-4o using lightweight tools, achieving an 8.2% gain on VideoMME and 4.6% on LongVideoBench. We believe that our proposed Video Toolkit and STAR framework make an important step towards building autonomous and intelligent video analysis assistants. The code is publicly available at https://github.com/fansunqi/VideoTool.

[14] Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models

Woojun Jung, Jaehoon Go, Mingyu Jeon, Sunjae Yoon, Junyeong Kim

🧩 TL;DR

本文提出Visual Funnel方法,通过构建层次化视觉投资组合解决多模态大语言模型中的上下文盲区问题,该方法在无需训练的情况下显著提升了模型对细粒度视觉细节的感知能力。


📘 Detailed Summary

Motivation: 多模态大语言模型在推理方面表现出色,但往往无法感知细粒度视觉细节,限制了其在精度要求高的任务中的应用。现有裁剪显著区域的方法虽然提供部分解决方案,但引入了"上下文盲区"这一关键限制,即高保真细节与全局上下文之间的结构性脱节,即使所有必要视觉信息都存在。这种限制并非源于信息"数量"不足,而是模型输入缺乏"结构多样性"。

Method: 本文提出Visual Funnel,一种无需训练的两步方法。首先进行上下文锚定,在单次前向传播中识别感兴趣区域;然后构建熵缩放投资组合,通过基于注意力熵动态确定裁剪尺寸并优化裁剪中心,保留从焦点细节到更广泛环境的层次化上下文。该方法通过结构化多尺度裁剪解决了传统方法中细节与上下文的结构性脱节问题。

Result: 实验表明,Visual Funnel显著优于简单的单裁剪和非结构化多裁剪基线方法。结果进一步验证了单纯增加非结构化裁剪只能提供有限甚至有害的收益,确认了层次化投资组合结构对于解决上下文盲区的关键作用。该方法在多个精度要求高的视觉任务中表现出优越性能。

Conclusion: 研究揭示了多模态大语言模型中上下文盲区的根本原因在于输入缺乏结构多样性而非信息数量不足。Visual Funnel通过层次化视觉表示有效解决了这一问题,为提升模型细粒度视觉感知能力提供了新思路。该方法无需训练的特性使其具有广泛适用性,为未来多模态模型设计提供了重要参考。


📄 Abstract

Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning capabilities, but often fail to perceive fine-grained visual details, limiting their applicability in precision-demanding tasks. While methods that crop salient regions of an image offer a partial solution, we identify a critical limitation they introduce: "Contextual Blindness". This failure occurs due to structural disconnect between high-fidelity details (from the crop) and the broader global context (from the original image), even when all necessary visual information is present. We argue that this limitation stems not from a lack of information 'Quantity', but from a lack of 'Structural Diversity' in the model's input. To resolve this, we propose Visual Funnel, a training-free, two-step approach. Visual Funnel first performs Contextual Anchoring to identify the region of interest in a single forward pass. It then constructs an Entropy-Scaled Portfolio that preserves the hierarchical context - ranging from focal detail to broader surroundings - by dynamically determining crop sizes based on attention entropy and refining crop centers. Through extensive experiments, we demonstrate that Visual Funnel significantly outperforms naive single-crop and unstructured multi-crop baselines. Our results further validate that simply adding more unstructured crops provides limited or even detrimental benefits, confirming that the hierarchical structure of our portfolio is key to resolving Contextual Blindness.

[15] Point to Span: Zero-Shot Moment Retrieval for Navigating Unseen Hour-Long Videos

Mingyu Jeon, Jisoo Yang, Sungjin Han, Jinkwon Hwang, Sunjae Yoon, Jonghee Kim, Junyeoung Kim

🧩 TL;DR

本文提出了P2S(Point-to-Span),首个无需训练的零样本长视频时刻检索框架,通过自适应跨度生成和查询分解技术,在小时级视频中实现了高效的时序定位,显著超越了有监督方法的性能。


📘 Detailed Summary

Motivation: 零样本长视频时刻检索面临两大挑战:传统有监督方法存在可扩展性差和泛化能力不足的问题,而现有零样本方法则面临搜索阶段候选爆炸和精炼阶段依赖高成本视觉语言模型验证的双重困境,导致计算开销巨大且效率低下。

Method: P2S框架包含两个关键技术创新:自适应跨度生成器用于防止搜索阶段的候选爆炸问题,通过动态调整候选区间;查询分解技术则在不依赖高成本视觉语言模型验证的情况下精炼候选结果,将复杂查询分解为可管理的子任务。

Result: 在MAD数据集上,P2S在R5@0.1指标上超越了有监督的state-of-the-art方法3.7%,成为首个能够在小时级视频中实现时序定位的零样本框架,显著降低了计算开销并提高了检索效率。

Conclusion: 该研究证明了训练无关方法在长视频时序定位任务中的可行性,为视频理解领域提供了新的高效范式,其自适应机制和查询分解策略为解决大规模视频处理中的计算瓶颈提供了重要思路。


📄 Abstract

Zero-shot Long Video Moment Retrieval (ZLVMR) is the task of identifying temporal segments in hour-long videos using a natural language query without task-specific training. The core technical challenge of LVMR stems from the computational infeasibility of processing entire lengthy videos in a single pass. This limitation has established a 'Search-then-Refine' approach, where candidates are rapidly narrowed down, and only those portions are analyzed, as the dominant paradigm for LVMR. However, existing approaches to this paradigm face severe limitations. Conventional supervised learning suffers from limited scalability and poor generalization, despite substantial resource consumption. Yet, existing zero-shot methods also fail, facing a dual challenge: (1) their heuristic strategies cause a 'search' phase candidate explosion, and (2) the 'refine' phase, which is vulnerable to semantic discrepancy, requires high-cost VLMs for verification, incurring significant computational overhead. We propose \textbf{P}oint-\textbf{to}-\textbf{S}pan (P2S), a novel training-free framework to overcome this challenge of inefficient 'search' and costly 'refine' phases. P2S overcomes these challenges with two key innovations: an 'Adaptive Span Generator' to prevent the search phase candidate explosion, and 'Query Decomposition' to refine candidates without relying on high-cost VLM verification. To our knowledge, P2S is the first zero-shot framework capable of temporal grounding in hour-long videos, outperforming supervised state-of-the-art methods by a significant margin (e.g., +3.7\% on R5@0.1 on MAD).

[16] RaLiFlow: Scene Flow Estimation with 4D Radar and LiDAR Point Clouds

Jingyun Fu, Zhiyu Xiang, Na Zhao

🧩 TL;DR

本研究提出了首个用于4D毫米波雷达与激光雷达联合场景流估计的框架RaLiFlow,通过构建首个雷达-激光雷达场景流数据集和创新的动态感知双向跨模态融合模块,显著提升了多模态场景流估计性能。


📘 Detailed Summary

Motivation: 当前多模态融合方法主要关注图像与激光雷达点云的结合,而4D毫米波雷达与激光雷达的融合尚未被探索。雷达具有成本低、天气鲁棒性强且能检测点速度的优势,但其数据存在噪声、分辨率低和稀疏性等挑战,且目前缺乏专门用于场景流估计的雷达-激光雷达数据集。

Method: 本研究基于公开的真实世界汽车数据集构建了首个雷达-激光雷达场景流数据集,提出了有效的雷达去噪预处理策略和场景流标签生成方法。核心框架RaLiFlow采用动态感知双向跨模态融合模块,通过将雷达动态线索整合到局部交叉注意力机制中实现跨模态上下文信息传播,并设计了一套精心构造的损失函数来减轻不可靠雷达数据的影响并增强实例级一致性。

Result: 在重新构建的场景流数据集上进行的大量实验表明,该方法显著优于现有的基于激光雷达和基于雷达的单模态方法。所提出的动态感知双向跨模态融合模块和损失函数设计有效提升了动态前景区域的场景流预测精度,实现了跨模态信息的有效互补。

Conclusion: 该研究证明了4D毫米波雷达与激光雷达融合在场景流估计中的巨大潜力,为多模态感知系统提供了新的技术路径。所构建的数据集和框架为后续相关研究奠定了基础,展示了通过精心设计的融合策略可以克服雷达数据固有缺陷并充分利用其动态信息优势。


📄 Abstract

Recent multimodal fusion methods, integrating images with LiDAR point clouds, have shown promise in scene flow estimation. However, the fusion of 4D millimeter wave radar and LiDAR remains unexplored. Unlike LiDAR, radar is cheaper, more robust in various weather conditions and can detect point-wise velocity, making it a valuable complement to LiDAR. However, radar inputs pose challenges due to noise, low resolution, and sparsity. Moreover, there is currently no dataset that combines LiDAR and radar data specifically for scene flow estimation. To address this gap, we construct a Radar-LiDAR scene flow dataset based on a public real-world automotive dataset. We propose an effective preprocessing strategy for radar denoising and scene flow label generation, deriving more reliable flow ground truth for radar points out of the object boundaries. Additionally, we introduce RaLiFlow, the first joint scene flow learning framework for 4D radar and LiDAR, which achieves effective radar-LiDAR fusion through a novel Dynamic-aware Bidirectional Cross-modal Fusion (DBCF) module and a carefully designed set of loss functions. The DBCF module integrates dynamic cues from radar into the local cross-attention mechanism, enabling the propagation of contextual information across modalities. Meanwhile, the proposed loss functions mitigate the adverse effects of unreliable radar data during training and enhance the instance-level consistency in scene flow predictions from both modalities, particularly for dynamic foreground areas. Extensive experiments on the repurposed scene flow dataset demonstrate that our method outperforms existing LiDAR-based and radar-based single-modal methods by a significant margin.

[17] Self-Supervised Contrastive Embedding Adaptation for Endoscopic Image Matching

Alberto Rota, Elena De Momi

🧩 TL;DR

本研究提出了一种用于内窥镜图像特征匹配的新型深度学习流水线,通过自监督优化框架增强DINOv2骨干网络,在手术场景中实现了超越现有方法的匹配精度和更低的极线误差。


📘 Detailed Summary

Motivation: 在微创手术中,内窥镜图像间的精确像素级对应关系对于三维重建、相机跟踪和场景理解至关重要,但手术领域存在弱透视线索、非朗伯组织反射和复杂可变形解剖结构等独特挑战,导致传统计算机视觉技术和自然场景训练的深度学习模型性能下降,需要针对性的领域适应方法。

Method: 本研究提出了一种新颖的深度学习流水线,包含基于新视角合成的自监督优化框架,该框架生成真实内点对应关系用于对比学习三元组挖掘,并通过增强DINOv2骨干网络添加额外的Transformer层,专门优化产生可通过余弦相似度阈值直接匹配的嵌入表示。

Result: 在SCARED数据集上的实验评估表明,该流水线在匹配精度和极线误差方面均超越了现有最先进方法,实现了更精确的特征对应关系建立,为手术内窥镜中的高级计算机视觉应用提供了可靠基础。

Conclusion: 该框架通过自监督学习和领域特定优化,显著提升了内窥镜图像特征匹配的鲁棒性和准确性,为手术导航、增强现实集成和三维重建等应用提供了有价值的贡献,推动了手术内窥镜中计算机视觉技术的发展。


📄 Abstract

Accurate spatial understanding is essential for image-guided surgery, augmented reality integration and context awareness. In minimally invasive procedures, where visual input is the sole intraoperative modality, establishing precise pixel-level correspondences between endoscopic frames is critical for 3D reconstruction, camera tracking, and scene interpretation. However, the surgical domain presents distinct challenges: weak perspective cues, non-Lambertian tissue reflections, and complex, deformable anatomy degrade the performance of conventional computer vision techniques. While Deep Learning models have shown strong performance in natural scenes, their features are not inherently suited for fine-grained matching in surgical images and require targeted adaptation to meet the demands of this domain. This research presents a novel Deep Learning pipeline for establishing feature correspondences in endoscopic image pairs, alongside a self-supervised optimization framework for model training. The proposed methodology leverages a novel-view synthesis pipeline to generate ground-truth inlier correspondences, subsequently utilized for mining triplets within a contrastive learning paradigm. Through this self-supervised approach, we augment the DINOv2 backbone with an additional Transformer layer, specifically optimized to produce embeddings that facilitate direct matching through cosine similarity thresholding. Experimental evaluation demonstrates that our pipeline surpasses state-of-the-art methodologies on the SCARED datasets improved matching precision and lower epipolar error compared to the related work. The proposed framework constitutes a valuable contribution toward enabling more accurate high-level computer vision applications in surgical endoscopy.

[18] Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies

Cong Pang, Hongtao Yu, Zixuan Chen, Lewei Lu, Xin Lou

🧩 TL;DR

本文提出了FROW基准测试,用于评估大型视觉语言模型在细粒度识别任务上的性能,并提出了一种从数据构建和训练过程两方面优化的策略,显著提升了模型在细粒度识别方面的表现。


📘 Detailed Summary

Motivation: 现有的大型视觉语言模型基准测试主要关注推理任务,往往忽视了细粒度识别能力,而这一能力在实际应用场景中至关重要。为了填补这一研究空白,需要开发专门针对细粒度识别评估的基准测试和优化方法。

Method: 本文提出了FROW基准测试,并从数据构建和训练过程两个角度设计优化策略。数据构建方面包括马赛克数据(整合多个简短回答)和开放世界数据(使用GPT-4o从真实世界问答生成),训练过程中将细粒度数据整合到预训练阶段以提升模型性能。

Result: 实验结果表明,马赛克数据将类别识别准确率提升了1%,开放世界数据使FROW基准测试准确率提高了10%-20%,内容准确率提升了6%-12%。在预训练阶段整合细粒度数据可将模型类别识别准确率最高提升10%。

Conclusion: 该研究强调了细粒度识别能力对大型视觉语言模型实际应用的重要性,提出的FROW基准测试和优化策略为评估和提升模型在细粒度视觉理解方面的性能提供了有效框架,推动了视觉语言模型在更精细视觉任务上的发展。


📄 Abstract

Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, existing benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: \textit{data construction} and \textit{training process}, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments show that mosaic data improves category recognition accuracy by 1\% and open-world data boosts FROW benchmark accuracy by 10\%-20\% and content accuracy by 6\%-12\%. Meanwhile, incorporating fine-grained data into the pre-training phase can improve the model's category recognition accuracy by up to 10\%. The benchmark will be available at https://github.com/pc-inno/FROW.

[19] MultiHateLoc: Towards Temporal Localisation of Multimodal Hate Content in Online Videos

Qiyue Sun, Tailin Chen, Yinghui Zhang, Yuchen Zhang, Jiangbei Yue, Jianbo Jiao, Zeyu Fu

🧩 TL;DR

该研究提出了MultiHateLoc,这是首个用于弱监督多模态仇恨言论时序定位的框架,通过模态感知编码器、动态跨模态融合和模态感知MIL目标,在仅使用视频级标签的情况下实现了细粒度帧级预测。


📘 Detailed Summary

Motivation: 随着TikTok和YouTube等平台视频内容的快速增长,多模态仇恨言论的传播日益严重,其中有害线索在视觉、听觉和文本流中以微妙且异步的方式出现。现有研究主要关注视频级分类,而实践中至关重要的时序定位任务(识别仇恨片段何时发生)在很大程度上未被解决,特别是在弱监督设置下,仅使用视频级标签时,静态融合或基于分类的架构难以捕捉跨模态和时间动态。

Method: 该研究提出了MultiHateLoc框架,包含三个核心组件:模态感知时序编码器用于建模异构序列模式,包括专门设计的基于文本的预处理模块进行特征增强;动态跨模态融合机制自适应地强调每个时刻最具信息量的模态,以及跨模态对比对齐策略增强多模态特征一致性;模态感知多实例学习目标在视频级监督下识别判别性片段,从而在仅使用粗粒度标签的情况下生成细粒度、可解释的帧级预测。

Result: 在HateMM和MultiHateClip数据集上的实验表明,MultiHateLoc在时序定位任务中实现了最先进的性能。尽管仅依赖粗粒度标签,该方法能够产生细粒度的帧级预测,验证了其在弱监督多模态仇恨言论检测中的有效性。

Conclusion: 该研究首次系统性地解决了弱监督多模态仇恨言论时序定位问题,提出的框架通过创新的模态感知设计和动态融合机制有效捕捉了跨模态和时间动态。这项工作为社交媒体平台的内容审核提供了实用的技术方案,并为多模态时序分析领域开辟了新的研究方向,特别是在弱监督设置下的细粒度内容理解。


📄 Abstract

The rapid growth of video content on platforms such as TikTok and YouTube has intensified the spread of multimodal hate speech, where harmful cues emerge subtly and asynchronously across visual, acoustic, and textual streams. Existing research primarily focuses on video-level classification, leaving the practically crucial task of temporal localisation, identifying when hateful segments occur, largely unaddressed. This challenge is even more noticeable under weak supervision, where only video-level labels are available, and static fusion or classification-based architectures struggle to capture cross-modal and temporal dynamics. To address these challenges, we propose MultiHateLoc, the first framework designed for weakly-supervised multimodal hate localisation. MultiHateLoc incorporates (1) modality-aware temporal encoders to model heterogeneous sequential patterns, including a tailored text-based preprocessing module for feature enhancement; (2) dynamic cross-modal fusion to adaptively emphasise the most informative modality at each moment and a cross-modal contrastive alignment strategy to enhance multimodal feature consistency; (3) a modality-aware MIL objective to identify discriminative segments under video-level supervision. Despite relying solely on coarse labels, MultiHateLoc produces fine-grained, interpretable frame-level predictions. Experiments on HateMM and MultiHateClip show that our method achieves state-of-the-art performance in the localisation task.

[20] TransLocNet: Cross-Modal Attention for Aerial-Ground Vehicle Localization with Contrastive Learning

Phu Pham, Damon Conover, Aniket Bera

🧩 TL;DR

本文提出TransLocNet,一种基于跨模态注意力机制的空中-地面定位框架,通过融合LiDAR几何信息与航空图像语义上下文,显著提升了跨视角跨模态的定位精度。


📘 Detailed Summary

Motivation: 空中-地面定位面临巨大挑战,主要源于地面LiDAR扫描与航空图像之间存在显著视角差异和模态鸿沟,现有方法难以有效对齐这两种异质数据源。

Method: TransLocNet采用跨模态注意力框架,将LiDAR扫描投影为鸟瞰图表示,通过双向注意力机制与航空特征对齐,结合似然图解码器输出位置和方向的空间概率分布,并引入对比学习模块增强跨模态嵌入空间对齐。

Result: 在CARLA和KITTI数据集上的实验表明,TransLocNet显著优于现有最优基线方法,定位误差降低高达63%,实现了亚米级位置精度和亚度级方向精度,在合成和真实场景中均表现出鲁棒性。

Conclusion: 该研究证明了跨模态注意力机制在融合LiDAR几何信息与航空语义上下文方面的有效性,为空中-地面定位提供了通用且鲁棒的解决方案,展示了在合成和真实世界场景中的良好泛化能力。


📄 Abstract

Aerial-ground localization is difficult due to large viewpoint and modality gaps between ground-level LiDAR and overhead imagery. We propose TransLocNet, a cross-modal attention framework that fuses LiDAR geometry with aerial semantic context. LiDAR scans are projected into a bird's-eye-view representation and aligned with aerial features through bidirectional attention, followed by a likelihood map decoder that outputs spatial probability distributions over position and orientation. A contrastive learning module enforces a shared embedding space to improve cross-modal alignment. Experiments on CARLA and KITTI show that TransLocNet outperforms state-of-the-art baselines, reducing localization error by up to 63% and achieving sub-meter, sub-degree accuracy. These results demonstrate that TransLocNet provides robust and generalizable aerial-ground localization in both synthetic and real-world settings.

Yuchen Feng, Zhenyu Zhang, Naibin Gu, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang

🧩 TL;DR

本文提出Blink框架,一种受人类眨眼启发的动态视觉令牌分辨率方法,通过选择性分配计算资源到显著区域来增强多模态大语言模型的视觉感知能力,在单一前向传播中实现高效的多尺度视觉理解。


📘 Detailed Summary

Motivation: 多模态大语言模型在视觉语言任务上取得显著进展,但其视觉感知能力仍然有限。相比之下,人类通过动态扫描和聚焦显著区域的序列化"眨眼"过程高效感知复杂场景。本文旨在探索MLLMs是否表现出类似行为,并基于此开发更高效的视觉感知机制。

Method: 本文提出Blink框架,包含两个核心模块:显著性引导扫描和动态令牌分辨率。该方法首先基于注意力图估计每层视觉令牌的显著性,然后通过即插即用的令牌超分辨率模块扩展重要令牌,在下一层当令牌失去焦点时丢弃扩展令牌。这种动态机制平衡了广泛探索和细粒度聚焦,实现了自适应高效的视觉感知增强。

Result: 广泛的实验验证了Blink框架的有效性,证明其能够显著增强多模态大语言模型的视觉感知能力和多模态理解性能。初步分析发现MLLMs在不同层自然关注不同的视觉区域,选择性分配更多计算资源到显著令牌可以提升视觉感知效果。

Conclusion: 该研究展示了受人类视觉感知启发的动态计算分配策略在多模态模型中的有效性,为开发更高效、更接近人类感知方式的视觉语言模型提供了新思路。Blink框架的即插即用特性使其易于集成到现有MLLMs中,平衡了计算效率和感知精度。


📄 Abstract

Multimodal large language models (MLLMs) have achieved remarkable progress on various vision-language tasks, yet their visual perception remains limited. Humans, in comparison, perceive complex scenes efficiently by dynamically scanning and focusing on salient regions in a sequential "blink-like" process. Motivated by this strategy, we first investigate whether MLLMs exhibit similar behavior. Our pilot analysis reveals that MLLMs naturally attend to different visual regions across layers and that selectively allocating more computation to salient tokens can enhance visual perception. Building on this insight, we propose Blink, a dynamic visual token resolution framework that emulates the human-inspired process within a single forward pass. Specifically, Blink includes two modules: saliency-guided scanning and dynamic token resolution. It first estimates the saliency of visual tokens in each layer based on the attention map, and extends important tokens through a plug-and-play token super-resolution (TokenSR) module. In the next layer, it drops the extended tokens when they lose focus. This dynamic mechanism balances broad exploration and fine-grained focus, thereby enhancing visual perception adaptively and efficiently. Extensive experiments validate Blink, demonstrating its effectiveness in enhancing visual perception and multimodal understanding.

[22] Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval

J. Xiao, Y. Guo, X. Zi, K. Thiyagarajan, C. Moreira, M. Prasad

🧩 TL;DR

本文提出了一种完全免训练的遥感图像语义检索方法TRSLLaVA,通过引入RSRT数据集并重新将跨模态检索定义为文本到文本匹配问题,在零样本设置下取得了与监督方法相竞争的性能。


📘 Detailed Summary

Motivation: 遥感图像语义检索面临"语义鸿沟"的根本挑战,即低层视觉特征与高层人类概念之间的差异。现有基于视觉语言模型的方法通常需要昂贵的领域特定训练,且缺乏评估VLM生成文本在零样本检索中实际效用的基准数据集。

Method: 本文提出了完全免训练的文本检索参考框架TRSLLaVA,通过引入具有多重结构化描述的RSRT数据集,将跨模态检索重新定义为文本到文本匹配问题。该方法利用丰富的文本描述作为查询,在统一的文本嵌入空间中与VLM生成的图像描述进行匹配,完全避免了模型训练或微调过程。

Result: 在RSITMD和RSICD基准测试上的实验表明,该免训练方法在RSITMD上实现了42.62%的平均召回率,几乎是标准零样本CLIP基线(23.86%)的两倍,并且超越了多个顶级监督模型。该方法在零样本设置下展现出与最先进监督模型相竞争的性能。

Conclusion: 研究表明,通过结构化文本实现高质量语义表示为遥感图像检索提供了强大且经济高效的范式。该方法验证了免训练文本匹配策略在跨模态检索中的有效性,为遥感领域的零样本语义理解开辟了新途径。


📄 Abstract

Semantic retrieval of remote sensing (RS) images is a critical task fundamentally challenged by the \textquote{semantic gap}, the discrepancy between a model's low-level visual features and high-level human concepts. While large Vision-Language Models (VLMs) offer a promising path to bridge this gap, existing methods often rely on costly, domain-specific training, and there is a lack of benchmarks to evaluate the practical utility of VLM-generated text in a zero-shot retrieval context. To address this research gap, we introduce the Remote Sensing Rich Text (RSRT) dataset, a new benchmark featuring multiple structured captions per image. Based on this dataset, we propose a fully training-free, text-only retrieval reference called TRSLLaVA. Our methodology reformulates cross-modal retrieval as a text-to-text (T2T) matching problem, leveraging rich text descriptions as queries against a database of VLM-generated captions within a unified textual embedding space. This approach completely bypasses model training or fine-tuning. Experiments on the RSITMD and RSICD benchmarks show our training-free method is highly competitive with state-of-the-art supervised models. For instance, on RSITMD, our method achieves a mean Recall of 42.62\%, nearly doubling the 23.86\% of the standard zero-shot CLIP baseline and surpassing several top supervised models. This validates that high-quality semantic representation through structured text provides a powerful and cost-effective paradigm for remote sensing image retrieval.

[23] Grounding Everything in Tokens for Multimodal Large Language Models

Xiangxuan Ren, Zhongdao Wang, Liping Hou, Pin Tang, Guoqing Wang, Chao Ma

🧩 TL;DR

本文提出了GETok,一种用于多模态大语言模型的空间表示方法,通过引入可学习的网格标记和偏移标记,将空间关系直接嵌入到标记中,显著提升了模型在2D空间中的对象定位能力,而无需修改自回归Transformer架构。


📘 Detailed Summary

Motivation: 多模态大语言模型在视觉理解和推理方面取得了显著进展,但其自回归Transformer架构需要对输入图像进行标记化处理,这限制了模型在2D图像空间中准确定位对象的能力。核心研究问题是如何改进序列语言标记以更好地在2D空间中对对象进行定位。

Method: 本文提出了GETok空间表示方法,该方法将专门的可学习标记词汇集成到MLLMs中。GETok首先使用网格标记将图像平面划分为结构化的空间锚点,然后利用偏移标记实现精确且迭代的定位预测细化。通过将空间关系直接嵌入到标记中,该方法无需修改自回归架构即可实现原生2D空间推理。

Result: 大量实验表明,GETok在监督微调和强化学习设置下的各种指代任务中,均优于最先进的方法,取得了卓越的性能表现。该方法显著提升了多模态大语言模型在2D空间中的对象定位准确性。

Conclusion: GETok通过将空间关系直接编码到语言标记中,为多模态大语言模型提供了有效的2D空间表示方法。这一方法不仅保持了现有架构的完整性,还为视觉语言模型的空间推理能力开辟了新的研究方向,具有重要的理论和应用价值。


📄 Abstract

Multimodal large language models (MLLMs) have made significant advancements in vision understanding and reasoning. However, the autoregressive Transformer architecture used by MLLMs requries tokenization on input images, which limits their ability to accurately ground objects within the 2D image space. This raises an important question: how can sequential language tokens be improved to better ground objects in 2D spatial space for MLLMs? To address this, we present a spatial representation method for grounding objects, namely GETok, that integrates a specialized vocabulary of learnable tokens into MLLMs. GETok first uses grid tokens to partition the image plane into structured spatial anchors, and then exploits offset tokens to enable precise and iterative refinement of localization predictions. By embedding spatial relationships directly into tokens, GETok significantly advances MLLMs in native 2D space reasoning without modifying the autoregressive architecture. Extensive experiments demonstrate that GETok achieves superior performance over the state-of-the-art methods across various referring tasks in both supervised fine-tuning and reinforcement learning settings.

[24] Data-Efficient American Sign Language Recognition via Few-Shot Prototypical Networks

Meher Md Saad

🧩 TL;DR

本文提出了一种基于骨架编码器的少样本原型网络框架,用于解决孤立手语识别中的数据稀缺和长尾分布问题。该方法通过度量学习范式在WLASL数据集上实现了43.75%的Top-1准确率,显著优于传统分类方法。


📘 Detailed Summary

Motivation: 孤立手语识别面临数据稀缺和词汇长尾分布的根本限制,收集数千个独特手语的足够样本成本过高。传统分类方法在这些条件下表现不佳,容易对频繁类别过拟合而无法泛化到罕见类别,需要新的学习范式来应对数据稀缺挑战。

Method: 本文提出了一个适应于骨架编码器的少样本原型网络框架,采用情节训练学习语义度量空间,其中手语根据与动态类别原型的接近程度进行分类。该方法整合了时空图卷积网络与新颖的多尺度时间聚合模块,以捕捉快速和流畅的运动动态。

Result: 在WLASL数据集上的实验结果表明,该模型在测试集上实现了43.75%的Top-1准确率和77.10%的Top-5准确率。这比共享相同骨干架构的标准分类基线高出13%以上,证明了原型训练策略在数据稀缺情况下的有效性。此外,模型在未微调的情况下在未见过的SignASL数据集上实现了近30%的准确率,展示了强大的零样本泛化能力。

Conclusion: 该研究证明了度量学习范式在数据稀缺的手语识别任务中的优越性,原型网络框架能够有效应对长尾分布挑战。模型展现的零样本泛化能力为使用有限数据识别广泛手语词汇提供了可扩展的途径,为手语识别系统在实际部署中的可行性提供了重要支持。


📄 Abstract

Isolated Sign Language Recognition (ISLR) is critical for bridging the communication gap between the Deaf and Hard-of-Hearing (DHH) community and the hearing world. However, robust ISLR is fundamentally constrained by data scarcity and the long-tail distribution of sign vocabulary, where gathering sufficient examples for thousands of unique signs is prohibitively expensive. Standard classification approaches struggle under these conditions, often overfitting to frequent classes while failing to generalize to rare ones. To address this bottleneck, we propose a Few-Shot Prototypical Network framework adapted for a skeleton based encoder. Unlike traditional classifiers that learn fixed decision boundaries, our approach utilizes episodic training to learn a semantic metric space where signs are classified based on their proximity to dynamic class prototypes. We integrate a Spatiotemporal Graph Convolutional Network (ST-GCN) with a novel Multi-Scale Temporal Aggregation (MSTA) module to capture both rapid and fluid motion dynamics. Experimental results on the WLASL dataset demonstrate the superiority of this metric learning paradigm: our model achieves 43.75% Top-1 and 77.10% Top-5 accuracy on the test set. Crucially, this outperforms a standard classification baseline sharing the identical backbone architecture by over 13%, proving that the prototypical training strategy effectively outperforms in a data scarce situation where standard classification fails. Furthermore, the model exhibits strong zero-shot generalization, achieving nearly 30% accuracy on the unseen SignASL dataset without fine-tuning, offering a scalable pathway for recognizing extensive sign vocabularies with limited data.

[25] Unleashing Degradation-Carrying Features in Symmetric U-Net: Simpler and Stronger Baselines for All-in-One Image Restoration

Wenlong Jiao, Heyang Lee, Ping Wang, Pengfei Zhu, Qinghua Hu, Dongwei Ren

🧩 TL;DR

本文提出了一种对称U-Net架构(SymUNet)用于一体化图像恢复,通过揭示精心设计的特征提取本身已编码退化信息,对称设计能够有效释放这些线索,在降低计算成本的同时实现最先进的性能。


📘 Detailed Summary

Motivation: 一体化图像恢复旨在统一处理多种退化类型(如噪声、模糊、恶劣天气),但现有方法越来越依赖复杂架构(如混合专家、扩散模型)和精细的退化提示策略,导致计算成本高昂且实现复杂。

Method: 本文提出对称U-Net架构(SymUNet),通过对齐编码器-解码器的特征尺度并实现简化的跨尺度传播,对称设计能够稳健地保留内在退化信号,使跳跃连接中的简单加性融合足以实现高性能。进一步提出语义增强变体SE-SymUNet,通过简单的交叉注意力机制集成来自冻结CLIP特征的直接语义注入,以显式放大退化先验。

Result: 在多个基准数据集上的广泛实验验证了所提方法的优越性,SymUNet在降低计算成本的同时取得了比现有方法更好的结果。SE-SymUNet进一步通过语义增强提升了性能,两种基线方法都为一化图像恢复的未来发展建立了更简单且更强大的基础。

Conclusion: 研究表明精心设计的特征提取本身已编码退化信息,对称U-Net架构足以有效释放这些线索,无需复杂架构或精细的退化提示策略。这项工作为一化图像恢复提供了更简单且更强大的基础,表明通过对称设计和特征尺度对齐可以显著简化架构同时保持高性能。


📄 Abstract

All-in-one image restoration aims to handle diverse degradations (e.g., noise, blur, adverse weather) within a unified framework, yet existing methods increasingly rely on complex architectures (e.g., Mixture-of-Experts, diffusion models) and elaborate degradation prompt strategies. In this work, we reveal a critical insight: well-crafted feature extraction inherently encodes degradation-carrying information, and a symmetric U-Net architecture is sufficient to unleash these cues effectively. By aligning feature scales across encoder-decoder and enabling streamlined cross-scale propagation, our symmetric design preserves intrinsic degradation signals robustly, rendering simple additive fusion in skip connections sufficient for state-of-the-art performance. Our primary baseline, SymUNet, is built on this symmetric U-Net and achieves better results across benchmark datasets than existing approaches while reducing computational cost. We further propose a semantic enhanced variant, SE-SymUNet, which integrates direct semantic injection from frozen CLIP features via simple cross-attention to explicitly amplify degradation priors. Extensive experiments on several benchmarks validate the superiority of our methods. Both baselines SymUNet and SE-SymUNet establish simpler and stronger foundations for future advancements in all-in-one image restoration. The source code is available at https://github.com/WenlongJiao/SymUNet.

[26] Salient Object Detection in Complex Weather Conditions via Noise Indicators

Quan Chen, Xiaokai Yang, Tingyu Wang, Rongfeng Lu, Xichun Sheng, Yaoqi Sun, Chenggang Yan

🧩 TL;DR

本文提出了一种面向复杂天气条件的显著目标检测框架,通过噪声指示器融合模块嵌入天气感知先验,显著提升了在天气噪声干扰下的分割准确性,同时保持与主流解码器的兼容性。


📘 Detailed Summary

Motivation: 现有显著目标检测方法大多假设低噪声视觉条件,忽视了现实场景中天气诱导噪声对分割准确性的影响,缺乏针对复杂天气条件的鲁棒性解决方案。

Method: 提出了一种面向多样天气条件的显著目标检测框架,包含特定编码器和可替换解码器。引入one-hot向量作为噪声指示器表示不同天气类型,设计了噪声指示器融合模块,该模块以语义特征和噪声指示器为双输入,通过自适应特征调制嵌入天气感知先验,同时保持与主流SOD解码器的兼容性。

Result: 在WXSOD数据集上进行了广泛实验,涵盖不同训练数据规模(100%、50%、30%完整训练集)、三种编码器和七种解码器配置。结果表明,所提出的SOD框架(特别是NIFM增强的特定编码器)相比普通编码器在复杂天气条件下显著提高了分割准确性。

Conclusion: 该研究通过噪声指示器融合模块有效解决了天气噪声对显著目标检测的影响,提出的框架具有良好的兼容性和可扩展性,为现实世界复杂环境下的视觉任务提供了鲁棒性解决方案。


📄 Abstract

Salient object detection (SOD), a foundational task in computer vision, has advanced from single-modal to multi-modal paradigms to enhance generalization. However, most existing SOD methods assume low-noise visual conditions, overlooking the degradation of segmentation accuracy caused by weather-induced noise in real-world scenarios. In this paper, we propose a SOD framework tailored for diverse weather conditions, encompassing a specific encoder and a replaceable decoder. To enable handling of varying weather noises, we introduce a one-hot vector as a noise indicator to represent different weather types and design a Noise Indicator Fusion Module (NIFM). The NIFM takes both semantic features and the noise indicator as dual inputs and is inserted between consecutive stages of the encoder to embed weather-aware priors via adaptive feature modulation. Critically, the proposed specific encoder retains compatibility with mainstream SOD decoders. Extensive experiments are conducted on the WXSOD dataset under varying training data scales (100%, 50%, 30% of the full training set), three encoder and seven decoder configurations. Results show that the proposed SOD framework (particularly the NIFM-enhanced specific encoder) improves segmentation accuracy under complex weather conditions compared to a vanilla encoder.

[27] Track and Caption Any Motion: Query-Free Motion Discovery and Description in Videos

Bishoy Galoaa, Sarah Ostadabbas

🧩 TL;DR

本文提出了TCAM(Track and Caption Any Motion),一种以运动为中心的视频理解框架,能够自主发现和描述视频中的运动模式,无需用户查询即可实现多运动活动的空间定位和自然语言描述。


📘 Detailed Summary

Motivation: 在遮挡、伪装或快速运动等挑战性条件下理解视频内容时,运动动态比静态外观更为关键,现有方法通常需要用户查询来引导理解过程,缺乏自主发现和描述多种运动活动的能力。

Method: TCAM框架通过运动场注意力机制自主观察视频、识别多个运动活动,并将每个自然语言描述空间定位到对应轨迹,采用对比视觉语言表示对齐运动模式,通过结合全局视频文本对齐与细粒度空间对应的统一训练,利用多头交叉注意力实现无查询的多运动表达发现。

Result: 在MeViS基准测试中,TCAM实现了58.4%的视频到文本检索准确率、64.9的JF空间定位分数,每个视频发现4.8个相关表达且精度达到84.7%,展示了强大的跨任务泛化能力。

Conclusion: 研究表明运动模式与对比视觉语言表示对齐时能提供强大的语义信号用于动作识别和描述,TCAM的统一训练框架为自主视频理解开辟了新途径,在无需用户查询的情况下实现了多运动活动的发现和描述,具有重要的实际应用价值。


📄 Abstract

We propose Track and Caption Any Motion (TCAM), a motion-centric framework for automatic video understanding that discovers and describes motion patterns without user queries. Understanding videos in challenging conditions like occlusion, camouflage, or rapid movement often depends more on motion dynamics than static appearance. TCAM autonomously observes a video, identifies multiple motion activities, and spatially grounds each natural language description to its corresponding trajectory through a motion-field attention mechanism. Our key insight is that motion patterns, when aligned with contrastive vision-language representations, provide powerful semantic signals for recognizing and describing actions. Through unified training that combines global video-text alignment with fine-grained spatial correspondence, TCAM enables query-free discovery of multiple motion expressions via multi-head cross-attention. On the MeViS benchmark, TCAM achieves 58.4% video-to-text retrieval, 64.9 JF for spatial grounding, and discovers 4.8 relevant expressions per video with 84.7% precision, demonstrating strong cross-task generalization.

[28] Lang2Motion: Bridging Language and Motion through Joint Embedding Spaces

Bishoy Galoaa, Xiangyu Bai, Sarah Ostadabbas

🧩 TL;DR

本文提出了Lang2Motion框架,通过将运动流形与联合嵌入空间对齐,实现语言引导的点轨迹生成。该框架从真实世界视频中提取任意物体的点轨迹,在文本到轨迹检索任务上显著优于基于视频的方法,并展示了跨运动领域的有效迁移能力。


📘 Detailed Summary

Motivation: 现有研究主要关注人体运动或视频合成,缺乏从语言描述生成任意物体显式运动轨迹的方法。本研究旨在填补这一空白,通过从真实世界视频中提取点轨迹,构建语言引导的运动生成框架,以支持更广泛的应用场景。

Method: Lang2Motion采用基于Transformer的自编码器架构,通过点跟踪从真实视频中提取运动轨迹。该框架通过双重监督学习轨迹表示:文本运动描述和渲染的轨迹可视化,两者都通过冻结的CLIP编码器映射到联合嵌入空间,实现运动流形与语言空间的语义对齐。

Result: Lang2Motion在文本到轨迹检索任务上达到34.2%的Recall@1,比基于视频的方法高出12.5个百分点。在运动准确性方面,相比视频生成基线提升了33-52%(12.4 ADE vs 18.3-25.3)。尽管仅在多样物体运动上训练,但在人体动作识别任务上仍达到88.3%的Top-1准确率,展示了跨领域的有效迁移能力。

Conclusion: 该研究证明了通过CLIP对齐的轨迹表示能够有效连接语言与运动空间,支持风格迁移、语义插值和潜在空间编辑等应用。框架展示了从有限物体运动数据中学习到的表示能够泛化到不同运动领域,为语言引导的运动生成提供了新的技术路径和理论基础。


📄 Abstract

We present Lang2Motion, a framework for language-guided point trajectory generation by aligning motion manifolds with joint embedding spaces. Unlike prior work focusing on human motion or video synthesis, we generate explicit trajectories for arbitrary objects using motion extracted from real-world videos via point tracking. Our transformer-based auto-encoder learns trajectory representations through dual supervision: textual motion descriptions and rendered trajectory visualizations, both mapped through CLIP's frozen encoders. Lang2Motion achieves 34.2% Recall@1 on text-to-trajectory retrieval, outperforming video-based methods by 12.5 points, and improves motion accuracy by 33-52% (12.4 ADE vs 18.3-25.3) compared to video generation baselines. We demonstrate 88.3% Top-1 accuracy on human action recognition despite training only on diverse object motions, showing effective transfer across motion domains. Lang2Motion supports style transfer, semantic interpolation, and latent-space editing through CLIP-aligned trajectory representations.

[29] TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection

Jian-Yu Jiang-Lin, Kang-Yang Huang, Ling Zou, Ling Lo, Sheng-Ping Yang, Yu-Wen Tseng, Kun-Hsiang Lin, Chia-Ling Chen, Yu-Ting Ta, Yan-Tsung Wang, Po-Ching Chen, Hongxia Xie, Hong-Han Shuai, Wen-Huang Cheng

🧩 TL;DR

本文提出了TriDF,一个用于可解释DeepFake检测的综合基准,包含图像、视频和音频三种模态的16种伪造类型,评估感知、检测和幻觉三个关键方面,揭示了这些方面之间的相互依赖关系。


📘 Detailed Summary

Motivation: 生成建模技术的进步使得伪造个人真实肖像变得日益容易,这对安全、通信和公共信任构成了严重威胁。检测此类人员驱动的操纵需要不仅能区分篡改内容与真实媒体,还能提供清晰可靠推理的系统。当前缺乏一个统一的框架来理解检测准确性、证据识别和解释可靠性之间的相互作用。

Method: 本文引入了TriDF基准,包含来自先进合成模型的高质量伪造内容,涵盖图像、视频和音频三种模态的16种DeepFake类型。该基准评估三个关键方面:感知(使用人工标注证据测量模型识别细粒度操纵伪影的能力)、检测(评估跨不同伪造家族和生成器的分类性能)和幻觉(量化模型生成解释的可靠性)。

Result: 在最先进的多模态大语言模型上的实验表明,准确的感知对于可靠的检测至关重要,但幻觉会严重干扰决策过程。研究揭示了感知、检测和幻觉这三个方面之间的相互依赖关系,为理解检测准确性、证据识别和解释可靠性之间的相互作用提供了实证基础。

Conclusion: TriDF为理解检测准确性、证据识别和解释可靠性之间的相互作用提供了统一框架,为构建可信赖的系统以应对现实世界合成媒体威胁奠定了基础。该基准揭示了准确感知对可靠检测的重要性,以及幻觉对决策的破坏性影响,强调了在DeepFake检测系统中同时优化这三个方面的必要性。


📄 Abstract

Advances in generative modeling have made it increasingly easy to fabricate realistic portrayals of individuals, creating serious risks for security, communication, and public trust. Detecting such person-driven manipulations requires systems that not only distinguish altered content from authentic media but also provide clear and reliable reasoning. In this paper, we introduce TriDF, a comprehensive benchmark for interpretable DeepFake detection. TriDF contains high-quality forgeries from advanced synthesis models, covering 16 DeepFake types across image, video, and audio modalities. The benchmark evaluates three key aspects: Perception, which measures the ability of a model to identify fine-grained manipulation artifacts using human-annotated evidence; Detection, which assesses classification performance across diverse forgery families and generators; and Hallucination, which quantifies the reliability of model-generated explanations. Experiments on state-of-the-art multimodal large language models show that accurate perception is essential for reliable detection, but hallucination can severely disrupt decision-making, revealing the interdependence of these three aspects. TriDF provides a unified framework for understanding the interaction between detection accuracy, evidence identification, and explanation reliability, offering a foundation for building trustworthy systems that address real-world synthetic media threats.

[30] XDen-1K: A Density Field Dataset of Real-World Objects

Jingxuan Zhang, Tianqi Yu, Yatu Zhang, Jinze Wu, Kaixin Yao, Jingyang Liu, Yuyao Zhang, Jiayuan Gu, Jingyi Yu

🧩 TL;DR

该研究提出了XDen-1K,首个用于真实世界物理属性估计的大规模多模态数据集,并开发了一种从稀疏X射线视图重建高保真体积密度场的优化框架,显著提升了质心估计和机器人操作任务的性能。


📘 Detailed Summary

Motivation: 当前模型主要关注物体表面几何和外观,但忽略了内部物理属性如体积密度,这些属性对于预测物体质心、稳定性和交互动力学至关重要,而主要瓶颈在于缺乏大规模真实世界数据。

Method: 研究引入了XDen-1K数据集,包含1000个真实物体及其高分辨率3D几何模型、部件级标注和双平面X射线扫描;并提出了从稀疏X射线视图恢复物体体积密度场的新型优化框架,还将X射线图像作为条件信号集成到现有分割网络中。

Result: 实验表明,利用该数据集能有效提高质心估计的准确性,并提升机器人操作任务的成功率;通过X射线条件信号实现了体积分割,验证了数据集在物理基础视觉推理任务中的实用价值。

Conclusion: XDen-1K将成为物理基础视觉推理和具身AI研究的基础资源和挑战性新基准,推动对物体内部物理属性的理解,为机器人操作和物理仿真等应用提供关键数据支持。


📄 Abstract

A deep understanding of the physical world is a central goal for embodied AI and realistic simulation. While current models excel at capturing an object's surface geometry and appearance, they largely neglect its internal physical properties. This omission is critical, as properties like volumetric density are fundamental for predicting an object's center of mass, stability, and interaction dynamics in applications ranging from robotic manipulation to physical simulation. The primary bottleneck has been the absence of large-scale, real-world data. To bridge this gap, we introduce XDen-1K, the first large-scale, multi-modal dataset designed for real-world physical property estimation, with a particular focus on volumetric density. The core of this dataset consists of 1,000 real-world objects across 148 categories, for which we provide comprehensive multi-modal data, including a high-resolution 3D geometric model with part-level annotations and a corresponding set of real-world biplanar X-ray scans. Building upon this data, we introduce a novel optimization framework that recovers a high-fidelity volumetric density field of each object from its sparse X-ray views. To demonstrate its practical value, we add X-ray images as a conditioning signal to an existing segmentation network and perform volumetric segmentation. Furthermore, we conduct experiments on downstream robotics tasks. The results show that leveraging the dataset can effectively improve the accuracy of center-of-mass estimation and the success rate of robotic manipulation. We believe XDen-1K will serve as a foundational resource and a challenging new benchmark, catalyzing future research in physically grounded visual inference and embodied AI.

[31] Geo6DPose: Fast Zero-Shot 6D Object Pose Estimation via Geometry-Filtered Feature Matching

Javier Villena Toro, Mehdi Tarkian

🧩 TL;DR

本文提出Geo6DPose,一种轻量级、完全本地化且无需训练的零样本6D物体姿态估计流水线,通过几何可靠性替代模型规模,实现亚秒级推理并匹配更大基线模型的平均召回率。


📘 Detailed Summary

Motivation: 当前零样本6D物体姿态估计主要依赖大规模模型和云端推理,导致高延迟、高能耗以及连接性、成本和数据治理方面的部署风险,这与实际机器人应用中计算资源有限且常需设备端推理的约束条件相冲突。

Method: 该方法结合基础模型视觉特征与几何过滤策略:计算板载模板DINO描述符与场景补丁之间的相似性图,通过将场景补丁中心投影到3D空间并将模板描述符投影到物体模型坐标系来建立互对应关系,最终通过对应关系驱动的RANSAC恢复姿态,并使用加权几何对齐度量进行排序,该度量同时考虑重投影一致性和空间支持,提高了对噪声、杂乱和部分可见性的鲁棒性。

Result: Geo6DPose在单个商用GPU上实现亚秒级推理,同时匹配显著更大的零样本基线模型的平均召回率(53.7 AR,1.08 FPS),无需训练、微调或网络访问,并与演进中的基础骨干网络保持兼容。

Conclusion: 该研究通过几何可靠性替代模型规模,推进了完全本地化6D感知在实际机器人部署中的应用,提供了一种轻量级、无需训练且设备端友好的解决方案,解决了现有大规模云端方法在延迟、能耗和部署风险方面的局限性。


📄 Abstract

Recent progress in zero-shot 6D object pose estimation has been driven largely by large-scale models and cloud-based inference. However, these approaches often introduce high latency, elevated energy consumption, and deployment risks related to connectivity, cost, and data governance; factors that conflict with the practical constraints of real-world robotics, where compute is limited and on-device inference is frequently required. We introduce Geo6DPose, a lightweight, fully local, and training-free pipeline for zero-shot 6D pose estimation that trades model scale for geometric reliability. Our method combines foundation model visual features with a geometric filtering strategy: Similarity maps are computed between onboarded template DINO descriptors and scene patches, and mutual correspondences are established by projecting scene patch centers to 3D and template descriptors to the object model coordinate system. Final poses are recovered via correspondence-driven RANSAC and ranked using a weighted geometric alignment metric that jointly accounts for reprojection consistency and spatial support, improving robustness to noise, clutter, and partial visibility. Geo6DPose achieves sub-second inference on a single commodity GPU while matching the average recall of significantly larger zero-shot baselines (53.7 AR, 1.08 FPS). It requires no training, fine-tuning, or network access, and remains compatible with evolving foundation backbones, advancing practical, fully local 6D perception for robotic deployment.

[32] Sharp Monocular View Synthesis in Less Than a Second

Lars Mescheder, Wei Dong, Shiwei Li, Xuyang Bai, Marcel Santos, Peiyun Hu, Bruno Lecouat, Mingmin Zhen, Amaël Delaunoy, Tian Fang, Yanghai Tsin, Stephan R. Richter, Vladlen Koltun

🧩 TL;DR

SHARP提出了一种从单张图像进行逼真视图合成的方法,通过神经网络前向传播在标准GPU上不到一秒内回归出3D高斯表示的参数,实现了实时渲染和高分辨率逼真图像生成。


📘 Detailed Summary

Motivation: 该研究旨在解决从单张图像进行逼真视图合成的挑战,特别是需要快速、高质量且具有度量尺度支持的3D场景表示方法,以克服现有方法在合成时间、质量和泛化能力方面的限制。

Method: SHARP采用神经网络单次前向传播的方法,从输入的单张图像直接回归出描绘场景的3D高斯表示参数,该表示具有度量尺度和绝对比例,支持度量相机运动,能够在标准GPU上在不到一秒内完成处理。

Result: 实验结果表明,SHARP在多个数据集上实现了零样本泛化的鲁棒性,并创造了新的最先进性能,将LPIPS降低了25-34%,DISTS降低了21-43%,同时将合成时间降低了三个数量级,支持实时高分辨率逼真图像渲染。

Conclusion: SHARP展示了从单张图像快速生成度量3D高斯表示的可行性,为实时逼真视图合成提供了高效解决方案,其显著的性能提升和泛化能力表明该方法在计算机视觉和图形学应用中的实用价值。


📄 Abstract

We present SHARP, an approach to photorealistic view synthesis from a single image. Given a single photograph, SHARP regresses the parameters of a 3D Gaussian representation of the depicted scene. This is done in less than a second on a standard GPU via a single feedforward pass through a neural network. The 3D Gaussian representation produced by SHARP can then be rendered in real time, yielding high-resolution photorealistic images for nearby views. The representation is metric, with absolute scale, supporting metric camera movements. Experimental results demonstrate that SHARP delivers robust zero-shot generalization across datasets. It sets a new state of the art on multiple datasets, reducing LPIPS by 25-34% and DISTS by 21-43% versus the best prior model, while lowering the synthesis time by three orders of magnitude. Code and weights are provided at https://github.com/apple/ml-sharp

[33] Video Depth Propagation

Luigi Piccinelli, Thiemo Wandel, Christos Sakaridis, Wim Abbeloos, Luc Van Gool

🧩 TL;DR

本文提出VeloDepth,一种高效鲁棒的在线视频深度估计方法,通过利用时空先验和深度特征传播,在保持实时性的同时实现了卓越的时间一致性和竞争性精度。


📘 Detailed Summary

Motivation: 现有视频深度估计方法存在显著局限性:逐帧单目模型导致时间不一致性和不准确性,而计算密集的时序建模方法不适用于实时应用,这些限制严重影响了实际场景中的通用性和性能表现。

Method: VeloDepth提出一种高效的在线视频深度估计流程,通过利用先前深度预测的时空先验并执行深度特征传播。该方法引入新颖的传播模块,结合基于光流的扭曲和学习的残差校正来细化和传播深度特征与预测,其设计结构上强制时间一致性,从而在连续帧间产生稳定的深度预测。

Result: 在多个基准测试上的综合零样本评估表明,VeloDepth在时间一致性方面达到最先进水平,同时保持竞争性精度,并且与现有视频深度估计器相比推理速度显著更快,实现了高效实时的深度估计性能。

Conclusion: VeloDepth为实时深度估计提供了一个实用、高效且准确的解决方案,适用于多样化的感知任务,通过平衡时间一致性、精度和计算效率,解决了现有方法在实际应用中的关键限制,为实时视觉感知系统提供了可行的技术路径。


📄 Abstract

Depth estimation in videos is essential for visual perception in real-world applications. However, existing methods either rely on simple frame-by-frame monocular models, leading to temporal inconsistencies and inaccuracies, or use computationally demanding temporal modeling, unsuitable for real-time applications. These limitations significantly restrict general applicability and performance in practical settings. To address this, we propose VeloDepth, an efficient and robust online video depth estimation pipeline that effectively leverages spatiotemporal priors from previous depth predictions and performs deep feature propagation. Our method introduces a novel Propagation Module that refines and propagates depth features and predictions using flow-based warping coupled with learned residual corrections. In addition, our design structurally enforces temporal consistency, resulting in stable depth predictions across consecutive frames with improved efficiency. Comprehensive zero-shot evaluation on multiple benchmarks demonstrates the state-of-the-art temporal consistency and competitive accuracy of VeloDepth, alongside its significantly faster inference compared to existing video-based depth estimators. VeloDepth thus provides a practical, efficient, and accurate solution for real-time depth estimation suitable for diverse perception tasks. Code and models are available at https://github.com/lpiccinelli-eth/velodepth

[34] LDP: Parameter-Efficient Fine-Tuning of Multimodal LLM for Medical Report Generation

Tianyu Zhou, Junyi Tang, Zehui Li, Dahong Qian, Suncheng Xiang

🧩 TL;DR

本研究提出LDP框架,利用多模态大语言模型生成专业结肠息肉诊断报告,通过构建MMEndo数据集和参数高效微调方法,显著提升了诊断报告的准确性和临床实用性。


📘 Detailed Summary

Motivation: 传统自动化结肠息肉诊断报告存在不一致性和幻觉问题,主要源于高质量多模态医学数据的稀缺性,这阻碍了临床应用的可靠性和可扩展性。

Method: 研究提出LDP框架,首先构建了专家标注的多模态内窥镜数据集MMEndo,然后基于Qwen2-VL-7B骨干网络,采用LoRA进行参数高效微调,并通过DPO实现与临床标准的对齐。

Result: 实验表明LDP在自动指标和临床专家评估上均优于现有基线方法,获得7.2/10的医师评分,相比全参数微调减少了833倍的计算成本,在IU-XRay数据集上的验证进一步证实了其鲁棒性。

Conclusion: 该研究为初级医疗保健提供了一条可扩展且临床可行的路径,通过多模态大语言模型与参数高效微调技术的结合,显著提升了医学影像诊断的自动化水平和临床实用性。


📄 Abstract

Colonoscopic polyp diagnosis is pivotal for early colorectal cancer detection, yet traditional automated reporting suffers from inconsistencies and hallucinations due to the scarcity of high-quality multimodal medical data. To bridge this gap, we propose LDP, a novel framework leveraging multimodal large language models (MLLMs) for professional polyp diagnosis report generation. Specifically, we curate MMEndo, a multimodal endoscopic dataset comprising expert-annotated colonoscopy image-text pairs. We fine-tune the Qwen2-VL-7B backbone using Parameter-Efficient Fine-Tuning (LoRA) and align it with clinical standards via Direct Preference Optimization (DPO). Extensive experiments show that our LDP outperforms existing baselines on both automated metrics and rigorous clinical expert evaluations (achieving a Physician Score of 7.2/10), significantly reducing training computational costs by 833x compared to full fine-tuning. The proposed solution offers a scalable, clinically viable path for primary healthcare, with additional validation on the IU-XRay dataset confirming its robustness.

[35] From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

Zongzhao Li, Xiangzhe Kong, Jiahui Su, Zongyang Ma, Mingze Li, Songyou Li, Yuelin Zhang, Yu Rong, Tingyang Xu, Deli Zhao, Wenbing Huang

🧩 TL;DR

本文提出了微观空间智能(MiSI)概念,并构建了包含16.3万个问答对和58.7万张图像的MiSI-Bench基准框架,用于评估视觉语言模型在微观空间推理方面的能力。实验表明当前最先进的VLMs在该基准上表现显著低于人类水平,但经过微调的7B模型在空间转换任务中展现出超越人类的潜力。


📘 Detailed Summary

Motivation: 本研究旨在解决视觉语言模型在微观空间智能领域的评估空白,微观空间智能指感知和推理不可见微观实体空间关系的能力,这对科学发现至关重要。当前缺乏系统性的基准来评估VLMs在这一科学推理核心领域的能力,阻碍了科学AGI的发展。

Method: 研究提出了系统化的基准框架MiSI-Bench,包含从约4000个分子结构衍生的超过163,000个问答对和587,000张图像,涵盖九个互补任务,评估从基本空间变换到复杂关系识别的能力范围。该框架为评估VLMs在微观空间推理方面的表现提供了全面的测试平台。

Result: 实验结果显示,当前最先进的视觉语言模型在MiSI-Bench基准上的表现显著低于人类水平。然而,经过微调的7B模型展现出巨大潜力,在空间变换任务中甚至超越了人类表现,但在氢键识别等科学基础任务上的较差表现表明需要整合显式领域知识。

Conclusion: 研究表明当前VLMs在微观空间推理方面存在显著能力差距,特别是在需要科学领域知识的任务上。微调小型模型在特定任务上展现出超越人类的潜力,但实现科学AGI需要将显式领域知识与模型能力相结合。该基准为评估和推进VLMs在科学发现中的应用提供了重要工具。


📄 Abstract

This paper introduces the concept of Microscopic Spatial Intelligence (MiSI), the capability to perceive and reason about the spatial relationships of invisible microscopic entities, which is fundamental to scientific discovery. To assess the potential of Vision-Language Models (VLMs) in this domain, we propose a systematic benchmark framework MiSI-Bench. This framework features over 163,000 question-answer pairs and 587,000 images derived from approximately 4,000 molecular structures, covering nine complementary tasks that evaluate abilities ranging from elementary spatial transformations to complex relational identifications. Experimental results reveal that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model demonstrates substantial potential, even surpassing humans in spatial transformation tasks, while its poor performance in scientifically-grounded tasks like hydrogen bond recognition underscores the necessity of integrating explicit domain knowledge for progress toward scientific AGI. The datasets are available at https://huggingface.co/datasets/zongzhao/MiSI-bench.

[36] MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

Kehong Gong, Zhengyu Wen, Weixia He, Mingxi Xu, Qi Wang, Ning Zhang, Zhengyu Li, Dongze Lian, Wei Zhao, Xiaoyu He, Mingyuan Zhang

🧩 TL;DR

本文提出了MoCapAnything,一个用于类别无关运动捕捉(CAMoCap)的参考引导、因子化框架,能够从单目视频和任意绑定3D资产中重建基于旋转的动画,实现跨物种的运动重定向。


📘 Detailed Summary

Motivation: 现有运动捕捉流程大多局限于特定物种或模板,缺乏通用性。本文旨在解决类别无关运动捕捉(CAMoCap)问题:给定单目视频和任意绑定的3D资产作为提示,目标是重建可直接驱动特定资产的基于旋转的动画(如BVH格式)。

Method: MoCapAnything采用参考引导的因子化框架,包含三个可学习模块和一个轻量级IK阶段:参考提示编码器从资产骨架、网格和渲染图像中提取每关节查询;视频特征提取器计算密集视觉描述符并重建粗糙4D变形网格以桥接视频与关节空间;统一运动解码器融合这些线索生成时间一致的轨迹,最后通过约束感知逆运动学恢复资产特定旋转。

Result: 实验在领域内基准和真实世界视频上验证了MoCapAnything的有效性,能够生成高质量骨骼动画并展示有意义的跨物种重定向能力。研究还构建了Truebones Zoo数据集,包含1038个运动片段,每个提供标准化的骨架-网格-渲染三元组。

Conclusion: 该研究实现了可扩展的、提示驱动的3D运动捕捉,能够处理任意资产,突破了传统运动捕捉的物种限制。MoCapAnything框架为通用运动捕捉系统提供了新范式,支持异构绑定间的跨物种运动重定向,为内容创作提供了更灵活的工具。


📄 Abstract

Motion capture now underpins content creation far beyond digital humans, yet most existing pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation such as BVH that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware inverse kinematics. The system contains three learnable modules and a lightweight IK stage: (1) a Reference Prompt Encoder that extracts per-joint queries from the asset's skeleton, mesh, and rendered images; (2) a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the gap between video and joint space; and (3) a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1038 motion clips, each providing a standardized skeleton-mesh-render triad. Experiments on both in-domain benchmarks and in-the-wild videos show that MoCapAnything delivers high-quality skeletal animations and exhibits meaningful cross-species retargeting across heterogeneous rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets. Project page: https://animotionlab.github.io/MoCapAnything/

[37] PubTables-v2: A new large-scale dataset for full-page and multi-page table extraction

Brandon Smock, Valerie Faucon-Morin, Max Sokolov, Libin Liang, Tayyibah Khanam, Maury Courtland

🧩 TL;DR

本文提出了PubTables-v2,一个用于视觉文档理解中表格提取任务的大规模数据集,并基于此开发了POTATR模型,将Table Transformer扩展为图像到图的全面页面级表格提取方法。


📘 Detailed Summary

Motivation: 表格提取是视觉文档理解中的关键挑战,传统方法先检测表格再识别结构,而近期基于视觉语言模型的直接提取方法因缺乏标注数据而难以验证进展,需要大规模数据集支持多页表格结构识别等当前具有挑战性的任务。

Method: 研究创建了PubTables-v2大规模数据集支持多种表格提取任务,并基于此开发了Page-Object Table Transformer,这是Table Transformer的扩展版本,采用图像到图的架构实现全面页面级表格提取。

Result: PubTables-v2成为首个支持多页表格结构识别的大规模基准数据集,通过在该数据集上评估领域专用视觉语言模型,展示了当前研究进展,并验证了POTATR模型在页面级表格提取任务中的有效性。

Conclusion: 该研究通过创建大规模数据集解决了表格提取领域的数据稀缺问题,提出的POTATR模型为页面级表格提取提供了新方法,数据集和模型的发布将推动视觉文档理解中表格提取技术的进一步发展。


📄 Abstract

Table extraction (TE) is a key challenge in visual document understanding. Traditional approaches detect tables first, then recognize their structure. Recently, interest has surged in developing methods, such as vision-language models (VLMs), that can extract tables directly in their full page or document context. However, progress has been difficult to demonstrate due to a lack of annotated data. To address this, we create a new large-scale dataset, PubTables-v2. PubTables-v2 supports a number of current challenging table extraction tasks. Notably, it is the first large-scale benchmark for multi-page table structure recognition. We demonstrate its usefulness by evaluating domain-specialized VLMs on these tasks and highlighting current progress. Finally, we use PubTables-v2 to create the Page-Object Table Transformer (POTATR), an image-to-graph extension of the Table Transformer to comprehensive page-level TE. Data, code, and trained models will be released.

[38] DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance

Peiying Zhang, Nanxuan Zhao, Matthew Fisher, Yiran Xu, Jing Liao, Difan Liu

🧩 TL;DR

本文提出DuetSVG,一种统一的多模态模型,通过联合生成图像token和SVG token来解决现有VLM方法在SVG生成中因缺乏视觉信号而导致的几何不连贯和视觉吸引力不足的问题。


📘 Detailed Summary

Motivation: 现有基于视觉语言模型的SVG生成方法仅生成文本且在解码过程中缺乏视觉信号,导致在处理复杂语义时难以生成视觉吸引力强且几何连贯的SVG图像,存在语义理解不足和视觉质量低下的问题。

Method: DuetSVG采用统一的多模态架构,以端到端方式联合生成图像token和对应的SVG token,通过在图像和SVG数据集上进行训练,并在推理阶段应用新颖的测试时缩放策略,利用模型自身的视觉预测作为指导来提升SVG解码质量。

Result: 大量实验表明,该方法在多个应用场景中超越了现有方法,能够生成视觉忠实、语义对齐且语法清晰的SVG图像,在复杂语义理解和几何连贯性方面表现出显著优势。

Conclusion: 该研究证明了联合视觉和SVG token生成的有效性,测试时缩放策略为多模态生成提供了新思路,为高质量SVG生成开辟了统一的多模态框架,对视觉内容创作和图形设计具有重要意义。


📄 Abstract

Recent vision-language model (VLM)-based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.

[39] VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Allen Bolourchi, Yann LeCun, Pascale Fung

🧩 TL;DR

本文提出了VL-JEPA,一种基于联合嵌入预测架构的视觉语言模型,它通过预测连续文本嵌入而非自回归生成标记,在减少50%可训练参数的同时实现了更强的性能,并支持选择性解码以显著降低推理成本。


📘 Detailed Summary

Motivation: 传统视觉语言模型通常采用自回归方式生成标记,这种方法不仅计算成本高,而且容易受到表面语言变异性的干扰。本研究旨在解决这一问题,通过构建在抽象表示空间中进行预测的模型,使模型能够专注于任务相关的语义信息,同时减少对语言表面形式的依赖,从而提高效率并降低参数需求。

Method: VL-JEPA采用联合嵌入预测架构,在连续嵌入空间而非离散标记空间中进行预测。模型学习预测目标文本的连续嵌入表示,在推理阶段仅当需要生成文本时才调用轻量级文本解码器。这种方法使模型能够抽象掉语言表面变异性,专注于语义内容,并支持选择性解码机制,可根据需要减少解码操作次数。

Result: 在与相同视觉编码器和训练数据的标准标记空间VLM进行严格控制比较时,VL-JEPA在减少50%可训练参数的情况下实现了更强的性能。选择性解码可将解码操作减少2.85倍,同时保持相似性能。在八个视频分类和八个视频检索数据集上,VL-JEPA的平均性能超越了CLIP、SigLIP2和Perception Encoder。在四个VQA数据集上,仅拥有16亿参数的VL-JEPA达到了与经典VLM相当的性能。

Conclusion: VL-JEPA展示了在抽象表示空间中进行预测的优势,不仅提高了模型效率,还增强了多任务适应性。该架构自然地支持开放词汇分类、文本到视频检索和判别性VQA等任务,无需架构修改。研究结果表明,通过专注于语义嵌入预测而非表面标记生成,可以在显著减少参数和计算成本的同时实现优越的性能,为高效视觉语言建模提供了新方向。


📄 Abstract

We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.

[40] MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation

Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, Yu-Gang Jiang

🧩 TL;DR

本文提出了MeViS,一个大规模多模态数据集,专注于基于物体运动语言描述的视频分割与跟踪,旨在解决现有数据集对运动信息关注不足的问题,并提出了LMPM++方法在多个任务上实现了新的最先进性能。


📘 Detailed Summary

Motivation: 现有参考视频分割数据集通常关注显著物体并使用富含静态属性的语言表达,可能允许在单帧中识别目标物体,这些数据集对视频和语言中运动作用的强调不足。为了探索使用运动表达和运动推理线索进行像素级视频理解的可行性,需要构建一个专门关注运动表达的数据集。

Method: 本文引入了MeViS数据集,包含33,072个人工标注的运动表达(文本和音频),涵盖2,006个复杂场景视频中的8,171个物体。在MeViS支持的4个任务上对15种现有方法进行了基准测试,包括6种参考视频物体分割方法、3种音频引导视频物体分割方法、2种参考多目标跟踪方法和4种视频描述方法。进一步提出了LMPM++方法用于RVOS/AVOS/RMOT任务。

Result: 基准测试结果显示现有方法在处理运动表达引导的视频理解方面存在弱点和局限性。提出的LMPM++方法在RVOS/AVOS/RMOT任务上取得了新的最先进结果。MeViS数据集包含大量复杂场景中的运动表达标注,为运动表达引导的视频理解算法开发提供了平台。

Conclusion: 该研究强调了运动表达在视频理解中的重要性,揭示了现有方法在处理运动推理方面的不足。MeViS数据集为开发更有效的运动表达引导视频理解算法提供了重要资源,LMPM++方法的成功表明结合运动信息能显著提升性能。这项工作推动了多模态视频理解领域向更细粒度的运动感知方向发展。


📄 Abstract

This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects' motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes. The proposed MeViS dataset and the method's source code are publicly available at https://henghuiding.com/MeViS/

cs.CL [Back]

[41] Workflow is All You Need: Escaping the "Statistical Smoothing Trap" via High-Entropy Information Foraging and Adversarial Pacing

Zhongjie Jiang

🧩 TL;DR

本研究提出了DeepNews框架,通过模拟资深财经记者的认知过程来解决长文本生成中的"不可能三角"问题,即在垂直领域同时实现低幻觉率、深度逻辑连贯性和个性化表达。


📘 Detailed Summary

Motivation: 当前大语言模型在垂直领域长文本生成中面临"不可能三角"的瓶颈,即难以同时实现低幻觉率、深度逻辑连贯性和个性化表达,这一问题的根源在于现有生成范式陷入了"统计平滑陷阱",忽略了专家级写作所需的高熵信息获取和结构化认知过程。

Method: 本研究提出了DeepNews框架,这是一个模拟资深财经记者认知过程的智能体工作流,包含三个核心模块:基于信息觅食理论的双粒度检索机制,强制实施10:1的饱和信息输入比例以降低幻觉输出;基于领域专家知识库(叙事模式)和原子块的模式引导战略规划,构建稳健的逻辑骨架;以及对抗性约束提示技术,采用节奏中断和逻辑迷雾等策略来破坏模型生成文本固有的概率平滑性。

Result: 实验揭示了深度财经报道中存在显著的知识悬崖现象:当检索上下文低于15,000字符时,内容真实性会崩溃,而超过30,000字符的高冗余输入能将无幻觉率稳定在85%以上。在一项与中国顶级科技媒体机构进行的生态效度盲测中,基于上一代模型DeepSeek-V3-0324构建的DeepNews系统获得了25%的投稿接受率,显著优于最先进模型GPT-5的零样本生成0%接受率。

Conclusion: 该研究表明,通过显式建模专家认知过程而非依赖统计平滑,能够有效解决垂直领域长文本生成的"不可能三角"问题。DeepNews框架的成功验证了基于认知科学原理的智能体工作流在提升专业内容生成质量方面的潜力,为领域特定的大语言模型应用提供了新的方法论方向。


📄 Abstract

Central to long-form text generation in vertical domains is the "impossible trinity" confronting current large language models (LLMs): the simultaneous achievement of low hallucination, deep logical coherence, and personalized expression. This study establishes that this bottleneck arises from existing generative paradigms succumbing to the Statistical Smoothing Trap, a phenomenon that overlooks the high-entropy information acquisition and structured cognitive processes integral to expert-level writing. To address this limitation, we propose the DeepNews Framework, an agentic workflow that explicitly models the implicit cognitive processes of seasoned financial journalists. The framework integrates three core modules: first, a dual-granularity retrieval mechanism grounded in information foraging theory, which enforces a 10:1 saturated information input ratio to mitigate hallucinatory outputs; second, schema-guided strategic planning, a process leveraging domain expert knowledge bases (narrative schemas) and Atomic Blocks to forge a robust logical skeleton; third, adversarial constraint prompting, a technique deploying tactics including Rhythm Break and Logic Fog to disrupt the probabilistic smoothness inherent in model-generated text. Experiments delineate a salient Knowledge Cliff in deep financial reporting: content truthfulness collapses when retrieved context falls below 15,000 characters, while a high-redundancy input exceeding 30,000 characters stabilizes the Hallucination-Free Rate (HFR) above 85%. In an ecological validity blind test conducted with a top-tier Chinese technology media outlet, the DeepNews system--built on a previous-generation model (DeepSeek-V3-0324)-achieved a 25% submission acceptance rate, significantly outperforming the 0% acceptance rate of zero-shot generation by a state-of-the-art (SOTA) model (GPT-5).

[42] Multilingual VLM Training: Adapting an English-Trained VLM to French

Jules Lahmi, Alexis Roger

🧩 TL;DR

本文研究了将英语训练的视觉语言模型(VLM)适配到其他语言的挑战,通过比较多种适配方法的性能和计算成本,发现数据集翻译是影响多语言VLM性能的主要瓶颈。


📘 Detailed Summary

Motivation: 当前视觉语言模型(VLM)的进展主要局限于英语,这限制了非英语用户的可访问性,因此需要将VLM能力扩展到更广泛的语言范围,研究将英语训练的VLM适配到不同语言所面临的挑战。

Method: 研究比较了三种主要适配方法:基于翻译的流水线方法、LoRA微调方法,以及将视觉适配与语言适配分离的两阶段微调策略,评估采用翻译成目标语言的标准多模态基准测试和母语专家的手动评估相结合的方式。

Result: 实验结果表明,数据集翻译是多语言VLM性能的主要瓶颈,数据质量限制了训练和评估的有效性,不同适配方法在性能和计算成本方面表现出显著差异。

Conclusion: 研究发现未来工作应聚焦于原生语言数据集的收集和改进的翻译策略,数据质量而非模型架构是多语言VLM性能提升的关键制约因素,这为多语言视觉语言模型的发展提供了重要方向指引。


📄 Abstract

Artificial intelligence has made great progress in recent years, particularly in the development of Vision--Language Models (VLMs) that understand both visual and textual data. However, these advancements remain largely limited to English, reducing their accessibility for non--English speakers. It is essential to extend these capabilities to a broader range of languages. This paper explores the challenges of adapting an English-trained VLM to different languages. To this end, we will explore and compare different methods for their performance and computational cost. We consider a translation-based pipeline, LoRA finetuning, and a two-stage finetuning strategy that separates vision adaptation from language adaptation. To evaluate these methods, we use a combination of standard multimodal benchmarks translated into the target language and manual assessments by native experts. The results reveal that dataset translation remains a major bottleneck in multilingual VLM performance, with data quality limiting the effectiveness of training and evaluation. These findings suggest that future efforts should focus on native-language dataset collection and improved translation strategies.

[43] Semantic Reconstruction of Adversarial Plagiarism: A Context-Aware Framework for Detecting and Restoring "Tortured Phrases" in Scientific Literature

Agniva Maiti, Prajwal Panth, Suresh Chandra Satapathy

🧩 TL;DR

本文提出了一种名为SRAP的语义重构对抗剽窃检测框架,该框架不仅能检测通过自动转述工具生成的"扭曲短语"剽窃,还能通过数学方法恢复原始术语,解决了现有方法对新型混淆检测能力不足且无法溯源的问题。


📘 Detailed Summary

Motivation: 科学文献的完整性和可靠性正受到对抗性文本生成技术的严重威胁,特别是使用自动转述工具掩盖剽窃行为。这些工具生成"扭曲短语"——统计上不太可能的同义词(如用"counterfeit consciousness"替代"artificial intelligence"),在保持局部语法结构的同时模糊原始来源。现有检测方法严重依赖静态黑名单或通用领域语言模型,对新型混淆具有高假阴性率,且无法确定剽窃内容的来源。

Method: 本文提出了语义重构对抗剽窃检测框架,采用两阶段架构:第一阶段使用领域特定的掩码语言模型进行统计异常检测,通过令牌级伪困惑度分析;第二阶段使用基于来源的语义重构,结合密集向量检索和句子级对齐技术。具体使用SciBERT进行领域特定建模,FAISS进行向量检索,SBERT进行句子对齐。

Result: 在对抗性科学文本平行语料库上的实验表明,零样本基线方法完全失败(恢复准确率为0.00%),而本文提出的检索增强方法达到了23.67%的恢复准确率,显著优于基线方法。研究还发现,在术语密集的科学文本中,静态决策边界对于鲁棒检测是必要的,因为动态阈值在高方差情况下会失效。

Conclusion: SRAP框架通过将混淆表达链接回其最可能的源文档,实现了法证分析能力。该研究表明,结合统计异常检测和语义重构的方法能够有效应对对抗性剽窃检测的挑战,特别是在科学文献领域。研究结果为文本完整性验证提供了新的技术途径,强调了领域特定建模和静态决策边界在检测科学文本剽窃中的重要性。


📄 Abstract

The integrity and reliability of scientific literature is facing a serious threat by adversarial text generation techniques, specifically from the use of automated paraphrasing tools to mask plagiarism. These tools generate "tortured phrases", statistically improbable synonyms (e.g. "counterfeit consciousness" for "artificial intelligence"), that preserve the local grammar while obscuring the original source. Most existing detection methods depend heavily on static blocklists or general-domain language models, which suffer from high false-negative rates for novel obfuscations and cannot determine the source of the plagiarized content. In this paper, we propose Semantic Reconstruction of Adversarial Plagiarism (SRAP), a framework designed not only to detect these anomalies but to mathematically recover the original terminology. We use a two-stage architecture: (1) statistical anomaly detection with a domain-specific masked language model (SciBERT) using token-level pseudo-perplexity, and (2) source-based semantic reconstruction using dense vector retrieval (FAISS) and sentence-level alignment (SBERT). Experiments on a parallel corpus of adversarial scientific text show that while zero-shot baselines fail completely (0.00 percent restoration accuracy), our retrieval-augmented approach achieves 23.67 percent restoration accuracy, significantly outperforming baseline methods. We also show that static decision boundaries are necessary for robust detection in jargon-heavy scientific text, since dynamic thresholding fails under high variance. SRAP enables forensic analysis by linking obfuscated expressions back to their most probable source documents.

[44] Decoding Student Minds: Leveraging Conversational Agents for Psychological and Learning Analysis

Nour El Houda Ben Chaabene, Hamza Hammami, Laid Kahloul

🧩 TL;DR

本文提出了一种心理感知的对话代理,通过结合LLM、知识图谱增强的BERT和双向LSTM注意力机制,实时分类学生的认知和情感状态,以同时提升学习表现和情感健康。


📘 Detailed Summary

Motivation: 现有教育对话代理通常局限于单一功能,要么专注于辅导,要么仅提供情感支持,无法同时处理学生的认知和情感需求。本研究旨在填补这一空白,通过开发能够实时识别学生认知和情感状态的综合系统,实现更全面的教育支持。

Method: 该系统采用多模态融合架构,结合大型语言模型、知识图谱增强的BERT用于语义理解,以及双向长短期记忆网络配合注意力机制进行时序建模。通过整合文本语义、语音韵律特征和行为时序趋势等多模态数据,实时推断学生的参与度、压力水平和概念理解程度。

Result: 在大学学生中进行的试点研究表明,与基线方法相比,该系统显著提高了学生的学习动机,有效降低了压力水平,并带来了适度的学业成绩提升。这些结果验证了多模态融合方法在教育对话代理中的有效性。

Conclusion: 研究结果表明,整合语义推理、多模态融合和时序建模能够支持自适应、以学生为中心的教育干预。这种方法为开发更全面、个性化的教育支持系统提供了有前景的方向,强调了同时关注认知和情感维度的重要性。


📄 Abstract

This paper presents a psychologically-aware conversational agent designed to enhance both learning performance and emotional well-being in educational settings. The system combines Large Language Models (LLMs), a knowledge graph-enhanced BERT (KG-BERT), and a bidirectional Long Short-Term Memory (LSTM) with attention to classify students' cognitive and affective states in real time. Unlike prior chatbots limited to either tutoring or affective support, our approach leverages multimodal data-including textual semantics, prosodic speech features, and temporal behavioral trends-to infer engagement, stress, and conceptual understanding. A pilot study with university students demonstrated improved motivation, reduced stress, and moderate academic gains compared to baseline methods. These results underline the promise of integrating semantic reasoning, multimodal fusion, and temporal modeling to support adaptive, student-centered educational interventions.

[45] AgriGPT-Omni: A Unified Speech-Vision-Text Framework for Multilingual Agricultural Intelligence

Bo Yang, Lanfei Feng, Yunkui Chen, Yu Zhang, Jianyu Zhang, Xiao Xu, Nueraili Aierken, Shijian Li

🧩 TL;DR

本文提出了AgriGPT-Omni,一个农业全模态框架,通过整合语音、视觉和文本模态,解决了农业AI应用中多语言语音数据缺乏、统一多模态架构缺失和全面评估基准不足的问题。


📘 Detailed Summary

Motivation: 尽管多模态大语言模型快速发展,但农业应用仍受限于多语言语音数据缺乏、统一多模态架构缺失以及全面评估基准不足等约束,这阻碍了农业智能技术的广泛应用和可重复研究。

Method: 研究提出了AgriGPT-Omni农业全模态框架,首先构建了可扩展的数据合成与收集管道,将农业文本和图像转换为训练数据;其次采用三阶段训练范式:文本知识注入、渐进式多模态对齐和基于GRPO的强化学习;最后提出了AgriBench-Omni-2K三模态基准测试。

Result: 实验结果表明,AgriGPT-Omni在多语言和多模态推理以及真实世界语音理解方面显著优于通用基线模型,创建了迄今为止最大的农业语音数据集,包含492K合成和1.4K真实语音样本,覆盖六种语言。

Conclusion: 该研究通过统一的农业全模态框架和综合基准测试,促进了可重复研究、包容性农业智能以及低资源地区的可持续AI发展,为农业领域多模态AI应用提供了系统化解决方案和评估标准。


📄 Abstract

Despite rapid advances in multimodal large language models, agricultural applications remain constrained by the lack of multilingual speech data, unified multimodal architectures, and comprehensive evaluation benchmarks. To address these challenges, we present AgriGPT-Omni, an agricultural omni-framework that integrates speech, vision, and text in a unified framework. First, we construct a scalable data synthesis and collection pipeline that converts agricultural texts and images into training data, resulting in the largest agricultural speech dataset to date, including 492K synthetic and 1.4K real speech samples across six languages. Second, based on this, we train the first agricultural omni-model via a three-stage paradigm: textual knowledge injection, progressive multimodal alignment, and GRPO-based reinforcement learning, enabling unified reasoning across languages and modalities. Third, we propose AgriBench-Omni-2K, the first tri-modal benchmark for agriculture, covering diverse speech-vision-text tasks and multilingual slices, with standardized protocols and reproducible tools. Experiments show that AgriGPT-Omni significantly outperforms general-purpose baselines on multilingual and multimodal reasoning as well as real-world speech understanding. All models, data, benchmarks, and code will be released to promote reproducible research, inclusive agricultural intelligence, and sustainable AI development for low-resource regions.

[46] The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

Aileen Cheng, Alon Jacovi, Amir Globerson, Ben Golan, Charles Kwong, Chris Alberti, Connie Tao, Eyal Ben-David, Gaurav Singh Tomar, Lukas Haas, Yonatan Bitton, Adam Bloniarz, Aijun Bai, Andrew Wang, Anfal Siddiqui, Arturo Bajuelos Castillo, Aviel Atias, Chang Liu, Corey Fry, Daniel Balle, Deepanway Ghosal, Doron Kukliansky, Dror Marcus, Elena Gribovskaya, Eran Ofek, Honglei Zhuang, Itay Laish, Jan Ackermann, Lily Wang, Meg Risdal, Megan Barnes, Michael Fink, Mohamed Amin, Moran Ambar, Natan Potikha, Nikita Gupta, Nitzan Katz, Noam Velan, Ofir Roval, Ori Ram, Polina Zablotskaia, Prathamesh Bang, Priyanka Agrawal, Rakesh Ghiya, Sanjay Ganapathy, Simon Baumgartner, Sofia Erell, Sushant Prakash, Thibault Sellam, Vikram Rao, Xuanhui Wang, Yaroslav Akulov, Yulong Yang, Zhen Yang, Zhixin Lai, Zhongru Wu, Anca Dragan, Avinatan Hassidim, Fernando Pereira, Slav Petrov, Srinivasan Venkatachary, Tulsee Doshi, Yossi Matias, Sasha Goldshtein, Dipanjan Das

🧩 TL;DR

本文介绍了FACTS Leaderboard,这是一个在线评估套件和基准测试集,旨在全面评估语言模型在不同场景下生成事实准确文本的能力,通过四个子排行榜提供模型事实性的整体衡量。


📘 Detailed Summary

Motivation: 当前缺乏能够全面评估语言模型事实准确性的综合基准测试,现有方法往往只关注特定方面而无法提供模型在不同场景下事实性能力的整体评估,因此需要开发一个能够衡量模型在多种现实情境中生成事实准确文本能力的评估框架。

Method: 该方法构建了一个包含四个专门子排行榜的综合评估套件:FACTS Multimodal评估图像问答的事实性,FACTS Parametric评估模型参数中的世界知识,FACTS Search评估信息检索场景中的事实性,FACTS Grounding (v2)评估长文本响应与提供文档的关联性,每个子排行榜使用自动化评判模型对响应进行评分,最终套件分数是四个组成部分的平均值。

Result: 该研究创建了一个活跃维护的在线评估平台,包含公共和私有分割以确保评估完整性,提供了对语言模型事实性的全面评估框架,能够通过自动化评判模型对模型在不同场景下的表现进行标准化评分,并建立了可持续更新的基准测试集。

Conclusion: FACTS Leaderboard为语言模型的事实性评估提供了全面且平衡的框架,通过多维度评估能够更准确地反映模型在实际应用中的可靠性,该框架的持续维护和更新将推动语言模型事实性研究的发展,并为模型开发提供有价值的评估标准。


📄 Abstract

We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text across diverse scenarios. The suite provides a holistic measure of factuality by aggregating the performance of models on four distinct sub-leaderboards: (1) FACTS Multimodal, which measures the factuality of responses to image-based questions; (2) FACTS Parametric, which assesses models' world knowledge by answering closed-book factoid questions from internal parameters; (3) FACTS Search, which evaluates factuality in information-seeking scenarios, where the model must use a search API; and (4) FACTS Grounding (v2), which evaluates whether long-form responses are grounded in provided documents, featuring significantly improved judge models. Each sub-leaderboard employs automated judge models to score model responses, and the final suite score is an average of the four components, designed to provide a robust and balanced assessment of a model's overall factuality. The FACTS Leaderboard Suite will be actively maintained, containing both public and private splits to allow for external participation while guarding its integrity. It can be found at https://www.kaggle.com/benchmarks/google/facts .

cs.AI [Back]

[47] Echo-CoPilot: A Multi-View, Multi-Task Agent for Echocardiography Interpretation and Reporting

Moein Heidari, Mohammad Amin Roohi, Armin Khosravi, Ilker Hacihaliloglu

🧩 TL;DR

本文提出了Echo-CoPilot,一个基于大语言模型的多视图、多任务智能体,用于协调超声心动图专业工具套件,实现临床连贯的超声心动图综合评估,在MIMIC-EchoQA基准测试中达到50.8%的准确率,优于通用和生物医学视频视觉语言模型。


📘 Detailed Summary

Motivation: 超声心动图在心血管诊疗中至关重要,但全研究解读仍是认知密集型、多视图的手动任务。现有的超声心动图基础模型虽然在视图分类、分割或疾病预测等单个感知子任务上表现良好,但它们通常孤立运行,无法提供统一、临床连贯的综合评估。

Method: Echo-CoPilot采用基于大语言模型的智能体架构,在ReAct式循环中协调专业超声心动图工具套件。该智能体分解临床医生查询,调用视图识别、心脏结构分割、测量和疾病预测、报告生成等工具,并将输出整合为指南感知的答案和叙述性总结。

Result: 在公开的MIMIC-EchoQA基准测试中,Echo-CoPilot达到50.8%的准确率,优于通用和生物医学视频视觉语言模型。定性分析显示,该智能体能够利用定量测量和生理学上下文解决临床决策阈值附近的挑战性病例,如临界左心室肥厚或心包积液严重程度判断。

Conclusion: 该研究展示了基于大语言模型的智能体在协调专业医学工具方面的潜力,能够提供临床连贯的超声心动图综合评估。这种方法为解决医学影像中多视图、多任务整合问题提供了新范式,有望提升临床决策支持系统的实用性和准确性。


📄 Abstract

Echocardiography is central to contemporary cardiovascular care, but full-study interpretation remains a cognitively demanding, multi-view task that is still performed manually. While recent foundation models for echocardiography can achieve strong performance on individual perceptual subtasks such as view classification, segmentation, or disease prediction, they typically operate in isolation and do not provide a unified, clinically coherent assessment. In this work, we introduce Echo-CoPilot, a multi-view, multi-task agent that uses a large language model to orchestrate a suite of specialized echocardiography tools. Within a ReAct-style loop, the agent decomposes clinician queries, invokes tools for view recognition, cardiac structure segmentation, measurement and disease prediction, and report synthesis, and integrates their outputs into guideline-aware answers and narrative summaries. We evaluate Echo-CoPilot on the public MIMIC-EchoQA benchmark, where it achieves an accuracy of 50.8\%, outperforming both general-purpose and biomedical video vision-language models. Qualitative analyses further show that the agent leverages quantitative measurements and physiologic context to resolve challenging cases near clinical decision thresholds, such as borderline left ventricular hypertrophy or pericardial effusion severity. The code will be released upon acceptance of the paper.

[48] Exploring LLMs for Scientific Information Extraction Using The SciEx Framework

Sha Li, Ayush Sadekar, Nathan Self, Yiqi Su, Lars Andersland, Mira Chaplin, Annabel Zhang, Hyoju Yang, James B Henderson, Krista Wigginton, Linsey Marr, T. M. Murali, Naren Ramakrishnan

🧩 TL;DR

本文提出了SciEx,一个模块化、可组合的框架,用于解决科学文献信息提取中的挑战,包括长上下文文档、多模态内容以及跨多篇文献的细粒度信息标准化问题。


📘 Detailed Summary

Motivation: 现有的大语言模型方法和工具在处理科学文献时面临多重挑战,包括长上下文文档、多模态内容处理困难,以及如何将多篇出版物中不一致的细粒度信息标准化为统一格式。当所需的数据模式或提取本体快速变化时,重新架构或微调现有系统变得尤为困难。

Method: SciEx采用模块化和可组合的框架设计,将PDF解析、多模态检索、信息提取和聚合等关键组件解耦。这种设计支持按需数据提取,同时允许灵活集成新模型、提示策略和推理机制,增强了系统的可扩展性和适应性。

Result: 研究在涵盖三个科学主题的数据集上评估了SciEx提取细粒度信息的准确性和一致性。评估结果不仅展示了框架的性能,还提供了关于当前基于LLM的提取管道优势和局限性的实际见解。

Conclusion: 该研究强调了模块化设计在处理科学文献提取挑战中的重要性,为构建适应性强的信息提取系统提供了实用框架。研究结果揭示了当前LLM管道的实际能力边界,并为未来系统开发提供了可扩展的架构方向。


📄 Abstract

Large language models (LLMs) are increasingly touted as powerful tools for automating scientific information extraction. However, existing methods and tools often struggle with the realities of scientific literature: long-context documents, multi-modal content, and reconciling varied and inconsistent fine-grained information across multiple publications into standardized formats. These challenges are further compounded when the desired data schema or extraction ontology changes rapidly, making it difficult to re-architect or fine-tune existing systems. We present SciEx, a modular and composable framework that decouples key components including PDF parsing, multi-modal retrieval, extraction, and aggregation. This design streamlines on-demand data extraction while enabling extensibility and flexible integration of new models, prompting strategies, and reasoning mechanisms. We evaluate SciEx on datasets spanning three scientific topics for its ability to extract fine-grained information accurately and consistently. Our findings provide practical insights into both the strengths and limitations of current LLM-based pipelines.

[49] SimWorld-Robotics: Synthesizing Photorealistic and Dynamic Urban Environments for Multimodal Robot Navigation and Collaboration

Yan Zhuang, Jiawei Ren, Xiaokang Ye, Jianzhi Shen, Ruixuan Zhang, Tianai Yue, Muhammad Faayez, Xuhong He, Ziqiao Ma, Lianhui Qin, Zhiting Hu, Tianmin Shu

🧩 TL;DR

本文提出了SimWorld-Robotics (SWR),一个基于Unreal Engine 5构建的大规模、逼真城市环境仿真平台,并建立了两个具有挑战性的机器人基准任务,用于全面评估机器人在真实城市场景中的多模态感知、推理和规划能力。


📘 Detailed Summary

Motivation: 当前基础模型在通用机器人领域的研究主要集中在室内家庭场景,缺乏针对大规模、逼真城市环境的仿真平台和评估基准,无法全面测试机器人在复杂城市场景中的多模态感知、空间推理、安全导航和多机器人协作等关键能力。

Method: 基于Unreal Engine 5构建了SimWorld-Robotics (SWR)仿真平台,该平台能够程序化生成无限数量的逼真城市场景,包含行人、交通系统等动态元素,并支持多机器人控制和通信。在此基础上建立了两个基准任务:多模态指令跟随任务(机器人需遵循视觉-语言导航指令在行人交通环境中到达目的地)和多智能体搜索任务(两个机器人需通过通信协作定位并会合)。

Result: 实验结果表明,包括视觉语言模型在内的最先进模型在SWR平台的任务中表现不佳,缺乏在复杂城市环境中所需的鲁棒感知、推理和规划能力。该平台超越了先前城市仿真在逼真度、复杂性和可扩展性方面的表现,并首次全面评估了机器人在城市环境中的多模态指令理解、三维空间推理、安全长距离导航、多机器人协作和基于场景的通信等关键能力。

Conclusion: SWR平台填补了城市环境机器人仿真研究的空白,为评估和开发面向真实世界应用的通用机器人系统提供了重要工具。研究揭示了当前先进模型在城市场景中的局限性,强调了开发具有鲁棒感知推理能力的机器人系统的重要性,并为未来研究提供了具有挑战性的基准和仿真环境。


📄 Abstract

Recent advances in foundation models have shown promising results in developing generalist robotics that can perform diverse tasks in open-ended scenarios given multimodal inputs. However, current work has been mainly focused on indoor, household scenarios. In this work, we present SimWorld-Robotics~(SWR), a simulation platform for embodied AI in large-scale, photorealistic urban environments. Built on Unreal Engine 5, SWR procedurally generates unlimited photorealistic urban scenes populated with dynamic elements such as pedestrians and traffic systems, surpassing prior urban simulations in realism, complexity, and scalability. It also supports multi-robot control and communication. With these key features, we build two challenging robot benchmarks: (1) a multimodal instruction-following task, where a robot must follow vision-language navigation instructions to reach a destination in the presence of pedestrians and traffic; and (2) a multi-agent search task, where two robots must communicate to cooperatively locate and meet each other. Unlike existing benchmarks, these two new benchmarks comprehensively evaluate a wide range of critical robot capacities in realistic scenarios, including (1) multimodal instructions grounding, (2) 3D spatial reasoning in large environments, (3) safe, long-range navigation with people and traffic, (4) multi-robot collaboration, and (5) grounded communication. Our experimental results demonstrate that state-of-the-art models, including vision-language models (VLMs), struggle with our tasks, lacking robust perception, reasoning, and planning abilities necessary for urban environments.

[50] Investigating The Functional Roles of Attention Heads in Vision Language Models: Evidence for Reasoning Modules

Yanbei Jiang, Xueqi Ma, Shu Liu, Sarah Monazam Erfani, Tongliang Liu, James Bailey, Jey Han Lau, Krista A. Ehinger

🧩 TL;DR

本文提出了一种新颖的可解释性框架来系统分析视觉语言模型(VLM)的内部机制,重点关注注意力头在多模态推理中的功能角色,并通过干预实验验证了这些功能头对模型性能的关键作用。


📘 Detailed Summary

Motivation: 尽管视觉语言模型在多模态基准测试中表现出色,但其内部机制在很大程度上仍是一个黑箱,缺乏对模型内部工作方式特别是注意力头在多模态推理中功能角色的系统性理解。

Method: 研究引入了CogVision数据集,将复杂的多模态问题分解为模拟人类推理过程的逐步子问题,每个子问题关联特定的感知或认知功能;采用基于探测的方法论来识别专门处理这些功能的注意力头,并将其表征为功能头。

Result: 分析发现功能头在不同VLM家族中普遍稀疏,数量和分布随功能而异,并介导交互和层次组织;干预实验表明移除功能头会导致性能下降,而强调它们则能提高准确性,验证了其在多模态推理中的关键作用。

Conclusion: 研究结果为VLM的认知组织提供了新见解,揭示了注意力头在模拟人类感知和推理过程中的功能专门化,为设计更具人类对齐感知和推理能力的模型指明了有前景的方向。


📄 Abstract

Despite excelling on multimodal benchmarks, vision-language models (VLMs) largely remain a black box. In this paper, we propose a novel interpretability framework to systematically analyze the internal mechanisms of VLMs, focusing on the functional roles of attention heads in multimodal reasoning. To this end, we introduce CogVision, a dataset that decomposes complex multimodal questions into step-by-step subquestions designed to simulate human reasoning through a chain-of-thought paradigm, with each subquestion associated with specific receptive or cognitive functions such as high-level visual reception and inference. Using a probing-based methodology, we identify attention heads that specialize in these functions and characterize them as functional heads. Our analysis across diverse VLM families reveals that these functional heads are universally sparse, vary in number and distribution across functions, and mediate interactions and hierarchical organization. Furthermore, intervention experiments demonstrate their critical role in multimodal reasoning: removing functional heads leads to performance degradation, while emphasizing them enhances accuracy. These findings provide new insights into the cognitive organization of VLMs and suggest promising directions for designing models with more human-aligned perceptual and reasoning abilities.

[51] User-Feedback-Driven Continual Adaptation for Vision-and-Language Navigation

Yongqiang Yu, Xuhui Li, Hazza Mahmood, Jinxing Zhou, Haodong Hong, Longtao Jiang, Zhiqiang Xu, Qi Wu, Xiaojun Chang

🧩 TL;DR

本文提出了一种用户反馈驱动的自适应框架,用于视觉语言导航中的通用场景适应,通过系统整合人类交互和记忆库预热机制,显著提升了导航性能和部署稳定性。


📘 Detailed Summary

Motivation: 当前通用场景适应视觉语言导航框架主要依赖无监督的环境暴露适应,完全排除了用户反馈这一自然且有价值的监督信号,这限制了从静态基准到实际部署的适应质量提升潜力。

Method: 本文提出了一个用户反馈驱动的自适应框架,将人类交互系统性地整合到持续学习中,将用户反馈(导航指令和纠正信号)转换为高质量、环境对齐的训练数据,并采用记忆库预热机制重用先前获取的环境知识,以减轻冷启动退化并确保稳定重新部署。

Result: 在GSA-R2R基准测试上的实验表明,该方法持续超越GR-DUET等强基线,显著提高了导航成功率和路径效率,记忆库预热机制稳定了早期导航并减少了更新后的性能下降,在持续和混合适应设置下均表现出鲁棒性和通用性。

Conclusion: 该研究证明了用户反馈在视觉语言导航自适应中的关键价值,提出的框架能够实现高效且现实的适应,为实际部署场景中的持续学习系统设计提供了重要见解,展示了在不同部署条件下持续改进的能力。


📄 Abstract

Vision-and-Language Navigation (VLN) requires agents to navigate complex environments by following natural-language instructions. General Scene Adaptation for VLN (GSA-VLN) shifts the focus from zero-shot generalization to continual, environment-specific adaptation, narrowing the gap between static benchmarks and real-world deployment. However, current GSA-VLN frameworks exclude user feedback, relying solely on unsupervised adaptation from repeated environmental exposure. In practice, user feedback offers natural and valuable supervision that can significantly enhance adaptation quality. We introduce a user-feedback-driven adaptation framework that extends GSA-VLN by systematically integrating human interactions into continual learning. Our approach converts user feedback-navigation instructions and corrective signals-into high-quality, environment-aligned training data, enabling efficient and realistic adaptation. A memory-bank warm-start mechanism further reuses previously acquired environmental knowledge, mitigating cold-start degradation and ensuring stable redeployment. Experiments on the GSA-R2R benchmark show that our method consistently surpasses strong baselines such as GR-DUET, improving navigation success and path efficiency. The memory-bank warm start stabilizes early navigation and reduces performance drops after updates. Results under both continual and hybrid adaptation settings confirm the robustness and generality of our framework, demonstrating sustained improvement across diverse deployment conditions.

[52] Boosting RL-Based Visual Reasoning with Selective Adversarial Entropy Intervention

Yang Yu, Zhuangzhuang Chen, Siqi Wang, Lanqing Li, Xiaomeng Li

🧩 TL;DR

本文提出了一种选择性对抗熵干预方法(SaEI),通过基于采样响应熵的令牌选择性对抗目标来扭曲视觉输入,从而增强策略熵并提升视觉语言模型的推理能力。


📘 Detailed Summary

Motivation: 现有基于强化学习的视觉语言模型微调方法通常仅通过控制特定令牌的更新来干预熵,忽略了在强化学习采样阶段进行熵干预的重要性,这限制了策略探索的多样性和性能提升潜力。

Method: 该方法包含两个核心组件:熵引导对抗采样(EgAS)将采样响应的熵作为对抗目标,生成对抗性梯度来攻击视觉输入;令牌选择性熵计算(TsEC)则在不扭曲视觉语言模型中事实知识的前提下最大化对抗攻击的有效性。

Result: 在领域内和领域外数据集上的广泛实验表明,该方法通过熵干预显著增强了策略探索能力,有效提升了视觉语言模型的推理性能,证明了采样阶段熵干预的重要性。

Conclusion: 该研究揭示了强化学习采样阶段熵干预对提升视觉语言模型推理能力的关键作用,提出的选择性对抗熵干预方法为增强模型探索多样性提供了有效途径,同时保持了模型的事实知识完整性。


📄 Abstract

Recently, reinforcement learning (RL) has become a common choice in enhancing the reasoning capabilities of vision-language models (VLMs). Considering existing RL- based finetuning methods, entropy intervention turns out to be an effective way to benefit exploratory ability, thereby improving policy performance. Notably, most existing stud- ies intervene in entropy by simply controlling the update of specific tokens during policy optimization of RL. They ig- nore the entropy intervention during the RL sampling that can boost the performance of GRPO by improving the di- versity of responses. In this paper, we propose Selective- adversarial Entropy Intervention, namely SaEI, which en- hances policy entropy by distorting the visual input with the token-selective adversarial objective coming from the en- tropy of sampled responses. Specifically, we first propose entropy-guided adversarial sampling (EgAS) that formu- lates the entropy of sampled responses as an adversarial ob- jective. Then, the corresponding adversarial gradient can be used to attack the visual input for producing adversarial samples, allowing the policy model to explore a larger an- swer space during RL sampling. Then, we propose token- selective entropy computation (TsEC) to maximize the ef- fectiveness of adversarial attack in EgAS without distorting factual knowledge within VLMs. Extensive experiments on both in-domain and out-of-domain datasets show that our proposed method can greatly improve policy exploration via entropy intervention, to boost reasoning capabilities. Code will be released once the paper is accepted.

[53] Targeted Data Protection for Diffusion Model by Matching Training Trajectory

Hojun Lee, Mijin Koo, Yeji Song, Nojun Kwak

🧩 TL;DR

本文提出了TAFAP(通过对抗性扰动微调的轨迹对齐),这是首个通过控制整个训练轨迹成功实现有效目标数据保护的方法,显著优于现有的快照匹配方法,在扩散模型中实现了对身份和视觉模式的同时控制。


📘 Detailed Summary

Motivation: 扩散模型的个性化微调技术日益普及,但引发了未经授权数据使用和隐私侵犯的严重担忧。现有保护方法仅限于被动降低图像质量,无法实现稳定控制,而目标数据保护方法虽然提供了有前景的主动重定向范式,但现有尝试由于采用快照匹配方法而忽略了完整学习动态,导致可控性差。

Method: TAFAP采用轨迹对齐方法,通过对抗性扰动微调来控制整个训练轨迹。与基于快照的方法不同,TAFAP受到数据集蒸馏的启发,采用轨迹匹配来在整个微调过程中强制执行持久、可验证的转换。该方法通过控制训练动态而非单点状态,确保保护效果不会随着训练进程而被稀释。

Result: 通过广泛实验验证,TAFAP在扩散模型中首次成功实现了目标转换,同时控制身份和视觉模式。该方法显著优于现有TDP尝试,实现了向目标概念的稳健重定向,同时保持高图像质量。实验结果表明TAFAP能够提供可验证的保护效果,并在整个训练过程中保持一致的转换效果。

Conclusion: 这项工作实现了可验证的安全保障,并为控制和追踪扩散模型输出中的修改提供了新框架。TAFAP通过轨迹对齐方法解决了现有目标数据保护方法的局限性,为隐私保护和授权控制提供了更有效的技术途径,推动了扩散模型安全性和可控性的研究进展。


📄 Abstract

Recent advancements in diffusion models have made fine-tuning text-to-image models for personalization increasingly accessible, but have also raised significant concerns regarding unauthorized data usage and privacy infringement. Current protection methods are limited to passively degrading image quality, failing to achieve stable control. While Targeted Data Protection (TDP) offers a promising paradigm for active redirection toward user-specified target concepts, existing TDP attempts suffer from poor controllability due to snapshot-matching approaches that fail to account for complete learning dynamics. We introduce TAFAP (Trajectory Alignment via Fine-tuning with Adversarial Perturbations), the first method to successfully achieve effective TDP by controlling the entire training trajectory. Unlike snapshot-based methods whose protective influence is easily diluted as training progresses, TAFAP employs trajectory-matching inspired by dataset distillation to enforce persistent, verifiable transformations throughout fine-tuning. We validate our method through extensive experiments, demonstrating the first successful targeted transformation in diffusion models with simultaneous control over both identity and visual patterns. TAFAP significantly outperforms existing TDP attempts, achieving robust redirection toward target concepts while maintaining high image quality. This work enables verifiable safeguards and provides a new framework for controlling and tracing alterations in diffusion model outputs.

[54] Zero-shot 3D Map Generation with LLM Agents: A Dual-Agent Architecture for Procedural Content Generation

Lim Chien Her, Ming Yan, Yunshu Bai, Ruihao Li, Hao Zhang

🧩 TL;DR

本文提出了一种免训练的LLM智能体架构,用于零样本配置程序化内容生成(PCG)的参数,通过Actor-Critic双智能体协作迭代优化参数配置,实现了从自然语言指令到精确参数规范的语义桥梁。


📘 Detailed Summary

Motivation: 程序化内容生成(PCG)虽然能够算法化创建复杂可定制的虚拟世界,但其控制需要精确配置不透明的技术参数,而现成的LLM模型往往难以弥合抽象用户指令与严格参数规范之间的语义鸿沟,这限制了PCG工具的自然语言接口应用。

Method: 本文提出了一种免训练架构,采用Actor-Critic双智能体协作框架,其中Actor智能体负责初始参数配置,Critic智能体则进行迭代式推理和优化,通过自主推理工具参数并逐步精化配置,使系统能够渐进式地符合人类设计偏好,无需任务特定的微调。

Result: 实验验证了该方法在多种3D地图生成任务上的有效性,建立了PCG指令跟随的新基准,结果表明该方法优于单智能体基线,能够从自然语言描述中生成多样且结构有效的环境,证明了现成LLM可作为通用智能体有效重用于任意PCG工具。

Conclusion: 该研究展示了现成LLM通过智能体架构而非模型训练的方式,能够有效掌握复杂软件系统的参数配置,将负担从模型训练转移到架构推理,为PCG工具提供了可扩展的自然语言控制框架,推动了零样本配置方法的发展。


📄 Abstract

Procedural Content Generation (PCG) offers scalable methods for algorithmically creating complex, customizable worlds. However, controlling these pipelines requires the precise configuration of opaque technical parameters. We propose a training-free architecture that utilizes LLM agents for zero-shot PCG parameter configuration. While Large Language Models (LLMs) promise a natural language interface for PCG tools, off-the-shelf models often fail to bridge the semantic gap between abstract user instructions and strict parameter specifications. Our system pairs an Actor agent with a Critic agent, enabling an iterative workflow where the system autonomously reasons over tool parameters and refines configurations to progressively align with human design preferences. We validate this approach on the generation of various 3D maps, establishing a new benchmark for instruction-following in PCG. Experiments demonstrate that our approach outperforms single-agent baselines, producing diverse and structurally valid environments from natural language descriptions. These results demonstrate that off-the-shelf LLMs can be effectively repurposed as generalized agents for arbitrary PCG tools. By shifting the burden from model training to architectural reasoning, our method offers a scalable framework for mastering complex software without task-specific fine-tuning.

[55] Refinement Contrastive Learning of Cell-Gene Associations for Unsupervised Cell Type Identification

Liang Peng, Haopeng Liu, Yixuan Ye, Cheng Liu, Wenjun Shen, Si Wu, Hau-San Wong

🧩 TL;DR

本文提出了一种名为scRCL的细粒度对比学习框架,通过显式整合细胞-基因相互作用来改进单细胞组学中的无监督细胞类型识别,解决了现有方法忽略细胞-基因关联的关键限制。


📘 Detailed Summary

Motivation: 当前大多数单细胞聚类方法主要关注细胞内在结构,而忽视了细胞-基因关联的关键作用,这限制了它们区分密切相关的细胞类型的能力,因此需要一种能够有效利用细胞-基因相互作用的方法来改进细胞类型识别。

Method: 本文提出了细粒度对比学习框架scRCL,包含两个对比分布对齐组件来揭示可靠的细胞内在结构,并开发了一个整合基因相关性结构学习的细化模块,通过捕获潜在的细胞-基因关联来增强细胞嵌入表示。

Result: 在多个单细胞RNA-seq和空间转录组学基准数据集上的广泛实验表明,该方法在细胞类型识别准确性方面持续优于最先进的基线方法,下游生物学分析进一步证实恢复的细胞群体表现出连贯的基因表达特征。

Conclusion: 该研究证明了显式整合细胞-基因相互作用对于改进无监督细胞类型识别的重要性,所提出的框架不仅提高了识别准确性,还增强了生物学相关性,为单细胞组学研究提供了更可靠的细胞群体发现工具。


📄 Abstract

Unsupervised cell type identification is crucial for uncovering and characterizing heterogeneous populations in single cell omics studies. Although a range of clustering methods have been developed, most focus exclusively on intrinsic cellular structure and ignore the pivotal role of cell-gene associations, which limits their ability to distinguish closely related cell types. To this end, we propose a Refinement Contrastive Learning framework (scRCL) that explicitly incorporates cell-gene interactions to derive more informative representations. Specifically, we introduce two contrastive distribution alignment components that reveal reliable intrinsic cellular structures by effectively exploiting cell-cell structural relationships. Additionally, we develop a refinement module that integrates gene-correlation structure learning to enhance cell embeddings by capturing underlying cell-gene associations. This module strengthens connections between cells and their associated genes, refining the representation learning to exploiting biologically meaningful relationships. Extensive experiments on several single-cell RNA-seq and spatial transcriptomics benchmark datasets demonstrate that our method consistently outperforms state-of-the-art baselines in cell-type identification accuracy. Moreover, downstream biological analyses confirm that the recovered cell populations exhibit coherent gene-expression signatures, further validating the biological relevance of our approach. The code is available at https://github.com/THPengL/scRCL.

[56] CAPTAIN: Semantic Feature Injection for Memorization Mitigation in Text-to-Image Diffusion Models

Tong Zhang, Carlos Hinojosa, Bernard Ghanem

🧩 TL;DR

本文提出CAPTAIN,一种无需训练的框架,通过直接修改去噪过程中的潜在特征来缓解扩散模型的记忆化问题,在保持提示对齐的同时显著减少训练样本的复制。


📘 Detailed Summary

Motivation: 扩散模型在生成过程中可能无意中复制训练样本,引发隐私和版权问题,而现有的推理时缓解方法(如操纵分类器无关引导或扰动提示嵌入)往往难以在减少记忆化的同时保持与条件提示的对齐。

Method: CAPTAIN采用基于频率的噪声初始化来减少去噪早期复制记忆模式的倾向,识别特征注入的最佳去噪时间步并定位记忆区域,然后将非记忆参考图像的语义对齐特征注入到局部潜在区域中,从而抑制记忆化同时保持提示保真度和视觉质量。

Result: 实验表明,与基于CFG的基线方法相比,CAPTAIN实现了记忆化的显著减少,同时保持了与目标提示的强对齐,在抑制训练样本复制的同时维持了良好的生成质量。

Conclusion: 该研究证明了通过直接操作潜在特征而非修改引导机制,可以在不牺牲提示对齐的情况下有效缓解扩散模型的记忆化问题,为隐私保护和版权合规的生成模型部署提供了新的技术路径。


📄 Abstract

Diffusion models can unintentionally reproduce training examples, raising privacy and copyright concerns as these systems are increasingly deployed at scale. Existing inference-time mitigation methods typically manipulate classifier-free guidance (CFG) or perturb prompt embeddings; however, they often struggle to reduce memorization without compromising alignment with the conditioning prompt. We introduce CAPTAIN, a training-free framework that mitigates memorization by directly modifying latent features during denoising. CAPTAIN first applies frequency-based noise initialization to reduce the tendency to replicate memorized patterns early in the denoising process. It then identifies the optimal denoising timesteps for feature injection and localizes memorized regions. Finally, CAPTAIN injects semantically aligned features from non-memorized reference images into localized latent regions, suppressing memorization while preserving prompt fidelity and visual quality. Our experiments show that CAPTAIN achieves substantial reductions in memorization compared to CFG-based baselines while maintaining strong alignment with the intended prompt.

[57] Enhancing Radiology Report Generation and Visual Grounding using Reinforcement Learning

Benjamin Gundersen, Nicolas Deperrois, Samuel Ruiperez-Campillo, Thomas M. Sutter, Julia E. Vogt, Michael Moor, Farhad Nooralahzadeh, Michael Krauthammer

🧩 TL;DR

本研究探讨了强化学习(RL)和显式推理(thinking)在胸部X光视觉语言模型(VLM)中的作用,发现RL能显著提升报告生成和视觉定位性能,而显式推理则未带来额外增益,最终开发出在两项任务上均达到最先进性能的RadVLM模型。


📘 Detailed Summary

Motivation: 当前许多医学视觉语言模型仅依赖监督微调(SFT),该方法优化下一个标记预测但未评估答案质量,而强化学习能整合任务特定反馈,且结合显式中间推理在数学和编程任务中已证明显著优势,本研究旨在探究RL和显式推理在胸部X光VLM中的效果。

Method: 研究首先在胸部X光数据上进行大规模监督微调,基于Qwen3-VL构建更新的RadVLM模型,随后通过冷启动SFT阶段赋予模型基本推理能力,接着应用具有临床基础、任务特定奖励的组相对策略优化(GRPO)进行报告生成和视觉定位,并在领域特定和通用领域Qwen3-VL变体上开展匹配的RL实验,比较有无显式推理的效果。

Result: 实验结果表明,虽然强大的SFT对基础性能至关重要,但RL在报告生成和视觉定位两项任务上均能提供额外性能提升,而显式推理则未进一步改善结果,在统一评估流程下,RL优化的RadVLM模型超越基线对应版本,在报告生成和视觉定位任务上均达到最先进性能水平。

Conclusion: 研究证实临床对齐的强化学习可作为医学视觉语言模型中监督微调的有力补充,显式推理在胸部X光解释任务中未表现出预期优势,RL优化方法能有效提升模型在医学影像分析中的实际应用性能,为医学AI模型优化提供了新的技术路径。


📄 Abstract

Recent advances in vision-language models (VLMs) have improved Chest X-ray (CXR) interpretation in multiple aspects. However, many medical VLMs rely solely on supervised fine-tuning (SFT), which optimizes next-token prediction without evaluating answer quality. In contrast, reinforcement learning (RL) can incorporate task-specific feedback, and its combination with explicit intermediate reasoning ("thinking") has demonstrated substantial gains on verifiable math and coding tasks. To investigate the effects of RL and thinking in a CXR VLM, we perform large-scale SFT on CXR data to build an updated RadVLM based on Qwen3-VL, followed by a cold-start SFT stage that equips the model with basic thinking ability. We then apply Group Relative Policy Optimization (GRPO) with clinically grounded, task-specific rewards for report generation and visual grounding, and run matched RL experiments on both domain-specific and general-domain Qwen3-VL variants, with and without thinking. Across these settings, we find that while strong SFT remains crucial for high base performance, RL provides additional gains on both tasks, whereas explicit thinking does not appear to further improve results. Under a unified evaluation pipeline, the RL-optimized RadVLM models outperform their baseline counterparts and reach state-of-the-art performance on both report generation and grounding, highlighting clinically aligned RL as a powerful complement to SFT for medical VLMs.