Table of Contents
cs.CV [Back]
[1] Semantic-Consistent Bidirectional Contrastive Hashing for Noisy Multi-Label Cross-Modal Retrieval
Likang Peng, Chao Su, Wenyuan Wu, Yuan Sun, Dezhong Peng, Xi Peng, Xu Wang
🧩 TL;DR
本文提出了一种语义一致双向对比哈希框架(SCBCH),通过跨模态语义一致性分类和双向软对比哈希模块,有效解决了多标签跨模态检索中的标签噪声和语义重叠问题。
📘 Detailed Summary
Motivation: 现有跨模态哈希方法严重依赖全标注数据集,而在实际多标签场景中标签噪声普遍存在且会显著降低检索性能,同时现有方法通常忽略了多标签数据中固有的部分语义重叠问题,限制了模型的鲁棒性和泛化能力。
Method: 提出的SCBCH框架包含两个互补模块:跨模态语义一致性分类(CSCC)利用跨模态语义一致性估计样本可靠性以减少噪声标签影响;双向软对比哈希(BSCH)基于多标签语义重叠动态生成软对比样本对,实现跨模态语义相似与不相似样本间的自适应对比学习。
Result: 在四个广泛使用的跨模态检索基准数据集上的大量实验验证了该方法的有效性和鲁棒性,在噪声多标签条件下始终优于最先进的方法。
Conclusion: 该研究证明了利用跨模态语义一致性和自适应对比学习能够有效提升噪声多标签场景下的跨模态检索性能,为处理现实世界中的不完美标注数据提供了新的解决方案。
📄 Abstract
Cross-modal hashing (CMH) facilitates efficient retrieval across different modalities (e.g., image and text) by encoding data into compact binary representations. While recent methods have achieved remarkable performance, they often rely heavily on fully annotated datasets, which are costly and labor-intensive to obtain. In real-world scenarios, particularly in multi-label datasets, label noise is prevalent and severely degrades retrieval performance. Moreover, existing CMH approaches typically overlook the partial semantic overlaps inherent in multi-label data, limiting their robustness and generalization. To tackle these challenges, we propose a novel framework named Semantic-Consistent Bidirectional Contrastive Hashing (SCBCH). The framework comprises two complementary modules: (1) Cross-modal Semantic-Consistent Classification (CSCC), which leverages cross-modal semantic consistency to estimate sample reliability and reduce the impact of noisy labels; (2) Bidirectional Soft Contrastive Hashing (BSCH), which dynamically generates soft contrastive sample pairs based on multi-label semantic overlap, enabling adaptive contrastive learning between semantically similar and dissimilar samples across modalities. Extensive experiments on four widely-used cross-modal retrieval benchmarks validate the effectiveness and robustness of our method, consistently outperforming state-of-the-art approaches under noisy multi-label conditions.
[2] Cancer-Net PCa-MultiSeg: Multimodal Enhancement of Prostate Cancer Lesion Segmentation Using Synthetic Correlated Diffusion Imaging
Jarett Dewbury, Chi-en Amy Tai, Alexander Wong
🧩 TL;DR
本文提出将合成相关扩散成像(CDI^s)作为标准扩散成像协议的增强方法,在六种最先进的分割架构上验证了CDI^s能够可靠提升前列腺癌病灶分割性能,最高实现72.5%的相对改进,且无需额外扫描时间即可部署到临床工作流中。
📘 Detailed Summary
Motivation: 当前基于深度学习的前列腺癌病灶分割方法在大规模患者队列中表现有限,Dice分数仅为0.32或更低,这凸显了需要更有效的成像增强策略来提升分割性能。
Method: 研究采用合成相关扩散成像(CDI^s)作为标准扩散成像协议的增强,在200名患者的共配准CDI^s、扩散加权成像和表观扩散系数序列上,系统评估了六种最先进的分割架构的性能表现。
Result: CDI^s集成在94%的评估配置中可靠地提升或保持了分割性能,个别架构相比基线模态实现了高达72.5%的统计显著相对改进,其中CDI^s + DWI组合在一半评估架构中实现显著改进且无性能下降实例。
Conclusion: CDI^s作为现有DWI采集的衍生技术,无需额外扫描时间或架构修改即可实现即时临床部署,为前列腺癌病灶分割任务建立了经过验证的实用增强集成路径。
📄 Abstract
Current deep learning approaches for prostate cancer lesion segmentation achieve limited performance, with Dice scores of 0.32 or lower in large patient cohorts. To address this limitation, we investigate synthetic correlated diffusion imaging (CDI$^s$) as an enhancement to standard diffusion-based protocols. We conduct a comprehensive evaluation across six state-of-the-art segmentation architectures using 200 patients with co-registered CDI$^s$, diffusion-weighted imaging (DWI) and apparent diffusion coefficient (ADC) sequences. We demonstrate that CDI$^s$ integration reliably enhances or preserves segmentation performance in 94% of evaluated configurations, with individual architectures achieving up to 72.5% statistically significant relative improvement over baseline modalities. CDI$^s$ + DWI emerges as the safest enhancement pathway, achieving significant improvements in half of evaluated architectures with zero instances of degradation. Since CDI$^s$ derives from existing DWI acquisitions without requiring additional scan time or architectural modifications, it enables immediate deployment in clinical workflows. Our results establish validated integration pathways for CDI$^s$ as a practical drop-in enhancement for PCa lesion segmentation tasks across diverse deep learning architectures.
[3] Visual Bridge: Universal Visual Perception Representations Generating
Yilin Gao, Shuguang Dou, Junzhou Li, Zhiheng Yu, Yin Li, Dongsheng Jiang, Shugong Xu
🧩 TL;DR
本文提出了一种基于流匹配的通用视觉感知框架,能够跨多个任务生成多样化的视觉表示,突破了传统扩散模型在视觉任务中的单任务限制,实现了竞争性的零样本和微调性能。
📘 Detailed Summary
Motivation: 当前扩散模型在文本到图像生成、深度估计和光流等孤立计算机视觉任务中取得了显著成功,但受到“单任务-单模型”范式的限制,严重制约了其在多任务场景中的泛化性和可扩展性。受到大型语言模型跨领域泛化能力的启发,本研究旨在解决多任务视觉感知中的通用性问题。
Method: 该方法将视觉表示生成过程构建为从图像块标记到任务特定表示的通用流匹配问题,而非独立的生成或回归问题。通过利用强大的自监督基础模型作为锚点,并引入多尺度循环任务嵌入机制,学习一个通用速度场来桥接异构任务之间的差距,支持高效灵活的表征迁移。
Result: 在分类、检测、分割、深度估计和图像-文本检索等任务上的广泛实验表明,该模型在零样本和微调设置下均实现了竞争性性能,超越了先前通用模型和多个专用模型。消融研究进一步验证了框架的鲁棒性、可扩展性和泛化能力。
Conclusion: 本研究标志着向通用视觉感知迈出了重要一步,为未来通用视觉建模研究提供了坚实基础。该框架展示了通过流匹配方法实现跨任务视觉表示生成的可行性,为构建更通用的视觉系统开辟了新方向。
📄 Abstract
Recent advances in diffusion models have achieved remarkable success in isolated computer vision tasks such as text-to-image generation, depth estimation, and optical flow. However, these models are often restricted by a ``single-task-single-model'' paradigm, severely limiting their generalizability and scalability in multi-task scenarios. Motivated by the cross-domain generalization ability of large language models, we propose a universal visual perception framework based on flow matching that can generate diverse visual representations across multiple tasks. Our approach formulates the process as a universal flow-matching problem from image patch tokens to task-specific representations rather than an independent generation or regression problem. By leveraging a strong self-supervised foundation model as the anchor and introducing a multi-scale, circular task embedding mechanism, our method learns a universal velocity field to bridge the gap between heterogeneous tasks, supporting efficient and flexible representation transfer. Extensive experiments on classification, detection, segmentation, depth estimation, and image-text retrieval demonstrate that our model achieves competitive performance in both zero-shot and fine-tuned settings, outperforming prior generalist and several specialist models. Ablation studies further validate the robustness, scalability, and generalization of our framework. Our work marks a significant step towards general-purpose visual perception, providing a solid foundation for future research in universal vision modeling.
[4] Exploring the Underwater World Segmentation without Extra Training
Bingyu Li, Tao Huo, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li
🧩 TL;DR
本文提出了首个大规模细粒度水下开放词汇分割数据集AquaOV255和基准UOVSBench,并开发了无需训练的地球到海洋迁移框架Earth2Ocean,通过几何引导视觉掩码生成器和类别-视觉语义对齐模块,显著提升了水下开放词汇分割性能。
📘 Detailed Summary
Motivation: 现有分割数据集和模型主要局限于陆地场景,缺乏针对水下环境的大规模细粒度分割资源,这限制了海洋生物多样性监测和生态评估的准确性,需要构建专门的水下开放词汇分割基准并开发有效的迁移方法。
Method: 提出了Earth2Ocean框架,包含两个核心组件:几何引导视觉掩码生成器通过自相似性几何先验优化视觉特征以增强局部结构感知,类别-视觉语义对齐模块利用多模态大语言模型推理和场景感知模板构建来增强文本嵌入,实现无需额外水下训练的陆地视觉-语言模型向水下领域的迁移。
Result: 在UOVSBench基准上的广泛实验表明,Earth2Ocean在保持高效推理的同时实现了显著的性能提升,该基准整合了AquaOV255数据集和五个额外水下数据集,包含255个类别和超过20,000张图像,为水下开放词汇分割提供了全面的评估平台。
Conclusion: 该研究证明了通过精心设计的迁移框架,陆地预训练模型可以有效适应水下分割任务而无需额外训练,为海洋视觉分析开辟了新途径,同时建立的数据集和基准将为未来水下计算机视觉研究提供重要基础。
📄 Abstract
Accurate segmentation of marine organisms is vital for biodiversity monitoring and ecological assessment, yet existing datasets and models remain largely limited to terrestrial scenes. To bridge this gap, we introduce \textbf{AquaOV255}, the first large-scale and fine-grained underwater segmentation dataset containing 255 categories and over 20K images, covering diverse categories for open-vocabulary (OV) evaluation. Furthermore, we establish the first underwater OV segmentation benchmark, \textbf{UOVSBench}, by integrating AquaOV255 with five additional underwater datasets to enable comprehensive evaluation. Alongside, we present \textbf{Earth2Ocean}, a training-free OV segmentation framework that transfers terrestrial vision--language models (VLMs) to underwater domains without any additional underwater training. Earth2Ocean consists of two core components: a Geometric-guided Visual Mask Generator (\textbf{GMG}) that refines visual features via self-similarity geometric priors for local structure perception, and a Category-visual Semantic Alignment (\textbf{CSA}) module that enhances text embeddings through multimodal large language model reasoning and scene-aware template construction. Extensive experiments on the UOVSBench benchmark demonstrate that Earth2Ocean achieves significant performance improvement on average while maintaining efficient inference.
[5] Federated CLIP for Resource-Efficient Heterogeneous Medical Image Classification
Yihang Wu, Ahmad Chaddad
🧩 TL;DR
本文提出了一种基于CLIP的联邦学习方法FedMedCLIP,通过掩码特征适配模块和本地分类器设计,在医疗图像分类任务中实现了高性能和低资源消耗的联邦学习方案。
📘 Detailed Summary
Motivation: 尽管深度学习模型在医学影像中表现出色,但传统训练需要源数据,存在隐私泄露风险。联邦学习虽提供分散式解决方案,但数据异构性和资源成本限制了其部署,特别是在使用视觉语言模型时。
Method: 提出FedMedCLIP框架,采用掩码特征适配模块作为通信模块减少通信负载,冻结CLIP编码器降低计算开销。设计掩码MLP作为本地分类器适应客户端任务,并引入自适应KL散度蒸馏正则化方法促进模块间相互学习。
Result: 在四个公开医学数据集上的实验表明,该方法在ISIC2019数据集上比次优基线性能提升8%,同时资源效率显著提升,训练速度比FedAVG快120倍。
Conclusion: 该研究证明了基于CLIP的联邦学习在医学图像分类中的可行性,通过模块化设计和模型压缩技术,在保持性能的同时大幅降低了通信和计算成本,为隐私保护的医疗AI应用提供了实用解决方案。
📄 Abstract
Despite the remarkable performance of deep models in medical imaging, they still require source data for training, which limits their potential in light of privacy concerns. Federated learning (FL), as a decentralized learning framework that trains a shared model with multiple hospitals (a.k.a., FL clients), provides a feasible solution. However, data heterogeneity and resource costs hinder the deployment of FL models, especially when using vision language models (VLM). To address these challenges, we propose a novel contrastive language-image pre-training (CLIP) based FL approach for medical image classification (FedMedCLIP). Specifically, we introduce a masked feature adaptation module (FAM) as a communication module to reduce the communication load while freezing the CLIP encoders to reduce the computational overhead. Furthermore, we propose a masked multi-layer perceptron (MLP) as a private local classifier to adapt to the client tasks. Moreover, we design an adaptive Kullback-Leibler (KL) divergence-based distillation regularization method to enable mutual learning between FAM and MLP. Finally, we incorporate model compression to transmit the FAM parameters while using ensemble predictions for classification. Extensive experiments on four publicly available medical datasets demonstrate that our model provides feasible performance (e.g., 8\% higher compared to second best baseline on ISIC2019) with reasonable resource cost (e.g., 120$\times$ faster than FedAVG).
[6] Laytrol: Preserving Pretrained Knowledge in Layout Control for Multimodal Diffusion Transformers
Sida Huang, Siqi Huang, Ping Luo, Hongyuan Zhang
🧩 TL;DR
本文提出Layout Control (Laytrol)网络和Layout Synthesis (LaySyn)数据集,通过继承预训练参数和专用初始化方案解决扩散模型中布局到图像生成任务的视觉质量下降和风格不一致问题。
📘 Detailed Summary
Motivation: 现有布局到图像生成方法通常通过适配器模块引入布局条件,但生成的图像往往视觉质量较低且与基础模型风格不一致,表明预训练知识存在损失,需要解决分布偏移问题。
Method: 提出Layout Control (Laytrol)网络,其参数从MM-DiT继承以保留基础模型的预训练知识;采用专用初始化方案,将布局编码器初始化为纯文本编码器,布局控制网络输出初始化为零;应用对象级旋转位置嵌入为布局令牌提供粗略位置信息。
Result: 定性和定量实验证明了该方法的有效性,生成的图像在视觉质量和空间一致性方面均表现出优越性能。
Conclusion: 通过继承预训练参数和精心设计的初始化方案,该方法成功缓解了分布偏移问题,为扩散模型的空间可控性增强提供了有效解决方案,并展示了在保持预训练知识的同时实现精确布局控制的可能性。
📄 Abstract
With the development of diffusion models, enhancing spatial controllability in text-to-image generation has become a vital challenge. As a representative task for addressing this challenge, layout-to-image generation aims to generate images that are spatially consistent with the given layout condition. Existing layout-to-image methods typically introduce the layout condition by integrating adapter modules into the base generative model. However, the generated images often exhibit low visual quality and stylistic inconsistency with the base model, indicating a loss of pretrained knowledge. To alleviate this issue, we construct the Layout Synthesis (LaySyn) dataset, which leverages images synthesized by the base model itself to mitigate the distribution shift from the pretraining data. Moreover, we propose the Layout Control (Laytrol) Network, in which parameters are inherited from MM-DiT to preserve the pretrained knowledge of the base model. To effectively activate the copied parameters and avoid disturbance from unstable control conditions, we adopt a dedicated initialization scheme for Laytrol. In this scheme, the layout encoder is initialized as a pure text encoder to ensure that its output tokens remain within the data domain of MM-DiT. Meanwhile, the outputs of the layout control network are initialized to zero. In addition, we apply Object-level Rotary Position Embedding to the layout tokens to provide coarse positional information. Qualitative and quantitative experiments demonstrate the effectiveness of our method.
[7] Libra-MIL: Multimodal Prototypes Stereoscopic Infused with Task-specific Language Priors for Few-shot Whole Slide Image Classification
Zhenfeng Zhuang, Fangyu Zhou, Liansheng Wang
🧩 TL;DR
本文提出了一种多模态原型多示例学习方法,通过构建任务特异性病理实体原型和双向交互机制,解决了计算病理学中全切片图像建模的计算挑战和标签稀疏性问题,在多个癌症数据集上展现了优越的泛化能力。
📘 Detailed Summary
Motivation: 计算病理学中全切片图像的高计算成本需要使用多示例学习进行建模,但病理任务通常仅提供包级标签,而由大语言模型生成的实例级描述由于缺乏细粒度医学知识往往存在偏差。现有视觉语言多示例学习方法通常采用单向指导,限制了跨模态协同效应。
Method: 提出多模态原型多示例学习方法,利用冻结的大语言模型生成任务特异性病理实体描述作为文本原型,同时视觉分支学习实例级原型以减少对冗余数据的依赖。在融合阶段采用基于相似度度量的立体最优传输算法,促进高维空间中的语义对齐。
Result: 在三个不同的癌症数据集上进行了少样本分类和可解释性实验,结果表明所提方法具有优越的泛化能力,在病理图像分析任务中展现了显著性能提升。
Conclusion: 构建任务特异性病理实体原型对于学习可泛化特征和增强模型可解释性至关重要,双向交互机制通过平衡信息压缩方案有效促进了跨模态协同,为计算病理学中的多模态学习提供了新思路。
📄 Abstract
While Large Language Models (LLMs) are emerging as a promising direction in computational pathology, the substantial computational cost of giga-pixel Whole Slide Images (WSIs) necessitates the use of Multi-Instance Learning (MIL) to enable effective modeling. A key challenge is that pathological tasks typically provide only bag-level labels, while instance-level descriptions generated by LLMs often suffer from bias due to a lack of fine-grained medical knowledge. To address this, we propose that constructing task-specific pathological entity prototypes is crucial for learning generalizable features and enhancing model interpretability. Furthermore, existing vision-language MIL methods often employ unidirectional guidance, limiting cross-modal synergy. In this paper, we introduce a novel approach, Multimodal Prototype-based Multi-Instance Learning, that promotes bidirectional interaction through a balanced information compression scheme. Specifically, we leverage a frozen LLM to generate task-specific pathological entity descriptions, which are learned as text prototypes. Concurrently, the vision branch learns instance-level prototypes to mitigate the model's reliance on redundant data. For the fusion stage, we employ the Stereoscopic Optimal Transport (SOT) algorithm, which is based on a similarity metric, thereby facilitating broader semantic alignment in a higher-dimensional space. We conduct few-shot classification and explainability experiments on three distinct cancer datasets, and the results demonstrate the superior generalization capabilities of our proposed method.
[8] Multi-Modal Assistance for Unsupervised Domain Adaptation on Point Cloud 3D Object Detection
Shenao Zhao, Pengpeng Liang, Zhoufan Yang
🧩 TL;DR
本文提出MMAssist方法,通过多模态辅助提升基于LiDAR的3D无监督域自适应检测性能,利用图像和文本特征作为桥梁进行跨域特征对齐,在三个主流3D检测数据集上实现了优于现有方法的性能。
📘 Detailed Summary
Motivation: 尽管点云和图像通常同时采集,但在3D无监督域自适应训练中图像数据的潜力尚未得到充分探索,现有基于师生架构和伪标签的方法很少利用多模态信息来提升域自适应性能。
Method: 提出MMAssist方法,通过将3D边界框投影到图像获取2D框,使用预训练视觉骨干提取图像特征,利用大型视觉语言模型生成文本描述并通过文本编码器获取文本特征,在源域和目标域训练中通过特征对齐和加权融合实现3D特征与多模态特征的协同学习,同时结合2D检测器增强伪标签质量。
Result: 在三个流行的3D目标检测数据集上的三个域自适应任务中,该方法相比最先进方法取得了显著性能提升,证明了多模态辅助在3D无监督域自适应中的有效性。
Conclusion: 研究表明图像和文本等多模态信息可作为有效的桥梁促进3D特征的跨域对齐,为3D无监督域自适应提供了新的多模态融合范式,未来可进一步探索更精细的多模态交互机制。
📄 Abstract
Unsupervised domain adaptation for LiDAR-based 3D object detection (3D UDA) based on the teacher-student architecture with pseudo labels has achieved notable improvements in recent years. Although it is quite popular to collect point clouds and images simultaneously, little attention has been paid to the usefulness of image data in 3D UDA when training the models. In this paper, we propose an approach named MMAssist that improves the performance of 3D UDA with multi-modal assistance. A method is designed to align 3D features between the source domain and the target domain by using image and text features as bridges. More specifically, we project the ground truth labels or pseudo labels to the images to get a set of 2D bounding boxes. For each 2D box, we extract its image feature from a pre-trained vision backbone. A large vision-language model (LVLM) is adopted to extract the box's text description, and a pre-trained text encoder is used to obtain its text feature. During the training of the model in the source domain and the student model in the target domain, we align the 3D features of the predicted boxes with their corresponding image and text features, and the 3D features and the aligned features are fused with learned weights for the final prediction. The features between the student branch and the teacher branch in the target domain are aligned as well. To enhance the pseudo labels, we use an off-the-shelf 2D object detector to generate 2D bounding boxes from images and estimate their corresponding 3D boxes with the aid of point cloud, and these 3D boxes are combined with the pseudo labels generated by the teacher model. Experimental results show that our approach achieves promising performance compared with state-of-the-art methods in three domain adaptation tasks on three popular 3D object detection datasets. The code is available at https://github.com/liangp/MMAssist.
[9] Knowledge-Guided Textual Reasoning for Explainable Video Anomaly Detection via LLMs
Hari Lee
🧩 TL;DR
本文提出了基于文本的可解释视频异常检测框架TbVAD,该框架完全在文本域内执行异常检测和解释,通过语言表示视频语义实现可解释的知识驱动推理。
📘 Detailed Summary
Motivation: 传统弱监督视频异常检测模型依赖显式视觉特征,缺乏可解释性,TbVAD旨在通过语言驱动框架解决这一问题,实现可解释且基于知识的异常检测推理。
Method: TbVAD采用三阶段框架:首先使用视觉语言模型将视频内容转换为细粒度描述,然后将描述组织为四个语义槽(动作、对象、上下文、环境)构建结构化知识,最后生成槽级解释以揭示哪些语义因素对异常决策贡献最大。
Result: 在UCF-Crime和XD-Violence两个公开基准上的评估表明,文本知识推理为真实世界监控场景提供了可解释且可靠的异常检测性能。
Conclusion: 该研究表明基于语言的视频表示能够实现可解释的异常检测,为监控系统提供了知识驱动的推理框架,开辟了文本驱动视频理解的新方向。
📄 Abstract
We introduce Text-based Explainable Video Anomaly Detection (TbVAD), a language-driven framework for weakly supervised video anomaly detection that performs anomaly detection and explanation entirely within the textual domain. Unlike conventional WSVAD models that rely on explicit visual features, TbVAD represents video semantics through language, enabling interpretable and knowledge-grounded reasoning. The framework operates in three stages: (1) transforming video content into fine-grained captions using a vision-language model, (2) constructing structured knowledge by organizing the captions into four semantic slots (action, object, context, environment), and (3) generating slot-wise explanations that reveal which semantic factors contribute most to the anomaly decision. We evaluate TbVAD on two public benchmarks, UCF-Crime and XD-Violence, demonstrating that textual knowledge reasoning provides interpretable and reliable anomaly detection for real-world surveillance scenarios.
[10] ChexFract: From General to Specialized - Enhancing Fracture Description Generation
Nikolay Nechaev, Evgeniia Przhezdzetskaia, Dmitry Umerenkov, Dmitry V. Dylov
🧩 TL;DR
本研究针对胸部X光报告中罕见骨折病理描述不足的问题,开发了专门的骨折检测与描述视觉语言模型,显著提升了骨折描述的准确性,并公开了最佳模型以促进罕见病理报告研究。
📘 Detailed Summary
Motivation: 当前通用视觉语言模型在胸部X光报告生成中虽然整体表现良好,但在描述罕见但临床重要的病理(如骨折)方面存在明显不足,这限制了其在临床实践中的可靠应用。
Method: 基于MAIRA-2和CheXagent的编码器,训练了专门针对骨折病理的视觉语言模型,通过专业化架构设计提升对骨折特征的识别和描述能力。
Result: 专业骨折模型在生成准确骨折描述方面显著优于通用模型,通过对骨折类型、位置和年龄的分析揭示了当前视觉语言模型架构的特定优势和局限性。
Conclusion: 专业化模型在罕见病理报告生成中具有重要价值,当前架构在不同骨折特征描述上存在差异,需要进一步优化以提升临床实用性,公开模型将推动该领域研究发展。
📄 Abstract
Generating accurate and clinically meaningful radiology reports from chest X-ray images remains a significant challenge in medical AI. While recent vision-language models achieve strong results in general radiology report generation, they often fail to adequately describe rare but clinically important pathologies like fractures. This work addresses this gap by developing specialized models for fracture pathology detection and description. We train fracture-specific vision-language models with encoders from MAIRA-2 and CheXagent, demonstrating significant improvements over general-purpose models in generating accurate fracture descriptions. Analysis of model outputs by fracture type, location, and age reveals distinct strengths and limitations of current vision-language model architectures. We publicly release our best-performing fracture-reporting model, facilitating future research in accurate reporting of rare pathologies.
[11] Multi-modal Deepfake Detection and Localization with FPN-Transformer
Chende Zheng, Ruiqi Suo, Zhoulin Ji, Jingyi Deng, Fangbin Yi, Chenhao Lin, Chao Shen
🧩 TL;DR
本文提出了一种基于特征金字塔变换器的多模态深度伪造检测与定位框架,通过跨模态特征融合和时序边界回归,在IJCAI'25 DDL-AV基准测试中实现了0.7535的优异性能,为通用深度伪造检测提供了新思路。
📘 Detailed Summary
Motivation: 当前单模态深度伪造检测方法无法有效利用跨模态关联性,且难以精确定位伪造片段,在面对复杂精细的合成媒体时存在明显局限性,亟需开发能够同时实现跨模态泛化和时序定位的检测框架。
Method: 该方法采用预训练自监督模型提取层次化时序特征,通过具有局部注意力机制的R-TLM模块构建多尺度特征金字塔,利用双分支预测头同时预测伪造概率和精炼被篡改片段的时序偏移量,实现帧级定位精度。
Result: 在IJCAI'25 DDL-AV基准测试集上,该方法取得了0.7535的最终得分,实验结果表明该框架在挑战性环境下对跨模态深度伪造检测和定位具有显著有效性。
Conclusion: 该研究证实了多模态特征融合和时序边界回归在深度伪造检测中的重要性,为通用深度伪造检测提供了创新解决方案,并展示了跨模态分析方法在处理复杂合成媒体威胁方面的潜力。
📄 Abstract
The rapid advancement of generative adversarial networks (GANs) and diffusion models has enabled the creation of highly realistic deepfake content, posing significant threats to digital trust across audio-visual domains. While unimodal detection methods have shown progress in identifying synthetic media, their inability to leverage cross-modal correlations and precisely localize forged segments limits their practicality against sophisticated, fine-grained manipulations. To address this, we introduce a multi-modal deepfake detection and localization framework based on a Feature Pyramid-Transformer (FPN-Transformer), addressing critical gaps in cross-modal generalization and temporal boundary regression. The proposed approach utilizes pre-trained self-supervised models (WavLM for audio, CLIP for video) to extract hierarchical temporal features. A multi-scale feature pyramid is constructed through R-TLM blocks with localized attention mechanisms, enabling joint analysis of cross-context temporal dependencies. The dual-branch prediction head simultaneously predicts forgery probabilities and refines temporal offsets of manipulated segments, achieving frame-level localization precision. We evaluate our approach on the test set of the IJCAI'25 DDL-AV benchmark, showing a good performance with a final score of 0.7535 for cross-modal deepfake detection and localization in challenging environments. Experimental results confirm the effectiveness of our approach and provide a novel way for generalized deepfake detection. Our code is available at https://github.com/Zig-HS/MM-DDL
[12] WEDepth: Efficient Adaptation of World Knowledge for Monocular Depth Estimation
Gongshu Wang, Zhirui Wang, Kan Yang
🧩 TL;DR
WEDepth提出了一种无需修改视觉基础模型结构和预训练权重的单目深度估计方法,通过将VFM作为多级特征增强器来有效激发和利用其内在先验知识,在NYU-Depth v2和KITTI数据集上实现了最先进的性能。
📘 Detailed Summary
Motivation: 单目深度估计由于从单张2D图像重建3D场景的固有不适定性而极具挑战性,现代视觉基础模型在大规模多样化数据集上预训练后展现出卓越的世界理解能力,但如何在不修改模型结构和预训练权重的情况下有效利用这些先验知识进行深度估计仍是一个开放问题。
Method: 该方法将视觉基础模型作为多级特征增强器,在不同表示层次上系统性地注入先验知识,通过这种结构保持的方式有效激发VFM的内在先验,同时避免了模型结构和预训练权重的修改。
Result: 在NYU-Depth v2和KITTI数据集上的实验表明,WEDepth建立了新的最先进性能,相比需要多次前向传播的基于扩散的方法和在相对深度上预训练的方法都取得了有竞争力的结果,同时展现出跨多样化场景的强大零样本迁移能力。
Conclusion: 研究表明视觉基础模型的内在先验知识可以有效迁移到单目深度估计任务中,无需模型结构修改即可实现卓越性能,为零样本深度估计和跨域迁移学习提供了新的技术路径,展示了预训练视觉模型在几何理解任务中的巨大潜力。
📄 Abstract
Monocular depth estimation (MDE) has widely applicable but remains highly challenging due to the inherently ill-posed nature of reconstructing 3D scenes from single 2D images. Modern Vision Foundation Models (VFMs), pre-trained on large-scale diverse datasets, exhibit remarkable world understanding capabilities that benefit for various vision tasks. Recent studies have demonstrated significant improvements in MDE through fine-tuning these VFMs. Inspired by these developments, we propose WEDepth, a novel approach that adapts VFMs for MDE without modi-fying their structures and pretrained weights, while effec-tively eliciting and leveraging their inherent priors. Our method employs the VFM as a multi-level feature en-hancer, systematically injecting prior knowledge at differ-ent representation levels. Experiments on NYU-Depth v2 and KITTI datasets show that WEDepth establishes new state-of-the-art (SOTA) performance, achieving competi-tive results compared to both diffusion-based approaches (which require multiple forward passes) and methods pre-trained on relative depth. Furthermore, we demonstrate our method exhibits strong zero-shot transfer capability across diverse scenarios.
[13] Toward the Frontiers of Reliable Diffusion Sampling via Adversarial Sinkhorn Attention Guidance
Kwanyoung Kim
🧩 TL;DR
本文提出了对抗性Sinkhorn注意力引导(ASAG),一种基于最优传输理论的新颖扩散模型引导方法,通过故意破坏自注意力层中的传输成本来提升生成质量。该方法在文本到图像生成任务中展现出稳定改进,并增强了IP-Adapter和ControlNet等下游应用的可控性和保真度。
📘 Detailed Summary
Motivation: 现有扩散模型引导方法如无分类器引导(CFG)通常通过启发式扰动函数故意劣化无条件输出来提升目标输出质量,但这些方法缺乏理论基础且依赖人工设计的失真策略。本文旨在为注意力机制提供更原则性的引导框架,解决现有方法在理论基础和优化效率方面的局限性。
Method: ASAG方法从最优传输角度重新解释扩散模型中的注意力分数,通过Sinkhorn算法在自注意力层中注入对抗性成本来降低查询和键之间的像素级相似性。这种有意的劣化策略削弱了误导性的注意力对齐,从而提升条件生成和无条件生成的样本质量。该方法具有轻量级、即插即用的特点,无需模型重新训练。
Result: ASAG在文本到图像扩散任务中展现出稳定的质量改进,同时显著提升了IP-Adapter和ControlNet等下游应用的可控性和保真度。实验结果表明该方法能够有效改善生成样本的可靠性和质量,在各种基准测试中均表现出优越性能。
Conclusion: ASAG为扩散模型引导提供了基于最优传输的理论基础,展示了通过故意破坏注意力传输成本来提升生成质量的可行性。该方法不仅提高了生成可靠性,还为未来扩散模型优化开辟了新的研究方向,特别是在注意力机制的理论解释和优化策略方面具有重要启示意义。
📄 Abstract
Diffusion models have demonstrated strong generative performance when using guidance methods such as classifier-free guidance (CFG), which enhance output quality by modifying the sampling trajectory. These methods typically improve a target output by intentionally degrading another, often the unconditional output, using heuristic perturbation functions such as identity mixing or blurred conditions. However, these approaches lack a principled foundation and rely on manually designed distortions. In this work, we propose Adversarial Sinkhorn Attention Guidance (ASAG), a novel method that reinterprets attention scores in diffusion models through the lens of optimal transport and intentionally disrupt the transport cost via Sinkhorn algorithm. Instead of naively corrupting the attention mechanism, ASAG injects an adversarial cost within self-attention layers to reduce pixel-wise similarity between queries and keys. This deliberate degradation weakens misleading attention alignments and leads to improved conditional and unconditional sample quality. ASAG shows consistent improvements in text-to-image diffusion, and enhances controllability and fidelity in downstream applications such as IP-Adapter and ControlNet. The method is lightweight, plug-and-play, and improves reliability without requiring any model retraining.
[14] CLIP is All You Need for Human-like Semantic Representations in Stable Diffusion
Cameron Braunstein, Mariya Toneva, Eddy Ilg
🧩 TL;DR
本文研究发现,Stable Diffusion等潜在扩散模型在文本到图像生成中的语义理解主要来源于CLIP文本编码器,而非反向扩散过程,扩散过程主要承担视觉解码器的角色。
📘 Detailed Summary
Motivation: 当前潜在扩散模型在文本到图像生成任务中取得了最先进的结果,但这些模型对所生成图像的语义理解程度尚未得到充分研究,特别是模型内部表示是否包含对人类有意义的语义信息这一关键问题亟待探索。
Method: 通过在Stable Diffusion上执行探测分析,使用简单的回归层预测对象语义属性,并将这些预测与人类标注进行比较,重点比较CLIP文本编码与反向扩散过程在语义表示中的相对贡献。
Result: 研究发现语义理解的成功主要归因于CLIP的文本编码而非反向扩散过程,不同语义属性的解码准确率存在显著差异,且在反向扩散过程中属性间的区分度逐渐降低,表明CLIP具有最强的语义表示能力。
Conclusion: 研究表明独立训练的CLIP视觉语言模型决定了类人的语义表示,而扩散过程主要承担视觉解码器的功能,这对理解文本到图像生成模型的语义能力分布具有重要意义。
📄 Abstract
Latent diffusion models such as Stable Diffusion achieve state-of-the-art results on text-to-image generation tasks. However, the extent to which these models have a semantic understanding of the images they generate is not well understood. In this work, we investigate whether the internal representations used by these models during text-to-image generation contain semantic information that is meaningful to humans. To do so, we perform probing on Stable Diffusion with simple regression layers that predict semantic attributes for objects and evaluate these predictions against human annotations. Surprisingly, we find that this success can actually be attributed to the text encoding occurring in CLIP rather than the reverse diffusion process. We demonstrate that groups of specific semantic attributes have markedly different decoding accuracy than the average, and are thus represented to different degrees. Finally, we show that attributes become more difficult to disambiguate from one another during the inverse diffusion process, further demonstrating the strongest semantic representation of object attributes in CLIP. We conclude that the separately trained CLIP vision-language model is what determines the human-like semantic representation, and that the diffusion process instead takes the role of a visual decoder.
[15] OTSNet: A Neurocognitive-Inspired Observation-Thinking-Spelling Pipeline for Scene Text Recognition
Lixu Sun, Nurmemet Yolwas, Wushour Silamu
🧩 TL;DR
本文提出OTSNet,一种受神经认知启发的三阶段网络,通过观察-思考-拼写流程实现统一的场景文本识别,在多个基准测试中创造了新的最先进性能记录。
📘 Detailed Summary
Motivation: 现有场景文本识别框架中的解耦视觉-语言优化会通过跨模态错位放大错误传播,视觉编码器对背景干扰物存在注意力偏差,而解码器在解析几何变形文本时遭受空间错位,这些因素共同降低了不规则模式下的识别准确性。
Method: OTSNet采用三阶段架构,包括双注意力马卡龙编码器通过差分注意力图细化视觉特征,位置感知模块和语义量化器通过自适应采样整合空间上下文与字形级语义抽象,以及多模态协作验证器通过视觉、语义和字符级特征的跨模态融合实现自校正。
Result: 在挑战性的Union14M-L基准测试中达到83.5%的平均准确率,在严重遮挡的OST数据集上达到79.1%的准确率,在14个评估场景中的9个场景上创造了新的记录。
Conclusion: 该研究证明了神经认知启发的分层处理流程在场景文本识别中的有效性,通过统一的跨模态对齐和自校正机制显著提升了复杂场景下的识别鲁棒性,为未来多模态融合研究提供了新思路。
📄 Abstract
Scene Text Recognition (STR) remains challenging due to real-world complexities, where decoupled visual-linguistic optimization in existing frameworks amplifies error propagation through cross-modal misalignment. Visual encoders exhibit attention bias toward background distractors, while decoders suffer from spatial misalignment when parsing geometrically deformed text-collectively degrading recognition accuracy for irregular patterns. Inspired by the hierarchical cognitive processes in human visual perception, we propose OTSNet, a novel three-stage network embodying a neurocognitive-inspired Observation-Thinking-Spelling pipeline for unified STR modeling. The architecture comprises three core components: (1) a Dual Attention Macaron Encoder (DAME) that refines visual features through differential attention maps to suppress irrelevant regions and enhance discriminative focus; (2) a Position-Aware Module (PAM) and Semantic Quantizer (SQ) that jointly integrate spatial context with glyph-level semantic abstraction via adaptive sampling; and (3) a Multi-Modal Collaborative Verifier (MMCV) that enforces self-correction through cross-modal fusion of visual, semantic, and character-level features. Extensive experiments demonstrate that OTSNet achieves state-of-the-art performance, attaining 83.5% average accuracy on the challenging Union14M-L benchmark and 79.1% on the heavily occluded OST dataset-establishing new records across 9 out of 14 evaluation scenarios.
[16] PEOD: A Pixel-Aligned Event-RGB Benchmark for Object Detection under Challenging Conditions
Luoping Cui, Hanqing Liu, Mingjie Liu, Endian Lin, Donghong Jiang, Yuhao Wang, Chuang Zhu
🧩 TL;DR
本文提出了PEOD,首个大规模、像素对齐的高分辨率事件-RGB数据集,用于挑战条件下的目标检测,包含130+时空对齐序列和34万手动标注框,其中57%数据在低光照、过曝和高速运动条件下采集。
📘 Detailed Summary
Motivation: 现有事件-RGB数据集在极端条件覆盖稀疏且空间分辨率低(≤640×480),无法全面评估挑战场景下的检测器性能,这限制了鲁棒目标检测在复杂环境中的发展。
Method: 构建了PEOD数据集,包含130多个时空对齐序列和340,000个手动标注边界框,57%数据在低光照、过曝和高速运动条件下采集,并对14种方法在三种输入配置(事件、RGB和事件-RGB融合)上进行基准测试。
Result: 在完整测试集和正常子集上,融合模型表现优异;在光照挑战子集中,顶级事件模型优于所有融合模型,而融合模型仍优于RGB模型,表明当帧模态严重退化时现有融合方法存在局限。
Conclusion: PEOD为多模态感知建立了真实、高质量的基准,揭示了现有融合方法在帧模态严重退化时的局限性,促进了未来在挑战条件下鲁棒目标检测的研究。
📄 Abstract
Robust object detection for challenging scenarios increasingly relies on event cameras, yet existing Event-RGB datasets remain constrained by sparse coverage of extreme conditions and low spatial resolution (<= 640 x 480), which prevents comprehensive evaluation of detectors under challenging scenarios. To address these limitations, we propose PEOD, the first large-scale, pixel-aligned and high-resolution (1280 x 720) Event-RGB dataset for object detection under challenge conditions. PEOD contains 130+ spatiotemporal-aligned sequences and 340k manual bounding boxes, with 57% of data captured under low-light, overexposure, and high-speed motion. Furthermore, we benchmark 14 methods across three input configurations (Event-based, RGB-based, and Event-RGB fusion) on PEOD. On the full test set and normal subset, fusion-based models achieve the excellent performance. However, in illumination challenge subset, the top event-based model outperforms all fusion models, while fusion models still outperform their RGB-based counterparts, indicating limits of existing fusion methods when the frame modality is severely degraded. PEOD establishes a realistic, high-quality benchmark for multimodal perception and facilitates future research.
[17] Boomda: Balanced Multi-objective Optimization for Multimodal Domain Adaptation
Jun Sun, Xinxin Zhang, Simin Hong, Jian Zhu, Xiang Gao
🧩 TL;DR
本文提出Boomda方法,通过多目标优化实现异构多模态领域自适应,有效平衡不同模态间的领域偏移,在缺乏标注数据的多模态场景中实现高效领域适应。
📘 Detailed Summary
Motivation: 多模态学习面临标注数据稀缺的挑战,而现有的无监督领域自适应方法主要针对单模态场景,在多模态设置中研究较少,特别是当不同模态在源域和目标域之间存在不同程度的领域偏移时,需要解决异构多模态领域自适应问题。
Method: 首先引入信息瓶颈方法独立学习每个模态的表示,然后通过相关性对齐在表示空间中对齐源域和目标域,将问题建模为多目标优化任务以获得帕累托最优解,通过模型特定性质将问题简化为二次规划问题并推导出闭式解,最终形成高效的模态平衡多模态领域自适应算法Boomda。
Result: 大量实证结果表明所提方法具有显著有效性,Boomda在多个基准测试中优于竞争方案,展示了其在多模态领域自适应任务中的优越性能。
Conclusion: 该研究为多模态领域自适应提供了有效的模态平衡解决方案,通过多目标优化框架成功处理了不同模态间的异构领域偏移问题,为缺乏标注数据的多模态学习场景开辟了新的研究方向。
📄 Abstract
Multimodal learning, while contributing to numerous success stories across various fields, faces the challenge of prohibitively expensive manual annotation. To address the scarcity of annotated data, a popular solution is unsupervised domain adaptation, which has been extensively studied in unimodal settings yet remains less explored in multimodal settings. In this paper, we investigate heterogeneous multimodal domain adaptation, where the primary challenge is the varying domain shifts of different modalities from the source to the target domain. We first introduce the information bottleneck method to learn representations for each modality independently, and then match the source and target domains in the representation space with correlation alignment. To balance the domain alignment of all modalities, we formulate the problem as a multi-objective task, aiming for a Pareto optimal solution. By exploiting the properties specific to our model, the problem can be simplified to a quadratic programming problem. Further approximation yields a closed-form solution, leading to an efficient modality-balanced multimodal domain adaptation algorithm. The proposed method features \textbf{B}alanced multi-\textbf{o}bjective \textbf{o}ptimization for \textbf{m}ultimodal \textbf{d}omain \textbf{a}daptation, termed \textbf{Boomda}. Extensive empirical results showcase the effectiveness of the proposed approach and demonstrate that Boomda outperforms the competing schemes. The code is is available at: https://github.com/sunjunaimer/Boomda.git.
[18] Non-Aligned Reference Image Quality Assessment for Novel View Synthesis
Abhijay Ghildyal, Rajesh Sureddi, Nabajeet Barman, Saman Zadtootaghaj, Alan Bovik
🧩 TL;DR
本文提出了一种针对新视角合成图像的非对齐参考图像质量评估框架,通过对比学习和合成失真训练,在不对齐参考场景下实现了优于现有方法的性能表现。
📘 Detailed Summary
Motivation: 新视角合成图像的感知质量评估面临关键挑战,特别是在缺乏像素对齐的真实参考图像时,全参考图像质量评估方法在不对齐情况下失效,而无参考方法则存在泛化能力不足的问题。
Method: 构建了包含针对时间感兴趣区域合成失真的大规模图像数据集,采用基于对比学习的框架,结合LoRA增强的DINOv2嵌入表示,并利用现有IQA方法进行监督训练,专门在合成失真数据上训练以避免对特定真实NVS样本的过拟合。
Result: 所提模型在不对齐参考场景下超越了最先进的全参考、无参考和非对齐参考IQA方法,在对齐和不对齐参考情况下均表现出鲁棒性能,且与收集的主观评分具有强相关性。
Conclusion: 该研究证明了在合成失真数据上训练的非对齐参考IQA框架的有效性,为NVS质量评估提供了新范式,并通过用户研究验证了模型预测与人类偏好的强相关性,为未来研究方向提供了重要启示。
📄 Abstract
Evaluating the perceptual quality of Novel View Synthesis (NVS) images remains a key challenge, particularly in the absence of pixel-aligned ground truth references. Full-Reference Image Quality Assessment (FR-IQA) methods fail under misalignment, while No-Reference (NR-IQA) methods struggle with generalization. In this work, we introduce a Non-Aligned Reference (NAR-IQA) framework tailored for NVS, where it is assumed that the reference view shares partial scene content but lacks pixel-level alignment. We constructed a large-scale image dataset containing synthetic distortions targeting Temporal Regions of Interest (TROI) to train our NAR-IQA model. Our model is built on a contrastive learning framework that incorporates LoRA-enhanced DINOv2 embeddings and is guided by supervision from existing IQA methods. We train exclusively on synthetically generated distortions, deliberately avoiding overfitting to specific real NVS samples and thereby enhancing the model's generalization capability. Our model outperforms state-of-the-art FR-IQA, NR-IQA, and NAR-IQA methods, achieving robust performance on both aligned and non-aligned references. We also conducted a novel user study to gather data on human preferences when viewing non-aligned references in NVS. We find strong correlation between our proposed quality prediction model and the collected subjective ratings. For dataset and code, please visit our project page: https://stootaghaj.github.io/nova-project/
[19] LandSegmenter: Towards a Flexible Foundation Model for Land Use and Land Cover Mapping
Chenying Liu, Wei Huang, Xiao Xiang Zhu
🧩 TL;DR
本文提出LandSegmenter,一种用于土地利用土地覆盖(LULC)映射的专用基础模型框架,通过弱监督学习解决遥感领域标注数据稀缺的问题,在零样本和迁移学习场景下实现了优越性能。
📘 Detailed Summary
Motivation: 当前LULC模型通常针对特定模态和固定类别分类法开发,限制了其泛化能力和广泛应用性。任务无关基础模型需要微调,而任务专用基础模型依赖大量标注数据,这在遥感领域成本高昂且不切实际。
Method: 提出LandSegmenter三阶段框架:输入层面构建LAS大规模多模态多源数据集,使用现有LULC产品的弱标签;模型层面集成遥感专用适配器进行跨模态特征提取和文本编码器增强语义感知;输出层面采用类别置信度引导融合策略缓解语义遗漏。
Result: 在六个精确标注的LULC数据集上进行评估,广泛的迁移学习和零样本实验表明,LandSegmenter实现了竞争性或优越性能,特别是在转移到未见数据集时的零样本设置中表现突出。
Conclusion: 该研究证明了所提框架的有效性以及弱监督在构建任务专用基础模型中的实用性,为遥感领域的通用模型开发提供了可行路径,显著降低了数据标注成本并提升了模型泛化能力。
📄 Abstract
Land Use and Land Cover (LULC) mapping is a fundamental task in Earth Observation (EO). However, current LULC models are typically developed for a specific modality and a fixed class taxonomy, limiting their generability and broader applicability. Recent advances in foundation models (FMs) offer promising opportunities for building universal models. Yet, task-agnostic FMs often require fine-tuning for downstream applications, whereas task-specific FMs rely on massive amounts of labeled data for training, which is costly and impractical in the remote sensing (RS) domain. To address these challenges, we propose LandSegmenter, an LULC FM framework that resolves three-stage challenges at the input, model, and output levels. From the input side, to alleviate the heavy demand on labeled data for FM training, we introduce LAnd Segment (LAS), a large-scale, multi-modal, multi-source dataset built primarily with globally sampled weak labels from existing LULC products. LAS provides a scalable, cost-effective alternative to manual annotation, enabling large-scale FM training across diverse LULC domains. For model architecture, LandSegmenter integrates an RS-specific adapter for cross-modal feature extraction and a text encoder for semantic awareness enhancement. At the output stage, we introduce a class-wise confidence-guided fusion strategy to mitigate semantic omissions and further improve LandSegmenter's zero-shot performance. We evaluate LandSegmenter on six precisely annotated LULC datasets spanning diverse modalities and class taxonomies. Extensive transfer learning and zero-shot experiments demonstrate that LandSegmenter achieves competitive or superior performance, particularly in zero-shot settings when transferred to unseen datasets. These results highlight the efficacy of our proposed framework and the utility of weak supervision for building task-specific FMs.
[20] Multi-Granularity Mutual Refinement Network for Zero-Shot Learning
Ning Wang, Long Yu, Cong Hua, Guangming Zhu, Lin Mei, Syed Afaq Ali Shah, Mohammed Bennamoun, Liang Zhang
🧩 TL;DR
本文提出了一种多粒度相互精炼网络(Mg-MRN),通过解耦多粒度特征学习和跨粒度特征交互来精炼判别性和可迁移的视觉特征,以解决零样本学习中局部区域特征间内在交互被忽视的问题。
📘 Detailed Summary
Motivation: 现有零样本学习方法通常将全局视觉特征与语义信息关联或将局部视觉区域特征与对应属性对齐,但往往忽视了局部区域特征之间的内在交互作用,这些交互可以进一步提高可迁移和显式视觉特征的获取能力。
Method: 设计了多粒度特征提取模块通过解耦区域特征挖掘学习区域级判别特征,并构建跨粒度特征融合模块来加强不同粒度区域特征之间的内在交互,通过整合相邻层次结构的区域表示来增强每个粒度级别的表示判别能力。
Result: 在三个流行的零样本学习基准数据集上进行的广泛实验证明了所提出的Mg-MRN方法的优越性和竞争力,显著提升了零样本识别性能。
Conclusion: 该研究强调了局部区域特征间交互在零样本学习中的重要性,提出的多粒度相互精炼框架为学习更具判别性和可迁移性的视觉特征提供了有效途径,对提升零样本识别性能具有重要价值。
📄 Abstract
Zero-shot learning (ZSL) aims to recognize unseen classes with zero samples by transferring semantic knowledge from seen classes. Current approaches typically correlate global visual features with semantic information (i.e., attributes) or align local visual region features with corresponding attributes to enhance visual-semantic interactions. Although effective, these methods often overlook the intrinsic interactions between local region features, which can further improve the acquisition of transferable and explicit visual features. In this paper, we propose a network named Multi-Granularity Mutual Refinement Network (Mg-MRN), which refine discriminative and transferable visual features by learning decoupled multi-granularity features and cross-granularity feature interactions. Specifically, we design a multi-granularity feature extraction module to learn region-level discriminative features through decoupled region feature mining. Then, a cross-granularity feature fusion module strengthens the inherent interactions between region features of varying granularities. This module enhances the discriminability of representations at each granularity level by integrating region representations from adjacent hierarchies, further improving ZSL recognition performance. Extensive experiments on three popular ZSL benchmark datasets demonstrate the superiority and competitiveness of our proposed Mg-MRN method. Our code is available at https://github.com/NingWang2049/Mg-MRN.
[21] Distributed Zero-Shot Learning for Visual Recognition
Zhi Chen, Yadan Luo, Zi Huang, Jingjing Li, Sen Wang, Xin Yu
🧩 TL;DR
本文提出了分布式零样本学习框架DistZSL,通过跨节点属性正则化和全局属性-视觉共识机制,有效利用分布式数据学习未见类别的模型,在分布式数据学习方面达到了最先进性能。
📘 Detailed Summary
Motivation: 该研究旨在解决分布式环境下零样本学习面临的数据异构性问题,传统方法难以充分利用分散在不同节点上的数据来学习有效的未见类别模型,特别是在数据分布不一致的情况下,视觉到属性的映射关系容易产生偏差。
Method: 提出了DistZSL框架,包含两个关键组件:跨节点属性正则化确保不同节点间属性特征距离的相似性,稳定整体属性特征空间;全局属性-视觉共识机制通过强制属性与视觉特征分布的双向映射在不同节点间保持一致,减轻个体节点学习的V2A映射偏差。
Result: 大量实验表明,DistZSL在分布式数据学习方面实现了优于现有最先进方法的性能,能够显著提升跨不同节点的零样本学习效果,验证了所提框架在处理数据异构性问题上的有效性。
Conclusion: 该研究证明了通过适当的正则化和共识机制可以有效解决分布式零样本学习中的数据异构性挑战,为分布式环境下的知识迁移提供了新的技术路径,具有重要的实际应用价值。
📄 Abstract
In this paper, we propose a Distributed Zero-Shot Learning (DistZSL) framework that can fully exploit decentralized data to learn an effective model for unseen classes. Considering the data heterogeneity issues across distributed nodes, we introduce two key components to ensure the effective learning of DistZSL: a cross-node attribute regularizer and a global attribute-to-visual consensus. Our proposed cross-node attribute regularizer enforces the distances between attribute features to be similar across different nodes. In this manner, the overall attribute feature space would be stable during learning, and thus facilitate the establishment of visual-to-attribute(V2A) relationships. Then, we introduce the global attribute-tovisual consensus to mitigate biased V2A mappings learned from individual nodes. Specifically, we enforce the bilateral mapping between the attribute and visual feature distributions to be consistent across different nodes. Thus, the learned consistent V2A mapping can significantly enhance zero-shot learning across different nodes. Extensive experiments demonstrate that DistZSL achieves superior performance to the state-of-the-art in learning from distributed data.
[22] Remodeling Semantic Relationships in Vision-Language Fine-Tuning
Xiangyang Wu, Liu Liu, Baosheng Yu, Jiayan Qiu, Zhenwei Shi
🧩 TL;DR
本文提出了一种基于语义和关系增强的多模态对齐与融合方法,通过提取多层级语义特征、学习语义分组以及使用可继承的交叉注意力机制,显著提升了视觉语言基础模型的性能。该方法在视觉问答和图像描述两个下游任务上超越了现有所有方法。
📘 Detailed Summary
Motivation: 现有的视觉语言微调方法通常忽视了文本上下文中强调的图像内部语义关系信息,导致视觉与语言对齐效果不佳。这种对语义关系的忽略限制了多模态基础模型的性能提升,需要开发能够同时利用语义和关系信息的新方法来解决这一局限性。
Method: 该方法首先从不同视觉编码器中提取多层级语义特征以捕捉更丰富的视觉关系线索,然后学习将视觉特征投影到相关的语义分组中,最后使用可继承的交叉注意力机制融合视觉和文本特征,通过丢弃相关性低的视觉-语言特征对来全局去除冗余的视觉关系。
Result: 在八个基础模型和两个下游任务上的评估表明,该方法在视觉问答和图像描述任务上均超越了所有现有方法,证明了其在多模态对齐和融合方面的显著优势。
Conclusion: 该研究证明了同时利用语义和关系信息对于提升多模态基础模型性能的重要性,提出的方法为视觉语言对齐提供了新的技术路径,并为构建更强大的多模态理解系统指明了方向。
📄 Abstract
Vision-language fine-tuning has emerged as an efficient paradigm for constructing multimodal foundation models. While textual context often highlights semantic relationships within an image, existing fine-tuning methods typically overlook this information when aligning vision and language, thus leading to suboptimal performance. Toward solving this problem, we propose a method that can improve multimodal alignment and fusion based on both semantics and relationships.Specifically, we first extract multilevel semantic features from different vision encoder to capture more visual cues of the relationships. Then, we learn to project the vision features to group related semantics, among which are more likely to have relationships. Finally, we fuse the visual features with the textual by using inheritable cross-attention, where we globally remove the redundant visual relationships by discarding visual-language feature pairs with low correlation. We evaluate our proposed method on eight foundation models and two downstream tasks, visual question answering and image captioning, and show that it outperforms all existing methods.
[23] VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion
Samet Hicsonmez, Abd El Rahman Shabayek, Djamila Aouada
🧩 TL;DR
本文提出VLMDiff,一种新颖的无监督多类别视觉异常检测框架,通过集成潜在扩散模型和视觉语言模型,利用VLM生成的图像描述作为额外条件来增强异常定位和检测能力。该方法在Real-IAD和COCO-AD数据集上显著优于现有扩散基方法,无需人工标注或逐类别模型训练即可实现多类别异常检测。
📘 Detailed Summary
Motivation: 当前基于扩散的视觉异常检测方法主要依赖合成噪声生成,这限制了其泛化能力且需要逐类别模型训练,严重阻碍了方法的可扩展性。多类别真实世界图像中的视觉异常检测面临多样性挑战,现有方法难以在无需人工标注的情况下实现有效的多类别异常定位和检测。
Method: VLMDiff框架集成潜在扩散模型与视觉语言模型,利用预训练VLM通过简单提示提取详细图像描述作为LDM训练的额外条件。该方法无需人工标注或额外训练即可从VLM获取正常图像描述,这些描述条件化扩散模型以学习鲁棒的正常图像特征表示,从而实现多类别异常检测。
Result: 在Real-IAD数据集上,该方法将像素级Per-Region-Overlap指标提升了高达25个点,在COCO-AD数据集上提升了8个点,显著超越了当前最先进的基于扩散的方法。实验结果表明该方法在多类别视觉异常检测任务中具有竞争优势。
Conclusion: 该研究证明了VLM与LDM的协同集成能够有效解决多类别视觉异常检测的挑战,无需逐类别训练即可实现可扩展的异常检测。方法展示了利用预训练视觉语言模型提供语义条件来增强扩散模型表示学习能力的潜力,为无监督异常检测提供了新的技术路径。
📄 Abstract
Detecting visual anomalies in diverse, multi-class real-world images is a significant challenge. We introduce \ours, a novel unsupervised multi-class visual anomaly detection framework. It integrates a Latent Diffusion Model (LDM) with a Vision-Language Model (VLM) for enhanced anomaly localization and detection. Specifically, a pre-trained VLM with a simple prompt extracts detailed image descriptions, serving as additional conditioning for LDM training. Current diffusion-based methods rely on synthetic noise generation, limiting their generalization and requiring per-class model training, which hinders scalability. \ours, however, leverages VLMs to obtain normal captions without manual annotations or additional training. These descriptions condition the diffusion model, learning a robust normal image feature representation for multi-class anomaly detection. Our method achieves competitive performance, improving the pixel-level Per-Region-Overlap (PRO) metric by up to 25 points on the Real-IAD dataset and 8 points on the COCO-AD dataset, outperforming state-of-the-art diffusion-based approaches. Code is available at https://github.com/giddyyupp/VLMDiff.
[24] NERVE: Neighbourhood & Entropy-guided Random-walk for training free open-Vocabulary sEgmentation
Kunal Mahatha, Jose Dolz, Christian Desrosiers
🧩 TL;DR
本文提出了NERVE,一种无需训练的开放词汇语义分割强基线方法,通过整合全局与局部信息、随机游走优化亲和度以及基于熵的不确定性选择机制,在7个主流基准测试中实现了最先进的零样本分割性能。
📘 Detailed Summary
Motivation: 现有无需训练的开放词汇语义分割方法存在多个局限性:计算成本高昂的亲和度优化策略、因等权重或固定大小高斯核导致的Transformer注意力图融合效率低下,以及强制各向同性邻域的问题。
Method: NERVE方法独特地整合了全局和细粒度局部信息,利用稳定扩散模型自注意力层的邻域结构,引入随机游走进行亲和度优化而非依赖固定大小高斯核,并使用基于熵的不确定性来选择最相关的注意力图,无需传统后处理技术。
Result: 在7个流行的语义分割基准测试上进行实验,该方法实现了整体最先进的零样本分割性能,为开放词汇语义分割提供了有效解决方案。
Conclusion: 该研究展示了通过整合全局局部信息、随机游走优化和不确定性选择机制,可以构建高效的无需训练开放词汇分割方法,有效处理任意形状物体的分割问题,为相关领域提供了新的技术路径。
📄 Abstract
Despite recent advances in Open-Vocabulary Semantic Segmentation (OVSS), existing training-free methods face several limitations: use of computationally expensive affinity refinement strategies, ineffective fusion of transformer attention maps due to equal weighting or reliance on fixed-size Gaussian kernels to reinforce local spatial smoothness, enforcing isotropic neighborhoods. We propose a strong baseline for training-free OVSS termed as NERVE (Neighbourhood \& Entropy-guided Random-walk for open-Vocabulary sEgmentation), which uniquely integrates global and fine-grained local information, exploiting the neighbourhood structure from the self-attention layer of a stable diffusion model. We also introduce a stochastic random walk for refining the affinity rather than relying on fixed-size Gaussian kernels for local context. This spatial diffusion process encourages propagation across connected and semantically related areas, enabling it to effectively delineate objects with arbitrary shapes. Whereas most existing approaches treat self-attention maps from different transformer heads or layers equally, our method uses entropy-based uncertainty to select the most relevant maps. Notably, our method does not require any conventional post-processing techniques like Conditional Random Fields (CRF) or Pixel-Adaptive Mask Refinement (PAMR). Experiments are performed on 7 popular semantic segmentation benchmarks, yielding an overall state-of-the-art zero-shot segmentation performance, providing an effective approach to open-vocabulary semantic segmentation.
[25] UI2Code$^\text{N}$: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation
Zhen Yang, Wenyi Hong, Mingde Xu, Xinyue Fan, Weihan Wang, Jiele Cheng, Xiaotao Gu, Jie Tang
🧩 TL;DR
本文提出了UI2Code$^\text{N}$,一种通过分阶段预训练、微调和强化学习训练的视觉语言模型,实现了交互式UI到代码生成的新范式,在UI到代码和UI优化基准测试中达到开源模型的新最优水平,性能与领先的闭源模型相当。
📘 Detailed Summary
Motivation: 当前UI编程方法面临两个关键限制:多模态编码能力发展不足,以及单轮范式很少利用迭代视觉反馈。本文旨在解决这些挑战,通过交互式UI到代码范式更好地反映实际工作流程并提高可实现的性能上限。
Method: 采用分阶段预训练、微调和强化学习训练视觉语言模型UI2Code$^\text{N}$,统一了三个关键能力:UI到代码生成、UI编辑和UI优化,并探索了交互式生成的测试时扩展,实现多轮反馈的系统性使用。
Result: 在UI到代码和UI优化基准测试中,UI2Code$^\text{N}$在开源模型中建立了新的最优水平,性能与领先的闭源模型如Claude-4-Sonnet和GPT-5相当。
Conclusion: 交互式UI到代码范式能够显著提升多模态编码性能,通过统一生成、编辑和优化能力以及多轮反馈机制,为实际UI开发工作流程提供了更有效的解决方案,并展示了测试时扩展在交互式生成中的价值。
📄 Abstract
User interface (UI) programming is a core yet highly complex part of modern software development. Recent advances in visual language models (VLMs) highlight the potential of automatic UI coding, but current approaches face two key limitations: multimodal coding capabilities remain underdeveloped, and single-turn paradigms make little use of iterative visual feedback. We address these challenges with an interactive UI-to-code paradigm that better reflects real-world workflows and raises the upper bound of achievable performance. Under this paradigm, we present UI2Code$^\text{N}$, a visual language model trained through staged pretraining, fine-tuning, and reinforcement learning to achieve foundational improvements in multimodal coding. The model unifies three key capabilities: UI-to-code generation, UI editing, and UI polishing. We further explore test-time scaling for interactive generation, enabling systematic use of multi-turn feedback. Experiments on UI-to-code and UI polishing benchmarks show that UI2Code$^\text{N}$ establishes a new state of the art among open-source models and achieves performance comparable to leading closed-source models such as Claude-4-Sonnet and GPT-5. Our code and models are available at https://github.com/zai-org/UI2Code_N.
[26] ImagebindDC: Compressing Multi-modal Data with Imagebind-based Condensation
Yue Min, Shaobo Wang, Jiaze Li, Tianle Niu, Junxin Fan, Yongliang Miao, Lijin Yang, Linfeng Zhang
🧩 TL;DR
本文提出了ImageBindDC,一种在ImageBind统一特征空间中操作的新型数据压缩框架,通过特征函数损失实现精确的统计对齐,在跨模态数据压缩方面取得了最先进性能。该框架在NYU-v2数据集上仅使用每类5个压缩数据点即可实现与完整数据集相当的无损性能。
📘 Detailed Summary
Motivation: 传统数据压缩技术在单模态场景中表现良好,但在多模态场景中往往失效,因为难以保持复杂的模态间依赖关系。现有方法在处理跨模态数据时无法有效捕捉不同模态之间的语义关联和统计特性。
Method: ImageBindDC框架在ImageBind统一特征空间中操作,采用特征函数损失在傅里叶域实现精确的无限矩匹配。该方法强制执行三个关键级别的分布一致性:单模态对齐匹配各模态内的统计特性,跨模态对齐通过混合真实-合成数据对保持成对语义,联合模态对齐通过对齐真实数据对与合成对应物的联合分布来捕捉完整多元数据结构。
Result: 在NYU-v2数据集上的实验表明,仅使用每类5个压缩数据点训练的模型即可实现与完整数据集相当的无损性能,相比之前最佳方法取得8.2%的绝对改进,且压缩时间减少超过4倍,达到了新的最先进水平。
Conclusion: ImageBindDC通过统一特征空间和特征函数损失有效解决了多模态数据压缩中的关键挑战,证明了在保持跨模态依赖关系的同时实现高效数据压缩的可行性。该方法为多模态机器学习中的高效训练开辟了新途径,具有重要的实际应用价值。
📄 Abstract
Data condensation techniques aim to synthesize a compact dataset from a larger one to enable efficient model training, yet while successful in unimodal settings, they often fail in multimodal scenarios where preserving intricate inter-modal dependencies is crucial. To address this, we introduce ImageBindDC, a novel data condensation framework operating within the unified feature space of ImageBind. Our approach moves beyond conventional distribution-matching by employing a powerful Characteristic Function (CF) loss, which operates in the Fourier domain to facilitate a more precise statistical alignment via exact infinite moment matching. We design our objective to enforce three critical levels of distributional consistency: (i) uni-modal alignment, which matches the statistical properties of synthetic and real data within each modality; (ii) cross-modal alignment, which preserves pairwise semantics by matching the distributions of hybrid real-synthetic data pairs; and (iii) joint-modal alignment, which captures the complete multivariate data structure by aligning the joint distribution of real data pairs with their synthetic counterparts. Extensive experiments highlight the effectiveness of ImageBindDC: on the NYU-v2 dataset, a model trained on just 5 condensed datapoints per class achieves lossless performance comparable to one trained on the full dataset, achieving a new state-of-the-art with an 8.2\% absolute improvement over the previous best method and more than 4$\times$ less condensation time.
[27] Evaluating Gemini LLM in Food Image-Based Recipe and Nutrition Description with EfficientNet-B4 Visual Backbone
Rizal Khoirul Anam
🧩 TL;DR
本研究提出了一种解耦的多模态食物识别管道,通过系统评估视觉骨干网络与生成式大语言模型的组合性能,揭示了视觉前端感知精度对系统整体效用的瓶颈限制。
📘 Detailed Summary
Motivation: 当前数字食品应用的普及需要可靠的自动化营养分析和烹饪指导方法,本研究旨在解决公开数据集中存在的文化偏见问题,并评估视觉分类精度与生成输出质量之间的权衡关系。
Method: 采用解耦的多模态管道架构,集成专用视觉骨干网络(EfficientNet-B4)与生成式大语言模型(Gemini LLM),并引入语义错误传播(SEP)形式化分析框架来评估分类误差在生成输出中的级联效应。
Result: 实验结果表明EfficientNet-B4在Top-1准确率达到89.0%时提供了最佳精度与效率平衡,Gemini在事实准确性方面达到9.2/10的优异表现,但系统整体性能受到视觉前端感知精度的根本性限制。
Conclusion: 研究揭示了高语义相似度是系统最关键失效模式,视觉模块的分类准确性是决定多模态食物识别系统效用的关键瓶颈,为未来改进方向提供了重要见解。
📄 Abstract
The proliferation of digital food applications necessitates robust methods for automated nutritional analysis and culinary guidance. This paper presents a comprehensive comparative evaluation of a decoupled, multimodal pipeline for food recognition. We evaluate a system integrating a specialized visual backbone (EfficientNet-B4) with a powerful generative large language model (Google's Gemini LLM). The core objective is to evaluate the trade-offs between visual classification accuracy, model efficiency, and the quality of generative output (nutritional data and recipes). We benchmark this pipeline against alternative vision backbones (VGG-16, ResNet-50, YOLOv8) and a lightweight LLM (Gemma). We introduce a formalization for "Semantic Error Propagation" (SEP) to analyze how classification inaccuracies from the visual module cascade into the generative output. Our analysis is grounded in a new Custom Chinese Food Dataset (CCFD) developed to address cultural bias in public datasets. Experimental results demonstrate that while EfficientNet-B4 (89.0\% Top-1 Acc.) provides the best balance of accuracy and efficiency, and Gemini (9.2/10 Factual Accuracy) provides superior generative quality, the system's overall utility is fundamentally bottlenecked by the visual front-end's perceptive accuracy. We conduct a detailed per-class analysis, identifying high semantic similarity as the most critical failure mode.
[28] Text-based Aerial-Ground Person Retrieval
Xinyu Zhou, Yu Wu, Jiayao Ma, Wenhao Wang, Min Cao, Mang Ye
🧩 TL;DR
本文提出了基于文本的空地行人检索任务(TAG-PR),通过构建TAG-PEDES数据集和开发TAG-CLIP检索框架,解决了跨异构视角的行人图像检索问题,在传统地面视角检索基础上引入了更具实用价值的空中视角。
📘 Detailed Summary
Motivation: 传统基于文本的行人检索(T-PR)仅关注地面视角图像,存在视角单一的限制,而空地行人检索(TAG-PR)引入了空中视角,具有更大的实际应用价值,但由于图像间存在显著的视角差异,带来了独特的挑战。
Method: 提出了TAG-CLIP检索框架,采用分层路由的专家混合模块来学习视角特定和视角无关的特征,并通过视角解耦策略分离视角特定特征以实现更好的跨模态对齐;同时构建了TAG-PEDES数据集,采用多样化文本生成范式确保在视角异构性下的鲁棒性。
Result: 在提出的TAG-PEDES数据集和现有T-PR基准上评估了TAG-CLIP的有效性,实验结果表明该框架能够有效处理跨异构视角的行人检索任务。
Conclusion: 该研究为跨视角行人检索提供了新的任务定义和解决方案,通过视角解耦和特征学习策略有效缓解了视角差异带来的挑战,为空地协同的智能监控和搜索救援等应用奠定了基础。
📄 Abstract
This work introduces Text-based Aerial-Ground Person Retrieval (TAG-PR), which aims to retrieve person images from heterogeneous aerial and ground views with textual descriptions. Unlike traditional Text-based Person Retrieval (T-PR), which focuses solely on ground-view images, TAG-PR introduces greater practical significance and presents unique challenges due to the large viewpoint discrepancy across images. To support this task, we contribute: (1) TAG-PEDES dataset, constructed from public benchmarks with automatically generated textual descriptions, enhanced by a diversified text generation paradigm to ensure robustness under view heterogeneity; and (2) TAG-CLIP, a novel retrieval framework that addresses view heterogeneity through a hierarchically-routed mixture of experts module to learn view-specific and view-agnostic features and a viewpoint decoupling strategy to decouple view-specific features for better cross-modal alignment. We evaluate the effectiveness of TAG-CLIP on both the proposed TAG-PEDES dataset and existing T-PR benchmarks. The dataset and code are available at https://github.com/Flame-Chasers/TAG-PR.
[29] Anatomy-VLM: A Fine-grained Vision-Language Model for Medical Interpretation
Difei Gu, Yunhe Gao, Mu Zhou, Dimitris Metaxas
🧩 TL;DR
本文提出Anatomy-VLM,一种细粒度视觉语言模型,通过整合多尺度医学信息和解剖结构定位,实现了专家级的放射学疾病诊断能力。该模型在分布内和分布外数据集上均表现出色,并支持零样本解剖结构解释。
📘 Detailed Summary
Motivation: 当前主流视觉语言模型将医学图像视为整体实体,忽略了疾病诊断至关重要的细粒度图像细节。放射科医生通过结合先验医学知识识别关键解剖结构作为感兴趣区域进行诊断,而现有模型缺乏这种细粒度分析能力。
Method: Anatomy-VLM设计了模型编码器来定位医学图像中的关键解剖特征,将这些区域与结构化知识结合进行上下文感知解释,并通过多尺度医学信息对齐生成临床可解释的疾病预测。模型采用细粒度视觉语言建模方法,整合了解剖结构定位和知识增强机制。
Result: Anatomy-VLM在分布内和分布外数据集上均取得优异性能,并在下游图像分割任务中验证了其细粒度对齐能够捕获解剖和病理相关知识。模型编码器支持零样本解剖结构解释,展现了强大的临床解释能力。
Conclusion: 该研究表明整合细粒度解剖信息和多尺度对齐对于医学视觉语言模型至关重要,Anatomy-VLM的专家级诊断能力验证了人类中心化工作流程在医学AI中的有效性,为临床可解释AI诊断提供了新范式。
📄 Abstract
Accurate disease interpretation from radiology remains challenging due to imaging heterogeneity. Achieving expert-level diagnostic decisions requires integration of subtle image features with clinical knowledge. Yet major vision-language models (VLMs) treat images as holistic entities and overlook fine-grained image details that are vital for disease diagnosis. Clinicians analyze images by utilizing their prior medical knowledge and identify anatomical structures as important region of interests (ROIs). Inspired from this human-centric workflow, we introduce Anatomy-VLM, a fine-grained, vision-language model that incorporates multi-scale information. First, we design a model encoder to localize key anatomical features from entire medical images. Second, these regions are enriched with structured knowledge for contextually-aware interpretation. Finally, the model encoder aligns multi-scale medical information to generate clinically-interpretable disease prediction. Anatomy-VLM achieves outstanding performance on both in- and out-of-distribution datasets. We also validate the performance of Anatomy-VLM on downstream image segmentation tasks, suggesting that its fine-grained alignment captures anatomical and pathology-related knowledge. Furthermore, the Anatomy-VLM's encoder facilitates zero-shot anatomy-wise interpretation, providing its strong expert-level clinical interpretation capabilities.
[30] Large Sign Language Models: Toward 3D American Sign Language Translation
Sen Zhang, Xiaoxiao He, Di Liu, Zhaoyang Xia, Mingyu Zhao, Chaowei Tan, Vivian Li, Bo Liu, Dimitris N. Metaxas, Mubbasir Kapadia
🧩 TL;DR
本文提出了大型手语模型(LSLM),这是一个利用大型语言模型作为骨干网络来翻译3D美国手语的新框架,能够改善听力障碍者的虚拟交流。该框架通过直接处理3D手语数据来捕捉丰富的空间、姿态和深度信息,实现了更准确和鲁棒的手语翻译。
📘 Detailed Summary
Motivation: 现有手语识别方法主要依赖2D视频数据,无法充分利用3D场景中的丰富空间、姿态和深度信息,限制了手语翻译的准确性和鲁棒性。本研究旨在解决这一局限性,同时探索将复杂的多模态语言整合到LLM处理能力中,超越纯文本输入以扩展对人类交流的理解。
Method: 该研究提出了基于大型语言模型的3D手语翻译框架,直接利用3D手语数据捕捉空间、姿态和深度信息。研究了两种翻译方式:直接从3D手势特征到文本的翻译,以及通过外部提示调节的指令引导翻译设置,提供了更大的灵活性。
Result: 该方法实现了更准确和鲁棒的3D手语翻译,增强了数字交流对听力障碍群体的可访问性。通过将复杂的多模态语言整合到LLM中,扩展了模型对人类交流形式的理解能力。
Conclusion: 这项工作为实现包容性多模态智能系统奠定了基础,这些系统能够理解多样化的语言形式。研究展示了LLM在处理复杂具身多模态语言方面的潜力,为未来更广泛的多模态理解和生成任务开辟了新的研究方向。
📄 Abstract
We present Large Sign Language Models (LSLM), a novel framework for translating 3D American Sign Language (ASL) by leveraging Large Language Models (LLMs) as the backbone, which can benefit hearing-impaired individuals' virtual communication. Unlike existing sign language recognition methods that rely on 2D video, our approach directly utilizes 3D sign language data to capture rich spatial, gestural, and depth information in 3D scenes. This enables more accurate and resilient translation, enhancing digital communication accessibility for the hearing-impaired community. Beyond the task of ASL translation, our work explores the integration of complex, embodied multimodal languages into the processing capabilities of LLMs, moving beyond purely text-based inputs to broaden their understanding of human communication. We investigate both direct translation from 3D gesture features to text and an instruction-guided setting where translations can be modulated by external prompts, offering greater flexibility. This work provides a foundational step toward inclusive, multimodal intelligent systems capable of understanding diverse forms of language.
[31] Top2Ground: A Height-Aware Dual Conditioning Diffusion Model for Robust Aerial-to-Ground View Generation
Jae Joong Lee, Bedrich Benes
🧩 TL;DR
本文提出Top2Ground,一种新颖的基于扩散的方法,能够直接从航拍图像生成逼真的地面视角图像,无需依赖深度图或3D体素等中间表示。该方法在三个基准数据集上平均SSIM提升7.3%,展示了强大的泛化能力。
📘 Detailed Summary
Motivation: 从航拍视角生成地面图像面临极端视角差异、遮挡和有限视野等挑战,现有方法通常依赖中间表示如深度图或3D体素,限制了生成质量和效率。
Method: Top2Ground采用扩散模型框架,将去噪过程基于VAE编码的空间特征(来自航拍RGB图像和估计高度图)与CLIP语义嵌入的联合表示进行条件化,确保生成结果既受场景3D结构几何约束又保持语义一致性。
Result: 在CVUSA、CVACT和Auto Arborist三个多样化数据集上的评估显示,Top2Ground在SSIM指标上平均提升7.3%,能够稳健处理宽窄视野场景,验证了其强大的泛化能力。
Conclusion: Top2Ground证明了直接端到端生成方法的有效性,无需中间表示即可实现几何精确和语义一致的地面图像生成,为跨视角图像合成提供了新思路,具有实际应用潜力。
📄 Abstract
Generating ground-level images from aerial views is a challenging task due to extreme viewpoint disparity, occlusions, and a limited field of view. We introduce Top2Ground, a novel diffusion-based method that directly generates photorealistic ground-view images from aerial input images without relying on intermediate representations such as depth maps or 3D voxels. Specifically, we condition the denoising process on a joint representation of VAE-encoded spatial features (derived from aerial RGB images and an estimated height map) and CLIP-based semantic embeddings. This design ensures the generation is both geometrically constrained by the scene's 3D structure and semantically consistent with its content. We evaluate Top2Ground on three diverse datasets: CVUSA, CVACT, and the Auto Arborist. Our approach shows 7.3% average improvement in SSIM across three benchmark datasets, showing Top2Ground can robustly handle both wide and narrow fields of view, highlighting its strong generalization capabilities.
[32] SENCA-st: Integrating Spatial Transcriptomics and Histopathology with Cross Attention Shared Encoder for Region Identification in Cancer Pathology
Shanaka Liyanaarachchi, Chathurya Wijethunga, Shihab Aaquil Ahamed, Akthas Absar, Ranga Rodrigo
🧩 TL;DR
本文提出SENCA-st架构,通过共享编码器和邻域交叉注意力机制,有效整合组织病理学图像和空间转录组学数据,在肿瘤异质性和肿瘤微环境区域检测方面超越现有方法。该模型能够保留两种模态的特征,特别关注结构相似但功能不同的区域。
📘 Detailed Summary
Motivation: 当前的组织病理学-空间转录组学区域分割方法存在两个极端问题:要么过度依赖空间转录组学数据而仅将组织病理学特征作为辅助处理,要么使用普通对比学习方法导致组织病理学图像特征过于突出而丢失功能信息。这两种情况都会导致模型要么迷失在空间转录组学的噪声中,要么过度平滑而丢失关键信息。
Method: 提出新颖的SENCA-st架构,采用共享编码器结合邻域交叉注意力机制。该架构能够同时保留两种模态的特征,特别重要的是通过交叉注意力机制强调在组织病理学上结构相似但在空间转录组学上功能不同的区域。
Result: 实验结果表明,该模型在检测肿瘤异质性和肿瘤微环境区域方面表现出卓越性能,超越了现有最先进的方法。这些区域在临床应用中具有关键重要性,特别是在癌症耐药性研究中。
Conclusion: SENCA-st架构成功解决了多模态数据整合中的平衡问题,为肿瘤异质性分析提供了更有效的工具。该研究强调了在整合结构和功能信息时保持特征平衡的重要性,为癌症研究和精准医疗开辟了新方向。
📄 Abstract
Spatial transcriptomics is an emerging field that enables the identification of functional regions based on the spatial distribution of gene expression. Integrating this functional information present in transcriptomic data with structural data from histopathology images is an active research area with applications in identifying tumor substructures associated with cancer drug resistance. Current histopathology-spatial-transcriptomic region segmentation methods suffer due to either making spatial transcriptomics prominent by using histopathology features just to assist processing spatial transcriptomics data or using vanilla contrastive learning that make histopathology images prominent due to only promoting common features losing functional information. In both extremes, the model gets either lost in the noise of spatial transcriptomics or overly smoothed, losing essential information. Thus, we propose our novel architecture SENCA-st (Shared Encoder with Neighborhood Cross Attention) that preserves the features of both modalities. More importantly, it emphasizes regions that are structurally similar in histopathology but functionally different on spatial transcriptomics using cross-attention. We demonstrate the superior performance of our model that surpasses state-of-the-art methods in detecting tumor heterogeneity and tumor micro-environment regions, a clinically crucial aspect.
[33] Re-coding for Uncertainties: Edge-awareness Semantic Concordance for Resilient Event-RGB Segmentation
Nan Bao, Yifan Zhao, Lin Zhu, Jia Li
🧩 TL;DR
本文提出了一种边缘感知语义一致性框架,通过挖掘事件和RGB模态的边缘特征进行弹性融合,解决了极端条件下多模态语义分割中的异构特征不匹配问题。该方法在合成和真实世界数据集上均优于现有技术,在DERS-XS数据集上实现了2.55%的mIoU提升。
📘 Detailed Summary
Motivation: 现有语义分割方法在极端条件下(如光照不足、剧烈相机运动)存在显著的RGB信息丢失问题,严重损害分割结果。虽然已有研究利用高速高动态范围的事件模态作为补充,但事件与RGB模态的天然异构性导致特征级不匹配,现有多模态方法优化效果不佳。
Method: 提出边缘感知语义一致性框架,包含边缘感知潜在重编码和重编码整合与不确定性优化两个核心模块。前者通过重编码分布引导将事件-RGB特征重新对齐到统一的语义空间,并利用预建立的边缘字典作为线索将事件-RGB分布转换为重编码特征;后者利用重编码边缘特征和不确定性指标解决极端条件下的异构事件-RGB融合问题。
Result: 在两个合成数据集和一个真实世界事件-RGB语义分割数据集上的实验结果表明,该方法在提出的DERS-XS数据集上比最先进方法提高了2.55%的mIoU,并在空间遮挡条件下表现出优越的弹性。
Conclusion: 该研究证明了通过挖掘多模态的边缘特征进行异构特征统一的有效性,为极端条件下的鲁棒语义分割提供了新思路。所提出的边缘感知语义一致性框架能够有效解决事件和RGB模态的异构融合问题,具有重要的实际应用价值。
📄 Abstract
Semantic segmentation has achieved great success in ideal conditions. However, when facing extreme conditions (e.g., insufficient light, fierce camera motion), most existing methods suffer from significant information loss of RGB, severely damaging segmentation results. Several researches exploit the high-speed and high-dynamic event modality as a complement, but event and RGB are naturally heterogeneous, which leads to feature-level mismatch and inferior optimization of existing multi-modality methods. Different from these researches, we delve into the edge secret of both modalities for resilient fusion and propose a novel Edge-awareness Semantic Concordance framework to unify the multi-modality heterogeneous features with latent edge cues. In this framework, we first propose Edge-awareness Latent Re-coding, which obtains uncertainty indicators while realigning event-RGB features into unified semantic space guided by re-coded distribution, and transfers event-RGB distributions into re-coded features by utilizing a pre-established edge dictionary as clues. We then propose Re-coded Consolidation and Uncertainty Optimization, which utilize re-coded edge features and uncertainty indicators to solve the heterogeneous event-RGB fusion issues under extreme conditions. We establish two synthetic and one real-world event-RGB semantic segmentation datasets for extreme scenario comparisons. Experimental results show that our method outperforms the state-of-the-art by a 2.55% mIoU on our proposed DERS-XS, and possesses superior resilience under spatial occlusion. Our code and datasets are publicly available at https://github.com/iCVTEAM/ESC.
[34] VideoChain: A Transformer-Based Framework for Multi-hop Video Question Generation
Arpan Phukan, Anupam Pandey, Deepjyoti Bodo, Asif Ekbal
🧩 TL;DR
本文提出了VideoChain框架,这是首个针对多跳视频问题生成(MVQG)的解决方案,能够生成需要跨多个时间分离视频片段进行推理的问题,在TVQA+数据集上构建了大规模MVQ-60数据集并展示了优越的生成性能。
📘 Detailed Summary
Motivation: 当前多跳问题生成主要局限于文本领域,而视频问题生成仅限于单片段零跳问题,缺乏能够处理跨多个时间分离视频片段进行复杂推理的多跳视频问题生成方法,这限制了视频理解能力的全面评估。
Method: VideoChain采用模块化架构,基于改进的BART骨干网络并增强视频嵌入,能够同时捕捉文本和视觉依赖关系,利用TVQA+数据集自动构建大规模MVQ-60数据集,通过合并零跳问答对确保可扩展性和多样性。
Result: 评估显示VideoChain在标准生成指标上表现优异:ROUGE-L(0.6454)、ROUGE-1(0.6854)、BLEU-1(0.6711)、BERTScore-F1(0.7967)和语义相似度(0.8110),证明了模型生成连贯、上下文相关且需要推理的问题的能力。
Conclusion: 该研究证明了多跳视频问题生成的可行性,VideoChain框架为视频理解提供了更全面的评估工具,其模块化设计为未来视频推理任务的发展奠定了基础,推动了视频问答领域向更复杂推理能力的发展。
📄 Abstract
Multi-hop Question Generation (QG) effectively evaluates reasoning but remains confined to text; Video Question Generation (VideoQG) is limited to zero-hop questions over single segments. To address this, we introduce VideoChain, a novel Multi-hop Video Question Generation (MVQG) framework designed to generate questions that require reasoning across multiple, temporally separated video segments. VideoChain features a modular architecture built on a modified BART backbone enhanced with video embeddings, capturing textual and visual dependencies. Using the TVQA+ dataset, we automatically construct the large-scale MVQ-60 dataset by merging zero-hop QA pairs, ensuring scalability and diversity. Evaluations show VideoChain's strong performance across standard generation metrics: ROUGE-L (0.6454), ROUGE-1 (0.6854), BLEU-1 (0.6711), BERTScore-F1 (0.7967), and semantic similarity (0.8110). These results highlight the model's ability to generate coherent, contextually grounded, and reasoning-intensive questions.
[35] Compression then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding
Da Li, Yuxiao Luo, Keping Bi, Jiafeng Guo, Wei Yuan, Biao Yang, Yan Wang, Fan Yang, Tingting Gao, Guorui Zhou
🧩 TL;DR
本文提出CoMa方法,通过压缩预训练阶段作为对比学习的预热,将视觉语言模型转化为具有竞争力的嵌入模型。该方法仅需少量预训练数据即可实现效率和效果的双重优化,在MMEB基准上达到同类规模模型的最先进性能。
📘 Detailed Summary
Motivation: 当前视觉语言模型通过大规模对比学习同时优化两个互补目标,但作者认为全面理解输入与下游任务判别性特征可以解耦。现有方法需要大量数据进行对比学习,限制了模型效率和适应性,因此需要探索更高效的预训练策略。
Method: 提出CoMa压缩预训练阶段作为对比学习的预热,该方法将视觉语言模型转化为嵌入模型,通过解耦语义内容保留和判别性特征强调两个目标,仅需少量预训练数据即可实现有效优化。
Result: 在MMEB基准测试中,CoMa实现了同类规模视觉语言模型中的最先进性能,证明了该方法在效率和效果上的双重优化,仅需少量预训练数据就能获得竞争力的嵌入质量。
Conclusion: 研究证明视觉语言模型的嵌入能力可以通过解耦预训练目标来优化,压缩预训练阶段能有效提升对比学习效率。这一发现为多模态表示学习提供了新的优化路径,表明模型性能提升不一定需要大规模数据支持。
📄 Abstract
Vision-language models advance multimodal representation learning by acquiring transferable semantic embeddings, thereby substantially enhancing performance across a range of vision-language tasks, including cross-modal retrieval, clustering, and classification. An effective embedding is expected to comprehensively preserve the semantic content of the input while simultaneously emphasizing features that are discriminative for downstream tasks. Recent approaches demonstrate that VLMs can be adapted into competitive embedding models via large-scale contrastive learning, enabling the simultaneous optimization of two complementary objectives. We argue that the two aforementioned objectives can be decoupled: a comprehensive understanding of the input facilitates the embedding model in achieving superior performance in downstream tasks via contrastive learning. In this paper, we propose CoMa, a compressed pre-training phase, which serves as a warm-up stage for contrastive learning. Experiments demonstrate that with only a small amount of pre-training data, we can transform a VLM into a competitive embedding model. CoMa achieves new state-of-the-art results among VLMs of comparable size on the MMEB, realizing optimization in both efficiency and effectiveness.
[36] UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
Zhengyang Liang, Daoan Zhang, Huichi Zhou, Rui Huang, Bobo Li, Yuechen Zhang, Shengqiong Wu, Xiaohan Wang, Jiebo Luo, Lizi Liao, Hao Fei
🧩 TL;DR
本文提出了UniVA,一个开源的、全能力多智能体框架,用于统一视频理解、分割、编辑和生成到连贯的工作流中,通过Plan-and-Act双智能体架构和层次化多级内存实现交互式和自我反思的视频创作。
📘 Detailed Summary
Motivation: 现有专业AI模型在孤立视频任务上表现出色,但现实应用需要结合这些能力的复杂迭代工作流,而当前缺乏能够统一视频理解、分割、编辑和生成能力的综合性框架。
Method: UniVA采用Plan-and-Act双智能体架构:规划智能体解释用户意图并分解为结构化视频处理步骤,执行智能体通过模块化MCP工具服务器执行这些步骤,同时通过层次化多级内存(全局知识、任务上下文和用户特定偏好)维持长视野推理和上下文连续性。
Result: 该框架实现了迭代和任意条件视频工作流,并引入了UniVA-Bench基准套件来严格评估此类智能视频系统,涵盖理解、编辑、分割和生成等多步骤视频任务。
Conclusion: UniVA和UniVA-Bench的开源发布旨在推动交互式、智能化和通用视频智能的研究,为下一代多模态AI系统的发展提供重要基础,解决了传统单用途模型或单体视频语言模型难以实现的复杂工作流问题。
📄 Abstract
While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation $\rightarrow$ multi-round editing $\rightarrow$ object segmentation $\rightarrow$ compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)
[37] 3D4D: An Interactive, Editable, 4D World Model via 3D Video Generation
Yunhong He, Zhengqing Yuan, Zhengzhong Tu, Yanfang Ye, Lichao Sun
🧩 TL;DR
本文提出了3D4D交互式4D可视化框架,通过集成WebGL与Supersplat渲染技术,将静态图像和文本转换为连贯的4D场景,实现了高效的多模态实时交互。该框架支持用户驱动的自适应复杂4D环境探索。
📘 Detailed Summary
Motivation: 当前4D可视化系统在实时交互效率和用户驱动探索方面存在局限,特别是在处理复杂4D环境时缺乏有效的多模态交互能力。本研究旨在解决静态内容向动态4D场景转换的技术挑战,以及大规模4D数据实时渲染的性能瓶颈问题。
Method: 该框架采用WebGL与Supersplat渲染技术集成方案,构建了四个核心处理模块,并引入了注视点渲染策略来优化计算资源分配。通过多模态输入处理管道,实现了从静态图像和文本到连贯4D场景的端到端转换。
Result: 实验结果表明,3D4D框架能够实现高效的实时多模态交互,显著提升了4D场景渲染性能。注视点渲染策略有效降低了计算开销,同时保持了视觉质量,支持用户对复杂4D环境的自适应探索。
Conclusion: 该研究证明了集成WebGL与Supersplat渲染在4D可视化中的有效性,为交互式4D内容创建提供了新的技术路径。框架的可扩展架构为未来多模态4D交互系统的发展奠定了基础,具有广泛的应用前景。
📄 Abstract
We introduce 3D4D, an interactive 4D visualization framework that integrates WebGL with Supersplat rendering. It transforms static images and text into coherent 4D scenes through four core modules and employs a foveated rendering strategy for efficient, real-time multi-modal interaction. This framework enables adaptive, user-driven exploration of complex 4D environments. The project page and code are available at https://yunhonghe1021.github.io/NOVA/.
cs.CL [Back]
[38] Motif 2 12.7B technical report
Junghwan Lim, Sungmin Lee, Dongseok Kim, Taehyun Kim, Eunhwan Park, Jeesoo Lee, Jeongdoo Lee, Junhyeok Lee, Wai Ting Cheung, Dahye Choi, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Beomgyu Kim, Minjae Kim, Taewhan Kim, Youngrok Kim, Hyukjin Kweon, Haesol Lee, Kungyu Lee, Dongpin Oh, Yeongjae Park, Bokki Ryu, Dongjoo Weon
🧩 TL;DR
Motif-2-12.7B是一个新的开源基础模型,通过架构创新和系统级优化推进大语言模型的效率前沿,在受限计算预算下实现可扩展的语言理解和鲁棒指令泛化能力。
📘 Detailed Summary
Motivation: 该研究旨在解决大语言模型在计算资源受限情况下的效率问题,通过优化架构和训练系统来提升模型性能,同时保持竞争力的基准表现,挑战更大规模模型的能力。
Method: 模型基于Motif-2.6B构建,集成了分组差分注意力机制来分离信号和噪声控制注意力路径,使用课程驱动的数据调度器在5.5万亿token上进行预训练,并采用MuonClip优化器和自定义高性能内核,包括融合PolyNorm激活和并行Muon算法。
Result: Motif-2-12.7B在多样化基准测试中展现出竞争力表现,表明经过深思熟虑的架构扩展和优化训练设计能够匹敌更大规模模型的能力,同时在分布式环境中实现了显著的吞吐量和内存效率提升。
Conclusion: 研究表明,通过架构创新和系统优化的结合,可以在不显著增加模型规模的情况下实现竞争力的性能,这为资源受限环境下的高效大语言模型开发提供了重要启示和方向。
📄 Abstract
We introduce Motif-2-12.7B, a new open-weight foundation model that pushes the efficiency frontier of large language models by combining architectural innovation with system-level optimization. Designed for scalable language understanding and robust instruction generalization under constrained compute budgets, Motif-2-12.7B builds upon Motif-2.6B with the integration of Grouped Differential Attention (GDA), which improves representational efficiency by disentangling signal and noise-control attention pathways. The model is pre-trained on 5.5 trillion tokens spanning diverse linguistic, mathematical, scientific, and programming domains using a curriculum-driven data scheduler that gradually changes the data composition ratio. The training system leverages the MuonClip optimizer alongside custom high-performance kernels, including fused PolyNorm activations and the Parallel Muon algorithm, yielding significant throughput and memory efficiency gains in large-scale distributed environments. Post-training employs a three-stage supervised fine-tuning pipeline that successively enhances general instruction adherence, compositional understanding, and linguistic precision. Motif-2-12.7B demonstrates competitive performance across diverse benchmarks, showing that thoughtful architectural scaling and optimized training design can rival the capabilities of much larger models.
[39] State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?
Taja Kuzman Pungeršek, Peter Rupnik, Ivan Porupski, Vuk Dinić, Nikola Ljubešić
🧩 TL;DR
本研究评估了大型语言模型在多个南斯拉夫语言文本分类任务上的性能,发现LLMs在零样本设置下表现优异,可与微调的BERT类模型相媲美,但在输出稳定性、推理速度和计算成本方面存在显著限制。
📘 Detailed Summary
Motivation: 随着指令调优的解码器模型兴起,文本分类领域逐渐转向零样本和少样本提示方法,但LLMs在文本分类任务上的表现,特别是在资源较少的语言上,仍然缺乏充分探索,本研究旨在填补这一研究空白。
Method: 研究比较了公开可用的微调BERT类模型与开源和闭源LLMs在南斯拉夫语言上的表现,涵盖三个领域的三个任务:议会演讲中的情感分类、新闻文章和议会演讲中的主题分类,以及网络文本的体裁识别。
Result: 实验结果表明,LLMs展现出强大的零样本性能,通常能够匹配甚至超越微调的BERT类模型,在零样本设置下,LLMs在南斯拉夫语言和英语上的表现相当,但存在输出不可预测性、推理速度显著较慢和计算成本更高等关键缺点。
Conclusion: 尽管LLMs在零样本文本分类中表现出色,但由于输出稳定性差、推理速度慢和计算成本高等限制,微调的BERT类模型在大规模自动文本标注场景中仍然是更实用的选择,这为资源受限语言的处理提供了重要指导。
📄 Abstract
Until recently, fine-tuned BERT-like models provided state-of-the-art performance on text classification tasks. With the rise of instruction-tuned decoder-only models, commonly known as large language models (LLMs), the field has increasingly moved toward zero-shot and few-shot prompting. However, the performance of LLMs on text classification, particularly on less-resourced languages, remains under-explored. In this paper, we evaluate the performance of current language models on text classification tasks across several South Slavic languages. We compare openly available fine-tuned BERT-like models with a selection of open-source and closed-source LLMs across three tasks in three domains: sentiment classification in parliamentary speeches, topic classification in news articles and parliamentary speeches, and genre identification in web texts. Our results show that LLMs demonstrate strong zero-shot performance, often matching or surpassing fine-tuned BERT-like models. Moreover, when used in a zero-shot setup, LLMs perform comparably in South Slavic languages and English. However, we also point out key drawbacks of LLMs, including less predictable outputs, significantly slower inference, and higher computational costs. Due to these limitations, fine-tuned BERT-like models remain a more practical choice for large-scale automatic text annotation.
[40] VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context
Heyang Liu, Ziyang Cheng, Yuhao Wang, Hongcheng Liu, Yiqi Li, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang
🧩 TL;DR
本文提出了VocalBench-zh,这是首个针对普通话环境的语音交互评估套件,包含10个子集和超过10K高质量实例,覆盖12个用户导向特性,通过对14个主流模型的评估揭示了当前方法的共同挑战。
📘 Detailed Summary
Motivation: 随着多模态大语言模型的发展,普通话作为全球使用最广泛的语言之一,虽然得到了大多数模型的支持,但缺乏全面的语音到语音基准测试阻碍了开发者的系统性评估和用户的公平模型比较,这种评估资源的稀缺限制了普通话语音交互系统的进步。
Method: 研究团队开发了VocalBench-zh评估套件,这是一个基于能力级别划分的普通话评估框架,包含10个精心设计的子集和超过10,000个高质量实例,覆盖了12个用户导向的特性维度,为普通话语音交互提供了全面的评估标准。
Result: 通过对14个主流模型的评估实验,研究发现当前技术路线面临共同的挑战,评估结果不仅揭示了现有模型的局限性,还强调了新一代语音交互系统需要新的设计思路和创新方法。
Conclusion: 这项研究强调了开发专门针对普通话环境的评估基准的重要性,为下一代语音交互系统的发展提供了关键见解,评估代码和数据集的公开可用性将促进该领域的进一步研究和公平比较。
📄 Abstract
The development of multi-modal large language models (LLMs) leads to intelligent approaches capable of speech interactions. As one of the most widely spoken languages globally, Mandarin is supported by most models to enhance their applicability and reach. However, the scarcity of comprehensive speech-to-speech (S2S) benchmarks in Mandarin contexts impedes systematic evaluation for developers and hinders fair model comparison for users. In this work, we propose VocalBench-zh, an ability-level divided evaluation suite adapted to Mandarin context consisting of 10 well-crafted subsets and over 10K high-quality instances, covering 12 user-oriented characters. The evaluation experiment on 14 mainstream models reveals the common challenges for current routes, and highlights the need for new insights into next-generation speech interactive systems. The evaluation codes and datasets will be available at https://github.com/SJTU-OmniAgent/VocalBench-zh.
[41] REFLEX: Reference-Free Evaluation of Log Summarization via Large Language Model Judgment
Priyanka Mudgal
🧩 TL;DR
本文提出了REFLEX,一种基于大语言模型判断的无参考日志摘要评估指标,解决了传统指标依赖词汇重叠和缺乏高质量参考摘要的问题,为日志摘要评估提供了可扩展的替代方案。
📘 Detailed Summary
Motivation: 日志摘要系统评估面临挑战,主要由于缺乏高质量参考摘要以及现有指标(如ROUGE和BLEU)的局限性,这些指标依赖于表面层次的词汇重叠,无法准确反映摘要质量。
Method: REFLEX使用大语言模型作为零样本评估器,在无需黄金标准参考或人工标注的情况下,从相关性、信息量和连贯性等维度评估摘要质量,实现了无参考的自动化评估。
Result: 实验表明REFLEX在多个日志摘要数据集上产生稳定、可解释且细粒度的评估结果,相比传统指标能更有效地区分不同模型的输出质量。
Conclusion: REFLEX为现实场景中参考数据稀缺或不可用的情况提供了可扩展的评估解决方案,推动了日志摘要评估方法的发展,具有重要的实际应用价值。
📄 Abstract
Evaluating log summarization systems is challenging due to the lack of high-quality reference summaries and the limitations of existing metrics like ROUGE and BLEU, which depend on surface-level lexical overlap. We introduce REFLEX, a reference-free evaluation metric for log summarization based on large language model (LLM) judgment. REFLEX uses LLMs as zero-shot evaluators to assess summary quality along dimensions such as relevance, informativeness, and coherence, without requiring gold-standard references or human annotations. We show that REFLEX produces stable, interpretable, and fine-grained evaluations across multiple log summarization dataset, and more effectively distinguishes model outputs than traditional metrics. REFLEX provides a scalable alternative for evaluating log summaries in real-world settings where reference data is scarce or unavailable.
cs.AI [Back]
[42] Versatile and Risk-Sensitive Cardiac Diagnosis via Graph-Based ECG Signal Representation
Yue Wang, Yuyang Xu, Renjun Hu, Fanqi Shen, Hanyun Jiang, Jun Wang, Jintai Chen, Danny Z. Chen, Jian Wu, Haochao Ying
🧩 TL;DR
本文提出VARS方法,通过基于图的表示统一建模异构心电图信号,解决了传统深度学习心电图诊断方法在处理多样化信号配置和检测风险信号方面的局限性。该方法在多个数据集上超越了现有最先进模型,并显著提高了风险信号的识别能力。
📘 Detailed Summary
Motivation: 当前基于深度学习的心电图诊断方法面临两个主要障碍:缺乏处理多样化信号配置的通用性,以及由于样本不平衡导致风险信号检测不足。这些限制阻碍了这些方法在临床实践中的广泛应用。
Method: VARS采用基于图的表示方法,将心电图信号转换为通用的图结构,捕捉关键诊断特征而不受导联数、采样频率和持续时间等信号多样性的影响。该方法结合去噪重建和对比学习,在保留原始心电图信息的同时突出病理性模式。
Result: 在三个不同的心电图数据集上的严格评估表明,VARS在所有数据集上持续超越现有最先进模型,并在识别风险信号方面表现出显著改进。该方法还提供可解释性,能够精确定位导致特定模型输出的波形。
Conclusion: VARS有望成为全面心脏健康评估的宝贵工具,其基于图的表示方法不仅提高了诊断准确性,还增强了临床决策支持能力。该方法为处理心电图信号多样性问题提供了新的解决方案,并展示了在医疗AI领域的广泛应用前景。
📄 Abstract
Despite the rapid advancements of electrocardiogram (ECG) signal diagnosis and analysis methods through deep learning, two major hurdles still limit their clinical adoption: the lack of versatility in processing ECG signals with diverse configurations, and the inadequate detection of risk signals due to sample imbalances. Addressing these challenges, we introduce VersAtile and Risk-Sensitive cardiac diagnosis (VARS), an innovative approach that employs a graph-based representation to uniformly model heterogeneous ECG signals. VARS stands out by transforming ECG signals into versatile graph structures that capture critical diagnostic features, irrespective of signal diversity in the lead count, sampling frequency, and duration. This graph-centric formulation also enhances diagnostic sensitivity, enabling precise localization and identification of abnormal ECG patterns that often elude standard analysis methods. To facilitate representation transformation, our approach integrates denoising reconstruction with contrastive learning to preserve raw ECG information while highlighting pathognomonic patterns. We rigorously evaluate the efficacy of VARS on three distinct ECG datasets, encompassing a range of structural variations. The results demonstrate that VARS not only consistently surpasses existing state-of-the-art models across all these datasets but also exhibits substantial improvement in identifying risk signals. Additionally, VARS offers interpretability by pinpointing the exact waveforms that lead to specific model outputs, thereby assisting clinicians in making informed decisions. These findings suggest that our VARS will likely emerge as an invaluable tool for comprehensive cardiac health assessment.
[43] National Institute on Aging PREPARE Challenge: Early Detection of Cognitive Impairment Using Speech - The SpeechCARE Solution
Maryam Zolnoori, Hossein Azadmaleki, Yasaman Haghbin, Ali Zolnour, Mohammad Javad Momeni Nezhad, Sina Rashidi, Mehdi Naserian, Elyas Esmaeili, Sepehr Karimi Arpanahi
🧩 TL;DR
本文提出了SpeechCARE,一种基于多模态语音处理的新型认知障碍检测框架,通过混合专家架构融合声学、语言学和人口统计学特征,在阿尔茨海默病和相关痴呆症的早期检测中实现了优异的分类性能。
📘 Detailed Summary
Motivation: 阿尔茨海默病和相关痴呆症影响超过60岁人群的五分之一,但超过一半的认知衰退患者未被诊断。现有基于语音的评估方法存在性能有限和泛化能力不足的问题,传统语音处理流程使用手工特征或通用音频分类器难以捕捉认知障碍相关的细微语音变化。
Method: SpeechCARE采用多模态语音处理流程,利用预训练的多语言声学和语言转换器模型捕获认知障碍相关的语音线索。受混合专家范式启发,该框架采用动态融合架构加权处理基于转换器的声学、语言学和人口统计学输入,并包含自动转录、基于大语言模型的异常检测和任务识别等鲁棒预处理步骤,以及基于SHAP的可解释性模块。
Result: SpeechCARE在认知健康、轻度认知障碍和阿尔茨海默病分类任务中达到AUC=0.88和F1=0.72,在轻度认知障碍检测中达到AUC=0.90和F1=0.62。偏倚分析显示除80岁以上成年人外差异最小,通过过采样和加权损失技术进行了缓解。
Conclusion: 该研究证明了多模态语音分析在认知障碍检测中的有效性,动态融合架构支持额外模态的集成并增强了跨任务的鲁棒性。未来工作包括在真实护理环境中部署、与电子健康记录集成的可解释性分析,以及关注纽约市代表性不足人群的应用扩展。
📄 Abstract
Alzheimer's disease and related dementias (ADRD) affect one in five adults over 60, yet more than half of individuals with cognitive decline remain undiagnosed. Speech-based assessments show promise for early detection, as phonetic motor planning deficits alter acoustic features (e.g., pitch, tone), while memory and language impairments lead to syntactic and semantic errors. However, conventional speech-processing pipelines with hand-crafted features or general-purpose audio classifiers often exhibit limited performance and generalizability. To address these limitations, we introduce SpeechCARE, a multimodal speech processing pipeline that leverages pretrained, multilingual acoustic and linguistic transformer models to capture subtle speech-related cues associated with cognitive impairment. Inspired by the Mixture of Experts (MoE) paradigm, SpeechCARE employs a dynamic fusion architecture that weights transformer-based acoustic, linguistic, and demographic inputs, allowing integration of additional modalities (e.g., social factors, imaging) and enhancing robustness across diverse tasks. Its robust preprocessing includes automatic transcription, large language model (LLM)-based anomaly detection, and task identification. A SHAP-based explainability module and LLM reasoning highlight each modality's contribution to decision-making. SpeechCARE achieved AUC = 0.88 and F1 = 0.72 for classifying cognitively healthy, MCI, and AD individuals, with AUC = 0.90 and F1 = 0.62 for MCI detection. Bias analysis showed minimal disparities, except for adults over 80. Mitigation techniques included oversampling and weighted loss. Future work includes deployment in real-world care settings (e.g., VNS Health, Columbia ADRC) and EHR-integrated explainability for underrepresented populations in New York City.
[44] oboro: Text-to-Image Synthesis on Limited Data using Flow-based Diffusion Transformer with MMH Attention
Ryusuke Mizutani, Kazuaki Matano, Tsugumi Kadowaki, Haruki Tenya, Layris, nuigurumi, Koki Hashimoto, Yu Tanaka
🧩 TL;DR
本研究开发了日本首个开源商业图像生成模型'oboro:',该模型完全从零开始训练,仅使用版权清理图像,旨在解决日本动画产业劳动力短缺问题。
📘 Detailed Summary
Motivation: 项目旨在解决日本动画制作行业面临的劳动力短缺等挑战,通过开发从零开始的图像生成模型来支持国内创意产业,这是日本首个面向商业应用的开源图像生成AI项目。
Method: 开发了新型图像生成模型'oboro:',采用专为有限数据集设计的新架构,能够从少量训练数据生成高质量图像,所有训练图像均经过版权清理确保合规性。
Result: 成功开发并公开发布了'oboro:'基础模型权重和推理代码,这是日本首个完全自主研发的开源商业图像生成AI,为国内AI生态系统提供了重要技术资源。
Conclusion: 该项目通过保持开发过程透明度,为日本AI研究社区做出贡献,促进了国内AI开发生态系统的发展,标志着日本在生成式AI领域的重要里程碑。
📄 Abstract
This project was conducted as a 2nd-term adopted project of the "Post-5G Information and Communication System Infrastructure Enhancement R&D Project Development of Competitive Generative AI Foundation Models (GENIAC)," a business of the Ministry of Economy, Trade and Industry (METI) and the New Energy and Industrial Technology Development Organization (NEDO). To address challenges such as labor shortages in Japan's anime production industry, this project aims to develop an image generation model from scratch. This report details the technical specifications of the developed image generation model, "oboro:." We have developed "oboro:," a new image generation model built from scratch, using only copyright-cleared images for training. A key characteristic is its architecture, designed to generate high-quality images even from limited datasets. The foundation model weights and inference code are publicly available alongside this report. This project marks the first release of an open-source, commercially-oriented image generation AI fully developed in Japan. AiHUB originated from the OSS community; by maintaining transparency in our development process, we aim to contribute to Japan's AI researcher and engineer community and promote the domestic AI development ecosystem.
[45] An Efficient Training Pipeline for Reasoning Graphical User Interface Agents
Georgios Pantazopoulos, Eda B. Özyiğit
🧩 TL;DR
本文提出了一种高效的视觉定位训练流程,通过模型驱动的数据过滤和参数高效微调,从480万合成样本中筛选出1.2万个高质量实例,在多个基准测试中匹配或超越了更大规模基线模型。
📘 Detailed Summary
Motivation: 现有视觉定位方法严重依赖大规模噪声合成数据集,这限制了图形用户界面智能代理的推理能力发展,需要更高效的数据筛选和训练策略来提升模型性能。
Method: 采用模型驱动的数据过滤方法,首先识别挑战性案例,移除未对齐样本,然后选择多样化的多模态实例;在筛选数据上对30亿参数视觉语言模型进行监督微调、思维链增强微调和基于组相对策略优化的强化学习三种训练策略。
Result: 在ScreenSpot、Multimodal-Mind2Web和AndroidControl等基准测试中,使用过滤数据和轻量级训练策略的模型匹配或超越了更大规模的基线模型,证明了高效数据筛选的有效性。
Conclusion: 研究表明,原则性的数据筛选和鲁棒的适应策略可以媲美大规模训练,能够开发出紧凑但具备强大能力的多模态推理智能体,为资源受限环境下的模型部署提供了可行方案。
📄 Abstract
Visual grounding is the task of localising image regions from natural language queries and is critical for reasoning capable Graphical User Interface agents. Many existing methods rely on massive, noisy synthetic datasets.This work introduces an efficient training pipeline that combines model-based data filtering with parameter-efficient fine-tuning. From 4.8M synthetic examples, 12K clean and diverse instances are curated by first identifying challenging cases, removing misaligned and then selecting a diverse set of multimodal instances. On this data, a 3B-parameter Vision-Language Model is trained under three regimes: supervised fine-tuning, chain-of-thought- augmented fine-tuning, and reinforcement learning via Group Relative Policy Optimization. Models trained with the filtered data and lightweight training strategies match or surpass larger baselines on benchmarks such as ScreenSpot, Multimodal-Mind2Web, and AndroidControl. These results demonstrate that principled data curation and robust adaptation can rival large-scale training, enabling compact yet capable multimodal reasoning agents.
[46] Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning
Ziyu Ma, Chenhui Gou, Yiming Hu, Yong Wang, Xiangxiang Chu, Bohan Zhuang, Jianfei Cai
🧩 TL;DR
本文提出了一种敏感性感知任务向量插入框架(STV),通过识别激活差异的结构模式来确定任务向量的插入位置和内容,解决了多模态模型中多样本上下文学习面临的上下文长度限制和推理成本问题。
📘 Detailed Summary
Motivation: 大型多模态模型在多样本上下文学习中面临上下文长度有限和推理成本高的问题,现有基于任务向量的方法要么忽略了插入位置的重要性,要么难以确定每个位置合适的插入值。
Method: 提出敏感性感知任务向量插入框架(STV),通过分析查询-上下文对之间的激活差异结构模式来识别敏感位置,为每个位置构建预聚类激活库,并使用强化学习选择最合适的插入值。
Result: 在多种多模态模型(如Qwen-VL、Idefics-2)和任务(如VizWiz、OK-VQA)上的评估表明,STV相比之前的任务向量方法具有一致改进和强泛化能力。
Conclusion: 激活差异的结构模式为任务向量插入提供了可靠线索,STV框架通过结合敏感性分析和强化学习优化,在多模态上下文学习中实现了更有效的任务向量插入策略。
📄 Abstract
Large Multimodal Models (LMMs) have shown promising in-context learning (ICL) capabilities, but scaling to many-shot settings remains difficult due to limited context length and high inference cost. To address these challenges, task-vector-based methods have been explored by inserting compact representations of many-shot in-context demonstrations into model activations. However, existing task-vector-based methods either overlook the importance of where to insert task vectors or struggle to determine suitable values for each location. To this end, we propose a novel Sensitivity-aware Task Vector insertion framework (STV) to figure out where and what to insert. Our key insight is that activation deltas across query-context pairs exhibit consistent structural patterns, providing a reliable cue for insertion. Based on the identified sensitive-aware locations, we construct a pre-clustered activation bank for each location by clustering the activation values, and then apply reinforcement learning to choose the most suitable one to insert. We evaluate STV across a range of multimodal models (e.g., Qwen-VL, Idefics-2) and tasks (e.g., VizWiz, OK-VQA), demonstrating its effectiveness and showing consistent improvements over previous task-vector-based methods with strong generalization.
[47] FaithAct: Faithfulness Planning and Acting in MLLMs
Junxian Li, Xinyue Xu, Sai Ma, Sichao Li
🧩 TL;DR
本文提出了FaithEval评估框架和FaithAct规划执行框架,通过区分行为忠实性和感知忠实性,在推理过程中强制证据基础,显著提升了多模态推理的忠实性而不降低任务准确率。
📘 Detailed Summary
Motivation: 大型语言模型存在忠实性问题,经常产生看似合理但缺乏依据的推理链,这些推理链与感知证据或最终结论存在分歧,因此需要解决推理过程与输入证据之间的对齐问题。
Method: 提出了FaithEval评估框架用于量化步骤级和链级忠实性,评估每个声称对象是否得到图像视觉支持;并开发了FaithAct框架,以忠实性为首要原则的规划和执行方法,在每一步推理中强制证据基础。
Result: 在多个推理基准测试中,FaithAct相比基于提示和工具增强的基线方法,将感知忠实性提升了高达26%,同时不降低任务准确率,并产生了更稳定的推理轨迹。
Conclusion: 将忠实性作为指导原则不仅能够减轻幻觉问题,还能产生更稳定的推理轨迹,为多模态推理中的忠实性评估和执行建立了统一框架。
📄 Abstract
Unfaithfulness remains a persistent challenge for large language models (LLMs), which often produce plausible yet ungrounded reasoning chains that diverge from perceptual evidence or final conclusions. We distinguish between behavioral faithfulness (alignment between reasoning and output) and perceptual faithfulness (alignment between reasoning and input), and introduce FaithEval for quantifying step-level and chain-level faithfulness by evaluating whether each claimed object is visually supported by the image. Building on these insights, we propose FaithAct, a faithfulness-first planning and acting framework that enforces evidential grounding at every reasoning step. Experiments across multiple reasoning benchmarks demonstrate that FaithAct improves perceptual faithfulness by up to 26% without degrading task accuracy compared to prompt-based and tool-augmented baselines. Our analysis shows that treating faithfulness as a guiding principle not only mitigates hallucination but also leads to more stable reasoning trajectories. This work thereby establishes a unified framework for both evaluating and enforcing faithfulness in multimodal reasoning.
[48] Simulating the Visual World with Artificial Intelligence: A Roadmap
Jingtong Yue, Ziqi Huang, Zhaoxi Chen, Xintao Wang, Pengfei Wan, Ziwei Liu
🧩 TL;DR
本文系统综述了视频生成向视频基础模型的演进,提出将现代视频基础模型概念化为隐式世界模型和视频渲染器的组合,旨在构建具备物理合理性和交互能力的虚拟环境。
📘 Detailed Summary
Motivation: 视频生成领域正从单纯生成视觉吸引力的片段转向构建支持交互并保持物理合理性的虚拟环境,这指向了视频基础模型的发展,这些模型不仅作为视觉生成器,还作为能够模拟物理动力学、智能体-环境交互和任务规划的隐式世界模型。
Method: 研究将现代视频基础模型概念化为两个核心组件的组合:隐式世界模型和视频渲染器。世界模型编码关于世界的结构化知识,包括物理定律、交互动力学和智能体行为,作为潜在模拟引擎;视频渲染器则将这种潜在模拟转换为逼真的视觉观察。
Result: 研究追踪了视频生成的四代演进过程,每代核心能力逐步提升,最终形成具备内在物理合理性、实时多模态交互以及跨多个时空尺度规划能力的世界模型,并分析了在机器人、自动驾驶和交互游戏等领域的应用。
Conclusion: 视频基础模型的发展标志着向构建作为隐式世界模型的系统的转变,这些模型能够模拟复杂的物理和交互动态,为下一代世界模型的设计原则和挑战提供了重要见解,包括智能体智能在塑造和评估这些系统中的作用。
📄 Abstract
The landscape of video generation is shifting, from a focus on generating visually appealing clips to building virtual environments that support interaction and maintain physical plausibility. These developments point toward the emergence of video foundation models that function not only as visual generators but also as implicit world models, models that simulate the physical dynamics, agent-environment interactions, and task planning that govern real or imagined worlds. This survey provides a systematic overview of this evolution, conceptualizing modern video foundation models as the combination of two core components: an implicit world model and a video renderer. The world model encodes structured knowledge about the world, including physical laws, interaction dynamics, and agent behavior. It serves as a latent simulation engine that enables coherent visual reasoning, long-term temporal consistency, and goal-driven planning. The video renderer transforms this latent simulation into realistic visual observations, effectively producing videos as a "window" into the simulated world. We trace the progression of video generation through four generations, in which the core capabilities advance step by step, ultimately culminating in a world model, built upon a video generation model, that embodies intrinsic physical plausibility, real-time multimodal interaction, and planning capabilities spanning multiple spatiotemporal scales. For each generation, we define its core characteristics, highlight representative works, and examine their application domains such as robotics, autonomous driving, and interactive gaming. Finally, we discuss open challenges and design principles for next-generation world models, including the role of agent intelligence in shaping and evaluating these systems. An up-to-date list of related works is maintained at this link.