cs.CV [Total: 31]
cs.CL [Total: 7]
cs.AI [Total: 3]

cs.CV [Back]

[1] Skin-R1: Toward Trustworthy Clinical Reasoning for Dermatological Diagnosis

Zehao Liu, Wejieying Ren, Jipeng Zhang, Tianxiang Zhao, Jingxi Zhu, Xiaoting Li, Vasant G. Honavar

🧩 TL;DR

本文提出SkinR1，一种结合教科书推理与强化学习的皮肤病视觉语言模型，通过构建层次化诊断轨迹和监督微调解决数据异构性和推理监督缺失问题，在多个皮肤病数据集上实现卓越诊断性能。

📘 Detailed Summary

Motivation: 当前视觉语言模型在皮肤病诊断中的可信度和临床实用性受到三个主要因素限制：数据异构性导致诊断标签和临床概念注释不一致；缺乏基于诊断依据的推理轨迹，导致可靠的推理监督稀缺；模型在小规模密集标注数据集上训练后难以泛化到大规模稀疏标注数据。

Method: SkinR1采用统一的端到端框架，首先设计基于教科书的推理生成器合成高保真、层次感知和鉴别诊断指导的轨迹，提供专家级监督；然后利用构建的轨迹进行监督微调，赋予模型基于依据的推理能力；最后开发结合疾病层次结构的新型强化学习范式，将这些推理模式有效迁移到大规模稀疏数据。

Result: 在多个皮肤病数据集上的广泛实验表明，SkinR1实现了卓越的诊断准确性，消融研究证实了监督微调所建立的推理基础的重要性，验证了所提方法在解决数据稀疏性和泛化挑战方面的有效性。

Conclusion: 该研究展示了结合深度教科书推理与强化学习能够有效解决皮肤病诊断中的数据异构性和推理监督缺失问题，为临床可信AI系统的发展提供了重要见解，表明基于依据的推理基础和层次化迁移学习是实现可靠医疗AI的关键要素。

📄 Abstract

The emergence of vision-language models (VLMs) has opened new possibilities for clinical reasoning and has shown promising performance in dermatological diagnosis. However, their trustworthiness and clinical utility are often limited by three major factors: (1) Data heterogeneity, where diverse datasets lack consistent diagnostic labels and clinical concept annotations; (2) Absence of grounded diagnostic rationales, leading to a scarcity of reliable reasoning supervision; and (3) Limited scalability and generalization, as models trained on small, densely annotated datasets struggle to transfer nuanced reasoning to large, sparsely-annotated ones. To address these limitations, we propose SkinR1, a novel dermatological VLM that combines deep, textbook-based reasoning with the broad generalization capabilities of reinforcement learning (RL). SkinR1 systematically resolves the key challenges through a unified, end-to-end framework. First, we design a textbook-based reasoning generator that synthesizes high-fidelity, hierarchy-aware, and differential-diagnosis (DDx)-informed trajectories, providing reliable expert-level supervision. Second, we leverage the constructed trajectories for supervised fine-tuning (SFT) empowering the model with grounded reasoning ability. Third, we develop a novel RL paradigm that, by incorporating the hierarchical structure of diseases, effectively transfers these grounded reasoning patterns to large-scale, sparse data. Extensive experiments on multiple dermatology datasets demonstrate that SkinR1 achieves superior diagnostic accuracy. The ablation study demonstrates the importance of the reasoning foundation instilled by SFT.

[2] GeoSceneGraph: Geometric Scene Graph Diffusion Model for Text-guided 3D Indoor Scene Synthesis

Antonio Ruiz, Tao Wu, Andrew Melnik, Qing Cheng, Xuqin Wang, Lu Liu, Yongliang Wang, Yanfeng Zhang, Helge Ritter

🧩 TL;DR

GeoSceneGraph是一种从文本提示合成3D室内场景的方法，它利用场景图结构和几何对称性，无需预定义关系类别即可生成连贯的3D场景，在性能上可与依赖真实关系标注的方法相媲美。

📘 Detailed Summary

Motivation: 现有3D场景生成方法存在两个主要问题：基于视觉语言模型的方法在资源受限设备上部署困难，而从头训练的生成方法往往忽略室内场景固有的图结构，限制了场景的连贯性和真实感。同时，现有的场景图方法要么需要用户提供的语义图，要么依赖真实关系标注，限制了捕捉多样化物体交互的能力。

Method: 该方法基于等变图神经网络构建，提出了简单有效的文本特征条件化策略，能够在不使用预定义关系类别的情况下，利用3D场景的图结构和几何对称性进行场景合成。通过消融研究验证了所提出的EGNN文本条件化设计的有效性。

Result: 尽管不使用真实关系标注，GeoSceneGraph在性能上达到了与依赖真实关系的方法相当的水平，证明了其在不依赖预定义关系类别的情况下仍能生成高质量的3D场景。

Conclusion: 该研究表明，通过合理利用场景图结构和几何对称性，可以在不依赖预定义关系类别的情况下实现高质量的3D场景生成，为资源受限设备上的3D场景合成提供了可行的解决方案，并扩展了等变图神经网络在复杂模态处理中的应用范围。

📄 Abstract

Methods that synthesize indoor 3D scenes from text prompts have wide-ranging applications in film production, interior design, video games, virtual reality, and synthetic data generation for training embodied agents. Existing approaches typically either train generative models from scratch or leverage vision-language models (VLMs). While VLMs achieve strong performance, particularly for complex or open-ended prompts, smaller task-specific models remain necessary for deployment on resource-constrained devices such as extended reality (XR) glasses or mobile phones. However, many generative approaches that train from scratch overlook the inherent graph structure of indoor scenes, which can limit scene coherence and realism. Conversely, methods that incorporate scene graphs either demand a user-provided semantic graph, which is generally inconvenient and restrictive, or rely on ground-truth relationship annotations, limiting their capacity to capture more varied object interactions. To address these challenges, we introduce GeoSceneGraph, a method that synthesizes 3D scenes from text prompts by leveraging the graph structure and geometric symmetries of 3D scenes, without relying on predefined relationship classes. Despite not using ground-truth relationships, GeoSceneGraph achieves performance comparable to methods that do. Our model is built on equivariant graph neural networks (EGNNs), but existing EGNN approaches are typically limited to low-dimensional conditioning and are not designed to handle complex modalities such as text. We propose a simple and effective strategy for conditioning EGNNs on text features, and we validate our design through ablation studies.

[3] EGSA-PT:Edge-Guided Spatial Attention with Progressive Training for Monocular Depth Estimation and Segmentation of Transparent Objects

Gbenga Omotara, Ramy Farag, Seyed Mohamad Ali Tousi, G. N. DeSouza

🧩 TL;DR

本文提出了一种边缘引导的空间注意力融合机制和多模态渐进训练策略，用于解决透明物体感知中多任务学习框架的负向跨任务交互问题，在Syn-TODD和ClearPose基准测试中显著提升了深度估计精度。

📘 Detailed Summary

Motivation: 透明物体感知是计算机视觉中的主要挑战，因为透明性会混淆深度估计和语义分割。现有研究探索了多任务学习框架来提高鲁棒性，但负向跨任务交互经常阻碍性能提升，需要设计有效的融合机制来缓解这种破坏性交互。

Method: 提出了边缘引导空间注意力融合机制，通过将边界信息整合到语义和几何特征的融合中来减轻破坏性交互。同时开发了多模态渐进训练策略，从RGB图像提取的边缘过渡到预测深度图像提取的边缘，使系统能够从RGB图像的丰富纹理中引导学习，然后转向深度图中更相关的几何内容。

Result: 在Syn-TODD和ClearPose基准测试中，EGSA持续改进了当前最先进方法MODEST的深度估计精度，同时保持了具有竞争力的分割性能，在透明区域显示出最大的改进幅度。该方法的训练过程无需真实深度数据。

Conclusion: 边缘引导融合被证明是提高透明物体感知的鲁棒方法，多模态渐进训练策略能够有效利用不同模态的信息优势，为透明物体感知任务提供了新的技术路径和训练范式。

📄 Abstract

Transparent object perception remains a major challenge in computer vision research, as transparency confounds both depth estimation and semantic segmentation. Recent work has explored multi-task learning frameworks to improve robustness, yet negative cross-task interactions often hinder performance. In this work, we introduce Edge-Guided Spatial Attention (EGSA), a fusion mechanism designed to mitigate destructive interactions by incorporating boundary information into the fusion between semantic and geometric features. On both Syn-TODD and ClearPose benchmarks, EGSA consistently improved depth accuracy over the current state of the art method (MODEST), while preserving competitive segmentation performance, with the largest improvements appearing in transparent regions. Besides our fusion design, our second contribution is a multi-modal progressive training strategy, where learning transitions from edges derived from RGB images to edges derived from predicted depth images. This approach allows the system to bootstrap learning from the rich textures contained in RGB images, and then switch to more relevant geometric content in depth maps, while it eliminates the need for ground-truth depth at training time. Together, these contributions highlight edge-guided fusion as a robust approach capable of improving transparent object perception.

[4] Learning Depth from Past Selves: Self-Evolution Contrast for Robust Depth Estimation

Jing Cao, Kui Jiang, Shenyi Li, Xiaocheng Feng, Yong Huang

🧩 TL;DR

本文提出了一种名为SEC-Depth的自进化对比学习框架，通过利用训练过程中生成的中间参数构建时序演化的延迟模型，显著提升了自监督深度估计在恶劣天气条件下的鲁棒性。该方法无需手动干预即可自适应调整学习目标，并在零样本评估中表现出优异的性能。

📘 Detailed Summary

Motivation: 现有自监督深度估计方法在雨雾等恶劣天气条件下性能显著下降，能见度降低严重影响了深度预测的准确性。这一性能退化问题在自动驾驶和机器人应用中尤为关键，需要开发能够适应复杂环境条件的鲁棒深度估计算法。

Method: 提出自进化对比学习框架SEC-Depth，设计动态延迟模型更新策略来捕捉训练过程中的优化状态。引入自进化对比损失SECL，将历史延迟模型的输出作为负样本，自适应调整学习目标并隐式感知天气退化程度，减少人工干预需求。

Result: 实验表明该方法能够无缝集成到多种基线模型中，在零样本评估中显著增强了鲁棒性。对比学习机制有效缓解了恶劣天气条件下的性能损失，无需额外标注数据即可提升深度估计的可靠性。

Conclusion: 自进化对比学习为自监督深度估计提供了一种有效的鲁棒性增强方法，通过时序建模和对比机制实现了对恶劣天气条件的自适应。该框架具有通用性，可扩展到其他视觉任务，为复杂环境下的感知系统提供了新的技术路径。

📄 Abstract

Self-supervised depth estimation has gained significant attention in autonomous driving and robotics. However, existing methods exhibit substantial performance degradation under adverse weather conditions such as rain and fog, where reduced visibility critically impairs depth prediction. To address this issue, we propose a novel self-evolution contrastive learning framework called SEC-Depth for self-supervised robust depth estimation tasks. Our approach leverages intermediate parameters generated during training to construct temporally evolving latency models. Using these, we design a self-evolution contrastive scheme to mitigate performance loss under challenging conditions. Concretely, we first design a dynamic update strategy of latency models for the depth estimation task to capture optimization states across training stages. To effectively leverage latency models, we introduce a self-evolution contrastive Loss (SECL) that treats outputs from historical latency models as negative samples. This mechanism adaptively adjusts learning objectives while implicitly sensing weather degradation severity, reducing the needs for manual intervention. Experiments show that our method integrates seamlessly into diverse baseline models and significantly enhances robustness in zero-shot evaluations.

[5] FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding

Zhenshi Li, Weikang Yu, Dilxat Muhtar, Xueliang Zhang, Pengfeng Xiao, Pedram Ghamisi, Xiao Xiang Zhu

🧩 TL;DR

本文提出了FarSLIP框架，通过构建首个多粒度遥感图像-文本数据集MGRS-200k，并采用补丁到补丁蒸馏和CLS令牌区域类别对齐方法，显著提升了遥感领域细粒度视觉语言对齐能力，在多个任务上实现了新的最优性能。

📘 Detailed Summary

Motivation: 当前遥感专用CLIP变体仍然继承有限的空问感知能力，主要存在两个关键限制：现有遥感图像-文本数据集从对象级标签生成全局描述，导致原始对象级监督未被充分利用；尽管通用领域的区域-文本对齐方法取得了成功，但直接应用于遥感数据通常会导致性能下降。

Method: 构建了首个多粒度遥感图像-文本数据集MGRS-200k，提供丰富的对象级文本监督；提出FarSLIP框架，采用补丁到补丁蒸馏而非常用的补丁到CLS自蒸馏来对齐局部和全局视觉线索，同时使用简单的CLS令牌区域类别对齐而非显式补丁级对齐来有效利用区域-文本监督。

Result: FarSLIP显著提升了遥感领域的细粒度视觉语言对齐能力，不仅在遥感开放词汇语义分割任务上创造了新的最优性能，还在图像级任务如零样本分类和图像-文本检索上取得了最先进的结果。

Conclusion: 研究表明当前显式区域-文本对齐方法由于严重破坏CLIP的语义连贯性而表现不佳，而通过补丁到补丁蒸馏和CLS令牌对齐策略可以同时保持语义连贯性和提升特征判别性，为遥感领域的细粒度视觉语言理解提供了有效解决方案。

📄 Abstract

As CLIP's global alignment limits its ability to capture fine-grained details, recent efforts have focused on enhancing its region-text alignment. However, current remote sensing (RS)-specific CLIP variants still inherit this limited spatial awareness. We identify two key limitations behind this: (1) current RS image-text datasets generate global captions from object-level labels, leaving the original object-level supervision underutilized; (2) despite the success of region-text alignment methods in general domain, their direct application to RS data often leads to performance degradation. To address these, we construct the first multi-granularity RS image-text dataset, MGRS-200k, featuring rich object-level textual supervision for RS region-category alignment. We further investigate existing fine-grained CLIP tuning strategies and find that current explicit region-text alignment methods, whether in a direct or indirect way, underperform due to severe degradation of CLIP's semantic coherence. Building on these, we propose FarSLIP, a Fine-grained Aligned RS Language-Image Pretraining framework. Rather than the commonly used patch-to-CLS self-distillation, FarSLIP employs patch-to-patch distillation to align local and global visual cues, which improves feature discriminability while preserving semantic coherence. Additionally, to effectively utilize region-text supervision, it employs simple CLS token-based region-category alignment rather than explicit patch-level alignment, further enhancing spatial awareness. FarSLIP features improved fine-grained vision-language alignment in RS domain and sets a new state of the art not only on RS open-vocabulary semantic segmentation, but also on image-level tasks such as zero-shot classification and image-text retrieval. Our dataset, code, and models are available at https://github.com/NJU-LHRS/FarSLIP.

[6] Physics-Based Benchmarking Metrics for Multimodal Synthetic Images

Kishor Datta Gupta, Marufa Kamal, Md. Mahfuzur Rahman, Fahad Rahman, Mohd Ariful Haque, Sunzida Siddique

🧩 TL;DR

本文提出了一种物理约束的多模态数据评估指标PCMDE，通过结合大型语言模型的推理能力、知识映射和视觉语言模型，克服现有评估指标在语义和结构准确性方面的局限性。

📘 Detailed Summary

Motivation: 当前最先进的评估指标如BLEU、CIDEr、VQA分数、SigLIP-2和CLIPScore在领域特定或上下文依赖场景中往往无法有效捕捉语义或结构准确性，存在显著的评估局限性。

Method: 该方法采用三阶段架构：首先通过目标检测和视觉语言模型提取空间和语义的多模态特征；其次进行置信度加权组件融合以实现自适应组件级验证；最后利用大型语言模型进行物理引导推理，强制执行结构和关系约束。

Result: 论文提出的PCMDE指标能够有效评估领域特定场景中的语义和结构准确性，克服了传统评估指标在复杂多模态数据评估中的局限性。

Conclusion: 该研究展示了结合大型语言模型推理能力与物理约束的多模态评估框架的潜力，为复杂场景下的语义和结构准确性评估提供了新的解决方案，推动了多模态评估指标的发展。

📄 Abstract

Current state of the art measures like BLEU, CIDEr, VQA score, SigLIP-2 and CLIPScore are often unable to capture semantic or structural accuracy, especially for domain-specific or context-dependent scenarios. For this, this paper proposes a Physics-Constrained Multimodal Data Evaluation (PCMDE) metric combining large language models with reasoning, knowledge based mapping and vision-language models to overcome these limitations. The architecture is comprised of three main stages: (1) feature extraction of spatial and semantic information with multimodal features through object detection and VLMs; (2) Confidence-Weighted Component Fusion for adaptive component-level validation; and (3) physics-guided reasoning using large language models for structural and relational constraints (e.g., alignment, position, consistency) enforcement.

[7] CPSL: Representing Volumetric Video via Content-Promoted Scene Layers

Kaiyuan Hu, Yili Jin, Junhua Liu, Xize Duan, Hong Kang, Xue Liu

🧩 TL;DR

本文提出了内容促进场景层（CPSL），一种紧凑的2.5D视频表示方法，通过基于深度和内容显著性的分层分解，实现了从传统2D内容到沉浸式媒体的高效转换，显著降低了存储和渲染成本。

📘 Detailed Summary

Motivation: 现有体积视频表示方法（从显式点云到隐式神经场）在采集、计算和渲染方面成本高昂，限制了按需视频的可扩展性和实时通信的可行性，需要一种更高效的解决方案来弥合传统2D内容与沉浸式体验之间的差距。

Method: CPSL基于每帧深度和内容显著性将视频帧分解为少量几何一致层，配备软alpha带和边缘深度缓存，通过深度加权变形和前向后alpha合成实现视差校正的新视角合成，并采用运动引导传播和逐层编码保持时间一致性。

Result: 在多个基准测试中，CPSL相比基于层和神经场的基线方法实现了更优的感知质量和边界保真度，同时将存储和渲染成本降低了数倍，支持使用标准视频编解码器进行实时播放。

Conclusion: CPSL为从2D视频到可扩展2.5D沉浸式媒体提供了一条实用路径，通过轻量级的2D可编码资产实现了体积视频的感知优势，同时避免了昂贵的3D重建过程，具有重要的实际应用价值。

📄 Abstract

Volumetric video enables immersive and interactive visual experiences by supporting free viewpoint exploration and realistic motion parallax. However, existing volumetric representations from explicit point clouds to implicit neural fields, remain costly in capture, computation, and rendering, which limits their scalability for on-demand video and reduces their feasibility for real-time communication. To bridge this gap, we propose Content-Promoted Scene Layers (CPSL), a compact 2.5D video representation that brings the perceptual benefits of volumetric video to conventional 2D content. Guided by per-frame depth and content saliency, CPSL decomposes each frame into a small set of geometry-consistent layers equipped with soft alpha bands and an edge-depth cache that jointly preserve occlusion ordering and boundary continuity. These lightweight, 2D-encodable assets enable parallax-corrected novel-view synthesis via depth-weighted warping and front-to-back alpha compositing, bypassing expensive 3D reconstruction. Temporally, CPSL maintains inter-frame coherence using motion-guided propagation and per-layer encoding, supporting real-time playback with standard video codecs. Across multiple benchmarks, CPSL achieves superior perceptual quality and boundary fidelity compared with layer-based and neural-field baselines while reducing storage and rendering cost by several folds. Our approach offer a practical path from 2D video to scalable 2.5D immersive media.

[8] Unsupervised Discovery of Long-Term Spatiotemporal Periodic Workflows in Human Activities

Fan Yang, Quanting Xie, Atsunori Moteki, Shoichi Masui, Shan Jiang, Yonatan Bisk, Graham Neubig

🧩 TL;DR

本文提出了首个针对长期周期性工作流程的多模态人类活动基准，包含580个序列，并开发了一种轻量级、无需训练的基线方法。该基准在三个现实应用任务上验证了方法的有效性，显著优于现有无监督方法和零样本LLM方法。

📘 Detailed Summary

Motivation: 当前研究主要关注具有简单结构和高对比度模式的短期周期性活动，而具有低对比度模式的长期周期性工作流程仍未被充分探索。本文旨在填补这一研究空白，为长期周期性工作流程的分析提供首个系统性基准。

Method: 提出了一个包含580个多模态人类活动序列的基准数据集，支持三个评估任务：无监督周期性工作流程检测、任务完成跟踪和过程异常检测。同时开发了一种轻量级、无需训练的基线方法，用于建模多样化的周期性工作流程模式。

Result: 实验表明：该基准对现有无监督周期性检测方法和基于大语言模型的零样本方法构成了显著挑战；所提出的基线方法在所有评估任务中均大幅优于竞争方法；在实际应用中，该方法展现出与传统监督工作流程检测方法相当的部署优势，且无需标注和重新训练。

Conclusion: 该研究为长期周期性工作流程分析建立了首个系统性基准，证明了轻量级无需训练方法的有效性，为制造业、体育等领域的周期性活动监测提供了实用解决方案，同时展示了在真实场景中替代传统监督方法的潜力。

📄 Abstract

Periodic human activities with implicit workflows are common in manufacturing, sports, and daily life. While short-term periodic activities -- characterized by simple structures and high-contrast patterns -- have been widely studied, long-term periodic workflows with low-contrast patterns remain largely underexplored. To bridge this gap, we introduce the first benchmark comprising 580 multimodal human activity sequences featuring long-term periodic workflows. The benchmark supports three evaluation tasks aligned with real-world applications: unsupervised periodic workflow detection, task completion tracking, and procedural anomaly detection. We also propose a lightweight, training-free baseline for modeling diverse periodic workflow patterns. Experiments show that: (i) our benchmark presents significant challenges to both unsupervised periodic detection methods and zero-shot approaches based on powerful large language models (LLMs); (ii) our baseline outperforms competing methods by a substantial margin in all evaluation tasks; and (iii) in real-world applications, our baseline demonstrates deployment advantages on par with traditional supervised workflow detection approaches, eliminating the need for annotation and retraining. Our project page is https://sites.google.com/view/periodicworkflow.

[9] CKDA: Cross-modality Knowledge Disentanglement and Alignment for Visible-Infrared Lifelong Person Re-identification

Zhenyu Cui, Jiahuan Zhou, Yuxin Peng

🧩 TL;DR

本文提出了一种跨模态知识解耦与对齐方法CKDA，通过显式分离和平衡保存模态特定知识与模态通用知识，解决了可见光-红外终身行人重识别中的协作遗忘问题。该方法在四个基准数据集上验证了其有效性和优越性。

📘 Detailed Summary

Motivation: 现有可见光-红外终身行人重识别方法通常利用跨模态知识蒸馏来缓解旧知识的灾难性遗忘，但这些方法忽视了模态特定知识获取与模态通用知识抗遗忘之间的相互干扰问题，导致冲突知识引发协作遗忘。

Method: 提出了跨模态知识解耦与对齐方法CKDA，包含模态通用提示模块和模态特定提示模块来显式解耦和纯化不同模态中共存且特定的判别信息，避免两种知识的相互干扰；同时设计了跨模态知识对齐模块，基于双模态原型在相互独立的模态间和模态内特征空间中平衡地对齐解耦的新旧知识。

Result: 在四个基准数据集上的大量实验验证了CKDA方法的有效性和优越性，相比最先进方法取得了更好的性能表现。

Conclusion: 该研究表明通过显式分离和平衡保存模态特定知识与模态通用知识，可以有效解决跨模态终身学习中的协作遗忘问题，为多模态终身学习提供了新的技术思路和解决方案。

📄 Abstract

Lifelong person Re-IDentification (LReID) aims to match the same person employing continuously collected individual data from different scenarios. To achieve continuous all-day person matching across day and night, Visible-Infrared Lifelong person Re-IDentification (VI-LReID) focuses on sequential training on data from visible and infrared modalities and pursues average performance over all data. To this end, existing methods typically exploit cross-modal knowledge distillation to alleviate the catastrophic forgetting of old knowledge. However, these methods ignore the mutual interference of modality-specific knowledge acquisition and modality-common knowledge anti-forgetting, where conflicting knowledge leads to collaborative forgetting. To address the above problems, this paper proposes a Cross-modality Knowledge Disentanglement and Alignment method, called CKDA, which explicitly separates and preserves modality-specific knowledge and modality-common knowledge in a balanced way. Specifically, a Modality-Common Prompting (MCP) module and a Modality-Specific Prompting (MSP) module are proposed to explicitly disentangle and purify discriminative information that coexists and is specific to different modalities, avoiding the mutual interference between both knowledge. In addition, a Cross-modal Knowledge Alignment (CKA) module is designed to further align the disentangled new knowledge with the old one in two mutually independent inter- and intra-modality feature spaces based on dual-modality prototypes in a balanced manner. Extensive experiments on four benchmark datasets verify the effectiveness and superiority of our CKDA against state-of-the-art methods. The source code of this paper is available at https://github.com/PKU-ICST-MIPL/CKDA-AAAI2026.

[10] HV-Attack: Hierarchical Visual Attack for Multimodal Retrieval Augmented Generation

Linyin Luo, Yujuan Ding, Yunshan Ma, Wenqi Fan, Hanjiang Lai

🧩 TL;DR

本文提出了一种针对多模态检索增强生成系统的新型层次化视觉攻击方法，仅通过在用户图像输入中添加不可察觉的扰动，即可破坏检索器与生成器的协同工作，显著降低系统性能。该攻击通过错位和干扰生成器的两个输入来实现对MRAG系统的有效攻击。

📘 Detailed Summary

Motivation: 现有研究主要关注MRAG系统的知识投毒攻击，但需要操纵检索器的知识库内容。本文探索了一种不同的攻击场景：仅通过在用户图像输入中添加视觉扰动来攻击MRAG系统，而不需要操控其他组件。这种攻击更具挑战性，因为微调后的检索器和大型生成器具有鲁棒性，且视觉扰动在RAG链中的传播效应可能被削弱。

Method: 提出了一种新颖的层次化视觉攻击方法，通过错位和干扰MRAG生成器的两个输入来混淆其生成过程。设计了两阶段策略来获取错位的增强知识：首先破坏检索器图像输入的跨模态对齐，然后破坏多模态语义对齐，使检索器从原始数据库中召回不相关的知识。

Result: 在OK-VQA和InfoSeek两个广泛使用的MRAG数据集上进行了大量实验，使用基于CLIP的检索器和BLIP-2、LLaVA两种大型多模态模型作为生成器。结果表明，该视觉攻击能显著降低MRAG系统的检索和生成性能，证明了攻击的有效性。

Conclusion: 该研究揭示了MRAG系统在仅受视觉扰动攻击时的脆弱性，表明即使不操纵知识库，仅通过图像输入端的微小扰动也能有效破坏整个系统的功能。这为MRAG系统的安全性研究提供了新的视角，强调了需要开发针对此类视觉攻击的防御机制。

📄 Abstract

Advanced multimodal Retrieval-Augmented Generation (MRAG) techniques have been widely applied to enhance the capabilities of Large Multimodal Models (LMMs), but they also bring along novel safety issues. Existing adversarial research has revealed the vulnerability of MRAG systems to knowledge poisoning attacks, which fool the retriever into recalling injected poisoned contents. However, our work considers a different setting: visual attack of MRAG by solely adding imperceptible perturbations at the image inputs of users, without manipulating any other components. This is challenging due to the robustness of fine-tuned retrievers and large-scale generators, and the effect of visual perturbation may be further weakened by propagation through the RAG chain. We propose a novel Hierarchical Visual Attack that misaligns and disrupts the two inputs (the multimodal query and the augmented knowledge) of MRAG's generator to confuse its generation. We further design a hierarchical two-stage strategy to obtain misaligned augmented knowledge. We disrupt the image input of the retriever to make it recall irrelevant knowledge from the original database, by optimizing the perturbation which first breaks the cross-modal alignment and then disrupts the multimodal semantic alignment. We conduct extensive experiments on two widely-used MRAG datasets: OK-VQA and InfoSeek. We use CLIP-based retrievers and two LMMs BLIP-2 and LLaVA as generators. Results demonstrate the effectiveness of our visual attack on MRAG through the significant decrease in both retrieval and generation performance.

[11] Evaluating Multimodal Large Language Models on Vertically Written Japanese Text

Keito Sasagawa, Shuhei Kurita, Daisuke Kawahara

🧩 TL;DR

本研究评估了多模态大语言模型在垂直书写日文文本上的阅读能力，发现现有模型在垂直文本上表现较差，并通过合成OCR数据集微调显著提升了模型处理垂直日文的能力。

📘 Detailed Summary

Motivation: 多模态大语言模型在视觉文档理解任务中应用日益广泛，需要处理包括日文在内的多种语言文档。由于部分日文文档采用垂直书写方式，对垂直文本的支持至关重要，但针对垂直书写日文文本的研究仍然有限，现有模型在此方面的能力尚未得到充分评估。

Method: 研究通过将日文文本渲染为图像生成合成日文OCR数据集，包含水平和垂直两种书写方向的文本，用于模型微调和评估。同时创建了来自真实世界文档图像的评估数据集，专门包含垂直书写日文文本，以全面评估模型性能。

Result: 实验结果表明，现有多模态大语言模型在垂直书写日文文本上的表现明显差于水平书写文本。通过使用合成的日文OCR数据集进行训练，原本无法处理垂直书写的模型性能得到显著提升，验证了专门训练对改善垂直文本理解的有效性。

Conclusion: 该研究揭示了当前多模态大语言模型在处理垂直书写日文文本方面的局限性，并证明了通过针对性训练可以有效克服这一挑战。研究提供的合成数据集和评估方法为未来改进多模态模型的多语言文档理解能力提供了重要基础，特别是在处理非标准书写方向的文本时。

📄 Abstract

Multimodal Large Language Models (MLLMs) have seen rapid advances in recent years and are now being applied to visual document understanding tasks. They are expected to process a wide range of document images across languages, including Japanese. Understanding documents from images requires models to read what are written in them. Since some Japanese documents are written vertically, support for vertical writing is essential. However, research specifically focused on vertically written Japanese text remains limited. In this study, we evaluate the reading capability of existing MLLMs on vertically written Japanese text. First, we generate a synthetic Japanese OCR dataset by rendering Japanese texts into images, and use it for both model fine-tuning and evaluation. This dataset includes Japanese text in both horizontal and vertical writing. We also create an evaluation dataset sourced from the real-world document images containing vertically written Japanese text. Using these datasets, we demonstrate that the existing MLLMs perform worse on vertically written Japanese text than on horizontally written Japanese text. Furthermore, we show that training MLLMs on our synthesized Japanese OCR dataset results in improving the performance of models that previously could not handle vertical writing. The datasets and code are publicly available https://github.com/llm-jp/eval_vertical_ja.

[12] TiCAL:Typicality-Based Consistency-Aware Learning for Multimodal Emotion Recognition

Wen Yin, Siyu Zhan, Cencen Liu, Xin Hu, Guiduo Duan, Xiurui Xie, Yuan-Fang Li, Tao He

🧩 TL;DR

本文提出TiCAL框架，通过典型性估计和一致性感知机制解决多模态情感识别中的模态间情感冲突问题，在双曲空间中学习细粒度情感表示，显著提升了不一致样本的识别性能。

📘 Detailed Summary

Motivation: 现有多模态情感识别方法主要依赖统一情感标签监督训练，忽视了同一样本中不同模态可能表达分歧情感倾向的关键挑战——模态间情感冲突问题，这限制了模型在实际复杂场景中的性能表现。

Method: 提出基于典型性的一致性感知多模态情感识别框架TiCAL，利用伪单模态情感标签和典型性估计动态评估训练样本的一致性，并在双曲空间中嵌入特征以捕捉情感类别间的细粒度差异，将一致性估计融入学习过程。

Result: 在CMU-MOSEI和MER2023等基准数据集上的广泛实验验证了TiCAL的有效性，特别是在模态不一致性高的样本上表现优异，相比最先进的DMD方法实现了约2.6%的性能提升。

Conclusion: 该研究揭示了处理模态间情感冲突对提升多模态情感识别性能的重要性，提出的典型性引导一致性感知框架为处理复杂多模态交互提供了新思路，双曲空间嵌入进一步增强了情感表示的判别能力。

📄 Abstract

Multimodal Emotion Recognition (MER) aims to accurately identify human emotional states by integrating heterogeneous modalities such as visual, auditory, and textual data. Existing approaches predominantly rely on unified emotion labels to supervise model training, often overlooking a critical challenge: inter-modal emotion conflicts, wherein different modalities within the same sample may express divergent emotional tendencies. In this work, we address this overlooked issue by proposing a novel framework, Typicality-based Consistent-aware Multimodal Emotion Recognition (TiCAL), inspired by the stage-wise nature of human emotion perception. TiCAL dynamically assesses the consistency of each training sample by leveraging pseudo unimodal emotion labels alongside a typicality estimation. To further enhance emotion representation, we embed features in a hyperbolic space, enabling the capture of fine-grained distinctions among emotional categories. By incorporating consistency estimates into the learning process, our method improves model performance, particularly on samples exhibiting high modality inconsistency. Extensive experiments on benchmark datasets, e.g, CMU-MOSEI and MER2023, validate the effectiveness of TiCAL in mitigating inter-modal emotional conflicts and enhancing overall recognition accuracy, e.g., with about 2.6% improvements over the state-of-the-art DMD.

[13] A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models

Duo Li, Zuhao Yang, Xiaoqin Zhang, Ling Shao, Shijian Lu

🧩 TL;DR

本研究系统分析了离散扩散多模态大语言模型中视觉令牌冗余的演化规律及其对模型效率的影响，揭示了不同架构和任务下视觉令牌剪枝的有效策略，为dMLLMs的效率优化提供了新视角。

📘 Detailed Summary

Motivation: 现有离散扩散多模态大语言模型在推理时因全序列注意力计算产生显著计算开销，先前研究从模态无关角度尝试解决此问题，但大多忽视了模态特定的视觉令牌冗余问题，本研究旨在填补这一研究空白。

Method: 通过综合研究不同dMLLM架构和任务下视觉令牌冗余的演化规律，分析视觉令牌剪枝对模型响应和效率的影响，特别关注从头训练dMLLM与AR转扩散dMLLM的差异行为。

Result: 研究发现视觉冗余仅出现在处理长答案任务的从头训练dMLLMs中，视觉令牌剪枝会引入不可忽视的信息损失，但只有从头训练dMLLMs能在后期去噪步骤中逐步恢复丢失信息，层跳过对AR转扩散dLLMs有效，而渐进或后期剪枝对从头训练dMLLMs更有效。

Conclusion: 本研究为dMLLMs效率优化提供了基于视觉令牌冗余分析的新视角，显著推进了其在各种多模态理解任务中的适用性，揭示了不同架构需要采用不同优化策略的重要发现。

📄 Abstract

Discrete diffusion-based multimodal large language models (dMLLMs) have emerged as a promising alternative to autoregressive MLLMs thanks to their advantages in parallel decoding and bidirectional context modeling, but most existing dMLLMs incur significant computational overhead during inference due to the full-sequence attention computation in each denoising step. Pioneer studies attempt to resolve this issue from a modality-agnostic perspective via key-value cache optimization or efficient sampling but most of them overlook modality-specific visual token redundancy. In this work, we conduct a comprehensive study on how visual token redundancy evolves with different dMLLM architectures and tasks and how visual token pruning affects dMLLM responses and efficiency. Specifically, our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks. In addition, we validate that visual token pruning introduces non-negligible information loss in dMLLMs and only from-scratch dMLLMs can recover the lost information progressively during late denoising steps. Furthermore, our study shows that layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs. Overall, this work offers a new perspective on efficiency optimization for dMLLMs, greatly advancing their applicability across various multimodal understanding tasks.

[14] Unbiased Semantic Decoding with Vision Foundation Models for Few-shot Segmentation

Jin Wang, Bingfeng Zhang, Jian Pang, Weifeng Liu, Baodi Liu, Honglong Chen

🧩 TL;DR

本文提出了一种与SAM集成的无偏语义解码策略，通过同时从支持和查询集中提取目标信息，利用CLIP模型的语义指导进行一致预测，解决了SAM在少样本分割中依赖精确提示导致的解码偏差问题。

📘 Detailed Summary

Motivation: 当前基于SAM的少样本分割方法主要依赖从支持集中提取提示，这不足以激活SAM的泛化能力，且在适应未知类别时容易导致解码过程产生偏差。现有方法未能充分利用查询集信息来指导分割过程。

Method: 提出了无偏语义解码策略，设计两种特征增强策略：图像级的全局补充提供泛化类别指示，像素级的局部指导提供目标位置信息。同时提出可学习的视觉-文本目标提示生成器，通过交互目标文本嵌入和CLIP视觉特征生成目标聚焦的提示嵌入。

Result: 该方法在不重新训练视觉基础模型的情况下，通过语义区分特征引导注意力到目标区域，实现了更准确和一致的少样本分割性能。利用CLIP的语义对齐能力丰富了原始SAM特征。

Conclusion: 该研究展示了结合SAM和CLIP的潜力，通过语义指导的无偏解码策略有效解决了少样本分割中的偏差问题，为视觉基础模型在少样本学习中的应用提供了新思路，强调了同时利用支持和查询集信息的重要性。

📄 Abstract

Few-shot segmentation has garnered significant attention. Many recent approaches attempt to introduce the Segment Anything Model (SAM) to handle this task. With the strong generalization ability and rich object-specific extraction ability of the SAM model, such a solution shows great potential in few-shot segmentation. However, the decoding process of SAM highly relies on accurate and explicit prompts, making previous approaches mainly focus on extracting prompts from the support set, which is insufficient to activate the generalization ability of SAM, and this design is easy to result in a biased decoding process when adapting to the unknown classes. In this work, we propose an Unbiased Semantic Decoding (USD) strategy integrated with SAM, which extracts target information from both the support and query set simultaneously to perform consistent predictions guided by the semantics of the Contrastive Language-Image Pre-training (CLIP) model. Specifically, to enhance the unbiased semantic discrimination of SAM, we design two feature enhancement strategies that leverage the semantic alignment capability of CLIP to enrich the original SAM features, mainly including a global supplement at the image level to provide a generalize category indicate with support image and a local guidance at the pixel level to provide a useful target location with query image. Besides, to generate target-focused prompt embeddings, a learnable visual-text target prompt generator is proposed by interacting target text embeddings and clip visual features. Without requiring re-training of the vision foundation models, the features with semantic discrimination draw attention to the target region through the guidance of prompt with rich target information.

[15] Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance

Songze Li, Mingyu Gao, Tonghua Su, Xu-Yao Zhang, Zhongjie Wang

🧩 TL;DR

本文提出了一种新颖的多模态持续指令调优方法，通过将灾难性遗忘概念化为旧任务梯度缺失问题，利用参数空间的几何特性近似缺失梯度，有效缓解了灾难性遗忘问题，同时保持了模型的紧凑架构。

📘 Detailed Summary

Motivation: 多模态持续指令调优面临灾难性遗忘的严重挑战，即学习新任务会导致先前任务性能下降。现有方法通常需要模型扩展或大量重放缓冲区，限制了实际应用。本文旨在解决这一核心问题，通过新的视角理解灾难性遗忘的本质。

Method: 该方法将灾难性遗忘概念化为旧任务梯度缺失问题，利用参数空间的几何特性，通过当前参数与先前最优参数之间的方向向量作为梯度指导来近似缺失梯度。该方法还结合了有限重放缓冲区的真实梯度，并通过Bernoulli采样策略动态平衡模型稳定性和可塑性。

Result: 在多模态持续指令调优数据集上的广泛实验表明，该方法在不扩展模型的情况下实现了最先进的性能，有效缓解了灾难性遗忘，同时保持了紧凑的架构。实验验证了该方法在平衡稳定性和可塑性方面的有效性。

Conclusion: 该研究提供了对灾难性遗忘的新视角理解，将其视为梯度缺失问题而非简单的参数覆盖。提出的几何梯度近似方法为持续学习领域提供了新的技术路径，展示了在不增加模型复杂度的情况下实现有效持续学习的可能性，为实际应用中的紧凑模型部署提供了解决方案。

📄 Abstract

Multimodal continual instruction tuning enables multimodal large language models to sequentially adapt to new tasks while building upon previously acquired knowledge. However, this continual learning paradigm faces the significant challenge of catastrophic forgetting, where learning new tasks leads to performance degradation on previous ones. In this paper, we introduce a novel insight into catastrophic forgetting by conceptualizing it as a problem of missing gradients from old tasks during new task learning. Our approach approximates these missing gradients by leveraging the geometric properties of the parameter space, specifically using the directional vector between current parameters and previously optimal parameters as gradient guidance. This approximated gradient can be further integrated with real gradients from a limited replay buffer and regulated by a Bernoulli sampling strategy that dynamically balances model stability and plasticity. Extensive experiments on multimodal continual instruction tuning datasets demonstrate that our method achieves state-of-the-art performance without model expansion, effectively mitigating catastrophic forgetting while maintaining a compact architecture.

[16] MMCM: Multimodality-aware Metric using Clustering-based Modes for Probabilistic Human Motion Prediction

Kyotaro Tokoro, Hiromu Taketsugu, Norimichi Ukita

🧩 TL;DR

本文提出了一种用于人体运动预测的新型多模态评估指标MMCM，通过基于聚类的模式划分来同时评估预测运动的覆盖率和有效性，解决了现有指标无法区分运动模式且忽略运动有效性的问题。

📘 Detailed Summary

Motivation: 现有的人体运动预测评估指标存在两个主要问题：它们简单地赞赏广泛分布的运动，即使这些运动属于单一模式且在运动学上无效；无法有效区分多模态预测的覆盖范围和质量。本文旨在解决这些缺陷，提出能够同时评估预测运动覆盖率和运动有效性的新指标。

Method: 本文提出了基于聚类的多模态感知指标MMCM，该方法首先通过聚类将运动空间划分为多个模式，每个聚类被视为一个运动模式；然后利用运动数据集收集可能的未来运动来识别有效模式；最后通过模式分布来显式评估预测运动是否分布在多个模式中，同时确保运动的运动学有效性。

Result: 实验验证表明，本文提出的聚类方法能够产生合理的模式定义，MMCM指标能够准确地对多模态预测进行评分。该指标在评估多模态人体运动预测时表现出色，能够有效区分不同预测方法的覆盖能力和运动质量。

Conclusion: MMCM指标为多模态人体运动预测提供了更全面的评估框架，强调了同时考虑运动覆盖率和运动有效性的重要性。该研究为未来运动预测方法的开发和比较建立了更可靠的评估标准，推动了该领域向更实用和准确的方向发展。

📄 Abstract

This paper proposes a novel metric for Human Motion Prediction (HMP). Since a single past sequence can lead to multiple possible futures, a probabilistic HMP method predicts such multiple motions. While a single motion predicted by a deterministic method is evaluated only with the difference from its ground truth motion, multiple predicted motions should also be evaluated based on their distribution. For this evaluation, this paper focuses on the following two criteria. \textbf{(a) Coverage}: motions should be distributed among multiple motion modes to cover diverse possibilities. \textbf{(b) Validity}: motions should be kinematically valid as future motions observable from a given past motion. However, existing metrics simply appreciate widely distributed motions even if these motions are observed in a single mode and kinematically invalid. To resolve these disadvantages, this paper proposes a Multimodality-aware Metric using Clustering-based Modes (MMCM). For (a) coverage, MMCM divides a motion space into several clusters, each of which is regarded as a mode. These modes are used to explicitly evaluate whether predicted motions are distributed among multiple modes. For (b) validity, MMCM identifies valid modes by collecting possible future motions from a motion dataset. Our experiments validate that our clustering yields sensible mode definitions and that MMCM accurately scores multimodal predictions. Code: https://github.com/placerkyo/MMCM

[17] Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset

Geon Choi, Hangyul Yoon, Hyunju Shin, Hyunki Park, Sang Hoon Seo, Eunho Yang, Edward Choi

🧩 TL;DR

本文提出了指令引导的病灶分割新范式，通过构建首个大规模胸部X光指令-答案数据集MIMIC-ILS，并开发了能够根据简单用户指令分割多种病灶类型的视觉语言模型ROSALIA，显著提升了胸部X光病灶分割的实用性和可访问性。

📘 Detailed Summary

Motivation: 当前胸部X光病灶分割模型存在目标标签数量有限和依赖专家级详细文本输入的局限性，这严重阻碍了其在实际临床应用中的推广和使用。这些限制使得模型难以适应多样化的病灶类型分割需求，且对非专业用户不够友好。

Method: 研究引入了指令引导病灶分割新范式，构建了首个大规模胸部X光指令-答案数据集MIMIC-ILS，该数据集包含110万条指令-答案对，源自19.2万张图像和9.1万个独特分割掩码，覆盖七种主要病灶类型。基于此数据集，开发了视觉语言模型ROSALIA，该模型能够根据用户指令分割多种病灶并提供文本解释。

Result: ROSALIA模型在新提出的任务中实现了高精度的分割和文本生成性能，验证了所提出管道的有效性。MIMIC-ILS数据集被证明是像素级胸部X光病灶定位的基础资源，为后续研究提供了重要支撑。

Conclusion: 该研究通过指令引导分割范式和大规模数据集的构建，显著降低了胸部X光病灶分割的使用门槛，为医学影像分析提供了新的研究方向。MIMIC-ILS作为基础资源将推动该领域的进一步发展，而ROSALIA模型展示了视觉语言模型在医学图像理解中的巨大潜力。

📄 Abstract

The applicability of current lesion segmentation models for chest X-rays (CXRs) has been limited both by a small number of target labels and the reliance on long, detailed expert-level text inputs, creating a barrier to practical use. To address these limitations, we introduce a new paradigm: instruction-guided lesion segmentation (ILS), which is designed to segment diverse lesion types based on simple, user-friendly instructions. Under this paradigm, we construct MIMIC-ILS, the first large-scale instruction-answer dataset for CXR lesion segmentation, using our fully automated multimodal pipeline that generates annotations from chest X-ray images and their corresponding reports. MIMIC-ILS contains 1.1M instruction-answer pairs derived from 192K images and 91K unique segmentation masks, covering seven major lesion types. To empirically demonstrate its utility, we introduce ROSALIA, a vision-language model fine-tuned on MIMIC-ILS. ROSALIA can segment diverse lesions and provide textual explanations in response to user instructions. The model achieves high segmentation and textual accuracy in our newly proposed task, highlighting the effectiveness of our pipeline and the value of MIMIC-ILS as a foundational resource for pixel-level CXR lesion grounding.

[18] Insert In Style: A Zero-Shot Generative Framework for Harmonious Cross-Domain Object Composition

Raghu Vamsi Chittersu, Yuvraj Singh Rathore, Pranav Adlinge, Kunal Swami

🧩 TL;DR

本文提出了Insert In Style，这是首个零样本生成框架，能够将真实世界物体高质量地插入到风格化场景中。该框架通过多阶段训练协议和专门的掩码注意力架构实现身份、风格和组合的表示解耦。

📘 Detailed Summary

Motivation: 基于参考的对象组合方法在将真实世界物体插入风格化领域时效果不佳，现有方法分为缺乏生成保真度的实用"混合器"和需要不切实际的逐主题在线微调的"生成器"，这构成了一个尚未充分探索的问题。

Method: 核心贡献是一个统一框架，包含两个关键创新：新颖的多阶段训练协议用于解耦身份、风格和组合的表示，以及专门的掩码注意力架构在生成过程中精确执行这种解耦。该方法在从新数据管道策划的10万样本数据集上进行训练。

Result: 该框架在风格化组合的新公共基准上展示了最先进的性能，在身份和风格指标上显著优于现有方法，这一结果得到了用户研究的强有力证实。模型是真正的零样本，不需要文本提示。

Conclusion: 该方法防止了通用统一注意力模型中常见的概念干扰，为风格化对象组合提供了实用且高保真的解决方案，为生成式AI在跨域对象插入任务中开辟了新方向。

📄 Abstract

Reference-based object composition methods fail when inserting real-world objects into stylized domains. This under-explored problem is currently split between practical "blenders" that lack generative fidelity and "generators" that require impractical, per-subject online finetuning. In this work, we introduce Insert In Style, the first zero-shot generative framework that is both practical and high-fidelity. Our core contribution is a unified framework with two key innovations: (i) a novel multi-stage training protocol that disentangles representations for identity, style, and composition, and (ii) a specialized masked-attention architecture that surgically enforces this disentanglement during generation. This approach prevents the concept interference common in general-purpose, unified-attention models. Our framework is trained on a new 100k sample dataset, curated from a novel data pipeline. This pipeline couples large-scale generation with a rigorous, two-stage filtering process to ensure both high-fidelity semantic identity and style coherence. Unlike prior work, our model is truly zero-shot and requires no text prompts. We also introduce a new public benchmark for stylized composition. We demonstrate state-of-the-art performance, significantly outperforming existing methods on both identity and style metrics, a result strongly corroborated by user studies.

Qing Wang, Chong-Wah Ngo, Ee-Peng Lim

🧩 TL;DR

本文通过因果理论建模食谱-食物图像跨模态检索中的偏差问题，提出基于后门调整的因果干预方法，在Recipe1M数据集上实现了MedR=1的最优检索性能。

📘 Detailed Summary

Motivation: 现有方法将食谱视为描述菜品视觉外观的文本源进行表示学习，但由于烹饪过程、菜品呈现和图像采集条件等因素，食物图像可能无法平等捕捉食谱中的每个细节，导致表示学习倾向于捕捉主导的视觉-文本对齐而忽略决定检索相关性的细微变化，从而产生误导图像-食谱相似性判断的偏差。

Method: 基于因果理论将食材建模为混杂源，通过简单的后门调整来缓解偏差，通过因果干预重新制定传统的食物-食谱检索模型，增加额外项以消除相似性判断中的潜在偏差，并提出即插即用的神经网络模块——本质上是一个用于去偏的多标签食材分类器。

Result: 在Recipe1M数据集上经验证明检索的oracle性能在1K、10K甚至50K测试数据规模下均达到MedR=1，并在Recipe1M数据集上报告了新的最先进搜索性能。

Conclusion: 因果视角揭示了食材作为混杂源在跨模态表示学习中的作用，因果干预方法能够有效消除食谱-食物图像检索中的偏差，为跨模态检索问题提供了理论指导的解决方案，并展示了因果理论在计算机视觉和自然语言处理交叉领域应用的潜力。

📄 Abstract

This paper addresses the challenges of learning representations for recipes and food images in the cross-modal retrieval problem. As the relationship between a recipe and its cooked dish is cause-and-effect, treating a recipe as a text source describing the visual appearance of a dish for learning representation, as the existing approaches, will create bias misleading image-and-recipe similarity judgment. Specifically, a food image may not equally capture every detail in a recipe, due to factors such as the cooking process, dish presentation, and image-capturing conditions. The current representation learning tends to capture dominant visual-text alignment while overlooking subtle variations that determine retrieval relevance. In this paper, we model such bias in cross-modal representation learning using causal theory. The causal view of this problem suggests ingredients as one of the confounder sources and a simple backdoor adjustment can alleviate the bias. By causal intervention, we reformulate the conventional model for food-to-recipe retrieval with an additional term to remove the potential bias in similarity judgment. Based on this theory-informed formulation, we empirically prove the oracle performance of retrieval on the Recipe1M dataset to be MedR=1 across the testing data sizes of 1K, 10K, and even 50K. We also propose a plug-and-play neural module, which is essentially a multi-label ingredient classifier for debiasing. New state-of-the-art search performances are reported on the Recipe1M dataset.

[20] Taming Generative Synthetic Data for X-ray Prohibited Item Detection

Jialong Sun, Hongguang Zhu, Weizhe Liu, Yunda Sun, Renshuai Tao, Yunchao Wei

🧩 TL;DR

本文提出了一种基于文本到图像生成的一阶段X射线安检图像合成方法Xsyn，通过交叉注意力细化和背景遮挡建模策略，首次实现了无需额外人工成本的高质量X射线安检图像合成，在多个数据集和检测器上显著提升了违禁品检测性能。

📘 Detailed Summary

Motivation: 当前X射线安检图像合成方法主要采用两阶段流程，需要人工进行前景提取，导致额外的人工成本且效率低下。为解决数据不足问题并提升合成效率，需要开发无需人工干预的一阶段合成方法。

Method: 提出基于文本到图像扩散模型的一阶段合成框架Xsyn，采用交叉注意力细化策略利用扩散模型的交叉注意力图优化边界框标注，并通过背景遮挡建模策略在潜在空间中显式建模背景遮挡以增强图像复杂性。

Result: 实验表明Xsyn方法在mAP指标上比先前方法提升1.2%，生成的合成图像能够有效提升多种X射线安检数据集和检测器上的违禁品检测性能，证明了方法的有效性和泛化能力。

Conclusion: 该研究首次实现了无需额外人工成本的高质量X射线安检图像合成，为安检领域的数据增强提供了高效解决方案，所提出的交叉注意力细化和背景遮挡建模策略对提升合成图像质量具有重要价值。

📄 Abstract

Training prohibited item detection models requires a large amount of X-ray security images, but collecting and annotating these images is time-consuming and laborious. To address data insufficiency, X-ray security image synthesis methods composite images to scale up datasets. However, previous methods primarily follow a two-stage pipeline, where they implement labor-intensive foreground extraction in the first stage and then composite images in the second stage. Such a pipeline introduces inevitable extra labor cost and is not efficient. In this paper, we propose a one-stage X-ray security image synthesis pipeline (Xsyn) based on text-to-image generation, which incorporates two effective strategies to improve the usability of synthetic images. The Cross-Attention Refinement (CAR) strategy leverages the cross-attention map from the diffusion model to refine the bounding box annotation. The Background Occlusion Modeling (BOM) strategy explicitly models background occlusion in the latent space to enhance imaging complexity. To the best of our knowledge, compared with previous methods, Xsyn is the first to achieve high-quality X-ray security image synthesis without extra labor cost. Experiments demonstrate that our method outperforms all previous methods with 1.2% mAP improvement, and the synthetic images generated by our method are beneficial to improve prohibited item detection performance across various X-ray security datasets and detectors. Code is available at https://github.com/pILLOW-1/Xsyn/.

[21] Text2Loc++: Generalizing 3D Point Cloud Localization from Natural Language

Yan Xia, Letian Shi, Yilin Di, Joao F. Henriques, Daniel Cremers

🧩 TL;DR

本文提出了Text2Loc++，一种用于3D点云与复杂自然语言描述之间跨模态对齐的神经网络，通过粗到精的定位流程实现了城市环境中基于文本的精确位置识别。该方法在KITTI360Pose数据集上比现有方法性能提升达15%，并展现出对复杂语言表达和多样化城市场景的强大泛化能力。

📘 Detailed Summary

Motivation: 该研究旨在解决使用复杂多样的自然语言描述对3D点云子图进行定位的问题，现有方法在处理复杂语言表达和多样化城市场景时存在局限性，需要开发更有效的跨模态对齐技术来弥合语言与点云之间的语义鸿沟。

Method: Text2Loc++采用粗到精的定位流程，在全局位置识别阶段结合预训练语言模型和分层Transformer进行句子级语义理解，使用基于注意力的点云编码器进行空间理解，并提出掩码实例训练来过滤非对齐对象。该方法还引入了模态感知分层对比学习，包含跨模态、子图、文本和实例级损失，在精细定位阶段完全移除显式文本-实例匹配，设计了基于原型地图克隆和级联交叉注意力Transformer的轻量级框架。

Result: 在KITTI360Pose数据集上的大量实验表明，Text2Loc++比现有方法性能提升高达15%，同时在新数据集上展现出强大的泛化能力，能够有效处理复杂语言表达和多样化城市场景。该方法还引入了覆盖彩色和非彩色点云的城市规模数据集，并将位置描述组织为三个语言复杂度级别以支持基准测试。

Conclusion: 该研究表明通过粗到精的定位流程和创新的跨模态对齐技术，可以实现语言与点云之间的有效语义匹配，为基于自然语言的3D环境定位提供了新的解决方案。所提出的方法在处理复杂语言表达和多样化场景方面展现出显著优势，为未来跨模态定位研究奠定了重要基础。

📄 Abstract

We tackle the problem of localizing 3D point cloud submaps using complex and diverse natural language descriptions, and present Text2Loc++, a novel neural network designed for effective cross-modal alignment between language and point clouds in a coarse-to-fine localization pipeline. To support benchmarking, we introduce a new city-scale dataset covering both color and non-color point clouds from diverse urban scenes, and organize location descriptions into three levels of linguistic complexity. In the global place recognition stage, Text2Loc++ combines a pretrained language model with a Hierarchical Transformer with Max pooling (HTM) for sentence-level semantics, and employs an attention-based point cloud encoder for spatial understanding. We further propose Masked Instance Training (MIT) to filter out non-aligned objects and improve multimodal robustness. To enhance the embedding space, we introduce Modality-aware Hierarchical Contrastive Learning (MHCL), incorporating cross-modal, submap-, text-, and instance-level losses. In the fine localization stage, we completely remove explicit text-instance matching and design a lightweight yet powerful framework based on Prototype-based Map Cloning (PMC) and a Cascaded Cross-Attention Transformer (CCAT). Extensive experiments on the KITTI360Pose dataset show that Text2Loc++ outperforms existing methods by up to 15%. In addition, the proposed model exhibits robust generalization when evaluated on the new dataset, effectively handling complex linguistic expressions and a wide variety of urban environments. The code and dataset will be made publicly available.

[22] Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models

Mehran Tamjidi, Hamidreza Dastmalchi, Mohammadreza Alimoradijazi, Ali Cheraghian, Aijun An, Morteza Saberi

🧩 TL;DR

本文提出Uni-Adapter，一种基于动态原型学习的免训练在线测试时适应策略，用于解决3D视觉语言基础模型在噪声、不完整或分布偏移数据上的性能下降问题。

📘 Detailed Summary

Motivation: 3D视觉语言基础模型在开放世界点云处理任务中展现出强大的泛化和零样本识别能力，但在实际场景中，当数据存在噪声、不完整或与训练数据分布不同时，这些模型往往表现不佳，需要解决分布偏移带来的性能下降问题。

Method: 提出基于动态原型学习的训练免在线测试时适应策略，定义3D缓存存储类别特定的聚类中心作为原型，通过持续更新捕获异构数据分布中的类内变异性。同时采用基于图的标签平滑模块捕捉原型间相似性以增强相似原型间的标签一致性，并通过熵加权聚合统一原始3D视觉语言基础模型和精炼3D缓存的预测结果。

Result: 在不重新训练的情况下，Uni-Adapter有效缓解了分布偏移，在多个3D基准测试中实现了最先进的性能，相比源3D视觉语言基础模型，在ModelNet-40C上提升了10.55%，在ScanObjectNN-C上提升了8.26%，在ShapeNet-C上提升了4.49%。

Conclusion: 该研究表明动态原型学习和缓存机制能够有效适应3D视觉语言基础模型在测试时的分布变化，为实际部署中的模型鲁棒性提供了可行的免训练适应方案，展示了在异构数据环境下保持模型性能的潜力。

📄 Abstract

3D Vision-Language Foundation Models (VLFMs) have shown strong generalization and zero-shot recognition capabilities in open-world point cloud processing tasks. However, these models often underperform in practical scenarios where data are noisy, incomplete, or drawn from a different distribution than the training data. To address this, we propose Uni-Adapter, a novel training-free online test-time adaptation (TTA) strategy for 3D VLFMs based on dynamic prototype learning. We define a 3D cache to store class-specific cluster centers as prototypes, which are continuously updated to capture intra-class variability in heterogeneous data distributions. These dynamic prototypes serve as anchors for cache-based logit computation via similarity scoring. Simultaneously, a graph-based label smoothing module captures inter-prototype similarities to enforce label consistency among similar prototypes. Finally, we unify predictions from the original 3D VLFM and the refined 3D cache using entropy-weighted aggregation for reliable adaptation. Without retraining, Uni-Adapter effectively mitigates distribution shifts, achieving state-of-the-art performance on diverse 3D benchmarks over different 3D VLFMs, improving ModelNet-40C by 10.55%, ScanObjectNN-C by 8.26%, and ShapeNet-C by 4.49% over the source 3D VLFMs.

[23] A Multimodal Transformer Approach for UAV Detection and Aerial Object Recognition Using Radar, Audio, and Video Data

Mauro Larrat, Claudomiro Sales

🧩 TL;DR

本研究提出了一种新颖的多模态Transformer模型，通过融合雷达、可见光视频、红外视频和音频数据，实现了无人机检测和空中物体识别的最先进性能，为复杂空域监控提供了高精度实时解决方案。

📘 Detailed Summary

Motivation: 当前无人机检测和空中物体识别系统存在单模态方法的局限性，无法充分利用多种数据源的互补信息，需要开发能够融合多模态数据的鲁棒系统来提升复杂空域环境下的检测精度和可靠性。

Method: 该研究设计了一种多模态Transformer架构，利用自注意力机制有效融合雷达、RGB视频、红外视频和音频四种不同模态的特征，学习全面、互补且高度区分的表示用于分类任务。

Result: 模型在独立测试集上表现出色，实现了0.9812准确率、0.9873召回率、0.9787精确率、0.9826 F1分数和0.9954特异性，计算效率分析显示其具有1.09 GFLOPs、122万参数和41.11 FPS推理速度，特别在区分无人机与其他空中物体方面表现出高精度和高召回率。

Conclusion: 该研究验证了通过Transformer架构进行多模态数据融合在实现最先进性能方面的有效性，为无人机检测和监控提供了高精度且具有弹性的解决方案，显著推进了空中物体分类技术的发展，并展示了在实时应用中的适用性。

📄 Abstract

Unmanned aerial vehicle (UAV) detection and aerial object recognition are critical for modern surveillance and security, prompting a need for robust systems that overcome limitations of single-modality approaches. This research addresses these challenges by designing and rigorously evaluating a novel multimodal Transformer model that integrates diverse data streams: radar, visual band video (RGB), infrared (IR) video, and audio. The architecture effectively fuses distinct features from each modality, leveraging the Transformer's self-attention mechanisms to learn comprehensive, complementary, and highly discriminative representations for classification. The model demonstrated exceptional performance on an independent test set, achieving macro-averaged metrics of 0.9812 accuracy, 0.9873 recall, 0.9787 precision, 0.9826 F1-score, and 0.9954 specificity. Notably, it exhibited particularly high precision and recall in distinguishing drones from other aerial objects. Furthermore, computational analysis confirmed its efficiency, with 1.09 GFLOPs, 1.22 million parameters, and an inference speed of 41.11 FPS, highlighting its suitability for real-time applications. This study presents a significant advancement in aerial object classification, validating the efficacy of multimodal data fusion via a Transformer architecture for achieving state-of-the-art performance, thereby offering a highly accurate and resilient solution for UAV detection and monitoring in complex airspace.

[24] What Your Features Reveal: Data-Efficient Black-Box Feature Inversion Attack for Split DNNs

Zhihan Ren, Lijun He, Jiaxi Liang, Xinzhu Fu, Haixia Bi, Fan Li

🧩 TL;DR

本文提出了FIA-Flow框架，通过潜在特征空间对齐模块和确定性反转流匹配技术，实现了从中间特征的高保真图像重建，揭示了分割DNN中比先前认知更严重的隐私泄露风险。

📘 Detailed Summary

Motivation: 分割DNN通过将密集计算卸载到云服务器来赋能边缘设备，但这种范式暴露了隐私漏洞，因为中间特征可能被利用通过特征反转攻击重建私有输入。现有的FIA方法通常产生有限的重建质量，难以评估隐私泄露的真实程度。

Method: 设计了潜在特征空间对齐模块来桥接中间特征空间与潜在空间之间的语义鸿沟，开发了确定性反转流匹配技术，通过一步推理将离群特征投影到目标流形上。这种解耦设计简化了学习过程，并支持使用少量图像-特征对进行有效训练。

Result: 实验表明FIA-Flow在多种模型（AlexNet、ResNet、Swin Transformer、DINO和YOLO11）和不同层上实现了更忠实和语义对齐的特征反转，并基于大型视觉语言模型提出了两个量化隐私泄露的指标。

Conclusion: 研究揭示了分割DNN中比先前认知更严重的隐私威胁，提出的框架能够更准确地评估隐私泄露风险，为分割计算系统的安全设计提供了重要启示。

📄 Abstract

Split DNNs enable edge devices by offloading intensive computation to a cloud server, but this paradigm exposes privacy vulnerabilities, as the intermediate features can be exploited to reconstruct the private inputs via Feature Inversion Attack (FIA). Existing FIA methods often produce limited reconstruction quality, making it difficult to assess the true extent of privacy leakage. To reveal the privacy risk of the leaked features, we introduce FIA-Flow, a black-box FIA framework that achieves high-fidelity image reconstruction from intermediate features. To exploit the semantic information within intermediate features, we design a Latent Feature Space Alignment Module (LFSAM) to bridge the semantic gap between the intermediate feature space and the latent space. Furthermore, to rectify distributional mismatch, we develop Deterministic Inversion Flow Matching (DIFM), which projects off-manifold features onto the target manifold with one-step inference. This decoupled design simplifies learning and enables effective training with few image-feature pairs. To quantify privacy leakage from a human perspective, we also propose two metrics based on a large vision-language model. Experiments show that FIA-Flow achieves more faithful and semantically aligned feature inversion across various models (AlexNet, ResNet, Swin Transformer, DINO, and YOLO11) and layers, revealing a more severe privacy threat in Split DNNs than previously recognized.

[25] Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training

Yunjiao Zhou, Xinyan Chen, Junlang Qian, Lihua Xie, Jianfei Yang

🧩 TL;DR

本文提出了ZOMG，一种零样本开放词汇的运动理解框架，无需任何标注或微调即可将运动序列分割成语义对齐的子动作，在HumanML3D基准上实现了+8.7% mAP的性能提升。

📘 Detailed Summary

Motivation: 现有方法依赖预定义动作类别的密集监督，在开放词汇的真实世界场景中不可行，需要开发能够分解复杂人类活动为细粒度语义对齐子动作的零样本方法。

Method: ZOMG整合了语言语义分割和软掩码优化，利用大语言模型将指令分解为有序子动作单元，同时学习实例特定的时间掩码来聚焦关键帧，保持段内连续性和段间分离性，且不改变预训练编码器。

Result: 在三个运动-语言数据集上的实验表明，ZOMG在运动定位性能上达到了最先进的有效性和效率，在HumanML3D基准上比先前方法提升了+8.7% mAP，在下游检索任务中也存在显著改进。

Conclusion: 该研究为无标注运动理解建立了新范式，证明了零样本开放词汇方法在复杂人类活动分解中的有效性，为行为分析、具身AI和虚拟现实应用提供了实用解决方案。

📄 Abstract

Understanding complex human activities demands the ability to decompose motion into fine-grained, semantic-aligned sub-actions. This motion grounding process is crucial for behavior analysis, embodied AI and virtual reality. Yet, most existing methods rely on dense supervision with predefined action classes, which are infeasible in open-vocabulary, real-world settings. In this paper, we propose ZOMG, a zero-shot, open-vocabulary framework that segments motion sequences into semantically meaningful sub-actions without requiring any annotations or fine-tuning. Technically, ZOMG integrates (1) language semantic partition, which leverages large language models to decompose instructions into ordered sub-action units, and (2) soft masking optimization, which learns instance-specific temporal masks to focus on frames critical to sub-actions, while maintaining intra-segment continuity and enforcing inter-segment separation, all without altering the pretrained encoder. Experiments on three motion-language datasets demonstrate state-of-the-art effectiveness and efficiency of motion grounding performance, outperforming prior methods by +8.7\% mAP on HumanML3D benchmark. Meanwhile, significant improvements also exist in downstream retrieval, establishing a new paradigm for annotation-free motion understanding.

[26] D4C: Data-free Quantization for Contrastive Language-Image Pre-training Models

Wenlun Zhang, Yunshan Zhong, Zihao Ding, Xinyu Li, Kentaro Yoshioka

🧩 TL;DR

本文提出了D4C，这是首个专门为CLIP模型设计的数据无关量化框架，通过生成语义丰富且结构多样的伪图像，解决了现有DFQ方法在CLIP上直接应用时性能严重下降的问题。

📘 Detailed Summary

Motivation: 数据无关量化在隐私敏感场景中具有重要价值，但现有方法主要针对单模态模型，在视觉语言模型如CLIP上的应用仍未被充分探索。研究发现直接将现有DFQ技术应用于CLIP会导致显著性能下降，主要原因是合成样本语义内容不足和图像内部多样性低。

Method: D4C框架包含三个核心组件：提示引导语义注入通过文本提示使生成图像与真实世界语义对齐；结构对比生成利用前景-背景对比合成重现自然图像的组合结构；扰动感知增强应用受控扰动以提高样本多样性和鲁棒性。这些组件共同确保合成图像既语义丰富又结构多样。

Result: 实验验证了D4C的有效性，在不同比特宽度和模型上均显示出显著性能提升。在W4A8设置下，CLIP ResNet-50和ViT-B/32在CIFAR-10上的Top-1准确率分别提升12.4%和18.9%，在CIFAR-100上提升6.8%和19.7%，在ImageNet-1K零样本分类上提升1.4%和5.7%。

Conclusion: D4C成功解决了CLIP模型数据无关量化的关键挑战，通过语义丰富和结构多样的伪图像合成有效弥补了性能差距。该框架为视觉语言模型的隐私保护压缩提供了实用解决方案，并展示了在多模态场景下数据无关量化的可行性。

📄 Abstract

Data-Free Quantization (DFQ) offers a practical solution for model compression without requiring access to real data, making it particularly attractive in privacy-sensitive scenarios. While DFQ has shown promise for unimodal models, its extension to Vision-Language Models such as Contrastive Language-Image Pre-training (CLIP) models remains underexplored. In this work, we reveal that directly applying existing DFQ techniques to CLIP results in substantial performance degradation due to two key limitations: insufficient semantic content and low intra-image diversity in synthesized samples. To tackle these challenges, we propose D4C, the first DFQ framework tailored for CLIP. D4C synthesizes semantically rich and structurally diverse pseudo images through three key components: (1) Prompt-Guided Semantic Injection aligns generated images with real-world semantics using text prompts; (2) Structural Contrastive Generation reproduces compositional structures of natural images by leveraging foreground-background contrastive synthesis; and (3) Perturbation-Aware Enhancement applies controlled perturbations to improve sample diversity and robustness. These components jointly empower D4C to synthesize images that are both semantically informative and structurally diverse, effectively bridging the performance gap of DFQ on CLIP. Extensive experiments validate the effectiveness of D4C, showing significant performance improvements on various bit-widths and models. For example, under the W4A8 setting with CLIP ResNet-50 and ViT-B/32, D4C achieves Top-1 accuracy improvement of 12.4% and 18.9% on CIFAR-10, 6.8% and 19.7% on CIFAR-100, and 1.4% and 5.7% on ImageNet-1K in zero-shot classification, respectively.

[27] Representation Space Constrained Learning with Modality Decoupling for Multimodal Object Detection

YiKang Shao, Tao Shi

🧩 TL;DR

本文针对多模态目标检测中的融合退化问题进行了系统理论分析，提出了RSC-MD方法来解决梯度抑制和模态不平衡问题，在多个基准数据集上实现了最先进的性能。

📘 Detailed Summary

Motivation: 当前多模态目标检测研究大多关注模态融合策略的改进，但忽视了融合退化现象，且缺乏对其根本原因的理论分析。本文旨在填补这一研究空白，系统性地研究多模态检测中的融合退化问题。

Method: 提出了表示空间约束学习与模态解耦方法，包含RSC模块和MD模块。RSC模块用于放大被抑制的梯度，MD模块用于消除模态间耦合干扰和模态不平衡，从而实现各模态特定骨干网络的全面优化。

Result: 在FLIR、LLVIP、M3FD和MFAD数据集上的大量实验表明，所提方法有效缓解了融合退化问题，在多个基准测试中达到了最先进的性能水平。

Conclusion: 研究揭示了多模态检测中融合退化的两个关键优化缺陷：单模态骨干网络梯度严重抑制导致的欠优化，以及模态质量差异引起的梯度抑制不平衡。RSC-MD方法为解决这些问题提供了有效途径。

📄 Abstract

Multimodal object detection has attracted significant attention in both academia and industry for its enhanced robustness. Although numerous studies have focused on improving modality fusion strategies, most neglect fusion degradation, and none provide a theoretical analysis of its underlying causes. To fill this gap, this paper presents a systematic theoretical investigation of fusion degradation in multimodal detection and identifies two key optimization deficiencies: (1) the gradients of unimodal branch backbones are severely suppressed under multimodal architectures, resulting in under-optimization of the unimodal branches; (2) disparities in modality quality cause weaker modalities to experience stronger gradient suppression, which in turn results in imbalanced modality learning. To address these issues, this paper proposes a Representation Space Constrained Learning with Modality Decoupling (RSC-MD) method, which consists of two modules. The RSC module and the MD module are designed to respectively amplify the suppressed gradients and eliminate inter-modality coupling interference as well as modality imbalance, thereby enabling the comprehensive optimization of each modality-specific backbone. Extensive experiments conducted on the FLIR, LLVIP, M3FD, and MFAD datasets demonstrate that the proposed method effectively alleviates fusion degradation and achieves state-of-the-art performance across multiple benchmarks. The code and training procedures will be released at https://github.com/yikangshao/RSC-MD.

Dabin Jeong, Amirhossein Vahidi, Ciro Ramírez-Suástegui, Marie Moullet, Kevin Ly, Mohammad Vali Sanian, Sebastian Birk, Yinshui Chang, Adam Boxall, Daniyal Jafree, Lloyd Steele, Vijaya Baskar MS, Muzlifah Haniffa, Mohammad Lotfollahi

🧩 TL;DR

本文提出Sigmma框架，通过多尺度对比对齐学习HE图像与空间转录组谱的层次化表示，解决了现有方法在单尺度对齐中忽略细胞结构空间组织的问题。该方法在基因表达预测和跨模态检索任务中分别实现了平均9.78%和26.93%的性能提升。

📘 Detailed Summary

Motivation: 现有计算方法通常将HE图像切片与对应空间转录组谱在单一尺度上进行对齐，忽略了细粒度细胞结构及其空间组织，这限制了模型对组织微环境中细胞间相互作用的理解能力。

Method: Sigmma采用多模态对比对齐框架，通过多尺度对比对齐确保不同尺度下学习的表示在跨模态间保持一致性，同时将细胞相互作用表示为图结构，整合图间和图内关系来捕获从细粒度到粗粒度的细胞间相互作用。

Result: 实验表明Sigmma学习的表示能更好地捕获跨模态对应关系，在基因表达预测任务中平均提升9.78%，在跨模态检索任务中平均提升26.93%，并在下游分析中学习到有意义的多组织组织结构。

Conclusion: 该研究证明了多尺度对比对齐在计算病理学中的重要性，Sigmma框架能够有效捕获组织微环境中的层次化细胞相互作用，为理解疾病机制和开发精准医疗方法提供了新的技术途径。

📄 Abstract

Recent advances in computational pathology have leveraged vision-language models to learn joint representations of Hematoxylin and Eosin (HE) images with spatial transcriptomic (ST) profiles. However, existing approaches typically align HE tiles with their corresponding ST profiles at a single scale, overlooking fine-grained cellular structures and their spatial organization. To address this, we propose Sigmma, a multi-modal contrastive alignment framework for learning hierarchical representations of HE images and spatial transcriptome profiles across multiple scales. Sigmma introduces multi-scale contrastive alignment, ensuring that representations learned at different scales remain coherent across modalities. Furthermore, by representing cell interactions as a graph and integrating inter- and intra-subgraph relationships, our approach effectively captures cell-cell interactions, ranging from fine to coarse, within the tissue microenvironment. We demonstrate that Sigmm learns representations that better capture cross-modal correspondences, leading to an improvement of avg. 9.78\% in the gene-expression prediction task and avg. 26.93\% in the cross-modal retrieval task across datasets. We further show that it learns meaningful multi-tissue organization in downstream analyses.

[29] Multi-Text Guided Few-Shot Semantic Segmentation

Qiang Jiao, Bin Yan, Yi Yang, Mengrui Shi, Qiang Zhang

🧩 TL;DR

本文提出MTGNet，一种双分支框架，通过融合多样文本提示来增强小样本语义分割性能，解决了单一文本描述无法捕捉复杂类别语义多样性的问题，在PASCAL-5i和COCO-20i基准上取得了显著性能提升。

📘 Detailed Summary

Motivation: 现有基于CLIP的小样本语义分割方法通常使用单一文本提示（如“一张类别的照片”），这往往导致目标区域激活不完整，因为单一文本描述无法充分捕捉复杂类别的语义多样性，同时缺乏显式的跨模态交互且容易受到噪声支持特征的影响，进一步降低了视觉先验质量。

Method: MTGNet采用双分支框架，包含三个关键模块：多文本先验精炼模块通过抑制干扰和聚合互补语义线索来增强前景激活；文本锚点特征融合模块利用多文本嵌入作为语义锚点，促进判别性局部原型从支持图像到查询图像的迁移；前景置信度加权注意力模块利用支持前景特征的内部自相似性来增强视觉先验鲁棒性，自适应地降低不一致区域的权重。

Result: 在标准FSS基准上的广泛实验验证了MTGNet的有效性，在1-shot设置下，在PASCAL-5i上达到76.8% mIoU，在COCO-20i上达到57.4% mIoU，在具有高类内变化的fold上表现出显著改进。

Conclusion: 该研究表明通过融合多样文本提示可以有效增强文本先验质量，多文本嵌入作为语义锚点能够改善跨模态语义一致性，利用内部自相似性机制可以显著提升视觉先验的鲁棒性，为解决小样本语义分割中的类内变化和语义覆盖不足问题提供了有效途径。

📄 Abstract

Recent CLIP-based few-shot semantic segmentation methods introduce class-level textual priors to assist segmentation by typically using a single prompt (e.g., a photo of class). However, these approaches often result in incomplete activation of target regions, as a single textual description cannot fully capture the semantic diversity of complex categories. Moreover, they lack explicit cross-modal interaction and are vulnerable to noisy support features, further degrading visual prior quality. To address these issues, we propose the Multi-Text Guided Few-Shot Semantic Segmentation Network (MTGNet), a dual-branch framework that enhances segmentation performance by fusing diverse textual prompts to refine textual priors and guide the cross-modal optimization of visual priors. Specifically, we design a Multi-Textual Prior Refinement (MTPR) module that suppresses interference and aggregates complementary semantic cues to enhance foreground activation and expand semantic coverage for structurally complex objects. We introduce a Text Anchor Feature Fusion (TAFF) module, which leverages multi-text embeddings as semantic anchors to facilitate the transfer of discriminative local prototypes from support images to query images, thereby improving semantic consistency and alleviating intra-class variations. Furthermore, a Foreground Confidence-Weighted Attention (FCWA) module is presented to enhance visual prior robustness by leveraging internal self-similarity within support foreground features. It adaptively down-weights inconsistent regions and effectively suppresses interference in the query segmentation process. Extensive experiments on standard FSS benchmarks validate the effectiveness of MTGNet. In the 1-shot setting, it achieves 76.8% mIoU on PASCAL-5i and 57.4% on COCO-20i, with notable improvements in folds exhibiting high intra-class variations.

[30] AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning

Urjitkumar Patel, Fang-Chun Yeh, Chinmay Gondhalekar

🧩 TL;DR

本文提出了AVATAAR框架，这是一个模块化且可解释的视频问答系统，通过结合全局与局部视频上下文以及预检索思维代理和重新思考模块，显著提升了长视频理解能力。

📘 Detailed Summary

Motivation: 随着视频内容的日益普及，有效理解和回答长视频问题变得至关重要，但现有的大型视觉语言模型在处理需要全面理解和详细分析的复杂查询时仍面临挑战，特别是在处理细微差别查询方面存在局限性。

Method: AVATAAR框架采用模块化设计，结合全局和局部视频上下文，包含预检索思维代理和重新思考模块，创建持久性全局摘要并在两个模块之间建立反馈循环，使系统能够基于部分答案优化检索策略并模拟人类迭代推理过程。

Result: 在CinePile基准测试中，AVATAAR相比基线模型取得了显著改进，在时间推理、技术查询、主题问题和叙事理解方面分别实现了+5.6%、+5%、+8%和+8.2%的相对增益，实验证实每个模块都对整体性能有积极贡献，反馈循环对适应性至关重要。

Conclusion: AVATAAR为长视频问答提供了一个可扩展的解决方案，融合了准确性、可解释性和可扩展性，其模块化设计和反馈机制为复杂视频理解任务提供了有效的处理框架，展示了在视频理解能力增强方面的有效性。

📄 Abstract

With the increasing prevalence of video content, effectively understanding and answering questions about long form videos has become essential for numerous applications. Although large vision language models (LVLMs) have enhanced performance, they often face challenges with nuanced queries that demand both a comprehensive understanding and detailed analysis. To overcome these obstacles, we introduce AVATAAR, a modular and interpretable framework that combines global and local video context, along with a Pre Retrieval Thinking Agent and a Rethink Module. AVATAAR creates a persistent global summary and establishes a feedback loop between the Rethink Module and the Pre Retrieval Thinking Agent, allowing the system to refine its retrieval strategies based on partial answers and replicate human-like iterative reasoning. On the CinePile benchmark, AVATAAR demonstrates significant improvements over a baseline, achieving relative gains of +5.6% in temporal reasoning, +5% in technical queries, +8% in theme-based questions, and +8.2% in narrative comprehension. Our experiments confirm that each module contributes positively to the overall performance, with the feedback loop being crucial for adaptability. These findings highlight AVATAAR's effectiveness in enhancing video understanding capabilities. Ultimately, AVATAAR presents a scalable solution for long-form Video Question Answering (QA), merging accuracy, interpretability, and extensibility.

Miruna-Alexandra Gafencu, Yordanka Velikova, Nassir Navab, Mohammad Farid Azampour

🧩 TL;DR

本研究提出了一种新颖的多模态深度学习方法，通过结合单张X射线图像的互补信息来完成3D超声中被遮挡的解剖结构重建，显著改善了超声在脊柱成像中的骨结构可视化限制。

📘 Detailed Summary

Motivation: 超声成像在脊柱手术中具有无辐射、实时可视化的优势，但由于骨骼引起的声影效应，无法完整显示椎体等解剖结构，这限制了其在脊柱手术导航中的应用价值。

Method: 该方法采用多模态深度学习框架，通过生成配对的训练数据，包括模拟X射线扫描的2D侧位椎体视图和模拟超声成像中受限可视性的3D部分椎体表示，整合两种成像模式的形态学信息。

Result: 实验结果显示该方法在椎体重建方面相比现有3D超声椎体补全方法有显著改进（p < 0.001），在体模研究中实现了更准确、完整的腰椎体积可视化，无需与术前CT等模式进行配准。

Conclusion: 研究表明整合单张X射线投影可以有效缓解超声成像的关键局限性，同时保留其作为主要成像模式的优势，为未来临床转化提供了有前景的技术路径。

📄 Abstract

Ultrasound offers a radiation-free, cost-effective solution for real-time visualization of spinal landmarks, paraspinal soft tissues and neurovascular structures, making it valuable for intraoperative guidance during spinal procedures. However, ultrasound suffers from inherent limitations in visualizing complete vertebral anatomy, in particular vertebral bodies, due to acoustic shadowing effects caused by bone. In this work, we present a novel multi-modal deep learning method for completing occluded anatomical structures in 3D ultrasound by leveraging complementary information from a single X-ray image. To enable training, we generate paired training data consisting of: (1) 2D lateral vertebral views that simulate X-ray scans, and (2) 3D partial vertebrae representations that mimic the limited visibility and occlusions encountered during ultrasound spine imaging. Our method integrates morphological information from both imaging modalities and demonstrates significant improvements in vertebral reconstruction (p < 0.001) compared to state of art in 3D ultrasound vertebral completion. We perform phantom studies as an initial step to future clinical translation, and achieve a more accurate, complete volumetric lumbar spine visualization overlayed on the ultrasound scan without the need for registration with preoperative modalities such as computed tomography. This demonstrates that integrating a single X-ray projection mitigates ultrasound's key limitation while preserving its strengths as the primary imaging modality. Code and data can be found at https://github.com/miruna20/US-X-Complete

cs.CL [Back]

[32] Hierarchical Token Prepending: Enhancing Information Flow in Decoder-based LLM Embeddings

Xueying Ding, Xingyue Huang, Mingxuan Ju, Liam Collins, Yozen Liu, Leman Akoglu, Neil Shah, Tong Zhao

🧩 TL;DR

本文提出分层令牌预置（HTP）方法，通过引入块级摘要令牌和多路径后向信息流机制，解决了大型语言模型中因果注意力机制导致的信息流限制问题，显著提升了长文档嵌入的质量和检索性能。

📘 Detailed Summary

Motivation: 大型语言模型虽然能生成强大的文本嵌入，但其因果注意力机制限制了从后向令牌到前向令牌的信息流动，导致表示质量下降。现有方法通过在输入前添加单一摘要令牌来缓解此问题，但这种方法在长文档场景下会过度压缩信息，从而损害性能表现。

Method: HTP方法通过两个关键机制解决信息流瓶颈：在注意力层面，将输入划分为多个块并为后续块预置块级摘要令牌，创建多路径后向信息流；在读出层面，用均值池化替代最后令牌池化，这一选择得到了理论分析的支持。该方法架构无关且实现简单。

Result: HTP在11个检索数据集和30个通用嵌入基准测试中实现了持续的性能提升，尤其在长上下文设置下表现优异。该方法能够同时增强零样本和微调模型的性能，为长文档嵌入提供了可扩展的优化路径。

Conclusion: HTP方法通过分层令牌预置机制有效解决了因果注意力模型中的信息流限制问题，证明了多路径信息传播和适当的池化策略对提升长文档嵌入质量的重要性。该方法为改进语言模型的表示能力提供了简单而有效的解决方案，具有广泛的应用前景。

📄 Abstract

Large language models produce powerful text embeddings, but their causal attention mechanism restricts the flow of information from later to earlier tokens, degrading representation quality. While recent methods attempt to solve this by prepending a single summary token, they over-compress information, hence harming performance on long documents. We propose Hierarchical Token Prepending (HTP), a method that resolves two critical bottlenecks. To mitigate attention-level compression, HTP partitions the input into blocks and prepends block-level summary tokens to subsequent blocks, creating multiple pathways for backward information flow. To address readout-level over-squashing, we replace last-token pooling with mean-pooling, a choice supported by theoretical analysis. HTP achieves consistent performance gains across 11 retrieval datasets and 30 general embedding benchmarks, especially in long-context settings. As a simple, architecture-agnostic method, HTP enhances both zero-shot and finetuned models, offering a scalable route to superior long-document embeddings.

[33] HinTel-AlignBench: A Framework and Benchmark for Hindi-Telugu with English-Aligned Samples

Rishikant Chigrupaatii, Ponnada Sai Tulasi Kanishka, Lalit Chandra Routhu, Martin Patel Sama Supratheek Reddy, Divyam Gupta, Dasari Srikar, Krishna Teja Kuchimanchi, Rajiv Misra, Rohun Tripathi

🧩 TL;DR

本研究提出了一个可扩展的框架来评估多语言视觉语言模型在印度语言中的表现，并创建了HinTel-AlignBench基准测试，揭示了模型在印度语言与英语之间的性能差距。

📘 Detailed Summary

Motivation: 当前多语言视觉语言模型评估存在四个主要局限性：依赖未经验证的自动翻译、任务/领域覆盖范围狭窄、样本量有限以及缺乏文化和本地来源的问答数据，这阻碍了为低资源语言开发公平AI的进展。

Method: 提出了半自动化的数据集创建框架，结合回译、过滤和人工验证；构建了最全面的印地语和泰卢固语视觉语言基准，包括改编的英文数据集和本地新颖的印度数据集；对多种最先进的开源和闭源视觉语言模型进行了详细性能分析。

Result: 在所有模型的5个任务中，有4个任务在印度语言中的表现相比英语出现退化，印地语平均退化8.3分，泰卢固语平均退化5.5分，研究还对常见失败模式进行了分类以突出多语言多模态理解的具体改进领域。

Conclusion: 该研究揭示了多语言视觉语言模型在印度语言与英语之间的显著性能差距，强调了开发更公平AI系统的重要性，并为改进多语言多模态理解提供了具体的失败模式分类和基准测试框架。

📄 Abstract

With nearly 1.5 billion people and more than 120 major languages, India represents one of the most diverse regions in the world. As multilingual Vision-Language Models (VLMs) gain prominence, robust evaluation methodologies are essential to drive progress toward equitable AI for low-resource languages. Current multilingual VLM evaluations suffer from four major limitations: reliance on unverified auto-translations, narrow task/domain coverage, limited sample sizes, and lack of cultural and natively sourced Question-Answering (QA). To address these gaps, we present a scalable framework to evaluate VLMs in Indian languages and compare it with performance in English. Using the framework, we generate HinTel-AlignBench, a benchmark that draws from diverse sources in Hindi and Telugu with English-aligned samples. Our contributions are threefold: (1) a semi-automated dataset creation framework combining back-translation, filtering, and human verification; (2) the most comprehensive vision-language benchmark for Hindi and and Telugu, including adapted English datasets (VQAv2, RealWorldQA, CLEVR-Math) and native novel Indic datasets (JEE for STEM, VAANI for cultural grounding) with approximately 4,000 QA pairs per language; and (3) a detailed performance analysis of various State-of-the-Art (SOTA) open-weight and closed-source VLMs. We find a regression in performance for tasks in English versus in Indian languages for 4 out of 5 tasks across all the models, with an average regression of 8.3 points in Hindi and 5.5 points for Telugu. We categorize common failure modes to highlight concrete areas of improvement in multilingual multimodal understanding.

[34] OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition

Xinli Tao, Xin Dong, Xuezhong Zhou

🧩 TL;DR

本文提出了OEMA，一种基于多智能体协作的零样本临床命名实体识别框架，通过自注释器生成示例、判别器基于SNOMED CT进行筛选，以及预测器利用实体描述进行推理，在MTSamples和VAERS数据集上实现了最先进的精确匹配性能。

📘 Detailed Summary

Motivation: 临床命名实体识别需要大量标注数据，而传统监督模型如CRF和BioClinicalBERT标注成本高昂，现有零样本方法在示例选择粒度和提示与自改进集成方面存在不足，需要解决这些关键挑战。

Method: OEMA框架包含三个核心组件：自注释器生成候选示例，判别器基于SNOMED CT本体进行过滤筛选，预测器利用实体描述进行准确推理，通过多智能体协作实现零样本临床NER。

Result: 在MTSamples和VAERS数据集上，OEMA实现了最先进的精确匹配性能，在相关匹配指标下与监督BioClinicalBERT相当并超越CRF，展示了接近监督方法的性能水平。

Conclusion: OEMA通过本体引导推理和多智能体协作解决了零样本NER的关键挑战，实现了接近监督学习的性能，为临床NLP应用提供了有前景的解决方案，展示了在医疗信息抽取领域的实用价值。

📄 Abstract

Clinical named entity recognition (NER) is crucial for extracting information from electronic health records (EHRs), but supervised models like CRF and BioClinicalBERT require costly annotated data. While zero-shot NER with large language models (LLMs) reduces this dependency, it struggles with example selection granularity and integrating prompts with self-improvement. To address this, we propose OEMA, a zero-shot clinical NER framework using multi-agent collaboration. OEMA's three components are: a self-annotator generating examples, a discriminator filtering them via SNOMED CT, and a predictor using entity descriptions for accurate inference. On MTSamples and VAERS datasets, OEMA achieves state-of-the-art exact-match performance. Under related-match, it matches supervised BioClinicalBERT and surpasses CRF. OEMA addresses key zero-shot NER challenges through ontology-guided reasoning and multi-agent collaboration, achieving near-supervised performance and showing promise for clinical NLP applications.

[35] Context Cascade Compression: Exploring the Upper Limits of Text Compression

Fanfan Liu, Haibo Qiu

🧩 TL;DR

本文提出上下文级联压缩（C3）方法，通过级联不同规模的大语言模型实现高比率文本压缩，在20倍压缩比下达到98%的解码准确率，显著优于现有光学字符压缩方法。

📘 Detailed Summary

Motivation: 百万级token的长上下文任务对大语言模型的计算和内存需求构成重大挑战，现有DeepSeek-OCR的光学压缩方法存在性能限制，需要探索文本压缩的上限以解决长上下文处理难题。

Method: 提出级联压缩框架C3，使用小型LLM作为第一阶段执行文本压缩，将长上下文压缩为少量潜在token（如32或64长度），大型LLM作为第二阶段在压缩上下文上执行解码任务，构建纯文本处理流程。

Result: 在20倍压缩比下模型达到98%解码准确率，相比DeepSeek-OCR的约60%有显著提升；当压缩比增加到40倍时，准确率仍保持在93%左右，证明了高比率压缩的可行性。

Conclusion: C3压缩在上下文压缩领域展现出优于光学字符压缩的性能和可行性，其纯文本流程忽略布局、颜色等视觉因素，为光学字符压缩、OCR等相关领域的压缩比率上限提供了参考基准。

📄 Abstract

Million-level token inputs in long-context tasks pose significant computational and memory challenges for Large Language Models (LLMs). Recently, DeepSeek-OCR conducted research into the feasibility of Contexts Optical Compression and achieved preliminary results. Inspired by this, we introduce Context Cascade Compression C3 to explore the upper limits of text compression. Our method cascades two LLMs of different sizes to handle the compression and decoding tasks. Specifically, a small LLM, acting as the first stage, performs text compression by condensing a long context into a set of latent tokens (e.g., 32 or 64 in length), achieving a high ratio of text tokens to latent tokens. A large LLM, as the second stage, then executes the decoding task on this compressed context. Experiments show that at a 20x compression ratio (where the number of text tokens is 20 times the number of latent tokens), our model achieves 98% decoding accuracy, compared to approximately 60% for DeepSeek-OCR. When we further increase the compression ratio to 40x, the accuracy is maintained at around 93%. This indicates that in the domain of context compression, C3 Compression demonstrates superior performance and feasibility over optical character compression. C3 uses a simpler, pure-text pipeline that ignores factors like layout, color, and information loss from a visual encoder. This also suggests a potential upper bound for compression ratios in future work on optical character compression, OCR, and related fields. Codes and model weights are publicly accessible at https://github.com/liufanfanlff/C3-Context-Cascade-Compression

[36] The Empowerment of Science of Science by Large Language Models: New Tools and Methods

Guoqiang Liang, Jingqian Gong, Mengxuan Li, Gege Lin, Shuo Zhang

🧩 TL;DR

本文对大型语言模型的核心技术进行了系统性综述，并从科学计量学角度探讨了LLMs在科学评价、研究前沿检测和知识图谱构建等领域的潜在应用前景。

📘 Detailed Summary

Motivation: 随着大型语言模型在自然语言理解、图像识别和多模态任务中展现出卓越能力，并成为全球技术竞争的核心议题，本研究旨在从用户角度系统梳理支撑LLMs的核心技术，并探索其在科学计量学领域的应用潜力。

Method: 研究采用系统性综述方法，涵盖了提示工程、知识增强的检索增强生成、微调、预训练和工具学习等LLMs核心技术，同时追溯了科学计量学的发展历史，并提出了基于AI代理的科学评价模型。

Result: 通过综合分析，研究展示了LLMs在科学计量学领域的多种应用可能性，包括新的研究前沿检测方法和知识图谱构建技术，为科学评价体系提供了创新视角。

Conclusion: 该研究为LLMs在科学计量学中的应用开辟了新方向，提出了AI代理驱动的科学评价模型，并强调了LLMs在推动科学发现和知识管理方面的变革潜力，为未来研究提供了重要参考框架。

📄 Abstract

Large language models (LLMs) have exhibited exceptional capabilities in natural language understanding and generation, image recognition, and multimodal tasks, charting a course towards AGI and emerging as a central issue in the global technological race. This manuscript conducts a comprehensive review of the core technologies that support LLMs from a user standpoint, including prompt engineering, knowledge-enhanced retrieval augmented generation, fine tuning, pretraining, and tool learning. Additionally, it traces the historical development of Science of Science (SciSci) and presents a forward looking perspective on the potential applications of LLMs within the scientometric domain. Furthermore, it discusses the prospect of an AI agent based model for scientific evaluation, and presents new research fronts detection and knowledge graph building methods with LLMs.

[37] A Compliance-Preserving Retrieval System for Aircraft MRO Task Search

Byungho Jo

🧩 TL;DR

本研究提出了一种合规保持的检索系统，通过结合LLM重排序和语义搜索技术，在保留原有认证查看器的基础上，将飞机维修技师查找手册的时间从6-15分钟减少到18秒，实现了95%的查找时间缩减。

📘 Detailed Summary

Motivation: 飞机维修技师在维修、修理和大修操作中花费高达30%的工作时间搜索手册，这已成为MRO操作中一个已记录效率瓶颈，而每个程序都必须能够追溯到认证来源，因此需要在不违反严格监管约束的前提下提高检索效率。

Method: 系统采用LLM重排序和语义搜索技术，构建基于ATA章节层次结构的修订鲁棒嵌入表示，并使用视觉语言解析来结构化认证内容，使技术人员能够预览排名任务并在现有查看器中访问已验证程序。

Result: 在49k个合成查询上的评估显示检索准确率超过90%，双语对照研究显示在10名持证AMT中实现了90.9%的前10成功率，并将每个任务的查找时间从6-15分钟减少到18秒，缩减了95%。

Conclusion: 研究证明语义检索可以在严格监管约束下运行，并显著减少现实世界多语言MRO工作流程中的操作工作量，为合规环境下的AI应用提供了具体证据。

📄 Abstract

Aircraft Maintenance Technicians (AMTs) spend up to 30% of work time searching manuals, a documented efficiency bottleneck in MRO operations where every procedure must be traceable to certified sources. We present a compliance-preserving retrieval system that adapts LLM reranking and semantic search to aviation MRO environments by operating alongside, rather than replacing, certified legacy viewers. The system constructs revision-robust embeddings from ATA chapter hierarchies and uses vision-language parsing to structure certified content, allowing technicians to preview ranked tasks and access verified procedures in existing viewers. Evaluation on 49k synthetic queries achieves >90% retrieval accuracy, while bilingual controlled studies with 10 licensed AMTs demonstrate 90.9% top-10 success rate and 95% reduction in lookup time, from 6-15 minutes to 18 seconds per task. These gains provide concrete evidence that semantic retrieval can operate within strict regulatory constraints and meaningfully reduce operational workload in real-world multilingual MRO workflows.

[38] Multimodal Evaluation of Russian-language Architectures

Artem Chervyakov, Ulyana Isaeva, Anton Emelyanov, Artem Safin, Maria Tikhonova, Alexander Kharitonov, Yulia Lyakh, Petr Surovtsev, Denis Shevelev Vildan Saburov, Vasily Konovalov, Elisei Rykov, Ivan Sviridov, Amina Miftakhova, Ilseyar Alimova, Alexander Panchenko, Alexander Kapitanov, Alena Fenogenova

🧩 TL;DR

本文提出了Mera Multi，一个针对俄语多模态大语言模型的开源评估框架，填补了俄语多模态基准测试的空白，包含18个新构建的评估任务和通用能力分类法。

📘 Detailed Summary

Motivation: 当前多模态大语言模型在规模和能力上快速发展，但其智能水平、局限性和风险尚未得到充分理解，特别是在俄语环境下缺乏多模态基准测试，这阻碍了对俄语多模态模型的系统评估和比较。

Method: 研究构建了一个基于指令的开放多模态评估框架，涵盖文本、图像、音频和视频四种模态，包含18个全新构建的评估任务，采用统一提示词和度量标准，并设计了防止基准泄漏的方法论，包括水印技术和私有数据集许可机制。

Result: 该基准测试为闭源和开源模型提供了基线结果，并特别关注俄语文化和语言特性，所有数据集均从头构建，确保了对俄语多模态架构的全面评估能力。

Conclusion: 虽然当前研究聚焦于俄语，但提出的基准构建方法论具有可复制性，能够为斯拉夫语系等类型多样语言的多模态基准测试建设提供参考，推动了多模态模型评估的标准化和系统化发展。

📄 Abstract

Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.

cs.AI [Back]

[39] Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration

Yifu Guo, Zishan Xu, Zhiyuan Yao, Yuquan Lu, Jiaye Lin, Sen Hu, Zhenheng Tang, Yingchao Li, Huacan Wang, Ronghao Chen

🧩 TL;DR

本文提出Octopus：具有六种能力编排的代理式多模态推理新范式，通过自主探索推理路径和动态选择最合适能力，解决了现有方法在适应动态变化能力需求方面的局限性。实验表明Octopus在Octopus-Bench基准测试中取得了最佳性能，突显了能力协调在代理式多模态推理中的关键作用。

📘 Detailed Summary

Motivation: 现有多模态推理模型存在根本性架构限制，缺乏人类般的自主探索多样化推理路径的能力，包括直接推理、工具驱动的视觉探索、程序化视觉操作和内在视觉想象。这些方法通常只覆盖人类思维能力的子集，难以适应现实任务中动态变化的能力需求，而人类在解决此类任务时展现出互补的思维能力集合。

Method: 本文提出Octopus：具有六种能力编排的代理式多模态推理新范式，定义了多模态推理必需的六种核心能力，并据此组织了全面的评估基准Octopus-Bench。Octopus能够在推理过程中自主探索，并根据当前状态动态选择最合适的能力，实现多种推理路径的协调运用。

Result: 实验结果表明，Octopus在Octopus-Bench基准测试中的绝大多数任务上取得了最佳性能。该框架在动态能力选择和自主探索方面的优势得到了验证，显著提升了多模态推理任务的处理效果和适应性。

Conclusion: 本研究强调了能力协调在代理式多模态推理中的关键作用，提出的六能力编排范式为解决现有方法的局限性提供了有效途径。Octopus框架的成功表明，模仿人类多样化思维能力的协调机制对于构建更强大的多模态推理系统具有重要意义，为未来智能代理系统的发展指明了方向。

📄 Abstract

Existing multimodal reasoning models and frameworks suffer from fundamental architectural limitations: most lack the human-like ability to autonomously explore diverse reasoning pathways-whether in direct inference, tool-driven visual exploration, programmatic visual manipulation, or intrinsic visual imagination. Consequently, they struggle to adapt to dynamically changing capability requirements in real-world tasks. Meanwhile, humans exhibit a complementary set of thinking abilities when addressing such tasks, whereas existing methods typically cover only a subset of these dimensions. Inspired by this, we propose Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration, a new paradigm for multimodal agentic reasoning. We define six core capabilities essential for multimodal reasoning and organize a comprehensive evaluation benchmark, Octopus-Bench, accordingly. Octopus is capable of autonomously exploring during reasoning and dynamically selecting the most appropriate capability based on the current state. Experimental results show that Octopus achieves the best performance on the vast majority of tasks in Octopus-Bench, highlighting the crucial role of capability coordination in agentic multimodal reasoning.

[40] IPR-1: Interactive Physical Reasoner

Mingyu Zhang, Lifeng Zhuo, Tianxi Tan, Guocan Xie, Xian Nie, Yan Li, Renjie Zhao, Zizhu He, Ziyu Wang, Jiting Cai, Yong-Lu Li

🧩 TL;DR

本文提出IPR（交互式物理推理器），通过世界模型推演来评估和增强视觉语言模型的策略，并引入PhysCode物理中心动作编码，在1000+游戏上预训练后实现了稳健的物理推理性能，总体表现与GPT-5相当并在好奇心任务上超越。

📘 Detailed Summary

Motivation: 当前智能体在物理推理方面存在互补性缺陷：VLM/VLA代理能够推理但缺乏交互环境中的前瞻性，而世界模型能够想象但仅模仿视觉模式而非分析物理和因果关系。研究旨在探索智能体是否能够通过交互学习获得类似人类的推理能力并随着经验积累持续改进。

Method: 提出IPR框架，利用世界模型推演来评分和增强VLM的策略，引入PhysCode物理中心动作编码将语义意图与动力学对齐，为预测和推理提供共享动作空间。在Game-to-Unseen设置下使用1000+异构游戏进行预训练，涵盖多样化的物理和因果机制。

Result: IPR在生存、好奇心和实用性三个层次上表现稳健，总体性能与GPT-5相当，在好奇心任务上超越GPT-5。模型性能随训练游戏数量和交互步骤增加而提升，并能够零样本迁移到未见过的游戏中。

Conclusion: 研究结果表明物理中心的交互学习是实现持续改进物理推理的有效路径，世界模型与VLM的协同能够弥补各自在物理推理中的局限性，为开发更接近人类推理能力的智能体提供了新方向。

📄 Abstract

Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience. We study this in a Game-to-Unseen (G2U) setting, curating 1,000+ heterogeneous games with diverse physical and causal mechanisms, and evaluate at three human-like levels: Survival, Curiosity, Utility, from primitive intuition to goal-driven reasoning. Our analysis reveals complementary failures: VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world-model rollouts to score and reinforce a VLM's policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, our IPR performs robustly on three levels, matches GPT-5 overall, and surpasses it on Curiosity. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games. These results support physics-centric interaction as a path to steadily improving physical reasoning.

[41] Know Your Intent: An Autonomous Multi-Perspective LLM Agent Framework for DeFi User Transaction Intent Mining

Qian'ang Mao, Yuxuan Zhang, Jiaman Chen, Wenjun Zhou, Jiaqi Yan

🧩 TL;DR

本文提出了交易意图挖掘（TIM）框架，通过基于扎根理论的DeFi意图分类法和多智能体大语言模型系统来推断用户意图，显著优于现有方法，为理解DeFi用户动机提供了可靠解决方案。

📘 Detailed Summary

Motivation: 随着去中心化金融的发展，理解DeFi交易背后的用户意图变得至关重要但极具挑战性，主要由于复杂的智能合约交互、多方面的链上/链下因素以及不透明的十六进制日志，现有方法缺乏深度语义洞察能力。

Method: TIM框架采用基于扎根理论构建的DeFi意图分类法，设计多智能体LLM系统，其中元级规划器动态协调领域专家将多视角意图分析分解为可解决子任务，问题求解器处理多模态链上/链下数据，认知评估器则减轻LLM幻觉并确保可验证性。

Result: 实验结果表明，TIM框架在意图推断任务上显著优于机器学习模型、单一LLM和单一智能体基线方法，同时分析了意图推断中的核心挑战。

Conclusion: 这项工作为理解DeFi用户动机提供了更可靠的方法，为复杂的区块链活动提供情境感知解释，有助于提升DeFi生态系统的透明度和可解释性。

📄 Abstract

As Decentralized Finance (DeFi) develops, understanding user intent behind DeFi transactions is crucial yet challenging due to complex smart contract interactions, multifaceted on-/off-chain factors, and opaque hex logs. Existing methods lack deep semantic insight. To address this, we propose the Transaction Intent Mining (TIM) framework. TIM leverages a DeFi intent taxonomy built on grounded theory and a multi-agent Large Language Model (LLM) system to robustly infer user intents. A Meta-Level Planner dynamically coordinates domain experts to decompose multiple perspective-specific intent analyses into solvable subtasks. Question Solvers handle the tasks with multi-modal on/off-chain data. While a Cognitive Evaluator mitigates LLM hallucinations and ensures verifiability. Experiments show that TIM significantly outperforms machine learning models, single LLMs, and single Agent baselines. We also analyze core challenges in intent inference. This work helps provide a more reliable understanding of user motivations in DeFi, offering context-aware explanations for complex blockchain activity.

Table of Contents

cs.CV [Back]

[1] Skin-R1: Toward Trustworthy Clinical Reasoning for Dermatological Diagnosis

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[2] GeoSceneGraph: Geometric Scene Graph Diffusion Model for Text-guided 3D Indoor Scene Synthesis

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[3] EGSA-PT:Edge-Guided Spatial Attention with Progressive Training for Monocular Depth Estimation and Segmentation of Transparent Objects

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[4] Learning Depth from Past Selves: Self-Evolution Contrast for Robust Depth Estimation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[5] FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[6] Physics-Based Benchmarking Metrics for Multimodal Synthetic Images

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[7] CPSL: Representing Volumetric Video via Content-Promoted Scene Layers

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[8] Unsupervised Discovery of Long-Term Spatiotemporal Periodic Workflows in Human Activities

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[9] CKDA: Cross-modality Knowledge Disentanglement and Alignment for Visible-Infrared Lifelong Person Re-identification

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[10] HV-Attack: Hierarchical Visual Attack for Multimodal Retrieval Augmented Generation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[11] Evaluating Multimodal Large Language Models on Vertically Written Japanese Text

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[12] TiCAL:Typicality-Based Consistency-Aware Learning for Multimodal Emotion Recognition

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[13] A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[14] Unbiased Semantic Decoding with Vision Foundation Models for Few-shot Segmentation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[15] Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[16] MMCM: Multimodality-aware Metric using Clustering-based Modes for Probabilistic Human Motion Prediction

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[17] Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[18] Insert In Style: A Zero-Shot Generative Framework for Harmonious Cross-Domain Object Composition

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[19] Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[20] Taming Generative Synthetic Data for X-ray Prohibited Item Detection

🧩 TL;DR