Table of Contents
cs.CV [Back]
[1] Expert-Guided Explainable Few-Shot Learning with Active Sample Selection for Medical Image Analysis
Longwei Wang, Ifrat Ikhtear Uddin, KC Santosh
🧩 TL;DR
本文提出了一种双框架解决方案,通过专家引导的可解释性少样本学习(EGxFSL)和可解释性引导的主动学习(xGAL),共同解决医学图像分析中数据稀缺和模型可解释性不足的问题,在多个医学影像数据集上实现了性能提升。
📘 Detailed Summary
Motivation: 医学图像分析面临两个关键挑战:标记数据稀缺和模型可解释性不足,这两者都阻碍了临床AI的部署。少样本学习虽然能解决数据限制问题,但缺乏预测透明度;主动学习方法优化数据获取,但忽视了获取样本的可解释性。
Method: 本文提出双框架解决方案:专家引导的可解释性少样本学习(EGxFSL)和可解释性引导的主动学习(xGAL)。EGxFSL通过基于Grad-CAM的Dice损失将放射科医生定义的感兴趣区域作为空间监督,与原型分类联合优化以实现可解释的少样本学习。xGAL引入迭代样本获取策略,优先考虑预测不确定性和注意力错位,创建了一个可解释性指导训练和样本选择的闭环框架。
Result: 在BraTS(MRI)、VinDr-CXR(胸部X光)和SIIM-COVID-19(胸部X光)数据集上,分别实现了92%、76%和62%的准确率,在所有数据集上始终优于非引导基线。在严重数据限制下,xGAL仅用680个样本就实现了76%的准确率,而随机采样仅为57%。Grad-CAM可视化显示引导模型聚焦于诊断相关区域,在乳腺超声上的泛化验证证实了跨模态适用性。
Conclusion: 该研究证明了将可解释性机制整合到少样本学习和主动学习框架中的有效性,创建了一个协同系统,其中可解释性不仅增强了模型透明度,还指导了训练过程和样本选择策略。该方法在多个医学影像模态上展示了良好的泛化能力,为临床AI部署提供了更可靠和透明的解决方案。
📄 Abstract
Medical image analysis faces two critical challenges: scarcity of labeled data and lack of model interpretability, both hindering clinical AI deployment. Few-shot learning (FSL) addresses data limitations but lacks transparency in predictions. Active learning (AL) methods optimize data acquisition but overlook interpretability of acquired samples. We propose a dual-framework solution: Expert-Guided Explainable Few-Shot Learning (EGxFSL) and Explainability-Guided AL (xGAL). EGxFSL integrates radiologist-defined regions-of-interest as spatial supervision via Grad-CAM-based Dice loss, jointly optimized with prototypical classification for interpretable few-shot learning. xGAL introduces iterative sample acquisition prioritizing both predictive uncertainty and attention misalignment, creating a closed-loop framework where explainability guides training and sample selection synergistically. On the BraTS (MRI), VinDr-CXR (chest X-ray), and SIIM-COVID-19 (chest X-ray) datasets, we achieve accuracies of 92\%, 76\%, and 62\%, respectively, consistently outperforming non-guided baselines across all datasets. Under severe data constraints, xGAL achieves 76\% accuracy with only 680 samples versus 57\% for random sampling. Grad-CAM visualizations demonstrate guided models focus on diagnostically relevant regions, with generalization validated on breast ultrasound confirming cross-modality applicability.
[2] MIAR: Modality Interaction and Alignment Representation Fuison for Multimodal Emotion
Jichao Zhu, Jun Yu
🧩 TL;DR
本文提出了一种名为模态交互与对齐表示(MIAR)的新型多模态情感识别方法,通过特征交互和对比学习策略解决模态间分布差异和贡献度不平衡问题,在CMU-MOSI和CMU-MOSEI基准测试中取得了最先进的性能。
📘 Detailed Summary
Motivation: 现有多模态情感识别方法主要关注模态融合,但未能充分解决模态间显著的分布差异问题,也未考虑不同模态对任务贡献度的差异,同时缺乏对多样化文本模型特征的鲁棒泛化能力,从而限制了在多模态场景下的性能表现。
Method: 提出的MIAR网络通过特征交互机制整合不同模态的上下文特征,生成代表全局表示的特征令牌,这些令牌捕捉每个模态从其他模态提取信息的方式,同时采用对比学习和归一化策略对齐不同模态的表示空间。
Result: 在CMU-MOSI和CMU-MOSEI两个基准数据集上的实验结果表明,MIAR方法超越了现有的最先进多模态情感识别方法,验证了所提方法在处理模态分布差异和提升泛化能力方面的有效性。
Conclusion: 该研究强调了在多模态情感识别中考虑模态间分布差异和贡献度差异的重要性,提出的交互对齐框架为处理模态异质性提供了有效解决方案,为未来多模态表示学习研究提供了新的方向。
📄 Abstract
Multimodal Emotion Recognition (MER) aims to perceive human emotions through three modes: language, vision, and audio. Previous methods primarily focused on modal fusion without adequately addressing significant distributional differences among modalities or considering their varying contributions to the task. They also lacked robust generalization capabilities across diverse textual model features, thus limiting performance in multimodal scenarios. Therefore, we propose a novel approach called Modality Interaction and Alignment Representation (MIAR). This network integrates contextual features across different modalities using a feature interaction to generate feature tokens to represent global representations of this modality extracting information from other modalities. These four tokens represent global representations of how each modality extracts information from others. MIAR aligns different modalities using contrastive learning and normalization strategies. We conduct experiments on two benchmarks: CMU-MOSI and CMU-MOSEI datasets, experimental results demonstrate the MIAR outperforms state-of-the-art MER methods.
[3] Multimodal Sentiment Analysis based on Multi-channel and Symmetric Mutual Promotion Feature Fusion
Wangyuan Zhu, Jun Yu
🧩 TL;DR
本文提出了一种用于多模态情感分析的对称互促特征融合方法,通过提取多通道特征增强模态内表示,并利用对称交叉注意力机制促进模态间信息交互,有效解决了现有方法中特征提取不足和模态融合不充分的问题。
📘 Detailed Summary
Motivation: 多模态情感分析面临两个主要挑战:一是从单模态数据中提取的特征有限且不够丰富;二是现有研究大多只关注模态间特征信息的一致性,而忽略了特征间的差异性,导致特征信息融合不充分。本研究旨在解决这些限制,提升多模态情感分析的性能。
Method: 首先提取多通道特征以获得更全面的特征信息,在视觉和听觉模态中采用双通道特征增强模态内特征表示。其次提出对称互促(SMP)的模态间特征融合方法,该方法结合对称交叉模态注意力机制和自注意力机制,其中交叉模态注意力机制从其他模态捕获有用信息,自注意力机制建模上下文信息,促进模态间有用信息的交换,从而加强模态间交互。最后整合模态内特征和模态间融合特征,充分利用模态间特征信息的互补性,同时考虑特征信息差异。
Result: 在两个基准数据集上进行的实验证明了所提出方法的有效性和优越性。实验结果表明,该方法在多模态情感分析任务中取得了显著的性能提升,验证了多通道特征提取和对称互促融合策略的有效性。
Conclusion: 本研究通过增强模态内特征表示和促进模态间信息交互,有效解决了多模态情感分析中的特征提取和融合问题。对称互促融合方法不仅考虑了模态间的一致性,还充分利用了特征差异的互补性,为多模态情感分析提供了新的技术思路,对提升人机交互和情感计算的性能具有重要价值。
📄 Abstract
Multimodal sentiment analysis is a key technology in the fields of human-computer interaction and affective computing. Accurately recognizing human emotional states is crucial for facilitating smooth communication between humans and machines. Despite some progress in multimodal sentiment analysis research, numerous challenges remain. The first challenge is the limited and insufficiently rich features extracted from single modality data. Secondly, most studies focus only on the consistency of inter-modal feature information, neglecting the differences between features, resulting in inadequate feature information fusion. In this paper, we first extract multi-channel features to obtain more comprehensive feature information. We employ dual-channel features in both the visual and auditory modalities to enhance intra-modal feature representation. Secondly, we propose a symmetric mutual promotion (SMP) inter-modal feature fusion method. This method combines symmetric cross-modal attention mechanisms and self-attention mechanisms, where the cross-modal attention mechanism captures useful information from other modalities, and the self-attention mechanism models contextual information. This approach promotes the exchange of useful information between modalities, thereby strengthening inter-modal interactions. Furthermore, we integrate intra-modal features and inter-modal fused features, fully leveraging the complementarity of inter-modal feature information while considering feature information differences. Experiments conducted on two benchmark datasets demonstrate the effectiveness and superiority of our proposed method.
[4] Watch Wider and Think Deeper: Collaborative Cross-modal Chain-of-Thought for Complex Visual Reasoning
Wenting Lu, Didi Zhu, Tao Shen, Donglin Zhu, Ayong Ye, Chao Wu
🧩 TL;DR
本文提出CoCoT(协作跨模态思维)框架,通过动态多区域定位和关系感知推理解决现有跨模态思维链方法在视觉-语言推理中的局限性,显著提升了复杂视觉推理性能。
📘 Detailed Summary
Motivation: 现有跨模态思维链方法存在两个关键局限性:一是过度依赖单一粗粒度图像区域,二是连续推理步骤之间存在语义碎片化问题,这限制了多模态推理中视觉与语言线索的无缝整合能力。
Method: CoCoT框架包含两个核心创新:动态多区域定位根据问题自适应检测最相关的图像区域,关系感知推理通过迭代对齐视觉线索实现多区域协作,构建连贯的逻辑思维链。同时构建了包含74,691个高质量样本的CoCoT-70K数据集,包含多区域标注和结构化推理链。
Result: 实验表明CoCoT显著增强了复杂视觉推理能力,在LLaVA-1.5上平均准确率提升15.4%,在Qwen2-VL上提升4.0%,在六个具有挑战性的基准测试中均表现出优越性能。
Conclusion: 该研究证明了动态多区域定位和关系感知推理在跨模态思维链中的有效性,为多模态推理提供了更精细的视觉-语言对齐机制,推动了复杂视觉推理任务的发展,相关数据和代码已开源供社区使用。
📄 Abstract
Multi-modal reasoning requires the seamless integration of visual and linguistic cues, yet existing Chain-of-Thought methods suffer from two critical limitations in cross-modal scenarios: (1) over-reliance on single coarse-grained image regions, and (2) semantic fragmentation between successive reasoning steps. To address these issues, we propose the CoCoT (Collaborative Coross-modal Thought) frame- work, built upon two key innovations: a) Dynamic Multi-Region Grounding to adaptively detect the most relevant image regions based on the question, and b) Relation-Aware Reasoning to enable multi-region collaboration by iteratively align- ing visual cues to form a coherent and logical chain of thought. Through this approach, we construct the CoCoT-70K dataset, comprising 74,691 high-quality samples with multi-region annotations and structured reasoning chains. Extensive experiments demonstrate that CoCoT significantly enhances complex visual rea- soning, achieving an average accuracy improvement of 15.4% on LLaVA-1.5 and 4.0% on Qwen2-VL across six challenging benchmarks. The data and code are available at: https://github.com/deer-echo/CoCoT.
[5] Understanding Pure Textual Reasoning for Blind Image Quality Assessment
Yuan Li, Shin'ya Nishida
🧩 TL;DR
该研究从信息流角度分析文本推理在盲图像质量评估中的作用,通过比较三种学习图像-文本-分数关系的范式,揭示了文本信息对质量预测的贡献程度及优化方向。
📘 Detailed Summary
Motivation: 尽管文本推理在盲图像质量评估中已被广泛采用,但文本信息如何贡献于质量预测以及文本能在多大程度上表示与分数相关的图像内容仍不明确,本研究旨在从信息流角度解决这些问题。
Method: 研究比较了现有BIQA模型与三种专门设计用于学习图像-文本-分数关系的范式:思维链范式、自一致性范式和自编码器范式,从信息流角度分析不同范式的表现差异。
Result: 实验表明,仅使用文本信息时现有模型的分数预测性能显著下降;思维链范式对BIQA性能提升有限,而自一致性范式显著缩小了图像与文本条件预测之间的差距,将PLCC/SRCC差异缩小至0.02/0.03;自编码器范式在缩小图像-文本差距方面效果较差,但揭示了进一步优化的方向。
Conclusion: 这些发现为改进盲图像质量评估及高级视觉任务中的文本推理提供了重要见解,特别是自一致性范式在弥合图像与文本表示差距方面的有效性,为未来研究指明了优化方向。
📄 Abstract
Textual reasoning has recently been widely adopted in Blind Image Quality Assessment (BIQA). However, it remains unclear how textual information contributes to quality prediction and to what extent text can represent the score-related image contents. This work addresses these questions from an information-flow perspective by comparing existing BIQA models with three paradigms designed to learn the image-text-score relationship: Chain-of-Thought, Self-Consistency, and Autoencoder. Our experiments show that the score prediction performance of the existing model significantly drops when only textual information is used for prediction. Whereas the Chain-of-Thought paradigm introduces little improvement in BIQA performance, the Self-Consistency paradigm significantly reduces the gap between image- and text-conditioned predictions, narrowing the PLCC/SRCC difference to 0.02/0.03. The Autoencoder-like paradigm is less effective in closing the image-text gap, yet it reveals a direction for further optimization. These findings provide insights into how to improve the textual reasoning for BIQA and high-level vision tasks.
[6] Evaluating the Diagnostic Classification Ability of Multimodal Large Language Models: Insights from the Osteoarthritis Initiative
Li Wang, Xi Chen, XiangWen Deng, HuaHui Yi, ZeKun Jiang, Kang Li, Jian Li
🧩 TL;DR
本研究评估了多模态大语言模型在膝关节骨关节炎X光片分类任务中的表现,发现完整的MLLM架构在特定医学图像分类任务中表现不佳,而优化视觉编码器和精心策划数据集更为关键。
📘 Detailed Summary
Motivation: 多模态大语言模型在医学视觉问答和报告生成方面表现出色,但其生成和解释能力并不能可靠地迁移到疾病特异性分类任务中。本研究旨在评估MLLM在膝关节骨关节炎X光片分类这一代表性不足但影响全球数亿人的医学任务中的表现,探索各组件对诊断准确性的贡献。
Method: 通过系统消融研究,操纵视觉编码器、连接器和大语言模型组件,并采用多样化的训练策略。研究比较了不同训练方法,包括LoRA微调在小型平衡数据集(500张图像)与大型不平衡数据集(5,778张图像)上的表现,评估各组件对分类准确性的影响。
Result: 在分类任务中,单独训练的视觉编码器在分类准确率上能够超越完整的MLLM流程,而微调LLM相比基于提示的指导并未带来有意义的改进。LoRA微调在小型平衡数据集上的表现优于大型不平衡数据集,表明数据平衡性和质量比原始规模更为重要。
Conclusion: 研究结果表明,对于领域特定的医学分类任务,LLM更适合作为解释器和报告生成器而非主要分类器。MLLM架构对于需要高确定性的医学图像诊断分类任务不太适用,建议在开发临床适用系统时优先优化视觉编码器并进行精细的数据集策划。
📄 Abstract
Multimodal large language models (MLLMs) show promising performance on medical visual question answering (VQA) and report generation, but these generation and explanation abilities do not reliably transfer to disease-specific classification. We evaluated MLLM architectures on knee osteoarthritis (OA) radiograph classification, which remains underrepresented in existing medical MLLM benchmarks, even though knee OA affects an estimated 300 to 400 million people worldwide. Through systematic ablation studies manipulating the vision encoder, the connector, and the large language model (LLM) across diverse training strategies, we measured each component's contribution to diagnostic accuracy. In our classification task, a trained vision encoder alone could outperform full MLLM pipelines in classification accuracy and fine-tuning the LLM provided no meaningful improvement over prompt-based guidance. And LoRA fine-tuning on a small, class-balanced dataset (500 images) gave better results than training on a much larger but class-imbalanced set (5,778 images), indicating that data balance and quality can matter more than raw scale for this task. These findings suggest that for domain-specific medical classification, LLMs are more effective as interpreters and report generators rather than as primary classifiers. Therefore, the MLLM architecture appears less suitable for medical image diagnostic classification tasks that demand high certainty. We recommend prioritizing vision encoder optimization and careful dataset curation when developing clinically applicable systems.
[7] PatchAlign3D: Local Feature Alignment for Dense 3D Shape understanding
Souhail Hadgi, Bingchen Gong, Ramana Sundararaman, Emery Pierson, Lei Li, Peter Wonka, Maks Ovsjanikov
🧩 TL;DR
本文提出了一种编码器专用的3D模型,能够直接从点云生成语言对齐的补丁级特征,实现了无需多视角渲染的零样本3D部件分割,显著超越了现有的基于渲染的方法。
📘 Detailed Summary
Motivation: 当前3D形状基础模型在全局任务上表现良好,但在局部部件级推理上迁移能力较差。现有方法依赖多视角渲染和文本查询,需要昂贵的推理成本、复杂的LLM提示工程,且未能充分利用3D形状的固有几何结构。
Method: 方法采用编码器专用3D模型,通过两阶段预训练:首先从DINOv2等视觉编码器蒸馏密集2D特征到3D补丁,然后通过多正对比目标将这些补丁嵌入与部件级文本嵌入对齐。模型使用点云Transformer编码器,利用现有数据引擎生成的部件标注3D形状数据。
Result: 该3D编码器实现了零样本3D部件分割,仅需单次前向推理而无需测试时多视角渲染,在多个3D部件分割基准测试中显著超越了先前基于渲染和前馈方法的表现。
Conclusion: 研究展示了直接从点云学习语言对齐3D特征的有效性,为3D理解任务提供了更高效、几何感知的解决方案,减少了对外部渲染和复杂提示工程的依赖,推动了3D基础模型向局部推理能力的发展。
📄 Abstract
Current foundation models for 3D shapes excel at global tasks (retrieval, classification) but transfer poorly to local part-level reasoning. Recent approaches leverage vision and language foundation models to directly solve dense tasks through multi-view renderings and text queries. While promising, these pipelines require expensive inference over multiple renderings, depend heavily on large language-model (LLM) prompt engineering for captions, and fail to exploit the inherent 3D geometry of shapes. We address this gap by introducing an encoder-only 3D model that produces language-aligned patch-level features directly from point clouds. Our pre-training approach builds on existing data engines that generate part-annotated 3D shapes by pairing multi-view SAM regions with VLM captioning. Using this data, we train a point cloud transformer encoder in two stages: (1) distillation of dense 2D features from visual encoders such as DINOv2 into 3D patches, and (2) alignment of these patch embeddings with part-level text embeddings through a multi-positive contrastive objective. Our 3D encoder achieves zero-shot 3D part segmentation with fast single-pass inference without any test-time multi-view rendering, while significantly outperforming previous rendering-based and feed-forward approaches across several 3D part segmentation benchmarks. Project website: https://souhail-hadgi.github.io/patchalign3dsite/
[8] MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark
Shaden Shaar, Bradon Thymes, Sirawut Chaixanien, Claire Cardie, Bharath Hariharan
🧩 TL;DR
本文提出了MovieRecapsQA,一个基于电影解说视频构建的新型开放式多模态视频问答基准,该基准通过同步的视觉和文本模态提供显式文本上下文,以评估模型在复杂多模态推理任务上的表现。
📘 Detailed Summary
Motivation: 现有视频问答基准难以捕捉真实世界视频(如电影)所需的多模态推理能力,且大多不是开放式的,主要因为自由形式答案的评估困难。研究旨在填补这一空白,创建一个能够评估模型整合视觉和对话线索以回答复杂问题的开放式多模态视频问答基准。
Method: 研究利用电影解说视频这一独特的YouTube内容类型,通过同步的视觉(解说视频)和文本(解说摘要)模态构建基准。使用解说摘要生成约8.2K个与电影字幕对齐的问答对,并提供验证答案所需的"事实"信息,实现无参考评估。基准提供多种长度视频(解说片段、电影片段)和问题分类(按模态和类型),支持细粒度分析。
Result: 评估了七种最先进的多模态大语言模型,发现:1)纯视觉问题最具挑战性;2)模型在有文本输入时倾向于依赖文本;3)从视频内容中提取事实准确信息对所有模型仍很困难;4)专有模型和开源模型在视频依赖问题上表现相当。基准提供了首个提供输入显式文本上下文的开放式视频问答评估框架。
Conclusion: MovieRecapsQA基准揭示了当前多模态模型在视频理解方面的关键局限性,特别是视觉推理能力的不足和过度依赖文本输入的倾向。研究为评估复杂多模态推理提供了新工具,并指出了未来模型需要改进的方向,包括提升纯视觉信息提取能力和平衡多模态融合策略。
📄 Abstract
Understanding real-world videos such as movies requires integrating visual and dialogue cues to answer complex questions. Yet existing VideoQA benchmarks struggle to capture this multimodal reasoning and are largely not open-ended, given the difficulty of evaluating free-form answers. In this paper, we introduce a novel open-ended multi-modal VideoQA benchmark, MovieRecapsQA created using movie recap videos--a distinctive type of YouTube content that summarizes a film by presenting its key events through synchronized visual (recap video) and textual (recap summary) modalities. Using the recap summary, we generate $\approx 8.2$ K question-answer (QA) pairs (aligned with movie-subtitles) and provide the necessary "facts" needed to verify an answer in a reference-free manner. To our knowledge, this is the first open-ended VideoQA benchmark that supplies explicit textual context of the input (video and/or text); which we use for evaluation. Our benchmark provides videos of multiple lengths (i.e., recap-segments, movie-segments) and categorizations of questions (by modality and type) to enable fine-grained analysis. We evaluate the performance of seven state-of-the-art MLLMs using our benchmark and observe that: 1) visual-only questions remain the most challenging; 2) models default to textual inputs whenever available; 3) extracting factually accurate information from video content is still difficult for all models; and 4) proprietary and open-source models perform comparably on video-dependent questions.
[9] Unveiling and Bridging the Functional Perception Gap in MLLMs: Atomic Visual Alignment and Hierarchical Evaluation via PET-Bench
Zanting Ye, Xiaolong Niu, Xuanbin Wu, Xu Han, Shengyuan Liu, Jing Hao, Zhihao Peng, Hao Sun, Jieqin Lv, Fanghu Wang, Yanchao Huang, Hubing Wu, Yixuan Yuan, Habib Zaidi, Arman Rahmim, Yefeng Zheng, Lijun Lu
🧩 TL;DR
本研究揭示了多模态大语言模型在功能成像领域存在的功能性感知鸿沟,并提出了PET-Bench基准和原子视觉对齐方法,成功将思维链从幻觉来源转化为可靠的推理工具,将诊断准确率提升高达14.83%。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在解剖模态异常检测和报告生成方面表现出色,但在功能成像领域的能力尚未得到充分探索。本研究旨在识别和量化一个根本性的功能性感知鸿沟:现有视觉编码器无法独立于形态学先验解码功能性示踪剂生物分布,特别以正电子发射断层扫描作为研究这一脱节的典型模态。
Method: 研究引入了PET-Bench,这是首个大规模功能成像基准,包含来自9,732项多中心、多示踪剂PET研究的52,308个分层问答对。针对19个最先进的多模态大语言模型进行广泛评估后,提出了原子视觉对齐方法,这是一种简单的微调策略,强制模型在高级诊断推理之前掌握低层次功能性感知能力。
Result: 评估揭示了思维链幻觉陷阱这一关键安全隐患:标准思维链提示在PET中会矛盾地解耦语言生成与视觉证据,产生临床流畅但事实无根据的诊断。原子视觉对齐方法有效弥合了感知鸿沟,将思维链从幻觉来源转化为稳健的推理工具,将诊断准确率提升高达14.83%。
Conclusion: 该研究强调了多模态大语言模型在功能成像领域存在的独特挑战,特别是思维链提示可能导致的幻觉风险。原子视觉对齐方法通过强制低层次视觉理解优先于高级推理,为解决这一安全问题提供了有效途径,为医学影像分析中的可靠多模态系统设计提供了重要见解。
📄 Abstract
While Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in tasks such as abnormality detection and report generation for anatomical modalities, their capability in functional imaging remains largely unexplored. In this work, we identify and quantify a fundamental functional perception gap: the inability of current vision encoders to decode functional tracer biodistribution independent of morphological priors. Identifying Positron Emission Tomography (PET) as the quintessential modality to investigate this disconnect, we introduce PET-Bench, the first large-scale functional imaging benchmark comprising 52,308 hierarchical QA pairs from 9,732 multi-site, multi-tracer PET studies. Extensive evaluation of 19 state-of-the-art MLLMs reveals a critical safety hazard termed the Chain-of-Thought (CoT) hallucination trap. We observe that standard CoT prompting, widely considered to enhance reasoning, paradoxically decouples linguistic generation from visual evidence in PET, producing clinically fluent but factually ungrounded diagnoses. To resolve this, we propose Atomic Visual Alignment (AVA), a simple fine-tuning strategy that enforces the mastery of low-level functional perception prior to high-level diagnostic reasoning. Our results demonstrate that AVA effectively bridges the perception gap, transforming CoT from a source of hallucination into a robust inference tool and improving diagnostic accuracy by up to 14.83%. Code and data are available at https://github.com/yezanting/PET-Bench.
[10] Towards Zero-Shot Point Cloud Registration Across Diverse Scales, Scenes, and Sensor Setups
Hyungtae Lim, Minkyun Seo, Luca Carlone, Jaesik Park
🧩 TL;DR
本文提出BUFFER-X,一种无需训练的零样本泛化点云配准框架,通过几何引导的超参数估计、分布感知采样和补丁级坐标归一化解决现有方法跨域泛化不足的问题,并在12个数据集上验证了其有效性。
📘 Detailed Summary
Motivation: 当前基于深度学习的点云配准方法在零样本泛化方面存在局限,通常需要针对新环境进行数据集特定的超参数调整或重新训练。研究识别了三个关键问题:固定用户定义参数无法适应不同尺度变化,学习型关键点检测器跨域迁移能力差,以及绝对坐标会放大数据集间的尺度不匹配。
Method: BUFFER-X框架包含三个核心组件:几何引导的超参数自动估计,用于替代学习型检测器的分布感知最远点采样,以及确保尺度一致性的补丁级坐标归一化。该方法采用分层多尺度匹配策略,在局部、中间和全局感受野中提取对应关系,同时提出了计算效率更高的BUFFER-X-Lite版本,通过早期退出策略和快速位姿求解器减少43%的计算时间。
Result: 在包含12个数据集的综合基准测试中,涵盖物体尺度、室内和室外场景,包括异构LiDAR配置间的跨传感器配准,结果表明该方法无需手动调参或测试域先验知识即可有效泛化。BUFFER-X-Lite在保持精度的同时将总计算时间相对BUFFER-X减少了43%。
Conclusion: 该研究证明了无需训练的零样本泛化点云配准的可行性,通过几何引导方法替代学习型组件和手动参数调整。BUFFER-X框架为跨域点云配准提供了通用解决方案,其轻量级版本BUFFER-X-Lite进一步提升了实际应用的效率,为自动驾驶和机器人等领域的环境感知系统提供了重要技术支撑。
📄 Abstract
Some deep learning-based point cloud registration methods struggle with zero-shot generalization, often requiring dataset-specific hyperparameter tuning or retraining for new environments. We identify three critical limitations: (a) fixed user-defined parameters (e.g., voxel size, search radius) that fail to generalize across varying scales, (b) learned keypoint detectors exhibit poor cross-domain transferability, and (c) absolute coordinates amplify scale mismatches between datasets. To address these three issues, we present BUFFER-X, a training-free registration framework that achieves zero-shot generalization through: (a) geometric bootstrapping for automatic hyperparameter estimation, (b) distribution-aware farthest point sampling to replace learned detectors, and (c) patch-level coordinate normalization to ensure scale consistency. Our approach employs hierarchical multi-scale matching to extract correspondences across local, middle, and global receptive fields, enabling robust registration in diverse environments. For efficiency-critical applications, we introduce BUFFER-X-Lite, which reduces total computation time by 43% (relative to BUFFER-X) through early exit strategies and fast pose solvers while preserving accuracy. We evaluate on a comprehensive benchmark comprising 12 datasets spanning object-scale, indoor, and outdoor scenes, including cross-sensor registration between heterogeneous LiDAR configurations. Results demonstrate that our approach generalizes effectively without manual tuning or prior knowledge of test domains. Code: https://github.com/MIT-SPARK/BUFFER-X.
[11] AnyDepth: Depth Estimation Made Easy
Zeyu Ren, Zeyu Zhang, Wukai Li, Qingxiang Liu, Hao Tang
🧩 TL;DR
本文提出了一种轻量级、以数据为中心的零样本单目深度估计框架,采用DINOv3作为视觉编码器,并设计了参数减少85%-89%的Simple Depth Transformer解码器,通过质量过滤策略提升训练质量,在五个基准测试中超越了DPT的精度。
📘 Detailed Summary
Motivation: 当前单目深度估计方法依赖大规模数据集和复杂解码器,限制了其效率和泛化能力,本文旨在解决这一问题,提出一个轻量级且以数据为中心的零样本深度估计框架,以平衡模型设计和数据质量。
Method: 该方法采用DINOv3作为视觉编码器获取高质量密集特征,并设计了Simple Depth Transformer(SDT)作为紧凑的基于Transformer的解码器,采用单路径特征融合和上采样过程减少跨尺度特征融合的计算开销,同时提出基于质量的过滤策略来筛选有害样本,减少数据集规模并提升训练质量。
Result: 在五个基准测试上的广泛实验表明,该框架在精度上超越了DPT,同时SDT解码器相比DPT减少了约85%-89%的参数数量,在降低计算开销的同时实现了更高的准确性。
Conclusion: 本研究强调了平衡模型设计和数据质量对于实现高效且可泛化的零样本深度估计的重要性,提出的轻量级框架展示了在减少参数和计算开销的同时保持甚至提升性能的可行性,为实际应用中的资源受限场景提供了有效解决方案。
📄 Abstract
Monocular depth estimation aims to recover the depth information of 3D scenes from 2D images. Recent work has made significant progress, but its reliance on large-scale datasets and complex decoders has limited its efficiency and generalization ability. In this paper, we propose a lightweight and data-centric framework for zero-shot monocular depth estimation. We first adopt DINOv3 as the visual encoder to obtain high-quality dense features. Secondly, to address the inherent drawbacks of the complex structure of the DPT, we design the Simple Depth Transformer (SDT), a compact transformer-based decoder. Compared to the DPT, it uses a single-path feature fusion and upsampling process to reduce the computational overhead of cross-scale feature fusion, achieving higher accuracy while reducing the number of parameters by approximately 85%-89%. Furthermore, we propose a quality-based filtering strategy to filter out harmful samples, thereby reducing dataset size while improving overall training quality. Extensive experiments on five benchmarks demonstrate that our framework surpasses the DPT in accuracy. This work highlights the importance of balancing model design and data quality for achieving efficient and generalizable zero-shot depth estimation. Code: https://github.com/AIGeeksGroup/AnyDepth. Website: https://aigeeksgroup.github.io/AnyDepth.
[12] ClearAIR: A Human-Visual-Perception-Inspired All-in-One Image Restoration
Xu Zhang, Huan Zhang, Guoli Wang, Qian Zhang, Lefei Zhang
🧩 TL;DR
本文提出ClearAIR,一种受人类视觉感知启发的全合一图像修复框架,通过分层粗到细的修复策略解决现有方法过度平滑和伪影问题,在合成和真实数据集上均取得优越性能。
📘 Detailed Summary
Motivation: 现有全合一图像修复方法严重依赖退化特定表示,常导致过度平滑和伪影问题,难以准确处理复杂的复合退化场景,需要更有效的框架来提升修复质量。
Method: ClearAIR采用分层粗到细修复策略:首先基于多模态大语言模型的图像质量评估进行全局评估;其次通过区域感知和任务识别流程,结合语义交叉注意力和退化感知模块进行局部修复;最后引入自监督的内部线索重用机制挖掘图像内在信息以恢复细节。
Result: 实验结果表明,ClearAIR在多样化的合成和真实世界数据集上均取得优越性能,相比现有方法能更有效地处理复杂复合退化,显著提升图像修复质量。
Conclusion: 该研究证明了人类视觉感知启发的分层修复策略在全合一图像修复中的有效性,跨模态理解和内部线索重用机制为处理复杂退化提供了新思路,为实际应用中的图像修复任务提供了更鲁棒的解决方案。
📄 Abstract
All-in-One Image Restoration (AiOIR) has advanced significantly, offering promising solutions for complex real-world degradations. However, most existing approaches rely heavily on degradation-specific representations, often resulting in oversmoothing and artifacts. To address this, we propose ClearAIR, a novel AiOIR framework inspired by Human Visual Perception (HVP) and designed with a hierarchical, coarse-to-fine restoration strategy. First, leveraging the global priority of early HVP, we employ a Multimodal Large Language Model (MLLM)-based Image Quality Assessment (IQA) model for overall evaluation. Unlike conventional IQA, our method integrates cross-modal understanding to more accurately characterize complex, composite degradations. Building upon this overall assessment, we then introduce a region awareness and task recognition pipeline. A semantic cross-attention, leveraging semantic guidance unit, first produces coarse semantic prompts. Guided by this regional context, a degradation-aware module implicitly captures region-specific degradation characteristics, enabling more precise local restoration. Finally, to recover fine details, we propose an internal clue reuse mechanism. It operates in a self-supervised manner to mine and leverage the intrinsic information of the image itself, substantially enhancing detail restoration. Experimental results show that ClearAIR achieves superior performance across diverse synthetic and real-world datasets.
[13] AbductiveMLLM: Boosting Visual Abductive Reasoning Within MLLMs
Boyu Chang, Qi Wang, Xi Guo, Zhixiong Nan, Yazhou Yao, Tianfei Zhou
🧩 TL;DR
本文提出AbductiveMLLM,一种受人类认知启发的视觉溯因推理框架,通过结合语言和图像双模态推理机制,显著提升了多模态大语言模型在视觉溯因任务上的性能,在标准基准测试中达到最先进水平。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在通用多模态推理方面表现出色,但在视觉溯因推理任务上仍远不及人类水平,存在明显的推理能力差距。视觉溯因推理要求从部分视觉观察中推断最可能的解释,这对AI系统提出了更高层次的认知挑战。
Method: AbductiveMLLM包含两个协同组件:REASONER在语言域工作,先使用盲LLM探索广泛的可能解释空间,然后基于跨模态因果对齐修剪视觉不一致假设,将剩余假设作为目标先验引入MLLM;IMAGINER模拟人类图像思维,基于输入视频和REASONER输出嵌入条件化文本到图像扩散模型,生成与语言解释对应的视觉场景。两个组件以端到端方式联合训练。
Result: 在标准视觉溯因推理基准测试中,AbductiveMLLM实现了最先进的性能,一致优于传统解决方案和先进的多模态大语言模型。实验结果表明该方法能有效提升模型在视觉溯因任务上的推理能力。
Conclusion: 该研究通过模仿人类语言与图像溯因的交互认知机制,成功提升了多模态大语言模型的溯因推理能力。这表明结合双模态推理策略是增强AI系统高级认知功能的有效途径,为未来多模态推理研究提供了新方向。
📄 Abstract
Visual abductive reasoning (VAR) is a challenging task that requires AI systems to infer the most likely explanation for incomplete visual observations. While recent MLLMs develop strong general-purpose multimodal reasoning capabilities, they fall short in abductive inference, as compared to human beings. To bridge this gap, we draw inspiration from the interplay between verbal and pictorial abduction in human cognition, and propose to strengthen abduction of MLLMs by mimicking such dual-mode behavior. Concretely, we introduce AbductiveMLLM comprising of two synergistic components: REASONER and IMAGINER. The REASONER operates in the verbal domain. It first explores a broad space of possible explanations using a blind LLM and then prunes visually incongruent hypotheses based on cross-modal causal alignment. The remaining hypotheses are introduced into the MLLM as targeted priors, steering its reasoning toward causally coherent explanations. The IMAGINER, on the other hand, further guides MLLMs by emulating human-like pictorial thinking. It conditions a text-to-image diffusion model on both the input video and the REASONER's output embeddings to "imagine" plausible visual scenes that correspond to verbal explanation, thereby enriching MLLMs' contextual grounding. The two components are trained jointly in an end-to-end manner. Experiments on standard VAR benchmarks show that AbductiveMLLM achieves state-of-the-art performance, consistently outperforming traditional solutions and advanced MLLMs.
[14] EarthVL: A Progressive Earth Vision-Language Understanding and Generation Framework
Junjue Wang, Yanfei Zhong, Zihang Chen, Zhuo Zheng, Ailong Ma, Liangpei Zhang
🧩 TL;DR
本文提出了一个渐进式地球视觉-语言理解与生成框架,包括多任务数据集EarthVLSet和语义引导网络EarthVLNet,旨在解决地球视觉中对象关系推理的不足,实现从语义分割到关系推理再到全面理解的渐进式场景理解。
📘 Detailed Summary
Motivation: 地球视觉在遥感对象识别方面取得了里程碑进展,但缺乏对对象关系推理的探索,这限制了全面的场景理解能力。现有方法未能充分整合图像、掩码和文本信息,特别是在城市规划应用中,需要更全面的地理对象关系理解框架。
Method: 研究提出了渐进式地球视觉-语言理解与生成框架,包括多任务数据集EarthVLSet(包含10.9k亚米级分辨率遥感图像、土地覆盖掩码和761.5k文本对)和语义引导网络EarthVLNet。EarthVLNet采用对象中心方法,通过三个阶段实现渐进理解:首先进行土地覆盖语义分割生成对象语义,然后基于像素级语义引导的对象感知大型语言模型进行关系推理和知识总结,最后通过数值差异损失函数动态添加差异惩罚以处理不同对象的统计特性。
Result: 在三个基准测试(语义分割、多项选择VQA和开放式VQA)中,EarthVLNet表现出优越性能。实验发现三个关键方向:1)分割特征即使在跨数据集场景中也能持续增强VQA性能;2)多项选择任务对视觉编码器的敏感性大于语言解码器;3)开放式任务需要先进的视觉编码器和语言解码器才能获得最佳性能。
Conclusion: 该研究为连接"图像-掩码-文本"提供了有益基准,推动了地球视觉的地理应用发展。提出的数据集和方法框架填补了地球视觉中对象关系推理的空白,为城市规划等应用提供了更全面的场景理解能力。研究揭示了分割特征对VQA任务的重要性以及不同任务类型对模型组件的敏感性差异。
📄 Abstract
Earth vision has achieved milestones in geospatial object recognition but lacks exploration in object-relational reasoning, limiting comprehensive scene understanding. To address this, a progressive Earth vision-language understanding and generation framework is proposed, including a multi-task dataset (EarthVLSet) and a semantic-guided network (EarthVLNet). Focusing on city planning applications, EarthVLSet includes 10.9k sub-meter resolution remote sensing images, land-cover masks, and 761.5k textual pairs involving both multiple-choice and open-ended visual question answering (VQA) tasks. In an object-centric way, EarthVLNet is proposed to progressively achieve semantic segmentation, relational reasoning, and comprehensive understanding. The first stage involves land-cover segmentation to generate object semantics for VQA guidance. Guided by pixel-wise semantics, the object awareness based large language model (LLM) performs relational reasoning and knowledge summarization to generate the required answers. As for optimization, the numerical difference loss is proposed to dynamically add difference penalties, addressing the various objects' statistics. Three benchmarks, including semantic segmentation, multiple-choice, and open-ended VQA demonstrated the superiorities of EarthVLNet, yielding three future directions: 1) segmentation features consistently enhance VQA performance even in cross-dataset scenarios; 2) multiple-choice tasks show greater sensitivity to the vision encoder than to the language decoder; and 3) open-ended tasks necessitate advanced vision encoders and language decoders for an optimal performance. We believe this dataset and method will provide a beneficial benchmark that connects ''image-mask-text'', advancing geographical applications for Earth vision.
[15] Topology-aware Pathological Consistency Matching for Weakly-Paired IHC Virtual Staining
Mingzhou Jiang, Jiaying Zhou, Nan Zeng, Mickael Li, Qijie Tang, Chao He, Huazhu Fu, Honghui He
🧩 TL;DR
本文提出了一种新颖的拓扑感知框架用于H&E到IHC的虚拟染色,通过拓扑感知一致性匹配和拓扑约束病理匹配机制,有效解决了相邻切片数据空间错位和局部变形带来的弱配对问题,显著提升了虚拟染色的生成质量和临床相关性。
📘 Detailed Summary
Motivation: 免疫组化染色在癌症临床检查中至关重要,但相比常用的H&E染色,其流程复杂、耗时且昂贵,限制了临床应用。虚拟染色技术可将H&E图像转换为IHC图像,然而使用相邻切片作为真实标签会导致弱配对数据,存在空间错位和局部变形问题,阻碍了有效的监督学习。
Method: 本文提出了一种新颖的拓扑感知框架,包含两个核心机制:拓扑感知一致性匹配机制采用图对比学习和拓扑扰动来学习鲁棒的匹配模式,确保结构一致性;拓扑约束病理匹配机制基于节点重要性对齐病理阳性区域,增强病理一致性。该框架专门设计用于处理空间错位和局部变形问题。
Result: 在两个基准数据集上的四个染色任务中进行了广泛实验,结果表明该方法优于现有最先进方法,实现了更高的生成质量和临床相关性。实验验证了拓扑感知机制在解决弱配对数据问题上的有效性,并展示了在多个染色任务中的一致优越性能。
Conclusion: 该研究证明了拓扑感知框架在H&E到IHC虚拟染色任务中的有效性,通过处理空间错位和局部变形问题,显著提升了生成质量。该方法为临床提供了一种成本效益高的替代方案,具有重要的临床应用价值,并为处理弱配对医学图像数据提供了新的技术思路。
📄 Abstract
Immunohistochemical (IHC) staining provides crucial molecular characterization of tissue samples and plays an indispensable role in the clinical examination and diagnosis of cancers. However, compared with the commonly used Hematoxylin and Eosin (H&E) staining, IHC staining involves complex procedures and is both time-consuming and expensive, which limits its widespread clinical use. Virtual staining converts H&E images to IHC images, offering a cost-effective alternative to clinical IHC staining. Nevertheless, using adjacent slides as ground truth often results in weakly-paired data with spatial misalignment and local deformations, hindering effective supervised learning. To address these challenges, we propose a novel topology-aware framework for H&E-to-IHC virtual staining. Specifically, we introduce a Topology-aware Consistency Matching (TACM) mechanism that employs graph contrastive learning and topological perturbations to learn robust matching patterns despite spatial misalignments, ensuring structural consistency. Furthermore, we propose a Topology-constrained Pathological Matching (TCPM) mechanism that aligns pathological positive regions based on node importance to enhance pathological consistency. Extensive experiments on two benchmarks across four staining tasks demonstrate that our method outperforms state-of-the-art approaches, achieving superior generation quality with higher clinical relevance.
[16] TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors
Wei-Yuan Cheng, Kai-Po Chang, Chi-Pin Huang, Fu-En Yang, Yu-Chiang Frank Wang
🧩 TL;DR
本文提出TA-Prompting方法,通过引入时间锚点增强视频大语言模型的事件定位能力,并结合事件一致性采样策略,显著提升了密集视频描述和时间理解任务的性能。
📘 Detailed Summary
Motivation: 现有基于大语言模型的视频理解方法在未修剪视频中难以精确识别事件边界,导致生成的描述缺乏准确的时间定位基础,这限制了密集视频描述任务的实际应用效果。
Method: 该方法提出TA-Prompting框架,通过时间锚点学习精确的事件定位,并提示视频大语言模型进行时间感知的视频事件理解;在推理阶段引入事件一致性采样策略,基于时间事件间的连贯性和跨模态相似性选择事件描述。
Result: 在基准数据集上的广泛实验表明,TA-Prompting方法在密集视频描述和时间理解任务(包括时刻检索和TemporalQA)上优于现有最先进的视频大语言模型,取得了卓越的性能表现。
Conclusion: 该研究证明了时间锚点机制对于提升视频大语言模型事件定位能力的重要性,事件一致性采样策略有效解决了视频中任意数量事件的描述生成问题,为密集视频理解任务提供了新的技术路径。
📄 Abstract
Dense video captioning aims to interpret and describe all temporally localized events throughout an input video. Recent state-of-the-art methods leverage large language models (LLMs) to provide detailed moment descriptions for video data. However, existing VideoLLMs remain challenging in identifying precise event boundaries in untrimmed videos, causing the generated captions to be not properly grounded. In this paper, we propose TA-Prompting, which enhances VideoLLMs via Temporal Anchors that learn to precisely localize events and prompt the VideoLLMs to perform temporal-aware video event understanding. During inference, in order to properly determine the output caption sequence from an arbitrary number of events presented within a video, we introduce an event coherent sampling strategy to select event captions with sufficient coherence across temporal events and cross-modal similarity with the given video. Through extensive experiments on benchmark datasets, we show that our TA-Prompting is favorable against state-of-the-art VideoLLMs, yielding superior performance on dense video captioning and temporal understanding tasks including moment retrieval and temporalQA.
[17] SketchThinker-R1: Towards Efficient Sketch-Style Reasoning in Large Multimodal Models
Ruiyang Zhang, Dongzhan Zhou, Zhedong Zheng
🧩 TL;DR
本文提出SketchThinker-R1方法,通过激励大语言模型采用草图式推理来显著降低计算开销,在保持答案准确性的同时实现了超过64%的推理令牌成本减少。
📘 Detailed Summary
Motivation: 当前大语言模型广泛采用逐步推理方法,虽然经验上有效但导致计算开销显著增加,包括更高的令牌成本和响应时间,从而损害推理效率。相比之下,人类常采用草图式推理——一种简洁、目标导向的认知过程,能够优先处理关键信息并实现高效问题解决。
Method: 该方法包含三个主要阶段:草图模式冷启动阶段将标准长推理过程转换为草图式推理并微调基础多模态模型;训练SketchJudge奖励模型,显式评估模型思维过程并为草图式推理分配更高分数;在SketchJudge监督下进行草图思维强化学习,进一步泛化草图式推理能力。
Result: 在四个基准测试上的实验评估显示,SketchThinker-R1实现了超过64%的推理令牌成本减少,同时不损害最终答案准确性。定性分析进一步表明草图式推理在问题解决过程中更专注于关键线索。
Conclusion: 该研究展示了通过模仿人类认知效率来优化大语言模型推理过程的可行性,草图式推理方法在保持性能的同时显著提升了计算效率,为高效多模态推理系统设计提供了新方向。
📄 Abstract
Despite the empirical success of extensive, step-by-step reasoning in large multimodal models, long reasoning processes inevitably incur substantial computational overhead, i.e., in terms of higher token costs and increased response time, which undermines inference efficiency. In contrast, humans often employ sketch-style reasoning: a concise, goal-directed cognitive process that prioritizes salient information and enables efficient problem-solving. Inspired by this cognitive efficiency, we propose SketchThinker-R1, which incentivizes sketch-style reasoning ability in large multimodal models. Our method consists of three primary stages. In the Sketch-Mode Cold Start stage, we convert standard long reasoning process into sketch-style reasoning and finetune base multimodal model, instilling initial sketch-style reasoning capability. Next, we train SketchJudge Reward Model, which explicitly evaluates thinking process of model and assigns higher scores to sketch-style reasoning. Finally, we conduct Sketch-Thinking Reinforcement Learning under supervision of SketchJudge to further generalize sketch-style reasoning ability. Experimental evaluation on four benchmarks reveals that our SketchThinker-R1 achieves over 64% reduction in reasoning token cost without compromising final answer accuracy. Qualitative analysis further shows that sketch-style reasoning focuses more on key cues during problem solving.
[18] DCG ReID: Disentangling Collaboration and Guidance Fusion Representations for Multi-modal Vehicle Re-Identification
Aihua Zheng, Ya Gao, Shihao Li, Chenglong Li, Jin Tang
🧩 TL;DR
本文提出DCG-ReID方法,通过解耦异构质量分布模态数据的融合需求,设计了动态置信度解耦加权机制和两种场景特异性融合策略,有效解决了多模态车辆重识别中平衡与不平衡质量分布数据的冲突融合问题。
📘 Detailed Summary
Motivation: 多模态车辆重识别面临模态质量分布不确定性的挑战,由于RGB、近红外和热红外模态间的固有差异,导致平衡与不平衡质量分布数据具有不同的冲突融合需求。现有方法将所有多模态数据置于单一融合模型中处理,忽视了两种数据类型的不同需求,难以解耦类内一致性与模态间异质性之间的冲突。
Method: 提出DCG-ReID方法,首先设计动态置信度解耦加权机制,通过交互导出的模态置信度动态重加权三模态贡献,构建解耦融合框架。在此基础上开发两种场景特异性融合策略:针对平衡质量分布,协作融合模块挖掘成对共识特征以捕获共享判别信息;针对不平衡分布,引导融合模块实施模态判别差异的差异化放大,强化主导模态优势并引导辅助模态挖掘互补判别信息。
Result: 在三个多模态重识别基准数据集(WMVeID863、MSVR310、RGBNT100)上进行了广泛实验,验证了所提方法的有效性。实验结果表明该方法能够有效处理多模态车辆重识别中的模态质量分布不确定性问题,显著提升了识别性能。
Conclusion: 该研究揭示了多模态车辆重识别中模态质量分布差异对融合策略的重要影响,提出的解耦协作与引导融合框架为解决平衡与不平衡质量分布数据的冲突融合需求提供了有效方案。该方法为多模态视觉任务中的异构数据融合提供了新的思路,具有扩展到其他多模态识别任务的潜力。
📄 Abstract
Multi-modal vehicle Re-Identification (ReID) aims to leverage complementary information from RGB, Near Infrared (NIR), and Thermal Infrared (TIR) modalities to retrieve the same vehicle. The challenges of multi-modal vehicle ReID arise from the uncertainty of modality quality distribution induced by inherent discrepancies across modalities, resulting in distinct conflicting fusion requirements for data with balanced and unbalanced quality distributions. Existing methods handle all multi-modal data within a single fusion model, overlooking the different needs of the two data types and making it difficult to decouple the conflict between intra-class consistency and inter-modal heterogeneity. To this end, we propose Disentangle Collaboration and Guidance Fusion Representations for Multi-modal Vehicle ReID (DCG-ReID). Specifically, to disentangle heterogeneous quality-distributed modal data without mutual interference, we first design the Dynamic Confidence-based Disentangling Weighting (DCDW) mechanism: dynamically reweighting three-modal contributions via interaction-derived modal confidence to build a disentangled fusion framework. Building on DCDW, we develop two scenario-specific fusion strategies: (1) for balanced quality distributions, Collaboration Fusion Module (CFM) mines pairwise consensus features to capture shared discriminative information and boost intra-class consistency; (2) for unbalanced distributions, Guidance Fusion Module (GFM) implements differential amplification of modal discriminative disparities to reinforce dominant modality advantages, guide auxiliary modalities to mine complementary discriminative info, and mitigate inter-modal divergence to boost multi-modal joint decision performance. Extensive experiments on three multi-modal ReID benchmarks (WMVeID863, MSVR310, RGBNT100) validate the effectiveness of our method. Code will be released upon acceptance.
[19] DGA-Net: Enhancing SAM with Depth Prompting and Graph-Anchor Guidance for Camouflaged Object Detection
Yuetong Li, Qing Zhang, Yilin Zhao, Gongyang Li, Zeming Liu
🧩 TL;DR
本文提出DGA-Net,一种通过新颖的"深度提示"范式适配Segment Anything Model (SAM)的伪装目标检测框架,通过跨模态图增强和锚点引导细化模块,有效利用深度线索提升检测性能。
📘 Detailed Summary
Motivation: 现有伪装目标检测方法主要依赖稀疏提示(如点或框),未能充分利用深度线索。本文旨在解决如何有效整合RGB语义信息和深度几何信息,以构建密集深度提示并传播到整个网络,从而提升伪装目标检测的精度和一致性。
Method: 本文提出DGA-Net框架,包含两个核心模块:跨模态图增强模块通过异构图融合RGB语义和深度几何信息,生成统一引导信号;锚点引导细化模块创建全局锚点并建立非局部路径,将引导信号从深层广播到浅层,以缓解特征层次中的信息衰减问题。
Result: 定量和定性实验结果表明,DGA-Net在伪装目标检测任务上超越了现有最先进方法,证明了深度提示范式的有效性以及所提模块在提升分割精度和一致性方面的优势。
Conclusion: 该研究展示了通过密集深度提示机制有效整合多模态信息的潜力,为SAM在特定视觉任务上的适配提供了新思路。所提出的图增强和锚点引导方法为解决特征层次信息衰减问题提供了有效解决方案,对复杂场景下的目标检测具有重要参考价值。
📄 Abstract
To fully exploit depth cues in Camouflaged Object Detection (COD), we present DGA-Net, a specialized framework that adapts the Segment Anything Model (SAM) via a novel ``depth prompting" paradigm. Distinguished from existing approaches that primarily rely on sparse prompts (e.g., points or boxes), our method introduces a holistic mechanism for constructing and propagating dense depth prompts. Specifically, we propose a Cross-modal Graph Enhancement (CGE) module that synthesizes RGB semantics and depth geometric within a heterogeneous graph to form a unified guidance signal. Furthermore, we design an Anchor-Guided Refinement (AGR) module. To counteract the inherent information decay in feature hierarchies, AGR forges a global anchor and establishes direct non-local pathways to broadcast this guidance from deep to shallow layers, ensuring precise and consistent segmentation. Quantitative and qualitative experimental results demonstrate that our proposed DGA-Net outperforms the state-of-the-art COD methods.
[20] PrismVAU: Prompt-Refined Inference System for Multimodal Video Anomaly Understanding
Iñaki Erregue, Kamal Nasrollahi, Sergio Escalera
🧩 TL;DR
本文提出PrismVAU,一种轻量级实时视频异常理解系统,它利用单个现成多模态大语言模型进行异常评分、解释和提示优化,无需指令微调或外部模块即可实现竞争性检测性能和可解释的异常解释。
📘 Detailed Summary
Motivation: 现有视频异常理解方法通常依赖微调的多模态大语言模型或外部模块如视频描述器,这些方法引入了昂贵的标注成本、复杂的训练流程和高推理开销,因此需要一种轻量级且高效的实时解决方案。
Method: PrismVAU采用两阶段互补架构:粗粒度异常评分模块通过文本锚点的相似性计算帧级异常分数,以及基于MLLM的细化模块通过系统和用户提示对异常进行上下文理解;文本锚点和提示通过弱监督自动提示工程框架进行优化。
Result: 在标准视频异常检测基准上的广泛实验表明,PrismVAU在无需指令微调、帧级标注、外部模块或密集处理的情况下,实现了竞争性的检测性能和可解释的异常解释。
Conclusion: 该研究提供了一种高效实用的视频异常理解解决方案,通过单个现成MLLM实现了实时性能,降低了标注和计算成本,为实际应用中的轻量级异常理解系统设计提供了新思路。
📄 Abstract
Video Anomaly Understanding (VAU) extends traditional Video Anomaly Detection (VAD) by not only localizing anomalies but also describing and reasoning about their context. Existing VAU approaches often rely on fine-tuned multimodal large language models (MLLMs) or external modules such as video captioners, which introduce costly annotations, complex training pipelines, and high inference overhead. In this work, we introduce PrismVAU, a lightweight yet effective system for real-time VAU that leverages a single off-the-shelf MLLM for anomaly scoring, explanation, and prompt optimization. PrismVAU operates in two complementary stages: (1) a coarse anomaly scoring module that computes frame-level anomaly scores via similarity to textual anchors, and (2) an MLLM-based refinement module that contextualizes anomalies through system and user prompts. Both textual anchors and prompts are optimized with a weakly supervised Automatic Prompt Engineering (APE) framework. Extensive experiments on standard VAD benchmarks demonstrate that PrismVAU delivers competitive detection performance and interpretable anomaly explanations -- without relying on instruction tuning, frame-level annotations, and external modules or dense processing -- making it an efficient and practical solution for real-world applications.
[21] LAMS-Edit: Latent and Attention Mixing with Schedulers for Improved Content Preservation in Diffusion-Based Image and Style Editing
Wingwa Fu, Takayuki Okatani
🧩 TL;DR
本文提出LAMS-Edit框架,通过融合去噪过程中的潜在表示和注意力图,结合调度器控制插值权重,有效解决了文本到图像编辑中内容保持与编辑应用之间的平衡问题。
📘 Detailed Summary
Motivation: 基于扩散模型的文本到图像编辑面临两大挑战:一是难以在内容保持与编辑应用之间取得平衡,二是处理真实图像编辑时存在困难。现有方法在保持原始图像内容的同时有效应用编辑指令方面存在局限,特别是在真实图像编辑场景中。
Method: 提出LAMS-Edit框架,核心是Latent and Attention Mixing with Schedulers技术,利用反转过程中的中间状态进行编辑图像生成。具体方法是在每个生成步骤中,通过加权插值融合反转过程和编辑生成过程的潜在表示与注意力图,权重由调度器控制。该框架与Prompt-to-Prompt集成,支持区域掩码的精确编辑,并可通过LoRA实现风格迁移。
Result: 大量实验表明,LAMS-Edit在内容保持与编辑应用之间取得了有效平衡。该方法在真实图像编辑任务中表现优异,能够精确应用编辑指令同时保持原始图像的关键内容特征,支持区域掩码编辑和风格迁移功能。
Conclusion: LAMS-Edit提供了一个可扩展的编辑框架,通过利用反转过程的中间状态信息,显著提升了文本到图像编辑的质量和可控性。该方法为平衡内容保持与编辑应用提供了有效解决方案,并为未来基于扩散模型的图像编辑研究提供了新的技术方向。
📄 Abstract
Text-to-Image editing using diffusion models faces challenges in balancing content preservation with edit application and handling real-image editing. To address these, we propose LAMS-Edit, leveraging intermediate states from the inversion process--an essential step in real-image editing--during edited image generation. Specifically, latent representations and attention maps from both processes are combined at each step using weighted interpolation, controlled by a scheduler. This technique, Latent and Attention Mixing with Schedulers (LAMS), integrates with Prompt-to-Prompt (P2P) to form LAMS-Edit--an extensible framework that supports precise editing with region masks and enables style transfer via LoRA. Extensive experiments demonstrate that LAMS-Edit effectively balances content preservation and edit application.
[22] Towards Faithful Reasoning in Comics for Small MLLMs
Chengcheng Feng, Haojie Yin, Yucheng Jin, Kaizhu Huang
🧩 TL;DR
本研究提出了一种新颖的漫画推理框架,旨在解决传统思维链提示在漫画视觉问答任务中的性能下降问题,通过模块化推理生成和强化微调,使小型多模态大语言模型在抽象视觉推理任务上实现显著性能提升。
📘 Detailed Summary
Motivation: 漫画视觉问答任务对多模态大语言模型提出了独特挑战,涉及符号抽象、叙事逻辑和幽默理解等复杂推理。研究发现传统思维链提示在CVQA任务中反而会降低性能,特别是在小型模型中,主要问题包括状态纠缠、虚假转移和探索效率低下,这些缺陷在资源受限的小型模型中尤为突出。
Method: 本研究提出了一种新颖的漫画推理框架,结合了模块化思维链生成与基于GRPO的强化微调方法,并设计了新型结构化奖励机制。该框架专门针对小型多模态大语言模型优化,旨在生成更忠实且可迁移的推理链,同时将方法扩展到更广泛的幽默中心和抽象视觉推理任务,包括表情包理解和社论漫画解读。
Result: 在五个具有挑战性的基准测试中,提出的3B参数模型超越了现有最先进方法,插件实验在不同多模态大语言模型上实现了平均12.1%的额外性能提升。该方法不仅在漫画VQA任务上表现优异,在更广泛的幽默中心和抽象视觉推理任务上也展现出强大的泛化能力。
Conclusion: 该研究揭示了传统思维链提示在复杂视觉推理任务中的局限性,特别是对小型模型的负面影响,并提出了一种有效的解决方案。所提出的漫画推理框架为资源受限环境下的抽象视觉理解任务提供了新的技术路径,对多模态推理系统的优化设计具有重要指导意义。
📄 Abstract
Comic-based visual question answering (CVQA) poses distinct challenges to multimodal large language models (MLLMs) due to its reliance on symbolic abstraction, narrative logic, and humor, which differ from conventional VQA tasks. Although Chain-of-Thought (CoT) prompting is widely used to enhance MLLM reasoning, surprisingly, its direct application to CVQA often degrades performance, especially in small-scale models. Our theoretical and empirical analyses reveal that standard CoT in CVQA suffers from state entanglement, spurious transitions, and exploration inefficiency, with small models particularly vulnerable in resource-constrained settings. To address these issues, we propose a novel comic reasoning framework, designed to produce more faithful and transferable reasoning chains in small MLLMs. Specifically, our framework combines modular CoT generation with GRPO-based reinforcement fine-tuning and a novel structured reward. Beyond comic VQA, we further evaluate our approach on a broader class of humor-centric and abstract visual reasoning tasks, including meme understanding and editorial cartoon interpretation. Across five challenging benchmarks, our 3B model outperforms state-of-the-art methods, and plug-in experiments yield an additional average improvement of $\mathbf{12.1\%}$ across different MLLMs.
[23] IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
Yankai Jiang, Qiaoru Li, Binlu Xu, Haoran Sun, Chao Ding, Junting Dong, Yuxiang Cai, Xuhong Zhang, Jianwei Yin
🧩 TL;DR
本文提出了一种名为IBISAgent的新型代理式多模态大语言模型,将医学图像分割重新定义为视觉中心的多步决策过程,通过迭代推理和文本点击动作生成高质量分割掩码,无需修改模型架构。
📘 Detailed Summary
Motivation: 现有医学多模态大语言模型在像素级理解方面面临两大挑战:一是引入隐式分割标记并需要同时微调MLLM和外部像素解码器,增加了灾难性遗忘风险并限制了泛化能力;二是大多数方法依赖单次推理,缺乏迭代优化分割结果的能力,导致性能欠佳。
Method: IBISAgent将分割重新定义为视觉中心的多步决策过程,使MLLM能够生成交错推理和基于文本的点击动作,调用分割工具并产生高质量掩码而无需架构修改。采用两阶段训练框架,包括冷启动监督微调和具有定制细粒度奖励的代理强化学习,增强模型在复杂医学指代和推理分割任务中的鲁棒性。
Result: 大量实验表明,IBISAgent在性能上持续优于闭源和开源的最先进方法,在复杂医学指代和推理分割任务中表现出卓越的鲁棒性和准确性,验证了其多步迭代推理方法的有效性。
Conclusion: 该研究展示了将分割重新定义为多步决策过程的可行性,通过代理式MLLM实现无需架构修改的高质量分割,为医学图像分析提供了新的像素级视觉推理能力,并为复杂医疗分割任务中的迭代优化提供了有效框架。
📄 Abstract
Recent research on medical MLLMs has gradually shifted its focus from image-level understanding to fine-grained, pixel-level comprehension. Although segmentation serves as the foundation for pixel-level understanding, existing approaches face two major challenges. First, they introduce implicit segmentation tokens and require simultaneous fine-tuning of both the MLLM and external pixel decoders, which increases the risk of catastrophic forgetting and limits generalization to out-of-domain scenarios. Second, most methods rely on single-pass reasoning and lack the capability to iteratively refine segmentation results, leading to suboptimal performance. To overcome these limitations, we propose a novel agentic MLLM, named IBISAgent, that reformulates segmentation as a vision-centric, multi-step decision-making process. IBISAgent enables MLLMs to generate interleaved reasoning and text-based click actions, invoke segmentation tools, and produce high-quality masks without architectural modifications. By iteratively performing multi-step visual reasoning on masked image features, IBISAgent naturally supports mask refinement and promotes the development of pixel-level visual reasoning capabilities. We further design a two-stage training framework consisting of cold-start supervised fine-tuning and agentic reinforcement learning with tailored, fine-grained rewards, enhancing the model's robustness in complex medical referring and reasoning segmentation tasks. Extensive experiments demonstrate that IBISAgent consistently outperforms both closed-source and open-source SOTA methods. All datasets, code, and trained models will be released publicly.
[24] Text-Guided Layer Fusion Mitigates Hallucination in Multimodal LLMs
Chenchen Lin, Sanbao Su, Rachel Luo, Yuxiao Chen, Yan Wang, Marco Pavone, Fei Miao
🧩 TL;DR
本文提出了TGIF(文本引导的层间融合),一种轻量级模块,通过预测与查询相关的视觉特征融合来增强多模态大语言模型的视觉基础能力,有效减少幻觉现象。
📘 Detailed Summary
Motivation: 当前多模态大语言模型通常仅使用冻结视觉编码器的单一深层特征,未能充分利用编码器丰富的视觉层次信息,导致模型容易产生视觉未基础的幻觉,过度依赖语言先验而非图像证据。现有多层融合方法虽部分解决了这一限制,但仍是静态的,无论查询内容如何都应用相同的层混合策略。
Method: TGIF将视觉编码器的各层视为深度方向的"专家",通过轻量级模块预测与文本提示相关的视觉特征融合。该方法遵循直接外部融合原则,无需更新视觉编码器参数,仅增加极小的计算开销。TGIF被集成到LLaVA-1.5-7B模型中,实现了查询条件化的层次感知特征融合。
Result: 在LLaVA-1.5-7B中集成TGIF后,在幻觉检测、OCR识别和视觉问答基准测试中均获得了一致的性能提升,同时在ScienceQA、GQA和MMBench等基准上保持或改进了原有性能。这些结果表明该方法能有效增强视觉基础能力并减少幻觉现象。
Conclusion: 研究表明,查询条件化的层次感知融合是增强现代多模态大语言模型视觉基础能力和减少幻觉的有效途径。该方法通过轻量级模块充分利用视觉编码器的层次信息,为多模态模型设计提供了新的方向,表明动态特征选择比静态融合策略更具优势。
📄 Abstract
Multimodal large language models (MLLMs) typically rely on a single late-layer feature from a frozen vision encoder, leaving the encoder's rich hierarchy of visual cues under-utilized. MLLMs still suffer from visually ungrounded hallucinations, often relying on language priors rather than image evidence. While many prior mitigation strategies operate on the text side, they leave the visual representation unchanged and do not exploit the rich hierarchy of features encoded across vision layers. Existing multi-layer fusion methods partially address this limitation but remain static, applying the same layer mixture regardless of the query. In this work, we introduce TGIF (Text-Guided Inter-layer Fusion), a lightweight module that treats encoder layers as depth-wise "experts" and predicts a prompt-dependent fusion of visual features. TGIF follows the principle of direct external fusion, requires no vision-encoder updates, and adds minimal overhead. Integrated into LLaVA-1.5-7B, TGIF provides consistent improvements across hallucination, OCR, and VQA benchmarks, while preserving or improving performance on ScienceQA, GQA, and MMBench. These results suggest that query-conditioned, hierarchy-aware fusion is an effective way to strengthen visual grounding and reduce hallucination in modern MLLMs.
[25] Unified Thinker: A General Reasoning Modular Core for Image Generation
Sashuai Zhou, Qiang Zhou, Jijin Hu, Hanqing Yang, Yue Cao, Junpeng Ma, Yinchao Ma, Jun Song, Tiezheng Ge, Cheng Yu, Bo Zheng, Zhou Zhao
🧩 TL;DR
本文提出Unified Thinker,一种任务无关的推理架构,通过将专用推理模块与图像生成器解耦,显著提升了生成模型在逻辑密集型指令跟随方面的能力,弥合了推理与执行之间的差距。
📘 Detailed Summary
Motivation: 尽管高保真图像合成取得了显著进展,生成模型在逻辑密集型指令跟随方面仍存在困难,暴露出持续的推理-执行差距。同时,闭源系统在推理驱动的图像生成方面表现出色,突显了当前开源模型的明显不足。本文认为弥合这一差距不仅需要更好的视觉生成器,更需要可执行的推理能力,即将高级意图分解为可直接指导生成过程的具体、可验证计划。
Method: 本文提出Unified Thinker,一种任务无关的通用图像生成推理架构,设计为统一的规划核心,可插入不同的生成器和工作流程。该架构将专用推理模块与图像生成器解耦,实现推理能力的模块化升级而无需重新训练整个生成模型。采用两阶段训练范式:首先为推理模块构建结构化规划接口,然后应用强化学习将其策略基于像素级反馈进行优化,鼓励生成优化视觉正确性而非文本合理性的计划。
Result: 在文本到图像生成和图像编辑任务上的广泛实验表明,Unified Thinker显著提升了图像推理和生成质量。该方法在逻辑密集型指令跟随方面表现出色,有效弥合了开源模型与闭源系统在推理驱动图像生成方面的性能差距。
Conclusion: 该研究表明,通过将推理与生成过程解耦并采用基于像素反馈的强化学习训练,可以显著提升生成模型在复杂逻辑任务上的表现。这种模块化架构为未来生成模型的推理能力升级提供了灵活框架,强调了可执行推理在弥合推理-执行差距中的关键作用。
📄 Abstract
Despite impressive progress in high-fidelity image synthesis, generative models still struggle with logic-intensive instruction following, exposing a persistent reasoning--execution gap. Meanwhile, closed-source systems (e.g., Nano Banana) have demonstrated strong reasoning-driven image generation, highlighting a substantial gap to current open-source models. We argue that closing this gap requires not merely better visual generators, but executable reasoning: decomposing high-level intents into grounded, verifiable plans that directly steer the generative process. To this end, we propose Unified Thinker, a task-agnostic reasoning architecture for general image generation, designed as a unified planning core that can plug into diverse generators and workflows. Unified Thinker decouples a dedicated Thinker from the image Generator, enabling modular upgrades of reasoning without retraining the entire generative model. We further introduce a two-stage training paradigm: we first build a structured planning interface for the Thinker, then apply reinforcement learning to ground its policy in pixel-level feedback, encouraging plans that optimize visual correctness over textual plausibility. Extensive experiments on text-to-image generation and image editing show that Unified Thinker substantially improves image reasoning and generation quality.
[26] ReCCur: A Recursive Corner-Case Curation Framework for Robust Vision-Language Understanding in Open and Edge Scenarios
Yihan Wei, Shenghai Yuan, Tianchen Deng, Boyang Lou, Enwen Hu
🧩 TL;DR
本文提出ReCCur框架,一种低计算成本的递归角点案例标注方法,通过多智能体流水线将噪声网络图像转化为可审计的细粒度标签,为资源受限环境下的下游训练和评估提供实用解决方案。
📘 Detailed Summary
Motivation: 角点案例是驱动现实世界故障的罕见或极端场景,但难以大规模标注:网络数据噪声大、标签脆弱,且边缘部署环境限制了大规模重新训练。现有方法面临数据获取困难、标注质量不稳定以及计算资源需求高等挑战。
Method: ReCCur框架采用三级递归流水线:首先通过视觉语言模型进行大规模数据采集和过滤,扩展领域词汇并执行三模态一致性检查;其次采用专家混合知识蒸馏,利用互补编码器进行kNN投票和不确定性采样;最后通过区域证据VLM对抗标注,结合提议器和验证器生成可解释标签。
Result: 在真实角点案例场景(如洪水车辆检测)中,ReCCur在消费级GPU上运行,持续提升数据纯度和可分离性,同时仅需最小化人工监督。该方法能够有效处理噪声网络图像,生成高质量可审计标签。
Conclusion: ReCCur为资源受限环境下的角点案例数据标注提供了实用解决方案,通过递归多智能体框架显著降低人工监督需求,同时保持高质量标注。该框架为下游训练和评估提供了可靠数据基础,并将在实际部署中释放代码和数据集。
📄 Abstract
Corner cases are rare or extreme scenarios that drive real-world failures, but they are difficult to curate at scale: web data are noisy, labels are brittle, and edge deployments preclude large retraining. We present ReCCur (Recursive Corner-Case Curation), a low-compute framework that converts noisy web imagery into auditable fine-grained labels via a multi-agent recursive pipeline. First, large-scale data acquisition and filtering expands a domain vocabulary with a vision-language model (VLM), crawls the web, and enforces tri-modal (image, description, keyword) consistency with light human spot checks to yield refined candidates. Next, mixture-of-experts knowledge distillation uses complementary encoders (e.g., CLIP, DINOv2, BEiT) for kNN voting with dual-confidence activation and uncertainty sampling, converging to a high-precision set. Finally, region-evidence VLM adversarial labeling pairs a proposer (multi-granularity regions and semantic cues) with a validator (global and local chained consistency) to produce explainable labels and close the loop. On realistic corner-case scenarios (e.g., flooded-car inspection), ReCCur runs on consumer-grade GPUs, steadily improves purity and separability, and requires minimal human supervision, providing a practical substrate for downstream training and evaluation under resource constraints. Code and dataset will be released.
[27] AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation
Anees Ur Rehman Hashmi, Numan Saeed, Christoph Lippert
🧩 TL;DR
本文提出AnatomiX,一种专为胸部X光解剖学基础解释设计的多任务多模态大语言模型,通过两阶段方法显著提升解剖学推理能力,在多个基准测试中性能提升超过25%。
📘 Detailed Summary
Motivation: 当前多模态医学大语言模型在胸部X光解读中虽取得进展,但在空间推理和解剖学理解方面仍面临挑战,现有基础技术常无法建立真正的解剖学对应关系,导致医学领域解剖学理解错误,需要解决这一关键差距。
Method: AnatomiX采用受放射学工作流程启发的两阶段方法:首先识别解剖结构并提取其特征,然后利用大语言模型执行多种下游任务,包括短语定位、报告生成、视觉问答和图像理解,实现解剖学基础的多任务处理。
Result: 在多个基准测试上的广泛实验表明,AnatomiX实现了卓越的解剖学推理能力,在解剖学基础、短语定位、基础诊断和基础描述任务上相比现有方法性能提升超过25%,代码和预训练模型已开源。
Conclusion: 该研究通过解剖学基础的多任务多模态大语言模型设计,显著提升了医学图像解释的解剖学准确性,为医学AI系统提供了更可靠的解剖学推理框架,推动了医学图像理解向更精确的解剖学对应方向发展。
📄 Abstract
Multimodal medical large language models have shown impressive progress in chest X-ray interpretation but continue to face challenges in spatial reasoning and anatomical understanding. Although existing grounding techniques improve overall performance, they often fail to establish a true anatomical correspondence, resulting in incorrect anatomical understanding in the medical domain. To address this gap, we introduce AnatomiX, a multitask multimodal large language model explicitly designed for anatomically grounded chest X-ray interpretation. Inspired by the radiological workflow, AnatomiX adopts a two stage approach: first, it identifies anatomical structures and extracts their features, and then leverages a large language model to perform diverse downstream tasks such as phrase grounding, report generation, visual question answering, and image understanding. Extensive experiments across multiple benchmarks demonstrate that AnatomiX achieves superior anatomical reasoning and delivers over 25% improvement in performance on anatomy grounding, phrase grounding, grounded diagnosis and grounded captioning tasks compared to existing approaches. Code and pretrained model are available at https://github.com/aneesurhashmi/anatomix
[28] Understanding Multi-Agent Reasoning with Large Language Models for Cartoon VQA
Tong Wu, Thanet Markchom
🧩 TL;DR
本文提出了一种用于卡通图像视觉问答的多智能体LLM框架,通过三个专门化智能体的协同工作来解决卡通图像中视觉抽象和叙事上下文带来的挑战,并在Pororo和Simpsons数据集上进行了系统评估。
📘 Detailed Summary
Motivation: 卡通图像的视觉问答面临独特挑战,包括夸张的视觉抽象和叙事驱动的上下文理解,这些挑战无法被基于自然图像训练的标准大型语言模型有效处理,因此需要专门的方法来解决卡通VQA中的这些局限性。
Method: 本文提出了一种多智能体LLM框架,包含三个专门化智能体:视觉智能体负责处理视觉抽象特征,语言智能体处理文本信息,批评智能体进行综合推理,这三个智能体通过协作集成视觉线索和叙事上下文来支持结构化推理。
Result: 该框架在Pororo和Simpsons两个卡通VQA数据集上进行了系统评估,实验结果详细分析了每个智能体对最终预测的贡献,提供了对LLM基多智能体在卡通VQA和多模态推理中行为的深入理解。
Conclusion: 研究展示了多智能体框架在解决卡通图像独特挑战方面的有效性,为理解LLM基多智能体在复杂视觉推理任务中的行为提供了重要见解,并为未来针对特定领域视觉抽象的多模态AI系统设计指明了方向。
📄 Abstract
Visual Question Answering (VQA) for stylised cartoon imagery presents challenges, such as interpreting exaggerated visual abstraction and narrative-driven context, which are not adequately addressed by standard large language models (LLMs) trained on natural images. To investigate this issue, a multi-agent LLM framework is introduced, specifically designed for VQA tasks in cartoon imagery. The proposed architecture consists of three specialised agents: visual agent, language agent and critic agent, which work collaboratively to support structured reasoning by integrating visual cues and narrative context. The framework was systematically evaluated on two cartoon-based VQA datasets: Pororo and Simpsons. Experimental results provide a detailed analysis of how each agent contributes to the final prediction, offering a deeper understanding of LLM-based multi-agent behaviour in cartoon VQA and multimodal inference.
[29] UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision
Ruiyan Han, Zhen Fang, XinYu Sun, Yuchen Ma, Ziheng Wang, Yu Zeng, Zehui Chen, Lin Chen, Wenxuan Huang, Wei-Jie Xu, Yi Cao, Feng Zhao
🧩 TL;DR
本文提出UniCorn框架,通过将统一多模态模型划分为三个协作角色(提议者、求解者和评判者)并进行自博弈,解决了模型在多模态理解与生成之间的传导性失语问题,显著提升了文本到图像生成质量。
📘 Detailed Summary
Motivation: 统一多模态模型在跨模态理解方面取得了显著成功,但在利用内部知识进行高质量生成方面存在显著差距,这种理解与生成能力之间的不一致被形式化为传导性失语现象,即模型能准确解释多模态输入却难以将这种理解转化为忠实可控的合成输出。
Method: 提出UniCorn自改进框架,无需外部数据或教师监督,通过将单个统一多模态模型划分为三个协作角色:提议者、求解者和评判者,通过自博弈生成高质量交互,并采用认知模式重构将潜在理解提炼为显式生成信号,同时引入UniCycle基准,基于文本到图像到文本的重建循环来验证多模态一致性恢复。
Result: 实验表明UniCorn在六个通用图像生成基准上相比基础模型实现了全面且显著的改进,在TIIF(73.8)、DPG(86.8)、CompBench(88.5)和UniCycle上达到最先进性能,同时在WISE和OneIG上分别获得+5.0和+6.5的显著提升,证明了该方法在增强文本到图像生成的同时保持了鲁棒的理解能力。
Conclusion: 该研究展示了完全自监督精炼方法在统一多模态智能中的可扩展性,通过自我改进框架有效弥合了多模态理解与生成之间的差距,为提升统一多模态模型的生成能力提供了新途径,同时保持了模型原有的理解能力,实现了理解与生成的协同提升。
📄 Abstract
While Unified Multimodal Models (UMMs) have achieved remarkable success in cross-modal comprehension, a significant gap persists in their ability to leverage such internal knowledge for high-quality generation. We formalize this discrepancy as Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis. To address this, we propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision. By partitioning a single UMM into three collaborative roles: Proposer, Solver, and Judge, UniCorn generates high-quality interactions via self-play and employs cognitive pattern reconstruction to distill latent understanding into explicit generative signals. To validate the restoration of multimodal coherence, we introduce UniCycle, a cycle-consistency benchmark based on a Text to Image to Text reconstruction loop. Extensive experiments demonstrate that UniCorn achieves comprehensive and substantial improvements over the base model across six general image generation benchmarks. Notably, it achieves SOTA performance on TIIF(73.8), DPG(86.8), CompBench(88.5), and UniCycle while further delivering substantial gains of +5.0 on WISE and +6.5 on OneIG. These results highlight that our method significantly enhances T2I generation while maintaining robust comprehension, demonstrating the scalability of fully self-supervised refinement for unified multimodal intelligence.
[30] LTX-2: Efficient Joint Audio-Visual Foundation Model
Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, Victor Kulikov, Yaron Inger, Yonatan Shiftan, Zeev Melumian, Zeev Farbman
🧩 TL;DR
本文提出LTX-2,一种开源基础模型,能够以统一方式生成高质量、时间同步的视听内容,解决了现有文本到视频扩散模型缺乏音频生成能力的问题。
📘 Detailed Summary
Motivation: 当前文本到视频扩散模型虽然能生成引人注目的视频序列,但缺乏音频生成能力,无法提供语义、情感和氛围线索,这限制了生成内容的沉浸感和完整性。
Method: LTX-2采用非对称双流Transformer架构,包含140亿参数的视频流和50亿参数的音频流,通过双向音频-视频交叉注意力层连接,并配备时间位置嵌入和跨模态AdaLN用于共享时间步条件。模型采用多语言文本编码器增强提示理解,并引入模态感知分类器自由引导机制以改善视听对齐和可控性。
Result: 该模型在评估中实现了开源系统中最佳的视听质量和提示遵循性能,同时以显著降低的计算成本和推理时间达到了与专有模型相当的结果。LTX-2不仅能生成语音,还能产生丰富、连贯的音频轨道,包含自然的背景音和拟音元素。
Conclusion: LTX-2展示了统一视听生成模型的可行性,通过非对称架构设计实现了高效训练和推理,同时保持高质量输出。该研究为开源多模态生成模型提供了重要进展,所有模型权重和代码均已公开发布,促进了该领域的可访问性和进一步发展。
📄 Abstract
Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.
[31] A Versatile Multimodal Agent for Multimedia Content Generation
Daoan Zhang, Wenlin Yao, Xiaoyang Wang, Yebowen Hu, Jiebo Luo, Dong Yu
🧩 TL;DR
本文提出了一种基于智能体的多模态内容生成系统MultiMedia-Agent,通过引入技能习得理论和两阶段关联策略,实现了复杂多媒体内容的端到端自动化生成,相比现有模型能产生更优质的多媒体内容。
📘 Detailed Summary
Motivation: 当前AIGC模型大多只能作为特定应用场景中的独立组件,无法在真实世界应用中完成端到端任务,特别是在处理多模态输入和输出方面存在局限,而现实应用中编辑专家需要处理多样化的图像视频输入并生成包含音频、文本等多元素的多模态输出。
Method: 提出MultiMedia-Agent系统,包含数据生成管道、内容创建工具库和偏好对齐评估指标集,引入技能习得理论来建模训练数据管理和智能体训练,设计了两阶段关联策略进行计划优化,包括自关联和模型偏好关联,并通过三阶段方法训练智能体,包括基础/成功计划微调和偏好优化。
Result: 对比实验结果表明,所提出的方法有效,MultiMedia-Agent相比新颖模型能够生成更优质的多媒体内容,证明了智能体系统在复杂内容生成任务中的优越性和实用性。
Conclusion: 基于智能体的系统为解决复杂多媒体内容生成提供了可行方案,通过整合多模态处理能力和端到端自动化流程,填补了当前AIGC模型在实际应用中的局限性,为未来智能内容创作系统的发展指明了方向。
📄 Abstract
With the advancement of AIGC (AI-generated content) technologies, an increasing number of generative models are revolutionizing fields such as video editing, music generation, and even film production. However, due to the limitations of current AIGC models, most models can only serve as individual components within specific application scenarios and are not capable of completing tasks end-to-end in real-world applications. In real-world applications, editing experts often work with a wide variety of images and video inputs, producing multimodal outputs -- a video typically includes audio, text, and other elements. This level of integration across multiple modalities is something current models are unable to achieve effectively. However, the rise of agent-based systems has made it possible to use AI tools to tackle complex content generation tasks. To deal with the complex scenarios, in this paper, we propose a MultiMedia-Agent designed to automate complex content creation. Our agent system includes a data generation pipeline, a tool library for content creation, and a set of metrics for evaluating preference alignment. Notably, we introduce the skill acquisition theory to model the training data curation and agent training. We designed a two-stage correlation strategy for plan optimization, including self-correlation and model preference correlation. Additionally, we utilized the generated plans to train the MultiMedia-Agent via a three stage approach including base/success plan finetune and preference optimization. The comparison results demonstrate that the our approaches are effective and the MultiMedia-Agent can generate better multimedia content compared to novel models.
cs.CL [Back]
[32] Adversarial Question Answering Robustness: A Multi-Level Error Analysis and Mitigation Study
Agniv Roy Choudhury, Vignesh Ponselvan Rajasingh
🧩 TL;DR
本研究系统评估了Transformer模型在对抗性QA任务中的鲁棒性,通过多层级错误分析识别主要失败模式,并提出了基于实体感知对比学习的针对性缓解策略,在ELECTRA-base模型上实现了对抗性能与干净性能的接近持平。
📘 Detailed Summary
Motivation: 尽管问答系统在标准基准测试如SQuAD上表现出色,但它们对对抗性示例仍然脆弱。本研究旨在解决Transformer模型在AddSent对抗数据集上的鲁棒性问题,探索模型规模与针对性缓解策略对对抗性脆弱性的影响。
Method: 研究采用系统实验方法,在ELECTRA-small到ELECTRA-base不同规模模型上进行评估。通过五种互补的分类方案进行多层级错误分析,识别主要失败模式。系统评估对抗性微调比例,并实施三种针对性缓解策略,其中实体感知对比学习为核心方法,结合命名实体识别指导的对比学习框架。
Result: 研究发现80%干净数据+20%对抗数据为最优微调比例。模型规模从ELECTRA-small扩展到ELECTRA-base消除了鲁棒性-准确性权衡,在干净和对抗数据上都取得显著提升。实体感知对比学习实现最佳性能:AddSent Exact Match达到89.89%,SQuAD EM达到90.73%,对抗性差距缩小94.9%。
Conclusion: 研究表明针对性缓解策略能够实现干净性能与对抗性能的接近持平,模型规模扩展可消除鲁棒性-准确性权衡。这是首个将全面语言错误分析与NER指导的对比学习相结合用于对抗性QA的研究,为构建更鲁棒的问答系统提供了有效框架。
📄 Abstract
Question answering (QA) systems achieve impressive performance on standard benchmarks like SQuAD, but remain vulnerable to adversarial examples. This project investigates the adversarial robustness of transformer models on the AddSent adversarial dataset through systematic experimentation across model scales and targeted mitigation strategies. We perform comprehensive multi-level error analysis using five complementary categorization schemes, identifying negation confusion and entity substitution as the primary failure modes. Through systematic evaluation of adversarial fine-tuning ratios, we identify 80% clean + 20% adversarial data as optimal. Data augmentation experiments reveal a capacity bottleneck in small models. Scaling from ELECTRA-small (14M parameters) to ELECTRA-base (110M parameters) eliminates the robustness-accuracy trade-off, achieving substantial improvements on both clean and adversarial data. We implement three targeted mitigation strategies, with Entity-Aware contrastive learning achieving best performance: 89.89% AddSent Exact Match (EM) and 90.73% SQuAD EM, representing 94.9% closure of the adversarial gap. To our knowledge, this is the first work integrating comprehensive linguistic error analysis with Named Entity Recognition (NER)-guided contrastive learning for adversarial QA, demonstrating that targeted mitigation can achieve near-parity between clean and adversarial performance.
[33] Revisiting Data Compression with Language Modeling
Chen-Han Tsai
🧩 TL;DR
本研究探索了大型语言模型在数据压缩任务中的应用,通过优化配置方法在enwik9数据集上实现了约18%的调整压缩率,达到了新的最先进水平,同时验证了LLM在非自然文本序列压缩中的竞争力。
📘 Detailed Summary
Motivation: 尽管先前研究已证明大型语言模型在文本和多模态数据压缩方面具有潜力,但在实际应用中仍存在若干挑战,阻碍其替代现有数据压缩算法。本研究旨在探索如何降低LLM作为数据压缩器的调整压缩率,并解决实际部署中的关键问题。
Method: 本研究探索了多种优化方法以降低大型语言模型的调整压缩率,重点研究了无需额外模型训练的配置策略。研究特别关注了LLM在非英语数据、代码数据和字节流序列压缩中的应用,通过不同的配置方式评估其压缩性能。
Result: 在enwik9数据集上,研究实现了约18%的调整压缩率,创造了新的最先进水平,且无需额外模型训练。实验表明,LLM在文本主导领域的数据压缩方面表现优异,而在非自然文本序列压缩中,通过适当配置仍能保持竞争力。
Conclusion: 大型语言模型在数据压缩领域具有显著潜力,特别是在文本主导领域表现优异。研究证实通过适当的配置策略,LLM能够有效压缩非自然文本序列,为替代传统数据压缩算法提供了可行性依据,但实际部署仍需进一步优化。
📄 Abstract
In this report, we investigate the potential use of large language models (LLM's) in the task of data compression. Previous works have demonstrated promising results in applying LLM's towards compressing not only text, but also a wide range of multi-modal data. Despite the favorable performance achieved, there still remains several practical questions that pose a challenge towards replacing existing data compression algorithms with LLM's. In this work, we explore different methods to achieve a lower adjusted compression rate using LLM's as data compressors. In comparison to previous works, we were able to achieve a new state-of-the-art (SOTA) adjusted compression rate of around $18\%$ on the enwik9 dataset without additional model training. Furthermore, we explore the use of LLM's in compressing non-English data, code data, byte stream sequences. We show that while LLM's excel in compressing data in text-dominant domains, their ability in compressing non-natural text sequences still remain competitive if configured in the right way.
[34] Linear Script Representations in Speech Foundation Models Enable Zero-Shot Transliteration
Ryan Soh-Eun Shim, Kwanghee Choi, Kalvin Chang, Ming-Hao Hsu, Florian Eichin, Zhizheng Wu, Alane Suhr, Michael A. Hedderich, David Harwath, David R. Mortensen, Barbara Plank
🧩 TL;DR
该研究提出了一种后处理方法,通过修改多语言语音模型在推理时的激活向量,实现对语音识别输出脚本的直接控制,解决了不同地区变体使用不同脚本导致的输出不确定性问题。
📘 Detailed Summary
Motivation: 多语言语音基础模型(如Whisper)在网页规模数据上训练,但同一语言的不同地区变体往往使用不同脚本书写,导致语音识别输出脚本存在不确定性,这给下游应用带来挑战。
Method: 研究发现脚本信息在多语言语音模型的激活空间中呈线性编码,通过在推理时向激活向量添加特定的脚本向量,可以直接控制输出脚本,该方法支持非常规的语言-脚本配对转换。
Result: 实验表明该方法能够有效诱导脚本变更,包括非常规的语言-脚本配对(如意大利语使用西里尔字母、日语使用拉丁字母),在Whisper所有模型规模上均表现出竞争性性能。
Conclusion: 该研究揭示了多语言语音模型中脚本信息的线性编码特性,提供了一种无需重新训练即可实现脚本控制的后处理方法,为语音识别系统的脚本标准化和定制化应用开辟了新途径。
📄 Abstract
Multilingual speech foundation models such as Whisper are trained on web-scale data, where data for each language consists of a myriad of regional varieties. However, different regional varieties often employ different scripts to write the same language, rendering speech recognition output also subject to non-determinism in the output script. To mitigate this problem, we show that script is linearly encoded in the activation space of multilingual speech models, and that modifying activations at inference time enables direct control over output script. We find the addition of such script vectors to activations at test time can induce script change even in unconventional language-script pairings (e.g. Italian in Cyrillic and Japanese in Latin script). We apply this approach to inducing post-hoc control over the script of speech recognition output, where we observe competitive performance across all model sizes of Whisper.
[35] MMFormalizer: Multimodal Autoformalization in the Wild
Jing Xiong, Qi Han, Yunta Hsieh, Hui Shen, Huajian Xin, Chaofan Tao, Chenyang Zhao, Hengyuan Zhang, Taiqiang Wu, Zhen Zhang, Haochen Wang, Zhongwei Wan, Lingpeng Kong, Ngai Wong
🧩 TL;DR
本文提出MMFormalizer,一种多模态自动形式化方法,通过自适应实体基础将自然语言数学与视觉元素相结合,实现从物理世界到形式化陈述的转换,并构建了PhyX-AF基准进行评估。
📘 Detailed Summary
Motivation: 自动形式化面临的核心挑战在于物理世界的多模态特性,其中物理推理需要从视觉元素推断隐藏约束(如质量或能量),而现有方法主要局限于文本领域,无法处理现实世界数学和物理领域中的多模态信息。
Method: MMFormalizer通过自适应基础将自动形式化扩展到文本之外,从感知基础的原语递归构建形式命题,采用递归基础和公理组合方法,并通过自适应递归终止确保每个抽象都得到视觉证据支持并锚定在维度或公理基础上。
Result: 在包含115个样本的新基准PhyX-AF上评估,结果显示前沿模型如GPT-5和Gemini-3-Pro在编译和语义准确性方面表现最佳,其中GPT-5在物理推理方面表现突出,而几何领域仍然是最具挑战性的任务。
Conclusion: MMFormalizer为统一的多模态自动形式化提供了可扩展框架,连接了感知与形式推理,首次实现了处理经典力学(源自哈密顿量)、相对论、量子力学和热力学的多模态自动形式化方法,填补了该领域的重要空白。
📄 Abstract
Autoformalization, which translates natural language mathematics into formal statements to enable machine reasoning, faces fundamental challenges in the wild due to the multimodal nature of the physical world, where physics requires inferring hidden constraints (e.g., mass or energy) from visual elements. To address this, we propose MMFormalizer, which extends autoformalization beyond text by integrating adaptive grounding with entities from real-world mathematical and physical domains. MMFormalizer recursively constructs formal propositions from perceptually grounded primitives through recursive grounding and axiom composition, with adaptive recursive termination ensuring that every abstraction is supported by visual evidence and anchored in dimensional or axiomatic grounding. We evaluate MMFormalizer on a new benchmark, PhyX-AF, comprising 115 curated samples from MathVerse, PhyX, Synthetic Geometry, and Analytic Geometry, covering diverse multimodal autoformalization tasks. Results show that frontier models such as GPT-5 and Gemini-3-Pro achieve the highest compile and semantic accuracy, with GPT-5 excelling in physical reasoning, while geometry remains the most challenging domain. Overall, MMFormalizer provides a scalable framework for unified multimodal autoformalization, bridging perception and formal reasoning. To the best of our knowledge, this is the first multimodal autoformalization method capable of handling classical mechanics (derived from the Hamiltonian), as well as relativity, quantum mechanics, and thermodynamics. More details are available on our project page: MMFormalizer.github.io
[36] TWIST: Training-free and Label-free Short Text Clustering through Iterative Vector Updating with LLMs
I-Fan Lin, Faegheh Hasibi, Suzan Verberne
🧩 TL;DR
本文提出了一种无需训练和标注的短文本聚类方法,该方法可在任何现有嵌入器之上使用,通过迭代向量更新和LLM指导实现意图聚类,在商业场景中取得了与最先进方法相当或更优的性能。
📘 Detailed Summary
Motivation: 在面向客户的聊天机器人场景中,企业需要处理大量用户话语并按意图进行聚类,但商业环境中通常没有标注数据且聚类数量未知,现有方法通常需要对比学习或先验知识,无法适应这种无标签、无先验的低资源设置。
Method: 该方法基于迭代向量更新机制,首先基于代表性文本构建稀疏向量,然后通过大型语言模型指导进行迭代优化,整个过程无需训练和标注,且不假设已知聚类数量或标签信息,可应用于任何嵌入器。
Result: 实验表明该方法在多样化数据集上取得了与使用对比学习的最先进方法相当或更优的结果,同时证明其具有模型无关性,可应用于不同嵌入器和较小规模的LLM,且能够扩展到大型数据集并降低计算成本。
Conclusion: 该方法在低资源、适应性强的设置下表现出色,其可扩展性使其比现有聚类方法更符合实际商业场景,为无标注数据环境下的意图发现提供了实用解决方案,展示了LLM指导在无监督聚类中的有效性。
📄 Abstract
In this paper, we propose a training-free and label-free method for short text clustering that can be used on top of any existing embedder. In the context of customer-facing chatbots, companies are dealing with large amounts of user utterances that need to be clustered according to their intent. In these commercial settings, no labeled data is typically available, and the number of clusters is not known. Our method is based on iterative vector updating: it constructs sparse vectors based on representative texts, and then iteratively refines them through LLM guidance. Our method achieves comparable or superior results to state-of-the-art methods that use contrastive learning, but without assuming prior knowledge of clusters or labels. Experiments on diverse datasets and smaller LLMs show that our method is model agnostic and can be applied to any embedder, with relatively small LLMs, and different clustering methods. We also show that our method scales to large datasets, reducing the computational cost of the LLM. These low-resource, adaptable settings and the scalability of our method make it more aligned with real-world scenarios than existing clustering methods.
[37] Limited Linguistic Diversity in Embodied AI Datasets
Selma Wanna, Agnes Luhtaru, Jonathan Salfity, Ryan Barron, Juston Moore, Cynthia Matuszek, Mitch Pryor
🧩 TL;DR
该研究对广泛使用的视觉-语言-动作模型数据集进行了系统性审计,量化了指令语言的词汇多样性、重复性、语义相似性和句法复杂性,揭示了当前数据集中指令语言的高度重复性和有限结构变化问题。
📘 Detailed Summary
Motivation: 当前视觉-语言-动作模型在语言理解方面发挥着关键作用,但用于训练和评估这些系统的数据集的真实语言特征缺乏系统性的文档记录,研究人员对数据集实际包含的指令类型和语言多样性了解有限,这限制了模型的语言理解和泛化能力。
Method: 研究采用系统性数据集审计方法,对多个广泛使用的VLA语料库进行量化分析,从互补维度评估指令语言特征,包括词汇多样性、重复和重叠程度、语义相似性以及句法复杂性,通过多维度指标全面刻画数据集的语言分布特征。
Result: 分析结果显示许多数据集依赖高度重复、模板化的指令,结构变化有限,导致指令形式分布狭窄,数据集中的语言信号呈现出明显的模式化特征,缺乏足够的语言多样性和复杂性来支持模型对自然语言指令的全面理解。
Conclusion: 该研究为当前VLA训练和评估数据的语言特征提供了描述性文档,支持更详细的数据集报告、更原则性的数据集选择,以及针对性的数据增强策略,旨在扩大语言覆盖范围,为构建更具语言多样性和泛化能力的VLA模型提供数据基础。
📄 Abstract
Language plays a critical role in Vision-Language-Action (VLA) models, yet the linguistic characteristics of the datasets used to train and evaluate these systems remain poorly documented. In this work, we present a systematic dataset audit of several widely used VLA corpora, aiming to characterize what kinds of instructions these datasets actually contain and how much linguistic variety they provide. We quantify instruction language along complementary dimensions-including lexical variety, duplication and overlap, semantic similarity, and syntactic complexity. Our analysis shows that many datasets rely on highly repetitive, template-like commands with limited structural variation, yielding a narrow distribution of instruction forms. We position these findings as descriptive documentation of the language signal available in current VLA training and evaluation data, intended to support more detailed dataset reporting, more principled dataset selection, and targeted curation or augmentation strategies that broaden language coverage.
[38] Multi-RADS Synthetic Radiology Report Dataset and Head-to-Head Benchmarking of 41 Open-Weight and Proprietary Language Models
Kartik Bose, Abhinandan Kumar, Raghuraman Soundararajan, Priya Mudgil, Samonee Ralmilay, Niharika Dutta, Manphool Singhal, Arun Kumar, Saugata Sen, Anurima Patra, Priya Ghosh, Abanti Das, Amit Gupta, Ashish Verma, Dipin Sudhakaran, Ekta Dhamija, Himangi Unde, Ishan Kumar, Krithika Rangarajan, Prerna Garg, Rachel Sequeira, Sudhin Shylendran, Taruna Yadav, Tej Pal, Pankaj Gupta
🧩 TL;DR
本研究创建了RXL-RADSet——一个经过放射科医生验证的多RADS合成基准数据集,并系统比较了开源小语言模型与专有模型在放射学报告结构化分类任务中的表现,发现20-32B参数规模的小语言模型在引导提示下可接近专有模型性能。
📘 Detailed Summary
Motivation: 放射学报告和数据分析系统(RADS)虽然标准化了风险沟通,但从叙述性报告中自动分配RADS类别面临多重挑战,包括指南复杂性、输出格式约束、跨RADS框架和模型规模的基准测试有限,以及缺乏经过验证的多RADS基准数据集。
Method: 研究创建了包含1,600份合成放射学报告的RXL-RADSet基准数据集,涵盖10种RADS标准和多种成像模态,报告通过大语言模型基于场景计划模拟放射科医生风格生成,并经过两阶段放射科医生验证;评估了41个量化小语言模型(12个家族,0.135-32B参数)和GPT-5.2,采用固定引导提示策略,主要终点为有效性和准确性,次要分析比较了引导提示与零样本提示。
Result: 在引导提示下,GPT-5.2达到99.8%有效性和81.1%准确性;小语言模型总体达到96.8%有效性和61.1%准确性,其中20-32B参数范围的顶级模型达到约99%有效性和中高70%准确性;性能随模型规模扩展(<1B与≥10B之间存在拐点),随RADS复杂性增加而下降主要源于分类难度而非无效输出;引导提示相比零样本提示显著改善了有效性(99.2% vs 96.7%)和准确性(78.5% vs 69.6%)。
Conclusion: RXL-RADSet为多RADS自动分类提供了经过验证的基准数据集,大型小语言模型(20-32B)在引导提示下可接近专有模型性能,但对于更高复杂度的分类方案仍存在差距,模型规模扩展和提示工程对性能提升至关重要,为临床环境中部署高效小语言模型提供了实证依据。
📄 Abstract
Background: Reporting and Data Systems (RADS) standardize radiology risk communication but automated RADS assignment from narrative reports is challenging because of guideline complexity, output-format constraints, and limited benchmarking across RADS frameworks and model sizes. Purpose: To create RXL-RADSet, a radiologist-verified synthetic multi-RADS benchmark, and compare validity and accuracy of open-weight small language models (SLMs) with a proprietary model for RADS assignment. Materials and Methods: RXL-RADSet contains 1,600 synthetic radiology reports across 10 RADS (BI-RADS, CAD-RADS, GB-RADS, LI-RADS, Lung-RADS, NI-RADS, O-RADS, PI-RADS, TI-RADS, VI-RADS) and multiple modalities. Reports were generated by LLMs using scenario plans and simulated radiologist styles and underwent two-stage radiologist verification. We evaluated 41 quantized SLMs (12 families, 0.135-32B parameters) and GPT-5.2 under a fixed guided prompt. Primary endpoints were validity and accuracy; a secondary analysis compared guided versus zero-shot prompting. Results: Under guided prompting GPT-5.2 achieved 99.8% validity and 81.1% accuracy (1,600 predictions). Pooled SLMs (65,600 predictions) achieved 96.8% validity and 61.1% accuracy; top SLMs in the 20-32B range reached ~99% validity and mid-to-high 70% accuracy. Performance scaled with model size (inflection between <1B and >=10B) and declined with RADS complexity primarily due to classification difficulty rather than invalid outputs. Guided prompting improved validity (99.2% vs 96.7%) and accuracy (78.5% vs 69.6%) compared with zero-shot. Conclusion: RXL-RADSet provides a radiologist-verified multi-RADS benchmark; large SLMs (20-32B) can approach proprietary-model performance under guided prompting, but gaps remain for higher-complexity schemes.
cs.AI [Back]
[39] Learning from Prompt itself: the Hierarchical Attribution Prompt Optimization
Dongyu Chen, Jian Ma, Xianpeng Zhang, Lei Zhang, Haonan Lu, Chen Chen, Chuangchuang Wang, Kai Tang
🧩 TL;DR
本研究提出了分层归因提示优化(HAPO)框架,通过动态归因机制和语义单元优化解决现有提示优化方法中的提示漂移和可解释性降低问题,在多种视觉语言任务中实现了更高效的提示优化。
📘 Detailed Summary
Motivation: 当前提示优化方法存在两个主要问题:一是提示漂移现象,即新提示修复先前失败任务的同时会损害先前成功任务的性能;二是从头生成提示会降低可解释性。本研究旨在解决这些限制,开发一种既能保持性能一致性又能维持可解释性的结构化提示优化框架。
Method: HAPO框架包含三个核心创新:动态归因机制针对训练数据和提示历史中的错误模式进行优化;语义单元优化方法编辑功能性提示片段而非整个提示;多模态友好进展支持端到端LLM和LLM-MLLM两种工作流程。该框架采用分层结构,能够系统性地改进提示设计过程。
Result: 在单/多图像问答(如OCRV2)和复杂任务分析(如BBH)等应用场景中,HAPO表现出增强的优化效率,超越了可比较的自动化提示优化方法。实验结果表明该框架在保持提示可解释性的同时,显著减少了提示漂移问题,并建立了可扩展的提示工程范式。
Conclusion: HAPO框架为大规模提示工程提供了一个可扩展的范式,通过结构化优化方法平衡了性能改进与可解释性需求。该研究强调了针对错误模式的动态归因和语义单元编辑在提示优化中的重要性,为未来多模态提示工程系统的发展奠定了基础。
📄 Abstract
Optimization is fundamental across numerous disciplines, typically following an iterative process of refining an initial solution to enhance performance. This principle is equally critical in prompt engineering, where designing effective prompts for large language models constitutes a complex optimization challenge. A structured optimization approach requires automated or semi-automated procedures to develop improved prompts, thereby reducing manual effort, improving performance, and yielding an interpretable process. However, current prompt optimization methods often induce prompt drift, where new prompts fix prior failures but impair performance on previously successful tasks. Additionally, generating prompts from scratch can compromise interpretability. To address these limitations, this study proposes the Hierarchical Attribution Prompt Optimization (HAPO) framework, which introduces three innovations: (1) a dynamic attribution mechanism targeting error patterns in training data and prompting history, (2) semantic-unit optimization for editing functional prompt segments, and (3) multimodal-friendly progression supporting both end-to-end LLM and LLM-MLLM workflows. Applied in contexts like single/multi-image QA (e.g., OCRV2) and complex task analysis (e.g., BBH), HAPO demonstrates enhanced optimization efficiency, outperforming comparable automated prompt optimization methods and establishing an extensible paradigm for scalable prompt engineering.
[40] M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?
Ao Li, Jinghui Zhang, Luyu Li, Yuxiang Duan, Lang Gao, Mingcai Chen, Weijun Qin, Shaopeng Li, Fengxian Ji, Ning Liu, Lizhen Cui, Xiuying Chen, Yuntao Du
🧩 TL;DR
本文提出了M3MAD-Bench,一个用于评估多智能体辩论方法的统一且可扩展的基准,覆盖多领域任务、多模态输入和多维度指标,旨在解决现有评估方法的碎片化和单模态限制问题。
📘 Detailed Summary
Motivation: 现有多智能体辩论研究存在两个根本性局限:评估在碎片化且不一致的设置下进行,阻碍了公平比较;评估主要局限于依赖纯文本输入的单模态场景,缺乏对多模态输入的覆盖。
Method: M3MAD-Bench建立了五个核心任务领域的标准化协议:知识、数学、医学、自然科学和复杂推理,并系统覆盖了纯文本和视觉语言数据集,支持受控的跨模态比较。该基准在九个不同架构、规模和模态能力的基础模型上评估MAD方法,并纳入了面向效率的指标如令牌消耗和推理时间。
Result: 广泛实验产生了关于MAD在纯文本和多模态场景下有效性、鲁棒性和效率的系统性见解。基准提供了性能-成本权衡的整体视图,揭示了不同设置下MAD方法的相对优势和局限性。
Conclusion: M3MAD-Bench为未来标准化MAD评估研究提供了可靠基础,通过统一的评估框架促进了公平比较,并为多模态多智能体辩论系统的开发提供了系统性的性能洞察。
📄 Abstract
As an agent-level reasoning and coordination paradigm, Multi-Agent Debate (MAD) orchestrates multiple agents through structured debate to improve answer quality and support complex reasoning. However, existing research on MAD suffers from two fundamental limitations: evaluations are conducted under fragmented and inconsistent settings, hindering fair comparison, and are largely restricted to single-modality scenarios that rely on textual inputs only. To address these gaps, we introduce M3MAD-Bench, a unified and extensible benchmark for evaluating MAD methods across Multi-domain tasks, Multi-modal inputs, and Multi-dimensional metrics. M3MAD-Bench establishes standardized protocols over five core task domains: Knowledge, Mathematics, Medicine, Natural Sciences, and Complex Reasoning, and systematically covers both pure text and vision-language datasets, enabling controlled cross-modality comparison. We evaluate MAD methods on nine base models spanning different architectures, scales, and modality capabilities. Beyond accuracy, M3MAD-Bench incorporates efficiency-oriented metrics such as token consumption and inference time, providing a holistic view of performance--cost trade-offs. Extensive experiments yield systematic insights into the effectiveness, robustness, and efficiency of MAD across text-only and multimodal scenarios. We believe M3MAD-Bench offers a reliable foundation for future research on standardized MAD evaluation. The code is available at http://github.com/liaolea/M3MAD-Bench.
[41] Rationale-Grounded In-Context Learning for Time Series Reasoning with Multimodal Large Language Models
Qingxiang Liu, Zhiqing Cui, Xiaoliang Luo, Yuqian Wu, Zhuoyang Jiang, Huaiyu Wan, Sheng Sun, Lvchun Wang, Wei Yu, Yuxuan Liang
🧩 TL;DR
本文提出RationaleTS方法,通过基于原理的情境学习来解决时间序列推理问题,其中原理作为指导性推理单元而非事后解释,从而提升多模态大语言模型在时间序列推理中的性能。
📘 Detailed Summary
Motivation: 现有用于时间序列推理的多模态大语言模型表现不佳,其根本原因在于缺乏连接时间观测与下游结果的原理先验,导致模型依赖表面模式匹配而非原则性推理。
Method: 该方法首先诱导标签条件化原理,构建从可观测证据到潜在结果的推理路径;然后设计混合检索机制,通过平衡时间模式和语义上下文来检索相关原理先验,最终对新样本进行情境推理。
Result: 在三个领域的时间序列推理任务上进行的广泛实验证明了RationaleTS方法的有效性和效率,展示了该方法在提升时间序列推理性能方面的显著优势。
Conclusion: 该研究表明将原理作为指导性推理单元而非事后解释能够显著提升时间序列推理性能,为多模态大语言模型在时间序列分析中的应用提供了新的方法论框架,并计划开源代码以供复现。
📄 Abstract
The underperformance of existing multimodal large language models for time series reasoning lies in the absence of rationale priors that connect temporal observations to their downstream outcomes, which leads models to rely on superficial pattern matching rather than principled reasoning. We therefore propose the rationale-grounded in-context learning for time series reasoning, where rationales work as guiding reasoning units rather than post-hoc explanations, and develop the RationaleTS method. Specifically, we firstly induce label-conditioned rationales, composed of reasoning paths from observable evidence to the potential outcomes. Then, we design the hybrid retrieval by balancing temporal patterns and semantic contexts to retrieve correlated rationale priors for the final in-context inference on new samples. We conduct extensive experiments to demonstrate the effectiveness and efficiency of our proposed RationaleTS on three-domain time series reasoning tasks. We will release our code for reproduction.