Table of Contents
cs.CV [Back]
[1] UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment
Wei Zhang, Yeying Jin, Xin Li, Yan Zhang, Xiaofeng Cong, Cong Wang, Fengcai Qiao, zhichao Lian
🧩 TL;DR
本文提出UniFit,一种基于多模态大语言模型驱动的通用虚拟试穿框架,通过语义对齐模块和渐进式训练策略,解决了文本指令与参考图像之间的语义鸿沟以及复杂场景数据稀缺的问题,在多种虚拟试穿任务上实现了最先进的性能。
📘 Detailed Summary
Motivation: 当前基于文本指令的多任务虚拟试穿框架面临两个关键挑战:文本指令与参考图像之间的语义鸿沟,以及复杂场景下的数据稀缺问题,这限制了通用虚拟试穿系统的开发与应用。
Method: UniFit框架引入了多模态大语言模型引导的语义对齐模块,通过可学习查询集整合多模态输入,并施加语义对齐损失来捕捉跨模态语义关系;同时采用两阶段渐进式训练策略和自合成流程,从有限数据中学习复杂任务。
Result: 大量实验表明,UniFit不仅支持多服装和模型间试穿等广泛的虚拟试穿任务,而且在性能上达到了最先进水平,有效验证了所提方法的有效性和通用性。
Conclusion: 该研究证明了多模态大语言模型在虚拟试穿任务中的有效性,通过语义对齐和渐进式学习策略,为构建通用虚拟试穿系统提供了新的技术路径,并为多模态生成任务的研究提供了重要参考。
📄 Abstract
Image-based virtual try-on (VTON) aims to synthesize photorealistic images of a person wearing specified garments. Despite significant progress, building a universal VTON framework that can flexibly handle diverse and complex tasks remains a major challenge. Recent methods explore multi-task VTON frameworks guided by textual instructions, yet they still face two key limitations: (1) semantic gap between text instructions and reference images, and (2) data scarcity in complex scenarios. To address these challenges, we propose UniFit, a universal VTON framework driven by a Multimodal Large Language Model (MLLM). Specifically, we introduce an MLLM-Guided Semantic Alignment Module (MGSA), which integrates multimodal inputs using an MLLM and a set of learnable queries. By imposing a semantic alignment loss, MGSA captures cross-modal semantic relationships and provides coherent and explicit semantic guidance for the generative process, thereby reducing the semantic gap. Moreover, by devising a two-stage progressive training strategy with a self-synthesis pipeline, UniFit is able to learn complex tasks from limited data. Extensive experiments show that UniFit not only supports a wide range of VTON tasks, including multi-garment and model-to-model try-on, but also achieves state-of-the-art performance. The source code and pretrained models are available at https://github.com/zwplus/UniFit.
[2] Box6D : Zero-shot Category-level 6D Pose Estimation of Warehouse Boxes
Yintao Ma, Sajjad Pakdamansavoji, Amir Rasouli, Tongtong Cao
🧩 TL;DR
本文提出Box6D,一种针对仓库场景中存储箱的类别级6D姿态估计方法,通过快速二分搜索推断箱子尺寸并使用类别CAD模板估计姿态,在保持竞争性精度的同时将推理时间减少约76%。
📘 Detailed Summary
Motivation: 现有6D姿态估计方法在工业场景中存在局限性:基于模型的方法需要精确CAD模型且泛化能力差,无模型方法在复杂条件下容易失败,类别级方法过于通用且忽略环境和物体先验。本研究旨在解决仓库自动化中存储箱在杂乱和遮挡条件下的高效准确姿态估计问题。
Method: Box6D从单次RGB-D观测中通过快速二分搜索推断箱子尺寸,使用类别CAD模板而非实例特定模型进行姿态估计,采用基于深度的合理性过滤器和早停策略来拒绝不合理假设并降低计算成本。
Result: 在真实仓库场景和公共基准测试上的评估表明,该方法在保持竞争性或更优的6D姿态精度的同时,将推理时间减少了约76%,显著提升了计算效率。
Conclusion: Box6D证明了在特定工业场景中利用类别先验和环境约束的有效性,为仓库自动化中的实时物体操作提供了实用解决方案,展示了类别级方法在平衡灵活性和准确性方面的潜力。
📄 Abstract
Accurate and efficient 6D pose estimation of novel objects under clutter and occlusion is critical for robotic manipulation across warehouse automation, bin picking, logistics, and e-commerce fulfillment. There are three main approaches in this domain; Model-based methods assume an exact CAD model at inference but require high-resolution meshes and transfer poorly to new environments; Model-free methods that rely on a few reference images or videos are more flexible, however often fail under challenging conditions; Category-level approaches aim to balance flexibility and accuracy but many are overly general and ignore environment and object priors, limiting their practicality in industrial settings. To this end, we propose Box6d, a category-level 6D pose estimation method tailored for storage boxes in the warehouse context. From a single RGB-D observation, Box6D infers the dimensions of the boxes via a fast binary search and estimates poses using a category CAD template rather than instance-specific models. Suing a depth-based plausibility filter and early-stopping strategy, Box6D then rejects implausible hypotheses, lowering computational cost. We conduct evaluations on real-world storage scenarios and public benchmarks, and show that our approach delivers competitive or superior 6D pose precision while reducing inference time by approximately 76%.
[3] Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions
Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Ruicong Liu, Yoichi Sato
🧩 TL;DR
本文提出了多模态交互欺骗评估任务和数据集,揭示了当前最先进的多模态大语言模型在检测复杂社交互动中欺骗行为方面的显著缺陷,并开发了社交思维链推理和动态社交认知记忆模块来提升模型的社会推理能力。
📘 Detailed Summary
Motivation: 尽管多模态大语言模型具备先进的推理能力,但它们明显缺乏人类智能的核心组成部分:在复杂社交互动中'察言观色'和评估欺骗行为的能力。为了严格量化这一缺陷,研究旨在解决模型无法有效理解多模态社交线索和建模他人知识、信念或意图的问题。
Method: 研究引入了多模态交互欺骗评估新任务,并构建了包含同步视频和文本数据及可验证真实标签的新型多模态数据集。为解决该挑战,设计了社交思维链推理流程和动态社交认知记忆模块,这些组件共同构成了提升模型社会推理能力的新框架。
Result: 对12个最先进的开放和闭源多模态大语言模型的综合基准测试显示存在显著的性能差距,即使是GPT-4o等强大模型也难以可靠地区分真假。所提出的框架在该挑战性任务上实现了性能提升,展示了构建具备真正类人社会推理能力的多模态大语言模型的新路径。
Conclusion: 研究揭示了多模态大语言模型在有效将语言与多模态社交线索进行接地以及建模他人心理状态方面的根本性缺陷,强调了构建更具洞察力和可信度AI系统的迫切需求。所提出的方法为解决这一关键挑战提供了有前景的方向,推动了向真正类人社会推理能力的发展。
📄 Abstract
Despite their advanced reasoning capabilities, state-of-the-art Multimodal Large Language Models (MLLMs) demonstrably lack a core component of human intelligence: the ability to `read the room' and assess deception in complex social interactions. To rigorously quantify this failure, we introduce a new task, Multimodal Interactive Deception Assessment (MIDA), and present a novel multimodal dataset providing synchronized video and text with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating 12 state-of-the-art open- and closed-source MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to effectively ground language in multimodal social cues and lack the ability to model what others know, believe, or intend, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems. To take a step forward, we design a Social Chain-of-Thought (SoCoT) reasoning pipeline and a Dynamic Social Epistemic Memory (DSEM) module. Our framework yields performance improvement on this challenging task, demonstrating a promising new path toward building MLLMs capable of genuine human-like social reasoning.
[4] Boosting Medical Visual Understanding From Multi-Granular Language Learning
Zihan Li, Yiqing Wang, Sina Farsiu, Paul Kinahan
🧩 TL;DR
本文提出了多粒度语言学习(MGLL)框架,通过结构化多标签监督和跨粒度一致性约束,解决了CLIP在医学影像等多标签、多粒度场景中的对齐限制问题。
📘 Detailed Summary
Motivation: 现有对比语言-图像预训练(CLIP)方法主要关注单标签、单粒度的视觉-文本对齐,但在医学影像等复杂领域中,图像通常对应多个高层标签和不同粒度的注释,这种单粒度对齐方式限制了模型的有效性。
Method: MGLL框架采用结构化多标签监督机制,整合不同粒度的文本描述,并引入点式约束的软标签监督来增强对齐效果。该方法使用平滑KL散度确保跨粒度一致性,同时作为即插即用模块保持计算效率。
Result: 在构建的大规模多粒度数据集上预训练后,MGLL在多个下游任务评估中均优于其他最先进方法,证明了其在多标签和跨粒度对齐方面的优越性能。
Conclusion: MGLL框架为多模态学习提供了有效的多粒度对齐解决方案,特别适用于医学影像等需要处理复杂标签结构的领域,为视觉-语言模型在多标签场景中的应用开辟了新方向。
📄 Abstract
Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback-Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at \href{https://github.com/HUANGLIZI/MGLL}{https://github.com/HUANGLIZI/MGLL}.
[5] InfoCLIP: Bridging Vision-Language Pretraining and Open-Vocabulary Semantic Segmentation via Information-Theoretic Alignment Transfer
Muyao Yuan, Yuanhong Zhang, Weizhan Zhang, Lan Ma, Yuan Gao, Jiangyong Ying, Yudeng Xin
🧩 TL;DR
本文提出InfoCLIP,一种基于信息论视角的方法,通过互信息引导将预训练CLIP的对齐知识迁移到开放词汇语义分割任务中,解决了传统微调方法导致的过拟合和模态对齐退化问题。
📘 Detailed Summary
Motivation: 现有方法在有限可见类别上微调CLIP进行分割时会导致过拟合,并破坏预训练获得的视觉-语言对齐能力,需要一种能够稳定模态对齐的微调策略。
Method: 提出基于互信息的两个新目标:压缩预训练CLIP的像素-文本模态对齐以减少噪声,同时最大化预训练CLIP与微调模型对齐知识间的互信息来迁移适合分割任务的紧凑局部语义关系。
Result: 在多个基准测试上的广泛评估验证了InfoCLIP在增强CLIP微调用于开放词汇语义分割方面的有效性,展示了其在非对称迁移中的适应性和优越性。
Conclusion: 信息论视角为CLIP微调提供了稳定有效的知识迁移框架,InfoCLIP在保持预训练对齐能力的同时提升了分割性能,为非对称模态迁移任务提供了新思路。
📄 Abstract
Recently, the strong generalization ability of CLIP has facilitated open-vocabulary semantic segmentation, which labels pixels using arbitrary text. However, existing methods that fine-tune CLIP for segmentation on limited seen categories often lead to overfitting and degrade the pretrained vision-language alignment. To stabilize modality alignment during fine-tuning, we propose InfoCLIP, which leverages an information-theoretic perspective to transfer alignment knowledge from pretrained CLIP to the segmentation task. Specifically, this transfer is guided by two novel objectives grounded in mutual information. First, we compress the pixel-text modality alignment from pretrained CLIP to reduce noise arising from its coarse-grained local semantic representations learned under image-text supervision. Second, we maximize the mutual information between the alignment knowledge of pretrained CLIP and the fine-tuned model to transfer compact local semantic relations suited for the segmentation task. Extensive evaluations across various benchmarks validate the effectiveness of InfoCLIP in enhancing CLIP fine-tuning for open-vocabulary semantic segmentation, demonstrating its adaptability and superiority in asymmetric transfer.
[6] Fairness in Multi-modal Medical Diagnosis with Demonstration Selection
Dawei Li, Zijian Gu, Peng Wang, Chuhan Song, Zhen Tan, Mohan Zhang, Tianlong Chen, Yu Tian, Song Wang
🧩 TL;DR
本文提出了一种公平感知的演示选择方法FADS,通过聚类采样构建人口统计学平衡且语义相关的演示,为多模态大语言模型提供无需微调的公平性提升方案。该方法在多个医学影像基准测试中一致减少了性别、种族和民族相关的性能差异,同时保持强准确性。
📘 Detailed Summary
Motivation: 多模态大语言模型在医学图像推理中展现出强大潜力,但跨人口统计学群体的公平性仍是主要关切。现有的去偏方法通常依赖大规模标注数据集或微调,这对于基础规模模型来说不切实际。本文探索将上下文学习作为轻量级、无需调优的替代方案来改善公平性。
Method: 研究发现传统演示选择策略因所选示例中的人口统计学不平衡而无法确保公平性。为此,提出公平感知演示选择方法FADS,通过基于聚类的采样构建人口统计学平衡且语义相关的演示。该方法无需模型微调,直接在上下文学习框架中实现公平性优化。
Result: 在多个医学影像基准测试上的实验表明,FADS一致减少了性别、种族和民族相关的性能差异,同时保持了强大的准确性。该方法显著改善了模型在不同人口统计学群体间的公平表现,验证了上下文学习在公平性优化中的有效性。
Conclusion: 这些结果突显了公平感知上下文学习作为可扩展且数据高效解决方案的潜力,为公平医学图像推理提供了实用路径。该方法为大型基础模型提供了一种无需大量标注数据或计算密集型微调的公平性提升方案,具有重要的实际应用价值。
📄 Abstract
Multimodal large language models (MLLMs) have shown strong potential for medical image reasoning, yet fairness across demographic groups remains a major concern. Existing debiasing methods often rely on large labeled datasets or fine-tuning, which are impractical for foundation-scale models. We explore In-Context Learning (ICL) as a lightweight, tuning-free alternative for improving fairness. Through systematic analysis, we find that conventional demonstration selection (DS) strategies fail to ensure fairness due to demographic imbalance in selected exemplars. To address this, we propose Fairness-Aware Demonstration Selection (FADS), which builds demographically balanced and semantically relevant demonstrations via clustering-based sampling. Experiments on multiple medical imaging benchmarks show that FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining strong accuracy, offering an efficient and scalable path toward fair medical image reasoning. These results highlight the potential of fairness-aware in-context learning as a scalable and data-efficient solution for equitable medical image reasoning.
[7] TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
Boshen Xu, Zihan Xiao, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Jian Luan, Qin Jin
🧩 TL;DR
本文提出了TimeViper,一种混合视觉语言模型,通过结合Mamba和Transformer架构以及创新的TransV令牌压缩机制,实现了对超过10,000帧的长时间视频的高效理解。
📘 Detailed Summary
Motivation: 长时间视频理解面临两个关键挑战:需要高效的模型架构和有效的长时序上下文处理机制。现有方法在处理超长视频时存在计算效率低下和视觉令牌冗余的问题,限制了模型处理小时级视频的能力。
Method: TimeViper采用混合Mamba-Transformer骨干网络,结合状态空间模型的高效性和注意力机制的表达能力。针对视觉到文本信息聚合现象导致的视觉令牌冗余,提出了TransV令牌信息传输模块,将视觉令牌转移并压缩到指令令牌中,同时保持多模态理解能力。
Result: 实验结果表明TimeViper能够在多个基准测试中与最先进模型竞争,同时显著扩展了处理帧数能力,能够处理超过10,000帧的小时级视频。对Mamba和Transformer层注意力行为的分析为混合模型的可解释性提供了新见解。
Conclusion: 这项工作代表了开发、解释和压缩混合Mamba-Transformer架构的初步探索,揭示了视觉到文本信息流中的令牌冗余现象,并为高效长视频理解提供了可行的解决方案,为未来混合架构研究奠定了基础。
📄 Abstract
We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.
[8] Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution
Xiao He, Zhijun Tu, Kun Cheng, Mingrui Zhu, Jie Hu, Nannan Wang, Xinbo Gao
🧩 TL;DR
本文提出了一种用于单步图像超分辨率的混合秩(MoR)架构,将稀疏混合专家(MoE)机制引入真实世界图像超分辨率任务中,通过细粒度专家划分策略和退化感知路由机制,在保持计算效率的同时实现了对复杂退化模式的适应性建模。
📘 Detailed Summary
Motivation: 现有基于LoRA微调扩散模型的真实世界图像超分辨率方法存在局限性,无法自适应地捕捉复杂真实世界退化样本的异质性特征,也无法在同等计算预算下实现输入间的知识共享,这限制了模型对多样化退化模式的适应能力。
Method: 提出混合秩架构,将LoRA中的每个秩视为独立专家,采用细粒度专家划分策略实现灵活知识重组;设计基于CLIP嵌入和预定义正负文本对的退化估计模块计算相对退化分数,动态指导专家激活;引入零专家槽和退化感知负载均衡损失,根据退化严重程度动态调整活跃专家数量。
Result: 综合实验验证了所提框架的有效性和最先进性能,在真实世界图像超分辨率任务中表现出优越的重建质量和计算效率,能够自适应处理不同复杂度的退化样本。
Conclusion: 该研究展示了稀疏混合专家机制在图像超分辨率领域的应用潜力,通过细粒度专家划分和动态路由策略实现了对复杂退化模式的有效建模,为计算资源受限下的高效图像重建提供了新思路。
📄 Abstract
The demonstrated success of sparsely-gated Mixture-of-Experts (MoE) architectures, exemplified by models such as DeepSeek and Grok, has motivated researchers to investigate their adaptation to diverse domains. In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models through Low-Rank Adaptation (LoRA) module to reconstruct high-resolution (HR) images. However, these dense Real-ISR models are limited in their ability to adaptively capture the heterogeneous characteristics of complex real-world degraded samples or enable knowledge sharing between inputs under equivalent computational budgets. To address this, we investigate the integration of sparse MoE into Real-ISR and propose a Mixture-of-Ranks (MoR) architecture for single-step image super-resolution. We introduce a fine-grained expert partitioning strategy that treats each rank in LoRA as an independent expert. This design enables flexible knowledge recombination while isolating fixed-position ranks as shared experts to preserve common-sense features and minimize routing redundancy. Furthermore, we develop a degradation estimation module leveraging CLIP embeddings and predefined positive-negative text pairs to compute relative degradation scores, dynamically guiding expert activation. To better accommodate varying sample complexities, we incorporate zero-expert slots and propose a degradation-aware load-balancing loss, which dynamically adjusts the number of active experts based on degradation severity, ensuring optimal computational resource allocation. Comprehensive experiments validate our framework's effectiveness and state-of-the-art performance.
[9] SurvAgent: Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction
Guolin Huang, Wenting Chen, Jiaqi Yang, Xinheng Lyu, Xiaoling Luo, Sen Yang, Xiaohan Xing, Linlin Shen
🧩 TL;DR
SurvAgent是首个用于多模态生存预测的分层思维链增强多智能体系统,通过构建病例库和多专家智能体推理,在TCGA数据集上超越了传统方法、专有大语言模型和医疗智能体,为精准肿瘤学中的可解释AI驱动生存预测建立了新范式。
📘 Detailed Summary
Motivation: 现有生存分析方法缺乏临床采用所需的透明度,而当前病理智能体在生存预测方面存在三个关键局限:无法整合多模态数据、无效的兴趣区域探索以及未能利用历史病例的经验学习。
Method: SurvAgent采用两阶段架构:第一阶段通过低倍镜筛查、跨模态相似性感知补丁挖掘和置信度感知补丁挖掘对病理图像进行分层分析,同时对六个功能基因类别进行基因分层分析,生成带思维链推理的结构化报告;第二阶段通过检索增强生成技术检索相似病例,并通过渐进区间精化整合多模态报告与专家预测。
Result: 在五个TCGA队列上的广泛实验表明,SurvAgent在性能上超越了传统方法、专有大语言模型和医疗智能体,验证了其在多模态生存预测方面的优越性。
Conclusion: 该研究为精准肿瘤学中的可解释AI驱动生存预测建立了新范式,通过分层思维链增强的多智能体系统有效解决了现有方法在透明度、多模态整合和经验学习方面的局限。
📄 Abstract
Survival analysis is critical for cancer prognosis and treatment planning, yet existing methods lack the transparency essential for clinical adoption. While recent pathology agents have demonstrated explainability in diagnostic tasks, they face three limitations for survival prediction: inability to integrate multimodal data, ineffective region-of-interest exploration, and failure to leverage experiential learning from historical cases. We introduce SurvAgent, the first hierarchical chain-of-thought (CoT)-enhanced multi-agent system for multimodal survival prediction. SurvAgent consists of two stages: (1) WSI-Gene CoT-Enhanced Case Bank Construction employs hierarchical analysis through Low-Magnification Screening, Cross-Modal Similarity-Aware Patch Mining, and Confidence-Aware Patch Mining for pathology images, while Gene-Stratified analysis processes six functional gene categories. Both generate structured reports with CoT reasoning, storing complete analytical processes for experiential learning. (2) Dichotomy-Based Multi-Expert Agent Inference retrieves similar cases via RAG and integrates multimodal reports with expert predictions through progressive interval refinement. Extensive experiments on five TCGA cohorts demonstrate SurvAgent's superority over conventional methods, proprietary MLLMs, and medical agents, establishing a new paradigm for explainable AI-driven survival prediction in precision oncology.
[10] Crossmodal learning for Crop Canopy Trait Estimation
Timilehin T. Ayanlade, Anirudha Powadi, Talukder Z. Jubery, Baskar Ganapathysubramanian, Soumik Sarkar
🧩 TL;DR
本研究提出了一种跨模态学习策略,通过将高分辨率卫星图像与无人机级别的视觉细节相结合,用于作物冠层性状估计。该方法在产量和氮素预测等多个下游任务中,生成的无人机样表示始终优于真实卫星图像。
📘 Detailed Summary
Motivation: 当前农业监测中,卫星任务受限于空间分辨率,难以满足现代微区管理农业系统的需求,而无人机虽然性能优越但覆盖范围有限。本研究旨在解决卫星和无人机传感之间的差距,通过跨模态学习来丰富卫星图像的视觉细节。
Method: 提出跨模态学习策略,使用在美国玉米带五个不同地点收集的84个杂交玉米品种的近似共配准卫星-无人机图像对数据集,训练模型学习不同传感模态之间的细粒度光谱空间对应关系。
Result: 实验结果表明,从卫星输入生成的无人机样表示在多个下游任务中持续优于真实卫星图像,特别是在产量预测和氮素预测任务上表现出显著改进。
Conclusion: 跨模态对应学习具有弥合农业监测中卫星和无人机传感差距的潜力,为高分辨率农业遥感提供了一种有效的解决方案,能够在不依赖密集无人机部署的情况下获得精细的作物监测数据。
📄 Abstract
Recent advances in plant phenotyping have driven widespread adoption of multi sensor platforms for collecting crop canopy reflectance data. This includes the collection of heterogeneous data across multiple platforms, with Unmanned Aerial Vehicles (UAV) seeing significant usage due to their high performance in crop monitoring, forecasting, and prediction tasks. Similarly, satellite missions have been shown to be effective for agriculturally relevant tasks. In contrast to UAVs, such missions are bound to the limitation of spatial resolution, which hinders their effectiveness for modern farming systems focused on micro-plot management. In this work, we propose a cross modal learning strategy that enriches high-resolution satellite imagery with UAV level visual detail for crop canopy trait estimation. Using a dataset of approximately co registered satellite UAV image pairs collected from replicated plots of 84 hybrid maize varieties across five distinct locations in the U.S. Corn Belt, we train a model that learns fine grained spectral spatial correspondences between sensing modalities. Results show that the generated UAV-like representations from satellite inputs consistently outperform real satellite imagery on multiple downstream tasks, including yield and nitrogen prediction, demonstrating the potential of cross-modal correspondence learning to bridge the gap between satellite and UAV sensing in agricultural monitoring.
[11] Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen, Sifan Wang, Yan Feng, Peng Pei, Pheng-Ann Heng
🧩 TL;DR
本文提出了Thinking-while-Generating (TwiG)框架,这是首个在视觉生成过程中实现文本推理与生成交织演进的创新方法,通过动态交互产生更具上下文感知和语义丰富的视觉输出。
📘 Detailed Summary
Motivation: 现有视觉生成方法仅在生成前作为预规划或生成后作为后优化引入文本推理,缺乏在生成过程中进行实时多模态交互的能力,这种局限性限制了生成内容与语义理解之间的动态协调。
Method: TwiG框架通过在视觉内容逐步生成过程中交织文本推理,既指导即将生成的局部区域又反思已合成的内容;研究了三种策略:零样本提示、基于TwiG-50K数据集的监督微调以及定制的TwiG-GRPO强化学习方法。
Result: 该框架实现了视觉生成与文本推理的动态协同进化,产生更加上下文感知和语义丰富的视觉输出,三种策略分别提供了交织推理动态学的独特见解。
Conclusion: 这项工作为交织文本推理以增强视觉生成的研究开辟了新方向,展示了实时多模态交互在生成过程中的重要价值,有望激发该领域的进一步探索。
📄 Abstract
Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.
[12] T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs
Shao-Jun Xia, Huixin Zhang, Zhengzhong Tu
🧩 TL;DR
本文提出T2T-VICL框架,通过文本提示生成和感知评分推理机制,首次实现了视觉语言模型在跨任务视觉上下文学习中的有效应用,突破了传统视觉上下文学习的任务边界限制。
📘 Detailed Summary
Motivation: 当前视觉上下文学习主要局限于同任务场景,当视觉提示与目标图像来自不同视觉任务时,统一视觉语言模型是否仍能实现有效的跨任务视觉上下文学习仍是一个未探索的关键问题。
Method: 提出T2T-VICL全协作流水线,设计文本提示生成与选择机制来隐式描述不同低层视觉任务间的差异,构建首个跨任务VICL数据集,并开发结合感知评分推理与传统评估指标的新型推理框架。
Result: 该方法在九个跨任务场景中取得顶级性能,在另外十个场景中获得次优性能,显著扩展了视觉语言模型在跨任务视觉上下文学习中的能力边界。
Conclusion: 该研究证明了视觉语言模型在跨任务视觉上下文学习中的潜力,为统一模型处理多样化视觉任务提供了新范式,推动了多模态学习向更通用的方向发展。
📄 Abstract
In large language models (LLM), in-context learning (ICL) refers to performing new tasks by conditioning on small demonstrations provided in the input context. Recent advances in visual in-context learning (VICL) demonstrate promising capabilities for solving downstream tasks by unified vision-language models (VLMs). When the visual prompt and the target images originate from different visual tasks, can VLMs still enable VICL? In the paper, we propose a fully collaborative pipeline, i.e. T2T-VICL, for VLMs to investigate the potential of cross-task VICL. Fundamentally, we design a mechanism to generate and select text prompts that best implicitly describe the differences between two distinct low-level vision tasks, and construct the first cross-task VICL dataset. Building upon this, we propose a novel inference framework that combines perceptual score-based reasoning with traditional evaluation metrics to perform cross-task VICL. Our approach achieves top-tier results across nine cross-task scenarios and second-tier performance in ten additional scenarios, unlocking the boundaries of cross-task VICL within VLMs.
[13] How Noise Benefits AI-generated Image Detection
Jiazhen Yan, Ziqiang Li, Fan Wang, Kai Zeng, Zhangjie Fu
🧩 TL;DR
本文提出PiN-CLIP方法,通过联合训练噪声生成器和检测网络,在特征空间中注入正激励噪声来抑制捷径敏感方向并放大稳定取证线索,在42种生成模型合成的开放世界数据集上实现了最先进的AI生成图像检测性能。
📘 Detailed Summary
Motivation: 生成模型的快速发展使得真实图像与合成图像越来越难以区分,尽管已有大量研究致力于检测AI生成图像,但分布外泛化仍然是一个持续存在的挑战。作者将此弱点归因于训练过程中被利用的虚假捷径,并观察到小的特征空间扰动可以缓解捷径主导问题。
Method: 提出正激励噪声CLIP(PiN-CLIP)方法,通过变分正激励原理联合训练噪声生成器和检测网络。具体而言,通过视觉和类别语义特征的交叉注意力融合在特征空间中构建正激励噪声,在优化过程中将噪声注入特征空间以微调视觉编码器,抑制捷径敏感方向同时放大稳定取证线索。
Result: 在包含42种不同生成模型合成图像的开放世界数据集上进行了比较实验,该方法实现了新的最先进性能,相比现有方法在平均准确率上显著提升了5.4个百分点。
Conclusion: 该方法通过特征空间扰动有效缓解了AI生成图像检测中的捷径学习问题,提取了更鲁棒和泛化的伪影表示,为分布外泛化挑战提供了可控的解决方案,展示了正激励噪声在提升检测模型泛化能力方面的有效性。
📄 Abstract
The rapid advancement of generative models has made real and synthetic images increasingly indistinguishable. Although extensive efforts have been devoted to detecting AI-generated images, out-of-distribution generalization remains a persistent challenge. We trace this weakness to spurious shortcuts exploited during training and we also observe that small feature-space perturbations can mitigate shortcut dominance. To address this problem in a more controllable manner, we propose the Positive-Incentive Noise for CLIP (PiN-CLIP), which jointly trains a noise generator and a detection network under a variational positive-incentive principle. Specifically, we construct positive-incentive noise in the feature space via cross-attention fusion of visual and categorical semantic features. During optimization, the noise is injected into the feature space to fine-tune the visual encoder, suppressing shortcut-sensitive directions while amplifying stable forensic cues, thereby enabling the extraction of more robust and generalized artifact representations. Comparative experiments are conducted on an open-world dataset comprising synthetic images generated by 42 distinct generative models. Our method achieves new state-of-the-art performance, with notable improvements of 5.4 in average accuracy over existing approaches.
[14] Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval
Chunxu Liu, Jiyuan Yang, Ruopeng Gao, Yuhan Zhu, Feng Zhu, Rui Zhao, Limin Wang
🧩 TL;DR
本文提出推理引导嵌入方法,通过将多模态大语言模型的生成式推理能力显式整合到嵌入提取过程中,利用结构化原理生成和对比训练来增强多模态表示质量。
📘 Detailed Summary
Motivation: 现有方法将多模态大语言模型的嵌入提取视为直接编码步骤,忽视了其生成式推理能力可用于提升表示质量的潜力,因此需要探索如何将显式推理整合到嵌入过程中。
Method: 提出推理引导嵌入方法,首先让模型基于指令执行结构化原理生成,在推理展开后提取表示,并将生成式推理过程与对比训练相结合来增强嵌入中的上下文条件推理信号。
Result: 在MMEB基准测试中,推理引导的条件化方法相比非推理基线将多模态检索性能提升了4.9%,证实显式推理能有效增强嵌入质量。
Conclusion: 研究表明多模态大语言模型的生成式推理能力可被显式利用来增强嵌入表示,为改进多模态表示学习提供了新方向,即通过整合推理过程来提升下游任务性能。
📄 Abstract
Multimodal embeddings are widely used in downstream tasks such as multimodal retrieval, enabling alignment of interleaved modalities in a shared representation space. While recent studies show that Multimodal Large Language Models (MLLMs) can serve as strong embedding extractors, existing approaches treat embedding extraction as a direct encoding step, overlooking the fact that MLLMs possess the generative capability for reasoning that could be leveraged to enhance representation quality. In this work, we explore how to explicitly incorporate reasoning into the embedding process. To this end, we propose Reasoning Guided Embeddings (RGE), which preserves the generative rationale process of MLLMs and couples it with contrastive training. Our method first enables the model to perform structured rationale generation conditioned on the instruction, and then extracts representations after reasoning has unfolded. This simple design enhances the context-conditional inference signals within the embedding, leading to improved multimodal representation quality. Experiments on the MMEB benchmark show that reasoning-guided conditioning improves multimodal retrieval performance by 4.9% over the non-reasoning baseline, confirming that explicit reasoning can effectively enhance embedding quality.
[15] Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers
Jian Ma, Qirong Peng, Xujie Zhu, Peixing Xie, Chen Chen, Haonan Lu
🧩 TL;DR
本文提出PPCL框架,一种专门为扩散变换器设计的可插拔结构化剪枝方法,通过连续层蒸馏实现50%参数减少,同时保持图像生成质量,适合资源受限环境部署。
📘 Detailed Summary
Motivation: 扩散变换器在图像生成中表现出色,但参数量大导致计算成本高,阻碍了在资源受限环境中的部署应用,需要高效的模型压缩方法来解决这一瓶颈问题。
Method: 提出可插拔剪枝与连续层蒸馏框架,首先通过线性探测和一阶微分趋势分析识别冗余层区间,然后设计即插即用的师生交替蒸馏方案,在单一训练阶段内集成深度和宽度剪枝。
Result: 在多模态扩散变换器架构上的实验表明,PPCL相比完整模型实现50%参数减少,关键客观指标退化小于3%,在保持高质量图像生成能力的同时达到更高压缩比。
Conclusion: 该方法为扩散变换器提供了灵活高效的压缩方案,支持多种剪枝配置而无需逐配置重新训练,显著提升了模型在资源受限环境中的部署可行性。
📄 Abstract
Diffusion Transformers (DiTs) have shown exceptional performance in image generation, yet their large parameter counts incur high computational costs, impeding deployment in resource-constrained settings. To address this, we propose Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible structured pruning framework specifically designed for DiT architectures. First, we identify redundant layer intervals through a linear probing mechanism combined with the first-order differential trend analysis of similarity metrics. Subsequently, we propose a plug-and-play teacher-student alternating distillation scheme tailored to integrate depth-wise and width-wise pruning within a single training phase. This distillation framework enables flexible knowledge transfer across diverse pruning ratios, eliminating the need for per-configuration retraining. Extensive experiments on multiple Multi-Modal Diffusion Transformer architecture models demonstrate that PPCL achieves a 50\% reduction in parameter count compared to the full model, with less than 3\% degradation in key objective metrics. Notably, our method maintains high-quality image generation capabilities while achieving higher compression ratios, rendering it well-suited for resource-constrained environments. The open-source code, checkpoints for PPCL can be found at the following link: https://github.com/OPPO-Mente-Lab/Qwen-Image-Pruning.
[16] Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning
Yibin Huang, Wang Xu, Wanyue Zhang, Helu Zhi, Jingjing Huang, Yangbin Xu, Yangang Sun, Conghui Zhu, Tiejun Zhao
🧩 TL;DR
本文提出了Video2Layout框架,通过从视频中重建基于度量的空间布局,使用连续物体边界坐标来量化物体间的物理距离和尺寸,解决了现有基于网格认知地图方法在细粒度空间推理方面的局限性。
📘 Detailed Summary
Motivation: 现有基于网格的认知地图方法依赖于离散化的栅格表示,限制了模型在细粒度空间推理方面的能力,无法精确量化物体间的物理距离和尺寸,导致空间关系描述存在固有模糊性。
Method: Video2Layout框架采用连续物体边界坐标来量化物体间物理距离和物体尺寸,包含两个核心阶段:监督微调阶段从AI2THOR模拟器构建高质量数据集学习视觉输入到精确边界坐标的映射,强化微调阶段进一步提升模型在真实世界的泛化能力。
Result: 在QVS-Bench和主流空间推理基准测试中,V2LO-7B模型相比基于网格地图训练的模型平均提升了4.92%的性能,验证了所提方法的优越性,并系统分析了认知地图准确性与图像数量之间的关系。
Conclusion: 连续边界坐标表示显著提升了多模态大语言模型的空间推理能力,解决了自然语言描述空间关系的模糊性问题,为构建更精确的空间认知系统提供了有效途径,同时提出的QVS-Bench为相关机制分析提供了诊断基准。
📄 Abstract
Spatial intelligence is a critical frontier for Multimodal Large Language Models (MLLMs), empowering them to comprehend the physical world. Drawing inspiration from human perception mechanisms, existing studies attempt to construct a coherent spatial understanding via grid-based cognitive maps from multi-frame visual inputs. However, current grid-based map methods rely on discretized raster representations, which limit the model's ability in fine-grained spatial reasoning. To overcome this limitation, we propose Video2Layout, a framework for reconstructing metric-grounded spatial layouts from video. The framework employs continuous object boundary coordinates to quantify inter-object physical distances and object size. This empowers the model with quantitative spatial computation capabilities, effectively alleviating the inherent ambiguity when describing spatial relationships in natural language. Specifically, our method comprises two core stages. First, in supervised fine-tuning stage, we construct a high-quality dataset from the AI2THOR simulator, which enables the model to learn the mapping from visual inputs to precise boundary coordinates. Subsequently, a reinforcement fine-tuning stage further enhances the model's real-world generalization capabilities. To systematically evaluate the correlation between cognitive map accuracy and image quantity, as well as how the quantity of image inputs affects spatial reasoning accuracy, we introduce QVS-Bench, a diagnostic benchmark designed to analyze the relevant mechanisms. Evaluated on QVS-Bench and mainstream spatial reasoning benchmarks, our model, V2LO-7B achieves an average improvement of 4.92% over the model trained on grid maps, validating the superiority of our method. Our code is available at https://github.com/ybrrraway/Video2Layout.
[17] An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs
Zhi Luo, Zenghui Yuan, Wenqi Wei, Daizong Liu, Pan Zhou
🧩 TL;DR
本文提出了一种新颖的冗长文本诱导攻击方法,通过两阶段框架向良性图像注入难以察觉的对抗性扰动,显著增加视觉语言模型的输出令牌数量,从而有效评估模型的部署效率问题。
📘 Detailed Summary
Motivation: 随着视觉语言模型在多模态任务上的显著成功,其部署效率问题日益突出,特别是生成过程中消耗的令牌数量已成为关键评估指标。现有方法仅通过延迟EOS令牌的出现来隐式延长输出,缺乏直接最大化输出令牌长度的显式优化目标,导致稳定性和可控性不足。
Method: 提出两阶段框架:首先进行对抗性提示搜索,采用强化学习策略自动识别能够诱导VLM中LLM组件产生冗长输出的对抗性提示;然后进行视觉对齐扰动优化,在输入图像上构建对抗样本,最大化扰动图像视觉嵌入与对抗性提示之间的相似性,从而构建触发冗长文本生成的恶意图像。
Result: 在四个流行视觉语言模型上的综合实验表明,该方法在有效性、效率和泛化能力方面均取得显著优势,能够显著增加模型的输出令牌数量,验证了攻击方法的实际威胁性。
Conclusion: 该研究揭示了视觉语言模型在面对精心设计的对抗性攻击时的脆弱性,为模型安全性和部署效率评估提供了重要参考,强调了在模型部署前进行充分安全测试的必要性,并为未来防御机制的设计指明了方向。
📄 Abstract
With the remarkable success of Vision-Language Models (VLMs) on multimodal tasks, concerns regarding their deployment efficiency have become increasingly prominent. In particular, the number of tokens consumed during the generation process has emerged as a key evaluation metric.Prior studies have shown that specific inputs can induce VLMs to generate lengthy outputs with low information density, which significantly increases energy consumption, latency, and token costs. However, existing methods simply delay the occurrence of the EOS token to implicitly prolong output, and fail to directly maximize the output token length as an explicit optimization objective, lacking stability and controllability.To address these limitations, this paper proposes a novel verbose-text induction attack (VTIA) to inject imperceptible adversarial perturbations into benign images via a two-stage framework, which identifies the most malicious prompt embeddings for optimizing and maximizing the output token of the perturbed images.Specifically, we first perform adversarial prompt search, employing reinforcement learning strategies to automatically identify adversarial prompts capable of inducing the LLM component within VLMs to produce verbose outputs. We then conduct vision-aligned perturbation optimization to craft adversarial examples on input images, maximizing the similarity between the perturbed image's visual embeddings and those of the adversarial prompt, thereby constructing malicious images that trigger verbose text generation. Comprehensive experiments on four popular VLMs demonstrate that our method achieves significant advantages in terms of effectiveness, efficiency, and generalization capability.
[18] EvoVLA: Self-Evolving Vision-Language-Action Model
Zeting Liu, Zida Yang, Zeyu Zhang, Hao Tang
🧩 TL;DR
EvoVLA是一个自监督的视觉-语言-动作框架,通过阶段对齐奖励、基于姿态的对象探索和长视野记忆三个组件,解决了VLA模型在多阶段机器人操作任务中的阶段幻觉问题,显著提升了任务成功率和样本效率。
📘 Detailed Summary
Motivation: 当前视觉-语言-动作模型在长视野机器人操作中存在阶段幻觉问题,即智能体利用粗略评估信号在多步任务中走捷径,报告高进度但未真正完成任务,这限制了模型在复杂操作任务中的实际应用效果。
Method: EvoVLA框架包含三个关键技术组件:阶段对齐奖励使用三重对比学习和Gemini生成的困难负样本来防止视觉捷径;基于姿态的对象探索将好奇心建立在相对对象-夹爪姿态而非原始像素上;长视野记忆通过选择性上下文保留和门控融合来稳定扩展轨迹中的内在塑造过程。
Result: 在Discoverse-L长视野操作基准测试中,EvoVLA将平均任务成功率提升10.2个百分点至69.2%,样本效率提高1.5倍,并将阶段幻觉从38.5%降低至14.8%;在真实机器人部署中达到54.6%的平均成功率,比OpenVLA-OFT高出11个百分点。
Conclusion: EvoVLA通过自监督学习有效解决了VLA模型的阶段幻觉问题,实现了从仿真到真实世界的有效迁移和强泛化能力,为长视野机器人操作任务提供了可靠的解决方案,推动了VLA模型在实际应用中的发展。
📄 Abstract
Long-horizon robotic manipulation remains challenging for Vision-Language-Action (VLA) models despite recent progress in zero-shot generalization and simulation-to-real-world transfer. Current VLA models suffer from stage hallucination, where agents exploit coarse evaluation signals to shortcut multi-step tasks, reporting high progress without truly completing them. We present EvoVLA, a self-supervised VLA framework that addresses this issue through three complementary components: Stage-Aligned Reward (SAR), which uses triplet contrastive learning with Gemini-generated hard negatives to prevent visual shortcuts; Pose-Based Object Exploration (POE), which grounds curiosity in relative object-gripper pose instead of raw pixels; and Long-Horizon Memory, which uses selective context retention and gated fusion to stabilize intrinsic shaping during extended rollouts. Extensive evaluations on Discoverse-L, a long-horizon manipulation benchmark with three multi-stage tasks, show that EvoVLA improves average task success by 10.2 percentage points over the strongest baseline (OpenVLA-OFT), reaching 69.2 percent. EvoVLA also achieves one-and-a-half times better sample efficiency and reduces stage hallucination from 38.5 percent to 14.8 percent. Real-world deployment on physical robots reaches an average success rate of 54.6 percent across four manipulation tasks, outperforming OpenVLA-OFT by 11 points, demonstrating effective sim-to-real transfer and strong generalization. Code: https://github.com/AIGeeksGroup/EvoVLA. Website: https://aigeeksgroup.github.io/EvoVLA.
[19] Target Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective
Jiahao Li, Yang Lu, Yachao Zhang, Yong Xie, Fangyong Wang, Yuan Xie, Yanyun Qu
🧩 TL;DR
本文提出了RF-CLIP方法,通过模拟人类注意力分散-重新聚焦行为来提升CLIP在开放词汇语义分割中的密集预测能力。该方法在八个基准测试中达到最先进性能,同时保持高推理效率。
📘 Detailed Summary
Motivation: 现有开放词汇语义分割方法虽然利用CLIP的视觉-语言对齐取得了不错效果,但很少从可解释性机制角度研究CLIP在密集预测任务中的性能边界。研究发现CLIP存在类似人类注意力分散的现象,将大量注意力资源从目标区域转移到无关token上。
Method: 提出ReFocusing CLIP(RF-CLIP),一种无需训练的方法,通过识别并过滤维度特异性过度激活产生的干扰token,重新将注意力引导回目标区域,从而提升CLIP的多模态对齐粒度。该方法模拟人类注意力分散-重新聚焦行为机制。
Result: RF-CLIP在八个基准测试中均达到最先进性能,同时保持了较高的推理效率。分析表明过滤干扰token能显著提升CLIP的密集预测性能。
Conclusion: 研究揭示了CLIP内部机制中存在的注意力分散现象,并提出有效的注意力重聚焦策略。该方法不仅提升了开放词汇语义分割性能,也为理解视觉-语言模型的内在工作机制提供了新视角,具有重要的理论价值和实际应用意义。
📄 Abstract
Open-vocabulary semantic segmentation (OVSS) employs pixel-level vision-language alignment to associate category-related prompts with corresponding pixels. A key challenge is enhancing the multimodal dense prediction capability, specifically this pixel-level multimodal alignment. Although existing methods achieve promising results by leveraging CLIP's vision-language alignment, they rarely investigate the performance boundaries of CLIP for dense prediction from an interpretability mechanisms perspective. In this work, we systematically investigate CLIP's internal mechanisms and identify a critical phenomenon: analogous to human distraction, CLIP diverts significant attention resources from target regions to irrelevant tokens. Our analysis reveals that these tokens arise from dimension-specific over-activation; filtering them enhances CLIP's dense prediction performance. Consequently, we propose ReFocusing CLIP (RF-CLIP), a training-free approach that emulates human distraction-refocusing behavior to redirect attention from distraction tokens back to target regions, thereby refining CLIP's multimodal alignment granularity. Our method achieves SOTA performance on eight benchmarks while maintaining high inference efficiency.
[20] Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
Yi Yang, Xueqi Li, Yiyang Chen, Jin Song, Yihan Wang, Zipeng Xiao, Jiadi Su, You Qiaoben, Pengfei Liu, Zhijie Deng
🧩 TL;DR
本文提出Mantis框架,通过解耦视觉预测与主干网络,结合元查询和扩散Transformer头,在保持语言理解能力的同时有效学习视觉轨迹中的潜在动作,显著提升了视觉-语言-动作模型的性能。
📘 Detailed Summary
Motivation: 现有视觉-语言-动作模型直接预测高维视觉状态会分散模型容量并带来高昂训练成本,而压缩视觉状态为紧凑监督信号则会产生信息瓶颈,同时这些方法往往因忽视语言监督而缺乏足够的理解和推理能力。
Method: Mantis框架采用解耦视觉预测方法,将视觉预测从主干网络中分离,结合元查询和扩散Transformer头,通过残差连接提供当前视觉状态,使用简单的下一状态预测目标使元查询自动捕捉描述视觉轨迹的潜在动作,从而促进显式动作的学习。
Result: 在LIBERO基准测试中,经过微调的Mantis达到96.7%的成功率,超越了现有强大基线并展现出高收敛速度;真实世界评估显示Mantis在指令跟随能力、对未见指令的泛化能力和推理能力方面均优于领先的开源VLA模型π0.5。
Conclusion: 解耦视觉预测减轻了VLA主干网络的负担,使其能够通过语言监督保持理解和推理能力;该方法在人类操作视频、机器人演示和图像-文本对上进行预训练后,展现出卓越的性能和泛化能力,为视觉-语言-动作模型的发展提供了新方向。
📄 Abstract
Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms $π_{0.5}$, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.
[21] Domain-Shared Learning and Gradual Alignment for Unsupervised Domain Adaptation Visible-Infrared Person Re-Identification
Nianchang Huang, Yi Xu, Ruida Xi, Ruida Xi, Qiang Zhang
🧩 TL;DR
本文提出了一种新颖的两阶段模型DSLGA来解决无监督域自适应可见光-红外行人重识别问题,通过域共享学习策略和渐进对齐策略有效处理跨域和跨模态差异,在多个设置下显著优于现有方法。
📘 Detailed Summary
Motivation: 现有可见光-红外行人重识别算法在公开数据集上表现优异,但由于公开数据与真实世界数据之间存在差异,这些方法在实际应用中表现不佳。本文旨在研究无监督域自适应可见光-红外行人重识别,将公开数据学到的知识迁移到真实世界数据中,同时不牺牲精度且无需新样本标注。
Method: 提出两阶段模型DSLGA:第一阶段预训练引入域共享学习策略,通过挖掘源域和目标域之间的共享信息来缓解跨域模态差异导致的无效预训练;第二阶段微调设计渐进对齐策略,通过从聚类到整体的对齐方式处理可见光和红外数据之间由大模态差异带来的跨模态对齐挑战。
Result: 大量实验表明,该方法在各种设置下显著优于现有的VI-ReID域自适应方法,甚至优于某些监督学习方法。同时构建了新的UDA-VI-ReID测试方法CMDA-XD,用于训练和测试不同的UDA-VI-ReID模型。
Conclusion: 该研究为解决VI-ReID在实际应用中的域适应问题提供了有效方案,通过两阶段策略分别处理跨域和跨模态挑战,展示了在无监督设置下实现高性能VI-ReID的可行性,为实际部署提供了重要参考价值。
📄 Abstract
Recently, Visible-Infrared person Re-Identification (VI-ReID) has achieved remarkable performance on public datasets. However, due to the discrepancies between public datasets and real-world data, most existing VI-ReID algorithms struggle in real-life applications. To address this, we take the initiative to investigate Unsupervised Domain Adaptation Visible-Infrared person Re-Identification (UDA-VI-ReID), aiming to transfer the knowledge learned from the public data to real-world data without compromising accuracy and requiring the annotation of new samples. Specifically, we first analyze two basic challenges in UDA-VI-ReID, i.e., inter-domain modality discrepancies and intra-domain modality discrepancies. Then, we design a novel two-stage model, i.e., Domain-Shared Learning and Gradual Alignment (DSLGA), to handle these discrepancies. In the first pre-training stage, DSLGA introduces a Domain-Shared Learning Strategy (DSLS) to mitigate ineffective pre-training caused by inter-domain modality discrepancies via exploiting shared information between the source and target domains. While, in the second fine-tuning stage, DSLGA designs a Gradual Alignment Strategy (GAS) to handle the cross-modality alignment challenges between visible and infrared data caused by the large intra-domain modality discrepancies through a cluster-to-holistic alignment way. Finally, a new UDA-VI-ReID testing method i.e., CMDA-XD, is constructed for training and testing different UDA-VI-ReID models. A large amount of experiments demonstrate that our method significantly outperforms existing domain adaptation methods for VI-ReID and even some supervised methods under various settings.
[22] When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models
Yuping Yan, Yuhan Xie, Yinxin Zhang, Lingjuan Lyu, Yaochu Jin
🧩 TL;DR
本文提出了VLA-Fool,这是首个针对具身视觉-语言-动作模型在黑白盒设置下的多模态对抗鲁棒性综合研究,揭示了即使轻微的多模态扰动也会导致显著的行为偏差,证明了具身多模态对齐的脆弱性。
📘 Detailed Summary
Motivation: 尽管视觉-语言-动作模型在具身环境中取得了显著进展,但这些系统的对抗鲁棒性在现实多模态和黑盒条件下仍未被充分探索。现有研究主要关注单模态扰动,忽略了从根本上影响具身推理和决策的跨模态错位问题。
Method: VLA-Fool统一了三个层次的多模态对抗攻击:基于梯度和提示的文本扰动、通过补丁和噪声失真的视觉扰动,以及故意破坏感知与指令间语义对应的跨模态错位攻击。研究还首次将VLA感知语义空间融入语言提示,开发了自动构建的语义引导提示框架。
Result: 在LIBERO基准测试中使用微调的OpenVLA模型进行的实验表明,即使轻微的多模态扰动也会导致显著的行为偏差,证明了具身多模态对齐的脆弱性。
Conclusion: 该研究揭示了具身视觉-语言-动作模型在多模态对抗攻击下的严重脆弱性,强调了在现实部署中考虑跨模态鲁棒性的重要性,并为未来更安全的具身AI系统设计提供了重要见解。
📄 Abstract
Vision-Language-Action models (VLAs) have recently demonstrated remarkable progress in embodied environments, enabling robots to perceive, reason, and act through unified multimodal understanding. Despite their impressive capabilities, the adversarial robustness of these systems remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies mainly focus on single-modality perturbations and overlook the cross-modal misalignment that fundamentally affects embodied reasoning and decision-making. In this paper, we introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings. VLA-Fool unifies three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that intentionally disrupt the semantic correspondence between perception and instruction. We further incorporate a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework. Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.
[23] SwiTrack: Tri-State Switch for Cross-Modal Object Tracking
Boyue Xu, Ruichao Hou, Tongwei Ren, Dongming Zhou, Gangshan Wu, Jinde Cao
🧩 TL;DR
本文提出SwiTrack,一种新颖的状态切换框架,通过部署三个专用流重新定义跨模态目标跟踪,在RGB-NIR跟踪中实现了最先进的性能,同时保持实时跟踪速度。
📘 Detailed Summary
Motivation: 现有跨模态目标跟踪方法通常将并行RGB和NIR分支连接到共享骨干网络,这限制了区分性模态特定特征的全面提取,且无法解决目标漂移问题,特别是在存在不可靠输入的情况下。
Method: SwiTrack采用三流架构:RGB帧由视觉编码器处理,NIR帧通过NIR门控适配器结合视觉编码器进行细化以校准共享潜在空间特征,对于无效模态则使用一致性轨迹预测模块利用时空线索估计目标运动,同时结合动态模板重建和相似性对齐损失来增强特征一致性。
Result: 在最新基准测试中,该跟踪器实现了最先进的性能,精度率和成功率分别提升了7.2%和4.3%,同时保持65帧/秒的实时跟踪速度。
Conclusion: 该研究表明通过专门的状态切换框架和模态特定特征提取策略,可以有效解决跨模态跟踪中的目标漂移问题,为处理动态模态切换场景提供了新的解决方案,并展示了在保持实时性能的同时显著提升跟踪精度的潜力。
📄 Abstract
Cross-modal object tracking (CMOT) is an emerging task that maintains target consistency while the video stream switches between different modalities, with only one modality available in each frame, mostly focusing on RGB-Near Infrared (RGB-NIR) tracking. Existing methods typically connect parallel RGB and NIR branches to a shared backbone, which limits the comprehensive extraction of distinctive modality-specific features and fails to address the issue of object drift, especially in the presence of unreliable inputs. In this paper, we propose SwiTrack, a novel state-switching framework that redefines CMOT through the deployment of three specialized streams. Specifically, RGB frames are processed by the visual encoder, while NIR frames undergo refinement via a NIR gated adapter coupled with the visual encoder to progressively calibrate shared latent space features, thereby yielding more robust cross-modal representations. For invalid modalities, a consistency trajectory prediction module leverages spatio-temporal cues to estimate target movement, ensuring robust tracking and mitigating drift. Additionally, we incorporate dynamic template reconstruction to iteratively update template features and employ a similarity alignment loss to reinforce feature consistency. Experimental results on the latest benchmarks demonstrate that our tracker achieves state-of-the-art performance, boosting precision rate and success rate gains by 7.2\% and 4.3\%, respectively, while maintaining real-time tracking at 65 frames per second. Code and models are available at https://github.com/xuboyue1999/SwiTrack.git.
[24] CAMS: Towards Compositional Zero-Shot Learning via Gated Cross-Attention and Multi-Space Disentanglement
Pan Yang, Cheng Deng, Jing Yang, Han Zhao, Yun Liu, Yuling Chen, Xiaoli Ruan, Yanping Chen
🧩 TL;DR
本文提出CAMS方法,通过门控交叉注意力和多空间解耦机制,在CLIP框架中实现属性和对象语义的细粒度解耦,显著提升了组合零样本学习在未见组合上的泛化能力。
📘 Detailed Summary
Motivation: 现有基于CLIP的组合零样本学习方法主要依赖图像编码器获得的全局语义表示,这种表示能力有限且无法实现属性和对象的完全解耦,限制了模型对未见属性-对象组合的泛化性能。
Method: CAMS设计了门控交叉注意力机制,通过一组潜在单元从CLIP的高层图像编码块中捕获细粒度语义特征,同时自适应抑制背景和无关信息;随后进行多空间解耦,在多维空间中实现属性和对象语义的分离。
Result: 在三个主流基准数据集(MIT-States、UT-Zappos和C-GQA)上的实验表明,CAMS在封闭世界和开放世界设置下均达到了最先进的性能水平。
Conclusion: 该研究证明了在CLIP框架中通过细粒度语义特征提取和多空间解耦机制能够有效提升组合零样本学习的泛化能力,为视觉-语言模型的语义理解提供了新的技术路径。
📄 Abstract
Compositional zero-shot learning (CZSL) aims to learn the concepts of attributes and objects in seen compositions and to recognize their unseen compositions. Most Contrastive Language-Image Pre-training (CLIP)-based CZSL methods focus on disentangling attributes and objects by leveraging the global semantic representation obtained from the image encoder. However, this representation has limited representational capacity and do not allow for complete disentanglement of the two. To this end, we propose CAMS, which aims to extract semantic features from visual features and perform semantic disentanglement in multidimensional spaces, thereby improving generalization over unseen attribute-object compositions. Specifically, CAMS designs a Gated Cross-Attention that captures fine-grained semantic features from the high-level image encoding blocks of CLIP through a set of latent units, while adaptively suppressing background and other irrelevant information. Subsequently, it conducts Multi-Space Disentanglement to achieve disentanglement of attribute and object semantics. Experiments on three popular benchmarks (MIT-States, UT-Zappos, and C-GQA) demonstrate that CAMS achieves state-of-the-art performance in both closed-world and open-world settings. The code is available at https://github.com/ybyangjing/CAMS.
[25] Beyond Visual Cues: Leveraging General Semantics as Support for Few-Shot Segmentation
Jin Wang, Bingfeng Zhang, Jian Pang, Mengyu Liu, Honglong Chen, Weifeng Liu
🧩 TL;DR
本文提出了一种语言驱动的属性泛化架构,通过利用目标类别的语言描述而非支持图像来构建鲁棒的支持策略,在少样本分割任务中实现了新的最先进性能。该方法的核心创新在于使用大语言模型生成多属性描述并进行跨模态对齐,以提供无偏的元指导。
📘 Detailed Summary
Motivation: 现有少样本分割方法主要从支持图像中挖掘参考信息作为元指导,但由于视觉表示中的类内差异,从支持图像提取的元信息无法为未训练类别提供准确指导。本文认为支持图像提供的参考可能并非必需,关键在于为已训练和未训练类别提供无偏的元指导。
Method: 提出语言驱动的属性泛化架构,包含多属性增强模块和多模态属性对齐模块。前者通过大语言模型生成目标类别的多个详细属性描述,利用多模态匹配构建精细的视觉-文本先验指导;后者通过跨模态交互解决文本-视觉模态偏移问题,促进属性文本对视觉特征表示的提升。
Result: 实验结果表明,所提方法明显优于现有方法,在少样本分割任务中实现了新的最先进性能,具体表现为在多个基准数据集上的分割精度显著提升。
Conclusion: 研究表明利用目标类别的语言属性描述而非支持图像可以提供更鲁棒和无偏的元指导,这为少样本学习开辟了新的方向。多模态属性对齐机制有效解决了文本-视觉模态偏移问题,证明了语言驱动方法在视觉任务中的潜力。
📄 Abstract
Few-shot segmentation (FSS) aims to segment novel classes under the guidance of limited support samples by a meta-learning paradigm. Existing methods mainly mine references from support images as meta guidance. However, due to intra-class variations among visual representations, the meta information extracted from support images cannot produce accurate guidance to segment untrained classes. In this paper, we argue that the references from support images may not be essential, the key to the support role is to provide unbiased meta guidance for both trained and untrained classes. We then introduce a Language-Driven Attribute Generalization (LDAG) architecture to utilize inherent target property language descriptions to build robust support strategy. Specifically, to obtain an unbiased support representation, we design a Multi-attribute Enhancement (MaE) module, which produces multiple detailed attribute descriptions of the target class through Large Language Models (LLMs), and then builds refined visual-text prior guidance utilizing multi-modal matching. Meanwhile, due to text-vision modal shift, attribute text struggles to promote visual feature representation, we design a Multi-modal Attribute Alignment (MaA) to achieve cross-modal interaction between attribute texts and visual feature. Experiments show that our proposed method outperforms existing approaches by a clear margin and achieves the new state-of-the art performance. The code will be released.
[26] StreetView-Waste: A Multi-Task Dataset for Urban Waste Management
Diogo J. Paulo, João Martins, Hugo Proença, João C. Neves
🧩 TL;DR
本文提出了StreetView-Waste数据集,这是一个专注于城市垃圾容器监测的综合数据集,支持垃圾容器检测、跟踪和溢出分割三个关键任务,并通过提出的启发式跟踪方法和几何先验分割框架显著提升了基线性能。
📘 Detailed Summary
Motivation: 现有垃圾检测数据集主要关注一般垃圾识别,缺乏针对垃圾容器监测的专门标注,特别是从垃圾车视角采集的图像数据,且现有数据集多为静态环境拍摄,限制了其在真实城市物流场景中的应用价值。
Method: 提出了StreetView-Waste数据集并建立了三个任务的基准测试,包括目标检测、目标跟踪和语义分割的先进模型;同时提出了两种补充策略:基于启发式的方法改进垃圾容器跟踪,以及利用几何先验的模型无关框架来优化垃圾分割。
Result: 实验结果显示,微调的目标检测器在垃圾容器检测上表现合理,但基线跟踪方法在数量估计上表现不佳;提出的启发式方法将平均绝对计数误差降低了79.6%;几何感知策略在轻量级模型上将分割mAP@0.5提高了27%,证明了多模态输入在此任务中的价值。
Conclusion: StreetView-Waste为城市垃圾管理的真实感知系统研究提供了具有挑战性的基准,展示了结合领域知识和几何先验能显著提升垃圾监测任务的性能,为智能城市废物管理系统的开发奠定了基础。
📄 Abstract
Urban waste management remains a critical challenge for the development of smart cities. Despite the growing number of litter detection datasets, the problem of monitoring overflowing waste containers, particularly from images captured by garbage trucks, has received little attention. While existing datasets are valuable, they often lack annotations for specific container tracking or are captured in static, decontextualized environments, limiting their utility for real-world logistics. To address this gap, we present StreetView-Waste, a comprehensive dataset of urban scenes featuring litter and waste containers. The dataset supports three key evaluation tasks: (1) waste container detection, (2) waste container tracking, and (3) waste overflow segmentation. Alongside the dataset, we provide baselines for each task by benchmarking state-of-the-art models in object detection, tracking, and segmentation. Additionally, we enhance baseline performance by proposing two complementary strategies: a heuristic-based method for improved waste container tracking and a model-agnostic framework that leverages geometric priors to refine litter segmentation. Our experimental results show that while fine-tuned object detectors achieve reasonable performance in detecting waste containers, baseline tracking methods struggle to accurately estimate their number; however, our proposed heuristics reduce the mean absolute counting error by 79.6%. Similarly, while segmenting amorphous litter is challenging, our geometry-aware strategy improves segmentation mAP@0.5 by 27% on lightweight models, demonstrating the value of multimodal inputs for this task. Ultimately, StreetView-Waste provides a challenging benchmark to encourage research into real-world perception systems for urban waste management.
[27] VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference
Ziyan Liu, Yeqiu Chen, Hongyi Cai, Tao Lin, Shuo Yang, Zheng Liu, Bo Zhao
🧩 TL;DR
本文提出VLA-Pruner,一种专为视觉-语言-动作模型设计的双层级令牌剪枝方法,通过结合语义相关性和动作重要性标准,在保持高性能的同时显著加速VLA模型的推理速度。
📘 Detailed Summary
Motivation: 现有视觉-语言模型的令牌剪枝方法仅基于语义显著性指标选择令牌,忽略了VLA模型兼具高层语义理解和低层动作执行的双系统特性,导致偏向保留语义线索而丢弃对动作生成至关重要的信息,从而显著降低VLA性能。
Method: VLA-Pruner采用双层级重要性标准:视觉-语言预填充注意力用于语义级相关性,通过时间平滑估计的动作解码注意力用于动作级重要性,并基于此提出自适应双层级令牌选择策略,在给定计算预算下保留紧凑且信息丰富的视觉令牌集合。
Result: 实验表明VLA-Pruner在多种VLA架构和多样化机器人任务上实现了最先进的性能,在保持模型准确性的同时显著提升了推理效率。
Conclusion: 该研究强调了VLA模型令牌剪枝需要考虑其双系统特性,提出的双层级方法为高效VLA部署提供了有效解决方案,并为未来具身AI系统的实时应用开辟了新途径。
📄 Abstract
Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA's intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.
[28] Supervised Contrastive Learning for Few-Shot AI-Generated Image Detection and Attribution
Jaime Álvarez Urueña, David Camacho, Javier Huertas Tato
🧩 TL;DR
本文提出了一种新颖的两阶段检测框架,通过监督对比学习和少样本学习相结合的方法,有效解决了合成图像检测中的泛化挑战,在仅需少量未见生成器样本的情况下显著提升了检测性能。
📘 Detailed Summary
Motivation: 生成式人工智能的快速发展使得合成图像越来越难以与真实内容区分,这对数字媒体完整性构成了重大挑战。传统检测方法依赖定期重新训练,而新型生成模型的快速发布周期使得这种方法在计算上不可行且操作上不切实际,因此需要能够适应不断发展的生成AI环境而无需全面重新训练协议的鲁棒检测系统。
Method: 该框架采用两阶段方法:第一阶段使用通过监督对比学习训练的视觉深度学习模型从输入图像中提取判别性嵌入,该模型在可用生成器的战略分区子集上进行训练,特定架构被保留用于严格消融跨生成器泛化能力;第二阶段在学习的嵌入空间上使用k近邻分类器,采用少样本学习范式,结合来自先前未见测试生成器的有限样本进行训练。
Result: 在少样本学习机制下仅需每类150张图像的情况下,该框架实现了91.3%的平均检测准确率,比现有方法提高了5.2个百分点;在来源归因任务中,该方法在开放集分类背景下AUC和OSCR分别提高了14.70%和4.27%,显著提升了鲁棒性和可扩展性。
Conclusion: 该研究证明了结合监督对比学习和少样本学习的有效性,为构建能够适应不断发展的生成AI环境而无需全面重新训练协议的鲁棒、可扩展取证归因系统提供了重要进展,为解决合成图像检测中的泛化挑战提供了实用解决方案。
📄 Abstract
The rapid advancement of generative artificial intelligence has enabled the creation of synthetic images that are increasingly indistinguishable from authentic content, posing significant challenges for digital media integrity. This problem is compounded by the accelerated release cycle of novel generative models, which renders traditional detection approaches (reliant on periodic retraining) computationally infeasible and operationally impractical. This work proposes a novel two-stage detection framework designed to address the generalization challenge inherent in synthetic image detection. The first stage employs a vision deep learning model trained via supervised contrastive learning to extract discriminative embeddings from input imagery. Critically, this model was trained on a strategically partitioned subset of available generators, with specific architectures withheld from training to rigorously ablate cross-generator generalization capabilities. The second stage utilizes a k-nearest neighbors (k-NN) classifier operating on the learned embedding space, trained in a few-shot learning paradigm incorporating limited samples from previously unseen test generators. With merely 150 images per class in the few-shot learning regime, which are easily obtainable from current generation models, the proposed framework achieves an average detection accuracy of 91.3\%, representing a 5.2 percentage point improvement over existing approaches . For the source attribution task, the proposed approach obtains improvements of of 14.70\% and 4.27\% in AUC and OSCR respectively on an open set classification context, marking a significant advancement toward robust, scalable forensic attribution systems capable of adapting to the evolving generative AI landscape without requiring exhaustive retraining protocols.
[29] LLaVA$^3$: Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs
Doriand Petit, Steve Bourgeois, Vincent Gay-Bellile, Florian Chabot, Loïc Barthe
🧩 TL;DR
LLaVA³提出了一种无需微调即可增强视觉语言模型3D场景理解能力的新方法,该方法仅使用多视角2D图像,通过立体主义启发的全方位视觉表示来描述3D场景,在3D视觉问答和3D语言定位任务上优于现有基于2D的VLM解决方案。
📘 Detailed Summary
Motivation: 当前多模态语言模型在3D场景理解方面面临的主要挑战是3D训练数据的稀缺性,相比之下用于视觉语言模型的2D数据集则非常丰富,这限制了模型对三维空间的理解能力。
Method: 受立体主义画家启发,该方法通过中间多视角3D重建生成每个物体的全方位视觉表示,利用这些表示向视觉语言模型描述3D场景,整个过程无需任何微调操作。
Result: 在3D视觉问答和3D语言定位任务上的大量实验表明,该方法显著优于先前基于2D的视觉语言模型解决方案,证明了其有效性。
Conclusion: 该研究展示了仅使用2D多视角图像即可有效提升3D场景理解的可行性,为克服3D数据稀缺问题提供了创新解决方案,并开辟了基于2D数据增强3D理解能力的新研究方向。
📄 Abstract
Developing a multi-modal language model capable of understanding 3D scenes remains challenging due to the limited availability of 3D training data, in contrast to the abundance of 2D datasets used for vision-language models (VLM). As an alternative, we introduce LLaVA$^3$ (pronounced LLaVA-Cube), a novel method that improves the 3D scene understanding capabilities of VLM using only multi-view 2D images and without any fine-tuning. Inspired by Cubist painters, who represented multiple viewpoints of a 3D object within a single picture, we propose to describe the 3D scene for the VLM through omnidirectional visual representations of each object. These representations are derived from an intermediate multi-view 3D reconstruction of the scene. Extensive experiments on 3D VQA and 3D language grounding show that our approach outperforms previous 2D-based VLM solutions.
[30] NutriScreener: Retrieval-Augmented Multi-Pose Graph Attention Network for Malnourishment Screening
Misaal Khan, Mayank Vatsa, Kuldeep Singh, Richa Singh
🧩 TL;DR
本文提出了NutriScreener,一种基于检索增强的多姿态图注意力网络,通过结合CLIP视觉嵌入、类别增强知识检索和上下文感知,实现了从儿童图像中进行营养不良检测和人体测量预测的可靠解决方案,特别适用于资源匮乏环境。
📘 Detailed Summary
Motivation: 儿童营养不良仍然是全球性危机,现有筛查方法劳动密集且难以扩展,阻碍了早期干预,亟需开发能够同时解决泛化性和类别不平衡问题的自动化筛查工具。
Method: 该方法采用检索增强的多姿态图注意力网络架构,整合了CLIP视觉嵌入、类别增强知识检索机制和上下文感知模块,通过多姿态信息融合和人口统计学匹配知识库来提升模型鲁棒性。
Result: 在包含2,141名儿童的临床研究中,医生评分显示准确性4.3/5、效率4.6/5,模型达到0.79召回率和0.82 AUC,人体测量RMSE显著降低,跨数据集测试显示使用人口统计学匹配知识库可获得25%召回率提升和3.5厘米RMSE减少。
Conclusion: NutriScreener为资源匮乏环境提供了可扩展且准确的早期营养不良检测方案,其检索增强架构和上下文感知能力确保了在无约束儿科环境中的可靠性能,具有重要的实际部署价值。
📄 Abstract
Child malnutrition remains a global crisis, yet existing screening methods are laborious and poorly scalable, hindering early intervention. In this work, we present NutriScreener, a retrieval-augmented, multi-pose graph attention network that combines CLIP-based visual embeddings, class-boosted knowledge retrieval, and context awareness to enable robust malnutrition detection and anthropometric prediction from children's images, simultaneously addressing generalizability and class imbalance. In a clinical study, doctors rated it 4.3/5 for accuracy and 4.6/5 for efficiency, confirming its deployment readiness in low-resource settings. Trained and tested on 2,141 children from AnthroVision and additionally evaluated on diverse cross-continent populations, including ARAN and an in-house collected CampusPose dataset, it achieves 0.79 recall, 0.82 AUC, and significantly lower anthropometric RMSEs, demonstrating reliable measurement in unconstrained pediatric settings. Cross-dataset results show up to 25% recall gain and up to 3.5 cm RMSE reduction using demographically matched knowledge bases. NutriScreener offers a scalable and accurate solution for early malnutrition detection in low-resource environments.
[31] BoxingVI: A Multi-Modal Benchmark for Boxing Action Recognition and Localization
Rahul Kumar, Vipul Baghel, Sudhanshu Singh, Bikash Kumar Badatya, Shivam Yadav, Babji Srinivasan, Ravi Hegde
🧩 TL;DR
本研究提出了一个专门用于拳击击打检测和分类的综合视频数据集,包含6,915个高质量击打片段,涵盖六种不同击打类型,旨在解决动态无约束环境中动作识别数据集稀缺的问题。
📘 Detailed Summary
Motivation: 当前基于计算机视觉的格斗运动分析面临的主要瓶颈是缺乏鲁棒的数据集,这是由于动作的动态非结构化特性以及录制环境变化造成的,特别是在低资源和无约束环境下的实时视觉动作识别研究需要专门的数据集支持。
Method: 研究团队从20个公开YouTube训练视频中提取了6,915个高质量击打片段,涉及18名不同运动员,每个片段都经过手动分割和标注以确保精确的时间边界和类别一致性,涵盖了广泛的运动风格、摄像机角度和运动员体型。
Result: 构建的数据集包含六个不同击打类型的6,915个标注片段,提供了丰富的基准测试示例,能够支持拳击动作分析、自动化教练和性能评估等研究方向的发展。
Conclusion: 该数据集填补了拳击动作识别领域的数据空白,通过提供多样化的击打示例基准,有望加速运动分析、自动化教练和性能评估在拳击及相关领域的研究进展,特别是在低资源和无约束环境下的应用。
📄 Abstract
Accurate analysis of combat sports using computer vision has gained traction in recent years, yet the development of robust datasets remains a major bottleneck due to the dynamic, unstructured nature of actions and variations in recording environments. In this work, we present a comprehensive, well-annotated video dataset tailored for punch detection and classification in boxing. The dataset comprises 6,915 high-quality punch clips categorized into six distinct punch types, extracted from 20 publicly available YouTube sparring sessions and involving 18 different athletes. Each clip is manually segmented and labeled to ensure precise temporal boundaries and class consistency, capturing a wide range of motion styles, camera angles, and athlete physiques. This dataset is specifically curated to support research in real-time vision-based action recognition, especially in low-resource and unconstrained environments. By providing a rich benchmark with diverse punch examples, this contribution aims to accelerate progress in movement analysis, automated coaching, and performance assessment within boxing and related domains.
[32] Contrastive vision-language learning with paraphrasing and negation
Kwun Ho Ngan, Saman Sadeghi Afgeh, Joe Townsend, Artur d'Avila Garcez
🧩 TL;DR
本文提出SemCLIP方法,通过结合释义和否定处理来增强视觉-语言模型的语义对齐能力,在保持CLIP性能的同时显著提升对否定文本的鲁棒性,在CC-Neg基准上将图像检索准确率从68.1%提升至78.1%。
📘 Detailed Summary
Motivation: 对比视觉-语言模型在处理否定和释义文本时表现出混合性能,因为否定会通过最小词汇变化彻底改变语义,而释义可能产生完全不同但语义相同的文本表达,这对改进模型评估结果和语义对齐提出了重大挑战。
Method: 本文评估了释义和否定的组合,提出了新的CLIP对比损失函数来同时考虑释义和否定,并应用LLM生成的训练三元组(包含原始、释义和否定文本描述)来训练CLIP类模型,该方法称为SemCLIP,能够将释义描述向原始图像嵌入拉近,同时将否定描述在嵌入空间中推远。
Result: 在CC-Neg基准测试中,使用原始对否定的图像检索准确率指标,SemCLIP将准确率从68.1%提升至78.1%;在Sugarcrepe++基准上表现混合但总体上优于仅使用否定描述训练的模型;在下游零样本分类任务中,在Sugarcrepe++上预训练的SemCLIP在所有测试任务上都优于CLIP。
Conclusion: 研究结果表明SemCLIP能够实现对语义变换的显著鲁棒性,表明通过结合释义和否定处理的训练策略可以有效增强视觉-语言模型对复杂语义变化的适应能力,为改进模型语义对齐提供了有前景的方向。
📄 Abstract
Contrastive vision-language models continue to be the dominant approach for image and text retrieval. Contrastive Language-Image Pre-training (CLIP) trains two neural networks in contrastive manner to align their image and text embeddings in a shared latent space. Recent results evaluating CLIP on negated or paraphrased text have shown mixed performance because negation changes meaning radically with minimal lexical changes, while paraphrasing can create very different textual expressions with the same intended meaning. This poses a significant challenge for improving the evaluation results and alignment of vision-language models. To address this challenge, this paper evaluates the combination of paraphrasing and negation, proposes a new CLIP contrastive loss function accounting for both paraphrasing and negation, and applies LLM-generated training triples consisting of original, paraphrased and negated textual captions to CLIP-like training models. The approach, called SemCLIP, is shown to move paraphrased captions towards the original image embeddings while pushing negated captions further away in embedding space. Empirically, SemCLIP is shown to be capable of preserving CLIP's performance while increasing considerably the distances to negated captions. On the CC-Neg benchmark using an original over negation image-retrieval accuracy metric, SemCLIP improves accuracy from 68.1% to 78.1%. Although results are mixed when compared with CLIP on the Sugarcrepe++ benchmark, SemCLIP's performance is generally better than the models trained with negated captions. This robustness to negation extends to downstream zero-shot classification tasks where SemCLIP pre-trained on Sugarcrepe++ performs better than CLIP on all tested downstream tasks. These results indicate that SemCLIP can achieve significant robustness to semantic transformations.
[33] Dataset Distillation for Pre-Trained Self-Supervised Vision Models
George Cazenavette, Antonio Torralba, Vincent Sitzmann
🧩 TL;DR
本文提出了一种针对预训练视觉模型的线性梯度匹配数据集蒸馏方法,该方法通过优化合成图像使得线性分类器产生的梯度与真实数据匹配,在多个基准测试中超越了真实图像基线。
📘 Detailed Summary
Motivation: 现有数据集蒸馏方法主要关注从随机初始化模型训练,但当前最先进的视觉方法越来越多地基于大型预训练自监督模型而非从头训练,因此需要研究如何为这些预训练模型蒸馏出能够最优训练线性探针的数据集。
Method: 提出了线性梯度匹配方法,通过优化合成图像使得当它们通过预训练特征提取器时,在线性分类器中产生的梯度与真实数据产生的梯度相似,从而实现对预训练模型的高效数据集蒸馏。
Result: 该方法生成的合成数据在所有真实图像基线上表现更优,能够跨预训练视觉模型泛化,例如使用DINO骨干网络蒸馏的数据集可以训练出具有竞争力的CLIP线性探针,在细粒度分类任务中表现尤为出色。
Conclusion: 该蒸馏数据集为模型可解释性提供了有价值的工具,能够预测两个模型嵌入空间的相似性以及模型对对抗数据集中虚假相关性的敏感性,验证了柏拉图表示假设下的模型行为一致性。
📄 Abstract
The task of dataset distillation aims to find a small set of synthetic images such that training a model on them reproduces the performance of the same model trained on a much larger dataset of real samples. Existing distillation methods focus on synthesizing datasets that enable training randomly initialized models. In contrast, state-of-the-art vision approaches are increasingly building on large, pre-trained self-supervised models rather than training from scratch. In this paper, we investigate the problem of distilling datasets that enable us to optimally train linear probes on top of such large, pre-trained vision models. We introduce a method of dataset distillation for this task called Linear Gradient Matching that optimizes the synthetic images such that, when passed through a pre-trained feature extractor, they induce gradients in the linear classifier similar to those produced by the real data. Our method yields synthetic data that outperform all real-image baselines and, remarkably, generalize across pre-trained vision models, enabling us, for instance, to train a linear CLIP probe that performs competitively using a dataset distilled via a DINO backbone. Further, we show that our distilled datasets are exceptionally effective for fine-grained classification and provide a valuable tool for model interpretability, predicting, among other things, how similar two models' embedding spaces are under the platonic representation hypothesis or whether a model is sensitive to spurious correlations in adversarial datasets.
[34] Lite Any Stereo: Efficient Zero-Shot Stereo Matching
Junpeng Jing, Weixun Luo, Ye Mao, Krystian Mikolajczyk
🧩 TL;DR
本文提出了Lite Any Stereo立体深度估计框架,在保持高效性的同时实现了强大的零样本泛化能力,其计算成本不到现有方法的1%却能达到或超过最先进非先验方法的精度。
📘 Detailed Summary
Motivation: 当前立体匹配研究主要关注精度提升,但往往以显著增加模型大小为代价,传统观点认为高效模型由于容量有限而无法具备零样本泛化能力,本文旨在解决这一效率与泛化能力之间的权衡问题。
Method: 设计了紧凑而表达能力强的骨干网络确保可扩展性,构建了精心设计的混合代价聚合模块,并提出了在百万级数据上的三阶段训练策略以有效弥合仿真到现实的差距。
Result: 在四个广泛使用的真实世界基准测试中排名第一,精度达到或超过最先进的非先验方法,同时计算成本不到这些方法的1%,为高效立体匹配设立了新标准。
Conclusion: 研究表明超轻量级模型能够实现强大的泛化能力,打破了传统认为高效模型无法具备零样本能力的观念,为实际应用中的实时立体深度估计提供了可行的解决方案。
📄 Abstract
Recent advances in stereo matching have focused on accuracy, often at the cost of significantly increased model size. Traditionally, the community has regarded efficient models as incapable of zero-shot ability due to their limited capacity. In this paper, we introduce Lite Any Stereo, a stereo depth estimation framework that achieves strong zero-shot generalization while remaining highly efficient. To this end, we design a compact yet expressive backbone to ensure scalability, along with a carefully crafted hybrid cost aggregation module. We further propose a three-stage training strategy on million-scale data to effectively bridge the sim-to-real gap. Together, these components demonstrate that an ultra-light model can deliver strong generalization, ranking 1st across four widely used real-world benchmarks. Remarkably, our model attains accuracy comparable to or exceeding state-of-the-art non-prior-based accurate methods while requiring less than 1% computational cost, setting a new standard for efficient stereo matching.
[35] SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking
Haofeng Liu, Ziyue Wang, Sudhanshu Mishra, Mingqi Gao, Guanyi Qin, Chang Han Low, Alex Y. W. Kong, Yueming Jin
🧩 TL;DR
本研究提出了SAM2S,一个增强SAM2用于外科手术交互式视频分割的基础模型,通过构建SA-SV基准数据集并引入多样化记忆机制、时序语义学习和抗模糊学习,显著提升了手术场景中的长期跟踪性能和零样本泛化能力。
📘 Detailed Summary
Motivation: 外科手术视频分割对于计算机辅助手术至关重要,但现有的交互式视频分割模型如SAM2在外科手术场景中面临领域差距和有限长期跟踪能力的挑战,需要专门针对手术场景的解决方案。
Method: 本研究构建了SA-SV基准数据集,并提出了SAM2S模型,包含三个关键组件:DiveMem多样化记忆机制用于鲁棒长期跟踪,时序语义学习用于器械理解,以及抗模糊学习机制来缓解多源数据集中的标注不一致问题。
Result: 实验表明,在SA-SV数据集上微调使SAM2的平均J&F指标提升了12.99点,而SAM2S进一步将性能提升至80.42平均J&F,分别比原始SAM2和微调SAM2高出17.10和4.11点,同时保持68 FPS的实时推理速度和强大的零样本泛化能力。
Conclusion: 该研究证明了专门针对外科手术场景设计的交互式视频分割方法的有效性,SAM2S在保持实时性能的同时显著提升了长期跟踪和零样本泛化能力,为计算机辅助手术系统的发展提供了重要基础。
📄 Abstract
Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing \textbf{SAM2} for \textbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average $\mathcal{J}$\&$\mathcal{F}$ over vanilla SAM2. SAM2S further advances performance to 80.42 average $\mathcal{J}$\&$\mathcal{F}$, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.
[36] V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models
Yang Luo, Xuanlei Zhao, Baijiong Lin, Lingting Zhu, Liyao Tang, Yuqi Liu, Ying-Cong Chen, Shengju Qian, Xin Wang, Yang You
🧩 TL;DR
本文提出了V-ReasonBench基准测试,用于系统评估视频模型的推理能力,涵盖结构化问题解决、空间认知、模式推理和物理动力学四个关键维度,为视频推理模型的可靠评估提供了统一框架。
📘 Detailed Summary
Motivation: 随着生成式视频模型(如Veo-3)展现出令人惊讶的零样本推理能力,迫切需要系统且可靠的评估方法,当前缺乏专门针对视频推理能力的标准化基准测试。
Method: 构建了包含合成和真实世界图像序列的基准测试集,设计了多样化且答案可验证的任务,确保评估的可重复性、可扩展性和无歧义性,评估了六种最先进的视频模型在四个推理维度上的表现。
Result: 评估结果显示不同视频模型在结构化、空间、模式推理和物理动力学方面存在显著差异,进一步比较了视频模型与强图像模型的性能,分析了常见的幻觉行为,并研究了视频时长对帧链推理的影响。
Conclusion: V-ReasonBench为测量视频推理能力提供了统一且可重复的框架,旨在支持开发具有更可靠、与人类对齐的推理技能的模型,揭示了当前视频模型在不同推理维度上的具体优势和局限。
📄 Abstract
Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.
[37] Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
Junhao Cheng, Liang Hou, Xin Tao, Jing Liao
🧩 TL;DR
本文提出VANS模型,通过强化学习将视觉语言模型与视频扩散模型对齐,用于视频下一事件预测任务,实现了从文本描述到动态视频生成的转变,在程序性和预测性基准测试中达到最先进性能。
📘 Detailed Summary
Motivation: 当前语言模型在现实应用中影响广泛,但视频生成主要局限于娱乐领域。视频具有展示难以通过纯语言传达的物理世界信息的能力,本文识别了将视频作为下一事件预测新答案模态的未充分利用机会,提出了视频下一事件预测任务,从“讲述”转向“展示”以解锁更直观和定制化的程序学习和创造性探索答案。
Method: 本文引入VANS模型,利用强化学习将视觉语言模型与视频扩散模型对齐用于VNEP任务。核心是提出的Joint-GRPO方法,将VLM和VDM协调为一个单元运行,通过对其各自输出的共享奖励进行优化,使VLM生成既准确又易于可视化的描述,同时指导VDM生成忠实于这些描述和输入视觉上下文的视频。为支持此学习过程,构建了专用的VANS-Data-100K数据集。
Result: 在程序性和预测性基准测试上的实验表明,VANS在视频事件预测和可视化方面均实现了最先进的性能。该模型能够有效理解多模态输入、进行指令条件推理,并生成具有视觉和语义一致性的视频。
Conclusion: 这项研究展示了将视频作为下一事件预测答案模态的潜力,从文本描述转向动态视频生成能够提供更直观和定制化的回答。VANS模型通过强化学习对齐多模态组件的方法为解决VNEP任务提供了有效解决方案,为程序性学习和创造性探索开辟了新途径,相关代码已开源以促进进一步研究。
📄 Abstract
While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video's inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as Video-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduce VANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposed Joint-GRPO that orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craft VANS-Data-100K, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization. Codes are released in https://github.com/KlingTeam/VANS.
[38] Learning to Think Fast and Slow for Visual Language Models
Chenyu Lin, Cheng Chi, Jinlin Wu, Sharon Li, Kaiyang Zhou
🧩 TL;DR
本文提出一种简单的强化学习方法,使视觉语言模型能够根据任务难度自动切换快速思考和慢速思考模式,在保持高性能的同时显著提高计算效率。
📘 Detailed Summary
Motivation: 现有面向推理的视觉语言模型主要追求冗长的推理链,导致计算成本过高,而人类认知系统能够根据问题复杂度自适应分配认知资源,这种双系统思考机制尚未在VLMs中得到有效实现。
Method: 采用两阶段方法:第一阶段基于模型输出长度标注数据为快速思考或慢速思考模式,第二阶段使用GRPO结合思考模式标签训练模型发展双模式思考能力。
Result: DualMindVLM模型显著超越基础模型,性能与最先进的视觉推理模型相当,同时保持了极高的token效率,实现了计算效率与推理性能的良好平衡。
Conclusion: 该研究表明简单的强化学习方法能够有效模拟人类双系统认知机制,为构建更高效的视觉推理系统提供了新思路,证明了自适应思考模式切换在提升模型效率方面的重要价值。
📄 Abstract
When confronted with complex problems, we tend to think slowly; conversely, for simple questions, we think quickly. Such a two-system thinking mechanism allows us to efficiently allocate cognitive resources, enabling quick decision-making for straightforward issues while reserving deeper analytical thinking for more intricate challenges. However, existing reasoning-oriented visual language models (VLMs), whether trained with explicit chain-of-thought annotations or rule-based RL rewards, mainly pursue lengthy, detailed reasoning chains, which often lead to excessive computational costs. In this work, we propose a simple RL approach, which enables VLMs to automatically switch between fast and slow thinking modes depending on task difficulty. The approach consists of two stages: in the first stage, we label data as either requiring fast thinking or slow thinking based on the model output length, which is inspired by the observation that pre-trained VLMs typically produce answers of varying lengths for different types of questions; in the second stage, we train the model using GRPO along with the thinking mode labels to develop dual-mode thinking. Despite its simplicity, our model, named DualMindVLM, significantly outperforms the base model and achieves performance on par with state-of-the-art visual reasoning models, while maintaining exceptionally high token efficiency.
[39] EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards
Omkat Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan
🧩 TL;DR
本文提出EvoLMM,一种无需人工标注或奖励模型的纯无监督自进化框架,通过两个协作代理(提议者和求解者)实现大型多模态模型的自我改进,在多个多模态数学推理基准上取得显著性能提升。
📘 Detailed Summary
Motivation: 现有大型多模态模型的训练流程仍依赖人工标注数据或外部验证的奖励模型,这限制了模型的自主性和可扩展性,因此需要开发无需任何标注数据或奖励蒸馏的纯无监督训练方法。
Method: 提出EvoLMM自进化框架,从单一骨干模型实例化两个协作代理:提议者生成多样化的图像基础问题,求解者通过内部一致性解决这些问题,学习过程通过持续的自奖励机制进行,无需依赖真实标签或人工判断。
Result: 以Qwen2.5-VL为基础模型,EvoLMM在ChartQA、MathVista和MathVision等多模态数学推理基准上实现了约3%的持续性能提升,仅使用原始训练图像。
Conclusion: 这种简单而有效的纯无监督方法为自改进大型多模态模型的研究提供了坚实基础,展示了无需外部监督即可实现模型能力持续进化的可行性,为未来完全无监督的自进化多模态模型研究开辟了新方向。
📄 Abstract
Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.
cs.CL [Back]
[40] Learning Tractable Distributions Of Language Model Continuations
Gwen Yidou-Weng, Ian Li, Anji Liu, Oliver Broadrick, Guy Van den Broeck, Benjie Wang
🧩 TL;DR
本文提出学习前瞻(LTLA)方法,通过将基础语言模型与固定可处理的代理模型相结合,解决了受控文本生成中未来依赖约束的近似问题。该方法使用单次批量HMM更新计算所有候选标记的延续概率,在保持推理效率的同时显著提升了约束满足度。
📘 Detailed Summary
Motivation: 现有受控文本生成方法使用可处理代理模型(如HMM)近似未来依赖约束,但这些代理模型通常上下文感知能力较弱,导致查询质量下降。传统方法在添加神经上下文时面临两个效率瓶颈:需要对每个候选标记重新评分前缀,以及为每个前缀预测新的代理参数导致计算无法复用。
Method: LTLA采用混合方法,将相同的基础语言模型用于丰富前缀编码,同时使用固定的可处理代理模型计算精确的延续概率。通过单次批量HMM更新同时处理所有候选标记,并仅将代理模型的潜在状态先验条件于LM的隐藏表示,保持代理解码器固定以实现跨前缀的计算复用。
Result: 实验表明,LTLA获得了比无条件HMM更高的条件似然,能够近似视觉语言模型的延续分布(这是独立HMM无法编码视觉上下文的场景),在受控生成任务中以可比较的流畅度提高了约束满足度,且推理开销最小。
Conclusion: LTLA证明了在保持推理效率的同时,通过合理结合神经模型和可处理代理模型可以显著提升受控生成性能。该方法为处理未来依赖约束提供了一种有效的混合范式,在视觉语言模型等复杂场景中展现出良好的扩展性,为约束文本生成开辟了新的技术路径。
📄 Abstract
Controlled language generation conditions text on sequence-level constraints (for example, syntax, style, or safety). These constraints may depend on future tokens, which makes directly conditioning an autoregressive language model (LM) generally intractable. Prior work uses tractable surrogates such as hidden Markov models (HMMs) to approximate the distribution over continuations and adjust the model's next-token logits at decoding time. However, we find that these surrogates are often weakly context aware, which reduces query quality. We propose Learning to Look Ahead (LTLA), a hybrid approach that pairs the same base language model for rich prefix encoding with a fixed tractable surrogate model that computes exact continuation probabilities. Two efficiency pitfalls arise when adding neural context: (i) naively rescoring the prefix with every candidate next token requires a sweep over the entire vocabulary at each step, and (ii) predicting fresh surrogate parameters for each prefix, although tractable at a single step, forces recomputation of future probabilities for every new prefix and eliminates reuse. LTLA avoids both by using a single batched HMM update to account for all next-token candidates at once, and by conditioning only the surrogate's latent state prior on the LM's hidden representations while keeping the surrogate decoder fixed, so computations can be reused across prefixes. Empirically, LTLA attains higher conditional likelihood than an unconditional HMM, approximates continuation distributions for vision-language models where a standalone HMM cannot encode visual context, and improves constraint satisfaction at comparable fluency on controlled-generation tasks, with minimal inference overhead.
[41] Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems
Elias Lumer, Alex Cardenas, Matt Melich, Myles Mason, Sara Dieter, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, Roberto Hernandez
🧩 TL;DR
本研究通过对比分析两种多模态RAG检索方法,发现直接多模态嵌入检索显著优于基于LLM摘要的文本检索,在金融财报问答基准上实现了13%的mAP@5绝对提升和32%的相对改进,证明了保留视觉上下文对多模态检索的重要性。
📘 Detailed Summary
Motivation: 现有基于检索增强生成的多模态系统依赖LLM摘要将图像转换为文本表示,导致视觉细节和上下文信息丢失,这对下游检索和问答任务产生负面影响,特别是在包含图表、图形和表格的金融文档中。
Method: 本研究对两种多模态RAG检索方法进行全面比较分析:基于文本分块检索和直接多模态嵌入检索,在6个LLM模型和2个多模态嵌入模型上进行评估,使用包含40个问答对的新建金融财报电话会议基准数据集。
Result: 实验结果表明直接多模态嵌入检索显著优于基于LLM摘要的方法,在mAP@5上实现13%的绝对提升和32%的相对改进,在nDCG@5上实现11%的绝对提升和20%的相对改进,同时通过LLM作为评判者的成对比较显示该方法产生更准确和事实一致的答案。
Conclusion: 研究证明LLM摘要会在预处理阶段引入信息损失,而直接多模态嵌入能够保留视觉上下文用于检索和推理,为多模态RAG系统的设计提供了重要指导,强调了原生视觉表示在复杂文档理解中的关键价值。
📄 Abstract
Recent advancements in Retrieval-Augmented Generation (RAG) have enabled Large Language Models (LLMs) to access multimodal knowledge bases containing both text and visual information such as charts, diagrams, and tables in financial documents. However, existing multimodal RAG systems rely on LLM-based summarization to convert images into text during preprocessing, storing only text representations in vector databases, which causes loss of contextual information and visual details critical for downstream retrieval and question answering. To address this limitation, we present a comprehensive comparative analysis of two retrieval approaches for multimodal RAG systems, including text-based chunk retrieval (where images are summarized into text before embedding) and direct multimodal embedding retrieval (where images are stored natively in the vector space). We evaluate all three approaches across 6 LLM models and a two multi-modal embedding models on a newly created financial earnings call benchmark comprising 40 question-answer pairs, each paired with 2 documents (1 image and 1 text chunk). Experimental results demonstrate that direct multimodal embedding retrieval significantly outperforms LLM-summary-based approaches, achieving absolute improvements of 13% in mean average precision (mAP@5) and 11% in normalized discounted cumulative gain. These gains correspond to relative improvements of 32% in mAP@5 and 20% in nDCG@5, providing stronger evidence of their practical impact. We additionally find that direct multimodal retrieval produces more accurate and factually consistent answers as measured by LLM-as-a-judge pairwise comparisons. We demonstrate that LLM summarization introduces information loss during preprocessing, whereas direct multimodal embeddings preserve visual context for retrieval and inference.
[42] Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs
Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Ruisi Cai, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov
🧩 TL;DR
本文提出了Nemotron Elastic框架,通过单一父模型中嵌入多个嵌套子模型的方法,实现了多尺度推理导向LLM的高效构建,相比传统方法实现了360倍的成本降低。
📘 Detailed Summary
Motivation: 当前训练针对不同规模和部署目标的大语言模型家族成本极其昂贵,需要为每个不同尺寸单独训练,即使采用剪枝和知识蒸馏等压缩技术,每个压缩模型仍需消耗数千亿token的训练成本。
Method: 该框架采用混合Mamba-Attention架构,通过端到端训练的路由器与两阶段训练课程相结合,实现了多个嵌套子模型的权重共享和零-shot提取,同时引入了组感知SSM弹性化、异构MLP弹性化、基于归一化MSE的层重要性评估以及知识蒸馏等技术。
Result: 在Nemotron Nano V2 12B模型上的应用显示,仅使用110B训练token即可同时生成9B和6B模型,相比从头训练模型家族实现了360倍以上的成本降低,相比最先进的压缩技术实现了约7倍的改进,且每个嵌套模型在准确率上均达到或优于最先进水平。
Conclusion: 该方法不仅显著降低了多尺度模型构建的成本,还实现了恒定部署内存的多合一推理模型,为高效构建和部署不同预算配置的LLM提供了新的技术路径。
📄 Abstract
Training a family of large language models targeting multiple scales and deployment objectives is prohibitively expensive, requiring separate training runs for each different size. Recent work on model compression through pruning and knowledge distillation has reduced this cost; however, this process still incurs hundreds of billions of tokens worth of training cost per compressed model. In this paper, we present Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning. We enable this functionality through an end-to-end trained router, tightly coupled to a two-stage training curriculum designed specifically for reasoning models. We additionally introduce group-aware SSM elastification that preserves Mamba's structural constraints, heterogeneous MLP elastification, normalized MSE-based layer importance for improved depth selection, and knowledge distillation enabling simultaneous multi-budget optimization. We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens; this results in over 360x cost reduction compared to training model families from scratch, and around 7x compared to SoTA compression techniques. Each of the nested models performs on par or better than the SoTA in accuracy. Moreover, unlike other compression methods, the nested capability of our approach allows having a many-in-one reasoning model that has constant deployment memory against the number of models in the family.
cs.AI [Back]
[43] Chain of Summaries: Summarization Through Iterative Questioning
William Brach, Lukas Galke Poech
🧩 TL;DR
本文提出了一种名为摘要链(CoS)的方法,通过黑格尔辩证法启发的迭代优化过程生成信息密集的通用摘要,显著提升LLM对网络内容的处理效率。该方法在多个基准测试中优于零样本LLM基线和专业摘要方法,同时大幅减少token使用量。
📘 Detailed Summary
Motivation: 大型语言模型日益依赖外部网络内容,但现有内容格式对LLM不友好且受限于上下文长度,导致信息消化效率低下。需要开发能够将网络内容转化为LLM可高效处理的通用摘要的方法,以解决格式兼容性和信息密度问题。
Method: 提出摘要链(CoS)方法,受黑格尔辩证法启发,通过初始摘要(正题)、质疑识别局限性(反题)、迭代优化生成通用摘要(合题)的三阶段过程。该方法能够满足当前并预测未来信息需求,生成信息密集的纯文本摘要存储库。
Result: 在TriviaQA、TruthfulQA和SQUAD数据集上的实验表明,CoS比零样本LLM基线性能提升高达66%,优于BRIO和PEGASUS等专业摘要方法达27%。CoS生成的摘要相比原始内容在问答任务中表现更优,同时显著减少token使用量且与下游LLM无关。
Conclusion: CoS为网站维护者提供了使内容更易于LLM访问的可行方案,同时保留人工监督的可能性。该方法通过迭代优化生成的信息密集摘要不仅提升当前任务性能,还能适应未来信息需求,为LLM与网络内容的高效交互开辟了新途径。
📄 Abstract
Large Language Models (LLMs) are increasingly using external web content. However, much of this content is not easily digestible by LLMs due to LLM-unfriendly formats and limitations of context length. To address this issue, we propose a method for generating general-purpose, information-dense summaries that act as plain-text repositories of web content. Inspired by Hegel's dialectical method, our approach, denoted as Chain of Summaries (CoS), iteratively refines an initial summary (thesis) by identifying its limitations through questioning (antithesis), leading to a general-purpose summary (synthesis) that can satisfy current and anticipate future information needs. Experiments on the TriviaQA, TruthfulQA, and SQUAD datasets demonstrate that CoS outperforms zero-shot LLM baselines by up to 66% and specialized summarization methods such as BRIO and PEGASUS by up to 27%. CoS-generated summaries yield higher Q&A performance compared to the source content, while requiring substantially fewer tokens and being agnostic to the specific downstream LLM. CoS thus resembles an appealing option for website maintainers to make their content more accessible for LLMs, while retaining possibilities for human oversight.
[44] Majority Rules: LLM Ensemble is a Winning Approach for Content Categorization
Ariel Kamen, Yakov Kamen
🧩 TL;DR
本研究提出了一种基于大语言模型的集成框架eLLM,用于非结构化文本分类。该框架通过集成多个模型解决了单个系统的常见弱点,在F1分数上比最强单模型提升了65%,达到了接近人类专家的性能水平。
📘 Detailed Summary
Motivation: 当前大语言模型在文本分类中存在一致性差、幻觉问题、类别膨胀和误分类等常见弱点,这些限制影响了分类系统的可靠性和准确性。研究旨在通过集成方法克服单个模型在将语义丰富的文本压缩为稀疏分类表示时的性能瓶颈问题。
Method: 提出了集成大语言模型框架eLLM,通过数学建模集体决策过程并建立原则性聚合准则。在IAB分层分类体系下,对10个最先进的大语言模型在8,660个人工标注样本上进行零样本条件评估,采用多样化模型联盟实现集成分类。
Result: eLLM框架在F1分数上比最强单模型提升了65%的性能,实现了接近人类专家水平的分类精度。实验表明单个模型由于语义压缩问题而性能趋于饱和,而eLLM同时提高了系统的鲁棒性和准确性。
Conclusion: eLLM框架为基于分类学的文本分类提供了可扩展且可靠的解决方案,显著减少了对人类专家标注的依赖。该研究证明了集成方法在克服大语言模型固有局限性方面的有效性,为实际应用中的文本分类任务提供了新的技术路径。
📄 Abstract
This study introduces an ensemble framework for unstructured text categorization using large language models (LLMs). By integrating multiple models, the ensemble large language model (eLLM) framework addresses common weaknesses of individual systems, including inconsistency, hallucination, category inflation, and misclassification. The eLLM approach yields a substantial performance improvement of up to 65\% in F1-score over the strongest single model. We formalize the ensemble process through a mathematical model of collective decision-making and establish principled aggregation criteria. Using the Interactive Advertising Bureau (IAB) hierarchical taxonomy, we evaluate ten state-of-the-art LLMs under identical zero-shot conditions on a human-annotated corpus of 8{,}660 samples. Results show that individual models plateau in performance due to the compression of semantically rich text into sparse categorical representations, while eLLM improves both robustness and accuracy. With a diverse consortium of models, eLLM achieves near human-expert-level performance, offering a scalable and reliable solution for taxonomy-based classification that may significantly reduce dependence on human expert labeling.
[45] Step-Audio-R1 Technical Report
Fei Tian, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Yuxin Li, Daijiao Liu, Yayue Deng, Donghang Wu, Jun Chen, Liang Zhao, Chengyuan Yao, Hexin Liu, Eng Siong Chng, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu
🧩 TL;DR
本文提出了Step-Audio-R1,首个成功解锁音频领域推理能力的音频推理模型,通过模态锚定推理蒸馏框架,使音频智能能够从深思熟虑中真正受益,在音频理解和推理基准测试中超越了Gemini 2.5 Pro并达到与Gemini 3 Pro相当的性能。
📘 Detailed Summary
Motivation: 当前音频语言模型存在一个令人困惑的现象:它们在最小化或没有推理的情况下表现更好,这引发了一个基本问题——音频智能是否真的能从深思熟虑中受益。现有推理模型在文本和视觉领域取得了显著成功,但在音频领域推理能力的有效性尚未得到验证。
Method: 提出了模态锚定推理蒸馏框架,使Step-Audio-R1能够生成与音频相关的推理链,这些推理链真正基于声学特征而非产生脱节的幻觉性思考。该框架通过将推理过程锚定在音频模态特征上,确保推理链与音频内容的相关性和一致性。
Result: Step-Audio-R1在全面的音频理解和推理基准测试中表现出强大的音频推理能力,超越了Gemini 2.5 Pro,并在涵盖语音、环境声音和音乐的多个任务上达到了与最先进的Gemini 3 Pro相当的性能水平。
Conclusion: 研究表明推理是一种可跨模态迁移的能力,当适当锚定时,扩展的深思熟虑可以从负担转变为音频智能的强大资产。这项工作为构建真正跨所有感官模态进行深度思考的多模态推理系统开辟了新途径,证明了音频智能确实能够从深思熟虑中受益。
📄 Abstract
Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.
[46] Automated Hazard Detection in Construction Sites Using Large Language and Vision-Language Models
Islem Sahraoui
🧩 TL;DR
本研究提出了一种多模态AI框架,结合文本和图像分析来增强建筑工地安全监测,通过两个案例研究评估了大型语言模型和视觉语言模型在自动识别安全风险方面的能力。
📘 Detailed Summary
Motivation: 在建筑工地等安全关键环境中,事故数据通常以多种格式存在,如书面报告、检查记录和现场图像,这使得传统方法难以综合识别安全隐患,需要开发能够整合多模态数据的智能框架。
Method: 研究提出了多模态AI框架,第一个案例研究使用GPT-4o和GPT-4o mini从28,000份OSHA事故报告中提取结构化见解,第二个案例研究使用轻量级开源视觉语言模型Molmo 7B和Qwen2 VL 2B,在ConstructionSite10k数据集上评估基于自然语言提示的规则级安全违规检测性能。
Result: 尽管模型规模较小,Molmo 7B和Qwen2 VL 2B在某些提示配置下表现出竞争性性能,验证了低资源多模态系统在规则感知安全监测中的可行性,同时为专有模型提供了成本感知的基准测试。
Conclusion: 研究表明轻量级多模态AI系统能够有效支持建筑安全监测,为资源受限环境下的自动化安全风险识别提供了可行方案,并强调了开源模型在特定应用场景中的实用价值。
📄 Abstract
This thesis explores a multimodal AI framework for enhancing construction safety through the combined analysis of textual and visual data. In safety-critical environments such as construction sites, accident data often exists in multiple formats, such as written reports, inspection records, and site imagery, making it challenging to synthesize hazards using traditional approaches. To address this, this thesis proposed a multimodal AI framework that combines text and image analysis to assist in identifying safety hazards on construction sites. Two case studies were consucted to evaluate the capabilities of large language models (LLMs) and vision-language models (VLMs) for automated hazard identification.The first case study introduces a hybrid pipeline that utilizes GPT 4o and GPT 4o mini to extract structured insights from a dataset of 28,000 OSHA accident reports (2000-2025). The second case study extends this investigation using Molmo 7B and Qwen2 VL 2B, lightweight, open-source VLMs. Using the public ConstructionSite10k dataset, the performance of the two models was evaluated on rule-level safety violation detection using natural language prompts. This experiment served as a cost-aware benchmark against proprietary models and allowed testing at scale with ground-truth labels. Despite their smaller size, Molmo 7B and Quen2 VL 2B showed competitive performance in certain prompt configurations, reinforcing the feasibility of low-resource multimodal systems for rule-aware safety monitoring.
[47] OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
Kaichen Zhang, Keming Wu, Zuhao Yang, Kairui Hu, Bin Wang, Ziwei Liu, Xingxuan Li, Lidong Bing
🧩 TL;DR
本文提出了OpenMMReasoner,一个完全透明的多模态推理两阶段训练方法,通过监督微调和强化学习构建可复现的多模态推理能力,在九个基准测试中相比Qwen2.5-VL-7B-Instruct基线提升了11.6%。
📘 Detailed Summary
Motivation: 当前多模态推理领域缺乏透明且可复现的数据构建和训练策略,这成为可扩展研究的主要障碍,需要建立系统化的训练方法来推动多模态推理能力的发展。
Method: 采用两阶段训练方法:监督微调阶段构建了包含87.4万样本的冷启动数据集并进行严格的逐步验证;强化学习阶段使用7.4万样本跨多个领域进一步优化和稳定推理能力。
Result: 在九个多模态推理基准测试中,该方法相比Qwen2.5-VL-7B-Instruct基线实现了11.6%的性能提升,证明了训练方案的有效性和数据质量的关键作用。
Conclusion: 研究强调了数据质量和训练设计对多模态推理性能的决定性影响,为未来大规模多模态推理研究建立了坚实的实证基础,所有代码、流程和数据均已开源以促进领域发展。
📄 Abstract
Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.
[48] Spatial Reasoning in Multimodal Large Language Models: A Survey of Tasks, Benchmarks and Methods
Weichen Liu, Qiyao Xue, Haoming Wang, Xiangyu Yin, Boyuan Yang, Wei Gao
🧩 TL;DR
本文提出了一种基于认知视角的空间智能分类法,将空间推理任务按推理复杂度组织,并映射现有基准到该分类框架中,揭示了当前多模态大语言模型与人类空间推理能力之间的关键差距。
📘 Detailed Summary
Motivation: 现有研究通常基于输入模态对多模态大语言模型的空间推理能力进行分类,但空间能力不仅取决于输入格式。本文旨在从认知角度构建更原则性的分类法,以更好地评估和比较不同任务中的空间推理能力。
Method: 提出基于认知视角的空间智能分类法,按推理复杂度组织任务并将其与多个认知功能关联;将文本、视觉语言和具身环境中的现有基准映射到该分类框架;分析评估空间推理能力的指标和方法;同时考察基于训练和基于推理的改进方法。
Result: 通过认知视角的分类框架实现了更原则性的跨任务比较,揭示了当前模型能力与人类推理之间的关键差距;分析表明基于训练和基于推理的改进方法具有互补机制和各自优势。
Conclusion: 认知视角的分类法为空间推理研究提供了新的分析框架,有助于识别研究空白和未来方向;基于训练和推理的改进方法具有互补性,为提升多模态大语言模型的空间智能提供了可行路径。
📄 Abstract
Spatial reasoning, which requires ability to perceive and manipulate spatial relationships in the 3D world, is a fundamental aspect of human intelligence, yet remains a persistent challenge for Multimodal large language models (MLLMs). While existing surveys often categorize recent progress based on input modality (e.g., text, image, video, or 3D), we argue that spatial ability is not solely determined by the input format. Instead, our survey introduces a taxonomy that organizes spatial intelligence from cognitive aspect and divides tasks in terms of reasoning complexity, linking them to several cognitive functions. We map existing benchmarks across text only, vision language, and embodied settings onto this taxonomy, and review evaluation metrics and methodologies for assessing spatial reasoning ability. This cognitive perspective enables more principled cross-task comparisons and reveals critical gaps between current model capabilities and human-like reasoning. In addition, we analyze methods for improving spatial ability, spanning both training-based and reasoning-based approaches. This dual perspective analysis clarifies their respective strengths, uncovers complementary mechanisms. By surveying tasks, benchmarks, and recent advances, we aim to provide new researchers with a comprehensive understanding of the field and actionable directions for future research.
[49] TOFA: Training-Free One-Shot Federated Adaptation for Vision-Language Models
Li Zhang, Zhongxuan Han, XiaoHua Feng, Jiaming Zhang, Yuyuan Li, Linbo Jiang, Jianan Lin, Chaochao Chen
🧩 TL;DR
本文提出了一种无需训练的one-shot联邦视觉语言模型适应框架TOFA,通过层次贝叶斯模型和文本提示对齐机制,在单轮通信中实现高效的多模态联邦适应,有效解决了数据异构性和通信成本问题。
📘 Detailed Summary
Motivation: 现有联邦视觉语言模型适应方法存在迭代训练导致的通信成本高、易受攻击问题,而当前one-shot方法在联邦设置下面临多模态信息利用不足、缺乏专门处理数据异构性的策略以及需要额外训练资源等挑战。
Method: TOFA框架包含视觉和文本两条并行处理路径:视觉路径采用层次贝叶斯模型学习个性化的类特定原型分布,文本路径评估并全局对齐生成的本地文本提示以提高鲁棒性,同时引入自适应权重校准机制平衡个性化和鲁棒性。
Result: 在9个数据集上的广泛实验表明,TOFA在各种联邦设置下均表现出色,该方法无需客户端或服务器端的额外训练资源,实现了高效的one-shot联邦适应。
Conclusion: TOFA证明了无需训练的one-shot联邦适应在视觉语言模型中的可行性,为处理数据异构性提供了有效的多模态解决方案,并为轻量级联邦学习开辟了新方向。
📄 Abstract
Efficient and lightweight adaptation of pre-trained Vision-Language Models (VLMs) to downstream tasks through collaborative interactions between local clients and a central server is a rapidly emerging research topic in federated learning. Existing adaptation algorithms are typically trained iteratively, which incur significant communication costs and increase the susceptibility to potential attacks. Motivated by the one-shot federated training techniques that reduce client-server exchanges to a single round, developing a lightweight one-shot federated VLM adaptation method to alleviate these issues is particularly attractive. However, current one-shot approaches face certain challenges in adapting VLMs within federated settings: (1) insufficient exploitation of the rich multimodal information inherent in VLMs; (2) lack of specialized adaptation strategies to systematically handle the severe data heterogeneity; and (3) requiring additional training resource of clients or server. To bridge these gaps, we propose a novel Training-free One-shot Federated Adaptation framework for VLMs, named TOFA. To fully leverage the generalizable multimodal features in pre-trained VLMs, TOFA employs both visual and textual pipelines to extract task-relevant representations. In the visual pipeline, a hierarchical Bayesian model learns personalized, class-specific prototype distributions. For the textual pipeline, TOFA evaluates and globally aligns the generated local text prompts for robustness. An adaptive weight calibration mechanism is also introduced to combine predictions from both modalities, balancing personalization and robustness to handle data heterogeneity. Our method is training-free, not relying on additional training resources on either the client or server side. Extensive experiments across 9 datasets in various federated settings demonstrate the effectiveness of the proposed TOFA method.
[50] Uncertainty-Resilient Multimodal Learning via Consistency-Guided Cross-Modal Transfer
Hyo-Jeong Jang
🧩 TL;DR
该论文提出了一种基于一致性引导跨模态迁移的不确定性弹性多模态学习框架,通过将异构模态投影到共享潜在空间来缓解模态差距并揭示支持不确定性估计的结构关系,显著提升了模型在噪声和不完整监督下的稳定性和鲁棒性。
📘 Detailed Summary
Motivation: 多模态学习系统面临数据噪声、低质量标签和异构模态特性带来的显著不确定性挑战,特别是在人机交互环境中,数据质量、语义可靠性和标注一致性随用户和记录条件变化而存在较大差异,这限制了模型的可靠性和适应性。
Method: 采用一致性引导的跨模态迁移方法,利用跨模态语义一致性作为鲁棒表示学习的基础,通过将异构模态投影到共享潜在空间来缓解模态差距,并揭示支持不确定性估计和稳定特征学习的结构关系,同时探索增强语义鲁棒性、提高数据效率和减少噪声影响的策略。
Result: 在多模态情感识别基准测试中,一致性引导的跨模态迁移显著提升了模型稳定性、判别能力和对噪声或不完整监督的鲁棒性,潜在空间分析进一步表明该框架即使在挑战性条件下也能捕获可靠的跨模态结构。
Conclusion: 该研究通过整合不确定性建模、语义对齐和数据高效监督,为弹性多模态学习提供了统一视角,为开发可靠和自适应的脑机接口系统提供了实用见解,强调了跨模态一致性在构建稳健学习系统中的核心作用。
📄 Abstract
Multimodal learning systems often face substantial uncertainty due to noisy data, low-quality labels, and heterogeneous modality characteristics. These issues become especially critical in human-computer interaction settings, where data quality, semantic reliability, and annotation consistency vary across users and recording conditions. This thesis tackles these challenges by exploring uncertainty-resilient multimodal learning through consistency-guided cross-modal transfer. The central idea is to use cross-modal semantic consistency as a basis for robust representation learning. By projecting heterogeneous modalities into a shared latent space, the proposed framework mitigates modality gaps and uncovers structural relations that support uncertainty estimation and stable feature learning. Building on this foundation, the thesis investigates strategies to enhance semantic robustness, improve data efficiency, and reduce the impact of noise and imperfect supervision without relying on large, high-quality annotations. Experiments on multimodal affect-recognition benchmarks demonstrate that consistency-guided cross-modal transfer significantly improves model stability, discriminative ability, and robustness to noisy or incomplete supervision. Latent space analyses further show that the framework captures reliable cross-modal structure even under challenging conditions. Overall, this thesis offers a unified perspective on resilient multimodal learning by integrating uncertainty modeling, semantic alignment, and data-efficient supervision, providing practical insights for developing reliable and adaptive brain-computer interface systems.
[51] IMACT-CXR - An Interactive Multi-Agent Conversational Tutoring System for Chest X-Ray Interpretation
Tuan-Anh Le, Anh Mai Vu, David Yang, Akash Awasthi, Hien Van Nguyen
🧩 TL;DR
IMACT-CXR是一个基于AutoGen的多智能体交互式胸部X光解读教学系统,通过整合空间标注、视线分析、知识检索和图像推理,为医学学员提供个性化的教学指导。该系统实现了响应式教学流程、精确的答案泄露控制,并展示了在定位和诊断推理方面的改进效果。
📘 Detailed Summary
Motivation: 该研究旨在解决医学影像教学中缺乏综合性交互式教学系统的问题,传统方法无法同时处理学员的空间标注、视线数据和文本观察,难以提供个性化的实时教学反馈和知识强化。现有系统在整合多种输入模态、防止答案过早泄露以及基于技能掌握程度的自适应教学方面存在局限性。
Method: 该系统采用基于AutoGen的多智能体架构,同时处理学员的边界框标注、视线采样和自由文本观察。专用智能体评估定位质量、生成苏格拉底式指导、检索PubMed证据、从REFLACX推荐相似病例,并在掌握度低或学员明确要求时触发NV-Reason-CXR-3B进行视觉语言推理。系统还集成了基于TensorFlow U-Net的肺叶分割模块用于解剖感知的视线反馈,以及贝叶斯知识追踪用于技能掌握度估计。
Result: 初步评估显示,与基线方法相比,IMACT-CXR在定位准确性和诊断推理能力方面均有显著提升。系统实现了响应式教学流程,具有有界延迟特性,并能够精确控制答案泄露。该系统与REFLACX真实DICOM病例数据集成功集成,展示了向实际住院医师部署的可扩展性。
Conclusion: IMACT-CXR证明了多智能体架构在医学影像教学中的有效性,通过整合多种输入模态和自适应教学策略,能够提供个性化的实时指导。该系统为医学教育提供了新的技术范式,具有向临床培训环境部署的潜力,并为未来开发更复杂的交互式教学系统奠定了基础。
📄 Abstract
IMACT-CXR is an interactive multi-agent conversational tutor that helps trainees interpret chest X-rays by unifying spatial annotation, gaze analysis, knowledge retrieval, and image-grounded reasoning in a single AutoGen-based workflow. The tutor simultaneously ingests learner bounding boxes, gaze samples, and free-text observations. Specialized agents evaluate localization quality, generate Socratic coaching, retrieve PubMed evidence, suggest similar cases from REFLACX, and trigger NV-Reason-CXR-3B for vision-language reasoning when mastery remains low or the learner explicitly asks. Bayesian Knowledge Tracing (BKT) maintains skill-specific mastery estimates that drive both knowledge reinforcement and case similarity retrieval. A lung-lobe segmentation module derived from a TensorFlow U-Net enables anatomically aware gaze feedback, and safety prompts prevent premature disclosure of ground-truth labels. We describe the system architecture, implementation highlights, and integration with the REFLACX dataset for real DICOM cases. IMACT-CXR demonstrates responsive tutoring flows with bounded latency, precise control over answer leakage, and extensibility toward live residency deployment. Preliminary evaluation shows improved localization and diagnostic reasoning compared to baselines.
[52] Sensorium Arc: AI Agent System for Oceanic Data Exploration and Interactive Eco-Art
Noah Bissell, Ethan Paley, Joshua Harrison, Juliano Calil, Myungin Lee
🧩 TL;DR
Sensorium Arc是一个实时多模态交互式AI代理系统,将海洋人格化为诗意叙述者,通过模块化多代理系统和检索增强大语言模型框架,实现与海洋视角的自然语音对话,动态触发数据可视化和视听播放。
📘 Detailed Summary
Motivation: 该研究旨在解决环境数据抽象化的问题,重新构想海洋数据作为活生生的叙事而非抽象数据集,探索对话式AI代理在调解人类对高维环境数据的情感直觉访问方面的潜力。
Method: 采用模块化多代理系统和检索增强大语言模型框架,通过关键词检测和语义解析技术,基于对话中的时间、位置和主题线索动态触发数据可视化和视听播放。
Result: 系统成功实现了与海洋视角AI代理的自然语音对话,生成融合科学洞察与生态诗意的响应,展示了多模态交互在环境数据叙事化方面的有效性。
Conclusion: 该研究提出了人-机-生态系统交互的新范式,证明了对话式AI代理在调解环境数据情感访问方面的潜力,为生态美学与人工智能的融合开辟了新方向。
📄 Abstract
Sensorium Arc (AI reflects on climate) is a real-time multimodal interactive AI agent system that personifies the ocean as a poetic speaker and guides users through immersive explorations of complex marine data. Built on a modular multi-agent system and retrieval-augmented large language model (LLM) framework, Sensorium enables natural spoken conversations with AI agents that embodies the ocean's perspective, generating responses that blend scientific insight with ecological poetics. Through keyword detection and semantic parsing, the system dynamically triggers data visualizations and audiovisual playback based on time, location, and thematic cues drawn from the dialogue. Developed in collaboration with the Center for the Study of the Force Majeure and inspired by the eco-aesthetic philosophy of Newton Harrison, Sensorium Arc reimagines ocean data not as an abstract dataset but as a living narrative. The project demonstrates the potential of conversational AI agents to mediate affective, intuitive access to high-dimensional environmental data and proposes a new paradigm for human-machine-ecosystem.
[53] MUSEKG: A Knowledge Graph Over Museum Collections
Jinhao Li, Jianzhong Qi, Soyeon Caren Han, Eun-Jung Holden
🧩 TL;DR
本文提出了MuseKG,一种端到端的知识图谱框架,通过符号-神经集成统一博物馆中的结构化和非结构化数据。该框架在真实博物馆数据集上超越了大型语言模型零样本、少样本和SPARQL提示基线,展示了符号基础在可解释文化遗产推理中的重要性。
📘 Detailed Summary
Motivation: 文化遗产领域的数字化转型产生了大量但碎片化的文物数据,现有博物馆信息系统难以将异构元数据、非结构化文档和多模态文物整合为统一可查询的形式。
Method: MuseKG构建类型化属性图,连接对象、人物、组织以及视觉或文本标签,并支持自然语言查询,通过符号-神经集成方法统一结构化和非结构化博物馆数据。
Result: 在真实博物馆数据集上的评估表明,该框架在属性查询、关系查询和相关实体查询方面表现出稳健性能,超越了大型语言模型的零样本、少样本和SPARQL提示基线方法。
Conclusion: 研究结果强调了符号基础对于可解释和可扩展的文化遗产推理的重要性,为数字遗产知识在网络规模上的整合铺平了道路,展示了符号-神经混合方法在复杂文化遗产数据管理中的优势。
📄 Abstract
Digital transformation in the cultural heritage sector has produced vast yet fragmented collections of artefact data. Existing frameworks for museum information systems struggle to integrate heterogeneous metadata, unstructured documents, and multimodal artefacts into a coherent and queryable form. We present MuseKG, an end-to-end knowledge-graph framework that unifies structured and unstructured museum data through symbolic-neural integration. MuseKG constructs a typed property graph linking objects, people, organisations, and visual or textual labels, and supports natural language queries. Evaluations on real museum collections demonstrate robust performance across queries over attributes, relations, and related entities, surpassing large-language-model zero-shot, few-shot and SPARQL prompt baselines. The results highlight the importance of symbolic grounding for interpretable and scalable cultural heritage reasoning, and pave the way for web-scale integration of digital heritage knowledge.
[54] FOOTPASS: A Multi-Modal Multi-Agent Tactical Context Dataset for Play-by-Play Action Spotting in Soccer Broadcast Videos
Jeremie Ochin, Raphael Chekroun, Bogdan Stanciulescu, Sotiris Manitsaris
🧩 TL;DR
本文提出了FOOTPASS数据集,这是首个面向完整足球比赛的多模态、多智能体战术背景下的逐场动作识别基准,通过整合计算机视觉输出和足球战术先验知识,实现更可靠的逐场数据流自动提取。
📘 Detailed Summary
Motivation: 当前足球视频理解方法在构建可靠的逐场数据方面仍显不足,通常只能辅助而非完全自动化标注,而战术建模、轨迹预测和性能分析等研究都依赖于比赛状态和逐场数据,这促使研究利用战术知识作为先验来支持基于计算机视觉的预测。
Method: 提出了多模态、多智能体战术背景下的逐场动作识别方法,整合计算机视觉任务输出(如跟踪、识别)和足球战术先验知识,包括长期战术规律性,以生成可靠的逐场数据流。
Result: 建立了FOOTPASS数据集作为首个面向完整足球比赛的逐场动作识别基准,支持开发以球员为中心的动作识别方法,为数据驱动的体育分析提供基础。
Conclusion: 该研究展示了将战术知识作为先验整合到计算机视觉预测中的可行性,为更自动化和可靠的足球逐场数据提取提供了新途径,对数据驱动的体育分析具有重要意义。
📄 Abstract
Soccer video understanding has motivated the creation of datasets for tasks such as temporal action localization, spatiotemporal action detection (STAD), or multiobject tracking (MOT). The annotation of structured sequences of events (who does what, when, and where) used for soccer analytics requires a holistic approach that integrates both STAD and MOT. However, current action recognition methods remain insufficient for constructing reliable play-by-play data and are typically used to assist rather than fully automate annotation. Parallel research has advanced tactical modeling, trajectory forecasting, and performance analysis, all grounded in game-state and play-by-play data. This motivates leveraging tactical knowledge as a prior to support computer-vision-based predictions, enabling more automated and reliable extraction of play-by-play data. We introduce Footovision Play-by-Play Action Spotting in Soccer Dataset (FOOTPASS), the first benchmark for play-by-play action spotting over entire soccer matches in a multi-modal, multi-agent tactical context. It enables the development of methods for player-centric action spotting that exploit both outputs from computer-vision tasks (e.g., tracking, identification) and prior knowledge of soccer, including its tactical regularities over long time horizons, to generate reliable play-by-play data streams. These streams form an essential input for data-driven sports analytics.
[55] ChemLabs on ChemO: A Multi-Agent System for Multimodal Reasoning on IChO 2025
Xu Qiang, Shengyuan Bai, Leqing Chen, Zijing Liu, Yu Li
🧩 TL;DR
本文提出了ChemO基准测试和ChemLabs多智能体框架,用于解决化学奥林匹克竞赛级别的AI推理挑战。通过评估等效重构和结构化视觉增强技术,实现了93.6/100的优异成绩,超越了人类金牌水平。
📘 Detailed Summary
Motivation: 当前AI推理基准主要集中在数学和物理领域,而化学因其独特的多模态符号语言特性一直是一个未解决的挑战。化学奥林匹克竞赛级别的推理任务需要处理复杂的视觉输出和符号推理,现有方法难以有效应对。
Method: 提出了ChemO基准测试,包含评估等效重构技术将需要视觉输出的问题转换为计算可行的格式,以及结构化视觉增强机制来分离模型的视觉感知能力和化学推理能力。同时开发了ChemLabs分层多智能体框架,模拟人类专家协作,包含问题分解、感知、推理和审核等专门化智能体。
Result: 在先进的多模态模型上的实验表明,结合结构化视觉增强与多智能体系统带来了显著的性能提升。最优配置获得了93.6分(满分100分),超过了预估的人类金牌阈值,在自动化化学问题解决方面建立了新的最先进水平。
Conclusion: 该研究证明了通过专门的基准设计和多智能体协作框架,AI系统能够有效处理复杂的化学推理任务。结构化视觉增强技术为分离不同认知能力提供了有效诊断工具,为未来多模态推理系统的开发提供了重要启示。
📄 Abstract
Olympiad-level benchmarks in mathematics and physics are crucial testbeds for advanced AI reasoning, but chemistry, with its unique multimodal symbolic language, has remained an open challenge. We introduce ChemO, a new benchmark built from the International Chemistry Olympiad (IChO) 2025. ChemO features two key innovations for automated assessment: Assessment-Equivalent Reformulation (AER), which converts problems requiring visual outputs (e.g., drawing molecules) into computationally tractable formats, and Structured Visual Enhancement (SVE), a diagnostic mechanism to disentangle a model's visual perception capabilities from its core chemical reasoning. To tackle this benchmark, we propose ChemLabs, a hierarchical multi-agent framework that mimics human expert collaboration through specialized agents for problem decomposition, perception, reasoning, and auditing. Experiments on state-of-the-art multimodal models demonstrate that combining SVE with our multi-agent system yields dramatic performance gains. Our top configuration achieves a score of 93.6 out of 100, surpassing an estimated human gold medal threshold and establishing a new state-of-the-art in automated chemical problem-solving. ChemO Dataset: https://huggingface.co/datasets/IDEA-AI4SCI/ChemO
[56] FlipVQA-Miner: Cross-Page Visual Question-Answer Mining from Textbooks
Zhen Hao Wong, Jingwen Deng, Hao Liang, Runming He, Chengyu Shen, Wentao Zhang
🧩 TL;DR
本文提出了一种自动化流水线,通过结合布局感知OCR与基于LLM的语义解析,从教育文档中提取高质量问答对和视觉问答对,为LLM训练提供可扩展的真实世界教育内容替代方案。该方法能够生成准确对齐且低噪声的监督数据,有效缓解合成数据带来的幻觉和多样性限制问题。
📘 Detailed Summary
Motivation: 当前大型语言模型的发展严重依赖高质量监督数据,但现有的指令微调和强化学习数据集构建成本高昂且通常依赖合成样本,导致幻觉问题和多样性受限。同时,教科书和练习材料包含丰富的高质量人工编写问答内容,但由于将原始PDF转换为AI就绪监督数据的困难而未被充分利用。
Method: 提出了一种自动化流水线,结合布局感知OCR技术与基于大语言模型的语义解析,从教育文档中提取格式良好的问答对和视觉问答对。该方法首先利用现代OCR和视觉语言模型准确解析文档结构,然后通过LLM进行语义对齐处理,确保输出数据符合训练要求。
Result: 在多种文档类型上的实验表明,该方法能够生成准确、对齐且低噪声的问答对和视觉问答对。提取的数据质量高,能够有效支持推理导向的LLM训练,为真实世界教育内容的规模化使用提供了可行方案。
Conclusion: 该研究证明了利用真实教育内容作为监督数据源的可行性,为改进LLM训练提供了实用的合成数据替代方案。自动化流水线的开发使得大规模利用高质量人类编写内容成为可能,有助于提升模型的推理能力和减少幻觉问题。所有代码和数据处理流水线均已开源,促进社区进一步研究和发展。
📄 Abstract
The development of Large Language Models (LLMs) increasingly depends on high-quality supervised data, yet existing instruction-tuning and RL datasets remain costly to curate and often rely on synthetic samples that introduce hallucination and limited diversity. At the same time, textbooks and exercise materials contain abundant, high-quality human-authored Question-Answer(QA) content that remains underexploited due to the difficulty of transforming raw PDFs into AI-ready supervision. Although modern OCR and vision-language models can accurately parse document structure, their outputs lack the semantic alignment required for training. We propose an automated pipeline that extracts well-formed QA and visual-QA (VQA) pairs from educational documents by combining layout-aware OCR with LLM-based semantic parsing. Experiments across diverse document types show that the method produces accurate, aligned, and low-noise QA/VQA pairs. This approach enables scalable use of real-world educational content and provides a practical alternative to synthetic data generation for improving reasoning-oriented LLM training. All code and data-processing pipelines are open-sourced at https://github.com/OpenDCAI/DataFlow.
[57] Pharos-ESG: A Framework for Multimodal Parsing, Contextual Narration, and Hierarchical Labeling of ESG Report
Yan Chen, Yu Zou, Jialei Zeng, Haoran You, Xiaorui Zhou, Aixi Zhong
🧩 TL;DR
本文提出Pharos-ESG框架,通过多模态解析、上下文叙述和层次化标注将ESG报告转化为结构化表示,并发布首个大规模ESG报告数据集Aurora-ESG,显著提升了ESG文档分析的性能。
📘 Detailed Summary
Motivation: ESG报告作为评估企业ESG表现的核心媒介,由于类似幻灯片的非规则布局导致的混乱阅读顺序以及冗长、弱结构化内容产生的隐含层次结构,给大规模理解带来了重大挑战。
Method: 该框架集成了基于布局流的阅读顺序建模模块、由目录锚点引导的层次感知分割模块,以及将视觉元素上下文转化为连贯自然语言的多模态聚合流水线,进一步通过ESG、GRI和情感标签丰富输出。
Result: 在标注基准测试上的广泛实验表明,Pharos-ESG始终优于专用文档解析系统和通用多模态模型,同时发布了首个覆盖中国大陆、香港和美国市场的大规模公开ESG报告数据集Aurora-ESG。
Conclusion: 该研究为金融治理和决策中的ESG整合提供了更好的支持,通过统一的结构化表示和细粒度布局语义标注,显著提升了ESG文档分析的准确性和实用性。
📄 Abstract
Environmental, Social, and Governance (ESG) principles are reshaping the foundations of global financial gover- nance, transforming capital allocation architectures, regu- latory frameworks, and systemic risk coordination mecha- nisms. However, as the core medium for assessing corpo- rate ESG performance, the ESG reports present significant challenges for large-scale understanding, due to chaotic read- ing order from slide-like irregular layouts and implicit hier- archies arising from lengthy, weakly structured content. To address these challenges, we propose Pharos-ESG, a uni- fied framework that transforms ESG reports into structured representations through multimodal parsing, contextual nar- ration, and hierarchical labeling. It integrates a reading-order modeling module based on layout flow, hierarchy-aware seg- mentation guided by table-of-contents anchors, and a multi- modal aggregation pipeline that contextually transforms vi- sual elements into coherent natural language. The framework further enriches its outputs with ESG, GRI, and sentiment labels, yielding annotations aligned with the analytical de- mands of financial research. Extensive experiments on anno- tated benchmarks demonstrate that Pharos-ESG consistently outperforms both dedicated document parsing systems and general-purpose multimodal models. In addition, we release Aurora-ESG, the first large-scale public dataset of ESG re- ports, spanning Mainland China, Hong Kong, and U.S. mar- kets, featuring unified structured representations of multi- modal content, enriched with fine-grained layout and seman- tic annotations to better support ESG integration in financial governance and decision-making.
[58] Utilizing Large Language Models for Zero-Shot Medical Ontology Extension from Clinical Notes
Guanchen Wu, Yuzhang Xie, Huanwei Wu, Zhe He, Hui Shao, Xiao Hu, Carl Yang
🧩 TL;DR
本文提出了CLOZE框架,利用大型语言模型从临床笔记中自动提取医学术语并整合到层次化医学本体中,实现了无需训练数据、保护隐私的本体扩展方法。该框架通过零样本方式有效识别疾病相关概念并捕获复杂层次关系,为生物医学研究和临床信息学提供了可扩展的解决方案。
📘 Detailed Summary
Motivation: 现有医学本体在覆盖范围和实用性方面存在局限,而临床笔记作为富含详细患者观察信息的非结构化文档,虽具有丰富的情境特定洞察力,但在本体扩展方面的潜力尚未得到充分开发。目前直接利用临床笔记进行本体扩展的研究仍处于探索不足的状态,需要一种能够自动提取医学实体并整合到层次化医学本体中的有效方法。
Method: CLOZE框架基于预训练大型语言模型构建,利用其强大的语言理解和广泛的生物医学知识,从临床笔记中自动提取医学实体并整合到层次化医学本体中。该框架采用零样本方法,无需额外训练或标注数据,通过自动化移除受保护健康信息来确保患者隐私,实现了成本效益高的本体扩展解决方案。
Result: 实验结果表明CLOZE框架能够准确识别疾病相关概念并捕获复杂的层次关系,提供了一个精确、可扩展且保护隐私的本体扩展框架。该框架在生物医学研究和临床信息学领域展现出强大的应用潜力,能够有效支持多种下游应用场景的实现。
Conclusion: CLOZE框架证明了利用大型语言模型从临床笔记中自动扩展医学本体的可行性,为零样本本体扩展提供了新的技术路径。该研究为生物医学知识表示和临床决策支持系统的发展提供了重要支撑,具有推动医学信息学领域进步的深远意义。
📄 Abstract
Integrating novel medical concepts and relationships into existing ontologies can significantly enhance their coverage and utility for both biomedical research and clinical applications. Clinical notes, as unstructured documents rich with detailed patient observations, offer valuable context-specific insights and represent a promising yet underutilized source for ontology extension. Despite this potential, directly leveraging clinical notes for ontology extension remains largely unexplored. To address this gap, we propose CLOZE, a novel framework that uses large language models (LLMs) to automatically extract medical entities from clinical notes and integrate them into hierarchical medical ontologies. By capitalizing on the strong language understanding and extensive biomedical knowledge of pre-trained LLMs, CLOZE effectively identifies disease-related concepts and captures complex hierarchical relationships. The zero-shot framework requires no additional training or labeled data, making it a cost-efficient solution. Furthermore, CLOZE ensures patient privacy through automated removal of protected health information (PHI). Experimental results demonstrate that CLOZE provides an accurate, scalable, and privacy-preserving ontology extension framework, with strong potential to support a wide range of downstream applications in biomedical research and clinical informatics.
[59] You Only Forward Once: An Efficient Compositional Judging Paradigm
Tianlong Zhang, Hongwei Xue, Shilin Yan, Di Wu, Chen Xu, Yunyun Yang
🧩 TL;DR
YOFO提出了一种模板条件化的多模态大语言模型评判方法,通过在单次前向传播中并行验证结构化需求,解决了生成式评判速度与细粒度理解之间的权衡问题,实现了数量级的加速同时保持可解释性。
📘 Detailed Summary
Motivation: 现有MLLM评判方法面临根本性权衡:将MLLM适配为输出单一分数与生成式本质不匹配且限制了细粒度需求理解,而自回归生成评判分析在高吞吐量场景下速度过慢。该研究观察到评判可简化为验证输入是否满足一组结构化需求,旨在解决这一效率与精度之间的矛盾。
Method: YOFO基于自回归模型构建,采用模板条件化方法,在单次推理步骤中通过读取与每个需求关联的最终token的logits,为所有结构化需求并行生成二元是/否决策。该方法支持依赖感知分析,其中后续判断基于先前判断,并能从后验思维链中获益。
Result: 大量实验表明YOFO在标准推荐数据集上实现了最先进的结果,同时获得了数量级的加速,在保持可解释性的前提下显著提升了评判效率。该方法不仅达到了优异的性能指标,还验证了依赖感知分析和后验思维链的额外优势。
Conclusion: YOFO证明了将评判任务重新构建为结构化需求验证的有效性,为高吞吐量MLLM评判提供了高效解决方案。该方法的成功表明在保持生成模型优势的同时,通过精心设计的推理策略可以显著提升效率,为实际部署MLLM评判系统开辟了新途径。
📄 Abstract
Multimodal large language models (MLLMs) show strong potential as judges. However, existing approaches face a fundamental trade-off: adapting MLLMs to output a single score misaligns with the generative nature of MLLMs and limits fine-grained requirement understanding, whereas autoregressively generating judging analyses is prohibitively slow in high-throughput settings. Observing that judgment reduces to verifying whether inputs satisfy a set of structured requirements, we propose YOFO, a template-conditioned method that judges all requirements in a single forward pass. Built on an autoregressive model, YOFO accepts a structured requirement template and, in one inference step, produces a binary yes/no decision for each requirement by reading the logits of the final token associated with that requirement. This design yields orders-of-magnitude speedups while preserving interpretability. Extensive experiments show that YOFO not only achieves state-of-the-art results on standard recommendation datasets, but also supports dependency-aware analysis-where subsequent judgments are conditioned on previous ones-and further benefits from post-hoc CoT.
[60] Bridging VLMs and Embodied Intelligence with Deliberate Practice Policy Optimization
Yi Zhang, Che Liu, Xiancong Ren, Hanchu Ni, Yingji Zhang, Shuai Zhang, Zeyuan Ding, Jiayu Hu, Haozhe Shan, Junbo Qi, Yan Bai, Dengjie Li, Jiachen Luo, Yidong Wang, Yong Dai, Zenglin Xu, Bin Shen, Qifan Wang, Jian Tang, Xiaozhu Ju
🧩 TL;DR
本文提出了Deliberate Practice Policy Optimization (DPPO)框架,一种元认知训练方法,通过动态交替监督微调和强化学习来解决具身智能中的数据瓶颈和算法效率问题。该方法在Pelican-VL 1.0模型中实现了20.3%的性能提升,超越了100B参数规模的开源模型。
📘 Detailed Summary
Motivation: 该研究旨在解决通用具身智能系统面临的两个主要挑战:关键具身数据瓶颈(现实世界数据稀缺且昂贵)和现有方法的算法效率低下(资源消耗过大)。这些限制阻碍了高效构建多功能具身智能体的发展。
Method: 提出的核心方法是Deliberate Practice Policy Optimization (DPPO),这是一种元认知“Metaloop”训练框架,动态交替进行监督微调(能力扩展)和强化学习(技能精炼)。该框架能够自动识别弱点并进行针对性资源分配,专门设计用于从稀疏有限数据中最大化学习效率。理论上,DPPO可形式化为统一的偏好学习框架。
Result: 实验结果表明,使用DPPO训练的视觉语言具身模型Pelican-VL 1.0相比基础模型实现了20.3%的性能提升,并在100B参数规模上超越了开源模型10.6%。该框架有效缓解了数据和资源瓶颈问题。
Conclusion: 该研究提供了首个系统性框架来缓解具身智能中的数据与资源瓶颈,使社区能够高效构建多功能具身智能体。研究团队开源了模型和代码,为具身智能领域的发展提供了重要工具和理论基础。
📄 Abstract
Developing a universal and versatile embodied intelligence system presents two primary challenges: the critical embodied data bottleneck, where real-world data is scarce and expensive, and the algorithmic inefficiency of existing methods, which are resource-prohibitive. To address these limitations, we introduce Deliberate Practice Policy Optimization (DPPO), a metacognitive ``Metaloop'' training framework that dynamically alternates between supervised fine-tuning (competence expansion) and reinforcement learning (skill refinement). This enables automatic weakness identification and targeted resource allocation, specifically designed to maximize learning efficiency from sparse, finite data. Theoretically, DPPO can be formalised as a unified preference-learning framework. Empirically, training a vision-language embodied model with DPPO, referred to as Pelican-VL 1.0, yields a 20.3% performance improvement over the base model and surpasses open-source models at the 100B-parameter scale by 10.6%. We are open-sourcing both the models and code, providing the first systematic framework that alleviates the data and resource bottleneck and enables the community to build versatile embodied agents efficiently.