Table of Contents
cs.CV [Back]
[1] SO-Bench: A Structural Output Evaluation of Multimodal LLMs
Di Feng, Kaixin Ma, Feng Nan, Haofeng Chen, Bohan Zhai, David Griffiths, Mingfei Gao, Zhe Gan, Eshan Verma, Yinfei Yang, Zhifeng Chen, Afshin Dehghan
🧩 TL;DR
本研究提出了SO-Bench基准测试,用于系统评估多模态大语言模型在视觉结构化输出生成方面的能力,填补了该领域缺乏系统性评估基准的空白,并通过训练实验显著提升了模型的结构化输出性能。
📘 Detailed Summary
Motivation: 尽管文本领域的结构化生成已取得进展,但目前缺乏能够系统评估多模态大语言模型在视觉输入上进行模式驱动的信息提取和推理能力的基准测试,而现实世界中部署的MLLM需要生成既正确又符合预定义数据模式的输出。
Method: 本研究设计了SO-Bench基准测试,涵盖UI界面、自然图像、文档和图表四个视觉领域,基于超过6,500个多样化JSON模式和1,800个人工验证质量的图像-模式对构建,并进行了基准测试和训练实验以提升模型的结构化输出能力。
Result: 对开源和前沿专有模型的基准测试揭示了在预测准确、符合模式要求的输出方面存在持续差距,突显了改进多模态结构化推理的必要性,同时训练实验显著提升了模型的结构化输出能力。
Conclusion: 该研究强调了多模态大语言模型在视觉结构化输出生成方面的能力不足,提出的SO-Bench基准为系统评估提供了标准,训练实验表明模型性能可通过专门训练显著提升,为未来改进多模态结构化推理提供了方向。
📄 Abstract
Multimodal large language models (MLLMs) are increasingly deployed in real-world, agentic settings where outputs must not only be correct, but also conform to predefined data schemas. Despite recent progress in structured generation in textual domain, there is still no benchmark that systematically evaluates schema-grounded information extraction and reasoning over visual inputs. In this work, we conduct a comprehensive study of visual structural output capabilities for MLLMs with our carefully designed SO-Bench benchmark. Covering four visual domains, including UI screens, natural images, documents, and charts, SO-Bench is built from over 6.5K diverse JSON schemas and 1.8K curated image-schema pairs with human-verified quality. Benchmarking experiments on open-sourced and frontier proprietary models reveal persistent gaps in predicting accurate, schema compliant outputs, highlighting the need for better multimodal structured reasoning. Beyond benchmarking, we further conduct training experiments to largely improve the model's structured output capability. We plan to make the benchmark available to the community.
[2] PathReasoning: A multimodal reasoning agent for query-based ROI navigation on whole-slide images
Kunpeng Zhang, Hanwen Xu, Sheng Wang
🧩 TL;DR
本文提出了PathReasoning,一种多模态推理智能体,通过迭代推理和细化在WSI中导航,将整个病理切片转换为问题引导的视图序列,从而高效定位诊断相关区域,无需密集像素级标注。
📘 Detailed Summary
Motivation: 全切片图像(WSI)虽然提供了癌症的全面信息,但其巨大的尺寸(可达100亿像素)使得导航到相应区域以支持多样化临床检查变得具有挑战性且耗时。现有方法难以高效定位诊断相关区域,而病理学家通常通过采样、推理和自反思的组合在WSI中进行导航,这启发了本研究。
Method: PathReasoning是一种多模态推理智能体,通过多轮推理和细化迭代导航WSI。该方法从随机采样的候选区域开始,通过自反思评估当前选择,推理视觉观察与临床问题之间的对应关系,并最终提出新的探索区域。通过构建逐渐将注意力引导到诊断相关区域的推理链,将整个切片转换为问题引导的视图序列。
Result: PathReasoning在亚型分类和纵向分析任务上分别比强大的ROI选择方法高出6.7%和3.1%的AUROC。高质量ROI进一步支持乳腺癌的准确报告生成,在准确性上显著优于标准GPT-4o达10%。该方法能够在固定步数内高效找到信息丰富的ROI,无需密集像素级标注。
Conclusion: PathReasoning优先考虑问题特定区域并构建可解释的推理链,支持数字病理学中的高效切片审查、一致诊断解释、全面报告和证据可追溯性。该方法模仿病理学家的认知过程,为大规模WSI分析提供了有效的导航框架,显著提高了诊断效率和准确性。
📄 Abstract
Deciphering tumor microenvironment from Whole Slide Images (WSIs) is intriguing as it is key to cancer diagnosis, prognosis and treatment response. While these gigapixel images on one hand offer a comprehensive portrait of cancer, on the other hand, the extremely large size, as much as more than 10 billion pixels, make it challenging and time-consuming to navigate to corresponding regions to support diverse clinical inspection. Inspired by pathologists who conducted navigation on WSIs with a combination of sampling, reasoning and self-reflection, we proposed "PathReasoning", a multi-modal reasoning agent that iteratively navigates across WSIs through multiple rounds of reasoning and refinements. Specifically, starting with randomly sampled candidate regions, PathReasoning reviews current selections with self-reflection, reasoning over the correspondence between visual observations and clinical questions, and concludes by proposing new regions to explore. Across rounds, PathReasoning builds a reasoning chain that gradually directs attention to diagnostically relevant areas. PathReasoning turns each whole slide into a sequence of question-guided views, allowing the model to efficiently find informative ROIs within a fixed number of steps, without the need for dense pixel-level annotations. PathReasoning can substantially outperform strong ROI-selection approaches by 6.7% and 3.1% of AUROC on subtyping and longitudinal analysis tasks. The high-quality ROIs further support accurate report generation on breast cancer, significantly outperforming the standard GPT-4o by 10% in accuracy. PathReasoning prioritizes question-specific regions and constructs interpretable reasoning chains, supporting efficient slide review, consistent diagnostic interpretations, comprehensive reporting, and evidence traceability in digital pathology.
[3] Interpretable Multimodal Cancer Prototyping with Whole Slide Images and Incompletely Paired Genomics
Yupei Zhang, Yating Huang, Wanming Hu, Lequan Yu, Hujun Yin, Chao Li
🧩 TL;DR
本文提出了一种灵活的多模态原型框架,用于整合全切片图像和不完整的基因组数据以支持精准肿瘤学。该框架通过生物原型构建、多视图对齐、二分融合和语义基因组插补等关键组件,有效解决了模态内表示质量和模态间整合的挑战。
📘 Detailed Summary
Motivation: 多模态方法整合组织学和基因组学在精准肿瘤学中具有巨大潜力,但表型和基因型异质性限制了模态内表示质量并阻碍了有效的模态间整合。此外,现有方法大多忽略了现实临床场景中基因组数据可能部分缺失或完全不可用的情况,这限制了方法在实际医疗环境中的应用。
Method: 该方法提出了一个灵活的多模态原型框架,包含四个关键组件:1)使用文本提示和原型加权进行生物原型构建;2)通过样本级和分布级对齐实现多视图对齐;3)采用二分融合机制捕获共享和模态特定信息以进行多模态融合;4)语义基因组插补技术处理缺失数据。该框架特别设计用于整合全切片图像和不完整的基因组数据。
Result: 大量实验表明,所提出的方法在多个下游任务上相比其他最先进方法表现出持续的优势。该方法在基因组数据不完整或缺失的临床场景中展现出鲁棒性和有效性,验证了其在现实医疗环境中的实用价值。
Conclusion: 该研究为精准肿瘤学中的多模态数据整合提供了灵活且鲁棒的解决方案,特别适用于基因组数据不完整的现实临床场景。提出的框架通过原型构建、多视图对齐和缺失数据处理机制,为多模态医学数据分析开辟了新途径,具有重要的临床应用前景。
📄 Abstract
Multimodal approaches that integrate histology and genomics hold strong potential for precision oncology. However, phenotypic and genotypic heterogeneity limits the quality of intra-modal representations and hinders effective inter-modal integration. Furthermore, most existing methods overlook real-world clinical scenarios where genomics may be partially missing or entirely unavailable. We propose a flexible multimodal prototyping framework to integrate whole slide images and incomplete genomics for precision oncology. Our approach has four key components: 1) Biological Prototyping using text prompting and prototype-wise weighting; 2) Multiview Alignment through sample- and distribution-wise alignments; 3) Bipartite Fusion to capture both shared and modality-specific information for multimodal fusion; and 4) Semantic Genomics Imputation to handle missing data. Extensive experiments demonstrate the consistent superiority of the proposed method compared to other state-of-the-art approaches on multiple downstream tasks. The code is available at https://github.com/helenypzhang/Interpretable-Multimodal-Prototyping.
[4] WalkCLIP: Multimodal Learning for Urban Walkability Prediction
Shilong Xiang, JangHyeon Lee, Min Namgung, Yao-Yi Chiang
🧩 TL;DR
本文提出了WalkCLIP,一种多模态框架,通过整合卫星图像、街景图像和人口动态数据来预测城市步行性,克服了单源方法的局限性,并在明尼阿波利斯-圣保罗地区取得了优于基准方法的性能。
📘 Detailed Summary
Motivation: 传统步行性评估方法依赖调查和实地审核,成本高且难以扩展。现有研究使用卫星图像、街景图像或人口指标等单源方法,但这些方法各自存在局限性:卫星数据缺乏行人视角,街景图像缺少空间上下文,人口动态数据无法反映环境视觉形态。因此需要整合这些互补视角来全面评估步行环境。
Method: WalkCLIP是一个多模态框架,首先从GPT-4o生成的图像描述中学习步行性感知的视觉-语言表示,然后通过空间聚合模块整合邻域上下文信息来优化这些表示,最后将得到的特征与人口动态基础模型的表示进行融合,从而综合预测城市步行性。
Result: 在明尼阿波利斯-圣保罗地区4,660个地点进行评估,WalkCLIP在预测准确性和空间对齐方面均优于单模态和多模态基准方法。实验结果表明,整合视觉和行为信号能够可靠地预测步行环境质量。
Conclusion: 该研究表明,整合互补的多模态数据源(包括视觉和行为信号)能够更全面、可靠地评估城市步行性。WalkCLIP框架为可扩展的城市环境评估提供了新方法,并展示了多模态学习在城市分析中的潜力,为城市规划、公共健康和可持续发展提供了数据支持。
📄 Abstract
Urban walkability is a cornerstone of public health, sustainability, and quality of life. Traditional walkability assessments rely on surveys and field audits, which are costly and difficult to scale. Recent studies have used satellite imagery, street view imagery, or population indicators to estimate walkability, but these single-source approaches capture only one dimension of the walking environment. Satellite data describe the built environment from above, but overlook the pedestrian perspective. Street view imagery captures conditions at the ground level, but lacks broader spatial context. Population dynamics reveal patterns of human activity but not the visual form of the environment. We introduce WalkCLIP, a multimodal framework that integrates these complementary viewpoints to predict urban walkability. WalkCLIP learns walkability-aware vision-language representations from GPT-4o generated image captions, refines these representations with a spatial aggregation module that incorporates neighborhood context, and fuses the resulting features with representations from a population dynamics foundation model. Evaluated at 4,660 locations throughout Minneapolis-Saint Paul, WalkCLIP outperforms unimodal and multimodal baselines in both predictive accuracy and spatial alignment. These results show that the integration of visual and behavioral signals yields reliable predictions of the walking environment.
[5] PAT3D: Physics-Augmented Text-to-3D Scene Generation
Guying Lin, Kemeng Huang, Michael Liu, Ruihan Gao, Hanke Chen, Lyuhao Chen, Beijia Lu, Taku Komura, Yuan Liu, Jun-Yan Zhu, Minchen Li
🧩 TL;DR
本文提出了PAT3D,首个将视觉语言模型与物理仿真相结合的物理增强文本到3D场景生成框架,能够生成物理合理、仿真就绪且无相交的3D场景。
📘 Detailed Summary
Motivation: 现有文本到3D场景生成方法通常忽略物理合理性,导致生成的场景存在物体穿透、物理不稳定等问题,无法直接用于下游仿真任务。本研究旨在填补这一空白,通过集成物理仿真来生成物理合理且仿真就绪的3D场景。
Method: PAT3D框架首先根据文本提示生成3D物体并推断其空间关系,组织成层次化场景树,然后转换为仿真初始条件。采用可微分刚体仿真器确保物体在重力下的真实交互,驱动场景达到静态平衡且无穿透。进一步引入仿真循环优化过程,在保证物理稳定性和非相交性的同时提升与输入提示的语义一致性。
Result: 实验表明,PAT3D在物理合理性、语义一致性和视觉质量方面显著优于现有方法。该框架不仅生成高质量场景,还能产生可直接用于下游任务(如场景编辑和机器人操作)的仿真就绪3D场景。
Conclusion: PAT3D通过将物理仿真集成到文本到3D生成流程中,开创了物理增强场景生成的新范式。该框架不仅解决了现有方法在物理合理性方面的局限性,还为仿真就绪3D内容的创建提供了新途径,具有重要的实际应用价值。
📄 Abstract
We introduce PAT3D, the first physics-augmented text-to-3D scene generation framework that integrates vision-language models with physics-based simulation to produce physically plausible, simulation-ready, and intersection-free 3D scenes. Given a text prompt, PAT3D generates 3D objects, infers their spatial relations, and organizes them into a hierarchical scene tree, which is then converted into initial conditions for simulation. A differentiable rigid-body simulator ensures realistic object interactions under gravity, driving the scene toward static equilibrium without interpenetrations. To further enhance scene quality, we introduce a simulation-in-the-loop optimization procedure that guarantees physical stability and non-intersection, while improving semantic consistency with the input prompt. Experiments demonstrate that PAT3D substantially outperforms prior approaches in physical plausibility, semantic consistency, and visual quality. Beyond high-quality generation, PAT3D uniquely enables simulation-ready 3D scenes for downstream tasks such as scene editing and robotic manipulation. Code and data will be released upon acceptance.
[6] DialBench: Towards Accurate Reading Recognition of Pointer Meter using Large Foundation Models
Futian Wang, Chaoliu Weng, Xiao Wang, Zhen Chen, Zhicheng Zhao, Jin Tang
🧩 TL;DR
本文提出了一个用于指针式仪表读数识别的大规模基准数据集RPM-10K,并基于物理关系注入构建了一种新颖的视觉语言模型MRLM,该模型通过几何与因果关系编码实现了感知与物理推理的对齐。
📘 Detailed Summary
Motivation: 智能电力系统中指针式仪表的精确读数识别面临反射、遮挡、动态视角以及指针与刻度标记之间过于接近等挑战,而现有方法对此较为脆弱。该领域目前缺乏大规模数据集来支持鲁棒算法的开发,这限制了相关技术的进步与应用。
Method: 本文首先构建了包含10730张仪表图像的大规模基准数据集RPM-10K,充分反映了实际应用中的关键挑战。基于此数据集,提出了一种基于物理关系注入的视觉语言模型MRLM,该模型显式编码指针与刻度之间的几何和因果关系,通过跨注意力融合和自适应专家选择机制,将感知与物理推理对齐,以世界模型的视角解释仪表配置并生成精确数值读数。
Result: 在新提出的基准数据集上进行了广泛的实验,充分验证了所提出框架的有效性。实验结果表明,MRLM模型能够有效应对反射、遮挡、动态视角等复杂场景,在指针式仪表读数识别任务上取得了显著性能提升,证明了物理关系注入方法的优越性。
Conclusion: 该研究不仅填补了指针式仪表读数识别领域大规模数据集的空白,更重要的是提出了一种将物理关系显式编码到视觉语言模型中的新范式。这种方法通过将感知与物理推理对齐,为复杂场景下的仪表读数识别提供了更鲁棒的解决方案,对未来智能电力系统和其他工业视觉应用具有重要指导意义。
📄 Abstract
The precise reading recognition of pointer meters plays a key role in smart power systems, but existing approaches remain fragile due to challenges like reflections, occlusions, dynamic viewing angles, and overly between thin pointers and scale markings. Up to now, this area still lacks large-scale datasets to support the development of robust algorithms. To address these challenges, this paper first presents a new large-scale benchmark dataset for dial reading, termed RPM-10K, which contains 10730 meter images that fully reflect the aforementioned key challenges. Built upon the dataset, we propose a novel vision-language model for pointer meter reading recognition, termed MRLM, based on physical relation injection. Instead of exhaustively learning image-level correlations, MRLM explicitly encodes the geometric and causal relationships between the pointer and the scale, aligning perception with physical reasoning in the spirit of world-model perspectives. Through cross-attentional fusion and adaptive expert selection, the model learns to interpret dial configurations and generate precise numeric readings. Extensive experiments fully validated the effectiveness of our proposed framework on the newly proposed benchmark dataset. Both the dataset and source code will be released on https://github.com/Event-AHU/DialBench
[7] PPBoost: Progressive Prompt Boosting for Text-Driven Medical Image Segmentation
Xuchen Li, Hengrui Gu, Mohan Zhang, Qin Liu, Zhen Tan, Xinyuan Zhu, Huixue Zhou, Tianlong Chen, Kaixiong Zhou
🧩 TL;DR
本文提出PPBoost框架,通过将弱文本提示转化为强空间视觉提示,在零样本条件下显著提升医学图像分割性能。该框架利用视觉语言模型生成初始伪边界框,并通过不确定性感知过滤和伪标签检测器训练,最终增强现有分割模型的定位能力。
📘 Detailed Summary
Motivation: 医学图像分割中,基于文本提示的基础模型虽然直观但缺乏空间精度且易受域偏移影响,而基于视觉提示的模型虽性能优越却需要精确边界框标注,这在临床实践中成本高昂且难以获取。本研究旨在解决文本提示模型空间精度不足与视觉提示模型标注需求苛刻之间的矛盾。
Method: PPBoost框架采用渐进式提示增强策略,首先利用视觉语言模型基于文本描述生成初始伪边界框,并通过不确定性感知标准过滤不可靠预测。保留的图像-边界框对用于训练伪标签检测器以生成高质量边界框,推理时进一步通过适当扩展边界框来紧密覆盖目标解剖结构,最终将增强的空间定位边界框提示用于指导现有分割模型生成密集掩码。
Result: 在涵盖多种模态和解剖结构的三个数据集上,PPBoost在Dice系数和归一化表面距离指标上持续优于文本提示和视觉提示基线方法,且显著超越少样本分割模型而无需使用标注数据。该框架能够泛化到多种典型的视觉分割模型骨干网络,展现出强大的跨域适应能力。
Conclusion: PPBoost成功地将弱文本信号转化为强空间视觉提示,在严格零样本条件下实现了医学图像分割性能的显著提升。该研究为结合文本和视觉提示的优势提供了有效框架,展示了在缺乏精确标注的临床场景中利用弱监督信号实现高质量分割的可行性,为医学图像分析中的提示工程开辟了新方向。
📄 Abstract
Text-prompted foundation models for medical image segmentation offer an intuitive way to delineate anatomical structures from natural language queries, but their predictions often lack spatial precision and degrade under domain shift. In contrast, visual-prompted models achieve strong segmentation performance across diverse modalities by leveraging spatial cues of precise bounding-box (bbox) prompts to guide the segmentation of target lesions. However, it is costly and challenging to obtain the precise visual prompts in clinical practice. We propose PPBoost (Progressive Prompt-Boosting), a framework that bridges these limitations by transforming weak text-derived signals into strong, spatially grounded visual prompts, operating under a strict zero-shot regime with no image- or pixel-level segmentation labels. PPBoost first uses a vision-language model to produce initial pseudo-bboxes conditioned on the textual object descriptions and applies an uncertainty-aware criterion to filter unreliable predictions. The retained image-bboxes pairs are then leveraged to train a pseudo-labeled detector, producing the high-quality bboxes for the query images. During inference, PPBoost further refines the generated bboxes by appropriately expanding them to tightly cover the target anatomical structures. The enhanced spatially-grounding bbox prompts guide existing segmentation models to generate final dense masks, effectively amplifying weak text cues into strong spatial guidance. Across three datasets spanning diverse modalities and anatomies, PPBoost consistently improves Dice and Normalized Surface Distance over text- and visual-prompted baselines and, notably, surpasses few-shot segmentation models without using labeled data. PPBoost can generalize to multiple typical visual segmentation model backbones.
[8] Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?
Apratim Bhattacharyya, Bicheng Xu, Sanjay Haresh, Reza Pourreza, Litian Liu, Sunny Panchal, Pulkit Madan, Leonid Sigal, Roland Memisevic
🧩 TL;DR
本文提出了Qualcomm Interactive Cooking基准数据集和LiveMamba模型,旨在解决多模态大语言模型在实时交互式分步指导方面的不足,为实时情境化教学提供了首个专用基准和强基线。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在对话能力上虽有进步,但缺乏提供实时交互式分步指导的关键能力,这需要模型不仅能提供指令,还能检测执行成功、识别错误并及时提醒用户,所有这些都必须在实时视频流中异步响应完成。
Method: 研究引入了基于CaptainCook4D构建的Qualcomm Interactive Cooking基准数据集,包含用户在任务执行中的错误记录,具有密集标注的定时指令和反馈信息,特别是与视频中视觉错误发生时间精确对齐的错误提醒。同时提出了LiveMamba,一种专为交互式教学指导设计的流式多模态大语言模型。
Result: 在Qualcomm Interactive Cooking基准上评估了最先进的多模态大语言模型,并展示了LiveMamba作为实时交互指导的强基线性能,该基准提供了首个专门用于评估实时情境化教学能力的测试平台。
Conclusion: 这项工作为开发和评估实时情境化教学系统提供了首个专用基准和强基线,推动了多模态大语言模型向实时交互指导能力的发展,为未来AI助手的关键功能奠定了基础。
📄 Abstract
Multi-modal Large Language Models (LLM) have advanced conversational abilities but struggle with providing live, interactive step-by-step guidance, a key capability for future AI assistants. Effective guidance requires not only delivering instructions but also detecting their successful execution, as well as identifying and alerting users to mistakes, all of which has to happen in real-time. This requires models that are not turn-based, but that can react asynchronously to a video stream, as well as video data showing users performing tasks including mistakes and their corrections. To this end, we introduce Qualcomm Interactive Cooking, a new benchmark and dataset built upon CaptainCook4D, which contains user mistakes during task execution. Our dataset and benchmark features densely annotated, timed instructions and feedback messages, specifically including mistake alerts precisely timestamped to their visual occurrence in the video. We evaluate state-of-the-art multi-modal LLMs on the Qualcomm Interactive Cooking benchmark and introduce LiveMamba, a streaming multi-modal LLM designed for interactive instructional guidance. This work provides the first dedicated benchmark and a strong baseline for developing and evaluating on live, situated coaching.
[9] MedEyes: Learning Dynamic Visual Focus for Medical Progressive Diagnosis
Chunzheng Zhu, Yangfang Lin, Shen Chen, Yijun Wang, Jianxin Lin
🧩 TL;DR
本文提出MedEyes,一种新颖的强化学习框架,通过动态建模临床医生风格的诊断推理过程,解决现有视觉语言模型在医疗诊断中产生表面连贯但临床不准确推理路径的问题,在多个医疗VQA基准上实现平均8.5%的性能提升。
📘 Detailed Summary
Motivation: 准确的医疗诊断通常涉及渐进式视觉聚焦和迭代推理,这与临床工作流程特征相符。然而,现有基于强化学习与可验证奖励的视觉语言模型采用纯在线策略学习范式,倾向于强化表面连贯但临床不准确的推理路径,无法有效建模临床医生风格的诊断推理过程。
Method: MedEyes框架通过整合离线策略专家指导,将专家视觉搜索轨迹转化为结构化外部行为信号,引导模型实现临床对齐的视觉推理。设计了凝视引导推理导航器,采用双模式探索策略模拟诊断过程,包括系统性异常定位扫描和详细区域分析钻取。引入置信度值采样器,通过核心采样和自适应终止创建多样且可信的探索路径。最后,双流GRPO优化框架解耦在线和离线策略学习信号,缓解奖励同化和熵崩溃问题。
Result: 实验结果表明,MedEyes在多个医疗视觉问答基准测试中实现了平均8.5%的性能提升,验证了该框架在构建可解释医疗AI系统方面的潜力。具体而言,该方法通过专家指导的视觉推理机制,显著提高了医疗诊断的准确性和临床相关性。
Conclusion: 该研究展示了通过整合专家视觉搜索轨迹和结构化行为信号,可以有效引导AI模型实现临床对齐的诊断推理。MedEyes框架不仅提升了医疗VQA任务的性能,还为构建可解释、符合临床工作流程的医疗AI系统提供了新的技术路径,平衡了专家模仿与自主发现的关系。
📄 Abstract
Accurate medical diagnosis often involves progressive visual focusing and iterative reasoning, characteristics commonly observed in clinical workflows. While recent vision-language models demonstrate promising chain-of-thought (CoT) reasoning capabilities via reinforcement learning with verifiable rewards (RLVR), their purely on-policy learning paradigm tends to reinforce superficially coherent but clinically inaccurate reasoning paths. We propose MedEyes, a novel reinforcement learning framework that dynamically models clinician-style diagnostic reasoning by progressively attending to and interpreting relevant medical image regions. By incorporating off-policy expert guidance, MedEyes converts expert visual search trajectories into structured external behavioral signals, guiding the model toward clinically aligned visual reasoning. We design the Gaze-guided Reasoning Navigator (GRN) to emulate the diagnostic process through a dual-mode exploration strategy, scanning for systematic abnormality localization and drilling for detailed regional analysis. To balance expert imitation and autonomous discovery, we introduce the Confidence Value Sampler (CVS), which employs nucleus sampling and adaptive termination to create diverse yet credible exploration paths. Finally, the dual-stream GRPO optimization framework decouples on-policy and off-policy learning signals, mitigating reward assimilation and entropy collapse. Experiments demonstrate that MedEyes achieves an average performance improvement of +8.5\% across multiple medical VQA benchmarks, validating MedEyes's potential in building interpretable medical AI systems.
[10] Intra-Class Probabilistic Embeddings for Uncertainty Estimation in Vision-Language Models
Zhenxiang Lin, Maryam Haghighat, Will Browne, Dimity Miller
🧩 TL;DR
本文提出了一种无需训练的后处理不确定性估计方法ICPE,用于提升对比视觉语言模型的可靠性,通过测量类内视觉特征一致性来检测错误预测,显著优于现有基线方法。
📘 Detailed Summary
Motivation: 视觉语言模型(如CLIP)在开放词汇分类中表现出色,但容易对错误分类分配高置信度分数,这在安全关键应用中限制了其可靠性,因此需要有效的错误检测机制。
Method: 提出了一种无需训练的后处理不确定性估计方法,通过特征投影结合多元高斯分布创建类特定的概率嵌入,测量类内视觉特征一致性,该方法与VLM无关且无需微调。
Result: 在ImageNet、Flowers102、Food101、EuroSAT和DTD数据集上的广泛实验表明,该方法在错误检测性能上达到最先进水平,显著优于确定性和概率性VLM基线,且仅需每类10张训练图像即可有效工作。
Conclusion: 该方法提供了一种简单有效的后处理解决方案,增强了视觉语言模型在分布偏移下的鲁棒性,为安全关键应用中的可靠部署提供了实用工具,同时保持了模型无关性和无需训练的优势。
📄 Abstract
Vision-language models (VLMs), such as CLIP, have gained popularity for their strong open vocabulary classification performance, but they are prone to assigning high confidence scores to misclassifications, limiting their reliability in safety-critical applications. We introduce a training-free, post-hoc uncertainty estimation method for contrastive VLMs that can be used to detect erroneous predictions. The key to our approach is to measure visual feature consistency within a class, using feature projection combined with multivariate Gaussians to create class-specific probabilistic embeddings. Our method is VLM-agnostic, requires no fine-tuning, demonstrates robustness to distribution shift, and works effectively with as few as 10 training images per class. Extensive experiments on ImageNet, Flowers102, Food101, EuroSAT and DTD show state-of-the-art error detection performance, significantly outperforming both deterministic and probabilistic VLM baselines. Code is available at https://github.com/zhenxianglin/ICPE.
[11] Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation
Joel Alberto Santos, Zongwei Wu, Xavier Alameda-Pineda, Radu Timofte
🧩 TL;DR
本文提出了一种直接从音频进行物体定位的新方法,挑战了依赖文本转录的传统音频-视觉对齐范式,并通过实验证明该方法在特定情况下甚至优于基于转录的方法,尤其在处理语言多样性方面表现更鲁棒。
📘 Detailed Summary
Motivation: 当前基于文本转录的音频-视觉物体定位方法存在效率和鲁棒性问题,这些方法通常将语音转录为文本、提取关键词,然后利用预训练的文本-视觉模型进行定位,但研究者质疑这种转录流程的效率和鲁棒性,探索是否能够实现不依赖文本的直接音频-视觉对齐。
Method: 研究通过简化任务,专注于基于单词语音指令的物体定位,引入了一个新的音频定位数据集,涵盖多种物体和多样化的人类口音,然后从相关音频-视觉领域适配并基准测试了多种模型,探索直接音频定位的可行性。
Result: 实验结果表明,直接从音频进行物体定位不仅是可行的,在某些情况下甚至优于基于转录的方法,特别是在处理语言多样性方面表现出更强的鲁棒性,这挑战了传统依赖文本中间表示的范式。
Conclusion: 该研究鼓励对直接音频定位的重新关注,为更鲁棒和高效的多模态理解系统铺平了道路,表明绕过文本转录可以实现更直接和鲁棒的音频-视觉对齐,特别是在处理多样化口音和语言变体时具有优势。
📄 Abstract
Understanding human instructions is essential for enabling smooth human-robot interaction. In this work, we focus on object grounding, i.e., localizing an object of interest in a visual scene (e.g., an image) based on verbal human instructions. Despite recent progress, a dominant research trend relies on using text as an intermediate representation. These approaches typically transcribe speech to text, extract relevant object keywords, and perform grounding using models pretrained on large text-vision datasets. However, we question both the efficiency and robustness of such transcription-based pipelines. Specifically, we ask: Can we achieve direct audio-visual alignment without relying on text? To explore this possibility, we simplify the task by focusing on grounding from single-word spoken instructions. We introduce a new audio-based grounding dataset that covers a wide variety of objects and diverse human accents. We then adapt and benchmark several models from the closely audio-visual field. Our results demonstrate that direct grounding from audio is not only feasible but, in some cases, even outperforms transcription-based methods, especially in terms of robustness to linguistic variability. Our findings encourage a renewed interest in direct audio grounding and pave the way for more robust and efficient multimodal understanding systems.
[12] ICM-SR: Image-Conditioned Manifold Regularization for Image Super-Resoultion
Junoh Kang, Donghun Ryu, Bohyung Han
🧩 TL;DR
本文提出图像条件流形正则化(ICM)方法,通过利用颜色映射和Canny边缘的结构信息来正则化输出,解决了现有基于文本条件扩散模型的真实图像超分辨率方法中存在的概念错位和生成先验缺陷问题。
📘 Detailed Summary
Motivation: 现有真实图像超分辨率方法通常利用文本到图像扩散模型的生成先验,但默认采用文本条件流形存在两个关键局限:概念上与任务目标不匹配(超分辨率应生成与低质量图像直接相关的高质量图像),实践中教师模型常产生颜色失真和边缘模糊的图像,表明其生成先验存在缺陷。
Method: 本文提出图像条件流形正则化(ICM)方法,该方法将输出正则化到以稀疏但关键的结构信息为条件的流形上,具体结合颜色映射和Canny边缘信息,避免了直接对密集原始图像进行条件化时的不稳定性,同时提供了任务对齐的稳定正则化信号。
Result: 实验验证表明,所提出的正则化方法显著提升了超分辨率性能,特别是在感知质量方面表现出色,证明了该方法在真实世界应用中的有效性,作者承诺将开源代码以确保可复现性。
Conclusion: 该研究强调了为特定任务选择合适正则化流形的重要性,通过结合稀疏结构信息而非密集原始图像,ICM方法既保持了数值稳定性又实现了概念对齐,为基于扩散模型的真实图像超分辨率提供了更有效的正则化框架。
📄 Abstract
Real world image super-resolution (Real-ISR) often leverages the powerful generative priors of text-to-image diffusion models by regularizing the output to lie on their learned manifold. However, existing methods often overlook the importance of the regularizing manifold, typically defaulting to a text-conditioned manifold. This approach suffers from two key limitations. Conceptually, it is misaligned with the Real-ISR task, which is to generate high quality (HQ) images directly tied to the low quality (LQ) images. Practically, the teacher model often reconstructs images with color distortions and blurred edges, indicating a flawed generative prior for this task. To correct these flaws and ensure conceptual alignment, a more suitable manifold must incorporate information from the images. While the most straightforward approach is to condition directly on the raw input images, their high information densities make the regularization process numerically unstable. To resolve this, we propose image-conditioned manifold regularization (ICM), a method that regularizes the output towards a manifold conditioned on the sparse yet essential structural information: a combination of colormap and Canny edges. ICM provides a task-aligned and stable regularization signal, thereby avoiding the instability of dense-conditioning and enhancing the final super-resolution quality. Our experiments confirm that the proposed regularization significantly enhances super-resolution performance, particularly in perceptual quality, demonstrating its effectiveness for real-world applications. We will release the source code of our work for reproducibility.
[13] OralGPT-Omni: A Versatile Dental Multimodal Large Language Model
Jing Hao, Yuci Liang, Lizhuo Lin, Yuxuan Fan, Wenkai Zhou, Kaixin Guo, Zanting Ye, Yanpeng Sun, Xinyu Zhang, Yanqi Yang, Qiankun Li, Hao Tang, James Kit-Hon Tsoi, Linlin Shen, Kuo Feng Hung
🧩 TL;DR
本文提出了OralGPT-Omni,这是首个面向牙科领域的多模态大语言模型,通过构建TRACE-CoT临床推理数据集和四阶段训练范式,显著提升了牙科图像分析与理解能力,并在新提出的MMOral-Uni基准测试中取得了优异表现。
📘 Detailed Summary
Motivation: 尽管多模态大语言模型在众多医学专业领域展现出巨大潜力,但牙科领域仍未被充分探索,主要由于领域特定数据有限、牙科专家标注稀缺、模态特定建模不足以及可靠性挑战等问题,因此需要开发专门针对牙科的综合可信分析系统。
Method: 研究构建了TRACE-CoT临床推理数据集,该数据集模拟牙科放射科医师的决策过程,以显式捕捉牙医的诊断推理;同时提出了四阶段训练范式,结合推理监督来增强模型对牙科图像的理解与分析能力;此外还创建了MMOral-Uni基准,这是首个统一的多模态牙科图像分析基准,包含2,809个开放性问题-答案对,涵盖五种模态和五项任务。
Result: OralGPT-Omni在MMOral-Uni基准测试中获得了51.84的总分,在MMOral-OPG基准测试中获得了45.31分,显著超越了GPT-5的表现,证明了该模型在牙科图像分析任务上的优越性能。
Conclusion: 该研究推动了智能牙科的发展,为牙科图像分析的未来进展铺平了道路,通过专门设计的模型架构、临床推理数据集和综合评估基准,为解决牙科领域多模态分析中的挑战提供了系统化解决方案。
📄 Abstract
Multimodal Large Language Models (MLLMs) have exhibited immense potential across numerous medical specialties; yet, dentistry remains underexplored, in part due to limited domain-specific data, scarce dental expert annotations, insufficient modality-specific modeling, and challenges in reliability. In this paper, we present OralGPT-Omni, the first dental-specialized MLLM designed for comprehensive and trustworthy analysis across diverse dental imaging modalities and clinical tasks. To explicitly capture dentists' diagnostic reasoning, we construct TRACE-CoT, a clinically grounded chain-of-thought dataset that mirrors dental radiologists' decision-making processes. This reasoning supervision, combined with our proposed four-stage training paradigm, substantially strengthens the model's capacity for dental image understanding and analysis. In parallel, we introduce MMOral-Uni, the first unified multimodal benchmark for dental image analysis. It comprises 2,809 open-ended question-answer pairs spanning five modalities and five tasks, offering a comprehensive evaluation suite to date for MLLMs in digital dentistry. OralGPT-Omni achieves an overall score of 51.84 on the MMOral-Uni benchmark and 45.31 on the MMOral-OPG benchmark, dramatically outperforming the scores of GPT-5. Our work promotes intelligent dentistry and paves the way for future advances in dental image analysis. All code, benchmark, and models will be made publicly available.
[14] MRI-Based Brain Age Estimation with Supervised Contrastive Learning of Continuous Representation
Simon Joseph Clément Crête, Marta Kersten-Oertel, Yiming Xiao
🧩 TL;DR
本研究首次将监督对比学习与Rank-N-Contrast损失函数应用于基于T1w结构MRI的脑龄估计,提出了一种能够更好捕捉神经形态连续变化的方法,并在神经退行性疾病中验证了脑龄差作为生物标志物的潜力。
📘 Detailed Summary
Motivation: 现有基于深度学习的脑龄估计方法往往难以捕捉神经形态变化的连续性,可能导致次优的特征表示和结果。神经退行性疾病等因素会加速大脑衰老,准确测量这一现象对于临床生物标志物应用具有重要意义。
Method: 本研究首次将监督对比学习与Rank-N-Contrast损失函数应用于基于T1w结构MRI的脑龄估计,采用ResNet作为骨干网络,并利用Grad-RAM对回归结果进行可视化解释,以增强模型的可解释性。
Result: 在有限训练样本的数据集上,所提方法实现了4.27年的平均绝对误差和0.93的R²分数,显著优于使用相同ResNet骨干的传统深度回归方法,并与使用更大训练数据的先进方法表现相当或更好。Grad-RAM可视化显示RNC损失能捕捉比传统回归更细微的年龄相关特征。
Conclusion: 该方法不仅提高了脑龄估计的准确性,还通过脑龄差分析揭示了阿尔茨海默病和帕金森病患者脑龄差与疾病严重程度的相关性,证明了其作为神经退行性疾病生物标志物的潜在临床应用价值。
📄 Abstract
MRI-based brain age estimation models aim to assess a subject's biological brain age based on information, such as neuroanatomical features. Various factors, including neurodegenerative diseases, can accelerate brain aging and measuring this phenomena could serve as a potential biomarker for clinical applications. While deep learning (DL)-based regression has recently attracted major attention, existing approaches often fail to capture the continuous nature of neuromorphological changes, potentially resulting in sub-optimal feature representation and results. To address this, we propose to use supervised contrastive learning with the recent Rank-N-Contrast (RNC) loss to estimate brain age based on widely used T1w structural MRI for the first time and leverage Grad-RAM to visually explain regression results. Experiments show that our proposed method achieves a mean absolute error (MAE) of 4.27 years and an $R^2$ of 0.93 with a limited dataset of training samples, significantly outperforming conventional deep regression with the same ResNet backbone while performing better or comparably with the state-of-the-art methods with significantly larger training data. Furthermore, Grad-RAM revealed more nuanced features related to age regression with the RNC loss than conventional deep regression. As an exploratory study, we employed the proposed method to estimate the gap between the biological and chronological brain ages in Alzheimer's Disease and Parkinson's disease patients, and revealed the correlation between the brain age gap and disease severity, demonstrating its potential as a biomarker in neurodegenerative disorders.
[15] MoE3D: Mixture of Experts meets Multi-Modal 3D Understanding
Yu Li, Yuenan Hou, Yingmei Wei, Xinge Zhu, Yuexin Ma, Wenqi Shao, Yanming Guo
🧩 TL;DR
本文提出MoE3D,一种基于混合专家(MoE)的多模态3D理解框架,通过部署专门化的专家网络处理不同模态或跨模态交互,显著提升了多模态融合性能。
📘 Detailed Summary
Motivation: 现有多模态融合方法通常采用单一密集融合网络,难以处理模态间的显著异质性和复杂性,导致性能欠佳,需要更有效的多模态学习框架来解决这一局限性。
Method: 提出MoE3D框架,将混合专家(MoE)集成到多模态学习中,部署一组专门化专家网络处理特定模态或跨模态交互;设计基于MoE的Transformer以更好利用视觉特征中的互补信息;引入信息聚合模块增强融合性能;采用Top-1门控机制确保高效处理;提出渐进式预训练策略以利用语义和2D先验知识进行良好初始化。
Result: MoE3D在四个主流3D理解任务上取得竞争性性能,在Multi3DRefer基准上超越最佳对比方法6.1 mIoU,证明了其在多模态3D理解任务中的有效性。
Conclusion: 研究表明混合专家架构能有效处理多模态异质性,渐进式预训练策略有助于利用先验知识,为多模态3D理解提供了新的高效融合框架,具有广泛的应用潜力。
📄 Abstract
Multi-modal 3D understanding is a fundamental task in computer vision. Previous multi-modal fusion methods typically employ a single, dense fusion network, struggling to handle the significant heterogeneity and complexity across modalities, leading to suboptimal performance. In this paper, we propose MoE3D, which integrates Mixture of Experts (MoE) into the multi-modal learning framework. The core is that we deploy a set of specialized "expert" networks, each adept at processing a specific modality or a mode of cross-modal interaction. Specifically, the MoE-based transformer is designed to better utilize the complementary information hidden in the visual features. Information aggregation module is put forward to further enhance the fusion performance. Top-1 gating is employed to make one expert process features with expert groups, ensuring high efficiency. We further propose a progressive pre-training strategy to better leverage the semantic and 2D prior, thus equipping the network with good initialization. Our MoE3D achieves competitive performance across four prevalent 3D understanding tasks. Notably, our MoE3D surpasses the top-performing counterpart by 6.1 mIoU on Multi3DRefer.
[16] HyperST: Hierarchical Hyperbolic Learning for Spatial Transcriptomics Prediction
Chen Zhang, Yilu An, Ying Chen, Hao Li, Xitong Ling, Lihao Liu, Junjun He, Yuxiang Lin, Zihui Wang, Rongshan Yu
🧩 TL;DR
本文提出HyperST框架,通过双曲空间建模空间转录组数据的固有层次结构,实现从组织病理图像到基因表达的多层次跨模态预测,显著提升了预测性能。
📘 Detailed Summary
Motivation: 现有方法主要关注点水平的图像-基因匹配,未能充分利用空间转录组数据的完整层次结构,特别是在基因表达侧。此外,基因表达谱包含更多分子细节,这些细节可能在组织学图像中缺乏显著的视觉对应物,导致模态间信息不对称,需要复杂的表示学习方法来弥合这一差距。
Method: HyperST框架在双曲空间中建模数据的固有层次结构,首先设计多级表示提取器捕获每个模态的点水平和生态位水平表示,提供超越单个点水平图像-基因对的上下文感知信息;其次引入层次双曲对齐模块统一这些表示,在执行空间对齐的同时层次化地结构化图像和基因嵌入,该对齐策略用分子语义丰富图像表示。
Result: HyperST在来自不同组织的四个公共数据集上实现了最先进的性能,显著改善了跨模态预测,为更可扩展和准确的空间转录组学预测铺平了道路。
Conclusion: 该研究通过建模空间转录组数据的固有层次结构并利用双曲空间的几何特性,成功弥合了组织病理图像与基因表达之间的模态差距,为成本效益高的空间转录组学预测提供了新框架,推动了该领域向更可扩展和准确的方向发展。
📄 Abstract
Spatial Transcriptomics (ST) merges the benefits of pathology images and gene expression, linking molecular profiles with tissue structure to analyze spot-level function comprehensively. Predicting gene expression from histology images is a cost-effective alternative to expensive ST technologies. However, existing methods mainly focus on spot-level image-to-gene matching but fail to leverage the full hierarchical structure of ST data, especially on the gene expression side, leading to incomplete image-gene alignment. Moreover, a challenge arises from the inherent information asymmetry: gene expression profiles contain more molecular details that may lack salient visual correlates in histological images, demanding a sophisticated representation learning approach to bridge this modality gap. We propose HyperST, a framework for ST prediction that learns multi-level image-gene representations by modeling the data's inherent hierarchy within hyperbolic space, a natural geometric setting for such structures. First, we design a Multi-Level Representation Extractors to capture both spot-level and niche-level representations from each modality, providing context-aware information beyond individual spot-level image-gene pairs. Second, a Hierarchical Hyperbolic Alignment module is introduced to unify these representations, performing spatial alignment while hierarchically structuring image and gene embeddings. This alignment strategy enriches the image representations with molecular semantics, significantly improving cross-modal prediction. HyperST achieves state-of-the-art performance on four public datasets from different tissues, paving the way for more scalable and accurate spatial transcriptomics prediction.
[17] PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and Fuzz Optimization
Mingzhe Li, Renhao Zhang, Zhiyang Wen, Siqi Pan, Bruno Castro da Silva, Juan Zhai, Shiqing Ma
🧩 TL;DR
本文提出了PROMPTMINER,一种用于文本到图像生成模型的黑盒提示窃取框架,该框架通过强化学习优化和模糊搜索两阶段方法,能够从未知生成器的图像中有效恢复原始文本提示,在多个数据集和扩散模型上实现了最先进的性能。
📘 Detailed Summary
Motivation: 随着文本到图像生成模型的发展,精心设计的提示词已成为具有价值的数字资产,但面临着安全性和知识产权风险,特别是提示窃取攻击的威胁。现有方法通常需要白盒梯度访问、大规模标注数据集进行监督训练,或仅依赖图像描述而不进行显式优化,限制了其实用性和适应性。
Method: PROMPTMINER采用两阶段黑盒提示窃取框架:第一阶段使用基于强化学习的优化方法重建图像的主要主题,第二阶段采用模糊驱动的搜索方法恢复风格修饰符。该框架不依赖梯度访问或大规模标注数据,通过解耦主题和风格恢复任务来提高效率和准确性。
Result: 在多个数据集和扩散模型上的实验表明,PROMPTMINER实现了CLIP相似度高达0.958和SBERT文本对齐度达0.751的优异性能,超越了所有基线方法。对于未知生成器的真实世界图像,其CLIP相似度比最强基线高出7.5%,显示出更好的泛化能力。即使在防御性扰动下,该框架仍保持强大的鲁棒性。
Conclusion: PROMPTMINER展示了黑盒提示窃取的实际可行性,为提示安全性和知识产权保护提供了重要见解。该框架不仅可用于恶意攻击检测,还能支持数据归属、模型来源分析和水印验证等有益应用,为生成模型的安全评估和审计工具开发奠定了基础。
📄 Abstract
Text-to-image (T2I) generative models such as Stable Diffusion and FLUX can synthesize realistic, high-quality images directly from textual prompts. The resulting image quality depends critically on well-crafted prompts that specify both subjects and stylistic modifiers, which have become valuable digital assets. However, the rising value and ubiquity of high-quality prompts expose them to security and intellectual-property risks. One key threat is the prompt stealing attack, i.e., the task of recovering the textual prompt that generated a given image. Prompt stealing enables unauthorized extraction and reuse of carefully engineered prompts, yet it can also support beneficial applications such as data attribution, model provenance analysis, and watermarking validation. Existing approaches often assume white-box gradient access, require large-scale labeled datasets for supervised training, or rely solely on captioning without explicit optimization, limiting their practicality and adaptability. To address these challenges, we propose PROMPTMINER, a black-box prompt stealing framework that decouples the task into two phases: (1) a reinforcement learning-based optimization phase to reconstruct the primary subject, and (2) a fuzzing-driven search phase to recover stylistic modifiers. Experiments across multiple datasets and diffusion backbones demonstrate that PROMPTMINER achieves superior results, with CLIP similarity up to 0.958 and textual alignment with SBERT up to 0.751, surpassing all baselines. Even when applied to in-the-wild images with unknown generators, it outperforms the strongest baseline by 7.5 percent in CLIP similarity, demonstrating better generalization. Finally, PROMPTMINER maintains strong performance under defensive perturbations, highlighting remarkable robustness. Code: https://github.com/aaFrostnova/PromptMiner
[18] From Compound Figures to Composite Understanding: Developing a Multi-Modal LLM from Biomedical Literature with Medical Multiple-Image Benchmarking and Validation
Zhen Chen, Yihang Fu, Gabriel Madera, Mauro Giuffre, Serina Applebaum, Hyunjae Kim, Hua Xu, Qingyu Chen
🧩 TL;DR
本研究提出M3LLM,一种能够进行多图像复合理解的医学多模态大语言模型,通过利用生物医学文献中的复合图像数据解决了医疗MLLMs在多图像分析方面训练数据匮乏的问题。
📘 Detailed Summary
Motivation: 现有医学多模态大语言模型主要局限于单图像理解,无法满足临床实践中需要综合多模态或多时间点图像信息的诊断需求,且缺乏大规模高质量的多图像标注训练数据阻碍了相关模型的发展。
Method: 提出基于许可允许的生物医学文献复合图像作为数据源,设计五阶段上下文感知指令生成范式,采用分治策略将多图像分析分解为可管理的子任务,通过解析超过237,000个复合图像及其上下文文本构建M3LLM模型。
Result: M3LLM在多图像、单图像、纯文本和多选择场景中显著优于通用和专业医学MLLMs,在PMC-MI-Bench基准测试中表现出色,并在MIMIC数据集上的纵向胸部X光分析中展现出强大的泛化能力。
Conclusion: 该工作建立了可扩展且高效的医学MLLMs开发范式,使模型能够学习复合图像中复杂的空间、时间和跨模态关系,为生物医学文献与真实临床应用之间架起了桥梁,推动了医疗AI向复合推理能力的发展。
📄 Abstract
Multi-modal large language models (MLLMs) have shown promise in advancing healthcare. However, most existing models remain confined to single-image understanding, which greatly limits their applicability in clinical workflows. In practice, medical diagnosis and progression often require synthesizing information across multiple images from different modalities or time points. The development of medical MLLMs capable of such multi-image understanding has been hindered by the lack of large-scale, high-quality annotated training data. To address this limitation, we propose a novel framework that leverages license-permissive compound images in biomedical literature, as a rich yet underutilized data source for multi-image analysis. Specifically, we design a five-stage, context-aware instruction generation paradigm underpinned by a divide-and-conquer strategy. By decomposing multi-image analysis into manageable sub-tasks, this paradigm empowers MLLMs to move beyond single-panel analysis and provide a composite understanding by learning the complex spatial, temporal, and cross-modal relationships inherent in these compound figures. By parsing over 237,000 compound figures and their contextual text for instruction generation, we develop M3LLM, a medical multi-image multi-modal large language model. For benchmarking, we construct PMC-MI-Bench for composite understanding, manually validated by medical experts. Extensive experiments show that M3LLM significantly outperforms both general-purpose and specialized medical MLLMs across multi-image, single-image, text-only, and multi-choice scenarios. Notably, M3LLM exhibits strong generalization to longitudinal chest X-ray analysis using the MIMIC dataset. This work establishes a scalable and efficient paradigm for developing medical MLLMs capable of composite reasoning, bridging the gap between biomedical literature and real-world clinical applications.
[19] GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models
Bin Wang, Ruotong Hu, Wenqian Wang, Wentong Li, Mingliang Gao, Runmin Cong, Wei Zhang
🧩 TL;DR
本文提出了一种即插即用的耦合提示学习框架,通过引入外部监督提示来缓解视觉语言模型在视频任务微调中的语义空间窄化问题,显著提升了模型在未见类别上的泛化性能。
📘 Detailed Summary
Motivation: 现有视觉语言模型在视频任务上进行微调时会损害模型对未见类别的泛化能力,传统方法通过正则化手工提示与软提示之间的差距来缓解遗忘效应,但这会削弱软提示的学习能力,因此需要解决微调过程中语义空间窄化的问题。
Method: 本文提出耦合提示学习框架,在文本提示中引入来自其他数据集的预训练提示作为硬提示标记,与软提示标记拼接并通过可学习的映射层耦合,形成竞争性提示机制防止语义空间过度拟合监督类别;同时引入精心设计的无关视频集和负提示作为通用属性锚点,保持预训练语义空间中属性的通用相关性。
Result: 在视频任务上的实验表明,该方法在泛化基准测试中显著优于最先进的提示调优方法,特别是在基础到新类别的预测任务上表现优异,有效提升了模型对未见类别的泛化能力。
Conclusion: 该研究通过竞争性提示机制和通用属性锚点的引入,成功缓解了视觉语言模型在视频任务微调中的语义空间窄化问题,为提升模型泛化能力提供了有效解决方案,表明外部监督提示和属性保持机制在提示学习中的重要性。
📄 Abstract
Visual and textual soft prompt tuning can effectively improve the adaptability of Vision-Language Models (VLMs) in downstream tasks. However, fine-tuning on video tasks impairs the model's generalization ability to unseen classes. Existing methods attempt to mitigate this forgetting effect by regularizing the gap between hand-crafted prompts and soft prompts, but this also weakens the learning ability of soft prompts. To address this challenge, we propose a plug-and-play coupling prompt learning framework to optimize the generalization performance of V-L models in video tasks, with the core motivation of mitigating semantic space narrowing during fine-tuning by introducing an externally supervised prompt. Specifically, for textual prompts, we introduce pre-trained prompts from other datasets as hard prompt tokens. These are concatenated with soft prompt tokens and coupled via a learnable mapping layer. This competitive prompting approach prevents the semantic space from overfitting to supervised categories. In addition, we introduce a set of well-designed irrelevant video sets and negative prompts as generic attribute anchors to maintain the generic relevance of the attributes in the pre-trained semantic space, thus preserving the generalization ability. Experiments on video tasks demonstrate that our method significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks, particularly on base-to-new class prediction.
[20] ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
Alberto Compagnoni, Marco Morini, Sara Sarto, Federico Cocchi, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
🧩 TL;DR
本文提出ReAG,一种新颖的推理增强多模态检索增强生成方法,通过结合粗粒度与细粒度检索以及批判模型过滤无关段落,显著提升了知识密集型视觉问答任务的性能。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在领域特定或知识密集型查询上表现不佳,而现有的基于检索的视觉问答方法存在检索精度低、噪声段落多和推理能力有限的问题,需要更有效的检索增强方法来提供高质量上下文。
Method: ReAG采用粗粒度与细粒度检索相结合的策略,引入批判模型过滤无关段落以确保高质量上下文,采用多阶段训练策略,利用强化学习增强对检索内容的推理能力,而监督微调仅作为冷启动。
Result: 在Encyclopedic-VQA和InfoSeek数据集上的广泛实验表明,ReAG显著优于先前方法,提高了答案准确性,并提供了基于检索证据的可解释推理,验证了方法的有效性。
Conclusion: 该研究表明推理增强的检索增强生成方法能有效解决知识密集型多模态任务中的信息不足问题,为多模态大语言模型在专业领域的应用提供了新思路,强调了高质量检索上下文与强化学习推理相结合的重要性。
📄 Abstract
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence. Our source code is publicly available at: https://github.com/aimagelab/ReAG.
[21] DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action
Zhen Fang, Zhuoyang Liu, Jiaming Liu, Hao Chen, Yu Zeng, Shiting Huang, Zehui Chen, Lin Chen, Shanghang Zhang, Feng Zhao
🧩 TL;DR
本文提出DualVLA方法解决视觉-语言-动作模型中动作退化问题,通过双重数据剪枝和双教师自适应蒸馏策略,在保持推理能力的同时增强动作性能,并引入VLA Score进行细粒度评估。
📘 Detailed Summary
Motivation: 当前构建通用视觉-语言-动作模型时存在动作退化问题:在将专家VLA模型与多模态数据混合微调以恢复推理能力的过程中,动作性能相比专家模型显著下降,这限制了通用VLA模型的实际应用效果。
Method: 提出DualVLA框架,采用双重数据剪枝方法去除冗余的具身推理数据以避免对动作学习产生负面影响,并设计双教师自适应蒸馏策略为不同数据域分配不同的监督信号,同时保持推理能力;此外还提出VLA Score评估指标,将VLA能力解耦为推理、意图、动作和对齐四个维度进行细粒度评估。
Result: 实验表明DualVLA在SimplerEnv中达到61.0%的平均成功率,在八个竞争性多模态基准测试中获得65.4的平均分数,在精确动作执行和多模态理解之间实现了更好的平衡,显著缓解了动作退化问题。
Conclusion: 该研究揭示了通用VLA模型中动作退化的根本原因,并提出有效的后训练策略来平衡推理能力和动作性能,为构建更实用的具身智能系统提供了重要方法论;VLA Score评估框架也为未来研究提供了更全面的性能评估标准。
📄 Abstract
To build a generalizable Vision-Language-Action (VLA) model with strong reasoning ability, a common strategy is to first train a specialist VLA on robot demonstrations to acquire reliable manipulation skills, and then incorporate mixed annotated robot data together with multimodal data to restore broader reasoning capabilities. However, we observe that the resulting reasoning VLA often suffers from degraded action performance compared to the specialist model before fine-tuning, a phenomenon we refer to as action degeneration. To address this issue, we propose DualVLA, which enhances action performance through carefully designed post-training while still preserving reasoning capability. We first introduce a dual-layer data pruning method that removes redundant embodied reasoning, preventing it from adversely influencing action learning. To further strengthen action generation, we design a dual-teacher adaptive distillation strategy that assigns different supervision signals to different data domains while maintaining reasoning ability. To fill the evaluation gap for generalist VLAs, we also propose VLA Score, which decouples VLA capability into reasoning, intention, action, and alignment dimensions for a more fine-grained assessment. Experiments show that DualVLA achieves an average success rate of 61.0 in SimplerEnv and an average score of 65.4 across eight competitive multimodal benchmarks, demonstrating a stronger balance between precise action execution and multimodal understanding. Project Website: https://costaliya.github.io/DualVLA/.
[22] Partially Shared Concept Bottleneck Models
Delong Zhao, Qiang Huang, Di Yan, Yiqun Sun, Jun Yu
🧩 TL;DR
本文提出了PS-CBM(部分共享概念瓶颈模型),通过多模态概念生成器、部分共享概念策略和概念效率准确度指标,解决了传统概念瓶颈模型在视觉基础、概念冗余和平衡准确性-紧凑性方面的三大挑战。
📘 Detailed Summary
Motivation: 尽管现有方法使用大语言模型和视觉语言模型自动生成概念,但概念瓶颈模型仍面临三个基本挑战:视觉基础薄弱、概念冗余严重,以及缺乏平衡预测准确性和概念紧凑性的原则性度量指标。
Method: PS-CBM框架包含三个核心组件:多模态概念生成器整合LLM语义与基于示例的视觉线索;部分共享概念策略基于激活模式合并概念以平衡特异性和紧凑性;概念效率准确度作为事后度量指标联合捕捉预测准确性和概念紧凑性。
Result: 在11个多样化数据集上的广泛实验表明,PS-CBM持续优于最先进的概念瓶颈模型,分类准确率提升1.0%-7.4%,概念效率准确度提升2.0%-9.5%,同时所需概念数量显著减少。
Conclusion: PS-CBM通过解决概念瓶颈模型的关键限制,在保持高准确性的同时实现了强可解释性,为平衡模型性能和概念紧凑性提供了系统框架,推动了可解释人工智能的发展。
📄 Abstract
Concept Bottleneck Models (CBMs) enhance interpretability by introducing a layer of human-understandable concepts between inputs and predictions. While recent methods automate concept generation using Large Language Models (LLMs) and Vision-Language Models (VLMs), they still face three fundamental challenges: poor visual grounding, concept redundancy, and the absence of principled metrics to balance predictive accuracy and concept compactness. We introduce PS-CBM, a Partially Shared CBM framework that addresses these limitations through three core components: (1) a multimodal concept generator that integrates LLM-derived semantics with exemplar-based visual cues; (2) a Partially Shared Concept Strategy that merges concepts based on activation patterns to balance specificity and compactness; and (3) Concept-Efficient Accuracy (CEA), a post-hoc metric that jointly captures both predictive accuracy and concept compactness. Extensive experiments on eleven diverse datasets show that PS-CBM consistently outperforms state-of-the-art CBMs, improving classification accuracy by 1.0%-7.4% and CEA by 2.0%-9.5%, while requiring significantly fewer concepts. These results underscore PS-CBM's effectiveness in achieving both high accuracy and strong interpretability.
[23] Guiding the Inner Eye: A Framework for Hierarchical and Flexible Visual Grounded Reasoning
Zhaoyang Wei, Wenchao Ding, Yanchao Hao, Xi Chen
🧩 TL;DR
本文提出了GRiP(引导推理与感知)框架,通过两阶段训练方法解决视觉基础推理中的稳定性与灵活性困境,在Qwen2.5-VL-7B模型基础上实现了显著的性能提升,并在多个挑战性基准测试中达到开源模型的最先进水平。
📘 Detailed Summary
Motivation: 当前多模态AI中实现"图像思维"能力的方法面临根本困境:端到端强化学习(RL)方法存在不稳定性问题,而监督微调(SFT)方法则缺乏认知灵活性,导致模型要么难以有效学习,要么无法应对复杂真实场景的推理需求,需要在稳定性和灵活性之间找到更好的平衡点。
Method: GRiP框架采用两阶段训练方法,核心创新在于认知增强的强化学习阶段,包含两个关键组件:显著性加权IoU奖励机制,激励模型优先定位任务关键对象而非无关干扰物;多启发式奖励机制,通过奖励多样但逻辑有效的推理路径来促进认知灵活性。该框架基于Qwen2.5-VL-7B模型进行初始化。
Result: GRiP在多个挑战性基准测试中表现出显著性能提升,在高度复杂的TreeBench和V* Bench基准测试中达到了开源模型的最先进水平,证明了其在复杂视觉推理任务中的有效性,展示了框架在引导模型感知焦点和逻辑路径方面的优势。
Conclusion: 研究表明,超越简单奖励机制,采用认知启发的信号来引导模型"看什么"和"如何思考"对于解锁下一代多模态智能至关重要。GRiP框架的成功验证了通过显式引导感知焦点和逻辑路径来培养稳健灵活视觉基础推理的有效性,为多模态AI的发展提供了重要方向。
📄 Abstract
Models capable of "thinking with images" by dynamically grounding their reasoning in visual evidence represent a major leap in multimodal AI. However, replicating and advancing this ability is non-trivial, with current methods often trapped between the instability of end-to-end reinforcement learning (RL) and the rigidity of supervised fine-tuning (SFT). This leads to models that either struggle to learn or lack the cognitive flexibility required for complex, real-world scenes. To navigate this dilemma, we introduce GRiP (Guided Reasoning and Perception), a novel two-stage training framework that cultivates robust and flexible visual grounded reasoning by explicitly guiding the model's perceptual focus and logical pathways. GRiP's core lies in its cognitive-enhanced RL stage, which features two key innovations: (1) a Salience-Weighted IoU Reward that incentivizes the model to prioritize the localization of mission-critical objects over trivial distractors, and (2) a Multi-Heuristic Reward that encourages cognitive flexibility by rewarding diverse yet logically valid reasoning pathways. Initialized from the Qwen2.5-VL-7B model, GRiP demonstrates significant performance gains across multiple challenging benchmarks. It achieves state-of-the-art results among open-source models on the highly challenging TreeBench and V* Bench, proving its effectiveness in complex visual reasoning. Our work demonstrates that moving beyond simplistic rewards and instead guiding models with cognitively-inspired signals for what to see and how to think is crucial for unlocking the next level of multimodal intelligence. The code will be made publicly available.
[24] Enhanced Graph Convolutional Network with Chebyshev Spectral Graph and Graph Attention for Autism Spectrum Disorder Classification
Adnan Ferdous Ashrafi, Hasanul Kabir
🧩 TL;DR
本文提出了一种结合切比雪夫谱图卷积和图注意力网络的图卷积网络模型,用于提高自闭症谱系障碍的多模态神经影像和表型数据分类准确性,在ABIDE I数据集上实现了74.82%的测试准确率和0.82的AUC值。
📘 Detailed Summary
Motivation: 自闭症谱系障碍是一种复杂的神经发育障碍,其症状表现和神经基础存在显著异质性,这使得早期客观诊断极为困难。现有方法在处理多模态神经影像数据和捕捉个体间关系方面存在局限性,需要更有效的分类模型来提高诊断准确性。
Method: 本文提出了一种图卷积网络模型,整合了切比雪夫谱图卷积和图注意力网络来处理多模态数据。模型采用多分支架构分别处理静息态功能磁共振成像、结构磁共振成像和表型变量,然后通过拼接进行融合。基于站点相似性构建群体图结构,切比雪夫多项式滤波器提供局部谱学习,而GAT层通过注意力加权聚合邻居信息增强节点表示。模型使用分层五折交叉验证进行训练,每个个体输入维度为5,206个特征。
Result: 在包含870名患者的ABIDE I数据集上的广泛实验表明,所提模型在完整数据集上实现了74.82%的测试准确率和0.82的AUC值。该性能超越了多个最先进的基线方法,包括传统图卷积网络、基于自动编码器的深度神经网络和多模态卷积神经网络,证明了模型在多模态自闭症分类任务中的优越性。
Conclusion: 该研究证明了整合切比雪夫谱图卷积和图注意力网络的多模态图卷积网络在自闭症诊断中的有效性。基于站点相似性的图结构构建方法有助于捕捉个体间关系,而多模态融合策略能够充分利用神经影像和表型信息的互补性。这项工作为神经发育障碍的客观诊断提供了有前景的计算框架,并展示了图神经网络在医学图像分析中的应用潜力。
📄 Abstract
ASD is a complicated neurodevelopmental disorder marked by variation in symptom presentation and neurological underpinnings, making early and objective diagnosis extremely problematic. This paper presents a Graph Convolutional Network (GCN) model, incorporating Chebyshev Spectral Graph Convolution and Graph Attention Networks (GAT), to increase the classification accuracy of ASD utilizing multimodal neuroimaging and phenotypic data. Leveraging the ABIDE I dataset, which contains resting-state functional MRI (rs-fMRI), structural MRI (sMRI), and phenotypic variables from 870 patients, the model leverages a multi-branch architecture that processes each modality individually before merging them via concatenation. Graph structure is encoded using site-based similarity to generate a population graph, which helps in understanding relationship connections across individuals. Chebyshev polynomial filters provide localized spectral learning with lower computational complexity, whereas GAT layers increase node representations by attention-weighted aggregation of surrounding information. The proposed model is trained using stratified five-fold cross-validation with a total input dimension of 5,206 features per individual. Extensive trials demonstrate the enhanced model's superiority, achieving a test accuracy of 74.82\% and an AUC of 0.82 on the entire dataset, surpassing multiple state-of-the-art baselines, including conventional GCNs, autoencoder-based deep neural networks, and multimodal CNNs.
[25] MTR-VP: Towards End-to-End Trajectory Planning through Context-Driven Image Encoding and Multiple Trajectory Prediction
Maitrayee Keskar, Mohan Trivedi, Ross Greer
🧩 TL;DR
本文提出了MTR-VP方法,一种基于视觉的自动驾驶轨迹规划方法,通过ViT编码器学习图像上下文嵌入来替代传统地图特征,并利用运动预测框架进行多模态轨迹规划。
📘 Detailed Summary
Motivation: 该研究旨在解决自动驾驶轨迹规划中传统地图特征依赖的问题,探索如何利用视觉信息替代地图特征来生成有效的场景上下文嵌入,同时研究多模态轨迹预测在规划性能中的作用。
Method: 该方法提出MTR-VP架构,使用ViT编码器处理原始图像和过去运动状态,生成与MTR编码器类似的上下文嵌入;采用交叉注意力机制将意图与上下文嵌入结合,替代MTR解码器中的可学习意图查询;在Waymo端到端驾驶数据集上评估,通过消融研究分析图像输入和多轨迹输出的影响。
Result: 实验结果表明,基于Transformer的方法在结合视觉特征和运动特征方面效果有限,即使使用CLIP和DINOv2等基础模型增强意图嵌入,也无法有效生成有用的场景上下文嵌入;然而,预测多个未来轨迹分布而非单一轨迹显著提升了规划性能。
Conclusion: 研究揭示了视觉特征与运动特征融合的挑战,表明当前Transformer架构在跨模态信息整合方面存在局限;同时验证了多模态轨迹预测对规划性能的重要性,为未来基于视觉的自动驾驶规划系统设计提供了重要见解。
📄 Abstract
We present a method for trajectory planning for autonomous driving, learning image-based context embeddings that align with motion prediction frameworks and planning-based intention input. Within our method, a ViT encoder takes raw images and past kinematic state as input and is trained to produce context embeddings, drawing inspiration from those generated by the recent MTR (Motion Transformer) encoder, effectively substituting map-based features with learned visual representations. MTR provides a strong foundation for multimodal trajectory prediction by localizing agent intent and refining motion iteratively via motion query pairs; we name our approach MTR-VP (Motion Transformer for Vision-based Planning), and instead of the learnable intention queries used in the MTR decoder, we use cross attention on the intent and the context embeddings, which reflect a combination of information encoded from the driving scene and past vehicle states. We evaluate our methods on the Waymo End-to-End Driving Dataset, which requires predicting the agent's future 5-second trajectory in bird's-eye-view coordinates using prior camera images, agent pose history, and routing goals. We analyze our architecture using ablation studies, removing input images and multiple trajectory output. Our results suggest that transformer-based methods that are used to combine the visual features along with the kinetic features such as the past trajectory features are not effective at combining both modes to produce useful scene context embeddings, even when intention embeddings are augmented with foundation-model representations of scene context from CLIP and DINOv2, but that predicting a distribution over multiple futures instead of a single future trajectory boosts planning performance.
[26] Controllable 3D Object Generation with Single Image Prompt
Jaeseok Lee, Jaekoo Lee
🧩 TL;DR
本文提出了一种无需文本反转的3D对象生成方法,通过现成的图像适配器和深度条件预热策略,在保持生成质量的同时增强了3D一致性和控制能力。
📘 Detailed Summary
Motivation: 当前基于文本反转的3D对象生成方法存在两个主要问题:需要额外的训练时间且缺乏对生成过程的控制能力。现有的文本到图像生成模型虽然通过文本反转学习目标对象的概念或风格,但这种方法限制了用户对深度、姿态等条件的精确控制。
Method: 本文提出了两种创新方法:首先,采用现成的图像适配器直接生成3D对象,无需依赖文本反转,从而实现对深度、姿态和文本等多种条件的增强控制;其次,引入深度条件预热策略,专门设计用于提升生成结果的3D一致性。
Result: 实验结果表明,该方法在定性和定量评估中均展现出与现有基于文本反转方法相当的性能,同时在3D一致性方面有所提升。用户研究进一步证实,该方法在输入图像匹配度和3D一致性保持方面均优于现有替代方案,验证了所提方法的有效性。
Conclusion: 该研究为3D对象生成提供了一种更高效且可控的解决方案,通过消除文本反转的需求并增强条件控制能力,推动了计算机视觉中3D生成任务的发展。该方法在保持生成质量的同时显著提升了3D一致性,为未来的3D内容创作工具提供了新的技术路径。
📄 Abstract
Recently, the impressive generative capabilities of diffusion models have been demonstrated, producing images with remarkable fidelity. Particularly, existing methods for the 3D object generation tasks, which is one of the fastest-growing segments in computer vision, pre-dominantly use text-to-image diffusion models with textual inversion which train a pseudo text prompt to describe the given image. In practice, various text-to-image generative models employ textual inversion to learn concepts or styles of target object in the pseudo text prompt embedding space, thereby generating sophisticated outputs. However, textual inversion requires additional training time and lacks control ability. To tackle this issues, we propose two innovative methods: (1) using an off-the-shelf image adapter that generates 3D objects without textual inversion, offering enhanced control over conditions such as depth, pose, and text. (2) a depth conditioned warmup strategy to enhance 3D consistency. In experimental results, ours show qualitatively and quantitatively comparable performance and improved 3D consistency to the existing text-inversion-based alternatives. Furthermore, we conduct a user study to assess (i) how well results match the input image and (ii) whether 3D consistency is maintained. User study results show that our model outperforms the alternatives, validating the effectiveness of our approaches. Our code is available at GitHub repository:https://github.com/Seooooooogi/Control3D_IP/
[27] TTSnap: Test-Time Scaling of Diffusion Models via Noise-Aware Pruning
Qingtao Yu, Changlin Song, Minghao Sun, Zhengyang Yu, Vinay Kumar Verma, Soumya Roy, Sumit Negi, Hongdong Li, Dylan Campbell
🧩 TL;DR
本文提出了TTSnap框架,通过噪声感知剪枝在测试时扩展文本到图像扩散模型,利用自蒸馏训练噪声感知奖励模型来对齐中间估计与最终干净图像的奖励,从而在不完全去噪的情况下高效剪枝低质量候选样本。
📘 Detailed Summary
Motivation: 当前基于多噪声种子搜索的文本到图像扩散模型测试时扩展方法面临计算效率瓶颈,因为每个候选样本必须完全去噪后才能计算奖励函数,这严重限制了固定计算预算下可探索的样本数量,需要一种能在不完全去噪阶段就识别和剪枝低质量候选的高效方法。
Method: 提出测试时扩展与噪声感知剪枝框架TTSnap,通过自蒸馏训练噪声感知奖励模型来对齐中间估计与最终干净图像的奖励预测,采用课程训练策略逐步从干净图像域过渡到噪声图像域以稳定学习过程,并引入新的奖励对齐与计算预算利用率度量指标。
Result: 实验表明TTSnap相比现有方法性能提升超过16%,实现了更高效的测试时扩展,同时在与后训练技术和局部测试时优化方法结合时能提供正交增益,显著提高了计算预算利用率。
Conclusion: 该研究证明了噪声感知奖励模型在测试时扩展中的有效性,为扩散模型的高效推理提供了新方向,展示了自蒸馏和课程训练策略在处理跨域奖励对齐问题上的优势,并为结合其他优化技术提供了兼容性框架。
📄 Abstract
A prominent approach to test-time scaling for text-to-image diffusion models formulates the problem as a search over multiple noise seeds, selecting the one that maximizes a certain image-reward function. The effectiveness of this strategy heavily depends on the number and diversity of noise seeds explored. However, verifying each candidate is computationally expensive, because each must be fully denoised before a reward can be computed. This severely limits the number of samples that can be explored under a fixed budget. We propose test-time scaling with noise-aware pruning (TTSnap), a framework that prunes low-quality candidates without fully denoising them. The key challenge is that reward models are learned in the clean image domain, and the ranking of rewards predicted for intermediate estimates are often inconsistent with those predicted for clean images. To overcome this, we train noise-aware reward models via self-distillation to align the reward for intermediate estimates with that of the final clean images. To stabilize learning across different noise levels, we adopt a curriculum training strategy that progressively shifts the data domain from clean images to noise images. In addition, we introduce a new metric that measures reward alignment and computational budget utilization. Experiments demonstrate that our approach improves performance by over 16\% compared with existing methods, enabling more efficient and effective test-time scaling. It also provides orthogonal gains when combined with post-training techniques and local test-time optimization. Code: https://github.com/TerrysLearning/TTSnap/.
[28] Semantic Anchoring for Robust Personalization in Text-to-Image Diffusion Models
Seoyun Yang, Gihoon Kim, Taesup Kim
🧩 TL;DR
本文提出了一种基于语义锚定的文本到图像扩散模型个性化方法,通过将新概念锚定到其对应分布来引导模型适应,在保持预训练语义先验的同时学习用户特定的视觉概念,实现了稳定的个性化生成。
📘 Detailed Summary
Motivation: 文本到图像扩散模型在个性化方面面临关键挑战:当模型专注于主体保真度时容易对少量参考图像过拟合,而强调先验保持则阻碍学习新的个性化属性,需要在学习新视觉概念与保持预训练语义先验之间取得平衡。
Method: 该方法提出通过语义锚定引导个性化过程,将个性化重新表述为通过语义锚定在频繁对应概念的指导下学习稀有概念的过程,这种锚定机制鼓励模型以稳定可控的方式适应新概念,在扩展预训练分布的同时保持其语义结构。
Result: 实验表明所提方法实现了稳定的适应,在主体保真度和文本图像对齐方面均优于基线方法,广泛的消融研究进一步证明了所提锚定策略的鲁棒性和有效性。
Conclusion: 该研究通过语义锚定框架解决了扩散模型个性化中的过拟合与先验保持的权衡问题,为稳定可控的概念适应提供了新思路,扩展了预训练分布向个性化区域的同时保持了语义结构的一致性。
📄 Abstract
Text-to-image diffusion models have achieved remarkable progress in generating diverse and realistic images from textual descriptions. However, they still struggle with personalization, which requires adapting a pretrained model to depict user-specific subjects from only a few reference images. The key challenge lies in learning a new visual concept from a limited number of reference images while preserving the pretrained semantic prior that maintains text-image alignment. When the model focuses on subject fidelity, it tends to overfit the limited reference images and fails to leverage the pretrained distribution. Conversely, emphasizing prior preservation maintains semantic consistency but prevents the model from learning new personalized attributes. Building on these observations, we propose the personalization process through a semantic anchoring that guides adaptation by grounding new concepts in their corresponding distributions. We therefore reformulate personalization as the process of learning a rare concept guided by its frequent counterpart through semantic anchoring. This anchoring encourages the model to adapt new concepts in a stable and controlled manner, expanding the pretrained distribution toward personalized regions while preserving its semantic structure. As a result, the proposed method achieves stable adaptation and consistent improvements in both subject fidelity and text-image alignment compared to baseline methods. Extensive experiments and ablation studies further demonstrate the robustness and effectiveness of the proposed anchoring strategy.
[29] UMind-VL: A Generalist Ultrasound Vision-Language Model for Unified Grounded Perception and Comprehensive Interpretation
Dengbo Chen, Ziwei Zhao, Kexin Zhang, Shishuang Zhao, Junjie Hou, Yaqian Wang, Nianxi Liao, Anlan Sun, Fei Gao, Jia Ding, Yuhang Liu, Dong Wang
🧩 TL;DR
本研究提出了UMind-VL,一个统一的超声医学基础模型,旨在弥合低层次超声感知(如分割、定位)与高层次临床解释(如诊断、推理)之间的鸿沟,通过单一框架实现像素级结构理解与复杂临床推理的协同。
📘 Detailed Summary
Motivation: 尽管医学基础模型取得了显著进展,但超声领域缺乏能够连接低层次超声基础感知(如分割、定位)与高层次超声综合解释(如诊断、推理)的全面解决方案,现有方法无法有效整合像素级结构理解与临床推理能力。
Method: 研究首先构建了UMind-DS大规模多模态数据集,包含120万超声图像-文本对,涵盖16个解剖区域,并包含像素级标注和临床验证的推理依据。模型架构上,UMind-VL引入了轻量级动态卷积掩码解码器,通过基于大语言模型输出的动态核生成掩码,结合任务特定令牌,在单一框架内统一了分割、检测、几何测量和诊断任务。
Result: 广泛评估表明,UMind-VL在分割、检测、关键点定位和诊断推理等多个基准测试中显著优于现有的通用多模态模型,其性能达到甚至超越了最先进的专用模型水平,同时保持了强大的泛化能力。
Conclusion: 该研究证明了统一超声基础模型的可行性,通过整合像素级感知与临床推理能力,为超声医学人工智能提供了全面解决方案,展示了单一模型在多种超声任务上的卓越性能,为未来医学多模态模型的发展提供了重要参考。
📄 Abstract
Despite significant strides in medical foundation models, the ultrasound domain lacks a comprehensive solution capable of bridging low-level Ultrasound Grounded Perception (e.g., segmentation, localization) and high-level Ultrasound Comprehensive Interpretation (e.g., diagnosis, reasoning). To bridge this gap, we propose UMind-VL, a unified foundation model designed to synergize pixel-level structural understanding with complex clinical reasoning. We first introduce UMind-DS, a large-scale multimodal dataset comprising 1.2 million ultrasound image-text pairs across 16 anatomical regions, enriching standard data with pixel-level annotations and clinician-validated rationales. Architecturally, UMind-VL incorporates a lightweight Dynamic Convolutional Mask Decoder that generates masks via dynamic kernels conditioned on LLM outputs. This design, combined with task-specific tokens, unifies segmentation, detection, geometric measurement, and diagnosis tasks within a single framework. Extensive evaluations demonstrate that UMind-VL significantly outperforms existing generalist multimodal models and achieves performance on par with, or superior to, state-of-the-art specialist models across segmentation, detection, keypoint localization, and diagnostic reasoning benchmarks, while maintaining strong generalization ability. We demonstrate the capability of UMind-VL in Figure 1.
[30] Asking like Socrates: Socrates helps VLMs understand remote sensing images
Run Shao, Ziyu Li, Zhaoyang Zhang, Linrui Xu, Xinran He, Hongyuan Yuan, Bolei He, Yongxing Dai, Yiming Yan, Yijun Chen, Wang Guo, Haifeng Li
🧩 TL;DR
本文针对遥感视觉问答中存在的伪推理问题,提出了RS-EoT(遥感证据思维)范式,通过语言驱动的迭代视觉证据搜索机制,结合SocraticAgent多智能体系统和渐进式强化学习策略,显著提升了遥感视觉推理的准确性和可解释性。
📘 Detailed Summary
Motivation: 当前遥感多模态推理模型普遍存在伪推理问题,即模型仅描述推理过程而非基于视觉证据进行真实推理。作者将此归因于Glance Effect现象:对大规模遥感图像的粗略单次感知导致不完整理解,使得推理基于语言自洽性而非视觉证据。
Method: 提出了RS-EoT范式,这是一种语言驱动的迭代视觉证据搜索机制。为实现该范式,设计了SocraticAgent多智能体系统,通过推理和视觉检查的交替循环合成推理轨迹。为增强和泛化这些模式,提出了两阶段渐进式强化学习策略:首先在细粒度Grounding任务上进行RL以增强RS-EoT能力,然后在RS VQA任务上进行RL以泛化到更广泛的理解场景。
Result: 实验表明RS-EoT在多个RS VQA和Grounding基准测试中取得了最先进的性能。分析揭示了清晰的推理和证据搜索迭代循环,证实了RS-EoT能够缓解Glance Effect并实现真正的证据基础推理。
Conclusion: 该研究揭示了遥感多模态推理中的伪推理问题及其根源,提出的RS-EoT范式通过迭代证据搜索机制有效解决了这一问题。该方法不仅提升了性能,还增强了模型的可解释性,为遥感视觉推理提供了新的研究方向和实用框架。
📄 Abstract
Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision-language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence. To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self-play multi-agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a two-stage progressive RL strategy: first, RL on fine-grained Grounding tasks to enhance RS-EoT capabilities, followed by RL on RS VQA to generalize to broader understanding scenarios. Experiments show RS-EoT achieves state-of-the-art performance on multiple RS VQA and grounding benchmarks. Analyses reveal clear iterative cycles of reasoning and evidence seeking, confirming RS-EoT mitigates the Glance Effect and enables genuine evidence-grounded reasoning. Our code, data, and models are available at https://geox-lab.github.io/Asking_like_Socrates
[31] Match-and-Fuse: Consistent Generation from Unstructured Image Sets
Kate Feingold, Omri Kaduri, Tali Dekel
🧩 TL;DR
本文提出Match-and-Fuse,一种零样本、无需训练的方法,用于生成具有一致性的非结构化图像集合,这些集合共享共同视觉元素但在视角、拍摄时间和周围内容上存在差异,实现了集合到集合的生成。
📘 Detailed Summary
Motivation: 现有方法主要处理单张图像或密集采样的视频,无法有效处理非结构化图像集合的生成问题,这些集合中的图像共享共同视觉元素但在视角、时间和其他内容上存在差异,需要一种能够保持跨图像一致性的集合到集合生成框架。
Method: 该方法将任务建模为图结构,其中每个节点对应一张图像,每条边触发图像对的联合生成,通过融合跨图像对的内部特征并在密集输入对应关系的指导下实现局部一致性和全局连贯性,无需掩码或人工监督,同时利用文本到图像模型中新兴的先验知识,鼓励多个视图在共享画布上生成连贯内容。
Result: Match-and-Fuse在一致性和视觉质量方面达到最先进水平,能够从图像集合中创建新内容,实现了集合到集合的生成能力,在保持共享内容跨图像一致性的同时生成高质量的新图像集合。
Conclusion: 该研究提供了一种统一的框架来处理非结构化图像集合的生成问题,通过图建模和特征融合实现了局部一致性和全局连贯性,为从图像集合进行内容创作开辟了新途径,展示了零样本方法在复杂生成任务中的潜力。
📄 Abstract
We present Match-and-Fuse - a zero-shot, training-free method for consistent controlled generation of unstructured image sets - collections that share a common visual element, yet differ in viewpoint, time of capture, and surrounding content. Unlike existing methods that operate on individual images or densely sampled videos, our framework performs set-to-set generation: given a source set and user prompts, it produces a new set that preserves cross-image consistency of shared content. Our key idea is to model the task as a graph, where each node corresponds to an image and each edge triggers a joint generation of image pairs. This formulation consolidates all pairwise generations into a unified framework, enforcing their local consistency while ensuring global coherence across the entire set. This is achieved by fusing internal features across image pairs, guided by dense input correspondences, without requiring masks or manual supervision. It also allows us to leverage an emergent prior in text-to-image models that encourages coherent generation when multiple views share a single canvas. Match-and-Fuse achieves state-of-the-art consistency and visual quality, and unlocks new capabilities for content creation from image collections.
[32] DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA
Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Dheeraj Kulshrestha, Rajiv Ramnath
🧩 TL;DR
本文提出DocVAL框架,通过验证式思维链蒸馏将大型教师模型的空间推理能力迁移到可部署的学生视觉语言模型中,在DocVQA任务上实现了高精度与高效率的平衡。
📘 Detailed Summary
Motivation: 当前文档视觉问答系统面临精度与效率的尖锐权衡:大型教师模型具有强大的空间定位能力但部署成本过高,而紧凑的学生模型在定位性能上显著下降,需要解决这一部署瓶颈。
Method: DocVAL框架包含三个核心组件:基于验证时文本检测的教师监督以过滤和去噪训练信号;多模块验证器确保答案正确性和几何一致性并提供像素级错误反馈;两阶段学生训练方案,先学习验证后的思维链轨迹,再通过验证器反馈进行迭代精炼。
Result: 学生模型(Gemma-3 12B)在DocVQA上达到91.4% ANLS和82.4% mAP,作为纯视觉语言模型无需推理时的文本检测或OCR。消融实验显示验证反馈贡献6.3 mAP增益,迭代精炼带来9.7 mAP提升。
Conclusion: 研究证明了验证式蒸馏在迁移空间推理能力方面的有效性,释放的9.5万条高质量验证思维链轨迹将推动文档理解中的空间推理研究,为实际部署提供了高效解决方案。
📄 Abstract
Document visual question answering (DocVQA) requires models to jointly reason over textual content and spatial layout, yet current systems exhibit a sharp accuracy--efficiency trade-off: large teacher models achieve strong grounding but are too expensive for deployment, while compact students suffer substantial drops in localization performance. We propose DocVAL, a validated chain-of-thought distillation framework that transfers the spatial reasoning ability of a large teacher into a deployable student VLM through three key components: (1) teacher supervision with validation-time text detection to filter and denoise training signals, (2) a multi-module validator (VAL) that enforces answer correctness and geometric consistency while producing fine-grained, pixel-level error feedback, and (3) a two-stage student training scheme that first learns from validated CoT traces and then undergoes iterative refinement driven by VAL feedback. Our student (Gemma-3 12B) achieves 91.4\% ANLS and 82.4\% mAP on DocVQA as a pure VLM requiring no text detection or OCR at inference. Extensive ablations demonstrate that validated feedback contributes 6.3 mAP gain and iterative refinement accounts for 9.7 mAP improvement. We release 95k high-quality, validator-verified CoT traces to advance spatial reasoning research in document understanding.
[33] Structure is Supervision: Multiview Masked Autoencoders for Radiology
Sonia Laguna, Andrea Agostini, Alain Ryser, Samuel Ruiperez-Campillo, Irene Cannistraci, Moritz Vandenhirtz, Stephan Mandt, Nicolas Deperrois, Farhad Nooralahzadeh, Michael Krauthammer, Thomas M. Sutter, Julia E. Vogt
🧩 TL;DR
该研究提出了MVMAE和MVMAE-V2T两种自监督学习框架,利用放射学研究的自然多视图组织和放射学报告来学习视图不变且与疾病相关的医学图像表示,在多个大规模公共数据集上超越了监督学习和视觉语言基线方法。
📘 Detailed Summary
Motivation: 构建鲁棒的医学机器学习系统需要能够利用临床数据内在结构的预训练策略,特别是放射学研究中的自然多视图组织,以及如何将这些结构信息转化为有效的自监督信号来学习视图不变且与疾病相关的表示。
Method: 研究提出了多视图掩码自编码器框架,结合掩码图像重建与跨视图对齐,将不同投影间的临床冗余转化为自监督信号;进一步扩展为MVMAE-V2T,引入放射学报告作为辅助文本学习信号以增强语义基础,同时保持完全基于视觉的推理能力。
Result: 在MIMIC-CXR、CheXpert和PadChest三个大规模公共数据集的下游疾病分类任务评估中,MVMAE持续优于监督学习和视觉语言基线方法;MVMAE-V2T提供了额外性能提升,特别是在低标签情况下,结构化文本监督的效益最为显著。
Conclusion: 该研究确立了结构和文本监督作为构建可扩展、临床基础的医学基础模型的互补路径,证明了利用放射学研究的自然多视图组织和放射学报告能够显著提升医学图像表示学习的性能,特别是在数据稀缺情况下。
📄 Abstract
Building robust medical machine learning systems requires pretraining strategies that exploit the intrinsic structure present in clinical data. We introduce Multiview Masked Autoencoder (MVMAE), a self-supervised framework that leverages the natural multi-view organization of radiology studies to learn view-invariant and disease-relevant representations. MVMAE combines masked image reconstruction with cross-view alignment, transforming clinical redundancy across projections into a powerful self-supervisory signal. We further extend this approach with MVMAE-V2T, which incorporates radiology reports as an auxiliary text-based learning signal to enhance semantic grounding while preserving fully vision-based inference. Evaluated on a downstream disease classification task on three large-scale public datasets, MIMIC-CXR, CheXpert, and PadChest, MVMAE consistently outperforms supervised and vision-language baselines. Furthermore, MVMAE-V2T provides additional gains, particularly in low-label regimes where structured textual supervision is most beneficial. Together, these results establish the importance of structural and textual supervision as complementary paths toward scalable, clinically grounded medical foundation models.
[34] CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving
Zhaohui Wang, Tengbo Yu, Hao Tang
🧩 TL;DR
本文提出了CoT4AD,一种用于自动驾驶的新型视觉-语言-动作模型框架,通过引入思维链推理来增强视觉语言模型的数值推理和因果推理能力,在复杂驾驶场景中实现了最先进的性能。
📘 Detailed Summary
Motivation: 现有视觉-语言-动作模型在自动驾驶中存在数值推理能力有限和输入-输出映射过于简化的问题,这阻碍了它们在需要逐步因果推理的复杂驾驶场景中的表现,因此需要增强模型的推理能力以应对动态环境中的决策挑战。
Method: CoT4AD框架通过整合视觉观察和语言指令来执行语义推理、场景理解和轨迹规划,在训练阶段显式建模感知-问题-预测-动作的思维链,以对齐多个驾驶任务中的推理空间和动作空间,在推理阶段执行隐式思维链推理以实现动态环境中的一致数值推理和鲁棒决策。
Result: 在nuScenes和Bench2Drive等真实世界和模拟基准测试上进行的大量实验表明,CoT4AD在开环和闭环评估中都实现了最先进的性能,验证了思维链推理在增强自动驾驶模型数值和因果推理能力方面的有效性。
Conclusion: 该研究表明思维链推理机制能够显著提升视觉-语言-动作模型在自动驾驶中的推理能力和决策鲁棒性,为复杂动态环境中的端到端自动驾驶系统提供了新的框架设计思路,并展示了跨任务推理空间对齐的重要性。
📄 Abstract
Vision-Language-Action (VLA) models have recently attracted growing attention in end-to-end autonomous driving for their strong reasoning capabilities and rich world knowledge. However, existing VLAs often suffer from limited numerical reasoning ability and overly simplified input-output mappings, which hinder their performance in complex driving scenarios requiring step-by-step causal reasoning. To address these challenges, we propose CoT4AD, a novel VLA framework that introduces Chain-of-Thought (CoT) reasoning for autonomous driving to enhance both numerical and causal reasoning in Vision-Language Models (VLMs). CoT4AD integrates visual observations and language instructions to perform semantic reasoning, scene understanding, and trajectory planning. During training, it explicitly models a perception-question-prediction-action CoT to align the reasoning space with the action space across multiple driving tasks. During inference, it performs implicit CoT reasoning to enable consistent numerical reasoning and robust decision-making in dynamic environments. Extensive experiments on both real-world and simulated benchmarks, including nuScenes and Bench2Drive, demonstrate that CoT4AD achieves state-of-the-art performance in both open-loop and closed-loop evaluations. Code will be released upon paper acceptance.
[35] Unexplored flaws in multiple-choice VQA evaluations
Fabio Rosenthal, Sebastian Schmidt, Thorsten Graf, Thorsten Bagodonat, Stephan Günnemann, Leo Schwinn
🧩 TL;DR
该研究揭示了多模态大语言模型在多选题视觉问答评估中存在显著的提示格式敏感性,即使语义中性的格式变化也会导致性能波动,而现有偏置缓解策略无法解决这些新发现的偏置问题。
📘 Detailed Summary
Motivation: 尽管先前研究已发现多选题视觉问答基准对答案选项顺序敏感,但本研究指出存在更多未被探索的提示格式偏置问题,这些偏置质疑当前多模态大语言模型评估的可靠性,需要系统性地识别和分析这些格式变化因素对评估结果的影响。
Method: 研究通过大规模实验分析,识别了提示格式中的三个关键变化因素,并评估了它们对多模态大语言模型性能的影响,涉及七个不同的多模态大语言模型和五个视觉问答数据集,共计48种不同的提示格式变体,系统性地考察了格式变化对模型评估结果的影响。
Result: 研究发现多选题视觉问答对微小的提示格式变化高度敏感,即使这些变化在语义上是中性的,这种敏感性独立于已知的顺序偏置或模型对正确答案的置信度,并且现有偏置缓解策略无法有效解决这些新发现的格式偏置问题。
Conclusion: 该研究揭示了当前多模态大语言模型评估中存在的系统性偏置问题,提示格式的微小变化会显著影响评估结果,这要求未来研究需要开发更稳健的评估方法,并重新审视现有基准的可靠性,为构建更公平和可靠的模型评估框架提供了重要见解。
📄 Abstract
Multimodal Large Language Models (MLLMs) demonstrate strong capabilities in handling image-text inputs. A common way to assess this ability is through multiple-choice Visual Question Answering (VQA). Earlier works have already revealed that these benchmarks are sensitive to answer choice order, a limitation that can be mitigated through careful design. Yet, we highlight additional, unexplored biases in prompt formatting that question the reliability of current MLLM evaluations. Specifically, we identify three key variation factors in prompt formatting and analyze their impact through a large-scale study involving $\mathbf{\text{seven}}$ MLLMs and $\mathbf{\text{five}}$ VQA datasets, spanning $\mathbf{48}$ distinct $\mathbf{\text{prompt format variations}}$. Our findings reveal that multiple-choice VQA is highly sensitive to minor prompt format changes, even when these changes are semantically neutral. We further demonstrate that these biases persist independently of known order biases or the MLLM's confidence in the correct answer. Finally, we demonstrate that existing bias mitigation strategies fail to address these newly identified biases.
[36] Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization
Yifan Du, Kun Zhou, Yingqian Min, Yue Ling, Wayne Xin Zhao, Youbin Wu
🧩 TL;DR
本研究系统评估了不同思维链设计对视觉语言模型泛化视觉推理能力的影响,发现简洁且仅包含必要空间定位步骤的思维链在迷宫求解等视觉推理任务中表现最佳,揭示了'短即长'效应。
📘 Detailed Summary
Motivation: 尽管思维链数据(尤其是长链或视觉思维链)已被广泛用于监督中间推理过程,但不同思维链设计如何影响视觉语言模型获得泛化视觉推理能力的具体机制尚不明确,需要系统评估哪些设计真正支持可泛化的推理能力。
Method: 研究采用可控的迷宫求解基准,其中推理规则完全视觉化,难度可通过网格大小调节,所有中间步骤可自动生成。使用Qwen2.5-VL-7B模型在标准SFT-then-RL流程下,比较了三种代表性思维链格式:语言思维链、空间定位思维链(含坐标轨迹)和视觉思维链(含图像操作)。
Result: 实验表明视觉和较长的思维链主要加速收敛但不提升最终性能上限;仅包含必要定位步骤的简洁思维链优于较长轨迹;仅保留最小定位结果的思维链在不同迷宫尺寸上泛化能力最佳。这些发现在其他视觉中心任务上得到进一步验证。
Conclusion: 研究揭示了思维链设计中的'短即长'效应,表明简洁且仅包含必要空间定位信息的思维链最能促进泛化视觉推理能力,为构建更可泛化的监督微调数据集提供了实用指导,强调了在视觉推理任务中优化思维链设计的重要性。
📄 Abstract
We study how different Chain-of-Thought (CoT) designs affect the acquisition of the generalizable visual reasoning ability in vision-language models (VLMs). While CoT data, especially long or visual CoT such as "think with image", has been widely used to supervise intermediate reasoning, it remains unclear why specific CoT designs help and which ones truly support generalizable reasoning. To systematically evaluate this, we focus on a controlled maze-solving benchmark where reasoning rules are fully visual, difficulty can be tuned by grid size, and all the intermediate steps can be automatically generated. Using Qwen2.5-VL-7B under a standard SFT-then-RL pipeline, we compare three representative CoT formats: Language CoT, Grounding CoT (with spatial coordinate trajectories), and Visual CoT (with image manipulations). Our experiments reveal that visual and longer CoT mainly accelerate convergence but do not lift the final performance ceiling; concise CoT containing only essential grounding steps outperforms longer traces; and, strikingly, CoT retaining only the minimal grounding results generalizes best across different maze sizes. We further validate these insights on other vision-centric tasks. These findings highlight a "short is long" effect and provide practical guidance for constructing more generalizable SFT datasets for visual reasoning.
[37] INSIGHT: An Interpretable Neural Vision-Language Framework for Reasoning of Generative Artifacts
Anshul Bagaria
🧩 TL;DR
本文提出INSIGHT框架,一个用于AI生成图像检测和解释的统一多模态系统,能够在极低分辨率下实现鲁棒检测并提供透明解释,显著提升了生成图像取证的可信度。
📘 Detailed Summary
Motivation: 当前深度伪造检测系统在现实世界条件下表现急剧下降,特别是在严重下采样、压缩和跨域分布偏移时,且大多数检测器作为不透明分类器运行,无法解释为何图像被标记为合成,这削弱了信任并阻碍了高风险场景的采用。
Method: INSIGHT框架结合了分层超分辨率以放大细微取证线索而不引入误导性伪影,Grad-CAM驱动的多尺度定位来揭示指示生成模式的空间区域,以及CLIP引导的语义对齐将视觉异常映射到人类可解释的描述符,最后使用结构化ReAct+思维链协议提示视觉语言模型生成一致、细粒度的解释,并通过双阶段G-Eval+LLM-as-a-judge流程验证以最小化幻觉并确保事实性。
Result: 在包括动物、车辆和抽象合成场景在内的多样化领域中,INSIGHT在极端退化条件下显著提高了检测鲁棒性和解释质量,优于先前的检测器和黑盒视觉语言模型基线,特别是在极低分辨率(16x16-64x64)下表现出色。
Conclusion: 该研究为透明、可靠的AI生成图像取证提供了一条实用路径,确立了INSIGHT作为可信多模态内容验证的重要进展,强调了可解释性在生成图像检测中的关键作用,并为高风险应用中的可信采用奠定了基础。
📄 Abstract
The growing realism of AI-generated images produced by recent GAN and diffusion models has intensified concerns over the reliability of visual media. Yet, despite notable progress in deepfake detection, current forensic systems degrade sharply under real-world conditions such as severe downsampling, compression, and cross-domain distribution shifts. Moreover, most detectors operate as opaque classifiers, offering little insight into why an image is flagged as synthetic, undermining trust and hindering adoption in high-stakes settings. We introduce INSIGHT (Interpretable Neural Semantic and Image-based Generative-forensic Hallucination Tracing), a unified multimodal framework for robust detection and transparent explanation of AI-generated images, even at extremely low resolutions (16x16 - 64x64). INSIGHT combines hierarchical super-resolution for amplifying subtle forensic cues without inducing misleading artifacts, Grad-CAM driven multi-scale localization to reveal spatial regions indicative of generative patterns, and CLIP-guided semantic alignment to map visual anomalies to human-interpretable descriptors. A vision-language model is then prompted using a structured ReAct + Chain-of-Thought protocol to produce consistent, fine-grained explanations, verified through a dual-stage G-Eval + LLM-as-a-judge pipeline to minimize hallucinations and ensure factuality. Across diverse domains, including animals, vehicles, and abstract synthetic scenes, INSIGHT substantially improves both detection robustness and explanation quality under extreme degradation, outperforming prior detectors and black-box VLM baselines. Our results highlight a practical path toward transparent, reliable AI-generated image forensics and establish INSIGHT as a step forward in trustworthy multimodal content verification.
[38] HarmoCLIP: Harmonizing Global and Regional Representations in Contrastive Vision-Language Models
Haoxi Zeng, Haoxuan Li, Yi Bin, Pengpeng Zeng, Xing Xu, Yang Yang, Heng Tao Shen
🧩 TL;DR
本文提出了HarmoCLIP框架,旨在解决CLIP模型中全局与局部表征之间的权衡问题,通过引入细粒度语义监督和区域-语言对齐策略,在保持全局一致性的同时增强局部语义理解能力。
📘 Detailed Summary
Motivation: CLIP模型虽然展现出强大的泛化能力,但由于缺乏区域级监督,其细粒度语义理解能力有限。现有方法试图缓解这一问题,但往往破坏了全局对齐,导致改善局部感知的同时会降低全局一致性,形成了持续的权衡困境。
Method: HarmoCLIP框架首先识别出局部文本与视觉语义之间缺乏直接对齐是权衡问题的根本原因。为此,该方法引入了显式的细粒度语义监督项,直接对齐文本片段与其对应的视觉区域,有效桥接图像区域空间与文本空间。为进一步增强局部表征能力,还提出了新颖的区域-语言对齐监督策略,在不损害全局语义一致性的前提下促进细粒度语义学习。
Result: 大量实验表明,HarmoCLIP在检索这一全局任务上达到了最先进性能(提升高达69.78%),在边界框分类这一区域任务上实现了3.2%的Top-1准确率显著提升。该方法持续优于先前方法,为CLIP中的全局-局部权衡问题提供了平衡、高效且即插即用的解决方案。
Conclusion: 该研究揭示了CLIP模型中全局与局部表征权衡的根本原因,并提出了一种有效的协调机制。HarmoCLIP框架不仅提升了细粒度语义理解能力,同时保持了全局对齐的完整性,为视觉-语言模型的细粒度理解提供了新的设计思路,具有即插即用的实用价值。
📄 Abstract
Contrastive Language-Image Pre-training (CLIP) has demonstrated remarkable generalization ability and strong performance across a wide range of vision-language tasks. However, due to the lack of region-level supervision, CLIP exhibits limited fine-grained semantic understanding. Although several methods attempt to mitigate this issue, they unintentionally disrupt the global alignment, resulting in a persistent trade-off where improving local perception simultaneously degrades global coherence. In this paper, we propose HarmoCLIP, a novel framework designed to harmonize global and region representations within CLIP. We first identify that the absence of direct alignment between local textual and visual semantics is the fundamental cause of the trade-off. To address this, HarmoCLIP introduces an explicit fine-grained semantic supervision term that directly aligns textual segments with their corresponding visual regions, effectively bridging the image region space and the textual space. To further strengthen the representation capability at the local level, our method introduces a novel Region-Language Alignment supervision strategy that promotes fine-grained semantic learning without compromising global semantic consistency. Extensive experiments demonstrate that HarmoCLIP achieves state-of-the-art (improvement up to 69.78%) performance on the global task of retrieval and yields a substantial 3.2% improvement in Top-1 accuracy on the region task of bounding-box classification, consistently outperforming prior approaches while providing a balanced, efficient, and plug-and-play solution to the global-local trade-off in CLIP. Code is available at https://github.com/Erosist/HarmoCLIP.
[39] UAV-MM3D: A Large-Scale Synthetic Benchmark for 3D Perception of Unmanned Aerial Vehicles with Multi-Modal Data
Longkun Zou, Jiale Wang, Rongqin Liang, Hai Wu, Ke Chen, Yaowei Wang
🧩 TL;DR
本文提出了UAV-MM3D,一个用于低空无人机感知与运动理解的高保真多模态合成数据集,包含40万帧同步数据,涵盖多种场景、天气条件和传感器模态,并提供了相应的基准模型以促进相关研究。
📘 Detailed Summary
Motivation: 无人机在复杂低空环境中的精确感知对空域安全和智能系统至关重要,但现实世界无人机数据收集面临空域管制、隐私问题和环境变化等固有约束,同时3D姿态和跨模态对应关系的手动标注耗时且成本高昂,现有数据集难以满足可靠解决方案的开发需求。
Method: 研究团队构建了UAV-MM3D合成数据集,包含40万帧同步数据,涵盖城市、郊区、森林、沿海等多样场景和晴、阴、雨、雾等天气条件,包含微、小、中型多种无人机模型,提供RGB、红外、激光雷达、雷达和动态视觉传感器五种模态数据,每帧包含2D/3D边界框、6自由度姿态和实例级标注,并提出了LiDAR引导的多模态融合基准网络LGFusionNet和专用的无人机轨迹预测基准模型。
Result: UAV-MM3D数据集支持无人机3D检测、姿态估计、目标跟踪和短期轨迹预测等核心任务,通过可控的仿真环境、全面的场景覆盖和丰富的标注信息,为无人机3D感知研究提供了公开的基准测试平台,促进了相关算法的发展与评估。
Conclusion: 该研究通过构建大规模、高质量的多模态合成数据集,有效解决了无人机感知研究中真实数据获取困难的问题,为低空无人机感知和运动理解提供了重要的数据资源和基准测试框架,推动了相关领域的技术进步,同时展示了合成数据在解决现实世界数据约束方面的巨大潜力。
📄 Abstract
Accurate perception of UAVs in complex low-altitude environments is critical for airspace security and related intelligent systems. Developing reliable solutions requires large-scale, accurately annotated, and multimodal data. However, real-world UAV data collection faces inherent constraints due to airspace regulations, privacy concerns, and environmental variability, while manual annotation of 3D poses and cross-modal correspondences is time-consuming and costly. To overcome these challenges, we introduce UAV-MM3D, a high-fidelity multimodal synthetic dataset for low-altitude UAV perception and motion understanding. It comprises 400K synchronized frames across diverse scenes (urban areas, suburbs, forests, coastal regions) and weather conditions (clear, cloudy, rainy, foggy), featuring multiple UAV models (micro, small, medium-sized) and five modalities - RGB, IR, LiDAR, Radar, and DVS (Dynamic Vision Sensor). Each frame provides 2D/3D bounding boxes, 6-DoF poses, and instance-level annotations, enabling core tasks related to UAVs such as 3D detection, pose estimation, target tracking, and short-term trajectory forecasting. We further propose LGFusionNet, a LiDAR-guided multimodal fusion baseline, and a dedicated UAV trajectory prediction baseline to facilitate benchmarking. With its controllable simulation environment, comprehensive scenario coverage, and rich annotations, UAV3D offers a public benchmark for advancing 3D perception of UAVs.
[40] All Centers Are at most a Few Tokens Apart: Knowledge Distillation with Domain Invariant Prompt Tuning
Amir Mohammad Ezzati, Alireza Malekhosseini, Armin Khosravi, Mohammad Hossein Rohban
🧩 TL;DR
本文提出了一种用于计算病理学领域泛化的领域不变提示调优方法,通过从病理学视觉语言模型中蒸馏知识,学习领域不变提示以增强模型在异构临床数据上的泛化能力。
📘 Detailed Summary
Motivation: 计算病理学中由于染色协议、扫描设备和成像设置的变化导致显著的领域偏移,而现有的视觉语言模型在零样本设置下对提示变化敏感,且病理学图像缺乏自然图像中的语义描述符,难以定义领域特定的提示,需要数据驱动的方法来学习领域不变提示。
Method: 提出领域不变提示调优方法,为每个领域学习多个输入标记,这些标记分别针对每个领域进行训练,然后跨领域平均以产生领域不变提示;学生模型通过利用DIPT学习的提示从PLIP的文本编码器蒸馏知识,使视觉特征与领域不变嵌入对齐。
Result: 该方法在组织病理学数据集上的领域泛化任务中,相比现有最先进的知识蒸馏方法,在平均F1分数上取得了显著改进,证明了其在异构数据源上的优越泛化性能。
Conclusion: 该研究为在现实世界临床问题中部署鲁棒的计算病理学模型提供了有效途径,通过领域不变提示学习实现了更好的跨中心泛化能力,为解决医学图像分析中的领域偏移问题提供了新思路。
📄 Abstract
Domain generalization is critical in computational pathology (CPath) due to inherent domain shifts caused by variations in staining protocols, scanner devices, and imaging settings across clinical centers. Vision-language models (VLMs), such as PLIP-a pathology-tuned CLIP-trained on image-text pairs across diverse domains, serve as strong knowledge distillation sources. However, their zero-shot performance with predefined prompts remains limited due to sensitivity to prompt variations. Moreover, unlike natural images, histopathology centers lack semantic descriptors (e.g., 'sketch'), making it difficult to define domain-specific prompts for clinical centers. This requires a data-driven approach for learning domain-specific and ultimately class-generic continuous prompts. We propose Domain Invariant Prompt Tuning (DIPT) for knowledge distillation process, a novel step that learns multiple input tokens for each domain. These tokens are trained separately for each domain and are averaged across domains, leading to domain-invariant prompts. Our student model then distills knowledge from PLIP's text encoder by leveraging the prompts learned by DIPT. This leads to alignment of visual features with domain-invariant embeddings, enhancing generalization by training on multiple domains. Our method adds a significant improvement in average F1-score to existing state-of-the-art (SOTA) knowledge distillation approaches in domain generalization with histopathology datasets. This work helps the way of deploying robust CPath models in real-world clinical problems with heterogeneous data sources.
[41] SkeletonAgent: An Agentic Interaction Framework for Skeleton-based Action Recognition
Hongda Liu, Yunfan Liu, Changlu Wang, Yunlong Wang, Zhenan Sun
🧩 TL;DR
本文提出SkeletonAgent框架,通过两个协作代理(Questioner和Selector)将骨架动作识别模型与大型语言模型桥接,使LLM能够提供更具判别性的语义先验,从而显著提升对相似动作的区分能力。
📘 Detailed Summary
Motivation: 现有基于骨架的动作识别方法虽然利用大型语言模型提供语义先验,但LLM通常与识别模型隔离运行且缺乏性能反馈,导致其无法提供区分相似动作所需的关键判别性线索,限制了模型性能的进一步提升。
Method: SkeletonAgent框架包含两个协作代理:Questioner识别最易混淆的动作类别并将其作为上下文提供给LLM以获得针对性指导;Selector解析LLM的响应,提取精确的关节级约束并反馈给识别器,实现细粒度的跨模态对齐。
Result: 在五个基准数据集(NTU RGB+D、NTU RGB+D 120、Kinetics-Skeleton、FineGYM和UAV-Human)上的综合评估表明,SkeletonAgent始终优于最先进的基准方法,证明了该框架在提升骨架动作识别性能方面的有效性。
Conclusion: 该研究展示了通过协作代理机制桥接识别模型与LLM的有效性,使LLM能够提供更具判别性的语义指导,为基于骨架的动作识别领域提供了一种新的跨模态交互范式,并表明细粒度的关节级约束提取对提升模型性能至关重要。
📄 Abstract
Recent advances in skeleton-based action recognition increasingly leverage semantic priors from Large Language Models (LLMs) to enrich skeletal representations. However, the LLM is typically queried in isolation from the recognition model and receives no performance feedback. As a result, it often fails to deliver the targeted discriminative cues critical to distinguish similar actions. To overcome these limitations, we propose SkeletonAgent, a novel framework that bridges the recognition model and the LLM through two cooperative agents, i.e., Questioner and Selector. Specifically, the Questioner identifies the most frequently confused classes and supplies them to the LLM as context for more targeted guidance. Conversely, the Selector parses the LLM's response to extract precise joint-level constraints and feeds them back to the recognizer, enabling finer-grained cross-modal alignment. Comprehensive evaluations on five benchmarks, including NTU RGB+D, NTU RGB+D 120, Kinetics-Skeleton, FineGYM, and UAV-Human, demonstrate that SkeletonAgent consistently outperforms state-of-the-art benchmark methods. The code is available at https://github.com/firework8/SkeletonAgent.
[42] Leveraging Textual Compositional Reasoning for Robust Change Captioning
Kyu Ri Park, Jiyoung Park, Seong Tae Kim, Hong Joo Lee, Jung Uk Kim
🧩 TL;DR
本文提出了CORTEX框架,通过整合视觉语言模型提供的文本线索来增强变化描述任务,该框架能够捕捉像素级差异和场景级文本知识,从而解决现有方法因仅依赖视觉特征而难以识别细微但有意义变化的问题。
📘 Detailed Summary
Motivation: 现有变化描述方法主要依赖视觉特征,但由于缺乏对结构化信息(如对象关系和组合语义)的显式表示能力,往往难以捕捉细微但有意义的变化,这限制了模型对复杂场景变化的准确理解与描述。
Method: CORTEX框架包含三个关键模块:图像级变化检测器用于识别配对图像间的低级视觉差异;推理感知文本提取模块利用视觉语言模型生成隐含在视觉特征中的组合推理描述;图像-文本双重对齐模块将视觉和文本特征对齐以实现细粒度关系推理,从而实现对视觉和文本特征的联合推理。
Result: 虽然摘要未提供具体性能指标,但论文表明CORTEX框架能够有效整合像素级差异和场景级文本知识,通过视觉语言模型提取更丰富的图像文本信号,揭示潜在的组合推理,从而捕捉仅凭视觉特征难以识别的模糊变化。
Conclusion: 该研究强调了整合文本线索对于增强变化理解的重要性,CORTEX框架通过结合视觉和文本模态的互补信息,为解决变化描述中细微变化识别难题提供了新思路,为多模态推理在视觉变化分析中的应用开辟了方向。
📄 Abstract
Change captioning aims to describe changes between a pair of images. However, existing works rely on visual features alone, which often fail to capture subtle but meaningful changes because they lack the ability to represent explicitly structured information such as object relationships and compositional semantics. To alleviate this, we present CORTEX (COmpositional Reasoning-aware TEXt-guided), a novel framework that integrates complementary textual cues to enhance change understanding. In addition to capturing cues from pixel-level differences, CORTEX utilizes scene-level textual knowledge provided by Vision Language Models (VLMs) to extract richer image text signals that reveal underlying compositional reasoning. CORTEX consists of three key modules: (i) an Image-level Change Detector that identifies low-level visual differences between paired images, (ii) a Reasoning-aware Text Extraction (RTE) module that use VLMs to generate compositional reasoning descriptions implicit in visual features, and (iii) an Image-Text Dual Alignment (ITDA) module that aligns visual and textual features for fine-grained relational reasoning. This enables CORTEX to reason over visual and textual features and capture changes that are otherwise ambiguous in visual features alone.
[43] ABounD: Adversarial Boundary-Driven Few-Shot Learning for Multi-Class Anomaly Detection
Runzhi Deng, Yundi Hu, Xinshuang Zhang, Zhao Wang, Xixi Liu, Wang-Zhou Dai, Caifeng Shan, Fang Zhao
🧩 TL;DR
本文提出ABounD,一种对抗性边界驱动的少样本多类别工业异常检测框架,通过动态概念融合模块生成类别自适应提示,并结合对抗性边界锻造技术来塑造更精确的决策边界,在MVTec-AD和VisA数据集上实现了最先进的性能。
📘 Detailed Summary
Motivation: 少样本多类别工业异常检测面临数据稀缺导致正常与异常状态边界模糊的挑战,这种模糊性导致难以检测细微缺陷并可能错误拒绝非典型正常样本,现有视觉语言模型需要在类别适应性和判别能力之间取得平衡。
Method: ABounD框架整合了语义概念学习与决策边界塑造,包含动态概念融合模块通过融合可泛化先验与类别特定线索生成类别自适应提示,以及对抗性边界锻造模块通过PGD风格扰动生成边界级围栏特征来塑造精确决策边界,采用单阶段训练和概念-边界损失函数,其中ABF提供主要监督信号,语义-空间正则化器稳定优化过程。
Result: 在MVTec-AD和VisA数据集上的实验表明,该方法在少样本多类别异常检测任务中实现了最先进的性能,生成的决策边界能够紧密跟随正常数据分布,同时保持灵活性和鲁棒的语义对齐能力。
Conclusion: 该研究展示了语义概念学习与决策边界塑造的协同作用能够有效解决少样本异常检测中的边界模糊问题,对抗性边界锻造技术为精确决策边界学习提供了有效监督信号,该统一框架为工业异常检测中的类别适应性和判别能力平衡提供了新思路。
📄 Abstract
Few-shot multi-class industrial anomaly detection remains a challenging task. Vision-language models need to be both category-adaptive and sharply discriminative, yet data scarcity often blurs the boundary between normal and abnormal states. This ambiguity leads to missed subtle defects and the rejection of atypical normal samples. We propose ABounD, an Adversarial Boundary-Driven few-shot learning for multi-class anomaly detection, which is a unified learning framework that integrates semantic concept learning with decision boundary shaping. The Dynamic Concept Fusion (DCF) module produces class-adaptive prompts by fusing generalizable priors with class-specific cues, conditioned on image features. Meanwhile, Adversarial Boundary Forging (ABF) sculpts a more precise decision margin by generating boundary-level fence features via PGD-style perturbations. Training is conducted in a single stage under a Concept-Boundary Loss, where ABF provides the main supervisory signal and semantic-spatial regularizers stabilize the optimization. This synergy yields a decision boundary that closely follows normal data while preserving flexibility and robust semantic alignment. Experiments on MVTec-AD and VisA datasets demonstrate state-of-the-art performance in the task of few-shot multi-class anomaly detection.
[44] Ovis-Image Technical Report
Guo-Hua Wang, Liangfu Cao, Tianyu Cui, Minghao Fu, Xiaohao Chen, Pengxin Zhan, Jianshan Zhao, Lan Li, Bowen Fu, Jiaqi Liu, Qing-Guo Chen
🧩 TL;DR
本文提出了Ovis-Image,一个专门针对高质量文本渲染优化的7B参数文生图模型,能够在严格的计算约束下高效运行,在文本渲染性能上达到与更大开源模型相当的水平,同时保持单GPU可部署性。
📘 Detailed Summary
Motivation: 当前前沿文本渲染模型通常规模庞大或依赖闭源系统,导致部署成本高昂且难以在有限计算资源下实现高质量文本生成。本研究旨在开发一个紧凑高效的文生图模型,专门优化文本渲染能力,缩小前沿文本渲染技术与实际部署可行性之间的差距。
Method: 基于先前Ovis-U1框架,Ovis-Image整合了扩散式视觉解码器与更强的Ovis 2.5多模态骨干网络,采用以文本为中心的训练流程,结合大规模预训练与精心设计的后训练精调策略,专门针对文本渲染任务进行优化。
Result: 尽管架构紧凑,Ovis-Image在文本渲染性能上与Qwen-Image等显著更大的开源模型相当,并接近Seedream和GPT4o等闭源系统。该模型能够在单个高端GPU上部署,仅需中等内存需求,实现了前沿文本渲染能力与实用部署可行性之间的平衡。
Conclusion: 研究表明,将强大的多模态骨干网络与精心设计的文本中心化训练方案相结合,足以实现可靠的双语文本渲染,无需依赖超大模型或专有系统。这一发现为开发高效可部署的文本渲染模型提供了新路径,有助于降低高质量文本生成的技术门槛。
📄 Abstract
We introduce $\textbf{Ovis-Image}$, a 7B text-to-image model specifically optimized for high-quality text rendering, designed to operate efficiently under stringent computational constraints. Built upon our previous Ovis-U1 framework, Ovis-Image integrates a diffusion-based visual decoder with the stronger Ovis 2.5 multimodal backbone, leveraging a text-centric training pipeline that combines large-scale pre-training with carefully tailored post-training refinements. Despite its compact architecture, Ovis-Image achieves text rendering performance on par with significantly larger open models such as Qwen-Image and approaches closed-source systems like Seedream and GPT4o. Crucially, the model remains deployable on a single high-end GPU with moderate memory, narrowing the gap between frontier-level text rendering and practical deployment. Our results indicate that combining a strong multimodal backbone with a carefully designed, text-focused training recipe is sufficient to achieve reliable bilingual text rendering without resorting to oversized or proprietary models.
[45] Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition
Maheswar Bora, Tashvik Dhamija, Shukesh Reddy, Baptiste Chopin, Pranav Balaji, Abhijit Das, Antitza Dantcheva
🧩 TL;DR
本文提出了一种基于预训练视觉语音识别特征的深度伪造检测网络FauxNet,该网络在零样本检测设置中持续优于现有技术,并能识别视频的生成来源。研究还发布了包含约38,000个真实和伪造视频的Authentica数据集。
📘 Detailed Summary
Motivation: 深度伪造生成技术的快速发展带来了高度逼真的生成图像、视频和音频,同时也引发了关于操纵媒体滥用的严重担忧。为减轻此类滥用,迫切需要鲁棒且可靠的深度伪造检测方法,特别是零样本检测(即可泛化检测)这一关键挑战。
Method: 本文提出了一种新颖的深度伪造检测网络FauxNet,该网络基于预训练的视觉语音识别特征。通过从视频中提取时序VSR特征,系统能够识别并分离真实视频与操纵视频。研究还创建了包含约38,000个真实和伪造视频的新数据集Authentica-Vox和Authentica-HDTF,后者使用六种最新的深度伪造生成技术创建。
Result: FauxNet在零样本检测设置中持续优于现有最先进技术,并能够进行属性识别——区分视频来源的生成技术。在Authentica数据集和FaceForensics++上的广泛分析和结果证明了FauxNet的优越性,Authentica数据集将公开提供以促进进一步研究。
Conclusion: 该研究表明基于视觉语音识别特征的深度伪造检测方法在零样本设置中具有显著优势,不仅能够检测伪造视频,还能识别其生成来源。发布的Authentica数据集为深度伪造检测研究提供了新的基准资源,推动了可泛化检测技术的发展。
📄 Abstract
Deepfake generation has witnessed remarkable progress, contributing to highly realistic generated images, videos, and audio. While technically intriguing, such progress has raised serious concerns related to the misuse of manipulated media. To mitigate such misuse, robust and reliable deepfake detection is urgently needed. Towards this, we propose a novel network FauxNet, which is based on pre-trained Visual Speech Recognition (VSR) features. By extracting temporal VSR features from videos, we identify and segregate real videos from manipulated ones. The holy grail in this context has to do with zero-shot detection, i.e., generalizable detection, which we focus on in this work. FauxNet consistently outperforms the state-of-the-art in this setting. In addition, FauxNet is able to attribute - distinguish between generation techniques from which the videos stem. Finally, we propose new datasets, referred to as Authentica-Vox and Authentica-HDTF, comprising about 38,000 real and fake videos in total, the latter created with six recent deepfake generation techniques. We provide extensive analysis and results on the Authentica datasets and FaceForensics++, demonstrating the superiority of FauxNet. The Authentica datasets will be made publicly available.
[46] From Illusion to Intention: Visual Rationale Learning for Vision-Language Reasoning
Changpeng Wang, Haozhe Wang, Xi Chen, Junhan Liu, Taofeng Xue, Chong Peng, Donglian Qi, Fangzhen Lin, Yunfeng Yan
🧩 TL;DR
本文提出视觉理性学习(ViRL)范式,将视觉动作重构为核心推理原语而非可选工具,通过端到端强化学习训练模型基于视觉证据进行推理,在感知、幻觉和推理基准上取得了最先进的结果。
📘 Detailed Summary
Motivation: 当前视觉语言推理框架将视觉动作视为可选工具,导致模型推理缺乏视觉基础,产生"图像思考幻觉"——模型看似基于视觉证据,实则依赖与上下文无关的动作,既未精炼感知也未引导推理走向正确答案。
Method: 提出视觉理性学习(ViRL)范式,将视觉动作重构为视觉理性化(视觉类比于文本思维链)的核心推理原语。ViRL包含三个关键组件:基于真实理性的过程监督、通过步骤级奖励塑形的目标对齐,以及区分正确、冗余和错误动作的细粒度信用分配机制。
Result: 通过纯端到端强化学习训练,ViRL在涵盖感知、幻觉和推理的多个基准测试中取得了最先进的性能表现,验证了视觉理性化范式在提升模型透明度和可验证性方面的有效性。
Conclusion: 该研究确立了视觉理性化作为任务无关、过程基础的范式,为构建透明、可验证且可信赖的视觉语言模型提供了新方向,确保每个视觉动作都能有意义地贡献于推理链,使模型"基于正确的视觉理由得出正确答案"。
📄 Abstract
Recent advances in vision-language reasoning underscore the importance of thinking with images, where models actively ground their reasoning in visual evidence. Yet, prevailing frameworks treat visual actions as optional tools, boosting metrics but leaving reasoning ungrounded and crops ineffective. This gap gives rise to the illusion of thinking with images: models seem visually grounded but rely on context-agnostic actions that neither refine perception nor guide reasoning toward correct answers. We address this problem by reframing visual actions as core reasoning primitives rather than optional tools, which we term visual rationalization, the visual analogue of textual Chain-of-Thought. Building on this insight, we propose Visual Rationale Learning (ViRL), an end-to-end paradigm that grounds training in the visual rationale itself. ViRL integrates (1) Process Supervision with ground-truth rationales, (2) Objective Alignment via step-level reward shaping, and (3) Fine-Grained Credit Assignment to distinguish correct, redundant, and erroneous actions. By ensuring each action contributes meaningfully to the reasoning chain, ViRL enables models to "get the right answer for the right visual reason". Trained purely with end-to-end RL, ViRL achieves state-of-the-art results across benchmarks spanning perception, hallucination, and reasoning. This work establishes visual rationalization as a task-agnostic, process-grounded paradigm for building transparent, verifiable, and trustworthy vision-language models.
[47] Beyond Real versus Fake Towards Intent-Aware Video Analysis
Saurabh Atreya, Nabyl Quignon, Baptiste Chopin, Abhijit Das, Antitza Dantcheva
🧩 TL;DR
本文提出了IntentHQ基准,将深度伪造检测范式从真实性验证转向意图分析,通过多模态模型识别视频背后的23种细粒度意图类别,为理解操纵视频的动机提供了新框架。
📘 Detailed Summary
Motivation: 随着生成模型快速发展,深度伪造视频日益逼真,带来严重社会和安全风险。现有检测方法仅关注区分真假视频,未能解决一个根本问题:操纵视频背后的意图是什么?本研究旨在填补这一空白,将研究范式从真实性验证转向视频的上下文理解。
Method: 研究提出了IntentHQ基准,包含5168个经过精心收集和标注的视频,涵盖23种细粒度意图类别。采用监督学习和自监督学习的多模态模型,整合时空视频特征、音频处理和文本分析,以推断视频背后的潜在动机和目标。提出的模型经过优化,能够区分广泛的意图类别。
Result: 研究构建了包含5168个视频的IntentHQ基准数据集,标注了包括"金融欺诈"、"间接营销"、"政治宣传"和"恐惧煽动"在内的23种意图类别。通过多模态模型实现了对视频意图的识别,为人类中心意图分析提供了新的评估框架和基准。
Conclusion: 该研究将深度伪造检测从简单的真实性判断提升到意图理解层面,为理解操纵视频的社会影响提供了更全面的框架。IntentHQ基准为未来研究提供了重要数据集和评估标准,推动了视频内容分析向更细粒度的意图识别方向发展,具有重要的实际应用价值。
📄 Abstract
The rapid advancement of generative models has led to increasingly realistic deepfake videos, posing significant societal and security risks. While existing detection methods focus on distinguishing real from fake videos, such approaches fail to address a fundamental question: What is the intent behind a manipulated video? Towards addressing this question, we introduce IntentHQ: a new benchmark for human-centered intent analysis, shifting the paradigm from authenticity verification to contextual understanding of videos. IntentHQ consists of 5168 videos that have been meticulously collected and annotated with 23 fine-grained intent-categories, including "Financial fraud", "Indirect marketing", "Political propaganda", as well as "Fear mongering". We perform intent recognition with supervised and self-supervised multi-modality models that integrate spatio-temporal video features, audio processing, and text analysis to infer underlying motivations and goals behind videos. Our proposed model is streamlined to differentiate between a wide range of intent-categories.
[48] SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models
Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Weijian Sun, Zizhuang Wei
🧩 TL;DR
本文提出了SpaceMind,一种专为空间推理设计的多模态大语言模型,仅从RGB输入即可实现3D空间理解。该模型通过相机引导的模态融合模块,将相机表示作为主动引导模态而非被动元数据,显著提升了视觉语言模型的空间推理能力。
📘 Detailed Summary
Motivation: 现有大型视觉语言模型在多模态理解方面表现优异,但在3D空间推理(如距离估计、尺寸比较和跨视图一致性)方面仍存在不足。现有3D感知方法要么依赖辅助3D信息,要么仅通过浅层特征融合增强RGB-only模型,缺乏对相机表示的主动利用。
Method: SpaceMind采用双编码器架构,集成VGGT作为空间理解编码器和InternViT作为2D视觉编码器。核心创新是引入轻量级相机引导模态融合模块,该模块在语言模型之前替换浅层融合,通过相机条件偏置处理空间标记,分配反映几何重要性的查询无关权重,并使用相机嵌入门控融合表示。
Result: SpaceMind在VSI-Bench、SQA3D和SPBench基准测试中均取得了新的最先进结果。在VSI-Bench和SPBench上大幅超越开源和专有系统,在SQA3D上也达到了最先进性能,证明了相机引导模态融合的有效性。
Conclusion: 研究表明相机引导的模态融合是为视觉语言模型提供真正空间基础智能的有效且实用的归纳偏置。该方法仅需RGB输入即可实现强大的3D空间推理,为未来研究提供了重要方向,作者将发布代码和模型检查点以支持后续工作。
📄 Abstract
Large vision-language models (VLMs) show strong multimodal understanding but still struggle with 3D spatial reasoning, such as distance estimation, size comparison, and cross-view consistency. Existing 3D-aware methods either depend on auxiliary 3D information or enhance RGB-only VLMs with geometry encoders through shallow feature fusion. We propose SpaceMind, a multimodal large language model explicitly designed for spatial reasoning solely from RGB inputs. The model adopts a dual-encoder architecture, integrating VGGT as a spatial understanding encoder and InternViT as a 2D visual encoder. The key idea is to treat the camera representation as an active guiding modality rather than passive metadata. Specifically, SpaceMind introduces a lightweight Camera-Guided Modality Fusion module before the language model to replace shallow fusion. It applies camera-conditioned biasing to spatial tokens, assigns query-independent weights reflecting their geometric importance, and uses the camera embedding to gate the fused representation. Empirically, SpaceMind establishes new state-of-the-art results on VSI-Bench, SQA3D and SPBench, surpassing both open and proprietary systems on VSI-Bench and SPBench by large margins and achieving state-of-the-art performance on SQA3D. These results demonstrate that camera-guided modality fusion is an effective and practical inductive bias for equipping VLMs with genuinely spatially grounded intelligence. We will release code and model checkpoints to support future research.
[49] RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding
Xiyan Liu, Han Wang, Yuhu Wang, Junjie Cai, Zhe Cao, Jianzhong Yang, Zhen Lu
🧩 TL;DR
本文提出了RoadSceneBench基准和HRRP-T训练框架,旨在解决现有自动驾驶基准缺乏道路拓扑和动态场景结构推理能力的问题,通过强调关系理解和结构一致性来提升视觉语言模型在复杂道路环境中的推理可靠性。
📘 Detailed Summary
Motivation: 现有自动驾驶基准主要关注检测或分割等感知任务,忽视了推理道路拓扑和动态场景结构所需的能力,这限制了模型理解连接低层感知与高层规划的中层道路语义的能力,而中层道路语义对于可靠的自动驾驶和数字地图构建至关重要。
Method: 本文提出了RoadSceneBench基准,这是一个轻量级但信息丰富的基准,专门设计用于评估复杂道路环境中的视觉推理能力;同时提出了分层关系奖励传播与时序一致性框架,该训练框架通过自适应奖励信号促进空间一致性和语义对齐,使视觉语言模型能够实现几何感知和时序一致的推理。
Result: 大量实验表明,所提出的方法在多种道路配置上实现了最先进的性能,RoadSceneBench基准为研究中层道路语义和培养结构感知的自动驾驶感知提供了紧凑而强大的基础,相关数据集已在GitHub上公开可用。
Conclusion: 该研究强调了关系理解和结构一致性在自动驾驶推理中的重要性,提出的基准和训练框架使模型能够超越静态识别,实现几何感知和时序一致的推理,为结构感知的自动驾驶感知研究提供了重要基础,并推动了视觉语言模型在复杂道路场景中的应用。
📄 Abstract
Understanding mid-level road semantics, which capture the structural and contextual cues that link low-level perception to high-level planning, is essential for reliable autonomous driving and digital map construction. However, existing benchmarks primarily target perception tasks such as detection or segmentation, overlooking the reasoning capabilities required to infer road topology and dynamic scene structure. To address this gap, we present RoadSceneBench, a lightweight yet information-rich benchmark designed to evaluate and advance visual reasoning in complex road environments. Unlike large-scale perception datasets, RoadSceneBench emphasizes relational understanding and structural consistency, encouraging models to capture the underlying logic of real-world road scenes. Furthermore, to enhance reasoning reliability, we propose Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T), a training framework for Vision-Language Models (VLMs) in which reward signals adaptively promote spatial coherence and semantic alignment throughout the reasoning process. This paradigm enables models to move beyond static recognition toward geometry-aware and temporally consistent reasoning. Extensive experiments demonstrate that our method achieves state-of-the-art performance across diverse road configurations. RoadSceneBench thus provides a compact yet powerful foundation for studying mid-level road semantics and fostering structure-aware autonomous perception. Our dataset is available at https://github.com/XiyanLiu/RoadSceneBench.
[50] Hybrid, Unified and Iterative: A Novel Framework for Text-based Person Anomaly Retrieval
Tien-Huy Nguyen, Huu-Loc Tran, Huu-Phong Phan-Nguyen, Quang-Vinh Dinh
🧩 TL;DR
本文提出了一种用于文本行人异常检索的局部-全局混合视角模块,结合视觉语言模型和统一图像-文本模型,通过迭代集成策略和特征选择算法显著提升了细粒度特征提取能力。
📘 Detailed Summary
Motivation: 现有基于文本的行人异常检索方法大多依赖复杂的深度学习技术,存在细粒度特征提取不足的问题,需要探索如何优化模型以获得更精细的特征表示。
Method: 提出了局部-全局混合视角模块与视觉语言模型集成,开发了统一图像-文本模型结合ITC、ITM、MLM和MIM多目标损失函数,设计了新颖的迭代集成策略而非传统并行集成,并基于LHP模型指导提出了特征选择算法。
Result: 在PAB数据集上实现了最先进的性能,相比先前工作,R@1提升了9.70%,R@5提升了1.77%,R@10提升了1.01%,证明了所提方法的有效性。
Conclusion: 研究表明结合细粒度与粗粒度特征的局部-全局混合视角能显著提升文本行人异常检索性能,迭代集成策略和基于模型指导的特征选择算法为相关任务提供了新的优化方向。
📄 Abstract
Text-based person anomaly retrieval has emerged as a challenging task, with most existing approaches relying on complex deep-learning techniques. This raises a research question: How can the model be optimized to achieve greater fine-grained features? To address this, we propose a Local-Global Hybrid Perspective (LHP) module integrated with a Vision-Language Model (VLM), designed to explore the effectiveness of incorporating both fine-grained features alongside coarse-grained features. Additionally, we investigate a Unified Image-Text (UIT) model that combines multiple objective loss functions, including Image-Text Contrastive (ITC), Image-Text Matching (ITM), Masked Language Modeling (MLM), and Masked Image Modeling (MIM) loss. Beyond this, we propose a novel iterative ensemble strategy, by combining iteratively instead of using model results simultaneously like other ensemble methods. To take advantage of the superior performance of the LHP model, we introduce a novel feature selection algorithm based on its guidance, which helps improve the model's performance. Extensive experiments demonstrate the effectiveness of our method in achieving state-of-the-art (SOTA) performance on PAB dataset, compared with previous work, with a 9.70\% improvement in R@1, 1.77\% improvement in R@5, and 1.01\% improvement in R@10.
[51] REVEAL: Reasoning-enhanced Forensic Evidence Analysis for Explainable AI-generated Image Detection
Huangsen Cao, Qin Mei, Zhiheng Li, Yuxi Li, Ying Zhang, Chen Li, Zhimeng Zhang, Xin Ding, Yongwei Wang, Jing Lyu, Fei Wu
🧩 TL;DR
本文提出了REVEAL-Bench基准和REVEAL框架,这是首个基于证据链推理的多模态AI生成图像检测方法,通过专家模型驱动的强化学习机制,在提升检测准确性的同时生成可验证的解释性推理过程。
📘 Detailed Summary
Motivation: 随着生成模型的快速发展,AI生成图像与真实图像越来越难以区分,对信息完整性构成严重威胁。现有可解释取证方法主要依赖事后合理化或视觉判别,缺乏可验证的证据链,导致解释缺乏因果基础且泛化能力差,迫切需要建立基于证据链的可验证取证框架。
Method: 研究提出了REVEAL-Bench基准,这是首个围绕证据链构建的推理增强多模态AI生成图像检测基准,基于多个轻量级专家模型生成逐步推理轨迹和证据依据。在此基础上提出了REVEAL框架,采用专家驱动的强化学习方法,其奖励机制专门设计用于联合优化检测准确性、解释保真度和基于明确取证证据的逻辑连贯性。
Result: 大量实验结果表明,REVEAL在检测准确性、解释保真度和跨模型泛化鲁棒性方面均显著提升,为可解释图像取证建立了新的最先进基准。该方法能够生成细粒度、可解释且可验证的推理链,同时实现优异的检测性能。
Conclusion: 该研究通过引入基于证据链的推理增强框架,解决了现有可解释取证方法缺乏可验证性的关键局限,为AI生成图像检测提供了新的范式。REVEAL框架不仅提升了检测性能,更重要的是建立了可验证的解释机制,为构建可信赖的取证系统奠定了基础,推动了可解释人工智能在图像取证领域的发展。
📄 Abstract
With the rapid advancement of generative models, visually realistic AI-generated images have become increasingly difficult to distinguish from authentic ones, posing severe threats to social trust and information integrity. Consequently, there is an urgent need for efficient and truly explainable image forensic methods. Recent detection paradigms have shifted towards explainable forensics. However, state-of-the-art approaches primarily rely on post-hoc rationalizations or visual discrimination, lacking a verifiable chain of evidence. This reliance on surface-level pattern matching limits the generation of causally grounded explanations and often results in poor generalization. To bridge this critical gap, we introduce \textbf{REVEAL-Bench}, the first reasoning-enhanced multimodal benchmark for AI-generated image detection that is explicitly structured around a chain-of-evidence derived from multiple lightweight expert models, then records step-by-step reasoning traces and evidential justifications. Building upon this dataset, we propose \textbf{REVEAL} (\underline{R}easoning-\underline{e}nhanced Forensic E\underline{v}id\underline{e}nce \underline{A}na\underline{l}ysis), an effective and explainable forensic framework that integrates detection with a novel expert-grounded reinforcement learning. Our reward mechanism is specially tailored to jointly optimize detection accuracy, explanation fidelity, and logical coherence grounded in explicit forensic evidence, enabling REVEAL to produce fine-grained, interpretable, and verifiable reasoning chains alongside its detection outcomes. Extensive experimental results demonstrate that REVEAL significantly enhances detection accuracy, explanation fidelity, and robust cross-model generalization, benchmarking a new state of the art for explainable image forensics.
[52] Text Condition Embedded Regression Network for Automated Dental Abutment Design
Mianjie Zheng, Xinquan Yang, Xuguang Li, Xiaoling Luo, Xuefen Liu, Kun Tang, He Meng, Linlin Shen
🧩 TL;DR
本文提出了一种文本条件嵌入的牙科种植体基台设计框架(TCEAD),通过引入文本引导定位模块和自监督预训练,实现了牙科基台设计的自动化,显著提高了定位精度和设计效率。
📘 Detailed Summary
Motivation: 传统牙科种植体基台设计过程耗时耗力,且长期使用不合适的基台可能导致种植体周围炎等并发症。现有方法缺乏自动化解决方案,且基台区域定位精度不足,需要一种能够快速定位并设计适配基台的智能方法。
Method: 本文提出了文本条件嵌入基台设计框架(TCEAD),扩展了网格掩码自编码器(MeshMAE)的自监督学习框架,引入了文本引导定位(TGL)模块。该框架通过CLIP文本编码器引入基台区域描述,使网络能够快速定位基台区域,并利用口腔扫描数据预训练编码器以增强局部细粒度特征提取能力。
Result: 在大型基台设计数据集上的实验表明,TCEAD相比其他主流方法在交并比(IoU)指标上提升了0.8%-12.85%,验证了其在自动化牙科基台设计中的优越性能。文本引导定位模块有效提高了基台区域的定位精度。
Conclusion: TCEAD框架展示了文本引导定位与自监督学习在牙科基台设计中的有效性,为自动化牙科修复体设计提供了新思路。该方法不仅提高了设计效率,还通过精准定位减少了并发症风险,具有临床转化潜力。
📄 Abstract
The abutment is an important part of artificial dental implants, whose design process is time-consuming and labor-intensive. Long-term use of inappropriate dental implant abutments may result in implant complications, including peri-implantitis. Using artificial intelligence to assist dental implant abutment design can quickly improve the efficiency of abutment design and enhance abutment adaptability. In this paper, we propose a text condition embedded abutment design framework (TCEAD), the novel automated abutment design solution available in literature. The proposed study extends the self-supervised learning framework of the mesh mask autoencoder (MeshMAE) by introducing a text-guided localization (TGL) module to facilitate abutment area localization. As the parameter determination of the abutment is heavily dependent on local fine-grained features (the width and height of the implant and the distance to the opposing tooth), we pre-train the encoder using oral scan data to improve the model's feature extraction ability. Moreover, considering that the abutment area is only a small part of the oral scan data, we designed a TGL module, which introduces the description of the abutment area through the text encoder of Contrastive Language-Image Pre-training (CLIP), enabling the network to quickly locate the abutment area. We validated the performance of TCEAD on a large abutment design dataset. Extensive experiments demonstrate that TCEAD achieves an Intersection over Union (IoU) improvement of 0.8%-12.85% over other mainstream methods, underscoring its potential in automated dental abutment design.
[53] AnoRefiner: Anomaly-Aware Group-Wise Refinement for Zero-Shot Industrial Anomaly Detection
Dayou Huang, Feng Xue, Xurui Li, Yu Zhou
🧩 TL;DR
本文提出AnoRefiner,一种可插入式异常感知细化器,通过利用异常分数图的互补空间线索,将零样本工业异常检测的补丁级异常图提升至像素级精度,显著提升了检测性能。
📘 Detailed Summary
Motivation: 零样本工业异常检测方法通常只能生成粗糙的补丁级异常图,现有方法尝试使用合成异常数据进行细化但仍难以准确恢复细粒度异常,主要因为合成异常与真实异常之间存在差距。研究发现异常分数图提供了图像特征中缺乏的互补空间线索,这一事实先前被忽视。
Method: 提出AnoRefiner框架,包含异常细化解码器,通过渐进式利用异常分数图增强图像特征,减少对合成异常数据的依赖。同时提出渐进式分组测试时训练策略,在每个产品组中训练ARD用于下一组的细化过程,保持与任意ZSAD方法的兼容性。
Result: 在MVTec AD和VisA数据集上的实验表明,AnoRefiner能够将多种ZSAD模型的像素级AP指标提升高达5.2%,这一改进在大量可视化结果中可直接观察到,验证了方法的有效性。
Conclusion: 该研究揭示了异常分数图作为互补空间线索的重要价值,提出的AnoRefiner框架为工业异常检测提供了有效的像素级细化解决方案,其模块化设计使其能够与现有ZSAD方法无缝集成,具有实际应用潜力。
📄 Abstract
Zero-shot industrial anomaly detection (ZSAD) methods typically yield coarse anomaly maps as vision transformers (ViTs) extract patch-level features only. To solve this, recent solutions attempt to predict finer anomalies using features from ZSAD, but they still struggle to recover fine-grained anomalies without missed detections, mainly due to the gap between randomly synthesized training anomalies and real ones. We observe that anomaly score maps exactly provide complementary spatial cues that are largely absent from ZSAD's image features, a fact overlooked before. Inspired by this, we propose an anomaly-aware refiner (AnoRefiner) that can be plugged into most ZSAD models and improve patch-level anomaly maps to the pixel level. First, we design an anomaly refinement decoder (ARD) that progressively enhances image features using anomaly score maps, reducing the reliance on synthetic anomaly data. Second, motivated by the mass production paradigm, we propose a progressive group-wise test-time training (PGT) strategy that trains ARD in each product group for the refinement process in the next group, while staying compatible with any ZSAD method. Experiments on the MVTec AD and VisA datasets show that AnoRefiner boosts various ZSAD models by up to a 5.2\% gain in pixel-AP metrics, which can also be directly observed in many visualizations. The code will be available at https://github.com/HUST-SLOW/AnoRefiner.
[54] MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory
Bo Wang, Jiehong Lin, Chenzhi Liu, Xinting Hu, Yifei Yu, Tianjia Liu, Zhongrui Wang, Xiaojuan Qi
🧩 TL;DR
本文提出了MG-Nav(记忆引导导航),一种用于零样本视觉导航的双尺度框架,通过全局记忆引导规划与局部几何增强控制的统一,实现了在动态重排和未见场景条件下的鲁棒导航。
📘 Detailed Summary
Motivation: 当前视觉导航方法在长视野规划、动态环境适应和未见场景泛化方面存在挑战,特别是在零样本设置下需要同时处理全局路径规划和局部精确控制的统一问题。
Method: MG-Nav框架包含稀疏空间记忆图(SMG)作为紧凑的区域中心记忆表示,采用全局图像到实例混合检索进行路径规划,局部使用导航基础策略执行点目标模式,并引入VGGT-adapter几何模块在共享3D感知空间中对齐观测和目标特征。
Result: 在HM3D实例-图像-目标和MP3D图像-目标基准测试中,MG-Nav实现了最先进的零样本性能,在动态重排和未见场景条件下保持鲁棒性,证明了双尺度框架的有效性。
Conclusion: 该研究展示了全局记忆引导规划与局部几何增强控制统一框架的优越性,为视觉导航系统提供了可扩展的解决方案,特别是在处理长视野任务和环境变化方面具有重要应用价值。
📄 Abstract
We present MG-Nav (Memory-Guided Navigation), a dual-scale framework for zero-shot visual navigation that unifies global memory-guided planning with local geometry-enhanced control. At its core is the Sparse Spatial Memory Graph (SMG), a compact, region-centric memory where each node aggregates multi-view keyframe and object semantics, capturing both appearance and spatial structure while preserving viewpoint diversity. At the global level, the agent is localized on SMG and a goal-conditioned node path is planned via an image-to-instance hybrid retrieval, producing a sequence of reachable waypoints for long-horizon guidance. At the local level, a navigation foundation policy executes these waypoints in point-goal mode with obstacle-aware control, and switches to image-goal mode when navigating from the final node towards the visual target. To further enhance viewpoint alignment and goal recognition, we introduce VGGT-adapter, a lightweight geometric module built on the pre-trained VGGT model, which aligns observation and goal features in a shared 3D-aware space. MG-Nav operates global planning and local control at different frequencies, using periodic re-localization to correct errors. Experiments on HM3D Instance-Image-Goal and MP3D Image-Goal benchmarks demonstrate that MG-Nav achieves state-of-the-art zero-shot performance and remains robust under dynamic rearrangements and unseen scene conditions.
[55] REASONEDIT: Towards Reasoning-Enhanced Image Editing Models
Fukun Yin, Shiyu Liu, Yucheng Han, Zhibo Wang, Peng Xing, Rui Wang, Wei Cheng, Yingming Wang, Aojie Li, Zixin Yin, Pengtao Chen, Xiangyu Zhang, Daxin Jiang, Xianfang Zeng, Gang Yu
🧩 TL;DR
本文提出了一种基于推理的图像编辑框架,通过解锁多模态大语言模型(MLLM)的推理能力来提升编辑性能。该框架采用思考-编辑-反思循环机制,显著提高了图像编辑的准确性和指令理解能力。
📘 Detailed Summary
Motivation: 当前图像编辑模型通常将多模态大语言模型(MLLM)编码器与扩散解码器耦合,但MLLM在训练过程中保持冻结状态,这限制了其推理能力的发挥。本文旨在探索解锁MLLM推理能力是否能进一步提升图像编辑模型的性能边界,特别是在理解抽象指令和提升编辑准确性方面。
Method: 本文提出了一个基于推理的图像编辑框架,采用思考-编辑-反思循环机制。思考机制利用MLLM的世界知识来解释抽象指令,反思机制则审查编辑结果、自动纠正意外操作并确定停止轮次。该框架探索了两种推理机制:思考用于增强指令理解,反思用于提高编辑准确性。
Result: 实验结果表明,该推理方法在多个基准测试中取得了显著性能提升。当从Step1X-Edit初始化DiT时,在ImgEdit上提升了4.3%,在GEdit上提升了4.7%,在Kris上提升了8.2%。当与Qwen-Image-Edit集成时,在GEdit和Kris基准测试上也优于之前的开源方法。
Conclusion: 本研究证明了解锁多模态大语言模型的推理能力可以显著提升图像编辑模型的性能。思考-编辑-反思循环机制为图像编辑任务提供了一种有效的推理框架,能够更好地处理抽象指令并提高编辑准确性,为未来基于MLLM的图像编辑系统设计提供了新的方向。
📄 Abstract
Recent advances in image editing models have shown remarkable progress. A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder, as seen in systems such as Step1X-Edit and Qwen-Image-Edit, where the MLLM encodes both the reference image and the instruction but remains frozen during training. In this work, we demonstrate that unlocking the reasoning capabilities of MLLM can further push the boundaries of editing models. Specifically, we explore two reasoning mechanisms, thinking and reflection, which enhance instruction understanding and editing accuracy. Based on that, our proposed framework enables image editing in a thinking-editing-reflection loop: the thinking mechanism leverages the world knowledge of MLLM to interpret abstract instructions, while the reflection reviews editing results, automatically corrects unintended manipulations, and identifies the stopping round. Extensive experiments demonstrate that our reasoning approach achieves significant performance gains, with improvements of ImgEdit (+4.3%), GEdit (+4.7%), and Kris (+8.2%) when initializing our DiT from the Step1X-Edit (ReasonEdit-S), and also outperforms previous open-source methods on both GEdit and Kris when integrated with Qwen-Image-Edit (ReasonEdit-Q).
[56] GeoZero: Incentivizing Reasoning from Scratch on Geospatial Scenes
Di Wang, Shunyu Liu, Wentao Jiang, Fengxiang Wang, Yi Liu, Xiaolei Qin, Zhiming Luo, Chaoyang Zhou, Haonan Guo, Jing Zhang, Bo Du, Dacheng Tao, Liangpei Zhang
🧩 TL;DR
本文提出了GeoZero框架,使多模态大语言模型能够在无需预定义思维链监督的情况下进行地理空间推理,通过监督微调和强化学习阶段激发模型的深度推理能力。
📘 Detailed Summary
Motivation: 当前遥感多模态大语言模型通常通过精心策划的思维链数据进行冷启动训练来增强推理能力,这种方法不仅产生高昂的标注成本,还引入了可能限制模型推理多样性的人类偏见,因此需要开发无需预定义思维链监督的地理空间推理方法。
Method: 提出的GeoZero框架包含两个数据集构建:GeoZero-Instruct通过监督微调让模型获取初步地理空间知识,GeoZero-Hard在后续强化学习阶段激发深度推理;同时引入了答案锚定组相对策略优化算法,通过模型自身答案对推理过程进行正则化,鼓励多样而准确的思考。
Result: 在多个遥感视觉语言基准测试上的广泛实验表明,GeoZero不仅超越了现有的最先进方法,还在多样化的地理空间任务中培养了通用的涌现推理能力,证明了无需预定义思维链监督的有效性。
Conclusion: 该研究展示了无需人工标注思维链数据的地理空间推理框架的可行性,通过模型自身答案引导的强化学习能够激发多样且准确的推理过程,为遥感多模态大语言模型的发展提供了新的无监督推理范式。
📄 Abstract
Multimodal large language models (MLLMs) have undergone rapid development in advancing geospatial scene understanding. Recent studies have sought to enhance the reasoning capabilities of remote sensing MLLMs, typically through cold-start training with elaborately curated chain-of-thought (CoT) data. However, this approach not only incurs substantial annotation costs but also introduces human biases that may limit the diversity of model reasoning. To address these challenges, we propose GeoZero, a framework that enables MLLMs to perform geospatial reasoning without any predefined CoT supervision. Specifically, we construct two datasets, GeoZero-Instruct and GeoZero-Hard. GeoZero-Instruct allows the model to acquire preliminary geospatial knowledge through supervised fine-tuning, while GeoZero-Hard stimulates deep reasoning during the subsequent reinforcement learning stage. Furthermore, we introduce Answer-Anchored Group Relative Policy Optimization (A$^2$GRPO), where the reasoning process is regularized by the model's own answers, encouraging diverse yet accurate thinking. Extensive experiments on multiple remote sensing vision-language benchmarks demonstrate that GeoZero not only surpasses existing state-of-the-art methods but also fosters universal emergent reasoning capabilities across diverse geospatial tasks. Code,data,and models will be publicly available at https://github.com/MiliLab/GeoZero.
[57] Architecture Decoupling Is Not All You Need For Unified Multimodal Model
Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Yan Feng, Peng Pei, Xunliang Cai, Hongsheng Li
🧩 TL;DR
本文提出了一种注意力交互对齐(AIA)损失函数,旨在缓解多模态模型中理解与生成任务之间的冲突,而无需进行模型解耦,从而在保持统一模型交错生成能力的同时提升性能。
📘 Detailed Summary
Motivation: 当前统一多模态模型面临理解与生成任务间目标冲突的挑战,现有方法通过不同程度的模型解耦(如双图像编码器、MOE/MOT架构或冻结MLLM)来缓解冲突,但过度解耦会损害模型的交错生成能力,违背了统一模型的初衷。本研究旨在探索如何在不进行模型解耦的情况下缓解任务冲突。
Method: 首先通过分析模型解耦如何缓解冲突,研究模型的跨模态注意力行为,发现模型解耦本质上驱动模型趋向任务特定的多模态交互模式。基于此观察,提出了注意力交互对齐(AIA)损失函数,在训练过程中显式学习任务特定的多模态交互模式。为了验证AIA损失的通用性,将其分别应用于Emu3和Janus-Pro模型,在SFT和后期训练阶段进行实验。
Result: AIA损失不仅细化了跨模态注意力模式,还同时提升了生成和理解性能。实验表明,该方法在不引入额外复杂结构的情况下,有效缓解了任务冲突,保持了模型的交错生成能力,并在多个评估指标上取得了改进。
Conclusion: 该研究揭示了模型解耦缓解任务冲突的机制是通过引导模型学习任务特定的交互模式,提出的AIA损失提供了一种无需模型解耦的替代方案,为统一多模态模型的训练范式提供了新思路,有助于在保持模型统一性的同时提升多任务性能。
📄 Abstract
Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of model decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling alleviates conflicts by studying the cross-modal attention behavior of models. We observe that model decoupling essentially drives models toward task-specific multimodal interaction patterns, as seen in Qwen-VL and HunyuanImage, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns Task-Specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.
[58] VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models
Silin Cheng, Kai Han
🧩 TL;DR
本文提出了一种变分多模态提示学习框架VaMP,通过从学习到的后验分布中采样生成实例条件化的提示,实现了样本特定、不确定性感知的多模态表示学习,在少样本和领域泛化基准上取得了最先进的性能。
📘 Detailed Summary
Motivation: 现有多模态提示学习方法通常依赖固定、共享的提示和确定性参数,限制了它们捕捉实例级变化或跨不同任务和领域建模不确定性的能力,这阻碍了视觉语言模型在有限监督下适应下游任务的潜力。
Method: 本文提出了变分多模态提示学习框架VaMP,通过从学习到的后验分布中采样生成实例条件化的提示,实现样本特定的不确定性感知提示调优;引入基于实例表示和类别原型的类感知先验,增强局部和全局语义的整合;将提示调优形式化为潜在提示表示的变分推断,并通过重参数化采样进行端到端训练。
Result: 在少样本学习和领域泛化基准测试中,VaMP框架取得了最先进的性能,实验结果表明建模不确定性和任务结构对提升多模态表示学习效果具有显著优势。
Conclusion: 该研究表明,通过变分推断实现实例特定、不确定性感知的提示生成能够有效提升视觉语言模型在有限监督下的适应能力,为多模态表示学习中的提示调优提供了新的理论框架和实践方法。
📄 Abstract
Vision-language models (VLMs), such as CLIP, have shown strong generalization under zero-shot settings, yet adapting them to downstream tasks with limited supervision remains a significant challenge. Existing multi-modal prompt learning methods typically rely on fixed, shared prompts and deterministic parameters, which limits their ability to capture instance-level variation or model uncertainty across diverse tasks and domains. To tackle this issue, we propose a novel Variational Multi-Modal Prompt Learning (VaMP) framework that enables sample-specific, uncertainty-aware prompt tuning in multi-modal representation learning. VaMP generates instance-conditioned prompts by sampling from a learned posterior distribution, allowing the model to personalize its behavior based on input content. To further enhance the integration of local and global semantics, we introduce a class-aware prior derived from the instance representation and class prototype. Building upon these, we formulate prompt tuning as variational inference over latent prompt representations and train the entire framework end-to-end through reparameterized sampling. Experiments on few-shot and domain generalization benchmarks show that VaMP achieves state-of-the-art performance, highlighting the benefits of modeling both uncertainty and task structure in our method. Project page: https://visual-ai.github.io/vamp
[59] Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield
Dongyang Liu, Peng Gao, David Liu, Ruoyi Du, Zhen Li, Qilong Wu, Xin Jin, Sihan Cao, Shifeng Zhang, Hongsheng Li, Steven Hoi
🧩 TL;DR
本文挑战了扩散模型蒸馏中分布匹配机制的传统理解,通过对DMD训练目标的分解,揭示了在文本到图像生成等复杂任务中,CFG增强而非分布匹配才是少步蒸馏的主要驱动力,并提出了解耦噪声调度等改进方法。
📘 Detailed Summary
Motivation: 本研究旨在挑战扩散模型蒸馏中关于分布匹配蒸馏及其变体性能来源的传统理解,传统观点认为这些方法的优异性能源于学生模型输出分布与预训练教师模型的匹配机制,但本文通过深入分析发现这种理解可能不准确,特别是在需要CFG的复杂任务中。
Method: 本文通过对DMD训练目标进行严格的数学分解,识别出CFG增强这一先前被忽视的核心组件,并系统分析了分布匹配项与CFG增强项的不同作用,进一步验证了这种解耦关系,通过引入更简单的非参数约束或基于GAN的目标作为替代正则化器,并提出了解耦引擎与正则化器噪声调度等改进方法。
Result: 研究结果表明CFG增强项是少步蒸馏的核心"引擎",而分布匹配项主要作为"正则化器"确保训练稳定性并减少伪影,实验证明分布匹配项并非唯一有效的正则化器,更简单的替代方案也能实现类似功能但具有不同权衡,基于新理解提出的改进方法已被Z-Image项目采用,成功开发出顶级的8步图像生成模型。
Conclusion: 本研究提供了对扩散模型蒸馏机制更系统深入的理解,揭示了CFG增强而非分布匹配才是复杂任务中少步蒸馏的主要驱动力,这种解耦分析为蒸馏过程提供了更原则性的框架,使研究人员能够更系统地分析两个组件的特性,并为改进蒸馏方法提供了理论基础,最终推动了更高效图像生成模型的开发。
📄 Abstract
Diffusion model distillation has emerged as a powerful technique for creating efficient few-step and single-step generators. Among these, Distribution Matching Distillation (DMD) and its variants stand out for their impressive performance, which is widely attributed to their core mechanism of matching the student's output distribution to that of a pre-trained teacher model. In this work, we challenge this conventional understanding. Through a rigorous decomposition of the DMD training objective, we reveal that in complex tasks like text-to-image generation, where CFG is typically required for desirable few-step performance, the primary driver of few-step distillation is not distribution matching, but a previously overlooked component we identify as CFG Augmentation (CA). We demonstrate that this term acts as the core engine'' of distillation, while the Distribution Matching (DM) term functions as aregularizer'' that ensures training stability and mitigates artifacts. We further validate this decoupling by demonstrating that while the DM term is a highly effective regularizer, it is not unique; simpler non-parametric constraints or GAN-based objectives can serve the same stabilizing function, albeit with different trade-offs. This decoupling of labor motivates a more principled analysis of the properties of both terms, leading to a more systematic and in-depth understanding. This new understanding further enables us to propose principled modifications to the distillation process, such as decoupling the noise schedules for the engine and the regularizer, leading to further performance gains. Notably, our method has been adopted by the Z-Image ( https://github.com/Tongyi-MAI/Z-Image ) project to develop a top-tier 8-step image generation model, empirically validating the generalization and robustness of our findings.
[60] Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation
Shubhankar Borse, Phuc Pham, Farzad Farhadzadeh, Seokeon Choi, Phong Ha Nguyen, Anh Tuan Tran, Sungrack Yun, Munawar Hayat, Fatih Porikli
🧩 TL;DR
本文提出了Ar2Can,一个用于多人生成场景的两阶段框架,通过解耦空间规划与身份渲染来解决现有文本到图像生成模型在多人场景中重复面部、合并身份或计数错误的问题。
📘 Detailed Summary
Motivation: 尽管文本到图像生成技术近期取得进展,现有模型在生成多人场景时始终存在可靠性问题,经常出现面部重复、身份合并或个体计数错误,这构成了当前研究需要解决的关键技术瓶颈。
Method: Ar2Can采用两阶段框架:Architect模块预测结构化布局以确定每个人的位置,Artist模块基于扩散模型合成逼真图像,并通过结合匈牙利空间对齐与ArcFace身份相似性的空间接地面部匹配奖励进行引导,同时使用Group Relative Policy Optimization(GRPO)通过组合奖励优化计数准确性、图像质量和身份匹配。
Result: 在MultiHuman-Testbench评估中,Ar2Can在计数准确性和身份保持方面实现了显著改进,同时保持了高感知质量,值得注意的是该方法主要使用合成数据实现这些结果,无需真实多人图像。
Conclusion: 该研究展示了通过解耦空间规划与身份渲染的框架设计能够有效解决多人场景生成的可靠性问题,同时证明了使用合成数据进行训练的有效性,为复杂场景生成提供了新的技术路径。
📄 Abstract
Despite recent advances in text-to-image generation, existing models consistently fail to produce reliable multi-human scenes, often duplicating faces, merging identities, or miscounting individuals. We present Ar2Can, a novel two-stage framework that disentangles spatial planning from identity rendering for multi-human generation. The Architect module predicts structured layouts, specifying where each person should appear. The Artist module then synthesizes photorealistic images, guided by a spatially-grounded face matching reward that combines Hungarian spatial alignment with ArcFace identity similarity. This approach ensures faces are rendered at correct locations and faithfully preserve reference identities. We develop two Architect variants, seamlessly integrated with our diffusion-based Artist model and optimized via Group Relative Policy Optimization (GRPO) using compositional rewards for count accuracy, image quality, and identity matching. Evaluated on the MultiHuman-Testbench, Ar2Can achieves substantial improvements in both count accuracy and identity preservation, while maintaining high perceptual quality. Notably, our method achieves these results using primarily synthetic data, without requiring real multi-human images.
[61] Alzheimer's Disease Prediction Using EffNetViTLoRA and BiLSTM with Multimodal Longitudinal MRI Data
Mahdieh Behjat Khatooni, Mohsen Soryani
🧩 TL;DR
本研究提出了一种用于阿尔茨海默病预测的广义端到端深度学习模型,通过整合卷积神经网络和视觉Transformer来捕捉MRI扫描的局部空间特征与全局上下文依赖,并利用双向LSTM处理时序数据,实现了对轻度认知障碍向阿尔茨海默病转化的高精度预测。
📘 Detailed Summary
Motivation: 阿尔茨海默病是一种不可逆的神经退行性疾病,早期预测对及时干预至关重要。轻度认知障碍作为认知正常与阿尔茨海默病之间的过渡阶段,其向阿尔茨海默病的转化预测具有挑战性,因为并非所有MCI患者都会发展为阿尔茨海默病,现有方法在区分稳定MCI与进展性MCI方面存在局限性。
Method: 本研究提出了一种混合架构,整合了卷积神经网络和视觉Transformer,以从磁共振成像扫描中捕捉局部空间特征和全局上下文依赖。为纳入时序进展信息,进一步采用双向长短期记忆网络处理连续四个时间点的MRI特征及其他非影像生物标志物,预测受试者在第48个月的认知状态。
Result: 该多模态模型在区分稳定MCI与进展性MCI方面实现了95.05%的平均进展预测准确率,超越了现有阿尔茨海默病预测研究的表现。该工作展示了在纵向阿尔茨海默病预测中的最先进性能,验证了所提方法的有效性。
Conclusion: 本研究证明了结合空间与时序建模在阿尔茨海默病早期检测中的有效性,为神经退行性疾病的预测提供了强大的多模态深度学习框架。该混合架构能够同时捕捉影像数据的局部细节与全局上下文,并结合时序动态信息,为临床早期干预提供了可靠的工具。
📄 Abstract
Alzheimer's disease (AD) is a prevalent neurodegenerative disorder that progressively impairs memory, decision-making, and overall cognitive function. As AD is irreversible, early prediction is critical for timely intervention and management. Mild Cognitive Impairment (MCI), a transitional stage between cognitively normal (CN) aging and AD, plays a significant role in early AD diagnosis. However, predicting MCI progression remains a significant challenge, as not all individuals with MCI convert to AD. MCI subjects are categorized into stable MCI (sMCI) and progressive MCI (pMCI) based on conversion status. In this study, we propose a generalized, end-to-end deep learning model for AD prediction using MCI cases from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Our hybrid architecture integrates Convolutional Neural Networks and Vision Transformers to capture both local spatial features and global contextual dependencies from Magnetic Resonance Imaging (MRI) scans. To incorporate temporal progression, we further employ Bidirectional Long Short-Term Memory (BiLSTM) networks to process features extracted from four consecutive MRI timepoints along with some other non-image biomarkers, predicting each subject's cognitive status at month 48. Our multimodal model achieved an average progression prediction accuracy of 95.05\% between sMCI and pMCI, outperforming existing studies in AD prediction. This work demonstrates state-of-the-art performance in longitudinal AD prediction and highlights the effectiveness of combining spatial and temporal modeling for the early detection of Alzheimer's disease.
[62] World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models
Eunsu Kim, Junyeong Park, Na Min An, Junseong Kim, Hitesh Laxmichand Patel, Jiho Jin, Julia Kruk, Amit Agarwal, Srikant Panda, Fenal Ashokbhai Ilasariya, Hyunjung Shim, Alice Oh
🧩 TL;DR
本文研究了大型视觉语言模型在文化混合场景中的感知能力,构建了CultureMix基准测试并发现现有模型在保持文化身份一致性方面存在显著缺陷,通过监督微调策略有效提升了模型在多元文化环境中的鲁棒性。
📘 Detailed Summary
Motivation: 在全球化的视觉场景中,来自不同文化的元素经常同时出现形成文化混合场景,然而大型视觉语言模型如何感知这些混合文化场景仍未被充分探索。本研究旨在系统性地分析LVLMs在多元文化元素共存时的行为模式,填补当前研究在文化混合场景评估方面的空白。
Method: 研究构建了CultureMix基准测试,包含23k个扩散生成并经人工验证的文化混合图像,涵盖四个子任务:纯食物、食物+食物、食物+背景、食物+食物+背景。评估了10个大型视觉语言模型,并探索了三种鲁棒性策略,其中监督微调使用多样化的文化混合数据集来提升模型性能。
Result: 实验发现现有模型在文化混合场景中普遍无法保持个体文化身份的一致性,表现出强烈的背景依赖性——当文化背景添加到纯食物基准时准确率下降14%。模型对相同食物在不同上下文中的预测结果不一致,而监督微调策略显著提高了模型的一致性并降低了背景敏感性。
Conclusion: 研究表明文化混合场景是LVLMs面临的关键挑战,需要更多关注以开发能够在文化多样性现实环境中可靠运行的模型。监督微调策略的有效性为提升模型文化感知能力提供了可行路径,强调了构建包容性AI系统时考虑文化复杂性的重要性。
📄 Abstract
In a globalized world, cultural elements from diverse origins frequently appear together within a single visual scene. We refer to these as culture mixing scenarios, yet how Large Vision-Language Models (LVLMs) perceive them remains underexplored. We investigate culture mixing as a critical challenge for LVLMs and examine how current models behave when cultural items from multiple regions appear together. To systematically analyze these behaviors, we construct CultureMix, a food Visual Question Answering (VQA) benchmark with 23k diffusion-generated, human-verified culture mixing images across four subtasks: (1) food-only, (2) food+food, (3) food+background, and (4) food+food+background. Evaluating 10 LVLMs, we find consistent failures to preserve individual cultural identities in mixed settings. Models show strong background reliance, with accuracy dropping 14% when cultural backgrounds are added to food-only baselines, and they produce inconsistent predictions for identical foods across different contexts. To address these limitations, we explore three robustness strategies. We find supervised fine-tuning using a diverse culture mixing dataset substantially improve model consistency and reduce background sensitivity. We call for increased attention to culture mixing scenarios as a critical step toward developing LVLMs capable of operating reliably in culturally diverse real-world environments.
[63] From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images
Yiming Chen, Junlin Han, Tianyi Bai, Shengbang Tong, Filippos Kokkinos, Philip Torr
🧩 TL;DR
本文提出CogIP-Bench基准测试,用于评估多模态大语言模型在图像认知属性(如记忆性、趣味性、审美性)上与人类感知的对齐程度,并开发了一种后训练方法显著提升模型的对齐能力,证明这种认知对齐可迁移到下游创意任务中。
📘 Detailed Summary
Motivation: 当前多模态大语言模型擅长识别图像中的物体和描述场景,但缺乏理解图像对人类观察者主观感受的能力,特别是在记忆性、趣味性、审美性和情感唤起等认知属性方面与人类感知存在显著差距。
Method: 研究引入了CogIP-Bench基准测试来系统评估MLLMs在图像认知属性上的表现,并开发了一种后训练阶段方法,通过专门的训练过程增强模型与人类感知的对齐能力,同时将认知对齐模型集成到图像生成流程中以指导合成过程。
Result: 评估显示当前模型在图像认知属性上与人类感知的对齐程度较差,但后训练方法能有效缩小这一差距,显著提升模型与人类判断的对齐度,且学习到的认知对齐具有可迁移性,能指导图像生成流程产生更具记忆性或视觉吸引力的图像。
Conclusion: 该研究提供了测量人类感知的基准测试、增强对齐的后训练流程,并证明这种认知对齐能解锁更以人为本的人工智能,为开发更具人类感知能力的多模态模型提供了系统方法和实证基础。
📄 Abstract
While Multimodal Large Language Models (MLLMs) are adept at answering what is in an image-identifying objects and describing scenes-they often lack the ability to understand how an image feels to a human observer. This gap is most evident when considering subjective cognitive properties, such as what makes an image memorable, funny, aesthetically pleasing, or emotionally evocative. To systematically address this challenge, we introduce CogIP-Bench, a comprehensive benchmark for evaluating MLLMs on such image cognitive properties. Our evaluation reveals a significant gap: current models are poorly aligned with human perception of these nuanced properties. We then demonstrate that a post-training phase can effectively bridge this gap, significantly enhancing the model's alignment with human judgments. Furthermore, we show that this learned cognitive alignment is not merely predictive but also transferable to downstream creative tasks. By integrating our cognitively-aligned MLLM into an image generation pipeline, we can guide the synthesis process to produce images that better embody desired traits, such as being more memorable or visually appealing. Our work provides a benchmark to measure this human-like perception, a post-training pipeline to enhance it, and a demonstration that this alignment unlocks more human-centric AI.
[64] Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs
Tianle Chen, Chaitanya Chakka, Arjun Reddy Akula, Xavier Thomas, Deepti Ghadiyaram
🧩 TL;DR
该研究提出了MMA-Bench基准来评估多模态大语言模型对矛盾模态的鲁棒性,并开发了一种模态对齐微调策略以增强模型的跨模态推理可靠性。
📘 Detailed Summary
Motivation: 尽管多模态大语言模型取得了显著进展,但其对矛盾模态的鲁棒性仍是一个未解决的基本问题,当前模型在音频-视觉对错位和简单误导文本情况下缺乏稳健的多模态推理能力。
Method: 研究引入了MMA-Bench基准,包含视频和任务来探测模型对特定模态的依赖,采用黑盒和白盒可解释性技术分析模型脆弱性,并提出模态对齐微调策略来教导模型何时优先考虑、利用或忽略特定模态线索。
Result: 实验表明当前开源和闭源MLLM在错位的音频-视觉对和简单误导文本下表现脆弱,而提出的对齐微调策略能显著增强多模态基础能力,产生更可靠的跨模态推理。
Conclusion: 该工作提供了可解释性工具和明确的路径来开发具有内在可靠跨模态推理能力的MLLM,模态对齐微调是实现这一目标的有效方法,代码和数据集将公开可用。
📄 Abstract
Despite remarkable advancements in Multimodal Large Language Models (MLLMs), a fundamental question remains: are MLLMs robust to contradicting modalities? To rigorously study this, we introduce MMA-Bench comprising videos and tasks that probe a model's reliance on specific modalities. Using black-box and white-box interpretability techniques, we provide a critical analysis of the brittleness of both open- and closed-sourced MLLMs. We show that current MLLMs struggle under misaligned audio-visual pairs and simple misleading text, thereby lacking robust multi-modal reasoning. Building on these findings, we propose a modality alignment tuning strategy to teach the model when to prioritize, leverage, or ignore specific modality cues. Through extensive experiments and analysis, we show that our alignment tuning yields demonstrably stronger multimodal grounding. This work provides both interpretability tools and a clear path toward developing MLLMs with intrinsically reliable cross-modal reasoning. Code and dataset will be publicly available.
[65] Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering
Dosung Lee, Sangwon Jung, Boyoung Kim, Minyoung Kim, Sungyeon Kim, Junyoung Sung, Paul Hongsuck Seo
🧩 TL;DR
该研究揭示了现有多模态知识库视觉问答基准存在的"视觉捷径"问题,并提出了RETINA基准和MIMIR方法来解决这一问题。RETINA通过LLM驱动的流程自动构建,包含12万训练样本和2千人工标注测试集,专门设计用于消除视觉捷径;MIMIR则通过增强多个相关实体的图像来丰富文档嵌入表示。
📘 Detailed Summary
Motivation: 现有多模态知识库视觉问答基准存在严重的"视觉捷径"问题,即查询图像通常与目标文档的主要主题实体直接匹配,导致模型可以仅利用视觉线索就获得可比结果,而无需真正理解多模态知识。这种设计缺陷使得现有基准无法准确评估模型的多模态推理能力,掩盖了模型对视觉捷径的依赖。
Method: 研究提出了两个主要贡献:首先,开发了RETINA基准,采用LLM驱动的自动化流程构建,包含12万训练样本和2千人工标注测试集,专门设计查询引用次要主题(相关实体)并配对相关实体的图像,从而消除视觉捷径。其次,提出了MIMIR方法,通过增强多个相关实体的图像来丰富文档嵌入表示,与先前仅使用单个图像的方法不同,MIMIR能够有效处理RETINA中的复杂多实体关系。
Result: 实验结果表明,当在RETINA基准上评估时,现有模型的性能显著下降,证实了它们对视觉捷径的依赖。MIMIR方法在RETINA基准上表现出色,有效解决了现有方法无法处理的复杂多实体关系问题。研究验证了现有基准的局限性,并证明了RETINA基准和MIMIR方法的有效性。
Conclusion: 该研究揭示了多模态知识库视觉问答领域基准设计的重要缺陷,提出的RETINA基准为评估模型真实多模态推理能力提供了更可靠的测试平台。MIMIR方法展示了通过多图像增强来处理复杂实体关系的有效性,为未来多模态知识表示和推理研究提供了新的方向。这项工作强调了在基准设计中消除捷径的重要性,并推动了更鲁棒的多模态模型开发。
📄 Abstract
Existing Multimodal Knowledge-Based Visual Question Answering (MKB-VQA) benchmarks suffer from "visual shortcuts", as the query image typically matches the primary subject entity of the target document. We demonstrate that models can exploit these shortcuts, achieving comparable results using visual cues alone. To address this, we introduce Relational Entity Text-Image kNowledge Augmented (RETINA) benchmark, automatically constructed using an LLM-driven pipeline, consisting of 120k training and 2k human-curated test set. RETINA contains queries referencing secondary subjects (i.e. related entities) and pairs them with images of these related entities, removing the visual shortcut. When evaluated on RETINA existing models show significantly degraded performance, confirming their reliance on the shortcut. Furthermore, we propose Multi-Image MultImodal Retriever (MIMIR), which enriches document embeddings by augmenting images of multiple related entities, effectively handling RETINA, unlike prior work that uses only a single image per document. Our experiments validate the limitations of existing benchmarks and demonstrate the effectiveness of RETINA and MIMIR. Our project is available at: Project Page.
[66] Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding
Keliang Liu, Zizhi Chen, Mingcheng Li, Jingqun Tang, Dingkang Yang, Lihua Zhang
🧩 TL;DR
本文提出SLEUTH,一个用于长文档理解的多智能体框架,通过分层精炼过程识别关键多模态线索并过滤冗余信息,与先进的视觉语言模型结合时在多个基准测试中实现了最先进的性能。
📘 Detailed Summary
Motivation: 视觉语言模型在单页文档任务中表现有效,但在处理长文档时性能下降,因为线索分散在多个页面和模态中,且冗长输入中的冗余会损害模型判断,而检索增强生成虽然能过滤问题相关内容,但检索结果仍包含大量冗余信息。
Method: SLEUTH是一个模型无关且可扩展的多智能体框架,协调检索器和四个协作智能体执行从粗到精的处理流程,该框架识别检索页面中的关键文本和视觉线索,过滤表格和图表等显著视觉证据,分析查询以制定推理策略,最终合成经过提炼的证据密集型多模态上下文以生成最终预测。
Result: 当与先进的视觉语言模型主干结合时,SLEUTH在多个长文档基准测试中持续提升性能并实现了最先进的结果,消融研究验证了每个模块的有效性,并确认了分层精炼范式的优势。
Conclusion: 该研究展示了多智能体协作框架在解决长文档理解挑战中的有效性,特别是通过分层精炼过程处理多模态线索和冗余问题,为文档理解领域提供了可扩展且模型无关的解决方案,并验证了智能体协作在复杂多模态任务中的价值。
📄 Abstract
Document understanding is a long standing practical task. Vision Language Models (VLMs) have gradually become a primary approach in this domain, demonstrating effective performance on single page tasks. However, their effectiveness diminishes when handling long documents. In such scenarios, clues are often scattered across multiple pages and modalities, and redundancy from lengthy inputs can impair the models judgment. While retrieval augmented generation mitigates this issue by filtering for question relevant content, the retrieved results still contain substantial redundancy. To address these limitations, we propose SLEUTH, a multi agent framework. Concretely, SLEUTH orchestrates a retriever and four collaborative agents in a coarse to fine process. The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy. It ultimately synthesizes a distilled, evidence dense multimodal context to generate the final prediction. SLEUTH is model agnostic and scalable. When paired with advanced VLM backbones, it consistently improves performance on multiple long document benchmarks, achieving state of the art results. Ablation studies verify each modules effectiveness and confirm the benefits of our hierarchical refinement paradigm.
[67] CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation
Fengyi Fang, Sicheng Yang, Wenming Yang
🧩 TL;DR
本文提出了CoordSpeaker框架,通过手势描述生成和多粒度控制,解决了协同语音手势生成中语义先验缺失和跨模态协调控制的挑战,实现了高质量、语义一致的手势合成。
📘 Detailed Summary
Motivation: 现有协同语音手势生成方法面临两个关键挑战:一是由于手势数据集中缺乏描述性文本标注导致的语义先验差距,二是难以实现手势生成的协调多模态控制,特别是无法生成文本驱动的非自发手势(如说话时鞠躬)。
Method: 方法包括两个核心组件:首先提出新颖的手势描述框架,利用运动-语言模型生成多粒度描述性标题来弥合语义先验差距;然后构建条件潜在扩散模型,采用统一跨数据集运动表示和分层控制去噪器,实现高度可控的协调手势生成。
Result: 大量实验表明,该方法能生成高质量手势,既与语音节奏同步,又与任意标题语义一致,相比现有方法实现了更优性能且效率更高,在协调控制和语义一致性方面表现突出。
Conclusion: 该研究首次探索手势理解和描述生成来解决手势生成中的语义差距,提供了手势-文本双向映射的新视角,为可控协同手势合成开辟了新方向,显著提升了人机交互的自然性和表现力。
📄 Abstract
Co-speech gesture generation has significantly advanced human-computer interaction, yet speaker movements remain constrained due to the omission of text-driven non-spontaneous gestures (e.g., bowing while talking). Existing methods face two key challenges: 1) the semantic prior gap due to the lack of descriptive text annotations in gesture datasets, and 2) the difficulty in achieving coordinated multimodal control over gesture generation. To address these challenges, this paper introduces CoordSpeaker, a comprehensive framework that enables coordinated caption-empowered co-speech gesture synthesis. Our approach first bridges the semantic prior gap through a novel gesture captioning framework, leveraging a motion-language model to generate descriptive captions at multiple granularities. Building upon this, we propose a conditional latent diffusion model with unified cross-dataset motion representation and a hierarchically controlled denoiser to achieve highly controlled, coordinated gesture generation. CoordSpeaker pioneers the first exploration of gesture understanding and captioning to tackle the semantic gap in gesture generation while offering a novel perspective of bidirectional gesture-text mapping. Extensive experiments demonstrate that our method produces high-quality gestures that are both rhythmically synchronized with speeches and semantically coherent with arbitrary captions, achieving superior performance with higher efficiency compared to existing approaches.
[68] CNN-Based Framework for Pedestrian Age and Gender Classification Using Far-View Surveillance in Mixed-Traffic Intersections
Shisir Shahriar Arif, Md. Muhtashim Shahrier, Nazmul Haque, Md Asif Raihan, Md. Hadiuzzaman
🧩 TL;DR
本研究提出了一种基于深度学习的框架,利用卷积神经网络从远视角路口监控视频中实时分类行人的年龄组别和性别,无需依赖面部识别或高分辨率图像,为混合交通环境下的行人安全监测提供了可扩展的解决方案。
📘 Detailed Summary
Motivation: 在低收入和中等收入国家的拥挤城市交叉口,行人安全是一个紧迫问题,特别是多模式交通和缺乏正式控制的基础设施环境。年龄和性别等人口统计因素显著影响行人脆弱性,但实时监控系统很少捕获此类信息,导致传统交通数据中缺乏关键的人口统计洞察。
Method: 研究提出了一种深度学习框架,使用卷积神经网络从远视角路口监控视频中分类行人年龄组别和性别。该方法将分类构建为统一的六类问题,区分成年、青少年和儿童行人以及男性和女性,基于全身视觉线索而非面部识别。研究实现了两种CNN架构:在ImageNet上预训练的深度卷积神经网络ResNet50,以及为计算效率优化的自定义轻量级CNN。通过八种模型变体探索了池化策略和优化器的不同组合。
Result: ResNet50结合最大池化和SGD优化器实现了最高准确率(86.19%),而自定义CNN在参数更少、训练更快的情况下表现相当(84.15%)。模型的高效设计使其能够在标准监控视频流上实现实时推理。数据收集自孟加拉国达卡的三个高风险交叉口,验证了方法在真实混合交通环境中的适用性。
Conclusion: 该框架为从业者提供了一个可扩展、经济高效的工具,可利用现有摄像头基础设施监测交叉口行人人口统计特征。其输出可用于优化交叉口设计、调整信号时序,并为儿童或老年人等弱势群体实施有针对性的安全干预措施。通过提供传统交通数据中常缺失的人口统计洞察,该框架支持在混合交通环境中进行更具包容性、数据驱动的规划。
📄 Abstract
Pedestrian safety remains a pressing concern in congested urban intersections, particularly in low- and middle-income countries where traffic is multimodal, and infrastructure often lacks formal control. Demographic factors like age and gender significantly influence pedestrian vulnerability, yet real-time monitoring systems rarely capture this information. To address this gap, this study proposes a deep learning framework that classifies pedestrian age group and gender from far-view intersection footage using convolutional neural networks (CNNs), without relying on facial recognition or high-resolution imagery. The classification is structured as a unified six-class problem, distinguishing adult, teenager, and child pedestrians for both males and females, based on full-body visual cues. Video data was collected from three high-risk intersections in Dhaka, Bangladesh. Two CNN architectures were implemented: ResNet50, a deep convolutional neural network pretrained on ImageNet, and a custom lightweight CNN optimized for computational efficiency. Eight model variants explored combinations of pooling strategies and optimizers. ResNet50 with Max Pooling and SGD achieved the highest accuracy (86.19%), while the custom CNN performed comparably (84.15%) with fewer parameters and faster training. The model's efficient design enables real-time inference on standard surveillance feeds. For practitioners, this system provides a scalable, cost-effective tool to monitor pedestrian demographics at intersections using existing camera infrastructure. Its outputs can shape intersection design, optimize signal timing, and enable targeted safety interventions for vulnerable groups such as children or the elderly. By offering demographic insights often missing in conventional traffic data, the framework supports more inclusive, data-driven planning in mixed-traffic environments.
[69] DM$^3$T: Harmonizing Modalities via Diffusion for Multi-Object Tracking
Weiran Li, Yeqiang Liu, Yijie Wei, Mina Han, Qiannan Guo, Zhenbo Li
🧩 TL;DR
本文提出DM³T,一种基于扩散模型的多模态多目标跟踪框架,通过迭代特征对齐实现可见光与热红外模态的有效融合,在VT-MOT基准上实现了41.7 HOTA的性能提升。
📘 Detailed Summary
Motivation: 多模态多目标跟踪中可见光与热红外信息的有效融合面临挑战,传统的拼接或相加等简单策略难以弥合特征表示间的非线性分布差异,容易导致模态冲突并降低跟踪精度,需要更先进的融合方法来解决这一关键问题。
Method: 本文提出DM³T框架,将多模态融合重新定义为迭代特征对齐过程,通过跨模态扩散融合模块实现迭代的跨模态协调,使两种模态特征相互引导并投影到共享一致的特征流形上,同时引入可插拔的扩散精炼器增强统一特征表示,并设计分层跟踪器自适应处理置信度估计。
Result: 在VT-MOT基准上的广泛实验表明,DM³T方法实现了41.7 HOTA的性能指标,相对于现有最先进方法获得了1.54%的相对性能提升,将目标检测、状态估计和数据关联统一到无需复杂后处理的在线跟踪框架中。
Conclusion: 该研究展示了扩散模型迭代细化思想在多模态融合任务中的有效性,提出的框架能够学习互补信息并实现比传统方法更深层次的融合,为鲁棒自动驾驶系统提供了更准确和时序一致的目标轨迹生成方法。
📄 Abstract
Multi-object tracking (MOT) is a fundamental task in computer vision with critical applications in autonomous driving and robotics. Multimodal MOT that integrates visible light and thermal infrared information is particularly essential for robust autonomous driving systems. However, effectively fusing these heterogeneous modalities is challenging. Simple strategies like concatenation or addition often fail to bridge the significant non-linear distribution gap between their feature representations, which can lead to modality conflicts and degrade tracking accuracy. Drawing inspiration from the connection between multimodal MOT and the iterative refinement in diffusion models, this paper proposes DM$^3$T, a novel framework that reformulates multimodal fusion as an iterative feature alignment process to generate accurate and temporally coherent object trajectories. Our approach performs iterative cross-modal harmonization through a proposed Cross-Modal Diffusion Fusion (C-MDF) module. In this process, features from both modalities provide mutual guidance, iteratively projecting them onto a shared, consistent feature manifold. This enables the learning of complementary information and achieves deeper fusion compared to conventional methods. Additionally, we introduce a plug-and-play Diffusion Refiner (DR) to enhance and refine the unified feature representation. To further improve tracking robustness, we design a Hierarchical Tracker that adaptively handles confidence estimation. DM$^3$T unifies object detection, state estimation, and data association into a comprehensive online tracking framework without complex post-processing. Extensive experiments on the VT-MOT benchmark demonstrate that our method achieves 41.7 HOTA, representing a 1.54% relative improvement over existing state-of-the-art methods. The code and models are available at https://vranlee.github.io/DM-3-T/.
[70] From Points to Clouds: Learning Robust Semantic Distributions for Multi-modal Prompts
Weiran Li, Yeqiang Liu, Yijie Wei, Mina Han, Xin Liu, Zhenbo Li
🧩 TL;DR
本文提出了Points-to-Clouds (P2C)框架,通过将多模态提示学习重构为动态去噪任务,将传统的静态点表示学习扩展为语义云分布学习,显著提升了视觉语言模型在新类别上的泛化能力。
📘 Detailed Summary
Motivation: 当前多模态提示学习方法受限于优化单一静态点表示,这种范式存在固有脆弱性,容易在基础类别上过拟合,且在新类别或模糊类别上泛化能力差。研究挑战了这种点范式,认为鲁棒泛化需要学习语义云(即嵌入空间上的分布)。
Method: 提出了Points-to-Clouds (P2C)框架,受扩散模型启发,将提示学习重构为动态去噪任务。核心是双重去噪机制:动态提示去噪机制通过复杂的退火噪声扰动文本提示以学习更平滑的语义景观;辅助的V-L映射器去噪损失将映射器重新任务化为去噪自编码器,强制其从噪声文本输入重构干净的视觉提示,确保鲁棒的跨模态对齐。
Result: 在11个数据集上的广泛实验表明,P2C始终优于强基线方法。在基础到新类别的泛化基准测试中,该方法实现了79.7%的调和平均值,相对于基线有1.4%的相对改进。
Conclusion: 研究证明了将多模态提示学习从静态点表示扩展到语义云分布的有效性,为视觉语言模型的鲁棒泛化提供了新范式。P2C框架通过去噪机制学习更平滑的语义景观,显著提升了模型在未见类别上的性能,为未来多模态学习研究提供了重要方向。
📄 Abstract
Multimodal Prompt Learning (MPL) has emerged as a pivotal technique for adapting large-scale Visual Language Models (VLMs). However, current MPL methods are fundamentally limited by their optimization of a single, static point representation. This paradigm is inherently brittle, leads to overfitting on base classes, and generalizes poorly to novel or ambiguous categories. We challenge this point paradigm, proposing that robust generalization requires learning a semantic cloud (i.e., a distribution over the embedding space). To achieve this, we introduce Points-to-Clouds (P2C), a novel framework inspired by diffusion models that reframes prompt learning as a dynamic denoising task. At the core of P2C is a dual denoising mechanism: a Dynamic Prompt Denoising (DPD) mechanism perturbs text prompts with sophisticated, annealed noise to learn a smoother semantic landscape, while an auxiliary V-L Mapper denoising loss re-tasks the mapper as a denoising autoencoder. This forces the mapper to reconstruct clean visual prompts from noisy text inputs, ensuring robust cross-modal alignment. Extensive experiments across 11 datasets demonstrate that P2C consistently outperforms strong baselines. On the base-to-novel generalization benchmark, our method achieves a Harmonic Mean of 79.7%, representing a relative improvement of 1.4% over the baseline. The code and models are available at https://vranlee.github.io/P2C/.
[71] See, Rank, and Filter: Important Word-Aware Clip Filtering via Scene Understanding for Moment Retrieval and Highlight Detection
YuEun Lee, Jung Uk Kim
🧩 TL;DR
本文提出了一种新颖的视频时刻检索与高光检测方法,通过识别查询中的重要词汇实现细粒度视频片段过滤,显著提升了现有方法的性能。该方法整合了多模态大语言模型的场景理解能力,并引入了特征增强模块和基于排名的过滤模块。
📘 Detailed Summary
Motivation: 现有视频时刻检索与高光检测方法将整个文本查询和视频片段视为黑盒,忽视了单个词汇的重要性,这阻碍了对视频内容的上下文理解。本文旨在解决这一局限性,通过识别查询中的重要词汇来实现更精细的视频片段过滤。
Method: 本文方法通过多模态大语言模型整合图像-文本场景理解,增强视频片段的语义理解。核心包括特征增强模块用于捕获查询中的重要词汇,以及基于排名的过滤模块用于根据这些重要词汇的相关性迭代细化视频片段,实现细粒度的片段过滤。
Result: 大量实验表明,该方法在视频时刻检索和高光检测任务上显著优于现有最先进方法,取得了卓越的性能表现。具体实验验证了所提模块的有效性,证明了细粒度词汇关注机制对提升视频理解任务性能的重要性。
Conclusion: 本研究强调了在视频理解任务中关注查询词汇粒度的重要性,为视频时刻检索和高光检测提供了新的技术路径。所提出的细粒度过滤框架展示了多模态大语言模型与专门设计的过滤模块相结合的有效性,为未来视频语义理解研究提供了有价值的参考。
📄 Abstract
Video moment retrieval (MR) and highlight detection (HD) with natural language queries aim to localize relevant moments and key highlights in a video clips. However, existing methods overlook the importance of individual words, treating the entire text query and video clips as a black-box, which hinders contextual understanding. In this paper, we propose a novel approach that enables fine-grained clip filtering by identifying and prioritizing important words in the query. Our method integrates image-text scene understanding through Multimodal Large Language Models (MLLMs) and enhances the semantic understanding of video clips. We introduce a feature enhancement module (FEM) to capture important words from the query and a ranking-based filtering module (RFM) to iteratively refine video clips based on their relevance to these important words. Extensive experiments demonstrate that our approach significantly outperforms existing state-of-the-art methods, achieving superior performance in both MR and HD tasks. Our code is available at: https://github.com/VisualAIKHU/SRF.
[72] Contrastive Heliophysical Image Pretraining for Solar Dynamics Observatory Records
Shiyu Shen, Zhe Gao, Taifeng Chai, Yang Huang, Bin Pan
🧩 TL;DR
本文提出了SolarCHIP,一个专为太阳动力学天文台多仪器观测设计的对比预训练视觉骨干网络家族,通过多粒度对比学习解决太阳图像分析中的模态差异、类间相似性和类内变异性问题,并在跨模态翻译和耀斑分类任务上取得了最先进性能。
📘 Detailed Summary
Motivation: 现有深度学习方法在太阳图像分析中通常从头训练任务特定编码器或依赖忽略太阳数据独特性的自然图像预训练,而太阳动力学天文台数据面临三个关键挑战:跨AIA和HMI仪器的多模态感知、由于缓慢时间演化导致的弱类间可分离性,以及具有稀疏活动信号的强类内变异性。
Method: SolarCHIP采用多粒度对比预训练框架,联合对齐三个层次的特征:跨共时AIA-HMI对的全局类别标记以增强时间区分能力,固定空间索引的局部补丁标记以强制位置一致且模态不变的特征,以及不同空间位置的样本内补丁以保留细粒度空间结构。该方法训练了基于CNN和视觉Transformer的自编码器。
Result: 实验结果表明,SolarCHIP在跨模态翻译(通过ControlNet实现HMI与AIA通道间转换)和全盘耀斑分类两个下游任务上均取得了最先进性能,特别是在标记数据有限的低资源设置中表现出显著优势。消融研究证实每个对比组件在不同粒度上贡献了必要的判别能力。
Conclusion: 该研究为日球物理学社区提供了一个实用、即插即用的特征提取器,通过公开预训练权重和训练代码,显著降低了计算需求,提高了标签效率,并为多样化的太阳成像应用建立了可重用的基础。多粒度对比学习方法有效解决了太阳图像特有的模态对齐和特征学习挑战。
📄 Abstract
Deep learning has revolutionized solar image analysis, yet most approaches train task-specific encoders from scratch or rely on natural-image pretraining that ignores the unique characteristics of Solar Dynamics Observatory (SDO) data. We introduce SolarCHIP, a family of contrastively pretrained visual backbones tailored to multi-instrument SDO observations. SolarCHIP addresses three key challenges in solar imaging: multimodal sensing across AIA and HMI instruments, weak inter-class separability due to slow temporal evolution, and strong intra-class variability with sparse activity signals. Our pretraining framework employs a multi-granularity contrastive objective that jointly aligns (1) global class tokens across co-temporal AIA-HMI pairs to enhance temporal discrimination, (2) local patch tokens at fixed spatial indices to enforce position-consistent, modality-invariant features, and (3) intra-sample patches across different spatial locations to preserve fine-grained spatial structure. We train both CNN- and Vision Transformer-based autoencoders and demonstrate their effectiveness on two downstream tasks: cross-modal translation between HMI and AIA passbands via ControlNet, and full-disk flare classification. Experimental results show that SolarCHIP achieves state-of-the-art performance across both tasks, with particularly strong gains in low-resource settings where labeled data is limited. Ablation studies confirm that each contrastive component contributes essential discriminative capacity at different granularities. By publicly releasing pretrained weights and training code, we provide the heliophysics community with a practical, plug-and-play feature extractor that reduces computational requirements, improves label efficiency, and establishes a reusable foundation for diverse solar imaging applications.
[73] HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model
Chen Li, Eric Peh, Basura Fernando
🧩 TL;DR
本文提出了一种新颖的分层多模态表示方法,用于3D场景推理,通过在输入空间显式对齐大型视觉语言模型,利用多视角图像和包含3D坐标的文本描述来提升性能。
📘 Detailed Summary
Motivation: 现有基于大型视觉语言模型的3D场景理解方法通常将3D场景特征与VLM的嵌入空间对齐,但这种隐式对齐由于3D数据稀缺和3D环境中空间关系的固有复杂性,往往导致次优性能。
Method: 提出了一种分层多模态表示方法,通过在输入空间显式对齐VLM,利用多视角图像(包括俯视图和四个方向视图)和包含检测对象3D坐标的文本描述来捕获空间关系,并引入分层特征表示,将图像块级特征聚合为视图级和场景级表示,使模型能够在局部和全局场景上下文中进行推理。
Result: 在情境化3D问答和通用3D问答基准测试上的实验结果表明,该方法在3D场景理解任务中表现出显著的有效性,验证了显式输入空间对齐和分层多模态表示的优越性。
Conclusion: 该研究证明了通过显式输入空间对齐和分层多模态表示可以显著提升3D场景推理性能,为3D场景理解提供了新的有效方法,同时强调了综合利用多视角视觉信息和包含空间坐标的文本描述对于捕获复杂3D空间关系的重要性。
📄 Abstract
Recent advances in large vision-language models (VLMs) have shown significant promise for 3D scene understanding. Existing VLM-based approaches typically align 3D scene features with the VLM's embedding space. However, this implicit alignment often yields suboptimal performance due to the scarcity of 3D data and the inherent complexity of spatial relationships in 3D environments. To address these limitations, we propose a novel hierarchical multimodal representation for 3D scene reasoning that explicitly aligns with VLMs at the input space by leveraging both multi-view images and text descriptions. The text descriptions capture spatial relationships by referencing the 3D coordinates of detected objects, while the multi-view images include a top-down perspective and four directional views (forward, left, right, and backward), ensuring comprehensive scene coverage. Additionally, we introduce a hierarchical feature representation that aggregates patch-level image features into view-level and scene-level representations, enabling the model to reason over both local and global scene context. Experimental results on both situated 3D Q&A and general 3D Q&A benchmarks demonstrate the effectiveness of our approach.
[74] MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation
Yuta Oshima, Daiki Miyake, Kohsei Matsutani, Yusuke Iwasawa, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta
🧩 TL;DR
本文提出了MultiBanana基准数据集,旨在系统评估多参考图像生成与编辑模型的性能边界,通过覆盖五种具有挑战性的多参考场景来揭示现有模型的优势与局限。
📘 Detailed Summary
Motivation: 现有文本到图像生成模型虽已具备多参考生成与编辑能力,但当前基准数据集主要关注单参考或少量参考图像场景,无法系统评估模型在不同多参考条件下的性能进展与弱点。此外,现有任务定义较为模糊,通常局限于"编辑什么"或"参考数量"等简单维度,未能捕捉多参考设置的内在难度。
Method: 本文设计了MultiBanana基准数据集,通过大规模覆盖五种多参考特定问题来评估模型能力边界:参考数量变化、参考间领域不匹配(如照片与动漫)、参考与目标场景尺度不匹配、参考包含罕见概念(如红色香蕉)以及多语言文本参考渲染。该基准旨在建立标准化评估基础,推动多参考图像生成领域的公平比较。
Result: 通过对多种文本到图像模型的分析,研究揭示了这些模型在多参考场景下的优越性能、典型失败模式以及需要改进的领域。实验结果表明,现有模型在不同多参考条件下表现出显著差异,特别是在领域不匹配和罕见概念处理方面存在明显局限性。
Conclusion: MultiBanana基准的引入为多参考图像生成研究提供了系统评估框架,通过识别模型的具体弱点为未来改进指明了方向。该开放基准将推动领域边界扩展,并建立标准化比较基础,促进多参考生成技术的公平评估与持续发展。
📄 Abstract
Recent text-to-image generation models have acquired the ability of multi-reference generation and editing; the ability to inherit the appearance of subjects from multiple reference images and re-render them under new contexts. However, the existing benchmark datasets often focus on the generation with single or a few reference images, which prevents us from measuring the progress on how model performance advances or pointing out their weaknesses, under different multi-reference conditions. In addition, their task definitions are still vague, typically limited to axes such as "what to edit" or "how many references are given", and therefore fail to capture the intrinsic difficulty of multi-reference settings. To address this gap, we introduce $\textbf{MultiBanana}$, which is carefully designed to assesses the edge of model capabilities by widely covering multi-reference-specific problems at scale: (1) varying the number of references, (2) domain mismatch among references (e.g., photo vs. anime), (3) scale mismatch between reference and target scenes, (4) references containing rare concepts (e.g., a red banana), and (5) multilingual textual references for rendering. Our analysis among a variety of text-to-image models reveals their superior performances, typical failure modes, and areas for improvement. MultiBanana will be released as an open benchmark to push the boundaries and establish a standardized basis for fair comparison in multi-reference image generation. Our data and code are available at https://github.com/matsuolab/multibanana .
[75] MrGS: Multi-modal Radiance Fields with 3D Gaussian Splatting for RGB-Thermal Novel View Synthesis
Minseong Kweon, Janghyun Kim, Ukcheol Shin, Jinsun Park
🧩 TL;DR
本文提出MrGS,一种基于3D高斯泼溅的多模态辐射场方法,能够同时重建RGB和热红外3D场景,通过物理先验建模热传导和辐射特性,在保持高保真度的同时减少了高斯数量。
📘 Detailed Summary
Motivation: 当前基于NeRF和3DGS的方法在RGB场景重建方面取得了显著进展,但融合热红外图像的多模态渲染仍未被充分探索。现有方法往往忽略了热传导和朗伯特性等独特的热学特征,这限制了多模态场景重建的准确性和物理合理性。
Method: MrGS通过正交特征提取从单一外观特征中分离RGB和热相关信息,并根据各模态的朗伯反射程度采用视图依赖或视图独立的嵌入策略。该方法整合了两个物理原理:在alpha混合前应用傅里叶热传导定律来建模相邻高斯之间的热传导强度插值,以及结合斯特藩-玻尔兹曼定律和平方反比定律来构建深度感知的热辐射图,为热渲染施加额外的几何约束。
Result: 实验结果表明,MrGS能够实现高保真度的RGB-T场景重建,同时显著减少了所需的高斯数量。该方法在保持重建质量的前提下,有效建模了热传导和辐射现象,验证了物理先验在多模态渲染中的有效性。
Conclusion: 该研究展示了将物理定律整合到多模态辐射场中的可行性,为RGB-T场景重建提供了新的解决方案。通过结合热传导和辐射的物理约束,MrGS不仅提高了热红外重建的准确性,还减少了模型复杂度,为多模态3D重建开辟了新的研究方向。
📄 Abstract
Recent advances in Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) have achieved considerable performance in RGB scene reconstruction. However, multi-modal rendering that incorporates thermal infrared imagery remains largely underexplored. Existing approaches tend to neglect distinctive thermal characteristics, such as heat conduction and the Lambertian property. In this study, we introduce MrGS, a multi-modal radiance field based on 3DGS that simultaneously reconstructs both RGB and thermal 3D scenes. Specifically, MrGS derives RGB- and thermal-related information from a single appearance feature through orthogonal feature extraction and employs view-dependent or view-independent embedding strategies depending on the degree of Lambertian reflectance exhibited by each modality. Furthermore, we leverage two physics-based principles to effectively model thermal-domain phenomena. First, we integrate Fourier's law of heat conduction prior to alpha blending to model intensity interpolation caused by thermal conduction between neighboring Gaussians. Second, we apply the Stefan-Boltzmann law and the inverse-square law to formulate a depth-aware thermal radiation map that imposes additional geometric constraints on thermal rendering. Experimental results demonstrate that the proposed MrGS achieves high-fidelity RGB-T scene reconstruction while reducing the number of Gaussians.
[76] JarvisEvo: Towards a Self-Evolving Photo Editing Agent with Synergistic Editor-Evaluator Optimization
Yunlong Lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong, Wenbo Li, Bin Lin, Zhenxi Li, Shiyi Zhang, Yuyang Peng, Wenxun Dai, Xinghao Ding, Chunyu Wang, Qinglin Lu
🧩 TL;DR
本文提出JarvisEvo,一种统一的图像编辑智能体,通过交织式多模态思维链推理和协同编辑-评估策略优化框架,解决了指令幻觉和奖励攻击问题,显著提升了图像编辑的准确性和内容保真度。
📘 Detailed Summary
Motivation: 现有基于智能体的编辑模型面临两个关键挑战:一是指令幻觉问题,纯文本思维链推理由于固有信息瓶颈无法完全防止事实错误;二是奖励攻击问题,针对静态奖励模型的动态策略优化使智能体能够利用奖励函数缺陷,导致编辑质量下降。
Method: JarvisEvo采用交织式多模态思维链推理机制增强指令遵循和编辑质量,并提出协同编辑-评估策略优化框架实现无需外部奖励的自我改进,有效缓解奖励攻击问题,同时通过无缝集成Adobe Lightroom支持全局和局部精细编辑。
Result: 在ArtEdit-Bench基准测试中,JarvisEvo在保护性编辑指标上平均优于Nano-Banana 18.95%,其中像素级内容保真度提升达44.96%,显著提高了编辑准确性和内容一致性。
Conclusion: 该研究表明多模态推理与协同策略优化的结合能有效解决智能体编辑中的核心挑战,为构建更可靠、自改进的编辑系统提供了新范式,具有推动创意AI工具发展的实际应用价值。
📄 Abstract
Agent-based editing models have substantially advanced interactive experiences, processing quality, and creative flexibility. However, two critical challenges persist: (1) instruction hallucination, text-only chain-of-thought (CoT) reasoning cannot fully prevent factual errors due to inherent information bottlenecks; (2) reward hacking, dynamic policy optimization against static reward models allows agents to exploit flaws in reward functions. To address these issues, we propose JarvisEvo, a unified image editing agent that emulates an expert human designer by iteratively editing, selecting appropriate tools, evaluating results, and reflecting on its own decisions to refine outcomes. JarvisEvo offers three key advantages: (1) an interleaved multimodal chain-of-thought (iMCoT) reasoning mechanism that enhances instruction following and editing quality; (2) a synergistic editor-evaluator policy optimization (SEPO) framework that enables self-improvement without external rewards, effectively mitigating reward hacking; and (3) support for both global and local fine-grained editing through seamless integration of Adobe Lightroom. On ArtEdit-Bench, JarvisEvo outperforms Nano-Banana by an average of 18.95% on preservative editing metrics, including a substantial 44.96% improvement in pixel-level content fidelity.
[77] Buffer replay enhances the robustness of multimodal learning under missing-modality
Hongye Zhu, Xuan Liu, Yanwen Ba, Jingye Xue, Shigeng Zhang
🧩 TL;DR
本文提出了REplay Prompting (REP)方法,通过构建模态特征缓冲区、解耦私有-共享特征以及动态初始化机制,有效解决了多模态模型中模态缺失导致的性能下降问题,在多种模态缺失场景下实现了轻量级且鲁棒的性能提升。
📘 Detailed Summary
Motivation: 多模态模型中模态缺失通常导致显著的性能下降,现有方法要么计算成本高昂,要么仅依赖相邻层特征而忽略了长距离上下文信息,这些信息在模态缺失时可能提供额外的容错能力。
Method: REP方法包含三个核心组件:通过残差旁路构建模态特征缓冲区以缓存早期层表示并在深层重放;采用私有-共享特征解耦策略,私有缓冲区保留模态特定信号,共享缓冲区编码跨模态语义;设计任务感知的动态初始化机制,根据不同缺失模态条件配置缓冲区以提高稳定性和泛化能力。
Result: 在视觉-语言、视觉-语言-音频以及时序多模态基准测试中,REP在单模态和多模态缺失场景下均优于现有方法,同时仅引入可忽略的参数开销,证明了其在挑战性模态缺失环境中的有效性和轻量级特性。
Conclusion: REP为鲁棒多模态学习提供了一个轻量级且有效的范式,通过特征重放和解耦机制有效缓解了深度网络中的信息损失问题,为处理现实世界中不完整多模态数据提供了实用解决方案,具有广泛的应用潜力。
📄 Abstract
Missing modalities consistently lead to significant performance degradation in multimodal models. Existing approaches either synthesize missing modalities at high computational cost or apply prompt-based fine-tuning that relies only on adjacent-layer features and overlooks long-distance contextual information, which may offer additional tolerance to errors when one or more modalities are missing. To address this, we introduce REplay Prompting (REP): (1) construct modality-wise feature buffers via a residual bypass to cache early-layer representations and replay them in deeper layers, mitigating information loss as network depth increases; (2) employ a private-shared feature decoupling strategy, where private buffers preserve modality-specific signals and shared buffers encode cross-modal semantics; and (3) design a task-aware dynamic initialization mechanism to configure these buffers differently, improving stability and generalization under diverse missing-modality conditions. Experiments on vision-language, vision-language-audio, and temporal multimodal benchmarks demonstrate that REP consistently outperforms prior methods under both single- and multi-modality missing scenarios, while introducing only negligible parameter overhead. These results establish REP as a lightweight and effective paradigm for robust multimodal learning in challenging missing-modality environments.
[78] NumeriKontrol: Adding Numeric Control to Diffusion Transformers for Instruction-based Image Editing
Zhenyu Xu, Xiaoqi Shen, Haotian Nan, Xinyu Zhang
🧩 TL;DR
本文提出了NumeriKontrol框架,通过连续标量值实现基于指令的图像编辑的精确强度控制,解决了文本指令在细粒度编辑强度控制方面的不足。该框架采用可插拔的数值适配器,支持零样本多条件编辑,并在多样化属性编辑场景中实现了准确、连续且稳定的尺度控制。
📘 Detailed Summary
Motivation: 基于指令的图像编辑虽然能够通过自然语言命令实现直观操作,但文本指令本身往往缺乏对编辑强度的精确控制能力,难以实现细粒度的属性调整。现有方法在编辑强度控制方面存在不足,需要一种能够提供连续、精确尺度控制的解决方案。
Method: NumeriKontrol框架通过有效的数值适配器编码数值编辑尺度,并以即插即用方式注入到扩散模型中。该框架采用任务分离设计,支持零样本多条件编辑,允许用户以任意顺序指定多个指令。为提供高质量监督,研究团队从高保真渲染引擎和数码单反相机等可靠来源合成精确训练数据,构建了覆盖多样化属性操作的通用属性变换数据集。
Result: 大量实验表明,NumeriKontrol在广泛的属性编辑场景中实现了准确、连续且稳定的尺度控制。该框架能够作为简单而强大的交互式编辑工作室,在多样化属性操作中提供精确的编辑效果,支持用户通过连续标量值进行精细调整。
Conclusion: NumeriKontrol通过引入连续标量控制机制,显著推进了基于指令的图像编辑技术,实现了精确、可扩展且用户可控的图像操作。该研究为图像编辑提供了新的交互范式,使编辑强度控制更加直观和精确,为未来可控制生成模型的发展提供了重要方向。
📄 Abstract
Instruction-based image editing enables intuitive manipulation through natural language commands. However, text instructions alone often lack the precision required for fine-grained control over edit intensity. We introduce NumeriKontrol, a framework that allows users to precisely adjust image attributes using continuous scalar values with common units. NumeriKontrol encodes numeric editing scales via an effective Numeric Adapter and injects them into diffusion models in a plug-and-play manner. Thanks to a task-separated design, our approach supports zero-shot multi-condition editing, allowing users to specify multiple instructions in any order. To provide high-quality supervision, we synthesize precise training data from reliable sources, including high-fidelity rendering engines and DSLR cameras. Our Common Attribute Transform (CAT) dataset covers diverse attribute manipulations with accurate ground-truth scales, enabling NumeriKontrol to function as a simple yet powerful interactive editing studio. Extensive experiments show that NumeriKontrol delivers accurate, continuous, and stable scale control across a wide range of attribute editing scenarios. These contributions advance instruction-based image editing by enabling precise, scalable, and user-controllable image manipulation.
[79] MathSight: A Benchmark Exploring Have Vision-Language Models Really Seen in University-Level Mathematical Reasoning?
Yuandong Wang, Yao Cui, Yuxin Zhao, Zhen Yang, Yangfu Zhu, Zhenzhou Shao
🧩 TL;DR
该研究提出了MathSight基准测试,用于量化视觉语言模型在多模态数学推理中视觉信息的实际贡献,发现随着问题难度增加,视觉信息的作用显著减弱,甚至纯文本模型优于多模态变体。
📘 Detailed Summary
Motivation: 尽管视觉语言模型在多模态数学推理方面取得了显著进展,但视觉信息对推理的实际贡献程度尚不明确。现有基准测试报告了强大的整体性能,但很少分离图像模态的作用,无法确定模型是否真正利用视觉理解还是仅仅依赖语言先验知识。
Method: 研究提出了MathSight基准测试,这是一个大学级别的多模态数学推理基准,旨在分离和量化视觉输入的影响。每个问题包含多个视觉变体——原始图像、手绘图像、照片拍摄图像——以及一个纯文本条件用于受控比较,从而系统评估视觉信息的作用。
Result: 在最先进的视觉语言模型上的实验揭示了一致趋势:视觉信息的贡献随着问题难度的增加而减少。值得注意的是,没有任何图像输入的Qwen3-VL模型超越了其多模态变体和GPT-5,这表明当前模型对视觉信息的依赖有限。
Conclusion: 该研究表明当前视觉语言模型在多模态数学推理中并未充分利用视觉信息,而是过度依赖语言先验知识。MathSight基准测试为评估和推进真正的视觉基础推理提供了必要工具,强调了未来模型需要更有效地整合视觉理解能力。
📄 Abstract
Recent advances in Vision-Language Models (VLMs) have achieved impressive progress in multimodal mathematical reasoning. Yet, how much visual information truly contributes to reasoning remains unclear. Existing benchmarks report strong overall performance but seldom isolate the role of the image modality, leaving open whether VLMs genuinely leverage visual understanding or merely depend on linguistic priors. To address this, we present MathSight, a university-level multimodal mathematical reasoning benchmark designed to disentangle and quantify the effect of visual input. Each problem includes multiple visual variants -- original, hand-drawn, photo-captured -- and a text-only condition for controlled comparison. Experiments on state-of-the-art VLMs reveal a consistent trend: the contribution of visual information diminishes with increasing problem difficulty. Remarkably, Qwen3-VL without any image input surpasses both its multimodal variants and GPT-5, underscoring the need for benchmarks like MathSight to advance genuine vision-grounded reasoning in future models.
[80] Learning to Refuse: Refusal-Aware Reinforcement Fine-Tuning for Hard-Irrelevant Queries in Video Temporal Grounding
Jin-Seop Lee, SungJoon Lee, SeongJun Jung, Boyang Li, Jee-Hyong Lee
🧩 TL;DR
本文提出拒绝感知强化微调(RA-RFT)方法,有效解决视频时序定位中硬不相关查询的拒绝问题,该方法基于组相对策略优化框架并整合四种奖励目标,同时构建了包含硬不相关查询的HI-VTG数据集。
📘 Detailed Summary
Motivation: 现有视频时序定位模型假设查询总是与视频相关,导致即使查询不相关也会预测目标片段,而近期方法只能拒绝完全不相关的查询,无法处理语义相似但实际不相关的硬不相关查询,这限制了模型在实际应用中的可靠性。
Method: 提出拒绝感知强化微调方法,基于组相对策略优化框架,整合格式、拒绝交并比、解释和查询修正四种奖励目标,以提升相关性判别和细粒度语义推理能力,同时构建包含硬不相关查询及其拒绝答案的HI-VTG数据集来支持方法训练。
Result: 方法在多种相关性感知视频时序定位场景中验证有效,包括硬不相关VTG、简单重排RA-VTG和人工标注RA-VTG设置,实验表明该方法可扩展应用于多种基于大型视觉语言模型的VTG模型,并在拒绝硬不相关查询方面表现优异。
Conclusion: 该研究为解决视频时序定位中硬不相关查询的拒绝问题提供了有效方案,通过强化学习框架和多目标奖励设计提升了模型的语义理解能力,构建的数据集为相关研究提供了基准,方法具有良好的可扩展性和实际应用价值。
📄 Abstract
Video Temporal Grounding (VTG) aims to localize a temporal segment in a video corresponding to a natural language query. However, existing VTG models assume that a relevant segment always exists, causing them to always predict a target segment even when the query is irrelevant to the video. While recent approaches attempt to handle irrelevant queries, they can only reject those that are entirely unrelated to the video and still fail to handle hard-irrelevant queries that are semantically similar but not actually relevant. To address this, we propose Refusal-Aware Reinforcement Fine-Tuning (RA-RFT) to effectively refuse hard-irrelevant queries in VTG. Our method is based on the Group Relative Policy Optimization (GRPO) framework and integrates four reward objectives-format, refuse-IoU, explain, and query correction-to improve both relevance discrimination and fine-grained semantic reasoning. In addition, to effectively support RA-RFT, we construct a Hard-Irrelevant VTG (HI-VTG) dataset, which includes hard-irrelevant queries and their refusal answers. We demonstrate the effectiveness of our method across various relevance-aware VTG scenarios, including hard-irrelevant VTG, simply-shuffled RA-VTG, and human-annotated RA-VTG settings. We also show that the proposed method is scalable by applying it to various LVLM-based VTG models. Our code is available at https://github.com/JINSUBY/RA-RFT.
[81] PowerCLIP: Powerset Alignment for Contrastive Pre-Training
Masaki Kawamura, Nakamasa Inoue, Rintaro Yanagi, Hirokatsu Kataoka, Rio Yokota
🧩 TL;DR
本文提出PowerCLIP,一种基于幂集对齐的对比视觉语言预训练框架,通过优化图像区域与文本解析树之间的幂集对齐来增强细粒度组合语义理解,显著提升了零样本分类和检索性能。
📘 Detailed Summary
Motivation: 现有对比视觉语言预训练框架如CLIP在细粒度组合理解方面存在局限,特别是难以捕捉跨越多个图像区域的组合语义,这限制了模型对复杂视觉场景的深度理解能力。
Method: PowerCLIP采用幂集对齐方法,通过最小化图像区域子集与文本解析树之间的损失来优化区域到短语的对齐;为解决幂集构造带来的指数计算复杂度,引入了高效非线性聚合器,将复杂度从O(2^M)降低到O(M),同时能以任意精度逼近精确损失值。
Result: 实验表明PowerCLIP在零样本分类和检索任务中超越了现有最先进方法,证明了其组合性和鲁棒性优势,特别是在需要细粒度语义理解的复杂视觉场景中表现出色。
Conclusion: PowerCLIP通过幂集对齐机制有效解决了多区域组合语义的建模问题,为视觉语言理解提供了新的技术路径,其高效非线性聚合器设计为处理组合爆炸问题提供了实用解决方案,推动了细粒度跨模态对齐研究的发展。
📄 Abstract
Contrastive vision-language pre-training frameworks such as CLIP have demonstrated impressive zero-shot performance across a range of vision-language tasks. Recent studies have shown that aligning individual text tokens with specific image patches or regions enhances fine-grained compositional understanding. However, it remains challenging to capture compositional semantics that span multiple image regions. To address this limitation, we propose PowerCLIP, a novel contrastive pre-training framework enhanced by powerset alignment, which exhaustively optimizes region-to-phrase alignments by minimizing the loss defined between powersets of image regions and textual parse trees. Since the naive powerset construction incurs exponential computational cost due to the combinatorial explosion in the number of region subsets, we introduce efficient non-linear aggregators (NLAs) that reduce complexity from O(2^M) to O(M) with respect to the number of regions M, while approximating the exact loss value with arbitrary precision. Our extensive experiments demonstrate that PowerCLIP outperforms state-of-the-art methods in zero-shot classification and retrieval tasks, underscoring the compositionality and robustness of our approach. Our code will be made publicly available.
[82] Zero-Shot Multi-Criteria Visual Quality Inspection for Semi-Controlled Industrial Environments via Real-Time 3D Digital Twin Simulation
Jose Moises Araya-Martinez, Gautham Mohan, Kenichi Hayakawa Bolaños, Roberto Mendieta, Sarvenaz Sardari, Jens Lambrecht, Jörg Krüger
🧩 TL;DR
本文提出了一种姿态无关的零样本质量检测框架,通过将真实场景与实时数字孪生在RGB-D空间中进行比较,实现了半控制工业环境下的高效缺陷检测。该方法基于已知CAD模型的物体检测和姿态估计,为动态制造环境中的通用化、低数据缺陷检测方法奠定了基础。
📘 Detailed Summary
Motivation: 现代工业环境中,早期视觉质量检测对于实现零缺陷制造和最小化生产浪费至关重要,但鲁棒视觉检测系统的复杂性及其广泛的数据需求阻碍了其在半控制工业环境中的广泛应用。因此需要开发能够在动态制造环境中实现通用化、低数据缺陷检测的方法。
Method: 本文提出了一种姿态无关的零样本质量检测框架,通过物体检测和已知CAD模型的姿态估计对工业场景进行语义描述,实现实时数字孪生在RGB-D空间中的高效渲染。该方法提供了可扩展的分层注释策略,用于多标准缺陷检测,将姿态标注与逻辑和结构缺陷注释统一起来,并基准测试了实时多模态RGB-D数字孪生创建工具,同时跟踪计算资源消耗。
Result: 基于轴向磁通电机质量检测的汽车用例,该框架在半控制工业条件下表现出色,即使使用简单的距离测量,与真实掩码相比也能达到高达63.3%的交并比分数。该方法在动态制造环境中展示了有效的检测性能,为低数据缺陷检测方法提供了实证支持。
Conclusion: 该研究为动态制造环境中通用化、低数据缺陷检测方法的未来研究奠定了基础,通过实时数字孪生与真实场景的比较框架,解决了半控制工业环境中视觉质量检测的复杂性和数据需求问题。研究结果表明,即使使用相对简单的测量方法,该框架也能在具有挑战性的工业条件下实现有效的缺陷检测。
📄 Abstract
Early-stage visual quality inspection is vital for achieving Zero-Defect Manufacturing and minimizing production waste in modern industrial environments. However, the complexity of robust visual inspection systems and their extensive data requirements hinder widespread adoption in semi-controlled industrial settings. In this context, we propose a pose-agnostic, zero-shot quality inspection framework that compares real scenes against real-time Digital Twins (DT) in the RGB-D space. Our approach enables efficient real-time DT rendering by semantically describing industrial scenes through object detection and pose estimation of known Computer-Aided Design models. We benchmark tools for real-time, multimodal RGB-D DT creation while tracking consumption of computational resources. Additionally, we provide an extensible and hierarchical annotation strategy for multi-criteria defect detection, unifying pose labelling with logical and structural defect annotations. Based on an automotive use case featuring the quality inspection of an axial flux motor, we demonstrate the effectiveness of our framework. Our results demonstrate detection performace, achieving intersection-over-union (IoU) scores of up to 63.3% compared to ground-truth masks, even if using simple distance measurements under semi-controlled industrial conditions. Our findings lay the groundwork for future research on generalizable, low-data defect detection methods in dynamic manufacturing settings.
[83] Unlocking Multilingual Reasoning Capability of LLMs and LVLMs through Representation Engineering
Qiming Li, Xiaocheng Feng, Yixuan Ma, Zekai Ye, Ruihan Chen, Xiachong Feng, Bing Qin
🧩 TL;DR
本文提出了一种无需训练、在推理时增强多语言推理能力的方法MRRE,通过表示工程在推理过程中顺序注入两个预计算向量,显著提升了低资源语言的推理性能,同时保持了输入输出语言的一致性。
📘 Detailed Summary
Motivation: 大型语言模型和视觉语言模型在英语上的推理能力显著优于低资源语言,导致多语言应用中的公平性问题,现有方法要么依赖昂贵的多语言训练,要么使用外部翻译工具进行提示,这两种方法都资源密集且对翻译质量敏感。
Method: MRRE是一种无需训练数据的推理时方法,通过在推理处理的特定层顺序注入两个预计算向量:跨语言推理增强向量将非英语推理表示引导至英语空间以解锁多语言推理能力,目标语言输出锚定向量恢复目标语言的分布以保持输入输出语言一致性。
Result: 在六个先进LLM和LVLM上的四个推理基准测试中,MRRE将非英语推理能力平均提升了5.48%,在低资源语言(泰语和斯瓦希里语)上最高提升达7.54%,同时将输入输出语言一致性提高了3.78%。
Conclusion: MRRE提供了一种高效、无需训练的多语言推理增强方案,通过表示工程有效缓解了语言模型在低资源语言上的性能差距,为多语言AI系统的公平性提供了实用解决方案,同时避免了传统方法的高成本和翻译质量依赖问题。
📄 Abstract
Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) demonstrate strong reasoning capabilities, yet their performance in English significantly outperforms that in low-resource languages, raising fairness concerns in multilingual applications. Existing approaches either rely on costly multilingual training or employ prompting with external translation tools, both of which are resource-intensive and sensitive to translation quality. To address these limitations, we propose a training-free inference-time method to enhance Multilingual Reasoning capabilities via Representation Engineering (MRRE) without using any additional training data or tools. MRRE sequentially injects two precomputed vectors at specific layers during inference processing: cross-lingual reasoning enhancement vectors, which steer non-English reasoning representations toward English space to unlock multilingual reasoning, and target-language output anchoring vectors, which restore the distribution of the target language to preserve input-output language consistency. Comprehensive experiments across six advanced LLMs and LVLMs on four reasoning benchmarks demonstrate that MRRE consistently enhances non-English reasoning by an average gain of 5.48% and up to 7.54% in low-resource languages (Thai and Swahili), while improving input-output language consistency by 3.78%.
[84] UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes
Shuo Ni, Di Wang, He Chen, Haonan Guo, Ning Zhang, Jing Zhang
🧩 TL;DR
本文提出了GeoSeg-1M,首个百万规模的遥感指令驱动分割数据集,并构建了统一的UniGeoSeg框架,在遥感指令分割任务上实现了最先进的性能表现和强大的零样本泛化能力。
📘 Detailed Summary
Motivation: 现有遥感指令驱动分割方法面临任务表述碎片化和指令数据有限的问题,这阻碍了模型的有效理解和泛化能力,需要大规模数据集和统一框架来解决这些挑战。
Method: 研究提出了自动掩码过滤和指令生成流水线,从多个公共数据集合成参考、交互和推理分割指令,构建了包含590K图像和1.1M三元组的GeoSeg-1M数据集;进一步设计了UniGeoSeg统一框架,包含任务感知文本增强、潜在知识记忆和渐进式训练策略以促进多任务学习。
Result: UniGeoSeg在GeoSeg-Bench基准测试和多个公共基准上均实现了最先进的性能表现,同时展现出强大的零样本泛化能力,验证了大规模数据集和统一框架的有效性。
Conclusion: 该研究通过构建大规模数据集和统一框架,显著推进了遥感指令分割领域的发展,为复杂地理空间场景下的上下文理解和推理能力评估提供了重要基准,并为通用遥感视觉系统的发展奠定了基础。
📄 Abstract
Instruction-driven segmentation in remote sensing generates masks from guidance, offering great potential for accessible and generalizable applications. However, existing methods suffer from fragmented task formulations and limited instruction data, hindering effective understanding and generalization. To address these issues, we introduce GeoSeg-1M, the first million-scale dataset for remote sensing instruction-driven segmentation, constructed via an automatic mask filtering and instruction generation pipeline that synthesizes referring, interactive, and reasoning segmentation instructions from multiple public datasets. GeoSeg-1M contains 590K images, 117 categories, and 1.1M image-mask-instruction triplets. Building upon this foundation, we further curate GeoSeg-Bench, a challenging benchmark designed to evaluate contextual understanding and reasoning capabilities across diverse instruction-driven tasks and complex geospatial scenes. Furthermore, we present UniGeoSeg, a unified framework that serves as a strong baseline, incorporating task-aware text enhancement, latent knowledge memory, and a progressive training strategy to facilitate multi-task learning. Extensive experiments demonstrate the state-of-the-art performance of UniGeoSeg across GeoSeg-Bench and diverse public benchmarks, while exhibiting strong zero-shot generalization. Datasets and source code were released at https://github.com/MiliLab/UniGeoSeg.
[85] DEAL-300K: Diffusion-based Editing Area Localization with a 300K-Scale Dataset and Frequency-Prompted Baseline
Rui Zhang, Hongxia Wang, Hangqing Liu, Yang Zhou, Qiang Zeng
🧩 TL;DR
本文提出了DEAL-300K数据集,这是一个包含超过30万张标注图像的大规模扩散编辑定位基准,并开发了一种基于视觉基础模型和多频率提示调优的定位框架,为扩散编辑区域检测提供了实用基础。
📘 Detailed Summary
Motivation: 基于扩散的图像编辑技术虽然实现了语义级图像操作,但也产生了难以定位的真实局部伪造内容。现有基准主要关注生成图像的二元检测或手动编辑区域的定位,未能反映扩散编辑平滑融入原始内容的特性,缺乏专门针对扩散编辑定位的大规模数据集。
Method: 研究构建了DEAL-300K数据集,使用多模态大语言模型生成编辑指令,无掩码扩散编辑器生成操作图像,以及主动学习变化检测流程获取像素级标注。在此基础上,提出了基于冻结视觉基础模型和多频率提示调优的定位框架,以同时捕捉编辑区域的语义和频域特征。
Result: 在DEAL-300K数据集上训练的方法在测试集上达到了82.56%的像素级F1分数,在外部CoCoGlide基准上达到了80.97%的F1分数,为扩散编辑定位研究提供了强有力的基线性能。
Conclusion: 该研究填补了扩散编辑定位领域的数据集空白,提出的定位框架展现了捕捉编辑区域语义和频域特征的有效性,为未来DIML研究提供了实用的数据集和基准方法,有助于提升扩散伪造内容的检测能力。
📄 Abstract
Diffusion-based image editing has made semantic level image manipulation easy for general users, but it also enables realistic local forgeries that are hard to localize. Existing benchmarks mainly focus on the binary detection of generated images or the localization of manually edited regions and do not reflect the properties of diffusion-based edits, which often blend smoothly into the original content. We present Diffusion-Based Image Editing Area Localization Dataset (DEAL-300K), a large scale dataset for diffusion-based image manipulation localization (DIML) with more than 300,000 annotated images. We build DEAL-300K by using a multi-modal large language model to generate editing instructions, a mask-free diffusion editor to produce manipulated images, and an active-learning change detection pipeline to obtain pixel-level annotations. On top of this dataset, we propose a localization framework that uses a frozen Visual Foundation Model (VFM) together with Multi Frequency Prompt Tuning (MFPT) to capture both semantic and frequency-domain cues of edited regions. Trained on DEAL-300K, our method reaches a pixel-level F1 score of 82.56% on our test split and 80.97% on the external CoCoGlide benchmark, providing strong baselines and a practical foundation for future DIML research.The dataset can be accessed via https://github.com/ymhzyj/DEAL-300K.
[86] VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
Sinan Du, Jiahao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu, Yifu Luo, Yongxian Wei, Kun Gai, Xinggang Wang, Kai Wu, Chun Yuan
🧩 TL;DR
本文提出VQRAE,一种向量量化的表示自编码器,首次探索在统一标记器中同时生成用于图像理解的连续语义特征和用于视觉生成的离散标记,解决了多模态理解、生成和重建表示统一的关键挑战。
📘 Detailed Summary
Motivation: 构建统一模型的关键挑战在于如何在单一标记器中统一多模态理解、生成和重建表示。先前研究主要采用双编码器范式,例如分别使用不同编码器进行理解和生成,或通过对比损失平衡语义表示与低级特征,缺乏真正的统一表示方法。
Method: VQRAE基于预训练视觉基础模型构建,采用对称ViT解码器和两阶段训练策略:第一阶段冻结编码器,通过像素重建目标学习高维语义VQ码本;第二阶段通过自蒸馏约束联合优化编码器。该方法能够在统一标记器中生成用于理解的连续语义特征和用于生成的离散标记。
Result: VQRAE在多个视觉理解、生成和重建基准测试中表现出竞争力,其语义VQ码本在1536维度下能达到100%利用率,并在自回归范式中展现出良好的扩展性。该方法在保持多模态理解能力的同时,实现了细粒度重建和生成兼容性。
Conclusion: 该研究揭示了量化语义编码器时高维码本的重要性,与先前图像重建中常用的低维码本实践形成对比。VQRAE为构建统一多模态模型提供了新范式,其离散特性特别适合自回归生成任务,为未来统一表示学习开辟了有前景的方向。
📄 Abstract
Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.
[87] MANTA: Physics-Informed Generalized Underwater Object Tracking
Suhas Srinath, Hemang Jamadagni, Aditya Chadrasekar, Prathosh AP
🧩 TL;DR
本文提出MANTA框架,一种针对水下物体跟踪的物理感知方法,通过结合表示学习与跟踪设计,利用双正对比学习策略和物理增强的二次关联算法,显著提升了水下跟踪的鲁棒性和准确性。
📘 Detailed Summary
Motivation: 水下物体跟踪面临波长依赖的衰减和散射等物理退化问题,导致目标外观在不同深度和水况下严重失真,现有基于陆地数据训练的跟踪器难以泛化到这些物理驱动的退化场景,需要专门针对水下环境设计的跟踪框架。
Method: MANTA框架采用双正对比学习策略,将时间一致性约束与Beer-Lambert物理增强相结合,生成对时间和水下失真均鲁棒的特征表示;进一步提出多阶段跟踪流程,在基于运动的跟踪基础上引入物理感知的二次关联算法,整合几何一致性和外观相似性以应对遮挡和漂移情况;同时提出中心尺度一致性和几何对齐分数作为几何保真度的评估指标。
Result: 在四个水下基准数据集(WebUOT-1M、UOT32、UTB180、UWCOT220)上的实验表明,MANTA实现了最先进的性能,将Success AUC指标提升高达6%,同时确保了稳定的长期泛化水下跟踪能力和高效的运行时性能。
Conclusion: 该研究表明将物理感知机制整合到跟踪框架中能有效应对水下环境的独特挑战,双正对比学习策略和物理增强的关联算法为水下视觉任务提供了新的解决方案方向,所提出的几何评估指标为水下跟踪性能提供了更全面的衡量标准。
📄 Abstract
Underwater object tracking is challenging due to wavelength dependent attenuation and scattering, which severely distort appearance across depths and water conditions. Existing trackers trained on terrestrial data fail to generalize to these physics-driven degradations. We present MANTA, a physics-informed framework integrating representation learning with tracking design for underwater scenarios. We propose a dual-positive contrastive learning strategy coupling temporal consistency with Beer-Lambert augmentations to yield features robust to both temporal and underwater distortions. We further introduce a multi-stage pipeline augmenting motion-based tracking with a physics-informed secondary association algorithm that integrates geometric consistency and appearance similarity for re-identification under occlusion and drift. To complement standard IoU metrics, we propose Center-Scale Consistency (CSC) and Geometric Alignment Score (GAS) to assess geometric fidelity. Experiments on four underwater benchmarks (WebUOT-1M, UOT32, UTB180, UWCOT220) show that MANTA achieves state-of-the-art performance, improving Success AUC by up to 6 percent, while ensuring stable long-term generalized underwater tracking and efficient runtime.
[88] DisMo: Disentangled Motion Representations for Open-World Motion Transfer
Thomas Ressler-Antal, Frank Fundel, Malek Ben Alaya, Stefan Andreas Baumann, Felix Krause, Ming Gui, Björn Ommer
🧩 TL;DR
本文提出DisMo,一种通过图像空间重建目标从原始视频数据中学习抽象运动表示的新范式,该表示与静态信息解耦,支持开放世界运动迁移,并能与现有视频生成器结合。
📘 Detailed Summary
Motivation: 当前文本到视频和图像到视频模型缺乏将运动与内容分离的显式表示,限制了内容创作的应用。现有方法在运动保真度和提示遵循之间存在权衡,容易过度拟合源结构或偏离描述动作,需要一种能够解耦运动语义与外观的通用表示。
Method: 提出DisMo范式,通过图像空间重建目标直接从原始视频数据学习抽象运动表示。该表示与外观、物体身份或姿态等静态信息无关,支持开放世界运动迁移,无需物体对应关系。通过轻量级适配器可与任何现有视频生成器结合,利用视频模型的未来进展。
Result: 方法在多样化运动迁移任务中展示有效性,学习到的表示在下游运动理解任务中表现优异,在Something-Something v2和Jester等基准上,零样本动作分类性能持续超越V-JEPA等最先进的视频表示模型。
Conclusion: DisMo提供了一种通用的运动表示方法,成功解耦运动语义与外观,支持跨语义无关实体的准确运动迁移。该方法框架灵活,可与现有视频生成器集成,为内容创作和运动理解任务提供了新的可能性,展示了学习到的表示在下游任务中的实用性。
📄 Abstract
Recent advances in text-to-video (T2V) and image-to-video (I2V) models, have enabled the creation of visually compelling and dynamic videos from simple textual descriptions or initial frames. However, these models often fail to provide an explicit representation of motion separate from content, limiting their applicability for content creators. To address this gap, we propose DisMo, a novel paradigm for learning abstract motion representations directly from raw video data via an image-space reconstruction objective. Our representation is generic and independent of static information such as appearance, object identity, or pose. This enables open-world motion transfer, allowing motion to be transferred across semantically unrelated entities without requiring object correspondences, even between vastly different categories. Unlike prior methods, which trade off motion fidelity and prompt adherence, are overfitting to source structure or drifting from the described action, our approach disentangles motion semantics from appearance, enabling accurate transfer and faithful conditioning. Furthermore, our motion representation can be combined with any existing video generator via lightweight adapters, allowing us to effortlessly benefit from future advancements in video models. We demonstrate the effectiveness of our method through a diverse set of motion transfer tasks. Finally, we show that the learned representations are well-suited for downstream motion understanding tasks, consistently outperforming state-of-the-art video representation models such as V-JEPA in zero-shot action classification on benchmarks including Something-Something v2 and Jester. Project page: https://compvis.github.io/DisMo
[89] Visual Generation Tuning
Jiahao Guo, Sinan Du, Jingfeng Yao, Wenyu Liu, Bo Li, Haoxiang Cao, Kun Gai, Chun Yuan, Kai Wu, Xinggang Wang
🧩 TL;DR
本文提出VGT(视觉生成调优)范式,通过高效调优预训练视觉语言模型来激发其视觉生成潜力,显著降低对齐成本并加速连续空间自回归建模,为探索下一代统一多模态基础模型开辟新途径。
📘 Detailed Summary
Motivation: 尽管大型视觉语言模型通过广泛预训练获得了与语言对齐的复杂视觉表示,但这些针对多模态理解任务优化的表示是否蕴含视觉生成潜力仍未充分探索,当前研究旨在激发任何视觉语言模型中潜在的视觉生成能力。
Method: 提出VGT(视觉生成调优)范式,摒弃为扩散变换器设计的纠缠像素级VAE,通过将预训练VLM的语义编码器与像素解码器的潜在表示对齐来构建VGT-AE,实现高效视觉生成调优,显著降低对齐成本并加速连续空间自回归建模收敛速度达20倍。
Result: 在图像重建任务中,VGT在28倍压缩比下达到26.67 PSNR和0.50 rFID,超越专用VAE;在视觉生成任务中,在自回归模型中达到最先进水平,GenEval得分为0.77,DPG-Bench得分为78.73,展示了显著的扩展潜力。
Conclusion: VGT范式为探索下一代统一多模态基础模型开辟了新途径,能够为任何为多模态理解训练的VLM赋予视觉生成能力,展示了将理解与生成能力统一于单一模型的巨大潜力,推动了多模态基础模型的发展方向。
📄 Abstract
Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned with language. However, it remains underexplored whether these representations, optimized for multimodal understanding tasks, harbor an inherent potential for visual generation. In this paper, we propose VGT, Visual Generation Tuning, a novel paradigm designed to stimulate the underlying capabilities of visual generation within any vision language models. By performing efficient visual generation tuning on well-pretrained VLMs, we significantly mitigate the alignment costs and accelerate the convergence of autoregressive modeling in the continuous space (20x speedup). Specifically, we dismiss the entangled pixel-level VAEs designed for diffusion transformers and formulate VGT-AE through aligning the semantic encoders from pretrained VLMs with the latent representations of pixel decoders. In image reconstruction tasks, we achieve 26.67 PSNR and 0.50 rFID at a 28x compression ratio, outperforming specialized VAEs; in visual generation tasks, we achieve state-of-the-art outcomes among autoregressive models, 0.77 on GenEval and 78.73 on DPG-Bench. Furthermore, our proposed VGT showcases significant scaling promise and is versatile for endowing any VLMs trained for multimodal understanding with the capabilities of visual generation, which paves the new avenue to explore next-generation unified multimodal foundation models. Models and codes are available at https://github.com/hustvl/VGT.
[90] AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement
Zhizhou Zhong, Yicheng Ji, Zhe Kong, Yiying Liu, Jiarui Wang, Jiasun Feng, Lupeng Liu, Xiangyi Wang, Yanjia Li, Yuqing She, Ying Qin, Huan Li, Shuiyang Mao, Wei Liu, Wenhan Luo
🧩 TL;DR
AnyTalker是一个多人生成框架,通过可扩展的多流处理架构和身份感知注意力机制,实现了音频驱动的多人说话视频生成,在数据成本和身份可扩展性之间取得了良好平衡。
📘 Detailed Summary
Motivation: 当前音频驱动的多人说话视频生成面临两个主要挑战:多样化多人数据收集成本高昂,以及难以驱动多个身份实现连贯的交互性。现有方法在处理这些挑战时存在局限性,需要更高效且可扩展的解决方案。
Method: AnyTalker采用可扩展的多流处理架构,通过扩展Diffusion Transformer的注意力块,引入新颖的身份感知注意力机制,迭代处理身份-音频对,实现任意数量的可驱动身份。训练流程仅依赖单人视频学习多人说话模式,并使用少量真实多人片段精炼交互性。
Result: 广泛实验表明,AnyTalker在唇部同步、视觉质量和自然交互性方面取得了显著成果。该框架还贡献了专门设计的评估指标和数据集,用于评估生成的多人视频的自然性和交互性质量。
Conclusion: 该研究证明了通过创新的架构设计和高效的数据利用策略,可以在不依赖大规模多人数据集的情况下实现高质量的多人视频生成。身份感知注意力机制为多人生成任务提供了可扩展的解决方案,为未来多人交互视频生成研究提供了重要参考。
📄 Abstract
Recently, multi-person video generation has started to gain prominence. While a few preliminary works have explored audio-driven multi-person talking video generation, they often face challenges due to the high costs of diverse multi-person data collection and the difficulty of driving multiple identities with coherent interactivity. To address these challenges, we propose AnyTalker, a multi-person generation framework that features an extensible multi-stream processing architecture. Specifically, we extend Diffusion Transformer's attention block with a novel identity-aware attention mechanism that iteratively processes identity-audio pairs, allowing arbitrary scaling of drivable identities. Besides, training multi-person generative models demands massive multi-person data. Our proposed training pipeline depends solely on single-person videos to learn multi-person speaking patterns and refines interactivity with only a few real multi-person clips. Furthermore, we contribute a targeted metric and dataset designed to evaluate the naturalness and interactivity of the generated multi-person videos. Extensive experiments demonstrate that AnyTalker achieves remarkable lip synchronization, visual quality, and natural interactivity, striking a favorable balance between data costs and identity scalability.
[91] Video-CoM: Interactive Video Reasoning via Chain of Manipulations
Hanoona Rasheed, Mohammed Zumri, Muhammad Maaz, Ming-Hsuan Yang, Fahad Shahbaz Khan, Salman Khan
🧩 TL;DR
本文提出了交互式视频推理新范式Video CoM,通过操作链机制使模型能够主动与视频交互进行推理,在九个视频推理基准上平均性能提升3.6%,同时仅需少量训练数据。
📘 Detailed Summary
Motivation: 现有多模态大语言模型在视频理解中采用被动范式,将视觉输入视为静态上下文,导致语义瓶颈:模型无法回看、重新聚焦或验证证据,在需要细粒度时空理解的任务上只能进行浅层视觉推理。
Method: 提出交互式视频推理范式,将视频转化为主动认知工作空间;开发Video CoM模型,通过操作链机制执行迭代视觉动作来收集和精炼证据;构建包含18K样本的Video CoM Instruct指令调优数据集;采用强化学习优化操作策略,引入具有推理感知的组相对策略优化方法和步骤级推理奖励。
Result: 在九个视频推理基准上取得强劲结果,平均性能比最新SOTA模型提升3.6%;仅使用25K监督微调和3K GRPO视频样本进行训练,显著少于可比的大规模模型;消融研究表明推理感知奖励同时提高了准确性和可解释性。
Conclusion: 交互式视频推理范式突破了传统被动视频理解的限制,使模型能够主动与视觉内容交互进行深入推理;操作链机制和推理感知奖励的结合为细粒度时空理解提供了新方法;该框架在数据效率方面表现出色,为未来视频理解研究开辟了新方向。
📄 Abstract
Recent multimodal large language models (MLLMs) have advanced video understanding, yet most still "think about videos" ie once a video is encoded, reasoning unfolds entirely in text, treating visual input as a static context. This passive paradigm creates a semantic bottleneck: models cannot rewatch, refocus, or verify evidence, leading to shallow visual reasoning on tasks requiring fine grained spatio temporal understanding. In this work, we introduce Interactive Video Reasoning, a new paradigm that transforms video into an active cognitive workspace, enabling models to "think with videos". Our model, Video CoM, reasons through a Chain of Manipulations (CoM), performing iterative visual actions to gather and refine evidence. To support this behavior, we construct Video CoM Instruct, an 18K instruction tuning dataset curated for multi step manipulation reasoning. Beyond supervised learning, we further optimize the manipulation policy via reinforcement learning with reasoning aware Group Relative Policy Optimization (GRPO). Unlike prior work that relies solely on sparse answer rewards, our method introduces step level reasoning rewards, guiding the model toward grounded and consistent reasoning. Video CoM achieves strong results across nine video reasoning benchmarks, improving average performance by 3.6 percent over recent state of the art models, while training on only 25K SFT and 3K GRPO video samples, significantly fewer than comparable large scale models. Ablation studies demonstrate that reasoning aware rewards improve both accuracy and interpretability. Code: https://github.com/mbzuai-oryx/Video-CoM
[92] Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models
Muhammad Maaz, Hanoona Rasheed, Fahad Shahbaz Khan, Salman Khan
🧩 TL;DR
本文提出Video R2模型,通过强化学习方法增强视频推理的时间对齐和逻辑一致性,解决了多模态大语言模型在动态视觉内容推理中存在的逻辑不一致和视觉证据依赖不足的问题。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在动态视觉内容推理中存在严重问题:生成的推理轨迹虽然看似合理,但实际上经常出现逻辑不一致或缺乏视觉证据支撑的情况。研究通过两个诊断指标(思维答案一致性和视频注意力分数)形式化这些问题,发现现有模型过度依赖语言先验而非视觉内容。
Method: 提出一种强化学习框架,结合时间戳感知的监督微调和基于新型时间对齐奖励的组相对策略优化。该方法采用双阶段后训练策略,通过时间对齐奖励引导模型生成时间对齐且因果连贯的视频推理,增强推理的视觉依赖性和逻辑一致性。
Result: 在11个视频推理基准测试中,Video R2模型在思维答案一致性、视频注意力分数和准确率方面均取得显著提升。实验结果表明,时间对齐和推理一致性的改进直接转化为更准确和可信的视频理解性能。
Conclusion: 研究表明,通过强化学习优化时间对齐和推理一致性能够有效提升视频推理的准确性和可信度。该方法为多模态大语言模型的视频理解提供了新的优化方向,强调了视觉证据依赖和逻辑一致性的重要性,相关代码、数据集和模型将开源。
📄 Abstract
Reasoning over dynamic visual content remains a central challenge for multimodal large language models. Recent thinking models generate explicit reasoning traces for interpretability; however, their reasoning often appears convincing while being logically inconsistent or weakly grounded in visual evidence. We identify and formalize these issues through two diagnostic metrics: Think Answer Consistency (TAC), which measures the alignment between reasoning and answers, and Video Attention Score (VAS), which captures the extent to which reasoning depends on visual versus textual cues. Analysis across 11 video reasoning benchmarks shows that current models rely heavily on linguistic priors rather than visual content. To address this, we propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency. Our approach combines timestamp aware supervised fine tuning with Group Relative Policy Optimization (GRPO) guided by a novel Temporal Alignment Reward (TAR). This dual step post training stage encourages temporally aligned and causally coherent video reasoning. The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks, demonstrating that improvements in temporal alignment and reasoning coherence lead to more accurate and trustworthy video understanding. Our code, dataset, and model will be open sourced.
cs.CL [Back]
[93] Insight-A: Attribution-aware for Multimodal Misinformation Detection
Junjie Wu, Yumeng Fu, Chen Gong, Guohong Fu
🧩 TL;DR
本文提出Insight-A框架,通过探索多模态大语言模型(MLLMs)的归因能力来检测AIGC生成的多模态虚假信息,该方法不仅识别虚假内容,还将其归因于特定的伪造来源。
📘 Detailed Summary
Motivation: AIGC技术已成为社交媒体上多模态虚假信息的主要来源,对社会安全构成严重威胁。现有基于标准提示的方法利用MLLMs检测虚假信息,但忽略了虚假信息的归因分析,无法追踪伪造来源和生成模式。
Method: Insight-A框架采用分层推理管道,包含两个核心组件:交叉归因提示(CAP)用于建模感知与推理之间的复杂关联,将虚假信息归因于基于生成模式的伪造痕迹;自动归因去偏提示(ADP)用于减少人工标注提示的主观性,实现MLLMs的任务适应。同时设计图像描述(IC)模块提取视觉细节,增强跨模态一致性检查。
Result: 大量实验证明了该方法的优越性,在检测AIGC生成的多模态虚假信息方面表现出色。该框架为虚假信息检测提供了新的范式,特别是在归因分析和跨模态一致性验证方面取得了显著进展。
Conclusion: 该研究为AIGC时代的多模态虚假信息检测提供了新范式,强调归因分析的重要性。通过将检测与伪造来源追踪相结合,Insight-A不仅提高了检测准确性,还为理解虚假信息的生成机制和传播路径提供了新视角。
📄 Abstract
AI-generated content (AIGC) technology has emerged as a prevalent alternative to create multimodal misinformation on social media platforms, posing unprecedented threats to societal safety. However, standard prompting leverages multimodal large language models (MLLMs) to identify the emerging misinformation, which ignores the misinformation attribution. To this end, we present Insight-A, exploring attribution with MLLM insights for detecting multimodal misinformation. Insight-A makes two efforts: I) attribute misinformation to forgery sources, and II) an effective pipeline with hierarchical reasoning that detects distortions across modalities. Specifically, to attribute misinformation to forgery traces based on generation patterns, we devise cross-attribution prompting (CAP) to model the sophisticated correlations between perception and reasoning. Meanwhile, to reduce the subjectivity of human-annotated prompts, automatic attribution-debiased prompting (ADP) is used for task adaptation on MLLMs. Additionally, we design image captioning (IC) to achieve visual details for enhancing cross-modal consistency checking. Extensive experiments demonstrate the superiority of our proposal and provide a new paradigm for multimodal misinformation detection in the era of AIGC.
[94] An Optimized Machine Learning Classifier for Detecting Fake Reviews Using Extracted Features
Shabbir Anees, Anshuman, Ayush Chaurasia, Prathmesh Bogar
🧩 TL;DR
该研究提出了一种基于机器学习的高级系统,用于高精度检测人工智能生成的虚假评论,通过结合文本预处理、多模态特征提取、哈里斯鹰优化特征选择和堆叠集成分类器,在公开数据集上实现了95.40%的准确率。
📘 Detailed Summary
Motivation: 在线购物中虚假评论损害了平台的可信度,特别是人工智能生成的评论与人类撰写的评论混合出现,使得消费者难以辨别真伪,这种新型的计算机生成评论对现有检测方法提出了挑战。
Method: 该方法采用先进的文本预处理技术,结合多模态特征提取从评论中获取丰富特征,使用哈里斯鹰优化算法进行特征选择以降低维度,最后构建堆叠集成分类器进行最终分类决策。
Result: 在包含40,432条原始评论和计算机生成评论的公开数据集上,哈里斯鹰优化算法从初始13,539个特征中选择了1,368个最相关特征,实现了89.9%的维度缩减,最终堆叠模型达到了95.40%准确率、92.81%精确率、95.01%召回率和93.90% F1分数。
Conclusion: 研究表明集成学习与生物启发式优化的结合是检测机器生成文本的有效方法,考虑到大规模评论分析通常在云平台上进行,需要采用差分隐私和安全外包等隐私保护技术来保护用户数据。
📄 Abstract
It is well known that fraudulent reviews cast doubt on the legitimacy and dependability of online purchases. The most recent development that leads customers towards darkness is the appearance of human reviews in computer-generated (CG) ones. In this work, we present an advanced machine-learning-based system that analyses these reviews produced by AI with remarkable precision. Our method integrates advanced text preprocessing, multi-modal feature extraction, Harris Hawks Optimization (HHO) for feature selection, and a stacking ensemble classifier. We implemented this methodology on a public dataset of 40,432 Original (OR) and Computer-Generated (CG) reviews. From an initial set of 13,539 features, HHO selected the most applicable 1,368 features, achieving an 89.9% dimensionality reduction. Our final stacking model achieved 95.40% accuracy, 92.81% precision, 95.01% recall, and a 93.90% F1-Score, which demonstrates that the combination of ensemble learning and bio-inspired optimisation is an effective method for machine-generated text recognition. Because large-scale review analytics commonly run on cloud platforms, privacy-preserving techniques such as differential approaches and secure outsourcing are essential to protect user data in these systems.
[95] CrossCheck-Bench: Diagnosing Compositional Failures in Multimodal Conflict Resolution
Baoliang Tian, Yuxuan Si, Jilong Wang, Lingyao Li, Zhongyuan Bao, Zineng Zhou, Tao Wang, Sixu Li, Ziyao Xu, Mingze Wang, Zhouzhuo Zhang, Zhihao Wang, Yike Yun, Ke Tian, Ning Yang, Minghui Qiu
🧩 TL;DR
本文提出了CrossCheck-Bench,这是一个用于评估多模态大语言模型在检测和解决跨模态不一致性方面能力的诊断性基准,揭示了现有模型在逻辑矛盾检测方面的显著性能下降。
📘 Detailed Summary
Motivation: 当前多模态大语言模型主要在对齐的图像-文本对上进行训练和评估,这导致它们在检测和解决现实世界中的跨模态不一致性方面的能力尚未得到充分探索。在开放域应用中,视觉和文本线索经常发生冲突,需要模型进行超越表面层面对齐的结构化推理。
Method: 研究引入了CrossCheck-Bench诊断基准,采用分层任务框架覆盖三个推理复杂度级别,并定义了解决跨模态不一致性所需的七种原子能力。该基准包含15k个从真实世界数据源获取的问题-答案对,并通过合成注入矛盾的方式构建,采用多阶段标注流程,涉及超过450个专家小时以确保语义有效性和校准难度。
Result: 评估了13个最先进的视觉-语言模型,观察到随着任务从感知匹配转向逻辑矛盾检测,性能出现一致下降。大多数模型在孤立实体识别上表现良好,但在需要综合多个线索进行冲突推理时失败。能力层级分析进一步揭示了技能获取的不均衡性,特别是在需要多步推理或基于规则验证的任务中。传统提示策略如思维链和标记集仅带来边际改进,而将符号推理与基础视觉处理交织的方法则实现更稳定的提升。
Conclusion: 研究结果突显了多模态推理中存在的持续瓶颈,表明现有模型在跨模态验证能力方面存在显著不足。这为构建能够进行稳健跨模态验证的模型指出了新方向,特别是需要开发能够有效整合符号推理与基础视觉处理的方法。
📄 Abstract
Multimodal Large Language Models are primarily trained and evaluated on aligned image-text pairs, which leaves their ability to detect and resolve real-world inconsistencies largely unexplored. In open-domain applications visual and textual cues often conflict, requiring models to perform structured reasoning beyond surface-level alignment. We introduce CrossCheck-Bench, a diagnostic benchmark for evaluating contradiction detection in multimodal inputs. The benchmark adopts a hierarchical task framework covering three levels of reasoning complexity and defines seven atomic capabilities essential for resolving cross-modal inconsistencies. CrossCheck-Bench includes 15k question-answer pairs sourced from real-world artifacts with synthetically injected contradictions. The dataset is constructed through a multi-stage annotation pipeline involving more than 450 expert hours to ensure semantic validity and calibrated difficulty across perception, integration, and reasoning. We evaluate 13 state-of-the-art vision-language models and observe a consistent performance drop as tasks shift from perceptual matching to logical contradiction detection. Most models perform well on isolated entity recognition but fail when multiple clues must be synthesized for conflict reasoning. Capability-level analysis further reveals uneven skill acquisition, especially in tasks requiring multi-step inference or rule-based validation. Additional probing shows that conventional prompting strategies such as Chain-of-Thought and Set-of-Mark yield only marginal gains. By contrast, methods that interleave symbolic reasoning with grounded visual processing achieve more stable improvements. These results highlight a persistent bottleneck in multimodal reasoning and suggest new directions for building models capable of robust cross-modal verification.
[96] Affective Multimodal Agents with Proactive Knowledge Grounding for Emotionally Aligned Marketing Dialogue
Lin Yu, Xiaofei Han, Yifei Kang, Chiung-Yi Tseng, Danyang Zhang, Ziqian Bi, Zhimo Han
🧩 TL;DR
本文提出AffectMind,一种多模态情感对话代理,通过主动推理和动态知识基础来维持情感对齐的说服性互动,解决了现有LLM在情感丰富、目标导向场景中反应性和情感一致性不足的问题。
📘 Detailed Summary
Motivation: 尽管大型语言模型在流畅对话系统方面取得进展,但大多数模型在情感丰富、目标导向的场景(如营销对话)中仍保持被动反应性,缺乏情感对齐和主动说服能力,这限制了其在商业多模态代理中的应用效果。
Method: AffectMind包含三个核心组件:主动知识基础网络(PKGN)从文本、视觉和韵律中持续更新事实和情感上下文;情感-意图对齐模型(EIAM)联合建模用户情感和购买意图以调整说服策略;强化话语循环(RDL)通过用户响应的强化信号优化情感连贯性和参与度。
Result: 在两个新构建的营销对话数据集MM-ConvMarket和AffectPromo上的实验表明,AffectMind在情感一致性(+26%)、说服成功率(+19%)和长期用户参与度(+23%)方面显著优于基于LLM的强基线模型。
Conclusion: 研究表明情感基础的主动性是商业多模态代理的关键能力,通过多模态情感建模和主动推理可以显著提升对话系统的情感对齐和说服效果,为情感智能对话系统的发展提供了重要方向。
📄 Abstract
Recent advances in large language models (LLMs) have enabled fluent dialogue systems, but most remain reactive and struggle in emotionally rich, goal-oriented settings such as marketing conversations. To address this limitation, we propose AffectMind, a multimodal affective dialogue agent that performs proactive reasoning and dynamic knowledge grounding to sustain emotionally aligned and persuasive interactions. AffectMind combines three components: a Proactive Knowledge Grounding Network (PKGN) that continuously updates factual and affective context from text, vision, and prosody; an Emotion--Intent Alignment Model (EIAM) that jointly models user emotion and purchase intent to adapt persuasion strategies; and a Reinforced Discourse Loop (RDL) that optimizes emotional coherence and engagement via reinforcement signals from user responses. Experiments on two newly curated marketing dialogue datasets, MM-ConvMarket and AffectPromo, show that AffectMind outperforms strong LLM-based baselines in emotional consistency (+26\%), persuasive success rate (+19\%), and long-term user engagement (+23\%), highlighting emotion-grounded proactivity as a key capability for commercial multimodal agents.
[97] HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation
Jiajun Zhang, Shijia Luo, Ruikang Zhang, Qi Su
🧩 TL;DR
本文提出了HUMORCHAIN,一种理论指导的多阶段推理框架,首次将幽默理论的认知结构显式嵌入多模态幽默生成中,通过结构化推理过程从视觉理解到幽默创造,显著提升了AI生成幽默的质量和人类偏好。
📘 Detailed Summary
Motivation: 现有数据驱动方法缺乏对幽默的显式建模或理论基础,通常产生字面描述而无法捕捉其底层认知机制,导致生成的图像描述流畅但缺乏真正的幽默或认知深度。多模态幽默已成为在线交流的普遍形式,特别是Z世代,凸显了需要能够整合视觉理解与幽默语言生成的AI系统。
Method: 本文提出了HUMORCHAIN(HUmor-guided Multi-step Orchestrated Reasoning Chain for Image Captioning),这是一个理论指导的多阶段推理框架,集成了视觉语义解析、基于幽默和心理学的推理,以及用于幽默评估的微调判别器,形成一个可解释且可控的认知推理链。该框架首次将幽默理论的认知结构显式嵌入多模态幽默生成中,实现了从视觉理解到幽默创造的结构化推理过程。
Result: 在Meme-Image-No-Text、Oogiri-GO和OxfordTVG-HIC数据集上的实验表明,HUMORCHAIN在人类幽默偏好、Elo/BT分数和语义多样性方面优于最先进的基线方法。这些结果证明了理论驱动的结构化推理能够使大型语言模型生成与人类感知一致的幽默内容。
Conclusion: 该研究展示了将认知理论结构显式嵌入多模态幽默生成的有效性,为AI幽默生成提供了理论指导的结构化方法。HUMORCHAIN框架的可解释性和可控性为未来在更广泛创意任务中应用认知推理链奠定了基础,表明理论驱动的结构化方法能够显著提升生成内容的质量和人类对齐性。
📄 Abstract
Humor, as both a creative human activity and a social binding mechanism, has long posed a major challenge for AI generation. Although producing humor requires complex cognitive reasoning and social understanding, theories of humor suggest that it follows learnable patterns and structures, making it theoretically possible for generative models to acquire them implicitly. In recent years, multimodal humor has become a prevalent form of online communication, especially among Gen Z, highlighting the need for AI systems capable of integrating visual understanding with humorous language generation. However, existing data-driven approaches lack explicit modeling or theoretical grounding of humor, often producing literal descriptions that fail to capture its underlying cognitive mechanisms, resulting in the generated image descriptions that are fluent but lack genuine humor or cognitive depth. To address this limitation, we propose HUMORCHAIN (HUmor-guided Multi-step Orchestrated Reasoning Chain for Image Captioning), a theory-guided multi-stage reasoning framework. It integrates visual semantic parsing, humor- and psychology-based reasoning, and a fine-tuned discriminator for humor evaluation, forming an interpretable and controllable cognitive reasoning chain. To the best of our knowledge, this is the first work to explicitly embed cognitive structures from humor theories into multimodal humor generation, enabling a structured reasoning process from visual understanding to humor creation. Experiments on Meme-Image-No-Text, Oogiri-GO, and OxfordTVG-HIC datasets show that HUMORCHAIN outperforms state-of-the-art baselines in human humor preference, Elo/BT scores, and semantic diversity, demonstrating that theory-driven structured reasoning enables large language models to generate humor aligned with human perception.
[98] Closing the Performance Gap Between AI and Radiologists in Chest X-Ray Reporting
Harshita Sharma, Maxwell C. Reynolds, Valentina Salvatelli, Anne-Marie G. Sykes, Kelly K. Horst, Anton Schwaighofer, Maximilian Ilse, Olesya Melnichenko, Sam Bond-Taylor, Fernando Pérez-García, Vamshi K. Mugu, Alex Chan, Ceylan Colak, Shelby A. Swartz, Motassem B. Nashawaty, Austin J. Gonzalez, Heather A. Ouellette, Selnur B. Erdal, Beth A. Schueler, Maria T. Wetscherek, Noel Codella, Mohit Jain, Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Stephanie Hyland, Panos Korfiatis, Ashish Khandelwal, Javier Alvarez-Valle
🧩 TL;DR
本文介绍了MAIRA-X,一种用于纵向胸部X光报告生成的多模态AI模型,该模型在临床评估中显著提升了AI生成报告在词汇质量、临床正确性和管线报告方面的性能,并通过首次回顾性用户研究验证了其临床实用性。
📘 Detailed Summary
Motivation: 随着筛查指南扩展、复杂病例增加和放射科医生短缺,AI辅助报告生成有望减轻放射科医生的工作负担,同时保持诊断准确性。特别是胸部X光报告中描述病理发现和解读管线(L&T)的任务既繁重又重复,在高患者量情况下尤为突出,需要开发能够同时处理临床发现和管线报告的综合AI系统。
Method: 研究开发了MAIRA-X多模态AI模型,专门用于纵向胸部X光报告生成,涵盖临床发现和管线报告。模型基于梅奥诊所的大规模、多中心、纵向数据集训练,包含310万项研究(来自80.6万患者的600万张图像)。研究还开发了新颖的管线特定评估框架,用于评估管线类型、纵向变化和放置位置等属性的报告准确性,并进行了首次回顾性用户评估研究。
Result: 在三个保留测试集和公开MIMIC-CXR数据集上,MAIRA-X在词汇质量、临床正确性和管线相关元素方面显著优于现有技术。用户研究涉及9名不同经验水平的放射科医生对600项研究进行盲审,结果显示关键错误率相当(原始报告3.0% vs AI生成报告4.6%),可接受句子率相似(原始97.8% vs AI生成97.4%),相比先前用户研究有显著改进。
Conclusion: 研究结果表明MAIRA-X能够有效辅助放射科医生,特别是在高容量临床环境中。该模型在保持临床安全性的同时减轻了工作负担,为AI辅助放射学报告生成的实际临床应用提供了有力证据,标志着该领域向临床实用化迈出了重要一步。
📄 Abstract
AI-assisted report generation offers the opportunity to reduce radiologists' workload stemming from expanded screening guidelines, complex cases and workforce shortages, while maintaining diagnostic accuracy. In addition to describing pathological findings in chest X-ray reports, interpreting lines and tubes (L&T) is demanding and repetitive for radiologists, especially with high patient volumes. We introduce MAIRA-X, a clinically evaluated multimodal AI model for longitudinal chest X-ray (CXR) report generation, that encompasses both clinical findings and L&T reporting. Developed using a large-scale, multi-site, longitudinal dataset of 3.1 million studies (comprising 6 million images from 806k patients) from Mayo Clinic, MAIRA-X was evaluated on three holdout datasets and the public MIMIC-CXR dataset, where it significantly improved AI-generated reports over the state of the art on lexical quality, clinical correctness, and L&T-related elements. A novel L&T-specific metrics framework was developed to assess accuracy in reporting attributes such as type, longitudinal change and placement. A first-of-its-kind retrospective user evaluation study was conducted with nine radiologists of varying experience, who blindly reviewed 600 studies from distinct subjects. The user study found comparable rates of critical errors (3.0% for original vs. 4.6% for AI-generated reports) and a similar rate of acceptable sentences (97.8% for original vs. 97.4% for AI-generated reports), marking a significant improvement over prior user studies with larger gaps and higher error rates. Our results suggest that MAIRA-X can effectively assist radiologists, particularly in high-volume clinical settings.
[99] Decoding inner speech with an end-to-end brain-to-text neural interface
Yizi Zhang, Linyang He, Chaofei Fan, Tingkai Liu, Han Yu, Trung Le, Jingyuan Li, Scott Linderman, Lea Duncker, Francis R Willett, Nima Mesgarani, Liam Paninski
🧩 TL;DR
本文提出了一种端到端的脑到文本(BIT)框架,通过单一可微分神经网络将神经活动直接翻译为连贯句子,显著提升了语音脑机接口的解码性能,并实现了跨任务和跨物种的表示迁移。
📘 Detailed Summary
Motivation: 当前语音脑机接口大多采用级联框架,先解码音素再通过n-gram语言模型组装句子,这种架构阻碍了所有阶段的联合优化,限制了系统性能的进一步提升。
Method: 该方法的核心是跨任务、跨物种预训练的神经编码器,其表示可迁移到尝试性和想象性语音;该编码器与音频大语言模型端到端集成,并通过对比学习进行跨模态对齐,形成完整的Brain-to-Text框架。
Result: 在级联设置中,预训练编码器在Brain-to-Text '24和'25基准测试中达到新的最先进水平;端到端集成后,将先前端到端方法的词错误率从24.69%显著降低至10.22%,同时发现小规模音频LLM能显著改善端到端解码性能。
Conclusion: 该研究不仅实现了记录性的性能提升,还通过对齐尝试性和想象性语音嵌入实现了跨任务泛化,为整合大规模多样化神经数据集、构建支持无缝可微分优化的端到端解码框架开辟了新途径。
📄 Abstract
Speech brain-computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end Brain-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text '24 and '25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment, BIT reduces the word error rate (WER) of the prior end-to-end method from 24.69% to 10.22%. Notably, we find that small-scale audio LLMs markedly improve end-to-end decoding. Beyond record-setting performance, BIT aligns attempted and imagined speech embeddings to enable cross-task generalization. Altogether, our approach advances the integration of large, diverse neural datasets, paving the way for an end-to-end decoding framework that supports seamless, differentiable optimization.
[100] A Multiscale Geometric Method for Capturing Relational Topic Alignment
Conrad D. Hougen, Karl T. Pazdernik, Alfred O. Hero
🧩 TL;DR
本文提出了一种几何方法,通过整合多模态文本和合著者网络数据,构建层次化主题树状图,有效识别科学文献中的稀有主题并可视化主题随时间的平滑漂移。
📘 Detailed Summary
Motivation: 当前基于密集Transformer嵌入的主题模型在处理科学文献时存在明显局限,它们往往无法捕捉稀有主题,导致在追踪合著者社区研究兴趣演变时缺乏平滑的时间对齐能力,特别是在识别被忽视的利基主题方面表现不足。
Method: 该方法采用几何框架整合多模态文本和合著者网络数据,使用Hellinger距离和Ward连接算法构建层次化主题树状图,该方法能够同时捕捉局部和全局结构,支持跨语义和时间维度的多尺度学习。
Result: 实验结果表明,该方法能有效识别稀有主题结构并可视化主题随时间的平滑漂移,展示了可解释的词袋模型与原则性几何对齐相结合的优势,在捕捉科学文献中新颖和代表性不足的主题方面表现出色。
Conclusion: 该研究强调了可解释的词袋模型与几何对齐方法相结合的价值,为追踪科学社区研究兴趣演变提供了有效工具,特别是在识别稀有主题和可视化主题漂移方面具有重要应用前景,为多模态数据整合的主题建模开辟了新方向。
📄 Abstract
Interpretable topic modeling is essential for tracking how research interests evolve within co-author communities. In scientific corpora, where novelty is prized, identifying underrepresented niche topics is particularly important. However, contemporary models built from dense transformer embeddings tend to miss rare topics and therefore also fail to capture smooth temporal alignment. We propose a geometric method that integrates multimodal text and co-author network data, using Hellinger distances and Ward's linkage to construct a hierarchical topic dendrogram. This approach captures both local and global structure, supporting multiscale learning across semantic and temporal dimensions. Our method effectively identifies rare-topic structure and visualizes smooth topic drift over time. Experiments highlight the strength of interpretable bag-of-words models when paired with principled geometric alignment.
[101] DELTA: Language Diffusion-based EEG-to-Text Architecture
Mingyu Jeon, Hyobin Kim
🧩 TL;DR
本文提出DELTA模型,通过残差向量量化EEG分词器和掩码语言扩散模型,解决EEG到文本转换中的噪声、个体差异和自回归解码误差累积问题,在ZuCo数据集上显著提升了语义对齐性能。
📘 Detailed Summary
Motivation: 脑电图到文本转换面临高维噪声、受试者个体差异以及自回归解码中误差累积的挑战,现有方法在语义对齐和可靠性方面存在局限,需要更有效的EEG表征和文本生成框架。
Method: DELTA采用残差向量量化将连续EEG信号离散化为多层token以减少噪声和个体差异,同时结合掩码语言扩散模型通过非顺序去噪过程重构句子,避免了传统自回归方法的误差累积问题。
Result: 在ZuCo数据集上,DELTA相比自回归基线将语义对齐提升了最高5.37分,在词级条件下达到BLEU-1 21.9和ROUGE-1 F 17.2,实现了从小规模EEG-文本数据集中可靠生成文本的目标。
Conclusion: 该研究证明了结合离散化表征和扩散模型在EEG到文本转换中的有效性,为构建可扩展的多模态EEG-语言模型提供了新方向,能够在有限数据条件下实现可靠的文本生成。
📄 Abstract
Electroencephalogram (EEG)-to-text remains challenging due to high-dimensional noise, subject variability, and error accumulation in autoregressive decoding. We introduce DELTA, which pairs a Residual Vector Quantization (RVQ) EEG tokenizer with a masked language diffusion model (LLaDA). RVQ discretizes continuous EEG into multi-layer tokens to reduce noise and individual differences, while LLaDA reconstructs sentences via non-sequential denoising. On ZuCo, DELTA improves semantic alignment by up to 5.37 points over autoregressive baselines, achieving BLEU-1 21.9 and ROUGE-1 F 17.2 under word-level conditions. These results enable reliable text generation from small EEG-text datasets and point toward scalable multimodal EEG-language models.
[102] fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding
Yuxiang Wei, Yanteng Zhang, Xi Xiao, Chengxuan Qian, Tianyang Wang, Vince D. Calhoun
🧩 TL;DR
本文提出了fMRI-LM,这是一个将功能性磁共振成像与语言模型连接的基础模型,通过三阶段框架实现了大脑活动与语义认知的统一建模,在多个基准测试中表现出强大的零样本和少样本性能。
📘 Detailed Summary
Motivation: 尽管多模态大语言模型在图像、音频和视频的统一推理方面取得了进展,但将其能力扩展到脑成像领域仍未被充分探索,而弥合这一差距对于连接神经活动与语义认知以及开发跨模态大脑表征至关重要。
Method: 该方法采用三阶段框架:第一阶段学习神经分词器,将fMRI映射到语言一致空间中的离散标记;第二阶段适配预训练LLM,联合建模fMRI标记和文本,将大脑活动视为可预测和描述的序列;第三阶段进行多任务、多范式指令微调,赋予模型高级语义理解能力,同时通过构建大型描述性语料库解决自然fMRI-文本对缺乏的问题。
Result: fMRI-LM在多个基准测试中实现了强大的零样本和少样本性能,并通过参数高效调优(LoRA)实现了高效适配,为fMRI的结构和语义理解建立了一个可扩展的语言对齐通用模型路径。
Conclusion: 该研究为连接大脑活动与语言理解提供了一个可扩展的基础模型框架,通过将fMRI转化为语言一致的表示,实现了跨模态的大脑语义建模,为神经科学和人工智能的交叉研究开辟了新方向,支持多样化的下游应用。
📄 Abstract
Recent advances in multimodal large language models (LLMs) have enabled unified reasoning across images, audio, and video, but extending such capability to brain imaging remains largely unexplored. Bridging this gap is essential to link neural activity with semantic cognition and to develop cross-modal brain representations. To this end, we present fMRI-LM, a foundational model that bridges functional MRI (fMRI) and language through a three-stage framework. In Stage 1, we learn a neural tokenizer that maps fMRI into discrete tokens embedded in a language-consistent space. In Stage 2, a pretrained LLM is adapted to jointly model fMRI tokens and text, treating brain activity as a sequence that can be temporally predicted and linguistically described. To overcome the lack of natural fMRI-text pairs, we construct a large descriptive corpus that translates diverse imaging-based features into structured textual descriptors, capturing the low-level organization of fMRI signals. In Stage 3, we perform multi-task, multi-paradigm instruction tuning to endow fMRI-LM with high-level semantic understanding, supporting diverse downstream applications. Across various benchmarks, fMRI-LM achieves strong zero-shot and few-shot performance, and adapts efficiently with parameter-efficient tuning (LoRA), establishing a scalable pathway toward a language-aligned, universal model for structural and semantic understanding of fMRI.
[103] A Comparative Study of LLM Prompting and Fine-Tuning for Cross-genre Authorship Attribution on Chinese Lyrics
Yuxin Li, Lorraine Xu, Meng Fan Wang
🧩 TL;DR
本研究针对中文歌词作者归属领域缺乏公开数据集的问题,创建了首个跨流派平衡数据集,并开发了领域特定模型,通过对比微调模型与零样本LLM基线的性能,揭示了流派敏感性对作者归属准确性的关键影响。
📘 Detailed Summary
Motivation: 中文歌词作者归属领域面临两大核心问题:一是缺乏干净、公开的数据集,二是现有方法在跨流派评估中的有效性不明确。本研究旨在填补这一研究空白,通过创建首个平衡的中文歌词数据集,并系统评估领域特定模型在不同流派中的性能表现。
Method: 研究方法包括两个主要方面:首先创建了一个跨多个流派的平衡中文歌词数据集;其次开发并微调了领域特定模型,将其性能与使用DeepSeek LLM的零样本推理进行对比。实验设计测试了两个核心假设:微调模型是否优于零样本基线,以及性能是否具有流派依赖性。
Result: 实验结果强烈支持第二个假设:结构化流派(如民俗与传统)的作者归属准确率显著高于抽象流派(如爱情与浪漫)。第一个假设仅得到部分支持:在真实世界数据和困难流派的Test1中,微调提高了模型的鲁棒性和泛化能力,但在合成增强的Test2中增益有限或模糊。研究发现Test2的设计局限性(如标签不平衡、浅层词汇差异和狭窄流派采样)可能掩盖微调的真实效果。
Conclusion: 本研究建立了首个跨流派中文歌词作者归属基准,强调了流派敏感评估的重要性,并为未来研究提供了公开数据集和分析框架。主要建议包括:扩大和多样化测试集、减少对词级数据增强的依赖、平衡跨流派的作者代表性,以及探索领域自适应预训练作为改进归属性能的途径。
📄 Abstract
We propose a novel study on authorship attribution for Chinese lyrics, a domain where clean, public datasets are sorely lacking. Our contributions are twofold: (1) we create a new, balanced dataset of Chinese lyrics spanning multiple genres, and (2) we develop and fine-tune a domain-specific model, comparing its performance against zero-shot inference using the DeepSeek LLM. We test two central hypotheses. First, we hypothesize that a fine-tuned model will outperform a zero-shot LLM baseline. Second, we hypothesize that performance is genre-dependent. Our experiments strongly confirm Hypothesis 2: structured genres (e.g. Folklore & Tradition) yield significantly higher attribution accuracy than more abstract genres (e.g. Love & Romance). Hypothesis 1 receives only partial support: fine-tuning improves robustness and generalization in Test1 (real-world data and difficult genres), but offers limited or ambiguous gains in Test2, a smaller, synthetically-augmented set. We show that the design limitations of Test2 (e.g., label imbalance, shallow lexical differences, and narrow genre sampling) can obscure the true effectiveness of fine-tuning. Our work establishes the first benchmark for cross-genre Chinese lyric attribution, highlights the importance of genre-sensitive evaluation, and provides a public dataset and analytical framework for future research. We conclude with recommendations: enlarge and diversify test sets, reduce reliance on token-level data augmentation, balance author representation across genres, and investigate domain-adaptive pretraining as a pathway for improved attribution performance.
[104] ResearchArcade: Graph Interface for Academic Tasks
Jingjun Xu, Chongshan Lin, Haofei Yu, Tao Feng, Jiaxuan You
🧩 TL;DR
本文提出了ResearchArcade,一种基于图的统一数据接口,用于连接多种学术数据源并支持多样化的机器学习任务,旨在加速知识发现过程。该框架通过多模态图结构组织学术数据,并在六个学术任务上验证了其有效性。
📘 Detailed Summary
Motivation: 随着研究人员越来越多地使用机器学习辅助研究任务,学术领域产生了多样化的数据源,但缺乏统一的接口来支持跨数据源的机器学习模型开发。这种分散性限制了模型在整个研究过程中的支持能力,阻碍了知识发现的加速。
Method: ResearchArcade采用基于图的接口设计,使用连贯的多表格式和图结构组织来自不同来源的数据,包括ArXiv学术语料库和OpenReview同行评审数据。该框架支持多模态信息(文本、图表、表格)的整合,并保留了手稿和社区层面的时间演化信息,同时统一了多样化的学术任务定义以支持不同输入需求的模型。
Result: 在六个学术任务上的实验表明,结合跨源和多模态信息能够支持更广泛的任务范围,而图结构的引入持续提升了基线方法的性能。这些结果验证了ResearchArcade在整合异构学术数据方面的有效性及其对任务性能的积极影响。
Conclusion: ResearchArcade展示了构建统一学术数据接口的可行性,其图结构和多模态整合方法为学术机器学习提供了更全面的数据支持。该框架具有加速研究进程的潜力,为未来开发更强大的研究辅助工具奠定了基础。
📄 Abstract
Academic research generates diverse data sources, and as researchers increasingly use machine learning to assist research tasks, a crucial question arises: Can we build a unified data interface to support the development of machine learning models for various academic tasks? Models trained on such a unified interface can better support human researchers throughout the research process, eventually accelerating knowledge discovery. In this work, we introduce ResearchArcade, a graph-based interface that connects multiple academic data sources, unifies task definitions, and supports a wide range of base models to address key academic challenges. ResearchArcade utilizes a coherent multi-table format with graph structures to organize data from different sources, including academic corpora from ArXiv and peer reviews from OpenReview, while capturing information with multiple modalities, such as text, figures, and tables. ResearchArcade also preserves temporal evolution at both the manuscript and community levels, supporting the study of paper revisions as well as broader research trends over time. Additionally, ResearchArcade unifies diverse academic task definitions and supports various models with distinct input requirements. Our experiments across six academic tasks demonstrate that combining cross-source and multi-modal information enables a broader range of tasks, while incorporating graph structures consistently improves performance over baseline methods. This highlights the effectiveness of ResearchArcade and its potential to advance research progress.
[105] Bridging the Modality Gap by Similarity Standardization with Pseudo-Positive Samples
Shuhei Yamashita, Daiki Shirafuji, Tatsuhiko Saito
🧩 TL;DR
本文提出了一种基于伪数据构建的相似度标准化方法,用于解决跨模态检索中的模态间隙问题。该方法通过计算模态特定的统计量来标准化相似度分数,从而在多模态问答基准上显著提升了检索性能。
📘 Detailed Summary
Motivation: 当文本和图像同时存在于数据库中时,跨模态检索面临模态间隙问题,即不同模态的相似度分数存在尺度差异,这阻碍了准确的检索。现有研究大多依赖人工标注数据进行微调,而本文旨在通过无监督方法解决这一挑战。
Method: 本文提出相似度标准化方法,首先计算每个查询与其配对文本或图像数据之间相似度分数的均值和方差。利用这些模态特定的统计量,将所有相似度分数标准化到跨模态可比较的共同尺度上。这些统计量通过伪数据对计算,伪数据对通过检索与每个查询具有最高余弦相似度的文本和图像候选构建而成。
Result: 在七个视觉语言模型和两个多模态问答基准(MMQA和WebQA)上的实验结果表明,该方法显著提升了检索性能。当查询与目标数据属于不同模态时,在MMQA上实现了平均64%的Recall@20提升,在WebQA上实现了28%的提升。与通过图像描述解决模态间隙的E5-V相比,本文方法更有效地弥合了模态间隙。
Conclusion: 该研究证明了通过伪数据构建的相似度标准化方法能够有效解决跨模态检索中的模态间隙问题,无需依赖人工标注数据。该方法为多模态检索系统提供了一种高效的无监督解决方案,具有实际应用价值,并为未来跨模态对齐研究提供了新的思路。
📄 Abstract
Advances in vision-language models (VLMs) have enabled effective cross-modality retrieval. However, when both text and images exist in the database, similarity scores would differ in scale by modality. This phenomenon, known as the modality gap, hinders accurate retrieval. Most existing studies address this issue with manually labeled data, e.g., by fine-tuning VLMs on them. In this work, we propose a similarity standardization approach with pseudo data construction. We first compute the mean and variance of the similarity scores between each query and its paired data in text or image modality. Using these modality-specific statistics, we standardize all similarity scores to compare on a common scale across modalities. These statistics are calculated from pseudo pairs, which are constructed by retrieving the text and image candidates with the highest cosine similarity to each query. We evaluate our method across seven VLMs using two multi-modal QA benchmarks (MMQA and WebQA), where each question requires retrieving either text or image data. Our experimental results show that our method significantly improves retrieval performance, achieving average Recall@20 gains of 64% on MMQA and 28% on WebQA when the query and the target data belong to different modalities. Compared to E5-V, which addresses the modality gap through image captioning, we confirm that our method more effectively bridges the modality gap.
[106] Focused Chain-of-Thought: Efficient LLM Reasoning via Structured Input Information
Lukas Struppek, Dominik Hintersdorf, Hannah Struppek, Daniel Neider, Kristian Kersting
🧩 TL;DR
本文提出了一种无需训练、以输入为中心的Focused Chain-of-Thought方法,通过将信息提取与推理过程分离,在保持准确性的同时将生成令牌减少2-3倍,显著提升大型语言模型的推理效率。
📘 Detailed Summary
Motivation: 现有大型语言模型通过生成详细的思维链轨迹实现强大推理能力,但这通常导致令牌使用过多和推理延迟过高。传统效率方法主要关注模型中心干预,如强化学习或监督微调来减少冗余,而本研究探索训练自由的输入中心方法来解决这一效率瓶颈。
Method: 受认知心理学启发,本文提出Focused Chain-of-Thought方法,将信息提取与推理过程分离。该方法首先从查询中提取关键信息并组织成简洁的结构化上下文,然后引导模型仅在此上下文中进行推理,通过防止关注无关细节自然产生更短的推理路径。
Result: 在算术文字问题实验中,F-CoT将生成令牌减少2-3倍,同时保持与标准零样本思维链相当的准确性。这些结果表明该方法在显著提升效率的同时不会牺牲推理性能,验证了结构化输入作为高效推理杠杆的有效性。
Conclusion: 本研究证明了结构化输入作为简单而有效的杠杆,能够显著提升大型语言模型的推理效率。F-CoT方法展示了训练自由、输入中心方法的潜力,为减少推理延迟和计算成本提供了新方向,同时保持了模型的推理准确性。
📄 Abstract
Recent large language models achieve strong reasoning performance by generating detailed chain-of-thought traces, but this often leads to excessive token use and high inference latency. Existing efficiency approaches typically focus on model-centric interventions, such as reinforcement learning or supervised fine-tuning, to reduce verbosity. In contrast, we propose a training-free, input-centric approach. Inspired by cognitive psychology, we introduce Focused Chain-of-Thought (F-CoT), which separates information extraction from the reasoning process. F-CoT first organizes the essential information from a query into a concise, structured context and then guides the model to reason exclusively over this context. By preventing attention to irrelevant details, F-CoT naturally produces shorter reasoning paths. On arithmetic word problems, F-CoT reduces generated tokens by 2-3x while maintaining accuracy comparable to standard zero-shot CoT. These results highlight structured input as a simple yet effective lever for more efficient LLM reasoning.
[107] Improving LLM-based Ontology Matching with fine-tuning on synthetic data
Guilherme Sousa, Rinaldo Lima, Cassia Trojahn
🧩 TL;DR
本文提出了一种结合自动数据集生成与微调的策略,使大型语言模型能够有效执行本体匹配任务。该方法通过搜索空间缩减和合成数据生成,显著提升了LLM在零样本设置下的本体对齐性能。
📘 Detailed Summary
Motivation: 本研究旨在解决LLM在本体匹配任务中面临的两个关键挑战:缺乏足够的参考对齐数据进行训练,以及如何有效利用LLM直接处理本体模块并生成对应对齐。现有方法在零样本设置下的性能有限,且缺乏专门针对本体匹配任务的微调策略。
Method: 该方法包含三个核心技术:首先采用搜索空间缩减技术从源和目标本体中选择相关子集,并自动构建提示;其次引入一种新颖的LLM基方法生成合成数据集,创建本体子模块对及其参考对齐的语料库;最后设计专门的微调策略,利用合成数据训练LLM以适应本体匹配任务。
Result: 在OAEI复杂赛道的Conference、Geolink、Enslaved、Taxon和Hydrography数据集上的评估结果表明,基于合成数据微调的LLM相比未微调的基础模型表现出更优越的性能。实验验证了该方法在不同领域本体匹配任务中的有效性。
Conclusion: 本研究的主要贡献在于提出了一种结合自动数据集生成与微调的策略,有效解决了LLM在本体匹配任务中的数据稀缺问题。该方法为LLM在知识表示和语义集成领域的应用提供了新思路,展示了合成数据生成在特定领域任务适配中的潜力。
📄 Abstract
Large Language Models (LLMs) are increasingly being integrated into various components of Ontology Matching pipelines. This paper investigates the capability of LLMs to perform ontology matching directly on ontology modules and generate the corresponding alignments. Furthermore, it is explored how a dedicated fine-tuning strategy can enhance the model's matching performance in a zero-shot setting. The proposed method incorporates a search space reduction technique to select relevant subsets from both source and target ontologies, which are then used to automatically construct prompts. Recognizing the scarcity of reference alignments for training, a novel LLM-based approach is introduced for generating a synthetic dataset. This process creates a corpus of ontology submodule pairs and their corresponding reference alignments, specifically designed to fine-tune an LLM for the ontology matching task. The proposed approach was evaluated on the Conference, Geolink, Enslaved, Taxon, and Hydrography datasets from the OAEI complex track. The results demonstrate that the LLM fine-tuned on the synthetically generated data exhibits superior performance compared to the non-fine-tuned base model. The key contribution is a strategy that combines automatic dataset generation with fine-tuning to effectively adapt LLMs for ontology matching tasks.
[108] Visual Puns from Idioms: An Iterative LLM-T2IM-MLLM Framework
Kelaiti Xiao, Liang Yang, Dongyu Zhang, Paerhati Tulajiang, Hongfei Lin
🧩 TL;DR
本文提出了一种基于迭代框架的视觉双关语自动生成与评估系统,通过协调大语言模型、文生图模型和多模态大语言模型,实现了成语视觉双关的合成与理解,并构建了相应的基准数据集。
📘 Detailed Summary
Motivation: 本研究旨在解决成语视觉双关语自动生成与评估的挑战,探索如何协调不同模态的AI模型来创建既体现成语字面意义又传达其比喻意义的图像,并建立相应的基准数据集用于模型性能评估。
Method: 研究提出了一个迭代框架,协调大语言模型、文生图模型和多模态大语言模型进行协同工作。系统采用四步迭代流程:生成详细视觉提示、合成图像、从图像推断成语、优化提示直至识别成功或达到步数限制,使用1000个成语作为输入构建了视觉双关图像数据集。
Result: 实验涉及10个大语言模型、10个多模态大语言模型和1个文生图模型,结果显示多模态大语言模型选择是性能的主要驱动因素:GPT获得最高准确率,Gemini次之,最佳开源模型Gemma与部分闭源模型竞争。在大语言模型方面,Claude在提示生成任务中表现最佳。
Conclusion: 该研究展示了多模态AI系统在创意内容生成任务中的协同潜力,为视觉语言理解提供了新的基准测试框架,同时揭示了不同模型架构在复杂多模态任务中的性能差异,为未来跨模态生成与理解系统的设计提供了重要参考。
📄 Abstract
We study idiom-based visual puns--images that align an idiom's literal and figurative meanings--and present an iterative framework that coordinates a large language model (LLM), a text-to-image model (T2IM), and a multimodal LLM (MLLM) for automatic generation and evaluation. Given an idiom, the system iteratively (i) generates detailed visual prompts, (ii) synthesizes an image, (iii) infers the idiom from the image, and (iv) refines the prompt until recognition succeeds or a step limit is reached. Using 1,000 idioms as inputs, we synthesize a corresponding dataset of visual pun images with paired prompts, enabling benchmarking of both generation and understanding. Experiments across 10 LLMs, 10 MLLMs, and one T2IM (Qwen-Image) show that MLLM choice is the primary performance driver: GPT achieves the highest accuracies, Gemini follows, and the best open-source MLLM (Gemma) is competitive with some closed models. On the LLM side, Claude attains the strongest average performance for prompt generation.
[109] Mind Reading or Misreading? LLMs on the Big Five Personality Test
Francesco Di Cursi, Chiara Boldrini, Marco Conti, Andrea Passarella
🧩 TL;DR
本研究评估了大型语言模型在零样本设置下基于文本进行自动人格预测的能力,发现当前现成的LLM在二元五因素模型人格预测任务中尚未达到可靠水平,且提示设计和评估指标的选择对结果解释至关重要。
📘 Detailed Summary
Motivation: 本研究旨在解决当前大型语言模型在自动人格预测任务中的适用性问题,探索LLM在零样本设置下基于文本进行人格特质预测的能力,特别关注不同模型、数据集和提示策略对预测性能的影响,以填补现有研究在LLM人格预测系统评估方面的空白。
Method: 研究采用五种大型语言模型(包括GPT-4和轻量级开源替代方案),在三个异构数据集(Essays、MyPersonality、Pandora)上进行评估,并比较两种提示策略:最小化提示与包含语言学和心理学线索的丰富提示,所有实验均在二元五因素模型框架下进行零样本预测。
Result: 实验结果表明,丰富提示虽能减少无效输出并改善类别平衡,但会引入系统性偏向特质存在的预测偏差;开放性特质和宜人性特质相对较易检测,而外向性和神经质特质预测仍具挑战性;开源模型有时接近GPT-4和先前基准,但无任何配置能在零样本二元设置中产生持续可靠的预测;聚合指标如准确率和宏F1分数掩盖了显著的不对称性,而每类召回率提供更清晰的诊断价值。
Conclusion: 研究发现当前现成的LLM尚不适合自动人格预测任务,提示设计、特质框架和评估指标需要精心协调才能获得可解释的结果;研究强调了在人格预测应用中谨慎使用LLM的重要性,并指出需要更细致的评估方法和更专门化的模型设计来提升预测可靠性。
📄 Abstract
We evaluate large language models (LLMs) for automatic personality prediction from text under the binary Five Factor Model (BIG5). Five models -- including GPT-4 and lightweight open-source alternatives -- are tested across three heterogeneous datasets (Essays, MyPersonality, Pandora) and two prompting strategies (minimal vs. enriched with linguistic and psychological cues). Enriched prompts reduce invalid outputs and improve class balance, but also introduce a systematic bias toward predicting trait presence. Performance varies substantially: Openness and Agreeableness are relatively easier to detect, while Extraversion and Neuroticism remain challenging. Although open-source models sometimes approach GPT-4 and prior benchmarks, no configuration yields consistently reliable predictions in zero-shot binary settings. Moreover, aggregate metrics such as accuracy and macro-F1 mask significant asymmetries, with per-class recall offering clearer diagnostic value. These findings show that current out-of-the-box LLMs are not yet suitable for APPT, and that careful coordination of prompt design, trait framing, and evaluation metrics is essential for interpretable results.
[110] Tourism Question Answer System in Indian Language using Domain-Adapted Foundation Models
Praveen Gatla, Anushka, Nikita Kanwar, Gouri Sahoo, Rajesh Kumar Mundotiya
🧩 TL;DR
本文首次为印地语旅游领域构建了抽取式问答系统基准,专注于瓦拉纳西文化场景,通过构建数据集并利用BERT和RoBERTa等基础模型结合LoRA微调,在低资源环境下实现了参数效率与性能的平衡。
📘 Detailed Summary
Motivation: 该研究旨在解决印地语旅游领域缺乏语言特定问答资源的空白,特别是针对瓦拉纳西这一具有丰富文化内涵(如Bhakti-Bhaav精神)的旅游目的地,现有系统无法处理文化细微差别和领域特定查询。
Method: 研究构建了包含7,715个印地语问答对的数据集,并通过Llama零样本提示生成了27,455个增强对;提出基于BERT和RoBERTa等基础模型的框架,采用监督微调和低秩适应两种微调策略,评估了包括印地语BERT在内的多种预训练语言模型变体。
Result: 实验结果显示,基于LoRA的微调在减少98%可训练参数的同时实现了85.3%的F1分数,在效率与准确性间取得平衡;RoBERTa结合监督微调在捕捉文化嵌入术语(如Aarti、Kund)的上下文细微差别方面优于BERT变体,F1、BLEU和ROUGE-L指标揭示了答案精确性与语言流畅性之间的权衡。
Conclusion: 该研究为印地语旅游问答系统建立了基础基准,强调了LoRA在低资源环境中的重要作用,并指出旅游领域需要文化情境化的NLP框架;工作展示了参数高效微调方法在保持性能的同时显著降低计算需求,为类似低资源领域特定应用提供了可行方案。
📄 Abstract
This article presents the first comprehensive study on designing a baseline extractive question-answering (QA) system for the Hindi tourism domain, with a specialized focus on the Varanasi-a cultural and spiritual hub renowned for its Bhakti-Bhaav (devotional ethos). Targeting ten tourism-centric subdomains-Ganga Aarti, Cruise, Food Court, Public Toilet, Kund, Museum, General, Ashram, Temple and Travel, the work addresses the absence of language-specific QA resources in Hindi for culturally nuanced applications. In this paper, a dataset comprising 7,715 Hindi QA pairs pertaining to Varanasi tourism was constructed and subsequently augmented with 27,455 pairs generated via Llama zero-shot prompting. We propose a framework leveraging foundation models-BERT and RoBERTa, fine-tuned using Supervised Fine-Tuning (SFT) and Low-Rank Adaptation (LoRA), to optimize parameter efficiency and task performance. Multiple variants of BERT, including pre-trained languages (e.g., Hindi-BERT), are evaluated to assess their suitability for low-resource domain-specific QA. Evaluation metrics - F1, BLEU, and ROUGE-L - highlight trade-offs between answer precision and linguistic fluency. Experiments demonstrate that LoRA-based fine-tuning achieves competitive performance (85.3\% F1) while reducing trainable parameters by 98\% compared to SFT, striking a balance between efficiency and accuracy. Comparative analysis across models reveals that RoBERTa with SFT outperforms BERT variants in capturing contextual nuances, particularly for culturally embedded terms (e.g., Aarti, Kund). This work establishes a foundational baseline for Hindi tourism QA systems, emphasizing the role of LORA in low-resource settings and underscoring the need for culturally contextualized NLP frameworks in the tourism domain.
[111] Optimizing Multimodal Language Models through Attention-based Interpretability
Alexander Sergeev, Evgeny Kotelnikov
🧩 TL;DR
本文提出了一种基于注意力的多模态语言模型可解释性方法,通过分析图像关键对象的注意力得分来识别重要注意力头,并利用该信息指导参数高效微调,在仅微调约0.01%参数的情况下显著提升图像理解能力。
📘 Detailed Summary
Motivation: 多模态语言模型虽然功能强大但难以解释,使得在参数高效微调中难以确定哪些组件最有效训练以平衡效率与性能,现有方法缺乏对模型内部注意力机制与图像关键对象关联的系统分析。
Method: 提出基于注意力的可解释性方法,通过分析注意力得分相对于图像令牌的关系来识别关注图像关键对象的注意力头,计算头部影响分数量化注意力头对关键对象的关注程度,并利用该信息选择最优模型组件进行参数高效微调。
Result: 在20-30亿参数的多模态语言模型上验证了方法有效性,实验表明微调具有最高HI分数的层相比预训练、随机选择或最低HI分数层能带来最显著的指标变化,仅微调约0.01%参数即可显著影响图像理解能力。
Conclusion: 该方法为多模态语言模型提供了可解释性工具,证明了基于注意力分析选择微调组件的有效性,为参数高效微调提供了数据驱动的组件选择策略,同时创建了包含图像、关键对象掩码和文本描述的新数据集支持后续研究。
📄 Abstract
Modern large language models become multimodal, analyzing various data formats like text and images. While fine-tuning is effective for adapting these multimodal language models (MLMs) to downstream tasks, full fine-tuning is computationally expensive. Parameter-Efficient Fine-Tuning (PEFT) methods address this by training only a small portion of model weights. However, MLMs are difficult to interpret, making it challenging to identify which components are most effective for training to balance efficiency and performance. We propose an attention-based interpretability method for MLMs by analyzing attention scores relative to image tokens. The core idea is to identify attention heads that focus on image key objects. We utilize this information to select optimal model components for PEFT in multimodal models. Our contributions include a method for identifying attention heads associated with image key objects, its application to PEFT for image captioning, and the creation of a new dataset containing images, key object masks, and their textual descriptions. We conducted experiments on MLMs with 2-3 billion parameters to validate the method's effectiveness. By calculating Head Impact (HI) scores we quantify an attention head's focus on key objects, indicating its significance in image understanding. Our fine-tuning experiments demonstrate that adapting layers with the highest HI scores leads to the most significant shifts in metrics compared to pre-trained, randomly selected, or lowest-HI-score layers. This indicates that fine-tuning a small percentage (around 0.01%) of parameters in these crucial layers can substantially influence image understanding capabilities.
cs.AI [Back]
[112] Evaluating Strategies for Synthesizing Clinical Notes for Medical Multimodal AI
Niccolo Marini, Zhaohui Liang, Sivaramakrishnan Rajaraman, Zhiyun Xue, Sameer Antani
🧩 TL;DR
本研究探索了在生物医学多模态学习中利用大型语言模型生成合成临床文本的策略,通过优化提示设计和医学元数据整合,增强皮肤病变分类和跨模态检索任务的性能。
📘 Detailed Summary
Motivation: 生物医学多模态学习面临大规模异质数据稀缺的挑战,特别是在皮肤病学领域,皮肤病变数据集通常仅包含图像和少量元数据,限制了多模态数据整合的潜力。此外,大型语言模型虽能生成图像文本描述,但未经医学领域专门训练,存在临床相关幻觉风险,需要研究有效的合成临床文本生成策略。
Method: 本研究提出了针对合成临床文本生成的策略,重点关注提示设计和医学元数据整合方法。通过优化大型语言模型的提示工程,结合医学领域特定的元数据信息,生成高质量的合成临床笔记。在多模态架构中整合图像和文本表示,评估其在分类和跨模态检索任务中的性能表现。
Result: 在多个异质皮肤病数据集上的实验表明,合成临床笔记显著提升了分类性能,特别是在领域偏移情况下表现更为突出。此外,合成文本还解锁了跨模态检索能力,这一下游任务在训练过程中并未被明确优化。研究验证了优化提示设计和医学元数据整合对提升多模态学习效果的有效性。
Conclusion: 本研究证明了通过精心设计的提示和医学元数据整合,大型语言模型能够生成高质量的合成临床文本,有效增强生物医学多模态学习性能。这一方法不仅提升了分类任务的鲁棒性和泛化能力,还解锁了新的跨模态检索功能,为医疗AI应用中数据稀缺问题的解决提供了创新思路。
📄 Abstract
Multimodal (MM) learning is emerging as a promising paradigm in biomedical artificial intelligence (AI) applications, integrating complementary modality, which highlight different aspects of patient health. The scarcity of large heterogeneous biomedical MM data has restrained the development of robust models for medical AI applications. In the dermatology domain, for instance, skin lesion datasets typically include only images linked to minimal metadata describing the condition, thereby limiting the benefits of MM data integration for reliable and generalizable predictions. Recent advances in Large Language Models (LLMs) enable the synthesis of textual description of image findings, potentially allowing the combination of image and text representations. However, LLMs are not specifically trained for use in the medical domain, and their naive inclusion has raised concerns about the risk of hallucinations in clinically relevant contexts. This work investigates strategies for generating synthetic textual clinical notes, in terms of prompt design and medical metadata inclusion, and evaluates their impact on MM architectures toward enhancing performance in classification and cross-modal retrieval tasks. Experiments across several heterogeneous dermatology datasets demonstrate that synthetic clinical notes not only enhance classification performance, particularly under domain shift, but also unlock cross-modal retrieval capabilities, a downstream task that is not explicitly optimized during training.
[113] Hybrid Stackelberg Game and Diffusion-based Auction for Two-tier Agentic AI Task Offloading in Internet of Agents
Yue Zhong, Yongju Tong, Jiawen Kang, Minghui Dai, Hong-Ning Dai, Zhou Su, Dusit Niyato
🧩 TL;DR
本文提出了一种面向智能体物联网的两层优化方案,通过Stackelberg博弈和双荷兰拍卖机制解决无线智能体计算密集型任务卸载问题,并采用基于扩散的深度强化学习算法进行求解。
📘 Detailed Summary
Motivation: 智能体物联网中无线智能体具有有限板载资源,需要将计算密集型AI服务卸载到附近服务器,但现有方案未能有效解决移动智能体动态连接限制与固定智能体资源过载情况下的协同卸载优化问题。
Method: 提出两层优化方法:第一层采用多领导者多跟随者Stackelberg博弈,移动智能体和固定智能体作为领导者设定资源价格,无线智能体作为跟随者确定任务卸载比例;第二层引入双荷兰拍卖模型,过载固定智能体作为买方请求资源,空中智能体作为卖方提供资源,并开发基于扩散的深度强化学习算法求解模型。
Result: 数值结果表明,所提方案在促进任务卸载方面具有优越性,能够有效协调移动智能体、固定智能体和空中智能体之间的资源分配,优化整体系统性能。
Conclusion: 该研究为智能体物联网中的资源优化提供了创新框架,通过分层博弈和拍卖机制实现了异构智能体间的协同卸载,为未来大规模互联智能系统的实际部署提供了理论基础和算法支持。
📄 Abstract
The Internet of Agents (IoA) is rapidly gaining prominence as a foundational architecture for interconnected intelligent systems, designed to facilitate seamless discovery, communication, and collaborative reasoning among a vast network of Artificial Intelligence (AI) agents. Powered by Large Language and Vision-Language Models, IoA enables the development of interactive, rational agents capable of complex cooperation, moving far beyond traditional isolated models. IoA involves physical entities, i.e., Wireless Agents (WAs) with limited onboard resources, which need to offload their compute-intensive agentic AI services to nearby servers. Such servers can be Mobile Agents (MAs), e.g., vehicle agents, or Fixed Agents (FAs), e.g., end-side units agents. Given their fixed geographical locations and stable connectivity, FAs can serve as reliable communication gateways and task aggregation points. This stability allows them to effectively coordinate with and offload to an Aerial Agent (AA) tier, which has an advantage not affordable for highly mobile MAs with dynamic connectivity limitations. As such, we propose a two-tier optimization approach. The first tier employs a multi-leader multi-follower Stackelberg game. In the game, MAs and FAs act as the leaders who set resource prices. WAs are the followers to determine task offloading ratios. However, when FAs become overloaded, they can further offload tasks to available aerial resources. Therefore, the second tier introduces a Double Dutch Auction model where overloaded FAs act as the buyers to request resources, and AAs serve as the sellers for resource provision. We then develop a diffusion-based Deep Reinforcement Learning algorithm to solve the model. Numerical results demonstrate the superiority of our proposed scheme in facilitating task offloading.
[114] WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios
Eun Chang, Zhuangqun Huang, Yiwei Liao, Sagar Ravi Bhavsar, Amogh Param, Tammy Stark, Adel Ahmadyan, Xiao Yang, Jiaqi Wang, Ahsan Abdullah, Giang Nguyen, Akil Iyer, David Hall, Elissa Li, Shane Moon, Nicolas Scheffer, Kirmani Ahmed, Babak Damavandi, Rakesh Wanga, Anuj Kumar, Rohit Patel, Xin Luna Dong
🧩 TL;DR
本文提出了WearVQA,这是首个专门评估可穿戴设备上多模态AI助手视觉问答能力的基准测试,该基准包含2,520个精心策划的图像-问题-答案三元组,涵盖了可穿戴设备特有的视觉挑战和现实使用场景。
📘 Detailed Summary
Motivation: 现有视觉问答基准主要关注高质量、第三人称图像,无法反映可穿戴设备上以自我为中心交互的独特挑战,包括视觉输入可能被遮挡、光照不佳、未缩放或模糊等问题,且缺乏针对可穿戴设备实际使用场景的评估框架。
Method: 研究团队构建了包含2,520个图像-问题-答案三元组的WearVQA基准,涵盖7个不同图像领域(包括文本中心和一般场景)、10种认知任务类型(从基本识别到各种推理形式)以及6种常见的可穿戴设备图像质量问题,并配备了准确率达96%的LLM-as-a-judge评估框架。
Result: 开源和专有多模态大语言模型在WearVQA上的问答准确率仅为24-52%,在低质量图像和推理密集型任务上表现显著下降,这表明现有模型在处理可穿戴设备特有的视觉挑战方面存在明显不足。
Conclusion: WearVQA作为一个全面且具有挑战性的基准,揭示了当前多模态AI系统在可穿戴设备实际应用场景中的局限性,为开发更鲁棒、实用的可穿戴AI系统提供了重要的评估工具和技术指导方向。
📄 Abstract
We introduce WearVQA, the first benchmark specifically designed to evaluate the Visual Question Answering (VQA) capabilities of multi-model AI assistant on wearable devices like smart glasses. Unlike prior benchmarks that focus on high-quality, third-person imagery, WearVQA reflects the unique challenges of ego-centric interaction-where visual inputs may be occluded, poorly lit, unzoomed, or blurry, and questions are grounded in realistic wearable use cases. The benchmark comprises 2,520 carefully curated image-question-answer triplets, spanning 7 diverse image domains including both text-centric and general scenes, 10 cognitive task types ranging from basic recognition to various forms of reasoning, and 6 common wearables-specific image quality issues. All questions are designed to be answerable using only the visual input and common senses. WearVQA is paired with a rigorous LLM-as-a-judge evaluation framework with 96% labeling accuracy. Open-source and proprietary multi-model LLMs achieved a QA accuracy as low as 24-52% on WearVQA, with substantial drops on lower-quality images and reasoning-heavy tasks. These observations position WearVQA as a comprehensive and challenging benchmark for guiding technical advancement towards robust, real-world multi-model wearables AI systems.
[115] Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation
Zehao Deng, Tianjie Ju, Zheng Wu, Zhuosheng Zhang, Gongshen Liu
🧩 TL;DR
本文提出了Coordinator-Executor-State Tracker (CES)多智能体框架,通过训练高层调度模型来解决GUI智能体在长时程任务中的责任耦合与状态管理问题,显著提升了各种底层执行器的长时程任务处理能力。
📘 Detailed Summary
Motivation: 当前大型视觉语言模型驱动的GUI智能体在长时程任务处理上面临两大挑战:单智能体模型难以平衡高层规划能力与底层执行能力,存在责任耦合和能力冲突问题;智能体缺乏任务状态感知能力,导致长时程任务中进度丢失。
Method: 提出分阶段执行-反馈强化学习算法,专注于训练高层调度模型而非统一策略模型。构建了Coordinator-Executor-State Tracker (CES)多智能体框架,包含负责战略规划和任务分解的Coordinator,以及负责上下文压缩和信息管理的State Tracker,该框架可与任何底层Executor模型集成。
Result: 在长时程任务基准测试上的实验表明,CES框架显著增强了系统的规划能力和状态管理能力。分析证实训练得到的高层调度模块具有通用性,可作为即插即用模块显著提升各种Executor的长时程任务处理能力。
Conclusion: CES框架通过解耦高层调度与底层执行,有效解决了GUI智能体在长时程任务中的责任耦合问题。该研究证明了模块化多智能体架构在复杂任务处理中的优势,为通用GUI智能体系统设计提供了新的范式。
📄 Abstract
The rapid development of large vision-language model (VLM) has greatly promoted the research of GUI agent. However, GUI agents still face significant challenges in handling long-horizon tasks. First, single-agent models struggle to balance high-level capabilities and low-level execution capability, facing prevalent issues of responsibility coupling and capability conflicts. Second, agents lack awareness of the task state, leading to progress loss in long-horizon tasks. To address these challenges, we propose a staged execution-feedback reinforcement learning algorithm. Unlike training a unified policy model, we focus on training high-level scheduling models. Specifically, we propose and train two agents: a Coordinator, responsible for the strategic planning and task decomposition; and a State Tracker, responsible for context compression and information management to maintain the task's state and coherence. Based on this, we built the Coordinator-Executor-State Tracker (CES) multi-agent framework, which can be integrated with any low-level Executor model, assisting the Executor in solving long-horizon tasks through task scheduling and state management. Experiments on long-horizon task benchmarks demonstrate that CES significantly enhances the system's planning and state management capabilities. Furthermore, analysis confirms that our trained high-level scheduling module is a generalizable, plug-and-play module that significantly enhances the long-horizon capabilities of various Executors. Code can be available at https://github.com/hehehahi4/CES.
[116] Structured Extraction from Business Process Diagrams Using Vision-Language Models
Pritam Deka, Barry Devereux
🧩 TL;DR
本文提出了一种利用视觉语言模型从BPMN流程图像中直接提取结构化JSON表示的管道,无需源模型文件或文本标注,并通过OCR进行文本增强以提升模型性能。
📘 Detailed Summary
Motivation: BPMN作为广泛采用的业务流程建模标准,其图表通常以视觉图像形式交换,但现有方法主要依赖XML表示进行计算分析,当原始源文件不可用时缺乏有效的图像解析方法。
Method: 该方法构建了一个结合视觉语言模型和光学字符识别的处理管道,利用VLMs直接从BPMN图像中提取结构化JSON表示,并通过OCR进行文本内容增强,同时设计了基于源XML文件的真实数据评估框架。
Result: 实验对多个VLMs进行了基准测试,发现OCR文本增强能显著提升多个模型的性能表现,同时通过广泛的统计分析揭示了OCR增强方法和提示消融研究对模型性能的具体影响模式。
Conclusion: 该研究证明了直接从BPMN图像中提取结构化表示的可行性,为源文件不可用场景下的业务流程分析提供了实用解决方案,同时阐明了文本增强技术在视觉语言模型应用中的重要作用。
📄 Abstract
Business Process Model and Notation (BPMN) is a widely adopted standard for representing complex business workflows. While BPMN diagrams are often exchanged as visual images, existing methods primarily rely on XML representations for computational analysis. In this work, we present a pipeline that leverages Vision-Language Models (VLMs) to extract structured JSON representations of BPMN diagrams directly from images, without requiring source model files or textual annotations. We also incorporate optical character recognition (OCR) for textual enrichment and evaluate the generated element lists against ground truth data derived from the source XML files. Our approach enables robust component extraction in scenarios where original source files are unavailable. We benchmark multiple VLMs and observe performance improvements in several models when OCR is used for text enrichment. In addition, we conducted extensive statistical analyses of OCR-based enrichment methods and prompt ablation studies, providing a clearer understanding of their impact on model performance.
[117] Agentic AI Framework for Individuals with Disabilities and Neurodivergence: A Multi-Agent System for Healthy Eating, Daily Routines, and Inclusive Well-Being
Salman Jan, Toqeer Ali Syed, Gohar Ali, Ali Akarma, Mohammad Riyaz Belgaum, Ahmad Ali
🧩 TL;DR
本文提出了一种面向残障人士和神经多样性人群的智能体人工智能框架,通过多层架构和四个专用智能体的协同工作,提供个性化、自适应且透明的健康生活支持系统。
📘 Detailed Summary
Motivation: 传统辅助系统在包容性、个性化和可访问性方面存在不足,难以满足残障人士和神经多样性人群的复杂需求,本研究旨在开发一个能够支持他们实现更健康生活和规律日常的智能体AI系统。
Method: 采用三层架构设计:应用与接口层、智能体层和数据源层,核心是混合推理引擎同步四个专用智能体(膳食规划、提醒、食物指导和监测),通过黑板/事件总线实现自主交互和实时反馈,并整合电子健康记录、营养数据库、可穿戴设备和智能厨房物联网等多模态数据源。
Result: 该框架实现了自适应、透明且包容的支持系统,通过可解释AI模块提供决策解释,政策控制层确保数据安全和合规性,临床医生仪表板支持协同监督,多媒体用户界面实现实时交互反馈。
Conclusion: 该研究展示了多智能体推理、多模态界面和以人为中心设计的交叉融合,超越了传统辅助系统,为促进残障人士和神经多样性人群的自主性、健康和数字公平提供了创新框架,具有重要的社会包容意义。
📄 Abstract
The paper presents a detailed Agentic Artificial Intelligence (AI) model that would enable people with disabilities and neurodivergence to lead healthier lives and have more regular days. The system will use a multi-layer structure; it will include an Application and Interface Layer, an Agents Layer, and a Data Source Layer to provide adaptive, transparent, and inclusive support. Fundamentally, a hybrid reasoning engine will synchronize four special-purpose agents, which include: a personalized-nutrition-based, called a Meal Planner Agent; an adaptive-scheduling-based, called a Reminder Agent; interactive assistance during grocery shopping and cooking, called a Food Guidance Agent; and a continuous-intake-and-physiological-tracking, called a Monitoring Agent. All the agents interact through a central communicative system called the Blackboard/Event Bus, which allows autonomous interaction and real-time feedback loops with multimedia user interfaces. Privacy-sensitive data sources, including electronic health records (EHRs), nutritional databases, wearable sensors, and smart kitchen Internet of Things, are also included in the framework and placed into a policy-controlled layer, which ensures data safety and compliance with consent. Collaborative care and clinician dashboards allow common supervision, and discussable artificial intelligence (XAI) modules give brief explanations of why a decision was made, making users responsible and reliant. The proposed agentic AI framework is an extension beyond traditional assistive systems since it incorporates inclusiveness, personalization, and accessibility at all levels. It displays the intersection of multi-agent reasoning, multi-modal interfaces, and human-centered design that will enable the development of autonomy, health, and digital equity among people with disabilities and neurodivergence.
[118] TIM-PRM: Verifying multimodal reasoning with Tool-Integrated PRM
Peng Kuang, Xiangxiang Wang, Wentao Liu, Jian Dong, Kaidi Xu, Haohan Wang
🧩 TL;DR
本文提出了TIM-PRM(工具集成多模态过程奖励模型),一种将验证从被动分类任务转变为主动工具增强调查的代理框架,通过独立提问机制消除确认偏差,显著提升了多模态数学推理的可靠性和可解释性。
📘 Detailed Summary
Motivation: 多模态大语言模型在数学推理中表现出色,但仍易受视觉幻觉和逻辑不一致性的影响,而基于结果的监督方法无法有效缓解这些问题。现有的过程奖励模型通常作为标量评分器或生成式批评者,存在谄媚倾向,盲目验证有缺陷的假设而非将其基于视觉现实,因此需要新的验证框架来消除确认偏差。
Method: 本文提出了TIM-PRM框架,将验证从被动分类任务转变为主动工具增强调查。该方法训练模型显式规划验证策略,并采用独立提问机制通过外部工具查询证据,有效解耦验证与推理上下文以消除确认偏差。通过策划高质量的工具集成验证轨迹数据集来实例化该方法,构建了8B参数模型。
Result: 在VisualProcessBench上的广泛实验表明,8B参数的TIM-PRM模型超越了现有的开源多模态PRM模型,显著优于Qwen2.5-72B和InternVL-78B等更大规模的模型。该方法不仅提供了卓越的性能,还提供了验证过程的可解释性洞察,展示了小规模模型通过工具增强验证策略可以超越大规模模型的潜力。
Conclusion: TIM-PRM框架通过将验证重新定义为主动调查过程,有效解决了多模态推理中的确认偏差问题。该方法展示了工具增强验证策略的重要性,以及小规模模型通过适当的架构设计可以超越大规模模型的性能。研究为构建更可靠、可解释的多模态推理系统提供了新方向,强调了过程验证中主动证据收集的价值。
📄 Abstract
Multimodal Large Language Models (MLLMs) have achieved impressive performances in mathematical reasoning, yet they remain vulnerable to visual hallucinations and logical inconsistencies that standard outcome-based supervision fails to mitigate. While Process Reward Models (PRMs) promise step-by-step verification, current approaches typically operate as scalar scorers or generative critics that suffer from sycophancy, blindly validating the flawed hypotheses rather than grounding them in visual reality. To bridge this gap, we introduce TIM-PRM (Tool-Integrated Multimodal PRM), a novel agentic framework that transforms verification from a passive classification task into an active, tool-augmented investigation. TIM-PRM is trained to explicitly plan verification strategies and utilizes a mechanism of Independent Question Asking to query evidence via external tools, effectively decoupling verification from the reasoning context to eliminate confirmation bias. We instantiate this method by curating a high-quality dataset of tool-integrated verification trajectories. Extensive experiments on VisualProcessBench demonstrate that our 8B parameter model surpasses existing open-source multimodal PRMs, significantly outperforming much larger models like Qwen2.5-72B and InternVL-78B, while offering interpretable insights into the verification process.
[119] MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents
Ruoxuan Zhang, Qiyun Zheng, Zhiyu Zhou, Ziqi Liao, Siyu Wu, Jian-Yu Jiang-Lin, Bin Wen, Hongxia Xie, Jianlong Fu, Wen-Huang Cheng
🧩 TL;DR
本文提出MindPower框架,一个集成了感知、心智推理、决策与执行的机器人中心化架构,通过引入Mind-Reward优化目标,显著提升了视觉语言具身智能体在基于心智理论推理的决策与行动生成能力。
📘 Detailed Summary
Motivation: 当前视觉语言具身智能体缺乏基于心智理论的决策能力,现有基准测试仅关注人类心智状态而忽略智能体自身视角,导致决策与行动生成不连贯,本研究旨在解决这一局限性。
Method: 提出MindPower机器人中心化框架,包含感知、心智推理、决策与执行四个模块,首先感知环境与人类状态,然后进行心智理论推理以建模自我与他人心智状态,最后基于推理结果生成决策与行动,并引入Mind-Reward优化目标来促进视觉语言模型产生一致的心智推理与行为。
Result: 实验结果表明,MindPower在决策制定方面超越GPT-4o达12.77%,在行动生成方面超越GPT-4o达12.49%,显著提升了基于心智理论推理的具身智能体性能。
Conclusion: 该研究证明了集成自我与他人心智状态建模对于具身智能体决策连贯性的重要性,Mind-Reward优化目标为视觉语言模型的一致性推理与行为生成提供了有效机制,为未来具身人工智能的心智理论能力发展提供了新方向。
📄 Abstract
Theory of Mind (ToM) refers to the ability to infer others' mental states, such as beliefs, desires, and intentions. Current vision-language embodied agents lack ToM-based decision-making, and existing benchmarks focus solely on human mental states while ignoring the agent's own perspective, hindering coherent decision and action generation. To address this, we propose MindPower, a Robot-Centric framework integrating Perception, Mental Reasoning, Decision Making and Action. Given multimodal inputs, MindPower first perceives the environment and human states, then performs ToM Reasoning to model both self and others, and finally generates decisions and actions guided by inferred mental states. Furthermore, we introduce Mind-Reward, a novel optimization objective that encourages VLMs to produce consistent ToM Reasoning and behavior. Our model outperforms GPT-4o by 12.77% in decision making and 12.49% in action generation.
[120] AgriCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture
Yibin Wen, Qingmei Li, Zi Ye, Jiarui Zhang, Jing Wu, Zurong Mai, Shuohong Lou, Yuhang Chen, Henglian Huang, Xiaoya Fan, Yang Zhang, Lingyuan Zhao, Haohuan Fu, Huang Jianxi, Juepeng Zheng
🧩 TL;DR
本文提出了AgriCoT,一个包含思维链推理的视觉问答数据集,专门用于评估视觉语言模型在农业领域的推理能力,揭示了当前模型在复杂农业场景中推理能力的显著不足。
📘 Detailed Summary
Motivation: 尽管现有视觉问答数据集已用于评估视觉语言模型性能,但它们往往无法充分评估复杂农业场景中所需的关键推理和问题解决能力,这限制了模型在精准农业、作物监测、病虫害检测等实际应用中的有效性。
Method: 研究团队开发了AgriCoT数据集,该数据集包含4,535个精心策划的样本,专门整合了思维链推理机制,旨在系统评估视觉语言模型在零样本场景下的逻辑推理和问题解决能力,并对26个代表性模型(包括专有和开源模型)进行了全面评估。
Result: 评估结果显示,虽然某些专有模型在回答问题方面表现良好,但所有模型在推理能力方面存在显著且明显的差距,这突显了思维链推理对于更精确评估的重要性,同时也揭示了当前视觉语言模型在复杂农业推理任务中的局限性。
Conclusion: 该研究强调了在视觉语言模型评估中整合思维链推理的必要性,为农业人工智能领域提供了更精确的评估基准,并指出了未来模型开发需要重点关注逻辑推理和复杂问题解决能力的提升方向。
📄 Abstract
Recent advancements in Vision-Language Models (VLMs) have significantly transformed various industries. In agriculture, these dual-modal capabilities offer promising applications such as precision farming, crop monitoring, pest detection, and environmental sustainability. While several Visual Question Answering (VQA) datasets and benchmarks have been developed to evaluate VLM performance, they often fail to adequately assess the critical reasoning and problem-solving skills required in complex agricultural contexts. To address this gap, we introduce AgriCoT, a VQA dataset that incorporates Chain-of-Thought (CoT) reasoning, specifically designed to evaluate the reasoning capabilities of VLMs. With 4,535 carefully curated samples, AgriCoT offers a comprehensive and robust evaluation of reasoning abilities for VLMs, particularly in zero-shot scenarios, by focusing on their capacity to engage in logical reasoning and effective problem-solving. Our evaluations, conducted with 26 representative VLMs, including both proprietary and open-source models, reveal that while some proprietary models excel at answering questions, there is a notable and significant gap in their reasoning capabilities. This underscores the importance of incorporating CoT for more precise and effective assessments. Our dataset are available at https://huggingface.co/datasets/wenyb/AgriCoT.
[121] Adapting Like Humans: A Metacognitive Agent with Test-time Reasoning
Yang Li, Zhiyuan He, Yuxuan Huang, Zhuhanling Xiao, Chao Yu, Meng Fang, Kun Shao, Jun Wang
🧩 TL;DR
本文提出了元认知测试时推理(MCTR)框架,通过赋予视觉语言模型元认知自我更新能力,使其能够在测试时学习、适应和改进,从而弥合模型与人类在适应新任务方面的差距。
📘 Detailed Summary
Motivation: 当前视觉语言模型虽然展现出强大的感知推理能力,但在测试时遇到新任务时往往难以高效适应,而人类则能够利用具有记忆的元认知模型,通过元认知控制持续优化策略来应对新挑战,本研究旨在弥合这一差距。
Method: MCTR框架包含元级和对象级VLM推理模块,每个模块配备专用记忆系统进行分层自适应推理:元推理模块通过从测试时观察中发现并存储任务相关规则、环境模式和动作-结果关系作为自然语言描述来增量构建结构化记忆;动作推理模块通过上下文感知感知和策略推理动态检索并整合记忆知识来确定最优动作,并通过提出的元认知测试时强化学习持续更新策略。
Result: 在45个Atari游戏(33个已见,12个未见)上的评估显示,MCTR展现出强大的测试时适应能力,在未见游戏中相比基线获得了9/12的top-1结果;通过消融实验、学习动态和案例研究分析揭示了两个组件的互补贡献,并显示元推理逐渐演变为类似人类的适应策略。
Conclusion: 该研究证明了为视觉语言模型配备元认知能力能够显著提升其在测试时的适应性能,元认知测试时推理框架通过分层记忆系统和自适应学习机制使模型能够像人类一样持续改进策略,为开发更具适应性和通用性的人工智能系统提供了新方向。
📄 Abstract
Recent Vision-Language Models (VLMs) exhibit strong perceptual reasoning abilities, yet they often struggle to adapt efficiently when encountering novel tasks at test time. In contrast, humans leverage the metacognitive model with memory, enabling continuous strategy refinement through metacognitive control when faced with new challenges. To bridge this gap, we propose metacognitive test-time reasoning (MCTR), a framework that equips models with the ability to learn, adapt, and improve during test time through metacognitive self-updating. Inspired by the dual structure of human metacognition, MCTR comprises meta-level and object-level VLM reasoning modules, each equipped with dedicated memory systems for hierarchical adaptive reasoning. Specifically, MCTR consists of (1) a meta-reasoning module which incrementally builds a structured memory by discovering and storing task-relevant rules, environmental patterns, and action-outcome relationships from test-time observations as natural language descriptions; and (2) an action-reasoning module that determines optimal actions through context-aware perception and strategic reasoning by dynamically retrieving and integrating knowledge from memory. The action-reasoning module continuously updates its policy through proposed metacognitive test-time reinforcement learning, adapting as knowledge memory evolves. We evaluate MCTR on 45 Atari games (33 seen, 12 unseen). MCTR demonstrates robust test-time adaptation, achieving 9/12 top-1 results on unseen games compared with baselines. Analyses through ablations, learning dynamics, and case studies reveal the complementary contributions of both components and show meta-reasoning evolving toward human-like adaptation strategies.
[122] OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning
Timothy Ossowski, Sheng Zhang, Qianchu Liu, Guanghui Qin, Reuben Tan, Tristan Naumann, Junjie Hu, Hoifung Poon
🧩 TL;DR
该研究提出了一种用于医学大语言模型的数据配方策略,通过利用结构化推理轨迹进行监督微调,在超过800万样本和68亿响应令牌的数据集上实现了开源模型在多样化医学基准任务上的最先进性能。
📘 Detailed Summary
Motivation: 高质量且精心策划的数据是训练医学大语言模型的基石,直接影响模型对未见临床任务的泛化能力和鲁棒性。本研究旨在探索训练和数据策划策略,以开发医学领域中鲁棒的多模态推理模型,特别关注如何通过数据配方提升模型的推理能力。
Method: 研究主要采用监督微调方法,探索利用结构化推理轨迹的数据配方策略。该方法涉及精心策划包含不同长度结构化推理轨迹的高质量、多样化训练数据集,使模型能够根据下游任务自我校准其推理轨迹长度,而无需显式监督。
Result: 使用提出的数据配方,研究将实验扩展到包含超过800万个示例和68亿响应令牌的数据集,在多样化的分布外医学基准任务上实现了开源模型中的最先进性能。结果表明,策划具有不同结构化推理轨迹长度的训练数据集能使微调模型根据下游任务自我校准推理轨迹长度。
Conclusion: 该研究提供了关于医学大语言模型数据策划的关键见解,展示了结构化推理轨迹在提升模型鲁棒性方面的重要性。研究结果为开发鲁棒的医学视觉语言推理系统奠定了基础,并指出了通过数据配方实现模型自我校准推理能力的未来方向。
📄 Abstract
High-quality and carefully curated data is a cornerstone of training medical large language models, as it directly impacts both generalization and robustness to unseen clinical tasks. We investigate strategies for training and data curation to develop a robust multimodal reasoning model in the medical domain. Our work focuses on supervised fine-tuning (SFT) and explores data recipes that leverage structured reasoning traces. Using our proposed data recipe, we scale experiments to a dataset of over 8 million examples and 6.8 billion response tokens, achieving state-of-the-art performance among open-source models across diverse out-of-distribution medical benchmark tasks. Our results further indicate that curating a high-quality, diverse training dataset with varying structured reasoning trace lengths enables the fine-tuned model to self-calibrate its reasoning trajectory lengths based on the downstream task, without explicit supervision. We present key insights, describe the data curation strategy, and outline next steps toward developing robust medical vision-language reasoning system.
[123] Multi-Modal Scene Graph with Kolmogorov-Arnold Experts for Audio-Visual Question Answering
Zijian Fu, Changsheng Lv, Mengshi Qi, Huadong Ma
🧩 TL;DR
本文提出SHRIKE模型,通过引入多模态场景图显式建模视听场景中的对象及其关系,并设计基于Kolmogorov-Arnold网络的专家混合机制,在MUSIC-AVQA基准上实现了最先进的视听问答性能。
📘 Detailed Summary
Motivation: 现有视听问答方法未能有效捕捉视频中的结构化信息,且对多模态特征的细粒度建模不足,导致难以从复杂的视听内容中识别问题相关线索,限制了人类推理能力的模仿效果。
Method: 提出一种新颖的多模态场景图,显式建模视听场景中的对象及其关系,形成视觉基础的场景结构化表示;设计基于Kolmogorov-Arnold网络的专家混合机制,增强时间整合阶段的表达能力,实现问题感知的融合视听表示中跨模态交互的细粒度建模。
Result: 在MUSIC-AVQA和MUSIC-AVQA v2基准测试中,SHRIKE模型实现了最先进的性能表现,验证了多模态场景图和KAN-based MoE机制在提升视听问答任务中时间推理能力方面的有效性。
Conclusion: 该研究通过结构化场景表示和细粒度跨模态交互建模,显著提升了视听问答系统的推理能力,为多模态理解任务提供了新的技术框架,同时公开代码和模型检查点将促进该领域的进一步发展。
📄 Abstract
In this paper, we propose a novel Multi-Modal Scene Graph with Kolmogorov-Arnold Expert Network for Audio-Visual Question Answering (SHRIKE). The task aims to mimic human reasoning by extracting and fusing information from audio-visual scenes, with the main challenge being the identification of question-relevant cues from the complex audio-visual content. Existing methods fail to capture the structural information within video, and suffer from insufficient fine-grained modeling of multi-modal features. To address these issues, we are the first to introduce a new multi-modal scene graph that explicitly models the objects and their relationship as a visually grounded, structured representation of the audio-visual scene. Furthermore, we design a Kolmogorov-Arnold Network~(KAN)-based Mixture of Experts (MoE) to enhance the expressive power of the temporal integration stage. This enables more fine-grained modeling of cross-modal interactions within the question-aware fused audio-visual representation, leading to capture richer and more nuanced patterns and then improve temporal reasoning performance. We evaluate the model on the established MUSIC-AVQA and MUSIC-AVQA v2 benchmarks, where it achieves state-of-the-art performance. Code and model checkpoints will be publicly released.