Table of Contents

cs.CV [Back]

[1] SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning

Fangxun Shu, Yongjie Ye, Yue Liao, Zijian Kang, Weijie Yin, Jiacong Wang, Xiao Liang, Shuicheng Yan, Chao Feng

🧩 TL;DR

SAIL-RL是一个强化学习后训练框架,通过教导多模态大语言模型何时思考以及如何思考来增强其推理能力。该框架采用双奖励系统评估推理质量并自适应地决定深度推理或直接回答的适用场景。


📘 Detailed Summary

Motivation: 现有方法受限于仅基于结果的监督机制,仅奖励正确答案而不确保推理过程的合理性,同时采用统一的思考策略导致简单任务过度思考而复杂任务思考不足的问题。

Method: SAIL-RL提出双奖励系统:思考奖励通过事实基础、逻辑一致性和答案一致性评估推理质量,判断奖励自适应地决定何时需要深度推理或直接回答。该框架基于强化学习进行后训练优化。

Result: 在SAIL-VL2模型上的实验表明,SAIL-RL在4B和8B规模上均提升了推理和多模态理解基准性能,达到与GPT-4o等商业闭源模型竞争的水平,并显著减少了幻觉现象。

Conclusion: SAIL-RL为构建更可靠和自适应的多模态大语言模型提供了原则性框架,通过教导模型何时思考以及如何思考,有效解决了现有方法在推理质量和效率方面的局限性。


📄 Abstract

We introduce SAIL-RL, a reinforcement learning (RL) post-training framework that enhances the reasoning capabilities of multimodal large language models (MLLMs) by teaching them when and how to think. Existing approaches are limited by outcome-only supervision, which rewards correct answers without ensuring sound reasoning, and by uniform thinking strategies, which often lead to overthinking on simple tasks and underthinking on complex ones. SAIL-RL addresses these challenges with a dual reward system: the Thinking Reward, which evaluates reasoning quality through factual grounding, logical coherence, and answer consistency, and the Judging Reward, which adaptively determines whether deep reasoning or direct answering is appropriate. Experiments on the state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal understanding benchmarks at both 4B and 8B scales, achieving competitive performance against commercial closed-source models such as GPT-4o, and substantially reduces hallucinations, establishing it as a principled framework for building more reliable and adaptive MLLMs. The code will be available at https://github.com/BytedanceDouyinContent/SAIL-RL.

[2] iFlyBot-VLA Technical Report

Yuan Zhang, Chenyu Xue, Wenjie Xu, Chao Ji, Jiajia wu, Jia Pan

🧩 TL;DR

本文提出了iFlyBot-VLA,这是一个在大规模视觉-语言-动作框架下训练的大模型,通过双级动作表示和混合训练策略,在机器人操作任务中实现了卓越的性能。


📘 Detailed Summary

Motivation: 当前视觉-语言-动作模型在机器人操作任务中存在动作表示与视觉语言表示空间对齐不足的问题,需要开发能够同时捕捉高层意图和低层动态的动作表示框架。

Method: 提出双级动作表示框架,结合潜在动作模型和结构化离散动作标记,潜在动作从跨具身操作数据中学习隐含高层意图,离散动作标记通过频域变换编码显式低层动态,采用混合训练策略结合机器人轨迹数据与通用问答数据集。

Result: 在LIBERO Franka基准测试中表现出优越性能,真实世界评估显示在多样化挑战性操作任务中达到竞争性成功率,证明了框架的有效性和泛化能力。

Conclusion: 该研究证明了双级动作表示和混合训练策略能够有效对齐语言、视觉和动作表示空间,为视觉-语言-动作模型在机器人操作领域的应用提供了新范式,计划开源部分自建数据集以促进社区研究。


📄 Abstract

We introduce iFlyBot-VLA, a large-scale Vision-Language-Action (VLA) model trained under a novel framework. The main contributions are listed as follows: (1) a latent action model thoroughly trained on large-scale human and robotic manipulation videos; (2) a dual-level action representation framework that jointly supervises both the Vision-Language Model (VLM) and the action expert during training; (3) a mixed training strategy that combines robot trajectory data with general QA and spatial QA datasets, effectively enhancing the 3D perceptual and reasoning capabilities of the VLM backbone. Specifically, the VLM is trained to predict two complementary forms of actions: latent actions, derived from our latent action model pretrained on cross-embodiment manipulation data, which capture implicit high-level intentions; and structured discrete action tokens, obtained through frequency-domain transformations of continuous control signals, which encode explicit low-level dynamics. This dual supervision aligns the representation spaces of language, vision, and action, enabling the VLM to directly contribute to action generation. Experimental results on the LIBERO Franka benchmark demonstrate the superiority of our frame-work, while real-world evaluations further show that iFlyBot-VLA achieves competitive success rates across diverse and challenging manipulation tasks. Furthermore, we plan to open-source a portion of our self-constructed dataset to support future research in the community

[3] CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning

Jizheng Ma, Xiaofei Zhou, Yanlong Song, Han Yan

🧩 TL;DR

本文提出了CoCoVa框架,通过引入连续跨模态推理机制来弥合视觉感知的丰富高维特性与语言模型离散推理空间之间的鸿沟,在多种视觉语言任务上实现了更高效和准确的推理。


📘 Detailed Summary

Motivation: 当前视觉语言模型受限于在离散语言标记空间中进行推理,无法充分表达人类认知中那些难以言传的思维过程,这限制了模型对丰富高维视觉信息的理解能力。

Method: CoCoVa框架采用迭代推理循环机制,其中新型Latent Q-Former作为动态推理引擎,通过跨模态融合迭代优化潜在思维向量链,并配合动态标记选择机制聚焦关键视觉区域,同时使用对比学习和基于扩散的重构多任务目标来确保潜在表示的跨模态对齐。

Result: 实验表明CoCoVa在准确性和标记效率上均优于强基线模型,使用1.5B骨干网络时在几乎所有基准测试中达到或超越7B-9B大模型的性能,当扩展到7B骨干时仍能与最先进模型保持竞争力。

Conclusion: 该研究表明学习的潜在空间能够捕获可解释的结构化推理模式,证明了CoCoVa在弥合离散语言处理与连续视觉理解之间表征鸿沟方面的潜力,为视觉语言推理开辟了新的研究方向。


📄 Abstract

In human cognition, there exist numerous thought processes that are tacit and beyond verbal expression, enabling us to understand and interact with the world in multiple ways. However, contemporary Vision-Language Models (VLMs) remain constrained to reasoning within the discrete and rigid space of linguistic tokens, thereby bottlenecking the rich, high-dimensional nature of visual perception. To bridge this gap, we propose CoCoVa (Chain of Continuous Vision-Language Thought), a novel framework for vision-language model that leverages continuous cross-modal reasoning for diverse vision-language tasks. The core of CoCoVa is an iterative reasoning cycle, where a novel Latent Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a chain of latent thought vectors through cross-modal fusion. To focus this process, a token selection mechanism dynamically identifies salient visual regions, mimicking attentional focus. To ensure these latent thoughts remain grounded, we train the model with a multi-task objective that combines contrastive learning and diffusion-based reconstruction, enforcing alignment between latent representations and both visual and textual modalities. Evaluations show CoCoVa improves accuracy and token efficiency over strong baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B models on almost all benchmarks. When scaled to 7B LLM backbones, it remains competitive with state-of-the-art models. Qualitative analysis validates that learned latent space captures interpretable and structured reasoning patterns, highlighting the potential of CoCoVa to bridge the representational gap between discrete language processing and the continuous nature of visual understanding.

[4] Towards Selection of Large Multimodal Models as Engines for Burned-in Protected Health Information Detection in Medical Images

Tuan Truong, Guillermo Jimenez Perez, Pedro Osorio, Matthias Lenga

🧩 TL;DR

本研究系统评估了大型多模态模型在医疗图像中受保护健康信息检测任务上的表现,发现LMM在OCR准确率上优于传统方法但整体PHI检测性能提升有限,并为不同应用场景提供了模型选择和部署策略建议。


📘 Detailed Summary

Motivation: 传统医疗图像PHI检测主要依赖OCR模型结合命名实体识别方法,存在性能瓶颈,而新兴的大型多模态模型为文本提取和语义分析提供了新的技术机会,需要系统评估其在实际应用中的效果。

Method: 研究系统评估了GPT-4o、Gemini 2.5 Flash和Qwen 2.5 7B三个主流LMM,采用两种处理流程:纯文本分析流程以及OCR与语义分析结合的混合流程,与传统EasyOCR方法进行对比分析。

Result: LMM在OCR准确率上显著优于传统模型(WER:0.03-0.05,CER:0.02-0.03),但在复杂印记模式测试案例中表现最佳,对于高对比度可读文本区域,不同流程配置效果相近,OCR性能提升并未一致转化为整体PHI检测准确率的提高。

Conclusion: 研究揭示了LMM在医疗图像PHI检测中的优势与局限,为不同操作约束下的模型选择提供了实证依据,并提出了基于可扩展模块化基础设施的部署策略,对实际医疗隐私保护应用具有重要指导意义。


📄 Abstract

The detection of Protected Health Information (PHI) in medical imaging is critical for safeguarding patient privacy and ensuring compliance with regulatory frameworks. Traditional detection methodologies predominantly utilize Optical Character Recognition (OCR) models in conjunction with named entity recognition. However, recent advancements in Large Multimodal Model (LMM) present new opportunities for enhanced text extraction and semantic analysis. In this study, we systematically benchmark three prominent closed and open-sourced LMMs, namely GPT-4o, Gemini 2.5 Flash, and Qwen 2.5 7B, utilizing two distinct pipeline configurations: one dedicated to text analysis alone and another integrating both OCR and semantic analysis. Our results indicate that LMM exhibits superior OCR efficacy (WER: 0.03-0.05, CER: 0.02-0.03) compared to conventional models like EasyOCR. However, this improvement in OCR performance does not consistently correlate with enhanced overall PHI detection accuracy. The strongest performance gains are observed on test cases with complex imprint patterns. In scenarios where text regions are well readable with sufficient contrast, and strong LMMs are employed for text analysis after OCR, different pipeline configurations yield similar results. Furthermore, we provide empirically grounded recommendations for LMM selection tailored to specific operational constraints and propose a deployment strategy that leverages scalable and modular infrastructure.

[5] DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding

Zixuan Liu, Siavash H. Khajavi, Guangkai Jiang

🧩 TL;DR

本文提出了DetectiumFire,一个大规模多模态火灾数据集,包含22.5K高分辨率图像和2.5K真实世界视频,旨在解决火灾领域缺乏高质量标注数据的问题,推动火灾相关AI研究的发展。


📘 Detailed Summary

Motivation: 当前多模态模型在图像生成和推理任务中表现出色,但在火灾领域的应用仍面临挑战,主要原因是缺乏公开可用的高质量火灾领域标注数据集,限制了火灾相关AI技术的开发和应用。

Method: 研究团队构建了DetectiumFire数据集,包含22.5K高分辨率火灾图像和2.5K真实世界火灾视频,覆盖多种火灾类型、环境和风险等级,数据标注包括传统计算机视觉标签(如边界框)和详细文本提示描述场景,支持合成数据生成和火灾风险推理等应用。

Result: DetectiumFire在规模、多样性和数据质量方面显著优于现有基准,有效减少了数据冗余并增强了真实场景覆盖度,在目标检测、基于扩散的图像生成和视觉语言推理等多个任务中验证了其有效性。

Conclusion: 该数据集具有推动火灾相关研究和智能安全系统开发的潜力,其公开发布将促进AI社区对火灾理解研究的广泛探索,为火灾预防和应急响应提供重要数据支撑。


📄 Abstract

Recent advances in multi-modal models have demonstrated strong performance in tasks such as image generation and reasoning. However, applying these models to the fire domain remains challenging due to the lack of publicly available datasets with high-quality fire domain annotations. To address this gap, we introduce DetectiumFire, a large-scale, multi-modal dataset comprising of 22.5k high-resolution fire-related images and 2.5k real-world fire-related videos covering a wide range of fire types, environments, and risk levels. The data are annotated with both traditional computer vision labels (e.g., bounding boxes) and detailed textual prompts describing the scene, enabling applications such as synthetic data generation and fire risk reasoning. DetectiumFire offers clear advantages over existing benchmarks in scale, diversity, and data quality, significantly reducing redundancy and enhancing coverage of real-world scenarios. We validate the utility of DetectiumFire across multiple tasks, including object detection, diffusion-based image generation, and vision-language reasoning. Our results highlight the potential of this dataset to advance fire-related research and support the development of intelligent safety systems. We release DetectiumFire to promote broader exploration of fire understanding in the AI community. The dataset is available at https://kaggle.com/datasets/38b79c344bdfc55d1eed3d22fbaa9c31fad45e27edbbe9e3c529d6e5c4f93890

[6] Text-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis

Soham Joshi, Shwet Kamal Mishra, Viswanath Gopalakrishnan

🧩 TL;DR

本研究提出了首个自动化合成文本视觉问答数据集的端到端流程,通过整合OCR检测、区域识别、图像描述生成和问题生成等模型,成功构建了包含约72K问答对的大规模文本-VQA数据集。


📘 Detailed Summary

Motivation: 当前面向场景文本的视觉问答任务需要大量人工标注,过程繁琐且具有挑战性,亟需建立能够基于图像中场景文本自动合成问答对的端到端流程。

Method: 该方法整合了OCR检测与识别、感兴趣区域检测、图像描述生成和问题生成等多个模型与算法,将这些组件流线化整合为统一的自动化问答对合成与验证流程。

Result: 该流程成功构建了包含约44K图像和72K问答对的大规模文本-VQA数据集,据我们所知这是首个能够自动合成和验证大规模文本-VQA数据集的完整流程。

Conclusion: 该研究证明了利用基础模型和成熟OCR技术构建自动化文本-VQA数据集合成流程的可行性,为大规模视觉语言任务的数据集构建提供了高效解决方案,并展示了良好的可扩展性。


📄 Abstract

Creation of large-scale databases for Visual Question Answering tasks pertaining to the text data in a scene (text-VQA) involves skilful human annotation, which is tedious and challenging. With the advent of foundation models that handle vision and language modalities, and with the maturity of OCR systems, it is the need of the hour to establish an end-to-end pipeline that can synthesize Question-Answer (QA) pairs based on scene-text from a given image. We propose a pipeline for automated synthesis for text-VQA dataset that can produce faithful QA pairs, and which scales up with the availability of scene text data. Our proposed method harnesses the capabilities of multiple models and algorithms involving OCR detection and recognition (text spotting), region of interest (ROI) detection, caption generation, and question generation. These components are streamlined into a cohesive pipeline to automate the synthesis and validation of QA pairs. To the best of our knowledge, this is the first pipeline proposed to automatically synthesize and validate a large-scale text-VQA dataset comprising around 72K QA pairs based on around 44K images.

[7] UniChange: Unifying Change Detection with Multimodal Large Language Model

Xu Zhang, Danyang Li, Xiaohang Dong, Tianhao Wu, Hualong Yu, Jianye Wang, Qicheng Li, Xiang Li

🧩 TL;DR

本文提出了UniChange,这是首个基于多模态大语言模型的统一变化检测框架,通过引入特殊令牌和文本提示机制,成功统一了二元变化检测和语义变化检测任务,并在多个公开基准上实现了最先进的性能。


📘 Detailed Summary

Motivation: 当前变化检测模型存在关键局限性,通常只能从单一类型标注数据中获取有限知识,无法同时利用多样的二元变化检测和语义变化检测数据集,这导致了泛化能力差和功能有限的问题。

Method: UniChange利用多模态大语言模型的语言先验和统一能力,将生成式语言能力与专门的变化检测功能相结合,通过引入三个特殊令牌[T1]、[T2]和[CHANGE]来统一二元变化检测和语义变化检测任务,并使用文本提示来指导变化类别的识别,消除了对预定义分类头的依赖。

Result: 在四个公开基准(WHU-CD、S2Looking、LEVIR-CD+和SECOND)上的实验表明,UniChange实现了最先进的性能,分别获得了90.41、53.04、78.87和57.62的IoU分数,超越了所有先前的方法。

Conclusion: 该研究表明多模态大语言模型为统一变化检测框架提供了新的可能性,UniChange的设计使其能够有效从多源数据集中获取知识,即使这些数据集的类别定义存在冲突,这为变化检测领域开辟了新的研究方向。


📄 Abstract

Change detection (CD) is a fundamental task for monitoring and analyzing land cover dynamics. While recent high performance models and high quality datasets have significantly advanced the field, a critical limitation persists. Current models typically acquire limited knowledge from single-type annotated data and cannot concurrently leverage diverse binary change detection (BCD) and semantic change detection (SCD) datasets. This constraint leads to poor generalization and limited versatility. The recent advancements in Multimodal Large Language Models (MLLMs) introduce new possibilities for a unified CD framework. We leverage the language priors and unification capabilities of MLLMs to develop UniChange, the first MLLM-based unified change detection model. UniChange integrates generative language abilities with specialized CD functionalities. Our model successfully unifies both BCD and SCD tasks through the introduction of three special tokens: [T1], [T2], and [CHANGE]. Furthermore, UniChange utilizes text prompts to guide the identification of change categories, eliminating the reliance on predefined classification heads. This design allows UniChange to effectively acquire knowledge from multi-source datasets, even when their class definitions conflict. Experiments on four public benchmarks (WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance, achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively, surpassing all previous methods. The code is available at https://github.com/Erxucomeon/UniChange.

[8] Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models

Jinhwan Seo, Yoonki Cho, Junhyug Noh, Sung-eui Yoon

🧩 TL;DR

本文提出了一个用于解决接地视频问答任务的三阶段框架,通过引入触发时刻概念显著提升了时空接地和跟踪性能,在GVQA任务上实现了HOTA分数0.4968的重大改进。


📘 Detailed Summary

Motivation: GVQA任务需要强大的多模态模型来处理视频内容的复杂推理,将答案结果进行视觉接地,并在时间维度上跟踪参考对象,现有方法在这些能力上存在不足。

Method: 提出的方法将GVQA任务分解为三阶段流水线:视频推理与问答、时空接地和跟踪,关键贡献是引入基于CORTEX提示的触发时刻概念,该时刻确定目标对象最可见的单一帧作为接地和跟踪的鲁棒锚点。

Result: 在GVQA任务上实现了HOTA分数0.4968,相比前一年获胜分数0.2704有显著提升,证明了所提框架的有效性。

Conclusion: 通过任务分解和触发时刻的引入,该研究为复杂视频理解任务提供了有效的解决方案,证明了多阶段方法和精确时刻定位在提升视觉接地性能方面的重要性。


📄 Abstract

In this technical report, we introduce a framework to address Grounded Video Question Answering (GVQA) task for the ICCV 2025 Perception Test Challenge. The GVQA task demands robust multimodal models capable of complex reasoning over video content, grounding the resulting answers visually, and tracking the referenced objects temporally. To achieve this capability, our proposed approach decomposes the GVQA task into a three-stage pipeline: (1) Video Reasoning \& QA, (2) Spatio-temporal Grounding and (3) Tracking. Our key contribution is the introduction of a trigger moment, derived from our proposed CORTEX prompt, which pinpoints the single most visible frame of a target object to serve as a robust anchor for grounding and tracking. To this end, we achieve the HOTA score of 0.4968, which marks a significant improvement over the previous year's winning score of 0.2704 on GVQA task.

[9] VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

Kevin Qinghong Lin, Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao, Linjie Li, Philip Torr, Alex Jinpeng Wang

🧩 TL;DR

本文提出VCode基准测试,将多模态理解重新定义为SVG代码生成任务,并开发VCoder代理框架通过迭代修订和视觉工具增强视觉语言模型在视觉中心编码任务中的表现。


📘 Detailed Summary

Motivation: 当前AI研究在代码作为推理和执行媒介方面主要关注语言中心任务如程序合成和调试,而视觉中心编码领域尚未充分探索,特别是缺乏能够保留符号意义进行下游推理的紧凑、可解释视觉表示方法。

Method: 提出VCode基准测试框架,将多模态理解重构为SVG代码生成任务,并开发VCoder代理框架,包含两个核心组件:基于修订的迭代分析机制和利用检测器与解析器提供结构化视觉线索的视觉工具增强策略。

Result: 实验表明前沿视觉语言模型在生成忠实SVG方面表现不佳,VCoder框架在基准测试中相比表现最佳的Claude-4-Opus实现了12.3分的总体提升,人类研究显示渲染SVG对人和模型都具有挑战性但一致性揭示了符号视觉表示的潜力。

Conclusion: 研究揭示了语言中心与视觉中心编码之间的持续差距,证明了符号视觉表示在推理中的价值,并为多模态理解提供了新的评估范式和改进方向,特别是在专业知识和3D推理方面仍需进一步探索。


📄 Abstract

Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.

[10] Language-Enhanced Generative Modeling for PET Synthesis from MRI and Blood Biomarkers

Zhengjie Zhang, Xiaoxie Mao, Qihao Guo, Shaoting Zhang, Qi Huang, Mu Zhou, Fang Xie, Mianxin Liu

🧩 TL;DR

本研究开发了一种基于大语言模型增强的生成模型,能够从血液生物标志物和MRI扫描中合成淀粉样蛋白PET图像,为阿尔茨海默病诊断提供了一种成本更低、更易获取的替代方案。


📘 Detailed Summary

Motivation: 阿尔茨海默病诊断主要依赖淀粉样蛋白正电子发射断层扫描,但该方法成本高昂且可及性有限,本研究旨在探索是否能够从血液生物标志物和MRI扫描中预测淀粉样蛋白PET的空间分布模式。

Method: 研究收集了566名参与者的淀粉样蛋白PET图像、T1加权MRI扫描和血液生物标志物数据,开发了一种基于大语言模型和多模态信息融合的语言增强生成模型来合成PET图像,并构建了全自动诊断流程进行评估。

Result: 合成的PET图像在结构细节上与真实PET扫描高度相似,结构相似性指数达到0.920±0.003,区域模式相关性为0.955±0.007,基于合成PET的诊断与真实PET诊断的一致性准确率达到0.80,合成PET模型在阿尔茨海默病诊断中的AUC为0.78,优于基于T1的模型和基于血液生物标志物的模型。

Conclusion: 语言增强生成模型能够合成逼真的PET图像,显著提升了MRI和血液生物标志物在淀粉样蛋白空间模式评估中的实用性,改善了阿尔茨海默病的诊断流程,为低成本、可扩展的神经退行性疾病诊断提供了新途径。


📄 Abstract

Background: Alzheimer's disease (AD) diagnosis heavily relies on amyloid-beta positron emission tomography (Abeta-PET), which is limited by high cost and limited accessibility. This study explores whether Abeta-PET spatial patterns can be predicted from blood-based biomarkers (BBMs) and MRI scans. Methods: We collected Abeta-PET images, T1-weighted MRI scans, and BBMs from 566 participants. A language-enhanced generative model, driven by a large language model (LLM) and multimodal information fusion, was developed to synthesize PET images. Synthesized images were evaluated for image quality, diagnostic consistency, and clinical applicability within a fully automated diagnostic pipeline. Findings: The synthetic PET images closely resemble real PET scans in both structural details (SSIM = 0.920 +/- 0.003) and regional patterns (Pearson's r = 0.955 +/- 0.007). Diagnostic outcomes using synthetic PET show high agreement with real PET-based diagnoses (accuracy = 0.80). Using synthetic PET, we developed a fully automatic AD diagnostic pipeline integrating PET synthesis and classification. The synthetic PET-based model (AUC = 0.78) outperforms T1-based (AUC = 0.68) and BBM-based (AUC = 0.73) models, while combining synthetic PET and BBMs further improved performance (AUC = 0.79). Ablation analysis supports the advantages of LLM integration and prompt engineering. Interpretation: Our language-enhanced generative model synthesizes realistic PET images, enhancing the utility of MRI and BBMs for Abeta spatial pattern assessment and improving the diagnostic workflow for Alzheimer's disease.

[11] Collaborative Attention and Consistent-Guided Fusion of MRI and PET for Alzheimer's Disease Diagnosis

Delin Ma, Menghui Zhou, Jun Qi, Yun Yang, Po Yang

🧩 TL;DR

本研究提出了一种用于阿尔茨海默病诊断的协作注意力和一致性引导融合框架,通过整合MRI和PET多模态神经影像数据,在ADNI数据集上实现了优于现有融合方法的诊断性能。该框架通过可学习参数表示块补偿缺失模态信息,并利用一致性引导机制显式对齐跨模态潜在分布。


📘 Detailed Summary

Motivation: 当前多模态神经影像融合方法主要强调跨模态互补性,但忽视了模态特异性特征在诊断中的重要性,且模态间固有的分布差异往往导致有偏和噪声表示,从而降低分类性能。本研究旨在解决这些挑战,提升阿尔茨海默病的早期诊断准确性。

Method: 提出协作注意力和一致性引导融合框架,包含可学习参数表示块用于补偿缺失模态信息,共享编码器和模态独立编码器以保留共享和特定表示,并采用一致性引导机制显式对齐跨模态潜在分布。该模型有效整合了MRI和PET数据的多尺度互补特征。

Result: 在ADNI数据集上的实验结果表明,该方法相比现有融合策略实现了更优越的诊断性能,验证了所提框架在阿尔茨海默病分类任务中的有效性。模型通过同时考虑跨模态互补性和模态特异性特征,显著提升了诊断准确率。

Conclusion: 该研究强调了同时保留共享和特定模态表示的重要性,以及显式对齐跨模态分布对提升诊断性能的关键作用。所提出的融合框架为多模态神经影像分析提供了新的技术路径,对阿尔茨海默病的早期诊断具有重要临床意义。


📄 Abstract

Alzheimer's disease (AD) is the most prevalent form of dementia, and its early diagnosis is essential for slowing disease progression. Recent studies on multimodal neuroimaging fusion using MRI and PET have achieved promising results by integrating multi-scale complementary features. However, most existing approaches primarily emphasize cross-modal complementarity while overlooking the diagnostic importance of modality-specific features. In addition, the inherent distributional differences between modalities often lead to biased and noisy representations, degrading classification performance. To address these challenges, we propose a Collaborative Attention and Consistent-Guided Fusion framework for MRI and PET based AD diagnosis. The proposed model introduces a learnable parameter representation (LPR) block to compensate for missing modality information, followed by a shared encoder and modality-independent encoders to preserve both shared and specific representations. Furthermore, a consistency-guided mechanism is employed to explicitly align the latent distributions across modalities. Experimental results on the ADNI dataset demonstrate that our method achieves superior diagnostic performance compared with existing fusion strategies.

[12] Medical Report Generation: A Hierarchical Task Structure-Based Cross-Modal Causal Intervention Framework

Yucheng Song, Yifan Ge, Junhao Li, Zhining Liao, Zhifang Liao

🧩 TL;DR

本文提出HTSC-CIF框架,通过分层任务分解方法解决医学报告生成中的三个核心挑战:领域知识理解不足、文本-视觉实体嵌入对齐不佳以及跨模态偏差导致的伪相关性,显著优于现有最先进方法。


📘 Detailed Summary

Motivation: 医学报告生成模型在病灶描述方面面临三个主要挑战:领域知识理解不足、文本-视觉实体嵌入对齐不佳以及跨模态偏差导致的伪相关性,现有工作仅能解决单个挑战而无法全面应对所有问题。

Method: 提出HTSC-CIF框架采用分层任务分解方法:低层任务将医学实体特征与空间位置对齐以增强视觉编码器的领域知识;中层任务使用前缀语言建模和掩码图像建模通过相互指导提升跨模态对齐;高层任务通过前门干预的跨模态因果干预模块减少混杂因素并提高可解释性。

Result: 广泛实验验证了HTSC-CIF框架的有效性,在医学报告生成任务上显著优于现有最先进方法,表现出优越的性能表现。

Conclusion: 该研究证明了分层任务分解方法在解决医学报告生成多挑战问题上的有效性,为跨模态医学AI系统提供了新的框架设计思路,具有重要的临床应用价值。


📄 Abstract

Medical Report Generation (MRG) is a key part of modern medical diagnostics, as it automatically generates reports from radiological images to reduce radiologists' burden. However, reliable MRG models for lesion description face three main challenges: insufficient domain knowledge understanding, poor text-visual entity embedding alignment, and spurious correlations from cross-modal biases. Previous work only addresses single challenges, while this paper tackles all three via a novel hierarchical task decomposition approach, proposing the HTSC-CIF framework. HTSC-CIF classifies the three challenges into low-, mid-, and high-level tasks: 1) Low-level: align medical entity features with spatial locations to enhance domain knowledge for visual encoders; 2) Mid-level: use Prefix Language Modeling (text) and Masked Image Modeling (images) to boost cross-modal alignment via mutual guidance; 3) High-level: a cross-modal causal intervention module (via front-door intervention) to reduce confounders and improve interpretability. Extensive experiments confirm HTSC-CIF's effectiveness, significantly outperforming state-of-the-art (SOTA) MRG methods. Code will be made public upon paper acceptance.

[13] RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning

Jiahe Song, Chuang Wang, Bowen Jiang, Yinfan Wang, Hao Zheng, Xingjian Wei, Chengjin Liu, Junyuan Gao, Yubin Wang, Lijun Wu, Jiang Wu, Qian Yu, Conghui He

🧩 TL;DR

本研究提出了RxnCaption框架,将化学反应图解析重新构建为图像描述问题,通过BIVP策略和MolYOLO分子检测器显著提升了结构提取质量,并构建了大规模RxnCaption-11k数据集,在多个指标上实现了最先进性能。


📘 Detailed Summary

Motivation: 现有化学反应数据通常以论文中的图像形式存在,这些数据无法被机器读取,也无法用于训练机器学习模型,这限制了AI在化学研究中的应用。

Method: 提出了RxnCaption框架,将传统的坐标预测驱动解析过程重新构建为图像描述问题,采用BIVP策略,使用最先进的分子检测器MolYOLO在输入图像上预绘制分子边界框和索引,将下游解析转化为自然语言描述问题。

Result: 广泛的实验表明BIVP策略显著提高了结构提取质量并简化了模型设计,构建的RxnCaption-11k数据集比先前真实世界文献基准大一个数量级,RxnCaption-VL在多个指标上实现了最先进性能。

Conclusion: 该方法、数据集和模型将推动从化学文献中提取结构化信息的进展,并催化更广泛的AI在化学中的应用,数据、模型和代码将在GitHub上发布。


📄 Abstract

Large-scale chemical reaction datasets are crucial for AI research in chemistry. However, existing chemical reaction data often exist as images within papers, making them not machine-readable and unusable for training machine learning models. In response to this challenge, we propose the RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP). Our framework reformulates the traditional coordinate prediction driven parsing process into an image captioning problem, which Large Vision-Language Models (LVLMs) handle naturally. We introduce a strategy termed "BBox and Index as Visual Prompt" (BIVP), which uses our state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the input image. This turns the downstream parsing into a natural-language description problem. Extensive experiments show that the BIVP strategy significantly improves structural extraction quality while simplifying model design. We further construct the RxnCaption-11k dataset, an order of magnitude larger than prior real-world literature benchmarks, with a balanced test subset across four layout archetypes. Experiments demonstrate that RxnCaption-VL achieves state-of-the-art performance on multiple metrics. We believe our method, dataset, and models will advance structured information extraction from chemical literature and catalyze broader AI applications in chemistry. We will release data, models, and code on GitHub.

[14] ChartM$^3$: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension

Duo Xu, Hao Cheng, Xin Lin, Zhen Xie, Hao Wang

🧩 TL;DR

本研究提出了一种自动化多阶段代码驱动管道,用于系统生成复杂图表理解数据集ChartM³,该数据集显著提升了多模态大语言模型在图表推理任务中的性能,使较小模型能够达到与更大规模模型相当的复杂图表理解能力。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在复杂图表理解任务中存在覆盖场景有限和计算密集型推理任务支持不足的问题,难以满足实际应用中对高级视觉识别和推理能力的需求。

Method: 采用自动化多阶段代码驱动管道,集成检索增强生成技术获取专业图表模板,并利用思维链策略生成模拟真实数据分布的推理代码,驱动图表渲染和问题相关统计计算,通过基于模型的评估提升图表多样性和数据质量。

Result: 构建了包含38K图表和142K问答对的ChartM³数据集,监督微调和强化学习实验表明该数据集显著提升了推理能力和跨领域泛化性能,使较小模型在复杂图表理解任务中能够达到与更大规模模型相当的表现。

Conclusion: 该研究证明了自动化数据生成管道在构建高质量多模态数据集方面的有效性,为提升模型在复杂视觉推理任务中的性能提供了可行方案,并展示了数据质量对模型能力提升的关键作用。


📄 Abstract

Complex chart understanding tasks demand advanced visual recognition and reasoning capabilities from multimodal large language models (MLLMs). However, current research provides limited coverage of complex chart scenarios and computation-intensive reasoning tasks prevalent in real-world applications. This study proposes an automated multi-stage code-driven pipeline for systematically generating visual reasoning datasets to address these limitations. The pipeline integrates retrieval-augmented generation (RAG) to retrieve professional chart templates and employs chain-of-thought (CoT) strategies to generate reasoning codes that simulate real data distributions, thereby driving chart rendering and question-related statistical computations. Through model-based evaluation, the pipeline enhances chart diversity and data quality. Using this framework, we construct ChartM$^3$, a multi-dimensional and multi-step dataset containing 38K charts and 142K Q&A pairs for training, along with 2,871 high-quality evaluation samples for enabling practical performance assessment. Supervised fine-tuning (SFT) and reinforcement learning (RL) experiments demonstrate that our dataset significantly improves reasoning capabilities and cross-domain generalization performance, enabling smaller models to achieve performance comparable to larger-scale models in complex chart comprehension.

[15] A Cognitive Process-Inspired Architecture for Subject-Agnostic Brain Visual Decoding

Jingyu Lu, Haonan Wang, Qixiang Zhang, Xiaomeng Li

🧩 TL;DR

本文提出VCFlow,一种新颖的层次化解码框架,通过显式建模人类视觉系统的腹侧-背侧架构来学习多维表征,实现无需特定受试者训练的连续视觉体验重建,在仅牺牲7%准确率的同时将重建速度提升至10秒每视频。


📘 Detailed Summary

Motivation: 主题无关的脑解码旨在无需特定受试者训练的情况下从fMRI重建连续视觉体验,具有重要的临床应用潜力,但由于跨受试者泛化挑战和脑信号的复杂性,该方向仍未被充分探索。

Method: 提出视觉皮层流架构VCFlow,通过解耦和利用早期视觉皮层、腹侧流和背侧流的特征来学习多维表征,并引入特征级对比学习策略增强主题不变语义表征的提取。

Result: 与传统需要超过12小时每受试者数据和大量计算的方法相比,VCFlow平均仅牺牲7%准确率,却能在无需重新训练的情况下以每视频10秒的速度生成重建结果。

Conclusion: VCFlow提供了一种快速且临床可扩展的解决方案,通过显式建模视觉系统层次结构成功实现了跨受试者的脑信号解码,为临床脑机接口应用开辟了新途径。


📄 Abstract

Subject-agnostic brain decoding, which aims to reconstruct continuous visual experiences from fMRI without subject-specific training, holds great potential for clinical applications. However, this direction remains underexplored due to challenges in cross-subject generalization and the complex nature of brain signals. In this work, we propose Visual Cortex Flow Architecture (VCFlow), a novel hierarchical decoding framework that explicitly models the ventral-dorsal architecture of the human visual system to learn multi-dimensional representations. By disentangling and leveraging features from early visual cortex, ventral, and dorsal streams, VCFlow captures diverse and complementary cognitive information essential for visual reconstruction. Furthermore, we introduce a feature-level contrastive learning strategy to enhance the extraction of subject-invariant semantic representations, thereby enhancing subject-agnostic applicability to previously unseen subjects. Unlike conventional pipelines that need more than 12 hours of per-subject data and heavy computation, VCFlow sacrifices only 7\% accuracy on average yet generates each reconstructed video in 10 seconds without any retraining, offering a fast and clinically scalable solution. The source code will be released upon acceptance of the paper.

[16] From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics

Nicolas Schuler, Lea Dewald, Nick Baldig, Jürgen Graf

🧩 TL;DR

本研究评估了用于边缘设备部署的小型视觉语言模型在场景理解和动作识别任务中的能力,特别关注移动机器人应用场景下的准确性与推理时间权衡问题。


📘 Detailed Summary

Motivation: 尽管大型语言模型和视觉语言模型在视频理解、场景解释和常识推理方面取得了显著进展,但其计算复杂性对边缘设备和移动机器人应用构成了挑战,特别是在准确性与推理时间之间的权衡问题上。

Method: 本研究提出了一种评估流程,专门研究能够在边缘设备上部署的小型视觉语言模型在场景解释和动作识别任务中的能力,并在包含真实世界城市景观、校园和室内场景的多样化数据集上进行实验验证。

Result: 实验评估深入探讨了小型模型在边缘设备上的潜力,特别关注了这些模型面临的挑战、弱点、固有偏见以及所获信息的实际应用价值。

Conclusion: 该研究为边缘计算环境中的视觉语言模型部署提供了重要见解,揭示了小型模型在移动机器人应用中的实际可行性和局限性,为资源受限环境下的智能感知系统设计提供了指导。


📄 Abstract

Video Understanding, Scene Interpretation and Commonsense Reasoning are highly challenging tasks enabling the interpretation of visual information, allowing agents to perceive, interact with and make rational decisions in its environment. Large Language Models (LLMs) and Visual Language Models (VLMs) have shown remarkable advancements in these areas in recent years, enabling domain-specific applications as well as zero-shot open vocabulary tasks, combining multiple domains. However, the required computational complexity poses challenges for their application on edge devices and in the context of Mobile Robotics, especially considering the trade-off between accuracy and inference time. In this paper, we investigate the capabilities of state-of-the-art VLMs for the task of Scene Interpretation and Action Recognition, with special regard to small VLMs capable of being deployed to edge devices in the context of Mobile Robotics. The proposed pipeline is evaluated on a diverse dataset consisting of various real-world cityscape, on-campus and indoor scenarios. The experimental evaluation discusses the potential of these small models on edge devices, with particular emphasis on challenges, weaknesses, inherent model biases and the application of the gained information. Supplementary material is provided via the following repository: https://datahub.rz.rptu.de/hstr-csrl-public/publications/scene-interpretation-on-edge-devices/

[17] TAUE: Training-free Noise Transplant and Cultivation Diffusion Model

Daichi Nagai, Ryugo Morita, Shunsuke Kitada, Hitoshi Iyatomi

🧩 TL;DR

本文提出了TAUE框架,一种无需训练的层式图像生成方法,通过噪声移植与培育技术实现前景、背景和合成层的语义结构一致性,解决了现有方法在层式控制方面的局限性。


📘 Detailed Summary

Motivation: 当前文本到图像扩散模型仅能生成单一平面图像,这成为专业应用中需要层式控制的关键瓶颈。现有解决方案要么依赖大规模不可得数据集的微调,要么是无需训练但仅限于生成孤立前景元素,无法产生完整连贯的场景。

Method: 我们提出了噪声移植与培育技术,从前景和合成生成过程中提取中间潜在表示,将其移植到后续层的初始噪声中。这种方法确保了前景、背景和合成层之间的语义和结构一致性,无需微调或辅助数据集即可实现一致的层式输出。

Result: 大量实验表明,我们的无需训练方法达到了与微调方法相当的性能,在保持高图像质量和保真度的同时增强了层式一致性。该方法消除了昂贵的训练和数据集需求,同时解锁了复杂组合编辑等新颖下游应用。

Conclusion: TAUE不仅消除了成本高昂的训练和数据集需求,还为更易访问和可控的生成工作流程开辟了道路。该方法展示了无需训练即可实现专业级层式图像生成的可行性,为生成式AI的实际应用提供了新的可能性。


📄 Abstract

Despite the remarkable success of text-to-image diffusion models, their output of a single, flattened image remains a critical bottleneck for professional applications requiring layer-wise control. Existing solutions either rely on fine-tuning with large, inaccessible datasets or are training-free yet limited to generating isolated foreground elements, failing to produce a complete and coherent scene. To address this, we introduce the Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a novel framework for zero-shot, layer-wise image generation. Our core technique, Noise Transplantation and Cultivation (NTC), extracts intermediate latent representations from both foreground and composite generation processes, transplanting them into the initial noise for subsequent layers. This ensures semantic and structural coherence across foreground, background, and composite layers, enabling consistent, multi-layered outputs without requiring fine-tuning or auxiliary datasets. Extensive experiments show that our training-free method achieves performance comparable to fine-tuned methods, enhancing layer-wise consistency while maintaining high image quality and fidelity. TAUE not only eliminates costly training and dataset requirements but also unlocks novel downstream applications, such as complex compositional editing, paving the way for more accessible and controllable generative workflows.

[18] LLEXICORP: End-user Explainability of Convolutional Neural Networks

Vojtěch Kůr, Adam Bajger, Adam Kukučka, Marek Hradil, Vít Musil, Tomáš Brázdil

🧩 TL;DR

本文提出LLEXICORP,一种将概念相关性传播与多模态大语言模型相结合的模块化管道,能够自动为概念原型分配描述性名称并生成自然语言解释,显著降低了深度神经网络解释的门槛。


📘 Detailed Summary

Motivation: 当前概念相关性传播工作流程主要依赖人工操作,专家需要检查激活图像来命名发现的概念,并从相关性图中合成冗长的解释,这限制了解释的可访问性和可扩展性。

Method: 该方法将概念相关性传播与多模态大语言模型耦合,通过精心设计的提示词教给语言模型CRP的语义,并强制分离命名和解释任务,生成的文本可以根据不同受众进行定制。

Result: 在ImageNet数据集上的VGG16模型上进行定性评估,结果表明该方法能够自动生成描述性概念名称和直观的自然语言解释,将定量相关性分布转化为易于理解的叙述。

Conclusion: 将基于概念的可归因方法与大型语言模型集成可以显著降低解释深度神经网络的门槛,为更透明的AI系统铺平道路,生成的解释可以根据技术背景为不同受众提供适当详细程度的信息。


📄 Abstract

Convolutional neural networks (CNNs) underpin many modern computer vision systems. With applications ranging from common to critical areas, a need to explain and understand the model and its decisions (XAI) emerged. Prior works suggest that in the top layers of CNNs, the individual channels can be attributed to classifying human-understandable concepts. Concept relevance propagation (CRP) methods can backtrack predictions to these channels and find images that most activate these channels. However, current CRP workflows are largely manual: experts must inspect activation images to name the discovered concepts and must synthesize verbose explanations from relevance maps, limiting the accessibility of the explanations and their scalability. To address these issues, we introduce Large Language model EXplaIns COncept Relevance Propagation (LLEXICORP), a modular pipeline that couples CRP with a multimodal large language model. Our approach automatically assigns descriptive names to concept prototypes and generates natural-language explanations that translate quantitative relevance distributions into intuitive narratives. To ensure faithfulness, we craft prompts that teach the language model the semantics of CRP through examples and enforce a separation between naming and explanation tasks. The resulting text can be tailored to different audiences, offering low-level technical descriptions for experts and high-level summaries for non-technical stakeholders. We qualitatively evaluate our method on various images from ImageNet on a VGG16 model. Our findings suggest that integrating concept-based attribution methods with large language models can significantly lower the barrier to interpreting deep neural networks, paving the way for more transparent AI systems.

[19] Adapting General-Purpose Foundation Models for X-ray Ptychography in Low-Data Regimes

Robinson Umeike, Neil Getty, Yin Xiangyu, Yi Jiang

🧩 TL;DR

本文提出了PtychoBench基准测试,系统比较了监督微调(SFT)和上下文学习(ICL)两种领域适应策略在科学显微镜工作流自动化中的应用,发现最优策略取决于任务模态:视觉任务中SFT与ICL互补,文本任务中ICL表现更优。


📘 Detailed Summary

Motivation: 先进显微镜工作流自动化是重要目标,基础模型如LLMs和VLMs展现出巨大潜力,但将这些通用模型适应专业科学任务至关重要,而最优领域适应策略尚不明确。

Method: 引入PtychoBench多模态多任务基准测试,系统比较监督微调(SFT)和上下文学习(ICL)两种专业化策略,在数据稀缺情况下评估VLMs的视觉伪影检测任务和LLMs的文本参数推荐任务。

Result: 研究发现最优专业化路径具有任务依赖性:视觉任务中SFT和ICL高度互补,微调模型在上下文感知示例指导下达到最高性能(Micro-F1 0.728);文本任务中大型基础模型的ICL是更优策略,达到峰值Micro-F1 0.847,优于强大的"超级专家"SFT模型(0-shot Micro-F1 0.839)。

Conclusion: 研究结果为科学AI提供了关键观察:基准测试中最优专业化路径取决于任务模态,为开发更有效的基于科学的智能体系统提供了清晰框架,同时确认了上下文感知提示的优越性并识别了微调模型中一致的上下文干扰现象。


📄 Abstract

The automation of workflows in advanced microscopy is a key goal where foundation models like Language Models (LLMs) and Vision-Language Models (VLMs) show great potential. However, adapting these general-purpose models for specialized scientific tasks is critical, and the optimal domain adaptation strategy is often unclear. To address this, we introduce PtychoBench, a new multi-modal, multi-task benchmark for ptychographic analysis. Using this benchmark, we systematically compare two specialization strategies: Supervised Fine-Tuning (SFT) and In-Context Learning (ICL). We evaluate these strategies on a visual artifact detection task with VLMs and a textual parameter recommendation task with LLMs in a data-scarce regime. Our findings reveal that the optimal specialization pathway is task-dependent. For the visual task, SFT and ICL are highly complementary, with a fine-tuned model guided by context-aware examples achieving the highest mean performance (Micro-F1 of 0.728). Conversely, for the textual task, ICL on a large base model is the superior strategy, reaching a peak Micro-F1 of 0.847 and outperforming a powerful "super-expert" SFT model (0-shot Micro-F1 of 0.839). We also confirm the superiority of context-aware prompting and identify a consistent contextual interference phenomenon in fine-tuned models. These results, benchmarked against strong baselines including GPT-4o and a DINOv3-based classifier, offer key observations for AI in science: the optimal specialization path in our benchmark is dependent on the task modality, offering a clear framework for developing more effective science-based agentic systems.

[20] Keeping it Local, Tiny and Real: Automated Report Generation on Edge Computing Devices for Mechatronic-Based Cognitive Systems

Nicolas Schuler, Lea Dewald, Jürgen Graf

🧩 TL;DR

本文提出了一种用于移动机器人的自动化报告生成流水线,该方案完全依赖本地模型在边缘设备上运行,保护用户隐私并无需外部服务,通过多模态传感器数据生成自然语言报告。


📘 Detailed Summary

Motivation: 随着深度学习在硬件认知系统中的应用日益广泛,特别是在自动驾驶和服务机器人等关键任务中,需要评估大量异构数据。自动化报告生成对于促进此类系统在各个领域的评估和接受度至关重要,但现有方法往往依赖外部服务且存在隐私风险。

Method: 提出了一种基于多模态传感器的自动化报告生成流水线,该方案完全采用本地模型部署在边缘计算设备上,无需依赖外部云服务。该方法利用各种传感器数据生成自然语言报告,确保所有参与者的隐私保护。

Result: 在涵盖室内、室外和城市环境等多种领域的多样化数据集上进行了评估,提供了定量和定性评估结果。生成的示例报告和补充材料通过公共存储库提供,验证了方法的有效性。

Conclusion: 该研究展示了完全基于边缘计算的自动化报告生成系统的可行性,为移动机器人系统提供了一种隐私保护的评估解决方案,具有在关键任务应用中推广的潜力,并为未来本地化AI系统的发展提供了参考。


📄 Abstract

Recent advancements in Deep Learning enable hardware-based cognitive systems, that is, mechatronic systems in general and robotics in particular with integrated Artificial Intelligence, to interact with dynamic and unstructured environments. While the results are impressive, the application of such systems to critical tasks like autonomous driving as well as service and care robotics necessitate the evaluation of large amount of heterogeneous data. Automated report generation for Mobile Robotics can play a crucial role in facilitating the evaluation and acceptance of such systems in various domains. In this paper, we propose a pipeline for generating automated reports in natural language utilizing various multi-modal sensors that solely relies on local models capable of being deployed on edge computing devices, thus preserving the privacy of all actors involved and eliminating the need for external services. In particular, we evaluate our implementation on a diverse dataset spanning multiple domains including indoor, outdoor and urban environments, providing quantitative as well as qualitative evaluation results. Various generated example reports and other supplementary materials are available via a public repository.

[21] Seeing Across Time and Views: Multi-Temporal Cross-View Learning for Robust Video Person Re-Identification

Md Rashidunnabi, Kailash A. Hambarde, Vasco Lopes, Joao C. Neves, Hugo Proenca

🧩 TL;DR

本文提出了MTF-CVReID框架,通过七个参数高效的模块增强ViT-B/16骨干网络,显著提升了跨视角视频行人重识别的性能,在保持实时效率的同时实现了最先进的性能。


📘 Detailed Summary

Motivation: 跨视角视频行人重识别面临极端视角变化、尺度差异和时间不一致性等挑战,现有方法在处理空中-地面监控等场景时性能受限,需要开发能够同时解决这些问题的鲁棒框架。

Method: 基于ViT-B/16骨干网络,引入了七个互补模块:跨流特征归一化用于纠正相机和视角偏差,多分辨率特征协调用于尺度稳定,身份感知记忆模块强化身份特征,时间动态建模进行运动感知编码,跨视角特征对齐实现视角不变表示,分层时间模式学习捕获多尺度时间规律,多视角身份一致性学习通过对比学习强制跨视角身份一致性。

Result: MTF-CVReID仅增加约200万参数和0.7 GFLOPs,保持实时效率(189 FPS),在AG-VPReID基准测试的所有高度级别上实现最先进性能,并在G2A-VReID和MARS数据集上展现出强大的跨数据集泛化能力。

Conclusion: 精心设计的基于适配器的模块可以在不牺牲计算效率的情况下显著增强跨视角鲁棒性和时间一致性,证明了参数高效方法在复杂视觉任务中的有效性,为实际部署提供了可行的解决方案。


📄 Abstract

Video-based person re-identification (ReID) in cross-view domains (for example, aerial-ground surveillance) remains an open problem because of extreme viewpoint shifts, scale disparities, and temporal inconsistencies. To address these challenges, we propose MTF-CVReID, a parameter-efficient framework that introduces seven complementary modules over a ViT-B/16 backbone. Specifically, we include: (1) Cross-Stream Feature Normalization (CSFN) to correct camera and view biases; (2) Multi-Resolution Feature Harmonization (MRFH) for scale stabilization across altitudes; (3) Identity-Aware Memory Module (IAMM) to reinforce persistent identity traits; (4) Temporal Dynamics Modeling (TDM) for motion-aware short-term temporal encoding; (5) Inter-View Feature Alignment (IVFA) for perspective-invariant representation alignment; (6) Hierarchical Temporal Pattern Learning (HTPL) to capture multi-scale temporal regularities; and (7) Multi-View Identity Consistency Learning (MVICL) that enforces cross-view identity coherence using a contrastive learning paradigm. Despite adding only about 2 million parameters and 0.7 GFLOPs over the baseline, MTF-CVReID maintains real-time efficiency (189 FPS) and achieves state-of-the-art performance on the AG-VPReID benchmark across all altitude levels, with strong cross-dataset generalization to G2A-VReID and MARS datasets. These results show that carefully designed adapter-based modules can substantially enhance cross-view robustness and temporal consistency without compromising computational efficiency. The source code is available at https://github.com/MdRashidunnabi/MTF-CVReID

[22] Zero-Shot Multi-Animal Tracking in the Wild

Jan Frederik Meier, Timo Lüddecke

🧩 TL;DR

本文提出了一种基于视觉基础模型的零样本多动物追踪框架,通过结合Grounding Dino目标检测器和SAM 2跟踪器,实现了无需重新训练或超参数调整即可应用于新数据集的动物追踪。


📘 Detailed Summary

Motivation: 多动物追踪在理解动物生态和行为中至关重要,但由于栖息地变化、运动模式和物种外观的差异,传统方法通常需要为每个应用场景进行大量模型微调和启发式设计,这限制了方法的通用性和效率。

Method: 该方法将Grounding Dino目标检测器与Segment Anything Model 2跟踪器相结合,并设计了精心优化的启发式策略,构建了一个无需重新训练或超参数适应的零样本追踪框架。

Result: 在ChimpAct、Bird Flock Tracking、AnimalTrack和GMOT-40子集上的评估表明,该方法在不同物种和环境条件下均表现出强大且一致的追踪性能。

Conclusion: 该研究证明了视觉基础模型在零样本多动物追踪任务中的巨大潜力,为生态学和动物行为研究提供了一种通用且高效的解决方案,减少了传统方法对场景特定调整的依赖。


📄 Abstract

Multi-animal tracking is crucial for understanding animal ecology and behavior. However, it remains a challenging task due to variations in habitat, motion patterns, and species appearance. Traditional approaches typically require extensive model fine-tuning and heuristic design for each application scenario. In this work, we explore the potential of recent vision foundation models for zero-shot multi-animal tracking. By combining a Grounding Dino object detector with the Segment Anything Model 2 (SAM 2) tracker and carefully designed heuristics, we develop a tracking framework that can be applied to new datasets without any retraining or hyperparameter adaptation. Evaluations on ChimpAct, Bird Flock Tracking, AnimalTrack, and a subset of GMOT-40 demonstrate strong and consistent performance across diverse species and environments. The code is available at https://github.com/ecker-lab/SAM2-Animal-Tracking.

[23] Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models

Tianfan Peng, Yuntao Du, Pengzhou Ji, Shijie Dong, Kailin Jiang, Mingchuan Ma, Yijun Tian, Jinhe Bi, Qian Li, Wei Du, Feng Xiao, Lizhen Cui

🧩 TL;DR

本文提出了UniPruneBench,一个用于多模态大模型中视觉令牌剪枝的统一可扩展基准,通过标准化评估协议揭示了剪枝方法的关键发现,包括随机剪枝作为强基线、方法性能不一致性以及剪枝比例的主导作用。


📘 Detailed Summary

Motivation: 多模态大模型由于图像编码器引入的大量视觉令牌导致严重的推理效率问题,而现有的令牌压缩方法评估存在碎片化和不一致性,缺乏统一的基准来系统比较不同剪枝算法的性能。

Method: UniPruneBench提供了跨六个能力维度和十个数据集的标准化评估协议,涵盖十种代表性压缩算法和三个LMM家族(LLaVA-v1.5、Intern-VL3和Qwen2.5-VL),除了任务精度外还整合了运行时和前填充延迟等系统级指标。

Result: 实验发现随机剪枝是一个令人惊讶的强基线,没有单一方法在所有场景中持续优于其他方法,剪枝敏感性在不同任务间差异显著(OCR最脆弱),且剪枝比例是性能下降的主导因素。

Conclusion: UniPruneBench为高效多模态建模的未来研究提供了可靠基础,揭示了剪枝方法选择需要根据具体任务场景进行优化,并强调了剪枝比例控制的重要性。


📄 Abstract

Large multimodal models (LMMs) often suffer from severe inference inefficiency due to the large number of visual tokens introduced by image encoders. While recent token compression methods, such as pruning and merging, have shown promise in reducing redundancy, their evaluation remains fragmented and inconsistent. In this work, we present UniPruneBench, a unified and extensible benchmark for visual token pruning in multimodal LLMs. UniPruneBench provides standardized protocols across six ability dimensions and ten datasets, covering ten representative compression algorithms and three families of LMMs (LLaVA-v1.5, Intern-VL3, and Qwen2.5-VL). Beyond task accuracy, it incorporates system-level metrics such as runtime and prefilling latency to provide a holistic view. Our experiments uncover several key findings: (1) random pruning is a surprisingly strong baseline, (2) no single method consistently outperforms others across scenarios, (3) pruning sensitivity varies significantly across tasks, with OCR being most vulnerable, and (4) pruning ratio is the dominant factor governing performance degradation. We believe UniPruneBench will serve as a reliable foundation for future research on efficient multimodal modeling.

[24] Modality-Transition Representation Learning for Visible-Infrared Person Re-Identification

Chao Yuan, Zanwu Liu, Guiwei Zhang, Haoxuan Xu, Yujian Zhao, Guanglin Niu, Bo Li

🧩 TL;DR

本文提出了一种基于模态转换表示学习(MTRL)的可见光-红外行人重识别框架,通过生成中间图像作为模态转换器,无需额外参数即可有效对齐跨模态特征,在三个典型数据集上显著超越现有最优方法。


📘 Detailed Summary

Motivation: 现有可见光-红外行人重识别方法主要依赖中间表示来对齐跨模态特征,但这些方法通常通过生成中间图像或融合中间特征实现,存在参数过多、可解释性差的问题,且未能充分利用中间特征的有效信息。

Method: 提出模态转换表示学习框架,生成中间图像作为可见光到红外模态的转换器,该图像与原始可见光图像完全对齐且与红外模态相似;采用模态转换对比损失和模态查询正则化损失进行训练,有效对齐跨模态特征且无需额外参数。

Result: 在三个典型可见光-红外行人重识别数据集上的广泛实验表明,所提模型显著且持续地优于现有最优方法,同时保持与骨干网络相同的推理速度。

Conclusion: 该研究证明了通过精心设计的模态转换表示学习可以有效解决可见光-红外模态间的特征对齐问题,无需增加模型复杂度即可提升性能,为跨模态行人重识别提供了新的有效解决方案。


📄 Abstract

Visible-infrared person re-identification (VI-ReID) technique could associate the pedestrian images across visible and infrared modalities in the practical scenarios of background illumination changes. However, a substantial gap inherently exists between these two modalities. Besides, existing methods primarily rely on intermediate representations to align cross-modal features of the same person. The intermediate feature representations are usually create by generating intermediate images (kind of data enhancement), or fusing intermediate features (more parameters, lack of interpretability), and they do not make good use of the intermediate features. Thus, we propose a novel VI-ReID framework via Modality-Transition Representation Learning (MTRL) with a middle generated image as a transmitter from visible to infrared modals, which are fully aligned with the original visible images and similar to the infrared modality. After that, using a modality-transition contrastive loss and a modality-query regularization loss for training, which could align the cross-modal features more effectively. Notably, our proposed framework does not need any additional parameters, which achieves the same inference speed to the backbone while improving its performance on VI-ReID task. Extensive experimental results illustrate that our model significantly and consistently outperforms existing SOTAs on three typical VI-ReID datasets.

[25] Dynamic Reflections: Probing Video Representations with Text Alignment

Tyler Zhu, Tengda Han, Leonidas Guibas, Viorica Pătrăucean, Maks Ovsjanikov

🧩 TL;DR

本文首次对视频-文本表示对齐进行了全面研究,提出了参数化测试时缩放定律来捕捉跨模态对齐行为,并揭示了语义对齐与下游任务性能之间的相关性,为时空数据的表示能力提供了零样本探测方法。


📘 Detailed Summary

Motivation: 尽管图像与文本的对齐研究已取得显著进展,但视频数据的时序特性在此背景下仍未被充分探索,本研究旨在填补视频-文本表示对齐这一研究空白,系统评估现代视频和语言编码器的能力。

Method: 研究提出了参数化测试时缩放定律来捕捉跨模态对齐行为,系统分析了视觉数据(静态图像vs多帧视频)和文本数据(单字幕vs集合)丰富度对对齐效果的影响,并探讨了语义对齐与下游任务性能的关联性。

Result: 实验发现跨模态对齐高度依赖于测试时提供的视觉和文本数据丰富度,提出的缩放定律展现出与经验观察的显著预测能力,同时揭示了强文本编码器对齐可能与通用视频表示和理解能力相关。

Conclusion: 本研究确立了视频-文本对齐作为探测时空数据表示能力的有效零样本方法,为理解视频编码器的表示能力和跨模态对齐机制提供了重要见解,并为视觉语言模型的时序推理能力评估建立了具有挑战性的测试基准。


📄 Abstract

The alignment of representations from different modalities has recently been shown to provide insights on the structural similarities and downstream capabilities of different encoders across diverse data types. While significant progress has been made in aligning images with text, the temporal nature of video data remains largely unexplored in this context. In this work, we conduct the first comprehensive study of video-text representation alignment, probing the capabilities of modern video and language encoders. Our findings reveal several key insights. First, we demonstrate that cross-modal alignment highly depends on the richness of both visual (static images vs. multi-frame videos) and text (single caption vs. a collection) data provided at test time, especially when using state-of-the-art video encoders. We propose parametric test-time scaling laws that capture this behavior and show remarkable predictive power against empirical observations. Secondly, we investigate the correlation between semantic alignment and performance on both semantic and non-semantic downstream tasks, providing initial evidence that strong alignment against text encoders may be linked to general-purpose video representation and understanding. Finally, we correlate temporal reasoning with cross-modal alignment providing a challenging test-bed for vision and language models. Overall, our work introduces video-text alignment as an informative zero-shot way to probe the representation power of different encoders for spatio-temporal data. Project page can be found at https://video-prh.github.io/

[26] When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Yejin Choi, James Zou, Chaorui Deng, Shen Yan, Haoqi Fan, Cihang Xie, Huaxiu Yao, Qinghao Ye

🧩 TL;DR

MIRA是一个新的基准测试,旨在评估模型在需要生成中间视觉图像以进行成功推理的场景中的能力,强调视觉思维在复杂推理中的关键作用。


📘 Detailed Summary

Motivation: 该研究旨在解决传统思维链方法仅依赖文本的局限性,探索在需要生成和利用中间视觉图像(如草图、结构图或路径图)来指导推理过程的复杂任务中模型的性能,这种设置更贴近人类通过"绘图思考"解决复杂问题的方式。

Method: MIRA基准包含546个多模态问题,并标注了中间视觉图像和最终答案;提出了统一的评估协议,涵盖三个评估输入级别:仅图像和问题的直接输入、带图像和思维提示的纯文本CoT输入,以及带注释图像线索和文本思维提示的Visual-CoT输入;还报告了不同k设置下的pass@k和多数投票准确率以探测模型能力上限。

Result: 实验结果显示,现有多模态大语言模型在仅依赖文本提示时表现较差,但当提供中间视觉线索时,模型性能一致提升,在所有模型和任务中平均相对增益达到33.7%;通过扩大搜索空间和设计与Visual-CoT对齐的文本提示来探测上限,但相比Visual-CoT设置仅获得有限改进。

Conclusion: 这些结果强调了想象视觉信息在MIRA上实现成功推理的关键作用,表明视觉思维对于处理涉及复杂结构、空间关系或难以仅通过语言表达的推理步骤的任务至关重要,为多模态推理研究提供了新的方向和基准。


📄 Abstract

We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through "drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.

cs.CL [Back]

[27] Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation

Wongyu Kim, Hochang Lee, Sanghak Lee, Yoonsung Kim, Jaehyun Park

🧩 TL;DR

本文提出了M-Solomon,一种通用多模态嵌入器,能够自适应地决定何时进行查询增强,解决了现有LLM嵌入器中盲目增强所有查询导致的延迟问题和性能下降问题,并在多模态环境中实现了显著性能提升。


📘 Detailed Summary

Motivation: 当前基于LLM的嵌入器对所有查询进行增强会导致显著的嵌入延迟,且某些查询的增强反而会损害性能,同时先前方法未在多模态环境中进行探索,这些局限性促使本研究开发能够自适应决定增强时机的多模态嵌入方法。

Method: 该方法首先在数据集层面将训练查询分为需要增强和不需要增强两组,利用强大的多模态大语言模型为需要增强的查询生成适当的增强内容,然后通过自适应查询增强机制,仅对需要增强的查询生成带有/augment前缀的合成增强,对其他查询生成简单的/embed字符串。

Result: 实验结果表明,M-Solomon不仅大幅超越了无增强的基线方法,也优于始终使用增强的基线方法,同时提供了更快的嵌入延迟,在多模态检索任务中实现了显著的性能提升。

Conclusion: 该研究表明自适应查询增强策略能够有效平衡检索性能与计算效率,为多模态检索系统提供了更智能的查询处理方案,证明了选择性增强比盲目增强所有查询更具优势,为未来多模态嵌入方法的设计提供了重要启示。


📄 Abstract

Query augmentation makes queries more meaningful by appending further information to the queries to find relevant documents. Current studies have proposed Large Language Model (LLM)-based embedders, which learn representation for embedding and generation for query augmentation in a multi-task manner by leveraging the generative capabilities of LLM. During inference, these jointly trained embedders have conducted query augmentation followed by embedding, showing effective results. However, augmenting every query leads to substantial embedding latency and query augmentation can be detrimental to performance for some queries. Also, previous methods have not been explored in multimodal environments. To tackle these problems, we propose M-Solomon, a universal multimodal embedder that can adaptively determine when to augment queries. Our approach first divides the queries of the training datasets into two groups at the dataset level. One includes queries that require augmentation and the other includes queries that do not. Then, we introduces a synthesis process that generates appropriate augmentations for queries that require them by leveraging a powerful Multimodal LLM (MLLM). Next, we present adaptive query augmentation. Through this step, M-Solomon can conduct query augmentation only when necessary by learning to generate synthetic augmentations with the prefix /augment for queries that demand them and to generate the simple string /embed for others. Experimental results showed that M-Solomon not only surpassed the baseline without augmentation by a large margin but also outperformed the baseline that always used augmentation, providing much faster embedding latency.

[28] LiveSecBench: A Dynamic and Culturally-Relevant AI Safety Benchmark for LLMs in Chinese Context

Yudong Li, Zhongliang Yang, Kejiang Chen, Wenxuan Wang, Tianxin Zhang, Sifang Wan, Kecheng Wang, Haitian Li, Xu Wang, Lefan Cheng, Youdan Yang, Baocheng Chen, Ziyu Liu, Yufei Sun, Liyan Wu, Wenya Wen, Xingchi Gu, Peiru Yang

🧩 TL;DR

本研究提出了LiveSecBench,一个针对中文LLM应用场景的动态持续更新安全基准,评估模型在六个关键安全维度上的表现,并已对18个模型进行了评估。


📘 Detailed Summary

Motivation: 当前缺乏专门针对中文语言环境和大语言模型应用场景的动态安全评估基准,需要建立一个能够持续更新、反映中文法律和社会框架的安全评估体系。

Method: 提出了LiveSecBench基准,包含合法性、伦理、事实性、隐私、对抗鲁棒性和推理安全六个关键维度,采用动态更新机制纳入新的威胁向量,如文本到图像生成安全和智能体安全。

Result: LiveSecBench(v251030)已评估了18个大语言模型,提供了中文语境下AI安全的全景图,基准排行榜已公开可访问。

Conclusion: 该基准为中文LLM安全评估提供了标准化框架,通过动态更新机制确保评估的时效性,为AI安全研究和模型开发提供了重要参考依据。


📄 Abstract

In this work, we propose LiveSecBench, a dynamic and continuously updated safety benchmark specifically for Chinese-language LLM application scenarios. LiveSecBench evaluates models across six critical dimensions (Legality, Ethics, Factuality, Privacy, Adversarial Robustness, and Reasoning Safety) rooted in the Chinese legal and social frameworks. This benchmark maintains relevance through a dynamic update schedule that incorporates new threat vectors, such as the planned inclusion of Text-to-Image Generation Safety and Agentic Safety in the next update. For now, LiveSecBench (v251030) has evaluated 18 LLMs, providing a landscape of AI safety in the context of Chinese language. The leaderboard is publicly accessible at https://livesecbench.intokentech.cn/.

[29] Beyond Single Embeddings: Capturing Diverse Targets with Multi-Query Retrieval

Hung-Ting Chen, Xiang Liu, Shauli Ravfogel, Eunsol Choi

🧩 TL;DR

本文提出了一种自回归多嵌入检索器(AMER),通过生成多个查询向量来捕获相关文档的多模态分布,相比传统单向量检索器在多个数据集上实现了显著的性能提升。


📘 Detailed Summary

Motivation: 现有文本检索器通常只生成单个查询向量来检索相关文档,但相关文档的条件分布可能是多模态的,代表查询的不同解释。研究发现现有检索器在目标文档嵌入距离较大时表现较差,这限制了其处理复杂查询的能力。

Method: 提出了一种新的检索器架构——自回归多嵌入检索器(AMER),该模型自回归地生成多个查询向量,所有预测的查询向量都用于从语料库中检索文档。这种方法能够更好地捕获多模态的目标分布。

Result: 在合成向量化数据上,所提方法能够完美捕获多个目标分布,性能比单嵌入模型提升4倍。在真实世界多答案检索数据集上的评估显示,AMER在两个数据集上分别比单嵌入基线相对提升4%和21%。在目标文档嵌入相似度较低的数据子集上,性能提升更为显著。

Conclusion: 研究证明了使用多查询向量检索器的潜力,为解决复杂查询的多模态分布问题开辟了新的研究方向。该方法在处理目标文档嵌入差异较大的场景中表现尤为突出,为未来检索系统的发展提供了重要启示。


📄 Abstract

Most text retrievers generate \emph{one} query vector to retrieve relevant documents. Yet, the conditional distribution of relevant documents for the query may be multimodal, e.g., representing different interpretations of the query. We first quantify the limitations of existing retrievers. All retrievers we evaluate struggle more as the distance between target document embeddings grows. To address this limitation, we develop a new retriever architecture, \emph{A}utoregressive \emph{M}ulti-\emph{E}mbedding \emph{R}etriever (AMER). Our model autoregressively generates multiple query vectors, and all the predicted query vectors are used to retrieve documents from the corpus. We show that on the synthetic vectorized data, the proposed method could capture multiple target distributions perfectly, showing 4x better performance than single embedding model. We also fine-tune our model on real-world multi-answer retrieval datasets and evaluate in-domain. AMER presents 4 and 21\% relative gains over single-embedding baselines on two datasets we evaluate on. Furthermore, we consistently observe larger gains on the subset of dataset where the embeddings of the target documents are less similar to each other. We demonstrate the potential of using a multi-query vector retriever and open up a new direction for future work.

cs.AI [Back]

[30] Human-AI Co-Embodied Intelligence for Scientific Experimentation and Manufacturing

Xinyi Lin, Yuyang Zhang, Yuanhang Gan, Juntao Chen, Hao Shen, Yichun He, Lijun Li, Ze Yuan, Shuang Wang, Chaohao Wang, Rui Zhang, Na Li, Jia Liu

🧩 TL;DR

本文提出了一种人机协同具身智能新范式,将人类用户、智能体AI和可穿戴硬件集成到统一系统中,用于现实世界的实验和智能制造。通过APEX系统展示了智能体推理与物理执行的结合,实现了超越通用多模态大语言模型的上下文感知推理能力。


📘 Detailed Summary

Motivation: 科学实验和制造依赖复杂多步骤程序,需要持续的人类专业知识进行精确执行和决策。尽管机器学习和自动化技术取得进展,传统模型仍局限于虚拟领域,而现实世界的实验和制造仍依赖人类监督和专业知识。机器智能与物理执行之间的差距限制了科学和制造工作流程的可重复性、可扩展性和可访问性。

Method: 提出人机协同具身智能范式,将人类用户、智能体AI和可穿戴硬件集成到统一系统中。开发了APEX系统,通过混合现实技术将智能体推理与物理执行相结合,系统能够观察和解释人类动作,与标准操作程序对齐,提供3D视觉指导,并分析每个步骤。

Result: 在洁净室柔性电子制造环境中实现的APEX系统,其上下文感知推理准确率超过通用多模态大语言模型,能够实时纠正错误,并将专业知识传递给初学者。系统实现了自主、可追溯、可解释和可扩展的实验和制造流程。

Conclusion: 这项研究建立了一类新的智能体-物理-人类智能系统,将智能体推理从计算领域扩展到物理领域。该范式将科学研究和制造转变为自主、可追溯、可解释和可扩展的过程,为人机协同智能在物理世界的应用开辟了新途径。


📄 Abstract

Scientific experiment and manufacture rely on complex, multi-step procedures that demand continuous human expertise for precise execution and decision-making. Despite advances in machine learning and automation, conventional models remain confined to virtual domains, while real-world experiment and manufacture still rely on human supervision and expertise. This gap between machine intelligence and physical execution limits reproducibility, scalability, and accessibility across scientific and manufacture workflows. Here, we introduce human-AI co-embodied intelligence, a new form of physical AI that unites human users, agentic AI, and wearable hardware into an integrated system for real-world experiment and intelligent manufacture. In this paradigm, humans provide precise execution and control, while agentic AI contributes memory, contextual reasoning, adaptive planning, and real-time feedback. The wearable interface continuously captures the experimental and manufacture processes, facilitates seamless communication between humans and AI for corrective guidance and interpretable collaboration. As a demonstration, we present Agentic-Physical Experimentation (APEX) system, coupling agentic reasoning with physical execution through mixed-reality. APEX observes and interprets human actions, aligns them with standard operating procedures, provides 3D visual guidance, and analyzes every step. Implemented in a cleanroom for flexible electronics fabrication, APEX system achieves context-aware reasoning with accuracy exceeding general multimodal large language models, corrects errors in real time, and transfers expertise to beginners. These results establish a new class of agentic-physical-human intelligence that extends agentic reasoning beyond computation into the physical domain, transforming scientific research and manufacturing into autonomous, traceable, interpretable, and scalable processes.

[31] When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs

Zhuoran Zhang, Tengyue Wang, Xilin Gong, Yang Shi, Haotian Wang, Di Wang, Lijie Hu

🧩 TL;DR

本研究提出了一个分解多模态大语言模型中模态跟随行为的新框架,揭示了相对推理不确定性和内在模态偏好这两个基本因素如何共同决定模型在冲突信息下的决策过程。通过构建可控数据集和引入熵作为细粒度不确定性度量,发现了模态跟随概率随相对不确定性单调下降的普遍规律。


📘 Detailed Summary

Motivation: 现有研究仅使用粗粒度的数据集级统计来测量多模态大语言模型中的模态跟随行为,忽略了模型在单模态推理中的置信度影响。这种简化方法无法揭示模态冲突解决的内在机制,特别是当不同模态提供矛盾信息时模型决策的深层原理。

Method: 本研究构建了一个可控数据集,系统性地改变视觉和文本输入的推理难度,并使用熵作为细粒度的不确定性度量指标。通过分析层间预测,揭示了模型在平衡点附近区域内的内部振荡机制。

Result: 实验发现了一个普遍规律:模态跟随概率随其相对不确定性的增加而单调下降。在平衡点处,模型倾向于以可比较的概率跟随两种模态,这为内在模态偏好提供了实用指标。层间分析显示,在平衡点附近的模糊区域,模型会在不同层间在模态之间振荡。

Conclusion: 相对不确定性和内在偏好是模态跟随的两个主导原则,为理解多模态大语言模型如何解决冲突信息提供了定量框架和机制性洞见。这一发现提供了比传统宏观比率更原则化、更少混淆的方式来表征模态偏差,将其与单模态能力和数据集伪影分离开来。


📄 Abstract

Multimodal large language models (MLLMs) must resolve conflicts when different modalities provide contradictory information, a process we term modality following. Prior work measured this behavior only with coarse dataset-level statistics, overlooking the influence of model's confidence in unimodal reasoning. In this paper, we introduce a new framework that decomposes modality following into two fundamental factors: relative reasoning uncertainty (the case-specific confidence gap between unimodal predictions) and inherent modality preference( a model's stable bias when uncertainties are balanced). To validate this framework, we construct a controllable dataset that systematically varies the reasoning difficulty of visual and textual inputs. Using entropy as a fine-grained uncertainty metric, we uncover a universal law: the probability of following a modality decreases monotonically as its relative uncertainty increases. At the relative difficulty level where the model tends to follow both modalities with comparable probability what we call the balance point, a practical indicator of the model's inherent preference. Unlike traditional macro-level ratios, this measure offers a more principled and less confounded way to characterize modality bias, disentangling it from unimodal capabilities and dataset artifacts. Further, by probing layer-wise predictions, we reveal the internal mechanism of oscillation: in ambiguous regions near the balance point, models vacillate between modalities across layers, explaining externally observed indecision. Together, these findings establish relative uncertainty and inherent preference as the two governing principles of modality following, offering both a quantitative framework and mechanistic insight into how MLLMs resolve conflicting information.

[32] Chronic Kidney Disease Prognosis Prediction Using Transformer

Yohan Lee, DongGyun Kang, SeHoon Park, Sa-Yoon Park, Kwangsoo Kim

🧩 TL;DR

本研究提出了一种基于Transformer的框架ProQ-BERT,用于预测慢性肾脏病的进展,通过整合多模态电子健康记录数据,在短期预测中实现了高达0.995的ROC-AUC性能。


📘 Detailed Summary

Motivation: 慢性肾脏病影响全球近10%的人口,常进展至终末期肾衰竭,准确的预后预测对于及时干预和资源优化至关重要。现有方法在利用多模态电子健康记录数据进行精确预测方面存在局限。

Method: 提出了ProQ-BERT框架,基于Transformer架构整合人口统计学、临床和实验室数据,采用基于量化的标记化方法处理连续实验室数值,并利用注意力机制增强可解释性。模型通过掩码语言建模进行预训练,并针对从3a期到5期的二元分类任务进行微调。

Result: 在91,816名患者队列上的评估显示,该模型持续优于CEHR-BERT,在短期预测中ROC-AUC高达0.995,PR-AUC高达0.989,展现了卓越的预测性能。

Conclusion: 研究结果表明Transformer架构和时间设计选择在临床预后建模中的有效性,为个性化慢性肾脏病护理提供了有前景的方向,强调了多模态数据整合和注意力机制在医疗预测任务中的重要性。


📄 Abstract

Chronic Kidney Disease (CKD) affects nearly 10\% of the global population and often progresses to end-stage renal failure. Accurate prognosis prediction is vital for timely interventions and resource optimization. We present a transformer-based framework for predicting CKD progression using multi-modal electronic health records (EHR) from the Seoul National University Hospital OMOP Common Data Model. Our approach (\textbf{ProQ-BERT}) integrates demographic, clinical, and laboratory data, employing quantization-based tokenization for continuous lab values and attention mechanisms for interpretability. The model was pretrained with masked language modeling and fine-tuned for binary classification tasks predicting progression from stage 3a to stage 5 across varying follow-up and assessment periods. Evaluated on a cohort of 91,816 patients, our model consistently outperformed CEHR-BERT, achieving ROC-AUC up to 0.995 and PR-AUC up to 0.989 for short-term prediction. These results highlight the effectiveness of transformer architectures and temporal design choices in clinical prognosis modeling, offering a promising direction for personalized CKD care.

[33] Agentic AI for Mobile Network RAN Management and Optimization

Jorge Pellejero, Luis A. Hernández Gómez, Luis Mendo Tomás, Zoraida Frias Barroso

🧩 TL;DR

本文提出了一个面向5G和6G网络的Agentic AI框架,通过将大型AI模型与核心设计模式相结合,实现无线接入网络的自主优化决策,填补了该领域缺乏系统性框架的空白。


📘 Detailed Summary

Motivation: 5G及未来6G网络的复杂性使得人工优化变得低效,而Agentic AI虽然快速发展,但缺乏统一的定义和系统性框架来指导其在动态无线接入网络环境中的决策自动化应用。

Method: 提出了基于大型AI模型的Agentic AI系统,采用反思、规划、工具使用和多智能体协作等核心设计模式,结合时间序列分析和KPI驱动的自主决策机制来协调智能行为。

Result: 通过一个实际的5G RAN案例研究,展示了时间序列分析与LAM驱动智能体如何协作实现基于KPI的自主决策,验证了Agentic AI在网络优化中的可行性。

Conclusion: Agentic AI为5G/6G网络自动化提供了新范式,通过整合多模态感知、规划、记忆和推理能力,能够实现网络目标的自主分解、上下文保持和动态适应,为未来网络智能化发展指明了方向。


📄 Abstract

Agentic AI represents a new paradigm for automating complex systems by using Large AI Models (LAMs) to provide human-level cognitive abilities with multimodal perception, planning, memory, and reasoning capabilities. This will lead to a new generation of AI systems that autonomously decompose goals, retain context over time, learn continuously, operate across tools and environments, and adapt dynamically. The complexity of 5G and upcoming 6G networks renders manual optimization ineffective, pointing to Agentic AI as a method for automating decisions in dynamic RAN environments. However, despite its rapid advances, there is no established framework outlining the foundational components and operational principles of Agentic AI systems nor a universally accepted definition. This paper contributes to ongoing research on Agentic AI in 5G and 6G networks by outlining its core concepts and then proposing a practical use case that applies Agentic principles to RAN optimization. We first introduce Agentic AI, tracing its evolution from classical agents and discussing the progress from workflows and simple AI agents to Agentic AI. Core design patterns-reflection, planning, tool use, and multi-agent collaboration-are then described to illustrate how intelligent behaviors are orchestrated. These theorical concepts are grounded in the context of mobile networks, with a focus on RAN management and optimization. A practical 5G RAN case study shows how time-series analytics and LAM-driven agents collaborate for KPI-based autonomous decision-making.

[34] When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning

Chenyu Zhang, Minsol Kim, Shohreh Ghorbani, Jingyao Wu, Rosalind Picard, Patricia Maes, Paul Pu Liang

🧩 TL;DR

本文提出了模态破坏这一诊断性失效模式,并开发了一个轻量级、模型无关的评估层来揭示多模态大语言模型中各模态的推理动态,通过将每个模态视为智能体来识别贡献者和破坏者。


📘 Detailed Summary

Motivation: 尽管多模态大语言模型发展迅速,但其推理过程仍然不透明:难以确定哪个模态驱动预测、如何解决模态间冲突、以及何时某个模态主导决策。本文旨在解决多模态推理中缺乏透明度和诊断能力的问题。

Method: 提出了一种轻量级、模型无关的评估层,将每个模态视为智能体,生成候选标签和简要的自我评估用于审计。采用简单的融合机制聚合这些输出,从而暴露贡献者(支持正确结果的模态)和破坏者(误导决策的模态)。

Result: 在多模态情感识别基准测试的案例研究中,应用该诊断层揭示了系统性的可靠性特征,提供了关于失败是源于数据集伪影还是模型局限性的见解。

Conclusion: 该框架为多模态推理提供了诊断性支架,支持对融合动态的原则性审计,并为可能的干预措施提供了信息,有助于理解多模态模型中的决策过程和失效机制。


📄 Abstract

Despite rapid growth in multimodal large language models (MLLMs), their reasoning traces remain opaque: it is often unclear which modality drives a prediction, how conflicts are resolved, or when one stream dominates. In this paper, we introduce modality sabotage, a diagnostic failure mode in which a high-confidence unimodal error overrides other evidence and misleads the fused result. To analyze such dynamics, we propose a lightweight, model-agnostic evaluation layer that treats each modality as an agent, producing candidate labels and a brief self-assessment used for auditing. A simple fusion mechanism aggregates these outputs, exposing contributors (modalities supporting correct outcomes) and saboteurs (modalities that mislead). Applying our diagnostic layer in a case study on multimodal emotion recognition benchmarks with foundation models revealed systematic reliability profiles, providing insight into whether failures may arise from dataset artifacts or model limitations. More broadly, our framework offers a diagnostic scaffold for multimodal reasoning, supporting principled auditing of fusion dynamics and informing possible interventions.

[35] Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything

Huawei Lin, Yunzhi Shi, Tong Geng, Weijie Zhao, Wei Wang, Ravender Pal Singh

🧩 TL;DR

本文提出了Agent-Omni框架,通过主代理系统协调现有基础模型,实现无需重新训练的多模态推理,在多种基准测试中达到最先进性能。


📘 Detailed Summary

Motivation: 当前多模态大语言模型局限于固定模态组合,需要大量对齐数据进行微调,构建能够集成文本、图像、音频和视频的全能模型仍不实用且缺乏强大的推理支持。

Method: 采用主代理系统框架,主代理解释用户意图,将子任务委托给特定模态代理,并整合它们的输出形成连贯响应,实现灵活的多模态推理而无需重新训练。

Result: 在文本、图像、音频、视频和全能基准测试上的广泛实验表明,Agent-Omni始终达到最先进性能,特别是在需要复杂跨模态推理的任务上表现突出。

Conclusion: 基于代理的设计能够无缝集成专门的基础模型,确保对多样化输入的适应性,同时保持透明性和可解释性;框架具有模块化和易扩展性,能够随着更强模型的可用性进行未来改进。


📄 Abstract

Multimodal large language models (MLLMs) have shown strong capabilities but remain limited to fixed modality pairs and require costly fine-tuning with large aligned datasets. Building fully omni-capable models that can integrate text, images, audio, and video remains impractical and lacks robust reasoning support. In this paper, we propose an Agent-Omni framework that coordinates existing foundation models through a master-agent system, enabling flexible multimodal reasoning without retraining. The master agent interprets user intent, delegates subtasks to modality-specific agents, and integrates their outputs into coherent responses. Extensive experiments across text, image, audio, video, and omni benchmarks show that Agent-Omni consistently achieves state-of-the-art performance, particularly on tasks requiring complex cross-modal reasoning. Its agent-based design enables seamless integration of specialized foundation models, ensuring adaptability to diverse inputs while maintaining transparency and interpretability. In addition, the framework is modular and easily extensible, allowing future improvements as stronger models become available. %We release an open-source implementation to support continued research on scalable and reliable omni-modal reasoning.