Table of Contents
cs.CV [Back]
[1] Unbiased Visual Reasoning with Controlled Visual Inputs
Zhaonan Li, Shijie Lu, Fei Wang, Jacob Dineen, Xiao Ye, Zhikun Xu, Siyi Liu, Young Min Cho, Bangzheng Li, Daniel Chang, Kenny Nguyen, Qizheng Yang, Muhao Chen, Ben Zhou
🧩 TL;DR
本文提出VISTA框架,通过模块化设计将视觉感知与推理解耦,利用信息瓶颈限制视觉语言模型仅提供客观感知查询,再由纯文本大语言模型进行分解、规划和聚合,从而减少对虚假相关性的依赖并提升视觉推理的鲁棒性。
📘 Detailed Summary
Motivation: 端到端视觉语言模型在回答视觉问题时常常利用虚假相关性而非因果视觉证据,并且在微调后更容易产生捷径学习问题,导致视觉推理存在偏差和不可靠性,需要一种能够分离感知与推理的框架来提升鲁棒性。
Method: VISTA框架采用模块化设计,通过信息瓶颈将感知与推理解耦:冻结的VLM传感器仅提供简短客观的感知查询,而纯文本LLM推理器负责问题分解、查询规划和视觉事实的自然语言聚合,并使用强化学习在奖励对齐的环境中训练无偏的视觉推理能力。
Result: 在SpuriVerse基准测试中,VISTA显著提升了鲁棒性(Qwen-2.5-VL-7B提升16.29%,Llama-3.2-Vision-11B提升6.77%),在MMVP和平衡的SeedBench子集上保持竞争力,能够跨未见VLM传感器迁移并识别和恢复感知失败,人类分析显示其推理轨迹更中立且更明确基于视觉证据。
Conclusion: VISTA通过感知与推理的显式分离有效减少了视觉语言模型对虚假相关性的依赖,提供了一种可解释且鲁棒的视觉推理框架,能够跨不同VLM传感器迁移并处理感知失败,为构建更可靠的多模态系统提供了新方向。
📄 Abstract
End-to-end Vision-language Models (VLMs) often answer visual questions by exploiting spurious correlations instead of causal visual evidence, and can become more shortcut-prone when fine-tuned. We introduce VISTA (Visual-Information Separation for Text-based Analysis), a modular framework that decouples perception from reasoning via an explicit information bottleneck. A frozen VLM sensor is restricted to short, objective perception queries, while a text-only LLM reasoner decomposes each question, plans queries, and aggregates visual facts in natural language. This controlled interface defines a reward-aligned environment for training unbiased visual reasoning with reinforcement learning. Instantiated with Qwen2.5-VL and Llama3.2-Vision sensors, and trained with GRPO from only 641 curated multi-step questions, VISTA significantly improves robustness to real-world spurious correlations on SpuriVerse (+16.29% with Qwen-2.5-VL-7B and +6.77% with Llama-3.2-Vision-11B), while remaining competitive on MMVP and a balanced SeedBench subset. VISTA transfers robustly across unseen VLM sensors and is able to recognize and recover from VLM perception failures. Human analysis further shows that VISTA's reasoning traces are more neutral, less reliant on spurious attributes, and more explicitly grounded in visual evidence than end-to-end VLM baselines.
[2] SAMM2D: Scale-Aware Multi-Modal 2D Dual-Encoder for High-Sensitivity Intracrania Aneurysm Screening
Antara Titikhsha, Divyanshu Tak
🧩 TL;DR
本文提出了SAMM2D双编码器框架用于颅内动脉瘤检测,在RSNA数据集上实现了0.686的AUC,比临床基线提升32%,并发现强预训练骨干网络下数据增强反而会降低性能,挑战了医学影像分析中"增强越多越好"的传统假设。
📘 Detailed Summary
Motivation: 动脉瘤检测对于预防致命性出血至关重要,但面临动脉瘤形态细微、类别极度不平衡以及标注数据稀缺等挑战,现有方法在低数据医学影像场景下的数据增强策略有效性存在疑问。
Method: 提出了SAMM2D双编码器框架,采用ImageNet预训练骨干网络,在六个不同增强策略下进行系统消融实验,通过Grad-CAM可视化验证模型关注区域,并采用决策阈值校准来优化敏感度。
Result: SAMM2D在RSNA颅内动脉瘤数据集上达到0.686 AUC,比临床基线提升32%;未增强的基线模型在所有增强变体上表现更优,高出1.75-2.23个百分点(p<0.01);校准阈值后达到95%敏感度,超过放射科医生平均水平;Grad-CAM显示85%真阳性关注相关血管区域(与专家标注IoU为62%)。
Conclusion: 研究挑战了医学影像分析中数据增强必然有益的假设,发现强预训练特征已具备足够鲁棒性,额外增强反而破坏特征流形;未来医学影像工作流应更注重强预训练而非复杂增强管道;模型在筛查应用中每1000名患者可节省约1390万美元。
📄 Abstract
Effective aneurysm detection is essential to avert life-threatening hemorrhages, but it remains challenging due to the subtle morphology of the aneurysm, pronounced class imbalance, and the scarcity of annotated data. We introduce SAMM2D, a dual-encoder framework that achieves an AUC of 0.686 on the RSNA intracranial aneurysm dataset; an improvement of 32% over the clinical baseline. In a comprehensive ablation across six augmentation regimes, we made a striking discovery: any form of data augmentation degraded performance when coupled with a strong pretrained backbone. Our unaugmented baseline model outperformed all augmented variants by 1.75--2.23 percentage points (p < 0.01), overturning the assumption that "more augmentation is always better" in low-data medical settings. We hypothesize that ImageNet-pretrained features already capture robust invariances, rendering additional augmentations both redundant and disruptive to the learned feature manifold. By calibrating the decision threshold, SAMM2D reaches 95% sensitivity, surpassing average radiologist performance, and translates to a projected \$13.9M in savings per 1,000 patients in screening applications. Grad-CAM visualizations confirm that 85% of true positives attend to relevant vascular regions (62% IoU with expert annotations), demonstrating the model's clinically meaningful focus. Our results suggest that future medical imaging workflows could benefit more from strong pretraining than from increasingly complex augmentation pipelines.
[3] HookMIL: Revisiting Context Modeling in Multiple Instance Learning for Computational Pathology
Xitong Ling, Minxi Ouyang, Xiaoxiao Li, Jiawen Li, Ying Chen, Yuxuan Sun, Xinrui Chen, Tian Guan, Xiaoping Liu, Yonghong He
🧩 TL;DR
本文提出HookMIL,一种用于计算病理学中全切片图像弱监督分析的高效上下文感知多示例学习框架,通过可学习的钩子令牌实现结构化上下文聚合,在保持线性复杂度的同时提升性能。
📘 Detailed Summary
Motivation: 传统多示例学习方法在处理全切片图像时往往丢失关键上下文信息,而基于Transformer的变体虽然表达能力更强,但存在二次复杂度和冗余计算的问题,需要一种既能保留上下文又计算高效的解决方案。
Method: HookMIL框架采用可学习的钩子令牌进行结构化上下文聚合,支持三种初始化方式:关键补丁视觉特征、视觉语言病理模型的文本嵌入、以及空间转录组-视觉模型的空间基础特征。通过双向注意力机制实现线性复杂度交互,并引入钩子多样性损失促进专业化,同时采用钩子间通信机制优化上下文交互。
Result: 在四个公开病理数据集上的广泛实验表明,HookMIL实现了最先进的性能,同时显著提升了计算效率和可解释性。多模态初始化策略加速了收敛过程并提高了表示质量。
Conclusion: HookMIL通过创新的钩子令牌机制成功解决了传统MIL方法上下文信息丢失和Transformer变体计算效率低下的问题,为计算病理学中的弱监督学习提供了高效且可解释的解决方案,其多模态初始化方法展示了结合文本和空间先验知识的优势。
📄 Abstract
Multiple Instance Learning (MIL) has enabled weakly supervised analysis of whole-slide images (WSIs) in computational pathology. However, traditional MIL approaches often lose crucial contextual information, while transformer-based variants, though more expressive, suffer from quadratic complexity and redundant computations. To address these limitations, we propose HookMIL, a context-aware and computationally efficient MIL framework that leverages compact, learnable hook tokens for structured contextual aggregation. These tokens can be initialized from (i) key-patch visual features, (ii) text embeddings from vision-language pathology models, and (iii) spatially grounded features from spatial transcriptomics-vision models. This multimodal initialization enables Hook Tokens to incorporate rich textual and spatial priors, accelerating convergence and enhancing representation quality. During training, Hook tokens interact with instances through bidirectional attention with linear complexity. To further promote specialization, we introduce a Hook Diversity Loss that encourages each token to focus on distinct histopathological patterns. Additionally, a hook-to-hook communication mechanism refines contextual interactions while minimizing redundancy. Extensive experiments on four public pathology datasets demonstrate that HookMIL achieves state-of-the-art performance, with improved computational efficiency and interpretability. Codes are available at https://github.com/lingxitong/HookMIL.
[4] Tiny-YOLOSAM: Fast Hybrid Image Segmentation
Kenneth Xu, Songhan Wu
🧩 TL;DR
本文提出Tiny-YOLOSAM,一种用于全场景分割的快速混合流水线,通过结合YOLO检测器生成边界框提示和稀疏点采样,显著提升了TinySAM的效率,同时保持了高质量的分割性能。
📘 Detailed Summary
Motivation: Segment Anything Model (SAM)及其轻量级变体TinySAM在延迟关键场景中计算成本过高,尽管TinySAM通过蒸馏保持了强大的零样本掩码质量,但其"分割一切"模式仍需要数百个提示且实际运行速度缓慢,这限制了其在实时应用中的实用性。
Method: 首先在COCO val2017上复现TinySAM并建立可靠的实验基线,然后提出Tiny-YOLOSAM混合流水线,使用最新的YOLO检测器(YOLOv12)为显著前景对象生成边界框提示,并在YOLO引导的掩码未覆盖区域补充稀疏点提示采样,从而避免密集的"分割一切"提示策略。
Result: 在COCO val2017上,混合系统显著提升了类别无关的覆盖率(AR从16.4%提高到77.1%,mIoU从19.2%提高到67.8%),同时将端到端运行时间从每图像49.20秒减少到10.39秒(加速4.7倍),在Apple M1 Pro CPU上实现了效率与性能的平衡。
Conclusion: 研究表明检测器引导的提示结合目标稀疏采样是实际全场景分割中密集"分割一切"提示的有效替代方案,为实时分割系统提供了实用的设计范式,平衡了计算效率与分割质量之间的权衡。
📄 Abstract
The Segment Anything Model (SAM) enables promptable, high-quality segmentation but is often too computationally expensive for latency-critical settings. TinySAM is a lightweight, distilled SAM variant that preserves strong zero-shot mask quality, yet its "segment-everything" mode still requires hundreds of prompts and remains slow in practice. We first replicate TinySAM on COCO val2017 using official checkpoints, matching the reported AP within 0.03%, establishing a reliable experimental baseline. Building on this, we propose Tiny-YOLOSAM, a fast hybrid pipeline that uses a recent YOLO detector (YOLOv12) to generate box prompts for TinySAM on salient foreground objects, and supplements uncovered regions with sparse point prompts sampled only where YOLO-guided masks provide no coverage. On COCO val2017, the hybrid system substantially improves class-agnostic coverage (AR from 16.4% to 77.1%, mIoU from 19.2% to 67.8%) while reducing end-to-end runtime from 49.20s/image to 10.39s/image (4.7x) on an Apple M1 Pro CPU. These results suggest detector-guided prompting combined with targeted sparse sampling as an effective alternative to dense "segment-everything" prompting for practical full-scene segmentation.
[5] Quadrant Segmentation VLM with Few-Shot Adaptation and OCT Learning-based Explainability Methods for Diabetic Retinopathy
Shivum Telang
🧩 TL;DR
本文提出了一种新颖的多模态可解释性模型,利用视觉语言模型和少样本学习技术,通过分析视网膜象限内的病灶分布来模拟眼科医生的诊断推理过程,为糖尿病视网膜病变提供更全面的解释性诊断工具。
📘 Detailed Summary
Motivation: 当前糖尿病视网膜病变诊断面临两个主要挑战:一是现有AI模型依赖病灶分割进行可解释性分析,但手动标注病灶对临床医生不实用,医生需要能够解释分类推理过程的模型而非仅突出病灶位置;二是现有模型多为单模态,仅依赖单一成像模态进行可解释性分析,效果有限。需要一种能够识别个体DR病灶并用自然语言描述的多模态定量检测系统。
Method: 该方法采用基于视觉语言模型的多模态可解释性模型,结合少样本学习技术,通过分析视网膜象限内的病灶分布来模拟眼科医生的诊断推理。模型生成配对的Grad-CAM热力图,展示OCT和眼底图像中各个神经元的权重分布,可视化突出对DR严重程度分类有贡献的区域。研究使用了包含3,000张眼底图像和1,000张OCT图像的数据集。
Result: 该创新方法成功解决了当前DR诊断中的关键限制,通过多模态分析提供了更全面的诊断工具。模型能够生成可视化热力图,明确显示对分类决策有贡献的视网膜区域,为临床医生提供了直观的解释性诊断支持,提高了诊断的透明度和可信度。
Conclusion: 该研究提出的多模态可解释性模型为糖尿病视网膜病变诊断提供了实用的综合工具,能够改善患者治疗效果。通过模拟眼科医生的推理过程并提供可视化解释,该方法在筛查、治疗和研究等多种应用场景中具有广泛潜力,代表了DR诊断领域的重要技术进步。
📄 Abstract
Diabetic Retinopathy (DR) is a leading cause of vision loss worldwide, requiring early detection to preserve sight. Limited access to physicians often leaves DR undiagnosed. To address this, AI models utilize lesion segmentation for interpretability; however, manually annotating lesions is impractical for clinicians. Physicians require a model that explains the reasoning for classifications rather than just highlighting lesion locations. Furthermore, current models are one-dimensional, relying on a single imaging modality for explainability and achieving limited effectiveness. In contrast, a quantitative-detection system that identifies individual DR lesions in natural language would overcome these limitations, enabling diverse applications in screening, treatment, and research settings. To address this issue, this paper presents a novel multimodal explainability model utilizing a VLM with few-shot learning, which mimics an ophthalmologist's reasoning by analyzing lesion distributions within retinal quadrants for fundus images. The model generates paired Grad-CAM heatmaps, showcasing individual neuron weights across both OCT and fundus images, which visually highlight the regions contributing to DR severity classification. Using a dataset of 3,000 fundus images and 1,000 OCT images, this innovative methodology addresses key limitations in current DR diagnostics, offering a practical and comprehensive tool for improving patient outcomes.
[6] VLM-PAR: A Vision Language Model for Pedestrian Attribute Recognition
Abdellah Zakaria Sellam, Salah Eddine Bekhouche, Fadi Dornaika, Cosimo Distante, Abdenour Hadid
🧩 TL;DR
本文提出VLM-PAR,一种基于冻结SigLIP 2多语言编码器的模块化视觉语言框架,通过跨模态特征精炼显著提升了行人属性识别性能,在严重类别不平衡的PA100K基准上实现了新的最优结果。
📘 Detailed Summary
Motivation: 行人属性识别面临严重类别不平衡、复杂属性间依赖关系以及域偏移等挑战,现有方法在这些问题上表现有限,需要更有效的跨模态学习框架来克服这些障碍。
Method: 提出VLM-PAR框架,基于冻结的SigLIP 2多语言编码器构建,通过紧凑的交叉注意力融合机制精炼视觉特征,实现图像与提示嵌入的对齐,形成模块化的视觉语言架构。
Result: 在高度不平衡的PA100K基准上实现了显著准确率提升并达到新的最优性能,同时在PETA和Market-1501基准上也取得了显著的均值准确率增益,验证了方法的有效性。
Conclusion: 研究表明将大规模视觉语言预训练与针对性跨模态精炼相结合能有效克服行人属性识别中的不平衡和泛化挑战,为跨模态学习在细粒度视觉任务中的应用提供了新思路。
📄 Abstract
Pedestrian Attribute Recognition (PAR) involves predicting fine-grained attributes such as clothing color, gender, and accessories from pedestrian imagery, yet is hindered by severe class imbalance, intricate attribute co-dependencies, and domain shifts. We introduce VLM-PAR, a modular vision-language framework built on frozen SigLIP 2 multilingual encoders. By first aligning image and prompt embeddings via refining visual features through a compact cross-attention fusion, VLM-PAR achieves significant accuracy improvement on the highly imbalanced PA100K benchmark, setting a new state-of-the-art performance, while also delivering significant gains in mean accuracy across PETA and Market-1501 benchmarks. These results underscore the efficacy of integrating large-scale vision-language pretraining with targeted cross-modal refinement to overcome imbalance and generalization challenges in PAR.
[7] Towards Signboard-Oriented Visual Question Answering: ViSignVQA Dataset, Method and Benchmark
Hieu Minh Nguyen, Tam Le-Thanh Dang, Kiet Van Nguyen
🧩 TL;DR
本文提出了ViSignVQA,这是首个面向越南语招牌文本理解的大规模视觉问答数据集,包含10,762张图像和25,573个问答对,并通过集成OCR和语言模型改进了现有VQA方法,显著提升了低资源语言场景文本理解性能。
📘 Detailed Summary
Motivation: 自然场景中招牌文本理解对于视觉问答的实际应用至关重要,但在低资源语言中仍未被充分探索,特别是越南语缺乏专门的招牌导向VQA数据集,现有方法难以处理越南语招牌中复杂的双语文本、非正式表达及视觉元素等特征。
Method: 研究构建了ViSignVQA数据集,包含越南语招牌的多样化语言、文化和视觉特征,并采用两种主要方法:一是将越南语OCR模型SwinTextSpotter和预训练语言模型ViT5集成到BLIP-2、LaTr、PreSTU和SaL等先进VQA模型中;二是提出了结合感知与推理智能体的多智能体VQA框架,利用GPT-4进行协同推理。
Result: 实验结果表明,将OCR文本附加到问题中可显著提升性能,F1分数最高提升209%;多智能体框架通过多数投票达到75.98%的准确率,验证了OCR增强上下文在招牌文本理解中的关键作用,并为越南语场景文本VQA建立了首个基准。
Conclusion: 该研究强调了领域特定资源对于提升低资源语言文本型VQA的重要性,ViSignVQA数据集捕捉了真实世界场景文本特征,支持越南语OCR集成VQA模型的开发与评估,为多语言场景文本理解研究提供了重要基准和资源。
📄 Abstract
Understanding signboard text in natural scenes is essential for real-world applications of Visual Question Answering (VQA), yet remains underexplored, particularly in low-resource languages. We introduce ViSignVQA, the first large-scale Vietnamese dataset designed for signboard-oriented VQA, which comprises 10,762 images and 25,573 question-answer pairs. The dataset captures the diverse linguistic, cultural, and visual characteristics of Vietnamese signboards, including bilingual text, informal phrasing, and visual elements such as color and layout. To benchmark this task, we adapted state-of-the-art VQA models (e.g., BLIP-2, LaTr, PreSTU, and SaL) by integrating a Vietnamese OCR model (SwinTextSpotter) and a Vietnamese pretrained language model (ViT5). The experimental results highlight the significant role of the OCR-enhanced context, with F1-score improvements of up to 209% when the OCR text is appended to questions. Additionally, we propose a multi-agent VQA framework combining perception and reasoning agents with GPT-4, achieving 75.98% accuracy via majority voting. Our study presents the first large-scale multimodal dataset for Vietnamese signboard understanding. This underscores the importance of domain-specific resources in enhancing text-based VQA for low-resource languages. ViSignVQA serves as a benchmark capturing real-world scene text characteristics and supporting the development and evaluation of OCR-integrated VQA models in Vietnamese.
[8] VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs
Naishan Zheng, Jie Huang, Qingpei Guo, Feng Zhao
🧩 TL;DR
本文提出了VideoScaffold,一种用于流式视频理解的动态表示框架,通过自适应调整事件粒度并保留细粒度视觉语义,解决了现有静态方法在连续视频流中产生的碎片化或过度压缩问题。
📘 Detailed Summary
Motivation: 现有基于多模态大语言模型的视频理解方法面临两个主要挑战:帧间冗余严重以及需要时间连贯的表示。静态策略如稀疏采样、帧压缩和聚类针对离线场景优化,应用于连续视频流时会产生碎片化或过度压缩的输出,无法适应流式视频的动态特性。
Method: VideoScaffold框架包含两个核心组件:弹性尺度事件分割通过预测引导的分割动态细化事件边界,分层事件整合逐步将语义相关片段聚合成多层次抽象表示。这两个组件协同工作,使系统能够随着视频流展开,从细粒度帧理解平滑过渡到抽象事件推理。
Result: 在离线和流式视频理解基准测试上的广泛实验表明,VideoScaffold实现了最先进的性能。该框架具有模块化和即插即用特性,能够无缝扩展现有的基于图像的多模态大语言模型,使其具备连续视频理解能力。
Conclusion: VideoScaffold通过动态调整事件粒度的创新方法,为流式视频理解提供了有效的解决方案。该框架不仅提升了性能,还保持了模块化设计,为将图像基础模型扩展到视频领域提供了通用途径,推动了连续视频理解技术的发展。
📄 Abstract
Understanding long videos with multimodal large language models (MLLMs) remains challenging due to the heavy redundancy across frames and the need for temporally coherent representations. Existing static strategies, such as sparse sampling, frame compression, and clustering, are optimized for offline settings and often produce fragmented or over-compressed outputs when applied to continuous video streams. We present VideoScaffold, a dynamic representation framework designed for streaming video understanding. It adaptively adjusts event granularity according to video duration while preserving fine-grained visual semantics. VideoScaffold introduces two key components: Elastic-Scale Event Segmentation (EES), which performs prediction-guided segmentation to dynamically refine event boundaries, and Hierarchical Event Consolidation (HEC), which progressively aggregates semantically related segments into multi-level abstractions. Working in concert, EES and HEC enable VideoScaffold to transition smoothly from fine-grained frame understanding to abstract event reasoning as the video stream unfolds. Extensive experiments across both offline and streaming video understanding benchmarks demonstrate that VideoScaffold achieves state-of-the-art performance. The framework is modular and plug-and-play, seamlessly extending existing image-based MLLMs to continuous video comprehension. The code is available at https://github.com/zheng980629/VideoScaffold.
[9] Meta-information Guided Cross-domain Synergistic Diffusion Model for Low-dose PET Reconstruction
Mengxiao Geng, Ran Hong, Xiaoling Xu, Bingxuan Li, Qiegen Liu
🧩 TL;DR
本文提出了一种元信息引导的跨域协同扩散模型(MiG-DM),通过整合跨模态先验知识和投影域物理结构,显著提升了低剂量PET成像的质量和生理细节保留能力。
📘 Detailed Summary
Motivation: 低剂量PET成像在减少患者辐射暴露方面至关重要,但面临噪声干扰、对比度降低和生理细节难以保留等挑战。现有方法通常忽视了投影域物理知识和患者特定元信息,而这些信息对于功能语义关联挖掘至关重要。
Method: 本研究提出了元信息引导的跨域协同扩散模型(MiG-DM),包含元信息编码模块将临床参数转化为语义提示,考虑患者特征、剂量相关信息和半定量参数,实现文本元信息与图像重建的跨模态对齐。跨域架构结合投影域和图像域处理,投影域中的专用正弦图适配器通过卷积操作捕获全局物理结构,相当于全局图像域滤波。
Result: 在UDPET公共数据集和不同剂量水平的临床数据集上的实验表明,MiG-DM在提升PET图像质量和保留生理细节方面优于最先进的方法,验证了跨域协同和元信息引导的有效性。
Conclusion: 该研究证明了整合患者元信息和跨域物理先验对于低剂量PET重建的重要性,为医学图像重建提供了新的跨模态协同框架,具有临床应用的潜力。
📄 Abstract
Low-dose PET imaging is crucial for reducing patient radiation exposure but faces challenges like noise interference, reduced contrast, and difficulty in preserving physiological details. Existing methods often neglect both projection-domain physics knowledge and patient-specific meta-information, which are critical for functional-semantic correlation mining. In this study, we introduce a meta-information guided cross-domain synergistic diffusion model (MiG-DM) that integrates comprehensive cross-modal priors to generate high-quality PET images. Specifically, a meta-information encoding module transforms clinical parameters into semantic prompts by considering patient characteristics, dose-related information, and semi-quantitative parameters, enabling cross-modal alignment between textual meta-information and image reconstruction. Additionally, the cross-domain architecture combines projection-domain and image-domain processing. In the projection domain, a specialized sinogram adapter captures global physical structures through convolution operations equivalent to global image-domain filtering. Experiments on the UDPET public dataset and clinical datasets with varying dose levels demonstrate that MiG-DM outperforms state-of-the-art methods in enhancing PET image quality and preserving physiological details.
[10] Human-Aligned Generative Perception: Bridging Psychophysics and Generative Models
Antara Titikhsha, Om Kulkarni, Dharun Muthaiah
🧩 TL;DR
本文提出了一种利用轻量级判别器作为外部引导信号的方法,通过人类感知嵌入(HPE)教师模型将几何理解引入文本到图像扩散模型,实现了几何与风格的分离控制,显著提升了生成图像的语义对齐能力。
📘 Detailed Summary
Motivation: 当前文本到图像扩散模型虽然能生成高度详细的纹理,但主要依赖表面外观而难以遵循严格的几何约束,特别是在几何约束与文本提示暗示的风格相冲突时。这反映了人类感知与当前生成模型之间存在语义鸿沟,需要探索如何在无需专门训练的情况下引入几何理解能力。
Method: 研究提出使用轻量级、现成的判别器作为外部引导信号,构建了一个基于THINGS三元组数据集训练的人类感知嵌入(HPE)教师模型,该模型捕捉了人类对物体形状的敏感性。通过将该教师模型的梯度注入潜在扩散过程,实现了几何与风格的可控分离。该方法在三种架构上进行了评估:基于U-Net的Stable Diffusion v1.5、流匹配模型SiT-XL/2以及扩散变换器PixArt-Σ。
Result: 实验表明,流模型在没有持续引导的情况下倾向于漂移回默认轨迹。研究成功实现了复杂三维形状(如Eames椅子)到冲突材料(如粉色金属)的零样本迁移。与无引导基线相比,这种引导生成方法将语义对齐能力提升了约80%。
Conclusion: 研究结果表明,小型教师模型能够可靠地引导大型生成系统,实现了更强的几何控制能力,并拓宽了文本到图像合成的创意范围。这种方法为解决生成模型中几何与风格冲突问题提供了一种无需专门训练的有效途径,展示了外部引导信号在增强生成模型语义理解方面的潜力。
📄 Abstract
Text-to-image diffusion models generate highly detailed textures, yet they often rely on surface appearance and fail to follow strict geometric constraints, particularly when those constraints conflict with the style implied by the text prompt. This reflects a broader semantic gap between human perception and current generative models. We investigate whether geometric understanding can be introduced without specialized training by using lightweight, off-the-shelf discriminators as external guidance signals. We propose a Human Perception Embedding (HPE) teacher trained on the THINGS triplet dataset, which captures human sensitivity to object shape. By injecting gradients from this teacher into the latent diffusion process, we show that geometry and style can be separated in a controllable manner. We evaluate this approach across three architectures: Stable Diffusion v1.5 with a U-Net backbone, the flow-matching model SiT-XL/2, and the diffusion transformer PixArt-Σ. Our experiments reveal that flow models tend to drift back toward their default trajectories without continuous guidance, and we demonstrate zero-shot transfer of complex three-dimensional shapes, such as an Eames chair, onto conflicting materials such as pink metal. This guided generation improves semantic alignment by about 80 percent compared to unguided baselines. Overall, our results show that small teacher models can reliably guide large generative systems, enabling stronger geometric control and broadening the creative range of text-to-image synthesis.
[11] The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency
Dingyu Wang, Zimu Yuan, Jiajun Liu, Shanggui Liu, Nan Zhou, Tianxing Xu, Di Huang, Dong Jiang
🧩 TL;DR
本研究开发了骨骼与关节(B&J)基准测试,用于评估人工智能模型在骨科和运动医学中的真实临床推理能力,发现当前模型在需要多模态整合的开放式任务上表现显著不足,表明其尚未具备临床胜任力。
📘 Detailed Summary
Motivation: 当前基于医学执照考试或精选案例的基准测试无法捕捉真实世界患者护理所需的整合性多模态推理能力,基础模型在临床实践中的快速整合要求对其真实临床推理能力进行严格评估,而非仅关注狭窄的考试成功率。
Method: 研究开发了骨骼与关节(B&J)基准测试框架,包含来自骨科和运动医学真实患者案例的1,245个问题,评估模型在7个反映临床推理路径的任务上的表现,包括知识回忆、文本和图像解释、诊断生成、治疗计划和原理提供,评估了11个视觉语言模型和6个大语言模型,并与专家推导的基准真值进行比较。
Result: 结果显示不同任务类型间存在显著的性能差距,最先进模型在结构化多项选择题上准确率超过90%,但在需要多模态整合的开放式任务上准确率仅勉强达到60%,视觉语言模型在解释医学图像方面存在严重限制,经常表现出严重的文本驱动幻觉,忽略矛盾的视觉证据,专门为医疗应用微调的模型相比通用模型没有一致优势。
Conclusion: 当前人工智能模型尚未具备复杂多模态推理的临床胜任力,其安全部署应限于支持性、基于文本的角色,核心临床任务的未来进展需要多模态整合和视觉理解方面的根本性突破,专门医疗微调并未带来系统性优势。
📄 Abstract
Background: The rapid integration of foundation models into clinical practice and public health necessitates a rigorous evaluation of their true clinical reasoning capabilities beyond narrow examination success. Current benchmarks, typically based on medical licensing exams or curated vignettes, fail to capture the integrated, multimodal reasoning essential for real-world patient care. Methods: We developed the Bones and Joints (B&J) Benchmark, a comprehensive evaluation framework comprising 1,245 questions derived from real-world patient cases in orthopedics and sports medicine. This benchmark assesses models across 7 tasks that mirror the clinical reasoning pathway, including knowledge recall, text and image interpretation, diagnosis generation, treatment planning, and rationale provision. We evaluated eleven vision-language models (VLMs) and six large language models (LLMs), comparing their performance against expert-derived ground truth. Results: Our results demonstrate a pronounced performance gap between task types. While state-of-the-art models achieved high accuracy, exceeding 90%, on structured multiple-choice questions, their performance markedly declined on open-ended tasks requiring multimodal integration, with accuracy scarcely reaching 60%. VLMs demonstrated substantial limitations in interpreting medical images and frequently exhibited severe text-driven hallucinations, often ignoring contradictory visual evidence. Notably, models specifically fine-tuned for medical applications showed no consistent advantage over general-purpose counterparts. Conclusions: Current artificial intelligence models are not yet clinically competent for complex, multimodal reasoning. Their safe deployment should currently be limited to supportive, text-based roles. Future advancement in core clinical tasks awaits fundamental breakthroughs in multimodal integration and visual understanding.
[12] FETAL-GAUGE: A Benchmark for Assessing Vision-Language Models in Fetal Ultrasound
Hussain Alasmawi, Numan Saeed, Mohammad Yaqub
🧩 TL;DR
本文提出了Fetal-Gauge,这是首个专门用于评估视觉语言模型在胎儿超声成像中性能的大规模基准测试,包含超过42,000张图像和93,000个问答对,揭示了当前最先进模型在临床任务上的显著性能差距。
📘 Detailed Summary
Motivation: 全球范围内训练有素的超声医师短缺严重阻碍了产前胎儿健康监测,而视觉语言模型在超声图像解释方面具有潜力但缺乏标准化评估基准,这主要由于胎儿超声成像的挑战性、操作者依赖性以及公开数据集的有限性。
Method: 研究团队构建了Fetal-Gauge基准测试,包含超过42,000张胎儿超声图像和93,000个问答对,涵盖解剖平面识别、解剖结构视觉定位、胎儿方位评估、临床视图符合性和临床诊断等多个任务,并系统评估了包括通用模型和医学专用模型在内的多种最先进视觉语言模型。
Result: 实验结果显示当前视觉语言模型在胎儿超声解释任务上表现显著不足,最佳模型仅达到55%的准确率,远低于临床要求,这揭示了现有模型在医学影像理解方面的严重局限性。
Conclusion: 该研究强调了开发领域自适应架构和专门训练方法的紧迫需求,为推进多模态深度学习在产前护理中的应用建立了严格基础,并为解决全球医疗可及性挑战提供了途径,同时该基准测试将在论文接受后公开可用。
📄 Abstract
The growing demand for prenatal ultrasound imaging has intensified a global shortage of trained sonographers, creating barriers to essential fetal health monitoring. Deep learning has the potential to enhance sonographers' efficiency and support the training of new practitioners. Vision-Language Models (VLMs) are particularly promising for ultrasound interpretation, as they can jointly process images and text to perform multiple clinical tasks within a single framework. However, despite the expansion of VLMs, no standardized benchmark exists to evaluate their performance in fetal ultrasound imaging. This gap is primarily due to the modality's challenging nature, operator dependency, and the limited public availability of datasets. To address this gap, we present Fetal-Gauge, the first and largest visual question answering benchmark specifically designed to evaluate VLMs across various fetal ultrasound tasks. Our benchmark comprises over 42,000 images and 93,000 question-answer pairs, spanning anatomical plane identification, visual grounding of anatomical structures, fetal orientation assessment, clinical view conformity, and clinical diagnosis. We systematically evaluate several state-of-the-art VLMs, including general-purpose and medical-specific models, and reveal a substantial performance gap: the best-performing model achieves only 55\% accuracy, far below clinical requirements. Our analysis identifies critical limitations of current VLMs in fetal ultrasound interpretation, highlighting the urgent need for domain-adapted architectures and specialized training approaches. Fetal-Gauge establishes a rigorous foundation for advancing multimodal deep learning in prenatal care and provides a pathway toward addressing global healthcare accessibility challenges. Our benchmark will be publicly available once the paper gets accepted.
[13] A Three-Level Alignment Framework for Large-Scale 3D Retrieval and Controlled 4D Generation
Philip Xu, David Elizondo, Raouf Hamzaoui
🧩 TL;DR
本文提出了Uni4D,这是一个基于文本、3D模型和图像模态间结构化三级对齐的统一框架,用于大规模开放词汇3D检索和可控4D生成,显著提升了动态多模态理解能力。
📘 Detailed Summary
Motivation: 当前在3D检索和4D生成领域存在多模态对齐不足的问题,特别是文本、3D模型和图像之间的语义一致性难以保证,限制了大规模开放词汇检索和可控动态内容生成的实际应用效果。
Method: Uni4D框架基于Align3D 130数据集构建,采用3D文本多头注意力机制和搜索模型优化文本到3D检索的语义对齐,并通过三个关键组件增强跨模态对齐:精确的文本到3D检索、多视角3D到图像对齐以及用于生成时间一致4D资产的图像到文本对齐。
Result: 实验结果表明,Uni4D在3D检索和可控4D生成方面实现了高质量性能,在动态多模态理解任务上取得了显著进展,验证了结构化三级对齐方法的有效性。
Conclusion: 该研究证明了结构化多模态对齐在3D检索和4D生成中的重要性,为动态内容创建提供了实用框架,并为未来多模态理解系统的开发提供了新的技术路径和应用可能性。
📄 Abstract
We introduce Uni4D, a unified framework for large scale open vocabulary 3D retrieval and controlled 4D generation based on structured three level alignment across text, 3D models, and image modalities. Built upon the Align3D 130 dataset, Uni4D employs a 3D text multi head attention and search model to optimize text to 3D retrieval through improved semantic alignment. The framework further strengthens cross modal alignment through three components: precise text to 3D retrieval, multi view 3D to image alignment, and image to text alignment for generating temporally consistent 4D assets. Experimental results demonstrate that Uni4D achieves high quality 3D retrieval and controllable 4D generation, advancing dynamic multimodal understanding and practical applications.
[14] PortionNet: Distilling 3D Geometric Knowledge for Food Nutrition Estimation
Darrin Bright, Rakshith Raj, Kanchan Keisham
🧩 TL;DR
本文提出PortionNet,一种新颖的跨模态知识蒸馏框架,通过从点云学习几何特征实现仅需RGB图像的单图像食物营养估计,无需深度传感器即可实现伪3D推理。
📘 Detailed Summary
Motivation: 单图像食物营养估计因丢失3D信息而具有挑战性,现有基于深度的方法虽能提供可靠几何信息,但需要深度传感器而无法在大多数智能手机上普及,这限制了实际应用。
Method: PortionNet采用跨模态知识蒸馏框架,通过双模式训练策略,在训练阶段从点云学习几何特征,推理阶段仅需RGB图像;使用轻量级适配器网络模拟点云表示,实现无需专用硬件的伪3D推理。
Result: PortionNet在MetaFood3D数据集上实现了最先进的性能,在体积和能量估计方面均优于所有先前方法;在SimpleFood45上的跨数据集评估进一步证明了其在能量估计方面的强大泛化能力。
Conclusion: 该研究展示了通过知识蒸馏将3D几何信息迁移到2D图像的有效性,为移动设备上的食物营养估计提供了实用解决方案,无需额外硬件即可实现准确的伪3D推理,具有重要的实际应用价值。
📄 Abstract
Accurate food nutrition estimation from single images is challenging due to the loss of 3D information. While depth-based methods provide reliable geometry, they remain inaccessible on most smartphones because of depth-sensor requirements. To overcome this challenge, we propose PortionNet, a novel cross-modal knowledge distillation framework that learns geometric features from point clouds during training while requiring only RGB images at inference. Our approach employs a dual-mode training strategy where a lightweight adapter network mimics point cloud representations, enabling pseudo-3D reasoning without any specialized hardware requirements. PortionNet achieves state-of-the-art performance on MetaFood3D, outperforming all previous methods in both volume and energy estimation. Cross-dataset evaluation on SimpleFood45 further demonstrates strong generalization in energy estimation.
[15] VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning
Yang Ding, Yizhen Zhang, Xin Lai, Ruihang Chu, Yujiu Yang
🧩 TL;DR
本文提出了VideoZoomer,一种新颖的智能体框架,使多模态大语言模型能够在推理过程中动态控制视觉焦点,通过从低帧率概览到自主选择时刻获取高帧率片段的渐进式交互,有效解决长视频理解中上下文窗口有限的问题。
📘 Detailed Summary
Motivation: 多模态大语言模型在视觉语言任务上取得了显著进展,但在长视频理解方面仍受限于有限的上下文窗口。现有方法通常依赖均匀帧采样或静态预选择,这可能会忽略关键证据且无法在推理过程中纠正初始选择错误,因此需要一种能够动态调整视觉焦点的解决方案。
Method: VideoZoomer采用智能体框架,使MLLMs能够通过多轮交互动态控制视觉焦点。该方法从粗粒度的低帧率概览开始,调用时间缩放工具在自主选择的时刻获取高帧率片段,逐步收集细粒度证据。训练策略包括两阶段:首先在蒸馏示例和反思轨迹的精选数据集上进行冷启动监督微调,然后通过强化学习进一步优化智能体策略。
Result: 实验表明,7B模型展现出多样且复杂的推理模式,在广泛的长视频理解和推理基准测试中表现优异。该模型能够持续超越现有开源模型,甚至在具有挑战性的任务上与专有系统相媲美,同时在减少帧预算的情况下实现更高的效率。
Conclusion: VideoZoomer通过动态视觉焦点控制机制有效解决了长视频理解中的上下文限制问题,展示了智能体框架在复杂多模态推理任务中的潜力。该方法不仅提升了性能,还实现了更高的计算效率,为未来视频理解系统的设计提供了新的方向。
📄 Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language tasks yet remain limited in long video understanding due to the limited context window. Consequently, prevailing approaches tend to rely on uniform frame sampling or static pre-selection, which might overlook critical evidence and unable to correct its initial selection error during its reasoning process. To overcome these limitations, we propose VideoZoomer, a novel agentic framework that enables MLLMs to dynamically control their visual focus during reasoning. Starting from a coarse low-frame-rate overview, VideoZoomer invokes a temporal zoom tool to obtain high-frame-rate clips at autonomously chosen moments, thereby progressively gathering fine-grained evidence in a multi-turn interactive manner. Accordingly, we adopt a two-stage training strategy: a cold-start supervised fine-tuning phase on a curated dataset of distilled exemplar and reflection trajectories, followed by reinforcement learning to further refine the agentic policy. Extensive experiments demonstrate that our 7B model delivers diverse and complex reasoning patterns, yielding strong performance across a broad set of long video understanding and reasoning benchmarks. These emergent capabilities allow it to consistently surpass existing open-source models and even rival proprietary systems on challenging tasks, while achieving superior efficiency under reduced frame budgets.
[16] Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone
Jiacheng Ye, Shansan Gong, Jiahui Gao, Junming Fan, Shuang Wu, Wei Bi, Haoli Bai, Lifeng Shang, Lingpeng Kong
🧩 TL;DR
本文提出了Dream-VL和Dream-VLA,分别基于扩散大语言模型构建了视觉语言模型和视觉语言动作模型,在保持与自回归模型相当性能的同时,在视觉规划和机器人控制任务中展现出显著优势。
📘 Detailed Summary
Motivation: 自回归大型视觉语言模型在复杂视觉规划和动态机器人控制任务中受限于其序列生成特性,本研究旨在探索基于扩散大语言模型构建视觉语言模型以克服这些限制,利用扩散模型固有的双向性来提升在视觉语言动作任务中的表现。
Method: 研究提出了Dream-VL,一种基于扩散的开放视觉语言模型,在此基础上进一步开发了Dream-VLA,即基于扩散大语言模型的视觉语言动作模型,通过持续预训练在开放机器人数据集上进行构建,利用扩散骨干网络固有的双向特性实现动作分块和并行生成。
Result: Dream-VL在多个基准测试中与基于开放数据训练的自回归视觉语言模型表现相当,但在视觉规划任务中展现出更优潜力;Dream-VLA在LIBERO上达到97.2%的平均成功率,在SimplerEnv-Bridge和SimplerEnv-Fractal上分别达到71.4%和60.5%的整体平均性能,超越了π₀和GR00T-N1等领先模型,并在下游微调中实现了显著更快的收敛速度。
Conclusion: 基于扩散的视觉语言模型在视觉规划和机器人控制任务中超越了自回归基线,扩散骨干网络的双向特性使其成为视觉语言动作任务的优越基础,能够实现更快的收敛和更好的性能,为复杂视觉推理和机器人控制任务提供了新的研究方向。
📄 Abstract
While autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the potential of constructing Vision-Language Models upon diffusion-based large language models (dLLMs) to overcome these limitations. We introduce Dream-VL, an open diffusion-based VLM (dVLM) that achieves state-of-the-art performance among previous dVLMs. Dream-VL is comparable to top-tier AR-based VLMs trained on open data on various benchmarks but exhibits superior potential when applied to visual planning tasks. Building upon Dream-VL, we introduce Dream-VLA, a dLLM-based Vision-Language-Action model (dVLA) developed through continuous pre-training on open robotic datasets. We demonstrate that the natively bidirectional nature of this diffusion backbone serves as a superior foundation for VLA tasks, inherently suited for action chunking and parallel generation, leading to significantly faster convergence in downstream fine-tuning. Dream-VLA achieves top-tier performance of 97.2% average success rate on LIBERO, 71.4% overall average on SimplerEnv-Bridge, and 60.5% overall average on SimplerEnv-Fractal, surpassing leading models such as $π_0$ and GR00T-N1. We also validate that dVLMs surpass AR baselines on downstream tasks across different training objectives. We release both Dream-VL and Dream-VLA to facilitate further research in the community.
[17] The Multi-View Paradigm Shift in MRI Radiomics: Predicting MGMT Methylation in Glioblastoma
Mariya Miteva, Maria Nisheva-Pavlova
🧩 TL;DR
本研究提出了一种基于变分自编码器的多视图潜在表示学习框架,用于整合胶质母细胞瘤MRI的T1Gd和FLAIR序列的互补放射组学特征,以实现MGMT启动子甲基化的非侵入性预测。
📘 Detailed Summary
Motivation: 胶质母细胞瘤中MGMT启动子甲基化具有重要的预后和治疗意义,但传统的单模态和早期融合放射组学方法存在特征冗余高、模态特异性信息建模不完整的问题,限制了分子特征从医学影像中的非侵入性推断能力。
Method: 该研究提出了一种基于变分自编码器的多视图潜在表示学习框架,通过独立的概率编码器分别编码T1Gd和FLAIR MRI序列的放射组学特征,并在紧凑的潜在空间中进行融合,从而保留模态特异性结构的同时实现有效的多模态整合。
Result: 该方法生成的潜在嵌入表示被成功应用于MGMT启动子甲基化分类任务,相较于传统方法,该框架能够更有效地整合多模态信息并减少特征冗余,提升了分子特征预测的准确性。
Conclusion: 该研究证明了在潜在空间中进行多模态融合的有效性,为放射基因组学中的分子特征非侵入性推断提供了新思路,特别适用于需要整合互补影像信息的临床预测任务,具有重要的临床应用价值。
📄 Abstract
Non-invasive inference of molecular tumor characteristics from medical imaging is a central goal of radiogenomics, particularly in glioblastoma (GBM), where O6-methylguanine-DNA methyltransferase (MGMT) promoter methylation carries important prognostic and therapeutic significance. Although radiomics-based machine learning methods have shown promise for this task, conventional unimodal and early-fusion approaches are often limited by high feature redundancy and an incomplete modeling of modality-specific information. In this work, we introduce a multi-view latent representation learning framework based on variational autoencoders (VAE) to integrate complementary radiomic features derived from post-contrast T1-weighted (T1Gd) and Fluid-Attenuated Inversion Recovery (FLAIR) magnetic resonance imaging (MRI). By encoding each modality through an independent probabilistic encoder and performing fusion in a compact latent space, the proposed approach preserves modality-specific structure while enabling effective multimodal integration. The resulting latent embeddings are subsequently used for MGMT promoter methylation classification.
[18] VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement
Zhengfei Kuang, Rui Lin, Long Zhao, Gordon Wetzstein, Saining Xie, Sanghyun Woo
🧩 TL;DR
本文提出了一种基于多模态大语言模型的3D场景操作框架,通过引入MCP-based API、专用视觉工具套件和多智能体协作架构,解决了MLLMs在复杂3D物体排列任务中的视觉基础薄弱、场景理解不足和迭代更新易错等关键挑战。
📘 Detailed Summary
Motivation: 尽管多模态大语言模型在2D视觉语言任务上取得了显著进展,但其在复杂3D场景操作中的应用仍未被充分探索。本研究旨在解决MLLMs在3D物体排列任务中面临的三个关键挑战:视觉基础薄弱导致程序化编辑与精确3D结果难以关联、3D场景理解能力不足,以及迭代更新过程易出错且难以管理。
Method: 为解决MLLMs的视觉基础薄弱问题,引入了基于MCP的API,将交互从脆弱的原始代码操作转向更稳健的函数级更新。通过专用视觉工具套件增强MLLMs的3D场景理解能力,包括场景状态分析、空间信息收集和动作结果验证,形成感知反馈闭环。提出协作式多智能体框架,分配规划、执行和验证等专门角色,以管理迭代易错的更新过程并实现鲁棒的多步骤指令处理。
Result: 该方法在25个复杂物体排列任务的多样化集合上进行了评估,结果表明其性能显著优于现有基线方法。系统能够有效处理多步骤指令并从中间错误中恢复,展示了在复杂3D场景操作任务中的优越能力。
Conclusion: 该研究为MLLMs在3D场景操作中的应用提供了系统化解决方案,通过API抽象、感知增强和多智能体协作解决了关键的技术障碍。框架的成功验证了将语言模型与专用工具和结构化流程相结合在复杂空间任务中的有效性,为未来3D-aware人工智能系统的发展提供了重要参考。
📄 Abstract
Despite the remarkable progress of Multimodal Large Language Models (MLLMs) in 2D vision-language tasks, their application to complex 3D scene manipulation remains underexplored. In this paper, we bridge this critical gap by tackling three key challenges in 3D object arrangement task using MLLMs. First, to address the weak visual grounding of MLLMs, which struggle to link programmatic edits with precise 3D outcomes, we introduce an MCP-based API. This shifts the interaction from brittle raw code manipulation to more robust, function-level updates. Second, we augment the MLLM's 3D scene understanding with a suite of specialized visual tools to analyze scene state, gather spatial information, and validate action outcomes. This perceptual feedback loop is critical for closing the gap between language-based updates and precise 3D-aware manipulation. Third, to manage the iterative, error-prone updates, we propose a collaborative multi-agent framework with designated roles for planning, execution, and verification. This decomposition allows the system to robustly handle multi-step instructions and recover from intermediate errors. We demonstrate the effectiveness of our approach on a diverse set of 25 complex object arrangement tasks, where it significantly outperforms existing baselines. Website: vulcan-3d.github.io
[19] Self-Evaluation Unlocks Any-Step Text-to-Image Generation
Xin Yu, Xiaojuan Qi, Zhengqi Li, Kai Zhang, Richard Zhang, Zhe Lin, Eli Shechtman, Tianyu Wang, Yotam Nitzan
🧩 TL;DR
本文提出了Self-Evaluating Model (Self-E),一种从零开始训练且支持任意步数推理的文本到图像生成方法。该方法结合了局部流匹配学习和新颖的自评估机制,无需预训练教师模型即可实现从少步到多步的高质量生成。
📘 Detailed Summary
Motivation: 传统扩散模型或流模型主要依赖局部监督,通常需要大量推理步骤才能获得高质量结果;而蒸馏方法虽然能实现少步生成,但需要预训练的教师模型。现有方法无法同时实现从零开始训练、支持任意步数推理、且在不同步数下均保持高性能的统一框架,这一研究空白正是本文旨在解决的问题。
Method: Self-E采用从零开始的训练方法,结合流匹配模型的数据学习方式与新颖的自评估机制。模型使用当前分数估计来评估自身生成的样本,作为动态的自教师,同时实现瞬时局部学习和自驱动的全局匹配。这种方法不依赖预训练教师模型,也不局限于传统的局部监督范式,从而能够训练出支持任意步数推理的统一模型。
Result: 在大规模文本到图像基准测试上的广泛实验表明,Self-E在少步生成方面表现优异,同时在50步推理时与最先进的流匹配模型具有竞争力。模型性能随着推理步数的增加单调提升,能够在单个统一模型中同时实现超快速的少步生成和高质量的长轨迹采样。这是首个从零开始训练、支持任意步数的文本到图像模型。
Conclusion: Self-E通过结合局部学习和自评估机制,成功弥合了传统流模型与蒸馏方法之间的范式差距,为高效且可扩展的生成提供了统一框架。该方法展示了从零开始训练支持任意步数推理模型的可行性,为未来生成模型的设计提供了新的方向,特别是在平衡训练效率与推理灵活性方面具有重要启示。
📄 Abstract
We introduce the Self-Evaluating Model (Self-E), a novel, from-scratch training approach for text-to-image generation that supports any-step inference. Self-E learns from data similarly to a Flow Matching model, while simultaneously employing a novel self-evaluation mechanism: it evaluates its own generated samples using its current score estimates, effectively serving as a dynamic self-teacher. Unlike traditional diffusion or flow models, it does not rely solely on local supervision, which typically necessitates many inference steps. Unlike distillation-based approaches, it does not require a pretrained teacher. This combination of instantaneous local learning and self-driven global matching bridges the gap between the two paradigms, enabling the training of a high-quality text-to-image model from scratch that excels even at very low step counts. Extensive experiments on large-scale text-to-image benchmarks show that Self-E not only excels in few-step generation, but is also competitive with state-of-the-art Flow Matching models at 50 steps. We further find that its performance improves monotonically as inference steps increase, enabling both ultra-fast few-step generation and high-quality long-trajectory sampling within a single unified model. To our knowledge, Self-E is the first from-scratch, any-step text-to-image model, offering a unified framework for efficient and scalable generation.
[20] iOSPointMapper: RealTime Pedestrian and Accessibility Mapping with Mobile AI
Himanshu Naidu, Yuxiang Zhang, Sachin Mehta, Anat Caspi
🧩 TL;DR
本文介绍了iOSPointMapper,一款利用iPhone和iPad进行实时、隐私保护的侧行人行道测绘的移动应用程序,通过设备端语义分割和LiDAR深度估计等技术检测并定位交通标志、信号灯等行人相关设施。
📘 Detailed Summary
Motivation: 当前人行道数据收集方法成本高、碎片化且难以扩展,缺乏准确、及时的数据来支持无障碍和包容性行人基础设施建设,这阻碍了行人基础设施的优化与规划。
Method: 该系统采用设备端语义分割、基于LiDAR的深度估计以及融合GPS/IMU数据来检测和定位人行道相关特征,如交通标志、交通信号灯和杆柱,并包含用户引导的标注界面用于验证系统输出,收集的数据经过匿名化处理后传输至交通数据交换计划。
Result: 对系统特征检测和空间测绘性能的详细评估揭示了该应用在增强行人测绘方面的潜力,实验结果表明系统能够有效识别和定位行人相关基础设施,为大规模数据收集提供了可行方案。
Conclusion: iOSPointMapper提供了一种可扩展且以用户为中心的方法来填补行人数据的关键空白,其隐私保护设计和与多模式交通数据集的集成能力为城市规划和交通管理提供了新的数据采集范式,有望推动行人基础设施的智能化发展。
📄 Abstract
Accurate, up-to-date sidewalk data is essential for building accessible and inclusive pedestrian infrastructure, yet current approaches to data collection are often costly, fragmented, and difficult to scale. We introduce iOSPointMapper, a mobile application that enables real-time, privacy-conscious sidewalk mapping on the ground, using recent-generation iPhones and iPads. The system leverages on-device semantic segmentation, LiDAR-based depth estimation, and fused GPS/IMU data to detect and localize sidewalk-relevant features such as traffic signs, traffic lights and poles. To ensure transparency and improve data quality, iOSPointMapper incorporates a user-guided annotation interface for validating system outputs before submission. Collected data is anonymized and transmitted to the Transportation Data Exchange Initiative (TDEI), where it integrates seamlessly with broader multimodal transportation datasets. Detailed evaluations of the system's feature detection and spatial mapping performance reveal the application's potential for enhanced pedestrian mapping. Together, these capabilities offer a scalable and user-centered approach to closing critical data gaps in pedestrian
[21] EmoCtrl: Controllable Emotional Image Content Generation
Jingyuan Yang, Weibin Luo, Hui Huang
🧩 TL;DR
本文提出EmoCtrl框架,用于可控情感图像内容生成,能够在保持内容描述一致性的同时表达目标情感,解决了现有文本到图像模型缺乏情感意识的问题。
📘 Detailed Summary
Motivation: 现有文本到图像模型虽然能确保内容一致性但缺乏情感意识,而情感驱动模型在生成情感结果时往往以内容失真为代价,这导致无法同时实现内容忠实性和情感表达,因此需要开发能够同时控制内容和情感的可控情感图像内容生成方法。
Method: 本文提出EmoCtrl框架,包含文本和视觉情感增强模块,通过描述性语义和感知线索丰富情感表达,并构建了包含内容、情感和情感提示的数据集,将抽象情感与视觉线索相连接,学习的情感标记展现出互补效应。
Result: 定量和定性实验表明EmoCtrl在忠实内容和表达情感控制方面优于现有方法,消融实验和可视化证实了学习情感标记的互补效应,用户研究确认EmoCtrl与人类偏好高度一致,且该框架在创意应用中表现出良好的泛化能力。
Conclusion: EmoCtrl框架成功解决了内容和情感的双重控制问题,学习的情感标记具有鲁棒性和适应性,为情感感知的图像生成开辟了新方向,并在创意应用中展现出实际应用价值,推动了可控情感内容生成领域的发展。
📄 Abstract
An image conveys meaning through both its visual content and emotional tone, jointly shaping human perception. We introduce Controllable Emotional Image Content Generation (C-EICG), which aims to generate images that remain faithful to a given content description while expressing a target emotion. Existing text-to-image models ensure content consistency but lack emotional awareness, whereas emotion-driven models generate affective results at the cost of content distortion. To address this gap, we propose EmoCtrl, supported by a dataset annotated with content, emotion, and affective prompts, bridging abstract emotions to visual cues. EmoCtrl incorporates textual and visual emotion enhancement modules that enrich affective expression via descriptive semantics and perceptual cues. The learned emotion tokens exhibit complementary effects, as demonstrated through ablations and visualizations. Quantatitive and qualatitive experiments demonstrate that EmoCtrl achieves faithful content and expressive emotion control, outperforming existing methods across multiple aspects. User studies confirm EmoCtrl's strong alignment with human preference. Moreover, EmoCtrl generalizes well to creative applications, further demonstrating the robustness and adaptability of the learned emotion tokens.
[22] SAM 3D for 3D Object Reconstruction from Remote Sensing Images
Junsheng Yao, Lichao Mou, Qingyu Li
🧩 TL;DR
本文首次系统评估了通用图像到3D基础模型SAM 3D在单目遥感建筑重建任务中的表现,发现其在屋顶几何一致性和边界清晰度方面优于专用方法TRELLIS,并通过分段-重建-组合流水线展示了城市场景建模的潜力。
📘 Detailed Summary
Motivation: 单目遥感图像的三维建筑重建对于可扩展的城市建模至关重要,但现有方法通常需要特定任务架构和密集监督,缺乏对通用基础模型在该领域性能的系统评估。
Method: 本研究首次对通用图像到3D基础模型SAM 3D进行单目遥感建筑重建的系统评估,使用NYC Urban Dataset样本,以TRELLIS作为基准对比方法,采用Frechet Inception Distance (FID)和CLIP-based Maximum Mean Discrepancy (CMMD)作为评估指标,并通过分段-重建-组合流水线将SAM 3D扩展到城市场景重建。
Result: 实验结果表明,SAM 3D相比TRELLIS能产生更一致的屋顶几何形状和更清晰的边界,在单目遥感建筑重建任务中表现出优越性能,同时通过扩展流水线验证了其在城市场景建模中的潜力。
Conclusion: 该研究为在城市三维重建中部署基础模型提供了实用指导,揭示了通用基础模型在特定领域任务中的竞争力,并指出未来需要集成场景级结构先验以进一步提升性能,为城市建模领域的基础模型应用开辟了新方向。
📄 Abstract
Monocular 3D building reconstruction from remote sensing imagery is essential for scalable urban modeling, yet existing methods often require task-specific architectures and intensive supervision. This paper presents the first systematic evaluation of SAM 3D, a general-purpose image-to-3D foundation model, for monocular remote sensing building reconstruction. We benchmark SAM 3D against TRELLIS on samples from the NYC Urban Dataset, employing Frechet Inception Distance (FID) and CLIP-based Maximum Mean Discrepancy (CMMD) as evaluation metrics. Experimental results demonstrate that SAM 3D produces more coherent roof geometry and sharper boundaries compared to TRELLIS. We further extend SAM 3D to urban scene reconstruction through a segment-reconstruct-compose pipeline, demonstrating its potential for urban scene modeling. We also analyze practical limitations and discuss future research directions. These findings provide practical guidance for deploying foundation models in urban 3D reconstruction and motivate future integration of scene-level structural priors.
[23] Tracking by Predicting 3-D Gaussians Over Time
Tanish Baranwal, Himanshu Gaurav Singh, Jathushan Rajasegaran, Jitendra Malik
🧩 TL;DR
本文提出Video-GMAE,一种自监督视频表示学习方法,通过将视频编码为随时间移动的高斯溅射集合,在预训练中自然涌现出跟踪能力,并在多个基准上超越了现有自监督方法。
📘 Detailed Summary
Motivation: 现有视频表示学习方法缺乏对视频本质的合理归纳偏置,即2D视频通常是动态3D场景的一致投影。本文旨在通过引入高斯表示来捕捉视频的时空一致性,并探索在这种架构下跟踪能力是否能自然涌现。
Method: 提出Video Gaussian Masked Autoencoders (Video-GMAE),这是一种自监督表示学习方法,将图像序列编码为随时间移动的高斯溅射集合。该方法通过高斯表示强制引入合理的归纳偏置,即视频是动态3D场景的投影,并在这种架构下进行掩码自编码预训练。
Result: 实验表明,在这种架构下预训练时跟踪能力自然涌现。将学习到的高斯轨迹映射到图像平面上,实现了与最先进方法相当的零样本跟踪性能。经过小规模微调后,在Kinetics数据集上取得了34.6%的改进,在Kubric数据集上取得了13.1%的改进,超越了现有的自监督视频方法。
Conclusion: 该研究表明,通过高斯表示引入合理的3D场景归纳偏置,可以在自监督预训练中自然涌现出跟踪能力,为视频理解提供了一种新的表示学习范式。这种方法不仅实现了优异的零样本跟踪性能,而且在微调后显著超越了现有自监督方法,展示了高斯表示在视频理解中的潜力。
📄 Abstract
We propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pretraining a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches. The project page and code are publicly available at https://videogmae.org/ and https://github.com/tekotan/video-gmae.
[24] SCAFusion: A Multimodal 3D Detection Framework for Small Object Detection in Lunar Surface Exploration
Xin Chen, Kang Luo, Yangyi Xiao, Hesheng Wang
🧩 TL;DR
本文提出了SCAFusion,一种专为月球机器人任务设计的3D目标检测模型,通过认知适配器、对比对齐模块和截面感知坐标注意力机制,显著提升了月球环境中不规则小目标的检测性能。
📘 Detailed Summary
Motivation: 现有面向地面自动驾驶的多模态3D感知方法在月球等外星环境中表现不佳,主要由于特征对齐不足、多模态协同有限以及小目标检测能力弱,无法满足月球表面自主导航对陨石碎片和岩石等不规则小目标的可靠检测需求。
Method: 基于BEVFusion框架,SCAFusion集成了四个关键组件:用于高效相机骨干调优的认知适配器、增强相机与激光雷达特征一致性的对比对齐模块、强化视觉表示的相机辅助训练分支,以及专门设计用于提升不规则小目标检测性能的截面感知坐标注意力机制。
Result: 在nuScenes验证集上,模型以可忽略的参数和计算量增加实现了69.7% mAP和72.1% NDS,分别比基线提升5.0%和2.7%;在基于Isaac Sim构建的模拟月球环境中,SCAFusion达到90.93% mAP,优于基线11.5%,在陨石类小障碍物检测方面表现尤为突出。
Conclusion: SCAFusion通过针对性的多模态特征对齐和小目标检测增强机制,为月球机器人任务提供了有效的3D感知解决方案,其模块化设计思路可为其他恶劣环境下的自主系统感知任务提供参考,展示了专门化架构设计在外星环境感知中的重要性。
📄 Abstract
Reliable and precise detection of small and irregular objects, such as meteor fragments and rocks, is critical for autonomous navigation and operation in lunar surface exploration. Existing multimodal 3D perception methods designed for terrestrial autonomous driving often underperform in off world environments due to poor feature alignment, limited multimodal synergy, and weak small object detection. This paper presents SCAFusion, a multimodal 3D object detection model tailored for lunar robotic missions. Built upon the BEVFusion framework, SCAFusion integrates a Cognitive Adapter for efficient camera backbone tuning, a Contrastive Alignment Module to enhance camera LiDAR feature consistency, a Camera Auxiliary Training Branch to strengthen visual representation, and most importantly, a Section aware Coordinate Attention mechanism explicitly designed to boost the detection performance of small, irregular targets. With negligible increase in parameters and computation, our model achieves 69.7% mAP and 72.1% NDS on the nuScenes validation set, improving the baseline by 5.0% and 2.7%, respectively. In simulated lunar environments built on Isaac Sim, SCAFusion achieves 90.93% mAP, outperforming the baseline by 11.5%, with notable gains in detecting small meteor like obstacles.
[25] DreamOmni3: Scribble-based Editing and Generation
Bin Xia, Bohao Peng, Jiyang Liu, Sitong Wu, Jingyao Li, Junjia Huang, Xu Zhao, Yitong Wang, Ruihang Chu, Bei Yu, Jiaya Jia
🧩 TL;DR
本文提出DreamOmni3,一个统一生成与编辑框架,通过引入涂鸦作为额外输入模态来解决基于文本提示在精确定位和视觉细节表达上的局限性,实现了结合文本、图像和手绘草图的灵活创作。
📘 Detailed Summary
Motivation: 当前统一生成与编辑模型主要依赖文本提示进行指令式编辑和生成,但语言描述难以准确捕捉用户意图的编辑位置和细粒度视觉细节,限制了在图形用户界面上的灵活创作能力。
Method: 研究提出涂鸦编辑和生成两大任务,设计数据合成流水线构建训练数据,并创新性地采用联合输入方案,将原始图像和涂鸦源图像同时输入模型,使用不同颜色区分区域,并应用相同的索引和位置编码实现精确定位。
Result: 实验结果表明DreamOmni3在涂鸦编辑和生成任务上取得了卓越性能,建立了全面的基准测试集以促进后续研究,模型和代码将公开发布。
Conclusion: 该研究证明了涂鸦作为额外输入模态在统一生成与编辑模型中的有效性,提出的联合输入方案解决了复杂编辑场景下的定位难题,为图形用户界面上的灵活创作提供了新的技术路径。
📄 Abstract
Recently unified generation and editing models have achieved remarkable success with their impressive performance. These models rely mainly on text prompts for instruction-based editing and generation, but language often fails to capture users intended edit locations and fine-grained visual details. To this end, we propose two tasks: scribble-based editing and generation, that enables more flexible creation on graphical user interface (GUI) combining user textual, images, and freehand sketches. We introduce DreamOmni3, tackling two challenges: data creation and framework design. Our data synthesis pipeline includes two parts: scribble-based editing and generation. For scribble-based editing, we define four tasks: scribble and instruction-based editing, scribble and multimodal instruction-based editing, image fusion, and doodle editing. Based on DreamOmni2 dataset, we extract editable regions and overlay hand-drawn boxes, circles, doodles or cropped image to construct training data. For scribble-based generation, we define three tasks: scribble and instruction-based generation, scribble and multimodal instruction-based generation, and doodle generation, following similar data creation pipelines. For the framework, instead of using binary masks, which struggle with complex edits involving multiple scribbles, images, and instructions, we propose a joint input scheme that feeds both the original and scribbled source images into the model, using different colors to distinguish regions and simplify processing. By applying the same index and position encodings to both images, the model can precisely localize scribbled regions while maintaining accurate editing. Finally, we establish comprehensive benchmarks for these tasks to promote further research. Experimental results demonstrate that DreamOmni3 achieves outstanding performance, and models and code will be publicly released.
[26] CoAgent: Collaborative Planning and Consistency Agent for Coherent Video Generation
Qinglin Zeng, Kaitong Cai, Ruiqi Chen, Qinhan Lv, Keze Wang
🧩 TL;DR
本文提出CoAgent,一个用于连贯视频生成的协作闭环框架,通过计划-合成-验证流程解决开放域视频生成中的叙事一致性和视觉一致性挑战,显著提升了长视频生成的质量。
📘 Detailed Summary
Motivation: 开放域视频生成面临叙事连贯性和视觉一致性的核心挑战,现有文本到视频模型通常独立处理每个镜头,导致身份漂移、场景不一致和不稳定的时间结构,需要系统性的解决方案来维持跨镜头的实体一致性和叙事流程。
Method: CoAgent采用计划-合成-验证的协作闭环框架,包括故事板规划器将输入分解为结构化镜头级计划,全局上下文管理器维护实体级记忆以保持外观一致性,合成模块在视觉一致性控制器指导下生成镜头,验证代理使用视觉语言推理评估中间结果并触发选择性重新生成,最后通过节奏感知编辑器优化时间节奏和过渡。
Result: 大量实验表明,CoAgent在长视频生成中显著提高了连贯性、视觉一致性和叙事质量,在保持实体身份、场景一致性和时间结构稳定性方面表现出优越性能,验证了该框架在解决现有模型局限性方面的有效性。
Conclusion: 该研究展示了通过系统化的计划-合成-验证流程和实体级记忆管理可以有效解决视频生成中的一致性问题,为长叙事视频生成提供了新的框架方向,强调了闭环协作机制和跨镜头上下文维护在复杂生成任务中的重要性。
📄 Abstract
Maintaining narrative coherence and visual consistency remains a central challenge in open-domain video generation. Existing text-to-video models often treat each shot independently, resulting in identity drift, scene inconsistency, and unstable temporal structure. We propose CoAgent, a collaborative and closed-loop framework for coherent video generation that formulates the process as a plan-synthesize-verify pipeline. Given a user prompt, style reference, and pacing constraints, a Storyboard Planner decomposes the input into structured shot-level plans with explicit entities, spatial relations, and temporal cues. A Global Context Manager maintains entity-level memory to preserve appearance and identity consistency across shots. Each shot is then generated by a Synthesis Module under the guidance of a Visual Consistency Controller, while a Verifier Agent evaluates intermediate results using vision-language reasoning and triggers selective regeneration when inconsistencies are detected. Finally, a pacing-aware editor refines temporal rhythm and transitions to match the desired narrative flow. Extensive experiments demonstrate that CoAgent significantly improves coherence, visual consistency, and narrative quality in long-form video generation.
[27] Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains
Jesen Zhang, Ningyuan Liu, Kaitong Cai, Sidi Liu, Jing Yang, Ziliang Chen, Xiaofei Sun, Keze Wang
🧩 TL;DR
本文提出SR-MCR框架,一种轻量级、无需人工标注的多模态大语言模型推理对齐方法,通过利用模型输出中的内在过程信号来提升推理过程的可靠性和视觉基础性。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在推理过程中常产生流畅但不可靠的中间步骤,表现出弱步骤间一致性和不足的视觉基础性,这主要因为现有对齐方法仅监督最终答案而忽略了中间推理过程的可靠性。
Method: SR-MRC框架整合了五个自参考线索——语义对齐、词汇保真度、非冗余性、视觉基础和步骤一致性——构建归一化的可靠性加权奖励,提供细粒度的过程级指导;采用无批评者的GRPO目标,并引入置信度感知的冷却机制来稳定训练并抑制平凡或过度自信的生成。
Result: 基于Qwen2.5-VL构建的SR-MCR在广泛的视觉基准测试中同时提升了答案准确性和推理一致性;在同等规模的开源模型中,SR-MCR-7B以81.4%的平均准确率达到了最先进的性能;消融研究证实了每个奖励项和冷却模块的独立贡献。
Conclusion: 该研究表明,利用模型内在过程信号进行推理对齐是有效且高效的,无需额外人工标注即可显著提升多模态推理的可靠性和准确性;细粒度的过程级监督和稳定性机制对于实现可靠的推理生成至关重要。
📄 Abstract
Multimodal LLMs often produce fluent yet unreliable reasoning, exhibiting weak step-to-step coherence and insufficient visual grounding, largely because existing alignment approaches supervise only the final answer while ignoring the reliability of the intermediate reasoning process. We introduce SR-MCR, a lightweight and label-free framework that aligns reasoning by exploiting intrinsic process signals derived directly from model outputs. Five self-referential cues -- semantic alignment, lexical fidelity, non-redundancy, visual grounding, and step consistency -- are integrated into a normalized, reliability-weighted reward that provides fine-grained process-level guidance. A critic-free GRPO objective, enhanced with a confidence-aware cooling mechanism, further stabilizes training and suppresses trivial or overly confident generations. Built on Qwen2.5-VL, SR-MCR improves both answer accuracy and reasoning coherence across a broad set of visual benchmarks; among open-source models of comparable size, SR-MCR-7B achieves state-of-the-art performance with an average accuracy of 81.4%. Ablation studies confirm the independent contributions of each reward term and the cooling module.
[28] Visual Autoregressive Modelling for Monocular Depth Estimation
Amir El-Ghoussani, André Kaup, Nassir Navab, Gustavo Carneiro, Vasileios Belagiannis
🧩 TL;DR
本文提出了一种基于视觉自回归先验的单目深度估计方法,作为扩散模型方法的替代方案,通过适应大规模文本到图像VAR模型并引入尺度条件上采样机制,在室内外数据集上实现了竞争性性能。
📘 Detailed Summary
Motivation: 本研究旨在解决当前深度估计方法中扩散模型主导的局面,探索自回归先验作为几何感知生成模型的替代方案,特别关注数据可扩展性和对3D视觉任务的适应性,为深度估计提供互补的方法家族。
Method: 该方法基于视觉自回归先验,适应大规模文本到图像VAR模型,引入尺度条件上采样机制并采用分类器无关引导,仅需74K合成样本进行微调,推理过程在十个固定自回归阶段完成,实现了高效的单目深度估计。
Result: 在室内基准测试中,该方法在受限训练条件下实现了最先进的性能,在室外数据集上也表现出强大的性能,仅需少量合成样本微调即可达到竞争性结果,证明了自回归先验在深度估计任务中的有效性。
Conclusion: 该研究确立了自回归先验作为深度估计中几何感知生成模型的互补家族,突出了其在数据可扩展性和3D视觉任务适应性方面的优势,为单目深度估计提供了新的技术路径,并展示了自回归模型在计算机视觉任务中的潜力。
📄 Abstract
We propose a monocular depth estimation method based on visual autoregressive (VAR) priors, offering an alternative to diffusion-based approaches. Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism with classifier-free guidance. Our approach performs inference in ten fixed autoregressive stages, requiring only 74K synthetic samples for fine-tuning, and achieves competitive results. We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets. This work establishes autoregressive priors as a complementary family of geometry-aware generative models for depth estimation, highlighting advantages in data scalability, and adaptability to 3D vision tasks. Code available at "https://github.com/AmirMaEl/VAR-Depth".
[29] CritiFusion: Semantic Critique and Spectral Alignment for Faithful Text-to-Image Generation
ZhenQi Chen, TsaiChing Ni, YuanFu Yang
🧩 TL;DR
本文提出了CritiFusion,一种无需额外训练即可提升文本到图像生成语义对齐的推理时框架,通过多模态语义批判机制和频域细化来增强提示一致性与细节。
📘 Detailed Summary
Motivation: 当前文本到图像扩散模型虽然视觉保真度高,但在处理复杂提示时经常面临语义对齐困难的问题,生成的图像内容与提示意图之间存在不一致性,这限制了模型在实际应用中的可靠性和实用性。
Method: CritiFusion框架包含两个核心组件:CritiCore模块利用视觉语言模型和多个大型语言模型来丰富提示上下文并产生高级语义反馈,指导扩散过程更好地对齐生成内容与提示意图;SpecFusion在谱域合并中间生成状态,注入粗粒度结构信息同时保留高频细节。该框架作为插件式细化阶段与现有扩散主干模型兼容,无需额外训练。
Result: 在标准基准测试中,该方法显著提升了文本到图像对应关系的人类对齐指标和视觉质量,在人类偏好评分和美学评估方面持续提升性能,达到了与最先进奖励优化方法相当的结果。定性结果进一步展示了在细节、真实性和提示保真度方面的优越表现。
Conclusion: 研究表明,语义批判与谱对齐策略能有效提升文本到图像生成的语义一致性,CritiFusion作为推理时优化框架具有良好的兼容性和实用性,为改善扩散模型与复杂提示的对齐问题提供了新的技术路径,同时避免了额外的模型训练成本。
📄 Abstract
Recent text-to-image diffusion models have achieved remarkable visual fidelity but often struggle with semantic alignment to complex prompts. We introduce CritiFusion, a novel inference-time framework that integrates a multimodal semantic critique mechanism with frequency-domain refinement to improve text-to-image consistency and detail. The proposed CritiCore module leverages a vision-language model and multiple large language models to enrich the prompt context and produce high-level semantic feedback, guiding the diffusion process to better align generated content with the prompt's intent. Additionally, SpecFusion merges intermediate generation states in the spectral domain, injecting coarse structural information while preserving high-frequency details. No additional model training is required. CritiFusion serves as a plug-in refinement stage compatible with existing diffusion backbones. Experiments on standard benchmarks show that our method notably improves human-aligned metrics of text-to-image correspondence and visual quality. CritiFusion consistently boosts performance on human preference scores and aesthetic evaluations, achieving results on par with state-of-the-art reward optimization approaches. Qualitative results further demonstrate superior detail, realism, and prompt fidelity, indicating the effectiveness of our semantic critique and spectral alignment strategy.
[30] Multimodal Diffeomorphic Registration with Neural ODEs and Structural Descriptors
Salvador Rodriguez-Sanz, Monica Hernandez
🧩 TL;DR
本文提出了一种基于神经常微分方程的多模态微分同胚配准方法,通过连续深度网络和结构描述符实现了无需训练数据的高精度多模态图像配准,在保持低误差的同时展现出对显式正则化的鲁棒性。
📘 Detailed Summary
Motivation: 传统非刚性配准算法在精度、计算复杂度和正则化之间存在权衡,且通常假设图像对在解剖同源区域存在强度相关性,这限制了其在多模态场景下的应用。现有学习方法需要大量训练数据且在未见模态上性能下降,因此需要一种实例特定的框架来解决这些限制。
Method: 该方法利用神经常微分方程范式中的连续深度网络潜力,结合结构描述符作为模态无关的度量模型,通过参数化邻域几何结构中的自相似性进行特征提取。提出了三种变体,分别整合了基于图像或特征的结构描述符以及通过局部互信息计算的非结构图像相似性。
Result: 在不同扫描数据集组合形成的多个实验中进行了广泛评估,结果显示在定性和定量指标上均超越了适用于大变形或小变形的先进基线方法。该方法在保持低误差的同时展现出对显式正则化水平变化的鲁棒性,适合不同尺度的配准任务,并且在大变形配准方面比其他方法更高效。
Conclusion: 该研究证明了基于神经常微分方程的实例特定框架在多模态配准中的有效性,无需大量训练数据即可实现高性能,为医学图像分析中的多模态配准问题提供了新的解决方案。该方法对正则化的鲁棒性和尺度适应性使其在实际应用中具有重要价值。
📄 Abstract
This work proposes a multimodal diffeomorphic registration method using Neural Ordinary Differential Equations (Neural ODEs). Nonrigid registration algorithms exhibit tradeoffs between their accuracy, the computational complexity of their deformation model, and its proper regularization. In addition, they also assume intensity correlation in anatomically homologous regions of interest among image pairs, limiting their applicability to the monomodal setting. Unlike learning-based models, we propose an instance-specific framework that is not subject to high scan requirements for training and does not suffer performance degradation at inference time on modalities unseen during training. Our method exploits the potential of continuous-depth networks in the Neural ODE paradigm with structural descriptors, widely adopted as modality-agnostic metric models which exploit self-similarities on parameterized neighborhood geometries. We propose three different variants that integrate image-based or feature-based structural descriptors and nonstructural image similarities computed by local mutual information. We conduct extensive evaluations on different experiments formed by scan dataset combinations and show surpassing qualitative and quantitative results compared to state-of-the-art baselines adequate for large or small deformations, and specific of multimodal registration. Lastly, we also demonstrate the underlying robustness of the proposed framework to varying levels of explicit regularization while maintaining low error, its suitability for registration at varying scales, and its efficiency with respect to other methods targeted to large-deformation registration.
[31] SwinTF3D: A Lightweight Multimodal Fusion Approach for Text-Guided 3D Medical Image Segmentation
Hasan Faraz Khan, Noor Fatima, Muzammil Behzad
🧩 TL;DR
本文提出SwinTF3D,一种轻量级多模态融合方法,通过统一视觉和语言表示实现文本引导的3D医学图像分割,旨在解决现有方法缺乏语义理解和领域适应性的问题。
📘 Detailed Summary
Motivation: 现有3D医学图像分割框架主要依赖大规模标注数据的视觉学习,缺乏语义理解能力,难以适应新领域和临床任务,且无法处理灵活的用户定义分割目标,这限制了其在临床环境中的实际应用价值。
Method: SwinTF3D采用基于Transformer的视觉编码器提取体数据特征,并通过高效融合机制与紧凑文本编码器集成,实现自然语言提示的理解和语义线索与空间结构的正确对齐,同时保持低计算开销。
Result: 在BTCV数据集上的广泛实验表明,SwinTF3D在多个器官分割任务中取得了具有竞争力的Dice和IoU分数,模型对未见数据具有良好的泛化能力,且相比传统基于Transformer的分割网络具有显著的计算效率优势。
Conclusion: 该研究通过桥接视觉感知与语言理解,为交互式、文本驱动的3D医学图像分割建立了实用且可解释的范式,为临床成像中更具适应性和资源效率的解决方案开辟了前景,推动了多模态医学图像分析的发展。
📄 Abstract
The recent integration of artificial intelligence into medical imaging has driven remarkable advances in automated organ segmentation. However, most existing 3D segmentation frameworks rely exclusively on visual learning from large annotated datasets restricting their adaptability to new domains and clinical tasks. The lack of semantic understanding in these models makes them ineffective in addressing flexible, user-defined segmentation objectives. To overcome these limitations, we propose SwinTF3D, a lightweight multimodal fusion approach that unifies visual and linguistic representations for text-guided 3D medical image segmentation. The model employs a transformer-based visual encoder to extract volumetric features and integrates them with a compact text encoder via an efficient fusion mechanism. This design allows the system to understand natural-language prompts and correctly align semantic cues with their corresponding spatial structures in medical volumes, while producing accurate, context-aware segmentation results with low computational overhead. Extensive experiments on the BTCV dataset demonstrate that SwinTF3D achieves competitive Dice and IoU scores across multiple organs, despite its compact architecture. The model generalizes well to unseen data and offers significant efficiency gains compared to conventional transformer-based segmentation networks. Bridging visual perception with linguistic understanding, SwinTF3D establishes a practical and interpretable paradigm for interactive, text-driven 3D medical image segmentation, opening perspectives for more adaptive and resource-efficient solutions in clinical imaging.
[32] TrimTokenator-LC: Towards Adaptive Visual Token Pruning for Large Multimodal Models with Long Contexts
Hao Zhang, Mengsi Lyu, Bo Huang, Yulong Ao, Yonghua Lin
🧩 TL;DR
本文提出了一种针对长上下文多图像场景的自适应视觉令牌剪枝方法,通过分解冗余为图像内和图像间分量,实现动态预算分配,在保持性能的同时显著减少视觉令牌数量。
📘 Detailed Summary
Motivation: 大型多模态模型在处理视觉输入时通常将图像编码为令牌序列,但视觉令牌数量的增长会大幅增加推理成本。现有视觉令牌剪枝方法往往忽视长上下文输入中包含多张图像的场景,本文旨在解决这一特定挑战。
Method: 该方法采用两阶段自适应剪枝策略:图像内阶段为每张图像分配内容感知的令牌预算并贪婪选择最具代表性的令牌;图像间阶段执行全局多样性过滤形成候选池,然后应用帕累托选择程序平衡多样性与文本对齐。冗余分解为图像内多样性和图像间变化两个分量,共同指导动态预算分配。
Result: 大量实验表明,该方法在长上下文设置中保持强大性能的同时,显著减少了视觉令牌数量。具体而言,在包含多张图像的复杂场景中,该方法有效平衡了计算效率与模型性能,验证了自适应剪枝策略的有效性。
Conclusion: 该研究为长上下文多图像场景下的视觉令牌剪枝提供了系统解决方案,通过分解冗余和动态预算分配实现了效率与性能的平衡。该方法不仅适用于当前大型多模态模型,也为未来高效多模态推理系统设计提供了重要参考方向。
📄 Abstract
Large Multimodal Models (LMMs) have proven effective on various tasks. They typically encode visual inputs into Original Model sequences of tokens, which are then concatenated with textual tokens and jointly processed by the language model. However, the growing number of visual tokens greatly increases inference cost. Visual token pruning has emerged as a promising solution. However, existing methods often overlook scenarios involving long context inputs with multiple images. In this paper, we analyze the challenges of visual token pruning in long context, multi-image settings and introduce an adaptive pruning method tailored for such scenarios. We decompose redundancy into intra-image and inter-image components and quantify them through intra-image diversity and inter-image variation, which jointly guide dynamic budget allocation. Our approach consists of two stages. The intra-image stage allocates each image a content-aware token budget and greedily selects its most representative tokens. The inter-image stage performs global diversity filtering to form a candidate pool and then applies a Pareto selection procedure that balances diversity with text alignment. Extensive experiments show that our approach maintains strong performance in long context settings while significantly cutting down the number of visual tokens.
[33] OpenGround: Active Cognition-based Reasoning for Open-World 3D Visual Grounding
Wenyuan Huang, Zhao Wang, Zhou Wei, Ting Huang, Fang Zhao, Jian Yang, Zhenyu Zhang
🧩 TL;DR
本文提出了OpenGround,一种用于开放世界3D视觉定位的零样本框架,通过主动认知推理模块克服了预定义对象查找表的限制,实现了对未定义或未知目标的定位能力。
📘 Detailed Summary
Motivation: 现有3D视觉定位方法依赖于预定义的对象查找表来查询视觉语言模型进行对象位置推理,这限制了在未定义或意外目标场景中的应用,需要解决开放世界场景中的目标定位问题。
Method: OpenGround框架的核心是主动认知推理模块,该模块通过认知任务链模拟人类感知目标的过程,主动推理上下文相关对象,并通过动态更新的对象查找表扩展视觉语言模型的认知范围,支持预定义和开放世界类别。
Result: 实验表明OpenGround在Nr3D数据集上取得竞争性性能,在ScanRefer上达到最先进水平,在提出的OpenTarget数据集上实现了17.6%的显著提升,该数据集包含7000多个对象-描述对用于开放世界场景评估。
Conclusion: 该研究突破了传统3D视觉定位方法对预定义类别的依赖,通过主动认知推理机制实现了开放世界场景的零样本定位能力,为实际应用中的未知目标识别提供了有效解决方案,并建立了专门的开放世界评估基准。
📄 Abstract
3D visual grounding aims to locate objects based on natural language descriptions in 3D scenes. Existing methods rely on a pre-defined Object Lookup Table (OLT) to query Visual Language Models (VLMs) for reasoning about object locations, which limits the applications in scenarios with undefined or unforeseen targets. To address this problem, we present OpenGround, a novel zero-shot framework for open-world 3D visual grounding. Central to OpenGround is the Active Cognition-based Reasoning (ACR) module, which is designed to overcome the fundamental limitation of pre-defined OLTs by progressively augmenting the cognitive scope of VLMs. The ACR module performs human-like perception of the target via a cognitive task chain and actively reasons about contextually relevant objects, thereby extending VLM cognition through a dynamically updated OLT. This allows OpenGround to function with both pre-defined and open-world categories. We also propose a new dataset named OpenTarget, which contains over 7000 object-description pairs to evaluate our method in open-world scenarios. Extensive experiments demonstrate that OpenGround achieves competitive performance on Nr3D, state-of-the-art on ScanRefer, and delivers a substantial 17.6% improvement on OpenTarget. Project Page at this https URL.
[34] Plug In, Grade Right: Psychology-Inspired AGIQA
Zhicheng Liao, Baoliang Chen, Hanwei Zhu, Lingyu Zhu, Shiqi Wang, Weisi Lin
🧩 TL;DR
本文提出了一种基于算术分级响应模型(AGRM)的质量分级模块(AGQG),用于缓解AGIQA中的语义漂移问题,该模块通过建模图像能力与难度水平来生成单峰可解释的质量分布,在各种AGIQA框架中实现了即插即用的性能提升。
📘 Detailed Summary
Motivation: 现有AGIQA模型通过测量图像嵌入与多等级质量描述文本嵌入之间的相似性来评估图像质量,但研究发现这些相似性分布通常呈现多模态模式,导致图像嵌入可能同时与"优秀"和"差"等级描述高度相似而与"良好"等级偏离,这种现象被称为"语义漂移",即文本嵌入与其预期描述之间的语义不一致性损害了文本-图像共享空间学习的可靠性。
Method: 受心理测量学启发,本文提出改进的分级响应模型(GRM)用于AGIQA,该模型将图像质量解释为图像满足不同质量等级的能力。基于此理念,设计了一个双分支质量分级模块:一个分支估计图像能力,另一个分支构建多个难度水平。为确保难度水平的单调性,采用算术方式建模难度生成,从而强制产生单峰且可解释的质量分布,形成算术GRM质量分级(AGQG)模块。
Result: AGQG模块具有即插即用优势,当集成到各种最先进的AGIQA框架中时,能够持续提升性能表现。此外,该模块在自然图像和屏幕内容图像质量评估任务中均展现出良好的泛化能力,揭示了其作为未来IQA模型关键组件的潜力。
Conclusion: 该研究通过引入心理测量学中的分级响应模型框架,有效解决了AGIQA中的语义漂移问题,提出的AGQG模块不仅提升了现有模型的性能,还展示了跨不同类型图像质量评估任务的泛化能力,为未来图像质量评估模型的设计提供了新的关键组件和理论框架。
📄 Abstract
Existing AGIQA models typically estimate image quality by measuring and aggregating the similarities between image embeddings and text embeddings derived from multi-grade quality descriptions. Although effective, we observe that such similarity distributions across grades usually exhibit multimodal patterns. For instance, an image embedding may show high similarity to both "excellent" and "poor" grade descriptions while deviating from the "good" one. We refer to this phenomenon as "semantic drift", where semantic inconsistencies between text embeddings and their intended descriptions undermine the reliability of text-image shared-space learning. To mitigate this issue, we draw inspiration from psychometrics and propose an improved Graded Response Model (GRM) for AGIQA. The GRM is a classical assessment model that categorizes a subject's ability across grades using test items with various difficulty levels. This paradigm aligns remarkably well with human quality rating, where image quality can be interpreted as an image's ability to meet various quality grades. Building on this philosophy, we design a two-branch quality grading module: one branch estimates image ability while the other constructs multiple difficulty levels. To ensure monotonicity in difficulty levels, we further model difficulty generation in an arithmetic manner, which inherently enforces a unimodal and interpretable quality distribution. Our Arithmetic GRM based Quality Grading (AGQG) module enjoys a plug-and-play advantage, consistently improving performance when integrated into various state-of-the-art AGIQA frameworks. Moreover, it also generalizes effectively to both natural and screen content image quality assessment, revealing its potential as a key component in future IQA models.
[35] Parallel Diffusion Solver via Residual Dirichlet Policy Optimization
Ruoyu Wang, Ziyu Li, Beier Zhu, Liangyu Yuan, Hanwang Zhang, Xun Yang, Xiaojun Chang, Chi Zhang
🧩 TL;DR
本文提出EPD-Solver,一种新型ODE求解器,通过集成并行梯度评估来减少截断误差,在保持低延迟采样的同时显著提升扩散模型生成质量。该方法还引入了参数高效的强化学习微调方案,可作为插件改进现有采样器。
📘 Detailed Summary
Motivation: 扩散模型虽然取得了最先进的生成性能,但由于其顺序去噪特性而面临高采样延迟问题。现有的基于求解器的加速方法在低延迟预算下通常面临显著的图像质量下降,这主要源于无法捕捉高曲率轨迹段而导致的累积截断误差。
Method: 本文提出集成并行方向求解器(EPD-Solver),这是一种新型ODE求解器,通过在每个步骤中集成多个并行梯度评估来减轻截断误差。该方法基于采样轨迹主要局限于低维流形的几何洞察,利用向量值函数的均值定理更准确地近似积分解。由于额外的梯度计算相互独立,它们可以完全并行化,从而保持低延迟采样特性。作者还引入了两阶段优化框架:首先通过基于蒸馏的方法优化少量可学习参数,然后提出参数高效的强化学习微调方案,将求解器重新表述为随机Dirichlet策略。
Result: EPD-Solver在保持低延迟采样的同时显著减少了截断误差,提升了生成质量。与传统的需要微调大规模骨干网络的方法不同,本文的强化学习方法仅在低维求解器空间内操作,有效缓解了奖励黑客问题,并在复杂的文本到图像生成任务中增强了性能。该方法具有灵活性,可作为插件(EPD-Plugin)改进现有的ODE采样器。
Conclusion: EPD-Solver通过并行梯度评估和几何洞察有效解决了扩散模型加速中的截断误差问题,同时保持了低延迟特性。参数高效的强化学习微调方案避免了传统方法中的奖励黑客问题,为复杂生成任务提供了更稳定的优化路径。该方法作为插件的能力增强了其实际应用价值,为扩散模型的高效采样提供了新的解决方案。
📄 Abstract
Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature. Existing solver-based acceleration methods often face significant image quality degradation under a low-latency budget, primarily due to accumulated truncation errors arising from the inability to capture high-curvature trajectory segments. In this paper, we propose the Ensemble Parallel Direction solver (dubbed as EPD-Solver), a novel ODE solver that mitigates these errors by incorporating multiple parallel gradient evaluations in each step. Motivated by the geometric insight that sampling trajectories are largely confined to a low-dimensional manifold, EPD-Solver leverages the Mean Value Theorem for vector-valued functions to approximate the integral solution more accurately. Importantly, since the additional gradient computations are independent, they can be fully parallelized, preserving low-latency sampling nature. We introduce a two-stage optimization framework. Initially, EPD-Solver optimizes a small set of learnable parameters via a distillation-based approach. We further propose a parameter-efficient Reinforcement Learning (RL) fine-tuning scheme that reformulates the solver as a stochastic Dirichlet policy. Unlike traditional methods that fine-tune the massive backbone, our RL approach operates strictly within the low-dimensional solver space, effectively mitigating reward hacking while enhancing performance in complex text-to-image (T2I) generation tasks. In addition, our method is flexible and can serve as a plugin (EPD-Plugin) to improve existing ODE samplers.
[36] An Architecture-Led Hybrid Report on Body Language Detection Project
Thomson Tong, Diba Darooneh
🧩 TL;DR
本报告对两种现代视觉语言模型(Qwen2.5-VL-7B-Instruct 和 Llama-4-Scout-17B-16E-Instruct)进行架构导向分析,并解释其架构特性如何映射到视频到人工制品的实际管道实现,重点揭示了结构化输出验证与语义正确性之间的关键区别。
📘 Detailed Summary
Motivation: 该研究旨在解决视觉语言模型在视频分析应用中结构化输出生成与验证的实际挑战,特别是如何将模型架构特性映射到实用的视频处理管道,并识别结构化输出中语法有效性与语义正确性之间的关键差异。
Method: 研究方法包括对两种VLMs的架构分析(视觉分词、Transformer注意力、指令跟随),以及实现视频到人工制品管道的系统设计:采样视频帧、提示VLM检测可见人物并生成带属性(默认情感)的像素空间边界框、使用预定义模式验证输出结构、可选渲染带注释视频。
Result: 分析揭示了关键的系统约束:结构化输出可能在语法上有效但语义上不正确,模式验证仅检查结构而非几何正确性,人物标识符在当前提示合约中是帧局部的,交互式单帧分析返回自由格式文本而非模式强制的JSON。
Conclusion: 这些区别对于撰写可辩护的声明、设计稳健接口和规划评估至关重要,强调了在VLM应用中需要明确区分输出验证的结构层面与语义层面,并为实际系统设计提供了重要的工程指导原则。
📄 Abstract
This report provides an architecture-led analysis of two modern vision-language models (VLMs), Qwen2.5-VL-7B-Instruct and Llama-4-Scout-17B-16E-Instruct, and explains how their architectural properties map to a practical video-to-artifact pipeline implemented in the BodyLanguageDetection repository [1]. The system samples video frames, prompts a VLM to detect visible people and generate pixel-space bounding boxes with prompt-conditioned attributes (emotion by default), validates output structure using a predefined schema, and optionally renders an annotated video. We first summarize the shared multimodal foundation (visual tokenization, Transformer attention, and instruction following), then describe each architecture at a level sufficient to justify engineering choices without speculative internals. Finally, we connect model behavior to system constraints: structured outputs can be syntactically valid while semantically incorrect, schema validation is structural (not geometric correctness), person identifiers are frame-local in the current prompting contract, and interactive single-frame analysis returns free-form text rather than schema-enforced JSON. These distinctions are critical for writing defensible claims, designing robust interfaces, and planning evaluation.
[37] VPTracker: Global Vision-Language Tracking via Visual Prompt and MLLM
Jingchao Wang, Kaiwen Zhou, Zhijian Wu, Kunhua Ji, Dingjiang Huang, Yefeng Zheng
🧩 TL;DR
本文提出了首个基于多模态大语言模型的全局视觉语言跟踪框架VPTracker,通过引入位置感知的视觉提示机制,在保持全局搜索优势的同时有效抑制视觉干扰,显著提升了跟踪的鲁棒性和目标区分能力。
📘 Detailed Summary
Motivation: 现有视觉语言跟踪方法通常局限于局部搜索,在视角变化、遮挡和目标快速移动等挑战性场景下容易失败,缺乏全局推理能力来应对这些挑战。
Method: 本文提出了基于多模态大语言模型的全局跟踪框架VPTracker,并设计了位置感知的视觉提示机制,该机制基于目标先前位置构建区域级提示,使模型能够优先进行区域级识别,仅在必要时才进行全局推理。
Result: 大量实验表明,该方法在挑战性场景下显著提升了跟踪稳定性和目标区分能力,有效抑制了视觉或语义相似对象的干扰,为多模态大语言模型在视觉跟踪领域的应用开辟了新途径。
Conclusion: 该研究展示了多模态大语言模型在视觉跟踪任务中的潜力,通过结合全局推理和局部先验的混合策略,为解决传统跟踪方法在复杂场景下的局限性提供了新思路,为后续研究奠定了重要基础。
📄 Abstract
Vision-Language Tracking aims to continuously localize objects described by a visual template and a language description. Existing methods, however, are typically limited to local search, making them prone to failures under viewpoint changes, occlusions, and rapid target movements. In this work, we introduce the first global tracking framework based on Multimodal Large Language Models (VPTracker), exploiting their powerful semantic reasoning to locate targets across the entire image space. While global search improves robustness and reduces drift, it also introduces distractions from visually or semantically similar objects. To address this, we propose a location-aware visual prompting mechanism that incorporates spatial priors into the MLLM. Specifically, we construct a region-level prompt based on the target's previous location, enabling the model to prioritize region-level recognition and resort to global inference only when necessary. This design retains the advantages of global tracking while effectively suppressing interference from distracting visual content. Extensive experiments show that our approach significantly enhances tracking stability and target disambiguation under challenging scenarios, opening a new avenue for integrating MLLMs into visual tracking. Code is available at https://github.com/jcwang0602/VPTracker.
[38] Anomaly Detection by Effectively Leveraging Synthetic Images
Sungho Kang, Hyunkyu Park, Yeonho Lee, Hanbyul Lee, Mijoo Jeong, YeongHyeon Park, Injae Lee, Juneho Yi
🧩 TL;DR
本文提出了一种新颖的工业异常检测框架,通过结合预训练的文本引导图像翻译模型和图像检索模型,高效生成高质量的合成缺陷图像,并采用两阶段训练策略显著提升检测性能。
📘 Detailed Summary
Motivation: 工业异常检测面临真实缺陷图像稀缺的挑战,现有合成方法存在明显权衡:基于规则的合成方法成本低但生成图像不真实,而基于生成模型的合成方法质量高但成本巨大。本研究旨在开发一种既能高效生成高质量合成缺陷图像,又能有效利用这些图像提升异常检测性能的解决方案。
Method: 本文提出了一种新颖框架,利用预训练的文本引导图像到图像翻译模型和图像检索模型来高效生成合成缺陷图像。图像检索模型评估生成图像与真实正常图像的相似度,过滤无关输出以提升生成质量。同时引入两阶段训练策略:首先在大量基于规则合成的图像上进行预训练,然后在较小规模的高质量图像集上进行微调。
Result: 在MVTec AD数据集上的实验证明了该方法的有效性。所提出的框架在显著降低数据收集成本的同时,有效提升了异常检测性能,验证了结合高质量合成图像生成与两阶段训练策略的优越性。
Conclusion: 该研究提供了一种平衡成本与质量的工业异常检测新范式,通过智能合成与筛选机制以及分层训练策略,为数据稀缺场景下的异常检测提供了高效解决方案,具有重要的工业应用价值。
📄 Abstract
Anomaly detection plays a vital role in industrial manufacturing. Due to the scarcity of real defect images, unsupervised approaches that rely solely on normal images have been extensively studied. Recently, diffusion-based generative models brought attention to training data synthesis as an alternative solution. In this work, we focus on a strategy to effectively leverage synthetic images to maximize the anomaly detection performance. Previous synthesis strategies are broadly categorized into two groups, presenting a clear trade-off. Rule-based synthesis, such as injecting noise or pasting patches, is cost-effective but often fails to produce realistic defect images. On the other hand, generative model-based synthesis can create high-quality defect images but requires substantial cost. To address this problem, we propose a novel framework that leverages a pre-trained text-guided image-to-image translation model and image retrieval model to efficiently generate synthetic defect images. Specifically, the image retrieval model assesses the similarity of the generated images to real normal images and filters out irrelevant outputs, thereby enhancing the quality and relevance of the generated defect images. To effectively leverage synthetic images, we also introduce a two stage training strategy. In this strategy, the model is first pre-trained on a large volume of images from rule-based synthesis and then fine-tuned on a smaller set of high-quality images. This method significantly reduces the cost for data collection while improving the anomaly detection performance. Experiments on the MVTec AD dataset demonstrate the effectiveness of our approach.
[39] Medical Scene Reconstruction and Segmentation based on 3D Gaussian Representation
Bin Liu, Wenyan Tian, Huangxin Fu, Zizheng Li, Zhifen He, Bo Li
🧩 TL;DR
本文提出了一种基于3D高斯和tri-plane表示的高效医学图像三维重建方法,该方法在稀疏切片条件下显著增强了结构连续性和语义一致性,同时保持了高斯表示在高效渲染和几何表示方面的优势。
📘 Detailed Summary
Motivation: 传统医学图像三维重建方法计算成本高,在稀疏切片条件下容易出现结构不连续和细节丢失问题,难以满足临床精度要求,因此需要开发更高效且能保持结构连续性的重建方法。
Method: 该方法结合了3D高斯表示和tri-plane表示技术,利用高斯表示的高效渲染和几何表示优势,同时通过tri-plane表示增强稀疏切片条件下的结构连续性和语义一致性。
Result: 在超声和MRI等多模态医学数据集上的实验结果表明,该方法在稀疏数据条件下能够生成高质量、解剖结构连贯且语义稳定的医学图像,同时显著提高了重建效率。
Conclusion: 该方法为医学图像的三维可视化和临床分析提供了一种高效可靠的新途径,特别适用于稀疏切片条件下的高质量三维重建,具有重要的临床应用价值。
📄 Abstract
3D reconstruction of medical images is a key technology in medical image analysis and clinical diagnosis, providing structural visualization support for disease assessment and surgical planning. Traditional methods are computationally expensive and prone to structural discontinuities and loss of detail in sparse slices, making it difficult to meet clinical accuracy requirements.To address these challenges, we propose an efficient 3D reconstruction method based on 3D Gaussian and tri-plane representations. This method not only maintains the advantages of Gaussian representation in efficient rendering and geometric representation but also significantly enhances structural continuity and semantic consistency under sparse slicing conditions. Experimental results on multimodal medical datasets such as US and MRI show that our proposed method can generate high-quality, anatomically coherent, and semantically stable medical images under sparse data conditions, while significantly improving reconstruction efficiency. This provides an efficient and reliable new approach for 3D visualization and clinical analysis of medical images.
[40] ViLaCD-R1: A Vision-Language Framework for Semantic Change Detection in Remote Sensing
Xingwei Ma, Shiyang Feng, Bo Zhang, Bin Wang
🧩 TL;DR
本文提出ViLaCD-R1,一种用于遥感变化检测的两阶段框架,通过结合多图像推理器和掩码引导解码器,显著提升了语义变化识别精度并抑制了非语义扰动,在多个基准测试中实现了最先进的性能。
📘 Detailed Summary
Motivation: 遥感变化检测任务中,传统的像素级操作和编码器-解码器网络难以充分捕获高层语义信息且对非语义扰动敏感,而现有的多模态和视觉语言模型方法仍面临空间定位不准确、像素级边界划分不精确以及可解释性有限等挑战。
Method: ViLaCD-R1采用两阶段框架,包含多图像推理器和掩码引导解码器。首先通过监督微调和强化学习在块级双时相推理任务上训练视觉语言模型,输入双时相图像块并输出粗略变化掩码;然后解码器整合双时相图像特征与粗略掩码,预测精确的二进制变化图。
Result: 在多个遥感变化检测基准上的综合评估表明,ViLaCD-R1显著提升了真实语义变化的识别和定位能力,有效抑制了非语义变化,在复杂真实场景中实现了最先进的准确率。
Conclusion: 该研究展示了结合视觉语言模型与专门解码器的两阶段框架在遥感变化检测中的有效性,为处理复杂多图像推理任务提供了新范式,同时增强了模型对语义变化的敏感性和对非语义扰动的鲁棒性。
📄 Abstract
Remote sensing change detection (RSCD), a complex multi-image inference task, traditionally uses pixel-based operators or encoder-decoder networks that inadequately capture high-level semantics and are vulnerable to non-semantic perturbations. Although recent multimodal and vision-language model (VLM)-based approaches enhance semantic understanding of change regions by incorporating textual descriptions, they still suffer from challenges such as inaccurate spatial localization, imprecise pixel-level boundary delineation, and limited interpretability. To address these issues, we propose ViLaCD-R1, a two-stage framework comprising a Multi-Image Reasoner (MIR) and a Mask-Guided Decoder (MGD). Specifically, the VLM is trained through supervised fine-tuning (SFT) and reinforcement learning (RL) on block-level dual-temporal inference tasks, taking dual-temporal image patches as input and outputting a coarse change mask. Then, the decoder integrates dual-temporal image features with this coarse mask to predict a precise binary change map. Comprehensive evaluations on multiple RSCD benchmarks demonstrate that ViLaCD-R1 substantially improves true semantic change recognition and localization, robustly suppresses non-semantic variations, and achieves state-of-the-art accuracy in complex real-world scenarios.
[41] Depth Anything in $360^\circ$: Towards Scale Invariance in the Wild
Hualie Jiang, Ziyang Song, Zhiqiang Lou, Rui Xu, Minglang Tan
🧩 TL;DR
本文提出了DA360,一种全景深度估计方法,通过从ViT主干学习偏移参数并集成圆形填充,显著提升了零样本全景深度估计性能,在室内外基准测试中均实现了显著的相对误差降低。
📘 Detailed Summary
Motivation: 全景深度估计在室内场景已有广泛研究,但其零样本泛化到开放世界领域的能力远落后于透视图像,这主要是由于训练数据不足造成的差距,因此需要将透视域的能力迁移到全景域以弥合这一差距。
Method: 该方法提出了DA360,即Depth Anything V2的全景适配版本,核心创新包括从ViT主干学习偏移参数,将模型的尺度和偏移不变输出转换为尺度不变估计,从而直接生成格式良好的3D点云,同时通过在DPT解码器中集成圆形填充来消除接缝伪影,确保尊重球面连续性的空间一致深度图。
Result: 在标准室内基准和新构建的室外数据集Metropolis上评估,DA360相比其基础模型在室内和室外基准上分别实现了超过50%和10%的相对深度误差降低,并且显著优于现有的全景深度估计方法,相比PanDA在所有三个测试数据集上实现了约30%的相对误差改进,为零样本全景深度估计建立了新的最先进性能。
Conclusion: 该研究表明通过从ViT主干学习偏移参数并集成圆形填充,可以有效将透视深度估计能力迁移到全景域,显著提升零样本全景深度估计性能,为机器人学和AR/VR应用提供了更全面的环境结构信息捕获解决方案。
📄 Abstract
Panoramic depth estimation provides a comprehensive solution for capturing complete $360^\circ$ environmental structural information, offering significant benefits for robotics and AR/VR applications. However, while extensively studied in indoor settings, its zero-shot generalization to open-world domains lags far behind perspective images, which benefit from abundant training data. This disparity makes transferring capabilities from the perspective domain an attractive solution. To bridge this gap, we present Depth Anything in $360^\circ$ (DA360), a panoramic-adapted version of Depth Anything V2. Our key innovation involves learning a shift parameter from the ViT backbone, transforming the model's scale- and shift-invariant output into a scale-invariant estimate that directly yields well-formed 3D point clouds. This is complemented by integrating circular padding into the DPT decoder to eliminate seam artifacts, ensuring spatially coherent depth maps that respect spherical continuity. Evaluated on standard indoor benchmarks and our newly curated outdoor dataset, Metropolis, DA360 shows substantial gains over its base model, achieving over 50\% and 10\% relative depth error reduction on indoor and outdoor benchmarks, respectively. Furthermore, DA360 significantly outperforms robust panoramic depth estimation methods, achieving about 30\% relative error improvement compared to PanDA across all three test datasets and establishing new state-of-the-art performance for zero-shot panoramic depth estimation.
[42] MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images
Md. Sazzadul Islam Prottasha, Nabil Walid Rafi
🧩 TL;DR
本研究对比了专用开源模型MedGemma与大型多模态模型GPT-4在医学影像诊断中的性能,发现经过领域特定微调的MedGemma在六种疾病分类任务中显著优于GPT-4,强调了临床应用中领域适应的重要性。
📘 Detailed Summary
Motivation: 多模态大语言模型为医学影像分析提供了新范式,但不同架构模型在临床诊断任务中的性能差异尚不明确。本研究旨在对比专用开源代理MedGemma与专有大型多模态模型GPT-4在疾病诊断中的表现,评估领域特定微调对临床实施效果的影响。
Method: 研究采用两种不同AI架构进行对比:专用开源模型MedGemma-4b-it使用低秩适应(LoRA)进行微调,与未经调优的专有多模态大模型GPT-4。两种模型在六种不同疾病的诊断任务上进行评估,通过混淆矩阵和分类报告进行定量分析。
Result: 微调后的MedGemma-4b-it模型平均测试准确率达到80.37%,显著优于GPT-4的69.58%。MedGemma在癌症和肺炎等高风险临床任务中表现出更高的敏感性,定量分析提供了跨所有疾病类别的全面性能洞察。
Conclusion: 研究结果表明领域特定微调对于最小化临床实施中的幻觉现象至关重要,将MedGemma定位为复杂循证医学推理的精密工具。这一发现强调了在医疗AI应用中,专用微调模型相对于通用大模型的性能优势。
📄 Abstract
Multimodal Large Language Models (LLMs) introduce an emerging paradigm for medical imaging by interpreting scans through the lens of extensive clinical knowledge, offering a transformative approach to disease classification. This study presents a critical comparison between two fundamentally different AI architectures: the specialized open-source agent MedGemma and the proprietary large multimodal model GPT-4 for diagnosing six different diseases. The MedGemma-4b-it model, fine-tuned using Low-Rank Adaptation (LoRA), demonstrated superior diagnostic capability by achieving a mean test accuracy of 80.37% compared to 69.58% for the untuned GPT-4. Furthermore, MedGemma exhibited notably higher sensitivity in high-stakes clinical tasks, such as cancer and pneumonia detection. Quantitative analysis via confusion matrices and classification reports provides comprehensive insights into model performance across all categories. These results emphasize that domain-specific fine-tuning is essential for minimizing hallucinations in clinical implementation, positioning MedGemma as a sophisticated tool for complex, evidence-based medical reasoning.
[43] MUSON: A Reasoning-oriented Multimodal Dataset for Socially Compliant Navigation in Urban Environments
Zhuonan Liu, Xinyu Zhang, Zishuo Wang, Tomohito Kawabata, Xuesu Xiao, Ling Xiao
🧩 TL;DR
本文提出了MUSON,一个用于短时域社交导航的多模态数据集,采用结构化五步思维链标注方法,解决了现有数据集缺乏显式推理监督和动作分布长尾的问题,为社交合规导航提供了有效的基准测试平台。
📘 Detailed Summary
Motivation: 现有社交导航数据集通常缺乏显式推理监督,且呈现高度长尾的动作分布,这限制了模型学习安全关键行为的能力,因此需要构建具有结构化推理标注和平衡动作空间的数据集来提升社交合规导航的性能。
Method: MUSON采用结构化五步思维链标注方法,包括感知、预测、推理、动作和解释,同时显式建模静态物理约束并构建理性平衡的离散动作空间,在多样化的室内外校园场景中收集多模态导航数据。
Result: 在MUSON上对多个最先进的小型视觉语言模型进行基准测试,Qwen2.5-VL-3B取得了最高的决策准确率0.8625,表明该数据集能够有效评估社交合规导航模型的性能,相比SNEI数据集提供了更一致的推理、动作和解释标注。
Conclusion: MUSON通过结构化思维链标注和平衡动作空间设计,为社交导航研究提供了高质量的监督信号和可复用的基准测试平台,有助于推动具有安全意识和解释性决策的导航系统发展,该数据集已在HuggingFace平台公开可用。
📄 Abstract
Socially compliant navigation requires structured reasoning over dynamic pedestrians and physical constraints to ensure safe and interpretable decisions. However, existing social navigation datasets often lack explicit reasoning supervision and exhibit highly long-tailed action distributions, limiting models' ability to learn safety-critical behaviors. To address these issues, we introduce MUSON, a multimodal dataset for short-horizon social navigation collected across diverse indoor and outdoor campus scenes. MUSON adopts a structured five-step Chain-of-Thought annotation consisting of perception, prediction, reasoning, action, and explanation, with explicit modeling of static physical constraints and a rationally balanced discrete action space. Compared to SNEI, MUSON provides consistent reasoning, action, and explanation. Benchmarking multiple state-of-the-art Small Vision Language Models on MUSON shows that Qwen2.5-VL-3B achieves the highest decision accuracy of 0.8625, demonstrating that MUSON serves as an effective and reusable benchmark for socially compliant navigation. The dataset is publicly available at https://huggingface.co/datasets/MARSLab/MUSON
[44] M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models
Ju-Hsuan Weng, Jia-Wei Liao, Cheng-Fu Chou, Jun-Cheng Chen
🧩 TL;DR
本文提出了M-ErasureBench,首个超越文本提示的多模态概念擦除评估框架,并开发了IRECE模块以增强推理时鲁棒性,显著降低了概念再现率。
📘 Detailed Summary
Motivation: 现有概念擦除方法主要关注文本提示,忽略了图像编辑和个性化生成等实际应用中日益重要的其他输入模态,这些模态可能成为攻击面,导致已擦除概念重新出现,因此需要系统评估多模态场景下的概念擦除鲁棒性。
Method: 本文提出了M-ErasureBench多模态评估框架,系统评估文本提示、学习嵌入和反转潜在空间三种输入模态下的概念擦除性能,涵盖白盒和黑盒访问共五种评估场景;同时开发了IRECE即插即用模块,通过交叉注意力定位目标概念并在去噪过程中扰动相关潜在表示以增强鲁棒性。
Result: 实验表明现有方法在文本提示上表现良好,但在学习嵌入和反转潜在空间上基本失效,白盒设置下概念再现率超过90%;IRECE模块显著恢复了鲁棒性,在最具挑战性的白盒潜在反转场景下将概念再现率降低了40%,同时保持了视觉质量。
Conclusion: 该研究揭示了当前概念擦除方法在多模态攻击下的脆弱性,M-ErasureBench提供了首个超越文本提示的全面基准,IRECE模块为构建更可靠的保护性生成模型提供了实用保障,强调了未来概念擦除研究需要考虑多模态鲁棒性。
📄 Abstract
Text-to-image diffusion models may generate harmful or copyrighted content, motivating research on concept erasure. However, existing approaches primarily focus on erasing concepts from text prompts, overlooking other input modalities that are increasingly critical in real-world applications such as image editing and personalized generation. These modalities can become attack surfaces, where erased concepts re-emerge despite defenses. To bridge this gap, we introduce M-ErasureBench, a novel multimodal evaluation framework that systematically benchmarks concept erasure methods across three input modalities: text prompts, learned embeddings, and inverted latents. For the latter two, we evaluate both white-box and black-box access, yielding five evaluation scenarios. Our analysis shows that existing methods achieve strong erasure performance against text prompts but largely fail under learned embeddings and inverted latents, with Concept Reproduction Rate (CRR) exceeding 90% in the white-box setting. To address these vulnerabilities, we propose IRECE (Inference-time Robustness Enhancement for Concept Erasure), a plug-and-play module that localizes target concepts via cross-attention and perturbs the associated latents during denoising. Experiments demonstrate that IRECE consistently restores robustness, reducing CRR by up to 40% under the most challenging white-box latent inversion scenario, while preserving visual quality. To the best of our knowledge, M-ErasureBench provides the first comprehensive benchmark of concept erasure beyond text prompts. Together with IRECE, our benchmark offers practical safeguards for building more reliable protective generative models.
[45] CoFi-Dec: Hallucination-Resistant Decoding via Coarse-to-Fine Generative Feedback in Large Vision-Language Models
Zongsheng Cao, Yangfan He, Anran Liu, Jun Xie, Feng Chen, Zepeng Wang
🧩 TL;DR
本文提出CoFi-Dec,一种无需训练的解码框架,通过整合生成式自反馈与粗到细的视觉条件来减轻大型视觉语言模型中的幻觉问题。该框架采用基于Wasserstein的融合机制统一多级视觉假设,在六个幻觉基准上显著提升了输出可靠性。
📘 Detailed Summary
Motivation: 大型视觉语言模型在多模态理解和生成方面取得了显著进展,但倾向于产生与视觉输入不一致的幻觉内容,这限制了其在现实应用中的可靠性。现有方法在同时保持高层语义一致性和细粒度视觉接地方面存在不足,需要一种无需训练且能有效缓解幻觉的解码策略。
Method: CoFi-Dec采用训练自由的解码框架,受人类从全局场景感知到细节检查的视觉过程启发,首先生成基于原始图像粗粒度和细粒度视图的两个中间文本响应。这些响应通过文本到图像模型转换为合成图像,形成多级视觉假设以丰富接地线索。为统一这些多视觉条件的预测,引入基于Wasserstein的融合机制,将其预测分布对齐到几何一致的解码轨迹中。
Result: 在六个专注于幻觉的基准测试上进行广泛实验,结果表明CoFi-Dec显著减少了实体级和语义级幻觉,优于现有的解码策略。该框架具有模型无关性,无需额外训练,可无缝应用于各种大型视觉语言模型,在保持高层语义一致性的同时实现了细粒度视觉接地。
Conclusion: CoFi-Dec通过生成式自反馈与多级视觉条件的结合,为缓解视觉语言模型幻觉问题提供了有效的训练自由解决方案。基于Wasserstein的融合机制实现了高层语义一致性与细粒度视觉接地的统一,增强了输出的鲁棒性和忠实性。该框架的模型无关性和无需训练的特点使其具有广泛的适用性,为未来多模态解码策略的发展提供了新方向。
📄 Abstract
Large Vision-Language Models (LVLMs) have achieved impressive progress in multi-modal understanding and generation. However, they still tend to produce hallucinated content that is inconsistent with the visual input, which limits their reliability in real-world applications. We propose \textbf{CoFi-Dec}, a training-free decoding framework that mitigates hallucinations by integrating generative self-feedback with coarse-to-fine visual conditioning. Inspired by the human visual process from global scene perception to detailed inspection, CoFi-Dec first generates two intermediate textual responses conditioned on coarse- and fine-grained views of the original image. These responses are then transformed into synthetic images using a text-to-image model, forming multi-level visual hypotheses that enrich grounding cues. To unify the predictions from these multiple visual conditions, we introduce a Wasserstein-based fusion mechanism that aligns their predictive distributions into a geometrically consistent decoding trajectory. This principled fusion reconciles high-level semantic consistency with fine-grained visual grounding, leading to more robust and faithful outputs. Extensive experiments on six hallucination-focused benchmarks show that CoFi-Dec substantially reduces both entity-level and semantic-level hallucinations, outperforming existing decoding strategies. The framework is model-agnostic, requires no additional training, and can be seamlessly applied to a wide range of LVLMs. The implementation is available at https://github.com/AI-Researcher-Team/CoFi-Dec.
[46] JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation
Kai Liu, Jungang Li, Yuchong Sun, Shengqiong Wu, Jianzhang Gao, Daoan Zhang, Wei Zhang, Sheng Jin, Sicheng Yu, Geng Zhan, Jiayi Ji, Fan Zhou, Liang Zheng, Shuicheng Yan, Hao Fei, Tat-Seng Chua
🧩 TL;DR
本文提出了JavisGPT,首个用于联合视听理解与生成的统一多模态大语言模型,采用简洁的编码器-LLM-解码器架构,通过三阶段训练流程和高质量指令数据集,在复杂时序同步任务中超越了现有MLLMs。
📘 Detailed Summary
Motivation: 现有多模态大语言模型在联合音频-视频理解与生成方面存在局限,特别是在处理时空同步的复杂多模态任务时表现不足,缺乏统一的框架来同时实现时序一致的视听理解和生成能力。
Method: JavisGPT采用编码器-LLM-解码器架构,包含SyncFusion模块进行时空音频-视频融合,以及同步感知可学习查询来桥接预训练的JAV-DiT生成器;采用三阶段训练流程:多模态预训练、音频-视频微调和大规模指令调优;构建了JavisInst-Omni指令数据集,包含超过20万条GPT-4o策划的音频-视频-文本对话。
Result: 在联合视听理解与生成基准测试中,JavisGPT超越了现有多模态大语言模型,特别是在复杂和时序同步设置下表现优异,证明了其在处理时空一致的多模态任务方面的有效性。
Conclusion: JavisGPT为联合视听理解与生成提供了统一的解决方案,通过创新的架构设计和训练策略实现了时序一致的多模态能力,为未来多模态AI系统的发展提供了重要参考,特别是在需要精确时空对齐的应用场景中具有重要价值。
📄 Abstract
This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for Joint Audio-Video (JAV) comprehension and generation. JavisGPT adopts a concise encoder-LLM-decoder architecture, featuring a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. To support this, we further construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that span diverse and multi-level comprehension and generation scenarios. Extensive experiments on JAV comprehension and generation benchmarks show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.
[47] PathFound: An Agentic Multimodal Model Activating Evidence-seeking Pathological Diagnosis
Shengyi Hua, Jianfeng Wu, Tianle Shen, Kangzhe Hu, Zhongzhen Huang, Shujuan Ni, Zhihong Zhang, Yuan Li, Zhe Wang, Xiaofan Zhang
🧩 TL;DR
本文提出了PathFound,一种用于病理诊断的代理式多模态模型,通过整合病理视觉基础模型、视觉语言模型和强化学习训练推理模型,实现了主动信息获取和诊断细化的证据寻求推理范式。
📘 Detailed Summary
Motivation: 当前病理基础模型大多采用静态推理范式,即一次性处理全切片图像生成预测,缺乏在模糊诊断下的重新评估或针对性证据获取,这与临床诊断通过重复观察和进一步检查来完善假设的工作流程形成鲜明对比。
Method: PathFound整合了病理视觉基础模型、视觉语言模型和强化学习训练的推理模型,通过初始诊断、证据寻求和最终决策三个阶段执行主动信息获取和诊断细化,实现了类似临床工作流程的动态推理过程。
Result: 在多个大型多模态模型中采用证据寻求策略一致提高了诊断准确性,PathFound在多样化临床场景中实现了最先进的诊断性能,并展现出发现核特征和局部侵袭等细微细节的强大潜力。
Conclusion: 证据寻求工作流程在计算病理学中具有显著有效性,PathFound的代理式推理范式为病理诊断提供了更接近临床实践的动态评估框架,展示了主动信息获取在提升诊断准确性和发现细微病理特征方面的重要价值。
📄 Abstract
Recent pathological foundation models have substantially advanced visual representation learning and multimodal interaction. However, most models still rely on a static inference paradigm in which whole-slide images are processed once to produce predictions, without reassessment or targeted evidence acquisition under ambiguous diagnoses. This contrasts with clinical diagnostic workflows that refine hypotheses through repeated slide observations and further examination requests. We propose PathFound, an agentic multimodal model designed to support evidence-seeking inference in pathological diagnosis. PathFound integrates the power of pathological visual foundation models, vision-language models, and reasoning models trained with reinforcement learning to perform proactive information acquisition and diagnosis refinement by progressing through the initial diagnosis, evidence-seeking, and final decision stages. Across several large multimodal models, adopting this strategy consistently improves diagnostic accuracy, indicating the effectiveness of evidence-seeking workflows in computational pathology. Among these models, PathFound achieves state-of-the-art diagnostic performance across diverse clinical scenarios and demonstrates strong potential to discover subtle details, such as nuclear features and local invasions.
[48] ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving
Qihang Peng, Xuesong Chen, Chenye Yang, Shaoshuai Shi, Hongsheng Li
🧩 TL;DR
本文提出ColaVLA,一个统一的视觉-语言-动作框架,通过将推理从文本转移到统一潜在空间并结合分层并行轨迹解码器,解决了基于VLM的规划器在连续控制、延迟和实时部署方面的挑战。
📘 Detailed Summary
Motivation: 当前基于视觉语言模型的自动驾驶规划器面临三个关键挑战:离散文本推理与连续控制之间的不匹配、自回归思维链解码导致的高延迟,以及低效或非因果规划器限制实时部署。传统模块化管道分离感知、预测和规划,而端到端系统虽能联合学习这些任务,但VLM-based规划器在效率和实时性方面仍有不足。
Method: ColaVLA框架包含两个核心组件:认知潜在推理器和分层并行规划器。认知潜在推理器通过自我适应选择和仅两次VLM前向传递,将场景理解压缩为紧凑的决策导向元动作嵌入。分层并行规划器在单次前向传递中生成多尺度、因果一致的轨迹,将推理从文本转移到统一潜在空间并耦合分层并行轨迹解码。
Result: 在nuScenes基准测试上的实验表明,ColaVLA在开环和闭环设置中都实现了最先进的性能,同时具有优越的效率和鲁棒性。该框架在保持VLM泛化能力和可解释性的同时,实现了高效、准确和安全的轨迹生成,显著提升了实时部署的可行性。
Conclusion: ColaVLA通过统一潜在空间推理和分层并行解码,成功解决了VLM-based规划器的关键挑战,为自动驾驶轨迹生成提供了既保持VLM优势又满足实时性要求的新范式。该研究展示了将高级语义推理与低级控制动作有效结合的可能性,为未来端到端自动驾驶系统的发展提供了重要方向。
📄 Abstract
Autonomous driving requires generating safe and reliable trajectories from complex multimodal inputs. Traditional modular pipelines separate perception, prediction, and planning, while recent end-to-end (E2E) systems learn them jointly. Vision-language models (VLMs) further enrich this paradigm by introducing cross-modal priors and commonsense reasoning, yet current VLM-based planners face three key challenges: (i) a mismatch between discrete text reasoning and continuous control, (ii) high latency from autoregressive chain-of-thought decoding, and (iii) inefficient or non-causal planners that limit real-time deployment. We propose ColaVLA, a unified vision-language-action framework that transfers reasoning from text to a unified latent space and couples it with a hierarchical, parallel trajectory decoder. The Cognitive Latent Reasoner compresses scene understanding into compact, decision-oriented meta-action embeddings through ego-adaptive selection and only two VLM forward passes. The Hierarchical Parallel Planner then generates multi-scale, causality-consistent trajectories in a single forward pass. Together, these components preserve the generalization and interpretability of VLMs while enabling efficient, accurate and safe trajectory generation. Experiments on the nuScenes benchmark show that ColaVLA achieves state-of-the-art performance in both open-loop and closed-loop settings with favorable efficiency and robustness.
[49] RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature
Hanzheng Li, Xi Fang, Yixuan Li, Chaozheng Huang, Junjie Wang, Xi Wang, Hongzhe Bai, Bojun Hao, Shenyu Lin, Huiqi Liang, Linfeng Zhang, Guolin Ke
🧩 TL;DR
本文提出了RxnBench,一个用于评估多模态大语言模型在化学文献中理解反应能力的多层级基准测试,揭示了当前模型在深层化学逻辑推理方面的显著能力差距。
📘 Detailed Summary
Motivation: 尽管多模态大语言模型在化学领域的应用前景广阔,但其在真实科学文献中理解密集、图形化的化学反应语言的能力尚未得到充分探索,特别是在处理复杂反应机制和跨模态信息整合方面存在显著研究空白。
Method: 研究设计了RxnBench多层级基准测试,包含两个核心任务:单图问答任务评估细粒度视觉感知和机制推理能力,基于305个精选反应方案生成1,525个问题;全文档问答任务挑战模型从108篇科学文章中综合文本、方案和表格信息的能力,要求跨模态信息整合。
Result: 评估结果显示当前多模态大语言模型存在关键能力差距:模型在提取显式文本方面表现良好,但在深层化学逻辑推理和精确结构识别方面表现不佳;具有推理时推理能力的模型显著优于标准架构,但所有模型在全文档问答任务上的准确率均未超过50%。
Conclusion: 研究结果表明迫切需要开发领域特定的视觉编码器和更强的推理引擎来推进自主AI化学家的发展,当前模型在化学文献理解方面的局限性凸显了跨模态整合和领域知识融合的重要性。
📄 Abstract
The integration of Multimodal Large Language Models (MLLMs) into chemistry promises to revolutionize scientific discovery, yet their ability to comprehend the dense, graphical language of reactions within authentic literature remains underexplored. Here, we introduce RxnBench, a multi-tiered benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs. RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning using 1,525 questions derived from 305 curated reaction schemes, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles, requiring cross-modal integration of text, schemes, and tables. Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic and precise structural recognition. Notably, models with inference-time reasoning significantly outperform standard architectures, yet none achieve 50\% accuracy on FD-QA. These findings underscore the urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists.
[50] CLIP-Joint-Detect: End-to-End Joint Training of Object Detectors with Contrastive Vision-Language Supervision
Behnam Raoufi, Hossein Sharify, Mohamad Mahdee Ramezanee, Khosrow Hajsadeghi, Saeed Bagheri Shouraki
🧩 TL;DR
本文提出CLIP-Joint-Detect,一种简单且与检测器无关的框架,通过端到端联合训练集成CLIP风格的对比视觉语言监督,显著提升目标检测性能同时保持实时推理速度。
📘 Detailed Summary
Motivation: 传统目标检测器依赖交叉熵分类,容易受到类别不平衡和标签噪声的影响,需要更鲁棒的监督机制来提升检测性能。
Method: 提出CLIP-Joint-Detect框架,采用轻量级并行头将区域或网格特征投影到CLIP嵌入空间,通过InfoNCE对比损失和辅助交叉熵项与可学习的类别特定文本嵌入对齐,同时优化所有标准检测损失,适用于两阶段和一阶段架构。
Result: 在Pascal VOC 2007+2012上使用Faster R-CNN验证,在MS COCO 2017基准上使用现代YOLO检测器(YOLOv11)测试,均获得一致且显著的性能提升,同时保持实时推理速度,消融实验证明可学习文本嵌入的联合优化能显著增强闭集检测性能。
Conclusion: 研究表明通过端到端联合训练集成对比视觉语言监督是一种有效且通用的检测器增强方法,可学习文本嵌入的联合优化能显著提升跨架构和数据集的目标检测性能,为检测器设计提供了新的监督范式。
📄 Abstract
Conventional object detectors rely on cross-entropy classification, which can be vulnerable to class imbalance and label noise. We propose CLIP-Joint-Detect, a simple and detector-agnostic framework that integrates CLIP-style contrastive vision-language supervision through end-to-end joint training. A lightweight parallel head projects region or grid features into the CLIP embedding space and aligns them with learnable class-specific text embeddings via InfoNCE contrastive loss and an auxiliary cross-entropy term, while all standard detection losses are optimized simultaneously. The approach applies seamlessly to both two-stage and one-stage architectures. We validate it on Pascal VOC 2007+2012 using Faster R-CNN and on the large-scale MS COCO 2017 benchmark using modern YOLO detectors (YOLOv11), achieving consistent and substantial improvements while preserving real-time inference speed. Extensive experiments and ablations demonstrate that joint optimization with learnable text embeddings markedly enhances closed-set detection performance across diverse architectures and datasets.
[51] RealCamo: Boosting Real Camouflage Synthesis with Layout Controls and Textual-Visual Guidance
Chunyuan Chen, Yunuo Cai, Shujuan Li, Weiyun Liang, Bin Wang, Jing Xu
🧩 TL;DR
本文提出ReamCamo,一种基于外绘制的统一框架,用于生成逼真的伪装图像,通过引入布局控制和多模态文本-视觉条件来提升语义一致性和视觉保真度。
📘 Detailed Summary
Motivation: 现有伪装图像生成方法存在显著缺陷:生成的图像要么因视觉相似性不足而缺乏有效伪装,要么背景杂乱且与前景目标语义不一致,导致与真实伪装图像存在较大差距。
Method: 提出ReamCamo统一外绘制框架,通过引入额外布局控制来调节全局图像结构以改善语义一致性,并构建结合统一细粒度文本任务描述和纹理导向背景检索的多模态文本-视觉条件来联合指导生成过程。
Result: 大量实验和可视化结果证明了该框架的有效性,并引入了背景-前景分布差异度量来定量评估生成图像的伪装质量。
Conclusion: 该研究通过布局控制和多模态条件引导,显著提升了伪装图像生成的语义一致性和视觉真实性,为伪装目标检测提供了更高质量的训练数据生成方法。
📄 Abstract
Camouflaged image generation (CIG) has recently emerged as an efficient alternative for acquiring high-quality training data for camouflaged object detection (COD). However, existing CIG methods still suffer from a substantial gap to real camouflaged imagery: generated images either lack sufficient camouflage due to weak visual similarity, or exhibit cluttered backgrounds that are semantically inconsistent with foreground targets. To address these limitations, we propose ReamCamo, a unified out-painting based framework for realistic camouflaged image generation. ReamCamo explicitly introduces additional layout controls to regulate global image structure, thereby improving semantic coherence between foreground objects and generated backgrounds. Moreover, we construct a multi-modal textual-visual condition by combining a unified fine-grained textual task description with texture-oriented background retrieval, which jointly guides the generation process to enhance visual fidelity and realism. To quantitatively assess camouflage quality, we further introduce a background-foreground distribution divergence metric that measures the effectiveness of camouflage in generated images. Extensive experiments and visualizations demonstrate the effectiveness of our proposed framework.
[52] PoseStreamer: A Multi-modal Framework for 6DoF Pose Estimation of Unseen Moving Objects
Huiming Yang, Linglin Liao, Fei Ding, Sibo Wang, Zijian Zeng
🧩 TL;DR
本文提出了PoseStreamer,一种专为高速运动场景设计的鲁棒多模态6DoF姿态估计框架,通过整合自适应姿态记忆队列、物体中心2D跟踪器和射线姿态滤波器等组件,显著提升了事件相机在高速低光条件下的姿态估计性能。
📘 Detailed Summary
Motivation: 现有6DoF姿态估计方法在高速运动场景下性能不佳,特别是在低光条件下RGB相机容易产生运动模糊,而事件相机虽然具有高时间分辨率,但现有方法仍无法充分利用其优势处理高速运动物体的姿态估计问题。
Method: PoseStreamer框架包含三个核心组件:自适应姿态记忆队列利用历史方向线索确保时间一致性;物体中心2D跟踪器提供强2D先验以提升3D中心召回率;射线姿态滤波器沿相机射线进行几何细化。此外还构建了MoCapCube6D多模态数据集用于快速运动下的性能基准测试。
Result: 大量实验表明,PoseStreamer在高速运动场景下实现了卓越的精度,同时作为无模板框架对未见过的运动物体表现出强大的泛化能力,在构建的MoCapCube6D数据集上验证了其有效性。
Conclusion: 该研究证明了整合时间一致性机制、2D跟踪先验和几何细化策略能够有效解决高速运动下的6DoF姿态估计挑战,为事件相机在动态场景中的应用提供了实用框架,并建立了首个专门针对快速运动的多模态基准数据集。
📄 Abstract
Six degree of freedom (6DoF) pose estimation for novel objects is a critical task in computer vision, yet it faces significant challenges in high-speed and low-light scenarios where standard RGB cameras suffer from motion blur. While event cameras offer a promising solution due to their high temporal resolution, current 6DoF pose estimation methods typically yield suboptimal performance in high-speed object moving scenarios. To address this gap, we propose PoseStreamer, a robust multi-modal 6DoF pose estimation framework designed specifically on high-speed moving scenarios. Our approach integrates three core components: an Adaptive Pose Memory Queue that utilizes historical orientation cues for temporal consistency, an Object-centric 2D Tracker that provides strong 2D priors to boost 3D center recall, and a Ray Pose Filter for geometric refinement along camera rays. Furthermore, we introduce MoCapCube6D, a novel multi-modal dataset constructed to benchmark performance under rapid motion. Extensive experiments demonstrate that PoseStreamer not only achieves superior accuracy in high-speed moving scenarios, but also exhibits strong generalizability as a template-free framework for unseen moving objects.
[53] Spatial-aware Symmetric Alignment for Text-guided Medical Image Segmentation
Linglin Liao, Qichuan Geng, Yu Liu
🧩 TL;DR
本文提出空间感知对称对齐(SSA)框架,通过对称最优传输对齐机制和复合方向引导策略,增强医学图像分割中混合文本(位置、描述、诊断信息)的处理能力,在公开基准测试中实现了最先进的性能。
📘 Detailed Summary
Motivation: 当前文本引导医学图像分割方法存在两个关键瓶颈:一方面难以同时处理诊断性和描述性文本,导致难以识别病变并建立与图像区域的关联;另一方面现有方法侧重于病变描述而未能捕捉位置约束,导致关键偏差,例如在文本提示"左下肺"时分割结果可能错误覆盖双侧肺部。
Method: 本文提出空间感知对称对齐(SSA)框架,包含两个核心技术:对称最优传输对齐机制,通过建立双向细粒度多模态对应关系,增强图像区域与多个相关表达之间的关联;复合方向引导策略,通过构建区域级引导掩码,在文本中显式引入空间约束,从而处理包含位置、描述和诊断信息的混合医学文本。
Result: 在公开基准测试上的广泛实验表明,SSA框架实现了最先进的性能,特别是在准确分割具有空间关系约束特征的病变方面表现优异,显著优于现有方法。
Conclusion: 该研究通过对称对齐机制和空间约束引导策略,有效解决了医学图像分割中混合文本处理的挑战,为文本引导医学图像分析提供了新的技术路径,特别在需要精确空间定位的临床应用中具有重要价值。
📄 Abstract
Text-guided Medical Image Segmentation has shown considerable promise for medical image segmentation, with rich clinical text serving as an effective supplement for scarce data. However, current methods have two key bottlenecks. On one hand, they struggle to process diagnostic and descriptive texts simultaneously, making it difficult to identify lesions and establish associations with image regions. On the other hand, existing approaches focus on lesions description and fail to capture positional constraints, leading to critical deviations. Specifically, with the text "in the left lower lung", the segmentation results may incorrectly cover both sides of the lung. To address the limitations, we propose the Spatial-aware Symmetric Alignment (SSA) framework to enhance the capacity of referring hybrid medical texts consisting of locational, descriptive, and diagnostic information. Specifically, we propose symmetric optimal transport alignment mechanism to strengthen the associations between image regions and multiple relevant expressions, which establishes bi-directional fine-grained multimodal correspondences. In addition, we devise a composite directional guidance strategy that explicitly introduces spatial constraints in the text by constructing region-level guidance masks. Extensive experiments on public benchmarks demonstrate that SSA achieves state-of-the-art (SOTA) performance, particularly in accurately segmenting lesions characterized by spatial relational constraints.
[54] Reverse Personalization
Han-Wei Kung, Tuomas Varanka, Nicu Sebe
🧩 TL;DR
本文提出了一种基于条件扩散反演的反向个性化框架,用于面部匿名化,该框架无需文本提示即可直接操作图像,并在身份移除、属性保留和图像质量之间实现了最先进的平衡。
📘 Detailed Summary
Motivation: 现有的基于提示的方法在移除或修改身份特定特征时存在局限性,要么依赖于预训练模型中充分表示的主体,要么需要对特定身份进行模型微调,缺乏对未见过的训练数据外主体的泛化能力,且现有匿名化方法缺乏对面部属性的控制。
Method: 该方法引入了一个反向个性化框架,利用条件扩散反演技术实现无需文本提示的直接图像操作,并加入身份引导的条件分支以泛化到模型训练数据之外的主体,支持属性可控的匿名化操作。
Result: 实验表明,该方法在身份移除、属性保留和图像质量之间实现了最先进的平衡,相比现有方法在泛化性和控制能力方面具有显著优势,能够处理训练数据外的主体并保持面部属性的可控性。
Conclusion: 该研究为面部匿名化提供了一种无需文本提示的直接操作框架,通过条件扩散反演和身份引导条件分支实现了对未见主体的泛化能力,为隐私保护应用中保持图像质量和属性控制提供了有效解决方案,推动了可控生成模型在隐私保护领域的发展。
📄 Abstract
Recent text-to-image diffusion models have demonstrated remarkable generation of realistic facial images conditioned on textual prompts and human identities, enabling creating personalized facial imagery. However, existing prompt-based methods for removing or modifying identity-specific features rely either on the subject being well-represented in the pre-trained model or require model fine-tuning for specific identities. In this work, we analyze the identity generation process and introduce a reverse personalization framework for face anonymization. Our approach leverages conditional diffusion inversion, allowing direct manipulation of images without using text prompts. To generalize beyond subjects in the model's training data, we incorporate an identity-guided conditioning branch. Unlike prior anonymization methods, which lack control over facial attributes, our framework supports attribute-controllable anonymization. We demonstrate that our method achieves a state-of-the-art balance between identity removal, attribute preservation, and image quality. Source code and data are available at https://github.com/hanweikung/reverse-personalization .
[55] With Great Context Comes Great Prediction Power: Classifying Objects via Geo-Semantic Scene Graphs
Ciprian Constantinescu, Marius Leordeanu
🧩 TL;DR
本文提出了一种基于地理语义上下文图(GSCG)的上下文感知物体分类框架,通过整合深度估计与全景分割构建结构化场景表示,显著提升了物体识别性能并提供了可解释的推理过程。
📘 Detailed Summary
Motivation: 现有计算物体识别系统通常在孤立图像区域上操作,忽略了人类识别物体时依赖的关键上下文信息,如空间关系、材料属性和物体共现模式,这导致系统性能受限且缺乏可解释性。
Method: 该方法首先从单目图像构建地理语义上下文图(GSCG),通过整合度量深度估计器与统一的全景和材料分割模型,将物体编码为具有几何、颜色和材料属性的节点,空间关系编码为边;然后提出专门的图分类器,聚合目标物体、其邻近物体和全局场景特征进行类别预测。
Result: 在COCO 2017数据集上的实验表明,上下文感知模型达到73.4%的分类准确率,显著优于上下文无关版本(最低38.4%),并大幅超越微调ResNet模型(最高53.5%)和最先进的多模态大语言模型Llama 4 Scout(最高42.3%)。
Conclusion: 研究证明了显式结构化上下文表示在物体识别任务中的优越性,GSCG框架不仅提供了高性能的上下文感知分类,还实现了可解释的推理过程,为计算机视觉系统理解复杂场景提供了新方向。
📄 Abstract
Humans effortlessly identify objects by leveraging a rich understanding of the surrounding scene, including spatial relationships, material properties, and the co-occurrence of other objects. In contrast, most computational object recognition systems operate on isolated image regions, devoid of meaning in isolation, thus ignoring this vital contextual information. This paper argues for the critical role of context and introduces a novel framework for contextual object classification. We first construct a Geo-Semantic Contextual Graph (GSCG) from a single monocular image. This rich, structured representation is built by integrating a metric depth estimator with a unified panoptic and material segmentation model. The GSCG encodes objects as nodes with detailed geometric, chromatic, and material attributes, and their spatial relationships as edges. This explicit graph structure makes the model's reasoning process inherently interpretable. We then propose a specialized graph-based classifier that aggregates features from a target object, its immediate neighbors, and the global scene context to predict its class. Through extensive ablation studies, we demonstrate that our context-aware model achieves a classification accuracy of 73.4%, dramatically outperforming context-agnostic versions (as low as 38.4%). Furthermore, our GSCG-based approach significantly surpasses strong baselines, including fine-tuned ResNet models (max 53.5%) and a state-of-the-art multimodal Large Language Model (LLM), Llama 4 Scout, which, even when given the full image alongside a detailed description of objects, maxes out at 42.3%. These results on COCO 2017 train/val splits highlight the superiority of explicitly structured and interpretable context for object recognition tasks.
[56] Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion
Yi Zhou, Xuechao Zou, Shun Zhang, Kai Li, Shiying Wang, Jingming Chen, Congyan Lang, Tengfei Cao, Pin Tao, Yuanchun Shi
🧩 TL;DR
本文提出了Co2S,一种用于遥感图像语义分割的稳定半监督框架,通过协同融合视觉语言模型和自监督模型的先验知识,有效缓解了伪标签漂移问题,在多个数据集上实现了领先性能。
📘 Detailed Summary
Motivation: 半监督遥感图像语义分割虽然能减轻详尽标注的负担,但存在伪标签漂移的根本问题,即确认偏差导致训练过程中错误不断累积,这限制了其实际应用效果和稳定性。
Method: 该方法构建了一个异构双学生架构,包含分别使用预训练CLIP和DINOv3初始化的两个不同ViT视觉基础模型,以减轻错误累积和伪标签漂移。引入了显式-隐式语义协同引导机制,利用文本嵌入和可学习查询分别提供显式和隐式类别级引导,增强语义一致性。此外,开发了全局-局部特征协同融合策略,有效融合CLIP捕获的全局上下文信息与DINOv3产生的局部细节。
Result: 在六个流行数据集上的广泛实验证明了该方法的优越性,在各种划分协议和多样化场景中始终实现领先性能,显著优于现有半监督分割方法,验证了所提框架的有效性和鲁棒性。
Conclusion: 该研究表明,通过协同融合不同视觉基础模型的先验知识,可以有效缓解半监督遥感分割中的伪标签漂移问题,为构建更稳定可靠的半监督分割系统提供了新思路,展示了异构模型融合在提升语义一致性方面的潜力。
📄 Abstract
Semi-supervised remote sensing (RS) image semantic segmentation offers a promising solution to alleviate the burden of exhaustive annotation, yet it fundamentally struggles with pseudo-label drift, a phenomenon where confirmation bias leads to the accumulation of errors during training. In this work, we propose Co2S, a stable semi-supervised RS segmentation framework that synergistically fuses priors from vision-language models and self-supervised models. Specifically, we construct a heterogeneous dual-student architecture comprising two distinct ViT-based vision foundation models initialized with pretrained CLIP and DINOv3 to mitigate error accumulation and pseudo-label drift. To effectively incorporate these distinct priors, an explicit-implicit semantic co-guidance mechanism is introduced that utilizes text embeddings and learnable queries to provide explicit and implicit class-level guidance, respectively, thereby jointly enhancing semantic consistency. Furthermore, a global-local feature collaborative fusion strategy is developed to effectively fuse the global contextual information captured by CLIP with the local details produced by DINOv3, enabling the model to generate highly precise segmentation results. Extensive experiments on six popular datasets demonstrate the superiority of the proposed method, which consistently achieves leading performance across various partition protocols and diverse scenarios. Project page is available at https://xavierjiezou.github.io/Co2S/.
[57] Video-BrowseComp: Benchmarking Agentic Video Research on Open Web
Zhengyang Liang, Yan Shu, Xiangrui Liu, Minghao Qin, Kaixin Liang, Paolo Rota, Nicu Sebe, Zheng Liu, Lizi Liao
🧩 TL;DR
本文提出了Video-BrowseComp基准测试,这是首个针对开放网络视频研究的评估框架,旨在推动智能体从被动视频感知转向主动的视频推理和验证能力。
📘 Detailed Summary
Motivation: 当前自主智能体的发展正在重新定义信息检索方式,从被动检索转向主动的开放网络研究,但现有视频基准测试主要关注被动感知,缺乏对智能体视频研究能力的评估,无法衡量智能体主动探索视频时间线、交叉引用分散证据以及在开放网络中验证声明的能力。
Method: 研究团队提出了Video-BrowseComp基准测试,包含210个专门为开放网络智能体视频推理设计的问题,该基准强制依赖时间视觉证据,确保答案不能仅通过文本搜索获得,而必须通过导航视频时间线来验证外部声明。
Result: 对最先进模型的评估揭示了关键瓶颈:即使是像GPT-5.1(带搜索功能)这样的先进搜索增强模型,准确率也仅为15.24%,分析表明这些模型主要依赖文本代理,在元数据丰富的领域表现出色,但在元数据稀疏的动态环境中完全失效,其中视觉基础至关重要。
Conclusion: Video-BrowseComp作为首个开放网络视频研究基准,将领域从被动感知推向主动视频推理,揭示了当前模型在需要视觉基础验证的动态视频环境中的严重局限性,为未来智能体视频研究能力的发展提供了重要评估框架。
📄 Abstract
The evolution of autonomous agents is redefining information seeking, transitioning from passive retrieval to proactive, open-ended web research. However, while textual and static multimodal agents have seen rapid progress, a significant modality gap remains in processing the web's most dynamic modality: video. Existing video benchmarks predominantly focus on passive perception, feeding curated clips to models without requiring external retrieval. They fail to evaluate agentic video research, which necessitates actively interrogating video timelines, cross-referencing dispersed evidence, and verifying claims against the open web. To bridge this gap, we present \textbf{Video-BrowseComp}, a challenging benchmark comprising 210 questions tailored for open-web agentic video reasoning. Unlike prior benchmarks, Video-BrowseComp enforces a mandatory dependency on temporal visual evidence, ensuring that answers cannot be derived solely through text search but require navigating video timelines to verify external claims. Our evaluation of state-of-the-art models reveals a critical bottleneck: even advanced search-augmented models like GPT-5.1 (w/ Search) achieve only 15.24\% accuracy. Our analysis reveals that these models largely rely on textual proxies, excelling in metadata-rich domains (e.g., TV shows with plot summaries) but collapsing in metadata-sparse, dynamic environments (e.g., sports, gameplay) where visual grounding is essential. As the first open-web video research benchmark, Video-BrowseComp advances the field beyond passive perception toward proactive video reasoning.
[58] Domain-Shift Immunity in Deep Deformable Registration via Local Feature Representations
Mingzhen Shao, Sarang Joshi
🧩 TL;DR
本文揭示了基于学习的可变形图像配准模型具有固有的域偏移免疫性,并提出UniReg通用配准框架,通过解耦特征提取与形变估计来验证局部特征一致性是鲁棒性的关键驱动因素。
📘 Detailed Summary
Motivation: 尽管深度学习在可变形图像配准中超越了传统优化方法,但学界普遍认为学习模型对域偏移敏感,通常依赖大规模多样化训练数据来提升鲁棒性,却缺乏对其内在机制的解释。本研究旨在探究基于学习的可变形配准模型是否具有固有的域偏移免疫性,并揭示其背后的机制原理。
Method: 本文提出UniReg通用配准框架,通过解耦特征提取与形变估计来验证域偏移免疫机制。该框架使用固定的预训练特征提取器提取局部特征,结合UNet形变网络进行变形估计,从而隔离并分析局部特征表示在跨域鲁棒性中的作用。此外,研究分析了传统CNN模型在模态偏移下失效的原因,识别了早期卷积层中数据集诱导偏见的来源。
Result: 尽管仅在单一数据集上训练,UniReg在跨域和多模态配准任务中表现出与优化方法相当的鲁棒性能。分析进一步揭示,传统CNN模型在模态偏移下的失败源于早期卷积层中的数据集诱导偏见,而基于局部特征表示的模型则展现出固有的域偏移免疫性。实验验证了局部特征一致性是学习型可变形配准模型鲁棒性的关键驱动因素。
Conclusion: 本研究证明了域偏移免疫性是深度可变形配准模型的固有属性,源于其对局部特征表示而非全局外观的依赖。这一发现为设计保持域不变局部特征的骨干网络提供了理论依据,挑战了传统认为学习模型必然需要大规模多样化训练数据才能实现鲁棒性的观点,为可变形图像配准的模型设计提供了新方向。
📄 Abstract
Deep learning has advanced deformable image registration, surpassing traditional optimization-based methods in both accuracy and efficiency. However, learning-based models are widely believed to be sensitive to domain shift, with robustness typically pursued through large and diverse training datasets, without explaining the underlying mechanisms. In this work, we show that domain-shift immunity is an inherent property of deep deformable registration models, arising from their reliance on local feature representations rather than global appearance for deformation estimation. To isolate and validate this mechanism, we introduce UniReg, a universal registration framework that decouples feature extraction from deformation estimation using fixed, pre-trained feature extractors and a UNet-based deformation network. Despite training on a single dataset, UniReg exhibits robust cross-domain and multi-modal performance comparable to optimization-based methods. Our analysis further reveals that failures of conventional CNN-based models under modality shift originate from dataset-induced biases in early convolutional layers. These findings identify local feature consistency as the key driver of robustness in learning-based deformable registration and motivate backbone designs that preserve domain-invariant local features.
[59] REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation
Fulin Shi, Wenyi Xiao, Bin Chen, Liang Din, Leilei Gan
🧩 TL;DR
本文提出了REVEALER,一个基于强化引导视觉推理的统一框架,用于细粒度评估文本到图像生成中的元素级对齐,通过结构化范式实现可解释的评估并显著超越现有方法。
📘 Detailed Summary
Motivation: 现有文本到图像模型评估方法主要依赖粗粒度指标或静态问答流程,缺乏细粒度可解释性且难以反映人类偏好,需要开发能够提供元素级对齐评估的框架来解决这一局限性。
Method: REVEALER采用"定位-推理-结论"结构化范式,使多模态大语言模型能够显式定位语义元素并推导可解释的对齐判断,通过组相对策略优化和包含结构格式、定位准确性和对齐保真度的复合奖励函数进行模型优化。
Result: 在四个基准测试上的广泛实验表明,REVEALER在EvalMuse-40K、RichHF、MHaluBench和GenAI-Bench上实现了最先进的性能,一致优于强大的专有模型和监督基线,同时相比现有迭代视觉推理方法展现出卓越的推理效率。
Conclusion: 该研究为文本到图像对齐评估提供了统一且可解释的框架,强化引导的视觉推理范式在保持高效推理的同时实现了细粒度评估,为未来多模态评估方法的发展提供了重要方向。
📄 Abstract
Evaluating the alignment between textual prompts and generated images is critical for ensuring the reliability and usability of text-to-image (T2I) models. However, most existing evaluation methods rely on coarse-grained metrics or static QA pipelines, which lack fine-grained interpretability and struggle to reflect human preferences. To address this, we propose REVEALER, a unified framework for element-level alignment evaluation based on reinforcement-guided visual reasoning. Adopting a structured "grounding-reasoning-conclusion" paradigm, our method enables Multimodal Large Language Models (MLLMs) to explicitly localize semantic elements and derive interpretable alignment judgments. We optimize the model via Group Relative Policy Optimization(GRPO) using a composite reward function that incorporates structural format, grounding accuracy, and alignment fidelity. Extensive experiments across four benchmarks-EvalMuse-40K, RichHF, MHaluBench, and GenAI-Bench-demonstrate that REVEALER achieves state-of-the-art performance. Our approach consistently outperforms both strong proprietary models and supervised baselines while demonstrating superior inference efficiency compared to existing iterative visual reasoning methods.
[60] GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation
Tianchen Deng, Xuefeng Chen, Yi Chen, Qu Chen, Yuyao Xu, Lijin Yang, Le Xu, Yu Zhang, Bo Zhang, Wuxiong Huang, Hesheng Wang
🧩 TL;DR
本文提出了一种基于3D高斯场景表示的统一驾驶世界模型框架,通过将文本信息直接与3D场景对齐,实现了3D场景理解和多模态场景生成,并在nuScenes和NuInteract数据集上取得了最先进的性能。
📘 Detailed Summary
Motivation: 现有驾驶世界模型缺乏3D场景理解能力,只能基于输入数据生成内容而无法解释或推理驾驶环境,同时现有方法使用点云或BEV特征表示3D空间信息,无法准确将文本信息与底层3D场景对齐,这些局限性促使研究者开发新的统一框架。
Method: 提出基于3D高斯场景表示的统一框架,通过将丰富的语言特征嵌入到每个高斯基元中实现早期模态对齐;设计了任务感知的语言引导采样策略,去除冗余3D高斯并注入准确紧凑的3D令牌到LLM;开发了双条件多模态生成模型,结合高级语言条件和低级图像条件共同指导生成过程。
Result: 在nuScenes和NuInteract数据集上进行了全面研究验证框架有效性,该方法实现了最先进的性能表现,代码将在GitHub上公开发布以供社区使用和复现。
Conclusion: 该研究通过3D高斯表示实现了文本与3D场景的直接对齐,为驾驶世界模型提供了同时具备理解和生成能力的统一框架,早期模态对齐和双条件生成机制为多模态自动驾驶系统的发展提供了新的技术路径。
📄 Abstract
Driving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi-modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early modality alignment. In addition, we design a novel task-aware language-guided sampling strategy that removes redundant 3D Gaussians and injects accurate and compact 3D tokens into LLM. Furthermore, we design a dual-condition multi-modal generation model, where the information captured by our vision-language model is leveraged as a high-level language condition in combination with a low-level image condition, jointly guiding the multi-modal generation process. We conduct comprehensive studies on the nuScenes, and NuInteract datasets to validate the effectiveness of our framework. Our method achieves state-of-the-art performance. We will release the code publicly on GitHub https://github.com/dtc111111/GaussianDWM.
[61] MM-UAVBench: How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios?
Shiqi Dai, Zizhi Ma, Zhicong Luo, Xuesong Yang, Yibin Huang, Wanyue Zhang, Chi Chen, Zonghao Guo, Wang Xu, Yufei Sun, Maosong Sun
🧩 TL;DR
本文提出了MM-UAVBench,这是一个专门针对低空无人机场景的多模态大语言模型综合基准测试,旨在系统评估MLLMs在感知、认知和规划三个核心维度的能力,填补了现有基准在无人机应用领域的空白。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在无人机主导的低空应用场景中潜力尚未充分探索,现有MLLM基准很少涵盖低空场景的独特挑战,而无人机相关评估主要集中于定位或导航等特定任务,缺乏对MLLMs通用智能的统一评估。
Method: 研究提出了MM-UAVBench基准,该系统从感知、认知和规划三个核心能力维度评估MLLMs,包含19个子任务和超过5.7K个手动标注的问题,所有数据均来自公开数据集的真实无人机采集数据。
Result: 对16个开源和专有MLLMs的广泛实验表明,当前模型难以适应低空场景的复杂视觉和认知需求,分析进一步揭示了空间偏差和多视角理解等关键瓶颈,这些因素阻碍了MLLMs在无人机场景中的有效部署。
Conclusion: 该研究揭示了当前MLLMs在低空无人机应用中的局限性,并识别了阻碍其实际部署的关键技术瓶颈,MM-UAVBench基准有望推动面向真实世界无人机智能的鲁棒可靠MLLMs的未来研究。
📄 Abstract
While Multimodal Large Language Models (MLLMs) have exhibited remarkable general intelligence across diverse domains, their potential in low-altitude applications dominated by Unmanned Aerial Vehicles (UAVs) remains largely underexplored. Existing MLLM benchmarks rarely cover the unique challenges of low-altitude scenarios, while UAV-related evaluations mainly focus on specific tasks such as localization or navigation, without a unified evaluation of MLLMs'general intelligence. To bridge this gap, we present MM-UAVBench, a comprehensive benchmark that systematically evaluates MLLMs across three core capability dimensions-perception, cognition, and planning-in low-altitude UAV scenarios. MM-UAVBench comprises 19 sub-tasks with over 5.7K manually annotated questions, all derived from real-world UAV data collected from public datasets. Extensive experiments on 16 open-source and proprietary MLLMs reveal that current models struggle to adapt to the complex visual and cognitive demands of low-altitude scenarios. Our analyses further uncover critical bottlenecks such as spatial bias and multi-view understanding that hinder the effective deployment of MLLMs in UAV scenarios. We hope MM-UAVBench will foster future research on robust and reliable MLLMs for real-world UAV intelligence.
[62] Multimodal Interpretation of Remote Sensing Images: Dynamic Resolution Input Strategy and Multi-scale Vision-Language Alignment Mechanism
Siyu Zhang, Ying Chen, Lianlei Shan, Runhe Qiu
🧩 TL;DR
本研究提出了一种融合动态分辨率输入策略和多尺度视觉语言对齐机制的视觉语言模型框架,旨在解决遥感图像多模态融合中固定分辨率效率与细节平衡不足以及单尺度对齐语义层次缺失的问题,显著提升了语义理解和计算效率。
📘 Detailed Summary
Motivation: 现有遥感图像多模态融合方法存在固定分辨率无法平衡效率与细节、单尺度对齐缺乏语义层次等缺陷,限制了地表信息提取的准确性和系统性能,需要新的技术框架来克服这些局限性。
Method: 本研究提出了一种视觉语言模型框架,集成了两个关键技术创新:动态分辨率输入策略采用从粗到细的方法自适应分配计算资源,多尺度视觉语言对齐机制构建了覆盖对象、局部区域和全局的三层对齐架构,系统捕获跨模态语义一致性。
Result: 在RS-GPT4V数据集上的实验结果表明,该框架在图像描述和跨模态检索任务中显著提升了语义理解准确性和计算效率,在BLEU-4、CIDEr和R@10等评价指标上优于传统方法,表现出优越的性能表现。
Conclusion: 该技术框架为构建高效稳健的多模态遥感系统提供了新方法,为智能遥感解译的工程应用奠定了理论基础并提供了技术指导,具有重要的实际应用价值和理论意义。
📄 Abstract
Multimodal fusion of remote sensing images serves as a core technology for overcoming the limitations of single-source data and improving the accuracy of surface information extraction, which exhibits significant application value in fields such as environmental monitoring and urban planning. To address the deficiencies of existing methods, including the failure of fixed resolutions to balance efficiency and detail, as well as the lack of semantic hierarchy in single-scale alignment, this study proposes a Vision-language Model (VLM) framework integrated with two key innovations: the Dynamic Resolution Input Strategy (DRIS) and the Multi-scale Vision-language Alignment Mechanism (MS-VLAM).Specifically, the DRIS adopts a coarse-to-fine approach to adaptively allocate computational resources according to the complexity of image content, thereby preserving key fine-grained features while reducing redundant computational overhead. The MS-VLAM constructs a three-tier alignment mechanism covering object, local-region and global levels, which systematically captures cross-modal semantic consistency and alleviates issues of semantic misalignment and granularity imbalance.Experimental results on the RS-GPT4V dataset demonstrate that the proposed framework significantly improves the accuracy of semantic understanding and computational efficiency in tasks including image captioning and cross-modal retrieval. Compared with conventional methods, it achieves superior performance in evaluation metrics such as BLEU-4 and CIDEr for image captioning, as well as R@10 for cross-modal retrieval. This technical framework provides a novel approach for constructing efficient and robust multimodal remote sensing systems, laying a theoretical foundation and offering technical guidance for the engineering application of intelligent remote sensing interpretation.
[63] ASemConsist: Adaptive Semantic Feature Control for Training-Free Identity-Consistent Generation
Shin seong Kim, Minjung Shin, Hyunin Cho, Youngjung Uh
🧩 TL;DR
本文提出ASemconsist框架,通过选择性文本嵌入修改和自适应特征共享策略,解决了文本到图像扩散模型中跨序列图像生成时角色身份一致性与单图提示对齐之间的权衡问题,并引入了统一的评估协议CQS。
📘 Detailed Summary
Motivation: 当前文本到图像扩散模型在生成图像序列时面临一个核心挑战:如何在保持角色身份一致性的同时确保每张图像与各自文本提示的对齐。现有方法往往在这两个目标之间难以平衡,导致要么身份一致性不足,要么单图提示对齐效果不佳。
Method: 本文提出ASemconsist框架,包含三个关键技术:首先是通过选择性文本嵌入修改实现显式语义控制;其次是基于对FLUX模型中填充嵌入的分析,提出将填充嵌入重新用作语义容器的策略;最后是自适应特征共享策略,能够自动评估文本模糊性并仅对模糊的身份提示施加约束。
Result: 该框架在保持角色身份一致性和单图提示对齐方面实现了最先进的性能,有效克服了先前方法的权衡问题。同时,作者提出了统一的评估协议一致性质量分数(CQS),该指标将身份保持和单图文本对齐整合为一个综合度量,能够明确捕捉两个指标之间的性能不平衡。
Conclusion: 本研究通过创新的语义控制策略和自适应约束机制,为文本到图像扩散模型中的序列生成问题提供了有效解决方案。提出的CQS评估协议为未来研究提供了更全面的性能评估标准,框架的成功表明通过精细的嵌入操作和智能约束分配可以显著改善身份一致性与提示对齐之间的平衡。
📄 Abstract
Recent text-to-image diffusion models have significantly improved visual quality and text alignment. However, generating a sequence of images while preserving consistent character identity across diverse scene descriptions remains a challenging task. Existing methods often struggle with a trade-off between maintaining identity consistency and ensuring per-image prompt alignment. In this paper, we introduce a novel framework, ASemconsist, that addresses this challenge through selective text embedding modification, enabling explicit semantic control over character identity without sacrificing prompt alignment. Furthermore, based on our analysis of padding embeddings in FLUX, we propose a semantic control strategy that repurposes padding embeddings as semantic containers. Additionally, we introduce an adaptive feature-sharing strategy that automatically evaluates textual ambiguity and applies constraints only to the ambiguous identity prompt. Finally, we propose a unified evaluation protocol, the Consistency Quality Score (CQS), which integrates identity preservation and per-image text alignment into a single comprehensive metric, explicitly capturing performance imbalances between the two metrics. Our framework achieves state-of-the-art performance, effectively overcoming prior trade-offs. Project page: https://minjung-s.github.io/asemconsist
[64] Multi-Track Multimodal Learning on iMiGUE: Micro-Gesture and Emotion Recognition
Arman Martirosyan, Shahane Tigranyan, Maria Razzhivina, Artak Aslanyan, Nazgul Salikhova, Ilya Makarov, Andrey Savchenko, Aram Avetisyan
🧩 TL;DR
本文提出了两种多模态框架,分别用于微手势识别和行为情感预测任务,在iMiGUE数据集上探索了RGB视频与3D姿态表示的互补优势,并在MiGA 2025挑战赛的行为情感预测任务中取得了第二名。
📘 Detailed Summary
Motivation: 微手势识别和行为情感预测都是极具挑战性的任务,需要建模细微、细粒度的人类行为,主要利用视频和骨骼姿态数据。本研究旨在解决iMiGUE数据集上的这两个问题,探索多模态表示在捕捉微妙时空模式方面的潜力。
Method: 对于微手势分类,采用MViTv2-S提取视频嵌入和2s-AGCN提取骨骼嵌入,通过跨模态令牌融合模块整合空间和姿态信息。对于情感识别,使用SwinFace提取面部嵌入和MViTv2-S提取上下文嵌入,通过InterFusion模块融合情感表达和身体手势信息。
Result: 在iMiGUE数据集上进行的实验表明,该方法在行为情感预测任务中表现出强大的性能和准确性。在MiGA 2025挑战赛的行为情感预测任务中,该方法获得了第二名的成绩,验证了多模态融合方法的有效性。
Conclusion: 该研究展示了多模态框架在细粒度人类行为分析中的有效性,特别是通过融合视频、骨骼、面部和上下文信息来提升微手势识别和情感预测性能。跨模态融合策略为建模微妙的人类行为提供了有前景的方向,并为未来细粒度行为分析研究奠定了基础。
📄 Abstract
Micro-gesture recognition and behavior-based emotion prediction are both highly challenging tasks that require modeling subtle, fine-grained human behaviors, primarily leveraging video and skeletal pose data. In this work, we present two multimodal frameworks designed to tackle both problems on the iMiGUE dataset. For micro-gesture classification, we explore the complementary strengths of RGB and 3D pose-based representations to capture nuanced spatio-temporal patterns. To comprehensively represent gestures, video, and skeletal embeddings are extracted using MViTv2-S and 2s-AGCN, respectively. Then, they are integrated through a Cross-Modal Token Fusion module to combine spatial and pose information. For emotion recognition, our framework extends to behavior-based emotion prediction, a binary classification task identifying emotional states based on visual cues. We leverage facial and contextual embeddings extracted using SwinFace and MViTv2-S models and fuse them through an InterFusion module designed to capture emotional expressions and body gestures. Experiments conducted on the iMiGUE dataset, within the scope of the MiGA 2025 Challenge, demonstrate the robust performance and accuracy of our method in the behavior-based emotion prediction task, where our approach secured 2nd place.
[65] Visual Language Hypothesis
Xiu Li
🧩 TL;DR
该研究从结构拓扑视角分析视觉表示学习,提出视觉理解需要语义语言,并推导出视觉观测空间必须具有纤维丛结构,其中语义对应商基空间,而语义不变性需要非同胚的判别性目标。
📘 Detailed Summary
Motivation: 该研究旨在从结构和拓扑视角理解视觉表示学习,基于一个核心假设:视觉理解需要语义语言,其中许多感知观察对应少量离散语义状态。结合表示学习中广泛假设的可迁移性和抽象性前提,研究探索视觉观测空间的组织结构,特别是语义不变性如何从拓扑结构中产生。
Method: 研究采用拓扑和几何框架分析视觉表示学习,提出视觉观测空间具有纤维丛结构,其中干扰变化占据纤维,语义对应商基空间X/G。从该结构推导出两个理论结果:语义商不是X的子流形,无法通过光滑变形获得;近似商对模型架构提出结构要求,需要支持拓扑变化的表示机制,即先几何扩展分离结构再坍缩形成离散语义区域的"扩展-坍缩"过程。
Result: 理论推导表明语义商X/G不是X的子流形,无法通过光滑变形获得,语义不变性需要非同胚的判别性目标,如监督标签、跨实例识别或多模态对齐提供的显式语义等价。同时,近似商对模型架构提出结构要求,语义抽象不仅需要外部语义目标,还需要支持拓扑变化的表示机制,即扩展-坍缩过程。
Conclusion: 该框架提供了与大规模判别性和多模态模型中观察到的经验规律相一致的拓扑视角,也与统计学习理论的经典原则相符。研究结果具有解释性而非规范性,为理解视觉表示学习的结构特性提供了理论框架,强调了语义不变性需要判别性目标和拓扑变化能力,这对模型设计和训练策略具有重要启示。
📄 Abstract
We study visual representation learning from a structural and topological perspective. We begin from a single hypothesis: that visual understanding presupposes a semantic language for vision, in which many perceptual observations correspond to a small number of discrete semantic states. Together with widely assumed premises on transferability and abstraction in representation learning, this hypothesis implies that the visual observation space must be organized in a fiber bundle like structure, where nuisance variation populates fibers and semantics correspond to a quotient base space. From this structure we derive two theoretical consequences. First, the semantic quotient $X/G$ is not a submanifold of $X$ and cannot be obtained through smooth deformation alone, semantic invariance requires a non-homeomorphic, discriminative target, for example, supervision via labels, cross instance identification, or multimodal alignment that supplies explicit semantic equivalence. Second, we show that approximating the quotient also places structural demands on the model architecture. Semantic abstraction requires not only an external semantic target, but a representation mechanism capable of supporting topology change: an expand-and-snap process in which the manifold is first geometrically expanded to separate structure and then collapsed to form discrete semantic regions. We emphasize that these results are interpretive rather than prescriptive: the framework provides a topological lens that aligns with empirical regularities observed in large-scale discriminative and multimodal models, and with classical principles in statistical learning theory.
[66] CountGD++: Generalized Prompting for Open-World Counting
Niki Amini-Naieni, Andrew Zisserman
🧩 TL;DR
本文提出了CountGD++,一种增强的多模态开放世界计数方法,通过扩展提示灵活性来改进目标对象的指定方式,包括引入否定计数、伪示例自动标注以及支持自然和合成外部图像中的视觉示例,显著提升了计数精度和泛化能力。
📘 Detailed Summary
Motivation: 现有自动计数方法在目标对象指定方式上存在局限性:视觉示例需要在图像内手动标注,无法指定不应计数的对象,且缺乏灵活的提示机制。这些限制影响了计数方法的灵活性、准确性和实际应用范围。
Method: 研究引入了多项创新技术:扩展提示机制以支持通过文本和/或视觉示例描述不应计数的对象;提出"伪示例"概念,在推理时自动标注视觉示例;扩展计数模型以接受来自自然和合成外部图像的视觉示例;并将CountGD++作为视觉专家代理集成到LLM中。
Result: CountGD++在多个数据集上实现了显著的精度提升、效率改进和泛化能力增强。扩展的提示灵活性使多模态开放世界计数在准确性和适应性方面取得突破性进展,代码已在GitHub开源。
Conclusion: 该研究通过增强目标对象指定机制,显著扩展了多模态开放世界计数的能力边界。提出的方法不仅解决了现有技术的局限性,还为视觉计数任务与大型语言模型的集成提供了新范式,推动了开放世界视觉理解的发展。
📄 Abstract
The flexibility and accuracy of methods for automatically counting objects in images and videos are limited by the way the object can be specified. While existing methods allow users to describe the target object with text and visual examples, the visual examples must be manually annotated inside the image, and there is no way to specify what not to count. To address these gaps, we introduce novel capabilities that expand how the target object can be specified. Specifically, we extend the prompt to enable what not to count to be described with text and/or visual examples, introduce the concept of `pseudo-exemplars' that automate the annotation of visual examples at inference, and extend counting models to accept visual examples from both natural and synthetic external images. We also use our new counting model, CountGD++, as a vision expert agent for an LLM. Together, these contributions expand the prompt flexibility of multi-modal open-world counting and lead to significant improvements in accuracy, efficiency, and generalization across multiple datasets. Code is available at https://github.com/niki-amini-naieni/CountGDPlusPlus.
[67] SpatialMosaic: A Multiview VLM Dataset for Partial Visibility
Kanghee Lee, Injae Lee, Minseok Kwak, Kwonyoung Ryu, Jungi Hong, Jaesik Park
🧩 TL;DR
本文提出了一种用于多视图空间推理的可扩展数据生成与标注流程,构建了包含200万问答对的SpatialMosaic指令调优数据集,并开发了SpatialMosaicVLM混合框架,该框架将3D重建模型作为几何编码器集成到视觉语言模型中,以增强在具有挑战性的现实场景下的空间推理能力。
📘 Detailed Summary
Motivation: 现有多模态大语言模型在3D场景理解方面通常依赖预构建的3D表示或现成的重建流程,限制了可扩展性和实际应用性。虽然近期研究探索了直接从多视图图像学习空间推理,但现实环境中常见的部分可见性、遮挡和低重叠度等挑战性条件,以及需要从碎片化视觉线索进行空间推理的问题仍未得到充分探索。
Method: 本文提出了一种可扩展的多视图数据生成与标注流程,用于构建真实的空间推理问答对,从而创建了包含200万问答对的SpatialMosaic指令调优数据集。同时开发了SpatialMosaic-Bench基准测试,包含100万问答对和6个任务,用于评估在现实挑战性场景下的多视图空间推理能力。此外,提出了SpatialMosaicVLM混合框架,该框架将3D重建模型作为几何编码器集成到视觉语言模型中,以实现鲁棒的空间推理。
Result: 大量实验表明,所提出的数据集和视觉问答任务能有效增强在挑战性多视图条件下的空间推理能力,验证了数据生成流程在构建真实且多样化问答对方面的有效性。SpatialMosaic-Bench基准测试涵盖了6个不同任务,为评估多视图空间推理提供了全面的评估框架。
Conclusion: 该研究通过可扩展的数据生成流程和混合模型架构,为在现实挑战性条件下增强多模态大语言模型的空间推理能力提供了有效解决方案。所构建的数据集和基准测试为未来研究提供了重要资源,而将3D重建模型集成到视觉语言模型中的方法展示了在无需显式3D重建的情况下实现鲁棒空间推理的潜力。
📄 Abstract
The rapid progress of Multimodal Large Language Models (MLLMs) has unlocked the potential for enhanced 3D scene understanding and spatial reasoning. However, existing approaches often rely on pre-constructed 3D representations or off-the-shelf reconstruction pipelines, which constrain scalability and real-world applicability. A recent line of work explores learning spatial reasoning directly from multi-view images, enabling Vision-Language Models (VLMs) to understand 3D scenes without explicit 3D reconstructions. Nevertheless, key challenges that frequently arise in real-world environments, such as partial visibility, occlusion, and low-overlap conditions that require spatial reasoning from fragmented visual cues, remain under-explored. To address these limitations, we propose a scalable multi-view data generation and annotation pipeline that constructs realistic spatial reasoning QAs, resulting in SpatialMosaic, a comprehensive instruction-tuning dataset featuring 2M QA pairs. We further introduce SpatialMosaic-Bench, a challenging benchmark for evaluating multi-view spatial reasoning under realistic and challenging scenarios, consisting of 1M QA pairs across 6 tasks. In addition, we present SpatialMosaicVLM, a hybrid framework that integrates 3D reconstruction models as geometry encoders within VLMs for robust spatial reasoning. Extensive experiments demonstrate that our proposed dataset and VQA tasks effectively enhance spatial reasoning under challenging multi-view conditions, validating the effectiveness of our data generation pipeline in constructing realistic and diverse QA pairs. Code and dataset will be available soon.
[68] Bridging Cognitive Gap: Hierarchical Description Learning for Artistic Image Aesthetics Assessment
Henglin Liu, Nisha Huang, Chang Liu, Jiangpeng Yan, Huijuan Huang, Jixuan Ying, Tong-Yee Lee, Pengfei Wan, Xiangyang Ji
🧩 TL;DR
该研究提出了RAD数据集和ArtQuant框架,解决了美学质量评估中的数据稀缺与模型碎片化问题。通过迭代生成的大规模多维度数据集和结合LLM解码器的评估框架,实现了最先进的性能并显著减少了训练成本。
📘 Detailed Summary
Motivation: 该研究旨在解决美学质量评估任务中的两个关键挑战:数据稀缺与不平衡问题,以及模型碎片化问题。现有数据集因昂贵的人工标注而过度关注视觉感知,忽视了更深层次的认知和情感维度;同时,现有视觉网络通过多分支编码器隔离美学属性,而基于对比学习的多模态方法难以有效处理长文本描述。
Method: 研究提出了两个核心贡献:首先,通过迭代流水线生成了大规模(70k)、多维度结构化的Refined Aesthetic Description (RAD)数据集,无需重标注成本且易于扩展;其次,提出了ArtQuant美学评估框架,该框架通过联合描述生成耦合孤立的美学维度,并借助LLM解码器更好地建模长文本语义。理论分析证实了这种共生关系:RAD的语义充分性和生成范式共同最小化预测熵。
Result: 该方法在多个数据集上实现了最先进的性能,同时仅需传统训练轮次的33%,显著缩小了艺术图像与美学判断之间的认知差距。理论分析为框架提供了数学基础,确认了数据语义充分性与模型生成范式共同最小化预测熵的共生关系。
Conclusion: 该研究通过RAD数据集和ArtQuant框架有效解决了美学评估中的数据稀缺和模型碎片化问题,为AIGC的人类对齐定量评估系统提供了重要基础。研究展示了数据生成范式与模型架构的共生关系,为未来美学质量评估研究提供了可扩展的数据集和高效的评估框架。
📄 Abstract
The aesthetic quality assessment task is crucial for developing a human-aligned quantitative evaluation system for AIGC. However, its inherently complex nature, spanning visual perception, cognition, and emotion, poses fundamental challenges. Although aesthetic descriptions offer a viable representation of this complexity, two critical challenges persist: (1) data scarcity and imbalance: existing dataset overly focuses on visual perception and neglects deeper dimensions due to the expensive manual annotation; and (2) model fragmentation: current visual networks isolate aesthetic attributes with multi-branch encoder, while multimodal methods represented by contrastive learning struggle to effectively process long-form textual descriptions. To resolve challenge (1), we first present the Refined Aesthetic Description (RAD) dataset, a large-scale (70k), multi-dimensional structured dataset, generated via an iterative pipeline without heavy annotation costs and easy to scale. To address challenge (2), we propose ArtQuant, an aesthetics assessment framework for artistic images which not only couples isolated aesthetic dimensions through joint description generation, but also better models long-text semantics with the help of LLM decoders. Besides, theoretical analysis confirms this symbiosis: RAD's semantic adequacy (data) and generation paradigm (model) collectively minimize prediction entropy, providing mathematical grounding for the framework. Our approach achieves state-of-the-art performance on several datasets while requiring only 33% of conventional training epochs, narrowing the cognitive gap between artistic images and aesthetic judgment. We will release both code and dataset to support future research.
[69] Direct Diffusion Score Preference Optimization via Stepwise Contrastive Policy-Pair Supervision
Dohyun Kim, Seungwoo Lyu, Seung Wook Kim, Paul Hongsuck Seo
🧩 TL;DR
本文提出了直接扩散分数偏好优化(DDSPO),这是一种无需人工标注的扩散模型偏好对齐方法,通过在去噪轨迹上直接利用获胜与失败策略的分数级监督来提升文本-图像对齐和视觉质量。
📘 Detailed Summary
Motivation: 扩散模型在文本到图像生成任务中表现出色,但往往难以完全对齐用户细微意图并保持一致的审美质量。现有的基于偏好的训练方法如Diffusion Direct Preference Optimization依赖昂贵且可能带有噪声的人工标注数据集,这限制了其可扩展性和实用性。
Method: 本文提出了直接扩散分数偏好优化(DDSPO),该方法在获胜和失败策略可用时直接从这些策略中推导出每个时间步的监督信号。与仅基于最终样本的先前方法不同,DDSPO在去噪轨迹上提供密集的转移级监督。实践中,通过使用预训练的参考模型自动生成偏好信号来避免对标注数据的依赖:对比模型在原始提示与语义降级变体条件下的输出。
Result: 实验结果表明,DDSPO在提升文本-图像对齐和视觉质量方面表现优异,优于或匹配现有的基于偏好的方法,同时需要显著更少的监督。该方法在无需显式奖励建模或人工标注的情况下实现了有效的分数空间偏好监督。
Conclusion: DDSPO提供了一种实用且高效的扩散模型偏好对齐框架,通过去噪轨迹上的密集监督和自动生成的偏好信号,解决了现有方法对人工标注的依赖问题。该方法为扩散模型的细粒度控制和质量提升开辟了新方向,同时保持了方法的可扩展性和实用性。
📄 Abstract
Diffusion models have achieved impressive results in generative tasks such as text-to-image synthesis, yet they often struggle to fully align outputs with nuanced user intent and maintain consistent aesthetic quality. Existing preference-based training methods like Diffusion Direct Preference Optimization help address these issues but rely on costly and potentially noisy human-labeled datasets. In this work, we introduce Direct Diffusion Score Preference Optimization (DDSPO), which directly derives per-timestep supervision from winning and losing policies when such policies are available. Unlike prior methods that operate solely on final samples, DDSPO provides dense, transition-level signals across the denoising trajectory. In practice, we avoid reliance on labeled data by automatically generating preference signals using a pretrained reference model: we contrast its outputs when conditioned on original prompts versus semantically degraded variants. This practical strategy enables effective score-space preference supervision without explicit reward modeling or manual annotations. Empirical results demonstrate that DDSPO improves text-image alignment and visual quality, outperforming or matching existing preference-based methods while requiring significantly less supervision. Our implementation is available at: https://dohyun-as.github.io/DDSPO
[70] Towards Integrating Uncertainty for Domain-Agnostic Segmentation
Jesse Brouwers, Xiaoyan Xing, Alexander Timans
🧩 TL;DR
本研究提出UncertSAM基准,用于评估分割基础模型在挑战性条件下的不确定性量化能力,发现后验拉普拉斯近似能有效关联分割误差与不确定性,为提升模型鲁棒性提供新方向。
📘 Detailed Summary
Motivation: 分割基础模型如SAM系列在零样本场景下表现优异,但在领域偏移或知识受限场景中仍显脆弱,本研究旨在探索不确定性量化能否以领域无关的方式缓解这些挑战并增强模型泛化能力。
Method: 研究构建了包含八个数据集的UncertSAM基准,专门设计用于在阴影、透明和伪装等挑战性分割条件下测试SAM模型,同时评估了一套轻量级后验不确定性估计方法,并初步探索了基于不确定性的预测细化步骤。
Result: 在评估的方法中,最后一层拉普拉斯近似产生的估计值与分割误差具有良好相关性,表明其提供了有意义的不确定性信号,虽然细化步骤的改进效果尚属初步,但结果验证了不确定性量化在分割模型中的潜在价值。
Conclusion: 研究结果表明将不确定性纳入分割模型能够支持鲁棒且领域无关的性能表现,UncertSAM基准为后续研究提供了标准化测试平台,不确定性量化有望成为提升基础模型在复杂场景下可靠性的关键技术路径。
📄 Abstract
Foundation models for segmentation such as the Segment Anything Model (SAM) family exhibit strong zero-shot performance, but remain vulnerable in shifted or limited-knowledge domains. This work investigates whether uncertainty quantification can mitigate such challenges and enhance model generalisability in a domain-agnostic manner. To this end, we (1) curate UncertSAM, a benchmark comprising eight datasets designed to stress-test SAM under challenging segmentation conditions including shadows, transparency, and camouflage; (2) evaluate a suite of lightweight, post-hoc uncertainty estimation methods; and (3) assess a preliminary uncertainty-guided prediction refinement step. Among evaluated approaches, a last-layer Laplace approximation yields uncertainty estimates that correlate well with segmentation errors, indicating a meaningful signal. While refinement benefits are preliminary, our findings underscore the potential of incorporating uncertainty into segmentation models to support robust, domain-agnostic performance. Our benchmark and code are made publicly available.
[71] Automated river gauge plate reading using a hybrid object detection and generative AI framework in the Limpopo River Basin
Kayathri Vigneswaran, Hugo Retief, Jai Clifford Holmes, Mariangel Garcia Andarcia, Hansaka Tennakoon
🧩 TL;DR
本研究提出了一种混合框架,将基于视觉的水位线检测、YOLOv8姿态尺度提取与大型多模态语言模型(GPT-4o和Gemini 2.0 Flash)相结合,实现了河流水位计读数的自动化识别,为水文监测提供了可扩展的解决方案。
📘 Detailed Summary
Motivation: 准确连续的水位监测对于洪水预报、水资源管理和生态保护至关重要,但传统水文观测方法受限于人工测量误差和环境约束,需要开发自动化解决方案来克服这些局限性。
Method: 研究提出了一个混合框架,包括图像预处理、标注、水位线检测、尺度间隙估计和数值读取提取的序列阶段,整合了基于视觉的水位线检测、YOLOv8姿态尺度提取以及GPT-4o和Gemini 2.0 Flash等多模态大语言模型。
Result: 水位线检测实现了94.24%的高精度和83.64%的F1分数,尺度间隙检测为后续读数提取提供了准确的几何校准,结合尺度间隙元数据显著提升了LLMs的预测性能,其中Gemini Stage 2在最优图像条件下达到了5.43厘米的平均绝对误差、8.58厘米的均方根误差和0.84的R平方值。
Conclusion: 研究结果表明LLMs对图像质量敏感,图像质量下降会导致误差增加,强调了将几何元数据与多模态人工智能相结合对于稳健水位估计的重要性,该方法为实时河流水位计数字化和改进水资源管理提供了可扩展、高效可靠的解决方案。
📄 Abstract
Accurate and continuous monitoring of river water levels is essential for flood forecasting, water resource management, and ecological protection. Traditional hydrological observation methods are often limited by manual measurement errors and environmental constraints. This study presents a hybrid framework integrating vision based waterline detection, YOLOv8 pose scale extraction, and large multimodal language models (GPT 4o and Gemini 2.0 Flash) for automated river gauge plate reading. The methodology involves sequential stages of image preprocessing, annotation, waterline detection, scale gap estimation, and numeric reading extraction. Experiments demonstrate that waterline detection achieved high precision of 94.24 percent and an F1 score of 83.64 percent, while scale gap detection provided accurate geometric calibration for subsequent reading extraction. Incorporating scale gap metadata substantially improved the predictive performance of LLMs, with Gemini Stage 2 achieving the highest accuracy, with a mean absolute error of 5.43 cm, root mean square error of 8.58 cm, and R squared of 0.84 under optimal image conditions. Results highlight the sensitivity of LLMs to image quality, with degraded images producing higher errors, and underscore the importance of combining geometric metadata with multimodal artificial intelligence for robust water level estimation. Overall, the proposed approach offers a scalable, efficient, and reliable solution for automated hydrological monitoring, demonstrating potential for real time river gauge digitization and improved water resource management.
[72] TV-RAG: A Temporal-aware and Semantic Entropy-Weighted Framework for Long Video Retrieval and Understanding
Zongsheng Cao, Yangfan He, Anran Liu, Feng Chen, Zepeng Wang, Jun Xie
🧩 TL;DR
本文提出了TV-RAG,一种无需训练的架构,通过耦合时间对齐与熵引导语义来增强大型视频语言模型的长视频推理能力,实现了在多个长视频基准测试中超越主流基线模型的性能表现。
📘 Detailed Summary
Motivation: 当前大型视频语言模型在处理长视频时面临两个主要限制:其时间窗口狭窄,难以捕捉长时间跨度内的细粒度语义变化;主流基于文本的检索流程主要依赖表层词汇重叠,忽略了视觉、音频和字幕通道之间丰富的时间相互依赖性。
Method: TV-RAG框架包含两个核心机制:时间衰减检索模块将显式时间偏移注入相似度计算,使文本查询能够根据其真实的多媒体上下文进行排序;熵加权关键帧采样器选择均匀间隔且信息密集的帧,减少冗余同时保持代表性。该框架无需重新训练或微调即可集成到任何大型视频语言模型中。
Result: TV-RAG在多个已建立的长视频基准测试中持续超越大多数领先基线模型,包括Video-MME、MLVU和LongVideoBench,验证了该方法的有效性。该系统提供了轻量级、预算友好的升级路径,无需重新训练即可提升现有模型的性能。
Conclusion: 研究证实了将时间对齐与熵引导语义相结合的双层推理机制能够有效解决长视频理解中的关键挑战。该框架为现有大型视频语言模型提供了无需重新训练的增强方案,为多媒体AI研究中的长视频处理开辟了新的方向。
📄 Abstract
Large Video Language Models (LVLMs) have rapidly emerged as the focus of multimedia AI research. Nonetheless, when confronted with lengthy videos, these models struggle: their temporal windows are narrow, and they fail to notice fine-grained semantic shifts that unfold over extended durations. Moreover, mainstream text-based retrieval pipelines, which rely chiefly on surface-level lexical overlap, ignore the rich temporal interdependence among visual, audio, and subtitle channels. To mitigate these limitations, we propose TV-RAG, a training-free architecture that couples temporal alignment with entropy-guided semantics to improve long-video reasoning. The framework contributes two main mechanisms: \emph{(i)} a time-decay retrieval module that injects explicit temporal offsets into the similarity computation, thereby ranking text queries according to their true multimedia context; and \emph{(ii)} an entropy-weighted key-frame sampler that selects evenly spaced, information-dense frames, reducing redundancy while preserving representativeness. By weaving these temporal and semantic signals together, TV-RAG realises a dual-level reasoning routine that can be grafted onto any LVLM without re-training or fine-tuning. The resulting system offers a lightweight, budget-friendly upgrade path and consistently surpasses most leading baselines across established long-video benchmarks such as Video-MME, MLVU, and LongVideoBench, confirming the effectiveness of our model. The code can be found at https://github.com/AI-Researcher-Team/TV-RAG.
[73] PurifyGen: A Risk-Discrimination and Semantic-Purification Model for Safe Text-to-Image Generation
Zongsheng Cao, Yangfan He, Anran Liu, Jun Xie, Feng Chen, Zepeng Wang
🧩 TL;DR
本文提出了PurifyGen,一种无需训练的双阶段提示净化方法,用于安全文本到图像生成。该方法通过计算互补语义距离评估提示词安全性,并采用双空间变换去除有害语义成分,同时保留原始模型权重。
📘 Detailed Summary
Motivation: 扩散模型在提升文本到图像生成质量的同时,也增加了生成不安全内容的风险。传统安全方法如文本黑名单或有害内容分类存在显著缺陷:容易被规避或需要大量数据集和额外训练。本研究旨在克服这些挑战,开发一种无需训练的安全生成方法。
Method: PurifyGen采用双阶段提示净化策略:首先通过计算互补语义距离评估每个提示词标记的安全性,该距离衡量提示词标记与预定义有毒和清洁概念嵌入之间的语义接近度;其次对风险提示应用双空间变换,将有毒性对齐的嵌入投影到有毒概念矩阵的零空间以去除有害语义成分,同时将其对齐到清洁概念的范围空间。该方法还定义了词级策略,仅选择性替换风险标记嵌入,最小化对安全内容的干扰。
Result: 广泛测试表明,PurifyGen在五个数据集上超越现有方法,显著减少了不安全内容的生成,其性能可与依赖训练的方法相媲美。该方法展现出对未见提示和模型的强大泛化能力,同时保留了原始模型的权重和生成质量。
Conclusion: PurifyGen提供了一种理论上有依据、无需训练的即插即用解决方案,能够在保持模型原始权重的同时实现安全文本到图像生成。该方法通过双空间变换同时去除有害语义和增强安全语义,为扩散模型的安全部署提供了新的技术途径,具有重要的实际应用价值。
📄 Abstract
Recent advances in diffusion models have notably enhanced text-to-image (T2I) generation quality, but they also raise the risk of generating unsafe content. Traditional safety methods like text blacklisting or harmful content classification have significant drawbacks: they can be easily circumvented or require extensive datasets and extra training. To overcome these challenges, we introduce PurifyGen, a novel, training-free approach for safe T2I generation that retains the model's original weights. PurifyGen introduces a dual-stage strategy for prompt purification. First, we evaluate the safety of each token in a prompt by computing its complementary semantic distance, which measures the semantic proximity between the prompt tokens and concept embeddings from predefined toxic and clean lists. This enables fine-grained prompt classification without explicit keyword matching or retraining. Tokens closer to toxic concepts are flagged as risky. Second, for risky prompts, we apply a dual-space transformation: we project toxic-aligned embeddings into the null space of the toxic concept matrix, effectively removing harmful semantic components, and simultaneously align them into the range space of clean concepts. This dual alignment purifies risky prompts by both subtracting unsafe semantics and reinforcing safe ones, while retaining the original intent and coherence. We further define a token-wise strategy to selectively replace only risky token embeddings, ensuring minimal disruption to safe content. PurifyGen offers a plug-and-play solution with theoretical grounding and strong generalization to unseen prompts and models. Extensive testing shows that PurifyGen surpasses current methods in reducing unsafe content across five datasets and competes well with training-dependent approaches. The code can refer to https://github.com/AI-Researcher-Team/PurifyGen.
[74] ThinkGen: Generalized Thinking for Visual Generation
Siyu Jiao, Yiheng Lin, Yujie Zhong, Qi She, Wei Zhou, Xiaohan Lan, Zilong Huang, Fei Yu, Yingchen Yu, Yunqing Zhao, Yao Zhao, Yunchao Wei
🧩 TL;DR
本文提出了ThinkGen,首个基于思维链推理的视觉生成框架,通过解耦的多模态大语言模型和扩散变换器架构,结合可分离的GRPO训练范式,在各种生成场景中实现了鲁棒的先进性能。
📘 Detailed Summary
Motivation: 尽管多模态大语言模型中的思维链推理在复杂理解任务中表现出色,但其在生成任务中的扩展仍处于初级阶段,受限于特定场景机制,这阻碍了模型的泛化能力和适应性,需要一种能够统一利用思维链推理进行多样化视觉生成的通用框架。
Method: ThinkGen采用解耦架构,包含预训练的多模态大语言模型和扩散变换器,其中MLLM根据用户意图生成定制化指令,DiT基于这些指令生成高质量图像;进一步提出可分离的GRPO训练范式,在MLLM和DiT模块之间交替进行强化学习,支持跨多样化数据集的联合训练。
Result: 广泛的实验表明,ThinkGen在多个生成基准测试中实现了鲁棒的最先进性能,其解耦架构和SepGRPO训练范式有效促进了思维链推理在各种生成场景中的应用,代码已在GitHub上开源。
Conclusion: 该研究展示了思维链推理在视觉生成任务中的巨大潜力,提出的解耦架构和交替强化学习范式为多模态生成系统提供了灵活且可扩展的解决方案,为未来通用生成模型的发展指明了方向。
📄 Abstract
Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM's CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks. Code is available: https://github.com/jiaosiyuu/ThinkGen
[75] ProGuard: Towards Proactive Multimodal Safeguard
Shaohan Yu, Lijun Li, Chenyang Si, Lu Sheng, Jing Shao
🧩 TL;DR
本文提出ProGuard,一种视觉语言主动防护系统,通过强化学习训练来识别和描述分布外安全风险,无需传统反应式方法所需的模型调整,在多模态安全防护方面取得显著性能提升。
📘 Detailed Summary
Motivation: 生成模型的快速发展导致多模态安全风险不断涌现,现有防御方法存在局限性,传统反应式方法需要模型调整,无法有效应对分布外安全风险,需要开发主动防护机制来解决这些挑战。
Method: 研究构建了包含87K样本的模态平衡数据集,采用分层多模态安全分类法进行标注,通过纯强化学习训练视觉语言基础模型实现高效推理,引入分布外安全类别推断任务,并在强化学习目标中增加基于同义词库的相似性奖励机制,鼓励模型为未见不安全类别生成简洁描述。
Result: 实验结果显示ProGuard在二元安全分类任务上性能与闭源大模型相当,在不安全内容分类上显著优于现有开源防护模型,在分布外风险检测方面提升52.6%,在分布外风险描述方面提升64.8%,展现出强大的主动调节能力。
Conclusion: 该研究证明了通过强化学习训练的主动防护系统在多模态安全风险识别中的有效性,为应对新兴安全威胁提供了新范式,其分层分类法和模态平衡数据集设计有助于减少模态偏差,确保跨文本、图像和文本-图像输入的一致调节。
📄 Abstract
The rapid evolution of generative models has led to a continuous emergence of multimodal safety risks, exposing the limitations of existing defense methods. To address these challenges, we propose ProGuard, a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks without the need for model adjustments required by traditional reactive approaches. We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories under a hierarchical multimodal safety taxonomy, effectively mitigating modality bias and ensuring consistent moderation across text, image, and text-image inputs. Based on this dataset, we train our vision-language base model purely through reinforcement learning (RL) to achieve efficient and concise reasoning. To approximate proactive safety scenarios in a controlled setting, we further introduce an OOD safety category inference task and augment the RL objective with a synonym-bank-based similarity reward that encourages the model to generate concise descriptions for unseen unsafe categories. Experimental results show that ProGuard achieves performance comparable to closed-source large models on binary safety classification, substantially outperforms existing open-source guard models on unsafe content categorization. Most notably, ProGuard delivers a strong proactive moderation ability, improving OOD risk detection by 52.6% and OOD risk description by 64.8%.
[76] LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation
Ethan Chern, Zhulin Hu, Bohao Tang, Jiadi Su, Steffi Chern, Zhijie Deng, Pengfei Liu
🧩 TL;DR
本文提出了一种针对多模态条件视频扩散模型的实时蒸馏方法,通过改进蒸馏配方解决了现有方法在多模态条件下产生的视觉伪影问题,并构建了LiveTalk实时交互式化身系统,将响应延迟从1-2分钟降低到实时生成。
📘 Detailed Summary
Motivation: 实时视频生成对于构建通用多模态交互AI系统至关重要,但现有扩散模型通过双向注意力的迭代过程同时去噪所有视频帧,阻碍了实时交互。现有蒸馏方法主要关注文本到视频生成,在多模态条件下存在视觉伪影(闪烁、黑帧、质量下降)问题,导致人机交互不自然且效率低下。
Method: 本文针对多模态条件(文本、图像、音频)的实时交互视频扩散提出改进的蒸馏配方,重点关注条件输入质量以及在线策略优化的初始化和调度策略。通过集成音频语言模型和长视频推理技术Anchor-Heavy Identity Sinks,构建了LiveTalk实时多模态交互化身系统。
Result: 在HDTF、AVSpeech和CelebV-HQ等多模态条件化身视频生成基准测试中,蒸馏模型以20倍更低的推理成本和延迟匹配了全步双向基线的视觉质量。系统级评估显示,LiveTalk在多轮视频连贯性和内容质量上优于Sora2、Veo3等最先进模型,将响应延迟从1-2分钟降低到实时生成。
Conclusion: 该研究通过改进多模态条件下的蒸馏方法,实现了高质量实时视频生成,显著降低了推理延迟和成本。LiveTalk系统展示了实时多模态交互的可行性,为人机交互AI系统的发展提供了重要技术基础,推动了通用多模态交互系统的实用化进程。
📄 Abstract
Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.
[77] Same or Not? Enhancing Visual Perception in Vision-Language Models
Damiano Marsili, Aditya Mehta, Ryan Y. Lin, Georgia Gkioxari
🧩 TL;DR
本文提出了TWIN数据集,这是一个包含56.1万对图像查询的大规模数据集,旨在增强视觉语言模型的细粒度感知能力,通过训练模型判断两幅视觉相似图像是否描绘同一物体来提升对细微视觉线索的关注。
📘 Detailed Summary
Motivation: 当前视觉语言模型在广泛视觉理解方面表现出色,但存在粒度粗糙、视觉偏见和忽略细微视觉细节的问题。现有训练语料库强化了这一局限性,过于强调一般识别而忽视了细粒度感知能力的发展。
Method: 研究引入了TWIN数据集,包含56.1万对图像查询,要求模型判断两幅视觉相似图像是否描绘同一物体,从而鼓励关注细微视觉线索。该数据集涵盖日常物体在不同情境、视角和外观下的多样性。同时提出了FGVQA基准套件,包含1.2万个查询,用于量化细粒度识别能力的提升。
Result: 在TWIN上微调的视觉语言模型在细粒度识别方面取得显著提升,在未见领域如艺术、动物、植物和地标上的表现改善高达19.3%。这些改进不损害模型在通用视觉问答基准上的性能,且数据集规模与性能提升呈正相关关系。
Conclusion: TWIN数据集作为开源视觉语言模型训练语料库的有效补充,能够显著提升未来模型的感知精度。研究表明数据规模对性能至关重要,该工作为增强视觉语言模型的细粒度感知能力提供了系统性的解决方案。
📄 Abstract
Vision-language models (VLMs) excel at broad visual understanding but remain coarse-grained, exhibit visual biases, and miss subtle visual details. Existing training corpora reinforce this limitation by emphasizing general recognition ("Is it a cat or a dog?") over fine-grained perception. To address this, we introduce a new training corpus and task designed to enhance the perceptual abilities of VLMs. TWIN is a large-scale dataset of 561,000 image-pair queries that task models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. The dataset spans a diverse range of everyday objects across contexts, viewpoints, and appearances. Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks. To quantify these gains, we introduce FGVQA, a benchmark suite of 12,000 queries that repurposes fine-grained recognition and retrieval datasets from multiple domains. While existing VLMs struggle on FGVQA, when fine-tuned on TWIN they improve by up to 19.3%, without compromising performance on general VQA benchmarks. Finally, our TWIN dataset scales favorably with object annotations, and our analysis shows that scale is key to performance. We envision TWIN as a drop-in addition to open-source VLM training corpora, advancing perceptual precision of future models. Project webpage: https://glab-caltech.github.io/twin/
[78] Scalable Residual Feature Aggregation Framework with Hybrid Metaheuristic Optimization for Robust Early Pancreatic Neoplasm Detection in Multimodal CT Imaging
Janani Annur Thiruvengadam, Kiran Mayee Nabigaru, Anusha Kovi
🧩 TL;DR
本研究提出了一种可扩展的残差特征聚合(SRFA)框架,用于胰腺肿瘤的早期检测,该框架通过多阶段处理流程和混合模型实现了96.23%的准确率,显著优于传统CNN和基于Transformer的模型。
📘 Detailed Summary
Motivation: 胰腺肿瘤早期检测面临重大临床挑战,主要问题在于CT扫描中肿瘤通常呈现微小对比度边缘,且患者间存在广泛的解剖学变异,这些复杂性需要一种有效且可扩展的系统来增强细微视觉线索的显著性,并在多模态成像数据上实现高泛化能力。
Method: 研究提出了可扩展残差特征聚合(SRFA)框架,采用预处理管道和MAGRes-UNet进行分割以增强胰腺结构可见性,使用具有残差特征存储的DenseNet-121进行特征提取,采用混合HHO-BA元启发式特征选择策略优化特征子集,并构建了结合Vision Transformer注意力机制和EfficientNet-B3表示效率的混合分类模型,最后通过整合SSA和GWO的双重优化机制进行超参数微调以减少过拟合。
Result: 实验结果表明,所提出的SRFA框架在胰腺肿瘤检测中取得了显著性能提升,达到96.23%的准确率、95.58%的F1分数和94.83%的特异性,这些指标明显优于传统卷积神经网络和当代基于Transformer的模型。
Conclusion: 该研究证明了SRFA框架作为胰腺肿瘤早期检测有效工具的潜力,其多阶段处理流程和混合模型设计能够有效处理CT扫描中的复杂性和变异性问题,为医学影像分析提供了新的技术路径,并展示了元启发式优化和注意力机制在医学诊断中的协同优势。
📄 Abstract
The early detection of pancreatic neoplasm is a major clinical dilemma, and it is predominantly so because tumors are likely to occur with minimal contrast margins and a large spread anatomy-wide variation amongst patients on a CT scan. These complexities require to be addressed with an effective and scalable system that can assist in enhancing the salience of the subtle visual cues and provide a high level of the generalization on the multimodal imaging data. A Scalable Residual Feature Aggregation (SRFA) framework is proposed to be used to meet these conditions in this study. The framework integrates a pipeline of preprocessing followed by the segmentation using the MAGRes-UNet that is effective in making the pancreatic structures and isolating regions of interest more visible. DenseNet-121 performed with residual feature storage is used to extract features to allow deep hierarchical features to be aggregated without properties loss. To go further, hybrid HHO-BA metaheuristic feature selection strategy is used, which guarantees the best feature subset refinement. To be classified, the system is trained based on a new hybrid model that integrates the ability to pay attention on the world, which is the Vision Transformer (ViT) with the high representational efficiency of EfficientNet-B3. A dual optimization mechanism incorporating SSA and GWO is used to fine-tune hyperparameters to enhance greater robustness and less overfitting. Experimental results support the significant improvement in performance, with the suggested model reaching 96.23% accuracy, 95.58% F1-score and 94.83% specificity, the model is significantly better than the traditional CNNs and contemporary transformer-based models. Such results highlight the possibility of the SRFA framework as a useful instrument in the early detection of pancreatic tumors.
[79] OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding
Keda Tao, Wenjie Du, Bohan Yu, Weiqiang Wang, Jian Liu, Huan Wang
🧩 TL;DR
本文提出OmniAgent,一种完全由音频引导的主动感知智能体,通过动态编排专用工具实现细粒度视听推理,在多个基准测试中超越了现有开源和专有模型10%-20%的准确率。
📘 Detailed Summary
Motivation: 当前全模态大语言模型在统一音频和视觉模态方面取得进展,但缺乏细粒度跨模态理解能力,难以实现有效的多模态对齐,且现有方法依赖刚性静态工作流程和密集帧标注,限制了主动感知能力。
Method: OmniAgent采用动态规划自主按需编排工具调用,通过新颖的粗到细音频引导感知范式,利用音频线索定位时间事件并指导后续推理,实现从被动响应生成到主动多模态查询的范式转变。
Result: 在三个音频-视频理解基准测试上的广泛实证评估表明,OmniAgent实现了最先进的性能,超越领先的开源和专有模型10%-20%的准确率优势。
Conclusion: 该研究展示了从被动响应生成到主动多模态查询的范式转变,通过动态工具编排和音频引导感知实现了更精细的跨模态理解,为构建更智能的多模态系统提供了新方向。
📄 Abstract
Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often lack the fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, a fully audio-guided active perception agent that dynamically orchestrates specialized tools to achieve more fine-grained audio-visual reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, this paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and proprietary models by substantial margins of 10% - 20% accuracy.
[80] Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation
Shaocong Xu, Songlin Wei, Qizhe Wei, Zheng Geng, Hong Li, Licheng Shen, Qianpu Sun, Shu Han, Bin Ma, Bohan Li, Chongjie Ye, Yuhang Zheng, Nan Wang, Saining Zhang, Hao Zhao
🧩 TL;DR
本文提出DKT模型,通过利用视频扩散模型中的生成先验知识来解决透明物体感知难题,实现了无需标注的零样本深度和法线估计,在透明和反射场景中取得了最先进的性能。
📘 Detailed Summary
Motivation: 透明物体对感知系统构成显著挑战,因为折射、反射和透射现象破坏了立体视觉、飞行时间相机和纯判别式单目深度估计的基本假设,导致深度估计中存在空洞和时间不稳定性问题。
Method: 研究构建了TransPhy3D合成视频数据集,包含11k个使用Blender/Cycles渲染的透明/反射场景序列,采用物理光线追踪和OptiX去噪技术生成RGB、深度和法线图。基于大型视频扩散模型,通过轻量级LoRA适配器学习视频到视频的深度和法线转换器,在DiT骨干网络中拼接RGB和噪声深度潜在表示,并在TransPhy3D与现有合成数据集上进行协同训练。
Result: DKT模型在涉及透明度的真实和合成视频基准测试中实现了零样本最先进性能,包括ClearPose、DREDS和TransPhy3D-Test。该模型在准确性和时间一致性方面优于强图像/视频基线,法线变体在ClearPose上取得了最佳视频法线估计结果。1.3B紧凑版本运行速度约0.17秒/帧,集成到抓取系统中显著提高了透明、反射和漫反射表面的成功率。
Conclusion: 研究证实了"扩散模型理解透明度"的核心主张,表明生成式视频先验知识可以高效、无需标注地转化为鲁棒且时间一致的感知能力,为解决现实世界操作中的挑战性场景提供了新途径。该方法展示了利用生成模型内部光学规则进行透明物体感知的可行性。
📄 Abstract
Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines, and a normal variant sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame. Integrated into a grasping stack, DKT's depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: "Diffusion knows transparency." Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.
cs.CL [Back]
[81] Open-Source Multimodal Moxin Models with Moxin-VLM and Moxin-VLA
Pu Zhao, Xuan Shen, Zhenglun Kong, Yixin Shen, Sung-En Chang, Arash Akbari, Timothy Rupprecht, Lei Lu, Enfu Nan, Changdi Yang, Yumei He, Weiyan Shi, Xingchen Xu, Yu Huang, Wei Jiang, Wei Wang, Yue Chen, Yong He, Yanzhi Wang
🧩 TL;DR
本文介绍了Moxin 7B,一个遵循模型开放框架构建的完全开源大语言模型,超越了仅共享模型权重的传统方式,实现了训练过程、数据集和实现细节的全面透明,并开发了针对视觉语言、视觉语言动作和中文能力的三个变体模型。
📘 Detailed Summary
Motivation: 当前大语言模型领域存在专有模型(如GPT-4)与开源模型(如LLaMA、Mistral)的二元格局,虽然开源模型因易于定制和部署而促进了LLM的普及,但大多数开源模型仅共享模型权重,缺乏训练过程、数据集和实现细节的全面透明性,这限制了研究社区的协作和开源生态系统的健康发展。
Method: 本文提出Moxin 7B模型,遵循模型开放框架构建,实现了训练过程、数据集和实现细节的完全透明,并基于该基础模型开发了三个专门变体:针对视觉语言任务的Moxin-VLM、针对视觉语言动作任务的Moxin-VLA以及针对中文能力的Moxin-Chinese,所有模型均采用开源框架和开放数据进行训练。
Result: 实验结果表明,Moxin系列模型在各种评估中取得了优异的性能表现,具体评估细节和量化指标在论文中提供,验证了完全开源透明方法在保持模型性能方面的可行性,同时作者公开了所有模型、相关数据和代码以支持社区使用和复现。
Conclusion: Moxin 7B代表了向完全透明开源LLM生态系统的重要迈进,超越了传统的权重共享模式,通过提供完整的训练透明度促进了更具包容性和协作性的研究环境,其多能力变体展示了开源模型在不同任务领域的适应性和扩展潜力,为可持续的健康开源生态系统提供了可行范例。
📄 Abstract
Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA and Mistral, have made great contributions to the ever-increasing popularity of LLMs due to the ease to customize and deploy the models across diverse applications. Moxin 7B is introduced as a fully open-source LLM developed in accordance with the Model Openness Framework, which moves beyond the simple sharing of model weights to embrace complete transparency in training, datasets, and implementation detail, thus fostering a more inclusive and collaborative research environment that can sustain a healthy open-source ecosystem. To further equip Moxin with various capabilities in different tasks, we develop three variants based on Moxin, including Moxin-VLM, Moxin-VLA, and Moxin-Chinese, which target the vision-language, vision-language-action, and Chinese capabilities, respectively. Experiments show that our models achieve superior performance in various evaluations. We adopt open-source framework and open data for the training. We release our models, along with the available data and code to derive these models.
[82] ManchuTTS: Towards High-Quality Manchu Speech Synthesis via Flow Matching and Hierarchical Text Representation
Suhua Wang, Zifan Wang, Xiaoxin Sun, D. J. Wang, Zhanbo Liu, Xin Li
🧩 TL;DR
本文提出ManchuTTS,一种针对濒危语言满语语音合成的创新方法,通过设计三层次文本表示和跨模态分层注意力机制,有效解决了满语数据稀缺和语音粘着性问题,在低资源条件下实现了高质量的语音合成。
📘 Detailed Summary
Motivation: 满语作为濒危语言面临语音合成的独特挑战,包括严重的数据稀缺性和强烈的语音粘着现象,现有方法难以有效处理这些语言特性,需要专门针对满语语言学特征设计解决方案。
Method: 该方法设计了音素、音节、韵律三层次文本表示,采用跨模态分层注意力机制实现多粒度对齐,集成深度卷积网络与流匹配Transformer实现高效非自回归生成,并引入分层对比损失指导结构化声学-语言对应,同时构建首个满语TTS数据集并采用数据增强策略应对低资源限制。
Result: 实验表明ManchuTTS在使用5.2小时训练子集时达到MOS评分4.52,显著优于所有基线模型,消融实验证实分层指导将粘着词发音准确率提升31%,韵律自然度提高27%,该结果基于构建的6.24小时标注语料库实现。
Conclusion: 该研究证明了针对特定语言特性设计分层表示和注意力机制在低资源语音合成中的有效性,为其他濒危语言或粘着性语言的语音合成提供了可借鉴的框架,同时构建的数据集为满语计算语言学研究提供了宝贵资源。
📄 Abstract
As an endangered language, Manchu presents unique challenges for speech synthesis, including severe data scarcity and strong phonological agglutination. This paper proposes ManchuTTS(Manchu Text to Speech), a novel approach tailored to Manchu's linguistic characteristics. To handle agglutination, this method designs a three-tier text representation (phoneme, syllable, prosodic) and a cross-modal hierarchical attention mechanism for multi-granular alignment. The synthesis model integrates deep convolutional networks with a flow-matching Transformer, enabling efficient, non-autoregressive generation. This method further introduce a hierarchical contrastive loss to guide structured acoustic-linguistic correspondence. To address low-resource constraints, This method construct the first Manchu TTS dataset and employ a data augmentation strategy. Experiments demonstrate that ManchuTTS attains a MOS of 4.52 using a 5.2-hour training subset derived from our full 6.24-hour annotated corpus, outperforming all baseline models by a notable margin. Ablations confirm hierarchical guidance improves agglutinative word pronunciation accuracy (AWPA) by 31% and prosodic naturalness by 27%.
[83] Structured Prompting and LLM Ensembling for Multimodal Conversational Aspect-based Sentiment Analysis
Zhiqiang Gao, Shihao Gao, Zixing Zhang, Yihao Guo, Hongyu Chen, Jing Han
🧩 TL;DR
本文提出了一种用于多模态对话中基于方面的情感分析的两阶段方法,针对MCABSA挑战的两个子任务:通过结构化提示管道提取情感六元组,以及通过集成多个LLM检测情感翻转及其触发因素。
📘 Detailed Summary
Motivation: 多模态对话中的情感理解是构建情感智能AI系统的关键挑战,MCABSA挑战提出了两个复杂子任务:从多说话者对话中提取完整的情感六元组(包括持有者、目标、方面、意见、情感和理由),以及检测动态情感转变及其潜在触发因素。
Method: 针对子任务一,设计了结构化提示管道,引导大型语言模型按顺序提取情感组件并优化上下文理解;针对子任务二,通过集成三种不同LLM的互补优势,鲁棒地识别情感转变及其触发因素。
Result: 系统在子任务一上获得了47.38%的平均得分,在子任务二上获得了74.12%的精确匹配F1分数,证明了逐步细化和集成策略在多模态情感分析任务中的有效性。
Conclusion: 研究表明结构化提示管道和模型集成策略能够有效处理复杂多模态情感分析任务,为情感智能AI系统的开发提供了实用方法,展示了LLM在细粒度情感理解任务中的潜力。
📄 Abstract
Understanding sentiment in multimodal conversations is a complex yet crucial challenge toward building emotionally intelligent AI systems. The Multimodal Conversational Aspect-based Sentiment Analysis (MCABSA) Challenge invited participants to tackle two demanding subtasks: (1) extracting a comprehensive sentiment sextuple, including holder, target, aspect, opinion, sentiment, and rationale from multi-speaker dialogues, and (2) detecting sentiment flipping, which detects dynamic sentiment shifts and their underlying triggers. For Subtask-I, in the present paper, we designed a structured prompting pipeline that guided large language models (LLMs) to sequentially extract sentiment components with refined contextual understanding. For Subtask-II, we further leveraged the complementary strengths of three LLMs through ensembling to robustly identify sentiment transitions and their triggers. Our system achieved a 47.38% average score on Subtask-I and a 74.12% exact match F1 on Subtask-II, showing the effectiveness of step-wise refinement and ensemble strategies in rich, multimodal sentiment analysis tasks.
[84] Text-Routed Sparse Mixture-of-Experts Model with Explanation and Temporal Alignment for Multi-Modal Sentiment Analysis
Dongning Rao, Yunbiao Zeng, Zhihua Jiang, Jujian Lv
🧩 TL;DR
本文提出了一种名为TEXT的多模态情感分析模型,通过多模态大语言模型生成解释并设计时序对齐模块,实现了跨模态表示的对齐与融合,在多个数据集上取得了最先进的性能。
📘 Detailed Summary
Motivation: 当前多模态情感分析研究虽然提出了多种方法处理不同模态中的微妙情感,但解释能力和时序对齐的潜力尚未得到充分探索,这限制了模型对跨模态情感交互的理解能力。
Method: TEXT模型首先通过多模态大语言模型为多模态情感分析生成解释,然后通过时序导向的神经网络模块创新性地对齐音频和视频表示,该模块融合了Mamba和时序交叉注意力的优势,最后采用基于文本路由的稀疏专家混合与门控融合机制。
Result: TEXT在四个数据集上均取得了最佳性能,优于所有测试模型包括三个最近提出的方法和三个多模态大语言模型,在六个评估指标中至少四项领先,例如在CH-SIMS数据集上将平均绝对误差降至0.353,相比最近方法降低了13.5%。
Conclusion: 研究表明结合解释生成和时序对齐能显著提升多模态情感分析性能,时序对齐模块的有效融合为跨模态表示学习提供了新思路,基于文本路由的稀疏专家混合机制为多模态融合提供了高效架构。
📄 Abstract
Human-interaction-involved applications underscore the need for Multi-modal Sentiment Analysis (MSA). Although many approaches have been proposed to address the subtle emotions in different modalities, the power of explanations and temporal alignments is still underexplored. Thus, this paper proposes the Text-routed sparse mixture-of-Experts model with eXplanation and Temporal alignment for MSA (TEXT). TEXT first augments explanations for MSA via Multi-modal Large Language Models (MLLM), and then novelly aligns the epresentations of audio and video through a temporality-oriented neural network block. TEXT aligns different modalities with explanations and facilitates a new text-routed sparse mixture-of-experts with gate fusion. Our temporal alignment block merges the benefits of Mamba and temporal cross-attention. As a result, TEXT achieves the best performance cross four datasets among all tested models, including three recently proposed approaches and three MLLMs. TEXT wins on at least four metrics out of all six metrics. For example, TEXT decreases the mean absolute error to 0.353 on the CH-SIMS dataset, which signifies a 13.5% decrement compared with recently proposed approaches.
[85] LENS: LLM-Enabled Narrative Synthesis for Mental Health by Aligning Multimodal Sensing with Language Models
Wenxuan Xu, Arvind Pillai, Subigya Nepal, Amanda C Collins, Daniel M Mackin, Michael V Heinz, Tess Z Griffin, Nicholas C Jacobson, Andrew Campbell
🧩 TL;DR
本文提出了LENS框架,通过将多模态健康感知数据与语言模型对齐,生成临床基础的心理健康叙述,解决了传感器时间序列数据难以转化为自然语言的挑战。
📘 Detailed Summary
Motivation: 多模态健康感知为心理健康评估提供了丰富的行为信号,但将数值时间序列测量转化为自然语言仍然具有挑战性。当前大型语言模型无法原生处理长时间传感器数据流,且配对的传感器-文本数据集稀缺,这限制了基于感知数据的临床叙述生成。
Method: LENS框架首先通过将抑郁和焦虑症状相关的生态瞬时评估响应转化为自然语言描述,构建了包含258名参与者超过10万个传感器-文本问答对的大规模数据集。为实现原生时间序列集成,训练了一个补丁级编码器,将原始传感器信号直接投影到LLM的表示空间中。
Result: 实验结果表明,LENS在标准NLP指标和症状严重程度准确性的任务特定指标上均优于强基线模型。一项涉及13名心理健康专业人员的用户研究进一步表明,LENS生成的叙述全面且具有临床意义。
Conclusion: 该方法推进了LLM作为健康感知接口的应用,为能够推理原始行为信号并支持下游临床决策的模型提供了可扩展路径。LENS框架展示了将多模态感知数据与语言模型对齐的潜力,为基于传感器数据的临床叙述生成开辟了新方向。
📄 Abstract
Multimodal health sensing offers rich behavioral signals for assessing mental health, yet translating these numerical time-series measurements into natural language remains challenging. Current LLMs cannot natively ingest long-duration sensor streams, and paired sensor-text datasets are scarce. To address these challenges, we introduce LENS, a framework that aligns multimodal sensing data with language models to generate clinically grounded mental-health narratives. LENS first constructs a large-scale dataset by transforming Ecological Momentary Assessment (EMA) responses related to depression and anxiety symptoms into natural-language descriptions, yielding over 100,000 sensor-text QA pairs from 258 participants. To enable native time-series integration, we train a patch-level encoder that projects raw sensor signals directly into an LLM's representation space. Our results show that LENS outperforms strong baselines on standard NLP metrics and task-specific measures of symptom-severity accuracy. A user study with 13 mental-health professionals further indicates that LENS-produced narratives are comprehensive and clinically meaningful. Ultimately, our approach advances LLMs as interfaces for health sensing, providing a scalable path toward models that can reason over raw behavioral signals and support downstream clinical decision-making.
[86] AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents
Jiafeng Liang, Hao Li, Chang Li, Jiaqi Zhou, Shixin Jiang, Zekun Wang, Changkai Ji, Zhihao Zhu, Runxuan Liu, Tao Ren, Jinlan Fu, See-Kiong Ng, Xia Liang, Ming Liu, Bing Qin
🧩 TL;DR
该论文系统性地综合了认知神经科学和LLM驱动智能体之间的记忆研究知识,旨在弥合跨学科障碍,为自主智能体设计更有效的记忆工作流程。研究通过比较分析生物与人工记忆系统,并探讨记忆安全与评估基准,为未来多模态记忆系统和技能获取研究提供方向。
📘 Detailed Summary
Motivation: 现有自主智能体研究在借鉴认知神经科学设计高效记忆工作流程时面临跨学科障碍,难以吸收人类记忆机制的本质。该研究旨在弥合认知神经科学与LLM驱动智能体之间的知识鸿沟,系统性地综合跨学科记忆知识,为智能体记忆系统设计提供更坚实的理论基础。
Method: 研究采用系统性的跨学科综合方法,首先从认知神经科学到LLM再到智能体的渐进轨迹阐明记忆的定义和功能。然后从生物和人工角度对记忆分类、存储机制及完整管理生命周期进行对比分析。接着回顾评估智能体记忆的主流基准,并从攻击和防御双重视角探讨记忆安全问题。
Result: 研究提供了认知神经科学与LLM智能体之间记忆知识的系统性综合框架,建立了从生物记忆到人工记忆的对比分析体系。研究识别了当前记忆评估的主流基准,并提出了记忆安全的双重分析视角,为未来智能体记忆系统设计提供了全面的跨学科参考框架。
Conclusion: 该研究为自主智能体记忆系统设计提供了重要的跨学科理论基础,强调需要更深入地整合人类记忆机制到AI系统中。研究展望了未来研究方向,特别关注多模态记忆系统和技能获取,为开发更接近人类认知能力的智能体记忆架构指明了路径。
📄 Abstract
Memory serves as the pivotal nexus bridging past and future, providing both humans and AI systems with invaluable concepts and experience to navigate complex tasks. Recent research on autonomous agents has increasingly focused on designing efficient memory workflows by drawing on cognitive neuroscience. However, constrained by interdisciplinary barriers, existing works struggle to assimilate the essence of human memory mechanisms. To bridge this gap, we systematically synthesizes interdisciplinary knowledge of memory, connecting insights from cognitive neuroscience with LLM-driven agents. Specifically, we first elucidate the definition and function of memory along a progressive trajectory from cognitive neuroscience through LLMs to agents. We then provide a comparative analysis of memory taxonomy, storage mechanisms, and the complete management lifecycle from both biological and artificial perspectives. Subsequently, we review the mainstream benchmarks for evaluating agent memory. Additionally, we explore memory security from dual perspectives of attack and defense. Finally, we envision future research directions, with a focus on multimodal memory systems and skill acquisition.
[87] UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?
Fengjiao Chen, Minhao Jing, Weitao Lu, Yan Feng, Xiaoyu Li, Xuezhi Cao
🧩 TL;DR
本文提出UniHetero统一模型,在大规模预训练(>200M样本)下系统研究了视觉生成任务对视觉理解能力的增强作用,发现仅当生成语义而非像素时才能有效提升理解性能。
📘 Detailed Summary
Motivation: 当前视觉-语言大模型正朝着视觉理解和视觉生成任务统一的方向发展,但生成任务是否以及如何在大规模数据尺度上增强理解能力仍缺乏深入探索,本文旨在系统研究这一关键问题。
Method: 本文采用简洁结构的UniHetero统一模型,在大规模预训练数据集(超过2亿样本)上进行实验分析,重点研究了输入嵌入的自回归方法对视觉细节的捕捉能力。
Result: 实验发现三个关键结果:生成语义而非像素能有效提升理解性能;生成任务展现出更优的数据缩放趋势和更高的数据利用率;输入嵌入的自回归方法能有效捕捉视觉细节。
Conclusion: 研究揭示了视觉生成与理解统一模型的关键设计原则,即语义级生成比像素级生成更有利于理解任务,这为未来统一模型架构设计提供了重要指导方向。
📄 Abstract
Vision-language large models are moving toward the unification of visual understanding and visual generation tasks. However, whether generation can enhance understanding is still under-explored on large data scale. In this work, we analysis the unified model with a concise structure, UniHetero, under large-scale pretraining (>200M samples). Our key observations are: (1) Generation can improve understanding, but Only if you generate Semantics, Not Pixels. (2) Generation reveals a superior Data Scaling trend and higher Data Utilization. (3) Autoregression on Input Embedding is effective to capture visual details.
[88] Instruction-Following Evaluation of Large Vision-Language Models
Daiki Shiono, Shumpei Miyawaki, Ryota Tanaka, Jun Suzuki
🧩 TL;DR
本研究定量揭示了大型视觉语言模型在视觉指令微调后指令跟随能力下降的现象,并通过构建包含输出格式指令的训练数据集,证明了明确指定输出格式能有效缓解这种能力退化。
📘 Detailed Summary
Motivation: 大型视觉语言模型在集成视觉能力并进行视觉指令微调后,经常出现指令跟随能力下降的问题,导致模型无法按照预期遵循任务指令,这阻碍了LVLMs在实际应用中的可靠部署。
Method: 研究首先定量评估了LVLMs在微调前后指令跟随能力的变化,然后构建了强调输出格式是否被明确指定的新训练数据集,系统研究了在微调过程中明确指示输出格式对模型指令跟随能力的影响机制。
Result: 定量评估证实了LVLMs在使用常用数据集进行微调后指令跟随能力确实显著下降,同时发现使用包含输出格式指令的数据集训练的模型比未使用的模型能更准确地遵循指令,输出格式的明确指示对保持指令跟随能力具有关键作用。
Conclusion: 研究结果表明,在视觉指令微调过程中包含具有明确输出格式指令的样本可以有效缓解指令跟随能力的退化,这为改进LVLMs训练策略提供了重要指导,强调了输出格式规范在保持模型原有语言能力中的重要性。
📄 Abstract
Following the initial flourishing of large language models (LLMs), there has been a surge in proposed large vision-language models (LVLMs) that integrate LLMs with vision capabilities. However, it has been observed that LVLMs, after tuning to visual instruction using commonly used training datasets, often fail to exhibit the instruction-following ability that was present in the LLM before integration, leading to results in which they do not follow task instructions as expected. This study quantitatively demonstrates that LVLMs' instruction-following ability declines after fine-tuning and analyzes its underlying causes. In particular, we constructed new training datasets highlighting whether the output format is specified. Then, we investigated how explicitly indicating the output format during fine-tuning affects LVLMs' instruction-following ability. Our quantitative evaluation confirmed that LVLMs' instruction-following ability declines after fine-tuning with commonly used datasets. Furthermore, we found that LVLMs trained with datasets, including instructions on output format, tend to follow instructions more accurately than models that do not. These findings suggest that including samples with instructions on output format during (visual) instruction tuning may help mitigate the decline in instruction-following abilities.
cs.AI [Back]
[89] GamiBench: Evaluating Spatial Reasoning and 2D-to-3D Planning Capabilities of MLLMs with Origami Folding Tasks
Ryan Spencer, Roey Yaari, Ritvik Vemavarapu, Joyce Yang, Steven Ngo, Utkarsh Sharma
🧩 TL;DR
该研究提出了GamiBench基准测试,用于评估多模态大语言模型在空间推理和2D到3D规划方面的能力,通过折纸启发的折叠任务揭示当前领先模型在空间理解方面的显著不足。
📘 Detailed Summary
Motivation: 多模态大语言模型虽然在感知和指令跟随方面表现出色,但在空间推理能力上存在明显不足,即跨多个视角和时间对物体进行心理追踪和操作的能力。现有基准测试大多关注静态图像或最终输出,未能充分考虑空间推理的顺序性和视角依赖性本质,因此需要专门设计的评估框架来填补这一研究空白。
Method: 研究团队开发了GamiBench基准测试,包含186个常规和186个不可能的2D折痕图案及其对应的3D折叠形状,从六个不同视角生成数据。基准测试设计了三种视觉问答任务:预测3D折叠配置、区分有效视角以及检测不可能图案。与以往仅评估最终预测的基准不同,GamiBench全面评估整个推理过程,并引入了新的诊断指标——视角一致性和不可能折叠选择率。
Result: 实验结果表明,即使是GPT-5和Gemini-2.5-Pro等领先模型在单步空间理解任务上也表现不佳。基准测试通过系统性的评估揭示了当前多模态大语言模型在空间推理能力上的显著局限性,特别是在处理复杂折叠任务时表现出的困难。
Conclusion: GamiBench为评估多模态大语言模型的几何理解和空间推理能力建立了标准化框架,揭示了当前模型在空间认知方面的核心缺陷。这项工作不仅提供了全面的评估工具,还为未来改进模型的空间推理能力指明了方向,对推动多模态智能系统的发展具有重要意义。
📄 Abstract
Multimodal large language models (MLLMs) are proficient in perception and instruction-following, but they still struggle with spatial reasoning: the ability to mentally track and manipulate objects across multiple views and over time. Spatial reasoning is a key component of human intelligence, but most existing benchmarks focus on static images or final outputs, failing to account for the sequential and viewpoint-dependent nature of this skill. To close this gap, we introduce GamiBench, a benchmark designed to evaluate spatial reasoning and 2D-to-3D planning in MLLMs through origami-inspired folding tasks. GamiBench includes 186 regular and 186 impossible 2D crease patterns paired with their corresponding 3D folded shapes, produced from six distinct viewpoints across three visual question-answering (VQA) tasks: predicting 3D fold configurations, distinguishing valid viewpoints, and detecting impossible patterns. Unlike previous benchmarks that assess only final predictions, GamiBench holistically evaluates the entire reasoning process--measuring cross-view consistency, physical feasibility through impossible-fold detection, and interpretation of intermediate folding steps. It further introduces new diagnostic metrics--viewpoint consistency (VC) and impossible fold selection rate (IFSR)--to measure how well models handle folds of varying complexity. Our experiments show that even leading models such as GPT-5 and Gemini-2.5-Pro struggle on single-step spatial understanding. These contributions establish a standardized framework for evaluating geometric understanding and spatial reasoning in MLLMs. Dataset and code: https://github.com/stvngo/GamiBench.
[90] SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence
Yiheng Wang, Yixin Chen, Shuo Li, Yifan Zhou, Bo Liu, Hengjian Gao, Jiakang Yuan, Jia Bu, Wanghan Xu, Yuhao Zhou, Xiangyu Zhao, Zhiwang Zhou, Fengxiang Wang, Haodong Duan, Songyang Zhang, Jun Yao, Han Deng, Yizhou Wang, Jiabei Xiao, Jiaqi Liu, Encheng Su, Yujie Liu, Weida Wang, Junchi Yao, Shenghe Zheng, Haoran Sun, Runmin Ma, Xiangchao Yan, Bo Zhang, Dongzhan Zhou, Shufei Zhang, Peng Ye, Xiaosong Wang, Shixiang Tang, Wenlong Zhang, Lei Bai
🧩 TL;DR
本文介绍了SciEvalKit,这是一个统一的基准测试工具包,专门用于评估AI模型在广泛科学学科和任务能力上的表现,旨在为科学基础模型和智能代理提供标准化且可定制的评估基础设施。
📘 Detailed Summary
Motivation: 当前缺乏专门针对科学智能核心能力的统一评估平台,现有通用评估工具无法充分反映真实科学挑战的多样性和复杂性,需要建立能够跨多个科学领域评估模型科学能力的标准化基准测试框架。
Method: SciEvalKit构建了一个专家级科学基准测试集合,涵盖科学多模态感知、推理、理解、符号推理、代码生成、假设生成和知识理解等核心能力,支持物理、化学、天文学和材料科学等六大科学领域,采用灵活可扩展的评估流水线,支持批量评估和自定义模型与数据集集成。
Result: 该工具包提供了透明、可重复且可比较的评估结果,通过基于真实世界领域特定数据集构建的基准测试确保任务反映真实科学挑战,建立了连接能力评估和学科多样性的标准化基础设施,并已开源以促进社区驱动的AI4Science发展。
Conclusion: SciEvalKit通过统一评估科学智能的核心能力,为下一代科学基础模型和智能代理提供了标准化基准测试框架,其可扩展架构和开源特性将促进AI4Science领域的社区协作和进展,填补了科学AI评估基础设施的重要空白。
📄 Abstract
We introduce SciEvalKit, a unified benchmarking toolkit designed to evaluate AI models for science across a broad range of scientific disciplines and task capabilities. Unlike general-purpose evaluation platforms, SciEvalKit focuses on the core competencies of scientific intelligence, including Scientific Multimodal Perception, Scientific Multimodal Reasoning, Scientific Multimodal Understanding, Scientific Symbolic Reasoning, Scientific Code Generation, Science Hypothesis Generation and Scientific Knowledge Understanding. It supports six major scientific domains, spanning from physics and chemistry to astronomy and materials science. SciEvalKit builds a foundation of expert-grade scientific benchmarks, curated from real-world, domain-specific datasets, ensuring that tasks reflect authentic scientific challenges. The toolkit features a flexible, extensible evaluation pipeline that enables batch evaluation across models and datasets, supports custom model and dataset integration, and provides transparent, reproducible, and comparable results. By bridging capability-based evaluation and disciplinary diversity, SciEvalKit offers a standardized yet customizable infrastructure to benchmark the next generation of scientific foundation models and intelligent agents. The toolkit is open-sourced and actively maintained to foster community-driven development and progress in AI4Science.
[91] Learning Multi-Modal Mobility Dynamics for Generalized Next Location Recommendation
Junshu Dai, Yu Wang, Tongya Zheng, Wei Ji, Qinghong Guo, Ji Cao, Jie Song, Canghong Jin, Mingli Song
🧩 TL;DR
该研究提出了一种名为M³ob的多模态空间-时间知识增强移动性预测方法,通过构建统一的空间-时间关系图并设计跨模态对齐机制,显著提升了位置推荐任务在正常和异常场景下的性能与泛化能力。
📘 Detailed Summary
Motivation: 现有的人类移动性预测方法存在泛化能力有限的问题:单模态方法受限于数据稀疏性和固有偏差,而多模态方法难以有效捕捉静态多模态表示与空间-时间动态之间的语义鸿沟所导致的移动动态性。
Method: 该方法首先利用大语言模型增强的空间-时间知识图捕获功能语义和时空知识,构建统一的空间-时间关系图进行多模态表示;其次设计门控机制融合不同模态的空间-时间图表示,并提出STKG引导的跨模态对齐方法,将时空动态知识注入静态图像模态。
Result: 在六个公开数据集上的广泛实验表明,该方法不仅在正常场景下实现了持续的性能提升,而且在异常场景下展现出显著的泛化能力,验证了多模态空间-时间知识表征移动动态性的有效性。
Conclusion: 该研究通过多模态空间-时间知识增强的移动性预测框架,成功解决了现有方法在泛化能力和动态表征方面的局限性,为位置推荐等应用提供了更鲁棒和适应性强的解决方案,并为多模态时空数据分析开辟了新方向。
📄 Abstract
The precise prediction of human mobility has produced significant socioeconomic impacts, such as location recommendations and evacuation suggestions. However, existing methods suffer from limited generalization capability: unimodal approaches are constrained by data sparsity and inherent biases, while multi-modal methods struggle to effectively capture mobility dynamics caused by the semantic gap between static multi-modal representation and spatial-temporal dynamics. Therefore, we leverage multi-modal spatial-temporal knowledge to characterize mobility dynamics for the location recommendation task, dubbed as \textbf{M}ulti-\textbf{M}odal \textbf{Mob}ility (\textbf{M}$^3$\textbf{ob}). First, we construct a unified spatial-temporal relational graph (STRG) for multi-modal representation, by leveraging the functional semantics and spatial-temporal knowledge captured by the large language models (LLMs)-enhanced spatial-temporal knowledge graph (STKG). Second, we design a gating mechanism to fuse spatial-temporal graph representations of different modalities, and propose an STKG-guided cross-modal alignment to inject spatial-temporal dynamic knowledge into the static image modality. Extensive experiments on six public datasets show that our proposed method not only achieves consistent improvements in normal scenarios but also exhibits significant generalization ability in abnormal scenarios.
[92] HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery
Yaping Zhang, Qixuan Zhang, Xingquan Zhang, Zhiyuan Chen, Wenwen Zhuang, Yupu Liang, Lu Xiang, Yang Zhao, Jiajun Zhang, Yu Zhou, Chengqing Zong
🧩 TL;DR
本文提出了HiSciBench,一个层次化的科学智能基准测试,旨在全面评估基础模型在完整科学工作流程中的能力,涵盖从基础科学素养到科学发现的五个层次,填补了现有基准测试碎片化、任务狭窄的空白。
📘 Detailed Summary
Motivation: 现有科学智能基准测试存在碎片化问题,大多关注狭窄任务,未能反映真实科学探究的层次性和多学科性质,无法全面评估基础模型在完整科学工作流程中的能力,因此需要构建一个能够系统评估模型从基础理解到创造性发现全过程的综合性基准。
Method: 研究提出了HiSciBench层次化基准框架,包含五个层级:科学素养、文献解析、基于文献的问答、文献综述生成和科学发现,涵盖数学、物理、化学、生物、地理和天文六大学科,包含8,735个精心策划的实例,支持文本、方程、图表等多模态输入以及跨语言评估,采用依赖感知的集成评估框架。
Result: 对GPT-5、DeepSeek-R1等领先模型的综合评估显示显著性能差距:模型在基础素养任务上达到69%准确率,但在发现级挑战中性能急剧下降至25%,揭示了现有模型在高级科学推理能力上的严重不足,同时展示了基准在跨学科和多模态评估方面的有效性。
Conclusion: HiSciBench为科学智能评估设立了新标准,提供了对模型能力在不同科学推理阶段的详细诊断,揭示了当前模型在高级科学发现能力上的局限性,为开发更强大、更可靠的模型提供了可操作的见解,该基准的公开发布将促进未来研究的发展。
📄 Abstract
The rapid advancement of large language models (LLMs) and multimodal foundation models has sparked growing interest in their potential for scientific research. However, scientific intelligence encompasses a broad spectrum of abilities ranging from understanding fundamental knowledge to conducting creative discovery, and existing benchmarks remain fragmented. Most focus on narrow tasks and fail to reflect the hierarchical and multi-disciplinary nature of real scientific inquiry. We introduce \textbf{HiSciBench}, a hierarchical benchmark designed to evaluate foundation models across five levels that mirror the complete scientific workflow: \textit{Scientific Literacy} (L1), \textit{Literature Parsing} (L2), \textit{Literature-based Question Answering} (L3), \textit{Literature Review Generation} (L4), and \textit{Scientific Discovery} (L5). HiSciBench contains 8,735 carefully curated instances spanning six major scientific disciplines, including mathematics, physics, chemistry, biology, geography, and astronomy, and supports multimodal inputs including text, equations, figures, and tables, as well as cross-lingual evaluation. Unlike prior benchmarks that assess isolated abilities, HiSciBench provides an integrated, dependency-aware framework that enables detailed diagnosis of model capabilities across different stages of scientific reasoning. Comprehensive evaluations of leading models, including GPT-5, DeepSeek-R1, and several multimodal systems, reveal substantial performance gaps: while models achieve up to 69\% accuracy on basic literacy tasks, performance declines sharply to 25\% on discovery-level challenges. HiSciBench establishes a new standard for evaluating scientific Intelligence and offers actionable insights for developing models that are not only more capable but also more reliable. The benchmark will be publicly released to facilitate future research.
[93] Geometric Structural Knowledge Graph Foundation Model
Ling Xin, Mojtaba Nayyeri, Zahra Makki Nayeri, Steffen Staab
🧩 TL;DR
本文提出Gamma,一种用于知识图谱推理的新型基础模型,通过引入多头几何注意力机制,采用多种代数变换来增强关系表达能力,在零样本归纳链接预测任务中显著优于现有方法。
📘 Detailed Summary
Motivation: 现有知识图谱基础模型如Ultra主要依赖单一关系变换(如逐元素乘法)进行消息传递,这种设计限制了模型的表达能力,难以捕捉多样化图谱中呈现的不同关系和结构模式,需要更灵活的关系建模机制。
Method: Gamma模型采用多头几何注意力机制,使用多种并行代数变换包括实数、复数、分裂复数和双数变换来建模不同关系结构,并通过关系条件注意力融合机制在链接级别自适应融合这些变换,采用轻量级门控和熵正则化确保模型为每个三元组模式选择最合适的关系偏置。
Result: 在56个多样化知识图谱上的综合实验表明,Gamma在零样本归纳链接预测任务中持续优于Ultra模型,在归纳基准测试中平均倒数排名提升5.5%,在所有基准测试中平均提升4.4%,证明了互补几何表示的有效性。
Conclusion: 研究证明多头几何注意力机制能够显著提升知识图谱基础模型的表达能力,多种代数变换的组合超越了任何单一空间的表达能力限制,为知识图谱推理提供了更灵活和强大的建模框架。
📄 Abstract
Structural knowledge graph foundation models aim to generalize reasoning to completely new graphs with unseen entities and relations. A key limitation of existing approaches like Ultra is their reliance on a single relational transformation (e.g., element-wise multiplication) in message passing, which can constrain expressiveness and fail to capture diverse relational and structural patterns exhibited on diverse graphs. In this paper, we propose Gamma, a novel foundation model that introduces multi-head geometric attention to knowledge graph reasoning. Gamma replaces the single relational transformation with multiple parallel ones, including real, complex, split-complex, and dual number based transformations, each designed to model different relational structures. A relational conditioned attention fusion mechanism then adaptively fuses them at link level via a lightweight gating with entropy regularization, allowing the model to robustly emphasize the most appropriate relational bias for each triple pattern. We present a full formalization of these algebraic message functions and discuss how their combination increases expressiveness beyond any single space. Comprehensive experiments on 56 diverse knowledge graphs demonstrate that Gamma consistently outperforms Ultra in zero-shot inductive link prediction, with a 5.5% improvement in mean reciprocal rank on the inductive benchmarks and a 4.4% improvement across all benchmarks, highlighting benefits from complementary geometric representations.
[94] Multimodal Fact-Checking: An Agent-based Approach
Danni Xu, Shaojing Fan, Xuanang Cheng, Mohan Kankanhalli
🧩 TL;DR
本文提出了RW-Post数据集和AgentFact框架,以解决多模态虚假信息检测中的关键瓶颈。RW-Post提供了包含详细推理过程和可验证证据的高质量可解释数据集,而AgentFact则通过基于智能体的协作框架模拟人类验证工作流程,显著提升了多模态事实核查的准确性和可解释性。
📘 Detailed Summary
Motivation: 当前多模态虚假信息的快速传播对自动化事实核查系统提出了严峻挑战,现有方法包括大型视觉语言模型和深度多模态融合方法往往因推理能力有限和证据利用浅层而表现不足。关键瓶颈在于缺乏提供完整真实世界多模态虚假信息实例并附带标注推理过程和可验证证据的专用数据集。
Method: 研究引入了RW-Post数据集,这是一个高质量可解释的真实世界多模态事实核查数据集,将真实世界多模态声明与其原始社交媒体帖子对齐,保留声明产生的丰富上下文信息。基于此数据集,提出了AgentFact框架,这是一个基于智能体的多模态事实核查框架,包含五个专门智能体,分别处理策略规划、高质量证据检索、视觉分析、推理和解释生成等关键子任务,通过迭代工作流程在证据搜索与任务感知证据过滤和推理之间交替进行。
Result: 广泛的实验结果表明,RW-Post数据集与AgentFact框架的协同作用显著提高了多模态事实核查的准确性和可解释性。该框架通过模拟人类验证工作流程,实现了战略决策和系统证据分析,在真实世界多模态虚假信息检测任务中表现出优越性能。
Conclusion: 该研究通过提供高质量数据集和智能体协作框架,为多模态事实核查领域提供了重要进展。RW-Post数据集解决了数据稀缺和缺乏可解释性的问题,而AgentFact框架则展示了智能体协作在复杂多模态推理任务中的有效性,为未来自动化事实核查系统的设计提供了新方向。
📄 Abstract
The rapid spread of multimodal misinformation poses a growing challenge for automated fact-checking systems. Existing approaches, including large vision language models (LVLMs) and deep multimodal fusion methods, often fall short due to limited reasoning and shallow evidence utilization. A key bottleneck is the lack of dedicated datasets that provide complete real-world multimodal misinformation instances accompanied by annotated reasoning processes and verifiable evidence. To address this limitation, we introduce RW-Post, a high-quality and explainable dataset for real-world multimodal fact-checking. RW-Post aligns real-world multimodal claims with their original social media posts, preserving the rich contextual information in which the claims are made. In addition, the dataset includes detailed reasoning and explicitly linked evidence, which are derived from human written fact-checking articles via a large language model assisted extraction pipeline, enabling comprehensive verification and explanation. Building upon RW-Post, we propose AgentFact, an agent-based multimodal fact-checking framework designed to emulate the human verification workflow. AgentFact consists of five specialized agents that collaboratively handle key fact-checking subtasks, including strategy planning, high-quality evidence retrieval, visual analysis, reasoning, and explanation generation. These agents are orchestrated through an iterative workflow that alternates between evidence searching and task-aware evidence filtering and reasoning, facilitating strategic decision-making and systematic evidence analysis. Extensive experimental results demonstrate that the synergy between RW-Post and AgentFact substantially improves both the accuracy and interpretability of multimodal fact-checking.
[95] Problems With Large Language Models for Learner Modelling: Why LLMs Alone Fall Short for Responsible Tutoring in K--12 Education
Danial Hooshyar, Yeongwook Yang, Gustav Šíř, Tommi Kärkkäinen, Raija Hämäläinen, Mutlu Cukurova, Roger Azevedo
🧩 TL;DR
本研究通过实证比较发现,在K-12教育场景中,基于深度知识追踪(DKT)的模型在预测学习者知识状态演变方面显著优于大型语言模型(LLM),即使经过微调的LLM也无法达到DKT的准确性和时间一致性,表明LLM单独应用无法替代传统学习者建模。
📘 Detailed Summary
Motivation: 随着基于大型语言模型(LLM)的智能导师在K-12教育中的快速兴起,产生了生成模型可以替代传统学习者建模进行自适应教学的误解,而欧盟AI法案将K-12教育归类为高风险领域,需要负责任的设计,因此本研究旨在系统评估LLM在评估学习者知识状态演变方面的准确性、可靠性和时间一致性等关键问题。
Method: 研究采用深度知识追踪(DKT)模型与广泛使用的LLM进行对比,评估了LLM的零样本和微调两种设置,使用大规模开放访问数据集进行实验,通过定量指标(如下一步正确性预测的AUC)和定性分析(如多技能掌握度估计的时间一致性)来全面评估模型性能。
Result: DKT在下一步正确性预测方面取得了最高的区分性能(AUC=0.83),在所有设置中始终优于LLM;虽然微调使LLM的AUC比零样本基线提高了约8%,但仍比DKT低6%,且在序列早期产生更高错误率;时间分析显示DKT保持稳定、方向正确的掌握度更新,而LLM变体表现出显著的时间弱点,包括不一致和错误方向的更新。
Conclusion: 研究结果表明,LLM单独应用不太可能达到已建立的智能辅导系统的有效性,负责任的辅导需要结合学习者建模的混合框架;即使经过大量计算资源微调(近198小时的高计算训练),LLM仍无法克服在知识状态评估方面的根本局限性,而DKT在计算效率和性能方面均表现出优势。
📄 Abstract
The rapid rise of large language model (LLM)-based tutors in K--12 education has fostered a misconception that generative models can replace traditional learner modelling for adaptive instruction. This is especially problematic in K--12 settings, which the EU AI Act classifies as high-risk domain requiring responsible design. Motivated by these concerns, this study synthesises evidence on limitations of LLM-based tutors and empirically investigates one critical issue: the accuracy, reliability, and temporal coherence of assessing learners' evolving knowledge over time. We compare a deep knowledge tracing (DKT) model with a widely used LLM, evaluated zero-shot and fine-tuned, using a large open-access dataset. Results show that DKT achieves the highest discrimination performance (AUC = 0.83) on next-step correctness prediction and consistently outperforms the LLM across settings. Although fine-tuning improves the LLM's AUC by approximately 8\% over the zero-shot baseline, it remains 6\% below DKT and produces higher early-sequence errors, where incorrect predictions are most harmful for adaptive support. Temporal analyses further reveal that DKT maintains stable, directionally correct mastery updates, whereas LLM variants exhibit substantial temporal weaknesses, including inconsistent and wrong-direction updates. These limitations persist despite the fine-tuned LLM requiring nearly 198 hours of high-compute training, far exceeding the computational demands of DKT. Our qualitative analysis of multi-skill mastery estimation further shows that, even after fine-tuning, the LLM produced inconsistent mastery trajectories, while DKT maintained smooth and coherent updates. Overall, the findings suggest that LLMs alone are unlikely to match the effectiveness of established intelligent tutoring systems, and that responsible tutoring requires hybrid frameworks that incorporate learner modelling.
[96] Benchmark Success, Clinical Failure: When Reinforcement Learning Optimizes for Benchmarks, Not Patients
Armin Berger, Manuela Bergau, Helen Schneider, Saad Ahmad, Tom Anglim Lagones, Gianluca Brugnara, Martha Foltyn-Dumitru, Kai Schlamp, Philipp Vollmuth, Rafet Sifa
🧩 TL;DR
本研究提出了ChexReason视觉语言模型,采用资源受限的R1风格训练方法(SFT后接GRPO),在医疗影像推理任务中揭示了强化学习范式与模型泛化能力之间的基本矛盾:GRPO能恢复分布内性能但损害跨数据集迁移能力。
📘 Detailed Summary
Motivation: 尽管强化学习在大型语言模型的推理任务中取得了进展,但其在资源受限条件下应用于医疗影像领域的研究仍显不足,特别是强化学习优化对模型跨机构泛化能力的影响尚未得到充分探索。
Method: 研究采用R1风格训练方法,首先进行监督微调(SFT),随后使用GRPO进行强化学习优化,整个训练过程仅使用2,000个SFT样本、1,000个RL样本和单个A100 GPU,构建了ChexReason视觉语言模型。
Result: 实验结果表明GRPO能显著提升分布内性能(在CheXpert数据集上F1分数提升23%至0.346),但会损害跨数据集迁移能力(在NIH数据集上性能下降19%),这一现象与高资源模型NV-Reason-CXR-3B表现一致,表明问题根源在于强化学习范式而非模型规模。
Conclusion: 研究发现强化学习优化存在泛化悖论:SFT检查点能独特地提升跨机构性能,表明教师引导的推理能捕捉更多机构无关特征;结构化推理支架对通用视觉语言模型有益,但对医学预训练模型增益有限;在需要跨人群鲁棒性的临床部署中,精心设计的监督微调可能优于激进的强化学习优化。
📄 Abstract
Recent Reinforcement Learning (RL) advances for Large Language Models (LLMs) have improved reasoning tasks, yet their resource-constrained application to medical imaging remains underexplored. We introduce ChexReason, a vision-language model trained via R1-style methodology (SFT followed by GRPO) using only 2,000 SFT samples, 1,000 RL samples, and a single A100 GPU. Evaluations on CheXpert and NIH benchmarks reveal a fundamental tension: GRPO recovers in-distribution performance (23% improvement on CheXpert, macro-F1 = 0.346) but degrades cross-dataset transferability (19% drop on NIH). This mirrors high-resource models like NV-Reason-CXR-3B, suggesting the issue stems from the RL paradigm rather than scale. We identify a generalization paradox where the SFT checkpoint uniquely improves on NIH before optimization, indicating teacher-guided reasoning captures more institution-agnostic features. Furthermore, cross-model comparisons show structured reasoning scaffolds benefit general-purpose VLMs but offer minimal gain for medically pre-trained models. Consequently, curated supervised fine-tuning may outperform aggressive RL for clinical deployment requiring robustness across diverse populations.
[97] TCEval: Using Thermal Comfort to Assess Cognitive and Perceptual Abilities of AI
Jingming Li
🧩 TL;DR
本文提出了TCEval,首个利用热舒适场景评估AI核心认知能力(跨模态推理、因果关联和自适应决策)的框架,通过LLM代理模拟人类热舒适感知与决策过程,为AI系统提供生态效度更高的认知图灵测试。
📘 Detailed Summary
Motivation: 当前LLM任务特定基准存在关键缺口,热舒适作为涉及环境因素与个人感知的复杂交互过程,涉及感官整合和自适应决策,为评估AI系统真实世界认知能力提供了理想范式,需要开发能够评估AI核心认知能力的生态效度评估框架。
Method: 提出TCEval评估框架,通过初始化具有虚拟个性属性的LLM代理,引导其生成服装隔热选择与热舒适反馈,并将输出与ASHRAE全球数据库和中国热舒适数据库进行验证,评估AI的跨模态推理、因果关联和自适应决策三种核心认知能力。
Result: 实验在四个LLM上显示,代理反馈与人类数据的精确对齐有限,但在1 PMV容差下方向一致性显著改善;统计测试表明LLM生成的PMV分布与人类数据明显偏离,代理在离散热舒适分类中表现接近随机水平。
Conclusion: TCEval证实了作为AI生态效度认知图灵测试的可行性,表明当前LLM具备基础的跨模态推理能力,但缺乏对热舒适中变量间非线性关系的精确因果理解;该框架补充了传统基准,将AI评估重点从抽象任务熟练度转向具身化、情境感知的感知与决策,为智能建筑等以人为本的应用提供了重要见解。
📄 Abstract
A critical gap exists in LLM task-specific benchmarks. Thermal comfort, a sophisticated interplay of environmental factors and personal perceptions involving sensory integration and adaptive decision-making, serves as an ideal paradigm for evaluating real-world cognitive capabilities of AI systems. To address this, we propose TCEval, the first evaluation framework that assesses three core cognitive capacities of AI, cross-modal reasoning, causal association, and adaptive decision-making, by leveraging thermal comfort scenarios and large language model (LLM) agents. The methodology involves initializing LLM agents with virtual personality attributes, guiding them to generate clothing insulation selections and thermal comfort feedback, and validating outputs against the ASHRAE Global Database and Chinese Thermal Comfort Database. Experiments on four LLMs show that while agent feedback has limited exact alignment with humans, directional consistency improves significantly with a 1 PMV tolerance. Statistical tests reveal that LLM-generated PMV distributions diverge markedly from human data, and agents perform near-randomly in discrete thermal comfort classification. These results confirm the feasibility of TCEval as an ecologically valid Cognitive Turing Test for AI, demonstrating that current LLMs possess foundational cross-modal reasoning ability but lack precise causal understanding of the nonlinear relationships between variables in thermal comfort. TCEval complements traditional benchmarks, shifting AI evaluation focus from abstract task proficiency to embodied, context-aware perception and decision-making, offering valuable insights for advancing AI in human-centric applications like smart buildings.
[98] Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control
Yoonpyo Lee, Kazuma Kobayashi, Sai Puppala, Sajedul Talukder, Seid Koric, Souvik Chakraborty, Syed Bahauddin Alam
🧩 TL;DR
本文提出了一种面向物理系统的Agentic Physical AI范式,通过基于物理验证的策略优化而非感知推理,构建了紧凑的语言模型作为领域专用基础模型,实现了从模仿学习到执行保证的范式转变。
📘 Detailed Summary
Motivation: 当前通用基础模型在物理系统控制方面面临根本性障碍,即使前沿视觉语言模型在基本定量物理任务上准确率仅为50-53%,表现为近似猜测器,保持语义合理性但违反物理约束。这种输入不忠实性不是规模缺陷而是结构限制,感知中心架构优化参数空间模仿,而安全关键控制需要对执行动作的结果空间保证。
Method: 研究提出了一种根本不同的领域专用基础模型路径,引入紧凑语言模型作为Agentic Physical AI,其中策略优化由基于物理的验证驱动而非感知推理。训练了一个3.6亿参数模型处理合成反应器控制场景,将数据集从10^3扩展到10^5个示例,模型自主拒绝约70%的训练分布并将95%运行时执行集中在单一策略上。
Result: 该方法诱导了通用模型中不存在的急剧相变,小规模系统表现出高方差模仿和灾难性尾部风险,而大规模模型经历超过500倍的方差崩溃,稳定了执行级行为。学习表示能够在不同物理系统和连续输入模态间迁移而无需架构修改,尽管平衡暴露于四种执行家族,模型仍表现出强烈的策略偏好。
Conclusion: 研究展示了物理验证驱动的Agentic AI范式能够克服通用基础模型在控制接口的结构限制,实现从感知模仿到执行保证的转变。紧凑模型通过方差崩溃和策略集中实现稳定控制,为安全关键物理系统提供了可验证的AI解决方案,开辟了领域专用基础模型的新方向。
📄 Abstract
The prevailing paradigm in AI for physical systems, scaling general-purpose foundation models toward universal multimodal reasoning, confronts a fundamental barrier at the control interface. Recent benchmarks show that even frontier vision-language models achieve only 50-53% accuracy on basic quantitative physics tasks, behaving as approximate guessers that preserve semantic plausibility while violating physical constraints. This input unfaithfulness is not a scaling deficiency but a structural limitation. Perception-centric architectures optimize parameter-space imitation, whereas safety-critical control demands outcome-space guarantees over executed actions. Here, we present a fundamentally different pathway toward domain-specific foundation models by introducing compact language models operating as Agentic Physical AI, in which policy optimization is driven by physics-based validation rather than perceptual inference. We train a 360-million-parameter model on synthetic reactor control scenarios, scaling the dataset from 10^3 to 10^5 examples. This induces a sharp phase transition absent in general-purpose models. Small-scale systems exhibit high-variance imitation with catastrophic tail risk, while large-scale models undergo variance collapse exceeding 500x reduction, stabilizing execution-level behavior. Despite balanced exposure to four actuation families, the model autonomously rejects approximately 70% of the training distribution and concentrates 95% of runtime execution on a single-bank strategy. Learned representations transfer across distinct physics and continuous input modalities without architectural modification.
[99] MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning
Jiawei Chen, Xintian Shen, Lihao Zheng, Zhenwei Shao, Hongyuan Zhang, Pengfei Yu, Xudong Rao, Ning Mao, Xiaobo Liu, Lian Wen, Chaoqun Du, Feng Gu, Wei He, Qizhen Li, Shanshan Li, Zide Liu, Jing Luo, Lifu Mu, Xuhao Pan, Chang Ren, Haoyi Sun, Qian Wang, Wei Wang, Hongfu Yang, Jiqing Zhan, Chunpeng Zhou, Zheng Zhou, Hao Ma, Tao Wei, Pan Zhou, Wei Chen
🧩 TL;DR
本文提出MindWatcher,一种集成交错思维与多模态思维链推理的工具集成推理智能体,能够自主决策是否及如何调用多样化工具,并在小型模型上实现与大模型相当甚至更优的性能。
📘 Detailed Summary
Motivation: 传统基于工作流的智能体在解决需要工具调用的现实问题时表现出有限的智能性,而现有的工具集成推理智能体在自主推理和多步环境交互方面仍存在不足,需要更灵活的思维-工具调用切换机制和更强的多模态处理能力。
Method: MindWatcher采用交错思维范式,允许模型在任何中间阶段在思考与工具调用之间切换,并集成多模态思维链推理能力以在推理过程中操作图像;通过自动化数据审计与评估管道、手动策划高质量训练数据集,并构建MindWatcher-Evaluate Bench基准;配备全面的辅助推理工具套件和覆盖八个类别的大规模高质量本地图像检索数据库。
Result: 实验表明MindWatcher通过优越的工具调用能力,在性能上匹配甚至超越了更大或更新的模型;同时揭示了智能体训练中的关键见解,如智能体强化学习中的遗传继承现象;模型尽管规模较小,但凭借本地图像检索数据库展现出强大的物体识别能力。
Conclusion: 该研究证明了交错思维与多模态思维链推理相结合的有效性,为工具集成推理智能体的设计提供了新范式;同时揭示了智能体训练中的重要现象,并为更高效的训练基础设施设计提供了实践指导,推动了小型模型通过工具调用实现复杂任务处理能力的发展。
📄 Abstract
Traditional workflow-based agents exhibit limited intelligence when addressing real-world problems requiring tool invocation. Tool-integrated reasoning (TIR) agents capable of autonomous reasoning and tool invocation are rapidly emerging as a powerful approach for complex decision-making tasks involving multi-step interactions with external environments. In this work, we introduce MindWatcher, a TIR agent integrating interleaved thinking and multimodal chain-of-thought (CoT) reasoning. MindWatcher can autonomously decide whether and how to invoke diverse tools and coordinate their use, without relying on human prompts or workflows. The interleaved thinking paradigm enables the model to switch between thinking and tool calling at any intermediate stage, while its multimodal CoT capability allows manipulation of images during reasoning to yield more precise search results. We implement automated data auditing and evaluation pipelines, complemented by manually curated high-quality datasets for training, and we construct a benchmark, called MindWatcher-Evaluate Bench (MWE-Bench), to evaluate its performance. MindWatcher is equipped with a comprehensive suite of auxiliary reasoning tools, enabling it to address broad-domain multimodal problems. A large-scale, high-quality local image retrieval database, covering eight categories including cars, animals, and plants, endows model with robust object recognition despite its small size. Finally, we design a more efficient training infrastructure for MindWatcher, enhancing training speed and hardware utilization. Experiments not only demonstrate that MindWatcher matches or exceeds the performance of larger or more recent models through superior tool invocation, but also uncover critical insights for agent training, such as the genetic inheritance phenomenon in agentic RL.
[100] AKG kernel Agent: A Multi-Agent Framework for Cross-Platform Kernel Synthesis
Jinye Du, Quan Yuan, Zuyao Zhang, Yanzhi Yi, Jiahui Hu, Wangyi Chen, Yiyang Zhu, Qishui Zheng, Wenxiang Zou, Xiangyu Chang, Zuohe Zheng, Zichun Ye, Chao Liu, Shanni Li, Renwei Zhang, Yiping Deng, Xinwei Hu, Xuefeng Jin, Jie Zhao
🧩 TL;DR
本文提出了AKG内核代理,这是一个多智能体系统,通过自动化内核生成、迁移和性能调优来解决现代AI模型高性能计算内核开发中的瓶颈问题,在KernelBench基准测试中实现了平均1.46倍的加速。
📘 Detailed Summary
Motivation: 现代AI模型对高性能计算内核的需求日益增长,但LLM、多模态架构和推荐系统的复杂性增加,加上稀疏化和量化等技术,带来了显著的计算挑战。频繁的硬件更新和多样化的芯片架构进一步复杂化了这一局面,需要为每个平台定制内核实现。手动优化无法跟上这些需求,成为AI系统开发的关键瓶颈。
Method: 本文提出了AKG内核代理,这是一个多智能体系统,能够自动化内核生成、迁移和性能调优。该系统设计支持多种领域特定语言,包括Triton、TileLang、CPP和CUDA-C,使其能够针对不同的硬件后端,同时保持正确性和可移植性。系统的模块化设计允许快速集成新的DSL和硬件目标。
Result: 在KernelBench基准测试中使用Triton DSL在GPU和NPU后端进行评估时,AKG内核代理相比PyTorch Eager基线实现实现了平均1.46倍的加速。这一结果证明了该系统在加速现代AI工作负载内核开发方面的有效性。
Conclusion: 该研究展示了利用LLM代码生成能力自动化内核开发的可行性,为解决AI系统开发中的计算瓶颈提供了有效方案。模块化设计支持多DSL和多硬件后端的特性为未来的扩展和适应新的硬件架构奠定了基础,有望显著加速AI模型部署和优化过程。
📄 Abstract
Modern AI models demand high-performance computation kernels. The growing complexity of LLMs, multimodal architectures, and recommendation systems, combined with techniques like sparsity and quantization, creates significant computational challenges. Moreover, frequent hardware updates and diverse chip architectures further complicate this landscape, requiring tailored kernel implementations for each platform. However, manual optimization cannot keep pace with these demands, creating a critical bottleneck in AI system development. Recent advances in LLM code generation capabilities have opened new possibilities for automating kernel development. In this work, we propose AKG kernel agent (AI-driven Kernel Generator), a multi-agent system that automates kernel generation, migration, and performance tuning. AKG kernel agent is designed to support multiple domain-specific languages (DSLs), including Triton, TileLang, CPP, and CUDA-C, enabling it to target different hardware backends while maintaining correctness and portability. The system's modular design allows rapid integration of new DSLs and hardware targets. When evaluated on KernelBench using Triton DSL across GPU and NPU backends, AKG kernel agent achieves an average speedup of 1.46$\times$ over PyTorch Eager baselines implementations, demonstrating its effectiveness in accelerating kernel development for modern AI workloads.