cs.CV [Total: 31]
cs.CL [Total: 4]
cs.AI [Total: 4]

cs.CV [Back]

[1] VL4Gaze: Unleashing Vision-Language Models for Gaze Following

Shijing Wang, Chaoqun Cui, Yaping Huang, Hyung Jin Chang, Yihua Cheng

🧩 TL;DR

本文提出了VL4Gaze，这是首个用于评估和训练视觉语言模型在视线理解能力方面的大规模基准测试，包含489K个自动生成的问答对，覆盖四种互补的视线理解任务，实验表明大规模VLMs在无任务特定监督时难以可靠推断视线语义和空间定位。

📘 Detailed Summary

Motivation: 人类视线为理解视觉场景中的注意力、意图和社交互动提供了关键线索，但当前视觉语言模型在视线理解方面尚未得到充分探索，缺乏系统评估或训练模型进行视线理解的基准测试，因此无法确定视线理解能力是否能够从通用视觉语言预训练中自然涌现。

Method: 研究引入了VL4Gaze基准测试，包含489K个自动生成的问答对，覆盖124K张图像，将视线理解统一为视觉问答问题，通过四种互补任务实现：视线对象描述、视线方向描述、视线点定位和模糊问题识别，并评估了商业和开源VLMs在上下文学习和微调设置下的表现。

Result: 实验结果表明，即使是大规模视觉语言模型在没有任务特定监督的情况下也难以可靠推断视线语义和空间定位，而使用VL4Gaze进行训练后，所有任务都取得了显著且一致的性能提升，凸显了针对性多任务监督对于开发VLMs视线理解能力的重要性。

Conclusion: 该研究强调了视线理解作为视觉语言模型关键能力的重要性，证明了通用预训练不足以让模型自然获得视线理解能力，需要针对性的多任务监督训练，VL4Gaze基准测试的发布将支持该方向的进一步研究和开发，为构建更全面的视觉理解系统提供了重要工具。

📄 Abstract

Human gaze provides essential cues for interpreting attention, intention, and social interaction in visual scenes, yet gaze understanding remains largely unexplored in current vision-language models (VLMs). While recent VLMs achieve strong scene-level reasoning across a range of visual tasks, there exists no benchmark that systematically evaluates or trains them for gaze interpretation, leaving open the question of whether gaze understanding can emerge from general-purpose vision-language pre-training. To address this gap, we introduce VL4Gaze, the first large-scale benchmark designed to investigate, evaluate, and unlock the potential of VLMs for gaze understanding. VL4Gaze contains 489K automatically generated question-answer pairs across 124K images and formulates gaze understanding as a unified VQA problem through four complementary tasks: (1) gaze object description, (2) gaze direction description, (3) gaze point location, and (4) ambiguous question recognition. We comprehensively evaluate both commercial and open-source VLMs under in-context learning and fine-tuning settings. The results show that even large-scale VLMs struggle to reliably infer gaze semantics and spatial localization without task-specific supervision. In contrast, training on VL4Gaze brings substantial and consistent improvements across all tasks, highlighting the importance of targeted multi-task supervision for developing gaze understanding capabilities in VLMs. We will release the dataset and code to support further research and development in this direction.

[2] NULLBUS: Multimodal Mixed-Supervision for Breast Ultrasound Segmentation via Nullable Global-Local Prompts

Raja Mallina, Bryar Shareef

🧩 TL;DR

本文提出NullBUS，一种用于乳腺超声分割的多模态混合监督框架，通过引入可空提示机制处理缺失的文本元数据，实现在有无提示数据情况下的统一模型训练，并在三个公共数据集上取得了最先进的性能。

📘 Detailed Summary

Motivation: 当前乳腺超声分割中，基于提示的方法在文本或空间提示可用时能提升分割性能，但许多公共BUS数据集缺乏可靠的元数据或报告，导致训练仅限于小型多模态子集并降低模型鲁棒性，需要解决混合提示可用性下的分割问题。

Method: 提出NullBUS多模态混合监督框架，引入可空提示机制，通过可学习的空嵌入和存在掩码实现，当元数据缺失时回退到仅图像证据，当文本存在时则利用文本信息，支持在单一模型中同时学习有提示和无提示的图像数据。

Result: 在三个公共BUS数据集统一池上的评估显示，NullBUS实现了0.8568的平均IoU和0.9103的平均Dice分数，在混合提示可用性条件下达到了最先进的性能水平。

Conclusion: 该研究展示了通过可空提示机制处理缺失元数据的有效性，为多模态医学图像分割提供了实用的混合监督解决方案，能够充分利用有限的多模态数据并保持模型在真实临床场景中的鲁棒性。

📄 Abstract

Breast ultrasound (BUS) segmentation provides lesion boundaries essential for computer-aided diagnosis and treatment planning. While promptable methods can improve segmentation performance and tumor delineation when text or spatial prompts are available, many public BUS datasets lack reliable metadata or reports, constraining training to small multimodal subsets and reducing robustness. We propose NullBUS, a multimodal mixed-supervision framework that learns from images with and without prompts in a single model. To handle missing text, we introduce nullable prompts, implemented as learnable null embeddings with presence masks, enabling fallback to image-only evidence when metadata are absent and the use of text when present. Evaluated on a unified pool of three public BUS datasets, NullBUS achieves a mean IoU of 0.8568 and a mean Dice of 0.9103, demonstrating state-of-the-art performance under mixed prompt availability.

[3] Transductive Visual Programming: Evolving Tool Libraries from Experience for Spatial Reasoning

Shengguang Wu, Xiaohan Wang, Yuhui Zhang, Hao Zhu, Serena Yeung-Levy

🧩 TL;DR

本文提出了Transductive Visual Programming (TVP)框架，通过从自身经验中构建工具而非推测来改进视觉编程系统。TVP在解决3D空间推理问题时积累经验并抽象出可重用工具，在Omni3D-Bench上实现了最先进的性能。

📘 Detailed Summary

Motivation: 现有视觉编程方法在3D空间推理任务中存在局限性，它们要么依赖固定工具集，要么在解决问题前进行推测性工具归纳，导致生成次优程序且对归纳工具的利用率低下。这阻碍了视觉编程系统有效处理需要精确几何计算的复杂空间推理任务。

Method: TVP采用转导式视觉编程框架，首先使用基本工具解决问题，并将经验性解决方案积累到示例库中。然后从这些程序中抽象出重复出现的模式，形成可重用的高级工具，构建不断演化的工具库。这种经验驱动的工具创建方法允许系统用从经验中学到的越来越强大的工具处理新问题。

Result: 在Omni3D-Bench基准测试中，TVP实现了最先进的性能，比GPT-4o高出22%，比之前最好的视觉编程系统高出11%。转导学习到的工具作为核心程序依赖的使用频率是归纳创建工具的5倍，显示出更有效的工具发现和重用。演化出的工具在SpatialScore-Hard集合的未见空间任务上也表现出强大的泛化能力，无需任何测试集特定修改即可获得优越性能。

Conclusion: 该研究确立了经验驱动的转导式工具创建作为构建自演化视觉编程代理的强大范式，能够有效处理具有挑战性的空间推理任务。TVP框架通过从自身经验中学习而非推测来构建工具，实现了工具发现和重用的显著改进，为视觉编程系统在复杂几何推理任务中的应用开辟了新方向。

📄 Abstract

Spatial reasoning in 3D scenes requires precise geometric calculations that challenge vision-language models. Visual programming addresses this by decomposing problems into steps calling specialized tools, yet existing methods rely on either fixed toolsets or speculative tool induction before solving problems, resulting in suboptimal programs and poor utilization of induced tools. We present Transductive Visual Programming (TVP), a novel framework that builds new tools from its own experience rather than speculation. TVP first solves problems using basic tools while accumulating experiential solutions into an Example Library, then abstracts recurring patterns from these programs into reusable higher-level tools for an evolving Tool Library. This allows TVP to tackle new problems with increasingly powerful tools learned from experience. On Omni3D-Bench, TVP achieves state-of-the-art performance, outperforming GPT-4o by 22% and the previous best visual programming system by 11%. Our transductively learned tools are used 5x more frequently as core program dependency than inductively created ones, demonstrating more effective tool discovery and reuse. The evolved tools also show strong generalization to unseen spatial tasks, achieving superior performance on benchmarks from SpatialScore-Hard collection without any testset-specific modification. Our work establishes experience-driven transductive tool creation as a powerful paradigm for building self-evolving visual programming agents that effectively tackle challenging spatial reasoning tasks. We release our code at https://transductive-visualprogram.github.io/.

[4] Input-Adaptive Visual Preprocessing for Efficient Fast Vision-Language Model Inference

Putu Indah Githa Cahyani, Komang David Dananjaya Suartana, Novanto Yudistira

🧩 TL;DR

本文提出了一种自适应视觉预处理方法，通过动态调整输入分辨率和空间覆盖范围来减少视觉冗余，从而显著提升视觉语言模型的部署效率。该方法与FastVLM无缝集成，无需修改架构或重新训练。

📘 Detailed Summary

Motivation: 视觉语言模型在处理高分辨率视觉输入时面临高推理延迟和计算成本的问题，现有流水线依赖静态视觉预处理导致对视觉简单输入产生冗余计算，需要一种更高效的部署策略。

Method: 提出自适应视觉预处理方法，结合内容感知图像分析、自适应分辨率选择和内容感知裁剪，在视觉编码前动态调整输入分辨率和空间覆盖范围，该方法与FastVLM集成且无需修改架构或重新训练。

Result: 在DocVQA数据集子集的推理评估中，自适应预处理使单图像推理时间减少超过50%，降低平均完整生成时间，视觉标记数量相比基线流水线持续减少超过55%，显著提升效率导向指标。

Conclusion: 输入感知的预处理是提升视觉语言模型部署效率的有效轻量级策略，证明了动态适应输入内容在减少计算冗余方面的价值，为实际部署提供了实用解决方案。

📄 Abstract

Vision-Language Models (VLMs) have demonstrated strong performance on multimodal reasoning tasks, but their deployment remains challenging due to high inference latency and computational cost, particularly when processing high-resolution visual inputs. While recent architectures such as FastVLM improve efficiency through optimized vision encoders, existing pipelines still rely on static visual preprocessing, leading to redundant computation for visually simple inputs. In this work, we propose an adaptive visual preprocessing method that dynamically adjusts input resolution and spatial coverage based on image content characteristics. The proposed approach combines content-aware image analysis, adaptive resolution selection, and content-aware cropping to reduce visual redundancy prior to vision encoding. Importantly, the method is integrated with FastVLM without modifying its architecture or requiring retraining. We evaluate the proposed method on a subset of the DocVQA dataset in an inference-only setting, focusing on efficiency-oriented metrics. Experimental results show that adaptive preprocessing reduces per-image inference time by over 50\%, lowers mean full generation time, and achieves a consistent reduction of more than 55\% in visual token count compared to the baseline pipeline. These findings demonstrate that input-aware preprocessing is an effective and lightweight strategy for improving deployment-oriented efficiency of vision-language models. To facilitate reproducibility, our implementation is provided as a fork of the FastVLM repository, incorporating the files for the proposed method, and is available at https://github.com/kmdavidds/mlfastlm.

[5] ALIVE: An Avatar-Lecture Interactive Video Engine with Content-Aware Retrieval for Real-Time Interaction

Md Zabirul Islam, Md Motaleb Hossen Manik, Ge Wang

🧩 TL;DR

ALIVE是一个本地部署的交互式视频学习引擎，通过神经化身、内容感知检索和实时多模态交互，将被动讲座观看转变为动态学习体验，显著提升了录播讲座的教学价值。

📘 Detailed Summary

Motivation: 传统讲座视频缺乏实时澄清机制，学习者遇到困惑时需外部搜索；现有交互式学习系统通常缺乏讲座意识、依赖云端服务，或未能将检索与化身讲解集成到统一且保护隐私的流程中。

Method: ALIVE采用完全本地部署架构，集成三个核心组件：通过ASR转录、LLM精炼和神经说话头合成生成的化身讲座；结合语义相似性与时间戳对齐的内容感知检索机制；支持文本或语音提问并接收文本或化身响应的实时多模态交互系统。

Result: 在完整医学影像课程上的演示表明，ALIVE实现了准确的检索性能、低延迟特性以及良好的用户体验评估，系统能够提供准确、内容感知且引人入胜的实时学习支持。

Conclusion: ALIVE展示了多模态AI结合内容感知检索和本地部署如何显著增强录播讲座的教学价值，为下一代交互式学习环境提供了可扩展的路径，同时解决了隐私保护和实时响应的问题。

📄 Abstract

Traditional lecture videos offer flexibility but lack mechanisms for real-time clarification, forcing learners to search externally when confusion arises. Recent advances in large language models and neural avatars provide new opportunities for interactive learning, yet existing systems typically lack lecture awareness, rely on cloud-based services, or fail to integrate retrieval and avatar-delivered explanations in a unified, privacy-preserving pipeline. We present ALIVE, an Avatar-Lecture Interactive Video Engine that transforms passive lecture viewing into a dynamic, real-time learning experience. ALIVE operates fully on local hardware and integrates (1) Avatar-delivered lecture generated through ASR transcription, LLM refinement, and neural talking-head synthesis; (2) A content-aware retrieval mechanism that combines semantic similarity with timestamp alignment to surface contextually relevant lecture segments; and (3) Real-time multimodal interaction, enabling students to pause the lecture, ask questions through text or voice, and receive grounded explanations either as text or as avatar-delivered responses. To maintain responsiveness, ALIVE employs lightweight embedding models, FAISS-based retrieval, and segmented avatar synthesis with progressive preloading. We demonstrate the system on a complete medical imaging course, evaluate its retrieval accuracy, latency characteristics, and user experience, and show that ALIVE provides accurate, content-aware, and engaging real-time support. ALIVE illustrates how multimodal AI-when combined with content-aware retrieval and local deployment-can significantly enhance the pedagogical value of recorded lectures, offering an extensible pathway toward next-generation interactive learning environments.

Tingfeng Xian, Wenlve Zhou, Zhiheng Zhou, Zhelin Li

🧩 TL;DR

本文提出了一种新颖的域表征注入（DRI）参数高效微调策略，用于跨模态船舶重识别任务。该方法在特征空间而非权重空间进行优化，通过轻量级可学习模块将领域特定表征注入冻结的视觉基础模型，显著提升了跨模态船舶重识别的性能。

📘 Detailed Summary

Motivation: 跨模态船舶重识别（CMS Re-ID）对于实现全天候海上目标跟踪至关重要，但面临显著的模态差异挑战。主流解决方案依赖显式的模态对齐策略，这严重依赖于构建大规模配对数据集进行预训练，且现有通用的参数高效微调方法在权重空间操作，在容量有限的模型上表现欠佳。

Method: 基于柏拉图表征假说，本文提出了一种新颖的域表征注入（DRI）参数高效微调策略。该方法将优化视角从权重空间转移到特征空间，在完全冻结视觉基础模型以最大化保留通用知识的同时，设计了一个轻量级可学习的偏移编码器，从原始输入中提取富含模态和身份属性的领域特定表征。调制器根据不同层中间特征的上下文信息自适应地转换这些表征，然后通过加性融合将其注入中间层，动态重塑特征分布以适应下游任务，而不改变预训练权重。

Result: 广泛的实验结果表明该方法具有优越性，以最少的可训练参数实现了最先进的性能。在HOSS-ReID数据集上，仅使用1.54M和7.05M参数就分别达到了57.9%和60.5%的mAP，显著优于现有方法。

Conclusion: 该研究证明了在特征空间而非权重空间进行参数高效微调的优越性，为跨模态视觉任务提供了一种有效的新范式。域表征注入策略能够在不破坏视觉基础模型预训练知识的情况下，通过轻量级模块动态适应特定领域，为资源受限环境下的跨模态重识别应用提供了实用解决方案。

📄 Abstract

Cross-Modality Ship Re-Identification (CMS Re-ID) is critical for achieving all-day and all-weather maritime target tracking, yet it is fundamentally challenged by significant modality discrepancies. Mainstream solutions typically rely on explicit modality alignment strategies; however, this paradigm heavily depends on constructing large-scale paired datasets for pre-training. To address this, grounded in the Platonic Representation Hypothesis, we explore the potential of Vision Foundation Models (VFMs) in bridging modality gaps. Recognizing the suboptimal performance of existing generic Parameter-Efficient Fine-Tuning (PEFT) methods that operate within the weight space, particularly on limited-capacity models, we shift the optimization perspective to the feature space and propose a novel PEFT strategy termed Domain Representation Injection (DRI). Specifically, while keeping the VFM fully frozen to maximize the preservation of general knowledge, we design a lightweight, learnable Offset Encoder to extract domain-specific representations rich in modality and identity attributes from raw inputs. Guided by the contextual information of intermediate features at different layers, a Modulator adaptively transforms these representations. Subsequently, they are injected into the intermediate layers via additive fusion, dynamically reshaping the feature distribution to adapt to the downstream task without altering the VFM's pre-trained weights. Extensive experimental results demonstrate the superiority of our method, achieving State-of-the-Art (SOTA) performance with minimal trainable parameters. For instance, on the HOSS-ReID dataset, we attain 57.9\% and 60.5\% mAP using only 1.54M and 7.05M parameters, respectively. The code is available at https://github.com/TingfengXian/DRI.

[7] DGSAN: Dual-Graph Spatiotemporal Attention Network for Pulmonary Nodule Malignancy Prediction

Xiao Yu, Zhaojie Fang, Guanyu Zhou, Yin Shen, Huoling Luo, Ye Li, Ahmed Elazab, Xiang Wan, Ruiquan Ge, Changmiao Wang

🧩 TL;DR

本文提出了一种双图时空注意力网络（DGSAN），通过利用时间变化和多模态数据来提高肺结节分类的准确性，显著优于现有方法并具有出色的计算效率。

📘 Detailed Summary

Motivation: 肺癌是全球癌症相关死亡的主要原因，早期检测和诊断对提高患者生存率至关重要。尽管先前研究整合了多模态和多时间点信息，但融合方法仅限于低效的向量拼接和简单的相互注意力机制，因此需要更有效的多模态信息融合方法来解决这一局限性。

Method: 该方法包括开发全局-局部特征编码器以更好地捕捉肺结节的局部、全局和融合特征，采用双图构建方法将多模态特征组织为模态间和模态内图，并引入分层跨模态图融合模块以优化特征整合。此外，研究还构建了名为NLST-cmst的新型多模态数据集作为相关研究的全面支持。

Result: 在NLST-cmst和CSTL衍生数据集上的广泛实验表明，DGSAN在肺结节分类任务上显著优于最先进的方法，同时展现出卓越的计算效率，验证了所提出方法的有效性。

Conclusion: 该研究展示了双图时空注意力网络在肺结节分类中的优越性能，为多模态医学图像分析提供了有效的融合框架。所提出的方法不仅提高了诊断准确性，还保持了计算效率，为临床早期肺癌检测提供了有前景的技术支持。

📄 Abstract

Lung cancer continues to be the leading cause of cancer-related deaths globally. Early detection and diagnosis of pulmonary nodules are essential for improving patient survival rates. Although previous research has integrated multimodal and multi-temporal information, outperforming single modality and single time point, the fusion methods are limited to inefficient vector concatenation and simple mutual attention, highlighting the need for more effective multimodal information fusion. To address these challenges, we introduce a Dual-Graph Spatiotemporal Attention Network, which leverages temporal variations and multimodal data to enhance the accuracy of predictions. Our methodology involves developing a Global-Local Feature Encoder to better capture the local, global, and fused characteristics of pulmonary nodules. Additionally, a Dual-Graph Construction method organizes multimodal features into inter-modal and intra-modal graphs. Furthermore, a Hierarchical Cross-Modal Graph Fusion Module is introduced to refine feature integration. We also compiled a novel multimodal dataset named the NLST-cmst dataset as a comprehensive source of support for related research. Our extensive experiments, conducted on both the NLST-cmst and curated CSTL-derived datasets, demonstrate that our DGSAN significantly outperforms state-of-the-art methods in classifying pulmonary nodules with exceptional computational efficiency.

[8] Benchmarking and Enhancing VLM for Compressed Image Understanding

Zifu Zhang, Tongda Xu, Siqi Li, Shengxi Li, Yue Zhang, Mai Xu, Yan Wang

🧩 TL;DR

本文首次建立了评估视觉语言模型处理压缩图像能力的综合基准，并提出了一种通用适配器方法，能够将VLM在多种压缩格式和比特率下的性能提升10%-30%。

📘 Detailed Summary

Motivation: 现有视觉语言模型主要处理高比特率压缩图像，而其对低比特率压缩图像的理解能力尚未得到充分探索，随着VLM应用的快速增长，高效压缩图像输入变得日益重要，因此需要系统评估VLM在压缩图像上的性能并提升其处理能力。

Method: 研究首先构建了包含超过一百万张压缩图像的综合基准，涵盖多种广泛使用的图像编解码器和多样化任务；接着分析性能差距的来源，将其分为压缩过程中的信息损失和VLM的泛化失败两类；最后提出了一种通用VLM适配器，旨在增强模型对现有编解码器压缩图像的处理能力。

Result: 实验结果表明，所提出的单一适配器能够将VLM在多种编解码器和不同比特率压缩图像上的性能提升10%-30%；基准测试揭示了VLM在压缩图像处理中的具体性能差距，并通过可视化分析发现只有泛化差距可以通过技术手段缓解。

Conclusion: 该研究为理解VLM与压缩图像之间的性能差距提供了有价值的见解，提出的基准和增强方法有助于弥合这一技术鸿沟；研究结果表明通过适当的适配器设计，可以显著提升VLM对压缩图像的鲁棒性，为实际应用中的图像压缩优化提供了可行方案。

📄 Abstract

With the rapid development of Vision-Language Models (VLMs) and the growing demand for their applications, efficient compression of the image inputs has become increasingly important. Existing VLMs predominantly digest and understand high-bitrate compressed images, while their ability to interpret low-bitrate compressed images has yet to be explored by far. In this paper, we introduce the first comprehensive benchmark to evaluate the ability of VLM against compressed images, varying existing widely used image codecs and diverse set of tasks, encompassing over one million compressed images in our benchmark. Next, we analyse the source of performance gap, by categorising the gap from a) the information loss during compression and b) generalisation failure of VLM. We visualize these gaps with concrete examples and identify that for compressed images, only the generalization gap can be mitigated. Finally, we propose a universal VLM adaptor to enhance model performance on images compressed by existing codecs. Consequently, we demonstrate that a single adaptor can improve VLM performance across images with varying codecs and bitrates by 10%-30%. We believe that our benchmark and enhancement method provide valuable insights and contribute toward bridging the gap between VLMs and compressed images.

[9] PanoGrounder: Bridging 2D and 3D with Panoramic Scene Representations for VLM-based 3D Visual Grounding

Seongmin Jung, Seongho Choi, Gunwoo Jeon, Minsu Cho, Jongwoo Lim

🧩 TL;DR

本文提出PanoGrounder，一种可泛化的三维视觉定位框架，通过将多模态全景表示与预训练的二维视觉语言模型相结合，实现了强大的视觉语言推理能力，并在多个基准测试中取得了最先进的性能。

📘 Detailed Summary

Motivation: 三维视觉定位是连接视觉语言感知与机器人的关键桥梁，但传统监督模型由于三维视觉语言数据集的稀缺性以及与现代视觉语言模型相比有限的推理能力，表现出泛化能力有限的问题，这限制了其在现实场景中的应用。

Method: 该方法提出PanoGrounder框架，采用多模态全景表示作为二维与三维之间的中间表示，通过增强三维语义和几何特征的全景渲染，可直接输入到视觉语言模型中。采用三阶段流程：基于场景布局和几何放置紧凑的全景点集，使用视觉语言模型在每个全景渲染上对文本查询进行定位，并通过提升操作将各视图预测融合为单个三维边界框。

Result: 该方法在ScanRefer和Nr3D基准测试中取得了最先进的性能，同时在未见过的三维数据集和文本重述方面表现出优异的泛化能力，验证了全景表示与预训练视觉语言模型结合的有效性。

Conclusion: 研究证明了全景表示作为二维与三维之间桥梁的有效性，通过利用预训练视觉语言模型的强大推理能力，显著提升了三维视觉定位的泛化性能，为机器人应用中的视觉语言理解提供了新的解决方案。

📄 Abstract

3D Visual Grounding (3DVG) is a critical bridge from vision-language perception to robotics, requiring both language understanding and 3D scene reasoning. Traditional supervised models leverage explicit 3D geometry but exhibit limited generalization, owing to the scarcity of 3D vision-language datasets and the limited reasoning capabilities compared to modern vision-language models (VLMs). We propose PanoGrounder, a generalizable 3DVG framework that couples multi-modal panoramic representation with pretrained 2D VLMs for strong vision-language reasoning. Panoramic renderings, augmented with 3D semantic and geometric features, serve as an intermediate representation between 2D and 3D, and offer two major benefits: (i) they can be directly fed to VLMs with minimal adaptation and (ii) they retain long-range object-to-object relations thanks to their 360-degree field of view. We devise a three-stage pipeline that places a compact set of panoramic viewpoints considering the scene layout and geometry, grounds a text query on each panoramic rendering with a VLM, and fuses per-view predictions into a single 3D bounding box via lifting. Our approach achieves state-of-the-art results on ScanRefer and Nr3D, and demonstrates superior generalization to unseen 3D datasets and text rephrasings.

[10] Self-supervised Multiplex Consensus Mamba for General Image Fusion

Yingying Wang, Rongjin Zhuang, Hui Zheng, Xuanhua He, Ke Cao, Xiaotong Tu, Xinghao Ding

🧩 TL;DR

本研究提出SMC-Mamba框架，一种自监督多路共识Mamba方法，用于通用图像融合。该方法通过模态无关特征增强、多路共识跨模态整合和双层自监督对比学习，在多种融合任务和下游视觉任务中均优于现有最先进算法。

📘 Detailed Summary

Motivation: 通用图像融合方法需要同时处理多种任务并提升性能而不增加复杂度，而现有任务特定方法主要关注模态间信息整合，无法满足这一需求。本研究旨在开发一种能够适应多种融合任务并提升下游任务性能的通用图像融合框架。

Method: 提出SMC-Mamba框架，包含模态无关特征增强模块通过自适应门控保留细节并通过空间-通道与频率-旋转扫描增强全局表示。多路共识跨模态Mamba模块实现专家动态协作达成共识以高效整合多模态互补信息，跨模态扫描进一步加强跨模态特征交互。引入双层自监督对比学习损失在保持高频信息的同时提升下游任务性能。

Result: 在红外-可见光、医学、多焦点和多曝光融合任务中，该方法在图像融合质量方面显著优于现有最先进算法。在下游视觉任务中也表现出优越性能，验证了所提框架的有效性和通用性。

Conclusion: SMC-Mamba框架通过模态无关特征增强、多路共识机制和自监督学习，实现了高效的多模态信息整合，为通用图像融合提供了新思路。该方法在保持计算效率的同时显著提升融合质量和下游任务性能，具有广泛的应用潜力。

📄 Abstract

Image fusion integrates complementary information from different modalities to generate high-quality fused images, thereby enhancing downstream tasks such as object detection and semantic segmentation. Unlike task-specific techniques that primarily focus on consolidating inter-modal information, general image fusion needs to address a wide range of tasks while improving performance without increasing complexity. To achieve this, we propose SMC-Mamba, a Self-supervised Multiplex Consensus Mamba framework for general image fusion. Specifically, the Modality-Agnostic Feature Enhancement (MAFE) module preserves fine details through adaptive gating and enhances global representations via spatial-channel and frequency-rotational scanning. The Multiplex Consensus Cross-modal Mamba (MCCM) module enables dynamic collaboration among experts, reaching a consensus to efficiently integrate complementary information from multiple modalities. The cross-modal scanning within MCCM further strengthens feature interactions across modalities, facilitating seamless integration of critical information from both sources. Additionally, we introduce a Bi-level Self-supervised Contrastive Learning Loss (BSCL), which preserves high-frequency information without increasing computational overhead while simultaneously boosting performance in downstream tasks. Extensive experiments demonstrate that our approach outperforms state-of-the-art (SOTA) image fusion algorithms in tasks such as infrared-visible, medical, multi-focus, and multi-exposure fusion, as well as downstream visual tasks.

[11] TGC-Net: A Structure-Aware and Semantically-Aligned Framework for Text-Guided Medical Image Segmentation

Gaoren Lin, Huangxuan Zhao, Yuan Xiong, Lefei Zhang, Bo Du, Wentao Zhu

🧩 TL;DR

本文提出TGC-Net，一种基于CLIP的文本引导医学分割框架，通过参数高效的特定任务适配解决现有方法在细粒度结构保留、复杂临床描述建模和领域语义对齐方面的不足。

📘 Detailed Summary

Motivation: 现有文本引导医学分割方法通常依赖未对齐的图像和文本编码器，需要复杂的多模态融合模块，而直接应用CLIP到医学成像面临三个主要问题：细粒度解剖结构保留不足、复杂临床描述建模不充分以及领域特定语义错位。

Method: TGC-Net采用参数高效的特定任务适配策略，包含三个核心组件：语义-结构协同编码器（SSE）通过CNN分支增强CLIP的ViT以实现多尺度结构细化，领域增强文本编码器（DATE）注入大语言模型衍生的医学知识，以及视觉-语言校准模块（VLCM）在统一特征空间中优化跨模态对应关系。

Result: 在涵盖胸部X光和胸部CT模态的五个数据集上的实验表明，TGC-Net以显著更少的可训练参数实现了最先进的性能，包括在具有挑战性的基准测试上获得显著的Dice分数提升。

Conclusion: 该研究证明了基于CLIP的医学分割框架通过参数高效适配策略的可行性，为多模态医学图像分析提供了有效的解决方案，同时保持了模型轻量化和高性能的平衡。

📄 Abstract

Text-guided medical segmentation enhances segmentation accuracy by utilizing clinical reports as auxiliary information. However, existing methods typically rely on unaligned image and text encoders, which necessitate complex interaction modules for multimodal fusion. While CLIP provides a pre-aligned multimodal feature space, its direct application to medical imaging is limited by three main issues: insufficient preservation of fine-grained anatomical structures, inadequate modeling of complex clinical descriptions, and domain-specific semantic misalignment. To tackle these challenges, we propose TGC-Net, a CLIP-based framework focusing on parameter-efficient, task-specific adaptations. Specifically, it incorporates a Semantic-Structural Synergy Encoder (SSE) that augments CLIP's ViT with a CNN branch for multi-scale structural refinement, a Domain-Augmented Text Encoder (DATE) that injects large-language-model-derived medical knowledge, and a Vision-Language Calibration Module (VLCM) that refines cross-modal correspondence in a unified feature space. Experiments on five datasets across chest X-ray and thoracic CT modalities demonstrate that TGC-Net achieves state-of-the-art performance with substantially fewer trainable parameters, including notable Dice gains on challenging benchmarks.

[12] Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval

Dao Sy Duy Minh, Huynh Trung Kiet, Nguyen Lam Phu Quy, Phu-Hoa Pham, Tran Chi Nguyen

🧩 TL;DR

本文提出了一种轻量级的两阶段图像检索流水线，通过事件中心实体提取结合BM25过滤和BEiT-3重排序，显著提升了复杂真实场景下的图像文本检索性能。

📘 Detailed Summary

Motivation: 真实世界的图像文本检索面临模糊或上下文依赖的查询、语言多样性以及可扩展性需求等挑战，现有方法在处理复杂真实场景时性能有限，需要更有效的解决方案。

Method: 本文提出了一种轻量级两阶段检索流水线，第一阶段基于事件中心实体提取使用BM25进行高效候选过滤，第二阶段应用BEiT-3模型捕获深度多模态语义并对结果进行重排序。

Result: 在OpenEvents v1基准测试中，该方法取得了0.559的平均精度均值，显著超越了现有基线方法，证明了事件引导过滤与长文本视觉语言建模结合的有效性。

Conclusion: 研究表明结合事件引导过滤与深度多模态语义建模能够实现复杂真实场景下准确高效的图像检索，为搜索引擎、媒体归档等应用提供了有效的解决方案。

📄 Abstract

Retrieving images from natural language descriptions is a core task at the intersection of computer vision and natural language processing, with wide-ranging applications in search engines, media archiving, and digital content management. However, real-world image-text retrieval remains challenging due to vague or context-dependent queries, linguistic variability, and the need for scalable solutions. In this work, we propose a lightweight two-stage retrieval pipeline that leverages event-centric entity extraction to incorporate temporal and contextual signals from real-world captions. The first stage performs efficient candidate filtering using BM25 based on salient entities, while the second stage applies BEiT-3 models to capture deep multimodal semantics and rerank the results. Evaluated on the OpenEvents v1 benchmark, our method achieves a mean average precision of 0.559, substantially outperforming prior baselines. These results highlight the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in complex, real-world scenarios. Our code is available at https://github.com/PhamPhuHoa-23/Event-Based-Image-Retrieval

[13] FluencyVE: Marrying Temporal-Aware Mamba with Bypass Attention for Video Editing

Mingshu Cai, Yixuan Li, Osamu Yoshie, Yuya Ieiri

🧩 TL;DR

本文提出FluencyVE，一种基于Mamba架构的简单而有效的一次性视频编辑方法，通过将线性时间序列模块集成到预训练的Stable Diffusion模型中，替代传统的时间注意力层，在保持生成能力的同时显著降低计算成本。

📘 Detailed Summary

Motivation: 大规模文本到图像扩散模型在图像生成和编辑方面取得了前所未有的成功，但将其扩展到视频编辑仍然面临挑战。现有的视频编辑方法通过添加时间注意力机制来适应预训练的文本到图像模型，但这些方法仍然存在时间不一致问题和高计算开销，需要更有效的解决方案。

Method: FluencyVE将线性时间序列模块Mamba集成到基于预训练Stable Diffusion模型的视频编辑框架中，替代传统的时间注意力层。该方法采用低秩近似矩阵替换因果注意力中的查询和键权重矩阵，并在训练过程中使用加权平均技术更新注意力分数，从而在保持文本到图像模型生成能力的同时有效降低计算负担。

Result: 实验和分析表明，FluencyVE在编辑真实世界视频的各种属性、主体和场景方面取得了有希望的结果。该方法在保持生成质量的同时显著减少了计算开销，实现了全局帧级注意力而无需传统注意力机制的高计算成本。

Conclusion: 该研究展示了Mamba架构在视频编辑任务中的有效性，为基于扩散模型的视频编辑提供了一种计算效率更高的替代方案。通过线性时间序列模块替代传统注意力机制，该方法在保持生成能力的同时解决了时间不一致问题，为未来视频编辑研究提供了新的技术方向。

📄 Abstract

Large-scale text-to-image diffusion models have achieved unprecedented success in image generation and editing. However, extending this success to video editing remains challenging. Recent video editing efforts have adapted pretrained text-to-image models by adding temporal attention mechanisms to handle video tasks. Unfortunately, these methods continue to suffer from temporal inconsistency issues and high computational overheads. In this study, we propose FluencyVE, which is a simple yet effective one-shot video editing approach. FluencyVE integrates the linear time-series module, Mamba, into a video editing model based on pretrained Stable Diffusion models, replacing the temporal attention layer. This enables global frame-level attention while reducing the computational costs. In addition, we employ low-rank approximation matrices to replace the query and key weight matrices in the causal attention, and use a weighted averaging technique during training to update the attention scores. This approach significantly preserves the generative power of the text-to-image model while effectively reducing the computational burden. Experiments and analyses demonstrate promising results in editing various attributes, subjects, and locations in real-world videos.

[14] Multi-Attribute guided Thermal Face Image Translation based on Latent Diffusion Model

Mingshu Cai, Osamu Yoshie, Yuya Ieiri

🧩 TL;DR

本文提出了一种基于潜在扩散模型的红外人脸识别方法，通过生成高质量的可见光人脸图像并保留关键身份特征，解决了红外到可见光跨模态人脸识别中的特征损失问题。

📘 Detailed Summary

Motivation: 现代监控系统依赖多波长传感器进行夜间红外人脸识别，但大多数模型在可见光数据集上训练，导致红外输入时性能显著下降。现有的异质人脸识别方法面临模型和模态差异问题，导致生成图像失真和特征丢失，需要一种能够生成高质量可见光图像同时保留身份特征的新方法。

Method: 本文提出了一种基于潜在扩散的模型，用于从热红外输入生成高质量的可见光人脸图像。该方法引入多属性分类器从可见光图像中提取关键面部属性，以减轻红外到可见光图像恢复过程中的特征损失。此外，提出了Self-attn Mamba模块，增强跨模态特征的全局建模能力，并显著提高推理速度。

Result: 在两个基准数据集上的实验结果表明，该方法在图像质量和身份保持方面均达到最先进性能。Self-attn Mamba模块不仅提升了跨模态特征的全局建模能力，还显著提高了推理速度，验证了所提方法的有效性。

Conclusion: 该研究为异质人脸识别提供了一种有效的解决方案，通过结合潜在扩散模型和多属性分类器，成功解决了红外到可见光转换中的特征保持问题。提出的Self-attn Mamba模块为跨模态特征建模提供了新思路，同时兼顾了性能与效率，为实际监控应用中的夜间人脸识别系统提供了技术基础。

📄 Abstract

Modern surveillance systems increasingly rely on multi-wavelength sensors and deep neural networks to recognize faces in infrared images captured at night. However, most facial recognition models are trained on visible light datasets, leading to substantial performance degradation on infrared inputs due to significant domain shifts. Early feature-based methods for infrared face recognition proved ineffective, prompting researchers to adopt generative approaches that convert infrared images into visible light images for improved recognition. This paradigm, known as Heterogeneous Face Recognition (HFR), faces challenges such as model and modality discrepancies, leading to distortion and feature loss in generated images. To address these limitations, this paper introduces a novel latent diffusion-based model designed to generate high-quality visible face images from thermal inputs while preserving critical identity features. A multi-attribute classifier is incorporated to extract key facial attributes from visible images, mitigating feature loss during infrared-to-visible image restoration. Additionally, we propose the Self-attn Mamba module, which enhances global modeling of cross-modal features and significantly improves inference speed. Experimental results on two benchmark datasets demonstrate the superiority of our approach, achieving state-of-the-art performance in both image quality and identity preservation.

[15] Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control

Minghao Han, YiChen Liu, Yizhou Liu, Zizhi Chen, Jingqun Tang, Xuecheng Wu, Dingkang Yang, Lihua Zhang

🧩 TL;DR

本文提出了UniPath，一种基于语义驱动的病理图像生成框架，通过利用成熟的诊断理解能力实现可控生成，在计算病理学中弥合了理解与生成之间的鸿沟。

📘 Detailed Summary

Motivation: 计算病理学中理解与生成发展不平衡：理解模型已达到诊断级能力，而生成模型仍主要模拟像素。研究面临三个耦合挑战：大规模高质量图像-文本语料稀缺；缺乏精确的细粒度语义控制，导致依赖非语义线索；术语异质性使得相同诊断概念的不同表述阻碍可靠的文本条件生成。

Method: UniPath采用多流控制架构：原始文本流；高层语义流，使用可学习查询到冻结的病理MLLM以提取抗释义的诊断语义标记，并将提示扩展为诊断感知的属性束；原型流通过原型库提供组件级形态学控制。研究还构建了265万图像-文本语料库和精细标注的6.8万高质量子集以缓解数据稀缺问题。

Result: UniPath在病理学四层评估体系下取得最先进性能，Patho-FID达到80.9（比第二名提升51%），细粒度语义控制达到真实图像的98.7%。研究公开了精心策划的数据集、完整源代码和预训练模型权重。

Conclusion: UniPath通过语义驱动的生成框架成功弥合了病理学理解与生成之间的差距，实现了精确的语义控制。该研究为计算病理学提供了新的生成范式，其开放的数据和模型资源将推动该领域的发展，特别是在诊断辅助和教育应用方面具有重要价值。

📄 Abstract

In computational pathology, understanding and generation have evolved along disparate paths: advanced understanding models already exhibit diagnostic-level competence, whereas generative models largely simulate pixels. Progress remains hindered by three coupled factors: the scarcity of large, high-quality image-text corpora; the lack of precise, fine-grained semantic control, which forces reliance on non-semantic cues; and terminological heterogeneity, where diverse phrasings for the same diagnostic concept impede reliable text conditioning. We introduce UniPath, a semantics-driven pathology image generation framework that leverages mature diagnostic understanding to enable controllable generation. UniPath implements Multi-Stream Control: a Raw-Text stream; a High-Level Semantics stream that uses learnable queries to a frozen pathology MLLM to distill paraphrase-robust Diagnostic Semantic Tokens and to expand prompts into diagnosis-aware attribute bundles; and a Prototype stream that affords component-level morphological control via a prototype bank. On the data front, we curate a 2.65M image-text corpus and a finely annotated, high-quality 68K subset to alleviate data scarcity. For a comprehensive assessment, we establish a four-tier evaluation hierarchy tailored to pathology. Extensive experiments demonstrate UniPath's SOTA performance, including a Patho-FID of 80.9 (51% better than the second-best) and fine-grained semantic control achieving 98.7% of the real-image. The meticulously curated datasets, complete source code, and pre-trained model weights developed in this study will be made openly accessible to the public.

[16] Multimodal Skeleton-Based Action Representation Learning via Decomposition and Composition

Hongsong Wang, Heng Fei, Bingxuan Dai, Jie Gui

🧩 TL;DR

本文提出了一种自监督多模态骨架动作表示学习框架Decomposition and Composition，通过解耦和组合策略有效平衡多模态动作理解中的效率与性能，在多个基准数据集上实现了计算成本与模型性能的优异平衡。

📘 Detailed Summary

Motivation: 多模态人类动作理解的核心挑战在于如何有效利用不同模态间的互补性同时保持模型效率。现有方法存在局限性：简单的后期融合方法虽然能提升性能但带来巨大计算开销，而早期融合方法虽然高效却难以达到优异性能，因此需要解决效率与效果之间的平衡困境。

Method: 本文提出了自监督多模态骨架动作表示学习框架Decomposition and Composition。解耦策略将融合的多模态特征精细分解为独立的单模态特征，并将其与相应的真实单模态对应特征对齐；组合策略则整合多个单模态特征，利用它们作为自监督指导来增强多模态表示的学习。

Result: 在NTU RGB+D 60、NTU RGB+D 120和PKU-MMD II数据集上的大量实验表明，所提方法在计算成本与模型性能之间达到了优异平衡。该方法在保持高效计算的同时实现了卓越的性能表现，验证了所提框架的有效性。

Conclusion: 该研究为解决多模态动作理解中效率与性能的平衡问题提供了有效方案，通过解耦和组合策略实现了自监督多模态表示学习。该方法为多模态融合提供了新思路，表明精细的特征分解与组合能够在不牺牲效率的前提下提升模型性能，对实际应用具有重要意义。

📄 Abstract

Multimodal human action understanding is a significant problem in computer vision, with the central challenge being the effective utilization of the complementarity among diverse modalities while maintaining model efficiency. However, most existing methods rely on simple late fusion to enhance performance, which results in substantial computational overhead. Although early fusion with a shared backbone for all modalities is efficient, it struggles to achieve excellent performance. To address the dilemma of balancing efficiency and effectiveness, we introduce a self-supervised multimodal skeleton-based action representation learning framework, named Decomposition and Composition. The Decomposition strategy meticulously decomposes the fused multimodal features into distinct unimodal features, subsequently aligning them with their respective ground truth unimodal counterparts. On the other hand, the Composition strategy integrates multiple unimodal features, leveraging them as self-supervised guidance to enhance the learning of multimodal representations. Extensive experiments on the NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD II datasets demonstrate that the proposed method strikes an excellent balance between computational cost and model performance.

[17] UniPR-3D: Towards Universal Visual Place Recognition with Visual Geometry Grounded Transformer

Tianchen Deng, Xun Chen, Ziming Li, Hongming Shen, Danwei Wang, Javier Civera, Hesheng Wang

🧩 TL;DR

本文提出了UniPR-3D，这是首个有效整合多视角信息的视觉地点识别架构，通过结合3D几何信息和2D纹理特征，在多个基准测试中实现了最先进的性能。

📘 Detailed Summary

Motivation: 传统视觉地点识别通常被表述为单图像检索任务，多视角方法虽然具有明显优势但研究相对不足，现有方法在多样化环境中的泛化能力有限，因此需要开发能够有效整合多视角信息并提升泛化性能的新架构。

Method: UniPR-3D基于VGGT骨干网络构建，能够编码多视角3D表示，通过设计专门的特征聚合器并针对地点识别任务进行微调；该方法联合利用VGGT产生的3D令牌和中间2D令牌，根据其不同特性设计了专门的2D和3D特征聚合模块，同时结合了单帧和多帧聚合方案以及可变长度序列检索策略。

Result: 实验结果表明UniPR-3D在多个基准测试中实现了新的最先进性能，超越了现有的单视角和多视角基线方法，证明了基于几何基础的令牌在视觉地点识别中的有效性。

Conclusion: 该研究展示了多视角3D表示与2D纹理特征的有效整合能够显著提升视觉地点识别性能，几何基础的特征表示对于跨视角推理具有重要价值，为未来多模态地点识别研究提供了新的架构设计思路。

📄 Abstract

Visual Place Recognition (VPR) has been traditionally formulated as a single-image retrieval task. Using multiple views offers clear advantages, yet this setting remains relatively underexplored and existing methods often struggle to generalize across diverse environments. In this work we introduce UniPR-3D, the first VPR architecture that effectively integrates information from multiple views. UniPR-3D builds on a VGGT backbone capable of encoding multi-view 3D representations, which we adapt by designing feature aggregators and fine-tune for the place recognition task. To construct our descriptor, we jointly leverage the 3D tokens and intermediate 2D tokens produced by VGGT. Based on their distinct characteristics, we design dedicated aggregation modules for 2D and 3D features, allowing our descriptor to capture fine-grained texture cues while also reasoning across viewpoints. To further enhance generalization, we incorporate both single- and multi-frame aggregation schemes, along with a variable-length sequence retrieval strategy. Our experiments show that UniPR-3D sets a new state of the art, outperforming both single- and multi-view baselines and highlighting the effectiveness of geometry-grounded tokens for VPR. Our code and models will be made publicly available on Github https://github.com/dtc111111/UniPR-3D.

[18] T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Zhe Cao, Tao Wang, Jiaming Wang, Yanghai Wang, Yuanxing Zhang, Jialu Chen, Miao Deng, Jiahao Wang, Yubin Guo, Chenxi Liao, Yize Zhang, Zhaoxiang Zhang, Jiaheng Liu

🧩 TL;DR

本文提出了T2AV-Compass，一个用于全面评估文本到音视频生成系统的统一基准，包含500个多样化复杂提示和双层次评估框架，揭示了当前模型在跨模态对齐和真实感方面的显著不足。

📘 Detailed Summary

Motivation: 当前文本到音视频生成系统的评估存在碎片化问题，通常依赖单模态指标或范围狭窄的基准，无法充分捕捉跨模态对齐、指令跟随和复杂提示下的感知真实感，这限制了该领域的系统性进展和模型能力的准确评估。

Method: 研究提出了T2AV-Compass基准，包含通过分类学驱动流程构建的500个多样化复杂提示，确保语义丰富性和物理合理性；同时引入了双层次评估框架，整合了用于视频质量、音频质量和跨模态对齐的客观信号级指标，以及用于指令跟随和真实感评估的主观MLLM-as-a-Judge协议。

Result: 对11个代表性T2AV系统的广泛评估显示，即使是最强的模型也显著低于人类水平的真实感和跨模态一致性，在音频真实感、细粒度同步、指令跟随等方面存在持续失败，表明当前模型在这些关键维度上存在明显不足。

Conclusion: 研究结果表明未来模型有显著的改进空间，并突出了T2AV-Compass作为推动文本到音视频生成发展的挑战性和诊断性测试平台的价值，为系统评估和模型优化提供了重要基准。

📄 Abstract

Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2AV-Compass, a unified benchmark for comprehensive evaluation of T2AV systems, consisting of 500 diverse and complex prompts constructed via a taxonomy-driven pipeline to ensure semantic richness and physical plausibility. Besides, T2AV-Compass introduces a dual-level evaluation framework that integrates objective signal-level metrics for video quality, audio quality, and cross-modal alignment with a subjective MLLM-as-a-Judge protocol for instruction following and realism assessment. Extensive evaluation of 11 representative T2AVsystems reveals that even the strongest models fall substantially short of human-level realism and cross-modal consistency, with persistent failures in audio realism, fine-grained synchronization, instruction following, etc. These results indicate significant improvement room for future models and highlight the value of T2AV-Compass as a challenging and diagnostic testbed for advancing text-to-audio-video generation.

[19] UniRec-0.1B: Unified Text and Formula Recognition with 0.1B Parameters

Yongkun Du, Zhineng Chen, Yazhen Xie, Weikang Baiand Hao Feng, Wei Shi, Yuchen Su, Can Huang, Yu-Gang Jiang

🧩 TL;DR

本文提出UniRec-0.1B，一个仅0.1B参数的轻量级统一识别模型，能够高效识别文本和公式，解决了现有视觉语言模型计算需求大、难以实际部署的问题。

📘 Detailed Summary

Motivation: 当前视觉语言模型虽然能统一识别文本和公式，但参数量大、计算需求高，限制了在实际应用中的部署。需要开发轻量级但性能强大的统一识别模型，以支持字符、单词、行、段落和文档等多层次识别任务。

Method: 首先构建了包含4000万文本、公式及其混合样本的大规模数据集UniRec40M。针对轻量级统一专家模型的两个挑战——层次结构可变性和文本公式内容语义纠缠，提出了分层监督训练来显式指导结构理解，以及语义解耦分词器来分离文本和公式表示。

Result: 在涵盖中英文多领域多层次的综合评估基准和公共基准测试中，UniRec-0.1B在性能上超越了通用视觉语言模型和领先的文档解析专家模型，同时实现了2-9倍的加速，验证了其有效性和效率。

Conclusion: 该研究表明通过精心设计的数据集构建、分层监督训练和语义解耦分词器，可以开发出既轻量又强大的统一识别模型。UniRec-0.1B为文档解析系统提供了高效实用的解决方案，平衡了性能与计算效率，具有广泛的实际应用潜力。

📄 Abstract

Text and formulas constitute the core informational components of many documents. Accurately and efficiently recognizing both is crucial for developing robust and generalizable document parsing systems. Recently, vision-language models (VLMs) have achieved impressive unified recognition of text and formulas. However, they are large-sized and computationally demanding, restricting their usage in many applications. In this paper, we propose UniRec-0.1B, a unified recognition model with only 0.1B parameters. It is capable of performing text and formula recognition at multiple levels, including characters, words, lines, paragraphs, and documents. To implement this task, we first establish UniRec40M, a large-scale dataset comprises 40 million text, formula and their mix samples, enabling the training of a powerful yet lightweight model. Secondly, we identify two challenges when building such a lightweight but unified expert model. They are: structural variability across hierarchies and semantic entanglement between textual and formulaic content. To tackle these, we introduce a hierarchical supervision training that explicitly guides structural comprehension, and a semantic-decoupled tokenizer that separates text and formula representations. Finally, we develop a comprehensive evaluation benchmark covering Chinese and English documents from multiple domains and with multiple levels. Experimental results on this and public benchmarks demonstrate that UniRec-0.1B outperforms both general-purpose VLMs and leading document parsing expert models, while achieving a 2-9$\times$ speedup, validating its effectiveness and efficiency. Codebase and Dataset: https://github.com/Topdu/OpenOCR.

[20] FreeInpaint: Tuning-free Prompt Alignment and Visual Rationality Enhancement in Image Inpainting

Chao Gong, Dong Li, Yingwei Pan, Jingjing Chen, Ting Yao, Tao Mei

🧩 TL;DR

本文提出了FreeInpaint，一种即插即用、无需微调的文本引导图像修复方法，通过在推理过程中直接优化扩散潜变量来提升生成图像与文本提示的忠实度，同时保持视觉合理性。

📘 Detailed Summary

Motivation: 文本引导图像修复的核心挑战在于同时满足用户提示的准确对齐和生成区域的高度视觉保真度。现有基于预训练文本到图像扩散模型的修复方法虽然能产生视觉上可信的结果，但在同时保持提示对齐和视觉合理性方面仍存在困难。

Method: FreeInpaint采用一种即插即用、无需微调的方法，在推理过程中直接优化扩散潜变量。技术层面，该方法引入了先验引导的噪声优化方法，通过优化初始噪声来引导模型注意力到有效的修复区域；同时精心设计了专门针对修复任务的复合引导目标，通过优化每个步骤的中间潜变量来高效指导去噪过程。

Result: 通过在各种修复扩散模型和评估指标上的广泛实验，证明了FreeInpaint方法的有效性和鲁棒性。该方法在提升提示对齐和视觉合理性方面表现出显著改进，能够生成更忠实于用户文本提示且视觉上合理的修复结果。

Conclusion: 该研究提供了一种无需训练即可提升文本引导图像修复性能的有效框架，通过优化扩散过程中的潜变量实现了提示对齐和视觉合理性的平衡。该方法为扩散模型在图像编辑任务中的应用提供了新的优化思路，具有即插即用的实用价值。

📄 Abstract

Text-guided image inpainting endeavors to generate new content within specified regions of images using textual prompts from users. The primary challenge is to accurately align the inpainted areas with the user-provided prompts while maintaining a high degree of visual fidelity. While existing inpainting methods have produced visually convincing results by leveraging the pre-trained text-to-image diffusion models, they still struggle to uphold both prompt alignment and visual rationality simultaneously. In this work, we introduce FreeInpaint, a plug-and-play tuning-free approach that directly optimizes the diffusion latents on the fly during inference to improve the faithfulness of the generated images. Technically, we introduce a prior-guided noise optimization method that steers model attention towards valid inpainting regions by optimizing the initial noise. Furthermore, we meticulously design a composite guidance objective tailored specifically for the inpainting task. This objective efficiently directs the denoising process, enhancing prompt alignment and visual rationality by optimizing intermediate latents at each step. Through extensive experiments involving various inpainting diffusion models and evaluation metrics, we demonstrate the effectiveness and robustness of our proposed FreeInpaint.

[21] MarineEval: Assessing the Marine Intelligence of Vision-Language Models

YuK-Kwan Wong, Tuan-An To, Jipeng Zhang, Ziqiang Zheng, Sai-Kit Yeung

🧩 TL;DR

该研究构建了首个大规模海洋视觉语言模型基准MarineEval，包含2000个图像问答对，用于评估现有VLMs在海洋专业领域的表现，发现现有模型在回答需要领域专业知识的问题时存在显著不足。

📘 Detailed Summary

Motivation: 尽管视觉语言模型在各种领域取得了显著成功，但现有研究尚未系统评估VLMs能否作为领域专家准确回答需要专业知识的海洋问题。海洋领域具有特殊的挑战和要求，需要评估现有模型在该专业领域的有效性和边界。

Method: 研究构建了首个大规模海洋VLM数据集和基准MarineEval，包含2000个基于图像的问答对。数据集设计确保了多样性和覆盖范围，涵盖7个任务维度和20个能力维度，领域需求通过海洋领域专家验证并整合到数据构建过程中。

Result: 实验对17个现有VLMs在MarineEval上进行了全面基准测试，结果显示现有模型无法有效回答领域特定问题，性能提升空间巨大。模型在需要海洋专业知识的问题上表现不佳，揭示了当前VLMs在专业领域的局限性。

Conclusion: 研究表明现有视觉语言模型在海洋专业领域存在显著能力缺口，需要专门针对领域知识进行优化。该基准为未来研究提供了重要评估工具，促进VLMs在专业领域的性能提升，强调了领域适应性和专业知识整合的重要性。

📄 Abstract

We have witnessed promising progress led by large language models (LLMs) and further vision language models (VLMs) in handling various queries as a general-purpose assistant. VLMs, as a bridge to connect the visual world and language corpus, receive both visual content and various text-only user instructions to generate corresponding responses. Though great success has been achieved by VLMs in various fields, in this work, we ask whether the existing VLMs can act as domain experts, accurately answering marine questions, which require significant domain expertise and address special domain challenges/requirements. To comprehensively evaluate the effectiveness and explore the boundary of existing VLMs, we construct the first large-scale marine VLM dataset and benchmark called MarineEval, with 2,000 image-based question-answering pairs. During our dataset construction, we ensure the diversity and coverage of the constructed data: 7 task dimensions and 20 capacity dimensions. The domain requirements are specially integrated into the data construction and further verified by the corresponding marine domain experts. We comprehensively benchmark 17 existing VLMs on our MarineEval and also investigate the limitations of existing models in answering marine research questions. The experimental results reveal that existing VLMs cannot effectively answer the domain-specific questions, and there is still a large room for further performance improvements. We hope our new benchmark and observations will facilitate future research. Project Page: http://marineeval.hkustvgd.com/

[22] ORCA: Object Recognition and Comprehension for Archiving Marine Species

Yuk-Kwan Wong, Haixin Liang, Zeyu Ma, Yiwei Chen, Ziqiang Zheng, Rinaldi Gotama, Pascal Sebastian, Lauren D. Sparks, Sai-Kit Yeung

🧩 TL;DR

本文提出了ORCA，一个用于海洋视觉理解的多模态基准数据集，包含14,647张图像和42,217个边界框标注，旨在解决海洋领域训练数据有限和任务定义不系统的问题，为海洋生态系统监测提供标准化评估框架。

📘 Detailed Summary

Motivation: 海洋视觉理解对于监测和保护海洋生态系统至关重要，但目前进展受到训练数据有限和缺乏系统化任务定义的阻碍，现有方法未能将海洋领域特定挑战与明确定义的计算机视觉任务对齐，限制了模型的有效应用。

Method: 研究团队构建了ORCA多模态基准数据集，包含14,647张来自478个物种的图像，提供42,217个边界框标注和22,321个专家验证的实例描述，数据集涵盖细粒度的视觉和文本标注，捕捉了多样海洋物种的形态学特征，并在三个任务上评估了18个最先进模型：目标检测（闭集和开放词汇）、实例描述和视觉定位。

Result: 实验结果表明海洋视觉理解面临关键挑战，包括物种多样性、形态重叠和领域特定需求，现有模型在这些任务上表现不佳，突显了海洋领域的理解难度，基准测试为模型性能提供了系统化评估。

Conclusion: ORCA建立了全面的海洋视觉理解基准，促进了该领域的研究进展，数据集的多模态特性和系统化任务定义有助于将海洋特定挑战与计算机视觉方法对齐，为自动化和可扩展的海洋生物调查提供了重要基础设施。

📄 Abstract

Marine visual understanding is essential for monitoring and protecting marine ecosystems, enabling automatic and scalable biological surveys. However, progress is hindered by limited training data and the lack of a systematic task formulation that aligns domain-specific marine challenges with well-defined computer vision tasks, thereby limiting effective model application. To address this gap, we present ORCA, a multi-modal benchmark for marine research comprising 14,647 images from 478 species, with 42,217 bounding box annotations and 22,321 expert-verified instance captions. The dataset provides fine-grained visual and textual annotations that capture morphology-oriented attributes across diverse marine species. To catalyze methodological advances, we evaluate 18 state-of-the-art models on three tasks: object detection (closed-set and open-vocabulary), instance captioning, and visual grounding. Results highlight key challenges, including species diversity, morphological overlap, and specialized domain demands, underscoring the difficulty of marine understanding. ORCA thus establishes a comprehensive benchmark to advance research in marine domain. Project Page: http://orca.hkustvgd.com/.

[23] VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs

Brigitta Malagurski Törtei, Yasser Dahou, Ngoc Dung Huynh, Wamiq Reyaz Para, Phúc H. Lê Khac, Ankit Singh, Sofian Chaybouti, Sanath Narayan

🧩 TL;DR

本文提出了VisRes基准测试，旨在评估视觉语言模型在无语言监督的自然场景下的视觉推理能力，揭示了当前模型在感知和关系推理方面的显著局限性。

📘 Detailed Summary

Motivation: 尽管视觉语言模型在视觉问答和图像描述等任务上取得了显著进展，但这些模型究竟是在进行视觉推理还是主要依赖语言先验仍不清楚。现有研究缺乏在无上下文语言监督的自然场景下系统评估视觉推理能力的基准测试，这阻碍了对模型抽象视觉理解能力的深入分析。

Method: 研究团队设计了VisRes基准测试，包含三个复杂度递增的推理层级：第一级测试在模糊、纹理变化、遮挡和旋转等扰动下的感知补全和全局图像匹配能力；第二级评估基于单一属性（如颜色、数量、方向）的规则推理；第三级针对需要整合多个视觉属性的组合推理。该基准包含超过19,000张受控任务图像，旨在隔离不同的推理能力。

Result: 实验发现，在细微的感知扰动下，最先进的视觉语言模型表现接近随机水平，显示出超越模式识别的抽象能力有限。模型在三个复杂度层级上都表现出明显的感知和关系视觉推理能力限制，特别是在需要整合多个属性的组合推理任务中表现不佳。

Conclusion: VisRes基准为多模态研究提供了一个统一的框架来推进抽象视觉推理能力的发展。研究结果表明当前视觉语言模型主要依赖模式识别而非真正的视觉推理，这为未来模型设计指明了改进方向，强调了开发更强大的感知和关系推理能力的重要性。

📄 Abstract

Vision-Language Models (VLMs) have achieved remarkable progress across tasks such as visual question answering and image captioning. Yet, the extent to which these models perform visual reasoning as opposed to relying on linguistic priors remains unclear. To address this, we introduce VisRes Bench, a benchmark designed to study visual reasoning in naturalistic settings without contextual language supervision. Analyzing model behavior across three levels of complexity, we uncover clear limitations in perceptual and relational visual reasoning capacities. VisRes isolates distinct reasoning abilities across its levels. Level 1 probes perceptual completion and global image matching under perturbations such as blur, texture changes, occlusion, and rotation; Level 2 tests rule-based inference over a single attribute (e.g., color, count, orientation); and Level 3 targets compositional reasoning that requires integrating multiple visual attributes. Across more than 19,000 controlled task images, we find that state-of-the-art VLMs perform near random under subtle perceptual perturbations, revealing limited abstraction beyond pattern recognition. We conclude by discussing how VisRes provides a unified framework for advancing abstract visual reasoning in multimodal research.

[24] Human Motion Estimation with Everyday Wearables

Siqi Zhu, Yixuan Li, Junfu Li, Qi Wu, Zan Wang, Haozhe Ma, Wei Liang

🧩 TL;DR

本文提出了EveryWear，一种基于日常可穿戴设备的轻量级人体运动捕捉方法，通过智能手机、智能手表、耳机和配备摄像头的智能眼镜实现无需显式校准的全身运动估计，并发布了包含真实世界数据的Ego-Elec数据集以促进相关研究。

📘 Detailed Summary

Motivation: 现有基于设备的全身运动估计方法存在穿戴性差、硬件昂贵、校准繁琐等问题，限制了其在日常生活中的应用，需要一种更轻量、实用且无需显式校准的解决方案。

Method: 该方法采用多模态师生框架，将来自第一人称摄像头的视觉线索与消费级设备的惯性信号相结合，完全基于智能手机、智能手表、耳机和配备前后摄像头的智能眼镜等日常可穿戴设备，无需使用前进行显式校准。

Result: 实验表明该方法在性能上优于基线模型，验证了其在实际全身运动估计中的有效性，同时通过直接在真实世界数据而非合成数据上训练，有效消除了制约先前工作的模拟到现实的差距。

Conclusion: 该研究展示了基于日常可穿戴设备实现实用人体运动捕捉的可行性，通过引入Ego-Elec真实世界数据集为相关研究提供了基准，为XR交互等应用提供了更便捷、低成本的解决方案。

📄 Abstract

While on-body device-based human motion estimation is crucial for applications such as XR interaction, existing methods often suffer from poor wearability, expensive hardware, and cumbersome calibration, which hinder their adoption in daily life. To address these challenges, we present EveryWear, a lightweight and practical human motion capture approach based entirely on everyday wearables: a smartphone, smartwatch, earbuds, and smart glasses equipped with one forward-facing and two downward-facing cameras, requiring no explicit calibration before use. We introduce Ego-Elec, a 9-hour real-world dataset covering 56 daily activities across 17 diverse indoor and outdoor environments, with ground-truth 3D annotations provided by the motion capture (MoCap), to facilitate robust research and benchmarking in this direction. Our approach employs a multimodal teacher-student framework that integrates visual cues from egocentric cameras with inertial signals from consumer devices. By training directly on real-world data rather than synthetic data, our model effectively eliminates the sim-to-real gap that constrains prior work. Experiments demonstrate that our method outperforms baseline models, validating its effectiveness for practical full-body motion estimation.

[25] Latent Implicit Visual Reasoning

Kelvin Li, Chuyi Shang, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Roei Herzig

🧩 TL;DR

本文提出了一种任务无关的机制，使大型多模态模型能够自主发现和使用视觉推理标记，而无需显式监督，从而显著提升了模型在视觉中心任务上的性能并实现了最先进的结果。

📘 Detailed Summary

Motivation: 当前大型多模态模型主要依赖文本作为核心推理模态，在处理以视觉为主的推理任务时存在局限性。现有方法通过辅助图像、深度图或图像裁剪来监督中间视觉步骤，但这些策略对"有用"视觉抽象施加了限制性先验，增加了标注成本，且难以跨任务泛化。

Method: 本文提出了一种任务无关的机制，训练大型多模态模型在没有显式监督的情况下自主发现和使用视觉推理标记。这些标记能够全局关注并以任务自适应方式重新编码图像，使模型能够提取相关视觉信息而无需手工制作的监督。

Result: 该方法在直接微调基础上取得了显著改进，在多种视觉中心任务上实现了最先进的结果，包括那些中间抽象难以指定的任务。同时，该方法在多任务指令调优中展现出良好的泛化能力。

Conclusion: 研究表明，通过任务无关的视觉推理标记发现机制，大型多模态模型能够有效克服对文本中心推理的依赖，在视觉推理任务上实现更强大的性能。这一方法为开发更平衡的多模态推理系统提供了新方向，减少了对手工监督的依赖并提高了跨任务适应性。

📄 Abstract

While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks -- including those where intermediate abstractions are hard to specify -- while also generalizing to multi-task instruction tuning.

[26] SegMo: Segment-aligned Text to 3D Human Motion Generation

Bowen Dang, Lin Wu, Xiaohang Yang, Zheng Yuan, Zhixiang Chen

🧩 TL;DR

本文提出SegMo框架，通过细粒度的文本-动作片段对齐实现文本驱动的3D人体动作生成。该方法将文本描述和动作序列分解为语义连贯的片段单元，利用对比学习建立精细的跨模态对应关系。

📘 Detailed Summary

Motivation: 现有方法在文本描述与人体动作生成中主要进行序列级对齐，忽略了模态内部的语义结构。然而，动作描述和动作序列都可以自然分解为更小、语义连贯的片段，这些片段可以作为原子对齐单元以实现更细粒度的对应关系。

Method: SegMo框架包含三个核心模块：文本片段提取将复杂文本描述分解为时序有序的短语，每个短语表示简单原子动作；动作片段提取将完整动作序列划分为对应的动作片段；细粒度文本-动作对齐通过对比学习对齐文本和动作片段，学习共享的嵌入空间。

Result: 在HumanML3D等广泛使用的数据集上，SegMo显著提升了基线性能，在HumanML3D测试集上实现了0.553的TOP 1分数。得益于学习的文本和动作片段共享嵌入空间，该方法还可应用于动作定位和动作到文本检索等检索式任务。

Conclusion: 研究表明，通过细粒度的片段级对齐可以显著提升文本驱动动作生成的质量和语义一致性。该方法不仅改进了生成性能，还提供了可迁移的跨模态表示，为多模态对齐和检索任务开辟了新方向。

📄 Abstract

Generating 3D human motions from textual descriptions is an important research problem with broad applications in video games, virtual reality, and augmented reality. Recent methods align the textual description with human motion at the sequence level, neglecting the internal semantic structure of modalities. However, both motion descriptions and motion sequences can be naturally decomposed into smaller and semantically coherent segments, which can serve as atomic alignment units to achieve finer-grained correspondence. Motivated by this, we propose SegMo, a novel Segment-aligned text-conditioned human Motion generation framework to achieve fine-grained text-motion alignment. Our framework consists of three modules: (1) Text Segment Extraction, which decomposes complex textual descriptions into temporally ordered phrases, each representing a simple atomic action; (2) Motion Segment Extraction, which partitions complete motion sequences into corresponding motion segments; and (3) Fine-grained Text-Motion Alignment, which aligns text and motion segments with contrastive learning. Extensive experiments demonstrate that SegMo improves the strong baseline on two widely used datasets, achieving an improved TOP 1 score of 0.553 on the HumanML3D test set. Moreover, thanks to the learned shared embedding space for text and motion segments, SegMo can also be applied to retrieval-style tasks such as motion grounding and motion-to-text retrieval.

[27] DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

Jiawei Liu, Junqiao Li, Jiangfan Deng, Gen Li, Siyu Zhou, Zetao Fang, Shanshan Lao, Zengde Deng, Jianing Zhu, Tingting Ma, Jiayi Li, Yunqiu Wang, Qian He, Xinglong Wu

🧩 TL;DR

本文提出了DreaMontage框架，通过集成轻量级中间条件机制、视觉表达微调策略和分段自回归推理方法，实现了从多样化用户输入生成无缝、富有表现力的长时长一镜到底视频，解决了现有视频生成方法在视觉平滑性和时间连贯性方面的不足。

📘 Detailed Summary

Motivation: 一镜到底技术在电影制作中具有独特的审美价值，但其实际实现常受高昂成本和复杂现实约束的限制。现有视频生成模型通常依赖简单的片段拼接方法，难以保持视觉平滑性和时间连贯性，无法满足高质量一镜到底视频的生成需求。

Method: DreaMontage框架通过三个主要维度解决挑战：在DiT架构中集成轻量级中间条件机制，采用自适应调优策略利用基础训练数据实现鲁棒的任意帧控制；通过高质量数据集和视觉表达SFT阶段增强视觉保真度和电影表现力，应用定制化DPO方案改善主体运动合理性和过渡平滑性；设计内存高效的分段自回归推理策略以支持长序列生成。

Result: 大量实验表明，该方法能够生成视觉震撼且无缝连贯的一镜到底效果，同时保持计算效率。框架显著提高了生成内容的成功率和可用性，使用户能够将碎片化的视觉材料转化为生动、连贯的一镜到底电影体验。

Conclusion: 该研究为高质量一镜到底视频生成提供了全面解决方案，通过创新的条件控制机制、视觉表达优化策略和高效推理方法，突破了现有视频生成技术在长序列连贯性方面的限制，为虚拟电影制作和创意内容生成开辟了新途径。

📄 Abstract

The "one-shot" technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.

[28] AnyAD: Unified Any-Modality Anomaly Detection in Incomplete Multi-Sequence MRI

Changwei Wu, Yifei Chen, Yuxin Du, Mingxuan Liu, Jinying Zong, Beining Wu, Jie Dong, Feiwei Qin, Yunkang Cao, Qiyuan Tian

🧩 TL;DR

本研究提出了一种统一的任意模态异常检测框架，能够在任意MRI模态可用性下执行稳健的脑部异常检测与定位，通过特征分布对齐和内在正常原型机制解决了现有方法在临床可扩展性方面的限制。

📘 Detailed Summary

Motivation: 脑部MRI异常检测面临标注异常病例稀缺和临床工作流中关键成像模态频繁缺失的挑战，现有单类或多类异常检测模型通常依赖固定模态配置、需要重复训练或无法泛化到未见模态组合，这限制了其临床可扩展性。

Method: 该框架整合了双通路DINOv2编码器与特征分布对齐机制，通过统计对齐不完整模态特征与完整模态表示实现稳定推理；引入内在正常原型提取器和INP引导的解码器，仅重建正常解剖模式同时自然放大异常偏差；通过训练期间的随机模态掩蔽和间接特征完成，模型学习适应所有模态配置而无需重新训练。

Result: 在BraTS2018、MU-Glioma-Post和Pretreat-MetsToBrain-Masks数据集上的广泛实验表明，该方法在7种模态组合中始终超越最先进的工业和医学异常检测基线，实现了卓越的泛化性能，证明了其在真实世界不完美模态条件下的有效性。

Conclusion: 本研究为多模态医学异常检测建立了一个可扩展的范式，能够在真实临床环境中处理不完整的模态条件，通过统一的任意模态框架解决了现有方法的局限性，为临床部署提供了更具适应性和稳健性的解决方案。

📄 Abstract

Reliable anomaly detection in brain MRI remains challenging due to the scarcity of annotated abnormal cases and the frequent absence of key imaging modalities in real clinical workflows. Existing single-class or multi-class anomaly detection (AD) models typically rely on fixed modality configurations, require repetitive training, or fail to generalize to unseen modality combinations, limiting their clinical scalability. In this work, we present a unified Any-Modality AD framework that performs robust anomaly detection and localization under arbitrary MRI modality availability. The framework integrates a dual-pathway DINOv2 encoder with a feature distribution alignment mechanism that statistically aligns incomplete-modality features with full-modality representations, enabling stable inference even with severe modality dropout. To further enhance semantic consistency, we introduce an Intrinsic Normal Prototypes (INPs) extractor and an INP-guided decoder that reconstruct only normal anatomical patterns while naturally amplifying abnormal deviations. Through randomized modality masking and indirect feature completion during training, the model learns to adapt to all modality configurations without re-training. Extensive experiments on BraTS2018, MU-Glioma-Post, and Pretreat-MetsToBrain-Masks demonstrate that our approach consistently surpasses state-of-the-art industrial and medical AD baselines across 7 modality combinations, achieving superior generalization. This study establishes a scalable paradigm for multimodal medical AD under real-world, imperfect modality conditions. Our source code is available at https://github.com/wuchangw/AnyAD.

[29] Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential

Shihao Zou, Jingjing Li, Wei Ji, Jincai Huang, Kai Wang, Guo Dan, Weixin Si, Yi Pan

🧩 TL;DR

本文提出了SpikeSurgSeg，这是首个针对手术场景分割的脉冲驱动视频Transformer框架，能够在非GPU平台上实现实时性能，通过脉冲神经网络显著降低计算开销同时保持与ANN模型相当的精度。

📘 Detailed Summary

Motivation: 现代手术系统依赖智能场景理解来增强术中安全性，但现有深度学习模型（特别是大规模基础模型）虽然分割精度高，却存在计算需求大、功耗高的问题，难以在资源受限的手术环境中实时部署。脉冲神经网络作为高效计算范式具有潜力，但其性能受限于标记手术数据的稀缺性以及手术视频表示固有的稀疏性。

Method: 提出了SpikeSurgSeg框架，这是首个针对手术场景分割的脉冲驱动视频Transformer架构。为解决手术标注数据有限的问题，引入了手术场景掩码自编码预训练策略，通过分层管状掩码实现鲁棒的时空表示学习。基于预训练骨干网络，进一步采用轻量级脉冲驱动分割头，在保持脉冲神经网络低延迟特性的同时产生时间一致的预测。

Result: 在EndoVis18和内部SurgBleed数据集上的广泛实验表明，SpikeSurgSeg实现了与最先进ANN模型相当的mIoU性能，同时将推理延迟降低了至少8倍。特别值得注意的是，相对于大多数基础模型基线，它提供了超过20倍的加速，突显了其在时间关键型手术场景分割中的潜力。

Conclusion: 该研究证明了脉冲神经网络在资源受限手术环境中的实际可行性，通过创新的预训练策略和架构设计，在保持分割精度的同时实现了显著的延迟降低。这项工作为实时手术智能系统的发展提供了新方向，展示了脉冲计算范式在医疗AI应用中的巨大潜力，特别是在需要低功耗、低延迟部署的场景中。

📄 Abstract

Modern surgical systems increasingly rely on intelligent scene understanding to provide timely situational awareness for enhanced intra-operative safety. Within this pipeline, surgical scene segmentation plays a central role in accurately perceiving operative events. Although recent deep learning models, particularly large-scale foundation models, achieve remarkable segmentation accuracy, their substantial computational demands and power consumption hinder real-time deployment in resource-constrained surgical environments. To address this limitation, we explore the emerging SNN as a promising paradigm for highly efficient surgical intelligence. However, their performance is still constrained by the scarcity of labeled surgical data and the inherently sparse nature of surgical video representations. To this end, we propose \textit{SpikeSurgSeg}, the first spike-driven video Transformer framework tailored for surgical scene segmentation with real-time potential on non-GPU platforms. To address the limited availability of surgical annotations, we introduce a surgical-scene masked autoencoding pretraining strategy for SNNs that enables robust spatiotemporal representation learning via layer-wise tube masking. Building on this pretrained backbone, we further adopt a lightweight spike-driven segmentation head that produces temporally consistent predictions while preserving the low-latency characteristics of SNNs. Extensive experiments on EndoVis18 and our in-house SurgBleed dataset demonstrate that SpikeSurgSeg achieves mIoU comparable to SOTA ANN-based models while reducing inference latency by at least $8\times$. Notably, it delivers over $20\times$ acceleration relative to most foundation-model baselines, underscoring its potential for time-critical surgical scene segmentation.

[30] Streaming Video Instruction Tuning

Jiaer Xia, Peixian Chen, Mengdan Zhang, Xing Sun, Kaiyang Zhou

🧩 TL;DR

本文提出了Streamo，一个实时流式视频大语言模型，作为通用交互助手，能够执行广泛的实时视频任务，包括实时叙述、动作理解、事件描述、时间事件定位和时间敏感问答。

📘 Detailed Summary

Motivation: 现有在线视频模型主要局限于问答或描述等狭窄任务，缺乏能够处理多样化实时视频任务的通用交互助手，这限制了视频理解在连续视频流中的应用潜力。

Method: 研究构建了Streamo-Instruct-465K大规模指令跟随数据集，专门针对流式视频理解设计，涵盖多样化时间上下文和多任务监督，通过端到端训练流程实现异构流式任务的统一训练。

Result: 实验表明Streamo在多种流式基准测试中表现出强大的时间推理能力、响应式交互和广泛泛化能力，成功弥合了离线视频感知模型与实时多模态助手之间的差距。

Conclusion: Streamo代表了向统一智能视频理解在连续视频流中应用的重要一步，展示了通过大规模指令数据集训练实现通用流式视频助手的可行性，为实时多模态交互系统的发展提供了新方向。

📄 Abstract

We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.

Li-Zhong Szu-Tu, Ting-Lin Wu, Chia-Jui Chang, He Syu, Yu-Lun Liu

🧩 TL;DR

该研究揭示了最先进的视觉语言模型存在显著流行度偏差，在著名建筑上的准确率比普通建筑高出34%，表明模型依赖记忆而非泛化理解。为此，研究者构建了最大的开放基准数据集YearGuessr，并提出了流行度感知的评估指标来量化这一偏差。

📘 Detailed Summary

Motivation: 当前最先进的视觉语言模型存在显著的流行度偏差，在著名建筑上的表现比普通建筑高出34%，这表明模型过度依赖记忆而非真正的泛化理解能力。这种偏差暴露了模型推理能力的根本缺陷，需要系统性的评估方法来量化这一现象。

Method: 研究者构建了YearGuessr数据集，包含55,546个建筑图像，涵盖157个国家，标注了连续的建筑年份标签、GPS数据和页面浏览量作为流行度代理。将建筑年份预测任务形式化为序数回归问题，并引入了流行度感知的区间准确率指标来量化偏差。评估了30多个模型，包括提出的YearCLIP模型。

Result: 实验结果表明，视觉语言模型在流行、记忆的项目上表现出色，但在未识别主题上表现显著较差。模型在著名建筑上的准确率比普通建筑高出34%，证实了模型严重依赖记忆而非泛化理解能力。YearGuessr基准成为该领域最大的开放评估数据集。

Conclusion: 该研究揭示了视觉语言模型推理能力的根本缺陷，即过度依赖记忆而非真正的理解。提出的YearGuessr基准和流行度感知指标为评估模型泛化能力提供了系统框架。这一发现对构建更鲁棒、更少偏差的多模态模型具有重要意义，强调了需要开发不依赖记忆的推理方法。

📄 Abstract

We expose a significant popularity bias in state-of-the-art vision-language models (VLMs), which achieve up to 34% higher accuracy on famous buildings compared to ordinary ones, indicating a reliance on memorization over generalizable understanding. To systematically investigate this, we introduce the largest open benchmark for this task: the YearGuessr dataset, a collection of 55,546 building images with multi-modal attributes from 157 countries, annotated with continuous ordinal labels of their construction year (1001-2024), GPS data, and page-view counts as a proxy for popularity. Using this dataset, we frame the construction year prediction task as ordinal regression and introduce popularity-aware interval accuracy metrics to quantify this bias. Our resulting benchmark of 30+ models, including our YearCLIP model, confirms that VLMs excel on popular, memorized items but struggle significantly with unrecognized subjects, exposing a critical flaw in their reasoning capabilities. Project page: https://sytwu.github.io/BeyondMemo/

cs.CL [Back]

[32] Adversarial Training for Failure-Sensitive User Simulation in Mental Health Dialogue Optimization

Ziyi Zhu, Olivier Tieleman, Caitlin A. Stamatis, Luka Smyth, Thomas D. Hull, Daniel R. Cahn, Matteo Malgaroli

🧩 TL;DR

本文提出了一种对抗性训练框架，通过生成器（用户模拟器）与判别器之间的竞争动态迭代提升用户模拟器的真实性，应用于心理健康支持对话系统评估，显著提高了模拟器揭示系统故障模式的能力。

📘 Detailed Summary

Motivation: 面向任务的对话系统评估需要真实用户模拟，但现有模拟器难以准确复现人类行为，特别是有效模拟器应能暴露被评估系统的故障模式，而心理健康支持聊天机器人领域尤其需要可靠且经济高效的系统评估方法。

Method: 采用对抗性训练框架，包含生成器（用户模拟器）和判别器的竞争动态，通过迭代对抗过程提升模拟器真实性，应用于心理健康支持聊天机器人领域，比较了微调模拟器与零样本基础模型的性能差异。

Result: 微调模拟器在揭示系统问题方面显著优于零样本基础模型，对抗性训练进一步增强了多样性、分布对齐和预测有效性，模拟故障发生率与实际故障发生率呈现强相关性，判别器准确率在三次对抗迭代后急剧下降表明真实性提升。

Conclusion: 对抗性训练是创建真实用户模拟器的有前景方法，特别适用于心理健康支持对话领域，能够在部署前实现快速、可靠且经济高效的系统评估，同时保持故障模式的低分布差异。

📄 Abstract

Realistic user simulation is crucial for training and evaluating task-oriented dialogue (TOD) systems, yet creating simulators that accurately replicate human behavior remains challenging. A key property of effective simulators is their ability to expose failure modes of the systems they evaluate. We present an adversarial training framework that iteratively improves user simulator realism through a competitive dynamic between a generator (user simulator) and a discriminator. Applied to mental health support chatbots, our approach demonstrates that fine-tuned simulators dramatically outperform zero-shot base models at surfacing system issues, and adversarial training further enhances diversity, distributional alignment, and predictive validity. The resulting simulator achieves a strong correlation between simulated and real failure occurrence rates across diverse chatbot configurations while maintaining low distributional divergence of failure modes. Discriminator accuracy decreases drastically after three adversarial iterations, suggesting improved realism. These results provide evidence that adversarial training is a promising approach for creating realistic user simulators in mental health support TOD domains, enabling rapid, reliable, and cost-effective system evaluation before deployment.

Zhongren Dong, Haotian Guo, Weixiang Xu, Huan Zhao, Zixing Zhang

🧩 TL;DR

本研究提出了FEND框架，一个基于基础模型的多模态神经精神障碍评估系统，用于检测阿尔茨海默病、抑郁症和自闭症谱系障碍，通过整合语音和文本模态并在多语言数据集上进行系统评估，为领域提供了统一的基准和分析框架。

📘 Detailed Summary

Motivation: 神经精神障碍如阿尔茨海默病、抑郁症和自闭症谱系障碍表现出语言和声学异常，可作为早期检测的生物标志物，但现有方法面临多语言泛化能力不足、缺乏统一评估框架以及数据集异质性等挑战，需要建立系统化的多模态评估体系。

Method: 研究提出了FEND框架，这是一个整合语音和文本模态的综合多模态系统，利用13个多语言数据集（涵盖英语、中文、希腊语、法语和荷兰语）进行系统评估，重点研究多模态融合策略在不同神经精神障碍检测任务中的表现。

Result: 实验结果显示多模态融合在阿尔茨海默病和抑郁症检测中表现优异，但在自闭症谱系障碍检测中因数据集异质性而表现不佳；研究还发现模态不平衡是普遍问题，多模态融合未能超越最佳单模态模型；跨语料库实验表明任务和语言一致时性能稳健，但在多语言和任务异构设置下性能显著下降。

Conclusion: FEND框架通过提供广泛的基准测试和对性能影响因素的详细分析，推进了自动化、全生命周期覆盖和多语言的神经精神障碍评估领域；该框架为公平比较和可重复研究提供了标准化工具，鼓励研究者采用该框架进行系统评估。

📄 Abstract

Neuropsychiatric disorders, such as Alzheimer's disease (AD), depression, and autism spectrum disorder (ASD), are characterized by linguistic and acoustic abnormalities, offering potential biomarkers for early detection. Despite the promise of multi-modal approaches, challenges like multi-lingual generalization and the absence of a unified evaluation framework persist. To address these gaps, we propose FEND (Foundation model-based Evaluation of Neuropsychiatric Disorders), a comprehensive multi-modal framework integrating speech and text modalities for detecting AD, depression, and ASD across the lifespan. Leveraging 13 multi-lingual datasets spanning English, Chinese, Greek, French, and Dutch, we systematically evaluate multi-modal fusion performance. Our results show that multi-modal fusion excels in AD and depression detection but underperforms in ASD due to dataset heterogeneity. We also identify modality imbalance as a prevalent issue, where multi-modal fusion fails to surpass the best mono-modal models. Cross-corpus experiments reveal robust performance in task- and language-consistent scenarios but noticeable degradation in multi-lingual and task-heterogeneous settings. By providing extensive benchmarks and a detailed analysis of performance-influencing factors, FEND advances the field of automated, lifespan-inclusive, and multi-lingual neuropsychiatric disorder assessment. We encourage researchers to adopt the FEND framework for fair comparisons and reproducible research.

[34] MultiMind at SemEval-2025 Task 7: Crosslingual Fact-Checked Claim Retrieval via Multi-Source Alignment

Mohammad Mahdi Abootorabi, Alireza Ghahramani Kure, Mohammadali Mohammadkhani, Sina Elahimanesh, Mohammad Ali Ali Panah

🧩 TL;DR

本文提出了TriAligner系统，一种用于多语言和跨语言事实核查声明检索的新方法，通过双编码器架构和对比学习，在SemEval-2025任务7中实现了显著的性能提升。

📘 Detailed Summary

Motivation: 在错误信息快速传播的时代，有效的事实核查变得日益关键，本研究旨在解决多语言和跨语言环境下事实核查声明的检索问题，填补现有方法在处理多语言对齐和跨模态信息整合方面的不足。

Method: 本文提出了TriAligner方法，采用双编码器架构结合对比学习，整合了不同模态的原始语言和英语翻译信息，通过高效的数据预处理和增强技术，利用大语言模型进行数据增强，并采用硬负样本采样策略来改进表示学习。

Result: 在单语言和跨语言基准测试中，该方法在检索准确性和事实核查性能方面均显著优于基线方法，证明了其在多语言环境下的有效性和鲁棒性。

Conclusion: 该研究展示了结合多模态翻译信息和对比学习策略在多语言事实核查中的有效性，为构建更鲁棒的跨语言信息检索系统提供了新思路，并强调了数据增强和硬负采样在表示学习中的重要性。

📄 Abstract

This paper presents our system for SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval. In an era where misinformation spreads rapidly, effective fact-checking is increasingly critical. We introduce TriAligner, a novel approach that leverages a dual-encoder architecture with contrastive learning and incorporates both native and English translations across different modalities. Our method effectively retrieves claims across multiple languages by learning the relative importance of different sources in alignment. To enhance robustness, we employ efficient data preprocessing and augmentation using large language models while incorporating hard negative sampling to improve representation learning. We evaluate our approach on monolingual and crosslingual benchmarks, demonstrating significant improvements in retrieval accuracy and fact-checking performance over baselines.

[35] Your Reasoning Benchmark May Not Test Reasoning: Revealing Perception Bottleneck in Abstract Reasoning Benchmarks

Xinhe Wang, Jin Huang, Xingjian Zhang, Tianhao Wang, Jiaqi W. Ma

🧩 TL;DR

该研究挑战了关于AI推理基准的主流解释，提出ARC等基准测试中的性能差距主要源于视觉感知限制而非归纳推理缺陷，并通过分离感知与推理的两阶段实验管道验证了这一假设。

📘 Detailed Summary

Motivation: 当前研究普遍将ARC等推理基准测试中的性能差距归因于机器推理能力的不足，但本研究挑战这一解释，假设性能差距主要源于视觉感知限制而非归纳推理缺陷，旨在澄清机器智能评估中的混淆因素。

Method: 研究引入两阶段实验管道，明确分离感知与推理过程：感知阶段将每个图像独立转换为自然语言描述，推理阶段则基于这些描述进行规则归纳和应用，防止跨图像归纳信号泄漏，从而隔离感知瓶颈对推理评估的影响。

Result: 在Mini-ARC、ACRE和Bongard-LOGO三个ARC风格数据集上的实验表明，感知能力是观察到的性能差距的主导因素；对VLM输出推理轨迹的手动检查进一步揭示约80%的模型失败源于感知错误而非推理缺陷。

Conclusion: ARC风格基准测试混淆了感知与推理挑战，观察到的性能差距可能夸大了机器推理的缺陷；研究强调需要开发能够分离感知与推理的评估协议，以更准确地评估机器智能进展。

📄 Abstract

Reasoning benchmarks such as the Abstraction and Reasoning Corpus (ARC) and ARC-AGI are widely used to assess progress in artificial intelligence and are often interpreted as probes of core, so-called ``fluid'' reasoning abilities. Despite their apparent simplicity for humans, these tasks remain challenging for frontier vision-language models (VLMs), a gap commonly attributed to deficiencies in machine reasoning. We challenge this interpretation and hypothesize that the gap arises primarily from limitations in visual perception rather than from shortcomings in inductive reasoning. To verify this hypothesis, we introduce a two-stage experimental pipeline that explicitly separates perception and reasoning. In the perception stage, each image is independently converted into a natural-language description, while in the reasoning stage a model induces and applies rules using these descriptions. This design prevents leakage of cross-image inductive signals and isolates reasoning from perception bottlenecks. Across three ARC-style datasets, Mini-ARC, ACRE, and Bongard-LOGO, we show that the perception capability is the dominant factor underlying the observed performance gap by comparing the two-stage pipeline with against standard end-to-end one-stage evaluation. Manual inspection of reasoning traces in the VLM outputs further reveals that approximately 80 percent of model failures stem from perception errors. Together, these results demonstrate that ARC-style benchmarks conflate perceptual and reasoning challenges and that observed performance gaps may overstate deficiencies in machine reasoning. Our findings underscore the need for evaluation protocols that disentangle perception from reasoning when assessing progress in machine intelligence.

cs.AI [Back]

[36] MegaRAG: Multimodal Knowledge Graph-Based Retrieval Augmented Generation

Chi-Hsiang Hsiao, Yi-Cheng Wang, Tzung-Sheng Lin, Yi-Ren Yeh, Chu-Song Chen

🧩 TL;DR

本文提出了一种多模态知识图谱增强的检索增强生成方法，通过将视觉线索整合到知识图谱构建、检索和答案生成过程中，实现了跨模态推理以提升对复杂内容的理解能力。

📘 Detailed Summary

Motivation: 现有检索增强生成方法在处理长篇领域特定内容时面临上下文窗口限制，难以进行深度推理，而基于知识图谱的方法仅限于文本输入，无法利用视觉等多模态信息提供的互补洞察，这阻碍了对视觉文档的全面理解。

Method: 该方法构建了多模态知识图谱，将视觉线索整合到知识图谱的构建过程中，并在检索阶段和答案生成过程中充分利用这些多模态信息，实现了跨模态推理机制以支持对复杂内容的理解。

Result: 实验结果表明，该方法在全局和细粒度问答任务上均优于现有的检索增强生成方法，在文本和多模态语料库上均取得了显著的性能提升，证明了跨模态知识图谱的有效性。

Conclusion: 该研究证明了整合视觉信息到知识图谱结构中的重要性，为处理复杂多模态内容提供了新的解决方案，并展示了跨模态推理在提升内容理解能力方面的潜力，为未来多模态人工智能系统的发展提供了方向。

📄 Abstract

Retrieval-augmented generation (RAG) enables large language models (LLMs) to dynamically access external information, which is powerful for answering questions over previously unseen documents. Nonetheless, they struggle with high-level conceptual understanding and holistic comprehension due to limited context windows, which constrain their ability to perform deep reasoning over long-form, domain-specific content such as full-length books. To solve this problem, knowledge graphs (KGs) have been leveraged to provide entity-centric structure and hierarchical summaries, offering more structured support for reasoning. However, existing KG-based RAG solutions remain restricted to text-only inputs and fail to leverage the complementary insights provided by other modalities such as vision. On the other hand, reasoning from visual documents requires textual, visual, and spatial cues into structured, hierarchical concepts. To address this issue, we introduce a multimodal knowledge graph-based RAG that enables cross-modal reasoning for better content understanding. Our method incorporates visual cues into the construction of knowledge graphs, the retrieval phase, and the answer generation process. Experimental results across both global and fine-grained question answering tasks show that our approach consistently outperforms existing RAG-based approaches on both textual and multimodal corpora.

[37] Memory Bear AI A Breakthrough from Memory to Cognition Toward Artificial General Intelligence

Deliang Wen, Ke Sun

🧩 TL;DR

本文提出了Memory Bear系统，基于认知科学原理构建类人记忆架构，通过多模态信息感知、动态记忆维护和自适应认知服务，实现了LLM记忆机制的全链重构，显著提升了长对话中的知识保真度、检索效率并降低了幻觉率。

📘 Detailed Summary

Motivation: 大型语言模型面临内存限制，包括受限的上下文窗口、长期知识遗忘、冗余信息积累和幻觉生成，这些问题严重制约了持续对话和个性化服务的发展，需要从根本上解决LLM的记忆机制缺陷。

Method: 基于认知科学原理构建类人记忆架构Memory Bear系统，整合多模态信息感知、动态记忆维护和自适应认知服务，实现LLM记忆机制的全链重构，通过记忆-认知集成增强上下文适应性和推理能力。

Result: 在医疗、企业运营和教育等多个领域，Memory Bear相比现有解决方案（如Mem0、MemGPT、Graphiti）在准确性、令牌效率和响应延迟等关键指标上表现更优，显著提高了知识保真度和检索效率，同时降低了幻觉率。

Conclusion: Memory Bear系统标志着AI从"记忆"向"认知"迈进的关键一步，通过全链重构的记忆机制为持续对话和个性化服务提供了工程创新和性能突破，为未来智能系统的记忆架构设计提供了重要参考。

📄 Abstract

Large language models (LLMs) face inherent limitations in memory, including restricted context windows, long-term knowledge forgetting, redundant information accumulation, and hallucination generation. These issues severely constrain sustained dialogue and personalized services. This paper proposes the Memory Bear system, which constructs a human-like memory architecture grounded in cognitive science principles. By integrating multimodal information perception, dynamic memory maintenance, and adaptive cognitive services, Memory Bear achieves a full-chain reconstruction of LLM memory mechanisms. Across domains such as healthcare, enterprise operations, and education, Memory Bear demonstrates substantial engineering innovation and performance breakthroughs. It significantly improves knowledge fidelity and retrieval efficiency in long-term conversations, reduces hallucination rates, and enhances contextual adaptability and reasoning capability through memory-cognition integration. Experimental results show that, compared with existing solutions (e.g., Mem0, MemGPT, Graphiti), Memory Bear outperforms them across key metrics, including accuracy, token efficiency, and response latency. This marks a crucial step forward in advancing AI from "memory" to "cognition".

[38] Agentic Explainable Artificial Intelligence (Agentic XAI) Approach To Explore Better Explanation

Tomoaki Yamaguchi, Yutong Zhou, Masahiro Ryo, Keisuke Katsura

🧩 TL;DR

本研究提出了一个结合SHAP可解释性与多模态大语言模型迭代精炼的代理式可解释人工智能框架，通过农业推荐系统案例验证了该框架能有效提升解释质量，但过度精炼会导致质量下降，揭示了偏差-方差权衡现象。

📘 Detailed Summary

Motivation: 尽管可解释人工智能能够提供数据驱动的因子关联理解，但其技术性输出难以向非专业人士传达，阻碍了AI预测的可信度；同时，大语言模型作为将技术解释转化为可访问叙述的有前景工具，其与代理式AI（通过迭代精炼实现自主操作）在可解释人工智能中的整合尚未得到探索。

Method: 本研究提出了一个代理式可解释人工智能框架，将基于SHAP的可解释性与多模态大语言模型驱动的迭代精炼相结合，以生成逐步增强的解释；作为用例，该框架在日本26个稻田的水稻产量数据上作为农业推荐系统进行测试，通过11轮精炼过程（第0-10轮）迭代改进解释。

Result: 人类专家（作物科学家，n=12）和大语言模型（n=14）根据七个指标评估解释质量：特异性、清晰度、简洁性、实用性、上下文相关性、成本考虑和作物科学可信度；两组评估者均确认框架成功提升了推荐质量，从第0轮开始平均得分增加30-33%，在第3-4轮达到峰值，但过度精炼导致推荐质量显著下降。

Conclusion: 研究结果表明需要进行战略性早期停止（正则化）以优化实际效用，挑战了单调改进的假设，并为代理式可解释人工智能系统提供了基于证据的设计原则；过度迭代揭示了偏差-方差权衡现象，早期轮次缺乏解释深度（偏差），而过度迭代则引入冗长和未接地的抽象（方差）。

📄 Abstract

Explainable artificial intelligence (XAI) enables data-driven understanding of factor associations with response variables, yet communicating XAI outputs to laypersons remains challenging, hindering trust in AI-based predictions. Large language models (LLMs) have emerged as promising tools for translating technical explanations into accessible narratives, yet the integration of agentic AI, where LLMs operate as autonomous agents through iterative refinement, with XAI remains unexplored. This study proposes an agentic XAI framework combining SHAP-based explainability with multimodal LLM-driven iterative refinement to generate progressively enhanced explanations. As a use case, we tested this framework as an agricultural recommendation system using rice yield data from 26 fields in Japan. The Agentic XAI initially provided a SHAP result and explored how to improve the explanation through additional analysis iteratively across 11 refinement rounds (Rounds 0-10). Explanations were evaluated by human experts (crop scientists) (n=12) and LLMs (n=14) against seven metrics: Specificity, Clarity, Conciseness, Practicality, Contextual Relevance, Cost Consideration, and Crop Science Credibility. Both evaluator groups confirmed that the framework successfully enhanced recommendation quality with an average score increase of 30-33% from Round 0, peaking at Rounds 3-4. However, excessive refinement showed a substantial drop in recommendation quality, indicating a bias-variance trade-off where early rounds lacked explanation depth (bias) while excessive iteration introduced verbosity and ungrounded abstraction (variance), as revealed by metric-specific analysis. These findings suggest that strategic early stopping (regularization) is needed for optimizing practical utility, challenging assumptions about monotonic improvement and providing evidence-based design principles for agentic XAI systems.

[39] RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic

Le Wang, Zonghao Ying, Xiao Yang, Quanchen Zou, Zhenfei Yin, Tianlin Li, Jian Yang, Yaodong Yang, Aishan Liu, Xianglong Liu

🧩 TL;DR

本文提出了RoboSafe，一种用于具身智能体的混合推理运行时安全保障机制，通过可执行的基于谓词的安全逻辑来防御动态环境中的隐性风险，相比现有基线显著减少了危险行为的发生率。

📘 Detailed Summary

Motivation: 具身智能体在执行复杂现实任务时容易受到危险指令的影响而触发不安全行为，现有的运行时安全防护主要依赖静态规则过滤器或提示级控制，难以应对动态、时间依赖且上下文丰富的环境中出现的隐性风险。

Method: RoboSafe采用基于可执行谓词安全逻辑的混合推理方法，包含两个互补的推理模块：后向反思推理模块通过持续回顾短期记忆中的近期轨迹来推断时间安全谓词，并在检测到违规时主动触发重新规划；前向预测推理模块通过从长期安全记忆和智能体的多模态观察中生成上下文感知的安全谓词来预测即将到来的风险，这些组件共同构成了自适应、可验证的安全逻辑。

Result: 在多个智能体上的广泛实验表明，RoboSafe相比领先基线显著减少了危险行为的发生率（风险发生率降低36.8%），同时保持了接近原始任务性能；在物理机械臂上的真实世界评估进一步证实了其实用性。

Conclusion: 该研究提供了一种自适应、可验证且可解释的运行时安全保障框架，能够有效应对具身智能体在动态复杂环境中面临的隐性安全风险，为具身人工智能的安全部署提供了实用的解决方案，并展示了代码可执行安全逻辑在实际机器人系统中的可行性。

📄 Abstract

Embodied agents powered by vision-language models (VLMs) are increasingly capable of executing complex real-world tasks, yet they remain vulnerable to hazardous instructions that may trigger unsafe behaviors. Runtime safety guardrails, which intercept hazardous actions during task execution, offer a promising solution due to their flexibility. However, existing defenses often rely on static rule filters or prompt-level control, which struggle to address implicit risks arising in dynamic, temporally dependent, and context-rich environments. To address this, we propose RoboSafe, a hybrid reasoning runtime safeguard for embodied agents through executable predicate-based safety logic. RoboSafe integrates two complementary reasoning processes on a Hybrid Long-Short Safety Memory. We first propose a Backward Reflective Reasoning module that continuously revisits recent trajectories in short-term memory to infer temporal safety predicates and proactively triggers replanning when violations are detected. We then propose a Forward Predictive Reasoning module that anticipates upcoming risks by generating context-aware safety predicates from the long-term safety memory and the agent's multimodal observations. Together, these components form an adaptive, verifiable safety logic that is both interpretable and executable as code. Extensive experiments across multiple agents demonstrate that RoboSafe substantially reduces hazardous actions (-36.8% risk occurrence) compared with leading baselines, while maintaining near-original task performance. Real-world evaluations on physical robotic arms further confirm its practicality. Code will be released upon acceptance.

Table of Contents

cs.CV [Back]

[1] VL4Gaze: Unleashing Vision-Language Models for Gaze Following

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[2] NULLBUS: Multimodal Mixed-Supervision for Breast Ultrasound Segmentation via Nullable Global-Local Prompts

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[3] Transductive Visual Programming: Evolving Tool Libraries from Experience for Spatial Reasoning

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[4] Input-Adaptive Visual Preprocessing for Efficient Fast Vision-Language Model Inference

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[5] ALIVE: An Avatar-Lecture Interactive Video Engine with Content-Aware Retrieval for Real-Time Interaction

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[6] Beyond Weight Adaptation: Feature-Space Domain Injection for Cross-Modal Ship Re-Identification

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[7] DGSAN: Dual-Graph Spatiotemporal Attention Network for Pulmonary Nodule Malignancy Prediction

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[8] Benchmarking and Enhancing VLM for Compressed Image Understanding

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[9] PanoGrounder: Bridging 2D and 3D with Panoramic Scene Representations for VLM-based 3D Visual Grounding

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[10] Self-supervised Multiplex Consensus Mamba for General Image Fusion

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[11] TGC-Net: A Structure-Aware and Semantically-Aligned Framework for Text-Guided Medical Image Segmentation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[12] Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[13] FluencyVE: Marrying Temporal-Aware Mamba with Bypass Attention for Video Editing

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[14] Multi-Attribute guided Thermal Face Image Translation based on Latent Diffusion Model

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[15] Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[16] Multimodal Skeleton-Based Action Representation Learning via Decomposition and Composition

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[17] UniPR-3D: Towards Universal Visual Place Recognition with Visual Geometry Grounded Transformer

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[18] T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[19] UniRec-0.1B: Unified Text and Formula Recognition with 0.1B Parameters

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[20] FreeInpaint: Tuning-free Prompt Alignment and Visual Rationality Enhancement in Image Inpainting

🧩 TL;DR