Table of Contents

cs.CV [Back]

[1] Complex Mathematical Expression Recognition: Benchmark, Large-Scale Dataset and Strong Baseline

Weikang Bai, Yongkun Du, Yuchen Su, Yazhen Xie, Zhineng Chen

🧩 TL;DR

该研究针对复杂数学表达式识别难题,提出了CMER-Bench基准、大规模数据集MER-17M/CMER-3M、结构化数学语言表示以及CMERNet模型,显著提升了复杂数学表达式的识别性能。


📘 Detailed Summary

Motivation: 现有数学表达式识别方法在处理简单表达式时表现良好,但在处理包含大量符号和多行布局的复杂数学表达式时性能显著下降,主要原因是现有公共训练数据集主要由简单样本组成,缺乏对复杂表达式的充分覆盖。

Method: 研究首先构建了CMER-Bench基准,将表达式分为简单、中等和复杂三个难度级别;随后创建了大规模数据集MER-17M和CMER-3M,专注于复杂数学表达式识别;提出了一种新颖的表达式分词器和结构化数学语言表示,显式建模表达式的层次和空间结构;基于编码器-解码器架构开发了专门模型CMERNet,并在CMER-3M数据集上进行训练。

Result: 实验结果表明,现有MER模型和多模态大语言模型在简单和中等难度表达式上表现良好,但在复杂表达式上性能显著下降;而提出的CMERNet模型仅包含1.25亿参数,在CMER-Bench基准上显著优于现有MER模型和MLLMs,特别是在复杂数学表达式识别方面表现出色。

Conclusion: 该研究揭示了当前数学表达式识别方法在处理复杂表达式时的局限性,并通过构建专门基准、大规模数据集和新型表示方法为复杂数学表达式识别提供了系统解决方案;结构化数学语言表示超越了传统LaTeX格式,能更好地建模表达式的空间布局和层次结构,为未来复杂数学表达式识别研究奠定了基础。


📄 Abstract

Mathematical Expression Recognition (MER) has made significant progress in recognizing simple expressions, but the robust recognition of complex mathematical expressions with many tokens and multiple lines remains a formidable challenge. In this paper, we first introduce CMER-Bench, a carefully constructed benchmark that categorizes expressions into three difficulty levels: easy, moderate, and complex. Leveraging CMER-Bench, we conduct a comprehensive evaluation of existing MER models and general-purpose multimodal large language models (MLLMs). The results reveal that while current methods perform well on easy and moderate expressions, their performance degrades significantly when handling complex mathematical expressions, mainly because existing public training datasets are primarily composed of simple samples. In response, we propose MER-17M and CMER-3M that are large-scale datasets emphasizing the recognition of complex mathematical expressions. The datasets provide rich and diverse samples to support the development of accurate and robust complex MER models. Furthermore, to address the challenges posed by the complicated spatial layout of complex expressions, we introduce a novel expression tokenizer, and a new representation called Structured Mathematical Language, which explicitly models the hierarchical and spatial structure of expressions beyond LaTeX format. Based on these, we propose a specialized model named CMERNet, built upon an encoder-decoder architecture and trained on CMER-3M. Experimental results show that CMERNet, with only 125 million parameters, significantly outperforms existing MER models and MLLMs on CMER-Bench.

[2] Human-AI Collaboration Mechanism Study on AIGC Assisted Image Production for Special Coverage

Yajie Yang, Yuqing Zhao, Xiaochao Xi, Yinan Zhu

🧩 TL;DR

本研究针对新闻业中AIGC辅助图像生成存在的黑盒问题,提出了一种可控的图像生产路径,通过构建人机协同的模块化流程,解决了语义保真度、文化准确性和编辑可控性等关键挑战。


📘 Detailed Summary

Motivation: 新闻业中AIGC辅助图像生成面临黑盒不透明性,导致内容准确性、语义对齐和伦理信任等多重困境,特别是在特殊报道场景中需要满足编辑保真度和文化准确性要求。

Method: 研究采用两阶段实验方法:实验一通过标准化提示词测试跨平台适应性;实验二构建人机协同模块化流程,集成高精度分割(SAM、GroundingDINO)、语义对齐(BrushNet)和风格调节(Style-LoRA、Prompt-to-Prompt),并通过CLIP语义评分、NSFW/OCR/YOLO过滤及可验证内容凭证确保编辑保真度。

Result: 实验一揭示了训练语料偏差和平台级过滤导致的语义对齐、文化特异性和视觉真实感差异;实验二实现了可追溯部署和语义表征保持,成功构建了满足新闻编辑要求的可控图像生产流程。

Conclusion: 研究提出了新闻特殊报道中AIGC辅助图像生产的人机协同机制,并建议评估角色身份稳定性、文化表达准确性和用户-公众适宜性三个关键指标,为新闻业AIGC应用提供了可操作框架。


📄 Abstract

Artificial Intelligence Generated Content (AIGC) assisting image production triggers controversy in journalism while attracting attention from media agencies. Key issues involve misinformation, authenticity, semantic fidelity, and interpretability. Most AIGC tools are opaque "black boxes," hindering the dual demands of content accuracy and semantic alignment and creating ethical, sociotechnical, and trust dilemmas. This paper explores pathways for controllable image production in journalism's special coverage and conducts two experiments with projects from China's media agency: (1) Experiment 1 tests cross-platform adaptability via standardized prompts across three scenes, revealing disparities in semantic alignment, cultural specificity, and visual realism driven by training-corpus bias and platform-level filtering. (2) Experiment 2 builds a human-in-the-loop modular pipeline combining high-precision segmentation (SAM, GroundingDINO), semantic alignment (BrushNet), and style regulating (Style-LoRA, Prompt-to-Prompt), ensuring editorial fidelity through CLIP-based semantic scoring, NSFW/OCR/YOLO filtering, and verifiable content credentials. Traceable deployment preserves semantic representation. Consequently, we propose a human-AI collaboration mechanism for AIGC assisted image production in special coverage and recommend evaluating Character Identity Stability (CIS), Cultural Expression Accuracy (CEA), and User-Public Appropriateness (U-PA).

[3] HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

HyperAI Team, Yuchen Liu, Kaiyang Han, Zhiqiang Xia, Yuhang Dong, Chen Song, Kangyu Tang, Jiaming Xu, Xiushi Feng, WenXuan Yu, Li Peng, Mingyang Wang, Kai Wang, Changpeng Yang, Yang Li, Haoyu Lu, Hao Wang, Bingna Xu, Guangyao Liu, Long Huang, Kaibin Guo, Jinyang Wu, Dan Wu, Hongzhen Wang, Peng Zhou, Shuai Nie, Shande Wang, Runyu Shi, Ying Huang

🧩 TL;DR

本文提出了HyperVL,一种专为设备端推理设计的高效多模态大语言模型,通过视觉分辨率压缩器和双重一致性学习技术,在保持强大感知能力的同时显著降低了计算和内存需求。


📘 Detailed Summary

Motivation: 当前多模态大语言模型虽然具备强大的感知和推理能力,但高计算和内存需求使其难以直接部署在设备端环境中,而标准视觉Transformer编码器在处理高分辨率输入时存在延迟过高和内存消耗过大的关键瓶颈问题。

Method: HyperVL采用图像分块策略来限制峰值内存使用,并引入了两种新技术:视觉分辨率压缩器(VRC)能够自适应预测最佳编码分辨率以消除冗余计算,以及双重一致性学习(DCL)在统一框架内对齐多尺度ViT编码器,实现在共享LLM下视觉分支的动态切换。

Result: 大量实验表明,HyperVL在多个基准测试中实现了同类规模模型中最先进的性能,同时在真实移动设备上显著降低了延迟和功耗,证明了其在设备端多模态推理中的实用性。

Conclusion: 该研究为设备端多模态AI部署提供了有效的解决方案,通过创新的分辨率自适应和编码器对齐技术,在保持模型性能的同时大幅提升了推理效率,推动了高效多模态模型在实际应用中的落地。


📄 Abstract

Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution inputs.To address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.

[4] DL$^3$M: A Vision-to-Language Framework for Expert-Level Medical Reasoning through Deep Learning and Large Language Models

Md. Najib Hasan, Imran Ahmad, Sourav Basak Shuvo, Md. Mahadi Hasan Ankon, Sunanda Das, Nazmul Siddique, Hui Wang

🧩 TL;DR

本研究提出了一个结合深度学习图像分类与大型语言模型临床推理的框架,通过新型混合模型MobileCoAtNet实现内窥镜图像的高精度分类,并利用其输出驱动LLMs生成结构化临床解释,同时构建专家验证的基准来评估LLM的推理可靠性。


📘 Detailed Summary

Motivation: 医学图像分类器虽能有效检测胃肠道疾病但缺乏决策解释能力,而大型语言模型虽能生成临床文本却在视觉推理方面表现不稳定且常产生错误解释,这造成了模型所见与临床医生期望的推理类型之间的差距。

Method: 研究提出了一个将图像分类与结构化临床推理相连接的框架,设计了专门针对内窥镜图像的新型混合模型MobileCoAtNet,该模型在八个胃部相关类别上实现高精度分类,其输出用于驱动多个LLM进行推理,同时构建了两个专家验证的基准涵盖病因、症状、治疗、生活方式和随访护理等方面。

Result: MobileCoAtNet在胃部疾病分类中表现出高准确率,对32个LLM的评估显示强分类能力能提升解释质量,但所有模型均未达到人类水平的稳定性,即使最佳LLM也会因提示变化而改变其推理,当前LLM在高风险医疗决策中仍不可靠。

Conclusion: 研究表明深度学习与LLM结合可生成有用的临床叙述,但当前LLM在高风险医疗决策中仍不可靠,该框架为理解其局限性提供了更清晰的视角,并为构建更安全的推理系统指明了路径,所有源代码和数据集均已公开。


📄 Abstract

Medical image classifiers detect gastrointestinal diseases well, but they do not explain their decisions. Large language models can generate clinical text, yet they struggle with visual reasoning and often produce unstable or incorrect explanations. This leaves a gap between what a model sees and the type of reasoning a clinician expects. We introduce a framework that links image classification with structured clinical reasoning. A new hybrid model, MobileCoAtNet, is designed for endoscopic images and achieves high accuracy across eight stomach-related classes. Its outputs are then used to drive reasoning by several LLMs. To judge this reasoning, we build two expert-verified benchmarks covering causes, symptoms, treatment, lifestyle, and follow-up care. Thirty-two LLMs are evaluated against these gold standards. Strong classification improves the quality of their explanations, but none of the models reach human-level stability. Even the best LLMs change their reasoning when prompts vary. Our study shows that combining DL with LLMs can produce useful clinical narratives, but current LLMs remain unreliable for high-stakes medical decisions. The framework provides a clearer view of their limits and a path for building safer reasoning systems. The complete source code and datasets used in this study are available at https://github.com/souravbasakshuvo/DL3M.

[5] Why Text Prevails: Vision May Undermine Multimodal Medical Decision Making

Siyuan Dai, Lunxiao Li, Kun Zhao, Eardi Lila, Paul K. Crane, Heng Huang, Dongkuan Xu, Haoteng Tang, Liang Zhan

🧩 TL;DR

本文研究发现当前最先进的多模态大语言模型在医学决策任务中表现不佳,纯文本推理优于视觉或视觉文本联合推理,并探索了三种改进策略以提升医疗领域的多模态决策能力。


📘 Detailed Summary

Motivation: 尽管多模态大语言模型在通用视觉语言任务上展现出强大的零样本能力,但在生物医学领域中,即使是当前最先进的模型也难以处理基本的医学决策任务,特别是在视觉差异细微且需要精确判断的医疗场景中。

Method: 研究采用两种具有挑战性的医学数据集:三阶段阿尔茨海默病分类和MIMIC-CXR胸片多标签分类,并探索了三种改进策略:基于推理标注示例的上下文学习、视觉描述生成后纯文本推理,以及视觉编码器的少样本微调。

Result: 实验结果表明,纯文本推理在两种医学任务中均优于纯视觉或视觉文本联合推理,多模态输入反而表现更差;三种改进策略中,视觉描述生成方法效果最佳,但整体性能仍远低于专业医学模型。

Conclusion: 当前多模态大语言模型缺乏对医疗图像的扎实视觉理解能力,视觉信息可能干扰文本推理;研究指出了提升医疗多模态决策的可行方向,包括改进视觉编码器微调和增强视觉文本对齐。


📄 Abstract

With the rapid progress of large language models (LLMs), advanced multimodal large language models (MLLMs) have demonstrated impressive zero-shot capabilities on vision-language tasks. In the biomedical domain, however, even state-of-the-art MLLMs struggle with basic Medical Decision Making (MDM) tasks. We investigate this limitation using two challenging datasets: (1) three-stage Alzheimer's disease (AD) classification (normal, mild cognitive impairment, dementia), where category differences are visually subtle, and (2) MIMIC-CXR chest radiograph classification with 14 non-mutually exclusive conditions. Our empirical study shows that text-only reasoning consistently outperforms vision-only or vision-text settings, with multimodal inputs often performing worse than text alone. To mitigate this, we explore three strategies: (1) in-context learning with reason-annotated exemplars, (2) vision captioning followed by text-only inference, and (3) few-shot fine-tuning of the vision tower with classification supervision. These findings reveal that current MLLMs lack grounded visual understanding and point to promising directions for improving multimodal decision making in healthcare.

[6] STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning

Jie Qin, Jiancheng Huang, Limeng Qiao, Lin Ma

🧩 TL;DR

本文提出STAR:一种用于任务渐进式统一多模态学习的堆叠自回归方案,通过分解多模态学习为理解、生成和编辑多个阶段,在保持现有理解能力的同时有效提升生成性能。


📘 Detailed Summary

Motivation: 多模态大语言模型在追求通用人工智能中扮演关键角色,但实现多模态理解与生成的统一目标仍面临优化冲突和性能权衡的挑战,需要在增强生成性能的同时保持现有理解能力。

Method: STAR方法将多模态学习分解为理解、生成和编辑多个阶段,通过冻结基础自回归模型参数并渐进堆叠同构自回归模块来避免跨任务干扰;同时引入高容量VQ增强图像表示粒度,并采用隐式推理机制提升复杂条件下的生成质量。

Result: 实验表明STAR在GenEval(0.91)、DPG-Bench(87.44)和ImgEdit(4.34)基准上实现了最先进的性能,验证了其统一多模态学习的有效性。

Conclusion: STAR通过任务渐进式堆叠架构成功解决了多模态理解与生成的统一难题,为构建更强大的统一多模态模型提供了有效框架,同时避免了任务间的性能权衡问题。


📄 Abstract

Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce STAR: a STacked AutoRegressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (0.91), DPG-Bench (87.44), and ImgEdit (4.34), validating its efficacy for unified multimodal learning.

[7] Improvise, Adapt, Overcome -- Telescopic Adapters for Efficient Fine-tuning of Vision Language Models in Medical Imaging

Ujjwal Mishra, Vinita Shukla, Praful Hambarde, Amit Shukla

🧩 TL;DR

本文提出了一种名为Telescopic Adapters的新型参数高效微调框架,通过深度感知缩放机制为视觉语言分割模型在医学影像领域的适应提供高效解决方案,仅需61.3万可训练参数即可在多个医学数据集上实现优异性能。


📘 Detailed Summary

Motivation: 传统微调方法在将视觉语言分割模型适配到医学影像领域时需要大量计算开销,而现有的参数高效微调方法在所有Transformer层采用统一的适配器维度,导致参数分配次优和适应效率降低,无法满足资源受限临床环境的需求。

Method: 该方法提出了Telescopic Adapters框架,采用深度感知缩放策略,从浅层到深层Transformer层逐步增加适配器容量,在CLIPSeg的视觉和文本编码器中集成轻量级瓶颈模块,适配器维度根据层深度和语义相关性动态缩放。

Result: 该方法仅使用61.3万可训练参数,比端到端微调少244倍,在涵盖息肉分割、皮肤病变检测和乳腺超声成像的五个多样化医学数据集上实现了优异性能,消融研究证实深层比浅层需要显著更多的适应能力。

Conclusion: 该研究为医学视觉语言分割模型的高效微调建立了新范式,通过验证深层需要更多适应容量的假设,为资源受限临床环境中的部署提供了可行方案,同时保持了有竞争力的分割精度。


📄 Abstract

Adapting Vision Language Segmentation Models (VLSMs) to medical imaging domains requires significant computational overhead when using conventional fine-tuning approaches. Existing Parameter-Efficient Fine-Tuning (PEFT) methods apply uniform adapter dimensions across all transformer layers, leading to suboptimal parameter allocation and reduced adaptation efficiency. We introduce Telescopic Adapters, a novel PEFT framework that employs depth-aware scaling to progressively increase adapter capacity from shallow to deep transformer layers. Our method integrates lightweight bottleneck modules within CLIPSeg's vision and text encoders, with adapter dimensions dynamically scaled based on layer depth and semantic relevance. Using only 613k trainable parameters--244x fewer than end-to-end fine-tuning, Telescopic Adapters achieve superior performance across five diverse medical datasets spanning polyp segmentation, skin lesion detection, and breast ultrasound imaging. Comprehensive ablation studies demonstrate that deeper layers require substantially more adaptation capacity than shallow layers, validating our telescopic scaling hypothesis. Our approach establishes a new paradigm for efficient medical VLSM fine-tuning, enabling deployment in resource-constrained clinical environments while maintaining competitive segmentation accuracy.

[8] From Unlearning to UNBRANDING: A Benchmark for Trademark-Safe Text-to-Image Generation

Dawid Malarz, Artur Kasymov, Filip Manjak, Maciej Zięba, Przemysław Spurek

🧩 TL;DR

本文提出了"去品牌化"这一新任务,旨在从文本到图像扩散模型生成的图像中精细移除商标和微妙的结构性品牌特征,同时保持语义连贯性,并构建了全面的基准数据集和基于视觉语言模型的新型评估指标。


📘 Detailed Summary

Motivation: 文本到图像扩散模型的快速发展引发了未经授权复制商标内容的严重担忧,现有研究主要针对通用概念而无法处理特定品牌标识,特别是品牌识别具有多维性,不仅包括显式商标还涵盖微妙的结构特征,且现有品牌检测器仅限于商标而无法捕捉抽象的商业外观。

Method: 研究引入了"去品牌化"这一新任务,专注于精细移除商标和微妙的结构性品牌特征,同时保持语义连贯性,构建了全面的基准数据集,并提出了基于视觉语言模型的新型评估指标,该指标采用问答框架来探测图像中的显式商标和隐式整体品牌特征。

Result: 研究结果表明,随着模型保真度的提高,新系统比旧模型更容易合成品牌标识,突显了去品牌化挑战的紧迫性,基于VLM指标的验证确认去品牌化是一个独特且具有实际相关性的问题,需要专门的技术来解决。

Conclusion: 去品牌化是一个独特且具有实际重要性的研究问题,需要专门的技术方法,基于视觉语言模型的评估指标能够有效捕捉多维品牌特征,为未来研究提供了有价值的基准和评估框架,强调了在高级文本到图像模型中解决品牌保护问题的必要性。


📄 Abstract

The rapid progress of text-to-image diffusion models raises significant concerns regarding the unauthorized reproduction of trademarked content. While prior work targets general concepts (e.g., styles, celebrities), it fails to address specific brand identifiers. Crucially, we note that brand recognition is multi-dimensional, extending beyond explicit logos to encompass distinctive structural features (e.g., a car's front grille). To tackle this, we introduce unbranding, a novel task for the fine-grained removal of both trademarks and subtle structural brand features, while preserving semantic coherence. To facilitate research, we construct a comprehensive benchmark dataset. Recognizing that existing brand detectors are limited to logos and fail to capture abstract trade dress (e.g., the shape of a Coca-Cola bottle), we introduce a novel evaluation metric based on Vision Language Models (VLMs). This VLM-based metric uses a question-answering framework to probe images for both explicit logos and implicit, holistic brand characteristics. Furthermore, we observe that as model fidelity increases, with newer systems (SDXL, FLUX) synthesizing brand identifiers more readily than older models (Stable Diffusion), the urgency of the unbranding challenge is starkly highlighted. Our results, validated by our VLM metric, confirm unbranding is a distinct, practically relevant problem requiring specialized techniques. Project Page: https://gmum.github.io/UNBRANDING/.

[9] Repurposing 2D Diffusion Models for 3D Shape Completion

Yao He, Youngjoong Kwon, Tiange Xiang, Wenxiao Cai, Ehsan Adeli

🧩 TL;DR

本研究提出了一个将2D扩散模型适配到3D形状补全任务的框架,通过引入Shape Atlas——一种紧凑的3D几何2D表示,克服了3D数据稀缺和模态差距问题,实现了高质量的形状补全并展示了实际应用价值。


📘 Detailed Summary

Motivation: 该研究旨在解决3D形状补全中的关键挑战:尽管2D扩散模型在丰富数据上取得了显著成功,但3D扩散模型因高质量3D数据集稀缺以及3D输入与2D潜在空间之间的模态差距而发展滞后,这限制了从2D扩散模型中迁移生成能力到3D形状补全任务的效果。

Method: 论文提出了一个框架,通过引入Shape Atlas——一种紧凑的3D几何2D表示,将2D扩散模型适配到3D形状补全任务中。该方法能够充分利用预训练2D扩散模型的生成能力,同时对齐条件输入和输出空间之间的模态,从而实现更有效的条件生成。这种统一的2D表述促进了从有限3D数据中学习,并生成高质量、保留细节的形状补全结果。

Result: 该方法在PCN和ShapeNet-55数据集上验证了有效性,展示了高质量的形状补全性能。此外,研究还展示了从补全点云创建艺术家生成网格的下游应用,进一步证明了该方法的实用性,表明其能够生成适用于实际3D建模工作流程的完整几何结构。

Conclusion: 该研究通过创新的2D表示方法成功弥合了2D扩散模型与3D形状补全任务之间的模态差距,为利用丰富2D生成先验解决3D数据稀缺问题提供了有效途径。Shape Atlas框架不仅实现了高质量的形状补全,还展示了在实际3D内容创作中的实用价值,为跨模态生成模型的应用开辟了新方向。


📄 Abstract

We present a framework that adapts 2D diffusion models for 3D shape completion from incomplete point clouds. While text-to-image diffusion models have achieved remarkable success with abundant 2D data, 3D diffusion models lag due to the scarcity of high-quality 3D datasets and a persistent modality gap between 3D inputs and 2D latent spaces. To overcome these limitations, we introduce the Shape Atlas, a compact 2D representation of 3D geometry that (1) enables full utilization of the generative power of pretrained 2D diffusion models, and (2) aligns the modalities between the conditional input and output spaces, allowing more effective conditioning. This unified 2D formulation facilitates learning from limited 3D data and produces high-quality, detail-preserving shape completions. We validate the effectiveness of our results on the PCN and ShapeNet-55 datasets. Additionally, we show the downstream application of creating artist-created meshes from our completed point clouds, further demonstrating the practicality of our method.

[10] Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models

Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, Jason Kuen

🧩 TL;DR

本文提出Sparse-LaViDa,一种新颖的建模框架,通过动态截断推理过程中不必要的掩码标记来加速掩码离散扩散模型,在保持生成质量的同时实现高达2倍的加速。


📘 Detailed Summary

Motivation: 掩码离散扩散模型在多模态任务中表现出色,但其推理速度因需要在每个采样步骤重复处理冗余的掩码标记而受限,这构成了当前方法的主要性能瓶颈。

Method: Sparse-LaViDa框架采用动态截断策略,在每一步推理中移除不必要的掩码标记,同时引入专门的寄存器标记作为截断标记的紧凑表示,并设计了与截断采样过程匹配的专门注意力掩码以确保训练与推理的一致性。

Result: 基于最先进的统一MDM LaViDa-O构建的Sparse-LaViDa在文本到图像生成、图像编辑和数学推理等多种任务中实现了高达2倍的加速,同时保持了原有的生成质量。

Conclusion: 该研究表明通过动态稀疏化策略可以有效加速掩码离散扩散模型的推理过程,为高效多模态生成模型的设计提供了新思路,同时证明了在保持性能的前提下显著提升推理速度的可行性。


📄 Abstract

Masked Discrete Diffusion Models (MDMs) have achieved strong performance across a wide range of multimodal tasks, including image understanding, generation, and editing. However, their inference speed remains suboptimal due to the need to repeatedly process redundant masked tokens at every sampling step. In this work, we propose Sparse-LaViDa, a novel modeling framework that dynamically truncates unnecessary masked tokens at each inference step to accelerate MDM sampling. To preserve generation quality, we introduce specialized register tokens that serve as compact representations for the truncated tokens. Furthermore, to ensure consistency between training and inference, we design a specialized attention mask that faithfully matches the truncated sampling procedure during training. Built upon the state-of-the-art unified MDM LaViDa-O, Sparse-LaViDa achieves up to a 2x speedup across diverse tasks including text-to-image generation, image editing, and mathematical reasoning, while maintaining generation quality.

[11] KFS-Bench: Comprehensive Evaluation of Key Frame Sampling in Long Video Understanding

Zongyao Li, Kengo Ishida, Satoshi Yamazaki, Xiaotong Ji, Jianquan Liu

🧩 TL;DR

该研究提出了KFS-Bench,这是首个用于长视频问答中关键帧采样的基准测试,包含多场景标注以直接评估采样策略,并开发了一种基于问题-视频相关性的自适应平衡采样方法,显著提升了采样质量和问答性能。


📘 Detailed Summary

Motivation: 当前长视频问答研究缺乏直接评估关键帧采样策略的基准,现有工作仅通过问答准确率间接评估帧选择质量,无法分析采样方法如何在整个长视频中捕获关键内容,这限制了高效长视频理解方法的发展。

Method: 研究提出了KFS-Bench基准,包含多场景标注以直接评估采样策略,设计了综合考虑采样精度、场景覆盖和采样平衡性的采样质量度量指标,并开发了一种基于问题-视频相关性的自适应平衡采样方法,在采样多样性与问题-帧相似性之间取得平衡。

Result: 通过KFS-Bench的全面研究,发现采样精度、场景覆盖和采样平衡性是影响问答性能的关键因素,提出的采样质量指标与问答准确率高度相关,自适应平衡采样方法在关键帧采样和问答性能方面均取得了优越表现。

Conclusion: 该研究为长视频问答中的关键帧采样提供了首个直接评估基准,揭示了影响采样效果的多维度因素,提出的自适应平衡采样方法通过优化问题-视频相关性平衡,显著提升了多模态大语言模型在长视频理解中的效率和准确性。


📄 Abstract

We propose KFS-Bench, the first benchmark for key frame sampling in long video question answering (QA), featuring multi-scene annotations to enable direct and robust evaluation of sampling strategies. Key frame sampling is crucial for efficient long-form video understanding. In long video QA, selecting informative frames enables multimodal large language models (MLLMs) to improve both accuracy and efficiency. KFS-Bench addresses the limitation of prior works that only indirectly assess frame selection quality via QA accuracy. By providing ground-truth annotations of multiple disjoint scenes required per question, KFS-Bench allows us to directly analyze how different sampling approaches capture essential content across an entire long video. Using KFS-Bench, we conduct a comprehensive study of key frame sampling methods and identify that not only sampling precision but also scene coverage and sampling balance are the key factors influencing QA performance. Regarding all the factors, we design a novel sampling quality metric that correlates with QA accuracy. Furthermore, we develop a novel key frame sampling method that leverages question-video relevance to balance sampling diversity against question-frame similarity, thereby improving coverage of relevant scenes. Our adaptively balanced sampling approach achieves superior performance in both key frame sampling and QA performance. The benchmark is available at https://github.com/NEC-VID/KFS-Bench.

[12] Unleashing the Power of Image-Tabular Self-Supervised Learning via Breaking Cross-Tabular Barriers

Yibing Fu, Yunpeng Zhao, Zhitao Zeng, Cheng Chen, Yueming Jin

🧩 TL;DR

本文提出CITab框架,一种面向跨表格多模态学习的自监督学习新范式,通过语义感知的表格建模机制和原型引导的线性混合层模块,有效解决异构表格数据间的迁移学习难题,在阿尔茨海默病诊断任务上超越现有方法。


📘 Detailed Summary

Motivation: 现有基于图像与表格数据的多模态自监督学习方法受限于特定数据队列,主要原因是其僵化的表格建模机制难以处理异构表格数据,这种跨表格障碍阻碍了模型从多样化队列中学习可迁移的医学知识。

Method: 提出CITab自监督学习框架,从语义感知角度设计表格建模机制,通过整合列标题作为语义线索促进可迁移知识学习;同时提出原型引导的线性混合层模块用于表格特征专业化,有效处理表格数据的异质性并探索底层医学概念。

Result: 在阿尔茨海默病诊断任务上,基于三个公开数据队列共4,461名受试者进行全面评估,实验结果表明CITab在跨表格多模态学习中显著优于现有最先进方法。

Conclusion: 该研究为有效且可扩展的跨表格多模态学习开辟了新途径,通过语义感知的表格建模和原型引导的特征专业化机制,成功解决了异构表格数据间的知识迁移难题,为临床决策支持系统提供了更强大的多模态表示学习框架。


📄 Abstract

Multi-modal learning integrating medical images and tabular data has significantly advanced clinical decision-making in recent years. Self-Supervised Learning (SSL) has emerged as a powerful paradigm for pretraining these models on large-scale unlabeled image-tabular data, aiming to learn discriminative representations. However, existing SSL methods for image-tabular representation learning are often confined to specific data cohorts, mainly due to their rigid tabular modeling mechanisms when modeling heterogeneous tabular data. This inter-tabular barrier hinders the multi-modal SSL methods from effectively learning transferrable medical knowledge shared across diverse cohorts. In this paper, we propose a novel SSL framework, namely CITab, designed to learn powerful multi-modal feature representations in a cross-tabular manner. We design the tabular modeling mechanism from a semantic-awareness perspective by integrating column headers as semantic cues, which facilitates transferrable knowledge learning and the scalability in utilizing multiple data sources for pretraining. Additionally, we propose a prototype-guided mixture-of-linear layer (P-MoLin) module for tabular feature specialization, empowering the model to effectively handle the heterogeneity of tabular data and explore the underlying medical concepts. We conduct comprehensive evaluations on Alzheimer's disease diagnosis task across three publicly available data cohorts containing 4,461 subjects. Experimental results demonstrate that CITab outperforms state-of-the-art approaches, paving the way for effective and scalable cross-tabular multi-modal learning.

[13] ChartAgent: A Chart Understanding Framework with Tool Integrated Reasoning

Boran Wang, Xinming Wang, Yi Chen, Xiang Li, Jian Xu, Jing Yuan, Chenglin Liu

🧩 TL;DR

本文提出了ChartAgent,一个基于工具集成推理的图表理解框架,通过将复杂图表分析分解为可观察、可重放的步骤,并利用模块化工具库实现系统视觉解析,显著提升了在稀疏标注场景下的鲁棒性。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在图表理解方面虽取得进展,但严重依赖显式文本标注,当关键数值缺失时性能显著下降。现有方法缺乏透明度、可验证性和在稀疏标注条件下的鲁棒性,这限制了实际应用中的可信度和可扩展性。

Method: ChartAgent采用工具集成推理框架,将复杂图表分析分解为可观察、可重放的步骤序列。该框架包含一个可扩展的模块化工具库,包括关键元素检测、实例分割、光学字符识别等十余种核心工具,代理动态编排这些工具实现跨多种图表类型的系统视觉解析。通过标准化和整合中间输出为结构化证据包,提供可追溯和可复现的最终结论支持。

Result: 实验表明,ChartAgent在稀疏标注设置下显著提升了鲁棒性。该框架通过工具集成推理的透明度和可验证性,超越了黑盒范式,为图表理解提供了可追溯、可复现的解决方案,展示了在多样化图表类型上的系统解析能力。

Conclusion: ChartAgent通过工具集成推理框架为图表理解提供了透明、可验证的解决方案,显著提升了在稀疏标注条件下的性能。该研究为构建可信赖、可扩展的图表理解系统提供了实用路径,通过模块化工具库和结构化证据包实现了超越传统黑盒方法的可追溯性和可复现性。


📄 Abstract

With their high information density and intuitive readability, charts have become the de facto medium for data analysis and communication across disciplines. Recent multimodal large language models (MLLMs) have made notable progress in automated chart understanding, yet they remain heavily dependent on explicit textual annotations and the performance degrades markedly when key numerals are absent. To address this limitation, we introduce ChartAgent, a chart understanding framework grounded in Tool-Integrated Reasoning (TIR). Inspired by human cognition, ChartAgent decomposes complex chart analysis into a sequence of observable, replayable steps. Supporting this architecture is an extensible, modular tool library comprising more than a dozen core tools, such as keyelement detection, instance segmentation, and optical character recognition (OCR), which the agent dynamically orchestrates to achieve systematic visual parsing across diverse chart types. Leveraging TIRs transparency and verifiability, ChartAgent moves beyond the black box paradigm by standardizing and consolidating intermediate outputs into a structured Evidence Package, providing traceable and reproducible support for final conclusions. Experiments show that ChartAgent substantially improves robustness under sparse annotation settings, offering a practical path toward trustworthy and extensible systems for chart understanding.

[14] OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving

Zhenguo Zhang, Haohan Zhen, Yishen Wang, Le Xu, Tianchen Deng, Xuefeng Chen, Qu Chen, Bo Zhang, Wuxiong Huang

🧩 TL;DR

本文提出了OmniDrive-R1,一种用于自动驾驶的端到端视觉语言模型框架,通过交错多模态思维链机制统一感知与推理,并引入强化学习驱动的视觉定位能力,显著提升了模型在自动驾驶场景中的可靠性和准确性。


📘 Detailed Summary

Motivation: 视觉语言模型在自动驾驶等安全关键领域部署面临可靠性问题,特别是物体幻觉问题,这源于其依赖未接地的文本思维链推理。现有多模态思维链方法存在两个根本缺陷:感知与推理阶段解耦阻碍端到端联合优化,以及依赖昂贵密集的定位标注。

Method: 提出了OmniDrive-R1端到端VLM框架,通过交错多模态思维链机制统一感知与推理。核心创新是强化学习驱动的视觉定位能力,使模型能自主聚焦关键区域进行细粒度分析。该方法采用纯两阶段强化学习训练流程和Clip-GRPO算法,其中Clip-GRPO引入了无需标注的基于过程的定位奖励,通过强制视觉焦点与文本推理之间的实时跨模态一致性来确保稳定性。

Result: 在DriveLMM-o1数据集上的广泛实验表明,相比基线Qwen2.5VL-7B模型,OmniDrive-R1将整体推理分数从51.77%提升至80.35%,最终答案准确率从37.81%提升至73.62%,显示出显著的性能改进。

Conclusion: 该研究展示了通过强化学习驱动的视觉定位和交错多模态思维链机制,可以有效解决VLM在自动驾驶中的可靠性问题。该方法不仅消除了对密集标注的依赖,还通过跨模态一致性确保了推理过程的稳定性,为安全关键领域的VLM部署提供了新的技术路径。


📄 Abstract

The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) reasoning.While existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization labels.Thus we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and "zoom in" on critical regions for fine-grained analysis. This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm. Crucially, Clip-GRPO introduces an annotation-free, process-based grounding reward. This reward not only eliminates the need for dense labels but also circumvents the instability of external tool calls by enforcing real-time cross-modal consistency between the visual focus and the textual reasoning. Extensive experiments on DriveLMM-o1 demonstrate our model's significant improvements. Compared to the baseline Qwen2.5VL-7B, OmniDrive-R1 improves the overall reasoning score from 51.77% to 80.35%, and the final answer accuracy from 37.81% to 73.62%.

[15] Real-time prediction of workplane illuminance distribution for daylight-linked controls using non-intrusive multimodal deep learning

Zulin Zhuang, Yu Bian

🧩 TL;DR

本研究提出了一种多模态深度学习框架,通过仅从侧窗区域提取图像特征来实时预测室内工作平面照度分布,解决了动态占用室内空间日光预测的挑战,为日光联动控制提供了非侵入式解决方案。


📘 Detailed Summary

Motivation: 日光联动控制在建筑节能方面具有巨大潜力,但现有室内日光预测研究大多针对静态场景开发,无法适应动态占用的室内空间,需要一种能够实时准确预测工作平面照度分布的非侵入式方法。

Method: 本研究提出了一种多模态深度学习框架,通过提取侧窗区域的图像特征而非室内像素,结合时空特征来实时预测室内工作平面照度分布,该方法在动态占用空间仍保持适用性,并在广州的测试房间进行了现场实验,收集了17,344个样本用于模型训练和验证。

Result: 模型在同分布测试集上取得了R² > 0.98且RMSE < 0.14的优异性能,在未见天测试集上达到R² > 0.82且RMSE < 0.17,显示出高精度和可接受的时间泛化能力,验证了该框架在实时室内日光预测中的有效性。

Conclusion: 该研究证明了通过聚焦窗口区域而非室内空间的多模态深度学习框架能够有效预测动态占用环境下的室内照度分布,为日光联动控制系统提供了实用的非侵入式解决方案,具有显著的节能潜力和实际应用价值。


📄 Abstract

Daylight-linked controls (DLCs) have significant potential for energy savings in buildings, especially when abundant daylight is available and indoor workplane illuminance can be accurately predicted in real time. Most existing studies on indoor daylight predictions were developed and tested for static scenes. This study proposes a multimodal deep learning framework that predicts indoor workplane illuminance distributions in real time from non-intrusive images with temporal-spatial features. By extracting image features only from the side-lit window areas rather than interior pixels, the approach remains applicable in dynamically occupied indoor spaces. A field experiment was conducted in a test room in Guangzhou (China), where 17,344 samples were collected for model training and validation. The model achieved R2 > 0.98 with RMSE < 0.14 on the same-distribution test set and R2 > 0.82 with RMSE < 0.17 on an unseen-day test set, indicating high accuracy and acceptable temporal generalization.

[16] SELECT: Detecting Label Errors in Real-world Scene Text Data

Wenjun Liu, Qian Wu, Yifeng Hu, Yuke Li

🧩 TL;DR

本文提出了SELECT方法,利用多模态训练检测真实场景文本数据集中的标签错误,并引入SSLC过程模拟真实错误场景,首次成功解决了变长标签序列的标签错误检测问题。


📘 Detailed Summary

Motivation: 真实场景文本数据集存在标签错误问题,现有方法难以有效处理变长序列标签、标签序列不对齐和字符级错误等挑战,需要开发能够准确检测真实场景文本标签错误的新方法。

Method: SELECT方法采用图像-文本编码器和字符级分词器处理多模态输入,并提出相似性序列标签损坏过程,该过程在训练中故意引入错误以模拟真实场景,同时考虑字符间的视觉相似性并处理序列长度变化。

Result: 实验结果表明SELECT方法在检测标签错误方面优于现有方法,能够有效提高真实场景文本识别准确率,展示了该方法在实际应用中的实用价值。

Conclusion: 该研究首次成功解决了真实场景文本数据集中变长标签的标签错误检测问题,提出的SSLC过程为模拟真实错误场景提供了有效方法,为场景文本识别数据质量提升提供了实用工具。


📄 Abstract

We introduce SELECT (Scene tExt Label Errors deteCTion), a novel approach that leverages multi-modal training to detect label errors in real-world scene text datasets. Utilizing an image-text encoder and a character-level tokenizer, SELECT addresses the issues of variable-length sequence labels, label sequence misalignment, and character-level errors, outperforming existing methods in accuracy and practical utility. In addition, we introduce Similarity-based Sequence Label Corruption (SSLC), a process that intentionally introduces errors into the training labels to mimic real-world error scenarios during training. SSLC not only can cause a change in the sequence length but also takes into account the visual similarity between characters during corruption. Our method is the first to detect label errors in real-world scene text datasets successfully accounting for variable-length labels. Experimental results demonstrate the effectiveness of SELECT in detecting label errors and improving STR accuracy on real-world text datasets, showcasing its practical utility.

[17] SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding

Shuang Cheng, Yuhua Jiang, Zineng Zhou, Dawei Liu, Wang Tao, Linfeng Zhang, Biqing Qi, Bowen Zhou

🧩 TL;DR

本文提出了SDAR-VL,这是首个将块级离散扩散系统应用于大规模视觉语言理解的方法,通过集成框架解决了传统块扩散训练成本高、收敛慢和不稳定的问题,在21个基准测试中实现了最先进的性能。


📘 Detailed Summary

Motivation: 块级离散扩散在并行生成和因果依赖建模之间提供了有吸引力的平衡,但其实际应用受到高训练成本、收敛缓慢和不稳定性的限制,导致其性能落后于强大的自回归基线模型。

Method: SDAR-VL提出了一个集成框架,包含三个关键组件:异步块级噪声调度以多样化批次内监督;有效掩码比率缩放用于随机掩码下的无偏损失归一化;以及渐进式Beta噪声课程,在保持破坏多样性的同时增加有效掩码覆盖率。

Result: 在21个单图像、多图像和视频基准测试中,SDAR-VL在训练效率、收敛稳定性和任务性能方面均优于传统块扩散方法。在匹配设置下,SDAR-VL达到或超越了LLaVA-OneVision等自回归基线和全局扩散基线LLaDA-V的性能。

Conclusion: 该研究确立了块级扩散作为视觉语言理解的实际骨干网络的可行性,通过系统优化解决了传统方法的局限性,为扩散模型在视觉语言任务中的应用提供了新的高效稳定框架。


📄 Abstract

Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling, making it a promising backbone for vision-language modeling. However, its practical adoption has been limited by high training cost, slow convergence, and instability, which have so far kept it behind strong autoregressive (AR) baselines. We present \textbf{SDAR-VL}, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding (VLU), together with an \emph{integrated framework for efficient and stable training}. This framework unifies three components: (1) \textbf{Asynchronous Block-wise Noise Scheduling} to diversify supervision within each batch; (2) \textbf{Effective Mask Ratio Scaling} for unbiased loss normalization under stochastic masking; and (3) a \textbf{Progressive Beta Noise Curriculum} that increases effective mask coverage while preserving corruption diversity. Experiments on 21 single-image, multi-image, and video benchmarks show that SDAR-VL consistently improves \emph{training efficiency}, \emph{convergence stability}, and \emph{task performance} over conventional block diffusion. On this evaluation suite, SDAR-VL sets a new state of the art among diffusion-based vision-language models and, under matched settings, matches or surpasses strong AR baselines such as LLaVA-OneVision as well as the global diffusion baseline LLaDA-V, establishing block-wise diffusion as a practical backbone for VLU.

[18] Neurosymbolic Inference On Foundation Models For Remote Sensing Text-to-image Retrieval With Complex Queries

Emanuele Mezzi, Gertjan Burghouts, Maarten Kruithof

🧩 TL;DR

本文提出RUNE(Reasoning Using Neurosymbolic Entities),一种结合大型语言模型与神经符号AI的方法,通过检测实体与一阶逻辑表达式之间的推理来实现遥感图像检索,相比传统遥感大视觉语言模型在性能、鲁棒性和可解释性方面均有提升。


📘 Detailed Summary

Motivation: 尽管针对航空和卫星图像的遥感大视觉语言模型在文本到图像检索方面取得了进展,但现有方法在可解释性和处理复杂空间关系方面仍存在局限,限制了其在现实世界中的应用。

Method: RUNE方法将大型语言模型与神经符号AI相结合,通过LLM将文本查询转换为一级逻辑表达式,然后由神经符号推理模块在检测到的实体上进行显式推理,而非依赖隐式联合嵌入;为提高可扩展性,提出了在检测实体条件子集上运行的逻辑分解策略。

Result: 在重新利用DOTA数据集并增加复杂查询的评估中,RUNE在复杂遥感检索任务中优于最先进的遥感大视觉语言模型;引入了检索鲁棒性到查询复杂度和检索鲁棒性到图像不确定性两个新指标,展示了该方法在性能、鲁棒性和可解释性方面的优势。

Conclusion: RUNE通过显式推理而非隐式嵌入的方法,为遥感图像检索提供了更优的性能、鲁棒性和可解释性,展示了神经符号AI在现实世界遥感应用中的潜力,特别是在洪水后卫星图像检索等复杂场景中。


📄 Abstract

Text-to-image retrieval in remote sensing (RS) has advanced rapidly with the rise of large vision-language models (LVLMs) tailored for aerial and satellite imagery, culminating in remote sensing large vision-language models (RS-LVLMS). However, limited explainability and poor handling of complex spatial relations remain key challenges for real-world use. To address these issues, we introduce RUNE (Reasoning Using Neurosymbolic Entities), an approach that combines Large Language Models (LLMs) with neurosymbolic AI to retrieve images by reasoning over the compatibility between detected entities and First-Order Logic (FOL) expressions derived from text queries. Unlike RS-LVLMs that rely on implicit joint embeddings, RUNE performs explicit reasoning, enhancing performance and interpretability. For scalability, we propose a logic decomposition strategy that operates on conditioned subsets of detected entities, guaranteeing shorter execution time compared to neural approaches. Rather than using foundation models for end-to-end retrieval, we leverage them only to generate FOL expressions, delegating reasoning to a neurosymbolic inference module. For evaluation we repurpose the DOTA dataset, originally designed for object detection, by augmenting it with more complex queries than in existing benchmarks. We show the LLM's effectiveness in text-to-logic translation and compare RUNE with state-of-the-art RS-LVLMs, demonstrating superior performance. We introduce two metrics, Retrieval Robustness to Query Complexity (RRQC) and Retrieval Robustness to Image Uncertainty (RRIU), which evaluate performance relative to query complexity and image uncertainty. RUNE outperforms joint-embedding models in complex RS retrieval tasks, offering gains in performance, robustness, and explainability. We show RUNE's potential for real-world RS applications through a use case on post-flood satellite image retrieval.

[19] From YOLO to VLMs: Advancing Zero-Shot and Few-Shot Detection of Wastewater Treatment Plants Using Satellite Imagery in MENA Region

Akila Premarathna, Kanishka Hewageegana, Garcia Andarcia Mariangel

🧩 TL;DR

本研究提出了一种结构化方法,用于比较视觉语言模型在零样本和少样本设置下识别中东和北非地区废水处理厂的性能,结果表明VLMs能够超越传统YOLOv8分割方法,实现无需人工标注的高效分类。


📘 Detailed Summary

Motivation: 中东和北非地区对废水处理厂有高需求,传统基于YOLOv8的卫星图像分割方法需要大量人工标注,而视觉语言模型通过内在推理能力提供了无需标注的替代方案,本研究旨在系统比较VLMs在WWTP识别任务中的性能。

Method: 研究采用结构化VLM比较方法,分为零样本和少样本两个流程,评估了包括LLaMA 3.2 Vision、Qwen 2.5 VL、DeepSeek-VL2、Gemma 3、Gemini和Pixtral 12B在内的多种模型,使用包含83,566张高分辨率卫星图像的数据集,通过专家提示识别圆形/矩形储罐、曝气池等WWTP组件并生成JSON输出。

Result: 零样本评估显示多个VLMs在WWTP图像上的真阳性率超越了YOLOv8,其中Gemma-3表现最佳,验证了VLMs在无需人工标注的情况下能够实现高效分类,特别是在零样本设置下展现出优越性能。

Conclusion: 研究表明视觉语言模型,特别是零样本方法,能够替代传统的YOLOv8分割技术,实现无需人工标注的废水处理厂高效分类,为可扩展的遥感监测提供了新途径,显著降低了环境监测中的人工标注成本。


📄 Abstract

In regions of the Middle East and North Africa (MENA), there is a high demand for wastewater treatment plants (WWTPs), crucial for sustainable water management. Precise identification of WWTPs from satellite images enables environmental monitoring. Traditional methods like YOLOv8 segmentation require extensive manual labeling. But studies indicate that vision-language models (VLMs) are an efficient alternative to achieving equivalent or superior results through inherent reasoning and annotation. This study presents a structured methodology for VLM comparison, divided into zero-shot and few-shot streams specifically to identify WWTPs. The YOLOv8 was trained on a governmental dataset of 83,566 high-resolution satellite images from Egypt, Saudi Arabia, and UAE: ~85% WWTPs (positives), 15% non-WWTPs (negatives). Evaluated VLMs include LLaMA 3.2 Vision, Qwen 2.5 VL, DeepSeek-VL2, Gemma 3, Gemini, and Pixtral 12B (Mistral), used to identify WWTP components such as circular/rectangular tanks, aeration basins and distinguish confounders via expert prompts producing JSON outputs with confidence and descriptions. The dataset comprises 1,207 validated WWTP locations (198 UAE, 354 KSA, 655 Egypt) and equal non-WWTP sites from field/AI data, as 600mx600m Geo-TIFF images (Zoom 18, EPSG:4326). Zero-shot evaluations on WWTP images showed several VLMs out-performing YOLOv8's true positive rate, with Gemma-3 highest. Results confirm that VLMs, particularly with zero-shot, can replace YOLOv8 for efficient, annotation-free WWTP classification, enabling scalable remote sensing.

[20] Semantic Mismatch and Perceptual Degradation: A New Perspective on Image Editing Immunity

Shuai Dong, Jie Zhang, Guoying Zhao, Shiguang Shan, Xilin Chen

🧩 TL;DR

本文提出了一种新的图像免疫方法SIFM,通过协同中间特征扰动来防御基于扩散模型的恶意文本引导编辑,并首次引入免疫成功率(ISR)这一严格评估指标来量化免疫效果。


📘 Detailed Summary

Motivation: 现有基于扩散模型的文本引导图像编辑技术存在被滥用的风险,而当前评估图像免疫成功的方法存在根本缺陷,它们通常通过测量受保护图像输出与未受保护原始图像参考输出之间的视觉差异来评估,这种方法忽视了图像免疫的核心要求——破坏与攻击者意图的语义对齐,而不仅仅是偏离特定输出。

Method: 本文提出了协同中间特征操纵(SIFM)方法,通过双重协同目标策略性地扰动扩散模型的中间特征:一是最大化特征与原始编辑轨迹的差异以破坏与预期编辑的语义对齐,二是最小化特征范数以诱导感知退化。此外,首次引入了免疫成功率(ISR)这一新指标,通过多模态大语言模型(MLLMs)评估编辑中免疫是否诱导了相对于提示的语义失败或显著感知退化。

Result: 大量实验表明,SIFM在保护视觉内容免受基于扩散模型的恶意操纵方面实现了最先进的性能。新提出的ISR指标能够严格量化真正的免疫效果,为评估图像免疫方法提供了更准确的衡量标准。

Conclusion: 该研究重新定义了图像免疫成功的概念,强调免疫成功应表现为编辑输出与提示语义不匹配或遭受显著感知退化,从而挫败恶意意图。SIFM方法和ISR指标的结合为图像安全防御领域提供了新的技术框架和评估标准,对防止AI生成内容的滥用具有重要意义。


📄 Abstract

Text-guided image editing via diffusion models, while powerful, raises significant concerns about misuse, motivating efforts to immunize images against unauthorized edits using imperceptible perturbations. Prevailing metrics for evaluating immunization success typically rely on measuring the visual dissimilarity between the output generated from a protected image and a reference output generated from the unprotected original. This approach fundamentally overlooks the core requirement of image immunization, which is to disrupt semantic alignment with attacker intent, regardless of deviation from any specific output. We argue that immunization success should instead be defined by the edited output either semantically mismatching the prompt or suffering substantial perceptual degradations, both of which thwart malicious intent. To operationalize this principle, we propose Synergistic Intermediate Feature Manipulation (SIFM), a method that strategically perturbs intermediate diffusion features through dual synergistic objectives: (1) maximizing feature divergence from the original edit trajectory to disrupt semantic alignment with the expected edit, and (2) minimizing feature norms to induce perceptual degradations. Furthermore, we introduce the Immunization Success Rate (ISR), a novel metric designed to rigorously quantify true immunization efficacy for the first time. ISR quantifies the proportion of edits where immunization induces either semantic failure relative to the prompt or significant perceptual degradations, assessed via Multimodal Large Language Models (MLLMs). Extensive experiments show our SIFM achieves the state-of-the-art performance for safeguarding visual content against malicious diffusion-based manipulation.

[21] AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation

Sisi Dai, Kai Xu

🧩 TL;DR

本文提出AnchorHOI框架,通过引入基于锚点的先验蒸馏策略,结合视频扩散模型和图像扩散模型,实现了零样本4D人-物交互生成,显著提升了生成多样性和泛化能力。


📘 Detailed Summary

Motivation: 当前文本驱动的4D人-物交互生成面临两大挑战:监督方法受限于大规模4D HOI数据集的稀缺性,而现有零样本方法在生成过程中对交互线索的提取不足,限制了其在多样化场景中的应用能力。

Method: AnchorHOI框架采用基于锚点的先验蒸馏策略,通过构建交互感知的锚点来引导生成过程。具体设计了两种专用锚点:用于表达性交互组合的锚点神经辐射场和用于真实运动合成的锚点关键点,形成可处理的两步生成流程,并整合了视频扩散模型和图像扩散模型的混合先验知识。

Result: 大量实验表明,AnchorHOI在4D人-物交互生成任务上超越了先前方法,展现出卓越的生成多样性和泛化能力,能够处理更广泛的交互场景并产生更真实的运动合成效果。

Conclusion: 该研究证明了基于锚点的先验蒸馏策略在4D HOI生成中的有效性,通过结合视频和图像扩散模型的混合先验,为解决高维4D交互生成的优化难题提供了新思路,为零样本交互生成的实际应用奠定了基础。


📄 Abstract

Despite significant progress in text-driven 4D human-object interaction (HOI) generation with supervised methods, the scalability remains limited by the scarcity of large-scale 4D HOI datasets. To overcome this, recent approaches attempt zero-shot 4D HOI generation with pre-trained image diffusion models. However, interaction cues are minimally distilled during the generation process, restricting their applicability across diverse scenarios. In this paper, we propose AnchorHOI, a novel framework that thoroughly exploits hybrid priors by incorporating video diffusion models beyond image diffusion models, advancing 4D HOI generation. Nevertheless, directly optimizing high-dimensional 4D HOI with such priors remains challenging, particularly for human pose and compositional motion. To address this challenge, AnchorHOI introduces an anchor-based prior distillation strategy, which constructs interaction-aware anchors and then leverages them to guide generation in a tractable two-step process. Specifically, two tailored anchors are designed for 4D HOI generation: anchor Neural Radiance Fields (NeRFs) for expressive interaction composition, and anchor keypoints for realistic motion synthesis. Extensive experiments demonstrate that AnchorHOI outperforms previous methods with superior diversity and generalization.

[22] Dual Attention Guided Defense Against Malicious Edits

Jie Zhang, Shuai Dong, Shiguang Shan, Xilin Chen

🧩 TL;DR

本文提出了一种双重注意力引导噪声扰动免疫方法,通过添加不可察觉的扰动来破坏文本到图像扩散模型的语义理解和生成过程,从而有效防御恶意编辑攻击。该方法在多个时间步上操作,同时操纵交叉注意力图和噪声预测过程,实现了对恶意编辑的卓越免疫能力。


📘 Detailed Summary

Motivation: 文本到图像扩散模型在图像编辑方面的进展带来了显著的伦理挑战,特别是可能被滥用于创建欺骗性或有害内容。现有的防御方法通过嵌入不可察觉的扰动来降低风险,但其有效性在面对恶意篡改时仍然有限,因此需要更强大的免疫机制来对抗恶意编辑攻击。

Method: 本文提出了双重注意力引导噪声扰动免疫方法,该方法在多个时间步上操作,通过动态阈值生成掩码来识别文本相关和不相关区域。它减少相关区域的注意力同时增加不相关区域的注意力,从而将编辑误导至错误区域并保护预期目标。此外,该方法最大化注入噪声与模型预测噪声之间的差异,进一步干扰生成过程,通过同时针对注意力和噪声预测机制实现强大的免疫效果。

Result: 广泛的实验证实,DANP方法在对抗恶意编辑方面表现出卓越的免疫能力,并实现了最先进的性能。该方法通过同时操纵交叉注意力图和噪声预测过程,有效破坏了模型的语义理解和生成过程,从而成功防御了各种恶意编辑攻击。

Conclusion: 该研究提供了一种有效的防御机制来对抗文本到图像扩散模型的恶意编辑,通过同时针对注意力和噪声预测机制实现了更全面的保护。这种方法为解决扩散模型伦理滥用问题提供了新的技术途径,并为未来开发更强大的内容保护系统奠定了基础。


📄 Abstract

Recent progress in text-to-image diffusion models has transformed image editing via text prompts, yet this also introduces significant ethical challenges from potential misuse in creating deceptive or harmful content. While current defenses seek to mitigate this risk by embedding imperceptible perturbations, their effectiveness is limited against malicious tampering. To address this issue, we propose a Dual Attention-Guided Noise Perturbation (DANP) immunization method that adds imperceptible perturbations to disrupt the model's semantic understanding and generation process. DANP functions over multiple timesteps to manipulate both cross-attention maps and the noise prediction process, using a dynamic threshold to generate masks that identify text-relevant and irrelevant regions. It then reduces attention in relevant areas while increasing it in irrelevant ones, thereby misguides the edit towards incorrect regions and preserves the intended targets. Additionally, our method maximizes the discrepancy between the injected noise and the model's predicted noise to further interfere with the generation. By targeting both attention and noise prediction mechanisms, DANP exhibits impressive immunity against malicious edits, and extensive experiments confirm that our method achieves state-of-the-art performance.

[23] OUSAC: Optimized Guidance Scheduling with Adaptive Caching for DiT Acceleration

Ruitong Sun, Tianze Yang, Wei Niu, Jin Sun

🧩 TL;DR

本文提出OUSAC框架,通过优化引导调度与自适应缓存来加速扩散变换器,核心创新在于利用可变引导尺度实现稀疏计算,在减少采样步数和CFG计算的同时保持生成质量。


📘 Detailed Summary

Motivation: 扩散模型已成为高质量图像生成的主导范式,但其迭代去噪过程计算成本高昂。无分类器引导(CFG)虽能显著提升生成质量和可控性,但需要在每个时间步同时执行条件和非条件前向传播,导致计算量加倍。现有缓存方法假设CFG尺度恒定,无法适应可变引导模式带来的去噪偏差,且不同变换器块在动态条件下的受影响程度不同。

Method: OUSAC采用两阶段优化框架:第一阶段使用进化算法联合优化跳过的时间步和引导尺度,最多可消除82%的非条件前向传播;第二阶段引入自适应秩分配机制,针对每个变换器块定制校准工作,在可变引导条件下保持缓存有效性。该方法的核心洞察是可变引导尺度可实现稀疏计算,通过调整某些时间步的尺度来补偿跳过CFG计算带来的影响。

Result: 实验结果表明,OUSAC显著优于现有加速方法:在DiT-XL/2(ImageNet 512x512)上实现53%计算节省和15%质量提升;在PixArt-alpha(MSCOCO)上实现60%计算节省和16.1%质量提升;在FLUX上实现5倍加速,同时CLIP分数超过50步基线。该方法在保持生成质量的同时大幅减少了计算开销。

Conclusion: 该研究证明了通过系统优化引导调度和缓存机制,可以在不牺牲质量的前提下显著加速扩散变换器。可变引导尺度的引入为稀疏计算提供了新思路,而自适应秩分配解决了动态条件下的缓存失效问题。这项工作为高效扩散模型推理提供了实用框架,并为未来优化研究开辟了新方向。


📄 Abstract

Diffusion models have emerged as the dominant paradigm for high-quality image generation, yet their computational expense remains substantial due to iterative denoising. Classifier-Free Guidance (CFG) significantly enhances generation quality and controllability but doubles the computation by requiring both conditional and unconditional forward passes at every timestep. We present OUSAC (Optimized gUidance Scheduling with Adaptive Caching), a framework that accelerates diffusion transformers (DiT) through systematic optimization. Our key insight is that variable guidance scales enable sparse computation: adjusting scales at certain timesteps can compensate for skipping CFG at others, enabling both fewer total sampling steps and fewer CFG steps while maintaining quality. However, variable guidance patterns introduce denoising deviations that undermine standard caching methods, which assume constant CFG scales across steps. Moreover, different transformer blocks are affected at different levels under dynamic conditions. This paper develops a two-stage approach leveraging these insights. Stage-1 employs evolutionary algorithms to jointly optimize which timesteps to skip and what guidance scale to use, eliminating up to 82% of unconditional passes. Stage-2 introduces adaptive rank allocation that tailors calibration efforts per transformer block, maintaining caching effectiveness under variable guidance. Experiments demonstrate that OUSAC significantly outperforms state-of-the-art acceleration methods, achieving 53% computational savings with 15% quality improvement on DiT-XL/2 (ImageNet 512x512), 60% savings with 16.1% improvement on PixArt-alpha (MSCOCO), and 5x speedup on FLUX while improving CLIP Score over the 50-step baseline.

[24] Towards Transferable Defense Against Malicious Image Edits

Jie Zhang, Shuai Dong, Shiguang Shan, Xilin Chen

🧩 TL;DR

本文提出了TDAE,一种新颖的双模态框架,通过协调的图像-文本优化增强图像对恶意编辑的免疫力,解决了现有方法在跨模型评估中可迁移性有限的问题。


📘 Detailed Summary

Motivation: 现有方法在对抗扩散基图像编辑系统中的恶意操作时,采用不可察觉的输入图像扰动已显示出潜力,但这些方法在跨模型评估中存在可迁移性有限的问题,无法有效防御未见过的编辑模型。

Method: 本文提出了TDAE双模态框架,包含视觉防御层的FlatGrad防御机制和文本增强保护的动态提示防御。FDM通过将梯度正则化融入对抗目标,显式引导扰动朝向平坦最小值以增强对未见编辑模型的免疫鲁棒性。DPD采用对抗优化范式,周期性优化文本嵌入使免疫化图像的编辑结果与原始图像对齐,然后在优化嵌入下更新图像,通过迭代对抗更新不同嵌入强制生成寻求更广泛免疫增强特征的免疫化图像。

Result: 广泛的实验结果表明,TDAE在减轻恶意编辑方面取得了最先进的性能,在模型内和跨模型评估中均表现出色,显著提升了对抗恶意图像编辑的可迁移防御能力。

Conclusion: 该研究通过协调的图像-文本优化框架实现了对恶意图像编辑的有效防御,提出的双模态方法不仅增强了单模型防御能力,更重要的是解决了跨模型可迁移性问题,为扩散基编辑系统的安全防护提供了新的技术路径。


📄 Abstract

Recent approaches employing imperceptible perturbations in input images have demonstrated promising potential to counter malicious manipulations in diffusion-based image editing systems. However, existing methods suffer from limited transferability in cross-model evaluations. To address this, we propose Transferable Defense Against Malicious Image Edits (TDAE), a novel bimodal framework that enhances image immunity against malicious edits through coordinated image-text optimization. Specifically, at the visual defense level, we introduce FlatGrad Defense Mechanism (FDM), which incorporates gradient regularization into the adversarial objective. By explicitly steering the perturbations toward flat minima, FDM amplifies immune robustness against unseen editing models. For textual enhancement protection, we propose an adversarial optimization paradigm named Dynamic Prompt Defense (DPD), which periodically refines text embeddings to align the editing outcomes of immunized images with those of the original images, then updates the images under optimized embeddings. Through iterative adversarial updates to diverse embeddings, DPD enforces the generation of immunized images that seek a broader set of immunity-enhancing features, thereby achieving cross-model transferability. Extensive experimental results demonstrate that our TDAE achieves state-of-the-art performance in mitigating malicious edits under both intra- and cross-model evaluations.

[25] ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Diffusion Models

Ruishu Zhu, Zhihao Huang, Jiacheng Sun, Ping Luo, Hongyuan Zhang, Xuelong Li

🧩 TL;DR

ViewMask-1-to-3提出了一种基于离散扩散模型的多视角图像生成方法,将多视角合成转化为离散序列建模问题,通过掩码标记预测实现文本引导的渐进式多视角生成,无需复杂3D几何约束即可保持跨视角一致性。


📘 Detailed Summary

Motivation: 现有多视角图像生成方法通常依赖3D感知架构或专用扩散模型,需要大量多视角训练数据和复杂几何先验,难以在保持几何一致性的同时从单图像和文本描述生成多视角图像。

Method: 该方法将多视角合成构建为离散序列建模问题,使用MAGVIT-v2标记化将每个视角表示为视觉标记,通过掩码标记预测统一语言和视觉表示,结合随机掩码和自注意力机制实现渐进式多视角生成,无需复杂3D几何约束或专用注意力架构。

Result: ViewMask-1-to-3在GSO和3D-FUTURE数据集上的PSNR、SSIM和LPIPS指标平均排名第一,证明了离散扩散模型在多视角生成任务中的有效性,同时保持了架构的简洁性。

Conclusion: 研究表明离散扩散模型为多视角图像生成提供了可行且简单的替代方案,通过离散序列建模和掩码预测机制,能够在无需复杂3D几何约束的情况下实现跨视角一致性,为多模态生成任务提供了新的技术路径。


📄 Abstract

Multi-view image generation from a single image and text description remains challenging due to the difficulty of maintaining geometric consistency across different viewpoints. Existing approaches typically rely on 3D-aware architectures or specialized diffusion models that require extensive multi-view training data and complex geometric priors. In this work, we introduce ViewMask-1-to-3, a pioneering approach to apply discrete diffusion models to multi-view image generation. Unlike continuous diffusion methods that operate in latent spaces, ViewMask-1-to-3 formulates multi-view synthesis as a discrete sequence modeling problem, where each viewpoint is represented as visual tokens obtained through MAGVIT-v2 tokenization. By unifying language and vision through masked token prediction, our approach enables progressive generation of multiple viewpoints through iterative token unmasking with text input. ViewMask-1-to-3 achieves cross-view consistency through simple random masking combined with self-attention, eliminating the requirement for complex 3D geometric constraints or specialized attention architectures. Our approach demonstrates that discrete diffusion provides a viable and simple alternative to existing multi-view generation methods, ranking first on average across GSO and 3D-FUTURE datasets in terms of PSNR, SSIM, and LPIPS, while maintaining architectural simplicity.

[26] DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning

Nakamasa Inoue, Kanoko Goto, Masanari Oi, Martyna Gruszka, Mahiro Ukai, Takumi Hirose, Yusuke Sekikawa

🧩 TL;DR

本文提出了DISCODE,一种无需微调的分布感知评分解码器,用于提升大视觉语言模型在跨域场景下的图像描述评估鲁棒性,并引入了MCEval基准来评估评估指标的鲁棒性。


📘 Detailed Summary

Motivation: 尽管大视觉语言模型在多模态任务中表现出色,但在域偏移场景下进行鲁棒的图像描述评估仍然具有挑战性,现有方法难以在不同领域保持与人类判断的一致性。

Method: 本文提出了DISCODE方法,其核心是测试时自适应评估方法,引入了自适应测试时损失,利用高斯先验分布提升评估分数估计的鲁棒性,并通过推导的解析解在测试时高效最小化该损失。

Result: 实验表明,DISCODE在MCEval基准和四个代表性现有基准上作为无参考评估指标实现了最先进的性能,证明了其在跨域场景下的优越鲁棒性。

Conclusion: DISCODE提供了一种无需微调的鲁棒评估框架,能够更好地对齐人类判断,同时MCEval基准为评估指标的鲁棒性评估提供了新的标准,推动了多模态评估领域的发展。


📄 Abstract

Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks. However, robust image caption evaluation using LVLMs remains challenging, particularly under domain-shift scenarios. To address this issue, we introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores better aligned with human judgments across diverse domains. The core idea behind DISCODE lies in its test-time adaptive evaluation approach, which introduces the Adaptive Test-Time (ATT) loss, leveraging a Gaussian prior distribution to improve robustness in evaluation score estimation. This loss is efficiently minimized at test time using an analytical solution that we derive. Furthermore, we introduce the Multi-domain Caption Evaluation (MCEval) benchmark, a new image captioning evaluation benchmark covering six distinct domains, designed to assess the robustness of evaluation metrics. In our experiments, we demonstrate that DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across MCEval and four representative existing benchmarks.

[27] CLNet: Cross-View Correspondence Makes a Stronger Geo-Localizationer

Xianwei Cao, Dou Quan, Shuang Wang, Ning Huyan, Wei Wang, Yunan Li, Licheng Jiao

🧩 TL;DR

本文提出了一种新颖的对应感知特征精炼框架CLNet,用于解决基于图像检索的跨视角地理定位问题。该方法通过显式建模跨视角空间对应关系,在四个公开基准数据集上实现了最先进的性能。


📘 Detailed Summary

Motivation: 基于图像检索的跨视角地理定位旨在匹配从显著不同视角(如卫星和街景)捕获的图像。现有方法主要依赖学习鲁棒的全局表示或隐式特征对齐,往往无法建模对精确定位至关重要的显式空间对应关系,导致语义和几何差距难以弥合。

Method: 本文提出了CLNet框架,将视角对齐过程分解为三个可学习且互补的模块:神经对应图通过潜在对应场空间对齐跨视角特征;非线性嵌入转换器使用基于MLP的变换跨视角重映射特征;全局特征重校准模块通过学习到的空间线索引导重新加权信息丰富的特征通道。该框架能够联合捕获高级语义和细粒度对齐。

Result: 在CVUSA、CVACT、VIGOR和University-1652四个公开基准数据集上的广泛实验表明,CLNet实现了最先进的性能。该方法不仅提升了定位精度,还提供了更好的可解释性和泛化能力,验证了显式建模空间对应关系的有效性。

Conclusion: 该研究证明了显式建模跨视角空间对应关系对于地理定位任务的重要性。CLNet框架通过分解视角对齐过程为互补模块,有效弥合了不同视角间的语义和几何差距,为跨视角匹配任务提供了新的解决方案,并展示了更好的可解释性和泛化潜力。


📄 Abstract

Image retrieval-based cross-view geo-localization (IRCVGL) aims to match images captured from significantly different viewpoints, such as satellite and street-level images. Existing methods predominantly rely on learning robust global representations or implicit feature alignment, which often fail to model explicit spatial correspondences crucial for accurate localization. In this work, we propose a novel correspondence-aware feature refinement framework, termed CLNet, that explicitly bridges the semantic and geometric gaps between different views. CLNet decomposes the view alignment process into three learnable and complementary modules: a Neural Correspondence Map (NCM) that spatially aligns cross-view features via latent correspondence fields; a Nonlinear Embedding Converter (NEC) that remaps features across perspectives using an MLP-based transformation; and a Global Feature Recalibration (GFR) module that reweights informative feature channels guided by learned spatial cues. The proposed CLNet can jointly capture both high-level semantics and fine-grained alignments. Extensive experiments on four public benchmarks, CVUSA, CVACT, VIGOR, and University-1652, demonstrate that our proposed CLNet achieves state-of-the-art performance while offering better interpretability and generalizability.

[28] Selective, Controlled and Domain-Agnostic Unlearning in Pretrained CLIP: A Training- and Data-Free Approach

Ashish Mishra, Gyanaranjan Nayak, Tarun Kumar, Arpit Shah, Suparna Bhattacharya, Martin Foltin

🧩 TL;DR

本文提出了一种新颖的训练与数据无关的遗忘框架,通过多模态零空间方法实现CLIP模型中特定对象类别的选择性遗忘,支持全局遗忘、领域特定知识移除和选择性领域完全遗忘三种范式。


📘 Detailed Summary

Motivation: 尽管CLIP等预训练模型在跨领域零样本分类中表现出色,但实际应用常需移除特定对象类别而不需要额外数据或重新训练,同时不影响模型在其他任务上的性能。现有基于重新训练的方法存在局限性,需要一种灵活且计算高效的受控模型遗忘解决方案。

Method: 该方法提出了一种训练与数据无关的遗忘框架,通过文本提示和从CLIP联合嵌入空间衍生的合成视觉原型的协同整合,利用多模态零空间技术。该框架支持三种遗忘范式:全局跨领域遗忘、领域特定知识移除以及选择性领域的完全遗忘。

Result: 该方法能够高效移除不需要的类别信息,同时保留剩余知识,克服了现有重新训练方法的局限性。该框架提供了灵活且计算高效的解决方案,实现了对模型知识的受控遗忘。

Conclusion: 该研究为预训练模型的受控知识移除提供了创新方法,通过多模态零空间技术实现了无需额外数据或重新训练的选择性遗忘。该框架具有实际应用价值,为模型隐私保护、知识更新和领域适应等场景提供了有效工具。


📄 Abstract

Pretrained models like CLIP have demonstrated impressive zero-shot classification capabilities across diverse visual domains, spanning natural images, artistic renderings, and abstract representations. However, real-world applications often demand the removal (or "unlearning") of specific object classes without requiring additional data or retraining, or affecting the model's performance on unrelated tasks. In this paper, we propose a novel training- and data-free unlearning framework that enables three distinct forgetting paradigms: (1) global unlearning of selected objects across all domains, (2) domain-specific knowledge removal (e.g., eliminating sketch representations while preserving photo recognition), and (3) complete unlearning in selective domains. By leveraging a multimodal nullspace through synergistic integration of text prompts and synthesized visual prototypes derived from CLIP's joint embedding space, our method efficiently removes undesired class information while preserving the remaining knowledge. This approach overcomes the limitations of existing retraining-based methods and offers a flexible and computationally efficient solution for controlled model forgetting.

[29] Erasing CLIP Memories: Non-Destructive, Data-Free Zero-Shot class Unlearning in CLIP Models

Ashish Mishra, Tarun Kumar, Gyanaranjan Nayak, Arpit Shah, Suparna Bhattacharya, Martin Foltin

🧩 TL;DR

本文提出了一种用于多模态模型选择性遗忘的闭式方法,通过零空间投影在无需重新训练或遗忘集图像的情况下,从预训练模型(如CLIP)中精确擦除目标类信息。


📘 Detailed Summary

Motivation: 传统遗忘技术依赖于迭代微调和大量数据整理,计算成本高且不够精确,本研究旨在解决多模态模型中高效、精确选择性遗忘的挑战,特别是在模型去污染和隐私保护方面。

Method: 该方法利用零空间投影技术,通过计算目标文本嵌入张成的子空间的正交基,并将这些方向投影到最终投影层,从而擦除嵌入的目标类信息,无需任何重新训练或遗忘集图像。

Result: 实验表明,该方法显著降低了目标类在零样本任务中的性能,同时保留了模型的整体多模态知识,即使部分投影也能在完全遗忘和保留有用信息之间取得平衡。

Conclusion: 该方法为多模态模型选择性遗忘提供了一种计算高效且精确的解决方案,解决了模型去污染和隐私保护的关键挑战,展示了闭式方法在模型编辑中的潜力。


📄 Abstract

We introduce a novel, closed-form approach for selective unlearning in multimodal models, specifically targeting pretrained models such as CLIP. Our method leverages nullspace projection to erase the target class information embedded in the final projection layer, without requiring any retraining or the use of images from the forget set. By computing an orthonormal basis for the subspace spanned by target text embeddings and projecting these directions, we dramatically reduce the alignment between image features and undesired classes. Unlike traditional unlearning techniques that rely on iterative fine-tuning and extensive data curation, our approach is both computationally efficient and surgically precise. This leads to a pronounced drop in zero-shot performance for the target classes while preserving the overall multimodal knowledge of the model. Our experiments demonstrate that even a partial projection can balance between complete unlearning and retaining useful information, addressing key challenges in model decontamination and privacy preservation.

[30] FakeRadar: Probing Forgery Outliers to Detect Unknown Deepfake Videos

Zhaolun Li, Jichang Li, Yinqi Cai, Junye Chen, Xiaonan Luo, Guanbin Li, Rushi Lan

🧩 TL;DR

本文提出FakeRadar,一种新颖的深度伪造视频检测框架,通过主动探测特征空间中的分布差异并合成异常样本来解决跨域泛化挑战,显著提升了对于新兴伪造技术的检测能力。


📘 Detailed Summary

Motivation: 现有深度伪造检测方法通常依赖于特定操纵痕迹,在已知伪造类型上表现良好,但面对新兴操纵技术时泛化能力严重不足,这源于它们无法有效适应未见过的伪造模式,导致跨域场景下性能下降。

Method: FakeRadar框架采用大规模预训练模型主动探测特征空间,通过伪造异常探测技术动态建模子簇并生成簇条件异常样本来模拟新型伪造伪影,同时设计异常引导的三重训练机制,结合异常驱动的对比学习和异常条件交叉熵损失来优化检测器区分真实、伪造和异常样本的能力。

Result: 实验表明FakeRadar在多个深度伪造视频检测基准数据集上优于现有方法,特别是在跨域评估中表现突出,能够有效处理各种新兴操纵技术,验证了其优越的泛化性能。

Conclusion: 该研究通过主动探测特征分布差异和合成异常样本的方法,为解决深度伪造检测中的跨域泛化问题提供了有效途径,为未来对抗新兴伪造技术的检测系统设计提供了重要启示,强调了适应未知伪造模式的重要性。


📄 Abstract

In this paper, we propose FakeRadar, a novel deepfake video detection framework designed to address the challenges of cross-domain generalization in real-world scenarios. Existing detection methods typically rely on manipulation-specific cues, performing well on known forgery types but exhibiting severe limitations against emerging manipulation techniques. This poor generalization stems from their inability to adapt effectively to unseen forgery patterns. To overcome this, we leverage large-scale pretrained models (e.g. CLIP) to proactively probe the feature space, explicitly highlighting distributional gaps between real videos, known forgeries, and unseen manipulations. Specifically, FakeRadar introduces Forgery Outlier Probing, which employs dynamic subcluster modeling and cluster-conditional outlier generation to synthesize outlier samples near boundaries of estimated subclusters, simulating novel forgery artifacts beyond known manipulation types. Additionally, we design Outlier-Guided Tri-Training, which optimizes the detector to distinguish real, fake, and outlier samples using proposed outlier-driven contrastive learning and outlier-conditioned cross-entropy losses. Experiments show that FakeRadar outperforms existing methods across various benchmark datasets for deepfake video detection, particularly in cross-domain evaluations, by handling the variety of emerging manipulation techniques.

[31] Improving Semantic Uncertainty Quantification in LVLMs with Semantic Gaussian Processes

Joseph Hoche, Andrei Bursuc, David Brellmann, Gilles Louppe, Pavel Izmailov, Angela Yao, Gianni Franchi

🧩 TL;DR

本文提出了语义高斯过程不确定性(SGPU),这是一种贝叶斯框架,通过分析答案嵌入的几何结构来量化大型视觉语言模型中的语义不确定性,避免了传统聚类方法的脆弱性,并在多个基准测试中实现了最先进的校准性能。


📘 Detailed Summary

Motivation: 大型视觉语言模型(LVLMs)经常产生看似合理但不可靠的输出,因此需要稳健的不确定性估计。现有的语义不确定性估计方法依赖外部模型对多个采样响应进行聚类并测量其语义一致性,但这些聚类方法通常很脆弱,对细微的措辞变化高度敏感,可能错误地分组或分离语义相似的答案,导致不可靠的不确定性估计。

Method: 本文提出了语义高斯过程不确定性(SGPU),这是一个贝叶斯框架,通过分析答案嵌入的几何结构来量化语义不确定性,避免了脆弱的聚类过程。SGPU将生成的答案映射到密集的语义空间,计算其嵌入的Gram矩阵,并通过特征谱总结其语义配置。这种谱表示随后被输入到高斯过程分类器中,该分类器学习将语义一致性模式映射到预测不确定性,并可在黑盒和白盒设置中应用。

Result: 在六个LLM和LVLM模型上,跨越八个数据集(包括VQA、图像分类和文本QA),SGPU在校准性能(ECE)和判别性能(AUROC、AUARC)方面一致实现了最先进的表现。实验进一步表明,SGPU能够在不同模型和模态之间进行迁移,表明其谱表示捕获了语义不确定性的一般模式。

Conclusion: SGPU框架通过分析答案嵌入的几何结构而非依赖脆弱的聚类过程,为大型视觉语言模型提供了更可靠的语义不确定性估计方法。其谱表示方法能够捕获跨模型和模态的通用语义不确定性模式,为不确定性估计领域提供了新的贝叶斯视角,具有实际部署的潜力。


📄 Abstract

Large Vision-Language Models (LVLMs) often produce plausible but unreliable outputs, making robust uncertainty estimation essential. Recent work on semantic uncertainty estimates relies on external models to cluster multiple sampled responses and measure their semantic consistency. However, these clustering methods are often fragile, highly sensitive to minor phrasing variations, and can incorrectly group or separate semantically similar answers, leading to unreliable uncertainty estimates. We propose Semantic Gaussian Process Uncertainty (SGPU), a Bayesian framework that quantifies semantic uncertainty by analyzing the geometric structure of answer embeddings, avoiding brittle clustering. SGPU maps generated answers into a dense semantic space, computes the Gram matrix of their embeddings, and summarizes their semantic configuration via the eigenspectrum. This spectral representation is then fed into a Gaussian Process Classifier that learns to map patterns of semantic consistency to predictive uncertainty, and that can be applied in both black-box and white-box settings. Across six LLMs and LVLMs on eight datasets spanning VQA, image classification, and textual QA, SGPU consistently achieves state-of-the-art calibration (ECE) and discriminative (AUROC, AUARC) performance. We further show that SGPU transfers across models and modalities, indicating that its spectral representation captures general patterns of semantic uncertainty.

[32] DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos

Yang Bai, Liudi Yang, George Eskandar, Fengyi Shen, Mohammad Altillawi, Ziyuan Liu, Gitta Kutyniok

🧩 TL;DR

本文提出了DRAW2ACT,一种深度感知的轨迹条件视频生成框架,通过提取输入轨迹的多重正交表示并注入扩散模型,联合生成空间对齐的RGB和深度视频,从而提升机器人操作的可控性和一致性。


📘 Detailed Summary

Motivation: 视频扩散模型为具身AI提供了强大的真实世界模拟器,但在机器人操作的可控性方面仍存在局限。现有轨迹条件视频生成方法通常依赖2D轨迹或单模态条件,限制了其生成可控且一致的机器人演示的能力。

Method: DRAW2ACT从输入轨迹中提取深度、语义、形状和运动等多重正交表示,并将其注入扩散模型。该框架联合生成空间对齐的RGB和深度视频,利用跨模态注意力机制和深度监督增强时空一致性,并引入基于生成RGB和深度序列的多模态策略模型来回归机器人关节角度。

Result: 在Bridge V2、Berkeley Autolab和仿真基准测试中,DRAW2ACT相比现有基线方法实现了更优的视觉保真度和一致性,同时获得了更高的操作成功率。

Conclusion: 该研究展示了深度感知轨迹条件视频生成在提升机器人操作可控性方面的有效性,通过多模态表示和联合生成机制解决了现有方法的局限性,为具身AI提供了更可靠的仿真环境。


📄 Abstract

Video diffusion models provide powerful real-world simulators for embodied AI but remain limited in controllability for robotic manipulation. Recent works on trajectory-conditioned video generation address this gap but often rely on 2D trajectories or single modality conditioning, which restricts their ability to produce controllable and consistent robotic demonstrations. We present DRAW2ACT, a depth-aware trajectory-conditioned video generation framework that extracts multiple orthogonal representations from the input trajectory, capturing depth, semantics, shape and motion, and injects them into the diffusion model. Moreover, we propose to jointly generate spatially aligned RGB and depth videos, leveraging cross-modality attention mechanisms and depth supervision to enhance the spatio-temporal consistency. Finally, we introduce a multimodal policy model conditioned on the generated RGB and depth sequences to regress the robot's joint angles. Experiments on Bridge V2, Berkeley Autolab, and simulation benchmarks show that DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.

[33] OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving

Tao Tang, Enhui Ma, xia zhou, Letian Wang, Tianyi Yan, Xueyang Zhang, Kun Zhan, Peng Jia, XianPeng Lang, Jia-Wang Bian, Kaicheng Yu, Xiaodan Liang

🧩 TL;DR

本文提出了OminiGen,一个用于生成对齐的多模态传感器数据的统一框架,通过共享BEV空间和新型多模态重建方法UAE,实现了LiDAR和多视角相机数据的联合解码与可控生成。


📘 Detailed Summary

Motivation: 自动驾驶领域虽然取得了显著进展,但获取多样化和极端情况数据仍然成本高昂且效率低下。现有生成方法主要关注单模态生成,导致多模态传感器数据存在效率低下和对齐不准确的问题,需要一种能够生成对齐多模态传感器数据的统一解决方案。

Method: OminiGen采用共享鸟瞰图空间统一多模态特征,设计了新颖的可泛化多模态重建方法UAE,通过体渲染实现LiDAR和多视角相机数据的联合解码。此外,结合了带有ControlNet分支的扩散变换器,实现了可控的多模态传感器生成。

Result: 综合实验表明,OminiGen在统一多模态传感器数据生成方面取得了预期性能,实现了多模态一致性和灵活的传感器调整,能够准确灵活地重建多模态传感器数据。

Conclusion: 该研究为解决自动驾驶数据收集的挑战提供了有效的生成解决方案,通过统一的框架实现了多模态传感器数据的对齐生成,为可控和一致的多模态数据合成开辟了新途径,具有重要的实际应用价值。


📄 Abstract

Autonomous driving has seen remarkable advancements, largely driven by extensive real-world data collection. However, acquiring diverse and corner-case data remains costly and inefficient. Generative models have emerged as a promising solution by synthesizing realistic sensor data. However, existing approaches primarily focus on single-modality generation, leading to inefficiencies and misalignment in multimodal sensor data. To address these challenges, we propose OminiGen, which generates aligned multimodal sensor data in a unified framework. Our approach leverages a shared Bird\u2019s Eye View (BEV) space to unify multimodal features and designs a novel generalizable multimodal reconstruction method, UAE, to jointly decode LiDAR and multi-view camera data. UAE achieves multimodal sensor decoding through volume rendering, enabling accurate and flexible reconstruction. Furthermore, we incorporate a Diffusion Transformer (DiT) with a ControlNet branch to enable controllable multimodal sensor generation. Our comprehensive experiments demonstrate that OminiGen achieves desired performances in unified multimodal sensor data generation with multimodal consistency and flexible sensor adjustments.

[34] ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

Juze Zhang, Changan Chen, Xin Chen, Heng Yu, Tiange Xiang, Ali Sartaz Khan, Shrinidhi K. Lakshmikanth, Ehsan Adeli

🧩 TL;DR

本文提出了ViBES,一种用于对话式3D代理的语音-语言-行为模型,通过混合模态专家架构联合规划语言和身体动作,超越了传统的语音条件运动生成方法,实现了可控的社交互动。


📘 Detailed Summary

Motivation: 现有系统将人类行为建模为翻译任务,将固定话语映射到运动片段,缺乏关于何时移动、做什么以及如何在多轮对话中适应的自主决策能力,导致时序脆弱、社交基础薄弱以及语音、文本和运动训练或推断的孤立性。

Method: ViBES采用混合模态专家架构,包含针对语音、面部表情和身体运动的模态分区Transformer专家,通过硬路由按模态处理交错的多模态令牌流,同时通过跨专家注意力共享信息,并利用强大的预训练语音-语言模型支持混合主动交互。

Result: 在多轮对话基准测试中,ViBES在对话-运动对齐和行为质量方面表现出优于强基线的一致性提升,支持用户通过语音、文本或身体动作指令进行混合主动交互,并提供了流式响应的可控行为钩子。

Conclusion: 该研究超越了"语音条件运动生成"的范式,实现了语言、韵律和运动的联合生成,为可控且具备社交能力的3D交互提供了虚拟身体代理,推动了多模态对话系统向更自然、自主的交互方向发展。


📄 Abstract

Human communication is inherently multimodal and social: words, prosody, and body language jointly carry intent. Yet most prior systems model human behavior as a translation task co-speech gesture or text-to-motion that maps a fixed utterance to motion clips-without requiring agentic decision-making about when to move, what to do, or how to adapt across multi-turn dialogue. This leads to brittle timing, weak social grounding, and fragmented stacks where speech, text, and motion are trained or inferred in isolation. We introduce ViBES (Voice in Behavioral Expression and Synchrony), a conversational 3D agent that jointly plans language and movement and executes dialogue-conditioned body actions. Concretely, ViBES is a speech-language-behavior (SLB) model with a mixture-of-modality-experts (MoME) backbone: modality-partitioned transformer experts for speech, facial expression, and body motion. The model processes interleaved multimodal token streams with hard routing by modality (parameters are split per expert), while sharing information through cross-expert attention. By leveraging strong pretrained speech-language models, the agent supports mixed-initiative interaction: users can speak, type, or issue body-action directives mid-conversation, and the system exposes controllable behavior hooks for streaming responses. We further benchmark on multi-turn conversation with automatic metrics of dialogue-motion alignment and behavior quality, and observe consistent gains over strong co-speech and text-to-motion baselines. ViBES goes beyond "speech-conditioned motion generation" toward agentic virtual bodies where language, prosody, and movement are jointly generated, enabling controllable, socially competent 3D interaction. Code and data will be made available at: ai.stanford.edu/~juze/ViBES/

[35] Enhancing Visual Programming for Visual Reasoning via Probabilistic Graphs

Wentao Wan, Kaiyu Wu, Qingyang Ma, Nan Kang, Yunjie Chen, Liang Lin, Keze Wang

🧩 TL;DR

本文提出EVPG方法,通过构建概率图将不可微的视觉编程执行过程重构为可微的精确概率推理过程,从而实现对视觉编程框架的端到端梯度优化,显著提升了复杂视觉推理任务的性能。


📘 Detailed Summary

Motivation: 现有视觉编程方法主要关注提升大语言模型生成视觉程序的质量,但忽视了优化视觉编程调用的预训练模型,这些模型作为视觉子任务的模块。主要挑战在于只有最终任务的标签而没有子任务标签,且视觉编程的不可微特性阻碍了基于梯度的端到端优化方法直接利用最终标签进行学习。

Method: 提出EVPG方法,通过构建基于变量依赖关系的有向概率图,将不可微的视觉编程执行过程重构为在该有向概率图上的可微精确概率推理过程。这种重构使得视觉编程框架能够利用最终任务标签进行高效的基于梯度的端到端监督学习优化。

Result: 在三个经典复杂视觉推理任务上的广泛实验验证了EVPG的有效性和优势:GQA、NLVRv2和Open Images数据集上均显示出显著的性能提升,证明了该方法能够显著增强视觉编程在复杂视觉推理任务中的表现。

Conclusion: 该研究通过概率图方法成功解决了视觉编程框架的不可微优化问题,为视觉编程的端到端学习提供了新思路。该方法不仅提升了现有视觉编程系统的性能,还为未来结合符号推理与神经网络学习的研究开辟了新的技术路径。


📄 Abstract

Recently, Visual Programming (VP) based on large language models (LLMs) has rapidly developed and demonstrated significant potential in complex Visual Reasoning (VR) tasks. Previous works to enhance VP have primarily focused on improving the quality of LLM-generated visual programs. However, they have neglected to optimize the VP-invoked pre-trained models, which serve as modules for the visual sub-tasks decomposed from the targeted tasks by VP. The difficulty is that there are only final labels of targeted VR tasks rather than labels of sub-tasks. Besides, the non-differentiable nature of VP impedes the direct use of efficient gradient-based optimization methods to leverage final labels for end-to-end learning of the entire VP framework. To overcome these issues, we propose EVPG, a method to Enhance Visual Programming for visual reasoning via Probabilistic Graphs. Specifically, we creatively build a directed probabilistic graph according to the variable dependency relationships during the VP executing process, which reconstructs the non-differentiable VP executing process into a differentiable exact probability inference process on this directed probabilistic graph. As a result, this enables the VP framework to utilize the final labels for efficient, gradient-based optimization in end-to-end supervised learning on targeted VR tasks. Extensive and comprehensive experiments demonstrate the effectiveness and advantages of our EVPG, showing significant performance improvements for VP on three classical complex VR tasks: GQA, NLVRv2, and Open Images.

[36] Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in

Xiaoqian Shen, Min-Hung Chen, Yu-Chiang Frank Wang, Mohamed Elhoseiny, Ryo Hachiuma

🧩 TL;DR

本文提出了Zoom-Zero框架,通过粗粒度到细粒度的时序定位方法改进视频问答中的时序感知能力,该框架首先定位查询相关片段,然后放大到最显著帧进行细粒度视觉验证,显著提升了时序定位准确性和答案生成质量。


📘 Detailed Summary

Motivation: 现有基于组相对策略优化(GRPO)的方法在视频问答任务中仍存在时序定位不准确和幻觉问题,大型视频语言模型(LVLMs)的时序感知能力有限,难以将答案忠实锚定在相关视频证据上,导致时序错位和虚假生成。

Method: 本文提出了Zoom-Zero粗粒度到细粒度框架,包含两个关键创新:首先通过放大准确性奖励验证时序定位预测的保真度并促进在定位帧上的细粒度视觉验证;其次采用令牌选择性信用分配机制,将奖励归因于负责时序定位或答案生成的令牌,缓解GRPO在处理多方面奖励信号时的问题。

Result: 该方法在NExT-GQA和ReXTime数据集上分别将时序定位准确率提升了5.2%和4.6%,平均答案准确率提高了2.4%,推理过程中的粗粒度到细粒度放大机制在长视频理解基准上带来了平均6.4%的改进,同时保持了全局上下文而不损害关键视觉细节。

Conclusion: Zoom-Zero框架通过创新的奖励设计和信用分配机制有效解决了视频问答中的时序定位挑战,粗粒度到细粒度的放大策略不仅提升了短期时序定位性能,还为长视频理解提供了更有效的处理范式,为视频语言模型的时序感知能力改进提供了新方向。


📄 Abstract

Grounded video question answering (GVQA) aims to localize relevant temporal segments in videos and generate accurate answers to a given question; however, large video-language models (LVLMs) exhibit limited temporal awareness. Although existing approaches based on Group Relative Policy Optimization (GRPO) attempt to improve temporal grounding, they still struggle to faithfully ground their answers in the relevant video evidence, leading to temporal mislocalization and hallucinations. In this work, we present Zoom-Zero, a coarse-to-fine framework that first localizes query-relevant segments and then temporally zooms into the most salient frames for finer-grained visual verification. Our method addresses the limits of GRPO for the GVQA task with two key innovations: (i) a zoom-in accuracy reward that validates the fidelity of temporal grounding prediction and facilitates fine-grained visual verification on grounded frames; (ii) token-selective credit assignment, which attributes rewards to the tokens responsible for temporal localization or answer generation, mitigating GRPO's issue in handling multi-faceted reward signals. Our proposed method advances grounded video question answering, improving temporal grounding by 5.2\% on NExT-GQA and 4.6\% on ReXTime, while also enhancing average answer accuracy by 2.4\%. Additionally, the coarse-to-fine zoom-in during inference further benefits long-form video understanding by preserving critical visual details without compromising global context, yielding an average improvement of 6.4\% on long-video benchmarks.

[37] TUN: Detecting Significant Points in Persistence Diagrams with Deep Learning

Yu Chen, Hongwei Lin

🧩 TL;DR

本文提出了一种名为TUN(拓扑理解网络)的多模态网络,用于自动检测一维持续性图中的显著性点,通过结合增强的持续性图描述符、自注意力机制和点云编码器,为拓扑数据分析的实际应用提供了可靠的自动化解决方案。


📘 Detailed Summary

Motivation: 持续性图(PDs)虽然能有效捕捉点云的拓扑结构,但难以自动识别图中哪些点代表真实信号而非噪声,这一挑战严重阻碍了拓扑数据分析在实际应用中的推广,特别是在需要自动化可靠解释持续性图以支持下游决策的场景中。

Method: 本文提出了Topology Understanding Net(TUN),这是一个多模态网络,结合了增强的持续性图描述符、自注意力机制、PointNet风格的点云编码器、学习融合机制以及逐点分类器,同时采用了稳定的预处理技术和考虑类别不平衡的训练策略。

Result: 实验结果表明,TUN在检测持续性图中显著性点方面显著优于传统方法,证明了其在真实世界应用中的有效性,为拓扑数据分析的自动化解释提供了可靠的技术支持。

Conclusion: 该研究为拓扑数据分析的实际应用提供了自动化解决方案,通过深度学习技术解决了持续性图解释中的关键挑战,显著提升了拓扑特征识别的可靠性和效率,为下游决策任务提供了更可靠的拓扑信息支持。


📄 Abstract

Persistence diagrams (PDs) provide a powerful tool for understanding the topology of the underlying shape of a point cloud. However, identifying which points in PDs encode genuine signals remains challenging. This challenge directly hinders the practical adoption of topological data analysis in many applications, where automated and reliable interpretation of persistence diagrams is essential for downstream decision-making. In this paper, we study automatic significance detection for one-dimensional persistence diagrams. Specifically, we propose Topology Understanding Net (TUN), a multi-modal network that combines enhanced PD descriptors with self-attention, a PointNet-style point cloud encoder, learned fusion, and per-point classification, alongside stable preprocessing and imbalance-aware training. It provides an automated and effective solution for identifying significant points in PDs, which are critical for downstream applications. Experiments show that TUN outperforms classic methods in detecting significant points in PDs, illustrating its effectiveness in real-world applications.

[38] Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure

Jooyeol Yun, Jaegul Choo

🧩 TL;DR

本文提出了一种通过语义结构恢复来增强SVG动画生成能力的框架,通过统计聚合多个弱部件预测来稳定推断SVG的语义分组,从而显著提升视觉语言模型在矢量图形动画生成中的连贯性。


📘 Detailed Summary

Motivation: 当前视觉语言模型在自动化SVG动画生成方面面临挑战,因为矢量图形中视觉连贯的部件通常被分割为低层级形状,这些形状无法提供哪些元素应该一起运动的指导,导致模型难以正确处理SVG动画生成任务。

Method: 该框架通过统计聚合多个弱部件预测来恢复SVG的语义结构,能够从噪声预测中稳定推断语义信息,通过将SVG重新组织为语义分组,为视觉语言模型提供更可靠的动画生成基础。

Result: 实验结果表明,该方法相比现有方法取得了显著提升,能够生成更加连贯的SVG动画,验证了语义恢复是解锁稳健SVG动画生成的关键步骤。

Conclusion: 该研究揭示了当前VLM系统忽视的关键语义层,语义结构恢复不仅能够实现更可靠的SVG动画生成,还能支持视觉语言模型与矢量图形之间更可解释的交互,为动态网页设计中的自动化动画生成提供了新方向。


📄 Abstract

Scalable Vector Graphics (SVG) are central to modern web design, and the demand to animate them continues to grow as web environments become increasingly dynamic. Yet automating the animation of vector graphics remains challenging for vision-language models (VLMs) despite recent progress in code generation and motion planning. VLMs routinely mis-handle SVGs, since visually coherent parts are often fragmented into low-level shapes that offer little guidance of which elements should move together. In this paper, we introduce a framework that recovers the semantic structure required for reliable SVG animation and reveals the missing layer that current VLM systems overlook. This is achieved through a statistical aggregation of multiple weak part predictions, allowing the system to stably infer semantics from noisy predictions. By reorganizing SVGs into semantic groups, our approach enables VLMs to produce animations with far greater coherence. Our experiments demonstrate substantial gains over existing approaches, suggesting that semantic recovery is the key step that unlocks robust SVG animation and supports more interpretable interactions between VLMs and vector graphics.

[39] A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning

Zixin Zhang, Kanghao Chen, Hanqing Wang, Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Litao Guo, Ying-Cong Chen

🧩 TL;DR

本文提出A4-Agent,一种免训练的智能体框架,通过解耦高层推理与低层定位,利用预训练基础模型的互补优势实现零样本的交互区域预测,显著超越现有监督方法。


📘 Detailed Summary

Motivation: 现有端到端模型将高层推理与低层定位耦合在单一管道中,依赖标注数据集训练,导致在新物体和未见环境上泛化能力差,本研究旨在超越这一范式。

Method: 提出A4-Agent框架,将交互区域预测解耦为三阶段流程:Dreamer使用生成模型可视化交互过程,Thinker利用大视觉语言模型决定交互对象部件,Spotter协调视觉基础模型精确定位交互区域,无需任务特定微调。

Result: 该零样本框架在多个基准测试中显著优于最先进的监督方法,并在真实世界环境中展现出强大的泛化能力,验证了预训练基础模型组合的有效性。

Conclusion: 研究表明通过解耦推理过程并协调专业化基础模型,可在无需训练的情况下实现卓越的交互区域预测性能,为具身AI系统提供了新的零样本学习范式。


📄 Abstract

Affordance prediction, which identifies interaction regions on objects based on language instructions, is critical for embodied AI. Prevailing end-to-end models couple high-level reasoning and low-level grounding into a single monolithic pipeline and rely on training over annotated datasets, which leads to poor generalization on novel objects and unseen environments. In this paper, we move beyond this paradigm by proposing A4-Agent, a training-free agentic framework that decouples affordance prediction into a three-stage pipeline. Our framework coordinates specialized foundation models at test time: (1) a $\textbf{Dreamer}$ that employs generative models to visualize $\textit{how}$ an interaction would look; (2) a $\textbf{Thinker}$ that utilizes large vision-language models to decide $\textit{what}$ object part to interact with; and (3) a $\textbf{Spotter}$ that orchestrates vision foundation models to precisely locate $\textit{where}$ the interaction area is. By leveraging the complementary strengths of pre-trained models without any task-specific fine-tuning, our zero-shot framework significantly outperforms state-of-the-art supervised methods across multiple benchmarks and demonstrates robust generalization to real-world settings.

[40] SuperCLIP: CLIP with Simple Classification Supervision

Weiheng Zhao, Zilong Huang, Jiashi Feng, Xinggang Wang

🧩 TL;DR

本文提出SuperCLIP框架,通过在对比学习中引入分类监督来增强CLIP模型的细粒度语义对齐能力。该方法仅需在视觉编码器上添加轻量级线性层,即可显著提升零样本分类、图文检索等任务的性能。


📘 Detailed Summary

Motivation: CLIP模型虽然通过图像-文本对比学习实现了良好的泛化能力,但其训练目标仅优化全局相似度,忽视了细粒度的语义信号,特别是在处理长而详细的文本描述时,这一问题更加突出。这限制了模型实现细粒度视觉-文本对齐的能力,导致对文本中token级监督信息的利用不足。

Method: SuperCLIP框架在对比学习基础上引入了基于分类的监督机制,通过在视觉编码器上仅添加一个轻量级线性层,利用token级语义线索来增强视觉-文本对齐。该方法总FLOPs仅增加0.077%,且无需额外的标注数据,通过分类监督避免了传统对比学习对大批次大小的依赖。

Result: 实验表明SuperCLIP在零样本分类、图文检索和纯视觉任务上均取得一致性的性能提升。无论使用原始网络数据还是丰富的重标注数据进行训练,SuperCLIP都能有效恢复文本监督信息。此外,该方法缓解了CLIP在小批次训练时的性能下降问题,展示了分类监督的稳定性优势。

Conclusion: SuperCLIP通过简单的分类监督机制有效解决了CLIP模型细粒度语义对齐不足的问题,证明了token级监督对视觉-语言模型的重要性。该方法为改进对比学习框架提供了新思路,特别是在资源受限环境下的小批次训练场景具有实用价值,代码和模型将开源以促进后续研究。


📄 Abstract

Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize fine-grained semantic signals in text, and this issue becomes even more pronounced when dealing with long and detailed captions. This stems from CLIP's training objective, which optimizes only global image-text similarity and overlooks token-level supervision - limiting its ability to achieve fine-grained visual-text alignment. To address this, we propose SuperCLIP, a simple yet effective framework that augments contrastive learning with classification-based supervision. By adding only a lightweight linear layer to the vision encoder, SuperCLIP leverages token-level cues to enhance visual-textual alignment - with just a 0.077% increase in total FLOPs, and no need for additional annotated data. Experiments show that SuperCLIP consistently improves zero-shot classification, image-text retrieval, and purely visual tasks. These gains hold regardless of whether the model is trained on original web data or rich re-captioned data, demonstrating SuperCLIP's ability to recover textual supervision in both cases. Furthermore, SuperCLIP alleviates CLIP's small-batch performance drop through classification-based supervision that avoids reliance on large batch sizes. Code and models will be made open source.

[41] SignIT: A Comprehensive Dataset and Multimodal Analysis for Italian Sign Language Recognition

Alessia Micieli, Giovanni Maria Farinella, Francesco Ragusa

🧩 TL;DR

本文提出了SignIT,一个用于意大利手语识别研究的新数据集,包含644个视频和94个手语类别,并建立了基准测试框架以评估不同模态信息对手语识别模型性能的影响。


📘 Detailed Summary

Motivation: 当前缺乏专门针对意大利手语识别的高质量数据集,限制了该领域的研究进展。本研究旨在填补这一空白,通过构建一个包含丰富标注信息的LIS数据集,为手语识别任务提供标准化的评估基准。

Method: 研究团队收集了644个总计3.33小时的视频数据,手动标注了涵盖5个宏观类别(动物、食物、颜色、情感、家庭)的94个手语类别。同时提取了用户手部、面部和身体的2D关键点信息,并采用多种最先进模型构建基准测试框架,分析时间信息、2D关键点和RGB帧等不同模态对模型性能的影响。

Result: 实验结果表明,现有模型在该具有挑战性的意大利手语数据集上表现存在局限性。基准测试揭示了不同信息模态(时间序列、关键点、RGB帧)对识别性能的具体影响程度,为后续模型优化提供了重要参考依据。

Conclusion: SignIT数据集的发布为意大利手语识别研究提供了重要的资源基础,揭示了当前模型在处理复杂手语数据时的不足。该研究强调了多模态信息融合在手语识别中的重要性,并为未来开发更鲁棒的手语识别系统指明了方向。


📄 Abstract

In this work we present SignIT, a new dataset to study the task of Italian Sign Language (LIS) recognition. The dataset is composed of 644 videos covering 3.33 hours. We manually annotated videos considering a taxonomy of 94 distinct sign classes belonging to 5 macro-categories: Animals, Food, Colors, Emotions and Family. We also extracted 2D keypoints related to the hands, face and body of the users. With the dataset, we propose a benchmark for the sign recognition task, adopting several state-of-the-art models showing how temporal information, 2D keypoints and RGB frames can be influence the performance of these models. Results show the limitations of these models on this challenging LIS dataset. We release data and annotations at the following link: https://fpv-iplab.github.io/SignIT/.

[42] Native Intelligence Emerges from Large-Scale Clinical Practice: A Retinal Foundation Model with Deployment Efficiency

Jia Guo, Jiawei Du, Shengzhu Yang, Shuai Lu, Wenquan Cheng, Kaiwen Zhang, Yihua Sun, Chuhong Yang, Weihang Zhang, Fang Chen, Yilan Wu, Lie Ju, Guochen Ning, Longfei Ma, Huiping Yao, Jinyuan Wang, Peilun Shi, Yukun Zhou, Jie Xu, Pearse A. Keane, Hanruo Liu, Hongen Liao, Ningli Wang, Huiqi Li

🧩 TL;DR

该研究提出了ReVision,一个从大规模远程医疗项目中学习临床图像解读的视网膜基础模型,通过直接从真实世界临床档案中提取临床原生智能,实现了在低资源环境下的高效部署。


📘 Detailed Summary

Motivation: 当前视网膜基础模型受限于缺乏真实临床背景的精选研究数据集,且需要针对每个应用进行大量任务特定优化,这限制了其在低资源环境下的部署效率。研究旨在通过直接从真实世界医疗实践中构建临床原生智能来克服这些障碍。

Method: 研究提出ReVision视网膜基础模型,其核心洞察是利用大规模远程医疗项目作为学习临床图像解读的自然资源库。模型从中国162家医疗机构十年远程医疗项目中积累的485,980张彩色眼底照片及其对应诊断报告的自然对齐关系中学习临床图像解读能力。

Result: 在27个眼科基准测试中,ReVision实现了高效部署且仅需极少本地资源。零样本疾病检测在12个公共基准上平均AUROC达0.946,在3个独立临床队列上达0.952。最小适应情况下,ReVision匹配了需要大量微调的替代方案,同时所需可训练参数和标注样本数量级更少。在包含33名眼科医生的前瞻性读者研究中,ReVision的零样本辅助将诊断准确率提高了14.8%。

Conclusion: 研究表明临床原生智能可以直接从临床档案中提取而无需额外标注,能够构建适用于各种低资源环境的医疗AI系统。该方法通过利用远程医疗项目中自然对齐的图像-报告对,实现了在真实临床环境中的高效部署和泛化能力。


📄 Abstract

Current retinal foundation models remain constrained by curated research datasets that lack authentic clinical context, and require extensive task-specific optimization for each application, limiting their deployment efficiency in low-resource settings. Here, we show that these barriers can be overcome by building clinical native intelligence directly from real-world medical practice. Our key insight is that large-scale telemedicine programs, where expert centers provide remote consultations across distributed facilities, represent a natural reservoir for learning clinical image interpretation. We present ReVision, a retinal foundation model that learns from the natural alignment between 485,980 color fundus photographs and their corresponding diagnostic reports, accumulated through a decade-long telemedicine program spanning 162 medical institutions across China. Through extensive evaluation across 27 ophthalmic benchmarks, we demonstrate that ReVison enables deployment efficiency with minimal local resources. Without any task-specific training, ReVision achieves zero-shot disease detection with an average AUROC of 0.946 across 12 public benchmarks and 0.952 on 3 independent clinical cohorts. When minimal adaptation is feasible, ReVision matches extensively fine-tuned alternatives while requiring orders of magnitude fewer trainable parameters and labeled examples. The learned representations also transfer effectively to new clinical sites, imaging domains, imaging modalities, and systemic health prediction tasks. In a prospective reader study with 33 ophthalmologists, ReVision's zero-shot assistance improved diagnostic accuracy by 14.8% across all experience levels. These results demonstrate that clinical native intelligence can be directly extracted from clinical archives without any further annotation to build medical AI systems suited to various low-resource settings.

[43] HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion

Yifang Xu, Benxiang Zhai, Yunzhuo Sun, Ming Li, Yang Li, Sidan Du

🧩 TL;DR

本文提出了HiFi-Portrait,一种用于零样本肖像生成的高保真方法,通过融合多面部特征并与三维感知地标对齐,显著提升了身份保真度和面部控制精度。


📘 Detailed Summary

Motivation: 现有基于扩散的身份保持肖像生成方法在使用同一身份的多张参考图像时,通常会产生较低保真度的肖像,并且难以精确定制面部属性,这限制了高质量个性化肖像生成的应用。

Method: 该方法首先引入面部细化器和地标生成器,获取细粒度的多面部特征和三维感知的面部地标,这些地标包含参考身份和目标属性信息;然后设计HiFi-Net来融合多面部特征并与地标对齐,以提升身份保真度和面部控制能力;同时开发了自动化流水线来构建基于身份的数据集用于训练HiFi-Portrait。

Result: 大量实验结果表明,该方法在面部相似度和可控性方面超越了现有最先进方法,并且在面部相似度指标上表现出显著优势,同时与基于SDXL的先前工作保持兼容性。

Conclusion: HiFi-Portrait通过创新的多特征融合和地标对齐机制,有效解决了多参考图像下的身份保真度和属性控制问题,为零样本高保真肖像生成提供了可靠的技术方案,具有实际应用价值。


📄 Abstract

Recent advancements in diffusion-based technologies have made significant strides, particularly in identity-preserved portrait generation (IPG). However, when using multiple reference images from the same ID, existing methods typically produce lower-fidelity portraits and struggle to customize face attributes precisely. To address these issues, this paper presents HiFi-Portrait, a high-fidelity method for zero-shot portrait generation. Specifically, we first introduce the face refiner and landmark generator to obtain fine-grained multi-face features and 3D-aware face landmarks. The landmarks include the reference ID and the target attributes. Then, we design HiFi-Net to fuse multi-face features and align them with landmarks, which improves ID fidelity and face control. In addition, we devise an automated pipeline to construct an ID-based dataset for training HiFi-Portrait. Extensive experimental results demonstrate that our method surpasses the SOTA approaches in face similarity and controllability. Furthermore, our method is also compatible with previous SDXL-based works.

[44] FoodLogAthl-218: Constructing a Real-World Food Image Dataset Using Dietary Management Applications

Mitsuki Watanabe, Sosuke Amano, Kiyoharu Aizawa, Yoko Yamakata

🧩 TL;DR

本文提出了FoodLogAthl-218数据集,这是一个从真实世界膳食管理应用中收集的食物图像数据集,包含6,925张图像和218个食物类别,并引入了基于时间流的增量微调和上下文感知分类等特定任务,为食物图像识别研究提供了更贴近实际应用场景的基准。


📘 Detailed Summary

Motivation: 现有食物图像分类模型大多依赖网络爬取的图像进行训练,这些图像与用户实际拍摄的膳食照片存在显著差异,导致模型在真实应用场景中性能受限。本研究旨在解决这一数据分布不匹配问题,通过构建来自真实膳食管理应用的用户提交图像数据集,提供更贴近实际使用场景的训练和评估基准。

Method: 本研究构建了FoodLogAthl-218数据集,包含6,925张图像、218个食物类别和14,349个边界框,每张图像都附有丰富的元数据如用餐时间、匿名用户ID和膳食上下文信息。不同于传统数据集,该数据集采用用户提交照片后标注的方式,并引入了三个评估任务:标准分类基准、遵循用户日志时间流的增量微调协议,以及利用整体膳食上下文进行多菜品分类的上下文感知分类任务。

Result: 数据集展现出更大的类内多样性、自然的膳食类型频率分布以及为个人使用而非公开分享的随意未过滤图像特性。研究使用大型多模态模型对这些任务进行了评估,验证了数据集在真实场景食物识别任务中的适用性。数据集已在Hugging Face平台公开提供,为食物图像识别研究提供了新的基准资源。

Conclusion: 该研究强调了从真实应用场景收集数据的重要性,FoodLogAthl-218数据集通过其自然的数据分布和丰富的上下文信息,为开发更鲁棒的食物识别系统提供了重要基础。引入的增量微调和上下文感知分类任务反映了实际膳食管理应用中的动态需求,推动了食物图像识别研究向更实用化方向发展。


📄 Abstract

Food image classification models are crucial for dietary management applications because they reduce the burden of manual meal logging. However, most publicly available datasets for training such models rely on web-crawled images, which often differ from users' real-world meal photos. In this work, we present FoodLogAthl-218, a food image dataset constructed from real-world meal records collected through the dietary management application FoodLog Athl. The dataset contains 6,925 images across 218 food categories, with a total of 14,349 bounding boxes. Rich metadata, including meal date and time, anonymized user IDs, and meal-level context, accompany each image. Unlike conventional datasets-where a predefined class set guides web-based image collection-our data begins with user-submitted photos, and labels are applied afterward. This yields greater intra-class diversity, a natural frequency distribution of meal types, and casual, unfiltered images intended for personal use rather than public sharing. In addition to (1) a standard classification benchmark, we introduce two FoodLog-specific tasks: (2) an incremental fine-tuning protocol that follows the temporal stream of users' logs, and (3) a context-aware classification task where each image contains multiple dishes, and the model must classify each dish by leveraging the overall meal context. We evaluate these tasks using large multimodal models (LMMs). The dataset is publicly available at https://huggingface.co/datasets/FoodLog/FoodLogAthl-218.

[45] LLM-driven Knowledge Enhancement for Multimodal Cancer Survival Prediction

Chenyu Zhao, Yingxue Xu, Fengtao Zhou, Yihui Wang, Hao Chen

🧩 TL;DR

本文提出KEMM,一种基于大语言模型的知识增强多模态癌症生存预测模型,通过整合专家报告和预后背景知识来提升高维冗余多模态数据的特征提取与对齐能力,在五个数据集上实现了最先进的性能。


📘 Detailed Summary

Motivation: 当前多模态生存预测方法通常依赖病理图像和基因组数据,这些数据具有高维度和冗余性,难以提取判别性特征并实现模态对齐。同时,仅使用简单的生存随访标签不足以监督如此复杂的任务,需要更丰富的知识引导。

Method: KEMM模型整合两种知识源:由病理学家提供并经大语言模型提炼的专家报告,以及由大语言模型生成的预后背景知识。模型引入知识增强跨模态注意力模块,有效引导网络从高度冗余的模态中聚焦于判别性和生存相关的特征。

Result: 在五个数据集上的广泛实验表明,KEMM模型实现了最先进的性能,显著提升了癌症生存预测的准确性。该方法通过知识增强机制有效解决了多模态数据的高维冗余问题,验证了专家知识和预后背景信息对生存预测任务的重要性。

Conclusion: 该研究证明了整合专家报告和预后背景知识能够显著增强多模态生存预测模型的性能。知识增强跨模态注意力机制为处理高维冗余医学数据提供了有效解决方案,为未来医学人工智能研究开辟了结合领域知识与深度学习的新方向。


📄 Abstract

Current multimodal survival prediction methods typically rely on pathology images (WSIs) and genomic data, both of which are high-dimensional and redundant, making it difficult to extract discriminative features from them and align different modalities. Moreover, using a simple survival follow-up label is insufficient to supervise such a complex task. To address these challenges, we propose KEMM, an LLM-driven Knowledge-Enhanced Multimodal Model for cancer survival prediction, which integrates expert reports and prognostic background knowledge. 1) Expert reports, provided by pathologists on a case-by-case basis and refined by large language model (LLM), offer succinct and clinically focused diagnostic statements. This information may typically suggest different survival outcomes. 2) Prognostic background knowledge (PBK), generated concisely by LLM, provides valuable prognostic background knowledge on different cancer types, which also enhances survival prediction. To leverage these knowledge, we introduce the knowledge-enhanced cross-modal (KECM) attention module. KECM can effectively guide the network to focus on discriminative and survival-relevant features from highly redundant modalities. Extensive experiments on five datasets demonstrate that KEMM achieves state-of-the-art performance. The code will be released upon acceptance.

[46] ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking

Lihong Wang, Liangqi Li, Weiwei Feng, Jiamin Wu, Changtao Miao, Tieru Wu, Rui Ma, Bo Zhang, Zhe Li

🧩 TL;DR

本文提出ViRC框架,通过引入Reason Chunking机制将多模态数学推理分解为连续的关键推理单元,模拟人类专家解题模式,显著提升多模态大语言模型在数学任务上的推理能力。


📘 Detailed Summary

Motivation: 现有多模态大语言模型在数学任务中通常仅从单一静态数学图像进行文本推理,忽视了推理过程中的动态视觉获取,而人类在解题时会反复检查视觉图像并采用逐步推理来证明中间命题,这种将问题解决过程分解为关键逻辑节点的策略符合认知科学中的米勒定律。

Method: 提出ViRC框架,引入Reason Chunking机制将多模态数学思维链结构化分解为连续的关键推理单元,每个CRU确保单元内文本连贯性以验证中间命题,同时跨单元整合视觉信息生成后续命题;构建CRUX数据集,使用三种视觉工具和四种推理模式为每个数学问题提供多推理路径的显式标注CRU;采用渐进式训练策略,包括指导性SFT、实践性SFT和策略性RL,以增强模型的Reason Chunking能力。

Result: 基于CRUX数据集训练的ViRC-7B模型在多个数学基准测试中相比基线模型平均提升18.8%,显著提高了多模态数学推理性能,代码已在GitHub开源。

Conclusion: 该研究通过模拟人类认知过程的结构化推理方法,为多模态数学推理提供了新范式,证明了将复杂问题分解为关键推理单元并结合动态视觉获取的有效性,为未来多模态推理系统设计提供了重要启示。


📄 Abstract

CoT has significantly enhanced the reasoning ability of LLMs while it faces challenges when extended to multimodal domains, particularly in mathematical tasks. Existing MLLMs typically perform textual reasoning solely from a single static mathematical image, overlooking dynamic visual acquisition during reasoning. In contrast, humans repeatedly examine visual image and employ step-by-step reasoning to prove intermediate propositions. This strategy of decomposing the problem-solving process into key logical nodes adheres to Miller's Law in cognitive science. Inspired by this insight, we propose a ViRC framework for multimodal mathematical tasks, introducing a Reason Chunking mechanism that structures multimodal mathematical CoT into consecutive Critical Reasoning Units (CRUs) to simulate human expert problem-solving patterns. CRUs ensure intra-unit textual coherence for intermediate proposition verification while integrating visual information across units to generate subsequent propositions and support structured reasoning. To this end, we present CRUX dataset by using three visual tools and four reasoning patterns to provide explicitly annotated CRUs across multiple reasoning paths for each mathematical problem. Leveraging the CRUX dataset, we propose a progressive training strategy inspired by human cognitive learning, which includes Instructional SFT, Practice SFT, and Strategic RL, aimed at further strengthening the Reason Chunking ability of the model.The resulting ViRC-7B model achieves a 18.8\% average improvement over baselines across multiple mathematical benchmarks. Code is available at https://github.com/Leon-LihongWang/ViRC.

cs.CL [Back]

[47] A Unified Sparse Attention via Multi-Granularity Compression

Siran Liu, Zane Cao, Yongchao He

🧩 TL;DR

本文提出UniSparse,一种统一的稀疏注意力机制,通过引入复合令牌和多粒度压缩,在保持高精度的同时显著加速长序列处理,实现了跨模态的高效注意力计算。


📘 Detailed Summary

Motivation: 当前大语言模型的长上下文理解面临核心自注意力机制计算复杂度随序列长度呈二次方增长的瓶颈。现有稀疏注意力方法存在权衡:基于训练的方法成本高昂且无法作为加速插件应用于其他模型,而推理时方法往往在效率或跨模态通用性上有所妥协。

Method: UniSparse引入复合令牌的概念,即聚合多粒度上下文信息的紧凑表示。基于此抽象,该方法通过多粒度压缩和块级选择动态构建稀疏注意力,实现了在GPU上的高效且硬件友好的执行。

Result: 在从合成基准到实际应用的多种模态和任务中,UniSparse在准确性和效率上均优于最先进的稀疏注意力方法(如MInference、XAttention、FlexPrefill),达到≥99%的全注意力精度,注意力计算速度比FlashAttention快达2.61倍。

Conclusion: UniSparse提供了一种统一且通用的稀疏注意力解决方案,有效解决了长序列处理中的计算瓶颈,同时保持了高精度和跨模态适用性,为大规模语言模型的实际部署提供了重要的加速技术。


📄 Abstract

Efficient long-context understanding and reasoning are increasingly vital for large language model (LLM) applications such as multi-turn dialogue and program analysis. However, the core self-attention mechanism scales quadratically with sequence length, creating a fundamental computational bottleneck. Existing sparse attention methods alleviate this issue but face trade-offs: training-based methods are costly and cannot be directly applied as acceleration plugins for other models, while inference-time methods often compromise efficiency or cross-modal generality. To address these limitations, we present UniSparse, a unified mechanism that introduces the notion of composite tokens--compact representations that aggregate multi-granularity contextual information. Building on this abstraction, UniSparse dynamically constructs sparse attention through multi-granularity compression and block-level selection, enabling efficient and hardware-friendly execution on GPU. Across multiple modalities and tasks ranging from synthetic benchmarks to real-world applications, UniSparse consistently surpasses state-of-the-art sparse attention methods (e.g., MInference, XAttention, FlexPrefill) in both accuracy and efficiency, achieving $\ge$ 99% of full-attention accuracy and up to 2.61$\times$ faster attention computation than FlashAttention.

[48] Multilingual and Continuous Backchannel Prediction: A Cross-lingual Study

Koji Inoue, Mikey Elmers, Yahui Fu, Zi Haur Pang, Taiga Mori, Divesh Lala, Keiko Ochi, Tatsuya Kawahara

🧩 TL;DR

本文提出了一种基于Transformer的多语言连续反馈预测模型,用于研究日语、英语和汉语的跨语言时序行为,该模型在三种语言上均达到或超过单语基线,揭示了语言间反馈时序的显著差异。


📘 Detailed Summary

Motivation: 本研究旨在解决多语言对话系统中反馈时序行为的跨语言差异问题,通过构建统一的多语言模型来探究不同语言在反馈时机上的共性与特性,为设计更自然、文化敏感的语音对话系统提供实证依据。

Method: 研究采用基于Transformer的帧级连续反馈预测模型,在约300小时的二元对话数据上进行多语言联合训练,并辅以辅助任务,模型支持日语、英语和汉语三种语言,最终集成到实时处理软件中实现CPU推理。

Result: 多语言模型在所有三种语言上均匹配或超越单语基线,表明模型同时学习了语言通用线索和语言特定时序模式;零样本迁移效果有限,揭示了实质性跨语言差异;扰动分析显示日语更依赖短期语言信息,而英语和汉语对沉默时长和韵律变化更敏感。

Conclusion: 研究提供了统一的模型和实证证据,表明不同语言的反馈时序存在系统性差异,多语言训练鼓励共享但可适应的表征,并减少了汉语对音高的过度依赖,这些发现为设计更自然、文化敏感的语音对话系统提供了重要指导。


📄 Abstract

We present a multilingual, continuous backchannel prediction model for Japanese, English, and Chinese, and use it to investigate cross-linguistic timing behavior. The model is Transformer-based and operates at the frame level, jointly trained with auxiliary tasks on approximately 300 hours of dyadic conversations. Across all three languages, the multilingual model matches or surpasses monolingual baselines, indicating that it learns both language-universal cues and language-specific timing patterns. Zero-shot transfer with two-language training remains limited, underscoring substantive cross-lingual differences. Perturbation analyses reveal distinct cue usage: Japanese relies more on short-term linguistic information, whereas English and Chinese are more sensitive to silence duration and prosodic variation; multilingual training encourages shared yet adaptable representations and reduces overreliance on pitch in Chinese. A context-length study further shows that Japanese is relatively robust to shorter contexts, while Chinese benefits markedly from longer contexts. Finally, we integrate the trained model into a real-time processing software, demonstrating CPU-only inference. Together, these findings provide a unified model and empirical evidence for how backchannel timing differs across languages, informing the design of more natural, culturally-aware spoken dialogue systems.

[49] Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies

Ekaterina Artemova, Laurie Burchell, Daryna Dementieva, Shu Okabe, Mariya Shmatova, Pedro Ortiz Suarez

🧩 TL;DR

本教程为多语言和低资源语言NLP实践者提供了一套端到端的技术工具包,涵盖从数据收集到下游应用的全流程方法,旨在促进更公平、可复现且社区知情的技术开发。


📘 Detailed Summary

Motivation: 该教程旨在解决多语言和低资源语言在自然语言处理中面临的数据稀缺和文化差异挑战,特别是针对代表性不足的语言群体,以创建更公平、更具社会影响力的语言技术。

Method: 教程提出了从数据收集和网络爬取到平行句挖掘、机器翻译以及文本分类和多模态推理等下游应用的端到端NLP流水线构建方法,并提供了处理数据稀缺和文化差异的具体策略、实践方法和建模框架。

Result: 教程展示了涵盖10多种来自不同语系和地缘政治背景的语言的多样化用例,包括数字资源丰富和严重代表性不足的语言,为实际应用提供了可操作的参考框架。

Conclusion: 该研究强调了基于公平、可复现和社区知情开发方法的重要性,为低资源语言NLP提供了实用的技术路线图,有助于推动语言技术的包容性发展和实际社会影响。


📄 Abstract

This tutorial (https://tum-nlp.github.io/low-resource-tutorial) is designed for NLP practitioners, researchers, and developers working with multilingual and low-resource languages who seek to create more equitable and socially impactful language technologies. Participants will walk away with a practical toolkit for building end-to-end NLP pipelines for underrepresented languages -- from data collection and web crawling to parallel sentence mining, machine translation, and downstream applications such as text classification and multimodal reasoning. The tutorial presents strategies for tackling the challenges of data scarcity and cultural variance, offering hands-on methods and modeling frameworks. We will focus on fair, reproducible, and community-informed development approaches, grounded in real-world scenarios. We will showcase a diverse set of use cases covering over 10 languages from different language families and geopolitical contexts, including both digitally resource-rich and severely underrepresented languages.

[50] JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction

Atsuyuki Miyai, Shota Onohara, Jeonghun Baek, Kiyoharu Aizawa

🧩 TL;DR

本文提出了JMMMU-Pro,一个基于图像的日语多学科多模态理解基准,以及Vibe Benchmark Construction,一种可扩展的构建方法,旨在为日语大语言模型提供更严格的评估工具。


📘 Detailed Summary

Motivation: 现有日语多模态基准存在局限性,需要开发更严格的评估工具来准确评估日语大语言模型在视觉-文本整合理解方面的能力,特别是针对日本文化背景和复杂视觉场景的理解。

Method: 研究提出了Vibe Benchmark Construction方法,利用图像生成模型(如Nano Banana Pro)生成候选视觉问题,通过人工验证和提示调整确保质量,将问题图像和文本组合成单一图像,创建需要整合视觉感知的基准。

Result: 实验结果表明,所有开源大语言模型在JMMMU-Pro基准上都表现不佳,突显了该基准作为指导开源社区未来发展的重要评估工具的价值,同时验证了构建方法能够以低成本创建高质量、覆盖广泛背景和布局设计的基准。

Conclusion: JMMMU-Pro为评估日语大语言模型能力提供了更严格的工具,而Vibe Benchmark Construction方法为未来基于图像的视觉问答基准开发提供了高效指南,有助于推动开源社区在多模态理解方面的研究进展。


📄 Abstract

This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to ensure quality. By leveraging Nano Banana Pro's highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range of background and layout designs. Experimental results show that all open-source LMMs struggle substantially with JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guiding future efforts in the open-source community. We believe that JMMMU-Pro provides a more rigorous evaluation tool for assessing the Japanese capabilities of LMMs and that our Vibe Benchmark Construction also offers an efficient guideline for future development of image-based VQA benchmarks.

[51] MMGR: Multi-Modal Generative Reasoning

Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou, Kung-Hsiang Huang, Parisa Kordjamshidi, Minjia Zhang, Xiao Wen, Jiuxiang Gu, Nanyun Peng, Junjie Hu

🧩 TL;DR

本文提出了MMGR(多模态生成推理评估与基准),一个用于评估视频和图像生成模型推理能力的系统化框架,揭示了当前模型在抽象推理、空间规划和因果一致性方面的显著不足。


📘 Detailed Summary

Motivation: 现有视频生成模型虽然能产生视觉逼真且时序连贯的内容,但其作为世界模拟器的可靠性取决于是否捕捉物理、逻辑和空间约束。现有评估指标如Fréchet Video Distance(FVD)过于强调感知质量而忽视了推理失败,包括违反因果关系、物理规律和全局一致性的问题。

Method: 本文提出了MMGR框架,基于五种推理能力构建评估体系:物理推理、逻辑推理、3D空间推理、2D空间推理和时序推理。该框架在三个领域进行评估:抽象推理(ARC-AGI、数独)、具身导航(真实世界3D导航与定位)和物理常识(运动与组合交互),并采用细粒度指标要求视频和图像生成的整体正确性。

Result: 对主流视频模型(Veo-3、Sora-2、Wan-2.2)和图像模型(Nano-banana系列、GPT-4o-image、Qwen-image)的基准测试揭示了显著的性能差距。模型在物理常识任务上表现中等,但在抽象推理上表现极差(ARC-AGI准确率低于10%),在具身环境中的长时程空间规划方面也面临困难。

Conclusion: 分析揭示了当前模型的关键局限性,包括过度依赖感知数据、全局状态一致性弱,以及优化目标偏向视觉合理性而非因果正确性。MMGR提供了一个统一的诊断基准,并为开发具备推理能力的生成式世界模型指明了方向。


📄 Abstract

Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.

cs.AI [Back]

[52] MobileWorldBench: Towards Semantic World Modeling For Mobile Agents

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Aditya Grover

🧩 TL;DR

本文提出了一种用于GUI智能体的语义世界模型方法,通过自然语言而非像素空间描述状态转移,并发布了MobileWorldBench基准测试和MobileWorld数据集,显著提升了视觉语言模型的世界建模能力,进而提高了移动GUI智能体的任务成功率。


📘 Detailed Summary

Motivation: 现有像素空间世界模型在GUI环境中面临实际限制,难以预测未来状态的复杂视觉元素,因此需要探索替代方案来改进GUI智能体的世界建模能力。

Method: 本文提出了基于自然语言描述状态转移的GUI智能体世界模型新范式,包括引入MobileWorldBench基准测试评估视觉语言模型的世界建模能力,发布包含140万样本的大规模MobileWorld数据集,并开发了将视觉语言模型世界模型集成到移动智能体规划框架中的新框架。

Result: 实验表明,语义世界模型显著提升了视觉语言模型的世界建模能力,并直接提高了移动智能体的任务成功率,MobileWorld数据集的大规模样本为模型训练提供了重要支持。

Conclusion: 研究表明语义世界模型相比像素空间方法在GUI环境中具有实际优势,为GUI智能体的世界建模提供了有效替代方案,同时发布的基准测试和数据集为未来研究提供了重要资源。


📄 Abstract

World models have shown great utility in improving the task performance of embodied agents. While prior work largely focuses on pixel-space world models, these approaches face practical limitations in GUI settings, where predicting complex visual elements in future states is often difficult. In this work, we explore an alternative formulation of world modeling for GUI agents, where state transitions are described in natural language rather than predicting raw pixels. First, we introduce MobileWorldBench, a benchmark that evaluates the ability of vision-language models (VLMs) to function as world models for mobile GUI agents. Second, we release MobileWorld, a large-scale dataset consisting of 1.4M samples, that significantly improves the world modeling capabilities of VLMs. Finally, we propose a novel framework that integrates VLM world models into the planning framework of mobile agents, demonstrating that semantic world models can directly benefit mobile agents by improving task success rates. The code and dataset is available at https://github.com/jacklishufan/MobileWorld

[53] HydroGEM: A Self Supervised Zero Shot Hybrid TCN Transformer Foundation Model for Continental Scale Streamflow Quality Control

Ijaz Ul Haq, Byung Suk Lee, Julia N. Perdrial, David Baude

🧩 TL;DR

本文提出了HydroGEM,一个用于大陆尺度河流流量质量控制的预训练基础模型,通过两阶段训练和混合TCN-Transformer架构,在合成异常检测和重建任务上显著优于现有方法,并展示了跨国家泛化能力。


📘 Detailed Summary

Motivation: 实时河流流量监测网络每年产生数百万观测数据,但维护数千个远程传感器的数据质量仍然劳动密集型,需要自动化解决方案来处理大规模水文数据质量控制问题。

Method: HydroGEM采用两阶段训练策略:首先在3,724个USGS站点的603万序列上进行自监督预训练学习水文表征,然后使用合成异常进行微调用于检测和重建;模型采用混合TCN-Transformer架构(1420万参数)捕捉局部时间模式和长程依赖,同时通过分层归一化处理六个数量级的流量变化。

Result: 在包含799个站点和18种专家验证异常类型的合成测试集上,HydroGEM达到F1=0.792的检测分数和68.7%的重建误差降低,相比现有方法提升36.3%;在100个加拿大环境与气候变化部站点的零样本迁移中取得F1=0.586,超过所有基线并展示了跨国家泛化能力。

Conclusion: HydroGEM展示了基础模型在水文数据质量控制中的有效性,其设计支持人机协同工作流程,输出需要专家审查的质量控制建议而非自主修正;模型在不同校正幅度下保持一致的检测性能,并与操作季节性模式保持一致,为大规模水文监测提供了可扩展的解决方案。


📄 Abstract

Real-time streamflow monitoring networks generate millions of observations annually, yet maintaining data quality across thousands of remote sensors remains labor-intensive. We introduce HydroGEM (Hydrological Generalizable Encoder for Monitoring), a foundation model for continental-scale streamflow quality control. HydroGEM uses two-stage training: self-supervised pretraining on 6.03 million sequences from 3,724 USGS stations learns hydrological representations, followed by fine-tuning with synthetic anomalies for detection and reconstruction. A hybrid TCN-Transformer architecture (14.2M parameters) captures local temporal patterns and long-range dependencies, while hierarchical normalization handles six orders of magnitude in discharge. On held-out synthetic tests comprising 799 stations with 18 expert-validated anomaly types, HydroGEM achieves F1 = 0.792 for detection and 68.7% reconstruction-error reduction, a 36.3% improvement over existing methods. Zero-shot transfer to 100 Environment and Climate Change Canada stations yields F1 = 0.586, exceeding all baselines and demonstrating cross-national generalization. The model maintains consistent detection across correction magnitudes and aligns with operational seasonal patterns. HydroGEM is designed for human-in-the-loop workflows - outputs are quality control suggestions requiring expert review, not autonomous corrections.

[54] Incentivizing Tool-augmented Thinking with Images for Medical Image Analysis

Yankai Jiang, Yujie Zhang, Peng Zhang, Yichen Li, Jintai Chen, Xiaoming Shi, Shihui Zhen

🧩 TL;DR

本文提出Ophiuchus,一个工具增强的多模态大语言模型框架,通过动态聚焦细粒度视觉区域来解决复杂医学推理任务,实现了超越现有方法的性能表现。


📘 Detailed Summary

Motivation: 现有基于推理的医学MLLM虽然能生成逐步文本推理链,但在需要动态迭代聚焦细粒度视觉区域以实现精确定位和诊断的复杂任务上仍存在困难,这限制了模型在需要深入视觉分析的医学应用中的表现。

Method: Ophiuchus框架的核心是一个三阶段训练策略:冷启动训练使用工具集成推理数据实现基本工具选择和关键区域检查;自反思微调强化反思推理并鼓励重新审视工具输出;以及代理工具强化学习直接优化任务特定奖励并模拟专家诊断行为,将模型固有的定位感知能力与外部工具相结合。

Result: 在广泛的医学基准测试中,包括VQA、检测和基于推理的分割任务,Ophiuchus在多样化的医学基准上始终优于闭源和开源的最先进方法,展示了其在复杂医学推理任务上的卓越性能。

Conclusion: 该研究为医学AI代理开辟了新路径,通过工具集成推理实现真正的"图像思维",将模型内在能力与外部工具协同结合,超越了专用工具的性能上限,推动了医学多模态推理的发展。


📄 Abstract

Recent reasoning based medical MLLMs have made progress in generating step by step textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on fine-grained visual regions to achieve precise grounding and diagnosis. We introduce Ophiuchus, a versatile, tool-augmented framework that equips an MLLM to (i) decide when additional visual evidence is needed, (ii) determine where to probe and ground within the medical image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved, multimodal chain of thought. In contrast to prior approaches limited by the performance ceiling of specialized tools, Ophiuchus integrates the model's inherent grounding and perception capabilities with external tools, thereby fostering higher-level reasoning. The core of our method is a three-stage training strategy: cold-start training with tool-integrated reasoning data to achieve basic tool selection and adaptation for inspecting key regions; self-reflection fine-tuning to strengthen reflective reasoning and encourage revisiting tool outputs; and Agentic Tool Reinforcement Learning to directly optimize task-specific rewards and emulate expert-like diagnostic behavior. Extensive experiments show that Ophiuchus consistently outperforms both closed-source and open-source SOTA methods across diverse medical benchmarks, including VQA, detection, and reasoning-based segmentation. Our approach illuminates a path toward medical AI agents that can genuinely "think with images" through tool-integrated reasoning. Datasets, codes, and trained models will be released publicly.

[55] Sparse Multi-Modal Transformer with Masking for Alzheimer's Disease Classification

Cheng-Han Lu, Pei-Hsuan Tsai

🧩 TL;DR

本文提出了SMMT,一种稀疏多模态Transformer架构,旨在解决Transformer多模态系统因密集自注意力机制导致的高计算和能耗问题,通过引入聚类稀疏注意力和模态掩码机制,在保持竞争力的预测性能的同时显著提升效率。


📘 Detailed Summary

Motivation: 基于Transformer的多模态智能系统通常因密集自注意力机制而面临高计算和能耗成本,这限制了其在资源受限环境下的可扩展性,特别是在需要处理多模态数据的实际应用场景中。

Method: SMMT建立在级联多模态Transformer框架之上,引入了基于聚类的稀疏注意力机制以实现近似线性的计算复杂度,并采用模态级掩码技术来增强对不完整输入数据的鲁棒性,从而构建了一个高效且鲁棒的多模态架构。

Result: 在ADNI数据集上的阿尔茨海默病分类实验中,SMMT在保持竞争力的预测性能的同时,相比密集注意力基线显著减少了训练时间、内存使用和能耗,验证了其作为资源感知架构组件的有效性。

Conclusion: SMMT证明了稀疏注意力机制在多模态Transformer中的可行性,为构建可扩展的资源感知智能系统提供了有效的架构解决方案,特别是在医疗诊断等需要处理多模态数据的实际应用中具有重要价值。


📄 Abstract

Transformer-based multi-modal intelligent systems often suffer from high computational and energy costs due to dense self-attention, limiting their scalability under resource constraints. This paper presents SMMT, a sparse multi-modal transformer architecture designed to improve efficiency and robustness. Building upon a cascaded multi-modal transformer framework, SMMT introduces cluster-based sparse attention to achieve near linear computational complexity and modality-wise masking to enhance robustness against incomplete inputs. The architecture is evaluated using Alzheimer's Disease classification on the ADNI dataset as a representative multi-modal case study. Experimental results show that SMMT maintains competitive predictive performance while significantly reducing training time, memory usage, and energy consumption compared to dense attention baselines, demonstrating its suitability as a resource-aware architectural component for scalable intelligent systems.