Table of Contents

cs.CV [Back]

[1] Explainable Detection of AI-Generated Images with Artifact Localization Using Faster-Than-Lies and Vision-Language Models for Edge Devices

Aryan Mathur, Asaduddin Ahmed, Pushti Amit Vasoya, Simeon Kandan Sonar, Yasir Z, Madesh Kuppusamy

🧩 TL;DR

本研究提出了一种可解释的图像真实性检测系统,结合轻量级卷积分类器和视觉语言模型,能够在32×32低分辨率图像中实现96.5%的准确率检测AI生成图像,同时提供伪影定位和文本解释。


📘 Detailed Summary

Motivation: 随着AI生成图像的真实感不断提升,验证视觉内容的真实性面临严峻挑战,需要开发能够在低分辨率条件下准确检测并解释图像伪造痕迹的方法。

Method: 采用轻量级卷积分类器Faster-Than-Lies与视觉语言模型Qwen2-VL-7B相结合的方法,通过自编码器重构误差图生成伪影定位热力图,并将70种视觉伪影类型归类为八个语义组以实现可解释的异常检测。

Result: 在包含对抗性扰动的扩展CiFAKE数据集上达到96.5%的准确率,在8核CPU上推理时间为175毫秒,能够部署在本地或边缘设备上,同时生成伪影定位热力图和文本解释。

Conclusion: 该研究证明了视觉与语言推理相结合在低分辨率图像可解释真实性检测中的可行性,为取证、工业检测和社交媒体内容审核等跨领域应用提供了潜在解决方案。


📄 Abstract

The increasing realism of AI-generated imagery poses challenges for verifying visual authenticity. We present an explainable image authenticity detection system that combines a lightweight convolutional classifier ("Faster-Than-Lies") with a Vision-Language Model (Qwen2-VL-7B) to classify, localize, and explain artifacts in 32x32 images. Our model achieves 96.5% accuracy on the extended CiFAKE dataset augmented with adversarial perturbations and maintains an inference time of 175ms on 8-core CPUs, enabling deployment on local or edge devices. Using autoencoder-based reconstruction error maps, we generate artifact localization heatmaps, which enhance interpretability for both humans and the VLM. We further categorize 70 visual artifact types into eight semantic groups and demonstrate explainable text generation for each detected anomaly. This work highlights the feasibility of combining visual and linguistic reasoning for interpretable authenticity detection in low-resolution imagery and outlines potential cross-domain applications in forensics, industrial inspection, and social media moderation.

[2] CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting

Md Tanvir Hossain, Akif Islam, Mohd Ruhul Ameen

🧩 TL;DR

本文提出CountFormer,一种基于Transformer的类无关物体计数框架,通过整合自监督基础模型DINOv2来识别视觉重复和结构一致性,在FSC-147数据集上达到与当前最优方法相当的性能,并在结构复杂场景中表现更优。


📘 Detailed Summary

Motivation: 现有计数模型难以复制人类通过感知视觉重复和结构关系而非类别身份来计数的能力,在物体具有复杂形状、内部对称性或重叠组件时经常计数错误,需要开发能够识别重复性和结构一致性的类无关计数方法。

Method: 基于CounTR架构构建CountFormer,使用自监督基础模型DINOv2替换视觉编码器以生成更丰富且空间一致的特征表示,并引入位置嵌入融合来保持几何关系,最后通过轻量级卷积解码器将特征解码为密度图。

Result: 在FSC-147数据集上的评估显示,该模型性能与当前最优方法相当,同时在结构复杂或密集场景中展现出更高的准确性,证明了其在处理复杂结构方面的优势。

Conclusion: 研究表明整合DINOv2等基础模型能使计数系统接近人类的结构感知能力,推动了真正通用且无需示例的计数范式的发展,为类无关物体计数提供了新的技术路径。


📄 Abstract

Humans can effortlessly count diverse objects by perceiving visual repetition and structural relationships rather than relying on class identity. However, most existing counting models fail to replicate this ability; they often miscount when objects exhibit complex shapes, internal symmetry, or overlapping components. In this work, we introduce CountFormer, a transformer-based framework that learns to recognize repetition and structural coherence for class-agnostic object counting. Built upon the CounTR architecture, our model replaces its visual encoder with the self-supervised foundation model DINOv2, which produces richer and spatially consistent feature representations. We further incorporate positional embedding fusion to preserve geometric relationships before decoding these features into density maps through a lightweight convolutional decoder. Evaluated on the FSC-147 dataset, our model achieves performance comparable to current state-of-the-art methods while demonstrating superior accuracy on structurally intricate or densely packed scenes. Our findings indicate that integrating foundation models such as DINOv2 enables counting systems to approach human-like structural perception, advancing toward a truly general and exemplar-free counting paradigm.

[3] Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation

Jinxin Zhou, Jiachen Jiang, Zhihui Zhu

🧩 TL;DR

本文提出了LHT-CLIP,一种无需训练的新框架,通过系统利用CLIP在层、头和令牌级别的视觉区分能力,有效解决了CLIP模型扩展到语义分割时的对齐偏差问题,在多个基准测试中实现了最先进的性能。


📘 Detailed Summary

Motivation: 将CLIP模型扩展到语义分割面临挑战,主要原因是其图像级预训练目标与密集预测所需的像素级视觉理解之间存在不对齐。先前方法虽然通过重组最终层和特征取得了鼓舞人心的结果,但往往继承了前层的全局对齐偏差,导致分割性能不理想。

Method: 提出了三种互补技术:语义空间重加权、选择性头增强和异常令牌替换,这些方法基于对CLIP视觉区分能力的全面分析,包括发现最终层主要强化图像-文本对齐但牺牲视觉区分能力、部分注意力头在不同数据集上表现出一致的强视觉区分能力,以及异常令牌相比正常令牌显示出稀疏且一致的激活模式。

Result: 在8个常见语义分割基准测试上的广泛实验表明,LHT-CLIP在多样化场景中实现了最先进的性能,证明了其有效性和实际部署的实用性,且无需任何额外训练、辅助预训练网络或大量超参数调优。

Conclusion: 该研究揭示了CLIP模型中视觉区分能力的关键特性,提出了一种无需训练的有效解决方案,为CLIP在语义分割任务中的实际应用提供了重要见解,展示了通过系统分析模型内部机制来提升性能的潜力。


📄 Abstract

Extending CLIP models to semantic segmentation remains challenging due to the misalignment between their image-level pre-training objectives and the pixel-level visual understanding required for dense prediction. While prior efforts have achieved encouraging results by reorganizing the final layer and features, they often inherit the global alignment bias of preceding layers, leading to suboptimal segmentation performance. In this work, we propose LHT-CLIP, a novel training-free framework that systematically exploits the visual discriminability of CLIP across layer, head, and token levels. Through comprehensive analysis, we reveal three key insights: (i) the final layers primarily strengthen image-text alignment with sacrifice of visual discriminability (e.g., last 3 layers in ViT-B/16 and 8 layers in ViT-L/14), partly due to the emergence of anomalous tokens; (ii) a subset of attention heads (e.g., 10 out of 144 in ViT-B/16) display consistently strong visual discriminability across datasets; (iii) abnormal tokens display sparse and consistent activation pattern compared to normal tokens. Based on these findings, we propose three complementary techniques: semantic-spatial reweighting, selective head enhancement, and abnormal token replacement to effectively restore visual discriminability and improve segmentation performance without any additional training, auxiliary pre-trained networks, or extensive hyperparameter tuning. Extensive experiments on 8 common semantic segmentation benchmarks demonstrate that LHT-CLIP achieves state-of-the-art performance across diverse scenarios, highlighting its effectiveness and practicality for real-world deployment.

[4] DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning

Eddison Pham, Prisha Priyadarshini, Adrian Maliackel, Kanishk Bandi, Cristian Meo, Kevin Zhu

🧩 TL;DR

本文提出了DynaStride管道,用于在无需手动场景分割的情况下生成连贯的教学视频场景级字幕。该方法通过自适应帧采样、多模态窗口和动态步长选择算法,有效平衡时间上下文与冗余,在多个评估指标上优于现有基线模型。


📘 Detailed Summary

Motivation: 教学视频中的场景级字幕需要同时理解视觉线索和时间结构,但现有方法往往无法有效捕捉这种结构,导致生成的字幕缺乏连贯性和质量,从而影响视频的教育意图。当前缺乏能够自动生成高质量场景级字幕且无需手动场景分割的有效解决方案。

Method: DynaStride采用自适应帧采样和多模态窗口技术来捕捉场景内的关键转换,通过多模态思维链过程生成多个动作-对象对,并使用动态步长窗口选择算法来优化和融合这些对。该方法在YouCookII数据集上利用场景标注,最终将视觉语义和时间推理整合到单个教学字幕中。

Result: 与VLLaMA3和GPT-4o等强基线相比,DynaStride在N-gram指标(BLEU、METEOR)和语义相似度度量(BERTScore、CLIPScore)上均取得一致提升。定性分析进一步表明,该方法生成的字幕在时间连贯性和信息丰富度方面表现更优。

Conclusion: DynaStride展示了通过自适应时间建模和多模态推理来改进AI驱动的教学内容生成的可行方向。该方法在保持字幕连贯性的同时有效平衡了时间上下文与冗余,为教学视频的自动字幕生成提供了有前景的技术路径。


📄 Abstract

Scene-level captioning in instructional videos can enhance learning by requiring an understanding of both visual cues and temporal structure. By aligning visual cues with textual guidance, this understanding supports procedural learning and multimodal reasoning, providing a richer context for skill acquisition. However, captions that fail to capture this structure may lack coherence and quality, which can create confusion and undermine the video's educational intent. To address this gap, we introduce DynaStride, a pipeline to generate coherent, scene-level captions without requiring manual scene segmentation. Using the YouCookII dataset's scene annotations, DynaStride performs adaptive frame sampling and multimodal windowing to capture key transitions within each scene. It then employs a multimodal chain-of-thought process to produce multiple action-object pairs, which are refined and fused using a dynamic stride window selection algorithm that adaptively balances temporal context and redundancy. The final scene-level caption integrates visual semantics and temporal reasoning in a single instructional caption. Empirical evaluations against strong baselines, including VLLaMA3 and GPT-4o, demonstrate consistent gains on both N-gram-based metrics (BLEU, METEOR) and semantic similarity measures (BERTScore, CLIPScore). Qualitative analyses further show that DynaStride produces captions that are more temporally coherent and informative, suggesting a promising direction for improving AI-powered instructional content generation.

[5] PlanarGS: High-Fidelity Indoor 3D Gaussian Splatting Guided by Vision-Language Planar Priors

Xirui Jin, Renbiao Jin, Boying Li, Danping Zou, Wenxian Yu

🧩 TL;DR

本文提出了PlanarGS,一种针对室内场景重建的3D高斯泼溅框架,通过引入语言提示平面先验和几何监督来解决大范围低纹理区域中3DGS的几何模糊问题,显著提升了3D表面重建质量。


📘 Detailed Summary

Motivation: 传统3D高斯泼溅在室内场景中面临重大挑战,特别是在大范围低纹理区域,仅依赖光度损失会导致几何模糊,无法恢复高保真度的3D表面,这限制了其在室内环境重建中的应用效果。

Method: 提出PlanarGS框架,设计了语言提示平面先验管线,利用预训练视觉语言分割模型并通过跨视图融合和几何先验检查来优化区域提议;在3D高斯优化中引入平面先验监督项和几何先验监督项,分别强制平面一致性和引导高斯分布朝向深度与法线线索。

Result: 在标准室内基准测试上的广泛实验表明,PlanarGS能够重建准确且详细的3D表面,在各项指标上均以较大优势超越现有最先进方法,证明了该方法的有效性和优越性。

Conclusion: 该研究证明了结合语言引导的平面先验和几何监督能够有效解决3DGS在室内场景中的几何模糊问题,为基于高斯泼溅的室内场景重建提供了新的技术路径,具有重要的实际应用价值。


📄 Abstract

Three-dimensional Gaussian Splatting (3DGS) has recently emerged as an efficient representation for novel-view synthesis, achieving impressive visual quality. However, in scenes dominated by large and low-texture regions, common in indoor environments, the photometric loss used to optimize 3DGS yields ambiguous geometry and fails to recover high-fidelity 3D surfaces. To overcome this limitation, we introduce PlanarGS, a 3DGS-based framework tailored for indoor scene reconstruction. Specifically, we design a pipeline for Language-Prompted Planar Priors (LP3) that employs a pretrained vision-language segmentation model and refines its region proposals via cross-view fusion and inspection with geometric priors. 3D Gaussians in our framework are optimized with two additional terms: a planar prior supervision term that enforces planar consistency, and a geometric prior supervision term that steers the Gaussians toward the depth and normal cues. We have conducted extensive experiments on standard indoor benchmarks. The results show that PlanarGS reconstructs accurate and detailed 3D surfaces, consistently outperforming state-of-the-art methods by a large margin. Project page: https://planargs.github.io

[6] Reasoning Visual Language Model for Chest X-Ray Analysis

Andriy Myronenko, Dong Yang, Baris Turkbey, Mariam Aboian, Sena Azamat, Esra Akcicek, Hongxu Yin, Pavlo Molchanov, Marc Edgar, Yufan He, Pengfei Guo, Yucheng Tang, Daguang Xu

🧩 TL;DR

本研究提出了一个将思维链推理引入胸部X光解读的框架,通过结合高保真视觉编码与两阶段训练方法,在保持竞争性多标签分类性能的同时显著提升了模型的可解释性和临床可审计性。


📘 Detailed Summary

Motivation: 当前视觉语言模型在医学图像分析中虽然表现出色,但大多缺乏透明度,无法提供临床医生依赖的逐步推理过程,这限制了模型在临床实践中的可信度和安全性。

Method: 该框架采用高保真视觉编码器与两阶段训练策略:首先进行推理风格的监督微调,然后使用基于X光异常列表可验证奖励的强化学习,使模型输出能够反映放射科医生系统性思维过程、不确定性和鉴别诊断的推理轨迹。

Result: 在分布外评估中,该方法实现了竞争性的多标签分类性能;在专家放射科医生的阅读研究中,完整推理轨迹提高了诊断信心、支持错误审计,并减少了最终报告完成时间。

Conclusion: 该研究强调了在医学影像任务中推理质量与预测质量同等重要,为构建可信赖、可解释的AI系统提供了重要方向,特别是在需要透明决策过程的临床应用中。


📄 Abstract

Vision-language models (VLMs) have shown strong promise for medical image analysis, but most remain opaque, offering predictions without the transparent, stepwise reasoning clinicians rely on. We present a framework that brings chain-of-thought (CoT) reasoning to chest X-ray interpretation. Inspired by reasoning-first training paradigms, our approach is designed to learn how experts reason, not just what they conclude, by aligning intermediate steps with observable image evidence and radiology workflow. Beyond accuracy, the explicit reasoning traces support clinical auditability: they reveal why a conclusion was reached, which alternatives were considered, and where uncertainty remains, enabling quality assurance, error analysis, and safer human-AI collaboration. Our model couples high-fidelity visual encoding with a two-stage training recipe: a reasoning-style supervised fine-tuning (SFT) followed by reinforcement learning (RL) that uses verifiable rewards over a list of X-ray abnormalities. The model outputs reasoning that mirrors radiologists systematic thought process, uncertainty, and differential diagnosis. In out-of-distribution evaluation, the approach achieves competitive multi-label classification while improving interpretability. In a reader study with expert radiologists, full reasoning traces increased confidence, supported error auditing, and reduced time to finalize reports. We release code and the model NV-Reason-CXR-3B to support community progress toward trustworthy, explainable AI in chest radiography and other medical imaging tasks where reasoning quality is as critical as prediction quality.

[7] TeleEgo: Benchmarking Egocentric AI Assistants in the Wild

Jiaqi Yan, Ruilong Ren, Jingren Liu, Shuning Xu, Ling Wang, Yiheng Wang, Yun Wang, Long Zhang, Xiangyu Chen, Changzhi Sun, Jixiang Luo, Dell Zhang, Hao Sun, Chi Zhang, Xuelong Li

🧩 TL;DR

本文提出了TeleEgo——一个用于评估具身AI助手的长时程、流式、全模态基准测试,包含超过14小时的同步自我中心视频、音频和文本数据,定义了12个诊断性子任务,旨在解决现有基准测试在评估实时性、多模态处理和长期记忆方面的不足。


📘 Detailed Summary

Motivation: 现有基准测试通常孤立评估具身AI助手的各项能力,缺乏真实的流式场景支持,或仅支持短期任务,无法全面评估现实环境中所需的实时处理、多模态输入和长期记忆保持能力。

Method: 构建了包含工作学习、生活方式、社交活动和外出文化四个领域的同步多模态数据集,采用统一全局时间线对齐,包含高质量视觉叙述和语音转录,定义了记忆、理解和跨记忆推理三大核心能力的12个诊断性子任务,包含3,291个人工验证的问答项目,并提出了实时准确率和记忆持久时间两个关键评估指标。

Result: TeleEgo基准测试包含每位参与者超过14小时的多模态数据,支持多种问答格式的严格流式评估,通过提出的双指标系统能够联合评估模型的正确性、时间响应性和长期记忆保持能力。

Conclusion: TeleEgo为开发实用的AI助手提供了现实且全面的评估框架,通过长时程流式多模态基准测试的建立,推动了具身AI助手在真实日常环境中的能力发展,特别是在长期记忆保持和实时响应方面的技术进步。


📄 Abstract

Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text), respond in real time, and retain evolving long-term memory. However, existing benchmarks typically evaluate these abilities in isolation, lack realistic streaming scenarios, or support only short-term tasks. We introduce \textbf{TeleEgo}, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants in realistic daily contexts. The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains: work \& study, lifestyle \& routines, social activities, and outings \& culture. All data is aligned on a unified global timeline and includes high-quality visual narrations and speech transcripts, curated through human refinement.TeleEgo defines 12 diagnostic subtasks across three core capabilities: Memory (recalling past events), Understanding (interpreting the current moment), and Cross-Memory Reasoning (linking distant events). It contains 3,291 human-verified QA items spanning multiple question formats (single-choice, binary, multi-choice, and open-ended), evaluated strictly in a streaming setting. We propose two key metrics -- Real-Time Accuracy and Memory Persistence Time -- to jointly assess correctness, temporal responsiveness, and long-term retention. TeleEgo provides a realistic and comprehensive evaluation to advance the development of practical AI assistants.

[8] Mars-Bench: A Benchmark for Evaluating Foundation Models for Mars Science Tasks

Mirali Purohit, Bimal Gajera, Vatsal Malaviya, Irish Mehta, Kunal Kasodekar, Jacob Adler, Steven Lu, Umaa Rebbapragada, Hannah Kerner

🧩 TL;DR

本文提出了Mars-Bench,这是首个专门用于系统评估火星科学任务的基准测试,包含20个数据集,涵盖分类、分割和检测任务,旨在建立火星机器学习模型开发的标准化基础。


📘 Detailed Summary

Motivation: 尽管基础模型在地球观测等领域取得了显著进展,但在火星科学中的应用仍然有限,主要障碍是缺乏标准化的基准测试和评估框架,这限制了火星专用基础模型的开发进程。

Method: Mars-Bench基准包含20个标准化数据集,涵盖分类、分割和物体检测任务,专注于关键地质特征如陨石坑、锥体、巨石和霜冻,并提供了使用自然图像、地球卫星数据和最先进视觉语言模型预训练的基线评估。

Result: 所有分析结果表明,火星专用基础模型相比通用领域对应模型可能具有优势,这激励了对领域自适应预训练的进一步探索,基准测试为火星科学机器学习模型的开发和比较建立了标准化基础。

Conclusion: 火星专用基础模型在火星科学任务中展现出优于通用模型的潜力,Mars-Bench的建立为未来火星机器学习研究提供了标准化评估框架,推动了领域自适应预训练方法的发展。


📄 Abstract

Foundation models have enabled rapid progress across many specialized domains by leveraging large-scale pre-training on unlabeled data, demonstrating strong generalization to a variety of downstream tasks. While such models have gained significant attention in fields like Earth Observation, their application to Mars science remains limited. A key enabler of progress in other domains has been the availability of standardized benchmarks that support systematic evaluation. In contrast, Mars science lacks such benchmarks and standardized evaluation frameworks, which have limited progress toward developing foundation models for Martian tasks. To address this gap, we introduce Mars-Bench, the first benchmark designed to systematically evaluate models across a broad range of Mars-related tasks using both orbital and surface imagery. Mars-Bench comprises 20 datasets spanning classification, segmentation, and object detection, focused on key geologic features such as craters, cones, boulders, and frost. We provide standardized, ready-to-use datasets and baseline evaluations using models pre-trained on natural images, Earth satellite data, and state-of-the-art vision-language models. Results from all analyses suggest that Mars-specific foundation models may offer advantages over general-domain counterparts, motivating further exploration of domain-adapted pre-training. Mars-Bench aims to establish a standardized foundation for developing and comparing machine learning models for Mars science. Our data, models, and code are available at: https://mars-bench.github.io/.

[9] AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts

Yufan Liu, Wanqian Zhang, Huashan Chen, Lin Wang, Xiaojun Jia, Zheng Lin, Weiping Wang

🧩 TL;DR

本文提出了APT(AutoPrompT),一种利用大语言模型自动生成人类可读对抗性后缀的黑盒框架,用于评估文本到图像模型的安全漏洞。该方法通过交替优化微调管道和双重规避策略,能够绕过基于困惑度的过滤器和黑名单词过滤器,展现出卓越的红队测试性能和零样本迁移能力。


📘 Detailed Summary

Motivation: 当前文本到图像模型的安全机制容易受到对抗性提示的攻击,而现有的红队测试方法通常需要白盒访问权限,依赖低效的逐提示优化,并且不可避免地生成语义无意义的提示,容易被过滤器阻止。

Method: 提出交替优化微调管道,在对抗性后缀优化和利用优化后缀微调LLM之间交替进行;集成双重规避策略,通过辅助LLM困惑度评分约束生成人类可读提示,并引入禁止词惩罚来抑制黑名单中显式禁止词的生成。

Result: 广泛的实验表明,该方法生成的人类可读、抗过滤的对抗性提示具有出色的红队测试性能,以及卓越的零样本迁移能力,能够即时适应未见过的提示,并在商业API中暴露关键漏洞。

Conclusion: 该研究揭示了文本到图像模型安全机制的脆弱性,提出的黑盒框架为评估和增强模型安全性提供了有效工具,同时证明了人类可读对抗性提示在绕过现有防御机制方面的有效性。


📄 Abstract

Despite rapid advancements in text-to-image (T2I) models, their safety mechanisms are vulnerable to adversarial prompts, which maliciously generate unsafe images. Current red-teaming methods for proactively assessing such vulnerabilities usually require white-box access to T2I models, and rely on inefficient per-prompt optimization, as well as inevitably generate semantically meaningless prompts easily blocked by filters. In this paper, we propose APT (AutoPrompT), a black-box framework that leverages large language models (LLMs) to automatically generate human-readable adversarial suffixes for benign prompts. We first introduce an alternating optimization-finetuning pipeline between adversarial suffix optimization and fine-tuning the LLM utilizing the optimized suffix. Furthermore, we integrates a dual-evasion strategy in optimization phase, enabling the bypass of both perplexity-based filter and blacklist word filter: (1) we constrain the LLM generating human-readable prompts through an auxiliary LLM perplexity scoring, which starkly contrasts with prior token-level gibberish, and (2) we also introduce banned-token penalties to suppress the explicit generation of banned-tokens in blacklist. Extensive experiments demonstrate the excellent red-teaming performance of our human-readable, filter-resistant adversarial prompts, as well as superior zero-shot transferability which enables instant adaptation to unseen prompts and exposes critical vulnerabilities even in commercial APIs (e.g., Leonardo.Ai.).

[10] Enhancing CLIP Robustness via Cross-Modality Alignment

Xingyu Zhu, Beier Zhu, Shuo Wang, Kesen Zhao, Hanwang Zhang

🧩 TL;DR

本文提出COLA,一种基于最优传输的跨模态对齐框架,通过恢复对抗扰动下的全局图像-文本对齐和局部结构一致性,显著提升视觉语言模型的对抗鲁棒性。该方法无需训练且与现有微调模型兼容,在14个零样本分类基准上取得显著改进。


📘 Detailed Summary

Motivation: 现有视觉语言模型如CLIP在零样本分类中表现出色,但对对抗扰动高度脆弱。现有方法主要关注对抗微调或提示优化,忽视了CLIP编码特征中的跨模态对齐差距,这种不对齐在对抗扰动下被显著放大,导致分类性能严重下降。

Method: COLA框架采用最优传输方法,首先将对抗图像嵌入投影到类别文本特征张成的子空间,过滤非语义失真同时保留判别信息;然后将图像和文本建模为多个增强视图上的离散分布,通过最优传输细化对齐,子空间投影无缝集成到成本计算中。

Result: 在14个零样本分类基准上的广泛评估表明,COLA在PGD对抗攻击下在ImageNet及其变体上平均提升6.7%,同时在干净样本上保持高准确率。该方法训练免费且与现有微调模型兼容。

Conclusion: COLA通过显式处理对抗不对齐问题,证明了跨模态特征对齐对提升视觉语言模型鲁棒性的重要性。该方法为无需额外训练即可增强模型对抗鲁棒性提供了有效途径,具有实际部署价值。


📄 Abstract

Vision-language models (VLMs) such as CLIP demonstrate strong generalization in zero-shot classification but remain highly vulnerable to adversarial perturbations. Existing methods primarily focus on adversarial fine-tuning or prompt optimization; they often overlook the gaps in CLIP's encoded features, which is shown as the text and image features lie far apart from each other. This misalignment is significantly amplified under adversarial perturbations, leading to severe degradation in classification performance. To address this problem, we propose Cross-modality Alignment, dubbed COLA, an optimal transport-based framework that explicitly addresses adversarial misalignment by restoring both global image-text alignment and local structural consistency in the feature space. (1) COLA first projects adversarial image embeddings onto a subspace spanned by class text features, effectively filtering out non-semantic distortions while preserving discriminative information. (2) It then models images and texts as discrete distributions over multiple augmented views and refines their alignment via OT, with the subspace projection seamlessly integrated into the cost computation. This design ensures stable cross-modal alignment even under adversarial conditions. COLA is training-free and compatible with existing fine-tuned models. Extensive evaluations across 14 zero-shot classification benchmarks demonstrate the effectiveness of COLA, especially with an average improvement of 6.7% on ImageNet and its variants under PGD adversarial attacks, while maintaining high accuracy on clean samples.

[11] VC4VG: Optimizing Video Captions for Text-to-Video Generation

Yang Du, Zhuoran Lin, Kaiqiang Song, Biao Wang, Zhicheng Zheng, Tiezheng Ge, Bo Zheng, Qin Jin

🧩 TL;DR

本文提出了VC4VG框架,专门针对文本到视频生成任务优化视频字幕,通过系统分析视频重建所需的关键元素并构建多维评估基准,显著提升了T2V模型的生成质量。


📘 Detailed Summary

Motivation: 当前文本到视频生成领域虽然认识到高质量视频-文本对的重要性,但专门针对T2V训练优化的视频字幕策略研究仍显不足,缺乏系统性的字幕设计方法和评估标准。

Method: 提出VC4VG框架,从T2V视角分析字幕内容,将视频重建所需的关键元素分解为多个维度,建立原则性的字幕设计方法,并构建VC4VG-Bench基准,包含细粒度、多维度和必要性分级的评估指标。

Result: 广泛的T2V微调实验表明,字幕质量的改进与视频生成性能之间存在强相关性,验证了所提方法的有效性,为T2V模型训练提供了可靠的字幕优化方案。

Conclusion: 该研究确立了视频字幕优化对T2V生成性能的关键影响,为后续研究提供了系统的评估工具和方法论,推动了文本到视频生成领域的数据质量优化研究。


📄 Abstract

Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models.We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements.Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/qyr0403/VC4VG to support further research.

[12] Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification

William Yang, Xindi Wu, Zhiwei Deng, Esin Tureci, Olga Russakovsky

🧩 TL;DR

本文提出BOB微调策略,通过提取类无关属性并显式条件化来缓解T2I模型在细粒度分类任务中的过拟合问题,在多个基准测试中实现了最先进的性能。该方法在低样本细粒度分类任务中,使用合成数据增强显著提升了模型性能。


📘 Detailed Summary

Motivation: 文本到图像模型在合成数据集生成中的应用日益增多,但为分类任务生成有效的合成训练数据仍然具有挑战性。微调T2I模型虽然能提高合成数据质量,但可能导致过拟合和样本多样性降低,特别是在细粒度分类任务中这一问题尤为突出。

Method: 提出BOB微调策略,首先从少量真实样本中提取类无关属性(如场景背景和物体姿态),然后在T2I模型微调过程中显式地条件化这些属性,并在生成过程中对它们进行边缘化处理。这种设计缓解了过拟合问题,保持了T2I模型的生成先验,减少了估计误差,并最小化了意外的类间关联。

Result: 在多个T2I模型、骨干网络和数据集上的广泛实验表明,该方法在使用合成数据增强的低样本细粒度分类中实现了最先进的性能。BOB在Aircraft数据集上比DataDream提升了7.4%,在四个基准测试中的三个上,使用5个真实图像加上BOB增强的微调性能优于使用10个真实图像的微调。在24个实验设置中的18个中优于先前技术,其中14个设置的准确率提升超过2%。

Conclusion: BOB方法通过显式条件化和边缘化类无关属性,有效缓解了T2I模型在细粒度分类任务中的过拟合问题,同时保持了生成多样性。该方法为使用合成数据增强解决低样本学习问题提供了有效的解决方案,在多个基准测试中展现了显著的性能提升。


📄 Abstract

Text-to-image (T2I) models are increasingly used for synthetic dataset generation, but generating effective synthetic training data for classification remains challenging. Fine-tuning a T2I model with a few real examples can help improve the quality of synthetic training data; however, it may also cause overfitting and reduce diversity in the generated samples. We propose a fine-tuning strategy BOB (BeyondOBjects) to mitigate these concerns for fine-grained classification. Given a small set of real examples, we first extract class-agnostic attributes such as scene background and object pose. We then explicitly condition on these attributes during fine-tuning of the T2I model and marginalize them out during generation. This design mitigates overfitting, preserves the T2I model's generative prior, reduces estimation errors, and further minimizes unintended inter-class associations. Extensive experiments across multiple T2I models, backbones, and datasets show that our method achieves state-of-the-art performance in low-shot fine-grained classification when augmented with synthetic data. Concretely, BOB outperforms DataDream by 7.4% on the Aircraft dataset (from 50.0% to 57.4% when fine-tuning a CLIP classifier with five real images augmented with 100 synthetic images). In three of the four benchmarks, fine-tuning downstream models with 5 real images augmented with BOB achieves better performance than fine-tuning with 10 real images. Collectively, BOB outperforms prior art in 18 of 24 experimental settings, with 2+% accuracy improvements in 14 of these settings.

[13] Compositional Image Synthesis with Inference-Time Scaling

Minsuk Ji, Sanghyeok Lee, Namhyuk Ahn

🧩 TL;DR

本文提出了一种无需训练的组合框架,通过结合目标中心化方法和自优化机制来提升文本到图像生成中的布局忠实度。该框架利用大语言模型合成显式布局,并通过视觉语言模型进行迭代重排序,显著改善了场景与提示的对齐质量。


📘 Detailed Summary

Motivation: 现代文本到图像模型在组合性方面仍存在显著不足,经常无法准确渲染物体数量、属性和空间关系。这种布局忠实度的缺失限制了模型在复杂场景生成中的实际应用效果,需要专门的方法来提升生成图像与文本提示的对齐程度。

Method: 该框架采用无需训练的方法,首先利用大语言模型从输入提示中合成显式布局,然后将这些布局注入图像生成过程。在生成阶段,采用目标中心化的视觉语言模型对多个候选图像进行迭代重排序,选择与提示最对齐的结果,实现了显式布局引导与基于自优化的推理时扩展的统一。

Result: 实验结果表明,该框架相比最近的文本到图像模型在场景与提示对齐方面取得了更强的性能。通过显式布局引导和迭代优化机制,显著提升了生成图像在物体数量、属性和空间关系方面的准确性,同时保持了良好的美学质量。

Conclusion: 该研究证明了结合显式布局引导与自优化机制在提升文本到图像生成忠实度方面的有效性。这种无需训练的方法为改善模型组合性提供了新思路,未来可扩展到更复杂的场景理解和生成任务中,推动文本到图像生成技术的进一步发展。


📄 Abstract

Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge reranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. The code are available at https://github.com/gcl-inha/ReFocus.

[14] ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model

Juntian Zhang, Song Jin, Chuanqi Cheng, Yuhan Liu, Yankai Lin, Xun Zhang, Yufei Zhang, Fei Jiang, Guojun Yin, Wei Lin, Rui Yan

🧩 TL;DR

本文提出了ViPER框架,通过粗粒度到细粒度的渐进式视觉感知学习和两阶段强化学习策略,解决了视觉语言模型在细粒度视觉感知方面的瓶颈问题,在保持通用能力的同时显著提升了感知性能。


📘 Detailed Summary

Motivation: 视觉语言模型在细粒度视觉感知方面的有限能力构成了实际应用中的关键瓶颈,现有方法存在明显局限性:监督微调通常会损害模型的通用能力,而强化微调则优先考虑文本推理而非视觉感知,这促使研究者寻求一种能够平衡感知能力提升与通用性保持的新方法。

Method: 本文提出了ViPER自举框架,通过将视觉感知学习构建为粗粒度到细粒度的渐进过程,结合图像级和实例级重建任务,采用两阶段强化学习策略建立闭环训练范式,利用内部合成数据直接驱动感知能力的迭代进化。

Result: 在Qwen2.5-VL系列模型上应用ViPER框架产生了Qwen-Viper系列,在七个综合基准测试中平均提升1.7%,在细粒度感知任务上最高提升6.0%,同时在不同视觉语言场景中保持一致的优越性能和泛化能力。

Conclusion: ViPER不仅实现了感知能力的自我提升,还为生成与理解之间的互惠关系提供了具体证据,这一突破为开发更自主、更强大的视觉语言模型开辟了新途径,展示了闭环训练范式在模型能力进化中的巨大潜力。


📄 Abstract

The limited capacity for fine-grained visual perception presents a critical bottleneck for Vision-Language Models (VLMs) in real-world applications. Addressing this is challenging due to the scarcity of high-quality data and the limitations of existing methods: supervised fine-tuning (SFT) often compromises general capabilities, while reinforcement fine-tuning (RFT) prioritizes textual reasoning over visual perception. To bridge this gap, we propose a novel two-stage task that structures visual perception learning as a coarse-to-fine progressive process. Based on this task formulation, we develop ViPER, a self-bootstrapping framework specifically designed to enable iterative evolution through self-critiquing and self-prediction. By synergistically integrating image-level and instance-level reconstruction with a two-stage reinforcement learning strategy, ViPER establishes a closed-loop training paradigm, where internally synthesized data directly fuel the enhancement of perceptual ability. Applied to the Qwen2.5-VL family, ViPER produces the Qwen-Viper series. With an average gain of 1.7% on seven comprehensive benchmarks spanning various tasks and up to 6.0% on fine-grained perception, Qwen-Viper consistently demonstrates superior performance across different vision-language scenarios while maintaining generalizability. Beyond enabling self-improvement in perceptual capabilities, ViPER provides concrete evidence for the reciprocal relationship between generation and understanding, a breakthrough to developing more autonomous and capable VLMs.

[15] Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning

Aodi Wu, Xubo Luo

🧩 TL;DR

本文提出了一个用于自动驾驶场景理解的系统化视觉语言模型框架,通过混合提示路由、任务特定提示、视觉组装模块和优化推理参数,在RoboSense挑战赛中实现了优异的性能表现。


📘 Detailed Summary

Motivation: 该研究旨在解决视觉语言模型在自动驾驶场景理解任务中的性能瓶颈,特别是在感知、预测、规划和异常检测等多任务交叉场景下,传统方法难以有效处理不同类型问题之间的干扰。

Method: 该方法构建了四个核心组件:混合提示路由器对问题进行分类并分发至任务特定的专家提示;任务特定提示嵌入了显式坐标系、空间推理规则、角色扮演和链式/树状推理;视觉组装模块根据问题需求组合多视角图像与对象裁剪;针对不同任务优化模型推理参数配置。

Result: 在Qwen2.5-VL-72B模型上实现,该方法在Phase-1(干净数据)上达到70.87%的平均准确率,在Phase-2(损坏数据)上达到72.85%的平均准确率,证明了结构化提示和空间基础对安全关键自动驾驶任务的有效性。

Conclusion: 研究表明系统化的提示工程和空间基础能够显著提升视觉语言模型在复杂自动驾驶任务中的性能,为安全关键应用提供了可靠的技术路径,同时开源代码和提示模板促进了相关研究的可复现性。


📄 Abstract

This technical report presents our solution for the RoboSense Challenge at IROS 2025, which evaluates Vision-Language Models (VLMs) on autonomous driving scene understanding across perception, prediction, planning, and corruption detection tasks. We propose a systematic framework built on four core components. First, a Mixture-of-Prompts router classifies questions and dispatches them to task-specific expert prompts, eliminating interference across diverse question types. Second, task-specific prompts embed explicit coordinate systems, spatial reasoning rules, role-playing, Chain-of-Thought/Tree-of-Thought reasoning, and few-shot examples tailored to each task. Third, a visual assembly module composes multi-view images with object crops, magenta markers, and adaptive historical frames based on question requirements. Fourth, we configure model inference parameters (temperature, top-p, message roles) per task to optimize output quality. Implemented on Qwen2.5-VL-72B, our approach achieves 70.87% average accuracy on Phase-1 (clean data) and 72.85% on Phase-2 (corrupted data), demonstrating that structured prompting and spatial grounding substantially enhance VLM performance on safety-critical autonomous driving tasks. Code and prompt are available at https://github.com/wuaodi/UCAS-CSU-phase2.

[16] Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs

Huanyu Zhang, Wenshan Wu, Chengzu Li, Ning Shang, Yan Xia, Yangyu Huang, Yifan Zhang, Li Dong, Zhang Zhang, Liang Wang, Tieniu Tan, Furu Wei

🧩 TL;DR

本文提出了Latent Sketchpad框架,通过为多模态大语言模型配备内部视觉草稿本,将视觉生成直接集成到自回归推理过程中,从而增强模型在复杂场景中的视觉规划和想象能力。


📘 Detailed Summary

Motivation: 多模态大语言模型虽然在视觉理解方面表现出色,但在需要视觉规划和想象的复杂场景中往往表现不佳,传统模型的内部视觉表示仅限于感知理解,缺乏生成性视觉思维的能力。

Method: 该框架包含两个核心组件:上下文感知视觉头自回归地生成视觉表示,以及预训练的草图解码器将这些表示渲染为人类可解释的图像,将视觉生成直接集成到模型的原生自回归推理过程中。

Result: 在MazePlanning数据集上的实验表明,Latent Sketchpad在各种MLLMs上实现了与骨干模型相当甚至更优的推理性能,并在包括Gemma3和Qwen2.5-VL在内的前沿MLLMs上展现出良好的泛化能力。

Conclusion: 通过将模型的文本推理扩展到视觉思维,该框架为人机交互和更广泛的应用开辟了新机遇,使模型能够像人类使用草图进行视觉思考一样发展想法。


📄 Abstract

While Multimodal Large Language Models (MLLMs) excel at visual understanding, they often struggle in complex scenarios that require visual planning and imagination. Inspired by how humans use sketching as a form of visual thinking to develop and communicate ideas, we introduce Latent Sketchpad, a framework that equips MLLMs with an internal visual scratchpad. The internal visual representations of MLLMs have traditionally been confined to perceptual understanding. We repurpose them to support generative visual thought without compromising reasoning ability. Building on frontier MLLMs, our approach integrates visual generation directly into their native autoregressive reasoning process. It allows the model to interleave textual reasoning with the generation of visual latents. These latents guide the internal thought process and can be translated into sketch images for interpretability. To realize this, we introduce two components: a Context-Aware Vision Head autoregressively produces visual representations, and a pretrained Sketch Decoder renders these into human-interpretable images. We evaluate the framework on our new dataset MazePlanning. Experiments across various MLLMs show that Latent Sketchpad delivers comparable or even superior reasoning performance to their backbone. It further generalizes across distinct frontier MLLMs, including Gemma3 and Qwen2.5-VL. By extending model's textual reasoning to visual thinking, our framework opens new opportunities for richer human-computer interaction and broader applications. More details and resources are available on our project page: https://latent-sketchpad.github.io/.

[17] SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs

Jinhong Deng, Wen Li, Joey Tianyi Zhou, Yang He

🧩 TL;DR

本文提出SCOPE,一种新颖的视觉令牌剪枝策略,通过联合建模显著性和覆盖度来优化多模态大语言模型的效率,在保持语义完整性的同时显著减少计算开销。


📘 Detailed Summary

Motivation: 多模态大语言模型通常处理大量视觉令牌导致计算开销巨大,现有视觉令牌剪枝方法主要基于注意力分数选择最显著令牌,但会导致所选令牌语义不完整的问题。

Method: 提出SCOPE策略,引入基于令牌关系的集合覆盖度概念,定义每个未选令牌的覆盖增益,将显著性分数整合到令牌覆盖增益中形成SCOPE分数,并迭代选择具有最高SCOPE分数的令牌。

Result: 在多个视觉语言理解基准测试中使用LLaVA-1.5和LLaVA-Next模型进行广泛实验,结果表明该方法持续优于先前方法。

Conclusion: SCOPE方法通过联合优化显著性和覆盖度,有效解决了现有视觉令牌剪枝中的语义不完整问题,为高效多模态大语言模型提供了新的优化方向。


📄 Abstract

Multimodal Large Language Models (MLLMs) typically process a large number of visual tokens, leading to considerable computational overhead, even though many of these tokens are redundant. Existing visual token pruning methods primarily focus on selecting the most salient tokens based on attention scores, resulting in the semantic incompleteness of the selected tokens. In this paper, we propose a novel visual token pruning strategy, called \textbf{S}aliency-\textbf{C}overage \textbf{O}riented token \textbf{P}runing for \textbf{E}fficient MLLMs (SCOPE), to jointly model both the saliency and coverage of the selected visual tokens to better preserve semantic completeness. Specifically, we introduce a set-coverage for a given set of selected tokens, computed based on the token relationships. We then define a token-coverage gain for each unselected token, quantifying how much additional coverage would be obtained by including it. By integrating the saliency score into the token-coverage gain, we propose our SCOPE score and iteratively select the token with the highest SCOPE score. We conduct extensive experiments on multiple vision-language understanding benchmarks using the LLaVA-1.5 and LLaVA-Next models. Experimental results demonstrate that our method consistently outperforms prior approaches. Our code is available at \href{https://github.com/kinredon/SCOPE}{https://github.com/kinredon/SCOPE}.

[18] Training-free Source Attribution of AI-generated Images via Resynthesis

Pietro Bongini, Valentina Molinari, Andrea Costanzo, Benedetta Tondi, Mauro Barni

🧩 TL;DR

本文提出了一种基于图像重合成的免训练单样本归因方法,通过生成描述图像提示词并在候选生成器上重合成图像,在特征空间中比较与原图的相似度进行归因。该方法在数据稀缺条件下显著优于现有少样本方法,并引入了一个具有挑战性的合成图像归因数据集。


📘 Detailed Summary

Motivation: 合成图像来源归因在数据稀缺条件下具有挑战性,特别是在需要少样本或零样本分类能力的情况下。现有方法在训练样本有限时性能受限,需要开发能够在极少训练数据下有效工作的归因技术。

Method: 提出基于图像重合成的免训练单样本归因方法:首先生成待分析图像的描述提示词,然后使用所有候选生成器重合成该图像,在适当的特征空间中比较重合成图像与原图的相似度,将图像归因于产生最接近重合成的模型。同时构建了包含商业和开源文本到图像生成器的人脸图像归因数据集。

Result: 在提出的新数据集上,重合成方法显著优于现有最先进的少样本方法和其他基线方法,特别是在训练或微调样本极少的情况下表现优异。实验证明该数据集具有挑战性,为开发和评估未来少样本和零样本方法提供了有价值的基准。

Conclusion: 基于重合成的免训练归因方法在数据稀缺条件下具有显著优势,为合成图像来源检测提供了新的有效途径。新构建的数据集为归因方法开发提供了标准化测试平台,推动了少样本和零样本归因技术的研究进展。


📄 Abstract

Synthetic image source attribution is a challenging task, especially in data scarcity conditions requiring few-shot or zero-shot classification capabilities. We present a new training-free one-shot attribution method based on image resynthesis. A prompt describing the image under analysis is generated, then it is used to resynthesize the image with all the candidate sources. The image is attributed to the model which produced the resynthesis closest to the original image in a proper feature space. We also introduce a new dataset for synthetic image attribution consisting of face images from commercial and open-source text-to-image generators. The dataset provides a challenging attribution framework, useful for developing new attribution models and testing their capabilities on different generative architectures. The dataset structure allows to test approaches based on resynthesis and to compare them to few-shot methods. Results from state-of-the-art few-shot approaches and other baselines show that the proposed resynthesis method outperforms existing techniques when only a few samples are available for training or fine-tuning. The experiments also demonstrate that the new dataset is a challenging one and represents a valuable benchmark for developing and evaluating future few-shot and zero-shot methods.

[19] DeshadowMamba: Deshadowing as 1D Sequential Similarity

Zhaotong Yang, Yi Chen, Yanying Li, Shengfeng He, Yangyang Xu, Junyu Dong, Jian Yang, Yong Du

🧩 TL;DR

本研究提出DeshadowMamba,一种基于选择性状态空间模型的图像阴影去除方法,通过CrossGate方向调制机制和ColorShift正则化,解决了传统注意力机制在阴影去除中混合无关区域光照线索导致的结构扭曲和颜色不一致问题。


📘 Detailed Summary

Motivation: 当前基于注意力的深度图像阴影去除模型存在固定注意力模式容易混合无关区域的光照线索,导致结构扭曲和颜色不一致的问题,需要更有效的全局上下文建模方法来保持结构完整性和色彩一致性。

Method: 采用Mamba选择性状态空间模型进行序列建模,提出CrossGate方向调制机制将阴影感知相似性注入输入门实现相关上下文的选择性整合,并引入基于全局颜色统计的ColorShift对比学习正则化来抑制颜色污染。

Result: 在公开基准测试上的广泛实验表明,DeshadowMamba在视觉质量和定量性能方面均达到了最先进水平,实现了鲁棒的颜色恢复和结构保持。

Conclusion: 该研究展示了序列建模在阴影去除任务中的适应性,通过方向状态转换和对比学习机制有效解决了结构完整性和色彩一致性的关键挑战,为图像修复任务提供了新的建模思路。


📄 Abstract

Recent deep models for image shadow removal often rely on attention-based architectures to capture long-range dependencies. However, their fixed attention patterns tend to mix illumination cues from irrelevant regions, leading to distorted structures and inconsistent colors. In this work, we revisit shadow removal from a sequence modeling perspective and explore the use of Mamba, a selective state space model that propagates global context through directional state transitions. These transitions yield an efficient global receptive field while preserving positional continuity. Despite its potential, directly applying Mamba to image data is suboptimal, since it lacks awareness of shadow-non-shadow semantics and remains susceptible to color interference from nearby regions. To address these limitations, we propose CrossGate, a directional modulation mechanism that injects shadow-aware similarity into Mamba's input gate, allowing selective integration of relevant context along transition axes. To further ensure appearance fidelity, we introduce ColorShift regularization, a contrastive learning objective driven by global color statistics. By synthesizing structured informative negatives, it guides the model to suppress color contamination and achieve robust color restoration. Together, these components adapt sequence modeling to the structural integrity and chromatic consistency required for shadow removal. Extensive experiments on public benchmarks demonstrate that DeshadowMamba achieves state-of-the-art visual quality and strong quantitative performance.

[20] Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning

Ivica Dimitrovski, Vlatko Spasev, Ivan Kitanovski

🧩 TL;DR

本文系统探索了提示学习作为遥感图像场景分类的高效适应策略,在少样本场景下显著优于传统基线方法,其中带自约束的提示学习方法在跨域泛化方面表现最为鲁棒。


📘 Detailed Summary

Motivation: 遥感应用中的深度学习性能受到标注数据稀缺和跨域标注成本高昂的限制,而现有视觉语言模型如CLIP直接应用于遥感领域存在显著的领域差距和任务语义适应需求不足的问题。

Method: 研究评估了多种代表性提示学习方法,包括上下文优化、条件上下文优化、多模态提示学习以及带自约束的提示学习,这些方法涵盖了从静态上下文优化到条件提示增强泛化、多模态联合适应以及语义正则化稳定学习的设计理念。

Result: 在多个遥感基准数据集上的广泛实验表明,提示学习方法在少样本场景下持续优于零样本CLIP和基于冻结特征的线性探测基线,特别是在跨数据集泛化测试中,带自约束的提示学习方法实现了最鲁棒的跨域性能。

Conclusion: 研究结果强调了提示学习作为连接卫星和航空影像领域差距的可扩展高效解决方案,为未来该领域研究提供了坚实基础,展示了轻量级适应策略在遥感场景分类中的巨大潜力。


📄 Abstract

Remote sensing applications increasingly rely on deep learning for scene classification. However, their performance is often constrained by the scarcity of labeled data and the high cost of annotation across diverse geographic and sensor domains. While recent vision-language models like CLIP have shown promise by learning transferable representations at scale by aligning visual and textual modalities, their direct application to remote sensing remains suboptimal due to significant domain gaps and the need for task-specific semantic adaptation. To address this critical challenge, we systematically explore prompt learning as a lightweight and efficient adaptation strategy for few-shot remote sensing image scene classification. We evaluate several representative methods, including Context Optimization, Conditional Context Optimization, Multi-modal Prompt Learning, and Prompting with Self-Regulating Constraints. These approaches reflect complementary design philosophies: from static context optimization to conditional prompts for enhanced generalization, multi-modal prompts for joint vision-language adaptation, and semantically regularized prompts for stable learning without forgetting. We benchmark these prompt-learning methods against two standard baselines: zero-shot CLIP with hand-crafted prompts and a linear probe trained on frozen CLIP features. Through extensive experiments on multiple benchmark remote sensing datasets, including cross-dataset generalization tests, we demonstrate that prompt learning consistently outperforms both baselines in few-shot scenarios. Notably, Prompting with Self-Regulating Constraints achieves the most robust cross-domain performance. Our findings underscore prompt learning as a scalable and efficient solution for bridging the domain gap in satellite and aerial imagery, providing a strong foundation for future research in this field.

[21] When are radiology reports useful for training medical image classifiers?

Herman Bergström, Zhongqi Yue, Fredrik D. Johansson

🧩 TL;DR

本研究系统性地探讨了放射学报告在医学图像分类训练中的使用时机和方法,发现在标签与文本强相关时预训练有益,而在弱相关时可能有害,同时微调阶段使用报告能带来显著改进。


📘 Detailed Summary

Motivation: 当前医学图像分类研究主要依赖放射学报告作为专家注释,但实际应用中需要放射科医生手动撰写报告,这引发了一个关键问题:何时以及如何在训练过程中有效利用放射学报告来改进仅基于图像的分类性能。现有研究局限于使用预训练图像表示进行微调,忽略了标签与文本弱关联的任务场景。

Method: 本研究采用系统性实验设计,考察放射学报告在预训练和微调两个阶段的使用策略,覆盖诊断性和预后性任务(如12个月再入院预测),并在不同训练集规模下进行评估。重点比较了显式图像-文本对齐预训练方法在不同任务场景下的效果。

Result: 实验结果表明:当标签与文本内容强相关时,预训练阶段利用报告能提升下游分类性能;但在标签与文本弱关联的场景中,显式图像-文本对齐预训练反而会产生负面影响。此外,微调阶段使用报告能带来显著改进,在某些设置下其影响甚至超过预训练方法。

Conclusion: 研究为医学图像分类器训练中如何有效利用特权文本数据提供了具体指导,强调了根据任务特性选择适当训练策略的重要性,同时指出了当前研究在标签-文本关联度影响方面的认知空白,为未来研究指明了方向。


📄 Abstract

Medical images used to train machine learning models are often accompanied by radiology reports containing rich expert annotations. However, relying on these reports as inputs for clinical prediction requires the timely manual work of a trained radiologist. This raises a natural question: when can radiology reports be leveraged during training to improve image-only classification? Prior works are limited to evaluating pre-trained image representations by fine-tuning them to predict diagnostic labels, often extracted from reports, ignoring tasks with labels that are weakly associated with the text. To address this gap, we conduct a systematic study of how radiology reports can be used during both pre-training and fine-tuning, across diagnostic and prognostic tasks (e.g., 12-month readmission), and under varying training set sizes. Our findings reveal that: (1) Leveraging reports during pre-training is beneficial for downstream classification tasks where the label is well-represented in the text; however, pre-training through explicit image-text alignment can be detrimental in settings where it's not; (2) Fine-tuning with reports can lead to significant improvements and even have a larger impact than the pre-training method in certain settings. These results provide actionable insights into when and how to leverage privileged text data to train medical image classifiers while highlighting gaps in current research.

[22] Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers?

Yihao Li, Saeed Salehi, Lyle Ungar, Konrad P. Kording

🧩 TL;DR

本研究揭示了自监督预训练的视觉Transformer(ViT)能够自然涌现出物体绑定能力,通过相似性探针解码IsSameObject属性,准确率超过90%,挑战了ViT缺乏物体绑定的传统观点。


📘 Detailed Summary

Motivation: 尽管先前工作常通过显式施加物体中心注意力来探索物体绑定的益处,但尚不清楚这种能力是否在预训练的视觉Transformer中自然涌现。本研究旨在探究ViT是否能够自发地学习识别哪些图像块属于同一物体,以及这种能力如何受到不同预训练目标的影响。

Method: 研究采用相似性探针从ViT各层的图像块嵌入中解码IsSameObject属性,分析自监督模型(DINO、MAE、CLIP)与ImageNet监督模型在物体绑定能力上的差异,并通过消融实验验证IsSameObject信号对注意力机制的引导作用。

Result: 相似性探针在解码IsSameObject属性时达到超过90%的准确率,且该能力在自监督ViT中可靠涌现,而在ImageNet监督模型中显著较弱。IsSameObject被编码在物体特征之上的低维子空间中,并主动引导注意力机制,消融该信号会降低下游任务性能。

Conclusion: 研究结果表明物体绑定能力并非ViT架构的简单产物,而是通过特定预训练目标习得的能力。这种涌现的符号知识挑战了连接主义系统缺乏结构化表征的传统观点,揭示了ViT能够自然学习"哪些部分属于一起"的抽象概念。


📄 Abstract

Object binding, the brain's ability to bind the many features that collectively represent an object into a coherent whole, is central to human cognition. It groups low-level perceptual features into high-level object representations, stores those objects efficiently and compositionally in memory, and supports human reasoning about individual object instances. While prior work often imposes object-centric attention (e.g., Slot Attention) explicitly to probe these benefits, it remains unclear whether this ability naturally emerges in pre-trained Vision Transformers (ViTs). Intuitively, they could: recognizing which patches belong to the same object should be useful for downstream prediction and thus guide attention. Motivated by the quadratic nature of self-attention, we hypothesize that ViTs represent whether two patches belong to the same object, a property we term IsSameObject. We decode IsSameObject from patch embeddings across ViT layers using a similarity probe, which reaches over 90% accuracy. Crucially, this object-binding capability emerges reliably in self-supervised ViTs (DINO, MAE, CLIP), but markedly weaker in ImageNet-supervised models, suggesting that binding is not a trivial architectural artifact, but an ability acquired through specific pretraining objectives. We further discover that IsSameObject is encoded in a low-dimensional subspace on top of object features, and that this signal actively guides attention. Ablating IsSameObject from model activations degrades downstream performance and works against the learning objective, implying that emergent object binding naturally serves the pretraining objective. Our findings challenge the view that ViTs lack object binding and highlight how symbolic knowledge of "which parts belong together" emerges naturally in a connectionist system.

[23] OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents

Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, Fei Huang

🧩 TL;DR

本文提出了OSWorld-MCP,这是首个全面且公平的基准测试,用于在真实环境中评估计算机使用代理的工具调用、GUI操作和决策能力,填补了多模态代理工具调用能力评估的空白。


📘 Detailed Summary

Motivation: 当前多模态代理评估主要关注GUI交互技能,而由模型上下文协议(MCP)实现的工具调用能力被严重忽视,导致集成工具调用的代理与仅评估GUI交互的代理之间存在不公平比较。

Method: 设计了一种新颖的自动代码生成流水线来创建工具,并结合现有工具的精选集合,通过严格的人工验证产生了158个高质量工具,涵盖7个常见应用程序,每个工具都验证了功能正确性、实用性和多功能性。

Result: 在OSWorld-MCP上的广泛评估显示,MCP工具普遍提高了任务成功率(例如OpenAI o3在15步时从8.3%提升至20.4%,Claude 4 Sonnet在50步时从40.1%提升至43.3%),但即使是最强模型的工具调用率也相对较低,仅为36.3%。

Conclusion: 通过明确测量MCP工具使用技能,OSWorld-MCP加深了对多模态代理的理解,并为评估复杂工具辅助环境中的性能设立了新标准,同时揭示了当前模型在工具调用能力方面仍有改进空间。


📄 Abstract

With advances in decision-making and reasoning capabilities, multimodal agents show strong potential in computer application scenarios. Past evaluations have mainly assessed GUI interaction skills, while tool invocation abilities, such as those enabled by the Model Context Protocol (MCP), have been largely overlooked. Comparing agents with integrated tool invocation to those evaluated only on GUI interaction is inherently unfair. We present OSWorld-MCP, the first comprehensive and fair benchmark for assessing computer-use agents' tool invocation, GUI operation, and decision-making abilities in a real-world environment. We design a novel automated code-generation pipeline to create tools and combine them with a curated selection from existing tools. Rigorous manual validation yields 158 high-quality tools (covering 7 common applications), each verified for correct functionality, practical applicability, and versatility. Extensive evaluations of state-of-the-art multimodal agents on OSWorld-MCP show that MCP tools generally improve task success rates (e.g., from 8.3% to 20.4% for OpenAI o3 at 15 steps, from 40.1% to 43.3% for Claude 4 Sonnet at 50 steps), underscoring the importance of assessing tool invocation capabilities. However, even the strongest models have relatively low tool invocation rates, Only 36.3%, indicating room for improvement and highlighting the benchmark's challenge. By explicitly measuring MCP tool usage skills, OSWorld-MCP deepens understanding of multimodal agents and sets a new standard for evaluating performance in complex, tool-assisted environments. Our code, environment, and data are publicly available at https://osworld-mcp.github.io.

[24] A Dual-Branch CNN for Robust Detection of AI-Generated Facial Forgeries

Xin Zhang, Yuqi Song, Fei Zuo

🧩 TL;DR

本文提出一种新颖的双分支卷积神经网络用于人脸伪造检测,通过结合空间域和频域的互补线索,在DiFF基准测试中表现出色并超越人类平均准确率,为AI安全生态系统提供有效的视觉伪造防御方案。


📘 Detailed Summary

Motivation: 生成式AI的快速发展使得伪造人脸图像变得高度逼真,对AI安全、数字媒体完整性和公众信任构成严重威胁。当前人脸伪造技术包括人脸交换、属性编辑和基于扩散的图像合成等方法,正被恶意用于虚假信息、身份欺诈和诽谤等目的,亟需开发鲁棒且泛化性强的人脸伪造检测方法作为AI安全基础设施的关键组成部分。

Method: 提出一种新颖的双分支卷积神经网络,其中RGB分支捕获语义信息,频率分支专注于生成模型难以抑制的高频伪影。引入通道注意力模块自适应融合这些异构特征,突出最具信息量的伪造判别通道。设计统一的FSC损失函数,结合焦点损失、监督对比损失和频率中心边界损失,以增强类别可分性和鲁棒性。

Result: 在包含文本到图像、图像到图像、人脸交换和人脸编辑四种代表性方法生成的伪造图像的DiFF基准测试中,该方法在所有类别上均表现出强劲性能,超越了人类平均准确率。实验结果验证了模型的有效性和对视觉伪造攻击的防御潜力。

Conclusion: 该研究证明了结合空间和频域线索的双分支架构在检测生成式AI伪造内容方面的有效性,为构建更安全的AI生态系统提供了重要技术支撑。模型在多种伪造技术上的泛化能力表明其在实际应用中的潜力,有助于应对日益复杂的视觉伪造威胁。


📄 Abstract

The rapid advancement of generative AI has enabled the creation of highly realistic forged facial images, posing significant threats to AI security, digital media integrity, and public trust. Face forgery techniques, ranging from face swapping and attribute editing to powerful diffusion-based image synthesis, are increasingly being used for malicious purposes such as misinformation, identity fraud, and defamation. This growing challenge underscores the urgent need for robust and generalizable face forgery detection methods as a critical component of AI security infrastructure. In this work, we propose a novel dual-branch convolutional neural network for face forgery detection that leverages complementary cues from both spatial and frequency domains. The RGB branch captures semantic information, while the frequency branch focuses on high-frequency artifacts that are difficult for generative models to suppress. A channel attention module is introduced to adaptively fuse these heterogeneous features, highlighting the most informative channels for forgery discrimination. To guide the network's learning process, we design a unified loss function, FSC Loss, that combines focal loss, supervised contrastive loss, and a frequency center margin loss to enhance class separability and robustness. We evaluate our model on the DiFF benchmark, which includes forged images generated from four representative methods: text-to-image, image-to-image, face swap, and face edit. Our method achieves strong performance across all categories and outperforms average human accuracy. These results demonstrate the model's effectiveness and its potential contribution to safeguarding AI ecosystems against visual forgery attacks.

[25] SAGE: Structure-Aware Generative Video Transitions between Diverse Clips

Mia Kan, Yilin Liu, Niloy Mitra

🧩 TL;DR

本文提出了SAGE(Structure-Aware Generative vidEo transitions),一种零样本的视频过渡方法,通过结合结构引导和生成合成,在多样视频片段之间实现平滑、语义一致的过渡。该方法在定量指标和用户研究中均优于现有经典和生成基线方法。


📘 Detailed Summary

Motivation: 现有视频过渡方法在处理具有大时间间隔或显著语义差异的多样化视频片段时面临挑战,传统技术如交叉淡入淡出、变形和帧插值以及最近的生成中间帧方法难以在保持内容感知和视觉连贯性的同时桥接这些差异。

Method: SAGE采用零样本方法,结合结构引导(通过线框图和运动流提供)与生成合成,无需微调即可实现平滑过渡。该方法借鉴艺术工作流程,通过对齐轮廓和插值显著特征来保持结构和感知连续性。

Result: 与现有替代方法(FILM、TVG、DiffMorpher、VACE、GI)的广泛实验比较表明,SAGE在定量指标和用户研究中均优于经典和生成基线方法,能够为多样化视频片段生成更优质的过渡效果。

Conclusion: SAGE通过结构感知的生成方法成功解决了多样化视频片段间的过渡挑战,证明了结合结构引导与生成合成的有效性,为零样本视频过渡提供了新的解决方案,并展示了在专业视频制作中的应用潜力。


📄 Abstract

Video transitions aim to synthesize intermediate frames between two clips, but naive approaches such as linear blending introduce artifacts that limit professional use or break temporal coherence. Traditional techniques (cross-fades, morphing, frame interpolation) and recent generative inbetweening methods can produce high-quality plausible intermediates, but they struggle with bridging diverse clips involving large temporal gaps or significant semantic differences, leaving a gap for content-aware and visually coherent transitions. We address this challenge by drawing on artistic workflows, distilling strategies such as aligning silhouettes and interpolating salient features to preserve structure and perceptual continuity. Building on this, we propose SAGE (Structure-Aware Generative vidEo transitions) as a zeroshot approach that combines structural guidance, provided via line maps and motion flow, with generative synthesis, enabling smooth, semantically consistent transitions without fine-tuning. Extensive experiments and comparison with current alternatives, namely [FILM, TVG, DiffMorpher, VACE, GI], demonstrate that SAGE outperforms both classical and generative baselines on quantitative metrics and user studies for producing transitions between diverse clips. Code to be released on acceptance.

cs.CL [Back]

[26] Success and Cost Elicit Convention Formation for Efficient Communication

Saujas Vaduguru, Yilun Hua, Yoav Artzi, Daniel Fried

🧩 TL;DR

本研究提出了一种训练大型多模态模型形成语言约定的方法,通过模拟参考游戏使模型能够进行高效通信。该方法在涉及照片和七巧板图像的重复参考游戏中,将消息长度减少达41%同时提高成功率15%。


📘 Detailed Summary

Motivation: 人类通过共享对话上下文逐渐形成高效通信能力,特别是形成临时语言约定来协调简短、低成本的表达。当前大型多模态模型缺乏这种通过上下文形成约定的能力,限制了其通信效率。

Method: 采用模拟参考游戏的方法在多模态模型之间进行训练,无需额外人工数据。该方法基于重复交互中的成功率和通信成本进行联合优化,促使模型自发形成高效通信约定。

Result: 在照片和七巧板图像的参考游戏中,模型通信消息长度减少高达41%,同时交互成功率提高15%。人类听众与形成约定的模型交互时响应速度更快,证明通信效率显著提升。

Conclusion: 仅基于成功率或通信成本的单一训练目标不足以激发约定形成,必须同时优化两者才能实现高效通信。该方法为多模态模型的高效人机交互提供了新途径,展示了通过模拟交互学习通信约定的可行性。


📄 Abstract

Humans leverage shared conversational context to become increasingly successful and efficient at communicating over time. One manifestation of this is the formation of ad hoc linguistic conventions, which allow people to coordinate on short, less costly utterances that are understood using shared conversational context. We present a method to train large multimodal models to form conventions, enabling efficient communication. Our approach uses simulated reference games between models, and requires no additional human-produced data. In repeated reference games involving photographs and tangram images, our method enables models to communicate efficiently with people: reducing the message length by up to 41% while increasing success by 15% over the course of the interaction. Human listeners respond faster when interacting with our model that forms conventions. We also show that training based on success or cost alone is insufficient - both are necessary to elicit convention formation.

[27] MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations

Aaron Scott, Maike Züfle, Jan Niehues

🧩 TL;DR

本研究提出了MuSaG,首个德语多模态讽刺检测数据集,包含来自德国电视节目的33分钟人工标注数据,并评估了多种模型在文本、音频、视觉和多模态设置下的性能表现。


📘 Detailed Summary

Motivation: 讽刺作为一种复杂的比喻语言形式,在社交媒体和流行文化中普遍存在,对自然语言理解、情感分析和内容审核构成了持续挑战。随着多模态大语言模型的出现,讽刺检测需要超越文本范围,整合来自音频和视觉的线索,而德语领域缺乏相应的多模态数据集。

Method: 研究构建了MuSaG数据集,包含33分钟来自德国电视节目的人工筛选和标注语句,每个实例提供对齐的文本、音频和视频模态,分别由人工标注。研究对九种开源和商业模型进行了基准测试,涵盖文本、音频、视觉和多模态架构,并将它们的性能与人工标注进行比较。

Result: 实验结果表明,人类在对话环境中主要依赖音频线索进行讽刺检测,而模型在文本模态上表现最佳。这揭示了当前多模态模型在整合多模态信息方面存在差距,特别是在利用音频线索方面表现不足。

Conclusion: 该研究强调了开发更适合现实场景的多模态模型的必要性,MuSaG数据集的发布将支持未来在多模态讽刺检测和人机对齐方面的研究,为改进多模态理解模型提供了重要基准和资源。


📄 Abstract

Sarcasm is a complex form of figurative language in which the intended meaning contradicts the literal one. Its prevalence in social media and popular culture poses persistent challenges for natural language understanding, sentiment analysis, and content moderation. With the emergence of multimodal large language models, sarcasm detection extends beyond text and requires integrating cues from audio and vision. We present MuSaG, the first German multimodal sarcasm detection dataset, consisting of 33 minutes of manually selected and human-annotated statements from German television shows. Each instance provides aligned text, audio, and video modalities, annotated separately by humans, enabling evaluation in unimodal and multimodal settings. We benchmark nine open-source and commercial models, spanning text, audio, vision, and multimodal architectures, and compare their performance to human annotations. Our results show that while humans rely heavily on audio in conversational settings, models perform best on text. This highlights a gap in current multimodal models and motivates the use of MuSaG for developing models better suited to realistic scenarios. We release MuSaG publicly to support future research on multimodal sarcasm detection and human-model alignment.

[28] Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations

Ahmad Ghannam, Naif Alharthi, Faris Alasmary, Kholood Al Tabash, Shouq Sadah, Lahouari Ghouti

🧩 TL;DR

本研究提出了一种融合文本和语音信息的跨模态方法来解决阿拉伯语方言的变音符号恢复任务,通过两种集成策略实现了优异的性能表现。该方法在开发集上达到了0.25的词错误率和0.9的字符错误率,在测试集上分别达到0.55和0.13。


📘 Detailed Summary

Motivation: 该研究旨在解决阿拉伯语方言句子中变音符号恢复任务的挑战,传统方法主要依赖文本信息,而忽略了语音模态提供的丰富韵律和发音特征,这限制了模型在方言环境下的性能表现。

Method: 提出的模型采用跨模态融合方法,文本模态使用自研预训练模型CATT的编码器,语音模态使用OpenAI Whisper基础模型的编码器模块。设计了两种集成策略:早期融合策略将1500帧音频段平均为150个语音标记,通过线性投影层处理后与文本标记合并;交叉注意力策略通过跨注意力机制融合文本和语音嵌入,输出送入CATT分类头进行标记级预测。训练时随机禁用语音输入以增强模型鲁棒性。

Result: 实验结果显示,该方法在开发集上取得了0.25的词错误率和0.9的字符错误率,在测试集上词错误率和字符错误率分别达到0.55和0.13,证明了跨模态融合方法的有效性。

Conclusion: 该研究表明融合文本和语音信息能够显著提升阿拉伯语方言变音符号恢复的性能,跨模态方法为方言处理任务提供了新的解决方案。随机禁用语音输入的训练策略增强了模型在仅有文本输入时的鲁棒性,为实际应用场景提供了灵活性。


📄 Abstract

In this work, we tackle the Diacritic Restoration (DR) task for Arabic dialectal sentences using a multimodal approach that combines both textual and speech information. We propose a model that represents the text modality using an encoder extracted from our own pre-trained model named CATT. The speech component is handled by the encoder module of the OpenAI Whisper base model. Our solution is designed following two integration strategies. The former consists of fusing the speech tokens with the input at an early stage, where the 1500 frames of the audio segment are averaged over 10 consecutive frames, resulting in 150 speech tokens. To ensure embedding compatibility, these averaged tokens are processed through a linear projection layer prior to merging them with the text tokens. Contextual encoding is guaranteed by the CATT encoder module. The latter strategy relies on cross-attention, where text and speech embeddings are fused. The cross-attention output is then fed to the CATT classification head for token-level diacritic prediction. To further improve model robustness, we randomly deactivate the speech input during training, allowing the model to perform well with or without speech. Our experiments show that the proposed approach achieves a word error rate (WER) of 0.25 and a character error rate (CER) of 0.9 on the development set. On the test set, our model achieved WER and CER scores of 0.55 and 0.13, respectively.

[29] Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants

Hunzalah Hassan Bhatti, Firoj Alam

🧩 TL;DR

本研究提出了一种综合方法来评估LLMs在阿拉伯语方言和文化内容上的表现,通过将现代标准阿拉伯语多选题转换为英语和多种阿拉伯方言的开放式问题,并利用思维链微调模型进行逐步推理,揭示了LLMs在方言知识方面的持续差距。


📘 Detailed Summary

Motivation: 大型语言模型在日常问答中应用日益广泛,但在文化基础和方言内容上的表现存在语言间的不均衡,特别是阿拉伯语方言的知识覆盖不足,需要系统评估和改进。

Method: 提出综合方法包括将现代标准阿拉伯语多选题翻译为英语和多种阿拉伯方言,转换为开放式问题,在零样本和微调设置下对多种LLMs进行基准测试,并生成思维链推理来微调模型进行逐步推理。

Result: 实验发现模型在阿拉伯方言上表现较差,揭示文化基础和方言特定知识的持续差距;阿拉伯中心模型在多选题上表现良好但在开放式问题上困难;思维链微调提高了判断正确性但n-gram指标结果不一。

Conclusion: 研究揭示了LLMs在方言和文化内容理解上的系统性缺陷,开发的数据集将公开支持文化语言包容性评估研究,思维链方法对推理能力提升有效但需改进评估指标。


📄 Abstract

Large Language Models (LLMs) are increasingly used to answer everyday questions, yet their performance on culturally grounded and dialectal content remains uneven across languages. We propose a comprehensive method that (i) translates Modern Standard Arabic (MSA) multiple-choice questions (MCQs) into English and several Arabic dialects, (ii) converts them into open-ended questions (OEQs), (iii) benchmarks a range of zero-shot and fine-tuned LLMs under both MCQ and OEQ settings, and (iv) generates chain-of-thought (CoT) rationales to fine-tune models for step-by-step reasoning. Using this method, we extend an existing dataset in which QAs are parallelly aligned across multiple language varieties, making it, to our knowledge, the first of its kind. We conduct extensive experiments with both open and closed models. Our findings show that (i) models underperform on Arabic dialects, revealing persistent gaps in culturally grounded and dialect-specific knowledge; (ii) Arabic-centric models perform well on MCQs but struggle with OEQs; and (iii) CoT improves judged correctness while yielding mixed n-gram-based metrics. The developed dataset will be publicly released to support further research on culturally and linguistically inclusive evaluation.

[30] SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space

Viktoriia Zinkovich, Anton Antonov, Andrei Spiridonov, Denis Shepelev, Andrey Moskalenko, Daria Pugacheva, Elena Tutubalina, Andrey Kuznetsov, Vlad Shakhuro

🧩 TL;DR

本文提出了一种新颖的对抗性改写任务,通过生成语义等价但能降低分割性能的文本查询来评估多模态大语言模型的鲁棒性,并开发了SPARTA方法在语义潜在空间中进行黑盒优化,显著优于现有方法。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在视觉语言任务中表现出色,但现有研究主要关注图像输入的扰动,而语义等价的文本改写在实际应用中至关重要,因为用户可能以不同方式表达相同意图,这一领域尚未得到充分探索。

Method: 我们引入了SPARTA方法,这是一种黑盒、句子级别的优化方法,在文本自编码器的低维语义潜在空间中操作,通过强化学习进行指导,同时开发了全面的自动评估协议来验证对抗性改写的质量。

Result: SPARTA在ReasonSeg和LLMSeg-40k数据集上取得了显著更高的成功率,比现有方法高出最多2倍,揭示了先进推理分割模型即使在严格的语义和语法约束下仍然容易受到对抗性改写的攻击。

Conclusion: 研究表明多模态大语言模型在推理分割任务中对语义等价的对抗性改写存在显著脆弱性,这为模型鲁棒性评估提供了新的视角,并强调了在实际部署中考虑多样化用户表达的重要性。


📄 Abstract

Multimodal large language models (MLLMs) have shown impressive capabilities in vision-language tasks such as reasoning segmentation, where models generate segmentation masks based on textual queries. While prior work has primarily focused on perturbing image inputs, semantically equivalent textual paraphrases-crucial in real-world applications where users express the same intent in varied ways-remain underexplored. To address this gap, we introduce a novel adversarial paraphrasing task: generating grammatically correct paraphrases that preserve the original query meaning while degrading segmentation performance. To evaluate the quality of adversarial paraphrases, we develop a comprehensive automatic evaluation protocol validated with human studies. Furthermore, we introduce SPARTA-a black-box, sentence-level optimization method that operates in the low-dimensional semantic latent space of a text autoencoder, guided by reinforcement learning. SPARTA achieves significantly higher success rates, outperforming prior methods by up to 2x on both the ReasonSeg and LLMSeg-40k datasets. We use SPARTA and competitive baselines to assess the robustness of advanced reasoning segmentation models. We reveal that they remain vulnerable to adversarial paraphrasing-even under strict semantic and grammatical constraints. All code and data will be released publicly upon acceptance.

[31] Talk2Ref: A Dataset for Reference Prediction from Scientific Talks

Frederik Broy, Maike Züfle, Jan Niehues

🧩 TL;DR

本文提出了演讲引用预测任务,构建了首个大规模Talk2Ref数据集,并开发了双编码器架构,显著提升了从科学演讲中自动识别相关文献的性能。


📘 Detailed Summary

Motivation: 科学演讲正成为传播研究的重要媒介,但自动识别能够支撑或丰富演讲内容的相关文献仍面临挑战,研究人员和学生需要能够从非结构化的长篇科学演讲中准确映射到相关论文的有效方法。

Method: 提出了基于双编码器架构的模型,在Talk2Ref数据集上进行微调,探索了处理长文本转录本的策略以及领域自适应训练方法,同时评估了最先进文本嵌入模型在零样本检索场景中的表现。

Result: 实验结果表明,在Talk2Ref数据集上的微调显著提升了引用预测性能,证明了该任务的技术挑战性以及从口语科学内容中学习语义表示的有效性。

Conclusion: 该研究展示了将口语科学交流整合到引用推荐系统中的可行性,发布的开放数据集和训练模型为未来研究提供了重要基础,推动了科学演讲内容自动分析的发展。


📄 Abstract

Scientific talks are a growing medium for disseminating research, and automatically identifying relevant literature that grounds or enriches a talk would be highly valuable for researchers and students alike. We introduce Reference Prediction from Talks (RPT), a new task that maps long, and unstructured scientific presentations to relevant papers. To support research on RPT, we present Talk2Ref, the first large-scale dataset of its kind, containing 6,279 talks and 43,429 cited papers (26 per talk on average), where relevance is approximated by the papers cited in the talk's corresponding source publication. We establish strong baselines by evaluating state-of-the-art text embedding models in zero-shot retrieval scenarios, and propose a dual-encoder architecture trained on Talk2Ref. We further explore strategies for handling long transcripts, as well as training for domain adaptation. Our results show that fine-tuning on Talk2Ref significantly improves citation prediction performance, demonstrating both the challenges of the task and the effectiveness of our dataset for learning semantic representations from spoken scientific content. The dataset and trained models are released under an open license to foster future research on integrating spoken scientific communication into citation recommendation systems.

[32] Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation

Snegha A, Sayambhu Sen, Piyush Singh Pasi, Abhishek Singhania, Preethi Jyothi

🧩 TL;DR

本研究系统评估了三种前缀方法在零样本跨语言迁移中的表现,发现在Llama 3.1 8B和Mistral v0.3 7B模型上,前缀方法比LoRA基线在Belebele基准上提升了高达6%,仅使用1.23M学习参数即可实现一致的性能改进。


📘 Detailed Summary

Motivation: 尽管Llama和Mistral等仅解码器大语言模型具备多语言预训练和强泛化能力,但将其适应到跨语言新任务仍具挑战性;虽然参数高效微调技术如LoRA被广泛使用,但基于前缀的技术如软提示调优、前缀调优和Llama Adapter在仅解码器模型的零样本迁移中研究较少。

Method: 本研究对三种前缀方法进行了全面研究,包括软提示调优、前缀调优和Llama Adapter,用于从英语到35+种高资源和低资源语言的零样本跨语言迁移;分析还探讨了跨语言家族和文字系统的迁移效果,以及模型规模从1B到24B缩放的影响。

Result: 在Llama 3.1 8B模型上,前缀方法在Belebele基准上比LoRA基线高出高达6%;Mistral v0.3 7B模型也观察到类似的改进;尽管前缀调优仅使用1.23M学习参数,但在多样化基准测试中实现了持续的性能提升。

Conclusion: 这些发现突显了前缀技术作为LoRA的有效且可扩展替代方案的潜力,特别是在低资源多语言设置中;前缀方法在保持参数效率的同时,能够实现更好的跨语言迁移性能,为多语言NLP应用提供了新的技术路径。


📄 Abstract

With the release of new large language models (LLMs) like Llama and Mistral, zero-shot cross-lingual transfer has become increasingly feasible due to their multilingual pretraining and strong generalization capabilities. However, adapting these decoder-only LLMs to new tasks across languages remains challenging. While parameter-efficient fine-tuning (PeFT) techniques like Low-Rank Adaptation (LoRA) are widely used, prefix-based techniques such as soft prompt tuning, prefix tuning, and Llama Adapter are less explored, especially for zero-shot transfer in decoder-only models. We present a comprehensive study of three prefix-based methods for zero-shot cross-lingual transfer from English to 35+ high- and low-resource languages. Our analysis further explores transfer across linguistic families and scripts, as well as the impact of scaling model sizes from 1B to 24B. With Llama 3.1 8B, prefix methods outperform LoRA-baselines by up to 6% on the Belebele benchmark. Similar improvements were observed with Mistral v0.3 7B as well. Despite using only 1.23M learning parameters with prefix tuning, we achieve consistent improvements across diverse benchmarks. These findings highlight the potential of prefix-based techniques as an effective and scalable alternative to LoRA, particularly in low-resource multilingual settings.

[33] "Mm, Wat?" Detecting Other-initiated Repair Requests in Dialogue

Anh Ngo, Nicolas Rollet, Catherine Pelachaud, Chloe Clavel

🧩 TL;DR

本研究提出了一种多模态模型,通过整合基于会话分析的语音学和语言学特征,自动检测荷兰语对话中的修复发起。结果表明语音线索补充了语言特征,显著提升了预训练文本和音频嵌入的性能。


📘 Detailed Summary

Motivation: 在人类对话中维持相互理解是避免对话中断的关键,其中修复特别是他人发起修复起着重要作用。然而会话代理仍然无法识别用户的修复发起,导致对话中断或用户脱离。

Method: 研究提出了一种多模态模型,通过整合基于会话分析的语言学和语音学特征,自动检测荷兰语对话中的修复发起。该方法结合了预训练的文本和音频嵌入,并利用语音线索补充语言特征。

Result: 实验结果显示语音线索显著补充了语言特征,并显著提升了预训练文本和音频嵌入的性能。研究还揭示了不同特征之间的交互机制。

Conclusion: 该研究为多模态修复发起检测提供了有效方法,未来方向包括整合视觉线索、探索多语言和跨上下文语料库以评估模型的鲁棒性和泛化能力。


📄 Abstract

Maintaining mutual understanding is a key component in human-human conversation to avoid conversation breakdowns, in which repair, particularly Other-Initiated Repair (OIR, when one speaker signals trouble and prompts the other to resolve), plays a vital role. However, Conversational Agents (CAs) still fail to recognize user repair initiation, leading to breakdowns or disengagement. This work proposes a multimodal model to automatically detect repair initiation in Dutch dialogues by integrating linguistic and prosodic features grounded in Conversation Analysis. The results show that prosodic cues complement linguistic features and significantly improve the results of pretrained text and audio embeddings, offering insights into how different features interact. Future directions include incorporating visual cues, exploring multilingual and cross-context corpora to assess the robustness and generalizability.

[34] Optimizing Retrieval for RAG via Reinforced Contrastive Learning

Jiawei Zhou, Lei Chen

🧩 TL;DR

本文提出R3框架,一种通过试错反馈强化对比学习优化的检索增强生成检索方法,能够在无需人工标注数据的情况下动态优化检索器在RAG环境中的相关性判断能力。


📘 Detailed Summary

Motivation: 随着检索增强生成的广泛应用,信息检索的角色从为人类用户检索信息转变为为AI系统检索上下文知识,其中相关性难以预先定义或标注,这构成了现有方法的主要挑战。

Method: R3框架采用基于试错反馈的强化对比学习方法,使检索器能够在RAG环境中动态探索和优化相关性,通过检索结果与环境交互产生对比信号来自动指导检索器的自我改进,无需依赖标注或合成数据进行监督微调。

Result: 在多样化任务上的广泛实验表明,R3将RAG性能相比原始检索器提升5.2%,超越最先进检索器4.9%,同时达到与基于后训练或指令调优LLM的LLM增强检索和RAG系统相当的结果,训练仅需4个GPU并在单日内完成。

Conclusion: R3证明了在RAG环境中通过强化对比学习实现检索器自优化的可行性,提供了一种高效实用的解决方案,能够在无需大量标注数据的情况下显著提升检索质量,为检索增强生成系统的优化开辟了新途径。


📄 Abstract

As retrieval-augmented generation (RAG) becomes increasingly widespread, the role of information retrieval (IR) is shifting from retrieving information for human users to retrieving contextual knowledge for artificial intelligence (AI) systems, where relevance becomes difficult to define or annotate beforehand. To address this challenge, we propose R3, a Retrieval framework optimized for RAG through trialand-feedback Reinforced contrastive learning. Unlike prior approaches that rely on annotated or synthetic data for supervised fine-tuning, R3 enables the retriever to dynamically explore and optimize relevance within the RAG environment. During training, the retrieved results interact with the environment to produce contrastive signals that automatically guide the retriever's self-improvement. Extensive experiments across diverse tasks demonstrate that R3 improves RAG performance by 5.2% over the original retriever and surpasses state-of-the-art retrievers by 4.9%, while achieving comparable results to LLM-augmented retrieval and RAG systems built on post-trained or instruction-tuned LLMs. It is both efficient and practical, requiring only 4 GPUs and completing training within a single day.

cs.AI [Back]

[35] Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents

Zihao Wang, Xujing Li, Yining Ye, Junjie Fang, Haoming Wang, Longxiang Liu, Shihao Liang, Junting Lu, Zhiyong Wu, Jiazhan Feng, Wanjun Zhong, Zili Li, Yu Wang, Yu Miao, Bo Zhou, Yuanfan Li, Hao Wang, Zhongkai Zhao, Faming Wu, Zhengxuan Jiang, Weihao Tan, Heyuan Yao, Shi Yan, Xiangyang Li, Yitao Liang, Yujia Qin, Guang Shi

🧩 TL;DR

Game-TARS是一个通用游戏智能体,通过统一可扩展的键盘鼠标动作空间进行训练,在大规模跨域预训练中实现了显著性能提升,在多个游戏基准测试中超越了现有最先进模型和人类水平。


📘 Detailed Summary

Motivation: 该研究旨在解决传统游戏智能体依赖特定API或GUI接口的限制,这些方法难以实现大规模跨域持续预训练。研究者希望开发一种基于人类对齐的键盘鼠标输入的统一动作空间,以支持在操作系统、网页和模拟游戏等异构领域中进行大规模预训练。

Method: Game-TARS采用统一可扩展的动作空间锚定于人类对齐的键盘鼠标输入,通过超过500B token的多样化轨迹和多模态数据进行预训练。关键技术包括衰减持续损失以减少因果混淆,以及高效的稀疏思维策略来平衡推理深度和推理成本。

Result: 实验结果显示,Game-TARS在开放世界Minecraft任务上的成功率比之前最先进模型提高约2倍,在未见过的网页3D游戏中接近人类新手水平,在FPS基准测试中超越了GPT-5、Gemini-2.5-Pro和Claude-4-Sonnet。训练时间和测试时间的扩展结果证实统一动作空间在跨游戏和多模态数据扩展时能持续提升性能。

Conclusion: 研究表明,简单可扩展的动作表示与大规模预训练相结合,为实现具有广泛计算机使用能力的通用智能体提供了一条有前景的路径。统一动作空间范式支持在异构领域中进行大规模持续预训练,为开发更通用的AI系统奠定了基础。


📄 Abstract

We present Game-TARS, a generalist game agent trained with a unified, scalable action space anchored to human-aligned native keyboard-mouse inputs. Unlike API- or GUI-based approaches, this paradigm enables large-scale continual pre-training across heterogeneous domains, including OS, web, and simulation games. Game-TARS is pre-trained on over 500B tokens with diverse trajectories and multimodal data. Key techniques include a decaying continual loss to reduce causal confusion and an efficient Sparse-Thinking strategy that balances reasoning depth and inference cost. Experiments show that Game-TARS achieves about 2 times the success rate over the previous sota model on open-world Minecraft tasks, is close to the generality of fresh humans in unseen web 3d games, and outperforms GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet in FPS benchmarks. Scaling results on training-time and test-time confirm that the unified action space sustains improvements when scaled to cross-game and multimodal data. Our results demonstrate that simple, scalable action representations combined with large-scale pre-training provide a promising path toward generalist agents with broad computer-use abilities.

[36] Why Foundation Models in Pathology Are Failing

Hamid R. Tizhoosh

🧩 TL;DR

本文系统分析了计算病理学中基础模型存在的根本性缺陷,指出这些模型与组织形态学本质存在概念性不匹配,并识别了七个相互关联的失败原因,呼吁对该范式进行根本性重新思考。


📘 Detailed Summary

Motivation: 尽管基础模型在非医学领域取得了革命性突破,但在计算病理学中的快速应用并未实现预期的癌症诊断、预后预测和多模态检索突破,反而暴露出诊断准确性低、鲁棒性差、几何不稳定性、计算需求大和安全漏洞等根本性弱点,需要深入探究这些失败的根本原因。

Method: 本文采用系统性评估方法,通过批判性分析识别了基础模型在计算病理学中失败的七个核心原因:生物复杂性、无效的自监督学习、过度泛化、过度复杂的架构、缺乏领域特定创新、数据不足以及与组织块大小相关的基本设计缺陷。

Result: 研究发现当前病理学基础模型存在诊断准确率低、鲁棒性差、几何不稳定、计算需求繁重和安全漏洞等系统性缺陷,这些缺陷源于模型假设与人类组织内在复杂性之间的根本性不匹配,而非简单的技术优化问题。

Conclusion: 研究结论表明当前病理学基础模型在概念层面与组织形态学本质存在根本性错配,需要彻底重新思考基础模型范式本身,而非仅仅进行渐进式改进,这为未来开发真正适用于计算病理学的新型建模方法指明了方向。


📄 Abstract

In non-medical domains, foundation models (FMs) have revolutionized computer vision and language processing through large-scale self-supervised and multimodal learning. Consequently, their rapid adoption in computational pathology was expected to deliver comparable breakthroughs in cancer diagnosis, prognostication, and multimodal retrieval. However, recent systematic evaluations reveal fundamental weaknesses: low diagnostic accuracy, poor robustness, geometric instability, heavy computational demands, and concerning safety vulnerabilities. This short paper examines these shortcomings and argues that they stem from deeper conceptual mismatches between the assumptions underlying generic foundation modeling in mainstream AI and the intrinsic complexity of human tissue. Seven interrelated causes are identified: biological complexity, ineffective self-supervision, overgeneralization, excessive architectural complexity, lack of domain-specific innovation, insufficient data, and a fundamental design flaw related to tissue patch size. These findings suggest that current pathology foundation models remain conceptually misaligned with the nature of tissue morphology and call for a fundamental rethinking of the paradigm itself.

[37] Latent Chain-of-Thought for Visual Reasoning

Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Dianat, Majid Rabbani, Raghuveer Rao, Zhiqiang Tao

🧩 TL;DR

本文提出了一种基于摊销变分推理的可扩展训练算法,将大型视觉语言模型中的推理重新表述为后验推断问题,通过多样性寻求强化学习和贝叶斯推理缩放策略,显著提升了模型在七个推理基准上的性能、泛化能力和可解释性。


📘 Detailed Summary

Motivation: 现有训练算法如SFT、PPO和GRPO在未见推理任务上泛化能力不足,且严重依赖有偏见的奖励模型,这限制了大型视觉语言模型推理能力的可靠性和可解释性发展。

Method: 采用摊销变分推理框架将推理重新表述为后验推断问题,引入基于多样性寻求强化学习的稀疏奖励函数,在token级别提供学习信号以鼓励多样化、高似然度的潜在思维链,并使用贝叶斯推理缩放策略通过边际似然替代昂贵的Best-of-N和Beam搜索来高效排序最优推理路径和答案。

Result: 在七个推理基准上的实证研究表明,所提方法在有效性、泛化性和可解释性方面均优于现有最先进的大型视觉语言模型,显著提升了模型性能。

Conclusion: 该研究证明了将推理建模为后验推断问题的有效性,提出的变分推理框架和多样性奖励机制为大型视觉语言模型的可靠推理提供了新范式,同时贝叶斯缩放策略为高效推理路径选择提供了实用解决方案,具有重要的理论价值和实际应用前景。


📄 Abstract

Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.

[38] HistoLens: An Interactive XAI Toolkit for Verifying and Mitigating Flaws in Vision-Language Models for Histopathology

Sandeep Vissapragada, Vikrant Sahu, Gagan Raj Gupta, Vandita Singh

🧩 TL;DR

本研究开发了HistoLens系统,这是一个面向病理学家的透明AI助手,能够通过自然语言交互和视觉解释功能增强诊断过程中的信任与合作。系统通过智能翻译用户查询、生成结构化报告并提供热力图可视化,使AI推理过程完全透明。


📘 Detailed Summary

Motivation: 当前医疗AI系统普遍存在黑箱问题,医生难以理解AI的推理过程,限制了其在临床实践中的可信度和应用价值。本研究旨在开发一个透明的AI助手,使病理学家能够像咨询同事一样与AI系统进行自然交互,从而建立真正的信任关系。

Method: 系统采用自然语言处理技术将病理学家的英文问题转化为精确的AI查询,生成结构化诊断报告。关键创新在于提供即时视觉解释功能,通过热力图精确定位分析所依据的细胞和组织区域。同时训练AI模型专注于患者组织特征,自动忽略背景噪声干扰。

Result: HistoLens实现了病理学家与AI系统之间的透明协作工作流程,医生能够保持专家主导地位,同时利用可信的AI助手验证诊断见解。系统显著提升了诊断效率和信心,使医生能够做出更快、更可靠的诊断决策。

Conclusion: 该研究证明了透明AI系统在医疗诊断中的关键价值,通过提供可解释的推理过程和视觉证据,成功建立了医生对AI的信任。这种协作模式为医疗AI的实际应用提供了可行路径,强调了保持人类专家主导地位的重要性。


📄 Abstract

For doctors to truly trust artificial intelligence, it can't be a black box. They need to understand its reasoning, almost as if they were consulting a colleague. We created HistoLens1 to be that transparent, collaborative partner. It allows a pathologist to simply ask a question in plain English about a tissue slide--just as they would ask a trainee. Our system intelligently translates this question into a precise query for its AI engine, which then provides a clear, structured report. But it doesn't stop there. If a doctor ever asks, "Why?", HistoLens can instantly provide a 'visual proof' for any finding--a heatmap that points to the exact cells and regions the AI used for its analysis. We've also ensured the AI focuses only on the patient's tissue, just like a trained pathologist would, by teaching it to ignore distracting background noise. The result is a workflow where the pathologist remains the expert in charge, using a trustworthy AI assistant to verify their insights and make faster, more confident diagnoses.

[39] BLM$_1$: A Boundless Large Model for Cross-Space, Cross-Task, and Cross-Embodiment Learning

Wentao Tan, Bowen Wang, Heng Zhi, Chenyu Liu, Zhe Li, Jian Liu, Zengrong Lin, Yukun Dai, Yipeng Chen, Wenjie Yang, Enci Xie, Hao Xue, Baixu Ji, Chen Xu, Zhibin Wang, Tianshi Wang, Lei Zhu, Heng Tao Shen

🧩 TL;DR

本文提出了边界大模型BLM₁,这是一个多模态空间基础模型,通过两阶段训练范式实现了跨空间迁移、跨任务学习和跨具身泛化能力,在数字和物理任务中均优于现有模型家族。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在数字-物理空间和不同具身系统间的泛化能力较差,视觉-语言-动作模型仅能产生低级动作而缺乏高级推理能力,大多数具身大语言模型局限于数字空间且难以泛化到物理世界,因此需要开发能够在数字和物理空间无缝操作并跨具身系统和任务泛化的统一模型。

Method: BLM₁采用两阶段训练范式:第一阶段通过精选数字语料库向多模态大语言模型注入具身知识同时保持语言能力;第二阶段通过意图桥接接口训练策略模块,从多模态大语言模型中提取高级语义来指导控制而不微调主干网络,该方法基于自收集的跨具身演示套件,涵盖四种机器人具身和六个渐进挑战性任务。

Result: 在数字和物理基准测试中,单个BLM₁实例在数字任务中实现了约6%的性能提升,在物理任务中实现了约3%的性能提升,优于多模态大语言模型、具身大语言模型、视觉-语言-动作模型和通用多模态语言模型四个模型家族。

Conclusion: BLM₁证明了通过两阶段训练范式可以实现跨空间、跨任务和跨具身的统一建模,为构建在数字和物理世界间无缝操作的具身智能系统提供了有效路径,同时保持了指令跟随和推理能力,支持鲁棒的跨具身控制。


📄 Abstract

Multimodal large language models (MLLMs) have advanced vision-language reasoning and are increasingly deployed in embodied agents. However, significant limitations remain: MLLMs generalize poorly across digital-physical spaces and embodiments; vision-language-action models (VLAs) produce low-level actions yet lack robust high-level embodied reasoning; and most embodied large language models (ELLMs) are constrained to digital-space with poor generalization to the physical world. Thus, unified models that operate seamlessly across digital and physical spaces while generalizing across embodiments and tasks remain absent. We introduce the \textbf{Boundless Large Model (BLM$_1$)}, a multimodal spatial foundation model that preserves instruction following and reasoning, incorporates embodied knowledge, and supports robust cross-embodiment control. BLM$_1$ integrates three key capabilities -- \textit{cross-space transfer, cross-task learning, and cross-embodiment generalization} -- via a two-stage training paradigm. Stage I injects embodied knowledge into the MLLM through curated digital corpora while maintaining language competence. Stage II trains a policy module through an intent-bridging interface that extracts high-level semantics from the MLLM to guide control, without fine-tuning the MLLM backbone. This process is supported by a self-collected cross-embodiment demonstration suite spanning four robot embodiments and six progressively challenging tasks. Evaluations across digital and physical benchmarks show that a single BLM$_1$ instance outperforms four model families -- MLLMs, ELLMs, VLAs, and GMLMs -- achieving $\sim!\textbf{6%}$ gains in digital tasks and $\sim!\textbf{3%}$ in physical tasks.

[40] MGA: Memory-Driven GUI Agent for Observation-Centric Interaction

Weihua Cheng, Ersheng Ni, Wenlong Wang, Yifei Sun, Junming Liu, Wangyu Shen, Yirong Chen, Botian Shi, Ding Wang

🧩 TL;DR

本文提出了记忆驱动的GUI智能体(MGA),通过'先观察后决策'的范式解决了现有GUI智能体对历史轨迹的依赖和局部探索偏差问题,在多个基准测试中显著提升了鲁棒性、泛化性和效率。


📘 Detailed Summary

Motivation: 现有GUI智能体通常将任务建模为长链执行,将历史轨迹串联到上下文中,存在两个持续性问题:对历史轨迹的依赖会放大错误传播,以及'决策优先、观察滞后'机制导致的局部探索偏差会忽略关键界面线索。

Method: MGA将GUI交互重新构建为'先观察后决策'的原则,将每个步骤建模为独立的、上下文丰富的环境状态,由三元组表示:当前屏幕截图、任务无关的空间信息以及动态更新的结构化记忆。

Result: 在OSworld基准测试、真实桌面应用程序(Chrome、VSCode、VLC)和跨任务迁移实验中的结果表明,MGA相比最先进的基线方法在鲁棒性、泛化性和效率方面取得了显著提升。

Conclusion: MGA通过重构GUI交互范式,证明了'先观察后决策'方法在解决现有GUI智能体核心限制方面的有效性,为构建更稳健和通用的界面交互系统提供了新的方向。


📄 Abstract

The rapid progress of Large Language Models (LLMs) and their multimodal extensions (MLLMs) has enabled agentic systems capable of perceiving and acting across diverse environments. A challenging yet impactful frontier is the development of GUI agents, which must navigate complex desktop and web interfaces while maintaining robustness and generalization. Existing paradigms typically model tasks as long-chain executions, concatenating historical trajectories into the context. While approaches such as Mirage and GTA1 refine planning or introduce multi-branch action selection, they remain constrained by two persistent issues: Dependence on historical trajectories, which amplifies error propagation. And Local exploration bias, where "decision-first, observation-later" mechanisms overlook critical interface cues. We introduce the Memory-Driven GUI Agent (MGA), which reframes GUI interaction around the principle of observe first, then decide. MGA models each step as an independent, context-rich environment state represented by a triad: current screenshot, task-agnostic spatial information, and a dynamically updated structured memory. Experiments on OSworld benchmarks, real desktop applications (Chrome, VSCode, VLC), and cross-task transfer demonstrate that MGA achieves substantial gains in robustness, generalization, and efficiency compared to state-of-the-art baselines. The code is publicly available at: {https://anonymous.4open.science/r/MGA-3571}.

[41] A Unified Geometric Space Bridging AI Models and the Human Brain

Silin Chen, Yuzhong Chen, Zifan Wang, Junhao Wang, Zifeng Jia, Keith M Kendrick, Tuo Zhang, Lin Zhao, Dezhong Yao, Tianming Liu, Xi Jiang

🧩 TL;DR

本文提出了脑相似空间这一开创性概念,通过将AI模型的内在空间注意力拓扑映射到人类功能性脑网络上,构建了统一的几何空间来量化比较不同模态AI模型的脑相似度,揭示了模型组织与大脑功能网络之间的深层对应关系。


📘 Detailed Summary

Motivation: 现有脑-AI对齐研究虽然显示了两者之间的显著对应关系,但这些比较局限于特定输入和任务,缺乏一个能够跨模态、跨任务统一比较不同AI模型内在组织结构的共同框架,无法系统评估视觉、语言或多模态模型在多大程度上与大脑的组织方式相似。

Method: 提出了脑相似空间的概念框架,通过将Transformer模型的内在空间注意力拓扑组织映射到标准人类功能性脑网络上,构建统一的几何空间;该方法分析了151个基于Transformer的模型,涵盖最先进的大规模视觉模型、语言模型和多模态模型,比较了不同预训练范式和位置编码方案对脑相似度的影响。

Result: 在脑相似空间中发现了一个连续的弧形几何结构,反映了脑相似度的逐渐增加;不同模型在该几何中展现出与不同脑相似度相关的分布模式,这些模式不仅受模态影响,更关键地取决于预训练范式是否强调全局语义抽象以及位置编码方案是否促进跨模态深度融合;模型脑相似度与下游任务性能并非完全一致。

Conclusion: 脑相似空间为跨领域智能的定位、量化和比较提供了首个统一框架,揭示了连接机器与大脑的深层组织原则;研究表明模型的内在组织方式反映了其与大脑功能网络的对应程度,而不仅仅是任务性能,这为理解智能的本质和构建更类脑的AI系统提供了新的理论基础。


📄 Abstract

For decades, neuroscientists and computer scientists have pursued a shared ambition: to understand intelligence and build it. Modern artificial neural networks now rival humans in language, perception, and reasoning, yet it is still largely unknown whether these artificial systems organize information as the brain does. Existing brain-AI alignment studies have shown the striking correspondence between the two systems, but such comparisons remain bound to specific inputs and tasks, offering no common ground for comparing how AI models with different kinds of modalities-vision, language, or multimodal-are intrinsically organized. Here we introduce a groundbreaking concept of Brain-like Space: a unified geometric space in which every AI model can be precisely situated and compared by mapping its intrinsic spatial attention topological organization onto canonical human functional brain networks, regardless of input modality, task, or sensory domain. Our extensive analysis of 151 Transformer-based models spanning state-of-the-art large vision models, large language models, and large multimodal models uncovers a continuous arc-shaped geometry within this space, reflecting a gradual increase of brain-likeness; different models exhibit distinct distribution patterns within this geometry associated with different degrees of brain-likeness, shaped not merely by their modality but by whether the pretraining paradigm emphasizes global semantic abstraction and whether the positional encoding scheme facilitates deep fusion across different modalities. Moreover, the degree of brain-likeness for a model and its downstream task performance are not "identical twins". The Brain-like Space provides the first unified framework for situating, quantifying, and comparing intelligence across domains, revealing the deep organizational principles that bridge machines and the brain.

[42] OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows

Qiushi Sun, Mukai Li, Zhoumianze Liu, Zhihui Xie, Fangzhi Xu, Zhangyue Yin, Kanzhi Cheng, Zehao Li, Zichen Ding, Qi Liu, Zhiyong Wu, Zhuosheng Zhang, Ben Kao, Lingpeng Kong

🧩 TL;DR

本文提出了MobileRisk-Live动态沙盒环境和OS-Sentinel混合安全检测框架,用于解决移动环境中基于视觉语言模型的智能代理的安全操作问题。该框架结合形式化验证器和上下文风险评估,在多个指标上比现有方法提升10%-30%的性能。


📘 Detailed Summary

Motivation: 基于视觉语言模型的计算机使用代理在移动平台等数字环境中展现出类人能力,但其潜在的不安全操作(如系统破坏和隐私泄露)引发了严重担忧。在移动环境广阔复杂的操作空间中检测这些安全问题是一个巨大挑战,目前仍处于严重未探索状态。

Method: 提出了OS-Sentinel混合安全检测框架,该框架协同结合形式化验证器用于检测显式系统级违规,以及基于VLM的上下文判断器用于评估上下文风险和代理行为。同时构建了MobileRisk-Live动态沙盒环境及其安全检测基准,包含具有细粒度标注的现实轨迹。

Result: 实验表明,OS-Sentinel在多个指标上比现有方法实现了10%-30%的性能提升。该框架在检测移动代理安全风险方面表现出显著优势,为安全检测提供了有效的解决方案。

Conclusion: 进一步分析提供了关键见解,促进了更安全可靠的自主移动代理的发展。该研究为移动代理安全研究建立了基础,提出的混合检测方法为解决复杂环境中的安全问题提供了有效途径。


📄 Abstract

Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detecting these safety concerns across the vast and complex operational space of mobile environments presents a formidable challenge that remains critically underexplored. To establish a foundation for mobile agent safety research, we introduce MobileRisk-Live, a dynamic sandbox environment accompanied by a safety detection benchmark comprising realistic trajectories with fine-grained annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety detection framework that synergistically combines a Formal Verifier for detecting explicit system-level violations with a VLM-based Contextual Judge for assessing contextual risks and agent actions. Experiments show that OS-Sentinel achieves 10%-30% improvements over existing approaches across multiple metrics. Further analysis provides critical insights that foster the development of safer and more reliable autonomous mobile agents.

[43] Generative AI for Healthcare: Fundamentals, Challenges, and Perspectives

Gang Chen, Changshuo Liu, Gene Anne Ooi, Marcus Tan, Zhongle Xie, Jianwei Yin, James Wei Luen Yip, Wenqiao Zhang, Jiaqi Zhu, Beng Chin Ooi

🧩 TL;DR

本文提出了一种以数据为中心的设计范式,将医疗数据生态系统重新定位为生成式人工智能在医疗领域的基础支撑,通过语义向量搜索和上下文查询等高效数据处理管道,支持上游模型组件和下游临床应用。


📘 Detailed Summary

Motivation: 生成式人工智能在医疗领域的部署需要深入理解医疗任务的能力边界,现有方法缺乏对医疗数据生命周期的系统性整合,无法持续支持多样化医疗数据和知识的集成、表示与检索。

Method: 提出以医疗数据生态系统为基石的生成式医疗系统设计范式,通过语义向量搜索和上下文查询等高效数据处理管道,支持多模态医疗数据的整合与检索,为上流模型预训练和领域微调提供高质量数据,同时作为知识检索后端支撑任务特定推理。

Result: 该生态系统能够持续支持多样化医疗数据和知识的集成与检索,为生成式人工智能模型提供高质量多模态数据用于大规模预训练和领域特定微调,并通过代理层支持任务特定的推理应用。

Conclusion: 以数据为中心的医疗生成式人工智能系统设计范式能够实现高质量有效的医疗服务交付,通过重新定位数据生命周期和构建可持续的医疗数据生态系统,为生成式人工智能在医疗领域的可靠部署提供了系统性解决方案。


📄 Abstract

Generative Artificial Intelligence (GenAI) is taking the world by storm. It promises transformative opportunities for advancing and disrupting existing practices, including healthcare. From large language models (LLMs) for clinical note synthesis and conversational assistance to multimodal systems that integrate medical imaging, electronic health records, and genomic data for decision support, GenAI is transforming the practice of medicine and the delivery of healthcare, such as diagnosis and personalized treatments, with great potential in reducing the cognitive burden on clinicians, thereby improving overall healthcare delivery. However, GenAI deployment in healthcare requires an in-depth understanding of healthcare tasks and what can and cannot be achieved. In this paper, we propose a data-centric paradigm in the design and deployment of GenAI systems for healthcare. Specifically, we reposition the data life cycle by making the medical data ecosystem as the foundational substrate for generative healthcare systems. This ecosystem is designed to sustainably support the integration, representation, and retrieval of diverse medical data and knowledge. With effective and efficient data processing pipelines, such as semantic vector search and contextual querying, it enables GenAI-powered operations for upstream model components and downstream clinical applications. Ultimately, it not only supplies foundation models with high-quality, multimodal data for large-scale pretraining and domain-specific fine-tuning, but also serves as a knowledge retrieval backend to support task-specific inference via the agentic layer. The ecosystem enables the deployment of GenAI for high-quality and effective healthcare delivery.

[44] Advancing site-specific disease and pest management in precision agriculture: From reasoning-driven foundation models to adaptive, feedback-based learning

Nitin Rai, Daeun, Choi, Nathan S. Boyd, Arnold W. Schumann

🧩 TL;DR

本综述分析了基础模型在作物定点病害管理中的应用进展,重点考察了视觉语言模型和大型语言模型在自适应学习、强化学习及数字孪生框架中的作用,揭示了多模态基础模型结合实时反馈将推动下一代智能农业的发展。


📘 Detailed Summary

Motivation: 传统作物病害管理方法存在效率低下和资源浪费问题,研究旨在探索基础模型如何通过整合视觉与文本数据、理解症状-管理关系以及支持交互式问答来革新定点病害管理,解决传统神经网络在农业应用中存在的局限性。

Method: 研究系统筛选了约40篇相关文献,重点分析大型语言模型和视觉语言模型在农业领域的应用,探讨了这些模型在自适应学习、强化学习以及数字孪生框架中的集成方式,特别关注多模态基础模型如何实现症状识别与管理决策的协同优化。

Result: 研究发现基础模型应用呈现快速增长趋势,2023-24年文献数量激增;视觉语言模型发展速度远超大型语言模型,发表量增长5-10倍;强化学习和自适应学习在智能喷洒领域仍处于起步阶段;数字孪生结合强化学习可有效模拟靶向喷洒过程;人机协作特别是人在环方法应用有限。

Conclusion: 基础模型正成为农业智能化的关键驱动力,多模态模型结合实时反馈将塑造下一代定点病害管理系统;解决模拟到现实的转换差距是实现实际部署的关键挑战;加强人机协作特别是人类专家对不确定病例的验证机制是未来重要发展方向。


📄 Abstract

Site-specific disease management (SSDM) in crops has advanced rapidly through machine and deep learning (ML and DL) for real-time computer vision. Research evolved from handcrafted feature extraction to large-scale automated feature learning. With foundation models (FMs), crop disease datasets are now processed in fundamentally new ways. Unlike traditional neural networks, FMs integrate visual and textual data, interpret symptoms in text, reason about symptom-management relationships, and support interactive QA for growers and educators. Adaptive and imitation learning in robotics further enables field-based disease management. This review screened approx. 40 articles on FM applications for SSDM, focusing on large-language models (LLMs) and vision-language models (VLMs), and discussing their role in adaptive learning (AL), reinforcement learning (RL), and digital twin frameworks for targeted spraying. Key findings: (a) FMs are gaining traction with surging literature in 2023-24; (b) VLMs outpace LLMs, with a 5-10x increase in publications; (c) RL and AL are still nascent for smart spraying; (d) digital twins with RL can simulate targeted spraying virtually; (e) addressing the sim-to-real gap is critical for real-world deployment; (f) human-robot collaboration remains limited, especially in human-in-the-loop approaches where robots detect early symptoms and humans validate uncertain cases; (g) multi-modal FMs with real-time feedback will drive next-gen SSDM. For updates, resources, and contributions, visit, https://github.com/nitin-dominic/AgriPathogenDatabase, to submit papers, code, or datasets.