cs.CV [Total: 40]
cs.CL [Total: 7]
cs.AI [Total: 2]

cs.CV [Back]

Tomohito Kawabata, Xinyu Zhang, Ling Xiao

🧩 TL;DR

本文提出SocialNav-MoE，一种基于专家混合架构的高效视觉语言模型，通过强化学习微调和语义相似度奖励机制，实现了机器人社会合规导航在精度与效率间的良好平衡。

📘 Detailed Summary

Motivation: 在人类密集环境中，机器人导航需要同时兼顾安全性与社会合规性，但现有研究主要关注安全性而忽视了社会合规导航。尽管视觉语言模型在此任务上展现出潜力，但大规模模型的计算开销过大，导致推理延迟和能耗过高，难以在资源受限的机器人平台上实时部署。

Method: 本文提出SocialNav-MoE，一种基于专家混合架构的高效视觉语言模型，专门用于社会合规导航任务。该方法采用强化学习微调策略，并引入了语义相似度奖励函数来增强决策能力。研究还系统评估了不同小型语言模型类型、路由策略以及视觉编码器的效果，包括Phi、Qwen、StableLM等模型架构，以及CLIP与SigLIP编码器的冻结与微调对比。

Result: 在SNEI数据集上的实验表明，SocialNav-MoE在导航精度与效率之间取得了优异平衡。所提出的语义相似度奖励函数相比硬级别和字符级别奖励机制更为有效，验证了该方法的优越性。模型在保持高性能的同时显著降低了计算开销，适合资源受限的机器人平台部署。

Conclusion: 本研究证明了小型视觉语言模型通过专家混合架构和强化学习微调，能够有效实现社会合规导航任务，为资源受限的机器人平台提供了可行的解决方案。语义相似度奖励机制的引入显著提升了决策质量，为未来高效社会智能导航系统的开发提供了重要技术路径。

📄 Abstract

For robots navigating in human-populated environments, safety and social compliance are equally critical, yet prior work has mostly emphasized safety. Socially compliant navigation that accounts for human comfort, social norms, and contextual appropriateness remains underexplored. Vision language models (VLMs) show promise for this task; however, large-scale models incur substantial computational overhead, leading to higher inference latency and energy consumption, which makes them unsuitable for real-time deployment on resource-constrained robotic platforms. To address this issue, we investigate the effectiveness of small VLM and propose SocialNav-MoE, an efficient Mixture-of-Experts vision language model for socially compliant navigation with reinforcement fine-tuning (RFT). We further introduce a semantic similarity reward (SSR) to effectively leverage RFT for enhancing the decision-making capabilities. Additionally, we study the effectiveness of different small language model types (Phi, Qwen, and StableLM), routing strategies, and vision encoders (CLIP vs. SigLIP, frozen vs. fine-tuned). Experiments on the SNEI dataset demonstrate that SocialNav-MoE achieves an excellent balance between navigation accuracy and efficiency. The proposed SSR function is more effective than hard-level and character-level rewards. Source code will be released upon acceptance.

[2] Improving VQA Reliability: A Dual-Assessment Approach with Self-Reflection and Cross-Model Verification

Xixian Wu, Yang Ou, Pengchao Tian, Zian Yang, Jielei Zhang, Peiyi Li, Longwen Gao

🧩 TL;DR

本文提出了DAVR框架，通过集成自反思和跨模型验证的双重评估机制，有效缓解视觉语言模型在视觉问答任务中的幻觉问题，显著提升了回答的可靠性。

📘 Detailed Summary

Motivation: 视觉语言模型在视觉问答任务中展现出巨大潜力，但其容易产生幻觉的倾向会导致模型给出过度自信但错误的答案，严重损害了回答的可靠性，这一问题亟待解决。

Method: 本文提出了DAVR框架，采用双路径架构：一条路径利用双重选择器模块，通过融合VLM潜在特征与问答嵌入来评估回答可靠性；另一条路径部署外部参考模型进行事实交叉验证以减轻幻觉。

Result: 在ICCV-CLVL 2025的可靠VQA挑战赛中，DAVR框架取得了领先的Φ₁₀₀分数39.64和100-AUC分数97.22，获得第一名，证明了其在增强VLM回答可信度方面的有效性。

Conclusion: DAVR框架通过双重评估机制显著提升了视觉语言模型在视觉问答任务中的可靠性，为缓解VLM幻觉问题提供了有效的解决方案，增强了模型输出的可信度。

📄 Abstract

Vision-language models (VLMs) have demonstrated significant potential in Visual Question Answering (VQA). However, the susceptibility of VLMs to hallucinations can lead to overconfident yet incorrect answers, severely undermining answer reliability. To address this, we propose Dual-Assessment for VLM Reliability (DAVR), a novel framework that integrates Self-Reflection and Cross-Model Verification for comprehensive uncertainty estimation. The DAVR framework features a dual-pathway architecture: one pathway leverages dual selector modules to assess response reliability by fusing VLM latent features with QA embeddings, while the other deploys external reference models for factual cross-checking to mitigate hallucinations. Evaluated in the Reliable VQA Challenge at ICCV-CLVL 2025, DAVR achieves a leading $Φ_{100}$ score of 39.64 and a 100-AUC of 97.22, securing first place and demonstrating its effectiveness in enhancing the trustworthiness of VLM responses.

[3] HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering

Dan Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin

🧩 TL;DR

本文提出了HERBench，一个专门评估视频大语言模型跨时间多证据整合能力的VideoQA基准，通过引入最小必需帧集（MRFS）量化证据需求，揭示了当前模型在检索和融合方面的关键瓶颈。

📘 Detailed Summary

Motivation: 当前视频问答基准通常允许从单一显著线索回答问题，未能充分测试需要聚合多个时间分离视觉证据的推理能力，存在对跨时间多证据整合评估不足的研究空白。

Method: 本文构建了HERBench基准，包含26K个五选一选择题，组织成12个组合任务，每个问题需要聚合至少三个非重叠证据线索；引入最小必需帧集（MRFS）作为量化指标，用于测量模型正确回答所需融合的最小帧数。

Result: HERBench的平均MRFS为5.5，显著高于先前数据集（2.6-4.2）；评估13个最先进的视频大语言模型显示普遍失败，准确率仅为31-42%，略高于20%的随机猜测基线；失败可分解为检索缺陷和融合缺陷两个关键瓶颈。

Conclusion: HERBench通过使跨时间证据既不可避免又可量化，为推进稳健的组合视频理解建立了原则性目标；揭示了当前视频大语言模型在多证据整合方面的系统性不足，为未来模型开发提供了明确的评估基准和改进方向。

📄 Abstract

Video Large Language Models (Video-LLMs) are rapidly improving, yet current Video Question Answering (VideoQA) benchmarks often allow questions to be answered from a single salient cue, under-testing reasoning that must aggregate multiple, temporally separated visual evidence. We present HERBench, a VideoQA benchmark purpose-built to assess multi-evidence integration across time. Each question requires aggregating at least three non-overlapping evidential cues across distinct video segments, so neither language priors nor a single snapshot can suffice. HERBench comprises 26K five-way multiple-choice questions organized into twelve compositional tasks that probe identity binding, cross-entity relations, temporal ordering, co-occurrence verification, and counting. To make evidential demand measurable, we introduce the Minimum Required Frame-Set (MRFS), the smallest number of frames a model must fuse to answer correctly, and show that HERBench imposes substantially higher demand than prior datasets (mean MRFS 5.5 vs. 2.6-4.2). Evaluating 13 state-of-the-art Video-LLMs on HERBench reveals pervasive failures: accuracies of 31-42% are only slightly above the 20% random-guess baseline. We disentangle this failure into two critical bottlenecks: (1) a retrieval deficit, where frame selectors overlook key evidence, and (2) a fusion deficit, where models fail to integrate information even when all necessary evidence is provided. By making cross-time evidence both unavoidable and quantifiable, HERBench establishes a principled target for advancing robust, compositional video understanding.

[4] Visual-textual Dermatoglyphic Animal Biometrics: A First Case Study on Panthera tigris

Wenshuo Li, Majid Mirmehdi, Tilo Burghardt

🧩 TL;DR

本研究提出了一种结合皮肤纹路文本描述符的跨模态动物重识别方法，通过将动物皮毛拓扑结构编码为人类可解释的语言标签，实现了文本到视觉的身份检索，显著提升了AI在生态监测中的准确性和可解释性。

📘 Detailed Summary

Motivation: 当前基于图像的动物重识别方法主要依赖视觉特征，难以处理形态特征不明显的物种，且缺乏人类可解释性。本研究旨在解决纯视觉方法的局限性，通过引入法医学中使用的皮肤纹路文本描述符，实现跨模态的身份检索，同时缓解生态监测中的数据稀缺问题。

Method: 研究提出了一种结合皮肤纹路文本描述符的视觉-文本方法，将动物皮毛拓扑结构抽象为人类可解释的语言标签。开发了文本-图像协同合成管道，生成包含数十张逼真图像和对应皮肤纹路文本描述的"虚拟个体"，用于数据增强和模型训练。该方法基于84,264个手动标注的细节特征，涵盖3,355张图像中的185只老虎。

Result: 在真实场景基准测试中，该方法显著提升了AI在跨模态检索中的准确性，同时有效缓解了数据稀缺问题。研究揭示了该方法在文本到视觉身份检索方面的新能力，通过人类可验证的匹配实现了基于文本的身份恢复，超越了纯视觉方法的局限性。

Conclusion: 皮肤纹路语言引导的生物识别技术能够克服纯视觉方法的限制，实现基于文本描述的身份恢复，为生态监测中的描述模态统一提供了语言驱动的解决方案。这代表了动物重识别领域在可解释性方面的重大进展，为跨模态生态监测工具的发展开辟了新方向。

📄 Abstract

Biologists have long combined visuals with textual field notes to re-identify (Re-ID) animals. Contemporary AI tools automate this for species with distinctive morphological features but remain largely image-based. Here, we extend Re-ID methodologies by incorporating precise dermatoglyphic textual descriptors-an approach used in forensics but new to ecology. We demonstrate that these specialist semantics abstract and encode animal coat topology using human-interpretable language tags. Drawing on 84,264 manually labelled minutiae across 3,355 images of 185 tigers (Panthera tigris), we evaluate this visual-textual methodology, revealing novel capabilities for cross-modal identity retrieval. To optimise performance, we developed a text-image co-synthesis pipeline to generate 'virtual individuals', each comprising dozens of life-like visuals paired with dermatoglyphic text. Benchmarking against real-world scenarios shows this augmentation significantly boosts AI accuracy in cross-modal retrieval while alleviating data scarcity. We conclude that dermatoglyphic language-guided biometrics can overcome vision-only limitations, enabling textual-to-visual identity recovery underpinned by human-verifiable matchings. This represents a significant advance towards explainability in Re-ID and a language-driven unification of descriptive modalities in ecological monitoring.

[5] Vibe Spaces for Creatively Connecting and Expressing Visual Concepts

Huzheng Yang, Katherine Xu, Andrew Lu, Michael D. Grossberg, Yutong Bai, Jianbo Shi

🧩 TL;DR

本文提出了Vibe Blending任务和Vibe Space方法，通过层次图流形学习特征空间中的低维测地线，实现概念间的平滑语义过渡，解决了现有方法在连接远距离概念时难以识别非线性路径的问题。

📘 Detailed Summary

Motivation: 现有方法在生成视觉概念混合时面临挑战，难以识别和遍历潜在空间中连接远距离概念的非线性路径，导致无法有效揭示图像间最相关的共享属性（即"vibe"），这限制了创造性视觉概念的生成能力。

Method: 本文提出Vibe Space方法，构建一个层次图流形，在CLIP等特征空间中学习低维测地线，实现概念间的平滑语义一致过渡；同时设计了结合人类判断、LLM推理和几何路径难度分数的认知启发式评估框架，用于评估创造性质量。

Result: 实验结果表明，Vibe Space生成的混合结果在人类评估中持续获得比现有方法更高的创造性和连贯性评分，验证了该方法在揭示概念间共享属性方面的有效性，特别是在处理远距离概念连接时的优越表现。

Conclusion: 该研究通过Vibe Space方法成功解决了概念混合中的非线性路径识别问题，为创造性视觉生成提供了新思路；层次图流形结构和测地线学习框架为特征空间中的语义过渡建模提供了有效工具，具有广泛的应用潜力。

📄 Abstract

Creating new visual concepts often requires connecting distinct ideas through their most relevant shared attributes -- their vibe. We introduce Vibe Blending, a novel task for generating coherent and meaningful hybrids that reveals these shared attributes between images. Achieving such blends is challenging for current methods, which struggle to identify and traverse nonlinear paths linking distant concepts in latent space. We propose Vibe Space, a hierarchical graph manifold that learns low-dimensional geodesics in feature spaces like CLIP, enabling smooth and semantically consistent transitions between concepts. To evaluate creative quality, we design a cognitively inspired framework combining human judgments, LLM reasoning, and a geometric path-based difficulty score. We find that Vibe Space produces blends that humans consistently rate as more creative and coherent than current methods.

[6] TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation

Zhenzhi Wang, Jian Wang, Ke Ma, Dahua Lin, Bing Zhou

🧩 TL;DR

本文提出了TalkVerse，一个用于单人音频驱动说话视频生成的大规模开放语料库，旨在实现公平、可复现的方法比较。基于该数据集，作者构建了一个可复现的5B DiT基线模型，通过高下采样比率的视频VAE和滑动窗口机制实现分钟级生成，并在保持唇同步和视觉质量的同时显著降低推理成本。

📘 Detailed Summary

Motivation: 当前最先进的音频驱动说话视频生成系统依赖于封闭数据或计算密集型模型，缺乏公平、可复现的比较基准。现有数据集在规模、质量和透明度方面存在不足，阻碍了该领域的研究进展和方法评估。

Method: 研究提出了TalkVerse数据集，包含230万条高分辨率音频-视频同步片段，总计6.3千小时，通过透明流水线从超过60千小时视频中筛选得到。基于此数据集构建了可复现的5B DiT基线模型，采用高下采样比率的视频VAE和滑动窗口机制，并集成MLLM导演模块以增强长视频叙事能力，同时通过受控潜在噪声注入实现零样本视频配音。

Result: TalkVerse数据集包含720p/1080p高分辨率片段，提供2D骨架和结构化视觉/音频风格标注。5B DiT模型实现了分钟级生成且漂移较低，在唇同步和视觉质量上与14B Wan-S2V模型相当，但推理成本降低10倍。模型支持零样本视频配音，并开源了数据集、训练方案和检查点。

Conclusion: TalkVerse为音频驱动人类视频生成研究提供了公平、可复现的基准，显著降低了该领域的入门门槛。提出的5B DiT模型在保持生成质量的同时大幅提升了效率，其滑动窗口机制和MLLM导演模块为长视频生成提供了有效解决方案，开源资源将促进该领域的进一步发展。

📄 Abstract

We introduce TalkVerse, a large-scale, open corpus for single-person, audio-driven talking video generation designed to enable fair, reproducible comparison across methods. While current state-of-the-art systems rely on closed data or compute-heavy models, TalkVerse offers 2.3 million high-resolution (720p/1080p) audio-video synchronized clips totaling 6.3k hours. These are curated from over 60k hours of video via a transparent pipeline that includes scene-cut detection, aesthetic assessment, strict audio-visual synchronization checks, and comprehensive annotations including 2D skeletons and structured visual/audio-style captions. Leveraging TalkVerse, we present a reproducible 5B DiT baseline built on Wan2.2-5B. By utilizing a video VAE with a high downsampling ratio and a sliding window mechanism with motion-frame context, our model achieves minute-long generation with low drift. It delivers comparable lip-sync and visual quality to the 14B Wan-S2V model but with 10$\times$ lower inference cost. To enhance storytelling in long videos, we integrate an MLLM director to rewrite prompts based on audio and visual cues. Furthermore, our model supports zero-shot video dubbing via controlled latent noise injection. We open-source the dataset, training recipes, and 5B checkpoints to lower barriers for research in audio-driven human video generation. Project Page: https://zhenzhiwang.github.io/talkverse/

[7] Puzzle Curriculum GRPO for Vision-Centric Reasoning

Ahmadreza Jeddi, Hakki Can Karaimer, Hue Nguyen, Zhongling Wang, Ke Zhao, Javad Rajabi, Ran Zhang, Raghav Goyal, Babak Taati, Radek Grzeszczuk

🧩 TL;DR

本文提出了PC-GRPO，一种无需监督的强化学习配方，通过可验证奖励解决视觉语言模型中视觉推理的三大关键问题：依赖昂贵标注、奖励稀疏性以及推理与答案的逻辑不一致性。

📘 Detailed Summary

Motivation: 当前基于结果的监督GRPO方法在视觉语言模型的链式推理中存在三个关键问题：依赖昂贵且嘈杂的手工标注或外部验证器；GRPO中平坦且稀疏的奖励方案；以及推理链与最终答案之间的逻辑不一致性。

Method: PC-GRPO采用无需监督的强化学习与可验证奖励方法，设计了三个自监督拼图环境：PatchFit、Rotation（二元奖励）和Jigsaw（分级部分奖励以缓解奖励稀疏性）。引入难度感知课程，动态加权样本并在中等难度达到峰值，以应对平坦奖励和消失的组相对优势问题。在训练后阶段监控推理-答案一致性，并通过一致性强制奖励方案进一步提升一致性。

Result: 在Qwen-7B和Qwen-3B骨干网络上，PC-GRPO显著提升了推理质量、训练稳定性和最终任务准确性。推理-答案一致性在训练早期上升后通常会下降，但课程学习延迟了这种下降，一致性强制奖励方案进一步提升了RAC。RAC与下游准确性呈正相关，在多样化基准测试中均表现出改进。

Conclusion: PC-GRPO为视觉语言模型提供了一条实用路径，实现了可扩展、可验证且可解释的强化学习后训练。该方法通过自监督拼图环境和难度感知课程解决了现有方法的局限性，为无需标注的视觉推理强化学习提供了有效框架，具有重要的实际应用价值。

📄 Abstract

Recent reinforcement learning (RL) approaches like outcome-supervised GRPO have advanced chain-of-thought reasoning in Vision Language Models (VLMs), yet key issues linger: (i) reliance on costly and noisy hand-curated annotations or external verifiers; (ii) flat and sparse reward schemes in GRPO; and (iii) logical inconsistency between a chain's reasoning and its final answer. We present Puzzle Curriculum GRPO (PC-GRPO), a supervision-free recipe for RL with Verifiable Rewards (RLVR) that strengthens visual reasoning in VLMs without annotations or external verifiers. PC-GRPO replaces labels with three self-supervised puzzle environments: PatchFit, Rotation (with binary rewards) and Jigsaw (with graded partial credit mitigating reward sparsity). To counter flat rewards and vanishing group-relative advantages, we introduce a difficulty-aware curriculum that dynamically weights samples and peaks at medium difficulty. We further monitor Reasoning-Answer Consistency (RAC) during post-training: mirroring reports for vanilla GRPO in LLMs, RAC typically rises early then degrades; our curriculum delays this decline, and consistency-enforcing reward schemes further boost RAC. RAC correlates with downstream accuracy. Across diverse benchmarks and on Qwen-7B and Qwen-3B backbones, PC-GRPO improves reasoning quality, training stability, and end-task accuracy, offering a practical path to scalable, verifiable, and interpretable RL post-training for VLMs.

[8] Adaptive Multimodal Person Recognition: A Robust Framework for Handling Missing Modalities

Aref Farhadipour, Teodora Vukovic, Volker Dellwo, Petr Motlicek, Srikanth Madikeri

🧩 TL;DR

本文提出了一种三模态人员识别框架，通过整合语音、面部和手势模态并采用置信度加权融合策略，实现了在模态缺失或质量下降情况下的鲁棒人员识别，在CANDOR数据集上达到99.18%的Top-1准确率。

📘 Detailed Summary

Motivation: 现实世界中的人员识别系统经常面临模态缺失或质量下降的问题，现有方法在部分模态不可用时的鲁棒性不足，需要开发能够适应单模态、双模态或三模态场景的鲁棒识别框架。

Method: 提出三模态人员识别框架，采用多任务学习独立处理语音、面部和手势模态，通过交叉注意力和门控融合机制促进模态间交互，并引入置信度加权融合策略动态适应缺失或低质量数据。

Result: 在CANDOR数据集上实现99.18%的Top-1人员识别准确率，优于传统单模态和晚期融合方法；在VoxCeleb1数据集的双模态模式下达到99.92%准确率；系统在部分模态缺失时仍保持高识别性能。

Conclusion: 该研究证明了多模态融合与置信度加权策略在现实世界人员识别中的有效性，为处理模态缺失问题提供了鲁棒解决方案，相关代码和数据集已公开以促进进一步研究。

📄 Abstract

Person recognition systems often rely on audio, visual, or behavioral cues, but real-world conditions frequently result in missing or degraded modalities. To address this challenge, we propose a Trimodal person identification framework that integrates voice, face, and gesture modalities, while remaining robust to modality loss. Our approach leverages multi-task learning to process each modality independently, followed by a cross-attention and gated fusion mechanisms to facilitate interaction across modalities. Moreover, a confidence-weighted fusion strategy dynamically adapts to missing and low-quality data, ensuring optimal classification even in Unimodal or Bimodal scenarios. We evaluate our method on CANDOR, a newly introduced interview-based multimodal dataset, which we benchmark for the first time. Our results demonstrate that the proposed Trimodal system achieves 99.18% Top-1 accuracy on person identification tasks, outperforming conventional Unimodal and late-fusion approaches. In addition, we evaluate our model on the VoxCeleb1 dataset as a benchmark and reach 99.92% accuracy in Bimodal mode. Moreover, we show that our system maintains high accuracy even when one or two modalities are unavailable, making it a robust solution for real-world person recognition applications. The code and data for this work are publicly available.

Sibi Parivendan, Kashfia Sailunaz, Suresh Neethirajan

🧩 TL;DR

本研究提出了一种基于姿态的计算框架，用于区分牲畜的社会交互行为，超越了传统的静态接近度阈值方法，通过建模解剖关键点的时空几何特征来区分亲和行为与攻击行为。

📘 Detailed Summary

Motivation: 精准畜牧业需要客观评估社会行为以支持群体福利监测，但现有方法大多使用静态接近度阈值推断交互，无法在复杂畜舍环境中区分亲和行为与攻击行为，这限制了自动化社交网络分析在商业环境中的可解释性。

Method: 该方法提出了一个基于姿态的交互分类计算框架，通过建模解剖关键点的时空几何特征来编码交互特定的运动特征。该框架实现为端到端的计算机视觉流水线，整合了YOLOv11进行目标检测、监督式个体识别、ByteTrack进行多目标跟踪、ZebraPose进行27点解剖关键点估计，以及基于姿态衍生距离动态训练的支持向量机分类器。

Result: 在商业奶牛场收集的标注交互片段上，仅使用姿态信息的分类器在区分亲和行为与攻击行为方面达到了77.51%的准确率。与仅基于接近度的基线方法相比，该方法在行为区分方面显示出显著提升，特别是对于亲和行为。各组件性能包括：目标检测mAP@0.50为96.24%，个体识别准确率为98.24%，多目标跟踪准确率为81.96%。

Conclusion: 该研究建立了一个概念验证，展示了适用于构建交互感知社交网络的自动化、基于视觉的社会交互推断方法，在商用硬件上实现了接近实时的性能。该方法超越了简单的接近度启发式方法，为精准畜牧业中的行为监测提供了更精细的分析工具。

📄 Abstract

Precision livestock farming requires objective assessment of social behavior to support herd welfare monitoring, yet most existing approaches infer interactions using static proximity thresholds that cannot distinguish affiliative from agonistic behaviors in complex barn environments. This limitation constrains the interpretability of automated social network analysis in commercial settings. We present a pose-based computational framework for interaction classification that moves beyond proximity heuristics by modeling the spatiotemporal geometry of anatomical keypoints. Rather than relying on pixel-level appearance or simple distance measures, the proposed method encodes interaction-specific motion signatures from keypoint trajectories, enabling differentiation of social interaction valence. The framework is implemented as an end-to-end computer vision pipeline integrating YOLOv11 for object detection (mAP@0.50: 96.24%), supervised individual identification (98.24% accuracy), ByteTrack for multi-object tracking (81.96% accuracy), ZebraPose for 27-point anatomical keypoint estimation, and a support vector machine classifier trained on pose-derived distance dynamics. On annotated interaction clips collected from a commercial dairy barn, the classifier achieved 77.51% accuracy in distinguishing affiliative and agonistic behaviors using pose information alone. Comparative evaluation against a proximity-only baseline shows substantial gains in behavioral discrimination, particularly for affiliative interactions. The results establish a proof-of-concept for automated, vision-based inference of social interactions suitable for constructing interaction-aware social networks, with near-real-time performance on commodity hardware.

Ziyu Shang, Haoran Liu, Rongchao Zhang, Zhiqian Wei, Tongtong Feng

🧩 TL;DR

本文提出了Pose-guided Multi-view Multimodal Diffusion (PMMD)框架，通过多模态扩散模型合成具有可控姿态和外观的逼真人像图像，解决了现有方法在遮挡、服装风格漂移和姿态对齐方面的局限性。

📘 Detailed Summary

Motivation: 当前生成具有可控姿态和外观的一致性人像图像的方法在虚拟试穿、图像编辑和数字人创建等应用中面临遮挡、服装风格漂移和姿态对齐不准确等挑战，这些局限性影响了生成图像的真实性和一致性。

Method: 提出了Pose-guided Multi-view Multimodal Diffusion (PMMD)框架，该框架基于多视图参考、姿态图和文本提示合成逼真人像；采用多模态编码器联合建模视觉视图、姿态特征和语义描述以减少跨模态差异；设计了ResCVA模块增强局部细节同时保持全局结构，以及跨模态融合模块在去噪过程中整合图像语义与文本信息。

Result: 在DeepFashion MultiModal数据集上的实验表明，PMMD在一致性、细节保持和可控性方面优于代表性基线方法，验证了所提出框架在生成高质量可控人像图像方面的有效性。

Conclusion: 该研究展示了多模态扩散模型在可控人像生成任务中的潜力，通过联合建模视觉、姿态和语义信息以及创新的模块设计，为虚拟试穿、图像编辑和数字人创建等应用提供了更鲁棒和灵活的解决方案。

📄 Abstract

Generating consistent human images with controllable pose and appearance is essential for applications in virtual try on, image editing, and digital human creation. Current methods often suffer from occlusions, garment style drift, and pose misalignment. We propose Pose-guided Multi-view Multimodal Diffusion (PMMD), a diffusion framework that synthesizes photorealistic person images conditioned on multi-view references, pose maps, and text prompts. A multimodal encoder jointly models visual views, pose features, and semantic descriptions, which reduces cross modal discrepancy and improves identity fidelity. We further design a ResCVA module to enhance local detail while preserving global structure, and a cross modal fusion module that integrates image semantics with text throughout the denoising pipeline. Experiments on the DeepFashion MultiModal dataset show that PMMD outperforms representative baselines in consistency, detail preservation, and controllability. Project page and code are available at https://github.com/ZANMANGLOOPYE/PMMD.

[11] Uni-Parser Technical Report

Xi Fang, Haoyi Tao, Shuwen Yang, Suyang Zhong, Haocheng Lu, Han Lyu, Chaozheng Huang, Xinyu Li, Linfeng Zhang, Guolin Ke

🧩 TL;DR

本文介绍了Uni-Parser，一种面向科学文献和专利的工业级文档解析引擎，采用模块化多专家架构实现高吞吐量、鲁棒准确性和成本效益，支持跨模态对齐并实现每秒20页PDF的处理速度。

📘 Detailed Summary

Motivation: 传统基于流水线的文档解析方法在跨模态对齐和可扩展性方面存在局限，无法高效处理科学文献和专利中复杂的多模态内容（文本、公式、表格、图表、化学结构），需要一种能够保持细粒度跨模态对齐同时支持大规模部署的工业级解决方案。

Method: Uni-Parser采用模块化、松耦合的多专家架构，保留文本、公式、表格、图表和化学结构之间的细粒度跨模态对齐，系统包含自适应GPU负载均衡、分布式推理、动态模块编排和可配置模式，支持整体解析或特定模态解析，易于扩展到新兴模态。

Result: 在8个NVIDIA RTX 4090D GPU上，Uni-Parser实现了每秒处理20页PDF的高吞吐量，支持数十亿页文档的成本效益推理，系统在保持鲁棒准确性的同时，为大规模云部署进行了优化，实现了工业级性能。

Conclusion: Uni-Parser为科学文献和专利解析提供了可扩展的工业级解决方案，其模块化架构和高效性能支持从文献检索、摘要生成到化学结构提取、反应方案分析和生物活性数据挖掘等多种下游应用，并为训练下一代大语言模型和AI4Science模型的大规模语料库构建提供了基础设施。

📄 Abstract

This technical report introduces Uni-Parser, an industrial-grade document parsing engine tailored for scientific literature and patents, delivering high throughput, robust accuracy, and cost efficiency. Unlike pipeline-based document parsing methods, Uni-Parser employs a modular, loosely coupled multi-expert architecture that preserves fine-grained cross-modal alignments across text, equations, tables, figures, and chemical structures, while remaining easily extensible to emerging modalities. The system incorporates adaptive GPU load balancing, distributed inference, dynamic module orchestration, and configurable modes that support either holistic or modality-specific parsing. Optimized for large-scale cloud deployment, Uni-Parser achieves a processing rate of up to 20 PDF pages per second on 8 x NVIDIA RTX 4090D GPUs, enabling cost-efficient inference across billions of pages. This level of scalability facilitates a broad spectrum of downstream applications, ranging from literature retrieval and summarization to the extraction of chemical structures, reaction schemes, and bioactivity data, as well as the curation of large-scale corpora for training next-generation large language models and AI4Science models.

[12] Is Nano Banana Pro a Low-Level Vision All-Rounder? A Comprehensive Evaluation on 14 Tasks and 40 Datasets

Jialong Zuo, Haoyou Deng, Hanyu Zhou, Jiaxin Zhu, Yicheng Zhang, Yiwei Zhang, Yongxin Yan, Kaixing Huang, Weisen Chen, Yongtai Deng, Rui Jin, Nong Sang, Changxin Gao

🧩 TL;DR

本研究通过零样本评估探究Nano Banana Pro在低层视觉任务中的通用性，发现其在主观视觉质量上表现出色但传统定量指标落后，揭示了生成模型在低层视觉应用中的性能二分性。

📘 Detailed Summary

Motivation: 尽管Nano Banana Pro等文本到图像生成模型在视觉内容创作中取得显著进展，但其作为通用解决器应对传统低层视觉挑战的潜力尚未充分探索，本研究旨在系统评估其是否能够成为低层视觉全能型模型。

Method: 研究采用零样本评估框架，在14个不同低层视觉任务和40个多样化数据集上进行全面测试，通过简单文本提示而不进行微调，将Nano Banana Pro与最先进的专用模型进行基准比较。

Result: 实验结果显示明显的性能二分性：Nano Banana Pro在主观视觉质量上表现优越，能够生成具有合理高频细节的图像，但在传统基于参考的定量指标上落后于专用模型，这种差异源于生成模型固有的随机性难以满足传统指标对像素级一致性的严格要求。

Conclusion: 该研究确认Nano Banana Pro是低层视觉任务中具备潜力的零样本竞争者，同时指出生成模型要达到领域专用模型的高保真度仍面临显著挑战，这为未来低层视觉与生成模型的融合研究提供了重要方向。

📄 Abstract

The rapid evolution of text-to-image generation models has revolutionized visual content creation. While commercial products like Nano Banana Pro have garnered significant attention, their potential as generalist solvers for traditional low-level vision challenges remains largely underexplored. In this study, we investigate the critical question: Is Nano Banana Pro a Low-Level Vision All-Rounder? We conducted a comprehensive zero-shot evaluation across 14 distinct low-level tasks spanning 40 diverse datasets. By utilizing simple textual prompts without fine-tuning, we benchmarked Nano Banana Pro against state-of-the-art specialist models. Our extensive analysis reveals a distinct performance dichotomy: while \textbf{Nano Banana Pro demonstrates superior subjective visual quality}, often hallucinating plausible high-frequency details that surpass specialist models, it lags behind in traditional reference-based quantitative metrics. We attribute this discrepancy to the inherent stochasticity of generative models, which struggle to maintain the strict pixel-level consistency required by conventional metrics. This report identifies Nano Banana Pro as a capable zero-shot contender for low-level vision tasks, while highlighting that achieving the high fidelity of domain specialists remains a significant hurdle.

[13] Emotion Recognition in Signers

Kotaro Funakoshi, Yaoxiong Zhu

🧩 TL;DR

该研究提出了一种跨语言手语情感识别方法，通过引入新的日本手语数据集eJSL并利用英国手语数据集BOBSL，解决了手语情感识别中语法与情感面部表情重叠以及数据稀缺两大挑战。

📘 Detailed Summary

Motivation: 手语情感识别面临两大核心挑战：一是语法性面部表情与情感性面部表情的重叠导致识别困难，二是手语情感数据稀缺限制了模型训练效果。本研究旨在在跨语言环境下解决这两个问题，通过构建新的日本手语情感数据集并利用现有英国手语资源来推进该领域的发展。

Method: 研究引入了eJSL数据集作为新的日本手语情感识别基准，包含两位手语者用七种不同情感状态表达78个独特话语的1,092个视频片段。方法上采用了跨语言迁移学习策略，利用口语文本情感识别技术缓解手语数据稀缺问题，并探索了时间片段选择策略以及结合手部运动信息来增强情感识别效果。

Result: 实验结果表明：1）利用口语文本情感识别技术能有效缓解手语数据稀缺问题；2）时间片段选择策略对识别性能有显著影响；3）结合手部运动信息能提升手语者情感识别准确率。最终建立的基线模型性能超越了口语语言大模型，为手语情感识别提供了更强的基准。

Conclusion: 该研究通过跨语言数据资源和多模态信息融合，为手语情感识别提供了有效的解决方案。研究不仅建立了超越口语语言模型的更强基线，还证明了时间片段选择和手部运动信息的重要性，为未来手语情感识别研究提供了新的技术路径和基准数据集。

📄 Abstract

Recognition of signers' emotions suffers from one theoretical challenge and one practical challenge, namely, the overlap between grammatical and affective facial expressions and the scarcity of data for model training. This paper addresses these two challenges in a cross-lingual setting using our eJSL dataset, a new benchmark dataset for emotion recognition in Japanese Sign Language signers, and BOBSL, a large British Sign Language dataset with subtitles. In eJSL, two signers expressed 78 distinct utterances with each of seven different emotional states, resulting in 1,092 video clips. We empirically demonstrate that 1) textual emotion recognition in spoken language mitigates data scarcity in sign language, 2) temporal segment selection has a significant impact, and 3) incorporating hand motion enhances emotion recognition in signers. Finally we establish a stronger baseline than spoken language LLMs.

[14] Explainable Action Form Assessment by Exploiting Multimodal Chain-of-Thoughts Reasoning

Mengshi Qi, Yeteng Wu, Xianlin Zhang, Huadong Ma

🧩 TL;DR

本文提出了人类动作形态评估（AFA）任务，并引入了包含大规模健身和武术视频的多级标注数据集CoT-AFA，同时提出了可解释的健身评估框架，通过思维链解释范式提供完整的动作分析和改进建议。

📘 Detailed Summary

Motivation: 当前视频理解方法主要关注动作的识别和定位，无法满足评估动作标准化程度并提供改进反馈的实际需求；同时现有数据集缺乏动作标准化程度的标注，而动作质量评估数据集则缺乏可解释性和详细反馈。

Method: 本文定义了人类动作形态评估（AFA）新任务，构建了包含健身和武术视频的多级标注数据集CoT-AFA，并采用思维链解释范式提供从动作识别到结果分析和解决方案的完整推理过程；提出了可解释健身评估框架，采用双并行处理流和动态门控机制融合视觉与语义信息。

Result: 实验结果表明，该方法在解释生成方面取得显著提升（如CIDEr指标提升16.0%），动作分类准确率提高2.7%，质量评估准确率提高2.1%，展现了CoT-AFA数据集在未来研究中的巨大潜力。

Conclusion: 该研究为动作标准化评估提供了新的任务定义、数据集和评估框架，通过思维链解释范式增强了动作分析的可解释性，为未来动作质量评估和个性化指导系统的发展奠定了基础。

📄 Abstract

Evaluating whether human action is standard or not and providing reasonable feedback to improve action standardization is very crucial but challenging in real-world scenarios. However, current video understanding methods are mainly concerned with what and where the action is, which is unable to meet the requirements. Meanwhile, most of the existing datasets lack the labels indicating the degree of action standardization, and the action quality assessment datasets lack explainability and detailed feedback. Therefore, we define a new Human Action Form Assessment (AFA) task, and introduce a new diverse dataset CoT-AFA, which contains a large scale of fitness and martial arts videos with multi-level annotations for comprehensive video analysis. We enrich the CoT-AFA dataset with a novel Chain-of-Thought explanation paradigm. Instead of offering isolated feedback, our explanations provide a complete reasoning process--from identifying an action step to analyzing its outcome and proposing a concrete solution. Furthermore, we propose a framework named Explainable Fitness Assessor, which can not only judge an action but also explain why and provide a solution. This framework employs two parallel processing streams and a dynamic gating mechanism to fuse visual and semantic information, thereby boosting its analytical capabilities. The experimental results demonstrate that our method has achieved improvements in explanation generation (e.g., +16.0% in CIDEr), action classification (+2.7% in accuracy) and quality assessment (+2.1% in accuracy), revealing great potential of CoT-AFA for future studies. Our dataset and source code is available at https://github.com/MICLAB-BUPT/EFA.

[15] VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

Hongbo Zhao, Meng Wang, Fei Zhu, Wenzhuo Liu, Bolin Ni, Fanhu Zeng, Gaofeng Meng, Zhaoxiang Zhang

🧩 TL;DR

本文提出了首个视觉文本压缩（VTC）基准测试，系统评估了视觉语言模型在压缩长文本下的长上下文理解能力，发现尽管VTC能实现3-20倍的token压缩，但多数模型在压缩信息下的长关联理解能力显著不足。

📘 Detailed Summary

Motivation: 现有视觉文本压缩（VTC）方法如DeepSeek-OCR和Glyph能将长文本转换为密集的2D视觉表示，实现3-20倍的token压缩，但VTC的高信息密度对视觉语言模型核心长上下文能力的影响尚未得到系统研究，这一研究空白限制了VTC在可扩展LLM中的应用潜力。

Method: 研究设计了首个VTC基准测试，包含三个长上下文理解场景：VTC-Retrieval评估信息检索与聚合能力，VTC-Reasoning测试模型在最小词汇重叠下推断潜在关联以定位事实的能力，VTC-Memory衡量长期对话记忆中的综合问答能力，并建立了VTCBench-Wild模拟多样化输入场景，全面评估了领先的开源和专有模型。

Result: 实验结果表明，尽管大多数视觉语言模型能够良好解码文本信息（如OCR），但在VTC压缩信息下表现出令人惊讶的薄弱长上下文理解能力，无法有效捕捉上下文中的长关联或依赖关系，这一发现在多个基准测试场景中均得到验证。

Conclusion: 本研究揭示了VTC压缩下视觉语言模型长上下文理解能力的系统性缺陷，为设计更高效、可扩展的视觉语言模型提供了重要基础，强调了在追求压缩效率的同时必须关注模型对压缩信息的深层理解能力，为未来VTC优化方向提供了实证依据。

📄 Abstract

The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model's ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-compressed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.

[16] EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence

Jiaxu Wan, Xu Wang, Mengwei Xie, Hang Zhang, Mu Xu, Yang Han, Hong Zhang, Ding Yuan, Yifan Yang

🧩 TL;DR

本文提出了EagleVision，一种用于渐进式空间认知的双阶段框架，通过宏观感知和微观验证解决现有空间智能方法中空间一致性弱、视角多样性有限以及证据链不可追溯的问题。

📘 Detailed Summary

Motivation: 现有空间智能方法通常将3D线索附加到2D推理流程中，或耦合MLLMs与黑盒重建模块，导致空间一致性弱、视角多样性有限且证据链无法追溯到支持视图。虽然"图像思维"框架通过交替假设形成与主动视觉证据获取实现了逐步多模态推理，但未解决空间思维链中的三个关键挑战：在严格令牌预算下构建全局空间感知、将3D假设与视频帧显式关联以进行验证，以及设计空间基础奖励用于强化学习。

Method: EagleVision采用双阶段框架实现渐进式空间认知。宏观感知阶段使用语义-视角融合确定性点过程（SPF-DPP）从长视频中选择紧凑的几何和语义感知关键帧集，在固定令牌预算下工作。微观验证阶段将空间思维链形式化为BEV基础姿态查询：智能体迭代预测BEV平面上的姿态，检索最近的真实帧，并通过强化学习进行训练，使用空间基础奖励来评分预测姿态与观察视图之间的一致性。

Result: 在VSI-Bench基准测试中，EagleVision在开源视觉语言模型中实现了最先进的性能，展示了强大且可泛化的空间理解能力。该方法通过双阶段框架有效解决了空间一致性、视角多样性和证据可追溯性问题。

Conclusion: 该研究证明了通过宏观感知和微观验证的双阶段框架可以有效解决空间认知中的关键挑战。EagleVision的成功表明，将空间思维链形式化为BEV基础姿态查询，并结合强化学习与空间基础奖励，能够实现更准确、可追溯的空间推理，为空间智能系统设计提供了新范式。

📄 Abstract

Recent spatial intelligence approaches typically attach 3D cues to 2D reasoning pipelines or couple MLLMs with black-box reconstruction modules, leading to weak spatial consistency, limited viewpoint diversity, and evidence chains that cannot be traced back to supporting views. Frameworks for "thinking with images" (e.g., ChatGPT-o3 and DeepEyes) show that stepwise multimodal reasoning can emerge by interleaving hypothesis formation with active acquisition of visual evidence, but they do not address three key challenges in spatial Chain-of-Thought (CoT): building global space perception under strict token budgets, explicitly associating 3D hypotheses with video frames for verification, and designing spatially grounded rewards for reinforcement learning. To address these issues, we present EagleVision, a dual-stage framework for progressive spatial cognition through macro perception and micro verification. In the macro perception stage, EagleVision employs a semantics-perspective-fusion determinantal point process (SPF-DPP) to select a compact set of geometry- and semantics-aware keyframes from long videos under a fixed token budget. In the micro verification stage, we formalize spatial CoT as BEV-grounded pose querying: the agent iteratively predicts poses on a BEV plane, retrieves the nearest real frames, and is trained purely by reinforcement learning with a spatial grounding reward that scores the consistency between predicted poses and observed views. On VSI-Bench, EagleVision achieves state-of-the-art performance among open-source vision-language models, demonstrating strong and generalizable spatial understanding.

[17] Intersectional Fairness in Vision-Language Models for Medical Image Disease Classification

Yupeng Zhang, Adam G. Dunn, Usman Naseem, Jinman Kim

🧩 TL;DR

本研究提出了跨模态对齐一致性（CMAC-MMD）训练框架，旨在解决医学AI系统中存在的交叉偏见问题，该框架能够在无需临床推理时敏感人口统计数据的情况下，标准化不同交叉患者亚组间的诊断确定性，从而在保持高准确性的同时实现公平性能。

📘 Detailed Summary

Motivation: 医学人工智能系统，特别是多模态视觉语言模型，常表现出交叉偏见，即模型对边缘化患者亚组的诊断信心系统性偏低，这可能导致因人口统计数据倾斜和诊断确定性分布差异而出现更高的误诊和漏诊率，而现有的公平性干预措施往往无法解决这些差距，或为了达到亚组间的统计均等而牺牲整体诊断性能。

Method: 本研究开发了跨模态对齐一致性（CMAC-MMD）训练框架，该框架标准化交叉患者亚组间的诊断确定性，与传统去偏见方法不同，该方法在临床推理时无需敏感人口统计数据即可均衡模型的决策信心，通过使用最大均值差异（MMD）等度量来实现跨模态特征分布的对齐。

Result: 在皮肤病学队列中，该方法将整体交叉漏诊率差距（ΔTPR）从0.50降低至0.26，同时将整体AUC从0.94提升至0.97；在青光眼筛查中，ΔTPR从0.41降至0.31，AUC达到0.72（基线为0.71），评估使用了10,015张皮肤病变图像（HAM10000）和10,000张眼底图像（Harvard-FairVLMed），并进行了外部验证。

Conclusion: 该研究建立了一个可扩展的框架，用于开发高风险的临床决策支持系统，这些系统既能保持准确性，又能在不同患者亚组间实现公平性能，确保可靠性能而不增加隐私风险，为医学AI的公平性干预提供了新的技术路径。

📄 Abstract

Medical artificial intelligence (AI) systems, particularly multimodal vision-language models (VLM), often exhibit intersectional biases where models are systematically less confident in diagnosing marginalised patient subgroups. Such bias can lead to higher rates of inaccurate and missed diagnoses due to demographically skewed data and divergent distributions of diagnostic certainty. Current fairness interventions frequently fail to address these gaps or compromise overall diagnostic performance to achieve statistical parity among the subgroups. In this study, we developed Cross-Modal Alignment Consistency (CMAC-MMD), a training framework that standardises diagnostic certainty across intersectional patient subgroups. Unlike traditional debiasing methods, this approach equalises the model's decision confidence without requiring sensitive demographic data during clinical inference. We evaluated this approach using 10,015 skin lesion images (HAM10000) with external validation on 12,000 images (BCN20000), and 10,000 fundus images for glaucoma detection (Harvard-FairVLMed), stratifying performance by intersectional age, gender, and race attributes. In the dermatology cohort, the proposed method reduced the overall intersectional missed diagnosis gap (difference in True Positive Rate, $Δ$TPR) from 0.50 to 0.26 while improving the overall Area Under the Curve (AUC) from 0.94 to 0.97 compared to standard training. Similarly, for glaucoma screening, the method reduced $Δ$TPR from 0.41 to 0.31, achieving a better AUC of 0.72 (vs. 0.71 baseline). This establishes a scalable framework for developing high-stakes clinical decision support systems that are both accurate and can perform equitably across diverse patient subgroups, ensuring reliable performance without increasing privacy risks.

Kaixing Long, Danyi Weng, Yun Mi, Zhentai Zhang, Yanmeng Lu, Jian Geng, Zhitao Zhou, Liming Zhong, Qianjin Feng, Wei Yang, Lei Cao

🧩 TL;DR

本文提出了一种跨模态超尺度学习网络（CMUS-Net），用于解决肾活检病理诊断中纳米级TEM图像与微米级OM/IM图像之间的尺度差异问题，首次实现了基于三模态双尺度图像的多种肾小球疾病的自动分类。

📘 Detailed Summary

Motivation: 肾活检病理诊断中，透射电镜（TEM）图像在纳米尺度与光学显微镜（OM）或免疫荧光显微镜（IM）图像在微米尺度之间存在显著的尺度差异，现有多模态和多尺度模型难以实现有效的特征融合并提升分类精度，这阻碍了基于三模态肾活检图像构建多模态自动分类模型以辅助病理医生进行肾小球多疾病识别的目标。

Method: 提出的CMUS-Net采用多种超微结构信息来弥合纳米与微米图像间的尺度差异，具体包括引入稀疏多实例学习模块来聚合TEM图像特征，设计跨模态尺度注意力模块以促进特征交互并增强病理语义信息，以及结合多种损失函数使模型能够权衡不同模态的重要性，从而实现肾小球疾病的精确分类。

Result: 在内部数据集上，CMUS-Net实现了95.37±2.41%的准确率（ACC）、99.05±0.53%的AUC和95.32±2.41%的F1分数，广泛实验表明该方法优于其他知名的多模态或多尺度方法，并在MN分期任务中展示了良好的泛化能力。

Conclusion: 该方法遵循肾活检病理诊断的常规流程，首次实现了基于三模态双尺度图像对IgA肾病、膜性肾病和狼疮性肾炎等多种肾小球疾病的自动分类，为病理辅助诊断提供了有效的多模态融合解决方案，代码已开源以促进相关研究。

📄 Abstract

Constructing a multi-modal automatic classification model based on three types of renal biopsy images can assist pathologists in glomerular multi-disease identification. However, the substantial scale difference between transmission electron microscopy (TEM) image features at the nanoscale and optical microscopy (OM) or immunofluorescence microscopy (IM) images at the microscale poses a challenge for existing multi-modal and multi-scale models in achieving effective feature fusion and improving classification accuracy. To address this issue, we propose a cross-modal ultra-scale learning network (CMUS-Net) for the auxiliary diagnosis of multiple glomerular diseases. CMUS-Net utilizes multiple ultrastructural information to bridge the scale difference between nanometer and micrometer images. Specifically, we introduce a sparse multi-instance learning module to aggregate features from TEM images. Furthermore, we design a cross-modal scale attention module to facilitate feature interaction, enhancing pathological semantic information. Finally, multiple loss functions are combined, allowing the model to weigh the importance among different modalities and achieve precise classification of glomerular diseases. Our method follows the conventional process of renal biopsy pathology diagnosis and, for the first time, performs automatic classification of multiple glomerular diseases including IgA nephropathy (IgAN), membranous nephropathy (MN), and lupus nephritis (LN) based on images from three modalities and two scales. On an in-house dataset, CMUS-Net achieves an ACC of 95.37+/-2.41%, an AUC of 99.05+/-0.53%, and an F1-score of 95.32+/-2.41%. Extensive experiments demonstrate that CMUS-Net outperforms other well-known multi-modal or multi-scale methods and show its generalization capability in staging MN. Code is available at https://github.com/SMU-GL-Group/MultiModal_lkx/tree/main.

[19] Null-LoRA: Low-Rank Adaptation on Null Space

Yi Zhang, Yulei Kang, Haoxuan Chen, Jinxuan Li, ian-Fang Hu

🧩 TL;DR

本文提出Null-LoRA，一种基于零空间的低秩自适应方法，通过将增量更新约束在预训练模型的零空间内，以更少的参数实现优于现有方法的性能。

📘 Detailed Summary

Motivation: 现有参数高效微调方法（如LoRA及其变体）在全参数空间执行低秩自适应，但研究表明在子空间内微调即可达到相当效果，且预训练模型存在非平凡零空间，这为提升参数效率提供了新思路。

Method: Null-LoRA通过冻结部分低秩矩阵来减少冗余并提升有效秩，同时将整个增量更新约束在预训练模型的零空间内，最大化利用增量更新来适应新任务范式，实现更高的参数效率。

Result: 在图像-文本检索和视觉问答任务的广泛实验中，Null-LoRA以更少的参数超越了现有最先进方法，证明了其在参数效率和性能方面的优越性。

Conclusion: 该研究表明利用预训练模型的零空间特性可以显著提升参数高效微调的效果，为大型模型适应下游任务提供了更高效的范式，并展示了子空间微调的实际可行性。

📄 Abstract

Parameter-efficient fine-tuning methods have gained considerable popularity for adapting large-scale models to downstream tasks, particularly LoRA and its variants. Existing methods perform low-rank adaptation over the full parameter space. However, fine-tuning within a subspace can achieve comparable effectiveness. Inspired by the observation that pre-trained models possess non-trivial null spaces, we propose Null-space based Low-Rank Adaptation (Null-LoRA). Null-LoRA effectively reduces redundancy and enhances effective rank by freezing portions of the low-rank matrices. To further improve parameter efficiency, Null-LoRA constrains the entire incremental update within the null space, maximizing the utilization of incremental updates to adapt to new task paradigms. Null-LoRA surpasses the state of the art with fewer parameters in extensive experiments across image-text retrieval and visual question answering tasks.

[20] Automated Motion Artifact Check for MRI (AutoMAC-MRI): An Interpretable Framework for Motion Artifact Detection and Severity Assessment

Antony Jerald, Dattesh Shanbhag, Sudhanya Chatterjee

🧩 TL;DR

本文提出AutoMAC-MRI，一种可解释的MRI运动伪影分级框架，通过监督对比学习学习运动严重程度的判别性表示，并结合逐级亲和力评分实现透明化的伪影分级，旨在提升MRI质量控制的准确性和可解释性。

📘 Detailed Summary

Motivation: 运动伪影会降低MRI图像质量并增加患者召回率，现有自动化质量评估方法大多局限于二元决策且缺乏可解释性，无法提供详细的运动严重程度分级和透明化的决策依据。

Method: 该方法采用监督对比学习来学习运动严重程度的判别性表示，在特征空间中计算逐级亲和力分数，量化图像与每个运动等级之间的接近程度，从而实现透明且可解释的等级分配。

Result: 在超过5000个专家标注的脑部MRI切片上进行评估，涵盖多种对比度和视图，实验表明亲和力分数与专家判断高度一致，支持其作为运动严重程度的可解释性度量指标。

Conclusion: 通过将准确的分级检测与逐级亲和力评分相结合，AutoMAC-MRI能够实现内联MRI质量控制，有望减少不必要的重新扫描并提高工作流程效率，为临床实践提供了透明化的决策支持工具。

📄 Abstract

Motion artifacts degrade MRI image quality and increase patient recalls. Existing automated quality assessment methods are largely limited to binary decisions and provide little interpretability. We introduce AutoMAC-MRI, an explainable framework for grading motion artifacts across heterogeneous MR contrasts and orientations. The approach uses supervised contrastive learning to learn a discriminative representation of motion severity. Within this feature space, we compute grade-specific affinity scores that quantify an image's proximity to each motion grade, thereby making grade assignments transparent and interpretable. We evaluate AutoMAC-MRI on more than 5000 expert-annotated brain MRI slices spanning multiple contrasts and views. Experiments assessing affinity scores against expert labels show that the scores align well with expert judgment, supporting their use as an interpretable measure of motion severity. By coupling accurate grade detection with per-grade affinity scoring, AutoMAC-MRI enables inline MRI quality control, with the potential to reduce unnecessary rescans and improve workflow efficiency.

[21] Assessing the Visual Enumeration Abilities of Specialized Counting Architectures and Vision-Language Models

Kuinan Hou, Jing Mi, Marco Zorzi, Lamberto Ballan, Alberto Testolin

🧩 TL;DR

本研究系统比较了专业计数架构与视觉语言模型在物体计数任务上的性能，发现视觉语言模型在生成中间表示时能够匹配甚至超越专业架构，但在复杂场景中仍存在可靠性不足的问题。

📘 Detailed Summary

Motivation: 传统物体计数方法依赖针对特定领域设计的专业架构，需要预定义对象类别标注数据进行训练，而新兴的大规模多模态视觉语言模型可能为开放集物体计数提供更灵活的替代方案，因此需要系统比较这两类方法在计数任务上的性能差异。

Method: 研究采用系统比较方法，将最先进的专用计数架构与视觉语言模型在两个流行的计数数据集上进行评估，同时创建了一个具有精细视觉属性控制的新型基准测试，特别探索了通过提示生成中间表示（物体位置和语言标签）对模型计数性能的影响。

Result: 实验结果表明，大多数视觉语言模型能够近似枚举视觉场景中的物体数量，其性能匹配甚至超越了专业计算机视觉架构，当模型被提示生成每个物体的中间表示时，计数准确率显著提高，但在复杂视觉场景中所有模型都无法可靠计数。

Conclusion: 视觉语言模型在物体计数任务上展现出与专业架构相当的潜力，特别是通过中间表示生成策略能够提升性能，但当前所有模型在复杂场景中的可靠性不足，表明需要进一步研究才能开发出能够在真实环境中可靠部署计数程序的AI系统。

📄 Abstract

Counting the number of items in a visual scene remains a fundamental yet challenging task in computer vision. Traditional approaches to solving this problem rely on domain-specific counting architectures, which are trained using datasets annotated with a predefined set of object categories. However, recent progress in creating large-scale multimodal vision-language models (VLMs) suggests that these domain-general architectures may offer a flexible alternative for open-set object counting. In this study, we therefore systematically compare the performance of state-of-the-art specialized counting architectures against VLMs on two popular counting datasets, as well as on a novel benchmark specifically created to have a finer-grained control over the visual properties of test images. Our findings show that most VLMs can approximately enumerate the number of items in a visual scene, matching or even surpassing the performance of specialized computer vision architectures. Notably, enumeration accuracy significantly improves when VLMs are prompted to generate intermediate representations (i.e., locations and verbal labels) of each object to be counted. Nevertheless, none of the models can reliably count the number of objects in complex visual scenes, showing that further research is still needed to create AI systems that can reliably deploy counting procedures in realistic environments.

[22] SMART: Semantic Matching Contrastive Learning for Partially View-Aligned Clustering

Liang Peng, Yixuan Ye, Cheng Liu, Hangjun Che, Fei Wang, Zhiwen Yu, Si Wu, Hau-San Wong

🧩 TL;DR

本文提出了一种名为SMART的语义匹配对比学习模型，用于解决部分视图对齐聚类问题，通过缓解跨视图分布偏移来促进语义匹配对比学习，从而充分利用对齐和未对齐数据中的语义关系。

📘 Detailed Summary

Motivation: 现实场景中收集严格对齐的多视图数据具有挑战性，现有部分视图对齐聚类方法未能充分利用未对齐数据来捕获同一簇样本间的共享语义，且多视图数据的固有异质性导致表示分布偏移，从而影响跨视图潜在特征间有意义的对应关系建立和学习效果。

Method: 本文提出SMART语义匹配对比学习模型，核心思想是通过缓解跨视图分布偏移的影响，促进语义匹配对比学习，从而充分利用对齐和未对齐数据中的语义关系，该方法旨在更好地利用多视图间潜在的一致性和互补性。

Result: 在八个基准数据集上的广泛实验表明，该方法在部分视图对齐聚类问题上持续优于现有方法，验证了所提模型在缓解分布偏移和利用语义关系方面的有效性。

Conclusion: 该研究为解决部分视图对齐聚类问题提供了有效方法，通过缓解分布偏移促进语义匹配对比学习，能够更好地利用多视图数据的一致性和互补性，为处理现实世界中不完全对齐的多视图数据提供了新思路。

📄 Abstract

Multi-view clustering has been empirically shown to improve learning performance by leveraging the inherent complementary information across multiple views of data. However, in real-world scenarios, collecting strictly aligned views is challenging, and learning from both aligned and unaligned data becomes a more practical solution. Partially View-aligned Clustering aims to learn correspondences between misaligned view samples to better exploit the potential consistency and complementarity across views, including both aligned and unaligned data. However, most existing PVC methods fail to leverage unaligned data to capture the shared semantics among samples from the same cluster. Moreover, the inherent heterogeneity of multi-view data induces distributional shifts in representations, leading to inaccuracies in establishing meaningful correspondences between cross-view latent features and, consequently, impairing learning effectiveness. To address these challenges, we propose a Semantic MAtching contRasTive learning model (SMART) for PVC. The main idea of our approach is to alleviate the influence of cross-view distributional shifts, thereby facilitating semantic matching contrastive learning to fully exploit semantic relationships in both aligned and unaligned data. Extensive experiments on eight benchmark datasets demonstrate that our method consistently outperforms existing approaches on the PVC problem.

Yingying Wang, Xuanhua He, Chen Wu, Jialing Huang, Suiyun Zhang, Rui Liu, Xinghao Ding, Haoxuan Che

🧩 TL;DR

本文提出MMMamba，一种基于Mamba架构的跨模态上下文融合框架，用于全色锐化任务，通过创新的多模态交错扫描机制实现高效的多模态信息交换，同时支持零样本图像超分辨率。

📘 Detailed Summary

Motivation: 传统基于CNN的全色锐化方法通常采用通道级联和固定卷积算子，难以适应多样化的空间和光谱变化；而交叉注意力机制虽然支持全局交互，但计算效率低下且可能稀释细粒度对应关系，难以捕捉复杂的语义关系。

Method: 本文提出MMMamba框架，基于Mamba架构构建，确保线性计算复杂度同时保持强大的跨模态交互能力；引入新颖的多模态交错扫描机制，促进PAN和MS模态间的有效信息交换；采用上下文条件化而非交叉注意力，实现更直接高效的跨模态信息交换。

Result: 大量实验表明，该方法在多个任务和基准测试中优于现有最先进技术，表现出卓越的性能；框架还灵活支持零样本图像超分辨率任务，展示了其广泛的应用潜力。

Conclusion: MMMamba框架通过创新的多模态交错扫描机制和上下文条件化设计，为全色锐化任务提供了高效且强大的解决方案；该研究展示了Mamba架构在多模态融合任务中的潜力，并为图像处理领域的跨模态信息交换提供了新的技术路径。

📄 Abstract

Pan-sharpening aims to generate high-resolution multispectral (HRMS) images by integrating a high-resolution panchromatic (PAN) image with its corresponding low-resolution multispectral (MS) image. To achieve effective fusion, it is crucial to fully exploit the complementary information between the two modalities. Traditional CNN-based methods typically rely on channel-wise concatenation with fixed convolutional operators, which limits their adaptability to diverse spatial and spectral variations. While cross-attention mechanisms enable global interactions, they are computationally inefficient and may dilute fine-grained correspondences, making it difficult to capture complex semantic relationships. Recent advances in the Multimodal Diffusion Transformer (MMDiT) architecture have demonstrated impressive success in image generation and editing tasks. Unlike cross-attention, MMDiT employs in-context conditioning to facilitate more direct and efficient cross-modal information exchange. In this paper, we propose MMMamba, a cross-modal in-context fusion framework for pan-sharpening, with the flexibility to support image super-resolution in a zero-shot manner. Built upon the Mamba architecture, our design ensures linear computational complexity while maintaining strong cross-modal interaction capacity. Furthermore, we introduce a novel multimodal interleaved (MI) scanning mechanism that facilitates effective information exchange between the PAN and MS modalities. Extensive experiments demonstrate the superior performance of our method compared to existing state-of-the-art (SOTA) techniques across multiple tasks and benchmarks.

[24] SynthSeg-Agents: Multi-Agent Synthetic Data Generation for Zero-Shot Weakly Supervised Semantic Segmentation

Wangyu Wu, Zhenhong Chen, Xiaowei Huang, Fei Ma, Jimin Xiao

🧩 TL;DR

本文提出了零样本弱监督语义分割（ZSWSSS）新方向，并设计了SynthSeg Agents框架，该框架利用大语言模型驱动的多智能体系统完全无需真实图像即可生成合成训练数据，实现了在PASCAL VOC和COCO数据集上的竞争性性能。

📘 Detailed Summary

Motivation: 现有弱监督语义分割方法虽然利用生成模型增强数据，但仍依赖于真实世界训练样本。本文旨在解决这一限制，提出零样本弱监督语义分割新方向，探索完全无需真实图像监督的语义分割训练范式，以降低标注成本并提高可扩展性。

Method: 本文提出SynthSeg Agents多智能体框架，包含自优化提示智能体和图像生成智能体两个核心模块。自优化提示智能体通过迭代优化、记忆机制和提示空间探索自主生成多样化语义丰富的图像提示，并采用CLIP相似度和最近邻多样性过滤进行引导。图像生成智能体利用视觉语言模型生成候选图像，通过冻结的CLIP评分模型筛选高质量样本，并训练ViT分类器对合成数据集进行重新标注以提高语义精度。

Result: 在PASCAL VOC 2012和COCO 2014数据集上的实验表明，SynthSeg Agents在不使用任何真实训练图像的情况下取得了竞争性性能。该框架能够生成高质量的合成训练数据，验证了完全无需真实图像监督的弱监督语义分割的可行性。

Conclusion: 本研究证明了LLM驱动智能体在实现成本高效、可扩展语义分割方面的潜力，为零样本弱监督语义分割开辟了新方向。该方法完全摆脱了对真实训练图像的依赖，为数据稀缺场景下的语义分割提供了创新解决方案，展示了合成数据生成与多智能体协同在计算机视觉任务中的有效应用。

📄 Abstract

Weakly Supervised Semantic Segmentation (WSSS) with image level labels aims to produce pixel level predictions without requiring dense annotations. While recent approaches have leveraged generative models to augment existing data, they remain dependent on real world training samples. In this paper, we introduce a novel direction, Zero Shot Weakly Supervised Semantic Segmentation (ZSWSSS), and propose SynthSeg Agents, a multi agent framework driven by Large Language Models (LLMs) to generate synthetic training data entirely without real images. SynthSeg Agents comprises two key modules, a Self Refine Prompt Agent and an Image Generation Agent. The Self Refine Prompt Agent autonomously crafts diverse and semantically rich image prompts via iterative refinement, memory mechanisms, and prompt space exploration, guided by CLIP based similarity and nearest neighbor diversity filtering. These prompts are then passed to the Image Generation Agent, which leverages Vision Language Models (VLMs) to synthesize candidate images. A frozen CLIP scoring model is employed to select high quality samples, and a ViT based classifier is further trained to relabel the entire synthetic dataset with improved semantic precision. Our framework produces high quality training data without any real image supervision. Experiments on PASCAL VOC 2012 and COCO 2014 show that SynthSeg Agents achieves competitive performance without using real training images. This highlights the potential of LLM driven agents in enabling cost efficient and scalable semantic segmentation.

[25] KD360-VoxelBEV: LiDAR and 360-degree Camera Cross Modality Knowledge Distillation for Bird's-Eye-View Segmentation

Wenke E, Yixin Sun, Jiaxu Liu, Hubert P. H. Shum, Amir Atapour-Abarghouei, Toby P. Breckon

🧩 TL;DR

本文提出了首个专为单全景相机鸟瞰图分割设计的跨模态蒸馏框架，通过融合激光雷达与相机模态的知识，使轻量级学生网络仅使用单全景相机即可实现高效BEV分割，显著降低了传感器复杂性和部署成本。

📘 Detailed Summary

Motivation: 现有基于相机的鸟瞰图分割方法在性能和效率上存在局限，而多传感器融合方案增加了系统复杂性和部署成本。本研究旨在开发一种跨模态蒸馏框架，使轻量级单全景相机网络能够从高性能的激光雷达-相机融合教师网络中学习丰富的空间和语义特征，从而在降低传感器需求的同时保持竞争性性能。

Method: 该方法提出了一个新颖的跨模态蒸馏框架，包含三个关键技术组件：首先，设计了一种融合范围、强度和环境通道的新型激光雷达图像表示；其次，开发了体素对齐的视图变换器，在保持空间保真度的同时实现高效的BEV处理；最后，通过高容量的激光雷达-相机融合教师网络提取丰富的空间和语义特征，并将其蒸馏到仅依赖单360度全景相机图像的轻量级学生网络中。

Result: 在Dur360BEV数据集上的实验表明，教师模型显著优于现有的基于相机的BEV分割方法，实现了25.6%的IoU提升。蒸馏后的学生网络获得了竞争性性能，达到8.5%的IoU增益和31.2 FPS的最先进推理速度。在KITTI-360数据集（两个鱼眼相机）上的评估进一步证实了该蒸馏框架能够泛化到不同的相机配置，证明了其可行性和鲁棒性。

Conclusion: 该研究证明了跨模态蒸馏在BEV分割中的有效性，为实际自动驾驶应用提供了高效、低成本的解决方案。该方法通过减少传感器复杂性和部署成本，同时保持竞争性性能，展示了在现实世界场景中的实用价值。框架的可泛化性表明其适用于多种相机配置，为未来轻量级感知系统的发展提供了有前景的方向。

📄 Abstract

We present the first cross-modality distillation framework specifically tailored for single-panoramic-camera Bird's-Eye-View (BEV) segmentation. Our approach leverages a novel LiDAR image representation fused from range, intensity and ambient channels, together with a voxel-aligned view transformer that preserves spatial fidelity while enabling efficient BEV processing. During training, a high-capacity LiDAR and camera fusion Teacher network extracts both rich spatial and semantic features for cross-modality knowledge distillation into a lightweight Student network that relies solely on a single 360-degree panoramic camera image. Extensive experiments on the Dur360BEV dataset demonstrate that our teacher model significantly outperforms existing camera-based BEV segmentation methods, achieving a 25.6\% IoU improvement. Meanwhile, the distilled Student network attains competitive performance with an 8.5\% IoU gain and state-of-the-art inference speed of 31.2 FPS. Moreover, evaluations on KITTI-360 (two fisheye cameras) confirm that our distillation framework generalises to diverse camera setups, underscoring its feasibility and robustness. This approach reduces sensor complexity and deployment costs while providing a practical solution for efficient, low-cost BEV segmentation in real-world autonomous driving.

[26] Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

Junjie Chen, Fei Wang, Zhihao Huang, Qing Zhou, Kun Li, Dan Guo, Linfeng Zhang, Xun Yang

🧩 TL;DR

本文提出了TIMAR（Turn-level Interleaved Masked AutoRegression），一种用于3D对话头部生成的因果框架，通过建模对话为交错的视听上下文，在DualTalk基准测试中显著降低了Fréchet距离和MSE。

📘 Detailed Summary

Motivation: 人类对话涉及语音和非语言线索的连续交换，现有框架通常将说话和倾听视为独立过程或依赖非因果的全序列建模，这阻碍了跨轮次的时间连贯性，因此需要一种能够建模双向动态的3D对话生成方法。

Method: TIMAR框架将对话建模为交错的视听上下文，在每个轮次内融合多模态信息，并应用轮次级因果注意力来积累对话历史，同时使用轻量级扩散头预测连续的3D头部动态，以捕捉协调性和表达变异性。

Result: 在DualTalk基准测试中，TIMAR在测试集上将Fréchet距离和MSE降低了15-30%，并在分布外数据上实现了类似的性能提升，证明了其有效性和泛化能力。

Conclusion: 该研究展示了通过因果框架建模对话交互动态的有效性，TIMAR能够生成时间连贯且富有表现力的3D对话头部运动，为构建表达性虚拟化身和交互式机器人提供了重要技术基础。

📄 Abstract

Human conversation involves continuous exchanges of speech and nonverbal cues such as head nods, gaze shifts, and facial expressions that convey attention and emotion. Modeling these bidirectional dynamics in 3D is essential for building expressive avatars and interactive robots. However, existing frameworks often treat talking and listening as independent processes or rely on non-causal full-sequence modeling, hindering temporal coherence across turns. We present TIMAR (Turn-level Interleaved Masked AutoRegression), a causal framework for 3D conversational head generation that models dialogue as interleaved audio-visual contexts. It fuses multimodal information within each turn and applies turn-level causal attention to accumulate conversational history, while a lightweight diffusion head predicts continuous 3D head dynamics that captures both coordination and expressive variability. Experiments on the DualTalk benchmark show that TIMAR reduces Fréchet Distance and MSE by 15-30% on the test set, and achieves similar gains on out-of-distribution data. The source code will be released in the GitHub repository https://github.com/CoderChen01/towards-seamleass-interaction.

[27] See It Before You Grab It: Deep Learning-based Action Anticipation in Basketball

Arnau Barrera Roy, Albert Clapés Sintes

🧩 TL;DR

该研究提出了篮球视频中篮板球预测的新任务，并创建了一个包含10万个视频片段和2000多个手动标注篮板事件的大规模数据集，首次将深度学习技术应用于篮球篮板预测，为动态多智能体体育场景的预测建模提供了基准。

📘 Detailed Summary

Motivation: 尽管计算机视觉和视频理解在体育分析中取得了显著进展，但在体育视频中预测动作发生前的行为（特别是篮板球预测）尚未得到充分关注。当前缺乏专门用于篮球篮板预测的数据集和基准方法，限制了实时自动广播和赛后分析工具的发展。

Method: 该研究引入了篮球广播视频中的动作预测任务，重点关注预测投篮尝试后哪支球队将获得球权。创建了一个自策划的数据集，包含10万个篮球视频片段、超过300小时的素材和2000多个手动标注的篮板事件。使用最先进的动作预测方法建立了全面的基线结果，并探索了篮板分类和篮板识别两个互补任务。

Result: 实验结果表明预测篮板球的可行性和固有挑战，为动态多智能体体育场景的预测建模提供了宝贵见解。该数据集支持广泛的篮球视频理解应用，填补了当前缺乏可比数据集的空白，首次实现了深度学习技术在篮球篮板预测中的应用。

Conclusion: 该研究通过预测篮板发生前的球队控球权，为实时自动广播和赛后分析工具提供了支持决策的应用基础。这项工作不仅建立了篮球篮板预测的基准，还为动态多智能体体育场景的预测建模提供了方法论参考，展示了体育视频理解在预测性分析方面的潜力。

📄 Abstract

Computer vision and video understanding have transformed sports analytics by enabling large-scale, automated analysis of game dynamics from broadcast footage. Despite significant advances in player and ball tracking, pose estimation, action localization, and automatic foul recognition, anticipating actions before they occur in sports videos has received comparatively little attention. This work introduces the task of action anticipation in basketball broadcast videos, focusing on predicting which team will gain possession of the ball following a shot attempt. To benchmark this task, a new self-curated dataset comprising 100,000 basketball video clips, over 300 hours of footage, and more than 2,000 manually annotated rebound events is presented. Comprehensive baseline results are reported using state-of-the-art action anticipation methods, representing the first application of deep learning techniques to basketball rebound prediction. Additionally, two complementary tasks, rebound classification and rebound spotting, are explored, demonstrating that this dataset supports a wide range of video understanding applications in basketball, for which no comparable datasets currently exist. Experimental results highlight both the feasibility and inherent challenges of anticipating rebounds, providing valuable insights into predictive modeling for dynamic multi-agent sports scenarios. By forecasting team possession before rebounds occur, this work enables applications in real-time automated broadcasting and post-game analysis tools to support decision-making.

[28] Step-GUI Technical Report

Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, Shiliang Yang, Zhirui Wang, Brian Li, Kang An, Chenyang Li, Lei Lei, Mengmeng Duan, Danxun Liang, Guodong Liu, Hang Cheng, Hao Wu, Jie Dong, Junhao Huang, Mei Chen, Renjie Yu, Shunshan Li, Xu Zhou, Yiting Dai, Yineng Deng, Yingdan Liang, Zelin Chen, Wen Sun, Chengxu Yan, Chunqin Xu, Dong Li, Fengqiong Xiao, Guanghao Fan, Guopeng Li, Guozhen Peng, Hongbing Li, Hang Li, Hongming Chen, Jingjing Xie, Jianyong Li, Jingyang Zhang, Jiaju Ren, Jiayu Yuan, Jianpeng Yin, Kai Cao, Liang Zhao, Liguo Tan, Liying Shi, Mengqiang Ren, Min Xu, Manjiao Liu, Mao Luo, Mingxin Wan, Na Wang, Nan Wu, Ning Wang, Peiyao Ma, Qingzhou Zhang, Qiao Wang, Qinlin Zeng, Qiong Gao, Qiongyao Li, Shangwu Zhong, Shuli Gao, Shaofan Liu, Shisi Gao, Shuang Luo, Xingbin Liu, Xiaojia Liu, Xiaojie Hou, Xin Liu, Xuanti Feng, Xuedan Cai, Xuan Wen, Xianwei Zhu, Xin Liang, Xin Liu, Xin Zhou, Yingxiu Zhao, Yukang Shi, Yunfang Xu, Yuqing Zeng, Yixun Zhang, Zejia Weng, Zhonghao Yan, Zhiguo Huang, Zhuoyu Wang, Zheng Ge, Jing Li, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Daxin Jiang

🧩 TL;DR

本文提出了一种用于GUI自动化的自演进训练流水线，结合校准步奖励系统生成高质量训练数据，并开发了Step-GUI模型家族和GUI-MCP协议，在多个基准测试中实现了最先进的性能，同时通过AndroidDaily基准评估了真实世界应用潜力。

📘 Detailed Summary

Motivation: 当前多模态大语言模型为GUI自动化带来了前所未有的机遇，但核心挑战在于如何高效获取高质量训练数据同时保持标注可靠性，以及如何在实际部署中实现跨异构设备的标准化接口并保护用户隐私，此外还需要评估代理在真实日常使用场景中的处理能力。

Method: 研究提出了由校准步奖励系统驱动的自演进训练流水线，通过轨迹级校准将模型生成的轨迹转换为可靠训练信号；开发了Step-GUI模型家族（4B/8B参数规模）；设计了GUI-MCP协议作为首个GUI自动化的模型上下文协议，采用分层架构结合低级原子操作和高级任务委派；并构建了AndroidDaily基准，基于真实世界移动使用模式包含3146个静态动作和235个端到端任务。

Result: 自演进训练流水线实现了超过90%的标注准确率，成本降低10-100倍；Step-GUI 8B模型在AndroidWorld基准上达到80.2%的准确率，OSWorld上48.5%，ScreenShot-Pro上62.6%；在AndroidDaily基准上，8B模型在静态动作评估中达到89.91%准确率，端到端任务中达到52.50%；GUI-MCP协议实现了高隐私执行，敏感数据保持在设备本地。

Conclusion: 该工作通过创新的数据生成方法、高性能模型架构和标准化协议，显著推进了实用GUI代理的发展，展示了在真实世界日常数字交互中部署的强大潜力，为解决GUI自动化中的数据质量、隐私保护和实际应用评估等关键问题提供了系统化解决方案。

📄 Abstract

Recent advances in multimodal large language models unlock unprecedented opportunities for GUI automation. However, a fundamental challenge remains: how to efficiently acquire high-quality training data while maintaining annotation reliability? We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System, which converts model-generated trajectories into reliable training signals through trajectory-level calibration, achieving >90% annotation accuracy with 10-100x lower cost. Leveraging this pipeline, we introduce Step-GUI, a family of models (4B/8B) that achieves state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. As GUI agent capabilities improve, practical deployment demands standardized interfaces across heterogeneous devices while protecting user privacy. To this end, we propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture that combines low-level atomic operations and high-level task delegation to local specialist models, enabling high-privacy execution where sensitive data stays on-device. Finally, to assess whether agents can handle authentic everyday usage, we introduce AndroidDaily, a benchmark grounded in real-world mobile usage patterns with 3146 static actions and 235 end-to-end tasks across high-frequency daily scenarios (8B: static 89.91%, end-to-end 52.50%). Our work advances the development of practical GUI agents and demonstrates strong potential for real-world deployment in everyday digital interactions.

[29] CLIP-FTI: Fine-Grained Face Template Inversion via CLIP-Driven Attribute Conditioning

Longchen Dai, Zixuan Shen, Zhiheng Zhou, Peipeng Yu, Zhihua Xia

🧩 TL;DR

本文提出CLIP-FTI，一种基于CLIP的细粒度属性条件化框架，用于解决人脸模板反演中面部特征属性过度平滑和迁移性有限的问题，通过融合CLIP语义嵌入与泄露模板，在StyleGAN中生成具有精细面部特征属性的重建人脸图像。

📘 Detailed Summary

Motivation: 人脸识别系统中存储的人脸模板一旦泄露，可能被反演生成逼真的替代图像，从而威胁隐私并支持冒充攻击。现有方法重建的人脸图像存在面部特征属性（眼、鼻、嘴）过度平滑的问题，且跨模型攻击迁移性有限，需要改进细粒度属性重建能力。

Method: 本文提出CLIP-FTI框架，核心思想是利用CLIP模型获取面部特征的语义嵌入，实现特定面部特征属性的重建。具体方法包括：从CLIP提取面部特征属性嵌入，通过跨模态特征交互网络与泄露模板融合，将融合特征投影到预训练StyleGAN的中间潜在空间，最后由StyleGAN生成器合成具有相同身份但包含更精细面部特征属性的人脸图像。

Result: 实验在多种人脸识别骨干网络和数据集上进行，结果显示：重建图像在身份识别准确率和属性相似度方面表现更优；恢复了更清晰的组件级属性语义；相比先前重建攻击方法，显著提高了跨模型攻击迁移性。该方法在人脸模板反演任务中取得了最先进的性能。

Conclusion: 本研究首次在仅有人脸模板攻击的基础上引入额外信息实现人脸模板反演，证明了CLIP语义嵌入在提升面部特征属性重建质量方面的有效性。该方法不仅提高了重建图像的视觉质量和身份保真度，还增强了攻击的实用性和威胁性，为人脸模板安全防护提出了新的挑战和方向。

📄 Abstract

Face recognition systems store face templates for efficient matching. Once leaked, these templates pose a threat: inverting them can yield photorealistic surrogates that compromise privacy and enable impersonation. Although existing research has achieved relatively realistic face template inversion, the reconstructed facial images exhibit over-smoothed facial-part attributes (eyes, nose, mouth) and limited transferability. To address this problem, we present CLIP-FTI, a CLIP-driven fine-grained attribute conditioning framework for face template inversion. Our core idea is to use the CLIP model to obtain the semantic embeddings of facial features, in order to realize the reconstruction of specific facial feature attributes. Specifically, facial feature attribute embeddings extracted from CLIP are fused with the leaked template via a cross-modal feature interaction network and projected into the intermediate latent space of a pretrained StyleGAN. The StyleGAN generator then synthesizes face images with the same identity as the templates but with more fine-grained facial feature attributes. Experiments across multiple face recognition backbones and datasets show that our reconstructions (i) achieve higher identification accuracy and attribute similarity, (ii) recover sharper component-level attribute semantics, and (iii) improve cross-model attack transferability compared to prior reconstruction attacks. To the best of our knowledge, ours is the first method to use additional information besides the face template attack to realize face template inversion and obtains SOTA results.

[30] The LUMirage: An independent evaluation of zero-shot performance in the LUMIR challenge

Rohit Jena, Pratik Chaudhari, James C. Gee

🧩 TL;DR

本研究对LUMIR挑战中深度学习可变形图像配准方法的零样本泛化能力进行了独立再评估，发现其在实际临床应用中存在显著的领域偏移问题，性能在分布外对比度和高分辨率数据上严重下降。

📘 Detailed Summary

Motivation: LUMIR挑战声称深度学习方法在神经影像配准中具有卓越的零样本泛化能力，能够适应未见过的对比度和分辨率，这一主张与深度学习领域偏移的既定理解相矛盾。本研究旨在通过严谨的评估协议重新检验这些零样本泛化声明，同时解决潜在的仪器偏差来源。

Method: 本研究采用独立的再评估方法，使用严格的评估协议对深度学习可变形图像配准方法进行系统性分析。研究重点关注方法在分布内T1加权MRI图像、分布外对比度（T2、T2*、FLAIR）以及不同分辨率数据上的性能表现，同时考察预处理选择对方法敏感性的影响。

Result: 研究结果显示：深度学习方法在分布内T1w图像和人类近缘物种（猕猴）上表现与迭代优化方法相当，表明任务理解有所改进；但在分布外对比度上性能显著下降，Cohen's d分数范围为0.7-1.5，对下游临床工作流程产生实质性影响；深度学习方法在高分辨率数据（0.6 mm各向同性图像）上存在可扩展性限制，而迭代方法则能从提高分辨率中受益；深度方法对预处理选择表现出高度敏感性。

Conclusion: 研究结果表明深度学习可变形配准方法的零样本泛化能力被高估，其性能下降与领域偏移的既定文献一致。研究强调需要采用反映实际临床和研究工作流程的评估协议，而非可能无意中偏向特定方法类别的条件。这些发现对医学影像分析中深度学习方法的实际部署具有重要启示。

📄 Abstract

The LUMIR challenge represents an important benchmark for evaluating deformable image registration methods on large-scale neuroimaging data. While the challenge demonstrates that modern deep learning methods achieve competitive accuracy on T1-weighted MRI, it also claims exceptional zero-shot generalization to unseen contrasts and resolutions, assertions that contradict established understanding of domain shift in deep learning. In this paper, we perform an independent re-evaluation of these zero-shot claims using rigorous evaluation protocols while addressing potential sources of instrumentation bias. Our findings reveal a more nuanced picture: (1) deep learning methods perform comparably to iterative optimization on in-distribution T1w images and even on human-adjacent species (macaque), demonstrating improved task understanding; (2) however, performance degrades significantly on out-of-distribution contrasts (T2, T2*, FLAIR), with Cohen's d scores ranging from 0.7-1.5, indicating substantial practical impact on downstream clinical workflows; (3) deep learning methods face scalability limitations on high-resolution data, failing to run on 0.6 mm isotropic images, while iterative methods benefit from increased resolution; and (4) deep methods exhibit high sensitivity to preprocessing choices. These results align with the well-established literature on domain shift and suggest that claims of universal zero-shot superiority require careful scrutiny. We advocate for evaluation protocols that reflect practical clinical and research workflows rather than conditions that may inadvertently favor particular method classes.

[31] EmoCaliber: Advancing Reliable Visual Emotion Comprehension via Confidence Verbalization and Calibration

Daiqing Wu, Dongbao Yang, Can Ma. Yu Zhou

🧩 TL;DR

本文提出EmoCaliber，一种用于视觉情感理解的置信度感知多模态大语言模型，通过引入置信度表达机制来解决传统方法忽略情感感知主观性的问题，在VECBench基准测试中展现出优越性能。

📘 Detailed Summary

Motivation: 当前多模态大语言模型在视觉情感理解任务中通常将情感预测视为确定性任务，要求模型为每张图像输出单一确定的情感标签，这种范式未能充分考虑情感感知固有的主观性，忽略了不同观察者可能认为同样合理的替代解释，从而限制了模型的实用可靠性。

Method: 本文提出一个三阶段训练框架：首先赋予模型结构化推理能力，然后教导模型表达置信度，最后校准置信度表达，最终构建出EmoCaliber这一置信度感知的多模态大语言模型，使模型能够量化其对情感预测的置信水平。

Result: 在统一的基准测试VECBench上进行公平全面的评估，EmoCaliber在情感预测和置信度估计两方面均展现出相对于现有方法的整体优越性，验证了所提方法的有效性。

Conclusion: 该研究通过引入置信度表达机制，为视觉情感理解系统提供了更可靠的实用基础，标志着向更可靠的视觉情感理解系统迈出的可行一步，同时为处理主观性感知任务的多模态模型提供了新的设计思路。

📄 Abstract

Visual Emotion Comprehension (VEC) aims to infer sentiment polarities or emotion categories from affective cues embedded in images. In recent years, Multimodal Large Language Models (MLLMs) have established a popular paradigm in VEC, leveraging their generalizability to unify VEC tasks defined under diverse emotion taxonomies. While this paradigm achieves notable success, it typically formulates VEC as a deterministic task, requiring the model to output a single, definitive emotion label for each image. Such a formulation insufficiently accounts for the inherent subjectivity of emotion perception, overlooking alternative interpretations that may be equally plausible to different viewers. To address this limitation, we propose equipping MLLMs with capabilities to verbalize their confidence in emotion predictions. This additional signal provides users with an estimate of both the plausibility of alternative interpretations and the MLLMs' self-assessed competence, thereby enhancing reliability in practice. Building on this insight, we introduce a three-stage training framework that progressively endows with structured reasoning, teaches to verbalize confidence, and calibrates confidence expression, culminating in EmoCaliber, a confidence-aware MLLM for VEC. Through fair and comprehensive evaluations on the unified benchmark VECBench, EmoCaliber demonstrates overall superiority against existing methods in both emotion prediction and confidence estimation. These results validate the effectiveness of our approach and mark a feasible step toward more reliable VEC systems. Project page: https://github.com/wdqqdw/EmoCaliber.

[32] An Efficient and Effective Encoder Model for Vision and Language Tasks in the Remote Sensing Domain

João Daniel Silva, Joao Magalhaes, Devis Tuia, Bruno Martins

🧩 TL;DR

本文提出GeoMELT模型，一种基于编码器架构的多任务高效学习Transformer，旨在解决遥感领域大型视觉语言模型计算成本过高的问题，同时统一处理图像生成文本和跨模态检索任务。

📘 Detailed Summary

Motivation: 遥感领域的大型视觉语言模型虽然能处理多任务，但参数量大导致训练和推理成本高昂，现有参数高效适应技术仍使计算成本对大多数机构难以承受，且现有模型通常不统一探索图像生成文本和跨模态检索的组合任务。

Method: 本文探索编码器架构并提出GeoMELT模型，这是一种多任务高效学习Transformer，采用紧凑参数设计，能够同时处理遥感图像生成文本和跨模态检索两种通常不在统一模型中探索的任务组合。

Result: 在已建立的基准测试中，GeoMELT模型的结果证实了所提方法的有效性和效率，表明该模型能够在保持参数紧凑的同时有效解决多任务学习问题。

Conclusion: 研究表明编码器架构能够有效解决遥感多任务学习问题，同时保持模型紧凑性，为资源受限机构提供了可行的替代方案，并展示了统一处理图像生成文本和跨模态检索任务的可行性。

📄 Abstract

The remote sensing community has recently seen the emergence of methods based on Large Vision and Language Models (LVLMs) that can address multiple tasks at the intersection of computer vision and natural language processing. To fully exploit the potential of such models, a significant focus has been given to the collection of large amounts of training data that cover multiple remote sensing-specific tasks, such as image captioning or visual question answering. However, the cost of using and training LVLMs is high, due to the large number of parameters. While multiple parameter-efficient adaptation techniques have been explored, the computational costs of training and inference with these models can remain prohibitive for most institutions. In this work, we explore the use of encoder-only architectures and propose a model that can effectively address multi-task learning while remaining compact in terms of the number of parameters. In particular, our model tackles combinations of tasks that are not typically explored in a unified model: the generation of text from remote sensing images and cross-modal retrieval. The results of our GeoMELT model - named from Multi-task Efficient Learning Transformer - in established benchmarks confirm the efficacy and efficiency of the proposed approach.

[33] GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Bozhou Li, Sihan Yang, Yushuo Guan, Ruichuan An, Xinlong Chen, Yang Shi, Pengfei Wan, Wentao Zhang, Yuanxing zhang

🧩 TL;DR

本文提出了GRAN-TED范式，通过引入TED-6K文本基准和两阶段训练方法，解决了文本编码器在扩散模型中缺乏高效评估框架和适应视觉合成困难的问题，显著提升了文本到图像和视频生成的语义保真度。

📘 Detailed Summary

Motivation: 文本编码器作为文本到图像和文本到视频扩散模型的关键组件，其发展受到两大挑战的阻碍：一是缺乏能够可靠预测下游生成性能的高效评估框架，二是难以有效将预训练语言模型适应于视觉合成任务。这些问题限制了文本编码器在生成模型中的语义保真度和性能提升。

Method: 研究提出了GRAN-TED范式，包含两个核心贡献：首先设计了TED-6K文本基准，通过轻量级统一适配器实现文本编码器表示质量的高效评估；其次开发了两阶段训练范式，第一阶段在多模态大语言模型上进行微调以获得更好的视觉表示，第二阶段采用层间加权方法提取更细致和强大的文本特征。

Result: 实验表明，TED-6K基准的性能与编码器在下游生成任务中的有效性呈现强相关性。GRAN-TED编码器在TED-6K基准上实现了最先进的性能，并在文本到图像和文本到视频生成任务中带来了可观的性能提升，验证了该评估框架的有效性和训练方法的优越性。

Conclusion: 该研究为文本编码器的评估和优化提供了系统化解决方案，证明了文本表示质量与生成性能之间的强相关性，为扩散模型中文本编码器的进一步发展奠定了理论基础和实践框架，有望推动文本引导生成模型的性能边界。

📄 Abstract

The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder's representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly correlates with an encoder's effectiveness in downstream generation tasks. Second, guided by this validated framework, we develop a superior text encoder using a novel two-stage training paradigm. This process involves an initial fine-tuning stage on a Multimodal Large Language Model for better visual representation, followed by a layer-wise weighting method to extract more nuanced and potent text features. Our experiments show that the resulting GRAN-TED encoder not only achieves state-of-the-art performance on TED-6K but also leads to demonstrable performance gains in text-to-image and text-to-video generation. Our code is available at the following link: https://anonymous.4open.science/r/GRAN-TED-4FCC/.

[34] On the Effectiveness of Textual Prompting with Lightweight Fine-Tuning for SAM3 Remote Sensing Segmentation

Roni Blushtein-Livnon, Osher Rafaeli, David Ioffe, Amir Boger, Karen Sandberg Esquenazi, Tal Svoray

🧩 TL;DR

本研究评估了SAM3概念驱动框架在遥感图像分割中的适应性，通过比较文本、几何和混合提示策略，发现结合语义和几何线索的混合提示在有限监督下实现了最佳性能，而文本提示在几何不规则目标上表现不佳。

📘 Detailed Summary

Motivation: 遥感图像分割面临标注数据有限以及航空影像与自然图像之间存在域差距的挑战，这限制了基础模型在遥感任务中的直接应用，因此需要在有限监督下实现有效适应。

Method: 采用SAM3概念驱动框架，通过文本提示生成分割掩码而无需任务特定修改，评估了文本、几何和混合三种提示策略，在零样本推理和不同监督程度的轻量级微调设置下进行实验，覆盖四种目标类型。

Result: 实验结果表明，结合语义和几何线索的混合提示策略在所有目标和指标上均取得最高性能，文本提示表现最差，尤其在几何不规则目标上存在显著性能差距，反映了SAM3文本表示与航空影像外观之间的语义对齐有限。性能在零样本推理和微调之间有所提升，但随着监督规模增加呈现收益递减趋势。

Conclusion: 研究表明，适度的几何标注努力足以实现有效适应，文本提示结合轻量微调为几何规则和视觉显著目标提供了实用的性能-努力权衡。精度与IoU之间的持续差距表明，欠分割和边界不准确仍然是遥感任务中的主要错误模式，特别是对于不规则和较少见的目标。

📄 Abstract

Remote sensing (RS) image segmentation is constrained by the limited availability of annotated data and a gap between overhead imagery and natural images used to train foundational models. This motivates effective adaptation under limited supervision. SAM3 concept-driven framework generates masks from textual prompts without requiring task-specific modifications, which may enable this adaptation. We evaluate SAM3 for RS imagery across four target types, comparing textual, geometric, and hybrid prompting strategies, under lightweight fine-tuning scales with increasing supervision, alongside zero-shot inference. Results show that combining semantic and geometric cues yields the highest performance across targets and metrics. Text-only prompting exhibits the lowest performance, with marked score gaps for irregularly shaped targets, reflecting limited semantic alignment between SAM3 textual representations and their overhead appearances. Nevertheless, textual prompting with light fine-tuning offers a practical performance-effort trade-off for geometrically regular and visually salient targets. Across targets, performance improves between zero-shot inference and fine-tuning, followed by diminishing returns as the supervision scale increases. Namely, a modest geometric annotation effort is sufficient for effective adaptation. A persistent gap between Precision and IoU further indicates that under-segmentation and boundary inaccuracies remain prevalent error patterns in RS tasks, particularly for irregular and less prevalent targets.

[35] MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors

Zhipeng Du, Duolikun Danier, Jan Eric Lenssen, Hakan Bilen

🧩 TL;DR

本文提出了MoonSeg3R，这是首个支持在线单目3D实例分割的方法，通过利用CUT3R重建基础模型提供几何先验，在无需RGB-D序列的情况下实现了与最先进RGB-D系统相竞争的性能。

📘 Detailed Summary

Motivation: 现有方法依赖于带姿态的RGB-D序列，无法在在线单目3D实例分割这一实际场景中有效工作，这限制了单目视觉系统在实时3D场景理解中的应用潜力。

Method: MoonSeg3R引入了三个关键组件：基于空间语义蒸馏的自监督查询精炼模块，将2D视觉基础模型的分割掩码转换为判别性3D查询；3D查询索引记忆模块，通过检索上下文查询提供时间一致性；以及来自CUT3R的状态分布令牌，作为掩码身份描述符以增强跨帧融合。

Result: 在ScanNet200和SceneNN数据集上的实验表明，MoonSeg3R是首个实现在线单目3D分割的方法，其性能与最先进的基于RGB-D的系统相竞争，证明了单目方法在3D实例分割任务中的可行性。

Conclusion: 该研究证明了利用重建基础模型提供几何先验的可行性，为单目3D场景理解开辟了新方向，同时提出的查询精炼和跨帧融合机制为在线3D视觉任务提供了有效的技术框架。

📄 Abstract

In this paper, we focus on online zero-shot monocular 3D instance segmentation, a novel practical setting where existing approaches fail to perform because they rely on posed RGB-D sequences. To overcome this limitation, we leverage CUT3R, a recent Reconstructive Foundation Model (RFM), to provide reliable geometric priors from a single RGB stream. We propose MoonSeg3R, which introduces three key components: (1) a self-supervised query refinement module with spatial-semantic distillation that transforms segmentation masks from 2D visual foundation models (VFMs) into discriminative 3D queries; (2) a 3D query index memory that provides temporal consistency by retrieving contextual queries; and (3) a state-distribution token from CUT3R that acts as a mask identity descriptor to strengthen cross-frame fusion. Experiments on ScanNet200 and SceneNN show that MoonSeg3R is the first method to enable online monocular 3D segmentation and achieves performance competitive with state-of-the-art RGB-D-based systems. Code and models will be released.

[36] Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu

🧩 TL;DR

本文提出了Skyra，一种专门的多模态大语言模型，通过识别AI生成视频中的人类可感知视觉伪影，为视频检测提供可解释的伪影证据，并构建了首个大规模细粒度标注的AI生成视频伪影数据集ViF-CoT-4K。

📘 Detailed Summary

Motivation: AI驱动视频生成技术的滥用引发了严重社会担忧，现有方法大多局限于二元分类且缺乏对人类可解释的解释，迫切需要可靠的AI生成视频检测器并提供可解释性证据。

Method: 提出了Skyra多模态大语言模型，通过识别人类可感知的视觉伪影作为检测和解释的接地证据；构建了首个大规模AI生成视频伪影数据集ViF-CoT-4K用于监督微调；开发了两阶段训练策略，系统提升模型的时空伪影感知、解释能力和检测准确性。

Result: 在包含3K高质量样本的ViF-Bench基准测试中，Skyra在多个基准上超越了现有方法，该评估为推进可解释AI生成视频检测提供了有价值的见解。

Conclusion: 该研究通过结合伪影识别与可解释性检测，为AI生成视频检测提供了新范式，构建的数据集和基准测试为未来可解释AI视频检测研究奠定了基础，强调了多模态理解和人类可解释证据在检测系统中的重要性。

📄 Abstract

The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection.

[37] VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression

Kyle Sargent, Ruiqi Gao, Philipp Henzler, Charles Herrmann, Aleksander Holynski, Li Fei-Fei, Jiajun Wu, Jason Zhang

🧩 TL;DR

本文提出VLIC，一种基于扩散模型的图像压缩系统，利用视觉语言模型的零样本推理能力来对齐人类感知偏好，无需依赖传统感知损失网络或大规模人类标注数据。

📘 Detailed Summary

Motivation: 传统图像压缩评估中，简单的失真函数如MSE与人类感知偏好不一致，而现有方法需要依赖大规模人类心理视觉标注数据来训练感知损失网络，这限制了感知对齐压缩模型的发展。

Method: 本文提出VLIC系统，采用基于扩散模型的图像压缩架构，并利用视觉语言模型的零样本推理能力来复制人类二选一强制选择判断，通过偏好后训练方法直接优化压缩模型，而非将VLM判断蒸馏到单独的感知损失网络中。

Result: 实验表明，基于VLM判断校准的系统在人类对齐视觉压缩任务上实现了竞争性或最先进的性能，在感知指标和大规模用户研究中均表现优异，具体性能取决于数据集的选择。

Conclusion: 该研究展示了视觉语言模型在零样本感知对齐方面的强大能力，为图像压缩领域提供了新的训练范式，同时通过广泛的奖励设计和训练过程分析提供了重要见解。

📄 Abstract

Evaluations of image compression performance which include human preferences have generally found that naive distortion functions such as MSE are insufficiently aligned to human perception. In order to align compression models to human perception, prior work has employed differentiable perceptual losses consisting of neural networks calibrated on large-scale datasets of human psycho-visual judgments. We show that, surprisingly, state-of-the-art vision-language models (VLMs) can replicate binary human two-alternative forced choice (2AFC) judgments zero-shot when asked to reason about the differences between pairs of images. Motivated to exploit the powerful zero-shot visual reasoning capabilities of VLMs, we propose Vision-Language Models for Image Compression (VLIC), a diffusion-based image compression system designed to be post-trained with binary VLM judgments. VLIC leverages existing techniques for diffusion model post-training with preferences, rather than distilling the VLM judgments into a separate perceptual loss network. We show that calibrating this system on VLM judgments produces competitive or state-of-the-art performance on human-aligned visual compression depending on the dataset, according to perceptual metrics and large-scale user studies. We additionally conduct an extensive analysis of the VLM-based reward design and training procedure and share important insights. More visuals are available at https://kylesargent.github.io/vlic

Yu Wang, Juhyung Ha, Frangil M. Ramirez, Yuchen Wang, David J. Crandall

🧩 TL;DR

本文提出GateFusion架构，通过分层门控融合解码器（HiGate）实现渐进式多深度跨模态融合，并结合两种辅助目标函数，在多个主动说话人检测基准上取得了新的最先进性能。

📘 Detailed Summary

Motivation: 现有主动说话人检测方法主要依赖后期融合来结合视觉和音频特征，但后期融合难以捕捉细粒度的跨模态交互，这在无约束场景下对鲁棒性能至关重要。因此需要一种能够实现更深入跨模态交互的融合机制。

Method: 本文提出GateFusion架构，结合强大的预训练单模态编码器和分层门控融合解码器（HiGate）。HiGate通过可学习的双模态条件门控，在Transformer骨干网络的多个层级上自适应地将一个模态的上下文特征注入到另一个模态中，实现渐进式多深度融合。此外，提出两种辅助目标函数：掩码对齐损失（MAL）用于对齐单模态输出与多模态预测，以及过正惩罚（OPP）用于抑制虚假的视频激活。

Result: GateFusion在多个具有挑战性的主动说话人检测基准上建立了新的最先进结果：在Ego4D-ASD上达到77.8% mAP（提升9.4%），在UniTalk上达到86.1% mAP（提升2.9%），在WASD上达到96.1% mAP（提升0.5%），并在AVA-ActiveSpeaker上表现出有竞争力的性能。域外实验证明了模型的泛化能力，而全面的消融研究显示了各组件之间的互补效益。

Conclusion: 研究表明，渐进式多深度融合机制比传统的后期融合能更有效地捕捉跨模态交互，而辅助目标函数进一步增强了多模态学习。该方法为主动说话人检测提供了新的架构范式，其分层门控融合策略可推广到其他多模态任务中，实现更鲁棒的跨模态表示学习。

📄 Abstract

Active Speaker Detection (ASD) aims to identify who is currently speaking in each frame of a video. Most state-of-the-art approaches rely on late fusion to combine visual and audio features, but late fusion often fails to capture fine-grained cross-modal interactions, which can be critical for robust performance in unconstrained scenarios. In this paper, we introduce GateFusion, a novel architecture that combines strong pretrained unimodal encoders with a Hierarchical Gated Fusion Decoder (HiGate). HiGate enables progressive, multi-depth fusion by adaptively injecting contextual features from one modality into the other at multiple layers of the Transformer backbone, guided by learnable, bimodally-conditioned gates. To further strengthen multimodal learning, we propose two auxiliary objectives: Masked Alignment Loss (MAL) to align unimodal outputs with multimodal predictions, and Over-Positive Penalty (OPP) to suppress spurious video-only activations. GateFusion establishes new state-of-the-art results on several challenging ASD benchmarks, achieving 77.8% mAP (+9.4%), 86.1% mAP (+2.9%), and 96.1% mAP (+0.5%) on Ego4D-ASD, UniTalk, and WASD benchmarks, respectively, and delivering competitive performance on AVA-ActiveSpeaker. Out-of-domain experiments demonstrate the generalization of our model, while comprehensive ablations show the complementary benefits of each component.

[39] Multi-View Foundation Models

Leo Segre, Or Hirschorn, Shai Avidan

🧩 TL;DR

本文提出了一种将单视图基础模型转换为多视图基础模型的方法，通过引入3D感知的注意力层来增强Transformer基础模型，从而在多视图图像中实现对应点特征的一致性。

📘 Detailed Summary

Motivation: 现有基础模型在处理同一3D场景的多个视图时，对每个图像独立操作，无法保证相同3D点的特征一致性，这限制了它们在多视图计算机视觉应用中的效果。

Method: 该方法通过向基于Transformer的基础模型（如DINO、SAM、CLIP）添加中间3D感知注意力层，构建多视图基础模型，这些层帮助匹配不同视图间的特征，避免了构建一致3D特征模型的复杂性，直接在图像空间进行操作。

Result: 定量实验表明，该方法在特征匹配方面相比现有基础模型有显著改进，并在表面法线估计和多视图分割等任务中展示了优越性能。

Conclusion: 该研究提供了一种有效增强基础模型多视图一致性的框架，为多视图计算机视觉任务开辟了新途径，避免了复杂的3D重建过程，直接在2D图像空间实现特征对齐。

📄 Abstract

Foundation models are vital tools in various Computer Vision applications. They take as input a single RGB image and output a deep feature representation that is useful for various applications. However, in case we have multiple views of the same 3D scene, they operate on each image independently and do not always produce consistent features for the same 3D point. We propose a way to convert a Foundation Model into a Multi-View Foundation Model. Such a model takes as input a set of images and outputs a feature map for each image such that the features of corresponding points are as consistent as possible. This approach bypasses the need to build a consistent 3D model of the features and allows direct manipulation in the image space. Specifically, we show how to augment Transformers-based foundation models (i.e., DINO, SAM, CLIP) with intermediate 3D-aware attention layers that help match features across different views. As leading examples, we show surface normal estimation and multi-view segmentation tasks. Quantitative experiments show that our method improves feature matching considerably compared to current foundation models.

[40] DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Lunbin Zeng, Jingfeng Yao, Bencheng Liao, Hongyuan Tao, Wenyu Liu, Xinggang Wang

🧩 TL;DR

本文提出了DiffusionVL，一种可将现有强大的自回归模型转换为扩散范式视觉语言模型的方法，通过简单微调实现范式转换，并在性能提升和推理加速方面取得显著成果。

📘 Detailed Summary

Motivation: 当前扩散视觉语言模型由于基础扩散语言模型的能力限制，性能显著落后于主流自回归模型，这引发了一个根本性问题：能否基于现有强大的自回归模型构建扩散视觉语言模型？

Method: 提出DiffusionVL框架，通过简单微调将预训练的自回归模型适配到扩散范式，并引入支持任意长度生成和KV缓存重用的块解码设计，显著提升推理速度。

Result: 仅使用先前方法所需数据量的不到5%进行训练，DiffusionVL在MMMU-Pro（视觉）基准上获得34.4%的性能提升，在MME（认知）基准上获得37.5%提升，同时实现2倍推理加速。

Conclusion: 研究表明从自回归到扩散范式的转换不仅可行且高效，直接转换自回归语言模型到扩散视觉语言模型能达到与LLaVA风格视觉指令调优相当的性能，为多模态模型发展提供了新路径。

📄 Abstract

In recent multimodal research, the diffusion paradigm has emerged as a promising alternative to the autoregressive paradigm (AR), owing to its unique decoding advantages. However, due to the capability limitations of the base diffusion language model, the performance of the diffusion vision language model (dVLM) still lags significantly behind that of mainstream models. This leads to a simple yet fundamental question: Is it possible to construct dVLMs based on existing powerful AR models? In response, we propose DiffusionVL, a dVLM family that could be translated from any powerful AR models. Through simple fine-tuning, we successfully adapt AR pre-trained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance competitive with LLaVA-style visual-instruction-tuning. Further, we introduce a block-decoding design into dVLMs that supports arbitrary-length generation and KV cache reuse, achieving a significant inference speedup. We conduct a large number of experiments. Despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog.) bench-alongside a 2x inference speedup. The model and code are released at https://github.com/hustvl/DiffusionVL.

cs.CL [Back]

[41] T5Gemma 2: Seeing, Reading, and Understanding Longer

Biao Zhang, Paul Suganthan, Gaël Liu, Ilya Philippov, Sahil Dua, Ben Hora, Kat Black, Gus Martins, Omar Sanseviero, Shreya Pathak, Cassidy Hardin, Francesco Visin, Jiageng Zhang, Kathleen Kenealy, Qin Yin, Olivier Lacombe, Armand Joulin, Tris Warkentin, Adam Roberts

🧩 TL;DR

本文介绍了T5Gemma 2，这是T5Gemma系列轻量级开放编码器-解码器模型的新一代，具有强大的多语言、多模态和长上下文能力，通过改进的适应策略和效率优化方法，在多个规模上超越了其Gemma 3对应模型。

📘 Detailed Summary

Motivation: 该研究旨在扩展T5Gemma的适应策略，从纯文本领域扩展到多模态领域，同时解决编码器-解码器模型的效率问题，并验证编码器-解码器架构在长上下文建模中的独特优势。

Method: T5Gemma 2基于Gemma 3模型，通过UL2适应策略将预训练的仅解码器模型转换为编码器-解码器模型，并提出了两种效率改进方法：跨编码器和解码器共享所有嵌入的绑定词嵌入，以及将解码器自注意力和交叉注意力统一为单个联合模块的合并注意力机制。

Result: 实验表明，适应策略在不同架构和模态上具有通用性，编码器-解码器架构在长上下文建模方面表现出独特优势，T5Gemma 2在预训练性能上与Gemma 3相当或更好，并在后训练性能上显著优于其对应模型，发布了270M-270M、1B-1B和4B-4B规模的预训练模型。

Conclusion: 该研究证明了将仅解码器模型适应为编码器-解码器架构的策略在多模态扩展中的有效性，提出的效率优化方法为轻量级编码器-解码器模型的发展提供了实用方案，发布的模型为未来研究提供了有价值的资源，特别是在多语言和多模态任务中。

📄 Abstract

We introduce T5Gemma 2, the next generation of the T5Gemma family of lightweight open encoder-decoder models, featuring strong multilingual, multimodal and long-context capabilities. T5Gemma 2 follows the adaptation recipe (via UL2) in T5Gemma -- adapting a pretrained decoder-only model into an encoder-decoder model, and extends it from text-only regime to multimodal based on the Gemma 3 models. We further propose two methods to improve the efficiency: tied word embedding that shares all embeddings across encoder and decoder, and merged attention that unifies decoder self- and cross-attention into a single joint module. Experiments demonstrate the generality of the adaptation strategy over architectures and modalities as well as the unique strength of the encoder-decoder architecture on long context modeling. Similar to T5Gemma, T5Gemma 2 yields comparable or better pretraining performance and significantly improved post-training performance than its Gemma 3 counterpart. We release the pretrained models (270M-270M, 1B-1B and 4B-4B) to the community for future research.

[42] Parameter Efficient Multimodal Instruction Tuning for Romanian Vision Language Models

George-Andrei Dima, Dumitru-Clementin Cercel

🧩 TL;DR

该研究通过翻译Flickr30k数据集并扩展为罗马尼亚语视觉问答数据集，微调开源视觉语言模型，显著提升了罗马尼亚语多模态AI能力，填补了低资源语言在生成式AI领域的资源缺口。

📘 Detailed Summary

Motivation: 该研究旨在解决罗马尼亚语作为低资源语言在多模态自然语言处理中的资源不足问题，通过填补罗马尼亚语视觉语言数据集的空白，推动生成式AI的民主化发展。

Method: 研究将广泛使用的Flickr30k数据集翻译为罗马尼亚语，并利用开源大语言模型扩展为视觉问答数据集；选择LLaMA 3.2、LLaVA 1.6和Qwen2三个主流模型家族的视觉语言模型，采用参数高效的LoRA方法进行微调。

Result: 微调后的模型在罗马尼亚语视觉问答任务上表现显著提升，70亿参数的Qwen2-VL-RoVQA模型在BERTScore F1指标上分别比原始版本提高了6.05%和2.61%；模型在未专门训练的罗马尼亚语图像描述生成任务上也展现出改进能力，语法错误大幅减少。

Conclusion: 该研究不仅填补了罗马尼亚语多模态数据集的空白，还证明了通过高质量数据集微调可以有效提升低资源语言的视觉语言模型性能，为其他低资源语言的类似研究提供了可行方案，推动了生成式AI在多语言环境中的公平发展。

📄 Abstract

Focusing on low-resource languages is an essential step toward democratizing generative AI. In this work, we contribute to reducing the multimodal NLP resource gap for Romanian. We translate the widely known Flickr30k dataset into Romanian and further extend it for visual question answering by leveraging open-source LLMs. We demonstrate the usefulness of our datasets by fine-tuning open-source VLMs on Romanian visual question answering. We select VLMs from three widely used model families: LLaMA 3.2, LLaVA 1.6, and Qwen2. For fine-tuning, we employ the parameter-efficient LoRA method. Our models show improved Romanian capabilities in visual QA, as well as on tasks they were not trained on, such as Romanian image description generation. The seven-billion-parameter Qwen2-VL-RoVQA obtains top scores on both tasks, with improvements of +6.05% and +2.61% in BERTScore F1 over its original version. Finally, the models show substantial reductions in grammatical errors compared to their original forms, indicating improvements not only in language understanding but also in Romanian fluency.

[43] Evaluating Large Language Models on Multimodal Chemistry Olympiad Exams

Yiming Cui, Xin Yao, Yuxuan Qin, Xin Li, Shijin Wang, Guoping Hu

🧩 TL;DR

本研究系统评估了40个多模态大语言模型在化学奥林匹克竞赛题目上的科学推理能力，揭示了当前模型在多模态融合方面的关键局限性，并提出了通过思维链提示提升准确性和视觉基础性的有效策略。

📘 Detailed Summary

Motivation: 多模态科学推理对大型语言模型构成重大挑战，尤其在化学领域，问题解决依赖于符号图表、分子结构和结构化视觉数据。当前研究缺乏对多模态模型在复杂化学推理任务上的系统性评估，特别是针对需要跨模态整合的奥林匹克竞赛级问题。

Method: 研究系统评估了40个专有和开源多模态大语言模型，包括GPT-5、o3、Gemini-2.5-Pro和Qwen2.5-VL，使用基于20多年美国国家化学奥林匹克竞赛试题构建的基准测试。采用思维链提示策略进行实验，并通过消融研究和基于遮挡的可解释性分析来评估模型的视觉基础性。

Result: 研究发现许多模型在多模态融合方面表现不佳，某些情况下移除图像甚至能提高准确性，表明视觉-语言整合存在错位。思维链提示一致性地提升了准确性和视觉基础性，消融研究和遮挡分析验证了这一效果。模型在科学推理能力方面显示出关键局限性。

Conclusion: 该研究为领域特定多模态AI的进展提供了及时基准，强调了人工智能与科学推理交叉领域需要进一步突破。研究结果揭示了当前多模态大语言模型在化学推理中的关键不足，并为开发更鲁棒、可解释的多模态系统提供了可操作策略。

📄 Abstract

Multimodal scientific reasoning remains a significant challenge for large language models (LLMs), particularly in chemistry, where problem-solving relies on symbolic diagrams, molecular structures, and structured visual data. Here, we systematically evaluate 40 proprietary and open-source multimodal LLMs, including GPT-5, o3, Gemini-2.5-Pro, and Qwen2.5-VL, on a curated benchmark of Olympiad-style chemistry questions drawn from over two decades of U.S. National Chemistry Olympiad (USNCO) exams. These questions require integrated visual and textual reasoning across diverse modalities. We find that many models struggle with modality fusion, where in some cases, removing the image even improves accuracy, indicating misalignment in vision-language integration. Chain-of-Thought prompting consistently enhances both accuracy and visual grounding, as demonstrated through ablation studies and occlusion-based interpretability. Our results reveal critical limitations in the scientific reasoning abilities of current MLLMs, providing actionable strategies for developing more robust and interpretable multimodal systems in chemistry. This work provides a timely benchmark for measuring progress in domain-specific multimodal AI and underscores the need for further advances at the intersection of artificial intelligence and scientific reasoning.

[44] SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification

Hongbo Wang, MaungMaung AprilPyone, Isao Echizen

🧩 TL;DR

本文提出SGM，一种白盒神经元级多模态干预方法，通过选择性重新校准有毒专家神经元来缓解多模态大语言模型中的毒性问题，无需参数更新即可在标准与对抗条件下有效降低有害生成率。

📘 Detailed Summary

Motivation: 多模态大语言模型从弱监督的预训练语料中继承了有毒、偏见和NSFW信号，导致安全风险，尤其是在对抗性触发条件下，现有后期、不透明的免训练去毒方法难以有效处理这些风险。

Method: SGM采用白盒神经元级多模态干预方法，通过专业知识加权的软抑制技术选择性重新校准一小部分有毒专家神经元，中和有害的跨模态激活而不需要任何参数更新，同时建立了MM-TOXIC-QA多模态毒性评估框架。

Result: 在开源MLLMs上的实验表明，SGM在标准和对抗条件下均能有效缓解毒性，将有害率从48.2%降至2.5%，同时保持了模型的流畅性和多模态推理能力，其增强版本SGM*与现有去毒方法结合可提供更强的安全性能。

Conclusion: SGM为毒性控制的多模态生成提供了一种可解释、低成本的解决方案，其白盒神经元级干预方法能够在不更新参数的情况下有效中和有害激活，为多模态模型的安全部署开辟了新途径，且具有良好的可扩展性。

📄 Abstract

Disclaimer: Samples in this paper may be harmful and cause discomfort. Multimodal large language models (MLLMs) enable multimodal generation but inherit toxic, biased, and NSFW signals from weakly curated pretraining corpora, causing safety risks, especially under adversarial triggers that late, opaque training-free detoxification methods struggle to handle. We propose SGM, a white-box neuron-level multimodal intervention that acts like safety glasses for toxic neurons: it selectively recalibrates a small set of toxic expert neurons via expertise-weighted soft suppression, neutralizing harmful cross-modal activations without any parameter updates. We establish MM-TOXIC-QA, a multimodal toxicity evaluation framework, and compare SGM with existing detoxification techniques. Experiments on open-source MLLMs show that SGM mitigates toxicity in standard and adversarial conditions, cutting harmful rates from 48.2\% to 2.5\% while preserving fluency and multimodal reasoning. SGM is extensible, and its combined defenses, denoted as SGM*, integrate with existing detoxification methods for stronger safety performance, providing an interpretable, low-cost solution for toxicity-controlled multimodal generation.

[45] Yes-MT's Submission to the Low-Resource Indic Language Translation Shared Task in WMT 2024

Yash Bhaskar, Parameswari Krishnamurthy

🧩 TL;DR

本文介绍了Yes-MT团队在WMT 2024低资源印度语言翻译共享任务中的提交系统，系统探索了多种方法在英语与阿萨姆语、米佐语、卡西语和曼尼普尔语之间的翻译，结果表明低资源翻译面临挑战但大语言模型通过微调展现出潜力。

📘 Detailed Summary

Motivation: 该研究旨在解决低资源印度语言（包括阿萨姆语、米佐语、卡西语和曼尼普尔语）与英语之间的机器翻译问题，这些语言由于缺乏大规模平行语料库而面临翻译质量挑战，研究探索了在低资源条件下提升翻译性能的各种方法。

Method: 研究采用了多种技术方法，包括微调预训练模型如mT5和IndicBart的多语言与单语言设置，使用LoRA微调IndicTrans2模型，对Llama 3和Mixtral 8x7b等大语言模型进行零样本和少样本提示，对Llama 3进行LoRA监督微调，以及从头训练Transformer模型，形成了全面的低资源翻译方法比较框架。

Result: 实验结果在WMT23低资源印度语言翻译共享任务的测试数据上使用SacreBLEU和CHRF指标进行评估，揭示了低资源翻译的具体挑战，同时展示了不同方法在四种目标语言上的性能表现，为大语言模型在低资源场景下的应用提供了实证数据。

Conclusion: 研究的主要见解是低资源翻译任务仍然面临显著挑战，但大语言模型特别是经过微调后展现出解决这些问题的潜力，这为未来低资源语言处理研究提供了方向，强调了模型适应性和数据高效学习方法的重要性。

📄 Abstract

This paper presents the systems submitted by the Yes-MT team for the Low-Resource Indic Language Translation Shared Task at WMT 2024 (Pakray et al., 2024), focusing on translating between English and the Assamese, Mizo, Khasi, and Manipuri languages. The experiments explored various approaches, including fine-tuning pre-trained models like mT5 (Xue et al., 2020) and IndicBart (Dabre et al., 2021) in both multilingual and monolingual settings, LoRA (Hu et al., 2021) fine-tuning IndicTrans2 (Gala et al., 2023), zero-shot and few-shot prompting (Brown, 2020) with large language models (LLMs) like Llama 3 (Dubey et al., 2024) and Mixtral 8x7b (Jiang et al., 2024), LoRA supervised fine-tuning of Llama 3 (Mecklenburg et al., 2024), and training Transformer models (Vaswani, 2017) from scratch. The results were evaluated on the WMT23 Low-Resource Indic Language Translation Shared Task test data using SacreBLEU (Post, 2018) and CHRF (Popovic, 2015), highlighting the challenges of low-resource translation and the potential of LLMs for these tasks, particularly with fine-tuning.

[46] Evaluating LLMs for Zeolite Synthesis Event Extraction (ZSEE): A Systematic Analysis of Prompting Strategies

Charan Prakash Rathore, Saumi Ray, Dhruv Kumar

🧩 TL;DR

本研究系统评估了大型语言模型在沸石合成实验信息提取任务中的效能，通过比较四种提示策略在六个先进LLM上的表现，发现LLM在高级事件分类上表现良好，但在精细参数提取方面存在显著局限性，揭示了领域适应模型的必要性。

📘 Detailed Summary

Motivation: 现有方法尚未系统评估大型语言模型在沸石合成实验程序结构化信息提取这一领域特定任务中的效能，本研究旨在解决一个基本问题：不同提示策略在科学信息提取任务中的应用效果如何，特别是针对事件类型分类、触发文本识别、论元角色提取和论元文本提取这四个关键子任务。

Method: 研究评估了四种提示策略——零样本、少样本、事件特定和基于反思的策略，在六个最先进的大型语言模型上进行测试，包括Gemma-3-12b-it、GPT-5-mini、O4-mini、Claude-Haiku-3.5、DeepSeek推理和非推理版本，使用包含1,530个标注句子的ZSEE数据集，重点关注四个信息提取子任务的系统性评估。

Result: 结果显示LLM在事件类型分类上表现强劲（80-90% F1分数），但在精细提取任务上表现一般，特别是论元角色和论元文本提取（50-65% F1分数），GPT-5-mini表现出极端的提示敏感性，F1分数变化范围达11-79%，高级提示策略相比零样本方法仅提供最小改进，揭示了基本架构限制，错误分析识别出系统性幻觉、过度泛化和无法捕捉合成特定细微差别等问题。

Conclusion: 研究发现虽然LLM能够实现高级理解，但精确提取实验参数需要领域适应模型，为科学信息提取提供了定量基准，揭示了当前LLM在领域特定信息提取任务中的架构局限性，强调了开发专门针对科学领域优化的模型的重要性，而非仅依赖通用提示策略改进。

📄 Abstract

Extracting structured information from zeolite synthesis experimental procedures is critical for materials discovery, yet existing methods have not systematically evaluated Large Language Models (LLMs) for this domain-specific task. This work addresses a fundamental question: what is the efficacy of different prompting strategies when applying LLMs to scientific information extraction? We focus on four key subtasks: event type classification (identifying synthesis steps), trigger text identification (locating event mentions), argument role extraction (recognizing parameter types), and argument text extraction (extracting parameter values). We evaluate four prompting strategies - zero-shot, few-shot, event-specific, and reflection-based - across six state-of-the-art LLMs (Gemma-3-12b-it, GPT-5-mini, O4-mini, Claude-Haiku-3.5, DeepSeek reasoning and non-reasoning) using the ZSEE dataset of 1,530 annotated sentences. Results demonstrate strong performance on event type classification (80-90\% F1) but modest performance on fine-grained extraction tasks, particularly argument role and argument text extraction (50-65\% F1). GPT-5-mini exhibits extreme prompt sensitivity with 11-79\% F1 variation. Notably, advanced prompting strategies provide minimal improvements over zero-shot approaches, revealing fundamental architectural limitations. Error analysis identifies systematic hallucination, over-generalization, and inability to capture synthesis-specific nuances. Our findings demonstrate that while LLMs achieve high-level understanding, precise extraction of experimental parameters requires domain-adapted models, providing quantitative benchmarks for scientific information extraction.

[47] You Never Know a Person, You Only Know Their Defenses: Detecting Levels of Psychological Defense Mechanisms in Supportive Conversations

Hongbin Na, Zimu Wang, Zhaoming Chen, Peilin Zhou, Yining Hua, Grace Ziqi Zhou, Haiyang Zhang, Tao Shen, Wei Wang, John Torous, Shaoxiong Ji, Ling Chen

🧩 TL;DR

本研究提出了PsyDefConv对话语料库和DMRS Co-Pilot四阶段标注流水线，用于在临床对话中自动识别心理防御机制，该语料库包含200个对话和4709个话语，标注一致性达到Cohen's kappa 0.639。

📘 Detailed Summary

Motivation: 心理防御机制是人们在应对痛苦时使用的策略，其僵化或过度使用与心理健康呈负相关，但防御机制复杂且难以可靠测量，特别是在临床对话中缺乏高质量的标注数据集和有效的自动化识别方法。

Method: 本研究引入了PsyDefConv对话语料库，包含200个对话和4709个话语，其中2336个求助者话语标注了防御水平；同时开发了DMRS Co-Pilot四阶段流水线，提供基于证据的预标注支持，通过零样本和微调设置对强语言模型进行基准测试。

Result: 标注一致性达到Cohen's kappa 0.639，在平衡研究中，协同标注系统将平均标注时间减少了22.4%；专家评审在七点量表上平均得分为证据性4.62、临床合理性4.44、洞察力4.40；最佳宏平均F1分数约为30%，模型倾向于过度预测成熟防御机制。

Conclusion: 该研究为语言中的防御功能研究提供了重要资源，揭示了成熟防御机制在临床对话中最常见且存在情绪特异性偏差，同时表明当前语言模型在防御机制识别方面仍有明显改进空间，为心理健康领域的自然语言处理应用奠定了基础。

📄 Abstract

Psychological defenses are strategies, often automatic, that people use to manage distress. Rigid or overuse of defenses is negatively linked to mental health and shapes what speakers disclose and how they accept or resist help. However, defenses are complex and difficult to reliably measure, particularly in clinical dialogues. We introduce PsyDefConv, a dialogue corpus with help seeker utterances labeled for defense level, and DMRS Co-Pilot, a four-stage pipeline that provides evidence-based pre-annotations. The corpus contains 200 dialogues and 4709 utterances, including 2336 help seeker turns, with labeling and Cohen's kappa 0.639. In a counterbalanced study, the co-pilot reduced average annotation time by 22.4%. In expert review, it averaged 4.62 for evidence, 4.44 for clinical plausibility, and 4.40 for insight on a seven-point scale. Benchmarks with strong language models in zero-shot and fine-tuning settings demonstrate clear headroom, with the best macro F1-score around 30% and a tendency to overpredict mature defenses. Corpus analyses confirm that mature defenses are most common and reveal emotion-specific deviations. We will release the corpus, annotations, code, and prompts to support research on defensive functioning in language.

cs.AI [Back]

[48] LADY: Linear Attention for Autonomous Driving Efficiency without Transformers

Jihao Huang, Xi Xia, Zhiyuan Li, Tianle Liu, Jingke Wang, Junbo Chen, Tengju Ye

🧩 TL;DR

本文提出了LADY，首个完全基于线性注意力的端到端自动驾驶生成模型，通过恒定计算和内存成本实现长程时空建模，显著提升了资源受限边缘平台上的部署效率和实时性能。

📘 Detailed Summary

Motivation: 现有基于Transformer的端到端自动驾驶方法存在二次注意力计算成本问题，限制了长时空序列建模能力，特别是在资源受限的边缘平台上；同时现有线性注意力架构仅限于自注意力，缺乏对自动驾驶至关重要的跨模态和跨时间交互支持。

Method: LADY采用完全线性注意力架构，实现了恒定计算和内存成本的长时间上下文融合，无论相机和LiDAR特征的历史长度如何；同时引入了轻量级线性交叉注意力机制，支持有效的跨模态信息交换，形成了首个基于线性注意力的端到端自动驾驶生成模型。

Result: 在NAVSIM和Bench2Drive基准测试中，LADY实现了最先进的性能，同时保持恒定时间和内存复杂度；模型在边缘设备上成功部署和验证，显著降低了计算成本并提升了规划性能，证明了其在资源受限场景中的实用性。

Conclusion: LADY通过线性注意力机制解决了自动驾驶中长时空建模的计算效率问题，为边缘设备上的实时部署提供了可行方案；该工作展示了线性注意力在复杂多模态时序任务中的潜力，为资源受限环境下的端到端自动驾驶系统设计提供了新方向。

📄 Abstract

End-to-end paradigms have demonstrated great potential for autonomous driving. Additionally, most existing methods are built upon Transformer architectures. However, transformers incur a quadratic attention cost, limiting their ability to model long spatial and temporal sequences-particularly on resource-constrained edge platforms. As autonomous driving inherently demands efficient temporal modeling, this challenge severely limits their deployment and real-time performance. Recently, linear attention mechanisms have gained increasing attention due to their superior spatiotemporal complexity. However, existing linear attention architectures are limited to self-attention, lacking support for cross-modal and cross-temporal interactions-both crucial for autonomous driving. In this work, we propose LADY, the first fully linear attention-based generative model for end-to-end autonomous driving. LADY enables fusion of long-range temporal context at inference with constant computational and memory costs, regardless of the history length of camera and LiDAR features. Additionally, we introduce a lightweight linear cross-attention mechanism that enables effective cross-modal information exchange. Experiments on the NAVSIM and Bench2Drive benchmarks demonstrate that LADY achieves state-of-the-art performance with constant-time and memory complexity, offering improved planning performance and significantly reduced computational cost. Additionally, the model has been deployed and validated on edge devices, demonstrating its practicality in resource-limited scenarios.

[49] ChatGPT and Gemini participated in the Korean College Scholastic Ability Test -- Earth Science I

Seok-Hyun Ga, Chun-Yen Chang

🧩 TL;DR

本研究通过分析2025年韩国高考地球科学I部分，深入评估了GPT-4o、Gemini 2.5 Flash和Gemini 2.5 Pro等多模态大语言模型的科学推理能力和认知局限，揭示了模型在感知-认知鸿沟、计算-概念化差异和过程幻觉等方面的系统性缺陷，为设计针对AI认知弱点的抗AI评估题目提供了实证依据。

📘 Detailed Summary

Motivation: 随着生成式AI在教育评估中的广泛应用，学生利用AI完成作业的现象日益普遍，引发了关于学术诚信和评估有效性的严重关切。本研究旨在深入分析最先进大语言模型在多模态科学推理任务中的认知局限性，为应对未经授权的AI使用提供实证基础。

Method: 研究采用2025年韩国高考地球科学I部分作为评估材料，对GPT-4o、Gemini 2.5 Flash和Gemini 2.5 Pro等先进LLMs进行多模态科学推理能力评估。设计了三种实验条件：整页输入、单项输入和优化多模态输入，以评估模型在不同数据结构下的表现，结合定量分析和定性分析揭示模型的认知缺陷。

Result: 定量结果显示非结构化输入导致显著的性能下降，主要源于分割和OCR失败问题。即使在优化条件下，模型仍表现出根本性的推理缺陷。定性分析揭示了"感知错误"占主导地位，存在明显的"感知-认知鸿沟"，模型能够识别视觉数据但无法解释示意图中的符号意义。此外，模型表现出"计算-概念化差异"，能够成功执行计算但无法应用基础科学概念，以及"过程幻觉"，模型跳过视觉验证而依赖未经证实的背景知识。

Conclusion: 本研究通过识别AI模型的特定认知弱点，为设计"抗AI问题"提供了可操作的线索。通过利用AI在感知与认知之间的鸿沟等缺陷，教育工作者能够区分真实学生能力与AI生成回答，从而确保评估的公平性。研究结果为应对教育评估中的AI挑战提供了实证基础和实用策略。

📄 Abstract

The rapid development of Generative AI is bringing innovative changes to education and assessment. As the prevalence of students utilizing AI for assignments increases, concerns regarding academic integrity and the validity of assessments are growing. This study utilizes the Earth Science I section of the 2025 Korean College Scholastic Ability Test (CSAT) to deeply analyze the multimodal scientific reasoning capabilities and cognitive limitations of state-of-the-art Large Language Models (LLMs), including GPT-4o, Gemini 2.5 Flash, and Gemini 2.5 Pro. Three experimental conditions (full-page input, individual item input, and optimized multimodal input) were designed to evaluate model performance across different data structures. Quantitative results indicated that unstructured inputs led to significant performance degradation due to segmentation and Optical Character Recognition (OCR) failures. Even under optimized conditions, models exhibited fundamental reasoning flaws. Qualitative analysis revealed that "Perception Errors" were dominant, highlighting a "Perception-Cognition Gap" where models failed to interpret symbolic meanings in schematic diagrams despite recognizing visual data. Furthermore, models demonstrated a "Calculation-Conceptualization Discrepancy," successfully performing calculations while failing to apply the underlying scientific concepts, and "Process Hallucination," where models skipped visual verification in favor of plausible but unfounded background knowledge. Addressing the challenge of unauthorized AI use in coursework, this study provides actionable cues for designing "AI-resistant questions" that target these specific cognitive vulnerabilities. By exploiting AI's weaknesses, such as the gap between perception and cognition, educators can distinguish genuine student competency from AI-generated responses, thereby ensuring assessment fairness.

Table of Contents

cs.CV [Back]

[1] SocialNav-MoE: A Mixture-of-Experts Vision Language Model for Socially Compliant Navigation with Reinforcement Fine-Tuning

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[2] Improving VQA Reliability: A Dual-Assessment Approach with Self-Reflection and Cross-Model Verification

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[3] HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[4] Visual-textual Dermatoglyphic Animal Biometrics: A First Case Study on Panthera tigris

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[5] Vibe Spaces for Creatively Connecting and Expressing Visual Concepts

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[6] TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[7] Puzzle Curriculum GRPO for Vision-Centric Reasoning

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[8] Adaptive Multimodal Person Recognition: A Robust Framework for Handling Missing Modalities

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[9] Beyond Proximity: A Keypoint-Trajectory Framework for Classifying Affiliative and Agonistic Social Networks in Dairy Cattle

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[10] PMMD: A pose-guided multi-view multi-modal diffusion for person generation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[11] Uni-Parser Technical Report

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[12] Is Nano Banana Pro a Low-Level Vision All-Rounder? A Comprehensive Evaluation on 14 Tasks and 40 Datasets

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[13] Emotion Recognition in Signers

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[14] Explainable Action Form Assessment by Exploiting Multimodal Chain-of-Thoughts Reasoning

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[15] VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[16] EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[17] Intersectional Fairness in Vision-Language Models for Medical Image Disease Classification

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[18] Cross-modal ultra-scale learning with tri-modalities of renal biopsy images for glomerular multi-disease auxiliary diagnosis

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[19] Null-LoRA: Low-Rank Adaptation on Null Space

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[20] Automated Motion Artifact Check for MRI (AutoMAC-MRI): An Interpretable Framework for Motion Artifact Detection and Severity Assessment

🧩 TL;DR