cs.CV [Total: 32]
cs.CL [Total: 18]
cs.AI [Total: 1]

cs.CV [Back]

[1] Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts

Seungjun Yu, Junsung Park, Youngsun Lim, Hyunjung Shim

🧩 TL;DR

本文提出了一种用于自动驾驶的两阶段视觉语言问答系统，通过精心设计的提示策略和上下文增强显著提升了高级感知、预测和规划问题的回答准确性。该系统在nuScenes基准测试中实现了67.37%的整体准确率，并在严重视觉干扰下保持96%的鲁棒性。

📘 Detailed Summary

Motivation: 该研究旨在解决自动驾驶场景中高级视觉语言问答的挑战，特别是如何让预训练的多模态大语言模型更好地理解驾驶环境中的感知、预测和规划问题。现有方法在复杂驾驶场景的推理能力和上下文理解方面存在局限，需要更有效的提示工程和上下文增强策略。

Method: 系统采用两阶段架构：第一阶段使用Qwen2.5-VL-32B大模型，输入六摄像头数据、历史时序窗口和思维链提示；第二阶段增强场景元数据（物体标注、自车状态等）和任务特定指令。关键技术创新包括自一致性集成（多采样推理链）和分类别问题指令设计，分别针对感知、预测和规划任务优化提示策略。

Result: 在驾驶问答基准测试中，系统显著优于基线模型：第一阶段使用5帧历史和10样本提示达到65.1%准确率（零样本为62.61%），自一致性集成提升至66.85%；第二阶段达到67.37%整体准确率。在严重视觉干扰下系统保持96%的准确率，证明了方法的鲁棒性。

Conclusion: 研究表明精心设计的提示工程和上下文增强能够显著提升预训练视觉语言模型在高级驾驶问答任务中的性能。该方法为自动驾驶系统的情境理解和决策推理提供了有效框架，证明了多模态大模型在复杂现实场景中的实用价值，为未来智能驾驶系统的发展指明了方向。

📄 Abstract

We present a two-phase vision-language QA system for autonomous driving that answers high-level perception, prediction, and planning questions. In Phase-1, a large multimodal LLM (Qwen2.5-VL-32B) is conditioned on six-camera inputs, a short temporal window of history, and a chain-of-thought prompt with few-shot exemplars. A self-consistency ensemble (multiple sampled reasoning chains) further improves answer reliability. In Phase-2, we augment the prompt with nuScenes scene metadata (object annotations, ego-vehicle state, etc.) and category-specific question instructions (separate prompts for perception, prediction, planning tasks). In experiments on a driving QA benchmark, our approach significantly outperforms the baseline Qwen2.5 models. For example, using 5 history frames and 10-shot prompting in Phase-1 yields 65.1% overall accuracy (vs.62.61% with zero-shot); applying self-consistency raises this to 66.85%. Phase-2 achieves 67.37% overall. Notably, the system maintains 96% accuracy under severe visual corruption. These results demonstrate that carefully engineered prompts and contextual grounding can greatly enhance high-level driving QA with pretrained vision-language models.

[2] PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut, Robert Stein, Rina Elster Pantalony, Mohit Bansal, Kathleen McKeown

🧩 TL;DR

本文提出了PoSh，一种基于场景图结构化评分标准和LLM作为评判者的详细图像描述评估指标，并引入了DOCENT基准数据集，在艺术图像描述任务中实现了比现有方法更好的人类评分相关性。

📘 Detailed Summary

Motivation: 当前视觉语言模型在详细图像描述方面取得了进展，但评估仍然面临挑战，标准指标如CIDEr和SPICE是为短文本设计的，无法有效评估长文本中的属性和关系错误，且缺乏对特定文本跨度错误的定位能力。

Method: PoSh使用场景图作为结构化评分标准来指导LLM作为评判者，通过细粒度错误分析生成聚合分数，同时引入了DOCENT数据集，包含艺术作品、专家撰写的参考描述以及艺术史学生提供的细粒度和粗粒度质量评估。

Result: PoSh在DOCENT数据集上比最佳开放权重替代方法实现了更强的人类评分相关性，Spearman ρ提高了0.05，对图像类型具有鲁棒性，并且作为奖励函数优于标准监督微调，能够有效评估基础模型在描述具有丰富场景动态的图像时的性能局限。

Conclusion: 该研究通过PoSh指标和DOCENT基准为详细图像描述评估提供了新的工具，揭示了基础模型在处理复杂场景动态时的局限性，为辅助文本生成等重要领域的发展奠定了基础，并建立了一个衡量VLM进展的具有挑战性的新任务。

📄 Abstract

While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman $\rho$) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.

[3] UniHPR: Unified Human Pose Representation via Singular Value Contrastive Learning

Zhongyu Jiang, Wenhao Chai, Lei Li, Zhuoran Zhou, Cheng-Yen Yang, Jenq-Neng Hwang

🧩 TL;DR

本文提出UniHPR，一种统一的人体姿态表示学习框架，通过新颖的基于奇异值的对比学习损失对齐图像、2D和3D人体姿态嵌入，在人体姿态估计任务中取得了优异性能。

📘 Detailed Summary

Motivation: 当前在多模态融合中，人体姿态表示作为人本应用的重要组成部分，虽然可以从图像、2D关键点、3D骨架等多种模态中提取，但缺乏对这些表示之间相关性的系统研究，特别是使用对比学习范式来统一对齐这些不同模态的表示。

Method: 提出了UniHPR统一人体姿态表示学习框架，采用基于奇异值的对比学习损失来同时对齐图像、2D和3D人体姿态嵌入，该损失函数能更好地对齐不同模态并提升性能，并使用简单的3D人体姿态解码器进行评估。

Result: 在Human3.6M数据集上达到MPJPE 49.9mm，在3DPW数据集上跨域评估达到PA-MPJPE 51.6mm，同时在Human3.6M数据集上实现2D和3D姿态检索，检索误差为9.24mm MPJPE。

Conclusion: UniHPR框架成功实现了多模态人体姿态表示的统一对齐，证明了基于奇异值的对比学习损失在多模态对齐中的有效性，为人体姿态估计和相关任务提供了强大的统一表示基础，推动了多模态人体姿态分析的发展。

📄 Abstract

In recent years, there has been a growing interest in developing effective alignment pipelines to generate unified representations from different modalities for multi-modal fusion and generation. As an important component of Human-Centric applications, Human Pose representations are critical in many downstream tasks, such as Human Pose Estimation, Action Recognition, Human-Computer Interaction, Object tracking, etc. Human Pose representations or embeddings can be extracted from images, 2D keypoints, 3D skeletons, mesh models, and lots of other modalities. Yet, there are limited instances where the correlation among all of those representations has been clearly researched using a contrastive paradigm. In this paper, we propose UniHPR, a unified Human Pose Representation learning pipeline, which aligns Human Pose embeddings from images, 2D and 3D human poses. To align more than two data representations at the same time, we propose a novel singular value-based contrastive learning loss, which better aligns different modalities and further boosts performance. To evaluate the effectiveness of the aligned representation, we choose 2D and 3D Human Pose Estimation (HPE) as our evaluation tasks. In our evaluation, with a simple 3D human pose decoder, UniHPR achieves remarkable performance metrics: MPJPE 49.9mm on the Human3.6M dataset and PA-MPJPE 51.6mm on the 3DPW dataset with cross-domain evaluation. Meanwhile, we are able to achieve 2D and 3D pose retrieval with our unified human pose representations in Human3.6M dataset, where the retrieval error is 9.24mm in MPJPE.

[4] X-Ego: Acquiring Team-Level Tactical Situational Awareness via Cross-Egocentric Contrastive Video Representation Learning

Yunzhe Wang, Soham Hans, Volkan Ustun

🧩 TL;DR

本文提出了X-Ego-CS基准数据集和跨自我对比学习(CECL)方法，用于解决多智能体决策中同步自我中心视角建模的挑战，通过对齐队友的第一人称视觉流来增强团队战术情境感知能力。

📘 Detailed Summary

Motivation: 现有视频理解研究主要依赖第三人称广播视角，忽视了多智能体学习中同步自我中心视角的重要性，无法有效建模人类团队战术中个体视角与队友意图预测之间的复杂交互关系。

Method: 构建了X-Ego-CS数据集，包含45场职业级Counter-Strike 2比赛的124小时游戏录像，提供同步的第一人称视角视频流和状态-动作轨迹；提出了跨自我对比学习(CECL)方法，通过对齐队友的自我中心视觉流来培养团队级战术情境感知。

Result: 在队友-对手位置预测任务上评估CECL，证明其能有效增强智能体从单一第一人称视角推断队友和对手位置的能力，使用最先进的视频编码器取得了显著性能提升。

Conclusion: X-Ego-CS和CECL为电子竞技中的跨自我中心多智能体基准测试奠定了基础，将游戏理解定位为多智能体建模和战术学习的测试平台，对虚拟和现实领域中的时空推理和人机协作具有重要启示。

📄 Abstract

Human team tactics emerge from each player's individual perspective and their ability to anticipate, interpret, and adapt to teammates' intentions. While advances in video understanding have improved the modeling of team interactions in sports, most existing work relies on third-person broadcast views and overlooks the synchronous, egocentric nature of multi-agent learning. We introduce X-Ego-CS, a benchmark dataset consisting of 124 hours of gameplay footage from 45 professional-level matches of the popular e-sports game Counter-Strike 2, designed to facilitate research on multi-agent decision-making in complex 3D environments. X-Ego-CS provides cross-egocentric video streams that synchronously capture all players' first-person perspectives along with state-action trajectories. Building on this resource, we propose Cross-Ego Contrastive Learning (CECL), which aligns teammates' egocentric visual streams to foster team-level tactical situational awareness from an individual's perspective. We evaluate CECL on a teammate-opponent location prediction task, demonstrating its effectiveness in enhancing an agent's ability to infer both teammate and opponent positions from a single first-person view using state-of-the-art video encoders. Together, X-Ego-CS and CECL establish a foundation for cross-egocentric multi-agent benchmarking in esports. More broadly, our work positions gameplay understanding as a testbed for multi-agent modeling and tactical learning, with implications for spatiotemporal reasoning and human-AI teaming in both virtual and real-world domains. Code and dataset are available at https://github.com/HATS-ICT/x-ego.

Fengyuan Sun, Hui Chen, Xinhao Xu, Dandan Zheng, Jingdong Chen, Jun Zhou, Jungong Han, Guiguang Ding

🧩 TL;DR

本文提出PruneHal方法，通过自适应KV缓存剪枝增强多模态大语言模型对关键视觉信息的关注，无需额外训练即可有效缓解幻觉问题，且几乎不增加推理成本。

📘 Detailed Summary

Motivation: 多模态大语言模型中的幻觉问题仍然是一个主要挑战，现有解决方案要么引入额外数据进行训练，要么在推理过程中整合外部或内部信息，这些方法都会带来额外的计算成本。研究发现幻觉与视觉令牌注意力分配不足密切相关，冗余视觉令牌分散了模型注意力。

Method: 提出PruneHal方法，利用自适应KV缓存剪枝来增强模型对关键视觉信息的关注，该方法无需训练且模型无关，可以无缝集成到不同的解码策略中，包括专门为缓解幻觉设计的策略。

Result: 在多个广泛使用的幻觉评估基准上对四种主流MLLMs进行评估，取得了稳健且优异的结果，证明了该方法的有效性和优越性。

Conclusion: 该研究首次将令牌剪枝应用于MLLMs的幻觉缓解，提供了一种简单有效的训练免费解决方案，为多模态模型的幻觉问题提供了新的解决思路，具有重要的实际应用价值。

📄 Abstract

While multi-modal large language models (MLLMs) have made significant progress in recent years, the issue of hallucinations remains a major challenge. To mitigate this phenomenon, existing solutions either introduce additional data for further training or incorporate external or internal information during inference. However, these approaches inevitably introduce extra computational costs. In this paper, we observe that hallucinations in MLLMs are strongly associated with insufficient attention allocated to visual tokens. In particular, the presence of redundant visual tokens disperses the model's attention, preventing it from focusing on the most informative ones. As a result, critical visual cues are often under-attended, which in turn exacerbates the occurrence of hallucinations. Building on this observation, we propose \textbf{PruneHal}, a training-free, simple yet effective method that leverages adaptive KV cache pruning to enhance the model's focus on critical visual information, thereby mitigating hallucinations. To the best of our knowledge, we are the first to apply token pruning for hallucination mitigation in MLLMs. Notably, our method don't require additional training and incurs nearly no extra inference cost. Moreover, PruneHal is model-agnostic and can be seamlessly integrated with different decoding strategies, including those specifically designed for hallucination mitigation. We evaluate PruneHal on several widely used hallucination evaluation benchmarks using four mainstream MLLMs, achieving robust and outstanding results that highlight the effectiveness and superiority of our method. Our code will be publicly available.

[6] FootFormer: Estimating Stability from Visual Input

Keaton Kraiger, Jingjing Li, Skanda Bharadwaj, Jesse Scott, Robert T. Collins, Yanxi Liu

🧩 TL;DR

FootFormer是一种跨模态方法，能够直接从视觉输入联合预测人体运动动力学，在多个数据集上实现了足部压力分布、足部接触图和质心估计的显著改进或等效性能。

📘 Detailed Summary

Motivation: 现有方法通常只能生成人体运动动力学中的一或两个测量指标，存在预测能力有限的问题，无法全面捕捉运动稳定性相关的关键要素。

Method: 该方法采用跨模态架构，直接从视觉输入联合预测多个运动动力学指标，包括足部压力分布、足部接触图和质心位置。

Result: 在多个数据集上，FootFormer在足部压力分布、足部接触图和质心估计方面实现了统计显著更好的性能或等效结果，并在经典运动学指标中的稳定性预测组件（压力中心、质心、支撑基面）估计上达到了最先进水平。

Conclusion: 该研究表明直接从视觉输入联合预测多个运动动力学指标是可行的，为基于视觉的运动稳定性分析提供了新的技术途径，具有在康复医学和运动科学领域的应用潜力。

📄 Abstract

We propose FootFormer, a cross-modality approach for jointly predicting human motion dynamics directly from visual input. On multiple datasets, FootFormer achieves statistically significantly better or equivalent estimates of foot pressure distributions, foot contact maps, and center of mass (CoM), as compared with existing methods that generate one or two of those measures. Furthermore, FootFormer achieves SOTA performance in estimating stability-predictive components (CoP, CoM, BoS) used in classic kinesiology metrics. Code and data are available at https://github.com/keatonkraiger/Vision-to-Stability.git.

[7] Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks

Kai Zeng, Zhanqian Wu, Kaixin Xiong, Xiaobao Wei, Xiangyu Guo, Zhenxin Zhu, Kalok Ho, Lijun Zhou, Bohan Zeng, Ming Lu, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Wentao Zhang

🧩 TL;DR

Dream4Drive是一个创新的合成数据生成框架，通过3D感知视频编辑增强自动驾驶下游感知任务，能够大规模生成多视角极端案例，显著提升自动驾驶系统的极端场景感知能力。

📘 Detailed Summary

Motivation: 现有驾驶世界模型主要关注生成质量和可控性指标，但忽视了对于自动驾驶性能至关重要的下游感知任务评估。传统方法采用合成数据预训练加真实数据微调的策略需要两倍训练周期，当基线方法同样使用双倍周期时，合成数据的优势变得微不足道。

Method: Dream4Drive首先将输入视频分解为多个3D感知引导图，然后将3D资源渲染到这些引导图上，最后微调驾驶世界模型生成经过编辑的多视角逼真视频。该框架还贡献了大规模3D资源数据集DriveObj3D，覆盖典型驾驶场景类别并支持多样化3D感知视频编辑。

Result: 综合实验表明，Dream4Drive在各种训练周期下都能有效提升下游感知模型的性能，特别是在生成多视角极端案例方面展现出前所未有的灵活性，显著增强了自动驾驶中的极端案例感知能力。

Conclusion: 该研究证明了合成数据在自动驾驶感知任务中的实际价值，通过创新的3D感知视频编辑框架解决了现有方法的局限性，为未来研究提供了大规模3D资源数据集，推动了自动驾驶感知系统在极端场景下的性能提升。

📄 Abstract

Recent advancements in driving world models enable controllable generation of high-quality RGB videos or multimodal videos. Existing methods primarily focus on metrics related to generation quality and controllability. However, they often overlook the evaluation of downstream perception tasks, which are $\mathbf{really\ crucial}$ for the performance of autonomous driving. Existing methods usually leverage a training strategy that first pretrains on synthetic data and finetunes on real data, resulting in twice the epochs compared to the baseline (real data only). When we double the epochs in the baseline, the benefit of synthetic data becomes negligible. To thoroughly demonstrate the benefit of synthetic data, we introduce Dream4Drive, a novel synthetic data generation framework designed for enhancing the downstream perception tasks. Dream4Drive first decomposes the input video into several 3D-aware guidance maps and subsequently renders the 3D assets onto these guidance maps. Finally, the driving world model is fine-tuned to produce the edited, multi-view photorealistic videos, which can be used to train the downstream perception models. Dream4Drive enables unprecedented flexibility in generating multi-view corner cases at scale, significantly boosting corner case perception in autonomous driving. To facilitate future research, we also contribute a large-scale 3D asset dataset named DriveObj3D, covering the typical categories in driving scenarios and enabling diverse 3D-aware video editing. We conduct comprehensive experiments to show that Dream4Drive can effectively boost the performance of downstream perception models under various training epochs. Project: $\href{https://wm-research.github.io/Dream4Drive/}{this\ https\ URL}$

[8] SFGFusion: Surface Fitting Guided 3D Object Detection with 4D Radar and Camera Fusion

Xiaozhi Li, Huijun Di, Jian Li, Feng Liu, Wei Liang

🧩 TL;DR

本研究提出SFGFusion，一种基于表面拟合引导的相机-4D成像雷达检测网络，通过估计物体的二次曲面参数来增强空间表示和跨模态交互，有效解决了4D雷达点云稀疏性和多模态融合的挑战。

📘 Detailed Summary

Motivation: 4D成像雷达虽然具有低成本、远距离检测和精确速度测量的优势，但其稀疏点云和低分辨率限制了物体的几何表示能力，并阻碍了与相机等多模态传感器的有效融合，这是当前3D目标检测领域面临的主要挑战。

Method: SFGFusion通过从图像和雷达数据估计物体的二次曲面参数，构建显式表面拟合模型来增强空间表示和跨模态交互，生成细粒度密集深度预测；该深度预测用于引导图像特征从透视视图到鸟瞰图的转换，并生成密集伪点云以缓解雷达点稀疏性；最终采用基于柱体的方法处理点云特征，并通过标准2D骨干网络和检测头在BEV空间进行目标检测。

Result: 实验结果表明，SFGFusion在TJ4DRadSet和view-of-delft (VoD) 目标检测基准测试中实现了优越性能，有效融合了相机和4D雷达特征，证明了该方法在多模态融合和目标检测任务中的有效性。

Conclusion: 该研究证明了基于表面拟合的跨模态融合策略能够显著提升4D雷达与相机融合的性能，为稀疏传感器数据的几何表示和多模态交互提供了新的解决方案，对自动驾驶中的3D目标检测具有重要指导意义。

📄 Abstract

3D object detection is essential for autonomous driving. As an emerging sensor, 4D imaging radar offers advantages as low cost, long-range detection, and accurate velocity measurement, making it highly suitable for object detection. However, its sparse point clouds and low resolution limit object geometric representation and hinder multi-modal fusion. In this study, we introduce SFGFusion, a novel camera-4D imaging radar detection network guided by surface fitting. By estimating quadratic surface parameters of objects from image and radar data, the explicit surface fitting model enhances spatial representation and cross-modal interaction, enabling more reliable prediction of fine-grained dense depth. The predicted depth serves two purposes: 1) in an image branch to guide the transformation of image features from perspective view (PV) to a unified bird's-eye view (BEV) for multi-modal fusion, improving spatial mapping accuracy; and 2) in a surface pseudo-point branch to generate dense pseudo-point cloud, mitigating the radar point sparsity. The original radar point cloud is also encoded in a separate radar branch. These two point cloud branches adopt a pillar-based method and subsequently transform the features into the BEV space. Finally, a standard 2D backbone and detection head are used to predict object labels and bounding boxes from BEV features. Experimental results show that SFGFusion effectively fuses camera and 4D radar features, achieving superior performance on the TJ4DRadSet and view-of-delft (VoD) object detection benchmarks.

[9] MobiAct: Efficient MAV Action Recognition Using MobileNetV4 with Contrastive Learning and Knowledge Distillation

Zhang Nengbo, Ho Hann Woei

🧩 TL;DR

本文提出了一种轻量级MAV动作识别框架MobiAct，通过知识蒸馏和参数自由注意力机制，在保持高精度的同时实现了低能耗和快速推理，显著提升了微型飞行器在资源受限平台上的动作识别效率。

📘 Detailed Summary

Motivation: 现有微型飞行器动作识别方法大多依赖计算密集型大型模型，无法在资源受限的MAV平台上有效部署，导致识别精度与推理速度之间存在显著权衡，亟需开发兼顾高精度与低计算成本的轻量级解决方案。

Method: MobiAct采用MobileNetV4作为骨干网络，引入阶段式正交知识蒸馏策略将ResNet18教师网络的MAV运动特征有效迁移至学生网络，同时集成参数自由注意力机制提升识别精度而不增加模型复杂度，并开发混合损失训练策略确保训练过程的稳定性和鲁棒性。

Result: 实验结果表明MobiAct在三个自收集数据集上平均识别准确率达到92.12%，仅消耗136.16皮焦能量，处理速度为每秒8.84个动作，动作解码速度比领先方法快2倍，同时保持高度可比的识别精度。

Conclusion: 该研究证明了轻量级架构结合知识蒸馏和注意力机制可在资源受限平台上实现高效MAV动作识别，为自主空中蜂群系统的实时感知与协调提供了可行的技术路径，展示了在精度与效率之间取得良好平衡的实用价值。

📄 Abstract

Accurate and efficient recognition of Micro Air Vehicle (MAV) motion is essential for enabling real-time perception and coordination in autonomous aerial swarm. However, most existing approaches rely on large, computationally intensive models that are unsuitable for resource-limited MAV platforms, which results in a trade-off between recognition accuracy and inference speed. To address these challenges, this paper proposes a lightweight MAV action recognition framework, MobiAct, designed to achieve high accuracy with low computational cost. Specifically, MobiAct adopts MobileNetV4 as the backbone network and introduces a Stage-wise Orthogonal Knowledge Distillation (SOKD) strategy to effectively transfer MAV motion features from a teacher network (ResNet18) to a student network, thereby enhancing knowledge transfer efficiency. Furthermore, a parameter-free attention mechanism is integrated into the architecture to improve recognition accuracy without increasing model complexity. In addition, a hybrid loss training strategy is developed to combine multiple loss objectives, which ensures stable and robust optimization during training. Experimental results demonstrate that the proposed MobiAct achieves low-energy and low-computation MAV action recognition, while maintaining the fastest action decoding speed among compared methods. Across all three self-collected datasets, MobiAct achieves an average recognition accuracy of 92.12%, while consuming only 136.16 pJ of energy and processing recognition at a rate of 8.84 actions per second. Notably, MobiAct decodes actions up to 2 times faster than the leading method, with highly comparable recognition accuracy, highlighting its superior efficiency in MAV action recognition.

[10] CARES: Context-Aware Resolution Selector for VLMs

Moshe Kimhi, Nimrod Shabtay, Raja Giryes, Chaim Baskin, Eli Schwartz

🧩 TL;DR

本文提出CARES（上下文感知分辨率选择器），一个轻量级预处理模块，能够预测图像-查询对所需的最小足够输入分辨率，在保持任务性能的同时将计算量减少高达80%。该方法通过紧凑视觉语言模型提取特征并预测目标预训练VLM响应收敛到峰值能力时的分辨率。

📘 Detailed Summary

Motivation: 大型视觉语言模型通常以原生或高分辨率处理图像以确保跨任务有效性，这导致视觉令牌占总令牌的97-99%，即使在低分辨率图像足够的情况下也会产生高计算成本和延迟。现有方法缺乏智能分辨率选择机制来平衡计算效率与任务性能。

Method: CARES采用紧凑VLM（350M参数）提取图像-查询对特征，预测目标预训练VLM响应收敛到其正确回答峰值能力时的最小足够分辨率。虽然训练为离散分类器在可选分辨率集合上进行，但在推理时可插值连续分辨率以实现细粒度控制。

Result: 在涵盖文档和自然图像的五个多模态基准测试以及多样化目标VLM上，CARES在保持任务性能的同时将计算量减少高达80%。该方法证明了智能分辨率选择在维持模型准确性的同时显著提升计算效率的有效性。

Conclusion: CARES展示了通过上下文感知分辨率选择可显著优化视觉语言模型的计算效率，为实际部署提供了实用解决方案。该方法为多模态系统设计提供了新思路，即通过轻量级预处理模块动态调整输入分辨率来平衡性能与效率。

📄 Abstract

Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \emph{CARES}-a \textbf{C}ontext-\textbf{A}ware \textbf{R}esolution \textbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the \emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM's response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.

[11] D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation

Nobline Yoo, Olga Russakovsky, Ye Zhu

🧩 TL;DR

本文提出了D2D框架，将不可微的检测模型转化为可微的评论器，利用其优越的计数能力指导文本到图像扩散模型生成正确数量的对象，显著提升了对象计数准确性。

📘 Detailed Summary

Motivation: 现有文本到图像扩散模型在语义对齐方面表现优异，但在生成提示中指定数量的对象时仍存在困难。现有方法通常使用可微的回归模型作为外部评论器，但排除了具有更优计数能力但不可微的检测器模型，因为其基于枚举的计数本质不可微分。

Method: 提出了Detector-to-Differentiable框架，通过设计自定义激活函数将检测器逻辑转换为软二进制指示器，在推理时使用预训练的文本到图像模型优化噪声先验，从而将不可微的检测模型转化为可微的评论器。

Result: 在SDXL-Turbo、SD-Turbo和Pixart-DMD模型上的广泛实验表明，在四个不同复杂度的基准测试中均实现了对象计数准确性的持续显著提升，例如在D2D-Small基准上提升了13.7%，同时图像质量和计算开销几乎没有下降。

Conclusion: 该研究证明了将不可微检测模型转化为可微评论器的可行性，为提升文本到图像模型的计数能力提供了新途径，同时保持了生成质量和效率，为未来结合不同类型模型优势的研究开辟了方向。

📄 Abstract

Text-to-image (T2I) diffusion models have achieved strong performance in semantic alignment, yet they still struggle with generating the correct number of objects specified in prompts. Existing approaches typically incorporate auxiliary counting networks as external critics to enhance numeracy. However, since these critics must provide gradient guidance during generation, they are restricted to regression-based models that are inherently differentiable, thus excluding detector-based models with superior counting ability, whose count-via-enumeration nature is non-differentiable. To overcome this limitation, we propose Detector-to-Differentiable (D2D), a novel framework that transforms non-differentiable detection models into differentiable critics, thereby leveraging their superior counting ability to guide numeracy generation. Specifically, we design custom activation functions to convert detector logits into soft binary indicators, which are then used to optimize the noise prior at inference time with pre-trained T2I models. Our extensive experiments on SDXL-Turbo, SD-Turbo, and Pixart-DMD across four benchmarks of varying complexity (low-density, high-density, and multi-object scenarios) demonstrate consistent and substantial improvements in object counting accuracy (e.g., boosting up to 13.7% on D2D-Small, a 400-prompt, low-density benchmark), with minimal degradation in overall image quality and computational overhead.

[12] A Matter of Time: Revealing the Structure of Time in Vision-Language Models

Nidham Tekaya, Manuela Waldner, Matthias Zeppelzauer

🧩 TL;DR

本文研究了大规模视觉语言模型的时间感知能力，发现时间信息在嵌入空间中沿低维非线性流形结构化，并提出从嵌入空间提取显式时间线表示的方法。这些方法在时间推理任务中实现了与基于提示的基线方法相当或更优的准确性。

📘 Detailed Summary

Motivation: 本研究旨在探索大规模视觉语言模型（如CLIP）的时间感知能力，评估它们将视觉内容定位在时间中的能力。尽管VLMs通过大规模多样化文本元数据训练获得了开放词汇能力，但其对时间信息的理解和推理能力尚未得到系统研究。

Method: 研究引入了TIME10k基准数据集，包含超过10,000张带有时间真实标签的图像，并采用新颖方法评估了37个VLMs的时间感知能力。基于发现的时间信息在嵌入空间中的低维非线性流形结构，提出了从嵌入空间提取显式时间线表示的方法。

Result: 实验结果表明，时间信息在VLM嵌入空间中沿低维非线性流形结构化。提出的时间线表示方法在时间推理任务中实现了与基于提示的基线方法相当或更优的准确性，同时计算效率更高。

Conclusion: 该研究揭示了VLMs嵌入空间中时间信息的结构化特性，为时间推理任务提供了新的方法视角。提出的时间线表示能够有效建模时间及其时序进展，为多模态时间理解开辟了新的研究方向。

📄 Abstract

Large-scale vision-language models (VLMs) such as CLIP have gained popularity for their generalizable and expressive multimodal representations. By leveraging large-scale training data with diverse textual metadata, VLMs acquire open-vocabulary capabilities, solving tasks beyond their training scope. This paper investigates the temporal awareness of VLMs, assessing their ability to position visual content in time. We introduce TIME10k, a benchmark dataset of over 10,000 images with temporal ground truth, and evaluate the time-awareness of 37 VLMs by a novel methodology. Our investigation reveals that temporal information is structured along a low-dimensional, non-linear manifold in the VLM embedding space. Based on this insight, we propose methods to derive an explicit ``timeline'' representation from the embedding space. These representations model time and its chronological progression and thereby facilitate temporal reasoning tasks. Our timeline approaches achieve competitive to superior accuracy compared to a prompt-based baseline while being computationally efficient. All code and data are available at https://tekayanidham.github.io/timeline-page/.

[13] Unified Reinforcement and Imitation Learning for Vision-Language Models

Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu

🧩 TL;DR

本文提出了统一强化与模仿学习（RIL），一种高效的训练算法，通过结合强化学习和对抗模仿学习的优势，使轻量级视觉语言模型能够模仿大型教师模型的文本生成能力并系统性地提升生成性能。

📘 Detailed Summary

Motivation: 当前视觉语言模型（VLMs）虽然取得了显著进展，但其大规模特性使得在资源受限环境中部署变得不切实际，因此需要开发能够创建强大且轻量级VLMs的高效训练方法。

Method: RIL算法独特地将强化学习与对抗模仿学习相结合，采用基于LLM的判别器来区分学生和教师模型的输出，并利用多个大型教师VLM提供多样化指导，实现统一的强化与模仿学习策略。

Result: 在多种视觉语言基准测试上的广泛实验表明，RIL显著缩小了与最先进开源和闭源VLMs的性能差距，并在多个实例中超越了这些模型，使学生模型取得了显著的性能提升。

Conclusion: 该研究证明了统一强化与模仿学习策略的有效性，能够使轻量级学生模型在性能上媲美领先的闭源VLMs，为资源受限环境下的高效视觉语言模型部署提供了有前景的解决方案。

📄 Abstract

Vision-Language Models (VLMs) have achieved remarkable progress, yet their large scale often renders them impractical for resource-constrained environments. This paper introduces Unified Reinforcement and Imitation Learning (RIL), a novel and efficient training algorithm designed to create powerful, lightweight VLMs. RIL distinctively combines the strengths of reinforcement learning with adversarial imitation learning. This enables smaller student VLMs not only to mimic the sophisticated text generation of large teacher models but also to systematically improve their generative capabilities through reinforcement signals. Key to our imitation framework is an LLM-based discriminator that adeptly distinguishes between student and teacher outputs, complemented by guidance from multiple large teacher VLMs to ensure diverse learning. This unified learning strategy, leveraging both reinforcement and imitation, empowers student models to achieve significant performance gains, making them competitive with leading closed-source VLMs. Extensive experiments on diverse vision-language benchmarks demonstrate that RIL significantly narrows the performance gap with state-of-the-art open- and closed-source VLMs and, in several instances, surpasses them.

Francisco Mena, Dino Ienco, Cassio F. Dantas, Roberto Interdonato, Andreas Dengel

🧩 TL;DR

本文提出了一种新颖的多模态协同学习框架，能够在不针对特定推理模态的情况下泛化到各种任务，通过结合对比学习和模态判别学习来指导单模态模型构建模态共享和模态特定的内部模型流形。

📘 Detailed Summary

Motivation: 地球观测领域中的多模态数据分析面临现实约束，导致训练和推理阶段难以访问相同的传感器模态，现有研究大多针对特定下游任务或推理模态设计定制化解决方案，缺乏通用性。

Method: 该框架结合对比学习和模态判别学习，引导单模态模型将内部模型流形结构化分为模态共享和模态特定信息，支持在仅有一个训练时可用模态在推理时可访问的场景下工作。

Result: 在四个涵盖分类和回归任务的EO基准测试中，该方法相比最新的机器学习和计算机视觉方法以及EO特定方法均取得了持续的预测性能提升，验证了其在单模态推理场景下的有效性。

Conclusion: 该研究证明了多模态协同学习框架在多样化EO应用中的单模态推理场景下的有效性，为处理现实约束下的多模态数据学习提供了通用解决方案，具有广泛的应用前景。

📄 Abstract

Multi-modal co-learning is emerging as an effective paradigm in machine learning, enabling models to collaboratively learn from different modalities to enhance single-modality predictions. Earth Observation (EO) represents a quintessential domain for multi-modal data analysis, wherein diverse remote sensors collect data to sense our planet. This unprecedented volume of data introduces novel challenges. Specifically, the access to the same sensor modalities at both training and inference stages becomes increasingly complex based on real-world constraints affecting remote sensing platforms. In this context, multi-modal co-learning presents a promising strategy to leverage the vast amount of sensor-derived data available at the training stage to improve single-modality models for inference-time deployment. Most current research efforts focus on designing customized solutions for either particular downstream tasks or specific modalities available at the inference stage. To address this, we propose a novel multi-modal co-learning framework capable of generalizing across various tasks without targeting a specific modality for inference. Our approach combines contrastive and modality discriminative learning together to guide single-modality models to structure the internal model manifold into modality-shared and modality-specific information. We evaluate our framework on four EO benchmarks spanning classification and regression tasks across different sensor modalities, where only one of the modalities available during training is accessible at inference time. Our results demonstrate consistent predictive improvements over state-of-the-art approaches from the recent machine learning and computer vision literature, as well as EO-specific methods. The obtained findings validate our framework in the single-modality inference scenarios across a diverse range of EO applications.

[15] BrainMCLIP: Brain Image Decoding with Multi-Layer feature Fusion of CLIP

Tian Xia, Zihan Ma, Xinlong Wang, Qing Liu, Xiaowei He, Tianming Liu, Yudan Ren

🧩 TL;DR

BrainMCLIP提出了一种参数高效的从fMRI解码图像的方法，通过多层级融合策略将大脑视觉区域与CLIP中间层对齐，无需额外的VAE通路即可在语义准确性和细节保真度之间取得良好平衡。

📘 Detailed Summary

Motivation: 现有从fMRI解码图像的方法通常将大脑活动映射到CLIP的最终语义层，为了捕捉更精细的视觉细节而添加参数密集的VAE通路，但这些方法忽略了CLIP中间层的丰富物体信息，且与大脑功能层次结构不符。

Method: BrainMCLIP采用参数高效的多层级融合方法，将功能上不同的视觉区域（低层/高层）的fMRI信号分别与CLIP的中间层和最终层对齐，尊重功能层次结构，并引入了交叉重建策略和新型多粒度损失函数。

Result: BrainMCLIP在高层语义指标上达到或超越了包括使用VAE通路的最先进方法在内的竞争性能，同时参数数量显著减少71.7%，通过利用CLIP中间特征有效捕捉了CLIP-only方法常遗漏的视觉细节。

Conclusion: 该研究表明通过合理利用CLIP中间层特征和遵循大脑功能层次结构，可以在不依赖单独VAE通路的情况下实现语义准确性和细节保真度的良好平衡，为fMRI图像解码提供了更高效的解决方案。

📄 Abstract

Decoding images from fMRI often involves mapping brain activity to CLIP's final semantic layer. To capture finer visual details, many approaches add a parameter-intensive VAE-based pipeline. However, these approaches overlook rich object information within CLIP's intermediate layers and contradicts the brain's functionally hierarchical. We introduce BrainMCLIP, which pioneers a parameter-efficient, multi-layer fusion approach guided by human visual system's functional hierarchy, eliminating the need for such a separate VAE pathway. BrainMCLIP aligns fMRI signals from functionally distinct visual areas (low-/high-level) to corresponding intermediate and final CLIP layers, respecting functional hierarchy. We further introduce a Cross-Reconstruction strategy and a novel multi-granularity loss. Results show BrainMCLIP achieves highly competitive performance, particularly excelling on high-level semantic metrics where it matches or surpasses SOTA(state-of-the-art) methods, including those using VAE pipelines. Crucially, it achieves this with substantially fewer parameters, demonstrating a reduction of 71.7\%(Table.\ref{tab:compare_clip_vae}) compared to top VAE-based SOTA methods, by avoiding the VAE pathway. By leveraging intermediate CLIP features, it effectively captures visual details often missed by CLIP-only approaches, striking a compelling balance between semantic accuracy and detail fidelity without requiring a separate VAE pipeline.

[16] A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition with EfficientNet and CLIP

Ying Dai, Wei Yu Chen

🧩 TL;DR

本文提出了一种无需训练的开集词汇图像分割与识别框架，通过结合EfficientNetB0的无监督分割和CLIP的视觉语言对齐，实现了开放词汇的语义分割和目标识别。

📘 Detailed Summary

Motivation: 当前图像分割和识别方法通常需要大量标注数据进行监督训练，限制了在开放词汇场景下的应用。本研究旨在开发一种无需训练即可处理任意类别词汇的图像分割与识别框架，解决传统方法在开放集识别中的局限性。

Method: 该框架采用两阶段流程：首先使用EfficientNetB0提取像素级特征，通过奇异值分解获得潜在表示，并基于奇异值分布自适应确定聚类数量进行层次聚类分割；然后利用CLIP的ViT骨干网络对分割区域进行编码，与预计算的文本嵌入在共享潜在空间中进行跨模态对齐，通过相似度计算实现识别。

Result: 在COCO、ADE20K和PASCAL VOC等标准基准测试中，该方法在匈牙利mIoU、精确率、召回率和F1分数等指标上均达到了最先进的性能水平，证明了其优越的识别能力。

Conclusion: 该研究证明了无需训练即可实现高效开放词汇图像分割与识别的可行性，提出的框架具有出色的灵活性、泛化能力和实际应用价值，为无监督视觉理解任务提供了新的解决方案。

📄 Abstract

This paper presents a novel training-free framework for open-vocabulary image segmentation and object recognition (OVSR), which leverages EfficientNetB0, a convolutional neural network, for unsupervised segmentation and CLIP, a vision-language model, for open-vocabulary object recognition. The proposed framework adopts a two stage pipeline: unsupervised image segmentation followed by segment-level recognition via vision-language alignment. In the first stage, pixel-wise features extracted from EfficientNetB0 are decomposed using singular value decomposition to obtain latent representations, which are then clustered using hierarchical clustering to segment semantically meaningful regions. The number of clusters is adaptively determined by the distribution of singular values. In the second stage, the segmented regions are localized and encoded into image embeddings using the Vision Transformer backbone of CLIP. Text embeddings are precomputed using CLIP's text encoder from category-specific prompts, including a generic something else prompt to support open set recognition. The image and text embeddings are concatenated and projected into a shared latent feature space via SVD to enhance cross-modal alignment. Recognition is performed by computing the softmax over the similarities between the projected image and text embeddings. The proposed method is evaluated on standard benchmarks, including COCO, ADE20K, and PASCAL VOC, achieving state-of-the-art performance in terms of Hungarian mIoU, precision, recall, and F1-score. These results demonstrate the effectiveness, flexibility, and generalizability of the proposed framework.

[17] XBench: A Comprehensive Benchmark for Visual-Language Explanations in Chest Radiography

Haozhe Luo, Shelley Zixin Shu, Ziyu Zhou, Sebastian Otalora, Mauricio Reyes

🧩 TL;DR

本文提出了首个系统性基准XBench，用于评估七种CLIP风格视觉语言模型在胸部X光片中的跨模态可解释性，揭示了当前模型在临床可靠定位能力上的不足，强调了医疗实践中部署前进行针对性可解释性基准测试的必要性。

📘 Detailed Summary

Motivation: 尽管视觉语言模型在医学图像理解中展现出卓越的零样本性能，但其定位能力（文本概念与视觉证据的对齐程度）在医学领域尚未得到充分探索，而可靠的定位对于模型可解释性和临床采用至关重要。

Method: 研究采用交叉注意力和相似性定位图生成视觉解释，并定量评估其与放射科医生标注的多个病理区域的对齐程度，系统性地比较了七种CLIP风格视觉语言模型变体在胸部X光片上的表现。

Result: 分析发现：所有VLM变体对大型明确病理的定位表现合理，但对小型或弥散性病变的性能显著下降；在胸部X光特定数据集上预训练的模型相比通用领域数据训练的模型表现出更好的对齐性；模型的整体识别能力与定位能力存在强相关性。

Conclusion: 研究结果表明，尽管当前视觉语言模型具有较强的识别能力，但在临床可靠定位方面仍存在不足，这凸显了在医疗实践中部署前进行针对性可解释性基准测试的必要性，为未来医疗AI系统的安全部署提供了重要指导。

📄 Abstract

Vision-language models (VLMs) have recently shown remarkable zero-shot performance in medical image understanding, yet their grounding ability, the extent to which textual concepts align with visual evidence, remains underexplored. In the medical domain, however, reliable grounding is essential for interpretability and clinical adoption. In this work, we present the first systematic benchmark for evaluating cross-modal interpretability in chest X-rays across seven CLIP-style VLM variants. We generate visual explanations using cross-attention and similarity-based localization maps, and quantitatively assess their alignment with radiologist-annotated regions across multiple pathologies. Our analysis reveals that: (1) while all VLM variants demonstrate reasonable localization for large and well-defined pathologies, their performance substantially degrades for small or diffuse lesions; (2) models that are pretrained on chest X-ray-specific datasets exhibit improved alignment compared to those trained on general-domain data. (3) The overall recognition ability and grounding ability of the model are strongly correlated. These findings underscore that current VLMs, despite their strong recognition ability, still fall short in clinically reliable grounding, highlighting the need for targeted interpretability benchmarks before deployment in medical practice. XBench code is available at https://github.com/Roypic/Benchmarkingattention

[18] DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile Phone Agents

Kai Shi, Jun Yang, Ni Yang, Binqiang Pan, Qingsong Xie, Chao Zhang, Zhenyu Yang, Tianhuang Su, Haonan Lu

🧩 TL;DR

本文提出了DaMo（数据混合优化器），一种通过可训练网络预测最优数据混合比例来优化多任务学习的方法，并在移动手机代理任务上实现了显著的性能提升。该方法在PhoneAgentBench基准测试中比替代方法提升了3.38%的性能，并展现出优异的泛化能力。

📘 Detailed Summary

Motivation: 移动手机代理（MPAs）作为多模态应用的重要研究方向，虽然基于多模态大语言模型（MLLMs）构建，但在同时处理多个移动手机任务时效果有限。现有的多任务监督微调方法难以确定最优的训练数据组合以达到峰值性能，这成为制约MPAs发展的关键瓶颈。

Method: 本文提出DaMo方法，采用可训练网络预测最优数据混合比例，通过预测任何给定数据集比例下的下游任务性能来确定最佳配置。同时引入了PhoneAgentBench基准测试，包含1235个QA对，覆盖多样化的现实工业移动应用场景，为全面评估提供支持。

Result: 实验结果显示DaMo在小规模试点实验中表现出强大的预测能力（R²=0.81），在PhoneAgentBench上比替代方法性能提升3.38%。在BFCL-v3、MME-Reasoning、MME-Perception和OCRBench等基准测试中，DaMo平均得分比其他方法高2.57%，在BFCL-v3任务上单独优化MLLM时指标提升12.47%。

Conclusion: DaMo方法展现出优异的泛化能力和鲁棒的可扩展性，在其他模型架构上仍能保持有效性。该研究为多任务学习中的数据混合优化提供了有效解决方案，推动了移动手机代理技术的发展，相关代码和数据集已开源供社区使用。

📄 Abstract

Mobile Phone Agents (MPAs) have emerged as a promising research direction due to their broad applicability across diverse scenarios. While Multimodal Large Language Models (MLLMs) serve as the foundation for MPAs, their effectiveness in handling multiple mobile phone tasks simultaneously remains limited. Although multitask supervised fine-tuning (SFT) is widely adopted for multitask learning, existing approaches struggle to determine optimal training data compositions for peak performance. To address this challenge, we propose DaMo (Data Mixture Optimizer) - a novel solution employing a trainable network that predicts optimal data mixtures by forecasting downstream task performance for any given dataset ratio. To support comprehensive evaluation, we introduce PhoneAgentBench, the first specialized benchmark to evaluate MLLMs on multimodal mobile phone tasks, comprising 1235 QA pairs spanning diverse real-world industrial mobile application scenarios. Demonstrating strong predictive capability (R^2=0.81) in small-scale pilot experiments, DaMo efficiently extrapolates optimal data mixing configurations. Our results show DaMo achieves a 3.38% performance improvement on PhoneAgentBench compared to alternative methods. Furthermore, extensive experiments across established benchmarks including BFCL-v3, MME-Reasoning, MME-Perception, and OCRBench reveal DaMo's superior generalization, outperforming other approaches by 2.57% in terms of average score. When used solely for MLLM optimization on the BFCL-v3 task, DaMo improves the metrics by 12.47% than other methods. Notably, DaMo maintains robust scalability, preserving its effectiveness when applied to other model architectures. The code and dataset are available at https://github.com/OPPO-Mente-Lab/DaMo.git

[19] Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

Zhiyuan Feng, Zhaolu Kang, Qijie Wang, Zhiying Du, Jiongrui Yan, Shubin Shi, Chengbo Yuan, Huizhi Liang, Yu Deng, Qixiu Li, Rushuai Yang, Arctanx An, Leqi Zheng, Weijie Wang, Shawn Chen, Sicheng Xu, Yaobo Liang, Jiaolong Yang, Baining Guo

🧩 TL;DR

本文提出了MV-RoboBench基准测试，专门用于评估视觉语言模型在多视角机器人操作中的空间推理能力，发现当前最先进模型与人类性能存在显著差距，并揭示了多视角空间智能与机器人任务执行之间的正相关性。

📘 Detailed Summary

Motivation: 当前视觉语言模型的评估主要集中于单视角设置，而忽略了其在多视角信息整合方面的能力，同时多摄像头配置在机器人平台中日益普及，但视觉语言模型是否能有效利用多视角输入进行机器人推理仍是一个未解决的问题。

Method: 研究团队开发了MV-RoboBench基准测试，包含1.7k个手动策划的问答项目，涵盖八个子任务，分为空间理解和机器人执行两大类别，并评估了包括开源和闭源模型在内的多种现有视觉语言模型，以及采用思维链启发技术的增强版本。

Result: 实验结果显示，最先进的视觉语言模型性能远低于人类水平，同时发现两个关键发现：多视角机器人场景中空间智能与机器人任务执行呈正相关，以及在现有通用单视角空间理解基准上的强性能并不能可靠地转化为机器人空间任务的成功。

Conclusion: 该研究强调了视觉语言模型在多视角机器人感知方面面临的重大挑战，MV-RoboBench作为开放资源发布，旨在促进空间基础视觉语言模型和视觉语言动作模型的进展，不仅提供数据还提供了多视角具身推理的标准化评估协议。

📄 Abstract

Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet most evaluations of VLMs focus on single-view settings, leaving their ability to integrate multi-view information underexplored. At the same time, multi-camera setups are increasingly standard in robotic platforms, as they provide complementary perspectives to mitigate occlusion and depth ambiguity. Whether VLMs can effectively leverage such multi-view inputs for robotic reasoning therefore remains an open question. To bridge this gap, we introduce MV-RoboBench, a benchmark specifically designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic manipulation. MV-RoboBench consists of 1.7k manually curated QA items across eight subtasks, divided into two primary categories: spatial understanding and robotic execution. We evaluate a diverse set of existing VLMs, including both open-source and closed-source models, along with enhanced versions incorporating CoT-inspired techniques. The results show that state-of-the-art models remain far below human performance, underscoring the substantial challenges VLMs face in multi-view robotic perception. Additionally, our analysis uncovers two key findings: (i) spatial intelligence and robotic task execution are positively correlated in multi-view robotic scenarios; and (ii) strong performance on existing general-purpose single-view spatial understanding benchmarks does not reliably translate to success in the robotic spatial tasks assessed by our benchmark. We release MV-RoboBench as an open resource to foster progress in spatially grounded VLMs and VLAs, providing not only data but also a standardized evaluation protocol for multi-view embodied reasoning.

[20] From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction

Zhida Zhao, Talas Fu, Yifan Wang, Lijun Wang, Huchuan Lu

🧩 TL;DR

本文提出了一种名为策略世界模型（PWM）的新型驾驶范式，通过统一世界建模与轨迹规划架构，并利用无动作未来状态预测方案使规划受益于学习的世界知识，实现了类人的预期感知能力。

📘 Detailed Summary

Motivation: 当前驾驶世界模型主要专注于世界模拟并与轨迹规划解耦，尽管近期研究尝试统一世界建模与规划，但世界建模对规划的协同促进机制仍需深入探索，现有方法未能充分利用学习的世界知识来增强规划性能。

Method: PWM采用统一架构整合世界建模与轨迹规划，提出无动作未来状态预测方案使规划受益于学习的世界知识，通过协作状态-动作预测实现类人预期感知，并引入动态增强并行令牌生成机制，配备上下文引导分词器和自适应动态焦点损失以提高视频预测效率。

Result: 尽管仅使用前视摄像头输入，该方法在性能上匹配或超越了依赖多视角和多模态输入的最先进方法，证明了其规划可靠性和预测效率的显著提升。

Conclusion: 该研究展示了世界建模与规划统一架构的潜力，通过协同状态-动作预测机制实现了更可靠的规划性能，为自动驾驶系统提供了新的范式，未来可进一步探索多模态输入的扩展应用。

📄 Abstract

Despite remarkable progress in driving world models, their potential for autonomous systems remains largely untapped: the world models are mostly learned for world simulation and decoupled from trajectory planning. While recent efforts aim to unify world modeling and planning in a single framework, the synergistic facilitation mechanism of world modeling for planning still requires further exploration. In this work, we introduce a new driving paradigm named Policy World Model (PWM), which not only integrates world modeling and trajectory planning within a unified architecture, but is also able to benefit planning using the learned world knowledge through the proposed action-free future state forecasting scheme. Through collaborative state-action prediction, PWM can mimic the human-like anticipatory perception, yielding more reliable planning performance. To facilitate the efficiency of video forecasting, we further introduce a dynamically enhanced parallel token generation mechanism, equipped with a context-guided tokenizer and an adaptive dynamic focal loss. Despite utilizing only front camera input, our method matches or exceeds state-of-the-art approaches that rely on multi-view and multi-modal inputs. Code and model weights will be released at https://github.com/6550Zhao/Policy-World-Model.

[21] Reasoning Like Experts: Leveraging Multimodal Large Language Models for Drawing-based Psychoanalysis

Xueqi Ma, Yanbei Jiang, Sarah Erfani, James Bailey, Weifeng Liu, Krista A. Ehinger, Jey Han Lau

🧩 TL;DR

本文提出了PICK框架，通过多模态大语言模型实现心理图像理解，专门针对房树人测试进行分层分析和知识注入，显著提升了MLLMs在心理分析领域的能力。

📘 Detailed Summary

Motivation: 多模态大语言模型在客观感知任务中表现出色，但在主观情感分析领域特别是心理分析应用方面仍未被充分探索，存在专业领域知识融合的空白。

Method: 提出PICK多步骤框架，包括将复杂绘图分解为语义子图构建层次表示，针对单对象、多对象和整体三个层次进行针对性分析，引入HTP知识库和基于强化学习的特征提取模块生成心理特征画像。

Result: 实验结果表明PICK框架显著增强了MLLMs在心理分析方面的能力，并在情感理解任务中验证了其作为通用框架的有效性。

Conclusion: 该研究弥合了MLLMs与专业领域之间的鸿沟，提供了通过视觉表达理解人类心理状态的结构化可解释框架，为专业领域应用开辟了新途径。

📄 Abstract

Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance across various objective multimodal perception tasks, yet their application to subjective, emotionally nuanced domains, such as psychological analysis, remains largely unexplored. In this paper, we introduce PICK, a multi-step framework designed for Psychoanalytical Image Comprehension through hierarchical analysis and Knowledge injection with MLLMs, specifically focusing on the House-Tree-Person (HTP) Test, a widely used psychological assessment in clinical practice. First, we decompose drawings containing multiple instances into semantically meaningful sub-drawings, constructing a hierarchical representation that captures spatial structure and content across three levels: single-object level, multi-object level, and whole level. Next, we analyze these sub-drawings at each level with a targeted focus, extracting psychological or emotional insights from their visual cues. We also introduce an HTP knowledge base and design a feature extraction module, trained with reinforcement learning, to generate a psychological profile for single-object level analysis. This profile captures both holistic stylistic features and dynamic object-specific features (such as those of the house, tree, or person), correlating them with psychological states. Finally, we integrate these multi-faceted information to produce a well-informed assessment that aligns with expert-level reasoning. Our approach bridges the gap between MLLMs and specialized expert domains, offering a structured and interpretable framework for understanding human mental states through visual expression. Experimental results demonstrate that the proposed PICK significantly enhances the capability of MLLMs in psychological analysis. It is further validated as a general framework through extensions to emotion understanding tasks.

[22] I Spy With My Model's Eye: Visual Search as a Behavioural Test for MLLMs

John Burden, Jonathan Prunty, Ben Slater, Matthieu Tehenan, Greg Davis, Lucy Cheke

🧩 TL;DR

本研究将认知心理学中的视觉搜索范式应用于多模态大语言模型，发现先进MLLMs在颜色和尺寸特征搜索中表现出类似人类的'跳出效应'，并揭示了其在联合搜索中的容量限制，为评估MLLMs的感知能力提供了认知基础诊断工具。

📘 Detailed Summary

Motivation: 当前多模态大语言模型在视觉语言任务上表现优异，但其视觉处理机制仍不透明，大多数黑盒评估仅关注任务准确性而无法揭示底层机制，研究旨在通过认知心理学方法填补这一理解空白。

Method: 研究采用经典的视觉搜索范式，通过控制实验测试颜色、尺寸和光照特征，评估MLLMs是否表现出'跳出效应'，并利用针对性微调和机制可解释性分析来验证发现。

Result: 实验发现先进MLLMs在颜色或尺寸的单特征搜索中表现出类似人类的跳出效应，在联合特征搜索中存在容量限制，并且证据表明MLLMs像人类一样将自然场景先验如光照方向整合到对象表征中。

Conclusion: 视觉搜索可作为评估MLLMs感知能力的认知基础诊断工具，研究揭示了MLLMs与人类视觉处理的相似性，为理解模型内部机制提供了新视角，并强调了认知心理学方法在AI评估中的价值。

📄 Abstract

Multimodal large language models (MLLMs) achieve strong performance on vision-language tasks, yet their visual processing is opaque. Most black-box evaluations measure task accuracy, but reveal little about underlying mechanisms. Drawing on cognitive psychology, we adapt classic visual search paradigms -- originally developed to study human perception -- to test whether MLLMs exhibit the ``pop-out'' effect, where salient visual features are detected independently of distractor set size. Using controlled experiments targeting colour, size and lighting features, we find that advanced MLLMs exhibit human-like pop-out effects in colour or size-based disjunctive (single feature) search, as well as capacity limits for conjunctive (multiple feature) search. We also find evidence to suggest that MLLMs, like humans, incorporate natural scene priors such as lighting direction into object representations. We reinforce our findings using targeted fine-tuning and mechanistic interpretability analyses. Our work shows how visual search can serve as a cognitively grounded diagnostic tool for evaluating perceptual capabilities in MLLMs.

[23] [De|Re]constructing VLMs' Reasoning in Counting

Simone Alghisi, Gabriel Roccabruna, Massimo Rizzoli, Seyed Mahed Mousavi, Giuseppe Riccardi

🧩 TL;DR

本研究通过系统分析七种先进视觉语言模型在计数任务中的推理能力，发现模型对物体数量、空间排列和干扰物高度敏感，并提出仅微调输出层即可显著提升模型性能21%的针对性训练方法。

📘 Detailed Summary

Motivation: 尽管视觉语言模型在下游任务中表现出色，但在视觉推理方面仍存在显著局限，特别是在识别物体关系、理解时间序列和计数任务中表现不佳。本研究旨在超越基准评估，深入探究模型失败的底层原因并提出针对性改进方法。

Method: 研究采用受控实验条件系统评估七种先进视觉语言模型的计数推理能力，通过层间分析识别错误来源，并设计针对性训练策略仅微调模型的输出层参数。

Result: 实验表明视觉语言模型对物体数量、类型、空间排列和干扰物共现高度敏感，层间分析揭示错误主要源于最后一层表示到输出空间的映射问题。仅微调输出层即可将准确率提升高达21%，并在真实数据集上获得一致改进。

Conclusion: 视觉语言模型的推理错误主要源于输出映射而非特征表示问题，针对性微调输出层是高效提升模型性能的有效策略，为改进视觉推理能力提供了新的技术路径。

📄 Abstract

Vision-Language Models (VLMs) have recently gained attention due to their competitive performance on multiple downstream tasks, achieved by following user-input instructions. However, VLMs still exhibit several limitations in visual reasoning, such as difficulties in identifying relations (e.g., spatial, temporal, and among objects), understanding temporal sequences (e.g., frames), and counting objects. In this work, we go beyond score-level benchmark evaluations of VLMs by investigating the underlying causes of their failures and proposing a targeted approach to improve their reasoning capabilities. We study the reasoning skills of seven state-of-the-art VLMs in the counting task under controlled experimental conditions. Our experiments show that VLMs are highly sensitive to the number and type of objects, their spatial arrangement, and the co-occurrence of distractors. A layer-wise analysis reveals that errors are due to incorrect mapping of the last-layer representation into the output space. Our targeted training shows that fine-tuning just the output layer improves accuracy by up to 21%. We corroborate these findings by achieving consistent improvements on real-world datasets.

[24] The Intricate Dance of Prompt Complexity, Quality, Diversity, and Consistency in T2I Models

Xiaofeng Zhang, Aaron Courville, Michal Drozdzal, Adriana Romero-Soriano

🧩 TL;DR

本文系统研究了提示复杂度对文本到图像模型生成合成数据效用的影响，发现增加提示复杂度会降低条件多样性和提示一致性，但能减少合成与真实数据间的分布偏移。

📘 Detailed Summary

Motivation: 尽管文本到图像模型具有生成无限合成数据的潜力，但提示工程作为与这些模型交互的主要方式，其复杂度对合成数据质量、多样性和一致性等关键效用维度的系统性影响尚未得到充分探索。

Method: 研究首先通过合成实验验证提示复杂度泛化的难度并进行理论推导，然后引入新的评估框架比较真实数据与合成数据的效用，在CC12M、ImageNet-1k和DCI等多个数据集上分析提示复杂度的影响，并评估不同的推理时干预方法。

Result: 合成实验表明向更一般条件的泛化比反向泛化更困难，大规模实证实验显示增加提示复杂度导致条件多样性和提示一致性降低，但减少了合成与真实数据间的分布偏移，其中提示扩展方法通过预训练语言模型作为似然估计器，在图像多样性和美学质量上表现最佳。

Conclusion: 研究揭示了提示复杂度与合成数据效用之间的权衡关系，当前推理时干预方法虽能增强多样性但会偏离真实数据分布，提示扩展方法通过利用语言模型的似然估计能力实现了最优性能，为优化文本到图像模型的合成数据生成提供了重要指导。

📄 Abstract

Text-to-image (T2I) models offer great potential for creating virtually limitless synthetic data, a valuable resource compared to fixed and finite real datasets. Previous works evaluate the utility of synthetic data from T2I models on three key desiderata: quality, diversity, and consistency. While prompt engineering is the primary means of interacting with T2I models, the systematic impact of prompt complexity on these critical utility axes remains underexplored. In this paper, we first conduct synthetic experiments to motivate the difficulty of generalization w.r.t. prompt complexity and explain the observed difficulty with theoretical derivations. Then, we introduce a new evaluation framework that can compare the utility of real data and synthetic data, and present a comprehensive analysis of how prompt complexity influences the utility of synthetic data generated by commonly used T2I models. We conduct our study across diverse datasets, including CC12M, ImageNet-1k, and DCI, and evaluate different inference-time intervention methods. Our synthetic experiments show that generalizing to more general conditions is harder than the other way round, since the former needs an estimated likelihood that is not learned by diffusion models. Our large-scale empirical experiments reveal that increasing prompt complexity results in lower conditional diversity and prompt consistency, while reducing the synthetic-to-real distribution shift, which aligns with the synthetic experiments. Moreover, current inference-time interventions can augment the diversity of the generations at the expense of moving outside the support of real data. Among those interventions, prompt expansion, by deliberately using a pre-trained language model as a likelihood estimator, consistently achieves the highest performance in both image diversity and aesthetics, even higher than that of real data.

[25] HAD: Hierarchical Asymmetric Distillation to Bridge Spatio-Temporal Gaps in Event-Based Object Tracking

Yao Deng, Xian Zhong, Wenxuan Liu, Zhaofei Yu, Jingling Yuan, Tiejun Huang

🧩 TL;DR

本文提出了一种名为分层非对称蒸馏（HAD）的多模态知识蒸馏框架，通过显式建模和缓解RGB相机与事件相机之间的时空不对称性，有效提升了在高速运动、高动态范围等挑战性条件下的目标跟踪性能。

📘 Detailed Summary

Motivation: RGB相机和事件相机在成像机制上存在根本差异，导致显著的时空不对称性，阻碍了两种模态的有效融合。这种不对称性限制了在高速运动、高动态范围环境和动态背景干扰等挑战性条件下目标跟踪的性能提升。

Method: 提出的HAD框架采用分层对齐策略，在保持学生网络计算效率和参数紧凑性的同时最小化信息损失。该框架通过多模态知识蒸馏显式建模和缓解RGB与事件相机之间的时空不对称性。

Result: 大量实验表明HAD在性能上持续优于最先进方法，全面的消融研究进一步验证了每个设计组件的有效性和必要性。该方法在挑战性跟踪场景中表现出显著优势。

Conclusion: 该研究证明了通过显式建模多模态时空不对称性可以有效提升目标跟踪性能，为多模态视觉任务中的信息融合提供了新的技术路径。分层对齐策略在保持效率的同时实现了信息损失最小化，具有重要的实际应用价值。

📄 Abstract

RGB cameras excel at capturing rich texture details with high spatial resolution, whereas event cameras offer exceptional temporal resolution and a high dynamic range (HDR). Leveraging their complementary strengths can substantially enhance object tracking under challenging conditions, such as high-speed motion, HDR environments, and dynamic background interference. However, a significant spatio-temporal asymmetry exists between these two modalities due to their fundamentally different imaging mechanisms, hindering effective multi-modal integration. To address this issue, we propose {Hierarchical Asymmetric Distillation} (HAD), a multi-modal knowledge distillation framework that explicitly models and mitigates spatio-temporal asymmetries. Specifically, HAD proposes a hierarchical alignment strategy that minimizes information loss while maintaining the student network's computational efficiency and parameter compactness. Extensive experiments demonstrate that HAD consistently outperforms state-of-the-art methods, and comprehensive ablation studies further validate the effectiveness and necessity of each designed component. The code will be released soon.

[26] Can You Trust What You See? Alpha Channel No-Box Attacks on Video Object Detection

Ariana Yi, Ce Zhou, Liyang Xiao, Qiben Yan

🧩 TL;DR

本文提出了α-Cloak，这是首个在无盒设置下通过RGBA视频的alpha通道对目标检测器进行对抗攻击的方法，该方法将恶意视频与良性视频融合，产生对人类观察者无害但能持续欺骗目标检测器的融合视频。

📘 Detailed Summary

Motivation: 随着目标检测模型在自动驾驶车辆和监控平台等网络物理系统中的部署日益增多，确保其对抗性威胁的安全性至关重要，而现有研究主要探索图像领域的对抗攻击，视频领域尤其是无盒设置下的攻击仍然缺乏深入研究。

Method: α-Cloak利用alpha通道将恶意目标视频与良性视频融合，设计了一种融合算法确保视觉隐蔽性和兼容性，该方法无需访问模型架构、参数或输出，也不会引入可感知的伪影。

Result: 在五个最先进的目标检测器、一个视觉语言模型和一个多模态大语言模型上的评估显示，α-Cloak在所有场景下均实现了100%的攻击成功率，证明了其在不同模型间的广泛有效性。

Conclusion: 该研究揭示了视频感知系统中先前未被探索的漏洞，强调了在对抗性设置中考虑alpha通道的防御措施的紧迫需求，为视频安全领域提供了重要的安全启示。

📄 Abstract

As object detection models are increasingly deployed in cyber-physical systems such as autonomous vehicles (AVs) and surveillance platforms, ensuring their security against adversarial threats is essential. While prior work has explored adversarial attacks in the image domain, those attacks in the video domain remain largely unexamined, especially in the no-box setting. In this paper, we present {\alpha}-Cloak, the first no-box adversarial attack on object detectors that operates entirely through the alpha channel of RGBA videos. {\alpha}-Cloak exploits the alpha channel to fuse a malicious target video with a benign video, resulting in a fused video that appears innocuous to human viewers but consistently fools object detectors. Our attack requires no access to model architecture, parameters, or outputs, and introduces no perceptible artifacts. We systematically study the support for alpha channels across common video formats and playback applications, and design a fusion algorithm that ensures visual stealth and compatibility. We evaluate {\alpha}-Cloak on five state-of-the-art object detectors, a vision-language model, and a multi-modal large language model (Gemini-2.0-Flash), demonstrating a 100% attack success rate across all scenarios. Our findings reveal a previously unexplored vulnerability in video-based perception systems, highlighting the urgent need for defenses that account for the alpha channel in adversarial settings.

[27] Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing

Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, Zhe Gan

🧩 TL;DR

本文提出了Pico-Banana-400K，一个包含40万张图像的大规模、高质量指令图像编辑数据集，通过系统化的质量控制和多样化编辑分类法构建，为文本引导图像编辑模型的训练和评估提供了坚实基础。

📘 Detailed Summary

Motivation: 当前多模态模型在文本引导图像编辑方面取得了显著进展，但研究社区的发展受到缺乏大规模、高质量、开放可访问的真实图像数据集的限制，现有数据集多为合成生成，难以满足复杂编辑场景的研究需求。

Method: 利用Nano-Banana从OpenImages集合的真实照片生成多样化编辑对，采用细粒度图像编辑分类法确保编辑类型的全面覆盖，通过基于MLLM的质量评分和精心筛选来保持内容保存和指令忠实度，并构建了三个专门子集用于多轮编辑、偏好对齐和指令重写研究。

Result: 构建了包含40万张图像的综合数据集，其中包含7.2万示例的多轮编辑子集用于顺序编辑研究，5.6万示例的偏好子集用于对齐研究，以及配对的长度指令用于指令重写能力开发，为下一代文本引导图像编辑模型提供了大规模、高质量的训练和基准测试资源。

Conclusion: Pico-Banana-400K通过系统化的质量控制和多样化编辑场景覆盖，为文本引导图像编辑研究提供了关键基础设施，不仅支持单轮编辑任务，还推动了复杂编辑场景、多轮推理和指令优化等前沿研究方向的发展。

📄 Abstract

Recent advances in multimodal models have demonstrated remarkable text-guided image editing capabilities, with systems like GPT-4o and Nano-Banana setting new benchmarks. However, the research community's progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built from real images. We introduce Pico-Banana-400K, a comprehensive 400K-image dataset for instruction-based image editing. Our dataset is constructed by leveraging Nano-Banana to generate diverse edit pairs from real photographs in the OpenImages collection. What distinguishes Pico-Banana-400K from previous synthetic datasets is our systematic approach to quality and diversity. We employ a fine-grained image editing taxonomy to ensure comprehensive coverage of edit types while maintaining precise content preservation and instruction faithfulness through MLLM-based quality scoring and careful curation. Beyond single turn editing, Pico-Banana-400K enables research into complex editing scenarios. The dataset includes three specialized subsets: (1) a 72K-example multi-turn collection for studying sequential editing, reasoning, and planning across consecutive modifications; (2) a 56K-example preference subset for alignment research and reward model training; and (3) paired long-short editing instructions for developing instruction rewriting and summarization capabilities. By providing this large-scale, high-quality, and task-rich resource, Pico-Banana-400K establishes a robust foundation for training and benchmarking the next generation of text-guided image editing models.

[28] Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation

Su Ho Han, Jeongseok Hyun, Pilhyeon Lee, Minho Shim, Dongyoon Wee, Seon Joo Kim

🧩 TL;DR

本文提出了一种无需训练的分解注意力融合方法DecAF，通过对比性对象-背景融合和互补视频-帧融合机制，将多模态大语言模型的注意力图直接转换为视频分割掩码，在推理分割任务中实现了与基于训练方法相媲美的性能。

📘 Detailed Summary

Motivation: 现有方法需要联合训练MLLMs与SAM来实现视频分割，而本文旨在开发无需重新训练的解决方案，直接利用MLLMs的注意力机制进行视频推理分割，但原始注意力图存在噪声且与对象区域对齐不佳的问题需要解决。

Method: 提出分解注意力融合方法DecAF，包含对比性对象-背景融合机制来抑制无关激活，以及互补视频-帧融合机制来增强对象聚焦线索，同时引入注意力引导的SAM2提示机制来获取细粒度分割掩码。

Result: DecAF在参考和推理视频对象分割基准测试中超越了所有无需训练方法，并实现了与基于训练方法相当的性能表现，证明了该方法在保持零训练优势的同时达到高性能的能力。

Conclusion: 该方法展示了无需重新训练MLLMs即可实现高质量视频分割的可行性，为多模态大模型的直接应用开辟了新途径，同时通过注意力融合机制有效解决了原始注意力图的噪声和对齐问题。

📄 Abstract

Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free manner, we cast video reasoning segmentation as a video QA task and extract attention maps via rollout mechanism. However, raw attention maps are noisy and poorly aligned with object regions. We propose Decomposed Attention Fusion (DecAF), which refines these maps through two mechanisms: (1) contrastive object-background fusion and (2) complementary video-frame fusion. This method suppresses irrelevant activations and enhances object-focused cues, enabling direct conversion of attention maps into coarse segmentation masks. In addition, we introduce attention-guided SAM2 prompting for obtaining fine-grained masks. Unlike existing methods that jointly train MLLMs with SAM, our method operates entirely without retraining. DecAF outperforms training-free methods and achieves performance comparable to training-based methods on both referring and reasoning VOS benchmarks. The code will be available at https://github.com/HYUNJS/DecAF.

[29] MedReason-R1: Learning to Reason for CT Diagnosis with Reinforcement Learning and Local Zoom

Yifan Li, Fenghe Tang, Yingtai Li, Shaohua Kevin Zhou

🧩 TL;DR

本研究提出了MedReason-R1医学视觉语言模型，通过构建CT-RATE-VQA数据集和引入显式推理过程，解决了通用VLMs在医学领域诊断性能不足的问题，在CT疾病诊断中实现了最先进的性能。

📘 Detailed Summary

Motivation: 通用大型视觉语言模型在自然图像描述生成方面表现出色，但在医学领域的性能仍然欠佳，主要由于缺乏大规模高质量的专业医学影像数据集，以及忽视了从粗粒度到细粒度的诊断过程。

Method: 构建了包含84K问答对的CT-RATE-VQA数据集，提出了MedReason-R1医学VLM，采用将疾病感兴趣区域嵌入图像的新策略，并引入了GRPO强化学习框架来实现无需昂贵人工标注的有效推理。

Result: 与最近的通用和医学VLMs相比，MedReason-R1在CT疾病诊断中实现了最先进的性能，同时保持了良好的泛化能力。

Conclusion: 该研究强调了全局定位和疾病特定细节在提升模型诊断性能中的关键作用，为医学视觉语言模型的发展提供了新的推理框架和数据集支持。

📄 Abstract

General-purpose large Vision-Language Models (VLMs) demonstrate strong capabilities in generating detailed descriptions for natural images. However, their performance in the medical domain remains suboptimal, even for relatively straightforward tasks, primarily due to the lack of large-scale, high-quality, specialized medical imaging datasets and the neglect of the diagnostic process that progresses from coarse to fine-grained. To address the first issue, we construct the CT-RATE-VQA dataset, which has 84K QA pairs. For the second issue, we propose MedReason-R1, a medical VLM with explicit reasoning process for disease diagnosis. MedReason-R1 incorporates a novel strategy that embeds zoom-in disease region-of-interest areas into the image, highlighting the crucial role of both global localization and disease-specific details in enhancing the model's diagnostic performance. Furthermore, we introduce the GRPO reinforcement learning framework to MedReason-R1, which enables effective reasoning without relying on costly manual annotations. Compared to recent general-purpose and medical VLMs, MedReason-R1 achieves state-of-the-art performance in CT disease diagnosis while retaining generalization. The code, checkpoints, and dataset are available at: https://github.com/Leevan001/MedReason-R1

[30] Curvilinear Structure-preserving Unpaired Cross-domain Medical Image Translation

Zihao Chen, Yi Zhou, Xudong Jiang, Li Chen, Leopold Schmetterer, Bingyao Tan, Jun Cheng

🧩 TL;DR

本文提出了一种名为Curvilinear Structure-preserving Translation (CST)的通用框架，通过在无配对图像翻译中显式保留精细曲线结构来解决现有方法在医学成像中扭曲微血管等细微结构的问题。该框架通过集成结构一致性监督，在多个医学成像模态上实现了最先进的翻译性能。

📘 Detailed Summary

Motivation: 现有无配对图像翻译方法在医学成像中经常扭曲精细的曲线结构，如微血管系统，这影响了诊断可靠性和定量分析。在眼科和血管成像中，细微的形态变化具有重要的临床意义，因此需要一种能够保持这些关键结构完整性的翻译方法。

Method: CST框架通过集成曲线结构提取模块来提供拓扑监督，将结构一致性融入训练过程。该框架可以无缝集成到现有方法中，作者将其应用于CycleGAN和UNSB两个代表性骨干网络，通过结构提取模块增强基线模型的性能。

Result: 在光学相干断层扫描血管成像、彩色眼底和X射线冠状动脉造影三种成像模态上的综合评估表明，CST提高了翻译保真度并实现了最先进的性能。该方法在保持几何完整性方面表现出色，特别是在保留细微曲线结构方面优于现有方法。

Conclusion: 通过在学习映射中加强几何完整性，CST为医学成像中的曲线结构感知跨域翻译建立了一个原则性途径。该研究为医学图像翻译中保持关键解剖结构提供了新的解决方案，具有重要的临床应用价值。

📄 Abstract

Unpaired image-to-image translation has emerged as a crucial technique in medical imaging, enabling cross-modality synthesis, domain adaptation, and data augmentation without costly paired datasets. Yet, existing approaches often distort fine curvilinear structures, such as microvasculature, undermining both diagnostic reliability and quantitative analysis. This limitation is consequential in ophthalmic and vascular imaging, where subtle morphological changes carry significant clinical meaning. We propose Curvilinear Structure-preserving Translation (CST), a general framework that explicitly preserves fine curvilinear structures during unpaired translation by integrating structure consistency into the training. Specifically, CST augments baseline models with a curvilinear extraction module for topological supervision. It can be seamlessly incorporated into existing methods. We integrate it into CycleGAN and UNSB as two representative backbones. Comprehensive evaluation across three imaging modalities: optical coherence tomography angiography, color fundus and X-ray coronary angiography demonstrates that CST improves translation fidelity and achieves state-of-the-art performance. By reinforcing geometric integrity in learned mappings, CST establishes a principled pathway toward curvilinear structure-aware cross-domain translation in medical imaging.

[31] OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation

Guowei Xu, Yuxuan Bian, Ailing Zeng, Mingyi Shi, Shaoli Huang, Wen Li, Lixin Duan, Qiang Xu

🧩 TL;DR

本文提出了OmniMotion-X，一个基于自回归扩散变换器的统一序列到序列框架，用于多模态全身人体运动生成。该框架支持文本到运动、音乐到舞蹈、语音到手势等多种任务，并通过渐进式弱到强混合条件训练策略解决多模态冲突问题。

📘 Detailed Summary

Motivation: 当前人体运动生成方法通常针对单一模态任务设计，缺乏统一的框架来处理多样化的多模态输入和输出场景。现有方法在内容一致性、风格保持和时间动态控制方面存在局限，难以支持复杂的交互式长序列运动生成需求。

Method: 提出基于自回归扩散变换器的统一序列到序列框架，引入参考运动作为新型条件信号以增强生成内容的一致性。采用渐进式弱到强混合条件训练策略处理多模态冲突，并构建了OmniMoCap-X数据集，整合28个公开MoCap数据源，使用GPT-4o自动生成结构化层次化标注。

Result: 广泛的实验评估表明，OmniMotion-X在多个多模态任务上显著超越现有方法，实现了最先进的性能表现。该框架能够生成逼真、连贯且可控的长时程运动序列，在文本到运动、音乐到舞蹈等任务中均展现出优越的生成质量。

Conclusion: OmniMotion-X证明了统一多模态运动生成框架的可行性，为复杂人体动画生成提供了有效的解决方案。该研究为未来多模态交互式运动生成系统的发展奠定了基础，展示了参考运动条件信号在保持生成内容一致性方面的重要价值。

📄 Abstract

This paper introduces OmniMotion-X, a versatile multimodal framework for whole-body human motion generation, leveraging an autoregressive diffusion transformer in a unified sequence-to-sequence manner. OmniMotion-X efficiently supports diverse multimodal tasks, including text-to-motion, music-to-dance, speech-to-gesture, and global spatial-temporal control scenarios (e.g., motion prediction, in-betweening, completion, and joint/trajectory-guided synthesis), as well as flexible combinations of these tasks. Specifically, we propose the use of reference motion as a novel conditioning signal, substantially enhancing the consistency of generated content, style, and temporal dynamics crucial for realistic animations. To handle multimodal conflicts, we introduce a progressive weak-to-strong mixed-condition training strategy. To enable high-quality multimodal training, we construct OmniMoCap-X, the largest unified multimodal motion dataset to date, integrating 28 publicly available MoCap sources across 10 distinct tasks, standardized to the SMPL-X format at 30 fps. To ensure detailed and consistent annotations, we render sequences into videos and use GPT-4o to automatically generate structured and hierarchical captions, capturing both low-level actions and high-level semantics. Extensive experimental evaluations confirm that OmniMotion-X significantly surpasses existing methods, demonstrating state-of-the-art performance across multiple multimodal tasks and enabling the interactive generation of realistic, coherent, and controllable long-duration motions.

[32] Class-Aware Prototype Learning with Negative Contrast for Test-Time Adaptation of Vision-Language Models

Xiaozhen Qiao, Jingkai Zhao, Yuqiu Jiang, Xianda Guo, Zhe Sun, Hongyuan Zhang, Xuelong Li

🧩 TL;DR

本文提出了CPL-NC，一个专为视觉语言模型设计的轻量级测试时适应框架，通过类别感知原型缓存和负对比学习机制，有效解决了分布偏移下的原型退化和类别混淆问题。

📘 Detailed Summary

Motivation: 现有视觉语言模型在零样本泛化方面表现出色，但在部署分布与训练分布发生偏移时性能会显著下降。现有测试时适应方法忽视了长尾分布中的原型退化问题以及语义相似类别之间的混淆挑战，需要专门针对VLMs的适应框架来提升分布偏移下的泛化能力。

Method: CPL-NC框架包含两个核心组件：类别感知原型缓存模块根据测试时频率和激活历史动态调整每类容量，并通过复活机制保留稀有类别知识；负对比学习机制识别并约束困难的视觉-文本负样本以提高类别可分性。该框架采用非对称优化策略，仅优化文本原型而保持视觉特征稳定。

Result: 在15个基准测试上的实验表明，CPL-NC在ResNet-50和ViT-B/16两种骨干网络上都一致优于先前的测试时适应方法，证明了该框架在各种分布偏移场景下的有效性和鲁棒性。

Conclusion: CPL-NC通过针对性地解决原型退化和类别混淆问题，为视觉语言模型在真实世界分布偏移下的部署提供了有效的轻量级适应方案。该工作强调了在测试时适应中考虑类别分布特性和语义关系的重要性，为未来研究提供了新的方向。

📄 Abstract

Vision-Language Models (VLMs) demonstrate impressive zero-shot generalization through large-scale image-text pretraining, yet their performance can drop once the deployment distribution diverges from the training distribution. To address this, Test-Time Adaptation (TTA) methods update models using unlabeled target data. However, existing approaches often ignore two key challenges: prototype degradation in long-tailed distributions and confusion between semantically similar classes. To tackle these issues, we propose \textbf{C}lass-Aware \textbf{P}rototype \textbf{L}earning with \textbf{N}egative \textbf{C}ontrast(\textbf{CPL-NC}), a lightweight TTA framework designed specifically for VLMs to enhance generalization under distribution shifts. CPL-NC introduces a \textit{Class-Aware Prototype Cache} Module that dynamically adjusts per-class capacity based on test-time frequency and activation history, with a rejuvenation mechanism for inactive classes to retain rare-category knowledge. Additionally, a \textit{Negative Contrastive Learning} Mechanism identifies and constrains hard visual-textual negatives to improve class separability. The framework employs asymmetric optimization, refining only textual prototypes while anchoring on stable visual features. Experiments on 15 benchmarks show that CPL-NC consistently outperforms prior TTA methods across both ResNet-50 and ViT-B/16 backbones.

cs.CL [Back]

[33] Small Language Models Offer Significant Potential for Science Community

Jian Zhang

🧩 TL;DR

本研究开发了一个基于小型语言模型（MiniLMs）的框架，用于从地球科学文献中进行精确、快速且经济高效的信息检索，相比大型语言模型能够更有效地提取经过专家验证的定量信息。

📘 Detailed Summary

Motivation: 尽管大型语言模型在自然语言处理领域取得了显著进展，但其存在信息偏见和计算成本高昂的问题，本研究旨在探索使用免费的小型语言模型实现从大量地球科学文献中精确、快速且成本可控的信息检索的可行性。

Method: 构建了一个包含约7700万高质量句子的语料库，涵盖2000年至2024年间95种领先地球科学期刊，采用小型语言模型通过语义搜索技术和句子级索引实现计算高效的信息提取，并结合情感分析和无监督聚类分析情感基调和主题演变。

Result: 相比ChatGPT-4等大型语言模型产生的通用化响应，MiniLM方法在识别经过专家验证的多学科来源信息方面表现卓越，特别擅长提取包含定量研究结果的信息，并能有效追踪地球科学社区中结论演变、研究重点和新兴问题。

Conclusion: MiniLM在地球科学社区具有重要应用潜力，可用于事实和图像检索、趋势分析、矛盾分析以及教育目的，为领域特定的信息检索提供了一种计算效率高且精确的替代方案。

📄 Abstract

Recent advancements in natural language processing, particularly with large language models (LLMs), are transforming how scientists engage with the literature. While the adoption of LLMs is increasing, concerns remain regarding potential information biases and computational costs. Rather than LLMs, I developed a framework to evaluate the feasibility of precise, rapid, and cost-effective information retrieval from extensive geoscience literature using freely available small language models (MiniLMs). A curated corpus of approximately 77 million high-quality sentences, extracted from 95 leading peer-reviewed geoscience journals such as Geophysical Research Letters and Earth and Planetary Science Letters published during years 2000 to 2024, was constructed. MiniLMs enable a computationally efficient approach for extracting relevant domain-specific information from these corpora through semantic search techniques and sentence-level indexing. This approach, unlike LLMs such as ChatGPT-4 that often produces generalized responses, excels at identifying substantial amounts of expert-verified information with established, multi-disciplinary sources, especially for information with quantitative findings. Furthermore, by analyzing emotional tone via sentiment analysis and topical clusters through unsupervised clustering within sentences, MiniLM provides a powerful tool for tracking the evolution of conclusions, research priorities, advancements, and emerging questions within geoscience communities. Overall, MiniLM holds significant potential within the geoscience community for applications such as fact and image retrievals, trend analyses, contradiction analyses, and educational purposes.

[34] DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code

Shriyansh Agrawal, Aidan Lau, Sanyam Shah, Ahan M R, Kevin Zhu, Sunishchal Dev, Vasu Sharma

🧩 TL;DR

本研究提出通过微调编码器专用的小型语言模型（SLMs）来检测机器生成内容，证明在二元分类任务中SLMs在显著降低计算成本的同时大幅优于大型语言模型。该方法在保持高检测精度的同时实现了8-12倍的延迟降低和3-5倍的VRAM峰值使用减少。

📘 Detailed Summary

Motivation: 当前基于零样本方法的大语言模型内容检测器存在高计算成本与检测精度不足的问题，如Fast DetectGPT和GPTZero等方法在计算效率与准确性之间存在明显权衡，亟需开发更高效准确的检测方案。

Method: 研究采用微调预训练的编码器专用小型语言模型，特别是RoBERTA和CodeBERTa模型，使用专门针对源代码和自然语言的数据集进行训练，专注于二元分类任务的优化。

Result: 实验结果显示编码器模型在AUROC指标上达到0.97-0.99，宏F1分数为0.89-0.94，在512个令牌输入时延迟降低8-12倍，峰值VRAM使用减少3-5倍，在跨生成器迁移和对抗性转换下性能保持不低于92%的清洁AUROC。

Conclusion: 该研究证明了小型语言模型在机器生成内容检测任务中的显著优势，不仅实现了计算效率的大幅提升，同时保持了强大的泛化能力和对抗鲁棒性，为实际部署提供了可行的解决方案。

📄 Abstract

The prevalence of Large Language Models (LLMs) for generating multilingual text and source code has only increased the imperative for machine-generated content detectors to be accurate and efficient across domains. Current detectors, predominantly utilizing zero-shot methods, such as Fast DetectGPT or GPTZero, either incur high computational cost or lack sufficient accuracy, often with a trade-off between the two, leaving room for further improvement. To address these gaps, we propose the fine-tuning of encoder-only Small Language Models (SLMs), in particular, the pre-trained models of RoBERTA and CodeBERTa using specialized datasets on source code and other natural language to prove that for the task of binary classification, SLMs outperform LLMs by a huge margin whilst using a fraction of compute. Our encoders achieve AUROC $= 0.97$ to $0.99$ and macro-F1 $0.89$ to $0.94$ while reducing latency by $8$-$12\times$ and peak VRAM by $3$-$5\times$ at $512$-token inputs. Under cross-generator shifts and adversarial transformations (paraphrase, back-translation; code formatting/renaming), performance retains $\geq 92%$ of clean AUROC. We release training and evaluation scripts with seeds and configs; a reproducibility checklist is also included.

Chen Chen, ZeYang Hu, Fengjiao Chen, Liya Ma, Jiaxing Liu, Xiaoyu Li, Xuezhi Cao

🧩 TL;DR

本文提出了一个新颖的多模态统一基准测试MMAO-Bench，用于评估单模态和全模态理解能力，揭示了跨模态与单模态性能之间的组合规律。

📘 Detailed Summary

Motivation: 当前多模态大语言模型正从单模态理解向视觉、音频和语言模态的统一发展，但单模态与全模态之间的相关性尚不明确，需要全面的评估来推动全模态模型的智能演进。

Method: 提出了高质量、多样化的全模态基准测试MMAO-Bench，包含1880个人工标注样本，涵盖44种任务类型，并创新性地引入了多步骤开放性问题类型以更好地评估复杂推理任务。

Result: 实验结果表明跨模态与单模态性能之间存在组合规律，全模态能力在弱模型上表现为瓶颈效应，而在强模型上则表现出协同促进作用。

Conclusion: 该研究揭示了全模态能力在不同模型强度下的不同表现模式，为多模态模型的评估和发展提供了重要指导，表明全模态理解需要模型具备足够的基础能力才能实现协同提升。

📄 Abstract

Multimodal Large Languages models have been progressing from uni-modal understanding toward unifying visual, audio and language modalities, collectively termed omni models. However, the correlation between uni-modal and omni-modal remains unclear, which requires comprehensive evaluation to drive omni model's intelligence evolution. In this work, we propose a novel, high quality and diversity omni model benchmark, MultiModal All in One Benchmark (MMAO-Bench), which effectively assesses both uni-modal and omni-modal understanding capabilities. The benchmark consists of 1880 human curated samples, across 44 task types, and a innovative multi-step open-ended question type that better assess complex reasoning tasks. Experimental result shows the compositional law between cross-modal and uni-modal performance and the omni-modal capability manifests as a bottleneck effect on weak models, while exhibiting synergistic promotion on strong models.

[36] Transformer-Based Low-Resource Language Translation: A Study on Standard Bengali to Sylheti

Mangsura Kabir Oni, Tabia Tanzin Prama

🧩 TL;DR

本研究通过微调多语言Transformer模型并与零样本大语言模型对比，探索了孟加拉语到锡尔赫特语的机器翻译。实验表明微调模型显著优于LLMs，其中mBART-50在翻译充分性方面表现最佳，MarianMT在字符级保真度方面最强。

📘 Detailed Summary

Motivation: 尽管基于Transformer的神经机器翻译方法在高资源语言上取得了显著成果，但低资源语言如锡尔赫特语仍然研究不足。本研究旨在填补孟加拉语到锡尔赫特语翻译这一研究空白，探索适用于低资源语言的机器翻译方法。

Method: 本研究采用多语言Transformer模型进行微调，并与零样本大语言模型进行对比分析。具体使用了mBART-50和MarianMT等预训练模型，针对孟加拉语到锡尔赫特语的翻译任务进行专门的适应性训练。

Result: 实验结果显示微调模型在翻译性能上显著优于零样本大语言模型。mBART-50模型在翻译充分性方面取得了最高得分，而MarianMT模型在字符级保真度方面表现最为出色，展现了不同模型架构在特定翻译质量维度上的优势。

Conclusion: 研究强调了任务特定适应对于低资源语言机器翻译的重要性，证明了微调预训练模型在提升低资源语言翻译质量方面的有效性。这些发现为构建包容性语言技术提供了重要参考，推动了面向代表性不足语言的机器翻译研究进展。

📄 Abstract

Machine Translation (MT) has advanced from rule-based and statistical methods to neural approaches based on the Transformer architecture. While these methods have achieved impressive results for high-resource languages, low-resource varieties such as Sylheti remain underexplored. In this work, we investigate Bengali-to-Sylheti translation by fine-tuning multilingual Transformer models and comparing them with zero-shot large language models (LLMs). Experimental results demonstrate that fine-tuned models significantly outperform LLMs, with mBART-50 achieving the highest translation adequacy and MarianMT showing the strongest character-level fidelity. These findings highlight the importance of task-specific adaptation for underrepresented languages and contribute to ongoing efforts toward inclusive language technologies.

[37] Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges

Cheng Huang, Nyima Tashi, Fan Gao, Yutong Liu, Jiahao Li, Hao Tian, Siyang Jiang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Jin Zhang, Xiao Feng, Hao Wang, Jie Tang, Guojie Tang, Xiangxiang Wang, Jia Zhang, Tsengdar Lee, Yongbin Yu

🧩 TL;DR

本文对藏语人工智能研究现状进行全面调查，系统梳理了文本和语音数据资源、NLP任务、机器翻译、语音识别及大语言模型等领域的发展，旨在为低资源语言的AI生态系统建设提供基础参考。

📘 Detailed Summary

Motivation: 藏语作为亚洲主要低资源语言，具有独特的语言和社会文化特征，但由于缺乏可访问的数据资源、标准化基准和专用工具，在AI研究中受到有限关注，本文旨在填补这一研究空白。

Method: 采用系统性调查方法，对现有数据集和工具进行分类整理，评估不同任务中使用的方法，并在可能的情况下进行性能比较分析。

Result: 研究识别出数据稀疏性、正字法变异和统一评估指标缺乏等持续瓶颈，同时探讨了跨语言迁移、多模态学习和社区驱动资源创建的潜力。

Conclusion: 本调查为未来藏语AI研究提供了基础性参考，鼓励通过协作努力为低资源语言构建包容和可持续的AI生态系统，推动该领域的发展。

📄 Abstract

Tibetan, one of the major low-resource languages in Asia, presents unique linguistic and sociocultural characteristics that pose both challenges and opportunities for AI research. Despite increasing interest in developing AI systems for underrepresented languages, Tibetan has received limited attention due to a lack of accessible data resources, standardized benchmarks, and dedicated tools. This paper provides a comprehensive survey of the current state of Tibetan AI in the AI domain, covering textual and speech data resources, NLP tasks, machine translation, speech recognition, and recent developments in LLMs. We systematically categorize existing datasets and tools, evaluate methods used across different tasks, and compare performance where possible. We also identify persistent bottlenecks such as data sparsity, orthographic variation, and the lack of unified evaluation metrics. Additionally, we discuss the potential of cross-lingual transfer, multi-modal learning, and community-driven resource creation. This survey aims to serve as a foundational reference for future work on Tibetan AI research and encourages collaborative efforts to build an inclusive and sustainable AI ecosystem for low-resource languages.

[38] KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints

Kailin Jiang, Hongbo Jiang, Ning Jiang, Zhi Gao, Jinhe Bi, Yuchen Ren, Bin Li, Yuntao Du, Lei Liu, Qing Li

🧩 TL;DR

本文提出KORE方法，通过知识导向增强和约束的协同机制，解决大型多模态模型在知识注入过程中面临的新知识学习困难与灾难性遗忘问题，实现了准确的知识适应与强大的知识保留。

📘 Detailed Summary

Motivation: 大型多模态模型在预训练权重中编码了大量事实知识，但其知识保持静态且有限，无法跟上现实世界的发展，这阻碍了持续知识获取。现有方法在学习新知识时往往困难，并遭受灾难性遗忘，因此需要解决知识适应和知识保留的双重挑战。

Method: KORE方法包含两个核心组件：知识导向增强自动将单个知识项转换为结构化、全面的知识表示，确保模型准确学习新知识；知识约束机制将先前知识存储在LMM线性层激活的协方差矩阵中，并通过将原始权重投影到矩阵零空间来初始化适配器，定义最小化与先前知识干扰的微调方向。

Result: 在多种LMM模型上的广泛实验表明，KORE在LLaVA-v1.5-7B、LLaVA-v1.5-13B和Qwen2.5-VL-7B等模型上实现了优越的新知识注入性能，并有效缓解了灾难性遗忘问题。

Conclusion: KORE通过协同的知识增强和约束机制，为大型多模态模型提供了有效的持续学习解决方案，能够在注入新知识的同时保持原有知识，为模型的知识更新和能力扩展提供了重要技术路径。

📄 Abstract

Large Multimodal Models encode extensive factual knowledge in their pre-trained weights. However, its knowledge remains static and limited, unable to keep pace with real-world developments, which hinders continuous knowledge acquisition. Effective knowledge injection thus becomes critical, involving two goals: knowledge adaptation (injecting new knowledge) and knowledge retention (preserving old knowledge). Existing methods often struggle to learn new knowledge and suffer from catastrophic forgetting. To address this, we propose KORE, a synergistic method of KnOwledge-oRientEd augmentations and constraints for injecting new knowledge into large multimodal models while preserving old knowledge. Unlike general text or image data augmentation, KORE automatically converts individual knowledge items into structured and comprehensive knowledge to ensure that the model accurately learns new knowledge, enabling accurate adaptation. Meanwhile, KORE stores previous knowledge in the covariance matrix of LMM's linear layer activations and initializes the adapter by projecting the original weights into the matrix's null space, defining a fine-tuning direction that minimizes interference with previous knowledge, enabling powerful retention. Extensive experiments on various LMMs, including LLaVA-v1.5-7B, LLaVA-v1.5-13B, and Qwen2.5-VL-7B, show that KORE achieves superior new knowledge injection performance and effectively mitigates catastrophic forgetting.

[39] M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models

Yejin Kwon, Taewoo Kang, Hyunsoo Yoon, Changouk Kim

🧩 TL;DR

本文提出了M3-SLU，一个用于评估多说话人多轮对话理解的多模态大语言模型基准，揭示了当前模型在说话人归属推理能力上的关键缺陷。该基准包含超过12,000个验证实例，涵盖两个核心任务：说话人归属问答和话语匹配说话人归属。

📘 Detailed Summary

Motivation: 尽管当前模型在语音和文本理解方面表现出色，但在说话人归属推理能力上仍存在显著不足，即难以理解自然对话中谁在何时说了什么。现有基准缺乏对多说话人多轮对话中说话人感知理解能力的系统性评估。

Method: M3-SLU基于四个开放语料库构建，包含配对音频、转录文本和元数据的验证实例。基准包含两个核心任务：说话人归属问答和话语匹配说话人归属。评估采用级联流水线和端到端MLLMs的基线结果，使用LLM-as-Judge和准确率指标进行评估。

Result: 实验结果表明，模型能够较好地捕捉对话内容，但在识别说话人身份方面表现不佳。基准测试揭示了模型在说话人感知对话理解能力上的关键差距，为多模态理解研究提供了具有挑战性的评估标准。

Conclusion: M3-SLU作为具有挑战性的基准，将推动说话人感知多模态理解研究的发展。研究强调了开发能够同时理解对话内容和说话人关系的模型的重要性，为未来多模态对话系统的发展指明了方向。

📄 Abstract

We present M3-SLU, a new multimodal large language model (MLLM) benchmark for evaluating multi-speaker, multi-turn spoken language understanding. While recent models show strong performance in speech and text comprehension, they still struggle with speaker-attributed reasoning, the ability to understand who said what and when in natural conversations. M3-SLU is built from four open corpora (CHiME-6, MELD, MultiDialog, and AMI) and comprises over 12,000 validated instances with paired audio, transcripts, and metadata. It includes two tasks: (1) Speaker-Attributed Question Answering and (2) Speaker Attribution via Utterance Matching. We provide baseline results for both cascaded pipelines and end-to-end MLLMs, evaluated using an LLM-as-Judge and accuracy metrics. Results show that while models can capture what was said, they often fail to identify who said it, revealing a key gap in speaker-aware dialogue understanding. M3-SLU offers as a challenging benchmark to advance research in speaker-aware multimodal understanding.

[40] Local Obfuscation by GLINER for Impartial Context Aware Lineage: Development and evaluation of PII Removal system

Prakrithi Shivaprakash, Lekhansh Shukla, Animesh Mukherjee, Prabhat Chand, Pratima Murthy

🧩 TL;DR

本研究提出了LOGICAL系统，一种基于微调GLiNER模型的高效本地部署PII移除方案，在临床笔记去标识化任务中显著优于大型语言模型，为资源受限环境提供了准确、安全且计算高效的解决方案。

📘 Detailed Summary

Motivation: 电子健康记录中个人身份信息的移除对于研究和AI开发至关重要，但大型语言模型的高计算成本和基于API服务的数据隐私风险限制了其在资源受限环境中的应用，特别是临床环境中的隐私保护需求。

Method: 开发了LOGICAL系统，基于微调的GLiNER模型进行本地部署，定义了九个PII类别，使用2849个文本实例对modern-gliner-bi-large-v1.0模型进行微调，并在376个测试实例上使用字符级精确率、召回率和F1分数进行评估。

Result: 微调GLiNER模型取得了卓越性能，总体微平均F1分数达到0.980，显著优于Gemini-Pro-2.5的0.845，LOGICAL系统正确清理了95%的文档，而次优解决方案仅达到64%，且模型在标准笔记本电脑上无需专用GPU即可高效运行。

Conclusion: 微调的专业化Transformer模型如GLiNER为临床笔记PII移除提供了准确、计算高效且安全的解决方案，这种"源头清理"方法是资源密集型LLMs的实用替代方案，特别适用于资源受限环境，但2%的实体级假阴性率强调了在所有测试系统中都需要人工验证的必要性。

📄 Abstract

Removing Personally Identifiable Information (PII) from clinical notes in Electronic Health Records (EHRs) is essential for research and AI development. While Large Language Models (LLMs) are powerful, their high computational costs and the data privacy risks of API-based services limit their use, especially in low-resource settings. To address this, we developed LOGICAL (Local Obfuscation by GLINER for Impartial Context-Aware Lineage), an efficient, locally deployable PII removal system built on a fine-tuned Generalist and Lightweight Named Entity Recognition (GLiNER) model. We used 1515 clinical documents from a psychiatric hospital's EHR system. We defined nine PII categories for removal. A modern-gliner-bi-large-v1.0 model was fine-tuned on 2849 text instances and evaluated on a test set of 376 instances using character-level precision, recall, and F1-score. We compared its performance against Microsoft Azure NER, Microsoft Presidio, and zero-shot prompting with Gemini-Pro-2.5 and Llama-3.3-70B-Instruct. The fine-tuned GLiNER model achieved superior performance, with an overall micro-average F1-score of 0.980, significantly outperforming Gemini-Pro-2.5 (F1-score: 0.845). LOGICAL correctly sanitised 95% of documents completely, compared to 64% for the next-best solution. The model operated efficiently on a standard laptop without a dedicated GPU. However, a 2% entity-level false negative rate underscores the need for human-in-the-loop validation across all tested systems. Fine-tuned, specialised transformer models like GLiNER offer an accurate, computationally efficient, and secure solution for PII removal from clinical notes. This "sanitisation at the source" approach is a practical alternative to resource-intensive LLMs, enabling the creation of de-identified datasets for research and AI development while preserving data privacy, particularly in resource-constrained environments.

[41] The Massive Legal Embedding Benchmark (MLEB)

Umar Butler, Abdur-Rahman Butler, Adrian Lucas Malec

🧩 TL;DR

本文提出了MLEB（大规模法律嵌入基准），这是迄今为止最大、最多样化且最全面的开源法律信息检索基准，包含十个专家标注的数据集，涵盖多个司法管辖区、文档类型和任务类型。

📘 Detailed Summary

Motivation: 当前开源法律信息检索领域存在领域和司法管辖区覆盖不足的问题，缺乏大规模、多样化的基准数据集来支持法律AI系统的全面评估和比较。

Method: 构建了包含十个专家标注数据集的综合基准，涵盖美国、英国、欧盟、澳大利亚、爱尔兰和新加坡等多个司法管辖区，文档类型包括案例、立法、监管指南、合同和文献，任务类型包括搜索、零样本分类和问答。

Result: MLEB是当前最大、最多样化的法律信息检索基准，其中七个数据集是新建的以填补领域空白，提供了可复现评估所需的完整代码、结果和数据。

Conclusion: MLEB为法律AI研究提供了标准化的评估框架，促进了法律信息检索系统的可复现比较和发展，填补了开源法律基准的重要空白。

📄 Abstract

We present the Massive Legal Embedding Benchmark (MLEB), the largest, most diverse, and most comprehensive open-source benchmark for legal information retrieval to date. MLEB consists of ten expert-annotated datasets spanning multiple jurisdictions (the US, UK, EU, Australia, Ireland, and Singapore), document types (cases, legislation, regulatory guidance, contracts, and literature), and task types (search, zero-shot classification, and question answering). Seven of the datasets in MLEB were newly constructed in order to fill domain and jurisdictional gaps in the open-source legal information retrieval landscape. We document our methodology in building MLEB and creating the new constituent datasets, and release our code, results, and data openly to assist with reproducible evaluations.

[42] Modeling Turn-Taking with Semantically Informed Gestures

Varsha Suresh, M. Hamza Mughal, Christian Theobalt, Vera Demberg

🧩 TL;DR

本研究通过引入DnD Gesture++数据集和混合专家框架，证明了语义手势在多模态对话轮次预测中的补充作用，显著提升了轮次转换预测性能。

📘 Detailed Summary

Motivation: 人类对话中利用多模态线索（如语音、手势和注视）来管理轮次转换，虽然语言和声学特征具有信息性，但手势提供了补充性线索，当前研究旨在探索手势在轮次转换建模中的具体作用。

Method: 研究扩展了多参与者DnD Gesture语料库为DnD Gesture++，新增2,663个语义手势标注，涵盖图标性、隐喻性、指示性和话语性手势类型，并采用混合专家框架整合文本、音频和手势特征进行轮次转换预测建模。

Result: 实验结果表明，引入语义引导的手势特征相比基线模型实现了持续的性能提升，验证了手势在多模态轮次转换预测中的补充价值。

Conclusion: 该研究证实了语义手势作为多模态对话分析的重要补充线索，为更自然的人机交互系统开发提供了理论依据，未来可进一步探索不同类型手势在对话管理中的具体作用机制。

📄 Abstract

In conversation, humans use multimodal cues, such as speech, gestures, and gaze, to manage turn-taking. While linguistic and acoustic features are informative, gestures provide complementary cues for modeling these transitions. To study this, we introduce DnD Gesture++, an extension of the multi-party DnD Gesture corpus enriched with 2,663 semantic gesture annotations spanning iconic, metaphoric, deictic, and discourse types. Using this dataset, we model turn-taking prediction through a Mixture-of-Experts framework integrating text, audio, and gestures. Experiments show that incorporating semantically guided gestures yields consistent performance gains over baselines, demonstrating their complementary role in multimodal turn-taking.

[43] ToMMeR -- Efficient Entity Mention Detection from Large Language Models

Victor Morand, Nadi Tomeh, Josiane Mothe, Benjamin Piwowarski

🧩 TL;DR

本文提出了ToMMeR，一个轻量级模型，通过探测大语言模型早期层来识别实体提及，在13个NER基准测试中实现了93%的零样本召回率，证明了实体表示自然存在于Transformer早期层中。

📘 Detailed Summary

Motivation: 实体提及检测是信息提取的基础任务，但已知是性能瓶颈，本研究旨在探索大语言模型早期层是否已经具备实体检测能力，以及能否通过轻量级方法高效提取这些结构化表示。

Method: 提出了ToMMeR模型，这是一个参数量小于30万的轻量级模型，通过探测大语言模型的早期层来提取实体提及检测能力，并可以扩展添加跨度分类头以支持命名实体识别任务。

Result: 在13个NER基准测试中，ToMMeR实现了93%的零样本召回率，使用LLM作为判断器时精度超过90%，跨模型分析显示不同架构的模型在提及边界上高度一致，扩展后的ToMMeR在标准基准上达到80-87%的F1分数。

Conclusion: 研究表明结构化实体表示自然存在于Transformer早期层中，可以通过极小参数量高效恢复，不同架构的大语言模型在实体提及检测上表现出收敛性，这为轻量级信息提取系统提供了理论基础。

📄 Abstract

Identifying which text spans refer to entities -- mention detection -- is both foundational for information extraction and a known performance bottleneck. We introduce ToMMeR, a lightweight model (<300K parameters) probing mention detection capabilities from early LLM layers. Across 13 NER benchmarks, ToMMeR achieves 93\% recall zero-shot, with over 90\% precision using an LLM as a judge showing that ToMMeR rarely produces spurious predictions despite high recall. Cross-model analysis reveals that diverse architectures (14M-15B parameters) converge on similar mention boundaries (DICE >75\%), confirming that mention detection emerges naturally from language modeling. When extended with span classification heads, ToMMeR achieves near SOTA NER performance (80-87\% F1 on standard benchmarks). Our work provides evidence that structured entity representations exist in early transformer layers and can be efficiently recovered with minimal parameters.

[44] SONAR-SLT: Multilingual Sign Language Translation via Language-Agnostic Sentence Embedding Supervision

Yasser Hamidullah, Shakib Yazdani, Cennet Oguz, Josef van Genabith, Cristina España-Bonet

🧩 TL;DR

本文提出了一种使用语言无关的多模态嵌入来监督手语翻译的方法，通过结合多语言目标增强和视频级扰动来解决数据稀缺问题，实现了直接的多语言手语翻译。

📘 Detailed Summary

Motivation: 传统手语翻译通常使用单一语言的文本进行训练，这限制了模型的可扩展性和跨语言泛化能力，现有方法虽然使用文本句子嵌入替代gloss监督，但仍受限于特定语言和模态。

Method: 采用在多种语言的文本和语音上训练的语言无关多模态嵌入来监督手语翻译，并提出耦合增强方法，结合多语言目标增强（翻译成多种语言）和视频级扰动来提高模型鲁棒性。

Result: 实验显示相比仅使用文本句子嵌入监督的方法，在BLEURT指标上获得一致提升，在低资源设置下改进更为显著。

Conclusion: 语言无关嵌入监督与耦合增强相结合，为传统手语翻译训练提供了可扩展且语义鲁棒的替代方案，能够有效应对数据稀缺挑战。

📄 Abstract

Sign language translation (SLT) is typically trained with text in a single spoken language, which limits scalability and cross-language generalization. Earlier approaches have replaced gloss supervision with text-based sentence embeddings, but up to now, these remain tied to a specific language and modality. In contrast, here we employ language-agnostic, multimodal embeddings trained on text and speech from multiple languages to supervise SLT, enabling direct multilingual translation. To address data scarcity, we propose a coupled augmentation method that combines multilingual target augmentations (i.e. translations into many languages) with video-level perturbations, improving model robustness. Experiments show consistent BLEURT gains over text-only sentence embedding supervision, with larger improvements in low-resource settings. Our results demonstrate that language-agnostic embedding supervision, combined with coupled augmentation, provides a scalable and semantically robust alternative to traditional SLT training.

[45] Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark

Yu Wu, Ke Shu, Jonas Fischer, Lidia Pivovarova, David Rosson, Eetu Mäkelä, Mikko Tolonen

🧩 TL;DR

本文提出从混合语言历史文档中提取拉丁语片段的新任务，通过评估大型基础模型在724页多模态数据集上的性能，证明了当代模型能够实现可靠的拉丁语检测。

📘 Detailed Summary

Motivation: 本研究旨在解决从布局多样的混合语言历史文档中提取拉丁语片段这一尚未被充分探索的任务，填补了该领域的研究空白，为历史文献的数字化处理提供了新的技术挑战。

Method: 研究采用大型基础模型作为核心方法，构建了包含724个标注页面的多模态数据集进行基准测试，系统评估了这些模型在拉丁语片段检测任务中的表现能力。

Result: 实验结果表明，当代模型能够实现可靠的拉丁语检测性能，在混合语言历史文档处理任务中展现出良好的效果，为相关应用提供了实证支持。

Conclusion: 本研究首次全面分析了大型基础模型在拉丁语片段提取任务中的能力和局限性，为历史文档处理领域提供了重要的基准参考，并证明了该技术路线的可行性。

📄 Abstract

This paper presents a novel task of extracting Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary models is achievable. Our study provides the first comprehensive analysis of these models' capabilities and limits for this task.

[46] MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models

Kailin Jiang, Ning Jiang, Yuchen Ren, Yuchen Li, Yifan Gao, Jinhe Bi, Yunpu Ma, Qingqing Liu, Xianhao Wang, Yifan Jia, Hongbo Jiang, Yaocong Hu, Bin Li, Lei Liu, Yuntao Du

🧩 TL;DR

本文提出了MINED基准测试，用于评估大型多模态模型的时间敏感知识理解能力，发现现有模型在此方面存在显著不足，并通过知识编辑方法验证了更新时间敏感知识的可行性。

📘 Detailed Summary

Motivation: 现有大型多模态模型通过跨模态预训练编码了丰富的知识，但其静态表示难以维持对时间敏感知识的准确理解，而现有基准测试受限于静态设计，无法充分评估模型的时间感知能力。

Method: 研究构建了MINED综合基准测试，包含6个关键维度和11个挑战性任务，涵盖认知、意识、可信度、理解、推理和鲁棒性，从维基百科中由专业标注者构建了2,104个时间敏感知识样本，涵盖六种知识类型。

Result: 评估15个广泛使用的大型多模态模型显示，Gemini-2.5-Pro获得最高平均CEM分数63.07，而大多数开源模型仍缺乏时间理解能力，模型在组织知识上表现最佳，在体育知识上表现最弱，通过知识编辑方法验证了模型在单次编辑场景下能有效更新知识。

Conclusion: 研究揭示了大型多模态模型在时间敏感知识理解方面的系统性不足，证明了知识编辑方法在更新模型知识方面的有效性，为改进模型的时间感知能力提供了重要基准和方向。

📄 Abstract

Large Multimodal Models (LMMs) encode rich factual knowledge via cross-modal pre-training, yet their static representations struggle to maintain an accurate understanding of time-sensitive factual knowledge. Existing benchmarks remain constrained by static designs, inadequately evaluating LMMs' ability to understand time-sensitive knowledge. To address this gap, we propose MINED, a comprehensive benchmark that evaluates temporal awareness along 6 key dimensions and 11 challenging tasks: cognition, awareness, trustworthiness, understanding, reasoning, and robustness. MINED is constructed from Wikipedia by two professional annotators, containing 2,104 time-sensitive knowledge samples spanning six knowledge types. Evaluating 15 widely used LMMs on MINED shows that Gemini-2.5-Pro achieves the highest average CEM score of 63.07, while most open-source LMMs still lack time understanding ability. Meanwhile, LMMs perform best on organization knowledge, whereas their performance is weakest on sport. To address these challenges, we investigate the feasibility of updating time-sensitive knowledge in LMMs through knowledge editing methods and observe that LMMs can effectively update knowledge via knowledge editing methods in single editing scenarios.

[47] Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition

Yuu Jinnai

🧩 TL;DR

该研究验证了基于采样的最小贝叶斯风险解码在语音转文本任务中的有效性，实验表明MBR解码在大多数设置下优于传统的束搜索方法，为离线ASR和语音翻译任务提供了更高精度的解码方案。

📘 Detailed Summary

Motivation: 虽然MBR解码在文本生成任务中已被证明优于束搜索，但在语音转文本任务中束搜索仍是主流实践，本研究旨在填补这一研究空白，验证MBR解码在ASR和语音翻译任务中的潜在优势。

Method: 研究采用Whisper及其衍生模型，在英语和日语数据集上系统评估MBR解码在自动语音识别和语音翻译任务中的表现，与传统的束搜索方法进行对比分析。

Result: 实验结果显示，在大多数评估设置中，MBR解码的准确率均优于束搜索，证明了该方法在语音转文本任务中的有效性，特别是在需要高精度的离线应用场景中。

Conclusion: MBR解码为离线ASR和语音翻译任务提供了一种有前景的高精度解码方法，未来可进一步探索其在更多语言和模型架构中的适用性，推动语音处理技术的发展。

📄 Abstract

Recent work has shown that sample-based Minimum Bayes Risk (MBR) decoding outperforms beam search in text-to-text generation tasks, such as machine translation, text summarization, and image captioning. On the other hand, beam search is the current practice for speech-to-text tasks such as automatic speech recognition (ASR) and Speech Translation (ST). Given that MBR decoding is effective in text-to-text generation tasks, it is reasonable to expect it to also be effective for speech-to-text tasks. In this paper, we evaluate MBR decoding for ASR and ST tasks on English and Japanese using Whisper and its derivative models. We observe that the accuracy of MBR decoding outperforms that of beam search in most of the experimental settings we have evaluated. The results show that MBR decoding is a promising method for offline ASR and ST tasks that require high accuracy. The code is available at https://github.com/CyberAgentAILab/mbr-for-asr

[48] Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent

Yangshijie Zhang, Xinda Wang, Jialin Liu, Wenqiang Wang, Zhicong Ma, Xingxing Jia

🧩 TL;DR

本研究提出了一种基于风格化文本的攻击方法SAD，利用人类与NLP模型在字体样式处理上的感知差异，在情感分类和机器翻译等任务中实现了有效的对抗攻击。

📘 Detailed Summary

Motivation: 随着社交媒体的发展，用户使用风格化字体和字体式表情符号表达个性，这些视觉上吸引人的文本对人类可读但对NLP模型构成潜在威胁。研究发现人类与模型在处理这些字符时存在感知差距，模型将风格化字符视为独立标记，从而导致干扰。

Method: 提出了Style Attack Disguise (SAD)攻击方法，设计了两种规模：轻量版用于查询效率，强化版用于卓越的攻击性能。该方法利用风格化文本创建对抗样本，针对传统模型、大语言模型和商业服务进行攻击。

Result: 在情感分类和机器翻译任务上的实验表明，SAD在各种模型上均表现出强大的攻击性能。研究还展示了SAD对多模态任务（包括文本到图像和文本到语音生成）的潜在威胁。

Conclusion: 该研究揭示了风格化文本在NLP系统中的安全漏洞，强调了人类与AI系统感知差异带来的安全风险。SAD方法展示了对抗攻击在多模态任务中的扩展性，为未来AI安全研究提供了重要方向。

📄 Abstract

With social media growth, users employ stylistic fonts and font-like emoji to express individuality, creating visually appealing text that remains human-readable. However, these fonts introduce hidden vulnerabilities in NLP models: while humans easily read stylistic text, models process these characters as distinct tokens, causing interference. We identify this human-model perception gap and propose a style-based attack, Style Attack Disguise (SAD). We design two sizes: light for query efficiency and strong for superior attack performance. Experiments on sentiment classification and machine translation across traditional models, LLMs, and commercial services demonstrate SAD's strong attack performance. We also show SAD's potential threats to multimodal tasks including text-to-image and text-to-speech generation.

[49] CoSense-LLM: Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation

Hasan Akgul, Mari Eplik, Javier Rojas, Aina Binti Abdullah, Pieter van der Merwe

🧩 TL;DR

CoSense-LLM是一个边缘优先框架，将连续多模态传感器流转换为紧凑可验证的语义令牌，并在严格延迟、能耗、带宽和隐私约束下与大型语言模型协同工作。该框架通过轻量级编码、本地检索、智能路由和安全执行实现了语义理解、隐私保护和可预测延迟的协同优化。

📘 Detailed Summary

Motivation: 该研究旨在解决在干扰环境中部署大型模型时面临的语义理解、隐私保护和可预测延迟之间的平衡问题，特别是在处理连续多模态传感器流时如何满足严格的延迟、能耗、带宽和隐私约束。

Method: CoSense-LLM包含四个核心组件：SenseFusion轻量级编码器将传感器嵌入与语言对齐并压缩为离散代码序列；Edge-RAG本地混合检索层基于站点特定策略进行生成；PromptRouter成本感知策略选择边缘生成、边缘检索或紧凑云升级；Secure Execution可审计删减路径确保原始波形数据不离开设备。

Result: 在家庭、办公室和诊所部署中，CoSense-LLM实现了亚秒级端到端延迟，通过偏好本地检索响应显著降低了层级间令牌和带宽成本，同时通过仅传输离散代码和删减元数据保护隐私。消融研究显示Edge-RAG提高了事实一致性并减少矛盾，校准不确定性支持选择性弃权和受控升级。

Conclusion: 研究结果支持边缘优先设计理念，将语义理解、隐私保护和可预测延迟视为干扰环境中大型模型部署的同等重要目标，为资源受限环境下的智能感知系统提供了可行的架构方案。

📄 Abstract

We present CoSense-LLM, an edge-first framework that turns continuous multimodal sensor streams (for example Wi-Fi CSI, IMU, audio, RFID, and lightweight vision) into compact, verifiable semantic tokens and coordinates with large language models under explicit latency, energy, bandwidth, and privacy constraints. CoSense-LLM has four parts: (i) SenseFusion, a lightweight encoder that aligns sensor embeddings with language and compresses them into short discrete code sequences; (ii) Edge-RAG, a local hybrid retrieval layer that grounds generation in site specific policies and notes; (iii) PromptRouter, a cost and uncertainty aware policy that selects edge only generation, edge plus retrieval, or compact cloud escalation; and (iv) Secure Execution, an auditable redaction path that enforces data minimization so raw waveforms never leave the device. The system works with modern serving optimizations, including paged or streaming KV caches, FlashAttention style kernels, speculative decoding, and quantized LoRA adapters, and supports on device personalization and federated updates under non IID drift. Across home, office, and clinic deployments, CoSense-LLM delivers grounded explanations while meeting tight service level objectives: it sustains sub second (p95) end to end latency on edge dominant paths, reduces inter tier token and bandwidth costs by preferring local retrieval grounded responses, and preserves privacy by transmitting only discrete codes and redacted metadata. Ablations show that Edge-RAG improves factual consistency and reduces contradictions, calibrated uncertainty enables selective abstention and controlled escalations, and KV plus decoding accelerators lower energy per decision. The results support an edge first design that treats semantics, privacy, and predictable latency as co equal goals for large model deployments in interference prone environments.

[50] Do Prompts Reshape Representations? An Empirical Study of Prompting Effects on Embeddings

Cesar Gonzalez-Gutierrez, Dirk Hovy

🧩 TL;DR

本研究通过实证分析揭示了提示工程与语言模型内部表征质量之间的复杂关系，挑战了相关提示必然产生更好表征的常见假设，为理解零样本学习机制提供了新视角。

📘 Detailed Summary

Motivation: 当前对于语言模型在零样本设置下无需任务特定监督即可执行多样化任务的内在机制理解不足，研究提示与内部表征质量的关系有助于揭示预训练嵌入如何支持上下文任务解决。

Method: 本研究采用探测实验方法，对提示嵌入进行系统性分析，考察了零样本分类中不同提示模板组合对表征质量的影响，并深入探究了可能导致意外行为的潜在因素。

Result: 实验发现提示确实影响表征质量，但这些变化与提示对目标任务的相关性并不一致相关，这一结果挑战了更相关提示必然导致更好表征的普遍假设。

Conclusion: 研究揭示了提示工程与表征质量之间的非直观关系，强调了需要更深入理解提示如何影响模型内部机制，为零样本学习的基础理论提供了重要实证依据。

📄 Abstract

Prompting is a common approach for leveraging LMs in zero-shot settings. However, the underlying mechanisms that enable LMs to perform diverse tasks without task-specific supervision remain poorly understood. Studying the relationship between prompting and the quality of internal representations can shed light on how pre-trained embeddings may support in-context task solving. In this empirical study, we conduct a series of probing experiments on prompt embeddings, analyzing various combinations of prompt templates for zero-shot classification. Our findings show that while prompting affects the quality of representations, these changes do not consistently correlate with the relevance of the prompts to the target task. This result challenges the assumption that more relevant prompts necessarily lead to better representations. We further analyze potential factors that may contribute to this unexpected behavior.

cs.AI [Back]

[51] The MUSE Benchmark: Probing Music Perception and Auditory Relational Reasoning in Audio LLMS

Brandon James Carone, Iran R. Roman, Pablo Ripollés

🧩 TL;DR

本研究提出了MUSE基准测试，用于评估多模态大语言模型在音乐理解中的关系推理能力，发现当前SOTA模型存在显著感知缺陷且与人类专家存在持续差距。

📘 Detailed Summary

Motivation: 当前多模态大语言模型在音频理解方面已展示能力，但现有评估可能掩盖了其在关系推理方面的根本弱点，需要更系统的评估工具来揭示这些基本缺陷。

Method: 研究引入了音乐理解与结构评估基准，包含10个任务来探测基础音乐感知技能，评估了四种SOTA模型并与大规模人类基线进行比较，同时测试了思维链提示的有效性。

Result: 结果显示SOTA模型能力差异巨大且与人类专家存在持续差距，Gemini Pro在基础感知上表现良好但Qwen和Audio Flamingo 3接近随机水平，思维链提示产生不一致且通常有害的结果。

Conclusion: 该研究为评估不变音乐表示提供了关键工具，揭示了当前AI系统在音乐理解方面的根本局限性，将推动开发更鲁棒的AI系统。

📄 Abstract

Multimodal Large Language Models (MLLMs) have demonstrated capabilities in audio understanding, but current evaluations may obscure fundamental weaknesses in relational reasoning. We introduce the Music Understanding and Structural Evaluation (MUSE) Benchmark, an open-source resource with 10 tasks designed to probe fundamental music perception skills. We evaluate four SOTA models (Gemini Pro and Flash, Qwen2.5-Omni, and Audio-Flamingo 3) against a large human baseline (N=200). Our results reveal a wide variance in SOTA capabilities and a persistent gap with human experts. While Gemini Pro succeeds on basic perception, Qwen and Audio Flamingo 3 perform at or near chance, exposing severe perceptual deficits. Furthermore, we find Chain-of-Thought (CoT) prompting provides inconsistent, often detrimental results. Our work provides a critical tool for evaluating invariant musical representations and driving development of more robust AI systems.

Table of Contents

cs.CV [Back]

[1] Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[2] PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[3] UniHPR: Unified Human Pose Representation via Singular Value Contrastive Learning

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[4] X-Ego: Acquiring Team-Level Tactical Situational Awareness via Cross-Egocentric Contrastive Video Representation Learning

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[5] PruneHal: Reducing Hallucinations in Multi-modal Large Language Models through Adaptive KV Cache Pruning

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[6] FootFormer: Estimating Stability from Visual Input

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[7] Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[8] SFGFusion: Surface Fitting Guided 3D Object Detection with 4D Radar and Camera Fusion

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[9] MobiAct: Efficient MAV Action Recognition Using MobileNetV4 with Contrastive Learning and Knowledge Distillation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[10] CARES: Context-Aware Resolution Selector for VLMs

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[11] D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[12] A Matter of Time: Revealing the Structure of Time in Vision-Language Models

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[13] Unified Reinforcement and Imitation Learning for Vision-Language Models

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[14] Multi-modal Co-learning for Earth Observation: Enhancing single-modality models via modality collaboration

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[15] BrainMCLIP: Brain Image Decoding with Multi-Layer feature Fusion of CLIP

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[16] A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition with EfficientNet and CLIP

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[17] XBench: A Comprehensive Benchmark for Visual-Language Explanations in Chest Radiography

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[18] DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile Phone Agents

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[19] Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[20] From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction

🧩 TL;DR