Table of Contents

cs.CV [Back]

[1] PickStyle: Video-to-Video Style Transfer with Context-Style Adapters

Soroush Mehraban, Vida Adeli, Jacob Rommann, Babak Taati, Kyryl Truskovskyi

🧩 TL;DR

本文提出PickStyle视频风格迁移框架,通过低秩适配器和上下文-风格分类器自由引导,在保持视频内容一致性的同时实现高效风格迁移,显著优于现有基线方法。


📘 Detailed Summary

Motivation: 视频风格迁移面临的主要挑战是缺乏成对的视频数据进行监督学习,现有方法难以在保持视频内容一致性的同时实现有效的风格转换,这限制了视频风格迁移的实际应用效果。

Method: PickStyle框架在预训练视频扩散骨干网络中插入低秩适配器到自注意力层,利用成对静态图像数据构建合成训练片段,并提出了上下文-风格分类器自由引导技术,将分类器自由引导分解为独立的文本(风格)和视频(上下文)方向。

Result: 实验结果表明,该方法在多个基准测试中实现了时间一致、风格忠实且内容保持的视频转换效果,在定性和定量评估上均优于现有基线方法,展现出卓越的性能表现。

Conclusion: 该研究证明了通过低秩适配器和创新的引导机制,可以有效利用静态图像数据解决视频风格迁移的监督学习问题,为视频生成和编辑任务提供了新的技术路径和理论启示。


📄 Abstract

We address the task of video style transfer with diffusion models, where the goal is to preserve the context of an input video while rendering it in a target style specified by a text prompt. A major challenge is the lack of paired video data for supervision. We propose PickStyle, a video-to-video style transfer framework that augments pretrained video diffusion backbones with style adapters and benefits from paired still image data with source-style correspondences for training. PickStyle inserts low-rank adapters into the self-attention layers of conditioning modules, enabling efficient specialization for motion-style transfer while maintaining strong alignment between video content and style. To bridge the gap between static image supervision and dynamic video, we construct synthetic training clips from paired images by applying shared augmentations that simulate camera motion, ensuring temporal priors are preserved. In addition, we introduce Context-Style Classifier-Free Guidance (CS-CFG), a novel factorization of classifier-free guidance into independent text (style) and video (context) directions. CS-CFG ensures that context is preserved in generated video while the style is effectively transferred. Experiments across benchmarks show that our approach achieves temporally coherent, style-faithful, and content-preserving video translations, outperforming existing baselines both qualitatively and quantitatively.

[2] TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility

Saman Motamed, Minghao Chen, Luc Van Gool, Iro Laina

🧩 TL;DR

本研究提出了TRAVL微调方法和ImplausiBench基准,用于评估和提升视频生成模型的物理合理性,通过改进视频语言模型的时空推理能力来解决现有模型在物理规律理解上的不足。


📘 Detailed Summary

Motivation: 尽管现代视频生成模型在视觉保真度方面表现出色,但经常产生违反物理规律的序列,如物体漂浮、瞬移或违反因果关系的变形,目前缺乏定量评估视频物理真实性的稳健方法,且现有视频语言模型在识别物理违规方面存在困难。

Method: 提出了TRAVL微调方法,结合平衡训练数据集和轨迹感知注意力模块来改进视频语言模型中的运动编码和判别能力,同时构建了ImplausiBench基准,包含300个视频(150个真实,150个生成)以消除语言偏见并隔离视觉-时间理解。

Result: 性能评估同时采用黄金标准的人类判断和更严格的LLM作为评判指标,TRAVL方法显著提升了视频语言模型对物理违规的识别能力,ImplausiBench为物理合理性评估提供了标准化测试环境。

Conclusion: TRAVL和ImplausiBench共同构成了一个统一框架,用于探索和改进多模态模型中的物理合理性,揭示了视觉-时间理解中一个具有挑战性且未被充分探索的方面,为未来视频生成模型的物理规律遵循提供了重要基准和改进方向。


📄 Abstract

Despite impressive visual fidelity, modern video generative models frequently produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing in ways that defy causality. While humans can easily detect such implausibilities, there remains no robust method for quantitatively assessing physical realism in video. In this work, we explore whether Video-Language Models (VLMs) can be trained to serve as reliable judges of physical plausibility. We find that existing VLMs struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding and discrimination in VLMs. To evaluate physical reasoning more rigorously, we propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding. Performance is reported both with gold-standard human judgments and stricter LLM-as-judge metrics. Together, TRAVL and ImplausiBench offer a unified framework for probing and improving physical plausibility in multimodal models, shedding light on a challenging and underexplored aspect of visual-temporal understanding.

[3] Label Semantics for Robust Hyperspectral Image Classification

Rafin Hassan, Zarin Tasnim Roshni, Rafiqul Bari, Alimul Islam, Nabeel Mohammed, Moshiur Farazi, Shafin Rahman

🧩 TL;DR

本文提出了一种通用的语义光谱-空间融合网络(S3FN),通过利用大语言模型生成的类别特定文本描述来增强高光谱图像分类模型的训练,显著提升了分类性能。该方法通过文本语义与光谱空间数据的协同作用,为语义增强的高光谱分类模型开辟了新途径。


📘 Detailed Summary

Motivation: 高光谱图像分类面临高质量训练样本稀缺和光谱数据高维度的挑战,导致模型容易过拟合且难以平衡精度与计算复杂度。现有模型多为单模态,仅依赖光谱空间数据在高维嵌入空间中学习决策边界,存在性能瓶颈。

Method: 提出S3FN框架,利用LLMs为每个类别标签生成全面的文本描述,捕捉其独特特征和光谱行为。使用预训练文本编码器(如BERT或RoBERTa)将这些描述嵌入向量空间,提取有意义的标签语义,实现更好的特征-标签对齐以提升分类性能。

Result: 在三个不同的高光谱基准数据集(Hyperspectral Wood、HyperspectralBlueberries和DeepHS-Fruit)上评估模型,报告了显著的性能提升。实验结果证明了文本语义与光谱空间数据之间的协同效应。

Conclusion: 研究表明文本语义与光谱空间数据的融合能够显著提升高光谱分类性能,为语义增强的高光谱分类模型的发展指明了方向。该方法为解决训练样本稀缺和高维度数据挑战提供了有效解决方案。


📄 Abstract

Hyperspectral imaging (HSI) classification is a critical tool with widespread applications across diverse fields such as agriculture, environmental monitoring, medicine, and materials science. Due to the limited availability of high-quality training samples and the high dimensionality of spectral data, HSI classification models are prone to overfitting and often face challenges in balancing accuracy and computational complexity. Furthermore, most of HSI classification models are monomodal, where it solely relies on spectral-spatial data to learn decision boundaries in the high dimensional embedding space. To address this, we propose a general-purpose Semantic Spectral-Spatial Fusion Network (S3FN) that uses contextual, class specific textual descriptions to complement the training of an HSI classification model. Specifically, S3FN leverages LLMs to generate comprehensive textual descriptions for each class label that captures their unique characteristics and spectral behaviors. These descriptions are then embedded into a vector space using a pre-trained text encoder such as BERT or RoBERTa to extract meaningful label semantics which in turn leads to a better feature-label alignment for improved classification performance. To demonstrate the effectiveness of our approach, we evaluate our model on three diverse HSI benchmark datasets - Hyperspectral Wood, HyperspectralBlueberries, and DeepHS-Fruit and report significant performance boost. Our results highlight the synergy between textual semantics and spectral-spatial data, paving the way for further advancements in semantically augmented HSI classification models. Codes are be available in: https://github.com/milab-nsu/S3FN

[4] Cross-Modal Attention Guided Unlearning in Vision-Language Models

Karuna Bhaila, Aneesh Komanduri, Minh-Hao Van, Xintao Wu

🧩 TL;DR

本文提出了CAGUL,一种轻量级的视觉语言模型遗忘框架,通过利用跨模态注意力引导视觉令牌转换来防止敏感信息泄露,同时保持参考模型行为,无需修改预训练模型参数或产生重训练成本。


📘 Detailed Summary

Motivation: 视觉语言模型在训练过程中可能记忆私有或敏感信息并在推理时泄露,而现有的机器遗忘方法主要针对语言模型,视觉语言模型由于视觉上下文也可能包含敏感信息而增加了复杂性,需要专门针对VQA任务的遗忘解决方案。

Method: 本文探索了跨模态注意力在视觉语言模型输出生成中的作用,并基于此提出了跨模态注意力引导遗忘框架CAGUL,该框架利用外部模块在相关查询的低重要性视觉令牌中编码遗忘信息,避免了对预训练模型参数的修改。

Result: 实验结果表明,CAGUL方法在防止信息泄露方面表现优于或与基于微调的基线方法相当,同时保持了参考模型的行为特性,且不需要改变预训练模型参数或承担重训练成本。

Conclusion: CAGUL为视觉语言模型提供了一种实用有效的遗忘解决方案,通过轻量级的外部模块实现敏感信息遗忘,在保持模型性能的同时解决了隐私保护问题,为多模态模型的隐私安全应用开辟了新途径。


📄 Abstract

Vision-Language Models (VLMs) have demonstrated immense capabilities in multi-modal understanding and inference tasks such as Visual Question Answering (VQA), which requires models to infer outputs based on visual and textual context simultaneously. Such inference abilities of large-scale pretrained models are often attributed to the massive scale of pre-training data collected across several domains. However, the models may memorize private and/or sensitive information during training and regurgitate it in inference. Recently, machine unlearning has been leveraged to address the leakage of private data in LLMs. VLMs add a layer of complexity to this process, as the visual context in the query may also contain sensitive information in addition to the text. To address this issue, we explore unlearning for vision-language models, specifically for the VQA task. We explore the role of visual tokens for output generation in VLMs using cross-modal attention and utilize it to formulate Cross-Modal Attention Guided Unlearning (CAGUL), a lightweight and efficient VLM unlearning framework. In contrast to computationally expensive model finetuning methods, CAGUL utilizes external modules to encode unlearning information in visual tokens of low importance for relevant queries. We find that the transformed visual tokens not only prevent leakage but also retain reference model behavior. Experimental results show that our method performs better or on par with finetuning-based baselines without altering the pre-trained model parameters or incurring retraining costs, making it a practical and effective unlearning solution for VLMs.

[5] Rectified-CFG++ for Flow Based Models

Shreshth Saini, Shashank Gupta, Alan C. Bovik

🧩 TL;DR

本文提出了Rectified-CFG++,一种用于整流流扩散模型的自适应预测-校正引导方法,通过几何感知的条件规则解决了标准分类器无关引导在整流流模型中引起的离流形漂移问题,显著提升了文本到图像生成的质量和稳定性。


📘 Detailed Summary

Motivation: 标准分类器无关引导在整流流扩散模型中的直接应用会导致严重的离流形漂移,产生视觉伪影、文本对齐错误和不稳定行为,这限制了整流流模型在文本条件生成任务中的性能表现。

Method: 提出自适应预测-校正引导方法,每个推理步骤首先执行条件整流流更新将样本锚定在学习的传输路径附近,然后应用加权条件校正,在条件和无条件速度场之间进行插值,确保速度场的边际一致性并保持轨迹在数据流形的有界管状邻域内。

Result: 在Flux、Stable Diffusion 3/3.5、Lumina等大规模文本到图像模型上的广泛实验表明,Rectified-CFG++在MS-COCO、LAION-Aesthetic和T2I-CompBench等基准数据集上始终优于标准CFG方法。

Conclusion: 该方法将整流流的确定性效率与几何感知条件规则相结合,证明了在广泛引导强度范围内的稳定性,为扩散模型的条件生成提供了更可靠和高效的解决方案。


📄 Abstract

Classifier-free guidance (CFG) is the workhorse for steering large diffusion models toward text-conditioned targets, yet its native application to rectified flow (RF) based models provokes severe off-manifold drift, yielding visual artifacts, text misalignment, and brittle behaviour. We present Rectified-CFG++, an adaptive predictor-corrector guidance that couples the deterministic efficiency of rectified flows with a geometry-aware conditioning rule. Each inference step first executes a conditional RF update that anchors the sample near the learned transport path, then applies a weighted conditional correction that interpolates between conditional and unconditional velocity fields. We prove that the resulting velocity field is marginally consistent and that its trajectories remain within a bounded tubular neighbourhood of the data manifold, ensuring stability across a wide range of guidance strengths. Extensive experiments on large-scale text-to-image models (Flux, Stable Diffusion 3/3.5, Lumina) show that Rectified-CFG++ consistently outperforms standard CFG on benchmark datasets such as MS-COCO, LAION-Aesthetic, and T2I-CompBench. Project page: https://rectified-cfgpp.github.io/

[6] PIT-QMM: A Large Multimodal Model For No-Reference Point Cloud Quality Assessment

Shashank Gupta, Gregoire Phillips, Alan C. Bovik

🧩 TL;DR

本文提出了PIT-QMM,一种新颖的大型多模态模型,用于无参考点云质量评估,通过端到端处理文本、图像和点云数据,在多个基准测试中显著优于现有最先进方法。


📘 Detailed Summary

Motivation: 大型多模态模型在图像和视频质量评估方面取得了显著进展,但这些进展尚未在3D资产领域得到充分探索,特别是在无参考点云质量评估方面,需要开发能够利用多模态互补信息的评估方法。

Method: 构建了PIT-QMM模型,这是一种能够端到端处理文本描述、2D投影和3D点云视图的新型大型多模态模型,通过融合不同模态的互补信息来预测点云质量分数。

Result: 在流行基准测试上的广泛实验表明,所提方法以显著优势超越了现有最先进技术,且训练迭代次数更少,同时框架还实现了失真定位和识别功能。

Conclusion: 该研究为模型可解释性和交互性开辟了新途径,证明了多模态融合在点云质量评估中的有效性,为3D资产质量评估提供了新的技术方向。


📄 Abstract

Large Multimodal Models (LMMs) have recently enabled considerable advances in the realm of image and video quality assessment, but this progress has yet to be fully explored in the domain of 3D assets. We are interested in using these models to conduct No-Reference Point Cloud Quality Assessment (NR-PCQA), where the aim is to automatically evaluate the perceptual quality of a point cloud in absence of a reference. We begin with the observation that different modalities of data - text descriptions, 2D projections, and 3D point cloud views - provide complementary information about point cloud quality. We then construct PIT-QMM, a novel LMM for NR-PCQA that is capable of consuming text, images and point clouds end-to-end to predict quality scores. Extensive experimentation shows that our proposed method outperforms the state-of-the-art by significant margins on popular benchmarks with fewer training iterations. We also demonstrate that our framework enables distortion localization and identification, which paves a new way forward for model explainability and interactivity. Code and datasets are available at https://www.github.com/shngt/pit-qmm.

[7] Mutual Learning for Hashing: Unlocking Strong Hash Functions from Weak Supervision

Xiaoxu Ma, Runhao Li, Zhenyu Weng

🧩 TL;DR

本文提出了一种新颖的弱到强相互学习框架MLH,通过将成对哈希分支的局部相似性知识迁移到中心哈希分支,有效解决了中心哈希方法在建模全局结构时对局部相似性信息利用不足的问题。


📘 Detailed Summary

Motivation: 中心哈希方法虽然能够有效捕捉全局数据分布,但在建模全局结构时往往以牺牲局部相似性信息为代价,导致对重要局部相似关系的利用不足,限制了哈希检索性能的进一步提升。

Method: 提出相互学习哈希框架MLH,包含一个强大的中心哈希分支和一个较弱的成对哈希分支,通过迭代式相互学习过程实现知识迁移;同时引入混合哈希专家模块,基于混合专家范式实现有效的跨分支交互,共同提升两个分支的性能。

Result: 在多个基准数据集上的广泛实验表明,MLH框架持续优于最先进的哈希方法,验证了该框架在结合全局结构和局部相似性方面的有效性。

Conclusion: 该研究证明了通过弱到强的相互学习机制能够有效结合中心哈希的全局建模能力和成对哈希的局部相似性保持优势,为深度哈希学习提供了新的范式,同时混合哈希专家模块展示了跨分支交互在提升哈希性能方面的重要价值。


📄 Abstract

Deep hashing has been widely adopted for large-scale image retrieval, with numerous strategies proposed to optimize hash function learning. Pairwise-based methods are effective in learning hash functions that preserve local similarity relationships, whereas center-based methods typically achieve superior performance by more effectively capturing global data distributions. However, the strength of center-based methods in modeling global structures often comes at the expense of underutilizing important local similarity information. To address this limitation, we propose Mutual Learning for Hashing (MLH), a novel weak-to-strong framework that enhances a center-based hashing branch by transferring knowledge from a weaker pairwise-based branch. MLH consists of two branches: a strong center-based branch and a weaker pairwise-based branch. Through an iterative mutual learning process, the center-based branch leverages local similarity cues learned by the pairwise-based branch. Furthermore, inspired by the mixture-of-experts paradigm, we introduce a novel mixture-of-hash-experts module that enables effective cross-branch interaction, further enhancing the performance of both branches. Extensive experiments demonstrate that MLH consistently outperforms state-of-the-art hashing methods across multiple benchmark datasets.

[8] GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models

Qinghongbing Xie, Zhaoyuan Xia, Feng Zhu, Lijun Gong, Ziyue Li, Rui Zhao, Long Zeng

🧩 TL;DR

本文提出了GTR-Bench基准测试,用于评估视觉语言模型在地理时空推理方面的能力,发现现有模型在结合地图和多视角视频进行时空推理方面存在显著不足,与人类表现存在巨大差距。


📘 Detailed Summary

Motivation: 现有时空基准测试主要关注以自我为中心的图像/视频推理或基于图形的视角推理,但缺乏评估视觉语言模型在地理时空推理中同时利用图像/视频和图形上下文的能力,这对于交通管理和应急响应等领域至关重要。

Method: 研究团队引入了地理时空推理基准测试(GTR-Bench),这是一个新颖的挑战,用于评估在大规模摄像头网络中移动目标的地理时空推理能力,要求模型在地图和视频之间进行多视角切换,跨多个非重叠视野视频进行联合推理,并对视频未观测到的时空区域进行推断。

Result: 对10多个流行视觉语言模型的评估显示,即使是性能最好的专有模型Gemini-2.5-Pro(34.9%)也显著落后于人类表现(78.61%),分析揭示了当前模型在地理时空推理中的三个主要缺陷:时空上下文利用不平衡、时间预测能力弱、地图与多视角视频输入理解对齐能力不足。

Conclusion: GTR-Bench为时空智能研究提供了宝贵见解和新的机遇,揭示了当前视觉语言模型在地理时空推理方面的核心局限性,为未来模型改进指明了方向,特别是在多模态上下文整合和时空预测能力方面需要重点突破。


📄 Abstract

Recently spatial-temporal intelligence of Visual-Language Models (VLMs) has attracted much attention due to its importance for Autonomous Driving, Embodied AI and General Artificial Intelligence. Existing spatial-temporal benchmarks mainly focus on egocentric perspective reasoning with images/video context, or geographic perspective reasoning with graphics context (eg. a map), thus fail to assess VLMs' geographic spatial-temporal intelligence with both images/video and graphics context, which is important for areas like traffic management and emergency response. To address the gaps, we introduce Geo-Temporal Reasoning benchmark (GTR-Bench), a novel challenge for geographic temporal reasoning of moving targets in a large-scale camera network. GTR-Bench is more challenging as it requires multiple perspective switches between maps and videos, joint reasoning across multiple videos with non-overlapping fields of view, and inference over spatial-temporal regions that are unobserved by any video context. Evaluations of more than 10 popular VLMs on GTR-Bench demonstrate that even the best proprietary model, Gemini-2.5-Pro (34.9%), significantly lags behind human performance (78.61%) on geo-temporal reasoning. Moreover, our comprehensive analysis on GTR-Bench reveals three primary deficiencies of current models for geo-temporal reasoning. (1) VLMs' reasoning is impaired by an imbalanced utilization of spatial-temporal context. (2) VLMs are weak in temporal forecasting, which leads to worse performance on temporal-emphasized tasks than on spatial-emphasized tasks. (3) VLMs lack the proficiency to comprehend or align the map data with multi-view video inputs. We believe GTR-Bench offers valuable insights and opens up new opportunities for research and applications in spatial-temporal intelligence. Benchmark and code will be released at https://github.com/X-Luffy/GTR-Bench.

[9] XYZCylinder: Feedforward Reconstruction for Driving Scenes Based on A Unified Cylinder Lifting Method

Haochen Yu, Qiankun Liu, Hongyuan Liu, Jianfei Jiang, Juntao Lyu, Jiansheng Chen, Huimin Ma

🧩 TL;DR

本文提出了XYZCylinder,一种基于统一圆柱体提升方法的前馈模型,通过统一相机建模和混合表示设计,解决了驾驶场景重建中泛化能力受限和重建精度不足的问题。


📘 Detailed Summary

Motivation: 现有前馈重建范式在驾驶场景重建中存在两个主要限制:固定视角变换在相机配置变化时失效,限制了跨不同驾驶场景的泛化能力;稀疏360度全景视图之间的小重叠区域和驾驶场景的复杂性增加了学习难度,降低了重建精度。

Method: 提出了基于统一圆柱体提升方法的XYZCylinder模型,包括统一圆柱相机建模策略避免学习视角依赖的空间对应关系,以及基于新设计的圆柱平面特征组的混合表示和专用模块,将2D图像特征提升到3D空间。

Result: 实验结果表明,XYZCylinder在不同评估设置下实现了最先进的性能,并且能够以零样本方式泛化到其他驾驶场景。

Conclusion: 该研究证明了统一相机建模和混合表示设计在提升驾驶场景重建泛化能力和精度方面的有效性,为跨不同相机配置的3D场景重建提供了新的解决方案。


📄 Abstract

Recently, more attention has been paid to feedforward reconstruction paradigms, which mainly learn a fixed view transformation implicitly and reconstruct the scene with a single representation. However, their generalization capability and reconstruction accuracy are still limited while reconstructing driving scenes, which results from two aspects: (1) The fixed view transformation fails when the camera configuration changes, limiting the generalization capability across different driving scenes equipped with different camera configurations. (2) The small overlapping regions between sparse views of the $360^\circ$ panorama and the complexity of driving scenes increase the learning difficulty, reducing the reconstruction accuracy. To handle these difficulties, we propose \textbf{XYZCylinder}, a feedforward model based on a unified cylinder lifting method which involves camera modeling and feature lifting. Specifically, to improve the generalization capability, we design a Unified Cylinder Camera Modeling (UCCM) strategy, which avoids the learning of viewpoint-dependent spatial correspondence and unifies different camera configurations with adjustable parameters. To improve the reconstruction accuracy, we propose a hybrid representation with several dedicated modules based on newly designed Cylinder Plane Feature Group (CPFG) to lift 2D image features to 3D space. Experimental results show that XYZCylinder achieves state-of-the-art performance under different evaluation settings, and can be generalized to other driving scenes in a zero-shot manner. Project page: \href{https://yuyuyu223.github.io/XYZCYlinder-projectpage/}{here}.

[10] MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

Peiran Wu, Zhuorui Yu, Yunze Liu, Chi-Hao Wu, Enmin Zhou, Junxiao Shen

🧩 TL;DR

本文提出MARC方法,通过记忆增强强化学习实现视觉令牌压缩,在视频理解任务中仅使用单帧令牌即可达到接近基准的准确率,同时显著降低计算开销。


📘 Detailed Summary

Motivation: 视觉语言模型从图像扩展到视频时面临高帧率和长持续时间带来的沉重计算成本问题,现有免训练令牌压缩方法会导致信息丢失和性能下降,需要更有效的压缩方案。

Method: MARC采用检索后压缩策略,包含视觉记忆检索器选择关键片段和基于压缩组相对策略优化的强化学习框架,通过结构化检索和RL蒸馏将教师模型的推理能力传递给学生模型。

Result: 在六个视频基准测试中,MARC仅使用单帧令牌即可达到接近基准准确率,视觉令牌减少95%,GPU内存降低72%,延迟减少23.9%,显著提升计算效率。

Conclusion: MARC证明了在资源受限环境下实现高效实时视频理解的可行性,为视频问答、监控和自动驾驶等应用提供了实用的解决方案,展示了令牌压缩在保持性能的同时大幅降低计算成本的潜力。


📄 Abstract

The rapid progress of large language models (LLMs) has laid the foundation for multimodal models. However, visual language models (VLMs) still face heavy computational costs when extended from images to videos due to high frame rates and long durations. Token compression is a promising solution, yet most existing training-free methods cause information loss and performance degradation. To overcome this, we propose \textbf{Memory-Augmented Reinforcement Learning-based Token Compression (MARC)}, which integrates structured retrieval and RL-based distillation. MARC adopts a \textit{retrieve-then-compress} strategy using a \textbf{Visual Memory Retriever (VMR)} to select key clips and a \textbf{Compression Group Relative Policy Optimization (C-GRPO)} framework to distil reasoning ability from a teacher to a student model. Experiments on six video benchmarks show that MARC achieves near-baseline accuracy using only one frame's tokens -- reducing visual tokens by \textbf{95\%}, GPU memory by \textbf{72\%}, and latency by \textbf{23.9\%}. This demonstrates its potential for efficient, real-time video understanding in resource-constrained settings such as video QA, surveillance, and autonomous driving.

[11] TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

Leigang Qu, Ziyang Wang, Na Zheng, Wenjie Wang, Liqiang Nie, Tat-Seng Chua

🧩 TL;DR

本文提出了TTOM框架,一种无需训练的推理时优化方法,通过布局注意力目标和参数化记忆机制,显著提升了视频基础模型在组合场景下的文本-图像对齐能力,实现了强大的可迁移性和泛化性能。


📘 Detailed Summary

Motivation: 视频基础模型在视觉生成方面表现出色,但在组合场景(如运动、数量关系和空间关系)中存在显著不足,导致文本-图像对齐效果不佳,这限制了模型在复杂视频生成任务中的实际应用价值。

Method: 提出了测试时优化与记忆化框架,通过布局注意力目标引导新参数的集成与优化,而非直接干预潜在空间或注意力机制;同时将视频生成建模为流式处理,采用参数化记忆机制维护历史优化上下文,支持插入、读取、更新和删除等灵活操作。

Result: 在T2V-CompBench和Vbench基准测试中,TTOM框架展现出卓越的性能,有效解决了组合场景下的对齐问题,实验结果表明该方法具有强大的可迁移性和泛化能力,能够实时实现跨模态对齐的组成式视频生成。

Conclusion: TTOM框架成功解耦了组合性世界知识,证明了无需训练的参数优化方法在视频生成中的有效性,为实时组合视频生成提供了可扩展、高效且实用的解决方案,推动了视频基础模型在复杂场景下的应用边界。


📄 Abstract

Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relation). In this work, we introduce Test-Time Optimization and Memorization (TTOM), a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment. Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete. Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.

[12] CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning

Weihuang Lin, Yiwei Ma, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji

🧩 TL;DR

本文提出了CIR-CoT,首个集成显式思维链推理的端到端检索导向多模态大语言模型,通过生成可解释的推理链来增强跨模态交互理解,在提升检索准确性的同时实现决策过程透明化。


📘 Detailed Summary

Motivation: 当前基于视觉语言模型和多模态大语言模型的组合图像检索方法主要作为'黑箱'运行,这种不透明性不仅阻碍用户理解检索原理,还限制了模型遵循复杂细粒度指令的能力,需要解决推理过程可解释性和跨模态交互理解的问题。

Method: 提出CIR-CoT模型,通过强制模型首先生成可解释的推理链来增强跨模态交互捕获能力,采用三阶段标注流程创建结构化思维链注释,包括描述、推理和结论,然后对模型进行微调以生成结构化输出,最后将其检索意图编码到专用嵌入中。

Result: 综合实验表明,CIR-CoT在领域内数据集(FashionIQ、CIRR)上实现了高度竞争力的性能,并在领域外CIRCO数据集上表现出卓越的泛化能力,为更有效和可信的检索系统开辟了新路径。

Conclusion: 该研究展示了集成显式思维链推理能够显著提升组合图像检索的性能和可解释性,通过结构化推理过程增强了模型对跨模态交互的理解,为构建更透明、可信的多模态检索系统提供了新的技术方向。


📄 Abstract

Composed Image Retrieval (CIR), which aims to find a target image from a reference image and a modification text, presents the core challenge of performing unified reasoning across visual and semantic modalities. While current approaches based on Vision-Language Models (VLMs, e.g., CLIP) and more recent Multimodal Large Language Models (MLLMs, e.g., Qwen-VL) have shown progress, they predominantly function as ``black boxes." This inherent opacity not only prevents users from understanding the retrieval rationale but also restricts the models' ability to follow complex, fine-grained instructions. To overcome these limitations, we introduce CIR-CoT, the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning. By compelling the model to first generate an interpretable reasoning chain, CIR-CoT enhances its ability to capture crucial cross-modal interactions, leading to more accurate retrieval while making its decision process transparent. Since existing datasets like FashionIQ and CIRR lack the necessary reasoning data, a key contribution of our work is the creation of structured CoT annotations using a three-stage process involving a caption, reasoning, and conclusion. Our model is then fine-tuned to produce this structured output before encoding its final retrieval intent into a dedicated embedding. Comprehensive experiments show that CIR-CoT achieves highly competitive performance on in-domain datasets (FashionIQ, CIRR) and demonstrates remarkable generalization on the out-of-domain CIRCO dataset, establishing a new path toward more effective and trustworthy retrieval systems.

[13] DarkHash: A Data-Free Backdoor Attack Against Deep Hashing

Ziqi Zhou, Menghao Deng, Yufei Song, Hangtao Zhang, Wei Wan, Shengshan Hu, Minghui Li, Leo Yu Zhang, Dezhong Yao

🧩 TL;DR

本文提出了DarkHash,这是首个针对深度哈希的无数据后门攻击方法,通过设计具有双重语义指导的影子后门攻击框架,仅使用替代数据集微调受害者模型的特定层,即可在不访问训练数据的情况下实现高效后门植入。


📘 Detailed Summary

Motivation: 现有深度哈希后门攻击方法需要访问训练数据集来植入后门,但在现实世界中由于隐私保护和知识产权考虑,获取此类数据往往被禁止,因此开发无需训练数据访问的后门攻击方法成为一个新颖且具有挑战性的问题。

Method: 提出了一种新颖的影子后门攻击框架,采用双重语义指导机制,通过设计拓扑对齐损失函数,优化单个样本及其邻近中毒样本向目标样本对齐,仅使用替代数据集微调受害者模型的特定层来嵌入后门功能并保持原始检索精度。

Result: 在四个图像数据集、五种模型架构和两种哈希方法上的实验结果表明,DarkHash具有高度有效性,优于现有最先进的后门攻击方法,并且防御实验显示其能够抵御现有主流后门防御方法。

Conclusion: 该研究证明了在深度哈希模型中实施无数据后门攻击的可行性,为深度哈希模型的安全性评估提供了新的视角,同时也揭示了当前防御方法的局限性,对实际应用中的模型安全部署具有重要意义。


📄 Abstract

Benefiting from its superior feature learning capabilities and efficiency, deep hashing has achieved remarkable success in large-scale image retrieval. Recent studies have demonstrated the vulnerability of deep hashing models to backdoor attacks. Although these studies have shown promising attack results, they rely on access to the training dataset to implant the backdoor. In the real world, obtaining such data (e.g., identity information) is often prohibited due to privacy protection and intellectual property concerns. Embedding backdoors into deep hashing models without access to the training data, while maintaining retrieval accuracy for the original task, presents a novel and challenging problem. In this paper, we propose DarkHash, the first data-free backdoor attack against deep hashing. Specifically, we design a novel shadow backdoor attack framework with dual-semantic guidance. It embeds backdoor functionality and maintains original retrieval accuracy by fine-tuning only specific layers of the victim model using a surrogate dataset. We consider leveraging the relationship between individual samples and their neighbors to enhance backdoor attacks during training. By designing a topological alignment loss, we optimize both individual and neighboring poisoned samples toward the target sample, further enhancing the attack capability. Experimental results on four image datasets, five model architectures, and two hashing methods demonstrate the high effectiveness of DarkHash, outperforming existing state-of-the-art backdoor attack methods. Defense experiments show that DarkHash can withstand existing mainstream backdoor defense methods.

[14] Improving Temporal Understanding Logic Consistency in Video-Language Models via Attention Enhancement

Chengzhi Li, Heyan Huang, Ping Jian, Zhen Yang, Yaning Tian

🧩 TL;DR

本文提出了一种称为时间条件注意力锐化(TCAS)的方法,通过增强跨模态注意力头的时间分辨能力来解决视频语言模型中的时间逻辑不一致问题,显著提升了模型的时间理解一致性。


📘 Detailed Summary

Motivation: 大型语言模型在视频语言模型中经常产生自相矛盾的输出,特别是在重述问题时无法提供逻辑一致的响应,这种现象严重影响了模型的可靠性并阻碍了实际应用,但其根本原因尚未得到充分探索。

Method: 采用可解释性驱动的方法分析潜在因素,并提出时间条件注意力锐化(TCAS)方法,该方法基于注意力差异构建增强目标,通过提升模型的时间分辨率能力来改善其时间理解逻辑一致性。

Result: 实验结果表明该方法显著提升了视频语言模型的时间逻辑一致性,可解释性分析证实了注意力头时间区分能力的改善,同时在通用视频时间定位任务中也实现了性能提升。

Conclusion: 时间逻辑一致性是时间理解的关键瓶颈,通过增强一致性可以推动视频时间理解的显著进展,TCAS方法通过改善注意力机制的时间分辨能力有效解决了这一问题。


📄 Abstract

Large language models (LLMs) often generate self-contradictory outputs, which severely impacts their reliability and hinders their adoption in practical applications. In video-language models (Video-LLMs), this phenomenon recently draws the attention of researchers. Specifically, these models fail to provide logically consistent responses to rephrased questions based on their grounding outputs. However, the underlying causes of this phenomenon remain underexplored. In this work, we adopt an interpretability-driven approach to analyze, statistically summarize, and intervention the potential factors of the phenomenon. We find that one of the primary reasons for the inconsistency in responses lies in the inability of cross-modal attention heads to effectively distinguish video tokens across different timestamps. To address this, we propose an attention enhancement method called Temporally Conditioned Attention Sharpening (TCAS), which constructs an enhancement objective based on attention distinctions to enhance the model's temporal resolution capability, thereby improving its temporal understanding logic consistency. Experimental results demonstrate that our method significantly enhances the temporal logic consistency of Video-LLMs. Further interpretability analyses reveal that our method indeed improves the temporal discriminability of attention heads, validating our conclusions. Additionally, our method achieves performance improvements in general video temporal grounding tasks, highlighting that temporal logic consistency is a bottleneck in temporal understanding. By enhancing consistency, our method drives significant progress in video temporal understanding.

[15] UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution

Shian Du, Menghan Xia, Chang Liu, Quande Liu, Xintao Wang, Pengfei Wan, Xiangyang Ji

🧩 TL;DR

本文提出了UniMMVSR,这是首个统一的多模态生成式视频超分辨率框架,能够整合文本、图像和视频等多种条件,显著提升了生成视频的细节质量和条件一致性,并实现了4K视频的多模态引导生成。


📘 Detailed Summary

Motivation: 现有级联视频超分辨率方法主要局限于文本到视频任务,未能充分利用文本以外的多模态生成条件,这在确保多模态视频生成保真度方面存在明显不足,限制了生成视频的质量和条件一致性。

Method: 提出了UniMMVSR统一框架,在潜在视频扩散模型中系统探索了条件注入策略、训练方案和数据混合技术,设计了针对不同条件类型的特定数据构建和条件利用方法,使模型能够精确利用所有条件类型。

Result: 实验表明UniMMVSR显著优于现有方法,生成的视频具有更优的细节质量和更高的多模态条件一致性,验证了与基础模型结合实现4K视频多模态引导生成的可行性,这是现有技术无法达到的成就。

Conclusion: 该研究证明了多模态条件在视频超分辨率中的重要性,为高质量视频生成提供了新的技术路径,展示了统一多模态框架在提升生成质量和条件保真度方面的巨大潜力,为未来视频生成技术的发展指明了方向。


📄 Abstract

Cascaded video super-resolution has emerged as a promising technique for decoupling the computational burden associated with generating high-resolution videos using large foundation models. Existing studies, however, are largely confined to text-to-video tasks and fail to leverage additional generative conditions beyond text, which are crucial for ensuring fidelity in multi-modal video generation. We address this limitation by presenting UniMMVSR, the first unified generative video super-resolution framework to incorporate hybrid-modal conditions, including text, images, and videos. We conduct a comprehensive exploration of condition injection strategies, training schemes, and data mixture techniques within a latent video diffusion model. A key challenge was designing distinct data construction and condition utilization methods to enable the model to precisely utilize all condition types, given their varied correlations with the target video. Our experiments demonstrate that UniMMVSR significantly outperforms existing methods, producing videos with superior detail and a higher degree of conformity to multi-modal conditions. We also validate the feasibility of combining UniMMVSR with a base model to achieve multi-modal guided generation of 4K video, a feat previously unattainable with existing techniques.

[16] Beyond Textual CoT: Interleaved Text-Image Chains with Deep Confidence Reasoning for Image Editing

Zhentao Zou, Zhengrong Yue, Kunpeng Du, Binlei Bao, Hanting Li, Haizhen Xie, Guozheng Xu, Yue Zhou, Yali Wang, Jie Hu, Xue Jiang, Xinghao Chen

🧩 TL;DR

本文提出MURE框架,通过多模态交错式思维链将图像编辑从纯文本推理转变为文本与视觉线索交替的渐进式过程,并引入多模态深度置信度推理机制来修剪低质量推理路径,显著提升了复杂图像编辑任务的精度和保真度。


📘 Detailed Summary

Motivation: 现有基于自然语言的图像编辑方法在处理复杂对象交叉和细粒度空间关系时存在困难,主要原因是缺乏显式推理过程。纯文本思维链或仅添加坐标信息的思维链在表示复杂视觉布局方面存在根本性限制,且缺乏足够的视觉线索来指导像素级细节的生成。

Method: 提出MURE框架,采用原生多模态交错式文本-图像思维链进行图像编辑,生成逐步推理链,其中文本描述后跟随相应的视觉线索(如位置掩码或新内容表示)。引入多模态深度置信度推理范式,在每一步探索视觉推理路径树,通过奖励模型的深度置信度分数修剪低质量分支,确保模型始终沿着高质量轨迹生成最终编辑结果。

Result: 该方法将复杂编辑任务分解为相互依赖的子任务,在每个阶段实现更高精度,产生高保真度的编辑结果。构建了首个CoT-Edit-14K数据集,包含14K高质量编辑示例。在三个图像编辑基准测试上的广泛实验表明,该方法带来了显著改进。

Conclusion: 研究证明了多模态交错式思维链在复杂图像编辑任务中的有效性,通过将推理过程分解为文本和视觉线索的交替步骤,能够更精确地处理对象交叉和空间关系。多模态深度置信度推理机制有效缓解了大语言模型的幻觉问题,为多模态推理任务提供了新的解决方案框架。


📄 Abstract

Image editing with natural language has gained significant popularity, yet existing methods struggle with intricate object intersections and fine-grained spatial relationships due to the lack of an explicit reasoning process. While Chain-of-Thought (CoT) has been explored to enhance reasoning, purely textual CoT or CoT augmented with coordinate information is fundamentally limited in its ability to represent intricate visual layouts and lacks the necessary visual cues to guide the generation of fine-grained, pixel-level details. To address these challenges, we propose Multimodal Reasoning Edit (MURE), a novel framework that shifts the visual editing process from purely text-based reasoning to a series of interleaved textual and visual rationales. Our framework performs image editing using a natively multimodal, interleaved text-image CoT. This approach generates a step-by-step chain of reasoning where a textual description is followed by a corresponding visual cue, such as a positional mask that defined intended edited regions or a representation of new content. Furthermore, to mitigate the hallucination phenomenon of large language models, we introduce Multimodal Deep Confidence (MMDC) reasoning paradigm. This paradigm explores a tree of visual reasoning paths at each step. By pruning low-quality branches using a deep confidence score from a reward model, it ensures the model consistently follows a high-quality trajectory towards the final edited result. The proposed method decomposes complex editing tasks into interdependent sub-tasks, achieving greater precision at each stage and yielding high-fidelity edited results. We define the formulation for interleaved text-image chains and release the first CoT-Edit-14K dataset, comprising 14K high-quality editing examples. Extensive experiments show that our method yields significant improvements across three image editing benchmarks.

[17] Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception

Nikos Theodoridis, Tim Brophy, Reenu Mohandas, Ganesh Sistu, Fiachra Collins, Anthony Scanlan, Ciaran Eising

🧩 TL;DR

本文提出了首个专注于交通场景感知的距离标注视觉问答基准DTPQA,评估了小型视觉语言模型在自动驾驶环境下的感知能力,发现这些模型在距离感知任务上显著落后于人类表现。


📘 Detailed Summary

Motivation: 自动驾驶系统需要可靠的感知能力来处理意外情况,但现有视觉语言模型在安全关键应用中缺乏可信度,特别是对于远距离物体的感知存在局限性,这限制了它们在自动驾驶中的实际应用。

Method: 研究引入了距离标注交通感知问答基准DTPQA,该基准专注于纯感知问题并排除推理任务,通过评估多个最先进的小型视觉语言模型来测试它们在近距离和远距离交通场景中的感知能力。

Result: 实验结果显示,表现最佳的小型视觉语言模型平均准确率约为60%,而人类表现达到约85%,模型在区分左右等特定感知任务上表现尤其困难,尽管人类样本量较小存在统计限制。

Conclusion: 研究表明当前小型视觉语言模型在交通场景的距离感知能力上存在显著不足,这为自动驾驶系统的安全部署提出了重要挑战,同时也为未来模型优化指明了改进方向。


📄 Abstract

Vision-Language Models (VLMs) are becoming increasingly powerful, demonstrating strong performance on a variety of tasks that require both visual and textual understanding. Their strong generalisation abilities make them a promising component for automated driving systems, which must handle unexpected corner cases. However, to be trusted in such safety-critical applications, a model must first possess a reliable perception system. Moreover, since critical objects and agents in traffic scenes are often at a distance, we require systems that are not "shortsighted", i.e., systems with strong perception capabilities at both close (up to 20 meters) and long (30+ meters) range. With this in mind, we introduce Distance-Annotated Traffic Perception Question Answering (DTPQA), the first Visual Question Answering (VQA) benchmark focused solely on perception-based questions in traffic scenes, enriched with distance annotations. By excluding questions that require reasoning, we ensure that model performance reflects perception capabilities alone. Since automated driving hardware has limited processing power and cannot support large VLMs, our study centers on smaller VLMs. More specifically, we evaluate several state-of-the-art (SOTA) small VLMs on DTPQA and show that, despite the simplicity of the questions, these models significantly underperform compared to humans (~60% average accuracy for the best-performing small VLM versus ~85% human performance). However, it is important to note that the human sample size was relatively small, which imposes statistical limitations. We also identify specific perception tasks, such as distinguishing left from right, that remain particularly challenging for these models.

[18] InstructUDrag: Joint Text Instructions and Object Dragging for Interactive Image Editing

Haoran Yu, Yi Shi

🧩 TL;DR

本文提出了InstructUDrag框架,将文本指令与物体拖拽相结合,实现了同时进行物体拖拽和基于文本的图像编辑,解决了现有方法在精确定位和语义控制方面的局限性。


📘 Detailed Summary

Motivation: 现有的文本到图像扩散模型编辑方法存在固有局限:基于文本的方法难以实现精确的物体定位,而物体拖拽方法仅限于静态重定位,无法同时实现精确位置控制和语义属性编辑。

Method: 提出的InstructUDrag框架将物体拖拽视为图像重建过程,分为两个协同分支:移动重建分支使用基于能量的梯度引导精确移动物体,通过优化交叉注意力图提升重定位精度;文本驱动编辑分支与重建分支共享梯度信号,确保变换一致性并实现对物体属性的细粒度控制,同时采用DDPM反演和噪声图先验注入来保持移动物体的结构完整性。

Result: 大量实验表明,InstructUDrag实现了灵活且高保真的图像编辑,在物体重定位精度和图像内容语义控制方面均表现出色,能够同时完成精确的物体移动和语义属性编辑任务。

Conclusion: 该研究证明了结合文本指令与物体拖拽的协同编辑框架的有效性,为扩散模型提供了更全面的图像编辑能力,在保持结构完整性的同时实现了位置精度和语义控制的统一,为未来多模态图像编辑方法的发展提供了重要启示。


📄 Abstract

Text-to-image diffusion models have shown great potential for image editing, with techniques such as text-based and object-dragging methods emerging as key approaches. However, each of these methods has inherent limitations: text-based methods struggle with precise object positioning, while object dragging methods are confined to static relocation. To address these issues, we propose InstructUDrag, a diffusion-based framework that combines text instructions with object dragging, enabling simultaneous object dragging and text-based image editing. Our framework treats object dragging as an image reconstruction process, divided into two synergistic branches. The moving-reconstruction branch utilizes energy-based gradient guidance to move objects accurately, refining cross-attention maps to enhance relocation precision. The text-driven editing branch shares gradient signals with the reconstruction branch, ensuring consistent transformations and allowing fine-grained control over object attributes. We also employ DDPM inversion and inject prior information into noise maps to preserve the structure of moved objects. Extensive experiments demonstrate that InstructUDrag facilitates flexible, high-fidelity image editing, offering both precision in object relocation and semantic control over image content.

[19] A Multimodal Depth-Aware Method For Embodied Reference Understanding

Fevziye Irem Eyiokur, Dogucan Yaman, Hazım Kemal Ekenel, Alexander Waibel

🧩 TL;DR

本文提出了一种新颖的具身参考理解框架,通过联合利用LLM数据增强、深度图模态和深度感知决策模块,显著提升了在复杂场景中的目标对象识别准确性和可靠性。


📘 Detailed Summary

Motivation: 现有开放词汇对象检测方法在场景中存在多个候选对象的模糊情况下往往表现不佳,无法有效整合语言指令和指向线索来解决具身参考理解中的歧义问题。

Method: 提出的ERU框架采用LLM驱动的数据增强策略,引入深度图模态作为额外输入,并设计深度感知决策模块来协同处理语言和具身线索,实现多模态信息的鲁棒集成。

Result: 在两个基准数据集上的实验结果表明,该方法显著优于现有基线方法,在复杂或杂乱环境中实现了更准确和可靠的参考对象检测性能。

Conclusion: 该研究证明了深度信息在解决具身参考理解歧义问题中的关键作用,为多模态融合方法在复杂视觉场景理解中的应用提供了新的技术路径和理论洞见。


📄 Abstract

Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues. While prior works have shown progress in open-vocabulary object detection, they often fail in ambiguous scenarios where multiple candidate objects exist in the scene. To address these challenges, we propose a novel ERU framework that jointly leverages LLM-based data augmentation, depth-map modality, and a depth-aware decision module. This design enables robust integration of linguistic and embodied cues, improving disambiguation in complex or cluttered environments. Experimental results on two datasets demonstrate that our approach significantly outperforms existing baselines, achieving more accurate and reliable referent detection.

[20] The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping

Onur Keleş, Aslı Özyürek, Gerardo Ortega, Kadir Gökgö, Esam Ghaleb

🧩 TL;DR

本文提出了视觉象似性挑战基准,用于评估视觉语言模型在手语象似性理解方面的能力。研究发现现有模型在语音形式预测和透明度任务上远低于人类水平,但模型对语音形式的敏感性与人类对象似性的判断存在相关性。


📘 Detailed Summary

Motivation: 手语中普遍存在的象似性(语言形式与意义之间的相似性)为视觉基础化提供了天然测试平台,但现有视觉语言模型需要从动态人体运动而非静态上下文中恢复这种基本映射关系,这构成了一个重要的研究挑战。

Method: 研究引入了视觉象似性挑战基准,该基准基于视频数据并采用心理语言学测量方法,评估视觉语言模型在三个任务上的表现:语音符号形式预测(手形、位置等)、透明度(从视觉形式推断意义)和分级象似性评分。

Result: 在零样本和少样本设置下评估了13个最先进的视觉语言模型,结果显示:在语音形式预测任务中,模型能恢复部分手形和位置细节但仍低于人类表现;在透明度任务上远低于人类基线;仅顶级模型与人类象似性评分呈中等相关性。有趣的是,具有更强语音形式预测能力的模型与人类象似性判断相关性更好。

Conclusion: 研究验证了这些诊断任务的有效性,并强调了将人类中心信号和具身学习方法纳入多模态模型中以改进象似性建模和视觉基础化的重要性,模型对语音形式的敏感性与人类对象似性判断的相关性表明了对视觉基础结构的共享敏感性。


📄 Abstract

Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the \textit{Visual Iconicity Challenge}, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess $13$ state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On \textit{phonological form prediction}, VLMs recover some handshape and location detail but remain below human performance; on \textit{transparency}, they are far from human baselines; and only top models correlate moderately with human \textit{iconicity ratings}. Interestingly, \textit{models with stronger phonological form prediction correlate better with human iconicity judgment}, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.

[21] Gaze on the Prize: Shaping Visual Attention with Return-Guided Contrastive Learning

Andrew Lee, Ian Chuang, Dechen Gao, Kai Fukazawa, Iman Soltani

🧩 TL;DR

本文提出Gaze on the Prize框架,通过可学习的中央凹注意力机制和基于回报差异的自监督信号,显著提升了视觉强化学习的样本效率,在ManiSkill3基准测试中实现了最高2.4倍的性能提升。


📘 Detailed Summary

Motivation: 视觉强化学习代理需要处理高维图像数据,但其中只有少量像素与任务相关,这导致代理将探索和计算资源浪费在无关特征上,造成样本效率低下和学习不稳定。

Method: 该方法引入可学习的中央凹注意力机制,通过基于回报差异的自监督信号进行指导,利用回报引导的对比学习训练注意力机制区分成功与失败相关的特征,将相似视觉表征根据回报差异分组为正负样本并构建对比三元组。

Result: 在ManiSkill3基准测试的多个操作任务中,该方法实现了最高2.4倍的样本效率提升,能够解决基线方法无法学习的任务,且无需修改底层算法或超参数。

Conclusion: 研究表明回报差异能够有效揭示任务相关特征,基于人类视觉中央凹启发的注意力机制可以显著提升视觉强化学习的效率和稳定性,为样本效率问题提供了新的解决方案。


📄 Abstract

Visual Reinforcement Learning (RL) agents must learn to act based on high-dimensional image data where only a small fraction of the pixels is task-relevant. This forces agents to waste exploration and computational resources on irrelevant features, leading to sample-inefficient and unstable learning. To address this, inspired by human visual foveation, we introduce Gaze on the Prize. This framework augments visual RL with a learnable foveal attention mechanism (Gaze), guided by a self-supervised signal derived from the agent's experience pursuing higher returns (the Prize). Our key insight is that return differences reveal what matters most: If two similar representations produce different outcomes, their distinguishing features are likely task-relevant, and the gaze should focus on them accordingly. This is realized through return-guided contrastive learning that trains the attention to distinguish between the features relevant to success and failure. We group similar visual representations into positives and negatives based on their return differences and use the resulting labels to construct contrastive triplets. These triplets provide the training signal that teaches the attention mechanism to produce distinguishable representations for states associated with different outcomes. Our method achieves up to 2.4x improvement in sample efficiency and can solve tasks that the baseline fails to learn, demonstrated across a suite of manipulation tasks from the ManiSkill3 benchmark, all without modifying the underlying algorithm or hyperparameters.

[22] Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge

Yu Huang, Zelin Peng, Changsong Wen, Xiaokang Yang, Wei Shen

🧩 TL;DR

本文提出了一种语义基础的学习范式,通过跨模态亲和力转移将大规模2D视觉基础模型的丰富语义知识迁移到3D领域,解决了3D功能分割中语义边界不清晰的问题,并在标准基准测试中取得了最先进的性能。


📘 Detailed Summary

Motivation: 现有3D功能分割方法通常依赖点云编码器作为通用特征提取器,但忽视了3D数据固有的稀疏性、噪声和几何模糊性等挑战,导致学习到的3D特征缺乏清晰且语义一致的功能边界,这成为当前方法的主要瓶颈。

Method: 提出了跨模态亲和力转移预训练策略,将3D编码器与提升的2D语义对齐,并联合优化重建、亲和力和多样性目标以生成语义组织的表示;在此基础上设计了跨模态功能分割变换器,集成多模态提示与CMAT预训练特征来生成精确的提示感知分割图。

Result: 在标准基准测试上的广泛实验表明,该框架在3D功能分割任务上建立了新的最先进结果,验证了所提方法的有效性和优越性能。

Conclusion: 该研究展示了将2D视觉基础模型的语义知识迁移到3D领域的有效性,为3D功能分割提供了一种语义基础的学习范式,对机器人操作、具身AI和增强现实等应用具有重要意义,并为跨模态学习在3D视觉任务中的应用开辟了新方向。


📄 Abstract

Affordance segmentation aims to parse 3D objects into functionally distinct parts, bridging recognition and interaction for applications in robotic manipulation, embodied AI, and AR. While recent studies leverage visual or textual prompts to guide this process, they often rely on point cloud encoders as generic feature extractors, overlooking the intrinsic challenges of 3D data such as sparsity, noise, and geometric ambiguity. As a result, 3D features learned in isolation frequently lack clear and semantically consistent functional boundaries. To address this bottleneck, we propose a semantic-grounded learning paradigm that transfers rich semantic knowledge from large-scale 2D Vision Foundation Models (VFMs) into the 3D domain. Specifically, We introduce Cross-Modal Affinity Transfer (CMAT), a pre-training strategy that aligns a 3D encoder with lifted 2D semantics and jointly optimizes reconstruction, affinity, and diversity to yield semantically organized representations. Building on this backbone, we further design the Cross-modal Affordance Segmentation Transformer (CAST), which integrates multi-modal prompts with CMAT-pretrained features to generate precise, prompt-aware segmentation maps. Extensive experiments on standard benchmarks demonstrate that our framework establishes new state-of-the-art results for 3D affordance segmentation.

[23] To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models

Jiayun Luo, Wan-Cyuan Fan, Lyuyang Wang, Xiangteng He, Tanzila Rahman, Purang Abolmaesumi, Leonid Sigal

🧩 TL;DR

本文发现并研究了ViT注意力汇聚点——来自视觉编码器的高范数视觉标记,这些标记封装了高层次语义概念,通过显式利用这些标记可以显著提升大视觉语言模型的视觉推理能力。


📘 Detailed Summary

Motivation: 现有研究主要关注LLM中的注意力汇聚点,而忽略了视觉编码器ViT中高范数视觉标记的重要性,这些ViT注意力汇聚点虽然包含关键语义信息,但在当前LVLM架构中往往被忽视,导致视觉信号从ViT到LLM的传播效率不足。

Method: 通过定性和定量分析ViT汇聚点标记中的信息,提出了无需训练和基于训练的方法来优化LLM对这些信息的解释方式,显式利用这些高语义价值的视觉标记来增强模型理解。

Result: 在多种LVLM模型和视觉推理任务上实现了显著性能提升,证明了ViT注意力汇聚点在增强视觉推理方面具有未被充分利用的潜力。

Conclusion: ViT注意力汇聚点包含关键的语义概念,显式利用这些标记可以大幅提升LVLM的视觉理解能力,为未来LVLM架构设计提供了重要启示,即需要更有效地利用视觉编码器中的高层次语义信息。


📄 Abstract

Large Vision Language Models (LVLMs) have recently emerged as powerful architectures capable of understanding and reasoning over both visual and textual information. These models typically rely on two key components: a Vision Transformer (ViT) and a Large Language Model (LLM). ViT encodes visual content into a sequence of image tokens and serves as the perceptual front-end -- the eyes of the model. In contrast, the LLM interprets these tokens to perform high-level reasoning, generates responses, and functions as the cognitive core -- the brain of the model. However, it remains unclear which visual tokens contribute most significantly to understanding and reasoning, and how effectively these signals are propagated from ViT to the LLM. While most existing works have focused on identifying attention sinks, low-semantic tokens receiving disproportionately high attention, within the LLM, we shift the focus to the vision encoder by identifying a class of high-norm visual tokens from ViT, referred to as ViT attention sinks -- a problem that has been rarely studied but is indeed very important for LVLMs. Our findings show that these ViT sinks encapsulate high-level semantic concepts from images, allowing the LLM to perform more effective understanding and reasoning. Despite their importance, these sink tokens are often overlooked in existing LVLM architectures. To explore their contribution, we present both qualitative and quantitative analyses of the information embedded in these sink tokens. We also propose both training-free and training-based approaches to better leverage how this information is interpreted by the LLM, and to what extent. By explicitly utilizing these tokens, we demonstrate substantial improvements across a range of LVLMs and visual reasoning tasks, highlighting the untapped potential of ViT attention sinks in enhancing visual reasoning.

[24] SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models

Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang

🧩 TL;DR

本研究提出了SpatialLadder,一种通过渐进式训练框架构建空间智能的3B参数模型,在空间推理基准上实现了最先进的性能,相比基础模型平均提升23.4%,超越了GPT-4o和Gemini-2.0-Flash。


📘 Detailed Summary

Motivation: 当前视觉语言模型在空间推理方面存在根本性挑战,现有方法试图直接学习空间推理而未能建立感知与理解的分层基础,这种局限性源于缺乏渐进式的空间智能构建方法。

Method: 提出了三阶段渐进式训练框架:首先通过目标定位建立空间感知,然后通过多维空间任务发展空间理解,最后通过可验证奖励的强化学习加强复杂推理,并构建了包含26,610个样本的多模态数据集SpatialLadder-26k。

Result: SpatialLadder在空间推理基准上实现了最先进的性能,相比基础模型平均提升23.4%,超越GPT-4o 20.8%和Gemini-2.0-Flash 10.1%,在域外基准上保持强泛化能力,提升7.2%。

Conclusion: 研究表明从感知到推理的渐进式训练对于构建鲁棒的空间智能至关重要,这种分层方法能够有效解决现有模型在空间推理方面的根本性挑战,并为多模态人工智能的发展提供了重要启示。


📄 Abstract

Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the hierarchical foundations of perception and understanding. To address this challenge, we present a comprehensive methodology for building spatial intelligence progressively. We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single image, multi-view, and video spatial reasoning tasks, constructed through a standardized pipeline that ensures systematic coverage across modalities. Building on this dataset, we design a three-stage progressive training framework that (1) establishes spatial perception through object localization, (2) develops spatial understanding through multi-dimensional spatial tasks, and (3) strengthens complex reasoning via reinforcement learning with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter model that achieves state-of-the-art performance on spatial reasoning benchmarks, with 23.4% average improvement over the base model, surpassing GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains strong generalization with 7.2% improvement on out-of-domain benchmarks, demonstrating that progressive training from perception to reasoning is essential for robust spatial intelligence.

[25] UniVideo: Unified Understanding, Generation, and Editing for Videos

Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen

🧩 TL;DR

UniVideo是一个统一的多模态视频生成框架,通过双流架构将多模态大语言模型与多模态DiT相结合,实现了多种视频生成和编辑任务的统一建模,并在多个基准测试中达到或超越了任务专用模型的最先进性能。


📘 Detailed Summary

Motivation: 当前统一多模态模型主要局限于图像领域,视频领域的统一建模能力相对不足,本研究旨在将统一建模扩展到视频领域,解决视频生成和编辑任务中多模态指令理解和视觉一致性的挑战。

Method: UniVideo采用双流设计架构,结合多模态大语言模型进行复杂指令理解,以及多模态DiT进行视频生成,通过统一的多模态指令范式联合训练多种视频生成和编辑任务。

Result: 实验表明UniVideo在文本/图像到视频生成、上下文视频生成和上下文视频编辑等任务中匹配或超越了最先进的任务专用基线模型,并展示了任务组合和跨任务泛化能力。

Conclusion: UniVideo的统一设计实现了任务组合和跨领域泛化能力,即使未经自由形式视频编辑的专门训练,也能从大规模图像编辑数据中迁移编辑能力,为视频多模态生成提供了新的研究方向。


📄 Abstract

Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.

[26] VideoNorms: Benchmarking Cultural Awareness of Video Language Models

Nikhil Reddy Varimalla, Yunfei Xu, Arkadiy Saakyan, Meng Fan Wang, Smaranda Muresan

🧩 TL;DR

本文提出了VideoNorms基准数据集,包含1000多个视频片段与规范对,用于评估视频大语言模型的文化意识,发现现有模型在文化规范理解方面存在显著差距,特别是在中国文化理解和非言语证据识别方面表现较差。


📘 Detailed Summary

Motivation: 随着视频大语言模型在全球部署,需要评估其对相关文化背景的理解和基础能力,但目前缺乏适当的基准来评估这些模型的文化意识。

Method: 采用人机协作框架构建数据集,其中使用基于理论指导提示的教师模型提供候选标注,然后由训练有素的人类专家验证和修正标注,构建包含视频片段、社会文化规范、规范遵守与违反标签以及言语和非言语证据的基准数据集。

Result: 基准测试显示多个共同趋势:模型在规范违反检测上表现较差;对中国文化的理解不如美国文化;在提供非言语证据方面比言语证据更困难;在正式非幽默语境中表现不如人类。

Conclusion: 研究结果强调了文化基础视频语言模型训练的必要性,而本文提出的基准和框架为填补这一空白提供了初步解决方案,揭示了当前模型在跨文化理解方面的局限性。


📄 Abstract

As Video Large Language Models (VideoLLMs) are deployed globally, they require understanding of and grounding in the relevant cultural background. To properly assess these models' cultural awareness, adequate benchmarks are needed. We introduce VideoNorms, a benchmark of over 1000 (video clip, norm) pairs from US and Chinese cultures annotated with socio-cultural norms grounded in speech act theory, norm adherence and violations labels, and verbal and non-verbal evidence. To build VideoNorms, we use a human-AI collaboration framework, where a teacher model using theoretically-grounded prompting provides candidate annotations and a set of trained human experts validate and correct the annotations. We benchmark a variety of open-weight VideoLLMs on the new dataset which highlight several common trends: 1) models performs worse on norm violation than adherence; 2) models perform worse w.r.t Chinese culture compared to the US culture; 3) models have more difficulty in providing non-verbal evidence compared to verbal for the norm adhere/violation label and struggle to identify the exact norm corresponding to a speech-act; and 4) unlike humans, models perform worse in formal, non-humorous contexts. Our findings emphasize the need for culturally-grounded video language model training - a gap our benchmark and framework begin to address.

[27] VideoVerse: How Far is Your T2V Generator from a World Model?

Zeqing Wang, Xinyu Wei, Bairui Li, Zhen Guo, Jinrui Zhang, Hongyang Wei, Keze Wang, Lei Zhang

🧩 TL;DR

本文提出了VideoVerse基准测试,旨在解决现有文本到视频生成评估方法的不足,通过评估复杂时间因果关系和世界知识来全面衡量T2V模型是否接近世界模型的能力。


📘 Detailed Summary

Motivation: 现有文本到视频生成基准测试存在三个主要问题:当前评估维度无法区分最先进的T2V模型,事件级时间因果关系评估严重不足,以及缺乏对世界知识的系统性评估,这些能力对于构建世界模型至关重要。

Method: 研究团队收集了跨多个领域的代表性视频,提取具有固有时间因果关系的事件级描述,并由独立标注者重写为文本到视频提示,设计了包含十个精心定义评估维度的二元评估问题套件,并开发了基于现代视觉语言模型的人类偏好对齐QA评估流程。

Result: VideoVerse基准包含300个精心策划的提示,涉及815个事件和793个二元评估问题,通过对最先进的开源和闭源T2V模型进行系统性评估,深入分析了当前T2V生成器与世界模型之间的差距。

Conclusion: 该研究为T2V模型评估提供了更全面的基准,揭示了当前模型在理解复杂时间因果关系和世界知识方面的局限性,为未来世界模型的发展指明了重要方向和改进空间。


📄 Abstract

The recent rapid advancement of Text-to-Video (T2V) generation technologies, which are critical to build ``world models'', makes the existing benchmarks increasingly insufficient to evaluate state-of-the-art T2V models. First, current evaluation dimensions, such as per-frame aesthetic quality and temporal consistency, are no longer able to differentiate state-of-the-art T2V models. Second, event-level temporal causality, which not only distinguishes video from other modalities but also constitutes a crucial component of world models, is severely underexplored in existing benchmarks. Third, existing benchmarks lack a systematic assessment of world knowledge, which are essential capabilities for building world models. To address these issues, we introduce VideoVerse, a comprehensive benchmark that focuses on evaluating whether a T2V model could understand complex temporal causality and world knowledge in the real world. We collect representative videos across diverse domains (e.g., natural landscapes, sports, indoor scenes, science fiction, chemical and physical experiments) and extract their event-level descriptions with inherent temporal causality, which are then rewritten into text-to-video prompts by independent annotators. For each prompt, we design a suite of binary evaluation questions from the perspective of dynamic and static properties, with a total of ten carefully defined evaluation dimensions. In total, our VideoVerse comprises 300 carefully curated prompts, involving 815 events and 793 binary evaluation questions. Consequently, a human preference aligned QA-based evaluation pipeline is developed by using modern vision-language models. Finally, we perform a systematic evaluation of state-of-the-art open-source and closed-source T2V models on VideoVerse, providing in-depth analysis on how far the current T2V generators are from world models.

[28] MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning

Tajamul Ashraf, Umair Nawaz, Abdelrahman M. Shaker, Rao Anwer, Philip Torr, Fahad Shahbaz Khan, Salman Khan

🧩 TL;DR

本文提出了一个视觉中心的多模态智能体调优框架,通过自动合成多模态轨迹和生成逐步偏好对,训练视觉语言模型控制器实现鲁棒的工具使用推理,在多个基准测试中超越了开源和闭源视觉语言模型。


📘 Detailed Summary

Motivation: 当前视觉语言模型作为控制器使用时面临高质量多模态轨迹稀缺和手动标注成本高昂的挑战,限制了其在复杂推理和决策任务中的有效性,需要开发自动化的多模态轨迹合成和偏好学习方法来提升工具使用能力。

Method: 提出了一个视觉中心的智能体调优框架,包含三个核心组件:首先构建了包含28.5K多模态任务和177K验证轨迹的大规模数据集M-TRACE用于基于模仿的轨迹调优;其次开发了MATRIX Agent控制器进行逐步工具推理;最后引入Pref-X包含11K自动生成的偏好对,通过逐步偏好学习优化MATRIX控制器。

Result: 在Agent-X、GTA和GAIA三个基准测试中,MATRIX控制器持续超越了开源和闭源视觉语言模型,证明了该方法在多模态工具使用方面的可扩展性和有效性,为大规模多模态智能体训练提供了可靠的数据和模型基础。

Conclusion: 该研究展示了通过自动合成多模态轨迹和偏好学习可以显著提升视觉语言模型在工具使用推理方面的能力,为构建更强大的多模态智能体提供了可扩展的框架,同时发布的大规模数据集和代码将推动该领域的进一步发展。


📄 Abstract

Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating scalable and effective multimodal tool use. Our data and code is avaliable at https://github.com/mbzuai-oryx/MATRIX.

[29] SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

Andong Deng, Taojiannan Yang, Shoubin Yu, Lincoln Spencer, Mohit Bansal, Chen Chen, Serena Yeung-Levy, Xiaohan Wang

🧩 TL;DR

本研究提出了SciVideoBench,一个专门用于评估科学视频推理能力的严格基准,包含1000个基于前沿科学实验视频的多选题,揭示了当前大型多模态模型在复杂科学视频推理方面的显著性能缺陷。


📘 Detailed Summary

Motivation: 当前视频基准主要针对通用场景,依赖感知和识别能力,而推理任务相对简单,导致性能饱和,无法有效评估高级多模态认知技能,特别是在科学领域的复杂视频推理仍是一个重大挑战。

Method: 构建了SciVideoBench基准,包含1000个精心设计的多选题,源自25+专业学科的前沿科学实验视频,并通过半自动系统验证,每个问题都需要复杂的领域知识、精确的时空感知和精细的逻辑推理。

Result: 评估显示最先进的专有和开源大型多模态模型(包括Gemini 2.5 Pro和Qwen2.5-VL)存在显著性能缺陷,表明视频推理能力仍有巨大提升空间,详细分析了推理复杂性和视觉基础等关键因素。

Conclusion: 该基准为大型多模态模型的未来发展提供了宝贵见解和明确方向,推动了真正有能力的多模态AI科学助手的演进,有望帮助推动前沿AI在更广泛科学领域的边界扩展。


📄 Abstract

Large Multimodal Models (LMMs) have achieved remarkable progress across various capabilities; however, complex video reasoning in the scientific domain remains a significant and challenging frontier. Current video benchmarks predominantly target general scenarios where perception/recognition is heavily relied on, while with relatively simple reasoning tasks, leading to saturation and thus failing to effectively evaluate advanced multimodal cognitive skills. To address this critical gap, we introduce SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice questions derived from cutting-edge scientific experimental videos spanning over 25 specialized academic subjects and verified by a semi-automatic system. Each question demands sophisticated domain-specific knowledge, precise spatiotemporal perception, and intricate logical reasoning, effectively challenging models' higher-order cognitive abilities. Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating substantial room for advancement in video reasoning capabilities. Detailed analyses of critical factors such as reasoning complexity and visual grounding provide valuable insights and clear direction for future developments in LMMs, driving the evolution of truly capable multimodal AI co-scientists. We hope SciVideoBench could fit the interests of the community and help to push the boundary of cutting-edge AI for border science.

[30] Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang

🧩 TL;DR

本研究提出了评分正则化连续时间一致性模型(rCM),首次将连续时间一致性蒸馏扩展到大规模应用级图像和视频扩散模型,在保持高生成多样性的同时显著提升视觉质量,仅需1-4步即可生成高保真样本。


📘 Detailed Summary

Motivation: 尽管连续时间一致性模型(sCM)在学术规模扩散加速方面具有理论优势和实证效果,但其在大规模文本到图像和视频任务中的适用性仍不明确,主要面临雅可比向量积计算的基础设施挑战以及标准评估基准的局限性,特别是在精细细节生成方面存在根本性的质量限制。

Method: 首先开发了并行兼容的FlashAttention-2 JVP核,支持超过100亿参数模型和高维视频任务的sCM训练;然后提出评分正则化连续时间一致性模型(rCM),通过评分蒸馏作为长跳跃正则器,将sCM的'模式覆盖'前向散度目标与'模式寻求'反向散度相结合。

Result: 在高达140亿参数的Cosmos-Predict2、Wan2.1等大规模模型和5秒视频任务上验证,rCM在质量指标上匹配或超越最先进的蒸馏方法DMD2,同时在多样性方面具有显著优势,无需GAN调优或大量超参数搜索,仅需1-4步即可生成高保真样本,将扩散采样加速15-50倍。

Conclusion: rCM作为一个实用且理论基础的框架,为推进大规模扩散蒸馏提供了有效解决方案,通过结合前向和反向散度的互补特性,在保持高多样性的同时显著改善了视觉质量,为大模型的实际部署提供了高效的推理加速方案。


📄 Abstract

This work represents the first effort to scale up continuous-time consistency distillation to general application-level image and video diffusion models. Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of standard evaluation benchmarks. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the "mode-covering" nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the "mode-seeking" reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM matches or surpasses the state-of-the-art distillation method DMD2 on quality metrics while offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only $1\sim4$ steps, accelerating diffusion sampling by $15\times\sim50\times$. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation.

[31] Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

Zhenlong Yuan, Xiangyan Qu, Chengxuan Qian, Rui Chen, Jing Tang, Lei Sun, Xiangxiang Chu, Dapeng Zhang, Yiwei Wang, Yujun Cai, Shuo Li

🧩 TL;DR

Video-STAR提出了一种融合上下文子动作分解与工具增强强化学习的框架,用于解决多模态大语言模型在开放词汇动作识别中的语义混淆问题,通过细粒度子动作匹配和跨模态推理显著提升了动作识别性能。


📘 Detailed Summary

Motivation: 现有多模态大语言模型在开放词汇动作识别场景中过度依赖文本先验,难以区分语义相似的动作,存在跨模态幻觉问题,需要开发能够进行细粒度视觉推理的新方法。

Method: 该方法创新性地将动作分解为可区分的子动作进行细粒度匹配,同时动态调用领域特定工具进行跨模态交错推理,通过分层奖励机制平衡工具使用效率、子动作相关性和结构一致性,实现从文本中心推理向视觉基础推理的自主转换。

Result: 在HMDB-51、UCF-101、SSv2、Kinetics-400和Kinetics-600等数据集上的广泛评估表明,该方法在区分细粒度动作和处理跨模态幻觉方面达到了最先进性能,验证了其优秀的鲁棒性和泛化能力。

Conclusion: 该研究证明了通过子动作分解和工具增强强化学习的协同作用,可以有效提升开放词汇动作识别的性能,为多模态推理系统提供了从文本主导到视觉基础推理的新范式,具有重要的理论和应用价值。


📄 Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.

[32] InstructX: Towards Unified Visual Editing with MLLM Guidance

Chong Mou, Qichao Sun, Yanze Wu, Pengze Zhang, Xinghui Li, Fulong Ye, Songtao Zhao, Qian He

🧩 TL;DR

本文提出了InstructX,一个统一的多模态大语言模型与扩散模型集成框架,用于图像和视频编辑任务。研究表明图像数据训练能够涌现视频编辑能力,并在单一模型中实现了图像和视频编辑任务的统一处理。


📘 Detailed Summary

Motivation: 当前多模态大语言模型与扩散模型集成的研究缺乏对设计选择的深入分析,且在视频编辑等困难任务中的整合仍面临挑战。本文旨在系统研究MLLM与扩散模型的集成策略,并探索图像与视频在统一建模中的协作与区别。

Method: 提出了InstructX统一框架,通过全面的设计选择研究来集成MLLM和扩散模型进行指令驱动的编辑。方法包括分析图像与视频的统一建模,利用图像数据训练涌现视频编辑能力,并整合模态特定的MLLM特征来实现单一模型的多任务处理。

Result: 实验表明该方法能够处理广泛的图像和视频编辑任务,并实现了最先进的性能。特别发现仅使用图像数据训练就能在没有显式监督的情况下获得视频编辑能力,有效缓解了视频训练数据稀缺的限制。

Conclusion: 研究揭示了图像数据训练对视频编辑任务的涌现能力,为数据稀缺场景提供了有效解决方案。通过模态特定特征的整合,成功在单一模型中统一了图像和视频编辑,为多模态编辑系统的发展提供了重要见解。


📄 Abstract

With recent advances in Multimodal Large Language Models (MLLMs) showing strong visual understanding and reasoning, interest is growing in using them to improve the editing performance of diffusion models. Despite rapid progress, most studies lack an in-depth analysis of MLLM design choices. Moreover, the integration of MLLMs and diffusion models remains an open challenge in some difficult tasks, such as video editing. In this paper, we present InstructX, a unified framework for image and video editing. Specifically, we conduct a comprehensive study on integrating MLLMs and diffusion models for instruction-driven editing across diverse tasks. Building on this study, we analyze the cooperation and distinction between images and videos in unified modeling. (1) We show that training on image data can lead to emergent video editing capabilities without explicit supervision, thereby alleviating the constraints imposed by scarce video training data. (2) By incorporating modality-specific MLLM features, our approach effectively unifies image and video editing tasks within a single model. Extensive experiments demonstrate that our method can handle a broad range of image and video editing tasks and achieves state-of-the-art performance.

[33] MoA-VR: A Mixture-of-Agents System Towards All-in-One Video Restoration

Lu Liu, Chunlei Cai, Shaocheng Shen, Jianfeng Liang, Weimin Ouyang, Tianxiao Ye, Jian Mao, Huiyu Duan, Jiangchao Yao, Xiaoyun Zhang, Qiang Hu, Guangtao Zhai

🧩 TL;DR

本文提出MoA-VR,首个基于多智能体协同的视频修复系统,通过三个协调的智能体模仿人类专家的推理和处理流程,有效处理复杂多样的视频退化问题,在客观指标和感知质量上均优于现有基线方法。


📘 Detailed Summary

Motivation: 现实世界视频常因采集和传输条件不同而遭受噪声、压缩伪影和低光失真等复杂退化,现有修复方法需要专业手动选择专用模型或依赖无法泛化处理不同退化的单一架构,存在通用性不足的问题。

Method: 提出MoA-VR系统,包含三个协调智能体:退化识别智能体基于构建的大规模高分辨率视频退化识别基准和视觉语言模型;自适应路由智能体由大语言模型驱动,通过观察工具使用模式自主学习有效修复策略;修复质量评估智能体基于专门构建的Res-VQ数据集和定制的VLM视频质量评估模型。

Result: 大量实验表明,MoA-VR能有效处理多样化和复合退化,在客观指标和感知质量方面持续优于现有基线方法,证明了系统在处理复杂视频退化问题上的优越性能。

Conclusion: 该研究展示了多模态智能和模块化推理在通用视频修复系统中的整合潜力,为构建更智能、自适应的视频处理系统提供了新范式,推动了视频修复技术向更接近人类专家决策过程的方向发展。


📄 Abstract

Real-world videos often suffer from complex degradations, such as noise, compression artifacts, and low-light distortions, due to diverse acquisition and transmission conditions. Existing restoration methods typically require professional manual selection of specialized models or rely on monolithic architectures that fail to generalize across varying degradations. Inspired by expert experience, we propose MoA-VR, the first \underline{M}ixture-\underline{o}f-\underline{A}gents \underline{V}ideo \underline{R}estoration system that mimics the reasoning and processing procedures of human professionals through three coordinated agents: Degradation Identification, Routing and Restoration, and Restoration Quality Assessment. Specifically, we construct a large-scale and high-resolution video degradation recognition benchmark and build a vision-language model (VLM) driven degradation identifier. We further introduce a self-adaptive router powered by large language models (LLMs), which autonomously learns effective restoration strategies by observing tool usage patterns. To assess intermediate and final processed video quality, we construct the \underline{Res}tored \underline{V}ideo \underline{Q}uality (Res-VQ) dataset and design a dedicated VLM-based video quality assessment (VQA) model tailored for restoration tasks. Extensive experiments demonstrate that MoA-VR effectively handles diverse and compound degradations, consistently outperforming existing baselines in terms of both objective metrics and perceptual quality. These results highlight the potential of integrating multimodal intelligence and modular reasoning in general-purpose video restoration systems.

[34] MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

Xiangyu Zhao, Junming Lin, Tianhao Liang, Yifan Zhou, Wenhao Chai, Yuzhe Gu, Weiyun Wang, Kai Chen, Gen Luo, Wenwei Zhang, Junchi Yan, Hua Yang, Haodong Duan, Xue Yang

🧩 TL;DR

本研究提出了MM-HELIX多模态基准和自适应混合策略优化方法,显著提升了多模态大语言模型在长链反思推理任务上的性能,在基准测试中实现了18.6%的准确率提升。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在数学和逻辑推理任务中表现出色,但其长链反思推理能力——解决复杂现实问题的关键前提——仍未被充分探索,现有模型在此类任务上存在显著性能缺陷。

Method: 研究首先构建了包含1,260个样本的MM-HELIX多模态基准,随后开发了步骤引导响应生成流水线创建MM-HELIX-100K大规模数据集,并提出了自适应混合策略优化训练策略,该策略将离线监督和在线优化动态统一到单一阶段中。

Result: 在Qwen2.5-VL-7B基线上,该方法在MM-HELIX基准上实现了18.6%的准确率提升,在通用数学和逻辑任务上平均性能增益达5.7%,显著优于标准强化学习方法。

Conclusion: 研究表明多模态大语言模型的反思推理能力可以通过有效学习实现泛化,这为开发更强大的多模态大语言模型开辟了新途径,证明了复杂推理任务中混合训练策略的重要性。


📄 Abstract

While current Multimodal Large Language Models (MLLMs) have demonstrated proficiency in reasoning tasks such as mathematics and logic, their capacity for long-chain reflective reasoning, a prerequisite for solving complex real-world problems, remains largely underexplored. In this work, we first conduct an extensive empirical investigation to evaluate this capability. Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a multimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasks that require iterative thinking and backtracking. Empirical results on this benchmark reveal that existing MLLMs exhibit significant performance deficits in long-chain reflective reasoning. To address this limitation, we generate post-training data and further explore learning paradigms for exploiting such data. We first develop the Step-Elicited Response Generation pipeline to create MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning traces for instruction-tuning stage. Given that standard Reinforcement Learning fails on complex tasks due to sparse reward signals and catastrophic forgetting after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that dynamically unifies offline supervision and online optimization into a single stage. This strategy enables the model to learn from expert data when rewards are sparse and conduct independent exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our method achieves a +18.6\% accuracy improvement on MM-HELIX benchmark and demonstrates strong generalization with a +5.7\% average performance gain on general mathematic and logic tasks. Our work demonstrate that reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable MLLMs.

[35] MultiCOIN: Multi-Modal COntrollable Video INbetweening

Maham Tanveer, Yang Zhou, Simon Niklaus, Ali Mahdavi Amiri, Hao Zhang, Krishna Kumar Singh, Nanxuan Zhao

🧩 TL;DR

本文提出了一个支持多模态控制的视频插帧框架,通过将各种运动控制映射为统一的稀疏点表示,并采用双分支架构分别处理内容和运动控制,实现了灵活、精确的视频过渡生成。


📘 Detailed Summary

Motivation: 现有视频插帧方法无法生成大规模、复杂或精细的运动,缺乏对用户意图的多样化支持和对中间帧细节的精细控制,导致与创意构思不一致。

Method: 采用Diffusion Transformer作为视频生成模型,将所有运动控制映射为统一的稀疏点表示作为视频/噪声输入,将内容和运动控制分离为两个分支进行特征编码,并采用分阶段训练策略确保多模态控制的平滑学习。

Result: 广泛的定性和定量实验表明,多模态控制能够实现更动态、可定制和上下文准确的视觉叙事,在视频插帧质量和控制精度方面表现出色。

Conclusion: 多模态控制框架为视频编辑和长视频合成提供了更高的灵活性和精确性,平衡了易用性与精细控制的需求,为创造性视频制作开辟了新方向。


📄 Abstract

Video inbetweening creates smooth and natural transitions between two image frames, making it an indispensable tool for video editing and long-form video synthesis. Existing works in this domain are unable to generate large, complex, or intricate motions. In particular, they cannot accommodate the versatility of user intents and generally lack fine control over the details of intermediate frames, leading to misalignment with the creative mind. To fill these gaps, we introduce \modelname{}, a video inbetweening framework that allows multi-modal controls, including depth transition and layering, motion trajectories, text prompts, and target regions for movement localization, while achieving a balance between flexibility, ease of use, and precision for fine-grained video interpolation. To achieve this, we adopt the Diffusion Transformer (DiT) architecture as our video generative model, due to its proven capability to generate high-quality long videos. To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse and user-friendly point-based representation as the video/noise input. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content controls and motion controls into two branches to encode the required features before guiding the denoising process, resulting in two generators, one for motion and the other for content. Finally, we propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly. Extensive qualitative and quantitative experiments demonstrate that multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.

[36] NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

Changyao Tian, Hao Li, Gen Luo, Xizhou Zhu, Weijie Su, Hanming Deng, Jinguo Zhu, Jie Shao, Ziran Zhu, Yunpeng Liu, Lewei Lu, Wenhai Wang, Hongsheng Li, Jifeng Dai

🧩 TL;DR

本研究提出了一种名为NaViL的原生多模态大语言模型,通过端到端训练方式系统探索了MLLM的设计空间和扩展特性,在14个多模态基准测试中展现出竞争力。


📘 Detailed Summary

Motivation: 现有MLLM普遍采用组合式训练范式,即预训练视觉编码器与预训练LLM通过连续多模态预训练连接,但由于分离训练导致难以探索多模态扩展特性。

Method: 研究在数据约束的实际设置下系统探索原生MLLM端到端训练的设计空间,通过仔细研究各种选择获得最佳元架构,并提出简单且成本效益高的训练方案。

Result: 实验结果表明NaViL在14个多模态基准测试中具有竞争力,同时揭示了视觉编码器与LLM之间存在正相关的扩展关系。

Conclusion: 研究为原生MLLM的未来研究提供了深入见解,证明了端到端训练在平衡性能与训练成本方面的有效性,并为多模态模型的扩展特性提供了实证依据。


📄 Abstract

Compositional training has been the de-facto paradigm in existing Multimodal Large Language Models (MLLMs), where pre-trained vision encoders are connected with pre-trained LLMs through continuous multimodal pre-training. However, the multimodal scaling property of this paradigm remains difficult to explore due to the separated training. In this paper, we focus on the native training of MLLMs in an end-to-end manner and systematically study its design space and scaling property under a practical setting, i.e., data constraint. Through careful study of various choices in MLLM, we obtain the optimal meta-architecture that best balances performance and training cost. After that, we further explore the scaling properties of the native MLLM and indicate the positively correlated scaling relationship between visual encoders and LLMs. Based on these findings, we propose a native MLLM called NaViL, combined with a simple and cost-effective recipe. Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs. Besides that, our findings and results provide in-depth insights for the future study of native MLLMs.

cs.CL [Back]

[37] Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices

Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub, Mizanur Rahman, Amran Bhuiyan, Israt Jahan, Mir Tafseer Nayeem, Shafiq Joty, Enamul Hoque, Jimmy Huang

🧩 TL;DR

本文提出了两种成本高效的评估方法:多标准提示和领域自适应迁移学习,成功将小型视觉语言模型(2B参数)转化为有效的图表理解任务自动评估器,解决了资源受限环境下小模型性能不足的问题。


📘 Detailed Summary

Motivation: 当前大型视觉语言模型(7B参数)在图表理解任务中展现出作为自动评估器的潜力,但小型模型(≤2B参数)作为评估器的性能仍然较差,这限制了它们在资源受限环境中的实际应用。

Method: 提出了两种方法:多标准提示将多个评估标准整合到单个查询中,以及领域自适应迁移学习,通过在图表数据集上使用合成判断对2B参数的LVLM进行微调,创建了ChartJudge模型。

Result: 实验表明多标准提示暴露了鲁棒性差距,导致7B模型性能大幅下降,包括专门的LVLM评估器如LLaVA-Critic。同时,ChartJudge能够有效地将知识从一个数据集迁移到另一个数据集,使其成为更专业化的模型。

Conclusion: 通过对图表类型和查询复杂度的细粒度分析,本研究提供了关于模型大小、提示设计和可迁移性之间权衡的可操作见解,为图表推理任务实现了可扩展、低成本的评估方案。


📄 Abstract

Large Vision-Language Models (LVLMs) with only 7B parameters have shown promise as automated judges in chart comprehension tasks. However, tiny models (<=2B parameters) still perform poorly as judges, limiting their real-world use in resource-constrained settings. To address this, we propose two approaches to ensure cost-efficient evaluation: (i) multi-criteria prompting, which combines separate evaluation criteria into a single query, and (ii) domain-adaptive transfer learning, in which we fine-tune a 2B-parameter LVLM on synthetic judgments in a chart dataset to create the ChartJudge. Experiments show that multi-criteria prompting exposes robustness gaps, which led to a huge drop in performance for 7B models, including specialized LVLM judges like LLaVA-Critic. In addition, we find that our tiny LVLM (ChartJudge) can effectively transfer knowledge from one dataset to another to make it a more specialized model. Our fine-grained analysis across chart types and query complexities offers actionable insights into trade-offs between model size, prompt design, and transferability, enabling scalable, low-cost evaluation for chart reasoning tasks. Our code and the data will be made publicly available.

[38] ToolExpander: Extending the Frontiers of Tool-Using Reinforcement Learning to Weak LLMs

Fu Chen, Peng Wang, Xiyin Li, Wen Li, Shichi Lei, Dongdong Xiang

🧩 TL;DR

本文提出了ToolExpander框架,通过动态多轮硬采样和自示例思维两项创新技术,解决了小规模LLM在GRPO训练中响应准确性不足和训练崩溃的问题,显著提升了工具使用能力和训练稳定性。


📘 Detailed Summary

Motivation: 训练大型语言模型时使用组相对策略优化(GRPO)面临显著挑战:模型经常无法生成准确响应,特别是在小规模架构中。这一限制不仅降低了性能改进并削弱了GRPO的潜力,还经常导致训练中期崩溃,严重影响稳定性和最终效果。

Method: ToolExpander框架包含两项关键技术:动态多轮硬采样在训练过程中动态替换挑战性样本(10次rollout无正确输出的样本)为高质量少样本演示,并结合指数学习率衰减策略缓解振荡;自示例思维作为增强的GRPO框架,消除了KL散度并整合调整后的裁剪系数,通过最小额外奖励(0.01)鼓励模型自主生成和分析少样本示例。

Result: 实验结果表明,ToolExpander显著增强了LLM的工具使用能力,特别是在较弱的小规模模型中,同时改善了训练稳定性和整体性能。该方法有效解决了小规模架构在GRPO训练中的准确性和稳定性问题。

Conclusion: 该研究展示了通过创新的采样策略和增强的强化学习框架,可以有效解决资源受限LLM在工具导向强化学习中的关键挑战。ToolExpander为小规模模型的高效训练提供了可行方案,具有重要的实际应用价值,并为未来资源受限环境下的LLM优化指明了方向。


📄 Abstract

Training Large Language Models (LLMs) with Group Relative Policy Optimization (GRPO) encounters a significant challenge: models often fail to produce accurate responses, particularly in small-scale architectures. This limitation not only diminishes performance improvements and undermines the potential of GRPO but also frequently leads to mid-training collapse, adversely affecting stability and final efficacy. To address these issues, we propose ToolExpander, a novel framework that advances tool-oriented reinforcement learning for resource-constrained LLMs through two key innovations:(1) Dynamic Multi-Round Hard Sampling, which dynamically substitutes challenging samples(those without correct outputs over 10 rollouts) with high-quality few-shot demonstrations during training, coupled with an exponential learning rate decay strategy to mitigate oscillations;(2) Self-Exemplifying Thinking, an enhanced GRPO framework that eliminates KL divergence and incorporates adjusted clipping coefficients, encouraging models to autonomously generate and analyze few-shot examples via a minimal additional reward (0.01).Experimental results demonstrate that ToolExpander significantly enhances tool-using capabilities in LLMs, especially in weaker small-scale models, improving both training stability and overall performance.

[39] LLM4Cell: A Survey of Large Language and Agentic Models for Single-Cell Biology

Sajib Acharjee Dip, Adrika Zafor, Bikash Kumar Paul, Uddip Acharjee Shuvo, Muhit Islam Emon, Xuan Wang, Liqing Zhang

🧩 TL;DR

LLM4Cell提出了首个统一调查框架,系统梳理了58个用于单细胞研究的基础模型和代理模型,涵盖RNA、ATAC、多组学和空间模态,为语言驱动的单细胞智能研究提供了集成视图和开放挑战分析。


📘 Detailed Summary

Motivation: 当前大语言模型和代理框架在单细胞生物学中的应用进展分散,缺乏跨数据模态、架构和评估标准的统一视角,阻碍了该领域的系统性发展。

Method: 研究将58个模型分类为五个家族:基础模型、文本桥接模型、空间模型、多模态模型、表观基因组模型和代理模型,并将其映射到八个关键分析任务,包括注释、轨迹建模、扰动建模和药物反应预测等。

Result: 基于40多个公共数据集的分析显示,模型在10个领域维度上得到评估,涵盖生物学基础、多组学对齐、公平性、隐私性和可解释性,揭示了基准适用性、数据多样性以及伦理和可扩展性约束。

Conclusion: 该研究为语言驱动的单细胞智能提供了首个集成视图,并指出了在可解释性、标准化和可信模型开发方面的开放挑战,为未来研究指明了方向。


📄 Abstract

Large language models (LLMs) and emerging agentic frameworks are beginning to transform single-cell biology by enabling natural-language reasoning, generative annotation, and multimodal data integration. However, progress remains fragmented across data modalities, architectures, and evaluation standards. LLM4Cell presents the first unified survey of 58 foundation and agentic models developed for single-cell research, spanning RNA, ATAC, multi-omic, and spatial modalities. We categorize these methods into five families-foundation, text-bridge, spatial, multimodal, epigenomic, and agentic-and map them to eight key analytical tasks including annotation, trajectory and perturbation modeling, and drug-response prediction. Drawing on over 40 public datasets, we analyze benchmark suitability, data diversity, and ethical or scalability constraints, and evaluate models across 10 domain dimensions covering biological grounding, multi-omics alignment, fairness, privacy, and explainability. By linking datasets, models, and evaluation domains, LLM4Cell provides the first integrated view of language-driven single-cell intelligence and outlines open challenges in interpretability, standardization, and trustworthy model development.

[40] CS3-Bench: Evaluating and Enhancing Speech-to-Speech LLMs for Mandarin-English Code-Switching

Heyang Liu, Yuhao Wang, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang

🧩 TL;DR

本研究提出了代码切换语音到语音基准(CS3-Bench),揭示了主流多模态大语言模型在语言对齐方面的严重缺陷,并提出通过链式识别和关键词高亮方法显著提升了模型的跨语言理解与生成能力。


📘 Detailed Summary

Motivation: 现有语音到语音交互系统虽然实现了自然单语交互,但在语言对齐方面存在显著缺陷,特别是在代码切换场景下,模型在知识密集型问答和开放对话中表现出严重的性能下降和误解问题。

Method: 研究提出了CS3-Bench基准用于评估语言对齐能力,并开发了数据构建和训练方法,包括使用链式识别(CoR)增强理解能力和关键词高亮(KH)引导生成过程。

Result: 在7个主流模型上的实验显示相对性能下降高达66%,而提出的方法将知识准确率从25.14%提升至46.13%,开放理解率从64.5%提升至86.5%,并显著减少了第二语言的发音错误。

Conclusion: 该研究强调了多模态大语言模型中语言对齐的重要性,提出的方法有效提升了跨语言交互能力,为构建更鲁棒的代码切换语音系统提供了实用解决方案和评估基准。


📄 Abstract

The advancement of multimodal large language models has accelerated the development of speech-to-speech interaction systems. While natural monolingual interaction has been achieved, we find existing models exhibit deficiencies in language alignment. In our proposed Code-Switching Speech-to-Speech Benchmark (CS3-Bench), experiments on 7 mainstream models demonstrate a relative performance drop of up to 66% in knowledge-intensive question answering and varying degrees of misunderstanding in open-ended conversations. Starting from a model with severe performance deterioration, we propose both data constructions and training approaches to improve the language alignment capabilities, specifically employing Chain of Recognition (CoR) to enhance understanding and Keyword Highlighting (KH) to guide generation. Our approach improves the knowledge accuracy from 25.14% to 46.13%, with open-ended understanding rate from 64.5% to 86.5%, and significantly reduces pronunciation errors in the secondary language. CS3-Bench is available at https://huggingface.co/datasets/VocalNet/CS3-Bench.

[41] Vision-Enabled LLMs in Historical Lexicography: Digitising and Enriching Estonian-German Dictionaries from the 17th and 18th Centuries

Madis Jürviste, Joonatan Jakobson

🧩 TL;DR

本研究展示了大型语言模型在17-18世纪爱沙尼亚语词典研究中的应用潜力,通过半自动词典信息增强、哥特体文本识别和跨源数据集构建,显著提升了历史语言资源的数字化效率。


📘 Detailed Summary

Motivation: 该研究旨在解决历史爱沙尼亚语词典数字化过程中的核心挑战,包括如何高效地将17-18世纪的词典内容与现代语言形式关联,克服哥特体印刷文本的识别困难,以及构建统一的多源词典数据集。

Method: 研究采用大型语言模型进行多任务处理:使用Claude 3.7 Sonnet进行词典条目的语义增强和现代化转换;应用视觉增强LLM对哥特体印刷文本进行零样本识别;采用重叠分块扫描图像处理技术,结合多个LLM分别负责文本识别和结构化输出合并。

Result: 实验结果显示,在充分上下文条件下,Claude 3.7 Sonnet能够为81%的词目条目准确提供现代语义和对应形式;在零样本文本识别任务中,成功识别并结构化41%的词目条目为无错误的JSON格式;通过重叠分块和双LLM协作方法,有效实现了Hupel 1780年词典的数字化处理。

Conclusion: 研究表明即使对于小语种,大型语言模型在历史语言资源数字化方面具有显著的时间和成本节约潜力,为历史语言学研究和文化遗产保护提供了高效的技术路径,同时证明了多模型协作在复杂文档处理中的有效性。


📄 Abstract

This article presents research conducted at the Institute of the Estonian Language between 2022 and 2025 on the application of large language models (LLMs) to the study of 17th and 18th century Estonian dictionaries. The authors address three main areas: enriching historical dictionaries with modern word forms and meanings; using vision-enabled LLMs to perform text recognition on sources printed in Gothic script (Fraktur); and preparing for the creation of a unified, cross-source dataset. Initial experiments with J. Gutslaff's 1648 dictionary indicate that LLMs have significant potential for semi-automatic enrichment of dictionary information. When provided with sufficient context, Claude 3.7 Sonnet accurately provided meanings and modern equivalents for 81% of headword entries. In a text recognition experiment with A. T. Helle's 1732 dictionary, a zero-shot method successfully identified and structured 41% of headword entries into error-free JSON-formatted output. For digitising the Estonian-German dictionary section of A. W. Hupel's 1780 grammar, overlapping tiling of scanned image files is employed, with one LLM being used for text recognition and a second for merging the structured output. These findings demonstrate that even for minor languages LLMs have a significant potential for saving time and financial resources.

[42] Leveraging Author-Specific Context for Scientific Figure Caption Generation: 3rd SciCap Challenge

Watcharapong Timklaypachara, Monrada Chiewhawan, Nopporn Lekuthai, Titipat Achakulvisut

🧩 TL;DR

本研究提出了一个领域特定的科学图表标题生成系统,通过整合图表相关文本上下文和作者特定写作风格,在SciCap挑战中实现了既科学准确又风格忠实于源论文的标题生成。


📘 Detailed Summary

Motivation: 科学图表标题需要同时具备准确性和风格一致性来传达视觉信息,现有方法在保持作者特定写作风格方面存在不足,因此需要开发能够结合上下文理解和风格适应的标题生成系统。

Method: 采用两阶段流水线方法:第一阶段结合上下文过滤、通过DSPy的MIPROv2和SIMBA进行类别特定提示优化以及标题候选选择;第二阶段应用基于配置文件的少样本提示进行风格精炼。

Result: 类别特定提示方法在ROUGE-1召回率上提升+8.3%,同时将精度损失限制在-2.8%,BLEU-4减少-10.9%;基于配置文件的风格精炼在BLEU分数上获得40-48%的提升,在ROUGE上获得25-27%的提升。

Conclusion: 研究表明,结合上下文理解与作者特定风格适应能够生成既科学准确又风格忠实于源论文的标题,为领域特定的标题生成提供了有效解决方案,并展示了类别特定优化和风格适应策略的重要性。


📄 Abstract

Scientific figure captions require both accuracy and stylistic consistency to convey visual information. Here, we present a domain-specific caption generation system for the 3rd SciCap Challenge that integrates figure-related textual context with author-specific writing styles using the LaMP-Cap dataset. Our approach uses a two-stage pipeline: Stage 1 combines context filtering, category-specific prompt optimization via DSPy's MIPROv2 and SIMBA, and caption candidate selection; Stage 2 applies few-shot prompting with profile figures for stylistic refinement. Our experiments demonstrate that category-specific prompts outperform both zero-shot and general optimized approaches, improving ROUGE-1 recall by +8.3\% while limiting precision loss to -2.8\% and BLEU-4 reduction to -10.9\%. Profile-informed stylistic refinement yields 40--48\% gains in BLEU scores and 25--27\% in ROUGE. Overall, our system demonstrates that combining contextual understanding with author-specific stylistic adaptation can generate captions that are both scientifically accurate and stylistically faithful to the source paper.

[43] Learning on the Job: An Experience-Driven Self-Evolving Agent for Long-Horizon Tasks

Cheng Yang, Xuemeng Yang, Licheng Wen, Daocheng Fu, Jianbiao Mei, Rong Wu, Pinlong Cai, Yufan Shen, Nianchen Deng, Botian Shi, Yu Qiao, Haifeng Li

🧩 TL;DR

本文提出了MUSE,一种基于层次化记忆模块的经验驱动自进化AI智能体框架,能够通过自主反思和知识积累实现持续学习,在长时程生产力任务上取得了新的最先进性能。


📘 Detailed Summary

Motivation: 现有大型语言模型智能体存在测试时静态化的关键限制,无法从经验中学习,缺乏知识积累和持续改进的能力,这限制了它们在现实世界长时程任务中的部署效果。

Method: MUSE框架引入了以层次化记忆模块为核心的经验驱动自进化系统,通过组织不同层级的经验来规划和执行多应用场景的长时程任务,并在每个子任务执行后自主反思轨迹,将原始轨迹转化为结构化经验并整合回记忆模块。

Result: 在长时程生产力基准TAC上,MUSE仅使用轻量级Gemini-2.5 Flash模型就取得了显著优势的新SOTA性能,实验表明随着智能体自主积累经验,其任务完成能力持续提升,并展现出强大的泛化特性,能够实现新任务的零样本改进。

Conclusion: MUSE建立了一种能够实现现实世界生产力任务自动化的AI智能体新范式,证明了经验驱动的自进化机制能够使智能体超越其静态预训练参数的限制,实现持续学习和自我进化。


📄 Abstract

Large Language Models have demonstrated remarkable capabilities across diverse domains, yet significant challenges persist when deploying them as AI agents for real-world long-horizon tasks. Existing LLM agents suffer from a critical limitation: they are test-time static and cannot learn from experience, lacking the ability to accumulate knowledge and continuously improve on the job. To address this challenge, we propose MUSE, a novel agent framework that introduces an experience-driven, self-evolving system centered around a hierarchical Memory Module. MUSE organizes diverse levels of experience and leverages them to plan and execute long-horizon tasks across multiple applications. After each sub-task execution, the agent autonomously reflects on its trajectory, converting the raw trajectory into structured experience and integrating it back into the Memory Module. This mechanism enables the agent to evolve beyond its static pretrained parameters, fostering continuous learning and self-evolution. We evaluate MUSE on the long-horizon productivity benchmark TAC. It achieves new SOTA performance by a significant margin using only a lightweight Gemini-2.5 Flash model. Sufficient Experiments demonstrate that as the agent autonomously accumulates experience, it exhibits increasingly superior task completion capabilities, as well as robust continuous learning and self-evolution capabilities. Moreover, the accumulated experience from MUSE exhibits strong generalization properties, enabling zero-shot improvement on new tasks. MUSE establishes a new paradigm for AI agents capable of real-world productivity task automation.

[44] A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models

Congming Zheng, Jiachen Zhu, Zhuoying Ou, Yuxiang Chen, Kangning Zhang, Rong Shan, Zeyu Zheng, Mengyue Yang, Jianghao Lin, Yong Yu, Weinan Zhang

🧩 TL;DR

本文系统综述了过程奖励模型(PRMs)在大型语言模型推理对齐中的应用,通过完整流程分析揭示了过程级评估相对于传统结果奖励模型的优势,为细粒度推理对齐研究提供了指导框架。


📘 Detailed Summary

Motivation: 尽管大型语言模型展现出先进的推理能力,但传统的对齐方法主要被仅评估最终答案的结果奖励模型所主导,过程奖励模型旨在填补这一空白,通过在步骤或轨迹级别评估和指导推理过程。

Method: 研究通过完整循环系统分析过程奖励模型:包括如何生成过程数据、构建PRMs模型,以及如何将PRMs用于测试时扩展和强化学习,涵盖了数学、代码、文本、多模态推理、机器人和智能体等多个应用领域。

Result: 研究总结了过程奖励模型在不同领域的应用案例和新兴基准测试,揭示了过程级评估在提升推理质量方面的有效性,为细粒度推理对齐提供了实证支持。

Conclusion: 该研究阐明了过程奖励模型的设计空间,揭示了当前面临的开放挑战,并为未来研究提供了向细粒度、鲁棒推理对齐发展的指导方向,强调了过程级评估在提升大型语言模型推理能力中的重要性。


📄 Abstract

Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.

[45] DACIP-RC: Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension on Business Conversations

Elena Khasanova, Harsh Saini, Md Tahmid Rahman Laskar, Xue-Yong Fu, Cheng Chen, Shashi Bhushan TN

🧩 TL;DR

本文提出DACIP-RC方法,通过阅读理解式持续预训练增强小型LLM在业务对话任务中的领域适应能力,显著提升零样本泛化性能。


📘 Detailed Summary

Motivation: 大型语言模型的高推理成本使其在实际工业部署中不切实际,而小型LLM缺乏跨领域的鲁棒零样本指令跟随能力,传统微调方法又会导致灾难性遗忘问题,限制了模型对动态用户需求的适应能力。

Method: 提出领域自适应持续指令预训练方法DACIP-RC,不同于依赖下一词预测的传统预训练方法,该方法通过对对话转录文本进行阅读理解来生成多样化的任务指令和响应,从而实现更好的指令泛化能力。

Result: 实证评估表明,DACIP-RC在多种业务对话任务上显著提升了零样本泛化性能,包括会议摘要、行动项生成和通话目的识别等任务。

Conclusion: 这是首个在业务对话数据上应用指令预训练的工作,为行业如何利用专有数据集进行领域适应提供了重要见解,展示了持续预训练在提升小型模型适应性方面的有效性。


📄 Abstract

The rapid advancements in Large Language Models (LLMs) have enabled their adoption in real-world industrial scenarios for various natural language processing tasks. However, the high inference cost of large-scale LLMs makes their deployment impractical, necessitating the use of smaller models. Despite their efficiency, smaller LLMs lack robust zero-shot instruction-following capabilities across diverse domains, limiting their adaptability to dynamic user requirements. Traditional fine-tuning approaches exacerbate this issue by inducing catastrophic forgetting, reducing the model's generalization ability for unseen tasks. In this paper, we propose Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension (DACIP-RC), a continual pre-training technique that enhances smaller LLMs' domain adaptability for business conversational tasks. Unlike conventional pre-training approaches that rely on next-token prediction, DACIP-RC generates diverse task instructions and responses via reading comprehension on conversation transcripts, enabling better instruction generalization. Our empirical evaluations demonstrate that DACIP-RC significantly improves zero-shot generalization across a wide range of business conversational tasks, including meeting summarization, action item generation, and call purpose identification. To the best of our knowledge, this is the first work to apply instruction pre-training on business conversational data, providing insights into how industries can leverage proprietary datasets for domain adaptation.

[46] ARM2: Adaptive Reasoning Model with Vision Understanding and Executable Code

Jian Xie, Zhendong Chu, Aoxiao Zhong, Kai Zhang, Mingzhe Han, Xin Fang, Jialie Shen, Qingsong Wen

🧩 TL;DR

本文提出了ARM2,一个通过强化学习框架和长度感知优化的统一模型,能够自适应平衡推理性能与效率。该模型在多模态任务中显著减少token使用量,同时保持与传统推理模型相当的性能。


📘 Detailed Summary

Motivation: 大型推理模型存在"过度思考"问题,在简单任务上生成不必要的冗长推理。现有的缓解策略如长度惩罚或路由机制通常是启发式且任务特定的,缺乏自适应推理的通用框架。

Method: ARM2采用增强长度感知优化的强化学习框架,统一平衡多种格式的推理性能与效率。该模型不仅整合视觉理解扩展至多模态应用,还集成可执行代码到推理过程中,相比长链思维显著降低token成本。

Result: 实验表明ARM2在性能上与使用GRPO训练的传统推理模型相当,同时平均减少超过70%的token使用量。广泛的消融分析验证了ARM2的有效性和设计合理性。

Conclusion: ARM2为自适应推理提供了一个通用框架,在保持性能的同时显著提升效率。该研究展示了将可执行代码集成到推理中的潜力,为多模态推理模型的发展提供了新方向。


📄 Abstract

Large Reasoning Models (LRMs) often suffer from the ``over-thinking'' problem, generating unnecessarily long reasoning on simple tasks. Some strategies have been proposed to mitigate this issue, such as length penalties or routing mechanisms, but they are typically heuristic and task-specific, lacking a general framework for adaptive reasoning. In this paper, we present ARM2, a unified model that adaptively balances reasoning performance and efficiency across multiple formats through a reinforcement learning framework augmented with length-aware optimization. Beyond conventional natural language inference, ARM2 integrates vision understanding, extending its applicability to multimodal. Moreover, ARM2 integrates executable code into reasoning, enabling substantial reductions in token cost while preserving task performance compared to long CoT. Experiments demonstrate that ARM2 achieves performance on par with traditional reasoning models trained with GRPO, while reducing token usage by over 70% on average. We further conduct extensive analyses to validate the effectiveness of ARM2 and the soundness of its design.

[47] Single layer tiny Co$^4$ outpaces GPT-2 and GPT-BERT

Noor Ul Zain, Mohsin Raza, Ahsan Adeel

🧩 TL;DR

本文提出了一种名为Co⁴的轻量级语言模型,仅包含单层、两个注意力头和800万参数,在O(N)计算复杂度下超越了传统O(N²)复杂度的GPT-2和GPT-BERT模型,展示了高效预训练的新范式。


📘 Detailed Summary

Motivation: 当前大型语言模型普遍采用深层架构和O(N²)计算复杂度,导致训练效率低下和资源消耗巨大,本研究旨在探索更高效的轻量级架构来挑战传统深度学习范式。

Method: 采用Co⁴机器架构,仅包含单层Transformer、两个注意力头和800万参数,通过优化设计实现近似O(N)的计算复杂度,相比传统模型的O(N²)复杂度显著降低计算需求。

Result: 在BabyLM挑战基准测试中,Co⁴仅训练两个周期就超越了训练十个周期的GPT-2和GPT-BERT,在1000万token上实现数量级更高的训练效率,并在SuperGLUE任务的零样本和微调评估中全面优于基线模型。

Conclusion: 研究结果表明轻量级模型在高效预训练方面具有巨大潜力,挑战了当前深度学习领域对模型规模和计算复杂度的传统认知,为重新思考扩展定律提供了实证依据。


📄 Abstract

We show that a tiny Co$^4$ machine(Adeel,2025) with a single layer, two heads, and 8M parameters, operating at an approximate cost of $O(N)$ (where $N$ is the number of input tokens), outpaces the BabyLM Challenge baselines GPT-2 (124M, 12 layers, $O(N^2))$ and GPT-BERT (30M, 12 layers, $O(N^2))$ in just two epochs, while both are trained for ten. Co$^4$ achieves orders-of-magnitude greater training efficiency on 10M tokens, demonstrating highly sample efficient pretraining. Using the BabyLM challenge evaluation pipeline across complex benchmarks, Co$^4$ exhibits strong zero-shot and fine-tuning performance on SuperGLUE tasks. Specifically, Co$^4$ outperforms GPT-2 on 5 out of 7 zero-shot metrics and 6 out of 7 fine-tuning tasks, and GPT-BERT on 4 out of 7 metrics in both cases. These results suggest the need to rethink prevailing deep learning paradigms and associated scaling laws.

[48] ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping

Shuang Chen, Yue Guo, Yimeng Ye, Shijue Huang, Wenbo Hu, Haoxi Li, Manyuan Zhang, Jiayu Chen, Song Guo, Nanyun Peng

🧩 TL;DR

本文提出了ARES框架,通过自适应推理机制动态分配探索努力来解决多模态大推理模型在简单问题上过度思考、在复杂问题上探索不足的问题。该方法基于窗口熵识别关键推理时刻,并通过两阶段训练实现难度感知的推理控制。


📘 Detailed Summary

Motivation: 当前多模态大推理模型存在推理努力分配不平衡的问题:在简单问题上产生不必要的冗长推理轨迹,而在复杂问题上则探索不足导致错过解决方案。这种推理效率的失衡限制了模型在实际应用中的性能表现和计算效率。

Method: ARES框架采用两阶段训练流程:自适应冷启动阶段通过按问题难度比例配对的推理轨迹数据赋予模型初步难度感知能力;第二阶段提出自适应熵策略优化,利用高窗口熵标记作为探索触发器决定何时探索,并通过分层熵奖励与动态KL控制决定探索程度。

Result: 广泛实验表明ARES在多种数学、逻辑和多模态基准测试中实现了优越的性能和推理效率,同时在显著降低推理成本的情况下缩小了与领先商业系统的性能差距。该方法有效平衡了简单和复杂问题上的推理努力分配。

Conclusion: ARES证明了基于窗口熵的自适应推理机制能够有效解决MLRMs的推理不平衡问题,为构建更高效的多模态推理系统提供了新思路。该框架展示了动态难度感知推理在提升模型性能和计算效率方面的巨大潜力。


📄 Abstract

Recent advances in multimodal large reasoning models (MLRMs) have substantially improved their ability to solve complex textual and visual tasks. However, these models tend to overthink on simple problems, producing unnecessarily lengthy reasoning traces, while under-exploring on challenging ones, leading to missed solutions. To address this imbalance, we propose ARES, a unified open-source framework for adaptive reasoning that dynamically allocates exploration effort based on task difficulty. Our approach is motivated by two key empirical findings: (i) while single-token entropy is noisy, high window-entropy (HWE) tokens (token-level entropies averaged under a sliding window) can reliably capture reasoning-critical moments; and (ii) reducing HWE usage benefits easy problems, while increasing it is essential for solving hard ones. Building on these insights, ARES introduces a two-stage training pipeline. In the Adaptive Cold-Start stage, we curate multimodal and textual data paired with reasoning traces of length proportional to problem difficulty, equipping the model with initial difficulty awareness. In the second stage, we develop Adaptive Entropy Policy Optimization (AEPO), which uses HWE tokens as exploration triggers to decide when to explore, and a hierarchical entropy reward with dynamic KL control to decide how much to explore. Extensive experiments demonstrate that ARES achieves superior performance and reasoning efficiency across diverse mathematical, logical, and multimodal benchmarks, while closing the gap to leading commercial systems under significantly lower inference costs.

cs.AI [Back]

[49] TS-Agent: A Time Series Reasoning Agent with Iterative Statistical Insight Gathering

Penghang Liu, Elizabeth Fons, Svitlana Vyetrenko, Daniel Borrajo, Vamsi Potluru, Manuela Veloso

🧩 TL;DR

本文提出了TS-Agent,一种时间序列推理智能体,通过将LLM的推理能力与时间序列分析工具相结合,专门解决LLM在时间序列推理任务中的幻觉和知识泄露问题。


📘 Detailed Summary

Motivation: 大型语言模型在推理和问题解决方面表现出强大能力,但最近研究发现它们在时间序列推理任务中仍然存在困难,输出经常受到幻觉或知识泄露的影响,这构成了当前研究的主要挑战。

Method: TS-Agent采用了一种新颖的设计方法,严格利用LLM擅长证据收集和逐步推理合成结论的能力,同时将统计和结构信息提取委托给时间序列分析工具,通过原子操作符与原始数值序列交互,在显式证据日志中记录输出,并在自我批评和最终质量门的指导下迭代优化推理过程。

Result: 在已建立的基准测试上的实验表明,TS-Agent在理解基准测试中达到了与最先进LLM相当的性能,在推理任务中实现了显著改进,而现有模型通常依赖记忆并在零样本设置中失败。

Conclusion: 该研究证明了将LLM的推理能力与领域特定工具相结合的有效性,避免了多模态对齐训练,保留了时间序列的原始形式,确保了可解释性和可验证性,为时间序列分析领域提供了新的解决方案框架。


📄 Abstract

Large language models (LLMs) have shown strong abilities in reasoning and problem solving, but recent studies reveal that they still struggle with time series reasoning tasks, where outputs are often affected by hallucination or knowledge leakage. In this work we propose TS-Agent, a time series reasoning agent that leverages LLMs strictly for what they excel at, i.e., gathering evidence and synthesizing it into conclusions through step-by-step reasoning, while delegating the extraction of statistical and structural information to time series analytical tools. Instead of mapping time series into text tokens, images, or embeddings, our agent interacts with raw numeric sequences through atomic operators, records outputs in an explicit evidence log, and iteratively refines its reasoning under the guidance of a self-critic and a final quality gate. This design avoids multi-modal alignment training, preserves the native form of time series, ensures interpretability and verifiability, and mitigates knowledge leakage or hallucination. Empirically, we evaluate the agent on established benchmarks. Our experiments show that TS-Agent achieves performance comparable to state-of-the-art LLMs on understanding benchmarks, and delivers significant improvements on reasoning tasks, where existing models often rely on memorization and fail in zero-shot settings.

[50] Evaluation of LLMs for Process Model Analysis and Optimization

Akhil Kumar, Jianliang Leon Zhao, Om Dobariya

🧩 TL;DR

本研究评估了多个大型语言模型在理解BPMN业务流程模型、检测语法和逻辑错误以及通过自然语言接口进行深度推理方面的能力,发现未经训练的LLM在零样本设置下能够有效分析业务流程模型。


📘 Detailed Summary

Motivation: 该研究旨在探索大型语言模型是否能够理解业务流程模型并以交互式对话方式检测其中的语法和逻辑错误,填补了LLM在业务流程建模和分析领域应用的研究空白。

Method: 研究采用多个大型语言模型在零样本设置下进行实验,通过自然语言接口让模型理解BPMN流程模型的图像,并在语法、逻辑和语义层面进行深度推理分析。

Result: 实验结果表明,未经训练的ChatGPT模型在零样本设置下能够有效理解BPMN流程模型图像,并在语法、逻辑和语义层面智能地回答相关查询,不同LLM在准确性和有效性方面表现存在差异。

Conclusion: 研究发现大型语言模型可以作为业务流程设计者和用户的有价值助手,在流程分析和优化中展现出类似人类的推理能力,并表现出拟人化特性,为LLM在业务流程管理领域的应用提供了实证依据。


📄 Abstract

In this paper, we report our experience with several LLMs for their ability to understand a process model in an interactive, conversational style, find syntactical and logical errors in it, and reason with it in depth through a natural language (NL) interface. Our findings show that a vanilla, untrained LLM like ChatGPT (model o3) in a zero-shot setting is effective in understanding BPMN process models from images and answering queries about them intelligently at syntactic, logic, and semantic levels of depth. Further, different LLMs vary in performance in terms of their accuracy and effectiveness. Nevertheless, our empirical analysis shows that LLMs can play a valuable role as assistants for business process designers and users. We also study the LLM's "thought process" and ability to perform deeper reasoning in the context of process analysis and optimization. We find that the LLMs seem to exhibit anthropomorphic properties.

[51] An Evaluation Study of Hybrid Methods for Multilingual PII Detection

Harshit Rajgarhia, Suryam Gupta, Asif Shaik, Gulipalli Praveen Kumar, Y Santhoshraj, Sanka Nithya Tanvy Nishitha, Abhishek Mukherji

🧩 TL;DR

RECAP是一个用于低资源语言PII检测的混合框架,结合确定性正则表达式和上下文感知大语言模型,在13种低资源语言环境中显著优于现有方法。


📘 Detailed Summary

Motivation: 该研究旨在解决低资源语言中个人身份信息检测的挑战,这些语言由于语言多样性和标注数据有限,导致隐私合规性检测面临困难。

Method: RECAP采用混合框架设计,结合确定性正则表达式和上下文感知LLMs,通过模块化架构支持300多种实体类型而无需重新训练,并采用三阶段精炼流程进行消歧和过滤。

Result: 使用nervaluate基准测试,该系统在加权F1分数上比微调NER模型高出82%,比零样本LLMs高出17%,在13种低资源语言环境中表现出卓越性能。

Conclusion: 这项工作为合规性应用提供了可扩展且适应性强的PII检测解决方案,有效解决了低资源语言环境下的隐私保护挑战,具有重要的实际应用价值。


📄 Abstract

The detection of Personally Identifiable Information (PII) is critical for privacy compliance but remains challenging in low-resource languages due to linguistic diversity and limited annotated data. We present RECAP, a hybrid framework that combines deterministic regular expressions with context-aware large language models (LLMs) for scalable PII detection across 13 low-resource locales. RECAP's modular design supports over 300 entity types without retraining, using a three-phase refinement pipeline for disambiguation and filtering. Benchmarked with nervaluate, our system outperforms fine-tuned NER models by 82% and zero-shot LLMs by 17% in weighted F1-score. This work offers a scalable and adaptable solution for efficient PII detection in compliance-focused applications.

[52] Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models

Yinglun Zhu, Jiancheng Zhang, Fuzhi Tang

🧩 TL;DR

本文揭示了现有评估指标系统性低估了AI模型的组合推理能力,提出了组匹配分数和测试时匹配算法,显著提升了模型在组合推理基准上的性能表现。该方法使SigLIP-B16和GPT-4.1等模型在多个基准上实现了新的最先进结果。


📘 Detailed Summary

Motivation: 前沿AI模型在组合推理方面表现不佳,在现有基准测试中往往仅达到或低于随机水平,但研究表明这可能是由于广泛使用的评估指标系统性低估了模型真实能力,需要更准确的评估方法和性能提升技术。

Method: 提出了组匹配分数以更好地利用组结构来揭示模型隐藏能力,并开发了测试时匹配算法,这是一种无需外部监督的迭代自改进方法,通过在测试时对诱导的组匹配进行过拟合来提升模型性能。

Result: 该方法使SigLIP-B16超越了所有先前结果,GPT-4.1在Winoground上首次超过了估计的人类性能,TTM算法进一步使SigLIP-B16在MMVP-VLM上超越了GPT-4.1,在16个数据集变体上均实现了性能提升,在WhatsUp等挑战性数据集上相对增益高达85.7%。

Conclusion: 研究揭示了评估指标的系统性偏差问题,证明了模型具有被低估的组合推理能力,提出的方法能够有效释放这种隐藏能力,为组合推理研究提供了新的评估范式和性能提升途径,推动了该领域的前沿发展。


📄 Abstract

Frontier AI models have achieved remarkable progress, yet recent studies suggest they struggle with compositional reasoning, often performing at or below random chance on established benchmarks. We revisit this problem and show that widely used evaluation metrics systematically underestimate model capability. To address this, we introduce a group matching score that better exploits group structure and reveals substantial hidden capability in both contrastive vision-language models (VLMs) and multimodal large language models (MLLMs). Moreover, simply overfitting to the induced group matchings at test time transfers this hidden capability into higher scores under standard evaluation metrics, closing much of the reported gap. This adjustment enables SigLIP-B16 to surpass all previous results and GPT-4.1 to yield the first result surpassing estimated human performance on Winoground. Building on this insight, we propose Test-Time Matching (TTM), an iterative, self-improving algorithm that further bootstraps model performance without any external supervision. TTM delivers additional, non-trivial improvements: for example, TTM enables SigLIP-B16 to surpass GPT-4.1 on MMVP-VLM, establishing a new state of the art. Importantly, TTM remains broadly effective even on benchmarks without metric-induced effects or group structures, achieving relative gains up to 85.7% on challenging datasets such as WhatsUp. Across 16 dataset variants spanning diverse setups, our experiments demonstrate that TTM consistently improves model performance and advances the frontier of compositional reasoning.

[53] Multimodal Safety Evaluation in Generative Agent Social Simulations

Alhim Vera, Karen Sanchez, Carlos Hinojosa, Haidar Bin Hamid, Donghoon Kim, Bernard Ghanem

🧩 TL;DR

本研究引入了一个可复现的多模态智能体评估框架,发现在多模态环境中生成式智能体存在严重的安全对齐问题,特别是在处理误导性视觉信息时表现出过度信任倾向,安全计划修正成功率仅为55%。


📘 Detailed Summary

Motivation: 尽管大型语言模型和视觉语言模型使智能体能够在丰富环境中自主行动并追求目标,但其在多模态场景下进行安全、连贯和可信推理的能力仍然有限,需要系统评估智能体在多模态环境中的安全改进、不安全活动检测和社会动态表现。

Method: 提出了一个可复现的仿真框架,配备分层记忆、动态规划、多模态感知能力,并引入SocialMetrics行为与结构指标套件,用于量化计划修订、不安全到安全转换以及网络中的信息扩散过程。

Result: 实验显示智能体在多模态矛盾检测方面表现良好,但在将局部修订与全局安全对齐方面存在困难,不安全计划修正成功率仅为55%;在三个模型(Claude、GPT-4o mini、Qwen-VL)的八次仿真运行中,平均不安全到安全转换率分别为75%、55%和58%;当与误导性视觉信息配对时,45%的不安全行动被接受,显示出对图像的过度信任倾向。

Conclusion: 研究结果揭示了当前架构在多模态安全方面的关键局限性,当误导性视觉信息存在时智能体表现出强烈的过度信任倾向,同时提供了一个可复现平台用于研究多模态安全、连贯性和社会动态,为未来智能体安全对齐研究奠定了基础。


📄 Abstract

Can generative agents be trusted in multimodal environments? Despite advances in large language and vision-language models that enable agents to act autonomously and pursue goals in rich settings, their ability to reason about safety, coherence, and trust across modalities remains limited. We introduce a reproducible simulation framework for evaluating agents along three dimensions: (1) safety improvement over time, including iterative plan revisions in text-visual scenarios; (2) detection of unsafe activities across multiple categories of social situations; and (3) social dynamics, measured as interaction counts and acceptance ratios of social exchanges. Agents are equipped with layered memory, dynamic planning, multimodal perception, and are instrumented with SocialMetrics, a suite of behavioral and structural metrics that quantifies plan revisions, unsafe-to-safe conversions, and information diffusion across networks. Experiments show that while agents can detect direct multimodal contradictions, they often fail to align local revisions with global safety, reaching only a 55 percent success rate in correcting unsafe plans. Across eight simulation runs with three models - Claude, GPT-4o mini, and Qwen-VL - five agents achieved average unsafe-to-safe conversion rates of 75, 55, and 58 percent, respectively. Overall performance ranged from 20 percent in multi-risk scenarios with GPT-4o mini to 98 percent in localized contexts such as fire/heat with Claude. Notably, 45 percent of unsafe actions were accepted when paired with misleading visuals, showing a strong tendency to overtrust images. These findings expose critical limitations in current architectures and provide a reproducible platform for studying multimodal safety, coherence, and social dynamics.

[54] FinMR: A Knowledge-Intensive Multimodal Benchmark for Advanced Financial Reasoning

Shuangyan Deng, Haizhou Peng, Jiachen Xu, Rui Mao, Ciprian Doru Giurcăneanu, Jiamou Liu

🧩 TL;DR

本文提出了FinMR,一个高质量、知识密集的多模态数据集,专门用于评估专业分析师级别的金融推理能力,填补了金融领域MLLM专业评估的空白。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在金融等专业领域的严格评估受到限制,主要原因是缺乏具有专业知识强度、详细注释和高级推理复杂度的数据集,这阻碍了对模型专业能力的准确评估。

Method: 研究团队构建了FinMR数据集,包含超过3,200个精心策划和专家标注的问答对,涵盖15个不同金融主题,整合了复杂数学推理、高级金融知识和多类型图像的精细视觉解释任务。

Result: 通过对领先的闭源和开源MLLM进行全面基准测试,揭示了这些模型与专业金融分析师之间的显著性能差距,特别是在精确图像分析、复杂金融公式准确应用和深度金融上下文理解等关键领域。

Conclusion: FinMR通过提供丰富多样的视觉内容和详尽的解释性注释,成为评估和推进多模态金融推理向专业分析师水平发展的重要基准工具,为模型改进指明了关键方向。


📄 Abstract

Multimodal Large Language Models (MLLMs) have made substantial progress in recent years. However, their rigorous evaluation within specialized domains like finance is hindered by the absence of datasets characterized by professional-level knowledge intensity, detailed annotations, and advanced reasoning complexity. To address this critical gap, we introduce FinMR, a high-quality, knowledge-intensive multimodal dataset explicitly designed to evaluate expert-level financial reasoning capabilities at a professional analyst's standard. FinMR comprises over 3,200 meticulously curated and expertly annotated question-answer pairs across 15 diverse financial topics, ensuring broad domain diversity and integrating sophisticated mathematical reasoning, advanced financial knowledge, and nuanced visual interpretation tasks across multiple image types. Through comprehensive benchmarking with leading closed-source and open-source MLLMs, we highlight significant performance disparities between these models and professional financial analysts, uncovering key areas for model advancement, such as precise image analysis, accurate application of complex financial formulas, and deeper contextual financial understanding. By providing richly varied visual content and thorough explanatory annotations, FinMR establishes itself as an essential benchmark tool for assessing and advancing multimodal financial reasoning toward professional analyst-level competence.

[55] Augur: Modeling Covariate Causal Associations in Time Series via Large Language Models

Zhiqing Cui, Binwu Wang, Qingxiang Liu, Yeqiang Wang, Zhengyang Zhou, Yuxuan Liang, Yang Wang

🧩 TL;DR

Augur是一个完全由大语言模型驱动的时序预测框架,通过利用LLM的因果推理能力发现并使用协变量间的有向因果关联,在保持竞争性预测性能的同时提供透明可追溯的变量交互推理。


📘 Detailed Summary

Motivation: 现有基于大语言模型的时序预测方法存在显著局限性,包括在模型架构中的边缘化角色、依赖粗糙的统计文本提示以及缺乏可解释性,这些问题限制了LLM在时序预测中的有效应用。

Method: Augur采用两阶段师生架构,其中强大的教师LLM通过启发式搜索和成对因果检验从时序数据中推断有向因果图,轻量级学生代理则精炼该图并基于高置信度因果关联进行微调,这些关联被编码为丰富的文本提示用于预测。

Result: 在真实世界数据集上进行的广泛实验表明,Augur在25个基线方法中取得了竞争性的性能表现,并展现出强大的零样本泛化能力。

Conclusion: 该研究证明了LLM因果推理在时序预测中的有效性,不仅提升了预测准确性,还提供了透明可追溯的变量交互推理机制,为可解释时序预测开辟了新途径。


📄 Abstract

Large language models (LLM) have emerged as a promising avenue for time series forecasting, offering the potential to integrate multimodal data. However, existing LLM-based approaches face notable limitations-such as marginalized role in model architectures, reliance on coarse statistical text prompts, and lack of interpretability. In this work, we introduce Augur, a fully LLM driven time series forecasting framework that exploits LLM causal reasoning to discover and use directed causal associations among covariates. Augur uses a two stage teacher student architecture where a powerful teacher LLM infers a directed causal graph from time series using heuristic search together with pairwise causality testing. A lightweight student agent then refines the graph and fine tune on high confidence causal associations that are encoded as rich textual prompts to perform forecasting. This design improves predictive accuracy while yielding transparent, traceable reasoning about variable interactions. Extensive experiments on real-world datasets with 25 baselines demonstrate that Augur achieves competitive performance and robust zero-shot generalization.

[56] Chain-of-Trigger: An Agentic Backdoor that Paradoxically Enhances Agentic Robustness

Jiyang Qiu, Xinbei Ma, Yunqing Xu, Zhuosheng Zhang, Hai Zhao

🧩 TL;DR

本文提出Chain-of-Trigger Backdoor (CoTri),一种针对LLM智能体的多步后门攻击方法,能够在保持高攻击成功率的同时提升智能体在良性任务上的性能,揭示了大模型智能体在安全性和鲁棒性方面的新漏洞。


📘 Detailed Summary

Motivation: 随着大语言模型智能体在现实应用中的快速部署,其可信赖性引发严重关切。传统后门攻击仅限于单步控制,无法应对智能体在长时程任务中的复杂决策过程,因此需要开发能够实现多步操控的新型攻击方法。

Method: 提出Chain-of-Trigger Backdoor (CoTri)方法,基于有序触发序列实现多步后门攻击。该方法从初始触发开始,后续触发器从环境中动态抽取,通过多步操控将智能体从预定任务中偏离,同时通过训练数据建模环境的随机性来增强攻击的隐蔽性。

Result: 实验结果显示CoTri实现了接近完美的攻击成功率(ASR)和接近零的误触发率(FTR)。由于训练过程建模了环境随机性,CoTri的植入反而提升了智能体在良性任务上的性能,并增强了其对抗环境干扰的鲁棒性。该方法在视觉语言模型上的验证也证明了其向多模态智能体的可扩展性。

Conclusion: CoTri实现了智能体内部的稳定多步控制,同时提升了其固有鲁棒性和任务能力,这使得攻击更加隐蔽并带来潜在安全风险。研究揭示了智能体安全漏洞的新维度,强调了在提升模型性能的同时需要同步加强安全防护机制的重要性。


📄 Abstract

The rapid deployment of large language model (LLM)-based agents in real-world applications has raised serious concerns about their trustworthiness. In this work, we reveal the security and robustness vulnerabilities of these agents through backdoor attacks. Distinct from traditional backdoors limited to single-step control, we propose the Chain-of-Trigger Backdoor (CoTri), a multi-step backdoor attack designed for long-horizon agentic control. CoTri relies on an ordered sequence. It starts with an initial trigger, and subsequent ones are drawn from the environment, allowing multi-step manipulation that diverts the agent from its intended task. Experimental results show that CoTri achieves a near-perfect attack success rate (ASR) while maintaining a near-zero false trigger rate (FTR). Due to training data modeling the stochastic nature of the environment, the implantation of CoTri paradoxically enhances the agent's performance on benign tasks and even improves its robustness against environmental distractions. We further validate CoTri on vision-language models (VLMs), confirming its scalability to multimodal agents. Our work highlights that CoTri achieves stable, multi-step control within agents, improving their inherent robustness and task capabilities, which ultimately makes the attack more stealthy and raises potential safty risks.

[57] Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling

Bianca-Mihaela Ganescu, Suchir Salhan, Andrew Caines, Paula Buttery

🧩 TL;DR

本研究提出了一种轻量级解码器架构,通过动态门控机制实现语言和视觉信息的自适应融合,在BabyLM挑战赛的严格数据约束下实现了竞争性多模态性能,同时发现了无需显式监督的可解释融合模式。


📘 Detailed Summary

Motivation: 该研究旨在解决在认知合理数据量限制下训练视觉语言模型的挑战,特别是在BabyLM挑战赛2025视觉赛道的严格约束条件下,需要重新思考模型如何有效整合多模态信息以最大化有限视觉信息的效用。

Method: 提出轻量级解码器架构,包含三个关键技术:基于token的动态门控机制实现语言和视觉线索的自适应融合;特征调制和通道注意力机制以最大化有限视觉信息的效用;辅助对比学习目标用于视觉基础任务。

Result: 在五个基准测试上的评估显示竞争性或优于多模态基线模型的性能,更重要的是动态门控机制发现了无需显式监督的可解释模式:对内容词偏好视觉线索,对功能词偏好语言线索。

Conclusion: 尽管识别了挑战约束的局限性,但研究确立了动态门控作为高效多模态学习的强大工具,即使在严格约束下也能提供可解释性和性能,同时揭示了全局图像嵌入造成的信息瓶颈和数据集分割导致的训练不稳定性问题。


📄 Abstract

Training vision-language models on cognitively-plausible amounts of data requires rethinking how models integrate multimodal information. Within the constraints of the Vision track for the BabyLM Challenge 2025, we propose a lightweight decoder-based architecture with (1) token-wise dynamic gating for adaptive fusion of linguistic and visual cues, (2) feature modulation and channel attention to maximise the utility of limited visual information and (3) auxiliary contrastive objectives for visual grounding. Evaluation on five benchmarks (BLiMP, BLiMP Supplement, EWoK, Winoground and VQA) shows competitive or superior performance to multimodal baselines. More notably, our dynamic gate discovers interpretable patterns without explicit supervision, favouring visual cues for content words and linguistic cues for function words. While we identify limitations in the Challenge constraints, such as the information bottleneck created by global image embeddings and training instability from the dataset split, our findings establish dynamic gating as a powerful tool for efficient multimodal learning, offering both interpretability and performance even under severe constraints.

[58] How to Teach Large Multimodal Models New Skills

Zhen Zhu, Yiming Gong, Yao Xiao, Yaoyao Liu, Derek Hoiem

🧩 TL;DR

本研究提出两种简单的微调方法,通过仅更新自注意力投影层或MLP Gate&Up层,在保持大型多模态模型原有能力的同时有效学习新技能,解决了序列微调中的灾难性遗忘问题。


📘 Detailed Summary

Motivation: 本研究旨在解决大型多模态模型在序列微调过程中面临的灾难性遗忘问题,即学习新技能时可能擦除先前获得的能力,通过系统研究不同微调策略对模型性能的影响来探索有效的持续学习方法。

Method: 研究提出了两种简单而鲁棒的微调方法:仅更新自注意力投影层,以及仅更新MLP Gate&Up层同时冻结Down投影层,这些方法通过限制参数更新范围来减少输出分布的漂移,从而缓解遗忘现象。

Result: 实验结果表明,这两种微调方法在多个模型家族和任务上都能实现强大的目标技能学习效果,同时显著保留在八个保留基准测试上的通用能力,输出分布漂移的测量与遗忘程度呈现协变关系。

Conclusion: 该研究揭示了序列微调中遗忘现象的部分可恢复性,并提供了实用的微调策略,为大型多模态模型的持续学习提供了理论基础和实践指导,表明通过精心设计的参数更新策略可以有效平衡新技能学习和原有能力保持。


📄 Abstract

How can we teach large multimodal models (LMMs) new skills without erasing prior abilities? We study sequential fine-tuning on five target skills while monitoring general ability on eight held-out benchmarks across three model families. We observe that apparent "forgetting" on held-out tasks after narrow fine-tuning can partly recover at later stages. We trace this behavior to a measurable shift in the output token distribution, manifested through a simple counting-bias probe that co-varies with forgetting. Guided by this picture, we identify two simple, robust tuning recipes that learn strongly while limiting drift: (i) updating only the self-attention projection layers, and (ii) updating only the MLP Gate&Up while freezing the Down projection. Across models and tasks, these choices deliver strong target gains while largely preserving held-out performance. Code is available at https://github.com/jessemelpolio/LMM_CL