Table of Contents
cs.CV [Back]
[1] Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction
Rui Fonseca, Bruno Martins, Gil Rocha
🧩 TL;DR
本文提出了TOMCap,一种改进的仅文本训练方法,无需对齐的图像-标题对即可执行图像描述生成。该方法通过减少模态间隙的CLIP表示来提示预训练语言模型,并结合检索增强技术,在训练自由和仅文本方法中取得了优越性能。
📘 Detailed Summary
Motivation: 当前图像描述生成研究主要依赖人工标注的图像-文本对数据,现有无需对齐图像-标题对的训练方法性能仍落后于完全监督方法。本文旨在减少对精心策划数据的依赖,探索无需人类标注图像-文本对的图像描述生成方法,以解决数据依赖性问题并提升训练自由方法的性能。
Method: TOMCap方法基于预训练语言模型解码器,通过经过模态间隙减少处理的CLIP表示进行提示。该方法结合检索到的标题示例和潜在向量表示来引导生成过程,具体包括检索增强组件和模态间隙减少组件的配置选择,实现了无需对齐图像-标题对的仅文本训练。
Result: 通过大量实验验证,TOMCap在性能上超越了其他训练自由和仅文本方法。研究还分析了检索增强和模态间隙减少组件不同配置选择的影响,证明了该方法在无需对齐图像-标题对情况下的有效性,为训练自由图像描述生成提供了新的性能基准。
Conclusion: 该研究表明,通过结合CLIP表示、模态间隙减少和检索增强技术,可以在无需对齐图像-标题对的情况下实现有效的图像描述生成。TOMCap为减少数据依赖性的图像描述生成提供了新思路,展示了预训练模型与检索增强结合在训练自由场景中的潜力,为未来无监督多模态学习研究提供了重要参考。
📄 Abstract
Image captioning has drawn considerable attention from the natural language processing and computer vision fields. Aiming to reduce the reliance on curated data, several studies have explored image captioning without any humanly-annotated image-text pairs for training, although existing methods are still outperformed by fully supervised approaches. This paper proposes TOMCap, i.e., an improved text-only training method that performs captioning without the need for aligned image-caption pairs. The method is based on prompting a pre-trained language model decoder with information derived from a CLIP representation, after undergoing a process to reduce the modality gap. We specifically tested the combined use of retrieved examples of captions, and latent vector representations, to guide the generation process. Through extensive experiments, we show that TOMCap outperforms other training-free and text-only methods. We also analyze the impact of different choices regarding the configuration of the retrieval-augmentation and modality gap reduction components.
[2] Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment
Kai-Po Chang, Wei-Yuan Cheng, Chi-Pin Huang, Fu-En Yang, Yu-Chiang Frank Wang
🧩 TL;DR
本文提出了一种名为SANTA的自增强对比对齐框架,旨在解决多模态大语言模型在视频描述生成中的事实性不准确问题,特别是针对视觉对象和时间动作的幻觉现象。
📘 Detailed Summary
Motivation: 尽管多模态大语言模型在视频描述生成方面取得了显著进展,但其生成描述中存在严重的事实性不准确问题,导致幻觉现象频发。现有研究主要关注静态图像的幻觉缓解,而针对动态视频中视觉对象和时间动作的联合幻觉缓解仍是一个具有挑战性且未解决的任务。
Method: SANTA框架采用自增强对比对齐方法,通过幻觉性自增强方案识别MLLM中潜在的幻觉内容,并将原始描述转换为对比负样本。同时,开发了轨迹-短语对比对齐机制,将区域对象和关系引导的动作与其对应的视觉和时间短语进行匹配,从而消除虚假相关性并强化对视觉事实的关注。
Result: 大量实验表明,SANTA在缓解对象和动作幻觉方面优于现有方法,在幻觉检测基准测试中取得了优越性能,有效提升了视频描述生成的事实准确性。
Conclusion: 该研究为解决视频描述生成中的幻觉问题提供了有效的框架,通过自增强对比对齐机制实现了对视觉对象和时间动作的联合优化,为多模态大语言模型的事实性改进开辟了新方向,具有重要的理论和应用价值。
📄 Abstract
Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a tracklet-phrase contrastive alignment to match the regional objects and relation-guided actions with their corresponding visual and temporal phrases. Extensive experiments demonstrate that SANTA outperforms existing methods in alleviating object and action hallucinations, yielding superior performance on the hallucination examination benchmarks.
[3] OnSight Pathology: A real-time platform-agnostic computational pathology companion for histopathology
Jinzhen Hu, Kevin Faust, Parsa Babaei Zadeh, Adrienn Bourkas, Shane Eaton, Andrew Young, Anzar Alvi, Dimitrios George Oreopoulos, Ameesha Paliwal, Assem Saleh Alrumeh, Evelyn Rose Kamski-Hennekam, Phedias Diamandis
🧩 TL;DR
本研究提出了OnSight Pathology,一个平台无关的计算机视觉软件,通过连续屏幕捕获在用户审查数字切片图像时提供实时AI推理,解决了数字病理学中AI工具部署的障碍。
📘 Detailed Summary
Motivation: 传统组织学检查依赖主观解释和专家经验,影响准确性和临床护理,而现有数字病理解决方案存在专有性障碍,限制了AI工具在现实世界中的部署和应用。
Method: OnSight Pathology采用平台无关的计算机视觉软件架构,通过连续自定义屏幕捕获技术提供实时AI推理,作为单一可执行文件在消费级个人电脑上本地运行,无需复杂软件集成,并包含多模态聊天助手进行图像验证。
Result: 在超过2,500张公开可用的全切片图像和临床数字病理案例中验证了软件的鲁棒性,成功应用于常见脑肿瘤分类、有丝分裂检测和免疫组化染色定量等常规组织病理学任务,并展示了与实时显微镜摄像头(包括智能手机)的兼容性。
Conclusion: OnSight Pathology能够跨广泛病理学流程提供实时AI推理,消除了AI工具在组织病理学中采用的关键障碍,为研究、临床工作流程以及远程病理学和术中设置提供了成本效益高且安全的部署方案。
📄 Abstract
The microscopic examination of surgical tissue remains a cornerstone of disease classification but relies on subjective interpretations and access to highly specialized experts, which can compromise accuracy and clinical care. While emerging breakthroughs in artificial intelligence (AI) offer promise for automated histological analysis, the growing number of proprietary digital pathology solutions has created barriers to real-world deployment. To address these challenges, we introduce OnSight Pathology, a platform-agnostic computer vision software that uses continuous custom screen captures to provide real-time AI inferences to users as they review digital slide images. Accessible as a single, self-contained executable file (https://onsightpathology.github.io/ ), OnSight Pathology operates locally on consumer-grade personal computers without complex software integration, enabling cost-effective and secure deployment in research and clinical workflows. Here we demonstrate the utility of OnSight Pathology using over 2,500 publicly available whole slide images across different slide viewers, as well as cases from our clinical digital pathology setup. The software's robustness is highlighted across routine histopathological tasks, including the classification of common brain tumor types, mitosis detection, and the quantification of immunohistochemical stains. A built-in multi-modal chat assistant provides verifiable descriptions of images, free of rigid class labels, for added quality control. Lastly, we show compatibility with live microscope camera feeds, including from personal smartphones, offering potential for deployment in more analog, inter-operative, and telepathology settings. Together, we highlight how OnSight Pathology can deliver real-time AI inferences across a broad range of pathology pipelines, removing key barriers to the adoption of AI tools in histopathology.
[4] SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding
Chang-Hsun Wu, Kai-Po Chang, Yu-Yang Sheng, Hung-Kai Chung, Kuei-Chun Wang, Yu-Chiang Frank Wang
🧩 TL;DR
本文提出SEASON(自诊断对比解码),一种无需训练的方法,通过动态诊断每个输出token的幻觉倾向并对其对应的时空负样本进行自适应对比解码,有效增强视频大语言模型在时空维度上的忠实性。
📘 Detailed Summary
Motivation: 视频大语言模型在视频理解方面取得了显著进展,但在响应用户查询时仍难以有效感知和利用视频中丰富的时序信息,导致生成的事件描述存在时序不一致或因果不合理的问题,引发严重的幻觉问题。尽管先前研究主要关注空间幻觉(如物体不匹配),但视频理解中的时序推理仍相对未被充分探索。
Method: 本文提出自诊断对比解码(SEASON),这是一种无需训练的方法,通过动态诊断每个输出token的幻觉倾向,并对其对应的时序和空间负样本进行自适应对比解码,从而自适应地增强每个输出token在时序和空间维度上的忠实性。
Result: 大量实验表明,SEASON在三个幻觉检测基准测试中优于所有现有的无需训练的幻觉缓解方法,同时在四个通用视频理解基准测试中进一步提升了视频大语言模型的性能。该方法在时序和空间幻觉缓解方面均表现出显著效果。
Conclusion: 该研究强调了视频理解中时序推理的重要性,并提出了一种无需训练的有效解决方案。SEASON方法通过自适应对比解码机制,为视频大语言模型的幻觉缓解提供了新思路,同时展示了在通用视频理解任务上的泛化能力,为未来视频语言模型的发展提供了重要参考。
📄 Abstract
Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token's hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. The code will be released upon acceptance.
[5] ReasonX: MLLM-Guided Intrinsic Image Decomposition
Alara Dirik, Tuanfeng Wang, Duygu Ceylan, Stefanos Zafeiriou, Anna Frühstück
🧩 TL;DR
本文提出了ReasonX框架,利用多模态大语言模型作为感知评判器提供相对内在比较,并将这些比较作为GRPO奖励来微调未标记真实世界图像上的内在图像分解模型,显著提升了多种基础架构和模态的性能。
📘 Detailed Summary
Motivation: 尽管基于扩散和Transformer的模型受益于合成数据集的配对监督,但它们在多样化真实世界场景中的泛化能力仍然有限,现有方法难以有效处理未标记的野外图像的内在分解任务。
Method: ReasonX框架采用多模态大语言模型作为感知评判器,生成相对内在比较作为GRPO奖励信号,通过奖励模型输出与评判器关系评估之间的一致性来微调条件内在预测器,该框架与模型无关,可应用于不同的内在预测器和多种模态。
Result: 在多种基础架构和模态上,ReasonX实现了显著性能提升,包括在IIW反照率数据集上WHDR降低9-25%,在ETH3D深度数据集上深度准确率提升高达46%,证明了该方法在真实世界场景中的有效性。
Conclusion: 该研究展示了多模态大语言模型引导的比较监督在连接低层和高层视觉推理方面的潜力,为利用未标记真实世界数据改进内在图像分解提供了模型无关的强化学习框架,开辟了结合高级语义理解和低级视觉任务的新途径。
📄 Abstract
Intrinsic image decomposition aims to separate images into physical components such as albedo, depth, normals, and illumination. While recent diffusion- and transformer-based models benefit from paired supervision from synthetic datasets, their generalization to diverse, real-world scenarios remains challenging. We propose ReasonX, a novel framework that leverages a multimodal large language model (MLLM) as a perceptual judge providing relative intrinsic comparisons, and uses these comparisons as GRPO rewards for fine-tuning intrinsic decomposition models on unlabeled, in-the-wild images. Unlike RL methods for generative models, our framework aligns conditional intrinsic predictors by rewarding agreement between the judge's relational assessments and analytically derived relations from the model's outputs. ReasonX is model-agnostic and can be applied to different intrinsic predictors. Across multiple base architectures and modalities, ReasonX yields significant improvements, including 9-25% WHDR reduction on IIW albedo and up to 46% depth accuracy gains on ETH3D, highlighting the promise of MLLM-guided comparative supervision to bridge low- and high-level vision reasoning.
[6] 6 Fingers, 1 Kidney: Natural Adversarial Medical Images Reveal Critical Weaknesses of Vision-Language Models
Leon Mayer, Piotr Kalinowski, Caroline Ebersbach, Marcel Knopp, Tim Rädsch, Evangelia Christodoulou, Annika Reinke, Fiona R. Kolbinger, Lena Maier-Hein
🧩 TL;DR
本研究提出了AdversarialAnatomyBench,这是首个针对罕见解剖变异的视觉语言模型基准测试,揭示了当前医学VLM在罕见解剖表现上的泛化能力严重不足,性能下降高达45个百分点。
📘 Detailed Summary
Motivation: 现有视觉语言模型基准主要评估常见解剖表现,无法捕捉罕见解剖变异带来的挑战,这导致医学AI系统在实际临床应用中可能对非典型解剖情况表现不佳,存在未被量化的局限性。
Method: 研究引入了AdversarialAnatomyBench基准,包含跨多种成像模态和解剖区域的自然发生罕见解剖变异,这些变异被称为"自然对抗解剖",用于系统评估22个最先进的视觉语言模型在医学感知任务上的表现。
Result: 在罕见解剖变异上,模型平均准确率从典型解剖的74%骤降至29%,性能下降达45个百分点;即使表现最佳的GPT-5、Gemini 2.5 Pro和Llama 4 Maverick模型也出现41-51%的性能下降,模型错误模式与预期解剖偏差高度一致,且模型扩展和干预措施均未能解决这些问题。
Conclusion: 该研究揭示了当前视觉语言模型对罕见解剖表现的泛化能力存在严重缺陷,AdversarialAnatomyBench为系统测量和减轻多模态医学AI系统中的解剖偏差提供了基础,强调了在临床应用中考虑解剖变异的重要性。
📄 Abstract
Vision-language models are increasingly integrated into clinical workflows. However, existing benchmarks primarily assess performance on common anatomical presentations and fail to capture the challenges posed by rare variants. To address this gap, we introduce AdversarialAnatomyBench, the first benchmark comprising naturally occurring rare anatomical variants across diverse imaging modalities and anatomical regions. We call such variants that violate learned priors about "typical" human anatomy natural adversarial anatomy. Benchmarking 22 state-of-the-art VLMs with AdversarialAnatomyBench yielded three key insights. First, when queried with basic medical perception tasks, mean accuracy dropped from 74% on typical to 29% on atypical anatomy. Even the best-performing models, GPT-5, Gemini 2.5 Pro, and Llama 4 Maverick, showed performance drops of 41-51%. Second, model errors closely mirrored expected anatomical biases. Third, neither model scaling nor interventions, including bias-aware prompting and test-time reasoning, resolved these issues. These findings highlight a critical and previously unquantified limitation in current VLM: their poor generalization to rare anatomical presentations. AdversarialAnatomyBench provides a foundation for systematically measuring and mitigating anatomical bias in multimodal medical AI systems.
[7] UniLight: A Unified Representation for Lighting
Zitian Zhang, Iliyan Georgiev, Michael Fischer, Yannick Hold-Geoffroy, Jean-François Lalonde, Valentin Deschaintre
🧩 TL;DR
本文提出了UniLight,一种联合潜在空间作为光照表示,通过对比学习统一了文本、图像、辐照度和环境贴图等多种模态,实现了跨模态的光照特征理解和灵活操控。
📘 Detailed Summary
Motivation: 现有光照表示方法如环境贴图、辐照度、球谐函数或文本等存在模态不兼容问题,这限制了跨模态的光照特征迁移和应用,因此需要一种能够统一多种光照表示模态的通用表示方法。
Method: UniLight采用对比学习训练模态特定的编码器,包括文本、图像、辐照度和环境贴图编码器,通过共享嵌入空间对齐不同模态的表示,并引入辅助的球谐函数预测任务来增强方向性理解,构建了支持大规模训练的多模态数据流水线。
Result: 实验表明UniLight表示能够捕获一致且可迁移的光照特征,在光照检索、环境贴图生成和扩散模型图像合成中的光照控制三个任务上均表现出色,实现了跨模态的灵活光照操控。
Conclusion: 该研究证明了统一光照表示空间的可行性和有效性,为跨模态的光照理解和操控提供了新范式,在计算机视觉和图形学应用中具有重要价值,未来可扩展至更多光照相关任务。
📄 Abstract
Lighting has a strong influence on visual appearance, yet understanding and representing lighting in images remains notoriously difficult. Various lighting representations exist, such as environment maps, irradiance, spherical harmonics, or text, but they are incompatible, which limits cross-modal transfer. We thus propose UniLight, a joint latent space as lighting representation, that unifies multiple modalities within a shared embedding. Modality-specific encoders for text, images, irradiance, and environment maps are trained contrastively to align their representations, with an auxiliary spherical-harmonics prediction task reinforcing directional understanding. Our multi-modal data pipeline enables large-scale training and evaluation across three tasks: lighting-based retrieval, environment-map generation, and lighting control in diffusion-based image synthesis. Experiments show that our representation captures consistent and transferable lighting features, enabling flexible manipulation across modalities.
[8] DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation
Dongzhi Jiang, Renrui Zhang, Haodong Li, Zhuofan Zong, Ziyu Guo, Jun He, Claire Guo, Junyan Ye, Rongyao Fang, Weijia Li, Rui Liu, Hongsheng Li
🧩 TL;DR
本文提出了DraCo(Draft-as-CoT),一种新颖的交错推理范式,通过生成低分辨率草图作为视觉规划,并利用模型内在理解能力进行语义对齐验证和选择性修正,显著提升了多模态大语言模型的文本到图像生成能力。
📘 Detailed Summary
Motivation: 现有统一多模态大语言模型在文本到图像生成方面存在局限性,要么将模型仅视为独立生成器,要么依赖抽象文本规划,无法解决文本规划的粗粒度特性以及生成罕见属性组合的困难。
Method: DraCo方法首先生成低分辨率草图作为预览,提供具体结构化视觉规划;然后利用模型内在理解能力验证草图与输入提示之间的潜在语义错位,并通过选择性修正和超分辨率进行细化;同时构建了DraCo-240K训练数据集增强三种原子能力,并开发了DraCo-CFG专门分类器自由引导策略支持交错推理。
Result: DraCo在GenEval基准上实现了+8%的提升,在Imagine-Bench上获得+0.91分改进,在GenEval++上取得+3%增长,显著优于直接生成方法及其他基于思维链的生成方法,表现出卓越的性能优势。
Conclusion: 该研究证明了交错推理范式在文本到图像生成中的有效性,通过结合文本和视觉内容的思维链实现了更好的规划和验证,为解决罕见属性组合生成和语义对齐问题提供了新思路,为多模态推理系统设计开辟了有前景的方向。
📄 Abstract
Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.
[9] Inference-time Stochastic Refinement of GRU-Normalizing Flow for Real-time Video Motion Transfer
Tasmiah Haque, Srinjoy Das
🧩 TL;DR
本文提出GRU-SNF(门控循环单元-随机归一化流),一种在推理时结合MCMC采样的新型序列预测方法,通过向GRU-NF注入随机性来增强视频运动传输中未来预测的多样性,同时保持准确性。
📘 Detailed Summary
Motivation: 实时视频运动传输应用如沉浸式游戏和基于视觉的异常检测需要准确且多样化的未来预测,以支持真实合成和不确定性下的稳健下游决策。GRU-NF虽然能通过归一化流捕获多模态分布,但其确定性变换结构限制了表达能力,需要改进序列预测的多样性。
Method: 本文提出一种新颖的推理时精炼技术,将GRU-NF与随机采样方法结合,受随机归一化流启发,在GRU-NF推理过程中引入马尔可夫链蒙特卡洛步骤,使模型能够探索更丰富的输出空间并更好地近似真实数据分布,而无需重新训练。该方法在基于关键点的视频运动传输管道中进行验证,其中捕获时间一致且感知多样的未来轨迹对于真实样本和低带宽通信至关重要。
Result: 实验表明,GRU-SNF在生成多样化输出方面优于GRU-NF,且不牺牲准确性,即使在更长的预测时间范围内也是如此。通过推理时注入随机性,该方法能更有效地捕获多模态行为,在关键点视频运动传输任务中表现出色。
Conclusion: 研究结果表明,将随机动力学与基于流的序列模型相结合在生成式时间序列预测中具有潜力。推理时随机性注入能够增强模型表达能力,为需要多样化且准确预测的实时应用提供了有效解决方案,展示了随机归一化流思想在时序建模中的扩展价值。
📄 Abstract
Real-time video motion transfer applications such as immersive gaming and vision-based anomaly detection require accurate yet diverse future predictions to support realistic synthesis and robust downstream decision making under uncertainty. To improve the diversity of such sequential forecasts we propose a novel inference-time refinement technique that combines Gated Recurrent Unit-Normalizing Flows (GRU-NF) with stochastic sampling methods. While GRU-NF can capture multimodal distributions through its integration of normalizing flows within a temporal forecasting framework, its deterministic transformation structure can limit expressivity. To address this, inspired by Stochastic Normalizing Flows (SNF), we introduce Markov Chain Monte Carlo (MCMC) steps during GRU-NF inference, enabling the model to explore a richer output space and better approximate the true data distribution without retraining. We validate our approach in a keypoint-based video motion transfer pipeline, where capturing temporally coherent and perceptually diverse future trajectories is essential for realistic samples and low bandwidth communication. Experiments show that our inference framework, Gated Recurrent Unit- Stochastic Normalizing Flows (GRU-SNF) outperforms GRU-NF in generating diverse outputs without sacrificing accuracy, even under longer prediction horizons. By injecting stochasticity during inference, our approach captures multimodal behavior more effectively. These results highlight the potential of integrating stochastic dynamics with flow-based sequence models for generative time series forecasting.
[10] How (Mis)calibrated is Your Federated CLIP and What To Do About It?
Mainak Singha, Masih Aminbeidokhti, Paolo Casari, Elisa Ricci, Subhankar Roy
🧩 TL;DR
本文研究了联邦学习环境下CLIP模型的校准问题,提出了一种名为FL²oRA的LoRA-based方法,该方法在联邦设置中自然改善模型校准,减少了对显式校准过程的需求。
📘 Detailed Summary
Motivation: 尽管视觉语言模型如CLIP已被广泛研究,但其在联邦学习设置中的校准问题尚未得到充分探索。现有研究主要关注离线环境下的CLIP校准,而联邦学习微调对模型可靠性的影响仍属未知领域,这构成了本研究要解决的核心研究空白。
Method: 本研究首先分析了文本提示调优方法在联邦学习下的校准表现,评估了现有训练中校准技术在不同全局聚合方法中的效果。基于这些分析,提出了FL²oRA方法,这是一种基于LoRA的简单方法,通过选择适当的微调组件来自然改善联邦学习环境中的模型校准。
Result: 实验表明,文本提示调优方法在联邦学习设置下会显著降低校准指标,而现有训练中校准技术在不同聚合方法中仅提供有限改进。FL²oRA在多个基准测试中始终产生良好校准的模型,有效减少了显式校准程序的需求,其有效性源于对微调组件的合理选择。
Conclusion: 研究揭示了联邦学习中CLIP校准的关键挑战不仅在于聚合或校准方法,更在于选择哪些组件进行微调。FL²oRA的成功表明,通过适当的参数高效微调策略,可以在分布式环境中自然实现模型可靠性提升,为联邦视觉语言模型的可靠部署提供了新方向。
📄 Abstract
While vision-language models like CLIP have been extensively studied, their calibration, crucial for reliable predictions, has received limited attention. Although a few prior works have examined CLIP calibration in offline settings, the impact of fine-tuning CLIP in a federated learning (FL) setup remains unexplored. In this work, we investigate how FL affects CLIP calibration and propose strategies to improve reliability in this distributed setting. We first analyze Textual Prompt Tuning approaches and show that they degrade calibration metrics when operating under FL. We also evaluate existing in-training calibration techniques across four global aggregation methods, finding that they provide limited improvements. Our results suggest that the key challenge lies not only in how we aggregate or calibrate, but in which components we choose to fine-tune. Motivated by this insight, we propose $\text{FL}^2\text{oRA}$, a straightforward LoRA-based approach that naturally improves calibration in FL, and we analyze the factors behind its effectiveness. Experiments on multiple benchmarks demonstrate that $\text{FL}^2\text{oRA}$ consistently produces well-calibrated models, reducing the need for explicit calibration procedures. Codes are available at https://github.com/mainaksingha01/FL2oRA.
[11] Fourier-Attentive Representation Learning: A Fourier-Guided Framework for Few-Shot Generalization in Vision-Language Models
Hieu Dinh Trung Pham, Huy Minh Nhat Nguyen, Cuong Tuan Nguyen
🧩 TL;DR
本文提出傅里叶注意力表示学习(FARL)框架,通过傅里叶分析显式解耦视觉表示中的结构特征与风格特征,以增强视觉语言模型在少样本学习中的泛化能力。
📘 Detailed Summary
Motivation: 大规模预训练视觉语言模型在少样本学习中表现出色,但其整体表示通常将图像的领域不变结构与领域特定风格隐式纠缠在一起,这限制了模型的泛化能力,因此需要一种方法来显式解耦这些视觉线索以进一步提升性能。
Method: 本文提出傅里叶注意力表示学习框架,其核心是双交叉注意力机制,其中可学习的表示标记分别从图像的相位谱查询结构特征和从幅度谱查询风格特征,生成解耦的丰富标记后通过非对称注入策略深度注入到视觉语言模型编码器中以引导适应过程。
Result: 在15个数据集上的广泛实验证明了该方法的有效性,FARL框架显著提升了视觉语言模型在少样本学习任务中的性能,通过显式解耦结构特征与风格特征实现了更鲁棒的视觉语言对齐。
Conclusion: 该研究表明通过傅里叶分析显式解耦视觉表示中的结构与风格特征能够有效增强视觉语言模型的泛化能力,非对称注入策略强制模型学习更鲁棒的视觉语言对齐,为提升少样本学习性能提供了新的技术路径。
📄 Abstract
Large-scale pre-trained Vision-Language Models (VLMs) have demonstrated strong few-shot learning capabilities. However, these methods typically learn holistic representations where an image's domain-invariant structure is implicitly entangled with its domain-specific style. This presents an opportunity to further enhance generalization by disentangling these visual cues. In this paper, we propose Fourier-Attentive Representation Learning (FARL), a novel framework that addresses this by explicitly disentangling visual representations using Fourier analysis. The core of our method is a dual cross-attention mechanism, where learnable representation tokens separately query an image's structural features (from the phase spectrum) and stylistic features (from the amplitude spectrum). This process yields enriched, disentangled tokens that are then injected deep into the VLM encoders to guide adaptation. Our design, which includes an asymmetric injection strategy, forces the model to learn a more robust vision-language alignment. Extensive experiments on 15 datasets demonstrate the effectiveness of our approach.
[12] Explainable Parkinsons Disease Gait Recognition Using Multimodal RGB-D Fusion and Large Language Models
Manar Alnaasan, Md Selim Sarowar, Sungho Kim
🧩 TL;DR
本文提出了一种可解释的多模态框架,通过融合RGB-D数据来识别帕金森病步态模式,并利用冻结的大型语言模型生成临床可理解的解释,实现了视觉识别与临床理解之间的桥梁。
📘 Detailed Summary
Motivation: 现有帕金森病步态分析方法存在单模态输入限制、鲁棒性不足以及缺乏临床透明度等问题,难以在真实条件下进行准确且可解释的步态分析,这阻碍了早期检测和临床应用的进展。
Method: 该方法采用双YOLOv11编码器分别提取RGB和深度模态特征,通过多尺度局部-全局提取模块和跨空间颈部融合机制增强时空表示,并引入冻结的大型语言模型将融合的视觉嵌入和结构化元数据转换为临床可理解的文本解释。
Result: 在多模态步态数据集上的实验表明,该RGB-D融合框架相比单输入基线实现了更高的识别准确率,对环境变化具有更强的鲁棒性,并能提供清晰的视觉-语言推理能力,有效处理低光照或衣物遮挡等挑战性场景。
Conclusion: 该研究通过结合多模态特征学习和基于语言的解释性,为可靠且可解释的帕金森病步态分析提供了新的视觉-语言范式,弥合了视觉识别与临床理解之间的鸿沟,具有重要的临床应用价值。
📄 Abstract
Accurate and interpretable gait analysis plays a crucial role in the early detection of Parkinsons disease (PD),yet most existing approaches remain limited by single-modality inputs, low robustness, and a lack of clinical transparency. This paper presents an explainable multimodal framework that integrates RGB and Depth (RGB-D) data to recognize Parkinsonian gait patterns under realistic conditions. The proposed system employs dual YOLOv11-based encoders for modality-specific feature extraction, followed by a Multi-Scale Local-Global Extraction (MLGE) module and a Cross-Spatial Neck Fusion mechanism to enhance spatial-temporal representation. This design captures both fine-grained limb motion (e.g., reduced arm swing) and overall gait dynamics (e.g., short stride or turning difficulty), even in challenging scenarios such as low lighting or occlusion caused by clothing. To ensure interpretability, a frozen Large Language Model (LLM) is incorporated to translate fused visual embeddings and structured metadata into clinically meaningful textual explanations. Experimental evaluations on multimodal gait datasets demonstrate that the proposed RGB-D fusion framework achieves higher recognition accuracy, improved robustness to environmental variations, and clear visual-linguistic reasoning compared with single-input baselines. By combining multimodal feature learning with language-based interpretability, this study bridges the gap between visual recognition and clinical understanding, offering a novel vision-language paradigm for reliable and explainable Parkinsons disease gait analysis. Code:https://github.com/manaralnaasan/RGB-D_parkinson-LLM
[13] PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglement
Yu-Wei Zhan, Xin Wang, Hong Chen, Tongtong Feng, Wei Feng, Ren Wang, Guangyao Li, Qing Li, Wenwu Zhu
🧩 TL;DR
本文提出PhyVLLM,一种物理引导的视频-语言框架,通过显式地将物理运动建模融入视频大语言模型,解决了现有模型在物理动态理解方面的局限性,显著提升了物理推理和通用视频理解能力。
📘 Detailed Summary
Motivation: 当前视频大语言模型在需要深入理解物理动态的场景中经常失败,主要源于其依赖外观匹配的局限性。物理运动建模面临三个关键挑战:运动信号与外观变化纠缠难以提取干净物理线索;有效运动建模需要连续时间表示和物理动态捕捉;物理属性标注成本高昂且不切实际。
Method: PhyVLLM采用双分支编码器解耦视觉外观和物体运动,通过神经常微分方程模块建模时间上的物理动态,生成可微分的物理动态表示。这些运动感知表示被投影到预训练大语言模型的标记空间,实现物理推理而不损害原始多模态能力。为避免显式物理标注,采用自监督方式建模物体运动的连续演化。
Result: 实验结果表明,PhyVLLM在物理推理和通用视频理解任务上均显著优于最先进的视频大语言模型。该框架在多个基准测试中展现出优越性能,验证了显式物理建模的有效性及其对模型整体视频理解能力的提升。
Conclusion: 研究表明,将显式物理运动建模融入视频大语言模型是提升物理推理和视频理解能力的关键途径。PhyVLLM的成功证明了通过解耦外观与运动、连续时间动态建模以及自监督学习,可以在无需昂贵标注的情况下有效整合物理知识,为更智能的视频理解系统开辟了新方向。
📄 Abstract
Video Large Language Models (Video LLMs) have shown impressive performance across a wide range of video-language tasks. However, they often fail in scenarios requiring a deeper understanding of physical dynamics. This limitation primarily arises from their reliance on appearance-based matching. Incorporating physical motion modeling is crucial for deeper video understanding, but presents three key challenges: (1) motion signals are often entangled with appearance variations, making it difficult to extract clean physical cues; (2) effective motion modeling requires not only continuous-time motion representations but also capturing physical dynamics; and (3) collecting accurate annotations for physical attributes is costly and often impractical. To address these issues, we propose PhyVLLM, a physical-guided video-language framework that explicitly incorporates physical motion into Video LLMs. Specifically, PhyVLLM disentangles visual appearance and object motion through a dual-branch encoder. To model physical dynamics over time, we incorporate a Neural Ordinary Differential Equation (Neural ODE) module, which generates differentiable physical dynamic representations. The resulting motion-aware representations are projected into the token space of a pretrained LLM, enabling physics reasoning without compromising the model's original multimodal capabilities. To circumvent the need for explicit physical labels, PhyVLLM employs a self-supervised manner to model the continuous evolution of object motion. Experimental results demonstrate that PhyVLLM significantly outperforms state-of-the-art Video LLMs on both physical reasoning and general video understanding tasks, highlighting the advantages of incorporating explicit physical modeling.
[14] MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving
Bin Suna, Yaoguang Caob, Yan Wanga, Rui Wanga, Jiachen Shanga, Xiejie Fenga, Jiayi Lu, Jia Shi, Shichun Yang, Xiaoyu Yane, Ziying Song
🧩 TL;DR
本文提出了MindDrive,一个将高质量轨迹生成与全面决策推理相协调的端到端自动驾驶框架,通过结构化推理范式在NAVSIM基准测试中实现了最先进的性能。
📘 Detailed Summary
Motivation: 现有端到端自动驾驶研究主要遵循两个方向:轨迹生成导向方法专注于生成高质量轨迹但决策机制简单,而轨迹选择导向方法进行多维度评估但缺乏足够的生成能力,需要一种能够整合高质量轨迹生成与全面决策推理的协调框架。
Method: MindDrive建立了"上下文模拟-候选生成-多目标权衡"的结构化推理范式,包括基于世界行动模型的未来感知轨迹生成器进行自我条件"假设"模拟预测潜在未来场景,以及利用大型视觉语言模型推理能力的VLM导向评估器在安全、舒适和效率维度进行多目标评估。
Result: 在NAVSIM-v1和NAVSIM-v2基准测试上的广泛实验表明,MindDrive在多维度驾驶指标上实现了最先进的性能,显著提升了安全性、合规性和泛化能力。
Conclusion: 该工作为可解释和认知引导的自动驾驶提供了一条有前景的路径,通过整合生成与评估能力实现了更全面、人类对齐的决策制定,展示了结构化推理范式在复杂驾驶场景中的有效性。
📄 Abstract
End-to-End autonomous driving (E2E-AD) has emerged as a new paradigm, where trajectory planning plays a crucial role. Existing studies mainly follow two directions: trajectory generation oriented, which focuses on producing high-quality trajectories with simple decision mechanisms, and trajectory selection oriented, which performs multi-dimensional evaluation to select the best trajectory yet lacks sufficient generative capability. In this work, we propose MindDrive, a harmonized framework that integrates high-quality trajectory generation with comprehensive decision reasoning. It establishes a structured reasoning paradigm of "context simulation - candidate generation - multi-objective trade-off". In particular, the proposed Future-aware Trajectory Generator (FaTG), based on a World Action Model (WaM), performs ego-conditioned "what-if" simulations to predict potential future scenes and generate foresighted trajectory candidates. Building upon this, the VLM-oriented Evaluator (VLoE) leverages the reasoning capability of a large vision-language model to conduct multi-objective evaluations across safety, comfort, and efficiency dimensions, leading to reasoned and human-aligned decision making. Extensive experiments on the NAVSIM-v1 and NAVSIM-v2 benchmarks demonstrate that MindDrive achieves state-of-the-art performance across multi-dimensional driving metrics, significantly enhancing safety, compliance, and generalization. This work provides a promising path toward interpretable and cognitively guided autonomous driving.
[15] StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios
Yifei Wang, Zhenkai Li, Tianwen Qian, Huanran Zheng, Zheng Wang, Yuqian Fu, Xiaoling Wang
🧩 TL;DR
本文提出了StreamEQA,这是首个面向具身智能场景的流式视频问答基准,旨在评估多模态大语言模型在连续视觉输入下的感知与推理能力,揭示了现有模型在流式视频理解方面的显著不足。
📘 Detailed Summary
Motivation: 随着具身智能向现实世界部署推进,智能体需要具备对连续视觉流进行感知和推理的能力,以维持环境态势感知、理解与周围实体的交互,并基于过去观察、当前上下文和预期未来事件动态规划行动。然而,现有基准缺乏专门针对具身场景中流式视频问答的评估框架,这限制了相关研究进展。
Method: 本文提出了StreamEQA基准,该基准从两个正交维度评估多模态大语言模型:具身维度和流式维度。具身维度将问题分为感知、交互和规划三个层次,逐步评估模型识别细粒度视觉细节、推理智能体-对象交互以及执行高层目标导向推理的能力。流式维度则将问题分为后向推理、实时推理和前向推理三种模式,每种模式依赖不同的时间上下文。该基准基于156个独立长视频构建,通过结合自动生成和人工精炼的混合流程,定义了42个任务并生成了约21K个带有精确时间戳的问答对。
Result: 对13个最先进的视频-大语言模型的评估表明,尽管这些模型在传统基准上表现强劲,但在具身场景的流式视频理解方面仍存在显著困难。评估结果揭示了现有模型在连续视频流中维持情境感知、理解动态交互以及进行时间推理方面的局限性,突显了当前多模态模型在流式视频理解能力上的不足。
Conclusion: StreamEQA基准的引入填补了具身智能领域流式视频理解评估的空白,为研究社区提供了系统化的评估框架。该研究揭示了当前多模态大语言模型在连续视觉输入处理方面的局限性,有望催化具身应用中流式视频理解的相关研究,推动智能体在动态环境中实时感知和推理能力的发展。
📄 Abstract
As embodied intelligence advances toward real-world deployment, the ability to continuously perceive and reason over streaming visual inputs becomes essential. In such settings, an agent must maintain situational awareness of its environment, comprehend the interactions with surrounding entities, and dynamically plan actions informed by past observations, current contexts, and anticipated future events. To facilitate progress in this direction, we introduce StreamEQA, the first benchmark designed for streaming video question answering in embodied scenarios. StreamEQA evaluates existing MLLMs along two orthogonal dimensions: Embodied and Streaming. Along the embodied dimension, we categorize the questions into three levels: perception, interaction, and planning, which progressively assess a model's ability to recognize fine-grained visual details, reason about agent-object interactions, and perform high-level goal-directed reasoning. For the streaming dimension, questions are divided into backward, real-time, and forward reasoning, with each mode relying on a distinct temporal context. Built upon 156 independent long videos, StreamEQA defines 42 tasks and generates approximately 21K question-answer pairs with precise timestamps through a hybrid pipeline combining automated generation and human refinement. Evaluations of 13 state-of-the-art video-LLMs reveal that, despite strong performance on conventional benchmarks, these models still struggle with streaming video understanding in embodied scenarios. We hope StreamEQA will catalyze research on streaming video understanding for embodied applications.
[16] Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild
Yigui Feng, Qinglin Wang, Haotian Mo, Yang Liu, Ke Liu, Gencheng Liu, Xinhai Chen, Siqi Shen, Songzhu Mei, Jie Liu
🧩 TL;DR
本文提出了一个完整的生态系统来解决生成式心理分析中的两个核心挑战:通过MIND层次视觉编码器解决发音-情感歧义问题,并构建ConvoInsight-DB数据集和PRISM评估框架,在微表情检测上实现了+86.95%的性能提升。
📘 Detailed Summary
Motivation: 生成式心理分析面临两个基本挑战:现有视觉语言模型无法解决发音-情感歧义问题,即语音的视觉模式会模仿情感表达;同时缺乏能够评估视觉基础和推理深度的可验证评估指标,这阻碍了该领域的进展。
Method: 提出了一个完整的生态系统,包括三个核心组件:首先,引入多级洞察网络用于解缠,这是一个新颖的层次视觉编码器,通过状态判断模块基于时间特征方差算法性地抑制模糊的唇部特征,实现显式视觉解缠;其次,构建了ConvoInsight-DB大规模数据集,包含专家标注的微表情和深度心理推理;第三,设计了心理推理洞察评级指标,这是一个自动化维度框架,使用专家指导的大型语言模型来评估大型心理视觉模型的多维性能。
Result: 在PRISM基准测试中,MIND显著优于所有基线方法,在微表情检测上实现了+86.95%的性能提升,超越了先前的最先进水平。消融研究证实,状态判断解缠模块是这一性能飞跃的最关键组件,验证了所提方法的有效性。
Conclusion: 该研究通过算法性视觉解缠、大规模标注数据集和自动化评估框架,为生成式心理分析建立了一个完整的生态系统。状态判断模块的成功表明显式处理发音-情感歧义对于心理推理至关重要,为未来在真实对话中的心理状态分析提供了可靠的技术基础和方法论框架。
📄 Abstract
Generative psychological analysis of in-the-wild conversations faces two fundamental challenges: (1) existing Vision-Language Models (VLMs) fail to resolve Articulatory-Affective Ambiguity, where visual patterns of speech mimic emotional expressions; and (2) progress is stifled by a lack of verifiable evaluation metrics capable of assessing visual grounding and reasoning depth. We propose a complete ecosystem to address these twin challenges. First, we introduce Multilevel Insight Network for Disentanglement(MIND), a novel hierarchical visual encoder that introduces a Status Judgment module to algorithmically suppress ambiguous lip features based on their temporal feature variance, achieving explicit visual disentanglement. Second, we construct ConvoInsight-DB, a new large-scale dataset with expert annotations for micro-expressions and deep psychological inference. Third, Third, we designed the Mental Reasoning Insight Rating Metric (PRISM), an automated dimensional framework that uses expert-guided LLM to measure the multidimensional performance of large mental vision models. On our PRISM benchmark, MIND significantly outperforms all baselines, achieving a +86.95% gain in micro-expression detection over prior SOTA. Ablation studies confirm that our Status Judgment disentanglement module is the most critical component for this performance leap. Our code has been opened.
[17] dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning
Yingzi Ma, Yulong Cao, Wenhao Ding, Shuibai Zhang, Yan Wang, Boris Ivanovic, Ming Jiang, Marco Pavone, Chaowei Xiao
🧩 TL;DR
本文提出dVLM-AD,一种基于扩散的视觉语言模型,用于端到端自动驾驶,通过统一感知、结构化推理和低层规划来解决现有自回归VLM在推理-行动一致性方面的局限性,在长尾场景中展现出更好的可控性和可靠性。
📘 Detailed Summary
Motivation: 当前自动驾驶研究主要依赖自回归视觉语言模型来提升端到端驾驶系统的泛化能力,但这些模型受限于因果注意力和顺序令牌生成机制,难以维持高层推理与低层规划之间的一致性和可控性,特别是在处理分布外驾驶场景时存在明显缺陷。
Method: 本文提出dVLM-AD,一种基于离散扩散的视觉语言模型,利用双向注意力和迭代去噪机制来统一感知、结构化推理和低层规划任务,相比自回归模型具有更好的可控性和可靠性,能够生成更一致的推理-行动对。
Result: 在nuScenes和WOD-E2E基准测试中,dVLM-AD在行为-轨迹一致性方面比自回归基线提升9%,在长尾WOD-E2E场景中RFS指标提升6%,尽管使用相对简单的骨干网络,其规划性能仍与现有驾驶VLM/VLA系统相当,并产生更一致的推理-行动对。
Conclusion: 研究表明基于扩散的视觉语言模型为可扩展的端到端自动驾驶提供了一条可控且可靠的途径,双向注意力机制和迭代去噪过程能够有效改善推理与规划的一致性,为解决自动驾驶中的长尾分布外场景挑战提供了新的技术方向。
📄 Abstract
The autonomous driving community is increasingly focused on addressing the challenges posed by out-of-distribution (OOD) driving scenarios. A dominant research trend seeks to enhance end-to-end (E2E) driving systems by integrating vision-language models (VLMs), leveraging their rich world knowledge and reasoning abilities to improve generalization across diverse environments. However, most existing VLMs or vision-language agents (VLAs) are built upon autoregressive (AR) models. In this paper, we observe that existing AR-based VLMs -- limited by causal attention and sequential token generation -- often fail to maintain consistency and controllability between high-level reasoning and low-level planning. In contrast, recent discrete diffusion VLMs equipped with bidirectional attention exhibit superior controllability and reliability through iterative denoising. Building on these observations, we introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving. Evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems despite a modest backbone, outperforming AR-based baselines with a 9 percent improvement in behavior-trajectory consistency and a 6 percent increase in RFS on long-tail WOD-E2E scenarios. These results suggest a controllable and reliable pathway for scalable end-to-end driving.
[18] E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving
Yihong Tang, Haicheng Liao, Tong Nie, Junlin He, Ao Qu, Kehua Chen, Wei Ma, Zhenning Li, Lijun Sun, Chengzhong Xu
🧩 TL;DR
本文提出E3AD,一种情感感知的视觉-语言-动作端到端自动驾驶框架,通过整合连续情感建模和双通路空间推理,实现了对乘客情感状态的理解与响应,从而提升自动驾驶的人机对齐与舒适度。
📘 Detailed Summary
Motivation: 当前端到端自动驾驶系统普遍采用视觉-语言-动作模型,但通常忽略乘客的情感状态,而情感状态对于驾驶舒适度和自动驾驶接受度至关重要。本文旨在解决这一研究空白,提出开放域端到端自动驾驶概念,要求自动驾驶车辆能够理解自由形式的自然语言指令、推断情感并规划物理可行的轨迹。
Method: 本文提出E3AD情感感知VLA框架,包含两个认知启发组件:采用连续Valence-Arousal-Dominance情感模型从语言中捕捉语气和紧迫性;设计双通路空间推理模块,融合自我中心和他者中心视角以实现类人空间认知。此外,采用一致性导向的训练方案,结合模态预训练和基于偏好的对齐,确保情感意图与驾驶行为的一致性。
Result: 在真实世界数据集上的实验表明,E3AD显著提升了视觉基础定位和路径点规划性能,并在情感估计方面实现了最先进的VAD相关性。该框架通过注入情感理解,实现了更符合人类期望的基础定位、规划和人本反馈。
Conclusion: 研究表明将情感理解整合到VLA风格自动驾驶中能够产生更人性化的驾驶行为,这为未来自动驾驶系统的人机交互设计提供了重要启示。情感感知不仅提升了技术性能,更重要的是增强了系统的人本对齐特性,为自动驾驶的广泛接受奠定了基础。
📄 Abstract
End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger's emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valenc-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and human-centric feedback.
[19] UniTS: Unified Time Series Generative Model for Remote Sensing
Yuxiang Zhang, Shunlin Liang, Wenyuan Li, Han Ma, Jianglei Xu, Yichuan Ma, Jiangwei Xie, Wei Li, Mengmeng Zhang, Ran Tao, Xiang-Gen Xia
🧩 TL;DR
本文提出了UniTS(统一时间序列生成模型),这是一个基于流匹配生成范式的通用框架,能够统一处理时间序列重建、云去除、语义变化检测和预测等多种任务,通过构建从噪声到目标的确定性演化路径实现时空特征统一建模。
📘 Detailed Summary
Motivation: 现有卫星遥感方法通常需要为不同任务设计专门模型,缺乏跨多个时间序列任务的时空特征统一建模能力,这限制了模型在复杂地球环境动态捕捉中的通用性和效率。
Method: UniTS基于流匹配生成范式,构建从噪声到目标的确定性演化路径,采用具有时空块的扩散变换器架构,设计了自适应条件注入器增强多模态输入的条件感知能力,并引入时空感知调制器提升复杂时空依赖的捕获能力。
Result: 实验表明UniTS在低层和高层时间序列任务中均展现出卓越的生成和认知能力,在严重云污染、模态缺失和物候变化预测等挑战下显著优于现有方法,同时构建的TS-S12和TS-S12CR数据集填补了时间序列云去除和预测任务的基准数据集空白。
Conclusion: 该研究证明了基于流匹配的统一生成框架在多种时间序列任务中的有效性和通用性,为卫星遥感领域的多任务统一建模提供了新范式,同时高质量数据集的构建将促进该领域的基准评估和算法发展。
📄 Abstract
One of the primary objectives of satellite remote sensing is to capture the complex dynamics of the Earth environment, which encompasses tasks such as reconstructing continuous cloud-free time series images, detecting land cover changes, and forecasting future surface evolution. However, existing methods typically require specialized models tailored to different tasks, lacking unified modeling of spatiotemporal features across multiple time series tasks. In this paper, we propose a Unified Time Series Generative Model (UniTS), a general framework applicable to various time series tasks, including time series reconstruction, time series cloud removal, time series semantic change detection, and time series forecasting. Based on the flow matching generative paradigm, UniTS constructs a deterministic evolution path from noise to targets under the guidance of task-specific conditions, achieving unified modeling of spatiotemporal representations for multiple tasks. The UniTS architecture consists of a diffusion transformer with spatio-temporal blocks, where we design an Adaptive Condition Injector (ACor) to enhance the model's conditional perception of multimodal inputs, enabling high-quality controllable generation. Additionally, we design a Spatiotemporal-aware Modulator (STM) to improve the ability of spatio-temporal blocks to capture complex spatiotemporal dependencies. Furthermore, we construct two high-quality multimodal time series datasets, TS-S12 and TS-S12CR, filling the gap of benchmark datasets for time series cloud removal and forecasting tasks. Extensive experiments demonstrate that UniTS exhibits exceptional generative and cognitive capabilities in both low-level and high-level time series tasks. It significantly outperforms existing methods, particularly when facing challenges such as severe cloud contamination, modality absence, and forecasting phenological variations.
[20] Reflection Removal through Efficient Adaptation of Diffusion Transformers
Daniyar Zakarin, Thiemo Wandel, Anton Obukhov, Dengxin Dai
🧩 TL;DR
本文提出了一种基于扩散变换器(DiT)的单图像反射去除框架,通过重新利用预训练的基础扩散模型并结合物理渲染合成数据,实现了最先进的反射去除性能。
📘 Detailed Summary
Motivation: 当前反射去除方法通常依赖任务特定的架构,缺乏泛化能力,且现有反射去除数据集在多样性、可扩展性和真实感方面存在不足,这限制了反射去除技术的发展和应用。
Method: 该方法采用扩散变换器框架,重新利用预训练的DiT基础模型,通过条件反射污染输入引导模型生成干净的透射层;为解决数据短缺问题,在Blender中构建基于物理的渲染管道,使用Principled BSDF合成逼真的玻璃材质和反射效果;采用高效的LoRA适配技术对基础模型进行微调。
Result: 该方法在领域内和零样本基准测试中均取得了最先进的性能,表明预训练扩散变换器与物理基础数据合成和高效适配相结合,能够提供可扩展且高保真的反射去除解决方案。
Conclusion: 研究表明,预训练的扩散变换器模型在结合物理基础数据合成和高效适配后,能够为反射去除任务提供可扩展且高质量的解决方案,这为利用基础模型解决图像恢复问题提供了新的范式。
📄 Abstract
We introduce a diffusion-transformer (DiT) framework for single-image reflection removal that leverages the generalization strengths of foundation diffusion models in the restoration setting. Rather than relying on task-specific architectures, we repurpose a pre-trained DiT-based foundation model by conditioning it on reflection-contaminated inputs and guiding it toward clean transmission layers. We systematically analyze existing reflection removal data sources for diversity, scalability, and photorealism. To address the shortage of suitable data, we construct a physically based rendering (PBR) pipeline in Blender, built around the Principled BSDF, to synthesize realistic glass materials and reflection effects. Efficient LoRA-based adaptation of the foundation model, combined with the proposed synthetic data, achieves state-of-the-art performance on in-domain and zero-shot benchmarks. These results demonstrate that pretrained diffusion transformers, when paired with physically grounded data synthesis and efficient adaptation, offer a scalable and high-fidelity solution for reflection removal. Project page: https://hf.co/spaces/huawei-bayerlab/windowseat-reflection-removal-web
[21] Not All Birds Look The Same: Identity-Preserving Generation For Birds
Aaron Sun, Oindrila Saha, Subhransu Maji
🧩 TL;DR
该研究针对现有身份保持图像生成模型在非刚性、细粒度类别(如鸟类)上的局限性,提出了NABirds Look-Alikes (NABLA)数据集作为评估基准,并通过按物种、年龄和性别分组的训练策略显著提升了模型在鸟类身份保持生成上的性能。
📘 Detailed Summary
Motivation: 现有可控图像生成模型虽然在人类和刚性日常物体上表现良好,但在非刚性或细粒度类别(如鸟类)上仍存在局限性,这些领域缺乏高质量、可访问的数据(尤其是视频或多视角观测数据),使得评估和改进变得困难,而鸟类因其高多样性、细粒度识别需求和各种姿态变化成为研究这一问题的理想领域。
Method: 研究引入了NABirds Look-Alikes (NABLA)数据集,包含4,759个专家标注的图像对,结合从iNaturalist收集的1,073个多图像观测对和少量视频数据,形成了一个评估鸟类身份保持生成的基准,并提出了按物种、年龄和性别分组训练的策略,将这些属性作为身份的代理来提升模型性能。
Result: 实验表明,现有最先进的基线模型在该数据集上无法有效保持鸟类身份,而采用按物种、年龄和性别分组的训练策略后,模型在已见和未见物种上的性能均得到显著提升,验证了该方法在细粒度身份保持生成任务上的有效性。
Conclusion: 该研究强调了细粒度、非刚性类别在身份保持生成中的挑战,提出的NABLA数据集为评估此类任务提供了重要基准,分组训练策略展示了利用可访问属性作为身份代理的有效性,为超越内容创作向需要准确性和精细细节的应用领域推进提供了方法论基础。
📄 Abstract
Since the advent of controllable image generation, increasingly rich modes of control have enabled greater customization and accessibility for everyday users. Zero-shot, identity-preserving models such as Insert Anything and OminiControl now support applications like virtual try-on without requiring additional fine-tuning. While these models may be fitting for humans and rigid everyday objects, they still have limitations for non-rigid or fine-grained categories. These domains often lack accessible, high-quality data -- especially videos or multi-view observations of the same subject -- making them difficult both to evaluate and to improve upon. Yet, such domains are essential for moving beyond content creation toward applications that demand accuracy and fine detail. Birds are an excellent domain for this task: they exhibit high diversity, require fine-grained cues for identification, and come in a wide variety of poses. We introduce the NABirds Look-Alikes (NABLA) dataset, consisting of 4,759 expert-curated image pairs. Together with 1,073 pairs collected from multi-image observations on iNaturalist and a small set of videos, this forms a benchmark for evaluating identity-preserving generation of birds. We show that state-of-the-art baselines fail to maintain identity on this dataset, and we demonstrate that training on images grouped by species, age, and sex -- used as a proxy for identity -- substantially improves performance on both seen and unseen species.
[22] Boundary-Aware Test-Time Adaptation for Zero-Shot Medical Image Segmentation
Chenlin Xu, Lei Zhang, Lituan Wang, Xinyu Pu, Pengfei Ma, Guangwu Qian, Zizhou Wang, Yan Wang
🧩 TL;DR
本文提出BA-TTA-SAM框架,通过测试时自适应显著增强SAM在医学图像分割中的零样本性能,无需源域训练数据即可实现平均12.4%的DICE分数提升。
📘 Detailed Summary
Motivation: 医学图像分割面临标注数据稀缺和计算成本高的挑战,现有方法仍需下游任务特定训练。虽然SAM等基础模型展现出良好泛化能力,但在医学数据集上仍受领域偏移限制,因此需要高效的零样本增强方法。
Method: 提出任务无关的测试时自适应框架BA-TTA-SAM,包含两个关键机制:编码器级高斯提示注入将高斯提示直接嵌入图像编码器,为初始表示学习提供显式指导;跨层边界感知注意力对齐利用ViT主干中的层次特征交互,将深层语义响应与浅层边界线索对齐。
Result: 在ISIC、Kvasir、BUSI和REFUGE四个数据集上的实验表明,相比SAM的零样本分割性能,平均DICE分数提升12.4%。该方法在医学图像分割中持续优于最先进模型,且无需任何源域训练数据。
Conclusion: 该框架显著增强了SAM的泛化能力,通过测试时自适应有效解决了医学图像分割中的领域偏移问题。研究证明了任务无关自适应在零样本医学图像分割中的有效性,为无需训练数据的模型适应提供了新思路。
📄 Abstract
Due to the scarcity of annotated data and the substantial computational costs of model, conventional tuning methods in medical image segmentation face critical challenges. Current approaches to adapting pretrained models, including full-parameter and parameter-efficient fine-tuning, still rely heavily on task-specific training on downstream tasks. Therefore, zero-shot segmentation has gained increasing attention, especially with foundation models such as SAM demonstrating promising generalization capabilities. However, SAM still faces notable limitations on medical datasets due to domain shifts, making efficient zero-shot enhancement an urgent research goal. To address these challenges, we propose BA-TTA-SAM, a task-agnostic test-time adaptation framework that significantly enhances the zero-shot segmentation performance of SAM via test-time adaptation. This framework integrates two key mechanisms: (1) The encoder-level Gaussian prompt injection embeds Gaussian-based prompts directly into the image encoder, providing explicit guidance for initial representation learning. (2) The cross-layer boundary-aware attention alignment exploits the hierarchical feature interactions within the ViT backbone, aligning deep semantic responses with shallow boundary cues. Experiments on four datasets, including ISIC, Kvasir, BUSI, and REFUGE, show an average improvement of 12.4\% in the DICE score compared with SAM's zero-shot segmentation performance. The results demonstrate that our method consistently outperforms state-of-the-art models in medical image segmentation. Our framework significantly enhances the generalization ability of SAM, without requiring any source-domain training data. Extensive experiments on publicly available medical datasets strongly demonstrate the superiority of our framework. Our code is available at https://github.com/Emilychenlin/BA-TTA-SAM.
[23] Identity Clue Refinement and Enhancement for Visible-Infrared Person Re-Identification
Guoqing Zhang, Zhun Wang, Hairui Wang, Zhonglin Ye, Yuhui Zheng
🧩 TL;DR
本文提出了一种新颖的身份线索精炼与增强网络,通过挖掘和利用模态特定属性中的隐式判别知识来解决可见光-红外行人重识别中的模态差异问题,显著提升了跨模态匹配性能。
📘 Detailed Summary
Motivation: 当前可见光-红外行人重识别方法主要关注学习模态不变特征,但往往只关注跨模态的共有判别语义,而忽视了模态特定身份感知知识在判别特征学习中的关键作用,这限制了模型的性能提升。
Method: 本文提出了身份线索精炼与增强网络,包含三个核心组件:多感知特征精炼模块通过聚合共享分支的浅层特征来捕获易被忽视的模态特定属性;语义蒸馏级联增强模块从聚合的浅层特征中蒸馏身份感知知识并指导模态不变特征学习;身份线索引导损失则用于缓解增强特征中的模态差异并促进多样化表示空间的学习。
Result: 在多个公开数据集上的广泛实验表明,所提出的ICRE网络明显优于现有的最先进方法,在跨模态匹配任务中取得了显著的性能提升,验证了挖掘模态特定身份感知知识的有效性。
Conclusion: 该研究强调了模态特定身份感知知识在可见光-红外行人重识别中的重要性,提出的框架通过精炼和增强身份线索有效缓解了模态差异,为跨模态特征学习提供了新的视角和方法论指导。
📄 Abstract
Visible-Infrared Person Re-Identification (VI-ReID) is a challenging cross-modal matching task due to significant modality discrepancies. While current methods mainly focus on learning modality-invariant features through unified embedding spaces, they often focus solely on the common discriminative semantics across modalities while disregarding the critical role of modality-specific identity-aware knowledge in discriminative feature learning. To bridge this gap, we propose a novel Identity Clue Refinement and Enhancement (ICRE) network to mine and utilize the implicit discriminative knowledge inherent in modality-specific attributes. Initially, we design a Multi-Perception Feature Refinement (MPFR) module that aggregates shallow features from shared branches, aiming to capture modality-specific attributes that are easily overlooked. Then, we propose a Semantic Distillation Cascade Enhancement (SDCE) module, which distills identity-aware knowledge from the aggregated shallow features and guide the learning of modality-invariant features. Finally, an Identity Clues Guided (ICG) Loss is proposed to alleviate the modality discrepancies within the enhanced features and promote the learning of a diverse representation space. Extensive experiments across multiple public datasets clearly show that our proposed ICRE outperforms existing SOTA methods.
[24] X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale
Pei Yang, Hai Ci, Yiren Song, Mike Zheng Shou
🧩 TL;DR
本文提出了X-Humanoid,一种生成式视频编辑方法,通过将强大的Wan 2.2模型适配为视频到视频结构并针对人形机器人转换任务进行微调,解决了从第三人称视频中生成大规模人形机器人训练数据的挑战。
📘 Detailed Summary
Motivation: 具身人工智能的发展受到大规模多样化训练数据稀缺的严重制约,现有方法主要通过在自我中心视频上"叠加"机器人手臂来处理,无法处理第三人称视频中复杂的全身运动和场景遮挡,因此不适合将人类视频转化为机器人训练数据。
Method: 该方法将Wan 2.2模型适配为视频到视频结构,并针对人形机器人转换任务进行微调,同时设计了一个可扩展的数据创建流程,利用Unreal Engine将社区资产转化为17小时以上的配对合成视频,然后将训练模型应用于60小时的Ego-Exo4D视频。
Result: 研究生成了包含超过360万帧"机器人化"人形机器人视频帧的大规模数据集,定量分析和用户研究证实了该方法优于现有基线:69%的用户在运动一致性方面给予最高评分,62.1%的用户在体现正确性方面给予最高评分。
Conclusion: X-Humanoid方法成功解决了从人类视频生成人形机器人训练数据的关键挑战,为具身人工智能提供了大规模多样化的训练资源,其可扩展的数据创建流程和生成的高质量数据集有望加速人形机器人视觉语言动作模型和世界模型的发展。
📄 Abstract
The advancement of embodied AI has unlocked significant potential for intelligent humanoid robots. However, progress in both Vision-Language-Action (VLA) models and world models is severely hampered by the scarcity of large-scale, diverse training data. A promising solution is to "robotize" web-scale human videos, which has been proven effective for policy training. However, these solutions mainly "overlay" robot arms to egocentric videos, which cannot handle complex full-body motions and scene occlusions in third-person videos, making them unsuitable for robotizing humans. To bridge this gap, we introduce X-Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video-to-video structure and finetunes it for the human-to-humanoid translation task. This finetuning requires paired human-humanoid videos, so we designed a scalable data creation pipeline, turning community assets into 17+ hours of paired synthetic videos using Unreal Engine. We then apply our trained model to 60 hours of the Ego-Exo4D videos, generating and releasing a new large-scale dataset of over 3.6 million "robotized" humanoid video frames. Quantitative analysis and user studies confirm our method's superiority over existing baselines: 69% of users rated it best for motion consistency, and 62.1% for embodiment correctness.
[25] VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management
Hongbo Jin, Qingyuan Wang, Wenhao Zhang, Yang Liu, Sijie Cheng
🧩 TL;DR
本文提出VideoMem框架,将超长视频理解建模为序列生成任务,通过自适应内存管理和渐进式强化学习算法,显著提升了现有视觉语言模型在超长视频理解任务上的性能。
📘 Detailed Summary
Motivation: 现有视觉语言模型在超长视频理解任务中存在上下文长度有限和长期记忆保留效率低的问题,而基于检索增强生成的外部知识库方法则带来巨大的存储和计算开销,因此需要一种更高效的解决方案。
Method: VideoMem框架将超长视频理解建模为序列生成任务,采用自适应内存管理动态更新全局内存缓冲区,保留关键信息并丢弃冗余内容;同时集成了渐进式分组相对策略优化算法,包含渐进状态传播模块自适应保留有效状态并传播到下一步,以及时间级联奖励模块缓解奖励稀疏性问题。
Result: 在多个超长视频理解基准测试上的广泛实验表明,VideoMem显著优于现有的开源模型,在样本利用率和收敛速度方面均有明显提升,验证了所提框架的有效性。
Conclusion: 该研究证明了将超长视频理解建模为序列生成任务并通过自适应内存管理结合渐进式强化学习的有效性,为长视频理解任务提供了一种高效且可扩展的解决方案,避免了传统检索增强生成方法的高昂开销。
📄 Abstract
Ultra long video understanding remains an open challenge, as existing vision language models (VLMs) falter on such content due to limited context length and inefficient long term memory retention. To address this, recent works have attempted to construct external knowledge bases and corresponding retrieval agumented generation (RAG) systems, yet these incur enormous storage and computational overhead. In this paper, we propose VideoMem, a novel framework that pioneers models long video understanding as a sequential generation task via adaptive memory management. Specifically, VideoMem dynamically updates a global memory buffer, which adaptively retains critical information while discarding redundant content across the video timeline. To efficiently train VLMs for such long-term tasks, VideoMem integrates the Progressive Grouped Relative Policy Optimization (PRPO) algorithm, equipped with two core modules: Progressive State Propagation (PSP) adaptively retains valid current states, propagates them to the next rollout step, and gradually narrows the model exploration space. Temporal Cascading Reward (TCR) further alleviates reward sparsity, improving sample utilization and accelerating convergence. Extensive experiments demonstrate that VideoMem significantly outperforms existing open-source models across diverse benchmarks for ultra-long video understanding tasks.
[26] Counterfeit Answers: Adversarial Forgery against OCR-Free Document Visual Question Answering
Marco Pintore, Maura Pintor, Dimosthenis Karatzas, Battista Biggio
🧩 TL;DR
本文针对文档视觉问答(DocVQA)系统提出了一种新颖的对抗攻击场景,开发了能够生成视觉上难以察觉但语义上有针对性的伪造文档的攻击算法,并在两种最先进的端到端模型上验证了其有效性。
📘 Detailed Summary
Motivation: 尽管当前文档视觉问答模型展现出令人印象深刻的能力,但它们仍然容易受到对抗攻击的威胁。本研究旨在解决现有DocVQA系统在面对视觉上难以察觉但语义上有针对性的文档伪造攻击时的脆弱性问题,探索攻击者如何通过特定方式篡改文档内容来诱导模型产生错误答案。
Method: 本研究开发了专门的攻击算法,能够生成针对不同攻击者目标定制的对抗性伪造文档,包括有针对性的错误信息传播和系统性模型失效场景。这些算法旨在以视觉上难以察觉的方式修改文档内容,同时实现语义层面的针对性攻击。
Result: 研究在两种最先进的端到端模型上验证了攻击方法的有效性:Pix2Struct(一种通过序列到序列建模联合处理图像和文本的视觉语言变换器)和Donut(一种直接从文档图像中提取文本并回答问题的基于变换器的模型)。实验结果表明,所提出的攻击方法能够成功诱导这些模型产生特定或普遍错误的答案。
Conclusion: 本研究揭示了当前DocVQA系统中存在的关键安全漏洞,表明即使是最先进的模型也容易受到精心设计的对抗攻击。这些发现强调了开发更强大防御机制的必要性,并为未来研究提供了关于文档视觉问答系统鲁棒性的重要见解。
📄 Abstract
Document Visual Question Answering (DocVQA) enables end-to-end reasoning grounded on information present in a document input. While recent models have shown impressive capabilities, they remain vulnerable to adversarial attacks. In this work, we introduce a novel attack scenario that aims to forge document content in a visually imperceptible yet semantically targeted manner, allowing an adversary to induce specific or generally incorrect answers from a DocVQA model. We develop specialized attack algorithms that can produce adversarially forged documents tailored to different attackers' goals, ranging from targeted misinformation to systematic model failure scenarios. We demonstrate the effectiveness of our approach against two end-to-end state-of-the-art models: Pix2Struct, a vision-language transformer that jointly processes image and text through sequence-to-sequence modeling, and Donut, a transformer-based model that directly extracts text and answers questions from document images. Our findings highlight critical vulnerabilities in current DocVQA systems and call for the development of more robust defenses.
[27] COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence
Zefeng Zhang, Xiangzhao Hao, Hengzhu Tang, Zhenyu Zhang, Jiawei Sheng, Xiaodong Li, Zhenyang Li, Li Gao, Daiting Shi, Dawei Yin, Tingwen Liu
🧩 TL;DR
本文提出COOPER,一种统一的多模态大语言模型,通过深度和分割作为辅助模态,结合自适应交错推理,显著提升视觉空间推理能力,同时保持通用性能。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在三维感知推理方面存在不足,现有方法通常孤立地增强感知(通过深度和分割等辅助模态)或推理(通过空间VQA数据集训练和强化学习),缺乏统一的框架来协同提升空间感知和推理能力。
Method: COOPER采用统一的多模态大语言模型架构,利用深度和分割作为辅助模态,通过两阶段训练策略:第一阶段学习辅助模态生成能力,第二阶段获得自适应交错推理能力,实现感知与推理的协同增强。
Result: COPER在空间推理任务上实现了平均6.91%的性能提升,同时保持了通用性能;仅进行辅助模态生成的变体在距离和大小估计任务上获得了7.92%的增益,表明辅助模态生成有助于内化空间知识。
Conclusion: 研究表明,通过统一的模型架构和两阶段训练策略,多模态大语言模型能够发展出内在的空间感知增强能力,并通过自适应交错推理实现更强的空间智能,为三维感知推理提供了新的研究方向。
📄 Abstract
Visual Spatial Reasoning is crucial for enabling Multimodal Large Language Models (MLLMs) to understand object properties and spatial relationships, yet current models still struggle with 3D-aware reasoning. Existing approaches typically enhance either perception, by augmenting RGB inputs with auxiliary modalities such as depth and segmentation, or reasoning, by training on spatial VQA datasets and applying reinforcement learning, and thus treat these two aspects in isolation. In this work, we investigate whether a unified MLLM can develop an intrinsic ability to enhance spatial perception and, through adaptive interleaved reasoning, achieve stronger spatial intelligence. We propose \textbf{COOPER}, a unified MLLM that leverages depth and segmentation as auxiliary modalities and is trained in two stages to acquire auxiliary modality generation and adaptive, interleaved reasoning capabilities. COOPER achieves an average \textbf{6.91\%} improvement in spatial reasoning while maintaining general performance. Moreover, even a variant trained only for auxiliary modality generation attains a \textbf{7.92\%} gain on distance and size estimation, suggesting that learning to generate auxiliary modalities helps internalize spatial knowledge and strengthen spatial understanding.
[28] TARDis: Time Attenuated Representation Disentanglement for Incomplete Multi-Modal Tumor Segmentation and Classification
Zishuo Wan, Qinqin Kang, Yi Huang, Yun Bian, Dawei Ding, Ke Yan
🧩 TL;DR
本文提出了一种名为时间衰减表示解耦(TARDis)的物理感知框架,通过将缺失模态重新定义为连续时间衰减曲线上的缺失采样点,有效解决了多期相CT成像中的模态缺失问题,显著提升了肿瘤分割和诊断的鲁棒性。
📘 Detailed Summary
Motivation: 在对比增强CT成像中,肿瘤分割和诊断严重依赖对比剂的生理动力学特性,但临床实践中常因辐射担忧或扫描限制而无法获取完整的多期相序列,导致模态缺失问题。现有深度学习方法通常将缺失期相视为独立的缺失通道,忽略了血流动力学的固有时间连续性,这限制了模型在数据稀疏场景下的诊断性能。
Method: 本文提出了时间衰减表示解耦(TARDis)框架,将缺失模态重新定义为连续时间衰减曲线上的缺失采样点。该方法通过双路径架构显式解耦潜在特征空间:一个基于量化编码的路径使用可学习嵌入字典提取一致的解剖结构(时间不变静态分量),另一个概率路径使用条件变分自编码器建模依赖于估计扫描时间的动态增强特征(时间依赖动态分量)。这种设计使网络能够通过学习到的潜在分布采样来生成缺失的血流动力学特征。
Result: 在大规模私有腹部CT数据集(2,282例)和两个公共数据集上的广泛实验表明,TARDis显著优于最先进的不完整模态框架。值得注意的是,即使在极端数据稀疏场景下,该方法仍能保持稳健的诊断性能,突显了其在减少辐射暴露同时保持诊断精度的潜力。
Conclusion: TARDis框架通过将物理先验知识融入深度学习架构,成功解决了多期相CT成像中的模态缺失问题。该方法不仅提升了模型在数据不完整情况下的性能,还为减少临床扫描次数和辐射剂量提供了可行途径,具有重要的临床应用价值。未来的研究方向可包括将该框架扩展到其他时间序列医学成像模态。
📄 Abstract
Tumor segmentation and diagnosis in contrast-enhanced Computed Tomography (CT) rely heavily on the physiological dynamics of contrast agents. However, obtaining a complete multi-phase series is often clinically unfeasible due to radiation concerns or scanning limitations, leading to the "missing modality" problem. Existing deep learning approaches typically treat missing phases as absent independent channels, ignoring the inherent temporal continuity of hemodynamics. In this work, we propose Time Attenuated Representation Disentanglement (TARDis), a novel physics-aware framework that redefines missing modalities as missing sample points on a continuous Time-Attenuation Curve. TARDis explicitly disentangles the latent feature space into a time-invariant static component (anatomy) and a time-dependent dynamic component (perfusion). We achieve this via a dual-path architecture: a quantization-based path using a learnable embedding dictionary to extract consistent anatomical structures, and a probabilistic path using a Conditional Variational Autoencoder to model dynamic enhancement conditioned on the estimated scan time. This design allows the network to hallucinate missing hemodynamic features by sampling from the learned latent distribution. Extensive experiments on a large-scale private abdominal CT dataset (2,282 cases) and two public datasets demonstrate that TARDis significantly outperforms state-of-the-art incomplete modality frameworks. Notably, our method maintains robust diagnostic performance even in extreme data-sparsity scenarios, highlighting its potential for reducing radiation exposure while maintaining diagnostic precision.
[29] SAM3-I: Segment Anything with Instructions
Jingjing Li, Yue Feng, Yuchen Guo, Jincai Huang, Yongri Piao, Qi Bi, Miao Zhang, Xiaoqi Zhao, Qiang Chen, Shihao Zou, Wei Ji, Huchuan Lu, Li Cheng
🧩 TL;DR
本文提出了SAM3-I,一个增强的框架,将概念级理解与指令级推理统一于SAM系列中,通过指令感知级联适配机制使SAM3能够直接遵循自然语言指令进行分割,同时保留其原有的概念驱动能力。
📘 Detailed Summary
Motivation: 虽然SAM3通过可提示概念分割推进了开放词汇分割,但现实应用需要更丰富的表达,包括属性、空间关系、功能、动作、状态甚至对实例的隐式推理。当前SAM3依赖外部多模态代理将复杂指令转换为名词短语并进行迭代掩码过滤,但这些名词短语级概念过于粗糙,往往无法精确表示特定实例。
Method: SAM3-I引入了指令感知级联适配机制,逐步将表达性指令语义与SAM3现有的视觉语言表示对齐,实现直接指令跟随分割而不牺牲其原始概念驱动能力。此外,设计了跨越概念、简单和复杂级别的结构化指令分类法,并开发了可扩展的数据引擎来构建包含多样化指令-掩码对的数据集。
Result: 实验表明SAM3-I展现出令人满意的性能,证明SAM3可以有效扩展到遵循自然语言指令,同时保持其强大的概念基础。该框架开源并提供实用的微调工作流程,使研究人员能够将其适配到特定领域应用中。
Conclusion: SAM3-I成功地将概念级理解与指令级推理统一于SAM框架中,通过创新的适配机制实现了直接指令跟随分割能力。这项工作展示了将大型视觉语言模型扩展到更复杂、表达性更强的指令理解任务的可行性,为开放词汇分割的实际应用提供了更灵活的解决方案。
📄 Abstract
Segment Anything Model 3 (SAM3) has advanced open-vocabulary segmentation through promptable concept segmentation, allowing users to segment all instances corresponding to a given concept, typically specified with short noun-phrase (NP) prompts. While this marks the first integration of language-level concepts within the SAM family, real-world usage typically requires far richer expressions that include attributes, spatial relations, functionalities, actions, states, and even implicit reasoning over instances. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and then conduct iterative mask filtering. However, these NP-level concepts remain overly coarse, often failing to precisely represent a specific instance. In this work, we present SAM3-I, an enhanced framework that unifies concept-level understanding and instruction-level reasoning within the SAM family. SAM3-I introduces an instruction-aware cascaded adaptation mechanism that progressively aligns expressive instruction semantics with SAM3's existing vision-language representations, enabling direct instruction-following segmentation without sacrificing its original concept-driven capabilities. Furthermore, we design a structured instruction taxonomy spanning concept, simple, and complex levels, and develop a scalable data engine to construct a dataset with diverse instruction-mask pairs. Experiments show that SAM3-I delivers appealing performance, demonstrating that SAM3 can be effectively extended to follow natural-language instructions while preserving its strong concept grounding. We open-source SAM3-I and provide practical fine-tuning workflows, enabling researchers to adapt it to domain-specific applications. The source code is available here.
[30] Malicious Image Analysis via Vision-Language Segmentation Fusion: Detection, Element, and Location in One-shot
Sheng Hang, Chaoxiang He, Hongsheng Hu, Hanqing Hu, Bin Benjamin Zhu, Shi-Feng Sun, Dawu Gu, Shuo Wang
🧩 TL;DR
本文提出了一种零样本恶意图像内容检测与定位管道,能够同时检测有害内容、识别关键元素并进行像素级定位,显著提升了细粒度内容审核的准确性和鲁棒性。
📘 Detailed Summary
Motivation: 现有图像级NSFW标记无法满足内容审核需求,审核人员需要知道图像中哪些具体对象构成非法内容及其精确位置,因此需要开发能够同时检测有害内容、识别关键元素并进行像素级定位的细粒度审核工具。
Method: 该方法采用零样本管道,首先使用基础分割模型生成候选对象掩码并细化为独立区域,然后通过视觉语言模型使用开放词汇提示对每个区域进行恶意相关性评分,最后通过加权融合步骤生成统一的恶意对象地图,并采用多分割器集成策略增强对抗攻击鲁棒性。
Result: 在包含790张图像的新标注数据集上,该方法在毒品、色情、暴力和极端主义内容上取得了85.8%的元素级召回率、78.1%的精确率和92.1%的分割成功率,比直接零样本VLM定位方法召回率提高了27.4%,在对抗PGD扰动攻击下精度和召回率下降不超过10%,表现出高鲁棒性。
Conclusion: 该研究提出了首个实用的细粒度、可解释恶意图像审核工具,能够在秒级处理图像并无缝集成到现有VLM工作流中,为内容审核提供了同时具备高准确性、强鲁棒性和像素级定位能力的解决方案。
📄 Abstract
Detecting illicit visual content demands more than image-level NSFW flags; moderators must also know what objects make an image illegal and where those objects occur. We introduce a zero-shot pipeline that simultaneously (i) detects if an image contains harmful content, (ii) identifies each critical element involved, and (iii) localizes those elements with pixel-accurate masks - all in one pass. The system first applies foundation segmentation model (SAM) to generate candidate object masks and refines them into larger independent regions. Each region is scored for malicious relevance by a vision-language model using open-vocabulary prompts; these scores weight a fusion step that produces a consolidated malicious object map. An ensemble across multiple segmenters hardens the pipeline against adaptive attacks that target any single segmentation method. Evaluated on a newly-annotated 790-image dataset spanning drug, sexual, violent and extremist content, our method attains 85.8% element-level recall, 78.1% precision and a 92.1% segment-success rate - exceeding direct zero-shot VLM localization by 27.4% recall at comparable precision. Against PGD adversarial perturbations crafted to break SAM and VLM, our method's precision and recall decreased by no more than 10%, demonstrating high robustness against attacks. The full pipeline processes an image in seconds, plugs seamlessly into existing VLM workflows, and constitutes the first practical tool for fine-grained, explainable malicious-image moderation.
[31] Denoise to Track: Harnessing Video Diffusion Priors for Robust Correspondence
Tianyu Yuan, Yuanbo Yang, Lin-Zhuo Chen, Yao Yao, Zhuzhong Qian
🧩 TL;DR
本文提出了HeFT(Head-Frequency Tracker),一种基于预训练视频扩散模型视觉先验的零样本点跟踪框架,通过分析VDiT内部表示发现注意力头的功能特化,并提出头-频率感知特征选择策略,在TAP-Vid基准上实现了最先进的零样本跟踪性能。
📘 Detailed Summary
Motivation: 本研究旨在探索预训练视频扩散模型如何编码时空信息,并利用其视觉先验解决零样本点跟踪问题,以弥合无监督方法与监督方法之间的性能差距,同时避免对标注训练数据的依赖。
Method: 该方法首先分析视频扩散变换器(VDiT)的内部表示,发现注意力头作为最小功能单元具有匹配、语义理解和位置编码等不同特化功能,并提出头-频率感知特征选择策略,联合选择最具信息量的注意力头和低频分量,通过单步去噪提取判别性特征,采用软argmax定位和前向-后向一致性检查进行对应关系估计。
Result: 在TAP-Vid基准上的广泛实验表明,HeFT实现了最先进的零样本跟踪性能,其准确率接近监督方法水平,同时完全不需要标注训练数据,验证了低频分量在建立对应关系中的关键作用以及高频分量引入噪声的观察。
Conclusion: 本研究揭示了视频扩散模型中注意力头的功能特化和频率分量的不同作用,证明了视频扩散模型作为强大基础模型在多种下游任务中的潜力,为构建统一的视觉基础模型开辟了道路,同时展示了零样本方法在点跟踪任务中达到监督方法性能的可能性。
📄 Abstract
In this work, we introduce HeFT (Head-Frequency Tracker), a zero-shot point tracking framework that leverages the visual priors of pretrained video diffusion models. To better understand how they encode spatiotemporal information, we analyze the internal representations of Video Diffusion Transformer (VDiT). Our analysis reveals that attention heads act as minimal functional units with distinct specializations for matching, semantic understanding, and positional encoding. Additionally, we find that the low-frequency components in VDiT features are crucial for establishing correspondences, whereas the high-frequency components tend to introduce noise. Building on these insights, we propose a head- and frequency-aware feature selection strategy that jointly selects the most informative attention head and low-frequency components to enhance tracking performance. Specifically, our method extracts discriminative features through single-step denoising, applies feature selection, and employs soft-argmax localization with forward-backward consistency checks for correspondence estimation. Extensive experiments on TAP-Vid benchmarks demonstrate that HeFT achieves state-of-the-art zero-shot tracking performance, approaching the accuracy of supervised methods while eliminating the need for annotated training data. Our work further underscores the promise of video diffusion models as powerful foundation models for a wide range of downstream tasks, paving the way toward unified visual foundation models.
[32] I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models
Juntong Wang, Jiarui Wang, Huiyu Duan, Jiaxiang Kang, Guangtao Zhai, Xiongkuo Min
🧩 TL;DR
该研究提出了I2I-Bench,一个全面的图像到图像编辑模型基准测试框架,涵盖10个任务类别和30个细粒度评估维度,采用自动化混合评估方法,并通过与人类偏好的一致性验证确保了评估的可靠性。
📘 Detailed Summary
Motivation: 现有图像编辑基准测试存在任务范围有限、评估维度不足、过度依赖人工标注等问题,严重制约了其可扩展性和实际应用价值,需要建立一个更全面、自动化的评估框架来推动图像编辑模型的发展。
Method: 该研究提出了I2I-Bench基准测试框架,包含三个核心设计:涵盖10个任务类别的多样化任务集,包含30个解耦细粒度评估维度的综合评估体系,以及结合专业工具和大型多模态模型的自动化混合评估方法,并通过严格的对齐验证确保评估结果与人类偏好的一致性。
Result: 研究使用I2I-Bench对多个主流图像编辑模型进行了基准测试,揭示了不同模型在各种评估维度上的性能差距和权衡关系,验证了自动化评估方法与人类偏好的一致性,为模型比较提供了全面可靠的评估数据。
Conclusion: I2I-Bench为图像编辑领域提供了一个可扩展、自动化的综合评估框架,其开源特性将促进未来研究的发展,揭示了不同模型在多个维度上的性能权衡,为模型选择和优化提供了重要参考依据。
📄 Abstract
Image editing models are advancing rapidly, yet comprehensive evaluation remains a significant challenge. Existing image editing benchmarks generally suffer from limited task scopes, insufficient evaluation dimensions, and heavy reliance on manual annotations, which significantly constrain their scalability and practical applicability. To address this, we propose \textbf{I2I-Bench}, a comprehensive benchmark for image-to-image editing models, which features (i) diverse tasks, encompassing 10 task categories across both single-image and multi-image editing tasks, (ii) comprehensive evaluation dimensions, including 30 decoupled and fine-grained evaluation dimensions with automated hybrid evaluation methods that integrate specialized tools and large multimodal models (LMMs), and (iii) rigorous alignment validation, justifying the consistency between our benchmark evaluations and human preferences. Using I2I-Bench, we benchmark numerous mainstream image editing models, investigating the gaps and trade-offs between editing models across various dimensions. We will open-source all components of I2I-Bench to facilitate future research.
[33] Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, Yujun Shen, Min Zhang
🧩 TL;DR
本文提出Reward Forcing框架,通过EMA-Sink机制和Rewarded Distribution Matching Distillation方法,解决了流式视频生成中初始帧过度依赖和运动动态性不足的问题,实现了高质量实时视频生成。
📘 Detailed Summary
Motivation: 现有流式视频生成方法采用滑动窗口注意力机制,将初始帧作为sink token以减少误差累积,但这导致视频帧过度依赖静态初始帧,造成初始帧复制和运动动态性减弱的问题。
Method: 本文提出Reward Forcing框架,包含两个核心设计:EMA-Sink机制通过指数移动平均融合被移出滑动窗口的token,在不增加计算成本的情况下捕获长期上下文和近期动态;Rewarded Distribution Matching Distillation利用视觉语言模型评估样本动态性,通过奖励机制偏置输出分布,优先学习高动态内容。
Result: 实验表明Reward Forcing在标准基准测试中达到最先进性能,能够在单块H100 GPU上以23.1 FPS的速度实现高质量流式视频生成,显著提升了运动质量同时保持了数据保真度。
Conclusion: 该研究证明了通过动态token更新机制和基于奖励的分布匹配蒸馏,可以有效解决流式视频生成中的初始帧依赖问题,为实时交互式动态世界模拟提供了高效解决方案,并展示了在保持长期一致性的同时增强运动动态性的可行性。
📄 Abstract
Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.
[34] Towards Cross-View Point Correspondence in Vision-Language Models
Yipu Wang, Yuheng Ji, Yuyang Liu, Enshen Zhou, Ziqiang Yang, Yuxuan Tian, Ziheng Qin, Yue Liu, Huajie Tan, Cheng Chi, Zhiyuan Ma, Daniel Dajun Zeng, Xiaolong Zheng
🧩 TL;DR
本文提出了跨视图点对应任务(CVPC)和CrossPoint-Bench基准,揭示了当前视觉语言模型在精细点级对应能力上的严重不足,并构建了CrossPoint-378K数据集和CroPond模型,显著提升了跨视图对应性能。
📘 Detailed Summary
Motivation: 当前视觉语言模型在实现跨视图对应方面存在显著不足,特别是在精确点级对应能力上,这对于实现精确的可供性交互至关重要。现有模型在从粗粒度判断过渡到细粒度坐标预测方面存在挑战,与人类表现存在巨大差距。
Method: 研究提出了跨视图点对应任务和CrossPoint-Bench基准,该基准采用分层设计,模拟人类"感知-推理-对应"的认知过程。为解决该问题,构建了包含378K问答对的CrossPoint-378K数据集,专注于反映真实世界操作场景的可供性区域,并提出了在该数据集上训练的CroPond模型。
Result: 评估显示最先进模型如Gemini-2.5-Pro在整体准确率上落后人类超过54.65%。CroPond模型在CrossPoint-Bench上实现了最先进性能,超越Gemini-2.5-Pro 39.7%的准确率,显著缩小了与人类表现的差距。
Conclusion: 该研究揭示了当前视觉语言模型在精细空间对应能力上的根本性局限,为空间理解和具身AI提供了重要基准。提出的数据集和模型为未来跨视图对应研究奠定了基础,强调了从粗粒度到细粒度空间推理转变的重要性。
📄 Abstract
Cross-view correspondence is a fundamental capability for spatial understanding and embodied AI. However, it is still far from being realized in Vision-Language Models (VLMs), especially in achieving precise point-level correspondence, which is crucial for precise affordance interaction. So we propose the Cross-View Point Correspondence (CVPC) task and CrossPoint-Bench, a comprehensive benchmark with hierarchical design, inspired by the human cognitive process of "perceive", "reason", and "correspond". Our evaluation shows the state-of-the-art models (e.g., Gemini-2.5-Pro) still fall far behind humans, with a gap of over 54.65% in overall accuracy, exposing a challenge in transitioning from coarse-grained judgement to fine-grained coordinate prediction. To address this problem, we construct CrossPoint-378K, a dataset with 378K question-answering pairs across 900 scenes, focused on actionable affordance regions that better reflect real-world manipulation and interaction scenarios. Furthermore, we propose CroPond that trained on the CrossPoint-378K dataset. Our CroPond achieves state-of-the-art performance on CrossPoint-Bench, surpassing Gemini-2.5-Pro by 39.7% accuracy, which offers a foundation for advancing future work on cross-view correspondence. The benchmark, dataset, and model are publicly available at https://github.com/WangYipu2002/CrossPoint.
[35] EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture
Xin He, Longhui Wei, Jianbo Ouyang, Lingxi Xie, Qi Tian
🧩 TL;DR
本文提出了EMMA,一种高效统一的多模态理解、生成与编辑架构,通过创新的压缩策略和网络设计显著提升了多模态任务的性能与效率,在4B参数规模下超越了现有统一多模态方法。
📘 Detailed Summary
Motivation: 当前统一多模态架构在处理理解与生成任务时面临效率低下和性能不足的问题,特别是在视觉token压缩和任务特定建模方面存在局限,需要一种既能高效处理多模态输入又能平衡不同任务需求的统一解决方案。
Method: EMMA架构包含四个关键技术:采用32倍压缩比的高效自编码器减少生成所需token数量;使用通道级而非token级拼接进一步减少视觉token;设计共享解耦网络实现任务间相互促进与任务特定建模;在视觉理解编码器中引入混合专家机制以少量参数提升感知能力。
Result: 实验表明,4B参数的EMMA在效率和性能上显著超越现有统一多模态方法如BAGEL-7B,同时与最新的多模态理解与生成专家模型如Qwen3-VL和Qwen-Image相比也取得了竞争性结果,验证了其架构设计的有效性。
Conclusion: EMMA为统一多模态架构的未来发展奠定了坚实基础,其高效压缩策略和灵活的网络设计为解决多模态任务中的效率与性能平衡问题提供了有效途径,展示了统一架构在保持竞争力的同时实现参数效率优化的潜力。
📄 Abstract
We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.
[36] Autoregressive Image Generation Needs Only a Few Lines of Cached Tokens
Ziran Qin, Youru Lv, Mingbao Lin, Zeren Zhang, Chanfan Gan, Tieyuan Chen, Weiyao Lin
🧩 TL;DR
本文提出了LineAR,一种无需训练、基于渐进式键值缓存压缩的自回归视觉生成方法,通过利用视觉注意力的内在特性,在保持生成质量的同时显著减少内存占用并提高吞吐量。
📘 Detailed Summary
Motivation: 现有自回归视觉生成方法在解码过程中需要缓存所有先前生成的视觉标记,导致严重的内存瓶颈,表现为高存储需求和低吞吐量,这限制了自回归图像生成的实际应用效率。
Method: LineAR是一种无需训练的渐进式键值缓存压缩流水线,它利用视觉注意力的内在特性,通过二维视图在线级别管理缓存,保留视觉依赖区域,同时根据行间注意力指导逐步淘汰对后续行生成无害的信息较少标记,仅使用少量缓存行实现高效生成。
Result: 在六个自回归图像生成模型上的广泛实验验证了其有效性,包括LlamaGen-XL和Janus-Pro-1B上ImageNet FID从2.77提升至2.68,COCO FID从23.85提升至22.86,同时仅保留1/6键值缓存;在Lumina-mGPT-768上仅使用1/8缓存即可提升DPG;在LlamaGen-XL上实现67.61%内存减少和7.57倍加速,在Janus-Pro-7B上实现39.66%内存减少和5.62倍加速。
Conclusion: LineAR通过有效管理自回归视觉生成的缓存机制,在保持或提升生成质量的同时显著优化内存效率和计算吞吐量,为大规模自回归图像生成提供了实用的解决方案,展示了基于视觉注意力特性的缓存压缩策略的通用性和有效性。
📄 Abstract
Autoregressive (AR) visual generation has emerged as a powerful paradigm for image and multimodal synthesis, owing to its scalability and generality. However, existing AR image generation suffers from severe memory bottlenecks due to the need to cache all previously generated visual tokens during decoding, leading to both high storage requirements and low throughput. In this paper, we introduce \textbf{LineAR}, a novel, training-free progressive key-value (KV) cache compression pipeline for autoregressive image generation. By fully exploiting the intrinsic characteristics of visual attention, LineAR manages the cache at the line level using a 2D view, preserving the visual dependency regions while progressively evicting less-informative tokens that are harmless for subsequent line generation, guided by inter-line attention. LineAR enables efficient autoregressive (AR) image generation by utilizing only a few lines of cache, achieving both memory savings and throughput speedup, while maintaining or even improving generation quality. Extensive experiments across six autoregressive image generation models, including class-conditional and text-to-image generation, validate its effectiveness and generality. LineAR improves ImageNet FID from 2.77 to 2.68 and COCO FID from 23.85 to 22.86 on LlamaGen-XL and Janus-Pro-1B, while retaining only 1/6 KV cache. It also improves DPG on Lumina-mGPT-768 with just 1/8 KV cache. Additionally, LineAR achieves significant memory and throughput gains, including up to 67.61% memory reduction and 7.57x speedup on LlamaGen-XL, and 39.66% memory reduction and 5.62x speedup on Janus-Pro-7B.
[37] Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion
Yueming Pan, Ruoyu Feng, Qi Dai, Yuqi Wang, Wenfeng Lin, Mingyu Guo, Chong Luo, Nanning Zheng
🧩 TL;DR
本文提出了语义优先扩散(SFD),一种显式优先考虑语义形成的潜在扩散范式,通过异步去噪语义和纹理潜在表示,利用语义先导指导纹理细化,实现了自然的从粗到细的图像生成。
📘 Detailed Summary
Motivation: 现有潜在扩散模型(LDMs)虽然遵循从粗到细的生成过程,但通常同步去噪语义和VAE编码的纹理,忽略了语义形成略微早于纹理生成的时序特性,未能充分利用语义锚点对纹理细化的指导作用。
Method: SFD首先通过专用的语义VAE从预训练视觉编码器中提取紧凑的语义潜在表示,与纹理潜在表示组合构建复合潜在表示;核心创新是使用分离的噪声调度异步去噪语义和纹理潜在表示,语义去噪比纹理提前一个时间偏移,为纹理细化提供更清晰的高层语义指导。
Result: 在ImageNet 256×256数据集上,SFD在引导生成条件下达到了FID 1.06(LightningDiT-XL)和FID 1.04(1.0B LightningDiT-XXL),相比原始DiT实现了高达100倍的收敛加速;SFD还能有效改进现有方法如ReDi和VA-VAE,证明了异步语义引导建模的有效性。
Conclusion: 该研究证明了显式建模语义优先的异步扩散范式能够显著提升图像生成质量和训练效率,为潜在扩散模型提供了更自然的从粗到细生成机制,这一框架具有通用性,可应用于改进多种现有扩散模型架构。
📄 Abstract
Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit texture generation by providing a semantic anchor. Recent advances have integrated semantic priors from pretrained visual encoders to further enhance LDMs, yet they still denoise semantic and VAE-encoded texture synchronously, neglecting such ordering. Observing these, we propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD first constructs composite latents by combining a compact semantic latent, which is extracted from a pretrained visual encoder via a dedicated Semantic VAE, with the texture latent. The core of SFD is to denoise the semantic and texture latents asynchronously using separate noise schedules: semantics precede textures by a temporal offset, providing clearer high-level guidance for texture refinement and enabling natural coarse-to-fine generation. On ImageNet 256x256 with guidance, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while achieving up to 100x faster convergence than the original DiT. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling. Project page and code: https://yuemingpan.github.io/SFD.github.io/.
[38] Towards Adaptive Fusion of Multimodal Deep Networks for Human Action Recognition
Novanto Yudistira
🧩 TL;DR
本研究提出了一种基于深度神经网络和多模态自适应融合策略的创新人类动作识别方法,通过门控机制融合RGB、光流、音频和深度信息,显著提升了动作识别的准确性和鲁棒性。
📘 Detailed Summary
Motivation: 传统单模态动作识别方法存在固有局限性,无法充分利用多模态信息的互补优势,本研究旨在通过多模态融合策略解决这一问题,探索在动作识别任务中整合RGB、光流、音频和深度信息的可能性。
Method: 该方法采用基于门控机制的多模态自适应融合架构,通过选择性整合不同模态的相关信息,利用深度神经网络技术和门控融合策略,系统研究了多种门控融合方法以确定最有效的多模态动作识别方案。
Result: 在人类动作识别、暴力动作检测和多个自监督学习任务的基准数据集评估中,该方法展现出显著的性能提升,证明了其在准确性和鲁棒性方面的优越性,超越了传统的单模态方法。
Conclusion: 该研究展示了多模态信息融合在动作识别领域的巨大潜力,为监控系统和人机交互等应用提供了更先进的解决方案,特别是在主动辅助生活等场景中具有重要的实际应用价值。
📄 Abstract
This study introduces a pioneering methodology for human action recognition by harnessing deep neural network techniques and adaptive fusion strategies across multiple modalities, including RGB, optical flows, audio, and depth information. Employing gating mechanisms for multimodal fusion, we aim to surpass limitations inherent in traditional unimodal recognition methods while exploring novel possibilities for diverse applications. Through an exhaustive investigation of gating mechanisms and adaptive weighting-based fusion architectures, our methodology enables the selective integration of relevant information from various modalities, thereby bolstering both accuracy and robustness in action recognition tasks. We meticulously examine various gated fusion strategies to pinpoint the most effective approach for multimodal action recognition, showcasing its superiority over conventional unimodal methods. Gating mechanisms facilitate the extraction of pivotal features, resulting in a more holistic representation of actions and substantial enhancements in recognition performance. Our evaluations across human action recognition, violence action detection, and multiple self-supervised learning tasks on benchmark datasets demonstrate promising advancements in accuracy. The significance of this research lies in its potential to revolutionize action recognition systems across diverse fields. The fusion of multimodal information promises sophisticated applications in surveillance and human-computer interaction, especially in contexts related to active assisted living.
[39] FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via neural Action Tokenization
Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, Liangtao Zheng, Tao Jiang, Jingjing Gong, Xipeng Qiu, Hang Zhao
🧩 TL;DR
本文提出FASTer框架,通过可学习的动作分词器和基于此的自回归策略,解决了视觉-语言-动作模型中动作分词在重建保真度与推理效率之间的权衡问题,实现了更高效且泛化性更强的机器人学习。
📘 Detailed Summary
Motivation: 自回归视觉-语言-动作模型在机器人操作中展现出强大能力,但其核心的动作分词过程通常需要在重建保真度与推理效率之间进行权衡,这限制了模型的整体性能和应用效率。
Method: FASTer框架整合了可学习分词器与基于此的自回归策略,其中FASTerVQ将动作块编码为单通道图像以捕获全局时空依赖并保持高压缩比,而FASTerVLA在此基础上采用分块自回归解码和轻量级动作专家模块。
Result: 在模拟和真实世界基准测试中,FASTerVQ展现出卓越的重建质量、高令牌利用率和强大的跨任务与跨具身泛化能力,而FASTerVLA在推理速度和任务性能上均超越了先前最先进的VLA模型。
Conclusion: 该研究证明了通过统一框架整合可学习分词器与自回归策略的有效性,为高效且泛化性强的机器人学习提供了新方向,同时展示了在保持高质量动作重建的同时实现快速推理的可行性。
📄 Abstract
Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce FASTer, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.
[40] Rethinking the Use of Vision Transformers for AI-Generated Image Detection
NaHyeon Park, Kunhee Kim, Junsuk Choe, Hyunjung Shim
🧩 TL;DR
本文提出了一种名为MoLD的自适应方法,通过门控机制动态集成CLIP-ViT的多层特征,显著提升了AI生成图像检测的性能和泛化能力。
📘 Detailed Summary
Motivation: 现有AI生成图像检测方法主要利用CLIP-ViT的最后一层特征,但缺乏对层间特征贡献的系统分析,未能充分利用不同层次特征的互补信息。
Method: 本文提出MoLD方法,采用基于门控的机制动态集成ViT的多层特征,该方法能够自适应地融合不同层次的特征表示,并验证了其在其他预训练ViT模型上的可扩展性。
Result: 实验表明,MoLD在GAN和扩散模型生成的图像检测任务中显著提升了性能,增强了跨不同生成模型的泛化能力,并在真实场景中表现出鲁棒性,早期层特征在检测任务中往往优于最后一层特征。
Conclusion: 研究揭示了ViT不同层次特征在AI生成图像检测中的互补价值,提出的自适应集成方法为特征利用提供了新范式,该方法具有可扩展性,可应用于其他预训练视觉Transformer模型。
📄 Abstract
Rich feature representations derived from CLIP-ViT have been widely utilized in AI-generated image detection. While most existing methods primarily leverage features from the final layer, we systematically analyze the contributions of layer-wise features to this task. Our study reveals that earlier layers provide more localized and generalizable features, often surpassing the performance of final-layer features in detection tasks. Moreover, we find that different layers capture distinct aspects of the data, each contributing uniquely to AI-generated image detection. Motivated by these findings, we introduce a novel adaptive method, termed MoLD, which dynamically integrates features from multiple ViT layers using a gating-based mechanism. Extensive experiments on both GAN- and diffusion-generated images demonstrate that MoLD significantly improves detection performance, enhances generalization across diverse generative models, and exhibits robustness in real-world scenarios. Finally, we illustrate the scalability and versatility of our approach by successfully applying it to other pre-trained ViTs, such as DINOv2.
[41] Stable Single-Pixel Contrastive Learning for Semantic and Geometric Tasks
Leonid Pogorelyuk, Niels Bracher, Aaron Verkleeren, Lars Kühmichel, Stefan T. Radev
🧩 TL;DR
本文提出了一种稳定的对比损失函数族,用于学习同时捕获语义和几何信息的像素级表示,该方法通过过完备描述符实现跨图像的精确点对应,无需基于动量的师生训练。
📘 Detailed Summary
Motivation: 现有方法在同时捕获像素级表示的语义和几何信息方面存在局限,且通常需要复杂的动量更新师生训练架构来实现跨图像的点对应,本研究旨在开发一种更简单有效的替代方案。
Method: 提出了一种稳定的对比损失函数族,将图像中的每个像素映射到过完备描述符,该描述符同时具备视角不变性和语义意义,无需依赖基于动量的师生训练即可实现精确的点对应。
Result: 在合成2D和3D环境中的实验验证了所提出损失函数的特性及其生成的过完备表示的有效性,展示了该方法能够实现精确的跨图像点对应,同时保持语义一致性。
Conclusion: 该方法为像素级表示学习提供了一种更简单有效的替代方案,通过过完备描述符同时编码语义和几何信息,为计算机视觉任务中的密集对应和场景理解开辟了新途径。
📄 Abstract
We pilot a family of stable contrastive losses for learning pixel-level representations that jointly capture semantic and geometric information. Our approach maps each pixel of an image to an overcomplete descriptor that is both view-invariant and semantically meaningful. It enables precise point-correspondence across images without requiring momentum-based teacher-student training. Two experiments in synthetic 2D and 3D environments demonstrate the properties of our loss and the resulting overcomplete representations.
[42] Aligned but Stereotypical? The Hidden Influence of System Prompts on Social Bias in LVLM-Based Text-to-Image Models
NaHyeon Park, Namin An, Kunhee Kim, Soyeon Yoon, Jiahao Huo, Hyunjung Shim
🧩 TL;DR
该研究发现基于大视觉语言模型的文生图系统比非LVLM模型产生更严重的社会偏见,并提出FairPro训练无关的元提示框架,通过系统提示自审显著减少偏见同时保持图文对齐。
📘 Detailed Summary
Motivation: 基于大视觉语言模型的文生图系统已成为图像生成的主流范式,但这些系统是否放大社会偏见仍缺乏充分理解,需要系统评估LVLM模型的社会偏见程度及其产生机制。
Method: 研究构建了包含1024个提示的基准测试,涵盖四个语言复杂度层次,通过解码中间表示、令牌概率诊断和嵌入关联分析揭示系统提示如何编码人口统计先验,并提出FairPro训练无关的元提示框架,使LVLM能够在测试时自审并构建公平感知的系统提示。
Result: 实验表明LVLM模型比非LVLM模型产生明显更严重的社会偏见图像,系统提示是偏见行为的主要驱动因素,FairPro框架在SANA和Qwen-Image两个LVLM模型上显著减少人口统计偏见,同时保持文本-图像对齐性能。
Conclusion: 该研究揭示了系统提示在偏见传播中的核心作用,为构建更负责任的文生图系统提供了实用可部署的方法,FairPro框架展示了通过元提示实现偏见缓解的有效途径。
📄 Abstract
Large vision-language model (LVLM) based text-to-image (T2I) systems have become the dominant paradigm in image generation, yet whether they amplify social biases remains insufficiently understood. In this paper, we show that LVLM-based models produce markedly more socially biased images than non-LVLM-based models. We introduce a 1,024 prompt benchmark spanning four levels of linguistic complexity and evaluate demographic bias across multiple attributes in a systematic manner. Our analysis identifies system prompts, the predefined instructions guiding LVLMs, as a primary driver of biased behavior. Through decoded intermediate representations, token-probability diagnostics, and embedding-association analyses, we reveal how system prompts encode demographic priors that propagate into image synthesis. To this end, we propose FairPro, a training-free meta-prompting framework that enables LVLMs to self-audit and construct fairness-aware system prompts at test time. Experiments on two LVLM-based T2I models, SANA and Qwen-Image, show that FairPro substantially reduces demographic bias while preserving text-image alignment. We believe our findings provide deeper insight into the central role of system prompts in bias propagation and offer a practical, deployable approach for building more socially responsible T2I systems.
[43] HTR-ConvText: Leveraging Convolution and Textual Information for Handwritten Text Recognition
Pham Thach Thanh Truc, Dang Hoai Nam, Huynh Tong Dang Khoa, Vo Nguyen Le Duy
🧩 TL;DR
本文提出HTR-ConvText模型,通过结合局部笔画特征与全局上下文依赖来解决手写文本识别中的挑战,在有限数据和复杂书写风格场景下实现了更好的泛化性能。
📘 Detailed Summary
Motivation: 手写文本识别面临数据有限、书写风格差异大以及复杂变音符号等挑战,现有方法通常需要大量合成数据才能泛化,难以在有限训练样本和高多样性手写场景中取得良好效果。
Method: 提出HTR-ConvText模型,在特征提取阶段集成残差卷积神经网络主干与带位置编码的MobileViT模块,以同时捕捉结构模式和细微书写细节;引入ConvText编码器,结合全局上下文和局部特征的混合架构,通过分层结构减少序列长度提高效率;添加辅助模块注入文本上下文以缓解连接时序分类的弱点。
Result: 在IAM、READ2016、LAM和HANDS-VNOnDB数据集上的评估表明,该方法相比现有方法实现了性能提升和更好的泛化能力,特别是在训练样本有限和手写多样性高的场景中表现优异。
Conclusion: 该研究展示了结合局部笔画级特征与全局上下文依赖的有效性,为有限数据下的手写文本识别提供了新思路,混合架构设计在保持效率的同时提升了模型对复杂书写风格的适应性。
📄 Abstract
Handwritten Text Recognition remains challenging due to the limited data, high writing style variance, and scripts with complex diacritics. Existing approaches, though partially address these issues, often struggle to generalize without massive synthetic data. To address these challenges, we propose HTR-ConvText, a model designed to capture fine-grained, stroke-level local features while preserving global contextual dependencies. In the feature extraction stage, we integrate a residual Convolutional Neural Network backbone with a MobileViT with Positional Encoding block. This enables the model to both capture structural patterns and learn subtle writing details. We then introduce the ConvText encoder, a hybrid architecture combining global context and local features within a hierarchical structure that reduces sequence length for improved efficiency. Additionally, an auxiliary module injects textual context to mitigate the weakness of Connectionist Temporal Classification. Evaluations on IAM, READ2016, LAM and HANDS-VNOnDB demonstrate that our approach achieves improved performance and better generalization compared to existing methods, especially in scenarios with limited training samples and high handwriting diversity.
[44] RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation
Nicolas Houdré, Diego Marcos, Hugo Riffaud de Turckheim, Dino Ienco, Laurent Wendling, Camille Kurtz, Sylvain Lobry
🧩 TL;DR
本文提出RAMEN,一种分辨率可调的多模态编码器,能够在完全传感器无关的方式下学习地球观测数据的共享视觉表示,通过将空间分辨率定义为可控输出参数,实现了跨异构模态的统一分析。
📘 Detailed Summary
Motivation: 当前地球观测基础模型通常期望固定输入分辨率或基于特定传感器编码器,限制了跨异构EO模态的泛化能力,现有方法难以处理广泛的空间、光谱和时间分辨率范围,需要一种能够统一处理不同传感器和分辨率数据的传感器无关方法。
Method: RAMEN将模态、空间和时间分辨率作为关键输入特征,定义空间分辨率为可控输出参数,使用单一统一的Transformer编码器重建来自多样来源的掩码多模态EO数据,通过分辨率调整机制实现空间精度与计算成本之间的显式权衡。
Result: RAMEN在社区标准PANGAEA基准测试中优于更大的最先进模型,有效迁移到已知和未见过的传感器配置,在包含各种多传感器和多分辨率下游任务中表现出色,证明了其跨传感器和分辨率的泛化能力。
Conclusion: 该研究提供了一种传感器无关的地球观测数据统一表示学习方法,通过可控分辨率输出实现了分析灵活性与计算效率的平衡,为跨模态EO分析开辟了新途径,代码和预训练模型已开源促进社区采用。
📄 Abstract
Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from high-resolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a resolution-adjustable multimodal encoder that learns a shared visual representation across EO data in a fully sensor-agnostic manner. RAMEN treats the modality and spatial and temporal resolutions as key input data features, enabling coherent analysis across modalities within a unified latent space. Its main methodological contribution is to define spatial resolution as a controllable output parameter, giving users direct control over the desired level of detail at inference and allowing explicit trade-offs between spatial precision and computational cost. We train a single, unified transformer encoder reconstructing masked multimodal EO data drawn from diverse sources, ensuring generalization across sensors and resolutions. Once pretrained, RAMEN transfers effectively to both known and unseen sensor configurations and outperforms larger state-of-the-art models on the community-standard PANGAEA benchmark, containing various multi-sensor and multi-resolution downstream tasks. Our code and pretrained model are available at https://github.com/nicolashoudre/RAMEN.
[45] Semantic-Guided Two-Stage GAN for Face Inpainting with Hybrid Perceptual Encoding
Abhigyan Bhattacharya, Hiranmoy Roy
🧩 TL;DR
本文提出了一种新颖的语义引导分层合成架构,用于面部图像修复,通过结合CNN的局部特征提取和Vision Transformer的全局特征建模,有效处理大面积不规则掩码,并在保持身份一致性和结构连贯性方面优于现有方法。
📘 Detailed Summary
Motivation: 现有面部图像修复方法在处理大面积不规则掩码时面临挑战,常产生模糊纹理、语义不一致或不合理的面部结构,这主要源于直接像素级合成方法和对面部先验知识的有限利用,无法在保持身份一致性和结构连贯性的同时实现逼真的修复效果。
Method: 本文提出语义引导分层合成架构,包含两个阶段:第一阶段结合CNN的局部特征提取和Vision Transformer的全局特征建模,生成清晰详细的语义布局;第二阶段采用多模态纹理生成器,通过多尺度信息融合细化纹理,该架构通过动态注意力机制自然处理任意掩码配置,无需针对特定掩码进行训练。
Result: 在CelebA-HQ和FFHQ数据集上的实验表明,该模型在LPIPS、PSNR和SSIM等指标上优于现有最先进方法,特别是在具有挑战性的大面积修复场景中,能够产生视觉上引人注目的结果,并实现更好的语义保持和结构一致性。
Conclusion: 该研究证明了语义引导分层合成方法在面部图像修复中的有效性,通过分离语义布局生成和纹理细化过程,结合局部与全局特征建模,为处理复杂掩码配置提供了新思路,未来可扩展到其他图像修复任务和更广泛的生成式视觉应用。
📄 Abstract
Facial Image inpainting aim is to restore the missing or corrupted regions in face images while preserving identity, structural consistency and photorealistic image quality, a task specifically created for photo restoration. Though there are recent lot of advances in deep generative models, existing methods face problems with large irregular masks, often producing blurry textures on the edges of the masked region, semantic inconsistencies, or unconvincing facial structures due to direct pixel level synthesis approach and limited exploitation of facial priors. In this paper we propose a novel architecture, which address these above challenges through semantic-guided hierarchical synthesis. Our approach starts with a method that organizes and synthesizes information based on meaning, followed by refining the texture. This process gives clear insights into the facial structure before we move on to creating detailed images. In the first stage, we blend two techniques: one that focuses on local features with CNNs and global features with Vision Transformers. This helped us create clear and detailed semantic layouts. In the second stage, we use a Multi-Modal Texture Generator to refine these layouts by pulling in information from different scales, ensuring everything looks cohesive and consistent. The architecture naturally handles arbitrary mask configurations through dynamic attention without maskspecific training. Experiment on two datasets CelebA-HQ and FFHQ shows that our model outperforms other state-of-the-art methods, showing improvements in metrics like LPIPS, PSNR, and SSIM. It produces visually striking results with better semantic preservation, in challenging large-area inpainting situations.
[46] Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark
Haobo Yuan, Yueyi Sun, Yanwei Li, Tao Zhang, Xueqing Deng, Henghui Ding, Lu Qi, Anran Wang, Xiangtai Li, Ming-Hsuan Yang
🧩 TL;DR
本文提出了视觉推理追踪(VRT)任务,旨在解决多模态大语言模型推理过程不透明的问题,并构建了VRT-Bench评估基准和VRT-80k训练数据集,显著提升了模型对中间推理步骤的显式追踪能力。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在视觉定位和视觉问答等任务上表现优异,但其推理过程通常不透明,仅输出最终预测而缺乏揭示中间步骤或细粒度证据(如像素、位置)的能力,这与人类通过视觉推理链进行思考的自然方式形成鲜明对比。
Method: 本文引入了视觉推理追踪(VRT)任务,要求模型不仅定位目标对象,还需显式预测构成推理路径的中间对象;为此构建了VRT-Bench人工标注评估基准、新的推理轨迹质量评估指标,以及用于训练的大规模数据集VRT-80k。
Result: 实验表明,现有模型虽然常能产生正确最终输出,但在中间推理的显式定位方面表现不佳;相比之下,在VRT-80k数据集上训练的模型在追踪推理路径方面取得了显著改进,验证了所提方法和数据资源的有效性。
Conclusion: 该研究揭示了当前多模态大语言模型在显式视觉推理方面的局限性,提出的VRT任务框架、评估基准和训练数据集为提升模型推理透明度提供了系统解决方案,推动了可解释多模态人工智能的发展方向。
📄 Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved performance on tasks such as visual grounding and visual question answering. However, the reasoning processes of these models remain largely opaque; they typically output only final predictions without revealing the intermediate steps or fine-grained evidence (e.g., pixels, locations) that lead to the result. This contrasts with human intelligence, which naturally operates through a chain of visual reasoning. To address this limitation, we introduce the Visual Reasoning Tracer (VRT) task, which requires models to not only localize the target object but also explicitly predict the intermediate objects that form the reasoning path. To advance research in this area, we contribute: (1) VRT-Bench, a human-annotated benchmark for evaluating visual reasoning; (2) a new metric for assessing the quality of reasoning traces; and (3) VRT-80k, a large-scale dataset for reasoning model training. Our experiments reveal that while existing models often produce the correct final output, they struggle to ground their intermediate reasoning. In contrast, models trained on VRT-80k achieve substantial improvements in tracing the reasoning path.
[47] NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation
Yu Zeng, Charles Ochoa, Mingyuan Zhou, Vishal M. Patel, Vitor Guizilini, Rowan McAllister
🧩 TL;DR
本文提出了相位保持扩散(φ-PD),这是一种模型无关的扩散过程重构方法,通过保留输入相位并随机化幅度,实现了无需架构修改或额外参数的结构对齐生成,特别适用于需要几何一致性的任务。
📘 Detailed Summary
Motivation: 标准扩散过程使用高斯噪声破坏数据的相位和幅度分量,这虽然适用于无条件或文本到图像生成,但相位破坏会消除空间结构,使其不适用于需要几何一致性的任务,如重渲染、仿真增强和图像到图像转换。
Method: 本文提出了相位保持扩散(φ-PD),这是一种模型无关的扩散过程重构方法,通过保留输入相位同时随机化幅度来实现结构对齐生成。此外,还提出了频率选择性结构化(FSS)噪声,通过单一频率截止参数提供对结构刚度的连续控制。该方法不增加推理时间成本,且与任何图像或视频扩散模型兼容。
Result: 在照片级真实感和风格化重渲染以及驾驶规划器的仿真到真实增强任务中,φ-PD产生了可控且空间对齐的结果。当应用于CARLA仿真器时,φ-PD将CARLA到Waymo规划器性能提升了50%。该方法与现有条件方法互补,广泛适用于图像到图像和视频到视频生成。
Conclusion: 相位保持扩散提供了一种有效的方法来解决扩散模型中几何一致性问题,通过相位保留机制实现了结构对齐生成,同时保持了模型无关性和零推理成本。该方法为需要空间结构保持的应用开辟了新途径,并展示了在仿真增强和跨域转换任务中的实际价值。
📄 Abstract
Standard diffusion corrupts data using Gaussian noise whose Fourier coefficients have random magnitudes and random phases. While effective for unconditional or text-to-image generation, corrupting phase components destroys spatial structure, making it ill-suited for tasks requiring geometric consistency, such as re-rendering, simulation enhancement, and image-to-image translation. We introduce Phase-Preserving Diffusion φ-PD, a model-agnostic reformulation of the diffusion process that preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes or additional parameters. We further propose Frequency-Selective Structured (FSS) noise, which provides continuous control over structural rigidity via a single frequency-cutoff parameter. φ-PD adds no inference-time cost and is compatible with any diffusion model for images or videos. Across photorealistic and stylized re-rendering, as well as sim-to-real enhancement for driving planners, φ-PD produces controllable, spatially aligned results. When applied to the CARLA simulator, φ-PD improves CARLA-to-Waymo planner performance by 50\%. The method is complementary to existing conditioning approaches and broadly applicable to image-to-image and video-to-video generation. Videos, additional examples, and code are available on our \href{https://yuzeng-at-tri.github.io/ppd-page/}{project page}.
[48] ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, Conghui He, Dahua Lin, Jiaqi Wang
🧩 TL;DR
本文提出ARM-Thinker,一种能够自主调用外部工具进行可验证证据推理的智能多模态奖励模型,通过工具调用机制显著提升了奖励模型在复杂多模态任务中的准确性和可解释性。
📘 Detailed Summary
Motivation: 当前视觉语言系统的奖励模型存在幻觉问题、视觉基础薄弱以及无法使用工具进行验证等局限性,这些缺陷限制了其在复杂多模态推理任务中的可靠性,因此需要开发能够进行可验证推理的智能奖励模型。
Method: 本文提出ARM-Thinker模型,采用智能多模态奖励模型架构,能够自主调用外部工具(如图像裁剪、文档页面检索)来基于可验证证据进行判断,替代传统的静态非交互式奖励评分,并通过多阶段强化学习联合优化工具调用决策和判断准确性。
Result: ARM-Thinker在奖励建模基准上实现了平均16.2%的性能提升,在工具使用任务上提升了9.6%,并在多模态数学和逻辑推理基准上优于基线模型,同时引入了ARMBench-VL评估框架,包含三个专门评估细粒度视觉基础、多页文档理解和指令遵循能力的基准。
Conclusion: 研究表明智能能力显著提升了奖励模型的准确性和可解释性,通过工具调用机制使模型能够验证细粒度视觉细节、交叉引用多页证据和验证推理主张,为构建更可靠的多模态对齐系统提供了新方向。
📄 Abstract
Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
cs.CL [Back]
[49] MSME: A Multi-Stage Multi-Expert Framework for Zero-Shot Stance Detection
Yuanshuo Zhang, Aohua Li, Bo Chen, Jingbo Sun, Xiaobing Zhao
🧩 TL;DR
本文提出MSME,一种用于零样本立场检测的多阶段多专家框架,通过知识准备、专家推理和决策聚合三阶段处理复杂现实场景中的立场理解挑战,在三个公开数据集上实现了最先进的性能。
📘 Detailed Summary
Motivation: 尽管基于LLM的方法在零样本立场检测中取得了显著成果,但在复杂现实场景中仍面临挑战,包括需要动态背景知识、涉及复合实体或事件的目标定义需要与立场标签明确关联,以及讽刺等修辞手法常常掩盖作者真实意图。
Method: MSME框架包含三个阶段:知识准备阶段检索相关背景知识并澄清立场标签;专家推理阶段包含三个专门模块——知识专家从知识角度提炼关键事实并进行推理,标签专家细化立场标签并相应推理,语用专家检测讽刺等修辞线索从语用角度推断意图;决策聚合阶段通过元法官整合所有专家分析生成最终立场预测。
Result: 在三个公开数据集上的实验表明,MSME在所有数据集上均实现了最先进的性能,全面超越了现有方法,证明了该框架在复杂零样本立场检测任务中的有效性。
Conclusion: MSME框架通过多阶段多专家设计有效解决了复杂现实场景中的立场检测挑战,其模块化架构允许专门处理知识、标签和语用等不同维度,为需要动态背景理解和修辞分析的NLP任务提供了有价值的参考框架。
📄 Abstract
LLM-based approaches have recently achieved impressive results in zero-shot stance detection. However, they still struggle in complex real-world scenarios, where stance understanding requires dynamic background knowledge, target definitions involve compound entities or events that must be explicitly linked to stance labels, and rhetorical devices such as irony often obscure the author's actual intent. To address these challenges, we propose MSME, a Multi-Stage, Multi-Expert framework for zero-shot stance detection. MSME consists of three stages: (1) Knowledge Preparation, where relevant background knowledge is retrieved and stance labels are clarified; (2) Expert Reasoning, involving three specialized modules-Knowledge Expert distills salient facts and reasons from a knowledge perspective, Label Expert refines stance labels and reasons accordingly, and Pragmatic Expert detects rhetorical cues such as irony to infer intent from a pragmatic angle; (3) Decision Aggregation, where a Meta-Judge integrates all expert analyses to produce the final stance prediction. Experiments on three public datasets show that MSME achieves state-of-the-art performance across the board.
[50] Factuality and Transparency Are All RAG Needs! Self-Explaining Contrastive Evidence Re-ranking
Francielle Vargas, Daniel Pedronette
🧩 TL;DR
本文提出了一种名为自解释对比证据重排序(CER)的新方法,该方法通过对比学习微调嵌入并为每个检索到的段落生成词级归因解释,从而围绕事实证据重构检索过程,旨在提高检索准确性并减少RAG系统中的幻觉风险。
📘 Detailed Summary
Motivation: 该研究旨在解决传统检索方法在安全关键领域中可能存在的幻觉风险问题,特别是在需要基于证据的可靠检索场景中,如临床试验报告分析,传统方法缺乏对事实证据的显式对齐和透明解释机制。
Method: 该方法采用自解释对比证据重排序框架,通过对比学习微调嵌入表示,并使用基于主观性的标准自动选择困难负样本,同时为每个检索到的段落生成词级归因解释,迫使模型将事实解释拉近而将主观或误导性解释推开。
Result: 在临床试验报告上的初步实验结果表明,CER方法显著提高了检索准确性,有效缓解了RAG系统中的潜在幻觉风险,并提供了透明、基于证据的检索机制,特别是在安全关键领域中增强了系统的可靠性。
Conclusion: 该方法创造了一个与证据推理显式对齐的嵌入空间,为安全关键领域提供了更可靠的检索解决方案,其透明性和基于证据的特性使其特别适用于需要高可靠性的应用场景,为未来检索系统的可信性研究提供了新方向。
📄 Abstract
This extended abstract introduces Self-Explaining Contrastive Evidence Re-Ranking (CER), a novel method that restructures retrieval around factual evidence by fine-tuning embeddings with contrastive learning and generating token-level attribution rationales for each retrieved passage. Hard negatives are automatically selected using a subjectivity-based criterion, forcing the model to pull factual rationales closer while pushing subjective or misleading explanations apart. As a result, the method creates an embedding space explicitly aligned with evidential reasoning. We evaluated our method on clinical trial reports, and initial experimental results show that CER improves retrieval accuracy, mitigates the potential for hallucinations in RAG systems, and provides transparent, evidence-based retrieval that enhances reliability, especially in safety-critical domains.
cs.AI [Back]
[51] BiTAgent: A Task-Aware Modular Framework for Bidirectional Coupling between Multimodal Large Language Models and World Models
Yu-Wei Zhan, Xin Wang, Pengzhe Mao, Tongtong Feng, Ren Wang, Wenwu Zhu
🧩 TL;DR
本文提出BiTAgent,一种任务感知的动态联合框架,通过双向耦合多模态大语言模型与世界模型来解决开放世界具身智能中的语义意图与动态状态表示对齐问题,实现了跨任务和跨环境的稳定泛化。
📘 Detailed Summary
Motivation: 构建通用具身智能体需要统一系统来解读多模态目标、建模环境动态并执行可靠动作,但结合多模态大语言模型与世界模型时面临两个关键挑战:建立MLLM语义意图与WM潜在空间动态状态表示之间的紧密耦合,以及实现支持多任务学习和跨环境泛化的任务感知适应性。
Method: 提出BiTAgent框架,建立双向耦合机制:前向路径将MLLM表示注入WM潜在空间实现语义引导的想象,后向路径通过密集文本条件奖励利用WM生成反馈优化MLLM语义空间。该框架包含三个协同组件:任务感知动态联合学习、任务感知行为学习和MLLM-WM联合优化,共同协调语义推理与动态预测。
Result: 在多任务和跨环境设置下的广泛实验表明,BiTAgent在稳定性和泛化能力上显著优于现有最先进基线方法,标志着向开放世界具身学习迈出了重要一步。
Conclusion: BiTAgent通过双向耦合机制成功解决了MLLM与WM集成中的语义-动态对齐问题,为开放世界具身智能提供了可扩展的框架,其任务感知设计支持多任务学习和跨环境泛化,为构建更通用的具身智能体指明了方向。
📄 Abstract
Building generalist embodied agents requires a unified system that can interpret multimodal goals, model environment dynamics, and execute reliable actions across diverse real-world tasks. Multimodal large language models (MLLMs) offer strong semantic priors and cross-modal generalization, while world models (WMs) provide actionable latent dynamics for prediction and control. Their combination holds promise for open-ended embodied intelligence, yet introduces two key challenges: (1) establishing a tight coupling between the semantic intent from MLLMs and the dynamic state representations within the WM's latent space, and (2) achieving task-aware adaptability that supports multi-task learning and cross-environment generalization. To address these limitations, we propose BiTAgent, a task-aware dynamic joint framework that enables bidirectional coupling between MLLMs and WMs. BiTAgent establishes two complementary pathways: a forward path that injects MLLM representations into the WM's latent space for semantically guided imagination, and a backward path where WM-generated feedback refines the MLLM's semantic space via dense text-conditioned rewards. This bidirectional interaction is realized through three synergistic components: Task-Aware Dynamic Joint Learning, Task-Aware Behavior Learning, and MLLM-WM Joint Optimization, which together harmonize semantic reasoning and dynamic prediction. Extensive experiments across multi-task and cross-environment settings demonstrate superior stability and generalization over state-of-the-art baselines, marking a step toward open-ended embodied learning.
[52] SlideGen: Collaborative Multimodal Agents for Scientific Slide Generation
Xin Liang, Xiang Zhang, Yiwei Xu, Siqi Sun, Chenyu You
🧩 TL;DR
本文提出了SlideGen,一种用于从科学论文生成学术幻灯片的智能框架,通过视觉语言代理的协作推理,实现了兼具逻辑流程和视觉吸引力的可编辑PPTX幻灯片生成,在多个基准测试中超越了现有方法。
📘 Detailed Summary
Motivation: 现有方法主要将学术幻灯片生成简化为纯文本摘要任务,忽视了视觉组件和设计密集型特性,导致无法满足科学论文到幻灯片转换的多模态推理需求,该研究旨在解决这一研究空白。
Method: SlideGen采用智能、模块化且视觉在环的框架,通过协调多个视觉语言代理协作推理文档结构和语义,集成大纲规划、内容映射、布局安排、笔记合成和迭代优化等模块,生成可编辑的PPTX格式幻灯片。
Result: 在多样化的基准测试和强基线对比中,SlideGen在视觉质量、内容忠实度和可读性方面均优于现有方法,能够持续生成专家级质量的幻灯片,确立了自动幻灯片生成领域的新技术水平。
Conclusion: 该研究为设计感知的多模态幻灯片生成奠定了基础,展示了智能代理协作如何弥合复杂多模态推理任务中的理解与呈现之间的鸿沟,为自动化学术演示创建提供了新的范式。
📄 Abstract
Generating academic slides from scientific papers is a challenging multimodal reasoning task that requires both long context understanding and deliberate visual planning. Existing approaches largely reduce it to text only summarization, overlooking the visual component and design intensive nature of slide creation. In this paper we introduce SlideGen, an agentic, modular, and visual in the loop framework for scientific paper to slide generation. SlideGen orchestrates a group of vision language agents that reason collaboratively over the document structure and semantics, producing editable PPTX slides with logical flow and compelling visual presentation. By integrating coordinated outlining, mapping, arrangement, note synthesis, and iterative refinement, our system consistently delivers slides of expert level quality. Across diverse benchmarks and strong baselines, SlideGen outperforms existing methods in visual quality, content faithfulness, and readability, positioning it as the new state of the art in automated slide generation. Our work establishes a foundation for design aware multimodal slide generation, demonstrating how agentic collaboration can bridge understanding and presentation in complex multimodal reasoning tasks.
[53] Are LLMs Truly Multilingual? Exploring Zero-Shot Multilingual Capability of LLMs for Information Retrieval: An Italian Healthcare Use Case
Vignesh Kumar Kembu, Pierandrea Morandini, Marta Bianca Maria Ranzini, Antonino Nocera
🧩 TL;DR
本研究评估了开源多语言大语言模型在意大利语电子健康记录实时信息提取中的能力,重点关注共病提取任务,揭示了模型在零样本、本地部署设置下的性能局限性和泛化挑战。
📘 Detailed Summary
Motivation: 临床记录信息提取是数字医疗中的关键任务,传统NLP技术因临床语言的复杂性、变异性和高度语义内涵而表现不足,需要探索大语言模型在理解意大利语电子健康记录并进行实时信息提取的能力。
Method: 研究采用开源多语言大语言模型,在零样本和本地部署设置下评估其对意大利语电子健康记录的理解能力,重点关注共病提取任务,并与原生模式匹配方法和人工标注进行对比分析。
Result: 实验结果表明,部分大语言模型在零样本、本地部署设置下表现不佳,不同模型性能存在显著差异,且在多种疾病间的泛化能力有限,与原生模式匹配和人工标注相比存在明显差距。
Conclusion: 研究揭示了开源多语言大语言模型在临床信息提取任务中的实际局限性,强调了在医疗领域应用中需要考虑部署环境、语言特异性和疾病泛化能力等因素,为未来医疗NLP系统开发提供了重要参考。
📄 Abstract
Large Language Models (LLMs) have become a key topic in AI and NLP, transforming sectors like healthcare, finance, education, and marketing by improving customer service, automating tasks, providing insights, improving diagnostics, and personalizing learning experiences. Information extraction from clinical records is a crucial task in digital healthcare. Although traditional NLP techniques have been used for this in the past, they often fall short due to the complexity, variability of clinical language, and high inner semantics in the free clinical text. Recently, Large Language Models (LLMs) have become a powerful tool for better understanding and generating human-like text, making them highly effective in this area. In this paper, we explore the ability of open-source multilingual LLMs to understand EHRs (Electronic Health Records) in Italian and help extract information from them in real-time. Our detailed experimental campaign on comorbidity extraction from EHR reveals that some LLMs struggle in zero-shot, on-premises settings, and others show significant variation in performance, struggling to generalize across various diseases when compared to native pattern matching and manual annotations.
[54] Neural Decoding of Overt Speech from ECoG Using Vision Transformers and Contrastive Representation Learning
Mohamed Baha Ben Ticha, Xingchen Ran, Guillaume Saldanha, Gaël Le Godais, Philémon Roussel, Marc Aubert, Amina Fontanell, Thomas Costecalde, Lucas Struber, Serpil Karakas, Shaomin Zhang, Philippe Kahane, Guillaume Charvet, Stéphan Chabardès, Blaise Yvert
🧩 TL;DR
该研究提出了一种基于编码器-解码器深度神经架构的离线语音解码管道,首次尝试从完全植入式无线硬膜外记录系统解码语音,为长期使用的脑机接口提供了新视角。
📘 Detailed Summary
Motivation: 当前语音脑机接口面临的关键挑战是实现流式语音重建,特别是从表面ECoG记录直接回归到声学语音。虽然最近使用皮层内数据已取得进展,但需要进一步工作以获得与表面ECoG记录相当的结果,其中神经解码器的优化变得至关重要。
Method: 研究提出了一种离线语音解码管道,基于编码器-解码器深度神经架构,整合了视觉变换器和对比学习技术,以增强从ECoG信号直接回归语音的能力。该方法在两个数据集上进行评估:一个来自癫痫患者的临床硬膜下电极数据,另一个来自运动脑机接口试验参与者使用的完全植入式WIMAGINE硬膜外系统。
Result: 该方法在两个不同数据集上进行了评估,包括临床硬膜下电极数据和完全植入式WIMAGINE硬膜外系统数据。据作者所知,这是首次尝试从完全植入式无线硬膜外记录系统解码语音,为长期使用的脑机接口提供了可行性验证。
Conclusion: 该研究展示了从完全植入式无线硬膜外系统解码语音的首次尝试,为长期脑机接口应用提供了有前景的视角。整合视觉变换器和对比学习的编码器-解码器架构在优化ECoG信号到语音的直接回归方面显示出潜力,为未来流式语音重建系统的发展奠定了基础。
📄 Abstract
Speech Brain Computer Interfaces (BCIs) offer promising solutions to people with severe paralysis unable to communicate. A number of recent studies have demonstrated convincing reconstruction of intelligible speech from surface electrocorticographic (ECoG) or intracortical recordings by predicting a series of phonemes or words and using downstream language models to obtain meaningful sentences. A current challenge is to reconstruct speech in a streaming mode by directly regressing cortical signals into acoustic speech. While this has been achieved recently using intracortical data, further work is needed to obtain comparable results with surface ECoG recordings. In particular, optimizing neural decoders becomes critical in this case. Here we present an offline speech decoding pipeline based on an encoder-decoder deep neural architecture, integrating Vision Transformers and contrastive learning to enhance the direct regression of speech from ECoG signals. The approach is evaluated on two datasets, one obtained with clinical subdural electrodes in an epileptic patient, and another obtained with the fully implantable WIMAGINE epidural system in a participant of a motor BCI trial. To our knowledge this presents a first attempt to decode speech from a fully implantable and wireless epidural recording system offering perspectives for long-term use.
[55] STELLA: Guiding Large Language Models for Time Series Forecasting with Semantic Abstractions
Junjie Fan, Hongye Zhao, Linduo Wei, Jiayu Rao, Guijia Li, Jiaxin Yuan, Wenqi Xu, Yong Qi
🧩 TL;DR
本文提出STELLA框架,通过动态语义抽象机制将时间序列分解为趋势、季节性和残差分量,并生成层次化语义锚点来增强LLM的时间序列预测能力,在多个基准数据集上实现了优于现有方法的性能。
📘 Detailed Summary
Motivation: 现有基于大语言模型的时间序列预测方法未能有效增强原始序列信息,导致LLM的推理能力未被充分利用。现有提示策略依赖静态相关性而非动态行为的生成式解释,缺乏关键的全局和实例特定上下文信息。
Method: STELLA框架采用动态语义抽象机制,将输入序列解耦为趋势、季节性和残差分量,并将这些分量的内在行为特征转化为层次化语义锚点:用于全局上下文的语料级语义先验和用于实例级模式的细粒度行为提示。这些锚点作为前缀提示引导LLM建模内在动态。
Result: 在八个基准数据集上的实验表明,STELLA在长期和短期预测任务中均优于最先进方法,并在零样本和少样本设置中展现出优异的泛化能力。消融研究进一步验证了动态生成语义锚点的有效性。
Conclusion: STELLA通过系统性地挖掘和注入结构化补充信息,有效解决了LLM在时间序列预测中信息增强不足的问题。该框架的动态语义抽象机制为LLM提供了关键的全局和实例特定上下文,显著提升了预测性能,为时间序列分析与LLM的融合提供了新思路。
📄 Abstract
Recent adaptations of Large Language Models (LLMs) for time series forecasting often fail to effectively enhance information for raw series, leaving LLM reasoning capabilities underutilized. Existing prompting strategies rely on static correlations rather than generative interpretations of dynamic behavior, lacking critical global and instance-specific context. To address this, we propose STELLA (Semantic-Temporal Alignment with Language Abstractions), a framework that systematically mines and injects structured supplementary and complementary information. STELLA employs a dynamic semantic abstraction mechanism that decouples input series into trend, seasonality, and residual components. It then translates intrinsic behavioral features of these components into Hierarchical Semantic Anchors: a Corpus-level Semantic Prior (CSP) for global context and a Fine-grained Behavioral Prompt (FBP) for instance-level patterns. Using these anchors as prefix-prompts, STELLA guides the LLM to model intrinsic dynamics. Experiments on eight benchmark datasets demonstrate that STELLA outperforms state-of-the-art methods in long- and short-term forecasting, showing superior generalization in zero-shot and few-shot settings. Ablation studies further validate the effectiveness of our dynamically generated semantic anchors.
[56] ASTRIDE: A Security Threat Modeling Platform for Agentic-AI Applications
Eranga Bandara, Amin Hass, Ross Gore, Sachin Shetty, Ravi Mukkamala, Safdar H. Bouk, Xueping Liang, Ng Wee Keong, Kasun De Zoysa, Aruna Withanage, Nilaan Loganathan
🧩 TL;DR
本文提出了ASTRIDE,一个专为基于AI代理的系统设计的自动化威胁建模平台,通过扩展经典STRIDE框架并引入新的AI代理特定攻击类别,结合微调的视觉语言模型和推理LLM,实现了从架构图到威胁分析的端到端自动化。
📘 Detailed Summary
Motivation: 基于AI代理的系统在现代软件架构中日益重要,但引入了传统威胁建模框架无法有效捕捉的新型安全挑战,包括提示注入攻击、上下文污染、模型操纵和不透明的代理间通信等漏洞,需要专门针对AI代理系统的威胁建模解决方案。
Method: ASTRIDE平台扩展了经典STRIDE框架,引入了新的威胁类别"A"代表AI代理特定攻击,涵盖提示注入、不安全工具调用和推理颠覆等新兴漏洞;采用微调的视觉语言模型联盟与OpenAI-gpt-oss推理LLM相结合,直接从视觉代理架构图(如数据流图)进行端到端分析;LLM代理协调VLM联盟与推理LLM之间的交互,实现威胁建模自动化流程的编排。
Result: 评估结果表明,ASTRIDE为下一代智能系统提供了准确、可扩展且可解释的威胁建模能力;该框架是首个既扩展STRIDE以涵盖AI特定威胁,又集成微调VLM与推理LLM以完全自动化基于图的威胁建模的系统。
Conclusion: ASTRIDE填补了AI代理系统安全评估的关键空白,通过自动化威胁建模框架为智能系统的安全设计提供了系统化方法;该研究为应对新兴AI安全挑战提供了实用工具,并为未来AI安全研究奠定了基础,强调了将传统安全框架适应AI特定威胁的重要性。
📄 Abstract
AI agent-based systems are becoming increasingly integral to modern software architectures, enabling autonomous decision-making, dynamic task execution, and multimodal interactions through large language models (LLMs). However, these systems introduce novel and evolving security challenges, including prompt injection attacks, context poisoning, model manipulation, and opaque agent-to-agent communication, that are not effectively captured by traditional threat modeling frameworks. In this paper, we introduce ASTRIDE, an automated threat modeling platform purpose-built for AI agent-based systems. ASTRIDE extends the classical STRIDE framework by introducing a new threat category, A for AI Agent-Specific Attacks, which encompasses emerging vulnerabilities such as prompt injection, unsafe tool invocation, and reasoning subversion, unique to agent-based applications. To automate threat modeling, ASTRIDE combines a consortium of fine-tuned vision-language models (VLMs) with the OpenAI-gpt-oss reasoning LLM to perform end-to-end analysis directly from visual agent architecture diagrams, such as data flow diagrams(DFDs). LLM agents orchestrate the end-to-end threat modeling automation process by coordinating interactions between the VLM consortium and the reasoning LLM. Our evaluations demonstrate that ASTRIDE provides accurate, scalable, and explainable threat modeling for next-generation intelligent systems. To the best of our knowledge, ASTRIDE is the first framework to both extend STRIDE with AI-specific threats and integrate fine-tuned VLMs with a reasoning LLM to fully automate diagram-driven threat modeling in AI agent-based applications.
[57] Chameleon: Adaptive Adversarial Agents for Scaling-Based Visual Prompt Injection in Multimodal AI Systems
M Zeeshan, Saud Satti
🧩 TL;DR
本文提出Chameleon,一种新颖的自适应对抗框架,专门针对生产级视觉语言模型中的图像缩放预处理漏洞,通过动态优化机制生成能够存活于标准下采样操作并劫持下游执行的对抗样本。
📘 Detailed Summary
Motivation: 当前多模态AI系统(特别是视觉语言模型)严重依赖预处理流水线进行高效输入处理,但标准预处理操作(如图像下采样)存在被忽视的安全漏洞。现有对抗策略多为静态攻击,无法适应现代智能体工作流的动态特性,因此需要开发能够动态适应并利用缩放漏洞的新型对抗框架。
Method: Chameleon采用基于智能体的迭代优化机制,根据目标模型的实时反馈动态细化图像扰动,从而生成能够经受标准下采样操作的高度鲁棒对抗样本。该框架专门设计用于暴露和利用生产级视觉语言模型中的缩放漏洞,与传统静态攻击方法形成鲜明对比。
Result: 在针对Gemini 2.5 Flash模型的实验中,Chameleon在不同缩放因子下实现了84.5%的攻击成功率,显著优于平均仅32.1%的静态基线攻击。此外,这些攻击能有效破坏智能体流水线,在多步任务中将决策准确率降低超过45%,证明了其对实际生产系统的严重威胁。
Conclusion: 该研究揭示了视觉语言模型中预处理缩放操作的严重安全漏洞,表明传统静态对抗攻击已不足以应对现代智能体系统的动态特性。研究建议采用多尺度一致性检查作为必要的防御机制,并强调了在生产级多模态AI系统中加强预处理阶段安全性的迫切需求。
📄 Abstract
Multimodal Artificial Intelligence (AI) systems, particularly Vision-Language Models (VLMs), have become integral to critical applications ranging from autonomous decision-making to automated document processing. As these systems scale, they rely heavily on preprocessing pipelines to handle diverse inputs efficiently. However, this dependency on standard preprocessing operations, specifically image downscaling, creates a significant yet often overlooked security vulnerability. While intended for computational optimization, scaling algorithms can be exploited to conceal malicious visual prompts that are invisible to human observers but become active semantic instructions once processed by the model. Current adversarial strategies remain largely static, failing to account for the dynamic nature of modern agentic workflows. To address this gap, we propose Chameleon, a novel, adaptive adversarial framework designed to expose and exploit scaling vulnerabilities in production VLMs. Unlike traditional static attacks, Chameleon employs an iterative, agent-based optimization mechanism that dynamically refines image perturbations based on the target model's real-time feedback. This allows the framework to craft highly robust adversarial examples that survive standard downscaling operations to hijack downstream execution. We evaluate Chameleon against Gemini 2.5 Flash model. Our experiments demonstrate that Chameleon achieves an Attack Success Rate (ASR) of 84.5% across varying scaling factors, significantly outperforming static baseline attacks which average only 32.1%. Furthermore, we show that these attacks effectively compromise agentic pipelines, reducing decision-making accuracy by over 45% in multi-step tasks. Finally, we discuss the implications of these vulnerabilities and propose multi-scale consistency checks as a necessary defense mechanism.