Table of Contents
cs.CV [Back]
[1] Unified Text-Image Generation with Weakness-Targeted Post-Training
Jiahui Chen, Philippe Hansen-Estruch, Xiaochuang Han, Yushi Hu, Emily Dinan, Amita Kamath, Michal Drozdzal, Reyhane Askari-Hemmat, Luke Zettlemoyer, Marjan Ghazvininejad
🧩 TL;DR
本文提出了一种通过后训练实现完全统一的文本-图像生成方法,使模型能够在单次推理过程中自主地从文本推理过渡到视觉合成,并通过奖励加权和策略性设计的后训练数据在多个T2I基准测试中取得了性能提升。
📘 Detailed Summary
Motivation: 现有统一多模态生成架构通常依赖显式的模态切换机制,需要先生成推理文本再手动切换到图像生成,这种分离的顺序推理过程限制了跨模态耦合并阻碍了自动化的多模态生成,因此需要探索实现完全统一的文本-图像生成方法。
Method: 本研究采用后训练方法实现完全统一的文本-图像生成,探索了联合文本-图像生成对T2I性能的影响以及后训练中各模态的相对重要性,并研究了不同的后训练数据策略,包括使用针对特定限制的有针对性数据集,以及采用离线奖励加权的后训练方法,利用完全自生成的合成数据进行训练。
Result: 实验表明,与广泛的图像-标题语料库或基准对齐数据相比,针对特定限制的有针对性数据集能够取得更优结果,通过奖励加权两个模态和策略性设计的后训练数据,该方法在四个不同的T2I基准测试中实现了多模态图像生成的改进。
Conclusion: 研究表明,通过后训练实现完全统一的文本-图像生成是可行的,奖励加权两个模态和策略性设计的后训练数据对于提升多模态生成性能至关重要,这为构建更自主、耦合性更强的多模态生成系统提供了有效途径。
📄 Abstract
Unified multimodal generation architectures that jointly produce text and images have recently emerged as a promising direction for text-to-image (T2I) synthesis. However, many existing systems rely on explicit modality switching, generating reasoning text before switching manually to image generation. This separate, sequential inference process limits cross-modal coupling and prohibits automatic multimodal generation. This work explores post-training to achieve fully unified text-image generation, where models autonomously transition from textual reasoning to visual synthesis within a single inference process. We examine the impact of joint text-image generation on T2I performance and the relative importance of each modality during post-training. We additionally explore different post-training data strategies, showing that a targeted dataset addressing specific limitations achieves superior results compared to broad image-caption corpora or benchmark-aligned data. Using offline, reward-weighted post-training with fully self-generated synthetic data, our approach enables improvements in multimodal image generation across four diverse T2I benchmarks, demonstrating the effectiveness of reward-weighting both modalities and strategically designed post-training data.
[2] PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache
Kunyang Li, Mubarak Shah, Yuzhang Shang
🧩 TL;DR
本文提出PackCache,一种无需训练的KV缓存管理方法,通过动态压缩统一自回归视频生成模型中的KV缓存,解决了KV缓存随生成序列长度线性增长导致的推理效率瓶颈问题。
📘 Detailed Summary
Motivation: 统一自回归视频生成模型依赖KV缓存机制将注意力计算复杂度从O(T²)降低到O(T),但KV缓存大小随生成令牌数量线性增长,成为限制推理效率和生成长度的主要瓶颈。研究发现KV缓存令牌具有明显的时空特性:文本和条件图像令牌作为持久语义锚点持续获得高注意力,而对先前帧的注意力随时间距离自然衰减。
Method: PackCache通过三种协调机制动态压缩KV缓存:条件锚定保留语义参考,跨帧衰减建模根据时间距离分配缓存预算,空间保持位置嵌入在缓存移除时维持连贯的3D结构。该方法无需额外训练,直接应用于现有统一自回归视频生成模型。
Result: 在48帧长序列上,PackCache将端到端生成速度提升1.7-2.2倍。对于受KV缓存扩展影响最大的最后四帧(视频中最昂贵的部分),在A40和H200上分别实现2.6倍和3.7倍的加速。该方法显著提高了长序列视频生成的效率。
Conclusion: PackCache通过利用KV缓存令牌的时空特性,有效解决了统一自回归视频生成中的推理效率瓶颈。该方法展示了在不影响生成质量的前提下动态管理缓存的重要性,为长序列视频生成提供了实用的优化方案,具有扩展到其他多模态生成任务的潜力。
📄 Abstract
A unified autoregressive model is a Transformer-based framework that addresses diverse multimodal tasks (e.g., text, image, video) as a single sequence modeling problem under a shared token space. Such models rely on the KV-cache mechanism to reduce attention computation from O(T^2) to O(T); however, KV-cache size grows linearly with the number of generated tokens, and it rapidly becomes the dominant bottleneck limiting inference efficiency and generative length. Unified autoregressive video generation inherits this limitation. Our analysis reveals that KV-cache tokens exhibit distinct spatiotemporal properties: (i) text and conditioning-image tokens act as persistent semantic anchors that consistently receive high attention, and (ii) attention to previous frames naturally decays with temporal distance. Leveraging these observations, we introduce PackCache, a training-free KV-cache management method that dynamically compacts the KV cache through three coordinated mechanisms: condition anchoring that preserves semantic references, cross-frame decay modeling that allocates cache budget according to temporal distance, and spatially preserving position embedding that maintains coherent 3D structure under cache removal. In terms of efficiency, PackCache accelerates end-to-end generation by 1.7-2.2x on 48-frame long sequences, showcasing its strong potential for enabling longer-sequence video generation. Notably, the final four frames - the portion most impacted by the progressively expanding KV-cache and thus the most expensive segment of the clip - PackCache delivers a 2.6x and 3.7x acceleration on A40 and H200, respectively, for 48-frame videos.
[3] Combining facial videos and biosignals for stress estimation during driving
Paraskevi Valergaki, Vassilis C. Nicodemou, Iason Oikonomidis, Antonis Argyros, Anastasios Roussos
🧩 TL;DR
本研究提出了一种基于解耦3D面部几何特征和跨模态注意力机制的驾驶压力识别方法,通过分析EMOCA提取的3D表情和姿态系数,结合Transformer时序建模框架,实现了高精度的压力状态检测。
📘 Detailed Summary
Motivation: 现有基于面部动作单元的压力识别方法面临主观性和自主面部控制的挑战,而解耦的3D面部几何特征在压力识别中的作用尚未得到充分探索。本研究旨在填补这一研究空白,特别关注分心驾驶场景下的压力检测问题。
Method: 研究采用EMOCA模型提取3D表情和姿态系数,通过配对假设检验分析基线和压力阶段的差异。在此基础上构建了基于Transformer的时序建模框架,评估了单模态、早期融合和跨模态注意力三种融合策略,特别关注EMOCA特征与生理信号、注视信号的跨模态注意力融合方法。
Result: 实验发现56个EMOCA系数中有41个在压力阶段表现出显著且一致的响应模式,与生理标记物相当。跨模态注意力融合方法表现最佳,EMOCA与生理信号融合达到AUROC 92%和准确率86.7%,EMOCA与注视信号融合也达到AUROC 91.8%的竞争性性能。
Conclusion: 研究证实了解耦3D面部几何特征在压力识别中的有效性,并展示了跨模态注意力机制在时序建模中的优势。该方法为基于视觉的压力检测提供了新思路,特别适用于驾驶监控等实际应用场景,表明多模态融合策略能显著提升识别性能。
📄 Abstract
Reliable stress recognition from facial videos is challenging due to stress's subjective nature and voluntary facial control. While most methods rely on Facial Action Units, the role of disentangled 3D facial geometry remains underexplored. We address this by analyzing stress during distracted driving using EMOCA-derived 3D expression and pose coefficients. Paired hypothesis tests between baseline and stressor phases reveal that 41 of 56 coefficients show consistent, phase-specific stress responses comparable to physiological markers. Building on this, we propose a Transformer-based temporal modeling framework and assess unimodal, early-fusion, and cross-modal attention strategies. Cross-Modal Attention fusion of EMOCA and physiological signals achieves best performance (AUROC 92\%, Accuracy 86.7\%), with EMOCA-gaze fusion also competitive (AUROC 91.8\%). This highlights the effectiveness of temporal modeling and cross-modal attention for stress recognition.
[4] 3D-Agent:Tri-Modal Multi-Agent Collaboration for Scalable 3D Object Annotation
Jusheng Zhang, Yijia Fan, Zimo Wen, Jian Wang, Keze Wang
🧩 TL;DR
本文提出Tri-MARF框架,通过整合2D多视角图像、文本描述和3D点云的三模态输入,并采用多智能体协作架构,显著提升了大规模3D对象标注的准确性和效率。
📘 Detailed Summary
Motivation: 3D对象标注在自动驾驶、机器人和增强现实等应用中面临空间复杂性、遮挡和视角不一致等挑战,现有基于单一模型的方法往往难以有效解决这些问题,需要更强大的多模态协作框架来提升标注性能。
Method: Tri-MARF框架整合三模态输入(2D多视角图像、文本描述和3D点云),采用多智能体协作架构,包括三个专门化智能体:视觉语言模型智能体用于生成多视角描述,信息聚合智能体用于选择最优描述,以及门控智能体用于对齐文本语义与3D几何信息以实现精细化标注。
Result: 在Objaverse、LVIS、Objaverse XL和ABO数据集上的广泛实验表明,Tri-MARF显著优于现有方法,CLIPScore达到88.7分,ViLT R@5检索准确率分别为45.2%和43.8%,在单个NVIDIA A100 GPU上实现高达每小时12000个对象的处理吞吐量。
Conclusion: 该研究证明了多模态输入与多智能体协作架构在3D对象标注任务中的有效性,为复杂3D场景理解提供了新的解决方案框架,并展示了在实际应用中实现高效率大规模标注的潜力。
📄 Abstract
Driven by applications in autonomous driving robotics and augmented reality 3D object annotation presents challenges beyond 2D annotation including spatial complexity occlusion and viewpoint inconsistency Existing approaches based on single models often struggle to address these issues effectively We propose Tri MARF a novel framework that integrates tri modal inputs including 2D multi view images textual descriptions and 3D point clouds within a multi agent collaborative architecture to enhance large scale 3D annotation Tri MARF consists of three specialized agents a vision language model agent for generating multi view descriptions an information aggregation agent for selecting optimal descriptions and a gating agent that aligns textual semantics with 3D geometry for refined captioning Extensive experiments on Objaverse LVIS Objaverse XL and ABO demonstrate that Tri MARF substantially outperforms existing methods achieving a CLIPScore of 88 point 7 compared to prior state of the art methods retrieval accuracy of 45 point 2 and 43 point 8 on ViLT R at 5 and a throughput of up to 12000 objects per hour on a single NVIDIA A100 GPU
[5] Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization
Xingjian Diao, Zheyuan Liu, Chunhui Zhang, Weiyi Wu, Keyi Kong, Lin Shi, Kaize Ding, Soroush Vosoughi, Jiang Gui
🧩 TL;DR
本文提出GPRO(门控感知-推理优化),一种元推理控制器,通过动态路由计算路径来解决大型视觉语言模型中过度思考的问题,在提升任务准确率的同时显著降低计算成本。
📘 Detailed Summary
Motivation: 大型视觉语言模型通过思维链机制展现出强大的推理能力,但这类慢思考方法常导致过度思考问题,即模型对简单查询生成过于冗长的响应,造成测试时效率低下甚至准确率下降。先前研究尝试通过自适应推理策略缓解此问题,但这些方法大多忽视了一个根本瓶颈:视觉感知失败。我们认为稳定推理关键依赖于低层次视觉基础,推理错误往往源于不完美的感知而非不足的深思熟虑。
Method: 为解决上述限制,我们提出GPRO(门控感知-推理优化),这是一种元推理控制器,在每个生成步骤动态路由计算到三个决策路径:轻量级快速路径、用于重新检查视觉输入的慢感知路径,以及用于内部自我反思的慢推理路径。为学习这种区分,我们从约79万个样本中推导出大规模失败归因监督,使用教师模型来区分感知幻觉与推理错误。然后通过多目标强化学习训练控制器,在不确定性下优化任务准确率与计算成本之间的权衡。
Result: 在五个基准测试上的实验表明,GPRO显著提高了准确率和效率,优于最近的慢思考方法,同时生成明显更短的响应。该方法在保持高任务准确率的同时,有效降低了计算开销,实现了精度与效率的平衡优化。
Conclusion: 该研究揭示了视觉感知失败是大型视觉语言模型推理错误的重要根源,而非仅仅推理能力不足。GPRO框架通过动态路由机制实现了感知与推理的协同优化,为构建更高效、更准确的视觉语言系统提供了新范式。这项工作强调了在复杂推理任务中整合低级视觉基础与高级认知过程的重要性,为未来自适应计算架构设计提供了重要启示。
📄 Abstract
Large Vision-Language Models (LVLMs) have exhibited strong reasoning capabilities through chain-of-thought mechanisms that generate step-by-step rationales. However, such slow-thinking approaches often lead to overthinking, where models produce excessively verbose responses even for simple queries, resulting in test-time inefficiency and even degraded accuracy. Prior work has attempted to mitigate this issue via adaptive reasoning strategies, but these methods largely overlook a fundamental bottleneck: visual perception failures. We argue that stable reasoning critically depends on low-level visual grounding, and that reasoning errors often originate from imperfect perception rather than insufficient deliberation. To address this limitation, we propose Gated Perception-Reasoning Optimization (GPRO), a meta-reasoning controller that dynamically routes computation among three decision paths at each generation step: a lightweight fast path, a slow perception path for re-examining visual inputs, and a slow reasoning path for internal self-reflection. To learn this distinction, we derive large-scale failure attribution supervision from approximately 790k samples, using teacher models to distinguish perceptual hallucinations from reasoning errors. We then train the controller with multi-objective reinforcement learning to optimize the trade-off between task accuracy and computational cost under uncertainty. Experiments on five benchmarks demonstrate that GPRO substantially improves both accuracy and efficiency, outperforming recent slow-thinking methods while generating significantly shorter responses.
[6] UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving
Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, Liu Ren
🧩 TL;DR
本文提出UniDrive-WM,一种统一的视觉语言模型世界模型,将驾驶场景理解、轨迹规划和轨迹条件未来图像生成集成到单一架构中,显著提升了自动驾驶系统的性能。
📘 Detailed Summary
Motivation: 当前自动驾驶系统通常将感知、预测和规划作为独立模块处理,这种分离限制了系统性能。现有基于视觉语言模型的规划方法未能实现这些功能的统一集成,因此需要一种能够联合执行驾驶场景理解、轨迹规划和未来图像生成的统一世界模型。
Method: UniDrive-WM采用统一的VLM架构,其轨迹规划器预测未来轨迹,该轨迹随后条件化VLM图像生成器以生成合理的未来帧。这些预测提供额外的监督信号,增强场景理解并迭代优化轨迹生成。研究还比较了未来图像预测的离散和连续输出表示,分析它们对下游驾驶性能的影响。
Result: 在Bench2Drive基准测试中,UniDrive-WM生成高保真未来图像,并将规划性能提升5.9%的L2轨迹误差和9.2%的碰撞率,优于先前最佳方法。实验结果表明该模型在联合推理、规划和生成建模方面的优势。
Conclusion: 研究表明紧密集成VLM驱动的推理、规划和生成世界建模对自动驾驶具有显著优势。统一架构通过轨迹条件图像生成提供额外监督信号,迭代优化系统性能,为自动驾驶世界模型设计提供了新范式。
📄 Abstract
World models have become central to autonomous driving, where accurate scene understanding and future prediction are crucial for safe control. Recent work has explored using vision-language models (VLMs) for planning, yet existing approaches typically treat perception, prediction, and planning as separate modules. We propose UniDrive-WM, a unified VLM-based world model that jointly performs driving-scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture. UniDrive-WM's trajectory planner predicts a future trajectory, which conditions a VLM-based image generator to produce plausible future frames. These predictions provide additional supervisory signals that enhance scene understanding and iteratively refine trajectory generation. We further compare discrete and continuous output representations for future image prediction, analyzing their influence on downstream driving performance. Experiments on the challenging Bench2Drive benchmark show that UniDrive-WM produces high-fidelity future images and improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate over the previous best method. These results demonstrate the advantages of tightly integrating VLM-driven reasoning, planning, and generative world modeling for autonomous driving. The project page is available at https://unidrive-wm.github.io/UniDrive-WM .
[7] Vision-Language Agents for Interactive Forest Change Analysis
James Brock, Ce Zhang, Nantheera Anantrasirichai
🧩 TL;DR
本文提出了一种基于大语言模型驱动的集成森林变化分析智能体,支持跨多任务的自然语言查询,并引入了包含多粒度语义变化描述的Forest-Change数据集,显著提升了森林变化监测的可访问性和可解释性。
📘 Detailed Summary
Motivation: 当前森林监测面临像素级变化检测和复杂森林动态语义变化描述的持续挑战,尽管大语言模型正被用于交互式数据探索,但其与视觉语言模型在遥感图像变化解释领域的集成仍未被充分探索,本研究旨在填补这一研究空白。
Method: 研究提出了一个基于大语言模型驱动的集成森林变化分析智能体,该系统构建在多级变化解释视觉语言骨干网络之上,采用大语言模型进行任务编排,并引入了Forest-Change数据集,该数据集包含双时相卫星影像、像素级变化掩码以及结合人工标注和规则方法生成的多粒度语义变化描述。
Result: 实验结果表明,所提系统在Forest-Change数据集上取得了67.10%的mIoU和40.17%的BLEU-4分数,在LEVIR-MCI-Trees基准(专注于树木的联合变化检测与描述子集)上分别达到88.13%和34.41%,验证了系统的有效性。
Conclusion: 该研究展示了交互式、大语言模型驱动的遥感图像变化解释系统在提升森林变化分析可访问性、可解释性和效率方面的潜力,为遥感与自然语言处理的交叉领域提供了新的技术框架,所有数据和代码均已公开以促进后续研究。
📄 Abstract
Modern forest monitoring workflows increasingly benefit from the growing availability of high-resolution satellite imagery and advances in deep learning. Two persistent challenges in this context are accurate pixel-level change detection and meaningful semantic change captioning for complex forest dynamics. While large language models (LLMs) are being adapted for interactive data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored. To address this gap, we introduce an LLM-driven agent for integrated forest change analysis that supports natural language querying across multiple RSICI tasks. The proposed system builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration. To facilitate adaptation and evaluation in forest environments, we further introduce the Forest-Change dataset, which comprises bi-temporal satellite imagery, pixel-level change masks, and multi-granularity semantic change captions generated using a combination of human annotation and rule-based methods. Experimental results show that the proposed system achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on the Forest-Change dataset, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI benchmark for joint change detection and captioning. These results highlight the potential of interactive, LLM-driven RSICI systems to improve accessibility, interpretability, and efficiency of forest change analysis. All data and code are publicly available at https://github.com/JamesBrockUoB/ForestChat.
[8] All Changes May Have Invariant Principles: Improving Ever-Shifting Harmful Meme Detection via Design Concept Reproduction
Ziyou Jiang, Mingyang Li, Junjie Wang, Yuekai Huang, Jie Huang, Zhiyuan Chang, Zhaoyang Li, Qing Wang
🧩 TL;DR
本文提出RepMD方法,通过再现恶意用户的设计概念来检测不断演变的网络有害模因,该方法利用设计概念图指导多模态大语言模型进行检测,在类型演变和时间演变的模因上表现出良好的泛化能力。
📘 Detailed Summary
Motivation: 网络有害模因具有类型演变和时间演变的特性,难以分析检测,尽管具体模因不断变化,但不同模因可能共享不变的设计原则,即恶意用户背后的设计概念,这有助于理解模因为何有害并实现有效检测。
Method: RepMD方法首先参考攻击树定义设计概念图来描述设计有害模因的步骤,然后通过设计步骤再现和图剪枝从历史模因中推导DCG,最后使用DCG指导多模态大语言模型进行有害模因检测。
Result: 评估结果显示RepMD达到81.1%的最高准确率,在泛化到类型演变和时间演变的模因时准确率仅有轻微下降,人工评估表明RepMD能将人类发现有害模因的效率提高到每模因15-30秒。
Conclusion: 该研究通过捕捉恶意用户的不变设计概念而非具体内容,为检测不断演变的网络有害模因提供了有效方法,设计概念图框架能够指导MLLM提升检测性能,在动态网络环境中具有实际应用价值。
📄 Abstract
Harmful memes are ever-shifting in the Internet communities, which are difficult to analyze due to their type-shifting and temporal-evolving nature. Although these memes are shifting, we find that different memes may share invariant principles, i.e., the underlying design concept of malicious users, which can help us analyze why these memes are harmful. In this paper, we propose RepMD, an ever-shifting harmful meme detection method based on the design concept reproduction. We first refer to the attack tree to define the Design Concept Graph (DCG), which describes steps that people may take to design a harmful meme. Then, we derive the DCG from historical memes with design step reproduction and graph pruning. Finally, we use DCG to guide the Multimodal Large Language Model (MLLM) to detect harmful memes. The evaluation results show that RepMD achieves the highest accuracy with 81.1% and has slight accuracy decreases when generalized to type-shifting and temporal-evolving memes. Human evaluation shows that RepMD can improve the efficiency of human discovery on harmful memes, with 15$\sim$30 seconds per meme.
[9] MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing
Zihao Lin, Wanrong Zhu, Jiuxiang Gu, Jihyung Kil, Christopher Tensmeyer, Lin Zhang, Shilong Liu, Ruiyi Zhang, Lifu Huang, Vlad I. Morariu, Tong Sun
🧩 TL;DR
本文提出了MiLDEAgent,一个基于推理的多层设计文档编辑框架,通过结合RL训练的多模态推理器和图像编辑器,解决了现有方法在多层文档编辑中的局限性。研究还引入了MiLDEBench基准数据集和MiLDEEval评估协议,为多层文档编辑建立了首个强基线。
📘 Detailed Summary
Motivation: 现实世界中的设计文档(如海报)本质上是多层的,包含装饰、文本和图像等多种元素。从自然语言指令编辑这些文档需要细粒度的、层感知的推理能力来识别相关层并协调修改。先前的研究大多忽视了多层设计文档编辑问题,主要关注单层图像编辑或多层生成,这些方法假设平面画布且缺乏确定修改内容和位置的推理能力。
Method: 本文引入了多层文档编辑代理(MiLDEAgent),这是一个基于推理的框架,结合了RL训练的多模态推理器用于层感知理解和图像编辑器用于目标修改。研究还建立了MiLDEBench基准,这是一个包含超过20K设计文档和多样化编辑指令的人工参与语料库,并配套了MiLDEEval评估协议,涵盖指令遵循、布局一致性、美学和文本渲染四个维度。
Result: 在14个开源模型和2个闭源模型上的广泛实验表明,现有方法难以泛化:开源模型通常无法完成多层文档编辑任务,而闭源模型存在格式违规问题。相比之下,MiLDEAgent实现了强大的层感知推理和精确编辑,显著优于所有开源基线,并达到与闭源模型相当的性能,从而为多层文档编辑建立了首个强基线。
Conclusion: 该研究为多层设计文档编辑领域提供了系统化的基准和评估框架,证明了基于推理的方法在处理复杂多层编辑任务中的有效性。MiLDEAgent的成功表明,结合强化学习训练的多模态推理与目标图像编辑可以显著提升层感知编辑能力,为未来文档编辑系统的发展提供了重要方向。
📄 Abstract
Real-world design documents (e.g., posters) are inherently multi-layered, combining decoration, text, and images. Editing them from natural-language instructions requires fine-grained, layer-aware reasoning to identify relevant layers and coordinate modifications. Prior work largely overlooks multi-layer design document editing, focusing instead on single-layer image editing or multi-layer generation, which assume a flat canvas and lack the reasoning needed to determine what and where to modify. To address this gap, we introduce the Multi-Layer Document Editing Agent (MiLDEAgent), a reasoning-based framework that combines an RL-trained multimodal reasoner for layer-wise understanding with an image editor for targeted modifications. To systematically benchmark this setting, we introduce the MiLDEBench, a human-in-the-loop corpus of over 20K design documents paired with diverse editing instructions. The benchmark is complemented by a task-specific evaluation protocol, MiLDEEval, which spans four dimensions including instruction following, layout consistency, aesthetics, and text rendering. Extensive experiments on 14 open-source and 2 closed-source models reveal that existing approaches fail to generalize: open-source models often cannot complete multi-layer document editing tasks, while closed-source models suffer from format violations. In contrast, MiLDEAgent achieves strong layer-aware reasoning and precise editing, significantly outperforming all open-source baselines and attaining performance comparable to closed-source models, thereby establishing the first strong baseline for multi-layer document editing.
[10] HyperAlign: Hyperbolic Entailment Cones for Adaptive Text-to-Image Alignment Assessment
Wenzhi Chen, Bo Hu, Leida Li, Lihuo He, Wen Lu, Xinbo Gao
🧩 TL;DR
本文提出HyperAlign,一种基于双曲蕴含几何的自适应文本-图像对齐评估框架,通过将CLIP特征映射到双曲空间并设计动态监督蕴含建模机制,显著提升了图像生成与文本提示对齐评估的准确性和泛化能力。
📘 Detailed Summary
Motivation: 随着文本到图像生成技术的快速发展,准确评估生成图像与文本提示之间的对齐性成为关键挑战。现有方法依赖欧几里得空间度量,忽视了语义对齐的结构化特性,同时缺乏对不同样本的自适应能力,这些局限性促使研究者开发更有效的评估框架。
Method: HyperAlign框架首先使用CLIP提取欧几里得特征并将其映射到双曲空间;其次设计动态监督蕴含建模机制,将离散的蕴含逻辑转化为连续的几何结构监督;最后提出自适应调制回归器,利用双曲几何特征生成样本级调制参数,自适应校准欧几里得余弦相似度以预测最终得分。
Result: HyperAlign在单数据库评估和跨数据库泛化任务中均取得了高度竞争力的性能表现,充分验证了双曲几何建模在图像-文本对齐评估中的有效性,显著超越了传统基于欧几里得空间度量的方法。
Conclusion: 该研究证明了双曲几何建模能够有效捕捉语义对齐的结构化特性,为文本-图像对齐评估提供了新的几何视角。自适应调制机制解决了样本间差异问题,为生成式AI的质量评估开辟了新的研究方向,具有重要的理论和应用价值。
📄 Abstract
With the rapid development of text-to-image generation technology, accurately assessing the alignment between generated images and text prompts has become a critical challenge. Existing methods rely on Euclidean space metrics, neglecting the structured nature of semantic alignment, while lacking adaptive capabilities for different samples. To address these limitations, we propose HyperAlign, an adaptive text-to-image alignment assessment framework based on hyperbolic entailment geometry. First, we extract Euclidean features using CLIP and map them to hyperbolic space. Second, we design a dynamic-supervision entailment modeling mechanism that transforms discrete entailment logic into continuous geometric structure supervision. Finally, we propose an adaptive modulation regressor that utilizes hyperbolic geometric features to generate sample-level modulation parameters, adaptively calibrating Euclidean cosine similarity to predict the final score. HyperAlign achieves highly competitive performance on both single database evaluation and cross-database generalization tasks, fully validating the effectiveness of hyperbolic geometric modeling for image-text alignment assessment.
[11] Agri-R1: Empowering Generalizable Agricultural Reasoning in Vision-Language Models with Reinforcement Learning
Wentao Zhang, Lifei Wang, Lina Lu, MingKun Xu, Shangyang Li, Yanchao Yang, Tao Fang
🧩 TL;DR
本文提出Agri-R1,一种面向农业的推理增强大模型,通过自动化高质量推理数据生成和创新的强化学习优化方法,在仅使用19%可用样本的情况下,显著提升了农业病害诊断的准确性和泛化能力。
📘 Detailed Summary
Motivation: 农业病害诊断对视觉语言模型提出挑战,传统微调方法需要大量标注、缺乏可解释性且泛化能力差。现有推理方法依赖昂贵的专家标注,难以处理农业查询的开放性和多样性,因此需要开发更高效的农业专用推理增强模型。
Method: 提出Agri-R1框架,通过视觉语言合成和基于LLM的过滤自动生成高质量推理数据,仅使用19%可用样本。采用Group Relative Policy Optimization(GRPO)进行训练,并提出新颖的奖励函数,整合领域特定词典和模糊匹配来评估开放回答的正确性和语言灵活性。
Result: 在CDDMBench评估中,3B参数模型性能与7B至13B参数基线模型相当,在病害识别准确率上实现+23.2%相对提升,农业知识问答提升+33.3%,跨域泛化能力比标准微调提高26.10分。消融研究表明结构化推理数据与GRPO驱动的探索协同作用支撑了这些增益。
Conclusion: 研究表明自动化推理数据生成与强化学习优化的结合能有效提升农业AI模型的诊断能力,特别是在问题复杂度增加时效果更显著。该方法为资源受限的农业应用提供了高效解决方案,展示了领域特定奖励函数在开放回答评估中的重要性。
📄 Abstract
Agricultural disease diagnosis challenges VLMs, as conventional fine-tuning requires extensive labels, lacks interpretability, and generalizes poorly. While reasoning improves model robustness, existing methods rely on costly expert annotations and rarely address the open-ended, diverse nature of agricultural queries. To address these limitations, we propose \textbf{Agri-R1}, a reasoning-enhanced large model for agriculture. Our framework automates high-quality reasoning data generation via vision-language synthesis and LLM-based filtering, using only 19\% of available samples. Training employs Group Relative Policy Optimization (GRPO) with a novel proposed reward function that integrates domain-specific lexicons and fuzzy matching to assess both correctness and linguistic flexibility in open-ended responses. Evaluated on CDDMBench, our resulting 3B-parameter model achieves performance competitive with 7B- to 13B-parameter baselines, showing a +23.2\% relative gain in disease recognition accuracy, +33.3\% in agricultural knowledge QA, and a +26.10-point improvement in cross-domain generalization over standard fine-tuning. Ablation studies confirm that the synergy between structured reasoning data and GRPO-driven exploration underpins these gains, with benefits scaling as question complexity increases.
[12] Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models
Yanbing Zeng, Jia Wang, Hanghang Ma, Junqiang Wu, Jie Zhu, Xiaoming Wei, Jie Hu
🧩 TL;DR
本文提出Forge-and-Quench统一框架,通过理解模型增强生成模型的保真度和细节丰富度,利用MLLM生成增强文本指令并映射为桥接特征,作为视觉引导信号注入T2I主干模型。
📘 Detailed Summary
Motivation: 当前多模态领域将图像生成与理解集成到单一框架成为关键目标,但理解如何有效辅助生成尚未充分探索。现有研究主要关注利用理解模型的推理能力和世界知识,而本文引入新视角:利用理解来增强生成图像的保真度和细节丰富度。
Method: 提出Forge-and-Quench统一框架,其中MLLM首先对整个对话上下文进行推理生成增强文本指令,然后通过新颖的桥接适配器将其映射为虚拟视觉表示——桥接特征。该特征作为关键链接,将理解模型的洞察注入生成过程,随后与增强文本指令一起作为视觉引导信号注入T2I主干模型。
Result: 实验表明Forge-and-Quench显著提高了多个模型的图像保真度和细节,同时保持了指令跟随准确性并增强了世界知识应用。该框架展现出卓越的可扩展性和灵活性,能够在不同MLLM和T2I模型间高效迁移,显著节省训练开销且不损害MLLM固有的多模态理解能力。
Conclusion: 该研究为理解辅助生成提供了新范式,通过桥接特征有效连接理解与生成过程。框架设计实现了理解模型洞察向生成模型的直接传递,为多模态统一框架的发展提供了重要技术路径,同时保持了模型的可迁移性和训练效率。
📄 Abstract
Integrating image generation and understanding into a single framework has become a pivotal goal in the multimodal domain. However, how understanding can effectively assist generation has not been fully explored. Unlike previous works that focus on leveraging reasoning abilities and world knowledge from understanding models, this paper introduces a novel perspective: leveraging understanding to enhance the fidelity and detail richness of generated images. To this end, we propose Forge-and-Quench, a new unified framework that puts this principle into practice. In the generation process of our framework, an MLLM first reasons over the entire conversational context, including text instructions, to produce an enhanced text instruction. This refined instruction is then mapped to a virtual visual representation, termed the Bridge Feature, via a novel Bridge Adapter. This feature acts as a crucial link, forging insights from the understanding model to quench and refine the generation process. It is subsequently injected into the T2I backbone as a visual guidance signal, alongside the enhanced text instruction that replaces the original input. To validate this paradigm, we conduct comprehensive studies on the design of the Bridge Feature and Bridge Adapter. Our framework demonstrates exceptional extensibility and flexibility, enabling efficient migration across different MLLM and T2I models with significant savings in training overhead, all without compromising the MLLM's inherent multimodal understanding capabilities. Experiments show that Forge-and-Quench significantly improves image fidelity and detail across multiple models, while also maintaining instruction-following accuracy and enhancing world knowledge application. Models and codes are available at https://github.com/YanbingZeng/Forge-and-Quench.
[13] On the Holistic Approach for Detecting Human Image Forgery
Xiao Guo, Jie Zhu, Anil Jain, Xiaoming Liu
🧩 TL;DR
本文提出HuForDet,一种用于人类图像伪造检测的整体框架,通过双分支架构结合面部伪造检测和上下文语义一致性分析,解决了现有方法在面部区域伪造与全身合成图像检测之间的碎片化问题。
📘 Detailed Summary
Motivation: AI生成内容的快速发展加剧了深度伪造威胁,从面部操纵到全身逼真人体合成,但现有检测方法存在碎片化问题,专门针对面部区域伪造或全身合成图像,无法泛化到完整的人类图像操纵谱系。
Method: HuForDet采用双分支架构:面部伪造检测分支在RGB和频域中采用异构专家,包括自适应拉普拉斯-高斯模块以捕获从细粒度混合边界到粗尺度纹理异常的伪影;上下文伪造检测分支利用多模态大语言模型分析全身语义一致性,并配备置信度估计机制在特征融合中动态加权其贡献。
Result: 通过构建统一的人类图像伪造数据集,将现有面部伪造数据与新的全身合成人体语料库结合,实验表明HuForDet在多样化人类图像伪造检测中实现了最先进的性能,并展现出卓越的鲁棒性。
Conclusion: 该研究证明了结合细粒度面部伪影检测与上下文语义分析的整体框架在人类图像伪造检测中的有效性,为应对日益复杂的AIGC伪造威胁提供了统一解决方案,并强调了多模态分析和自适应特征融合的重要性。
📄 Abstract
The rapid advancement of AI-generated content (AIGC) has escalated the threat of deepfakes, from facial manipulations to the synthesis of entire photorealistic human bodies. However, existing detection methods remain fragmented, specializing either in facial-region forgeries or full-body synthetic images, and consequently fail to generalize across the full spectrum of human image manipulations. We introduce HuForDet, a holistic framework for human image forgery detection, which features a dual-branch architecture comprising: (1) a face forgery detection branch that employs heterogeneous experts operating in both RGB and frequency domains, including an adaptive Laplacian-of-Gaussian (LoG) module designed to capture artifacts ranging from fine-grained blending boundaries to coarse-scale texture irregularities; and (2) a contextualized forgery detection branch that leverages a Multi-Modal Large Language Model (MLLM) to analyze full-body semantic consistency, enhanced with a confidence estimation mechanism that dynamically weights its contribution during feature fusion. We curate a human image forgery (HuFor) dataset that unifies existing face forgery data with a new corpus of full-body synthetic humans. Extensive experiments show that our HuForDet achieves state-of-the-art forgery detection performance and superior robustness across diverse human image forgeries.
[14] AIVD: Adaptive Edge-Cloud Collaboration for Accurate and Efficient Industrial Visual Detection
Yunqing Hu, Zheming Yang, Chang Zhao, Qi Guo, Meng Gao, Pengcheng Li, Wen Ji
🧩 TL;DR
本文提出了AIVD框架,通过轻量级边缘检测器与云端MLLM的协同工作,实现了精确目标定位与高质量语义生成的统一,同时设计了异构资源感知的动态调度算法以优化边缘-云端部署效率。
📘 Detailed Summary
Motivation: 多模态大语言模型在语义理解和视觉推理方面表现出色,但在精确目标定位和资源受限的边缘-云端部署场景中仍面临挑战,需要解决边缘裁剪框噪声、场景变化以及异构设备动态网络条件下的性能优化问题。
Method: 提出了AIVD框架,通过轻量级边缘检测器与云端MLLM的协作实现统一精确定位和高质量语义生成;设计了视觉-语义协同增强的高效微调策略以提升云端MLLM对边缘裁剪框噪声和场景变化的鲁棒性;开发了异构资源感知的动态调度算法以维持高吞吐量和低延迟。
Result: 实验结果表明,AIVD框架在显著降低资源消耗的同时,提高了MLLM的分类性能和语义生成质量;所提出的调度策略在多样化场景中实现了更高的吞吐量和更低的延迟,有效提升了系统整体效率。
Conclusion: 该研究展示了通过边缘-云端协同架构和针对性优化策略,能够有效解决MLLM在精确定位和边缘部署中的关键挑战,为资源受限环境下的多模态AI系统部署提供了可行的技术方案和性能优化思路。
📄 Abstract
Multimodal large language models (MLLMs) demonstrate exceptional capabilities in semantic understanding and visual reasoning, yet they still face challenges in precise object localization and resource-constrained edge-cloud deployment. To address this, this paper proposes the AIVD framework, which achieves unified precise localization and high-quality semantic generation through the collaboration between lightweight edge detectors and cloud-based MLLMs. To enhance the cloud MLLM's robustness against edge cropped-box noise and scenario variations, we design an efficient fine-tuning strategy with visual-semantic collaborative augmentation, significantly improving classification accuracy and semantic consistency. Furthermore, to maintain high throughput and low latency across heterogeneous edge devices and dynamic network conditions, we propose a heterogeneous resource-aware dynamic scheduling algorithm. Experimental results demonstrate that AIVD substantially reduces resource consumption while improving MLLM classification performance and semantic generation quality. The proposed scheduling strategy also achieves higher throughput and lower latency across diverse scenarios.
[15] GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models
Shurong Zheng, Yousong Zhu, Hongyin Zhao, Fan Yang, Yufei Zhan, Ming Tang, Jinqiao Wang
🧩 TL;DR
本文提出GeM-VG,一种能够进行广义多图像视觉定位的多模态大语言模型,通过系统化任务分类、构建大规模数据集MG-Data-240K以及设计混合强化微调策略,显著提升了模型在多样化多图像定位任务中的泛化能力。
📘 Detailed Summary
Motivation: 现有的多图像视觉定位方法受限于单目标定位和有限的任务类型,缺乏对广义定位任务的统一建模,这限制了多模态大语言模型在多图像场景下的实际应用潜力。
Method: 研究首先根据跨图像线索和推理需求对多图像定位任务进行系统分类,构建了包含240K样本的MG-Data-240K数据集以解决现有数据集中目标数量和图像关系方面的限制。为解决多样化多图像定位任务的鲁棒处理挑战,提出了一种结合思维链推理和直接回答的混合强化微调策略,采用基于规则奖励引导的R1-like算法,有效增强了模型的整体感知和推理能力。
Result: 实验表明,GeM-VG在多图像定位任务上表现卓越,在MIG-Bench和MC-Bench基准上分别比先前领先的多模态大语言模型提升了2.0%和9.7%。在单图像定位任务中,相较于基础模型在ODINW基准上实现了9.1%的改进。同时,模型在通用多图像理解任务中保持了强大的能力。
Conclusion: 该研究通过统一建模框架、大规模数据集和混合微调策略,成功解决了多图像视觉定位任务的泛化挑战,为多模态大语言模型在复杂多图像场景中的应用提供了有效解决方案,展示了在保持通用理解能力的同时提升特定定位性能的可能性。
📄 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grounding. However, they are constrained by single-target localization and limited types of practical tasks, due to the lack of unified modeling for generalized grounding tasks. Therefore, we propose GeM-VG, an MLLM capable of Generalized Multi-image Visual Grounding. To support this, we systematically categorize and organize existing multi-image grounding tasks according to their reliance of cross-image cues and reasoning, and introduce the MG-Data-240K dataset, addressing the limitations of existing datasets regarding target quantity and image relation. To tackle the challenges of robustly handling diverse multi-image grounding tasks, we further propose a hybrid reinforcement finetuning strategy that integrates chain-of-thought (CoT) reasoning and direct answering, considering their complementary strengths. This strategy adopts an R1-like algorithm guided by a carefully designed rule-based reward, effectively enhancing the model's overall perception and reasoning capabilities. Extensive experiments demonstrate the superior generalized grounding capabilities of our model. For multi-image grounding, it outperforms the previous leading MLLMs by 2.0% and 9.7% on MIG-Bench and MC-Bench, respectively. In single-image grounding, it achieves a 9.1% improvement over the base model on ODINW. Furthermore, our model retains strong capabilities in general multi-image understanding.
[16] CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models
Tobia Poppi, Burak Uzkent, Amanmeet Garg, Lucas Porto, Garin Kessler, Yezhou Yang, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara, Florian Schiffers
🧩 TL;DR
本文提出了一种可扩展的反事实视频生成框架CounterVid,用于缓解视频语言模型中的幻觉问题,特别是针对动作和时序推理,并引入MixDPO方法联合利用文本和视觉偏好进行优化。
📘 Detailed Summary
Motivation: 视频语言模型在多模态理解方面表现出色,但在动作和时序推理方面仍然容易产生幻觉,现有缓解策略如文本过滤或随机视频扰动往往未能解决根本原因:过度依赖语言先验而非细粒度视觉动态。
Method: 提出可扩展的反事实视频生成框架,结合多模态LLM进行动作提议和编辑指导,利用基于扩散的图像和视频模型生成大规模语义硬负样本;构建CounterVid合成数据集包含约26k偏好对,并引入MixDPO统一直接偏好优化方法联合利用文本和视觉偏好。
Result: 使用MixDPO微调Qwen2.5-VL模型带来一致改进,特别是在时序排序方面表现显著提升,并有效迁移到标准视频幻觉基准测试中,证明了方法的有效性。
Conclusion: 该研究通过反事实视频生成和混合偏好优化,为解决视频语言模型中的幻觉问题提供了有效途径,特别是在动作和时序推理方面,为多模态模型的可信度提升提供了新思路。
📄 Abstract
Video-language models (VLMs) achieve strong multimodal understanding but remain prone to hallucinations, especially when reasoning about actions and temporal order. Existing mitigation strategies, such as textual filtering or random video perturbations, often fail to address the root cause: over-reliance on language priors rather than fine-grained visual dynamics. We propose a scalable framework for counterfactual video generation that synthesizes videos differing only in actions or temporal structure while preserving scene context. Our pipeline combines multimodal LLMs for action proposal and editing guidance with diffusion-based image and video models to generate semantic hard negatives at scale. Using this framework, we build CounterVid, a synthetic dataset of ~26k preference pairs targeting action recognition and temporal reasoning. We further introduce MixDPO, a unified Direct Preference Optimization approach that jointly leverages textual and visual preferences. Fine-tuning Qwen2.5-VL with MixDPO yields consistent improvements, notably in temporal ordering, and transfers effectively to standard video hallucination benchmarks. Code and models will be made publicly available.
[17] Measurement-Consistent Langevin Corrector: A Remedy for Latent Diffusion Inverse Solvers
Lee Hyoseok, Sohwi Lim, Eunju Cha, Tae-Hyun Oh
🧩 TL;DR
本文提出了测量一致朗之万校正器(MCLC),一种理论基础的即插即用校正模块,用于稳定基于潜在扩散模型的零样本逆问题求解器,通过减少求解器与真实反向扩散动态之间的差异来解决现有方法的不稳定性问题。
📘 Detailed Summary
Motivation: 现有基于潜在扩散模型的零样本逆问题求解器存在不稳定性问题,表现为不希望的伪影和退化质量。研究发现这种不稳定性源于求解器与真实反向扩散动态之间的差异,需要一种无需线性流形假设的更稳定校正方法来解决这一根本问题。
Method: 本文提出了测量一致朗之万校正器(MCLC),这是一种理论基础的即插即用校正模块,通过测量一致的朗之万更新来修正基于潜在扩散模型的逆问题求解器。与依赖线性流形假设的先前方法不同,MCLC无需此假设,能够在潜在空间中实现更稳定可靠的行为,直接针对求解器与真实反向扩散动态之间的差异进行校正。
Result: 实验证明MCLC在多种图像恢复任务中具有显著效果,并且与现有求解器兼容。研究还分析了斑点伪影现象,并深入探讨了其根本原因,验证了MCLC在提高求解器稳定性和减少伪影方面的有效性。
Conclusion: MCLC代表了向更稳健的零样本逆问题求解器迈出的关键一步,其无需线性流形假设的方法为解决潜在扩散模型逆问题求解中的不稳定性问题提供了理论基础和实践方案。该研究不仅提出了有效的校正机制,还为理解伪影现象提供了新的见解。
📄 Abstract
With recent advances in generative models, diffusion models have emerged as powerful priors for solving inverse problems in each domain. Since Latent Diffusion Models (LDMs) provide generic priors, several studies have explored their potential as domain-agnostic zero-shot inverse solvers. Despite these efforts, existing latent diffusion inverse solvers suffer from their instability, exhibiting undesirable artifacts and degraded quality. In this work, we first identify the instability as a discrepancy between the solver's and true reverse diffusion dynamics, and show that reducing this gap stabilizes the solver. Building on this, we introduce Measurement-Consistent Langevin Corrector (MCLC), a theoretically grounded plug-and-play correction module that remedies the LDM-based inverse solvers through measurement-consistent Langevin updates. Compared to prior approaches that rely on linear manifold assumptions, which often do not hold in latent space, MCLC operates without this assumption, leading to more stable and reliable behavior. We experimentally demonstrate the effectiveness of MCLC and its compatibility with existing solvers across diverse image restoration tasks. Additionally, we analyze blob artifacts and offer insights into their underlying causes. We highlight that MCLC is a key step toward more robust zero-shot inverse problem solvers.
[18] Detector-Augmented SAMURAI for Long-Duration Drone Tracking
Tamara R. Lenhard, Andreas Weinmann, Hichem Snoussi, Tobias Koch
🧩 TL;DR
本研究首次系统评估了SAMURAI基础模型在无人机跟踪任务中的潜力,并提出了一种检测器增强的扩展方法,显著提升了复杂城市环境中长期无人机跟踪的鲁棒性。
📘 Detailed Summary
Motivation: 当前基于检测器的无人机跟踪方法虽然帧级精度较高,但存在时间不一致性和频繁检测丢失的问题,而RGB无人机跟踪研究仍有限且依赖传统运动模型。尽管SAMURAI等基础模型在其他领域表现出强大的类别无关跟踪性能,但其在无人机特定场景中的适用性尚未得到研究,这一研究空白促使本研究进行系统性评估。
Method: 本研究首先对SAMURAI基础模型在无人机跟踪任务中的潜力进行了首次系统性评估,随后提出了一种检测器增强的SAMURAI扩展方法,通过整合检测器线索来减轻对边界框初始化和序列长度的敏感性,从而提升跟踪鲁棒性。
Result: 实验结果表明,所提出的检测器增强扩展方法在复杂城市环境中显著提升了跟踪鲁棒性,特别是在长时序列和无人机退出-重入事件中表现突出。该方法在多个数据集和指标上相比SAMURAI的零样本性能均获得一致提升,成功率最高提升+0.393,误报率最高降低-0.475。
Conclusion: 本研究证实了基础模型在无人机跟踪领域的应用潜力,检测器增强策略有效解决了边界框初始化和序列长度敏感性问题,为复杂城市环境下的长期无人机监控提供了更可靠的解决方案,并为未来无人机跟踪研究提供了新的技术方向。
📄 Abstract
Robust long-term tracking of drone is a critical requirement for modern surveillance systems, given their increasing threat potential. While detector-based approaches typically achieve strong frame-level accuracy, they often suffer from temporal inconsistencies caused by frequent detection dropouts. Despite its practical relevance, research on RGB-based drone tracking is still limited and largely reliant on conventional motion models. Meanwhile, foundation models like SAMURAI have established their effectiveness across other domains, exhibiting strong category-agnostic tracking performance. However, their applicability in drone-specific scenarios has not been investigated yet. Motivated by this gap, we present the first systematic evaluation of SAMURAI's potential for robust drone tracking in urban surveillance settings. Furthermore, we introduce a detector-augmented extension of SAMURAI to mitigate sensitivity to bounding-box initialization and sequence length. Our findings demonstrate that the proposed extension significantly improves robustness in complex urban environments, with pronounced benefits in long-duration sequences - especially under drone exit-re-entry events. The incorporation of detector cues yields consistent gains over SAMURAI's zero-shot performance across datasets and metrics, with success rate improvements of up to +0.393 and FNR reductions of up to -0.475.
[19] SOVABench: A Vehicle Surveillance Action Retrieval Benchmark for Multimodal Large Language Models
Oriol Rabasseda, Zenjie Li, Kamal Nasrollahi, Sergio Escalera
🧩 TL;DR
本文提出了SOVABench基准测试,用于评估监控视频中车辆动作的识别能力,并开发了一种基于多模态大语言模型的免训练框架,通过生成可解释的描述嵌入来提升动作区分性能。
📘 Detailed Summary
Motivation: 现有基于内容的视频检索基准主要关注场景级相似性,缺乏对监控场景中动作区分能力的评估,特别是车辆相关动作的识别存在研究空白,需要专门的评估协议来衡量跨动作区分和时间方向理解能力。
Method: 研究提出了SOVABench基准测试,包含两种评估协议(inter-pair和intra-pair),并开发了基于多模态大语言模型的免训练框架,利用MLLM的视觉推理和指令跟随能力生成图像和视频的可解释描述嵌入。
Result: 实验表明,现有最先进的视觉和多模态模型在动作区分任务上表现不佳,而提出的免训练框架在SOVABench基准上取得了强劲性能,同时在对比视觉语言模型经常失败的空间和计数基准上也表现良好。
Conclusion: 研究揭示了监控视频中动作区分对现有模型的挑战性,证明了MLLM生成描述嵌入的有效性,为监控视频分析提供了新的评估基准和方法框架,推动了可解释视频理解技术的发展。
📄 Abstract
Automatic identification of events and recurrent behavior analysis are critical for video surveillance. However, most existing content-based video retrieval benchmarks focus on scene-level similarity and do not evaluate the action discrimination required in surveillance. To address this gap, we introduce SOVABench (Surveillance Opposite Vehicle Actions Benchmark), a real-world retrieval benchmark built from surveillance footage and centered on vehicle-related actions. SOVABench defines two evaluation protocols (inter-pair and intra-pair) to assess cross-action discrimination and temporal direction understanding. Although action distinctions are generally intuitive for human observers, our experiments show that they remain challenging for state-of-the-art vision and multimodal models. Leveraging the visual reasoning and instruction-following capabilities of Multimodal Large Language Models (MLLMs), we present a training-free framework for producing interpretable embeddings from MLLM-generated descriptions for both images and videos. The framework achieves strong performance on SOVABench as well as on several spatial and counting benchmarks where contrastive Vision-Language Models often fail. The code, annotations, and instructions to construct the benchmark are publicly available.
[20] DivAS: Interactive 3D Segmentation of NeRFs via Depth-Weighted Voxel Aggregation
Ayush Pande
🧩 TL;DR
本文提出了DivAS,一种无需优化的完全交互式框架,用于分割神经辐射场,通过深度引导的2D掩码聚合实现实时3D分割,避免了传统优化方法所需的逐场景训练。
📘 Detailed Summary
Motivation: 现有基于优化的NeRF分割方法需要缓慢的逐场景训练,牺牲了2D基础模型的零样本能力,这限制了交互式分割的效率和实用性。
Method: DivAS采用基于快速GUI的工作流程,利用用户点提示生成2D SAM掩码,并通过NeRF深度先验进行精炼以提高几何精度和前景-背景分离;核心贡献是自定义CUDA内核,可在200毫秒内将精炼的多视角掩码聚合到统一的3D体素网格中。
Result: 在Mip-NeRF 360°和LLFF数据集上的实验表明,DivAS实现了与优化方法相当的分割质量,端到端速度提高2-2.5倍,排除用户提示时间后速度提升可达一个数量级。
Conclusion: 该研究证明了无需优化的实时NeRF分割的可行性,通过深度交互的体素聚合方法有效结合了2D基础模型的零样本能力和3D几何先验,为交互式3D场景理解开辟了新方向。
📄 Abstract
Existing methods for segmenting Neural Radiance Fields (NeRFs) are often optimization-based, requiring slow per-scene training that sacrifices the zero-shot capabilities of 2D foundation models. We introduce DivAS (Depth-interactive Voxel Aggregation Segmentation), an optimization-free, fully interactive framework that addresses these limitations. Our method operates via a fast GUI-based workflow where 2D SAM masks, generated from user point prompts, are refined using NeRF-derived depth priors to improve geometric accuracy and foreground-background separation. The core of our contribution is a custom CUDA kernel that aggregates these refined multi-view masks into a unified 3D voxel grid in under 200ms, enabling real-time visual feedback. This optimization-free design eliminates the need for per-scene training. Experiments on Mip-NeRF 360° and LLFF show that DivAS achieves segmentation quality comparable to optimization-based methods, while being 2-2.5x faster end-to-end, and up to an order of magnitude faster when excluding user prompting time.
[21] Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics
Subhadeep Roy, Gagan Bhatia, Steffen Eger
🧩 TL;DR
该研究揭示了多模态评估中存在的原型性偏差问题,并提出了一种新的基准测试ProtoBias来量化这一偏差,同时开发了ProtoScore指标来显著减少评估失败率。
📘 Detailed Summary
Motivation: 当前文本到图像模型的自动评估指标通常替代人类判断进行基准测试和大规模筛选,但这些指标是否真正优先考虑语义正确性,还是倾向于从有偏数据分布中学习到的视觉和社会原型图像尚不明确。研究旨在识别和研究多模态评估中的原型性偏差这一系统性失效模式。
Method: 研究引入了受控对比基准ProtoBias,涵盖动物、物体和人口统计图像,其中语义正确但非原型的图像与语义错误但原型的对抗对应图像配对。这种设置能够定向评估指标是遵循文本语义还是默认原型。基于这些发现,研究提出了ProtoScore,这是一个拥有70亿参数的鲁棒性指标。
Result: 实验结果显示,广泛使用的指标包括CLIPScore、PickScore和基于VQA的分数经常错误排序这些配对,而即使是LLM-as-Judge系统在社会基础案例中也表现出不均匀的鲁棒性。相比之下,人类评估始终更倾向于语义正确性且具有更大的决策边界。ProtoScore指标显著降低了失败率并抑制了错误排序,同时运行速度比GPT-5的推理时间快几个数量级,接近更大规模闭源评估器的鲁棒性。
Conclusion: 该研究揭示了多模态评估中存在的系统性原型性偏差问题,表明当前自动指标可能过度依赖有偏数据分布中的原型模式。提出的ProtoBias基准为评估指标偏差提供了标准化测试框架,而ProtoScore指标展示了通过专门设计可以显著提高评估鲁棒性,同时保持计算效率,为未来多模态评估系统的开发提供了重要方向。
📄 Abstract
Automatic metrics are now central to evaluating text-to-image models, often substituting for human judgment in benchmarking and large-scale filtering. However, it remains unclear whether these metrics truly prioritize semantic correctness or instead favor visually and socially prototypical images learned from biased data distributions. We identify and study \emph{prototypicality bias} as a systematic failure mode in multimodal evaluation. We introduce a controlled contrastive benchmark \textsc{\textbf{ProtoBias}} (\textit{\textbf{Proto}typical \textbf{Bias}}), spanning Animals, Objects, and Demography images, where semantically correct but non-prototypical images are paired with subtly incorrect yet prototypical adversarial counterparts. This setup enables a directional evaluation of whether metrics follow textual semantics or default to prototypes. Our results show that widely used metrics, including CLIPScore, PickScore, and VQA-based scores, frequently misrank these pairs, while even LLM-as-Judge systems exhibit uneven robustness in socially grounded cases. Human evaluations consistently favour semantic correctness with larger decision margins. Motivated by these findings, we propose \textbf{\textsc{ProtoScore}}, a robust 7B-parameter metric that substantially reduces failure rates and suppresses misranking, while running at orders of magnitude faster than the inference time of GPT-5, approaching the robustness of much larger closed-source judges.
[22] Scaling Vision Language Models for Pharmaceutical Long Form Video Reasoning on Industrial GenAI Platform
Suyash Mishra, Qiang Li, Srikanth Patil, Satyanarayan Pati, Baddu Narendra
🧩 TL;DR
本文提出了一个面向工业场景的大规模多模态推理框架,针对制药领域的长视频理解任务,在严格的计算资源约束下评估了40多个视觉语言模型,揭示了当前模型在实际部署中的局限性、效率权衡和失败模式。
📘 Detailed Summary
Motivation: 当前视觉语言模型在多模态推理任务上表现出色,但大多数评估集中于短视频且假设计算资源不受限制,而工业场景如制药内容理解需要处理长视频并面临严格的GPU、延迟和成本约束,现有方法难以扩展,因此需要研究实际部署条件下的模型性能极限。
Method: 研究提出了一个工业级生成式AI框架,处理了超过20万份PDF、25,326个涵盖八种格式的视频以及888个多语言音频文件,构建了制药领域的大规模多模态推理架构,并在Video-MME和MMBench两个领先基准以及包含14种疾病领域的专有数据集上对40多个VLMs进行了实证分析。
Result: 实验结果显示,在商用GPU上使用SDPA注意力机制可获得3-8倍的效率提升,多模态性在8/12个任务领域(特别是长度相关任务)中带来改进,同时在开源和闭源VLMs中都发现了时间对齐和关键帧检测的明显瓶颈,揭示了长视频推理中注意力机制权衡、时间推理限制和视频分割挑战等关键发现。
Conclusion: 本研究没有提出新的"A+B"模型,而是系统刻画了当前VLMs在实际部署约束下的性能极限、权衡取舍和失败模式,为研究者和实践者设计可扩展的工业级长视频理解系统提供了可操作的指导,特别强调了多模态性、注意力机制优化和时间对齐处理在工业应用中的重要性。
📄 Abstract
Vision Language Models (VLMs) have shown strong performance on multimodal reasoning tasks, yet most evaluations focus on short videos and assume unconstrained computational resources. In industrial settings such as pharmaceutical content understanding, practitioners must process long-form videos under strict GPU, latency, and cost constraints, where many existing approaches fail to scale. In this work, we present an industrial GenAI framework that processes over 200,000 PDFs, 25,326 videos across eight formats (e.g., MP4, M4V, etc.), and 888 multilingual audio files in more than 20 languages. Our study makes three contributions: (i) an industrial large-scale architecture for multimodal reasoning in pharmaceutical domains; (ii) empirical analysis of over 40 VLMs on two leading benchmarks (Video-MME and MMBench) and proprietary dataset of 25,326 videos across 14 disease areas; and (iii) four findings relevant to long-form video reasoning: the role of multimodality, attention mechanism trade-offs, temporal reasoning limits, and challenges of video splitting under GPU constraints. Results show 3-8 times efficiency gains with SDPA attention on commodity GPUs, multimodality improving up to 8/12 task domains (especially length-dependent tasks), and clear bottlenecks in temporal alignment and keyframe detection across open- and closed-source VLMs. Rather than proposing a new "A+B" model, this paper characterizes practical limits, trade-offs, and failure patterns of current VLMs under realistic deployment constraints, and provide actionable guidance for both researchers and practitioners designing scalable multimodal systems for long-form video understanding in industrial domains.
[23] VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding
Ignacio de Rodrigo, Alvaro J. Lopez-Lopez, Jaime Boal
🧩 TL;DR
本文提出了VERSE方法,用于分析和改进视觉语言模型在视觉丰富文档理解任务中的应用,通过探索视觉嵌入空间来识别问题区域并生成合成数据以提升模型性能。
📘 Detailed Summary
Motivation: 当前视觉语言模型在视觉丰富文档理解任务中存在性能瓶颈,缺乏对模型视觉嵌入空间的系统分析方法,难以识别导致错误的视觉特征并针对性改进模型性能。
Method: VERSE方法通过可视化潜在表示来评估模型可行性,识别问题区域并指导合成数据生成,使用MERIT数据集进行训练并在MERIT Secret上进行评估,优化了Donut和Idefics2等本地模型。
Result: 实验结果表明VERSE能有效揭示与错误聚类相关的视觉特征,使用包含这些特征的样本重新训练显著提升了F1性能且未损害泛化能力,优化后的本地模型性能达到或超越了GPT-4和Pixtral等SaaS解决方案。
Conclusion: VERSE为视觉语言模型提供了系统化的分析和改进框架,通过针对性数据增强策略显著提升模型性能,证明了本地模型经过适当优化后能够与商业SaaS解决方案竞争,为文档理解任务提供了有效的模型诊断和增强方法。
📄 Abstract
This work introduces VERSE, a methodology for analyzing and improving Vision-Language Models applied to Visually-rich Document Understanding by exploring their visual embedding space. VERSE enables the visualization of latent representations, supporting the assessment of model feasibility. It also facilitates the identification of problematic regions and guides the generation of synthetic data to enhance performance in those clusters. We validate the methodology by training on the synthetic MERIT Dataset and evaluating on its real-world counterpart, MERIT Secret. Results show that VERSE helps uncover the visual features associated with error-prone clusters, and that retraining with samples containing these features substantially boosts F1 performance without degrading generalization. Furthermore, we demonstrate that on-premise models such as Donut and Idefics2, when optimized with VERSE, match or even surpass the performance of SaaS solutions like GPT-4 and Pixtral.
[24] From Understanding to Engagement: Personalized pharmacy Video Clips via Vision Language Models (VLMs)
Suyash Mishra, Qiang Li, Srikanth Patil, Anubhav Girdhar
🧩 TL;DR
本文提出了一种面向制药行业的领域自适应视频片段生成框架,通过集成音频语言模型和视觉语言模型,实现了高效、个性化的多模态内容处理,显著提升了视频摘要的效率和效果。
📘 Detailed Summary
Motivation: 传统制药行业在处理异构多模态数据(文本、图像、视频、音频和网页链接)时依赖人工标注,存在不一致性、质量下降和利用效率低下的问题,特别是长视频和音频数据(如临床试验访谈和教育研讨会)的处理挑战尤为突出。
Method: 该方法提出了一个领域自适应的视频到视频片段生成框架,包含三个核心贡献:可复现的Cut & Merge算法(支持淡入淡出和时间戳归一化以确保平滑过渡和音视频对齐);基于角色定义和提示注入的个性化机制(针对营销、培训、监管等不同需求);以及平衡ALM/VLM增强处理的成本高效端到端流水线策略。
Result: 在Video MME基准测试(900个视频)和包含16,159个制药视频的专有数据集(涵盖14个疾病领域)上的评估显示,该方法实现了3-4倍的速度提升和4倍的成本降低,同时取得了有竞争力的片段质量。与Gemini 2.5 Pro等最先进的VLM基线相比,该方法在片段连贯性得分(0.348)和信息性得分(0.721)方面均有显著提升。
Conclusion: 该研究展示了透明、可定制且支持合规性的视频摘要方法在生命科学领域的巨大潜力,不仅实现了显著的效率提升和成本节约,还通过领域自适应和个性化机制提高了输出质量,为制药行业的数字化转型提供了可扩展的自动化解决方案。
📄 Abstract
Vision Language Models (VLMs) are poised to revolutionize the digital transformation of pharmacyceutical industry by enabling intelligent, scalable, and automated multi-modality content processing. Traditional manual annotation of heterogeneous data modalities (text, images, video, audio, and web links), is prone to inconsistencies, quality degradation, and inefficiencies in content utilization. The sheer volume of long video and audio data further exacerbates these challenges, (e.g. long clinical trial interviews and educational seminars). Here, we introduce a domain adapted Video to Video Clip Generation framework that integrates Audio Language Models (ALMs) and Vision Language Models (VLMs) to produce highlight clips. Our contributions are threefold: (i) a reproducible Cut & Merge algorithm with fade in/out and timestamp normalization, ensuring smooth transitions and audio/visual alignment; (ii) a personalization mechanism based on role definition and prompt injection for tailored outputs (marketing, training, regulatory); (iii) a cost efficient e2e pipeline strategy balancing ALM/VLM enhanced processing. Evaluations on Video MME benchmark (900) and our proprietary dataset of 16,159 pharmacy videos across 14 disease areas demonstrate 3 to 4 times speedup, 4 times cost reduction, and competitive clip quality. Beyond efficiency gains, we also report our methods improved clip coherence scores (0.348) and informativeness scores (0.721) over state of the art VLM baselines (e.g., Gemini 2.5 Pro), highlighting the potential of transparent, custom extractive, and compliance supporting video summarization for life sciences.
[25] Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLMs via Interpretable Bi-Causal Steering
Shuliang Liu, Songbo Yang, Dong Fang, Sihang Jia, Yuqi Tang, Lingfeng Su, Ruoshui Peng, Yibo Yan, Xin Zou, Xuming Hu
🧩 TL;DR
本文提出了一种名为Vision-Language Introspection (VLI)的训练免费推理框架,通过模拟元认知自我纠正过程来减少多模态大语言模型中的物体幻觉问题。该方法通过属性内省诊断幻觉风险,并采用可解释的双因果引导动态调整推理过程,显著提升了模型可靠性。
📘 Detailed Summary
Motivation: 多模态大语言模型中的物体幻觉严重损害了其可靠性,这主要源于模型在认知内省方面的根本性失败,即模型盲目信任语言先验而非具体视觉证据。现有缓解方法存在明显局限:对比解码方法仅表面操作而未纠正内部语义错位,而当前潜在引导方法依赖缺乏实例特定精度的静态向量。
Method: 本文提出了Vision-Language Introspection (VLI)训练免费推理框架,模拟元认知自我纠正过程。VLI首先执行属性内省,通过概率冲突检测诊断幻觉风险并定位因果视觉锚点;然后采用可解释的双因果引导,主动调制推理过程,动态隔离视觉证据与背景噪声,同时通过自适应校准消除盲目置信。
Result: VLI在先进模型上实现了最先进的性能,在MMHal-Bench上将物体幻觉率降低了12.67%,在POPE上将准确率提高了5.8%。该方法显著提升了多模态大语言模型在物体识别任务中的可靠性和准确性,证明了训练免费推理框架的有效性。
Conclusion: 该研究展示了通过模拟元认知过程解决多模态幻觉问题的有效性,为训练免费干预提供了新范式。VLI框架通过内省诊断和动态引导的结合,实现了对模型推理过程的精细调控,为提升多模态模型的可靠性和可解释性开辟了有前景的方向。
📄 Abstract
Object hallucination critically undermines the reliability of Multimodal Large Language Models, often stemming from a fundamental failure in cognitive introspection, where models blindly trust linguistic priors over specific visual evidence. Existing mitigations remain limited: contrastive decoding approaches operate superficially without rectifying internal semantic misalignments, while current latent steering methods rely on static vectors that lack instance-specific precision. We introduce Vision-Language Introspection (VLI), a training-free inference framework that simulates a metacognitive self-correction process. VLI first performs Attributive Introspection to diagnose hallucination risks via probabilistic conflict detection and localize the causal visual anchors. It then employs Interpretable Bi-Causal Steering to actively modulate the inference process, dynamically isolating visual evidence from background noise while neutralizing blind confidence through adaptive calibration. VLI achieves state-of-the-art performance on advanced models, reducing object hallucination rates by 12.67% on MMHal-Bench and improving accuracy by 5.8% on POPE.
[26] UniLiPs: Unified LiDAR Pseudo-Labeling with Geometry-Grounded Dynamic Scene Decomposition
Filippo Ghilotti, Samuel Brucker, Nahku Saidy, Matteo Matteucci, Mario Bijelic, Felix Heide
🧩 TL;DR
本文提出了一种无监督多模态伪标签方法,利用激光雷达扫描的时间几何一致性将文本和2D视觉基础模型的线索直接提升到3D空间,无需人工标注,实现了3D语义标签、边界框和密集点云的自动生成。
📘 Detailed Summary
Motivation: 自动驾驶应用中未标注的激光雷达日志虽然包含丰富的3D几何信息,但由于缺乏人工标注而难以利用,形成了感知研究的主要成本瓶颈。现有方法通常需要额外的人工监督,限制了大规模数据集的利用效率。
Method: 该方法基于从时间累积的激光雷达地图中学习的强几何先验,通过时间几何一致性跨激光雷达扫描提升和融合文本与2D视觉基础模型的线索。引入了一种新颖的迭代更新规则,强制实施联合几何-语义一致性,并通过不一致性检测移动物体。
Result: 该方法在三个数据集上展示了鲁棒的泛化能力,同时生成3D语义标签、3D边界框和密集激光雷达扫描。实验验证表明,该方法优于现有的语义分割和物体检测伪标签方法。即使使用一小部分几何一致的稠密化激光雷达数据,也能在80-150米和150-250米范围内分别将深度预测MAE提升51.5%和22.0%。
Conclusion: 该研究证明了利用时间几何一致性实现无监督3D感知伪标签的可行性,为降低自动驾驶感知研究的标注成本提供了有效途径。几何一致的稠密化激光雷达数据显著提升了深度预测性能,表明该方法在推动大规模无监督3D感知学习方面具有重要价值。
📄 Abstract
Unlabeled LiDAR logs, in autonomous driving applications, are inherently a gold mine of dense 3D geometry hiding in plain sight - yet they are almost useless without human labels, highlighting a dominant cost barrier for autonomous-perception research. In this work we tackle this bottleneck by leveraging temporal-geometric consistency across LiDAR sweeps to lift and fuse cues from text and 2D vision foundation models directly into 3D, without any manual input. We introduce an unsupervised multi-modal pseudo-labeling method relying on strong geometric priors learned from temporally accumulated LiDAR maps, alongside with a novel iterative update rule that enforces joint geometric-semantic consistency, and vice-versa detecting moving objects from inconsistencies. Our method simultaneously produces 3D semantic labels, 3D bounding boxes, and dense LiDAR scans, demonstrating robust generalization across three datasets. We experimentally validate that our method compares favorably to existing semantic segmentation and object detection pseudo-labeling methods, which often require additional manual supervision. We confirm that even a small fraction of our geometrically consistent, densified LiDAR improves depth prediction by 51.5% and 22.0% MAE in the 80-150 and 150-250 meters range, respectively.
[27] Mechanisms of Prompt-Induced Hallucination in Vision-Language Models
William Rudman, Michal Golovanevsky, Dana Arad, Yonatan Belinkov, Ritambhara Singh, Carsten Eickhoff, Kyle Mahowald
🧩 TL;DR
该研究通过机制分析揭示了大型视觉语言模型中提示诱导幻觉的故障模式,发现通过消融少量注意力头可显著减少幻觉行为,无需额外训练即可降低至少40%的幻觉率。
📘 Detailed Summary
Motivation: 大型视觉语言模型虽然能力强大,但经常产生幻觉,倾向于依赖文本提示而忽略视觉证据。本研究旨在探究这种提示诱导幻觉的故障模式,特别是在受控的对象计数场景中,当提示高估图像中对象数量时模型的行为变化。
Method: 研究采用受控的对象计数实验设置,通过对比文本提示与视觉证据的差异来评估模型行为。对三个视觉语言模型进行机制分析,识别出导致提示诱导幻觉的特定注意力头,并通过消融这些注意力头来验证其作用。
Result: 实验发现,在低对象数量时模型能够纠正高估,但随着对象数量增加,模型越来越倾向于遵循提示而忽略视觉差异。消融识别出的少量注意力头可将提示诱导幻觉减少至少40%,且无需额外训练。不同模型中,这些幻觉头以模型特定的方式介导提示复制行为。
Conclusion: 研究揭示了视觉语言模型中提示诱导幻觉的内部机制,展示了不同模型在实现这些行为时的特定差异。消融幻觉注意力头能够增强模型对视觉证据的校正能力,为理解和缓解模型幻觉提供了机制层面的见解。
📄 Abstract
Large vision-language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image (e.g., asking a model to describe four waterlilies when only three are present). At low object counts, models often correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt regardless of the discrepancy. Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence. Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.
[28] Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing
Runze He, Yiji Cheng, Tiankai Hang, Zhimin Li, Yu Xu, Zijin Yin, Shiyi Zhang, Wenxun Dai, Penghui Du, Ao Ma, Chunyu Wang, Qinglin Lu, Jizhong Han, Jiao Dai
🧩 TL;DR
本文提出Re-Align框架,通过结构化推理引导的对齐机制弥合多模态模型中理解与生成之间的差距,显著提升了上下文图像生成与编辑任务的性能。
📘 Detailed Summary
Motivation: 当前统一多模态模型在理解能力上表现出色,但这些优势往往无法有效迁移到图像生成任务中,导致上下文图像生成与编辑任务中用户意图的理解与忠实执行之间存在显著差距。
Method: Re-Align框架的核心是上下文思维链结构化推理范式,它将语义引导与参考关联解耦,提供清晰的文本目标并减少参考图像间的混淆;同时引入基于代理奖励的强化学习训练方案,通过衡量结构化推理文本与生成图像之间的对齐度来提升模型性能。
Result: 大量实验验证表明,Re-Align在模型规模和计算资源相当的情况下,在上下文图像生成和编辑任务上均优于竞争方法,证明了该框架在弥合理解与生成差距方面的有效性。
Conclusion: 该研究展示了结构化推理引导的对齐机制在统一多模态模型中的重要性,为提升上下文图像生成与编辑任务的性能提供了有效框架,并强调了强化学习在优化生成对齐方面的潜力。
📄 Abstract
In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model's overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.
[29] RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation
Boyang Wang, Haoran Zhang, Shujie Zhang, Jinkun Hao, Mingda Jia, Qi Lv, Yucheng Mao, Zhaoyang Lyu, Jia Zeng, Xudong Xu, Jiangmiao Pang
🧩 TL;DR
本文提出了一种视觉身份提示方法,通过提供示例图像作为条件输入来引导扩散模型生成所需的场景设置,从而增强机器人操作数据,解决了现有文本提示方法在多视角和时间一致性方面的不足。
📘 Detailed Summary
Motivation: 由于硬件和物理设置的限制,收集大规模真实世界操作数据在不同环境中难以扩展,而现有的基于文本提示的图像扩散模型方法往往忽视了最先进策略模型所需的多视角和时间一致性观测需求,且仅靠文本提示无法可靠地指定场景设置。
Method: 本文引入了视觉身份提示方法,将示例图像作为条件输入提供给扩散模型,以引导生成所需的场景设置,同时构建了一个可扩展的流程,从大型机器人数据集中策划视觉身份池,用于增强操作数据。
Result: 使用增强的操作数据训练下游视觉-语言-动作和视觉运动策略模型,在仿真和真实机器人环境中均获得了一致的性能提升,验证了所提方法的有效性。
Conclusion: 视觉身份提示方法为机器人操作数据增强提供了更可靠的场景控制机制,通过利用大型机器人数据集中的视觉身份池,能够生成具有多视角一致性和时间连贯性的观测数据,从而提升策略模型的训练效果和泛化能力。
📄 Abstract
The diversity, quantity, and quality of manipulation data are critical for training effective robot policies. However, due to hardware and physical setup constraints, collecting large-scale real-world manipulation data remains difficult to scale across diverse environments. Recent work uses text-prompt conditioned image diffusion models to augment manipulation data by altering the backgrounds and tabletop objects in the visual observations. However, these approaches often overlook the practical need for multi-view and temporally coherent observations required by state-of-the-art policy models. Further, text prompts alone cannot reliably specify the scene setup. To provide the diffusion model with explicit visual guidance, we introduce visual identity prompting, which supplies exemplar images as conditioning inputs to guide the generation of the desired scene setup. To this end, we also build a scalable pipeline to curate a visual identity pool from large robotics datasets. Using our augmented manipulation data to train downstream vision-language-action and visuomotor policy models yields consistent performance gains in both simulation and real-robot settings.
[30] A Lightweight and Explainable Vision-Language Framework for Crop Disease Visual Question Answering
Md. Zahid Hossain, Most. Sharmin Sultana Samu, Md. Rakibul Islam, Md. Siam Ansary
🧩 TL;DR
本文提出了一种轻量级视觉语言框架,用于从叶片图像中进行作物和病害识别,该框架结合了Swin Transformer视觉编码器和序列到序列语言解码器,在保持高精度的同时显著减少了参数数量。
📘 Detailed Summary
Motivation: 作物病害视觉问答需要准确的视觉理解和可靠的语言生成能力,现有大规模视觉语言模型参数量过大,缺乏针对农业领域特定任务的轻量级高效解决方案。
Method: 该方法采用Swin Transformer作为视觉编码器,结合序列到序列语言解码器构建轻量级视觉语言框架,采用两阶段训练策略优化视觉表示学习和跨模态对齐,并使用Grad-CAM和词元级归因进行可解释性分析。
Result: 在大规模作物病害数据集上的实验表明,该模型在作物和病害识别方面均达到高准确率,在BLEU、ROUGE和BERTScore等自然语言生成指标上表现优异,同时参数量显著少于大规模视觉语言基线模型。
Conclusion: 研究结果表明任务特定的视觉预训练对作物病害视觉问答具有显著效果,轻量级框架在保持高性能的同时实现了参数效率,为农业领域的视觉语言应用提供了实用解决方案。
📄 Abstract
Visual question answering for crop disease analysis requires accurate visual understanding and reliable language generation. This work presents a lightweight vision-language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence-to-sequence language decoders. A two-stage training strategy is adopted to improve visual representation learning and cross-modal alignment. The model is evaluated on a large-scale crop disease dataset using classification and natural language generation metrics. Experimental results show high accuracy for both crop and disease identification. The framework also achieves strong performance on BLEU, ROUGE and BERTScore. Our proposed models outperform large-scale vision-language baselines while using significantly fewer parameters. Explainability is assessed using Grad-CAM and token-level attribution. Qualitative results demonstrate robust performance under diverse user-driven queries. These findings highlight the effectiveness of task-specific visual pretraining for crop disease visual question answering.
[31] VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu, Zechun Liu, Chenchen Zhu, Zhipeng Cai, Chong Zhou, Haozhe Liu, Ernie Chang, Saksham Suri, Hongyu Xu, Qi Qian, Wei Wen, Balakrishnan Varadarajan, Zhuang Liu, Hu Xu, Florian Bordes, Raghuraman Krishnamoorthi, Bernard Ghanem, Vikas Chandra, Yunyang Xiong
🧩 TL;DR
本文提出VideoAuto-R1框架,通过"仅在必要时推理"的策略优化视频理解任务,在保持最先进准确率的同时显著提升效率,将平均响应长度减少约3.3倍。
📘 Detailed Summary
Motivation: 尽管思维链推理在多模态大语言模型的视频理解任务中显示出强大能力,但其相对于直接回答的必要性和优势尚未得到充分探索。研究发现,对于强化学习训练的视频模型,直接回答的性能往往匹配甚至超越思维链推理,而后者需要更高的计算成本。因此需要探索更高效的推理策略。
Method: 提出VideoAuto-R1视频理解框架,采用"仅在必要时推理"的策略。训练阶段遵循"思考一次,回答两次"范式:模型首先生成初始答案,然后进行推理,最后输出经过审查的答案,两个答案都通过可验证的奖励进行监督。推理阶段使用初始答案的置信度分数来决定是否进行推理。
Result: 在视频问答和定位基准测试中,VideoAuto-R1实现了最先进的准确率,同时显著提升效率,将平均响应长度减少约3.3倍(例如从149个标记减少到44个标记)。研究还观察到,在感知导向任务中思维模式激活率较低,而在推理密集型任务中激活率较高。
Conclusion: 研究表明,显式的基于语言的推理通常有益但并非总是必要。VideoAuto-R1框架通过自适应推理策略在准确性和效率之间取得了良好平衡,为多模态视频理解任务提供了更实用的解决方案。该研究揭示了不同任务类型对推理需求的差异性,为未来高效多模态模型设计提供了重要见解。
📄 Abstract
Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.
[32] ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos
Rustin Soraki, Homanga Bharadhwaj, Ali Farhadi, Roozbeh Mottaghi
🧩 TL;DR
本文提出了ObjectForesight,一种从自我中心视频序列预测刚体物体未来6自由度位姿和轨迹的3D物体中心动力学模型,通过显式3D物体级表示实现几何基础和时序一致的预测。
📘 Detailed Summary
Motivation: 人类能够轻松预测物体通过交互可能发生的运动或变化,但现有计算系统缺乏直接从被动视觉观察中预测合理未来物体运动的能力。传统世界模型或动力学模型在像素或潜在空间中操作,无法提供几何基础和时序一致的物体级预测,这限制了系统对物体功能性和轨迹的理解能力。
Method: ObjectForesight采用3D物体中心动力学模型架构,从短自我中心视频序列预测刚体物体的未来6自由度位姿和轨迹。该方法在物体级别显式表示3D世界,而非传统的像素或潜在空间表示。为大规模训练该模型,研究利用分割、网格重建和3D位姿估计的最新进展,构建了包含200多万个短片段和伪地面真值3D物体轨迹的数据集。
Result: 实验结果表明,ObjectForesight在准确性、几何一致性和泛化能力方面取得显著提升,能够有效泛化到未见过的物体和场景。该模型能够捕捉物体功能性和轨迹,在预测未来物体运动方面表现出优越性能,为学习物理基础的物体中心动力学模型建立了可扩展框架。
Conclusion: ObjectForesight为直接从观察中学习物理基础的物体中心动力学模型提供了可扩展框架,其显式3D物体级表示实现了几何基础和时序一致的预测。该研究展示了大规模伪地面真值数据在训练复杂动力学模型中的有效性,为未来在机器人、增强现实和交互系统中的应用奠定了基础。
📄 Abstract
Humans can effortlessly anticipate how objects might move or change through interaction--imagining a cup being lifted, a knife slicing, or a lid being closed. We aim to endow computational systems with a similar ability to predict plausible future object motions directly from passive visual observation. We introduce ObjectForesight, a 3D object-centric dynamics model that predicts future 6-DoF poses and trajectories of rigid objects from short egocentric video sequences. Unlike conventional world or dynamics models that operate in pixel or latent space, ObjectForesight represents the world explicitly in 3D at the object level, enabling geometrically grounded and temporally coherent predictions that capture object affordances and trajectories. To train such a model at scale, we leverage recent advances in segmentation, mesh reconstruction, and 3D pose estimation to curate a dataset of 2 million plus short clips with pseudo-ground-truth 3D object trajectories. Through extensive experiments, we show that ObjectForesight achieves significant gains in accuracy, geometric consistency, and generalization to unseen objects and scenes, establishing a scalable framework for learning physically grounded, object-centric dynamics models directly from observation. objectforesight.github.io
[33] Plenoptic Video Generation
Xiao Fu, Shitao Tang, Min Shi, Xian Liu, Jinwei Gu, Ming-Yu Liu, Dahua Lin, Chen-Hsuan Lin
🧩 TL;DR
本文提出了PlenopticDreamer框架,通过同步生成幻觉来保持时空记忆,解决了多视角视频重渲染中的一致性问题,在多个基准测试中实现了最先进的性能。
📘 Detailed Summary
Motivation: 现有相机控制的生成视频重渲染方法(如ReCamMaster)在单视角设置中表现出色,但在多视角场景中难以保持一致性,生成模型的随机性使得幻觉区域的时空连贯性难以保证,这构成了当前研究的主要局限性。
Method: PlenopticDreamer框架的核心是训练一个多输入单输出的视频条件模型,采用自回归方式进行训练,并辅以相机引导的视频检索策略自适应选择先前生成的显著视频作为条件输入。训练过程还包含渐进式上下文缩放以改善收敛性,自条件机制以增强对误差累积导致的长程视觉退化的鲁棒性,以及长视频条件机制以支持扩展视频生成。
Result: 在Basic和Agibot基准测试上的广泛实验表明,PlenopticDreamer实现了最先进的视频重渲染性能,提供了卓越的视角同步、高保真视觉效果、精确的相机控制以及多样化的视角转换(如第三人称到第三人称,以及机器人操作中的头部视角到夹爪视角)。
Conclusion: 该研究通过同步生成幻觉来保持时空记忆的方法有效解决了多视角视频重渲染中的一致性问题,提出的框架在保持视觉质量和相机控制精度的同时实现了更好的视角同步,为生成视频重渲染领域提供了新的技术方向,特别是在需要多视角一致性的应用场景中具有重要价值。
📄 Abstract
Camera-controlled generative video re-rendering methods, such as ReCamMaster, have achieved remarkable progress. However, despite their success in single-view setting, these works often struggle to maintain consistency across multi-view scenarios. Ensuring spatio-temporal coherence in hallucinated regions remains challenging due to the inherent stochasticity of generative models. To address it, we introduce PlenopticDreamer, a framework that synchronizes generative hallucinations to maintain spatio-temporal memory. The core idea is to train a multi-in-single-out video-conditioned model in an autoregressive manner, aided by a camera-guided video retrieval strategy that adaptively selects salient videos from previous generations as conditional inputs. In addition, Our training incorporates progressive context-scaling to improve convergence, self-conditioning to enhance robustness against long-range visual degradation caused by error accumulation, and a long-video conditioning mechanism to support extended video generation. Extensive experiments on the Basic and Agibot benchmarks demonstrate that PlenopticDreamer achieves state-of-the-art video re-rendering, delivering superior view synchronization, high-fidelity visuals, accurate camera control, and diverse view transformations (e.g., third-person to third-person, and head-view to gripper-view in robotic manipulation). Project page: https://research.nvidia.com/labs/dir/plenopticdreamer/
cs.CL [Back]
[34] Attribute-Aware Controlled Product Generation with LLMs for E-commerce
Virginia Negri, Víctor Martínez Gómez, Sergio A. Balanya, Subburam Rajaram
🧩 TL;DR
本文提出了一种利用大语言模型生成合成电商产品数据的系统框架,通过受控修改策略生成高质量训练数据,在MAVE数据集上实现了与真实数据相当的性能表现,为低资源场景下的产品信息提取提供了实用解决方案。
📘 Detailed Summary
Motivation: 电商产品信息提取对电商服务至关重要,但获取高质量标注数据集仍然具有挑战性,特别是在低资源场景下,需要一种能够生成高质量合成数据的方法来增强训练数据集。
Method: 提出了一种系统化的合成电商产品数据生成方法,采用基于大语言模型的受控修改框架,包含三种策略:属性保留修改、受控负例生成和系统属性移除,使用最先进的大语言模型配合属性感知提示,在保持产品连贯性的同时强制执行店铺约束。
Result: 人工评估2000个合成产品显示高度有效性,99.6%被评为自然,96.5%包含有效属性值,超过90%显示一致的属性使用;在公开MAVE数据集上,合成数据达到60.5%准确率,与真实训练数据(60.8%)表现相当,显著优于13.4%的零样本基线;结合合成和真实数据的混合配置进一步将性能提升至68.8%准确率。
Conclusion: 该框架为增强电商数据集提供了实用解决方案,特别适用于低资源场景,证明了合成数据在信息提取任务中能够达到与真实数据相当的性能,混合配置的进一步改进表明合成数据与真实数据的结合具有协同效应。
📄 Abstract
Product information extraction is crucial for e-commerce services, but obtaining high-quality labeled datasets remains challenging. We present a systematic approach for generating synthetic e-commerce product data using Large Language Models (LLMs), introducing a controlled modification framework with three strategies: attribute-preserving modification, controlled negative example generation, and systematic attribute removal. Using a state-of-the-art LLM with attribute-aware prompts, we enforce store constraints while maintaining product coherence. Human evaluation of 2000 synthetic products demonstrates high effectiveness, with 99.6% rated as natural, 96.5% containing valid attribute values, and over 90% showing consistent attribute usage. On the public MAVE dataset, our synthetic data achieves 60.5% accuracy, performing on par with real training data (60.8%) and significantly improving upon the 13.4% zero-shot baseline. Hybrid configurations combining synthetic and real data further improve performance, reaching 68.8% accuracy. Our framework provides a practical solution for augmenting e-commerce datasets, particularly valuable for low-resource scenarios.
[35] TeleTables: A Benchmark for Large Language Models in Telecom Table Interpretation
Anas Ezzakri, Nicola Piovesan, Mohamed Sana, Antonio De Domenico, Fadhel Ayed, Haozhe Zhang
🧩 TL;DR
本文提出了TeleTables基准,用于评估大语言模型在电信技术规范中对表格的隐式知识和显式解释能力,揭示了当前模型在处理3GPP标准中密集表格信息时的局限性,并强调了领域专业化微调的必要性。
📘 Detailed Summary
Motivation: 当前大语言模型在电信行业应用中表现不佳,特别是在处理3GPP技术规范时存在显著缺陷。研究发现这些标准中密集包含大量表格呈现关键信息,但模型对这些表格的知识储备和解释能力尚未得到系统评估,这一研究空白阻碍了LLM在电信工程任务中的可靠应用。
Method: 研究团队开发了TeleTables基准,通过创新的多阶段数据生成流程构建评估数据集。该方法从3GPP标准中提取表格,利用多模态和推理导向的大语言模型生成并验证问题,最终创建了包含500个人工验证的问答对数据集,每个问题都关联多种格式的对应表格。
Result: 评估结果显示,较小模型(参数少于100亿)在3GPP知识回忆和表格解释方面均表现不佳,表明其预训练数据中电信标准暴露不足且缺乏处理复杂技术材料的归纳偏置。较大模型在表格解释方面展现出更强的推理能力,但整体表现仍显示领域专业化不足。
Conclusion: TeleTables基准揭示了当前大语言模型在电信领域技术规范处理中的局限性,特别是对表格信息的理解和推理能力不足。研究强调了领域专业化微调对于可靠解释和推理电信标准的重要性,为未来开发电信专用模型提供了评估框架和方向指导。
📄 Abstract
Language Models (LLMs) are increasingly explored in the telecom industry to support engineering tasks, accelerate troubleshooting, and assist in interpreting complex technical documents. However, recent studies show that LLMs perform poorly on telecom standards, particularly 3GPP specifications. We argue that a key reason is that these standards densely include tables to present essential information, yet the LLM knowledge and interpretation ability of such tables remains largely unexamined. To address this gap, we introduce TeleTables, a benchmark designed to evaluate both the implicit knowledge LLMs have about tables in technical specifications and their explicit ability to interpret them. TeleTables is built through a novel multi-stage data generation pipeline that extracts tables from 3GPP standards and uses multimodal and reasoning-oriented LLMs to generate and validate questions. The resulting dataset, which is publicly available, comprises 500 human-verified question-answer pairs, each associated with the corresponding table in multiple formats. Our evaluation shows that, smaller models (under 10B parameters) struggle both to recall 3GPP knowledge and to interpret tables, indicating the limited exposure to telecom standards in their pretraining and the insufficient inductive biases for navigating complex technical material. Larger models, on the other hand, show stronger reasoning on table interpretation. Overall, TeleTables highlights the need for domain-specialized fine-tuning to reliably interpret and reason over telecom standards.
[36] FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback
Xueqing Wu, Zihan Xue, Da Yin, Shuyan Zhou, Kai-Wei Chang, Nanyun Peng, Yeming Wen
🧩 TL;DR
本文提出了FronTalk基准测试,用于前端代码生成研究,重点关注多模态反馈的对话式代码生成这一独特交互动态,并揭示了现有模型在特征遗忘和视觉反馈理解方面的系统性挑战。
📘 Detailed Summary
Motivation: 前端开发中,草图、线框图和标注截图等视觉工件对于传达设计意图至关重要,但它们在多轮代码生成中的作用尚未得到充分探索。本研究旨在填补这一研究空白,特别关注前端开发任务中多模态反馈的对话式代码生成这一独特交互动态。
Method: 研究提出了FronTalk基准测试,包含100个从新闻、金融和艺术等不同领域真实网站提取的多轮对话。每个对话轮次同时包含文本指令和等效的视觉指令,代表相同的用户意图。此外,研究还提出了基于代理的评估框架,利用Web代理模拟用户探索网站,从而同时衡量功能正确性和用户体验。针对特征遗忘问题,提出了AceCoder方法,通过自主Web代理对每个过去指令的实现进行批判性分析。
Result: 对20个模型的评估揭示了两个关键挑战:模型会覆盖先前实现的功能导致任务失败的特征遗忘问题,以及开源视觉语言模型在解释视觉反馈方面的持续困难。提出的AceCoder基线方法将特征遗忘率显著降低至接近零,并将性能提升高达9.3%(从56.0%提升至65.3%)。
Conclusion: 本研究为前端开发和多轮多模态代码生成的一般交互动态研究提供了坚实基础。研究结果表明,特征遗忘和视觉反馈理解是当前模型面临的重要挑战,需要系统性的解决方案。提出的基准测试和评估框架为未来研究提供了有价值的工具和方向。
📄 Abstract
We present FronTalk, a benchmark for front-end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. In front-end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet their role in multi-turn code generation remains largely unexplored. To address this gap, we focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues derived from real-world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel agent-based evaluation framework leveraging a web agent to simulate users and explore the website, and thus measuring both functional correctness and user experience. Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature: (1) a significant forgetting issue where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in interpreting visual feedback, especially for open-source vision-language models (VLMs). We propose a strong baseline to tackle the forgetting issue with AceCoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to nearly zero and improves the performance by up to 9.3% (56.0% to 65.3%). Overall, we aim to provide a solid foundation for future research in front-end development and the general interaction dynamics of multi-turn, multi-modal code generation. Code and data are released at https://github.com/shirley-wu/frontalk
[37] RIGOURATE: Quantifying Scientific Exaggeration with Evidence-Aligned Claim Evaluation
Joseph James, Chenghao Xiao, Yucheng Li, Nafise Sadat Moosavi, Chenghua Lin
🧩 TL;DR
本文提出了RIGOURATE,一个两阶段多模态框架,用于从论文正文中检索支持证据并为每个主张分配夸大陈述分数,旨在操作化证据比例性并支持更清晰、透明的科学交流。
📘 Detailed Summary
Motivation: 科学严谨性往往被边缘化,作者倾向于做出超出其研究结果支持的夸大陈述,这导致了科学交流中缺乏透明度和证据比例性的问题。
Method: 该研究提出了一个两阶段多模态框架,包括从ICLR和NeurIPS论文中构建的超过10K个主张-证据数据集,使用八个大型语言模型进行标注,并通过同行评审评论校准夸大陈述分数。框架采用微调的重新排序器进行证据检索,以及微调模型来预测带有理由的夸大陈述分数。
Result: 与强基线相比,RIGOURATE在证据检索和夸大陈述检测方面实现了改进性能。该框架通过人类评估验证了其有效性,并展示了在科学论文中识别夸大主张的实际应用价值。
Conclusion: 这项工作操作化了证据比例性的概念,为更清晰、透明的科学交流提供了支持工具。该框架有助于减少科学论文中的夸大陈述,促进基于证据的科学论证和更严谨的学术交流实践。
📄 Abstract
Scientific rigour tends to be sidelined in favour of bold statements, leading authors to overstate claims beyond what their results support. We present RIGOURATE, a two-stage multimodal framework that retrieves supporting evidence from a paper's body and assigns each claim an overstatement score. The framework consists of a dataset of over 10K claim-evidence sets from ICLR and NeurIPS papers, annotated using eight LLMs, with overstatement scores calibrated using peer-review comments and validated through human evaluation. It employes a fine-tuned reranker for evidence retrieval and a fine-tuned model to predict overstatement scores with justification. Compared to strong baselines, RIGOURATE enables improved evidence retrieval and overstatement detection. Overall, our work operationalises evidential proportionality and supports clearer, more transparent scientific communication.
[38] Identifying Good and Bad Neurons for Task-Level Controllable LLMs
Wenjie Li, Guansong Pang, Hezhe Qiao, Debin Gao, David Lo
🧩 TL;DR
本文提出NeuronLLM,一种基于功能拮抗原理的任务级大语言模型理解框架,通过对比学习促进任务完成的"好神经元"和抑制任务的"坏神经元",实现对LLM神经元功能的全面建模,并在多个NLP任务上超越现有方法。
📘 Detailed Summary
Motivation: 现有针对大语言模型神经元可解释性的研究存在三个主要局限:能力特定方法难以适应需要多种能力协调的任务场景;仅关注与任务正相关的支持性神经元,忽略抑制性神经元等其他角色;由于LLM的偶然正确行为导致神经元归因错误,这些问题阻碍了对LLM功能机制的深入理解。
Method: NeuronLLM采用生物学功能拮抗原理,将任务性能建模为由促进任务完成的"好神经元"和抑制任务的"坏神经元"共同决定,通过对比学习同时建模这两种对立角色,并利用增强问题集来减轻LLM中的偶然正确行为,实现对神经元功能的全面识别和分析。
Result: 在不同规模和家族的大语言模型上进行综合实验表明,NeuronLLM在四个NLP任务上均优于现有方法,验证了该框架在神经元识别方面的优越性,为理解LLM的功能组织提供了新的实证支持。
Conclusion: 该研究揭示了任务性能由对立神经元角色共同决定的重要机制,提出的功能拮抗框架为LLM理解提供了新视角,有助于更准确地识别和调控模型内部功能单元,对模型可解释性和可控性研究具有重要启示。
📄 Abstract
Large Language Models have demonstrated remarkable capabilities on multiple-choice question answering benchmarks, but the complex mechanisms underlying their large-scale neurons remain opaque, posing significant challenges for understanding and steering LLMs. While recent studies made progress on identifying responsible neurons for certain abilities, these ability-specific methods are infeasible for task-focused scenarios requiring coordinated use of multiple abilities. Moreover, these approaches focus only on supportive neurons that correlate positively with task completion, while neglecting neurons with other roles-such as inhibitive roles-and misled neuron attribution due to fortuitous behaviors in LLMs (i.e., correctly answer the questions by chance rather than genuine understanding). To address these challenges, we propose NeuronLLM, a novel task-level LLM understanding framework that adopts the biological principle of functional antagonism for LLM neuron identification. The key insight is that task performance is jointly determined by neurons with two opposing roles: good neurons that facilitate task completion and bad neurons that inhibit it. NeuronLLM achieves a holistic modeling of neurons via contrastive learning of good and bad neurons, while leveraging augmented question sets to mitigate the fortuitous behaviors in LLMs. Comprehensive experiments on LLMs of different sizes and families show the superiority of NeuronLLM over existing methods in four NLP tasks, providing new insights into LLM functional organization.
[39] Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization
Mizanur Rahman, Mohammed Saidul Islam, Md Tahmid Rahman Laskar, Shafiq Joty, Enamul Hoque
🧩 TL;DR
本文提出RL-Text2Vis,首个基于强化学习的文本到可视化生成框架,通过多目标奖励函数联合优化文本准确性、代码有效性和可视化质量,显著提升了图表生成质量与代码执行成功率。
📘 Detailed Summary
Motivation: 当前文本到可视化系统中,闭源大语言模型生成的图表常缺乏语义对齐和清晰度,而开源模型则频繁产生不可执行或视觉质量差的输出。尽管监督微调能改善代码可执行性,但无法提升整体可视化质量,因为传统SFT损失无法捕捉执行后反馈。
Method: 本文提出RL-Text2Vis强化学习框架,基于组相对策略优化(GRPO),采用新颖的多目标奖励函数,联合优化文本准确性、代码有效性和可视化质量,利用执行后反馈进行训练。该方法在Qwen2.5模型(7B和14B)上进行训练,专门针对文本到可视化任务进行优化。
Result: RL-Text2Vis在Text2Vis基准测试中,相比GPT-4o实现了22%的图表质量相对提升,代码执行成功率从零样本基线的78%提升至97%。模型显著优于强零样本和监督基线,并在VIS-Eval和NVBench等域外数据集上展现出强大的泛化能力。
Conclusion: 该研究确立了GRPO作为可视化生成中结构化多模态推理的有效策略,证明了强化学习框架在提升文本到可视化系统质量方面的优越性,为复杂结构化输出任务提供了新的训练范式。
📄 Abstract
Text-to-Visualization (Text2Vis) systems translate natural language queries over tabular data into concise answers and executable visualizations. While closed-source LLMs generate functional code, the resulting charts often lack semantic alignment and clarity, qualities that can only be assessed post-execution. Open-source models struggle even more, frequently producing non-executable or visually poor outputs. Although supervised fine-tuning can improve code executability, it fails to enhance overall visualization quality, as traditional SFT loss cannot capture post-execution feedback. To address this gap, we propose RL-Text2Vis, the first reinforcement learning framework for Text2Vis generation. Built on Group Relative Policy Optimization (GRPO), our method uses a novel multi-objective reward that jointly optimizes textual accuracy, code validity, and visualization quality using post-execution feedback. By training Qwen2.5 models (7B and 14B), RL-Text2Vis achieves a 22% relative improvement in chart quality over GPT-4o on the Text2Vis benchmark and boosts code execution success from 78% to 97% relative to its zero-shot baseline. Our models significantly outperform strong zero-shot and supervised baselines and also demonstrate robust generalization to out-of-domain datasets like VIS-Eval and NVBench. These results establish GRPO as an effective strategy for structured, multimodal reasoning in visualization generation. We release our code at https://github.com/vis-nlp/RL-Text2Vis.
[40] When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation
Rhea Kapur, Robert Hawkins, Elisa Kreiss
🧩 TL;DR
该研究指出当前视觉语言模型中描述特异性常与长度混淆的问题,提出特异性应相对于对比集来定义,并通过构建控制长度的数据集验证了人们更偏好特异性描述而非冗长描述。
📘 Detailed Summary
Motivation: 当前视觉语言模型系统中,描述的特异性常被错误地与描述长度混为一谈,导致描述可能冗长但信息空洞,或简洁但信息密集,该研究旨在将这两个概念解耦并明确定义描述特异性。
Method: 研究将描述特异性定义为相对于对比集的概念,即描述能更好地区分目标图像与其他可能图像的程度,并构建了一个控制描述长度同时变化信息内容的数据集,通过人类偏好实验验证特异性评估方法。
Result: 实验结果表明人们确实更偏好特异性描述而非冗长描述,且仅控制长度不足以解释特异性差异,长度预算的分配方式对描述质量有显著影响,支持直接优先考虑特异性而非冗长的评估方法。
Conclusion: 该研究强调了在视觉语言模型评估中区分描述特异性与长度的重要性,提出了基于对比集的特异性定义框架,为开发更精确的图像描述评估指标提供了理论基础,并建议未来工作应直接优化描述的信息密度而非单纯控制长度。
📄 Abstract
Vision-language models (VLMs) are increasingly used to make visual content accessible via text-based descriptions. In current systems, however, description specificity is often conflated with their length. We argue that these two concepts must be disentangled: descriptions can be concise yet dense with information, or lengthy yet vacuous. We define specificity relative to a contrast set, where a description is more specific to the extent that it picks out the target image better than other possible images. We construct a dataset that controls for length while varying information content, and validate that people reliably prefer more specific descriptions regardless of length. We find that controlling for length alone cannot account for differences in specificity: how the length budget is allocated makes a difference. These results support evaluation approaches that directly prioritize specificity over verbosity.
[41] Safe in the Future, Dangerous in the Past: Dissecting Temporal and Linguistic Vulnerabilities in LLMs
Muhammad Abdullahi Said, Muhammad Sammani Sani
🧩 TL;DR
本研究通过系统审计发现,大型语言模型在多语言安全对齐中存在复杂的干扰机制而非简单的性能退化,揭示了安全性能受语言和时态框架交互影响的动态特性,并提出了不变对齐的新范式。
📘 Detailed Summary
Motivation: 当前研究存在一个危险盲区,即假设安全对齐能够从英语零样本迁移到其他语言,而大型语言模型正被集成到关键全球基础设施中,这种假设可能导致全球南方用户面临本地化危害风险。
Method: 研究采用系统审计方法,使用基于西非威胁场景构建的新型对抗数据集HausaSafety,对GPT-5.1、Gemini 3 Pro和Claude 4.5 Opus三个最先进模型进行1,440次评估,采用2×4因子设计检验语言与时态框架的非线性交互作用。
Result: 研究结果挑战了多语言安全差距的简单叙事,发现复杂干扰机制决定安全性能,Claude 4.5 Opus在豪萨语中安全性显著高于英语,同时模型在时态推理方面存在灾难性失败,过去时框架绕过防御而未来时场景触发过度保守拒绝,最安全与最脆弱配置间存在9.2倍差异。
Conclusion: 研究表明当前模型依赖表面启发式而非稳健语义理解,形成安全盲点使全球南方用户面临本地化危害,需要向不变对齐范式转变以确保跨语言和时态变化的安全稳定性,安全不是固定属性而是上下文依赖状态。
📄 Abstract
As Large Language Models (LLMs) integrate into critical global infrastructure, the assumption that safety alignment transfers zero-shot from English to other languages remains a dangerous blind spot. This study presents a systematic audit of three state of the art models (GPT-5.1, Gemini 3 Pro, and Claude 4.5 Opus) using HausaSafety, a novel adversarial dataset grounded in West African threat scenarios (e.g., Yahoo-Yahoo fraud, Dane gun manufacturing). Employing a 2 x 4 factorial design across 1,440 evaluations, we tested the non-linear interaction between language (English vs. Hausa) and temporal framing. Our results challenge the narrative of the multilingual safety gap. Instead of a simple degradation in low-resource settings, we identified a complex interference mechanism in which safety is determined by the intersection of variables. Although the models exhibited a reverse linguistic vulnerability with Claude 4.5 Opus proving significantly safer in Hausa (45.0%) than in English (36.7%) due to uncertainty-driven refusal, they suffered catastrophic failures in temporal reasoning. We report a profound Temporal Asymmetry, where past-tense framing bypassed defenses (15.6% safe) while future-tense scenarios triggered hyper-conservative refusals (57.2% safe). The magnitude of this volatility is illustrated by a 9.2x disparity between the safest and most vulnerable configurations, proving that safety is not a fixed property but a context-dependent state. We conclude that current models rely on superficial heuristics rather than robust semantic understanding, creating Safety Pockets that leave Global South users exposed to localized harms. We propose Invariant Alignment as a necessary paradigm shift to ensure safety stability across linguistic and temporal shifts.
[42] See, Explain, and Intervene: A Few-Shot Multimodal Agent Framework for Hateful Meme Moderation
Naquee Rizwan, Subhankar Swain, Paramananda Bhaskar, Gagan Aryan, Shehryaar Shah Khan, Animesh Mukherjee
🧩 TL;DR
本文提出了一种基于生成式AI的通用仇恨表情包检测、解释与干预框架,首次在有限数据条件下实现了三任务的统一处理,并利用任务特定生成式多模态代理和大规模多模态模型的少样本适应能力来应对不同类型表情包。
📘 Detailed Summary
Motivation: 当前仇恨表情包研究存在三个关键问题:检测、解释和干预通常被分开研究,不符合现实场景需求;构建大规模标注数据集成本过高;缺乏在有限数据条件下通用的仇恨表情包治理方案。本文旨在填补这些研究空白,实现三任务的统一处理并在数据稀缺条件下保持有效性。
Method: 本文提出了一个新颖的框架,利用任务特定的生成式多模态代理和大规模多模态模型的少样本适应能力来处理不同类型的表情包。该框架在生成式AI模型基础上构建多种策略,能够同时处理检测、解释和干预三个互补任务,特别针对数据有限的实际应用场景进行优化。
Result: 该研究首次实现了在有限数据条件下对仇恨表情包的通用治理,框架展示了在实际生产场景中部署的强大潜力。通过整合检测、解释和干预功能,系统能够更全面地应对仇恨表情包问题,相比传统分离处理方法更符合现实需求。
Conclusion: 本研究为仇恨表情包治理提供了首个在有限数据条件下通用的解决方案,将检测、解释和干预三个任务统一处理具有重要实践意义。该框架展示了生成式AI在多模态内容治理中的应用潜力,为实际生产环境中的内容审核系统提供了可行的技术路径。
📄 Abstract
In this work, we examine hateful memes from three complementary angles - how to detect them, how to explain their content and how to intervene them prior to being posted - by applying a range of strategies built on top of generative AI models. To the best of our knowledge, explanation and intervention have typically been studied separately from detection, which does not reflect real-world conditions. Further, since curating large annotated datasets for meme moderation is prohibitively expensive, we propose a novel framework that leverages task-specific generative multimodal agents and the few-shot adaptability of large multimodal models to cater to different types of memes. We believe this is the first work focused on generalizable hateful meme moderation under limited data conditions, and has strong potential for deployment in real-world production scenarios. Warning: Contains potentially toxic contents.
[43] Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, Junyang Lin
🧩 TL;DR
本文介绍了Qwen3-VL-Embedding和Qwen3-VL-Reranker模型系列,它们基于Qwen3-VL基础模型构建,通过多阶段训练范式实现了跨文本、图像、文档图像和视频的统一表示学习,在多模态嵌入基准测试中取得了最先进的性能。
📘 Detailed Summary
Motivation: 该研究旨在解决多模态搜索中不同模态数据(文本、图像、文档图像、视频)的统一表示问题,构建端到端的高精度多模态搜索管道,以克服现有方法在跨模态语义对齐和细粒度相关性评估方面的局限性。
Method: Qwen3-VL-Embedding采用多阶段训练范式,包括大规模对比预训练和重排序模型蒸馏,支持Matryoshka表示学习以实现灵活的嵌入维度,处理长达32k令牌的输入;Qwen3-VL-Reranker采用交叉编码器架构,利用交叉注意力机制进行查询-文档对的细粒度相关性估计;两个系列均继承Qwen3-VL的多语言能力,支持30多种语言,并提供2B和8B参数规模以适应不同部署需求。
Result: Qwen3-VL-Embedding系列在多模态嵌入评估基准测试中取得了最先进的性能,其中Qwen3-VL-Embedding-8B在MMEB-V2基准上获得77.8的综合得分,在所有模型中排名第一(截至2025年1月8日);模型在图像-文本检索、视觉问答和视频-文本匹配等多种多模态检索任务中表现出色。
Conclusion: 该研究证明了多阶段训练范式和统一表示空间在多模态嵌入学习中的有效性,为实际多模态搜索应用提供了可扩展的解决方案;Matryoshka表示学习和灵活参数规模设计增强了模型的实用性和部署灵活性,为多模态人工智能系统的发展提供了重要参考。
📄 Abstract
In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in $\textbf{2B}$ and $\textbf{8B}$ parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of $\textbf{77.8}$ on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.
[44] AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs
Han Zhu, Jiale Chen, Chengkun Cai, Shengjie Sun, Haoran Li, Yujin Zhou, Chi-Min Chan, Pengcheng Wen, Lei Li, Sirui Han, Yike Guo
🧩 TL;DR
本文提出了InterSafe-V多模态对话安全数据集和AM³Safety框架,通过冷启动拒绝阶段和基于组相对策略优化的对话级微调,显著提升了多模态大语言模型在多轮对话中的安全性,同时保持了模型的通用能力。
📘 Detailed Summary
Motivation: 多模态大语言模型在交互应用中面临严重的安全漏洞,特别是在多轮多模态场景中,有害意图可能逐步重建而安全协议逐渐失效。现有的基于人类反馈的强化学习方法主要针对单轮视觉问答任务,且需要昂贵的人工偏好标注,限制了其在对话场景中的有效性和可扩展性。
Method: 研究提出了InterSafe-V开源多模态对话数据集,包含11,270个对话和500个专门设计的拒绝视觉问答样本,通过模型间交互构建以更准确反映真实场景。在此基础上提出了AM³Safety框架,结合冷启动拒绝阶段和组相对策略优化,使用基于整个对话的轮次感知双目标奖励进行微调。
Result: 在Qwen2.5-VL-7B-Instruct和LLaVA-NeXT-7B模型上的实验显示,攻击成功率降低超过10%,在多模态多轮安全基准测试中,无害维度提升至少8%,有帮助维度提升超过13%,同时保持了模型的通用能力。
Conclusion: 该研究展示了通过专门构建的多轮对话安全数据集和对话级微调框架,能够有效提升多模态大语言模型在交互场景中的安全性,为多模态对话系统的安全对齐提供了新的解决方案,同时证明了在保持模型通用能力的前提下实现安全性提升的可行性。
📄 Abstract
Multi-modal Large Language Models (MLLMs) are increasingly deployed in interactive applications. However, their safety vulnerabilities become pronounced in multi-turn multi-modal scenarios, where harmful intent can be gradually reconstructed across turns, and security protocols fade into oblivion as the conversation progresses. Existing Reinforcement Learning from Human Feedback (RLHF) alignment methods are largely developed for single-turn visual question-answer (VQA) task and often require costly manual preference annotations, limiting their effectiveness and scalability in dialogues. To address this challenge, we present InterSafe-V, an open-source multi-modal dialogue dataset containing 11,270 dialogues and 500 specially designed refusal VQA samples. This dataset, constructed through interaction between several models, is designed to more accurately reflect real-world scenarios and includes specialized VQA pairs tailored for specific domains. Building on this dataset, we propose AM$^3$Safety, a framework that combines a cold-start refusal phase with Group Relative Policy Optimization (GRPO) fine-tuning using turn-aware dual-objective rewards across entire dialogues. Experiments on Qwen2.5-VL-7B-Instruct and LLaVA-NeXT-7B show more than 10\% decrease in Attack Success Rate (ASR) together with an increment of at least 8\% in harmless dimension and over 13\% in helpful dimension of MLLMs on multi-modal multi-turn safety benchmarks, while preserving their general abilities.
[45] When AI Settles Down: Late-Stage Stability as a Signature of AI-Generated Text Detection
Ke Sun, Guangsheng Bao, Han Cui, Yue Zhang
🧩 TL;DR
该研究揭示了AI生成文本的晚期波动衰减现象,并提出基于晚期统计特征的零样本检测方法,在不依赖扰动采样或额外模型访问的情况下,在多个基准测试中实现了最先进的检测性能。
📘 Detailed Summary
Motivation: 现有的零样本AI文本检测方法通常在整个序列上聚合词元级统计信息,忽略了自回归生成过程中的时间动态特性,这限制了检测性能的提升空间。
Method: 通过分析超过12万个文本样本,发现了晚期波动衰减现象,并基于此提出了两个简单特征:导数离散度和局部波动性,这些特征仅从晚期统计信息计算得出,无需扰动采样或额外模型访问。
Result: 该方法在EvoBench和MAGE基准测试中实现了最先进的性能,AI生成文本在序列后半部分表现出24-32%的更低波动性,并且与现有全局方法展现出强大的互补性。
Conclusion: 该研究揭示了AI生成文本的时间动态特性差异,证明了晚期统计特征的检测价值,为开发更有效的零样本检测方法提供了新方向,同时展示了简单特征与复杂方法的互补潜力。
📄 Abstract
Zero-shot detection methods for AI-generated text typically aggregate token-level statistics across entire sequences, overlooking the temporal dynamics inherent to autoregressive generation. We analyze over 120k text samples and reveal Late-Stage Volatility Decay: AI-generated text exhibits rapidly stabilizing log probability fluctuations as generation progresses, while human writing maintains higher variability throughout. This divergence peaks in the second half of sequences, where AI-generated text shows 24--32\% lower volatility. Based on this finding, we propose two simple features: Derivative Dispersion and Local Volatility, which computed exclusively from late-stage statistics. Without perturbation sampling or additional model access, our method achieves state-of-the-art performance on EvoBench and MAGE benchmarks and demonstrates strong complementarity with existing global methods.
[46] MisSpans: Fine-Grained False Span Identification in Cross-Domain Fake News
Zhiwei Liu, Paul Thompson, Jiaqi Rong, Baojie Qu, Runteng Guo, Min Peng, Qianqian Xie, Sophia Ananiadou
🧩 TL;DR
本文提出了MisSpans,这是首个用于细粒度虚假信息检测与分析的多领域、人工标注的基准数据集,包含三个互补任务:虚假片段识别、虚假类型分类和基于片段的解释生成,旨在解决现有方法在句子级别虚假信息定位和解释方面的局限性。
📘 Detailed Summary
Motivation: 现有虚假信息检测基准和方法通常在整个声明或段落级别使用粗粒度的二元标签进行评估,这掩盖了真实和虚假细节经常在单个句子中共存的现象。这些简化限制了可解释性:全局解释无法识别哪些具体片段具有误导性,也无法区分细节虚假的方式(例如扭曲与捏造)。
Method: 研究引入了MisSpans基准数据集,包含配对的真实和虚假新闻故事,定义了三个互补任务:MisSpansIdentity用于在句子内精确定位虚假片段,MisSpansType用于按虚假信息类型对虚假片段进行分类,MisSpansExplanation用于基于已识别片段提供理性解释。专家标注者遵循标准化指南和一致性检查,实现了较高的标注者间一致性。研究评估了15个代表性大语言模型,包括推理增强和非推理变体,在零样本和单样本设置下的表现。
Result: 实验结果表明,细粒度虚假信息识别和分析具有挑战性,需要深入理解多个交互因素如何影响性能,包括模型大小和推理能力,以及领域特定的文本特征。标注过程实现了较高的标注者间一致性,验证了数据集的可靠性。
Conclusion: MisSpans基准为细粒度虚假信息检测和分析提供了首个系统框架,揭示了现有大语言模型在精确识别和解释虚假信息片段方面的局限性。该研究强调了考虑模型大小、推理能力和领域特征等多重因素的重要性,为开发更精确、可解释的虚假信息检测系统奠定了基础。
📄 Abstract
Online misinformation is increasingly pervasive, yet most existing benchmarks and methods evaluate veracity at the level of whole claims or paragraphs using coarse binary labels, obscuring how true and false details often co-exist within single sentences. These simplifications also limit interpretability: global explanations cannot identify which specific segments are misleading or differentiate how a detail is false (e.g., distorted vs. fabricated). To address these gaps, we introduce MisSpans, the first multi-domain, human-annotated benchmark for span-level misinformation detection and analysis, consisting of paired real and fake news stories. MisSpans defines three complementary tasks: MisSpansIdentity for pinpointing false spans within sentences, MisSpansType for categorising false spans by misinformation type, and MisSpansExplanation for providing rationales grounded in identified spans. Together, these tasks enable fine-grained localisation, nuanced characterisation beyond true/false and actionable explanations. Expert annotators were guided by standardised guidelines and consistency checks, leading to high inter-annotator agreement. We evaluate 15 representative LLMs, including reasoning-enhanced and non-reasoning variants, under zero-shot and one-shot settings. Results reveal the challenging nature of fine-grained misinformation identification and analysis, and highlight the need for a deeper understanding of how performance may be influenced by multiple interacting factors, including model size and reasoning capabilities, along with domain-specific textual features. This project will be available at https://github.com/lzw108/MisSpans.
[47] V-FAT: Benchmarking Visual Fidelity Against Text-bias
Ziteng Wang, Yujie He, Guanliang Li, Siqi Yang, Jiaqi Xiong, Songxiang Liu
🧩 TL;DR
本文提出V-FAT基准测试,用于诊断多模态大语言模型中的文本偏见问题,揭示了现有模型在视觉证据与文本信息冲突时过度依赖语言捷径而非真实视觉基础的现象。
📘 Detailed Summary
Motivation: 多模态大语言模型在标准视觉推理基准上表现出色,但存在过度依赖语言捷径而非真实视觉基础的问题,即文本偏见现象。本研究旨在探究视觉感知与语言先验之间的根本张力,并将偏见来源解耦为内部语料偏见和外部指令偏见两个维度。
Method: 研究引入V-FAT诊断基准,包含4,026个VQA实例,涵盖六个语义领域。采用三级评估框架系统性地增加视觉证据与文本信息之间的冲突:L1级处理非典型图像引发的内部偏见,L2级处理误导性指令引发的外部偏见,L3级处理两者协同的偏见。同时提出视觉鲁棒性评分指标,旨在惩罚"幸运"的语言猜测并奖励真实的视觉保真度。
Result: 对12个前沿MLLM的评估显示,尽管模型在现有基准上表现出色,但在高语言主导性条件下会出现显著的视觉崩溃。V-FAT基准揭示了模型在面对视觉证据与文本信息冲突时的系统性弱点,特别是在L2和L3级评估中表现明显下降。
Conclusion: 研究表明当前MLLM存在严重的文本偏见问题,过度依赖语言先验而非视觉证据。这强调了开发更鲁棒的视觉基础模型的重要性,并提出了评估视觉保真度的新方法论。研究结果为未来多模态模型设计提供了重要启示,需要更好地平衡视觉与语言处理能力。
📄 Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on standard visual reasoning benchmarks. However, there is growing concern that these models rely excessively on linguistic shortcuts rather than genuine visual grounding, a phenomenon we term Text Bias. In this paper, we investigate the fundamental tension between visual perception and linguistic priors. We decouple the sources of this bias into two dimensions: Internal Corpus Bias, stemming from statistical correlations in pretraining, and External Instruction Bias, arising from the alignment-induced tendency toward sycophancy. To quantify this effect, we introduce V-FAT (Visual Fidelity Against Text-bias), a diagnostic benchmark comprising 4,026 VQA instances across six semantic domains. V-FAT employs a Three-Level Evaluation Framework that systematically increases the conflict between visual evidence and textual information: (L1) internal bias from atypical images, (L2) external bias from misleading instructions, and (L3) synergistic bias where both coincide. We introduce the Visual Robustness Score (VRS), a metric designed to penalize "lucky" linguistic guesses and reward true visual fidelity. Our evaluation of 12 frontier MLLMs reveals that while models excel in existing benchmarks, they experience significant visual collapse under high linguistic dominance.
[48] A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction
Qing Wang, Zehan Li, Yaodong Song, Hongjie Chen, Jian Kang, Jie Lian, Jie Li, Yongxiang Li, Xuelong Li
🧩 TL;DR
本文提出了一种统一的情感智能口语语言模型,通过新颖的注入式情感归因思维数据构建策略,将用户情感状态及其成因融入模型内部推理过程,实现了情感感知推理的内化而非显式监督。
📘 Detailed Summary
Motivation: 现有口语对话系统在情感智能方面存在局限,通常将情感感知作为外部监督而非模型内部推理能力,缺乏对用户情感状态及其成因的深度理解与整合,这限制了系统在情感轨迹建模、情感推理和共情回应生成方面的表现。
Method: 提出注入式情感归因思维数据构建策略,将用户情感状态及其成因融入模型内部推理过程;采用两阶段渐进训练策略:第一阶段通过自蒸馏进行语音-文本对齐和情感属性建模,第二阶段进行端到端跨模态联合优化以确保文本与口语情感表达的一致性。
Result: 在Human-like Spoken Dialogue Systems Challenge情感智能基准测试中,该方法在情感轨迹建模、情感推理和共情回应生成方面均取得顶级性能,在基于大语言模型和人类评估中均表现优异。
Conclusion: IEAT策略通过将情感归因思维注入模型内部推理,实现了情感感知能力的内化,为构建更自然、更具情感智能的口语对话系统提供了有效框架,展示了跨模态情感一致性的重要性。
📄 Abstract
This paper presents a unified spoken language model for emotional intelligence, enhanced by a novel data construction strategy termed Injected Emotional-Attribution Thinking (IEAT). IEAT incorporates user emotional states and their underlying causes into the model's internal reasoning process, enabling emotion-aware reasoning to be internalized rather than treated as explicit supervision. The model is trained with a two-stage progressive strategy. The first stage performs speech-text alignment and emotional attribute modeling via self-distillation, while the second stage conducts end-to-end cross-modal joint optimization to ensure consistency between textual and spoken emotional expressions. Experiments on the Human-like Spoken Dialogue Systems Challenge (HumDial) Emotional Intelligence benchmark demonstrate that the proposed approach achieves top-ranked performance across emotional trajectory modeling, emotional reasoning, and empathetic response generation under both LLM-based and human evaluations.
[49] Compositional Steering of Large Language Models with Steering Tokens
Gorjan Radevski, Kiril Gashteovski, Giwon Hong, Carolin Lawrence, Goran Glavaš
🧩 TL;DR
本文提出了一种名为"组合引导令牌"的新方法,用于实现大型语言模型的多行为组合控制。该方法通过自蒸馏将自然语言指令编码为专用令牌,并训练组合令牌来泛化到未见过的行为组合,从而在输入令牌空间中实现有效的零样本组合引导。
📘 Detailed Summary
Motivation: 在现实应用中部署LLM需要满足多个期望的可控输出,现有工作主要关注单一行为的引导,而组合引导——即同时引导LLM朝向多个行为——仍然是一个未被充分探索的问题。当前方法大多在激活空间中操作,缺乏对多行为组合的有效零样本控制能力。
Method: 该方法首先通过自蒸馏将自然语言指令表达的个体行为嵌入到专用令牌中,使行为引导在输入令牌空间中操作而非激活空间。随后训练专用的组合令牌来处理行为对,该令牌能够成功捕捉组合概念,并泛化到未见过的行为组合,包括包含未见行为的组合以及具有未见行为数量的组合。
Result: 实验结果表明,在不同LLM架构上,引导令牌方法相比竞争方法(指令、激活引导和LoRA合并)在多行为控制方面表现更优。此外,引导令牌与自然语言指令具有互补性,二者的组合能够带来进一步的性能提升,组合令牌能够有效泛化到未见过的行为组合。
Conclusion: 该研究证明了在输入令牌空间中实现组合引导的有效性,为LLM的多行为控制提供了新范式。组合令牌的泛化能力表明该方法具有实际应用价值,引导令牌与自然语言指令的互补性为未来混合控制方法的发展提供了方向,解决了现有方法在零样本组合控制方面的局限性。
📄 Abstract
Deploying LLMs in real-world applications requires controllable output that satisfies multiple desiderata at the same time. While existing work extensively addresses LLM steering for a single behavior, \textit{compositional steering} -- i.e., steering LLMs simultaneously towards multiple behaviors -- remains an underexplored problem. In this work, we propose \emph{compositional steering tokens} for multi-behavior steering. We first embed individual behaviors, expressed as natural language instructions, into dedicated tokens via self-distillation. Contrary to most prior work, which operates in the activation space, our behavior steers live in the space of input tokens, enabling more effective zero-shot composition. We then train a dedicated \textit{composition token} on pairs of behaviors and show that it successfully captures the notion of composition: it generalizes well to \textit{unseen} compositions, including those with unseen behaviors as well as those with an unseen \textit{number} of behaviors. Our experiments across different LLM architectures show that steering tokens lead to superior multi-behavior control compared to competing approaches (instructions, activation steering, and LoRA merging). Moreover, we show that steering tokens complement natural language instructions, with their combination resulting in further gains.
[50] SemPA: Improving Sentence Embeddings of Large Language Models through Semantic Preference Alignment
Ziyang Chen, Zhenxuan Huang, Yile Wang, Weiqin Wang, Lu Yin, Hui Huang
🧩 TL;DR
本文提出SemPA方法,通过语义偏好对齐增强大语言模型的句子表示能力,同时保持其生成能力。该方法利用句子级直接偏好优化在释义生成任务上高效优化LLMs,在保持固有生成能力的同时学习区分语义等效句子。
📘 Detailed Summary
Motivation: 传统句子嵌入方法在非生成式预训练模型上使用token级对比学习,而基于生成式大语言模型的嵌入方法存在两个问题:固定提示模板缺乏模型进一步优化导致性能有限,修改模型架构会改变内部计算机制从而损害生成能力。需要一种既能增强句子表示又不牺牲LLMs生成能力的新方法。
Method: 提出SemPA方法,通过语义偏好对齐增强句子表示同时保持LLMs生成能力。采用句子级直接偏好优化在释义生成任务上高效优化大语言模型,使模型学习区分语义等效句子。在理论层面,在Plackett-Luce模型框架下建立了DPO与对比学习之间的形式化联系。
Result: 在语义文本相似性任务和各种大语言模型基准测试上的实验结果表明,SemPA实现了更好的语义表示,同时不牺牲LLMs固有的生成能力。该方法在保持模型生成性能的同时显著提升了句子嵌入的质量和效果。
Conclusion: SemPA提供了一种有效平衡句子表示增强与生成能力保持的框架,通过语义偏好对齐实现了两方面的优化。该方法为在保持大语言模型核心能力的同时改进其嵌入表示提供了新思路,具有重要的理论和实践意义。
📄 Abstract
Traditional sentence embedding methods employ token-level contrastive learning on non-generative pre-trained models. Recently, there have emerged embedding methods based on generative large language models (LLMs). These methods either rely on fixed prompt templates or involve modifications to the model architecture. The former lacks further optimization of the model and results in limited performance, while the latter alters the internal computational mechanisms of the model, thereby compromising its generative capabilities. We propose SemPA, a novel approach that boosts the sentence representations while preserving the generative ability of LLMs via semantic preference alignment. We leverage sentence-level Direct Preference Optimization (DPO) to efficiently optimize LLMs on a paraphrase generation task, where the model learns to discriminate semantically equivalent sentences while preserving inherent generative capacity. Theoretically, we establish a formal connection between DPO and contrastive learning under the Plackett-Luce model framework. Empirically, experimental results on both semantic textual similarity tasks and various benchmarks for LLMs show that SemPA achieves better semantic representations without sacrificing the inherent generation capability of LLMs.
[51] LELA: an LLM-based Entity Linking Approach with Zero-Shot Domain Adaptation
Samy Haffoudhi, Fabian M. Suchanek, Nils Holzenberger
🧩 TL;DR
本文提出了LELA,一种无需微调的模块化由粗到精实体链接方法,利用大语言模型能力,在不同领域、知识库和LLM上均能工作,并在多种实体链接设置中展现出与微调方法相竞争的性能。
📘 Detailed Summary
Motivation: 实体链接是将文本中的模糊提及映射到知识库实体的基础任务,对知识图谱构建、问答和信息提取至关重要。现有方法通常需要针对特定领域或知识库进行微调,缺乏通用性且部署成本高,因此需要一种无需微调、能跨不同领域和知识库工作的灵活实体链接方法。
Method: LELA采用模块化由粗到精的方法,利用大语言模型的能力进行实体链接。该方法不依赖任何微调阶段,能够与不同的目标领域、知识库和大语言模型协同工作,通过分阶段处理策略实现高效准确的实体消歧。
Result: 实验结果表明,LELA在各种实体链接设置中表现出色,与经过微调的方法相比具有高度竞争力,同时显著优于所有未经微调的方法。该方法在不同领域和知识库上均展现出稳定的性能提升,验证了其通用性和有效性。
Conclusion: LELA证明了无需微调的模块化实体链接方法的可行性,为跨领域实体链接提供了灵活高效的解决方案。该方法降低了实体链接系统的部署和维护成本,为大语言模型在实体链接任务中的实际应用开辟了新途径,具有重要的实践意义。
📄 Abstract
Entity linking (mapping ambiguous mentions in text to entities in a knowledge base) is a foundational step in tasks such as knowledge graph construction, question-answering, and information extraction. Our method, LELA, is a modular coarse-to-fine approach that leverages the capabilities of large language models (LLMs), and works with different target domains, knowledge bases and LLMs, without any fine-tuning phase. Our experiments across various entity linking settings show that LELA is highly competitive with fine-tuned approaches, and substantially outperforms the non-fine-tuned ones.
cs.AI [Back]
[52] GUITester: Enabling GUI Agents for Exploratory Defect Discovery
Yifei Gao, Jiang Wu, Xiaoyi Chen, Yifan Yang, Zhe Cui, Tianyi Ma, Jiaming Zhang, Jitao Sang
🧩 TL;DR
本文提出GUITester,一种用于自主探索式GUI测试的多智能体框架,解决了多模态大语言模型智能体在导航任务中无法自主发现缺陷的问题。该框架通过解耦导航与验证,在GUITestBench基准测试中实现了48.90%的F1分数,显著优于现有方法。
📘 Detailed Summary
Motivation: 探索式GUI测试对软件质量至关重要,但面临高昂的人工成本。多模态大语言模型智能体在导航任务中表现出色,但由于两个核心挑战而无法自主发现缺陷:目标导向遮蔽(智能体优先完成任务而非报告异常)和执行偏差归因(系统缺陷被误判为智能体错误)。
Method: 研究首先引入了GUITestBench,这是首个用于该任务的交互式基准测试,包含26种缺陷类型的143个任务。随后提出了GUITester多智能体框架,通过两个模块解耦导航与验证:规划执行模块通过嵌入测试意图主动探测缺陷;分层反思模块通过交互历史分析解决归因模糊性问题。
Result: GUITester在GUITestBench基准测试中实现了48.90%的F1分数(Pass@3),显著优于最先进基线的33.35%。该框架有效解决了目标导向遮蔽和执行偏差归因问题,证明了自主探索式测试的可行性。
Conclusion: 这项工作展示了自主探索式GUI测试的可行性,为未来GUI质量保证提供了坚实基础。通过解耦导航与验证的多智能体方法,解决了MLLM智能体在缺陷发现中的核心挑战,为自动化软件测试开辟了新方向。
📄 Abstract
Exploratory GUI testing is essential for software quality but suffers from high manual costs. While Multi-modal Large Language Model (MLLM) agents excel in navigation, they fail to autonomously discover defects due to two core challenges: \textit{Goal-Oriented Masking}, where agents prioritize task completion over reporting anomalies, and \textit{Execution-Bias Attribution}, where system defects are misidentified as agent errors. To address these, we first introduce \textbf{GUITestBench}, the first interactive benchmark for this task, featuring 143 tasks across 26 defects. We then propose \textbf{GUITester}, a multi-agent framework that decouples navigation from verification via two modules: (i) a \textit{Planning-Execution Module (PEM)} that proactively probes for defects via embedded testing intents, and (ii) a \textit{Hierarchical Reflection Module (HRM)} that resolves attribution ambiguity through interaction history analysis. GUITester achieves an F1-score of 48.90\% (Pass@3) on GUITestBench, outperforming state-of-the-art baselines (33.35\%). Our work demonstrates the feasibility of autonomous exploratory testing and provides a robust foundation for future GUI quality assurance~\footnote{Our code is now available in~\href{https://github.com/ADaM-BJTU/GUITestBench}{https://github.com/ADaM-BJTU/GUITestBench}}.
[53] Specific Emitter Identification via Active Learning
Jingyi Wang, Fanggang Wang
🧩 TL;DR
本文提出了一种基于主动学习的特定辐射源识别方法,采用三阶段半监督训练方案,在有限标注预算下显著提升了识别性能并降低了标注成本。
📘 Detailed Summary
Motivation: 特定辐射源识别在无线通信安全中至关重要,但其模型训练严重依赖大规模标注数据,这些数据获取成本高昂且耗时。现有方法在有限标注条件下性能受限,需要一种能够有效利用未标注数据并降低标注成本的新方法。
Method: 该方法采用三阶段半监督训练方案:第一阶段使用带动态字典更新机制的自监督对比学习从未标注数据中提取鲁棒表示;第二阶段在小规模标注数据集上进行监督训练,联合优化对比损失和交叉熵损失以增强特征可分性和分类边界;第三阶段引入主动学习模块,基于不确定性和代表性准则从未标注数据中选择最有价值的样本进行标注,在有限标注预算下进一步提升泛化能力。
Result: 在ADS-B和WiFi数据集上的实验结果表明,该方法在有限标注条件下显著优于传统的监督和半监督方法,实现了更高的识别准确率和更低的标注成本。具体性能提升体现在不同标注比例下的稳定优势,验证了所提三阶段训练方案和主动学习选择机制的有效性。
Conclusion: 该研究证明了结合自监督对比学习、半监督训练和主动学习的多阶段框架能够有效解决特定辐射源识别中的标注数据稀缺问题。该方法为通信安全领域的实际应用提供了可行的解决方案,展示了在有限标注预算下通过智能样本选择和数据高效学习策略实现高性能识别的潜力。
📄 Abstract
With the rapid growth of wireless communications, specific emitter identification (SEI) is significant for communication security. However, its model training relies heavily on the large-scale labeled data, which are costly and time-consuming to obtain. To address this challenge, we propose an SEI approach enhanced by active learning (AL), which follows a three-stage semi-supervised training scheme. In the first stage, self-supervised contrastive learning is employed with a dynamic dictionary update mechanism to extract robust representations from large amounts of the unlabeled data. In the second stage, supervised training on a small labeled dataset is performed, where the contrastive and cross-entropy losses are jointly optimized to improve the feature separability and strengthen the classification boundaries. In the third stage, an AL module selects the most valuable samples from the unlabeled data for annotation based on the uncertainty and representativeness criteria, further enhancing generalization under limited labeling budgets. Experimental results on the ADS-B and WiFi datasets demonstrate that the proposed SEI approach significantly outperforms the conventional supervised and semi-supervised methods under limited annotation conditions, achieving higher recognition accuracy with lower labeling cost.
[54] Integrating Distribution Matching into Semi-Supervised Contrastive Learning for Labeled and Unlabeled Data
Shogo Nakayama, Masahiro Okuda
🧩 TL;DR
本研究提出了一种结合分布匹配的伪标签半监督对比学习方法,通过对齐标记数据与未标记数据的特征分布来提升图像分类性能,在多个数据集上实现了优于基准方法的准确率。
📘 Detailed Summary
Motivation: 深度学习的监督图像分类需要大量标注数据,但实际标注成本高昂,而完全无标注的场景较少,半监督学习在少量标注数据与大量未标注数据共存的情况下更具实际意义。现有基于伪标签的半监督对比学习方法仍有改进空间,本研究旨在通过分布匹配技术提升伪标签质量,从而改善图像分类性能。
Method: 该方法在伪标签半监督学习框架中引入了分布匹配机制,通过对比学习对齐标记数据与未标记数据的特征嵌入分布。具体而言,该方法在特征空间中强制标记样本与未标记样本的分布一致性,利用分布匹配损失函数优化特征表示,从而生成更可靠的伪标签用于模型训练。
Result: 实验在多个图像分类数据集上验证了所提方法的有效性,结果表明该方法在分类准确率方面显著优于传统伪标签方法和基准半监督对比学习方法。通过分布匹配机制,该方法能够更有效地利用未标记数据,提升模型在有限标注数据下的泛化能力。
Conclusion: 该研究表明,在半监督对比学习中引入分布匹配机制能够有效提升伪标签质量,从而改善图像分类性能。这一方法为半监督学习提供了新的技术路径,特别是在标注数据稀缺的实际应用中具有重要价值,未来可进一步探索更复杂的分布对齐策略。
📄 Abstract
The advancement of deep learning has greatly improved supervised image classification. However, labeling data is costly, prompting research into unsupervised learning methods such as contrastive learning. In real-world scenarios, fully unlabeled datasets are rare, making semi-supervised learning (SSL) highly relevant in scenarios where a small amount of labeled data coexists with a large volume of unlabeled data. A well-known semi-supervised contrastive learning approach involves assigning pseudo-labels to unlabeled data. This study aims to enhance pseudo-label-based SSL by incorporating distribution matching between labeled and unlabeled feature embeddings to improve image classification accuracy across multiple datasets.
[55] BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents
Yunhao Feng, Yige Li, Yutao Wu, Yingshui Tan, Yanming Guo, Yifan Ding, Kun Zhai, Xingjun Ma, Yugang Jiang
🧩 TL;DR
本文提出了BackdoorAgent框架,为LLM智能体中的后门威胁提供了统一的、面向智能体的视角,通过模块化、阶段感知的设计系统分析了后门触发器在智能体工作流不同阶段间的激活与传播机制。
📘 Detailed Summary
Motivation: 现有研究对LLM智能体中的后门威胁分析较为零散,通常孤立地分析单个攻击向量,缺乏从智能体角度理解后门触发器在不同阶段间的交互与传播机制,这限制了我们对智能体工作流安全漏洞的系统性认识。
Method: 提出了BackdoorAgent框架,将攻击面结构化为智能体工作流的三个功能阶段:规划攻击、记忆攻击和工具使用攻击,并通过检测智能体执行过程来系统分析触发器在不同阶段的激活与传播,构建了涵盖Agent QA、Agent Code、Agent Web和Agent Drive四个代表性应用的标准化基准。
Result: 实验分析表明,植入单个阶段的触发器可以跨多个步骤持续存在并通过中间状态传播,在使用GPT基座模型时,规划攻击的触发器持续率为43.58%,记忆攻击为77.97%,工具阶段攻击为60.28%,凸显了智能体工作流本身对后门威胁的脆弱性。
Conclusion: 该研究揭示了LLM智能体工作流中后门威胁的系统性风险,强调了跨阶段触发器传播的重要性,提出的框架和基准为未来智能体安全研究提供了可复现的基础,并指出了需要开发更鲁棒的防御机制来保护多阶段智能体系统。
📄 Abstract
Large language model (LLM) agents execute tasks through multi-step workflows that combine planning, memory, and tool use. While this design enables autonomy, it also expands the attack surface for backdoor threats. Backdoor triggers injected into specific stages of an agent workflow can persist through multiple intermediate states and adversely influence downstream outputs. However, existing studies remain fragmented and typically analyze individual attack vectors in isolation, leaving the cross-stage interaction and propagation of backdoor triggers poorly understood from an agent-centric perspective. To fill this gap, we propose \textbf{BackdoorAgent}, a modular and stage-aware framework that provides a unified, agent-centric view of backdoor threats in LLM agents. BackdoorAgent structures the attack surface into three functional stages of agentic workflows, including \textbf{planning attacks}, \textbf{memory attacks}, and \textbf{tool-use attacks}, and instruments agent execution to enable systematic analysis of trigger activation and propagation across different stages. Building on this framework, we construct a standardized benchmark spanning four representative agent applications: \textbf{Agent QA}, \textbf{Agent Code}, \textbf{Agent Web}, and \textbf{Agent Drive}, covering both language-only and multimodal settings. Our empirical analysis shows that \textit{triggers implanted at a single stage can persist across multiple steps and propagate through intermediate states.} For instance, when using a GPT-based backbone, we observe trigger persistence in 43.58\% of planning attacks, 77.97\% of memory attacks, and 60.28\% of tool-stage attacks, highlighting the vulnerabilities of the agentic workflow itself to backdoor threats. To facilitate reproducibility and future research, our code and benchmark are publicly available at GitHub.
[56] Enhancing Multimodal Retrieval via Complementary Information Extraction and Alignment
Delong Zeng, Yuexiang Xie, Yaliang Li, Ying Shen
🧩 TL;DR
本文提出CIEA,一种新颖的多模态检索方法,通过互补信息提取与对齐机制,将文本和图像转换到统一潜在空间,并设计互补信息提取器来识别和保留图像表示中的差异信息,显著提升了多模态检索性能。
📘 Detailed Summary
Motivation: 当前多模态检索研究主要关注捕捉多模态数据中与配对文本相似的信息,但往往忽略了多模态数据中包含的互补信息,这种信息缺失限制了检索系统的全面性和准确性。
Method: CIEA采用互补信息提取与对齐方法,将文档中的文本和图像转换到统一潜在空间,设计互补信息提取器来识别和保留图像表示中的差异,并使用两种互补对比损失进行优化以确保语义完整性并有效捕捉图像中的互补信息。
Result: 大量实验证明CIEA的有效性,相比分治模型和通用密集检索模型均取得显著改进,通过消融研究、进一步讨论和案例研究突出了CIEA所取得的进展,源代码已在GitHub上开源。
Conclusion: 该研究强调了在多模态检索中捕捉互补信息的重要性,CIEA方法为相关领域提供了新的技术路径,开源代码将促进社区进一步研究,该方法在保持语义完整性的同时有效利用图像中的差异信息,为多模态表示学习提供了有价值的见解。
📄 Abstract
Multimodal retrieval has emerged as a promising yet challenging research direction in recent years. Most existing studies in multimodal retrieval focus on capturing information in multimodal data that is similar to their paired texts, but often ignores the complementary information contained in multimodal data. In this study, we propose CIEA, a novel multimodal retrieval approach that employs Complementary Information Extraction and Alignment, which transforms both text and images in documents into a unified latent space and features a complementary information extractor designed to identify and preserve differences in the image representations. We optimize CIEA using two complementary contrastive losses to ensure semantic integrity and effectively capture the complementary information contained in images. Extensive experiments demonstrate the effectiveness of CIEA, which achieves significant improvements over both divide-and-conquer models and universal dense retrieval models. We provide an ablation study, further discussions, and case studies to highlight the advancements achieved by CIEA. To promote further research in the community, we have released the source code at https://github.com/zengdlong/CIEA.
[57] Bridging Temporal and Textual Modalities: A Multimodal Framework for Automated Cloud Failure Root Cause Analysis
Gijun Park
🧩 TL;DR
本文提出了一种多模态诊断框架,通过语义压缩和对齐编码技术将时间序列表示与预训练语言模型嵌入空间相协调,解决了语言模型在云基础设施根因分析中处理连续数值序列的模态不匹配问题。
📘 Detailed Summary
Motivation: 现代云基础设施根因分析需要理解异构数据源,特别是涉及核心故障特征的时间序列性能指标。大型语言模型在文本推理方面表现出色,但其基于离散令牌的架构与具有时间依赖性的连续数值序列存在根本性不兼容,当前方法未能充分解决这种模态不匹配问题,限制了语言模型在事件管理工作流中的自动化潜力。
Method: 该方法提出了一个多模态诊断框架,包含三项技术贡献:一是语义压缩技术,将时间片段蒸馏为单令牌抽象同时保留模式语义;二是使用门控交叉注意力的对齐编码器,将时间序列特征投影到语言模型潜在空间;三是检索增强的诊断管道,将对齐嵌入与历史事件知识相结合进行专家级故障归因。
Result: 在六个云系统基准测试中的全面评估表明,该框架实现了领先性能,达到48.75%的诊断准确率,在涉及复合故障模式的场景中表现出显著改进。结果验证了嵌入空间对齐作为使语言模型能够在生产事件响应上下文中对多模态遥测数据进行推理的有效策略。
Conclusion: 该研究证实了嵌入空间对齐是解决语言模型与连续时间序列数据模态不匹配问题的有效方法,为语言模型驱动的云基础设施自动化诊断提供了可行的技术路径,特别是在处理复合故障模式方面展现出优势,推动了多模态推理在事件管理中的应用。
📄 Abstract
Root cause analysis in modern cloud infrastructure demands sophisticated understanding of heterogeneous data sources, particularly time-series performance metrics that involve core failure signatures. While large language models demonstrate remarkable capabilities in textual reasoning, their discrete token-based architecture creates fundamental incompatibilities with continuous numerical sequences exhibiting temporal dependencies. Current methodologies inadequately address this modality mismatch, constraining the potential of language model-driven automation in incident management workflows. This paper presents a multimodal diagnostic framework that harmonizes time-series representations with pretrained language model embedding spaces. Our approach contributes three technical advances: (1) a semantic compression technique that distills temporal segments into single-token abstractions while preserving pattern semantics, (2) an alignment encoder utilizing gated cross-attention to project time-series features into language model latent space, and (3) a retrieval-augmented diagnostic pipeline that synthesizes aligned embeddings with historical incident knowledge for expert-level failure attribution. Comprehensive evaluation across six cloud system benchmarks demonstrates that our framework achieves leading performance, reaching 48.75% diagnostic accuracy with notable improvements on scenarios involving compound failure modes. The results validate embedding-space alignment as an effective strategy for enabling language models to reason over multimodal telemetry data in production incident response contexts.
[58] AECV-Bench: Benchmarking Multimodal Models on Architectural and Engineering Drawings Understanding
Aleksei Kondratenko, Mussie Birhane, Houssame E. Hsain, Guido Maciocci
🧩 TL;DR
该研究提出了AECV-Bench基准测试,用于评估多模态和视觉语言模型在建筑、工程和施工图纸理解任务上的性能,发现当前模型在文档辅助任务上表现良好,但在图纸符号理解和计数任务上存在显著不足。
📘 Detailed Summary
Motivation: 现代多模态和视觉语言模型在建筑、工程和施工图纸理解方面的能力尚不明确,这些图纸通过符号、布局约定和密集标注编码几何和语义信息,需要评估模型是否能可靠解释这种图形语言。
Method: 研究提出了AECV-Bench基准测试,包含两个互补用例:在120个高质量平面图上进行对象计数任务,以及基于192个问答对的图纸文档问答任务,后者涵盖文本提取、实例计数、空间推理和比较推理,采用LLM作为评判者的评分流程和针对边缘情况的人工裁决。
Result: 评估显示稳定的能力梯度:OCR和文本中心文档问答表现最强,空间推理中等,而符号中心图纸理解特别是门窗可靠计数仍然未解决,对象计数使用每字段精确匹配准确率和MAPE结果,文档问答使用整体准确率和每类别细分。
Conclusion: 当前系统作为文档助手表现良好但缺乏稳健的图纸理解能力,这激励了领域特定表示和工具增强、人在环的工作流程开发,以实现高效的AEC自动化,强调了符号中心图纸理解的挑战性。
📄 Abstract
AEC drawings encode geometry and semantics through symbols, layout conventions, and dense annotation, yet it remains unclear whether modern multimodal and vision-language models can reliably interpret this graphical language. We present AECV-Bench, a benchmark for evaluating multimodal and vision-language models on realistic AEC artefacts via two complementary use cases: (i) object counting on 120 high-quality floor plans (doors, windows, bedrooms, toilets), and (ii) drawing-grounded document QA spanning 192 question-answer pairs that test text extraction (OCR), instance counting, spatial reasoning, and comparative reasoning over common drawing regions. Object-counting performance is reported using per-field exact-match accuracy and MAPE results, while document-QA performance is reported using overall accuracy and per-category breakdowns with an LLM-as-a-judge scoring pipeline and targeted human adjudication for edge cases. Evaluating a broad set of state-of-the-art models under a unified protocol, we observe a stable capability gradient; OCR and text-centric document QA are strongest (up to 0.95 accuracy), spatial reasoning is moderate, and symbol-centric drawing understanding - especially reliable counting of doors and windows - remains unsolved (often 0.40-0.55 accuracy) with substantial proportional errors. These results suggest that current systems function well as document assistants but lack robust drawing literacy, motivating domain-specific representations and tool-augmented, human-in-the-loop workflows for an efficient AEC automation.