Table of Contents
cs.CV [Back]
[1] FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models
Jiyoon Pyo, Yuankun Jiao, Dongwon Jung, Zekun Li, Leeje Jang, Sofia Kirsanova, Jina Kim, Yijun Lin, Qin Liu, Junyi Xie, Hadi Askari, Nan Xu, Muhao Chen, Yao-Yi Chiang
🧩 TL;DR
本文提出了FRIEDA基准测试,用于评估大视觉语言模型在复杂开放式地图推理任务上的能力,该基准涵盖拓扑、度量和方向三类空间关系,揭示了当前模型在多步地图推理方面与人类表现存在显著差距。
📘 Detailed Summary
Motivation: 当前地图视觉问答研究常将地图视为图表的特例,但地图理解需要处理分层符号系统以及涉及方向和距离的空间关系,这些关系通常跨越多个地图且未被现有图表式评估所覆盖,因此需要专门的基准来评估复杂的地图推理能力。
Method: 研究构建了FRIEDA基准,从不同领域和地理区域的文档报告中收集真实地图图像,基于地理信息系统文献分类,涵盖拓扑关系、度量关系和方向关系三类空间关系,所有问题都需要多步推理,许多问题需要跨地图对齐和推理,并在直接设置和上下文设置两种场景下评估模型性能。
Result: 评估了11个最先进的大视觉语言模型,即使在直接设置下,最强模型Gemini-2.5-Pro和GPT-5-Think的准确率分别仅为38.20%和37.20%,远低于人类表现的84.87%,这表明当前模型在多步地图推理方面存在显著能力差距。
Conclusion: FRIEDA基准揭示了当前大视觉语言模型在复杂地图推理任务上的严重不足,为空间智能研究提供了严格的评估标准,强调了开发专门地图理解能力的重要性,而不仅仅是将其视为图表理解的特例。
📄 Abstract
Cartographic reasoning is the skill of interpreting geographic relationships by aligning legends, map scales, compass directions, map texts, and geometries across one or more map images. Although essential as a concrete cognitive capability and for critical tasks such as disaster response and urban planning, it remains largely unevaluated. Building on progress in chart and infographic understanding, recent large vision language model studies on map visual question-answering often treat maps as a special case of charts. In contrast, map VQA demands comprehension of layered symbology (e.g., symbols, geometries, and text labels) as well as spatial relations tied to orientation and distance that often span multiple maps and are not captured by chart-style evaluations. To address this gap, we introduce FRIEDA, a benchmark for testing complex open-ended cartographic reasoning in LVLMs. FRIEDA sources real map images from documents and reports in various domains and geographical areas. Following classifications in Geographic Information System (GIS) literature, FRIEDA targets all three categories of spatial relations: topological (border, equal, intersect, within), metric (distance), and directional (orientation). All questions require multi-step inference, and many require cross-map grounding and reasoning. We evaluate eleven state-of-the-art LVLMs under two settings: (1) the direct setting, where we provide the maps relevant to the question, and (2) the contextual setting, where the model may have to identify the maps relevant to the question before reasoning. Even the strongest models, Gemini-2.5-Pro and GPT-5-Think, achieve only 38.20% and 37.20% accuracy, respectively, far below human performance of 84.87%. These results reveal a persistent gap in multi-step cartographic reasoning, positioning FRIEDA as a rigorous benchmark to drive progress on spatial intelligence in LVLMs.
[2] Lost in Translation, Found in Embeddings: Sign Language Translation and Alignment
Youngjoon Jang, Liliane Momeni, Zifan Jiang, Joon Son Chung, Gül Varol, Andrew Zisserman
🧩 TL;DR
本文提出了一种统一的手语理解模型,能够同时执行手语翻译和手语-字幕对齐任务,通过多语言预训练和创新的视觉-文本映射架构,在BSL和ASL数据集上实现了最先进的性能。
📘 Detailed Summary
Motivation: 本研究旨在解决连续手语视频理解中的两个关键任务:手语翻译和手语-字幕对齐,这两个任务对于实际通信、大规模语料库构建和教育应用至关重要,但现有方法通常将它们分开处理,缺乏统一框架。
Method: 方法基于三个核心组件:轻量级视觉骨干网络从人体关键点和唇部区域图像中提取手动和非手动线索同时保护手语者隐私;滑动感知器映射网络将连续视觉特征聚合为词级嵌入以弥合视觉-文本差距;多任务可扩展训练策略联合优化手语翻译和手语-字幕对齐任务,强化语言和时间对齐。
Result: 通过在涵盖英国手语和美国手语的BOBSL和YouTube-SL-25数据集上进行多语言预训练,该模型在具有挑战性的BOBSL数据集上实现了手语翻译和手语-字幕对齐的最先进结果,同时在How2Sign数据集上展示了强大的零样本泛化能力和微调性能。
Conclusion: 该研究证明了统一手语理解框架的有效性,通过多语言预训练和联合优化策略能够实现跨手语的可扩展翻译,为实际通信应用和大规模手语语料库构建提供了有前景的解决方案,并展示了跨语言泛化的潜力。
📄 Abstract
Our aim is to develop a unified model for sign language understanding, that performs sign language translation (SLT) and sign-subtitle alignment (SSA). Together, these two tasks enable the conversion of continuous signing videos into spoken language text and also the temporal alignment of signing with subtitles -- both essential for practical communication, large-scale corpus construction, and educational applications. To achieve this, our approach is built upon three components: (i) a lightweight visual backbone that captures manual and non-manual cues from human keypoints and lip-region images while preserving signer privacy; (ii) a Sliding Perceiver mapping network that aggregates consecutive visual features into word-level embeddings to bridge the vision-text gap; and (iii) a multi-task scalable training strategy that jointly optimises SLT and SSA, reinforcing both linguistic and temporal alignment. To promote cross-linguistic generalisation, we pretrain our model on large-scale sign-text corpora covering British Sign Language (BSL) and American Sign Language (ASL) from the BOBSL and YouTube-SL-25 datasets. With this multilingual pretraining and strong model design, we achieve state-of-the-art results on the challenging BOBSL (BSL) dataset for both SLT and SSA. Our model also demonstrates robust zero-shot generalisation and finetuned SLT performance on How2Sign (ASL), highlighting the potential of scalable translation across different sign languages.
[3] CVP: Central-Peripheral Vision-Inspired Multimodal Model for Spatial Reasoning
Zeyuan Chen, Xiang Zhang, Haiyang Xu, Jianwen Xie, Zhuowen Tu
🧩 TL;DR
本文提出了一种受人类中央-外周视觉启发的多模态空间推理框架CVP,通过引入目标亲和性令牌和以自我为中心的网格两种互补组件,实现了对复杂3D场景的结构化、上下文感知理解,并在多个基准测试中取得了最先进的性能。
📘 Detailed Summary
Motivation: 现有方法主要依赖点云、体素或补丁特征等非结构化表示,并通过坐标嵌入隐式注入场景上下文,这导致空间推理能力有限,缺乏对场景的显式、高层次结构理解。本文旨在解决这一局限性,提升多模态模型在复杂3D环境中的结构化空间推理能力。
Method: 本文提出了中央-外周视觉启发框架CVP,该框架基于大型多模态模型架构,引入了两种互补组件:目标亲和性令牌(类比中央视觉),用于引导模型注意力关注查询相关对象;以自我为中心的网格(类比外周视觉),用于捕捉全局场景上下文和空间布局。这两种组件协同工作,实现对3D场景的结构化、上下文感知理解。
Result: 实验结果表明,CVP在一系列3D场景理解基准测试中取得了最先进的性能。该框架在多个评估指标上超越了现有方法,证明了其在复杂3D环境中的有效空间推理能力。
Conclusion: 该研究展示了将人类视觉系统的中央-外周机制引入多模态模型架构的有效性,为3D场景理解提供了新的结构化表示方法。CVP框架的成功表明,显式建模场景结构和上下文关系对于提升空间推理能力至关重要,为未来多模态空间推理研究提供了有前景的方向。
📄 Abstract
We present a central-peripheral vision-inspired framework (CVP), a simple yet effective multimodal model for spatial reasoning that draws inspiration from the two types of human visual fields -- central vision and peripheral vision. Existing approaches primarily rely on unstructured representations, such as point clouds, voxels, or patch features, and inject scene context implicitly via coordinate embeddings. However, this often results in limited spatial reasoning capabilities due to the lack of explicit, high-level structural understanding. To address this limitation, we introduce two complementary components into a Large Multimodal Model-based architecture: target-affinity token, analogous to central vision, that guides the model's attention toward query-relevant objects; and allocentric grid, akin to peripheral vision, that captures global scene context and spatial arrangements. These components work in tandem to enable structured, context-aware understanding of complex 3D environments. Experiments show that CVP achieves state-of-the-art performance across a range of 3D scene understanding benchmarks.
[4] VisKnow: Constructing Visual Knowledge Base for Object Understanding
Ziwei Yao, Qiyang Wan, Ruiping Wang, Xilin Chen
🧩 TL;DR
该研究提出了视觉知识库VisKnow,通过构建结构化多模态对象知识图谱来促进深度物体理解,并以AnimalKB作为具体案例展示了其在零样本识别和细粒度视觉问答等任务中的增强效果。
📘 Detailed Summary
Motivation: 当前计算机视觉中的物体理解通常局限于类别识别,缺乏对物体组成部分、外观特征、类别间关系和上下文背景知识的全面感知。现有多模态数据通常是任务导向的,缺乏系统性组织,无法实现预期的物体类别深度理解。
Method: 研究提出了视觉知识库框架VisKnow,通过结合专家设计和大型模型应用,从多模态数据中提取对象级知识并构建为图结构。具体构建了AnimalKB知识库,涵盖406个动物类别,包含从百科全书文档提取的22K文本知识三元组、420K图像以及对应的物体和部件级区域标注。
Result: 实验表明AnimalKB能够显著增强物体级视觉任务,如零样本识别和细粒度视觉问答,同时可作为知识图谱补全和部件分割的挑战性基准。该知识库展示了结构化视觉知识对提升视觉理解能力的实际价值。
Conclusion: 该研究证明了自动构建视觉知识库在推进视觉理解及其实际应用方面的潜力,为计算机视觉领域提供了系统化的多模态知识组织框架,并为后续研究提供了可扩展的知识库构建方法论和基准数据集。
📄 Abstract
Understanding objects is fundamental to computer vision. Beyond object recognition that provides only a category label as typical output, in-depth object understanding represents a comprehensive perception of an object category, involving its components, appearance characteristics, inter-category relationships, contextual background knowledge, etc. Developing such capability requires sufficient multi-modal data, including visual annotations such as parts, attributes, and co-occurrences for specific tasks, as well as textual knowledge to support high-level tasks like reasoning and question answering. However, these data are generally task-oriented and not systematically organized enough to achieve the expected understanding of object categories. In response, we propose the Visual Knowledge Base that structures multi-modal object knowledge as graphs, and present a construction framework named VisKnow that extracts multi-modal, object-level knowledge for object understanding. This framework integrates enriched aligned text and image-source knowledge with region annotations at both object and part levels through a combination of expert design and large-scale model application. As a specific case study, we construct AnimalKB, a structured animal knowledge base covering 406 animal categories, which contains 22K textual knowledge triplets extracted from encyclopedic documents, 420K images, and corresponding region annotations. A series of experiments showcase how AnimalKB enhances object-level visual tasks such as zero-shot recognition and fine-grained VQA, and serves as challenging benchmarks for knowledge graph completion and part segmentation. Our findings highlight the potential of automatically constructing visual knowledge bases to advance visual understanding and its practical applications. The project page is available at https://vipl-vsu.github.io/VisKnow.
[5] MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models
Jusheng Zhang, Kaitong Cai, Xiaoyang Guo, Sidi Liu, Qinhan Lv, Ruiqi Chen, Jing Yang, Yijia Fan, Xiaofei Sun, Jian Wang, Ziliang Chen, Liang Lin, Keze Wang
🧩 TL;DR
本文提出了MM-CoT基准测试,专门用于诊断多模态模型在链式思维推理中的视觉基础和逻辑连贯性,揭示了当前先进模型在生成流畅性与真实推理保真度之间的显著差距。
📘 Detailed Summary
Motivation: 现有基准测试强调生成能力但忽视验证能力,即评估推理链是否既视觉一致又逻辑有效的能力,这导致无法判断多模态模型的链式思维推理是否真正基于视觉证据且逻辑连贯。
Method: 研究引入了MM-CoT诊断基准,要求模型选择满足两个正交约束的唯一事件链:视觉一致性确保所有步骤都基于可观察证据,逻辑连贯性确保因果和常识有效性,同时设计了违反这些约束的对抗性干扰项以暴露不同的推理失败模式。
Result: 评估领先的视觉语言模型发现,即使最先进的系统在MM-CoT上也表现不佳,揭示了生成流畅性与真实推理保真度之间的显著差距,且MM-CoT与现有基准测试相关性低,证实其测量了视觉基础和逻辑推理的独特组合能力。
Conclusion: 该基准为开发未来模型提供了基础,这些模型不仅需要推理合理,还需要在视觉世界中忠实且连贯地推理,强调了验证能力对于评估多模态推理系统真实能力的重要性。
📄 Abstract
The ability to perform Chain-of-Thought (CoT) reasoning marks a major milestone for multimodal models (MMs), enabling them to solve complex visual reasoning problems. Yet a critical question remains: is such reasoning genuinely grounded in visual evidence and logically coherent? Existing benchmarks emphasize generation but neglect verification, i.e., the capacity to assess whether a reasoning chain is both visually consistent and logically valid. To fill this gap, we introduce MM-CoT, a diagnostic benchmark specifically designed to probe the visual grounding and logical coherence of CoT reasoning in MMs. Instead of generating free-form explanations, models must select the sole event chain that satisfies two orthogonal constraints: (i) visual consistency, ensuring all steps are anchored in observable evidence, and (ii) logical coherence, ensuring causal and commonsense validity. Adversarial distractors are engineered to violate one of these constraints, exposing distinct reasoning failures. We evaluate leading vision-language models on MM-CoT and find that even the most advanced systems struggle, revealing a sharp discrepancy between generative fluency and true reasoning fidelity. MM-CoT shows low correlation with existing benchmarks, confirming that it measures a unique combination of visual grounding and logical reasoning. This benchmark provides a foundation for developing future models that reason not just plausibly, but faithfully and coherently within the visual world.
[6] Beyond Real Weights: Hypercomplex Representations for Stable Quantization
Jawad Ibn Ahad, Maisha Rahman, Amrijit Biswas, Muhammad Rafsan Kabir, Robin Krambroeckers, Sifat Momen, Nabeel Mohammed, Shafin Rahman
🧩 TL;DR
本文提出了一种渐进式重参数化策略,通过逐步将密集前馈网络块替换为紧凑的参数化超复数乘法层来压缩多模态语言模型,在保持性能的同时显著减少参数数量和计算开销。
📘 Detailed Summary
Motivation: 多模态语言模型需要大量参数来对齐高维视觉特征与语言表示,导致计算负担沉重且部署效率低下,现有方法在保持模型性能的同时实现有效压缩仍面临挑战。
Method: 采用渐进式重参数化策略,逐步将密集前馈网络块替换为紧凑的参数化超复数乘法层,结合残差插值调度以及轻量级重构和知识蒸馏损失,确保PHM模块在训练过程中继承其密集对应组件的功能行为。
Result: 该方法在多个视觉语言模型上实现了显著的参数和FLOP减少,同时保持了与基准模型相当的性能表现,显著降低了模型大小和推理延迟而不损害输出质量。
Conclusion: 渐进式PHM替换提供了一种架构兼容的路径来实现更高效的多模态推理,能够与现有的低位量化技术互补,为部署高效的多模态模型开辟了新方向。
📄 Abstract
Multimodal language models (MLLMs) require large parameter capacity to align high-dimensional visual features with linguistic representations, making them computationally heavy and difficult to deploy efficiently. We introduce a progressive reparameterization strategy that compresses these models by gradually replacing dense feed-forward network blocks with compact Parameterized Hypercomplex Multiplication (PHM) layers. A residual interpolation schedule, together with lightweight reconstruction and knowledge distillation losses, ensures that the PHM modules inherit the functional behavior of their dense counterparts during training. This transition yields substantial parameter and FLOP reductions while preserving strong multimodal alignment, enabling faster inference without degrading output quality. We evaluate the approach on multiple vision-language models (VLMs). Our method maintains performance comparable to the base models while delivering significant reductions in model size and inference latency. Progressive PHM substitution thus offers an architecture-compatible path toward more efficient multimodal reasoning and complements existing low-bit quantization techniques.
[7] HybridToken-VLM: Hybrid Token Compression for Vision-Language Models
Jusheng Zhang, Xiaoyang Guo, Kaitong Cai, Qinhan Lv, Yijia Fan, Wenhao Chai, Jian Wang, Keze Wang
🧩 TL;DR
本文提出HTC-VLM,一种混合视觉语言模型框架,通过双通道解耦语义与外观表示,实现580:1的高效压缩,在保持多模态推理性能的同时显著降低计算成本。
📘 Detailed Summary
Motivation: 视觉语言模型在处理数百个视觉补丁标记时面临二次计算成本问题,传统方法存在连续压缩会稀释高级语义而离散量化会丢失细粒度细节的权衡困境,需要解决效率与保真度之间的根本矛盾。
Method: HTC-VLM采用混合双通道框架,包含用于细粒度细节的连续ViT补丁通道和使用MGVQ量化为四个标记的离散符号锚点通道,通过解耦注意力掩码和瓶颈机制将580个标记的混合序列压缩为单个voco标记,实现语义与外观的有效分离与融合。
Result: 在七个基准测试中平均性能保持率达到87.2%,优于领先连续基线的81.0%,实现580:1的压缩比;注意力分析显示压缩标记优先关注离散锚点,验证了其语义引导的有效性。
Conclusion: 研究表明极简混合设计能够解决效率与保真度的两难问题,通过解耦语义与外观表示实现高效且接地气的视觉表示,为可扩展视觉语言模型的发展提供了新方向。
📄 Abstract
Vision-language models (VLMs) have transformed multimodal reasoning, but feeding hundreds of visual patch tokens into LLMs incurs quadratic computational costs, straining memory and context windows. Traditional approaches face a trade-off: continuous compression dilutes high-level semantics such as object identities, while discrete quantization loses fine-grained details such as textures. We introduce HTC-VLM, a hybrid framework that disentangles semantics and appearance through dual channels, i.e., a continuous pathway for fine-grained details via ViT patches and a discrete pathway for symbolic anchors using MGVQ quantization projected to four tokens. These are fused into a 580-token hybrid sequence and compressed into a single voco token via a disentanglement attention mask and bottleneck, ensuring efficient and grounded representations. HTC-VLM achieves an average performance retention of 87.2 percent across seven benchmarks (GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, ScienceQA-Image), outperforming the leading continuous baseline at 81.0 percent with a 580-to-1 compression ratio. Attention analyses show that the compressed token prioritizes the discrete anchor, validating its semantic guidance. Our work demonstrates that a minimalist hybrid design can resolve the efficiency-fidelity dilemma and advance scalable VLMs.
[8] RLCNet: An end-to-end deep learning framework for simultaneous online calibration of LiDAR, RADAR, and Camera
Hafeez Husain Cholakkal, Stefano Arrigoni, Francesco Braghin
🧩 TL;DR
本文提出了RLCNet,一种新颖的端到端可训练深度学习框架,用于自动驾驶中LiDAR、RADAR和相机多模态传感器的同时在线标定,通过加权移动平均和异常值剔除机制实现实时动态参数调整。
📘 Detailed Summary
Motivation: 自动驾驶系统中LiDAR、RADAR和相机传感器的精确外参标定对于可靠感知至关重要,但由于机械振动和动态环境中累积的传感器漂移等因素,多模态传感器的在线标定仍然具有挑战性。
Method: 本文提出了RLCNet,一种端到端可训练的深度学习框架,专门设计用于多模态传感器的同时在线标定。该方法引入了包含加权移动平均和异常值剔除的在线标定框架,能够动态调整标定参数,减少预测噪声并提高对漂移的鲁棒性。
Result: 在真实世界数据集上的验证表明,RLCNet在不同条件下表现出鲁棒性能,消融研究突出了架构选择的重要性。与现有方法的比较显示,所提方法在准确性和鲁棒性方面均表现出优越性,并设计用于实际部署。
Conclusion: 该研究为自动驾驶系统中的多模态传感器标定提供了实用的深度学习解决方案,通过在线校准框架实现了实时动态参数调整,提高了系统在动态环境中的鲁棒性和可靠性,为实际部署提供了可行的技术路径。
📄 Abstract
Accurate extrinsic calibration of LiDAR, RADAR, and camera sensors is essential for reliable perception in autonomous vehicles. Still, it remains challenging due to factors such as mechanical vibrations and cumulative sensor drift in dynamic environments. This paper presents RLCNet, a novel end-to-end trainable deep learning framework for the simultaneous online calibration of these multimodal sensors. Validated on real-world datasets, RLCNet is designed for practical deployment and demonstrates robust performance under diverse conditions. To support real-time operation, an online calibration framework is introduced that incorporates a weighted moving average and outlier rejection, enabling dynamic adjustment of calibration parameters with reduced prediction noise and improved resilience to drift. An ablation study highlights the significance of architectural choices, while comparisons with existing methods demonstrate the superior accuracy and robustness of the proposed approach.
[9] PAVAS: Physics-Aware Video-to-Audio Synthesis
Oh Hyun-Bin, Yuhta Takida, Toshimitsu Uesaka, Tae-Hyun Oh, Yuki Mitsufuji
🧩 TL;DR
本文提出PAVAS,一种物理感知的视频到音频合成方法,通过引入物理推理来增强现有外观驱动的V2A生成模型,能够根据物体质量和运动轨迹等物理参数合成更符合物理真实性的声音。
📘 Detailed Summary
Motivation: 当前视频到音频生成模型主要基于外观驱动,捕捉视觉-声学相关性但忽略了塑造真实世界声音的物理因素,导致合成声音缺乏物理真实性,本文旨在将物理推理融入V2A生成过程以解决这一局限性。
Method: PAVAS方法基于潜在扩散模型,通过物理驱动音频适配器整合物理推理,该适配器接收物理参数估计器提供的物体级物理参数,包括使用视觉语言模型推断移动物体质量,以及基于分割的动态三维重建模块恢复运动轨迹以计算速度。
Result: 实验表明PAVAS能够生成物理合理且感知连贯的音频,在定量和定性评估中均优于现有V2A模型,研究还构建了VGG-Impact基准数据集用于评估物体-物体交互,并提出了音频-物理相关性系数作为评估物理真实性的度量标准。
Conclusion: 该研究证明了将物理推理融入视频到音频生成的有效性,为合成更具物理真实性的声音提供了新范式,提出的评估框架和基准数据集为未来物理感知音频生成研究奠定了基础,展示了跨模态生成中物理一致性建模的重要性。
📄 Abstract
Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter receives object-level physical parameters estimated by the Physical Parameter Estimator (PPE), which uses a Vision-Language Model (VLM) to infer the moving-object mass and a segmentation-based dynamic 3D reconstruction module to recover its motion trajectory for velocity computation. These physical cues enable the model to synthesize sounds that reflect underlying physical factors. To assess physical realism, we curate VGG-Impact, a benchmark focusing on object-object interactions, and introduce Audio-Physics Correlation Coefficient (APCC), an evaluation metric that measures consistency between physical and auditory attributes. Comprehensive experiments show that PAVAS produces physically plausible and perceptually coherent audio, outperforming existing V2A models in both quantitative and qualitative evaluations. Visit https://physics-aware-video-to-audio-synthesis.github.io for demo videos.
[10] OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation
Yexin Liu, Manyuan Zhang, Yueze Wang, Hongyu Li, Dian Zheng, Weiming Zhang, Changsheng Lu, Xunliang Cai, Yan Feng, Peng Pei, Harry Yang
🧩 TL;DR
本文提出了OpenSubject,一个包含250万样本和435万图像的大规模视频衍生数据集,用于提升主体驱动的图像生成与编辑性能,特别是在复杂多主体场景中。该研究还引入了相应的基准评估框架,通过四阶段流水线构建高质量训练数据,显著改善了生成模型的身份保真度和操作一致性。
📘 Detailed Summary
Motivation: 当前主体驱动的图像生成模型存在两个主要问题:在参考身份保持方面表现不佳,以及在包含多个主体的复杂场景中生成效果不理想。这些局限性阻碍了实际应用中对特定主体进行精确控制和编辑的能力,特别是在需要保持身份一致性的多主体交互场景中。
Method: 研究提出了OpenSubject数据集构建的四阶段流水线:视频筛选与质量过滤;基于视觉语言模型的跨帧主体挖掘与配对,包括类别共识、局部定位和多样性感知配对;身份保持的参考图像合成,采用分割图引导的外绘和框引导的内绘技术,结合几何感知增强和不规则边界侵蚀;验证与标注阶段,使用VLM验证合成样本并构建短长两种描述。此外,研究还建立了涵盖主体驱动生成和编辑的基准评估框架。
Result: 实验结果表明,使用OpenSubject数据集进行训练显著提升了生成和编辑性能,特别是在复杂场景中。研究建立了全面的评估基准,涵盖身份保真度、提示符遵循度、操作一致性和背景一致性等多个维度,并通过VLM评判器进行量化评估。大规模数据集包含250万样本和435万图像,为模型训练提供了高质量的数据支持。
Conclusion: OpenSubject数据集通过系统化的四阶段构建流程有效解决了主体驱动生成中的身份保持和复杂场景处理问题。该研究不仅提供了高质量的训练数据资源,还建立了标准化的评估框架,为未来主体驱动生成技术的发展奠定了重要基础。跨帧身份先验的利用和VLM辅助的验证机制为大规模高质量数据集的构建提供了可复制的技术路径。
📄 Abstract
Despite the promising progress in subject-driven image generation, current models often deviate from the reference identities and struggle in complex scenes with multiple subjects. To address this challenge, we introduce OpenSubject, a video-derived large-scale corpus with 2.5M samples and 4.35M images for subject-driven generation and manipulation. The dataset is built with a four-stage pipeline that exploits cross-frame identity priors. (i) Video Curation. We apply resolution and aesthetic filtering to obtain high-quality clips. (ii) Cross-Frame Subject Mining and Pairing. We utilize vision-language model (VLM)-based category consensus, local grounding, and diversity-aware pairing to select image pairs. (iii) Identity-Preserving Reference Image Synthesis. We introduce segmentation map-guided outpainting to synthesize the input images for subject-driven generation and box-guided inpainting to generate input images for subject-driven manipulation, together with geometry-aware augmentations and irregular boundary erosion. (iv) Verification and Captioning. We utilize a VLM to validate synthesized samples, re-synthesize failed samples based on stage (iii), and then construct short and long captions. In addition, we introduce a benchmark covering subject-driven generation and manipulation, and then evaluate identity fidelity, prompt adherence, manipulation consistency, and background consistency with a VLM judge. Extensive experiments show that training with OpenSubject improves generation and manipulation performance, particularly in complex scenes.
[11] Interpreting Structured Perturbations in Image Protection Methods for Diffusion Models
Michael R. Martin, Garrick Chan, Kwan-Liu Ma
🧩 TL;DR
本研究通过可解释AI框架系统分析了Glaze和Nightshade等图像保护机制的内在结构,揭示了这些对抗性扰动通过结构化特征级变形而非语义错位实现保护,解释了其视觉隐蔽但可检测的特性。
📘 Detailed Summary
Motivation: 尽管Glaze和Nightshade等图像保护机制在经验上有效,但其内部结构、可检测性和表征行为仍缺乏深入理解,本研究旨在通过系统化的可解释AI分析填补这一知识空白。
Method: 采用统一框架整合白盒特征空间检查和黑盒信号级探测,包括潜在空间聚类、特征通道激活分析、基于遮挡的空间敏感性映射以及频域表征分析等方法。
Result: 保护机制表现为结构化、低熵的扰动,紧密耦合于图像内容;保护图像保留内容驱动的特征组织但具有保护特定子结构;可检测性受扰动熵、空间部署和频率对齐相互作用影响;Glaze和Nightshade沿主导图像对齐频率轴重新分配能量而非引入扩散噪声。
Conclusion: 当代图像保护通过结构化特征级变形而非语义错位运作,解释了保护信号视觉隐蔽但一致可检测的原因,为未来生成AI系统的防御和检测策略设计提供了理论基础。
📄 Abstract
Recent image protection mechanisms such as Glaze and Nightshade introduce imperceptible, adversarially designed perturbations intended to disrupt downstream text-to-image generative models. While their empirical effectiveness is known, the internal structure, detectability, and representational behavior of these perturbations remain poorly understood. This study provides a systematic, explainable AI analysis using a unified framework that integrates white-box feature-space inspection and black-box signal-level probing. Through latent-space clustering, feature-channel activation analysis, occlusion-based spatial sensitivity mapping, and frequency-domain characterization, we show that protection mechanisms operate as structured, low-entropy perturbations tightly coupled to underlying image content across representational, spatial, and spectral domains. Protected images preserve content-driven feature organization with protection-specific substructure rather than inducing global representational drift. Detectability is governed by interacting effects of perturbation entropy, spatial deployment, and frequency alignment, with sequential protection amplifying detectable structure rather than suppressing it. Frequency-domain analysis shows that Glaze and Nightshade redistribute energy along dominant image-aligned frequency axes rather than introducing diffuse noise. These findings indicate that contemporary image protection operates through structured feature-level deformation rather than semantic dislocation, explaining why protection signals remain visually subtle yet consistently detectable. This work advances the interpretability of adversarial image protection and informs the design of future defenses and detection strategies for generative AI systems.
[12] MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs
Yufei Gao, Jiaying Fei, Nuo Chen, Ruirui Chen, Guohang Yan, Yunshi Lan, Botian Shi
🧩 TL;DR
本文提出了MELLA多模态多语言数据集,通过双源数据收集策略增强低资源语言MLLM的语言能力和文化基础性,显著提升了模型在多种语言上的性能表现。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在高资源语言中表现优异,但在低资源语言环境中效果显著下降。现有多语言增强方法通常局限于文本模态或仅依赖机器翻译,这些方法虽然帮助模型获得基本语言能力并产生"薄描述",但忽视了多模态信息性和文化基础性的重要性,而这两者对于有效服务低资源语言用户至关重要。
Method: 研究提出了一个双源数据收集策略,针对两个关键目标分别定制数据收集:为文化基础性收集本地网络替代文本,为语言能力收集MLLM生成的描述。具体实现中,研究者引入了MELLA多模态多语言数据集,该数据集专门设计用于同时增强低资源语言MLLM的语言能力和文化意识。
Result: 实验结果表明,在MELLA数据集上进行微调后,八种语言的多种MLLM骨干模型均表现出普遍的性能提升,模型能够产生"厚描述"。研究验证了性能提升既来自文化知识增强,也来自语言能力增强,表明双源策略在提升低资源语言MLLM效果方面的有效性。
Conclusion: 该研究强调了文化基础性对于低资源语言多模态大语言模型的重要性,提出了一个有效的数据收集框架来同时解决语言能力和文化意识问题。MELLA数据集的成功表明,结合本地文化内容和语言增强数据可以显著改善模型在低资源语言环境中的表现,为未来多语言AI系统开发提供了重要参考。
📄 Abstract
Multimodal Large Language Models (MLLMs) have shown remarkable performance in high-resource languages. However, their effectiveness diminishes significantly in the contexts of low-resource languages. Current multilingual enhancement methods are often limited to text modality or rely solely on machine translation. While such approaches help models acquire basic linguistic capabilities and produce "thin descriptions", they neglect the importance of multimodal informativeness and cultural groundedness, both of which are crucial for serving low-resource language users effectively. To bridge this gap, in this study, we identify two significant objectives for a truly effective MLLM in low-resource language settings, namely 1) linguistic capability and 2) cultural groundedness, placing special emphasis on cultural awareness. To achieve these dual objectives, we propose a dual-source strategy that guides the collection of data tailored to each goal, sourcing native web alt-text for culture and MLLM-generated captions for linguistics. As a concrete implementation, we introduce MELLA, a multimodal, multilingual dataset. Experiment results show that after fine-tuning on MELLA, there is a general performance improvement for the eight languages on various MLLM backbones, with models producing "thick descriptions". We verify that the performance gains are from both cultural knowledge enhancement and linguistic capability enhancement. Our dataset can be found at https://opendatalab.com/applyMultilingualCorpus.
[13] PointDico: Contrastive 3D Representation Learning Guided by Diffusion Models
Pengbo Li, Yiding Sun, Haozhe Cheng
🧩 TL;DR
本文提出PointDico模型,通过融合扩散模型和对比学习的优势来解决3D点云表示学习中的挑战,在多个基准测试中实现了最先进的性能。
📘 Detailed Summary
Motivation: 现有自监督表示学习方法在处理3D数据时面临困难,因为点云具有无序性和不均匀密度特性。通过深入分析主流对比学习和生成方法,发现对比模型容易过拟合,而3D掩码自编码器难以处理无序点云,这促使研究者探索结合扩散模型和对比学习优势的新方法。
Method: 提出PointDico模型,通过知识蒸馏无缝集成去噪生成建模和跨模态对比学习,其中扩散模型作为对比模型的指导。采用分层金字塔条件生成器进行多尺度几何特征提取,并设计双通道架构有效整合局部和全局上下文信息。
Result: PointDico在3D表示学习中实现了新的最先进性能,在ScanObjectNN上达到94.32%的准确率,在ShapeNetPart上达到86.5%的实例级mIoU,显著超越了现有方法。
Conclusion: 该研究表明结合扩散模型和对比学习的优势能够有效解决3D点云表示学习中的挑战,为无序和非均匀密度数据的自监督学习提供了新思路,同时分层特征提取和双通道设计为3D几何理解提供了有效的架构解决方案。
📄 Abstract
Self-supervised representation learning has shown significant improvement in Natural Language Processing and 2D Computer Vision. However, existing methods face difficulties in representing 3D data because of its unordered and uneven density. Through an in-depth analysis of mainstream contrastive and generative approaches, we find that contrastive models tend to suffer from overfitting, while 3D Mask Autoencoders struggle to handle unordered point clouds. This motivates us to learn 3D representations by sharing the merits of diffusion and contrast models, which is non-trivial due to the pattern difference between the two paradigms. In this paper, we propose \textit{PointDico}, a novel model that seamlessly integrates these methods. \textit{PointDico} learns from both denoising generative modeling and cross-modal contrastive learning through knowledge distillation, where the diffusion model serves as a guide for the contrastive model. We introduce a hierarchical pyramid conditional generator for multi-scale geometric feature extraction and employ a dual-channel design to effectively integrate local and global contextual information. \textit{PointDico} achieves a new state-of-the-art in 3D representation learning, \textit{e.g.}, \textbf{94.32\%} accuracy on ScanObjectNN, \textbf{86.5\%} Inst. mIoU on ShapeNetPart.
[14] The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss
Bozhou Li, Xinda Xue, Sihan Yang, Yang Shi, Xinlong Chen, Yushuo Guan, Yuanxing Zhang, Wentao Zhang
🧩 TL;DR
本文揭示了多模态大语言模型中普遍存在的视觉-文本令牌范数失衡问题,该问题导致不对称的更新动态并损害跨模态特征融合。作者提出了一种简单有效的解决方案:在视觉投影器后插入一个精心初始化的LayerNorm层来实现范数对齐,从而显著提升模型性能。
📘 Detailed Summary
Motivation: 多模态大语言模型普遍采用Pre-Norm架构,这导致视觉令牌的高范数与文本令牌的低范数之间存在严重失衡。这种范数差异并非静态问题,而是引发了一种"不对称更新动态",其中高范数的视觉令牌表现出"表示惯性",导致其语义转换速度远低于文本令牌,从而从根本上损害了有效的跨模态特征融合。
Method: 作者首先通过形式化理论分析证明了范数失衡导致的动态更新问题,并在主流MLLMs上进行了实证验证。基于这一洞察,提出了一个简单而有效的解决方案:在视觉投影器后插入一个精心初始化的LayerNorm层,以强制实现视觉令牌与文本令牌之间的范数对齐,从而解决架构层面的不平衡问题。
Result: 在LLaVA-1.5架构上进行的实验表明,该干预措施不仅在广泛的 multimodal 基准测试中带来了显著的性能提升,而且在纯文本评估(如MMLU)上也表现出明显改进。这证实了解决架构不平衡能够产生更全面能力的模型,同时验证了范数失衡和不对称更新动态是MLLMs中的普遍现象。
Conclusion: 研究揭示了MLLMs中视觉-文本令牌范数失衡的根本架构缺陷及其对跨模态学习的负面影响。提出的LayerNorm插入方法虽然简单,但能有效解决这一核心问题,为MLLMs的架构设计提供了重要启示。这一发现表明,解决底层架构不平衡可以同时提升多模态和单模态能力,为未来模型设计提供了新的优化方向。
📄 Abstract
Multimodal Large Language Models (MLLMs), which couple pre-trained vision encoders and language models, have shown remarkable capabilities. However, their reliance on the ubiquitous Pre-Norm architecture introduces a subtle yet critical flaw: a severe norm disparity between the high-norm visual tokens and the low-norm text tokens. In this work, we present a formal theoretical analysis demonstrating that this imbalance is not a static issue. Instead, it induces an asymmetric update dynamic,'' where high-norm visual tokens exhibit arepresentational inertia,'' causing them to transform semantically much slower than their textual counterparts. This fundamentally impairs effective cross-modal feature fusion. Our empirical validation across a range of mainstream MLLMs confirms that this theoretical dynamic -- the persistence of norm disparity and the resulting asymmetric update rates -- is a prevalent phenomenon. Based on this insight, we propose a remarkably simple yet effective solution: inserting a single, carefully initialized LayerNorm layer after the visual projector to enforce norm alignment. Experiments conducted on the LLaVA-1.5 architecture show that this intervention yields significant performance gains not only on a wide suite of multimodal benchmarks but also, notably, on text-only evaluations such as MMLU, suggesting that resolving the architectural imbalance leads to a more holistically capable model.
[15] Towards Effective and Efficient Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval
Tao Chen, Shaobo Ju, Qiong Wu, Chenxin Fang, Kun Zhang, Jun Peng, Hui Li, Yiyi Zhou, Rongrong Ji
🧩 TL;DR
本文提出了一种名为OneClip-RAG的高效视频检索增强范式,通过一次性视频片段检索增强技术,解决了多模态大语言模型处理长视频时内存开销过大的问题,显著提升了长视频理解能力。
📘 Detailed Summary
Motivation: 现有大多数多模态大语言模型由于内存开销过大,只能处理有限帧数的视频,无法有效理解长视频内容,这限制了模型在实际长视频理解任务中的应用能力。
Method: 提出了OneClip-RAG范式,包括基于查询指导的视频分块算法,将片段分块和跨模态检索统一在一个处理步骤中,避免了冗余计算;同时构建了SynLongVideo数据集并设计了渐进式训练策略来提升指令跟随能力。
Result: 实验结果表明,OneClip-RAG在多个长视频基准测试上显著提升了MLLM性能,例如将InternLV2 8B和Qwen2-VL 7B提升到GPT-4o在MLVU上的水平;在效率方面,能够在单张4090 GPU上让LLaVA-Video在2.2分钟内理解长达1小时的视频。
Conclusion: 该研究证明了通过检索增强范式可以有效解决MLLM处理长视频的内存限制问题,提出的统一处理框架在保持知识完整性和语义连贯性的同时显著提升了计算效率,为长视频理解任务提供了实用的解决方案。
📄 Abstract
Due to excessive memory overhead, most Multimodal Large Language Models (MLLMs) can only process videos of limited frames. In this paper, we propose an effective and efficient paradigm to remedy this shortcoming, termed One-shot video-Clip based Retrieval AuGmentation (OneClip-RAG). Compared with existing video RAG methods, OneClip-RAG makes full use of the merits of video clips for augmented video understanding in terms of both knowledge integrity and semantic coherence. Besides, it is also equipped with a novel query-guided video chunking algorithm that can unify clip chunking and cross-modal retrieval in one processing step, avoiding redundant computations. To improve instruction following, we further propose a new dataset called SynLongVideo and design a progressive training regime for OneClip-RAG. OneClip-RAG is plugged into five recent MLLMs and validated on a set of long-video benchmarks. Experimental results not only show the obvious performance gains by OneClip-RAG over MLLMs, e.g., boosting InternLV2 8B and Qwen2-VL 7B to the level of GPT-4o on MLVU, but also show its superior efficiency in handling long videos. e.g., enabling LLaVA-Video understand up to an hour of videos in less than 2.2 minutes on a single 4090 GPU.
[16] Temporal Concept Dynamics in Diffusion Models via Prompt-Conditioned Interventions
Ada Gorgun, Fawaz Sammani, Nikos Deligiannis, Bernt Schiele, Jonas Fischer
🧩 TL;DR
本文提出了PCI(Prompt-Conditioned Intervention),这是一个无需训练且模型无关的框架,用于通过扩散时间分析概念动态。该框架通过概念插入成功率(CIS)量化概念在扩散轨迹中何时形成并锁定,揭示了不同扩散模型中概念形成的多样化时间行为。
📘 Detailed Summary
Motivation: 当前扩散模型通常仅通过最终输出进行评估,而生成过程沿着轨迹展开,分析这一动态过程对于理解模型的可控性、可靠性和可预测性至关重要。本研究旨在探究噪声何时转化为特定概念(如年龄)并锁定去噪轨迹,以填补对扩散模型概念形成动态过程理解的研究空白。
Method: 本文提出了PCI(Prompt-Conditioned Intervention),这是一个无需训练且模型无关的分析框架。核心思想是分析概念插入成功率(CIS),定义为在给定时间步插入的概念在最终图像中被保留和反映的概率,从而量化概念形成的时间动态。该方法不需要访问模型内部或进行训练,适用于多种最先进的文本到图像扩散模型。
Result: 应用于多种最先进的文本到图像扩散模型和广泛的概念分类,PCI揭示了不同扩散模型中概念形成的多样化时间行为。研究发现轨迹的某些阶段对特定概念更为有利,即使在同一概念类型内也存在差异。该方法在文本驱动的图像编辑中产生了定量上更强的编辑效果,在语义准确性和内容保留之间实现了比强基线更好的平衡。
Conclusion: 该研究提供了对扩散模型概念形成动态的深入理解,揭示了不同模型中概念锁定的时间模式差异。这些发现为文本驱动的图像编辑提供了可操作的见解,突出了无需访问模型内部或训练即可进行最有效干预的时间窗口,推动了扩散模型可解释性和可控性的研究进展。
📄 Abstract
Diffusion models are usually evaluated by their final outputs, gradually denoising random noise into meaningful images. Yet, generation unfolds along a trajectory, and analyzing this dynamic process is crucial for understanding how controllable, reliable, and predictable these models are in terms of their success/failure modes. In this work, we ask the question: when does noise turn into a specific concept (e.g., age) and lock in the denoising trajectory? We propose PCI (Prompt-Conditioned Intervention) to study this question. PCI is a training-free and model-agnostic framework for analyzing concept dynamics through diffusion time. The central idea is the analysis of Concept Insertion Success (CIS), defined as the probability that a concept inserted at a given timestep is preserved and reflected in the final image, offering a way to characterize the temporal dynamics of concept formation. Applied to several state-of-the-art text-to-image diffusion models and a broad taxonomy of concepts, PCI reveals diverse temporal behaviors across diffusion models, in which certain phases of the trajectory are more favorable to specific concepts even within the same concept type. These findings also provide actionable insights for text-driven image editing, highlighting when interventions are most effective without requiring access to model internals or training, and yielding quantitatively stronger edits that achieve a balance of semantic accuracy and content preservation than strong baselines. Code is available at: https://github.com/adagorgun/PCI-Prompt-Controlled-Interventions
[17] Disrupting Hierarchical Reasoning: Adversarial Protection for Geographic Privacy in Multimodal Reasoning Models
Jiaming Zhang, Che Wang, Yang Cao, Longtao Huang, Wei Yang Bryan Lim
🧩 TL;DR
本文提出ReasonBreak,一种针对多模态大推理模型(MLRMs)地理定位推理的对抗性隐私保护框架,通过概念感知扰动有效破坏其分层推理链,显著提升隐私保护效果。
📘 Detailed Summary
Motivation: 多模态大推理模型(MLRMs)通过分层思维链推理从个人图像中推断精确地理位置,构成严重隐私风险。现有隐私保护技术主要针对感知模型设计,对MLRMs复杂的多步推理过程无效,需要专门应对推理威胁的新方法。
Method: 提出ReasonBreak对抗框架,基于概念感知扰动专门破坏MLRMs的分层推理过程。该方法核心洞察是有效的地理推理破坏需要与概念层次对齐的扰动而非均匀噪声。ReasonBreak战略性地针对推理链中的关键概念依赖关系,生成使特定推理步骤失效并级联影响后续阶段的扰动。同时贡献GeoPrivacy-6K数据集,包含6,341张超高分辨率图像(≥2K)及分层概念标注。
Result: 在七个最先进的MLRMs(包括GPT-o3、GPT-5、Gemini 2.5 Pro)上的广泛评估显示,ReasonBreak在区域级保护上实现14.4%的改进(33.8% vs 19.4%),在街区级保护上几乎翻倍(33.5% vs 16.8%),显著优于现有方法。
Conclusion: 该研究建立了针对推理威胁的隐私保护新范式,证明概念感知扰动比传统方法更有效破坏MLRMs的分层推理过程。工作为对抗基于推理的隐私威胁提供了系统框架,并贡献了首个专门用于地理隐私保护的大规模标注数据集。
📄 Abstract
Multi-modal large reasoning models (MLRMs) pose significant privacy risks by inferring precise geographic locations from personal images through hierarchical chain-of-thought reasoning. Existing privacy protection techniques, primarily designed for perception-based models, prove ineffective against MLRMs' sophisticated multi-step reasoning processes that analyze environmental cues. We introduce \textbf{ReasonBreak}, a novel adversarial framework specifically designed to disrupt hierarchical reasoning in MLRMs through concept-aware perturbations. Our approach is founded on the key insight that effective disruption of geographic reasoning requires perturbations aligned with conceptual hierarchies rather than uniform noise. ReasonBreak strategically targets critical conceptual dependencies within reasoning chains, generating perturbations that invalidate specific inference steps and cascade through subsequent reasoning stages. To facilitate this approach, we contribute \textbf{GeoPrivacy-6K}, a comprehensive dataset comprising 6,341 ultra-high-resolution images ($\geq$2K) with hierarchical concept annotations. Extensive evaluation across seven state-of-the-art MLRMs (including GPT-o3, GPT-5, Gemini 2.5 Pro) demonstrates ReasonBreak's superior effectiveness, achieving a 14.4\% improvement in tract-level protection (33.8\% vs 19.4\%) and nearly doubling block-level protection (33.5\% vs 16.8\%). This work establishes a new paradigm for privacy protection against reasoning-based threats.
[18] Beyond the Noise: Aligning Prompts with Latent Representations in Diffusion Models
Vasco Ramos, Regev Cohen, Idan Szpektor, Joao Magalhaes
🧩 TL;DR
本文提出了NoisyCLIP方法,能够在去噪过程的早期检测文本-图像语义对齐,实现了实时对齐评估,将计算成本降低50%的同时保持了98%的CLIP对齐性能。
📘 Detailed Summary
Motivation: 条件扩散模型虽然依赖语言-图像对齐方法来引导生成,但错位和幻觉问题仍然普遍存在,而生成后的对齐测量需要等待完整生成过程完成,计算成本高昂。本研究旨在解决实时对齐检测的难题,探索在去噪过程中早期检测文本/图像错位的可能性。
Method: 本文提出了NoisyCLIP方法,通过在噪声潜在空间中测量语义对齐来检测提示与潜在表示之间的错位。该方法首次探索并基准测试了在反向扩散过程中使用双编码器进行提示到潜在错位检测,实现了在图像生成过程中的实时对齐评估。
Result: NoisyCLIP在定性和定量评估中表现出色,将计算成本降低了50%,同时在Best-of-N设置中实现了98%的CLIP对齐性能。该方法能够在生成过程中进行实时对齐评估,显著减少了计算开销而不牺牲语义保真度。
Conclusion: 该研究表明文本/图像错位可以在去噪过程的早期被检测,为实时对齐评估提供了有效解决方案。NoisyCLIP方法为扩散模型的质量控制开辟了新途径,特别是在Best-of-N后生成设置中,能够在不等待完整生成的情况下进行语义对齐评估。
📄 Abstract
Conditional diffusion models rely on language-to-image alignment methods to steer the generation towards semantically accurate outputs. Despite the success of this architecture, misalignment and hallucinations remain common issues and require automatic misalignment detection tools to improve quality, for example by applying them in a Best-of-N (BoN) post-generation setting. Unfortunately, measuring the alignment after the generation is an expensive step since we need to wait for the overall generation to finish to determine prompt adherence. In contrast, this work hypothesizes that text/image misalignments can be detected early in the denoising process, enabling real-time alignment assessment without waiting for the complete generation. In particular, we propose NoisyCLIP a method that measures semantic alignment in the noisy latent space. This work is the first to explore and benchmark prompt-to-latent misalignment detection during image generation using dual encoders in the reverse diffusion process. We evaluate NoisyCLIP qualitatively and quantitatively and find it reduces computational cost by 50% while achieving 98% of CLIP alignment performance in BoN settings. This approach enables real-time alignment assessment during generation, reducing costs without sacrificing semantic fidelity.
[19] Decoupling Template Bias in CLIP: Harnessing Empty Prompts for Enhanced Few-Shot Learning
Zhenyu Zhang, Guangyao Chen, Yixiong Zou, Zhimeng Huang, Yuhua Li
🧩 TL;DR
该研究揭示了CLIP模型中模板-样本相似性(TSS)引入的偏差问题,并提出了一种使用空提示来解耦模板偏差的框架,显著提升了少样本分类的准确性和鲁棒性。
📘 Detailed Summary
Motivation: 研究发现CLIP模型中模板-样本相似性(TSS)——即文本模板与图像样本之间的相似性——会引入系统性偏差,导致模型过度依赖模板接近性而非真实的样本-类别对齐,从而降低了少样本分类的准确性和鲁棒性。
Method: 提出了一种两阶段框架:在预训练阶段使用空提示(不含类别信息的文本输入)来揭示和减少CLIP编码器中的模板诱导偏差;在少样本微调阶段,通过偏差校准损失强制图像与其正确类别对齐,确保模型关注相关的视觉线索而非模板相似性。
Result: 在多个基准测试上的实验表明,该模板校正方法显著减少了由TSS引起的性能波动,获得了更高的分类准确率和更强的鲁棒性,验证了空提示在解耦模板偏差方面的有效性。
Conclusion: 该研究揭示了CLIP模型中模板偏差的负面影响,并提出了一种有效的解耦方法,为改进视觉-语言模型的少样本学习能力提供了新思路,强调了在跨模态对齐中考虑模板效应的重要性。
📄 Abstract
The Contrastive Language-Image Pre-Training (CLIP) model excels in few-shot learning by aligning visual and textual representations. Our study shows that template-sample similarity (TSS), defined as the resemblance between a text template and an image sample, introduces bias. This bias leads the model to rely on template proximity rather than true sample-to-category alignment, reducing both accuracy and robustness in classification. We present a framework that uses empty prompts, textual inputs that convey the idea of "emptiness" without category information. These prompts capture unbiased template features and offset TSS bias. The framework employs two stages. During pre-training, empty prompts reveal and reduce template-induced bias within the CLIP encoder. During few-shot fine-tuning, a bias calibration loss enforces correct alignment between images and their categories, ensuring the model focuses on relevant visual cues. Experiments across multiple benchmarks demonstrate that our template correction method significantly reduces performance fluctuations caused by TSS, yielding higher classification accuracy and stronger robustness. The repository of this project is available at https://github.com/zhenyuZ-HUST/Decoupling-Template-Bias-in-CLIP.
[20] OCCDiff: Occupancy Diffusion Model for High-Fidelity 3D Building Reconstruction from Noisy Point Clouds
Jialu Sui, Rui Liu, Hongsheng Zhang
🧩 TL;DR
本文提出OCCDiff方法,将潜在扩散模型应用于占用函数空间,用于从LiDAR点云中重建建筑物表面,该方法能够生成连续可评估的占用函数,并在不同点密度和噪声干扰下保持鲁棒性。
📘 Detailed Summary
Motivation: 从LiDAR点云重建建筑物面临的主要挑战在于如何在变化的点密度和噪声干扰下准确捕捉建筑表面,需要一种能够灵活获取高质量三维轮廓的方法来适应不同的分辨率需求。
Method: OCCDiff方法将潜在扩散过程与函数自编码器架构相结合,在占用函数空间中生成连续可评估的占用函数;提出点编码器为扩散学习提供条件特征,约束最终占用预测,并向潜在编码器注入多模态特征;采用多任务训练策略确保点编码器学习多样且鲁棒的特征表示。
Result: 实验结果表明,该方法生成的样本在物理一致性方面表现良好,对目标分布具有高保真度,并且在处理噪声数据时展现出鲁棒性,能够有效应对不同点密度和噪声干扰。
Conclusion: 该研究展示了在占用函数空间中应用潜在扩散模型的有效性,为建筑物重建提供了一种灵活且鲁棒的方法,未来可进一步探索多模态特征融合和更复杂的建筑结构重建任务。
📄 Abstract
A major challenge in reconstructing buildings from LiDAR point clouds lies in accurately capturing building surfaces under varying point densities and noise interference. To flexibly gather high-quality 3D profiles of the building in diverse resolution, we propose OCCDiff applying latent diffusion in the occupancy function space. Our OCCDiff combines a latent diffusion process with a function autoencoder architecture to generate continuous occupancy functions evaluable at arbitrary locations. Moreover, a point encoder is proposed to provide condition features to diffusion learning, constraint the final occupancy prediction for occupancy decoder, and insert multi-modal features for latent generation to latent encoder. To further enhance the model performance, a multi-task training strategy is employed, ensuring that the point encoder learns diverse and robust feature representations. Empirical results show that our method generates physically consistent samples with high fidelity to the target distribution and exhibits robustness to noisy data.
[21] Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
Huilin Xu, Zhuoyang Liu, Yixiang Luomei, Feng Xu
🧩 TL;DR
本文提出了一种统一的空中视觉语言导航框架,仅使用单目RGB自我中心视觉观察和自然语言指令,通过将导航建模为下一个令牌预测问题,并采用提示引导的多任务学习来联合优化空间感知、轨迹推理和动作预测。
📘 Detailed Summary
Motivation: 空中视觉语言导航任务旨在使无人机能够理解自然语言指令并在复杂城市环境中导航,但现有方法通常依赖全景图像、深度输入或里程计来支持空间推理和动作规划,这些要求增加了系统成本和集成复杂性,阻碍了轻量级无人机的实际部署。
Method: 该模型将导航任务形式化为下一个令牌预测问题,通过提示引导的多任务学习联合优化空间感知、轨迹推理和动作预测;提出关键帧选择策略以减少视觉冗余,保留语义信息丰富的帧;同时引入动作合并和标签重新加权机制,缓解长尾监督不平衡问题并促进稳定的多任务协同训练。
Result: 在Aerial VLN基准测试上的广泛实验验证了该方法的有效性,在具有挑战性的单目RGB-only设置下,模型在可见和未见环境中均取得了强劲结果;显著优于现有的RGB-only基线方法,并缩小了与最先进的全景RGB-D对应方法的性能差距;全面的消融研究进一步证明了任务设计和架构选择的有效贡献。
Conclusion: 该研究展示了仅使用单目RGB视觉输入实现高效空中视觉语言导航的可行性,通过统一的多任务学习框架和创新的训练策略,为轻量级无人机的实际部署提供了有前景的解决方案,同时为减少对外部传感器依赖的导航系统设计提供了新的思路。
📄 Abstract
Aerial Vision-and-Language Navigation (VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and navigate complex urban environments using onboard visual observation. This task holds promise for real-world applications such as low-altitude inspection, search-and-rescue, and autonomous aerial delivery. Existing methods often rely on panoramic images, depth inputs, or odometry to support spatial reasoning and action planning. These requirements increase system cost and integration complexity, thus hindering practical deployment for lightweight UAVs. We present a unified aerial VLN framework that operates solely on egocentric monocular RGB observations and natural language instructions. The model formulates navigation as a next-token prediction problem, jointly optimizing spatial perception, trajectory reasoning, and action prediction through prompt-guided multi-task learning. Moreover, we propose a keyframe selection strategy to reduce visual redundancy by retaining semantically informative frames, along with an action merging and label reweighting mechanism that mitigates long-tailed supervision imbalance and facilitates stable multi-task co-training. Extensive experiments on the Aerial VLN benchmark validate the effectiveness of our method. Under the challenging monocular RGB-only setting, our model achieves strong results across both seen and unseen environments. It significantly outperforms existing RGB-only baselines and narrows the performance gap with state-of-the-art panoramic RGB-D counterparts. Comprehensive ablation studies further demonstrate the contribution of our task design and architectural choices.
[22] Thinking with Images via Self-Calling Agent
Wenxi Yang, Yuzhong Zhao, Fang Wan, Qixiang Ye
🧩 TL;DR
本文提出了一种名为自调用思维链(sCoT)的新型视觉推理范式,通过将交错多模态思维链重新表述为具有自调用功能的纯语言思维链,解决了多模态强化学习中高质量推理数据稀缺的问题,显著提升了训练效率和效果。
📘 Detailed Summary
Motivation: 当前基于图像的思维范式虽然展现了卓越的视觉推理能力,但通过强化学习优化交错多模态思维链(iMCoT)仍然具有挑战性,主要依赖于稀缺的高质量推理数据,这限制了训练的有效性和效率。
Method: sCoT将交错多模态思维链重新表述为具有自调用功能的纯语言思维链,其中主代理将复杂的视觉推理任务分解为原子子任务,并调用其虚拟副本(即参数共享的子代理)在隔离上下文中解决这些子任务。该方法采用组相对策略优化来强化有效的推理行为,无需模态间的显式交错。
Result: 在HR-Bench 4K上的实验表明,sCoT将整体推理性能提升了高达1.9%,同时与强大的基线方法相比,减少了约75%的GPU小时数,显著提高了训练效率和效果。
Conclusion: sCoT通过将多模态推理重新表述为纯语言自调用过程,有效解决了多模态强化学习中数据稀缺的挑战,为视觉推理任务提供了一种更高效、更可扩展的训练范式,具有重要的实际应用价值。
📄 Abstract
Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to $1.9\%$ with $\sim 75\%$ fewer GPU hours compared to strong baseline approaches. Code is available at https://github.com/YWenxi/think-with-images-through-self-calling.
[23] Refining Visual Artifacts in Diffusion Models via Explainable AI-based Flaw Activation Maps
Seoyeon Lee, Gwangyeol Yu, Chaewon Kim, Jonghyuk Park
🧩 TL;DR
本文提出了一种自优化扩散框架,通过可解释人工智能技术检测生成图像中的伪影和不真实区域,利用缺陷激活图在扩散过程的正向和反向阶段针对性优化图像质量,显著提升了多种扩散模型的生成性能。
📘 Detailed Summary
Motivation: 扩散模型在图像合成方面取得了显著成功,但处理伪影和不真实区域仍然是一个关键挑战。现有方法在检测和修复这些缺陷方面存在不足,需要一种能够主动识别并优化图像质量的系统化框架。
Method: 提出自优化扩散框架,采用基于可解释人工智能的缺陷高亮器生成缺陷激活图来识别伪影和不真实区域。这些激活图通过在正向过程中放大缺陷区域的噪声,并在反向过程中聚焦这些区域,从而改善重建质量。
Result: 该方法在多种基于扩散的模型上实现了高达27.3%的Fréchet起始距离改进,在多样化数据集上表现出持续强劲的性能。框架在不同任务中均显示出鲁棒有效性,包括图像生成、文本到图像生成和修复任务。
Conclusion: 研究表明可解释人工智能技术不仅限于提供可解释性,还能主动促进图像精炼过程。该框架为各种扩散模型和任务提供了通用且有效的解决方案,显著推动了图像合成领域的发展。
📄 Abstract
Diffusion models have achieved remarkable success in image synthesis. However, addressing artifacts and unrealistic regions remains a critical challenge. We propose self-refining diffusion, a novel framework that enhances image generation quality by detecting these flaws. The framework employs an explainable artificial intelligence (XAI)-based flaw highlighter to produce flaw activation maps (FAMs) that identify artifacts and unrealistic regions. These FAMs improve reconstruction quality by amplifying noise in flawed regions during the forward process and by focusing on these regions during the reverse process. The proposed approach achieves up to a 27.3% improvement in Fréchet inception distance across various diffusion-based models, demonstrating consistently strong performance on diverse datasets. It also shows robust effectiveness across different tasks, including image generation, text-to-image generation, and inpainting. These results demonstrate that explainable AI techniques can extend beyond interpretability to actively contribute to image refinement. The proposed framework offers a versatile and effective approach applicable to various diffusion models and tasks, significantly advancing the field of image synthesis.
[24] Training-Free Dual Hyperbolic Adapters for Better Cross-Modal Reasoning
Yi Zhang, Chun-Wun Cheng, Junyi He, Ke Yu, Yushun Tang, Carola-Bibiane Schönlieb, Zhihai He, Angelica I. Aviles-Rivero
🧩 TL;DR
本文提出了一种无需训练的双曲适配器方法T-DHA,通过在双曲空间中建模视觉-语言关系的层次结构,显著提升了少样本图像识别和领域泛化性能。
📘 Detailed Summary
Motivation: 现有视觉语言模型在领域变化时性能下降严重,且在新领域微调需要大量计算资源,这限制了模型的适应性和实际应用范围。
Method: 该方法采用无需训练的双曲适配器,在双曲空间中表征视觉-语言关系的层次树结构,利用双曲空间的指数体积增长特性,通过庞加莱球模型实现更有效的层次数据嵌入,并结合负学习提升分类精度。
Result: 在多个数据集上的实验表明,T-DHA方法在少样本图像识别和领域泛化任务中显著优于现有最先进方法,能够以更少的特征维度实现更准确和鲁棒的分类。
Conclusion: 双曲空间为视觉-语言层次关系建模提供了更有效的几何表示,无需训练的适配器方法为大规模视觉语言模型的领域适应提供了高效解决方案,具有重要的实际应用价值。
📄 Abstract
Recent research in Vision-Language Models (VLMs) has significantly advanced our capabilities in cross-modal reasoning. However, existing methods suffer from performance degradation with domain changes or require substantial computational resources for fine-tuning in new domains. To address this issue, we develop a new adaptation method for large vision-language models, called \textit{Training-free Dual Hyperbolic Adapters} (T-DHA). We characterize the vision-language relationship between semantic concepts, which typically has a hierarchical tree structure, in the hyperbolic space instead of the traditional Euclidean space. Hyperbolic spaces exhibit exponential volume growth with radius, unlike the polynomial growth in Euclidean space. We find that this unique property is particularly effective for embedding hierarchical data structures using the Poincaré ball model, achieving significantly improved representation and discrimination power. Coupled with negative learning, it provides more accurate and robust classifications with fewer feature dimensions. Our extensive experimental results on various datasets demonstrate that the T-DHA method significantly outperforms existing state-of-the-art methods in few-shot image recognition and domain generalization tasks.
[25] PaintFlow: A Unified Framework for Interactive Oil Paintings Editing and Generation
Zhangli Hu, Ye Chen, Jiajun Yao, Bingbing Ni
🧩 TL;DR
本文提出了一种统一的多模态油画生成与编辑框架,通过空间对齐与语义增强条件策略、基于笔画渲染的自监督风格迁移管道以及AdaIN特征融合,实现了具有精确语义控制和风格一致性的交互式油画创作。
📘 Detailed Summary
Motivation: 油画作为一种融合人类抽象思维与艺术表达的高级媒介,因其复杂的笔触动态和风格化特征,对数字生成与编辑提出了重大挑战。现有技术受限于训练数据分布,主要关注真实照片的修改,缺乏针对油画这种特定艺术形式的统一生成与编辑框架。
Method: 该方法包含三个关键技术进展:首先,在训练阶段采用空间对齐与语义增强条件策略,将掩码和草图映射为空间约束,从参考图像和文本编码上下文嵌入作为特征约束,实现对象级语义对齐。其次,提出基于笔画渲染的自监督风格迁移管道,模拟油画修复的修复动态,将真实图像转换为保留笔触纹理的风格化油画,构建大规模配对训练数据集。最后,在推理阶段使用AdaIN算子集成特征以确保风格一致性。
Result: 大量实验表明,该交互式系统能够在保持油画艺术品质的同时实现细粒度编辑,在风格化油画的生成与编辑中达到了前所未有的想象力实现水平。系统支持用户通过参考图像进行精确语义控制、手绘草图进行空间结构对齐以及自然语言提示进行高级语义引导,同时保持所有输出的统一绘画风格。
Conclusion: 该研究为油画这一特定艺术形式的数字创作提供了统一的解决方案,通过多模态条件控制实现了精确的语义对齐和风格一致性。提出的自监督风格迁移方法有效解决了油画训练数据稀缺的问题,为艺术风格的数字生成与编辑开辟了新途径,展示了交互式AI艺术创作的潜力。
📄 Abstract
Oil painting, as a high-level medium that blends human abstract thinking with artistic expression, poses substantial challenges for digital generation and editing due to its intricate brushstroke dynamics and stylized characteristics. Existing generation and editing techniques are often constrained by the distribution of training data and primarily focus on modifying real photographs. In this work, we introduce a unified multimodal framework for oil painting generation and editing. The proposed system allows users to incorporate reference images for precise semantic control, hand-drawn sketches for spatial structure alignment, and natural language prompts for high-level semantic guidance, while consistently maintaining a unified painting style across all outputs. Our method achieves interactive oil painting creation through three crucial technical advancements. First, we enhance the training stage with spatial alignment and semantic enhancement conditioning strategy, which map masks and sketches into spatial constraints, and encode contextual embedding from reference images and text into feature constraints, enabling object-level semantic alignment. Second, to overcome data scarcity, we propose a self-supervised style transfer pipeline based on Stroke-Based Rendering (SBR), which simulates the inpainting dynamics of oil painting restoration, converting real images into stylized oil paintings with preserved brushstroke textures to construct a large-scale paired training dataset. Finally, during inference, we integrate features using the AdaIN operator to ensure stylistic consistency. Extensive experiments demonstrate that our interactive system enables fine-grained editing while preserving the artistic qualities of oil paintings, achieving an unprecedented level of imagination realization in stylized oil paintings generation and editing.
[26] InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang, Wenyu Liu, Xinggang Wang
🧩 TL;DR
本文提出了InfiniteVL,一种线性复杂度的视觉语言模型架构,通过结合滑动窗口注意力与门控DeltaNet来克服现有窗口注意力和线性注意力方法的局限性,在保持恒定延迟和内存占用的同时实现高效的长序列处理。
📘 Detailed Summary
Motivation: 现有视觉语言模型中,窗口注意力方法在序列长度超过窗口大小时会出现性能下降,而线性注意力在OCR和文档理解等信息密集型任务上表现不佳,这两种方法都无法有效解决二次复杂度和不断增长的KV缓存问题,需要一种新的架构来克服这些限制。
Method: 提出了InfiniteVL架构,将滑动窗口注意力与门控DeltaNet相结合以实现线性复杂度,并设计了包含蒸馏预训练、指令微调和长序列监督微调的三阶段训练策略,使用不到领先VLM所需训练数据的2%来训练模型。
Result: InfiniteVL不仅显著超越以往的线性复杂度VLM,还能与领先的Transformer-based VLM性能相当,同时保持有效的长期记忆保留;相比使用FlashAttention-2加速的类似规模Transformer VLM,实现了超过3.6倍的推理加速,并在流式视频理解场景中保持稳定的24 FPS实时预填充速度。
Conclusion: 该研究证明了结合滑动窗口注意力与门控DeltaNet的线性复杂度架构能够在资源受限条件下实现竞争性的多模态性能,为高效长序列视觉语言建模提供了可行方案,同时展示了在保持恒定延迟和内存占用方面的显著优势。
📄 Abstract
Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2\% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.
[27] From Cells to Survival: Hierarchical Analysis of Cell Inter-Relations in Multiplex Microscopy for Lung Cancer Prognosis
Olle Edgren Schüllerqvist, Jens Baumann, Joakim Lindblad, Love Nordling, Artur Mezheyeuski, Patrick Micke, Nataša Sladoje
🧩 TL;DR
本文提出HiGINE——一种基于分层图的方法,利用多重免疫荧光图像中的肿瘤微环境特征预测肺癌患者生存期,并通过多模态融合整合癌症分期信息,显著提升了风险分层性能。
📘 Detailed Summary
Motivation: 肿瘤微环境作为预后生物标志物的潜力尚未被充分挖掘,现有分析方法难以捕捉不同细胞类型之间的复杂相互作用,特别是在肺癌风险分层方面存在局限,需要更有效的建模方法来利用多重免疫荧光图像中的细胞间关系信息。
Method: HiGINE采用分层图神经网络架构,编码细胞邻域中的局部和全局相互关系,整合细胞类型和形态学信息,并通过多模态融合将癌症分期与多重免疫荧光图像提取的特征进行聚合,从而增强模型的预测能力。
Result: 在两个公开数据集上的验证表明,HiGINE显著改善了风险分层性能,展现出优越的鲁棒性和泛化能力,多模态融合进一步提升了模型的预测准确性,为肺癌预后评估提供了可靠的工具。
Conclusion: 该研究证明了分层图方法在肿瘤微环境分析中的有效性,通过捕捉细胞间复杂相互作用和多模态信息融合,为精准医疗中的风险分层提供了新范式,具有重要的临床转化潜力。
📄 Abstract
The tumor microenvironment (TME) has emerged as a promising source of prognostic biomarkers. To fully leverage its potential, analysis methods must capture complex interactions between different cell types. We propose HiGINE -- a hierarchical graph-based approach to predict patient survival (short vs. long) from TME characterization in multiplex immunofluorescence (mIF) images and enhance risk stratification in lung cancer. Our model encodes both local and global inter-relations in cell neighborhoods, incorporating information about cell types and morphology. Multimodal fusion, aggregating cancer stage with mIF-derived features, further boosts performance. We validate HiGINE on two public datasets, demonstrating improved risk stratification, robustness, and generalizability.
[28] Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning
Jing Jie Tan, Anissa Mokraoui, Ban-Hoe Kwan, Danny Wee-Kiat Ng, Yan-Chai Hum
🧩 TL;DR
本文提出SOLI方法,一种专为轻量级低分辨率图像字幕生成设计的解决方案,通过孪生网络架构优化潜在嵌入,在保持性能的同时显著降低计算开销。
📘 Detailed Summary
Motivation: 当前低分辨率图像字幕生成面临的主要挑战是:传统方法通常依赖大型Transformer等重型模型进行编码,这些模型计算资源消耗大、内存需求高,导致在资源受限场景下重新训练困难,难以在轻量级设备上部署。
Method: SOLI方法采用孪生网络架构来优化低分辨率图像的潜在嵌入表示,通过双路径神经网络结构专门设计用于轻量级低分辨率图像字幕生成,该架构能够有效提升图像到文本翻译过程的效率和准确性。
Result: 该方法在保持性能不降低的前提下,显著减少了计算开销和内存需求,使其特别适合在资源受限的环境中进行训练和部署,为轻量级低分辨率图像字幕生成提供了实用解决方案。
Conclusion: SOLI方法展示了在资源受限场景下实现高效低分辨率图像字幕生成的可行性,为辅助视觉障碍人士、改进内容管理系统和增强人机交互等应用提供了更实用的技术方案,推动了轻量级计算机视觉模型的发展。
📄 Abstract
Image captioning is essential in many fields including assisting visually impaired individuals, improving content management systems, and enhancing human-computer interaction. However, a recent challenge in this domain is dealing with low-resolution image (LRI). While performance can be improved by using larger models like transformers for encoding, these models are typically heavyweight, demanding significant computational resources and memory, leading to challenges in retraining. To address this, the proposed SOLI (Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning) approach presents a solution specifically designed for lightweight, low-resolution images captioning. It employs a Siamese network architecture to optimize latent embeddings, enhancing the efficiency and accuracy of the image-to-text translation process. By focusing on a dual-pathway neural network structure, SOLI minimizes computational overhead without sacrificing performance, making it an ideal choice for training on resource-constrained scenarios.
[29] No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers
Damiano Marsili, Georgia Gkioxari
🧩 TL;DR
本文提出了一种无需标注的训练框架VALOR,通过AI驱动的验证器结合强化学习和自动化困难负样本挖掘,显著提升了视觉推理和对象定位能力,在多种空间推理任务上超越了开源和专有模型。
📘 Detailed Summary
Motivation: 现有视觉推理方法存在明显局限:纯语言思维链方法需要大规模(图像、查询、答案)监督数据,而程序合成方法虽然无需训练但存在逻辑缺陷和错误定位问题,因此需要开发一种无需标注的训练框架来同时改进推理和定位能力。
Method: 该方法采用AI驱动的验证器框架:LLM验证器通过强化学习优化语言推理过程,VLM验证器通过自动化困难负样本挖掘增强视觉定位能力,无需真实标注标签,结合了先进纯语言推理模型分解空间查询的能力和通过高性能VLM批评器改进的视觉专家模型优势。
Result: 该方法在多种空间推理任务上评估显示,显著提升了视觉推理性能并超越了开源和专有模型,改进后的视觉定位模型进一步超越了近期纯文本视觉推理方法,验证了框架的有效性。
Conclusion: 该研究展示了AI驱动验证器框架在无需标注数据情况下提升视觉推理和定位的潜力,为结合语言推理模型和视觉专家模型提供了新范式,并为开发更强大的多模态AI系统指明了方向。
📄 Abstract
Visual reasoning is challenging, requiring both precise object grounding and understanding complex spatial relationships. Existing methods fall into two camps: language-only chain-of-thought approaches, which demand large-scale (image, query, answer) supervision, and program-synthesis approaches which use pre-trained models and avoid training, but suffer from flawed logic and erroneous grounding. We propose an annotation-free training framework that improves both reasoning and grounding. Our framework uses AI-powered verifiers: an LLM verifier refines LLM reasoning via reinforcement learning, while a VLM verifier strengthens visual grounding through automated hard-negative mining, eliminating the need for ground truth labels. This design combines the strengths of modern AI systems: advanced language-only reasoning models for decomposing spatial queries into simpler subtasks, and strong vision specialist models improved via performant VLM critics. We evaluate our approach across diverse spatial reasoning tasks, and show that our method improves visual reasoning and surpasses open-source and proprietary models, while with our improved visual grounding model we further outperform recent text-only visual reasoning methods. Project webpage: https://glab-caltech.github.io/valor/
[30] OpenMonoGS-SLAM: Monocular Gaussian Splatting SLAM with Open-set Semantics
Jisang Yoo, Gyeongjin Kang, Hyun-kyu Ko, Hyeonwoo Yu, Eunbyung Park
🧩 TL;DR
本文提出了OpenMonoGS-SLAM,这是首个将3D高斯泼溅与开放集语义理解相结合的单目SLAM框架,无需深度传感器或3D语义标注,仅通过自监督学习实现开放世界环境下的精确相机跟踪与语义建图。
📘 Detailed Summary
Motivation: 现有SLAM与语义理解的集成方法通常依赖深度传感器或封闭集语义模型,限制了其在开放世界环境中的可扩展性和适应性,需要一种无需深度输入或3D语义标注就能实现开放集语义理解的单目SLAM解决方案。
Method: 该方法结合3D高斯泼溅与视觉基础模型,利用MASt3R进行视觉几何估计,SAM和CLIP进行开放词汇语义理解,并提出专门的内存机制来管理高维语义特征,构建高斯语义特征图,完全基于自监督学习目标运行。
Result: 实验结果表明,该方法在封闭集和开放集分割任务中均达到或超越现有基线性能,且不依赖深度图或语义标注等辅助传感器,实现了仅使用单目相机的精确跟踪与语义建图。
Conclusion: 该研究展示了视觉基础模型与3D高斯泼溅在单目SLAM中的有效集成,为开放世界环境下的智能感知与交互提供了可扩展的解决方案,推动了空间AI向更通用、更自适应的方向发展。
📄 Abstract
Simultaneous Localization and Mapping (SLAM) is a foundational component in robotics, AR/VR, and autonomous systems. With the rising focus on spatial AI in recent years, combining SLAM with semantic understanding has become increasingly important for enabling intelligent perception and interaction. Recent efforts have explored this integration, but they often rely on depth sensors or closed-set semantic models, limiting their scalability and adaptability in open-world environments. In this work, we present OpenMonoGS-SLAM, the first monocular SLAM framework that unifies 3D Gaussian Splatting (3DGS) with open-set semantic understanding. To achieve our goal, we leverage recent advances in Visual Foundation Models (VFMs), including MASt3R for visual geometry and SAM and CLIP for open-vocabulary semantics. These models provide robust generalization across diverse tasks, enabling accurate monocular camera tracking and mapping, as well as a rich understanding of semantics in open-world environments. Our method operates without any depth input or 3D semantic ground truth, relying solely on self-supervised learning objectives. Furthermore, we propose a memory mechanism specifically designed to manage high-dimensional semantic features, which effectively constructs Gaussian semantic feature maps, leading to strong overall performance. Experimental results demonstrate that our approach achieves performance comparable to or surpassing existing baselines in both closed-set and open-set segmentation tasks, all without relying on supplementary sensors such as depth maps or semantic annotations.
[31] Astra: General Interactive World Model with Autoregressive Denoising
Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, Jiwen Lu
🧩 TL;DR
本文提出了Astra,一种交互式通用世界模型,能够通过精确的动作交互为多样化场景生成真实世界的未来预测,解决了现有世界模型在通用场景和多种动作形式下长时程预测能力不足的问题。
📘 Detailed Summary
Motivation: 尽管扩散变换器在视频生成方面取得了进展,但能够从过去观察和动作预测长时程未来的通用世界模型仍然研究不足,特别是在通用场景和多样化动作形式方面存在明显差距。
Method: Astra采用自回归去噪架构,使用时序因果注意力聚合过去观察并支持流式输出;引入噪声增强历史记忆以平衡响应性和时序一致性;提出动作感知适配器将动作信号直接注入去噪过程;开发动作专家混合机制动态路由异构动作模态。
Result: 在多个数据集上的实验表明,Astra在保真度、长程预测和动作对齐方面优于现有最先进的世界模型,能够实现交互式、一致且通用的长时程视频预测,并支持多种形式的交互。
Conclusion: Astra展示了通用世界模型在多样化现实任务中的潜力,通过创新的架构设计实现了精确动作控制与长时程预测的平衡,为自主驾驶、机器人抓取等应用提供了有效的预测框架。
📄 Abstract
Recent advances in diffusion transformers have empowered video generation models to generate high-quality video clips from texts or images. However, world models with the ability to predict long-horizon futures from past observations and actions remain underexplored, especially for general-purpose scenarios and various forms of actions. To bridge this gap, we introduce Astra, an interactive general world model that generates real-world futures for diverse scenarios (e.g., autonomous driving, robot grasping) with precise action interactions (e.g., camera motion, robot action). We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations and support streaming outputs. We use a noise-augmented history memory to avoid over-reliance on past frames to balance responsiveness with temporal coherence. For precise action control, we introduce an action-aware adapter that directly injects action signals into the denoising process. We further develop a mixture of action experts that dynamically route heterogeneous action modalities, enhancing versatility across diverse real-world tasks such as exploration, manipulation, and camera control. Astra achieves interactive, consistent, and general long-term video prediction and supports various forms of interactions. Experiments across multiple datasets demonstrate the improvements of Astra in fidelity, long-range prediction, and action alignment over existing state-of-the-art world models.
[32] Trajectory Densification and Depth from Perspective-based Blur
Tianchen Qiu, Qirun Zhang, Jiajian He, Zhengyue Zhuge, Jiahui Xu, Yueting Chen
🧩 TL;DR
本文提出了一种通过分析视频流模糊模式来估计度量深度的方法,该方法利用透视模糊的深度依赖性,结合光学设计算法实现深度估计和轨迹稠密化,在手持拍摄场景中实现了高精度深度估计。
📘 Detailed Summary
Motivation: 在缺乏机械稳定器的情况下,相机在拍摄过程中不可避免地会发生旋转运动,特别是在长曝光场景下会引入透视模糊。从光学角度来看,透视模糊具有深度位置依赖性:即使在同一成像设置下,位于不同空间位置的物体也会产生不同程度的模糊。这一现象为通过分析模糊模式来估计深度提供了理论基础。
Method: 该方法通过分析视频流的模糊模式来估计度量深度,并结合稠密轨迹重建。具体采用现成的视觉编码器和点跟踪器提取视频信息,通过窗口化嵌入和多窗口聚合估计深度图,并使用视觉语言模型对光学算法产生的稀疏轨迹进行稠密化处理。整个方法基于联合光学设计算法实现深度估计和轨迹重建的协同优化。
Result: 在多个深度数据集上的评估表明,该方法在大深度范围内表现出强大的性能,同时保持了良好的泛化能力。相对于手持拍摄环境中的真实轨迹,所提出的光学算法实现了卓越的精度,稠密重建保持了较强的准确性。实验验证了该方法在缺乏机械稳定器条件下的深度估计有效性。
Conclusion: 该研究证明了利用透视模糊的深度依赖性进行深度估计的可行性,为无稳定器条件下的深度感知提供了新思路。所提出的联合光学设计算法在手持拍摄场景中实现了高精度深度估计和轨迹重建,为计算机视觉中的深度估计任务提供了有效的解决方案,特别是在动态拍摄环境下具有重要应用价值。
📄 Abstract
In the absence of a mechanical stabilizer, the camera undergoes inevitable rotational dynamics during capturing, which induces perspective-based blur especially under long-exposure scenarios. From an optical standpoint, perspective-based blur is depth-position-dependent: objects residing at distinct spatial locations incur different blur levels even under the same imaging settings. Inspired by this, we propose a novel method that estimate metric depth by examining the blur pattern of a video stream and dense trajectory via joint optical design algorithm. Specifically, we employ off-the-shelf vision encoder and point tracker to extract video information. Then, we estimate depth map via windowed embedding and multi-window aggregation, and densify the sparse trajectory from the optical algorithm using a vision-language model. Evaluations on multiple depth datasets demonstrate that our method attains strong performance over large depth range, while maintaining favorable generalization. Relative to the real trajectory in handheld shooting settings, our optical algorithm achieves superior precision and the dense reconstruction maintains strong accuracy.
[33] Chain-of-Image Generation: Toward Monitorable and Controllable Image Generation
Young Kyung Kim, Oded Schlesinger, Yuzhou Zhao, J. Matias Di Martino, Guillermo Sapiro
🧩 TL;DR
本文提出了链式图像生成(CoIG)框架,将图像生成重构为类似人类艺术创作的顺序语义过程,显著提升了生成过程的可监控性,同时保持了与基线模型相当的组合鲁棒性。
📘 Detailed Summary
Motivation: 当前最先进的图像生成模型虽然视觉质量卓越,但其内部生成过程仍是一个"黑箱",这种不透明性限制了人类观察和干预,阻碍了模型可靠性、安全性和控制性的保障,同时其非人类化的工作流程也使得人类观察者难以理解。
Method: CoIG框架利用大型语言模型将复杂提示分解为一系列简单的逐步指令,图像生成模型通过渐进式生成和编辑图像来执行这一计划,每个步骤专注于单个语义实体,从而实现直接监控,并引入了CoIG可读性和因果相关性两个新指标来正式评估这一特性。
Result: 实验结果表明,CoIG显著增强了定量可监控性,同时在与已建立的基线模型相比时实现了具有竞争力的组合鲁棒性,该框架是模型无关的,可以与任何图像生成模型集成。
Conclusion: CoIG框架通过将复杂生成任务分解为简单子问题,缓解了实体崩溃问题,类似于CoT在大型语言模型中采用的程序推理,为图像生成领域带来了类似CoT在监控性和性能方面的优势,推动了生成过程透明化和可控性的研究。
📄 Abstract
While state-of-the-art image generation models achieve remarkable visual quality, their internal generative processes remain a "black box." This opacity limits human observation and intervention, and poses a barrier to ensuring model reliability, safety, and control. Furthermore, their non-human-like workflows make them difficult for human observers to interpret. To address this, we introduce the Chain-of-Image Generation (CoIG) framework, which reframes image generation as a sequential, semantic process analogous to how humans create art. Similar to the advantages in monitorability and performance that Chain-of-Thought (CoT) brought to large language models (LLMs), CoIG can produce equivalent benefits in text-to-image generation. CoIG utilizes an LLM to decompose a complex prompt into a sequence of simple, step-by-step instructions. The image generation model then executes this plan by progressively generating and editing the image. Each step focuses on a single semantic entity, enabling direct monitoring. We formally assess this property using two novel metrics: CoIG Readability, which evaluates the clarity of each intermediate step via its corresponding output; and Causal Relevance, which quantifies the impact of each procedural step on the final generated image. We further show that our framework mitigates entity collapse by decomposing the complex generation task into simple subproblems, analogous to the procedural reasoning employed by CoT. Our experimental results indicate that CoIG substantially enhances quantitative monitorability while achieving competitive compositional robustness compared to established baseline models. The framework is model-agnostic and can be integrated with any image generation model.
[34] Dual-Branch Center-Surrounding Contrast: Rethinking Contrastive Learning for 3D Point Clouds
Shaofeng Zhang, Xuanqi Chen, Xiangdong Zhang, Sitong Wu, Junchi Yan
🧩 TL;DR
本文提出了一种新颖的双分支中心-周边对比学习框架CSCon,用于3D点云的自监督学习,通过分别对中心和周边部分进行掩码处理,构建中心偏向和周边偏向的双分支输入,以更好地捕获丰富的几何信息。
📘 Detailed Summary
Motivation: 现有的3D点云自监督学习方法主要基于掩码自编码器的生成式方法,但这些方法难以有效捕获高层判别性特征,导致在线性探测等下游任务上表现不佳。相比之下,对比学习方法在图像数据上表现出优异的判别特征表示和泛化能力,但3D数据中的对比学习研究稀缺,且直接将2D对比学习方法应用于3D数据无法有效学习3D局部细节。
Method: 本文提出双分支中心-周边对比学习框架CSCon,通过分别对点云的中心和周边部分进行掩码处理,构建中心偏向和周边偏向的双分支输入表示,以更好地捕获丰富的几何信息。同时引入补丁级对比损失函数,进一步增强高层信息和局部敏感性,实现更有效的3D特征学习。
Result: 在FULL和ALL协议下,CSCon达到与生成式方法相当的性能;在MLP-LINEAR、MLP-3和ONLY-NEW协议下,该方法取得了最先进的结果,甚至超越了跨模态方法。特别是在MLP-LINEAR协议下,该方法在ScanObjectNN的三个变体上分别比基线方法Point-MAE提升了7.9%、6.7%和10.3%。
Conclusion: 该研究表明,通过精心设计的双分支中心-周边对比学习框架,可以有效解决3D点云自监督学习中生成式方法在判别特征学习方面的局限性。该方法不仅在不同评估协议下表现出色,而且为3D对比学习提供了新的研究方向,证明了局部几何信息捕获在3D特征学习中的重要性。
📄 Abstract
Most existing self-supervised learning (SSL) approaches for 3D point clouds are dominated by generative methods based on Masked Autoencoders (MAE). However, these generative methods have been proven to struggle to capture high-level discriminative features effectively, leading to poor performance on linear probing and other downstream tasks. In contrast, contrastive methods excel in discriminative feature representation and generalization ability on image data. Despite this, contrastive learning (CL) in 3D data remains scarce. Besides, simply applying CL methods designed for 2D data to 3D fails to effectively learn 3D local details. To address these challenges, we propose a novel Dual-Branch \textbf{C}enter-\textbf{S}urrounding \textbf{Con}trast (CSCon) framework. Specifically, we apply masking to the center and surrounding parts separately, constructing dual-branch inputs with center-biased and surrounding-biased representations to better capture rich geometric information. Meanwhile, we introduce a patch-level contrastive loss to further enhance both high-level information and local sensitivity. Under the FULL and ALL protocols, CSCon achieves performance comparable to generative methods; under the MLP-LINEAR, MLP-3, and ONLY-NEW protocols, our method attains state-of-the-art results, even surpassing cross-modal approaches. In particular, under the MLP-LINEAR protocol, our method outperforms the baseline (Point-MAE) by \textbf{7.9\%}, \textbf{6.7\%}, and \textbf{10.3\%} on the three variants of ScanObjectNN, respectively. The code will be made publicly available.
[35] SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images
Kaiyu Li, Shengqi Zhang, Yupeng Deng, Zhi Wang, Deyu Meng, Xiangyong Cao
🧩 TL;DR
本文提出了一种无需训练的开放词汇遥感语义分割方法,通过将SAM 3模型适配到遥感场景,利用掩码融合策略和存在性评分过滤机制,在多个遥感数据集上实现了有前景的性能。
📘 Detailed Summary
Motivation: 现有基于CLIP的无训练开放词汇语义分割方法在精确定位方面面临挑战,且通常需要复杂的多模块组合流程,特别是在遥感场景中存在大量密集小目标的情况下。最近提出的SAM 3模型统一了分割和识别功能,为遥感开放词汇语义分割提供了新的可能性。
Method: 该方法首先实现了一种掩码融合策略,将SAM 3的语义分割头和Transformer解码器(实例头)的输出相结合,以充分利用两个头的优势实现更好的土地覆盖识别。其次,利用存在性头生成的存在性评分来过滤场景中不存在的类别,减少由大规模词汇量和地理空间场景中补丁级处理引起的误报。
Result: 在多个遥感数据集上的实验表明,这种简单的适配方法取得了有前景的性能表现,证明了SAM 3在遥感开放词汇语义分割任务中的潜力。该方法无需任何训练即可直接应用于遥感场景。
Conclusion: 该研究展示了SAM 3模型在遥感开放词汇语义分割任务中的适用性,通过简单的适配策略即可获得良好性能。这为未来在遥感领域利用统一分割识别框架提供了新思路,表明无需复杂训练流程也能处理遥感场景中的密集小目标识别挑战。
📄 Abstract
Most existing methods for training-free Open-Vocabulary Semantic Segmentation (OVSS) are based on CLIP. While these approaches have made progress, they often face challenges in precise localization or require complex pipelines to combine separate modules, especially in remote sensing scenarios where numerous dense and small targets are present. Recently, Segment Anything Model 3 (SAM 3) was proposed, unifying segmentation and recognition in a promptable framework. In this paper, we present a preliminary exploration of applying SAM 3 to the remote sensing OVSS task without any training. First, we implement a mask fusion strategy that combines the outputs from SAM 3's semantic segmentation head and the Transformer decoder (instance head). This allows us to leverage the strengths of both heads for better land coverage. Second, we utilize the presence score from the presence head to filter out categories that do not exist in the scene, reducing false positives caused by the vast vocabulary sizes and patch-level processing in geospatial scenes. We evaluate our method on extensive remote sensing datasets. Experiments show that this simple adaptation achieves promising performance, demonstrating the potential of SAM 3 for remote sensing OVSS. Our code is released at https://github.com/earth-insights/SegEarth-OV-3.
[36] A Scalable Pipeline Combining Procedural 3D Graphics and Guided Diffusion for Photorealistic Synthetic Training Data Generation in White Button Mushroom Segmentation
Artúr I. Károly, Péter Galambos
🧩 TL;DR
本文提出了一种结合3D渲染与约束扩散模型的新型工作流,用于自动生成高质量、逼真的蘑菇合成图像,并在仅使用合成数据训练的情况下,在真实世界蘑菇分割任务中实现了最先进的性能。
📘 Detailed Summary
Motivation: 工业蘑菇种植日益依赖计算机视觉进行监测和自动化收获,但开发准确的检测和分割模型需要大量精确标注的数据集,这些数据集的制作成本高昂。合成数据提供了可扩展的替代方案,但通常缺乏足够的真实感以泛化到真实世界场景。
Method: 本文提出了一种新颖的工作流,将Blender中的3D渲染与约束扩散模型相结合,自动生成高质量标注、逼真的双孢蘑菇合成图像。该方法在保持对3D场景配置和标注完全控制的同时,无需专门的计算机图形学专业知识即可实现照片级真实感。
Result: 研究发布了两个合成数据集(各包含6,000张图像,描绘超过25万个蘑菇实例),并在零样本设置下评估了基于这些数据集训练的Mask R-CNN模型。在两个独立的真实世界数据集(包括新收集的基准测试)上进行测试时,该方法实现了最先进的分割性能(在M18K数据集上F1分数达到0.859),尽管仅使用了合成训练数据。
Conclusion: 虽然该方法在双孢蘑菇上进行了演示,但所提出的流程可以轻松适应其他蘑菇物种或其他农业领域,如水果和叶片检测。该工作为农业计算机视觉任务提供了一种高效、可扩展的合成数据生成解决方案,显著降低了数据标注成本。
📄 Abstract
Industrial mushroom cultivation increasingly relies on computer vision for monitoring and automated harvesting. However, developing accurate detection and segmentation models requires large, precisely annotated datasets that are costly to produce. Synthetic data provides a scalable alternative, yet often lacks sufficient realism to generalize to real-world scenarios. This paper presents a novel workflow that integrates 3D rendering in Blender with a constrained diffusion model to automatically generate high-quality annotated, photorealistic synthetic images of Agaricus Bisporus mushrooms. This approach preserves full control over 3D scene configuration and annotations while achieving photorealism without the need for specialized computer graphics expertise. We release two synthetic datasets (each containing 6,000 images depicting over 250k mushroom instances) and evaluate Mask R-CNN models trained on them in a zero-shot setting. When tested on two independent real-world datasets (including a newly collected benchmark), our method achieves state-of-the-art segmentation performance (F1 = 0.859 on M18K), despite using only synthetic training data. Although the approach is demonstrated on Agaricus Bisporus mushrooms, the proposed pipeline can be readily adapted to other mushroom species or to other agricultural domains, such as fruit and leaf detection.
[37] Skewness-Guided Pruning of Multimodal Swin Transformers for Federated Skin Lesion Classification on Edge Devices
Kuniko Paxton, Koorosh Aslansefat, Dhavalkumar Thakker, Yiannis Papadopoulos
🧩 TL;DR
本研究提出了一种基于偏度指导的剪枝方法,用于压缩多模态Swin Transformer模型,在联邦学习环境中实现高效的边缘设备部署,同时保持皮肤病变分类的准确性。
📘 Detailed Summary
Motivation: 高性能计算机视觉模型在医学影像中表现出色,但其计算密集和大尺寸特性使其难以部署到边缘设备,同时严格的隐私约束阻碍了集中式数据管理,这促使需要采用联邦学习并实现模型压缩。
Method: 本研究提出了一种偏度指导的剪枝方法,该方法基于输出分布的统计偏度,有选择性地剪枝多模态Swin Transformer的多头自注意力和多层感知机层,并在水平联邦学习环境中进行验证。
Result: 实验表明,该方法在紧凑型Swin Transformer上实现了约36%的模型大小缩减,同时没有损失分类准确性,在联邦学习环境中保持了性能并显著降低了模型复杂度。
Conclusion: 该研究证明了通过偏度指导的剪枝方法实现高效模型压缩和隐私保护分布式学习的可行性,为多模态医学AI在边缘设备上的部署提供了实用解决方案,平衡了计算效率与诊断准确性。
📄 Abstract
In recent years, high-performance computer vision models have achieved remarkable success in medical imaging, with some skin lesion classification systems even surpassing dermatology specialists in diagnostic accuracy. However, such models are computationally intensive and large in size, making them unsuitable for deployment on edge devices. In addition, strict privacy constraints hinder centralized data management, motivating the adoption of Federated Learning (FL). To address these challenges, this study proposes a skewness-guided pruning method that selectively prunes the Multi-Head Self-Attention and Multi-Layer Perceptron layers of a multimodal Swin Transformer based on the statistical skewness of their output distributions. The proposed method was validated in a horizontal FL environment and shown to maintain performance while substantially reducing model complexity. Experiments on the compact Swin Transformer demonstrate approximately 36\% model size reduction with no loss in accuracy. These findings highlight the feasibility of achieving efficient model compression and privacy-preserving distributed learning for multimodal medical AI on edge devices.
[38] Tri-Bench: Stress-Testing VLM Reliability on Spatial Reasoning under Camera Tilt and Object Interference
Amit Bendkhale
🧩 TL;DR
本文提出了Tri-Bench基准测试,用于评估视觉语言模型在可验证几何推理任务上的性能,发现现有模型在相对几何推理、相机姿态变化和形状类别识别方面存在系统性缺陷,且无法有效利用提示中的参考框架提示。
📘 Detailed Summary
Motivation: 尽管视觉语言模型展现出强大能力,但在真实场景变化下的可验证几何推理仍存在不足,特别是模型在相对几何推理、相机姿态变化和场景上下文干扰下的性能表现尚未得到系统评估,这限制了可信可控的智能代理发展。
Method: 研究设计了Tri-Bench基准测试,包含平面三角形问题集,重点考察相对几何推理能力,同时评估两个关键部署因素:相机姿态(平面与倾斜)和场景上下文干扰(10种日常物体)。采用固定提示策略,在提示中明确描述包围正方形边框作为参考框架,通过单应性变换实现正确答案验证,并测试了四种近期视觉语言模型在六种简单任务上的表现。
Result: 实验结果显示,四种视觉语言模型相对于3D地面真值的平均准确率约为69%(最佳75%,最差64%),在图像平面2D投影上的对齐准确率约为72%。所有模型在识别少数形状类别(等边、等腰、直角三角形)时准确率降至约0%,相机倾斜导致整体准确率下降约4.1%,而物体干扰对模型准确率无显著影响,表明模型未能有效利用提示中的参考框架提示。
Conclusion: 该研究表明当前视觉语言模型在可验证几何推理方面存在系统性缺陷,模型过度依赖2D图像平面线索而忽略3D几何约束,无法有效利用明确的参考框架提示,这揭示了模型在几何理解和可控性方面的局限性,为未来可信几何推理系统的开发提供了重要基准和方向。
📄 Abstract
Verifiable geometric reasoning is a critical component for trustworthy and controllable agentic AI. Despite impressive capabilities, Vision-Language Models (VLMs) often fail under realistic scene changes. We present Tri-Bench, a compact benchmark of planar triangle problems that isolates relative geometric reasoning while stressing two deployment-critical factors: camera pose (planar vs. tilted) and scene context via object interference (10 everyday objects). To test verifiability and control, we evaluate four recent VLMs using a single, fixed prompt whose guardrail explicitly describes a surrounding square border, enabling correct answers via homography. We evaluate six simple tasks over binary and continuous targets, and observe that the overall accuracy with respect to 3D ground truth is modest, ~69% on average (best ~75%, worst ~64%). The same responses align even more closely with 2D projections in the image plane, where mean accuracy is ~72%. All four VLMs consistently fail, with accuracy falling to ~0%, on recognizing minority shape classes (equilateral, isosceles, right-angled triangles). Additionally, overall VLM accuracy degrades by ~4.1% under camera tilt. This demonstrates that models fail to correctly utilize the explicit frame-of-reference hint provided in the prompt and default to 2D image plane cues. Finally, we find that object interference has no significant effect on VLM accuracy.
[39] SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing
Aysim Toker, Andreea-Maria Oncescu, Roy Miles, Ismail Elezi, Jiankang Deng
🧩 TL;DR
该研究提出了一种新颖的结构化定位机制,通过微调预训练视觉语言模型并集成专用定位模块,显著提升了卫星图像中视觉定位的精度,在多个遥感基准测试中实现了最先进的性能。
📘 Detailed Summary
Motivation: 尽管视觉语言模型已成为遥感领域强大的通用工具,但在复杂卫星场景中实现精确的视觉定位仍面临挑战,现有方法在结合语言理解和空间推理方面存在不足,需要更有效的结构化定位机制来提升可靠性。
Method: 该方法通过微调预训练的视觉语言模型在多样化指令跟随任务上,同时通过专用控制令牌接口连接专门的定位模块,实现了语言信息和空间信息的联合推理,形成了结构化定位机制。
Result: 在多个遥感基准测试中,该方法持续改进了最先进水平,特别是在视觉定位任务上实现了相对于先前方法24.8%的相对提升,显著增强了模型在复杂卫星场景中精确定位物体的能力。
Conclusion: 该研究证明了将结构化空间推理集成到视觉语言模型中的显著优势,为更可靠的现实世界卫星数据分析铺平了道路,展示了联合语言-空间推理在遥感应用中的重要性。
📄 Abstract
Vision-language models (VLMs) are emerging as powerful generalist tools for remote sensing, capable of integrating information across diverse tasks and enabling flexible, instruction-based interactions via a chat interface. In this work, we enhance VLM-based visual grounding in satellite imagery by proposing a novel structured localization mechanism. Our approach involves finetuning a pretrained VLM on a diverse set of instruction-following tasks, while interfacing a dedicated grounding module through specialized control tokens for localization. This method facilitates joint reasoning over both language and spatial information, significantly enhancing the model's ability to precisely localize objects in complex satellite scenes. We evaluate our framework on several remote sensing benchmarks, consistently improving the state-of-the-art, including a 24.8% relative improvement over previous methods on visual grounding. Our results highlight the benefits of integrating structured spatial reasoning into VLMs, paving the way for more reliable real-world satellite data analysis.
[40] UniLayDiff: A Unified Diffusion Transformer for Content-Aware Layout Generation
Zeyang Liu, Le Wang, Sanping Zhou, Yuxuan Wu, Xiaolong Sun, Gang Hua, Haoxiang Li
🧩 TL;DR
本文提出了UniLayDiff:一种统一的扩散Transformer模型,首次通过单一端到端可训练模型解决了各种内容感知布局生成任务,将布局约束视为独立模态并采用多模态扩散Transformer框架。
📘 Detailed Summary
Motivation: 内容感知布局生成在图形设计自动化中至关重要,但现实应用的多样性使得开发单一模型统一处理元素类型、尺寸或关系等多样化输入约束的子任务极具挑战性。现有方法要么仅处理部分任务,要么需要为不同条件使用独立的模型参数,未能提供真正的统一解决方案。
Method: 该方法将布局约束视为独立模态,采用多模态扩散Transformer框架来捕捉背景图像、布局元素和多样化约束之间的复杂交互。此外,通过在其他任务上预训练模型后使用LoRA进行微调来整合关系约束,这种方案不仅实现了统一的条件生成,还提升了整体布局质量。
Result: 大量实验表明,UniLayDiff在从无条件到各种条件生成任务中均实现了最先进的性能表现。据作者所知,这是首个能够统一完整范围内容感知布局生成任务的模型,展示了其在不同约束条件下的卓越生成能力。
Conclusion: 该研究首次实现了内容感知布局生成任务的全面统一,通过创新的多模态扩散Transformer框架和LoRA微调策略,为图形设计自动化提供了灵活且高效的解决方案。这项工作为处理多样化约束的生成任务开辟了新途径,具有重要的实际应用价值和研究意义。
📄 Abstract
Content-aware layout generation is a critical task in graphic design automation, focused on creating visually appealing arrangements of elements that seamlessly blend with a given background image. The variety of real-world applications makes it highly challenging to develop a single model capable of unifying the diverse range of input-constrained generation sub-tasks, such as those conditioned by element types, sizes, or their relationships. Current methods either address only a subset of these tasks or necessitate separate model parameters for different conditions, failing to offer a truly unified solution. In this paper, we propose UniLayDiff: a Unified Diffusion Transformer, that for the first time, addresses various content-aware layout generation tasks with a single, end-to-end trainable model. Specifically, we treat layout constraints as a distinct modality and employ Multi-Modal Diffusion Transformer framework to capture the complex interplay between the background image, layout elements, and diverse constraints. Moreover, we integrate relation constraints through fine-tuning the model with LoRA after pretraining the model on other tasks. Such a schema not only achieves unified conditional generation but also enhances overall layout quality. Extensive experiments demonstrate that UniLayDiff achieves state-of-the-art performance across from unconditional to various conditional generation tasks and, to the best of our knowledge, is the first model to unify the full range of content-aware layout generation tasks.
[41] LiDAS: Lighting-driven Dynamic Active Sensing for Nighttime Perception
Simon de Moreau, Andrei Bursuc, Hafid El-Idrissi, Fabien Moutarde
🧩 TL;DR
本文提出了LiDAS(Lighting-driven Dynamic Active Sensing),一种闭环主动照明系统,通过动态预测最优照明场来最大化下游感知性能,实现了零样本夜间感知泛化,显著提升了夜间驾驶场景的感知能力。
📘 Detailed Summary
Motivation: 夜间环境对基于摄像头的感知系统构成重大挑战,现有方法被动依赖场景照明条件,无法主动优化感知性能。研究旨在解决夜间感知性能下降的问题,通过主动照明控制实现白天训练模型在夜间的零样本泛化。
Method: LiDAS结合现成的视觉感知模型与高分辨率头灯,形成一个闭环主动照明系统。该方法动态预测最优照明场,将光线从空区域重新分配到物体区域,以最大化下游感知性能。系统在合成数据上训练,并在真实世界闭环驾驶场景中零样本部署。
Result: 在相同功耗下,LiDAS相比标准近光灯实现了+18.7% mAP50和+5.0% mIoU的性能提升。系统在减少40%能耗的同时保持了感知性能。LiDAS与领域泛化方法互补,无需重新训练即可进一步增强鲁棒性。
Conclusion: 通过将现有头灯转变为主动视觉执行器,LiDAS为鲁棒夜间感知提供了经济有效的解决方案。该系统展示了主动照明控制在提升感知性能方面的潜力,为自动驾驶和机器人视觉系统在低光照条件下的可靠运行开辟了新途径。
📄 Abstract
Nighttime environments pose significant challenges for camera-based perception, as existing methods passively rely on the scene lighting. We introduce Lighting-driven Dynamic Active Sensing (LiDAS), a closed-loop active illumination system that combines off-the-shelf visual perception models with high-definition headlights. Rather than uniformly brightening the scene, LiDAS dynamically predicts an optimal illumination field that maximizes downstream perception performance, i.e., decreasing light on empty areas to reallocate it on object regions. LiDAS enables zero-shot nighttime generalization of daytime-trained models through adaptive illumination control. Trained on synthetic data and deployed zero-shot in real-world closed-loop driving scenarios, LiDAS enables +18.7% mAP50 and +5.0% mIoU over standard low-beam at equal power. It maintains performances while reducing energy use by 40%. LiDAS complements domain-generalization methods, further strengthening robustness without retraining. By turning readily available headlights into active vision actuators, LiDAS offers a cost-effective solution to robust nighttime perception.
[42] Unified Diffusion Transformer for High-fidelity Text-Aware Image Restoration
Jin Hyeon Kim, Paul Hyunbin Cho, Claire Kim, Jaewon Min, Jaeeun Lee, Jihye Park, Yeji Choi, Seungryong Kim
🧩 TL;DR
本文提出UniT,一个统一的文本感知图像修复框架,通过迭代集成扩散Transformer、视觉语言模型和文本定位模块,有效恢复退化文本内容并显著减少文本幻觉,在TAIR任务中实现了最先进的端到端F1分数性能。
📘 Detailed Summary
Motivation: 文本感知图像修复任务旨在从包含退化文本内容的低质量输入中恢复高质量图像,然而现有扩散模型虽然为通用图像修复提供了强大的生成先验,但在文本中心任务中由于缺乏显式语言知识经常产生文本幻觉问题,这限制了其在文本恢复任务中的实际应用效果。
Method: 本文提出UniT统一文本修复框架,采用迭代方式集成三个核心组件:扩散Transformer作为主干网络提供强大表示能力,视觉语言模型从退化图像中提取文本内容提供显式文本指导,文本定位模块在扩散特征上训练并在每个去噪步骤生成中间OCR预测,使VLM能够在去噪过程中迭代优化其指导信息,从而协同恢复细粒度文本内容并有效抑制幻觉。
Result: 在SA-Text和Real-Text基准测试上的实验表明,UniT能够忠实重建退化文本,显著减少文本幻觉现象,并在文本感知图像修复任务中实现了最先进的端到端F1分数性能,验证了该框架在文本恢复质量和准确性方面的优越性。
Conclusion: 该研究证明了将显式语言知识与扩散模型生成能力相结合的有效性,通过迭代式多模块协作机制解决了文本修复中的幻觉问题,为文本感知图像修复提供了新的技术路径,未来可进一步探索更高效的跨模态信息融合策略和更广泛的文本相关视觉任务应用。
📄 Abstract
Text-Aware Image Restoration (TAIR) aims to recover high- quality images from low-quality inputs containing degraded textual content. While diffusion models provide strong gen- erative priors for general image restoration, they often pro- duce text hallucinations in text-centric tasks due to the ab- sence of explicit linguistic knowledge. To address this, we propose UniT, a unified text restoration framework that in- tegrates a Diffusion Transformer (DiT), a Vision-Language Model (VLM), and a Text Spotting Module (TSM) in an it- erative fashion for high-fidelity text restoration. In UniT, the VLM extracts textual content from degraded images to provide explicit textual guidance. Simultaneously, the TSM, trained on diffusion features, generates intermedi- ate OCR predictions at each denoising step, enabling the VLM to iteratively refine its guidance during the denoising process. Finally, the DiT backbone, leveraging its strong representational power, exploit these cues to recover fine- grained textual content while effectively suppressing text hallucinations. Experiments on the SA-Text and Real-Text benchmarks demonstrate that UniT faithfully reconstructs degraded text, substantially reduces hallucinations, and achieves state-of-the-art end-to-end F1-score performance in TAIR task.
cs.CL [Back]
[43] Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing
Zifan Jiang, Youngjoon Jang, Liliane Momeni, Gül Varol, Sarah Ebling, Andrew Zisserman
🧩 TL;DR
本文提出了一种通用的字幕与手语视频对齐方法SEA,通过预训练模型将视频片段分割为单个手势并嵌入共享潜在空间,再通过轻量级动态规划实现高效对齐,该方法在多个数据集上实现了最先进的性能。
📘 Detailed Summary
Motivation: 现有字幕与连续手语视频对齐方法通常依赖于特定语言或数据集的端到端训练,这限制了其通用性,本研究旨在开发一种跨多种语言和领域的通用对齐框架,以克服现有方法的局限性。
Method: SEA方法采用Segment、Embed、Align三阶段框架:首先使用预训练模型将视频帧序列分割为单个手势,然后通过第二个预训练模型将每个手势的视频片段嵌入到与文本共享的潜在空间中,最后采用轻量级动态规划程序进行对齐,该程序在CPU上运行高效,即使处理长达一小时的视频也仅需一分钟。
Result: 在四个手语数据集上的实验表明,SEA方法实现了最先进的对齐性能,该方法具有灵活性,能够适应从小型词典到大型连续语料库的广泛场景,验证了其跨语言和跨领域的通用性。
Conclusion: SEA框架为手语处理领域提供了一种通用的高质量并行数据生成方法,其代码和模型已开源,有望推动手语处理研究的进展,展示了预训练模型与轻量级对齐算法结合的有效性。
📄 Abstract
The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA's code and models are openly available.
[44] Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation
Sampriti Soor, Suklav Ghosh, Arijit Sur
🧩 TL;DR
该研究提出了一种通用的对抗性后缀攻击方法,通过可微分的Gumbel-Softmax松弛学习短令牌序列,能够在多种任务和模型上广泛降低分类准确性,并展示出良好的跨模型迁移能力。
📘 Detailed Summary
Motivation: 语言模型作为零样本或少样本分类器时对对抗性提示仍然脆弱,先前工作通常优化任务或模型特定的触发器,导致结果难以比较且迁移性有限,因此需要研究能够跨任务和模型广泛降低准确性的通用对抗性后缀。
Method: 该方法学习通用的对抗性后缀(4-10个令牌),通过Gumbel-Softmax松弛在可微分形式下优化后缀,然后离散化用于推理,训练时最大化标签区域的校准交叉熵同时屏蔽黄金令牌以防止信息泄露,并采用熵正则化避免崩溃。
Result: 实验在情感分析、自然语言推理、释义检测、常识问答和物理推理等任务上,使用Qwen2-1.5B、Phi-1.5和TinyLlama-1.1B模型进行验证,单个后缀在多个模型上均能有效降低准确性和校准置信度,展现出良好的跨任务和跨模型迁移能力。
Conclusion: 该研究证明了通用对抗性后缀的存在性和有效性,揭示了语言模型在对抗性攻击下的系统性脆弱性,为评估模型鲁棒性提供了新方法,并强调了开发更强大防御机制的必要性。
📄 Abstract
Language models (LMs) are often used as zero-shot or few-shot classifiers by scoring label words, but they remain fragile to adversarial prompts. Prior work typically optimizes task- or model-specific triggers, making results difficult to compare and limiting transferability. We study universal adversarial suffixes: short token sequences (4-10 tokens) that, when appended to any input, broadly reduce accuracy across tasks and models. Our approach learns the suffix in a differentiable "soft" form using Gumbel-Softmax relaxation and then discretizes it for inference. Training maximizes calibrated cross-entropy on the label region while masking gold tokens to prevent trivial leakage, with entropy regularization to avoid collapse. A single suffix trained on one model transfers effectively to others, consistently lowering both accuracy and calibrated confidence. Experiments on sentiment analysis, natural language inference, paraphrase detection, commonsense QA, and physical reasoning with Qwen2-1.5B, Phi-1.5, and TinyLlama-1.1B demonstrate consistent attack effectiveness and transfer across tasks and model families.
cs.AI [Back]
[45] Beyond Traditional Diagnostics: Transforming Patient-Side Information into Predictive Insights with Knowledge Graphs and Prototypes
Yibowen Zhao, Yinan Zhang, Zhixiang Su, Lizhen Cui, Chunyan Miao
🧩 TL;DR
本文提出了KPI框架,通过整合医学知识图谱、构建疾病原型和利用对比学习,解决了基于患者信息的疾病预测中存在的类别不平衡和可解释性不足问题,并利用大语言模型生成患者特异性解释。
📘 Detailed Summary
Motivation: 基于患者人口统计学特征和自述症状的疾病预测面临两大关键挑战:疾病分布不平衡导致预测偏差,以及现有方法缺乏可解释性导致预测结果不可靠,这限制了其在患者中心医疗中的实际应用价值。
Method: KPI框架系统性地整合结构化可信医学知识构建统一疾病知识图谱,创建具有临床意义的疾病原型,并采用对比学习提升预测准确性,特别是针对长尾疾病;同时利用大语言模型生成患者特异性、医学相关的解释以增强可解释性。
Result: 在真实世界数据集上的广泛实验表明,KPI在预测准确性方面优于现有最先进方法,并能提供与患者叙述高度一致的临床有效解释,突显了其在患者中心医疗交付中的实际价值。
Conclusion: 该研究证明了整合结构化医学知识、疾病原型和对比学习能有效解决疾病预测中的不平衡问题,同时结合大语言模型的解释生成能力可显著提升预测的可靠性和临床实用性,为患者中心医疗系统提供了有价值的解决方案。
📄 Abstract
Predicting diseases solely from patient-side information, such as demographics and self-reported symptoms, has attracted significant research attention due to its potential to enhance patient awareness, facilitate early healthcare engagement, and improve healthcare system efficiency. However, existing approaches encounter critical challenges, including imbalanced disease distributions and a lack of interpretability, resulting in biased or unreliable predictions. To address these issues, we propose the Knowledge graph-enhanced, Prototype-aware, and Interpretable (KPI) framework. KPI systematically integrates structured and trusted medical knowledge into a unified disease knowledge graph, constructs clinically meaningful disease prototypes, and employs contrastive learning to enhance predictive accuracy, which is particularly important for long-tailed diseases. Additionally, KPI utilizes large language models (LLMs) to generate patient-specific, medically relevant explanations, thereby improving interpretability and reliability. Extensive experiments on real-world datasets demonstrate that KPI outperforms state-of-the-art methods in predictive accuracy and provides clinically valid explanations that closely align with patient narratives, highlighting its practical value for patient-centered healthcare delivery.
[46] See-Control: A Multimodal Agent Framework for Smartphone Interaction with a Robotic Arm
Haoyu Zhao, Weizhong Ding, Yuhao Yang, Zheng Tian, Linyi Yang, Kun Shao, Jun Wang
🧩 TL;DR
本文提出了See-Control框架,通过低自由度机械臂实现智能手机的物理操作,解决了现有方法依赖Android调试桥的局限性,为跨平台智能手机操作提供了新方案。
📘 Detailed Summary
Motivation: 现有基于多模态大语言模型的智能手机操作方法依赖Android调试桥进行数据传输和动作执行,这限制了其仅适用于Android设备,无法实现跨平台通用操作,因此需要一种不依赖特定系统后端的物理交互解决方案。
Method: 本文提出了See-Control框架,包含三个核心组件:包含155个任务的ESO基准测试及相应评估指标;基于MLLM的具身智能体,无需ADB或系统后端访问即可生成机器人控制命令;以及丰富标注的操作事件数据集,为未来研究提供资源。
Result: 研究构建了包含155个任务的ESO基准测试,开发了能够直接生成机器人控制命令的MLLM具身智能体,并创建了丰富标注的操作事件数据集,实现了通过低自由度机械臂进行跨平台智能手机物理操作的能力。
Conclusion: See-Control框架通过物理交互方式实现了跨平台智能手机操作,弥合了数字智能体与物理世界之间的鸿沟,为家庭机器人执行依赖智能手机的任务提供了具体实现路径,推动了具身智能在真实环境中的应用。
📄 Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have enabled their use as intelligent agents for smartphone operation. However, existing methods depend on the Android Debug Bridge (ADB) for data transmission and action execution, limiting their applicability to Android devices. In this work, we introduce the novel Embodied Smartphone Operation (ESO) task and present See-Control, a framework that enables smartphone operation via direct physical interaction with a low-DoF robotic arm, offering a platform-agnostic solution. See-Control comprises three key components: (1) an ESO benchmark with 155 tasks and corresponding evaluation metrics; (2) an MLLM-based embodied agent that generates robotic control commands without requiring ADB or system back-end access; and (3) a richly annotated dataset of operation episodes, offering valuable resources for future research. By bridging the gap between digital agents and the physical world, See-Control provides a concrete step toward enabling home robots to perform smartphone-dependent tasks in realistic environments.
[47] Multi-Agent Intelligence for Multidisciplinary Decision-Making in Gastrointestinal Oncology
Rongzhao Zhang, Junqiao Wang, Shuyun Yang, Mouxiao Bian, Chao Ding, Yuwei Bai, Chihao Zhang, Yuguang Shen, Lei Wang, Lei Zheng, Qiujuan Yan, Yun Zhong, Meiling Liu, Jiwei Yu, Zheng Wang, Jie Xu, Meng Luo
🧩 TL;DR
本研究提出了一种分层多智能体框架,模拟人类多学科团队协作工作流程,以解决多模态临床推理中的上下文稀释和幻觉问题,在胃肠道肿瘤学中实现了显著性能提升。
📘 Detailed Summary
Motivation: 多模态临床推理在胃肠道肿瘤学领域需要整合内窥镜图像、放射学数据和生化标志物的解释,尽管多模态大语言模型展现出潜力,但在处理复杂异构医疗病史时经常面临上下文稀释和幻觉的挑战。
Method: 本研究提出了一个分层多智能体框架,模拟人类多学科团队的协作工作流程,该框架通过智能体间的协同合作来处理多模态医疗数据,旨在解决传统单一模型在处理复杂医疗推理时的局限性。
Result: 该系统获得了4.60/5.00的复合专家评估分数,相比单一基线模型有显著改进,其中基于智能体的架构在推理逻辑和医学准确性方面实现了最显著的提升,证明了其在临床决策支持中的有效性。
Conclusion: 研究结果表明,模拟人类协作的多智能体架构为肿瘤学中的自动化决策支持提供了可扩展、可解释且临床稳健的范式,这种基于智能体的协作方法能够有效处理复杂的多模态医疗推理任务。
📄 Abstract
Multimodal clinical reasoning in the field of gastrointestinal (GI) oncology necessitates the integrated interpretation of endoscopic imagery, radiological data, and biochemical markers. Despite the evident potential exhibited by Multimodal Large Language Models (MLLMs), they frequently encounter challenges such as context dilution and hallucination when confronted with intricate, heterogeneous medical histories. In order to address these limitations, a hierarchical Multi-Agent Framework is proposed, which emulates the collaborative workflow of a human Multidisciplinary Team (MDT). The system attained a composite expert evaluation score of 4.60/5.00, thereby demonstrating a substantial improvement over the monolithic baseline. It is noteworthy that the agent-based architecture yielded the most substantial enhancements in reasoning logic and medical accuracy. The findings indicate that mimetic, agent-based collaboration provides a scalable, interpretable, and clinically robust paradigm for automated decision support in oncology.
[48] A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows
Eranga Bandara, Ross Gore, Peter Foytik, Sachin Shetty, Ravi Mukkamala, Abdul Rahman, Xueping Liang, Safdar H. Bouk, Amin Hass, Sachini Rajapakse, Ng Wee Keong, Kasun De Zoysa, Aruna Withanage, Nilaan Loganathan
🧩 TL;DR
本文提供了一个端到端的实用指南,用于设计、开发和部署生产级智能体AI系统,通过引入结构化工程生命周期和九大核心最佳实践,为构建可靠、可观察、可维护且符合安全治理要求的智能体AI工作流提供基础参考。
📘 Detailed Summary
Motivation: 随着智能体AI在工业和研究领域的加速采用,组织面临一个核心挑战:如何设计、工程化和运营生产级的智能体AI工作流,使其具备可靠性、可观察性、可维护性,并符合安全和治理要求,当前缺乏系统化的工程实践指南来解决这一问题。
Method: 本文提出结构化工程生命周期,涵盖工作流分解、多智能体设计模式、模型上下文协议(MCP)和工具集成、确定性编排、负责任AI考虑因素以及环境感知部署策略,并介绍了九大核心最佳实践,包括基于MCP的工具优先设计、纯函数调用、单一工具和单一职责智能体、外部化提示管理、负责任AI对齐的模型联盟设计、工作流逻辑与MCP服务器的清晰分离、容器化部署以及保持简单原则。
Result: 通过一个多模态新闻分析和媒体生成工作流的全面案例研究,展示了所提出原则的实际应用效果,验证了结构化工程生命周期和最佳实践在构建生产级智能体AI系统中的可行性和有效性,为实际部署提供了具体实施参考。
Conclusion: 本文提供了构建稳健、可扩展且生产就绪的智能体AI工作流的基础参考框架,通过结合架构指导、操作模式和实践实施洞察,为工程团队提供了系统化的方法论,有助于加速智能体AI在生产环境中的可靠部署和规模化运营。
📄 Abstract
Agentic AI marks a major shift in how autonomous systems reason, plan, and execute multi-step tasks. Unlike traditional single model prompting, agentic workflows integrate multiple specialized agents with different Large Language Models(LLMs), tool-augmented capabilities, orchestration logic, and external system interactions to form dynamic pipelines capable of autonomous decision-making and action. As adoption accelerates across industry and research, organizations face a central challenge: how to design, engineer, and operate production-grade agentic AI workflows that are reliable, observable, maintainable, and aligned with safety and governance requirements. This paper provides a practical, end-to-end guide for designing, developing, and deploying production-quality agentic AI systems. We introduce a structured engineering lifecycle encompassing workflow decomposition, multi-agent design patterns, Model Context Protocol(MCP), and tool integration, deterministic orchestration, Responsible-AI considerations, and environment-aware deployment strategies. We then present nine core best practices for engineering production-grade agentic AI workflows, including tool-first design over MCP, pure-function invocation, single-tool and single-responsibility agents, externalized prompt management, Responsible-AI-aligned model-consortium design, clean separation between workflow logic and MCP servers, containerized deployment for scalable operations, and adherence to the Keep it Simple, Stupid (KISS) principle to maintain simplicity and robustness. To demonstrate these principles in practice, we present a comprehensive case study: a multimodal news-analysis and media-generation workflow. By combining architectural guidance, operational patterns, and practical implementation insights, this paper offers a foundational reference to build robust, extensible, and production-ready agentic AI workflows.
[49] CARLoS: Retrieval via Concise Assessment Representation of LoRAs at Scale
Shahar Sarfaty, Adi Haviv, Uri Hacohen, Niva Elkin-Koren, Roi Livni, Amit H. Bermano
🧩 TL;DR
本文提出了CARLoS框架,用于在没有额外元数据的情况下大规模表征LoRA组件,通过分析650多个LoRA在图像生成中的行为,构建了包含方向、强度和一致性的三部分表示,并开发了高效的检索系统,在自动和人工评估中优于文本基线方法。
📘 Detailed Summary
Motivation: 生成式组件(如LoRA)的快速扩散创建了一个庞大但非结构化的生态系统,现有发现方法依赖于不可靠的用户描述或有偏见的流行度指标,这阻碍了可用性,因此需要一种无需额外元数据就能表征LoRA的方法。
Method: 该方法分析了超过650个LoRA,通过在多种提示和种子下进行图像生成来评估其行为,使用CLIP嵌入及其与基础模型生成的差异,定义了包含方向(定义语义偏移)、强度(量化效果显著性)和一致性(量化效果稳定性)的三部分表示,并基于此开发了高效的检索框架。
Result: CARLoS框架在语义匹配文本查询到相关LoRA的同时能够过滤过强或不稳定的组件,在自动和人工评估中均优于文本基线方法,该表示还支持将强度和一致性与版权法中实质性和意志性等法律概念联系起来进行分析。
Conclusion: CARLoS不仅提供了一个实用的LoRA检索系统,其表示框架还具有更广泛的LoRA分析相关性,能够将技术特征与法律考量联系起来,为生成式组件的系统化管理和分析提供了有效工具。
📄 Abstract
The rapid proliferation of generative components, such as LoRAs, has created a vast but unstructured ecosystem. Existing discovery methods depend on unreliable user descriptions or biased popularity metrics, hindering usability. We present CARLoS, a large-scale framework for characterizing LoRAs without requiring additional metadata. Analyzing over 650 LoRAs, we employ them in image generation over a variety of prompts and seeds, as a credible way to assess their behavior. Using CLIP embeddings and their difference to a base-model generation, we concisely define a three-part representation: Directions, defining semantic shift; Strength, quantifying the significance of the effect; and Consistency, quantifying how stable the effect is. Using these representations, we develop an efficient retrieval framework that semantically matches textual queries to relevant LoRAs while filtering overly strong or unstable ones, outperforming textual baselines in automated and human evaluations. While retrieval is our primary focus, the same representation also supports analyses linking Strength and Consistency to legal notions of substantiality and volition, key considerations in copyright, positioning CARLoS as a practical system with broader relevance for LoRA analysis.
[50] Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs
Angela van Sprang, Laurens Samson, Ana Lucic, Erman Acar, Sennay Ghebreab, Yuki M. Asano
🧩 TL;DR
本文提出了REST和REST+两个新基准,用于系统评估多模态大语言模型中的跨模态不一致性问题,发现现有模型在不同模态间无法保持一致的推理能力。
📘 Detailed Summary
Motivation: 多模态大语言模型被训练在同一个嵌入空间中表示视觉和语言信息,但它们无法在不同模态间执行相同的任务,这种跨模态不一致性问题缺乏系统性的评估基准和方法。
Method: 研究引入了REST和REST+两个基准测试,包含图像、文本和混合三种模态下相同语义信息的样本,用于评估模型在不同模态间的一致性推理能力,并对15个最先进的多模态大语言模型进行了系统性评估。
Result: 评估发现不同模型间的模态不一致程度差异显著,即使考虑文本识别问题后依然存在;视觉特征如文本颜色和分辨率会影响模型性能,而字体则无影响;视觉令牌数量对性能有影响;一致性得分与文本和图像间的模态差距相关。
Conclusion: 研究揭示了多模态大语言模型存在系统性跨模态不一致问题,既不能通过将文本渲染为图像也不能通过将图像渲染为文本来解决,一致性得分与模态差距的相关性为理解模型机制提供了新的视角。
📄 Abstract
We introduce two new benchmarks REST and REST+(Render-Equivalence Stress Tests) to enable systematic evaluation of cross-modal inconsistency in multimodal large language models (MLLMs). MLLMs are trained to represent vision and language in the same embedding space, yet they cannot perform the same tasks in both modalities. Our benchmarks contain samples with the same semantic information in three modalities (image, text, mixed) and we show that state-of-the-art MLLMs cannot consistently reason over these different modalities. We evaluate 15 MLLMs and find that the degree of modality inconsistency varies substantially, even when accounting for problems with text recognition (OCR). Neither rendering text as image nor rendering an image as text solves the inconsistency. Even if OCR is correct, we find that visual characteristics (text colour and resolution, but not font) and the number of vision tokens have an impact on model performance. Finally, we find that our consistency score correlates with the modality gap between text and images, highlighting a mechanistic interpretation of cross-modal inconsistent MLLMs.