cs.CV [Total: 23]
cs.CL [Total: 7]
cs.AI [Total: 9]

cs.CV [Back]

[1] Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs

Houston H. Zhang, Tao Zhang, Baoze Lin, Yuanqi Xue, Yincheng Zhu, Huan Liu, Li Gu, Linfeng Ye, Ziqiang Wang, Xinxin Zuo, Yang Wang, Yuanhao Yu, Zhixiang Chi

🧩 TL;DR

该研究提出了Widget-to-Code（Widget2Code）任务，针对紧凑、无上下文的小部件界面生成可执行代码，并开发了WidgetFactory基础设施，包含领域特定语言和编译器，显著提升了视觉保真度。

📘 Detailed Summary

Motivation: 现有UI2Code研究主要关注网页和移动界面，而小部件作为紧凑、无上下文的微界面，具有密集布局和图标化特征，且缺乏可访问的标记数据，这一领域尚未得到充分探索。

Method: 研究提出了联合推进感知理解和结构化代码生成的基线方法，在感知层面遵循小部件设计原则将原子组件组装为完整布局，配备图标检索和可重用可视化模块；在系统层面设计了端到端基础设施WidgetFactory，包含框架无关的小部件领域特定语言WidgetDSL和可编译为多种前端实现的编译器，自适应渲染模块进一步优化空间维度以满足紧凑性约束。

Result: 基准测试表明，尽管通用多模态大语言模型优于专门的UI2Code方法，但仍产生不可靠且视觉不一致的代码；提出的基线方法通过WidgetFactory基础设施显著提升了视觉保真度，为Widget2Code研究建立了统一的评估框架和基础设施。

Conclusion: 该研究形式化了Widget2Code任务并建立了首个仅图像的基准数据集，提出的方法通过联合感知理解和结构化代码生成解决了小部件特有的挑战，为未来研究提供了强大的基线方法和统一的基础设施，推动了紧凑界面代码生成领域的发展。

📄 Abstract

User interface to code (UI2Code) aims to generate executable code that can faithfully reconstruct a given input UI. Prior work focuses largely on web pages and mobile screens, leaving app widgets underexplored. Unlike web or mobile UIs with rich hierarchical context, widgets are compact, context-free micro-interfaces that summarize key information through dense layouts and iconography under strict spatial constraints. Moreover, while (image, code) pairs are widely available for web or mobile UIs, widget designs are proprietary and lack accessible markup. We formalize this setting as the Widget-to-Code (Widget2Code) and introduce an image-only widget benchmark with fine-grained, multi-dimensional evaluation metrics. Benchmarking shows that although generalized multimodal large language models (MLLMs) outperform specialized UI2Code methods, they still produce unreliable and visually inconsistent code. To address these limitations, we develop a baseline that jointly advances perceptual understanding and structured code generation. At the perceptual level, we follow widget design principles to assemble atomic components into complete layouts, equipped with icon retrieval and reusable visualization modules. At the system level, we design an end-to-end infrastructure, WidgetFactory, which includes a framework-agnostic widget-tailored domain-specific language (WidgetDSL) and a compiler that translates it into multiple front-end implementations (e.g., React, HTML/CSS). An adaptive rendering module further refines spatial dimensions to satisfy compactness constraints. Together, these contributions substantially enhance visual fidelity, establishing a strong baseline and unified infrastructure for future Widget2Code research.

[2] Vehicle-centric Perception via Multimodal Structured Pre-training

Wentao Wu, Xiao Wang, Chenglong Li, Jin Tang, Bin Luo

🧩 TL;DR

本文提出VehicleMAE-V2，一种面向车辆感知的预训练大模型，通过引入车辆相关的结构化先验知识来指导掩码令牌重建过程，显著提升了车辆中心感知任务的泛化表示学习能力。

📘 Detailed Summary

Motivation: 现有方法在预训练阶段缺乏对车辆相关知识的有效学习，导致在建模通用车辆感知表示时能力不足，无法充分捕捉车辆特有的结构特征和语义信息。

Method: 该方法设计了三个关键模块：对称引导掩码模块利用车辆对称性约束选择高质量掩码图像块并减少信息冗余；轮廓引导表示模块通过最小化轮廓特征与重建特征的概率分布差异来保留整体车辆结构信息；语义引导表示模块通过对比学习和跨模态蒸馏对齐图像-文本特征以解决掩码重建中的语义混淆问题。

Result: 实验构建了包含约400万车辆图像和12,693条文本描述的大规模数据集Autobot4M，并在五个下游任务上进行了广泛测试，证明了VehicleMAE-V2的优越性能。

Conclusion: 该研究表明利用车辆特有的结构化先验知识（对称性、轮廓、语义）可以有效指导掩码自编码器的预训练过程，为车辆中心感知任务学习更具泛化能力的表示，为智能交通、自动驾驶等领域的视觉感知系统提供了新的预训练范式。

📄 Abstract

Vehicle-centric perception plays a crucial role in many intelligent systems, including large-scale surveillance systems, intelligent transportation, and autonomous driving. Existing approaches lack effective learning of vehicle-related knowledge during pre-training, resulting in poor capability for modeling general vehicle perception representations. To handle this problem, we propose VehicleMAE-V2, a novel vehicle-centric pre-trained large model. By exploring and exploiting vehicle-related multimodal structured priors to guide the masked token reconstruction process, our approach can significantly enhance the model's capability to learn generalizable representations for vehicle-centric perception. Specifically, we design the Symmetry-guided Mask Module (SMM), Contour-guided Representation Module (CRM) and Semantics-guided Representation Module (SRM) to incorporate three kinds of structured priors into token reconstruction including symmetry, contour and semantics of vehicles respectively. SMM utilizes the vehicle symmetry constraints to avoid retaining symmetric patches and can thus select high-quality masked image patches and reduce information redundancy. CRM minimizes the probability distribution divergence between contour features and reconstructed features and can thus preserve holistic vehicle structure information during pixel-level reconstruction. SRM aligns image-text features through contrastive learning and cross-modal distillation to address the feature confusion caused by insufficient semantic understanding during masked reconstruction. To support the pre-training of VehicleMAE-V2, we construct Autobot4M, a large-scale dataset comprising approximately 4 million vehicle images and 12,693 text descriptions. Extensive experiments on five downstream tasks demonstrate the superior performance of VehicleMAE-V2.

[3] SE360: Semantic Edit in 360$^\circ$ Panoramas via Hierarchical Data Construction

Haoyi Zhong, Fang-Lue Zhang, Andrew Chalmers, Taehyun Rhee

🧩 TL;DR

本文提出了SE360，一个用于360°全景图像多条件引导对象编辑的新框架，通过自主数据生成管道和两阶段数据精炼策略，实现了在文本、掩码或参考图像引导下的灵活对象编辑。

📘 Detailed Summary

Motivation: 现有基于指令的图像编辑方法扩展到360°全景图像时面临额外挑战，在等距柱状投影和透视视图中常产生不合理结果，需要解决全景图像编辑中的语义一致性和几何一致性问题。

Method: 方法核心包括新颖的从粗到细的自主数据生成管道，利用视觉语言模型和自适应投影调整进行分层分析，确保对象及其物理环境的整体分割；同时引入经济高效的两阶段数据精炼策略，提升数据真实感并减轻模型对擦除伪影的过拟合；基于构建的数据集训练基于Transformer的扩散模型。

Result: 实验表明该方法在视觉质量和语义准确性方面均优于现有方法，生成的数据对既具有语义意义又保持几何一致性，即使来自未标记的全景图像也能保证编辑结果的合理性。

Conclusion: 该研究展示了自主数据生成管道在解决全景图像编辑挑战中的有效性，提出的多条件引导编辑框架为360°全景图像的灵活编辑提供了新途径，同时数据精炼策略为减少模型过拟合提供了实用解决方案。

📄 Abstract

While instruction-based image editing is emerging, extending it to 360$^\circ$ panoramas introduces additional challenges. Existing methods often produce implausible results in both equirectangular projections (ERP) and perspective views. To address these limitations, we propose SE360, a novel framework for multi-condition guided object editing in 360$^\circ$ panoramas. At its core is a novel coarse-to-fine autonomous data generation pipeline without manual intervention. This pipeline leverages a Vision-Language Model (VLM) and adaptive projection adjustment for hierarchical analysis, ensuring the holistic segmentation of objects and their physical context. The resulting data pairs are both semantically meaningful and geometrically consistent, even when sourced from unlabeled panoramas. Furthermore, we introduce a cost-effective, two-stage data refinement strategy to improve data realism and mitigate model overfitting to erase artifacts. Based on the constructed dataset, we train a Transformer-based diffusion model to allow flexible object editing guided by text, mask, or reference image in 360$^\circ$ panoramas. Our experiments demonstrate that our method outperforms existing methods in both visual quality and semantic accuracy.

[4] PaveSync: A Unified and Comprehensive Dataset for Pavement Distress Analysis and Classification

Blessing Agyei Kyem, Joshua Kofi Asamoah, Anthony Dontoh, Andrews Danyo, Eugene Denteh, Armstrong Aboah

🧩 TL;DR

该研究提出了首个全球代表性的路面缺陷检测基准数据集，通过整合多个公开数据源并标准化标注格式，解决了现有数据集在标注风格、缺陷定义和格式上的不一致问题，为模型训练和公平比较提供了统一资源。

📘 Detailed Summary

Motivation: 自动化路面缺陷检测在多样化真实场景中泛化能力有限，主要原因是缺乏标准化数据集。现有数据集在标注风格、缺陷类型定义和格式上存在差异，限制了它们整合用于统一训练，阻碍了模型的公平比较和性能评估。

Method: 研究构建了一个综合性基准数据集，整合了多个公开数据源，包含来自七个国家的52747张图像和135277个边界框标注，覆盖13种不同的缺陷类型。数据集标准化了类别定义和标注格式，并采用最先进的目标检测模型（包括YOLOv8-YOLOv12、Faster R-CNN和DETR）进行基准测试，评估其在多样化场景下的性能表现。

Result: 基准测试结果表明，所提出的数据集支持多种先进目标检测模型取得竞争性性能。数据集捕获了图像质量、分辨率、拍摄角度和天气条件等方面的广泛真实世界变化，为零样本迁移到新环境提供了有效评估平台，展示了模型在多样化场景下的鲁棒性和泛化能力。

Conclusion: 该研究提供了首个全球代表性的路面缺陷检测基准数据集，通过标准化标注解决了数据集不一致问题，实现了模型的公平比较。数据集支持零样本迁移评估，为未来研究提供了统一的训练和测试平台，推动了自动化路面检测技术的发展和应用。

📄 Abstract

Automated pavement defect detection often struggles to generalize across diverse real-world conditions due to the lack of standardized datasets. Existing datasets differ in annotation styles, distress type definitions, and formats, limiting their integration for unified training. To address this gap, we introduce a comprehensive benchmark dataset that consolidates multiple publicly available sources into a standardized collection of 52747 images from seven countries, with 135277 bounding box annotations covering 13 distinct distress types. The dataset captures broad real-world variation in image quality, resolution, viewing angles, and weather conditions, offering a unique resource for consistent training and evaluation. Its effectiveness was demonstrated through benchmarking with state-of-the-art object detection models including YOLOv8-YOLOv12, Faster R-CNN, and DETR, which achieved competitive performance across diverse scenarios. By standardizing class definitions and annotation formats, this dataset provides the first globally representative benchmark for pavement defect detection and enables fair comparison of models, including zero-shot transfer to new environments.

[5] A Contextual Analysis of Driver-Facing and Dual-View Video Inputs for Distraction Detection in Naturalistic Driving Environments

Anthony Dontoh, Stephanie Ivey, Armstrong Aboah

🧩 TL;DR

本研究探讨了在分心驾驶检测中结合驾驶员视角和道路视角的双视图方法，发现性能提升高度依赖于模型架构设计，其中单路径SlowOnly模型获得9.8%的准确率提升，而双路径SlowFast模型因表征冲突导致7.2%的性能下降。

📘 Detailed Summary

Motivation: 现有基于计算机视觉的分心驾驶检测模型主要依赖驾驶员视角，忽略了影响驾驶行为的关键环境上下文信息，本研究旨在探索结合道路视角与驾驶员视角的双视图输入是否能提升自然驾驶条件下的分心检测准确性。

Method: 研究使用真实驾驶环境中同步的双摄像头记录数据，对三种领先的时空动作识别架构进行基准测试：SlowFast-R50、X3D-M和SlowOnly-R50，每种模型在两种输入配置下进行评估：仅驾驶员视角和堆叠的双视图输入。

Result: 实验结果显示，上下文输入对性能提升的影响高度依赖于底层架构，单路径SlowOnly模型在双视图输入下获得9.8%的准确率提升，而双路径SlowFast模型因表征冲突导致7.2%的准确率下降，X3D-M模型的表现则介于两者之间。

Conclusion: 研究结果表明，简单地添加视觉上下文并不足以提升分心驾驶检测性能，反而可能导致干扰，除非架构专门设计用于支持多视图集成，这强调了融合感知设计对未来多模态驾驶员监控系统的重要性，并为单视图与双视图检测模型的系统比较提供了首批实证证据。

📄 Abstract

Despite increasing interest in computer vision-based distracted driving detection, most existing models rely exclusively on driver-facing views and overlook crucial environmental context that influences driving behavior. This study investigates whether incorporating road-facing views alongside driver-facing footage improves distraction detection accuracy in naturalistic driving conditions. Using synchronized dual-camera recordings from real-world driving, we benchmark three leading spatiotemporal action recognition architectures: SlowFast-R50, X3D-M, and SlowOnly-R50. Each model is evaluated under two input configurations: driver-only and stacked dual-view. Results show that while contextual inputs can improve detection in certain models, performance gains depend strongly on the underlying architecture. The single-pathway SlowOnly model achieved a 9.8 percent improvement with dual-view inputs, while the dual-pathway SlowFast model experienced a 7.2 percent drop in accuracy due to representational conflicts. These findings suggest that simply adding visual context is not sufficient and may lead to interference unless the architecture is specifically designed to support multi-view integration. This study presents one of the first systematic comparisons of single- and dual-view distraction detection models using naturalistic driving data and underscores the importance of fusion-aware design for future multimodal driver monitoring systems.

[6] Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark

Hao Guo, Xugong Qin, Jun Jie Ou Yang, Peng Zhang, Gangyan Zeng, Yubo Li, Hailun Lin

🧩 TL;DR

本文提出了一个基于自然语言的文档图像检索（NL-DIR）基准测试，通过引入语义丰富的自然语言查询来解决现有文档图像检索方法在细粒度语义检索方面的局限性，并提供了包含41K真实文档图像和高质量查询的数据集。

📘 Detailed Summary

Motivation: 现有文档图像检索（DIR）方法主要基于图像查询，只能检索相同粗粒度语义类别（如报纸或收据）的文档，但在现实场景中通常提供具有细粒度语义的文本查询，这些方法难以有效检索。为弥补这一差距，需要建立能够处理自然语言描述的细粒度语义查询的文档图像检索基准。

Method: 研究引入了自然语言文档图像检索（NL-DIR）基准测试，包含41K真实文档图像，每个图像配有五个通过大型语言模型生成并经人工验证的高质量细粒度语义查询。评估了现有主流对比视觉语言模型和无需OCR的视觉文档理解模型的零样本和微调性能，并进一步研究了两阶段检索方法以提高性能同时保持时间和空间效率。

Result: NL-DIR数据集包含41K真实文档图像，每个图像配有五个高质量细粒度语义查询。研究对现有主流模型进行了零样本和微调评估，并展示了两阶段检索方法在性能提升方面的有效性，同时实现了时间和空间效率的平衡。

Conclusion: 提出的NL-DIR基准测试为视觉文档理解社区带来了新的研究机会，通过引入语义丰富的自然语言查询解决了现有文档图像检索在细粒度语义匹配方面的局限性。该基准将促进文档图像检索领域的发展，特别是在现实场景中处理复杂语义查询的能力。

📄 Abstract

Document image retrieval (DIR) aims to retrieve document images from a gallery according to a given query. Existing DIR methods are primarily based on image queries that retrieve documents within the same coarse semantic category, e.g., newspapers or receipts. However, these methods struggle to effectively retrieve document images in real-world scenarios where textual queries with fine-grained semantics are usually provided. To bridge this gap, we introduce a new Natural Language-based Document Image Retrieval (NL-DIR) benchmark with corresponding evaluation metrics. In this work, natural language descriptions serve as semantically rich queries for the DIR task. The NL-DIR dataset contains 41K authentic document images, each paired with five high-quality, fine-grained semantic queries generated and evaluated through large language models in conjunction with manual verification. We perform zero-shot and fine-tuning evaluations of existing mainstream contrastive vision-language models and OCR-free visual document understanding (VDU) models. A two-stage retrieval method is further investigated for performance improvement while achieving both time and space efficiency. We hope the proposed NL-DIR benchmark can bring new opportunities and facilitate research for the VDU community. Datasets and codes will be publicly available at huggingface.co/datasets/nianbing/NL-DIR.

[7] MAPI-GNN: Multi-Activation Plane Interaction Graph Neural Network for Multimodal Medical Diagnosis

Ziwei Qin, Xuhui Song, Deqing Huang, Na Qin, Jun Li

🧩 TL;DR

该研究提出了多激活平面交互图神经网络（MAPI-GNN），通过从语义解缠的特征子空间学习多面图配置文件，克服了传统单静态图在医学诊断中的局限性，显著提升了多模态医疗诊断的性能。

📘 Detailed Summary

Motivation: 当前基于图神经网络的多模态医疗诊断方法主要依赖于从非区分性特征构建的单一静态图，这种范式限制了模型对患者特异性病理关系的建模能力，导致诊断效果受限。

Method: MAPI-GNN框架首先通过多维判别器揭示潜在的图感知模式，这些模式指导动态构建一系列激活图，最终通过关系融合引擎聚合和情境化这些多面图配置文件以实现稳健诊断。

Result: 在两个多样化任务上进行的广泛实验，涵盖超过1300个患者样本，表明MAPI-GNN在性能上显著优于现有最先进方法，验证了其多面图建模方法的有效性。

Conclusion: 该研究表明通过语义解缠特征子空间构建多面图配置文件能够更有效地捕捉患者特异性病理关系，为图神经网络在医疗诊断中的应用提供了新的动态图构建范式，具有重要的临床实践意义。

📄 Abstract

Graph neural networks are increasingly applied to multimodal medical diagnosis for their inherent relational modeling capabilities. However, their efficacy is often compromised by the prevailing reliance on a single, static graph built from indiscriminate features, hindering the ability to model patient-specific pathological relationships. To this end, the proposed Multi-Activation Plane Interaction Graph Neural Network (MAPI-GNN) reconstructs this single-graph paradigm by learning a multifaceted graph profile from semantically disentangled feature subspaces. The framework first uncovers latent graph-aware patterns via a multi-dimensional discriminator; these patterns then guide the dynamic construction of a stack of activation graphs; and this multifaceted profile is finally aggregated and contextualized by a relational fusion engine for a robust diagnosis. Extensive experiments on two diverse tasks, comprising over 1300 patient samples, demonstrate that MAPI-GNN significantly outperforms state-of-the-art methods.

[8] $\text{H}^2$em: Learning Hierarchical Hyperbolic Embeddings for Compositional Zero-Shot Learning

Lin Li, Jiahui Li, Jiaming Lei, Jun Xiao, Feifei Shao, Long Chen

🧩 TL;DR

本文提出H2em框架，通过双曲几何嵌入解决组合零样本学习中的层次结构建模问题，利用双曲空间的指数体积增长特性匹配组合语义的指数结构，在封闭世界和开放世界场景中均达到最先进性能。

📘 Detailed Summary

Motivation: 当前组合零样本学习方法通常忽略丰富的层次结构，如基元的语义层次和基元与组合之间的概念层次。现有方法在欧几里得空间中通过损失正则化建模这些层次，但无法扩展到现实世界CZSL所需的大规模分类体系，因为欧几里得空间的多项式体积增长无法匹配组合语义的指数结构，从而损害泛化能力。

Method: 提出H2em框架，利用双曲几何的自然特性嵌入树状结构。设计双重层次蕴含损失，使用双曲蕴含锥强制执行预定义的层次结构；设计判别对齐损失与困难负样本挖掘，在语义相似的组合之间建立较大的测地距离；开发双曲跨模态注意力机制，在双曲几何内实现实例感知的跨模态融合。

Result: 在三个基准测试上的广泛消融实验表明，H2em在封闭世界和开放世界场景中均建立了新的最先进性能。该方法有效解决了层次崩溃和细粒度区分不足的问题，显著提升了组合零样本学习的泛化能力。

Conclusion: 研究表明双曲几何为组合零样本学习的层次结构建模提供了自然且有效的数学框架，其指数体积增长特性能够匹配组合语义的指数结构。该方法为大规模分类体系下的组合学习开辟了新方向，证明了双曲空间在复杂语义关系建模中的优势。

📄 Abstract

Compositional zero-shot learning (CZSL) aims to recognize unseen state-object compositions by generalizing from a training set of their primitives (state and object). Current methods often overlook the rich hierarchical structures, such as the semantic hierarchy of primitives (e.g., apple fruit) and the conceptual hierarchy between primitives and compositions (e.g, sliced apple apple). A few recent efforts have shown effectiveness in modeling these hierarchies through loss regularization within Euclidean space. In this paper, we argue that they fail to scale to the large-scale taxonomies required for real-world CZSL: the space's polynomial volume growth in flat geometry cannot match the exponential structure, impairing generalization capacity. To this end, we propose H2em, a new framework that learns Hierarchical Hyperbolic EMbeddings for CZSL. H2em leverages the unique properties of hyperbolic geometry, a space naturally suited for embedding tree-like structures with low distortion. However, a naive hyperbolic mapping may suffer from hierarchical collapse and poor fine-grained discrimination. We further design two learning objectives to structure this space: a Dual-Hierarchical Entailment Loss that uses hyperbolic entailment cones to enforce the predefined hierarchies, and a Discriminative Alignment Loss with hard negative mining to establish a large geodesic distance between semantically similar compositions. Furthermore, we devise Hyperbolic Cross-Modal Attention to realize instance-aware cross-modal infusion within hyperbolic geometry. Extensive ablations on three benchmarks demonstrate that H2em establishes a new state-of-the-art in both closed-world and open-world scenarios. Our codes will be released.

Nguyen Lam Phu Quy, Pham Phu Hoa, Tran Chi Nguyen, Dao Sy Duy Minh, Nguyen Hoang Minh Ngoc, Huynh Trung Kiet

🧩 TL;DR

本文提出了一种多模态管道，通过整合外部文本知识来增强图像描述，生成包含事件背景、时间线索和命名实体等丰富上下文信息的描述，显著提升了传统图像描述方法的深度和信息量。

📘 Detailed Summary

Motivation: 现实世界中的图像描述通常缺乏上下文深度，忽略了事件背景、时间线索、结果和不可视的命名实体等关键细节，这限制了图像理解在新闻、教育和数字档案等领域的有效性，这些领域需要更丰富、信息量更大的描述。

Method: 该方法采用多模态管道，使用BEIT-3（Flickr30k-384和COCO-384）和SigLIP So-384检索语义相似图像，通过ORB和SIFT进行几何对齐重排序，并通过语义搜索从相关文章中提取上下文信息，然后使用QLoRA微调的Qwen3模型将上下文与Instruct BLIP（Vicuna-7B）生成的基础描述整合，生成事件丰富、上下文感知的描述。

Result: 在OpenEvents v1数据集上的评估表明，与传统方法相比，该方法生成的描述信息量显著更大，显示出在需要更深层次视觉-文本理解的实际应用中具有强大潜力。

Conclusion: 该研究证明了整合外部文本知识对于增强图像描述上下文深度的有效性，为新闻、教育和数字档案等领域的实际应用提供了有前景的解决方案，展示了多模态知识整合在提升视觉理解能力方面的价值。

📄 Abstract

Real-world image captions often lack contextual depth, omitting crucial details such as event background, temporal cues, outcomes, and named entities that are not visually discernible. This gap limits the effectiveness of image understanding in domains like journalism, education, and digital archives, where richer, more informative descriptions are essential. To address this, we propose a multimodal pipeline that augments visual input with external textual knowledge. Our system retrieves semantically similar images using BEIT-3 (Flickr30k-384 and COCO-384) and SigLIP So-384, reranks them using ORB and SIFT for geometric alignment, and extracts contextual information from related articles via semantic search. A fine-tuned Qwen3 model with QLoRA then integrates this context with base captions generated by Instruct BLIP (Vicuna-7B) to produce event-enriched, context-aware descriptions. Evaluated on the OpenEvents v1 dataset, our approach generates significantly more informative captions compared to traditional methods, showing strong potential for real-world applications requiring deeper visual-textual understanding

[10] Item Region-based Style Classification Network (IRSN): A Fashion Style Classifier Based on Domain Knowledge of Fashion Experts

Jinyoung Choi, Youngchae Kwon, Injung Kim

🧩 TL;DR

本文提出了一种基于物品区域的时尚风格分类网络（IRSN），通过分析物品特定特征及其组合来改进时尚风格分类。该方法结合了物品区域池化、门控特征融合和双骨干架构，在多个数据集上显著提升了分类准确率。

📘 Detailed Summary

Motivation: 时尚风格分类面临两大挑战：同一风格内存在较大的视觉差异，以及不同风格之间可能存在视觉相似性。风格不仅由整体外观表达，还取决于单个物品的属性及其组合方式，因此需要同时考虑全局特征和物品级特征。

Method: IRSN采用物品区域池化（IRP）提取每个物品区域的特征，分别进行分析，然后通过门控特征融合（GFF）进行组合。此外，该方法采用双骨干架构，结合了领域特定特征提取器和在大规模图像-文本数据集上预训练的通用特征提取器，以增强特征表示能力。

Result: 在FashionStyle14和ShowniqV3数据集上的实验表明，IRSN应用于包括EfficientNet、ConvNeXt和Swin Transformer在内的六种骨干网络时，平均分别提升了6.9%和7.6%的分类准确率，最大提升分别达到14.5%和15.1%。可视化分析进一步证实IRSN模型能更好地捕捉相似风格类别之间的差异。

Conclusion: IRSN通过同时建模全局特征、物品级特征及其组合关系，有效解决了时尚风格分类中的视觉变异和相似性挑战。该方法为细粒度时尚分析提供了新的技术框架，其双骨干架构和门控融合机制可推广到其他需要多层次特征建模的视觉任务中。

📄 Abstract

Fashion style classification is a challenging task because of the large visual variation within the same style and the existence of visually similar styles. Styles are expressed not only by the global appearance, but also by the attributes of individual items and their combinations. In this study, we propose an item region-based fashion style classification network (IRSN) to effectively classify fashion styles by analyzing item-specific features and their combinations in addition to global features. IRSN extracts features of each item region using item region pooling (IRP), analyzes them separately, and combines them using gated feature fusion (GFF). In addition, we improve the feature extractor by applying a dual-backbone architecture that combines a domain-specific feature extractor and a general feature extractor pre-trained with a large-scale image-text dataset. In experiments, applying IRSN to six widely-used backbones, including EfficientNet, ConvNeXt, and Swin Transformer, improved style classification accuracy by an average of 6.9% and a maximum of 14.5% on the FashionStyle14 dataset and by an average of 7.6% and a maximum of 15.1% on the ShowniqV3 dataset. Visualization analysis also supports that the IRSN models are better than the baseline models at capturing differences between similar style classes.

[11] DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation

Jingqi Tian, Yiheng Du, Haoji Zhang, Yuji Wang, Isaac Ning Lee, Xulong Bai, Tianrui Zhu, Jingxuan Niu, Yansong Tang

🧩 TL;DR

本文提出DDAVS框架，通过解耦音频语义与延迟双向对齐机制，解决了音频-视觉分割中的多源纠缠和视听错位问题，在多个基准测试中实现了最先进的性能。

📘 Detailed Summary

Motivation: 音频-视觉分割任务旨在通过结合听觉和视觉信息在像素级别定位发声物体，但现有方法常受多源纠缠和视听错位问题的困扰，导致模型偏向于更响亮或更大的物体，而忽略较弱、较小或共现的声源。

Method: DDAVS框架采用解耦音频语义和延迟双向对齐机制，通过可学习查询从音频原型记忆库中提取音频语义并将其锚定在结构化语义空间中，利用对比学习增强判别性和鲁棒性，同时引入具有延迟模态交互的双重交叉注意力机制来改善多模态对齐的鲁棒性。

Result: 在AVS-Objects和VPO基准测试上的广泛实验表明，DDAVS在单源、多源和多实例场景中均一致优于现有方法，验证了该框架在具有挑战性的真实世界音频-视觉分割条件下的有效性和泛化能力。

Conclusion: 该研究证明了通过解耦音频语义和延迟双向对齐机制可以有效解决音频-视觉分割中的核心挑战，为处理复杂多源场景提供了新的技术路径，展示了在真实世界条件下实现鲁棒多模态分割的潜力。

📄 Abstract

Audio-Visual Segmentation (AVS) aims to localize sound-producing objects at the pixel level by jointly leveraging auditory and visual information. However, existing methods often suffer from multi-source entanglement and audio-visual misalignment, which lead to biases toward louder or larger objects while overlooking weaker, smaller, or co-occurring sources. To address these challenges, we propose DDAVS, a Disentangled Audio Semantics and Delayed Bidirectional Alignment framework. To mitigate multi-source entanglement, DDAVS employs learnable queries to extract audio semantics and anchor them within a structured semantic space derived from an audio prototype memory bank. This is further optimized through contrastive learning to enhance discriminability and robustness. To alleviate audio-visual misalignment, DDAVS introduces dual cross-attention with delayed modality interaction, improving the robustness of multimodal alignment. Extensive experiments on the AVS-Objects and VPO benchmarks demonstrate that DDAVS consistently outperforms existing approaches, exhibiting strong performance across single-source, multi-source, and multi-instance scenarios. These results validate the effectiveness and generalization ability of our framework under challenging real-world audio-visual segmentation conditions. Project page: https://trilarflagz.github.io/DDAVS-page/

Xiangxuan Ren, Zhongdao Wang, Pin Tang, Guoqing Wang, Jilai Zheng, Chao Ma

🧩 TL;DR

本文提出LiteFusion，一种新颖的多模态3D目标检测器，通过将LiDAR数据作为几何信息的补充源来增强基于摄像头的检测，完全消除了对3D主干网络的依赖，从而提高了部署友好性和鲁棒性。

📘 Detailed Summary

Motivation: 当前多模态3D目标检测器严重依赖LiDAR传感器，在LiDAR缺失时性能大幅下降，且由于依赖主要针对NVIDIA GPU优化的3D稀疏卷积算子，难以部署到NPU和FPGA等多样化硬件平台上，这影响了自动驾驶系统在实际场景中的鲁棒性和安全性。

Method: LiteFusion重新思考了LiDAR在摄像头-LiDAR融合范式中的作用，将LiDAR数据作为几何信息的补充源来增强基于摄像头的检测，而非将其视为具有独立特征提取主干网络的独立模态。该方法在四元数空间中集成LiDAR点到图像特征中，其中正交约束在网络训练期间得到良好保持，有助于建模跨模态的领域特定关系，产生紧凑的跨模态嵌入。

Result: 在nuScenes数据集上的实验表明，LiteFusion将基于视觉的基线检测器的mAP提高了+20.4%，NDS提高了+19.7%，而参数仅增加1.1%，且未使用专用的LiDAR编码器。值得注意的是，即使在LiDAR输入缺失的情况下，LiteFusion仍能保持强劲结果，突显了其在不同融合范式和部署场景中的良好鲁棒性和有效性。

Conclusion: 该研究通过重新思考LiDAR在多模态融合中的作用，提出了一种部署友好且鲁棒的3D目标检测方法，完全消除了对3D主干网络的依赖，为实际自动驾驶系统的部署提供了更灵活的解决方案，同时在不同硬件平台上具有更好的适应性。

📄 Abstract

3D object detection is fundamental for safe and robust intelligent transportation systems. Current multi-modal 3D object detectors often rely on complex architectures and training strategies to achieve higher detection accuracy. However, these methods heavily rely on the LiDAR sensor so that they suffer from large performance drops when LiDAR is absent, which compromises the robustness and safety of autonomous systems in practical scenarios. Moreover, existing multi-modal detectors face difficulties in deployment on diverse hardware platforms, such as NPUs and FPGAs, due to their reliance on 3D sparse convolution operators, which are primarily optimized for NVIDIA GPUs. To address these challenges, we reconsider the role of LiDAR in the camera-LiDAR fusion paradigm and introduce a novel multi-modal 3D detector, LiteFusion. Instead of treating LiDAR point clouds as an independent modality with a separate feature extraction backbone, LiteFusion utilizes LiDAR data as a complementary source of geometric information to enhance camera-based detection. This straightforward approach completely eliminates the reliance on a 3D backbone, making the method highly deployment-friendly. Specifically, LiteFusion integrates complementary features from LiDAR points into image features within a quaternion space, where the orthogonal constraints are well-preserved during network training. This helps model domain-specific relations across modalities, yielding a compact cross-modal embedding. Experiments on the nuScenes dataset show that LiteFusion improves the baseline vision-based detector by +20.4% mAP and +19.7% NDS with a minimal increase in parameters (1.1%) without using dedicated LiDAR encoders. Notably, even in the absence of LiDAR input, LiteFusion maintains strong results , highlighting its favorable robustness and effectiveness across diverse fusion paradigms and deployment scenarios.

[13] LADLE-MM: Limited Annotation based Detector with Learned Ensembles for Multimodal Misinformation

Daniele Cardullo, Simone Teglia, Irene Amerini

🧩 TL;DR

本文提出了LADLE-MM，一种在有限标注和训练资源下工作的多模态虚假信息检测器，通过模型集成初始化方法，在减少60.3%可训练参数的同时，在DGM4和VERITE基准测试中实现了与最先进方法竞争的性能。

📘 Detailed Summary

Motivation: 随着多媒体内容生成和操纵工具的普及，跨多模态的逼真合成篡改已成为广泛威胁，常被用于扭曲重要事件叙事和传播虚假信息。现有的多模态虚假信息检测方法通常依赖计算密集型架构或需要大量标注数据，这限制了它们在资源受限环境下的应用。

Method: LADLE-MM采用模型集成初始化方法，包含两个单模态分支和一个多模态分支。多模态分支通过从BLIP模型中提取的固定多模态嵌入来增强图像和文本表示，这些嵌入作为参考空间。该方法在有限标注设置下工作，显著减少了模型复杂度。

Result: 在DGM4基准测试中，LADLE-MM在二进制和多标签分类任务上均取得竞争性性能，尽管可训练参数比先前最先进模型减少60.3%。在没有基础标注的情况下训练时，它超越了现有方法。在VERITE数据集上，LADLE-MM超越了使用更复杂大型视觉语言模型架构的当前最先进方法，展示了在开放集设置中的有效泛化能力和对单模态偏见的强鲁棒性。

Conclusion: 研究表明，通过精心设计的模型集成初始化和多模态表示增强，可以在显著减少参数数量的情况下实现强大的虚假信息检测性能。该方法为资源受限环境下的多模态内容验证提供了实用解决方案，并展示了在开放集场景中的良好泛化能力，对实际部署具有重要意义。

📄 Abstract

With the rise of easily accessible tools for generating and manipulating multimedia content, realistic synthetic alterations to digital media have become a widespread threat, often involving manipulations across multiple modalities simultaneously. Recently, such techniques have been increasingly employed to distort narratives of important events and to spread misinformation on social media, prompting the development of misinformation detectors. In the context of misinformation conveyed through image-text pairs, several detection methods have been proposed. However, these approaches typically rely on computationally intensive architectures or require large amounts of annotated data. In this work we introduce LADLE-MM: Limited Annotation based Detector with Learned Ensembles for Multimodal Misinformation, a model-soup initialized multimodal misinformation detector designed to operate under a limited annotation setup and constrained training resources. LADLE-MM is composed of two unimodal branches and a third multimodal one that enhances image and text representations with additional multimodal embeddings extracted from BLIP, serving as fixed reference space. Despite using 60.3% fewer trainable parameters than previous state-of-the-art models, LADLE-MM achieves competitive performance on both binary and multi-label classification tasks on the DGM4 benchmark, outperforming existing methods when trained without grounding annotations. Moreover, when evaluated on the VERITE dataset, LADLE-MM outperforms current state-of-the-art approaches that utilize more complex architectures involving Large Vision-Language-Models, demonstrating the effective generalization ability in an open-set setting and strong robustness to unimodal bias.

[14] TAVID: Text-Driven Audio-Visual Interactive Dialogue Generation

Ji-Hoon Kim, Junseok Ahn, Doyeop Kwak, Joon Son Chung, Shinji Watanabe

🧩 TL;DR

本文提出了TAVID，一个统一的框架，能够从文本和参考图像中同步生成交互式面部视频和对话语音，解决了现有研究中视听模态分离的问题，实现了更自然的人类对话模拟。

📘 Detailed Summary

Motivation: 现有研究通常孤立地探索说话头生成或对话语音生成，忽略了人类对话中紧密耦合的视听交互特性，这限制了构建类人对话系统的能力，因此需要开发能够同步生成交互式面部和对话语音的统一框架。

Method: 本文提出了TAVID框架，通过两个跨模态映射器（运动映射器和说话者映射器）整合面部和语音生成流程，实现音频和视觉模态之间的双向信息交换，从而同步生成交互式面部和对话语音。

Result: 实验在四个维度上评估系统性能：说话面部真实性、倾听头部响应性、二元交互流畅性和语音质量，广泛实验证明了该方法在所有方面均表现出有效性，实现了高质量的视听同步生成。

Conclusion: 该研究展示了统一视听生成框架在构建类人对话系统中的重要性，通过跨模态信息交换实现了更自然的交互体验，为未来多模态对话系统的发展提供了新的技术路径和评估标准。

📄 Abstract

The objective of this paper is to jointly synthesize interactive videos and conversational speech from text and reference images. With the ultimate goal of building human-like conversational systems, recent studies have explored talking or listening head generation as well as conversational speech generation. However, these works are typically studied in isolation, overlooking the multimodal nature of human conversation, which involves tightly coupled audio-visual interactions. In this paper, we introduce TAVID, a unified framework that generates both interactive faces and conversational speech in a synchronized manner. TAVID integrates face and speech generation pipelines through two cross-modal mappers (i.e., a motion mapper and a speaker mapper), which enable bidirectional exchange of complementary information between the audio and visual modalities. We evaluate our system across four dimensions: talking face realism, listening head responsiveness, dyadic interaction fluency, and speech quality. Extensive experiments demonstrate the effectiveness of our approach across all these aspects.

[15] CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation

V. Kovalev, A. Kuvshinov, A. Buzovkin, D. Pokidov, D. Timonin

🧩 TL;DR

本文提出了CRAFT框架，一种无需训练、模型无关的推理时间优化方法，通过结构化推理和约束驱动反馈来提升多模态图像生成的可靠性和可控性。该框架将提示分解为依赖结构化的视觉问题，利用视觉语言模型验证生成图像，并通过LLM代理进行针对性提示编辑。

📘 Detailed Summary

Motivation: 现有推理时间优化方法通常依赖隐式、整体的批评或无约束的提示重写，导致其行为难以解释、控制或可靠停止。相比之下，大型语言模型已从基于验证、针对性修正和早期停止的显式结构化思维形式中受益，而多模态图像生成领域缺乏类似的系统化推理框架。

Method: CRAFT框架将提示分解为依赖结构化的视觉问题，使用视觉语言模型验证生成图像，并通过LLM代理在约束失败处应用针对性提示编辑。该过程在满足所有约束条件时采用显式停止准则进行迭代，形成一个可解释且可控的推理时间优化循环，无需额外训练且与模型无关。

Result: 在多个模型系列和具有挑战性的基准测试中，CRAFT持续提升了组合准确性、文本渲染和基于偏好的评估，特别是在轻量级生成器上表现出显著优势。这些改进仅带来可忽略的推理时间开销，使较小或较便宜的模型能够接近更昂贵系统的质量水平。

Conclusion: 研究结果表明，显式结构化、约束驱动的推理时间优化是提升多模态生成模型可靠性的关键要素。CRAFT框架为图像生成提供了可解释、可控的推理循环，使资源受限的模型能够通过智能推理达到更高质量的输出，为未来生成式AI系统的可靠性和效率优化提供了新方向。

📄 Abstract

Recent work has shown that inference-time reasoning and reflection can improve text-to-image generation without retraining. However, existing approaches often rely on implicit, holistic critiques or unconstrained prompt rewrites, making their behavior difficult to interpret, control, or stop reliably. In contrast, large language models have benefited from explicit, structured forms of thinking based on verification, targeted correction, and early stopping. We introduce CRAFT (Continuous Reasoning and Agentic Feedback Tuning), a training-free, model-agnostic framework that brings this structured reasoning paradigm to multimodal image generation. CRAFT decomposes a prompt into dependency-structured visual questions, veries generated images using a vision-language model, and applies targeted prompt edits through an LLM agent only where constraints fail. The process iterates with an explicit stopping criterion once all constraints are satised, yielding an interpretable and controllable inference-time renement loop. Across multiple model families and challenging benchmarks, CRAFT consistently improves compositional accuracy, text rendering, and preference-based evaluations, with particularly strong gains for lightweight generators. Importantly, these improvements incur only a negligible inference-time overhead, allowing smaller or cheaper models to approach the quality of substantially more expensive systems. Our results suggest that explicitly structured, constraint-driven inference-time reasoning is a key ingredient for improving the reliability of multimodal generative models.

[16] Chain-of-Anomaly Thoughts with Large Vision-Language Models

Pedro Domingos, João Pereira, Vasco Lopes, João Neves, David Semedo

🧩 TL;DR

本文提出了Chain-of-Anomaly-Thoughts (CoAT)框架，通过引入归纳性犯罪偏置的多智能体推理机制，显著提升了大型视觉语言模型在视频监控中的异常检测性能。

📘 Detailed Summary

Motivation: 大型视觉语言模型在自动化视频监控中存在固有的正常性偏置，往往无法有效检测犯罪行为，而现有的思维链推理策略由于缺乏归纳性异常偏置，进一步将模型导向正常解释，这限制了异常检测的实际应用效果。

Method: 本文提出了Chain-of-Anomaly-Thoughts (CoAT)多智能体推理框架，通过在推理过程中引入归纳性犯罪偏置，并设计了一个专注于异常检测的最终分类层，从而系统性地引导模型关注异常模式。

Result: 该方法在低分辨率视频片段上将异常检测的F1分数提升了11.8个百分点，在高分辨率视频中将异常分类性能提升了3.78个百分点，显著改善了模型在挑战性监控场景下的表现。

Conclusion: 研究表明，通过在多智能体推理框架中引入归纳性异常偏置，可以有效克服大型视觉语言模型的正常性偏置问题，为视频监控中的异常检测提供了新的技术路径，具有重要的实际应用价值。

📄 Abstract

Automated video surveillance with Large Vision-Language Models is limited by their inherent bias towards normality, often failing to detect crimes. While Chain-of-Thought reasoning strategies show significant potential for improving performance in language tasks, the lack of inductive anomaly biases in their reasoning further steers the models towards normal interpretations. To address this, we propose Chain-of-Anomaly-Thoughts (CoAT), a multi-agent reasoning framework that introduces inductive criminal bias in the reasoning process through a final, anomaly-focused classification layer. Our method significantly improves Anomaly Detection, boosting F1-score by 11.8 p.p. on challenging low-resolution footage and Anomaly Classification by 3.78 p.p. in high-resolution videos.

[17] Beyond Motion Pattern: An Empirical Study of Physical Forces for Human Motion Understanding

Anh Dao, Manh Tran, Yufei Zhang, Xiaoming Liu, Zijun Cui

🧩 TL;DR

该研究提出将物理推断的关节驱动力整合到人体运动理解流程中，通过在步态识别、动作识别和视频描述三大任务上的系统评估，证明了力信号在动态、遮挡或外观变化条件下能显著增强现有视觉和运动学特征的性能。

📘 Detailed Summary

Motivation: 现有基于视觉的人体运动理解方法主要关注识别、跟踪和描述任务，但普遍忽略了关节驱动力等物理线索，而这些线索在生物力学中具有基础性作用。本研究旨在探索物理推断的力信号是否以及何时能够增强运动理解能力，填补当前方法在物理线索整合方面的研究空白。

Method: 研究将物理推断的力信号整合到现有运动理解流程中，系统评估其在三大任务上的影响：步态识别、动作识别和细粒度视频描述。在8个基准测试中，通过将力信号与基线模型结合，构建了包含物理线索的增强型运动理解管道，重点关注力信号在动态、遮挡和外观变化条件下的补充作用。

Result: 在CASIA-B步态识别基准上，Rank-1准确率从89.52%提升至90.39%（+0.87%），在更具挑战性的条件下提升更显著：穿外套时提升+2.7%，侧视图时提升+3.0%。Gait3D基准上性能从46.0%提升至47.3%（+1.3%）。动作识别方面，CTR-GCN在Penn Action上提升+2.00%，高强度动作如击打/拍打类提升+6.96%。视频描述任务中，Qwen2.5-VL的ROUGE-L得分从0.310提升至0.339（+0.029）。

Conclusion: 研究表明物理推断的力信号能够显著补充视觉和运动学特征，特别是在动态、遮挡或外观变化的条件下。这一发现为运动理解领域提供了新的物理线索整合范式，表明生物力学信息能够增强现有计算机视觉方法的鲁棒性和语义丰富性，为未来多模态运动理解研究开辟了新方向。

📄 Abstract

Human motion understanding has advanced rapidly through vision-based progress in recognition, tracking, and captioning. However, most existing methods overlook physical cues such as joint actuation forces that are fundamental in biomechanics. This gap motivates our study: if and when do physically inferred forces enhance motion understanding? By incorporating forces into established motion understanding pipelines, we systematically evaluate their impact across baseline models on 3 major tasks: gait recognition, action recognition, and fine-grained video captioning. Across 8 benchmarks, incorporating forces yields consistent performance gains; for example, on CASIA-B, Rank-1 gait recognition accuracy improved from 89.52% to 90.39% (+0.87), with larger gain observed under challenging conditions: +2.7% when wearing a coat and +3.0% at the side view. On Gait3D, performance also increases from 46.0% to 47.3% (+1.3). In action recognition, CTR-GCN achieved +2.00% on Penn Action, while high-exertion classes like punching/slapping improved by +6.96%. Even in video captioning, Qwen2.5-VL's ROUGE-L score rose from 0.310 to 0.339 (+0.029), indicating that physics-inferred forces enhance temporal grounding and semantic richness. These results demonstrate that force cues can substantially complement visual and kinematic features under dynamic, occluded, or appearance-varying conditions.

[18] UTDesign: A Unified Framework for Stylized Text Editing and Generation in Graphic Design Images

Yiming Zhao, Yuanpeng Gao, Yuxuan Luo, Jiwei Duan, Shisong Lin, Longfei Xiong, Zhouhui Lian

🧩 TL;DR

本文提出了UTDesign，一个用于设计图像中高精度风格化文本编辑和条件文本生成的统一框架，支持英文和中文脚本，通过集成扩散模型和多模态条件编码器实现了风格一致且准确的文本合成。

📘 Detailed Summary

Motivation: 当前基于扩散的文本到图像模型在视觉内容生成方面表现出色，但其文本渲染能力，特别是对于小规模排版和非拉丁文字（如中文）仍然有限，这限制了AI辅助图形设计的实际应用效果。

Method: UTDesign框架包含三个核心组件：首先，提出了一种基于DiT的文本风格迁移模型，在合成数据集上从头训练，能够生成保留参考字形风格的透明RGBA文本前景；其次，通过在多模态条件编码器上训练，扩展为条件文本生成框架，支持基于背景图像、提示词和布局规格的准确文本合成；最后，将方法集成到全自动文本到设计管道中，结合预训练的文本到图像模型和基于MLLM的布局规划器。

Result: 大量实验表明，UTDesign在开源方法中实现了最先进的性能，在风格一致性和文本准确性方面表现优异，与专有商业方法相比也展现出独特优势，特别是在支持中英文脚本的高精度文本生成方面。

Conclusion: 该研究为AI辅助图形设计提供了统一的文本处理框架，显著提升了非拉丁文字的渲染质量，通过条件生成和自动化管道的结合，为实际设计应用提供了实用解决方案，并为多语言文本生成研究开辟了新方向。

📄 Abstract

AI-assisted graphic design has emerged as a powerful tool for automating the creation and editing of design elements such as posters, banners, and advertisements. While diffusion-based text-to-image models have demonstrated strong capabilities in visual content generation, their text rendering performance, particularly for small-scale typography and non-Latin scripts, remains limited. In this paper, we propose UTDesign, a unified framework for high-precision stylized text editing and conditional text generation in design images, supporting both English and Chinese scripts. Our framework introduces a novel DiT-based text style transfer model trained from scratch on a synthetic dataset, capable of generating transparent RGBA text foregrounds that preserve the style of reference glyphs. We further extend this model into a conditional text generation framework by training a multi-modal condition encoder on a curated dataset with detailed text annotations, enabling accurate, style-consistent text synthesis conditioned on background images, prompts, and layout specifications. Finally, we integrate our approach into a fully automated text-to-design (T2D) pipeline by incorporating pre-trained text-to-image (T2I) models and an MLLM-based layout planner. Extensive experiments demonstrate that UTDesign achieves state-of-the-art performance among open-source methods in terms of stylistic consistency and text accuracy, and also exhibits unique advantages compared to proprietary commercial approaches. Code and data for this paper are available at https://github.com/ZYM-PKU/UTDesign.

[19] Bridging Modalities and Transferring Knowledge: Enhanced Multimodal Understanding and Recognition

Gorjan Radevski

🧩 TL;DR

该论文提出了一系列多模态对齐、翻译、融合和迁移方法，通过五个章节分别解决了空间语言理解、医学文本导航、知识图谱链接、动作识别融合以及多模态知识蒸馏等关键挑战，显著提升了计算系统处理复杂多模态输入的能力。

📘 Detailed Summary

Motivation: 该研究旨在解决多模态机器学习中的多个关键挑战，包括将文本空间关系转换为视觉排列、医学文本与解剖图谱的精确映射、结构化文本到知识图谱事实的链接、视频动作识别的多模态融合，以及如何通过知识迁移使单模态模型获得多模态能力，从而增强计算系统对复杂多模态输入的理解和处理能力。

Method: 研究提出了五种核心技术方法：第三章开发了Spatial-Reasoning Bert模型，将基于文本的空间关系翻译为二维排列；第四章引入利用医学术语空间共现的损失函数，实现医学文本到解剖图谱三维位置的映射；第五章建立了将结构化文本链接到知识图谱实体和谓词的基准；第六章提出了融合视频帧和物体检测表示的多模态融合方法；第七章探索了多模态知识蒸馏技术，使RGB-only模型能够模仿多模态融合模型的性能。

Result: 研究实现了多项重要成果：空间语言到视觉排列的有效解码，为自动化场景生成奠定了基础；医学文本导航性显著增强，创建了可解释的映射关系；建立了知识图谱链接的基准，解决了文本提取中的歧义问题；动作识别的鲁棒性和准确性得到提升；通过知识蒸馏，RGB-only模型在保持性能的同时大幅降低了计算需求。

Conclusion: 该研究在多模态机器学习领域做出了系统性贡献，推进了空间语言理解、医学文本解释、知识图谱丰富化和动作识别的方法论发展。这些方法增强了计算系统跨多样化应用处理复杂多模态输入的能力，并为多模态知识迁移提供了有效途径，平衡了性能与计算效率的需求。

📄 Abstract

This manuscript explores multimodal alignment, translation, fusion, and transference to enhance machine understanding of complex inputs. We organize the work into five chapters, each addressing unique challenges in multimodal machine learning. Chapter 3 introduces Spatial-Reasoning Bert for translating text-based spatial relations into 2D arrangements between clip-arts. This enables effective decoding of spatial language into visual representations, paving the way for automated scene generation aligned with human spatial understanding. Chapter 4 presents a method for translating medical texts into specific 3D locations within an anatomical atlas. We introduce a loss function leveraging spatial co-occurrences of medical terms to create interpretable mappings, significantly enhancing medical text navigability. Chapter 5 tackles translating structured text into canonical facts within knowledge graphs. We develop a benchmark for linking natural language to entities and predicates, addressing ambiguities in text extraction to provide clearer, actionable insights. Chapter 6 explores multimodal fusion methods for compositional action recognition. We propose a method fusing video frames and object detection representations, improving recognition robustness and accuracy. Chapter 7 investigates multimodal knowledge transference for egocentric action recognition. We demonstrate how multimodal knowledge distillation enables RGB-only models to mimic multimodal fusion-based capabilities, reducing computational requirements while maintaining performance. These contributions advance methodologies for spatial language understanding, medical text interpretation, knowledge graph enrichment, and action recognition, enhancing computational systems' ability to process complex, multimodal inputs across diverse applications.

[20] Multi-Grained Text-Guided Image Fusion for Multi-Exposure and Multi-Focus Scenarios

Mingwei Tang, Jiahao Nie, Guang Yang, Ziqing Cui, Jie Li

🧩 TL;DR

本文提出了一种多粒度文本引导的图像融合方法（MTIF），通过引入细粒度、结构性和语义性的多层次文本描述，结合分层跨模态调制模块，显著提升了多曝光和多焦点图像融合的质量。

📘 Detailed Summary

Motivation: 现有基于视觉语言模型的图像融合方法通常使用粗粒度文本描述作为辅助指导，这限制了模型对图像细粒度细节的理解能力，并给跨模态对齐带来了挑战，导致融合质量受限。

Method: MTIF方法包含三个关键设计：引入多粒度文本描述分别捕捉细节、结构和语义信息；采用分层跨模态调制模块实现文本引导的图像融合；在每个粒度级别添加监督信号以促进视觉-文本特征对齐；以及使用显著性驱动的数据增强模块来丰富训练数据的语义内容。

Result: 大量实验表明，MTIF在多曝光和多焦点图像融合任务上均持续优于先前方法，验证了多粒度文本描述和分层跨模态调制在提升图像融合质量方面的有效性。

Conclusion: 该研究证明了多粒度文本指导在图像融合中的重要性，通过细粒度、结构性和语义性描述的层次化整合，能够更精确地实现跨模态对齐，为文本引导的视觉任务提供了新的融合范式。

📄 Abstract

Image fusion aims to synthesize a single high-quality image from a pair of inputs captured under challenging conditions, such as differing exposure levels or focal depths. A core challenge lies in effectively handling disparities in dynamic range and focus depth between the inputs. With the advent of vision-language models, recent methods incorporate textual descriptions as auxiliary guidance to enhance fusion quality. However, simply incorporating coarse-grained descriptions hampers the understanding of fine-grained details and poses challenges for precise cross-modal alignment. To address these limitations, we propose Multi-grained Text-guided Image Fusion (MTIF), a novel fusion paradigm with three key designs. First, it introduces multi-grained textual descriptions that separately capture fine details, structural cues, and semantic content, guiding image fusion through a hierarchical cross-modal modulation module. Second, it involves supervision signals at each granularity to facilitate alignment between visual and textual features and enhance the utility of auxiliary text. Third, it adopts a saliency-driven enrichment module to augment training data with dense semantic content, further strengthening the cross-modal modulation and alignment. Extensive experiments show that MTIF consistently outperforms previous methods on both multi-exposure and multi-focus image fusion tasks.

[21] Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi

🧩 TL;DR

本文提出了DSR Suite，一个针对动态空间推理（DSR）的数据集、基准和模型增强套件，通过自动化管道从野外视频生成几何感知的问答对，并引入轻量级几何选择模块将几何先验集成到视觉语言模型中，显著提升了模型在动态空间推理任务上的性能。

📘 Detailed Summary

Motivation: 视觉语言模型在通用理解方面表现出色，但在动态空间推理（DSR）方面仍然薄弱，即推理3D空间中物体几何和关系随时间演变的能力，这主要是由于缺乏可扩展的4D感知训练资源。现有工作难以满足对野外视频源、物体和场景级3D要求、视点变换、多物体交互以及细粒度程序性答案的需求。

Method: 研究提出了DSR Suite，包含三个核心组件：首先，开发了自动化管道从野外视频生成动态空间推理的多选问答对，利用现代视觉基础模型提取丰富的几何和运动信息，包括相机姿态、局部点云、物体掩码、方向和3D轨迹；其次，构建了用于训练的DSR-Train数据集和经过人工精炼的评估基准DSR-Bench；最后，提出了轻量级几何选择模块，将几何先验无缝集成到视觉语言模型中，该模块压缩问题语义并从预训练的4D重建先验中提取问题相关知识到紧凑的几何标记集合中。

Result: 实验表明，将DSR-Train和几何选择模块集成到Qwen2.5-VL-7B模型中，显著增强了其动态空间推理能力，同时在通用视频理解基准上保持了准确性。与先前工作相比，该方法在野外视频源、物体和场景级3D要求、视点变换、多物体交互以及细粒度程序性答案等方面具有明显优势。

Conclusion: 该研究通过DSR Suite在数据集、基准和模型三个层面系统性地解决了动态空间推理的挑战，证明了将几何先验有效集成到视觉语言模型中的可行性。几何选择模块的设计避免了用无关知识淹没模型，实现了针对性知识提取，为未来4D感知视觉语言模型的发展提供了重要参考方向。

📄 Abstract

Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.

[22] FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models

Kaitong Cai, Jusheng Zhang, Jing Yang, Yijia Fan, Pengtao Xie, Jian Wang, Keze Wang

🧩 TL;DR

本文提出FlashVLM，一种文本引导的视觉令牌选择框架，通过动态适应查询的视觉输入压缩，在显著减少视觉令牌的同时实现超越无损压缩的性能，为大规模视觉语言模型提供了高效且鲁棒的推理解决方案。

📘 Detailed Summary

Motivation: 大规模视觉语言模型通常需要处理数百至数千个视觉令牌，导致二次注意力成本和大量冗余。现有令牌缩减方法往往忽略文本查询或依赖不稳定的深度注意力图，在激进剪枝下会导致语义对齐退化。

Method: FlashVLM采用文本引导的视觉令牌选择框架，通过计算投影图像令牌与归一化文本嵌入在语言模型空间中的显式跨模态相似度，将外在相关性与内在视觉显著性通过对数域加权和温度控制锐化融合，并采用多样性保持分区保留最小但具代表性的背景令牌以维持全局上下文。

Result: 在相同令牌预算和评估协议下，FlashVLM在LLaVA 1.5上实现了超越无损压缩的性能，在剪枝高达77.8%视觉令牌时略微超过未剪枝基线，即使在94.4%压缩率下仍保持92.8%准确率。在14个图像和视频基准测试中展示了最先进的效率-性能权衡。

Conclusion: 该研究表明，通过显式跨模态相似度计算和多样性保持机制，可以在大幅减少视觉令牌的同时维持甚至提升模型性能，为视觉语言模型的高效推理提供了鲁棒且可泛化的解决方案，具有实际部署价值。

📄 Abstract

Large vision-language models (VLMs) typically process hundreds or thousands of visual tokens per image or video frame, incurring quadratic attention cost and substantial redundancy. Existing token reduction methods often ignore the textual query or rely on deep attention maps, whose instability under aggressive pruning leads to degraded semantic alignment. We propose FlashVLM, a text guided visual token selection framework that dynamically adapts visual inputs to the query. Instead of relying on noisy attention weights, FlashVLM computes an explicit cross modal similarity between projected image tokens and normalized text embeddings in the language model space. This extrinsic relevance is fused with intrinsic visual saliency using log domain weighting and temperature controlled sharpening. In addition, a diversity preserving partition retains a minimal yet representative set of background tokens to maintain global context. Under identical token budgets and evaluation protocols, FlashVLM achieves beyond lossless compression, slightly surpassing the unpruned baseline while pruning up to 77.8 percent of visual tokens on LLaVA 1.5, and maintaining 92.8 percent accuracy even under 94.4 percent compression. Extensive experiments on 14 image and video benchmarks demonstrate that FlashVLM delivers state of the art efficiency performance trade offs while maintaining strong robustness and generalization across mainstream VLMs.

[23] SpatialTree: How Spatial Abilities Branch Out in MLLMs

Yuxi Xiao, Longfei Li, Shen Yan, Xinhang Liu, Sida Peng, Yunchao Wei, Xiaowei Zhou, Bingyi Kang

🧩 TL;DR

本研究提出了SpatialTree，一个受认知科学启发的四层次空间能力分类框架，并构建了首个层次化基准测试来评估多模态大语言模型。研究发现空间能力在MLLMs中呈现清晰的层次结构，揭示了跨层次迁移的动态模式，并提出自动思考策略来优化强化学习在所有层次的表现。

📘 Detailed Summary

Motivation: 该研究旨在解决多模态大语言模型（MLLMs）中空间能力层次结构理解不足的问题。当前研究大多关注狭窄的任务集，缺乏对空间能力从低级感知到高级推理与交互的渐进发展层次的系统性认知科学理解，这阻碍了对MLLMs空间能力的全面评估与系统性提升。

Method: 研究提出了SpatialTree框架，这是一个受认知科学启发的四层次空间能力分类体系：低级感知（L1）、心理映射（L2）、模拟（L3）和智能体能力（L4）。基于此分类法，构建了首个以能力为中心的层次化基准测试，全面评估主流MLLMs在27个子能力上的表现。此外，研究探索了有监督微调下的迁移动态，并提出了一种简单的自动思考策略来抑制不必要的深思熟虑，使强化学习能够一致地提升所有层次的表现。

Result: 评估结果显示了一个清晰的结构：L1技能基本正交，而更高级别的技能则强相关，表明随着层次提升，能力间的相互依赖性增强。通过有针对性的有监督微调，发现了令人惊讶的迁移动态：L1内部存在负迁移，但从低层到高层能力存在强跨层迁移并表现出显著的协同效应。实验还发现，鼓励广泛"思考"的朴素强化学习不可靠：它有助于复杂推理但损害直觉感知，而提出的自动思考策略能够使强化学习在所有层次上一致提升性能。

Conclusion: 该研究通过构建SpatialTree，为理解和系统性扩展MLLMs中的空间能力提供了一个概念验证框架。研究揭示了空间能力在MLLMs中的层次结构特性，发现了跨层次迁移的动态模式，并提出了有效的优化策略。这些发现为未来开发更全面、层次化的多模态能力评估体系以及设计针对性的模型训练方法提供了重要见解。

📄 Abstract

Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.

cs.CL [Back]

[24] HARMON-E: Hierarchical Agentic Reasoning for Multimodal Oncology Notes to Extract Structured Data

Shashi Kant Gupta, Arijeet Pramanik, Jerrin John Thomas, Regina Schwind, Lauren Wiener, Avi Raju, Jeremy Kornbluth, Yanshan Wang, Zhaohui Su, Hrituraj Singh

🧩 TL;DR

本研究提出了一种基于大型语言模型的智能体框架，用于从非结构化电子健康记录中提取结构化肿瘤学数据，在包含40万份临床文档的大规模真实数据集上实现了平均F1分数0.93的优异性能，显著降低了人工标注成本。

📘 Detailed Summary

Motivation: 电子健康记录中的非结构化临床笔记包含丰富的肿瘤治疗信息，但由于术语专业性强、文档格式不一致以及信息矛盾等问题，现有自动化方法通常局限于特定场景或变量，无法实现跨文档的患者级综合数据提取，而人工标注成本高昂且难以扩展。

Method: 研究提出了一种智能体框架，将复杂的肿瘤学数据提取任务分解为模块化、自适应的子任务，利用大型语言模型作为推理智能体，配备上下文敏感检索和迭代合成能力，从真实世界的肿瘤学笔记中全面提取结构化临床变量。

Result: 在包含40万份非结构化临床笔记和扫描PDF报告、涵盖2250名癌症患者的大规模数据集上，该方法平均F1分数达到0.93，103个肿瘤特异性临床变量中有100个超过0.85，关键变量如生物标志物和药物信息超过0.95，集成到数据整理工作流后获得了0.94的直接人工批准率。

Conclusion: 该研究首次实现了基于LLM智能体的端到端结构化肿瘤学数据提取系统，证明了智能体框架在处理复杂医疗信息提取任务中的有效性，为大规模临床数据自动化处理提供了可行方案，显著降低了人工成本并提高了数据提取的全面性和准确性。

📄 Abstract

Unstructured notes within the electronic health record (EHR) contain rich clinical information vital for cancer treatment decision making and research, yet reliably extracting structured oncology data remains challenging due to extensive variability, specialized terminology, and inconsistent document formats. Manual abstraction, although accurate, is prohibitively costly and unscalable. Existing automated approaches typically address narrow scenarios - either using synthetic datasets, restricting focus to document-level extraction, or isolating specific clinical variables (e.g., staging, biomarkers, histology) - and do not adequately handle patient-level synthesis across the large number of clinical documents containing contradictory information. In this study, we propose an agentic framework that systematically decomposes complex oncology data extraction into modular, adaptive tasks. Specifically, we use large language models (LLMs) as reasoning agents, equipped with context-sensitive retrieval and iterative synthesis capabilities, to exhaustively and comprehensively extract structured clinical variables from real-world oncology notes. Evaluated on a large-scale dataset of over 400,000 unstructured clinical notes and scanned PDF reports spanning 2,250 cancer patients, our method achieves an average F1-score of 0.93, with 100 out of 103 oncology-specific clinical variables exceeding 0.85, and critical variables (e.g., biomarkers and medications) surpassing 0.95. Moreover, integration of the agentic system into a data curation workflow resulted in 0.94 direct manual approval rate, significantly reducing annotation costs. To our knowledge, this constitutes the first exhaustive, end-to-end application of LLM-based agents for structured oncology data extraction at scale

[25] How well do Large Language Models Recognize Instructional Moves? Establishing Baselines for Foundation Models in Educational Discourse

Kirk Vanacore, Rene F. Kizilcec

🧩 TL;DR

本研究评估了大型语言模型在无需定制化的情况下对真实教育场景的解释能力，发现基础模型在分类教学行为方面表现出有意义但有限的能力，提示设计能提升性能但无法消除根本的可靠性限制。

📘 Detailed Summary

Motivation: 随着大型语言模型在教育技术中的广泛应用，现有研究主要关注模型针对特定任务的优化，而对模型在无需显著定制化的情况下解释真实教育场景的能力了解不足。在LLM系统被学习者和教育者广泛采用的背景下，理解其开箱即用的能力对于设定期望和建立基准至关重要。

Method: 研究比较了六个大型语言模型在真实课堂转录本中分类教学行为的基础性能。评估了典型的提示方法：零样本、单样本和少样本提示，通过专家编码注释作为基准来测量模型的分类准确性。

Result: 研究发现零样本性能中等，但提供全面示例的少样本提示显著提升了最先进模型的性能，最强配置达到Cohen's Kappa = 0.58。然而改进并不均匀或完全：性能因教学行为类型而异，更高的召回率通常以增加误报为代价。

Conclusion: 基础模型在解释教学话语方面表现出有意义但有限的能力，提示设计有助于展现模型潜力但无法消除根本的可靠性约束。这些发现强调了在真实教育应用中需要谨慎评估LLM性能，并指出了未来研究需要解决模型在特定教学行为分类上的不一致性。

📄 Abstract

Large language models (LLMs) are increasingly adopted in educational technologies for a variety of tasks, from generating instructional materials and assisting with assessment design to tutoring. While prior work has investigated how models can be adapted or optimized for specific tasks, far less is known about how well LLMs perform at interpreting authentic educational scenarios without significant customization. As LLM-based systems become widely adopted by learners and educators in everyday academic contexts, understanding their out-of-the-box capabilities is increasingly important for setting expectations and benchmarking. We compared six LLMs to estimate their baseline performance on a simple but important task: classifying instructional moves in authentic classroom transcripts. We evaluated typical prompting methods: zero-shot, one-shot, and few-shot prompting. We found that while zero-shot performance was moderate, providing comprehensive examples (few-shot prompting) significantly improved performance for state-of-the-art models, with the strongest configuration reaching Cohen's Kappa = 0.58 against expert-coded annotations. At the same time, improvements were neither uniform nor complete: performance varied considerably by instructional move, and higher recall frequently came at the cost of increased false positives. Overall, these findings indicate that foundation models demonstrate meaningful yet limited capacity to interpret instructional discourse, with prompt design helping to surface capability but not eliminating fundamental reliability constraints.

Zhixiang Lu, Xueyuan Deng, Yiran Liu, Yulong Li, Qiang Yan, Imran Razzak, Jionglong Su

🧩 TL;DR

本文提出了PRISM模型，这是一个结合随机微分方程和个性条件部分可观测马尔可夫决策过程的混合框架，用于模拟在线极化中的心理异质性，显著提升了人格一致性并成功复现了理性抑制和情感共鸣等涌现现象。

📘 Detailed Summary

Motivation: 传统基于代理的意见动态模型因采用简化的同质性假设而无法捕捉驱动在线极化的心理异质性，这限制了对个体认知偏差与信息传播之间关键相互作用的理解，阻碍了对意识形态分歧放大机制的理解。

Method: 本文提出了个性折射智能仿真模型，这是一个混合框架，将用于连续情绪演化的随机微分方程与用于离散决策的个性条件部分可观测马尔可夫决策过程相结合，该模型为多模态大语言模型代理分配基于迈尔斯-布里格斯类型指标的认知策略，并通过大规模社交媒体数据集的数据驱动先验进行初始化。

Result: PRISM模型实现了与人类真实情况一致的人格一致性，显著优于标准的同质性和大五人格基准，该框架有效复现了理性抑制和情感共鸣等涌现现象，为分析复杂社交媒体生态系统提供了强大工具。

Conclusion: 该研究提供了一个稳健的框架来分析复杂的社交媒体生态系统，通过结合连续情绪演化和离散决策过程，能够更好地理解个体心理异质性如何驱动在线极化现象，为社会科学计算建模提供了新的方法论工具。

📄 Abstract

Traditional agent-based models (ABMs) of opinion dynamics often fail to capture the psychological heterogeneity driving online polarization due to simplistic homogeneity assumptions. This limitation obscures the critical interplay between individual cognitive biases and information propagation, thereby hindering a mechanistic understanding of how ideological divides are amplified. To address this challenge, we introduce the Personality-Refracted Intelligent Simulation Model (PRISM), a hybrid framework coupling stochastic differential equations (SDE) for continuous emotional evolution with a personality-conditional partially observable Markov decision process (PC-POMDP) for discrete decision-making. In contrast to continuous trait approaches, PRISM assigns distinct Myers-Briggs Type Indicator (MBTI) based cognitive policies to multimodal large language model (MLLM) agents, initialized via data-driven priors from large-scale social media datasets. PRISM achieves superior personality consistency aligned with human ground truth, significantly outperforming standard homogeneous and Big Five benchmarks. This framework effectively replicates emergent phenomena such as rational suppression and affective resonance, offering a robust tool for analyzing complex social media ecosystems.

[27] M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation

Hyeongcheol Park, Jiyoung Seo, Jaewon Mun, Hogun Park, Wonmin Byeon, Sung June Kim, Hyeonsoo Im, JeungSub Lee, Sangpil Kim

🧩 TL;DR

本文提出M³KG-RAG，一种多跳多模态知识图谱增强的检索增强生成方法，通过构建上下文丰富的多模态实体三元组和引入GRASP机制，显著提升了多模态大语言模型在音频-视觉领域的推理深度和答案忠实度。

📘 Detailed Summary

Motivation: 当前多模态检索增强生成在音频-视觉领域面临两大挑战：一是现有多模态知识图谱的模态覆盖有限且多跳连接不足，二是仅基于共享多模态嵌入空间的相似性检索无法有效过滤无关或冗余知识，导致推理深度和答案忠实度受限。

Method: 本文提出M³KG-RAG框架，包含两个核心组件：首先设计轻量级多智能体流水线构建多跳多模态知识图谱，其中包含上下文丰富的多模态实体三元组，支持基于输入查询的模态感知检索；其次引入GRASP机制，确保实体与查询的精确对齐，评估答案支持相关性，并剪枝冗余上下文，仅保留生成响应所必需的知识。

Result: 在多个多模态基准测试上的广泛实验表明，M³KG-RAG相比现有方法显著提升了多模态大语言模型的多模态推理和实体对齐能力，特别是在音频-视觉领域的知识检索和答案生成质量方面取得了明显改进。

Conclusion: 该研究证明了多跳多模态知识图谱与选择性检索剪枝机制相结合的有效性，为多模态检索增强生成提供了新的技术路径，未来可扩展至更广泛的多模态应用场景，并进一步优化知识图谱构建和检索效率。

📄 Abstract

Retrieval-Augmented Generation (RAG) has recently been extended to multimodal settings, connecting multimodal large language models (MLLMs) with vast corpora of external knowledge such as multimodal knowledge graphs (MMKGs). Despite their recent success, multimodal RAG in the audio-visual domain remains challenging due to 1) limited modality coverage and multi-hop connectivity of existing MMKGs, and 2) retrieval based solely on similarity in a shared multimodal embedding space, which fails to filter out off-topic or redundant knowledge. To address these limitations, we propose M$^3$KG-RAG, a Multi-hop Multimodal Knowledge Graph-enhanced RAG that retrieves query-aligned audio-visual knowledge from MMKGs, improving reasoning depth and answer faithfulness in MLLMs. Specifically, we devise a lightweight multi-agent pipeline to construct multi-hop MMKG (M$^3$KG), which contains context-enriched triplets of multimodal entities, enabling modality-wise retrieval based on input queries. Furthermore, we introduce GRASP (Grounded Retrieval And Selective Pruning), which ensures precise entity grounding to the query, evaluates answer-supporting relevance, and prunes redundant context to retain only knowledge essential for response generation. Extensive experiments across diverse multimodal benchmarks demonstrate that M$^3$KG-RAG significantly enhances MLLMs' multimodal reasoning and grounding over existing approaches.

[28] Retrieval-augmented Prompt Learning for Pre-trained Foundation Models

Xiang Chen, Yixin Ou, Quan Feng, Lei Li, Piji Li, Haibo Ye, Sheng-Jun Huang, Shuofei Qiao, Shumin Deng, Huajun Chen, Ningyu Zhang

🧩 TL;DR

本文提出RetroPrompt方法，通过引入检索机制和解耦记忆与泛化，解决传统提示学习中过度依赖记忆化的问题，在零样本和少样本场景下实现更优性能。

📘 Detailed Summary

Motivation: 传统提示学习方法仍遵循参数化学习范式，在记忆化和机械学习方面存在泛化稳定性问题，难以充分利用非典型实例并避免在有限数据下对浅层模式的过拟合。

Method: RetroPrompt方法通过解耦知识与记忆化，在输入、训练和推理阶段引入基于训练数据生成的公开知识库检索机制，使模型能够主动从语料库中检索相关上下文信息以增强可用线索。

Result: 在自然语言处理和计算机视觉任务的多数据集实验中，RetroPrompt在零样本和少样本场景下均表现出优越性能，通过记忆模式分析证实其有效减少了对机械记忆的依赖。

Conclusion: RetroPrompt通过检索增强机制平衡记忆与泛化，为预训练基础模型的提示学习提供了新范式，表明解耦知识记忆能显著提升模型在数据稀缺情况下的泛化能力。

📄 Abstract

The pre-trained foundation models (PFMs) have become essential for facilitating large-scale multimodal learning. Researchers have effectively employed the ``pre-train, prompt, and predict'' paradigm through prompt learning to induce improved few-shot performance. However, prompt learning approaches for PFMs still follow a parametric learning paradigm. As such, the stability of generalization in memorization and rote learning can be compromised. More specifically, conventional prompt learning might face difficulties in fully utilizing atypical instances and avoiding overfitting to shallow patterns with limited data during the process of fully-supervised training. To overcome these constraints, we present our approach, named RetroPrompt, which aims to achieve a balance between memorization and generalization by decoupling knowledge from mere memorization. Unlike traditional prompting methods, RetroPrompt leverages a publicly accessible knowledge base generated from the training data and incorporates a retrieval mechanism throughout the input, training, and inference stages. This enables the model to actively retrieve relevant contextual information from the corpus, thereby enhancing the available cues. We conduct comprehensive experiments on a variety of datasets across natural language processing and computer vision tasks to demonstrate the superior performance of our proposed approach, RetroPrompt, in both zero-shot and few-shot scenarios. Through detailed analysis of memorization patterns, we observe that RetroPrompt effectively reduces the reliance on rote memorization, leading to enhanced generalization.

[29] Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits

Amirhosein Ghasemabadi, Di Niu

🧩 TL;DR

本文提出Gnosis，一种轻量级自感知机制，使冻结的大型语言模型能够通过解码隐藏状态和注意力模式的信号进行内在自我验证，以预测自身生成错误，仅增加约500万参数且与序列长度无关。

📘 Detailed Summary

Motivation: 大型语言模型生成流畅复杂的输出但经常无法识别自身错误和幻觉，现有方法依赖外部评判器、多样本一致性或基于文本的自我批判，这些方法要么增加额外计算成本，要么与真实正确性相关性较弱。本研究旨在探索LLMs是否能够通过推理过程中检查内部状态来预测自身失败。

Method: 引入Gnosis轻量级自感知机制，使冻结的LLMs能够进行内在自我验证，通过被动观察内部轨迹，将其压缩为固定预算描述符，并以可忽略的推理成本预测正确性。该方法仅增加约500万参数，独立于序列长度运行，解码隐藏状态和注意力模式的信号。

Result: 在数学推理、开放域问答和学术知识基准测试中，针对从17亿到200亿参数的冻结骨干模型，Gnosis在准确性和校准方面持续优于强大的内部基线和大型外部评判器。该方法能够零样本泛化到部分生成，实现失败轨迹的早期检测和计算感知控制。

Conclusion: 研究结果表明可靠的正确性线索内在于生成过程，可以在没有外部监督的情况下高效提取。Gnosis展示了LLMs内部状态包含丰富的自我评估信息，为构建更可靠、自我监控的语言模型系统提供了新途径，同时保持计算效率。

📄 Abstract

Large language models (LLMs) generate fluent and complex outputs but often fail to recognize their own mistakes and hallucinations. Existing approaches typically rely on external judges, multi-sample consistency, or text-based self-critique, which incur additional compute or correlate weakly with true correctness. We ask: can LLMs predict their own failures by inspecting internal states during inference? We introduce Gnosis, a lightweight self-awareness mechanism that enables frozen LLMs to perform intrinsic self-verification by decoding signals from hidden states and attention patterns. Gnosis passively observes internal traces, compresses them into fixed-budget descriptors, and predicts correctness with negligible inference cost, adding only ~5M parameters and operating independently of sequence length. Across math reasoning, open-domain question answering, and academic knowledge benchmarks, and over frozen backbones ranging from 1.7B to 20B parameters, Gnosis consistently outperforms strong internal baselines and large external judges in both accuracy and calibration. Moreover, it generalizes zero-shot to partial generations, enabling early detection of failing trajectories and compute-aware control. These results show that reliable correctness cues are intrinsic to generation process and can be extracted efficiently without external supervision.

[30] Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs

Dhruv Anand, Ehsan Shareghi

🧩 TL;DR

该研究提出了Cube Bench基准测试，用于评估多模态大语言模型在魔方解谜任务中的空间与序列推理能力，通过五个分解技能的系统性评估揭示了模型在复杂序列决策中的局限性。

📘 Detailed Summary

Motivation: 当前缺乏系统评估多模态大语言模型在空间与序列推理能力的标准化基准，特别是在需要多步决策和错误恢复的复杂任务中，现有评估方法难以全面衡量模型的实际推理能力。

Method: 研究设计了Cube Bench基准测试，将魔方解谜任务分解为五个核心技能：从图像和文本重建魔方面、选择最优下一步移动、预测候选移动结果而不执行、执行多步计划并从中断恢复、检测并修正自身错误。采用统一的加扰状态、提示词和解析器，以及单一的距离解决度量标准，在不同加扰深度下系统比较多个MLLM模型。

Result: 实验评估七个MLLM模型显示，随着加扰深度增加，模型准确率急剧下降；一旦轨迹停滞或发散，模型很少能恢复；高面重建准确率不能保证有效的动作选择或多步执行能力。闭源与开源模型存在显著差距：最强闭源模型在单步感知和多步控制任务中均领先，而开源模型在最困难设置下接近随机水平；即使最佳MLLM在更高魔方复杂度下性能也会退化。简单的自我校正通过反思思维带来适度提升，但也可能引入过度思考问题。

Conclusion: Cube Bench提供了一个紧凑、可复现的序列空间推理评估框架，揭示了MLLM在复杂多步决策任务中的根本局限性，特别是错误恢复能力和长期规划能力的不足。研究结果表明当前MLLM在空间序列推理方面仍有显著提升空间，且闭源模型在此类任务上明显优于开源模型，为未来模型改进提供了明确的评估方向。

📄 Abstract

We introduce Cube Bench, a Rubik's-cube benchmark for evaluating spatial and sequential reasoning in multimodal large language models (MLLMs). The benchmark decomposes performance into five skills: (i) reconstructing cube faces from images and text, (ii) choosing the optimal next move, (iii) predicting the outcome of a candidate move without applying it, (iv) executing multi-step plans while recovering from mistakes, and (v) detecting and revising one's own errors. Using a shared set of scrambled cube states, identical prompts and parsers, and a single distance-to-solved metric, we compare recent MLLMs side by side as a function of scramble depth. Across seven MLLMs, accuracy drops sharply with depth; once a trajectory stalls or diverges, models rarely recover, and high face-reconstruction accuracy does not guarantee competent action selection or multi-step execution. A pronounced closed- vs open-source gap emerges: the strongest closed model leads on both single-step perception tasks and multi-step control tasks, while open-weight models cluster near chance on the hardest settings; yet even the best MLLM degrades at higher cube complexity. A simple self-correction via reflective thinking yields modest gains but can also introduce overthinking. Cube Bench offers a compact, reproducible probe of sequential spatial reasoning in MLLMs.

cs.AI [Back]

[31] Zero-Shot Segmentation through Prototype-Guidance for Multi-Label Plant Species Identification

Luciano Araujo Dourado Filho, Almir Moreira da Silva Neto, Rodrigo Pereira David, Rodrigo Tripodi Calumby

🧩 TL;DR

本文提出了一种用于PlantClef 2025细粒度多标签物种识别挑战的方法，通过使用训练数据集中的类别原型作为代理指导，训练分割视觉Transformer在测试集图像上进行领域自适应，最终在竞赛中获得第五名。

📘 Detailed Summary

Motivation: 该研究旨在解决PlantClef 2025挑战中的细粒度多标签物种识别问题，该任务需要在包含多种物种的高分辨率植被图像中进行精确识别，核心挑战在于如何从多类别个体物种识别适应到高分辨率植被地块的多标签分类。

Method: 方法采用训练数据集中提取的类别原型作为代理指导，通过K-Means聚类（K等于数据集类别数）从训练图像特征中创建类别表示，构建了一个定制化的窄视觉Transformer，用冻结的DinoV2替换了原始补丁嵌入层，该模型在训练数据集上进行了个体物种分类的预训练，然后训练该模型从测试数据集图像中重建训练数据集的类别原型，利用获得的注意力分数来识别和定位感兴趣区域以指导分类过程。

Result: 该方法在PlantCLEF 2025挑战赛的私有排行榜上获得第五名，F1分数为0.33331，与最佳提交结果仅相差0.03分，表明在基准任务中具有竞争力，代码已在GitHub上公开可用。

Conclusion: 研究证明了使用类别原型作为代理指导进行领域自适应的有效性，成功将多类别个体物种识别适应到高分辨率植被地块的多标签分类任务，该方法在细粒度植物识别任务中展现出竞争性能，为类似的多标签视觉识别问题提供了新的解决方案思路。

📄 Abstract

This paper presents an approach developed to address the PlantClef 2025 challenge, which consists of a fine-grained multi-label species identification, over high-resolution images. Our solution focused on employing class prototypes obtained from the training dataset as a proxy guidance for training a segmentation Vision Transformer (ViT) on the test set images. To obtain these representations, the proposed method extracts features from training dataset images and create clusters, by applying K-Means, with $K$ equals to the number of classes in the dataset. The segmentation model is a customized narrow ViT, built by replacing the patch embedding layer with a frozen DinoV2, pre-trained on the training dataset for individual species classification. This model is trained to reconstruct the class prototypes of the training dataset from the test dataset images. We then use this model to obtain attention scores that enable to identify and localize areas of interest and consequently guide the classification process. The proposed approach enabled a domain-adaptation from multi-class identification with individual species, into multi-label classification from high-resolution vegetation plots. Our method achieved fifth place in the PlantCLEF 2025 challenge on the private leaderboard, with an F1 score of 0.33331. Besides that, in absolute terms our method scored 0.03 lower than the top-performing submission, suggesting that it may achieved competitive performance in the benchmark task. Our code is available at \href{https://github.com/ADAM-UEFS/PlantCLEF2025}{https://github.com/ADAM-UEFS/PlantCLEF2025}.

[32] Reason2Decide: Rationale-Driven Multi-Task Learning

H M Quamran Hasan, Housam Khalifa Bashier, Jiayi Dai, Mi-Young Kim, Randy Goebel

🧩 TL;DR

该研究提出了Reason2Decide框架，通过两阶段训练解决临床决策支持系统中预测与解释对齐的问题，在显著减小模型规模的同时实现了高性能的可解释决策支持。

📘 Detailed Summary

Motivation: 当前临床决策支持系统面临关键挑战：在实现高预测准确性的同时生成与预测对齐的解释。现有方法存在暴露偏差问题，导致解释与预测不一致，需要解决自解释任务中的暴露偏差和任务分离等关键挑战。

Method: 提出Reason2Decide两阶段训练框架：第一阶段训练模型进行推理生成；第二阶段联合训练标签预测和推理生成，应用计划采样技术逐步从基于黄金标签的条件转换到基于模型预测的条件，有效解决暴露偏差问题。

Result: 在三个医疗数据集上的评估显示，Reason2Decide在预测性能（F1）和推理保真度（BERTScore、BLEU、LLM-as-a-Judge）方面优于其他微调基线和部分零样本LLM。在分诊任务中，该框架对LLM生成、护士撰写和护士后处理的推理均表现出鲁棒性，且仅使用LLM生成推理进行第一阶段训练就能超越其他变体。

Conclusion: 该研究表明LLM生成的推理适合用于模型预训练，减少对人类标注的依赖。值得注意的是，Reason2Decide使用比当代基础模型小40倍的模型实现了这些性能提升，使临床推理在资源受限的部署中更加可及，同时仍提供可解释的决策支持。

📄 Abstract

Despite the wide adoption of Large Language Models (LLM)s, clinical decision support systems face a critical challenge: achieving high predictive accuracy while generating explanations aligned with the predictions. Current approaches suffer from exposure bias leading to misaligned explanations. We propose Reason2Decide, a two-stage training framework that addresses key challenges in self-rationalization, including exposure bias and task separation. In Stage-1, our model is trained on rationale generation, while in Stage-2, we jointly train on label prediction and rationale generation, applying scheduled sampling to gradually transition from conditioning on gold labels to model predictions. We evaluate Reason2Decide on three medical datasets, including a proprietary triage dataset and public biomedical QA datasets. Across model sizes, Reason2Decide outperforms other fine-tuning baselines and some zero-shot LLMs in prediction (F1) and rationale fidelity (BERTScore, BLEU, LLM-as-a-Judge). In triage, Reason2Decide is rationale source-robust across LLM-generated, nurse-authored, and nurse-post-processed rationales. In our experiments, while using only LLM-generated rationales in Stage-1, Reason2Decide outperforms other fine-tuning variants. This indicates that LLM-generated rationales are suitable for pretraining models, reducing reliance on human annotations. Remarkably, Reason2Decide achieves these gains with models 40x smaller than contemporary foundation models, making clinical reasoning more accessible for resource-constrained deployments while still providing explainable decision support.

[33] Enhancing Zero-Shot Time Series Forecasting in Off-the-Shelf LLMs via Noise Injection

Xingyou Yin, Ceyao Zhang, Min Hu, Kai Chen

🧩 TL;DR

本文提出了一种简单而有效的策略，通过在时间序列数据标记化前注入噪声，来提升冻结大型语言模型在零样本时间序列预测中的性能。该方法作为一种推理时增强手段，迫使模型基于鲁棒的时间模式而非表面数值伪影进行外推。

📘 Detailed Summary

Motivation: 现有研究试图利用完全冻结的、未经微调的大型语言模型进行零样本时间序列预测，但其性能对输入数据的文本表示极其敏感，因为模型参数无法适应分布偏移。这些完全冻结模型的表现脆弱性成为关键挑战，需要一种非侵入性干预来克服这种脆弱性。

Method: 本文提出了一种简单而有效的策略：在原始时间序列数据标记化为文本表示之前注入噪声。这种非侵入性干预作为一种推理时增强手段，迫使冻结的大型语言模型基于鲁棒的基础时间模式而非表面的数值伪影进行外推。该方法不涉及任何模型微调，仅通过策略性的数据预处理来提升性能。

Result: 该方法在多样化的基准测试中进行了实证验证，并始终观察到性能提升。为了完全消除大型语言模型预训练期间数据污染可能带来的偏差，研究引入了两个新颖的时间序列数据集，这些数据集完全超出了所有使用的大型语言模型的预训练范围，并在这些数据集上一致观察到改进的性能。

Conclusion: 这项研究为直接利用现成的大型语言模型进行时间序列预测提供了进一步的进展。噪声注入作为一种推理时增强手段，能够有效提升冻结模型对时间序列数据的泛化能力，为利用预训练语言模型处理数值序列任务提供了新的视角和方法论启示。

📄 Abstract

Large Language Models (LLMs) have demonstrated effectiveness as zero-shot time series (TS) forecasters. The key challenge lies in tokenizing TS data into textual representations that align with LLMs' pre-trained knowledge. While existing work often relies on fine-tuning specialized modules to bridge this gap, a distinct, yet challenging, paradigm aims to leverage truly off-the-shelf LLMs without any fine-tuning whatsoever, relying solely on strategic tokenization of numerical sequences. The performance of these fully frozen models is acutely sensitive to the textual representation of the input data, as their parameters cannot adapt to distribution shifts. In this paper, we introduce a simple yet highly effective strategy to overcome this brittleness: injecting noise into the raw time series before tokenization. This non-invasive intervention acts as a form of inference-time augmentation, compelling the frozen LLM to extrapolate based on robust underlying temporal patterns rather than superficial numerical artifacts. We theoretically analyze this phenomenon and empirically validate its effectiveness across diverse benchmarks. Notably, to fully eliminate potential biases from data contamination during LLM pre-training, we introduce two novel TS datasets that fall outside all utilized LLMs' pre-training scopes, and consistently observe improved performance. This study provides a further step in directly leveraging off-the-shelf LLMs for time series forecasting.

[34] TongSIM: A General Platform for Simulating Intelligent Machines

Zhe Sun, Kunlun Wu, Chuanjian Fu, Zeming Song, Langyong Shi, Zihe Xue, Bohan Jing, Ying Yang, Xiaomeng Gao, Aijia Li, Tianyu Guo, Huiying Li, Xueyuan Yang, Rongkai Liu, Xinyi He, Yuxi Wang, Yue Li, Mingyuan Liu, Yujie Lu, Hongzhao Xie, Shiyun Zhao, Bo Dai, Wei Wang, Tao Yuan, Song-Chun Zhu, Yujia Peng, Zhenliang Zhang

🧩 TL;DR

本文介绍了TongSIM，一个用于训练和评估具身智能体的高保真通用平台，旨在解决现有仿真环境局限于特定任务、缺乏通用训练平台的问题，通过提供多样化的室内外场景和综合评估框架来加速通用具身智能的发展。

📘 Detailed Summary

Motivation: 随着人工智能特别是多模态大语言模型的快速发展，研究重点正从单模态文本处理转向更复杂的多模态和具身智能领域。然而，现有仿真平台大多设计狭窄，针对特定任务定制，缺乏一个能够支持从低级具身导航到高级复合活动（如多智能体社会仿真和人机协作）的通用训练环境。

Method: TongSIM平台提供了高保真、通用的具身智能体训练和评估环境，包含超过100个多样化的多房间室内场景以及一个开放式、交互丰富的户外城镇仿真。平台具备定制场景、任务自适应保真度、多样化智能体类型和动态环境仿真等特征，提供了全面的评估框架和基准测试。

Result: TongSIM平台实现了广泛的适用性，能够支持感知、认知、决策、人机协作以及空间和社会推理等多种能力评估。其灵活性和可扩展性为研究人员提供了统一的平台，加速了训练、评估和通用具身智能的发展进程。

Conclusion: TongSIM作为一个统一的通用平台，填补了现有仿真环境局限于特定任务的空白，通过提供多样化的室内外场景和综合评估框架，为加速通用具身智能的研究和发展提供了重要基础设施。该平台的设计强调灵活性、可扩展性和广泛适用性，有望推动具身智能从特定任务向通用能力的转变。

📄 Abstract

As artificial intelligence (AI) rapidly advances, especially in multimodal large language models (MLLMs), research focus is shifting from single-modality text processing to the more complex domains of multimodal and embodied AI. Embodied intelligence focuses on training agents within realistic simulated environments, leveraging physical interaction and action feedback rather than conventionally labeled datasets. Yet, most existing simulation platforms remain narrowly designed, each tailored to specific tasks. A versatile, general-purpose training environment that can support everything from low-level embodied navigation to high-level composite activities, such as multi-agent social simulation and human-AI collaboration, remains largely unavailable. To bridge this gap, we introduce TongSIM, a high-fidelity, general-purpose platform for training and evaluating embodied agents. TongSIM offers practical advantages by providing over 100 diverse, multi-room indoor scenarios as well as an open-ended, interaction-rich outdoor town simulation, ensuring broad applicability across research needs. Its comprehensive evaluation framework and benchmarks enable precise assessment of agent capabilities, such as perception, cognition, decision-making, human-robot cooperation, and spatial and social reasoning. With features like customized scenes, task-adaptive fidelity, diverse agent types, and dynamic environmental simulation, TongSIM delivers flexibility and scalability for researchers, serving as a unified platform that accelerates training, evaluation, and advancement toward general embodied intelligence.

[35] ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge

Yuntao Dai, Hang Gu, Teng Wang, Qianyu Cheng, Yifei Zheng, Zhiyong Qiu, Lei Gong, Wenqi Lou, Xuehai Zhou

🧩 TL;DR

本文提出了ActionFlow，一种面向资源受限边缘平台的系统级推理框架，通过跨请求流水线调度策略和内存优化技术，将VLA模型的推理延迟显著降低，实现了实时动态操作。

📘 Detailed Summary

Motivation: 视觉-语言-动作模型在动态真实环境中的部署受到高推理延迟的严重阻碍，虽然流畅的机器人交互需要20-30Hz的控制频率，但当前VLA模型在边缘设备上通常只能以3-5Hz运行，这主要源于自回归解码的内存受限特性，现有优化方法往往需要大量重新训练或会损害模型精度。

Method: ActionFlow的核心是跨请求流水线策略，这是一种新颖的调度器，将VLA推理重新定义为微请求的宏流水线，该策略智能地将内存受限的解码阶段与计算受限的预填充阶段在连续时间步上进行批处理以最大化硬件利用率，此外还提出了跨请求状态打包前向算子和统一KV环形缓冲区，将碎片化的内存操作融合为高效的密集计算。

Result: 实验结果表明，ActionFlow在OpenVLA-7B模型上实现了2.55倍的FPS提升，且无需重新训练，这使得在边缘硬件上实现实时动态操作成为可能，该框架显著改善了VLA模型在资源受限平台上的推理效率。

Conclusion: ActionFlow通过系统级优化成功解决了VLA模型在边缘设备上的延迟瓶颈，其跨请求调度和内存优化技术为实时机器人控制提供了可行方案，这项工作为在资源受限环境中部署大型多模态模型开辟了新途径，并展示了系统级优化在提升AI模型实际部署效率方面的重要价值。

📄 Abstract

Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control, enabling emergent generalization and long-horizon task execution. However, their deployment in dynamic, real-world environments is severely hin dered by high inference latency. While smooth robotic interaction requires control frequencies of 20 to 30 Hz, current VLA models typi cally operate at only 3-5 Hz on edge devices due to the memory bound nature of autoregressive decoding. Existing optimizations often require extensive retraining or compromise model accuracy. To bridge this gap, we introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge plat forms. At the core of ActionFlow is a Cross-Request Pipelin ing strategy, a novel scheduler that redefines VLA inference as a macro-pipeline of micro-requests. The strategy intelligently batches memory-bound Decode phases with compute-bound Prefill phases across continuous time steps to maximize hardware utilization. Furthermore, to support this scheduling, we propose a Cross Request State Packed Forward operator and a Unified KV Ring Buffer, which fuse fragmented memory operations into efficient dense computations. Experimental results demonstrate that ActionFlow achieves a 2.55x improvement in FPS on the OpenVLA-7B model without retraining, enabling real-time dy namic manipulation on edge hardware. Our work is available at https://anonymous.4open.science/r/ActionFlow-1D47.

[36] A DeepSeek-Powered AI System for Automated Chest Radiograph Interpretation in Clinical Practice

Yaowei Bai, Ruiheng Zhang, Yu Lei, Xuhua Duan, Jingfeng Yao, Shuguang Ju, Chaoyang Wang, Wei Yao, Yiwan Guo, Guilin Zhang, Chao Wan, Qian Yuan, Lei Chen, Wenjuan Tang, Biqiang Zhu, Xinggang Wang, Tao Sun, Wei Zhou, Dacheng Tao, Yongchao Xu, Chuansheng Zheng, Huangxuan Zhao, Bo Du

🧩 TL;DR

本研究开发了Janus-Pro-CXR（1B）胸部X光解读系统，通过多中心前瞻性临床试验验证其在报告生成和关键放射学发现检测方面的优越性能，显著提升诊断可靠性和工作流程效率。

📘 Detailed Summary

Motivation: 全球放射科医生短缺问题因胸部X光工作量巨大而加剧，现有多模态大语言模型评估主要依赖自动化指标或回顾性分析，缺乏严格的前瞻性临床验证，需要开发经过临床验证的AI辅助放射学解决方案。

Method: 基于DeepSeek Janus-Pro模型开发了Janus-Pro-CXR（1B）胸部X光解读系统，采用轻量级架构和领域特定优化，通过多中心前瞻性临床试验（NCT07117266）进行严格验证，并与包括ChatGPT 4o（200B参数）在内的最先进模型进行比较。

Result: Janus-Pro-CXR在自动报告生成方面超越现有最先进模型，包括更大规模的ChatGPT 4o，可靠检测六种临床关键放射学发现；前瞻性临床部署中，AI辅助显著提高报告质量评分，减少18.3%解读时间（P<0.001），54.3%病例中专家更偏好AI辅助结果。

Conclusion: Janus-Pro-CXR通过轻量级架构和领域优化显著提升诊断可靠性和工作流程效率，特别适用于资源受限环境；模型架构和实施框架将开源以促进AI辅助放射学解决方案的临床转化，为多模态大语言模型在医疗领域的实际应用提供了验证范例。

📄 Abstract

A global shortage of radiologists has been exacerbated by the significant volume of chest X-ray workloads, particularly in primary care. Although multimodal large language models show promise, existing evaluations predominantly rely on automated metrics or retrospective analyses, lacking rigorous prospective clinical validation. Janus-Pro-CXR (1B), a chest X-ray interpretation system based on DeepSeek Janus-Pro model, was developed and rigorously validated through a multicenter prospective trial (NCT07117266). Our system outperforms state-of-the-art X-ray report generation models in automated report generation, surpassing even larger-scale models including ChatGPT 4o (200B parameters), while demonstrating reliable detection of six clinically critical radiographic findings. Retrospective evaluation confirms significantly higher report accuracy than Janus-Pro and ChatGPT 4o. In prospective clinical deployment, AI assistance significantly improved report quality scores, reduced interpretation time by 18.3% (P < 0.001), and was preferred by a majority of experts in 54.3% of cases. Through lightweight architecture and domain-specific optimization, Janus-Pro-CXR improves diagnostic reliability and workflow efficiency, particularly in resource-constrained settings. The model architecture and implementation framework will be open-sourced to facilitate the clinical translation of AI-assisted radiology solutions.

[37] Generative Digital Twins: Vision-Language Simulation Models for Executable Industrial Systems

YuChe Hsu, AnJui Wang, TsaiChing Ni, YuanFu Yang

🧩 TL;DR

本文提出了一种视觉语言仿真模型（VLSM），通过统一视觉和文本理解从布局草图和自然语言提示生成可执行的FlexScript，为工业仿真系统建立了生成式数字孪生的新范式。

📘 Detailed Summary

Motivation: 该研究旨在解决工业仿真系统中跨模态推理的挑战，即如何将视觉布局草图与自然语言描述相结合，生成可直接执行的仿真逻辑代码，从而填补生成式数字孪生领域缺乏统一视觉-语言-代码多模态学习框架的研究空白。

Method: 研究提出了视觉语言仿真模型（VLSM），该模型通过统一的架构整合视觉编码器、连接器和代码预训练语言骨干网络，能够从布局草图和自然语言提示生成可执行的FlexScript代码；同时构建了包含超过12万条提示-草图-代码三元组的大规模数据集，并专门设计了结构有效性率（SVR）、参数匹配率（PMR）和执行成功率（ESR）三个评估指标。

Result: 通过系统性的消融实验，所提出的模型在视觉编码器、连接器和代码预训练语言骨干网络的不同配置下，实现了接近完美的结构准确性和高度的执行鲁棒性；三个专门设计的评估指标全面验证了生成代码的结构完整性、参数保真度和仿真器可执行性。

Conclusion: 这项工作为生成式数字孪生奠定了重要基础，展示了将视觉推理和语言理解整合到可执行工业仿真系统中的可行性，为未来工业自动化、智能制造和数字孪生系统的智能化发展提供了新的技术路径和研究方向。

📄 Abstract

We propose a Vision-Language Simulation Model (VLSM) that unifies visual and textual understanding to synthesize executable FlexScript from layout sketches and natural-language prompts, enabling cross-modal reasoning for industrial simulation systems. To support this new paradigm, the study constructs the first large-scale dataset for generative digital twins, comprising over 120,000 prompt-sketch-code triplets that enable multimodal learning between textual descriptions, spatial structures, and simulation logic. In parallel, three novel evaluation metrics, Structural Validity Rate (SVR), Parameter Match Rate (PMR), and Execution Success Rate (ESR), are proposed specifically for this task to comprehensively evaluate structural integrity, parameter fidelity, and simulator executability. Through systematic ablation across vision encoders, connectors, and code-pretrained language backbones, the proposed models achieve near-perfect structural accuracy and high execution robustness. This work establishes a foundation for generative digital twins that integrate visual reasoning and language understanding into executable industrial simulation systems.

[38] Advancing Multimodal Teacher Sentiment Analysis:The Large-Scale T-MED Dataset & The Effective AAM-TSA Model

Zhiyi Duan, Xiangren Wang, Hongyu Yuan, Qianli Xing

🧩 TL;DR

本文构建了首个大规模教师多模态情感分析数据集T-MED，并提出了一种基于非对称注意力的多模态教师情感分析模型AAM-TSA，显著提升了教师情感识别的准确性和可解释性。

📘 Detailed Summary

Motivation: 现有研究往往无法准确捕捉教师情感，这既因为教师情感具有表演性特征，也因为忽视了教学信息对情感表达的关键影响，导致在真实教育场景中教师情感分析效果不佳。

Method: 研究首先构建了包含14,938个实例的教师多模态情感分析数据集T-MED，涵盖11个学科从K-12到高等教育的250个真实课堂，整合了文本、音频、视频和教学信息等多模态数据，并采用人机协同标注流程确保标注质量。在此基础上提出了基于非对称注意力的多模态教师情感分析模型AAM-TSA，该模型引入了非对称注意力机制和分层门控单元，实现了差异化的跨模态特征融合和精确的情感分类。

Result: 实验结果表明，AAM-TSA模型在T-MED数据集上显著优于现有最先进方法，在准确性和可解释性方面均表现出优越性能，验证了所提出方法在教师情感分析任务中的有效性。

Conclusion: 本研究通过构建大规模数据集和提出创新模型，为教师情感分析提供了系统性的解决方案，强调了教学信息在情感分析中的重要性，并为教育情感计算领域的发展提供了重要参考，未来可进一步探索更细粒度的情感分析和个性化教学支持应用。

📄 Abstract

Teachers' emotional states are critical in educational scenarios, profoundly impacting teaching efficacy, student engagement, and learning achievements. However, existing studies often fail to accurately capture teachers' emotions due to the performative nature and overlook the critical impact of instructional information on emotional expression.In this paper, we systematically investigate teacher sentiment analysis by building both the dataset and the model accordingly. We construct the first large-scale teacher multimodal sentiment analysis dataset, T-MED.To ensure labeling accuracy and efficiency, we employ a human-machine collaborative labeling process.The T-MED dataset includes 14,938 instances of teacher emotional data from 250 real classrooms across 11 subjects ranging from K-12 to higher education, integrating multimodal text, audio, video, and instructional information.Furthermore, we propose a novel asymmetric attention-based multimodal teacher sentiment analysis model, AAM-TSA.AAM-TSA introduces an asymmetric attention mechanism and hierarchical gating unit to enable differentiated cross-modal feature fusion and precise emotional classification. Experimental results demonstrate that AAM-TSA significantly outperforms existing state-of-the-art methods in terms of accuracy and interpretability on the T-MED dataset.

[39] LongVideoAgent: Multi-Agent Reasoning with Long Videos

Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, Qifeng Chen

🧩 TL;DR

本文提出了一种用于长视频问答的多智能体框架，其中主控LLM协调定位智能体和视觉智能体，通过强化学习训练实现高效的多智能体协作，显著提升了长视频理解性能。

📘 Detailed Summary

Motivation: 当前多模态LLM和长视频问答系统在处理小时级视频时，通常采用有损压缩摘要或依赖有限工具集，这削弱了时间定位能力并遗漏了细粒度视觉线索，需要更精确的时序定位和细粒度视觉信息提取方法。

Method: 本文提出了一种多智能体框架，其中主控LLM协调两个专门智能体：定位智能体负责定位问题相关视频片段，视觉智能体负责提取针对性文本观察。主控智能体采用步数限制进行规划，并通过强化学习训练以促进简洁、正确且高效的多智能体协作。

Result: 在提出的LongTVQA和LongTVQA+数据集（基于TVQA/TVQA+构建的剧集级数据集）上，该多智能体系统显著优于强非智能体基线。实验表明强化学习进一步增强了训练后智能体的推理和规划能力。

Conclusion: 该研究展示了多智能体协作框架在长视频理解中的有效性，通过专门智能体的分工协作实现了更好的时序定位和视觉细节补充，强化学习训练进一步优化了多智能体交互效率，为长视频问答系统提供了可解释的推理轨迹。

📄 Abstract

Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent. Code and data will be shared at https://longvideoagent.github.io/.

Table of Contents

cs.CV [Back]

[1] Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[2] Vehicle-centric Perception via Multimodal Structured Pre-training

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[3] SE360: Semantic Edit in 360$^\circ$ Panoramas via Hierarchical Data Construction

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[4] PaveSync: A Unified and Comprehensive Dataset for Pavement Distress Analysis and Classification

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[5] A Contextual Analysis of Driver-Facing and Dual-View Video Inputs for Distraction Detection in Naturalistic Driving Environments

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[6] Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[7] MAPI-GNN: Multi-Activation Plane Interaction Graph Neural Network for Multimodal Medical Diagnosis

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[8] $\text{H}^2$em: Learning Hierarchical Hyperbolic Embeddings for Compositional Zero-Shot Learning

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[9] Beyond Vision: Contextually Enriched Image Captioning with Multi-Modal Retrieva

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[10] Item Region-based Style Classification Network (IRSN): A Fashion Style Classifier Based on Domain Knowledge of Fashion Experts

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[11] DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[12] LiteFusion: Taming 3D Object Detectors from Vision-Based to Multi-Modal with Minimal Adaptation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[13] LADLE-MM: Limited Annotation based Detector with Learned Ensembles for Multimodal Misinformation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[14] TAVID: Text-Driven Audio-Visual Interactive Dialogue Generation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[15] CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[16] Chain-of-Anomaly Thoughts with Large Vision-Language Models

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[17] Beyond Motion Pattern: An Empirical Study of Physical Forces for Human Motion Understanding

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[18] UTDesign: A Unified Framework for Stylized Text Editing and Generation in Graphic Design Images

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[19] Bridging Modalities and Transferring Knowledge: Enhanced Multimodal Understanding and Recognition

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[20] Multi-Grained Text-Guided Image Fusion for Multi-Exposure and Multi-Focus Scenarios

🧩 TL;DR