Table of Contents
cs.CV [Back]
[1] Bias Detection and Rotation-Robustness Mitigation in Vision-Language Models and Generative Image Models
Tarannum Mithila
🧩 TL;DR
该研究分析了视觉语言模型和生成式图像模型在输入变换下的鲁棒性和公平性问题,特别关注图像旋转引起的偏差传播,并提出了结合数据增强、表示对齐和模型正则化的旋转鲁棒缓解策略。
📘 Detailed Summary
Motivation: 尽管视觉语言模型和生成式图像模型在多模态任务上取得了显著性能,但其在输入变换下的鲁棒性和公平性尚未得到充分探索。本研究旨在调查最先进的视觉语言和生成模型中的偏差传播和鲁棒性退化问题,特别关注图像旋转和分布偏移对模型预测、置信度校准和人口统计偏差模式的影响。
Method: 为解决旋转引起的鲁棒性和公平性问题,研究提出了旋转鲁棒缓解策略,该方法结合了数据增强、表示对齐和模型级正则化技术。这些方法旨在通过多层次的干预来增强模型对旋转变换的适应性,同时减少偏差放大效应。
Result: 在多个数据集上的实验结果表明,所提出的方法显著提高了模型的鲁棒性,同时减少了偏差放大,且没有牺牲整体性能。研究量化了旋转诱导扰动对模型预测和置信度校准的具体影响,并展示了缓解策略在改善模型公平性和可靠性方面的有效性。
Conclusion: 该研究揭示了当前多模态系统的关键局限性,特别是对输入变换的敏感性导致的鲁棒性和公平性问题。研究提供的实用缓解技术为构建更可靠和公平的AI模型提供了方法论支持,强调了在模型开发中考虑变换鲁棒性和偏差缓解的重要性。
📄 Abstract
Vision-Language Models (VLMs) and generative image models have achieved remarkable performance across multimodal tasks, yet their robustness and fairness under input transformations remain insufficiently explored. This work investigates bias propagation and robustness degradation in state-of-the-art vision-language and generative models, with a particular focus on image rotation and distributional shifts. We analyze how rotation-induced perturbations affect model predictions, confidence calibration, and demographic bias patterns. To address these issues, we propose rotation-robust mitigation strategies that combine data augmentation, representation alignment, and model-level regularization. Experimental results across multiple datasets demonstrate that the proposed methods significantly improve robustness while reducing bias amplification without sacrificing overall performance. This study highlights critical limitations of current multimodal systems and provides practical mitigation techniques for building more reliable and fair AI models.
[2] Residual Cross-Modal Fusion Networks for Audio-Visual Navigation
Yi Wang, Yinfeng Yu, Bin Ren
🧩 TL;DR
本文提出了一种跨模态残差融合网络(CRFN),用于解决音频-视觉具身导航中的多模态融合挑战,通过双向残差交互实现互补建模和细粒度对齐,显著提升了跨域泛化性能。
📘 Detailed Summary
Motivation: 音频-视觉具身导航任务的关键挑战在于异构特征在多模态融合过程中的有效交互建模,以避免单模态主导或信息退化问题,特别是在跨域场景下。现有方法通常依赖简单的拼接或注意力门控机制,难以实现模态间的互补建模和细粒度对齐。
Method: 本文提出了跨模态残差融合网络(CRFN),通过在音频和视觉流之间引入双向残差交互来实现互补建模和细粒度对齐,同时保持各自表示的独立性。该方法通过残差连接显式建模跨模态交互,并融入稳定化技术以改善收敛性和鲁棒性,区别于传统的拼接或注意力门控方法。
Result: 在Replica和Matterport3D数据集上的实验表明,CRFN显著优于最先进的多模态融合基线方法,并展现出更强的跨域泛化能力。值得注意的是,实验还发现智能体在不同数据集上表现出差异化的模态依赖性,这一现象为理解具身智能体的跨模态协作机制提供了新视角。
Conclusion: 该研究不仅提出了一种有效的跨模态融合框架,还揭示了具身智能体在不同环境中的模态依赖差异,为理解多模态协作机制提供了重要见解。CRFN的成功表明显式建模跨模态残差交互是实现稳健多模态融合的有效途径,为未来具身导航系统的设计提供了新思路。
📄 Abstract
Audio-visual embodied navigation aims to enable an agent to autonomously localize and reach a sound source in unseen 3D environments by leveraging auditory cues. The key challenge of this task lies in effectively modeling the interaction between heterogeneous features during multimodal fusion, so as to avoid single-modality dominance or information degradation, particularly in cross-domain scenarios. To address this, we propose a Cross-Modal Residual Fusion Network, which introduces bidirectional residual interactions between audio and visual streams to achieve complementary modeling and fine-grained alignment, while maintaining the independence of their representations. Unlike conventional methods that rely on simple concatenation or attention gating, CRFN explicitly models cross-modal interactions via residual connections and incorporates stabilization techniques to improve convergence and robustness. Experiments on the Replica and Matterport3D datasets demonstrate that CRFN significantly outperforms state-of-the-art fusion baselines and achieves stronger cross-domain generalization. Notably, our experiments also reveal that agents exhibit differentiated modality dependence across different datasets. The discovery of this phenomenon provides a new perspective for understanding the cross-modal collaboration mechanism of embodied agents.
[3] Thermo-LIO: A Novel Multi-Sensor Integrated System for Structural Health Monitoring
Chao Yang, Haoyuan Zheng, Yue Ma
🧩 TL;DR
本文提出Thermo-LIO,一种结合热成像与高分辨率LiDAR的多传感器系统,通过多模态数据融合与LiDAR惯性里程计集成,显著提升了大型土木基础设施结构健康监测的缺陷检测精度与覆盖范围。
📘 Detailed Summary
Motivation: 传统二维热成像技术在建筑领域缺陷检测中虽具非侵入性优势,但难以有效评估复杂几何结构、不可达区域及次表面缺陷,限制了其在结构健康监测中的全面应用。
Method: 研究开发了热成像与LiDAR的多模态融合方法,实现精确校准与同步数据流以构建建筑物温度分布精确表征;进一步将该融合方法与LiDAR惯性里程计集成,实现大规模结构的全覆盖监测与温度变化追踪。
Result: 在桥梁和厅堂建筑的实验验证表明,Thermo-LIO相比传统方法能更准确地检测详细热异常与结构缺陷,系统提升了诊断精度,支持实时处理并扩展了检测覆盖范围。
Conclusion: 该研究强调了多模态传感器集成在推进大型土木基础设施结构健康监测方法中的关键作用,系统通过融合热成像与LiDAR技术实现了更全面、精确的缺陷检测能力,为自动化监测系统发展提供了新方向。
📄 Abstract
Traditional two-dimensional thermography, despite being non-invasive and useful for defect detection in the construction field, is limited in effectively assessing complex geometries, inaccessible areas, and subsurface defects. This paper introduces Thermo-LIO, a novel multi-sensor system that can enhance Structural Health Monitoring (SHM) by fusing thermal imaging with high-resolution LiDAR. To achieve this, the study first develops a multimodal fusion method combining thermal imaging and LiDAR, enabling precise calibration and synchronization of multimodal data streams to create accurate representations of temperature distributions in buildings. Second, it integrates this fusion approach with LiDAR-Inertial Odometry (LIO), enabling full coverage of large-scale structures and allowing for detailed monitoring of temperature variations and defect detection across inspection cycles. Experimental validations, including case studies on a bridge and a hall building, demonstrate that Thermo-LIO can detect detailed thermal anomalies and structural defects more accurately than traditional methods. The system enhances diagnostic precision, enables real-time processing, and expands inspection coverage, highlighting the crucial role of multimodal sensor integration in advancing SHM methodologies for large-scale civil infrastructure.
[4] Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning
Yang Li, Aming Wu, Zihao Zhang, Yahong Han
🧩 TL;DR
本文提出slow4fast-VLN框架,通过建立动态交互的快慢推理机制来解决视觉语言导航中的泛化场景适应问题,使智能体能够在面对未见环境和指令时动态生成泛化策略。
📘 Detailed Summary
Motivation: 传统视觉语言导航方法通常遵循闭集假设,即训练和测试数据共享相同的输入图像和指令风格,然而现实世界是开放的且充满各种未见环境,这对闭集方法构成了巨大挑战。本文聚焦于泛化场景适应任务,旨在通过引入多样化环境和不一致指令来学习泛化的导航能力,主要挑战在于如何使智能体在面对未见环境和指令时动态产生泛化策略。
Method: 本文提出slow4fast-VLN框架,建立动态交互的快慢推理机制。快速推理模块是一个端到端的策略网络,通过实时输入输出动作,并在历史存储库中积累执行记录以构建记忆。慢速推理模块分析快速推理模块生成的记忆,通过深度反思提取增强决策泛化能力的经验,这些经验被结构化存储并用于持续优化快速推理模块。与传统方法将快慢推理视为独立机制不同,该框架实现了快慢交互,利用慢速推理的经验使系统能够持续适应并在面对未见场景时高效执行导航任务。
Result: 虽然摘要中未提供具体的性能指标和基准测试结果,但该方法在理论上解决了视觉语言导航中的泛化场景适应问题。通过快慢推理的动态交互机制,系统能够从执行记录中提取泛化经验并持续优化导航策略,从而在面对未见环境和不一致指令时表现出更强的适应能力。
Conclusion: 该研究的主要贡献在于提出了一个受人类认知系统启发的动态交互快慢推理框架,为开放世界视觉语言导航提供了新的解决方案。通过将快速执行与慢速反思相结合,系统能够持续学习和适应未见场景,这为构建更具泛化能力的自主导航智能体提供了重要思路,并为未来在更复杂开放环境中的导航研究奠定了基础。
📄 Abstract
Vision-Language Navigation aims to enable agents to navigate to a target location based on language instructions. Traditional VLN often follows a close-set assumption, i.e., training and test data share the same style of the input images and instructions. However, the real world is open and filled with various unseen environments, posing enormous difficulties for close-set methods. To this end, we focus on the General Scene Adaptation (GSA-VLN) task, aiming to learn generalized navigation ability by introducing diverse environments and inconsistent intructions.Towards this task, when facing unseen environments and instructions, the challenge mainly lies in how to enable the agent to dynamically produce generalized strategies during the navigation process. Recent research indicates that by means of fast and slow cognition systems, human beings could generate stable policies, which strengthen their adaptation for open world. Inspired by this idea, we propose the slow4fast-VLN, establishing a dynamic interactive fast-slow reasoning framework. The fast-reasoning module, an end-to-end strategy network, outputs actions via real-time input. It accumulates execution records in a history repository to build memory. The slow-reasoning module analyze the memories generated by the fast-reasoning module. Through deep reflection, it extracts experiences that enhance the generalization ability of decision-making. These experiences are structurally stored and used to continuously optimize the fast-reasoning module. Unlike traditional methods that treat fast-slow reasoning as independent mechanisms, our framework enables fast-slow interaction. By leveraging the experiences from slow reasoning. This interaction allows the system to continuously adapt and efficiently execute navigation tasks when facing unseen scenarios.
[5] LP-LLM: End-to-End Real-World Degraded License Plate Text Recognition via Large Multimodal Models
Haoyan Gong, Hongbin Liu
🧩 TL;DR
本文提出了一种基于Qwen3-VL的端到端结构感知多模态推理框架,通过引入字符感知多模态推理模块和字符槽查询,解决了车牌识别中图像恢复与字符识别目标不一致的问题,显著提升了严重退化场景下的识别性能。
📘 Detailed Summary
Motivation: 现实世界车牌识别面临运动模糊、低分辨率和复杂光照等严重退化挑战,传统的"恢复-再识别"两阶段范式存在根本缺陷:图像恢复模型的像素级优化目标与字符识别的语义目标不一致,导致伪影干扰和误差累积。同时,视觉语言模型虽然具备强大通用能力,但缺乏对车牌字符序列固定长度和特定顺序等显式结构建模。
Method: 本文提出基于Qwen3-VL的端到端结构感知多模态推理框架,核心创新是字符感知多模态推理模块,该模块引入一组可学习的字符槽查询,通过交叉注意力机制从视觉特征中主动检索对应字符位置的细粒度证据。随后通过残差调制将这些字符感知表示注入视觉标记,使语言模型能够基于显式结构先验进行自回归生成。结合LoRA参数高效微调策略,模型在保留大模型泛化能力的同时实现领域适应。
Result: 在合成和真实世界严重退化数据集上的大量实验表明,该方法显著优于现有的恢复-识别组合方法和通用视觉语言模型。具体而言,在多个基准测试中实现了最先进的性能,验证了将结构化推理融入大模型对于低质量文本识别任务的优势。
Conclusion: 研究表明,通过引入显式结构建模和字符感知推理机制,可以有效解决传统两阶段方法的语义目标不一致问题。该方法为低质量文本识别任务提供了新范式,展示了将大语言模型的强大推理能力与领域特定结构先验相结合的有效性,为类似的结构化视觉识别任务提供了重要参考。
📄 Abstract
Real-world License Plate Recognition (LPR) faces significant challenges from severe degradations such as motion blur, low resolution, and complex illumination. The prevailing "restoration-then-recognition" two-stage paradigm suffers from a fundamental flaw: the pixel-level optimization objectives of image restoration models are misaligned with the semantic goals of character recognition, leading to artifact interference and error accumulation. While Vision-Language Models (VLMs) have demonstrated powerful general capabilities, they lack explicit structural modeling for license plate character sequences (e.g., fixed length, specific order). To address this, we propose an end-to-end structure-aware multimodal reasoning framework based on Qwen3-VL. The core innovation lies in the Character-Aware Multimodal Reasoning Module (CMRM), which introduces a set of learnable Character Slot Queries. Through a cross-attention mechanism, these queries actively retrieve fine-grained evidence corresponding to character positions from visual features. Subsequently, we inject these character-aware representations back into the visual tokens via residual modulation, enabling the language model to perform autoregressive generation based on explicit structural priors. Furthermore, combined with the LoRA parameter-efficient fine-tuning strategy, the model achieves domain adaptation while retaining the generalization capabilities of the large model. Extensive experiments on both synthetic and real-world severely degraded datasets demonstrate that our method significantly outperforms existing restoration-recognition combinations and general VLMs, validating the superiority of incorporating structured reasoning into large models for low-quality text recognition tasks.
[6] LPCAN: Lightweight Pyramid Cross-Attention Network for Rail Surface Defect Detection Using RGB-D Data
Jackie Alex, Guoqiang Huan
🧩 TL;DR
本文提出了一种轻量级金字塔交叉注意力网络(LPCANet),利用RGB-D数据实现高效准确的轨道缺陷检测,在保持低计算复杂度的同时达到最先进的性能。
📘 Detailed Summary
Motivation: 当前基于视觉的轨道缺陷检测方法存在计算复杂度高、参数量大和精度不足等局限性,需要一种更高效准确的解决方案来满足工业缺陷检测的实际需求。
Method: 提出的LPCANet架构采用MobileNetv2作为RGB特征提取的主干网络,结合轻量级金字塔模块(LPM)处理深度数据,通过交叉注意力机制(CAM)实现多模态融合,并利用空间特征提取器(SFE)增强结构分析能力。
Result: 在三个无监督RGB-D轨道数据集(NEU-RSDDS-AUG、RSDD-TYPE1、RSDD-TYPE2)上的评估显示,LPCANet仅需990万参数和2.50 G FLOPs,推理速度达162.60 fps,在Sα、IOU和MAE指标上分别比现有最佳方法提升1.48%、0.86%和1.77%。消融实验证实了CAM和SFE的关键作用,在非轨道数据集上的实验验证了其泛化能力。
Conclusion: 该研究有效桥接了传统方法与深度学习技术,为工业缺陷检测提供了具有重要实用价值的解决方案,未来工作将集中于进一步模型压缩以实现实时部署。
📄 Abstract
This paper addresses the limitations of current vision-based rail defect detection methods, including high computational complexity, excessive parameter counts, and suboptimal accuracy. We propose a Lightweight Pyramid Cross-Attention Network (LPCANet) that leverages RGB-D data for efficient and accurate defect identification. The architecture integrates MobileNetv2 as a backbone for RGB feature extraction with a lightweight pyramid module (LPM) for depth processing, coupled with a cross-attention mechanism (CAM) for multimodal fusion and a spatial feature extractor (SFE) for enhanced structural analysis. Comprehensive evaluations on three unsupervised RGB-D rail datasets (NEU-RSDDS-AUG, RSDD-TYPE1, RSDD-TYPE2) demonstrate that LPCANet achieves state-of-the-art performance with only 9.90 million parameters, 2.50 G FLOPs, and 162.60 fps inference speed. Compared to 18 existing methods, LPCANet shows significant improvements, including +1.48\% in $S_α$, +0.86\% in IOU, and +1.77\% in MAE over the best-performing baseline. Ablation studies confirm the critical roles of CAM and SFE, while experiments on non-rail datasets (DAGM2007, MT, Kolektor-SDD2) validate its generalization capability. The proposed framework effectively bridges traditional and deep learning approaches, offering substantial practical value for industrial defect inspection. Future work will focus on further model compression for real-time deployment.
[7] SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL
Lijun Liu, Linwei Chen, Zhishou Zhang, Meng Tian, Hengfu Cui, Ruiyang Li, Zhaocheng Liu, Qiang Ju, Qianxi Li, Hong-Yu Zhou
🧩 TL;DR
本文提出SkinFlow框架,通过优化视觉信息传输效率而非参数扩展来解决通用大视觉语言模型在皮肤病学中的"弥散注意力"问题,在7B参数规模下实现了超越大规模通用模型的诊断性能。
📘 Detailed Summary
Motivation: 通用大视觉语言模型在皮肤病学诊断中存在"弥散注意力"问题,难以从背景噪声中区分细微病理病变,现有研究过度依赖参数扩展而忽视了信息传输效率的优化。
Method: 提出SkinFlow框架,采用虚拟宽度动态视觉编码器在不增加物理参数的情况下展开复杂病理流形,结合两阶段强化学习策略:第一阶段对齐显性医学描述,第二阶段重建隐性诊断纹理,并在受限语义空间内进行优化。
Result: 在Fitzpatrick17k基准测试中,7B模型实现了新的最先进性能,Top-1准确率提升12.06%,Top-6准确率提升28.57%,显著超越Qwen3VL-235B和GPT-5.2等大规模通用模型。
Conclusion: 研究表明优化几何容量和信息流比原始参数扩展能产生更优越的诊断推理能力,为医学AI模型设计提供了新的范式,强调诊断安全性和层次相关性的临床评估协议具有重要实践价值。
📄 Abstract
General-purpose Large Vision-Language Models (LVLMs), despite their massive scale, often falter in dermatology due to "diffuse attention" - the inability to disentangle subtle pathological lesions from background noise. In this paper, we challenge the assumption that parameter scaling is the only path to medical precision. We introduce SkinFlow, a framework that treats diagnosis as an optimization of visual information transmission efficiency. Our approach utilizes a Virtual-Width Dynamic Vision Encoder (DVE) to "unfold" complex pathological manifolds without physical parameter expansion, coupled with a two-stage Reinforcement Learning strategy. This strategy sequentially aligns explicit medical descriptions (Stage I) and reconstructs implicit diagnostic textures (Stage II) within a constrained semantic space. Furthermore, we propose a clinically grounded evaluation protocol that prioritizes diagnostic safety and hierarchical relevance over rigid label matching. Empirical results are compelling: our 7B model establishes a new state-of-the-art on the Fitzpatrick17k benchmark, achieving a +12.06% gain in Top-1 accuracy and a +28.57% boost in Top-6 accuracy over the massive general-purpose models (e.g., Qwen3VL-235B and GPT-5.2). These findings demonstrate that optimizing geometric capacity and information flow yields superior diagnostic reasoning compared to raw parameter scaling.
[8] SSVP: Synergistic Semantic-Visual Prompting for Industrial Zero-Shot Anomaly Detection
Chenhao Fu, Han Fang, Xiuzheng Zheng, Wenbo Wei, Yonghua Li, Hao Sun, Xuelong Li
🧩 TL;DR
本文提出了一种名为协同语义-视觉提示(SSVP)的新方法,用于零样本异常检测,通过融合多样化的视觉编码来提升模型的细粒度感知能力,在多个工业基准测试中实现了最先进的性能。
📘 Detailed Summary
Motivation: 现有的零样本异常检测方法受限于单一视觉骨干网络,难以同时兼顾全局语义泛化能力和细粒度结构判别能力,这限制了模型在工业检测中的实际应用效果。
Method: 该方法提出了协同语义-视觉提示(SSVP)框架,包含三个核心组件:层次化语义-视觉协同机制(HSVS)将DINOv3的多尺度结构先验深度集成到CLIP语义空间中;视觉条件提示生成器(VCPG)利用跨模态注意力引导动态提示生成,使语言查询能精确定位特定异常模式;视觉-文本异常映射器(VTAM)建立了双门校准范式,解决全局评分与局部证据之间的不一致问题。
Result: 在七个工业基准测试上的广泛评估验证了该方法的鲁棒性,SSVP在MVTec-AD数据集上实现了93.0%的图像级AUROC和92.2%的像素级AUROC,显著超越了现有的零样本方法,达到了最先进的性能水平。
Conclusion: 该研究展示了通过协同融合多样化视觉编码可以有效提升零样本异常检测的细粒度感知能力,为工业视觉检测提供了一种无需监督的高效解决方案,并为跨模态异常检测研究提供了新的技术路径。
📄 Abstract
Zero-Shot Anomaly Detection (ZSAD) leverages Vision-Language Models (VLMs) to enable supervision-free industrial inspection. However, existing ZSAD paradigms are constrained by single visual backbones, which struggle to balance global semantic generalization with fine-grained structural discriminability. To bridge this gap, we propose Synergistic Semantic-Visual Prompting (SSVP), that efficiently fuses diverse visual encodings to elevate model's fine-grained perception. Specifically, SSVP introduces the Hierarchical Semantic-Visual Synergy (HSVS) mechanism, which deeply integrates DINOv3's multi-scale structural priors into the CLIP semantic space. Subsequently, the Vision-Conditioned Prompt Generator (VCPG) employs cross-modal attention to guide dynamic prompt generation, enabling linguistic queries to precisely anchor to specific anomaly patterns. Furthermore, to address the discrepancy between global scoring and local evidence, the Visual-Text Anomaly Mapper (VTAM) establishes a dual-gated calibration paradigm. Extensive evaluations on seven industrial benchmarks validate the robustness of our method; SSVP achieves state-of-the-art performance with 93.0\% Image-AUROC and 92.2\% Pixel-AUROC on MVTec-AD, significantly outperforming existing zero-shot approaches.
[9] Architecture inside the mirage: evaluating generative image models on architectural style, elements, and typologies
Jamie Magrill, Leah Gornstein, Sandra Seekins, Barry Magrill
🧩 TL;DR
本研究系统评估了五种主流生成式AI文本到图像系统在建筑图像生成中的准确性,发现这些系统在生成历史规则约束的建筑图像时准确性有限,且性能因平台和提示词类型而异。
📘 Detailed Summary
Motivation: 生成式人工智能文本到图像系统在建筑图像生成中的应用日益广泛,但其在历史规则约束的建筑领域中生成准确图像的能力尚未得到充分表征,本研究旨在填补这一研究空白。
Method: 研究评估了五种广泛使用的GenAI图像平台(Adobe Firefly、DALL-E 3、Google Imagen 3、Microsoft Image Generator和Midjourney),使用30个涵盖不同风格、类型和编码元素的建筑提示词,每个提示词-生成器组合产生四张图像,共600张图像,由两位建筑历史学家根据预定义标准独立评分准确性,并通过共识解决分歧。
Result: 常见提示词的图像输出准确性比罕见提示词高2.7倍,各平台总体准确性有限(最高52%,最低32%,平均42%),全正确结果在各平台间相似,但全错误结果差异显著,Imagen 3失败最少,Microsoft Image Generator失败最多,定性分析识别出过度装饰、中世纪风格与后期复兴混淆以及描述性提示词误解等重复模式。
Conclusion: 研究结果表明需要为GenAI合成内容提供可见标签、建立未来训练数据集的来源标准,并在教育应用中谨慎使用GenAI建筑图像,这些发现对AI在专业领域应用的可靠性和伦理考量具有重要意义。
📄 Abstract
Generative artificial intelligence (GenAI) text-to-image systems are increasingly used to generate architectural imagery, yet their capacity to reproduce accurate images in a historically rule-bound field remains poorly characterized. We evaluated five widely used GenAI image platforms (Adobe Firefly, DALL-E 3, Google Imagen 3, Microsoft Image Generator, and Midjourney) using 30 architectural prompts spanning styles, typologies, and codified elements. Each prompt-generator pair produced four images (n = 600 images total). Two architectural historians independently scored each image for accuracy against predefined criteria, resolving disagreements by consensus. Set-level performance was summarized as zero to four accurate images per four-image set. Image output from Common prompts was 2.7-fold more accurate than from Rare prompts (p < 0.05). Across platforms, overall accuracy was limited (highest accuracy score 52 percent; lowest 32 percent; mean 42 percent). All-correct (4 out of 4) outcomes were similar across platforms. By contrast, all-incorrect (0 out of 4) outcomes varied substantially, with Imagen 3 exhibiting the fewest failures and Microsoft Image Generator exhibiting the highest number of failures. Qualitative review of the image dataset identified recurring patterns including over-embellishment, confusion between medieval styles and their later revivals, and misrepresentation of descriptive prompts (for example, egg-and-dart, banded column, pendentive). These findings support the need for visible labeling of GenAI synthetic content, provenance standards for future training datasets, and cautious educational use of GenAI architectural imagery.
[10] From Performance to Practice: Knowledge-Distilled Segmentator for On-Premises Clinical Workflows
Qizhen Lan, Aaron Choi, Jun Ma, Bo Wang, Zhaogming Zhao, Xiaoqian Jiang, Yu-Chun Hsu
🧩 TL;DR
本文提出了一种面向部署的医学图像分割框架,利用知识蒸馏将高性能分割模型转换为可扩展的紧凑学生模型家族,在保持架构兼容性的同时实现系统性的容量缩减,为临床工作流提供实用的部署解决方案。
📘 Detailed Summary
Motivation: 在常规临床工作流中部署医学图像分割模型常受限于本地基础设施,其中计算资源固定且基于云的推理可能受治理和安全策略限制。虽然高容量模型实现了强大的分割精度,但其计算需求阻碍了在医院环境中的实际部署和长期可维护性,需要一种能在保持性能的同时降低计算复杂度的解决方案。
Method: 该研究提出了一种部署导向的框架,利用知识蒸馏技术将高性能分割模型转换为可扩展的紧凑学生模型家族,无需修改推理管道。该方法保持了与现有临床系统的架构兼容性,同时实现了系统性的容量缩减,框架在多站点脑MRI数据集(包含1,104个3D体积)上进行评估,并在腹部CT上进一步检验跨模态泛化能力。
Result: 在激进参数减少94%的情况下,蒸馏后的学生模型保留了教师模型分割精度的98.7%,同时实现了显著的效率提升,包括CPU推理延迟减少高达67%且无需额外部署开销。该框架在101个精选病例的独立测试中表现出色,并在腹部CT数据上展示了良好的跨模态泛化能力,验证了其在真实医疗系统中的实用性。
Conclusion: 研究结果表明知识蒸馏为将研究级分割模型转换为可维护、部署就绪的组件提供了实用可靠的途径,特别适用于本地临床工作流。该方法在保持高性能的同时显著降低了计算需求,为医疗系统在资源受限环境中的模型部署提供了可行的解决方案,具有重要的临床应用价值。
📄 Abstract
Deploying medical image segmentation models in routine clinical workflows is often constrained by on-premises infrastructure, where computational resources are fixed and cloud-based inference may be restricted by governance and security policies. While high-capacity models achieve strong segmentation accuracy, their computational demands hinder practical deployment and long-term maintainability in hospital environments. We present a deployment-oriented framework that leverages knowledge distillation to translate a high-performing segmentation model into a scalable family of compact student models, without modifying the inference pipeline. The proposed approach preserves architectural compatibility with existing clinical systems while enabling systematic capacity reduction. The framework is evaluated on a multi-site brain MRI dataset comprising 1,104 3D volumes, with independent testing on 101 curated cases, and is further examined on abdominal CT to assess cross-modality generalizability. Under aggressive parameter reduction (94%), the distilled student model preserves nearly all of the teacher's segmentation accuracy (98.7%), while achieving substantial efficiency gains, including up to a 67% reduction in CPU inference latency without additional deployment overhead. These results demonstrate that knowledge distillation provides a practical and reliable pathway for converting research-grade segmentation models into maintainable, deployment-ready components for on-premises clinical workflows in real-world health systems.
[11] Pairing-free Group-level Knowledge Distillation for Robust Gastrointestinal Lesion Classification in White-Light Endoscopy
Qiang Hu, Qimei Wang, Yingjie Guo, Qiang Li, Zhiwei Wang
🧩 TL;DR
本文提出了一种名为PaGKD的无配对组级知识蒸馏框架,用于解决内窥镜成像中从窄带成像(NBI)向白光成像(WLI)进行跨模态知识迁移的难题,该框架无需依赖配对的NBI-WLI图像对,显著提升了WLI-only模型的诊断性能。
📘 Detailed Summary
Motivation: 内窥镜癌症筛查中,窄带成像(NBI)相比标准白光成像(WLI)能提供更优的诊断细节,但现有方法严重依赖同一病灶的配对NBI-WLI图像,这种数据获取成本高昂且不切实际,导致大量临床数据无法被有效利用,限制了跨模态知识迁移的实际应用。
Method: 本文提出PaGKD(Pairing-free Group-level Knowledge Distillation)框架,包含两个互补模块:组级原型蒸馏(GKD-Pro)通过共享的病灶感知查询提取模态不变的语义原型来蒸馏紧凑的组表示;组级密集蒸馏(GKD-Den)通过激活导出的关系图引导组感知注意力,实现密集的跨模态对齐。这两个模块共同作用,在不要求图像级对应关系的情况下,强制执行全局语义一致性和局部结构连贯性。
Result: 在四个临床数据集上的广泛实验表明,PaGKD持续且显著优于现有最先进方法,分别实现了3.3%、1.1%、2.8%和3.2%的相对AUC提升,为无配对数据的跨模态学习建立了新的性能基准。
Conclusion: PaGKD通过组级知识蒸馏打破了传统跨模态学习对配对数据的依赖,为利用大量未配对临床数据提供了有效途径,开辟了无配对跨模态学习的新方向,具有重要的临床应用价值和研究意义。
📄 Abstract
White-Light Imaging (WLI) is the standard for endoscopic cancer screening, but Narrow-Band Imaging (NBI) offers superior diagnostic details. A key challenge is transferring knowledge from NBI to enhance WLI-only models, yet existing methods are critically hampered by their reliance on paired NBI-WLI images of the same lesion, a costly and often impractical requirement that leaves vast amounts of clinical data untapped. In this paper, we break this paradigm by introducing PaGKD, a novel Pairing-free Group-level Knowledge Distillation framework that that enables effective cross-modal learning using unpaired WLI and NBI data. Instead of forcing alignment between individual, often semantically mismatched image instances, PaGKD operates at the group level to distill more complete and compatible knowledge across modalities. Central to PaGKD are two complementary modules: (1) Group-level Prototype Distillation (GKD-Pro) distills compact group representations by extracting modality-invariant semantic prototypes via shared lesion-aware queries; (2) Group-level Dense Distillation (GKD-Den) performs dense cross-modal alignment by guiding group-aware attention with activation-derived relation maps. Together, these modules enforce global semantic consistency and local structural coherence without requiring image-level correspondence. Extensive experiments on four clinical datasets demonstrate that PaGKD consistently and significantly outperforms state-of-the-art methods, achieving relative AUC improvements of 3.3%, 1.1%, 2.8%, and 3.2%, respectively, establishing a new direction for cross-modal learning from unpaired data.
[12] SpikeVAEDiff: Neural Spike-based Natural Visual Scene Reconstruction via VD-VAE and Versatile Diffusion
Jialu Li, Taiyan Zhou
🧩 TL;DR
本文提出SpikeVAEDiff,一种结合深度变分自编码器和扩散模型的两阶段框架,用于从神经尖峰信号重建高分辨率视觉场景。该方法在Allen视觉编码数据集上验证了特定脑区对重建质量的关键作用。
📘 Detailed Summary
Motivation: 该研究旨在解决从神经活动重建自然视觉场景的核心挑战,特别是利用神经尖峰信号而非fMRI数据,以获取更优越的时空分辨率。当前方法在从高维神经数据生成高分辨率、语义有意义的图像重建方面存在局限,需要更有效的解码框架。
Method: 本文提出SpikeVAEDiff两阶段框架:第一阶段使用深度变分自编码器将神经尖峰信号映射到潜在表示,生成低分辨率初步重建;第二阶段通过回归模型将尖峰信号映射到CLIP视觉和文本特征,利用Versatile Diffusion模型通过图像到图像生成进行精细化重建。
Result: 在Allen视觉编码-神经像素数据集上的评估显示,VISI脑区表现出最显著的激活并在重建质量中起关键作用。消融研究表明特定脑区数据显著提升重建性能,与fMRI方法相比,尖峰数据提供了更优越的时空分辨率。
Conclusion: 该研究证明了结合深度生成模型与扩散模型的有效性,为神经解码提供了新范式。尖峰数据相对于fMRI的优越分辨率优势得到验证,特定脑区对视觉重建的关键作用为神经编码机制研究提供了重要见解。
📄 Abstract
Reconstructing natural visual scenes from neural activity is a key challenge in neuroscience and computer vision. We present SpikeVAEDiff, a novel two-stage framework that combines a Very Deep Variational Autoencoder (VDVAE) and the Versatile Diffusion model to generate high-resolution and semantically meaningful image reconstructions from neural spike data. In the first stage, VDVAE produces low-resolution preliminary reconstructions by mapping neural spike signals to latent representations. In the second stage, regression models map neural spike signals to CLIP-Vision and CLIP-Text features, enabling Versatile Diffusion to refine the images via image-to-image generation. We evaluate our approach on the Allen Visual Coding-Neuropixels dataset and analyze different brain regions. Our results show that the VISI region exhibits the most prominent activation and plays a key role in reconstruction quality. We present both successful and unsuccessful reconstruction examples, reflecting the challenges of decoding neural activity. Compared with fMRI-based approaches, spike data provides superior temporal and spatial resolution. We further validate the effectiveness of the VDVAE model and conduct ablation studies demonstrating that data from specific brain regions significantly enhances reconstruction performance.
[13] Disentangle Object and Non-object Infrared Features via Language Guidance
Fan Liu, Ting Wu, Chuanyi Zhang, Liang Yao, Xing Ma, Yuhui Zheng
🧩 TL;DR
本文提出了一种新颖的视觉-语言表示学习范式用于红外目标检测,通过引入文本监督来引导对象与非对象特征的解耦,从而显著提升检测性能。
📘 Detailed Summary
Motivation: 红外目标检测在黑暗、雪天、雨天等复杂环境中具有重要应用价值,但由于红外图像对比度低、边缘信息弱,难以提取具有判别性的目标特征,导致检测鲁棒性不足。
Method: 本文提出了一种视觉-语言表示学习范式,包含语义特征对齐模块将目标特征与对应文本特征对齐,以及对象特征解耦模块通过最小化相关性来分离文本对齐的目标特征与非目标特征,最终将解耦后的目标特征输入检测头。
Result: 在M³FD和FLIR两个基准数据集上的大量实验结果表明,该方法取得了优越性能,分别达到83.7%和86.1%的mAP,显著提升了红外目标检测的准确性和鲁棒性。
Conclusion: 该研究证明了引入文本监督进行特征解耦的有效性,为红外目标检测提供了新的视觉-语言学习范式,通过增强特征判别性和减少噪声特征,显著改善了复杂环境下的检测性能。
📄 Abstract
Infrared object detection focuses on identifying and locating objects in complex environments (\eg, dark, snow, and rain) where visible imaging cameras are disabled by poor illumination. However, due to low contrast and weak edge information in infrared images, it is challenging to extract discriminative object features for robust detection. To deal with this issue, we propose a novel vision-language representation learning paradigm for infrared object detection. An additional textual supervision with rich semantic information is explored to guide the disentanglement of object and non-object features. Specifically, we propose a Semantic Feature Alignment (SFA) module to align the object features with the corresponding text features. Furthermore, we develop an Object Feature Disentanglement (OFD) module that disentangles text-aligned object features and non-object features by minimizing their correlation. Finally, the disentangled object features are entered into the detection head. In this manner, the detection performance can be remarkably enhanced via more discriminative and less noisy features. Extensive experimental results demonstrate that our approach achieves superior performance on two benchmarks: M\textsuperscript{3}FD (83.7\% mAP), FLIR (86.1\% mAP). Our code will be publicly available once the paper is accepted.
[14] PhyRPR: Training-Free Physics-Constrained Video Generation
Yibo Zhao, Hengjia Li, Xiaofei He, Boxi Wu
🧩 TL;DR
本文提出了PhyRPR,一种免训练的三阶段流水线,通过将物理理解与视觉合成解耦来解决扩散视频生成模型难以满足物理约束的问题。该方法通过物理推理、运动规划和外观细化的分离设计,实现了生成过程中对物理的显式控制。
📘 Detailed Summary
Motivation: 现有基于扩散的视频生成模型虽然能合成视觉上合理的视频,但往往难以满足物理约束。主要原因是大多数方法采用单阶段设计,将高层物理理解与低层视觉合成纠缠在一起,使得需要显式物理推理的内容生成变得困难。
Method: 本文提出了免训练的三阶段流水线PhyRPR,包含物理推理、物理规划和物理细化三个阶段。PhyReason使用大型多模态模型进行物理状态推理和图像生成器合成关键帧;PhyPlan确定性地合成可控的粗略运动支架;PhyRefine通过潜在融合策略将支架注入扩散采样中,在保留规划动态的同时细化外观。
Result: 在物理约束下的广泛实验表明,该方法持续提高了物理合理性和运动可控性。三阶段设计使得在生成过程中能够实现显式的物理控制,相比现有方法在满足物理约束方面表现更优。
Conclusion: 该研究通过解耦物理理解与视觉合成的三阶段框架,为视频生成中的物理约束问题提供了有效解决方案。分阶段设计不仅提高了物理合理性,还增强了运动可控性,为需要显式物理推理的生成任务开辟了新途径。
📄 Abstract
Recent diffusion-based video generation models can synthesize visually plausible videos, yet they often struggle to satisfy physical constraints. A key reason is that most existing approaches remain single-stage: they entangle high-level physical understanding with low-level visual synthesis, making it hard to generate content that require explicit physical reasoning. To address this limitation, we propose a training-free three-stage pipeline,\textit{PhyRPR}:\textit{Phy\uline{R}eason}--\textit{Phy\uline{P}lan}--\textit{Phy\uline{R}efine}, which decouples physical understanding from visual synthesis. Specifically, \textit{PhyReason} uses a large multimodal model for physical state reasoning and an image generator for keyframe synthesis; \textit{PhyPlan} deterministically synthesizes a controllable coarse motion scaffold; and \textit{PhyRefine} injects this scaffold into diffusion sampling via a latent fusion strategy to refine appearance while preserving the planned dynamics. This staged design enables explicit physical control during generation. Extensive experiments under physics constraints show that our method consistently improves physical plausibility and motion controllability.
[15] Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain
Lianying Chao, Haoran Cai, Xubin Li, Kai Zhang, Sijie Wu, Rui Xu
🧩 TL;DR
本文提出了一种多阶段渐进式训练策略,用于训练ICT领域的专用图像描述模型(DICModel),该模型在仅使用7B参数的情况下,性能超越了32B参数的最先进模型,显著提升了领域图像理解能力。
📘 Detailed Summary
Motivation: 在ICT行业中,训练领域专用大语言模型或构建检索增强生成系统需要大量高价值领域知识,但这些知识不仅隐藏在文本模态中,也存在于图像模态中。传统方法只能解析领域文档中的文本而缺乏图像描述能力,而多模态大语言模型虽然能理解图像,却缺乏足够的领域专业知识。
Method: 本文提出了一种多阶段渐进式训练策略来训练ICT领域的专用图像描述模型。首先使用Mermaid工具和大语言模型合成约7K图像-文本对进行第一阶段监督微调,然后由ICT领域专家手动标注约2K图像-文本对进行第二阶段监督微调,最后专家与大语言模型联合合成约1.5K视觉问答数据进行基于指令的监督微调。
Result: 实验结果表明,仅使用7B参数的DICModel性能优于其他32B参数的最先进模型。与7B和32B参数的SOTA模型相比,DICModel将BLEU指标分别提高了约56.8%和20.8%。在ICT领域专家构建的客观问题上,DICModel在准确率上比Qwen2.5-VL 32B高出1%。
Conclusion: 该研究能够高效准确地从图像中提取逻辑文本,有望促进ICT领域多模态模型的发展。提出的多阶段渐进式训练策略和标准评估系统为领域专用图像理解提供了有效解决方案,证明了在有限参数规模下通过针对性训练可以获得超越更大通用模型的领域性能。
📄 Abstract
In the information and communications technology (ICT) industry, training a domain-specific large language model (LLM) or constructing a retrieval-augmented generation system requires a substantial amount of high-value domain knowledge. However, the knowledge is not only hidden in the textual modality but also in the image modality. Traditional methods can parse text from domain documents but dont have image captioning ability. Multi-modal LLM (MLLM) can understand images, but they do not have sufficient domain knowledge. To address the above issues, this paper proposes a multi-stage progressive training strategy to train a Domain-specific Image Captioning Model (DICModel) in ICT, and constructs a standard evaluation system to validate the performance of DICModel. Specifically, this work first synthesizes about 7K image-text pairs by combining the Mermaid tool and LLMs, which are used for the first-stage supervised-fine-tuning (SFT) of DICModel. Then, ICT-domain experts manually annotate about 2K image-text pairs for the second-stage SFT of DICModel. Finally, experts and LLMs jointly synthesize about 1.5K visual question answering data for the instruction-based SFT. Experimental results indicate that our DICModel with only 7B parameters performs better than other state-of-the-art models with 32B parameters. Compared to the SOTA models with 7B and 32B parameters, our DICModel increases the BLEU metric by approximately 56.8% and 20.8%, respectively. On the objective questions constructed by ICT domain experts, our DICModel outperforms Qwen2.5-VL 32B by 1% in terms of accuracy rate. In summary, this work can efficiently and accurately extract the logical text from images, which is expected to promote the development of multimodal models in the ICT domain.
[16] See More, Store Less: Memory-Efficient Resolution for Video Moment Retrieval
Mingyu Jeon, Sungjin Han, Jinkwon Hwang, Minchol Kwon, Jonghee Kim, Junyeong Kim
🧩 TL;DR
本文提出SMORE框架,通过查询引导的语义编码、重要性调制和自适应帧压缩,在保持高信息分辨率的同时显著提升视频时刻检索的内存效率,在多个基准测试中达到最先进性能。
📘 Detailed Summary
Motivation: 多模态大语言模型在图像识别和推理方面取得进展,但视频相关任务仍面临内存限制的挑战,现有视频时刻检索方法依赖稀疏帧采样可能导致信息丢失,特别是在长视频处理中。
Method: SMORE框架采用三个关键技术:使用查询引导的标题编码与用户意图对齐的语义;应用查询感知的重要性调制突出相关片段;自适应压缩帧以保留关键内容同时减少冗余,实现高效视频理解而不超出内存预算。
Result: 实验验证表明,SMORE在QVHighlights、Charades-STA和ActivityNet-Captions基准测试中均达到最先进性能,证明了该框架在保持高信息分辨率的同时显著提升内存效率的有效性。
Conclusion: 该研究展示了通过语义对齐和自适应压缩策略可以有效解决视频处理中的内存瓶颈问题,为高效视频理解提供了新思路,未来可扩展至更复杂的多模态视频分析任务。
📄 Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have improved image recognition and reasoning, but video-related tasks remain challenging due to memory constraints from dense frame processing. Existing Video Moment Retrieval (VMR) methodologies rely on sparse frame sampling, risking potential information loss, especially in lengthy videos. We propose SMORE (See MORE, store less), a framework that enhances memory efficiency while maintaining high information resolution. SMORE (1) uses query-guided captions to encode semantics aligned with user intent, (2) applies query-aware importance modulation to highlight relevant segments, and (3) adaptively compresses frames to preserve key content while reducing redundancy. This enables efficient video understanding without exceeding memory budgets. Experimental validation reveals that SMORE achieves state-of-the-art performance on QVHighlights, Charades-STA, and ActivityNet-Captions benchmarks.
[17] Radiomics-Integrated Deep Learning with Hierarchical Loss for Osteosarcoma Histology Classification
Yaxi Chen, Zi Ye, Shaheer U. Saeed, Oliver Yu, Simin Ni, Jie Huang, Yipeng Hu
🧩 TL;DR
本研究提出了一种结合放射组学特征和分层损失函数的深度学习框架,用于骨肉瘤组织病理学图像中肿瘤区域(存活与非存活)的自动分类,显著提升了在患者级别独立采样测试数据上的性能表现。
📘 Detailed Summary
Motivation: 骨肉瘤新辅助化疗后存活与非存活肿瘤区域的准确组织病理学评估对预后和治疗规划至关重要,但人工评估存在劳动密集、主观性强和观察者间变异大的问题。先前研究在瓦片级别泛化能力良好,但在患者级别独立采样测试数据上性能显著下降,揭示了现有方法的局限性。
Method: 本研究提出了两种关键方法改进:首先引入从图像衍生的放射组学特征作为多模态输入,增强模型性能并提升可解释性;其次采用分层分类策略,将单一的三分类任务分解为肿瘤与非肿瘤、存活与非存活两个二分类任务,并设计可训练权重的分层损失函数,优化各类别性能。
Result: 在TCIA骨肉瘤肿瘤评估数据集上的实验表明,放射组学特征的引入和分层损失函数的应用均能显著提升分类性能,两者结合实现了该公开数据集上该应用的最佳性能,有效解决了患者级别测试数据上的性能下降问题。
Conclusion: 该研究证明了多模态输入和分层学习策略在医学图像分析中的有效性,为组织病理学自动评估提供了新的技术框架。放射组学特征的引入不仅提升了性能,还增强了模型的可解释性,而分层损失函数则通过任务分解优化了分类精度,为类似医学图像分析问题提供了可借鉴的解决方案。
📄 Abstract
Osteosarcoma (OS) is an aggressive primary bone malignancy. Accurate histopathological assessment of viable versus non-viable tumor regions after neoadjuvant chemotherapy is critical for prognosis and treatment planning, yet manual evaluation remains labor-intensive, subjective, and prone to inter-observer variability. Recent advances in digital pathology have enabled automated necrosis quantification. Evaluating on test data, independently sampled on patient-level, revealed that the deep learning model performance dropped significantly from the tile-level generalization ability reported in previous studies. First, this work proposes the use of radiomic features as additional input in model training. We show that, despite that they are derived from the images, such a multimodal input effectively improved the classification performance, in addition to its added benefits in interpretability. Second, this work proposes to optimize two binary classification tasks with hierarchical classes (i.e. tumor-vs-non-tumor and viable-vs-non-viable), as opposed to the alternative ``flat'' three-class classification task (i.e. non-tumor, non-viable tumor, viable tumor), thereby enabling a hierarchical loss. We show that such a hierarchical loss, with trainable weightings between the two tasks, the per-class performance can be improved significantly. Using the TCIA OS Tumor Assessment dataset, we experimentally demonstrate the benefits from each of the proposed new approaches and their combination, setting a what we consider new state-of-the-art performance on this open dataset for this application. Code and trained models: https://github.com/YaxiiC/RadiomicsOS.git.
[18] Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs
Rui Zhu, Xin Shen, Shuchen Wu, Chenxi Miao, Xin Yu, Yang Li, Weikang Li, Deguo Xia, Jizhou Huang
🧩 TL;DR
本文提出了Video-MSR,首个专门评估动态视频场景中多跳空间推理能力的基准,并构建了MSR-9K指令调优数据集,通过微调Qwen-VL模型实现了+7.82%的性能提升。
📘 Detailed Summary
Motivation: 现有多模态大语言模型的空间推理基准主要关注单步感知到判断任务,而需要复杂视觉-空间逻辑链的场景研究严重不足,特别是动态视频中的多跳空间推理能力评估存在显著空白。
Method: 研究提出了Video-MSR基准,包含约束定位、链式参考检索、路径规划和反事实物理推理四个任务,包含3,052个高质量视频实例和4,993个问答对,采用结合先进模型生成与严格人工验证的可扩展视觉基础流程构建,并进一步构建了MSR-9K专用指令调优数据集用于模型微调。
Result: 对20个最先进MLLM的评估揭示了显著局限性:模型在表层感知方面表现熟练,但在MSR任务中性能明显下降,经常在多步推理中出现空间迷失和幻觉问题;通过MSR-9K数据集微调Qwen-VL模型,在Video-MSR基准上实现了+7.82%的绝对性能提升。
Conclusion: 多跳空间指令数据的有效性得到验证,Video-MSR基准为未来研究提供了重要基础,揭示了当前MLLM在复杂空间推理方面的不足,并展示了通过专门指令调优可以显著提升模型的多跳空间推理能力。
📄 Abstract
Spatial reasoning has emerged as a critical capability for Multimodal Large Language Models (MLLMs), drawing increasing attention and rapid advancement. However, existing benchmarks primarily focus on single-step perception-to-judgment tasks, leaving scenarios requiring complex visual-spatial logical chains significantly underexplored. To bridge this gap, we introduce Video-MSR, the first benchmark specifically designed to evaluate Multi-hop Spatial Reasoning (MSR) in dynamic video scenarios. Video-MSR systematically probes MSR capabilities through four distinct tasks: Constrained Localization, Chain-based Reference Retrieval, Route Planning, and Counterfactual Physical Deduction. Our benchmark comprises 3,052 high-quality video instances with 4,993 question-answer pairs, constructed via a scalable, visually-grounded pipeline combining advanced model generation with rigorous human verification. Through a comprehensive evaluation of 20 state-of-the-art MLLMs, we uncover significant limitations, revealing that while models demonstrate proficiency in surface-level perception, they exhibit distinct performance drops in MSR tasks, frequently suffering from spatial disorientation and hallucination during multi-step deductions. To mitigate these shortcomings and empower models with stronger MSR capabilities, we further curate MSR-9K, a specialized instruction-tuning dataset, and fine-tune Qwen-VL, achieving a +7.82% absolute improvement on Video-MSR. Our results underscore the efficacy of multi-hop spatial instruction data and establish Video-MSR as a vital foundation for future research. The code and data will be available at https://github.com/ruiz-nju/Video-MSR.
[19] Do Transformers Understand Ancient Roman Coin Motifs Better than CNNs?
David Reid, Ognjen Arandjelovic
🧩 TL;DR
本文首次将Vision Transformer架构应用于古钱币语义元素识别任务,通过多模态数据自动学习,发现ViT模型在准确率上超越了新训练的CNN模型。
📘 Detailed Summary
Motivation: 古钱币自动分析有助于研究人员从大量钱币收藏中提取更多历史见解,并帮助收藏者理解其交易对象。现有研究主要使用卷积神经网络识别钱币上的语义元素,但尚未探索最近提出的Vision Transformer架构在该领域的应用潜力。
Method: 本研究首次将Vision Transformer深度学习架构应用于古钱币语义元素识别任务,采用完全自动化的多模态数据学习方法,同时处理图像和非结构化文本数据。研究还训练了CNN模型作为对比基准,并详细讨论了ViT和CNN模型的训练与实现过程。
Result: 实验评估显示,Vision Transformer模型在古钱币语义元素识别任务上的准确率超越了新训练的CNN模型。研究提供了两种架构的性能对比分析,验证了ViT在该特定计算机视觉任务上的优越表现。
Conclusion: 研究表明Vision Transformer架构在古钱币分析领域具有显著优势,为文化遗产数字化和自动分析提供了新的技术路径。多模态学习方法结合图像与文本信息,为复杂历史文物的智能识别开辟了创新方向,未来可进一步探索Transformer架构在文物分析中的广泛应用。
📄 Abstract
Automated analysis of ancient coins has the potential to help researchers extract more historical insights from large collections of coins and to help collectors understand what they are buying or selling. Recent research in this area has shown promise in focusing on identification of semantic elements as they are commonly depicted on ancient coins, by using convolutional neural networks (CNNs). This paper is the first to apply the recently proposed Vision Transformer (ViT) deep learning architecture to the task of identification of semantic elements on coins, using fully automatic learning from multi-modal data (images and unstructured text). This article summarises previous research in the area, discusses the training and implementation of ViT and CNN models for ancient coins analysis and provides an evaluation of their performance. The ViT models were found to outperform the newly trained CNN models in accuracy.
[20] PrivLEX: Detecting legal concepts in images through Vision-Language Models
Darya Baranouskaya, Andrea Cavallaro
🧩 TL;DR
本文提出PrivLEX,一种基于法律定义的个人数据概念进行决策的新型图像隐私分类器,这是首个利用视觉语言模型识别能力并与法律概念对齐的可解释隐私分类器。
📘 Detailed Summary
Motivation: 当前图像隐私分类器缺乏与法律定义的个人数据概念的对齐,导致决策过程不透明且难以解释,无法满足隐私保护的实际法律需求。
Method: PrivLEX采用零样本视觉语言模型概念检测技术,通过无标签概念瓶颈模型实现可解释分类,无需训练过程中的显式概念标注,将VLM识别能力与法律概念框架相结合。
Result: 实验证明PrivLEX能够有效识别图像中的个人数据概念,并分析了人类标注者对图像隐私数据集中此类概念敏感度的感知差异,验证了模型的法律概念对齐能力。
Conclusion: 该研究为隐私保护领域提供了首个法律对齐的可解释分类框架,通过VLM的零样本能力实现了无需标注的概念检测,为隐私敏感应用的透明决策开辟了新途径。
📄 Abstract
We present PrivLEX, a novel image privacy classifier that grounds its decisions in legally defined personal data concepts. PrivLEX is the first interpretable privacy classifier aligned with legal concepts that leverages the recognition capabilities of Vision-Language Models (VLMs). PrivLEX relies on zero-shot VLM concept detection to provide interpretable classification through a label-free Concept Bottleneck Model, without requiring explicit concept labels during training. We demonstrate PrivLEX's ability to identify personal data concepts that are present in images. We further analyse the sensitivity of such concepts as perceived by human annotators of image privacy datasets.
[21] Towards Robust Cross-Dataset Object Detection Generalization under Domain Specificity
Ritabrata Chakraborty, Hrishit Mitra, Shivakumara Palaiahnakote, Umapada Pal
🧩 TL;DR
该研究通过设定特异性视角系统分析了跨数据集目标检测问题,揭示了在相同设定类型内迁移相对稳定而跨类型迁移性能显著下降的规律,并提供了评估分布偏移下检测器的实用指导。
📘 Detailed Summary
Motivation: 目标检测器在分布内表现良好,但在不同基准测试上性能急剧下降。本研究旨在通过设定特异性的视角来系统分析跨数据集目标检测问题,探究检测器在不同类型数据集间迁移时的性能变化规律,特别是区分领域偏移和标签不匹配的影响。
Method: 研究将基准数据集分为设定无关数据集(包含多样化日常场景)和设定特定数据集(局限于狭窄环境),并评估标准检测器家族在所有训练-测试对上的表现。为了解耦领域偏移和标签不匹配,研究比较了封闭标签迁移与开放标签协议,后者使用CLIP相似性将预测类别映射到最近的目标标签。
Result: 实验揭示了跨数据集目标检测的清晰结构:相同设定类型内的迁移相对稳定,而跨类型迁移性能显著下降且通常不对称。最严重的性能崩溃发生在从特定源数据集迁移到无关目标数据集时,即使在开放标签对齐后仍然存在,表明领域偏移在最困难区域占主导地位。开放标签评估产生了一致但有界的性能提升,许多校正案例对应了图像证据支持的语义近似错误。
Conclusion: 该研究提供了基于设定特异性的跨数据集目标检测原则性表征,并为评估分布偏移下的检测器提供了实用指导。研究发现领域偏移在最困难的迁移场景中起主导作用,而开放标签协议能够部分缓解标签不匹配问题,但性能提升有限。这些发现有助于理解目标检测器的泛化能力和制定更有效的跨数据集评估策略。
📄 Abstract
Object detectors often perform well in-distribution, yet degrade sharply on a different benchmark. We study cross-dataset object detection (CD-OD) through a lens of setting specificity. We group benchmarks into setting-agnostic datasets with diverse everyday scenes and setting-specific datasets tied to a narrow environment, and evaluate a standard detector family across all train--test pairs. This reveals a clear structure in CD-OD: transfer within the same setting type is relatively stable, while transfer across setting types drops substantially and is often asymmetric. The most severe breakdowns occur when transferring from specific sources to agnostic targets, and persist after open-label alignment, indicating that domain shift dominates in the hardest regimes. To disentangle domain shift from label mismatch, we compare closed-label transfer with an open-label protocol that maps predicted classes to the nearest target label using CLIP similarity. Open-label evaluation yields consistent but bounded gains, and many corrected cases correspond to semantic near-misses supported by the image evidence. Overall, we provide a principled characterization of CD-OD under setting specificity and practical guidance for evaluating detectors under distribution shift. Code will be released at \href{[https://github.com/Ritabrata04/cdod-icpr.git}{https://github.com/Ritabrata04/cdod-icpr}.
[22] CogRail: Benchmarking VLMs in Cognitive Intrusion Perception for Intelligent Railway Transportation Systems
Yonglin Tian, Qiyao Zhang, Wei Xu, Yutong Wang, Yihao Wu, Xinyi Li, Xingyuan Dai, Hui Zhang, Zhiyong Cui, Baoqing Guo, Zujun Yu, Yisheng Lv
🧩 TL;DR
本文提出了CogRail基准测试,用于评估视觉语言模型在铁路入侵感知中的时空推理能力,并开发了一个联合微调框架,通过整合位置感知、运动预测和威胁分析三个核心任务,显著提升了模型在安全关键领域的性能。
📘 Detailed Summary
Motivation: 现有铁路入侵感知系统主要关注固定视觉范围内的物体分类,并应用基于规则的启发式方法判断入侵状态,往往忽略了具有潜在入侵风险的目标。准确预测这些风险需要理解感兴趣对象的空间上下文和时序动态,这对传统视觉模型构成了挑战。
Method: 研究引入了CogRail基准测试,整合了精选的开源数据集和认知驱动的问答标注以支持时空推理和预测。在此基础上,系统评估了最先进的视觉语言模型,并提出了一个联合微调框架,该框架整合了位置感知、运动预测和威胁分析三个核心任务,促进通用基础模型向认知入侵感知专用模型的适应。
Result: 大量实验表明,当前大规模多模态模型在处理认知入侵感知任务所需的复杂时空推理方面存在困难,突显了现有基础模型在这一安全关键领域的局限性。相比之下,提出的联合微调框架通过针对性地适应领域特定的推理需求,显著提升了模型性能,显示了结构化多任务学习在提高准确性和可解释性方面的优势。
Conclusion: 该研究强调了认知驱动方法在安全关键感知系统中的重要性,并展示了通过结构化多任务学习框架将通用基础模型适应到特定领域任务的有效性。研究结果为开发更可靠、可解释的铁路入侵感知系统提供了新方向,并揭示了当前多模态模型在复杂时空推理任务中的局限性。
📄 Abstract
Accurate and early perception of potential intrusion targets is essential for ensuring the safety of railway transportation systems. However, most existing systems focus narrowly on object classification within fixed visual scopes and apply rule-based heuristics to determine intrusion status, often overlooking targets that pose latent intrusion risks. Anticipating such risks requires the cognition of spatial context and temporal dynamics for the object of interest (OOI), which presents challenges for conventional visual models. To facilitate deep intrusion perception, we introduce a novel benchmark, CogRail, which integrates curated open-source datasets with cognitively driven question-answer annotations to support spatio-temporal reasoning and prediction. Building upon this benchmark, we conduct a systematic evaluation of state-of-the-art visual-language models (VLMs) using multimodal prompts to identify their strengths and limitations in this domain. Furthermore, we fine-tune VLMs for better performance and propose a joint fine-tuning framework that integrates three core tasks, position perception, movement prediction, and threat analysis, facilitating effective adaptation of general-purpose foundation models into specialized models tailored for cognitive intrusion perception. Extensive experiments reveal that current large-scale multimodal models struggle with the complex spatial-temporal reasoning required by the cognitive intrusion perception task, underscoring the limitations of existing foundation models in this safety-critical domain. In contrast, our proposed joint fine-tuning framework significantly enhances model performance by enabling targeted adaptation to domain-specific reasoning demands, highlighting the advantages of structured multi-task learning in improving both accuracy and interpretability. Code will be available at https://github.com/Hub-Tian/CogRail.
[23] Video Joint-Embedding Predictive Architectures for Facial Expression Recognition
Lennart Eing, Cristina Luna-Jiménez, Silvan Mertes, Elisabeth André
🧩 TL;DR
本文提出了一种基于视频联合嵌入预测架构(V-JEPA)的面部表情识别新方法,通过嵌入预测而非像素重建的预训练方式,在RAVDESS和CREMA-D数据集上实现了最先进的性能。
📘 Detailed Summary
Motivation: 传统基于像素重建的视频理解预训练方法可能捕获与任务无关的背景信息,本文旨在探索纯嵌入预测的预训练方法在面部表情识别任务中的有效性,以提升模型对相关特征的提取能力和泛化性能。
Method: 该方法采用视频联合嵌入预测架构(V-JEPA),通过预测掩码区域的嵌入表示而非像素级重建来学习视频表示,使用预训练的V-JEPA视频编码器提取特征,并在RAVDESS和CREMA-D数据集上训练浅层分类器进行面部表情识别。
Result: 在RAVDESS数据集上实现了最先进的性能,在CREMA-D数据集上超越了所有其他基于视觉的方法(加权准确率提升+1.48%),跨数据集评估显示出强大的泛化能力,证明了嵌入预测预训练方法的有效性。
Conclusion: 研究表明纯嵌入预测的预训练方法能够有效避免捕获无关背景信息,在面部表情识别任务中展现出优越的性能和泛化能力,为视频理解任务提供了新的预训练范式,具有推动FER领域发展的潜力。
📄 Abstract
This paper introduces a novel application of Video Joint-Embedding Predictive Architectures (V-JEPAs) for Facial Expression Recognition (FER). Departing from conventional pre-training methods for video understanding that rely on pixel-level reconstructions, V-JEPAs learn by predicting embeddings of masked regions from the embeddings of unmasked regions. This enables the trained encoder to not capture irrelevant information about a given video like the color of a region of pixels in the background. Using a pre-trained V-JEPA video encoder, we train shallow classifiers using the RAVDESS and CREMA-D datasets, achieving state-of-the-art performance on RAVDESS and outperforming all other vision-based methods on CREMA-D (+1.48 WAR). Furthermore, cross-dataset evaluations reveal strong generalization capabilities, demonstrating the potential of purely embedding-based pre-training approaches to advance FER. We release our code at https://github.com/lennarteingunia/vjepa-for-fer.
[24] Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz, Yu-Chiang Frank Wang, Fu-En Yang
🧩 TL;DR
本文提出Fast-ThinkAct框架,通过可表达的潜在推理实现紧凑而高效的规划,显著降低推理延迟,同时保持强大的长时程规划能力。
📘 Detailed Summary
Motivation: 视觉-语言-动作任务需要在动态环境中对复杂视觉场景进行推理并执行适应性动作,现有显式思维链方法虽然能提升泛化能力,但存在推理轨迹过长导致的高延迟问题,需要更高效的推理框架。
Method: Fast-ThinkAct采用可表达的潜在推理框架,通过从教师模型蒸馏学习潜在思维链,利用偏好引导目标对齐操作轨迹,同时迁移语言和视觉规划能力,实现推理增强的策略学习,将紧凑推理与动作执行有效连接。
Result: 在多样化的具身操作和推理基准测试中,Fast-ThinkAct相比最先进的推理VLA方法实现了高达89.3%的推理延迟降低,同时保持了有效的长时程规划、少样本适应和失败恢复能力。
Conclusion: 该研究表明通过潜在推理蒸馏和偏好对齐,可以在大幅降低推理延迟的同时保持强大的规划性能,为高效具身智能系统提供了新思路,平衡了推理质量与计算效率之间的权衡。
📄 Abstract
Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3\% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.
[25] OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding
Sheng-Yu Huang, Jaesung Choe, Yu-Chiang Frank Wang, Cheng Sun
🧩 TL;DR
本文提出了OpenVoxel,一种无需训练的方法,用于稀疏体素的分组和描述,以实现开放词汇的3D场景理解任务。该方法通过直接利用视觉语言模型和多模态大语言模型,构建信息丰富的场景地图,在复杂参考表达分割任务中表现出优越性能。
📘 Detailed Summary
Motivation: 现有3D场景理解方法通常需要训练过程或依赖CLIP/BERT文本编码器的嵌入表示,这限制了方法的灵活性和泛化能力。本研究旨在开发一种无需训练的算法,能够直接对稀疏体素进行分组和描述,实现更灵活、高效的开放词汇3D场景理解。
Method: OpenVoxel采用无需训练的算法,基于多视图图像获得的稀疏体素栅格化模型,对稀疏体素进行有意义的分组以描述场景中的不同物体。该方法直接利用强大的视觉语言模型和多模态大语言模型,通过文本到文本的搜索方式为每个分组生成描述性标题,避免了传统方法中引入CLIP/BERT文本编码器嵌入的步骤。
Result: 通过大量实验验证,OpenVoxel在复杂参考表达分割任务中表现出优于近期研究方法的性能。该方法能够成功构建信息丰富的场景地图,支持开放词汇分割和参考表达分割等进一步的3D场景理解任务,同时保持了无需训练的优势。
Conclusion: OpenVoxel展示了无需训练方法在3D场景理解任务中的有效性,特别是通过直接利用多模态大语言模型进行文本到文本搜索的创新策略。该方法为开放词汇3D场景理解提供了更灵活、高效的解决方案,有望推动该领域向更少依赖预训练嵌入的方向发展。
📄 Abstract
We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.
[26] GRCF: Two-Stage Groupwise Ranking and Calibration Framework for Multimodal Sentiment Analysis
Manning Gao, Leheng Zhang, Shiqin Han, Haifeng Hu, Yuncheng Jiang, Sijie Mai
🧩 TL;DR
本文提出了一种两阶段分组排序与校准框架(GRCF),通过引入优势加权动态边界排序损失和MAE驱动目标,解决了多模态情感分析中传统成对排序方法对困难样本关注不足和边界设置静态的问题,在回归和分类任务上均实现了最先进的性能。
📘 Detailed Summary
Motivation: 多模态情感分析研究大多关注点式回归方法,该方法对标签噪声敏感且忽略了样本间的相对顺序,导致预测不稳定和相关性对齐差。虽然成对排序学习框架通过比较学习相对顺序来弥补这一缺陷,但它们引入了两个新问题:一是对所有比较赋予统一重要性,未能自适应地关注难以排序的样本;二是采用静态排序边界,无法反映情感组间变化的语义距离。
Method: 本文提出了两阶段分组排序与校准框架(GRCF),该框架借鉴了分组相对策略优化的思想。第一阶段引入了GRPO启发的优势加权动态边界排序损失,以构建细粒度的序数结构;第二阶段采用MAE驱动的目标来对齐预测幅度。为了验证其泛化能力,作者将GRCF扩展到分类任务,包括多模态幽默检测和讽刺检测。
Result: GRCF在核心回归基准测试中实现了最先进的性能,同时在分类任务中也表现出强大的泛化能力。该方法在保持相对序数结构的同时,确保了绝对分数校准,并能自适应地关注困难样本,从而在多个多模态情感分析任务上取得了优越的实验结果。
Conclusion: 该研究证明了自适应关注困难样本和动态边界设置对于多模态情感分析中序数学习的重要性。GRCF框架不仅解决了传统成对排序方法的局限性,还展示了从回归任务到分类任务的良好泛化能力,为多模态序数学习提供了新的方法论视角和实用框架。
📄 Abstract
Most Multimodal Sentiment Analysis research has focused on point-wise regression. While straightforward, this approach is sensitive to label noise and neglects whether one sample is more positive than another, resulting in unstable predictions and poor correlation alignment. Pairwise ordinal learning frameworks emerged to address this gap, capturing relative order by learning from comparisons. Yet, they introduce two new trade-offs: First, they assign uniform importance to all comparisons, failing to adaptively focus on hard-to-rank samples. Second, they employ static ranking margins, which fail to reflect the varying semantic distances between sentiment groups. To address this, we propose a Two-Stage Group-wise Ranking and Calibration Framework (GRCF) that adapts the philosophy of Group Relative Policy Optimization (GRPO). Our framework resolves these trade-offs by simultaneously preserving relative ordinal structure, ensuring absolute score calibration, and adaptively focusing on difficult samples. Specifically, Stage 1 introduces a GRPO-inspired Advantage-Weighted Dynamic Margin Ranking Loss to build a fine-grained ordinal structure. Stage 2 then employs an MAE-driven objective to align prediction magnitudes. To validate its generalizability, we extend GRCF to classification tasks, including multimodal humor detection and sarcasm detection. GRCF achieves state-of-the-art performance on core regression benchmarks, while also showing strong generalizability in classification tasks.
[27] Identifying Models Behind Text-to-Image Leaderboards
Ali Naseh, Yuefeng Peng, Anshuman Suri, Harsh Chaudhari, Alina Oprea, Amir Houmansadr
🧩 TL;DR
本研究揭示了基于投票的文本到图像模型排行榜存在严重的安全漏洞,通过分析发现不同模型生成的图像在嵌入空间中形成独特的聚类模式,使得匿名化可以被轻易破解。
📘 Detailed Summary
Motivation: 当前文本到图像模型质量评估主要依赖基于投票的排行榜,这些排行榜假设模型输出经过匿名化处理以保证公平性。然而,这种匿名化机制的安全性尚未得到充分验证,本研究旨在揭示此类排行榜中存在的潜在安全漏洞。
Method: 研究提出了一种基于质心的去匿名化方法,通过分析22个不同文本到图像模型在280个提示词下生成的15万张图像,发现每个模型的生成结果在图像嵌入空间中形成独特的聚类模式。该方法不需要控制提示词或访问训练数据,仅利用图像嵌入特征即可实现准确模型识别。
Result: 实验结果显示,基于质心的去匿名化方法能够以高准确率识别不同模型的生成图像,揭示了系统性的模型特定特征签名。研究进一步引入了提示词级别的可区分性度量,并进行了大规模分析,发现某些特定提示词能够导致接近完美的模型可区分性。
Conclusion: 该研究暴露了文本到图像模型排行榜中存在的根本性安全缺陷,表明当前的匿名化机制不足以保护模型身份。这一发现强调了需要开发更强的匿名化防御措施,以确保模型评估的公平性和安全性,对AI生成内容认证和模型知识产权保护具有重要启示。
📄 Abstract
Text-to-image (T2I) models are increasingly popular, producing a large share of AI-generated images online. To compare model quality, voting-based leaderboards have become the standard, relying on anonymized model outputs for fairness. In this work, we show that such anonymity can be easily broken. We find that generations from each T2I model form distinctive clusters in the image embedding space, enabling accurate deanonymization without prompt control or training data. Using 22 models and 280 prompts (150K images), our centroid-based method achieves high accuracy and reveals systematic model-specific signatures. We further introduce a prompt-level distinguishability metric and conduct large-scale analyses showing how certain prompts can lead to near-perfect distinguishability. Our findings expose fundamental security flaws in T2I leaderboards and motivate stronger anonymization defenses.
[28] Image2Garment: Simulation-ready Garment Generation from a Single Image
Selim Emir Can, Jan Ackermann, Kiyohiro Nakayama, Ruofan Liu, Tong Wu, Yang Zheng, Hugo Bertiche, Menglei Chai, Thabo Beeler, Gordon Wetzstein
🧩 TL;DR
本文提出了一种前馈框架,能够从单张图像直接估计物理准确的、可用于仿真的服装,通过结合视觉语言模型进行材料属性推断和轻量级物理参数预测器,无需多视图捕获或迭代优化即可生成仿真就绪的服装。
📘 Detailed Summary
Motivation: 从单张图像估计物理准确的仿真就绪服装面临两大挑战:缺乏图像到物理的数据集以及问题的病态性。现有方法要么需要多视图捕获和昂贵的可微分仿真,要么只能预测服装几何形状而缺乏仿真所需的材料物理属性,这限制了实际应用的可行性。
Method: 该方法采用前馈框架,首先微调视觉语言模型从真实图像推断材料成分和织物属性,然后训练轻量级预测器将这些属性映射到相应的物理织物参数。框架引入了两个新数据集(FTAG和T2P),并避免了迭代优化过程,直接生成仿真就绪的服装表示。
Result: 实验表明,该方法在材料成分估计和织物属性预测方面实现了更高的准确性。通过将这些预测结果输入物理参数估计器,与最先进的图像到服装方法相比,能够生成更高保真度的仿真结果,验证了框架的有效性和实用性。
Conclusion: 该研究证明了通过结合视觉语言模型和物理参数映射,可以从单张图像有效估计仿真就绪的服装,无需多视图捕获或昂贵的优化过程。该方法为计算机图形学和虚拟试穿应用提供了实用的解决方案,并展示了跨模态学习在物理属性估计中的潜力。
📄 Abstract
Estimating physically accurate, simulation-ready garments from a single image is challenging due to the absence of image-to-physics datasets and the ill-posed nature of this problem. Prior methods either require multi-view capture and expensive differentiable simulation or predict only garment geometry without the material properties required for realistic simulation. We propose a feed-forward framework that sidesteps these limitations by first fine-tuning a vision-language model to infer material composition and fabric attributes from real images, and then training a lightweight predictor that maps these attributes to the corresponding physical fabric parameters using a small dataset of material-physics measurements. Our approach introduces two new datasets (FTAG and T2P) and delivers simulation-ready garments from a single image without iterative optimization. Experiments show that our estimator achieves superior accuracy in material composition estimation and fabric attribute prediction, and by passing them through our physics parameter estimator, we further achieve higher-fidelity simulations compared to state-of-the-art image-to-garment methods.
[29] LiteEmbed: Adapting CLIP to Rare Classes
Aishwarya Agarwal, Srikrishna Karanam, Vineet Gandhi
🧩 TL;DR
LiteEmbed 提出了一种轻量级框架,用于 CLIP 的少样本个性化,通过子空间引导的文本嵌入优化,使新类别能够在不重新训练编码器的情况下添加到模型中,显著提升了罕见类别和未见类别的识别性能。
📘 Detailed Summary
Motivation: 大规模视觉语言模型如 CLIP 在零样本识别方面表现出色,但在处理预训练期间罕见出现的类别时存在困难,包括新出现的实体和文化特定类别,这限制了其在现实世界应用中的适应性和覆盖范围。
Method: LiteEmbed 采用子空间引导的文本嵌入优化方法,基于 PCA 分解将语义空间解耦为粗粒度语义方向和细粒度变化方向,通过粗粒度对齐和细粒度分离两个互补目标,在保持全局语义一致性的同时增强视觉相似类别之间的区分性。
Result: 大量实验表明,LiteEmbed 在分类、检索、分割和检测等任务上显著优于先前方法,为 CLIP 在代表性不足、罕见或未见类别的适应方面建立了有效的解决方案,实现了即插即用的嵌入替换。
Conclusion: 该研究为大规模视觉语言模型的少样本个性化提供了轻量级且高效的框架,通过解耦语义空间的优化策略,在保持模型原有能力的同时显著扩展了其对罕见和新类别的识别能力,具有广泛的实际应用价值。
📄 Abstract
Large-scale vision-language models such as CLIP achieve strong zero-shot recognition but struggle with classes that are rarely seen during pretraining, including newly emerging entities and culturally specific categories. We introduce LiteEmbed, a lightweight framework for few-shot personalization of CLIP that enables new classes to be added without retraining its encoders. LiteEmbed performs subspace-guided optimization of text embeddings within CLIP's vocabulary, leveraging a PCA-based decomposition that disentangles coarse semantic directions from fine-grained variations. Two complementary objectives, coarse alignment and fine separation, jointly preserve global semantic consistency while enhancing discriminability among visually similar classes. Once optimized, the embeddings are plug-and-play, seamlessly substituting CLIP's original text features across classification, retrieval, segmentation, and detection tasks. Extensive experiments demonstrate substantial gains over prior methods, establishing LiteEmbed as an effective approach for adapting CLIP to underrepresented, rare, or unseen classes.
[30] Self-Supervised Animal Identification for Long Videos
Xuyang Fang, Sion Hannuna, Edwin Simpson, Neill Campbell
🧩 TL;DR
本文提出了一种高效的自监督动物个体识别方法,将识别任务重构为全局聚类问题而非序列跟踪,仅需边界框检测和个体总数,在消费级硬件上实现了超过97%的准确率,显著降低了计算资源和标注需求。
📘 Detailed Summary
Motivation: 传统动物个体识别方法需要大量人工标注,而现有自监督方法计算成本高且不适用于长视频序列,存在内存限制和时间误差传播问题,限制了在资源受限研究环境中的实际应用。
Method: 该方法将动物识别重构为全局聚类任务,假设视频中个体数量已知且固定,仅需边界框检测和总数信息;通过采样帧对、使用冻结预训练骨干网络、结合匈牙利算法的自引导机制进行批内伪标签分配,并采用来自视觉语言模型的二元交叉熵损失函数学习判别性特征。
Result: 在3D-POP鸽子和8头小牛喂食视频等真实数据集上,该方法实现了超过97%的准确率,每批次GPU内存消耗小于1GB,比标准对比方法低一个数量级,性能匹配或超越了使用超过1000个标注帧训练的监督基线方法。
Conclusion: 该研究消除了动物识别中的手动标注瓶颈,使消费级硬件上的高精度动物识别成为可能,在资源受限的研究环境中具有广泛适用性,为行为生态学、野生动物监测和畜牧管理提供了实用解决方案。
📄 Abstract
Identifying individual animals in long-duration videos is essential for behavioral ecology, wildlife monitoring, and livestock management. Traditional methods require extensive manual annotation, while existing self-supervised approaches are computationally demanding and ill-suited for long sequences due to memory constraints and temporal error propagation. We introduce a highly efficient, self-supervised method that reframes animal identification as a global clustering task rather than a sequential tracking problem. Our approach assumes a known, fixed number of individuals within a single video -- a common scenario in practice -- and requires only bounding box detections and the total count. By sampling pairs of frames, using a frozen pre-trained backbone, and employing a self-bootstrapping mechanism with the Hungarian algorithm for in-batch pseudo-label assignment, our method learns discriminative features without identity labels. We adapt a Binary Cross Entropy loss from vision-language models, enabling state-of-the-art accuracy ($>$97\%) while consuming less than 1 GB of GPU memory per batch -- an order of magnitude less than standard contrastive methods. Evaluated on challenging real-world datasets (3D-POP pigeons and 8-calves feeding videos), our framework matches or surpasses supervised baselines trained on over 1,000 labeled frames, effectively removing the manual annotation bottleneck. This work enables practical, high-accuracy animal identification on consumer-grade hardware, with broad applicability in resource-constrained research settings. All code written for this paper are \href{https://huggingface.co/datasets/tonyFang04/8-calves}{here}.
[31] STEP3-VL-10B Technical Report
Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, Jingcheng Hu, Kangheng Lin, Liang Zhao, Mitt Huang, Song Yuan, Wenwen Qu, Xiangfeng Wang, Yanlin Lai, Yingxiu Zhao, Yinmin Zhang, Yukang Shi, Yuyang Chen, Zejia Weng, Ziyang Meng, Ang Li, Aobo Kong, Bo Dong, Changyi Wan, David Wang, Di Qi, Dingming Li, En Yu, Guopeng Li, Haiquan Yin, Han Zhou, Hanshan Zhang, Haolong Yan, Hebin Zhou, Hongbo Peng, Jiaran Zhang, Jiashu Lv, Jiayi Fu, Jie Cheng, Jie Zhou, Jisheng Yin, Jingjing Xie, Jingwei Wu, Jun Zhang, Junfeng Liu, Kaijun Tan, Kaiwen Yan, Liangyu Chen, Lina Chen, Mingliang Li, Qian Zhao, Quan Sun, Shaoliang Pang, Shengjie Fan, Shijie Shang, Siyuan Zhang, Tianhao You, Wei Ji, Wuxun Xie, Xiaobo Yang, Xiaojie Hou, Xiaoran Jiao, Xiaoxiao Ren, Xiangwen Kong, Xin Huang, Xin Wu, Xing Chen, Xinran Wang, Xuelin Zhang, Yana Wei, Yang Li, Yanming Xu, Yeqing Shen, Yuang Peng, Yue Peng, Yu Zhou, Yusheng Li, Yuxiang Yang, Yuyang Zhang, Zhe Xie, Zhewei Huang, Zhenyi Lu, Zhimin Fan, Zihui Cheng, Daxin Jiang, Qi Han, Xiangyu Zhang, Yibo Zhu, Zheng Ge
🧩 TL;DR
STEP3-VL-10B是一个轻量级开源多模态基础模型,通过创新的训练策略和并行协调推理机制,在仅10B参数规模下实现了与10-20倍大型模型相当的性能,重新定义了紧凑效率与前沿多模态智能之间的权衡。
📘 Detailed Summary
Motivation: 该研究旨在解决当前多模态基础模型中紧凑效率与前沿性能之间的权衡问题,现有模型通常需要极大参数量才能达到顶级性能,而轻量级模型在复杂视觉语言任务上表现不足,需要重新定义这一权衡关系。
Method: 方法包括两个战略转变:首先采用统一完全解冻的预训练策略,在1.2T多模态token上整合语言对齐的感知编码器和Qwen3-8B解码器以建立内在视觉语言协同;其次实施包含1000多次迭代的强化学习规模化后训练流程,并引入并行协调推理机制来扩展测试时计算资源分配。
Result: STEP3-VL-10B在多个基准测试中取得卓越性能:MMBench达到92.2%,MMMU达到80.11%,复杂推理任务中AIME2025达到94.43%,MathVision达到75.95%,其10B参数规模下性能可匹敌或超越10-20倍大型模型及顶级专有旗舰模型。
Conclusion: 该研究表明通过创新的训练策略和推理机制,紧凑模型能够实现前沿多模态智能,为社区提供了强大、高效且可复现的基线,挑战了传统上认为需要极大参数量才能获得顶级性能的假设,推动了高效多模态AI的发展。
📄 Abstract
We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.
[32] Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering
Jieying Chen, Jeffrey Hu, Joan Lasenby, Ayush Tewari
🧩 TL;DR
本文提出SRENDER方法,通过扩散模型生成稀疏关键帧,然后利用3D重建和渲染合成完整视频,实现了比传统扩散模型快40倍以上的高效视频生成,同时保持高视觉保真度和时间稳定性。
📘 Detailed Summary
Motivation: 当前基于扩散模型的视频生成方法计算效率低下,生成几秒钟视频需要数分钟GPU时间,这严重阻碍了在需要实时交互的应用(如具身AI和VR/AR)中的部署。现有方法无法在保持高质量的同时实现高效生成,特别是在静态场景的相机条件视频生成方面存在显著计算瓶颈。
Method: 该方法采用分层策略:首先使用扩散模型生成稀疏的关键帧集合,然后将这些关键帧提升到3D表示中,通过3D重建和渲染技术合成中间视图。系统引入了一个预测模型,能够根据给定相机轨迹预测最优关键帧数量,使系统能够自适应地分配计算资源。最终提出的SRENDER方法根据相机运动复杂度动态调整关键帧密度,简单轨迹使用极稀疏关键帧,复杂运动使用较密集关键帧。
Result: 实验结果表明,SRENDER在生成20秒视频时比基于扩散的基线方法快40倍以上,同时保持了高视觉保真度和时间稳定性。该方法通过将生成成本分摊到数百帧中并强制执行几何一致性,实现了计算效率的显著提升。自适应关键帧分配机制确保了在不同相机轨迹复杂度下的最优性能平衡。
Conclusion: 该研究展示了通过结合生成模型与3D重建技术,可以实现高效且可控的视频合成,为实时交互应用中的视频生成提供了实用路径。方法的核心洞察在于将昂贵的生成过程限制在稀疏关键帧上,然后利用几何一致性进行高效插值,这种分层策略为未来视频生成系统的设计提供了新方向。
📄 Abstract
Modern video generative models based on diffusion models can produce very realistic clips, but they are computationally inefficient, often requiring minutes of GPU time for just a few seconds of video. This inefficiency poses a critical barrier to deploying generative video in applications that require real-time interactions, such as embodied AI and VR/AR. This paper explores a new strategy for camera-conditioned video generation of static scenes: using diffusion-based generative models to generate a sparse set of keyframes, and then synthesizing the full video through 3D reconstruction and rendering. By lifting keyframes into a 3D representation and rendering intermediate views, our approach amortizes the generation cost across hundreds of frames while enforcing geometric consistency. We further introduce a model that predicts the optimal number of keyframes for a given camera trajectory, allowing the system to adaptively allocate computation. Our final method, SRENDER, uses very sparse keyframes for simple trajectories and denser ones for complex camera motion. This results in video generation that is more than 40 times faster than the diffusion-based baseline in generating 20 seconds of video, while maintaining high visual fidelity and temporal stability, offering a practical path toward efficient and controllable video synthesis.
cs.CL [Back]
[33] TranslateGemma Technical Report
Mara Finkelstein, Isaac Caswell, Tobias Domhan, Jan-Thorsten Peter, Juraj Juraska, Parker Riley, Daniel Deutsch, Cole Dilanni, Colin Cherry, Eleftheria Briakou, Elizabeth Nielsen, Jiaming Luo, Kat Black, Ryan Mullins, Sweta Agrawal, Wenda Xu, Erin Kats, Stephane Jaskiewicz, Markus Freitag, David Vilar
🧩 TL;DR
本文提出了TranslateGemma,一个基于Gemma 3基础模型的开源机器翻译套件,通过两阶段微调方法显著提升了Gemma 3的多语言翻译能力,在多个基准测试中展现出优于基线模型的性能表现。
📘 Detailed Summary
Motivation: 该研究旨在增强Gemma 3基础模型固有的多语言能力,使其专门适用于机器翻译任务,通过开发开源翻译模型为研究社区提供强大且可适应的工具。
Method: 研究采用两阶段微调方法:首先使用大规模合成并行数据和人工翻译并行数据进行监督微调,然后通过强化学习阶段,利用MetricX-QE和AutoMQM等奖励模型集成来优化翻译质量。
Result: 在WMT25测试集的10个语言对上进行人工评估,在WMT24++基准的55个语言对上进行自动评估,结果显示TranslateGemma在所有模型尺寸上都比基线Gemma 3模型有显著提升,较小模型常能达到较大基线模型的性能水平,同时在Vistra图像翻译基准上展现出增强的多模态能力。
Conclusion: TranslateGemma通过专门的两阶段微调有效提升了基础模型的翻译性能,证明了较小模型在保持高效性的同时能达到较大模型的翻译质量,为机器翻译研究社区提供了高质量的开源工具,同时保持了强大的多模态能力。
📄 Abstract
We present TranslateGemma, a suite of open machine translation models based on the Gemma 3 foundation models. To enhance the inherent multilingual capabilities of Gemma 3 for the translation task, we employ a two-stage fine-tuning process. First, supervised fine-tuning is performed using a rich mixture of high-quality large-scale synthetic parallel data generated via state-of-the-art models and human-translated parallel data. This is followed by a reinforcement learning phase, where we optimize translation quality using an ensemble of reward models, including MetricX-QE and AutoMQM, targeting translation quality. We demonstrate the effectiveness of TranslateGemma with human evaluation on the WMT25 test set across 10 language pairs and with automatic evaluation on the WMT24++ benchmark across 55 language pairs. Automatic metrics show consistent and substantial gains over the baseline Gemma 3 models across all sizes. Notably, smaller TranslateGemma models often achieve performance comparable to larger baseline models, offering improved efficiency. We also show that TranslateGemma models retain strong multimodal capabilities, with enhanced performance on the Vistra image translation benchmark. The release of the open TranslateGemma models aims to provide the research community with powerful and adaptable tools for machine translation.
[34] Mi:dm 2.0 Korea-centric Bilingual Language Models
Donghoon Shin, Sejung Lee, Soonmin Bae, Hwijung Ryu, Changwon Ok, Hoyoun Jung, Hyesung Ji, Jeehyun Lim, Jehoon Lee, Ji-Eun Han, Jisoo Baik, Mihyeon Kim, Riwoo Chung, Seongmin Lee, Wonjae Park, Yoonseok Heo, Youngkyung Seo, Seyoun Won, Boeun Kim, Cheolhun Heo, Eunkyeong Lee, Honghee Lee, Hyeongju Ju, Hyeontae Seo, Jeongyong Shim, Jisoo Lee, Junseok Koh, Junwoo Kim, Minho Lee, Minji Kang, Minju Kim, Sangha Nam, Seongheum Park, Taehyeong Kim, Euijai Ahn, Hong Seok Jeung, Jisu Shin, Jiyeon Kim, Seonyeong Song, Seung Hyun Kong, Sukjin Hong, Taeyang Yun, Yu-Seon Kim, A-Hyun Lee, Chae-Jeong Lee, Hye-Won Yu, Ji-Hyun Ahn, Song-Yeon Kim, Sun-Woo Jung, Eunju Kim, Eunji Ha, Jinwoo Baek, Yun-ji Lee, Wanjin Park, Jeong Yeop Kim, Eun Mi Kim, Hyoung Jun Park, Jung Won Yoon, Min Sung Noh, Myung Gyo Oh, Wongyoung Lee, Yun Jin Park, Young S. Kwon, Hyun Keun Kim, Jieun Lee, YeoJoo Park
🧩 TL;DR
Mi:dm 2.0 是一个专门为推进韩国中心AI设计的双语大语言模型,通过整合韩国社会的价值观、推理模式和常识知识,在韩国特定基准测试中实现了最先进的性能,并提供了基础版和迷你版两种配置。
📘 Detailed Summary
Motivation: 现有大语言模型在处理韩国相关内容时存在局限性,主要源于韩国数据不足或质量低下以及缺乏文化对齐,导致模型难以理解韩国文化背景、情感细微差别和现实场景,无法生成可靠且文化适宜的响应。
Method: 该模型采用全面的数据处理流程,包括专有数据清洗、高质量合成数据生成、结合课程学习的策略性数据混合,以及定制的韩国优化分词器以提高效率和覆盖范围;模型提供两种配置:采用深度扩展策略的Mi:dm 2.0 Base(115亿参数)适用于通用场景,以及针对资源受限环境和专门任务优化的Mi:dm 2.0 Mini(23亿参数)。
Result: Mi:dm 2.0 在韩国特定基准测试中实现了最先进的性能,在KMMLU基准上取得了顶级的零样本结果,并在语言、人文和社会科学任务中表现出强大的内部评估结果。
Conclusion: 该研究通过提供可访问且高性能的韩国中心大语言模型,旨在加速韩国各行业、公共服务和教育领域的AI采用,加强韩国AI开发者社区,并为更广泛的K-intelligence愿景奠定基础;模型以MIT许可证发布支持广泛的研究和商业使用。
📄 Abstract
We introduce Mi:dm 2.0, a bilingual large language model (LLM) specifically engineered to advance Korea-centric AI. This model goes beyond Korean text processing by integrating the values, reasoning patterns, and commonsense knowledge inherent to Korean society, enabling nuanced understanding of cultural contexts, emotional subtleties, and real-world scenarios to generate reliable and culturally appropriate responses. To address limitations of existing LLMs, often caused by insufficient or low-quality Korean data and lack of cultural alignment, Mi:dm 2.0 emphasizes robust data quality through a comprehensive pipeline that includes proprietary data cleansing, high-quality synthetic data generation, strategic data mixing with curriculum learning, and a custom Korean-optimized tokenizer to improve efficiency and coverage. To realize this vision, we offer two complementary configurations: Mi:dm 2.0 Base (11.5B parameters), built with a depth-up scaling strategy for general-purpose use, and Mi:dm 2.0 Mini (2.3B parameters), optimized for resource-constrained environments and specialized tasks. Mi:dm 2.0 achieves state-of-the-art performance on Korean-specific benchmarks, with top-tier zero-shot results on KMMLU and strong internal evaluation results across language, humanities, and social science tasks. The Mi:dm 2.0 lineup is released under the MIT license to support extensive research and commercial use. By offering accessible and high-performance Korea-centric LLMs, KT aims to accelerate AI adoption across Korean industries, public services, and education, strengthen the Korean AI developer community, and lay the groundwork for the broader vision of K-intelligence. Our models are available at https://huggingface.co/K-intelligence. For technical inquiries, please contact midm-llm@kt.com.
[35] Contrastive Bi-Encoder Models for Multi-Label Skill Extraction: Enhancing ESCO Ontology Matching with BERT and Attention Mechanisms
Yongming Sun
🧩 TL;DR
本文提出了一种零样本技能提取框架,通过LLM从ESCO定义合成训练数据,并引入层次约束的多技能生成,训练对比双编码器实现无标注数据下的技能分类,显著提升了中文招聘广告中的技能提取性能。
📘 Detailed Summary
Motivation: 细粒度劳动力市场分析需要将非结构化招聘广告映射到标准化技能分类体系如ESCO,这本质上是极端多标签分类问题。然而监督解决方案受到大规模、分类对齐标注稀缺且成本高昂的限制,特别是在非英语环境中,招聘广告语言与正式技能定义存在显著差异。
Method: 该框架采用大型语言模型从ESCO定义合成训练实例,并引入基于ESCO二级类别的层次约束多技能生成以提升多标签上下文中的语义连贯性。在合成语料上训练对比双编码器,将招聘广告句子与ESCO技能描述对齐到共享嵌入空间;编码器在BERT骨干基础上增加BiLSTM和注意力池化以更好建模长而信息密集的需求陈述。上游基于RoBERTa的二元过滤器移除非技能句子以提高端到端精度。
Result: 实验表明层次条件生成相比无约束配对在流畅性和可区分性上均有改善,所得多标签模型能有效迁移到真实世界中文招聘广告,实现强大的零样本检索性能(F1@5 = 0.72),优于TF-IDF和标准BERT基线。
Conclusion: 该研究提出的流水线为劳动经济学和劳动力分析中的自动化技能编码提供了可扩展、数据高效的途径,通过零样本方法克服了标注数据稀缺的挑战,特别适用于非英语环境下的技能提取任务。
📄 Abstract
Fine-grained labor market analysis increasingly relies on mapping unstructured job advertisements to standardized skill taxonomies such as ESCO. This mapping is naturally formulated as an Extreme Multi-Label Classification (XMLC) problem, but supervised solutions are constrained by the scarcity and cost of large-scale, taxonomy-aligned annotations--especially in non-English settings where job-ad language diverges substantially from formal skill definitions. We propose a zero-shot skill extraction framework that eliminates the need for manually labeled job-ad training data. The framework uses a Large Language Model (LLM) to synthesize training instances from ESCO definitions, and introduces hierarchically constrained multi-skill generation based on ESCO Level-2 categories to improve semantic coherence in multi-label contexts. On top of the synthetic corpus, we train a contrastive bi-encoder that aligns job-ad sentences with ESCO skill descriptions in a shared embedding space; the encoder augments a BERT backbone with BiLSTM and attention pooling to better model long, information-dense requirement statements. An upstream RoBERTa-based binary filter removes non-skill sentences to improve end-to-end precision. Experiments show that (i) hierarchy-conditioned generation improves both fluency and discriminability relative to unconstrained pairing, and (ii) the resulting multi-label model transfers effectively to real-world Chinese job advertisements, achieving strong zero-shot retrieval performance (F1@5 = 0.72) and outperforming TF--IDF and standard BERT baselines. Overall, the proposed pipeline provides a scalable, data-efficient pathway for automated skill coding in labor economics and workforce analytics.
[36] OrthoGeoLoRA: Geometric Parameter-Efficient Fine-Tuning for Structured Social Science Concept Retrieval on theWeb
Zeqiang Wang, Xinyue Wu, Chenxi Li, Zixi Chen, Nishanth Sastry, Jon Johnson, Suparna De
🧩 TL;DR
本文提出OrthoGeoLoRA,一种基于Stiefel流形约束的几何感知参数高效微调方法,通过强制低秩因子正交化来克服标准LoRA的几何缺陷,在资源受限环境下为社会科学信息系统的模型适配提供了更高效路径。
📘 Detailed Summary
Motivation: 大型语言模型和文本编码器在社会科学信息系统中的应用日益广泛,但完全微调的计算和能耗成本高昂,对Web4Good生态系统中的小型机构和非营利组织构成障碍。标准LoRA方法存在几何缺陷,包括规范自由度、尺度模糊性和秩崩溃倾向,限制了其在资源受限环境中的有效性。
Method: 本文提出OrthoGeoLoRA方法,通过强制低秩因子正交化来约束参数更新形式为ΔW = BΣA⊤,类似于SVD分解。该方法将低秩因子约束在Stiefel流形上,并通过几何重参数化实现这一约束,同时保持与Adam等标准优化器及现有微调流程的兼容性。研究还建立了基于欧洲语言社会科学叙词表(ELSST)的层次概念检索基准,用于评估社会科学数字资源组织中的模型性能。
Result: 在多语言句子编码器上的实验表明,在相同低秩预算下,OrthoGeoLoRA在排序指标上优于标准LoRA和多种强参数高效微调变体。该方法在计算和参数效率方面表现出优势,为资源受限环境中的基础模型适配提供了更有效的解决方案。
Conclusion: OrthoGeoLoRA通过几何约束解决了标准LoRA的固有缺陷,为社会科学信息系统中的模型微调提供了更高效、更稳定的参数高效微调方法。该方法特别适合资源受限的研究机构和非营利组织,有助于推动Web4Good生态系统中人工智能技术的公平获取和应用。
📄 Abstract
Large language models and text encoders increasingly power web-based information systems in the social sciences, including digital libraries, data catalogues, and search interfaces used by researchers, policymakers, and civil society. Full fine-tuning is often computationally and energy intensive, which can be prohibitive for smaller institutions and non-profit organizations in the Web4Good ecosystem. Parameter-Efficient Fine-Tuning (PEFT), especially Low-Rank Adaptation (LoRA), reduces this cost by updating only a small number of parameters. We show that the standard LoRA update $ΔW = BA^\top$ has geometric drawbacks: gauge freedom, scale ambiguity, and a tendency toward rank collapse. We introduce OrthoGeoLoRA, which enforces an SVD-like form $ΔW = BΣA^\top$ by constraining the low-rank factors to be orthogonal (Stiefel manifold). A geometric reparameterization implements this constraint while remaining compatible with standard optimizers such as Adam and existing fine-tuning pipelines. We also propose a benchmark for hierarchical concept retrieval over the European Language Social Science Thesaurus (ELSST), widely used to organize social science resources in digital repositories. Experiments with a multilingual sentence encoder show that OrthoGeoLoRA outperforms standard LoRA and several strong PEFT variants on ranking metrics under the same low-rank budget, offering a more compute- and parameter-efficient path to adapt foundation models in resource-constrained settings.
[37] TeachPro: Multi-Label Qualitative Teaching Evaluation via Cross-View Graph Synergy and Semantic Anchored Evidence Encoding
Xiangqian Wang, Yifan Jia, Yang Xiang, Yumin Zhang, Yanbin Wang, Ke Liu
🧩 TL;DR
本文提出了TeachPro,一个多标签学习框架,用于从开放式学生评教中系统评估五个关键教学维度,解决了现有方法将反馈简化为二元情感而忽视具体教学问题的局限性。
📘 Detailed Summary
Motivation: 标准化的学生评教通常存在可靠性低、响应选项受限和响应失真等问题。现有的机器学习方法挖掘开放式评论时通常将反馈简化为二元情感分析,这忽视了内容清晰度、反馈及时性和教师态度等具体教学问题,无法为教学改进提供有效指导。
Method: 本文提出了TeachPro多标签学习框架,包含维度锚定证据编码器和跨视图图协同网络。维度锚定证据编码器整合了预训练文本编码器、表示五个教学维度的可学习语义锚点模块,以及结构化语义空间中对齐证据与教学维度的交叉注意力机制。跨视图图协同网络包含从解析树提取显式语法依赖的句法分支和基于BERT相似图建模潜在概念关系的语义分支,通过双仿射融合模块对齐句法与语义单元,并使用差分正则化器解耦嵌入以获得互补表示。
Result: 广泛的实验表明,TeachPro在多样化的评估设置中提供了优越的诊断粒度和鲁棒性。作者还贡献了一个包含专家定性标注和多标签评分的新型基准数据集,验证了所提方法的有效性。
Conclusion: 该研究为教学评估提供了更精细的分析工具,能够从开放式学生反馈中提取多维度的教学洞察,超越了传统二元情感分析的局限性。TeachPro框架展示了将结构化语义空间与多视图文本表示相结合的有效性,为教育数据挖掘和教学改进提供了新的技术途径。
📄 Abstract
Standardized Student Evaluation of Teaching often suffer from low reliability, restricted response options, and response distortion. Existing machine learning methods that mine open-ended comments usually reduce feedback to binary sentiment, which overlooks concrete concerns such as content clarity, feedback timeliness, and instructor demeanor, and provides limited guidance for instructional improvement.We propose TeachPro, a multi-label learning framework that systematically assesses five key teaching dimensions: professional expertise, instructional behavior, pedagogical efficacy, classroom experience, and other performance metrics. We first propose a Dimension-Anchored Evidence Encoder, which integrates three core components: (i) a pre-trained text encoder that transforms qualitative feedback annotations into contextualized embeddings; (ii) a prompt module that represents five teaching dimensions as learnable semantic anchors; and (iii) a cross-attention mechanism that aligns evidence with pedagogical dimensions within a structured semantic space. We then propose a Cross-View Graph Synergy Network to represent student comments. This network comprises two components: (i) a Syntactic Branch that extracts explicit grammatical dependencies from parse trees, and (ii) a Semantic Branch that models latent conceptual relations derived from BERT-based similarity graphs. BiAffine fusion module aligns syntactic and semantic units, while a differential regularizer disentangles embeddings to encourage complementary representations. Finally, a cross-attention mechanism bridges the dimension-anchored evidence with the multi-view comment representations. We also contribute a novel benchmark dataset featuring expert qualitative annotations and multi-label scores. Extensive experiments demonstrate that TeachPro offers superior diagnostic granularity and robustness across diverse evaluation settings.
[38] MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus
Yexing Du, Kaiyuan Liu, Bihe Zhang, Youcheng Pan, Bo Yang, Liangyu Huo, Xiyuan Zhang, Jian Xie, Daojing He, Yang Xiang, Ming Liu, Bin Qin
🧩 TL;DR
本文提出了MCGA多任务古典文学音频语料库,填补了多模态大语言模型在中文古典研究音频模态的空白,并通过评估十种MLLM模型揭示了当前模型在该领域的显著挑战。
📘 Detailed Summary
Motivation: 随着多模态大语言模型的快速发展,其在中文古典研究领域的潜力受到关注,但现有研究主要集中于文本和视觉模态,音频语料库在该领域仍处于探索不足的状态,需要填补这一研究空白。
Method: 研究提出了多任务古典文学音频语料库,涵盖六种任务:自动语音识别、语音到文本翻译、语音情感描述、口语问答、语音理解和语音推理,并引入了语音情感描述的评估指标以及衡量MLLM语音与文本能力一致性的度量方法。
Result: 通过对十种多模态大语言模型的评估实验,结果表明当前模型在处理MCGA测试集时仍面临显著挑战,特别是在多维音频能力方面存在明显不足,验证了该语料库对模型性能评估的有效性。
Conclusion: 该研究揭示了多模态大语言模型在中文古典研究音频处理方面的局限性,提出的MCGA语料库和评估框架为开发具有更强大多维音频能力的模型提供了重要基准,并公开了语料库和代码以促进该领域的发展。
📄 Abstract
With the rapid advancement of Multimodal Large Language Models (MLLMs), their potential has garnered significant attention in Chinese Classical Studies (CCS). While existing research has primarily focused on text and visual modalities, the audio corpus within this domain remains largely underexplored. To bridge this gap, we propose the Multi-task Classical Chinese Literary Genre Audio Corpus (MCGA). It encompasses a diverse range of literary genres across six tasks: Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Captioning (SEC), Spoken Question Answering (SQA), Speech Understanding (SU), and Speech Reasoning (SR). Through the evaluation of ten MLLMs, our experimental results demonstrate that current models still face substantial challenges when processed on the MCGA test set. Furthermore, we introduce an evaluation metric for SEC and a metric to measure the consistency between the speech and text capabilities of MLLMs. We release MCGA and our code to the public to facilitate the development of MLLMs with more robust multidimensional audio capabilities in CCS. MCGA Corpus: https://github.com/yxduir/MCGA
[39] Improving Implicit Hate Speech Detection via a Community-Driven Multi-Agent Framework
Ewelina Gajewska, Katarzyna Budzynska, Jarosław A Chudziak
🧩 TL;DR
本文提出了一种用于隐含仇恨言论检测的上下文感知框架,采用由中央仲裁代理和动态构建的社区代理组成的多智能体系统,通过整合社会文化背景知识,在分类准确性和公平性方面超越了现有最先进的提示方法。
📘 Detailed Summary
Motivation: 当前隐含仇恨言论检测方法缺乏对社会文化背景的充分考虑,难以实现身份感知的适度调节,特别是在处理针对特定人口群体的微妙偏见表达时存在局限性,需要更公平和准确的分类框架。
Method: 该方法构建了一个多智能体系统,包括中央仲裁代理和代表特定人口群体的动态社区代理,通过整合来自公开知识源的社会文化背景信息,实现了身份感知的适度调节,并采用平衡准确率作为分类公平性的核心评估指标。
Result: 在具有挑战性的ToxiGen数据集上,该方法超越了零样本提示、少样本提示和思维链提示等最先进的提示方法以及其他替代方法,显著提高了所有目标群体的分类准确性和公平性,平衡准确率指标验证了其优越性能。
Conclusion: 该研究证明了社区驱动的协商框架在隐含仇恨言论检测中的有效性,通过整合社会文化背景和采用公平性评估指标,为实现更准确和公正的内容审核提供了新方向,强调了多智能体系统在敏感内容识别中的潜力。
📄 Abstract
This work proposes a contextualised detection framework for implicitly hateful speech, implemented as a multi-agent system comprising a central Moderator Agent and dynamically constructed Community Agents representing specific demographic groups. Our approach explicitly integrates socio-cultural context from publicly available knowledge sources, enabling identity-aware moderation that surpasses state-of-the-art prompting methods (zero-shot prompting, few-shot prompting, chain-of-thought prompting) and alternative approaches on a challenging ToxiGen dataset. We enhance the technical rigour of performance evaluation by incorporating balanced accuracy as a central metric of classification fairness that accounts for the trade-off between true positive and true negative rates. We demonstrate that our community-driven consultative framework significantly improves both classification accuracy and fairness across all target groups.
[40] Relation Extraction Capabilities of LLMs on Clinical Text: A Bilingual Evaluation for English and Turkish
Aidana Aidynkyzy, Oğuz Dikenelli, Oylum Alatlı, Şebnem Bora
🧩 TL;DR
本研究提出了首个英语-土耳其语平行临床关系抽取数据集,并系统评估了多种提示策略,其中基于对比学习的关系感知检索方法在临床信息抽取任务中显著优于传统微调模型,揭示了高质量示例检索对于跨语言临床自然语言处理的重要性。
📘 Detailed Summary
Motivation: 非英语语言临床信息抽取标注数据稀缺阻碍了主要基于英语开发的大型语言模型方法的评估,本研究旨在通过构建首个英语-土耳其语平行临床关系抽取数据集,系统评估LLM在跨语言临床关系抽取任务中的表现,并探索有效的提示策略以弥补资源差距。
Method: 研究构建了首个从2010 i2b2/VA关系分类语料库衍生并精心策划的英语-土耳其语平行临床关系抽取数据集,系统评估了多种提示策略包括多种上下文学习和思维链方法,并与PURE等微调基线模型进行比较,特别提出了基于对比学习的关系感知检索方法,该方法专门设计用于捕捉句子级和关系级语义。
Result: 基于提示的LLM方法在所有评估中均优于传统微调模型,英语评估结果在所有LLM和提示技术上均优于土耳其语对应结果,在上下文学习方法中,关系感知检索达到最高性能,Gemini 1.5 Flash在英语和土耳其语中分别获得0.906和0.888的微平均F1分数,当RAR与DeepSeek-V3模型的结构化推理提示结合时,英语性能进一步提升至0.918 F1。
Conclusion: 高质量演示检索对于临床自然语言处理至关重要,先进的检索和提示技术具有弥补资源差距的潜力,关系感知检索方法通过捕捉句子级和关系级语义显著提升了跨语言临床关系抽取性能,为低资源语言的临床信息处理提供了有效解决方案。
📄 Abstract
The scarcity of annotated datasets for clinical information extraction in non-English languages hinders the evaluation of large language model (LLM)-based methods developed primarily in English. In this study, we present the first comprehensive bilingual evaluation of LLMs for the clinical Relation Extraction (RE) task in both English and Turkish. To facilitate this evaluation, we introduce the first English-Turkish parallel clinical RE dataset, derived and carefully curated from the 2010 i2b2/VA relation classification corpus. We systematically assess a diverse set of prompting strategies, including multiple in-context learning (ICL) and Chain-of-Thought (CoT) approaches, and compare their performance to fine-tuned baselines such as PURE. Furthermore, we propose Relation-Aware Retrieval (RAR), a novel in-context example selection method based on contrastive learning, that is specifically designed to capture both sentence-level and relation-level semantics. Our results show that prompting-based LLM approaches consistently outperform traditional fine-tuned models. Moreover, evaluations for English performed better than their Turkish counterparts across all evaluated LLMs and prompting techniques. Among ICL methods, RAR achieves the highest performance, with Gemini 1.5 Flash reaching a micro-F1 score of 0.906 in English and 0.888 in Turkish. Performance further improves to 0.918 F1 in English when RAR is combined with a structured reasoning prompt using the DeepSeek-V3 model. These findings highlight the importance of high-quality demonstration retrieval and underscore the potential of advanced retrieval and prompting techniques to bridge resource gaps in clinical natural language processing.
[41] Benchmarking Post-Training Quantization of Large Language Models under Microscaling Floating Point Formats
Manyi Zhang, Ji-Fu Li, Zhongao Sun, Haoli Bai, Hui-Ling Zhen, Zhenhua Dong, Xianzhi Yu
🧩 TL;DR
本研究系统性地探索了后训练量化在微缩浮点格式下的适用性,发现MXFP8能实现近乎无损的性能,而MXFP4仍面临挑战,并为现有PTQ方法适配MXFP量化提供了实用指导。
📘 Detailed Summary
Motivation: 尽管微缩浮点格式已成为大语言模型有前景的低精度格式,但现有后训练量化算法主要关注整数量化,其在MXFP格式下的适用性和行为尚未得到充分探索,本研究旨在填补这一研究空白。
Method: 本研究采用系统性实验方法,涵盖超过7种后训练量化算法、15个评估基准和3个大语言模型家族,特别关注格式兼容性分析,并针对MXFP4提出了简单的预缩放优化策略来缓解缩放因子误差。
Result: 实验发现MXFP8能持续实现近乎无损的性能表现,而MXFP4则引入显著精度下降;PTQ在MXFP下的有效性强烈依赖于格式兼容性,某些算法范式始终更有效;量化敏感性主要由语言模型而非视觉编码器主导;MXFP4的缩放因子是关键误差源,预缩放优化能显著缓解其影响。
Conclusion: 该研究为现有PTQ方法适配MXFP量化提供了实用指导,揭示了格式兼容性的重要性,并表明量化敏感性主要由语言模型架构决定,为未来低精度大语言模型部署提供了重要见解。
📄 Abstract
Microscaling Floating-Point (MXFP) has emerged as a promising low-precision format for large language models (LLMs). Despite various post-training quantization (PTQ) algorithms being proposed, they mostly focus on integer quantization, while their applicability and behavior under MXFP formats remain largely unexplored. To address this gap, this work conducts a systematic investigation of PTQ under MXFP formats, encompassing over 7 PTQ algorithms, 15 evaluation benchmarks, and 3 LLM families. The key findings include: 1) MXFP8 consistently achieves near-lossless performance, while MXFP4 introduces substantial accuracy degradation and remains challenging; 2) PTQ effectiveness under MXFP depends strongly on format compatibility, with some algorithmic paradigms being consistently more effective than others; 3) PTQ performance exhibits highly consistent trends across model families and modalities, in particular, quantization sensitivity is dominated by the language model rather than the vision encoder in multimodal LLMs; 4) The scaling factor of quantization is a critical error source in MXFP4, and a simple pre-scale optimization strategy can significantly mitigate its impact. Together, these results provide practical guidance on adapting existing PTQ methods to MXFP quantization.
[42] LLMs Got Rhythm? Hybrid Phonological Filtering for Greek Poetry Rhyme Detection and Generation
Stergios Chatzikyriakidis
🧩 TL;DR
本文提出了一种结合大语言模型与确定性音韵算法的混合系统,用于解决LLM在希腊语等低资源语言中韵律检测与生成方面的不足,通过音韵验证循环将诗歌生成有效性从不足4%提升至73.1%。
📘 Detailed Summary
Motivation: 大语言模型虽然在NLP任务中表现出色,但在韵律相关现象(如押韵检测与生成)上存在明显不足,这一问题在希腊语等低资源语言中尤为突出,因此需要开发能够准确处理音韵学任务的专门系统。
Method: 研究提出了一种混合系统,将LLM与确定性音韵算法相结合,实现了全面的希腊语押韵类型分类(包括纯韵、富韵、不完全韵、马赛克韵和相同前元音模式),并采用带有音韵验证的智能生成流程,评估了多种提示策略(零样本、少样本、思维链和RAG增强)在不同LLM上的表现。
Result: 实验结果显示存在显著的"推理鸿沟":类人推理模型(Claude 3.7)在押韵识别中达到40%准确率,而推理密集型模型(Claude 4.5)在使用思维链提示时达到54%的先进水平;纯LLM生成完全失败(有效诗歌不足4%),而混合验证系统将性能恢复至73.1%。
Conclusion: 研究表明纯LLM在音韵学任务上存在根本性局限,而混合方法通过结合算法验证能显著提升性能;研究发布的系统及包含4万多个押韵的清洁语料库为低资源语言的韵律研究提供了重要资源,强调了领域特定知识与LLM结合的必要性。
📄 Abstract
Large Language Models (LLMs), despite their remarkable capabilities across NLP tasks, struggle with phonologically-grounded phenomena like rhyme detection and generation. This is even more evident in lower-resource languages such as Modern Greek. In this paper, we present a hybrid system that combines LLMs with deterministic phonological algorithms to achieve accurate rhyme identification/analysis and generation. Our approach implements a comprehensive taxonomy of Greek rhyme types, including Pure, Rich, Imperfect, Mosaic, and Identical Pre-rhyme Vowel (IDV) patterns, and employs an agentic generation pipeline with phonological verification. We evaluate multiple prompting strategies (zero-shot, few-shot, Chain-of-Thought, and RAG-augmented) across several LLMs including Claude 3.7 and 4.5, GPT-4o, Gemini 2.0 and open-weight models like Llama 3.1 8B and 70B and Mistral Large. Results reveal a significant "Reasoning Gap": while native-like models (Claude 3.7) perform intuitively (40\% accuracy in identification), reasoning-heavy models (Claude 4.5) achieve state-of-the-art performance (54\%) only when prompted with Chain-of-Thought. Most critically, pure LLM generation fails catastrophically (under 4\% valid poems), while our hybrid verification loop restores performance to 73.1\%. We release our system and a crucial, rigorously cleaned corpus of 40,000+ rhymes, derived from the Anemoskala and Interwar Poetry corpora, to support future research.
[43] Empathy Applicability Modeling for General Health Queries
Shan Randhawa, Agha Ali Raza, Kentaro Toyama, Julie Hui, Mustafa Naseem
🧩 TL;DR
本文提出了共情适用性框架(EAF),这是一种理论驱动的方法,用于在生成医生回复之前识别患者查询中的共情需求,并建立了用于预测性共情建模的基准数据集。
📘 Detailed Summary
Motivation: 大型语言模型越来越多地融入临床工作流程,但通常缺乏临床共情这一医患沟通的关键要素。现有NLP框架主要关注对医生回复中的共情进行反应性标注,而在预测性建模共情需求方面支持有限,特别是在一般健康查询中。
Method: 本文引入了共情适用性框架(EAF),这是一种基于临床、上下文和语言线索的理论驱动方法,用于根据情感反应和解释的适用性对患者查询进行分类。研究者发布了真实患者查询的基准数据集,由人类和GPT-4o进行双重标注,并在人类共识子集上训练分类器来预测共情适用性。
Result: 在人类共识子集中观察到显著的人类-GPT对齐。基于人类标注和GPT-only标注训练的分类器在预测共情适用性方面表现出色,超越了启发式方法和零样本LLM基线。错误分析揭示了隐式困扰、临床严重性模糊性和上下文困难等持续挑战。
Conclusion: EAF为在回复生成前识别共情需求提供了框架,建立了预测性共情建模的基准,并支持异步医疗中的共情沟通。研究强调了多标注者建模、临床医生参与校准和文化多样性标注的必要性,为未来临床NLP系统设计提供了重要方向。
📄 Abstract
LLMs are increasingly being integrated into clinical workflows, yet they often lack clinical empathy, an essential aspect of effective doctor-patient communication. Existing NLP frameworks focus on reactively labeling empathy in doctors' responses but offer limited support for anticipatory modeling of empathy needs, especially in general health queries. We introduce the Empathy Applicability Framework (EAF), a theory-driven approach that classifies patient queries in terms of the applicability of emotional reactions and interpretations, based on clinical, contextual, and linguistic cues. We release a benchmark of real patient queries, dual-annotated by Humans and GPT-4o. In the subset with human consensus, we also observe substantial human-GPT alignment. To validate EAF, we train classifiers on human-labeled and GPT-only annotations to predict empathy applicability, achieving strong performance and outperforming the heuristic and zero-shot LLM baselines. Error analysis highlights persistent challenges: implicit distress, clinical-severity ambiguity, and contextual hardship, underscoring the need for multi-annotator modeling, clinician-in-the-loop calibration, and culturally diverse annotation. EAF provides a framework for identifying empathy needs before response generation, establishes a benchmark for anticipatory empathy modeling, and enables supporting empathetic communication in asynchronous healthcare.
cs.AI [Back]
[44] AviationLMM: A Large Multimodal Foundation Model for Civil Aviation
Wenbin Li, Jingling Wu, Xiaoyong Lin. Jing Chen, Cong Chen
🧩 TL;DR
本文提出了AviationLMM愿景,这是一个面向民航领域的大型多模态基础模型,旨在统一民航异构数据流,实现理解、推理、生成和智能体应用,以解决现有AI解决方案在民航中孤立、单模态的局限性。
📘 Detailed Summary
Motivation: 民航现有AI解决方案存在孤立和单模态的局限性,无法有效整合语音通信、雷达轨迹、传感器流和文本报告等异构数据,这限制了态势感知、适应性和实时决策支持能力,阻碍了民航安全、效率和客户满意度的提升。
Method: 论文提出了AviationLMM的模型架构,该架构能够处理空-地语音、监视数据、机载遥测、视频和结构化文本等多模态输入,执行跨模态对齐和融合,并生成从态势摘要、风险预警到预测性诊断和多模态事件重建的灵活输出。
Result: 论文未报告具体的性能指标或实验结果,而是提出了一个研究愿景和框架,并识别了实现该愿景需要解决的关键研究机会,包括数据获取、对齐与融合、预训练、推理、可信性、隐私、模态缺失鲁棒性和合成场景生成等方面。
Conclusion: 通过阐述AviationLMM的设计和挑战,本文旨在推动民航基础模型的进展,并催化协调的研究努力,以构建一个集成、可信且保护隐私的民航AI生态系统,为未来民航AI发展提供明确的研究方向和框架。
📄 Abstract
Civil aviation is a cornerstone of global transportation and commerce, and ensuring its safety, efficiency and customer satisfaction is paramount. Yet conventional Artificial Intelligence (AI) solutions in aviation remain siloed and narrow, focusing on isolated tasks or single modalities. They struggle to integrate heterogeneous data such as voice communications, radar tracks, sensor streams and textual reports, which limits situational awareness, adaptability, and real-time decision support. This paper introduces the vision of AviationLMM, a Large Multimodal foundation Model for civil aviation, designed to unify the heterogeneous data streams of civil aviation and enable understanding, reasoning, generation and agentic applications. We firstly identify the gaps between existing AI solutions and requirements. Secondly, we describe the model architecture that ingests multimodal inputs such as air-ground voice, surveillance, on-board telemetry, video and structured texts, and performs cross-modal alignment and fusion, and produces flexible outputs ranging from situation summaries and risk alerts to predictive diagnostics and multimodal incident reconstructions. In order to fully realize this vision, we identify key research opportunities to address, including data acquisition, alignment and fusion, pretraining, reasoning, trustworthiness, privacy, robustness to missing modalities, and synthetic scenario generation. By articulating the design and challenges of AviationLMM, we aim to boost the civil aviation foundation model progress and catalyze coordinated research efforts toward an integrated, trustworthy and privacy-preserving aviation AI ecosystem.
[45] The AI Hippocampus: How Far are We From Human Memory?
Zixia Jia, Jiaqi Li, Yipeng Kang, Yuxuan Wang, Tong Wu, Quansen Wang, Xiaobo Wang, Shuyi Zhang, Junzhe Shen, Qing Li, Siyuan Qi, Yitao Liang, Di He, Zilong Zheng, Song-Chun Zhu
🧩 TL;DR
本文对大型语言模型和多模态大语言模型中的记忆机制进行了全面综述,提出了一个涵盖隐式、显式和智能体记忆范式的结构化分类体系,系统梳理了该领域的关键架构进展、基准任务和开放挑战。
📘 Detailed Summary
Motivation: 随着大语言模型从静态预测器向具备持续学习和个性化推理能力的交互系统演进,记忆机制已成为其架构和功能发展的核心主题,但目前缺乏对记忆在LLMs和MLLMs中作用的系统性综述,需要建立统一的理论框架来组织相关文献并指导未来研究。
Method: 本文提出了一个结构化的记忆分类体系,包含三个主要范式:隐式记忆指预训练transformer内部参数中嵌入的知识,涉及记忆化、关联检索和上下文推理能力;显式记忆涉及外部存储和检索组件,包括文本语料库、稠密向量和图结构等动态可查询知识表示;智能体记忆则关注自主智能体中的持久性、时间扩展记忆结构,支持多智能体系统中的长期规划、自一致性和协作行为。
Result: 该综述系统梳理了记忆机制在LLMs和MLLMs中的关键架构进展、基准任务和评估方法,特别关注了多模态场景下跨视觉、语言、音频和动作模态的连贯性需求,并识别了记忆容量、对齐、事实一致性和跨系统互操作性等核心挑战。
Conclusion: 记忆机制对于增强大语言模型的推理能力、适应性和上下文保真度具有基础性作用,该综述建立的分类框架为理解不同记忆范式提供了系统视角,同时指出的开放挑战为未来研究指明了方向,特别是在多模态交互和自主智能体系统中的记忆集成方面。
📄 Abstract
Memory plays a foundational role in augmenting the reasoning, adaptability, and contextual fidelity of modern Large Language Models and Multi-Modal LLMs. As these models transition from static predictors to interactive systems capable of continual learning and personalized inference, the incorporation of memory mechanisms has emerged as a central theme in their architectural and functional evolution. This survey presents a comprehensive and structured synthesis of memory in LLMs and MLLMs, organizing the literature into a cohesive taxonomy comprising implicit, explicit, and agentic memory paradigms. Specifically, the survey delineates three primary memory frameworks. Implicit memory refers to the knowledge embedded within the internal parameters of pre-trained transformers, encompassing their capacity for memorization, associative retrieval, and contextual reasoning. Recent work has explored methods to interpret, manipulate, and reconfigure this latent memory. Explicit memory involves external storage and retrieval components designed to augment model outputs with dynamic, queryable knowledge representations, such as textual corpora, dense vectors, and graph-based structures, thereby enabling scalable and updatable interaction with information sources. Agentic memory introduces persistent, temporally extended memory structures within autonomous agents, facilitating long-term planning, self-consistency, and collaborative behavior in multi-agent systems, with relevance to embodied and interactive AI. Extending beyond text, the survey examines the integration of memory within multi-modal settings, where coherence across vision, language, audio, and action modalities is essential. Key architectural advances, benchmark tasks, and open challenges are discussed, including issues related to memory capacity, alignment, factual consistency, and cross-system interoperability.
[46] RISER: Orchestrating Latent Reasoning Skills for Adaptive Activation Steering
Wencheng Ye, Liang Peng, Xiaoyang Yuan, Yi Bin, Pengpeng Zeng, Hengyu Jin, Heng Tao Shen
🧩 TL;DR
本文提出了RISER(基于路由器的可引导推理增强框架),一种即插即用的激活空间干预框架,通过动态组合可复用推理向量来自适应引导大语言模型的推理过程,实现了无需参数更新的高效推理增强。
📘 Detailed Summary
Motivation: 当前领域特定推理方法通常依赖需要参数更新的训练密集型方法,而现有的激活引导方法采用静态、手动干预,无法适应复杂推理的动态特性,这限制了参数高效推理增强的适应性。
Method: RISER框架构建了可复用推理向量库,并采用轻量级路由器动态组合这些向量以适应每个输入;路由器通过强化学习在任务级奖励下进行优化,以涌现和组合方式激活潜在的认知原语,实现自适应激活空间干预。
Result: 在七个多样化基准测试中,RISER相比基础模型实现了3.4-6.5%的平均零样本准确率提升,同时超越了思维链推理方法,具有2-3倍更高的标记效率并保持稳健的准确率增益;分析表明RISER能够自主组合多个向量形成可解释的精确控制策略。
Conclusion: RISER展示了通过动态组合推理向量实现自适应激活引导的可行性,为更可控和高效的大语言模型推理提供了新方向,表明涌现式组合干预能够产生可解释的控制策略,推动了参数高效推理增强方法的发展。
📄 Abstract
Recent work on domain-specific reasoning with large language models (LLMs) often relies on training-intensive approaches that require parameter updates. While activation steering has emerged as a parameter efficient alternative, existing methods apply static, manual interventions that fail to adapt to the dynamic nature of complex reasoning. To address this limitation, we propose RISER (Router-based Intervention for Steerable Enhancement of Reasoning), a plug-and-play intervention framework that adaptively steers LLM reasoning in activation space. RISER constructs a library of reusable reasoning vectors and employs a lightweight Router to dynamically compose them for each input. The Router is optimized via reinforcement learning under task-level rewards, activating latent cognitive primitives in an emergent and compositional manner. Across seven diverse benchmarks, RISER yields 3.4-6.5% average zero-shot accuracy improvements over the base model while surpassing CoT-style reasoning with 2-3x higher token efficiency and robust accuracy gains. Further analysis shows that RISER autonomously combines multiple vectors into interpretable, precise control strategies, pointing toward more controllable and efficient LLM reasoning.
[47] M$^3$Searcher: Modular Multimodal Information Seeking Agency with Retrieval-Oriented Reasoning
Xiaohan Yu, Chao Feng, Lang Mei, Chong Chen
🧩 TL;DR
本文提出了M³Searcher,一种模块化的多模态信息检索智能体,通过解耦信息获取与答案推导过程,并采用检索导向的多目标奖励优化,显著提升了多模态环境中的自主信息检索能力。
📘 Detailed Summary
Motivation: 当前基于DeepResearch风格的智能体在自主信息获取方面表现出色,但仅限于文本模态。将自主信息检索扩展到多模态环境面临两个关键挑战:大规模训练多模态工具使用模型时出现的专业化与泛化权衡问题,以及捕捉复杂多步多模态搜索轨迹的训练数据严重稀缺。
Method: M³Searcher采用模块化架构,明确将信息获取与答案推导过程解耦。该方法通过检索导向的多目标奖励函数进行优化,联合鼓励事实准确性、推理合理性和检索保真度。此外,作者还开发了MMSearchVQA多模态多跳数据集,以支持检索中心的强化学习训练。
Result: 实验结果表明,M³Searcher在性能上超越了现有方法,在复杂多模态任务中展现出强大的迁移适应能力和有效的推理能力。该方法在多模态信息检索任务中取得了显著改进,验证了模块化架构和多目标奖励优化的有效性。
Conclusion: 该研究通过解耦信息获取与答案推导的模块化设计,成功解决了多模态自主信息检索中的专业化-泛化权衡问题。提出的多目标奖励优化框架和专门数据集为多模态强化学习智能体的训练提供了新范式,为未来更复杂的多模态交互系统奠定了基础。
📄 Abstract
Recent advances in DeepResearch-style agents have demonstrated strong capabilities in autonomous information acquisition and synthesize from real-world web environments. However, existing approaches remain fundamentally limited to text modality. Extending autonomous information-seeking agents to multimodal settings introduces critical challenges: the specialization-generalization trade-off that emerges when training models for multimodal tool-use at scale, and the severe scarcity of training data capturing complex, multi-step multimodal search trajectories. To address these challenges, we propose M$^3$Searcher, a modular multimodal information-seeking agent that explicitly decouples information acquisition from answer derivation. M$^3$Searcher is optimized with a retrieval-oriented multi-objective reward that jointly encourages factual accuracy, reasoning soundness, and retrieval fidelity. In addition, we develop MMSearchVQA, a multimodal multi-hop dataset to support retrieval centric RL training. Experimental results demonstrate that M$^3$Searcher outperforms existing approaches, exhibits strong transfer adaptability and effective reasoning in complex multimodal tasks.
[48] Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning
Dongjie Cheng, Yongqi Li, Zhixin Ma, Hongru Cai, Yupeng Hu, Wenjie Wang, Liqiang Nie, Wenjie Li
🧩 TL;DR
本文提出了统一的生成式多模态推理范式,通过生成中间图像来统一多样的多模态推理技能,并实例化为Omni-R1框架,该框架采用两阶段SFT+RL训练方法,实现了跨多种多模态任务的统一推理能力。
📘 Detailed Summary
Motivation: 现有的多模态大语言模型虽然取得了进展,但通常采用单一任务特定的推理模式,限制了在不同多模态任务间的泛化能力。许多多模态任务需要多样化的推理技能,如聚焦特定区域或标记图像中的对象,而现有方法无法统一处理这些多样化的推理需求。
Method: 提出了统一的生成式多模态推理范式,通过在推理过程中生成中间图像来统一多样化的多模态推理技能。具体实例化为Omni-R1框架,采用两阶段监督微调加强化学习训练方法,包含感知对齐损失和感知奖励机制以实现功能性图像生成。同时提出了Omni-R1-Zero,通过从纯文本推理数据中引导逐步可视化,消除了对多模态标注的需求。
Result: 实验结果表明,Omni-R1能够在广泛的多模态任务上实现统一的生成式推理。Omni-R1-Zero在平均性能上能够匹配甚至超越Omni-R1,这显示了生成式多模态推理的潜力,特别是在减少对标注数据依赖方面取得了显著进展。
Conclusion: 该研究展示了生成式多模态推理范式的有效性,通过中间图像生成统一了多样化的推理技能。Omni-R1-Zero的成功表明,从纯文本数据引导多模态推理是可行的方向,为减少对昂贵多模态标注的依赖提供了有前景的解决方案,推动了多模态推理向更通用、更高效的方向发展。
📄 Abstract
Multimodal Large Language Models (MLLMs) are making significant progress in multimodal reasoning. Early approaches focus on pure text-based reasoning. More recent studies have incorporated multimodal information into the reasoning steps; however, they often follow a single task-specific reasoning pattern, which limits their generalizability across various multimodal tasks. In fact, there are numerous multimodal tasks requiring diverse reasoning skills, such as zooming in on a specific region or marking an object within an image. To address this, we propose unified generative multimodal reasoning, which unifies diverse multimodal reasoning skills by generating intermediate images during the reasoning process. We instantiate this paradigm with Omni-R1, a two-stage SFT+RL framework featuring perception alignment loss and perception reward, thereby enabling functional image generation. Additionally, we introduce Omni-R1-Zero, which eliminates the need for multimodal annotations by bootstrapping step-wise visualizations from text-only reasoning data. Empirical results show that Omni-R1 achieves unified generative reasoning across a wide range of multimodal tasks, and Omni-R1-Zero can match or even surpass Omni-R1 on average, suggesting a promising direction for generative multimodal reasoning.