Table of Contents
cs.CV [Back]
[1] Thinking with Spatial Code for Physical-World Video Reasoning
Jieneng Chen, Wenxin Ma, Ruisheng Yuan, Yunzhi Zhang, Jiajun Wu, Alan Yuille
🧩 TL;DR
本文提出了Thinking with Spatial Code框架,将RGB视频转换为显式的、时间连贯的3D表示,用于物理世界视觉问答。该框架通过空间编码器将视频解析为结构化空间代码,使大语言模型能够直接对显式空间变量进行推理。
📘 Detailed Summary
Motivation: 当前视觉问答系统在处理物理世界场景时缺乏对3D空间结构的显式理解,难以进行几何感知的推理。本研究旨在解决RGB视频到结构化3D表示的转换问题,使大语言模型能够直接对空间变量进行推理,提升物理世界视觉问答的性能。
Method: 提出了Thinking with Spatial Code框架,包含空间编码器将RGB视频转换为显式的时间连贯3D表示。该编码器通过统一6D物体解析和跟踪骨干网络与几何预测来编码图像和几何特征,并进一步使用强化学习微调大语言模型,采用空间规则奖励机制鼓励视角感知和几何基础的推理。
Result: 该方法在VSI-Bench基准测试中超越了专有的视觉语言模型,创造了新的最先进性能。实验结果表明,提出的空间编码器能够将视频解析为具有显式3D定向边界框和语义标签的结构化空间代码,使大语言模型能够有效进行空间推理。
Conclusion: 该研究证明了将视频转换为显式3D空间表示的有效性,为物理世界视觉问答提供了新的解决方案。通过结合空间编码和强化学习微调,实现了大语言模型在空间推理任务上的显著提升,为多模态理解开辟了新的研究方向。
📄 Abstract
We introduce Thinking with Spatial Code, a framework that transforms RGB video into explicit, temporally coherent 3D representations for physical-world visual question answering. We highlight the empirical finding that our proposed spatial encoder can parse videos into structured spatial code with explicit 3D oriented bounding boxes and semantic labels, enabling large language models (LLMs) to reason directly over explicit spatial variables. Specifically, we propose the spatial encoder that encodes image and geometric features by unifying 6D object parsing and tracking backbones with geometric prediction, and we further finetuning LLMs with reinforcement learning using a spatial rubric reward that encourages perspective-aware, geometrically grounded inference. As a result, our model outperforms proprietary vision-language models on VSI-Bench, setting a new state-of-the-art. Code is available at https://github.com/Beckschen/spatialcode.
[2] From Decoupled to Coupled: Robustness Verification for Learning-based Keypoint Detection with Joint Specifications
Xusheng Luo, Changliu Liu
🧩 TL;DR
本文提出了首个用于基于热图的关键点检测器的耦合鲁棒性验证框架,该方法通过混合整数线性规划验证所有关键点的联合偏差,解决了现有解耦方法无法捕捉关键点间相互依赖性的问题。
📘 Detailed Summary
Motivation: 现代神经关键点检测模型对输入扰动仍然脆弱,但由于高维输入和连续坐标输出,其形式化鲁棒性验证研究不足;现有解耦的、分类式验证方法独立验证每个关键点,忽略了关键点间的相互依赖性和下游任务需求,导致保守的保证。
Method: 提出首个耦合鲁棒性验证框架,将验证问题表述为使用混合整数线性规划(MILP)的证伪问题,该方法结合可达热图集合与编码联合偏差约束的多面体;不可行性证明鲁棒性,可行性提供反例,并证明该方法具有完备性:若验证模型鲁棒,则关键点检测模型保证鲁棒。
Result: 实验表明,耦合方法实现了高验证率,并在严格误差阈值下保持有效,而解耦方法在此条件下失效;该方法能够捕捉关键点间的相互依赖性,提供比独立验证方法更强的鲁棒性保证。
Conclusion: 该研究填补了关键点检测器形式化鲁棒性验证的空白,提出的耦合验证框架通过考虑关键点间的相互依赖性,为下游视觉任务提供了更实用的鲁棒性保证;该方法为关键点检测系统的安全部署奠定了基础,并可能扩展到其他连续输出视觉任务。
📄 Abstract
Keypoint detection underpins many vision tasks, including pose estimation, viewpoint recovery, and 3D reconstruction, yet modern neural models remain vulnerable to small input perturbations. Despite its importance, formal robustness verification for keypoint detectors is largely unexplored due to high-dimensional inputs and continuous coordinate outputs. We propose the first coupled robustness verification framework for heatmap-based keypoint detectors that bounds the joint deviation across all keypoints, capturing their interdependencies and downstream task requirements. Unlike prior decoupled, classification-style approaches that verify each keypoint independently and yield conservative guarantees, our method verifies collective behavior. We formulate verification as a falsification problem using a mixed-integer linear program (MILP) that combines reachable heatmap sets with a polytope encoding joint deviation constraints. Infeasibility certifies robustness, while feasibility provides counterexamples, and we prove the method is sound: if it certifies the model as robust, then the keypoint detection model is guaranteed to be robust. Experiments show that our coupled approach achieves high verified rates and remains effective under strict error thresholds where decoupled methods fail.
[3] DreamCAD: Scaling Multi-modal CAD Generation using Differentiable Parametric Surfaces
Mohammad Sadil Khan, Muhammad Usama, Rolandos Alexandros Potamias, Didier Stricker, Muhammad Zeshan Afzal, Jiankang Deng, Ismail Elezi
🧩 TL;DR
DreamCAD是一个多模态生成框架,能够直接从点级监督生成可编辑的BReps而无需CAD特定标注,同时引入了包含100万+描述的CADCap-1M数据集,在多个基准测试中实现了最先进的性能。
📘 Detailed Summary
Motivation: 现有CAD生成方法受限于需要显式设计历史或BRep标签的小型标注数据集,而数百万未标注的3D网格数据未被充分利用,这限制了可扩展CAD生成的进展,因此需要开发能够利用大规模未标注数据的生成方法。
Method: DreamCAD将每个BRep表示为一组参数化曲面补丁(如贝塞尔曲面),并采用可微分细分方法生成网格,从而能够在3D数据集上进行大规模训练,同时重建连接且可编辑的曲面表面。
Result: DreamCAD在ABC和Objaverse基准测试中实现了最先进的性能,在文本、图像和点模态方面均表现出色,显著提升了几何保真度,并在用户偏好测试中获得了超过75%的偏好率。
Conclusion: 该研究通过利用大规模未标注数据和可微分表示方法,为可扩展CAD生成开辟了新途径,同时CADCap-1M数据集的发布将显著推动文本到CAD研究的发展,为实现无需CAD特定标注的生成模型提供了可行方案。
📄 Abstract
Computer-Aided Design (CAD) relies on structured and editable geometric representations, yet existing generative methods are constrained by small annotated datasets with explicit design histories or boundary representation (BRep) labels. Meanwhile, millions of unannotated 3D meshes remain untapped, limiting progress in scalable CAD generation. To address this, we propose DreamCAD, a multi-modal generative framework that directly produces editable BReps from point-level supervision, without CAD-specific annotations. DreamCAD represents each BRep as a set of parametric patches (e.g., Bézier surfaces) and uses a differentiable tessellation method to generate meshes. This enables large-scale training on 3D datasets while reconstructing connected and editable surfaces. Furthermore, we introduce CADCap-1M, the largest CAD captioning dataset to date, with 1M+ descriptions generated using GPT-5 for advancing text-to-CAD research. DreamCAD achieves state-of-the-art performance on ABC and Objaverse benchmarks across text, image, and point modalities, improving geometric fidelity and surpassing 75% user preference. Code and dataset will be publicly available.
[4] Adversarial Batch Representation Augmentation for Batch Correction in High-Content Cellular Screening
Lei Tong, Xujing Yao, Adam Corrigan, Long Chen, Navin Rathna Kumar, Kerry Hallbrook, Jonathan Orme, Yinhai Wang, Huiyu Zhou
🧩 TL;DR
本文提出对抗性批次表示增强(ABRA)方法,将细胞成像中的生物批次效应缓解问题构建为领域泛化任务,通过参数化特征统计量作为结构化不确定性并合成最坏情况批次扰动,在RxRx1和RxRx1-WILDS基准上实现了最先进的siRNA扰动分类性能。
📘 Detailed Summary
Motivation: 高内涵筛选产生的大规模细胞成像数据存在实验执行过程中的技术变异,导致生物批次效应,引发协变量偏移并降低深度学习模型在未见数据上的泛化能力。现有批次校正方法通常依赖额外先验知识或难以泛化到未见生物批次,需要更有效的解决方案。
Method: 本研究将生物批次缓解构建为领域泛化问题,提出对抗性批次表示增强(ABRA)方法。该方法通过参数化特征统计量作为结构化不确定性来显式建模批次间统计波动,采用最小-最大优化框架在表示空间中主动合成最坏情况生物批次扰动,并通过严格的角几何边界保持细粒度类别可区分性。为防止对抗探索过程中的表示崩溃,引入了协同分布对齐目标。
Result: 在大型RxRx1和RxRx1-WILDS基准上的广泛评估表明,ABRA在siRNA扰动分类任务上建立了新的最先进性能。该方法有效缓解了生物批次效应,显著提升了模型在未见批次数据上的泛化能力,超越了现有批次校正方法的性能表现。
Conclusion: 该研究将生物批次效应缓解重新构建为领域泛化问题,提供了一种不依赖额外先验知识的有效解决方案。ABRA通过对抗性扰动合成和几何边界约束的创新组合,在保持类别区分性的同时增强了模型对批次变化的鲁棒性,为细胞成像分析中的批次效应校正提供了新范式。
📄 Abstract
High-Content Screening routinely generates massive volumes of cell painting images for phenotypic profiling. However, technical variations across experimental executions inevitably induce biological batch (bio-batch) effects. These cause covariate shifts and degrade the generalization of deep learning models on unseen data. Existing batch correction methods typically rely on additional prior knowledge (e.g., treatment or cell culture information) or struggle to generalize to unseen bio-batches. In this work, we frame bio-batch mitigation as a Domain Generalization (DG) problem and propose Adversarial Batch Representation Augmentation (ABRA). ABRA explicitly models batch-wise statistical fluctuations by parameterizing feature statistics as structured uncertainties. Through a min-max optimization framework, it actively synthesizes worst-case bio-batch perturbations in the representation space, guided by a strict angular geometric margin to preserve fine-grained class discriminability. To prevent representation collapse during this adversarial exploration, we introduce a synergistic distribution alignment objective. Extensive evaluations on the large-scale RxRx1 and RxRx1-WILDS benchmarks demonstrate that ABRA establishes a new state-of-the-art for siRNA perturbation classification.
[5] Post Fusion Bird's Eye View Feature Stabilization for Robust Multimodal 3D Detection
Trung Tien Dong, Dev Thakkar, Arman Sargolzaei, Xiaomin Lin
🧩 TL;DR
本文提出了一种后融合稳定器(PFS),这是一种轻量级模块,可在不修改现有检测器架构的情况下,通过稳定鸟瞰图特征表示来提升自动驾驶中相机-LiDAR融合检测器在域偏移和传感器故障下的鲁棒性。
📘 Detailed Summary
Motivation: 鸟瞰图融合检测器在域偏移和传感器故障下性能会显著下降,限制了自动驾驶系统在实际部署中的可靠性。现有的鲁棒性方法通常需要修改融合架构或重新训练专用模型,难以集成到已部署系统中。
Method: 提出后融合稳定器(PFS),这是一种轻量级模块,在现有检测器的中间鸟瞰图表示上操作,生成用于原始检测头的精炼特征图。该设计通过稳定域偏移下的特征统计、抑制受传感器退化影响的空间区域,并通过残差校正自适应恢复弱化线索,作为近似恒等变换保持性能同时提升鲁棒性。
Result: 在nuScenes基准测试中,PFS在多种故障模式下实现了最先进的结果,相机丢失鲁棒性提升+1.2%,低光性能提升+4.4% mAP,同时仅需3.3M参数的轻量级占用。
Conclusion: PFS提供了一种实用的后处理解决方案,可在不修改现有检测器架构的情况下显著提升融合系统的鲁棒性,为实际自动驾驶部署中的传感器故障和域适应问题提供了有效的轻量级解决方法。
📄 Abstract
Camera-LiDAR fusion is widely used in autonomous driving to enable accurate 3D object detection. However, bird's-eye view (BEV) fusion detectors can degrade significantly under domain shift and sensor failures, limiting reliability in real-world deployment. Existing robustness approaches often require modifying the fusion architecture or retraining specialized models, making them difficult to integrate into already deployed systems. We propose a Post Fusion Stabilizer (PFS), a lightweight module that operates on intermediate BEV representations of existing detectors and produces a refined feature map for the original detection head. The design stabilizes feature statistics under domain shift, suppresses spatial regions affected by sensor degradation, and adaptively restores weakened cues through residual correction. Designed as a near-identity transformation, PFS preserves performance while improving robustness under diverse camera and LiDAR corruptions. Evaluations on the nuScenes benchmark demonstrate that PFS achieves state-of-the-art results in several failure modes, notably improving camera dropout robustness by +1.2% and low-light performance by +4.4% mAP while maintaining a lightweight footprint of only 3.3 M parameters.
[6] Rethinking Concept Bottleneck Models: From Pitfalls to Solutions
Merve Tapli, Quentin Bouniot, Wolfgang Stammer, Zeynep Akata, Emre Akbas
🧩 TL;DR
本文提出了CBM-Suite框架,系统解决了概念瓶颈模型的四个关键限制:概念相关性度量、线性问题、准确率差距和组件影响分析,通过熵度量、非线性层和蒸馏损失等技术实现了更准确且真正可解释的模型。
📘 Detailed Summary
Motivation: 概念瓶颈模型(CBMs)虽然基于人类可理解的概念进行预测,但面临四个基本限制:缺乏预先评估概念相关性的度量标准、导致近期CBMs完全绕过概念瓶颈的"线性问题"、与不透明模型相比存在准确率差距,以及缺乏对不同视觉主干网络和视觉语言模型影响的系统研究。
Method: 本文提出了CBM-Suite方法框架,首先引入基于熵的度量来量化概念集对特定数据集的固有适用性;其次通过在线性概念激活和分类器之间插入非线性层来解决线性问题,确保模型准确率真实反映概念相关性;第三采用由线性教师探针指导的蒸馏损失来缩小准确率差距;最后提供关于不同视觉编码器、视觉语言模型和概念集如何影响CBMs准确率和可解释性的综合分析。
Result: 广泛评估表明,CBM-Suite能够产生更准确的模型,并为改进基于概念的可解释性提供见解。该方法框架系统地解决了概念瓶颈模型的关键限制,包括概念相关性评估、线性问题缓解和准确率提升。
Conclusion: CBM-Suite为概念瓶颈模型提供了系统化的方法论框架,不仅解决了现有技术的关键限制,还为理解不同视觉组件如何影响概念可解释性提供了全面分析。这项工作为构建更准确且真正可解释的AI系统提供了实用指导。
📄 Abstract
Concept Bottleneck Models (CBMs) ground predictions in human-understandable concepts but face fundamental limitations: the absence of a metric to pre-evaluate concept relevance, the "linearity problem" causing recent CBMs to bypass the concept bottleneck entirely, an accuracy gap compared to opaque models, and finally the lack of systematic study on the impact of different visual backbones and VLMs. We introduce CBM-Suite, a methodological framework to systematically addresses these challenges. First, we propose an entropy-based metric to quantify the intrinsic suitability of a concept set for a given dataset. Second, we resolve the linearity problem by inserting a non-linear layer between concept activations and the classifier, which ensures that model accuracy faithfully reflects concept relevance. Third, we narrow the accuracy gap by leveraging a distillation loss guided by a linear teacher probe. Finally, we provide comprehensive analyses on how different vision encoders, vision-language models, and concept sets interact to influence accuracy and interpretability in CBMs. Extensive evaluations show that CBM-Suite yields more accurate models and provides insights for improving concept-based interpretability.
[7] Making Reconstruction FID Predictive of Diffusion Generation FID
Tongda Xu, Mingwei He, Shady Abu-Hussein, Jose Miguel Hernandez-Lobato, Haotian Zhang, Kai Zhao, Chao Zhou, Ya-Qin Zhang, Yan Wang
🧩 TL;DR
本文提出了插值FID(iFID),这是一种改进的重建FID变体,能够与潜在扩散模型的生成FID(gFID)强相关,解决了传统重建FID与生成FID相关性差的问题。
📘 Detailed Summary
Motivation: 当前变分自编码器(VAE)的重建FID(rFID)与潜在扩散模型的生成FID(gFID)之间存在弱相关性,这限制了rFID作为评估扩散模型生成质量的指标的有效性。本文旨在解决这一评估指标不匹配问题,并提出一种能够准确预测扩散模型生成性能的替代度量方法。
Method: 本文提出了插值FID(iFID),通过在潜在空间中为每个数据元素检索其最近邻,并对它们的潜在表示进行插值,然后解码插值后的潜在向量并计算解码样本与原始数据集之间的FID。该方法还深入分析了rFID与扩散细化阶段样本质量的相关性,以及iFID与扩散导航阶段样本质量的相关性,并基于扩散泛化和幻觉理论提供了理论解释。
Result: 实验结果表明,iFID是首个与扩散gFID表现出强相关性的度量指标,实现了约0.85的皮尔逊线性相关性和斯皮尔曼秩相关性。研究还精确界定了rFID与扩散细化阶段样本质量的相关性,以及iFID与扩散导航阶段样本质量的相关性,为不同评估指标的应用场景提供了明确指导。
Conclusion: iFID作为一种简单有效的评估指标,成功解决了传统重建FID与扩散模型生成性能评估不匹配的问题。该研究不仅提供了实用的评估工具,还通过理论分析揭示了重建度量与生成FID负相关的原因,为扩散模型的评估和优化提供了新的理论视角和方法论基础。
📄 Abstract
It is well known that the reconstruction FID (rFID) of a VAE is poorly correlated with the generation FID (gFID) of a latent diffusion model. We propose interpolated FID (iFID), a simple variant of rFID that exhibits a strong correlation with gFID. Specifically, for each element in the dataset, we retrieve its nearest neighbor (NN) in the latent space and interpolate their latent representations. We then decode the interpolated latent and compute the FID between the decoded samples and the original dataset. Additionally, we refine the claim that rFID correlates poorly with gFID, by showing that rFID correlates with sample quality in the diffusion refinement phase, whereas iFID correlates with sample quality in the diffusion navigation phase. Furthermore, we provide an explanation for why iFID correlates well with gFID, and why reconstruction metrics are negatively correlated with gFID, by connecting to results in the diffusion generalization and hallucination. Empirically, iFID is the first metric to demonstrate a strong correlation with diffusion gFID, achieving Pearson linear and Spearman rank correlations approximately 0.85. The source code is provided in https://github.com/tongdaxu/Making-rFID-Predictive-of-Diffusion-gFID.
[8] Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding
Jiaqi Li, Shuntian Zheng, Yixian Shen, Jia-Hong Huang, Xiaoman Lu, Minzhe Ni, Yu Guan
🧩 TL;DR
本文提出SemVID,一种无需训练的视觉令牌剪枝框架,用于视频时序定位任务。该框架通过保留查询关键证据和维持跨帧连接性,在显著减少视觉令牌的同时保持高定位精度,实现了精度与效率的平衡。
📘 Detailed Summary
Motivation: 视频时序定位任务需要在长视频中定位查询相关片段,现有视频语言模型计算成本过高。虽然无训练视觉令牌剪枝在视频问答中有效,但直接应用于VTG会导致性能显著下降,因为VTG依赖于边界敏感证据和跨帧推理链。
Method: 提出SemVID框架,基于证据保留和连接强度两个VTG特定剪枝原则。首先通过平衡查询相关性和帧间变化分配每帧令牌预算,然后选择三类令牌:用于多样化查询关键证据的对象令牌、捕捉有意义转换并作为跨帧中继的运动令牌,以及维持场景连续性的少量上下文令牌。
Result: 在VTG基准测试中,SemVID仅使用12.5%视觉令牌即可保留高达95.4%的mIoU,实现最高5.8倍的预填充加速。在相同预算下持续优于先前方法,实现了强大的精度-效率权衡。
Conclusion: 研究证明了针对特定任务设计剪枝策略的重要性,特别是对于边界敏感的视频时序定位任务。SemVID通过构建具有互补语义角色的紧凑令牌子集,为高效视频语言处理提供了有效解决方案,同时保持了跨帧推理能力。
📄 Abstract
Video Temporal Grounding (VTG) localizes the temporal boundaries of a query-relevant moment in long, untrimmed videos, making video-language-model (VLM) pipelines prohibitively expensive. While recent training-free visual token pruning has shown success in video question answering, naively applying these objectives to VTG often causes drastic degradation, as VTG crucially depends on boundary-sensitive evidence and cross-frame reasoning chains. We therefore identify two VTG-specific pruning principles: Evidence Retention (ER), which keeps query-critical patches especially around event boundaries, and Connectivity Strength (CS), which preserves token-level cross-frame connectivity for long-range evidence aggregation. Building on these insights, we propose SemVID, a training-free pruning framework that constructs a compact yet coherent token subset with complementary semantic roles. SemVID first allocates per-frame token budgets by balancing query relevance and inter-frame variation to avoid over-pruned segments, and then selects three types of tokens: object tokens for diverse query-critical evidence, motion tokens to capture meaningful transitions and serve as cross-frame relays, and a small set of context tokens for scene continuity. Extensive experiments on VTG benchmarks show that SemVID achieves a strong accuracy-efficiency trade-off, retaining up to 95.4% mIoU with only 12.5% visual tokens and delivering up to a 5.8x prefill speedup, consistently outperforming prior methods under the same budgets.
[9] OWL: A Novel Approach to Machine Perception During Motion
Daniel Raviv, Juan D. Yepes
🧩 TL;DR
本文提出了一种名为OWL的感知相关函数,旨在解决运动过程中的3D感知挑战。该函数直接从两种基本的视觉运动线索中推导数值,为每个时间点的每个点提供一组线索值,实现了无需环境先验知识的实时3D场景重建。
📘 Detailed Summary
Motivation: 本研究旨在解决运动过程中3D感知的复杂挑战,特别是如何在缺乏环境先验知识、静态环境信息或相机运动参数的情况下,仅通过视觉运动线索实现精确的3D感知。当前方法通常需要测量或已知相机与3D点之间的相对瞬时方向范围和方向平移等物理量,而OWL旨在消除这些限制,提供一种更简化的感知框架。
Method: OWL方法基于两种基本的视觉运动线索:固定点附近点的感知局部视觉逼近和刚性物体相对于固定点的感知旋转。该方法直接从未经处理的视觉运动图像序列中提取测量值,采用极简的基于像素的并行计算,无需显式测量相机与可见3D点之间的相对瞬时方向范围和方向平移,也不要求事先了解这些物理量的单独数值。
Result: 仿真实验表明,OWL能够实现3D物体随时间的几何恒常性,并仅凭视觉运动线索即可实现按比例缩放的3D场景重建。该方法在无需先验了解静态环境、运动物体或相机运动的情况下,成功获得了OWL数值,验证了其在实际应用中的可行性。
Conclusion: OWL提供了一种统一的、基于时间的分析方法,增强了关键的感知能力,包括按比例缩放的3D映射和相机航向估计。该方法弥合了理论概念与机器人学和自主导航实际应用之间的差距,可能为下一代自主系统的实时决策和交互提供基础构建模块,同时对自然感知、行为心理学和神经功能的理解也有潜在贡献。
📄 Abstract
We introduce a perception-related function, OWL, designed to address the complex challenges of 3D perception during motion. It derives its values directly from two fundamental visual motion cues, with one set of cue values per point per time instant. During motion, two visual motion cues relative to a fixation point emerge: 1) perceived local visual looming of points near the fixation point, and 2) perceived rotation of the rigid object relative to the fixation point. It also expresses the relation between two well-known physical quantities, the relative instantaneous directional range and directional translation in 3D between the camera and any visible 3D point, without explicitly requiring their measurement or prior knowledge of their individual values. OWL offers a unified, analytical time-based approach that enhances and simplifies key perception capabilities, including scaled 3D mapping and camera heading. Simulations demonstrate that OWL achieves geometric constancy of 3D objects over time and enables scaled 3D scene reconstruction from visual motion cues alone. By leveraging direct measurements from raw visual motion image sequences, OWL values can be obtained without prior knowledge of stationary environments, moving objects, or camera motion. This approach employs minimalistic, pixel-based, parallel computations, providing an alternative real-time representation for 3D points in relative motion. OWL bridges the gap between theoretical concepts and practical applications in robotics and autonomous navigation and may unlock new possibilities for real-time decision-making and interaction, potentially serving as a building block for next-generation autonomous systems. This paper offers an alternative perspective on machine perception, with implications that may extend to natural perception and contribute to a better understanding of behavioral psychology and neural functionality.
[10] MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents
Dannong Xu, Zhongyu Yang, Jun Chen, Yingfang Yuan, Ming Hu, Lei Sun, Luc Van Gool, Danda Pani Paudel, Chun-Mei Feng
🧩 TL;DR
本文提出了MultiHaystack基准测试,这是首个用于评估大规模跨模态条件下检索与推理能力的基准,揭示了当前多模态大语言模型在异构多模态检索方面的主要瓶颈。
📘 Detailed Summary
Motivation: 现有多模态大语言模型基准测试主要评估文本、图像或视频的单独理解能力,但忽视了从大规模异构多模态语料库中检索相关证据这一关键现实需求,大多数现有基准将检索限制在小规模单模态候选集中,过度简化了搜索空间并夸大了端到端可靠性。
Method: 研究者引入了MultiHaystack基准测试,该基准包含超过46,000个跨文档、图像和视频的多模态检索候选,以及747个开放但可验证的问题,每个问题都基于检索池中唯一的验证证据项,要求跨模态的证据定位和细粒度推理。
Result: 实验发现,当提供相应证据时模型表现良好,但从完整语料库中检索证据时性能显著下降,最强的检索器E5-V仅达到40.8%的Recall@1,而GPT-5等最先进MLLMs的推理准确率从提供证据时的80.86%降至top-5检索条件下的51.4%。
Conclusion: 异构多模态检索仍然是多模态大语言模型的主要瓶颈,MultiHaystack作为一个有价值的测试平台,揭示了小规模评估所掩盖的底层限制,并促进多模态系统中以检索为中心的进展。
📄 Abstract
Multimodal large language models (MLLMs) achieve strong performance on benchmarks that evaluate text, image, or video understanding separately. However, these settings do not assess a critical real-world requirement, which involves retrieving relevant evidence from large, heterogeneous multimodal corpora prior to reasoning. Most existing benchmarks restrict retrieval to small, single-modality candidate sets, substantially simplifying the search space and overstating end-to-end reliability. To address this gap, we introduce MultiHaystack, the first benchmark designed to evaluate both retrieval and reasoning under large-scale, cross-modal conditions. MultiHaystack comprises over 46,000 multimodal retrieval candidates across documents, images, and videos, along with 747 open yet verifiable questions. Each question is grounded in a unique validated evidence item within the retrieval pool, requiring evidence localization across modalities and fine-grained reasoning. In our study, we find that models perform competitively when provided with the corresponding evidence, but their performance drops sharply when required to retrieve that evidence from the full corpus. Additionally, even the strongest retriever, E5-V, achieves only 40.8% Recall@1, while state-of-the-art MLLMs such as GPT-5 experience a significant drop in reasoning accuracy from 80.86% when provided with the corresponding evidence to 51.4% under top-5 retrieval. These results indicate that multimodal retrieval over heterogeneous pools remains a primary bottleneck for MLLMs, positioning MultiHaystack as a valuable testbed that highlights underlying limitations obscured by small-scale evaluations and promotes retrieval-centric advances in multimodal systems.
[11] Interpretable Perception and Reasoning for Audiovisual Geolocation
Yiyang Su, Xiaoming Liu
🧩 TL;DR
本文提出了Audiovisual Geolocation框架,通过可解释的感知和推理解决地理定位中的视觉模糊问题,结合听觉线索实现高精度全球定位。该框架包含三阶段处理流程,并在新构建的AVG基准上显著优于单模态基线方法。
📘 Detailed Summary
Motivation: 尽管多模态大语言模型在基于图像的定位方面取得进展,但精确的全球地理定位仍然面临严峻挑战,主要源于视觉景观的固有模糊性以及听觉线索的巨大潜力尚未被充分挖掘。现有方法主要依赖视觉信息,而声音景观提供的正交信号能够补充视觉模糊性,但当前缺乏系统性的视听融合定位框架。
Method: 本文提出三阶段视听地理定位框架:感知阶段采用混合自回归稀疏自编码器将噪声音频分解为语义基础的"声学原子";多模态推理阶段使用通过组相对策略优化微调的MLLM将这些原子与视觉特征融合;精确预测阶段在S^2流形上应用黎曼流匹配进行最终定位。同时构建了包含20,000个剪辑、覆盖1,000个不同地点的全球尺度视频基准AVG。
Result: 实验结果表明,该框架在AVG基准上显著优于单模态基线方法。视听融合方法通过可解释的声音景观感知提供了关键的正交信号,结合多模态推理实现了高精度全球定位,验证了听觉线索在地理定位中的补充价值。
Conclusion: 研究表明声音景观的可解释感知为地理定位提供了关键的正交信号,与多模态推理结合能够显著提升定位精度。该工作不仅提出了有效的视听融合框架,还建立了首个全球尺度的视听地理定位基准,为多模态定位研究开辟了新方向,强调了跨模态信息融合在解决视觉模糊问题中的重要性。
📄 Abstract
While recent advances in Multimodal Large Language Models (MLLMs) have improved image-based localization, precise global geolocation remains a formidable challenge due to the inherent ambiguity of visual landscapes and the largely untapped potential of auditory cues. In this paper, we introduce Audiovisual Geolocation, a framework designed to resolve geographic ambiguity through interpretable perception and reasoning. We present AVG, a high-quality global-scale video benchmark for geolocation, comprising 20,000 curated clips across 1,000 distinct locations. To address the complexity of audiovisual geolocation, we propose a three-stage framework: (1) a Perception stage that utilizes a mixture-autoregressive sparse autoencoder to decompose noisy audio into semantically grounded "acoustic atoms"; (2) a Multimodal Reasoning stage that employs an MLLM finetuned via Group Relative Policy Optimization (GRPO) to synthesize these atoms with visual features; and (3) a Precision Prediction stage using Riemannian Flow Matching on the $S^2$ manifold. Our experiments demonstrate that our framework significantly outperforms unimodal baselines. These results entail that interpretable perception of the soundscape provides a critical, orthogonal signal that, when coupled with multimodal reasoning, enables high-precision global localization.
[12] Any to Full: Prompting Depth Anything for Depth Completion in One Stage
Zhiyuan Zhou, Ruofeng Liu, Taichi Liu, Weijian Zuo, Shanshan Wang, Zhiqing Hong, Desheng Zhang
🧩 TL;DR
本文提出了Any2Full,一种单阶段、领域通用且模式无关的深度补全框架,通过将补全任务重新定义为预训练单目深度估计模型的尺度提示适应,实现了对稀疏深度输入的鲁棒高效补全。
📘 Detailed Summary
Motivation: 现有RGBD融合深度补全方法受限于训练RGB分布和特定深度模式的联合条件学习,导致领域泛化能力不足且对多种深度模式鲁棒性有限;而基于单目深度估计模型的两阶段集成策略依赖显式的相对-度量对齐,引入了额外计算和结构化失真。
Method: Any2Full采用单阶段框架,将深度补全重新定义为预训练单目深度估计模型的尺度提示适应;设计了尺度感知提示编码器,从稀疏输入中提取尺度线索并编码为统一的尺度提示,引导模型生成全局尺度一致的预测同时保留其几何先验。
Result: 实验表明Any2Full在平均AbsREL指标上比OMNI-DC提升32.2%,在相同MDE骨干网络下比PriorDA实现1.4倍加速,在多个基准测试中展现出优越的鲁棒性和效率,建立了通用深度补全的新范式。
Conclusion: 该研究提出了一种无需显式对齐的单阶段深度补全方法,通过尺度提示机制有效结合了稀疏深度输入的尺度信息和预训练模型的几何先验,为跨领域、跨模式的通用深度补全提供了高效且鲁棒的解决方案。
📄 Abstract
Accurate, dense depth estimation is crucial for robotic perception, but commodity sensors often yield sparse or incomplete measurements due to hardware limitations. Existing RGBD-fused depth completion methods learn priors jointly conditioned on training RGB distribution and specific depth patterns, limiting domain generalization and robustness to various depth patterns. Recent efforts leverage monocular depth estimation (MDE) models to introduce domain-general geometric priors, but current two-stage integration strategies relying on explicit relative-to-metric alignment incur additional computation and introduce structured distortions. To this end, we present Any2Full, a one-stage, domain-general, and pattern-agnostic framework that reformulates completion as a scale-prompting adaptation of a pretrained MDE model. To address varying depth sparsity levels and irregular spatial distributions, we design a Scale-Aware Prompt Encoder. It distills scale cues from sparse inputs into unified scale prompts, guiding the MDE model toward globally scale-consistent predictions while preserving its geometric priors. Extensive experiments demonstrate that Any2Full achieves superior robustness and efficiency. It outperforms OMNI-DC by 32.2\% in average AbsREL and delivers a 1.4$\times$ speedup over PriorDA with the same MDE backbone, establishing a new paradigm for universal depth completion. Codes and checkpoints are available at https://github.com/zhiyuandaily/Any2Full.
[13] From Phase Grounding to Intelligent Surgical Narratives
Ethan Peterson, Huixin Zhan
🧩 TL;DR
本文提出了一种基于CLIP的多模态框架,用于从手术视频中自动生成结构化手术时间线和叙述,通过对齐视频帧与文本手势描述来减少外科医生手动标注的需求。
📘 Detailed Summary
Motivation: 当前手术时间线创建方法存在显著局限性:外科医生填写的术后报告通常过于模糊,而手动标注手术视频则极为耗时。本研究旨在填补这两种极端方法之间的空白,实现从手术视频自动生成结构化时间线和叙述的目标。
Method: 该方法采用基于CLIP的多模态框架,将手术视频帧与文本手势描述对齐。具体使用CLIP视觉编码器提取手术视频帧的表示,文本编码器嵌入相应手势句子到共享嵌入空间,然后对模型进行微调以增强视频手势与文本标记之间的对齐。训练完成后,模型能够预测视频帧的手势和阶段,从而构建结构化手术时间线。
Result: 该方法成功实现了手术视频帧与文本手势描述的有效对齐,能够自动预测视频帧的手势和手术阶段。通过利用预训练的多模态表示,该方法在手术时间线构建任务上表现出良好性能,为自动生成结构化手术叙述提供了可行方案。
Conclusion: 该研究展示了多模态预训练模型在手术视频分析中的有效应用,通过桥接视觉手势与文本叙述,显著减少了外科医生手动视频审查和标注的工作量。该方法为手术时间线自动化创建提供了实用框架,具有改善手术文档记录效率和准确性的潜力。
📄 Abstract
Video surgery timelines are an important part of tool-assisted surgeries, as they allow surgeons to quickly focus on key parts of the procedure. Current methods involve the surgeon filling out a post-operation (OP) report, which is often vague, or manually annotating the surgical videos, which is highly time-consuming. Our proposed method sits between these two extremes: we aim to automatically create a surgical timeline and narrative directly from the surgical video. To achieve this, we employ a CLIP-based multi-modal framework that aligns surgical video frames with textual gesture descriptions. Specifically, we use the CLIP visual encoder to extract representations from surgical video frames and the text encoder to embed the corresponding gesture sentences into a shared embedding space. We then fine-tune the model to improve the alignment between video gestures and textual tokens. Once trained, the model predicts gestures and phases for video frames, enabling the construction of a structured surgical timeline. This approach leverages pretrained multi-modal representations to bridge visual gestures and textual narratives, reducing the need for manual video review and annotation by surgeons.
[14] Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers
Ruidong Chen, Yancheng Bai, Xuanpu Zhang, Jianhao Zeng, Lanjun Wang, Dan Song, Lei Sun, Xiangxiang Chu, Anan Liu
🧩 TL;DR
本文提出LayerBind,一种无需训练、即插即用的区域布局与遮挡控制方法,通过将区域生成建模为独立图层并在生成过程中绑定,实现了对文本到图像生成中区域布局和遮挡顺序的精确控制。
📘 Detailed Summary
Motivation: 现有区域指令布局控制方法存在两大局限:基于训练的方法继承数据偏差并降低图像质量,现有技术难以处理遮挡顺序问题,限制了实际应用。本研究旨在解决这些限制,实现精确的区域布局和遮挡控制。
Method: LayerBind采用两阶段方案:首先,利用多模态联合注意力中的上下文共享机制,通过图层级实例初始化创建关注各自区域的分支,在指定早期步骤按图层顺序融合形成预布局的统一潜在表示;其次,通过图层级语义护理,采用顺序分层注意力路径与标准全局路径并行运行,在图层透明度调度器下组合更新,增强区域细节并保持遮挡顺序。
Result: 定性和定量实验结果均证明了LayerBind的有效性,该方法在区域布局控制和遮挡顺序管理方面表现出色,展现了在创意应用中的强大潜力。该方法无需训练且即插即用,可作为跨扩散变换器的区域和遮挡控制器。
Conclusion: LayerBind通过重新排列早期潜在结构实现了对最终输出的有效控制,其核心洞察是空间布局和遮挡在去噪早期阶段就已确立。该方法不仅支持生成任务,还原生支持可编辑工作流程,允许灵活修改实例或重新排列可见顺序,为文本到图像生成提供了实用的区域控制解决方案。
📄 Abstract
Region-instructed layout control in text-to-image generation is highly practical, yet existing methods suffer from limitations: (i) training-based approaches inherit data bias and often degrade image quality, and (ii) current techniques struggle with occlusion order, limiting real-world usability. To address these issues, we propose LayerBind. By modeling regional generation as distinct layers and binding them during the generation, our method enables precise regional and occlusion controllability. Our motivation stems from the observation that spatial layout and occlusion are established at a very early denoising stage, suggesting that rearranging the early latent structure is sufficient to modify the final output. Building on this, we structure the scheme into two phases: instance initialization and subsequent semantic nursing. (1) First, leveraging the contextual sharing mechanism in multimodal joint attention, Layer-wise Instance Initialization creates per-instance branches that attend to their own regions while anchoring to the shared background. At a designated early step, these branches are fused according to the layer order to form a unified latent with a pre-established layout. (2) Then, Layer-wise Semantic Nursing reinforces regional details and maintains the occlusion order via a layer-wise attention enhancement. Specifically, a sequential layered attention path operates alongside the standard global path, with updates composited under a layer-transparency scheduler. LayerBind is training-free and plug-and-play, serving as a regional and occlusion controller across Diffusion Transformers. Beyond generation, it natively supports editable workflows, allowing for flexible modifications like changing instances or rearranging visible orders. Both qualitative and quantitative results demonstrate LayerBind's effectiveness, highlighting its strong potential for creative applications.
[15] Visual Words Meet BM25: Sparse Auto-Encoder Visual Word Scoring for Image Retrieval
Donghoon Han, Eunhwan Park, Seunghyeon Seo
🧩 TL;DR
本文提出了BM25-V,一种基于视觉Transformer补丁特征的稀疏自动编码器激活的稀疏图像检索方法,该方法将Okapi BM25评分应用于视觉词激活,实现了高效且可解释的检索,可作为密集重排序的高召回第一级检索器。
📘 Detailed Summary
Motivation: 密集图像检索虽然准确,但存在可解释性和归因性有限、大规模计算成本高的问题。本研究旨在开发一种既高效又具有可解释性的图像检索方法,能够在大规模场景中提供高召回率候选集,同时支持检索决策的可归因分析。
Method: BM25-V方法将Okapi BM25评分应用于视觉Transformer补丁特征上稀疏自动编码器产生的稀疏视觉词激活。该方法利用视觉词文档频率的高度不平衡和Zipfian分布特性,通过逆文档频率加权抑制普遍存在的低信息词,强调稀有且具有判别性的视觉词。检索过程通过稀疏倒排索引操作实现高效候选检索,并作为密集重排序的第一级检索器。
Result: 在七个基准测试中,BM25-V实现了Recall@200 ≥ 0.993的高召回率,使得两阶段检索流程仅需对每个查询重排序K=200个候选图像,就能在平均0.2%的误差范围内恢复接近密集检索的准确率。在ImageNet-1K上训练一次的稀疏自动编码器能够零样本迁移到七个细粒度基准测试中,无需微调,且BM25-V的检索决策可归因于具有量化IDF贡献的特定视觉词。
Conclusion: BM25-V证明了稀疏视觉词表示与BM25评分相结合在图像检索中的有效性,提供了一种高效、可解释且可归因的检索方法。该方法的两阶段检索架构显著降低了计算成本,同时保持了接近密集检索的准确率,为零样本跨领域检索和可解释图像检索系统提供了实用解决方案。
📄 Abstract
Dense image retrieval is accurate but offers limited interpretability and attribution, and it can be compute-intensive at scale. We present \textbf{BM25-V}, which applies Okapi BM25 scoring to sparse visual-word activations from a Sparse Auto-Encoder (SAE) on Vision Transformer patch features. Across a large gallery, visual-word document frequencies are highly imbalanced and follow a Zipfian-like distribution, making BM25's inverse document frequency (IDF) weighting well suited for suppressing ubiquitous, low-information words and emphasizing rare, discriminative ones. BM25-V retrieves high-recall candidates via sparse inverted-index operations and serves as an efficient first-stage retriever for dense reranking. Across seven benchmarks, BM25-V achieves Recall@200 $\geq$ 0.993, enabling a two-stage pipeline that reranks only $K{=}200$ candidates per query and recovers near-dense accuracy within $0.2$\% on average. An SAE trained once on ImageNet-1K transfers zero-shot to seven fine-grained benchmarks without fine-tuning, and BM25-V retrieval decisions are attributable to specific visual words with quantified IDF contributions.
[16] Spectral Probing of Feature Upsamplers in 2D-to-3D Scene Reconstruction
Ling Xiao, Yuliang Xiu, Yue Chen, Guoming Wang, Toshihiko Yamasaki
🧩 TL;DR
本文提出了一种谱诊断框架,通过六个互补指标评估2D到3D重建流程中特征上采样器的三维感知能力,发现重建质量更依赖于谱结构保持而非空间细节增强。
📘 Detailed Summary
Motivation: 当前2D到3D重建流程中,特征上采样器主要关注增强空间细节(如更锐利的几何或更丰富的纹理),但其对三维感知能力的影响尚未得到充分探索,这构成了研究的核心动机。
Method: 本文引入了一个谱诊断框架,包含六个互补指标来表征振幅重分布、结构谱对齐和方向稳定性,并在CLIP和DINO骨干网络上对经典插值和可学习上采样方法进行了系统评估。
Result: 实验发现三个关键结果:结构谱一致性是新视角合成质量的最强预测因子,而高频谱斜率漂移常与重建性能负相关;几何和纹理对不同谱属性响应不同;可学习上采样器虽然产生更锐利的空间特征,但在重建质量上很少优于经典插值。
Conclusion: 研究表明重建质量更紧密地依赖于谱结构保持而非空间细节增强,强调谱一致性作为设计2D到3D流程中上采样策略的重要原则,为未来上采样器设计提供了新的指导方向。
📄 Abstract
A typical 2D-to-3D pipeline takes multi-view images as input, where a Vision Foundation Model (VFM) extracts features that are spatially upsampled to dense representations for 3D reconstruction. If dense features across views preserve geometric consistency, differentiable rendering can recover an accurate 3D representation, making the feature upsampler a critical component. Recent learnable upsampling methods mainly aim to enhance spatial details, such as sharper geometry or richer textures, yet their impact on 3D awareness remains underexplored. To address this gap, we introduce a spectral diagnostic framework with six complementary metrics that characterize amplitude redistribution, structural spectral alignment, and directional stability. Across classical interpolation and learnable upsampling methods on CLIP and DINO backbones, we observe three key findings. First, structural spectral consistency (SSC/CSC) is the strongest predictor of NVS quality, whereas High-Frequency Spectral Slope Drift (HFSS) often correlates negatively with reconstruction performance, indicating that emphasizing high-frequency details alone does not necessarily improve 3D reconstruction. Second, geometry and texture respond to different spectral properties: Angular Energy Consistency (ADC) correlates more strongly with geometry-related metrics, while SSC/CSC influence texture fidelity slightly more than geometric accuracy. Third, although learnable upsamplers often produce sharper spatial features, they rarely outperform classical interpolation in reconstruction quality, and their effectiveness depends on the reconstruction model. Overall, our results indicate that reconstruction quality is more closely related to preserving spectral structure than to enhancing spatial detail, highlighting spectral consistency as an important principle for designing upsampling strategies in 2D-to-3D pipelines.
[17] Training-free Latent Inter-Frame Pruning with Attention Recovery
Dennis Menn, Yuedong Yang, Bokun Wang, Xiwen Wei, Mustafa Munir, Feng Liang, Radu Marculescu, Chenfeng Xu, Diana Marculescu
🧩 TL;DR
本文提出LIPAR框架,通过检测和跳过重复的潜在补丁来减少视频生成模型的计算延迟,同时引入注意力恢复机制来避免视觉伪影,实现了1.45倍的吞吐量提升且不损害生成质量。
📘 Detailed Summary
Motivation: 当前视频生成模型存在高计算延迟问题,使得实时应用成本过高,这主要源于视频潜在补丁中存在的时间冗余未被有效利用。
Method: 本文提出潜在帧间剪枝与注意力恢复框架,通过检测并跳过重复潜在补丁的重新计算来减少计算量,并引入新颖的注意力恢复机制来近似被剪枝token的注意力值,从而消除简单应用剪枝方法产生的视觉伪影。
Result: 实验表明该方法平均将视频编辑吞吐量提升1.45倍,在NVIDIA A6000上达到12.2 FPS,而基线仅为8.4 FPS,且不损害生成质量,无需额外训练即可与模型无缝集成。
Conclusion: 该方法有效弥合了传统压缩算法与现代生成流水线之间的差距,通过利用时间冗余实现了高效视频生成,为实时视频应用提供了可行的解决方案,展示了在不牺牲质量的前提下显著提升计算效率的潜力。
📄 Abstract
Current video generation models suffer from high computational latency, making real-time applications prohibitively costly. In this paper, we address this limitation by exploiting the temporal redundancy inherent in video latent patches. To this end, we propose the Latent Inter-frame Pruning with Attention Recovery (LIPAR) framework, which detects and skips recomputing duplicated latent patches. Additionally, we introduce a novel Attention Recovery mechanism that approximates the attention values of pruned tokens, thereby removing visual artifacts arising from naively applying the pruning method. Empirically, our method increases video editing throughput by $1.45\times$, on average achieving 12.2 FPS on an NVIDIA A6000 compared to the baseline 8.4 FPS. The proposed method does not compromise generation quality and can be seamlessly integrated with the model without additional training. Our approach effectively bridges the gap between traditional compression algorithms and modern generative pipelines.
[18] Margin and Consistency Supervision for Calibrated and Robust Vision Models
Salim Khazem
🧩 TL;DR
本文提出了一种名为Margin and Consistency Supervision (MaCS)的简单正则化框架,通过联合实施logit空间分离和局部预测稳定性,显著提升了深度视觉分类器的校准性和分布偏移鲁棒性,同时保持或提高分类精度。
📘 Detailed Summary
Motivation: 深度视觉分类器通常能够实现高准确率,但在小分布偏移下往往校准性差且脆弱,这限制了其在现实世界应用中的可靠性。现有方法在同时提升校准性、鲁棒性和准确性方面存在不足,需要一种无需额外数据或架构修改的通用解决方案。
Method: 本文提出了Margin and Consistency Supervision (MaCS)框架,这是一种架构无关的正则化方法,通过两个核心组件增强交叉熵损失:一是hinge-squared边缘惩罚项,强制正确类别与最强竞争类别之间保持目标logit间隔;二是基于KL散度的正则化器,最小化干净输入与轻微扰动视图之间的预测差异,从而联合优化分类边缘和局部稳定性。
Result: 在多个图像分类基准测试和包括CNN与Vision Transformer在内的多种骨干网络上,MaCS一致性地改善了校准性(更低的ECE和NLL)和对常见损坏的鲁棒性,同时保持或提高了top-1准确率。该方法无需额外数据、架构修改,且推理开销可忽略不计。
Conclusion: MaCS提供了一种理论上有保证的实用方法,通过同时增加分类边缘和减少局部敏感性来提升泛化性能和鲁棒性边界。该框架可作为标准训练目标的有效替代方案,为构建更可靠、更鲁棒的视觉分类器提供了简单而强大的工具。
📄 Abstract
Deep vision classifiers often achieve high accuracy while remaining poorly calibrated and fragile under small distribution shifts. We present Margin and Consistency Supervision (MaCS), a simple, architecture-agnostic regularization framework that jointly enforces logit-space separation and local prediction stability. MaCS augments cross-entropy with (i) a hinge-squared margin penalty that enforces a target logit gap between the correct class and the strongest competitor, and (ii) a consistency regularizer that minimizes the KL divergence between predictions on clean inputs and mildly perturbed views. We provide a unifying theoretical analysis showing that increasing classification margin while reducing local sensitivity formalized via a Lipschitz-type stability proxy yields improved generalization guarantees and a provable robustness radius bound scaling with the margin-to-sensitivity ratio. Across several image classification benchmarks and several backbones spanning CNNs and Vision Transformers, MaCS consistently improves calibration (lower ECE and NLL) and robustness to common corruptions while preserving or improving top-1 accuracy. Our approach requires no additional data, no architectural changes, and negligible inference overhead, making it an effective drop-in replacement for standard training objectives.
[19] Cog2Gen3D: Sculpturing 3D Semantic-Geometric Cognition for 3D Generation
Haonan Wang, Hanyu Zhou, Haoyue Liu, Tao Gu, Luxin Yan
🧩 TL;DR
本文提出Cog2Gen3D,一种基于三维认知引导的扩散框架,通过融合语义与绝对几何信息实现物理合理的3D生成。该方法构建双流语义-几何图并通过交叉注意力融合,显著提升了生成结果的语义保真度和几何合理性。
📘 Detailed Summary
Motivation: 现有3D生成方法主要依赖几何特征作为条件,但仅能建模相对关系且容易产生绝对几何尺度不一致问题,缺乏对物理世界三维认知的充分建模。语义信息与绝对几何的缺失限制了可控3D生成的物理合理性和结构合理性。
Method: 提出Cog2Gen3D框架,包含三个核心设计:认知特征嵌入将多模态信息编码为语义、几何和逻辑表示;三维潜在认知图构建双流语义-几何图并通过公共基础的交叉注意力融合;认知引导潜在扩散利用融合后的3D认知图作为条件指导3D高斯的潜在扩散生成过程。
Result: 在基于Marble World Labs构建的验证子集上进行广泛实验,Cog2Gen3D在语义保真度和几何合理性方面显著优于现有方法。统一的3D认知图框架确保了生成结果的物理合理性和结构合理性。
Conclusion: 研究表明语义信息与绝对几何的融合对于三维认知至关重要,能够实现物理世界的可控3D生成。提出的认知引导扩散框架为3D生成提供了新的方向,通过结构化表示融合解决了现有方法的尺度不一致问题。
📄 Abstract
Generative models have achieved success in producing semantically plausible 2D images, but it remains challenging in 3D generation due to the absence of spatial geometry constraints. Typically, existing methods utilize geometric features as conditions to enhance spatial awareness. However, these methods can only model relative relationships and are prone to scale inconsistency of absolute geometry. Thus, we argue that semantic information and absolute geometry empower 3D cognition, thereby enabling controllable 3D generation for the physical world. In this work, we propose Cog2Gen3D, a 3D cognition-guided diffusion framework for 3D generation. Our model is guided by three key designs: 1) Cognitive Feature Embeddings. We encode different modalities into semantic and geometric representations and further extract logical representations. 2) 3D Latent Cognition Graph. We structure different representations into dual-stream semantic-geometric graphs and fuse them via common-based cross-attention to obtain a 3D cognition graph. 3) Cognition-Guided Latent Diffusion. We leverage the fused 3D cognition graph as the condition to guide the latent diffusion process for 3D Gaussian generation. Under this unified framework, the 3D cognition graph ensures the physical plausibility and structural rationality of 3D generation. Moreover, we construct a validation subset based on the Marble World Labs. Extensive experiments demonstrate that our Cog2Gen3D significantly outperforms existing methods in both semantic fidelity and geometric plausibility.
[20] CORE-Seg: Reasoning-Driven Segmentation for Complex Lesions via Reinforcement Learning
Yuxin Xie, Yuming Chen, Yishan Yang, Yi Zhou, Tao Zhou, Zhen Zhao, Jiacheng Liu, Huazhu Fu
🧩 TL;DR
本文提出了CORE-Seg框架,这是一个用于复杂病灶分割的端到端推理-分割集成方法,通过引入首个多样化的思维链基准ComLesion-14K和语义引导提示适配器,实现了医学图像分割从视觉模式匹配到认知推理分析的范式转变。
📘 Detailed Summary
Motivation: 当前医学图像分割正经历从传统视觉模式匹配到认知推理分析的范式转变,但现有方法存在显著差距:通用多模态大语言模型具备广泛常识但缺乏复杂病灶所需的专业视觉推理能力,而传统分割模型擅长像素级分割但缺乏逻辑可解释性,这促使研究开发能够整合推理与分割的专门框架。
Method: 研究提出了CORE-Seg端到端框架,通过语义引导提示适配器将推理与分割集成,设计了从监督微调到分组策略优化的渐进训练策略,并配备了自适应双粒度奖励机制以缓解奖励稀疏性问题,同时构建了首个多样化思维链基准ComLesion-14K用于推理驱动的复杂病灶分割。
Result: 该方法在复杂病灶分割任务中取得了最先进的结果,平均Dice系数达到37.06%,比次优基线高出14.89%,同时将失败率降低至18.42%,显著提升了分割性能和推理能力。
Conclusion: 该研究展示了推理与分割集成的有效性,为医学图像分析提供了新的认知推理范式,通过思维链基准和自适应训练策略解决了复杂病灶分割中的专业视觉推理挑战,为未来可解释医学AI系统的发展指明了方向。
📄 Abstract
Medical image segmentation is undergoing a paradigm shift from conventional visual pattern matching to cognitive reasoning analysis. Although Multimodal Large Language Models (MLLMs) have shown promise in integrating linguistic and visual knowledge, significant gaps remain: existing general MLLMs possess broad common sense but lack the specialized visual reasoning required for complex lesions, whereas traditional segmentation models excel at pixel-level segmentation but lack logical interpretability. In this paper, we introduce ComLesion-14K, the first diverse Chain-of-Thought (CoT) benchmark for reasoning-driven complex lesion segmentation. To accomplish this task, we propose CORE-Seg, an end-to-end framework integrating reasoning with segmentation through a Semantic-Guided Prompt Adapter. We design a progressive training strategy from SFT to GRPO, equipped with an adaptive dual-granularity reward mechanism to mitigate reward sparsity. Our Method achieves state-of-the-art results with a mean Dice of 37.06\% (14.89\% higher than the second-best baseline), while reducing the failure rate to 18.42\%. Project Page: https://xyxl024.github.io/CORE-Seg.github.io/
[21] VS3R: Robust Full-frame Video Stabilization via Deep 3D Reconstruction
Muhua Zhu, Xinhao Jin, Yu Zhang, Yifei Xue, Tie Ji, Yizhen Lao
🧩 TL;DR
本文提出VS3R框架,通过协同前馈式3D重建与生成式视频扩散技术,解决了视频稳定化中几何鲁棒性与全帧一致性之间的根本性权衡问题,实现了跨多种相机模型的高保真全帧稳定效果。
📘 Detailed Summary
Motivation: 视频稳定化面临几何鲁棒性与全帧一致性之间的根本性权衡:2D方法因过度裁剪而受限,3D技术则因优化流程脆弱而在极端运动下失效。现有方法无法同时保证所有场景下的可靠性和视觉质量,需要一种能够兼顾几何鲁棒性与全帧一致性的新框架。
Method: VS3R框架协同整合前馈式3D重建与生成式视频扩散技术,通过联合估计相机参数、深度和掩码确保全场景可靠性,引入混合稳定渲染模块融合语义与几何线索以保持动态一致性,并采用双流视频扩散模型通过结构引导与语义锚点的协同作用修复遮挡区域和校正伪影。
Result: VS3R在多种相机模型上实现了高保真度的全帧稳定效果,在鲁棒性和视觉质量方面显著优于现有最先进方法。实验表明该框架能够有效处理极端运动情况,同时保持视频内容的完整性和自然性。
Conclusion: 该研究证明了将3D重建与生成式扩散模型协同整合的有效性,为视频稳定化提供了兼顾几何准确性与视觉一致性的新范式。框架的成功表明语义-几何融合与生成式修复的结合能够突破传统方法的局限性,为未来视频处理技术的发展提供了重要方向。
📄 Abstract
Video stabilization aims to mitigate camera shake but faces a fundamental trade-off between geometric robustness and full-frame consistency. While 2D methods suffer from aggressive cropping, 3D techniques are often undermined by fragile optimization pipelines that fail under extreme motions. To bridge this gap, we propose VS3R, a framework that synergizes feed-forward 3D reconstruction with generative video diffusion. Our pipeline jointly estimates camera parameters, depth, and masks to ensure all-scenario reliability, and introduces a Hybrid Stabilized Rendering module that fuses semantic and geometric cues for dynamic consistency. Finally, a Dual-Stream Video Diffusion Model restores disoccluded regions and rectifies artifacts by synergizing structural guidance with semantic anchors. Collectively, VS3R achieves high-fidelity, full-frame stabilization across diverse camera models and significantly outperforms state-of-the-art methods in robustness and visual quality.
[22] BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation
Feiran Li, Qianqian Xu, Shilong Bao, Zhiyong Yang, Xilin Zhao, Xiaochun Cao, Qingming Huang
🧩 TL;DR
本文提出BlackMirror框架,用于在黑盒设置下检测后门文本到图像模型。该框架基于一个关键观察:后门攻击中只有部分语义模式被稳定操纵,而其余内容保持多样或良性,从而实现了对视觉多样后门生成的准确检测。
📘 Detailed Summary
Motivation: 现有后门检测方法通常依赖图像级相似性分析,假设后门触发生成具有强一致性,但难以泛化到新兴的后门攻击,其中后门生成可能呈现视觉多样性。本文旨在解决在黑盒设置下检测这些视觉多样后门文本到图像模型的挑战。
Method: BlackMirror框架包含两个核心组件:MirrorMatch将视觉模式与相应指令对齐以检测语义偏差;MirrorVerify评估这些偏差在不同提示下的稳定性以区分真实后门行为与良性响应。这是一个无需训练、即插即用的通用框架,可部署在模型即服务应用中。
Result: 综合实验表明,BlackMirror在广泛的攻击范围内实现了准确检测。该框架在各种后门攻击场景下表现出色,验证了其基于部分语义模式稳定操纵观察的有效性。
Conclusion: 该研究揭示了后门攻击中部分语义模式稳定操纵的关键特性,为黑盒后门检测提供了新视角。BlackMirror作为一个通用、无需训练的框架,为模型即服务应用中的安全检测提供了实用解决方案,并开源代码促进进一步研究。
📄 Abstract
This paper investigates the challenging task of detecting backdoored text-to-image models under black-box settings and introduces a novel detection framework BlackMirror. Existing approaches typically rely on analyzing image-level similarity, under the assumption that backdoor-triggered generations exhibit strong consistency across samples. However, they struggle to generalize to recently emerging backdoor attacks, where backdoored generations can appear visually diverse. BlackMirror is motivated by an observation: across backdoor attacks, {only partial semantic patterns within the generated image are steadily manipulated, while the rest of the content remains diverse or benign. Accordingly, BlackMirror consists of two components: MirrorMatch, which aligns visual patterns with the corresponding instructions to detect semantic deviations; and MirrorVerify, which evaluates the stability of these deviations across varied prompts to distinguish true backdoor behavior from benign responses. BlackMirror is a general, training-free framework that can be deployed as a plug-and-play module in Model-as-a-Service (MaaS) applications. Comprehensive experiments demonstrate that BlackMirror achieves accurate detection across a wide range of attacks. Code is available at https://github.com/Ferry-Li/BlackMirror.
[23] TumorChain: Interleaved Multimodal Chain-of-Thought Reasoning for Traceable Clinical Tumor Analysis
Sijing Li, Zhongwei Qiu, Jiang Liu, Wenqiao Zhang, Tianwei Lin, Yihan Xie, Jianxiang An, Boxiang Yun, Chenglin Yang, Jun Xiao, Guangyu Guo, Jiawen Yao, Wei Liu, Yuan Gao, Ke Yan, Weiwei Cao, Zhilin Zheng, Tony C. W. Mok, Kai Cao, Yu Shi, Jiuyu Zhang, Jian Zhou, Beng Chin Ooi, Yingda Xia, Ling Zhang
🧩 TL;DR
该研究提出了TumorChain,一种用于临床肿瘤分析的多模态交错推理框架,通过构建大规模TumorCoT基准数据集,实现了从影像发现到病理预测的端到端可解释推理,显著提升了肿瘤分析的准确性和可追溯性。
📘 Detailed Summary
Motivation: 临床肿瘤分析需要从影像发现到临床印象再到病理结论的逐步推理过程,但现有方法缺乏可追溯的推理链,容易产生诊断错误和幻觉风险。该研究旨在解决肿瘤分析中多模态推理的可解释性和一致性挑战,构建能够模拟临床医生思维过程的系统化分析框架。
Method: 研究提出了TumorChain多模态交错推理框架,结合3D CT扫描编码器、临床文本理解和器官级视觉语言对齐。通过跨模态对齐和迭代交错因果推理机制,该方法将视觉证据与文本推理紧密结合,经过多轮自我精炼后生成病理预测。同时构建了包含150万CoT标注VQA指令的TumorCoT大规模数据集,支持从发现到印象再到病理的轨迹对齐评估。
Result: 实验表明TumorChain在病灶检测、印象生成和病理分类任务上均优于强基线方法,在DeepTumorVQA基准测试中展现出强大的泛化能力。该方法通过多轮自我精炼显著减少了幻觉风险,同时保持了推理过程的可追溯性,为临床肿瘤分析提供了可靠且可解释的解决方案。
Conclusion: 该研究证明了多模态推理在临床肿瘤分析中的巨大潜力,通过模拟临床医生的思维链推理过程,实现了从影像到病理的端到端可解释分析。TumorChain框架为精准肿瘤学提供了可靠的分析工具,其构建的大规模基准数据集和交错推理机制为未来医学人工智能研究奠定了重要基础。
📄 Abstract
Accurate tumor analysis is central to clinical radiology and precision oncology, where early detection, reliable lesion characterization, and pathology-level risk assessment guide diagnosis and treatment planning. Chain-of-Thought (CoT) reasoning is particularly important in this setting because it enables step-by-step interpretation from imaging findings to clinical impressions and pathology conclusions, improving traceability and reducing diagnostic errors. Here, we target the clinical tumor analysis task and build a large-scale benchmark that operationalizes a multimodal reasoning pipeline, spanning findings, impressions, and pathology predictions. We curate TumorCoT, a large-scale dataset of 1.5M CoT-labeled VQA instructions paired with 3D CT scans, with step-aligned rationales and cross-modal alignments along the trajectory from findings to impression to pathology, enabling evaluation of both answer accuracy and reasoning consistency. We further propose TumorChain, a multimodal interleaved reasoning framework that tightly couples 3D imaging encoders, clinical text understanding, and organ-level vision-language alignment. Through cross-modal alignment and iterative interleaved causal reasoning, TumorChain grounds visual evidence, aggregates conclusions, and issues pathology predictions after multiple rounds of self-refinement, improving traceability and reducing hallucination risk. Experiments show consistent improvements over strong baselines in lesion detection, impression generation, and pathology classification, and demonstrate strong generalization on the DeepTumorVQA benchmark. These results highlight the potential of multimodal reasoning for reliable and interpretable tumor analysis in clinical practice. Detailed information about our project can be found on our project homepage at https://github.com/ZJU4HealthCare/TumorChain.
[24] Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models
Jialuo He, Huangxun Chen
🧩 TL;DR
本文提出E-AdaPrune,一种基于能量驱动的自适应视觉令牌剪枝框架,通过奇异值谱分析动态确定令牌预算,在保持性能的同时加速视觉语言模型推理。
📘 Detailed Summary
Motivation: 现有视觉令牌剪枝方法通常采用固定预算策略,忽略了不同图像信息密度的显著差异,导致对信息稀疏图像的过度处理和信息密集图像的处理不足,限制了视觉语言模型的加速效率。
Method: E-AdaPrune框架通过分析视觉特征空间的奇异值谱,基于保留特定比例谱能量的原则动态确定令牌预算,为信息密集场景分配更多令牌同时压缩冗余内容,无需引入额外可学习参数,并采用随机奇异值分解控制计算开销。
Result: 在九个基准测试和三个VLM骨干网络上的评估表明,在匹配的平均令牌预算下,E-AdaPrune带来平均0.6%的性能提升,在MMVet推理任务上实现+5.1%的相对提升,同时每张图像的额外延迟仅8毫秒。
Conclusion: 该研究证明了基于谱能量的自适应令牌剪枝策略的有效性,为视觉语言模型的高效推理提供了参数无关的解决方案,同时保持甚至提升模型性能,为实际部署中的计算资源优化提供了新思路。
📄 Abstract
Visual token reduction is critical for accelerating Vision-Language Models (VLMs), yet most existing approaches rely on a fixed budget shared across all inputs, overlooking the substantial variation in image information density. We propose E-AdaPrune, an energy-driven adaptive pruning framework that determines the token budget from the singular value spectrum of the visual features space. By preserving a certain proportion of spectral energy, our method allocates more tokens to information-dense scenes while aggressively compressing redundant ones, without introducing additional learnable parameters. We evaluate E-AdaPrune on nine benchmarks and three VLM backbones, LLaVA-1.5-7B, LLaVA-1.5-13B, and LLaVA-NeXT-8B. Under matched average token budgets, E-AdaPrune consistently yields an average improvement of up to 0.6\%, including a significant +5.1\% relative boost on the MMVet reasoning task. Using randomized singular value decomposition, the additional latency is limited to 8ms per image.
[25] PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues
Yukun Qi, Pei Fu, Hang Li, Yuhan Liu, Chao Jiang, Bin Qin, Zhenbo Luo, Jian Luan
🧩 TL;DR
本文提出了PatchCue,一种基于图像块的视觉提示范式,旨在增强视觉语言模型的视觉推理能力。该方法通过将图像划分为块级表示,更好地与人类感知习惯和现代VLM的补丁标记化输入对齐,从而超越传统的像素级和点级视觉提示方法。
📘 Detailed Summary
Motivation: 现有视觉语言模型的推理范式(如思维链)主要依赖文本信息,未能充分利用重要的视觉线索。虽然先前工作尝试整合像素级视觉提示,但这些方法需要精确的空间定位,引入了额外的学习复杂性,且与人类感知习惯和现代VLM的补丁标记化输入不够匹配。
Method: 本文提出PatchCue范式,通过将图像划分为补丁并在补丁级别表示视觉线索。采用两阶段训练方法:首先进行冷启动监督微调以输出补丁级提示,然后通过强化学习结合过程监督的提示奖励来引导中间视觉推理步骤。该方法充分利用了现代VLM的补丁标记化输入结构。
Result: 在多个VLM模型和多样化基准测试(包括通用视觉问答、复杂推理和文档理解)上的广泛实验表明,PatchCue能持续提升整体模型性能。结果显示补丁级提示在性能上优于像素级边界框和基于点的提示方法,为视觉推理提供了更有效的解决方案。
Conclusion: PatchCue提供了一种更有效且认知对齐的视觉推理范式,通过补丁级视觉提示显著增强了VLM的视觉推理能力。该方法不仅技术性能优越,而且更符合人类感知习惯,为未来多模态推理系统的设计提供了新的方向。
📄 Abstract
Vision-Language Models (VLMs) have achieved remarkable progress on a wide range of challenging multimodal understanding and reasoning tasks. However, existing reasoning paradigms, such as the classical Chain-of-Thought (CoT), rely solely on textual information and often underutilize important visual cues. While prior work has incorporated pixel-level visual cues, these representations require precise spatial localization, introducing additional learning complexity. To address this, we propose PatchCue, a novel patch-based visual cue paradigm designed to significantly enhance the visual reasoning capabilities of VLMs. By partitioning images into patches and representing cues at the patch level, PatchCue aligns better with human perceptual habits and leverages the patch-tokenized input of modern VLMs. We train VLMs using a two-stage approach: cold-start supervised fine-tuning to output patch-level cues, followed by reinforcement learning with a process-supervised cue reward that guides intermediate visual reasoning steps. Extensive experiments on multiple VLMs and diverse benchmarks, including general visual question answering, complex reasoning, and document understanding, demonstrate that PatchCue consistently improves overall model performance. Our results show that patch-level cues outperform both pixel-level bounding boxes and point-based cues, providing a more effective and cognitively aligned visual reasoning paradigm.
[26] MM-ISTS: Cooperating Irregularly Sampled Time Series Forecasting with Multimodal Vision-Text LLMs
Zhi Lei, Chenxi Liu, Hao Miao, Wanghui Qiu, Bin Yang, Chenjuan Guo
🧩 TL;DR
本文提出MM-ISTS,一种基于视觉-文本大语言模型增强的多模态框架,用于不规则采样时间序列预测。该框架通过跨模态编码机制融合时间序列、视觉和文本信息,有效捕捉复杂的时间模式和上下文语义。
📘 Detailed Summary
Motivation: 现实世界中不规则采样时间序列普遍存在,现有方法通常仅利用历史观测值进行预测,难以学习上下文语义和细粒度时间模式。这导致对异步观测和不均匀时间间隔的时间序列预测能力有限。
Method: MM-ISTS采用两阶段编码机制:跨模态视觉-文本编码模块自动生成信息丰富的视觉图像和文本数据,结合多模态大语言模型捕捉复杂时间模式和上下文理解;同时,ISTS编码通过多视图嵌入融合和时间-变量编码器从历史观测中提取互补的丰富时间特征。此外,提出基于自适应查询的特征提取器压缩MLLMs学习到的标记,以及具有模态感知门控的多模态对齐模块来缓解不同模态间的差距。
Result: 在真实数据上的广泛实验验证了所提解决方案的有效性。结果表明该框架能够显著提升不规则采样时间序列的预测性能,同时通过特征压缩降低了计算成本。
Conclusion: 该研究展示了多模态方法在时间序列分析中的潜力,通过融合视觉、文本和时间序列信息可以更全面地理解复杂时间模式。提出的跨模态对齐和特征压缩机制为多模态时间序列预测提供了实用解决方案,并为未来研究开辟了新方向。
📄 Abstract
Irregularly sampled time series (ISTS) are widespread in real-world scenarios, exhibiting asynchronous observations on uneven time intervals across variables. Existing ISTS forecasting methods often solely utilize historical observations to predict future ones while falling short in learning contextual semantics and fine-grained temporal patterns. To address these problems, we achieve MM-ISTS, a multimodal framework augmented by vision-text large language models, that bridges temporal, visual, and textual modalities, facilitating ISTS forecasting. MM-ISTS encompasses a novel two-stage encoding mechanism. In particular, a cross-modal vision-text encoding module is proposed to automatically generate informative visual images and textual data, enabling the capture of intricate temporal patterns and comprehensive contextual understanding, in collaboration with multimodal LLMs (MLLMs). In parallel, ISTS encoding extracts complementary yet enriched temporal features from historical ISTS observations, including multi-view embedding fusion and a temporal-variable encoder. Further, we propose an adaptive query-based feature extractor to compress the learned tokens of MLLMs, filtering out small-scale useful knowledge, which in turn reduces computational costs. In addition, a multimodal alignment module with modality-aware gating is designed to alleviate the modality gap across ISTS, images, and text. Extensive experiments on real data offer insight into the effectiveness of the proposed solutions.
[27] Shifting Adaptation from Weight Space to Memory Space: A Memory-Augmented Agent for Medical Image Segmentation
Bowen Chen, Qiaohui Gao, Shaowen Wan, Shanhui Sun, Wei Liu, Xiang Li, Tianming Liu, Lin Zhao
🧩 TL;DR
本文提出了一种记忆增强的分割智能体(MemSeg-Agent),通过将适应过程从权重空间转移到记忆空间,为医学图像分割提供了一种统一的架构,支持少样本学习、联邦监督学习和测试时适应,显著降低了联邦学习中的通信开销。
📘 Detailed Summary
Motivation: 医学图像分割模型在单一数据集上训练后难以泛化到不同机构、扫描仪或患者群体,而现有视觉基础模型通常需要任务特定的微调,这在联邦学习中引入大量通信开销,并阻碍部署期间的持续知识演化。
Method: MemSeg-Agent采用记忆增强架构,将适应过程从权重空间转移到记忆空间,通过智能体控制器动态组合轻量级静态记忆、少样本记忆和测试时工作记忆来调节固定骨干网络,在联邦设置中更新紧凑的记忆单元而非模型参数。
Result: 在四个公共数据集上的实验表明,静态记忆单独使用时以高参数效率匹配或超越强监督基线,测试时工作记忆进一步提升了域内和跨域性能而无需微调,同时显著降低了联邦学习中的通信开销。
Conclusion: MemSeg-Agent为可扩展和自适应的医学图像分割引入了新范式,在智能体AI时代实现了从权重空间适应到记忆空间适应的转变,为联邦学习中的高效知识共享和持续适应提供了可行方案。
📄 Abstract
Medical image segmentation is fundamental to clinical workflows, yet models trained on a single dataset often fail to generalize across institutions, scanners, or patient populations. While vision foundation models have shown great promise in addressing this challenge, their deployment typically requires task-specific fine-tuning, which introduces substantial communication overhead in federated learning and prevents continuous knowledge evolution during deployment. In this work, we propose a memory-augmented segmentation agent (MemSeg-Agent) that shifts adaptation from weight space to memory space, enabling few-shot learning, federated supervised learning, and test-time adaptation within a unified architecture. MemSeg-Agent conditions a fixed backbone with lightweight static, few-shot, and test-time working memories, which are dynamically composed by an agentic controller. In federated settings, we update compact memory units instead of model parameters, substantially reducing communication overhead. Experiments on four public datasets demonstrate strong performance and robustness to domain shift: Static memory alone matches or surpasses strong supervised baselines with high parameter efficiency, and test-time working memory further improves in-domain and cross-domain performance without fine-tuning. Overall, MemSeg-Agent introduces a new paradigm for scalable and adaptive medical image segmentation in the era of agentic AI.
[28] CylinderSplat: 3D Gaussian Splatting with Cylindrical Triplanes for Panoramic Novel View Synthesis
Qiwei Wang, Xianghui Ze, Jingyi Yu, Yujiao Shi
🧩 TL;DR
本文提出了CylinderSplat,一种用于全景3D高斯泼溅的前馈框架,通过引入圆柱形三平面表示和双分支架构,解决了现有方法在稀疏视图场景下的遮挡问题和全景数据几何表示不匹配的挑战。
📘 Detailed Summary
Motivation: 现有前馈3D高斯泼溅方法在全景图像应用中存在两个主要问题:基于多视图代价体积的几何细化方法在稀疏视图场景下难以解决遮挡问题,以及标准的笛卡尔三平面等体积表示无法有效捕捉360度场景的固有几何结构,导致失真和混叠现象。
Method: 本文提出了CylinderSplat框架,其核心创新是圆柱形三平面表示,该表示与全景数据及符合曼哈顿世界假设的真实世界结构更加匹配。采用双分支架构设计:像素分支负责重建观测良好的区域,而体积分支则利用圆柱形三平面来补全被遮挡或稀疏观测的区域。该框架能够灵活处理从单视图到多视图的变数量输入全景图像。
Result: 大量实验表明,CylinderSplat在单视图和多视图全景新视角合成任务中均取得了最先进的结果。该方法在重建质量和几何精度方面均优于先前方法,特别是在处理稀疏视图场景时表现出更强的鲁棒性和准确性。
Conclusion: 该研究证明了圆柱形三平面表示在全景3D重建中的有效性,为实时全景新视角合成提供了更优的解决方案。双分支架构的设计平衡了观测良好区域的重建质量与被遮挡区域的补全能力,为处理变数量输入视图的全景3DGS应用提供了灵活且高效的框架。
📄 Abstract
Feed-forward 3D Gaussian Splatting (3DGS) has shown great promise for real-time novel view synthesis, but its application to panoramic imagery remains challenging. Existing methods often rely on multi-view cost volumes for geometric refinement, which struggle to resolve occlusions in sparse-view scenarios. Furthermore, standard volumetric representations like Cartesian Triplanes are poor in capturing the inherent geometry of $360^\circ$ scenes, leading to distortion and aliasing. In this work, we introduce CylinderSplat, a feed-forward framework for panoramic 3DGS that addresses these limitations. The core of our method is a new {cylindrical Triplane} representation, which is better aligned with panoramic data and real-world structures adhering to the Manhattan-world assumption. We use a dual-branch architecture: a pixel-based branch reconstructs well-observed regions, while a volume-based branch leverages the cylindrical Triplane to complete occluded or sparsely-viewed areas. Our framework is designed to flexibly handle a variable number of input views, from single to multiple panoramas. Extensive experiments demonstrate that CylinderSplat achieves state-of-the-art results in both single-view and multi-view panoramic novel view synthesis, outperforming previous methods in both reconstruction quality and geometric accuracy.
[29] Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving
Nikos Theodoridis, Reenu Mohandas, Ganesh Sistu, Anthony Scanlan, Ciarán Eising, Tim Brophy
🧩 TL;DR
该研究通过分析视觉语言模型(VLM)的中间激活,探究其在自动驾驶相关简单视觉任务上的失败原因,识别出感知失败和认知失败两种故障模式,并揭示了视觉概念在模型中的线性编码特性。
📘 Detailed Summary
Motivation: 视觉语言模型在自动驾驶应用中日益普及,旨在利用其推理和泛化能力处理长尾场景,但这些模型经常在自动驾驶高度相关的简单视觉问题上失败,且失败原因尚不明确,本研究旨在深入理解这些故障背后的机制。
Method: 研究通过创建仅针对特定视觉概念存在差异的反事实图像集,并训练线性探针来区分这些图像,分析四种最先进视觉语言模型的中间激活,评估特定视觉概念在线性编码中的程度,从而识别视觉信息流中的瓶颈。
Result: 实验发现物体或智能体存在等概念在模型激活中具有显式线性编码,而物体方向等空间视觉概念仅通过视觉编码器保留的空间结构隐式编码,研究识别出两种故障模式:感知失败(所需视觉信息未线性编码)和认知失败(视觉信息存在但未能与语言语义正确对齐),同时发现物体距离增加会快速降低相应视觉概念的线性可分性。
Conclusion: 该研究提高了对视觉语言模型在自动驾驶相关简单视觉任务上失败案例的理解,揭示了视觉概念在模型中的编码特性和两种不同的故障机制,为改进模型在关键应用中的可靠性提供了诊断框架和理论基础。
📄 Abstract
The use of Vision-Language Models (VLMs) in automated driving applications is becoming increasingly common, with the aim of leveraging their reasoning and generalisation capabilities to handle long tail scenarios. However, these models often fail on simple visual questions that are highly relevant to automated driving, and the reasons behind these failures remain poorly understood. In this work, we examine the intermediate activations of VLMs and assess the extent to which specific visual concepts are linearly encoded, with the goal of identifying bottlenecks in the flow of visual information. Specifically, we create counterfactual image sets that differ only in a targeted visual concept and then train linear probes to distinguish between them using the activations of four state-of-the-art (SOTA) VLMs. Our results show that concepts such as the presence of an object or agent in a scene are explicitly and linearly encoded, whereas other spatial visual concepts, such as the orientation of an object or agent, are only implicitly encoded by the spatial structure retained by the vision encoder. In parallel, we observe that in certain cases, even when a concept is linearly encoded in the model's activations, the model still fails to answer correctly. This leads us to identify two failure modes. The first is perceptual failure, where the visual information required to answer a question is not linearly encoded in the model's activations. The second is cognitive failure, where the visual information is present but the model fails to align it correctly with language semantics. Finally, we show that increasing the distance of the object in question quickly degrades the linear separability of the corresponding visual concept. Overall, our findings improve our understanding of failure cases in VLMs on simple visual tasks that are highly relevant to automated driving.
[30] CollabOD: Collaborative Multi-Backbone with Cross-scale Vision for UAV Small Object Detection
Xuecheng Bai, Yuxiang Wang, Chuanzhi Xu, Boyu Hu, Kang Han, Ruijie Pan, Xiaowei Niu, Xiaotian Guan, Liqiang Fu, Pengfei Ye
🧩 TL;DR
本文提出CollabOD,一种轻量级协作检测框架,通过结构细节保留、跨路径特征对齐和定位感知轻量化设计,解决无人机图像中小目标检测的尺度变化和细节退化问题,在保持高效推理的同时提升检测稳定性。
📘 Detailed Summary
Motivation: 无人机图像中的小目标检测面临尺度变化、结构细节退化和计算资源有限等挑战,特别是在高海拔场景中,细粒度特征在分层下采样和跨尺度融合过程中进一步减弱,导致定位不稳定和鲁棒性降低。
Method: 提出CollabOD轻量级协作检测框架,集成结构细节保留、跨路径特征对齐和定位感知轻量化设计策略,从图像处理、通道结构和轻量化设计角度优化传统无人机感知模型架构,并引入统一的细节感知检测头以提升回归鲁棒性。
Result: 所提框架在保持高效推理的同时增强了表示稳定性,统一的细节感知检测头在不引入额外部署开销的情况下进一步提高了回归鲁棒性,代码已在GitHub上开源供社区使用。
Conclusion: 该研究通过协同优化特征表示和计算效率,为无人机小目标检测提供了有效的解决方案,框架的设计理念可扩展到其他资源受限的视觉感知任务中,平衡了检测精度与计算开销的权衡。
📄 Abstract
Small object detection in unmanned aerial vehicle (UAV) imagery is challenging, mainly due to scale variation, structural detail degradation, and limited computational resources. In high-altitude scenarios, fine-grained features are further weakened during hierarchical downsampling and cross-scale fusion, resulting in unstable localization and reduced robustness. To address this issue, we propose CollabOD, a lightweight collaborative detection framework that explicitly preserves structural details and aligns heterogeneous feature streams before multi-scale fusion. The framework integrates Structural Detail Preservation, Cross-Path Feature Alignment, and Localization-Aware Lightweight Design strategies. From the perspectives of image processing, channel structure, and lightweight design, it optimizes the architecture of conventional UAV perception models. The proposed design enhances representation stability while maintaining efficient inference. A unified detail-aware detection head further improves regression robustness without introducing additional deployment overhead. The code is available at: https://github.com/Bai-Xuecheng/CollabOD.
[31] Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion
Bohai Gu, Taiyi Wu, Dazhao Du, Jian Liu, Shuai Yang, Xiaotong Zhao, Alan Zhao, Song Guo
🧩 TL;DR
本文提出Place-it-R1框架,通过利用多模态大语言模型的链式思维推理能力,实现物理一致的视频对象插入,解决了现有方法仅关注视觉保真度而忽视物理因果关系的局限性。
📘 Detailed Summary
Motivation: 现代视频编辑技术在插入视频对象时虽能达到高视觉保真度,但主要关注视觉优化而非物理因果关系,导致编辑结果与环境的物理一致性不足,存在物理上不合理的插入问题。
Method: 提出端到端框架Place-it-R1,采用Think-then-Place范式,利用MLLM的链式思维推理协调视频扩散模型,包含三个关键创新:MLLM进行物理场景理解和交互推理生成环境感知的链式思维令牌并推断有效插入区域;引入MLLM引导的空间直接偏好优化,通过反馈评分机制确保视觉自然性;在推理时MLLM迭代触发细化循环并引导扩散模型进行自适应调整,形成闭环渐进式优化。
Result: 大量实验表明,Place-it-R1相比最先进的解决方案和商业模型,能够实现物理一致性的视频对象插入,显著提升了编辑结果的物理合理性。
Conclusion: 该研究展示了MLLM在连接认知推理与生成执行方面的潜力,通过闭环优化机制和用户可选的两种模式(合理性导向的灵活模式和保真度导向的标准模式),为视频编辑提供了物理一致性与视觉保真度之间的可控权衡。
📄 Abstract
Modern video editing techniques have achieved high visual fidelity when inserting video objects. However, they focus on optimizing visual fidelity rather than physical causality, leading to edits that are physically inconsistent with their environment. In this work, we present Place-it-R$1$, an end-to-end framework for video object insertion that unlocks the environment-aware reasoning potential of Multimodal Large Language Models (MLLMs). Our framework leverages the Chain-of-Thought (CoT) reasoning of MLLMs to orchestrate video diffusion, following a Think-then-Place paradigm. To bridge cognitive reasoning and generative execution, we introduce three key innovations: First, MLLM performs physical scene understanding and interaction reasoning, generating environment-aware chain-of-thought tokens and inferring valid insertion regions to explicitly guide the diffusion toward physically plausible insertion. Then, we introduce MLLM-guided Spatial Direct Preference Optimization (DPO), where diffusion outputs are fed back to the MLLM for scoring, enabling visual naturalness. During inference, the MLLM iteratively triggers refinement cycles and elicits adaptive adjustments from the diffusion model, forming a closed-loop that progressively enhances editing quality. Furthermore, we provide two user-selectable modes: a plausibility-oriented flexible mode that permits environment modifications (\eg, generating support structures) to enhance physical plausibility, and a fidelity-oriented standard mode that preserves scene integrity for maximum fidelity, offering users explicit control over the plausibility-fidelity trade-off. Extensive experiments demonstrate Place-it-R1 achieves physically-coherent video object insertion compared with state-of-the-art solutions and commercial models.
[32] VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models
Rohit Saxena, Alessandro Suglia, Pasquale Minervini
🧩 TL;DR
该研究提出了VLM-RobustBench基准测试,系统评估了视觉语言模型在现实世界图像失真下的鲁棒性,发现当前模型在语义理解上表现强劲但对空间扰动脆弱,特别是重采样和几何失真会导致性能显著下降。
📘 Detailed Summary
Motivation: 尽管视觉语言模型在标准高质量数据集上表现优异,但对其在现实世界图像失真条件下的性能理解仍然不足,缺乏系统评估模型在各种图像扰动下鲁棒性的综合基准。
Method: 研究构建了VLM-RobustBench基准,涵盖噪声、模糊、天气、数字和几何五大类共49种增强类型,在分级严重程度(低/中/高)和二元变换下产生133个损坏设置,评估了来自四个模型家族(Qwen、InternVL、Molmo、Gemma)的VLM在两个互补基准(MMBench视觉基础和MMMU-Pro推理导向)上的表现。
Result: 实验结果显示视觉严重程度是性能下降的弱预测指标:低严重度的空间扰动通常比视觉上严重的光度损坏更降低性能,特别是低严重度的玻璃模糊使MMBench准确率平均下降约8个百分点,而重采样和几何失真(如上采样、弹性变换)导致最大降幅达34个百分点。
Conclusion: 当前视觉语言模型在语义理解上表现强劲但对空间扰动脆弱,这一发现强调了需要定义新的鲁棒性评估协议和训练机制,特别应关注重采样和几何不变性的增强,以提升模型在现实世界失真条件下的实际应用能力。
📄 Abstract
Vision-language models (VLMs) achieve strong performance on standard, high-quality datasets, but we still do not fully understand how they perform under real-world image distortions. We present VLM-RobustBench, a benchmark spanning 49 augmentation types across noise, blur, weather, digital, and geometric perturbations, evaluated under graded severities (low/mid/high) and binary transforms, yielding 133 corrupted settings. We evaluate VLMs from four families (Qwen, InternVL, Molmo, Gemma) on two complementary benchmarks: MMBench (visually grounded) and MMMU-Pro (reasoning-oriented). Our results reveal that visual severity is a weak predictor of difficulty: low-severity spatial perturbations often degrade performance more than visually severe photometric corruptions. In particular, low-severity glass_blur reduces MMBench accuracy by about 8 pp on average across models, while the largest drops arise from resampling and geometric distortions (e.g., upsample, elastic_transform), reaching up to 34 pp. Overall, our findings suggest current VLMs are semantically strong but spatially fragile, motivating the definition of novel robustness evaluation protocols and training regimes that emphasize resampling and geometric invariances.
[33] Reflective Flow Sampling Enhancement
Zikai Zhou, Muyao Wang, Shitong Shao, Lichen Bai, Haoyi Xiong, Bo Han, Zeke Xie
🧩 TL;DR
本文提出了反射流采样(RF-Sampling),一种专为流匹配模型设计的免训练推理增强框架,通过理论推导证明其隐式执行文本-图像对齐分数的梯度上升,显著提升了FLUX等流模型的生成质量和提示对齐能力。
📘 Detailed Summary
Motivation: 当前推理时增强策略主要适用于传统扩散模型,但在流匹配模型(如FLUX)上表现不佳,这限制了流模型在文本到图像生成中的潜力,需要专门针对流模型设计的推理增强框架来填补这一技术空白。
Method: 提出了反射流采样(RF-Sampling),这是一种基于理论推导的免训练推理增强框架,专门为流模型特别是CFG蒸馏变体设计。该方法通过文本表示的线性组合与流反转技术相结合,使模型能够探索与输入提示更一致的噪声空间,其理论推导证明了该方法隐式执行文本-图像对齐分数的梯度上升。
Result: 在多个基准测试上的广泛实验表明,RF-Sampling能够持续提升生成质量和提示对齐能力,并且是首个在FLUX模型上展现出一定程度测试时缩放能力的推理增强方法,显著优于现有启发式方法在流模型上的表现。
Conclusion: 该研究为流匹配模型提供了首个专门设计的理论化推理增强框架,不仅填补了技术空白,还展示了流模型在推理时优化的潜力,为未来流模型的高质量生成开辟了新方向,同时证明了理论推导相对于启发式方法在复杂生成任务中的优越性。
📄 Abstract
The growing demand for text-to-image generation has led to rapid advances in generative modeling. Recently, text-to-image diffusion models trained with flow matching algorithms, such as FLUX, have achieved remarkable progress and emerged as strong alternatives to conventional diffusion models. At the same time, inference-time enhancement strategies have been shown to improve the generation quality and text-prompt alignment of text-to-image diffusion models. However, these techniques are mainly applicable to conventional diffusion models and usually fail to perform well on flow models. To bridge this gap, we propose Reflective Flow Sampling (RF-Sampling), a theoretically-grounded and training-free inference enhancement framework explicitly designed for flow models, especially for the CFG-distilled variants (i.e., models distilled from CFG guidance techniques), like FLUX. Departing from heuristic interpretations, we provide a formal derivation proving that RF-Sampling implicitly performs gradient ascent on the text-image alignment score. By leveraging a linear combination of textual representations and integrating them with flow inversion, RF-Sampling allows the model to explore noise spaces that are more consistent with the input prompt. Extensive experiments across multiple benchmarks demonstrate that RF-Sampling consistently improves both generation quality and prompt alignment. Moreover, RF-Sampling is also the first inference enhancement method that can exhibit test-time scaling ability to some extent on FLUX.
[34] Towards Driver Behavior Understanding: Weakly-Supervised Risk Perception in Driving Scenes
Nakul Agarwal, Yi-Ting Chen, Behzad Dariush
🧩 TL;DR
本文提出了RAID数据集,这是一个专门用于驾驶员风险感知和上下文风险评估研究的大规模数据集,并在此基础上开发了一种弱监督风险对象识别框架,显著提升了风险识别性能。
📘 Detailed Summary
Motivation: 实现零碰撞移动性是智能车辆系统的关键目标,这需要理解驾驶员的风险感知——这是一个复杂的认知过程,受到驾驶员对外部刺激的自愿反应以及周围道路使用者对自车的关注度的影响。当前缺乏专门针对驾驶员风险感知和上下文风险评估的大规模数据集,限制了该领域的研究进展。
Method: 本文引入了RAID数据集,包含4,691个标注视频片段,涵盖多样化的交通场景,标注内容包括驾驶员的预期操作、道路拓扑、风险情况、驾驶员反应和行人注意力。基于该数据集,提出了一种弱监督风险对象识别框架,通过建模驾驶员预期操作与反应之间的关系来识别潜在风险源,并分析了行人注意力在风险评估中的作用。
Result: 实验评估表明,所提出的方法在RAID数据集上相比先前最先进方法实现了20.6%的性能提升,在HDDS数据集上实现了23.1%的性能提升。RAID数据集的有效性得到了验证,为驾驶员风险感知研究提供了有价值的资源。
Conclusion: RAID数据集填补了驾驶员风险感知研究领域的数据空白,提出的弱监督风险对象识别框架通过建模驾驶员行为与风险源之间的关系,显著提升了风险识别性能。该研究强调了上下文信息(如行人注意力)在风险评估中的重要性,为智能车辆系统的安全决策提供了新的视角和方法。
📄 Abstract
Achieving zero-collision mobility remains a key objective for intelligent vehicle systems, which requires understanding driver risk perception-a complex cognitive process shaped by voluntary response of the driver to external stimuli and the attentiveness of surrounding road users towards the ego-vehicle. To support progress in this area, we introduce RAID (Risk Assessment In Driving scenes)-a large-scale dataset specifically curated for research on driver risk perception and contextual risk assessment. RAID comprises 4,691 annotated video clips, covering diverse traffic scenarios with labels for driver's intended maneuver, road topology, risk situations (e.g., crossing pedestrians), driver responses, and pedestrian attentiveness. Leveraging RAID, we propose a weakly supervised risk object identification framework that models the relationship between driver's intended maneuver and responses to identify potential risk sources. Additionally, we analyze the role of pedestrian attention in estimating risk and demonstrate the value of the proposed dataset. Experimental evaluations demonstrate that our method achieves 20.6% and 23.1% performance gains over prior state-of-the-art approaches on the RAID and HDDS datasets, respectively.
[35] OD-RASE: Ontology-Driven Risk Assessment and Safety Enhancement for Autonomous Driving
Kota Shimomura, Masaki Nambata, Atsuya Ishikawa, Ryota Mimura, Takayuki Kawabuchi, Takayoshi Yamashita, Koki Inoue
🧩 TL;DR
本文提出OD-RASE框架,通过检测导致交通事故的道路结构并将其与基础设施改进相连接,以增强自动驾驶系统的安全性。该框架结合领域本体论和大规模视觉语言模型,实现了对事故风险道路的主动识别与改进方案生成。
📘 Detailed Summary
Motivation: 当前自动驾驶系统在处理罕见情况或复杂道路结构时仍存在局限,而道路基础设施设计主要面向人类驾驶员且安全改进通常采取被动反应式方法。这种反应式方法对需要主动风险缓解的自动驾驶系统构成重大挑战,因此需要建立连接事故检测与基础设施开发的系统性框架。
Method: 研究提出OD-RASE框架,首先基于道路交通系统专业知识形式化本体论,同时利用大规模视觉语言模型生成基础设施改进方案,并通过本体驱动数据过滤增强可靠性。框架还包括OD-RASE模型,结合LVLM和扩散模型生成改进方案及改进后道路环境的生成图像,构建了新的数据集。
Result: 实验表明本体驱动数据过滤能够高精度预测导致事故的道路结构及相应改进计划。该方法实现了对事故前道路图像的自动标注,成功构建了包含基础设施改进提案的新数据集,验证了框架在连接事故检测与基础设施开发方面的有效性。
Conclusion: 该研究为交通环境整体安全性做出贡献,标志着向更广泛采用自动驾驶系统迈出重要一步。提出的主动风险缓解方法超越了传统反应式安全改进,通过系统化连接事故分析与基础设施规划,为自动驾驶安全提供了新的范式。
📄 Abstract
Although autonomous driving systems demonstrate high perception performance, they still face limitations when handling rare situations or complex road structures. Such road infrastructures are designed for human drivers, safety improvements are typically introduced only after accidents occur. This reactive approach poses a significant challenge for autonomous systems, which require proactive risk mitigation. To address this issue, we propose OD-RASE, a framework for enhancing the safety of autonomous driving systems by detecting road structures that cause traffic accidents and connecting these findings to infrastructure development. First, we formalize an ontology based on specialized domain knowledge of road traffic systems. In parallel, we generate infrastructure improvement proposals using a large-scale visual language model (LVLM) and use ontology-driven data filtering to enhance their reliability. This process automatically annotates improvement proposals on pre-accident road images, leading to the construction of a new dataset. Furthermore, we introduce the Baseline approach (OD-RASE model), which leverages LVLM and a diffusion model to produce both infrastructure improvement proposals and generated images of the improved road environment. Our experiments demonstrate that ontology-driven data filtering enables highly accurate prediction of accident-causing road structures and the corresponding improvement plans. We believe that this work contributes to the overall safety of traffic environments and marks an important step toward the broader adoption of autonomous driving systems.
[36] SLER-IR: Spherical Layer-wise Expert Routing for All-in-One Image Restoration
Peng Shurui, Xin Lin, Shi Luo, Jincen Ou, Dizhe Zhang, Lu Qi, Truong Nguyen, Chao Ren
🧩 TL;DR
本文提出SLER-IR,一种球面分层专家路由框架,通过动态激活跨网络层的专用专家来解决统一图像恢复中的特征干扰和专家专业化不足问题,在多项任务基准测试中实现了最先进的性能。
📘 Detailed Summary
Motivation: 统一的一体化图像恢复框架在处理多样化退化时面临特征干扰和专家专业化不足的挑战,现有方法难以在统一架构中有效处理不同类型的图像退化问题。
Method: 提出球面分层专家路由框架SLER-IR,包含球面均匀退化嵌入和对比学习,将退化表示映射到超球面以消除线性嵌入空间的几何偏差;同时设计全局-局部粒度融合模块,整合全局语义和局部退化线索以处理空间非均匀退化和训练-测试粒度差距。
Result: 在三任务和五任务基准测试中,SLER-IR在PSNR和SSIM指标上均优于现有最先进方法,实现了持续的性能改进,证明了该框架在处理多样化图像退化任务中的有效性。
Conclusion: 该研究展示了通过球面退化表示和分层专家路由机制可以有效解决统一图像恢复中的特征干扰问题,为处理多样化退化任务提供了一种新的架构设计思路,未来可扩展到更广泛的图像处理任务中。
📄 Abstract
Image restoration under diverse degradations remains challenging for unified all-in-one frameworks due to feature interference and insufficient expert specialization. We propose SLER-IR, a spherical layer-wise expert routing framework that dynamically activates specialized experts across network layers. To ensure reliable routing, we introduce a Spherical Uniform Degradation Embedding with contrastive learning, which maps degradation representations onto a hypersphere to eliminate geometry bias in linear embedding spaces. In addition, a Global-Local Granularity Fusion (GLGF) module integrates global semantics and local degradation cues to address spatially non-uniform degradations and the train-test granularity gap. Experiments on three-task and five-task benchmarks demonstrate that SLER-IR achieves consistent improvements over state-of-the-art methods in both PSNR and SSIM. Code and models will be publicly released.
[37] DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model
Hao Yang, Hongbo Zhang, Yanyan Zhao, Bing Qin
🧩 TL;DR
本文提出了DeepSight,首个专门用于深度感知的多模态大语言模型,通过利用深度图像的独特特性增强三维场景理解能力,显著提升了深度感知和下游任务性能。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在图像描述和视觉问答等任务上表现优异,但往往难以准确解释视觉数据中的深度信息,这限制了三维场景理解能力,因此需要专门针对深度感知设计的模型来解决这一局限性。
Method: DeepSight采用创新的深度图像处理方法,构建了新颖的深度图像-文本对数据集和深度指令数据集,使用GLPN模型从视觉图像生成深度图,并利用GPT-4策划相应的深度指令。同时修改CLIP中的ViT编码器以纳入局部物体信息,更有效地捕捉深度的细微连续变化。
Result: 实验结果表明,DeepSight在基于现有深度图像数据集构建的全面深度问答基准测试中表现优异,显著增强了深度感知能力和下游任务性能,验证了该模型在三维理解方面的有效性。
Conclusion: 该研究标志着多模态三维理解领域的重要进展,证明了专门针对深度感知设计的MLLM能够有效解决传统模型在深度信息解释方面的不足,为未来三维场景理解研究提供了新的方向和方法论基础。
📄 Abstract
Multimodal large language models (MLLMs) have achieved impressive performance across various tasks such as image captioning and visual question answer(VQA); however, they often struggle to accurately interpret depth information inherent in visual data. In this work, we introduce DeepSight, the first dedicated depth MLLM designed to enhance three-dimensional scene understanding. Unlike conventional methods that align RGB image encodings with text, our approach takes advantage of the unique characteristics of depth images: single-channel grayscale images where the pixel values directly reflect depth cues to improve spatial reasoning. To address challenges associated with limited depth data and the inadequacy of simple channel replication, we construct a novel depth image-text pair dataset and a depth instruction dataset. Depth maps are generated from visual images using the GLPN model, and GPT-4 is employed to curate corresponding depth instructions, an approach validated by LLaVA. Additionally, we modify the ViT encoder in CLIP to incorporate local object information, thereby capturing the subtle continuous variations of depth more effectively. To evaluate the performance of our model, we develop a comprehensive depth question answer benchmark based on existing depth image datasets, which rigorously assesses understanding in typical depth map scenarios. Experimental results demonstrate that DeepSight significantly enhances depth perception and downstream task performance, marking a substantial step forward in multimodal three-dimensional understanding.
[38] Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events
Xiaoxing You, Qiang Huang, Lingyu Li, Xiaojun Chang, Jun Yu
🧩 TL;DR
本文提出CoE,一种免训练的多模态摘要框架,通过事件链进行结构化推理,解决了现有方法对领域特定监督的依赖、隐式融合以及平坦时序建模的问题。
📘 Detailed Summary
Motivation: 现有多模态摘要方法面临三个主要挑战:对领域特定监督的依赖、隐式融合导致跨模态对齐薄弱,以及平坦时序建模缺乏事件转换机制。
Method: CoE框架通过分层事件图将文本语义编码为显式事件层次结构,引导跨模态对齐和时序推理;该框架定位关键视觉线索,建模事件演化和因果转换,并通过轻量级风格适应进行领域对齐。
Result: 在八个多样化数据集上的实验表明,CoE持续优于最先进的视频CoT基线,平均提升达到+3.04 ROUGE、+9.51 CIDEr和+1.88 BERTScore,展现了其鲁棒性、可解释性和跨领域泛化能力。
Conclusion: 该研究证明了通过显式事件层次结构进行结构化推理的有效性,为多模态摘要提供了可解释且泛化能力强的解决方案,同时避免了训练依赖和领域特定监督的限制。
📄 Abstract
Multimodal Summarization (MMS) aims to generate concise textual summaries by understanding and integrating information across videos, transcripts, and images. However, existing approaches still suffer from three main challenges: (1) reliance on domain-specific supervision, (2) implicit fusion with weak cross-modal grounding, and (3) flat temporal modeling without event transitions. To address these issues, we introduce CoE, a training-free MMS framework that performs structured reasoning through a Chain-of-Events guided by a Hierarchical Event Graph (HEG). The HEG encodes textual semantics into an explicit event hierarchy that scaffolds cross-modal grounding and temporal reasoning. Guided by this structure, CoE localizes key visual cues, models event evolution and causal transitions, and refines outputs via lightweight style adaptation for domain alignment. Extensive experiments on eight diverse datasets demonstrate that CoE consistently outperforms state-of-the-art video CoT baselines, achieving average gains of +3.04 ROUGE, +9.51 CIDEr, and +1.88 BERTScore, highlighting its robustness, interpretability, and cross-domain generalization. Our code is available at https://github.com/youxiaoxing/CoE.
[39] LucidNFT: LR-Anchored Multi-Reward Preference Optimization for Generative Real-World Super-Resolution
Song Fei, Tian Ye, Sixiang Chen, Zhaohu Xing, Jianyu Lai, Lei Zhu
🧩 TL;DR
本文提出LucidNFT,一个用于流匹配真实世界图像超分辨的多奖励强化学习框架,通过引入LR锚定的保真度评估器和解耦优势归一化策略,解决了生成式Real-ISR中的语义幻觉问题和优化瓶颈。
📘 Detailed Summary
Motivation: 生成式真实世界图像超分辨率存在关键失败模式:输出可能看起来清晰但偏离LR证据(语义和结构幻觉),而LR锚定的保真度难以在没有HR真值的情况下评估。现有偏好强化学习方法面临三个主要障碍:缺乏退化鲁棒的LR参考保真度信号、rollout组优化瓶颈导致优势崩溃,以及真实退化覆盖有限限制了rollout多样性和偏好信号质量。
Method: LucidNFT框架包含三个核心组件:LucidConsistency——一个退化鲁棒的语义评估器,使LR锚定的保真度可测量和可优化;解耦优势归一化策略——在融合前保留每个LR条件rollout组内的目标间对比,防止优势崩溃;LucidLR——大规模真实世界退化图像集合,支持鲁棒的RL微调。该框架基于流匹配Real-ISR构建,采用多奖励RL方法进行优化。
Result: 实验表明,LucidNFT能持续改进强大的基于流的Real-ISR基线,在多样真实世界场景中实现更好的感知-保真度权衡,并保持稳定的优化动态。该方法在保持生成质量的同时显著减少了语义幻觉问题,验证了所提框架的有效性和鲁棒性。
Conclusion: 该研究证明了通过退化鲁棒的LR锚定保真度评估器和解耦优势归一化策略,可以有效解决生成式Real-ISR中的语义幻觉问题。LucidNFT框架为真实世界图像超分辨率的可靠对齐提供了系统解决方案,同时大规模真实退化数据集LucidLR为未来相关研究提供了重要资源。
📄 Abstract
Generative real-world image super-resolution (Real-ISR) can synthesize visually convincing details from severely degraded low-resolution (LR) inputs, yet its stochastic sampling makes a critical failure mode hard to avoid: outputs may look sharp but be unfaithful to the LR evidence (semantic and structural hallucination), while such LR-anchored faithfulness is difficult to assess without HR ground truth. Preference-based reinforcement learning (RL) is a natural fit because each LR input yields a rollout group of candidates to compare. However, effective alignment in Real-ISR is hindered by (i) the lack of a degradation-robust LR-referenced faithfulness signal, and (ii) a rollout-group optimization bottleneck where naive multi-reward scalarization followed by normalization compresses objective-wise contrasts, causing advantage collapse and weakening the reward-weighted updates in DiffusionNFT-style forward fine-tuning. Moreover, (iii) limited coverage of real degradations restricts rollout diversity and preference signal quality. We propose LucidNFT, a multi-reward RL framework for flow-matching Real-ISR. LucidNFT introduces LucidConsistency, a degradation-robust semantic evaluator that makes LR-anchored faithfulness measurable and optimizable; a decoupled advantage normalization strategy that preserves objective-wise contrasts within each LR-conditioned rollout group before fusion, preventing advantage collapse; and LucidLR, a large-scale collection of real-world degraded images to support robust RL fine-tuning. Experiments show that LucidNFT consistently improves strong flow-based Real-ISR baselines, achieving better perceptual-faithfulness trade-offs with stable optimization dynamics across diverse real-world scenarios.
[40] TaPD: Temporal-adaptive Progressive Distillation for Observation-Adaptive Trajectory Forecasting in Autonomous Driving
Mingyu Fan, Yi Liu, Hao Zhou, Deheng Qian, Mohammad Haziq Khan, Matthias Raetsch
🧩 TL;DR
本文提出TaPD(时序自适应渐进蒸馏),一种统一的即插即用框架,用于解决自动驾驶轨迹预测中因遮挡或感知范围限制导致的变长历史观测问题,通过渐进知识蒸馏和时序回填模块实现观测自适应预测。
📘 Detailed Summary
Motivation: 现有轨迹预测方法通常假设固定长度的历史观测,但在真实场景中由于遮挡或有限感知范围导致观测长度可变或极短时,这些方法会出现显著的性能下降,这限制了自动驾驶系统在实际复杂环境中的可靠性。
Method: TaPD框架包含两个协同模块:观测自适应预测器(OAF)和时序回填模块(TBM)。OAF基于渐进知识蒸馏(PKD),通过分层特征回归将长时程"教师"的运动模式知识迁移到短时程"学生"中,并引入余弦退火蒸馏权重方案平衡预测监督和特征对齐。TBM针对极短历史观测,基于场景演化条件回填缺失的历史片段,采用解耦的预训练-重构-微调协议保持真实运动先验。
Result: 在Argoverse 1和Argoverse 2数据集上的广泛实验表明,TaPD在所有观测长度下均显著优于强基线方法,在极短输入条件下提升尤为明显,并能以即插即用方式提升其他预测器(如HiVT)的性能,证明了框架的有效性和通用性。
Conclusion: 该研究为解决变长历史观测下的轨迹预测问题提供了统一框架,通过渐进知识蒸馏和显式历史重构的协同机制,显著提升了短观测条件下的预测性能,为实际自动驾驶系统中处理不完整感知信息提供了有效解决方案。
📄 Abstract
Trajectory prediction is essential for autonomous driving, enabling vehicles to anticipate the motion of surrounding agents to support safe planning. However, most existing predictors assume fixed-length histories and suffer substantial performance degradation when observations are variable or extremely short in real-world settings (e.g., due to occlusion or a limited sensing range). We propose TaPD (Temporal-adaptive Progressive Distillation), a unified plug-and-play framework for observation-adaptive trajectory forecasting under variable history lengths. TaPD comprises two cooperative modules: an Observation-Adaptive Forecaster (OAF) for future prediction and a Temporal Backfilling Module (TBM) for explicit reconstruction of the past. OAF is built on progressive knowledge distillation (PKD), which transfers motion pattern knowledge from long-horizon "teachers" to short-horizon "students" via hierarchical feature regression, enabling short observations to recover richer motion context. We further introduce a cosine-annealed distillation weighting scheme to balance forecasting supervision and feature alignment, improving optimization stability and cross-length consistency. For extremely short histories where implicit alignment is insufficient, TBM backfills missing historical segments conditioned on scene evolution, producing context-rich trajectories that strengthen PKD and thereby improve OAF. We employ a decoupled pretrain-reconstruct-finetune protocol to preserve real-motion priors while adapting to backfilled inputs. Extensive experiments on Argoverse 1 and Argoverse 2 show that TaPD consistently outperforms strong baselines across all observation lengths, delivers especially large gains under very short inputs, and improves other predictors (e.g., HiVT) in a plug-and-play manner. Code will be available at https://github.com/zhouhao94/TaPD.
[41] GazeMoE: Perception of Gaze Target with Mixture-of-Experts
Zhuangzhuang Dai, Zhongxi Lu, Vincent G. Zakka, Luis J. Manso, Jose M Alcaraz Calero, Chen Li
🧩 TL;DR
本文提出了GazeMoE,一种基于混合专家(MoE)的端到端框架,通过选择性利用冻结基础模型中的多模态线索来估计人类注视目标,在基准数据集上实现了最先进的性能。
📘 Detailed Summary
Motivation: 从可见图像中估计人类注视目标对于机器人理解人类注意力至关重要,但现有方法在开发可泛化的神经架构和训练范式方面仍面临挑战,特别是如何有效整合眼睛、头部姿态、手势和上下文特征等多模态线索需要自适应且高效的解码机制。
Method: 受大型视觉语言模型中混合专家(MoE)自适应领域专业知识的启发,本文提出了GazeMoE框架,该框架通过MoE模块选择性地利用冻结基础模型中的注视目标相关线索,并采用类别平衡辅助损失和策略性数据增强(包括区域特定裁剪和光度变换)来解决注视目标分类中的类别不平衡问题并增强鲁棒性。
Result: 在基准数据集上的广泛实验表明,GazeMoE在具有挑战性的注视估计任务上实现了最先进的性能,超越了现有方法,代码和预训练模型已在Hugging Face平台公开发布。
Conclusion: 该研究展示了MoE架构在整合多模态线索进行注视目标估计方面的有效性,为解决类别不平衡和增强模型鲁棒性提供了实用解决方案,为机器人理解人类注意力开辟了新的技术途径。
📄 Abstract
Estimating human gaze target from visible images is a critical task for robots to understand human attention, yet the development of generalizable neural architectures and training paradigms remains challenging. While recent advances in pre-trained vision foundation models offer promising avenues for locating gaze targets, the integration of multi-modal cues -- including eyes, head poses, gestures, and contextual features -- demands adaptive and efficient decoding mechanisms. Inspired by Mixture-of-Experts (MoE) for adaptive domain expertise in large vision-language models, we propose GazeMoE, a novel end-to-end framework that selectively leverages gaze-target-related cues from a frozen foundation model through MoE modules. To address class imbalance in gaze target classification (in-frame vs. out-of-frame) and enhance robustness, GazeMoE incorporates a class-balancing auxiliary loss alongside strategic data augmentations, including region-specific cropping and photometric transformations. Extensive experiments on benchmark datasets demonstrate that our GazeMoE achieves state-of-the-art performance, outperforming existing methods on challenging gaze estimation tasks. The code and pre-trained models are released at https://huggingface.co/zdai257/GazeMoE
[42] Unify the Views: View-Consistent Prototype Learning for Few-Shot Segmentation
Hongli Liu, Yu Wang, Shengjie Zhao
🧩 TL;DR
本文提出了VINE(View-Informed NEtwork)框架,通过联合建模结构一致性和前景判别性来改进少样本分割中的类别特定原型,有效解决了视角变化下的结构错位和跨视图不一致性问题。
📘 Detailed Summary
Motivation: 少样本分割在有限监督下对新类别具有泛化能力,但在外观或视角变化较大时面临结构错位和跨视图不一致性的挑战。现有方法在处理复杂结构和视角偏移时性能受限,需要更鲁棒的框架来同时建模结构一致性和前景判别性。
Method: VINE框架引入了空间-视图图结构,其中空间图捕捉局部几何拓扑,视图图连接不同视角的特征以传播视角不变的结构语义。通过支持-查询特征差异推导判别性先验来捕获类别特定对比度,重新加权SAM特征并重新校准骨干网络激活。前景增强的SAM特征和结构丰富的ResNet特征通过掩码交叉注意力逐步集成,生成类别一致的原型作为SAM解码器的自适应提示。
Result: 在多个少样本分割基准上的广泛实验验证了VINE的有效性和鲁棒性,特别是在具有视角偏移和复杂结构的挑战性场景下。该方法在标准评估指标上取得了显著性能提升,证明了其处理视角变化和结构复杂性的优越能力。
Conclusion: VINE通过联合建模结构一致性和前景判别性,为少样本分割提供了统一的视角感知解决方案。该框架展示了在视角变化下保持结构一致性的重要性,并为处理复杂视觉场景中的少样本分割问题提供了新的技术方向。
📄 Abstract
Few-shot segmentation (FSS) has gained significant attention for its ability to generalize to novel classes with limited supervision, yet remains challenged by structural misalignment and cross-view inconsistency under large appearance or viewpoint variations. This paper tackles these challenges by introducing VINE (View-Informed NEtwork), a unified framework that jointly models structural consistency and foreground discrimination to refine class-specific prototypes. Specifically, VINE introduces a spatial-view graph on backbone features, where the spatial graph captures local geometric topology and the view graph connects features from different perspectives to propagate view-invariant structural semantics. To further alleviate foreground ambiguity, we derive a discriminative prior from the support-query feature discrepancy to capture category-specific contrast, which reweights SAM features by emphasizing salient regions and recalibrates backbone activations for improved structural focus. The foreground-enhanced SAM features and structurally enriched ResNet features are progressively integrated through masked cross-attention, yielding class-consistent prototypes used as adaptive prompts for the SAM decoder to generate accurate masks. Extensive experiments on multiple FSS benchmarks validate the effectiveness and robustness of VINE, particularly under challenging scenarios with viewpoint shifts and complex structures. The code is available at https://github.com/HongliLiu1/VINE-main.
[43] HiPP-Prune: Hierarchical Preference-Conditioned Structured Pruning for Vision-Language Models
Lincen Bai, Hedi Tabia, Raul Santos-Rodriguez
🧩 TL;DR
本文提出了HiPP-Prune,一种层次化偏好条件结构化剪枝框架,通过将剪枝视为多目标条件下的资源分配问题,为视觉语言模型提供可控的鲁棒性-效用权衡,同时减少对象幻觉。
📘 Detailed Summary
Motivation: 视觉语言模型的高效部署面临挑战,因为压缩不仅影响任务效用,还会加剧对象幻觉问题,即使在相同稀疏度水平下,视觉定位能力也会受到损害,需要一种能够平衡多个目标并控制幻觉的剪枝方法。
Method: HiPP-Prune采用层次化偏好条件结构化剪枝框架,将剪枝视为多目标条件下的资源分配问题,通过单一策略调用输出全局剪枝蓝图,将决策分解为总体稀疏度预算和分层分配,策略状态整合了视觉敏感度信号,该信号源自视觉标记与语言隐藏状态之间的注意力流,使用计划级组相对策略优化在多目标回报下优化剪枝计划。
Result: 在LLaVA模型上使用POPE和ScienceQA进行的实验表明,HiPP-Prune能够发现多样化的非支配剪枝计划,在匹配的稀疏度预算下提供可控的鲁棒性-效用权衡,多目标回报结合了任务效用、幻觉鲁棒性、压缩和突触流启发的稳定性代理。
Conclusion: 该研究提供了一种系统化的视觉语言模型剪枝框架,能够通过用户指定的偏好向量实现可查询的权衡,特别关注减少视觉关键层的过度剪枝以促进跨模态融合,为高效部署中的幻觉控制提供了新方法。
📄 Abstract
Pruning vision-language models (VLMs) for efficient deployment is challenging because compression can affect not only task utility but also visual grounding, often amplifying object hallucinations even at the same sparsity level. We present HiPP-Prune, a hierarchical preference-conditioned structured pruning framework that treats pruning as conditional resource allocation under multiple objectives. HiPP-Prune makes plan-level decisions: a single policy invocation outputs a global pruning blueprint by factorizing decisions into an overall sparsity budget and a layer-wise allocation, enabling queryable trade-offs via a user-specified preference vector. To account for VLM-specific failure modes, our policy state integrates a visual sensitivity signal derived from attention flow between vision tokens and language hidden states, discouraging over-pruning of vision-critical layers that facilitate cross-modal fusion. We optimize pruning plans with plan-level Group Relative Policy Optimization (GRPO) under a multi-objective return that combines task utility, hallucination robustness (POPE), compression, and a synaptic-flow-inspired stability proxy to reduce unproductive exploration in high-sparsity regimes. Experiments on LLaVA with POPE and ScienceQA demonstrate that HiPP-Prune discovers diverse non-dominated pruning plans and provides controllable robustness--utility trade-offs under matched sparsity budgets.
[44] OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer
Si-Yu Lu, Po-Ting Chen, Hui-Che Hsu, Sin-Ye Jhong, Wen-Huang Cheng, Yung-Yao Chen
🧩 TL;DR
本文提出OVGGT,一种无需训练即可在固定计算与内存预算下处理任意长度视频的3D几何重建框架,通过自选择缓存和动态锚点保护机制,在保持SOTA几何精度的同时实现恒定VRAM消耗。
📘 Detailed Summary
Motivation: 现有几何基础模型通过全注意力机制实现高质量3D重建,但其二次计算成本限制于短序列离线处理;因果注意力变体如StreamVGGT虽支持单次流式处理,但KV缓存持续增长导致GPU内存数百帧内耗尽,无法满足长序列流式推理的实际需求。
Method: OVGGT框架结合自选择缓存和动态锚点保护两大核心技术:自选择缓存利用FFN残差幅度压缩KV缓存,保持与FlashAttention完全兼容;动态锚点保护机制屏蔽坐标关键令牌免遭驱逐,抑制长轨迹中的几何漂移,确保在固定计算与内存预算下处理任意长度序列。
Result: 在室内、室外及超长序列基准测试中,OVGGT能够在恒定VRAM预算内处理任意长度视频,同时达到最先进的3D几何精度,显著优于现有流式处理方法,实现了长序列流式推理的实际可行性。
Conclusion: 该研究证明了通过智能缓存压缩与关键令牌保护,可在不牺牲几何精度的情况下实现无限序列的流式3D重建,为实际部署中的长视频几何理解提供了高效解决方案,推动了流式几何基础模型向实际应用场景的演进。
📄 Abstract
Reconstructing 3D geometry from streaming video requires continuous inference under bounded resources. Recent geometric foundation models achieve impressive reconstruction quality through all-to-all attention, yet their quadratic cost confines them to short, offline sequences. Causal-attention variants such as StreamVGGT enable single-pass streaming but accumulate an ever-growing KV cache, exhausting GPU memory within hundreds of frames and precluding the long-horizon deployment that motivates streaming inference in the first place. We present OVGGT, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. Our approach combines Self-Selective Caching, which leverages FFN residual magnitudes to compress the KV cache while remaining fully compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens from eviction to suppress geometric drift over extended trajectories. Extensive experiments on indoor, outdoor, and ultra-long sequence benchmarks demonstrate that OVGGT processes arbitrarily long videos within a constant VRAM envelope while achieving state-of-the-art 3D geometric accuracy.
[45] DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models
Walid Bousselham, Angie Boggust, Hendrik Strobelt, Hilde Kuehne
🧩 TL;DR
本文提出了DEX-AR,一种专为自回归视觉语言模型设计的可解释性方法,通过生成逐令牌和序列级别的二维热图来揭示模型决策所依赖的关键图像区域,解决了传统方法在处理复杂模态交互和逐令牌生成过程中的局限性。
📘 Detailed Summary
Motivation: 随着视觉语言模型日益复杂和广泛应用,理解其决策过程变得至关重要。传统为分类任务设计的可解释性方法难以处理现代自回归视觉语言模型,因为这些模型具有复杂的逐令牌生成过程以及视觉和文本模态之间错综复杂的交互作用。
Method: DEX-AR方法通过在逐令牌生成过程中计算相对于注意力图的层级梯度,生成逐令牌和序列级别的二维热图来突出显示对模型文本响应至关重要的图像区域。该方法引入了两个关键创新:动态头过滤机制用于识别关注视觉信息的注意力头,以及序列级过滤方法用于聚合逐令牌解释并区分视觉基础和纯语言令牌。
Result: 在ImageNet、VQAv2和PascalVOC数据集上的评估显示,该方法在使用新颖的归一化困惑度度量的基于扰动的指标以及基于分割的指标上都取得了一致的改进。
Conclusion: DEX-AR为自回归视觉语言模型提供了一种有效的可解释性框架,能够处理复杂的模态交互和逐令牌生成过程,为理解模型决策机制提供了新的技术途径,并展示了在多种基准测试中的优越性能。
📄 Abstract
As Vision-Language Models (VLMs) become increasingly sophisticated and widely used, it becomes more and more crucial to understand their decision-making process. Traditional explainability methods, designed for classification tasks, struggle with modern autoregressive VLMs due to their complex token-by-token generation process and intricate interactions between visual and textual modalities. We present DEX-AR (Dynamic Explainability for AutoRegressive models), a novel explainability method designed to address these challenges by generating both per-token and sequence-level 2D heatmaps highlighting image regions crucial for the model's textual responses. The proposed method offers to interpret autoregressive VLMs-including varying importance of layers and generated tokens-by computing layer-wise gradients with respect to attention maps during the token-by-token generation process. DEX-AR introduces two key innovations: a dynamic head filtering mechanism that identifies attention heads focused on visual information, and a sequence-level filtering approach that aggregates per-token explanations while distinguishing between visually-grounded and purely linguistic tokens. Our evaluation on ImageNet, VQAv2, and PascalVOC, shows a consistent improvement in both perturbation-based metrics, using a novel normalized perplexity measure, as well as segmentation-based metrics.
[46] Exploring Open-Vocabulary Object Recognition in Images using CLIP
Wei Yu Chen, Ying Dai
🧩 TL;DR
本文提出了一种基于两阶段策略的新型开放词汇物体识别框架,通过物体分割后识别的方法解决了现有方法系统复杂、训练成本高和泛化能力有限的问题,无需复杂重训练和人工标注即可实现任意词汇的识别。
📘 Detailed Summary
Motivation: 现有开放词汇物体识别方法存在系统复杂度高、训练成本大和泛化能力有限的问题,需要复杂重训练和大量人工标注,限制了实际应用场景的灵活性和效率。
Method: 提出基于简化两阶段策略的开放词汇物体识别框架:先进行物体分割,然后识别。裁剪物体区域后,使用CLIP生成物体级图像嵌入和类别级文本嵌入以支持任意词汇。为减少对CLIP的依赖并增强编码灵活性,进一步引入基于CNN/MLP的方法提取卷积神经网络特征图,利用多层感知机对齐视觉特征与文本嵌入,通过奇异值分解构建共享表示空间,最终通过嵌入相似度匹配进行识别。
Result: 在COCO、Pascal VOC和ADE20K数据集上的实验表明,无需训练且基于CLIP编码(不使用SVD)的方法取得了最高的平均AP,超越了当前最先进方法。同时,实验结果凸显了基于CNN/MLP的图像编码在开放词汇物体识别中的潜力。
Conclusion: 该研究证明了简化两阶段框架在开放词汇物体识别中的有效性,无需复杂重训练即可实现高性能识别。基于CNN/MLP的编码方法为减少对大型预训练模型的依赖提供了可行路径,为实际应用中的灵活部署和计算效率优化指明了方向。
📄 Abstract
To address the limitations of existing open-vocabulary object recognition methods, specifically high system complexity, substantial training costs, and limited generalization, this paper proposes a novel Open-Vocabulary Object Recognition (OVOR) framework based on a streamlined two-stage strategy: object segmentation followed by recognition. The framework eliminates the need for complex retraining and labor-intensive annotation. After cropping object regions, we generate object-level image embeddings alongside category-level text embeddings using CLIP, which facilitates arbitrary vocabularies. To reduce reliance on CLIP and enhance encoding flexibility, we further introduce a CNN/MLP-based method that extracts convolutional neural network (CNN) feature maps and utilizes a multilayer perceptron (MLP) to align visual features with text embeddings. These embeddings are concatenated and processed via Singular Value Decomposition (SVD) to construct a shared representation space. Finally, recognition is performed through embedding similarity matching. Experiments on COCO, Pascal VOC, and ADE20K demonstrate that training-free, CLIP-based encoding without SVD achieves the highest average AP, outperforming current state-of-the-art methods. Simultaneously, the results highlight the potential of CNN/MLP-based image encoding for OVOR.
[47] K-MaT: Knowledge-Anchored Manifold Transport for Cross-Modal Prompt Learning in Medical Imaging
Jiajun Zeng, Shadi Albarqouni
🧩 TL;DR
本文提出K-MaT(知识锚定流形传输)框架,通过提示学习和最优传输实现医学视觉语言模型在高端与低端成像模态间的零样本跨模态迁移,无需低端模态训练图像即可保持决策结构一致性。
📘 Detailed Summary
Motivation: 大规模生物医学视觉语言模型在高端成像模态(如CT)上适配后,迁移到前线低端模态(如X光)时往往失效,陷入模态特定的捷径学习,导致跨模态部署性能崩溃,特别是在缺乏低端训练图像的情况下这一问题尤为突出。
Method: K-MaT框架采用提示学习策略,将提示分解并锚定到临床文本描述上,然后使用融合Gromov-Wasserstein最优传输将低端提示流形与视觉基础的高端空间对齐,通过流形传输实现决策结构的跨模态迁移,整个过程无需低端模态的训练图像。
Result: 在四个跨模态基准测试(包括皮肤镜检查、乳腺X光到超声、CT到胸部X光)中,K-MaT实现了最先进性能,将平均准确率调和均值提升至44.1%(BiomedCoOp为42.0%),宏观F1分数达到36.2%;在具有挑战性的乳腺成像任务中,有效缓解了标准方法(如CoOp在低端模态准确率降至27.0%)中观察到的灾难性遗忘问题。
Conclusion: 通过最优传输对齐提示流形为医学视觉语言模型的零样本跨模态部署提供了高效途径,证明了在缺乏低端训练数据的情况下,保持决策结构一致性的重要性,为医疗AI模型在多样化成像设备间的实际应用开辟了新方向。
📄 Abstract
Large-scale biomedical vision-language models (VLMs) adapted on high-end imaging (e.g., CT) often fail to transfer to frontline low-end modalities (e.g., radiography), collapsing into modality-specific shortcuts. We propose K-MaT (Knowledge-Anchored Manifold Transport), a prompt-learning framework that transfers decision structures to low-end modalities without requiring low-end training images. K-MaT factorizes prompts, anchors them to clinical text descriptions, and aligns the low-end prompt manifold to the visually-grounded high-end space using Fused Gromov-Wasserstein optimal transport. We evaluate K-MaT on four cross-modal benchmarks, including dermoscopy, mammography to ultrasound, and CT to chest X-ray. K-MaT achieves state-of-the-art results, improving the average harmonic mean of accuracy to 44.1% (from BiomedCoOp's 42.0%) and macro-F1 to 36.2%. Notably, on the challenging breast imaging task, it mitigates the catastrophic forgetting seen in standard methods like CoOp (which drops to 27.0% accuracy on the low-end), preserving robust performance across modalities. Aligning prompt manifolds via optimal transport provides a highly effective route for the zero-shot cross-modal deployment of medical VLMs.
[48] Breaking Smooth-Motion Assumptions: A UAV Benchmark for Multi-Object Tracking in Complex and Adverse Conditions
Jingtao Ye, Kexin Zhang, Xunchi Ma, Yuehan Li, Guangming Zhu, Peiyi Shen, Linhua Jiang, Xiangdong Zhang, Liang Zhang
🧩 TL;DR
本文提出了DynUAV,这是一个用于动态无人机视角多目标跟踪的新基准,其特征是剧烈的自我运动和复杂的表观轨迹,旨在解决现有基准缺乏真实无人机动态复杂性的问题。
📘 Detailed Summary
Motivation: 现有无人机视角多目标跟踪基准通常缺乏真实场景中的动态复杂性,主要包含可预测的相机动态和线性运动模式,无法充分反映无人机快速运动和敏捷机动带来的显著观测挑战。
Method: 研究团队构建了DynUAV基准,包含42个视频序列和超过170万个边界框标注,涵盖车辆、行人以及挖掘机、推土机和起重机等专业工业类别,特别强调由自我运动引起的剧烈尺度变化、视角变化和运动模糊等挑战。
Result: 在DynUAV上对最先进的跟踪器进行全面评估显示,这些方法在管理动态条件下的检测和关联交织挑战方面存在显著局限性,特别是在处理剧烈自我运动引起的复杂表观轨迹时表现不足。
Conclusion: DynUAV作为一个严格的基准,揭示了当前多目标跟踪方法在真实无人机动态场景中的不足,预计将推动无人机视角多目标跟踪在实际应用中的进步,所有资源将公开提供以促进该领域的研究发展。
📄 Abstract
The rapid movements and agile maneuvers of unmanned aerial vehicles (UAVs) induce significant observational challenges for multi-object tracking (MOT). However, existing UAV-perspective MOT benchmarks often lack these complexities, featuring predominantly predictable camera dynamics and linear motion patterns. To address this gap, we introduce DynUAV, a new benchmark for dynamic UAV-perspective MOT, characterized by intense ego-motion and the resulting complex apparent trajectories. The benchmark comprises 42 video sequences with over 1.7 million bounding box annotations, covering vehicles, pedestrians, and specialized industrial categories such as excavators, bulldozers and cranes. Compared to existing benchmarks, DynUAV introduces substantial challenges arising from ego-motion, including drastic scale changes and viewpoint changes, as well as motion blur. Comprehensive evaluations of state-of-the-art trackers on DynUAV reveal their limitations, particularly in managing the intertwined challenges of detection and association under such dynamic conditions, thereby establishing DynUAV as a rigorous benchmark. We anticipate that DynUAV will serve as a demanding testbed to spur progress in real-world UAV-perspective MOT, and we will make all resources available at link.
[49] Prompt Group-Aware Training for Robust Text-Guided Nuclei Segmentation
Yonghuang Wu, Zhenyang Liang, Wenwen Zeng, Xuan Xie, Jinhua Yu
🧩 TL;DR
本文提出一种针对文本引导医学图像分割中提示敏感性的解决方案,通过将语义等价提示组织为提示组并引入组感知训练框架,显著提升了分割模型在计算病理学中的鲁棒性和泛化能力。
📘 Detailed Summary
Motivation: 尽管SAM3等基础模型支持灵活的文本引导医学图像分割,但其预测结果对提示表述高度敏感,即使语义等价的描述也可能产生不一致的分割掩码,这限制了模型在临床和病理工作流程中的可靠性,因此需要解决文本提示敏感性带来的不一致性问题。
Method: 该方法将提示敏感性重新表述为组一致性优化问题,将语义相关的提示组织为共享相同真实掩码的提示组,并引入提示组感知训练框架,包括基于质量引导的组正则化(利用分割损失作为隐式排序信号)和采用停止梯度策略的logit级一致性约束,无需修改模型架构且保持推理过程不变。
Result: 在多数据集细胞核分割基准测试中,该方法在文本提示下获得一致性能提升,显著降低了不同提示质量水平间的性能方差,在六个零样本跨数据集任务上平均Dice系数提高了2.16个百分点,证明了在计算病理学中视觉语言分割的鲁棒性和泛化能力改进。
Conclusion: 该研究通过重新定义提示敏感性问题并引入组一致性优化框架,为文本引导医学图像分割提供了有效的鲁棒性增强方案,展示了在保持模型架构不变的情况下显著提升预测一致性和跨数据集泛化能力的潜力,对临床可靠应用具有重要意义。
📄 Abstract
Foundation models such as Segment Anything Model 3 (SAM3) enable flexible text-guided medical image segmentation, yet their predictions remain highly sensitive to prompt formulation. Even semantically equivalent descriptions can yield inconsistent masks, limiting reliability in clinical and pathology workflows. We reformulate prompt sensitivity as a group-wise consistency problem. Semantically related prompts are organized into \emph{prompt groups} sharing the same ground-truth mask, and a prompt group-aware training framework is introduced for robust text-guided nuclei segmentation. The approach combines (i) a quality-guided group regularization that leverages segmentation loss as an implicit ranking signal, and (ii) a logit-level consistency constraint with a stop-gradient strategy to align predictions within each group. The method requires no architectural modification and leaves inference unchanged. Extensive experiments on multi-dataset nuclei benchmarks show consistent gains under textual prompting and markedly reduced performance variance across prompt quality levels. On six zero-shot cross-dataset tasks, our method improves Dice by an average of 2.16 points. These results demonstrate improved robustness and generalization for vision-language segmentation in computational pathology.
[50] Towards High-resolution and Disentangled Reference-based Sketch Colorization
Dingkun Yan, Xinrui Wang, Ru Wang, Zhuoru Li, Jinze Yu, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo
🧩 TL;DR
本文提出了一种新颖的双分支框架,通过显式建模训练和推理过程中的数据分布来解决草图着色中的分布偏移问题,实现了更高质量、更高分辨率和更好可控性的着色效果。
📘 Detailed Summary
Motivation: 草图着色是动画和数字插图创作自动化的关键任务,先前研究主要关注缓解分布偏移引起的伪影而非根本解决问题,本文旨在直接最小化训练数据与多样化测试数据之间的分布偏移。
Method: 提出双分支框架,分别通过语义对齐分支和语义错配分支显式建模训练和推理过程的数据分布,采用Gram正则化损失强制跨域分布一致性和稳定性,结合动漫专用Tagger网络提取细粒度属性调制SDXL条件编码器,并引入插件模块增强纹理传递。
Result: 定量和定性比较以及用户研究表明,该方法有效克服了分布偏移挑战,在质量和可控性指标上均达到最先进性能,消融实验揭示了各组件的影响。
Conclusion: 该研究通过直接解决分布偏移而非缓解其伪影,为草图着色提供了更根本的解决方案,框架设计实现了高质量、高分辨率和高可控性的着色效果,为相关领域的分布对齐问题提供了新思路。
📄 Abstract
Sketch colorization is a critical task for automating and assisting in the creation of animations and digital illustrations. Previous research identified the primary difficulty as the distribution shift between semantically aligned training data and highly diverse test data, and focused on mitigating the artifacts caused by the distribution shift instead of fundamentally resolving the problem. In this paper, we present a framework that directly minimizes the distribution shift, thereby achieving superior quality, resolution, and controllability of colorization. We propose a dual-branch framework to explicitly model the data distributions of the training process and inference process with a semantic-aligned branch and a semantic-misaligned branch, respectively. A Gram Regularization Loss is applied across the feature maps of both branches, effectively enforcing cross-domain distribution coherence and stability. Furthermore, we adopt an anime-specific Tagger Network to extract fine-grained attributions from reference images and modulate SDXL's conditional encoders to ensure precise control, and a plugin module to enhance texture transfer. Quantitative and qualitative comparisons, alongside user studies, confirm that our method effectively overcomes the distribution shift challenge, establishing State-of-the-Art performance across both quality and controllability metrics. Ablation study reveals the influence of each component.
[51] Physical Simulator In-the-Loop Video Generation
Lin Geng Foo, Mark He Huang, Alexandros Lattas, Stylianos Moschoglou, Thabo Beeler, Christian Theobalt
🧩 TL;DR
本文提出了PSIVG框架,将物理模拟器集成到视频扩散生成过程中,通过物理模拟引导生成物理一致的视频运动,同时提出测试时纹理一致性优化技术以保持纹理一致性。
📘 Detailed Summary
Motivation: 当前基于扩散的视频生成方法在视觉真实性方面取得显著进展,但经常违反重力、惯性和碰撞等基本物理定律,导致生成物体在帧间运动不一致、表现出不合理的动力学行为或违反物理约束,这限制了AI生成视频的真实性和可靠性。
Method: PSIVG框架将物理模拟器集成到视频扩散过程中,首先从预训练扩散模型生成的模板视频中重建4D场景和前景物体网格,然后在物理模拟器中初始化这些元素并生成物理一致的运动轨迹,这些模拟轨迹用于引导视频生成器产生时空物理一致的运动,同时提出测试时纹理一致性优化技术,基于模拟器提供的像素对应关系调整文本和特征嵌入以改善纹理一致性。
Result: 综合实验表明,PSIVG生成的视频能更好地遵循真实世界物理规律,同时保持视觉质量和多样性,该方法在物理一致性方面显著优于现有方法,在保持生成质量的同时提高了物理合理性。
Conclusion: 该研究展示了将物理模拟器集成到生成框架中的有效性,为物理感知的视频生成提供了新方向,表明结合领域知识(如物理规律)可以显著提升生成内容的真实性和可靠性,为未来更复杂的物理交互场景生成奠定了基础。
📄 Abstract
Recent advances in diffusion-based video generation have achieved remarkable visual realism but still struggle to obey basic physical laws such as gravity, inertia, and collision. Generated objects often move inconsistently across frames, exhibit implausible dynamics, or violate physical constraints, limiting the realism and reliability of AI-generated videos. We address this gap by introducing Physical Simulator In-the-loop Video Generation (PSIVG), a novel framework that integrates a physical simulator into the video diffusion process. Starting from a template video generated by a pre-trained diffusion model, PSIVG reconstructs the 4D scene and foreground object meshes, initializes them within a physical simulator, and generates physically consistent trajectories. These simulated trajectories are then used to guide the video generator toward spatio-temporally physically coherent motion. To further improve texture consistency during object movement, we propose a Test-Time Texture Consistency Optimization (TTCO) technique that adapts text and feature embeddings based on pixel correspondences from the simulator. Comprehensive experiments demonstrate that PSIVG produces videos that better adhere to real-world physics while preserving visual quality and diversity. Project Page: https://vcai.mpi-inf.mpg.de/projects/PSIVG/
[52] Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement
Yakov Pyotr Shkolnikov
🧩 TL;DR
该研究发现视觉语言模型在几何信息编码上存在显著的模态间瓶颈,并提出通过轻量级线性探针而非文本生成来提取几何信息,实现了单一冻结主干网络作为多任务几何传感器的功能。
📘 Detailed Summary
Motivation: 该研究旨在解决视觉语言模型中视觉通路编码的连续几何信息无法被文本通路充分表达的问题,揭示模态间存在的显著信息瓶颈,并探索训练目标与架构对几何信息编码的相对重要性。
Method: 研究采用线性探针分析冻结视觉特征中的几何信息,使用LoRA微调缩小模态间差距,比较五种不同训练范式的编码器,并进行层间分析以识别携带几何信号的关键网络区域。
Result: 线性探针从冻结特征中提取手部关节角度达到6.1度MAE,而最佳文本输出仅20.0度,存在3.3倍瓶颈;LoRA微调将差距缩小至6.5度;五种编码器在功能上收敛到统计等效精度,尽管表示相似性差异显著;自回归生成损害几何保真度,但损害源于生成过程而非语言对齐;层间分析显示所有架构在中间层达到精度峰值,18-22层的注意力头携带不成比例的几何信号。
Conclusion: 研究证明训练目标比架构对几何信息编码精度影响更大,揭示了功能收敛与表示收敛的分离现象,并表明通过轻量级探针而非文本生成可以更有效地提取几何信息,使单一冻结主干网络能够作为多任务几何传感器工作,无需微调或文本生成。
📄 Abstract
Vision-language models encode continuous geometry that their text pathway fails to express: a 6,000-parameter linear probe extracts hand joint angles at 6.1 degrees MAE from frozen features, while the best text output achieves only 20.0 degrees -- a 3.3x bottleneck. LoRA fine-tuning (r=16, 2,000 images) narrows this gap to 6.5 degrees, providing evidence for a pathway-training deficit rather than a representational one. Training objective determines accuracy more than architecture: five encoders spanning self-supervised, contrastive, and hybrid paradigms converge to statistically equivalent accuracy (R^2 approximately 0.55, TOST-equivalent at delta=0.03) despite sharing as little as CKA=0.41 representational similarity -- functional convergence without representational convergence. Autoregressive generation damages geometric fidelity, but the damage originates in the generation process, not in language alignment: Qwen2.5-VL's LLM layers actually improve probe accuracy over its raw vision encoder. Layer-wise analysis reveals a universal mid-network accuracy peak across all architectures, with attention heads in layers 18-22 carrying disproportionate geometric signal. These findings enable a single frozen backbone to function as a multi-task geometric sensor through lightweight probes, without fine-tuning or text generation.
[53] EffectMaker: Unifying Reasoning and Generation for Customized Visual Effect Creation
Shiyuan Yang, Ruihuang Li, Jiale Tao, Shuai Shao, Qinglin Lu, Jing Liao
🧩 TL;DR
EffectMaker提出了一种统一的推理-生成框架,通过多模态大语言模型解释高级特效语义并结合扩散变换器进行上下文学习,实现了无需逐特效微调的参考视频特效定制化生成。
📘 Detailed Summary
Motivation: 当前视觉特效生成面临特效特定数据稀缺、超自然或风格化效果建模困难等挑战,现有AIGC系统通常需要逐特效微调,严重限制了其可扩展性和对新特效的泛化能力。
Method: EffectMaker采用多模态大语言模型解释高级特效语义并推理如何适应目标主体,同时使用扩散变换器通过上下文学习从参考视频中捕获细粒度视觉线索,形成语义-视觉双路径引导机制,并构建了包含3k特效类别、130k视频的最大高质量合成数据集EffectData。
Result: 实验表明EffectMaker在视觉质量和特效一致性方面优于现有最先进基线方法,实现了准确、可控且特效一致的合成效果,无需逐特效微调即可生成高质量的定制化视觉特效。
Conclusion: 该研究为定制化视觉特效生成提供了可扩展且灵活的范式,通过统一的推理-生成框架解决了特效数据稀缺和泛化能力有限的问题,为高质量特效生成开辟了新途径。
📄 Abstract
Visual effects (VFX) are essential for enhancing the expressiveness and creativity of video content, yet producing high-quality effects typically requires expert knowledge and costly production pipelines. Existing AIGC systems face significant challenges in VFX generation due to the scarcity of effect-specific data and the inherent difficulty of modeling supernatural or stylized effects. Moreover, these approaches often require per-effect fine-tuning, which severely limits their scalability and generalization to novel VFX. In this work, we present EffectMaker, a unified reasoning-generation framework that enables reference-based VFX customization. EffectMaker employs a multimodal large language model to interpret high-level effect semantics and reason about how they should adapt to a target subject, while a diffusion transformer leverages in-context learning to capture fine-grained visual cues from reference videos. These two components form a semantic-visual dual-path guidance mechanism that enables accurate, controllable, and effect-consistent synthesis without per-effect fine-tuning. Furthermore, we construct EffectData, the largest high-quality synthetic dataset containing 130k videos across 3k VFX categories, to improve generalization and scalability. Experiments show that EffectMaker achieves superior visual quality and effect consistency over state-of-the-art baselines, offering a scalable and flexible paradigm for customized VFX generation. Project page: https://effectmaker.github.io
[54] StruVis: Enhancing Reasoning-based Text-to-Image Generation via Thinking with Structured Vision
Yuanhuiyi Lyu, Kaiyu Lei, Ziqiao Weng, Xu Zheng, Lutao Jiang, Teng Li, Yangfu Li, Ziyuan Huang, Linfeng Zhang, Xuming Hu
🧩 TL;DR
本文提出StruVis框架,通过结构化视觉表示增强文本到图像生成的推理能力,该框架采用基于文本的结构化视觉表示作为中间推理状态,避免了传统方法的计算开销和视觉限制,实现了与生成器无关的高效推理增强。
📘 Detailed Summary
Motivation: 现有基于推理的文本到图像生成方法存在显著局限:纯文本推理方法计算高效但缺乏视觉上下文,常忽略关键空间和视觉元素;文本-图像交错推理方法虽然增强了视觉基础,但计算成本高昂且受限于生成器的表示能力,约束了多模态大语言模型的推理潜力。
Method: 本文提出StruVis框架,采用"结构化视觉思考"方法,使用基于文本的结构化视觉表示作为中间推理状态,使多模态大语言模型能够在纯文本推理过程中有效"感知"视觉结构,通过结构化视觉引导的推理解锁文本到图像生成的推理潜力,该框架与生成器无关,可无缝集成多种文本到图像生成器。
Result: 实验表明StruVis在基于推理的文本到图像基准测试中取得显著性能提升,在T2I-ReasonBench上获得4.61%的性能增益,在WISE基准上获得4%的性能提升,证明了该框架在增强生成器性能方面的有效性。
Conclusion: StruVis框架通过结构化视觉表示有效解决了现有推理方法的局限性,提供了一种计算高效且与生成器无关的增强方案,为文本到图像生成的推理能力提升开辟了新途径,展示了结构化视觉引导在复杂提示理解中的重要作用。
📄 Abstract
Reasoning-based text-to-image (T2I) generation requires models to interpret complex prompts accurately. Existing reasoning frameworks can be broadly categorized into two types: (1) Text-Only Reasoning, which is computationally efficient but lacks access to visual context, often resulting in the omission of critical spatial and visual elements; and (2) Text-Image Interleaved Reasoning, which leverages a T2I generator to provide visual references during the reasoning process. While this approach enhances visual grounding, it incurs substantial computational costs and constrains the reasoning capacity of MLLMs to the representational limitations of the generator. To this end, we propose StruVis, a novel framework that enhances T2I generation through Thinking with Structured Vision. Instead of relying on intermediate image generation, StruVis employs text-based structured visual representations as intermediate reasoning states, thereby enabling the MLLM to effectively "perceive" visual structure within a purely text-based reasoning process. Powered by this, the reasoning potential for T2I generation of the MLLM is unlocked through structured-vision-guided reasoning. Additionally, as a generator-agnostic reasoning framework, our proposed StruVis can be seamlessly integrated with diverse T2I generators and efficiently enhance their performance in reasoning-based T2I generation. Extensive experiments demonstrate that StruVis achieves significant performance improvements on reasoning-based T2I benchmarks, e.g., a 4.61% gain on T2I-ReasonBench and a 4% gain on WISE.
[55] SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning
Alejandra Perez, Anita Rau, Lee White, Busisiwe Mlambo, Chinedu Nwoye, Muhammad Abdullah Jamal, Omid Mohareri
🧩 TL;DR
本文提出了SUREON,一个从手术教学视频中提取专家推理信号的大规模视频问答数据集,并开发了SureonVLM和SureonVLM-R1模型,显著提升了手术AI的推理能力,在基准测试中准确率超过84%。
📘 Detailed Summary
Motivation: 当前手术AI缺乏理解手术意图、决策理性和风险预测等高级推理能力,主要因为大规模标注包含专家推理的训练数据极其困难,而手术教学视频中的专家讲解恰好包含了这些推理信号,但存在噪声且非结构化。
Method: 研究提出了SUREON数据集构建方法,通过多智能体管道从手术学术视频中系统提取结构化监督信号,定义了12个问题类别覆盖安全性评估、决策理性和预测;并开发了SureonVLM(通过监督微调适配的视觉语言模型)和SureonVLM-R1(使用Group Relative Policy Optimization训练推理模型)。
Result: SUREON数据集包含134.7K个视频片段、170种手术类型和206.8k个问答对,以及354个专家验证的基准样例;SureonVLM和SureonVLM-R1在SUREON基准测试中准确率超过84%,显著优于更大的通用领域模型,并在标准手术感知任务上表现更优;SureonVLM-R1展现出从视觉上下文推断手术意图等显式推理行为。
Conclusion: 研究表明从手术教学视频中提取的监督信号能有效训练AI模型掌握高级手术推理能力,为手术AI从单纯感知向理解意图和决策的转变提供了可行路径;多智能体管道为从非结构化教学资源中获取专家知识提供了可扩展的解决方案。
📄 Abstract
Surgeons don't just see -- they interpret. When an expert observes a surgical scene, they understand not only what instrument is being used, but why it was chosen, what risk it poses, and what comes next. Current surgical AI cannot answer such questions, largely because training data that explicitly encodes surgical reasoning is immensely difficult to annotate at scale. Yet surgical video lectures already contain exactly this -- explanations of intent, rationale, and anticipation, narrated by experts for the purpose of teaching. Though inherently noisy and unstructured, these narrations encode the reasoning that surgical AI currently lacks. We introduce SUREON, a large-scale video QA dataset that systematically harvests this training signal from surgical academic videos. SUREON defines 12 question categories covering safety assessment, decision rationale, and forecasting, and uses a multi-agent pipeline to extract and structure supervision at scale. Across 134.7K clips and 170 procedure types, SUREON yields 206.8k QA pairs and an expert-validated benchmark of 354 examples. To evaluate the extent to which this supervision translates to surgical reasoning ability, we introduce two models: SureonVLM, a vision-language model adapted through supervised fine-tuning, and SureonVLM-R1, a reasoning model trained with Group Relative Policy Optimization. Both models can answer complex questions about surgery and substantially outperform larger general-domain models, exceeding 84% accuracy on the SUREON benchmark while outperforming general-domain models on standard surgical perception tasks. Qualitative analysis of SureonVLM-R1 reveals explicit reasoning behavior, such as inferring operative intent from visual context.
[56] Occlusion-Aware SORT: Observing Occlusion for Robust Multi-Object Tracking
Chunjiang Li, Jianbo Ma, Li Shen, Yanru Chen, Liangyin Chen
🧩 TL;DR
本文提出了OA-SORT框架,一种即插即用且无需训练的遮挡感知多目标跟踪方法,通过引入遮挡感知模块、偏移量和偏置感知动量机制,有效解决了2D MOT中因部分遮挡导致的轨迹关联混淆问题。
📘 Detailed Summary
Motivation: 2D多目标跟踪面临因部分遮挡导致的轨迹关联成本混淆问题,现有方法在处理遮挡场景时性能受限,需要一种能够有效识别和处理遮挡状态以提升跟踪鲁棒性的解决方案。
Method: 提出了遮挡感知SORT框架,包含三个核心组件:遮挡感知模块通过高斯图减少背景干扰来分析物体遮挡状态;遮挡感知偏移量利用遮挡状态信息缓解成本混淆;偏置感知动量机制抑制估计不稳定性,整个框架设计为即插即用且无需训练。
Result: 在DanceTrack、SportsMOT和MOT17数据集上的综合评估表明,OA-SORT在DanceTrack测试集上分别达到63.1%的HOTA和64.2%的IDF1指标;将该框架集成到四个额外跟踪器中,平均提升HOTA 2.08%和IDF1 3.05%,证明了遮挡感知机制的可重用性和有效性。
Conclusion: 研究表明遮挡处理对多目标跟踪至关重要,提出的遮挡感知框架具有通用性和可移植性,能够显著提升现有跟踪器的性能,为复杂场景下的鲁棒跟踪提供了有效的解决方案。
📄 Abstract
Multi-object tracking (MOT) involves analyzing object trajectories and counting the number of objects in video sequences. However, 2D MOT faces challenges due to positional cost confusion arising from partial occlusion. To address this issue, we present the novel Occlusion-Aware SORT (OA-SORT) framework, a plug-and-play and training-free framework that includes the Occlusion-Aware Module (OAM), the Occlusion-Aware Offset (OAO), and the Bias-Aware Momentum (BAM). Specifically, OAM analyzes the occlusion status of objects, where a Gaussian Map (GM) is introduced to reduce background influence. In contrast, OAO and BAM leverage the OAM-described occlusion status to mitigate cost confusion and suppress estimation instability. Comprehensive evaluations on the DanceTrack, SportsMOT, and MOT17 datasets demonstrate the importance of occlusion handling in MOT. On the DanceTrack test set, OA-SORT achieves 63.1% and 64.2% in HOTA and IDF1, respectively. Furthermore, integrating the Occlusion-Aware framework into the four additional trackers improves HOTA and IDF1 by an average of 2.08% and 3.05%, demonstrating the reusability of the occlusion awareness.
[57] BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations
Thomas Monninger, Shaoyuan Xie, Qi Alfred Chen, Sihao Ding
🧩 TL;DR
本文提出BEVLM框架,通过将空间一致且语义蒸馏的鸟瞰图表示与大型语言模型连接,解决了自动驾驶中视觉处理分离导致的冗余计算和空间一致性不足问题,显著提升了跨视图推理和端到端驾驶性能。
📘 Detailed Summary
Motivation: 现有自动驾驶方法将多视角多帧图像独立输入LLMs,导致冗余计算和空间一致性受限,阻碍了准确的三维空间推理和跨视图几何一致性保持;同时,基于几何标注任务学习的BEV表示缺乏基础视觉编码器的语义丰富性,需要桥接空间结构与语义理解之间的鸿沟。
Method: 提出BEVLM框架,将空间一致且语义蒸馏的BEV表示与LLMs连接,使用BEV特征作为统一输入,通过从LLMs向BEV表示蒸馏语义知识,实现空间结构与语义理解的协同增强。
Result: 实验表明BEVLM使LLMs在跨视图驾驶场景中的推理准确性提升46%;通过将语义知识从LLMs蒸馏到BEV表示中,BEVLM在安全关键场景下的闭环端到端驾驶性能显著提升29%。
Conclusion: 该研究证明了连接空间一致BEV表示与LLMs的有效性,为自动驾驶系统提供了同时具备几何结构和语义理解能力的统一表示框架,为解决复杂决策和长尾场景问题提供了新方向。
📄 Abstract
The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic understanding abilities, which are essential for handling complex decision-making and long-tail scenarios. However, existing methods typically feed LLMs with tokens from multi-view and multi-frame images independently, leading to redundant computation and limited spatial consistency. This separation in visual processing hinders accurate 3D spatial reasoning and fails to maintain geometric coherence across views. On the other hand, Bird's-Eye View (BEV) representations learned from geometrically annotated tasks (e.g., object detection) provide spatial structure but lack the semantic richness of foundation vision encoders. To bridge this gap, we propose BEVLM, a framework that connects a spatially consistent and semantically distilled BEV representation with LLMs. Through extensive experiments, we show that BEVLM enables LLMs to reason more effectively in cross-view driving scenes, improving accuracy by 46%, by leveraging BEV features as unified inputs. Furthermore, by distilling semantic knowledge from LLMs into BEV representations, BEVLM significantly improves closed-loop end-to-end driving performance by 29% in safety-critical scenarios.
[58] Ensemble Learning with Sparse Hypercolumns
Julia Dietlmeier, Vayangi Ganepola, Oluwabukola G. Adegboro, Mayug Maniparambil, Claudia Mazo, Noel E. O'Connor
🧩 TL;DR
该研究提出了一种基于分层子采样的高维超列处理方法,用于解决医学图像分割中计算复杂度高的问题,并在小样本情况下通过集成学习方法实现了显著性能提升。
📘 Detailed Summary
Motivation: 高维超列在图像分割中具有潜力,但由于其计算复杂度随训练集规模线性增长,实际应用受限。本研究旨在解决这一计算挑战,特别是在小样本医学图像分割场景下,探索高效利用超列特征的方法。
Method: 研究采用分层子采样策略处理基于VGG16的超列特征,降低计算复杂度。同时系统研究了集成学习方法在稀疏超列上的性能,包括堆叠集成和投票集成,并与简单的逻辑回归分类器进行比较,特别关注极端小样本情况。
Result: 在脑肿瘤数据集上的实验表明,当子采样率为10%且N=20时,最佳平均Dice得分为0.66。在极端小样本情况下(N≤20),简单的逻辑回归分类器表现最佳。与标准多尺度UNet基线相比,该方法实现了24.53%的统计显著改进(p值=3.07e-11,Wilcoxon符号秩检验)。
Conclusion: 研究表明分层子采样是处理高维超列的有效策略,能够显著降低计算负担。在极端小样本场景下,简单分类器优于复杂集成方法,这为医学图像分割中的小样本学习提供了重要指导。该方法在保持性能的同时解决了超列计算复杂度的实际问题。
📄 Abstract
Directly inspired by findings in biological vision, high-dimensional hypercolumns are feature vectors built by concatenating multi-scale activations of convolutional neural networks for a single image pixel location. Together with powerful classifiers, they can be used for image segmentation i.e. pixel classification. However, in practice, there are only very few works dedicated to the use of hypercolumns. One reason is the computational complexity of processing concatenated dense hypercolumns that grows linearly with the size $N$ of the training set. In this work, we address this challenge by applying stratified subsampling to the VGG16 based hypercolumns. Furthermore, we investigate the performance of ensemble learning on sparse hypercolumns. Our experiments on a brain tumor dataset show that stacking and voting ensembles deliver competitive performance, but in the extreme low-shot case of $N \leq 20$, a simple Logistic Regression classifier is the most effective method. For 10% stratified subsampling rate, our best average Dice score is 0.66 for $N=20$. This is a statistically significant improvement of 24.53% over the standard multi-scale UNet baseline ($p$-value = $[3.07e-11]$, Wilcoxon signed-rank test), which is less effective due to overfitting.
[59] FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography
Xia Xin, Yuki Endo, Yoshihiro Kanamori
🧩 TL;DR
本文提出了一种数据驱动的方法来改进文本到图像模型中的字体排版控制,通过构建大规模字体排版数据集FontUse并利用结构化标注流程进行微调,使模型能够更准确地根据用户友好的提示生成符合要求的字体样式和应用场景。
📘 Detailed Summary
Motivation: 当前的文本到图像模型在生成高质量图像方面表现出色,但在字体排版控制方面存在明显不足:模型经常忽略或仅弱遵循用户请求的字体排版外观,这限制了其在需要精确字体控制的应用场景中的实用性。
Method: 研究采用数据为中心的方法,开发了专门针对字体排版的结构化标注流程,构建了包含约70K图像的大规模字体排版数据集FontUse。该数据集通过分割模型和多模态大语言模型自动生成标注,包含用户友好的提示、文本区域位置和OCR识别字符串。提示明确结合字体样式和应用场景,使模型能够在不修改架构的情况下通过微调来一致地解释这些条件。
Result: 实验引入了基于Long-CLIP的度量标准来评估生成字体与请求属性之间的对齐程度。在多样化提示和布局上的实验表明,使用该流程训练的模型比竞争基线产生更符合提示的文本渲染效果,显著提升了字体排版的准确性和一致性。
Conclusion: 该研究证明了数据驱动方法在解决字体排版控制问题上的有效性,通过结构化标注流程和大规模数据集构建,使文本到图像模型能够更好地理解和生成符合特定字体样式和应用场景要求的排版,为字体感知的图像生成开辟了新途径。
📄 Abstract
Recent text-to-image models can generate high-quality images from natural-language prompts, yet controlling typography remains challenging: requested typographic appearance is often ignored or only weakly followed. We address this limitation with a data-centric approach that trains image generation models using targeted supervision derived from a structured annotation pipeline specialized for typography. Our pipeline constructs a large-scale typography-focused dataset, FontUse, consisting of about 70K images annotated with user-friendly prompts, text-region locations, and OCR-recognized strings. The annotations are automatically produced using segmentation models and multimodal large language models (MLLMs). The prompts explicitly combine font styles (e.g., serif, script, elegant) and use cases (e.g., wedding invitations, coffee-shop menus), enabling intuitive specification even for novice users. Fine-tuning existing generators with these annotations allows them to consistently interpret style and use-case conditions as textual prompts without architectural modification. For evaluation, we introduce a Long-CLIP-based metric that measures alignment between generated typography and requested attributes. Experiments across diverse prompts and layouts show that models trained with our pipeline produce text renderings more consistent with prompts than competitive baselines. The source code for our annotation pipeline is available at https://github.com/xiaxinz/FontUSE.
[60] GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection
Xuan Huang, Mochu Xiang, Zhelun Shen, Jinbo Wu, Chenming Wu, Chen Zhao, Kaisiyuan Wang, Hang Zhou, Shanshan Liu, Haocheng Feng, Wei He, Jingdong Wang
🧩 TL;DR
本文提出GenHOI,一种轻量级增强方法,通过时间平衡和空间选择性的参考对象信息注入,显著提升了预训练视频生成模型在复杂野外场景中的手-物交互合成能力。
📘 Detailed Summary
Motivation: 当前手-物交互重演方法通常在域内训练和评估,难以泛化到复杂的野外场景,而通用视频编辑模型虽然具有更广泛的鲁棒性,但仍存在对象外观不一致等HOI特定问题,这构成了数字人视频合成中的核心挑战。
Method: GenHOI采用两种关键技术:时间平衡方面提出Head-Sliding RoPE,为参考令牌分配头部特定的时间偏移,均匀分布其跨帧影响并减轻3D RoPE的时间衰减;空间选择性方面设计两级空间注意力门,将对象条件注意力集中在HOI区域并自适应调整其强度,在保持背景真实性的同时增强交互保真度。
Result: 在未见过的野外场景上进行的大量定性和定量评估表明,GenHOI在保持对象身份一致性和物理合理接触方面显著优于最先进的手-物交互重演方法和通用视频编辑方法,验证了其方法的有效性。
Conclusion: 该研究展示了通过轻量级增强策略有效提升预训练视频生成模型在复杂手-物交互任务中的性能,为数字人视频合成提供了新的技术路径,其时间平衡和空间选择性机制为解决长期对象一致性和交互保真度问题提供了有价值的见解。
📄 Abstract
Hand-Object Interaction (HOI) remains a core challenge in digital human video synthesis, where models must generate physically plausible contact and preserve object identity across frames. Although recent HOI reenactment approaches have achieved progress, they are typically trained and evaluated in-domain and fail to generalize to complex, in-the-wild scenarios. In contrast, all-in-one video editing models exhibit broader robustness but still struggle with HOI-specific issues such as inconsistent object appearance. In this paper, we present GenHOI, a lightweight augmentation to pretrained video generation models that injects reference-object information in a temporally balanced and spatially selective manner. For temporal balancing, we propose Head-Sliding RoPE, which assigns head-specific temporal offsets to reference tokens, distributing their influence evenly across frames and mitigating the temporal decay of 3D RoPE to improve long-range object consistency. For spatial selectivity, we design a two-level spatial attention gate that concentrates object-conditioned attention on HOI regions and adaptively scales its strength, preserving background realism while enhancing interaction fidelity. Extensive qualitative and quantitative evaluations on unseen, in-the-wild scenes demonstrate that GenHOI significantly outperforms state-of-the-art HOI reenactment and all-in-one video editing methods. Project page: https://xuanhuang0.github.io/GenHOI/
[61] Lyapunov Probes for Hallucination Detection in Large Foundation Models
Bozhi Luan, Gen Li, Yalan Qin, Jifeng Guo, Yun Zhou, Faguo Wu, Hongwei Zheng, Wenjun Wu, Zhaoxin Fan
🧩 TL;DR
该研究提出了一种基于动力系统稳定性理论的大语言模型幻觉检测新方法,通过将(M)LLMs视为动力系统,利用Lyapunov探针来识别稳定的事实区域和不稳定的幻觉易发区域。
📘 Detailed Summary
Motivation: 该研究旨在解决大语言模型和多模态大语言模型中的幻觉检测问题,现有方法通常将幻觉视为简单的分类任务,而本研究从动力系统稳定性理论的角度重新审视这一问题,认为事实知识在表示空间中对应稳定平衡点,而幻觉往往出现在知识过渡区域的边界。
Method: 该方法提出Lyapunov探针:轻量级网络通过基于导数的稳定性约束进行训练,确保在输入扰动下置信度单调衰减。采用系统扰动分析和两阶段训练过程,这些探针能够可靠地区分稳定事实区域和不稳定的幻觉易发区域。
Result: 在多样化数据集和模型上的实验表明,该方法相比现有基线方法取得了持续改进,验证了基于动力系统稳定性理论的幻觉检测框架的有效性和鲁棒性。
Conclusion: 该研究为理解大语言模型幻觉提供了新的理论视角,将动力系统稳定性理论应用于模型可靠性评估,提出的Lyapunov探针方法为实际应用中的幻觉检测提供了轻量级且有效的解决方案,并为未来研究开辟了新的方向。
📄 Abstract
We address hallucination detection in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) by framing the problem through the lens of dynamical systems stability theory. Rather than treating hallucination as a straightforward classification task, we conceptualize (M)LLMs as dynamical systems, where factual knowledge is represented by stable equilibrium points within the representation space. Our main insight is that hallucinations tend to arise at the boundaries of knowledge-transition regions separating stable and unstable zones. To capture this phenomenon, we propose Lyapunov Probes: lightweight networks trained with derivative-based stability constraints that enforce a monotonic decay in confidence under input perturbations. By performing systematic perturbation analysis and applying a two-stage training process, these probes reliably distinguish between stable factual regions and unstable, hallucination-prone regions. Experiments on diverse datasets and models demonstrate consistent improvements over existing baselines.
[62] FedARKS: Federated Aggregation via Robust and Discriminative Knowledge Selection and Integration for Person Re-identification
Xin Xu, Binchang Ma, Zhixi Yu, Wei Liu
🧩 TL;DR
本文提出FedARKS框架,通过鲁棒知识提取和知识选择机制解决联邦域泛化行人重识别中的局部细节捕获不足和客户端贡献均衡问题,显著提升了模型在未见域上的泛化性能。
📘 Detailed Summary
Motivation: 现有联邦域泛化行人重识别方法主要依赖全局特征表示和简单的模型平均聚合,存在两个关键局限:仅使用全局特征难以捕捉领域不变的局部细节(如配饰或纹理),而统一的参数平均将所有客户端视为等同,忽略了它们在鲁棒特征提取能力上的差异,从而稀释了高质量客户端的贡献。
Method: 提出FedARKS框架,包含两个核心机制:RK(鲁棒知识)机制专注于提取领域不变的局部细节特征,以补充全局特征表示的不足;KS(知识选择)机制根据客户端的鲁棒特征提取能力进行差异化聚合,避免高质量客户端贡献被稀释。
Result: FedARKS框架在多个基准数据集上验证了其有效性,相比现有主流方法,在未见域上的泛化性能显著提升,特别是在捕捉局部细节和利用高质量客户端知识方面表现出优势,具体体现在更高的重识别准确率和更强的跨域泛化能力。
Conclusion: 该研究揭示了联邦域泛化中局部细节特征和客户端差异化聚合的重要性,FedARKS框架为隐私保护下的跨域泛化提供了新思路,未来可扩展至其他联邦学习场景,并探索更精细的知识选择与融合策略。
📄 Abstract
The application of federated domain generalization in person re-identification (FedDG-ReID) aims to enhance the model's generalization ability in unseen domains while protecting client data privacy. However, existing mainstream methods typically rely on global feature representations and simple averaging operations for model aggregation, leading to two limitations in domain generalization: (1) Using only global features makes it difficult to capture subtle, domain-invariant local details (such as accessories or textures); (2) Uniform parameter averaging treats all clients as equivalent, ignoring their differences in robust feature extraction capabilities, thereby diluting the contributions of high quality clients. To address these issues, we propose a novel federated learning framework, Federated Aggregation via Robust and Discriminative Knowledge Selection and Integration (FedARKS), comprising two mechanisms: RK (Robust Knowledge) and KS (Knowledge Selection).
[63] Spatial Colour Mixing Illusions as a Perception Stress Test for Vision-Language Models
Nicoleta-Nina Basoc, Adrian Cosma, Emilian Radoi
🧩 TL;DR
该研究揭示了视觉语言模型在结构化色彩失真下的系统性感知弱点,并提出了一种程序化的空间色彩混合框架来评估模型鲁棒性,同时展示了人类启发的预处理方法能够部分恢复模型性能。
📘 Detailed Summary
Motivation: 尽管视觉语言模型在基准测试中表现优异,但它们存在系统性感知弱点:即使底层场景对人类来说仍然易于识别,像素值的结构化大幅变化可能导致模型产生自信但荒谬的预测。本研究旨在探索这种人类与模型感知之间的差距,并评估模型对结构化色彩失真的鲁棒性。
Method: 研究引入了空间色彩混合框架,这是一种程序化的色彩失真家族,在自然图像上叠加结构化模式(包括RGB和Ostwald色彩系统)。该框架包含八种空间色彩混合变体,用于评估来自三个模型家族的九个视觉语言模型在四个数据集上的表现。同时进行了包含61名参与者的人类研究,比较人类与模型在相同失真条件下的识别能力。
Result: 实验结果显示,随着失真程度的增加,所有模型和数据集上的准确率急剧下降,且扩展语言模型规模并不能可靠地缓解这种失败。在动物识别任务中,人类参与者在相同失真条件下显著优于视觉语言模型。研究还表明,简单的人类启发式预处理步骤能够为多种失真类型恢复相当一部分性能。
Conclusion: 该研究揭示了视觉语言模型在结构化色彩变化下的系统性脆弱性,表明当前模型缺乏人类级别的感知鲁棒性。研究结果支持将感知感知的预处理和工具使用作为提高视觉语言模型鲁棒性的实用策略,为未来开发更健壮的视觉语言系统提供了重要方向。
📄 Abstract
Vision-language models (VLMs) achieve strong benchmark results, yet can exhibit systematic perceptual weaknesses: structured, large changes to pixel values can cause confident yet nonsensical predictions, even when the underlying scene remains easily recognizable to humans. We study this gap using Spatial Colour Mixing, a programmatic family of colour distortions that overlays structured patterns (in both RGB and Ostwald colour systems) onto natural images. We introduce a framework of eight spatial colour mixing variants and evaluate nine VLMs across three model families on four datasets. Across models and datasets, accuracy degrades sharply with increasing distortion, and scaling the language model does not reliably mitigate the failure. In a human study with 61 participants on an animal recognition dataset, humans substantially outperform VLMs under the same distortions. Finally, we show that a simple human-inspired preprocessing step recovers a meaningful portion of performance for several distortion types, motivating perception-aware preprocessing and tool-use as practical strategies for improving VLM robustness.
[64] FreeOcc: Training-free Panoptic Occupancy Prediction via Foundation Models
Andrew Caunes, Thierry Chateau, Vincent Fremont
🧩 TL;DR
本文提出了FreeOcc,一种无需训练的三维场景理解框架,通过利用预训练的基础模型从多视角图像中恢复语义和几何信息,实现了无需三维监督的语义和全景占据预测。
📘 Detailed Summary
Motivation: 当前基于相机的道路场景分析方法通常依赖于昂贵的密集三维监督或需要在目标域数据上训练模型,这限制了在未见环境中的部署能力,因此需要开发无需训练即可泛化到新场景的解决方案。
Method: FreeOcc采用无需训练的流程,利用预训练的基础模型从多视角图像中恢复语义和几何信息。该方法通过可提示的基础分割模型提取每视角的全景先验,使用重建基础模型重建度量三维点云,并通过深度和置信度感知的过滤机制将可靠标签提升到三维空间,最后通过确定性细化堆栈进行体素化处理。
Result: 在Occ3D-nuScenes数据集上,FreeOcc在无需训练的情况下实现了16.9 mIoU和16.5 RayIoU,与最先进的弱监督方法性能相当。作为伪标签生成流程训练下游模型时,达到了21.1 RayIoU,超越了之前的弱监督基线。此外,FreeOcc在无需训练和弱监督的全景占据预测方面分别实现了3.1 RayPQ和3.9 RayPQ的新基准。
Conclusion: 该研究表明基于基础模型的感知是实现无需训练的三维场景理解的实用途径,通过利用现有预训练模型的能力,可以在不依赖昂贵三维监督的情况下实现有效的场景理解,为部署到未见环境提供了可行的解决方案。
📄 Abstract
Semantic and panoptic occupancy prediction for road scene analysis provides a dense 3D representation of the ego vehicle's surroundings. Current camera-only approaches typically rely on costly dense 3D supervision or require training models on data from the target domain, limiting deployment in unseen environments. We propose FreeOcc, a training-free pipeline that leverages pretrained foundation models to recover both semantics and geometry from multi-view images. FreeOcc extracts per-view panoptic priors with a promptable foundation segmentation model and prompt-to-taxonomy rules, and reconstructs metric 3D points with a reconstruction foundation model. Depth- and confidence- aware filtering lifts reliable labels into 3D, which are fused over time and voxelized with a deterministic refinement stack. For panoptic occupancy, instances are recovered by fitting and merging robust current-view 3D box candidates, enabling instance-aware occupancy without any learned 3D model. On Occ3D-nuScenes, FreeOcc achieves 16.9 mIoU and 16.5 RayIoU train-free, on par with state-of-the-art weakly supervised methods. When employed as a pseudo-label generation pipeline for training downstream models, it achieves 21.1 RayIoU, surpassing the previous state-of-the-art weakly supervised baseline. Furthermore, FreeOcc sets new baselines for both train-free and weakly supervised panoptic occupancy prediction, achieving 3.1 RayPQ and 3.9 RayPQ, respectively. These results highlight foundation-model-driven perception as a practical route to training-free 3D scene understanding.
[65] A Semi-Supervised Framework for Breast Ultrasound Segmentation with Training-Free Pseudo-Label Generation and Label Refinement
Ruili Li, Jiayi Ding, Ruiyu Li, Yilun Jin, Shiwen Ge, Yuwen Zeng, Xiaoyong Zhang, Eichi Takaya, Jan Vrba, Noriyasu Homma
🧩 TL;DR
本文提出了一种用于乳腺超声图像分割的半监督框架,通过免训练的伪标签生成和标签细化机制,利用视觉语言模型实现跨域结构迁移,在仅有2.5%标注数据下达到与全监督模型相当的性能。
📘 Detailed Summary
Motivation: 现有半监督学习方法在乳腺超声图像分割中面临伪标签不稳定问题,特别是在标注数据极度有限的情况下,导致监督信号不准确和性能下降。虽然视觉语言模型为伪标签生成提供了新机会,但其在乳腺超声图像上的有效性受到限制,因为领域特定的提示难以迁移。
Method: 该方法提出基于简单外观描述(如"暗色椭圆形")的免训练伪标签生成机制,实现自然图像与医学图像间的跨域结构迁移。通过生成的结构一致伪标签预热静态教师模型以捕获乳腺病变的全局结构先验,结合指数移动平均教师模型,引入不确定性熵加权融合和自适应不确定性引导的反向对比学习来提升边界判别能力。
Result: 在四个乳腺超声数据集上的实验表明,该方法在仅使用2.5%标注数据的情况下,性能达到与全监督模型相当的水平,显著优于现有的半监督学习方法。该框架具有良好的可扩展性,对于其他成像模态或疾病,仅需全局外观描述即可获得可靠的伪监督信号。
Conclusion: 该研究证明了基于视觉语言模型的跨域结构迁移在医学图像分割中的有效性,为有限标注下的可扩展半监督医学图像分割提供了新范式。该方法的核心优势在于其免训练特性和对领域特定知识的低依赖,仅需简单的外观描述即可实现可靠的伪标签生成,具有广泛的临床应用潜力。
📄 Abstract
Semi-supervised learning (SSL) has emerged as a promising paradigm for breast ultrasound (BUS) image segmentation, but it often suffers from unstable pseudo labels under extremely limited annotations, leading to inaccurate supervision and degraded performance. Recent vision-language models (VLMs) provide a new opportunity for pseudo-label generation, yet their effectiveness on BUS images remains limited because domain-specific prompts are difficult to transfer. To address this issue, we propose a semi-supervised framework with training-free pseudo-label generation and label refinement. By leveraging simple appearance-based descriptions (e.g., dark oval), our method enables cross-domain structural transfer between natural and medical images, allowing VLMs to generate structurally consistent pseudo labels. These pseudo labels are used to warm up a static teacher that captures global structural priors of breast lesions. Combined with an exponential moving average teacher, we further introduce uncertainty entropy weighted fusion and adaptive uncertainty-guided reverse contrastive learning to improve boundary discrimination. Experiments on four BUS datasets demonstrate that our method achieves performance comparable to fully supervised models even with only 2.5% labeled data, significantly outperforming existing SSL approaches. Moreover, the proposed paradigm is readily extensible: for other imaging modalities or diseases, only a global appearance description is required to obtain reliable pseudo supervision, enabling scalable semi-supervised medical image segmentation under limited annotations.
[66] JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas
Sandeep Inuganti, Hideaki Kanayama, Kanta Shimizu, Mahdi Chamseddine, Soichiro Yokota, Didier Stricker, Jason Rambach
🧩 TL;DR
本文提出了JOPP-3D,这是一个开放词汇语义分割框架,通过联合利用全景图像和点云数据实现语言驱动的场景理解,显著提升了跨模态分割性能。
📘 Detailed Summary
Motivation: 跨视觉模态(如3D点云和全景图像)的语义分割面临两大挑战:标注数据稀缺和固定标签模型适应性有限,这限制了语言驱动场景理解的发展。
Method: JOPP-3D框架将RGB-D全景图像转换为对应的切向透视图像和3D点云,利用这两种模态提取并对齐基础视觉-语言特征,通过自然语言查询生成两种输入模态的语义掩码。
Result: 在Stanford-2D-3D-s和ToF-360数据集上的实验表明,JOPP-3D能够生成跨全景和3D域的一致且有语义意义的分割结果,在开放和封闭词汇的2D和3D语义分割任务上相比现有最优方法取得了显著提升。
Conclusion: 该研究展示了联合多模态表示学习在开放词汇场景理解中的有效性,为跨视觉模态的语言驱动语义分割提供了新框架,推动了更灵活的场景理解系统发展。
📄 Abstract
Semantic segmentation across visual modalities such as 3D point clouds and panoramic images remains a challenging task, primarily due to the scarcity of annotated data and the limited adaptability of fixed-label models. In this paper, we present JOPP-3D, an open-vocabulary semantic segmentation framework that jointly leverages panoramic and point cloud data to enable language-driven scene understanding. We convert RGB-D panoramic images into their corresponding tangential perspective images and 3D point clouds, then use these modalities to extract and align foundational vision-language features. This allows natural language querying to generate semantic masks on both input modalities. Experimental evaluation on the Stanford-2D-3D-s and ToF-360 datasets demonstrates the capability of JOPP-3D to produce coherent and semantically meaningful segmentations across panoramic and 3D domains. Our proposed method achieves a significant improvement compared to the SOTA in open and closed vocabulary 2D and 3D semantic segmentation.
[67] Making Training-Free Diffusion Segmentors Scale with the Generative Power
Benyuan Meng, Qianqian Xu, Zitai Wang, Xiaochun Cao, Longtao Huang, Qingming Huang
🧩 TL;DR
本文提出两种技术——自动聚合和逐像素重缩放,以解决训练自由扩散分割器中存在的注意力图不一致和得分不平衡问题,从而更好地利用生成模型的语义分割能力。
📘 Detailed Summary
Motivation: 现有基于预训练扩散模型的训练自由语义分割方法通常依赖交叉注意力图来捕捉像素与文本标记的语义关系,但存在两个关键问题:不同注意力头和层的注意力图之间存在不一致性,难以形成统一的全局表示;即使获得全局图,不同文本标记之间的得分不平衡也阻碍了准确的语义相关性计算。
Method: 本文提出两种核心技术:自动聚合技术用于整合来自多个注意力头和层的交叉注意力图,形成统一的全局语义表示;逐像素重缩放技术则通过调整不同文本标记的得分分布,解决得分不平衡问题,从而更准确地计算语义相关性用于分割任务。
Result: 在标准语义分割基准测试中,该方法显著提升了训练自由扩散分割器的性能,验证了其有效性和可扩展性。进一步将该方法集成到生成技术中,展示了广泛的适用性,表明改进后的方法能够更好地利用生成模型的语义分割能力。
Conclusion: 该研究揭示了现有训练自由扩散分割方法未能充分利用生成模型能力的根本原因,并提供了有效的解决方案。所提出的自动聚合和逐像素重缩放技术不仅提升了分割性能,还为生成模型在判别任务中的应用提供了新的技术路径,具有重要的理论和实践意义。
📄 Abstract
As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to what training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model's attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly. To understand this issue, we identify two underlying gaps: (i) cross-attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified global representation. (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage generative capability. We evaluate our approach on standard semantic segmentation benchmarks and further integrate it into a generative technique, demonstrating both improved performance broad applicability. Codes are at https://github.com/Darkbblue/goca.
[68] Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots
Mingzhe Li, Mengyin Liu, Zekai Wu, Xincheng Lin, Junsheng Zhang, Ming Yan, Zengye Xie, Changwang Zhang, Chenglu Wen, Lan Xu, Siqi Shen, Cheng Wang
🧩 TL;DR
本文提出了运动图灵测试框架和HHMotion数据集,用于评估人形机器人动作的人类相似度,并开发了一个基线模型来预测运动的人类相似度评分,发现当前多模态大语言模型在此任务上表现不足。
📘 Detailed Summary
Motivation: 当前人形机器人运动生成与控制虽取得显著进展,但缺乏系统评估其动作人类相似度的客观框架。受图灵测试启发,本研究旨在建立运动图灵测试框架,解决如何量化评估机器人动作与人类动作相似度的问题,并填补该领域缺乏标准化评估数据集和方法的空白。
Method: 研究提出了运动图灵测试框架,通过构建包含1,000个运动序列的HHMotion数据集,涵盖15个动作类别,由11个人形机器人模型和10名人类受试者完成。所有运动序列均转换为SMPL-X表示以消除视觉外观影响,并招募30名标注者对每个姿势的人类相似度进行0-5分评分。基于此数据集,研究提出了人类相似度评估任务,并开发了一个简单的基线模型用于自动预测运动的人类相似度分数。
Result: 收集的数据分析显示,人形机器人动作仍存在明显偏离人类运动的偏差,特别是在跳跃、拳击和跑步等动态动作中。实验结果表明,尽管多模态大语言模型在其他领域取得进展,但在评估运动人类相似度任务上表现不足。提出的简单基线模型在预测人类相似度分数方面优于多种当代基于LLM的方法,验证了专门化模型在此任务上的优势。
Conclusion: 本研究建立了运动人类相似度评估的系统框架和基准数据集,揭示了当前人形机器人动作与人类动作之间的显著差距。研究结果表明,需要专门设计的模型而非通用多模态大语言模型来准确评估运动的人类相似度。公开的数据集、代码和基准将为社区未来研究提供重要支持,推动人形机器人运动生成技术的进一步发展。
📄 Abstract
Humanoid robots have achieved significant progress in motion generation and control, exhibiting movements that appear increasingly natural and human-like. Inspired by the Turing Test, we propose the Motion Turing Test, a framework that evaluates whether human observers can discriminate between humanoid robot and human poses using only kinematic information. To facilitate this evaluation, we present the Human-Humanoid Motion (HHMotion) dataset, which consists of 1,000 motion sequences spanning 15 action categories, performed by 11 humanoid models and 10 human subjects. All motion sequences are converted into SMPL-X representations to eliminate the influence of visual appearance. We recruited 30 annotators to rate the human-likeness of each pose on a 0-5 scale, resulting in over 500 hours of annotation. Analysis of the collected data reveals that humanoid motions still exhibit noticeable deviations from human movements, particularly in dynamic actions such as jumping, boxing, and running. Building on HHMotion, we formulate a human-likeness evaluation task that aims to automatically predict human-likeness scores from motion data. Despite recent progress in multimodal large language models, we find that they remain inadequate for assessing motion human-likeness. To address this, we propose a simple baseline model and demonstrate that it outperforms several contemporary LLM-based methods. The dataset, code, and benchmark will be publicly released to support future research in the community.
[69] Point-Supervised Skeleton-Based Human Action Segmentation
Hongsong Wang, Yiqin Shen, Pengbo Yan, Jie Gui
🧩 TL;DR
本文提出了一种基于骨架时序动作分割的点监督框架,仅需每个动作片段标注单帧即可实现高效学习,通过多模态伪标签集成方法在多个基准数据集上取得了与全监督方法相竞争的性能。
📘 Detailed Summary
Motivation: 骨架时序动作分割任务中,全监督方法虽然性能良好但需要昂贵的逐帧标注且对模糊动作边界敏感,本研究旨在解决标注成本高的问题,探索仅需每个动作片段标注单帧的点监督学习范式。
Method: 提出点监督骨架动作分割框架,利用预训练统一模型编码关节、骨骼和运动等多模态骨架数据提取丰富特征表示,通过原型相似性方法与能量函数、约束K-Medoids聚类相结合生成可靠伪标签,并采用多模态伪标签集成增强伪标签可靠性以指导模型训练。
Result: 在PKU-MMD(X-Sub和X-View)、MCFS-22和MCFS-130数据集上建立了新的基准,实验表明该方法取得了竞争性性能,甚至在某些情况下超越了全监督方法,同时显著减少了标注工作量。
Conclusion: 该研究证明了点监督方法在骨架动作分割任务中的有效性,通过多模态特征和伪标签集成策略能够以极低标注成本达到接近全监督的性能,为减少动作识别系统标注负担提供了实用解决方案。
📄 Abstract
Skeleton-based temporal action segmentation is a fundamental yet challenging task, playing a crucial role in enabling intelligent systems to perceive and respond to human activities. While fully-supervised methods achieve satisfactory performance, they require costly frame-level annotations and are sensitive to ambiguous action boundaries. To address these issues, we introduce a point-supervised framework for skeleton-based action segmentation, where only a single frame per action segment is labeled. We leverage multimodal skeleton data, including joint, bone, and motion information, encoded via a pretrained unified model to extract rich feature representations. To generate reliable pseudo-labels, we propose a novel prototype similarity method and integrate it with two existing methods: energy function and constrained K-Medoids clustering. Multimodal pseudo-label integration is proposed to enhance the reliability of the pseudo-label and guide the model training. We establish new benchmarks on PKU-MMD (X-Sub and X-View), MCFS-22, and MCFS-130, and implement baselines for point-supervised skeleton-based human action segmentation. Extensive experiments show that our method achieves competitive performance, even surpassing some fully-supervised methods while significantly reducing annotation effort.
[70] Hierarchical Collaborative Fusion for 3D Instance-aware Referring Expression Segmentation
Keshen Zhou, Runnan Chen, Mingming Gong, Tongliang Liu
🧩 TL;DR
本文提出HCF-RES,一种用于广义3D指代表达分割的多模态框架,通过分层视觉语义分解和渐进式多级融合,在稀疏点云场景中实现更精细的语言引导3D目标定位。
📘 Detailed Summary
Motivation: 现有3D广义指代表达分割方法仅依赖稀疏点云数据,缺乏丰富的视觉语义信息,难以处理需要细粒度描述的复杂场景,特别是当语言描述匹配多个目标或无目标时。
Method: HCF-RES采用分层视觉语义分解技术,利用SAM实例掩码指导CLIP编码器提取像素级和实例级双粒度特征,确保2D到3D投影时的对象边界保持;同时设计渐进式多级融合机制,通过模态内协作、2D语义与3D几何特征的跨模态自适应加权以及语言引导细化实现多模态表征集成。
Result: HCF-RES在ScanRefer和Multi3DRefer两个基准数据集上均取得了最先进的性能表现,验证了该方法在广义3D指代表达分割任务中的有效性。
Conclusion: 该研究表明结合2D视觉语义与3D几何信息的多模态方法能显著提升3D指代表达分割性能,分层特征提取和渐进式融合策略为解决稀疏点云场景中的细粒度语言引导定位问题提供了有效途径。
📄 Abstract
Generalised 3D Referring Expression Segmentation (3D-GRES) localizes objects in 3D scenes based on natural language, even when descriptions match multiple or zero targets. Existing methods rely solely on sparse point clouds, lacking rich visual semantics for fine-grained descriptions. We propose HCF-RES, a multi-modal framework with two key innovations. First, Hierarchical Visual Semantic Decomposition leverages SAM instance masks to guide CLIP encoding at dual granularities -- pixel-level and instance-level features -- preserving object boundaries during 2D-to-3D projection. Second, Progressive Multi-level Fusion integrates representations through intra-modal collaboration, cross-modal adaptive weighting between 2D semantic and 3D geometric features, and language-guided refinement. HCF-RES achieves state-of-the-art results on both ScanRefer and Multi3DRefer.
[71] NOVA: Next-step Open-Vocabulary Autoregression for 3D Multi-Object Tracking in Autonomous Driving
Kai Luo, Xu Wang, Rui Fan, Kailun Yang
🧩 TL;DR
本文提出NOVA(Next-step Open-Vocabulary Autoregression),一种创新的3D多目标跟踪范式,通过将轨迹重构为结构化时空语义序列,利用大语言模型的自回归能力实现开放词汇的跨类别泛化跟踪。
📘 Detailed Summary
Motivation: 现有3D多目标跟踪方法受限于闭集假设和"语义盲"启发式规则,难以在开放世界感知中实现跨未知目标的泛化,这限制了实际应用场景的适应性。
Method: NOVA将3D轨迹重构为结构化时空语义序列,同时编码物理运动连续性和深度语言先验,利用大语言模型的自回归能力将跟踪任务转化为原则性的下一步序列补全过程,通过语言空间的层次结构解决细粒度语义模糊性。
Result: 在nuScenes、V2X-Seq-SPD和KITTI数据集上的广泛实验表明NOVA具有优越性能,在nuScenes数据集上,NOVA在Novel类别上实现了22.41%的AMOTA,相比基线显著提高了20.21%的绝对增益,这些成果仅通过紧凑的0.5B自回归模型实现。
Conclusion: 该研究展示了将生成式时空语义建模引入3D跟踪的潜力,通过语言先验和常识推理增强身份一致性,为开放世界感知系统提供了新的设计范式,表明小规模自回归模型也能在复杂跟踪任务中取得突破性性能。
📄 Abstract
Generalizing across unknown targets is critical for open-world perception, yet existing 3D Multi-Object Tracking (3D MOT) pipelines remain limited by closed-set assumptions and ``semantic-blind'' heuristics. To address this, we propose Next-step Open-Vocabulary Autoregression (NOVA), an innovative paradigm that shifts 3D tracking from traditional fragmented distance-based matching toward generative spatio-temporal semantic modeling. NOVA reformulates 3D trajectories as structured spatio-temporal semantic sequences, enabling the simultaneous encoding of physical motion continuity and deep linguistic priors. By leveraging the autoregressive capabilities of Large Language Models (LLMs), we transform the tracking task into a principled process of next-step sequence completion. This mechanism allows the model to explicitly utilize the hierarchical structure of language space to resolve fine-grained semantic ambiguities and maintain identity consistency across complex long-range sequences through high-level commonsense reasoning. Extensive experiments on nuScenes, V2X-Seq-SPD, and KITTI demonstrate the superior performance of NOVA. Notably, on the nuScenes dataset, NOVA achieves an AMOTA of 22.41% for Novel categories, yielding a significant 20.21% absolute improvement over the baseline. These gains are realized through a compact 0.5B autoregressive model. Code will be available at https://github.com/xifen523/NOVA.
[72] ODD-SEC: Onboard Drone Detection with a Spinning Event Camera
Kuan Dai, Hongxin Zhang, Sheng Zhong, Yi Zhou
🧩 TL;DR
本文提出了一种用于移动载体的实时无人机检测系统,采用旋转事件相机实现360°水平视场和方位角估计,通过新型事件表示和轻量级神经网络架构,在挑战性条件下实现可靠检测。
📘 Detailed Summary
Motivation: 现有基于事件的无人机检测方案大多假设相机静止,这极大地限制了其在移动载体(如四足机器人或无人地面车辆)现场作业中的适用性,而传统帧相机在快速移动目标或不良光照条件下存在根本性限制。
Method: 系统采用旋转事件相机提供360°水平视场,提出无需运动补偿的新型类图像事件表示方法,结合轻量级神经网络架构进行高效时空学习,并在机载Jetson Orin NX上实现实时处理。
Result: 户外实验验证了系统在挑战性条件下的可靠检测性能,平均角度误差低于2°,能够在移动载体上实时运行,展示了实际监控应用的适用性。
Conclusion: 该系统突破了现有事件相机方案对静态相机的依赖,为移动平台上的无人机检测提供了实用解决方案,其开源完整流程将支持未来研究,展示了事件相机在动态监控场景中的独特优势。
📄 Abstract
The rapid proliferation of drones requires balancing innovation with regulation. To address security and privacy concerns, techniques for drone detection have attracted significant attention.Passive solutions, such as frame camera-based systems, offer versatility and energy efficiency under typical conditions but are fundamentally constrained by their operational principles in scenarios involving fast-moving targets or adverse illumination.Inspired by biological vision, event cameras asynchronously detect per-pixel brightness changes, offering high dynamic range and microsecond-level responsiveness that make them uniquely suited for drone detection in conditions beyond the reach of conventional frame-based cameras.However, the design of most existing event-based solutions assumes a static camera, greatly limiting their applicability to moving carriers--such as quadrupedal robots or unmanned ground vehicles--during field operations.In this paper, we introduce a real-time drone detection system designed for deployment on moving carriers. The system utilizes a spinning event-based camera, providing a 360° horizontal field of view and enabling bearing estimation of detected drones. A key contribution is a novel image-like event representation that operates without motion compensation, coupled with a lightweight neural network architecture for efficient spatiotemporal learning. Implemented on an onboard Jetson Orin NX, the system can operate in real time. Outdoor experimental results validate reliable detection with a mean angular error below 2° under challenging conditions, underscoring its suitability for real-world surveillance applications. We will open-source our complete pipeline to support future research.
[73] Can we Trust Unreliable Voxels? Exploring 3D Semantic Occupancy Prediction under Label Noise
Wenxin Li, Kunyu Peng, Di Wen, Junwei Zheng, Jiale Wei, Mengfei Duan, Yuheng Zhang, Rui Fan, Kailun Yang
🧩 TL;DR
本文提出了OccNL基准和DPR-Occ框架,首次系统研究3D语义占据预测中的标签噪声问题,通过双源部分标签推理构建可靠监督,显著提升模型在极端噪声下的鲁棒性。
📘 Detailed Summary
Motivation: 3D语义占据预测是机器人感知的核心任务,但真实世界体素标注存在结构伪影和动态拖尾效应等固有噪声,导致监督信号不可靠。现有研究尚未系统探索自主系统能否安全依赖这种噪声监督,特别是2D标签噪声学习方法在稀疏3D体素空间中存在根本性领域差距。
Method: 提出了DPR-Occ框架,采用双源部分标签推理构建可靠监督。该方法协同利用时序模型记忆和表示层结构相似性,动态扩展和修剪候选标签集,在保留真实语义的同时抑制噪声传播。框架通过标签噪声鲁棒性设计,专门针对3D占据预测任务中的不对称和动态拖尾噪声问题。
Result: 在SemanticKITTI数据集上的实验表明,DPR-Occ在极端噪声条件下能有效防止几何和语义崩溃。即使在90%标签噪声下,相比现有标签噪声学习方法,该方法仍能实现显著性能提升(最高达2.57% mIoU和13.91% IoU)。OccNL基准首次为3D占据预测中的标签噪声问题提供了系统评估平台。
Conclusion: 本研究通过连接标签噪声学习和3D感知领域,为动态环境中安全关键的机器人感知提供了可靠基础。OccNL基准和DPR-Occ框架揭示了现有范式在稀疏3D空间中的脆弱性,并提出了有效的解决方案,推动了鲁棒3D感知系统的发展。
📄 Abstract
3D semantic occupancy prediction is a cornerstone of robotic perception, yet real-world voxel annotations are inherently corrupted by structural artifacts and dynamic trailing effects. This raises a critical but underexplored question: can autonomous systems safely rely on such unreliable occupancy supervision? To systematically investigate this issue, we establish OccNL, the first benchmark dedicated to 3D occupancy under occupancy-asymmetric and dynamic trailing noise. Our analysis reveals a fundamental domain gap: state-of-the-art 2D label noise learning strategies collapse catastrophically in sparse 3D voxel spaces, exposing a critical vulnerability in existing paradigms. To address this challenge, we propose DPR-Occ, a principled label noise-robust framework that constructs reliable supervision through dual-source partial label reasoning. By synergizing temporal model memory with representation-level structural affinity, DPR-Occ dynamically expands and prunes candidate label sets to preserve true semantics while suppressing noise propagation. Extensive experiments on SemanticKITTI demonstrate that DPR-Occ prevents geometric and semantic collapse under extreme corruption. Notably, even at 90% label noise, our method achieves significant performance gains (up to 2.57% mIoU and 13.91% IoU) over existing label noise learning baselines adapted to the 3D occupancy prediction task. By bridging label noise learning and 3D perception, OccNL and DPR-Occ provide a reliable foundation for safety-critical robotic perception in dynamic environments. The benchmark and source code will be made publicly available at https://github.com/mylwx/OccNL.
[74] Attribute Distribution Modeling and Semantic-Visual Alignment for Generative Zero-shot Learning
Haojie Pu, Zhuoming Li, Yongbiao Gao, Yuheng Jia
🧩 TL;DR
本文提出了一种名为ADiVA的属性分布建模与语义-视觉对齐方法,通过联合建模属性分布和执行显式的语义-视觉对齐,解决了生成式零样本学习中存在的类-实例差距和语义-视觉领域差距问题。
📘 Detailed Summary
Motivation: 生成式零样本学习面临两个内在挑战:一是类级属性由于显著的类内变异性而无法捕捉实例特定的视觉外观,导致类-实例差距;二是语义特征与视觉特征分布之间的显著不匹配,表现为类间相关性,产生了语义-视觉领域差距。
Method: ADiVA方法包含两个核心模块:属性分布建模模块学习每个类的可转移属性分布,并为未见类采样实例级属性;视觉引导对齐模块通过细化语义表示以更好地反映视觉结构,实现显式的语义-视觉对齐。
Result: 在三个广泛使用的基准数据集上的实验表明,ADiVA显著优于现有最先进方法,在AWA2和SUN数据集上分别实现了4.7%和6.1%的性能提升,并且该方法可以作为插件增强现有的生成式零样本学习方法。
Conclusion: 该研究通过联合建模属性分布和执行语义-视觉对齐,有效解决了生成式零样本学习中的两个关键挑战,提出的方法不仅性能优越,还具有作为通用插件增强现有方法的扩展性,为零样本学习领域提供了新的技术路径。
📄 Abstract
Generative zero-shot learning (ZSL) synthesizes features for unseen classes, leveraging semantic conditions to transfer knowledge from seen classes. However, it also introduces two intrinsic challenges: (1) class-level attributes fails to capture instance-specific visual appearances due to substantial intra-class variability, thus causing the class-instance gap; (2) the substantial mismatch between semantic and visual feature distributions, manifested in inter-class correlations, gives rise to the semantic-visual domain gap. To address these challenges, we propose an Attribute Distribution Modeling and Semantic-Visual Alignment (ADiVA) approach, jointly modeling attribute distributions and performing explicit semantic-visual alignment. Specifically, our ADiVA consists of two modules: an Attribute Distribution Modeling (ADM) module that learns a transferable attribute distribution for each class and samples instance-level attributes for unseen classes, and a Visual-Guided Alignment (VGA) module that refines semantic representations to better reflect visual structures. Experiments on three widely used benchmark datasets demonstrate that ADiVA significantly outperforms state-of-the-art methods (e.g., achieving gains of 4.7% and 6.1% on AWA2 and SUN, respectively). Moreover, our approach can serve as a plugin to enhance existing generative ZSL methods.
[75] FlowMotion: Training-Free Flow Guidance for Video Motion Transfer
Zhen Wang, Youcan Xu, Jun Xiao, Long Chen
🧩 TL;DR
本文提出了FlowMotion,一种无需训练的视频运动迁移框架,通过直接利用基于光流的文本到视频模型的预测输出,实现了高效灵活的运动迁移,显著降低了计算开销并提升了性能。
📘 Detailed Summary
Motivation: 现有无需训练的视频运动迁移方法主要基于预训练文本到视频模型的中间输出构建运动指导,导致计算开销大且灵活性有限,需要更高效灵活的运动迁移解决方案。
Method: FlowMotion框架的核心创新包括:基于早期潜在预测提取运动表示的流指导机制,该机制利用潜在预测中编码的丰富时序信息来对齐源视频与生成视频的运动模式;以及用于稳定优化和确保平滑运动演化的速度正则化策略。
Result: FlowMotion在时间和资源效率方面表现出色,实现了卓越的计算效率,同时在性能上与最先进方法相比具有竞争力,验证了基于模型预测直接操作的有效性。
Conclusion: 该研究表明直接利用基于光流的文本到视频模型的预测输出可以显著提升运动迁移的效率和灵活性,为视频生成领域提供了新的技术路径,并展示了早期潜在预测中时序信息的有效利用价值。
📄 Abstract
Video motion transfer aims to generate a target video that inherits motion patterns from a source video while rendering new scenes. Existing training-free approaches focus on constructing motion guidance based on the intermediate outputs of pre-trained T2V models, which results in heavy computational overhead and limited flexibility. In this paper, we present FlowMotion, a novel training-free framework that enables efficient and flexible motion transfer by directly leveraging the predicted outputs of flow-based T2V models. Our key insight is that early latent predictions inherently encode rich temporal information. Motivated by this, we propose flow guidance, which extracts motion representations based on latent predictions to align motion patterns between source and generated videos. We further introduce a velocity regularization strategy to stabilize optimization and ensure smooth motion evolution. By operating purely on model predictions, FlowMotion achieves superior time and resource efficiency as well as competitive performance compared with state-of-the-art methods.
[76] Latent Transfer Attack: Adversarial Examples via Generative Latent Spaces
Eitan Shaar, Ariel Shaulov, Yalcin Tur, Gal Chechik, Ravid Shwartz-Ziv
🧩 TL;DR
本文提出LTA(潜在转移攻击),一种在预训练Stable Diffusion VAE的潜在空间中优化对抗扰动的转移攻击方法,通过潜在空间优化和变换期望技术生成空间相干、低频的扰动,显著提升了跨架构攻击的鲁棒性和转移成功率。
📘 Detailed Summary
Motivation: 传统像素空间对抗攻击通常产生高频、纹理状的噪声,这些噪声对常见的预处理操作(如调整大小和裁剪)脆弱,并且在跨架构转移时效果较差,这限制了对抗攻击在现实场景中作为鲁棒性评估工具的有效性。
Method: LTA方法在预训练Stable Diffusion VAE的潜在空间中优化对抗扰动,通过编码干净图像到潜在表示并最大化代理分类器损失,同时在解码后软性强制执行像素空间的ℓ∞约束。该方法结合了变换期望技术,包括随机调整大小、插值和裁剪,并应用周期性潜在高斯平滑来抑制伪影并稳定优化过程。
Result: 在CNN和视觉Transformer目标模型套件上,LTA实现了强大的转移攻击成功率,同时生成空间相干、主要为低频的扰动,这些扰动在质量上与像素空间基线不同,并在转移质量权衡中占据独特位置,表现出对分辨率不匹配和标准输入管道的鲁棒性。
Conclusion: 该研究强调了预训练生成潜在空间作为对抗优化的有效结构化领域,将鲁棒性评估与现代生成先验联系起来,为开发更鲁棒、更可转移的对抗攻击方法提供了新方向,并展示了生成模型潜在空间在对抗机器学习中的实用价值。
📄 Abstract
Adversarial attacks are a central tool for probing the robustness of modern vision models, yet most methods optimize perturbations directly in pixel space under $\ell_\infty$ or $\ell_2$ constraints. While effective in white-box settings, pixel-space optimization often produces high-frequency, texture-like noise that is brittle to common preprocessing (e.g., resizing and cropping) and transfers poorly across architectures. We propose $\textbf{LTA}$ ($\textbf{L}$atent $\textbf{T}$ransfer $\textbf{A}$ttack), a transfer-based attack that instead optimizes perturbations in the latent space of a pretrained Stable Diffusion VAE. Given a clean image, we encode it into a latent code and optimize the latent representation to maximize a surrogate classifier loss, while softly enforcing a pixel-space $\ell_\infty$ budget after decoding. To improve robustness to resolution mismatch and standard input pipelines, we incorporate Expectation Over Transformations (EOT) via randomized resizing, interpolation, and cropping, and apply periodic latent Gaussian smoothing to suppress emerging artifacts and stabilize optimization. Across a suite of CNN and vision-transformer targets, LTA achieves strong transfer attack success while producing spatially coherent, predominantly low-frequency perturbations that differ qualitatively from pixel-space baselines and occupy a distinct point in the transfer-quality trade-off. Our results highlight pretrained generative latent spaces as an effective and structured domain for adversarial optimization, bridging robustness evaluation with modern generative priors.
[77] WMoE-CLIP: Wavelet-Enhanced Mixture-of-Experts Prompt Learning for Zero-Shot Anomaly Detection
Peng Chen, Chao Huang
🧩 TL;DR
本文提出了一种小波增强的专家混合提示学习方法,用于零样本异常检测,通过变分自编码器建模全局语义表示并结合小波分解提取多频特征,显著提升了模型对未见异常模式的检测能力。
📘 Detailed Summary
Motivation: 现有零样本异常检测方法通常依赖固定的文本提示,难以捕捉复杂语义信息,且仅关注空间域特征,限制了模型对细微异常的检测能力,因此需要开发更灵活、多模态的提示学习方法。
Method: 提出小波增强的专家混合提示学习方法,采用变分自编码器建模全局语义表示并集成到提示中,利用小波分解提取多频图像特征并通过跨模态交互动态优化文本嵌入,引入语义感知的专家混合模块聚合上下文信息。
Result: 在14个工业和医学数据集上的广泛实验验证了所提方法的有效性,相比现有方法在检测精度和泛化能力方面均有显著提升,特别是在处理复杂异常模式和细微异常方面表现优异。
Conclusion: 该方法通过结合语义建模、多频特征提取和专家混合机制,为视觉语言模型在零样本异常检测中的应用提供了新范式,展示了多模态提示学习在复杂视觉任务中的潜力。
📄 Abstract
Vision-language models have recently shown strong generalization in zero-shot anomaly detection (ZSAD), enabling the detection of unseen anomalies without task-specific supervision. However, existing approaches typically rely on fixed textual prompts, which struggle to capture complex semantics, and focus solely on spatial-domain features, limiting their ability to detect subtle anomalies. To address these challenges, we propose a wavelet-enhanced mixture-of-experts prompt learning method for ZSAD. Specifically, a variational autoencoder is employed to model global semantic representations and integrate them into prompts to enhance adaptability to diverse anomaly patterns. Wavelet decomposition extracts multi-frequency image features that dynamically refine textual embeddings through cross-modal interactions. Furthermore, a semantic-aware mixture-of-experts module is introduced to aggregate contextual information. Extensive experiments on 14 industrial and medical datasets demonstrate the effectiveness of the proposed method.
[78] Computer vision-based estimation of invertebrate biomass
Mikko Impiö, Philipp M. Rehsen, Jarrett Blair, Cecilie Mielec, Arne J. Beermann, Florian Leese, Toke T. Høye, Jenni Raitoharju
🧩 TL;DR
本研究提出两种基于计算机视觉的无脊椎动物干重估计方法,通过线性模型和端到端深度神经网络,利用图像序列自动估算标本干重,避免了传统手工称重的破坏性过程。
📘 Detailed Summary
Motivation: 传统无脊椎动物生物量估计依赖手工、耗时且具有破坏性的干重测量过程,限制了大规模生物多样性监测的扩展。本研究旨在开发仅通过图像就能准确估计标本干重的计算机视觉方法,以克服传统方法的局限性并提升监测效率。
Method: 本研究提出两种主要方法:一是基于新型预测因子的线性模型,使用BIODISCOVER双相机系统自动计算面积和沉降速度作为预测因子;二是训练一系列端到端深度神经网络,包括单视图、多视图和元数据感知架构。研究还探索了不同的优化损失函数、数据增强方法和模型架构。
Result: 实验结果表明,即使在形态复杂多样的标本上,所提方法也能有效估计干重。结合自动分类后,该方法在个体水平上实现了10-20%的中位数百分比误差。研究强调了同时使用百分比误差和绝对误差作为评估指标的重要性,因为它们衡量了不同的性能特性。
Conclusion: 该研究提供了一种准确、非破坏性的生物量估计方法,能够显著提升大规模生物多样性监测的效率。研究强调了评估指标选择的重要性,并为计算机视觉在生态监测中的应用提供了新思路,展示了深度学习在复杂形态分析中的潜力。
📄 Abstract
The ability to estimate invertebrate biomass using only images could help scaling up quantitative biodiversity monitoring efforts. Computer vision-based methods have the potential to omit the manual, time-consuming, and destructive process of dry weighing specimens. We present two approaches for dry mass estimation that do not require additional manual effort apart from imaging the specimens: fitting a linear model with novel predictors, automatically calculated by an imaging device, and training a family of end-to-end deep neural networks for the task, using single-view, multi-view, and metadata-aware architectures. We propose using area and sinking speed as predictors. These can be calculated with BIODISCOVER, which is a dual-camera system that captures image sequences of specimens sinking in an ethanol column. For this study, we collected a large dataset of dry mass measurement and image sequence pairs to train and evaluate models. We show that our methods can estimate specimen dry mass even with complex and visually diverse specimen morphologies. Combined with automatic taxonomic classification, our approach is an accurate method for group-level dry mass estimation, with a median percentage error of 10-20% for individuals. We highlight the importance of choosing appropriate evaluation metrics, and encourage using both percentage errors and absolute errors as metrics, because they measure different properties. We also explore different optimization losses, data augmentation methods, and model architectures for training deep-learning models.
[79] OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis
Yuxuan Fan, Jing Hao, Hong Chen, Jiahao Bao, Yihua Shao, Yuci Liang, Kuo Feng Hung, Hao Tang
🧩 TL;DR
本文提出OralGPT-Plus,一种用于全景牙科X光片分析的智能视觉语言模型,通过迭代和对称感知的诊断推理机制,解决了现有静态单次推理范式在临床可靠性方面的局限性。
📘 Detailed Summary
Motivation: 现有视觉语言模型采用静态单次推理范式,无法满足全景牙科X光片分析所需的细粒度空间推理、双侧对称性理解和多步骤诊断验证要求,限制了临床应用的可靠性。
Method: 构建了包含五千张图像的DentalProbe数据集,提供专家标注的诊断轨迹用于局部检查和对比监督;开发了Reinspection驱动的强化学习框架,结合基于评分标准的奖励和条件诊断驱动奖励,促进临床意义的重新检查并稳定长时程推理;同时创建了首个全景诊断基准MMOral-X,包含300个开放式问题和区域级标注。
Result: OralGPT-Plus在MMOral-X基准和现有全景基准测试中均表现出优于强基线的持续可靠改进,验证了交互式和对称感知推理机制的有效性。
Conclusion: 该研究强调了智能建模在牙科影像分析中的价值,为未来临床对齐的全景X光片分析研究奠定了基础,展示了迭代推理和对称性理解对提升诊断可靠性的重要性。
📄 Abstract
Panoramic dental radiographs require fine-grained spatial reasoning, bilateral symmetry understanding, and multi-step diagnostic verification, yet existing vision-language models operate under a static single-pass paradigm that limits their clinical reliability. In this paper, we introduce OralGPT-Plus, an agentic vision-language model designed to perform iterative and symmetry-aware diagnostic reasoning for panoramic dental radiograph analysis. To support this paradigm, we construct DentalProbe, a five-thousand-image dataset with expert-curated diagnostic trajectories that provide structured supervision for localized inspection and contralateral comparison. We further develop a Reinspection-driven reinforcement learning framework that encourages clinically meaningful re-examination and stabilizes long-horizon reasoning with rubric-based reward and conditioned diagnostic-driven reward. In parallel, we present MMOral-X, the first benchmark for holistic panoramic diagnosis, containing 300 open-ended questions and region-level annotations across multiple difficulty levels. OralGPT-Plus demonstrates consistent and reliable improvements over strong baselines on MMOral-X and established panoramic benchmarks, indicating the effectiveness of interactive and symmetry-informed reasoning. Our work highlights the value of agentic modeling for dental imaging and provides a foundation for future research in clinically aligned panoramic radiograph analysis.
[80] Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation
Jonas Ernst, Wolfgang Boettcher, Lukas Hoyer, Jan Eric Lenssen, Bernt Schiele
🧩 TL;DR
Rewis3d提出了一种利用前馈式3D重建技术改进2D图像弱监督语义分割的框架,通过3D几何结构传播稀疏标注,在无需额外标注或推理开销的情况下显著提升分割性能。
📘 Detailed Summary
Motivation: 密集像素级标注成本高昂,稀疏标注作为弱监督替代方案仍存在性能差距,研究旨在利用3D场景重建作为辅助监督信号来弥合这一差距,通过3D几何结构传播稀疏标注信息。
Method: 提出Rewis3d框架,利用前馈式3D重建技术从2D视频恢复3D几何结构,采用双师生架构强制2D图像与重建3D点云之间的语义一致性,将3D几何结构作为可靠监督信号来传播稀疏标注。
Result: 在稀疏监督设置下,Rewis3d实现了最先进的性能表现,相比现有方法提升了2-7%的性能,且无需额外标注或增加推理开销,实验验证了该方法的有效性。
Conclusion: 研究表明3D几何结构能为2D语义分割提供强大的监督信号,前馈式3D重建技术可有效传播稀疏标注,该方法为弱监督学习开辟了新方向,展示了跨模态监督的潜力。
📄 Abstract
We present Rewis3d, a framework that leverages recent advances in feed-forward 3D reconstruction to significantly improve weakly supervised semantic segmentation on 2D images. Obtaining dense, pixel-level annotations remains a costly bottleneck for training segmentation models. Alleviating this issue, sparse annotations offer an efficient weakly-supervised alternative. However, they still incur a performance gap. To address this, we introduce a novel approach that leverages 3D scene reconstruction as an auxiliary supervisory signal. Our key insight is that 3D geometric structure recovered from 2D videos provides strong cues that can propagate sparse annotations across entire scenes. Specifically, a dual student-teacher architecture enforces semantic consistency between 2D images and reconstructed 3D point clouds, using state-of-the-art feed-forward reconstruction to generate reliable geometric supervision. Extensive experiments demonstrate that Rewis3d achieves state-of-the-art performance in sparse supervision, outperforming existing approaches by 2-7% without requiring additional labels or inference overhead.
[81] MoEMambaMIL: Structure-Aware Selective State Space Modeling for Whole-Slide Image Analysis
Dongqing Xie, Yonghuang Wu
🧩 TL;DR
本研究提出MoEMambaMIL,一种用于全切片图像分析的结构感知状态空间模型框架,通过区域嵌套选择性扫描与混合专家建模相结合,在9个下游任务中实现了最佳性能。
📘 Detailed Summary
Motivation: 全切片图像分析面临千兆像素尺度和固有层次多分辨率结构的挑战,现有多示例学习方法通常将WSI建模为无序补丁集合,限制了捕捉全局组织与局部模式之间结构化依赖关系的能力,而如何结构化WSI标记以充分利用其空间层次结构仍是一个未解决的问题。
Method: MoEMambaMIL框架集成了区域嵌套选择性扫描与混合专家建模,通过多分辨率预处理将补丁标记组织为保持跨分辨率空间包含关系的区域感知序列,采用静态分辨率特定专家和具有学习路由的动态稀疏专家组合,实现分辨率感知编码和区域自适应上下文建模的分离。
Result: 实验结果表明,MoEMambaMIL在9个下游任务中实现了最佳性能,该框架能够高效建模长序列同时促进专家在异质诊断模式上的专业化。
Conclusion: 该研究提供了一种结构感知的状态空间模型框架,通过有效组织WSI标记序列并利用混合专家专业化,解决了全切片图像分析中的层次结构建模问题,为医学图像分析中的长序列建模和专家专业化提供了新思路。
📄 Abstract
Whole-slide image (WSI) analysis is challenging due to the gigapixel scale of slides and their inherent hierarchical multi-resolution structure. Existing multiple instance learning (MIL) approaches often model WSIs as unordered collections of patches, which limits their ability to capture structured dependencies between global tissue organization and local cellular patterns. Although recent State Space Models (SSMs) enable efficient modeling of long sequences, how to structure WSI tokens to fully exploit their spatial hierarchy remains an open problem.We propose MoEMambaMIL, a structure-aware SSM framework for WSI analysis that integrates region-nested selective scanning with mixture-of-experts (MoE) modeling. Leveraging multi-resolution preprocessing, MoEMambaMIL organizes patch tokens into region-aware sequences that preserve spatial containment across resolutions. On top of this structured sequence, we decouple resolution-aware encoding and region-adaptive contextual modeling via a combination of static, resolution-specific experts and dynamic sparse experts with learned routing. This design enables efficient long-sequence modeling while promoting expert specialization across heterogeneous diagnostic patterns. Experiments demonstrate that MoEMambaMIL achieves the best performance across 9 downstream tasks.
[82] REACT++: Efficient Cross-Attention for Real-Time Scene Graph Generation
Maëlic Neau, Zoe Falomir
🧩 TL;DR
本文提出REACT++模型,在场景图生成任务中实现了性能与推理速度的最佳平衡,相比前代REACT模型在关系预测准确率提升10%的同时推理速度加快20%,成为实时SGG的新SOTA方法。
📘 Detailed Summary
Motivation: 当前场景图生成方法通常只专注于提升关系预测准确率、改进目标检测精度或降低延迟中的单一目标,缺乏同时平衡这三方面性能的综合解决方案,这限制了SGG在实时应用中的实用性。
Method: REACT++基于REACT架构进行改进,通过高效特征提取和原型空间中的主体到客体交叉注意力机制,在保持表征能力的同时优化计算效率,实现了延迟与性能的平衡。
Result: REACT++在现有SGG模型中达到最高推理速度,在提升关系预测准确率的同时不牺牲目标检测性能,相比前代REACT版本推理速度提升20%,关系预测准确率平均提升10%。
Conclusion: 该研究表明通过精心设计的架构优化可以同时提升场景图生成的准确性和实时性,为下游任务如具身智能推理提供了更实用的基础组件,推动了实时视觉关系理解的实际应用。
📄 Abstract
Scene Graph Generation (SGG) is a task that encodes visual relationships between objects in images as graph structures. SGG shows significant promise as a foundational component for downstream tasks, such as reasoning for embodied agents. To enable real-time applications, SGG must address the trade-off between performance and inference speed. However, current methods tend to focus on one of the following: (1) improving relation prediction accuracy, (2) enhancing object detection accuracy, or (3) reducing latency, without aiming to balance all three objectives simultaneously. To address this limitation, we build on the powerful Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation (REACT) architecture and propose REACT++, a new state-of-the-art model for real-time SGG. By leveraging efficient feature extraction and subject-to-object cross-attention within the prototype space, REACT++ balances latency and representational power. REACT++ achieves the highest inference speed among existing SGG models, improving relation prediction accuracy without sacrificing object detection performance. Compared to the previous REACT version, REACT++ is 20% faster with a gain of 10% in relation prediction accuracy on average. The code is available at https://github.com/Maelic/SGG-Benchmark.
[83] Solving Jigsaw Puzzles in the Wild: Human-Guided Reconstruction of Cultural Heritage Fragments
Omidreza Safaei, Sinem Aslan, Sebastiano Vascon, Luca Palmieri, Marina Khoroshiltseva, Marcello Pelillo
🧩 TL;DR
本文提出了一种人机协同的拼图求解框架,用于解决现实世界考古文物碎片重组中的大规模复杂性问题,该方法结合了自动求解器与交互式人工指导,显著提升了重组准确性和效率。
📘 Detailed Summary
Motivation: 现实世界考古文物碎片重组面临侵蚀、缺失区域、不规则形状和大规模模糊性等重大挑战,传统拼图求解器通常为干净合成场景设计,难以处理数千个碎片的复杂情况,特别是在RePAIR基准测试中表现不佳,因此需要开发能够处理文化遗产重建复杂性和规模的新方法。
Method: 本文提出了一个人机协同的拼图求解框架,将自动松弛标记求解器与交互式人工指导相结合,允许用户迭代锁定已验证的放置位置、纠正错误并引导系统实现语义和几何上一致的组装,引入了两种互补的交互策略:迭代锚定和连续交互细化,以支持不同模糊程度和拼图规模的可扩展重建。
Result: 在多个RePAIR组上的实验表明,这种混合方法在准确性和效率方面显著优于完全自动化和手动基线,为大规模专家参与的文物重组提供了实用解决方案,证明了人机协同方法在处理复杂文化遗产重建任务中的优越性。
Conclusion: 该研究展示了人机协同框架在解决大规模考古文物重组问题中的有效性,为文化遗产保护领域提供了实用的专家参与式重建工具,表明结合自动算法与人类专业知识的混合方法能够显著提升复杂现实世界任务的性能,为类似领域的人机协作系统设计提供了参考。
📄 Abstract
Reassembling real-world archaeological artifacts from fragmented pieces poses significant challenges due to erosion, missing regions, irregular shapes, and large-scale ambiguity. Traditional jigsaw puzzle solvers, often designed for clean synthetic scenarios, struggle under these conditions, especially when the number of fragments grows into the thousands, as in the RePAIR benchmark. In this paper, we propose a human-in-the-loop (HIL) puzzle solving framework designed to address the complexity and scale of real-world cultural heritage reconstruction. Our approach integrates an automatic relaxation-labeling solver with interactive human guidance, allowing users to iteratively lock verified placements, correct errors, and guide the system toward semantically and geometrically coherent assemblies. We introduce two complementary interaction strategies, Iterative Anchoring and Continuous Interactive Refinement, which support scalable reconstruction across varying levels of ambiguity and puzzle size. Experiments on several RePAIR groups demonstrate that our hybrid approach substantially outperforms both fully automatic and manual baselines in accuracy and efficiency, offering a practical solution for large-scale expert-in-the-loop artifact reassembly.
[84] DiffInf: Influence-Guided Diffusion for Supervision Alignment in Facial Attribute Learning
Basudha Pal, Rama Chellappa
🧩 TL;DR
本文提出了DiffInf,一种基于自影响力引导的扩散框架,用于缓解面部属性分类中的标注不一致问题。该方法通过生成式校正来修复有影响力的标注不一致样本,而非直接丢弃它们,从而提升下游分类性能。
📘 Detailed Summary
Motivation: 面部属性分类依赖于大规模标注数据集,其中年龄、表情等属性本质上是模糊且连续的,却被离散化为分类标签。标注不一致源于主观性和视觉混淆因素(如姿态、光照、表情和人口统计变异),导致图像与分配标签之间的不匹配,这些不一致引入了监督误差,损害表示学习并降低下游预测性能。
Method: DiffInf框架首先训练基线分类器,并使用一阶近似计算样本级自影响力分数,以识别对优化过程产生不成比例影响的训练实例。然后通过潜在扩散自编码器对这些有影响力的样本进行定向生成式校正,以更好地对齐视觉内容与分配标签,同时保持身份和真实感。为了在校正过程中实现可微分引导,训练一个轻量级的高影响力成员预测器作为替代影响力正则化器。
Result: 在多类面部属性分类任务中,DiffInf相比标准噪声标签训练、鲁棒优化基线和基于影响力的过滤方法,始终展现出更好的泛化性能。实验结果表明,在图像层面修复有影响力的标注不一致样本,能够在不牺牲分布覆盖范围的情况下,显著提升下游面部属性分类性能。
Conclusion: 该研究表明,通过生成式校正修复有影响力的标注不一致样本,而非直接丢弃它们,能够有效提升面部属性分类性能。这种方法为处理标注噪声提供了一种新范式,即通过内容对齐而非样本剔除来改善数据集质量,同时保持数据集规模不变,为噪声标签学习领域提供了有价值的见解。
📄 Abstract
Facial attribute classification relies on large-scale annotated datasets in which many traits, such as age and expression, are inherently ambiguous and continuous but are discretized into categorical labels. Annotation inconsistencies arise from subjectivity and visual confounders such as pose, illumination, expression, and demographic variation, creating mismatch between images and assigned labels. These inconsistencies introduce supervision errors that impair representation learning and degrade downstream prediction. We introduce DiffInf, a self-influence--guided diffusion framework for mitigating annotation inconsistencies in facial attribute learning. We first train a baseline classifier and compute sample-wise self-influence scores using a practical first-order approximation to identify training instances that disproportionately destabilize optimization. Instead of discarding these influential samples, we apply targeted generative correction via a latent diffusion autoencoder to better align visual content with assigned labels while preserving identity and realism. To enable differentiable guidance during correction, we train a lightweight predictor of high-influence membership and use it as a surrogate influence regularizer. The edited samples replace the originals, yielding an influence-refined dataset of unchanged size. Across multi-class facial attribute classification, DiffInf consistently improves generalization compared with standard noisy-label training, robust optimization baselines, and influence-based filtering. Our results demonstrate that repairing influential annotation inconsistencies at the image level enhances downstream facial attribute classification without sacrificing distributional coverage.
[85] What if? Emulative Simulation with World Models for Situated Reasoning
Ruiping Liu, Yufan Chen, Yuheng Zhang, Junwei Zheng, Kunyu Peng, Chengzhi Wu, Chenguang Huang, Di Wen, Jiaming Zhang, Kailun Yang, Rainer Stiefelhagen
🧩 TL;DR
本文提出了WanderDream,这是首个用于心理探索模拟的大规模数据集,使模型能够在无需主动探索的情况下进行空间推理。该数据集包含15.8K全景视频和158K问答对,通过世界模型和多模态大语言模型的实验证明了心理模拟对于情境推理的重要性。
📘 Detailed Summary
Motivation: 在许多现实场景中,由于机器人物理限制或视障用户的安全考虑,主动探索往往不可行。本研究旨在解决在仅有有限观测的情况下,智能体能否通过心理模拟未来轨迹来回答空间假设性问题,从而填补了无需主动探索的情境推理研究空白。
Method: 本文引入了WanderDream数据集,包含两个主要部分:WanderDream-Gen包含来自HM3D、ScanNet++和真实世界捕获的1,088个真实场景中的15.8K全景视频,描绘了从当前视角到目标情境的想象轨迹;WanderDream-QA包含158K问答对,覆盖每个轨迹的起始状态、路径和结束状态,全面评估基于探索的推理能力。
Result: 通过世界模型和多模态大语言模型的广泛实验表明:心理探索对于情境推理至关重要;世界模型在WanderDream-Gen上取得了令人信服的表现;想象力显著促进了WanderDream-QA上的推理能力;WanderDream数据在真实世界场景中表现出显著的迁移能力。
Conclusion: 该研究证明了心理模拟作为主动探索替代方案的有效性,为受限环境下的空间推理提供了新范式。WanderDream数据集的发布将为未来无需物理探索的情境推理研究提供重要基准,并展示了世界模型在模拟心理探索方面的潜力。
📄 Abstract
Situated reasoning often relies on active exploration, yet in many real-world scenarios such exploration is infeasible due to physical constraints of robots or safety concerns of visually impaired users. Given only a limited observation, can an agent mentally simulate a future trajectory toward a target situation and answer spatial what-if questions? We introduce WanderDream, the first large-scale dataset designed for the emulative simulation of mental exploration, enabling models to reason without active exploration. WanderDream-Gen comprises 15.8K panoramic videos across 1,088 real scenes from HM3D, ScanNet++, and real-world captures, depicting imagined trajectories from current viewpoints to target situations. WanderDream-QA contains 158K question-answer pairs, covering starting states, paths, and end states along each trajectory to comprehensively evaluate exploration-based reasoning. Extensive experiments with world models and MLLMs demonstrate (1) that mental exploration is essential for situated reasoning, (2) that world models achieve compelling performance on WanderDream-Gen, (3) that imagination substantially facilitates reasoning on WanderDream-QA, and (4) that WanderDream data exhibit remarkable transferability to real-world scenarios. The source code and all data will be released.
[86] Match4Annotate: Propagating Sparse Video Annotations via Implicit Neural Feature Matching
Zhuorui Zhang, Roger Pallarès-López, Praneeth Namburi, Brian W. Anthony
🧩 TL;DR
本文提出了Match4Annotate,一个轻量级框架,用于视频内和视频间的点标注和掩码标注传播,通过拟合基于SIREN的隐式神经表示到DINOv3特征,并学习平滑的隐式变形场来指导对应匹配,解决了专业领域中视频标注成本高昂的问题。
📘 Detailed Summary
Motivation: 在医学影像等专业领域中,获取逐帧视频标注仍然是计算机视觉部署的主要瓶颈,因为专家标注速度慢且成本高。现有标签传播方法存在根本性限制:视频跟踪器和分割模型需要每个视频单独初始化且无法跨视频泛化,传统对应管道在低纹理场景中表现不佳,而密集特征匹配和一次性分割方法缺乏时空平滑性且不支持点标注和掩码标注的统一处理。
Method: 本文提出了Match4Annotate框架,采用基于SIREN的隐式神经表示拟合测试时的DINOv3特征,生成连续的高分辨率时空特征场。该方法学习帧对之间的平滑隐式变形场来指导对应匹配,支持视频内和视频间的点标注和掩码标注传播,实现了轻量级的测试时优化特征匹配管道。
Result: 在三个具有挑战性的临床超声数据集上的评估表明,Match4Annotate在视频间传播方面达到了最先进水平,优于特征匹配和一次性分割基线方法。在视频内传播方面,与专门的跟踪器保持竞争力,展示了其在跨视频标注传播方面的优越性能。
Conclusion: 研究表明,轻量级的测试时优化特征匹配管道有潜力为可扩展的标注工作流程提供高效且易于访问的解决方案。该方法统一支持点标注和掩码标注,解决了现有方法在时空平滑性和跨视频泛化方面的局限性,为专业领域的视频标注自动化提供了新的技术途径。
📄 Abstract
Acquiring per-frame video annotations remains a primary bottleneck for deploying computer vision in specialized domains such as medical imaging, where expert labeling is slow and costly. Label propagation offers a natural solution, yet existing approaches face fundamental limitations. Video trackers and segmentation models can propagate labels within a single sequence but require per-video initialization and cannot generalize across videos. Classic correspondence pipelines operate on detector-chosen keypoints and struggle in low-texture scenes, while dense feature matching and one-shot segmentation methods enable cross-video propagation but lack spatiotemporal smoothness and unified support for both point and mask annotations. We present Match4Annotate, a lightweight framework for both intra-video and inter-video propagation of point and mask annotations. Our method fits a SIREN-based implicit neural representation to DINOv3 features at test time, producing a continuous, high-resolution spatiotemporal feature field, and learns a smooth implicit deformation field between frame pairs to guide correspondence matching. We evaluate on three challenging clinical ultrasound datasets. Match4Annotate achieves state-of-the-art inter-video propagation, outperforming feature matching and one-shot segmentation baselines, while remaining competitive with specialized trackers for intra-video propagation. Our results show that lightweight, test-time-optimized feature matching pipelines have the potential to offer an efficient and accessible solution for scalable annotation workflows.
[87] SCAN: Visual Explanations with Self-Confidence and Analysis Networks
Gwanghee Lee, Sungyoon Jeong, Kyoungson Jhang
🧩 TL;DR
本文提出了SCAN(自置信与分析网络),一种新颖的通用可解释AI框架,通过基于自动编码器的特征重建和信息瓶颈原理,为CNN和Transformer架构生成高分辨率、对象聚焦的解释,解决了现有方法在保真度与通用性之间的权衡问题。
📘 Detailed Summary
Motivation: 当前视觉解释方法面临架构特定方法的高保真度与通用方法的广泛适用性之间的关键权衡,这通常导致抽象或碎片化的解释,并且难以比较不同模型家族(如CNN和Transformer)之间的解释能力,需要一种既通用又高保真的统一框架来增强模型透明度。
Method: SCAN采用基于自动编码器的方法重建模型中间层的特征,遵循信息瓶颈原理生成高分辨率自置信图来识别信息丰富区域,该框架适用于卷积神经网络和Transformer架构,通过统一方法实现跨模型家族的可解释性。
Result: 在多种架构和数据集上的广泛实验表明,SCAN在AUC-D、Negative AUC、Drop%、Win%等定量指标上始终表现出色,定性分析显示其生成的解释比现有方法更清晰、更聚焦于对象,为不同神经网络架构提供了统一的解释框架。
Conclusion: SCAN通过提供既架构通用又高度保真的统一框架,增强了模型透明度,为理解复杂神经网络的决策过程提供了更可靠的工具,解决了可解释AI领域长期存在的保真度与通用性权衡问题,推动了跨模型家族解释方法的标准化比较。
📄 Abstract
Explainable AI (XAI) has become essential in computer vision to make the decision-making processes of deep learning models transparent. However, current visual explanation (XAI) methods face a critical trade-off between the high fidelity of architecture-specific methods and the broad applicability of universal ones. This often results in abstract or fragmented explanations and makes it difficult to compare explanatory power across diverse model families, such as CNNs and Transformers. This paper introduces the Self-Confidence and Analysis Networks (SCAN), a novel universal framework that overcomes these limitations for both convolutional neural network and transformer architectures. SCAN utilizes an AutoEncoder-based approach to reconstruct features from a model's intermediate layers. Guided by the Information Bottleneck principle, it generates a high-resolution Self-Confidence Map that identifies information-rich regions. Extensive experiments on diverse architectures and datasets demonstrate that SCAN consistently achieves outstanding performance on various quantitative metrics such as AUC-D, Negative AUC, Drop%, and Win%. Qualitatively, it produces significantly clearer, object-focused explanations than existing methods. By providing a unified framework that is both architecturally universal and highly faithful, SCAN enhances model transparency and offers a more reliable tool for understanding the decision-making processes of complex neural networks.
[88] NEGATE: Constrained Semantic Guidance for Linguistic Negation in Text-to-Video Diffusion
Taewon Kang, Ming C. Lin
🧩 TL;DR
本文提出了一种在基于扩散的生成模型中处理语言否定的统一框架,将否定建模为扩散动力学中语义引导的结构化可行性约束,无需重新训练模型参数即可实现鲁棒的否定合规性。
📘 Detailed Summary
Motivation: 否定是基本的语言运算符,但在基于扩散的生成系统中尚未得到充分建模,现有方法缺乏对否定现象的统一处理框架,无法有效处理对象缺失、分级非反转语义、多否定组合和范围敏感消歧等多样化否定现象。
Method: 该方法将语言否定形式化为扩散动力学中的结构化可行性约束,将无分类器引导重新解释为定义语义更新方向,并通过将更新投影到从语言结构导出的凸约束集上来强制执行否定,该框架无需训练且与预训练的扩散主干兼容,可从图像生成自然扩展到时间演化的视频轨迹。
Result: 实验表明该方法在保持视觉保真度和结构连贯性的同时实现了鲁棒的否定合规性,建立了超越表示级评估的扩散生成模型中语言否定的首个统一公式,并引入了结构化否定中心基准套件来隔离生成系统中的不同语言故障模式。
Conclusion: 该研究为扩散生成模型中的语言否定提供了首个统一的形式化处理框架,通过将否定建模为结构化约束而非启发式方法,实现了对多样化否定现象的系统处理,为生成系统中更复杂的语言操作建模开辟了新方向,并提供了专门用于评估否定合规性的基准工具。
📄 Abstract
Negation is a fundamental linguistic operator, yet it remains inadequately modeled in diffusion-based generative systems. In this work, we present a formal treatment of linguistic negation in diffusion-based generative models by modeling it as a structured feasibility constraint on semantic guidance within diffusion dynamics. Rather than introducing heuristics or retraining model parameters, we reinterpret classifier-free guidance as defining a semantic update direction and enforce negation by projecting the update onto a convex constraint set derived from linguistic structure. This novel formulation provides a unified framework for handling diverse negation phenomena, including object absence, graded non-inversion semantics, multi-negation composition, and scope-sensitive disambiguation. Our approach is training-free, compatible with pretrained diffusion backbones, and naturally extends from image generation to temporally evolving video trajectories. In addition, we introduce a structured negation-centric benchmark suite that isolates distinct linguistic failure modes in generative systems, to further research in this area. Experiments demonstrate that our method achieves robust negation compliance while preserving visual fidelity and structural coherence, establishing the first unified formulation of linguistic negation in diffusion-based generative models beyond representation-level evaluation.
[89] Modeling and Measuring Redundancy in Multisource Multimodal Data for Autonomous Driving
Yuhan Zhou, Mehri Sattari, Haihua Chen, Kewei Sha
🧩 TL;DR
本文聚焦于自动驾驶数据集中冗余性这一被忽视的数据质量问题,通过建模和测量多源摄像头数据与多模态图像-LiDAR数据中的冗余,并评估去除冗余标签对目标检测任务的影响,证明了冗余是影响自动驾驶性能的可测量且可操作的数据质量因素。
📘 Detailed Summary
Motivation: 自动驾驶研究长期以来优先考虑算法设计而忽视数据质量分析,特别是在多源多模态数据中,由于环境条件和传感器限制导致数据质量存在差异。本文针对自动驾驶数据集中冗余性这一基础但未被充分探索的数据质量问题进行研究,旨在填补这一研究空白。
Method: 本研究使用nuScenes和Argoverse 2数据集,对多源摄像头数据和多模态图像-LiDAR数据中的冗余进行建模和测量。通过选择性去除具有共享视野的摄像头中的冗余多源图像目标标签,评估其对YOLOv8目标检测任务的影响,并分析不同重叠区域的冗余程度。
Result: 实验结果表明,选择性去除冗余多源图像目标标签能改善检测性能:在nuScenes的三个代表性重叠区域中,mAP50分别从0.66提升至0.70、0.64提升至0.67、0.53提升至0.55,而其他重叠摄像头对的检测性能在更强剪枝下仍保持基线水平。在AV2数据集中,移除了4.1%-8.6%的标签,mAP50保持在0.64基线附近。多模态分析还揭示了图像与LiDAR数据之间存在显著冗余。
Conclusion: 本研究证明了冗余性是自动驾驶数据集中可测量且可操作的数据质量因素,直接影响自动驾驶性能。这项工作强调了冗余性作为数据质量因素在自动驾驶感知中的作用,并推动了以数据为中心的视角来评估和改进自动驾驶数据集,为数据质量优化提供了新的研究方向。
📄 Abstract
Next-generation autonomous vehicles (AVs) rely on large volumes of multisource and multimodal ($M^2$) data to support real-time decision-making. In practice, data quality (DQ) varies across sources and modalities due to environmental conditions and sensor limitations, yet AV research has largely prioritized algorithm design over DQ analysis. This work focuses on redundancy as a fundamental but underexplored DQ issue in AV datasets. Using the nuScenes and Argoverse 2 (AV2) datasets, we model and measure redundancy in multisource camera data and multimodal image-LiDAR data, and evaluate how removing redundant labels affects the YOLOv8 object detection task. Experimental results show that selectively removing redundant multisource image object labels from cameras with shared fields of view improves detection. In nuScenes, mAP${50}$ gains from $0.66$ to $0.70$, $0.64$ to $0.67$, and from $0.53$ to $0.55$, on three representative overlap regions, while detection on other overlapping camera pairs remains at the baseline even under stronger pruning. In AV2, $4.1$-$8.6\%$ of labels are removed, and mAP${50}$ stays near the $0.64$ baseline. Multimodal analysis also reveals substantial redundancy between image and LiDAR data. These findings demonstrate that redundancy is a measurable and actionable DQ factor with direct implications for AV performance. This work highlights the role of redundancy as a data quality factor in AV perception and motivates a data-centric perspective for evaluating and improving AV datasets. Code, data, and implementation details are publicly available at: https://github.com/yhZHOU515/RedundancyAD
[90] EgoReasoner: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking
Fangrui Zhu, Yunfeng Xi, Jianmo Ni, Mu Cai, Boqing Gong, Long Zhao, Chen Qu, Ian Miao, Yi Li, Cheng Zhong, Huaizu Jiang, Shwetak Patel
🧩 TL;DR
本文提出EgoReasoner,一个两阶段框架,通过任务自适应思维模板和任务感知奖励函数来解决自我中心4D推理任务中的认知结构不匹配问题,在HD-EPIC基准上显著超越现有方法。
📘 Detailed Summary
Motivation: 自我中心视频理解因环境的动态4D特性而复杂,相机运动和物体位移需要持续重新评估空间关系。现有任务无关方法存在不足:通用思维链方法缺乏任务适当的推理原语,统一的强化学习会破坏空间任务的性能,需要针对不同认知操作(空间锚定、时间跟踪、持续时间推理)设计专门方法。
Method: 提出EgoReasoner两阶段框架:第一阶段使用任务自适应思维模板指导结构化思维链轨迹的合成,通过监督微调教导模型跨任务类型自适应推理;第二阶段设计任务感知奖励函数验证实体基础、时间对齐和任务自适应逻辑一致性,通过GRPO进行强化微调选择性强化各推理路径。
Result: 仅用16K样本训练的3B参数模型在HD-EPIC基准上达到37.5%的平均准确率,显著超越Qwen2.5-VL-7B(25.7%)超过10个百分点,证明了方法在有限数据下的高效性。
Conclusion: 研究揭示了自我中心4D推理任务需要与认知结构对齐的专门方法,EgoReasoner通过任务自适应推理支架和奖励信号实现了这一目标,为复杂视频理解任务提供了有效的两阶段训练范式,展示了在有限数据下实现高性能的潜力。
📄 Abstract
Egocentric video understanding is inherently complex due to the dynamic 4D nature of the environment, where camera motion and object displacements necessitate a continuous re-evaluation of spatial relations. In this work, we target a suite of under-explored egocentric 4D reasoning tasks, including fixture interaction counting, viewpoint-relative fixture location, object movement itinerary tracking, and stationary object localization, that require fundamentally different cognitive operations: spatial anchoring, temporal tracking, and duration reasoning. We observe that these structural differences make task-agnostic approaches insufficient: generic Chain-of-Thought methods lack task-appropriate reasoning primitives, and uniform reinforcement learning actively destabilizes performance on spatial tasks. To address this, we propose EgoReasoner, a two-stage framework that aligns both the reasoning scaffold and the reward signal to each task's cognitive structure. In the first stage, Task-Adaptive Thinking Templates guide the synthesis of structured CoT traces that teach the model to reason adaptively across task types via supervised fine-tuning. In the second stage, task-aware reward functions verify entity grounding, temporal alignment, and task-adaptive logical consistency, selectively strengthening each reasoning pathway via reinforcement fine-tuning with GRPO. Our 3B-parameter model, trained on only 16K samples, achieves 37.5% average accuracy on the challenging HD-EPIC benchmark, surpassing Qwen2.5-VL-7B (25.7%) by over 10 points.
[91] Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu, Rossell Chen, Dong Yu, Leoweiliang
🧩 TL;DR
本文提出Penguin-VL,一种紧凑型视觉语言模型,通过将文本LLM初始化为视觉编码器来替代传统的对比预训练方法,在保持轻量级架构的同时实现了与大型VLMs相当的性能。
📘 Detailed Summary
Motivation: 当前视觉语言模型的发展主要依赖模型规模的扩展,这限制了其在计算资源受限的移动和边缘设备上的部署。同时,现有方法普遍依赖通过大规模对比预训练初始化的视觉编码器,但对比学习优化的判别性目标与密集描述和复杂推理所需的细粒度视觉线索之间存在目标不匹配问题。
Method: 本文提出Penguin-VL模型,其核心创新在于视觉编码器从纯文本LLM初始化,而非传统的对比预训练方法。这种方法旨在解决对比学习中类别级不变性对细粒度视觉线索的抑制问题,通过文本LLM初始化来保留更丰富的空间和时间细节,为密集感知和复杂推理提供更好的视觉表示。
Result: 在多个图像和视频基准测试中,Penguin-VL在数学推理任务上达到与领先VLMs相当的性能,并在文档理解、视觉知识和多视角视频理解等任务上超越它们。消融研究表明,Penguin-Encoder在视觉保真度和数据效率方面持续优于对比预训练的编码器,特别是在保留密集感知和复杂推理所需的空间和时间线索方面表现突出。
Conclusion: 研究表明,改进视觉表示而非单纯扩展模型规模是提升性能的主要驱动力。Penguin-Encoder作为传统对比预训练的有效替代方案,为计算高效的VLMs提供了强大的即插即用选择,特别适用于资源受限环境。这一发现挑战了当前VLM开发必须依赖大规模对比预训练视觉编码器的普遍实践。
📄 Abstract
Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing practice that state-of-the-art VLMs must rely on vision encoders initialized via massive contrastive pretraining (e.g., CLIP/SigLIP). We identify an objective mismatch: contrastive learning, optimized for discrimination, enforces coarse and category-level invariances that suppress fine-grained visual cues needed for dense captioning and complex VLM reasoning. To address this issue, we present Penguin-VL, whose vision encoder is initialized from a text-only LLM. Our experiments reveal that Penguin-Encoder serves as a superior alternative to traditional contrastive pretraining, unlocking a higher degree of visual fidelity and data efficiency for multimodal understanding. Across various image and video benchmarks, Penguin-VL achieves performance comparable to leading VLMs (e.g., Qwen3-VL) in mathematical reasoning and surpasses them in tasks such as document understanding, visual knowledge, and multi-perspective video understanding. Notably, these gains are achieved with a lightweight architecture, demonstrating that improved visual representation rather than model scaling is the primary driver of performance. Our ablations show that Penguin-Encoder consistently outperforms contrastive-pretrained encoders, preserving fine-grained spatial and temporal cues that are critical for dense perception and complex reasoning. This makes it a strong drop-in alternative for compute-efficient VLMs and enables high performance in resource-constrained settings. Code: https://github.com/tencent-ailab/Penguin-VL
[92] Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
Lijiang Li, Zuwei Long, Yunhang Shen, Heting Gao, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He, Chaoyou Fu
🧩 TL;DR
本文提出了Omni-Diffusion,这是首个完全基于掩码离散扩散模型构建的任意到任意多模态语言模型,统一了文本、语音和图像的理解与生成,展示了扩散模型作为多模态基础模型骨干的潜力。
📘 Detailed Summary
Motivation: 当前多模态大语言模型主要采用传统的自回归架构作为骨干,在架构设计上存在探索有效且高效替代方案的空间。同时,离散扩散模型在视觉理解和图像生成等领域已展现出作为多模态系统骨干的潜力,但尚未被充分探索用于构建统一的多模态模型。
Method: Omni-Diffusion采用统一的基于掩码的离散扩散模型来直接捕捉离散多模态标记的联合分布。该模型完全基于掩码离散扩散模型构建,支持从任意模态到任意模态的转换,不仅能处理双模态任务,还能应对涉及多个模态的复杂场景。
Result: 在多样化的基准测试中,该方法优于或与现有处理两个或更多模态的多模态系统性能相当。实验结果表明,Omni-Diffusion在统一文本、语音和图像的理解与生成方面表现出色,验证了离散扩散模型在多模态任务中的有效性。
Conclusion: 这项研究展示了扩散模型在推动下一代多模态基础模型发展方面的巨大潜力。Omni-Diffusion的成功表明,基于掩码的离散扩散模型可以成为传统自回归架构的有效替代方案,为构建更统一、高效的多模态系统提供了新的架构设计方向。
📄 Abstract
While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models. Project webpage: https://omni-diffusion.github.io.
[93] Multimodal Large Language Models as Image Classifiers
Nikita Kisel, Illia Volkov, Klara Janouskova, Jiri Matas
🧩 TL;DR
该研究揭示了多模态大语言模型在分类任务中表现不佳的主要原因在于有缺陷的评估协议和噪声标注数据,而非模型本身的能力缺陷。通过纠正评估协议中的关键问题并引入高质量重标注数据集ReGT,研究显著缩小了MLLM与监督模型之间的性能差距。
📘 Detailed Summary
Motivation: 当前关于多模态大语言模型分类性能的研究存在相互矛盾的结论,这些矛盾源于评估协议中的缺陷和标注数据质量问题。研究旨在系统识别并纠正导致MLLM性能被高估或低估的关键评估问题,包括模型输出超出类别列表被丢弃、弱干扰项导致结果膨胀以及输出映射不当等问题。
Method: 研究系统分析了常见的MLLM评估协议,识别并修复了三个关键问题:模型输出超出类别列表的处理、弱干扰项导致的性能膨胀以及开放世界设置中的输出映射问题。同时量化了批次大小、图像顺序和文本编码器选择等常被忽视的设计选择对准确率的影响。研究还创建了ReGT数据集,对ImageNet-1k的625个类别进行了多标签重标注。
Result: 在ReGT数据集上的评估显示,MLLM从纠正的标注中获益最大(准确率提升高达+10.8%),显著缩小了与监督模型的感知性能差距。研究还发现,较少依赖监督训练信号的模型对标注质量最为敏感。在受控案例研究中,标注者在约50%的困难案例中确认或整合了MLLM的预测,证明了其在数据集标注中的实用价值。
Conclusion: 研究表明,MLLM在分类任务中表现不佳主要是噪声标注和有缺陷评估协议的产物,而非模型固有缺陷。模型性能对标注质量高度敏感,特别是那些较少依赖监督信号的模型。MLLM在辅助人类标注方面展现出实用潜力,可用于大规模数据集构建和质量提升,为更准确的模型评估和数据集标注实践提供了重要指导。
📄 Abstract
Multimodal Large Language Models (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comparing MLLMs with supervised and vision-language models report conflicting conclusions, and we show these conflicts stem from protocols that either inflate or underestimate performance. Across the most common evaluation protocols, we identify and fix key issues: model outputs that fall outside the provided class list and are discarded, inflated results from weak multiple-choice distractors, and an open-world setting that underperforms only due to poor output mapping. We additionally quantify the impact of commonly overlooked design choices - batch size, image ordering, and text encoder selection - showing they substantially affect accuracy. Evaluating on ReGT, our multilabel reannotation of 625 ImageNet-1k classes, reveals that MLLMs benefit most from corrected labels (up to +10.8%), substantially narrowing the perceived gap with supervised models. Much of the reported MLLMs underperformance on classification is thus an artifact of noisy ground truth and flawed evaluation protocol rather than genuine model deficiency. Models less reliant on supervised training signals prove most sensitive to annotation quality. Finally, we show that MLLMs can assist human annotators: in a controlled case study, annotators confirmed or integrated MLLMs predictions in approximately 50% of difficult cases, demonstrating their potential for large-scale dataset curation.
cs.CL [Back]
[94] Verify as You Go: An LLM-Powered Browser Extension for Fake News Detection
Dorsaf Sallami, Esma Aïmeur
🧩 TL;DR
本文提出了Aletheia,一种基于检索增强生成和大语言模型的浏览器扩展,用于检测虚假新闻并提供基于证据的解释,同时通过交互功能增强用户参与和透明度。
📘 Detailed Summary
Motivation: 数字时代虚假新闻的泛滥对公众信任和民主制度构成严重威胁,而现有浏览器扩展存在模型行为不透明、解释支持有限以及用户参与度不足等问题,需要开发更有效、透明且以用户为中心的检测工具。
Method: Aletheia采用检索增强生成和大语言模型技术进行虚假新闻检测,并包含两个交互组件:支持用户围绕标记内容进行对话的讨论中心,以及展示近期事实核查结果的"保持知情"功能。
Result: 实验表明Aletheia在检测性能上优于现有最先进的基线方法,同时一项包含250名参与者的用户研究证实了系统的可用性和感知有效性,验证了其作为透明工具的潜力。
Conclusion: 该研究展示了结合RAG、LLM和交互式功能在构建透明虚假新闻检测工具方面的有效性,为开发用户参与度高、解释性强的在线信息验证系统提供了新方向。
📄 Abstract
The rampant spread of fake news in the digital age poses serious risks to public trust and democratic institutions, underscoring the need for effective, transparent, and user-centered detection tools. Existing browser extensions often fall short due to opaque model behavior, limited explanatory support, and a lack of meaningful user engagement. This paper introduces Aletheia, a novel browser extension that leverages Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs) to detect fake news and provide evidence-based explanations. Aletheia further includes two interactive components: a Discussion Hub that enables user dialogue around flagged content and a Stay Informed feature that surfaces recent fact-checks. Through extensive experiments, we show that Aletheia outperforms state-of-the-art baselines in detection performance. Complementing this empirical evaluation, a complementary user study with 250 participants confirms the system's usability and perceived effectiveness, highlighting its potential as a transparent tool for combating online fake news.
[95] NOTAI.AI: Explainable Detection of Machine-Generated Text via Curvature and Feature Attribution
Oleksandr Marchenko Breneur, Adelaide Danilov, Aria Nourbakhsh, Salima Lamsiyah
🧩 TL;DR
本文提出了NOTAI.AI,这是一个可解释的机器生成文本检测框架,通过将基于曲率的信号与神经和文体测量特征在监督设置下集成,扩展了Fast-DetectGPT方法,并部署为支持实时分析和结构化证据呈现的交互式Web应用程序。
📘 Detailed Summary
Motivation: 该研究旨在解决机器生成文本检测中可解释性不足的问题,通过开发一个能够提供特征级归因和结构化自然语言解释的框架,以增强检测系统的透明度和用户信任度。
Method: NOTAI.AI框架集成了17个可解释特征,包括条件概率曲率、ModernBERT检测器分数、可读性指标和文体测量线索,使用梯度提升树(XGBoost)元分类器进行决策,并应用SHAP进行局部和全局特征归因,最后通过基于LLM的解释层生成结构化自然语言解释。
Result: 该系统已部署为交互式Web应用程序,支持实时文本分析、可视化特征检查和结构化证据呈现,源代码和演示视频已公开以支持可重复性研究。
Conclusion: 该研究展示了将统计信号与神经特征相结合的有效性,并通过可解释性层增强了检测系统的透明度,为机器生成文本检测领域提供了既准确又可解释的实用框架。
📄 Abstract
We present NOTAI.AI, an explainable framework for machine-generated text detection that extends Fast-DetectGPT by integrating curvature-based signals with neural and stylometric features in a supervised setting. The system combines 17 interpretable features, including Conditional Probability Curvature, ModernBERT detector score, readability metrics, and stylometric cues, within a gradient-boosted tree (XGBoost) meta-classifier to determine whether a text is human- or AI-generated. Furthermore, NOTAI.AI applies Shapley Additive Explanations (SHAP) to provide both local and global feature-level attribution. These attributions are further translated into structured natural-language rationales through an LLM-based explanation layer, which enables user-facing interpretability. The system is deployed as an interactive web application that supports real-time analysis, visual feature inspection, and structured evidence presentation. A web interface allows users to input text and inspect how neural and statistical signals influence the final decision. The source code and demo video are publicly available to support reproducibility.
[96] Safer Reasoning Traces: Measuring and Mitigating Chain-of-Thought Leakage in LLMs
Patrick Ahrend, Tobias Eder, Xiyang Yang, Zhiyi Pan, Georg Groh
🧩 TL;DR
本研究提出一个模型无关的框架,系统评估了思维链提示在LLM推理过程中引发的个人身份信息泄露风险,发现思维链会显著增加隐私泄露,尤其是高风险类别,且泄露程度与模型家族和推理预算密切相关。
📘 Detailed Summary
Motivation: 思维链提示虽然能提升大语言模型的推理能力,但会增加隐私风险,可能将提示中的个人身份信息泄露到推理轨迹和输出中,即使模型被指示不得复述PII,这种推理时的直接PII泄露问题尚未得到系统研究。
Method: 研究采用模型无关的评估框架,将泄露定义为跨11种PII类型的风险加权、令牌级事件,追踪泄露曲线随思维链预算的变化,并在结构化PII数据集上比较开源和闭源模型家族,使用分层风险分类法,同时评估了四种轻量级推理时门控机制的性能。
Result: 研究发现思维链持续增加泄露风险,特别是高风险类别,泄露程度强烈依赖于模型家族和推理预算,增加推理预算可能放大或减弱泄露取决于基础模型,四种门控方法在风险加权F1、宏F1和召回率指标上各有优劣,没有单一方法在所有模型或预算上占优。
Conclusion: 研究结果表明需要混合、风格自适应的门控策略来平衡效用和风险,强调了在共同、可复现协议下系统评估隐私泄露的重要性,为LLM推理安全提供了量化评估框架和实践指导。
📄 Abstract
Chain-of-Thought (CoT) prompting improves LLM reasoning but can increase privacy risk by resurfacing personally identifiable information (PII) from the prompt into reasoning traces and outputs, even under policies that instruct the model not to restate PII. We study such direct, inference-time PII leakage using a model-agnostic framework that (i) defines leakage as risk-weighted, token-level events across 11 PII types, (ii) traces leakage curves as a function of the allowed CoT budget, and (iii) compares open- and closed-source model families on a structured PII dataset with a hierarchical risk taxonomy. We find that CoT consistently elevates leakage, especially for high-risk categories, and that leakage is strongly family- and budget-dependent. Increasing the reasoning budget can either amplify or attenuate leakage depending on the base model. We then benchmark lightweight inference-time gatekeepers: a rule-based detector, a TF-IDF + logistic regression classifier, a GLiNER-based NER model, and an LLM-as-judge, using risk-weighted F1, Macro-F1, and recall. No single method dominates across models or budgets, motivating hybrid, style-adaptive gatekeeping policies that balance utility and risk under a common, reproducible protocol.
[97] The Fragility Of Moral Judgment In Large Language Models
Tom van Nuenen, Pratik S. Sachdeva
🧩 TL;DR
本研究引入了一个扰动框架来测试大语言模型道德判断的稳定性和可操纵性,发现模型判断高度依赖于叙事形式和任务框架,而非道德实质内容,这引发了关于LLM道德指导的可重复性和公平性担忧。
📘 Detailed Summary
Motivation: 随着人们越来越多地使用大语言模型进行日常道德和人际指导,这些系统无法审视缺失的上下文并按照呈现的方式判断道德困境,因此需要研究LLM道德判断的稳定性和可操纵性问题。
Method: 研究引入了一个扰动框架来测试LLM道德判断的稳定性,使用来自r/AmItheAsshole的2,939个道德困境,生成三类内容扰动:表面编辑(词汇/结构噪声)、观点转换(声音和立场中立化)以及说服线索(自我定位、社会证明、模式承认、受害者框架),同时变化评估协议(输出排序、指令放置和非结构化提示),并在四个模型(GPT-4.1、Claude 3.7 Sonnet、DeepSeek V3、Qwen2.5-72B)上评估了129,156个判断。
Result: 表面扰动产生较低的翻转率(7.5%),基本在自一致性噪声范围内(4-13%),而观点转换导致显著更高的不稳定性(24.3%);37.9%的困境对表面噪声稳健但在视角变化下翻转,表明模型将叙事声音作为语用线索;不稳定性集中在道德模糊案例中,说服扰动产生系统性方向性偏移;协议选择主导所有其他因素,结构化协议间的一致性仅为67.6%(kappa=0.55),仅35.7%的模型-场景单元在所有三种协议中匹配。
Conclusion: LLM的道德判断是由叙事形式和任务框架共同构建的,当结果依赖于呈现技巧而非道德实质时,这引发了可重复性和公平性担忧,表明当前LLM的道德指导能力存在系统性偏差和不可靠性。
📄 Abstract
People increasingly use large language models (LLMs) for everyday moral and interpersonal guidance, yet these systems cannot interrogate missing context and judge dilemmas as presented. We introduce a perturbation framework for testing the stability and manipulability of LLM moral judgments while holding the underlying moral conflict constant. Using 2,939 dilemmas from r/AmItheAsshole (January-March 2025), we generate three families of content perturbations: surface edits (lexical/structural noise), point-of-view shifts (voice and stance neutralization), and persuasion cues (self-positioning, social proof, pattern admissions, victim framing). We also vary the evaluation protocol (output ordering, instruction placement, and unstructured prompting). We evaluated all variants with four models (GPT-4.1, Claude 3.7 Sonnet, DeepSeek V3, Qwen2.5-72B) (N=129,156 judgments). Surface perturbations produce low flip rates (7.5%), largely within the self-consistency noise floor (4-13%), whereas point-of-view shifts induce substantially higher instability (24.3%). A large subset of dilemmas (37.9%) is robust to surface noise yet flips under perspective changes, indicating that models condition on narrative voice as a pragmatic cue. Instability concentrates in morally ambiguous cases; scenarios where no party is assigned blame are most susceptible. Persuasion perturbations yield systematic directional shifts. Protocol choices dominate all other factors: agreement between structured protocols is only 67.6% (kappa=0.55), and only 35.7% of model-scenario units match across all three protocols. These results show that LLM moral judgments are co-produced by narrative form and task scaffolding, raising reproducibility and equity concerns when outcomes depend on presentation skill rather than moral substance.
[98] Towards Robust Retrieval-Augmented Generation Based on Knowledge Graph: A Comparative Analysis
Hazem Amamou, Stéphane Gagnon, Alan Davoust, Anderson R. Avila
🧩 TL;DR
本研究利用RGB基准评估检索增强生成系统在噪声环境下的鲁棒性,通过比较传统RAG基线与基于知识图谱的GraphRAG系统,测试了三种GraphRAG定制化改进方案,结果显示GraphRAG在多个场景下优于基线系统。
📘 Detailed Summary
Motivation: 检索增强生成系统虽然能增强大语言模型的事实准确性并扩展其知识范围,但检索到的不一致信息会负面影响模型响应,现有系统在噪声环境下的鲁棒性评估不足,需要系统性的评测框架来指导更可靠的RAG系统设计。
Method: 研究采用检索增强生成基准作为评估框架,在噪声鲁棒性、信息整合、负面拒绝和反事实鲁棒性四个场景下测试系统性能,对比了RGB RAG基线系统与基于知识图谱的GraphRAG检索系统,并测试了三种GraphRAG定制化改进方案以提升系统鲁棒性。
Result: 实验结果显示GraphRAG系统在多个评估场景下优于传统的RGB RAG基线系统,三种定制化改进方案进一步提升了系统性能,为不同噪声条件下的RAG系统表现提供了量化评估结果。
Conclusion: 基于知识图谱的检索系统在应对不一致信息时表现出更强的鲁棒性,研究结果为设计面向真实世界场景的可靠RAG系统提供了重要见解,强调了系统化评估框架对于优化检索增强生成技术的重要性。
📄 Abstract
Retrieval-Augmented Generation (RAG) was introduced to enhance the capabilities of Large Language Models (LLMs) beyond their encoded prior knowledge. This is achieved by providing LLMs with an external source of knowledge, which helps reduce factual hallucinations and enables access to new information not available during pretraining. However, inconsistent retrieved information can negatively affect LLM responses. The Retrieval-Augmented Generation Benchmark (RGB) was introduced to evaluate the robustness of RAG systems under such conditions. In this work, we use the RGB corpus to evaluate LLMs in four scenarios: noise robustness, information integration, negative rejection, and counterfactual robustness. We perform a comparative analysis between the RGB RAG baseline and GraphRAG, a knowledge graph based retrieval system. We test three GraphRAG customizations to improve robustness. Results show improvements over the RGB baseline and provide insights for designing more reliable RAG systems for real world scenarios.
[99] Let's Talk, Not Type: An Oral-First Multi-Agent Architecture for Guaraní
Samantha Adorno, Akshata Kishore Moharir, Ratna Kandala
🧩 TL;DR
本文以瓜拉尼语为案例研究,提出了一种面向口语优先的多智能体架构,旨在超越传统的文本中心AI系统,通过解耦自然语言理解与对话状态管理,尊重原住民数据主权和双语现象,实现真正文化扎根的人工智能。
📘 Detailed Summary
Motivation: 当前人工智能和人机交互系统普遍采用文本优先的设计范式,忽视了以口语为主的语言和原住民社区的需求,导致语言支持不足且未能与真实的口语实践相契合。本文以巴拉圭广泛使用的官方语言瓜拉尼语为案例,指出AI系统必须超越简单的语言识别,真正融入口语交流的实践特征。
Method: 本文提出了一种口语优先的多智能体架构作为传统"文本到语音"管道的替代方案。该架构将瓜拉尼语自然语言理解与专门的对话状态管理智能体解耦,并引入社区主导的治理机制,形成尊重原住民数据主权和双语现象的技术框架,重点关注话轮转换、修复和共享上下文等口语交互核心要素。
Result: 研究展示了一个技术框架,该框架能够处理口语优先的交互模式,支持社区主导的治理机制,并尊重原住民数据主权。通过将自然语言理解与对话状态管理分离,系统能够更好地适应瓜拉尼语的口语实践特征,包括话轮转换和共享上下文等关键交互维度。
Conclusion: 要使人工智能真正实现文化扎根,必须从将口语语言适应文本中心系统转变为将口语对话作为首要设计需求。这意味着数字生态系统应该赋能而非忽视多样化的语言实践,通过尊重原住民数据主权和双语现象的技术设计,确保AI系统能够真正服务于以口语为主的语言社区。
📄 Abstract
Although artificial intelligence (AI) and Human-Computer Interaction (HCI) systems are often presented as universal solutions, their design remains predominantly text-first, underserving primarily oral languages and indigenous communities. This position paper uses Guaraní, an official and widely spoken language of Paraguay, as a case study to argue that language support in AI remains insufficient unless it aligns with lived oral practices. We propose an alternative to the standard "text-to-speech" pipeline, proposing instead an oral-first multi-agent architecture. By decoupling Guaraní natural language understanding from dedicated agents for conversation state and community-led governance, we demonstrate a technical framework that respects indigenous data sovereignty and diglossia. Our work moves beyond mere recognition to focus on turn-taking, repair, and shared context as the primary locus of interaction. We conclude that for AI to be truly culturally grounded, it must shift from adapting oral languages to text-centric systems to treating spoken conversation as a first-class design requirement, ensuring digital ecosystems empower rather than overlook diverse linguistic practices.
[100] PVminerLLM: Structured Extraction of Patient Voice from Patient-Generated Text using Large Language Models
Samah Fodeh, Linhai Ma, Ganesh Puthiaraju, Srivani Talakokkul, Afshan Khan, Ashley Hagaman, Sarah Lowe, Aimee Roundtree
🧩 TL;DR
本文提出了PVminer基准和PVminerLLM模型,通过监督微调的大语言模型从患者生成文本中结构化提取患者声音信息,显著优于基于提示的基线方法,为规模化分析患者社会经验信号提供了可行方案。
📘 Detailed Summary
Motivation: 患者生成文本包含关于患者生活经历、社会环境和护理参与的关键信息,这些因素强烈影响依从性、护理协调和健康公平,但这类患者声音信号很少以结构化形式存在,限制了其在以患者为中心的结果研究和临床质量改进中的应用,因此可靠提取此类信息对于规模化理解和解决健康结果的非临床驱动因素至关重要。
Method: 研究引入了PVminer基准用于患者声音的结构化提取,并提出了PVminerLLM,这是一种针对该任务进行监督微调的大语言模型,该方法通过专门的模型架构和训练策略优化患者生成文本的信息提取能力。
Result: 在多个数据集和模型规模上,PVminerLLM显著优于基于提示的基线方法,在代码预测上达到83.82% F1分数,子代码预测达到80.74% F1分数,证据跨度提取达到87.03% F1分数,值得注意的是,即使使用较小的模型也能实现强劲性能,表明可靠的患者声音提取无需极端模型规模即可实现。
Conclusion: 该研究表明监督微调的大语言模型能够有效从患者生成文本中提取结构化信息,为规模化分析嵌入在患者文本中的社会和经验信号提供了可行途径,这些结果支持在患者为中心的研究和临床质量改进中更广泛地利用患者声音数据,同时证明了无需极大模型规模即可实现可靠提取的可行性。
📄 Abstract
Motivation: Patient-generated text contains critical information about patients' lived experiences, social circumstances, and engagement in care, including factors that strongly influence adherence, care coordination, and health equity. However, these patient voice signals are rarely available in structured form, limiting their use in patient-centered outcomes research and clinical quality improvement. Reliable extraction of such information is therefore essential for understanding and addressing non-clinical drivers of health outcomes at scale. Results: We introduce PVminer, a benchmark for structured extraction of patient voice, and propose PVminerLLM, a supervised fine-tuned large language model tailored to this task. Across multiple datasets and model sizes, PVminerLLM substantially outperforms prompt-based baselines, achieving up to 83.82% F1 for Code prediction, 80.74% F1 for Sub-code prediction, and 87.03% F1 for evidence Span extraction. Notably, strong performance is achieved even with smaller models, demonstrating that reliable patient voice extraction is feasible without extreme model scale. These results enable scalable analysis of social and experiential signals embedded in patient-generated text. Availability and Implementation: Code, evaluation scripts, and trained LLMs will be released publicly. Annotated datasets will be made available upon request for research use. Keywords: Large Language Models, Supervised Fine-Tuning, Medical Annotation, Patient-Generated Text, Clinical NLP
[101] RouteGoT: Node-Adaptive Routing for Cost-Efficient Graph of Thoughts Reasoning
Yuhang Liu, Ruijie Wang, Yunlong Chu, Bing Hao, Yumeng Lin, Shengzhong Liu, Minglai Shao
🧩 TL;DR
本文提出RouteGoT,一种预算可控、节点自适应的图结构推理路由框架,通过动态分配不同能力的模型到推理图中的不同节点,在保持或提升准确率的同时显著降低计算成本。
📘 Detailed Summary
Motivation: 现有图结构推理方法如Tree of Thoughts和Graph of Thoughts虽然能提升某些基准测试的准确率,但引入了显著的令牌消耗和延迟开销,且性能增益在不同任务分布中不稳定。这种低效性源于推理管道中阶段间和节点间的异质性:高质量规划和最终综合需要强大模型,而许多中间子任务可由轻量级模型以更少令牌准确解决。
Method: RouteGoT是一个预算可控、节点自适应的图结构推理路由框架,通过在图中进行路由决策,优先为规划和综合节点分配强大模型,同时基于预测难度为叶子子任务动态分配轻量级模型和成本效益策略。该框架进一步将显式预算约束集成到全局推理调度器中,在用户指定的令牌预算下控制图扩展,实现可预测的性能-成本权衡。
Result: 在推理、检索和多跳问答基准测试中,RouteGoT在匹配或提高准确率的同时显著减少令牌使用;具体而言,相比AGoT实现了平均8.1个百分点的准确率提升和79.1%的输出令牌减少。此外,RouteGoT通过保持优越的成本-准确率权衡优于现有路由基线,在不同预算目标和任务下表现出改进的鲁棒性。
Conclusion: 该研究表明,通过考虑推理图中不同节点的异质性需求并实施自适应路由策略,可以在不牺牲准确率的情况下显著优化计算效率。RouteGoT框架为图结构推理提供了实用的预算控制机制,实现了可预测的性能-成本权衡,为实际部署中的资源受限场景提供了有效解决方案。
📄 Abstract
Large Language Models (LLMs) excel at multi-step reasoning, yet increasing the structural complexity of inference does not consistently improve system-level returns. Methods such as Tree of Thoughts (ToT), Graph of Thoughts (GoT), and Adaptive Graph of Thoughts (AGoT) can boost accuracy on some benchmarks, but often introduce substantial overhead in token consumption and latency, and their gains can be unstable across task distributions-sometimes underperforming simpler Chain-of-Thought (CoT) or direct input-output prompting (IO). We attribute this inefficiency to stage-wise and node-wise heterogeneity inside GoT-style reasoning pipelines: high-quality planning and final synthesis are globally coupled and typically benefit from strong models, whereas many intermediate subtasks are localized and can be solved accurately by lighter models with far fewer tokens. Motivated by these observations, we propose RouteGoT, a budget-controllable, node-adaptive routing framework for graph-structured reasoning. RouteGoT performs in-graph routing by prioritizing strong models for planning and synthesis, while dynamically allocating lightweight models and cost-effective strategies to leaf subtasks based on predicted difficulty. It further integrates explicit budget constraints into a global inference scheduler to control graph expansion under a user-specified token budget, enabling predictable performance-cost trade-offs. Experiments across reasoning, retrieval, and multi-hop QA benchmarks show that RouteGoT matching or improving accuracy while substantially reducing token usage; specifically, it achieves an average 8.1 percentage points accuracy improvement and 79.1\% output token reduction compared to AGoT. Furthermore, RouteGoT outperforms existing routing baselines by maintaining a superior cost-accuracy trade-off, demonstrating improved robustness under varying budget targets and tasks.
[102] HART: Data-Driven Hallucination Attribution and Evidence-Based Tracing for Large Language Models
Shize Liang, Hongzhi Wang
🧩 TL;DR
本文提出HART框架,一种用于大语言模型的细粒度幻觉归因与证据检索方法,通过结构化建模将幻觉追踪分解为跨度定位、机制归因、证据检索和因果追踪四个阶段,显著提升了幻觉分析的因果可解释性和证据对齐能力。
📘 Detailed Summary
Motivation: 现有幻觉归因方法主要基于外部知识检索或内部模型机制,侧重于语义相似度匹配或表示级判别,难以在跨度层面建立幻觉类型、底层错误生成机制与外部事实证据之间的结构化对应关系,从而限制了幻觉片段的可解释性和支持/反对证据的可追溯性。
Method: HART框架将幻觉追踪形式化为包含四个阶段的结构化建模任务:跨度定位、机制归因、证据检索和因果追踪。基于此形式化,研究者开发了首个专为幻觉追踪设计的结构化数据集,其中联合标注了幻觉类型、错误机制和反事实证据集,以支持因果级可解释性评估。
Result: 在提出的数据集上的实验结果表明,HART显著优于包括BM25和DPR在内的强检索基线方法,验证了所提出的追踪范式在幻觉分析和证据对齐方面的有效性和泛化能力。
Conclusion: 该研究通过结构化建模方法为LLM幻觉分析提供了因果级可解释性框架,不仅提升了幻觉归因的精度,还增强了证据对齐的追溯能力,为高可靠性应用场景中的幻觉检测和缓解提供了新的方法论基础。
📄 Abstract
Large language models (LLMs) have demonstrated remarkable performance in text generation and knowledge-intensive question answering. Nevertheless, they are prone to producing hallucinated content, which severely undermines their reliability in high-stakes application domains. Existing hallucination attribution approaches, based on either external knowledge retrieval or internal model mechanisms, primarily focus on semantic similarity matching or representation-level discrimination. As a result, they have difficulty establishing structured correspondences at the span level between hallucination types, underlying error generation mechanisms, and external factual evidence, thereby limiting the interpretability of hallucinated fragments and the traceability of supporting or opposing evidence. To address these limitations, we propose HART, a fine-grained hallucination attribution and evidence retrieval framework for large language models. HART formalizes hallucination tracing as a structured modeling task comprising four stages: span localization, mechanism attribution, evidence retrieval, and causal tracing. Based upon this formulation, we develop the first structured dataset tailored for hallucination tracing, in which hallucination types, error mechanisms, and sets of counterfactual evidence are jointly annotated to enable causal-level interpretability evaluation. Experimental results on the proposed dataset demonstrate that HART substantially outperforms strong retrieval baselines, including BM25 and DPR, validating the effectiveness and generalization capability of the proposed tracing paradigm for hallucination analysis and evidence alignment.
[103] ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning
Juyong Jiang, Jiasi Shen, Sunghun Kim, Kang Min Yoo, Jeonghoon Kim, Sungju Kim
🧩 TL;DR
本文提出ReflexiCoder,一种新颖的强化学习框架,通过将结构化推理轨迹内化到模型权重中,使大型语言模型具备内在的自主反思与自我修正能力,从而在代码生成任务中实现最先进的性能。
📘 Detailed Summary
Motivation: 尽管大型语言模型已革新代码生成领域,但标准的"系统1"单次前向生成方法在处理复杂算法任务时面临性能瓶颈。现有迭代优化策略主要依赖外部预言机、执行反馈或计算昂贵的提示-响应循环,缺乏内在的自主反思与修正能力。
Method: ReflexiCoder采用强化学习框架,通过RL-zero训练范式和细粒度奖励函数优化整个反思-修正轨迹,将初始生成、错误与优化感知的反思以及自我修正等结构化推理过程直接内化到模型权重中,实现无需外部反馈或执行引擎的完全自主推理能力。
Result: 在七个基准测试上的广泛实验表明,ReflexiCoder-8B在1.5B-14B参数规模的开源模型中达到最先进水平:HumanEval(Plus)94.51%(87.20%)、MBPP(Plus)81.80%(78.57%)、BigCodeBench 35.00%、LiveCodeBench 52.21%、CodeForces 37.34%,在单次尝试设置下媲美或超越GPT-5.1等专有模型,同时推理计算开销降低约40%。
Conclusion: 该研究实现了从外部依赖优化到内在自主反思的范式转变,证明了通过强化学习内化结构化推理轨迹的有效性。ReflexiCoder不仅显著提升代码生成性能,还通过高效推理模式大幅降低计算成本,为开发更高效、更自主的代码生成系统提供了新方向。
📄 Abstract
While Large Language Models (LLMs) have revolutionized code generation, standard "System 1" approaches, generating solutions in a single forward pass, often hit a performance ceiling when faced with complex algorithmic tasks. Existing iterative refinement strategies attempt to bridge this gap at inference time, yet they predominantly rely on external oracles, execution feedback, or computationally expensive prompt-response cycles. In this work, we propose ReflexiCoder, a novel reinforcement learning (RL) framework that internalizes the structured reasoning trajectory, encompassing initial generation, bug and optimization aware reflection, and self-correction, directly into the model's weights. Unlike prior methods, ReflexiCoder shifts the paradigm from external-dependent refinement to an intrinsic, fully autonomous self-reflection and self-correction capabilities at inference time. We utilize an RL-zero training paradigm with granular reward functions to optimize the entire reflection-correction trajectory, teaching the model how to debug without reliance on ground-truth feedback or execution engines at inference time. Extensive experiments across seven benchmarks demonstrate that our ReflexiCoder-8B establishes a new state-of-the-art (SOTA) among leading open-source models in the 1.5B-14B range, achieving 94.51% (87.20%) on HumanEval (Plus), 81.80% (78.57%) on MBPP (Plus), 35.00% on BigCodeBench, 52.21% on LiveCodeBench, and 37.34% on CodeForces in a single-attempt setting, rivaling or surpassing proprietary models like GPT-5.1. Notably, our framework is significantly more token-efficient than base models, reducing inference-time compute overhead by approximately 40% through disciplined, high-speed reasoning and reflection patterns. Source code is available at https://github.com/juyongjiang/ReflexiCoder.
[104] ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning
Mingluo Su, Huan Wang
🧩 TL;DR
本文提出ROSE方法,一种改进的SparseGPT大语言模型剪枝框架,通过重新排序剪枝顺序来优化剪枝性能。该方法基于二阶梯度信息,优先剪除潜在误差较大的权重,从而在保持模型效率的同时提升剪枝质量。
📘 Detailed Summary
Motivation: SparseGPT等基于Hessian矩阵的二阶梯度剪枝方法采用固定的从左到右剪枝顺序,当权重呈现列式模式时会导致次优性能。本研究旨在探索剪枝顺序对SparseGPT框架性能的影响,解决现有方法在特定权重模式下的性能瓶颈问题。
Method: ROSE方法首先进行预剪枝以识别待移除的候选权重,并估计列级和块级剪枝损失。随后执行两级重新排序:块内列按列损失降序排列,块间按块损失重新排序。引入块损失相对范围作为识别列式层的指标,实现整个模型的自适应重新排序策略。
Result: 在LLaMA2-7B/13B/70B、LLaMA3-8B和Mistral-7B等主流大语言模型上的大量实验结果表明,ROSE方法显著超越了原始SparseGPT及其他对比剪枝方法,验证了重新排序策略在提升剪枝性能方面的有效性。
Conclusion: 研究表明剪枝顺序对基于Hessian的剪枝方法性能具有重要影响,特别是在权重呈现列式模式时。ROSE的自适应重新排序策略为高效模型压缩提供了新思路,其开源实现促进了该领域的研究与应用发展。
📄 Abstract
Pruning is widely recognized as an effective method for reducing the parameters of large language models (LLMs), potentially leading to more efficient deployment and inference. One classic and prominent path of LLM one-shot pruning is to leverage second-order gradients (i.e., Hessian), represented by the pioneering work SparseGPT. However, the predefined left-to-right pruning order in SparseGPT leads to suboptimal performance when the weights exhibit columnar patterns. This paper studies the effect of pruning order under the SparseGPT framework. The analyses lead us to propose ROSE, a reordered SparseGPT method that prioritizes weights with larger potential pruning errors to be pruned earlier. ROSE first performs pre-pruning to identify candidate weights for removal, and estimates both column and block pruning loss. Subsequently, two-level reordering is performed: columns within each block are reordered in descending order of column loss, while blocks are reordered based on block loss. We introduce the relative range of block loss as a metric to identify columnar layers, enabling adaptive reordering across the entire model. Substantial empirical results on prevalent LLMs (LLaMA2-7B/13B/70B, LLaMA3-8B, Mistral-7B) demonstrate that ROSE surpasses the original SparseGPT and other counterpart pruning methods. Our code is available at https://github.com/mingluo-su/ROSE.
[105] VerChol -- Grammar-First Tokenization for Agglutinative Languages
Prabhu Raja
🧩 TL;DR
该研究揭示了BPE分词器在处理黏着语时存在严重缺陷,并提出了一种基于形态学边界感知的改进分词方法,显著提升了黏着语在LLM中的处理效率。
📘 Detailed Summary
Motivation: 当前主流的分词方法如BPE及其变体本质上是脚本无关且针对英语等屈折语优化的,在处理黏着语时存在严重问题。黏着语如德拉维达语系、突厥语系、乌拉尔语系等语言中,单个词可能编码词根、时态、体、人称、数、性一致、格和后置词等多个语素,统计分词器会将这些词切分为破坏语素边界的字节对块,导致分词数量膨胀和处理效率低下。
Method: 该研究提出了一种基于形态学边界感知的分词方法,专门针对黏着语的语言特性进行优化。该方法通过识别和保留语素边界,避免传统BPE方法对黏着语词的任意切割,从而生成更符合语言结构的分词单元。
Result: 实验结果表明,与传统BPE分词器相比,提出的形态学边界感知分词方法显著减少了黏着语的分词数量,提高了分词效率。该方法在保持语义完整性的同时,优化了LLM处理黏着语的性能,为多语言NLP任务提供了更有效的预处理方案。
Conclusion: 该研究强调了为不同语言类型设计专门化分词器的重要性,特别是对于黏着语这类具有复杂形态结构的语言。研究结果表明,基于语言类型学的分词方法能够显著提升LLM在多语言环境下的性能,为未来开发更通用的多语言NLP系统提供了重要启示。
📄 Abstract
Tokenization is the foundational step in all large language model (LLM) pipelines, yet the dominant approach Byte Pair Encoding (BPE) and its variants is inherently script agnostic and optimized for English like morphology. For agglutinative languages a typological class encompassing the Dravidian family (Tamil, Kannada, Telugu, Malayalam), Turkic languages (Turkish, Azerbaijani, Uzbek), Uralic languages (Finnish, Hungarian, Estonian), Korean, Japanese, Swahili, Basque, and others, a single word may encode root, tense, aspect, person, number, gender agreement, case, and postpositions into one orthographic unit. Statistical tokenizers fragment these words into byte pair chunks that sever morpheme boundaries and inflate token counts.
[106] Building an Ensemble LLM Semantic Tagger for UN Security Council Resolutions
Hussein Ghaly
🧩 TL;DR
本文提出了一种基于LLM集成系统的新方法,用于联合国安理会决议的语义标注,通过引入内容保持率和标签规范性两个评估指标来避免幻觉问题,并利用模型性能变异性构建成本效益优化的集成系统。
📘 Detailed Summary
Motivation: 本研究旨在解决LLM在语义标注任务中存在的幻觉问题、不必要的内容添加或遗漏问题,以及如何利用LLM性能变异性构建可靠的数据清洗和语义标注系统,特别是在处理联合国安理会决议这类具有挑战性的文本时。
Method: 该方法提出利用LLM性能变异性构建集成系统,引入两个关键评估指标:内容保持率用于衡量输入文本的保留程度,标签规范性用于评估标签格式的正确性,通过这些指标从多个GPT模型的多次运行中选择最佳输出,并设计集成系统来优化性能与成本平衡。
Result: 实验结果显示GPT-4.1在两个任务上均取得最高指标:数据清洗任务的内容保持率达到84.9%,语义标注任务的内容保持率达到99.99%,标签规范性达到99.92%,而较小的GPT-4.1-mini模型在仅20%成本下实现了与最佳模型相当的性能,集成系统成功为所有LLM模型选择了最优输出。
Conclusion: 研究表明通过精心设计的评估指标和集成系统,可以构建可靠的LLM语义标注系统,有效避免幻觉问题,同时较小模型在成本效益方面表现出色,这为处理复杂文本的语义标注任务提供了实用的方法论框架和成本优化策略。
📄 Abstract
This paper introduces a new methodology for using LLM-based systems for accurate and efficient semantic tagging of UN Security Council resolutions. The main goal is to leverage LLM performance variability to build ensemble systems for data cleaning and semantic tagging tasks. We introduce two evaluation metrics: Content Preservation Ratio (CPR) and Tag Well-Formedness (TWF), in order to avoid hallucinations and unnecessary additions or omissions to the input text beyond the task requirement. These metrics allow the selection of the best output from multiple runs of several GPT models. GPT-4.1 achieved the highest metrics for both tasks (Cleaning: CPR 84.9% - Semantic Tagging: CPR 99.99% and TWF 99.92%). In terms of cost, smaller models, such as GPT-4.1-mini, achieved comparable performance to the best model in each task at only 20% of the cost. These metrics ultimately allowed the ensemble to select the optimal output (both cleaned and tagged content) for all the LLM models involved, across multiple runs. With this ensemble design and the use of metrics, we create a reliable LLM system for performing semantic tagging on challenging texts.
[107] Learning Next Action Predictors from Human-Computer Interaction
Omar Shaikh, Valentin Teutschbein, Kanishk Gandhi, Yikun Chi, Nick Haber, Thomas Robinson, Nilam Ram, Byron Reeves, Sherry Yang, Michael S. Bernstein, Diyi Yang
🧩 TL;DR
该研究提出了下一动作预测(NAP)任务,旨在通过用户与计算机的多模态交互历史来预测其未来行为,并开发了LongNAP模型,该模型结合参数化和上下文学习,在大量标注数据上显著优于现有基线方法。
📘 Detailed Summary
Motivation: 真正主动的AI系统需要预测用户的下一步行为,这要求超越当前稀疏的提示信号,对整个用户交互上下文进行推理。该研究将这一挑战形式化为下一动作预测(NAP)任务,即给定用户与计算机的多模态交互序列(屏幕截图、点击、传感器数据),预测用户的下一步动作。解决此任务需要新的数据和建模方法。
Method: 研究首先通过视觉语言模型标注纵向、自然化的计算机使用数据,并发布了在私有基础设施上进行标注的开源流水线,标注了20名用户一个月连续手机使用中的36万多个动作,总计1800小时屏幕时间。随后提出了LongNAP用户模型,该模型结合参数化和上下文学习来推理长交互历史,通过策略梯度方法训练生成用户特定的推理轨迹,从历史轨迹库中检索相关轨迹,并在上下文中应用检索到的轨迹来预测未来动作。
Result: 使用LLM作为评判者的评估指标(0-1与真实值相似度),LongNAP在保留数据上显著优于监督微调和提示基线(分别提升79%和39%)。当跨个体训练时,LongNAP能够泛化到未见用户。尽管用户可能采取的下一动作空间是无限的,涵盖数千种可能结果,但LongNAP预测轨迹中有17.1%与用户实际行为良好对齐(LLM评判得分≥0.5),当过滤到高置信度预测时,这一比例上升到26%。
Conclusion: 研究表明,通过学习用户行为的完整上下文来预测用户需求已成为可行的任务,具有巨大的发展潜力。LongNAP模型通过结合参数化学习和上下文检索机制,在预测用户未来行为方面取得了显著进展,为构建真正主动的AI系统提供了重要技术基础。该方法在跨用户泛化方面的成功表明其在个性化预测应用中的实用价值。
📄 Abstract
Truly proactive AI systems must anticipate what we will do next. This foresight demands far richer information than the sparse signals we type into our prompts -- it demands reasoning over the entire context of what we see and do. We formalize this as next action prediction (NAP): given a sequence of a user's multimodal interactions with a computer (screenshots, clicks, sensor data), predict that user's next action. Progress on this task requires both new data and modeling approaches. To scale data, we annotate longitudinal, naturalistic computer use with vision-language models. We release an open-source pipeline for performing this labeling on private infrastructure, and label over 360K actions across one month of continuous phone usage from 20 users, amounting to 1,800 hours of screen time. We then introduce LongNAP, a user model that combines parametric and in-context learning to reason over long interaction histories. LongNAP is trained via policy gradient methods to generate user-specific reasoning traces given some context; retrieve relevant traces from a library of past traces; and then apply retrieved traces in-context to predict future actions. Using an LLM-as-judge evaluation metric (0-1 similarity to ground truth), LongNAP significantly outperforms supervised finetuning and prompted baselines on held-out data (by 79% and 39% respectively). Additionally, LongNAP generalizes to held out users when trained across individuals. The space of next actions a user might take at any moment is unbounded, spanning thousands of possible outcomes. Despite this, 17.1% of LongNAP's predicted trajectories are well-aligned with what a user does next (LLM-judge score $\geq$ 0.5). This rises to 26% when we filter to highly confident predictions. In sum, we argue that learning from the full context of user behavior to anticipate user needs is now a viable task with substantial opportunity.
[108] Implicit Style Conditioning: A Structured Style-Rewrite Framework for Low-Resource Character Modeling
Chanhui Zhu
🧩 TL;DR
本文提出了一种结构化风格重写框架,通过将风格解耦为三个可解释维度(词汇特征、句法模式、语用风格)并引入隐式风格条件化策略,解决了小语言模型在角色扮演中风格一致性差和数据稀缺的问题,使1.7B模型在风格保真度上超越4B基线。
📘 Detailed Summary
Motivation: 当前小语言模型在角色扮演中难以保持高度风格化的人物特征,主要面临数据稀缺和风格解耦复杂性两大挑战。标准监督微调方法往往只能捕捉表层语义,而无法复现角色的复杂句法和语用细微差别,导致"角色失配"生成问题。
Method: 本文提出了结构化风格重写框架,将风格显式解耦为三个可解释维度:基于PMI的词汇特征、基于PCFG规则的句法模式以及语用风格。同时引入隐式风格条件化策略,通过思维链蒸馏在训练过程中利用显式推理轨迹作为强归纳偏置,使模型的潜在表示与结构化风格特征对齐,在推理时无需显式推理标记。
Result: 在高度风格化的动漫角色领域进行广泛实验,结果显示该方法使Qwen-1.7B模型在风格一致性和语义保真度上显著超越更大的基线模型(如4B标准监督微调)。该方法实现了数据高效的学习范式,能够在消费级硬件上进行推理和部署。
Conclusion: 该研究提供了一种数据高效的风格化生成范式,通过结构化风格解耦和隐式条件化策略,有效解决了小语言模型在角色扮演中的风格保真度问题。该方法不仅提升了风格一致性,还为在资源受限环境下部署高质量风格化语言模型开辟了新途径,具有重要的实际应用价值。
📄 Abstract
Large Language Models (LLMs) have demonstrated impressive capabilities in role-playing (RP); however, small Language Models (SLMs) with highly stylized personas remains a challenge due to data scarcity and the complexity of style disentanglement. Standard Supervised Fine-Tuning (SFT) often captures surface-level semantics while failing to reproduce the intricate syntactic and pragmatic nuances of a character, leading to "Out-Of-Character" (OOC) generation. To address this, we propose a Structured Style-Rewrite Framework that explicitly disentangles style into three interpretable dimensions: lexical signatures (via PMI), syntactic patterns (grounded in PCFG rules), and pragmatic style. Furthermore, we introduce an implicit style conditioning strategy via Chain-of-Thought (CoT) distillation. By leveraging explicit reasoning traces during training as a strong inductive bias, our approach aligns the model's latent representations with structured style features, enabling high-fidelity stylized generation without requiring explicit reasoning tokens during inference. Extensive experiments on a specific high-stylization domain (anime characters) demonstrate that our method enables a Qwen-1.7B model to outperform significantly larger baselines (e.g., 4B Vanilla SFT) in style consistency and semantic fidelity. Our approach offers a data-efficient paradigm for democratizing inference and deployment on consumer hardware.
[109] MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing
Yang Liu, Jinxuan Cai, Yishen Li, Qi Meng, Zedi Liu, Xin Li, Chen Qian, Chuan Shi, Cheng Yang
🧩 TL;DR
本文提出了MASFactory,一个面向大语言模型多智能体系统的图中心编排框架,通过Vibe Graphing技术将自然语言意图编译为可编辑的工作流规范,解决了现有框架中复杂图工作流实现困难、复用性差和异构上下文集成不足的问题。
📘 Detailed Summary
Motivation: 当前基于大语言模型的多智能体系统框架在实现复杂图工作流时存在显著局限性:需要大量手动工作、提供有限的复用能力,并且难以集成异构的外部上下文源,这阻碍了多智能体系统的灵活部署和扩展。
Method: MASFactory框架采用图中心架构,核心创新是Vibe Graphing技术,这是一种人在回路的方法,能够将自然语言意图编译为可编辑的工作流规范,并进一步转化为可执行图。该框架还提供可复用组件、可插拔的上下文集成机制,以及用于拓扑预览、运行时追踪和人在回路交互的可视化工具。
Result: 在七个公开基准测试上的评估验证了MASFactory的有效性,既证明了其在代表性多智能体方法上的再现一致性,也证实了Vibe Graphing技术的有效性,展示了该框架在实际应用场景中的实用价值。
Conclusion: MASFactory通过图中心的工作流编排和人在回路的意图编译,显著降低了多智能体系统的开发门槛,提高了工作流的可复用性和上下文集成能力,为复杂多智能体系统的快速构建和部署提供了系统化解决方案。
📄 Abstract
Large language model-based (LLM-based) multi-agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration. MAS workflows can be naturally modeled as directed computation graphs, where nodes execute agents/sub-workflows and edges encode dependencies and message passing. However, implementing complex graph workflows in current frameworks still requires substantial manual effort, offers limited reuse, and makes it difficult to integrate heterogeneous external context sources. To overcome these limitations, we present MASFactory, a graph-centric framework for orchestrating LLM-based MAS. It introduces Vibe Graphing, a human-in-the-loop approach that compiles natural-language intent into an editable workflow specification and then into an executable graph. In addition, the framework provides reusable components and pluggable context integration, as well as a visualizer for topology preview, runtime tracing, and human-in-the-loop interaction. We evaluate MASFactory on seven public benchmarks, validating both reproduction consistency for representative MAS methods and the effectiveness of Vibe Graphing. Our code (https://github.com/BUPT-GAMMA/MASFactory) and video (https://youtu.be/ANynzVfY32k) are publicly available.
[110] ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning
Xingjian Tao, Yiwei Wang, Yujun Cai, Yifan Song, Jing Tang
🧩 TL;DR
本文提出了ViewFusion,一个两阶段框架,通过显式分离跨视图空间预对齐与问题回答来解决多视图空间推理的挑战。该方法在MMSI-Bench上相比Qwen3-VL-4B-Instruct提升了5.3%的准确率,在需要真正跨视图对齐的任务上表现尤为突出。
📘 Detailed Summary
Motivation: 当前视觉语言模型在多视图空间推理方面仍存在困难,即使有多个视角可用,模型往往未能充分利用跨视图关系,而是依赖单图像捷径,导致在视角变换和遮挡敏感案例上表现脆弱。现有方法缺乏对跨视图空间关系的显式建模和推理。
Method: ViewFusion采用两阶段框架,第一阶段进行深思熟虑的空间预思考,推断跨视图的视角关系和空间变换,形成超越简单重描述的中间工作空间;第二阶段基于该工作空间进行问题驱动的推理以生成最终预测。训练过程结合合成推理监督和基于GRPO的强化学习,既提高答案正确性又稳定了两阶段生成行为。
Result: 在MMSI-Bench基准测试中,ViewFusion相比Qwen3-VL-4B-Instruct提升了5.3%的准确率,在需要真正跨视图对齐的示例上取得了最大幅度的性能提升。强化学习训练有效改善了答案正确性并稳定了两阶段生成行为。
Conclusion: 该研究表明,将跨视图空间预对齐与问题回答显式分离能显著提升多视图空间推理能力,中间工作空间的构建超越了简单的重描述,为复杂空间推理任务提供了更稳健的解决方案。该方法为视觉语言模型在需要跨视图对齐的任务上提供了新的架构设计思路。
📄 Abstract
Multi-view spatial reasoning remains difficult for current vision-language models. Even when multiple viewpoints are available, models often underutilize cross-view relations and instead rely on single-image shortcuts, leading to fragile performance on viewpoint transformation and occlusion-sensitive cases. We present ViewFusion, a two-stage framework that explicitly separates cross-view spatial pre-alignment from question answering. In the first stage, the model performs deliberate spatial pre-thinking to infer viewpoint relations and spatial transformations across views, forming an intermediate workspace that goes beyond a simple re-description. In the second stage, the model conducts question-driven reasoning conditioned on this workspace to produce the final prediction. We train ViewFusion with synthetic reasoning supervision followed by reinforcement learning using GRPO, which improves answer correctness while stabilizing the intended two-stage generation behavior. On MMSI-Bench, ViewFusion improves accuracy by 5.3\% over Qwen3-VL-4B-Instruct, with the largest gains on examples that require genuine cross-view alignment.
[111] MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue
Naifan Zhang, Ruihan Sun, Jinwei Su, Hengjie Yang, Zhengyuan Pan, Zhaohan Chen, Xiaofan Zhang
🧩 TL;DR
本文提出了一种名为MAPO的无critic高效强化学习算法,通过利用法官模型的密集过程反馈和蒙特卡洛回报传播长时程效应,解决了主观多轮对话任务中缺乏可靠过程监督的挑战。
📘 Detailed Summary
Motivation: 主观多轮对话任务(如情感支持)需要适应动态用户状态并优化长时程交互质量的对话策略,但强化学习在此类设置中面临挑战,主要由于缺乏可靠的过程监督。仅基于结果的训练将多轮信用分配压缩为单一轨迹级奖励,而天真的轮级分组采样在交互环境中会产生过高的rollout成本。
Method: 本文提出了一种名为MAPO的无critic高效强化学习算法,该方法利用法官模型提供的密集过程反馈,并通过蒙特卡洛回报传播长时程效应。为稳定优化,引入了混合优势估计器,结合了轮级归一化和批次级归一化,实现了细粒度且可扩展的信用分配。
Result: 在多个主观对话基准测试(包括EMPA、EmoBench和EQ-Bench)以及从7B到32B的不同模型规模上,该方法在训练稳定性和最终性能方面均优于仅基于结果的GRPO和单级归一化基线。在EMPA上,成功率提高了多达9个百分点,对话分数相比7B基础模型提升了+43.2。尽管仅在EMPA风格环境中训练,该方法展现出良好的泛化能力,在未见过的情感智能基准测试上获得一致改进,包括在EmoBench上提升多达+4分,在EQ-Bench上提升+3.5分。
Conclusion: 研究结果表明,密集过程监督与混合级归一化相结合,能够实现主观、开放式多轮对话的有效且可扩展的强化学习。该方法不仅提高了训练稳定性,还展现出良好的跨任务泛化能力,为复杂对话系统的策略优化提供了新的技术路径。
📄 Abstract
Subjective multi-turn dialogue tasks, such as emotional support, require conversational policies that adapt to evolving user states and optimize long-horizon interaction quality. However, reinforcement learning (RL) for such settings remains challenging due to the absence of reliable process supervision. Outcome-only training collapses credit assignment across turns into a single trajectory-level reward, while naïve turn-level group sampling incurs prohibitive rollout costs in interactive environments. We propose a critic-free and efficient RL algorithm named MAPO that leverages dense process feedback from a judge model and propagates long-horizon effects through Monte Carlo returns. To stabilize optimization, we introduce a mixed advantage estimator that combines turn-level normalization with batch-level normalization, enabling fine-grained yet scalable credit assignment. Across multiple subjective dialogue benchmarks, including EMPA, EmoBench, and EQ-Bench, and model scales ranging from 7B to 32B, our method consistently improves both training stability and final performance over outcome-only GRPO and single-level normalization baselines. On EMPA, we improve rates by up to 9 points and increase dialogue scores by as much as +43.2 over the 7B base model. Despite training only on EMPA-style environments, our approach generalizes well, yielding consistent improvements on unseen emotional-intelligence benchmarks, including up to +4 points on EmoBench and +3.5 on EQ-Bench. Together, these results demonstrate that dense process supervision combined with mixed-level normalization enables effective and scalable RL for subjective, open-ended multi-turn dialogue.
[112] Wisdom of the AI Crowd (AI-CROWD) for Ground Truth Approximation in Content Analysis: A Research Protocol & Validation Using Eleven Large Language Models
Luis de-Marcos, Manuel Goyanes, Adrián Domínguez-Díaz
🧩 TL;DR
本文提出了AI-CROWD协议,通过集成多个大语言模型的输出来近似大规模内容分析所需的地面真实标签,从而克服人工标注在时间、成本和一致性方面的限制。
📘 Detailed Summary
Motivation: 大规模内容分析面临的主要挑战是缺乏可观测的地面真实或黄金标准标签,因为通过广泛人工编码为海量数据集创建此类基准在实践中不可行,存在时间成本高、一致性难以保证等问题。
Method: AI-CROWD协议通过集成多个大语言模型的输出来近似地面真实标签,采用多数投票机制聚合模型输出,并通过诊断指标分析模型间的一致性和分歧模式,从而识别高置信度分类并标记潜在歧义或模型特定偏差。
Result: 该方法能够生成基于共识的近似地面真实标签,通过收敛和发散推理识别高置信度分类,同时有效标记数据中的模糊区域和模型特定偏差,为大规模内容分析提供了可行的替代方案。
Conclusion: AI-CROWD协议为缺乏地面真实标签的大规模内容分析提供了实用解决方案,通过模型集成和共识机制平衡了标注效率和可靠性,该方法在自然语言处理和数据标注领域具有重要应用价值,特别是在需要处理海量文本数据的场景中。
📄 Abstract
Large-scale content analysis is increasingly limited by the absence of observable ground truth or gold-standard labels, as creating such benchmarks through extensive human coding becomes impractical for massive datasets due to high time, cost, and consistency challenges. To overcome this barrier, we introduce the AI-CROWD protocol, which approximates ground truth by leveraging the collective outputs of an ensemble of large language models (LLMs). Rather than asserting that the resulting labels are true ground truth, the protocol generates a consensus-based approximation derived from convergent and divergent inferences across multiple models. By aggregating outputs via majority voting and interrogating agreement/disagreement patterns with diagnostic metrics, AI-CROWD identifies high-confidence classifications while flagging potential ambiguity or model-specific biases.
[113] LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation
Koki Itai, Shunichi Hasegawa, Yuta Yamamoto, Gouki Minegishi, Masaki Otsuki
🧩 TL;DR
本文提出了LIT-RAGBench基准测试,用于全面评估RAG系统中生成器的多维度能力,包括逻辑推理、信息整合、表格理解、多步推理和缺失证据处理,填补了现有评估框架的局限性。
📘 Detailed Summary
Motivation: 现有检索增强生成(RAG)基准测试对生成器能力的覆盖有限,无法在统一条件下同时评估多个关键能力,包括长上下文证据整合、多步推理、表格解释以及证据缺失时的弃权判断,这限制了模型在实际部署中的有效评估和选择。
Method: 研究团队构建了LIT-RAGBench基准,定义了五个评估类别:信息整合、推理、逻辑、表格理解和弃权判断,每个类别进一步细分为实际评估维度。基准采用虚构实体和场景构建,确保答案基于外部文档,包含114个人工构建的日语问题及其经过人工校对的英文翻译版本,并使用LLM-as-a-Judge方法进行评分。
Result: 实验结果显示,在API模型和开源模型测试中,没有任何模型在整体准确率上超过90%。基准能够系统性地测量模型在各个类别中的优势和弱点,为模型选择提供了详细的性能指标,揭示了当前RAG生成器在不同能力维度上的表现差异。
Conclusion: LIT-RAGBench填补了RAG生成器评估的重要空白,为实际RAG部署中的模型选择提供了有价值的度量标准,并有助于构建专门优化的RAG模型。该基准的发布包括数据集和评估代码,将促进RAG系统的进一步研究和改进。
📄 Abstract
Retrieval-Augmented Generation (RAG) is a framework in which a Generator, such as a Large Language Model (LLM), produces answers by retrieving documents from an external collection using a Retriever. In practice, Generators must integrate evidence from long contexts, perform multi-step reasoning, interpret tables, and abstain when evidence is missing. However, existing benchmarks for Generators provide limited coverage, with none enabling simultaneous evaluation of multiple capabilities under unified conditions. To bridge the gap between existing evaluations and practical use, we introduce LIT-RAGBench (the Logic, Integration, Table, Reasoning, and Abstention RAG Generator Benchmark), which defines five categories: Integration, Reasoning, Logic, Table, and Abstention, each further divided into practical evaluation aspects. LIT-RAGBench systematically covers patterns combining multiple aspects across categories. By using fictional entities and scenarios, LIT-RAGBench evaluates answers grounded in the provided external documents. The dataset consists of 114 human-constructed Japanese questions and an English version generated by machine translation with human curation. We use LLM-as-a-Judge for scoring and report category-wise and overall accuracy. Across API-based and open-weight models, no model exceeds 90% overall accuracy. By making strengths and weaknesses measurable within each category, LIT-RAGBench serves as a valuable metric for model selection in practical RAG deployments and for building RAG-specialized models. We release LIT-RAGBench, including the dataset and evaluation code, at https://github.com/Koki-Itai/LIT-RAGBench.
[114] FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling
Qihang Fan, Huaibo Huang, Zhiying Wu, Juqiu Wang, Bingning Wang, Ran He
🧩 TL;DR
本文提出FlashPrefill框架,通过即时模式发现和动态阈值机制实现超快速预填充,解决了长上下文建模中注意力二次复杂度的关键瓶颈问题,在256K序列上实现了27.78倍加速。
📘 Detailed Summary
Motivation: 长上下文建模是大型语言模型的关键能力,但注意力机制的二次复杂度在计算密集的预填充阶段成为关键瓶颈。现有稀疏注意力方法通常面临显著的搜索延迟或稀疏度不足的问题,需要更高效的解决方案。
Method: FlashPrefill框架采用快速块搜索技术同时定位动态垂直、斜线和块稀疏注意力模式,并引入动态阈值机制,避免了排序或累积注意力分数的过高开销,有效消除长尾分布以增强稀疏性。
Result: 在广泛评估中,FlashPrefill在256K序列上实现了前所未有的27.78倍加速。值得注意的是,与现有方法在短上下文上效率下降不同,FlashPrefill在4K上下文长度下仍保持1.71倍加速,展示了其跨序列规模的鲁棒性和实用性。
Conclusion: 该研究展示了通过即时模式发现和动态阈值机制可以显著提升长上下文建模效率,为大规模语言模型的高效预填充提供了实用解决方案,同时在不同序列长度下保持一致的性能优势。
📄 Abstract
Long-context modeling is a pivotal capability for Large Language Models, yet the quadratic complexity of attention remains a critical bottleneck, particularly during the compute-intensive prefilling phase. While various sparse attention mechanisms have been explored, they typically suffer from either significant search latency or insufficient sparsity. In this paper, we propose FlashPrefill, a framework enabling ultra-fast prefilling via instantaneous pattern discovery and thresholding. FlashPrefill leverages a fast block-searching technique to simultaneously locate dynamic vertical, slash, and block-sparse attention patterns. Crucially, it introduces a dynamic thresholding mechanism that bypasses the prohibitive overhead of sorting or accumulating attention scores while effectively eliminating the long-tail distribution to enhance sparsity. Extensive evaluations demonstrate that FlashPrefill achieves a substantial leap in efficiency, delivering an unprecedented 27.78x speedup on 256K sequences. Notably, unlike existing methods that incur efficiency degradation on shorter contexts, FlashPrefill maintains a 1.71x speedup even at a 4K context length, demonstrating its robustness and practical utility across varying sequence scales.
[115] SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models
Yunlong Chu, Minglai Shao, Yuhang Liu, Bing Hao, Yumeng Lin, Jialu Wang, Ruijie Wang
🧩 TL;DR
本文提出SPOT框架,通过将显式思维链压缩为紧凑的潜在暂停标记来优化大语言模型的推理效率,同时保持语义对齐和可解释性。该方法在减少生成标记37.5%的同时平均提升推理准确率2.3个百分点。
📘 Detailed Summary
Motivation: 显式思维链方法虽然能提升大语言模型的推理性能,但通常因冗长的标记级追踪而产生高推理成本。现有方法主要通过简洁提示或步骤剪枝来减少开销,但大多仅截断模型输出而非内化模型思考过程。潜在推理方法虽提供替代方案,但面临两个关键挑战:现有方法依赖刚性的点对点对齐,强制潜在标记近似推理步骤的最终表示,难以捕捉整个推理段的密集、可变长度语义;同时缺乏可解释性,潜在状态通常通过无约束优化或嵌入混合产生,难以在预训练语言头下解码或审计。
Method: 本文提出SPOT框架,将显式思维链压缩为紧凑的潜在暂停标记,无需强制固定响应模板。核心是跨度级语义对齐,采用Sinkhorn最优传输目标,将每个暂停标记软匹配到整个推理段的语义,克服步骤末端对齐的刚性。为提高可解释性,SPOT引入冻结头解码约束,保持潜在状态在冻结预训练LM头下可直接解码为标记分布,实现对潜在思考的可读关键词解释。
Result: 在推理基准测试中,SPOT平均提升准确率2.3个百分点,同时减少生成标记37.5%。该方法不仅提高了推理效率,还提供了潜在推理过程的忠实语义解释,验证了框架在保持可解释性的同时有效压缩推理过程的能力。
Conclusion: SPOT框架通过跨度级语义对齐和冻结头解码约束,成功解决了潜在推理中的语义对齐刚性和可解释性不足问题。该方法表明,通过软匹配整个推理段语义并保持潜在状态可解码,可以在大幅减少推理开销的同时提升性能并保持可解释性,为高效可解释的推理系统设计提供了新方向。
📄 Abstract
Explicit Chain-of-Thought improves the reasoning performance of large language models but often incurs high inference cost due to verbose token-level traces. While recent approaches reduce this overhead via concise prompting or step pruning, they largely truncate what the model says rather than internalize what the model thinks. Latent reasoning offers a promising alternative by performing computation in the hidden space, yet prior methods face two critical challenges. Many existing approaches rely on rigid point-to-point alignment, forcing a latent token to approximate the final representation of a reasoning step, which can be insufficient to capture the dense, variable-length semantics of an entire reasoning segment. Furthermore, these methods often suffer from a lack of interpretability: latent states are commonly produced by unconstrained optimization or embedding mixing, yielding vectors that are difficult to decode or audit under the pretrained language head. We propose SPOT, a flexible framework that compresses explicit CoT into compact latent pause tokens without enforcing a fixed response template. At the core of SPOT is Span-level Semantic Alignment, a Sinkhorn optimal-transport objective that softly matches each pause token to the semantics of an entire reasoning segment, overcoming the rigidity of step-end alignment. To further improve interpretability, SPOT introduces a Frozen-Head Decoding Constraint that keeps latent states directly decodable as token distributions under the frozen pretrained LM head, enabling readable keyword interpretations of latent thoughts. Experiments on reasoning benchmarks demonstrate that SPOT improves accuracy by 2.3 points on average while reducing generated tokens by 37.5% and provides faithful semantic interpretations of the latent reasoning process.
[116] Mind the Gap: Pitfalls of LLM Alignment with Asian Public Opinion
Hari Shankar, Vedanta S P, Sriharini Margapuri, Debjani Mazumder, Ponnurangam Kumaraguru, Abhijnan Chakraborty
🧩 TL;DR
本研究对GPT-4o-Mini、Gemini-2.5-Flash、Llama 3.2、Mistral和Gemma 3等当代大语言模型进行了全面的多语言文化对齐审计,发现这些模型在宗教敏感领域与少数群体观点存在系统性错位,并揭示了轻量级干预措施的局限性。
📘 Detailed Summary
Motivation: 大语言模型越来越多地部署在多语言、多文化环境中,但其主要依赖英语中心训练数据的特性可能导致与不同社会的多元文化价值观错位,特别是在宗教等敏感领域,需要系统评估模型的文化对齐程度。
Method: 研究采用多语言审计框架,对多个主流LLM进行综合分析,通过log-probs/logits分析模型的内部表征,将模型观点分布与真实公众态度进行对比,并评估人口统计提示和母语提示等轻量级干预措施的效果,同时在下游偏见基准测试中进行验证。
Result: 研究发现主流模型在广泛社会议题上能与公众意见基本对齐,但在宗教观点上持续失败,特别是对少数群体的代表性不足且经常放大负面刻板印象;轻量级干预措施仅能部分缓解但无法消除文化差距,下游偏见基准测试也揭示了持续的危害和代表性不足问题。
Conclusion: 研究结果强调了系统性、区域基础审计的紧迫性,以确保大语言模型的全球公平部署;当前模型在文化敏感领域存在显著对齐缺陷,需要更深入的文化嵌入和评估机制来应对多文化环境中的部署挑战。
📄 Abstract
Large Language Models (LLMs) are increasingly being deployed in multilingual, multicultural settings, yet their reliance on predominantly English-centric training data risks misalignment with the diverse cultural values of different societies. In this paper, we present a comprehensive, multilingual audit of the cultural alignment of contemporary LLMs including GPT-4o-Mini, Gemini-2.5-Flash, Llama 3.2, Mistral and Gemma 3 across India, East Asia and Southeast Asia. Our study specifically focuses on the sensitive domain of religion as the prism for broader alignment. To facilitate this, we conduct a multi-faceted analysis of every LLM's internal representations, using log-probs/logits, to compare the model's opinion distributions against ground-truth public attitudes. We find that while the popular models generally align with public opinion on broad social issues, they consistently fail to accurately represent religious viewpoints, especially those of minority groups, often amplifying negative stereotypes. Lightweight interventions, such as demographic priming and native language prompting, partially mitigate but do not eliminate these cultural gaps. We further show that downstream evaluations on bias benchmarks (such as CrowS-Pairs, IndiBias, ThaiCLI, KoBBQ) reveal persistent harms and under-representation in sensitive contexts. Our findings underscore the urgent need for systematic, regionally grounded audits to ensure equitable global deployment of LLMs.
[117] From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring
Minh Hoang Nguyen, Vu Hoang Pham, Xuan Thanh Huynh, Phuc Hong Mai, Vinh The Nguyen, Quang Nhut Huynh, Huy Tien Nguyen, Tung Le
🧩 TL;DR
该研究对大型语言模型在英语作为第二语言写作自动评分中的主要范式进行了首次统一实证比较,揭示了不同方法在准确性、成本和鲁棒性之间的权衡关系,最佳配置达到了93%的F1分数。
📘 Detailed Summary
Motivation: 现有研究通常孤立地考察大型语言模型在自动作文评分中的各种技术,缺乏对它们在英语作为第二语言写作场景下相对优缺点的系统理解,这限制了我们对不同方法在实际应用中的选择依据。
Method: 研究在统一的雅思写作任务2基准上评估了四种主要方法:基于编码器的分类微调、零样本和少样本提示、指令微调与检索增强生成,以及监督微调结合直接偏好优化和检索增强生成,形成了全面的比较框架。
Result: 实验结果显示不同方法在准确性、成本和鲁棒性之间存在明确的权衡关系,其中整合k-SFT和RAG的最佳配置取得了最强的整体性能,F1分数达到93%,为实际应用提供了明确的性能参考。
Conclusion: 该研究提供了首个针对英语作为第二语言写作的现代LLM自动评分策略的统一实证比较,揭示了不同范式的性能权衡,为教育技术中的自动评分系统设计和选择提供了实证依据,并展示了在写作任务自动评分中的实际应用潜力。
📄 Abstract
Large language models (LLMs) have recently reshaped Automated Essay Scoring (AES), yet prior studies typically examine individual techniques in isolation, limiting understanding of their relative merits for English as a Second Language (L2) writing. To bridge this gap, we presents a comprehensive comparison of major LLM-based AES paradigms on IELTS Writing Task~2. On this unified benchmark, we evaluate four approaches: (i) encoder-based classification fine-tuning, (ii) zero- and few-shot prompting, (iii) instruction tuning and Retrieval-Augmented Generation (RAG), and (iv) Supervised Fine-Tuning combined with Direct Preference Optimization (DPO) and RAG. Our results reveal clear accuracy-cost-robustness trade-offs across methods, the best configuration, integrating k-SFT and RAG, achieves the strongest overall results with F1-Score 93%. This study offers the first unified empirical comparison of modern LLM-based AES strategies for English L2, promising potential in auto-grading writing tasks. Code is public at https://github.com/MinhNguyenDS/LLM_AES-EnL2
[118] PONTE: Personalized Orchestration for Natural Language Trustworthy Explanations
Vittoria Vineis, Matteo Silvestri, Lorenzo Antonelli, Filippo Betello, Gabriele Tolomei
🧩 TL;DR
本文提出了PONTE框架,一种用于个性化可信解释生成的人机协同系统,通过闭环验证和自适应机制解决现有可解释人工智能方法中忽视用户差异以及大语言模型生成解释时存在的忠实性和幻觉问题。
📘 Detailed Summary
Motivation: 当前可解释人工智能方法大多采用一刀切的范式,忽视了用户在专业知识、目标和认知需求方面的差异,而大语言模型虽然能将技术解释转化为自然语言,但引入了忠实性和幻觉等挑战,因此需要开发能够适应不同用户需求且可靠的可解释人工智能叙事框架。
Method: PONTE框架将个性化建模为闭环验证和适应过程而非提示工程,包含三个核心组件:低维偏好模型捕捉风格要求、基于结构化可解释人工智能工件的偏好条件生成器、以及强制执行数值忠实性、信息完整性和风格对齐的验证模块,后者可选地支持检索增强论证,用户反馈迭代更新偏好状态实现快速个性化。
Result: 在医疗和金融领域的自动和人工评估表明,验证-精炼循环相比无验证生成显著提高了完整性和风格对齐,人类研究进一步证实了预期偏好向量与感知风格之间的强一致性、对生成随机性的鲁棒性以及持续积极的质量评估。
Conclusion: PONTE框架通过将个性化重新定义为闭环验证过程而非静态提示工程,为可解释人工智能提供了可靠且适应性强的方法,其验证机制有效解决了大语言模型在解释生成中的忠实性问题,为人机协同的可解释人工智能系统设计提供了新范式。
📄 Abstract
Explainable Artificial Intelligence (XAI) seeks to enhance the transparency and accountability of machine learning systems, yet most methods follow a one-size-fits-all paradigm that neglects user differences in expertise, goals, and cognitive needs. Although Large Language Models can translate technical explanations into natural language, they introduce challenges related to faithfulness and hallucinations. To address these challenges, we present PONTE (Personalized Orchestration for Natural language Trustworthy Explanations), a human-in-the-loop framework for adaptive and reliable XAI narratives. PONTE models personalization as a closed-loop validation and adaptation process rather than prompt engineering. It combines: (i) a low-dimensional preference model capturing stylistic requirements; (ii) a preference-conditioned generator grounded in structured XAI artifacts; and (iii) verification modules enforcing numerical faithfulness, informational completeness, and stylistic alignment, optionally supported by retrieval-grounded argumentation. User feedback iteratively updates the preference state, enabling quick personalization. Automatic and human evaluations across healthcare and finance domains show that the verification-refinement loop substantially improves completeness and stylistic alignment over validation-free generation. Human studies further confirm strong agreement between intended preference vectors and perceived style, robustness to generation stochasticity, and consistently positive quality assessments.
[119] Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing
Anmol Gulati, Sahil Sen, Waqar Sarguroh, Kevin Paul
🧩 TL;DR
本文提出了Beyond Rows to Reasoning (BRTR),这是一个用于电子表格理解的多模态代理框架,通过迭代工具调用循环取代单次检索,在三个前沿基准测试中实现了最先进的性能。
📘 Detailed Summary
Motivation: 当前最先进的多模态检索增强生成方法在分析包含数百万单元格、跨工作表依赖和嵌入式可视化元素的企业电子表格工作簿时存在三个关键问题:单次检索排除关键上下文、数据压缩导致分辨率损失、以及朴素全上下文注入超出LLM上下文窗口限制,这些限制阻碍了对复杂企业工作簿的可靠多步推理。
Method: BRTR框架采用多模态代理架构,用迭代工具调用循环取代传统的单次检索机制,支持从复杂分析到结构化编辑的端到端Excel工作流程。该框架包含规划器、检索器和迭代推理组件,并评估了五种多模态嵌入模型和九种大型语言模型,以优化混合表格和视觉数据的处理性能。
Result: BRTR在三个前沿电子表格理解基准测试中实现了最先进的性能:在FRTR-Bench上超过先前方法25个百分点,在SpreadsheetLLM上超过7个百分点,在FINCH上超过32个百分点。消融实验证实规划器、检索器和迭代推理各自贡献显著,成本分析显示GPT-5.2实现了最佳效率-准确性权衡,而NVIDIA NeMo Retriever 1B在混合表格和视觉数据检索中表现最佳。
Conclusion: BRTR框架通过迭代工具调用机制有效解决了企业电子表格分析中的上下文丢失和推理限制问题,为复杂工作簿的多步推理提供了可靠解决方案。该研究强调了多模态嵌入模型选择对混合数据检索的重要性,并通过显式工具调用痕迹确保了完全可审计性,为企业级电子表格自动化分析开辟了新方向。
📄 Abstract
Recent advances in multimodal Retrieval-Augmented Generation (RAG) enable Large Language Models (LLMs) to analyze enterprise spreadsheet workbooks containing millions of cells, cross-sheet dependencies, and embedded visual artifacts. However, state-of-the-art approaches exclude critical context through single-pass retrieval, lose data resolution through compression, and exceed LLM context windows through naive full-context injection, preventing reliable multi-step reasoning over complex enterprise workbooks. We introduce Beyond Rows to Reasoning (BRTR), a multimodal agentic framework for spreadsheet understanding that replaces single-pass retrieval with an iterative tool-calling loop, supporting end-to-end Excel workflows from complex analysis to structured editing. Supported by over 200 hours of expert human evaluation, BRTR achieves state-of-the-art performance across three frontier spreadsheet understanding benchmarks, surpassing prior methods by 25 percentage points on FRTR-Bench, 7 points on SpreadsheetLLM, and 32 points on FINCH. We evaluate five multimodal embedding models, identifying NVIDIA NeMo Retriever 1B as the top performer for mixed tabular and visual data, and vary nine LLMs. Ablation experiments confirm that the planner, retrieval, and iterative reasoning each contribute substantially, and cost analysis shows GPT-5.2 achieves the best efficiency-accuracy trade-off. Throughout all evaluations, BRTR maintains full auditability through explicit tool-call traces.
cs.AI [Back]
[120] RoboLayout: Differentiable 3D Scene Generation for Embodied Agents
Ali Shamsaddinlou
🧩 TL;DR
本文提出了RoboLayout,一种扩展LayoutVLM的框架,通过集成显式可达性约束和局部细化阶段,生成适用于具身智能体交互的室内场景布局,同时保持语义对齐和物理合理性。
📘 Detailed Summary
Motivation: 当前视觉语言模型在空间推理和3D场景布局生成方面取得进展,但生成的布局往往缺乏对具身智能体交互可行性的考虑,特别是在物理约束的室内环境中,难以确保布局既语义连贯又可供智能体实际导航和操作。
Method: RoboLayout扩展了LayoutVLM框架,集成了显式可达性约束到可微分布局优化过程中,支持多种具身智能体抽象(包括服务机器人、仓储机器人、不同年龄段人类或动物),并提出了局部细化阶段,选择性重新优化问题对象放置而不增加全局优化迭代次数。
Result: 实验结果表明,RoboLayout在保持LayoutVVM强语义对齐和物理合理性的同时,显著提升了面向具身智能体的室内场景生成适用性,在不同场景配置下均表现出良好的性能,且局部细化阶段有效提高了收敛效率。
Conclusion: 该研究展示了将具身智能体约束集成到场景布局生成中的重要性,RoboLayout框架为面向特定智能体能力的定制化环境设计提供了可行方案,推动了从语义合理到交互可行的场景生成范式转变,具有实际部署价值。
📄 Abstract
Recent advances in vision language models (VLMs) have shown strong potential for spatial reasoning and 3D scene layout generation from open-ended language instructions. However, generating layouts that are not only semantically coherent but also feasible for interaction by embodied agents remains challenging, particularly in physically constrained indoor environments. In this paper, RoboLayout is introduced as an extension of LayoutVLM that augments the original framework with agent-aware reasoning and improved optimization stability. RoboLayout integrates explicit reachability constraints into a differentiable layout optimization process, enabling the generation of layouts that are navigable and actionable by embodied agents. Importantly, the agent abstraction is not limited to a specific robot platform and can represent diverse entities with distinct physical capabilities, such as service robots, warehouse robots, humans of different age groups, or animals, allowing environment design to be tailored to the intended agent. In addition, a local refinement stage is proposed that selectively reoptimizes problematic object placements while keeping the remainder of the scene fixed, improving convergence efficiency without increasing global optimization iterations. Overall, RoboLayout preserves the strong semantic alignment and physical plausibility of LayoutVLM while enhancing applicability to agent-centric indoor scene generation, as demonstrated by experimental results across diverse scene configurations.
[121] Real-Time AI Service Economy: A Framework for Agentic Computing Across the Continuum
Lauri Lovén, Alaa Saleh, Reza Farahani, Ilir Murturi, Miguel Bordallo López, Praveen Kumar Donta, Schahram Dustdar
🧩 TL;DR
本文揭示了服务依赖图拓扑结构对去中心化价格资源分配可扩展性的决定性影响,并提出混合管理架构以封装复杂子图为结构良好的资源切片,从而在保持去中心化优势的同时实现价格稳定和高效分配。
📘 Detailed Summary
Motivation: 实时AI服务在设备-边缘-云连续体上运行时,自主AI代理生成延迟敏感工作负载、编排多阶段处理管道,并在策略和治理约束下竞争共享资源。研究旨在解决复杂服务依赖图导致去中心化价格资源分配不稳定、难以扩展的管理难题,探索拓扑结构对市场机制有效性的影响。
Method: 研究将服务依赖建模为DAG图,其中节点表示计算阶段,边编码执行顺序。提出混合管理架构,通过跨域集成器将复杂子图封装为资源切片,对外呈现结构良好的接口。采用机制设计方法,假设拟线性效用和离散切片项,确保代理在决策周期内无虚报估值动机。
Result: 实验包含六个场景共1620次运行,结果表明:依赖图拓扑是价格稳定性和可扩展性的首要决定因素;混合架构将价格波动降低70-75%且不影响吞吐量;治理约束产生可量化的效率-合规权衡;在诚实出价下,去中心化市场匹配集中化价值最优基准。分层结构(树或串并联)可实现价格稳定收敛和高效计算最优分配。
Conclusion: 服务依赖图拓扑结构是去中心化资源分配能否可靠扩展的关键因素,分层结构支持稳定价格均衡和高效优化,而复杂交叉依赖导致价格振荡和分配质量下降。混合架构通过封装复杂子图为结构良好切片,有效桥接这一差距,使去中心化协调能够复制集中化分配质量,为跨连续体AI服务管理提供可扩展解决方案。
📄 Abstract
Real-time AI services increasingly operate across the device-edge-cloud continuum, where autonomous AI agents generate latency-sensitive workloads, orchestrate multi-stage processing pipelines, and compete for shared resources under policy and governance constraints. This article shows that the structure of service-dependency graphs, modelled as DAGs whose nodes represent compute stages and whose edges encode execution ordering, is a primary determinant of whether decentralised, price-based resource allocation can work reliably at scale. When dependency graphs are hierarchical (tree or series-parallel), prices converge to stable equilibria, optimal allocations can be computed efficiently, and under appropriate mechanism design (with quasilinear utilities and discrete slice items), agents have no incentive to misreport their valuations within each decision epoch. When dependencies are more complex, with cross-cutting ties between pipeline stages, prices oscillate, allocation quality degrades, and the system becomes difficult to manage. To bridge this gap, we propose a hybrid management architecture in which cross-domain integrators encapsulate complex sub-graphs into resource slices that present a simpler, well-structured interface to the rest of the market. A systematic ablation study across six experiments (1,620 runs, 10 seeds each) confirms that (i) dependency-graph topology is a first-order determinant of price stability and scalability,(ii) the hybrid architecture reduces price volatility by up to 70-75% without sacrificing throughput, (iii) governance constraints create quantifiable efficiency-compliance trade-offs that depend jointly on topology and load, and (iv) under truthful bidding the decentralised market matches a centralised value-optimal baseline, confirming that decentralised coordination can replicate centralised allocation quality.
[122] Reasoning Models Struggle to Control their Chains of Thought
Chen Yueh-Han, Robert McCarthy, Bruce W. Lee, He He, Ian Kivlichan, Bowen Baker, Micah Carroll, Tomek Korbak
🧩 TL;DR
本研究提出了CoT-Control评估套件,用于量化推理模型控制其思维链内容的能力,发现当前模型在思维链控制方面表现显著弱于最终输出控制,为思维链监控的可靠性提供了实证支持。
📘 Detailed Summary
Motivation: 思维链监控是检测现代推理模型不当行为和理解其动机的重要工具,但如果模型能够控制其在思维链中表达的内容,可能会破坏思维链的可监控性。本研究旨在测量这种不良能力——思维链可控性,以评估思维链监控的潜在失效风险。
Method: 研究引入了CoT-Control评估套件,包含要求模型在解决问题时遵循特定思维链指令的任务,例如在回答遗传学问题时避免使用"染色体"一词。该方法通过设计各种约束条件来测试模型控制思维链内容的能力,并考察不同因素对可控性的影响。
Result: 实验结果表明,推理模型的思维链可控性显著低于输出可控性,例如Claude Sonnet 4.5的思维链可控率仅为2.7%,而最终输出可控率可达61.9%。可控性随模型规模增大而提高,但随强化学习训练、推理计算量增加和问题难度提升而降低。即使在提供激励而非直接请求的情况下,模型也难以有效控制思维链内容。
Conclusion: 研究结果谨慎乐观地表明,当前思维链可控性不太可能成为思维链监控的失效模式,但低可控性的机制尚未完全理解。鉴于其对维护思维链可监控性的重要性,建议前沿实验室在未来模型中持续跟踪思维链可控性指标,以确保监控系统的长期可靠性。
📄 Abstract
Chain-of-thought (CoT) monitoring is a promising tool for detecting misbehaviors and understanding the motivations of modern reasoning models. However, if models can control what they verbalize in their CoT, it could undermine CoT monitorability. To measure this undesirable capability -- CoT controllability -- we introduce the CoT-Control evaluation suite, which includes tasks that require models to solve problems while adhering to CoT instructions, e.g., reasoning about a genetics question without using the word 'chromosome'. We show that reasoning models possess significantly lower CoT controllability than output controllability; for instance, Claude Sonnet 4.5 can control its CoT only 2.7% of the time but 61.9% when controlling its final output. We also find that CoT controllability is higher for larger models and decreases with more RL training, test-time compute, and increased problem difficulty. CoT controllability failures extend even to situations in which models are given incentives (as opposed to direct requests) to evade CoT monitors, although models exhibit slightly higher controllability when they are told they are being monitored. Similarly, eliciting controllability by adversarially optimizing prompts does not meaningfully increase controllability. Our results leave us cautiously optimistic that CoT controllability is currently unlikely to be a failure mode of CoT monitorability. However, the mechanism behind low controllability is not well understood. Given its importance for maintaining CoT monitorability, we recommend that frontier labs track CoT controllability in future models.
[123] Evolving Medical Imaging Agents via Experience-driven Self-skill Discovery
Lin Fan, Pengyu Dai, Zhipeng Deng, Haolin Wang, Xun Gong, Yefeng Zheng, Yafei Ou
🧩 TL;DR
本文提出MACRO,一种自演进、经验增强的医疗代理系统,通过自主发现和合成可复用的复合工具,实现从静态工具组合到经验驱动工具发现的转变,显著提升多步骤医疗图像解释的准确性和跨领域泛化能力。
📘 Detailed Summary
Motivation: 临床图像解释本质上是多步骤和工具中心的,但现有基于LLM的医疗代理系统将工具集和调用策略视为静态部署,这种设计在现实世界的领域转移、跨任务和不断演化的诊断需求下显得脆弱,预定义的工具链经常退化并需要昂贵的人工重新设计。
Method: MACRO从已验证的执行轨迹中自主识别反复有效的多步骤工具序列,将其合成为可复用的复合工具,并注册为新的高级原语以持续扩展其行为库;系统采用轻量级图像特征记忆将工具选择锚定在视觉-临床上下文中,并通过GRPO式训练循环强化对发现复合工具的可靠调用,实现闭环自我改进。
Result: 在多样化医疗影像数据集和任务上的广泛实验表明,自主复合工具发现相比强基线方法和近期最先进的代理方法,在多步骤编排准确性和跨领域泛化方面取得一致改进,有效弥合了脆弱静态工具使用与自适应、上下文感知临床AI辅助之间的差距。
Conclusion: 该研究展示了从静态工具组合向经验驱动工具发现的范式转变价值,通过自主合成和注册复合工具实现持续行为扩展,为构建适应性强、能够应对现实世界临床领域转移和演化需求的医疗AI代理系统提供了可行路径。
📄 Abstract
Clinical image interpretation is inherently multi-step and tool-centric: clinicians iteratively combine visual evidence with patient context, quantify findings, and refine their decisions through a sequence of specialized procedures. While LLM-based agents promise to orchestrate such heterogeneous medical tools, existing systems treat tool sets and invocation strategies as static after deployment. This design is brittle under real-world domain shifts, across tasks, and evolving diagnostic requirements, where predefined tool chains frequently degrade and demand costly manual re-design. We propose MACRO, a self-evolving, experience-augmented medical agent that shifts from static tool composition to experience-driven tool discovery. From verified execution trajectories, the agent autonomously identifies recurring effective multi-step tool sequences, synthesizes them into reusable composite tools, and registers these as new high-level primitives that continuously expand its behavioral repertoire. A lightweight image-feature memory grounds tool selection in a visual-clinical context, while a GRPO-like training loop reinforces reliable invocation of discovered composites, enabling closed-loop self-improvement with minimal supervision. Extensive experiments across diverse medical imaging datasets and tasks demonstrate that autonomous composite tool discovery consistently improves multi-step orchestration accuracy and cross-domain generalization over strong baselines and recent state-of-the-art agentic methods, bridging the gap between brittle static tool use and adaptive, context-aware clinical AI assistance. Code will be available upon acceptance.
[124] The World Won't Stay Still: Programmable Evolution for Agent Benchmarks
Guangrui Li, Yaochen Xie, Yi Liu, Ziwei Dong, Xingyuan Pan, Tianqi Zheng, Jason Choi, Michael J. Morais, Binit Jha, Shaunak Mishra, Bingrou Zhou, Chen Luo, Monica Xiao Cheng, Dawn Song
🧩 TL;DR
本文提出了ProEvolve,一个基于图的框架,用于以可编程方式演化LLM智能体环境,以更好地评估智能体对现实世界动态变化的适应性。该框架通过图变换实现环境能力的添加、删除和修改,并自动生成演化环境和任务沙箱。
📘 Detailed Summary
Motivation: 现有基准测试大多假设静态环境,具有固定的模式和工具集,忽略了现实世界的演化性质以及智能体对环境变化的鲁棒性。本文旨在解决如何以可扩展和可控的方式演化智能体环境,从而更好地评估智能体对现实世界动态的适应能力。
Method: 本文提出了ProEvolve,一个基于图的框架,使环境演化可编程化。其核心是一个类型化关系图,提供环境数据、工具和模式的统一显式表示。在此形式化下,添加、删除或修改能力被表达为图变换,这些变换在工具、模式和数据访问之间一致地传播更新。基于此,ProEvolve能够将演化动态编程为图变换以自动生成环境,并通过子图采样和编程实例化任务沙箱。
Result: 通过将单个环境演化为200个环境和3,000个任务沙箱,并相应地对代表性智能体进行基准测试,验证了ProEvolve的有效性。该框架展示了在可控条件下生成多样化演化环境的能力,为评估智能体适应性提供了可扩展的测试平台。
Conclusion: ProEvolve为评估LLM智能体对动态环境的适应性提供了系统化框架,填补了现有静态基准测试的空白。该研究强调了环境演化在智能体评估中的重要性,并为未来开发更具鲁棒性的智能体系统提供了方法论基础。基于图的表示和变换方法为环境演化提供了可扩展且可控的编程范式。
📄 Abstract
LLM-powered agents fulfill user requests by interacting with environments, querying data, and invoking tools in a multi-turn process. Yet, most existing benchmarks assume static environments with fixed schemas and toolsets, neglecting the evolutionary nature of the real world and agents' robustness to environmental changes. In this paper, we study a crucial problem: how to evolve the agent environment in a scalable and controllable way, thereby better evaluating agents' adaptability to real-world dynamics. We propose ProEvolve, a graph-based framework that makes environment evolution programmable. At its core, a typed relational graph provides a unified, explicit representation of the environment: data, tools, and schema. Under this formalism, adding, removing, or modifying capabilities are expressed as graph transformations that coherently propagate updates across tools, schemas, and data access. Building on this, ProEvolve can (1) program the evolutionary dynamics as graph transformations to generate environments automatically, and (2) instantiate task sandboxes via subgraph sampling and programming. We validate ProEvolve by evolving a single environment into 200 environments and 3,000 task sandboxes, and benchmark representative agents accordingly.
[125] An Interactive Multi-Agent System for Evaluation of New Product Concepts
Bin Xuan, Ruo Ai, Hakyeon Lee
🧩 TL;DR
本研究提出了一种基于大语言模型的多智能体系统,用于自动化产品概念评估,通过虚拟专家团队进行技术可行性和市场可行性的结构化审议,其评估结果与行业资深专家保持一致。
📘 Detailed Summary
Motivation: 传统专家主导的产品概念评估方法存在主观偏见、时间和成本高昂等局限性,本研究旨在开发一种自动化评估方法以支持企业的战略资源分配和项目决策。
Method: 本研究构建了一个包含八个虚拟专家的多智能体系统,涵盖研发、营销等专业领域,采用检索增强生成和实时搜索工具收集客观证据,并基于技术可行性和市场可行性两个维度进行结构化审议,同时利用专业产品评审数据对智能体进行微调以提升判断准确性。
Result: 通过专业显示监视器概念的案例研究,系统评估排名与行业资深专家的评估结果保持一致,验证了所提出的多智能体评估方法的可用性和有效性。
Conclusion: 该研究证实了基于多智能体的自动化评估方法能够有效支持产品开发决策,为传统专家评估提供了客观、高效的替代方案,展示了LLM多智能体系统在复杂决策支持中的实际应用潜力。
📄 Abstract
Product concept evaluation is a critical stage that determines strategic resource allocation and project success in enterprises. However, traditional expert-led approaches face limitations such as subjective bias and high time and cost requirements. To support this process, this study proposes an automated approach utilizing a large language model (LLM)-based multi-agent system (MAS). Through a systematic analysis of previous research on product development and team collaboration, this study established two primary evaluation dimensions, namely technical feasibility and market feasibility. The proposed system consists of a team of eight virtual agents representing specialized domains such as R&D and marketing. These agents use retrieval-augmented generation (RAG) and real-time search tools to gather objective evidence and validate concepts through structured deliberations based on the established criteria. The agents were further fine-tuned using professional product review data to enhance their judgment accuracy. A case study involving professional display monitor concepts demonstrated that the system's evaluation rankings were consistent with those of senior industry experts. These results confirm the usability of the proposed multi-agent-based evaluation approach for supporting product development decisions.
[126] The EpisTwin: A Knowledge Graph-Grounded Neuro-Symbolic Architecture for Personal AI
Giovanni Servedio, Potito Aghilar, Alessio Mattiace, Gianni Carmosino, Francesco Musicco, Gabriele Conte, Vito Walter Anelli, Tommaso Di Noia, Francesco Maria Donini
🧩 TL;DR
本文提出EpisTwin,一种神经符号框架,通过构建可验证的个人知识图谱来增强个人人工智能的推理能力,解决用户数据碎片化和传统检索增强生成在语义拓扑与时间依赖性建模方面的局限性。
📘 Detailed Summary
Motivation: 当前个人人工智能面临用户数据分散在孤立数据孤岛中的问题,而传统的检索增强生成方法依赖于非结构化向量相似性,无法捕捉潜在语义拓扑结构和时间依赖性,这些对于整体意义构建至关重要。
Method: EpisTwin采用神经符号框架,利用多模态语言模型将异构跨应用数据提升为语义三元组,构建用户中心的个人知识图谱;在推理阶段通过智能协调器结合图检索增强生成与在线深度视觉细化,动态地将符号实体重新锚定到原始视觉上下文中。
Result: 研究引入了PersonalQA-71-100合成基准来模拟真实用户数字足迹并评估EpisTwin性能,该框架在一系列最先进的评判模型上展现出稳健的结果,证明了其有效性。
Conclusion: EpisTwin为可信赖的个人人工智能提供了有前景的研究方向,通过将生成式推理建立在可验证的个人知识图谱基础上,解决了数据碎片化和语义理解不足的问题,推动了神经符号方法在个人AI领域的应用。
📄 Abstract
Personal Artificial Intelligence is currently hindered by the fragmentation of user data across isolated silos. While Retrieval-Augmented Generation offers a partial remedy, its reliance on unstructured vector similarity fails to capture the latent semantic topology and temporal dependencies essential for holistic sensemaking. We introduce EpisTwin, a neuro-symbolic framework that grounds generative reasoning in a verifiable, user-centric Personal Knowledge Graph. EpisTwin leverages Multimodal Language Models to lift heterogeneous, cross-application data into semantic triples. At inference, EpisTwin enables complex reasoning over the personal semantic graph via an agentic coordinator that combines Graph Retrieval-Augmented Generation with Online Deep Visual Refinement, dynamically re-grounding symbolic entities in their raw visual context. We also introduce PersonalQA-71-100, a synthetic benchmark designed to simulate a realistic user's digital footprint and evaluate EpisTwin performance. Our framework demonstrates robust results across a suite of state-of-the-art judge models, offering a promising direction for trustworthy Personal AI.
[127] SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement
Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary
🧩 TL;DR
本文提出了SAHOO框架,用于在递归自我改进过程中监控和控制对齐漂移,通过多信号检测器、约束保持检查和回归风险量化三大保障机制,在代码生成、数学推理和真实性任务中实现了显著的质量提升,同时保持了对齐约束。
📘 Detailed Summary
Motivation: 随着递归自我改进从理论走向实践,现代系统能够批判、修订和评估自身输出,但迭代式自我修改存在微妙的对齐漂移风险,需要建立实用的监控和控制框架来确保改进过程中的对齐保持。
Method: SAHOO框架包含三大保障机制:目标漂移指数作为结合语义、词汇、结构和分布度量的学习型多信号检测器;约束保持检查用于强制执行安全关键不变性,如语法正确性和非幻觉;回归风险量化用于标记可能撤销先前增益的改进周期。
Result: 在189个代码生成、数学推理和真实性任务中,SAHOO实现了显著的质量提升,代码任务改进18.3%,推理任务改进16.8%,同时在两个领域保持了约束保持,在真实性任务中保持了较低的违规率,阈值在18个任务的验证集上校准。
Conclusion: 该研究绘制了能力-对齐前沿,展示了早期改进周期的高效性以及后期对齐成本的上升,揭示了特定领域的紧张关系如流畅性与事实性,使递归自我改进过程中的对齐保持变得可测量、可部署且可系统验证。
📄 Abstract
Recursive self-improvement is moving from theory to practice: modern systems can critique, revise, and evaluate their own outputs, yet iterative self-modification risks subtle alignment drift. We introduce SAHOO, a practical framework to monitor and control drift through three safeguards: (i) the Goal Drift Index (GDI), a learned multi-signal detector combining semantic, lexical, structural, and distributional measures; (ii) constraint preservation checks that enforce safety-critical invariants such as syntactic correctness and non-hallucination; and (iii) regression-risk quantification to flag improvement cycles that undo prior gains. Across 189 tasks in code generation, mathematical reasoning, and truthfulness, SAHOO produces substantial quality gains, including 18.3 percent improvement in code tasks and 16.8 percent in reasoning, while preserving constraints in two domains and maintaining low violations in truthfulness. Thresholds are calibrated on a small validation set of 18 tasks across three cycles. We further map the capability-alignment frontier, showing efficient early improvement cycles but rising alignment costs later and exposing domain-specific tensions such as fluency versus factuality. SAHOO therefore makes alignment preservation during recursive self-improvement measurable, deployable, and systematically validated at scale.