Table of Contents
cs.CV [Back]
[1] RS-Net: Context-Aware Relation Scoring for Dynamic Scene Graph Generation
Hae-Won Jo, Yeong-Jun Cho
🧩 TL;DR
本文提出关系评分网络(RS-Net),一个用于动态场景图生成的模块化框架,通过评估对象对的空间和时间上下文重要性来增强关系预测。该框架可无缝集成到现有DSGG模型中,在Action Genome数据集上显著提升了召回率和精确率。
📘 Detailed Summary
Motivation: 现有动态场景图生成方法仅在有标注的对象对上训练,缺乏对非相关对象对的指导,导致在推理阶段难以识别有意义的关系。这种训练方式使得模型在关系预测时存在明显局限性。
Method: RS-Net包含空间上下文编码器和时间编码器,空间编码器使用可学习的上下文令牌捕获空间交互,时间编码器聚合视频级信息。该方法通过统一的三元组评分机制将关系评分整合到关系预测中。
Result: 在Action Genome数据集上的实验表明,RS-Net在不同基线上持续提升了召回率和精确率,特别是在平均召回率上取得显著增益,有效应对了关系的长尾分布问题。尽管参数数量增加,但保持了竞争力的效率。
Conclusion: RS-Net证明了通过评估对象对上下文重要性可以有效改善动态场景图生成性能,其模块化设计使其能够无缝集成到现有方法中,为解决关系预测中的长尾分布问题提供了有效途径。
📄 Abstract
Dynamic Scene Graph Generation (DSGG) models how object relations evolve over time in videos. However, existing methods are trained only on annotated object pairs and lack guidance for non-related pairs, making it difficult to identify meaningful relations during inference. In this paper, we propose Relation Scoring Network (RS-Net), a modular framework that scores the contextual importance of object pairs using both spatial interactions and long-range temporal context. RS-Net consists of a spatial context encoder with learnable context tokens and a temporal encoder that aggregates video-level information. The resulting relation scores are integrated into a unified triplet scoring mechanism to enhance relation prediction. RS-Net can be easily integrated into existing DSGG models without architectural changes. Experiments on the Action Genome dataset show that RS-Net consistently improves both Recall and Precision across diverse baselines, with notable gains in mean Recall, highlighting its ability to address the long-tailed distribution of relations. Despite the increased number of parameters, RS-Net maintains competitive efficiency, achieving superior performance over state-of-the-art methods.
[2] Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding
Joseph Fioresi, Ishan Rajendrakumar Dave, Mubarak Shah
🧩 TL;DR
本文提出了一种在潜在空间中运行的视觉隐私保护方法,通过轻量级匿名化适配器模块(AAM)从视频特征中移除隐私信息,同时保持通用任务效用。该方法在多个下游任务中实现了35%的隐私泄露减少,同时维持接近基线的性能表现。
📘 Detailed Summary
Motivation: 当前视频基础模型提取的时空特征在共享或存储时会无意中泄露敏感个人信息,如肤色、性别或服装等。现有的隐私保护方法主要关注输入像素级的匿名化,需要重新训练整个实用视频模型,导致任务特定的匿名化,不适用于现代视频基础模型。
Method: 提出了轻量级匿名化适配器模块(AAM),可在冻结的视频编码器上即插即用地移除隐私信息。框架采用三个新设计的训练目标:剪辑级自监督隐私目标以减少静态剪辑间的互信息,协同训练目标以保留已见任务的效用,以及潜在一致性损失以在未见任务上实现泛化。
Result: 广泛评估显示隐私泄露显著减少35%,同时在多个下游任务中保持接近基线的性能表现,包括动作识别(Kinetics400、UCF101、HMDB51)、时序动作检测(THUMOS14)和异常检测(UCF-Crime)。该方法还能有效缓解动作识别模型中的性别偏见。
Conclusion: 该研究展示了在潜在空间中进行隐私保护的可行性,为视频基础模型提供了高效且通用的隐私保护解决方案。提出的方法不仅减少了隐私泄露,还促进了更公平的视频理解,为未来视频隐私保护研究提供了新的方向。
📄 Abstract
We introduce a novel formulation of visual privacy preservation for video foundation models that operates entirely in the latent space. While spatio-temporal features learned by foundation models have deepened general understanding of video content, sharing or storing these extracted visual features for downstream tasks inadvertently reveals sensitive personal information like skin color, gender, or clothing. Current privacy preservation methods focus on input-pixel-level anonymization, which requires retraining the entire utility video model and results in task-specific anonymization, making them unsuitable for recent video foundational models. To address these challenges, we introduce a lightweight Anonymizing Adapter Module (AAM) that removes private information from video features while retaining general task utility. AAM can be applied in a plug-and-play fashion to frozen video encoders, minimizing the computational burden of finetuning and re-extracting features. Our framework employs three newly designed training objectives: (1) a clip-level self-supervised privacy objective to reduce mutual information between static clips, (2) a co-training objective to retain utility across seen tasks, and (3) a latent consistency loss for generalization on unseen tasks. Our extensive evaluations demonstrate a significant 35% reduction in privacy leakage while maintaining near-baseline utility performance across various downstream tasks: Action Recognition (Kinetics400, UCF101, HMDB51), Temporal Action Detection (THUMOS14), and Anomaly Detection (UCF-Crime). We also provide an analysis on anonymization for sensitive temporal attribute recognition. Additionally, we propose new protocols for assessing gender bias in action recognition models, showing that our method effectively mitigates such biases and promotes more equitable video understanding.
[3] SIFT-Graph: Benchmarking Multimodal Defense Against Image Adversarial Attacks With Robust Feature Graph
Jingjie He, Weijie Liang, Zihan Shan, Matthew Caesar
🧩 TL;DR
本文提出SIFT-Graph多模态防御框架,通过整合尺度不变特征变换关键点和图注意力网络,构建对对抗攻击具有鲁棒性的结构感知视觉模型。该方法有效提升了传统视觉模型在梯度白盒攻击下的鲁棒性,同时仅带来轻微干净准确率下降。
📘 Detailed Summary
Motivation: 对抗攻击暴露了现代深度视觉模型对密集像素级表示的依赖脆弱性,这些表示对微小扰动高度敏感。传统防御策略通常在这种脆弱的像素域内操作,缺乏整合固有鲁棒视觉特征的机制,因此需要开发能够利用结构信息的防御框架来增强模型鲁棒性。
Method: 提出SIFT-Graph多模态防御框架,整合尺度不变特征变换关键点与图注意力网络,提取对尺度和旋转不变的局部结构特征。这些鲁棒特征嵌入与传统视觉模型融合,形成统一的结构感知防御模型,支持Vision Transformer和卷积神经网络等多种架构。
Result: 初步实验结果表明,该方法能有效提升视觉模型对基于梯度的白盒对抗攻击的鲁棒性,同时仅导致干净准确率的小幅下降,在保持模型原有性能的基础上显著增强了防御能力。
Conclusion: 通过整合手工特征和深度学习特征的多模态方法,为视觉模型的对抗鲁棒性提供了新的解决方案。结构感知特征融合策略展示了在保持模型性能的同时增强安全性的可行性,为未来鲁棒视觉系统设计提供了重要参考方向。
📄 Abstract
Adversarial attacks expose a fundamental vulnerability in modern deep vision models by exploiting their dependence on dense, pixel-level representations that are highly sensitive to imperceptible perturbations. Traditional defense strategies typically operate within this fragile pixel domain, lacking mechanisms to incorporate inherently robust visual features. In this work, we introduce SIFT-Graph, a multimodal defense framework that enhances the robustness of traditional vision models by aggregating structurally meaningful features extracted from raw images using both handcrafted and learned modalities. Specifically, we integrate Scale-Invariant Feature Transform keypoints with a Graph Attention Network to capture scale and rotation invariant local structures that are resilient to perturbations. These robust feature embeddings are then fused with traditional vision model, such as Vision Transformer and Convolutional Neural Network, to form a unified, structure-aware and perturbation defensive model. Preliminary results demonstrate that our method effectively improves the visual model robustness against gradient-based white box adversarial attacks, while incurring only a marginal drop in clean accuracy.
[4] Improve Contrastive Clustering Performance by Multiple Fusing-Augmenting ViT Blocks
Cheng Wang, Shuisheng Zhou, Fengjiao Peng, Jin Sheng, Feng Ye, Yinli Dong
🧩 TL;DR
本文提出了一种基于Vision Transformer的多重融合增强块(MFAVBs),通过显式融合正样本对的特征来改进对比学习网络在图像聚类中的性能,在七个公开数据集上实现了最先进的聚类效果。
📘 Detailed Summary
Motivation: 现有对比学习网络通过参数共享或动量更新实现编码器间的隐式交互,未能充分利用正样本对的互补性和相似性来提取聚类特征,限制了聚类性能的进一步提升。
Method: 设计基于Vision Transformer的多重融合增强块(MFAVBs),将两个共享权重的ViT输出的特征融合后输入更大的ViT,然后将学习到的特征分割为新的增强正样本对传递给后续FAVBs,实现多次融合和增强,最后将特征投影到实例级和聚类级空间计算交叉熵损失。
Result: 在七个公开数据集上的实验表明,将MFAVBs作为对比聚类骨干网络,在聚类性能方面超越了当前最先进的技术,验证了所提方法的有效性。
Conclusion: 通过显式融合正样本对特征并利用CLIP预训练模型的特征提取能力,MFAVBs能够更有效地学习区分性聚类特征,为对比学习在图像聚类任务中的应用提供了新的技术路径。
📄 Abstract
In the field of image clustering, the widely used contrastive learning networks improve clustering performance by maximizing the similarity between positive pairs and the dissimilarity of negative pairs of the inputs. Extant contrastive learning networks, whose two encoders often implicitly interact with each other by parameter sharing or momentum updating, may not fully exploit the complementarity and similarity of the positive pairs to extract clustering features from input data. To explicitly fuse the learned features of positive pairs, we design a novel multiple fusing-augmenting ViT blocks (MFAVBs) based on the excellent feature learning ability of Vision Transformers (ViT). Firstly, two preprocessed augmentions as positive pairs are separately fed into two shared-weight ViTs, then their output features are fused to input into a larger ViT. Secondly, the learned features are split into a pair of new augmented positive samples and passed to the next FAVBs, enabling multiple fusion and augmention through MFAVBs operations. Finally, the learned features are projected into both instance-level and clustering-level spaces to calculate the cross-entropy loss, followed by parameter updates by backpropagation to finalize the training process. To further enhance ability of the model to distinguish between similar images, our input data for the network we propose is preprocessed augmentions with features extracted from the CLIP pretrained model. Our experiments on seven public datasets demonstrate that MFAVBs serving as the backbone for contrastive clustering outperforms the state-of-the-art techniques in terms of clustering performance.
[5] Asymmetric Cross-Modal Knowledge Distillation: Bridging Modalities with Weak Semantic Consistency
Riling Wei, Kelu Yao, Chuanguang Yang, Jin Wang, Zhuoyan Gao, Chao Li
🧩 TL;DR
本文提出了一种非对称跨模态知识蒸馏框架SemBridge,通过学生友好匹配模块和语义感知知识对齐模块,解决了弱语义一致性下跨模态知识传输的挑战,在遥感场景分类任务上取得了最先进的性能。
📘 Detailed Summary
Motivation: 传统对称跨模态知识蒸馏依赖于强语义关联的配对模态数据,但在实际应用中配对数据稀缺,因此需要研究弱语义一致性下的非对称跨模态知识蒸馏,以连接语义重叠有限的模态。
Method: 提出了SemBridge框架,包含学生友好匹配模块和语义感知知识对齐模块。前者利用自监督学习获取语义知识并为每个学生样本动态选择相关教师样本提供个性化指导,后者通过拉格朗日优化寻找最优传输路径。
Result: 在从多光谱和不对称RGB图像构建的遥感场景分类基准数据集上,与7种现有方法和6种不同模型架构相比,该框架在各种数据集上均实现了最先进的性能。
Conclusion: 该研究证明了在弱语义一致性条件下实现有效跨模态知识蒸馏的可行性,为实际应用中模态配对受限的场景提供了解决方案,并基于最优传输理论验证了知识传输成本增加的挑战。
📄 Abstract
Cross-modal Knowledge Distillation has demonstrated promising performance on paired modalities with strong semantic connections, referred to as Symmetric Cross-modal Knowledge Distillation (SCKD). However, implementing SCKD becomes exceedingly constrained in real-world scenarios due to the limited availability of paired modalities. To this end, we investigate a general and effective knowledge learning concept under weak semantic consistency, dubbed Asymmetric Cross-modal Knowledge Distillation (ACKD), aiming to bridge modalities with limited semantic overlap. Nevertheless, the shift from strong to weak semantic consistency improves flexibility but exacerbates challenges in knowledge transmission costs, which we rigorously verified based on optimal transport theory. To mitigate the issue, we further propose a framework, namely SemBridge, integrating a Student-Friendly Matching module and a Semantic-aware Knowledge Alignment module. The former leverages self-supervised learning to acquire semantic-based knowledge and provide personalized instruction for each student sample by dynamically selecting the relevant teacher samples. The latter seeks the optimal transport path by employing Lagrangian optimization. To facilitate the research, we curate a benchmark dataset derived from two modalities, namely Multi-Spectral (MS) and asymmetric RGB images, tailored for remote sensing scene classification. Comprehensive experiments exhibit that our framework achieves state-of-the-art performance compared with 7 existing approaches on 6 different model architectures across various datasets.
[6] LLM-Guided Probabilistic Fusion for Label-Efficient Document Layout Analysis
Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
🧩 TL;DR
本文提出了一种融合视觉检测与LLM结构先验的半监督文档布局理解框架,通过概率加权方法将OCR-LLM推断的层次区域与教师检测器输出相结合,在轻量级和预训练架构上均实现性能提升。
📘 Detailed Summary
Motivation: 文档布局理解任务尽管半监督学习有所进展,但仍面临数据密集性问题,现有方法未能充分利用文本预训练大语言模型提供的结构化先验知识来增强检测性能。
Method: 提出基于概率加权的融合框架,通过OCR-LLM流水线推断未标注文档的层次区域结构,采用逆方差融合将LLM结构先验与教师检测器输出结合生成精炼伪标签,并引入实例自适应门控机制优化权重分配。
Result: 在PubLayNet数据集上,轻量级SwiftFormer骨干网络仅使用5%标签达到88.2±0.3 AP,文档预训练LayoutLMv3结合本框架达到89.7±0.4 AP,超越标准半监督学习并匹配需要亿级多模态预训练的UDOP性能;LLM提供针对性语义消歧在18.7%案例中带来3.8 AP增益。
Conclusion: LLM结构先验与视觉检测方法具有互补性,可同时提升轻量级和预训练架构性能;实例自适应门控机制优于固定权重;开源LLM支持隐私保护部署且性能损失极小;系统成本可控,为文档布局理解提供了高效半监督解决方案。
📄 Abstract
Document layout understanding remains data-intensive despite advances in semi-supervised learning. We present a framework that enhances semi-supervised detection by fusing visual predictions with structural priors from text-pretrained LLMs via principled probabilistic weighting. Given unlabeled documents, an OCR-LLM pipeline infers hierarchical regions which are combined with teacher detector outputs through inverse-variance fusion to generate refined pseudo-labels.Our method demonstrates consistent gains across model scales. With a lightweight SwiftFormer backbone (26M params), we achieve 88.2$\pm$0.3 AP using only 5\% labels on PubLayNet. When applied to document-pretrained LayoutLMv3 (133M params), our fusion framework reaches 89.7$\pm$0.4 AP, surpassing both LayoutLMv3 with standard semi-supervised learning (89.1$\pm$0.4 AP, p=0.02) and matching UDOP~\cite{udop} (89.8 AP) which requires 100M+ pages of multimodal pretraining. This demonstrates that LLM structural priors are complementary to both lightweight and pretrained architectures. Key findings include: (1) learned instance-adaptive gating improves over fixed weights by +0.9 AP with data-dependent PAC bounds correctly predicting convergence; (2) open-source LLMs enable privacy-preserving deployment with minimal loss (Llama-3-70B: 87.1 AP lightweight, 89.4 AP with LayoutLMv3); (3) LLMs provide targeted semantic disambiguation (18.7\% of cases, +3.8 AP gain) beyond simple text heuristics.Total system cost includes \$12 for GPT-4o-mini API or 17 GPU-hours for local Llama-3-70B per 50K pages, amortized across training runs.
[7] Negative Entity Suppression for Zero-Shot Captioning with Synthetic Images
Zimao Lu, Hui Xu, Bing Liu, Ke Wang
🧩 TL;DR
本文提出负实体抑制(NES)方法来解决零样本图像描述中的幻觉问题,通过合成图像确保检索一致性、过滤负实体以及注意力级抑制来减少跨域描述中的错误内容生成,在多个基准测试中实现了最先进的性能。
📘 Detailed Summary
Motivation: 当前基于纯文本训练的零样本图像描述方法虽然能避免图像-文本配对数据的收集成本,但在跨域泛化方面表现不佳,容易在遇到新视觉环境时产生幻觉内容。检索式方法试图通过外部知识缓解这一问题,但当检索到的描述包含与输入无关的实体时反而会加剧幻觉问题。
Method: 提出的负实体抑制(NES)方法包含三个集成阶段:首先使用合成图像确保训练和推理过程中的图像到文本检索一致性;其次从检索内容中过滤负实体以提高准确性;最后利用识别出的负实体进行注意力级抑制,进一步减少易产生幻觉的特征影响。
Result: 在多个基准测试上的评估表明,NES在保持竞争性域内性能的同时,显著改善了跨域迁移能力并降低了幻觉率,在零样本图像描述任务中实现了新的最先进结果。
Conclusion: 该研究证明了通过系统性抑制负实体可以有效解决零样本图像描述中的幻觉问题,为数据稀缺场景下的视觉语言任务提供了新的解决方案,同时展示了合成数据与检索机制结合在跨域泛化中的潜力。
📄 Abstract
Text-only training provides an attractive approach to address data scarcity challenges in zero-shot image captioning (ZIC), avoiding the expense of collecting paired image-text annotations. However, although these approaches perform well within training domains, they suffer from poor cross-domain generalization, often producing hallucinated content when encountering novel visual environments. Retrieval-based methods attempt to mitigate this limitation by leveraging external knowledge, but they can paradoxically exacerbate hallucination when retrieved captions contain entities irrelevant to the inputs. We introduce the concept of negative entities--objects that appear in generated caption but are absent from the input--and propose Negative Entity Suppression (NES) to tackle this challenge. NES seamlessly integrates three stages: (1) it employs synthetic images to ensure consistent image-to-text retrieval across both training and inference; (2) it filters negative entities from retrieved content to enhance accuracy; and (3) it applies attention-level suppression using identified negative entities to further minimize the impact of hallucination-prone features. Evaluation across multiple benchmarks demonstrates that NES maintains competitive in-domain performance while improving cross-domain transfer and reducing hallucination rates, achieving new state-of-the-art results in ZIC. Our code is available at https://github.com/nidongpinyinme/NESCap.
[8] SPEED-Q: Staged Processing with Enhanced Distillation towards Efficient Low-bit On-device VLM Quantization
Tianyu Guo, Shanwei Zhao, Shiai Zhu, Chenguang Ma
🧩 TL;DR
SPEED-Q提出了一种新颖的分阶段处理与增强蒸馏框架,专门用于视觉语言模型的低比特权重量化,解决了多模态组件量化敏感度差异和训练不稳定问题,在边缘设备上实现了准确、稳定且数据高效的VLM部署。
📘 Detailed Summary
Motivation: 现有研究很少探索视觉语言模型的激进量化,特别是对于更适合资源受限边缘设备的1B至2B参数模型,且面临视觉与语言组件量化敏感度差异显著以及低比特量化导致的训练不稳定等关键障碍。
Method: SPEED-Q引入了分阶段敏感度自适应机制来协调不同模态的性能,并提出蒸馏增强量化策略以稳定训练过程并减少数据依赖,从而系统性地解决了多模态量化敏感度差异和训练稳定性问题。
Result: 在多个基准测试上的广泛实验表明,SPEED-Q在2比特设置下比现有量化方法准确率提升高达6倍,在2比特和4比特设置下均持续优于先前的设备端VLM方法。
Conclusion: 该研究首次为完整的小规模十亿参数级VLM低比特量化提供了专门框架,展示了在边缘设备上部署复杂VLM的可行性,为资源受限环境下的多模态AI应用开辟了新途径。
📄 Abstract
Deploying Vision-Language Models (VLMs) on edge devices (e.g., smartphones and robots) is crucial for enabling low-latency and privacy-preserving intelligent applications. Given the resource constraints of these devices, quantization offers a promising solution by improving memory efficiency and reducing bandwidth requirements, thereby facilitating the deployment of VLMs. However, existing research has rarely explored aggressive quantization on VLMs, particularly for the models ranging from 1B to 2B parameters, which are more suitable for resource-constrained edge devices. In this paper, we propose SPEED-Q, a novel Staged Processing with Enhanced Distillation framework for VLM low-bit weight-only quantization that systematically addresses the following two critical obstacles: (1) significant discrepancies in quantization sensitivity between vision (ViT) and language (LLM) components in VLMs; (2) training instability arising from the reduced numerical precision inherent in low-bit quantization. In SPEED-Q, a staged sensitivity adaptive mechanism is introduced to effectively harmonize performance across different modalities. We further propose a distillation-enhanced quantization strategy to stabilize the training process and reduce data dependence. Together, SPEED-Q enables accurate, stable, and data-efficient quantization of complex VLMs. SPEED-Q is the first framework tailored for quantizing entire small-scale billion-parameter VLMs to low bits. Extensive experiments across multiple benchmarks demonstrate that SPEED-Q achieves up to 6x higher accuracy than existing quantization methods under 2-bit settings and consistently outperforms prior on-device VLMs under both 2-bit and 4-bit settings. Our code and models are available at https://github.com/antgroup/SPEED-Q.
[9] From Structure to Detail: Hierarchical Distillation for Efficient Diffusion Model
Hanbo Cheng, Peng Wang, Kaixiang Lei, Qi Li, Zhen Zou, Pengfei Hu, Jun Du
🧩 TL;DR
本文提出了分层蒸馏(HD)框架,将轨迹蒸馏和分布蒸馏从独立范式转变为协同组件,通过轨迹蒸馏建立结构草图,再通过分布蒸馏进行细化,实现了最先进的单步扩散模型性能。
📘 Detailed Summary
Motivation: 扩散模型的推理延迟是实时应用的关键障碍,现有轨迹蒸馏和分布蒸馏方法存在根本性权衡:轨迹蒸馏保留全局结构但牺牲高频细节,分布蒸馏可实现更高保真度但常出现模式崩溃和训练不稳定问题。
Method: 提出了分层蒸馏框架,将轨迹蒸馏作为结构草图生成器而非最终生成器,为后续分布蒸馏阶段提供近最优初始化;引入自适应加权判别器(AWD),通过动态分配token权重专注于局部缺陷,实现高效细节细化。
Result: 在ImageNet 256×256上,单步模型达到FID 2.26,媲美其250步教师模型;在高分辨率文本到图像MJHQ基准测试中取得有前景的结果,证明了方法的泛化能力。
Conclusion: 该方法为高保真单步扩散模型建立了稳健的新范式,通过协同利用两种蒸馏方法的优势,克服了各自局限性,实现了性能与效率的最佳平衡。
📄 Abstract
The inference latency of diffusion models remains a critical barrier to their real-time application. While trajectory-based and distribution-based step distillation methods offer solutions, they present a fundamental trade-off. Trajectory-based methods preserve global structure but act as a "lossy compressor", sacrificing high-frequency details. Conversely, distribution-based methods can achieve higher fidelity but often suffer from mode collapse and unstable training. This paper recasts them from independent paradigms into synergistic components within our novel Hierarchical Distillation (HD) framework. We leverage trajectory distillation not as a final generator, but to establish a structural ``sketch", providing a near-optimal initialization for the subsequent distribution-based refinement stage. This strategy yields an ideal initial distribution that enhances the ceiling of overall performance. To further improve quality, we introduce and refine the adversarial training process. We find standard discriminator structures are ineffective at refining an already high-quality generator. To overcome this, we introduce the Adaptive Weighted Discriminator (AWD), tailored for the HD pipeline. By dynamically allocating token weights, AWD focuses on local imperfections, enabling efficient detail refinement. Our approach demonstrates state-of-the-art performance across diverse tasks. On ImageNet $256\times256$, our single-step model achieves an FID of 2.26, rivaling its 250-step teacher. It also achieves promising results on the high-resolution text-to-image MJHQ benchmark, proving its generalizability. Our method establishes a robust new paradigm for high-fidelity, single-step diffusion models.
[10] T-Rex-Omni: Integrating Negative Visual Prompt in Generic Object Detection
Jiazhou Zhou, Qing Jiang, Kanghao Chen, Lutao Jiang, Yuanhuiyi Lyu, Ying-Cong Chen, Lei Zhang
🧩 TL;DR
本文提出T-Rex-Omni框架,通过引入负视觉提示来抑制硬负样本干扰,显著提升开放集目标检测性能。该方法在零样本检测中表现出色,特别是在长尾场景下取得了突破性进展。
📘 Detailed Summary
Motivation: 当前开放集目标检测器仅依赖基于文本描述或视觉示例的正向提示,这种仅正向的范式在面对视觉相似但语义不同的干扰物时存在持续脆弱性。研究旨在解决这一局限性,通过整合负视觉提示来否定硬负干扰物。
Method: 提出统一视觉提示编码器联合处理正负视觉提示,设计无需训练的负向否定计算模块动态抑制负响应,并通过负向否定铰链损失在微调阶段强制正负嵌入之间的判别性边界。支持正向和联合正负推理两种灵活部署模式。
Result: 大量实验显示在零样本检测中表现卓越,显著缩小了视觉提示与文本提示方法之间的性能差距,在长尾场景表现尤为突出(LVIS-minival上达到51.2 AP_r)。
Conclusion: 本研究确立了负提示作为推进开放集视觉识别系统的关键新维度,为处理视觉相似干扰物提供了有效解决方案,并为开放集视觉识别开辟了新的研究方向。
📄 Abstract
Object detection methods have evolved from closed-set to open-set paradigms over the years. Current open-set object detectors, however, remain constrained by their exclusive reliance on positive indicators based on given prompts like text descriptions or visual exemplars. This positive-only paradigm experiences consistent vulnerability to visually similar but semantically different distractors. We propose T-Rex-Omni, a novel framework that addresses this limitation by incorporating negative visual prompts to negate hard negative distractors. Specifically, we first introduce a unified visual prompt encoder that jointly processes positive and negative visual prompts. Next, a training-free Negating Negative Computing (NNC) module is proposed to dynamically suppress negative responses during the probability computing stage. To further boost performance through fine-tuning, our Negating Negative Hinge (NNH) loss enforces discriminative margins between positive and negative embeddings. T-Rex-Omni supports flexible deployment in both positive-only and joint positive-negative inference modes, accommodating either user-specified or automatically generated negative examples. Extensive experiments demonstrate remarkable zero-shot detection performance, significantly narrowing the performance gap between visual-prompted and text-prompted methods while showing particular strength in long-tailed scenarios (51.2 AP_r on LVIS-minival). This work establishes negative prompts as a crucial new dimension for advancing open-set visual recognition systems.
[11] Causally-Grounded Dual-Path Attention Intervention for Object Hallucination Mitigation in LVLMs
Liu Yu, Zhonghao Chen, Ping Kuang, Zhikun Feng, Fan Zhou, Lan Wang, Gillian Dobbie
🧩 TL;DR
本文提出了Owl框架,一种基于因果推理的双模态注意力重加权方法,通过建模幻觉过程的结构因果图,引入VTACR指标量化模态贡献不平衡,并设计细粒度注意力干预机制,显著减少大视觉语言模型中的物体幻觉问题。
📘 Detailed Summary
Motivation: 现有基于语言解码器的缓解方法通常独立调节视觉或文本注意力,忽视了它们作为两个关键因果因素的相互作用,导致物体幻觉问题在大视觉语言模型中持续存在。
Method: 提出Owl框架,通过结构因果图建模幻觉过程,将分解的视觉和文本注意力视为中介变量;引入VTACR指标量化解码过程中的模态贡献不平衡;设计细粒度注意力干预机制,根据VTACR信号动态调整token级和层级的注意力;采用双路径对比解码策略,一条路径强调视觉基础预测,另一条放大幻觉预测。
Result: 在POPE和CHAIR基准测试中,Owl实现了显著的幻觉减少,在保持视觉语言理解能力的同时,在忠实性方面达到了新的最先进水平。
Conclusion: 该研究表明通过因果建模和细粒度注意力干预可以有效缓解物体幻觉问题,VTACR指标为理解幻觉机制提供了新的视角,双路径对比解码策略为未来幻觉检测和缓解方法提供了有前景的方向。
📄 Abstract
Object hallucination remains a critical challenge in Large Vision-Language Models (LVLMs), where models generate content inconsistent with visual inputs. Existing language-decoder based mitigation approaches often regulate visual or textual attention independently, overlooking their interaction as two key causal factors. To address this, we propose Owl (Bi-mOdal attention reWeighting for Layer-wise hallucination mitigation), a causally-grounded framework that models hallucination process via a structural causal graph, treating decomposed visual and textual attentions as mediators. We introduce VTACR (Visual-to-Textual Attention Contribution Ratio), a novel metric that quantifies the modality contribution imbalance during decoding. Our analysis reveals that hallucinations frequently occur in low-VTACR scenarios, where textual priors dominate and visual grounding is weakened. To mitigate this, we design a fine-grained attention intervention mechanism that dynamically adjusts token- and layer-wise attention guided by VTACR signals. Finally, we propose a dual-path contrastive decoding strategy: one path emphasizes visually grounded predictions, while the other amplifies hallucinated ones -- letting visual truth shine and hallucination collapse. Experimental results on the POPE and CHAIR benchmarks show that Owl achieves significant hallucination reduction, setting a new SOTA in faithfulness while preserving vision-language understanding capability. Our code is available at https://github.com/CikZ2023/OWL
[12] VietMEAgent: Culturally-Aware Few-Shot Multimodal Explanation for Vietnamese Visual Question Answering
Hai-Dang Nguyen, Minh-Anh Dang, Minh-Tan Le, Minh-Tuan Le
🧩 TL;DR
本文提出了VietMEAgent,一个针对越南文化理解的多模态可解释框架,通过结合文化对象检测和结构化程序生成,为越南文化特定的视觉问答提供透明解释。该框架集成了文化知识库和双模态解释模块,支持文化教育和保护。
📘 Detailed Summary
Motivation: 当前视觉问答系统在处理文化特定内容时存在局限,主要因为训练语料中文化知识代表性不足且推理过程对终端用户不可解释。本文旨在解决越南文化理解中的这一缺口,强调可解释性和文化敏感性。
Method: 该方法整合了文化对象检测主干网络与结构化程序生成层,构建了一个答案预测与解释紧密耦合的流水线。使用精心策划的越南文化实体知识库作为显式背景信息源,并通过双模态解释模块结合基于注意力的视觉证据和结构化可读文本推理。
Result: 研究构建了越南文化VQA数据集并验证了基于编程方法在文化AI中的实用性。系统能够提供透明解释,揭示计算推理过程和文化背景,在教育和文化保护应用中表现出良好效果。
Conclusion: 该研究展示了编程化方法在文化AI中的有效性,为文化敏感的人工智能系统提供了可解释性框架。系统不仅提升了文化内容理解能力,还通过透明解释机制支持文化教育和保护工作,为多模态文化理解系统的发展提供了新方向。
📄 Abstract
Contemporary Visual Question Answering (VQA) systems remain constrained when confronted with culturally specific content, largely because cultural knowledge is under-represented in training corpora and the reasoning process is not rendered interpretable to end users. This paper introduces VietMEAgent, a multimodal explainable framework engineered for Vietnamese cultural understanding. The method integrates a cultural object detection backbone with a structured program generation layer, yielding a pipeline in which answer prediction and explanation are tightly coupled. A curated knowledge base of Vietnamese cultural entities serves as an explicit source of background information, while a dual-modality explanation module combines attention-based visual evidence with structured, human-readable textual rationales. We further construct a Vietnamese Cultural VQA dataset sourced from public repositories and use it to demonstrate the practicality of programming-based methodologies for cultural AI. The resulting system provides transparent explanations that disclose both the computational rationale and the underlying cultural context, supporting education and cultural preservation with an emphasis on interpretability and cultural sensitivity.
[13] Taming Object Hallucinations with Verified Atomic Confidence Estimation
Jiarui Liu, Weihao Xuan, Zhijing Jin, Mona Diab
🧩 TL;DR
本文提出了TACO框架,通过自验证和置信度校准来缓解多模态大语言模型中的幻觉问题,无需依赖外部视觉专家,在多个基准测试中显著提升了模型的忠实性。
📘 Detailed Summary
Motivation: 多模态大语言模型经常遭受幻觉问题的困扰,特别是在对象存在性、属性和关系方面的错误,这些问题严重削弱了模型的可靠性,因此需要开发有效的缓解方法。
Method: TACO框架将模型响应分解为原子查询,通过重述来降低对措辞的敏感性,使用自一致性或自置信度聚合来估计置信度,最后利用语言模型对答案进行精炼,整个过程不依赖外部视觉专家。
Result: 在五个基准测试上的实验表明,TACO在LLaVA-1.5-7B和CogVLM2两个模型上均优于直接提示和视觉对比解码方法,有效减少了系统性偏差并改善了置信度校准效果。
Conclusion: TACO框架证明了通过自验证和置信度校准可以有效缓解多模态大语言模型的幻觉问题,为提升模型忠实性提供了一种简单而有效的解决方案,具有重要的实际应用价值。
📄 Abstract
Multimodal Large Language Models (MLLMs) often suffer from hallucinations, particularly errors in object existence, attributes, or relations, which undermine their reliability. We introduce TACO (Verified Atomic Confidence Estimation), a simple framework that mitigates hallucinations through self-verification and confidence calibration without relying on external vision experts. TACO decomposes responses into atomic queries, paraphrases them to reduce sensitivity to wording, and estimates confidence using self-consistency (black-box) or self-confidence (gray-box) aggregation, before refining answers with a language model. Experiments on five benchmarks (POPE, MME, HallusionBench, AMBER, and MM-Hal Bench) with two MLLMs (\texttt{LLaVA-1.5-7B} and \texttt{CogVLM2}) show that TACO consistently outperforms direct prompting and Visual Contrastive Decoding, reduces systematic biases, and improves confidence calibration, demonstrating its effectiveness in enhancing the faithfulness of MLLMs.
[14] Diversifying Counterattacks: Orthogonal Exploration for Robust CLIP Inference
Chengze Jiang, Minjing Dong, Xinli Shi, Jie Gui
🧩 TL;DR
本文提出方向正交对抗防御(DOC)方法,通过引入正交梯度方向和动量更新来增强对抗样本的多样性和覆盖范围,显著提升了视觉语言预训练模型在测试时防御中的对抗鲁棒性。
📘 Detailed Summary
Motivation: 现有测试时对抗防御方法TTC仅基于对抗输入的梯度生成对抗防御,由于对抗攻击与对抗防御在优化目标上的根本差异,导致搜索空间受限,对抗防御容易过拟合有限的对抗模式,缺乏足够的多样性来完全中和广泛的扰动。
Method: 提出方向正交对抗防御(DOC),通过正交梯度方向和动量更新来增强对抗防御优化,扩展对抗防御空间的探索并增加扰动的多样性;同时基于平均余弦相似度提出方向敏感度评分,通过改进样本区分度和自适应调节对抗防御强度来提升DOC性能。
Result: 在16个数据集上的广泛实验表明,DOC在各种攻击下显著提升了对抗鲁棒性,同时保持了竞争力的干净准确率。
Conclusion: 增强对抗防御的多样性和覆盖范围对于提升测试时防御的对抗鲁棒性至关重要,DOC通过正交梯度探索和动量机制实现了更通用的对抗防御生成,为视觉语言模型的可靠部署提供了有效解决方案。
📄 Abstract
Vision-language pre-training models (VLPs) demonstrate strong multimodal understanding and zero-shot generalization, yet remain vulnerable to adversarial examples, raising concerns about their reliability. Recent work, Test-Time Counterattack (TTC), improves robustness by generating perturbations that maximize the embedding deviation of adversarial inputs using PGD, pushing them away from their adversarial representations. However, due to the fundamental difference in optimization objectives between adversarial attacks and counterattacks, generating counterattacks solely based on gradients with respect to the adversarial input confines the search to a narrow space. As a result, the counterattacks could overfit limited adversarial patterns and lack the diversity to fully neutralize a broad range of perturbations. In this work, we argue that enhancing the diversity and coverage of counterattacks is crucial to improving adversarial robustness in test-time defense. Accordingly, we propose Directional Orthogonal Counterattack (DOC), which augments counterattack optimization by incorporating orthogonal gradient directions and momentum-based updates. This design expands the exploration of the counterattack space and increases the diversity of perturbations, which facilitates the discovery of more generalizable counterattacks and ultimately improves the ability to neutralize adversarial perturbations. Meanwhile, we present a directional sensitivity score based on averaged cosine similarity to boost DOC by improving example discrimination and adaptively modulating the counterattack strength. Extensive experiments on 16 datasets demonstrate that DOC improves adversarial robustness under various attacks while maintaining competitive clean accuracy. Code is available at https://github.com/bookman233/DOC.
[15] DensiCrafter: Physically-Constrained Generation and Fabrication of Self-Supporting Hollow Structures
Shengqi Dang, Fu Chai, Jiaxin Li, Chao Yuan, Wei Ye, Nan Cao
🧩 TL;DR
本文提出了DensiCrafter框架,通过优化密度场生成轻量化且自支撑的3D中空结构,在文本到3D任务中实现了高达43%的材料质量减少,同时保持高几何保真度和结构稳定性。
📘 Detailed Summary
Motivation: 现有3D生成模型通常忽略物理约束和可制造性考虑,无法生成既轻量化又自支撑的3D设计,这限制了实际制造应用。
Method: 从Trellis生成的粗体素网格出发,将其解释为连续密度场进行优化,引入了三个可微分、物理约束且无需模拟的损失函数,包括质量正则化惩罚多余材料,以及受限优化域保持外表面完整性。
Result: 在文本到3D任务中实现了高达43%的材料质量减少,相比现有最优方法提升了结构稳定性并保持高几何保真度,真实3D打印实验验证了中空设计的可靠制造和自支撑能力。
Conclusion: 该工作展示了将物理约束无缝集成到预训练生成模型中的可行性,为可制造3D设计生成开辟了新途径,在保持生成质量的同时显著提升结构效率和实用性。
📄 Abstract
The rise of 3D generative models has enabled automatic 3D geometry and texture synthesis from multimodal inputs (e.g., text or images). However, these methods often ignore physical constraints and manufacturability considerations. In this work, we address the challenge of producing 3D designs that are both lightweight and self-supporting. We present DensiCrafter, a framework for generating lightweight, self-supporting 3D hollow structures by optimizing the density field. Starting from coarse voxel grids produced by Trellis, we interpret these as continuous density fields to optimize and introduce three differentiable, physically constrained, and simulation-free loss terms. Additionally, a mass regularization penalizes unnecessary material, while a restricted optimization domain preserves the outer surface. Our method seamlessly integrates with pretrained Trellis-based models (e.g., Trellis, DSO) without any architectural changes. In extensive evaluations, we achieve up to 43% reduction in material mass on the text-to-3D task. Compared to state-of-the-art baselines, our method could improve the stability and maintain high geometric fidelity. Real-world 3D-printing experiments confirm that our hollow designs can be reliably fabricated and could be self-supporting.
[16] Composition-Incremental Learning for Compositional Generalization
Zhen Li, Yuwei Wu, Chenchen Jing, Che Sun, Chuanhao Li, Yunde Jia
🧩 TL;DR
本文提出了一种用于组合零样本学习的组合增量学习框架,通过视觉合成器和语言基元蒸馏机制来解决现实世界中组合无限且长尾分布的问题,显著提升了模型在连续学习新组合时的组合泛化能力。
📘 Detailed Summary
Motivation: 现实世界中的数据持续涌现,可能的组合几乎是无限的、长尾分布的且不完全可见,现有方法在预收集的训练数据上取得了进展,但缺乏在增量学习环境中逐步提升组合泛化能力的能力。
Method: 提出了一个伪重放框架,利用视觉合成器合成已学习组合的视觉表示,并采用语言基元蒸馏机制在学习过程中保持对齐的基元表示,同时构建了MIT-States-CompIL和C-GQA-CompIL两个基准数据集用于定量评估。
Result: 大量实验证明了所提出框架的有效性,在组合增量学习任务中显著提升了模型的组合泛化能力,特别是在处理长尾分布和无限组合场景时表现出色。
Conclusion: 该研究为组合增量学习提供了有效的解决方案,通过视觉合成和语言蒸馏机制解决了连续学习中的灾难性遗忘问题,为现实世界中无限组合的学习问题指明了新的研究方向。
📄 Abstract
Compositional generalization has achieved substantial progress in computer vision on pre-collected training data. Nonetheless, real-world data continually emerges, with possible compositions being nearly infinite, long-tailed, and not entirely visible. Thus, an ideal model is supposed to gradually improve the capability of compositional generalization in an incremental manner. In this paper, we explore Composition-Incremental Learning for Compositional Generalization (CompIL) in the context of the compositional zero-shot learning (CZSL) task, where models need to continually learn new compositions, intending to improve their compositional generalization capability progressively. To quantitatively evaluate CompIL, we develop a benchmark construction pipeline leveraging existing datasets, yielding MIT-States-CompIL and C-GQA-CompIL. Furthermore, we propose a pseudo-replay framework utilizing a visual synthesizer to synthesize visual representations of learned compositions and a linguistic primitive distillation mechanism to maintain aligned primitive representations across the learning process. Extensive experiments demonstrate the effectiveness of the proposed framework.
[17] Ultra-Light Test-Time Adaptation for Vision--Language Models
Byunghyun Kim
🧩 TL;DR
本文提出UL-TTA,一种超轻量级测试时适应框架,通过仅调整logit级参数(类别原型、先验和温度)来解决视觉语言模型在域偏移下的特征漂移和校准问题,无需更新主干网络参数即可实现最先进的精度-校准权衡。
📘 Detailed Summary
Motivation: 视觉语言模型如CLIP在域偏移下存在特征漂移、类别先验不匹配和严重校准错误的问题,现有测试时适应方法需要主干网络的反向传播、协方差估计或大量内存状态,这在流式和边缘场景中存在显著限制。
Method: UL-TTA采用完全免训练和免反向传播的框架,冻结主干网络仅在线调整logit级参数,包括选择性样本过滤、基于文本和狄利克雷先验的闭式贝叶斯更新原型和先验、解耦的预测与校准温度,以及轻量级防护机制防止长期流中的漂移。
Result: 在跨域和分布外基准测试中,UL-TTA相比零样本CLIP平均提升4.7个百分点的top-1准确率,同时将ECE降低20-30%,延迟开销小于8%,在长达20万样本的流式实验中未出现崩溃现象。
Conclusion: 研究表明logit级贝叶斯适应足以在域偏移下为视觉语言模型实现最先进的精度-校准权衡,无需更新任何主干网络参数,为流式和边缘部署提供了高效解决方案。
📄 Abstract
Vision-Language Models (VLMs) such as CLIP achieve strong zero-shot recognition by comparing image embeddings to text-derived class prototypes. However, under domain shift, they suffer from feature drift, class-prior mismatch, and severe miscalibration. Existing test-time adaptation (TTA) methods often require backpropagation through large backbones, covariance estimation, or heavy memory/state, which is problematic for streaming and edge scenarios. We propose Ultra-Light Test-Time Adaptation (UL-TTA), a fully training-free and backprop-free framework that freezes the backbone and adapts only logit-level parameters: class prototypes, class priors, and temperature. UL-TTA performs an online EM-style procedure with (i) selective sample filtering to use only confident predictions, (ii) closed-form Bayesian updates for prototypes and priors anchored by text and Dirichlet priors, (iii) decoupled temperatures for prediction vs. calibration, and (iv) lightweight guards (norm clipping, prior KL constraints, smoothed temperature) to prevent drift in long streams. Across large-scale cross-domain and OOD benchmarks (PACS, Office-Home, DomainNet, Terra Incognita, ImageNet-R/A/V2/Sketch; ~726K test samples) and strong TTA baselines including Tent, T3A, CoTTA, SAR, Tip-Adapter, and FreeTTA, UL-TTA consistently improves top-1 accuracy (e.g., +4.7 points over zero-shot CLIP on average) while reducing ECE by 20-30%, with less than 8% latency overhead. Long-stream experiments up to 200K samples show no collapse. Our results demonstrate that logit-level Bayesian adaptation is sufficient to obtain state-of-the-art accuracy-calibration trade-offs for VLMs under domain shift, without updating any backbone parameters.
[18] Towards Trustworthy Dermatology MLLMs: A Benchmark and Multimodal Evaluator for Diagnostic Narratives
Yuhao Shen, Jiahe Qian, Shuping Zhang, Zhangtianyi Chen, Tao Lu, Juexiao Zhou
🧩 TL;DR
本研究提出了一个结合DermBench基准测试和DermEval自动评估器的创新评估框架,用于可靠评估多模态大语言模型在皮肤病学诊断中的表现,解决了临床部署中的评估瓶颈问题。
📘 Detailed Summary
Motivation: 多模态大语言模型越来越多地用于直接从图像生成皮肤病学诊断叙述,但可靠的评估仍然是负责任临床部署的主要瓶颈,需要开发能够提供临床意义、可重复和可扩展评估的方法。
Method: 构建了DermBench基准测试,将4000张真实皮肤病图像与专家认证的诊断叙述配对,并使用基于LLM的评判器在临床基础维度上对候选叙述进行评分;同时训练了DermEval无参考多模态评估器,能够根据图像和生成叙述产生结构化批评以及总体分数和维度评级。
Result: 在4500个多样化病例的实验表明,DermBench和DermEval与专家评分高度一致,平均偏差分别为0.251和0.117(满分5分),能够可靠测量不同多模态LLM的诊断能力和可信度。
Conclusion: 该评估框架为多模态皮肤病学AI系统提供了细粒度的逐病例分析能力,这对于识别模型局限性和偏见至关重要,为实现可靠的临床部署建立了标准化评估基础。
📄 Abstract
Multimodal large language models (LLMs) are increasingly used to generate dermatology diagnostic narratives directly from images. However, reliable evaluation remains the primary bottleneck for responsible clinical deployment. We introduce a novel evaluation framework that combines DermBench, a meticulously curated benchmark, with DermEval, a robust automatic evaluator, to enable clinically meaningful, reproducible, and scalable assessment. We build DermBench, which pairs 4,000 real-world dermatology images with expert-certified diagnostic narratives and uses an LLM-based judge to score candidate narratives across clinically grounded dimensions, enabling consistent and comprehensive evaluation of multimodal models. For individual case assessment, we train DermEval, a reference-free multimodal evaluator. Given an image and a generated narrative, DermEval produces a structured critique along with an overall score and per-dimension ratings. This capability enables fine-grained, per-case analysis, which is critical for identifying model limitations and biases. Experiments on a diverse dataset of 4,500 cases demonstrate that DermBench and DermEval achieve close alignment with expert ratings, with mean deviations of 0.251 and 0.117 (out of 5), respectively, providing reliable measurement of diagnostic ability and trustworthiness across different multimodal LLMs.
[19] Enriching Knowledge Distillation with Cross-Modal Teacher Fusion
Amir M. Mansourian, Amir Mohammad Babaei, Shohreh Kasaei
🧩 TL;DR
本文提出RichKD框架,通过融合传统教师模型与CLIP的跨模态知识来增强知识蒸馏,利用多提示文本指导实现语义丰富的监督信号,在多个基准测试中显著优于现有方法并展现出更强的鲁棒性。
📘 Detailed Summary
Motivation: 现有多教师知识蒸馏方法主要依赖单模态视觉信息,缺乏知识多样性,忽视了跨模态表示的潜力,特别是CLIP的视觉-语言知识作为补充监督源的研究尚未充分探索。
Method: 提出简单而有效的RichKD框架,将传统教师模型的logits和特征与CLIP的输出进行融合,通过整合CLIP的多提示文本指导,同时捕获数据集特定信息和语义增强的视觉线索。
Result: 该方法在多个基准测试中持续优于现有基线,融合监督产生更自信可靠的预测,显著增加自信正确案例并减少自信错误案例,同时改善整个logit分布,提升类间一致性和蒸馏质量,在分布偏移和输入损坏下展现出更强鲁棒性。
Conclusion: 融合跨模态知识可显著提升知识蒸馏效果,CLIP的视觉-语言表示提供了语义丰富的监督信号,该方法为知识蒸馏开辟了利用预训练跨模态模型的新方向,证明了简单而有效的融合策略的实用价值。
📄 Abstract
Multi-teacher knowledge distillation (KD), a more effective technique than traditional single-teacher methods, transfers knowledge from expert teachers to a compact student model using logit or feature matching. However, most existing approaches lack knowledge diversity, as they rely solely on unimodal visual information, overlooking the potential of cross-modal representations. In this work, we explore the use of CLIP's vision-language knowledge as a complementary source of supervision for KD, an area that remains largely underexplored. We propose a simple yet effective framework that fuses the logits and features of a conventional teacher with those from CLIP. By incorporating CLIP's multi-prompt textual guidance, the fused supervision captures both dataset-specific and semantically enriched visual cues. Beyond accuracy, analysis shows that the fused teacher yields more confident and reliable predictions, significantly increasing confident-correct cases while reducing confidently wrong ones. Moreover, fusion with CLIP refines the entire logit distribution, producing semantically meaningful probabilities for non-target classes, thereby improving inter-class consistency and distillation quality. Despite its simplicity, the proposed method, Enriching Knowledge Distillation (RichKD), consistently outperforms most existing baselines across multiple benchmarks and exhibits stronger robustness under distribution shifts and input corruptions.
[20] FQ-PETR: Fully Quantized Position Embedding Transformation for Multi-View 3D Object Detection
Jiangyong Yu, Changyong Shu, Sifan Zhou, Zichen Yu, Xing Hu, Yan Chen, Dawei Yang
🧩 TL;DR
本文提出了FQ-PETR,一种针对PETR系列3D检测模型的完全量化框架,通过量化友好的位置嵌入、双查找表技术和量化后数值稳定化方法,在W8A8量化下实现了接近浮点精度的性能,同时显著降低计算延迟。
📘 Detailed Summary
Motivation: PETR系列模型在自动驾驶多视图3D检测中表现出色,但在实际部署中面临高计算成本和内存占用的挑战。直接应用现有量化方法会导致严重的精度下降,主要问题源于多模态特征间显著的量级差异以及非线性算子量化中的效率低下和近似误差。
Method: FQ-PETR框架包含三个关键技术:量化友好的LiDAR射线位置嵌入(QFPE)通过单点采样和基于锚点的嵌入消除非线性操作;双查找表(DULUT)使用两个级联线性查找表高精度近似复杂非线性函数;量化后数值稳定化(QANS)在softmax数值稳定化后执行量化以减少注意力失真。
Result: 在PETR、StreamPETR、PETRv2、MV2d等模型上,FQ-PETR在W8A8量化下实现了接近浮点精度(仅1%性能下降),同时将延迟降低高达75%,显著优于现有的PTQ和QAT基线方法。
Conclusion: 该研究表明通过精心设计的量化策略可以有效解决多模态特征量级差异和非线性算子量化难题,为3D检测模型的高效部署提供了可行方案,同时证明了在保持精度的前提下实现显著计算优化的可能性。
📄 Abstract
Camera-based multi-view 3D detection is crucial for autonomous driving. PETR and its variants (PETRs) excel in benchmarks but face deployment challenges due to high computational cost and memory footprint. Quantization is an effective technique for compressing deep neural networks by reducing the bit width of weights and activations. However, directly applying existing quantization methods to PETRs leads to severe accuracy degradation. This issue primarily arises from two key challenges: (1) significant magnitude disparity between multi-modal features-specifically, image features and camera-ray positional embeddings (PE), and (2) the inefficiency and approximation error of quantizing non-linear operators, which commonly rely on hardware-unfriendly computations. In this paper, we propose FQ-PETR, a fully quantized framework for PETRs, featuring three key innovations: (1) Quantization-Friendly LiDAR-ray Position Embedding (QFPE): Replacing multi-point sampling with LiDAR-prior-guided single-point sampling and anchor-based embedding eliminates problematic non-linearities (e.g., inverse-sigmoid) and aligns PE scale with image features, preserving accuracy. (2) Dual-Lookup Table (DULUT): This algorithm approximates complex non-linear functions using two cascaded linear LUTs, achieving high fidelity with minimal entries and no specialized hardware. (3) Quantization After Numerical Stabilization (QANS): Performing quantization after softmax numerical stabilization mitigates attention distortion from large inputs. On PETRs (e.g. PETR, StreamPETR, PETRv2, MV2d), FQ-PETR under W8A8 achieves near-floating-point accuracy (1% degradation) while reducing latency by up to 75%, significantly outperforming existing PTQ and QAT baselines.
[21] Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition
Yang Chen, Miaoge Li, Zhijie Rao, Deze Zeng, Song Guo, Jingcai Guo
🧩 TL;DR
本文提出了一种名为Flora的零样本骨架动作识别方法,通过灵活的邻居感知语义调谐和开放形式分布感知流分类器,解决了现有方法中的点对点对齐脆弱性和分类器决策边界僵化问题。
📘 Detailed Summary
Motivation: 现有零样本骨架动作识别方法遵循“对齐后分类”范式,但面临两个基本问题:由于不完美语义导致的脆弱点对点对齐,以及受限于静态决策边界和粗粒度锚点的僵化分类器。
Method: Flora方法包含两个核心组件:灵活的邻居感知语义调谐通过整合相邻类间上下文线索形成方向感知区域语义,结合跨模态几何一致性目标确保稳定的点对区域对齐;采用无噪声流匹配来弥合语义与骨架潜在嵌入之间的模态分布差距,同时通过无条件对比正则化增强区分性,实现基于令牌级速度预测的细粒度决策边界分布感知分类器。
Result: 在三个基准数据集上的广泛实验验证了该方法的有效性,即使在仅使用10%可见数据训练的情况下也表现出特别令人印象深刻的性能。
Conclusion: 该研究展示了通过灵活的语义对齐和分布感知分类器设计,可以有效解决零样本骨架动作识别中的对齐脆弱性和分类器僵化问题,为跨模态动作识别提供了新的技术路径。
📄 Abstract
Recognizing unseen skeleton action categories remains highly challenging due to the absence of corresponding skeletal priors. Existing approaches generally follow an "align-then-classify" paradigm but face two fundamental issues, i.e., (i) fragile point-to-point alignment arising from imperfect semantics, and (ii) rigid classifiers restricted by static decision boundaries and coarse-grained anchors. To address these issues, we propose a novel method for zero-shot skeleton action recognition, termed $\texttt{$\textbf{Flora}$}$, which builds upon $\textbf{F}$lexib$\textbf{L}$e neighb$\textbf{O}$r-aware semantic attunement and open-form dist$\textbf{R}$ibution-aware flow cl$\textbf{A}$ssifier. Specifically, we flexibly attune textual semantics by incorporating neighboring inter-class contextual cues to form direction-aware regional semantics, coupled with a cross-modal geometric consistency objective that ensures stable and robust point-to-region alignment. Furthermore, we employ noise-free flow matching to bridge the modality distribution gap between semantic and skeleton latent embeddings, while a condition-free contrastive regularization enhances discriminability, leading to a distribution-aware classifier with fine-grained decision boundaries achieved through token-level velocity predictions. Extensive experiments on three benchmark datasets validate the effectiveness of our method, showing particularly impressive performance even when trained with only 10\% of the seen data. Code is available at https://github.com/cseeyangchen/Flora.
cs.CL [Back]
[22] Self-HarmLLM: Can Large Language Model Harm Itself?
Heehwan Kim, Sungjune Park, Daeseon Choi
🧩 TL;DR
本研究提出了Self-HarmLLM攻击场景,利用模型自身生成的模糊有害查询作为新输入来绕过安全防护,实验证明该方法在零样本和少样本条件下分别达到最高33%和41%的越狱成功率,揭示了现有防护机制的潜在漏洞。
📘 Detailed Summary
Motivation: 现有LLM安全防护机制主要假设攻击来自外部恶意查询,而忽略了模型自身输出可能成为新攻击向量的可能性,这种自我攻击场景尚未得到充分探索。
Method: 提出Self-HarmLLM攻击方法,使用同一模型生成的Mitigated Harmful Query作为新输入,这些查询保持原始意图但隐藏了有害性质,并在GPT-3.5-turbo、LLaMA3-8B-instruct和DeepSeek-R1-Distill-Qwen-7B模型上进行Base、Zero-shot和Few-shot条件下的实验验证。
Result: 实验结果显示,在Zero-shot条件下最高达到52%的转换成功率和33%的越狱成功率,Few-shot条件下最高达到65%的转换成功率和41%的越狱成功率,同时发现自动评估会系统性高估越狱成功率,平均差异达52%。
Conclusion: 该研究证明基于模型自身输出的攻击是有效的攻击场景,表明需要从根本上重新考虑防护机制设计并建立更鲁棒的评估方法,同时强调自动评估单独使用不足以准确判断有害性。
📄 Abstract
Large Language Models (LLMs) are generally equipped with guardrails to block the generation of harmful responses. However, existing defenses always assume that an external attacker crafts the harmful query, and the possibility of a model's own output becoming a new attack vector has not been sufficiently explored. In this study, we propose the Self-HarmLLM scenario, which uses a Mitigated Harmful Query (MHQ) generated by the same model as a new input. An MHQ is an ambiguous query whose original intent is preserved while its harmful nature is not directly exposed. We verified whether a jailbreak occurs when this MHQ is re-entered into a separate session of the same model. We conducted experiments on GPT-3.5-turbo, LLaMA3-8B-instruct, and DeepSeek-R1-Distill-Qwen-7B under Base, Zero-shot, and Few-shot conditions. The results showed up to 52% transformation success rate and up to 33% jailbreak success rate in the Zero-shot condition, and up to 65% transformation success rate and up to 41% jailbreak success rate in the Few-shot condition. By performing both prefix-based automated evaluation and human evaluation, we found that the automated evaluation consistently overestimated jailbreak success, with an average difference of 52%. This indicates that automated evaluation alone is not accurate for determining harmfulness. While this study is a toy-level study based on a limited query set and evaluators, it proves that our method can still be a valid attack scenario. These results suggest the need for a fundamental reconsideration of guardrail design and the establishment of a more robust evaluation methodology.
[23] HalluClean: A Unified Framework to Combat Hallucinations in LLMs
Yaxin Zhao, Yu Zhang
🧩 TL;DR
HalluClean是一个轻量级、任务无关的框架,通过推理增强范式检测和纠正大语言模型生成文本中的幻觉内容,显著提升事实一致性且无需外部知识源或监督检测器。
📘 Detailed Summary
Motivation: 大语言模型在各种自然语言处理任务中表现出色,但经常产生幻觉内容,这严重损害了事实可靠性,因此需要开发能够检测和纠正这些幻觉的有效方法。
Method: HalluClean采用推理增强范式,将过程明确分解为规划、执行和修订三个阶段来识别和优化无依据的主张,使用最小任务路由提示实现跨领域的零样本泛化。
Result: 在五个代表性任务上的广泛评估表明,HalluClean显著改善了事实一致性,并在问答、对话、摘要、数学文字问题和矛盾检测任务中优于竞争基线方法。
Conclusion: 该研究展示了HalluClean在增强大语言模型输出可信度方面的潜力,为实际应用中的幻觉问题提供了有效的轻量级解决方案,且无需依赖外部资源或监督训练。
📄 Abstract
Large language models (LLMs) have achieved impressive performance across a wide range of natural language processing tasks, yet they often produce hallucinated content that undermines factual reliability. To address this challenge, we introduce HalluClean, a lightweight and task-agnostic framework for detecting and correcting hallucinations in LLM-generated text. HalluClean adopts a reasoning-enhanced paradigm, explicitly decomposing the process into planning, execution, and revision stages to identify and refine unsupported claims. It employs minimal task-routing prompts to enable zero-shot generalization across diverse domains, without relying on external knowledge sources or supervised detectors. We conduct extensive evaluations on five representative tasks-question answering, dialogue, summarization, math word problems, and contradiction detection. Experimental results show that HalluClean significantly improves factual consistency and outperforms competitive baselines, demonstrating its potential to enhance the trustworthiness of LLM outputs in real-world applications.
[24] MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique
Gailun Zeng, Ziyang Luo, Hongzhan Lin, Yuchen Tian, Kaixin Li, Ziyang Gong, Jianxiong Guo, Jing Ma
🧩 TL;DR
本文提出了MM-CRITIC基准,这是首个全面评估大型多模态模型批判能力的基准,涵盖8种主要任务类型和500多个任务,包含4471个样本,为多模态批判能力研究提供了系统评估框架。
📘 Detailed Summary
Motivation: 尽管大型多模态模型在图像描述和视觉推理等任务中能力不断增强,但多模态批判能力的研究仍相对不足,而批判能力对于模型自我改进和成为可靠AI助手至关重要,现有研究主要集中在纯语言环境下的批判能力。
Method: 提出了MM-CRITIC基准,涵盖基本、修正和比较三个维度的批判能力评估,整合专家指导的参考答案到评分标准中,使用GPT-4o标注响应并生成参考批判作为可靠判断的锚点,覆盖不同模型大小的多种LMM响应。
Result: 广泛实验验证了MM-CRITIC的有效性,对主流LMMs的批判能力进行了全面评估,分析揭示了响应质量与批判能力之间的相关性,以及不同评估维度下批判难度的变化规律。
Conclusion: 研究为多模态批判能力评估提供了标准化基准,揭示了LMMs批判能力的现状和挑战,为未来模型改进和批判能力研究指明了方向,代码已开源供社区使用。
📄 Abstract
The ability of critique is vital for models to self-improve and serve as reliable AI assistants. While extensively studied in language-only settings, multimodal critique of Large Multimodal Models (LMMs) remains underexplored despite their growing capabilities in tasks like captioning and visual reasoning. In this work, we introduce MM-CRITIC, a holistic benchmark for evaluating the critique ability of LMMs across multiple dimensions: basic, correction, and comparison. Covering 8 main task types and over 500 tasks, MM-CRITIC collects responses from various LMMs with different model sizes and is composed of 4471 samples. To enhance the evaluation reliability, we integrate expert-informed ground answers into scoring rubrics that guide GPT-4o in annotating responses and generating reference critiques, which serve as anchors for trustworthy judgments. Extensive experiments validate the effectiveness of MM-CRITIC and provide a comprehensive assessment of leading LMMs' critique capabilities under multiple dimensions. Further analysis reveals some key insights, including the correlation between response quality and critique, and varying critique difficulty across evaluation dimensions. Our code is available at https://github.com/MichealZeng0420/MM-Critic.
[25] A Hybrid Search for Complex Table Question Answering in Securities Report
Daiki Shirafuji, Koji Tanaka, Tatsuhiko Saito
🧩 TL;DR
本文提出了一种无需人工识别的表格问答细胞提取方法,通过混合检索机制计算问题与单元格相似度来自动识别复杂表头,在NTCIR-18 U4共享任务的TQA数据集上达到74.6%的准确率,优于GPT-4o mini等现有LLM。
📘 Detailed Summary
Motivation: 当前大型语言模型在表格问答任务中直接将整个表格作为长文本输入会导致错误答案,因为大多数LLM无法有效捕捉复杂的表格结构,特别是在处理复杂表头时缺乏自动识别能力。
Method: 提出基于混合检索的细胞提取方法,结合语言模型和TF-IDF计算给定问题与单个单元格的相似度来估计表头,通过对比学习在小规模问题-表头对数据集上训练语言模型,最终选择最相关行和列交叉点的单元格作为答案。
Result: 在NTCIR-18 U4共享任务的TQA数据集评估中,该方法达到74.6%的准确率,显著优于GPT-4o mini的63.9%,证明了混合检索机制在表格结构理解上的有效性。
Conclusion: 该方法展示了无需人工干预即可处理复杂表格结构的能力,虽然当前使用传统编码器模型进行检索,未来计划整合更高效的文本搜索模型以进一步提升性能并缩小与人工评估结果的差距。
📄 Abstract
Recently, Large Language Models (LLMs) are gaining increased attention in the domain of Table Question Answering (TQA), particularly for extracting information from tables in documents. However, directly entering entire tables as long text into LLMs often leads to incorrect answers because most LLMs cannot inherently capture complex table structures. In this paper, we propose a cell extraction method for TQA without manual identification, even for complex table headers. Our approach estimates table headers by computing similarities between a given question and individual cells via a hybrid retrieval mechanism that integrates a language model and TF-IDF. We then select as the answer the cells at the intersection of the most relevant row and column. Furthermore, the language model is trained using contrastive learning on a small dataset of question-header pairs to enhance performance. We evaluated our approach in the TQA dataset from the U4 shared task at NTCIR-18. The experimental results show that our pipeline achieves an accuracy of 74.6\%, outperforming existing LLMs such as GPT-4o mini~(63.9\%). In the future, although we used traditional encoder models for retrieval in this study, we plan to incorporate more efficient text-search models to improve performance and narrow the gap with human evaluation results.
[26] Context is Enough: Empirical Validation of $\textit{Sequentiality}$ on Essays
Amal Sunny, Advay Gupta, Vishnu Sreekumar
🧩 TL;DR
本研究实证验证了基于上下文的序列性度量在自动作文评分中的有效性,该度量比原始的主题-上下文混合版本更符合人类对篇章连贯性的评估,并与标准语言特征结合时优于零样本LLM预测。
📘 Detailed Summary
Motivation: 先前研究提出使用LLMs通过序列性度量来量化叙事流,但该方法的有效性受到质疑,特别是主题选择方式可能混淆结果,且缺乏与真实流畅度评估的验证。本研究旨在实证验证仅使用上下文项的序列性度量作为更概念有效和可解释的替代方案。
Method: 使用两个带有人工标注特质分数的作文数据集ASAP++和ELLIPSE,比较上下文版本序列性与人类对组织和连贯性等篇章层面特质的评估。将上下文度量与标准语言特征结合,并与零样本提示LLMs的预测进行对比分析。
Result: 上下文序列性版本与人类对组织和连贯性的评估更一致。虽然零样本提示LLMs在预测特质分数上比单独使用上下文度量更准确,但上下文度量与标准语言特征结合时比仅主题版本和原始序列性公式增加更多预测价值,且这种组合也优于零样本LLM预测。
Conclusion: 研究结果支持使用基于上下文的序列性作为经过验证、可解释且互补的特征,用于自动作文评分和相关NLP任务。明确建模句子间流动性的方法具有重要价值,上下文序列性提供了比混合方法更可靠和可解释的篇章连贯性度量。
📄 Abstract
Recent work has proposed using Large Language Models (LLMs) to quantify narrative flow through a measure called sequentiality, which combines topic and contextual terms. A recent critique argued that the original results were confounded by how topics were selected for the topic-based component, and noted that the metric had not been validated against ground-truth measures of flow. That work proposed using only the contextual term as a more conceptually valid and interpretable alternative. In this paper, we empirically validate that proposal. Using two essay datasets with human-annotated trait scores, ASAP++ and ELLIPSE, we show that the contextual version of sequentiality aligns more closely with human assessments of discourse-level traits such as Organization and Cohesion. While zero-shot prompted LLMs predict trait scores more accurately than the contextual measure alone, the contextual measure adds more predictive value than both the topic-only and original sequentiality formulations when combined with standard linguistic features. Notably, this combination also outperforms the zero-shot LLM predictions, highlighting the value of explicitly modeling sentence-to-sentence flow. Our findings support the use of context-based sequentiality as a validated, interpretable, and complementary feature for automated essay scoring and related NLP tasks.
[27] POTSA: A Cross-Lingual Speech Alignment Framework for Low Resource Speech-to-Text Translation
Xuanchen Li, Chenrui Cui, Tianrui Wang, Meng Ge, Zikang Huang, Jin Li, Yizhou Peng, Longbiao Wang, Jianwu Dang, Nyima Tashi
🧩 TL;DR
本文提出了POTSA框架,基于跨语言平行语音对和最优传输理论,通过偏置补偿模块和token级OT约束来弥合高低资源语言间的翻译差距,在FLEURS数据集上实现了SOTA性能。
📘 Detailed Summary
Motivation: 现有的语音大语言模型在多语言语音到文本翻译中往往忽略跨源语言的语义共性,导致翻译性能存在偏差,特别是高低资源语言之间的翻译差距问题没有得到有效解决。
Method: 提出了POTSA框架,首先引入偏置补偿模块对初始语音表征进行粗粒度跨语言对齐,然后在Q-Former上施加基于平行语音对的token级最优传输约束以建立细粒度表征一致性,并采用层调度策略将OT约束聚焦于最具语义价值的层。
Result: 在FLEURS数据集上的实验表明,该方法在五种常用语言上平均提升0.93 BLEU,在零样本语言上提升5.05 BLEU,且每个源语言仅需10小时的平行语音数据。
Conclusion: POTSA框架通过结合偏置补偿和最优传输约束,有效解决了跨语言语音翻译中的语义对齐问题,为高低资源语言间的翻译性能均衡提供了新的解决方案,具有重要的实际应用价值。
📄 Abstract
Speech Large Language Models (SpeechLLMs) have achieved breakthroughs in multilingual speech-to-text translation (S2TT). However, existing approaches often overlook semantic commonalities across source languages, leading to biased translation performance. In this work, we propose \textbf{POTSA} (Parallel Optimal Transport for Speech Alignment), a new framework based on cross-lingual parallel speech pairs and Optimal Transport (OT), designed to bridge high- and low-resource translation gaps. First, we introduce a Bias Compensation module to coarsely align initial speech representations across languages. Second, we impose token-level OT constraints on a Q-Former using parallel speech pairs to establish fine-grained consistency of representations. Then, we apply a layer scheduling strategy to focus OT constraints on the most semantically beneficial layers. Experiments on the FLEURS dataset show that our method achieves SOTA performance, with +0.93 BLEU on average over five common languages and +5.05 BLEU on zero-shot languages, using only 10 hours of parallel speech per source language.
[28] mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models
Arka Mukherjee, Shreya Ghosh
🧩 TL;DR
本文提出了mmJEE-Eval,一个多模态双语基准测试,用于评估视觉语言模型的科学推理能力,揭示了前沿闭源模型与开源模型在复杂推理任务上的显著性能差距。
📘 Detailed Summary
Motivation: 现有视觉语言模型在多模态推理基准测试上表现良好,但这些结果无法有效区分真正的科学推理能力与模式匹配,存在评估不足的问题。
Method: 构建了包含1,460个来自印度JEE Advanced考试问题的多模态双语基准测试,涵盖物理、化学和数学领域,并系统评估了17个最先进模型。
Result: 前沿闭源模型在2025年题目上达到77-84%准确率,而开源模型仅达到37-45%,且在元认知推理负载增加时性能完全崩溃。
Conclusion: 该基准测试能够有效区分不同训练和推理方法的质量,揭示了当前模型在复杂科学推理任务上的局限性,为未来研究提供了重要评估工具。
📄 Abstract
Contemporary vision-language models (VLMs) perform well on existing multimodal reasoning benchmarks (78-85\% accuracy on MMMU, MathVista). Yet, these results fail to sufficiently distinguish true scientific reasoning articulation capabilities from pattern-matching. To address this gap, we introduce \textbf{mmJEE-Eval}, a multimodal bilingual (English and Hindi) benchmark comprising 1,460 questions from India's JEE Advanced examination (2019-2025) spanning pre-college Physics, Chemistry, and Mathematics domains. Our evaluation of 17 state-of-the-art models reveals that while frontier VLMs (GPT-5, Gemini 2.5 Pro/Flash) achieve 77-84\% accuracy on held-out 2025 questions, open-source models plateau at 37-45\% despite scaling to 400B parameters, a significant difference not observed on existing benchmarks. While closed frontiers from Google and OpenAI show high problem-solving accuracies (up to 100\% pass@3 scores), they fully collapse when the reasoning load is increased meta-cognitively (GPT-5 fixes just 5.2\% errors). Systematic ablations show mmJEE-Eval's difficulty stems from complexity and reasoning depth rather than memorization. Effectively, our benchmark segregates superior training and reasoning methodologies where alternatives fail. We publicly release our code and data: https://mmjee-eval.github.io
[29] Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque
Lukas Arana, Julen Etxaniz, Ander Salaberria, Gorka Azkune
🧩 TL;DR
本研究开发了针对低资源巴斯克语的多模态大语言模型,通过构建专用训练和评估数据集,证明了仅需20%巴斯克多模态数据即可获得良好性能,且无需巴斯克指令调优的骨干模型。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在主流语言任务中表现优异,但开源社区在低资源语言方面仍无法达到商业模型的性能水平,特别是巴斯克语这类资源稀缺语言缺乏高质量的多模态模型支持。
Method: 研究采用Llama-3.1-Instruct和巴斯克语适配变体Latxa作为骨干模型,构建了专用的图像-文本训练和评估数据集,探索了不同数据混合比例的训练策略。
Result: 实验表明仅需约20%的巴斯克多模态数据即可在巴斯克基准测试中获得稳定结果,且意外发现无需巴斯克指令调优的骨干模型也能构建强大的巴斯克多模态模型。
Conclusion: 该研究为开发其他低资源语言的多模态大语言模型提供了可行路径,通过公开释放资源促进了开源社区在低资源语言多模态建模方面的发展。
📄 Abstract
Current Multimodal Large Language Models exhibit very strong performance for several demanding tasks. While commercial MLLMs deliver acceptable performance in low-resource languages, comparable results remain unattained within the open science community. In this paper, we aim to develop a strong MLLM for a low-resource language, namely Basque. For that purpose, we develop our own training and evaluation image-text datasets. Using two different Large Language Models as backbones, the Llama-3.1-Instruct model and a Basque-adapted variant called Latxa, we explore several data mixtures for training. We show that: i) low ratios of Basque multimodal data (around 20%) are already enough to obtain solid results on Basque benchmarks, and ii) contrary to expected, a Basque instructed backbone LLM is not required to obtain a strong MLLM in Basque. Our results pave the way to develop MLLMs for other low-resource languages by openly releasing our resources.
cs.AI [Back]
[30] Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds
Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, Yujia Qin, Bo An, Libin Liu, Guang Shi
🧩 TL;DR
Lumine是首个能够完成数小时复杂任务的通用智能体配方,在3D开放世界环境中实现实时交互,通过端到端的视觉语言模型统一感知、推理和行动,并在多个游戏中展示了强大的零样本跨域泛化能力。
📘 Detailed Summary
Motivation: 该研究旨在解决通用智能体在复杂3D开放世界环境中执行长时间任务的挑战,现有方法难以在实时交互中统一感知、推理和行动,并且缺乏跨域泛化能力。
Method: Lumine采用类人交互范式,基于视觉语言模型实现端到端处理,以5Hz处理原始像素并生成30Hz的键盘鼠标动作,仅在必要时自适应调用推理模块,在《原神》中进行训练。
Result: Lumine成功完成了《原神》中五小时蒙德主线剧情,效率达到人类水平,并在《鸣潮》和《崩坏:星穹铁道》中实现零样本泛化,分别完成100分钟任务和五小时第一章内容。
Conclusion: Lumine展示了在开放环境中通用智能体的可行性,其跨游戏泛化能力表明该方法具有广泛适用性,为开发真正通用的开放世界智能体迈出了重要一步。
📄 Abstract
We introduce Lumine, the first open recipe for developing generalist agents capable of completing hours-long complex missions in real time within challenging 3D open-world environments. Lumine adopts a human-like interaction paradigm that unifies perception, reasoning, and action in an end-to-end manner, powered by a vision-language model. It processes raw pixels at 5 Hz to produce precise 30 Hz keyboard-mouse actions and adaptively invokes reasoning only when necessary. Trained in Genshin Impact, Lumine successfully completes the entire five-hour Mondstadt main storyline on par with human-level efficiency and follows natural language instructions to perform a broad spectrum of tasks in both 3D open-world exploration and 2D GUI manipulation across collection, combat, puzzle-solving, and NPC interaction. In addition to its in-domain performance, Lumine demonstrates strong zero-shot cross-game generalization. Without any fine-tuning, it accomplishes 100-minute missions in Wuthering Waves and the full five-hour first chapter of Honkai: Star Rail. These promising results highlight Lumine's effectiveness across distinct worlds and interaction dynamics, marking a concrete step toward generalist agents in open-ended environments.
[31] History-Aware Reasoning for GUI Agents
Ziwei Wang, Leyang Yang, Xiaoxuan Tang, Sheng Zhou, Dajun Chen, Wei Jiang, Yong Li
🧩 TL;DR
本文提出了一种历史感知推理框架,通过反思学习机制增强GUI代理的短期记忆能力,将推理模式从历史无关转变为历史感知,显著提升了长时程GUI任务的执行性能。
📘 Detailed Summary
Motivation: 当前原生GUI代理在显式推理中存在短期记忆薄弱的问题,将链式交互理解为离散的屏幕理解,缺乏对历史交互的感知能力,这种历史无关的推理模式严重影响了GUI自动化的性能表现。
Method: 提出了历史感知推理框架,主要包括构建反思学习场景、合成定制化修正指南以及设计混合强化学习奖励函数,通过让代理反思自身错误并从错误中获取情景推理知识来增强长期交互中的短期记忆。
Result: 基于该框架开发的HAR-GUI-3B模型在多个GUI相关基准测试中表现出色,证明了方法的有效性和泛化能力,显著提升了GUI代理的短期记忆稳定性和屏幕细节感知可靠性。
Conclusion: 该研究通过历史感知推理框架成功解决了GUI代理的短期记忆瓶颈,为长时程GUI任务自动化提供了新的解决方案,展示了反思学习在增强推理能力方面的潜力。
📄 Abstract
Advances in Multimodal Large Language Models have significantly enhanced Graphical User Interface (GUI) automation. Equipping GUI agents with reliable episodic reasoning capabilities is essential for bridging the gap between users' concise task descriptions and the complexities of real-world execution. Current methods integrate Reinforcement Learning (RL) with System-2 Chain-of-Thought, yielding notable gains in reasoning enhancement. For long-horizon GUI tasks, historical interactions connect each screen to the goal-oriented episode chain, and effectively leveraging these clues is crucial for the current decision. However, existing native GUI agents exhibit weak short-term memory in their explicit reasoning, interpreting the chained interactions as discrete screen understanding, i.e., unawareness of the historical interactions within the episode. This history-agnostic reasoning challenges their performance in GUI automation. To alleviate this weakness, we propose a History-Aware Reasoning (HAR) framework, which encourages an agent to reflect on its own errors and acquire episodic reasoning knowledge from them via tailored strategies that enhance short-term memory in long-horizon interaction. The framework mainly comprises constructing a reflective learning scenario, synthesizing tailored correction guidelines, and designing a hybrid RL reward function. Using the HAR framework, we develop a native end-to-end model, HAR-GUI-3B, which alters the inherent reasoning mode from history-agnostic to history-aware, equipping the GUI agent with stable short-term memory and reliable perception of screen details. Comprehensive evaluations across a range of GUI-related benchmarks demonstrate the effectiveness and generalization of our method.
[32] CrochetBench: Can Vision-Language Models Move from Describing to Doing in Crochet Domain?
Peiyu Li, Xiaobao Huang, Nitesh V. Chawla
🧩 TL;DR
本文提出了CrochetBench基准测试,用于评估多模态大语言模型在钩针编织领域执行细粒度、低层次程序推理的能力。该基准将评估重点从描述转向执行,要求模型识别针法、选择结构适当的指令并生成可编译的钩针程序。
📘 Detailed Summary
Motivation: 现有基准主要关注高层次描述或视觉问答,而缺乏对多模态模型在真实世界创造性领域中程序推理能力的评估。CrochetBench旨在解决这一研究空白,特别关注从表面理解到可执行精度的转变,填补了多模态模型在低层次程序推理能力评估方面的不足。
Method: 研究采用CrochetPARADE领域特定语言作为中间表示,支持结构验证和通过执行的功能评估。基准涵盖针法分类、指令接地以及自然语言和图像到DSL的翻译任务,通过可执行性验证来评估模型的程序推理能力。
Result: 在所有任务中,随着评估从表面相似性转向可执行正确性,模型性能急剧下降。这暴露了多模态模型在长距离符号推理和3D感知程序合成方面的局限性,显示出表面理解与可执行精度之间存在显著差距。
Conclusion: CrochetBench为评估多模态模型的程序能力提供了新视角,强调了在真实世界创造性领域中表面理解与可执行精度之间的关键差距。该研究揭示了当前模型在复杂程序推理任务中的局限性,为未来改进多模态模型的程序合成能力指明了方向。
📄 Abstract
We present CrochetBench, a benchmark for evaluating the ability of multimodal large language models to perform fine-grained, low-level procedural reasoning in the domain of crochet. Unlike prior benchmarks that focus on high-level description or visual question answering, CrochetBench shifts the emphasis from describing to doing: models are required to recognize stitches, select structurally appropriate instructions, and generate compilable crochet procedures. We adopt the CrochetPARADE DSL as our intermediate representation, enabling structural validation and functional evaluation via execution. The benchmark covers tasks including stitch classification, instruction grounding, and both natural language and image-to-DSL translation. Across all tasks, performance sharply declines as the evaluation shifts from surface-level similarity to executable correctness, exposing limitations in long-range symbolic reasoning and 3D-aware procedural synthesis. CrochetBench offers a new lens for assessing procedural competence in multimodal models and highlights the gap between surface-level understanding and executable precision in real-world creative domains. Code is available at https://github.com/Peiyu-Georgia-Li/crochetBench.