Table of Contents
cs.CV [Back]
[1] Understanding Virality: A Rubric based Vision-Language Model Framework for Short-Form Edutainment Evaluation
Arnav Gupta, Gurekas Singh Sahney, Hardik Rathi, Abhishek Chandwani, Ishaan Gupta, Pratik Narang, Dhruv Kumar
🧩 TL;DR
本研究提出了一种基于视觉语言模型的数据驱动评估框架,通过提取无监督视听特征并训练回归评估器来预测短视频参与度,实现了可解释且可扩展的视频内容评估。
📘 Detailed Summary
Motivation: 现有视频评估框架如VideoScore-2主要关注视觉和语义保真度,但未能捕捉具体视听属性如何驱动真实观众参与度,需要建立更符合人类感知的多模态推理评估方法。
Method: 该方法利用视觉语言模型提取无监督视听特征,通过聚类形成可解释因子,并训练基于回归的评估器来预测短视频参与度,构建了专门的YouTube Shorts数据集进行系统分析。
Result: 实验显示预测参与度与实际参与度之间存在强相关性,该轻量级基于特征的评估器相比传统指标(如SSIM、FID)提供了更可解释和可扩展的评估能力。
Conclusion: 通过将评估建立在多模态特征重要性和以人为中心的参与信号基础上,该方法推动了稳健且可解释的视频理解发展,为内容创作者和平台提供了更有效的评估工具。
📄 Abstract
Evaluating short-form video content requires moving beyond surface-level quality metrics toward human-aligned, multimodal reasoning. While existing frameworks like VideoScore-2 assess visual and semantic fidelity, they do not capture how specific audiovisual attributes drive real audience engagement. In this work, we propose a data-driven evaluation framework that uses Vision-Language Models (VLMs) to extract unsupervised audiovisual features, clusters them into interpretable factors, and trains a regression-based evaluator to predict engagement on short-form edutainment videos. Our curated YouTube Shorts dataset enables systematic analysis of how VLM-derived features relate to human engagement behavior. Experiments show strong correlations between predicted and actual engagement, demonstrating that our lightweight, feature-based evaluator provides interpretable and scalable assessments compared to traditional metrics (e.g., SSIM, FID). By grounding evaluation in both multimodal feature importance and human-centered engagement signals, our approach advances toward robust and explainable video understanding.
[2] A Tool Bottleneck Framework for Clinically-Informed and Interpretable Medical Image Understanding
Christina Liu, Alan Q. Wang, Joy Hsu, Jiajun Wu, Ehsan Adeli
🧩 TL;DR
本文提出了工具瓶颈框架(TBF),一种用于医学图像理解的工具使用框架,通过学习的工具瓶颈模型(TBM)来融合视觉语言模型选择的工具输出,解决了传统文本组合方法在医学图像理解中的局限性。
📘 Detailed Summary
Motivation: 当前基于视觉语言模型(VLM)的工具使用框架主要依赖文本(代码或自然语言)来组合工具调用,但在医学图像理解任务中表现不佳,因为关键信息通常编码为空间局部特征,难以通过纯文本有效组合或融合。
Method: 提出了工具瓶颈框架(TBF),包含工具瓶颈模型(TBM)作为核心组件。该框架利用现成的医学VLM从工具箱中选择提取临床相关特征的工具,然后通过TBM(一个神经网络)计算和融合这些工具的输出,而不是使用文本组合方式。TBM采用简单有效的策略处理任意VLM工具选择,生成最终预测。
Result: 在组织病理学和皮肤病学任务上的评估表明,TBF在性能上与基于深度学习的分类器、VLMs和最先进的工具使用框架相当或更优,特别是在数据有限的情况下表现出显著优势。该框架不仅提升了医学成像中的工具使用效果,还产生了更具可解释性和临床基础的预测器。
Conclusion: 工具瓶颈框架通过学习的神经网络模型融合工具输出,有效解决了医学图像理解中文本组合方法的局限性,为数据有限场景下的医学图像分析提供了更强大且可解释的解决方案,推动了工具使用框架在专业领域的发展。
📄 Abstract
Recent tool-use frameworks powered by vision-language models (VLMs) improve image understanding by grounding model predictions with specialized tools. Broadly, these frameworks leverage VLMs and a pre-specified toolbox to decompose the prediction task into multiple tool calls (often deep learning models) which are composed to make a prediction. The dominant approach to composing tools is using text, via function calls embedded in VLM-generated code or natural language. However, these methods often perform poorly on medical image understanding, where salient information is encoded as spatially-localized features that are difficult to compose or fuse via text alone. To address this, we propose a tool-use framework for medical image understanding called the Tool Bottleneck Framework (TBF), which composes VLM-selected tools using a learned Tool Bottleneck Model (TBM). For a given image and task, TBF leverages an off-the-shelf medical VLM to select tools from a toolbox that each extract clinically-relevant features. Instead of text-based composition, these tools are composed by the TBM, which computes and fuses the tool outputs using a neural network before outputting the final prediction. We propose a simple and effective strategy for TBMs to make predictions with any arbitrary VLM tool selection. Overall, our framework not only improves tool-use in medical imaging contexts, but also yields more interpretable, clinically-grounded predictors. We evaluate TBF on tasks in histopathology and dermatology and find that these advantages enable our framework to perform on par with or better than deep learning-based classifiers, VLMs, and state-of-the-art tool-use frameworks, with particular gains in data-limited regimes. Our code is available at https://github.com/christinaliu2020/tool-bottleneck-framework.
[3] Intelligent recognition of GPR road hidden defect images based on feature fusion and attention mechanism
Haotian Lv, Yuhui Zhang, Jiangbo Dai, Hanli Wu, Jiaji Wang, Dawei Wang
🧩 TL;DR
本研究提出了一种用于探地雷达图像缺陷检测的综合框架,通过DCGAN数据增强、多模态链式全局注意力网络(MCGA-Net)和迁移学习技术,实现了在复杂地下环境中高效准确的自动化缺陷检测。
📘 Detailed Summary
Motivation: 传统探地雷达图像解释严重依赖主观专家经验,导致效率低下和准确性不足,同时面临数据稀缺和复杂背景下的缺陷检测挑战,需要开发自动化、鲁棒的检测方法。
Method: 该框架包含三个关键技术:基于DCGAN的数据增强策略合成高保真GPR图像以缓解数据稀缺;提出多模态链式全局注意力网络(MCGA-Net),集成多模态链特征融合(MCFF)进行分层多尺度缺陷表示和全局注意力机制(GAM)进行上下文感知特征增强;采用MS COCO迁移学习微调骨干网络以加速收敛并提升泛化能力。
Result: MCGA-Net在实验中表现出色,达到精确率92.8%、召回率92.5%和mAP@50为95.9%,在高斯噪声、弱信号和小目标检测中保持鲁棒性并优于其他模型,验证了框架的有效性。
Conclusion: 本研究为基于探地雷达的自动化缺陷检测建立了新范式,在复杂地下环境中平衡了计算效率与高精度,为解决传统GPR图像解释的主观依赖和数据稀缺问题提供了有效解决方案。
📄 Abstract
Ground Penetrating Radar (GPR) has emerged as a pivotal tool for non-destructive evaluation of subsurface road defects. However, conventional GPR image interpretation remains heavily reliant on subjective expertise, introducing inefficiencies and inaccuracies. This study introduces a comprehensive framework to address these limitations: (1) A DCGAN-based data augmentation strategy synthesizes high-fidelity GPR images to mitigate data scarcity while preserving defect morphology under complex backgrounds; (2) A novel Multi-modal Chain and Global Attention Network (MCGA-Net) is proposed, integrating Multi-modal Chain Feature Fusion (MCFF) for hierarchical multi-scale defect representation and Global Attention Mechanism (GAM) for context-aware feature enhancement; (3) MS COCO transfer learning fine-tunes the backbone network, accelerating convergence and improving generalization. Ablation and comparison experiments validate the framework's efficacy. MCGA-Net achieves Precision (92.8%), Recall (92.5%), and mAP@50 (95.9%). In the detection of Gaussian noise, weak signals and small targets, MCGA-Net maintains robustness and outperforms other models. This work establishes a new paradigm for automated GPR-based defect detection, balancing computational efficiency with high accuracy in complex subsurface environments.
[4] GPF-Net: Gated Progressive Fusion Learning for Polyp Re-Identification
Suncheng Xiang, Xiaoyang Wang, Junjie Jiang, Hejia Wang, Dahong Qian
🧩 TL;DR
本文提出了一种名为门控渐进融合网络的新型架构,通过门控机制选择性融合多层级特征,以解决结肠镜息肉重识别中因高层特征分辨率不足而导致的小目标识别性能下降问题。
📘 Detailed Summary
Motivation: 结肠镜息肉重识别旨在从不同视角、不同相机拍摄的图像库中匹配同一息肉,这对计算机辅助诊断中的结直肠癌预防和治疗至关重要。然而,特定息肉的高层特征分辨率通常较粗糙,导致小目标识别性能下降,而小目标的细节信息对准确识别至关重要。
Method: 本文提出了一种名为门控渐进融合网络的新型架构,通过门控机制选择性融合多层级特征,采用全连接方式进行特征融合。在此基础上,引入了门控渐进融合策略,通过多层级特征交互实现语义信息的逐层细化,专门针对多模态融合场景进行优化。
Result: 在标准基准测试上的实验表明,该多模态设置相比最先进的单模态重识别模型具有显著优势,特别是当与专门的多模态融合策略结合时。该方法在息肉重识别任务中表现出优越性能,有效解决了小目标识别中的特征分辨率不足问题。
Conclusion: 该研究证明了多模态特征融合在医学图像重识别中的重要性,特别是针对小目标识别场景。门控渐进融合策略通过选择性融合多层级特征,有效提升了细节信息的保留能力,为计算机辅助诊断中的息肉跟踪和监测提供了更可靠的解决方案。
📄 Abstract
Colonoscopic Polyp Re-Identification aims to match the same polyp from a large gallery with images from different views taken using different cameras, which plays an important role in the prevention and treatment of colorectal cancer in computer-aided diagnosis. However, the coarse resolution of high-level features of a specific polyp often leads to inferior results for small objects where detailed information is important. To address this challenge, we propose a novel architecture, named Gated Progressive Fusion network, to selectively fuse features from multiple levels using gates in a fully connected way for polyp ReID. On the basis of it, a gated progressive fusion strategy is introduced to achieve layer-wise refinement of semantic information through multi-level feature interactions. Experiments on standard benchmarks show the benefits of the multimodal setting over state-of-the-art unimodal ReID models, especially when combined with the specialized multimodal fusion strategy.
[5] Fixed-Budget Parameter-Efficient Training with Frozen Encoders Improves Multimodal Chest X-Ray Classification
Md Ashik Khan, Md Nahid Siddique
🧩 TL;DR
本研究系统评估了多模态胸片分析中的参数高效训练策略,发现冻结编码器方法在显著减少计算成本的同时实现了优于全微调的诊断性能,尽管存在校准问题但可通过后处理解决。
📘 Detailed Summary
Motivation: 多模态胸片分析通常需要微调大型视觉语言模型,这带来了高昂的计算成本。本研究旨在探索参数高效训练策略,以在有限的参数预算下实现高性能的医学图像分类,同时避免数据泄露问题。
Method: 研究系统评估了多种参数高效训练策略,包括冻结编码器、BitFit、LoRA和适配器,应用于印第安纳大学胸片数据集的多标签分类任务。为防止数据泄露,从作为文本输入的报告中有选择地删除了病理学术语但保留了临床上下文。研究在固定参数预算下比较了这些方法,并进行了外部验证以评估可扩展性。
Result: 在固定参数预算下,所有参数高效训练变体实现了0.892至0.908的AUROC,显著优于使用40倍参数的全微调方法。外部验证显示所有方法在CheXpert数据集上实现了超过0.69的AUROC且训练参数少于9%,其中适配器方法表现最佳。预算匹配比较表明视觉单模态模型优于多模态模型,表明性能提升主要源于参数分配而非跨模态协同。所有参数高效方法都表现出校准退化问题。
Conclusion: 冻结编码器策略在显著降低计算成本的同时提供了优越的判别性能,但需要后处理校准校正才能满足临床部署要求。研究发现性能提升主要源于参数分配优化而非跨模态协同效应,这为医学影像分析中的高效模型训练提供了实用指导。
📄 Abstract
Multimodal chest X-Ray analysis often fine-tunes large vision-language models, which is computationally costly. We study parameter-efficient training (PET) strategies, including frozen encoders, BitFit, LoRA, and adapters for multi-label classification on the Indiana University Chest X-Ray dataset (3,851 image-report pairs; 579 test samples). To mitigate data leakage, we redact pathology terms from reports used as text inputs while retaining clinical context. Under a fixed parameter budget (2.37M parameters, 2.51% of total), all PET variants achieve AUROC between 0.892 and 0.908, outperforming full fine-tuning (0.770 AUROC), which uses 94.3M trainable parameters, a 40x reduction. External validation on CheXpert (224,316 images, 58x larger) confirms scalability: all PET methods achieve >0.69 AUROC with <9% trainable parameters, with Adapter achieving best performance (0.7214 AUROC). Budget-matched comparisons reveal that vision-only models (0.653 AUROC, 1.06M parameters) outperform budget-matched multimodal models (0.641 AUROC, 1.06M parameters), indicating improvements arise primarily from parameter allocation rather than cross-modal synergy. While PET methods show degraded calibration (ECE: 0.29-0.34) compared to simpler models (ECE: 0.049), this represents a tractable limitation addressable through post-hoc calibration methods. These findings demonstrate that frozen encoder strategies provide superior discrimination at substantially reduced computational cost, though calibration correction is essential for clinical deployment.
[6] Hierarchy-Aware Fine-Tuning of Vision-Language Models
Jiayu Li, Rajesh Gangireddy, Samet Akcay, Wei Cheng, Juhua Hu
🧩 TL;DR
本文提出了一种高效的层次感知微调框架,用于将视觉语言模型适配到层次分类任务中,通过结合树路径KL散度和层次兄弟平滑交叉熵损失,在保持结构一致性的同时仅更新少量参数。
📘 Detailed Summary
Motivation: 视觉语言模型通过大规模图像文本预训练学习强大的多模态表示,但将其适配到层次分类任务的研究不足。标准方法将标签视为扁平类别并需要完全微调,这种方法成本高昂且会在分类层级间产生不一致的预测。
Method: 提出了一种高效的层次感知微调框架,结合两种目标函数:树路径KL散度沿着真实标签路径对齐预测以实现垂直一致性,而层次兄弟平滑交叉熵鼓励兄弟类别间的一致性预测。两种损失在VLM的共享嵌入空间中工作,并与轻量级LoRA适配集成。
Result: 在多个基准测试上的实验显示,该方法在全路径准确率和基于树的不一致性误差方面持续改进,同时仅引入最小的参数开销。该框架在保持结构一致性的同时实现了高效的参数更新。
Conclusion: 该方法为将视觉语言模型适配到结构化分类体系提供了一种高效策略,通过结合垂直和水平一致性约束,在保持预测结构一致性的同时显著降低了微调成本,为层次分类任务开辟了新的研究方向。
📄 Abstract
Vision-Language Models (VLMs) learn powerful multimodal representations through large-scale image-text pretraining, but adapting them to hierarchical classification is underexplored. Standard approaches treat labels as flat categories and require full fine-tuning, which is expensive and produces inconsistent predictions across taxonomy levels. We propose an efficient hierarchy-aware fine-tuning framework that updates a few parameters while enforcing structural consistency. We combine two objectives: Tree-Path KL Divergence (TP-KL) aligns predictions along the ground-truth label path for vertical coherence, while Hierarchy-Sibling Smoothed Cross-Entropy (HiSCE) encourages consistent predictions among sibling classes. Both losses work in the VLM's shared embedding space and integrate with lightweight LoRA adaptation. Experiments across multiple benchmarks show consistent improvements in Full-Path Accuracy and Tree-based Inconsistency Error with minimal parameter overhead. Our approach provides an efficient strategy for adapting VLMs to structured taxonomies.
[7] EraseLoRA: MLLM-Driven Foreground Exclusion and Background Subtype Aggregation for Dataset-Free Object Removal
Sanghyun Jo, Donghwan Lee, Eunji Jung, Seong Je Oh, Kyungsu Kim
🧩 TL;DR
本文提出EraseLoRA,一种无需数据集的物体移除框架,通过背景感知推理和测试时适应取代注意力操作,解决了现有方法中非目标前景被误判为背景导致目标重现,以及注意力操作破坏细节的问题。
📘 Detailed Summary
Motivation: 物体移除与常规修复不同,需要防止掩码目标重新出现并以结构和上下文保真度重建被遮挡背景。现有无需数据集的方法通过重定向掩码内自注意力存在两个问题:非目标前景常被误判为背景导致不需要的物体重新生成,以及直接注意力操作破坏细节并阻碍背景线索的连贯整合。
Method: EraseLoRA包含两个核心组件:背景感知前景排除(BFE)使用多模态大语言模型从单张图像-掩码对中分离目标前景、非目标前景和干净背景,无需配对监督;背景感知重建与子类型聚合(BRSA)执行测试时优化,将推断的背景子类型视为互补片段,通过重建和对齐目标强制其一致整合,无需显式注意力干预。
Result: EraseLoRA作为预训练扩散模型的插件在物体移除基准测试中验证,相比无需数据集的基线方法获得一致改进,并与基于数据集的方法取得竞争性结果,在防止目标重现和背景重建质量方面表现优异。
Conclusion: 该研究表明背景感知推理和测试时适应能有效解决物体移除中的关键挑战,避免了注意力操作的破坏性影响,为无需数据集的物体移除提供了新范式,其框架设计具有可扩展性和实用性。
📄 Abstract
Object removal differs from common inpainting, since it must prevent the masked target from reappearing and reconstruct the occluded background with structural and contextual fidelity, rather than merely filling a hole plausibly. Recent dataset-free approaches that redirect self-attention inside the mask fail in two ways: non-target foregrounds are often misinterpreted as background, which regenerates unwanted objects, and direct attention manipulation disrupts fine details and hinders coherent integration of background cues. We propose EraseLoRA, a novel dataset-free framework that replaces attention surgery with background-aware reasoning and test-time adaptation. First, Background-aware Foreground Exclusion (BFE), uses a multimodal large-language models to separate target foreground, non-target foregrounds, and clean background from a single image-mask pair without paired supervision, producing reliable background cues while excluding distractors. Second, Background-aware Reconstruction with Subtype Aggregation (BRSA), performs test-time optimization that treats inferred background subtypes as complementary pieces and enforces their consistent integration through reconstruction and alignment objectives, preserving local detail and global structure without explicit attention intervention. We validate EraseLoRA as a plug-in to pretrained diffusion models and across benchmarks for object removal, demonstrating consistent improvements over dataset-free baselines and competitive results against dataset-driven methods. The code will be made available upon publication.
[8] Towards Long-window Anchoring in Vision-Language Model Distillation
Haoyi Zhou, Shuo Li, Tianyu Chen, Qi Song, Chonghan Gao, Jianxin Li
🧩 TL;DR
本文提出LAid方法,通过知识蒸馏将大视觉语言模型的长程注意力机制转移到小模型中,解决了小分支模型在有限窗口尺寸下语言-图像对齐能力不足的问题,实现了有效上下文窗口的显著扩展。
📘 Detailed Summary
Motivation: 当前大型视觉语言模型虽然具备强大的长上下文理解能力,但其普遍采用的小分支模型在有限窗口尺寸下存在语言-图像对齐能力不足的问题,这限制了模型处理长序列输入的有效性。
Method: 提出LAid方法,包含两个互补组件:渐进式距离加权注意力匹配,在训练过程中动态强调较长位置差异;可学习的RoPE响应增益调制,选择性增强需要位置敏感性的区域,直接实现长程注意力机制的转移。
Result: 实验表明,经过LAid蒸馏的模型相比基线小模型实现了高达3.2倍的有效上下文窗口扩展,同时在标准视觉语言基准测试中保持或提升了性能,频谱分析证实该方法成功保留了传统方法无法转移的关键低频注意力成分。
Conclusion: 该研究不仅为构建更高效的长上下文视觉语言模型提供了实用技术,还通过频谱分析揭示了位置理解在蒸馏过程中如何出现和转移的理论见解,为注意力机制的知识蒸馏提供了新的方法论框架。
📄 Abstract
While large vision-language models (VLMs) demonstrate strong long-context understanding, their prevalent small branches fail on linguistics-photography alignment for a limited window size. We discover that knowledge distillation improves students' capability as a complement to Rotary Position Embeddings (RoPE) on window sizes (anchored from large models). Building on this insight, we propose LAid, which directly aims at the transfer of long-range attention mechanisms through two complementary components: (1) a progressive distance-weighted attention matching that dynamically emphasizes longer position differences during training, and (2) a learnable RoPE response gain modulation that selectively amplifies position sensitivity where needed. Extensive experiments across multiple model families demonstrate that LAid-distilled models achieve up to 3.2 times longer effective context windows compared to baseline small models, while maintaining or improving performance on standard VL benchmarks. Spectral analysis also suggests that LAid successfully preserves crucial low-frequency attention components that conventional methods fail to transfer. Our work not only provides practical techniques for building more efficient long-context VLMs but also offers theoretical insights into how positional understanding emerges and transfers during distillation.
[9] Toward Intelligent Scene Augmentation for Context-Aware Object Placement and Sponsor-Logo Integration
Unnati Saraswat, Tarun Rao, Namah Gupta, Shweta Swami, Shikhar Sharma, Prateek Narang, Dhruv Kumar
🧩 TL;DR
该论文针对图像编辑中对象插入缺乏上下文一致性的问题,提出了两种新的广告和数字媒体任务:上下文感知对象插入和赞助商产品标识增强,并构建了相应的数据集来支持这些任务。
📘 Detailed Summary
Motivation: 当前基于视觉语言模型和扩散模型的图像编辑方法很少确保插入的对象在上下文上是恰当的,现有工作缺乏对场景上下文一致性的考虑,特别是在广告和数字媒体应用中需要更智能的对象插入和品牌标识增强功能。
Method: 该研究引入了两种新的任务框架:上下文感知对象插入需要预测合适的对象类别、生成对象并在场景中合理放置;赞助商产品标识增强涉及检测产品并插入正确的品牌标识,即使物品未标记品牌或标记错误。为支持这些任务,构建了两个包含类别注释、放置区域和赞助商产品标签的新数据集。
Result: 研究构建了两个专门的数据集,包含详细的类别注释、对象放置区域标注以及赞助商产品标签,这些数据集为上下文感知对象插入和品牌标识增强任务提供了基准评估框架,支持后续方法开发和性能比较。
Conclusion: 该研究强调了上下文一致性在智能图像编辑中的重要性,特别是在广告和数字媒体应用领域,提出的新任务和数据集为开发更智能、更符合场景上下文的对象插入和品牌增强方法奠定了基础,推动了计算机视觉与多模态推理在实用图像编辑中的融合。
📄 Abstract
Intelligent image editing increasingly relies on advances in computer vision, multimodal reasoning, and generative modeling. While vision-language models (VLMs) and diffusion models enable guided visual manipulation, existing work rarely ensures that inserted objects are \emph{contextually appropriate}. We introduce two new tasks for advertising and digital media: (1) \emph{context-aware object insertion}, which requires predicting suitable object categories, generating them, and placing them plausibly within the scene; and (2) \emph{sponsor-product logo augmentation}, which involves detecting products and inserting correct brand logos, even when items are unbranded or incorrectly branded. To support these tasks, we build two new datasets with category annotations, placement regions, and sponsor-product labels.
[10] TrackTeller: Temporal Multimodal 3D Grounding for Behavior-Dependent Object References
Jiahong Yu, Ziqi Wang, Hailiang Zhao, Wei Zhai, Xueqiang Yan, Shuiguang Deng
🧩 TL;DR
本文提出TrackTeller,一种用于动态3D驾驶场景中时序语言基础定位的统一多模态框架,通过融合LiDAR-图像数据、语言条件解码和时序推理,显著提升了基于语言的目标跟踪性能。
📘 Detailed Summary
Motivation: 在动态3D驾驶场景中,许多自然语言指称表达通过近期运动或短期交互来描述目标,这些信息无法仅从静态外观或几何特征中解析。现有方法难以处理这种时序依赖的语言基础定位问题,需要开发能够利用多帧观测的解决方案。
Method: TrackTeller是一个统一的时序多模态基础定位框架,集成了LiDAR-图像融合、语言条件解码和时序推理。该框架构建了与文本语义对齐的共享UniScene表示,生成语言感知的3D提议,并利用运动历史和短期动态来细化基础定位决策。
Result: 在NuPrompt基准测试上的实验表明,TrackTeller持续提升了语言基础跟踪性能,相对于强基线实现了70%的平均多目标跟踪准确率相对提升,并将误报频率降低了3.15-3.4倍,表现出显著的性能优势。
Conclusion: 该研究证明了时序推理在3D语言基础定位中的重要性,特别是对于涉及运动描述的语言指称。TrackTeller的统一架构为交互式自动驾驶系统提供了有效的解决方案,展示了多模态融合和时序建模在理解动态场景语言指称方面的关键作用。
📄 Abstract
Understanding natural-language references to objects in dynamic 3D driving scenes is essential for interactive autonomous systems. In practice, many referring expressions describe targets through recent motion or short-term interactions, which cannot be resolved from static appearance or geometry alone. We study temporal language-based 3D grounding, where the objective is to identify the referred object in the current frame by leveraging multi-frame observations. We propose TrackTeller, a temporal multimodal grounding framework that integrates LiDAR-image fusion, language-conditioned decoding, and temporal reasoning in a unified architecture. TrackTeller constructs a shared UniScene representation aligned with textual semantics, generates language-aware 3D proposals, and refines grounding decisions using motion history and short-term dynamics. Experiments on the NuPrompt benchmark demonstrate that TrackTeller consistently improves language-grounded tracking performance, outperforming strong baselines with a 70% relative improvement in Average Multi-Object Tracking Accuracy and a 3.15-3.4 times reduction in False Alarm Frequency.
[11] LLM-Free Image Captioning Evaluation in Reference-Flexible Settings
Shinnosuke Hirano, Yuiga Wada, Kazuki Matsuda, Seitaro Otsuki, Komei Sugiura
🧩 TL;DR
本文提出了Pearl,一种无需大型语言模型的监督式图像描述评估指标,适用于基于参考和无参考两种设置,通过新颖的相似性表示学习机制在多个基准数据集上超越了现有LLM-free指标。
📘 Detailed Summary
Motivation: 现有基于大型语言模型的图像描述评估指标存在偏向自身生成内容的中立性问题,而大多数LLM-free指标虽然不受此问题影响,但性能表现并不总是令人满意,因此需要开发一种既保持中立性又具备高性能的评估方法。
Method: 本文提出了Pearl这一LLM-free监督式图像描述评估指标,引入了一种新颖的机制来学习图像-描述和描述-描述相似性的表示,同时构建了一个包含约333k人类标注、来自2,360名标注者、覆盖超过75k图像的大规模人工标注数据集。
Result: Pearl在Composite、Flickr8K-Expert、Flickr8K-CF、Nebula和FOIL等多个数据集上,在基于参考和无参考两种设置下均超越了其他现有的LLM-free评估指标,展示了优越的性能表现。
Conclusion: 该研究表明,无需依赖大型语言模型也能开发出高性能的图像描述评估指标,Pearl通过创新的相似性学习机制解决了现有方法的中立性与性能权衡问题,为图像描述评估提供了更可靠的解决方案。
📄 Abstract
We focus on the automatic evaluation of image captions in both reference-based and reference-free settings. Existing metrics based on large language models (LLMs) favor their own generations; therefore, the neutrality is in question. Most LLM-free metrics do not suffer from such an issue, whereas they do not always demonstrate high performance. To address these issues, we propose Pearl, an LLM-free supervised metric for image captioning, which is applicable to both reference-based and reference-free settings. We introduce a novel mechanism that learns the representations of image--caption and caption--caption similarities. Furthermore, we construct a human-annotated dataset for image captioning metrics, that comprises approximately 333k human judgments collected from 2,360 annotators across over 75k images. Pearl outperformed other existing LLM-free metrics on the Composite, Flickr8K-Expert, Flickr8K-CF, Nebula, and FOIL datasets in both reference-based and reference-free settings. Our project page is available at https://pearl.kinsta.page/.
[12] From Shallow Humor to Metaphor: Towards Label-Free Harmful Meme Detection via LMM Agent Self-Improvement
Jian Lang, Rongpei Hong, Ting Zhong, Leiting Chen, Qiang Gao, Fan Zhou
🧩 TL;DR
本文提出了ALARM,首个基于大型多模态模型代理自改进的无标签有害表情包检测框架,通过利用浅层表情包信息迭代增强对复杂有害内容的检测能力,在多个数据集上超越了有标签方法的性能。
📘 Detailed Summary
Motivation: 在线媒体中有害表情包的泛滥对公共健康和社会稳定构成重大风险,现有检测方法严重依赖大规模标注数据进行训练,需要大量人工标注工作且难以适应有害内容不断演变的特性,这限制了方法的适应性和可扩展性。
Method: ALARM框架包含基于置信度的显式表情包识别机制,用于从原始数据集中分离显式表情包并分配伪标签,同时引入成对学习引导的代理自改进范式,将显式表情包重组为对比对(正面vs负面)来精炼学习器LMM代理,使代理能够从这些对中自主推导高级检测线索,从而有效处理复杂挑战性表情包。
Result: 在三个多样化数据集上的实验表明,ALARM表现出优越的性能和对新演化表情包的强大适应性,值得注意的是,该方法甚至超越了有标签驱动的方法,证明了其在动态在线环境中适应新型有害表情包形式和主题的潜力。
Conclusion: 该研究展示了无标签框架作为可扩展且有前景解决方案的潜力,能够适应动态在线环境中新型有害表情包形式和主题,通过利用浅层表情包信息迭代增强对复杂内容的检测能力,为有害内容检测提供了新的研究方向。
📄 Abstract
The proliferation of harmful memes on online media poses significant risks to public health and stability. Existing detection methods heavily rely on large-scale labeled data for training, which necessitates substantial manual annotation efforts and limits their adaptability to the continually evolving nature of harmful content. To address these challenges, we present ALARM, the first lAbeL-free hARmful Meme detection framework powered by Large Multimodal Model (LMM) agent self-improvement. The core innovation of ALARM lies in exploiting the expressive information from "shallow" memes to iteratively enhance its ability to tackle more complex and subtle ones. ALARM consists of a novel Confidence-based Explicit Meme Identification mechanism that isolates the explicit memes from the original dataset and assigns them pseudo-labels. Besides, a new Pairwise Learning Guided Agent Self-Improvement paradigm is introduced, where the explicit memes are reorganized into contrastive pairs (positive vs. negative) to refine a learner LMM agent. This agent autonomously derives high-level detection cues from these pairs, which in turn empower the agent itself to handle complex and challenging memes effectively. Experiments on three diverse datasets demonstrate the superior performance and strong adaptability of ALARM to newly evolved memes. Notably, our method even outperforms label-driven methods. These results highlight the potential of label-free frameworks as a scalable and promising solution for adapting to novel forms and topics of harmful memes in dynamic online environments.
[13] A-QCF-Net: An Adaptive Quaternion Cross-Fusion Network for Multimodal Liver Tumor Segmentation from Unpaired Datasets
Arunkumar V, Firos V M, Senthilkumar S, Gangadharan G R
🧩 TL;DR
本文提出了一种自适应四元数交叉融合网络(A-QCF-Net),能够从完全独立且未配对的CT和MRI队列中学习统一的医学图像分割模型,解决了多模态医学成像中数据配对稀缺的根本限制。
📘 Detailed Summary
Motivation: 多模态医学成像为病理准确分割提供了互补信息,但深度学习模型的发展受到大型配对且空间对齐数据集稀缺的限制。本研究旨在解决这一根本性挑战,即如何在完全独立且未配对的CT和MRI队列上训练单一统一的分割模型。
Method: 该方法提出了自适应四元数交叉融合网络(A-QCF-Net),利用四元数神经网络的参数效率和表达能力构建共享特征空间。其核心是自适应四元数交叉融合(A-QCF)块,这是一个数据驱动的注意力模块,支持两个流之间的双向知识转移,能够动态调节信息流以交换模态特定的专业知识。
Result: 在未配对的LiTS(CT)和ATLAS(MRI)数据集上联合训练后,该模型在CT上获得76.7%的肿瘤Dice分数,在MRI上获得78.3%的Dice分数,分别显著超过强单模态nnU-Net基线5.4%和4.7%。使用Grad-CAM和Grad-CAM++的可解释性分析证实模型能正确聚焦于相关病理结构。
Conclusion: 该研究提供了一个稳健且临床可行的范式,能够利用医疗领域中常见的大型未配对成像档案。通过自适应知识转移机制,模型能够从不同模态中学习互补特征,为多模态医学图像分析开辟了新的途径,特别是在数据配对受限的实际临床场景中具有重要应用价值。
📄 Abstract
Multimodal medical imaging provides complementary information that is crucial for accurate delineation of pathology, but the development of deep learning models is limited by the scarcity of large datasets in which different modalities are paired and spatially aligned. This paper addresses this fundamental limitation by proposing an Adaptive Quaternion Cross-Fusion Network (A-QCF-Net) that learns a single unified segmentation model from completely separate and unpaired CT and MRI cohorts. The architecture exploits the parameter efficiency and expressive power of Quaternion Neural Networks to construct a shared feature space. At its core is the Adaptive Quaternion Cross-Fusion (A-QCF) block, a data driven attention module that enables bidirectional knowledge transfer between the two streams. By learning to modulate the flow of information dynamically, the A-QCF block allows the network to exchange abstract modality specific expertise, such as the sharp anatomical boundary information available in CT and the subtle soft tissue contrast provided by MRI. This mutual exchange regularizes and enriches the feature representations of both streams. We validate the framework by jointly training a single model on the unpaired LiTS (CT) and ATLAS (MRI) datasets. The jointly trained model achieves Tumor Dice scores of 76.7% on CT and 78.3% on MRI, significantly exceeding the strong unimodal nnU-Net baseline by margins of 5.4% and 4.7% respectively. Furthermore, comprehensive explainability analysis using Grad-CAM and Grad-CAM++ confirms that the model correctly focuses on relevant pathological structures, ensuring the learned representations are clinically meaningful. This provides a robust and clinically viable paradigm for unlocking the large unpaired imaging archives that are common in healthcare.
[14] TAMEing Long Contexts in Personalization: Towards Training-Free and State-Aware MLLM Personalized Assistant
Rongpei Hong, Jian Lang, Ting Zhong, Yong Wang, Fan Zhou
🧩 TL;DR
本文提出了LCMP,首个长上下文多模态大语言模型个性化评估基准,并引入了TAME框架作为强基线,该框架通过双重记忆管理和检索对齐增强生成范式,显著提升了MLLM在长对话场景中的个性化交互能力。
📘 Detailed Summary
Motivation: 现有多模态大语言模型个性化方法主要关注简单的上下文无关视觉识别和文本替换,忽视了支持长上下文对话的能力,而理想的个性化MLLM助手应能参与长上下文对话并通过学习历史对话持续提升体验质量。
Method: 研究提出了LCMP评估基准来衡量MLLM感知个性化概念变化和生成上下文适当响应的能力,并引入了TAME训练免费、状态感知框架,该框架采用双重记忆管理来区分处理每个个性化概念的时序和持久变化,并整合了检索对齐增强生成范式,通过对齐步骤从多记忆检索知识中提取与当前问题上下文适配的信息。
Result: 在LCMP基准上的实验表明,TAME实现了最佳性能,在长上下文场景中展现出卓越且不断演进的交互体验,验证了该框架在复杂真实世界用户查询处理中的有效性。
Conclusion: 该研究填补了长上下文MLLM个性化评估的空白,TAME框架通过创新的记忆管理和知识对齐机制为个性化对话系统提供了有效解决方案,为未来个性化AI助手的发展指明了方向,特别是处理时序变化和复杂交互场景的能力。
📄 Abstract
Multimodal Large Language Model (MLLM) Personalization is a critical research problem that facilitates personalized dialogues with MLLMs targeting specific entities (known as personalized concepts). However, existing methods and benchmarks focus on the simple, context-agnostic visual identification and textual replacement of the personalized concept (e.g., "A yellow puppy" -> "Your puppy Mochi"), overlooking the ability to support long-context conversations. An ideal personalized MLLM assistant is capable of engaging in long-context dialogues with humans and continually improving its experience quality by learning from past dialogue histories. To bridge this gap, we propose LCMP, the first Long-Context MLLM Personalization evaluation benchmark. LCMP assesses the capability of MLLMs in perceiving variations of personalized concepts and generating contextually appropriate personalized responses that reflect these variations. As a strong baseline for LCMP, we introduce a novel training-free and state-aware framework TAME. TAME endows MLLMs with double memories to manage the temporal and persistent variations of each personalized concept in a differentiated manner. In addition, TAME incorporates a new training-free Retrieve-then-Align Augmented Generation (RA2G) paradigm. RA2G introduces an alignment step to extract the contextually fitted information from the multi-memory retrieved knowledge to the current questions, enabling better interactions for complex real-world user queries. Experiments on LCMP demonstrate that TAME achieves the best performance, showcasing remarkable and evolving interaction experiences in long-context scenarios.
[15] Training-Free Disentangled Text-Guided Image Editing via Sparse Latent Constraints
Mutiara Shabrina, Nova Kurnia Putri, Jefri Satria Ferdiansyah, Sabita Khansa Dewi, Novanto Yudistira
🧩 TL;DR
该研究分析了PPE框架在文本驱动图像编辑中的属性纠缠问题,并提出了一种基于L1正则化的稀疏约束方法,通过增强潜在空间操作的稀疏性来减少语义泄漏,实现更聚焦和可控的属性编辑。
📘 Detailed Summary
Motivation: 文本驱动图像编辑常面临属性纠缠问题,即修改目标属性(如添加刘海)会无意中改变其他语义属性(如身份或外观),PPE框架虽尝试解决此问题,但其正则化策略仍存在潜在更新密集且易发生语义泄漏的局限性。
Method: 该研究分析了基于BERT属性预测和StyleGAN2图像生成的PPE框架架构,并在CelebA-HQ数据集上进行实验,针对原始正则化策略的不足,提出了基于L1正则化的稀疏约束方法,对潜在空间操作施加稀疏性约束。
Result: 实验结果表明,提出的稀疏约束方法能够实现更聚焦和可控的编辑效果,有效减少了非目标属性的意外改变,同时更好地保持了面部身份特征,相比原始方法在属性解缠方面表现更优。
Conclusion: 该研究表明,在潜在空间操作中引入稀疏性约束是解决属性纠缠问题的有效策略,为文本驱动图像编辑提供了更精细的控制机制,未来可探索更先进的稀疏正则化方法以进一步提升编辑质量。
📄 Abstract
Text-driven image manipulation often suffers from attribute entanglement, where modifying a target attribute (e.g., adding bangs) unintentionally alters other semantic properties such as identity or appearance. The Predict, Prevent, and Evaluate (PPE) framework addresses this issue by leveraging pre-trained vision-language models for disentangled editing. In this work, we analyze the PPE framework, focusing on its architectural components, including BERT-based attribute prediction and StyleGAN2-based image generation on the CelebA-HQ dataset. Through empirical analysis, we identify a limitation in the original regularization strategy, where latent updates remain dense and prone to semantic leakage. To mitigate this issue, we introduce a sparsity-based constraint using L1 regularization on latent space manipulation. Experimental results demonstrate that the proposed approach enforces more focused and controlled edits, effectively reducing unintended changes in non-target attributes while preserving facial identity.
[16] Unsupervised Anomaly Detection in Brain MRI via Disentangled Anatomy Learning
Tao Yang, Xiuying Wang, Hao Liu, Guanzhong Gong, Lian-Ming Wu, Yu-Ping Wang, Lisheng Wang
🧩 TL;DR
本研究提出了一种新的伪健康图像重建框架,通过解耦表示和边缘到图像恢复模块,显著提升了脑MRI多模态多中心异常检测的泛化能力和性能,在九个公共数据集上超越了17种先进方法。
📘 Detailed Summary
Motivation: 当前基于伪健康图像重建的无监督异常检测方法面临两个关键限制:一是对多模态多中心MRI的泛化能力受限,因为它们过度依赖正常训练数据中的特定成像信息;二是性能受限,因为异常残差会从输入图像传播到重建的伪健康图像中。
Method: 提出了包含两个核心模块的新框架:解耦表示模块通过引入脑解剖先验和可微分独热编码算子,将脑MRI解耦为成像信息和本质的成像不变解剖图像;边缘到图像恢复模块通过仅从解剖图像的高频边缘信息恢复解剖表示,然后重新耦合解耦的成像信息来重建高质量伪健康图像。
Result: 在九个公共数据集(来自多个中心的4,443名患者MRI)上评估,该方法在AP和DSC指标上分别实现了+18.32%和+13.64%的绝对提升,显著超越了17种最先进方法。
Conclusion: 该研究通过解耦成像信息与解剖结构,并利用边缘信息抑制异常传播,为多模态多中心脑MRI异常检测提供了更鲁棒和泛化的解决方案,展示了在临床多样化数据上的实际应用潜力。
📄 Abstract
Detection of various lesions in brain MRI is clinically critical, but challenging due to the diversity of lesions and variability in imaging conditions. Current unsupervised learning methods detect anomalies mainly through reconstructing abnormal images into pseudo-healthy images (PHIs) by normal samples learning and then analyzing differences between images. However, these unsupervised models face two significant limitations: restricted generalizability to multi-modality and multi-center MRIs due to their reliance on the specific imaging information in normal training data, and constrained performance due to abnormal residuals propagated from input images to reconstructed PHIs. To address these limitations, two novel modules are proposed, forming a new PHI reconstruction framework. Firstly, the disentangled representation module is proposed to improve generalizability by decoupling brain MRI into imaging information and essential imaging-invariant anatomical images, ensuring that the reconstruction focuses on the anatomy. Specifically, brain anatomical priors and a differentiable one-hot encoding operator are introduced to constrain the disentanglement results and enhance the disentanglement stability. Secondly, the edge-to-image restoration module is designed to reconstruct high-quality PHIs by restoring the anatomical representation from the high-frequency edge information of anatomical images, and then recoupling the disentangled imaging information. This module not only suppresses abnormal residuals in PHI by reducing abnormal pixels input through edge-only input, but also effectively reconstructs normal regions using the preserved structural details in the edges. Evaluated on nine public datasets (4,443 patients' MRIs from multiple centers), our method outperforms 17 SOTA methods, achieving absolute improvements of +18.32% in AP and +13.64% in DSC.
[17] LongFly: Long-Horizon UAV Vision-and-Language Navigation with Spatiotemporal Context Integration
Wen Jiang, Li Wang, Kangyao Huang, Wei Fan, Jinyuan Liu, Shaoyu Liu, Hongwei Duan, Bin Xu, Xiangyang Ji
🧩 TL;DR
本文提出LongFly框架,一种用于无人机视觉语言导航的时空上下文建模方法,通过历史感知的时空建模策略解决长时程导航中的语义对齐和路径规划不稳定问题。
📘 Detailed Summary
Motivation: 当前无人机视觉语言导航方法在复杂环境中难以建模长时程时空上下文,导致语义对齐不准确和路径规划不稳定,特别是在灾后搜救等高信息密度、视角快速变化和动态结构场景中。
Method: LongFly框架包含三个核心模块:基于槽位的历史图像压缩模块,动态蒸馏多视角历史观测为固定长度上下文表示;时空轨迹编码模块,捕捉无人机轨迹的时间动态和空间结构;提示引导的多模态集成模块,整合现有时空上下文与当前观测,支持基于时间的推理和稳健航点预测。
Result: 实验结果表明,LongFly在成功率和路径长度加权成功率上分别超过最先进的无人机视觉语言导航基线7.89%和6.33%,在可见和未见环境中均表现一致优越。
Conclusion: 该研究提出的历史感知时空建模策略能够有效处理长时程导航中的碎片化和冗余历史数据,转化为结构化、紧凑且表达力强的表示,为复杂动态环境下的无人机自主导航提供了新的解决方案。
📄 Abstract
Unmanned aerial vehicles (UAVs) are crucial tools for post-disaster search and rescue, facing challenges such as high information density, rapid changes in viewpoint, and dynamic structures, especially in long-horizon navigation. However, current UAV vision-and-language navigation(VLN) methods struggle to model long-horizon spatiotemporal context in complex environments, resulting in inaccurate semantic alignment and unstable path planning. To this end, we propose LongFly, a spatiotemporal context modeling framework for long-horizon UAV VLN. LongFly proposes a history-aware spatiotemporal modeling strategy that transforms fragmented and redundant historical data into structured, compact, and expressive representations. First, we propose the slot-based historical image compression module, which dynamically distills multi-view historical observations into fixed-length contextual representations. Then, the spatiotemporal trajectory encoding module is introduced to capture the temporal dynamics and spatial structure of UAV trajectories. Finally, to integrate existing spatiotemporal context with current observations, we design the prompt-guided multimodal integration module to support time-based reasoning and robust waypoint prediction. Experimental results demonstrate that LongFly outperforms state-of-the-art UAV VLN baselines by 7.89\% in success rate and 6.33\% in success weighted by path length, consistently across both seen and unseen environments.
[18] Omni-Weather: Unified Multimodal Foundation Model for Weather Generation and Understanding
Zhiwang Zhou, Yuandong Pu, Xuming He, Yidi Liu, Yixin Chen, Junchao Gong, Xiang Zhuang, Wanghan Xu, Qinglong Cao, Shixiang Tang, Yihao Liu, Wenlong Zhang, Lei Bai
🧩 TL;DR
本文提出了Omni-Weather,首个统一天气生成与理解的多模态基础模型,通过单一架构同时实现准确预测和机制解释,并在两项任务上均达到最先进性能。
📘 Detailed Summary
Motivation: 现有天气建模方法将生成与理解目标分离处理,导致准确预测与机制解释无法统一,本研究旨在填补这一研究空白,构建能够同时处理天气生成和理解任务的统一模型。
Method: Omni-Weather采用多模态基础模型架构,集成了用于天气生成任务的雷达编码器,并通过共享的自注意力机制进行统一处理;此外构建了用于天气生成因果推理的思维链数据集,以实现可解释输出和提升感知质量。
Result: 大量实验表明Omni-Weather在天气生成和理解任务上均实现了最先进的性能表现,同时验证了生成任务和理解任务在天气领域能够相互促进、共同提升的协同效应。
Conclusion: 该研究证实了统一天气生成与理解的可行性和价值,表明生成和理解任务在天气领域具有相互增强的协同效应,为构建更全面、可解释的天气建模系统提供了新范式。
📄 Abstract
Weather modeling requires both accurate prediction and mechanistic interpretation, yet existing methods treat these goals in isolation, separating generation from understanding. To address this gap, we present Omni-Weather, the first multimodal foundation model that unifies weather generation and understanding within a single architecture. Omni-Weather integrates a radar encoder for weather generation tasks, followed by unified processing using a shared self-attention mechanism. Moreover, we construct a Chain-of-Thought dataset for causal reasoning in weather generation, enabling interpretable outputs and improved perceptual quality. Extensive experiments show Omni-Weather achieves state-of-the-art performance in both weather generation and understanding. Our findings further indicate that generative and understanding tasks in the weather domain can mutually enhance each other. Omni-Weather also demonstrates the feasibility and value of unifying weather generation and understanding.
[19] The Deepfake Detective: Interpreting Neural Forensics Through Sparse Features and Manifolds
Subramanyam Sahoo, Jared Junkin
🧩 TL;DR
本文提出了一种用于深度伪造检测的机制可解释性框架,结合稀疏自编码器分析和新颖的法证流形分析,揭示了视觉-语言模型内部表征的几何特性与不同深度伪造伪影之间的系统性关系。
📘 Detailed Summary
Motivation: 尽管深度伪造检测模型在识别合成媒体方面取得了高准确率,但其决策过程仍然很大程度上是不透明的,这限制了模型的可解释性和鲁棒性发展,因此需要开发能够揭示模型内部工作机制的机制可解释性方法。
Method: 本文提出了一种机制可解释性框架,应用于视觉-语言模型,结合了稀疏自编码器对内部网络表征的分析以及新颖的法证流形分析,后者通过控制法证伪影操作来探测模型特征响应,并分析特征流形的几何特性包括内在维度、曲率和特征选择性。
Result: 实验结果表明,每个网络层中只有一小部分潜在特征被主动使用,模型特征流形的几何特性(包括内在维度、曲率和特征选择性)与不同类型的深度伪造伪影之间存在系统性变化关系,这些发现为理解模型内部工作机制提供了实证基础。
Conclusion: 这项研究为打开深度伪造检测器的"黑箱"迈出了第一步,能够识别哪些学习特征对应特定的法证伪影,并为开发更可解释和鲁棒的模型提供了指导,推动了深度伪造检测领域的可解释性研究进展。
📄 Abstract
Deepfake detection models have achieved high accuracy in identifying synthetic media, but their decision processes remain largely opaque. In this paper we present a mechanistic interpretability framework for deepfake detection applied to a vision-language model. Our approach combines a sparse autoencoder (SAE) analysis of internal network representations with a novel forensic manifold analysis that probes how the model's features respond to controlled forensic artifact manipulations. We demonstrate that only a small fraction of latent features are actively used in each layer, and that the geometric properties of the model's feature manifold, including intrinsic dimensionality, curvature, and feature selectivity, vary systematically with different types of deepfake artifacts. These insights provide a first step toward opening the "black box" of deepfake detectors, allowing us to identify which learned features correspond to specific forensic artifacts and to guide the development of more interpretable and robust models.
[20] UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture
Shuo Cao, Jiayang Li, Xiaohui Li, Yuandong Pu, Kaiwen Zhu, Yuanting Gao, Siqi Luo, Yi Xin, Qi Qin, Yu Zhou, Xiangyu Chen, Wenlong Zhang, Bin Fu, Yu Qiao, Yihao Liu
🧩 TL;DR
本文提出了UniPercept-Bench,一个用于感知级图像理解的统一基准框架,涵盖美学、质量、结构和纹理四个关键领域,并开发了通过领域自适应预训练和任务对齐强化学习训练的强基线模型UniPercept,在感知级图像理解任务上超越了现有MLLMs。
📘 Detailed Summary
Motivation: 尽管多模态大语言模型在视觉理解任务上取得了显著进展,但其感知级图像特征理解能力仍然有限,缺乏对美学、质量、结构和纹理等关键感知领域的系统评估框架和基准。
Method: 研究建立了层次化定义系统并构建了大规模数据集,开发了通过领域自适应预训练和任务对齐强化学习训练的UniPercept模型,该框架统一支持视觉评分和视觉问答任务,实现了跨感知领域的鲁棒泛化。
Result: UniPercept在感知级图像理解任务上超越了现有多模态大语言模型,能够作为即插即用的奖励模型用于文本到图像生成,在美学、质量、结构和纹理等多个感知领域表现出色。
Conclusion: 本研究定义了MLLM时代的感知级图像理解概念,通过引入全面的基准和强基线模型,为推进感知级多模态图像理解奠定了坚实基础,展示了跨领域感知理解的统一框架潜力。
📄 Abstract
Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks such as visual grounding, segmentation, and captioning. However, their ability to perceive perceptual-level image features remains limited. In this work, we present UniPercept-Bench, a unified framework for perceptual-level image understanding across three key domains: Aesthetics, Quality, Structure and Texture. We establish a hierarchical definition system and construct large-scale datasets to evaluate perceptual-level image understanding. Based on this foundation, we develop a strong baseline UniPercept trained via Domain-Adaptive Pre-Training and Task-Aligned RL, enabling robust generalization across both Visual Rating (VR) and Visual Question Answering (VQA) tasks. UniPercept outperforms existing MLLMs on perceptual-level image understanding and can serve as a plug-and-play reward model for text-to-image generation. This work defines Perceptual-Level Image Understanding in the era of MLLMs and, through the introduction of a comprehensive benchmark together with a strong baseline, provides a solid foundation for advancing perceptual-level multimodal image understanding.
[21] Contrastive Graph Modeling for Cross-Domain Few-Shot Medical Image Segmentation
Yuntian Bo, Tao Zhou, Zechao Li, Haofeng Zhang, Ling Shao
🧩 TL;DR
本文提出了一种名为对比图建模(C-Graph)的新框架,用于解决跨域少样本医学图像分割问题。该框架通过利用医学图像的结构一致性作为领域可迁移先验,显著提升了跨域性能并保持了源域分割精度。
📘 Detailed Summary
Motivation: 现有跨域少样本医学图像分割方法通常通过过滤领域特定信息来提高泛化能力,但这无意中限制了跨域性能并降低了源域准确性。本研究旨在解决这一矛盾,探索如何在保持源域精度的同时提升跨域分割性能。
Method: 提出对比图建模框架,将图像特征表示为图结构,其中像素作为节点,语义亲和力作为边。设计了结构先验图层来捕获和转移目标类别节点依赖关系,通过显式节点交互实现全局结构建模。在此基础上引入子图匹配解码机制,利用节点间的语义关系指导预测,并设计混淆最小化节点对比损失来减轻节点模糊性和子图异质性。
Result: 该方法在多个跨域基准测试中显著优于先前的跨域少样本医学图像分割方法,实现了最先进的性能。同时,该方法在源域上保持了强大的分割准确性,解决了现有方法在提升跨域性能时牺牲源域精度的问题。
Conclusion: 研究表明,利用医学图像的结构一致性作为领域可迁移先验是解决跨域少样本分割问题的有效策略。所提出的图建模框架不仅提升了跨域性能,还保持了源域准确性,为数据稀缺的医学图像分析应用提供了有前景的解决方案。
📄 Abstract
Cross-domain few-shot medical image segmentation (CD-FSMIS) offers a promising and data-efficient solution for medical applications where annotations are severely scarce and multimodal analysis is required. However, existing methods typically filter out domain-specific information to improve generalization, which inadvertently limits cross-domain performance and degrades source-domain accuracy. To address this, we present Contrastive Graph Modeling (C-Graph), a framework that leverages the structural consistency of medical images as a reliable domain-transferable prior. We represent image features as graphs, with pixels as nodes and semantic affinities as edges. A Structural Prior Graph (SPG) layer is proposed to capture and transfer target-category node dependencies and enable global structure modeling through explicit node interactions. Building upon SPG layers, we introduce a Subgraph Matching Decoding (SMD) mechanism that exploits semantic relations among nodes to guide prediction. Furthermore, we design a Confusion-minimizing Node Contrast (CNC) loss to mitigate node ambiguity and subgraph heterogeneity by contrastively enhancing node discriminability in the graph space. Our method significantly outperforms prior CD-FSMIS approaches across multiple cross-domain benchmarks, achieving state-of-the-art performance while simultaneously preserving strong segmentation accuracy on the source domain.
[22] SlideChain: Semantic Provenance for Lecture Understanding via Blockchain Registration
Md Motaleb Hossen Manik, Md Zabirul Islam, Ge Wang
🧩 TL;DR
本文提出了SlideChain,一个基于区块链的溯源框架,旨在为多模态语义提取提供可验证的完整性保障,通过分析四种最先进的视觉-语言模型在医学教育幻灯片上的语义输出,揭示了显著的跨模型差异,并展示了该框架在篡改检测和可重现性方面的有效性。
📘 Detailed Summary
Motivation: 现代视觉-语言模型越来越多地用于解释和生成教育内容,但其语义输出在验证、重现和长期审计方面面临挑战,模型家族、推理设置和计算环境之间的不一致性削弱了AI生成教学材料(特别是在高风险定量STEM领域)的可靠性,需要一种能够提供可验证完整性的框架。
Method: 本研究引入了SlideChain,这是一个基于区块链的溯源框架,使用SlideChain幻灯片数据集(包含1,117张医学影像讲座幻灯片的精选语料库),从四种最先进的视觉-语言模型中提取概念和关系三元组,并为每张幻灯片构建结构化溯源记录,将这些记录的加密哈希锚定在本地EVM兼容区块链上,提供防篡改的可审计性和持久的语义基线。
Result: 通过对多模态教育内容进行首次系统性语义分歧、跨模型相似性和讲座级变异性分析,揭示了显著的跨模型差异,包括许多幻灯片上的低概念重叠和接近零的关系三元组一致性,在模拟部署条件下评估了燃气使用、吞吐量和可扩展性,并展示了完美的篡改检测能力以及在独立提取运行中的确定性可重现性。
Conclusion: SlideChain为可信赖、可验证的多模态教育管道提供了实用且可扩展的解决方案,支持AI辅助教学系统的长期可审计性、可重现性和完整性,该框架的实证结果表明,区块链技术能够有效解决多模态语义提取中的一致性和可靠性问题,为教育技术领域的AI系统部署提供了重要的技术基础。
📄 Abstract
Modern vision--language models (VLMs) are increasingly used to interpret and generate educational content, yet their semantic outputs remain challenging to verify, reproduce, and audit over time. Inconsistencies across model families, inference settings, and computing environments undermine the reliability of AI-generated instructional material, particularly in high-stakes and quantitative STEM domains. This work introduces SlideChain, a blockchain-backed provenance framework designed to provide verifiable integrity for multimodal semantic extraction at scale. Using the SlideChain Slides Dataset-a curated corpus of 1,117 medical imaging lecture slides from a university course-we extract concepts and relational triples from four state-of-the-art VLMs and construct structured provenance records for every slide. SlideChain anchors cryptographic hashes of these records on a local EVM (Ethereum Virtual Machine)-compatible blockchain, providing tamper-evident auditability and persistent semantic baselines. Through the first systematic analysis of semantic disagreement, cross-model similarity, and lecture-level variability in multimodal educational content, we reveal pronounced cross-model discrepancies, including low concept overlap and near-zero agreement in relational triples on many slides. We further evaluate gas usage, throughput, and scalability under simulated deployment conditions, and demonstrate perfect tamper detection along with deterministic reproducibility across independent extraction runs. Together, these results show that SlideChain provides a practical and scalable step toward trustworthy, verifiable multimodal educational pipelines, supporting long-term auditability, reproducibility, and integrity for AI-assisted instructional systems.
[23] Analyzing the Mechanism of Attention Collapse in VGGT from a Dynamics Perspective
Huan Li, Longjun Luo, Yuling Shi, Xiaodong Gu
🧩 TL;DR
本文为视觉几何基础Transformer(VGGT)中的全局自注意力层崩溃现象提供了严格的数学解释,将其建模为退化扩散过程,并推导出收敛速率的闭式解,理论预测与实验观察精确匹配。
📘 Detailed Summary
Motivation: 视觉几何基础Transformer(VGGT)在三维重建任务中表现出色,但其全局自注意力层在处理超过数百帧的输入序列时会出现严重的崩溃现象:注意力矩阵迅速变为近似秩一,令牌几何退化到几乎一维子空间,重建误差超线性累积。本研究旨在为这一崩溃现象提供严格的数学解释。
Method: 研究将全局注意力迭代视为退化扩散过程,通过数学分析证明在VGGT中令牌特征流以O(1/L)的速率收敛于狄拉克型测度,推导出闭式的平均场偏微分方程,该方程能够精确预测实验中观察到的秩分布特征。
Result: 理论分析定量匹配了注意力热图演化过程和相关工作中报告的一系列实验结果,解释了令牌合并补救措施通过降低有效扩散系数来延迟崩溃的机制,理论预测与经验观察高度一致。
Conclusion: 该分析为解释未来可扩展的三维视觉Transformer提供了原则性框架,并强调了其在多模态泛化方面的潜力。研究揭示了注意力机制中的基本数学特性,为改进大规模序列处理架构提供了理论指导。
📄 Abstract
Visual Geometry Grounded Transformer (VGGT) delivers state-of-the-art feed-forward 3D reconstruction, yet its global self-attention layer suffers from a drastic collapse phenomenon when the input sequence exceeds a few hundred frames: attention matrices rapidly become near rank-one, token geometry degenerates to an almost one-dimensional subspace, and reconstruction error accumulates super-linearly.In this report,we establish a rigorous mathematical explanation of the collapse by viewing the global-attention iteration as a degenerate diffusion process.We prove that,in VGGT, the token-feature flow converges toward a Dirac-type measure at a $O(1/L)$ rate, where $L$ is the layer index, yielding a closed-form mean-field partial differential equation that precisely predicts the empirically observed rank profile.The theory quantitatively matches the attention-heat-map evolution and a series of experiments outcomes reported in relevant works and explains why its token-merging remedy -- which periodically removes redundant tokens -- slows the effective diffusion coefficient and thereby delays collapse without additional training.We believe the analysis provides a principled lens for interpreting future scalable 3D-vision transformers,and we highlight its potential for multi-modal generalization.
[24] FUSE: Unifying Spectral and Semantic Cues for Robust AI-Generated Image Detection
Md. Zahid Hossain, Most. Sharmin Sultana Samu, Md. Kamrozzaman Bhuiyan, Farhad Uz Zaman, Md. Rakibul Islam
🧩 TL;DR
本文提出了FUSE系统,一种结合频谱特征和语义特征的混合方法,用于检测AI生成图像,在多个基准测试中实现了最先进的性能,并展现出强大的跨生成器泛化能力。
📘 Detailed Summary
Motivation: 随着生成模型的快速发展,对AI生成图像的可靠检测需求日益增长,现有方法在处理高保真图像时性能不佳,特别是在Chameleon等基准测试中表现较差,这促使研究者开发更鲁棒、泛化能力更强的检测系统。
Method: FUSE系统采用混合架构,通过快速傅里叶变换提取频谱特征,同时利用CLIP视觉编码器获取语义特征,将这些特征融合为联合表示,并采用两阶段渐进式训练策略进行模型优化。
Result: 在GenImage、WildFake、DiTFake、GPT-ImgEval和Chameleon数据集上的评估显示,FUSE(第一阶段)模型在Chameleon基准测试中达到最先进水平,在GenImage数据集上获得91.36%的平均准确率,在所有测试生成器上达到88.71%的准确率,平均精度均值为94.96%,第二阶段训练进一步提升了多数生成器的检测性能。
Conclusion: 研究表明,整合频谱和语义特征的方法能够有效提升AI生成图像检测的泛化能力和鲁棒性,特别是在处理高保真图像时保持稳定性能,这为开发更可靠的生成内容检测系统提供了重要技术路径。
📄 Abstract
The fast evolution of generative models has heightened the demand for reliable detection of AI-generated images. To tackle this challenge, we introduce FUSE, a hybrid system that combines spectral features extracted through Fast Fourier Transform with semantic features obtained from the CLIP's Vision encoder. The features are fused into a joint representation and trained progressively in two stages. Evaluations on GenImage, WildFake, DiTFake, GPT-ImgEval and Chameleon datasets demonstrate strong generalization across multiple generators. Our FUSE (Stage 1) model demonstrates state-of-the-art results on the Chameleon benchmark. It also attains 91.36% mean accuracy on the GenImage dataset, 88.71% accuracy across all tested generators, and a mean Average Precision of 94.96%. Stage 2 training further improves performance for most generators. Unlike existing methods, which often perform poorly on high-fidelity images in Chameleon, our approach maintains robustness across diverse generators. These findings highlight the benefits of integrating spectral and semantic features for generalized detection of images generated by AI.
[25] AstraNav-World: World Model for Foresight Control and Consistency
Junjun Hu, Jintao Chen, Haochen Bai, Minghua Luo, Shichao Xie, Ziyi Chen, Fei Liu, Zedong Chu, Xinda Xue, Botao Ren, Xiaolong Wu, Mu Xu, Shanghang Zhang
🧩 TL;DR
本文提出了AstraNav-World,一种端到端的世界模型,将未来视觉状态与动作序列的联合推理统一于概率框架中,通过扩散视频生成器与视觉语言策略的集成,实现了视觉预测与动作规划的同步展开,显著提升了开放动态环境中具身导航的准确性与鲁棒性。
📘 Detailed Summary
Motivation: 开放动态环境中的具身导航需要准确预见世界如何演变以及动作如何随时间展开,现有"先预测后规划"的解耦方法容易产生累积误差,且视觉预测与动作决策之间缺乏紧密耦合,导致预测不可执行而决策脱离物理一致的任务相关未来。
Method: AstraNav-World采用统一概率框架,集成扩散基视频生成器与视觉语言策略,实现预测场景与规划动作的同步展开,训练优化两个互补目标:生成动作条件多步视觉预测,以及基于预测视觉推导轨迹,这种双向约束使视觉预测可执行且决策基于物理一致的任务相关未来。
Result: 在多样化具身导航基准测试中,AstraNav-World展现出改进的轨迹准确性和更高的成功率,消融实验证实紧密视觉-动作耦合与统一训练的必要性,任一分支移除都会降低预测质量和策略可靠性,真实世界测试中表现出卓越的零样本能力,无需真实世界微调即可适应未见场景。
Conclusion: 该研究通过将前瞻视觉与控制统一于单一生成模型,推进了可靠、可解释、通用的具身智能体发展,模型捕捉了可迁移的空间理解和规划相关导航动态而非仅过拟合仿真数据分布,为开放真实世界环境中的鲁棒操作提供了新范式。
📄 Abstract
Embodied navigation in open, dynamic environments demands accurate foresight of how the world will evolve and how actions will unfold over time. We propose AstraNav-World, an end-to-end world model that jointly reasons about future visual states and action sequences within a unified probabilistic framework. Our framework integrates a diffusion-based video generator with a vision-language policy, enabling synchronized rollouts where predicted scenes and planned actions are updated simultaneously. Training optimizes two complementary objectives: generating action-conditioned multi-step visual predictions and deriving trajectories conditioned on those predicted visuals. This bidirectional constraint makes visual predictions executable and keeps decisions grounded in physically consistent, task-relevant futures, mitigating cumulative errors common in decoupled "envision-then-plan" pipelines. Experiments across diverse embodied navigation benchmarks show improved trajectory accuracy and higher success rates. Ablations confirm the necessity of tight vision-action coupling and unified training, with either branch removal degrading both prediction quality and policy reliability. In real-world testing, AstraNav-World demonstrated exceptional zero-shot capabilities, adapting to previously unseen scenarios without any real-world fine-tuning. These results suggest that AstraNav-World captures transferable spatial understanding and planning-relevant navigation dynamics, rather than merely overfitting to simulation-specific data distribution. Overall, by unifying foresight vision and control within a single generative model, we move closer to reliable, interpretable, and general-purpose embodied agents that operate robustly in open-ended real-world settings.
[26] Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models
Nimrod Berman, Adam Botach, Emanuel Ben-Baruch, Shunit Haviv Hakimi, Asaf Gendler, Ilan Naiman, Erez Yosef, Igor Kviatkovsky
🧩 TL;DR
本文提出了Scene-VLM,这是首个用于视频场景分割的微调视觉语言模型框架,通过联合处理视觉和文本线索实现跨镜头多模态推理,在标准基准测试中取得了最先进的性能。
📘 Detailed Summary
Motivation: 现有基于编码器的方法存在视觉中心偏见、孤立处理每个镜头而忽略序列依赖关系、缺乏叙事理解和可解释性等局限性,这促使研究者开发能够进行多模态推理并考虑时间依赖性的视频场景分割新方法。
Method: Scene-VLM框架联合处理视觉和文本线索,包括帧、转录文本和可选元数据,采用因果依赖的序列预测方式,引入上下文聚焦窗口机制确保每个镜头决策有足够的时间上下文,并提出从VLM令牌级对数中提取置信度分数的方案以实现可控的精度-召回权衡。
Result: 该方法在标准场景分割基准测试中取得了最先进的性能,在MovieNet数据集上相比先前领先方法实现了+6 AP和+13.7 F1的显著提升,同时能够通过最小化监督生成边界决策的连贯自然语言解释。
Conclusion: Scene-VLM展示了视觉语言模型在视频场景分割任务中的有效性,通过多模态推理和序列依赖建模克服了传统方法的局限性,同时提供了可解释的决策过程和可控的性能权衡,为大规模视频理解开辟了新方向。
📄 Abstract
Segmenting long-form videos into semantically coherent scenes is a fundamental task in large-scale video understanding. Existing encoder-based methods are limited by visual-centric biases, classify each shot in isolation without leveraging sequential dependencies, and lack both narrative understanding and explainability. In this paper, we present Scene-VLM, the first fine-tuned vision-language model (VLM) framework for video scene segmentation. Scene-VLM jointly processes visual and textual cues including frames, transcriptions, and optional metadata to enable multimodal reasoning across consecutive shots. The model generates predictions sequentially with causal dependencies among shots and introduces a context-focus window mechanism to ensure sufficient temporal context for each shot-level decision. In addition, we propose a scheme to extract confidence scores from the token-level logits of the VLM, enabling controllable precision-recall trade-offs that were previously limited to encoder-based methods. Furthermore, we demonstrate that our model can be aligned to generate coherent natural-language rationales for its boundary decisions through minimal targeted supervision. Our approach achieves state-of-the-art performance on standard scene segmentation benchmarks. On MovieNet, for example, Scene-VLM yields significant improvements of +6 AP and +13.7 F1 over the previous leading method.
[27] Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models
Mengqi He, Xinyu Tian, Xin Shen, Jinhong Ni, Shu Zou, Zhaoyuan Yang, Jing Zhang
🧩 TL;DR
该研究提出了一种基于熵引导的对抗攻击方法(EGA),通过识别自回归生成中的关键决策点并集中攻击这些高熵位置,显著提高了对视觉语言模型的安全威胁,同时揭示了当前VLM安全机制的新弱点。
📘 Detailed Summary
Motivation: 现有基于熵的对抗攻击方法假设所有解码步骤对生成不稳定性贡献相等,但实际上只有少数高熵令牌(约20%)作为关键决策点主导输出轨迹,这为开发更高效的针对性攻击提供了机会,同时需要揭示当前VLM安全机制中更关键的安全风险。
Method: 研究提出了熵引导对抗攻击(EGA)方法,该方法首先识别自回归生成过程中的高熵关键决策点,然后集中对抗扰动于这些位置而非所有解码步骤,通过选择性攻击策略在保持语义退化的同时显著减少攻击预算,并利用跨架构VLM中高熵分叉的复现性实现可行的可转移性。
Result: 实验表明选择性攻击在多个代表性VLM上能将35-49%的良性输出转换为有害输出,同时攻击成功率高达93-95%,跨模型可转移性达到17-26%的有害率,相比全局方法使用显著更小的攻击预算即可实现相当的语义退化效果。
Conclusion: 该研究揭示了视觉语言模型中少数高熵令牌作为关键决策点的主导作用,提出的EGA方法不仅暴露了当前VLM安全机制的新弱点,还为理解自回归生成模型的脆弱性提供了新视角,对开发更鲁棒的防御机制具有重要指导意义。
📄 Abstract
Vision-language models (VLMs) achieve remarkable performance but remain vulnerable to adversarial attacks. Entropy, a measure of model uncertainty, is strongly correlated with the reliability of VLM. Prior entropy-based attacks maximize uncertainty at all decoding steps, implicitly assuming that every token contributes equally to generation instability. We show instead that a small fraction (about 20%) of high-entropy tokens, i.e., critical decision points in autoregressive generation, disproportionately governs output trajectories. By concentrating adversarial perturbations on these positions, we achieve semantic degradation comparable to global methods while using substantially smaller budgets. More importantly, across multiple representative VLMs, such selective attacks convert 35-49% of benign outputs into harmful ones, exposing a more critical safety risk. Remarkably, these vulnerable high-entropy forks recur across architecturally diverse VLMs, enabling feasible transferability (17-26% harmful rates on unseen targets). Motivated by these findings, we propose Entropy-bank Guided Adversarial attacks (EGA), which achieves competitive attack success rates (93-95%) alongside high harmful conversion, thereby revealing new weaknesses in current VLM safety mechanisms.
[28] End-to-End 3D Spatiotemporal Perception with Multimodal Fusion and V2X Collaboration
Zhenwei Yang, Yibo Ai, Weidong Zhang
🧩 TL;DR
本文提出XET-V2X,一种用于车路协同的多模态融合端到端跟踪框架,通过共享时空表征统一多视角多模态感知,显著提升自动驾驶在遮挡和通信延迟下的3D时空理解能力。
📘 Detailed Summary
Motivation: 自动驾驶中的多视角协同感知和多模态融合对于可靠的3D时空理解至关重要,尤其是在V2X场景下存在遮挡、视角受限和通信延迟等挑战。现有方法需要解决异构视角和模态的有效对齐问题,以实现复杂交通场景下的鲁棒感知。
Method: XET-V2X采用基于多尺度可变形注意力的双层空间交叉注意力模块来高效对齐异构视角和模态。该方法首先聚合多视角图像特征以增强语义一致性,然后通过更新的空间查询引导点云融合,实现有效的跨模态交互同时降低计算开销。
Result: 在真实世界V2X-Seq-SPD数据集和模拟的V2X-Sim-V2V、V2X-Sim-V2I基准测试中,XET-V2X在不同通信延迟下均实现了检测和跟踪性能的持续提升。定量结果和定性可视化表明,该框架在复杂交通场景中实现了鲁棒且时间稳定的感知能力。
Conclusion: XET-V2X通过统一的时空表征和高效的双层空间交叉注意力机制,有效解决了V2X协同感知中的多视角多模态融合挑战。该框架为自动驾驶在遮挡和通信延迟条件下的可靠感知提供了实用解决方案,并为未来车路协同系统的发展提供了重要技术基础。
📄 Abstract
Multi-view cooperative perception and multimodal fusion are essential for reliable 3D spatiotemporal understanding in autonomous driving, especially under occlusions, limited viewpoints, and communication delays in V2X scenarios. This paper proposes XET-V2X, a multi-modal fused end-to-end tracking framework for v2x collaboration that unifies multi-view multimodal sensing within a shared spatiotemporal representation. To efficiently align heterogeneous viewpoints and modalities, XET-V2X introduces a dual-layer spatial cross-attention module based on multi-scale deformable attention. Multi-view image features are first aggregated to enhance semantic consistency, followed by point cloud fusion guided by the updated spatial queries, enabling effective cross-modal interaction while reducing computational overhead. Experiments on the real-world V2X-Seq-SPD dataset and the simulated V2X-Sim-V2V and V2X-Sim-V2I benchmarks demonstrate consistent improvements in detection and tracking performance under varying communication delays. Both quantitative results and qualitative visualizations indicate that XET-V2X achieves robust and temporally stable perception in complex traffic scenarios.
[29] Breaking Alignment Barriers: TPS-Driven Semantic Correlation Learning for Alignment-Free RGB-T Salient Object Detection
Lupiao Hu, Fasheng Wang, Fangmei Chen, Fuming Sun, Haojie Li
🧩 TL;DR
本文提出了一种针对真实世界未对齐RGB-T图像对的高效显著性目标检测方法TPS-SCL,该方法采用薄板样条驱动的语义相关学习网络,在保持低参数量和计算开销的同时,显著提升了未对齐跨模态数据上的检测性能。
📘 Detailed Summary
Motivation: 现有RGB-T显著性目标检测方法主要依赖于手动对齐和标注的数据集,难以处理真实世界中原始未对齐的RGB-T图像对。由于跨模态间存在空间未对齐、尺度变化和视角偏移等显著差异,当前方法在未对齐数据集上的性能急剧下降,这构成了实际应用中的主要瓶颈。
Method: 提出薄板样条驱动的语义相关学习网络,采用双流MobileViT作为编码器并结合高效的Mamba扫描机制来建模模态间相关性。设计了语义相关约束模块来分层约束显著特征以抑制冗余背景干扰,引入薄板样条对齐模块来缓解模态间空间差异,并加入跨模态相关模块来充分探索和整合模态间依赖关系。
Result: 在多个数据集上的广泛实验表明,TPS-SCL在现有轻量级显著性目标检测方法中达到了最先进的性能水平,并且优于主流的RGB-T显著性目标检测方法,同时在保持低参数量和计算开销方面表现出色。
Conclusion: 该研究为解决真实世界未对齐RGB-T图像对的显著性检测问题提供了有效方案,通过结合薄板样条对齐、语义约束和跨模态相关学习,显著提升了模型对跨模态差异的鲁棒性,为实际应用中的多模态视觉任务提供了有价值的参考框架。
📄 Abstract
Existing RGB-T salient object detection methods predominantly rely on manually aligned and annotated datasets, struggling to handle real-world scenarios with raw, unaligned RGB-T image pairs. In practical applications, due to significant cross-modal disparities such as spatial misalignment, scale variations, and viewpoint shifts, the performance of current methods drastically deteriorates on unaligned datasets. To address this issue, we propose an efficient RGB-T SOD method for real-world unaligned image pairs, termed Thin-Plate Spline-driven Semantic Correlation Learning Network (TPS-SCL). We employ a dual-stream MobileViT as the encoder, combined with efficient Mamba scanning mechanisms, to effectively model correlations between the two modalities while maintaining low parameter counts and computational overhead. To suppress interference from redundant background information during alignment, we design a Semantic Correlation Constraint Module (SCCM) to hierarchically constrain salient features. Furthermore, we introduce a Thin-Plate Spline Alignment Module (TPSAM) to mitigate spatial discrepancies between modalities. Additionally, a Cross-Modal Correlation Module (CMCM) is incorporated to fully explore and integrate inter-modal dependencies, enhancing detection performance. Extensive experiments on various datasets demonstrate that TPS-SCL attains state-of-the-art (SOTA) performance among existing lightweight SOD methods and outperforms mainstream RGB-T SOD approaches.
[30] Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models
Masayuki Kawarada, Kosuke Yamada, Antonio Tejero-de-Pablos, Naoto Inoue
🧩 TL;DR
本文提出了DIOR方法,一种无需训练的解决方案,利用大型视觉语言模型生成条件图像嵌入,能够根据给定的文本条件(如颜色、风格)提取图像的特定方面特征表示。
📘 Detailed Summary
Motivation: 条件图像嵌入旨在根据特定文本条件(如颜色、风格)提取图像的特定方面特征表示,这是一个具有挑战性的问题。现有的视觉基础模型如CLIP虽然提供丰富的图像表示,但并非专门设计用于聚焦指定的条件,因此需要一种能够针对任意图像和条件生成条件嵌入的通用方法。
Method: DIOR是一种无需训练的方法,通过提示大型视觉语言模型用与给定条件相关的单个词语描述图像,然后提取LVLM最后一个标记的隐藏状态向量作为条件图像嵌入。该方法不依赖额外训练或任务特定先验,能够应用于任何图像和条件。
Result: 在条件图像相似性任务上的综合实验结果表明,DIOR在性能上超越了包括CLIP在内的现有无需训练基线方法。此外,DIOR在多个设置下取得了优于需要额外训练的方法的性能表现,展示了其优越性。
Conclusion: DIOR提供了一种通用且有效的条件图像嵌入生成方案,无需额外训练即可应用于任意图像和条件。该方法展示了大型视觉语言模型在提取条件特定特征方面的潜力,为条件图像表示学习提供了新的研究方向。
📄 Abstract
Conditional image embeddings are feature representations that focus on specific aspects of an image indicated by a given textual condition (e.g., color, genre), which has been a challenging problem. Although recent vision foundation models, such as CLIP, offer rich representations of images, they are not designed to focus on a specified condition. In this paper, we propose DIOR, a method that leverages a large vision-language model (LVLM) to generate conditional image embeddings. DIOR is a training-free approach that prompts the LVLM to describe an image with a single word related to a given condition. The hidden state vector of the LVLM's last token is then extracted as the conditional image embedding. DIOR provides a versatile solution that can be applied to any image and condition without additional training or task-specific priors. Comprehensive experimental results on conditional image similarity tasks demonstrate that DIOR outperforms existing training-free baselines, including CLIP. Furthermore, DIOR achieves superior performance compared to methods that require additional training across multiple settings.
[31] DPAR: Dynamic Patchification for Efficient Autoregressive Visual Generation
Divyansh Srivastava, Akshay Mehra, Pranav Maneriker, Debopam Sanyal, Vishnu Raj, Vijay Kamarshi, Fan Du, Joshua Kimball
🧩 TL;DR
本文提出DPAR,一种新颖的解码器自回归图像生成模型,通过动态聚合图像令牌为可变数量的补丁来实现高效图像生成。该方法首次证明轻量级无监督自回归模型的下一个令牌预测熵可作为基于信息内容合并令牌的可靠标准。
📘 Detailed Summary
Motivation: 解码器自回归图像生成通常依赖固定长度的令牌化方案,其令牌数量随分辨率呈二次增长,显著增加了注意力的计算和内存需求。现有方法在处理高分辨率图像时面临计算效率低下的问题,需要更高效的令牌表示方法。
Method: DPAR采用动态令牌聚合机制,利用轻量级无监督自回归模型的下一个令牌预测熵作为信息度量标准,将图像令牌合并为可变大小的补丁。该方法对标准解码器架构进行最小修改,保持与多模态生成框架的兼容性,并将更多计算资源分配给高信息图像区域的生成。
Result: DPAR在Imagenet 256和384生成分辨率上分别减少令牌数量1.81倍和2.06倍,训练成本FLOPs降低高达40%。该方法表现出更快的收敛速度,相对于基线模型将FID改进高达27.1%。动态补丁大小训练还使表示对补丁边界具有鲁棒性,允许在推理时扩展到更大的补丁尺寸。
Conclusion: 该研究证明了自回归模型预测熵作为令牌合并标准的有效性,为高效图像生成提供了新范式。动态补丁聚合方法不仅减少计算需求,还通过将计算资源集中于信息丰富区域提高了生成质量。该方法为扩展到更高分辨率图像生成和多模态应用奠定了基础。
📄 Abstract
Decoder-only autoregressive image generation typically relies on fixed-length tokenization schemes whose token counts grow quadratically with resolution, substantially increasing the computational and memory demands of attention. We present DPAR, a novel decoder-only autoregressive model that dynamically aggregates image tokens into a variable number of patches for efficient image generation. Our work is the first to demonstrate that next-token prediction entropy from a lightweight and unsupervised autoregressive model provides a reliable criterion for merging tokens into larger patches based on information content. DPAR makes minimal modifications to the standard decoder architecture, ensuring compatibility with multimodal generation frameworks and allocating more compute to generation of high-information image regions. Further, we demonstrate that training with dynamically sized patches yields representations that are robust to patch boundaries, allowing DPAR to scale to larger patch sizes at inference. DPAR reduces token count by 1.81x and 2.06x on Imagenet 256 and 384 generation resolution respectively, leading to a reduction of up to 40% FLOPs in training costs. Further, our method exhibits faster convergence and improves FID by up to 27.1% relative to baseline models.
[32] Patch as Node: Human-Centric Graph Representation Learning for Multimodal Action Recognition
Zeyu Liang, Hailun Xia, Naichuan Zheng
🧩 TL;DR
本文提出PAN,首个以人为中心的图表示学习框架用于多模态动作识别,通过将包含人体关节的RGB补丁表示为时空图,有效解决RGB与骨架模态的异构性问题,并在多个数据集上实现最先进性能。
📘 Detailed Summary
Motivation: 当前融合RGB与骨架模态的多模态动作识别方法存在固有异构性问题,未能充分利用两种模态间的互补潜力,需要一种能够有效对齐RGB与骨架特征并抑制RGB冗余信息的融合框架。
Method: 提出以人为中心的图表示学习框架PAN,将包含人体关节的RGB补丁标记嵌入表示为时空图,采用基于注意力的后校准减少对高质量骨架数据的依赖,并开发PAN-Ensemble(双路径图卷积网络+后期融合)和PAN-Unified(单网络统一图表示学习)两种变体。
Result: 在三个广泛使用的多模态动作识别数据集上,PAN-Ensemble和PAN-Unified分别在分离建模和统一建模的多模态融合设置中实现了最先进的性能,证明了该框架的有效性和鲁棒性。
Conclusion: 该研究展示了以人为中心的图建模范式能够有效抑制RGB冗余信息并与基于骨架的方法良好对齐,为多模态动作识别提供了更有效且语义一致的融合方法,同时通过后校准机制降低了对高质量骨架数据的依赖。
📄 Abstract
While human action recognition has witnessed notable achievements, multimodal methods fusing RGB and skeleton modalities still suffer from their inherent heterogeneity and fail to fully exploit the complementary potential between them. In this paper, we propose PAN, the first human-centric graph representation learning framework for multimodal action recognition, in which token embeddings of RGB patches containing human joints are represented as spatiotemporal graphs. The human-centric graph modeling paradigm suppresses the redundancy in RGB frames and aligns well with skeleton-based methods, thus enabling a more effective and semantically coherent fusion of multimodal features. Since the sampling of token embeddings heavily relies on 2D skeletal data, we further propose attention-based post calibration to reduce the dependency on high-quality skeletal data at a minimal cost interms of model performance. To explore the potential of PAN in integrating with skeleton-based methods, we present two variants: PAN-Ensemble, which employs dual-path graph convolution networks followed by late fusion, and PAN-Unified, which performs unified graph representation learning within a single network. On three widely used multimodal action recognition datasets, both PAN-Ensemble and PAN-Unified achieve state-of-the-art (SOTA) performance in their respective settings of multimodal fusion: separate and unified modeling, respectively.
[33] Perceive and Calibrate: Analyzing and Enhancing Robustness of Medical Multi-Modal Large Language Models
Dunyuan XU, Xikai Yang, Yaoqian Li, Juzheng Miao, Jinpeng Li, Pheng-Ann Heng
🧩 TL;DR
本文提出了一种无需训练的多模态校准框架IMC,通过利用医学多模态大语言模型固有的去噪能力来增强其对现实世界输入扰动的鲁棒性,在包含11种噪声类型的基准测试中实现了最先进的性能。
📘 Detailed Summary
Motivation: 医学多模态大语言模型对现实世界输入扰动(如成像伪影和文本错误)的敏感性严重削弱了其临床适用性,而现有研究主要关注文本模态且依赖昂贵的微调,无法满足医学领域复杂的噪声模式和严格的安全标准。
Method: 本文提出了无需训练的固有增强多模态校准框架IMC,遵循感知与校准原则。针对视觉模态,设计了扰动感知去噪校准PDC,利用MLLM自身的视觉编码器识别噪声模式并进行原型引导的特征校准;针对文本去噪,设计了自实例化多智能体系统SMS,利用MLLM的自我评估能力通过协作的智能体层次结构精炼噪声文本。
Result: 实验结果表明,该方法在包含11种跨图像和文本模态噪声类型的基准测试中实现了最先进的性能,在2个数据集上展示了卓越的跨模态鲁棒性增强效果,显著提升了医学MLLM在真实临床场景中的适用性。
Conclusion: 该研究证明了利用MLLM固有去噪能力进行训练免费鲁棒性增强的可行性,为医学多模态大语言模型在临床环境中的安全部署提供了有效解决方案,同时提出的IMC框架具有扩展到其他医学AI应用的潜力。
📄 Abstract
Medical Multi-modal Large Language Models (MLLMs) have shown promising clinical performance. However, their sensitivity to real-world input perturbations, such as imaging artifacts and textual errors, critically undermines their clinical applicability. Systematic analysis of such noise impact on medical MLLMs remains largely unexplored. Furthermore, while several works have investigated the MLLMs' robustness in general domains, they primarily focus on text modality and rely on costly fine-tuning. They are inadequate to address the complex noise patterns and fulfill the strict safety standards in medicine. To bridge this gap, this work systematically analyzes the impact of various perturbations on medical MLLMs across both visual and textual modalities. Building on our findings, we introduce a training-free Inherent-enhanced Multi-modal Calibration (IMC) framework that leverages MLLMs' inherent denoising capabilities following the perceive-and-calibrate principle for cross-modal robustness enhancement. For the visual modality, we propose a Perturbation-aware Denoising Calibration (PDC) which leverages MLLMs' own vision encoder to identify noise patterns and perform prototype-guided feature calibration. For text denoising, we design a Self-instantiated Multi-agent System (SMS) that exploits the MLLMs' self-assessment capabilities to refine noisy text through a cooperative hierarchy of agents. We construct a benchmark containing 11 types of noise across both image and text modalities on 2 datasets. Experimental results demonstrate our method achieves the state-of-the-art performance across multiple modalities, showing potential to enhance MLLMs' robustness in real clinical scenarios.
[34] Look Closer! An Adversarial Parametric Editing Framework for Hallucination Mitigation in VLMs
Jiayu Hu, Beibei Li, Jiangwei Xia, Yanjun Qin, Bing Ji, Zhongshi He
🧩 TL;DR
本文提出了一种对抗性参数编辑框架ALEAHallu,用于缓解视觉语言模型中的幻觉问题。该框架采用激活-定位-编辑对抗范式,通过识别幻觉易发参数簇并进行对抗性微调,强制模型优先考虑视觉证据而非内在参数偏差。
📘 Detailed Summary
Motivation: 视觉语言模型存在持续的幻觉问题,生成与视觉输入不一致的输出。现有研究提出的启发式解码校准策略因其不可训练性质而限制了优化潜力,需要更有效的可训练方法来缓解幻觉。
Method: 提出ALEAHallu对抗性参数编辑框架,遵循激活-定位-编辑对抗范式。首先构建包含基于视觉特征的响应和反映先验偏差的幻觉响应的激活数据集;然后通过分析响应对的差异隐藏状态识别幻觉易发参数簇;最后使用注入对抗性调整前缀的提示对这些参数簇进行微调,最大化视觉忽视以强制模型优先考虑视觉证据。
Result: 在生成性和判别性VLM任务上的评估表明,ALEAHallu在缓解幻觉方面具有显著有效性。该方法通过对抗性参数编辑显著改善了模型对视觉输入的依赖,减少了基于语言先验的幻觉生成。
Conclusion: 该研究展示了通过对抗性参数编辑直接针对模型内部参数偏差的有效性,为缓解VLM幻觉问题提供了可训练的新范式。该方法超越了非可训练启发式策略的局限性,通过定位和编辑关键参数簇实现了更优化的幻觉缓解效果。
📄 Abstract
While Vision-Language Models (VLMs) have garnered increasing attention in the AI community due to their promising practical applications, they exhibit persistent hallucination issues, generating outputs misaligned with visual inputs. Recent studies attribute these hallucinations to VLMs' over-reliance on linguistic priors and insufficient visual feature integration, proposing heuristic decoding calibration strategies to mitigate them. However, the non-trainable nature of these strategies inherently limits their optimization potential. To this end, we propose an adversarial parametric editing framework for Hallucination mitigation in VLMs, which follows an \textbf{A}ctivate-\textbf{L}ocate-\textbf{E}dit \textbf{A}dversarially paradigm. Specifically, we first construct an activation dataset that comprises grounded responses (positive samples attentively anchored in visual features) and hallucinatory responses (negative samples reflecting LLM prior bias and internal knowledge artifacts). Next, we identify critical hallucination-prone parameter clusters by analyzing differential hidden states of response pairs. Then, these clusters are fine-tuned using prompts injected with adversarial tuned prefixes that are optimized to maximize visual neglect, thereby forcing the model to prioritize visual evidence over inherent parametric biases. Evaluations on both generative and discriminative VLM tasks demonstrate the significant effectiveness of ALEAHallu in alleviating hallucinations. Our code is available at https://github.com/hujiayu1223/ALEAHallu.
[35] iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception
Sarthak Mehrotra, Sairam V C Rebbapragada, Mani Hemanth Reddy Bonthu, Vineeth N Balasubramanian
🧩 TL;DR
本文提出了iSHIFT:一种轻量级多模态大语言模型代理,通过隐式慢-快混合推理与灵活感知令牌机制,在保持紧凑模型规模(2.5B参数)的同时,实现了高效与精确的GUI交互平衡。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在图形用户界面交互中存在效率与精度难以兼顾的问题:一方面需要高效处理常规任务,另一方面又需要精确的视觉定位能力来处理细粒度交互,现有方法在识别特定界面元素时准确度不足,且模型规模庞大无法根据任务需求自适应调整推理深度。
Method: iSHIFT框架集成了隐式思维链(潜在推理)与感知控制模块,通过慢-快双模式切换机制:慢模式利用详细视觉定位实现高精度交互,快模式则依赖全局线索提升效率;特殊感知令牌引导注意力到相关屏幕区域,使模型能够自主决定推理方式和关注焦点。
Result: 尽管模型规模仅为2.5B参数,iSHIFT在多个基准数据集上达到了最先进的性能水平,证明了其在保持紧凑架构的同时能够有效平衡GUI交互任务中的效率与精度需求。
Conclusion: 该研究展示了通过自适应推理机制和注意力引导技术,轻量级多模态模型能够在GUI交互任务中实现与大型模型相当的性能,为开发高效且精确的界面智能体提供了新的架构设计思路,推动了多模态推理向更灵活、可扩展的方向发展。
📄 Abstract
Multimodal Large Language Models (MLLMs) show strong potential for interpreting and interacting with complex, pixel-rich Graphical User Interface (GUI) environments. However, building agents that are both efficient for high-level tasks and precise for fine-grained interactions remains challenging. GUI agents must perform routine actions efficiently while also handling tasks that demand exact visual grounding, yet existing approaches struggle when accuracy depends on identifying specific interface elements. These MLLMs also remain large and cannot adapt their reasoning depth to the task at hand. In this work, we introduce iSHIFT: Implicit Slow-fast Hybrid Inference with Flexible Tokens, a lightweight agent that integrates latent thinking (implicit chain-of-thought) with a perception control module. iSHIFT enables an MLLM to switch between a slow mode, which leverages detailed visual grounding for high precision and a fast mode that uses global cues for efficiency. Special perception tokens guide attention to relevant screen regions, allowing the model to decide both how to reason and where to focus. Despite its compact 2.5B size, iSHIFT matches state-of-the-art performance on multiple benchmark datasets.
[36] See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning
Shuoshuo Zhang, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Yujiu Yang, Rui Wang
🧩 TL;DR
本文提出双向感知塑形(BiPS)方法,通过双向视觉信号塑形训练过程,解决现有视觉语言模型依赖外部工具或潜在视觉标记时忽略细粒度视觉证据、泛化能力差且推理成本高的问题。
📘 Detailed Summary
Motivation: 当前大型视觉语言模型依赖外部工具注入或生成潜在视觉标记作为中间视觉线索,但这些机制仍忽略细粒度视觉证据(如图表中的折线),在不同领域泛化能力差,且推理时成本高昂,需要更有效的视觉感知训练方法。
Method: 提出双向感知塑形(BiPS)方法,将问题条件化的掩码视图转换为双向视觉关注信号来塑形训练过程。首先应用原始图像与保留证据视图之间的KL一致性约束,确保对支持像素的完整覆盖;然后应用原始图像与消除证据视图之间的KL分离约束,防止模型仅依赖文本捷径回答,强制模型依赖细粒度视觉信息。
Result: 在八个基准测试中,BiPS将Qwen2.5-VL-7B模型的平均性能提升了8.2%,并在未见数据集和图像类型上展现出强大的跨领域泛化能力,验证了该方法在增强视觉依赖和泛化性能方面的有效性。
Conclusion: BiPS通过双向视觉信号塑形训练过程,有效增强了视觉语言模型对细粒度视觉证据的依赖,解决了文本捷径问题,同时提升了跨领域泛化能力,为视觉语言模型的感知训练提供了新范式。
📄 Abstract
Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.
cs.CL [Back]
[37] Enabling Conversational Behavior Reasoning Capabilities in Full-Duplex Speech
Shuchang Pan, Siddharth Banerjee, Dhruv Hebbar, Siddhant Patel, Akshaj Gupta, Kan Jen Cheng, Hanjo Kim, Zeyi Austin Li, Martin Q. Ma, Tingle Li, Gopala Anumanchipalli, Jiachen Lian
🧩 TL;DR
本文提出了一个基于思维图(GoT)的因果推理框架,用于建模人类对话中的意图到行为路径,实现了全双工交互系统中鲁棒的行为检测和可解释的推理链生成。
📘 Detailed Summary
Motivation: 人类对话由隐含的思维链组织,表现为定时的言语行为,捕捉这种因果路径是构建自然全双工交互系统的关键,但现有方法缺乏对这种因果推理过程的建模能力。
Method: 该方法将对话过程建模为思维图(GoT)内的因果推理,采用分层标注方案形式化意图到行为路径,预测高层交际意图和低层言语行为以学习其因果和时间依赖关系,并开发了混合语料库结合可控模拟和人类标注的理性解释。
Result: 在合成和真实全双工对话上的实验表明,该框架实现了鲁棒的行为检测,产生了可解释的推理链,并为全双工语音对话系统中的对话推理建立了基准测试基础,能够准确预测下一个言语行为并生成决策的简洁理由。
Conclusion: 该研究为全双工交互系统提供了一种因果推理框架,通过思维图结构实现了对话行为的可解释预测,建立了对话推理的基准测试方法,为构建更自然的对话系统奠定了基础。
📄 Abstract
Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this causal pathway is key to building natural full-duplex interactive systems. We introduce a framework that enables reasoning over conversational behaviors by modeling this process as causal inference within a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a hybrid corpus that pairs controllable, event-rich simulations with human-annotated rationales and real conversational speech. The GoT framework structures streaming predictions as an evolving graph, enabling a multimodal transformer to forecast the next speech act, generate concise justifications for its decisions, and dynamically refine its reasoning. Experiments on both synthetic and real duplex dialogues show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.
[38] Detecting AI-Generated Paraphrases in Bengali: A Comparative Study of Zero-Shot and Fine-Tuned Transformers
Md. Rakibul Islam, Most. Sharmin Sultana Samu, Md. Zahid Hossain, Farhad Uz Zaman, Md. Kamrozzaman Bhuiyan
🧩 TL;DR
本研究针对孟加拉语中AI生成文本检测的研究空白,系统评估了五种基于Transformer的模型,发现经过微调的XLM-RoBERTa、mDeBERTa和MultilingualBERT在孟加拉语AI文本检测任务上能达到约91%的准确率和F1分数。
📘 Detailed Summary
Motivation: 大型语言模型生成类似人类文本的能力引发了虚假信息和内容操纵的担忧,而现有研究虽已涵盖多种语言的检测,但孟加拉语因其丰富的词汇和复杂结构,在AI生成文本检测方面仍存在显著研究空白,这阻碍了该语言环境下内容真实性的维护和恶意应用的防范。
Method: 本研究系统评估了五种基于Transformer的预训练模型:XLMRoBERTa-Large、mDeBERTaV3-Base、BanglaBERT-Base、IndicBERT-Base和MultilingualBERT-Base,采用零样本评估和任务特定微调两种策略,以探索这些模型在孟加拉语AI生成文本检测任务上的性能表现和适应性。
Result: 零样本评估显示所有模型性能接近随机水平(约50%准确率),凸显了任务特定微调的必要性;经过微调后,XLM-RoBERTa、mDeBERTa和MultilingualBERT在准确率和F1分数上均达到约91%的优异性能,而IndicBERT表现出相对较弱的性能,表明其在该任务上的微调效果有限。
Conclusion: 该研究填补了孟加拉语AI生成文本检测的研究空白,证明了基于Transformer的预训练模型经过适当微调后在该任务上的有效性,为构建鲁棒的孟加拉语AI内容检测系统奠定了基础,同时揭示了不同预训练模型在特定语言任务上的性能差异,为未来多语言检测研究提供了重要参考。
📄 Abstract
Large language models (LLMs) can produce text that closely resembles human writing. This capability raises concerns about misuse, including disinformation and content manipulation. Detecting AI-generated text is essential to maintain authenticity and prevent malicious applications. Existing research has addressed detection in multiple languages, but the Bengali language remains largely unexplored. Bengali's rich vocabulary and complex structure make distinguishing human-written and AI-generated text particularly challenging. This study investigates five transformer-based models: XLMRoBERTa-Large, mDeBERTaV3-Base, BanglaBERT-Base, IndicBERT-Base and MultilingualBERT-Base. Zero-shot evaluation shows that all models perform near chance levels (around 50% accuracy) and highlight the need for task-specific fine-tuning. Fine-tuning significantly improves performance, with XLM-RoBERTa, mDeBERTa and MultilingualBERT achieving around 91% on both accuracy and F1-score. IndicBERT demonstrates comparatively weaker performance, indicating limited effectiveness in fine-tuning for this task. This work advances AI-generated text detection in Bengali and establishes a foundation for building robust systems to counter AI-generated content.
[39] Bridging the Copyright Gap: Do Large Vision-Language Models Recognize and Respect Copyrighted Content?
Naen Xu, Jinghuai Zhang, Changjiang Li, Hengyu An, Chunyi Zhou, Jun Wang, Boyu Xu, Yuyuan Li, Tianyu Du, Shouling Ji
🧩 TL;DR
该研究系统评估了大型视觉语言模型处理版权内容的能力,揭示了现有模型在版权合规方面的显著缺陷,并提出了一种工具增强的防御框架来降低侵权风险。
📘 Detailed Summary
Motivation: 随着大型视觉语言模型的广泛应用,其在处理包含版权内容(如图书摘录、新闻报道、音乐歌词和代码文档)的多模态输入时存在潜在的版权侵权风险,当前模型缺乏对版权法规的准确识别和遵守能力,可能导致严重的法律和伦理后果。
Method: 研究构建了一个包含50,000个多模态查询-内容对的大规模基准数据集,涵盖有版权声明和无版权声明两种场景,并覆盖四种版权声明类型;同时提出了一种新颖的工具增强防御框架,旨在增强模型对版权内容的合规处理能力。
Result: 评估结果显示,即使是当前最先进的闭源大型视觉语言模型在识别和尊重版权内容方面也存在显著缺陷,即使在存在版权声明的情况下,模型仍表现出较高的侵权风险;提出的防御框架在所有场景中都能有效降低侵权风险。
Conclusion: 该研究强调了开发版权感知的大型视觉语言模型的重要性,以确保对版权内容的负责任和合法使用;提出的基准数据集和防御框架为未来研究提供了重要工具,推动了多模态人工智能系统的版权合规发展。
📄 Abstract
Large vision-language models (LVLMs) have achieved remarkable advancements in multimodal reasoning tasks. However, their widespread accessibility raises critical concerns about potential copyright infringement. Will LVLMs accurately recognize and comply with copyright regulations when encountering copyrighted content (i.e., user input, retrieved documents) in the context? Failure to comply with copyright regulations may lead to serious legal and ethical consequences, particularly when LVLMs generate responses based on copyrighted materials (e.g., retrieved book experts, news reports). In this paper, we present a comprehensive evaluation of various LVLMs, examining how they handle copyrighted content -- such as book excerpts, news articles, music lyrics, and code documentation when they are presented as visual inputs. To systematically measure copyright compliance, we introduce a large-scale benchmark dataset comprising 50,000 multimodal query-content pairs designed to evaluate how effectively LVLMs handle queries that could lead to copyright infringement. Given that real-world copyrighted content may or may not include a copyright notice, the dataset includes query-content pairs in two distinct scenarios: with and without a copyright notice. For the former, we extensively cover four types of copyright notices to account for different cases. Our evaluation reveals that even state-of-the-art closed-source LVLMs exhibit significant deficiencies in recognizing and respecting the copyrighted content, even when presented with the copyright notice. To solve this limitation, we introduce a novel tool-augmented defense framework for copyright compliance, which reduces infringement risks in all scenarios. Our findings underscore the importance of developing copyright-aware LVLMs to ensure the responsible and lawful use of copyrighted content.
[40] Explainable Statute Prediction via Attention-based Model and LLM Prompting
Sachin Pawar, Girish Keshav Palshikar, Anindita Sinha Banerjee, Nitin Ramrakhiyani, Basit Ali
🧩 TL;DR
本文提出了两种自动法规预测与解释的方法:基于注意力机制的AoS模型和基于大语言模型的LLMPrompt方法,两者均能为法律案例生成相关法规预测并提供人类可理解的解释。
📘 Detailed Summary
Motivation: 本文旨在解决自动法规预测问题,即为给定案例描述预测相关法规子集,这对于律师AI助手和法律问答系统等应用至关重要。为提高法律AI系统的用户接受度,研究强调预测结果需要附带人类可理解的解释,以增强系统的透明度和可信度。
Method: 本文提出了两种主要方法:AoS(Attention-over-Sentences)模型使用句子级注意力机制,基于句子变换器在监督学习框架下预测相关法规;LLMPrompt方法则采用大语言模型进行零样本预测,探索了标准提示和思维链提示技术,两者均能生成人类可理解的解释。
Result: 研究在两个流行数据集上比较了两种方法的法规预测性能,并与多个基准模型进行了对比。同时通过自动反事实评估和人工评估两种方式对生成的解释质量进行了系统评价,验证了两种方法在预测准确性和解释可理解性方面的有效性。
Conclusion: 研究表明结合注意力机制的小型语言模型和基于提示的大语言模型都能有效解决法规预测与解释问题,为法律AI系统的可解释性提供了实用解决方案。未来工作可进一步探索模型融合、领域适应以及更复杂的解释生成机制,以提升法律智能系统的实际应用价值。
📄 Abstract
In this paper, we explore the problem of automatic statute prediction where for a given case description, a subset of relevant statutes are to be predicted. Here, the term "statute" refers to a section, a sub-section, or an article of any specific Act. Addressing this problem would be useful in several applications such as AI-assistant for lawyers and legal question answering system. For better user acceptance of such Legal AI systems, we believe the predictions should also be accompanied by human understandable explanations. We propose two techniques for addressing this problem of statute prediction with explanations -- (i) AoS (Attention-over-Sentences) which uses attention over sentences in a case description to predict statutes relevant for it and (ii) LLMPrompt which prompts an LLM to predict as well as explain relevance of a certain statute. AoS uses smaller language models, specifically sentence transformers and is trained in a supervised manner whereas LLMPrompt uses larger language models in a zero-shot manner and explores both standard as well as Chain-of-Thought (CoT) prompting techniques. Both these models produce explanations for their predictions in human understandable forms. We compare statute prediction performance of both the proposed techniques with each other as well as with a set of competent baselines, across two popular datasets. Also, we evaluate the quality of the generated explanations through an automated counter-factual manner as well as through human evaluation.
cs.AI [Back]
[41] From Visual Perception to Deep Empathy: An Automated Assessment Framework for House-Tree-Person Drawings Using Multimodal LLMs and Multi-Agent Collaboration
Shuide Wen, Yu Sun, Beier Ku, Zhi Gao, Lijun Ma, Yang Yang, Can Jiao
🧩 TL;DR
本研究提出了一种基于多模态大语言模型和多智能体协作的框架,用于标准化房屋-树木-人物绘画测试的评估,解决了传统临床心理学中该测试评分标准不统一、依赖主观经验的问题。
📘 Detailed Summary
Motivation: 房屋-树木-人物绘画测试作为临床心理学中广泛使用的投射技术,长期面临评分标准异质性、依赖检查者主观经验以及缺乏统一量化编码系统等挑战,本研究旨在解决这些传统评估方法的局限性。
Method: 研究提出了一个多智能体协作框架,该框架利用多模态大语言模型整合社会心理学视角和去污名化叙事,通过角色分工将特征识别与心理推理解耦,实现了对绘画测试的标准化分析。
Result: 定量实验显示,多模态大语言模型解释与人类专家解释之间的平均语义相似度约为0.75,在结构化专家数据集中相似度提升至0.85,达到专家级基线理解水平;定性分析表明该系统能有效纠正视觉幻觉并生成具有高生态效度和内部一致性的心理报告。
Conclusion: 研究证实了多模态大模型作为投射评估标准化工具的潜力,提出的多智能体框架为数字心理健康服务提供了新范式,通过角色分工实现了特征识别与心理推理的解耦,推动了计算心理学的发展。
📄 Abstract
Background: The House-Tree-Person (HTP) drawing test, introduced by John Buck in 1948, remains a widely used projective technique in clinical psychology. However, it has long faced challenges such as heterogeneous scoring standards, reliance on examiners subjective experience, and a lack of a unified quantitative coding system. Results: Quantitative experiments showed that the mean semantic similarity between Multimodal Large Language Model (MLLM) interpretations and human expert interpretations was approximately 0.75 (standard deviation about 0.05). In structurally oriented expert data sets, this similarity rose to 0.85, indicating expert-level baseline comprehension. Qualitative analyses demonstrated that the multi-agent system, by integrating social-psychological perspectives and destigmatizing narratives, effectively corrected visual hallucinations and produced psychological reports with high ecological validity and internal coherence. Conclusions: The findings confirm the potential of multimodal large models as standardized tools for projective assessment. The proposed multi-agent framework, by dividing roles, decouples feature recognition from psychological inference and offers a new paradigm for digital mental-health services. Keywords: House-Tree-Person test; multimodal large language model; multi-agent collaboration; cosine similarity; computational psychology; artificial intelligence
[42] LogicLens: Visual-Logical Co-Reasoning for Text-Centric Forgery Analysis
Fanwei Zeng, Changtao Miao, Jing Huang, Zhiya Tan, Shutao Gong, Xiaoming Yu, Yang Wang, Huazhe Tan, Weibin Yao, Jianshu Li
🧩 TL;DR
本文提出了LogicLens,一个用于文本中心伪造分析的统一视觉-文本协同推理框架,通过创新的跨线索感知思维链机制和加权多任务奖励优化,在检测、定位和解释任务上实现了显著性能提升。
📘 Detailed Summary
Motivation: 当前文本中心伪造分析方法存在三个主要局限:通常局限于粗粒度的视觉分析而缺乏深度推理能力;将检测、定位和解释视为离散子任务,忽视了它们之间的内在关联;缺乏能够支持模型训练的高质量细粒度标注数据集。
Method: 本文提出了LogicLens统一框架,包含创新的跨线索感知思维链机制,通过迭代交叉验证视觉线索与文本逻辑实现深度推理;设计了加权多任务奖励函数用于GRPO优化以确保任务间对齐;开发了PR²多智能体系统生成高质量认知对齐标注;构建了包含5,397张图像的RealText数据集,提供细粒度标注。
Result: 在T-IC13数据集上的零样本评估中,LogicLens以41.4%的优势超越专用框架,以23.4%的优势超越GPT-4o的宏平均F1分数;在密集文本T-SROIE数据集上,在mF1、CSS和宏平均F1指标上显著领先其他MLLM方法;实验证明了框架在多个基准测试上的优越性。
Conclusion: 该研究展示了统一视觉-文本协同推理框架在文本伪造分析中的有效性,强调了跨任务联合优化和深度推理机制的重要性;提出的PR²标注系统和RealText数据集为领域提供了有价值的资源;工作为信息真实性验证和AIGC安全检测提供了新的技术方向。
📄 Abstract
Sophisticated text-centric forgeries, fueled by rapid AIGC advancements, pose a significant threat to societal security and information authenticity. Current methods for text-centric forgery analysis are often limited to coarse-grained visual analysis and lack the capacity for sophisticated reasoning. Moreover, they typically treat detection, grounding, and explanation as discrete sub-tasks, overlooking their intrinsic relationships for holistic performance enhancement. To address these challenges, we introduce LogicLens, a unified framework for Visual-Textual Co-reasoning that reformulates these objectives into a joint task. The deep reasoning of LogicLens is powered by our novel Cross-Cues-aware Chain of Thought (CCT) mechanism, which iteratively cross-validates visual cues against textual logic. To ensure robust alignment across all tasks, we further propose a weighted multi-task reward function for GRPO-based optimization. Complementing this framework, we first designed the PR$^2$ (Perceiver, Reasoner, Reviewer) pipeline, a hierarchical and iterative multi-agent system that generates high-quality, cognitively-aligned annotations. Then, we constructed RealText, a diverse dataset comprising 5,397 images with fine-grained annotations, including textual explanations, pixel-level segmentation, and authenticity labels for model training. Extensive experiments demonstrate the superiority of LogicLens across multiple benchmarks. In a zero-shot evaluation on T-IC13, it surpasses the specialized framework by 41.4% and GPT-4o by 23.4% in macro-average F1 score. Moreover, on the challenging dense-text T-SROIE dataset, it establishes a significant lead over other MLLM-based methods in mF1, CSS, and the macro-average F1. Our dataset, model, and code will be made publicly available.
[43] A Medical Multimodal Diagnostic Framework Integrating Vision-Language Models and Logic Tree Reasoning
Zelin Zang, Wenyi Gu, Siqi Ma, Dan Yang, Yue Shen, Zhu Zhang, Guohui Fan, Wing-Kuen Ling, Fuji Yang
🧩 TL;DR
本研究提出了一种基于LLaVA的诊断框架,结合视觉语言对齐与逻辑正则化推理,旨在提升多模态医疗AI的诊断可靠性和可解释性,在多个基准测试中展现出优越性能。
📘 Detailed Summary
Motivation: 随着大型语言模型和视觉语言模型在医学领域的快速发展,现有模型在整合临床文本和医学影像时往往产生幻觉或不一致的推理链,限制了临床可信度,需要更可靠的多模态推理方法。
Method: 该框架基于LLaVA构建,包含文本和图像输入编码器、跨模态对齐投影模块、将诊断任务分解为步骤的推理控制器,以及将逐步前提组装成可验证结论的逻辑树生成器,实现了视觉语言对齐与逻辑正则化推理的结合。
Result: 在MedXpertQA等基准测试上的评估表明,该方法显著提升了多模态任务的诊断准确性,并产生了更具可解释性的推理轨迹,同时在纯文本设置中保持竞争力。
Conclusion: 该研究为构建可信赖的多模态医疗AI迈出了重要一步,表明结合逻辑约束的推理框架能够有效减少幻觉并提高诊断可靠性,为临床决策支持系统提供了新的技术路径。
📄 Abstract
With the rapid growth of large language models (LLMs) and vision-language models (VLMs) in medicine, simply integrating clinical text and medical imaging does not guarantee reliable reasoning. Existing multimodal models often produce hallucinations or inconsistent chains of thought, limiting clinical trust. We propose a diagnostic framework built upon LLaVA that combines vision-language alignment with logic-regularized reasoning. The system includes an input encoder for text and images, a projection module for cross-modal alignment, a reasoning controller that decomposes diagnostic tasks into steps, and a logic tree generator that assembles stepwise premises into verifiable conclusions. Evaluations on MedXpertQA and other benchmarks show that our method improves diagnostic accuracy and yields more interpretable reasoning traces on multimodal tasks, while remaining competitive on text-only settings. These results suggest a promising step toward trustworthy multimodal medical AI.