Table of Contents

cs.CV [Back]

[1] ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology

Srikumar Sastry, Subash Khanal, Aayush Dhakal, Jiayu Lin, Dan Cher, Phoenix Jarosz, Nathan Jacobs

🧩 TL;DR

ProM3E是一种基于概率掩码多模态嵌入的生态学表示学习模型,通过嵌入空间的模态重构支持任意模态间的生成与检索,并提出新颖的跨模态检索方法实现优越性能。


📘 Detailed Summary

Motivation: 该研究旨在解决生态学中多模态数据融合的挑战,特别是如何从部分上下文模态推断缺失模态,并分析不同模态融合在下游任务中的可行性,本质上学习何时融合何种模态。

Method: 该模型采用基于嵌入空间的掩码模态重构方法,学习给定少量上下文模态时推断缺失模态的能力,其概率性质支持模态嵌入空间的反转,并提出混合模态间和模态内相似度的新型跨模态检索方法。

Result: 实验表明该模型在所有检索任务中均实现优越性能,其隐藏表示在线性探测任务中展现出卓越的表征学习能力,验证了模型在多模态融合和表示学习方面的有效性。

Conclusion: 该研究为生态学多模态数据分析提供了有效的表示学习框架,其概率建模和模态反转能力为理解模态融合可行性提供了新视角,提出的跨模态检索方法具有广泛应用潜力。


📄 Abstract

We introduce ProM3E, a probabilistic masked multimodal embedding model for any-to-any generation of multimodal representations for ecology. ProM3E is based on masked modality reconstruction in the embedding space, learning to infer missing modalities given a few context modalities. By design, our model supports modality inversion in the embedding space. The probabilistic nature of our model allows us to analyse the feasibility of fusing various modalities for given downstream tasks, essentially learning what to fuse. Using these features of our model, we propose a novel cross-modal retrieval approach that mixes inter-modal and intra-modal similarities to achieve superior performance across all retrieval tasks. We further leverage the hidden representation from our model to perform linear probing tasks and demonstrate the superior representation learning capability of our model. All our code, datasets and model will be released at https://vishu26.github.io/prom3e.

[2] SCALE-VLP: Soft-Weighted Contrastive Volumetric Vision-Language Pre-training with Spatial-Knowledge Semantics

Ailar Mahdizadeh, Puria Azadi Moghadam, Xiangteng He, Shahriar Mirabbasi, Panos Nasiopoulos, Leonid Sigal

🧩 TL;DR

SCALE-VLP提出了一种软加权对比视觉语言预训练框架,通过整合体积空间语义和领域感知知识注入语义,解决了现有方法在处理体积数据时忽略连续结构化依赖关系的问题,在CT图像-报告检索、异常分类和报告生成任务中实现了显著的性能提升。


📘 Detailed Summary

Motivation: 现有视觉语言模型主要局限于2D数据并采用二元监督,忽略了CT等体积数据中存在的连续结构化依赖关系,同时现有方法将体积扫描视为独立2D切片处理,损害了空间一致性并未能充分利用丰富的临床语义信息。

Method: SCALE-VLP框架整合了体积空间语义以保持解剖结构,以及领域感知的知识注入语义(如放射学本体),通过软加权对比学习在有限监督下生成结构一致且语义基础的表征。

Result: 相比先前最优方法,SCALE-VLP在CT-报告检索中实现了最高4.3倍的top-1性能提升,异常分类准确率提高10个百分点,报告生成任务达到ROUGE-L 0.44和BERT-F1 0.89,在零样本跨域评估中也观察到一致的性能增益。

Conclusion: 该研究表明SCALE-VLP具有强大的跨任务可迁移性和跨域泛化能力,无需进一步微调即可实现一致性能提升,为处理体积医学数据提供了结构一致且语义基础的表征学习框架。


📄 Abstract

Vision-language models (VLMs) have demonstrated strong cross-modal capabilities, yet most work remains limited to 2D data and assumes binary supervision (i.e., positive vs. negative pairs), overlooking the continuous and structured dependencies present in volumetric data such as CT. Existing approaches often treat volumetric scans as independent 2D slices, compromising spatial coherence and underutilizing rich clinical semantics. We propose SCALE-VLP, a soft-weighted contrastive vision-language pre-training framework that integrates (i) volumetric spatial semantics to preserve anatomical structure and (ii) domain-aware, knowledge-infused semantics (e.g., radiological ontologies) to guide alignment. This yields structurally consistent and semantically grounded representations under limited supervision, demonstrating strong cross-task transferability (retrieval, report generation, and classification), and cross-domain generalizability with consistent gains without further fine-tuning. In particular, compared to the previous state of the art, SCALE-VLP achieves up to 4.3x higher top-1 CT-report retrieval, improves abnormality classification by 10 points, and reaches ROUGE-L 0.44 and BERT-F1 0.89 for report generation. Further, in zero-shot evaluation on an out-of-domain external dataset, we observe consistent gains, indicating the cross-task and cross-domain generalization ability of SCALE-VLP.

[3] SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment

Wenbo Lu

🧩 TL;DR

本文提出结构感知语言-图像预训练(SLIP),通过引入结构化对比损失来建模图像-文本对之间的关联关系,在跨模态检索和分类任务中显著优于CLIP模型,证明了关系监督对跨模态对齐的重要价值。


📘 Detailed Summary

Motivation: 现有视觉语言预训练方法将图像-文本对视为孤立训练样本,忽略了现实领域中丰富的关联结构,如电商产品共购图和社会推荐网络,而神经科学证据表明人类将知识编码为关系认知图谱,因此需要开发能够建模结构化关系的跨模态学习方法。

Method: SLIP方法整合了结构化对比损失,在保持模态对齐的同时建模结构化图中相邻实体之间的关系,并构建了大规模亚马逊产品共购多模态图数据集来支持结构化跨模态监督学习。

Result: 实验结果表明,SLIP在零样本和少样本设置下,在跨模态检索和分类任务上持续优于CLIP模型,验证了关系监督对跨模态对齐的有效性。

Conclusion: 该研究证明了结构化关系监督在跨模态学习中的重要性,为利用领域特定关系结构提升视觉语言模型性能提供了新范式,并展示了关系认知图谱在人工智能系统中的应用潜力。


📄 Abstract

Vision-Language Pretraining (VLP) has achieved remarkable success across various downstream tasks, but such gains are largely driven by scaling up on training data. Yet, literature methods treat image-text pairs as isolated training examples; this neglects the rich relational structure naturally present in many domains, such as e-commerce product co-purchase graphs and social recommendation networks. Inspired by neuroscientific evidence that human encodes knowledge as relationship cognitive maps, we introduce Structure-aware Language-Image Pretraining (SLIP). SLIP integrates a structural contrastive loss to align modalities while also modeling relationships between neighboring entities in a structured graph. To support this paradigm, we construct a large-scale Amazon Product Co-purchase Multimodal Graph Dataset, enabling structured cross-modality supervision at scale. Experiment results show that SLIP consistently outperforms CLIP on cross-modal retrieval and classification tasks in both zero-shot and few-shot settings, showing the value of relational supervision for cross-modal alignment.

[4] Finetuning-Free Personalization of Text to Image Generation via Hypernetworks

Sagar Shrestha, Gopal Sharma, Luowei Zhou, Suren Kumar

🧩 TL;DR

本文提出了一种基于超网络的免微调个性化文本到图像生成方法,通过端到端训练目标预测LoRA适配权重,消除了传统方法中每个主体都需要优化的计算开销,同时保持了主体保真度和提示对齐。


📘 Detailed Summary

Motivation: 传统的文本到图像扩散模型个性化方法如DreamBooth依赖于主体特定的微调,计算成本高昂且推理速度慢。现有的适配器和编码器方法虽然尝试减少开销,但仍需要额外微调或大型骨干模型。超网络方法虽然避免了逐主体优化,但面临数据生成成本高和优化轨迹不稳定的问题。

Method: 提出了一种端到端训练目标,通过简单的输出正则化实现稳定化,构建可靠的超网络直接从主体图像预测LoRA适配权重。在推理阶段引入了混合模型分类器自由引导(HM-CFG),结合基础扩散模型的组合能力和个性化模型的主体保真度进行采样。

Result: 在CelebA-HQ、AFHQ-v2和DreamBench上的广泛实验表明,该方法实现了强大的个性化性能,证明了超网络作为可扩展和有效的开放类别个性化方向的潜力。

Conclusion: 该方法消除了测试时逐主体优化的需求,同时保持了主体保真度和提示对齐。研究表明超网络是构建可扩展个性化文本到图像生成系统的有前景方向,为开放类别个性化提供了新的解决方案。


📄 Abstract

Personalizing text-to-image diffusion models has traditionally relied on subject-specific fine-tuning approaches such as DreamBooth~\cite{ruiz2023dreambooth}, which are computationally expensive and slow at inference. Recent adapter- and encoder-based methods attempt to reduce this overhead but still depend on additional fine-tuning or large backbone models for satisfactory results. In this work, we revisit an orthogonal direction: fine-tuning-free personalization via Hypernetworks that predict LoRA-adapted weights directly from subject images. Prior hypernetwork-based approaches, however, suffer from costly data generation or unstable attempts to mimic base model optimization trajectories. We address these limitations with an end-to-end training objective, stabilized by a simple output regularization, yielding reliable and effective hypernetworks. Our method removes the need for per-subject optimization at test time while preserving both subject fidelity and prompt alignment. To further enhance compositional generalization at inference time, we introduce Hybrid-Model Classifier-Free Guidance (HM-CFG), which combines the compositional strengths of the base diffusion model with the subject fidelity of personalized models during sampling. Extensive experiments on CelebA-HQ, AFHQ-v2, and DreamBench demonstrate that our approach achieves strong personalization performance and highlights the promise of hypernetworks as a scalable and effective direction for open-category personalization.

[5] SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention

Shreyas C. Dhake, Jiayuan Huang, Runlong He, Danyal Z. Khan, Evangelos B. Mazomenos, Sophia Bano, Hani J. Marcus, Danail Stoyanov, Matthew J. Clarkson, Mobarak I. Hoque

🧩 TL;DR

本文提出了PitVQA-Anticipation数据集和SurgAnt-ViVQA模型,这是首个针对手术预见性推理的视觉问答系统,通过时序建模和门控融合机制实现了从回顾性描述向前瞻性预测的转变。


📘 Detailed Summary

Motivation: 现有手术视觉问答系统主要基于孤立帧的静态视觉语言对齐,缺乏对未来步骤或器械需求的预测能力,而现有手术VQA数据集也集中于当前场景而非近期未来,无法满足内窥镜经鼻蝶垂体手术等视野受限、工作流快速变化场景的实时辅助需求。

Method: 提出了SurgAnt-ViVQA视频语言模型,采用GRU门控时序交叉注意力模块,双向GRU编码帧间动态,自适应门控在token级别将视觉上下文注入语言流,并通过参数高效微调定制语言骨干网络以适应手术领域。

Result: 在PitVQA-Anticipation和EndoVis数据集上的测试表明,SurgAnt-ViVQA超越了强图像和视频基线,消融研究显示时序循环和门控融合贡献了主要性能提升,帧预算研究发现8帧最大化流畅性,32帧略微降低BLEU但改善数值时间估计。

Conclusion: 通过将时序感知编码器与细粒度门控交叉注意力结合,SurgAnt-ViVQA将手术VQA从回顾性描述推进到前瞻性预测,PitVQA-Anticipation为该领域提供了全面基准,并强调了针对性时序建模对于可靠、未来感知手术辅助的重要性。


📄 Abstract

Anticipating forthcoming surgical events is vital for real-time assistance in endonasal transsphenoidal pituitary surgery, where visibility is limited and workflow changes rapidly. Most visual question answering (VQA) systems reason on isolated frames with static vision language alignment, providing little support for forecasting next steps or instrument needs. Existing surgical VQA datasets likewise center on the current scene rather than the near future. We introduce PitVQA-Anticipation, the first VQA dataset designed for forward looking surgical reasoning. It comprises 33.5 hours of operative video and 734,769 question answer pairs built from temporally grouped clips and expert annotations across four tasks: predicting the future phase, next step, upcoming instrument, and remaining duration. We further propose SurgAnt-ViVQA, a video language model that adapts a large language model using a GRU Gated Temporal Cross-Attention module. A bidirectional GRU encodes frame to frame dynamics, while an adaptive gate injects visual context into the language stream at the token level. Parameter efficient fine tuning customizes the language backbone to the surgical domain. SurgAnt-ViVQA tested upon on PitVQA-Anticipation and EndoVis datasets, surpassing strong image and video based baselines. Ablations show that temporal recurrence and gated fusion drive most of the gains. A frame budget study indicates a trade-off: 8 frames maximize fluency, whereas 32 frames slightly reduce BLEU but improve numeric time estimation. By pairing a temporally aware encoder with fine grained gated cross-attention, SurgAnt-ViVQA advances surgical VQA from retrospective description to proactive anticipation. PitVQA-Anticipation offers a comprehensive benchmark for this setting and highlights the importance of targeted temporal modeling for reliable, future aware surgical assistance.

[6] PETWB-REP: A Multi-Cancer Whole-Body FDG PET/CT and Radiology Report Dataset for Medical Imaging Research

Le Xue, Gang Feng, Wenbo Zhang, Yichi Zhang, Lanlan Li, Shuqi Wang, Liling Peng, Sisi Peng, Xin Gao

🧩 TL;DR

本研究提出了PETWB-REP数据集,这是一个包含490名多种癌症患者全身FDG PET/CT扫描和相应放射学报告的公开数据集,旨在支持医学影像、放射组学和多模态学习研究。


📘 Detailed Summary

Motivation: 当前缺乏结合功能与解剖成像及详细临床报告的多癌种公开医学影像数据集,这限制了人工智能模型的开发和验证以及回顾性临床研究的开展。

Method: 该研究构建了一个精心策划的数据集,包含全身18F-FDG PET/CT扫描、配对的PET和CT图像、去标识化的文本报告以及结构化的临床元数据,涵盖肺癌、肝癌、乳腺癌、前列腺癌和卵巢癌等常见癌症类型。

Result: 数据集成功整合了490名患者的多种恶性肿瘤影像数据和临床信息,提供了配对的PET-CT图像、放射学报告和结构化元数据,为多模态医学AI研究提供了重要资源。

Conclusion: PETWB-REP数据集填补了多癌种多模态医学影像数据的空白,为医学影像分析、放射组学研究、人工智能算法开发和多模态学习提供了宝贵的基准数据集,将推动精准医疗和临床研究的发展。


📄 Abstract

Publicly available, large-scale medical imaging datasets are crucial for developing and validating artificial intelligence models and conducting retrospective clinical research. However, datasets that combine functional and anatomical imaging with detailed clinical reports across multiple cancer types remain scarce. Here, we present PETWB-REP, a curated dataset comprising whole-body 18F-Fluorodeoxyglucose (FDG) Positron Emission Tomography/Computed Tomography (PET/CT) scans and corresponding radiology reports from 490 patients diagnosed with various malignancies. The dataset primarily includes common cancers such as lung cancer, liver cancer, breast cancer, prostate cancer, and ovarian cancer. This dataset includes paired PET and CT images, de-identified textual reports, and structured clinical metadata. It is designed to support research in medical imaging, radiomics, artificial intelligence, and multi-modal learning.

[7] QG-CoC: Question-Guided Chain-of-Captions for Large Multimodal Models

Kuei-Chun Kao, Hsu Tzu-Yin, Yunqi Hong, Ruochen Wang, Cho-Jui Hsieh

🧩 TL;DR

本文提出了一种新的零样本提示方法QG-CoC,通过问题引导的标题链机制有效解决多模态大语言模型在多图像场景下的细粒度感知和跨图像推理能力不足的问题,在多种基准测试中展现出竞争优势。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在多图像环境中存在两个关键问题:缺乏对分散图像的细粒度感知能力,以及在多视觉输入上进行有效推理和信息合成的能力下降。现有研究主要关注单图像设置或特定受限场景,缺乏对通用复杂多图像推理任务的理解和解决方案。

Method: 提出了一种零样本提示方法QG-CoC,该方法通过问题引导的标题链机制,能够有效处理任意数量图像的问题。该方法首先对多图像场景进行系统性调查,发现现有提示方法在关注所需线索和无缝整合感知与推理方面存在不足,进而设计出这种通用的提示方法。

Result: 在多种开源和闭源多模态大语言模型上进行评估,涵盖多图像和单图像基准测试。实验结果表明,QG-CoC在不同任务中展现出竞争优势,并在现有提示方法失效的挑战性场景中表现出稳健的性能提升。

Conclusion: QG-CoC方法为解决多模态大语言模型在多图像推理中的核心挑战提供了有效方案,展示了在复杂多图像场景下提升模型性能的潜力。该方法为未来多模态推理研究提供了新的思路,强调了细粒度感知与推理过程无缝整合的重要性。


📄 Abstract

Recently, Multimodal Large Language Models (MLLMs) encounter two key issues in multi-image contexts: (1) a lack of fine-grained perception across disparate images, and (2) a diminished capability to effectively reason over and synthesize information from multiple visual inputs. However, while various prompting methods aim to describe visual content, many existing studies focus primarily on single-image settings or specific, constrained scenarios. This leaves a critical gap in understanding and addressing how MLLMs tackle more general and complex multi-image reasoning tasks. Thus, we first extensively investigate how current prompting methods perceive fine-grained visual details and process visual information when dealing with multiple images. Our findings reveal that existing prompting methods fall short in attending to needed clues and seamlessly integrating perception and reasoning. Inspired by the findings, we propose a new zero-shot prompting method, Question-Guided Chain-of-Captions (QG-CoC), a generalized prompting approach that effectively handles problems with an arbitrary number of images. We evaluate our method on various open-source and closed-source MLLMs for multi-image and single-image benchmarks. Experimental results indicate that QG-CoC demonstrates competitive performance across tasks and exhibits robust improvements in the challenging scenarios where existing prompting methods fail.

[8] Enhancing Medical Image Segmentation via Heat Conduction Equation

Rong Wu, Yim-Sang Yu

🧩 TL;DR

本研究提出了一种结合U-Mamba与热传导方程的混合架构,通过状态空间模块实现高效长程推理,并在瓶颈层引入热传导算子模拟频域热扩散,显著提升了医学图像分割的全局上下文建模能力。


📘 Detailed Summary

Motivation: 现有医学图像分割模型在实用计算预算下难以同时实现高效的全局上下文建模和长程依赖推理,特别是U-Net变体在全局信息捕获方面存在局限性。

Method: 提出了一种混合架构,结合Mamba状态空间模块进行高效长程推理,并在瓶颈层引入热传导算子模拟频域热扩散过程,增强语义抽象能力。

Result: 在多模态腹部CT和MRI数据集上的实验结果表明,该模型持续优于强基线方法,验证了其有效性和泛化性能。

Conclusion: 将状态空间动力学与基于热力学的全局扩散相结合,为医学分割任务提供了可扩展且可解释的解决方案,展示了混合建模策略的潜力。


📄 Abstract

Medical image segmentation has been significantly advanced by deep learning architectures, notably U-Net variants. However, existing models struggle to achieve efficient global context modeling and long-range dependency reasoning under practical computational budgets simultaneously. In this work, we propose a novel hybrid architecture utilizing U-Mamba with Heat Conduction Equation. Our model combines Mamba-based state-space modules for efficient long-range reasoning with Heat Conduction Operators (HCOs) in the bottleneck layers, simulating frequency-domain thermal diffusion for enhanced semantic abstraction. Experimental results on multimodal abdominal CT and MRI datasets demonstrate that the proposed model consistently outperforms strong baselines, validating its effectiveness and generalizability. It suggest that blending state-space dynamics with heat-based global diffusion offers a scalable and interpretable solution for medical segmentation tasks.

[9] Unified Long Video Inpainting and Outpainting via Overlapping High-Order Co-Denoising

Shuangquan Lyu, Steven Mao, Yue Ma

🧩 TL;DR

本文提出了一种统一的长视频修复与外延方法,通过扩展文本到视频扩散模型实现任意长度的高保真空间编辑视频生成。该方法利用LoRA高效微调预训练视频扩散模型,并采用重叠混合时间协同去噪策略保持长序列一致性。


📘 Detailed Summary

Motivation: 长视频生成存在根本性挑战,特别是在视频修复和外延任务中实现高可控性尤为困难。现有方法在处理固定长度片段时存在局限性,容易出现拼接伪影或一致性漂移问题,难以实现任意长度的无缝视频编辑。

Method: 该方法采用LoRA对阿里巴巴Wan 2.1等大型预训练视频扩散模型进行高效微调,专门针对掩码区域视频合成任务。通过重叠混合时间协同去噪策略结合高阶求解器,确保长序列生成过程中的时空一致性,避免出现可见的接缝或漂移现象。

Result: 在具有挑战性的修复和外延任务上验证了方法有效性,包括数百帧的对象编辑和添加。在质量指标(PSNR/SSIM)和感知真实性(LPIPS)方面均优于Wan 2.1模型和VACE等基线方法,实现了参数效率与性能优越性的平衡。

Conclusion: 该方法为实际长范围视频编辑提供了可行解决方案,具有最小计算开销。研究展示了如何有效扩展文本到视频扩散模型的能力边界,为任意长度视频的空间编辑任务开辟了新途径,在参数效率和生成质量之间取得了良好平衡。


📄 Abstract

Generating long videos remains a fundamental challenge, and achieving high controllability in video inpainting and outpainting is particularly demanding. To address both of these challenges simultaneously and achieve controllable video inpainting and outpainting for long video clips, we introduce a novel and unified approach for long video inpainting and outpainting that extends text-to-video diffusion models to generate arbitrarily long, spatially edited videos with high fidelity. Our method leverages LoRA to efficiently fine-tune a large pre-trained video diffusion model like Alibaba's Wan 2.1 for masked region video synthesis, and employs an overlap-and-blend temporal co-denoising strategy with high-order solvers to maintain consistency across long sequences. In contrast to prior work that struggles with fixed-length clips or exhibits stitching artifacts, our system enables arbitrarily long video generation and editing without noticeable seams or drift. We validate our approach on challenging inpainting/outpainting tasks including editing or adding objects over hundreds of frames and demonstrate superior performance to baseline methods like Wan 2.1 model and VACE in terms of quality (PSNR/SSIM), and perceptual realism (LPIPS). Our method enables practical long-range video editing with minimal overhead, achieved a balance between parameter efficient and superior performance.

[10] Diffusion-SDPO: Safeguarded Direct Preference Optimization for Diffusion Models

Minghao Fu, Guo-Hua Wang, Tianyu Cui, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang

🧩 TL;DR

本文提出了Diffusion-SDPO方法,解决了扩散模型直接偏好优化中的关键病理问题,通过自适应缩放失败者梯度来保护获胜者质量,在文本到图像基准测试中实现了优于基线的性能提升。


📘 Detailed Summary

Motivation: 现有基于扩散的直接偏好优化方法存在一个关键病理:增大偏好边界并不总能提升生成质量,标准Diffusion-DPO目标会增加获胜者和失败者分支的重构误差,导致即使偏好边界扩大,偏好分支也会受到不利影响。

Method: 提出了Diffusion-SDPO方法,采用保护性更新规则,通过自适应缩放失败者梯度来保护获胜者,一阶分析得出闭式缩放系数,确保每个优化步骤中偏好输出的误差非递增,该方法简单、模型无关且与现有DPO风格对齐框架广泛兼容。

Result: 在标准文本到图像基准测试中,Diffusion-SDPO在自动化偏好、美学和提示对齐指标上相比偏好学习基线实现了持续的性能提升,同时仅增加了边际计算开销。

Conclusion: 研究揭示了扩散模型偏好优化中的关键病理机制,提出的保护性更新方法为扩散模型对齐提供了更可靠的技术路径,具有广泛的适用性和实际部署价值。


📄 Abstract

Text-to-image diffusion models deliver high-quality images, yet aligning them with human preferences remains challenging. We revisit diffusion-based Direct Preference Optimization (DPO) for these models and identify a critical pathology: enlarging the preference margin does not necessarily improve generation quality. In particular, the standard Diffusion-DPO objective can increase the reconstruction error of both winner and loser branches. Consequently, degradation of the less-preferred outputs can become sufficiently severe that the preferred branch is also adversely affected even as the margin grows. To address this, we introduce Diffusion-SDPO, a safeguarded update rule that preserves the winner by adaptively scaling the loser gradient according to its alignment with the winner gradient. A first-order analysis yields a closed-form scaling coefficient that guarantees the error of the preferred output is non-increasing at each optimization step. Our method is simple, model-agnostic, broadly compatible with existing DPO-style alignment frameworks and adds only marginal computational overhead. Across standard text-to-image benchmarks, Diffusion-SDPO delivers consistent gains over preference-learning baselines on automated preference, aesthetic, and prompt alignment metrics. Code is publicly available at https://github.com/AIDC-AI/Diffusion-SDPO.

[11] Decoupling Augmentation Bias in Prompt Learning for Vision-Language Models

Gahyeon Kim, Sohee Kim, Seokju Lee

🧩 TL;DR

本文提出AAPL方法,通过引入对抗性token嵌入来解耦图像增强引入的表层视觉变化与类别相关语义表示,从而增强提示学习在零样本学习中的泛化能力。该方法在11个基准数据集上持续优于现有方法,显著提升了少样本、零样本、跨数据集和领域泛化性能。


📘 Detailed Summary

Motivation: 现有提示学习方法如CoOp和CoCoOp虽然通过可学习向量替代手工提示提升了零样本学习性能,但在处理完全未见类别时泛化能力有限。传统零样本学习受益于多种数据增强策略,但提示学习主要关注文本层面的修改,图像级增强的潜力尚未充分探索,特别是缺乏对语义相关视觉特征学习的显式指导。

Method: 提出AAPL方法,通过引入对抗性token嵌入来解耦图像增强引入的表层视觉变化与类别相关语义表示。该方法利用属性特定的图像级增强,使学习的提示能够专注于与目标类别对齐的视觉判别性特征,解决了现有方法缺乏对语义相关视觉特征显式指导的局限性。

Result: 在11个基准数据集上的综合实验表明,AAPL在少样本、零样本、跨数据集和领域泛化设置中持续优于现有方法。该方法显著提升了模型在各种学习场景下的泛化性能,证明了图像级增强与软提示框架协同作用的有效性。

Conclusion: 研究表明图像级增强特别是属性特定变化能够有效支持提示学习,对抗性token嵌入机制成功解耦了表层变化与语义表示。这项工作为提示学习提供了新的增强策略方向,强调了视觉特征与语义对齐在零样本学习中的重要性,为未来研究开辟了结合视觉和文本增强的混合方法路径。


📄 Abstract

Recent advances in large-scale vision and language models have led to significant progress in zero-shot learning tasks. Methods such as CoOp and CoCoOp have shown that replacing handcrafted prompts with learnable vectors, known as prompt learning, can result in improved performance. However, these models often struggle to generalize to entirely unseen categories. While traditional zero-shot learning techniques benefit from various data augmentation strategies, prompt learning has primarily focused on text-based modifications, leaving the potential of image-based augmentation largely unexplored. In this work, we explore how image-level augmentations, particularly those that introduce attribute-specific variations, can support and enhance prompt learning. Our analysis examines the interaction between these augmentations and soft prompt frameworks, revealing their potential to improve generalization. We also identify a limitation in existing methods, such as CoCoOp, which do not provide explicit guidance for learning prompts that focus on semantically meaningful visual features. To address this, we propose Adding Attributes to Prompt Learning, AAPL, a novel method that introduces adversarial token embeddings to decouple superficial visual variations introduced by augmentation from class-relevant semantic representations. This decoupling enables the learned prompts to concentrate on visually discriminative features that align with the target categories. We conduct comprehensive experiments on eleven benchmark datasets, and AAPL consistently outperforms existing methods across few-shot, zero-shot, cross-dataset, and domain generalization settings. Our source code is publicly available at: https://github.com/Gahyeonkim09/AAPL

[12] SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding

Mauro Orazio Drago, Luca Carlini, Pelinsu Celebi Balyemez, Dennis Pierantozzi, Chiara Lena, Cesare Hassan, Danail Stoyanov, Elena De Momi, Sophia Bano, Mobarak I. Hoque

🧩 TL;DR

本文提出了SurgViVQA,一种用于手术视频问答的模型,通过掩码视频-文本编码器融合视频和问题特征,捕捉手术场景中的时间动态信息,显著提升了手术VideoQA的性能和鲁棒性。


📘 Detailed Summary

Motivation: 当前手术视频问答方法局限于静态图像特征,可用数据集缺乏时间标注,无法有效捕捉手术过程中关键的运动动态和工具-组织交互,限制了AI模型对动态手术场景的准确理解。

Method: SurgViVQA采用掩码视频-文本编码器融合视频和问题特征,捕捉运动动态和工具-组织交互等时间线索,然后通过微调的大型语言模型解码生成连贯答案;同时构建了REAL-Colon-VQA数据集,包含运动相关问题和诊断属性。

Result: 在REAL-Colon-VQA和EndoVis18-VQA数据集上的实验表明,SurgViVQA在关键词准确率上显著优于现有图像基准模型,分别比PitVQA提升11%和9%;扰动研究进一步证实了模型对问题表述变化的鲁棒性和泛化能力。

Conclusion: SurgViVQA和REAL-Colon-VQA数据集为手术视频问答提供了时间感知理解框架,使AI模型能够更有效地解释动态手术过程;该研究推动了手术AI从静态图像分析向动态场景理解的转变,具有重要的临床应用价值。


📄 Abstract

Video Question Answering (VideoQA) in the surgical domain aims to enhance intraoperative understanding by enabling AI models to reason over temporally coherent events rather than isolated frames. Current approaches are limited to static image features, and available datasets often lack temporal annotations, ignoring the dynamics critical for accurate procedural interpretation. We propose SurgViVQA, a surgical VideoQA model that extends visual reasoning from static images to dynamic surgical scenes. It uses a Masked Video--Text Encoder to fuse video and question features, capturing temporal cues such as motion and tool--tissue interactions, which a fine-tuned large language model (LLM) then decodes into coherent answers. To evaluate its performance, we curated REAL-Colon-VQA, a colonoscopic video dataset that includes motion-related questions and diagnostic attributes, as well as out-of-template questions with rephrased or semantically altered formulations to assess model robustness. Experimental validation on REAL-Colon-VQA and the public EndoVis18-VQA dataset shows that SurgViVQA outperforms existing image-based VQA benchmark models, particularly in keyword accuracy, improving over PitVQA by +11\% on REAL-Colon-VQA and +9\% on EndoVis18-VQA. A perturbation study on the questions further confirms improved generalizability and robustness to variations in question phrasing. SurgViVQA and the REAL-Colon-VQA dataset provide a framework for temporally-aware understanding in surgical VideoQA, enabling AI models to interpret dynamic procedural contexts more effectively. Code and dataset available at https://github.com/madratak/SurgViVQA.

[13] Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge

Yi Yang, Yiming Xu, Timo Kaiser, Hao Cheng, Bodo Rosenhahn, Michael Ying Yang

🧩 TL;DR

本研究提出了一种用于MOT25时空动作定位挑战的两阶段零样本方法,将最先进的跟踪模型FastTracker与多模态大语言模型LLaVA-Video相结合,在复杂真实场景中实现了基于语言查询的多目标定位与跟踪,最终在挑战赛中获得了第二名。


📘 Detailed Summary

Motivation: 该研究旨在解决MOT25时空动作定位挑战中的核心问题,即在复杂真实世界场景的视频数据中,准确地对符合特定和自由形式语言查询的多个目标进行定位和跟踪,这需要同时处理视觉跟踪和语言理解的多模态任务。

Method: 该方法将任务建模为视频检索问题,采用两阶段零样本方法,结合了最先进的跟踪模型FastTracker和多模态大语言模型LLaVA-Video的优势,通过集成视觉跟踪能力和语言理解能力来实现基于查询的目标定位与跟踪。

Result: 在MOT25-StAG测试集上,该方法取得了m-HIoU得分20.68和HOTA得分10.73的优异表现,这一成绩在挑战赛中获得了第二名,证明了所提方法在复杂多目标时空动作定位任务中的有效性。

Conclusion: 该研究表明,将先进的跟踪模型与多模态大语言模型相结合的两阶段零样本方法,能够有效解决基于语言查询的多目标时空定位问题,为视频理解与多模态任务提供了有前景的技术路径,展示了在复杂真实场景中实现精确动作定位的可行性。


📄 Abstract

In this report, we present our solution to the MOT25-Spatiotemporal Action Grounding (MOT25-StAG) Challenge. The aim of this challenge is to accurately localize and track multiple objects that match specific and free-form language queries, using video data of complex real-world scenes as input. We model the underlying task as a video retrieval problem and present a two-stage, zero-shot approach, combining the advantages of the SOTA tracking model FastTracker and Multi-modal Large Language Model LLaVA-Video. On the MOT25-StAG test set, our method achieves m-HIoU and HOTA scores of 20.68 and 10.73 respectively, which won second place in the challenge.

[14] UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions

Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, Youliang Zhang, Yi Chen, Yuan Zhou, Qinglin Lu, Limin Wang

🧩 TL;DR

本文提出了UniAVGen,一种统一的音频-视频联合生成框架,通过双分支扩散变换器架构和非对称跨模态交互机制,显著提升了唇部同步和语义一致性,在减少训练数据量的同时实现了多任务统一建模。


📘 Detailed Summary

Motivation: 现有开源音频-视频生成方法由于缺乏有效的跨模态建模,往往存在唇部同步效果不佳和语义一致性不足的问题,这限制了生成内容的质量和实用性。

Method: UniAVGen采用双分支联合合成架构,包含两个并行的扩散变换器构建统一的跨模态潜在空间,核心是非对称跨模态交互机制实现双向时间对齐的跨注意力,并辅以面部感知调制模块动态优化交互过程,同时提出模态感知的无分类器引导策略增强推理阶段的生成保真度。

Result: 综合实验验证表明,UniAVGen在仅使用1.3M训练样本的情况下,相比需要30.1M样本的基线方法,在音频-视频同步、音色一致性和情感一致性方面均表现出整体优势,实现了更好的生成质量。

Conclusion: UniAVGen的鲁棒联合合成设计实现了关键音频-视频任务在单一模型中的无缝统一,包括联合音频-视频生成与延续、视频到音频配音以及音频驱动视频合成,为跨模态生成提供了有效的解决方案。


📄 Abstract

Due to the lack of effective cross-modal modeling, existing open-source audio-video generation methods often exhibit compromised lip synchronization and insufficient semantic consistency. To mitigate these drawbacks, we propose UniAVGen, a unified framework for joint audio and video generation. UniAVGen is anchored in a dual-branch joint synthesis architecture, incorporating two parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which enables bidirectional, temporally aligned cross-attention, thus ensuring precise spatiotemporal synchronization and semantic consistency. Furthermore, this cross-modal interaction is augmented by a Face-Aware Modulation module, which dynamically prioritizes salient regions in the interaction process. To enhance generative fidelity during inference, we additionally introduce Modality-Aware Classifier-Free Guidance, a novel strategy that explicitly amplifies cross-modal correlation signals. Notably, UniAVGen's robust joint synthesis design enables seamless unification of pivotal audio-video tasks within a single model, such as joint audio-video generation and continuation, video-to-audio dubbing, and audio-driven video synthesis. Comprehensive experiments validate that, with far fewer training samples (1.3M vs. 30.1M), UniAVGen delivers overall advantages in audio-video synchronization, timbre consistency, and emotion consistency.

cs.CL [Back]

[15] LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation

Gyeom Hwangbo, Hyungjoo Chae, Minseok Kang, Hyeonjong Ju, Soohyun Oh, Jinyoung Yeo

🧩 TL;DR

本文提出了LEGO-Eval评估框架和LEGO-Bench基准,用于解决大语言模型生成3D场景时缺乏真实空间布局和对象属性的问题,通过显式地接地场景组件来更准确地评估场景与细粒度指令的对齐程度。


📘 Detailed Summary

Motivation: 当前基于大语言模型的3D场景生成方法存在空间布局和对象属性不真实的问题,这源于指令粒度不够细,导致生成的场景与现实世界环境存在偏差,进而影响在仿真环境中训练的具身智能体的性能,而现有的评估方法如CLIPScore和视觉语言模型对3D场景理解不足,无法可靠评估场景与细粒度指令的对齐程度。

Method: 提出了LEGO-Eval评估框架,配备多样化工具来显式地接地场景组件,实现更准确的对齐评估;同时构建了LEGO-Bench基准,包含详细指令以指定现实世界环境的复杂布局和属性。

Result: 实验表明LEGO-Eval在评估场景-指令对齐方面比VLM-as-a-judge方法高出0.41 F1分数;使用LEGO-Bench进行基准测试显示当前生成方法存在显著局限性,所有评估方法在生成完全符合细粒度指令的场景时成功率最高仅为10%。

Conclusion: 该研究揭示了当前3D场景生成方法在处理细粒度指令方面的严重不足,提出的评估框架和基准为改进3D场景生成质量提供了重要工具,强调了需要开发更先进的生成方法来满足现实世界环境建模的需求。


📄 Abstract

Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess such alignment. This shortcoming arises primarily from their shallow understanding of 3D scenes, which often leads to improperly grounded scene components. To address this, we introduce LEGO-Eval, an evaluation framework equipped with diverse tools designed to explicitly ground scene components, enabling more accurate alignment assessments. We also present LEGO-Bench, a benchmark of detailed instructions that specify complex layouts and attributes of real-world environments. Experiments demonstrate that LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. Benchmarking with LEGO-Bench reveals significant limitations in current generation methods. Across all evaluated approaches, success rates reached at most 10% in generating scenes that fully align with fine-grained instructions.

[16] MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

Kaiyuan Zhang, Chenghao Yang, Zhoufutu Wen, Sihang Yuan, Qiuyue Wang, Chaoyi Huang, Guosheng Zhu, He Wang, Huawenyu Lu, Jianing Wen, Jianpeng Jiao, Lishu Luo, Longxiang Liu, Sijin Wu, Xiaolei Zhu, Xuanliang Zhang, Ge Zhang, Yi Lin, Guang Shi, Chaoyou Fu, Wenhao Huang

🧩 TL;DR

本研究提出了MME-CC多模态认知能力评估基准,系统评估多模态大语言模型在视觉中心认知行为上的表现,填补了现有基准在视觉推理能力评估方面的不足。


📘 Detailed Summary

Motivation: 现有多模态基准过度强调文本推理,未能系统捕捉视觉中心认知行为,导致多模态大语言模型的认知能力评估不足,需要开发专门针对视觉信息处理的评估框架。

Method: 构建了MME-CC基准,将11个代表性推理任务组织为空间推理、几何推理和知识推理三个基本类别,对16个代表性多模态大语言模型进行系统性评估。

Result: 闭源模型表现领先(如Gemini-2.5-Pro得分为42.66,GLM-4.5V为30.45),空间和几何推理能力普遍较弱(≤30%),识别出方向错误、跨视图身份持续性脆弱、反事实指令遵循差等常见错误模式。

Conclusion: 思维链通常遵循提取-推理-验证的三阶段过程且严重依赖视觉提取,研究呼吁将多模态大语言模型的认知能力作为评估和模型设计的核心考量,推动认知能力导向的模型发展。


📄 Abstract

As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs insufficiently assessed. To address this limitation, we introduce MME-CC (Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information: spatial, geometric, and knowledge-based reasoning, and provides fine-grained analyses of MLLMs' cognitive capacity across these dimensions. Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs. Our study reveals that closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs. 30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak (less than or equal to 30%). We further identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions, and observe that Chain-of-Thought typically follows a three-stage process (extract -> reason -> verify) with heavy reliance on visual extraction. We hope this work catalyzes a shift toward treating the cognitive capacity of MLLMs as central to both evaluation and model design.

[17] BengaliMoralBench: A Benchmark for Auditing Moral Reasoning in Large Language Models within Bengali Language and Culture

Shahriyar Zaman Ridoy, Azmine Toushik Wasi, Koushik Ahamed Tonmoy

🧩 TL;DR

本文提出了BengaliMoralBench,这是首个针对孟加拉语的大规模伦理基准,填补了多语言大语言模型在文化伦理对齐方面的研究空白,为南亚地区的负责任AI部署提供了评估基础。


📘 Detailed Summary

Motivation: 随着多语言大语言模型在南亚地区的普及,这些模型与当地伦理规范的契合度研究仍然不足,特别是对于全球使用人数排名第六的孟加拉语。现有的伦理基准主要基于英语和西方框架,忽视了文化细微差别对实际部署的关键影响。

Method: 研究团队构建了涵盖五个道德领域的大规模伦理基准,包括日常活动、习惯、育儿、家庭关系和宗教活动,细分为50个文化相关子主题。每个场景通过母语者共识进行标注,采用美德伦理、常识伦理和正义伦理三种伦理视角,并对主流多语言LLM进行系统性零样本评估。

Result: 不同模型在基准测试中表现差异显著,准确率范围为50-91%。定性分析揭示了模型在文化基础、常识推理和道德公平性方面存在一致的弱点,表明当前模型对孟加拉文化背景的理解仍有明显不足。

Conclusion: BengaliMoralBench为负责任的本土化提供了基础框架,支持在多样化、低资源多语言环境中进行文化对齐评估。该研究强调了开发文化敏感的伦理基准对于在非西方环境中部署稳健AI系统的重要性,为未来多语言伦理对齐研究指明了方向。


📄 Abstract

As multilingual Large Language Models (LLMs) gain traction across South Asia, their alignment with local ethical norms, particularly for Bengali, which is spoken by over 285 million people and ranked 6th globally, remains underexplored. Existing ethics benchmarks are largely English-centric and shaped by Western frameworks, overlooking cultural nuances critical for real-world deployment. To address this, we introduce BengaliMoralBench, the first large-scale ethics benchmark for the Bengali language and socio-cultural contexts. It covers five moral domains, Daily Activities, Habits, Parenting, Family Relationships, and Religious Activities, subdivided into 50 culturally relevant subtopics. Each scenario is annotated via native-speaker consensus using three ethical lenses: Virtue, Commonsense, and Justice ethics. We conduct systematic zero-shot evaluation of prominent multilingual LLMs, including Llama, Gemma, Qwen, and DeepSeek, using a unified prompting protocol and standard metrics. Performance varies widely (50-91% accuracy), with qualitative analysis revealing consistent weaknesses in cultural grounding, commonsense reasoning, and moral fairness. BengaliMoralBench provides a foundation for responsible localization, enabling culturally aligned evaluation and supporting the deployment of ethically robust AI in diverse, low-resource multilingual settings such as Bangladesh.

[18] Benchmarking the Thinking Mode of Multimodal Large Language Models in Clinical Tasks

Jindong Hong, Tianjie Chen, Lingjie Luo, Chuanyang Zheng, Ting Xu, Haibao Yu, Jianing Qiu, Qianzhong Chen, Suning Huang, Yan Xu, Yong Gui, Yijun He, Jiankai Sun

🧩 TL;DR

本研究系统评估了多模态大语言模型在临床任务中激活思维模式对性能的影响,发现思维模式相比标准模式仅带来边际改进,在复杂医疗任务中表现仍不理想。


📘 Detailed Summary

Motivation: 随着具备'双状态'能力的推理型多模态大语言模型快速发展,本研究旨在严格评估这些模型增强的推理过程如何影响其在临床任务中的性能和可靠性,特别关注思维模式激活对医疗应用的实质性提升。

Method: 本研究评估了Seed1.5-VL和Gemini-2.5-Flash两个领先多模态大语言模型的主动思维模式能力,在视觉医疗任务中使用VQA-RAD和ROCOv2数据集进行系统性评估,涵盖四个不同的视觉医疗任务类型。

Result: 研究结果显示,对于大多数任务而言,激活思维模式相比标准非思维模式仅带来边际性能改进,在开放式视觉问答和医学图像解释等复杂医疗任务中表现仍然欠佳,未能达到理想水平。

Conclusion: 该研究强调了多模态大语言模型在医疗领域需要领域特定的医学数据和更先进的医学知识集成方法,当前思维模式在复杂临床推理任务中的有效性有限,需要进一步优化和改进。


📄 Abstract

A recent advancement in Multimodal Large Language Models (MLLMs) research is the emergence of "reasoning MLLMs" that offer explicit control over their internal thinking processes (normally referred as the "thinking mode") alongside the standard "non-thinking mode". This capability allows these models to engage in a step-by-step process of internal deliberation before generating a final response. With the rapid transition to and adoption of these "dual-state" MLLMs, this work rigorously evaluated how the enhanced reasoning processes of these MLLMs impact model performance and reliability in clinical tasks. This paper evaluates the active "thinking mode" capabilities of two leading MLLMs, Seed1.5-VL and Gemini-2.5-Flash, for medical applications. We assessed their performance on four visual medical tasks using VQA-RAD and ROCOv2 datasets. Our findings reveal that the improvement from activating the thinking mode remains marginal compared to the standard non-thinking mode for the majority of the tasks. Their performance on complex medical tasks such as open-ended VQA and medical image interpretation remains suboptimal, highlighting the need for domain-specific medical data and more advanced methods for medical knowledge integration.

[19] Step-Audio-EditX Technical Report

Chao Yan, Boyong Wu, Peng Yang, Pengfei Tan, Guoqiang Hu, Yuxin Zhang, Xiangyu, Zhang, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu

🧩 TL;DR

Step-Audio-EditX是首个基于LLM的开源音频模型,在表达性音频编辑和零样本文本转语音方面表现卓越,通过大边界合成数据方法实现了情感、说话风格和副语言特征的迭代控制。


📘 Detailed Summary

Motivation: 该研究旨在解决传统音频编辑模型在表达性控制和迭代编辑方面的局限性,特别是情感、说话风格和副语言特征等细粒度控制能力的不足,以及传统方法对嵌入先验或辅助模块的依赖问题。

Method: 核心创新在于采用仅使用大边界合成数据的方法,避免了基于嵌入的先验或辅助模块的需求,通过大边界学习实现跨声音的迭代控制和高表达性,代表了从传统表示级解纠缠方法的根本性转变。

Result: 评估结果表明,Step-Audio-EditX在情感编辑和其他细粒度控制任务中超越了MiniMax-2.6-hd和Doubao-Seed-TTS-2.0,证明了其在表达性音频编辑方面的优越性能。

Conclusion: 该研究展示了仅使用合成数据实现高质量音频编辑的可行性,为音频生成领域提供了新的技术路径,表明大边界学习方法可以替代传统的表示级解纠缠方法,为未来音频编辑系统的发展指明了方向。


📄 Abstract

We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero-shot text-to-speech (TTS) capabilities.Our core innovation lies in leveraging only large-margin synthetic data, which circumvents the need for embedding-based priors or auxiliary modules. This large-margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation-level disentanglement. Evaluation results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.

[20] Towards Transparent Stance Detection: A Zero-Shot Approach Using Implicit and Explicit Interpretability

Apoorva Upadhyaya, Wolfgang Nejdl, Marco Fisichella

🧩 TL;DR

本文提出了一种新颖的可解释零样本立场检测框架IRIS,通过结合隐式推理依据和显式语言学特征,在无需真实推理依据标注的情况下实现立场检测,同时提供内在的可解释性。


📘 Detailed Summary

Motivation: 现有零样本立场检测方法存在泛化性不足、文本与目标之间缺乏连贯性等问题,且大多数基于大语言模型的方法过度依赖显式推理、提供的解释缺乏细微差别、未能显式建模推理过程,导致模型预测难以解释。

Method: IRIS框架将立场检测视为信息检索排序任务,基于文本中的序列(隐式推理依据)和语言学度量(显式推理依据)分别提供隐式和显式的立场理解,通过理解不同立场隐式推理依据的相关性来指导模型预测,无需真实推理依据标注即可提供内在可解释性。

Result: 在VAST、EZ-STANCE、P-Stance和RFD等基准数据集上的广泛实验证明,即使仅使用50%、30%甚至10%的训练数据,该模型仍表现出良好的泛化能力,这得益于所提出的架构和可解释设计。

Conclusion: 该研究通过结合隐式和显式推理依据,不仅提高了零样本立场检测的性能和泛化能力,还提供了对作者态度情感和认知维度的可解释理解,为可解释AI在立场检测领域的应用开辟了新方向。


📄 Abstract

Zero-Shot Stance Detection (ZSSD) identifies the attitude of the post toward unseen targets. Existing research using contrastive, meta-learning, or data augmentation suffers from generalizability issues or lack of coherence between text and target. Recent works leveraging large language models (LLMs) for ZSSD focus either on improving unseen target-specific knowledge or generating explanations for stance analysis. However, most of these works are limited by their over-reliance on explicit reasoning, provide coarse explanations that lack nuance, and do not explicitly model the reasoning process, making it difficult to interpret the model's predictions. To address these issues, in our study, we develop a novel interpretable ZSSD framework, IRIS. We provide an interpretable understanding of the attitude of the input towards the target implicitly based on sequences within the text (implicit rationales) and explicitly based on linguistic measures (explicit rationales). IRIS considers stance detection as an information retrieval ranking task, understanding the relevance of implicit rationales for different stances to guide the model towards correct predictions without requiring the ground-truth of rationales, thus providing inherent interpretability. In addition, explicit rationales based on communicative features help decode the emotional and cognitive dimensions of stance, offering an interpretable understanding of the author's attitude towards the given target. Extensive experiments on the benchmark datasets of VAST, EZ-STANCE, P-Stance, and RFD using 50%, 30%, and even 10% training data prove the generalizability of our model, benefiting from the proposed architecture and interpretable design.

cs.AI [Back]

[21] Using Multi-modal Large Language Model to Boost Fireworks Algorithm's Ability in Settling Challenging Optimization Tasks

Shipeng Cen, Ying Tan

🧩 TL;DR

本研究提出了一种基于多模态大语言模型辅助的烟花算法优化框架,通过引入关键部件概念扩展烟花算法处理复杂高维优化任务的能力,在旅行商问题和电子设计自动化问题上取得了优于或达到当前最优水平的性能。


📘 Detailed Summary

Motivation: 传统零阶或一阶优化方法在处理非凸、高维、黑箱等复杂优化问题时存在效率低下、梯度信息不准确和优化信息利用不足等局限性,无法有效应对现代优化问题的挑战,而大语言模型在语言理解和代码生成能力上的显著提升为优化算法设计提供了新的可能性。

Method: 本研究以烟花算法为基础优化器,结合多模态大语言模型提出关键部件概念,利用大语言模型的多模态特性充分挖掘优化过程中的信息,将烟花算法扩展到复杂高维任务中,特别针对旅行商问题和电子设计自动化问题进行了算法设计。

Result: 实验结果表明,在新框架下生成的烟花算法在多个问题实例上取得了优于或达到当前最优水平的性能,证明了所提出方法在复杂优化任务中的有效性。

Conclusion: 该研究展示了多模态大语言模型在优化算法设计中的巨大潜力,为处理复杂高维优化问题提供了新的解决方案,同时证明了结合传统优化算法与先进人工智能技术能够产生协同效应,为未来优化算法研究开辟了新的方向。


📄 Abstract

As optimization problems grow increasingly complex and diverse, advancements in optimization techniques and paradigm innovations hold significant importance. The challenges posed by optimization problems are primarily manifested in their non-convexity, high-dimensionality, black-box nature, and other unfavorable characteristics. Traditional zero-order or first-order methods, which are often characterized by low efficiency, inaccurate gradient information, and insufficient utilization of optimization information, are ill-equipped to address these challenges effectively. In recent years, the rapid development of large language models (LLM) has led to substantial improvements in their language understanding and code generation capabilities. Consequently, the design of optimization algorithms leveraging large language models has garnered increasing attention from researchers. In this study, we choose the fireworks algorithm(FWA) as the basic optimizer and propose a novel approach to assist the design of the FWA by incorporating multi-modal large language model(MLLM). To put it simply, we propose the concept of Critical Part(CP), which extends FWA to complex high-dimensional tasks, and further utilizes the information in the optimization process with the help of the multi-modal characteristics of large language models. We focus on two specific tasks: the \textit{traveling salesman problem }(TSP) and \textit{electronic design automation problem} (EDA). The experimental results show that FWAs generated under our new framework have achieved or surpassed SOTA results on many problem instances.

[22] From Five Dimensions to Many: Large Language Models as Precise and Interpretable Psychological Profilers

Yi-Fei Liu, Yi-Long Lu, Di He, Hang Zhang

🧩 TL;DR

本研究证明大型语言模型能够从少量人格量表输入中准确建模人类心理特质的关联结构,通过抽象和推理过程实现零样本心理模拟,其性能接近在数据集上直接训练的机器学习算法。


📘 Detailed Summary

Motivation: 本研究旨在探索大型语言模型是否能够从最小化的定量输入中建模人类心理特质的关联结构,解决传统方法需要大量训练数据和专业知识的局限性,同时揭示LLMs在心理模拟方面的潜力。

Method: 研究采用零样本提示方法,让多种LLMs基于816名人类个体的五大性格量表响应,在九个其他心理量表上进行角色扮演响应生成,并通过分析推理轨迹揭示LLMs使用的两阶段处理过程:信息选择压缩和基于摘要的推理。

Result: LLMs在捕捉人类心理结构方面表现出色,生成响应与人类数据间的量表间相关性模式高度一致(R² > 0.89),零样本性能显著超过基于语义相似度的预测,接近直接在数据集上训练的机器学习算法精度。

Conclusion: 研究发现LLMs通过抽象和推理过程能够精确预测个体心理特质,其生成的压缩摘要不仅捕获了协同信息,还编码了特质互动的涌现二阶模式,为心理模拟提供了强大工具,同时揭示了LLMs的涌现推理能力。


📄 Abstract

Psychological constructs within individuals are widely believed to be interconnected. We investigated whether and how Large Language Models (LLMs) can model the correlational structure of human psychological traits from minimal quantitative inputs. We prompted various LLMs with Big Five Personality Scale responses from 816 human individuals to role-play their responses on nine other psychological scales. LLMs demonstrated remarkable accuracy in capturing human psychological structure, with the inter-scale correlation patterns from LLM-generated responses strongly aligning with those from human data $(R^2 > 0.89)$. This zero-shot performance substantially exceeded predictions based on semantic similarity and approached the accuracy of machine learning algorithms trained directly on the dataset. Analysis of reasoning traces revealed that LLMs use a systematic two-stage process: First, they transform raw Big Five responses into natural language personality summaries through information selection and compression, analogous to generating sufficient statistics. Second, they generate target scale responses based on reasoning from these summaries. For information selection, LLMs identify the same key personality factors as trained algorithms, though they fail to differentiate item importance within factors. The resulting compressed summaries are not merely redundant representations but capture synergistic information--adding them to original scores enhances prediction alignment, suggesting they encode emergent, second-order patterns of trait interplay. Our findings demonstrate that LLMs can precisely predict individual participants' psychological traits from minimal data through a process of abstraction and reasoning, offering both a powerful tool for psychological simulation and valuable insights into their emergent reasoning capabilities.

[23] Towards Scalable Web Accessibility Audit with MLLMs as Copilots

Ming Gu, Ziwei Wang, Sicen Lai, Zirui Gao, Sheng Zhou, Jiajun Bu

🧩 TL;DR

本文提出了AAA框架,通过人机协作模式实现可扩展的Web可访问性审计,核心创新包括基于图的多模态采样方法GRASP和基于多模态大语言模型的助手MaC,能够端到端地支持大规模网站可访问性评估。


📘 Detailed Summary

Motivation: 当前网站用户界面的可访问性合规性普遍不足,主要由于现有审计方法资源密集且难以扩展,WCAG-EM标准虽然提供了结构化评估方法,但需要大量人工投入且缺乏规模化执行的实际支持。

Method: AAA框架包含两个关键创新:GRASP基于图的多模态采样方法,通过学习视觉、文本和关系线索的嵌入表示确保代表性页面覆盖;MaC基于多模态大语言模型的辅助系统,通过跨模态推理为审计员提供智能协助。

Result: 实验证明该方法的有效性,并为审计流程的核心阶段贡献了四个新颖的数据集用于基准测试,研究还发现经过微调的小型语言模型能够胜任专家角色。

Conclusion: 该研究展示了人机协作模式在Web可访问性审计中的可行性,为大规模可访问性评估提供了端到端解决方案,同时揭示了小型语言模型在专门任务中的潜力。


📄 Abstract

Ensuring web accessibility is crucial for advancing social welfare, justice, and equality in digital spaces, yet the vast majority of website user interfaces remain non-compliant, due in part to the resource-intensive and unscalable nature of current auditing practices. While WCAG-EM offers a structured methodology for site-wise conformance evaluation, it involves great human efforts and lacks practical support for execution at scale. In this work, we present an auditing framework, AAA, which operationalizes WCAG-EM through a human-AI partnership model. AAA is anchored by two key innovations: GRASP, a graph-based multimodal sampling method that ensures representative page coverage via learned embeddings of visual, textual, and relational cues; and MaC, a multimodal large language model-based copilot that supports auditors through cross-modal reasoning and intelligent assistance in high-effort tasks. Together, these components enable scalable, end-to-end web accessibility auditing, empowering human auditors with AI-enhanced assistance for real-world impact. We further contribute four novel datasets designed for benchmarking core stages of the audit pipeline. Extensive experiments demonstrate the effectiveness of our methods, providing insights that small-scale language models can serve as capable experts when fine-tuned.