cs.CV [Total: 40]
cs.CL [Total: 6]
cs.AI [Total: 6]

cs.CV [Back]

[1] CoIDO: Efficient Data Selection for Visual Instruction Tuning via Coupled Importance-Diversity Optimization

Yichen Yan, Ming Zhong, Qi Zhu, Xiaoling Gu, Jinpeng Chen, Huan Li

🧩 TL;DR

本文提出了CoIDO框架，一种新颖的双目标数据选择方法，通过联合优化数据重要性和多样性来解决多模态大语言模型指令调优中的计算瓶颈问题。该方法仅需使用20%的随机样本训练轻量级评分器，就能在完整数据集上选择出性能接近全数据训练的20%子集。

📘 Detailed Summary

Motivation: 多模态大语言模型依赖大规模指令调优来对齐视觉和语言能力，但全数据集训练的计算成本过高成为主要瓶颈。现有数据选择方法存在两个关键缺陷：处理整个数据集的计算开销过高，以及将重要性和多样性分开处理导致数据选择效果不佳。

Method: CoIDO框架采用双目标优化方法，联合优化数据重要性和多样性。该方法使用轻量级插件评分器，仅需在小规模随机样本上训练即可学习候选集的分布特性，大幅降低计算需求。通过同方差不确定性公式，CoIDO在训练过程中有效平衡重要性和多样性目标。

Result: 实验中使用仅20%的随机样本训练CoIDO评分器，然后在完整数据集上选择20%子集进行指令调优。在LLaVA-1.5-7B模型上的十个下游任务测试中，所选子集平均达到了全数据微调性能的98.2%。

Conclusion: CoIDO框架证明通过联合优化重要性和多样性，可以在大幅减少计算开销的同时保持模型性能。该方法为大规模多模态模型训练提供了高效的数据选择解决方案，具有显著的可扩展性和实用性价值。

📄 Abstract

Multimodal large language models (MLLMs) rely heavily on instruction tuning to align vision and language capabilities, yet the computational cost of training on large-scale datasets remains a major bottleneck. Existing data selection methods aim to mitigate this by selecting important and diverse subsets, but they often suffer from two critical drawbacks: high computational overhead from processing the entire dataset and suboptimal data selection due to separate treatment of importance and diversity. We introduce CoIDO, a novel dual-objective framework that jointly optimizes data importance and diversity to overcome these challenges. Unlike existing approaches that require costly evaluations across the whole dataset, CoIDO employs a lightweight plug-in scorer. This scorer is trained on just a small random sample of data to learn the distribution of the candidate set, drastically reducing computational demands. By leveraging a homoscedastic uncertainty-based formulation, CoIDO effectively balances importance and diversity during training, enabling efficient and scalable data selection. In our experiments, we trained the CoIDO scorer using only 20 percent of randomly sampled data. Once trained, CoIDO was applied to the entire dataset to select a 20 percent subset for instruction tuning. On the widely used LLaVA-1.5-7B model across ten downstream tasks, this selected subset achieved an impressive 98.2 percent of the performance of full-data fine-tuning, on average.

[2] MUSE: Model-based Uncertainty-aware Similarity Estimation for zero-shot 2D Object Detection and Segmentation

Sungmin Cho, Sungbum Park, Insoo Oh

🧩 TL;DR

MUSE是一个无需训练、基于模型的零样本2D目标检测与分割框架，通过多视角模板渲染和不确定性感知相似度估计，在BOP Challenge 2025中实现了最先进的性能。

📘 Detailed Summary

Motivation: 该研究旨在解决零样本2D目标检测与分割中模型泛化能力不足的问题，特别是在未见过的3D物体上实现准确的2D检测与分割，而无需额外的训练或微调。

Method: MUSE框架利用3D未见物体的2D多视角模板渲染和输入查询图像的2D目标提议，在嵌入阶段整合类别和补丁嵌入，使用广义均值池化归一化补丁嵌入以高效捕获全局和局部表示，在匹配阶段采用结合绝对和相对相似度的联合相似度度量，并通过不确定性感知目标先验优化相似度得分。

Result: 在BOP Challenge 2025中，MUSE无需任何额外训练或微调，在Classic Core、H3和Industrial三个赛道均排名第一，实现了最先进的性能表现。

Conclusion: MUSE提供了一个强大且可泛化的零样本2D目标检测与分割框架，证明了基于模型的方法在零样本设置下的有效性，为实际应用中的物体识别任务提供了新的解决方案。

📄 Abstract

In this work, we introduce MUSE (Model-based Uncertainty-aware Similarity Estimation), a training-free framework designed for model-based zero-shot 2D object detection and segmentation. MUSE leverages 2D multi-view templates rendered from 3D unseen objects and 2D object proposals extracted from input query images. In the embedding stage, it integrates class and patch embeddings, where the patch embeddings are normalized using generalized mean pooling (GeM) to capture both global and local representations efficiently. During the matching stage, MUSE employs a joint similarity metric that combines absolute and relative similarity scores, enhancing the robustness of matching under challenging scenarios. Finally, the similarity score is refined through an uncertainty-aware object prior that adjusts for proposal reliability. Without any additional training or fine-tuning, MUSE achieves state-of-the-art performance on the BOP Challenge 2025, ranking first across the Classic Core, H3, and Industrial tracks. These results demonstrate that MUSE offers a powerful and generalizable framework for zero-shot 2D object detection and segmentation.

Xiaoxu Xu, Xuexun Liu, Jinlong Li, Yitian Yuan, Qiudan Zhang, Lin Ma, Nicu Sebe, Xu Wang

🧩 TL;DR

本文提出了一种集成3D几何先验的弱监督语义分割方法，通过类别感知标签精炼和几何感知标签精炼机制生成高质量伪标签，在ScanNet和S3DIS基准上实现了最先进的性能。

📘 Detailed Summary

Motivation: 现有3D弱监督语义分割方法面临伪标签质量低和3D几何先验利用不足的技术瓶颈，这限制了高性能模型的发展。

Method: 提出类别感知标签精炼模块生成平衡准确的伪标签，开发几何感知标签精炼组件集成隐式3D几何约束过滤低置信度伪标签，并采用标签更新策略结合自训练扩展标签覆盖范围。

Result: 在ScanNet和S3DIS基准测试中实现了最先进的性能，并在无监督设置下展现出卓越的泛化能力，通过鲁棒设计保持竞争性精度。

Conclusion: 该方法证明了集成3D几何先验和迭代标签精炼策略能够有效提升弱监督语义分割性能，为开发高性能3D弱监督模型提供了可行路径，同时展现出良好的泛化潜力。

📄 Abstract

3D weakly supervised semantic segmentation (3D WSSS) aims to achieve semantic segmentation by leveraging sparse or low-cost annotated data, significantly reducing reliance on dense point-wise annotations. Previous works mainly employ class activation maps or pre-trained vision-language models to address this challenge. However, the low quality of pseudo-labels and the insufficient exploitation of 3D geometric priors jointly create significant technical bottlenecks in developing high-performance 3D WSSS models. In this paper, we propose a simple yet effective 3D weakly supervised semantic segmentation method that integrates 3D geometric priors into a class-aware guidance mechanism to generate high-fidelity pseudo labels. Concretely, our designed methodology first employs Class-Aware Label Refinement module to generate more balanced and accurate pseudo labels for semantic categrories. This initial refinement stage focuses on enhancing label quality through category-specific optimization. Subsequently, the Geometry-Aware Label Refinement component is developed, which strategically integrates implicit 3D geometric constraints to effectively filter out low-confidence pseudo labels that fail to comply with geometric plausibility. Moreover, to address the challenge of extensive unlabeled regions, we propose a Label Update strategy that integrates Self-Training to propagate labels into these areas. This iterative process continuously enhances pseudo-label quality while expanding label coverage, ultimately fostering the development of high-performance 3D WSSS models. Comprehensive experimental validation reveals that our proposed methodology achieves state-of-the-art performance on both ScanNet and S3DIS benchmarks while demonstrating remarkable generalization capability in unsupervised settings, maintaining competitive accuracy through its robust design.

[4] ManzaiSet: A Multimodal Dataset of Viewer Responses to Japanese Manzai Comedy

Kazuki Kawamura, Kengo Nakai, Jun Rekimoto

🧩 TL;DR

本研究提出了首个大规模日本漫才喜剧观众反应多模态数据集ManzaiSet，通过分析241名参与者的面部视频和音频数据，揭示了三种不同的观众类型，并发现了积极的观看顺序效应，为文化敏感的情感AI开发提供了重要资源。

📘 Detailed Summary

Motivation: 当前情感计算领域存在严重的西方中心主义偏见，缺乏针对非西方文化背景的娱乐内容观众反应研究，特别是日本漫才喜剧这种具有独特文化特征的表演形式尚未得到系统性的多模态数据分析。

Method: 研究收集了241名参与者观看10个专业漫才表演的面部视频和音频数据，采用k均值聚类分析识别观众类型，使用个体水平分析评估观看顺序效应，并通过自动化幽默分类和观众水平响应建模进行跨类型比较。

Result: 聚类分析识别出三种观众类型：高稳定欣赏者（72.8%）、低可变下降者（13.2%）和可变改善者（14.0%），个体分析显示显著的积极观看顺序效应（平均斜率=0.488，p<0.001），但经FDR校正后未发现类型间差异。

Conclusion: 该数据集为开发文化敏感的情感AI系统和个性化娱乐系统提供了重要基础，证明了在非西方文化背景下观众反应的异质性，挑战了传统的疲劳假说，并为跨文化娱乐研究开辟了新方向。

📄 Abstract

We present ManzaiSet, the first large scale multimodal dataset of viewer responses to Japanese manzai comedy, capturing facial videos and audio from 241 participants watching up to 10 professional performances in randomized order (94.6 percent watched >= 8; analyses focus on n=228). This addresses the Western centric bias in affective computing. Three key findings emerge: (1) k means clustering identified three distinct viewer types: High and Stable Appreciators (72.8 percent, n=166), Low and Variable Decliners (13.2 percent, n=30), and Variable Improvers (14.0 percent, n=32), with heterogeneity of variance (Brown Forsythe p < 0.001); (2) individual level analysis revealed a positive viewing order effect (mean slope = 0.488, t(227) = 5.42, p < 0.001, permutation p < 0.001), contradicting fatigue hypotheses; (3) automated humor classification (77 instances, 131 labels) plus viewer level response modeling found no type wise differences after FDR correction. The dataset enables culturally aware emotion AI development and personalized entertainment systems tailored to non Western contexts.

[5] SAVANT: Semantic Analysis with Vision-Augmented Anomaly deTection

Roberto Brusnicki, David Pop, Yuan Gao, Mattia Piccinini, Johannes Betz

🧩 TL;DR

SAVANT是一个结构化推理框架，通过分层场景分析和双阶段流水线实现自动驾驶异常场景检测，显著提升了视觉语言模型在语义异常检测中的可靠性和准确性，同时使开源小模型能够超越专有模型性能。

📘 Detailed Summary

Motivation: 自动驾驶系统在面对分布外罕见语义异常场景时存在严重脆弱性，现有视觉语言模型的提示方法性能不可靠且依赖昂贵的专有模型，限制了实际部署应用。

Method: 提出结构化推理框架SAVANT，采用分层场景分析和双阶段流水线：首先提取结构化场景描述，然后进行多模态评估，涵盖街道、基础设施、可移动对象和环境四个语义层，将VLM推理从临时提示转变为系统分析。

Result: 在真实世界驾驶场景中达到89.6%召回率和88.0%准确率，显著优于非结构化基线；更重要的是，经过微调的70亿参数开源模型Qwen2.5VL实现了90.8%召回率和93.8%准确率，超越了所有评估模型，同时能够以接近零成本本地部署。

Conclusion: SAVANT通过自动标注9640多张真实世界图像解决了异常检测中的数据稀缺问题，为自动驾驶系统提供了可靠、可访问的语义监控实用路径，证明了结构化框架能够使小型开源模型超越专有模型性能。

📄 Abstract

Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution scenarios with semantic anomalies. While Vision Language Models (VLMs) offer promising reasoning capabilities, naive prompting approaches yield unreliable performance and depend on expensive proprietary models, limiting practical deployment. We introduce SAVANT (Semantic Analysis with Vision-Augmented Anomaly deTection), a structured reasoning framework that achieves high accuracy and recall in detecting anomalous driving scenarios from input images through layered scene analysis and a two-phase pipeline: structured scene description extraction followed by multi-modal evaluation. Our approach transforms VLM reasoning from ad-hoc prompting to systematic analysis across four semantic layers: Street, Infrastructure, Movable Objects, and Environment. SAVANT achieves 89.6% recall and 88.0% accuracy on real-world driving scenarios, significantly outperforming unstructured baselines. More importantly, we demonstrate that our structured framework enables a fine-tuned 7B parameter open-source model (Qwen2.5VL) to achieve 90.8% recall and 93.8% accuracy - surpassing all models evaluated while enabling local deployment at near-zero cost. By automatically labeling over 9,640 real-world images with high accuracy, SAVANT addresses the critical data scarcity problem in anomaly detection and provides a practical path toward reliable, accessible semantic monitoring for autonomous systems.

[6] HouseTour: A Virtual Real Estate A(I)gent

Ata Çelen, Marc Pollefeys, Daniel Barath, Iro Armeni

🧩 TL;DR

本文提出了HouseTour方法，通过扩散过程生成平滑的3D相机轨迹，并结合3D高斯泼溅渲染和视觉语言模型，实现了从图像集合自动生成空间感知的虚拟导览视频和自然语言描述。该方法在真实房产数据集上验证了3D相机轨迹整合对文本生成性能的提升。

📘 Detailed Summary

Motivation: 现有视觉语言模型在几何推理方面存在困难，无法有效处理3D空间中的相机轨迹生成和描述任务。本研究旨在解决从现有3D空间图像集合自动生成专业质量虚拟导览视频的挑战，消除对专业设备和专业知识的需求。

Method: 提出基于扩散过程的平滑相机轨迹生成方法，利用已知相机位姿作为约束条件，并将3D几何信息整合到视觉语言模型中实现空间感知描述。采用3D高斯泼溅技术渲染轨迹上的新视角，并构建了包含1200多个房屋导览视频的HouseTour数据集。

Result: 实验表明，将3D相机轨迹整合到文本生成过程中相比独立处理各项任务的方法显著提升了性能。研究引入了新的联合评估指标，验证了端到端系统的有效性，在真实房产应用场景中实现了专业质量的视频生成。

Conclusion: 该研究实现了无需专业设备或知识的自动化专业质量视频生成，为房地产和旅游应用提供了实用解决方案。3D几何信息的整合显著提升了视觉语言模型的空间推理能力，为多模态3D场景理解开辟了新方向。

📄 Abstract

We introduce HouseTour, a method for spatially-aware 3D camera trajectory and natural language summary generation from a collection of images depicting an existing 3D space. Unlike existing vision-language models (VLMs), which struggle with geometric reasoning, our approach generates smooth video trajectories via a diffusion process constrained by known camera poses and integrates this information into the VLM for 3D-grounded descriptions. We synthesize the final video using 3D Gaussian splatting to render novel views along the trajectory. To support this task, we present the HouseTour dataset, which includes over 1,200 house-tour videos with camera poses, 3D reconstructions, and real estate descriptions. Experiments demonstrate that incorporating 3D camera trajectories into the text generation process improves performance over methods handling each task independently. We evaluate both individual and end-to-end performance, introducing a new joint metric. Our work enables automated, professional-quality video creation for real estate and touristic applications without requiring specialized expertise or equipment.

[7] Chimera: Compositional Image Generation using Part-based Concepting

Shivam Singh, Yiming Chen, Agneet Chatterjee, Amit Raj, James Hays, Yezhou Yang, Chitra Baral

🧩 TL;DR

本文提出了Chimera，一种个性化图像生成模型，能够根据文本指令从多个源图像中组合特定部分生成新对象，无需用户指定的掩码或注释。该方法通过构建语义原子数据集和训练具有部分条件引导的扩散先验模型，在部分对齐和组合准确性方面显著优于基线方法。

📘 Detailed Summary

Motivation: 个性化图像生成模型虽然擅长从文本或单张图像合成图像，但缺乏对从多个源图像特定部分组合对象的显式控制能力，且通常需要用户提供掩码或注释。本研究旨在解决这一限制，开发能够根据文本指令精确组合不同源图像部分的生成模型。

Method: 研究首先基于464个独特（部分，主体）对构建语义原子数据集，生成37k提示并使用高保真文本到图像模型合成相应图像。训练自定义扩散先验模型，采用部分条件引导技术，通过引导图像条件特征来同时强制语义一致性和空间布局。

Result: 通过人类评估和提出的客观指标PartEval验证，Chimera在部分对齐和组合准确性方面比基线方法高出14%，在视觉质量方面高出21%。新提出的PartEval指标能够有效评估生成管道的保真度和组合准确性。

Conclusion: Chimera展示了无需用户指定掩码即可实现精确部分组合的可行性，为个性化图像生成提供了新的组合控制能力。该方法在部分对齐和视觉质量方面的显著提升表明部分条件引导策略的有效性，为未来组合生成研究提供了有价值的基准和评估框架。

📄 Abstract

Personalized image generative models are highly proficient at synthesizing images from text or a single image, yet they lack explicit control for composing objects from specific parts of multiple source images without user specified masks or annotations. To address this, we introduce Chimera, a personalized image generation model that generates novel objects by combining specified parts from different source images according to textual instructions. To train our model, we first construct a dataset from a taxonomy built on 464 unique (part, subject) pairs, which we term semantic atoms. From this, we generate 37k prompts and synthesize the corresponding images with a high-fidelity text-to-image model. We train a custom diffusion prior model with part-conditional guidance, which steers the image-conditioning features to enforce both semantic identity and spatial layout. We also introduce an objective metric PartEval to assess the fidelity and compositional accuracy of generation pipelines. Human evaluations and our proposed metric show that Chimera outperforms other baselines by 14% in part alignment and compositional accuracy and 21% in visual quality.

[8] Online In-Context Distillation for Low-Resource Vision Language Models

Zhiqi Kang, Rahaf Aljundi, Vaggelis Dorovatas, Karteek Alahari

🧩 TL;DR

本文提出了一种在线上下文蒸馏方法，使小型视觉语言模型能够在推理时通过稀疏演示与更强的教师模型协作，显著提升低资源环境下的性能表现。该方法在有限计算预算下超越了微调方法，将小型模型的性能提升高达33%。

📘 Detailed Summary

Motivation: 当前大型视觉语言模型在低资源、预算受限环境中部署不切实际，而小型模型虽然高效但需要昂贵的微调才能缩小与大型模型的性能差距。研究旨在解决如何在资源受限环境下有效提升小型视觉语言模型性能的核心问题。

Method: 提出在线上下文蒸馏框架，包含跨模态演示选择策略、教师测试时缩放以减少噪声、以及学生不确定性条件化来动态填充演示池并最小化教师查询。该方法基于对视觉语言上下文学习可行性的深入分析，识别了适合的模型规模和选择标准。

Result: ICD方法显著提升了小型模型的性能，最高提升达33%，仅需稀缺的教师标注（低至4%）。在受限计算预算下，上下文学习表现优于微调方法，并能够与教师的零样本性能相竞争。

Conclusion: 研究表明上下文蒸馏是低资源环境下提升小型视觉语言模型性能的有效途径，为资源受限部署提供了实用解决方案。该方法展示了在有限计算预算下，上下文学习相比传统微调方法的优势，并为未来高效模型协作研究指明了方向。

📄 Abstract

As the field continues its push for ever more resources, this work turns the spotlight on a critical question: how can vision-language models (VLMs) be adapted to thrive in low-resource, budget-constrained settings? While large VLMs offer strong performance, they are impractical to deploy in such settings. Small VLMs, on the other hand, are efficient but typically require costly fine-tuning to close the performance gap with larger models in the deployment domain. Inspired by the in-context learning framework, we propose an online In-Context Distillation (ICD) method, in which a small VLM collaborates with a stronger teacher model at inference time, distilling its knowledge via sparse demonstrations to efficiently bridge the gap between them. Our method is built on an in-depth analysis that identifies the scale and the choice of models for which vision-language ICL is currently feasible, and demonstrates the advantage of ICL over fine-tuning under constrained compute budgets. We enhance our method with a novel cross-modal demonstration selection strategy, teacher test-time scaling to reduce noise, and student uncertainty conditioning to dynamically populate a demonstration pool and minimize teacher queries. Our ICD method significantly boosts the performance of small models (up to 33%) using scarce teacher annotations (as low as 4%), and competes with the teacher's zero-shot performance.

[9] Adapting Stereo Vision From Objects To 3D Lunar Surface Reconstruction with the StereoLunar Dataset

Clementine Grethen, Simone Gasparini, Geraldine Morin, Jeremy Lebreton, Lucas Marti, Manuel Sanchez-Gestido

🧩 TL;DR

本研究提出了LunarStereo，首个月球立体图像数据集，并通过微调MASt3R模型实现了在月球环境下的鲁棒3D重建，显著提升了在月球恶劣条件下的重建性能。

📘 Detailed Summary

Motivation: 现有立体视觉重建方法在月球表面重建中面临重大挑战，包括月球表面缺乏纹理特征、复杂的光照变化以及非典型的轨道轨迹。当前最先进的深度学习模型主要基于人类尺度数据集训练，很少在行星图像上进行测试，无法直接迁移到月球环境。

Method: 本研究开发了LunarStereo数据集，这是首个基于光线追踪技术模拟的月球立体图像对数据集，利用高分辨率地形和反射率模型生成。基于该数据集，我们通过微调MASt3R模型来适应月球领域，为3D重建任务提供物理基础监督。

Result: 在合成和真实月球数据上的广泛实验验证了该方法的有效性，评估了3D表面重建和相对姿态估计性能。实验结果表明，该方法相比零样本基线取得了显著改进，为地外环境中的跨尺度泛化奠定了基础。

Conclusion: 该研究为月球3D重建提供了首个高质量数据集和有效的迁移学习方法，证明了在恶劣行星环境下深度学习的适应性，为未来空间探索任务中的自主导航和地形分析开辟了新途径。

📄 Abstract

Accurate 3D reconstruction of lunar surfaces is essential for space exploration. However, existing stereo vision reconstruction methods struggle in this context due to the Moon's lack of texture, difficult lighting variations, and atypical orbital trajectories. State-of-the-art deep learning models, trained on human-scale datasets, have rarely been tested on planetary imagery and cannot be transferred directly to lunar conditions. To address this issue, we introduce LunarStereo, the first open dataset of photorealistic stereo image pairs of the Moon, simulated using ray tracing based on high-resolution topography and reflectance models. It covers diverse altitudes, lighting conditions, and viewing angles around the lunar South Pole, offering physically grounded supervision for 3D reconstruction tasks. Based on this dataset, we adapt the MASt3R model to the lunar domain through fine-tuning on LunarStereo. We validate our approach through extensive qualitative and quantitative experiments on both synthetic and real lunar data, evaluating 3D surface reconstruction and relative pose estimation. Extensive experiments on synthetic and real lunar data validate the approach, demonstrating significant improvements over zero-shot baselines and paving the way for robust cross-scale generalization in extraterrestrial environments.

[10] RadDiagSeg-M: A Vision Language Model for Joint Diagnosis and Multi-Target Segmentation in Radiology

Chengrun Li, Corentin Royer, Haozhe Luo, Bastian Wittmann, Xia Li, Ibrahim Hamamci, Sezgin Er, Anjany Sekuboyina, Bjoern Menze

🧩 TL;DR

本研究提出了RadDiagSeg-D数据集和RadDiagSeg-M模型，解决了医学视觉语言模型难以同时生成诊断文本和像素级分割掩码的问题，实现了异常检测、诊断和灵活分割的联合任务。

📘 Detailed Summary

Motivation: 当前大多数医学视觉语言模型无法同时生成诊断文本和像素级分割掩码，这严重限制了临床应用的实用性，因为无法同时提供两种模态的辅助系统对医学从业者价值有限。

Method: 首先构建了RadDiagSeg-D数据集，将异常检测、诊断和多目标分割整合为统一的分层任务；随后基于该数据集开发了RadDiagSeg-M视觉语言模型，能够联合执行异常检测、诊断和灵活分割。

Result: RadDiagSeg-M在多目标文本和掩码生成任务的所有组件上均表现出强劲性能，为相关任务建立了稳健且具有竞争力的基准，提供了具有高度信息量和临床实用性的输出。

Conclusion: 该研究有效解决了辅助诊断中丰富上下文信息的需求，通过联合生成文本和分割掩码的方式显著提升了医学视觉语言模型在临床应用中的实用价值，为多模态医学AI系统的发展提供了重要基础。

📄 Abstract

Most current medical vision language models struggle to jointly generate diagnostic text and pixel-level segmentation masks in response to complex visual questions. This represents a major limitation towards clinical application, as assistive systems that fail to provide both modalities simultaneously offer limited value to medical practitioners. To alleviate this limitation, we first introduce RadDiagSeg-D, a dataset combining abnormality detection, diagnosis, and multi-target segmentation into a unified and hierarchical task. RadDiagSeg-D covers multiple imaging modalities and is precisely designed to support the development of models that produce descriptive text and corresponding segmentation masks in tandem. Subsequently, we leverage the dataset to propose a novel vision-language model, RadDiagSeg-M, capable of joint abnormality detection, diagnosis, and flexible segmentation. RadDiagSeg-M provides highly informative and clinically useful outputs, effectively addressing the need to enrich contextual information for assistive diagnosis. Finally, we benchmark RadDiagSeg-M and showcase its strong performance across all components involved in the task of multi-target text-and-mask generation, establishing a robust and competitive baseline.

[11] Visual Space Optimization for Zero-shot Learning

Xinsheng Wang, Shanmin Pang, Jihua Zhu, Zhongyu Li, Zhiqiang Tian, Yaochen Li

🧩 TL;DR

该论文提出两种优化视觉空间的方法来改进零样本学习，包括视觉原型方法和中间嵌入空间优化，通过在四个基准数据集上的实验验证了视觉空间优化对零样本学习的有效性，其中原型方法达到了新的最先进性能。

📘 Detailed Summary

Motivation: 现有零样本学习方法通常将深度视觉特征构成的视觉空间作为嵌入空间，但视觉空间中实例的离散分布使得数据结构不够明显，这限制了语义向量在视觉空间中的有效嵌入，因此需要优化视觉空间以提升零样本学习性能。

Method: 提出了两种视觉空间优化策略：一是视觉原型方法，为每个视觉类别学习一个视觉原型，用原型特征替代离散的视觉特征序列；二是中间嵌入空间优化方法，通过多层感知机框架算法学习共同的中间嵌入空间，同时使视觉数据结构更加显著。

Result: 在四个基准数据集上的广泛实验评估表明，优化视觉空间对零样本学习有益，提出的基于原型的方法实现了新的最先进性能，验证了所提方法的有效性。

Conclusion: 研究表明优化视觉空间是提升零样本学习性能的关键因素，视觉原型方法通过类级别表示简化了嵌入过程，中间嵌入空间方法则通过结构优化增强了数据可分性，为未来零样本学习研究提供了新的方向。

📄 Abstract

Zero-shot learning, which aims to recognize new categories that are not included in the training set, has gained popularity owing to its potential ability in the real-word applications. Zero-shot learning models rely on learning an embedding space, where both semantic descriptions of classes and visual features of instances can be embedded for nearest neighbor search. Recently, most of the existing works consider the visual space formulated by deep visual features as an ideal choice of the embedding space. However, the discrete distribution of instances in the visual space makes the data structure unremarkable. We argue that optimizing the visual space is crucial as it allows semantic vectors to be embedded into the visual space more effectively. In this work, we propose two strategies to accomplish this purpose. One is the visual prototype based method, which learns a visual prototype for each visual class, so that, in the visual space, a class can be represented by a prototype feature instead of a series of discrete visual features. The other is to optimize the visual feature structure in an intermediate embedding space, and in this method we successfully devise a multilayer perceptron framework based algorithm that is able to learn the common intermediate embedding space and meanwhile to make the visual data structure more distinctive. Through extensive experimental evaluation on four benchmark datasets, we demonstrate that optimizing visual space is beneficial for zero-shot learning. Besides, the proposed prototype based method achieves the new state-of-the-art performance.

[12] VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety

Shruti Palaskar, Leon Gatys, Mona Abdelrahman, Mar Jacobo, Larry Lindsey, Rutika Moharir, Gunnar Lund, Yang Xu, Navid Shiee, Jeffrey Bigham, Charles Maalouf, Joseph Yitan Cheng

🧩 TL;DR

本文提出了视觉语言安全理解（VLSU）框架，通过细粒度严重性分类和组合分析系统评估多模态模型安全性。研究发现现有模型在联合图像-文本推理方面存在系统性失败，即使单独模态分类正确，组合安全分类错误率仍达34%。

📘 Detailed Summary

Motivation: 当前多模态基础模型的安全评估通常将视觉和语言输入分开处理，忽略了联合解释中良性内容组合可能产生有害影响的风险。现有方法也未能清晰区分明显不安全内容与边界案例，导致对真正有害内容的过度阻止或拒绝不足。

Method: 我们提出了视觉语言安全理解（VLSU）框架，通过细粒度严重性分类和跨17个不同安全模式的组合分析来系统评估多模态安全性。采用包含真实世界图像和人工标注的多阶段流程，构建了包含8,187个样本的大规模基准数据集，涵盖15个危害类别。

Result: 对11个最先进模型的评估揭示了系统性联合理解失败：模型在清晰单模态安全信号上达到90%以上准确率，但在需要联合图像-文本推理确定安全标签时性能显著下降至20-55%。最关键的是，34%的联合图像-文本安全分类错误发生在个体模态分类正确的情况下。此外，模型难以平衡拒绝不安全内容与响应值得参与的边界案例，例如指令框架可将Gemini-1.5在边界内容上的过度阻止率从62.4%降至10.4%，但代价是不安全内容的拒绝率从90.8%降至53.9%。

Conclusion: 我们的框架揭示了当前模型在联合图像-文本理解方面的弱点和对齐差距，为研究稳健视觉语言安全的下一个里程碑提供了关键测试平台。这些发现强调了开发能够进行组合推理的多模态安全评估方法的必要性，以解决现有方法在区分边界案例和防止过度阻止方面的局限性。

📄 Abstract

Safety evaluation of multimodal foundation models often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing approaches also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal of genuinely harmful content. We present Vision Language Safety Understanding (VLSU), a comprehensive framework to systematically evaluate multimodal safety through fine-grained severity classification and combinatorial analysis across 17 distinct safety patterns. Using a multi-stage pipeline with real-world images and human annotation, we construct a large-scale benchmark of 8,187 samples spanning 15 harm categories. Our evaluation of eleven state-of-the-art models reveals systematic joint understanding failures: while models achieve 90%-plus accuracy on clear unimodal safety signals, performance degrades substantially to 20-55% when joint image-text reasoning is required to determine the safety label. Most critically, 34% of errors in joint image-text safety classification occur despite correct classification of the individual modalities, further demonstrating absent compositional reasoning capabilities. Additionally, we find that models struggle to balance refusing unsafe content while still responding to borderline cases that deserve engagement. For example, we find that instruction framing can reduce the over-blocking rate on borderline content from 62.4% to 10.4% in Gemini-1.5, but only at the cost of under-refusing on unsafe content with refusal rate dropping from 90.8% to 53.9%. Overall, our framework exposes weaknesses in joint image-text understanding and alignment gaps in current models, and provides a critical test bed to enable the next milestones in research on robust vision-language safety.

[13] BlendCLIP: Bridging Synthetic and Real Domains for Zero-Shot 3D Object Classification with Multimodal Pretraining

Ajinkya Khoche, Gergő László Nagy, Maciej Wozniak, Thomas Gustafsson, Patric Jensfelt

🧩 TL;DR

BlendCLIP提出了一种多模态预训练框架，通过课程式数据混合策略有效弥合合成数据与真实LiDAR扫描之间的领域差距，在零样本3D物体分类任务上实现了最先进的性能。该方法仅需少量真实世界样本即可显著提升模型在户外场景中的泛化能力。

📘 Detailed Summary

Motivation: 当前零样本3D物体分类面临合成数据与真实稀疏噪声LiDAR扫描之间的显著领域差距问题，仅使用合成数据训练的方法无法泛化到户外场景，而仅使用真实数据的方法又缺乏语义多样性来识别罕见或未见物体。

Method: 提出BlendCLIP多模态预训练框架，首先构建从真实驾驶数据中挖掘的大规模物体级三元组数据集（点云、图像、文本描述），核心贡献是课程式数据混合策略，先在语义丰富的合成CAD数据上训练模型，再逐步适应真实世界扫描的特定特征。

Result: 实验表明该方法具有高标签效率，仅需在每批次中引入1.5%的真实世界样本即可在nuScenes基准上提升零样本准确率27%，最终模型在nuScenes和TruckScenes等户外数据集上达到最先进性能，比先前最佳方法提升19.3%，同时在多样化合成基准上保持强泛化能力。

Conclusion: 研究表明有效的领域适应而非大规模真实世界标注是实现稳健开放词汇3D感知的关键，该方法为实际应用中的零样本3D分类提供了实用解决方案，证明了合成与真实数据策略性组合的潜力。

📄 Abstract

Zero-shot 3D object classification is crucial for real-world applications like autonomous driving, however it is often hindered by a significant domain gap between the synthetic data used for training and the sparse, noisy LiDAR scans encountered in the real-world. Current methods trained solely on synthetic data fail to generalize to outdoor scenes, while those trained only on real data lack the semantic diversity to recognize rare or unseen objects. We introduce BlendCLIP, a multimodal pretraining framework that bridges this synthetic-to-real gap by strategically combining the strengths of both domains. We first propose a pipeline to generate a large-scale dataset of object-level triplets -- consisting of a point cloud, image, and text description -- mined directly from real-world driving data and human annotated 3D boxes. Our core contribution is a curriculum-based data mixing strategy that first grounds the model in the semantically rich synthetic CAD data before progressively adapting it to the specific characteristics of real-world scans. Our experiments show that our approach is highly label-efficient: introducing as few as 1.5\% real-world samples per batch into training boosts zero-shot accuracy on the nuScenes benchmark by 27\%. Consequently, our final model achieves state-of-the-art performance on challenging outdoor datasets like nuScenes and TruckScenes, improving over the best prior method by 19.3\% on nuScenes, while maintaining strong generalization on diverse synthetic benchmarks. Our findings demonstrate that effective domain adaptation, not full-scale real-world annotation, is the key to unlocking robust open-vocabulary 3D perception. Our code and dataset will be released upon acceptance on https://github.com/kesu1/BlendCLIP.

[14] The Impact of Image Resolution on Biomedical Multimodal Large Language Models

Liangyu Chen, James Burgess, Jeffrey J Nirschl, Orr Zohar, Serena Yeung-Levy

🧩 TL;DR

本研究探讨了图像分辨率对生物医学多模态大语言模型性能的影响，发现原生分辨率训练和推理显著提升性能，并提出混合分辨率训练策略来平衡计算约束与性能需求。

📘 Detailed Summary

Motivation: 当前大多数多模态大语言模型主要针对通用数据集中的低分辨率图像设计，在应用于需要高分辨率分析的生物医学图像时存在关键信息丢失的风险，这限制了模型在生物医学研究和临床应用中的有效性。

Method: 研究通过系统实验评估不同分辨率设置对模型性能的影响，包括原生分辨率训练与推理、训练与推理分辨率不匹配情况下的性能分析，以及混合分辨率训练策略的开发与验证。

Result: 实验结果表明：原生分辨率训练和推理在多个任务上显著提升性能；训练与推理分辨率不匹配会严重降低模型表现；混合分辨率训练能有效缓解这种不匹配问题，在计算约束与性能需求之间实现良好平衡。

Conclusion: 研究建议在生物医学多模态大语言模型的开发中优先考虑原生分辨率推理和混合分辨率数据集，这对于优化模型在科学研究和临床应用中的表现具有重要指导意义，为实现变革性影响提供了关键策略。

📄 Abstract

Imaging technologies are fundamental to biomedical research and modern medicine, requiring analysis of high-resolution images across various modalities. While multimodal large language models (MLLMs) show promise for biomedical image analysis, most are designed for low-resolution images from general-purpose datasets, risking critical information loss. We investigate how image resolution affects MLLM performance in biomedical applications and demonstrate that: (1) native-resolution training and inference significantly improve performance across multiple tasks, (2) misalignment between training and inference resolutions severely degrades performance, and (3) mixed-resolution training effectively mitigates misalignment and balances computational constraints with performance requirements. Based on these findings, we recommend prioritizing native-resolution inference and mixed-resolution datasets to optimize biomedical MLLMs for transformative impact in scientific research and clinical applications.

[15] UWBench: A Comprehensive Vision-Language Benchmark for Underwater Understanding

Da Zhang, Chenggang Rong, Bingyu Li, Feiyu Wang, Zhiyuan Zhao, Junyu Gao, Xuelong Li

🧩 TL;DR

本文提出了UWBench，一个专门为水下视觉语言理解设计的综合基准，包含15,003张高分辨率水下图像和丰富的标注数据，用于评估大视觉语言模型在复杂水下环境中的表现。

📘 Detailed Summary

Motivation: 当前大视觉语言模型在自然场景理解方面取得了显著成功，但在水下环境中的应用仍未被充分探索。水下图像存在严重的光衰减、颜色失真和悬浮颗粒散射等独特挑战，同时需要海洋生态系统和生物分类学的专业知识。

Method: 研究团队构建了UWBench基准数据集，包含15,003张来自不同水生环境的高分辨率水下图像，每张图像都配有经过人工验证的标注，包括15,281个对象指代表达式和124,983个问答对，涵盖从物体识别到生态关系理解的各种推理能力。

Result: 基于UWBench建立了三个综合基准：详细图像描述生成、海洋生物精确定位的视觉接地以及水下环境多模态推理的视觉问答。在最先进的视觉语言模型上的广泛实验表明，水下理解仍然具有挑战性，存在显著的改进空间。

Conclusion: 该基准为推进水下环境中的视觉语言研究提供了重要资源，支持海洋科学、生态监测和自主水下探索等应用。研究揭示了当前模型在水下环境理解方面的局限性，为未来研究指明了改进方向。

📄 Abstract

Large vision-language models (VLMs) have achieved remarkable success in natural scene understanding, yet their application to underwater environments remains largely unexplored. Underwater imagery presents unique challenges including severe light attenuation, color distortion, and suspended particle scattering, while requiring specialized knowledge of marine ecosystems and organism taxonomy. To bridge this gap, we introduce UWBench, a comprehensive benchmark specifically designed for underwater vision-language understanding. UWBench comprises 15,003 high-resolution underwater images captured across diverse aquatic environments, encompassing oceans, coral reefs, and deep-sea habitats. Each image is enriched with human-verified annotations including 15,281 object referring expressions that precisely describe marine organisms and underwater structures, and 124,983 question-answer pairs covering diverse reasoning capabilities from object recognition to ecological relationship understanding. The dataset captures rich variations in visibility, lighting conditions, and water turbidity, providing a realistic testbed for model evaluation. Based on UWBench, we establish three comprehensive benchmarks: detailed image captioning for generating ecologically informed scene descriptions, visual grounding for precise localization of marine organisms, and visual question answering for multimodal reasoning about underwater environments. Extensive experiments on state-of-the-art VLMs demonstrate that underwater understanding remains challenging, with substantial room for improvement. Our benchmark provides essential resources for advancing vision-language research in underwater contexts and supporting applications in marine science, ecological monitoring, and autonomous underwater exploration. Our code and benchmark will be available.

[16] Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation

Wei-Chia Chang, Yan-Ann Chen

🧩 TL;DR

本文提出了一种结合视觉语言模型与检索增强生成技术的零样本车辆品牌型号识别方法，通过文本推理实现无需大规模重训练的车辆识别，相比CLIP基线提升了近20%的识别准确率。

📘 Detailed Summary

Motivation: 现有车辆品牌型号识别方法难以适应新发布车型，而CLIP等视觉语言模型的固定预训练权重在缺乏图像特定微调时性能受限，需要开发能够支持零样本识别且无需大规模重训练的解决方案。

Method: 提出了一种集成视觉语言模型与检索增强生成的流程：首先使用VLM将车辆图像转换为描述性属性，然后与文本特征数据库进行匹配检索，最后结合检索结果和描述构建提示，由语言模型推理出车辆品牌型号。

Result: 实验结果表明，所提方法相比CLIP基线提升了近20%的识别准确率，证明了RAG增强的语言模型推理在车辆识别任务中的有效性。

Conclusion: 该方法避免了大规模重训练需求，通过添加新车型的文本描述即可实现快速更新，展示了RAG增强的LM推理在智慧城市应用中实现可扩展VMMR的潜力。

📄 Abstract

Vehicle make and model recognition (VMMR) is an important task in intelligent transportation systems, but existing approaches struggle to adapt to newly released models. Contrastive Language-Image Pretraining (CLIP) provides strong visual-text alignment, yet its fixed pretrained weights limit performance without costly image-specific finetuning. We propose a pipeline that integrates vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to support zero-shot recognition through text-based reasoning. A VLM converts vehicle images into descriptive attributes, which are compared against a database of textual features. Relevant entries are retrieved and combined with the description to form a prompt, and a language model (LM) infers the make and model. This design avoids large-scale retraining and enables rapid updates by adding textual descriptions of new vehicles. Experiments show that the proposed method improves recognition by nearly 20% over the CLIP baseline, demonstrating the potential of RAG-enhanced LM reasoning for scalable VMMR in smart-city applications.

[17] StreamingTOM: Streaming Token Compression for Efficient Video Understanding

Xueyi Chen, Keda Tao, Kele Shao, Huan Wang

🧩 TL;DR

StreamingTOM提出了一种无需训练、即插即用的两阶段框架，通过因果时间缩减和在线量化内存技术，同时解决了流式视频语言模型中预LLM和后LLM的效率瓶颈，实现了可预测延迟的高效流式视频理解。

📘 Detailed Summary

Motivation: 流式视频视觉语言模型面临两个基本约束：因果性限制了对未来帧的访问，而累积性导致令牌无限增长造成效率瓶颈。现有方法仅调节后LLM的kv缓存，而忽略了成本高昂的预LLM预填充阶段，这成为流式视频处理的主要效率障碍。

Method: 该框架包含两个关键技术：因果时间缩减通过相邻帧变化和令牌显著性选择令牌，为每帧施加固定预算，仅处理紧凑的视觉令牌子集而非全部令牌；在线量化内存将令牌以4位格式存储，按需检索相关组并反量化，保持活动kv缓存有界且与流长度无关。

Result: 实验表明该方法实现了15.7倍的kv缓存压缩、1.2倍的峰值内存降低和2倍的首令牌时间加速。在无需训练的方法中保持最先进精度，离线基准测试平均达63.8%，RVS基准测试达55.8%/3.7。

Conclusion: 研究证明了双阶段方法在实现有界增长的高效流式视频理解方面的实际优势，通过同时处理预LLM和后LLM瓶颈，为实时视频分析提供了可预测延迟的解决方案，推动了流式视频处理技术的实用化发展。

📄 Abstract

Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens per frame instead of all visual tokens. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves $15.7\times$ kv-cache compression, $1.2\times$ lower peak memory and $2\times$ faster TTFT compared to prior SOTA. StreamingTOM maintains state-of-the-art accuracy among training-free methods with an average of $63.8\%$ on offline benchmarks and $55.8\%/3.7$ on RVS. These results highlight the practical benefits of our two-stage approach for efficient streaming video understanding with bounded growth.

[18] See the Text: From Tokenization to Visual Reading

Ling Xing, Alex Jinpeng Wang, Rui Yan, Hongyu Qu, Zechao Li, Jinhui Tang

🧩 TL;DR

本文提出SeeTok方法，将文本渲染为图像并利用预训练多模态大语言模型进行视觉理解，挑战了传统基于子词分词的范式。该方法在保持性能的同时显著减少了计算开销，并提升了跨语言泛化能力和抗噪鲁棒性。

📘 Detailed Summary

Motivation: 现代大语言模型依赖子词分词方法，将文本分割为固定词汇表中的片段，这种方法在高资源语言中有效，但在低资源语言中会导致过度分割，产生冗长且语言学无意义的序列，并增加计算开销。人类阅读通过识别单词作为视觉对象来处理文本，能够有效处理拼写错误、变形字体和不同文字系统，这启发了向视觉中心化替代方案的转变。

Method: SeeTok方法将文本渲染为图像（视觉文本），利用预训练多模态大语言模型来解释这些图像，复用从大规模多模态训练中学到的强大OCR和文本-视觉对齐能力。该方法采用视觉中心化范式，直接处理文本的视觉表示，而非依赖传统的符号化分词过程。

Result: 在三个不同语言任务中，SeeTok匹配或超越了子词分词器的性能，同时所需token数量减少了4.43倍，FLOPs降低了70.5%。该方法还展现出在跨语言泛化、对抗印刷噪声的鲁棒性以及语言学层次结构方面的额外优势。

Conclusion: SeeTok标志着从符号化分词向类人视觉阅读的范式转变，为构建更自然和认知启发的语言模型迈出了重要一步。该方法展示了视觉中心化方法在计算效率、泛化能力和鲁棒性方面的显著优势，为未来语言模型设计提供了新的方向。

📄 Abstract

People see text. Humans read by recognizing words as visual objects, including their shapes, layouts, and patterns, before connecting them to meaning, which enables us to handle typos, distorted fonts, and various scripts effectively. Modern large language models (LLMs), however, rely on subword tokenization, fragmenting text into pieces from a fixed vocabulary. While effective for high-resource languages, this approach over-segments low-resource languages, yielding long, linguistically meaningless sequences and inflating computation. In this work, we challenge this entrenched paradigm and move toward a vision-centric alternative. Our method, SeeTok, renders text as images (visual-text) and leverages pretrained multimodal LLMs to interpret them, reusing strong OCR and text-vision alignment abilities learned from large-scale multimodal training. Across three different language tasks, SeeTok matches or surpasses subword tokenizers while requiring 4.43 times fewer tokens and reducing FLOPs by 70.5%, with additional gains in cross-lingual generalization, robustness to typographic noise, and linguistic hierarchy. SeeTok signals a shift from symbolic tokenization to human-like visual reading, and takes a step toward more natural and cognitively inspired language models.

[19] Proactive Reasoning-with-Retrieval Framework for Medical Multimodal Large Language Models

Lehan Wang, Yi Qin, Honglong Yang, Xiaomeng Li

🧩 TL;DR

本文提出了首个多模态医学推理-检索框架Med-RwR，通过强化学习策略激励多模态大语言模型在推理过程中主动检索外部医学知识，显著提升了医学诊断的准确性和泛化能力。

📘 Detailed Summary

Motivation: 现有医学多模态大语言模型在推理时仅依赖内部知识，当遇到超出训练范围的病例时容易产生幻觉推理和事实错误，而现有的代理检索增强生成方法仅限于单模态语言模型，忽略了推理和检索过程中的关键视觉信息。

Method: 提出了Med-RwR多模态医学推理-检索框架，采用两阶段强化学习策略配合定制化奖励机制，激励模型利用视觉诊断发现和文本临床信息进行有效检索，并进一步提出了基于置信度的图像重检索方法用于测试时的扩展。

Result: 在多个公共医学基准测试中，Med-RwR相比基线模型取得了显著改进，在提出的超声心动图基准测试ECBench上获得了8.8%的性能提升，证明了外部知识集成对增强推理能力的有效性。

Conclusion: Med-RwR展示了将外部知识集成到多模态医学推理中的有效性，特别是在训练数据稀缺的领域表现出卓越的泛化能力，为医学人工智能系统提供了更可靠和透明的诊断支持。

📄 Abstract

Incentivizing the reasoning ability of Multimodal Large Language Models (MLLMs) is essential for medical applications to transparently analyze medical scans and provide reliable diagnosis. However, existing medical MLLMs rely solely on internal knowledge during reasoning, leading to hallucinated reasoning and factual inaccuracies when encountering cases beyond their training scope. Although recent Agentic Retrieval-Augmented Generation (RAG) methods elicit the medical model's proactive retrieval ability during reasoning, they are confined to unimodal LLMs, neglecting the crucial visual information during reasoning and retrieval. Consequently, we propose the first Multimodal Medical Reasoning-with-Retrieval framework, Med-RwR, which actively retrieves external knowledge by querying observed symptoms or domain-specific medical concepts during reasoning. Specifically, we design a two-stage reinforcement learning strategy with tailored rewards that stimulate the model to leverage both visual diagnostic findings and textual clinical information for effective retrieval. Building on this foundation, we further propose a Confidence-Driven Image Re-retrieval (CDIR) method for test-time scaling when low prediction confidence is detected. Evaluation on various public medical benchmarks demonstrates Med-RwR's significant improvements over baseline models, proving the effectiveness of enhancing reasoning capabilities with external knowledge integration. Furthermore, Med-RwR demonstrates remarkable generalizability to unfamiliar domains, evidenced by 8.8% performance gain on our proposed EchoCardiography Benchmark (ECBench), despite the scarcity of echocardiography data in the training corpus. Our data, model, and codes will be made publicly available at https://github.com/xmed-lab/Med-RwR.

[20] Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, Zhaoxiang Zhang

🧩 TL;DR

本文提出了GAR模型，通过RoI对齐特征重放技术解决多模态大语言模型在复杂场景细粒度理解中的局限性，实现了区域级视觉理解并支持多提示交互建模。

📘 Detailed Summary

Motivation: 多模态大语言模型虽然在整体理解方面表现出色，但在处理复杂密集场景时难以进行细粒度分析和对象间关系建模，现有区域级方法通常孤立理解给定区域而忽略了关键的全局上下文信息。

Method: 提出了GAR模型，采用有效的RoI对齐特征重放技术，支持利用必要全局上下文进行精确感知，并能够建模多个提示之间的交互关系，从而实现关于任意区域的组合推理能力。

Result: GAR-1B在DLC-Bench上超越DAM-3B达4.5分，在GAR-Bench-VQA上甚至超过InternVL3-78B，零样本GAR-8B在VideoRefer-BenchQ上优于领域内VideoRefer-7B，展示了强大的跨视频迁移能力。

Conclusion: GAR模型实现了从被动描述到主动对话的范式转变，通过构建GAR-Bench评估框架不仅准确衡量单区域理解能力，更重要的是能够评估跨多区域的交互和复杂推理能力。

📄 Abstract

While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehen- sive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Extensive experiments have demonstrated that GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong capabilities can be easily transferred to videos.

[21] Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding

Jinlin Li, Yuran Wang, Yifei Yuan, Xiao Zhou, Yingying Zhang, Xixian Yong, Yefeng Zheng, Xian Wu

🧩 TL;DR

本文提出自适应令牌集成解码（ATED），一种无需训练、基于令牌级集成的方法，通过聚合多个大型视觉语言模型的预测来缓解对象幻觉问题。该方法在标准幻觉检测基准上显著优于现有技术，在不影响流畅性和相关性的前提下有效减少幻觉。

📘 Detailed Summary

Motivation: 大型视觉语言模型在图像描述和视觉问答等任务中表现出色，但容易产生对象幻觉——描述不存在或错误识别的对象。现有方法通过辅助训练目标或外部模块部分缓解了这一问题，但在可扩展性、适应性和模型独立性方面仍面临挑战。

Method: 提出自适应令牌集成解码（ATED），这是一种无需训练的令牌级集成框架，通过在推理过程中聚合多个LVLM的预测来缓解幻觉。ATED动态计算每个模型基于不确定性的权重，反映其在每个解码步骤中的可靠性，并集成多样化的解码路径以改善上下文基础和语义一致性。

Result: 在标准幻觉检测基准上的实验表明，ATED显著优于最先进的方法，在不影响流畅性和相关性的前提下有效减少幻觉。该方法展示了在保持生成质量的同时提升模型鲁棒性的能力。

Conclusion: 研究结果强调了自适应集成的优势，为在高风险应用中提高LVLM鲁棒性指明了有前景的方向。该方法提供了一种模型无关的解决方案，无需额外训练即可有效缓解对象幻觉问题。

📄 Abstract

Large Vision-Language Models (LVLMs) have recently achieved impressive results in multimodal tasks such as image captioning and visual question answering. However, they remain prone to object hallucination -- generating descriptions of nonexistent or misidentified objects. Prior work has partially mitigated this via auxiliary training objectives or external modules, but challenges remain in terms of scalability, adaptability, and model independence. To address these limitations, we propose Adaptive Token Ensemble Decoding (ATED), a training-free, token-level ensemble framework that mitigates hallucination by aggregating predictions from multiple LVLMs during inference. ATED dynamically computes uncertainty-based weights for each model, reflecting their reliability at each decoding step. It also integrates diverse decoding paths to improve contextual grounding and semantic consistency. Experiments on standard hallucination detection benchmarks demonstrate that ATED significantly outperforms state-of-the-art methods, reducing hallucination without compromising fluency or relevance. Our findings highlight the benefits of adaptive ensembling and point to a promising direction for improving LVLM robustness in high-stakes applications. The code is available at https://github.com/jinlin2021/ATED.

[22] Enhancing Few-Shot Classification of Benchmark and Disaster Imagery with ATTBHFA-Net

Gao Yu Lee, Tanmoy Dam, Md Meftahul Ferdaus, Daniel Puiu Poenar, Vu Duong

🧩 TL;DR

本文提出了ATTBHFA-Net网络，通过结合Bhattacharyya系数和Hellinger距离来比较和聚合特征概率分布，以解决灾难图像分类中数据稀缺和类内差异大的问题，在少样本学习任务中展现出优越性能。

📘 Detailed Summary

Motivation: 灾难视觉识别面临数据稀缺和多样性挑战，现有少样本学习方法主要依赖通用基准数据集而缺乏遥感灾难图像，且灾难图像具有高类内变异和类间相似性，限制了传统基于度量的少样本学习方法的实际效果。

Method: 提出了基于注意力的Bhattacharyya-Hellinger特征聚合网络（ATTBHFA-Net），线性结合Bhattacharyya系数和Hellinger距离来比较和聚合特征概率分布以形成鲁棒原型，其中Bhattacharyya系数作为对比边界增强类间可分性，Hellinger距离正则化同类对齐，并提出了基于Bhattacharyya-Hellinger距离的对比损失作为余弦相似度损失的分布对应物。

Result: 在四个少样本学习基准和两个灾难图像数据集上的实验表明，ATTBHFA-Net相比现有方法具有优越的有效性和泛化能力，显著提升了少样本学习性能。

Conclusion: 该研究展示了在概率分布层面进行特征比较和聚合的有效性，为处理数据稀缺和高变异性的视觉识别任务提供了新思路，其分布对比学习框架可扩展到其他需要鲁棒特征表示的领域。

📄 Abstract

The increasing frequency of natural and human-induced disasters necessitates advanced visual recognition techniques capable of analyzing critical photographic data. With progress in artificial intelligence and resilient computational systems, rapid and accurate disaster classification has become crucial for efficient rescue operations. However, visual recognition in disaster contexts faces significant challenges due to limited and diverse data from the difficulties in collecting and curating comprehensive, high-quality disaster imagery. Few-Shot Learning (FSL) provides a promising approach to data scarcity, yet current FSL research mainly relies on generic benchmark datasets lacking remote-sensing disaster imagery, limiting its practical effectiveness. Moreover, disaster images exhibit high intra-class variation and inter-class similarity, hindering the performance of conventional metric-based FSL methods. To address these issues, this paper introduces the Attention-based Bhattacharyya-Hellinger Feature Aggregation Network (ATTBHFA-Net), which linearly combines the Bhattacharyya coefficient and Hellinger distances to compare and aggregate feature probability distributions for robust prototype formation. The Bhattacharyya coefficient serves as a contrastive margin that enhances inter-class separability, while the Hellinger distance regularizes same-class alignment. This framework parallels contrastive learning but operates over probability distributions rather than embedded feature points. Furthermore, a Bhattacharyya-Hellinger distance-based contrastive loss is proposed as a distributional counterpart to cosine similarity loss, used jointly with categorical cross-entropy to significantly improve FSL performance. Experiments on four FSL benchmarks and two disaster image datasets demonstrate the superior effectiveness and generalization of ATTBHFA-Net compared to existing approaches.

[23] GPTFace: Generative Pre-training of Facial-Linguistic Transformer by Span Masking and Weakly Correlated Text-image Data

Yudong Li, Hao Li, Xianxu Hou, Linlin Shen

🧩 TL;DR

本文提出了一种基于大规模网络数据的生成式预训练模型，用于面部知识学习，通过自监督任务训练实现可控的图像/文本生成，并在多种面部下游任务中达到与最先进模型相当的性能。

📘 Detailed Summary

Motivation: 当前面部知识学习的大规模预训练模型研究相对有限，主要依赖人工标注的面部数据集进行训练，这种标注方式成本高昂且训练出的模型在训练数据之外的可扩展性有限。

Method: 利用从互联网爬取的包含人脸的文本和图像数据，通过自监督任务进行预训练，包括掩码图像/语言建模和图像-文本匹配，并在生成阶段利用图像-文本匹配损失将生成分布拉向控制信号以实现可控的图像/文本生成。

Result: 实验结果表明，该模型在多种面部下游任务中达到与最先进预训练模型相当的性能，包括属性分类和表情识别，同时适用于广泛的面部编辑任务，如面部属性编辑、表情操控、掩码移除和照片修复。

Conclusion: 该研究证明了利用大规模网络数据进行自监督预训练的有效性，为面部知识学习提供了可扩展的解决方案，并展示了在面部编辑任务中的广泛应用潜力。

📄 Abstract

Compared to the prosperity of pre-training models in natural image understanding, the research on large-scale pre-training models for facial knowledge learning is still limited. Current approaches mainly rely on manually assembled and annotated face datasets for training, but labeling such datasets is labor-intensive and the trained models have limited scalability beyond the training data. To address these limitations, we present a generative pre-training model for facial knowledge learning that leverages large-scale web-built data for training. We use texts and images containing human faces crawled from the internet and conduct pre-training on self-supervised tasks, including masked image/language modeling (MILM) and image-text matching (ITM). During the generation stage, we further utilize the image-text matching loss to pull the generation distribution towards the control signal for controllable image/text generation. Experimental results demonstrate that our model achieves comparable performance to state-of-the-art pre-training models for various facial downstream tasks, such as attribution classification and expression recognition. Furthermore, our approach is also applicable to a wide range of face editing tasks, including face attribute editing, expression manipulation, mask removal, and photo inpainting.

[24] AV-Master: Dual-Path Comprehensive Perception Makes Better Audio-Visual Question Answering

Jiayu Zhang, Qilang Ye, Shuo Ye, Xun Lin, Zihan Song, Zitong Yu

🧩 TL;DR

本文提出AV-Master框架，通过动态建模时间和模态维度来增强从复杂音视频场景中提取关键信息的能力，在四个大规模基准测试中显著优于现有方法。

📘 Detailed Summary

Motivation: 现有的音频视觉问答方法在时间采样和模态偏好感知方面缺乏足够的灵活性和动态适应性，难以根据问题聚焦关键信息，限制了在复杂场景中的推理能力。

Method: 提出动态自适应焦点采样机制逐步关注与问题最相关的音视频片段，并采用偏好感知策略独立建模各模态贡献，同时引入双路径对比损失来增强时间和模态维度的一致性和互补性。

Result: 在四个大规模基准测试上的实验表明，AV-Master显著优于现有方法，特别是在复杂推理任务中表现出色。

Conclusion: 该研究证明了动态建模时间和模态维度对于处理复杂音视频场景的有效性，为多模态推理提供了新的方向，表明选择性激活关键特征和跨模态协作表示学习的重要性。

📄 Abstract

Audio-Visual Question Answering (AVQA) requires models to effectively utilize both visual and auditory modalities to answer complex and diverse questions about audio-visual scenes. However, existing methods lack sufficient flexibility and dynamic adaptability in temporal sampling and modality preference awareness, making it difficult to focus on key information based on the question. This limits their reasoning capability in complex scenarios. To address these challenges, we propose a novel framework named AV-Master. It enhances the model's ability to extract key information from complex audio-visual scenes with substantial redundant content by dynamically modeling both temporal and modality dimensions. In the temporal dimension, we introduce a dynamic adaptive focus sampling mechanism that progressively focuses on audio-visual segments most relevant to the question, effectively mitigating redundancy and segment fragmentation in traditional sampling methods. In the modality dimension, we propose a preference-aware strategy that models each modality's contribution independently, enabling selective activation of critical features. Furthermore, we introduce a dual-path contrastive loss to reinforce consistency and complementarity across temporal and modality dimensions, guiding the model to learn question-specific cross-modal collaborative representations. Experiments on four large-scale benchmarks show that AV-Master significantly outperforms existing methods, especially in complex reasoning tasks.

[25] ImageGem: In-the-wild Generative Image Interaction Dataset for Generative Model Personalization

Yuanhe Guo, Linxi Xie, Zhuoran Chen, Kangrui Yu, Ryan Po, Guandao Yang, Gordon Wetztein, Hongyi Wen

🧩 TL;DR

本文提出了ImageGem数据集，用于研究理解细粒度个体偏好的生成模型，并展示了该数据集在偏好对齐模型训练、个性化图像检索和生成模型推荐方面的应用价值。

📘 Detailed Summary

Motivation: 当前阻碍理解个体偏好的生成模型发展的关键挑战是缺乏真实世界和细粒度的用户偏好标注数据，现有数据集无法支持对个性化生成模型的深入研究。

Method: 构建了包含57K用户真实交互数据的ImageGem数据集，包含242K定制化LoRA模型、3M文本提示和5M生成图像；提出了在潜在权重空间中编辑定制化扩散模型以对齐个体用户偏好的端到端框架。

Result: 利用数据集中的用户偏好标注成功训练了更好的偏好对齐模型；在个性化图像检索和生成模型推荐任务中评估了检索模型和视觉语言模型的性能；提出的编辑框架有效实现了生成模型的个性化。

Conclusion: ImageGem数据集首次实现了生成模型个性化的新范式，为研究理解个体偏好的生成模型提供了关键数据基础，推动了个性化生成AI的发展方向。

📄 Abstract

We introduce ImageGem, a dataset for studying generative models that understand fine-grained individual preferences. We posit that a key challenge hindering the development of such a generative model is the lack of in-the-wild and fine-grained user preference annotations. Our dataset features real-world interaction data from 57K users, who collectively have built 242K customized LoRAs, written 3M text prompts, and created 5M generated images. With user preference annotations from our dataset, we were able to train better preference alignment models. In addition, leveraging individual user preference, we investigated the performance of retrieval models and a vision-language model on personalized image retrieval and generative model recommendation. Finally, we propose an end-to-end framework for editing customized diffusion models in a latent weight space to align with individual user preferences. Our results demonstrate that the ImageGem dataset enables, for the first time, a new paradigm for generative model personalization.

[26] Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback

Yi-Lun Wu, Bo-Kai Ruan, Chiang Tseng, Hong-Han Shuai

🧩 TL;DR

本文提出了Diffusion-DRO，一种基于逆强化学习的偏好学习框架，通过将偏好学习转化为排序问题并整合离线和在线数据，解决了现有DPO方法在图像概率估计和数据集多样性方面的局限性。

📘 Detailed Summary

Motivation: 现有的直接偏好优化方法虽然在避免REINFORCE算法方面提高了训练稳定性，但仍面临两个主要挑战：由于sigmoid函数的非线性特性导致图像概率估计不准确，以及离线数据集多样性有限的问题。

Method: Diffusion-DRO基于逆强化学习框架，将偏好学习重新表述为排序问题，消除了对奖励模型的依赖，并将训练目标简化为去噪公式。该方法独特地整合了离线专家演示和在线策略生成的负样本，有效捕获人类偏好同时克服离线数据的局限性。

Result: 综合实验表明，Diffusion-DRO在一系列具有挑战性和未见过的提示词上均实现了改进的生成质量，在定量指标和用户研究中均优于最先进的基线方法。

Conclusion: 该研究证明了将偏好学习重新表述为排序问题并通过离线和在线数据整合的有效性，为扩散模型与人类偏好的对齐提供了更稳定和准确的训练框架，具有重要的实际应用价值。

📄 Abstract

Direct preference optimization (DPO) methods have shown strong potential in aligning text-to-image diffusion models with human preferences by training on paired comparisons. These methods improve training stability by avoiding the REINFORCE algorithm but still struggle with challenges such as accurately estimating image probabilities due to the non-linear nature of the sigmoid function and the limited diversity of offline datasets. In this paper, we introduce Diffusion Denoising Ranking Optimization (Diffusion-DRO), a new preference learning framework grounded in inverse reinforcement learning. Diffusion-DRO removes the dependency on a reward model by casting preference learning as a ranking problem, thereby simplifying the training objective into a denoising formulation and overcoming the non-linear estimation issues found in prior methods. Moreover, Diffusion-DRO uniquely integrates offline expert demonstrations with online policy-generated negative samples, enabling it to effectively capture human preferences while addressing the limitations of offline data. Comprehensive experiments show that Diffusion-DRO delivers improved generation quality across a range of challenging and unseen prompts, outperforming state-of-the-art baselines in both both quantitative metrics and user studies. Our source code and pre-trained models are available at https://github.com/basiclab/DiffusionDRO.

Yuqing Luo, Yixiao Li, Jiang Liu, Jun Fu, Hadi Amirpour, Guanghui Yue, Baoquan Zhao, Padraig Corcoran, Hantao Liu, Wei Zhou

🧩 TL;DR

本文提出了一种新颖的图像复杂度评估方法CM-SSA，通过跨模态场景语义对齐来增强复杂度预测性能，使预测结果更符合人类主观感知。该方法在多个数据集上显著优于现有最先进方法。

📘 Detailed Summary

Motivation: 现有图像复杂度评估方法主要依赖手工特征或浅层卷积神经网络特征，这些单模态视觉特征不足以充分捕捉与图像复杂度密切相关的感知表征。跨模态场景语义信息在涉及感知理解的计算机视觉任务中已被证明具有重要作用，但在图像复杂度评估领域的探索尚未得到解决。

Method: 提出的CM-SSA方法包含复杂度回归分支和场景语义对齐分支。复杂度回归分支在场景语义对齐分支的指导下估计图像复杂度水平，而场景语义对齐分支通过成对学习将图像与传达丰富场景语义信息的对应文本提示进行对齐。

Result: 在多个图像复杂度评估数据集上的广泛实验表明，所提出的CM-SSA方法显著优于现有的最先进方法，验证了跨模态场景语义信息对图像复杂度评估的有效性。

Conclusion: 该研究证明了跨模态场景语义对齐能够有效提升图像复杂度评估性能，使预测结果更符合人类主观感知，为图像复杂度评估领域提供了新的研究方向和技术框架。

📄 Abstract

Image complexity assessment (ICA) is a challenging task in perceptual evaluation due to the subjective nature of human perception and the inherent semantic diversity in real-world images. Existing ICA methods predominantly rely on hand-crafted or shallow convolutional neural network-based features of a single visual modality, which are insufficient to fully capture the perceived representations closely related to image complexity. Recently, cross-modal scene semantic information has been shown to play a crucial role in various computer vision tasks, particularly those involving perceptual understanding. However, the exploration of cross-modal scene semantic information in the context of ICA remains unaddressed. Therefore, in this paper, we propose a novel ICA method called Cross-Modal Scene Semantic Alignment (CM-SSA), which leverages scene semantic alignment from a cross-modal perspective to enhance ICA performance, enabling complexity predictions to be more consistent with subjective human perception. Specifically, the proposed CM-SSA consists of a complexity regression branch and a scene semantic alignment branch. The complexity regression branch estimates image complexity levels under the guidance of the scene semantic alignment branch, while the scene semantic alignment branch is used to align images with corresponding text prompts that convey rich scene semantic information by pair-wise learning. Extensive experiments on several ICA datasets demonstrate that the proposed CM-SSA significantly outperforms state-of-the-art approaches. Codes are available at https://github.com/XQ2K/First-Cross-Model-ICA.

[28] Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views

Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Yan Feng, Peng Pei, Xunliang Cai, Ruqi Huang

🧩 TL;DR

本文提出了3DThinker框架，首次实现了在推理过程中无需3D先验输入即可进行3D心理建模，通过两阶段训练方法有效利用图像中的几何信息进行3D空间关系推理。

📘 Detailed Summary

Motivation: 现有视觉语言模型在多模态任务中取得了显著进展，但在从有限视角理解3D空间关系方面仍面临重大挑战，传统方法主要依赖纯文本或2D视觉线索，其有限的表示能力阻碍了需要3D空间想象力的特定任务性能。

Method: 该框架采用两阶段训练方法：首先通过监督训练将VLM在推理过程中生成的3D潜在表示与3D基础模型对齐，然后仅基于结果信号优化整个推理轨迹，从而细化底层的3D心理建模过程。

Result: 在多个基准测试上的广泛实验表明，3DThinker始终优于强基线方法，为将3D表示统一到多模态推理中提供了新的视角。

Conclusion: 该研究展示了无需显式3D标注数据即可实现3D心理建模的可行性，为多模态推理中整合3D空间理解开辟了新方向，具有重要的理论和应用价值。

📄 Abstract

Though recent advances in vision-language models (VLMs) have achieved remarkable progress across a wide range of multimodal tasks, understanding 3D spatial relationships from limited views remains a significant challenge. Previous reasoning methods typically rely on pure text (e.g., topological cognitive maps) or on 2D visual cues. However, their limited representational capacity hinders performance in specific tasks that require 3D spatial imagination. To address this limitation, we propose 3DThinker, a framework that can effectively exploits the rich geometric information embedded within images while reasoning, like humans do. Our framework is the first to enable 3D mentaling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training. Specifically, our training consists of two stages. First, we perform supervised training to align the 3D latent generated by VLM while reasoning with that of a 3D foundation model (e.g., VGGT). Then, we optimize the entire reasoning trajectory solely based on outcome signals, thereby refining the underlying 3D mentaling. Extensive experiments across multiple benchmarks show that 3DThinker consistently outperforms strong baselines and offers a new perspective toward unifying 3D representations into multimodal reasoning. Our code will be available at https://github.com/zhangquanchen/3DThinker.

[29] ε-Seg: Sparsely Supervised Semantic Segmentation of Microscopy Data

Sheida Rahnamai Kordasiabi, Damian Dalle Nogare, Florian Jug

🧩 TL;DR

本文提出ε-Seg方法，一种基于分层变分自编码器的稀疏标注语义分割方法，能够在仅使用0.05%图像数据进行标注的情况下，在复杂生物电子显微镜图像上实现具有竞争力的分割性能。

📘 Detailed Summary

Motivation: 电子显微镜生物图像语义分割在生命科学中仍然面临挑战，这些数据捕获了生物结构的复杂细节，有时甚至对人类观察者来说也难以处理，特别是当训练标签极其稀疏时。

Method: ε-Seg采用分层变分自编码器架构，结合中心区域掩码、稀疏标签对比学习、高斯混合模型先验和无聚类标签预测，通过中心区域掩码和修复损失鼓励模型学习区分所需类别的鲁棒嵌入表示。

Result: 在两个密集电子显微镜生物组织数据集上的实验结果表明，ε-Seg能够在荧光显微镜数据和复杂生物图像数据上实现具有竞争力的稀疏监督分割结果，即使在训练标签数量极其有限的情况下。

Conclusion: 该方法证明了在极稀疏标注条件下实现复杂生物图像语义分割的可行性，为生命科学中的大规模图像分析提供了有效的解决方案，特别是在标注成本高昂的应用场景中。

📄 Abstract

Semantic segmentation of electron microscopy (EM) images of biological samples remains a challenge in the life sciences. EM data captures details of biological structures, sometimes with such complexity that even human observers can find it overwhelming. We introduce {\epsilon}-Seg, a method based on hierarchical variational autoencoders (HVAEs), employing center-region masking, sparse label contrastive learning (CL), a Gaussian mixture model (GMM) prior, and clustering-free label prediction. Center-region masking and the inpainting loss encourage the model to learn robust and representative embeddings to distinguish the desired classes, even if training labels are sparse (0.05% of the total image data or less). For optimal performance, we employ CL and a GMM prior to shape the latent space of the HVAE such that encoded input patches tend to cluster wrt. the semantic classes we wish to distinguish. Finally, instead of clustering latent embeddings for semantic segmentation, we propose a MLP semantic segmentation head to directly predict class labels from latent embeddings. We show empirical results of {\epsilon}-Seg and baseline methods on 2 dense EM datasets of biological tissues and demonstrate the applicability of our method also on fluorescence microscopy data. Our results show that {\epsilon}-Seg is capable of achieving competitive sparsely-supervised segmentation results on complex biological image data, even if only limited amounts of training labels are available.

[30] DP$^2$O-SR: Direct Perceptual Preference Optimization for Real-World Image Super-Resolution

Rongyuan Wu, Lingchen Sun, Zhengqiang Zhang, Shihao Wang, Tianhe Wu, Qiaosi Yi, Shuai Li, Lei Zhang

🧩 TL;DR

本文提出DP²O-SR框架，通过直接感知偏好优化方法，利用T2I扩散模型的随机性生成多样性输出，结合全参考和无参考图像质量评估构建混合奖励信号，在无需人工标注的情况下提升真实图像超分辨率的感知质量。

📘 Detailed Summary

Motivation: 现有基于预训练文本到图像扩散模型的真实图像超分辨率方法虽然能够合成丰富细节，但由于T2I模型的随机性，不同噪声输入会导致感知质量差异较大的输出，这种随机性既是限制因素也提供了更宽的感知质量范围，需要开发能够利用这种多样性提升性能的方法。

Method: 提出直接感知偏好优化框架DP²O-SR，结合在大型人类偏好数据集上训练的全参考和无参考图像质量评估模型构建混合奖励信号，确保结构保真度和自然外观；超越标准最佳-最差选择，从同一模型输出构建多个偏好对；提出分层偏好优化，基于组内奖励差距和组间多样性自适应加权训练对。

Result: 在扩散和流式T2I骨干网络上的广泛实验表明，DP²O-SR显著提升了感知质量，并在真实世界基准测试中表现出良好的泛化能力；分析发现最优选择比例取决于模型容量，小模型受益于更广覆盖，大模型对监督中更强对比响应更好。

Conclusion: 研究揭示了模型容量与偏好选择策略的关系，提出的分层优化方法实现了更高效稳定的学习，为无需人工标注的生成模型对齐提供了有效解决方案，在真实图像超分辨率任务中展现出优越性能。

📄 Abstract

Benefiting from pre-trained text-to-image (T2I) diffusion models, real-world image super-resolution (Real-ISR) methods can synthesize rich and realistic details. However, due to the inherent stochasticity of T2I models, different noise inputs often lead to outputs with varying perceptual quality. Although this randomness is sometimes seen as a limitation, it also introduces a wider perceptual quality range, which can be exploited to improve Real-ISR performance. To this end, we introduce Direct Perceptual Preference Optimization for Real-ISR (DP$^2$O-SR), a framework that aligns generative models with perceptual preferences without requiring costly human annotations. We construct a hybrid reward signal by combining full-reference and no-reference image quality assessment (IQA) models trained on large-scale human preference datasets. This reward encourages both structural fidelity and natural appearance. To better utilize perceptual diversity, we move beyond the standard best-vs-worst selection and construct multiple preference pairs from outputs of the same model. Our analysis reveals that the optimal selection ratio depends on model capacity: smaller models benefit from broader coverage, while larger models respond better to stronger contrast in supervision. Furthermore, we propose hierarchical preference optimization, which adaptively weights training pairs based on intra-group reward gaps and inter-group diversity, enabling more efficient and stable learning. Extensive experiments across both diffusion- and flow-based T2I backbones demonstrate that DP$^2$O-SR significantly improves perceptual quality and generalizes well to real-world benchmarks.

[31] CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder

Yongmin Lee, Hye Won Chung

🧩 TL;DR

本文提出了CovMatch，一种可扩展的多模态数据集蒸馏框架，通过跨协方差对齐和特征分布正则化实现图像编码器和文本编码器的联合优化，显著提升了多模态对比学习中的跨模态对齐性能。

📘 Detailed Summary

Motivation: 多模态数据集蒸馏在扩展到多模态对比学习时面临关键挑战：学习跨模态对齐和管理大型编码器的高计算成本。先前方法通过冻结文本编码器仅更新图像编码器和文本投影层来解决可扩展性问题，但这严重限制了语义对齐并成为性能扩展的瓶颈。

Method: 提出了CovMatch框架，通过对齐真实和合成特征的跨协方差来优化跨模态对齐，同时对每个模态内的特征分布进行正则化。与先前方法不同，CovMatch支持两个编码器的联合优化，从而实现更强的跨模态对齐。

Result: 在Flickr30K和COCO数据集上的评估表明，CovMatch优于最先进的多模态蒸馏方法，仅使用500个合成对就在检索准确率上实现了高达6.8%的绝对增益。

Conclusion: 该研究表明，通过跨协方差对齐和特征分布正则化实现编码器联合优化是多模态数据集蒸馏的关键突破，为高效训练大规模视觉语言模型提供了更有效的解决方案，并展示了在有限计算资源下实现高性能多模态学习的潜力。

📄 Abstract

Multimodal dataset distillation aims to synthesize a small set of image-text pairs that enables efficient training of large-scale vision-language models. While dataset distillation has shown promise in unimodal tasks, extending it to multimodal contrastive learning presents key challenges: learning cross-modal alignment and managing the high computational cost of large encoders. Prior approaches address scalability by freezing the text encoder and update only the image encoder and text projection layer. However, we find this severely limits semantic alignment and becomes a bottleneck for performance scaling. We propose CovMatch, a scalable dataset distillation framework that aligns the cross-covariance of real and synthetic features while regularizing feature distributions within each modality. Unlike prior approaches, CovMatch enables joint optimization of both encoders, leading to stronger cross-modal alignment and improved performance. Evaluated on Flickr30K and COCO, CovMatch outperforms state-of-the-art multimodal distillation methods and achieves up to 6.8% absolute gains in retrieval accuracy using only 500 synthetic pairs.

[32] Beyond the Pipeline: Analyzing Key Factors in End-to-End Deep Learning for Historical Writer Identification

Hanif Rasyidi, Moshiur Farazi

🧩 TL;DR

本研究系统评估了端到端深度学习方法在历史笔迹识别任务中的性能影响因素，发现大多数配置在零样本场景下泛化能力不足，但识别出一种简化设计能达到与顶尖系统相当的性能。

📘 Detailed Summary

Motivation: 历史笔迹识别面临笔迹风格多样性、文档退化以及每个作者的标记样本有限等挑战，传统方法依赖手工特征提取在小规模精选数据集上表现良好，而端到端方法在实际文档级场景特别是零样本设置下泛化能力不足，本研究旨在探索影响端到端方法性能的关键因素。

Method: 研究探索了预处理方法、骨干网络架构和后处理策略的不同组合，包括文本分割、补丁采样和特征聚合技术，系统评估了这些组件对模型性能的影响。

Result: 实验结果表明大多数配置由于低层视觉特征捕获能力弱、补丁表示不一致以及对内容噪声高度敏感而表现不佳，但发现一种端到端设置尽管设计更简单，却能达到与性能最佳系统相当的结果。

Conclusion: 研究揭示了构建鲁棒端到端系统的关键挑战，为历史文档笔迹识别提供了改进性能的设计选择见解，强调了低层特征表示和噪声鲁棒性在零样本场景中的重要性。

📄 Abstract

This paper investigates various factors that influence the performance of end-to-end deep learning approaches for historical writer identification (HWI), a task that remains challenging due to the diversity of handwriting styles, document degradation, and the limited number of labelled samples per writer. These conditions often make accurate recognition difficult, even for human experts. Traditional HWI methods typically rely on handcrafted image processing and clustering techniques, which tend to perform well on small and carefully curated datasets. In contrast, end-to-end pipelines aim to automate the process by learning features directly from document images. However, our experiments show that many of these models struggle to generalise in more realistic, document-level settings, especially under zero-shot scenarios where writers in the test set are not present in the training data. We explore different combinations of pre-processing methods, backbone architectures, and post-processing strategies, including text segmentation, patch sampling, and feature aggregation. The results suggest that most configurations perform poorly due to weak capture of low-level visual features, inconsistent patch representations, and high sensitivity to content noise. Still, we identify one end-to-end setup that achieves results comparable to the top-performing system, despite using a simpler design. These findings point to key challenges in building robust end-to-end systems and offer insight into design choices that improve performance in historical document writer identification.

[33] UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

Yibin Wang, Zhimin Li, Yuhang Zang, Jiazi Bu, Yujie Zhou, Yi Xin, Junjun He, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang

🧩 TL;DR

本文提出了UniGenBench++，一个统一的文本到图像生成语义评估基准，通过分层结构设计覆盖多样化真实场景和多语言支持，并利用多模态大语言模型构建可靠的评估流程。

📘 Detailed Summary

Motivation: 现有文本到图像生成评估基准存在三个主要局限性：缺乏多样化提示场景和多语言支持，仅提供粗粒度评估而缺乏细粒度子维度分析，以及评估维度覆盖范围有限，无法满足实际应用需求。

Method: 构建了包含600个提示的分层基准，涵盖5个主要主题和20个子主题的多样化场景，系统评估10个主要维度和27个子评估标准；利用Gemini-2.5-Pro多模态大语言模型的世界知识和细粒度图像理解能力开发可靠评估流程，并提供中英文长短版本提示以测试模型鲁棒性。

Result: 通过对开源和闭源文本到图像模型的全面基准测试，系统揭示了不同模型在各种评估维度上的优势与不足，为模型性能提供了细粒度的量化分析。

Conclusion: UniGenBench++填补了现有评估基准的空白，提供了更全面、细粒度的文本到图像生成语义一致性评估框架，同时训练了离线评估模型以促进社区使用，为模型开发和优化提供了重要指导。

📄 Abstract

Recent progress in text-to-image (T2I) generation underscores the importance of reliable benchmarks in evaluating how accurately generated images reflect the semantics of their textual prompt. However, (1) existing benchmarks lack the diversity of prompt scenarios and multilingual support, both essential for real-world applicability; (2) they offer only coarse evaluations across primary dimensions, covering a narrow range of sub-dimensions, and fall short in fine-grained sub-dimension assessment. To address these limitations, we introduce UniGenBench++, a unified semantic assessment benchmark for T2I generation. Specifically, it comprises 600 prompts organized hierarchically to ensure both coverage and efficiency: (1) spans across diverse real-world scenarios, i.e., 5 main prompt themes and 20 subthemes; (2) comprehensively probes T2I models' semantic consistency over 10 primary and 27 sub evaluation criteria, with each prompt assessing multiple testpoints. To rigorously assess model robustness to variations in language and prompt length, we provide both English and Chinese versions of each prompt in short and long forms. Leveraging the general world knowledge and fine-grained image understanding capabilities of a closed-source Multi-modal Large Language Model (MLLM), i.e., Gemini-2.5-Pro, an effective pipeline is developed for reliable benchmark construction and streamlined model assessment. Moreover, to further facilitate community use, we train a robust evaluation model that enables offline assessment of T2I model outputs. Through comprehensive benchmarking of both open- and closed-sourced T2I models, we systematically reveal their strengths and weaknesses across various aspects.

Yiqi Lin, Alex Jinpeng Wang, Linjie Li, Zhengyuan Yang, Mike Zheng Shou

🧩 TL;DR

本文提出VC2L（Vision-Centric Contrastive Learning），一种统一的多模态学习框架，通过将文本、图像及其组合全部渲染为图像，在像素空间中进行对比学习，无需OCR或模态融合策略，在复杂网页文档理解任务上取得了竞争性或优于CLIP模型的性能。

📘 Detailed Summary

Motivation: 现有的对比视觉语言模型（如CLIP）在处理复杂真实网页文档时存在局限，特别是当文本和图像交错排列、松散对齐或以视觉形式嵌入时，传统方法难以有效处理这些多模态交互场景。

Method: VC2L采用单一视觉变换器在像素空间统一建模文本、图像及其组合，通过将各类输入渲染为图像消除对OCR、文本分词或模态融合策略的依赖；该方法采用片段级对比学习目标，对齐连续的多模态片段，利用文档内在连贯性而无需显式配对的图文数据。

Result: 在提出的AnyCIR、SeqCIR和CSR三个检索基准测试中，VC2L在跨模态检索、细粒度序列理解和未见数据泛化方面均表现出色；在M-BEIR和MTEB等已建立数据集上也取得了竞争性或优于CLIP风格模型的性能表现。

Conclusion: 研究证明了多模态网页数据作为对比学习训练资源的巨大潜力，展示了统一视觉中心方法在多模态表示学习中的可扩展性，为处理复杂真实世界多模态场景提供了新的技术路径。

📄 Abstract

Contrastive vision-language models such as CLIP have demonstrated strong performance across a wide range of multimodal tasks by learning from aligned image-text pairs. However, their ability to handle complex, real-world web documents remains limited, particularly in scenarios where text and images are interleaved, loosely aligned, or embedded in visual form. To address these challenges, we propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer. VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images, thus eliminating the need for OCR, text tokenization, or modality fusion strategy. To capture complex cross-modal relationships in multimodal web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments, leveraging the inherent coherence of documents without requiring explicitly paired image-text data. To assess the effectiveness of this approach, we introduce three retrieval benchmarks, AnyCIR, SeqCIR, and CSR, designed to evaluate cross-modal retrieval, fine-grained sequential understanding, and generalization to unseen data, respectively. Empirical results show that VC2L achieves competitive or superior performance compared to CLIP-style models on both the proposed benchmarks and established datasets such as M-BEIR and MTEB. These findings underscore the potential of multimodal web data as a valuable training resource for contrastive learning and illustrate the scalability of a unified, vision-centric approach for multimodal representation learning. Code and models are available at: https://github.com/showlab/VC2L.

[35] PLANA3R: Zero-shot Metric Planar 3D Reconstruction via Feed-Forward Planar Splatting

Changkun Liu, Bin Tan, Zeran Ke, Shangzhan Zhang, Jiachen Liu, Ming Qian, Nan Xue, Yujun Shen, Tristan Braud

🧩 TL;DR

本文提出了PLANA3R，一种无需相机位姿的度量三维室内场景重建框架，通过平面基元表示和平面渲染技术，在没有显式平面监督的情况下学习平面三维结构。

📘 Detailed Summary

Motivation: 现有前馈方法需要三维平面标注进行训练，限制了在大规模数据集上的可扩展性，且室内场景固有的几何规律性未被充分利用进行紧凑表示。

Method: 采用Vision Transformers提取稀疏平面基元并估计相对相机位姿，通过平面渲染技术监督几何学习，其中梯度通过高分辨率渲染的深度和法线图传播。

Result: 在多个室内场景数据集上验证了方法的有效性，展示了在跨域室内环境中的强泛化能力，包括三维表面重建、深度估计和相对位姿估计等任务。

Conclusion: 基于平面三维表示的方法不仅实现了准确的度量重建，还自然具备了精确平面分割的能力，为无监督三维重建提供了新的方向。

📄 Abstract

This paper addresses metric 3D reconstruction of indoor scenes by exploiting their inherent geometric regularities with compact representations. Using planar 3D primitives - a well-suited representation for man-made environments - we introduce PLANA3R, a pose-free framework for metric Planar 3D Reconstruction from unposed two-view images. Our approach employs Vision Transformers to extract a set of sparse planar primitives, estimate relative camera poses, and supervise geometry learning via planar splatting, where gradients are propagated through high-resolution rendered depth and normal maps of primitives. Unlike prior feedforward methods that require 3D plane annotations during training, PLANA3R learns planar 3D structures without explicit plane supervision, enabling scalable training on large-scale stereo datasets using only depth and normal annotations. We validate PLANA3R on multiple indoor-scene datasets with metric supervision and demonstrate strong generalization to out-of-domain indoor environments across diverse tasks under metric evaluation protocols, including 3D surface reconstruction, depth estimation, and relative pose estimation. Furthermore, by formulating with planar 3D representation, our method emerges with the ability for accurate plane segmentation. The project page is available at https://lck666666.github.io/plana3r

[36] IF-VidCap: Can Video Caption Models Follow Instructions?

Shihao Li, Yuanxing Zhang, Jiangtao Wu, Zhide Lei, Yiwen He, Runzhe Wen, Chenxi Liao, Chengkang Jiang, An Ping, Shuo Gao, Suhan Wang, Zhaozhou Bian, Zijun Zhou, Jingyi Xie, Jiayi Zhou, Jing Wang, Yifan Yao, Weihao Xie, Yingshui Tan, Yanghai Wang, Qianqian Xie, Zhaoxiang Zhang, Jiaheng Liu

🧩 TL;DR

本文提出了IF-VidCap基准，用于评估可控视频字幕生成的指令跟随能力，填补了现有基准主要关注描述全面性而忽视指令跟随能力的空白。通过对20多个主流模型的评估发现，尽管专有模型仍占主导地位，但顶级开源解决方案已接近同等水平。

📘 Detailed Summary

Motivation: 当前多模态大语言模型在视频字幕生成中虽然表现出色，但实际应用需要能够遵循特定用户指令的字幕，而非生成详尽无约束的描述。现有基准主要评估描述全面性，而很大程度上忽视了指令跟随能力，这一研究空白需要填补。

Method: 作者引入了IF-VidCap基准，包含1,400个高质量样本，采用系统性评估框架从格式正确性和内容正确性两个维度评估字幕质量。该基准区别于现有的视频字幕或通用指令跟随基准，专门针对可控视频字幕生成任务设计。

Result: 对超过20个主流模型的综合评估显示，专有模型仍占主导地位，但性能差距正在缩小，顶级开源解决方案已实现接近同等水平。此外，专门用于密集字幕生成的模型在复杂指令上表现不如通用多模态大语言模型。

Conclusion: 研究表明未来工作应同时推进描述丰富性和指令跟随保真度的发展。密集字幕专用模型的不足表明，通用多模态大语言模型在处理复杂指令方面具有优势，这为模型设计提供了重要启示。

📄 Abstract

Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.

[37] SEAL: Semantic-Aware Hierarchical Learning for Generalized Category Discovery

Zhenqi He, Yuanpei Liu, Kai Han

🧩 TL;DR

本文提出了SEAL框架，通过自然层次结构指导的语义感知分层学习来解决广义类别发现问题，在多个细粒度基准上实现了最先进的性能。

📘 Detailed Summary

Motivation: 现有广义类别发现方法通常依赖于单层语义或手动设计的抽象层次结构，这限制了方法的泛化能力和可扩展性，无法有效处理已知和未知类别的图像分类问题。

Method: 提出了SEAL框架，包含分层语义引导的软对比学习方法，利用层次相似性生成信息丰富的软负样本，以及跨粒度一致性模块来对齐不同粒度级别的预测结果。

Result: SEAL在SSB基准、Oxford-Pet和Herbarium19等细粒度数据集上持续实现最先进的性能，并在粗粒度数据集上展现出良好的泛化能力。

Conclusion: 该研究表明利用自然层次结构可以有效提升广义类别发现的性能，分层语义引导和跨粒度一致性是解决此类问题的关键机制，为开放世界视觉识别提供了新思路。

📄 Abstract

This paper investigates the problem of Generalized Category Discovery (GCD). Given a partially labelled dataset, GCD aims to categorize all unlabelled images, regardless of whether they belong to known or unknown classes. Existing approaches typically depend on either single-level semantics or manually designed abstract hierarchies, which limit their generalizability and scalability. To address these limitations, we introduce a SEmantic-aware hierArchical Learning framework (SEAL), guided by naturally occurring and easily accessible hierarchical structures. Within SEAL, we propose a Hierarchical Semantic-Guided Soft Contrastive Learning approach that exploits hierarchical similarity to generate informative soft negatives, addressing the limitations of conventional contrastive losses that treat all negatives equally. Furthermore, a Cross-Granularity Consistency (CGC) module is designed to align the predictions from different levels of granularity. SEAL consistently achieves state-of-the-art performance on fine-grained benchmarks, including the SSB benchmark, Oxford-Pet, and the Herbarium19 dataset, and further demonstrates generalization on coarse-grained datasets. Project page: https://visual-ai.github.io/seal/

[38] ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder

Xiaoxing Hu, Kaicheng Yang, Ziyong Feng, Qi Ming, Zonghao Guo, Xiang An, Ziyong Feng, Junchi Yan, Xue Yang

🧩 TL;DR

ProCLIP提出了一种基于课程学习的渐进式视觉-语言对齐框架，通过将CLIP图像编码器与基于LLM的嵌入器进行有效对齐，解决了CLIP文本编码器在处理长文本和多语言输入方面的局限性，同时避免了破坏CLIP预训练知识的视觉-语言对齐。

📘 Detailed Summary

Motivation: 原始CLIP文本编码器存在77个token的最大输入长度限制，无法有效处理长文本和进行细粒度语义理解，同时缺乏多语言输入支持，这些限制显著影响了其更广泛的应用范围。现有研究尝试用基于LLM的嵌入器替换CLIP文本编码器，但由于LLM表示空间与CLIP视觉-语言空间独立预训练缺乏对齐先验，直接使用对比学习进行对齐会破坏CLIP图像编码器的内在视觉-语言对齐，导致预训练知识利用不足。

Method: ProCLIP采用基于课程学习的渐进式视觉-语言对齐框架，首先通过知识蒸馏将CLIP文本编码器的知识迁移到基于LLM的嵌入器中，利用CLIP丰富的预训练知识并建立LLM嵌入器与CLIP图像编码器的初始对齐。随后通过图像-文本对比调优进一步对齐CLIP图像编码器与基于LLM的嵌入器，采用自蒸馏正则化避免过拟合。在表示继承和对比调优阶段使用实例语义对齐损失和嵌入结构对齐损失以实现更有效的对齐。

Result: 该方法有效解决了CLIP文本编码器的长度限制和多语言支持问题，同时保持了CLIP图像编码器的预训练知识完整性。通过渐进式对齐策略，实现了基于LLM的嵌入器与CLIP视觉编码器的高效集成，提升了模型在长文本处理和多语言理解方面的能力。

Conclusion: ProCLIP证明了渐进式对齐策略在整合不同预训练模型时的有效性，为增强视觉-语言模型的能力提供了新思路。该方法不仅解决了CLIP的固有局限性，还为未来视觉-语言模型的扩展和改进提供了可行的技术路径，特别是在处理复杂语义和多语言场景方面具有重要应用价值。

📄 Abstract

The original CLIP text encoder is limited by a maximum input length of 77 tokens, which hampers its ability to effectively process long texts and perform fine-grained semantic understanding. In addition, the CLIP text encoder lacks support for multilingual inputs. All these limitations significantly restrict its applicability across a broader range of tasks. Recent studies have attempted to replace the CLIP text encoder with an LLM-based embedder to enhance its ability in processing long texts, multilingual understanding, and fine-grained semantic comprehension. However, because the representation spaces of LLMs and the vision-language space of CLIP are pretrained independently without alignment priors, direct alignment using contrastive learning can disrupt the intrinsic vision-language alignment in the CLIP image encoder, leading to an underutilization of the knowledge acquired during pre-training. To address this challenge, we propose ProCLIP, a curriculum learning-based progressive vision-language alignment framework to effectively align the CLIP image encoder with an LLM-based embedder. Specifically, ProCLIP first distills knowledge from CLIP's text encoder into the LLM-based embedder to leverage CLIP's rich pretrained knowledge while establishing initial alignment between the LLM embedder and CLIP image encoder. Subsequently, ProCLIP further aligns the CLIP image encoder with the LLM-based embedder through image-text contrastive tuning, employing self-distillation regularization to avoid overfitting. To achieve a more effective alignment, instance semantic alignment loss and embedding structure alignment loss are employed during representation inheritance and contrastive tuning. The Code is available at https://github.com/VisionXLab/ProCLIP

[39] FedDEAP: Adaptive Dual-Prompt Tuning for Multi-Domain Federated Learning

Yubin Zheng, Pak-Hei Yeung, Jing Xia, Tianjie Ju, Peng Tang, Weidong Qiu, Jagath C. Rajapakse

🧩 TL;DR

本文提出了FedDEAP框架，通过解耦语义和领域特定特征、设计双提示机制以及对齐文本视觉表示，有效提升了CLIP在多领域联邦学习场景中的泛化能力。

📘 Detailed Summary

Motivation: 联邦学习在多客户端协作训练中面临领域偏移和标签异构的挑战，导致聚合的全局模型泛化能力受限，而现有方法在利用大规模视觉语言模型如CLIP进行跨领域联邦微调时存在领域特定信息丢失的问题。

Method: 提出自适应联邦提示调优框架FedDEAP，包含三个关键组件：通过语义和领域变换网络解耦图像中的语义和领域特定特征；设计全局语义提示和局部领域提示的双提示机制以平衡共享与个性化信息；在两种学习变换下对齐文本和视觉表示以保持语义和领域一致性。

Result: 理论分析和在四个数据集上的广泛实验表明，该方法显著提升了CLIP在多领域联邦图像识别任务中的泛化性能，验证了所提框架的有效性。

Conclusion: 该研究为联邦学习中跨领域视觉语言模型微调提供了有效解决方案，通过特征解耦和双提示设计实现了语义与领域知识的平衡保留，为多领域联邦学习的发展提供了重要启示。

📄 Abstract

Federated learning (FL) enables multiple clients to collaboratively train machine learning models without exposing local data, balancing performance and privacy. However, domain shift and label heterogeneity across clients often hinder the generalization of the aggregated global model. Recently, large-scale vision-language models like CLIP have shown strong zero-shot classification capabilities, raising the question of how to effectively fine-tune CLIP across domains in a federated setting. In this work, we propose an adaptive federated prompt tuning framework, FedDEAP, to enhance CLIP's generalization in multi-domain scenarios. Our method includes the following three key components: (1) To mitigate the loss of domain-specific information caused by label-supervised tuning, we disentangle semantic and domain-specific features in images by using semantic and domain transformation networks with unbiased mappings; (2) To preserve domain-specific knowledge during global prompt aggregation, we introduce a dual-prompt design with a global semantic prompt and a local domain prompt to balance shared and personalized information; (3) To maximize the inclusion of semantic and domain information from images in the generated text features, we align textual and visual representations under the two learned transformations to preserve semantic and domain consistency. Theoretical analysis and extensive experiments on four datasets demonstrate the effectiveness of our method in enhancing the generalization of CLIP for federated image recognition across multiple domains.

[40] DSI-Bench: A Benchmark for Dynamic Spatial Intelligence

Ziang Zhang, Zehan Wang, Guanghao Zhang, Weilong Dai, Yan Xia, Ziang Yan, Minjie Hong, Zhou Zhao

🧩 TL;DR

本研究提出了动态空间智能概念并开发了DSI-Bench基准，包含近1000个动态视频和1700多个手动标注问题，系统评估了14个视觉语言模型和专家模型在动态3D场景理解中的局限性。

📘 Detailed Summary

Motivation: 当前视觉语言模型和视觉专家模型在2D任务和静态场景中表现出色，但对动态3D场景的完整理解能力仍然有限，特别是在观察者和物体同时移动的动态空间关系推理方面存在明显不足。

Method: 提出了动态空间智能概念，构建了DSI-Bench基准数据集，包含近1000个动态视频和超过1700个手动标注问题，覆盖九种解耦的观察者和物体运动模式，采用空间和时间对称设计以减少偏差。

Result: 对14个视觉语言模型和专家模型的评估揭示了关键局限性：模型经常混淆观察者和物体运动，表现出语义偏差，并且在动态场景中无法准确推断相对空间关系。

Conclusion: DSI-Bench为动态空间智能的发展提供了有价值的发现和见解，揭示了当前模型在动态3D场景理解中的根本缺陷，为通用模型和专家模型的未来发展指明了方向。

📄 Abstract

Reasoning about dynamic spatial relationships is essential, as both observers and objects often move simultaneously. Although vision-language models (VLMs) and visual expertise models excel in 2D tasks and static scenarios, their ability to fully understand dynamic 3D scenarios remains limited. We introduce Dynamic Spatial Intelligence and propose DSI-Bench, a benchmark with nearly 1,000 dynamic videos and over 1,700 manually annotated questions covering nine decoupled motion patterns of observers and objects. Spatially and temporally symmetric designs reduce biases and enable systematic evaluation of models' reasoning about self-motion and object motion. Our evaluation of 14 VLMs and expert models reveals key limitations: models often conflate observer and object motion, exhibit semantic biases, and fail to accurately infer relative relationships in dynamic scenarios. Our DSI-Bench provides valuable findings and insights about the future development of general and expertise models with dynamic spatial intelligence.

cs.CL [Back]

[41] Efficient Toxicity Detection in Gaming Chats: A Comparative Study of Embeddings, Fine-Tuned Transformers and LLMs

Yehor Tereshchenko, Mika Hämäläinen

🧩 TL;DR

本文对在线游戏聊天中的自动毒性检测方法进行了全面比较分析，提出了一种混合审核系统架构，实验结果表明微调的DistilBERT在准确性和成本之间实现了最佳权衡。

📘 Detailed Summary

Motivation: 该研究旨在解决在线游戏环境中内容审核的挑战，通过系统评估不同NLP方法在毒性检测中的性能差异，为动态在线环境提供有效的自动化审核解决方案。

Method: 研究评估了传统机器学习模型与嵌入、大型语言模型的零样本和少样本提示、微调Transformer模型以及检索增强生成方法，并提出了结合自动化检测和持续学习机制的混合审核系统架构。

Result: 实验结果显示不同方法在分类准确性、处理速度和计算成本方面存在显著性能差异，其中微调的DistilBERT模型在准确性与成本权衡方面表现最优。

Conclusion: 研究结果为在动态在线游戏环境中部署成本效益高、效率高的内容审核系统提供了实证依据，展示了混合系统架构在优化人工审核工作负载方面的潜力。

📄 Abstract

This paper presents a comprehensive comparative analysis of Natural Language Processing (NLP) methods for automated toxicity detection in online gaming chats. Traditional machine learning models with embeddings, large language models (LLMs) with zero-shot and few-shot prompting, fine-tuned transformer models, and retrieval-augmented generation (RAG) approaches are evaluated. The evaluation framework assesses three critical dimensions: classification accuracy, processing speed, and computational costs. A hybrid moderation system architecture is proposed that optimizes human moderator workload through automated detection and incorporates continuous learning mechanisms. The experimental results demonstrate significant performance variations across methods, with fine-tuned DistilBERT achieving optimal accuracy-cost trade-offs. The findings provide empirical evidence for deploying cost-effective, efficient content moderation systems in dynamic online gaming environments.

[42] From Local to Global: Revisiting Structured Pruning Paradigms for Large Language Models

Ziyan Wang, Enmao Diao, Qi Le, Pu Wang, Minwoo Lee, Shu-ping Yeh, Evgeny Stupachenko, Hao Feng, Li Yang

🧩 TL;DR

本文提出了GISP（全局迭代结构化剪枝）方法，一种后训练全局结构化剪枝技术，通过基于损失的重要性权重和迭代剪枝策略，在保持语言建模性能的同时显著提升下游任务准确率，特别在40-50%稀疏度下表现优异。

📘 Detailed Summary

Motivation: 当前主流的局部结构化剪枝方法存在任务无关性问题，它们通过优化层间重构而非任务目标来保持困惑度或通用零样本行为，但无法充分利用任务特定的校准信号，导致下游任务性能提升有限。

Method: GISP采用全局结构化剪枝方法，使用基于一阶损失的重要性权重在结构级别聚合注意力头和MLP通道，并通过块级归一化和迭代剪枝策略来稳定高稀疏度下的精度，避免困惑度崩溃且无需中间微调。

Result: 在Llama2-7B/13B、Llama3-8B和Mistral-0.3-7B上的广泛实验表明，GISP持续降低WikiText-2困惑度并提升下游任务准确率，尤其在40-50%稀疏度下表现突出；在DeepSeek-R1-Distill-Llama-3-8B上，任务对齐校准显著提高了GSM8K的精确匹配准确率。

Conclusion: 该研究表明全局迭代结构化剪枝能够有效平衡模型压缩与任务性能，支持"一次剪枝、多次部署"的工作流程，并为任务特定优化提供了自然支持，为高效部署大语言模型提供了实用解决方案。

📄 Abstract

Structured pruning is a practical approach to deploying large language models (LLMs) efficiently, as it yields compact, hardware-friendly architectures. However, the dominant local paradigm is task-agnostic: by optimizing layer-wise reconstruction rather than task objectives, it tends to preserve perplexity or generic zero-shot behavior but fails to capitalize on modest task-specific calibration signals, often yielding limited downstream gains. We revisit global structured pruning and present GISP-Global Iterative Structured Pruning-a post-training method that removes attention heads and MLP channels using first-order, loss-based important weights aggregated at the structure level with block-wise normalization. An iterative schedule, rather than one-shot pruning, stabilizes accuracy at higher sparsity and mitigates perplexity collapse without requiring intermediate fine-tuning; the pruning trajectory also forms nested subnetworks that support a "prune-once, deploy-many" workflow. Furthermore, because importance is defined by a model-level loss, GISP naturally supports task-specific objectives; we instantiate perplexity for language modeling and a margin-based objective for decision-style tasks. Extensive experiments show that across Llama2-7B/13B, Llama3-8B, and Mistral-0.3-7B, GISP consistently lowers WikiText-2 perplexity and improves downstream accuracy, with especially strong gains at 40-50% sparsity; on DeepSeek-R1-Distill-Llama-3-8B with GSM8K, task-aligned calibration substantially boosts exact-match accuracy.

[43] Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs

Yanhong Li, Zixuan Lan, Jiawei Zhou

🧩 TL;DR

本文提出了一种新颖的文本输入压缩方法，通过将长文本渲染为图像输入到多模态大语言模型中，显著减少了解码器令牌使用量。实验表明该方法能在保持任务性能的同时实现接近50%的令牌节省。

📘 Detailed Summary

Motivation: 随着大语言模型及其多模态变体能够处理视觉输入，本研究旨在探索是否可以通过将文本输入转换为图像形式来减少令牌使用量，同时保持模型性能，这为解决长文本处理中的令牌效率问题提供了新的思路。

Method: 该方法的核心思想是将长文本输入渲染为单个图像，然后直接提供给多模态大语言模型进行处理，利用视觉文本表示作为输入压缩的有效形式，特别针对解码器架构的LLMs进行了优化设计。

Result: 在RULER长上下文检索和CNN/DailyMail文档摘要两个基准测试中，文本转图像方法实现了显著的令牌节省，通常接近50%的减少，同时任务性能没有出现明显下降，验证了该压缩方法的有效性。

Conclusion: 视觉文本表示被证明是解码器LLMs的一种实用且有效的输入压缩形式，为处理长文本内容提供了新的优化途径，同时保持了模型的核心能力，这一发现对提升大语言模型的输入效率具有重要启示意义。

📄 Abstract

Large language models (LLMs) and their multimodal variants can now process visual inputs, including images of text. This raises an intriguing question: can we compress textual inputs by feeding them as images to reduce token usage while preserving performance? In this paper, we show that visual text representations are a practical and surprisingly effective form of input compression for decoder LLMs. We exploit the idea of rendering long text inputs as a single image and provide it directly to the model. This leads to dramatically reduced number of decoder tokens required, offering a new form of input compression. Through experiments on two distinct benchmarks RULER (long-context retrieval) and CNN/DailyMail (document summarization) we demonstrate that this text-as-image method yields substantial token savings (often nearly half) without degrading task performance.

[44] ECG-LLM-- training and evaluation of domain-specific large language models for electrocardiography

Lara Ahrens, Wilhelm Haverkamp, Nils Strodthoff

🧩 TL;DR

本研究通过微调开放权重大语言模型和检索增强生成方法，在心脏病学领域实现了与专有模型相竞争的临床性能，证明了隐私保护、本地部署的医疗AI解决方案的可行性。

📘 Detailed Summary

Motivation: 当前领域适应的开放权重LLM在医疗应用中的最优适应策略、评估方法和相对于通用模型的性能表现仍缺乏系统研究，特别是在心电图学这一重要心血管医学领域需要深入探索。

Method: 研究采用对开放权重模型进行领域特定文献微调的方法，并构建了多层评估框架，比较了微调模型、检索增强生成方法和Claude Sonnet 3.7通用模型在心脏病学任务上的表现。

Result: 微调的Llama 3.1 70B模型在多项选择题评估和自动文本指标上表现最优，在LLM作为评判者的评估中仅次于Claude 3.7；人类专家评估更倾向于Claude 3.7和RAG方法处理复杂查询；微调模型在所有评估模式中均显著优于其基础版本。

Conclusion: 研究揭示了不同评估方法间存在显著的性能异质性，强调了评估复杂性，但通过微调和RAG的领域特定适应能够实现与专有模型相竞争的性能，支持了隐私保护、本地可部署临床解决方案的可行性。

📄 Abstract

Domain-adapted open-weight large language models (LLMs) offer promising healthcare applications, from queryable knowledge bases to multimodal assistants, with the crucial advantage of local deployment for privacy preservation. However, optimal adaptation strategies, evaluation methodologies, and performance relative to general-purpose LLMs remain poorly characterized. We investigated these questions in electrocardiography, an important area of cardiovascular medicine, by finetuning open-weight models on domain-specific literature and implementing a multi-layered evaluation framework comparing finetuned models, retrieval-augmented generation (RAG), and Claude Sonnet 3.7 as a representative general-purpose model. Finetuned Llama 3.1 70B achieved superior performance on multiple-choice evaluations and automatic text metrics, ranking second to Claude 3.7 in LLM-as-a-judge assessments. Human expert evaluation favored Claude 3.7 and RAG approaches for complex queries. Finetuned models significantly outperformed their base counterparts across nearly all evaluation modes. Our findings reveal substantial performance heterogeneity across evaluation methodologies, underscoring assessment complexity. Nevertheless, domain-specific adaptation through finetuning and RAG achieves competitive performance with proprietary models, supporting the viability of privacy-preserving, locally deployable clinical solutions.

[45] Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation

Yasser Hamidullah, Koel Dutta Chowdury, Yusser Al-Ghussin, Shakib Yazdani, Cennet Oguz, Josef van Genabith, Cristina España-Bonet

🧩 TL;DR

本研究提出了一种基于令牌级可靠性的视觉语言模型幻觉检测方法，通过量化解码器对视觉信息的依赖程度来预测手语翻译中的幻觉现象，该方法在多个基准测试中表现出良好的泛化能力和解释性。

📘 Detailed Summary

Motivation: 手语翻译中视觉语言模型存在严重的幻觉问题，特别是在无中间语素监督的模型中，模型倾向于依赖语言先验而非视觉输入生成文本，这严重影响了翻译的准确性和可靠性。

Method: 提出令牌级可靠性度量方法，结合基于特征的敏感性和反事实信号，前者通过掩码视频时内部特征的变化来测量，后者捕捉干净与修改视频输入之间的概率差异，最终聚合为句子级可靠性评分。

Result: 在PHOENIX-2014T和CSL-Daily两个手语翻译基准测试中，可靠性评分能有效预测幻觉率，跨数据集和架构具有良好的泛化性，在视觉质量下降时可靠性降低，且与文本信号结合能进一步提升幻觉风险估计精度。

Conclusion: 该研究确立了可靠性作为诊断手语翻译幻觉的实用工具，揭示了无中间语素模型更容易产生幻觉的原因，为多模态生成中更鲁棒的幻觉检测奠定了基础。

📄 Abstract

Hallucination, where models generate fluent text unsupported by visual evidence, remains a major flaw in vision-language models and is particularly critical in sign language translation (SLT). In SLT, meaning depends on precise grounding in video, and gloss-free models are especially vulnerable because they map continuous signer movements directly into natural language without intermediate gloss supervision that serves as alignment. We argue that hallucinations arise when models rely on language priors rather than visual input. To capture this, we propose a token-level reliability measure that quantifies how much the decoder uses visual information. Our method combines feature-based sensitivity, which measures internal changes when video is masked, with counterfactual signals, which capture probability differences between clean and altered video inputs. These signals are aggregated into a sentence-level reliability score, providing a compact and interpretable measure of visual grounding. We evaluate the proposed measure on two SLT benchmarks (PHOENIX-2014T and CSL-Daily) with both gloss-based and gloss-free models. Our results show that reliability predicts hallucination rates, generalizes across datasets and architectures, and decreases under visual degradations. Beyond these quantitative trends, we also find that reliability distinguishes grounded tokens from guessed ones, allowing risk estimation without references; when combined with text-based signals (confidence, perplexity, or entropy), it further improves hallucination risk estimation. Qualitative analysis highlights why gloss-free models are more susceptible to hallucinations. Taken together, our findings establish reliability as a practical and reusable tool for diagnosing hallucinations in SLT, and lay the groundwork for more robust hallucination detection in multimodal generation.

[46] Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model

Ling Team, Anqi Shen, Baihui Li, Bin Hu, Bin Jing, Cai Chen, Chao Huang, Chao Zhang, Chaokun Yang, Cheng Lin, Chengyao Wen, Congqi Li, Deng Zhao, Dingbo Yuan, Donghai You, Fagui Mao, Fanzhuang Meng, Feng Xu, Guojie Li, Guowei Wang, Hao Dai, Haonan Zheng, Hong Liu, Jia Guo, Jiaming Liu, Jian Liu, Jianhao Fu, Jiannan Shi, Jianwen Wang, Jianxin Lai, Jin Yang, Jun Mei, Jun Zhou, Junbo Zhao, Junping Zhao, Kuan Xu, Le Su, Lei Chen, Li Tang, Liang Jiang, Liangcheng Fu, Lianhao Xu, Linfeng Shi, Lisha Liao, Longfei Zheng, Meng Li, Mingchun Chen, Qi Zuo, Qiang Cheng, Qianggang Cao, Qitao Shi, Quanrui Guo, Senlin Zhu, Shaofei Wang, Shaomian Zheng, Shuaicheng Li, Shuwei Gu, Siba Chen, Tao Wu, Tao Zhang, Tianyu Zhang, Tianyu Zhou, Tiwei Bie, Tongkai Yang, Wang Hong, Wang Ren, Weihua Chen, Wenbo Yu, Wengang Zheng, Xiangchun Wang, Xiaodong Yan, Xiaopei Wan, Xin Zhao, Xinyu Kong, Xinyu Tang, Xudong Han, Xudong Wang, Xuemin Yang, Xueyu Hu, Yalin Zhang, Yan Sun, Yicheng Shan, Yilong Wang, Yingying Xu, Yongkang Liu, Yongzhen Guo, Yuanyuan Wang, Yuchen Yan, Yuefan Wang, Yuhong Guo, Zehuan Li, Zhankai Xu, Zhe Li, Zhenduo Zhang, Zhengke Gui, Zhenxuan Pan, Zhenyu Huang, Zhenzhong Lan, Zhiqiang Ding, Zhiqiang Zhang, Zhixun Li, Zhizhen Liu, Zihao Wang, Zujie Wen

🧩 TL;DR

Ring-1T是首个开源的万亿参数思考模型，通过三项关键技术创新解决了万亿级参数模型训练中的稳定性、效率和系统瓶颈问题，在多项推理基准测试中取得了突破性成果。

📘 Detailed Summary

Motivation: 该研究旨在解决万亿级参数模型训练面临的前所未有的挑战，包括训练-推理不对齐、推演处理效率低下以及强化学习系统瓶颈等问题，这些限制了大规模推理模型的发展和应用。

Method: 提出了三项关键技术创新：IcePop通过令牌级差异掩码和裁剪稳定强化学习训练；C3PO++在令牌预算下动态分区长推演以提高时间效率；ASystem是专为克服万亿参数模型训练系统瓶颈而设计的高性能强化学习框架。

Result: Ring-1T在关键基准测试中取得突破性成果：AIME-2025得分为93.4，HMMT-2025得分为86.72，CodeForces得分为2088，ARC-AGI-v1得分为55.94，并在IMO-2025上达到银牌水平，展现了卓越的推理能力。

Conclusion: 通过向研究社区发布完整的1T参数MoE模型，该研究在民主化大规模推理智能方面树立了重要里程碑，为开源模型性能建立了新的基准，推动了前沿推理能力的普及应用。

📄 Abstract

We present Ring-1T, the first open-source, state-of-the-art thinking model with a trillion-scale parameter. It features 1 trillion total parameters and activates approximately 50 billion per token. Training such models at a trillion-parameter scale introduces unprecedented challenges, including train-inference misalignment, inefficiencies in rollout processing, and bottlenecks in the RL system. To address these, we pioneer three interconnected innovations: (1) IcePop stabilizes RL training via token-level discrepancy masking and clipping, resolving instability from training-inference mismatches; (2) C3PO++ improves resource utilization for long rollouts under a token budget by dynamically partitioning them, thereby obtaining high time efficiency; and (3) ASystem, a high-performance RL framework designed to overcome the systemic bottlenecks that impede trillion-parameter model training. Ring-1T delivers breakthrough results across critical benchmarks: 93.4 on AIME-2025, 86.72 on HMMT-2025, 2088 on CodeForces, and 55.94 on ARC-AGI-v1. Notably, it attains a silver medal-level result on the IMO-2025, underscoring its exceptional reasoning capabilities. By releasing the complete 1T parameter MoE model to the community, we provide the research community with direct access to cutting-edge reasoning capabilities. This contribution marks a significant milestone in democratizing large-scale reasoning intelligence and establishes a new baseline for open-source model performance.

cs.AI [Back]

[47] Activation Manifold Projection: Liberating Task-Specific Behaviors from LLM Architectures

Al Kari

🧩 TL;DR

本文提出了Cartridge Activation Space Transfer (CAST)框架，通过在不同LLM架构的激活流形之间学习非线性映射，实现了LoRA适配器的零样本跨架构迁移，解决了大语言模型微调行为受限于源模型架构的问题。

📘 Detailed Summary

Motivation: 当前大语言模型架构的激增带来了一个根本性挑战：通过LoRA等微调方法学习到的有价值的任务特定行为被锁定在源模型架构中，称为架构锁定。现有的迁移方法试图通过对齐模型的静态权重空间来弥合这一差距，但这种方法脆弱且间接，依赖于参数几何之间的微弱相关性。

Method: CAST框架引入了一种新颖的范式，通过在不同LLM架构的激活流形之间学习直接的非线性映射来解放LoRA编码的行为。该方法将预训练的LoRA视为冻结的"行为内核"，学习一组轻量级的双向投影头，将目标模型的激活流转换为源模型的潜在空间，应用冻结内核，然后将结果投影回来。

Result: 实验表明，CAST实现了标准LoRA适配器的真正"零样本"迁移，在Llama-2和Mistral等异构模型家族之间的迁移中，CAST迁移的适配器性能达到了在目标模型上完全重新训练的LoRA的85-95%。该方法在定量上优于当前的权重空间迁移技术，建立了模型互操作性的新最先进水平。

Conclusion: CAST框架通过激活空间映射而非权重空间对齐，有效解耦了学习技能与源架构的绑定，为大语言模型的行为迁移提供了更直接和鲁棒的解决方案，推动了模型互操作性的发展，并为未来异构模型间的知识共享开辟了新途径。

📄 Abstract

The proliferation of Large Language Model (LLM) architectures presents a fundamental challenge: valuable, task-specific behaviors learned through fine-tuning methods like Low-Rank Adaptation (LoRA) are effectively trapped within their source model's architecture, herein referred to architectural lock-in. Existing transfer methods attempt to bridge this gap by aligning the static weight spaces of models, a brittle and indirect approach that relies on tenuous correlations between parameter geometries. This paper introduces a fundamentally different and more direct paradigm: the Cartridge Activation Space Transfer (CAST), a novel framework that liberates LoRA-encoded behaviors by learning a direct, nonlinear mapping between the activation manifolds, the geometric structures formed by the model's internal neuron activations, of two distinct LLM architectures. CAST treats a pre-trained LoRA as a frozen "behavioral kernel." It learns a set of lightweight, bidirectional projection heads that translate the target model's activation stream into the source model's latent space, apply the frozen kernel, and project the result back. This process, trained on a general text corpus without any task-specific data, effectively decouples the learned skill from the source architecture. We demonstrate that CAST enables true "zero-shot" translation of any standard LoRA adapter. Our experiments, including transfers between heterogeneous model families like Llama-2 and Mistral, show that CAST-translated adapters achieve 85-95\% of the performance of a LoRA fully retrained on the target model, quantitatively outperforming current weight-space transfer techniques and establishing a new state-of-the-art in model interoperability.

Aaron Bell, Amit Aides, Amr Helmy, Arbaaz Muslim, Aviad Barzilai, Aviv Slobodkin, Bolous Jaber, David Schottlander, George Leifman, Joydeep Paul, Mimi Sun, Nadav Sherman, Natalie Williams, Per Bjornsson, Roy Lee, Ruth Alcantara, Thomas Turnbull, Tomer Shekel, Vered Silverman, Yotam Gigi, Adam Boulanger, Alex Ottenwess, Ali Ahmadalipour, Anna Carter, Charles Elliott, David Andre, Elad Aharoni, Gia Jung, Hassler Thurston, Jacob Bien, Jamie McPike, Juliet Rothenberg, Kartik Hegde, Kel Markert, Kim Philipp Jablonski, Luc Houriez, Monica Bharel, Phing VanLee, Reuven Sayag, Sebastian Pilarski, Shelley Cazares, Shlomi Pasternak, Siduo Jiang, Stone Jiang, Thomas Colthurst, Yang Chen, Yehonathan Refael, Yochai Blau, Yuval Carny, Yael Maguire, Avinatan Hassidim, James Manyika, Tim Thelin, Genady Beryozkin, Gautam Prasad, Luke Barrington, Yossi Matias, Niv Efron, Shravya Shetty

🧩 TL;DR

本文提出了Earth AI，一个结合地理空间基础模型和智能推理引擎的AI系统，通过多模态基础模型和Gemini驱动的智能代理，显著提升了从复杂地理空间数据中提取可操作洞察的能力。

📘 Detailed Summary

Motivation: 地理空间数据虽然蕴含巨大潜力，但其海量性、多样性以及不同的分辨率、时间尺度和稀疏性给全面分析和解读带来了重大挑战，需要新的方法来有效处理这些复杂数据并从中提取深刻洞察。

Method: 该方法构建了三个关键领域（行星尺度影像、人口和环境）的基础模型，并开发了一个Gemini驱动的智能推理引擎，该代理能够联合推理多个基础模型以及大型地理空间数据源和工具来处理复杂的多步骤查询。

Result: 在严格的基准测试中展示了基础模型的强大能力和新颖特性，验证了当这些模型联合使用时能够为地理空间推理提供互补价值，其协同作用解锁了更优越的预测能力，在真实世界危机场景的新基准测试中，代理能够提供关键及时的洞察。

Conclusion: 该研究通过多模态基础模型的协同整合和智能代理推理，有效弥合了原始地理空间数据与可操作理解之间的差距，为地理空间AI提供了新的范式，展示了在复杂现实场景中提供及时洞察的潜力。

📄 Abstract

Geospatial data offers immense potential for understanding our planet. However, the sheer volume and diversity of this data along with its varied resolutions, timescales, and sparsity pose significant challenges for thorough analysis and interpretation. This paper introduces Earth AI, a family of geospatial AI models and agentic reasoning that enables significant advances in our ability to unlock novel and profound insights into our planet. This approach is built upon foundation models across three key domains--Planet-scale Imagery, Population, and Environment--and an intelligent Gemini-powered reasoning engine. We present rigorous benchmarks showcasing the power and novel capabilities of our foundation models and validate that when used together, they provide complementary value for geospatial inference and their synergies unlock superior predictive capabilities. To handle complex, multi-step queries, we developed a Gemini-powered agent that jointly reasons over our multiple foundation models along with large geospatial data sources and tools. On a new benchmark of real-world crisis scenarios, our agent demonstrates the ability to deliver critical and timely insights, effectively bridging the gap between raw geospatial data and actionable understanding.

[49] Med-VRAgent: A Framework for Medical Visual Reasoning-Enhanced Agents

Guangfu Guo, Xiaoqian Lu, Yue Feng

🧩 TL;DR

本研究提出了一种名为Med-VRAgent的医疗视觉推理智能体框架，通过结合视觉引导、自我奖励机制和蒙特卡洛树搜索，显著提升了视觉语言模型在医疗推理任务中的性能，并在多个医疗VQA基准测试中超越了现有方法。

📘 Detailed Summary

Motivation: 当前视觉语言模型在医疗推理任务中存在幻觉、模糊描述、逻辑不一致和定位能力差等关键问题，这些局限性限制了其在医疗领域的实际应用价值。

Method: 该方法基于视觉引导和自我奖励范式，结合蒙特卡洛树搜索构建医疗视觉推理智能体框架，并通过收集的轨迹数据使用近端策略优化目标对视觉语言模型进行微调。

Result: 在多个医疗视觉问答基准测试上的实验结果表明，该方法显著优于现有方法，验证了所提出框架在提升医疗视觉推理能力方面的有效性。

Conclusion: 该研究证明了结合视觉引导、树搜索和强化学习的智能体框架能够有效解决医疗视觉推理中的关键挑战，为医疗AI应用提供了新的技术路径，并展示了通过轨迹反馈进行模型优化的潜力。

📄 Abstract

Visual Language Models (VLMs) achieve promising results in medical reasoning but struggle with hallucinations, vague descriptions, inconsistent logic and poor localization. To address this, we propose a agent framework named Medical Visual Reasoning Agent (\textbf{Med-VRAgent}). The approach is based on Visual Guidance and Self-Reward paradigms and Monte Carlo Tree Search (MCTS). By combining the Visual Guidance with tree search, Med-VRAgent improves the medical visual reasoning capabilities of VLMs. We use the trajectories collected by Med-VRAgent as feedback to further improve the performance by fine-tuning the VLMs with the proximal policy optimization (PPO) objective. Experiments on multiple medical VQA benchmarks demonstrate that our method outperforms existing approaches.

[50] StarBench: A Turn-Based RPG Benchmark for Agentic Multimodal Decision-Making and Information Seeking

Haoran Zhang, Chenhao Zhu, Sicong Guo, Hanzhe Guo, Haiming Li, Donglin Yu

🧩 TL;DR

本文提出了StarBench基准测试，用于评估视觉语言模型在真实游戏客户端中的多模态决策和主动信息寻求能力，揭示了当前模型在感知到控制保真度方面存在的显著差距。

📘 Detailed Summary

Motivation: 当前视觉语言模型在真实游戏客户端中实现人类玩家级别的游戏能力仍面临挑战，包括将原始屏幕截图映射到时间一致的低级动作，以及在遇到困难时决定何时寻求指导。现有研究在简化控制或工具支架下取得了鼓舞人心的结果，但人类化游戏能力仍然是一个开放性问题。

Method: 研究引入了StarBench基准测试，这是一个基于《崩坏：星穹铁道》的回合制RPG基准，包含八个战斗任务和两种控制模式：直接控制模式仅提供屏幕截图并要求输出低级操作原语，工具辅助控制模式允许通过检测器和OCR输出将高级意图映射到操作原语。基准还包括ask-or-act诊断，用于测量代理选择请求指导的时机和效果。

Result: 实验结果显示，在直接控制模式下，当前视觉语言模型在感知到控制保真度方面存在显著差距。同时研究表明，明智的信息寻求行为与改进的成功率相关，为代理主动信息寻求和多模态决策提供了可复现的衡量标准。

Conclusion: StarBench基准为评估视觉语言模型在真实客户端游戏中的代理主动信息寻求和多模态决策能力提供了标准化框架。研究结果表明，当前模型在低级动作控制方面仍有改进空间，而适当的信息寻求策略能够有效提升任务成功率，为未来智能体研究指明了重要方向。

📄 Abstract

Human players do more than press buttons: they ground what they see on screen into precise keyboard-mouse actions and, when stuck, they seek information before trying again. We ask whether current vision-language models (VLMs) can do the same. Despite encouraging results under simplified control or tool scaffolds, human-like play in a real client - mapping raw screenshots to temporally coherent low-level actions while deciding when to ask for guidance - remains an open challenge. We introduce StarBench, a turn-based RPG benchmark derived from Honkai: Star Rail that targets these two human-like competencies: multimodal decision-making from pixels to actions and agentic information seeking. StarBench standardizes evaluation across eight combat tasks and two regimes with shared tasks and metrics: (i) direct control, where agents receive only screenshots and must emit low-level primitives (click and keypress) with no semantic hints; and (ii) tool-assisted control, where higher-level intents can be mapped to primitives by detectors and OCR outputs provide optional textualized observations to ease UI grounding. To mirror human practice, StarBench also includes an ask-or-act diagnostic that measures whether and when agents choose to request brief guidance before proceeding, and how that choice affects subsequent performance. We report reference baselines for contemporary VLMs and a human reference. Results expose sizable gaps in perception-to-control fidelity in the direct regime, while showing that judicious information seeking correlates with improved success, establishing StarBench as a reproducible yardstick for agentic information seeking and multimodal decision-making in real-client play.

[51] VAR: Visual Attention Reasoning via Structured Search and Backtracking

Wei Cai, Jian Zhao, Yuchen Yuan, Tianle Zhang, Ming Zhu, Haichuan Tang, Chi Zhang, Xuelong Li

🧩 TL;DR

本文提出了视觉注意力推理（VAR）框架，通过将基础推理重构为结构化搜索过程，解决了多模态大语言模型中的幻觉问题和脆弱线性推理限制，在幻觉和安全基准测试中达到了新的最先进水平。

📘 Detailed Summary

Motivation: 多模态大语言模型存在高幻觉倾向和依赖脆弱线性推理过程的问题，导致在复杂任务中失败，需要开发能够进行可追溯推理和自我纠正的新方法。

Method: VAR框架将基础推理分解为可追溯证据定位和基于搜索的思维链生成两个关键阶段，包含回溯机制进行自我纠正，并通过具有语义和几何自验证组件的多面奖励函数引导搜索过程。

Result: 7B参数的VAR模型在全面的幻觉和安全基准测试套件中创造了新的最先进水平，显著优于现有开源模型，并与领先的专有系统展现出竞争性性能。

Conclusion: 该研究证明了结构化搜索策略在解决多模态模型幻觉问题上的有效性，为开发更可靠的多模态推理系统提供了理论基础和实践框架，具有重要的理论和应用价值。

📄 Abstract

Multimodal Large Language Models (MLLMs), despite their advances, are hindered by their high hallucination tendency and heavy reliance on brittle, linear reasoning processes, leading to failures in complex tasks. To address these limitations, we introduce Visual Attention Reasoning (VAR), a novel framework that recasts grounded reasoning as a structured search over a reasoning trajectory space. VAR decomposes the reasoning process into two key stages: traceable evidence grounding and search-based chain-of-thought (CoT) generation, which incorporates a backtracking mechanism for self-correction. The search is guided by a multi-faceted reward function with semantic and geometric self-verification components, which penalize outputs that are not faithfully grounded in the visual input. We provide a theoretical analysis for our search strategy, validating its capability to find the correct solution with high probability. Experimental results show that our 7B model, VAR-7B, sets a new state-of-the-art on a comprehensive suite of hallucination and safety benchmarks, significantly outperforming existing open-source models and demonstrating competitive performance against leading proprietary systems.

[52] Seg the HAB: Language-Guided Geospatial Algae Bloom Reasoning and Segmentation

Patterson Hsieh, Jerry Yeh, Mao-Chi He, Wen-Han Hsieh, Elvis Hsieh

🧩 TL;DR

本研究提出了ALGOS系统，这是一个结合遥感图像分割与严重性评估的藻华监测方法，通过整合GeoSAM辅助的人工评估和视觉语言模型微调，实现了对有害藻华的自动化监测和严重程度量化。

📘 Detailed Summary

Motivation: 气候变化加剧了有害藻华特别是蓝藻的爆发，传统人工水质监测方法劳动密集且时空覆盖有限，现有视觉语言模型在遥感图像推理和藻华严重性量化方面仍存在挑战。

Method: ALGOS系统整合了GeoSAM辅助的人工评估用于高质量分割掩码标注，并在NASA的蓝藻聚合人工标签数据集上微调视觉语言模型进行严重性预测，实现了分割与推理的协同处理。

Result: 实验表明ALGOS在分割和严重性等级估计方面均表现出稳健性能，为自动化蓝藻监测系统提供了可靠的技术基础。

Conclusion: 该研究为实用化自动化蓝藻监测系统开辟了道路，展示了视觉语言模型在环境遥感监测中的巨大潜力，能够实现大规模、高效的藻华监测和风险评估。

📄 Abstract

Climate change is intensifying the occurrence of harmful algal bloom (HAB), particularly cyanobacteria, which threaten aquatic ecosystems and human health through oxygen depletion, toxin release, and disruption of marine biodiversity. Traditional monitoring approaches, such as manual water sampling, remain labor-intensive and limited in spatial and temporal coverage. Recent advances in vision-language models (VLMs) for remote sensing have shown potential for scalable AI-driven solutions, yet challenges remain in reasoning over imagery and quantifying bloom severity. In this work, we introduce ALGae Observation and Segmentation (ALGOS), a segmentation-and-reasoning system for HAB monitoring that combines remote sensing image understanding with severity estimation. Our approach integrates GeoSAM-assisted human evaluation for high-quality segmentation mask curation and fine-tunes vision language model on severity prediction using the Cyanobacteria Aggregated Manual Labels (CAML) from NASA. Experiments demonstrate that ALGOS achieves robust performance on both segmentation and severity-level estimation, paving the way toward practical and automated cyanobacterial monitoring systems.

Table of Contents

cs.CV [Back]

[1] CoIDO: Efficient Data Selection for Visual Instruction Tuning via Coupled Importance-Diversity Optimization

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[2] MUSE: Model-based Uncertainty-aware Similarity Estimation for zero-shot 2D Object Detection and Segmentation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[3] 3D Weakly Supervised Semantic Segmentation via Class-Aware and Geometry-Guided Pseudo-Label Refinement

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[4] ManzaiSet: A Multimodal Dataset of Viewer Responses to Japanese Manzai Comedy

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[5] SAVANT: Semantic Analysis with Vision-Augmented Anomaly deTection

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[6] HouseTour: A Virtual Real Estate A(I)gent

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[7] Chimera: Compositional Image Generation using Part-based Concepting

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[8] Online In-Context Distillation for Low-Resource Vision Language Models

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[9] Adapting Stereo Vision From Objects To 3D Lunar Surface Reconstruction with the StereoLunar Dataset

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[10] RadDiagSeg-M: A Vision Language Model for Joint Diagnosis and Multi-Target Segmentation in Radiology

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[11] Visual Space Optimization for Zero-shot Learning

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[12] VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[13] BlendCLIP: Bridging Synthetic and Real Domains for Zero-Shot 3D Object Classification with Multimodal Pretraining

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[14] The Impact of Image Resolution on Biomedical Multimodal Large Language Models

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[15] UWBench: A Comprehensive Vision-Language Benchmark for Underwater Understanding

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[16] Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[17] StreamingTOM: Streaming Token Compression for Efficient Video Understanding

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[18] See the Text: From Tokenization to Visual Reading

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[19] Proactive Reasoning-with-Retrieval Framework for Medical Multimodal Large Language Models

🧩 TL;DR

📘 Detailed Summary

📄 Abstract

[20] Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

🧩 TL;DR