Table of Contents

cs.CV [Back]

[1] ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs

An Yu, Ting Yu Tsai, Zhenfei Zhang, Weiheng Lu, Felix X. -F. Ye, Ming-Ching Chang

🧩 TL;DR

本文提出ReDiPrune,一种无需训练即可在视觉-语言投影器前应用的token剪枝方法,通过联合考虑文本条件相关性和最大-最小多样性的轻量级规则选择信息丰富的视觉token,在显著降低计算成本的同时提升多模态大语言模型的性能。


📘 Detailed Summary

Motivation: 当前多模态大语言模型计算成本高昂,主要因为Transformer需要处理大量视觉token。现有投影后剪枝方法在压缩表示上操作,可能丢失细粒度空间和语义信息,因此需要一种在视觉编码器输出阶段直接选择信息丰富token的方法,以在投影前保留丰富的视觉特征。

Method: ReDiPrune是一种无需训练的token剪枝方法,在视觉-语言投影器前应用,直接从视觉编码器输出中选择信息丰富的token。该方法采用轻量级评分规则,联合考虑文本条件相关性和最大-最小多样性,确保所选token既与查询相关又非冗余。该方法完全即插即用,无需重新训练或架构修改,可无缝插入编码器和投影器之间。

Result: 在四个视频和五个图像基准测试中,ReDiPrune持续改善了准确性与效率的权衡。例如,在EgoSchema数据集上使用LLaVA-NeXT-Video-7B模型,仅保留15%的视觉token即可获得+2.0%的绝对准确率提升,同时将TFLOPs计算量减少超过6倍。该方法在各种多模态任务中均表现出优越的性能-效率平衡。

Conclusion: ReDiPrune通过直接在视觉编码器输出阶段选择信息丰富的token,有效解决了多模态大语言模型的计算效率问题。该方法表明,在投影前进行token剪枝可以更好地保留细粒度视觉信息,同时显著降低计算成本。这种无需训练的即插即用方法为高效多模态模型设计提供了新思路,具有实际部署价值。


📄 Abstract

Recent multimodal large language models are computationally expensive because Transformers must process a large number of visual tokens. We present \textbf{ReDiPrune}, a training-free token pruning method applied before the vision-language projector, where visual features remain rich and discriminative. Unlike post-projection pruning methods that operate on compressed representations, ReDiPrune selects informative tokens directly from vision encoder outputs, preserving fine-grained spatial and semantic cues. Each token is scored by a lightweight rule that jointly consider text-conditioned relevance and max-min diversity, ensuring the selected tokens are both query-relevant and non-redundant. ReDiPrune is fully plug-and-play, requiring no retraining or architectural modifications, and can be seamlessly inserted between the encoder and projector. Across four video and five image benchmarks, it consistently improves the accuracy-efficiency trade-off. For example, on EgoSchema with LLaVA-NeXT-Video-7B, retaining only 15\% of visual tokens yields a +2.0\% absolute accuracy gain while reducing computation by more than $6\times$ in TFLOPs. Code is available at https://github.com/UA-CVML/ReDiPrune.

[2] LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration

Gokce Inal, Pouyan Navard, Alper Yilmaz

🧩 TL;DR

本文提出了LLaVA-LE,一个专门用于月球表面和次表面特征分析的视觉语言模型,通过构建大规模月球多模态数据集LUCID并进行两阶段微调,显著提升了行星科学领域的视觉推理能力。


📘 Detailed Summary

Motivation: 尽管多模态视觉语言模型在视觉和文本联合推理方面取得了进展,但在行星科学领域的应用仍未被充分探索,主要障碍是缺乏将真实行星图像与详细科学描述配对的大规模数据集,这限制了模型在月球地形分析等专业领域的应用能力。

Method: 研究构建了大规模月球多模态数据集LUCID,包含96k高分辨率全色图像与详细地形描述配对,以及81k个源自约20k图像的问答对;在此基础上采用两阶段训练课程:第一阶段进行领域特定地形描述的概念对齐,第二阶段进行指令调优的视觉问答训练,专门针对月球探索任务优化LLaVA模型。

Result: LLaVA-LE在针对月球地形分析设计的多层次推理复杂度评估基准上表现出色,相比基础LLaVA模型实现了3.3倍的整体性能提升,相比第一阶段模型实现了2.1倍提升,其推理得分达到1.070,超过了评估者自身的参考得分,这验证了领域特定多模态数据和指令调优的有效性。

Conclusion: 该研究表明,通过构建领域特定的大规模多模态数据集并结合两阶段的微调策略,能够显著提升视觉语言模型在行星科学等专业领域的推理能力,为行星探索中的自动化地形分析提供了有效的技术路径,并展示了领域适应方法在多模态人工智能应用中的重要性。


📄 Abstract

Recent advances in multimodal vision-language models (VLMs) have enabled joint reasoning over visual and textual information, yet their application to planetary science remains largely unexplored. A key hindrance is the absence of large-scale datasets that pair real planetary imagery with detailed scientific descriptions. In this work, we introduce LLaVA-LE (Large Language-and-Vision Assistant for Lunar Exploration), a vision-language model specialized for lunar surface and subsurface characterization. To enable this capability, we curate a new large-scale multimodal lunar dataset, LUCID (LUnar Caption Image Dataset) consisting of 96k high-resolution panchromatic images paired with detailed captions describing lunar terrain characteristics, and 81k question-answer (QA) pairs derived from approximately 20k images in the LUCID dataset. Leveraging this dataset, we fine-tune LLaVA using a two-stage training curriculum: (1) concept alignment for domain-specific terrain description, and (2) instruction-tuned visual question answering. We further design evaluation benchmarks spanning multiple levels of reasoning complexity relevant to lunar terrain analysis. Evaluated against GPT and Gemini judges, LLaVA-LE achieves a 3.3x overall performance gain over Base LLaVA and 2.1x over our Stage 1 model, with a reasoning score of 1.070, exceeding the judge's own reference score, highlighting the effectiveness of domain-specific multimodal data and instruction tuning to advance VLMs in planetary exploration. Code is available at https://github.com/OSUPCVLab/LLaVA-LE.

[3] VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning

Zhe Gao, Shiyu Shen, Taifeng Chai, Weinong Wang, Haotian Xu, Xing W, Wenbin Li, Qi Fan, Yang Gao, Dacheng Tao

🧩 TL;DR

本文提出VideoTIR,一种基于强化学习的多模态大语言模型方法,通过多级工具包和轨迹优化策略解决长视频理解中的幻觉问题,显著提升了准确性和效率。


📘 Detailed Summary

Motivation: 现有多模态大语言模型在长视频理解中常出现幻觉问题,主要源于文本与视觉令牌的不平衡。虽然基于监督微调的工具调用方法可以缓解此问题,但它们通常需要大量细粒度高质量数据,且受限于有限的工具调用轨迹。

Method: 提出VideoTIR框架,利用强化学习促进多级工具包的适当使用,包括零样本强化学习和监督微调冷启动策略。引入工具包动作分组策略优化,通过逐步奖励分配和失败轨迹重用来减少冗余工具调用。开发基于沙箱的轨迹合成框架生成高质量轨迹数据。

Result: 在三个长视频问答基准测试上进行了广泛实验,结果表明该方法在准确性和效率方面均表现出色。TAGPO策略有效减少了冗余工具调用,轨迹合成框架成功生成了高质量的训练数据。

Conclusion: 该研究展示了强化学习在长视频理解中的潜力,通过多级工具包和优化策略实现了准确高效的长视频分析。提出的轨迹合成框架为高质量训练数据生成提供了新思路,工具包动作分组策略优化为减少冗余调用提供了有效解决方案。


📄 Abstract

Existing Multimodal Large Language Models (MLLMs) often suffer from hallucinations in long video understanding (LVU), primarily due to the imbalance between textual and visual tokens. Observing that MLLMs handle short visual inputs well, recent LVU works alleviate hallucinations by automatically parsing the vast visual data into manageable segments that can be effectively processed by MLLMs. SFT-based tool-calling methods can serve this purpose, but they typically require vast amounts of fine-grained, high-quality data and suffer from constrained tool-calling trajectories. We propose a novel VideoTIR that leverages Reinforcement Learning (RL) to encourage proper usage of comprehensive multi-level toolkits for efficient long video understanding. VideoTIR explores both Zero-RL and SFT cold-starting to enable MLLMs to retrieve and focus on meaningful video segments/images/regions, enhancing long video understanding both accurately and efficiently. To reduce redundant tool-calling, we propose Toolkit Action Grouped Policy Optimization (TAGPO), which enhances the efficiency of the calling process through stepwise reward assignment and reuse of failed rollouts. Additionally, we develop a sandbox-based trajectory synthesis framework to generate high-quality trajectories data. Extensive experiments on three long-video QA benchmarks demonstrate the effectiveness and efficiency of our method.

[4] GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding

Junpeng Ma, Sashuai Zhou, Guanghao Li, Xin Gao, Yue Cao, Hengyu Zeng, Yuxiang Yan, Zhibin Wang, Jun Song, Bo Zheng, Shanghang Zhang, Jian Pu

🧩 TL;DR

本文提出了GIFT(全局不可替代性帧选择)框架,一种无需训练的视频关键帧选择方法,通过评估帧的内在不可替代性来解决现有方法因贪婪决策和分离评估导致的局部最优问题,显著提升了视频大语言模型的效率与性能。


📘 Detailed Summary

Motivation: 视频大语言模型在处理密集帧时面临巨大计算成本,现有关键帧选择方法采用贪婪决策机制,并将相关性与多样性评估分离,容易陷入局部最优并错误选择无关的噪声帧,限制了模型的实用性和性能。

Method: GIFT框架通过定向多样性量化帧在相关性条件下的独特性,形成统一的不可替代性评分,并采用预算感知细化策略,该策略首先选择具有最高不可替代性的核心帧集,然后随着预算扩展优先围绕这些选择构建关键时序上下文。

Result: 在LLaVA-Video-7B模型上的广泛实验表明,GIFT在长视频基准测试中相比均匀采样实现了最大12.5%的平均性能提升,验证了其通过全局不可替代性评估优化帧选择的有效性。

Conclusion: 该研究证明了通过统一的不可替代性评分和自适应预算感知策略能够有效解决视频帧选择的全局优化问题,为视频大语言模型的高效部署提供了无需训练的高性能解决方案,并展示了全局评估相对于局部贪婪决策的显著优势。


📄 Abstract

Video Large Language Models (VLMs) have achieved remarkable success in video understanding, but the significant computational cost from processing dense frames severely limits their practical application. Existing methods alleviate this by selecting keyframes, but their greedy decision-making, combined with a decoupled evaluation of relevance and diversity, often falls into local optima and results in erroneously selecting irrelevant noise frames. To address these challenges, we propose GIFT: Global Irreplaceability Frame Targeting, a novel training-free framework that selects frames by assessing their intrinsic irreplaceability. Specifically, we first introduce Directed Diversity to quantify a frame's uniqueness conditioned on relevance, which allows us to formulate a unified irreplaceability score. Subsequently, our Budget-Aware Refinement strategy employs a adaptive iterative process that first secures a core set of frames with the highest irreplaceability, and then shifts its priority to building crucial temporal context around these selections as the budget expands. Extensive experiments demonstrate that GIFT achieves a maximum average improvement of 12.5% across long-form video benchmarks on LLaVA-Video-7B compared to uniform sampling.

[5] Bridging Perception and Reasoning: Token Reweighting for RLVR in Multimodal LLMs

Jinda Lu, Junkang Wu, Jinghan Li, Kexin Huang, Shuo Yang, Guoyin Wang, Jiancan Wu, Xiang Wang, Xiangnan He

🧩 TL;DR

本文提出了一种名为Token-Reweighting (ToR)的即插即用策略,用于解决多模态大语言模型在可验证奖励强化学习中感知与推理令牌相互依赖的优化难题,通过在现有方法基础上动态重加权关键令牌,实现了多模态推理基准上的性能提升。


📘 Detailed Summary

Motivation: 将可验证奖励强化学习扩展到多模态大语言模型面临根本挑战:模型响应中感知相关令牌(用于视觉内容基础)与推理相关令牌(用于构建推理链)交织在一起,这两种令牌类型实例化了视觉基础与符号推理这两种不同但相互依赖的能力,使得孤立优化效果不足。

Method: 本文提出了一种即插即用的令牌重加权策略,通过识别感知和推理两种类型的关键令牌,在RLVR训练期间动态重加权这些令牌,从而显式建模感知与推理令牌之间的相互依赖关系,该策略可应用于现有方法如GRPO和DAPO之上。

Result: 实验表明,仅优化感知或推理令牌均持续表现不佳,而ToR策略在多个多模态推理基准上实现了持续性能提升,达到了最先进的性能水平,同时具备准确的视觉基础和连贯的推理能力。

Conclusion: 研究揭示了多模态大语言模型中感知与推理令牌的内在耦合特性,提出的令牌重加权策略为有效优化这种相互依赖关系提供了解决方案,为多模态强化学习中的令牌级优化开辟了新方向。


📄 Abstract

Extending Reinforcement Learning with Verifiable Rewards (RLVR) to multimodal large language models (MLLMs) faces a fundamental challenge: their responses inherently interleave perception-related tokens, which ground visual content, with reasoning-related tokens, which construct reasoning chains. These token types instantiate distinct yet interdependent capacities -- visual grounding and symbolic reasoning -- making isolated optimization insufficient. Through token-level empirical analysis, we demonstrate that optimizing either perception- or reasoning-only tokens consistently underperforms full optimization, underscoring their inherent coupling. To address this, we propose a plug-and-play Token-Reweighting (ToR) strategy that explicitly models this interdependence by identifying critical tokens of both types and dynamically reweighting them during RLVR training. Applied on top of existing methods (e.g., GRPO and DAPO), ToR delivers consistent performance gains across multiple multi-modal reasoning benchmarks, achieving state-of-the-art performance with both accurate visual grounding and coherent reasoning.

[6] Probabilistic Concept Graph Reasoning for Multimodal Misinformation Detection

Ruichao Yang, Wei Gao, Xiaobin Zhu, Jing Ma, Hongzhan Lin, Ziyang Luo, Bo-Wen Zhang, Xu-Cheng Yin

🧩 TL;DR

本文提出了概率概念图推理(PCGR)框架,将多模态虚假信息检测重新定义为结构化、基于概念的推理问题,通过构建人类可理解的概念图并应用分层注意力机制,实现了可解释且可演化的检测方法。


📘 Detailed Summary

Motivation: 多模态虚假信息日益严重,传统检测器存在不透明、黑箱化的问题,且对新型操纵手段脆弱,无法提供可解释的推理过程,需要一种既能保持高检测精度又能提供透明推理链的解决方案。

Method: PCGR采用"先构建后推理"范式,首先构建包含人类可理解概念节点的图结构,其中包含通过多模态大语言模型自动发现和验证的新型高层概念,然后在该概念图上应用分层注意力机制来推断声明的真实性,生成从证据到结论的可解释推理链。

Result: 实验表明PCGR在多模态虚假信息检测精度和对新兴操纵类型的鲁棒性方面达到最先进水平,在粗粒度检测和细粒度操纵识别任务中均优于先前方法,展示了卓越的检测性能和适应性。

Conclusion: 该研究证明了基于概念图推理的方法能够有效解决多模态虚假信息检测的可解释性和演化性问题,为构建透明、可适应新型威胁的检测系统提供了新范式,具有重要的实际应用价值。


📄 Abstract

Multimodal misinformation poses an escalating challenge that often evades traditional detectors, which are opaque black boxes and fragile against new manipulation tactics. We present Probabilistic Concept Graph Reasoning (PCGR), an interpretable and evolvable framework that reframes multimodal misinformation detection (MMD) as structured and concept-based reasoning. PCGR follows a build-then-infer paradigm, which first constructs a graph of human-understandable concept nodes, including novel high-level concepts automatically discovered and validated by multimodal large language models (MLLMs), and then applies hierarchical attention over this concept graph to infer claim veracity. This design produces interpretable reasoning chains linking evidence to conclusions. Experiments demonstrate that PCGR achieves state-of-the-art MMD accuracy and robustness to emerging manipulation types, outperforming prior methods in both coarse detection and fine-grained manipulation recognition.

[7] AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization

Jiawei Lin, Wanrong Zhu, Vlad I Morariu, Christopher Tensmeyer

🧩 TL;DR

本文提出了AnyDoc框架,这是一个能够处理多种文档生成任务的统一系统,通过构建大规模合成数据集DocHTML并采用多模态大语言模型微调与高度感知强化学习后训练,实现了跨111个文档类别的多样化生成能力。


📘 Detailed Summary

Motivation: 现有文档生成研究面临两个主要挑战:一是缺乏覆盖广泛文档类别的大规模数据集,现有数据集多为人工标注且规模有限;二是现有方法难以统一处理多种文档生成任务,如意图到文档生成、文档反渲染和元素到文档生成等多样化需求。

Method: AnyDoc框架首先构建了一个可扩展的数据合成流水线,自动生成HTML/CSS格式的文档,创建了包含265,206个样本、覆盖111个类别和32种风格的DocHTML数据集。基于该数据集,通过微调多模态大语言模型实现三种文档生成任务,并引入高度感知强化学习后训练程序,通过基于预测与目标文档高度差异的奖励函数来惩罚内容溢出问题。

Result: 实验结果表明,AnyDoc在意图到文档生成、文档反渲染和元素到文档生成三个任务上均优于通用多模态大语言模型和任务特定的基线方法。定性评估显示生成的文档质量高且多样化,定量指标证实了其在内容准确性和视觉保真度方面的优越性能。

Conclusion: 该研究证明了通过大规模合成数据训练和专门的后处理技术,可以显著提升文档生成系统的通用性和性能。AnyDoc的统一框架为跨多种文档类别的AI驱动内容创建提供了可行方案,其高度感知强化学习方法为解决内容溢出问题提供了有效途径,为未来文档生成系统的开发奠定了基础。


📄 Abstract

Document generation has gained growing attention in the field of AI-driven content creation. In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple generation tasks across a wide spectrum of document categories, all represented in a unified HTML/CSS format. To overcome the limited coverage and scale of existing human-crafted document datasets, AnyDoc first establishes a scalable data synthesis pipeline to automatically generate documents in HTML/CSS form. This pipeline yields DocHTML, a large-scale dataset containing 265,206 document samples, while spanning 111 categories and 32 distinct styles. Additionally, all documents are equipped with comprehensive metadata, including design intentions, HTML/CSS source code, visual assets, and rendered screenshots. Building on the curated dataset, AnyDoc fine-tunes multi-modal large language models (MLLMs) to achieve three practical document generation tasks: intention-to-document, document derendering, and element-to-document. To address the content overflow issue observed during fine-tuning, AnyDoc further incorporates a height-aware reinforcement learning (HARL) post-training procedure. By defining a reward function based on the difference between predicted and target document heights, overflow is penalized and gradually mitigated during HARL, thereby enhancing overall performance. Qualitative and quantitative experiments demonstrate that AnyDoc outperforms both general-purpose MLLMs and task-specific baselines across all three tasks.

[8] Demographic Fairness in Multimodal LLMs: A Benchmark of Gender and Ethnicity Bias in Face Verification

Ünsal Öztürk, Hatef Otroshi Shahreza, Sébastien Marcel

🧩 TL;DR

本研究对九种开源多模态大语言模型在面部验证任务中的公平性进行了基准测试,发现FaceLLM-8B作为唯一的面部专用模型显著优于通用MLLMs,且模型的准确性与公平性之间不存在必然关联。


📘 Detailed Summary

Motivation: 多模态大语言模型近期被探索作为面部验证系统,但其人口统计学公平性尚未得到充分研究,本研究旨在填补这一空白,评估MLLMs在不同种族和性别群体上的公平性表现。

Method: 本研究对来自六个模型家族的九种开源MLLMs进行了基准测试,参数范围从2B到8B,使用IJB-C和RFW面部验证协议评估四个种族群体和两个性别群体,采用等错误率和多个操作点的真实匹配率测量验证准确性,并使用四种基于错误匹配率的公平性指标量化人口统计学差异。

Result: 实验结果表明,FaceLLM-8B作为研究中唯一的面部专用模型,在两个基准测试中显著优于通用MLLMs;观察到的偏差模式与传统面部识别系统不同,受影响最大的群体因基准测试和模型而异;最准确的模型不一定最公平,整体准确性较差的模型可能因在所有人口群体中产生均匀的高错误率而显得公平。

Conclusion: 该研究揭示了MLLMs在面部验证任务中复杂的人口统计学公平性模式,强调了模型准确性与公平性之间的脱节,为未来开发更公平的多模态面部验证系统提供了重要基准和设计洞见。


📄 Abstract

Multimodal Large Language Models (MLLMs) have recently been explored as face verification systems that determine whether two face images are of the same person. Unlike dedicated face recognition systems, MLLMs approach this task through visual prompting and rely on general visual and reasoning abilities. However, the demographic fairness of these models remains largely unexplored. In this paper, we present a benchmarking study that evaluates nine open-source MLLMs from six model families, ranging from 2B to 8B parameters, on the IJB-C and RFW face verification protocols across four ethnicity groups and two gender groups. We measure verification accuracy with the Equal Error Rate and True Match Rate at multiple operating points per demographic group, and we quantify demographic disparity with four FMR-based fairness metrics. Our results show that FaceLLM-8B, the only face-specialised model in our study, substantially outperforms general-purpose MLLMs on both benchmarks. The bias patterns we observe differ from those commonly reported for traditional face recognition, with different groups being most affected depending on the benchmark and the model. We also note that the most accurate models are not necessarily the fairest and that models with poor overall accuracy can appear fair simply because they produce uniformly high error rates across all demographic groups.

[9] SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding

Jiwook Han, Geo Ahn, Youngrae Kim, Jinwoo Choi

🧩 TL;DR

本文提出了SlotVTG框架,通过轻量级的slot适配器将多模态大语言模型引导至以对象为中心的视觉推理,显著提升了视频时序定位任务中的域外泛化能力,同时保持了域内性能的竞争力。


📘 Detailed Summary

Motivation: 当前多模态大语言模型在视频时序定位任务中表现出色,但其粗粒度的识别能力不足以进行细粒度的时间理解,需要任务特定的微调。这种微调导致模型记忆数据集特定的捷径而非忠实于实际视觉内容,从而造成域外泛化能力差。虽然以对象为中心的学习通过将场景分解为实体级表示提供了有前景的解决方案,但现有方法需要从头重新运行整个多阶段训练流程。

Method: SlotVTG框架引入了一个轻量级的slot适配器,通过slot注意力将视觉标记分解为抽象slot并重建原始序列。该方法利用自监督视觉模型的对象性先验来鼓励语义连贯的slot形成,从而以最小成本引导多模态大语言模型进行以对象为中心、基于输入的视觉推理。

Result: 在标准视频时序定位基准上的跨域评估表明,SlotVTG方法显著提高了域外鲁棒性,同时以最小开销保持了竞争力的域内性能。实验验证了该框架在改善泛化能力方面的有效性,特别是在处理未见过的数据分布时表现出色。

Conclusion: SlotVTG提供了一种高效的方法来增强多模态大语言模型在视频时序定位任务中的泛化能力,通过以对象为中心的表示学习减少了数据集偏差的影响。该研究展示了轻量级适配器在引导预训练模型进行细粒度视觉推理方面的潜力,为改善多模态模型的域外性能提供了实用解决方案。


📄 Abstract

Multimodal Large Language Models (MLLMs) have shown strong performance on Video Temporal Grounding (VTG). However, their coarse recognition capabilities are insufficient for fine-grained temporal understanding, making task-specific fine-tuning indispensable. This fine-tuning causes models to memorize dataset-specific shortcuts rather than faithfully grounding in the actual visual content, leading to poor Out-of-Domain (OOD) generalization. Object-centric learning offers a promising remedy by decomposing scenes into entity-level representations, but existing approaches require re-running the entire multi-stage training pipeline from scratch. We propose SlotVTG, a framework that steers MLLMs toward object-centric, input-grounded visual reasoning at minimal cost. SlotVTG introduces a lightweight slot adapter that decomposes visual tokens into abstract slots via slot attention and reconstructs the original sequence, where objectness priors from a self-supervised vision model encourage semantically coherent slot formation. Cross-domain evaluation on standard VTG benchmarks demonstrates that our approach significantly improves OOD robustness while maintaining competitive In-Domain (ID) performance with minimal overhead.

[10] PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow

Xincheng Shuai, Song Tang, Yutong Huang, Henghui Ding, Dacheng Tao

🧩 TL;DR

本文提出PSDesigner,一个模仿人类设计师创意流程的自动化图形设计系统,通过构建包含高质量PSD文件和操作轨迹的CreativePSD数据集,使系统能够自主推断和执行工具调用来操作设计文件,从而将用户意图转化为可编辑的设计文件。


📘 Detailed Summary

Motivation: 当前自动化设计系统虽然利用文本到图像模型和多模态大语言模型辅助图形设计,但通常简化了专业工作流程,导致灵活性和直观性有限,无法将用户意图忠实地转化为可编辑的设计文件,这是图形设计自动化的主要挑战。

Method: PSDesigner基于多个专用组件构建,通过用户指令收集主题相关素材,并自主推断和执行工具调用来操作设计文件,如整合新素材或优化劣质元素;为赋予系统强大的工具使用能力,构建了CreativePSD设计数据集,包含大量高质量PSD设计文件,标注了广泛设计场景和艺术风格下的操作轨迹,使模型能够学习专家设计流程。

Result: 大量实验表明,PSDesigner在多样化图形设计任务中优于现有方法,能够使非专业人士方便地创建生产质量的设计作品,系统通过模仿人类设计师的创意工作流程,显著提升了设计的灵活性和直观性。

Conclusion: 该研究通过模仿人类设计师的创意流程和构建高质量标注数据集,成功开发了一个能够自主操作设计文件的自动化图形设计系统,为图形设计自动化提供了新的解决方案,使非专业人士能够便捷地创建专业质量的设计作品,推动了创意工作流程的智能化发展。


📄 Abstract

Graphic design is a creative and innovative process that plays a crucial role in applications such as e-commerce and advertising. However, developing an automated design system that can faithfully translate user intentions into editable design files remains an open challenge. Although recent studies have leveraged powerful text-to-image models and MLLMs to assist graphic design, they typically simplify professional workflows, resulting in limited flexibility and intuitiveness. To address these limitations, we propose PSDesigner, an automated graphic design system that emulates the creative workflow of human designers. Building upon multiple specialized components, PSDesigner collects theme-related assets based on user instructions, and autonomously infers and executes tool calls to manipulate design files, such as integrating new assets or refining inferior elements. To endow the system with strong tool-use capabilities, we construct a design dataset, CreativePSD, which contains a large amount of high-quality PSD design files annotated with operation traces across a wide range of design scenarios and artistic styles, enabling models to learn expert design procedures. Extensive experiments demonstrate that PSDesigner outperforms existing methods across diverse graphic design tasks, empowering non-specialists to conveniently create production-quality designs.

cs.AI [Back]

[11] Can MLLMs Read Students' Minds? Unpacking Multimodal Error Analysis in Handwritten Math

Dingjie Song, Tianlong Xu, Yi-Fan Zhang, Hang Li, Zhiling Yan, Xing Fan, Haoyang Li, Lichao Sun, Qingsong Wen

🧩 TL;DR

该研究提出了ScratchMath基准数据集,专门用于评估多模态大语言模型在解释和分类学生手写数学草稿错误方面的能力,揭示了现有模型在视觉识别和逻辑推理方面与人类专家的显著差距。


📘 Detailed Summary

Motivation: 现有教育NLP主要关注文本响应,忽略了真实手写草稿的复杂性和多模态特性,而当前多模态大语言模型通常采用"应试者视角"优先生成正确答案而非诊断学生错误,这导致缺乏专门用于评估手写数学草稿错误解释和分类的基准数据集。

Method: 研究引入了ScratchMath基准数据集,包含1,720个中国中小学生数学样本,支持错误原因解释和错误原因分类两个关键任务,定义了七种错误类型,并通过涉及专家标注、审查和验证的多阶段人机协作方法进行精细标注,系统评估了16个领先的多模态大语言模型。

Result: 评估结果显示多模态大语言模型在视觉识别和逻辑推理方面与人类专家存在显著性能差距,专有模型明显优于开源模型,大型推理模型在错误解释方面表现出强大潜力,所有评估数据和框架已公开以促进进一步研究。

Conclusion: 该研究强调了开发专门用于教育诊断而非答案生成的多模态模型的重要性,揭示了当前模型在处理复杂手写草稿方面的局限性,为个性化教育反馈系统的发展提供了重要基准和方向,同时公开的数据集和框架将推动教育人工智能领域的进一步研究。


📄 Abstract

Assessing student handwritten scratchwork is crucial for personalized educational feedback but presents unique challenges due to diverse handwriting, complex layouts, and varied problem-solving approaches. Existing educational NLP primarily focuses on textual responses and neglects the complexity and multimodality inherent in authentic handwritten scratchwork. Current multimodal large language models (MLLMs) excel at visual reasoning but typically adopt an "examinee perspective", prioritizing generating correct answers rather than diagnosing student errors. To bridge these gaps, we introduce ScratchMath, a novel benchmark specifically designed for explaining and classifying errors in authentic handwritten mathematics scratchwork. Our dataset comprises 1,720 mathematics samples from Chinese primary and middle school students, supporting two key tasks: Error Cause Explanation (ECE) and Error Cause Classification (ECC), with seven defined error types. The dataset is meticulously annotated through rigorous human-machine collaborative approaches involving multiple stages of expert labeling, review, and verification. We systematically evaluate 16 leading MLLMs on ScratchMath, revealing significant performance gaps relative to human experts, especially in visual recognition and logical reasoning. Proprietary models notably outperform open-source models, with large reasoning models showing strong potential for error explanation. All evaluation data and frameworks are publicly available to facilitate further research.

[12] ReLope: KL-Regularized LoRA Probes for Multimodal LLM Routing

Yaopei Zeng, Congchao Wang, Blake JianHang Chen, Lu Lin

🧩 TL;DR

本文针对多模态大语言模型中探针路由性能下降的问题,提出了两种改进方法:注意力探针和KL正则化LoRA探针,通过提升隐藏状态质量显著改善了多模态场景下的路由效果。


📘 Detailed Summary

Motivation: 现有探针路由方法在文本大语言模型中表现良好,但在多模态大语言模型中性能显著下降,主要原因是视觉输入的引入削弱了隐藏状态中正确性信号的可分离性,使得标准探针设计难以有效提取这些信号。

Method: 本文提出两种互补的改进方法:注意力探针通过基于注意力分数聚合前一层隐藏状态来恢复分布式正确性信号;KL正则化LoRA探针则插入轻量级LoRA适配器并应用KL正则化器来学习路由感知的表征。

Result: 综合实验表明,所提方法在多个基准测试中一致优于基线方法,验证了提升隐藏状态质量对多模态大语言模型路由有效性的关键作用,具体表现为路由准确率和效率的显著提升。

Conclusion: 研究表明,改进隐藏状态质量是多模态大语言模型中实现有效路由的关键,所提出的注意力机制和正则化技术为解决多模态场景下的路由挑战提供了实用解决方案,为构建高性能低成本的多模态系统开辟了新途径。


📄 Abstract

Routing has emerged as a promising strategy for balancing performance and cost in large language model (LLM) systems that combine lightweight models with powerful but expensive large models. Recent studies show that \emph{probe routing}, which predicts the correctness of a small model using its hidden states, provides an effective solution in text-only LLMs. However, we observe that these probes degrade substantially when applied to multimodal LLMs (MLLMs). Through empirical analysis, we find that the presence of visual inputs weakens the separability of correctness signals in hidden states, making them harder to extract using standard probe designs. To address this challenge, we introduce two complementary approaches for improving probe routing in MLLMs. First, we propose the \emph{Attention Probe}, which aggregates hidden states from the preceding layer based on attention scores to recover distributed correctness signals. Second, we present the \emph{KL-Regularized LoRA Probe (ReLope)}, which inserts a lightweight LoRA adapter and applies a KL regularizer to learn routing-aware representations. Comprehensive experiments show that our methods consistently outperform baselines, suggesting that improving the quality of hidden states is key to effective routing in MLLMs. Our code is available at https://github.com/Spinozaaa/ReLope.