Table of Contents

cs.CV [Back]

[1] Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

Yiwen Tang, Zoey Guo, Kaixin Zhu, Ray Zhang, Qizhi Chen, Dongzhi Jiang, Junli Liu, Bohan Zeng, Haoming Song, Delin Qu, Tianyi Bai, Dan Xu, Wentao Zhang, Bin Zhao

🧩 TL;DR

该研究首次系统性地将强化学习应用于文本到3D自回归生成,提出了Hi-GRPO分层优化方法和AR3D-R1模型,通过奖励设计和算法改进解决了3D生成中的空间复杂性和全局一致性挑战。


📘 Detailed Summary

Motivation: 尽管强化学习在2D图像生成中已证明有效,但将其应用于3D生成仍面临巨大挑战,主要由于3D对象具有更高的空间复杂性,需要全局一致的几何结构和细粒度局部纹理,这使得3D生成对奖励设计和RL算法极为敏感,现有研究对此探索不足。

Method: 研究采用系统性方法探索多个维度:评估奖励维度设计,证明与人类偏好对齐的重要性;研究GRPO变体,强调令牌级优化的有效性;引入MME-3DR基准测试;提出Hi-GRPO分层强化学习范式,通过专用奖励集成优化从全局到局部的分层3D生成,最终开发了AR3D-R1模型。

Result: 实验表明通用多模态模型为3D属性提供稳健信号,令牌级优化效果显著,Hi-GRPO方法有效处理3D生成的自然层次结构,AR3D-R1成为首个RL增强的文本到3D模型,实现了从粗糙形状到纹理细化的专业生成能力。

Conclusion: 该研究为RL驱动的3D生成推理提供了重要见解,证明了强化学习在复杂3D生成任务中的可行性,提出的分层优化方法和系统性分析框架为未来3D生成研究奠定了理论基础和实践指导,推动了自回归3D生成技术的发展。


📄 Abstract

Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.

[2] MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, Wenbo Hu, Dahua Lin, Tai Wang, Jiangmiao Pang

🧩 TL;DR

本文提出了MMSI-Video-Bench,这是首个全面评估多模态大语言模型视频空间智能的基准测试,包含1,106个基于1,278个视频片段的问题,涵盖感知、规划、预测和跨视频推理四个层次,揭示了当前模型与人类能力之间的显著差距。


📘 Detailed Summary

Motivation: 当前缺乏全面评估多模态大语言模型在连续视觉输入中空间理解能力的基准测试,这对于模型发展为物理环境中的通用助手至关重要。现有评估体系未能从整体上衡量模型在视频空间智能方面的进展,特别是缺乏系统性的多层次能力评估框架。

Method: 研究提出了MMSI-Video-Bench基准测试,采用四层次框架(感知、规划、预测、跨视频推理),包含1,106个人工标注的问题,基于来自25个数据集和内部视频的1,278个片段。每个项目由3DV专家精心设计和审查,确保精确无歧义的接地性,并支持三个领域导向的子基准(室内场景感知、机器人和接地基准)进行针对性能力评估。

Result: 评估了25个开源和专有MLLM模型,揭示了显著的人机差距:许多模型表现接近随机水平,最佳推理模型落后人类近60%。空间微调模型在基准测试上泛化能力仍然有限,细粒度错误分析暴露了几何推理、运动接地、长时程预测和跨视频对应方面的系统性失败。典型帧采样策略在推理密集型基准上迁移效果差,3D空间线索和思维链提示也未带来显著提升。

Conclusion: 该基准测试为推进视频空间智能研究提供了坚实的测试平台,揭示了当前MLLM在空间理解方面的核心局限性。研究结果表明需要开发更有效的空间表示学习方法和推理架构,而不仅仅是依赖微调或提示工程。基准的多层次框架和细粒度错误分析为未来研究方向提供了具体指导。


📄 Abstract

Spatial understanding over continuous visual input is crucial for MLLMs to evolve into general-purpose assistants in physical environments. Yet there is still no comprehensive benchmark that holistically assesses the progress toward this goal. In this work, we introduce MMSI-Video-Bench, a fully human-annotated benchmark for video-based spatial intelligence in MLLMs. It operationalizes a four-level framework, Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 clips from 25 datasets and in-house videos. Each item is carefully designed and reviewed by 3DV experts with explanatory rationales to ensure precise, unambiguous grounding. Leveraging its diverse data sources and holistic task coverage, MMSI-Video-Bench also supports three domain-oriented sub-benchmarks (Indoor Scene Perception Bench, Robot Bench and Grounding Bench) for targeted capability assessment. We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human--AI gap: many models perform near chance, and the best reasoning model lags humans by nearly 60%. We further find that spatially fine-tuned models still fail to generalize effectively on our benchmark. Fine-grained error analysis exposes systematic failures in geometric reasoning, motion grounding, long-horizon prediction, and cross-video correspondence. We also show that typical frame-sampling strategies transfer poorly to our reasoning-intensive benchmark, and that neither 3D spatial cues nor chain-of-thought prompting yields meaningful gains. We expect our benchmark to establish a solid testbed for advancing video-based spatial intelligence.

[3] BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models

Shengao Wang, Wenqi Wang, Zecheng Wang, Max Whitton, Michael Wakeham, Arjun Chandra, Joey Huang, Pengyue Zhu, Helen Chen, David Li, Jeffrey Li, Shawn Li, Andrew Zagula, Amy Zhao, Andrew Zhu, Sayaka Nakamura, Yuki Yamamoto, Jerry Jun Yokono, Aaron Mueller, Bryan A. Plummer, Kate Saenko, Venkatesh Saligrama, Boqing Gong

🧩 TL;DR

该研究提出了BabyVLM-V2框架,这是一个基于儿童发展轨迹的视觉语言模型预训练方法,通过开发认知评估工具箱DevCV和纵向多模态预训练数据集,实现了从零开始训练的紧凑模型在儿童认知任务上的竞争性表现。


📘 Detailed Summary

Motivation: 早期儿童发展轨迹为视觉基础模型的样本高效预训练提供了自然目标,但现有方法缺乏与儿童认知能力对齐的系统性评估框架,需要开发更符合发展心理学的预训练和评估方法。

Method: 研究提出了BabyVLM-V2框架,包含三个核心组件:纵向多方面的预训练数据集,最大化覆盖婴儿中心视听语料库;通用模型架构;以及DevCV工具箱,将NIH Baby Toolbox的视觉相关测量适配为十个多模态任务的基准套件,涵盖空间推理、记忆和词汇理解等早期儿童能力。

Result: 实验结果表明,从零开始预训练的紧凑模型在DevCV工具箱上能够达到竞争性性能,在某些任务上甚至超越了GPT-4o的表现,验证了发展性预训练框架的有效性。

Conclusion: BabyVLM-V2框架为发展性合理的视觉基础模型预训练研究提供了原则性统一方法,有望加速该领域的研究进展,并为理解人类认知发展提供计算模型支持。


📄 Abstract

Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.

[4] Mull-Tokens: Modality-Agnostic Latent Thinking

Arijit Ray, Ahmed Abdelkader, Chengzhi Mao, Bryan A. Plummer, Kate Saenko, Ranjay Krishna, Leonidas Guibas, Wen-Sheng Chu

🧩 TL;DR

本文提出Mull-Tokens——一种模态无关的潜在令牌,通过预训练在图像或文本模态中保存中间信息,使模型能够自由思考以获得正确答案,从而在多模态推理任务中实现显著性能提升。


📘 Detailed Summary

Motivation: 现有多模态模型在空间、时间、功能等真实世界推理任务中存在局限性,它们依赖调用专业工具、昂贵的图像生成或手工制作的推理数据在文本和图像思维间切换,这些方法脆弱且难以扩展,无法有效支持跨模态的自由推理过程。

Method: 本文提出Mull-Tokens方法,这是一种模态无关的潜在令牌,通过预训练在图像或文本模态中保存中间推理信息。该方法首先使用交错文本-图像轨迹进行监督训练,然后仅使用最终答案进行无监督微调,借鉴了潜在推理框架的最佳实践,使模型能够在多种模态中抽象思考。

Result: 在四个具有挑战性的空间推理基准测试中,包括解决谜题和采取不同视角等任务,Mull-Tokens相比仅使用文本推理或交错图像-文本推理的多个基线方法实现了平均+3%的性能提升,在推理密集的谜题解决任务中最高提升达+16%,显著优于现有方法。

Conclusion: Mull-Tokens为文本和视觉推理的落地挑战提供了简洁解决方案,通过模态无关的潜在表示实现了跨模态的抽象思考能力,为多模态推理系统的发展开辟了新方向,展示了在复杂空间推理任务中超越传统方法的潜力。


📄 Abstract

Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.

cs.CL [Back]

[5] Computational emotion analysis with multimodal LLMs: Current evidence on an emerging methodological opportunity

Hauke Licht

🧩 TL;DR

本研究评估了多模态大语言模型在视频情感分析中的有效性,发现在理想条件下模型表现可靠且无人口统计偏差,但在真实议会辩论场景中表现不佳,强调了持续评估生成式AI在政治分析中应用的必要性。


📘 Detailed Summary

Motivation: 尽管多模态生成式AI在情感分析领域展现出巨大潜力,但缺乏关于其在政治沟通情感分析中有效性的实证证据,本研究旨在填补这一研究空白,评估多模态大语言模型在视频情感唤起分析中的实际表现。

Method: 研究采用多模态大语言模型对两个互补数据集进行视频情感唤起分析,这两个数据集包含人工标注的视频记录,通过系统评估框架比较模型输出与人类标注结果,特别关注模型在不同场景下的表现差异。

Result: 在理想条件下,多模态大语言模型的情感唤起评分具有高度可靠性,且未显示出明显的人口统计偏差;然而在真实议会辩论场景中,模型的唤起评分表现显著下降,可能对下游统计推断产生负面影响。

Conclusion: 研究强调了持续深入评估新兴生成式AI方法在政治分析中应用的必要性,并提供了一个可复制的评估框架,表明当前多模态模型在复杂现实政治场景中的表现仍存在局限性,需要进一步改进。


📄 Abstract

Emotions are central to politics and analyzing their role in political communication has a long tradition. As research increasingly leverages audio-visual materials to analyze the display of emotions, the emergence of multimodal generative AI promises great advances. However, we lack evidence about the effectiveness of multimodal AI in emotion analysis. This paper addresses this gap by evaluating current multimodal large language models (mLLMs) in video-based analysis of emotional arousal in two complementary data sets of human-labeled video recordings. I find that under ideal circumstances, mLLMs' emotional arousal ratings are highly reliable and show little to know indication of demographic bias. However, in recordings of speakers in real-world parliamentary debates, mLLMs' arousal ratings fail to deliver on this promise with potential negative consequences for downstream statistical inferences. This study therefore underscores the need for continued, thorough evaluation of emerging generative AI methods in political analysis and contributes a suitable replicable framework.