Table of Contents
cs.CV [Back]
[1] AREA3D: Active Reconstruction Agent with Unified Feed-Forward 3D Perception and Vision-Language Guidance
Tianling Xu, Shengzhe Gan, Leslie Gu, Yuelei Li, Fangneng Zhan, Hanspeter Pfister
🧩 TL;DR
本文提出了AREA3D,一种主动三维重建智能体,通过结合前馈三维重建模型和视觉语言引导来解决传统基于几何启发式方法导致的冗余观测问题,在稀疏视角下实现了最先进的重建精度。
📘 Detailed Summary
Motivation: 现有主动三维重建方法通常依赖手工设计的几何启发式规则,这可能导致冗余观测而无法显著提升重建质量,因此需要更智能的视角选择策略来高效获取准确完整的场景几何。
Method: AREA3D框架将视角不确定性建模与前馈重建器解耦,实现无需昂贵在线优化的精确不确定性估计,同时集成视觉语言模型提供高层语义引导,鼓励超越纯几何线索的信息丰富且多样化的视角选择。
Result: 在场景级和物体级基准测试上的广泛实验表明,AREA3D实现了最先进的重建精度,特别是在稀疏视角机制下表现尤为突出,显著优于传统基于几何启发式的方法。
Conclusion: 该研究证明了结合前馈重建模型与视觉语言引导的有效性,为主动三维重建提供了更智能的视角选择策略,展示了语义信息在提升重建效率和质量方面的重要价值,为未来自主感知系统的发展提供了新方向。
📄 Abstract
Active 3D reconstruction enables an agent to autonomously select viewpoints to efficiently obtain accurate and complete scene geometry, rather than passively reconstructing scenes from pre-collected images. However, existing active reconstruction methods often rely on hand-crafted geometric heuristics, which can lead to redundant observations without substantially improving reconstruction quality. To address this limitation, we propose AREA3D, an active reconstruction agent that leverages feed-forward 3D reconstruction models and vision-language guidance. Our framework decouples view-uncertainty modeling from the underlying feed-forward reconstructor, enabling precise uncertainty estimation without expensive online optimization. In addition, an integrated vision-language model provides high-level semantic guidance, encouraging informative and diverse viewpoints beyond purely geometric cues. Extensive experiments on both scene-level and object-level benchmarks demonstrate that AREA3D achieves state-of-the-art reconstruction accuracy, particularly in the sparse-view regime. Code will be made available at: https://github.com/TianlingXu/AREA3D .
[2] Breaking Scale Anchoring: Frequency Representation Learning for Accurate High-Resolution Inference from Low-Resolution Training
Wenshuo Wang, Fan Zhang
🧩 TL;DR
本文提出了一种新的问题——尺度锚定,并针对零样本超分辨率时空预测任务提出了频率表示学习方法,该方法通过分辨率对齐的频率表示和谱一致性训练,有效缓解了尺度锚定问题,使误差随分辨率提高而降低。
📘 Detailed Summary
Motivation: 现有研究将不同分辨率下保持相似误差视为成功的多分辨率泛化,但作为数值求解器替代的深度学习模型应在分辨率提高时减少误差。根本限制在于低分辨率数据能表示的物理定律频率上限受奈奎斯特频率约束,导致模型难以处理高分辨率推理中未见频率成分,使误差锚定在低分辨率水平,被错误解释为成功泛化,本文将此定义为新的尺度锚定问题。
Method: 本文提出架构无关的频率表示学习方法,通过分辨率对齐的频率表示和谱一致性训练缓解尺度锚定问题。该方法在具有更高奈奎斯特频率的网格上,使FRL增强变体在高频带的频率响应更加稳定,从而允许误差随分辨率提高而降低,同时仅引入适度的计算开销。
Result: 在任务和分辨率范围内,FRL增强变体显著优于基线方法,误差随分辨率提高而降低,而非锚定在低分辨率水平。在具有更高奈奎斯特频率的网格上,FRL增强变体在高频带的频率响应表现出更好的稳定性,验证了该方法对尺度锚定问题的缓解效果。
Conclusion: 研究揭示了尺度锚定这一新问题,挑战了现有关于多分辨率泛化的理解,并提出了有效的频率表示学习解决方案。该方法为深度学习模型作为数值求解器替代品提供了更合理的评估标准,即误差应随分辨率提高而降低,而非保持相似,为时空预测任务中的超分辨率泛化问题提供了新视角。
📄 Abstract
Zero-Shot Super-Resolution Spatiotemporal Forecasting requires a deep learning model to be trained on low-resolution data and deployed for inference on high-resolution. Existing studies consider maintaining similar error across different resolutions as indicative of successful multi-resolution generalization. However, deep learning models serving as alternatives to numerical solvers should reduce error as resolution increases. The fundamental limitation is, the upper bound of physical law frequencies that low-resolution data can represent is constrained by its Nyquist frequency, making it difficult for models to process signals containing unseen frequency components during high-resolution inference. This results in errors being anchored at low resolution, incorrectly interpreted as successful generalization. We define this fundamental phenomenon as a new problem distinct from existing issues: Scale Anchoring. Therefore, we propose architecture-agnostic Frequency Representation Learning. It alleviates Scale Anchoring through resolution-aligned frequency representations and spectral consistency training: on grids with higher Nyquist frequencies, the frequency response in high-frequency bands of FRL-enhanced variants is more stable. This allows errors to decrease with resolution and significantly outperform baselines within our task and resolution range, while incurring only modest computational overhead.
[3] ChromouVQA: Benchmarking Vision-Language Models under Chromatic Camouflaged Images
Yunfei Zhang, Yizhuo He, Yuanxun Shao, Zhengtao Yao, Haoyan Xu, Junhao Dong, Zhen Yao, Zhikang Dong
🧩 TL;DR
本文提出了ChromouVQA,一个基于石原式色觉检查图风格的大规模多任务基准测试,用于评估视觉语言模型在复杂背景中目标识别和图形-背景分离的能力,并提出了模型无关的对比学习方法以改善全局形状恢复。
📘 Detailed Summary
Motivation: 尽管视觉语言模型在多模态理解方面取得了进展,但在目标嵌入杂乱背景需要图形-背景分离的场景中仍然表现不佳,现有基准测试缺乏对色觉伪装和复杂几何填充条件下模型能力的系统评估。
Method: 研究团队构建了基于石原式色觉检查图风格的大规模多任务基准测试ChromouVQA,扩展了经典点板设计,引入多种填充几何形状并变化色度分离、密度、大小、遮挡和旋转等参数,同时提出了模型无关的对比学习框架,通过将轮廓与其伪装渲染对齐来改善全局形状恢复。
Result: 评估结果显示人类与视觉语言模型在色觉伪装任务中存在显著性能差距,特别是在细微色度对比或破坏性几何填充条件下差距更大,提出的对比学习方法有效提升了全局形状恢复能力,基准测试为可重复评估提供了紧凑且受控的实验环境。
Conclusion: ChromouVQA基准测试填补了视觉语言模型在色觉伪装场景下评估的空白,揭示了模型在复杂背景中图形-背景分离能力的局限性,提出的对比学习方法为改善多模态理解提供了有效途径,该基准支持可重复研究和进一步扩展。
📄 Abstract
Vision-Language Models (VLMs) have advanced multimodal understanding, yet still struggle when targets are embedded in cluttered backgrounds requiring figure-ground segregation. To address this, we introduce ChromouVQA, a large-scale, multi-task benchmark based on Ishihara-style chromatic camouflaged images. We extend classic dot plates with multiple fill geometries and vary chromatic separation, density, size, occlusion, and rotation, recording full metadata for reproducibility. The benchmark covers nine vision-question-answering tasks, including recognition, counting, comparison, and spatial reasoning. Evaluations of humans and VLMs reveal large gaps, especially under subtle chromatic contrast or disruptive geometric fills. We also propose a model-agnostic contrastive recipe aligning silhouettes with their camouflaged renderings, improving recovery of global shapes. ChromouVQA provides a compact, controlled benchmark for reproducible evaluation and extension. Code and dataset are available at https://github.com/Chromou-VQA-Benchmark/Chromou-VQA.
[4] Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding
Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong, Silvio Savarese, Mohit Bansal, Michael S. Ryoo, Juan Carlos Niebles
🧩 TL;DR
本文提出了主动视频感知(AVP)框架,通过迭代式的计划-观察-反思过程,使多模态大语言模型代理能够主动决定观察视频的时机、内容和位置,从而在长视频理解任务中实现高效且准确的查询相关证据提取。
📘 Detailed Summary
Motivation: 长视频理解面临的主要挑战在于真实世界查询通常依赖于稀疏、时间分散的线索,这些线索被埋藏在数小时冗余且无关的内容中。现有代理框架依赖查询无关的标注器来感知视频信息,这不仅浪费计算资源处理无关内容,还会模糊细粒度的时空信息,因此需要一种能够主动决定观察内容、时机和位置的智能感知方法。
Method: 本文提出了主动视频感知(AVP)框架,该框架将视频视为交互环境,通过迭代式的计划-观察-反思过程实现查询相关证据的直接提取。具体而言,AVP运行多模态大语言模型代理,每轮迭代中规划器提出有针对性的视频交互操作,观察器执行这些操作提取带时间戳的证据,反射器评估证据对查询的充分性,决定是否停止回答或触发进一步观察。
Result: 在五个长视频理解基准测试中,AVP实现了最高性能并带来显著改进。特别值得注意的是,AVP在平均准确率上比最佳代理方法高出5.7%,同时仅需要18.4%的推理时间和12.4%的输入标记,展示了其高效性和准确性优势。
Conclusion: 该研究证明了主动感知理论在长视频理解中的有效性,通过让代理主动决定观察策略而非被动处理整个视频,能够显著提高计算效率和任务性能。这一框架为视频理解系统设计提供了新思路,强调了查询相关证据提取的重要性,并为未来更智能的多模态交互系统奠定了基础。
📄 Abstract
Long video understanding (LVU) is challenging because answering real-world queries often depends on sparse, temporally dispersed cues buried in hours of mostly redundant and irrelevant content. While agentic pipelines improve video reasoning capabilities, prevailing frameworks rely on a query-agnostic captioner to perceive video information, which wastes computation on irrelevant content and blurs fine-grained temporal and spatial information. Motivated by active perception theory, we argue that LVU agents should actively decide what, when, and where to observe, and continuously assess whether the current observation is sufficient to answer the query. We present Active Video Perception (AVP), an evidence-seeking framework that treats the video as an interactive environment and acquires compact, queryrelevant evidence directly from pixels. Concretely, AVP runs an iterative plan-observe-reflect process with MLLM agents. In each round, a planner proposes targeted video interactions, an observer executes them to extract time-stamped evidence, and a reflector evaluates the sufficiency of the evidence for the query, either halting with an answer or triggering further observation. Across five LVU benchmarks, AVP achieves highest performance with significant improvements. Notably, AVP outperforms the best agentic method by 5.7% in average accuracy while only requires 18.4% inference time and 12.4% input tokens.
[5] Self-Improving VLM Judges Without Human Annotations
Inna Wanyin Lin, Yushi Hu, Shuyue Stella Li, Scott Geng, Pang Wei Koh, Luke Zettlemoyer, Tim Althoff, Marjan Ghazvininejad
🧩 TL;DR
本文提出了一种无需人类偏好标注的自训练视觉语言模型评估框架,通过自合成数据迭代训练VLM评估器,在多个基准测试中超越了包括GPT-4o和Claude 3.5 Sonnet在内的更大模型。
📘 Detailed Summary
Motivation: 当前训练视觉语言模型评估器主要依赖大规模人类偏好标注,这种方法成本高昂且随着模型快速改进容易过时,因此需要开发无需人类标注的自训练框架来解决这一研究空白。
Method: 该方法采用三阶段迭代框架:首先生成不同质量级别的多样化多模态指令-响应对,接着为每对生成推理轨迹和判断并移除不符合预期质量级别的样本,最后在正确的评估答案及其推理轨迹上进行训练。
Result: 在Multimodal RewardBench和VL-RewardBench上的评估显示,该方法将Llama-3.2-11B多模态评估器的整体准确率从0.38提升至0.51,在通用性、幻觉和推理维度表现突出,经常超越包括Llama-3.2-90B、GPT-4o和Claude 3.5 Sonnet在内的更大模型。
Conclusion: 该研究展示了无需人类标注的自训练评估框架的潜力,能够随着VLM能力的快速演进而同步发展,为构建自适应评估系统提供了可行路径,并暗示了未来自评估模型与VLM能力共同进化的可能性。
📄 Abstract
Effective judges of Vision-Language Models (VLMs) are crucial for model development. Current methods for training VLM judges mainly rely on large-scale human preference annotations. However, such an approach is costly, and the annotations easily become obsolete as models rapidly improve. In this work, we present a framework to self-train a VLM judge model without any human preference annotations, using only self-synthesized data. Our method is iterative and has three stages: (1) generate diverse multimodal instruction-response pairs at varying quality levels, (2) generate reasoning traces and judgments for each pair, removing the ones that do not match our expected quality levels, and (3) training on correct judge answers and their reasoning traces. We evaluate the resulting judge on Multimodal RewardBench and VL-RewardBench across domains: correctness, preference, reasoning, safety, and visual question-answering. Our method improves a Llama-3.2-11B multimodal judge from 0.38 to 0.51 in overall accuracy on VL-RewardBench, often outperforming much larger models including Llama-3.2-90B, GPT-4o, and Claude 3.5 Sonnet, with particularly strong gains in general, hallucination, and reasoning dimensions. The overall strength of these human-annotation-free results suggest the potential for a future self-judge that evolves alongside rapidly improving VLM capabilities.
[6] Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding
Zhiyuan Jiang, Shenghao Xie, Wenyi Li, Wenqiang Zu, Peihang Li, Jiahao Qiu, Siqi Pei, Lei Ma, Tiejun Huang, Mengdi Wang, Shilong Liu
🧩 TL;DR
本文提出ZoomClick,一种无需训练的方法,通过探索缩放作为图形用户界面(GUI)定位的强大先验,显著提升了通用视觉语言模型和专用GUI定位模型的性能,并在多个主流基准测试中取得了最先进的结果。
📘 Detailed Summary
Motivation: 现有GUI代理的定位方法依赖大规模边界框监督,但仍面临跨平台泛化、复杂布局分析和细粒度元素定位等挑战。本文旨在探索缩放这一被低估的强大先验,以解决GUI定位中的动态空间聚焦和自适应上下文切换问题。
Method: 本文提出ZoomClick,一种无需训练的GUI定位方法,通过表征缩放的四个关键属性(预缩放、深度、收缩尺寸、最小裁剪尺寸)来解锁其动态空间聚焦和自适应上下文切换的全部能力。该方法利用缩放作为先验机制,无需额外训练即可增强现有模型的定位性能。
Result: 实验表明,ZoomClick显著提升了通用视觉语言模型和专用GUI定位模型的性能,在多个主流基准测试中取得了最先进的结果。例如,UI-Venus-72B在ScreenSpot-Pro基准上达到了73.1%的成功率。此外,作者还提出了GUIZoom-Bench基准,用于评估模型对缩放的适应性。
Conclusion: 本研究证明了缩放作为GUI定位先验的有效性,为未来研究提供了通过缩放改进训练和测试时扩展的新方向。GUIZoom-Bench基准的提出旨在激发未来研究进一步探索缩放机制在GUI定位任务中的潜力,推动该领域的发展。
📄 Abstract
Grounding is a fundamental capability for building graphical user interface (GUI) agents. Although existing approaches rely on large-scale bounding box supervision, they still face various challenges, such as cross-platform generalization, complex layout analysis, and fine-grained element localization. In this paper, we investigate zoom as a strong yet underexplored prior for GUI grounding, and propose a training-free method, ZoomClick. By characterizing four key properties of zoom (i.e., pre-zoom, depth, shrink size, minimal crop size), we unlock its full capabilities for dynamic spatial focusing and adaptive context switching. Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models, achieving state-of-the-art results on several mainstream benchmarks; for example, UI-Venus-72B attains a 73.1% success rate on ScreenSpot-Pro. Furthermore, we present GUIZoom-Bench, a benchmark for evaluating model adaptability to zoom, aiming to inspire future research on improving zoom for further training and test-time scaling in GUI grounding tasks.
[7] TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows
Zhenglin Cheng, Peng Sun, Jianguo Li, Tao Lin
🧩 TL;DR
本文提出TwinFlow框架,通过避免依赖预训练教师模型和标准对抗网络,训练单步生成模型,在文本到图像任务中实现高效推理,仅需1步推理即可达到原始100步模型的性能。
📘 Detailed Summary
Motivation: 当前多模态生成模型通常基于扩散和流匹配等多步框架,推理效率低下(需要40-100次函数评估)。现有加速方法存在明显局限:蒸馏方法需要迭代过程或在极少数步骤下性能显著下降,而结合对抗训练的蒸馏方法则引入训练不稳定性、复杂性和高GPU内存开销。
Method: 本文提出TwinFlow框架,这是一种简单而有效的训练单步生成模型的方法。该框架绕过了对固定预训练教师模型的依赖,并在训练过程中避免了标准对抗网络的使用,特别适合构建大规模高效模型。该方法通过全参数训练在Qwen-Image-20B上展示了可扩展性,将其转化为高效少步生成器。
Result: 在文本到图像任务中,TwinFlow在1步推理下获得0.83的GenEval分数,优于SANA-Sprint和RCGM等强基线。仅需1步推理即可在GenEval和DPG-Bench基准测试中匹配原始100步模型的性能,计算成本降低100倍且质量下降极小。
Conclusion: TwinFlow为训练高效单步生成模型提供了一种稳定且可扩展的解决方案,避免了现有方法的复杂性和不稳定性。该方法展示了在大规模模型上实现高效推理的可行性,为多模态生成模型的实用部署开辟了新途径。
📄 Abstract
Recent advances in large multi-modal generative models have demonstrated impressive capabilities in multi-modal generation, including image and video generation. These models are typically built upon multi-step frameworks like diffusion and flow matching, which inherently limits their inference efficiency (requiring 40-100 Number of Function Evaluations (NFEs)). While various few-step methods aim to accelerate the inference, existing solutions have clear limitations. Prominent distillation-based methods, such as progressive and consistency distillation, either require an iterative distillation procedure or show significant degradation at very few steps (< 4-NFE). Meanwhile, integrating adversarial training into distillation (e.g., DMD/DMD2 and SANA-Sprint) to enhance performance introduces training instability, added complexity, and high GPU memory overhead due to the auxiliary trained models. To this end, we propose TwinFlow, a simple yet effective framework for training 1-step generative models that bypasses the need of fixed pretrained teacher models and avoids standard adversarial networks during training, making it ideal for building large-scale, efficient models. On text-to-image tasks, our method achieves a GenEval score of 0.83 in 1-NFE, outperforming strong baselines like SANA-Sprint (a GAN loss-based framework) and RCGM (a consistency-based framework). Notably, we demonstrate the scalability of TwinFlow by full-parameter training on Qwen-Image-20B and transform it into an efficient few-step generator. With just 1-NFE, our approach matches the performance of the original 100-NFE model on both the GenEval and DPG-Bench benchmarks, reducing computational cost by $100\times$ with minor quality degradation. Project page is available at https://zhenglin-cheng.com/twinflow.
[8] Semore: VLM-guided Enhanced Semantic Motion Representations for Visual Reinforcement Learning
Wentao Wang, Chunyang Liu, Kehua Sheng, Bo Zhang, Yan Wang
🧩 TL;DR
本文提出了Semore框架,一种基于视觉语言模型的双路径骨干网络,用于增强视觉强化学习的表征能力,通过同时提取语义和运动信息来提升决策性能。
📘 Detailed Summary
Motivation: 现有基于大语言模型的强化学习方法主要关注控制策略指导,但面临骨干网络表征能力有限的问题,特别是在视觉强化学习中需要同时理解语义内容和运动信息。
Method: Semore框架采用基于视觉语言模型的双路径骨干网络,从RGB流中同时提取语义和运动表征;利用VLM的常识知识从观测中检索关键信息,并使用预训练的CLIP模型实现文本-图像对齐,将真实表征嵌入到骨干网络中;通过分离监督方法同时指导语义和运动提取,并允许它们自发交互以实现高效融合。
Result: 大量实验表明,在特征层面受到VLM指导的方法相比最先进方法展现出高效和自适应能力,代码已全部开源供研究社区使用。
Conclusion: 该研究证明了在特征层面集成视觉语言模型指导的有效性,为视觉强化学习提供了同时处理语义和运动信息的新框架,展示了多模态表征融合在决策任务中的潜力。
📄 Abstract
The growing exploration of Large Language Models (LLM) and Vision-Language Models (VLM) has opened avenues for enhancing the effectiveness of reinforcement learning (RL). However, existing LLM-based RL methods often focus on the guidance of control policy and encounter the challenge of limited representations of the backbone networks. To tackle this problem, we introduce Enhanced Semantic Motion Representations (Semore), a new VLM-based framework for visual RL, which can simultaneously extract semantic and motion representations through a dual-path backbone from the RGB flows. Semore utilizes VLM with common-sense knowledge to retrieve key information from observations, while using the pre-trained clip to achieve the text-image alignment, thereby embedding the ground-truth representations into the backbone. To efficiently fuse semantic and motion representations for decision-making, our method adopts a separately supervised approach to simultaneously guide the extraction of semantics and motion, while allowing them to interact spontaneously. Extensive experiments demonstrate that, under the guidance of VLM at the feature level, our method exhibits efficient and adaptive ability compared to state-of-art methods. All codes are released.
[9] From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model
Kevin Cannons, Saeed Ranjbar Alvar, Mohammad Asiful Hossain, Ahmad Rezaei, Mohsen Gholami, Alireza Heidarikhazaei, Zhou Weimin, Yong Zhang, Mohammad Akbari
🧩 TL;DR
本文提出了TAD基准测试,专门评估自动驾驶场景中的时序理解能力,并提出了两种无需训练的方法(Scene-CoT和TCogMap)来提升现有视觉语言模型在该任务上的性能。
📘 Detailed Summary
Motivation: 自动驾驶中的时序理解是一个重要挑战,现有基准测试主要关注体育、烹饪和电影等其他视频内容,缺乏专门针对自动驾驶第一人称视角时序理解的评估基准,这限制了视觉语言模型在动态驾驶场景中的能力发展。
Method: 研究提出了TAD基准测试,包含近6000个问答对和7个人工设计的任务,并提出了两种无需训练的解决方案:Scene-CoT利用思维链推理,TCogMap则整合了第一人称视角的时序认知地图,这两种方法可以与现有视觉语言模型集成使用。
Result: 在TAD基准测试中,当前最先进的通用模型和自动驾驶专用模型均表现出较低准确率,主要原因是细粒度运动理解不足。提出的Scene-CoT和TCogMap方法将平均准确率提升了高达17.72%,显著改善了模型在自动驾驶时序理解任务上的性能。
Conclusion: TAD基准测试揭示了当前视觉语言模型在自动驾驶时序理解方面的显著不足,提出的训练无关方法为提升模型性能提供了有效途径。这项工作为自动驾驶领域的时序理解研究提供了重要基准和工具,有望推动该方向的技术发展。
📄 Abstract
Temporal understanding in autonomous driving (AD) remains a significant challenge, even for recent state-of-the-art (SoTA) Vision-Language Models (VLMs). Prior work has introduced datasets and benchmarks aimed at improving temporal reasoning, but these have emphasized other video content, including sports, cooking, and movies. No existing benchmark focuses exclusively on the unique challenges of temporal understanding in ego-centric AD footage. To fill this gap, the Temporal Understanding in Autonomous Driving (TAD) benchmark is presented, which evaluates VLMs' ability to capture the dynamic relationships between actions in AD. TAD comprises nearly 6,000 question-answer (QA) pairs, spanning 7 human-designed tasks. In addition, an evaluation is performed that consists of 9 closed- and open-source generalist models as well as SoTA AD specialist models. When applied to TAD, current SoTA models demonstrated substandard accuracies, largely due to imperfect fine-grained motion understanding. To improve motion understanding and overall accuracy on TAD, two novel training-free solutions are proposed: Scene-CoT, that leverages Chain-of-Thought (CoT) and TCogMap, which incorporates an ego-centric temporal cognitive map. The proposed approaches are integrated with existing VLMs and improve average accuracy on TAD by up to 17.72%. By introducing TAD, benchmarking multiple SoTA models, and proposing effective enhancements, this work aims to catalyze future research on temporal understanding in AD. The benchmark and evaluation code are available at \href{https://huggingface.co/datasets/vbdai/TAD}{Hugging Face} and \href{https://github.com/vbdi/tad_bench}{Github}, respectively.
[10] ShaRP: SHAllow-LayeR Pruning for Video Large Language Models Acceleration
Yingjie Xia, Tao Liu, Jinglei Shi, Qingsong Xie, Heng Guo, Jian Yang, Xi Wang
🧩 TL;DR
本文提出了一种名为ShaRP的改进注意力剪枝框架,用于加速视频大语言模型的推理过程,该框架通过集成段感知因果掩码、位置去偏和令牌去重技术,在浅层解码器实现高效剪枝的同时保持高性能。
📘 Detailed Summary
Motivation: 视频大语言模型在预填充阶段面临高计算负载的挑战,传统注意力剪枝方法在浅层解码器进行剪枝时会导致显著的性能下降,尤其是在高压缩率下,这主要源于位置编码偏差和信息交互不足等限制因素。
Method: 本文提出的ShaRP框架集成了三种关键技术:段感知因果掩码用于增强令牌选择,位置去偏技术用于减少位置编码偏差的影响,以及令牌去重机制用于消除冗余视觉令牌,从而在浅层解码器实现高效剪枝而无需重新训练。
Result: 大量实验表明,ShaRP在多个视频理解基准测试中取得了竞争性性能,在高压缩率下仍能保持稳定的性能表现,为加速VLLM推理建立了新的范式。
Conclusion: 该研究证明了通过解决位置编码偏差和增强信息交互,可以在浅层解码器实现有效的注意力剪枝,为视频大语言模型的高效推理提供了可行的技术路径,同时为未来视觉令牌压缩研究提供了新的方向。
📄 Abstract
Video Large Language Models (VLLMs) face the challenge of high computational load during the pre-filling stage due to the processing of an enormous number of visual tokens. Although attention-based pruning methods are widely used to accelerate inference, trials at early decoder layers often result in significant performance degradation, especially under high compression rates. We argue that while attention-based pruning inherently holds the potential to identify the most relevant visual tokens, its effectiveness in shallow decoder layers is limited by factors such as positional encoding bias and insufficient information interaction. In this paper, we propose an improved attention-based pruning framework, termed ShaRP, that integrates segment-aware causal masking, positional debiasing, and token deduplication for enhanced token selection. It enables effective pruning at shallow layers while maintaining stable performance under high compression rates without retraining. Extensive experiments demonstrate that ShaRP achieves competitive performance across multiple video understanding benchmarks, establishing a new paradigm for accelerating VLLM inference.
[11] LoC-Path: Learning to Compress for Pathology Multimodal Large Language Models
Qingqiao Hu, Weimin Lyu, Meilong Xu, Kehan Qi, Xiaoling Hu, Saumya Gupta, Jiawei Zhou, Chao Chen
🧩 TL;DR
本文提出LoC-Path,一种高效的病理学多模态大语言模型框架,通过冗余减少模块替代昂贵的全切片编码器,在保持性能的同时显著降低计算和内存需求。
📘 Detailed Summary
Motivation: 全切片图像理解面临千兆像素规模和诊断相关区域极端稀疏性的根本挑战,现有病理学多模态大语言模型依赖处理数千个补丁特征的繁重切片级编码器,导致计算成本过高,而人类专家主要依赖关键区域进行诊断。
Method: 提出LoC-Path框架,包含稀疏令牌合并器和MAE预训练重采样器以消除局部冗余并将全局冗余的图块令牌压缩为紧凑的切片级表示集,然后通过交叉注意力路由适配器和令牌重要性评分器以计算高效的方式将压缩视觉表示与语言模型集成。
Result: 广泛实验表明,该方法在性能上与现有最先进的全切片多模态大语言模型相当,同时需要显著更低的计算和内存资源,验证了冗余减少策略的有效性。
Conclusion: 研究揭示了图块级特征存在强烈的全局和局部冗余性,只有少量图块真正与任务相关,这为高效的全切片图像理解提供了新范式,通过选择性关注关键区域而非暴力处理所有特征,实现了计算效率与诊断准确性的平衡。
📄 Abstract
Whole Slide Image (WSI) understanding is fundamentally challenging due to its gigapixel scale and the extreme sparsity of diagnostically relevant regions. Unlike human experts who primarily rely on key areas to arrive at a diagnosis, existing slide-level multimodal large language models (MLLMs) for pathology rely on heavy slide-level encoders that process thousands of patch features in a brute-force manner, resulting in excessive computational cost. In this work, we revisit the WSI-language modeling paradigm and show that tile-level features exhibit strong global and local redundancy, whereas only a small subset of tiles are truly task-relevant. Motivated by this observation, we introduce an efficient MLLM framework, called LoC-Path, that replaces the expensive slide-level encoder with redundancy-reducing modules. We first design a Sparse Token Merger (STM) and an MAE-pretrained resampler to remove local redundancy and compress globally redundant tile tokens into a compact slide-level representation set. We then propose a Cross-Attention Routing Adapter (CARA) and a Token Importance Scorer (TIS) to integrate the compressed visual representation with the language model in a computation-efficient manner. Extensive experiments demonstrate that our approach achieves performance comparable to existing state-of-the-art whole-slide MLLMs, while requiring significantly lower computation and memory.
[12] The Dynamic Prior: Understanding 3D Structures for Casual Dynamic Videos
Zhuoyuan Wu, Xurui Yang, Jiahui Huang, Yue Wang, Jun Gao
🧩 TL;DR
本文提出Dynamic Prior方法,利用视觉语言模型和SAM2的空间分割能力,无需任务特定训练即可从动态视频中稳健识别动态物体,显著提升了相机姿态估计和3D场景重建的准确性。
📘 Detailed Summary
Motivation: 传统运动结构恢复流程在处理包含动态物体的野外视频时面临挑战,而现有学习方法依赖大规模运动分割数据集,导致分割不准确和3D理解性能受限,需要一种无需任务特定训练的动态物体识别方法。
Method: 该方法提出Dynamic Prior框架,结合视觉语言模型的强大推理能力和SAM2的细粒度空间分割能力,无需任务特定训练即可识别动态物体,并可无缝集成到最先进的相机姿态优化、深度重建和4D轨迹估计流程中。
Result: 在合成和真实世界视频上的大量实验表明,该方法不仅在运动分割任务上达到最先进性能,而且显著提高了结构3D理解的准确性和鲁棒性,优于现有方法。
Conclusion: 该研究展示了结合视觉语言模型和分割模型的能力可以在无需专门训练的情况下有效解决动态场景理解问题,为3D重建和运动分析提供了更稳健的解决方案,开辟了多模态模型在几何理解任务中的新应用方向。
📄 Abstract
Estimating accurate camera poses, 3D scene geometry, and object motion from in-the-wild videos is a long-standing challenge for classical structure from motion pipelines due to the presence of dynamic objects. Recent learning-based methods attempt to overcome this challenge by training motion estimators to filter dynamic objects and focus on the static background. However, their performance is largely limited by the availability of large-scale motion segmentation datasets, resulting in inaccurate segmentation and, therefore, inferior structural 3D understanding. In this work, we introduce the Dynamic Prior (\ourmodel) to robustly identify dynamic objects without task-specific training, leveraging the powerful reasoning capabilities of Vision-Language Models (VLMs) and the fine-grained spatial segmentation capacity of SAM2. \ourmodel can be seamlessly integrated into state-of-the-art pipelines for camera pose optimization, depth reconstruction, and 4D trajectory estimation. Extensive experiments on both synthetic and real-world videos demonstrate that \ourmodel not only achieves state-of-the-art performance on motion segmentation, but also significantly improves accuracy and robustness for structural 3D understanding.
[13] ParaUni: Enhance Generation in Unified Multimodal Model with Reinforcement-driven Hierarchical Parallel Information Interaction
Jiangtong Tan, Lin Liu, Jie Huanng, Xiaopeng Zhang, Qi Tian, Feng Zhao
🧩 TL;DR
本文提出ParaUni,一种并行特征提取的统一多模态模型,通过并行整合视觉语言模型各层特征并设计层间动态调整机制,显著提升了视觉生成质量与多奖励强化学习性能。
📘 Detailed Summary
Motivation: 现有统一多模态模型在结合视觉语言模型与扩散模型时,难以平衡充分交互与灵活实现之间的需求,主要由于两者表征差异巨大,且视觉语言模型各层包含从低层细节到高层语义的丰富层次信息未被充分利用。
Method: 提出ParaUni框架,采用并行方式提取视觉语言模型各层特征,通过层集成模块整合细粒度细节与语义抽象信息,为扩散模型提供融合表征条件;进一步设计层间动态调整机制,利用强化学习根据各层对不同奖励的响应差异进行动态调整,实现多奖励协同优化。
Result: 大量实验表明,ParaUni通过利用互补的多层特征显著提升了生成质量,并在强化学习阶段展现出强大的多奖励提升潜力,验证了并行特征提取与层间动态调整机制的有效性。
Conclusion: 该研究揭示了视觉语言模型各层特征在视觉生成中的互补价值,提出的并行集成架构与层间动态调整机制为统一多模态模型设计提供了新思路,特别是在平衡交互充分性与实现灵活性方面具有重要指导意义。
📄 Abstract
Unified multimodal models significantly improve visual generation by combining vision-language models (VLMs) with diffusion models. However, existing methods struggle to fully balance sufficient interaction and flexible implementation due to vast representation difference. Considering abundant and hierarchical information in VLM's layers from low-level details to high-level semantics, we propose \textbf{ParaUni}. It extracts features from variants VLM's layers in a \textbf{Para}llel way for comprehensive information interaction and retains a flexible separation architecture to enhance generation in \textbf{Uni}fied multimodal model. Concretely, visual features from all VLM's layers are fed in parallel into a Layer Integration Module (LIM), which efficiently integrates fine-grained details and semantic abstractions and provides the fused representation as a condition to the diffusion model. To further enhance performance, we reveal that these hierarchical layers respond unequally to different rewards in Reinforcement Learning (RL). Crucially, we design a Layer-wise Dynamic Adjustment Mechanism (LDAM) to facilitate multiple reward improvements that aligns the hierarchical properties of these layers using RL. Extensive experiments show ParaUni leverages complementary multi-layer features to substantially improve generation quality and shows strong potential for multiple reward advances during RL stages. Code is available at https://github.com/JosephTiTan/ParaUni.
[14] UniFS: Unified Multi-Contrast MRI Reconstruction via Frequency-Spatial Fusion
Jialin Li, Yiwei Ren, Kai Pan, Dong Wei, Pujin Cheng, Xian Wu, Xiaoying Tang
🧩 TL;DR
本文提出UniFS,一种统一频率-空间融合模型,用于处理多对比度磁共振重建任务中的多种k空间欠采样模式,无需针对每种模式单独训练,显著提升了模型的泛化能力。
📘 Detailed Summary
Motivation: 现有多对比度磁共振重建方法通常难以泛化到不同的k空间欠采样模式,需要为每种特定模式单独训练模型,限制了实际应用。此外,现有方法往往只关注空间信息而忽略频率特征,或仅提取浅层频率特征,未能充分利用跨模态频率信息的互补性。
Method: UniFS包含三个关键模块:跨模态频率融合模块用于提取域不变特征,自适应掩码提示学习模块动态适应不同欠采样模式变化,以及双分支互补细化模块。特别引入了自适应提示引导的频率融合模块进行k空间学习,增强模型泛化性能。
Result: 在BraTS和HCP数据集上,针对多种k空间欠采样模式和加速因子(包括未见过的模式)进行综合评估,实验结果表明UniFS在多个场景下均实现了最先进的性能,验证了其出色的泛化能力。
Conclusion: UniFS通过统一的频率-空间融合框架有效解决了多对比度磁共振重建中模式泛化问题,其自适应提示学习和频率特征提取机制为处理多样k空间欠采样模式提供了新思路,具有重要的实际应用价值。
📄 Abstract
Recently, Multi-Contrast MR Reconstruction (MCMR) has emerged as a hot research topic that leverages high-quality auxiliary modalities to reconstruct undersampled target modalities of interest. However, existing methods often struggle to generalize across different k-space undersampling patterns, requiring the training of a separate model for each specific pattern, which limits their practical applicability. To address this challenge, we propose UniFS, a Unified Frequency-Spatial Fusion model designed to handle multiple k-space undersampling patterns for MCMR tasks without any need for retraining. UniFS integrates three key modules: a Cross-Modal Frequency Fusion module, an Adaptive Mask-Based Prompt Learning module, and a Dual-Branch Complementary Refinement module. These modules work together to extract domain-invariant features from diverse k-space undersampling patterns while dynamically adapt to their own variations. Another limitation of existing MCMR methods is their tendency to focus solely on spatial information while neglect frequency characteristics, or extract only shallow frequency features, thus failing to fully leverage complementary cross-modal frequency information. To relieve this issue, UniFS introduces an adaptive prompt-guided frequency fusion module for k-space learning, significantly enhancing the model's generalization performance. We evaluate our model on the BraTS and HCP datasets with various k-space undersampling patterns and acceleration factors, including previously unseen patterns, to comprehensively assess UniFS's generalizability. Experimental results across multiple scenarios demonstrate that UniFS achieves state-of-the-art performance. Our code is available at https://github.com/LIKP0/UniFS.
[15] Concept-based Explainable Data Mining with VLM for 3D Detection
Mai Tsujimoto
🧩 TL;DR
本文提出了一种新颖的跨模态框架,利用2D视觉语言模型从驾驶场景中识别和挖掘稀有对象,从而提升3D物体检测性能。该方法通过概念引导的数据挖掘策略,显著减少了标注负担并专注于最有价值的训练样本。
📘 Detailed Summary
Motivation: 自动驾驶系统中仅依赖点云数据的稀有物体检测仍然是一个具有挑战性的任务。尽管视觉语言模型在图像理解方面表现出强大能力,但其通过智能数据挖掘来增强3D物体检测的潜力尚未得到充分探索,特别是在识别驾驶场景中稀有但关键的物体方面存在研究空白。
Method: 本文提出了一种新颖的跨模态框架,将物体检测、语义特征提取、降维和多方面异常检测等互补技术综合成一个可解释的流程。该方法结合了孤立森林和t-SNE异常检测方法以及基于概念的过滤,有效识别语义上有意义的稀有物体。该框架能够提取和标注特定的稀有物体概念,如工程车辆、摩托车和障碍物。
Result: 在nuScenes数据集上的实验表明,这种概念引导的数据挖掘策略显著提升了3D物体检测模型的性能,同时仅使用了部分训练数据。与同等数量的随机数据相比,该方法在具有挑战性的物体类别(如拖车和自行车)上表现出特别显著的改进效果。
Conclusion: 该研究证明了利用2D视觉语言模型进行跨模态数据挖掘的有效性,能够显著减少标注负担并专注于最有价值的训练样本。这一发现对于安全关键自动驾驶系统中数据集的效率优化具有重要启示,为稀有物体检测提供了一种可扩展且高效的解决方案。
📄 Abstract
Rare-object detection remains a challenging task in autonomous driving systems, particularly when relying solely on point cloud data. Although Vision-Language Models (VLMs) exhibit strong capabilities in image understanding, their potential to enhance 3D object detection through intelligent data mining has not been fully explored. This paper proposes a novel cross-modal framework that leverages 2D VLMs to identify and mine rare objects from driving scenes, thereby improving 3D object detection performance. Our approach synthesizes complementary techniques such as object detection, semantic feature extraction, dimensionality reduction, and multi-faceted outlier detection into a cohesive, explainable pipeline that systematically identifies rare but critical objects in driving scenes. By combining Isolation Forest and t-SNE-based outlier detection methods with concept-based filtering, the framework effectively identifies semantically meaningful rare objects. A key strength of this approach lies in its ability to extract and annotate targeted rare object concepts such as construction vehicles, motorcycles, and barriers. This substantially reduces the annotation burden and focuses only on the most valuable training samples. Experiments on the nuScenes dataset demonstrate that this concept-guided data mining strategy enhances the performance of 3D object detection models while utilizing only a fraction of the training data, with particularly notable improvements for challenging object categories such as trailers and bicycles compared with the same amount of random data. This finding has substantial implications for the efficient curation of datasets in safety-critical autonomous systems.
[16] Conscious Gaze: Adaptive Attention Mechanisms for Hallucination Mitigation in Vision-Language Models
Weijue Bu, Guan Yuan, Guixian Zhang
🧩 TL;DR
本文提出Conscious Gaze (CG-VLM)框架,通过基于博弈论可解释性的推理时控制机制解决大视觉语言模型中的文本惯性问题,在保持通用能力的同时显著减少物体幻觉。
📘 Detailed Summary
Motivation: 大视觉语言模型普遍存在文本惯性问题,即注意力从视觉证据漂移到语言先验,导致物体幻觉。现有解码策略仅干预输出logits而无法纠正内部推理漂移,而基于启发式头抑制或全局导向向量的内部控制方法缺乏理论依据。
Method: 提出无需训练的推理时框架CG-VLM,将博弈论可解释性转化为可操作解码控制。基于Harsanyi交互构建认知需求传感器,估计瞬时视觉-文本协同并识别需要视觉接地的时刻;通过聚焦共识归纳模块选择性地在崩溃为文本先验前将中层注意力重新导向视觉标记。
Result: CG-VLM在POPE和CHAIR基准测试中达到最先进水平,覆盖InstructBLIP、LLaVA、Qwen-VL和mPLUG等模型,同时保持通用能力。这表明标记级感知能够实现精确、上下文感知的干预而不损害基础知识。
Conclusion: 研究表明基于博弈论可解释性的推理时控制能够有效解决视觉语言模型中的文本惯性问题,为模型内部推理过程提供精确干预机制,同时保持模型原有能力,为减少物体幻觉提供了理论严谨的解决方案。
📄 Abstract
Large Vision-Language Models (VLMs) often exhibit text inertia, where attention drifts from visual evidence toward linguistic priors, resulting in object hallucinations. Existing decoding strategies intervene only at the output logits and thus cannot correct internal reasoning drift, while recent internal-control methods based on heuristic head suppression or global steering vectors lack principled grounding. We introduce Conscious Gaze (CG-VLM), a training-free, inference-time framework that converts game-theoretic interpretability into actionable decoding control. A Cognitive Demand Sensor built on Harsanyi interactions estimates instantaneous vision-text synergy and identifies moments when visual grounding is necessary. Conditioned on this signal, a Focused Consensus Induction module selectively reorients mid-layer attention toward visual tokens before collapse into text priors. CG-VLM achieves state-of-the-art results on POPE and CHAIR across InstructBLIP, LLaVA, Qwen-VL, and mPLUG, while preserving general capabilities, demonstrating that token-level sensing enables precise, context-aware intervention without compromising foundational knowledge.
[17] Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning
Chinthani Sugandhika, Chen Li, Deepu Rajan, Basura Fernando
🧩 TL;DR
本文提出了Know-Show基准测试,用于评估视频语言模型在时空基础推理方面的能力,并开发了GRAM插件来增强现有模型的细粒度基础推理能力,揭示了当前模型在时空基础推理方面与人类能力之间的显著差距。
📘 Detailed Summary
Motivation: 当前大型视频语言模型在多模态理解方面取得了显著进展,但其推理过程在空间和时间维度上缺乏充分的基础支持。现有模型难以将推理与视觉和时序证据进行有效关联,这限制了模型的可解释性和可靠性,因此需要建立统一的评估框架来量化模型在时空基础推理方面的能力差距。
Method: 研究提出了Know-Show基准测试,该框架整合了推理和定位任务,包含五个互补的场景,涵盖空间维度(人物、物体、人物-物体、手-物体交互)和时间维度。基于Charades、Action Genome和Ego4D数据集构建了2.5K个人工标注的问题。同时开发了GRAM训练即插即用模块,通过基于注意力的视频令牌选择和显式时间戳编码来增强视频语言模型的细粒度基础能力。
Result: 实验评估了包括Qwen、VideoLLaVA、GPT-4o和Gemini在内的开放和封闭视频语言模型,结果显示现有模型在"展示所知内容"方面存在显著困难,特别是在细粒度的手-物体交互场景中。Know-Show基准测试揭示了当前视频语言模型与人类推理能力之间的明显差距,为模型性能提供了统一的量化评估标准。
Conclusion: 该研究为视频语言理解中的基础推理评估建立了统一标准,并为开发可解释和可靠的多模态推理系统提供了重要见解。GRAM模块展示了通过注意力机制和时间编码增强现有模型基础推理能力的可行性,为未来模型改进提供了技术路径。
📄 Abstract
Large Video-Language Models (Video-LMs) have achieved impressive progress in multimodal understanding, yet their reasoning remains weakly grounded in space and time. We present Know-Show, a new benchmark designed to evaluate spatio-temporal grounded reasoning, the ability of a model to reason about actions and their semantics while simultaneously grounding its inferences in visual and temporal evidence. Know-Show unifies reasoning and localization within a single evaluation framework consisting of five complementary scenarios across spatial (person, object, person-object, and hand-object) and temporal dimensions. Built from Charades, Action Genome, and Ego4D with 2.5K human-authored questions, the benchmark exposes significant gaps between current Video-LMs and human reasoning. To bridge this gap, we propose GRAM, a training-free plug-in that augments Video-LMs with fine-grained grounding through attention-based video token selection and explicit timestamp encoding. Extensive experiments across open and closed Video-LMs (Qwen, VideoLLaVA, GPT-4o, and Gemini, etc.) reveal that existing models struggle to "show what they know" and vice versa, especially in fine-grained hand-object interactions. Know-Show establishes a unified standard for assessing grounded reasoning in video-language understanding and provides insights toward developing interpretable and reliable multimodal reasoning systems. We will release the code at https://github.com/LUNAProject22/Know-Show.
[18] 2K-Characters-10K-Stories: A Quality-Gated Stylized Narrative Dataset with Disentangled Control and Sequence Consistency
Xingxi Yin, Yicheng Li, Gong Yan, Chenglin Li, Jian Zhao, Cong Huang, Yue Deng, Yin Zhang
🧩 TL;DR
该研究提出了2K-Characters-10K-Stories数据集,这是首个将大规模独特角色身份与显式解耦控制信号配对的数据集,用于解决可控视觉叙事中序列身份一致性的长期挑战,并通过人类参与循环管道确保高质量数据生成。
📘 Detailed Summary
Motivation: 现有数据集在精确瞬态属性控制下的序列身份一致性方面存在不足,缺乏足够的保真度且未能将稳定身份与瞬态属性解耦,限制了姿势、表情和场景构成的结构化控制,从而制约了可靠的序列合成能力。
Method: 研究引入了2K-Characters-10K-Stories多模态风格化叙事数据集,包含2000个独特风格化角色和10000个插图故事。采用人类参与循环管道,结合专家验证的角色模板和LLM引导的叙事规划生成高度对齐的结构化数据。通过解耦控制方案分离持久身份与瞬态属性,并集成MMLM评估、自动提示调优和局部图像编辑的质量门控循环来确保像素级一致性。
Result: 广泛实验表明,基于该数据集微调的模型在生成视觉叙事方面达到了与闭源模型相当的性能水平,验证了数据集在实现序列身份一致性和精确属性控制方面的有效性。
Conclusion: 该研究为解决可控视觉叙事中的序列身份一致性挑战提供了首个大规模结构化数据集和系统化解决方案,通过解耦控制和质量门控机制实现了精确的属性操纵,为视觉叙事生成领域建立了新的基准和可靠的数据基础。
📄 Abstract
Sequential identity consistency under precise transient attribute control remains a long-standing challenge in controllable visual storytelling. Existing datasets lack sufficient fidelity and fail to disentangle stable identities from transient attributes, limiting structured control over pose, expression, and scene composition and thus constraining reliable sequential synthesis. To address this gap, we introduce \textbf{2K-Characters-10K-Stories}, a multi-modal stylized narrative dataset of \textbf{2{,}000} uniquely stylized characters appearing across \textbf{10{,}000} illustration stories. It is the first dataset that pairs large-scale unique identities with explicit, decoupled control signals for sequential identity consistency. We introduce a \textbf{Human-in-the-Loop pipeline (HiL)} that leverages expert-verified character templates and LLM-guided narrative planning to generate highly-aligned structured data. A \textbf{decoupled control} scheme separates persistent identity from transient attributes -- pose and expression -- while a \textbf{Quality-Gated loop} integrating MMLM evaluation, Auto-Prompt Tuning, and Local Image Editing enforces pixel-level consistency. Extensive experiments demonstrate that models fine-tuned on our dataset achieves performance comparable to closed-source models in generating visual narratives.
[19] DashFusion: Dual-stream Alignment with Hierarchical Bottleneck Fusion for Multimodal Sentiment Analysis
Yuhua Wen, Qifei Li, Yingying Zhou, Yingming Gao, Zhengqi Wen, Jianhua Tao, Ya Li
🧩 TL;DR
本文提出了一种名为DashFusion的新型多模态情感分析框架,通过双流对齐和分层瓶颈融合机制,有效解决了多模态特征对齐与融合的挑战,在多个基准数据集上实现了最先进的性能。
📘 Detailed Summary
Motivation: 多模态情感分析面临的主要挑战在于模态间的对齐与融合问题,现有方法往往单独处理对齐或融合,导致性能受限且效率不高,本文旨在设计一个能够同时解决对齐和融合问题的统一框架。
Method: 提出的DashFusion框架包含三个核心组件:双流对齐模块通过跨模态注意力实现时序对齐,并通过对比学习确保语义对齐;监督对比学习利用标签信息细化模态特征;分层瓶颈融合通过压缩瓶颈令牌渐进式整合多模态信息,平衡性能与计算效率。
Result: 在CMU-MOSI、CMU-MOSEI和CH-SIMS三个数据集上的实验表明,DashFusion在各项指标上均达到了最先进的性能,消融研究证实了所提出的对齐和融合技术的有效性。
Conclusion: 该研究证明了同时处理多模态对齐与融合的重要性,提出的分层瓶颈融合机制为平衡模型性能与计算效率提供了有效解决方案,为多模态情感分析领域提供了新的技术路径。
📄 Abstract
Multimodal sentiment analysis (MSA) integrates various modalities, such as text, image, and audio, to provide a more comprehensive understanding of sentiment. However, effective MSA is challenged by alignment and fusion issues. Alignment requires synchronizing both temporal and semantic information across modalities, while fusion involves integrating these aligned features into a unified representation. Existing methods often address alignment or fusion in isolation, leading to limitations in performance and efficiency. To tackle these issues, we propose a novel framework called Dual-stream Alignment with Hierarchical Bottleneck Fusion (DashFusion). Firstly, dual-stream alignment module synchronizes multimodal features through temporal and semantic alignment. Temporal alignment employs cross-modal attention to establish frame-level correspondences among multimodal sequences. Semantic alignment ensures consistency across the feature space through contrastive learning. Secondly, supervised contrastive learning leverages label information to refine the modality features. Finally, hierarchical bottleneck fusion progressively integrates multimodal information through compressed bottleneck tokens, which achieves a balance between performance and computational efficiency. We evaluate DashFusion on three datasets: CMU-MOSI, CMU-MOSEI, and CH-SIMS. Experimental results demonstrate that DashFusion achieves state-of-the-art performance across various metrics, and ablation studies confirm the effectiveness of our alignment and fusion techniques. The codes for our experiments are available at https://github.com/ultramarineX/DashFusion.
[20] VOST-SGG: VLM-Aided One-Stage Spatio-Temporal Scene Graph Generation
Chinthani Sugandhika, Chen Li, Deepu Rajan, Basura Fernando
🧩 TL;DR
本文提出VOST-SGG,一种基于视觉语言模型辅助的单阶段时空场景图生成框架,通过引入语义先验和多模态特征融合,显著提升了视频中物体关系建模的性能。
📘 Detailed Summary
Motivation: 当前基于DETR风格的单阶段时空场景图生成模型存在两个关键局限:一是其可学习查询在语义上缺乏信息且实例无关初始化,二是仅依赖单模态视觉特征进行谓词分类,限制了模型对复杂时空关系的理解能力。
Method: 本文提出VOST-SGG框架,首先引入双源查询初始化策略,将"关注什么"与"在哪里关注"解耦,实现语义基础化的what-where推理;其次构建多模态特征库,融合来自视觉语言模型的视觉、文本和空间线索,以提升谓词分类的准确性。
Result: 在Action Genome数据集上的大量实验表明,该方法实现了最先进的性能,验证了整合VLM辅助的语义先验和多模态特征对时空场景图生成的有效性。
Conclusion: 该研究证明了将视觉语言模型的常识推理能力整合到时空场景图生成流程中的价值,通过语义基础化的查询初始化和多模态特征融合,显著提升了模型对视频中复杂时空关系的建模能力。
📄 Abstract
Spatio-temporal scene graph generation (ST-SGG) aims to model objects and their evolving relationships across video frames, enabling interpretable representations for downstream reasoning tasks such as video captioning and visual question answering. Despite recent advancements in DETR-style single-stage ST-SGG models, they still suffer from several key limitations. First, while these models rely on attention-based learnable queries as a core component, these learnable queries are semantically uninformed and instance-agnostically initialized. Second, these models rely exclusively on unimodal visual features for predicate classification. To address these challenges, we propose VOST-SGG, a VLM-aided one-stage ST-SGG framework that integrates the common sense reasoning capabilities of vision-language models (VLMs) into the ST-SGG pipeline. First, we introduce the dual-source query initialization strategy that disentangles what to attend to from where to attend, enabling semantically grounded what-where reasoning. Furthermore, we propose a multi-modal feature bank that fuses visual, textual, and spatial cues derived from VLMs for improved predicate classification. Extensive experiments on the Action Genome dataset demonstrate that our approach achieves state-of-the-art performance, validating the effectiveness of integrating VLM-aided semantic priors and multi-modal features for ST-SGG. We will release the code at https://github.com/LUNAProject22/VOST.
[21] Probing the effectiveness of World Models for Spatial Reasoning through Test-time Scaling
Saurav Jha, M. Jehanzeb Mirza, Wei Lin, Shiqi Yang, Sarath Chandar
🧩 TL;DR
本文系统分析了基于世界模型的测试时验证器在空间推理任务中的表现,揭示了其局限性并提出了ViSA框架,该框架通过可验证的空间断言来改进验证过程,在SAT-Real基准上取得了提升,但在更复杂的MMSI-Bench上仍面临挑战。
📘 Detailed Summary
Motivation: 当前视觉语言模型在多视角理解和具身视角转换的空间推理任务中存在局限,MindJourney等采用测试时缩放的方法通过世界模型想象动作条件轨迹并使用启发式验证器选择有用视图,但缺乏对这些验证器行为的系统性分析,需要评估其校准能力和可靠性。
Method: 研究首先对MindJourney的验证器进行不确定性分析,揭示其校准不足和系统动作偏差问题,然后提出ViSA(Verification through Spatial Assertions)框架,该框架将测试时奖励基于可验证的、帧锚定的微观声明,通过更平衡的探索行为纠正轨迹选择偏差。
Result: 分析显示MindJourney验证器提供很少的有意义校准,随机评分通常同样能降低答案熵,暴露了系统动作偏差和不可靠的奖励信号。ViSA框架在SAT-Real基准上持续改进了空间推理性能,但在更具挑战性的MMSI-Bench上,包括ViSA在内的所有验证器均未实现一致的缩放,表明当前世界模型形成了信息瓶颈。
Conclusion: 研究揭示了基于世界模型的测试时验证在空间推理中的好坏两面:验证器存在校准不足和偏差问题,但通过ViSA等原则性方法可以部分改进;然而在复杂任务中,世界模型本身成为信息瓶颈,想象的视图无法丰富细粒度推理,这为未来研究指明了方向。
📄 Abstract
Vision-Language Models (VLMs) remain limited in spatial reasoning tasks that require multi-view understanding and embodied perspective shifts. Recent approaches such as MindJourney attempt to mitigate this gap through test-time scaling where a world model imagines action-conditioned trajectories and a heuristic verifier selects helpful views from such trajectories. In this work, we systematically examine how such test-time verifiers behave across benchmarks, uncovering both their promise and their pitfalls. Our uncertainty-based analyses show that MindJourney's verifier provides little meaningful calibration, and that random scoring often reduces answer entropy equally well, thus exposing systematic action biases and unreliable reward signals. To mitigate these, we introduce a Verification through Spatial Assertions (ViSA) framework that grounds the test-time reward in verifiable, frame-anchored micro-claims. This principled verifier consistently improves spatial reasoning on the SAT-Real benchmark and corrects trajectory-selection biases through more balanced exploratory behavior. However, on the challenging MMSI-Bench, none of the verifiers, including ours, achieve consistent scaling, suggesting that the current world models form an information bottleneck where imagined views fail to enrich fine-grained reasoning. Together, these findings chart the bad, good, and ugly aspects of test-time verification for world-model-based reasoning. Our code is available at https://github.com/chandar-lab/visa-for-mindjourney.
[22] ProPhy: Progressive Physical Alignment for Dynamic World Simulation
Zijun Wang, Panwen Hu, Jing Wang, Terry Jingchen Zhang, Yuhao Cheng, Long Chen, Yiqiang Yan, Zutao Jiang, Hanhui Li, Xiaodan Liang
🧩 TL;DR
本文提出ProPhy框架,一种渐进式物理对齐方法,通过两阶段物理专家混合机制实现显式的物理感知条件生成,解决了现有视频生成模型在处理大规模复杂动态时物理一致性不足的问题。
📘 Detailed Summary
Motivation: 当前视频生成模型在构建世界模拟器方面展现出潜力,但在处理大规模或复杂动态时难以产生物理一致的结果,主要因为现有方法对物理提示的响应是各向同性的,且忽视了生成内容与局部物理线索之间的细粒度对齐。
Method: ProPhy采用渐进式物理对齐框架,包含两阶段物理专家混合机制:语义专家从文本描述中推断语义级物理原理,细化专家捕获令牌级物理动态;同时引入物理对齐策略,将视觉语言模型的物理推理能力迁移到细化专家中,实现更准确的动态物理现象表示。
Result: 在物理感知视频生成基准测试上的大量实验表明,ProPhy相比现有最先进方法能够产生更真实、动态且物理一致的结果,证明了其在处理复杂物理动态方面的优越性能。
Conclusion: 该研究通过显式的物理感知条件生成和细粒度的物理对齐机制,显著提升了视频生成的物理一致性,为构建更可靠的世界模拟器提供了新思路,未来可进一步探索物理先验与生成模型的深度融合。
📄 Abstract
Recent advances in video generation have shown remarkable potential for constructing world simulators. However, current models still struggle to produce physically consistent results, particularly when handling large-scale or complex dynamics. This limitation arises primarily because existing approaches respond isotropically to physical prompts and neglect the fine-grained alignment between generated content and localized physical cues. To address these challenges, we propose ProPhy, a Progressive Physical Alignment Framework that enables explicit physics-aware conditioning and anisotropic generation. ProPhy employs a two-stage Mixture-of-Physics-Experts (MoPE) mechanism for discriminative physical prior extraction, where Semantic Experts infer semantic-level physical principles from textual descriptions, and Refinement Experts capture token-level physical dynamics. This mechanism allows the model to learn fine-grained, physics-aware video representations that better reflect underlying physical laws. Furthermore, we introduce a physical alignment strategy that transfers the physical reasoning capabilities of vision-language models (VLMs) into the Refinement Experts, facilitating a more accurate representation of dynamic physical phenomena. Extensive experiments on physics-aware video generation benchmarks demonstrate that ProPhy produces more realistic, dynamic, and physically coherent results than existing state-of-the-art methods.
[23] Learning High-Fidelity Cloth Animation via Skinning-Free Image Transfer
Rong Wang, Wei Mao, Changsheng Lu, Hongdong Li
🧩 TL;DR
本文提出了一种基于图像传输的皮肤无关三维服装变形生成方法,通过独立估计顶点位置和法线来解耦低频形状与高频褶皱细节,并利用预训练图像模型恢复精细视觉特征,显著提升了服装动画质量。
📘 Detailed Summary
Motivation: 现有方法主要依赖线性混合蒙皮获取低频服装形状并回归高频褶皱,但由于缺乏明确的蒙皮监督,在服装姿态调整时经常产生形状错位,从而破坏高频信号并无法恢复高保真褶皱细节。
Method: 该方法采用皮肤无关方法,独立估计顶点位置用于低频服装形状和顶点法线用于高频局部褶皱细节,将两种频率模态解耦并直接通过变形服装几何进行监督;进一步将顶点属性编码为渲染纹理图像,通过二维图像传输实现三维服装变形,利用预训练图像模型恢复精细褶皱细节;最后通过多模态融合整合两种频率模态的约束,从传输图像中稳健恢复变形三维服装。
Result: 大量实验表明,该方法在各种服装类型上显著提升了动画质量,相比现有最先进方法能够恢复更精细的褶皱细节,同时无需依赖手动UV分区即可为不同拓扑结构的服装保持优异的可扩展性。
Conclusion: 该研究通过解耦频率模态和利用图像传输范式,有效解决了皮肤方法中的形状错位问题,为虚拟试穿和扩展现实应用提供了高质量服装变形生成框架,展示了预训练图像模型在三维几何细节恢复中的潜力。
📄 Abstract
We present a novel method for generating 3D garment deformations from given body poses, which is key to a wide range of applications, including virtual try-on and extended reality. To simplify the cloth dynamics, existing methods mostly rely on linear blend skinning to obtain low-frequency posed garment shape and only regress high-frequency wrinkles. However, due to the lack of explicit skinning supervision, such skinning-based approach often produces misaligned shapes when posing the garment, consequently corrupts the high-frequency signals and fails to recover high-fidelity wrinkles. To tackle this issue, we propose a skinning-free approach by independently estimating posed (i) vertex position for low-frequency posed garment shape, and (ii) vertex normal for high-frequency local wrinkle details. In this way, each frequency modality can be effectively decoupled and directly supervised by the geometry of the deformed garment. To further improve the visual quality of animation, we propose to encode both vertex attributes as rendered texture images, so that 3D garment deformation can be equivalently achieved via 2D image transfer. This enables us to leverage powerful pretrained image models to recover fine-grained visual details in wrinkles, while maintaining superior scalability for garments of diverse topologies without relying on manual UV partition. Finally, we propose a multimodal fusion to incorporate constraints from both frequency modalities and robustly recover deformed 3D garments from transferred images. Extensive experiments show that our method significantly improves animation quality on various garment types and recovers finer wrinkles than state-of-the-art methods.
[24] VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack
Shiji Zhao, Shukun Xiong, Yao Huang, Yan Jin, Zhenyu Wu, Jiyang Guan, Ranjie Duan, Jialing Tao, Hui Xue, Xingxing Wei
🧩 TL;DR
本文提出视觉推理序列攻击(VRSA),通过将有害文本分解为多个序列相关的子图像,诱导多模态大语言模型逐步外化并聚合完整的有害意图,从而评估视觉模态中的安全风险。
📘 Detailed Summary
Motivation: 多模态大语言模型在视觉模态中的安全风险被严重忽视,现有研究主要关注文本模态的推理安全风险,而更多模态带来了更多被用于越狱攻击的漏洞,需要全面评估视觉推理任务中的潜在安全威胁。
Method: 提出视觉推理序列攻击(VRSA),通过自适应场景优化选择与原始有害查询最相关的场景,利用语义连贯补全结合上下文信息迭代重写每个子文本,并通过文本-图像一致性对齐保持语义一致性,将原始有害文本分解为多个序列相关的子图像。
Result: 实验表明,VRSA在开源和闭源MLLMs(包括GPT-4o和Claude-4.5-Sonnet)上相比最先进的越狱攻击方法实现了更高的攻击成功率,验证了其在视觉模态中的有效威胁性。
Conclusion: 该研究揭示了多模态大语言模型在视觉推理任务中的严重安全漏洞,表明视觉模态中的序列攻击能够有效绕过安全防护,为未来MLLMs的安全防御机制设计提供了重要警示和评估基准。
📄 Abstract
Multimodal Large Language Models (MLLMs) are widely used in various fields due to their powerful cross-modal comprehension and generation capabilities. However, more modalities bring more vulnerabilities to being utilized for jailbreak attacks, which induces MLLMs to output harmful content. Due to the strong reasoning ability of MLLMs, previous jailbreak attacks try to explore reasoning safety risk in text modal, while similar threats have been largely overlooked in the visual modal. To fully evaluate potential safety risks in the visual reasoning task, we propose Visual Reasoning Sequential Attack (VRSA), which induces MLLMs to gradually externalize and aggregate complete harmful intent by decomposing the original harmful text into several sequentially related sub-images. In particular, to enhance the rationality of the scene in the image sequence, we propose Adaptive Scene Refinement to optimize the scene most relevant to the original harmful query. To ensure the semantic continuity of the generated image, we propose Semantic Coherent Completion to iteratively rewrite each sub-text combined with contextual information in this scene. In addition, we propose Text-Image Consistency Alignment to keep the semantical consistency. A series of experiments demonstrates that the VRSA can achieve a higher attack success rate compared with the state-of-the-art jailbreak attack methods on both the open-source and closed-source MLLMs such as GPT-4o and Claude-4.5-Sonnet.
cs.CL [Back]
[25] LYNX: Learning Dynamic Exits for Confidence-Controlled Reasoning
Ömer Faruk Akgül, Yusuf Hakan Kalaycı, Rajgopal Kannan, Willie Neiswanger, Viktor Prasanna
🧩 TL;DR
本文提出LYNX,一种在线早期退出机制,通过将模型自身的隐藏状态感知转化为置信度控制的停止决策,解决大型推理模型在推理过程中'过度思考'的问题,在保持或提高准确性的同时显著减少计算开销。
📘 Detailed Summary
Motivation: 大型推理模型在复杂任务中通过生成扩展的思维链实现强大性能,但经常出现'过度思考'现象:在已有足够信息给出正确答案后仍继续推理,这不仅浪费推理计算资源,还可能损害准确性。现有早期停止方法要么通过额外采样和启发式方法操纵解码,要么依赖辅助验证器模型,或仅作为事后分析流程运行而缺乏形式化保证。
Method: LYNX是一种在线早期退出机制,将模型的隐藏状态感知转化为置信度控制的停止决策。该方法在自然出现的推理线索标记处附加退出决策,使用强制退出监督在这些线索标记的隐藏状态上训练轻量级探针,并通过分割保形预测包装所得分数以获得对过早退出的分布无关控制。关键创新在于,该探针在通用数学语料上一次性训练和校准后,可在不同基准、解码温度甚至非数学任务中重复使用。
Result: 在三个模型系列上实验,单个数学训练探针实现了强大的准确性-效率权衡。在GSM8K上,LYNX匹配或改进基线准确性同时减少40-65%的标记;在MATH-500上,准确性提高达12个百分点,标记减少约35-60%;在AIME 2024上,以超过50%的标记节省恢复基线准确性;在非数学基准CommonsenseQA上,零样本迁移实现适度准确性提升和高达70%的标记减少。与最先进的早期退出方法相比,LYNX提供竞争性或更优的帕累托前沿。
Conclusion: LYNX提供了一种完全在线、无需推理时代理模型且具有明确用户可调置信度保证的解决方案,显著优于现有方法。该方法展示了基于隐藏状态感知的早期退出机制在不同任务和模型规模上的强大泛化能力,为高效推理系统设计提供了新范式,平衡了计算效率与推理准确性之间的权衡。
📄 Abstract
Large reasoning models achieve strong performance on complex tasks by generating extended chains of thought, but they often "overthink": continuing to reason long after they have enough information to answer correctly. This wastes inference-time compute and can hurt accuracy. Existing attempts to stop early either manipulate decoding with extra sampling and heuristics, rely on auxiliary verifier models, or operate only as post-hoc analysis pipelines without formal guarantees. We introduce LYNX, an online early-exit mechanism that turns a model's own hidden-state awareness into confidence-controlled stopping decisions. LYNX attaches exit decisions to naturally occurring reasoning cues (e.g., "hmm", "wait") during generation, trains a lightweight probe on hidden states at those cue tokens using supervision from forced exits, and wraps the resulting scores in split conformal prediction to obtain distribution-free control over premature exits. Crucially, we train and calibrate this probe once on a generic mathematical corpus and reuse it unchanged across benchmarks, decoding temperatures, and even non-mathematical tasks. Across three model families spanning 1.5B to 32B parameters, a single mathematically trained probe per base model yields strong accuracy--efficiency tradeoffs. On GSM8K, LYNX matches or improves baseline accuracy while reducing tokens by 40--65\%; on MATH-500 it improves accuracy by up to 12 points with roughly 35--60\% fewer tokens; on AIME 2024 it recovers baseline accuracy with more than 50\% token savings; and on CommonsenseQA, a non-math benchmark, it transfers zero-shot with modest accuracy gains and up to 70\% fewer tokens. Compared to state-of-the-art early-exit methods, LYNX offers competitive or superior Pareto frontiers while remaining fully online, requiring no proxy models at inference, and providing explicit, user-tunable confidence guarantees.
[26] ArtistMus: A Globally Diverse, Artist-Centric Benchmark for Retrieval-Augmented Music Question Answering
Daeyong Kwon, SeungHeon Doh, Juhan Nam
🧩 TL;DR
本研究针对大语言模型在音乐领域知识不足的问题,构建了MusWikiDB音乐知识向量数据库和ArtistMus评测基准,通过检索增强生成显著提升了音乐问答的准确率,为领域特定问答系统提供了有效解决方案。
📘 Detailed Summary
Motivation: 当前大语言模型在音乐相关推理任务上表现有限,主要因为预训练数据中音乐知识稀疏,而现有音乐信息检索和计算音乐学研究缺乏支持基于艺术家元数据和历史背景的事实性与上下文音乐问答的系统资源。
Method: 研究提出了MusWikiDB音乐知识向量数据库,包含来自144K个音乐相关维基百科页面的320万段落,并构建了ArtistMus评测基准,包含500位多样化艺术家的1000个问题及其流派、出道年份等元数据,用于系统评估检索增强生成在音乐问答中的应用效果。
Result: 实验表明检索增强生成显著提升了事实准确性,开源模型准确率最高提升56.8个百分点,Qwen3 8B从35.0%提升至91.8%,接近专有模型性能;RAG风格微调进一步增强了事实回忆和上下文推理能力;MusWikiDB相比通用维基百科语料库准确率提升约6个百分点,检索速度提高40%。
Conclusion: 该研究为音乐信息检索和领域特定问答建立了重要基础,展示了检索增强生成在文化丰富领域如音乐中的有效性,发布的资源将推动音乐相关推理和检索增强推理研究的发展,为其他专业领域问答系统提供了可借鉴的框架。
📄 Abstract
Recent advances in large language models (LLMs) have transformed open-domain question answering, yet their effectiveness in music-related reasoning remains limited due to sparse music knowledge in pretraining data. While music information retrieval and computational musicology have explored structured and multimodal understanding, few resources support factual and contextual music question answering (MQA) grounded in artist metadata or historical context. We introduce MusWikiDB, a vector database of 3.2M passages from 144K music-related Wikipedia pages, and ArtistMus, a benchmark of 1,000 questions on 500 diverse artists with metadata such as genre, debut year, and topic. These resources enable systematic evaluation of retrieval-augmented generation (RAG) for MQA. Experiments show that RAG markedly improves factual accuracy; open-source models gain up to +56.8 percentage points (for example, Qwen3 8B improves from 35.0 to 91.8), approaching proprietary model performance. RAG-style fine-tuning further boosts both factual recall and contextual reasoning, improving results on both in-domain and out-of-domain benchmarks. MusWikiDB also yields approximately 6 percentage points higher accuracy and 40% faster retrieval than a general-purpose Wikipedia corpus. We release MusWikiDB and ArtistMus to advance research in music information retrieval and domain-specific question answering, establishing a foundation for retrieval-augmented reasoning in culturally rich domains such as music.
[27] Interleaved Latent Visual Reasoning with Selective Perceptual Modeling
Shuai Dong, Siyuan Wang, Xingyu Liu, Zhongyu Wei
🧩 TL;DR
本文提出了交错潜在视觉推理(ILVR)框架,通过将文本生成与演化的潜在视觉表示交错结合,解决了多模态大语言模型中视觉反馈计算成本高与感知建模精度不足之间的权衡问题。
📘 Detailed Summary
Motivation: 交错推理范式虽然能增强多模态大语言模型的视觉反馈能力,但反复重新编码像素密集图像的计算成本过高。现有的潜在视觉推理方法面临关键权衡:要么通过过度压缩特征牺牲精确的感知建模,要么因静态非交错结构而无法建模动态问题。
Method: ILVR框架将动态状态演化与精确感知建模相统一,通过交错文本生成与作为后续推理具体演化线索的潜在视觉表示。采用自监督策略,其中动量教师模型从辅助图像中有选择性地蒸馏相关特征到稀疏监督目标中,这种自适应选择机制引导模型自主生成上下文感知的视觉信号。
Result: 在多模态推理基准测试上的广泛实验表明,ILVR显著优于现有方法,有效弥合了细粒度感知与顺序多模态推理之间的差距。
Conclusion: 该研究展示了通过交错潜在视觉表示实现动态状态演化与精确感知建模的统一,为多模态推理系统提供了既能保持感知精度又能处理动态问题的有效框架,推动了视觉语言模型在复杂推理任务中的应用。
📄 Abstract
Interleaved reasoning paradigms enhance Multimodal Large Language Models (MLLMs) with visual feedback but are hindered by the prohibitive computational cost of repeatedly re-encoding pixel-dense images. A promising alternative, latent visual reasoning, circumvents this bottleneck yet currently forces a critical trade-off: methods either sacrifice precise perceptual modeling by over-compressing features or fail to model dynamic problems due to static, non-interleaved structures. We introduce Interleaved Latent Visual Reasoning (ILVR), a framework that unifies dynamic state evolution with precise perceptual modeling. ILVR interleaves textual generation with latent visual representations that act as specific, evolving cues for subsequent reasoning. To enable this, we employ a self-supervision strategy where a Momentum Teacher Model selectively distills relevant features from helper images into sparse supervision targets. This adaptive selection mechanism guides the model to autonomously generate context-aware visual signals. Extensive experiments on multimodal reasoning benchmarks demonstrate that ILVR significantly outperforms existing approaches, effectively bridging the gap between fine-grained perception and sequential multimodal reasoning.
[28] MedTutor-R1: Socratic Personalized Medical Teaching with Multi-Agent Simulation
Zhitao He, Haolin Yang, Zeyu Qin, Yi R Fung
🧩 TL;DR
该研究开发了ClinEdu多智能体教学模拟器和ClinTeach数据集,并基于此训练了MedTutor-R1多模态苏格拉底式教学模型,旨在解决临床医学教育中专家指导稀缺与团队协作教学需求之间的差距。
📘 Detailed Summary
Motivation: 临床医学教育面临日益增长的教学需求与专家指导稀缺之间的显著差距,当前研究主要关注一对一知识传授,忽视了团队协作中的协作推理能力培养,而这是临床实践中如查房等团队活动所需的关键技能。
Method: 研究开发了ClinEdu多智能体教学模拟器,包含个性化驱动患者和多样化学生群体,用于生成教学数据;构建了ClinTeach大规模苏格拉底式教学对话数据集;训练了MedTutor-R1多模态苏格拉底式教学模型,首先在ClinTeach数据集上进行指令微调,然后采用强化学习优化,使用基于结构保真度、分析质量和临床安全性的三维评估标准作为奖励函数。
Result: 实验结果表明,MedTutor-R1在平均教学评分上比基础模型提升超过20%,与o3模型性能相当,同时在处理不同数量学生时表现出高适应性;通过模拟交互评估验证了模型在真实教学环境中的有效性。
Conclusion: 该研究证明了多智能体教学模拟器和强化学习优化的苏格拉底式教学模型在临床医学教育中的有效性,为规模化生成教学数据和开发自适应教学系统提供了新方法,同时强调了团队协作教学在医学教育中的重要性。
📄 Abstract
The significant gap between rising demands for clinical training and the scarcity of expert instruction poses a major challenge to medical education. With powerful capabilities in personalized guidance, Large Language Models (LLMs) offer a promising solution to bridge this gap. However, current research focuses mainly on one-on-one knowledge instruction, overlooking collaborative reasoning, a key skill for students developed in teamwork like ward rounds. To this end, we develop ClinEdu, a multi-agent pedagogical simulator with personality-driven patients and diverse student cohorts, enabling controlled testing of complex pedagogical processes and scalable generation of teaching data. Based on ClinEdu, we construct ClinTeach, a large Socratic teaching dialogue dataset that captures the complexities of group instruction. We then train MedTutor-R1, the first multimodal Socratic tutor designed for one-to-many instruction in clinical medical education. MedTutor-R1 is first instruction-tuned on our ClinTeach dataset and then optimized with reinforcement learning, using rewards derived from a three-axis rubric, covering structural fidelity, analytical quality, and clinical safety, to refine its adaptive Socratic strategies. For authentic in-situ assessment, we use simulation-based interactive evaluation that redeploys the tutor back into ClinEdu. Experimental results demonstrate that our MedTutor-R1 outperforms the base model by over 20% in average pedagogical score and is comparable to o3, while also exhibiting high adaptability in handling a varying number of students. This promising performance underscores the effectiveness of our pedagogical simulator, ClinEdu.
[29] Optimizing Medical Question-Answering Systems: A Comparative Study of Fine-Tuned and Zero-Shot Large Language Models with RAG Framework
Tasnimul Hassan, Md Faisal Karim, Haziq Jeelani, Elham Behnam, Robert Green, Fayeq Jeelani Syed
🧩 TL;DR
本文提出了一种基于检索增强生成(RAG)的医学问答系统,通过结合领域知识检索与开源大语言模型,显著提升了医学问答的事实准确性和可靠性,同时减少了幻觉现象。
📘 Detailed Summary
Motivation: 直接将大语言模型应用于临床领域面临事实准确性不足和幻觉问题等挑战,需要开发能够保持医学信息准确性的问答系统,以支持可靠的临床信息学应用。
Method: 采用检索增强生成框架,结合医学文献检索与开源大语言模型(LLaMA~2和Falcon),使用低秩适应技术对模型进行高效领域专业化微调,通过检索相关医学文献来支撑模型回答以提高事实正确性。
Result: 在PubMedQA和MedMCQA基准数据集上的评估显示,检索增强显著提升了答案准确性,微调后的LLaMA~2模型在PubMedQA上达到71.8%的准确率,相比零样本基线的55.4%有显著提升,同时通过提供来源引用保持透明度,基于检索证据的回答减少了约60%的无支持内容。
Conclusion: 研究表明检索增强生成框架能够有效提升开源大语言模型在生物医学问答中的可靠性,通过将回答建立在检索证据基础上显著减少幻觉现象,为实际临床信息学应用提供了可行路径,展示了领域专业化与知识检索相结合的重要价值。
📄 Abstract
Medical question-answering (QA) systems can benefit from advances in large language models (LLMs), but directly applying LLMs to the clinical domain poses challenges such as maintaining factual accuracy and avoiding hallucinations. In this paper, we present a retrieval-augmented generation (RAG) based medical QA system that combines domain-specific knowledge retrieval with open-source LLMs to answer medical questions. We fine-tune two state-of-the-art open LLMs (LLaMA~2 and Falcon) using Low-Rank Adaptation (LoRA) for efficient domain specialization. The system retrieves relevant medical literature to ground the LLM's answers, thereby improving factual correctness and reducing hallucinations. We evaluate the approach on benchmark datasets (PubMedQA and MedMCQA) and show that retrieval augmentation yields measurable improvements in answer accuracy compared to using LLMs alone. Our fine-tuned LLaMA~2 model achieves 71.8% accuracy on PubMedQA, substantially improving over the 55.4% zero-shot baseline, while maintaining transparency by providing source references. We also detail the system design and fine-tuning methodology, demonstrating that grounding answers in retrieved evidence reduces unsupported content by approximately 60%. These results highlight the potential of RAG-augmented open-source LLMs for reliable biomedical QA, pointing toward practical clinical informatics applications.
[30] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG
David Anugraha, Patrick Amadeus Irawan, Anshul Singh, En-Shiun Annie Lee, Genta Indra Winata
🧩 TL;DR
本文提出了M4-RAG基准,这是一个覆盖42种语言和56种区域方言的大规模多语言多模态检索增强生成基准,用于评估跨语言和跨模态的检索增强视觉问答性能,揭示了当前检索系统与大型模型之间的不匹配问题。
📘 Detailed Summary
Motivation: 尽管视觉语言模型在视觉问答任务中表现出色,但它们受限于静态训练数据,而检索增强生成虽然能提供最新、文化相关和多语言信息,但多语言多模态RAG领域仍未被充分探索,缺乏系统性的评估基准。
Method: 研究团队构建了M4-RAG基准,包含超过80,000个文化多样性的图像-问题对,覆盖42种语言和56种区域方言,并创建了包含数百万精心策划多语言文档的受控检索环境,以平衡真实性与可重复性,近似真实世界检索条件。
Result: 系统性评估显示,虽然RAG持续对小规模视觉语言模型有益,但在大型模型上无法有效扩展,甚至经常导致性能下降,暴露了模型规模与当前检索有效性之间的关键不匹配问题。
Conclusion: 该研究揭示了当前检索系统与大型视觉语言模型之间的不兼容性,M4-RAG基准为推进下一代能够跨语言、跨模态和跨文化语境无缝推理的RAG系统提供了基础,指出了改进检索机制以适应大型模型的重要性。
📄 Abstract
Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark covering 42 languages and 56 regional dialects and registers, comprising over 80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to scale to larger models and often even degrades their performance, exposing a critical mismatch between model size and current retrieval effectiveness. M4-RAG provides a foundation for advancing next-generation RAG systems capable of reasoning seamlessly across languages, modalities, and cultural contexts.
cs.AI [Back]
[31] Documenting SME Processes with Conversational AI: From Tacit Knowledge to BPMN
Unnikrishnan Radhakrishnan
🧩 TL;DR
本文提出了一种基于大语言模型的对话助手,能够捕获中小企业的隐性知识并将其逐步转换为符合标准的BPMN 2.0流程图,旨在降低流程文档化的技能和成本门槛。
📘 Detailed Summary
Motivation: 中小企业严重依赖难以转化为正式文档的隐性经验知识,这导致机构知识流失、运营透明度不足以及持续改进困难。现有流程建模方法对中小企业而言技能要求过高且成本昂贵,因此需要一种能够降低技能和成本门槛的解决方案。
Method: 该方法采用基于Gemini 2.5 Pro大语言模型的对话助手,通过轻量级Gradio前端和客户端bpmn-js可视化界面实现。系统采用访谈式对话交互,逐步获取流程细节、支持澄清对话和按需分析,并实时渲染可编辑的BPMN 2.0图表。
Result: 在设备维护场景的概念验证评估中,聊天机器人在约12分钟内生成了准确的"现状"模型,通过图表标注标记问题,并生成了改进的"未来"变体,同时将API成本控制在中小企业友好预算内。研究还分析了延迟来源、模型选择权衡以及强制执行严格XML模式的挑战。
Conclusion: 研究表明对话式大语言模型能够有效降低严格流程文档化的技能和成本障碍,帮助中小企业保存机构知识、增强运营透明度并加速持续改进。研究还提出了向智能体和多模态部署发展的路线图,为中小企业流程自动化提供了实用框架。
📄 Abstract
Small and medium-sized enterprises (SMEs) still depend heavily on tacit, experience-based know-how that rarely makes its way into formal documentation. This paper introduces a large-language-model (LLM)-driven conversational assistant that captures such knowledge on the shop floor and converts it incrementally and interactively into standards-compliant Business Process Model and Notation (BPMN) 2.0 diagrams. Powered by Gemini 2.5 Pro and delivered through a lightweight Gradio front-end with client-side bpmn-js visualisation, the assistant conducts an interview-style dialogue: it elicits process details, supports clarifying dialogue and on-demand analysis, and renders live diagrams that users can refine in real time. A proof-of-concept evaluation in an equipment-maintenance scenario shows that the chatbot produced an accurate "AS-IS" model, flagged issues via on-diagram annotations, and generated an improved "TO-BE" variant, all within about 12-minutes, while keeping API costs within an SME-friendly budget. The study analyses latency sources, model-selection trade-offs, and the challenges of enforcing strict XML schemas, then outlines a roadmap toward agentic and multimodal deployments. The results demonstrate that conversational LLMs can potentially be used to lower the skill and cost barriers to rigorous process documentation, helping SMEs preserve institutional knowledge, enhance operational transparency, and accelerate continuous-improvement efforts.
[32] MIND: Multi-rationale INtegrated Discriminative Reasoning Framework for Multi-modal Large Models
Chuang Yu, Jinmiao Zhao, Mingxuan Zhao, Yunpeng Liu, Xiujun Shu, Yuanhao Feng, Bo Wang, Xiangyu Yue
🧩 TL;DR
本文提出了一种多推理集成判别(MIND)推理框架,旨在解决多模态大语言模型在复杂推理任务中存在的多推理语义建模不足、逻辑鲁棒性差以及易受误导解释影响的问题,通过引入人类认知的"理解→反思→修正"能力,实现了从被动模仿推理到主动判别推理的范式演进。
📘 Detailed Summary
Motivation: 当前多模态大语言模型在推理任务中存在三个主要问题:多推理语义建模能力有限、逻辑鲁棒性不足,以及在复杂场景中容易受到误导性解释的影响。这些限制阻碍了模型在科学、常识和数学等复杂推理场景中的表现,需要一种能够模拟人类认知过程的主动判别推理框架来提升模型的推理能力。
Method: 提出的MIND框架包含三个核心技术组件:首先,引入推理增强与判别(RAD)范式,通过自动高效地生成多样化推理路径来扩展现有数据集,为模型提供统一且可扩展的数据基础。其次,设计了渐进式两阶段修正学习(P2CL)策略,第一阶段增强多推理正向学习,第二阶段实现主动逻辑判别与修正。此外,提出了多推理对比对齐(MCA)优化策略,通过语义聚合正确推理和边界分离错误推理来缓解多推理语义空间中的表示纠缠问题。
Result: 在涵盖科学、常识和数学场景的多个公开数据集上的广泛实验表明,MIND推理框架实现了最先进的性能表现。该框架在复杂推理任务中显著提升了模型的准确性和鲁棒性,验证了主动判别推理范式相对于传统被动模仿推理方法的优越性,为多模态大语言模型的推理能力提供了实质性改进。
Conclusion: MIND框架为推进多模态大语言模型向更高层次的认知智能提供了新的视角,实现了从被动模仿推理到主动判别推理的范式转变。该研究不仅提升了模型在复杂推理任务中的表现,更重要的是引入了人类认知的"理解→反思→修正"能力,为未来构建更具鲁棒性和可解释性的多模态推理系统奠定了方法论基础。
📄 Abstract
Recently, multimodal large language models (MLLMs) have been widely applied to reasoning tasks. However, they suffer from limited multi-rationale semantic modeling, insufficient logical robustness, and are susceptible to misleading interpretations in complex scenarios. Therefore, we propose a Multi-rationale INtegrated Discriminative (MIND) reasoning framework, which is designed to endow MLLMs with human-like cognitive abilities of "Understand -> Rethink -> Correct", and achieves a paradigm evolution from passive imitation-based reasoning to active discriminative reasoning. Specifically, we introduce a Rationale Augmentation and Discrimination (RAD) paradigm, which automatically and efficiently expands existing datasets by generating diverse rationales, providing a unified and extensible data foundation. Meanwhile, we design a Progressive Two-stage Correction Learning (P2CL) strategy. The first phase enhances multi-rationale positive learning, while the second phase enables active logic discrimination and correction. In addition, to mitigate representation entanglement in the multi-rationale semantic space, we propose a Multi-rationale Contrastive Alignment (MCA) optimization strategy, which achieves semantic aggregation of correct reasoning and boundary separation of incorrect reasoning. Extensive experiments demonstrate that the proposed MIND reasoning framework achieves state-of-the-art (SOTA) performance on multiple public datasets covering scientific, commonsense, and mathematical scenarios. It provides a new perspective for advancing MLLMs towards higher levels of cognitive intelligence. Our code is available at https://github.com/YuChuang1205/MIND
[33] Multimodal Oncology Agent for IDH1 Mutation Prediction in Low-Grade Glioma
Hafsa Akebli, Adam Shephard, Vincenzo Della Mea, Nasir Rajpoot
🧩 TL;DR
本研究提出了一种多模态肿瘤智能体(MOA),通过整合基于TITAN基础模型的病理学工具和结构化临床基因组数据推理,实现了低级别胶质瘤IDH1突变的准确预测,显著超越了传统临床和病理学基线方法。
📘 Detailed Summary
Motivation: 低级别胶质瘤中IDH1突变定义了具有特定预后和治疗意义的临床亚群,但现有预测方法存在局限性,需要开发能够整合多模态信息并利用外部生物医学知识的更准确预测系统。
Method: 研究提出了多模态肿瘤智能体(MOA),整合了基于TITAN基础模型的病理学工具用于IDH1突变预测,并通过PubMed、Google Search和OncoKB对结构化临床和基因组输入进行推理,实现了多源信息的融合分析。
Result: 在TCGA-LGG队列的488名患者中,MOA无病理学工具时F1分数为0.826,优于临床基线的0.798;融合病理学特征后达到最高性能,F1分数为0.912,超过了病理学基线的0.894和融合病理学-临床基线的0.897。
Conclusion: 该智能体能够通过外部生物医学资源捕获互补的突变相关信息,实现准确的IDH1突变预测,证明了多模态融合和外部知识整合在肿瘤分子分型中的价值,为精准医疗提供了新的技术途径。
📄 Abstract
Low-grade gliomas frequently present IDH1 mutations that define clinically distinct subgroups with specific prognostic and therapeutic implications. This work introduces a Multimodal Oncology Agent (MOA) integrating a histology tool based on the TITAN foundation model for IDH1 mutation prediction in low-grade glioma, combined with reasoning over structured clinical and genomic inputs through PubMed, Google Search, and OncoKB. MOA reports were quantitatively evaluated on 488 patients from the TCGA-LGG cohort against clinical and histology baselines. MOA without the histology tool outperformed the clinical baseline, achieving an F1-score of 0.826 compared to 0.798. When fused with histology features, MOA reached the highest performance with an F1-score of 0.912, exceeding both the histology baseline at 0.894 and the fused histology-clinical baseline at 0.897. These results demonstrate that the proposed agent captures complementary mutation-relevant information enriched through external biomedical sources, enabling accurate IDH1 mutation prediction.
[34] PRiSM: An Agentic Multimodal Benchmark for Scientific Reasoning via Python-Grounded Evaluation
Shima Imani, Seungwhan Moon, Adel Ahmadyan, Lu Zhang, Kirmani Ahmed, Babak Damavandi
🧩 TL;DR
本文提出了PRiSM,这是一个用于评估科学推理能力的合成、动态多模态基准测试,包含超过24,750个大学水平的物理和数学问题,通过Python代码生成和验证实现细粒度评估,揭示了现有视觉语言模型在科学推理方面的局限性。
📘 Detailed Summary
Motivation: 当前评估视觉语言模型在数学和物理等科学领域的基准存在显著不足,这些领域需要概念理解、符号推理和形式法则遵循,而现有数据集通常是静态的,缺乏中间推理步骤、对变体的鲁棒性或验证科学正确性的机制,无法全面评估模型的科学推理能力。
Method: 研究团队开发了PRiSM基准测试,这是一个完全动态的多模态评估框架,包含超过24,750个大学水平的物理和数学问题,采用可扩展的基于代理的PrismAgent管道生成结构化问题实例,每个问题包含动态文本和视觉输入、生成图像,以及丰富的结构化输出:用于真实值生成和验证的可执行Python代码,以及详细的逐步推理过程。
Result: 通过PRiSM基准对现有视觉语言模型进行全面评估,揭示了它们在科学推理方面的局限性,该基准支持五个针对性评估任务:泛化能力、符号程序合成、扰动鲁棒性、推理修正和歧义消解,实现了对多模态模型的细粒度实验审计,暴露了故障模式、不确定性行为和科学推理缺陷。
Conclusion: PRiSM基准为评估视觉语言模型的科学推理能力提供了更深入的分析工具,通过Python驱动的自动真实值生成和动态问题构建,能够揭示模型在复杂科学任务中的具体失败模式和局限性,为未来模型改进和评估方法的发展提供了重要基础。
📄 Abstract
Evaluating vision-language models (VLMs) in scientific domains like mathematics and physics poses unique challenges that go far beyond predicting final answers. These domains demand conceptual understanding, symbolic reasoning, and adherence to formal laws, requirements that most existing benchmarks fail to address. In particular, current datasets tend to be static, lacking intermediate reasoning steps, robustness to variations, or mechanisms for verifying scientific correctness. To address these limitations, we introduce PRiSM, a synthetic, fully dynamic, and multimodal benchmark for evaluating scientific reasoning via grounded Python code. PRiSM includes over 24,750 university-level physics and math problems, and it leverages our scalable agent-based pipeline, PrismAgent, to generate well-structured problem instances. Each problem contains dynamic textual and visual input, a generated figure, alongside rich structured outputs: executable Python code for ground truth generation and verification, and detailed step-by-step reasoning. The dynamic nature and Python-powered automated ground truth generation of our benchmark allow for fine-grained experimental auditing of multimodal VLMs, revealing failure modes, uncertainty behaviors, and limitations in scientific reasoning. To this end, we propose five targeted evaluation tasks covering generalization, symbolic program synthesis, perturbation robustness, reasoning correction, and ambiguity resolution. Through comprehensive evaluation of existing VLMs, we highlight their limitations and showcase how PRiSM enables deeper insights into their scientific reasoning capabilities.
[35] TRACE: A Framework for Analyzing and Enhancing Stepwise Reasoning in Vision-Language Models
Shima Imani, Seungwhan Moon, Lambert Mathias, Lu Zhang, Babak Damavandi
🧩 TL;DR
本文提出了TRACE框架,通过透明推理与一致性评估来诊断大型视觉语言模型在数学和科学推理中的错误轨迹,而非仅评估最终答案,从而揭示标准评估方法所忽视的推理失败。
📘 Detailed Summary
Motivation: 大型视觉语言模型在可靠数学和科学推理方面仍面临挑战,标准最终答案评估方法常常掩盖推理错误,导致无声失败持续存在,因此需要一种能够诊断推理轨迹而非仅评估最终结果的评估框架。
Method: TRACE框架的核心是辅助推理集,这些紧凑的子问题-答案对将复杂问题分解,通过基于一致性的指标评估中间步骤,并利用一致性度量来暴露标准评估所忽视的推理失败,同时定义了区分可靠与不可靠推理路径的置信区域。
Result: 实验表明,跨辅助推理集的一致性指标与最终答案正确性高度相关,能够精确定位推理失败发生的具体步骤,为模型改进提供可操作信号,同时置信区域能够有效区分可靠与不可靠的推理路径。
Conclusion: TRACE框架通过诊断推理轨迹而非仅评估最终结果,为大型视觉语言模型的数学和科学推理能力提供了更透明的评估方法,支持有效的过滤、调试和模型优化,为解决推理中的无声失败问题提供了新途径。
📄 Abstract
Reliable mathematical and scientific reasoning remains an open challenge for large vision-language models. Standard final-answer evaluation often masks reasoning errors, allowing silent failures to persist. To address this gap, we introduce TRACE, a framework for Transparent Reasoning And Consistency Evaluation that diagnoses reasoning trajectories rather than only end results. At its core, TRACE leverages Auxiliary Reasoning Sets, compact sub question answer pairs that decompose complex problems, evaluate intermediate steps through consistency-based metrics, and expose failures overlooked by standard evaluation. Our experiments show that consistency across ARS correlates with final-answer correctness and helps pinpoint the reasoning steps where failures arise, offering actionable signals for model improvement. Furthermore, TRACE defines confidence regions that distinguish reliable from unreliable reasoning paths, supporting effective filtering, debugging, and model refinement.