UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving

Rui Chen^1,2, Zehuan Wu², Yichen Liu², Yuxin Guo², Jingcheng Ni², Haifeng Xia¹, Siyu Xia¹,

¹Southeast University ²SenseTime Research

Abstract

The creation of diverse and realistic driving scenarios has become essential to enhance perception and planning capabilities of the autonomous driving system. However, generating long-duration, surround-view consistent driving videos remains a significant challenge. To address this, we present UniMLVG, a unified framework designed to generate extended street multi-perspective videos under precise control. By integrating single- and multi-view driving videos into the training data, our approach updates cross-frame and cross-view modules across three stages with different training objectives, substantially boosting the diversity and quality of generated visual content. Additionally, we employ the explicit viewpoint modeling in multi-view video generation to effectively improve motion transition consistency. Capable of handling various input reference formats (e.g., text, images, or video), our UniMLVG generates high-quality multi-view videos according to the corresponding condition constraints such as 3D bounding boxes or frame-level text descriptions. Compared to the best models with similar capabilities, our framework achieves improvements of 21.4% in FID and 36.5% in FVD.

Framework: Extend MM-DiT to Multi-view Video Generation

UniMLVG enhances text-to-image generation models by introducing cross-view and temporal modules, along with a unified conditional control mechanism. By leveraging multi-task training objectives (image/video prediction and image/video generation), mixed datasets, and a multi-stage training strategy, it enables the generation of long-duration, surround-view consistent driving scene videos.