UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving

Rui Chen1,2, Zehuan Wu2, Yichen Liu2, Yuxin Guo2, Jingcheng Ni2, Haifeng Xia1, Siyu Xia1,
1Southeast University   2SenseTime Research

Abstract

The creation of diverse and realistic driving scenarios has become essential to enhance perception and planning capabilities of the autonomous driving system. However, generating long-duration, surround-view consistent driving videos remains a significant challenge. To address this, we present UniMLVG, a unified framework designed to generate extended street multi-perspective videos under precise control. By integrating single- and multi-view driving videos into the training data, our approach updates cross-frame and cross-view modules across three stages with different training objectives, substantially boosting the diversity and quality of generated visual content. Additionally, we employ the explicit viewpoint modeling in multi-view video generation to effectively improve motion transition consistency. Capable of handling various input reference formats (e.g., text, images, or video), our UniMLVG generates high-quality multi-view videos according to the corresponding condition constraints such as 3D bounding boxes or frame-level text descriptions. Compared to the best models with similar capabilities, our framework achieves improvements of 21.4% in FID and 36.5% in FVD.

Framework: Extend MM-DiT to Multi-view Video Generation

architecture of UniMLVG

UniMLVG enhances text-to-image generation models by introducing cross-view and temporal modules, along with a unified conditional control mechanism. By leveraging multi-task training objectives (image/video prediction and image/video generation), mixed datasets, and a multi-stage training strategy, it enables the generation of long-duration, surround-view consistent driving scene videos.

Generated videos on this page are 20s at 12 FPS through autoregression on validation datasets!

Each video is divided into four rows: the first row represents the Ground Truth, the second row shows the 3Dboxes condition, the third row displays the HDmaps condition, and the last row presents our generated results.

Generation With Reference Frames

Generation Without Reference Frames

Snowy Scene Generation Based on Text Editing

Realistic Scene Generation Based on CARLA conditions.