Abstract

Generative models have been widely applied to world modeling for environment simulation and future state prediction. With advancements in autonomous driving, there is a growing demand not only for high-fidelity video generation under various controls, but also for producing diverse and meaningful information such as depth estimation. To address this, we propose CVD-STORM, a cross-view video diffusion model utilizing a spatial-temporal reconstruction Variational Autoencoder (VAE) that generates long-term, multi-view videos with 4D reconstruction capabilities under various control inputs. Our approach first fine-tunes the VAE with an auxiliary 4D reconstruction task, enhancing its ability to encode 3D structures and temporal dynamics. Subsequently, we integrate this VAE into the video diffusion process to significantly improve generation quality. Experimental results demonstrate that our model achieves substantial improvements in both FID and FVD metrics. Additionally, the jointly-trained Gaussian Splatting Decoder effectively reconstructs dynamic scenes, providing valuable geometric information for comprehensive scene understanding.

overview

Video

Depth Estimation

We train a Gaussian Splatting decoder for 4D scene reconstruction in the first stage. During inference, it can decode the Gaussian Splatting representation from the latent code and render the depth maps of the generative images. We provide the visualization of the depth maps of the generative images below.

Video Generation

Our method can generate diverse and meaningful videos under various control inputs.


Video Prediction

Our method can predict the future video frames of the input video.

Citation

Acknowledgements

The website template was borrowed from Michaƫl Gharbi and Ref-NeRF