Hierarchical Spatio-temporal Decoupling for Text-to-Video generation

Zhiwu Qing, Shiwei Zhang*, Jiayu Wang, Xiang Wang, Yujie Wei, Yingya Zhang,
Changxin Gao*, Nong Sang

Huazhong University of Science and Technology
Alibaba Group     Fudan University

Despite diffusion models having shown powerful abilities to generate photorealistic images, generating videos that are realistic and diverse still remains in its infancy. One of the key reasons is that current methods intertwine spatial content and temporal dynamics together, leading to a notably increased complexity of text-to-video generation (T2V). In this work, we propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives, i.e., structure level and content level. At the structure level, we decompose the T2V task into two steps, including spatial reasoning and temporal reasoning, using a unified denoiser. Specifically, we generate spatially coherent priors using text during spatial reasoning and then generate temporally coherent motions from these priors during temporal reasoning. At the content level, we extract two subtle cues from the content of the input video that can express motion and appearance changes, respectively. These two cues then guide the model's training for generating videos, enabling flexible content variations and enhancing temporal stability. Through the decoupled paradigm, HiGen can effectively reduce the complexity of this task and generate realistic videos with semantics accuracy and motion stability. Extensive experiments demonstrate the superior performance of HiGen over the state-of-the-art T2V methods. Our source codes and models will be found here.

Batman turns his head from right to left.

Tiny plant sprout coming out of the ground.

A video of a duckling wearing a medieval soldier helmet and riding a skateboard.

Astronaut riding a horse.

A wise tortoise in a tweed hat and spectacles reads a newspaper, Howard Hodgkin style.

A gentleman with a handlebar mustache, a bowler hat, and a monocle. with the style of oil painting.

A sketch painting of girl walking with a cat, created by Hayao Miyazaki.

With the style of low-poly game art, A majestic, white horse gallops gracefully across a moonlit beach.

Iron Man is walking towards the camera in the rain at night, with a lot of fog behind him. Science fiction movie, close-up.

A samurai walking on a bridge.

In a small forest, a colorful bird was flying around gracefully. Its shiny feathers reflected the sun rays, creating a beautiful sight.

A stylized octopus swimming in an underwater cave system.

A female mandalorian in forest green mandalorian armour with helmet.

Mickey Mouse is dancing on white background.

A yellow tiger with lightning around it.

Valkyrie riding flying horses through the clouds.

Pikachu turned his back towards me.

A girl with long curly blonde hair and sunglasses, camera pan from left to right.