pschaldenbrand / text2video

Method for generating bizarre looking videos from a series of language descriptions of the video. From the Bot Intelligence Group at CMU: Peter Schaldenbrand, Zhixuan Liu, & Jean Oh

  • Public
  • 8.4K runs
  • GitHub
  • Paper
  • License



Run time and cost

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 3 minutes. The predict time for this model varies significantly based on the inputs.


This is a method for generating videos from language descriptions. The video is generated by looping through the given text prompts. Frames are generated around 1 frame per second.

More info here:

Fast Text2Video

By optimizing the pixels of the video’s frames directly, rather than using a pre-trained generator model, this method is near real-time video generation. An image-to-image translation model is used to denoise the frames that were directly optimized.

From the Bot Intelligence Group at Carnegie Mellon University

This method is to be featured at the 2022 NeurIPS Workshop on Machine Learning for Creativity and Design.