cjwbw/damo-text-to-video | Run with an API on Replicate

cjwbw / damo-text-to-video

Multi-stage text-to-video generation

Cold

Public
148.9K runs
A100 (80GB)
GitHub

Iterate in playground

Run with an API

Playground API Examples README Versions

Input

Run this model in Node.js with one line of code:

npx create-replicate --model=cjwbw/damo-text-to-video

or set up a project from scratch

Install Replicate’s Node.js client library:

npm install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import and set up the client:

import Replicate from "replicate";
import fs from "node:fs";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Run cjwbw/damo-text-to-video using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

const output = await replicate.run(
  "cjwbw/damo-text-to-video:1e205ea73084bd17a0a3b43396e49ba0d6bc2e754e9283b2df49fad2dcf95755",
  {
    input: {
      fps: 8,
      prompt: "A panda eating bamboo on a rock.",
      num_frames: 50,
      num_inference_steps: 50
    }
  }
);

// To access the file URL:
console.log(output.url()); //=> "http://example.com"

// To write the file to disk:
fs.writeFile("my-image.png", output);

To learn more, take a look at the guide on getting started with Node.js.

Install Replicate’s Python client library:

pip install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import the client:

import replicate

Run cjwbw/damo-text-to-video using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

output = replicate.run(
    "cjwbw/damo-text-to-video:1e205ea73084bd17a0a3b43396e49ba0d6bc2e754e9283b2df49fad2dcf95755",
    input={
        "fps": 8,
        "prompt": "A panda eating bamboo on a rock.",
        "num_frames": 50,
        "num_inference_steps": 50
    }
)

# To access the file URL:
print(output.url())
#=> "http://example.com"

# To write the file to disk:
with open("my-image.png", "wb") as file:
    file.write(output.read())

To learn more, take a look at the guide on getting started with Python.

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Run cjwbw/damo-text-to-video using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Prefer: wait" \
  -d $'{
    "version": "cjwbw/damo-text-to-video:1e205ea73084bd17a0a3b43396e49ba0d6bc2e754e9283b2df49fad2dcf95755",
    "input": {
      "fps": 8,
      "prompt": "A panda eating bamboo on a rock.",
      "num_frames": 50,
      "num_inference_steps": 50
    }
  }' \
  https://api.replicate.com/v1/predictions

To learn more, take a look at Replicate’s HTTP API reference docs.

Output

{
  "completed_at": "2023-03-23T22:37:27.583925Z",
  "created_at": "2023-03-23T22:32:07.478436Z",
  "data_removed": false,
  "error": null,
  "id": "kfg5ftkjmzbt3jfpajv4r4bz6m",
  "input": {
    "fps": 8,
    "prompt": "A panda eating bamboo on a rock.",
    "num_frames": 50,
    "num_inference_steps": 50
  },
  "logs": "Using seed: 8502\n  0%|          | 0/50 [00:00<?, ?it/s]\n  2%|▏         | 1/50 [00:04<03:29,  4.27s/it]\n  4%|▍         | 2/50 [00:07<02:46,  3.47s/it]\n  6%|▌         | 3/50 [00:10<02:31,  3.23s/it]\n  8%|▊         | 4/50 [00:13<02:23,  3.12s/it]\n 10%|█         | 5/50 [00:16<02:17,  3.06s/it]\n 12%|█▏        | 6/50 [00:19<02:13,  3.03s/it]\n 14%|█▍        | 7/50 [00:22<02:10,  3.03s/it]\n 16%|█▌        | 8/50 [00:25<02:07,  3.02s/it]\n 18%|█▊        | 9/50 [00:28<02:03,  3.02s/it]\n 20%|██        | 10/50 [00:31<02:00,  3.02s/it]\n 22%|██▏       | 11/50 [00:34<01:57,  3.01s/it]\n 24%|██▍       | 12/50 [00:37<01:54,  3.00s/it]\n 26%|██▌       | 13/50 [00:40<01:50,  2.99s/it]\n 28%|██▊       | 14/50 [00:43<01:47,  2.99s/it]\n 30%|███       | 15/50 [00:45<01:44,  2.99s/it]\n 32%|███▏      | 16/50 [00:48<01:41,  2.98s/it]\n 34%|███▍      | 17/50 [00:51<01:38,  2.97s/it]\n 36%|███▌      | 18/50 [00:54<01:34,  2.96s/it]\n 38%|███▊      | 19/50 [00:57<01:31,  2.95s/it]\n 40%|████      | 20/50 [01:00<01:28,  2.94s/it]\n 42%|████▏     | 21/50 [01:03<01:25,  2.94s/it]\n 44%|████▍     | 22/50 [01:06<01:22,  2.93s/it]\n 46%|████▌     | 23/50 [01:09<01:19,  2.93s/it]\n 48%|████▊     | 24/50 [01:12<01:16,  2.92s/it]\n 50%|█████     | 25/50 [01:15<01:13,  2.92s/it]\n 52%|█████▏    | 26/50 [01:18<01:10,  2.92s/it]\n 54%|█████▍    | 27/50 [01:21<01:07,  2.92s/it]\n 56%|█████▌    | 28/50 [01:24<01:04,  2.92s/it]\n 58%|█████▊    | 29/50 [01:26<01:01,  2.92s/it]\n 60%|██████    | 30/50 [01:29<00:58,  2.92s/it]\n 62%|██████▏   | 31/50 [01:32<00:55,  2.93s/it]\n 64%|██████▍   | 32/50 [01:35<00:52,  2.93s/it]\n 66%|██████▌   | 33/50 [01:38<00:49,  2.93s/it]\n 68%|██████▊   | 34/50 [01:41<00:47,  2.94s/it]\n 70%|███████   | 35/50 [01:44<00:44,  2.95s/it]\n 72%|███████▏  | 36/50 [01:47<00:41,  2.95s/it]\n 74%|███████▍  | 37/50 [01:50<00:38,  2.95s/it]\n 76%|███████▌  | 38/50 [01:53<00:35,  2.96s/it]\n 78%|███████▊  | 39/50 [01:56<00:32,  2.96s/it]\n 80%|████████  | 40/50 [01:59<00:29,  2.96s/it]\n 82%|████████▏ | 41/50 [02:02<00:26,  2.97s/it]\n 84%|████████▍ | 42/50 [02:05<00:23,  2.98s/it]\n 86%|████████▌ | 43/50 [02:08<00:20,  2.98s/it]\n 88%|████████▊ | 44/50 [02:11<00:17,  2.98s/it]\n 90%|█████████ | 45/50 [02:14<00:14,  2.98s/it]\n 92%|█████████▏| 46/50 [02:17<00:11,  2.97s/it]\n 94%|█████████▍| 47/50 [02:20<00:08,  2.97s/it]\n 96%|█████████▌| 48/50 [02:23<00:05,  2.96s/it]\n 98%|█████████▊| 49/50 [02:26<00:02,  2.95s/it]\n100%|██████████| 50/50 [02:29<00:00,  2.95s/it]\n100%|██████████| 50/50 [02:29<00:00,  2.98s/it]",
  "metrics": {
    "predict_time": 155.840977,
    "total_time": 320.105489
  },
  "output": "https://replicate.delivery/pbxt/0KtGrmSHZM5eGy2kXndo7DmKfg5MS9pPf7TaM9iIKVxN1QVhA/out.mp4",
  "started_at": "2023-03-23T22:34:51.742948Z",
  "status": "succeeded",
  "urls": {
    "get": "https://api.replicate.com/v1/predictions/kfg5ftkjmzbt3jfpajv4r4bz6m",
    "cancel": "https://api.replicate.com/v1/predictions/kfg5ftkjmzbt3jfpajv4r4bz6m/cancel"
  },
  "version": "1e205ea73084bd17a0a3b43396e49ba0d6bc2e754e9283b2df49fad2dcf95755"
}

Generated in

2 minutes 36 seconds

Tweak it Report View full prediction

Using seed: 8502
  0%|          | 0/50 [00:00<?, ?it/s]
  2%|▏         | 1/50 [00:04<03:29,  4.27s/it]
  4%|▍         | 2/50 [00:07<02:46,  3.47s/it]
  6%|▌         | 3/50 [00:10<02:31,  3.23s/it]
  8%|▊         | 4/50 [00:13<02:23,  3.12s/it]
 10%|█         | 5/50 [00:16<02:17,  3.06s/it]
 12%|█▏        | 6/50 [00:19<02:13,  3.03s/it]
 14%|█▍        | 7/50 [00:22<02:10,  3.03s/it]
 16%|█▌        | 8/50 [00:25<02:07,  3.02s/it]
 18%|█▊        | 9/50 [00:28<02:03,  3.02s/it]
 20%|██        | 10/50 [00:31<02:00,  3.02s/it]
 22%|██▏       | 11/50 [00:34<01:57,  3.01s/it]
 24%|██▍       | 12/50 [00:37<01:54,  3.00s/it]
 26%|██▌       | 13/50 [00:40<01:50,  2.99s/it]
 28%|██▊       | 14/50 [00:43<01:47,  2.99s/it]
 30%|███       | 15/50 [00:45<01:44,  2.99s/it]
 32%|███▏      | 16/50 [00:48<01:41,  2.98s/it]
 34%|███▍      | 17/50 [00:51<01:38,  2.97s/it]
 36%|███▌      | 18/50 [00:54<01:34,  2.96s/it]
 38%|███▊      | 19/50 [00:57<01:31,  2.95s/it]
 40%|████      | 20/50 [01:00<01:28,  2.94s/it]
 42%|████▏     | 21/50 [01:03<01:25,  2.94s/it]
 44%|████▍     | 22/50 [01:06<01:22,  2.93s/it]
 46%|████▌     | 23/50 [01:09<01:19,  2.93s/it]
 48%|████▊     | 24/50 [01:12<01:16,  2.92s/it]
 50%|█████     | 25/50 [01:15<01:13,  2.92s/it]
 52%|█████▏    | 26/50 [01:18<01:10,  2.92s/it]
 54%|█████▍    | 27/50 [01:21<01:07,  2.92s/it]
 56%|█████▌    | 28/50 [01:24<01:04,  2.92s/it]
 58%|█████▊    | 29/50 [01:26<01:01,  2.92s/it]
 60%|██████    | 30/50 [01:29<00:58,  2.92s/it]
 62%|██████▏   | 31/50 [01:32<00:55,  2.93s/it]
 64%|██████▍   | 32/50 [01:35<00:52,  2.93s/it]
 66%|██████▌   | 33/50 [01:38<00:49,  2.93s/it]
 68%|██████▊   | 34/50 [01:41<00:47,  2.94s/it]
 70%|███████   | 35/50 [01:44<00:44,  2.95s/it]
 72%|███████▏  | 36/50 [01:47<00:41,  2.95s/it]
 74%|███████▍  | 37/50 [01:50<00:38,  2.95s/it]
 76%|███████▌  | 38/50 [01:53<00:35,  2.96s/it]
 78%|███████▊  | 39/50 [01:56<00:32,  2.96s/it]
 80%|████████  | 40/50 [01:59<00:29,  2.96s/it]
 82%|████████▏ | 41/50 [02:02<00:26,  2.97s/it]
 84%|████████▍ | 42/50 [02:05<00:23,  2.98s/it]
 86%|████████▌ | 43/50 [02:08<00:20,  2.98s/it]
 88%|████████▊ | 44/50 [02:11<00:17,  2.98s/it]
 90%|█████████ | 45/50 [02:14<00:14,  2.98s/it]
 92%|█████████▏| 46/50 [02:17<00:11,  2.97s/it]
 94%|█████████▍| 47/50 [02:20<00:08,  2.97s/it]
 96%|█████████▌| 48/50 [02:23<00:05,  2.96s/it]
 98%|█████████▊| 49/50 [02:26<00:02,  2.95s/it]
100%|██████████| 50/50 [02:29<00:00,  2.95s/it]
100%|██████████| 50/50 [02:29<00:00,  2.98s/it]

Examples

View more examples

Run time and cost

This model costs approximately $0.13 to run on Replicate, or 7 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 96 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Weights from: https://huggingface.co/damo-vilab/modelscope-damo-text-to-video-synthesis

This model is based on a multi-stage text-to-video generation diffusion model, which inputs a description text and returns a video that matches the text description. Only English input is supported.

Model Description

The text-to-video generation diffusion model consists of three sub-networks: text feature extraction, text feature-to-video latent space diffusion model, and video latent space to video visual space. The overall model parameters are about 1.7 billion. The diffusion model adopts the Unet3D structure, and realizes the function of video generation through the iterative denoising process from the pure Gaussian noise video.

This model is meant for research purposes. Please look at the model limitations and biases and misuse, malicious use and excessive use sections.

This model has a wide range of applications and can reason and generate videos based on arbitrary English text descriptions.

Model limitations and biases

The model is trained based on public data sets such as Webvid, and the generated results may have deviations related to the distribution of training data.
This model cannot achieve perfect film and television quality generation.
The model cannot generate clear text.
The model is mainly trained with English corpus and does not support other languages at the moment.
The performance of this model needs to be improved on complex compositional generation tasks.

Misuse, Malicious Use and Excessive Use

The model was not trained to realistically represent people or events, so using it to generate such content is beyond the model’s capabilities.
It is prohibited to generate content that is demeaning or harmful to people or their environment, culture, religion, etc.
Prohibited for pornographic, violent and bloody content generation.
Prohibited for error and false information generation.

Training data

The training data includes LAION5B, ImageNet, Webvid and other public datasets. Image and video filtering is performed after pre-training such as aesthetic score, watermark score, and deduplication.