zsxkib / step-video-t2v

Generate high-quality videos from text prompts using StepVideo

  • Public
  • 91 runs
  • H100
  • GitHub
  • Weights
  • Paper
  • License

Input

string
Shift + Return to add a new line

Prompt text

Default: "An astronaut on the moon"

string
Shift + Return to add a new line

Negative prompt

Default: "low resolution, text"

integer
(minimum: 1, maximum: 100)

Number of inference steps

Default: 30

number
(minimum: 1, maximum: 20)

Classifier free guidance scale

Default: 9

integer
(minimum: 17, maximum: 204)

Number of frames

Default: 51

Including fps and 2 more...

Output

Generated in

Run time and cost

This model runs on Nvidia H100 GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Step Video T2V - Text to Video Magic ✨

Transform your text descriptions into captivating videos with StepFun’s StepVideo model - now optimized to run on a single GPU! 🚀

About

This model turns your words into fluid, high-quality videos in seconds. Using StepFun’s groundbreaking approach to video generation, it creates remarkably coherent motion and impressive visuals from simple text prompts. 🎬

What makes this implementation special? 🌟

  • Single GPU power: Unlike the original implementation that required 4 GPUs, this version runs efficiently on just one H100! 💪
  • FP8 quantization: The diffusion model uses optimized FP8 precision for:
  • Faster generation on modern hardware 🏎️
  • Reduced memory footprint 🧠
  • Quicker creative iterations ⚡

While quantization introduces a slight quality trade-off compared to full-precision models, the speed and accessibility gains make this perfect for most creative projects!

Tips for stunning results 🎯

  • Be descriptive: “A golden retriever puppy playing with a red ball in a sunny park” works better than “dog playing”
  • Specify motion: Mention the action you want to see
  • Adjust frames: More frames = longer video, but might affect per-frame quality
  • Play with FPS: Higher FPS creates smoother motion
  • Use negative prompts: Add things you don’t want to see in the “negative prompt” field

Example prompts 💡

  • “A spaceship landing on a distant planet with two moons in the sky”
  • “Timelapse of a flower blooming in a lush garden”
  • “A robot chef preparing a gourmet meal in a futuristic kitchen”
  • “Waves crashing against a rocky shore during sunset”
  • “A panda doing kung fu moves in a bamboo forest”

Limitations 🚧

  • Text rendering isn’t perfect - avoid prompts that require specific text
  • Very complex scenes might lose some details
  • Faces can sometimes look a bit uncanny
  • The quantized version prioritizes speed over absolute quality

Coming soon… 📆

  • Multi-GPU support for even faster generation
  • Fine-tuned quality improvements while maintaining speed
  • Additional creative controls

Credits 🙏

Model adaptation and quantization by @zsakib_ - making high-end video generation accessible on single GPUs.

Based on StepFun’s StepVideo-T2V-Turbo with optimizations for Replicate’s infrastructure.

Happy video creating! 🎥✨