Examples

Run time and cost

This model costs approximately $2.61 to run on Replicate, or 0 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia H100 GPU hardware. Predictions typically complete within 29 minutes.

Readme

Step Video T2V - Text to Video Magic ✨

Transform your text descriptions into captivating videos with StepFun’s StepVideo model - now optimized to run on a single GPU! 🚀

About

This model turns your words into fluid, high-quality videos in seconds. Using StepFun’s groundbreaking approach to video generation, it creates remarkably coherent motion and impressive visuals from simple text prompts. 🎬

What makes this implementation special? 🌟

Single GPU power: Unlike the original implementation that required 4 GPUs, this version runs efficiently on just one H100! 💪
FP8 quantization: The diffusion model uses optimized FP8 precision for:
Faster generation on modern hardware 🏎️
Reduced memory footprint 🧠
Quicker creative iterations ⚡

While quantization introduces a slight quality trade-off compared to full-precision models, the speed and accessibility gains make this perfect for most creative projects!

Tips for stunning results 🎯

Be descriptive: “A golden retriever puppy playing with a red ball in a sunny park” works better than “dog playing”
Specify motion: Mention the action you want to see
Adjust frames: More frames = longer video, but might affect per-frame quality
Play with FPS: Higher FPS creates smoother motion
Use negative prompts: Add things you don’t want to see in the “negative prompt” field

Example prompts 💡

“A spaceship landing on a distant planet with two moons in the sky”
“Timelapse of a flower blooming in a lush garden”
“A robot chef preparing a gourmet meal in a futuristic kitchen”
“Waves crashing against a rocky shore during sunset”
“A panda doing kung fu moves in a bamboo forest”

Limitations 🚧

Text rendering isn’t perfect - avoid prompts that require specific text
Very complex scenes might lose some details
Faces can sometimes look a bit uncanny
The quantized version prioritizes speed over absolute quality

Coming soon… 📆

Multi-GPU support for even faster generation
Fine-tuned quality improvements while maintaining speed
Additional creative controls

Credits 🙏

Model adaptation and quantization by @zsakib_ - making high-end video generation accessible on single GPUs.

Based on StepFun’s StepVideo-T2V-Turbo with optimizations for Replicate’s infrastructure.

Happy video creating! 🎥✨