Readme
huggingface https://huggingface.co/rain1011/pyramid-flow-sd3 paper https://arxiv.org/abs/2410.05954 (project page https://pyramid-flow.github.io)
Text-to-Video + Image-to-Video: Pyramid Flow Autoregressive Video Generation method based on Flow Matching
This model costs approximately $0.44 to run on Replicate, or 2 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.
This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 6 minutes. The predict time for this model varies significantly based on the inputs.
huggingface https://huggingface.co/rain1011/pyramid-flow-sd3 paper https://arxiv.org/abs/2410.05954 (project page https://pyramid-flow.github.io)

Generate short videos from a start image and a text prompt. Produce 5 or 10 second clips at 24 fps in 720p (standard) or 1080p (pro). Optionally supply an end image in pro mode to guide the final frame or interpolate between start and end.
Generate 5–10 second videos from text prompts or a single starting image. Accept a required prompt and optional first-frame image, and output short clips with fluid motion, stable frames, and coherent pacing. Preserve color, lighting, and mood across frames with refined conditioning, and follow multi-step, causal instructions for complex camera moves. Suited for marketing assets, creator shorts, film/animation previz, and educational explainers.

Generate short videos from text prompts. Optionally condition on a start image to create image-to-video clips and include up to 4 reference images as scene elements. Choose 5s or 10s duration with 720p output at 30fps, and set aspect ratio to 16:9, 9:16, or 1:1. Supports negative prompts to steer content and returns a video.

Generate short videos from text prompts or a starting image. Produce 2–12 second clips at 24 fps in up to 1080p resolution across aspect ratios including 16:9, 4:3, 1:1, 3:4, 9:16, 21:9, and 9:21. Guide subjects, style, and multi-character interactions with 1–4 reference images for character, clothing, and environment consistency. Optionally lock the camera, set a random seed for reproducibility, and anchor start/end frames with first- and last-frame images. Outputs a video.

Generate videos from text prompts or a single input image. Produce 2–12 second clips at 24 fps in 480p, 720p, or 1080p and common aspect ratios (16:9, 9:16, 1:1, 4:3, 3:4, 21:9, 9:21). Control motion with a camera lock option and constrain start/end points by supplying both a start image and a last-frame image; set a seed for reproducible results. Support multi-shot generation with narrative coherence, consistent subjects and visual style across shot transitions, and temporal/spatial shifts. Handle subtle to large-scale motion, complex action sequences, and multi-agent interactions with stable physical realism. Interpret stylistic prompts including photorealism, cyberpunk, illustration, and felt texture while maintaining prompt adherence and source-image consistency in image-to-video.

Generate up to 6-second 720p, 25fps videos from a text prompt or by animating a first-frame image. Maintain a consistent character by providing a subject reference image (S2V-01). Use a first-frame image to set aspect ratio and initial composition while the prompt drives motion and cinematic camera movement. Includes an optional prompt optimizer.
Generate 6–10 second videos from text prompts or a single reference image. Select 512p, 768p (up to 10s), or 1080p (6s), with a pro 1080p mode offering improved motion coherence. Optionally pin the first or last frame with input images (aspect ratio follows the first-frame image) to guide motion and endpoints. Emphasizes realistic physics and reliable instruction following for complex actions and scene transitions. Includes an optional prompt optimizer.

Generate short videos from a text prompt or an initial image. Produce silent clips with realistic motion and physics, following simple or complex instructions and camera directions (shot styles, angles, movements) across diverse visual styles, up to 4K. Control duration (5–8 seconds) and aspect ratio (16:9 or 9:16), and optionally fix a random seed for repeatable results.