Examples

Run time and cost

This model costs approximately $0.40 to run on Replicate, or 2 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 5 minutes. The predict time for this model varies significantly based on the inputs.

Readme

FramePack (Non-F1 Variant): Progressive Video Generation 🎬

This Replicate model lets you run the non-F1 (original) variant of FramePack, a powerful way to create videos that build up bit by bit, starting from an image and a text prompt. It’s based on the ideas from the research paper “Packing Input Frame Context in Next-Frame Prediction Models for Video Generation.”

This implementation specifically uses the lllyasviel/FramePackI2V_HY transformer model, which is the non-F1 version. An F1 variant with a different transformer (lllyasviel/FramePack_F1_I2V_HY_20250503) also exists but is not used in this particular Replicate model.

This version is set up for Cog and designed to work efficiently, even if you don’t have the most powerful GPU. It smartly manages memory to run on a range of NVIDIA GPUs, from consumer cards (like an RTX 4090) to datacenter ones (like A100s).

Original Project Page: lllyasviel.github.io/frame_pack_gitpage/ Core Transformer Model (Non-F1 Variant): lllyasviel/FramePackI2V_HY

About the FramePack Model (Non-F1 Variant)

What makes this non-F1 variant of FramePack interesting is how it generates videos progressively. Instead of trying to create the whole video at once, it builds it in sections, like an artist painting a scene layer by layer. It cleverly “packs” the context of what it has already generated into a fixed-size summary. This means it doesn’t get overwhelmed by longer videos and can maintain good quality over time.

This Cog implementation focuses on an API-first approach, making it easy to integrate this non-F1 FramePack into your own projects.

Key Features ✨

Progressive Generation: Watch your video develop section by section from your initial image using the non-F1 model logic.
Image-to-Video with Text Control: Start with a still image and use a text prompt to describe the motion and story you want to see.
Efficient Memory Use: Runs effectively on different GPUs by adjusting how it loads and uses model parts. If you have less video memory (typically less than 65GB, like on an RTX 4090 or L40S), it will offload models to CPU when not in use. With more memory (65GB or more, like on an A100 80GB), models stay on the GPU for faster runs.
Smooth Video Output: Uses techniques to blend newly generated video sections smoothly with the previous ones.

How It Works (The Gist) 💡

FramePack (non-F1 variant) is a “next-frame-section” prediction model. It looks at the current image (or the last generated section of video) and your text prompt, then predicts the next chunk of frames.

This Cog setup uses: * The HunyuanVideoTransformer3DModelPacked (lllyasviel/FramePackI2V_HY) as the core engine for generating video frames. This is the non-F1 transformer model. * Text encoders from the Hunyuan-DiT series (hunyuanvideo-community/HunyuanVideo) to understand your prompt. * A SigLIP vision model (lllyasviel/flux_redux_bfl) to process and understand the input image. * An Autoencoder (VAE), also from Hunyuan-DiT, to translate between pixel images and the “latent” space where the model does its work.

The predict.py script manages loading these components and running the generation loop, adapting to your GPU’s VRAM, and implements the specific context handling logic associated with the non-F1 variant.

Potential Use Cases 🚀

Bring still images to life: Animate a photograph or illustration based on a narrative prompt.
Create short, dynamic clips: Generate videos of characters, objects, or scenes in motion.
Visualize evolving scenes: Show a scene changing over a few seconds, like a sunset or a character’s expression shifting.
Storytelling and concept art: Quickly prototype visual ideas that unfold over time.

Things to Keep in Mind ⚠️

Variant Specifics: This is the non-F1 variant. Performance and behavior may differ from the F1 variant.
Quality and Coherence: Video generation is complex! The final quality and how well the video follows your prompt can vary. Clear, descriptive prompts often work best.
Resource Needs: While efficient, generating longer or higher-resolution videos will naturally take more time.
Motion Range: The model is trained on a wide range of motions, but extremely complex or unusual actions might be challenging to generate perfectly.

License & Disclaimer 📜

The original FramePack model components (like lllyasviel/FramePackI2V_HY, hunyuanvideo-community/HunyuanVideo, lllyasviel/flux_redux_bfl) are generally available under open-source licenses like Apache 2.0. Please refer to their respective Hugging Face model cards for specific license details.

The code in the zsxkib/cog-Framepack GitHub repository for packaging this non-F1 model with Cog is released under the MIT License. You can find it here.

Please use this model responsibly and in accordance with any terms of use from the original model creators.

Citation 📚

If you use FramePack in your research, please consider citing the original paper:

@article{zhang2025framepack,
    title={Packing Input Frame Contexts in Next-Frame Prediction Models for Video Generation},
    author={Lvmin Zhang and Maneesh Agrawala},
    journal={Arxiv},
    year={2025},
    eprint={2504.12626},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Cog implementation (non-F1 variant) managed by zsxkib.

⭐ Star the Cog repo on GitHub: zsxkib/cog-Framepack (This repo contains the non-F1 variant)

👋 Follow me on Twitter/X