Grok Imagine R2V

Generate videos guided by reference images using xAI’s Grok Imagine Video model.

Reference-to-Video (R2V) takes one or more images and uses them as style and content references to guide video generation. Unlike image-to-video (where the image becomes the first frame), R2V treats your images as creative direction — the model draws on their visual style, subjects, and composition to produce something new.

What it does

Provide up to 7 reference images along with a text prompt, and the model generates a video that reflects the visual characteristics of your references. This is useful for:

Character consistency: Use photos of a character from different angles, then generate video of them in a new scene
Style transfer: Feed in images with a specific aesthetic (watercolor, noir, retro film) and the model carries that style into the video
Multi-subject scenes: Combine references of different subjects — a butterfly and a landscape, two characters, a product and a setting — and bring them together in motion
Creative remixing: Give the model a painting, a photo, and a sketch, and let it synthesize something that blends all three

How to use it

The key difference from image-to-video is the reference_images input. Pass a list of image URLs or uploaded files:

import replicate

output = replicate.run(
    "xai/grok-imagine-r2v",
    input={
        "prompt": "A monarch butterfly gliding over ancient pyramids at golden hour, cinematic aerial shot",
        "reference_images": [
            "https://example.com/butterfly.jpg",
            "https://example.com/pyramids.jpg"
        ],
        "duration": 8,
        "aspect_ratio": "16:9",
        "resolution": "720p"
    }
)
print(output)

Prompt tips for R2V

Since the model already has visual references, your prompt should focus on what happens rather than what things look like:

Describe the action and motion: “The cat stretches lazily and pounces toward the camera” rather than “a fluffy orange cat”
Specify camera movement: “slow push-in,” “sweeping drone shot,” “handheld tracking”
Set the mood: “warm afternoon light,” “dramatic storm clouds,” “ethereal glow”
Be specific about how references combine: “The butterfly flies through the foreground while the pyramids fill the background”

Technical details

Reference images: 1–7 images (jpg, jpeg, png, webp)
Video duration: 1–10 seconds
Resolution: 480p or 720p
Aspect ratios: 16:9, 9:16, 1:1, 4:3, 3:4, 3:2, 2:3
Prompt length: Up to 4,096 characters

Limitations

R2V cannot be combined with image-to-video (image input) or video editing (video input) — it’s a separate generation mode
Very large reference images may hit payload limits. Resize images to reasonable dimensions (under ~4000px on the longest side) before uploading
Maximum duration for R2V is 10 seconds, shorter than the 15-second limit for text-to-video

Model created 3 months, 3 weeks ago

Model updated 1 month ago