xai/grok-imagine-r2v

Generate videos guided by reference images using xAI's Grok Imagine Video model

3.5K runs

Grok Imagine R2V

Generate videos guided by reference images using xAI’s Grok Imagine Video model.

Reference-to-Video (R2V) takes one or more images and uses them as style and content references to guide video generation. Unlike image-to-video (where the image becomes the first frame), R2V treats your images as creative direction — the model draws on their visual style, subjects, and composition to produce something new.

What it does

Provide up to 7 reference images along with a text prompt, and the model generates a video that reflects the visual characteristics of your references. This is useful for:

  • Character consistency: Use photos of a character from different angles, then generate video of them in a new scene
  • Style transfer: Feed in images with a specific aesthetic (watercolor, noir, retro film) and the model carries that style into the video
  • Multi-subject scenes: Combine references of different subjects — a butterfly and a landscape, two characters, a product and a setting — and bring them together in motion
  • Creative remixing: Give the model a painting, a photo, and a sketch, and let it synthesize something that blends all three

How to use it

The key difference from image-to-video is the reference_images input. Pass a list of image URLs or uploaded files:

import replicate

output = replicate.run(
    "xai/grok-imagine-r2v",
    input={
        "prompt": "A monarch butterfly gliding over ancient pyramids at golden hour, cinematic aerial shot",
        "reference_images": [
            "https://example.com/butterfly.jpg",
            "https://example.com/pyramids.jpg"
        ],
        "duration": 8,
        "aspect_ratio": "16:9",
        "resolution": "720p"
    }
)
print(output)

Prompt tips for R2V

Since the model already has visual references, your prompt should focus on what happens rather than what things look like:

  • Describe the action and motion: “The cat stretches lazily and pounces toward the camera” rather than “a fluffy orange cat”
  • Specify camera movement: “slow push-in,” “sweeping drone shot,” “handheld tracking”
  • Set the mood: “warm afternoon light,” “dramatic storm clouds,” “ethereal glow”
  • Be specific about how references combine: “The butterfly flies through the foreground while the pyramids fill the background”

Technical details

  • Reference images: 1–7 images (jpg, jpeg, png, webp)
  • Video duration: 1–10 seconds
  • Resolution: 480p or 720p
  • Aspect ratios: 16:9, 9:16, 1:1, 4:3, 3:4, 3:2, 2:3
  • Prompt length: Up to 4,096 characters

Limitations

  • R2V cannot be combined with image-to-video (image input) or video editing (video input) — it’s a separate generation mode
  • Very large reference images may hit payload limits. Resize images to reasonable dimensions (under ~4000px on the longest side) before uploading
  • Maximum duration for R2V is 10 seconds, shorter than the 15-second limit for text-to-video
Model created