Readme
Grok Imagine R2V
Generate videos guided by reference images using xAI’s Grok Imagine Video model.
Reference-to-Video (R2V) takes one or more images and uses them as style and content references to guide video generation. Unlike image-to-video (where the image becomes the first frame), R2V treats your images as creative direction — the model draws on their visual style, subjects, and composition to produce something new.
What it does
Provide up to 7 reference images along with a text prompt, and the model generates a video that reflects the visual characteristics of your references. This is useful for:
- Character consistency: Use photos of a character from different angles, then generate video of them in a new scene
- Style transfer: Feed in images with a specific aesthetic (watercolor, noir, retro film) and the model carries that style into the video
- Multi-subject scenes: Combine references of different subjects — a butterfly and a landscape, two characters, a product and a setting — and bring them together in motion
- Creative remixing: Give the model a painting, a photo, and a sketch, and let it synthesize something that blends all three
How to use it
The key difference from image-to-video is the reference_images input. Pass a list of image URLs or uploaded files:
import replicate
output = replicate.run(
"xai/grok-imagine-r2v",
input={
"prompt": "A monarch butterfly gliding over ancient pyramids at golden hour, cinematic aerial shot",
"reference_images": [
"https://example.com/butterfly.jpg",
"https://example.com/pyramids.jpg"
],
"duration": 8,
"aspect_ratio": "16:9",
"resolution": "720p"
}
)
print(output)
Prompt tips for R2V
Since the model already has visual references, your prompt should focus on what happens rather than what things look like:
- Describe the action and motion: “The cat stretches lazily and pounces toward the camera” rather than “a fluffy orange cat”
- Specify camera movement: “slow push-in,” “sweeping drone shot,” “handheld tracking”
- Set the mood: “warm afternoon light,” “dramatic storm clouds,” “ethereal glow”
- Be specific about how references combine: “The butterfly flies through the foreground while the pyramids fill the background”
Technical details
- Reference images: 1–7 images (jpg, jpeg, png, webp)
- Video duration: 1–10 seconds
- Resolution: 480p or 720p
- Aspect ratios: 16:9, 9:16, 1:1, 4:3, 3:4, 3:2, 2:3
- Prompt length: Up to 4,096 characters
Limitations
- R2V cannot be combined with image-to-video (
imageinput) or video editing (videoinput) — it’s a separate generation mode - Very large reference images may hit payload limits. Resize images to reasonable dimensions (under ~4000px on the longest side) before uploading
- Maximum duration for R2V is 10 seconds, shorter than the 15-second limit for text-to-video