text-to-motion-diffusion-v2
Generate 3D character animation from a text prompt. Describe a motion in natural English — a walk, a kick, a gesture, a mood — and get production-ready animation you can drop into Unity, Unreal, Blender, or any pipeline that reads FBX or GLB.
text-to-motion-diffusion-v2 uses a diffusion-based architecture that generates motion through iterative denoising, producing physically realistic animation from natural-language descriptions. The result is motion that moves like a real person, not procedural interpolation between poses.
No motion capture. No keyframing. No cleanup. Text in, animation out.
How it works
Unlike autoregressive approaches that predict motion frame-by-frame, text-to-motion-diffusion-v2 starts from noise and progressively refines it into a complete motion sequence guided by your text prompt. This diffusion process produces smoother, more physically grounded results — particularly on complex multi-step actions where frame-by-frame methods can drift or lose coherence.
The model is trained on Uthana’s library of 100,000+ studio-grade motion captures, covering locomotion, combat, idles, gestures, interactions, sports, and performance animation.
Example prompts that work well
"A person dodges left, then throws a spinning roundhouse kick"— compound actions with clear physical sequencing"A person cautiously walks forward, looking around nervously"— emotional tone layered onto locomotion"A person performs a celebratory jump with both arms raised"— full-body expression"A person crouches to pick up an object from the ground"— clear movement and goal"A tired character slumps into a chair"— mood associated with main action
Specificity helps. The model responds to tone, pace, body-part detail, and physical context.
What you get
- A rigged animation clip on Uthana’s default bipedal character
- Standard formats: FBX and GLB
- Full skeletal motion data, retargetable to any humanoid rig
- Controllable output length via
motion_length(frames at 20 FPS) - Reproducible outputs via
seed - Tunable prompt fidelity vs. diversity via
cfg_scale - Adjustable quality/speed tradeoff via
steps(diffusion steps) - Optional IK retargeting via
retargeting_ikfor tighter skeleton alignment
Parameters
| Parameter | Description | Default |
|---|---|---|
prompt |
Natural-language description of the motion | (required) |
motion_length |
Output length in frames (at 20 FPS internal rate) | Model default |
steps |
Number of diffusion steps — higher = more refined, slower | Model default |
cfg_scale |
Classifier-free guidance scale — higher = closer to prompt, lower = more diverse | Model default |
seed |
Random seed for reproducibility | Random |
retargeting_ik |
Enable IK retargeting for better skeleton alignment | false |
Who this is for
- Game developers prototyping character movesets, NPC behavior, or combat animation without a mocap budget
- Indie studios that need studio-quality motion but can’t staff a full animation team
- Tool builders integrating text-to-animation into editors, asset pipelines, or creative apps
- Researchers working on motion synthesis, embodied AI, or human motion modeling
Intended use
This model is designed for generating character animation in games, film previs, VR/AR, interactive experiences, and research. It fits into existing pipelines — use it to prototype quickly, fill gaps in a motion library, or generate first-pass animation that you refine downstream.
Limitations
- Bipedal only. Generates motion for humanoid characters on a standard bipedal skeleton. Quadrupeds and non-human rigs are out of scope.
- Short clips. Typical output is 2–6 seconds per generation. For longer sequences, generate clips and stitch them — or use Uthana’s full platform for blending and stitching tools.
- Prompt sensitivity. Vague prompts produce generic motion. Specificity on body parts, tempo, emotion, and physical context leads to better results.
- No facial animation. Body motion only — face and lip sync are outside scope.
- No scene physics. The model animates the character in isolation. It doesn’t simulate collisions, props, or ground-contact with external objects.
How to use it
Via the Replicate playground
Enter a prompt and run. Adjust steps and cfg_scale to explore the quality/diversity tradeoff. Set a seed to lock in a result you like.
Via the API
See the API tab for code examples in Python, Node.js, and HTTP.
In your engine
The output targets a standard bipedal skeleton. To apply it to your own character:
- Unity: Import the FBX, use Humanoid rig retargeting
- Unreal: Import the FBX, retarget via IK Retargeter
- Blender: Import GLB or FBX, use built-in retargeting or Auto-Rig Pro
For native retargeting to your own rigs, batch generation, or pipeline integration, see Uthana’s API docs.
More from Uthana
Part of Uthana’s AI animation suite on Replicate:
text-to-motion-vqvae-v1— Autoregressive text-to-motion model — fast, lightweight, production-testedcreate-character-v1— Automatically rig any bipedal 3D character in under 30 seconds
For the full platform — 100,000+ motion library, DCC plugins, runtime SDKs, stitching, blending, looping, and enterprise features — visit uthana.com.
About Uthana
Uthana builds AI foundation models for human motion. The platform makes studio-quality character animation accessible to any team. All training data is ethically sourced from professional motion capture, performed by consenting actors.