text-to-motion-diffusion-v2

Generate 3D character animation from a text prompt. Describe a motion in natural English — a walk, a kick, a gesture, a mood — and get production-ready animation you can drop into Unity, Unreal, Blender, or any pipeline that reads FBX or GLB.

text-to-motion-diffusion-v2 uses a diffusion-based architecture that generates motion through iterative denoising, producing physically realistic animation from natural-language descriptions. The result is motion that moves like a real person, not procedural interpolation between poses.

No motion capture. No keyframing. No cleanup. Text in, animation out.

How it works

Unlike autoregressive approaches that predict motion frame-by-frame, text-to-motion-diffusion-v2 starts from noise and progressively refines it into a complete motion sequence guided by your text prompt. This diffusion process produces smoother, more physically grounded results — particularly on complex multi-step actions where frame-by-frame methods can drift or lose coherence.

The model is trained on Uthana’s library of 100,000+ studio-grade motion captures, covering locomotion, combat, idles, gestures, interactions, sports, and performance animation.

Example prompts that work well

"A person dodges left, then throws a spinning roundhouse kick" — compound actions with clear physical sequencing
"A person cautiously walks forward, looking around nervously" — emotional tone layered onto locomotion
"A person performs a celebratory jump with both arms raised" — full-body expression
"A person crouches to pick up an object from the ground" — clear movement and goal
"A tired character slumps into a chair" — mood associated with main action

Specificity helps. The model responds to tone, pace, body-part detail, and physical context.

What you get

A rigged animation clip on Uthana’s default bipedal character
Standard formats: FBX and GLB
Full skeletal motion data, retargetable to any humanoid rig
Controllable output length via motion_length (frames at 20 FPS)
Reproducible outputs via seed
Tunable prompt fidelity vs. diversity via cfg_scale
Adjustable quality/speed tradeoff via steps (diffusion steps)
Optional IK retargeting via retargeting_ik for tighter skeleton alignment

Parameters

Parameter	Description	Default
`prompt`	Natural-language description of the motion	(required)
`motion_length`	Output length in frames (at 20 FPS internal rate)	Model default
`steps`	Number of diffusion steps — higher = more refined, slower	Model default
`cfg_scale`	Classifier-free guidance scale — higher = closer to prompt, lower = more diverse	Model default
`seed`	Random seed for reproducibility	Random
`retargeting_ik`	Enable IK retargeting for better skeleton alignment	`false`

Who this is for

Game developers prototyping character movesets, NPC behavior, or combat animation without a mocap budget
Indie studios that need studio-quality motion but can’t staff a full animation team
Tool builders integrating text-to-animation into editors, asset pipelines, or creative apps
Researchers working on motion synthesis, embodied AI, or human motion modeling

Intended use

This model is designed for generating character animation in games, film previs, VR/AR, interactive experiences, and research. It fits into existing pipelines — use it to prototype quickly, fill gaps in a motion library, or generate first-pass animation that you refine downstream.

Limitations

Bipedal only. Generates motion for humanoid characters on a standard bipedal skeleton. Quadrupeds and non-human rigs are out of scope.
Short clips. Typical output is 2–6 seconds per generation. For longer sequences, generate clips and stitch them — or use Uthana’s full platform for blending and stitching tools.
Prompt sensitivity. Vague prompts produce generic motion. Specificity on body parts, tempo, emotion, and physical context leads to better results.
No facial animation. Body motion only — face and lip sync are outside scope.
No scene physics. The model animates the character in isolation. It doesn’t simulate collisions, props, or ground-contact with external objects.

How to use it

Via the Replicate playground

Enter a prompt and run. Adjust steps and cfg_scale to explore the quality/diversity tradeoff. Set a seed to lock in a result you like.

Via the API

See the API tab for code examples in Python, Node.js, and HTTP.

In your engine

The output targets a standard bipedal skeleton. To apply it to your own character:

Unity: Import the FBX, use Humanoid rig retargeting
Unreal: Import the FBX, retarget via IK Retargeter
Blender: Import GLB or FBX, use built-in retargeting or Auto-Rig Pro

For native retargeting to your own rigs, batch generation, or pipeline integration, see Uthana’s API docs.

About Uthana

Uthana builds AI foundation models for human motion. The platform makes studio-quality character animation accessible to any team. All training data is ethically sourced from professional motion capture, performed by consenting actors.

Model created 3 months ago

Model updated 1 month ago