uthana/text-to-motion-diffusion-v2

Generate 3D character animation data from a text prompt

19 runs

text-to-motion-diffusion-v2

Generate 3D character animation from a text prompt. Describe a motion in natural English — a walk, a kick, a gesture, a mood — and get production-ready animation you can drop into Unity, Unreal, Blender, or any pipeline that reads FBX or GLB.

text-to-motion-diffusion-v2 uses a diffusion-based architecture that generates motion through iterative denoising, producing physically realistic animation from natural-language descriptions. The result is motion that moves like a real person, not procedural interpolation between poses.

No motion capture. No keyframing. No cleanup. Text in, animation out.

How it works

Unlike autoregressive approaches that predict motion frame-by-frame, text-to-motion-diffusion-v2 starts from noise and progressively refines it into a complete motion sequence guided by your text prompt. This diffusion process produces smoother, more physically grounded results — particularly on complex multi-step actions where frame-by-frame methods can drift or lose coherence.

The model is trained on Uthana’s library of 100,000+ studio-grade motion captures, covering locomotion, combat, idles, gestures, interactions, sports, and performance animation.

Example prompts that work well

  • "A person dodges left, then throws a spinning roundhouse kick" — compound actions with clear physical sequencing
  • "A person cautiously walks forward, looking around nervously" — emotional tone layered onto locomotion
  • "A person performs a celebratory jump with both arms raised" — full-body expression
  • "A person crouches to pick up an object from the ground" — clear movement and goal
  • "A tired character slumps into a chair" — mood associated with main action

Specificity helps. The model responds to tone, pace, body-part detail, and physical context.

What you get

  • A rigged animation clip on Uthana’s default bipedal character
  • Standard formats: FBX and GLB
  • Full skeletal motion data, retargetable to any humanoid rig
  • Controllable output length via motion_length (frames at 20 FPS)
  • Reproducible outputs via seed
  • Tunable prompt fidelity vs. diversity via cfg_scale
  • Adjustable quality/speed tradeoff via steps (diffusion steps)
  • Optional IK retargeting via retargeting_ik for tighter skeleton alignment

Parameters

Parameter Description Default
prompt Natural-language description of the motion (required)
motion_length Output length in frames (at 20 FPS internal rate) Model default
steps Number of diffusion steps — higher = more refined, slower Model default
cfg_scale Classifier-free guidance scale — higher = closer to prompt, lower = more diverse Model default
seed Random seed for reproducibility Random
retargeting_ik Enable IK retargeting for better skeleton alignment false

Who this is for

  • Game developers prototyping character movesets, NPC behavior, or combat animation without a mocap budget
  • Indie studios that need studio-quality motion but can’t staff a full animation team
  • Tool builders integrating text-to-animation into editors, asset pipelines, or creative apps
  • Researchers working on motion synthesis, embodied AI, or human motion modeling

Intended use

This model is designed for generating character animation in games, film previs, VR/AR, interactive experiences, and research. It fits into existing pipelines — use it to prototype quickly, fill gaps in a motion library, or generate first-pass animation that you refine downstream.

Limitations

  • Bipedal only. Generates motion for humanoid characters on a standard bipedal skeleton. Quadrupeds and non-human rigs are out of scope.
  • Short clips. Typical output is 2–6 seconds per generation. For longer sequences, generate clips and stitch them — or use Uthana’s full platform for blending and stitching tools.
  • Prompt sensitivity. Vague prompts produce generic motion. Specificity on body parts, tempo, emotion, and physical context leads to better results.
  • No facial animation. Body motion only — face and lip sync are outside scope.
  • No scene physics. The model animates the character in isolation. It doesn’t simulate collisions, props, or ground-contact with external objects.

How to use it

Via the Replicate playground

Enter a prompt and run. Adjust steps and cfg_scale to explore the quality/diversity tradeoff. Set a seed to lock in a result you like.

Via the API

See the API tab for code examples in Python, Node.js, and HTTP.

In your engine

The output targets a standard bipedal skeleton. To apply it to your own character:

  • Unity: Import the FBX, use Humanoid rig retargeting
  • Unreal: Import the FBX, retarget via IK Retargeter
  • Blender: Import GLB or FBX, use built-in retargeting or Auto-Rig Pro

For native retargeting to your own rigs, batch generation, or pipeline integration, see Uthana’s API docs.

More from Uthana

Part of Uthana’s AI animation suite on Replicate:

For the full platform — 100,000+ motion library, DCC plugins, runtime SDKs, stitching, blending, looping, and enterprise features — visit uthana.com.

About Uthana

Uthana builds AI foundation models for human motion. The platform makes studio-quality character animation accessible to any team. All training data is ethically sourced from professional motion capture, performed by consenting actors.

Model created
Model updated