uthana/text-to-motion-vqvae-v1

Generate 3D character animation data from a text prompt

24 runs

text-to-motion-vqvae-v1

Generate 3D character animation from a text prompt. Describe a motion in plain English and get a production-ready animation clip you can drop into Unity, Unreal, Blender, or any pipeline that reads FBX or GLB.

text-to-motion-vqvae-v1 is Uthana’s autoregressive motion generation model. It predicts motion token-by-token from your text prompt, producing clean, reliable animation with fast inference. If you need to generate a lot of animation quickly — prototyping movesets, populating NPC libraries, building tools that call text-to-motion at scale — this is the model to start with.

No motion capture. No keyframing. No cleanup. Text in, animation out.

How it works

text-to-motion-vqvae-v1 converts your text prompt into a sequence of motion tokens, then decodes those tokens into skeletal animation data. The autoregressive approach generates motion sequentially — each frame is informed by the frames before it — producing natural, coherent clips with fast, predictable inference times.

The model is trained on Uthana’s library of 100,000+ studio-grade motion captures, covering locomotion, combat, idles, gestures, interactions, sports, and performance animation.

When to use v1 vs. v2

Both models generate 3D animation from text. The difference is in the architecture and what that means for your workflow:

  • Use text-to-motion-vqvae-v1 when speed and simplicity matter most. Faster inference, minimal parameters, consistent results. Good for batch generation, pipeline automation, and tools where you need reliable output without per-generation tuning.
  • Use text-to-motion-diffusion-v2 when you want finer control over the output. Diffusion-based, with parameters for guidance scale, diffusion steps, seed, and output length. Better suited for complex multi-step actions and cases where you want to iterate on a specific result.

They’re complementary, not competing — many workflows use v1 for volume and v2 for hero animations.

Example prompts that work well

  • "A person confidently walk forward" — clean locomotion, good baseline test
  • "A person performs a hero pose, hands on hips" — static pose with clear body positioning
  • "A person throws a quick jab followed by a cross punch" — short combat sequence with tempo
  • "A person waves hello with the right hand" — isolated upper-body gesture
  • "A person stands in an idle stance, shifting weight from foot to foot" — subtle motion, good for game idles

Clear, direct prompts work best with 1.0. Describe the action, specify which body parts are involved, and include any directional or timing cues.

What you get

  • A rigged animation clip on Uthana’s default bipedal character
  • Standard formats: FBX and GLB
  • Full skeletal motion data, retargetable to any humanoid rig
  • Optional foot IK via foot_ik for cleaner ground contact

Parameters

Parameter Description Default
prompt Natural-language description of the motion (required)
foot_ik Enable foot inverse kinematics for better ground contact false

That’s it — two parameters. The simplicity is intentional. text-to-motion-1.0 is designed to produce good results from a prompt alone, without requiring you to tune generation settings.

Who this is for

  • Game developers building out NPC animation libraries, moveset prototypes, or idle/locomotion sets at volume
  • Indie studios that need reliable animation generation without per-clip tuning
  • Tool builders integrating text-to-animation into products where speed and API simplicity matter
  • Anyone new to AI motion generation who wants a fast, low-friction starting point

Intended use

This model is designed for generating character animation in games, film previs, VR/AR, interactive experiences, and research. It fits into existing pipelines as a fast, dependable source of animation clips — use it to draft motions quickly, populate animation libraries, or prototype ideas before refining in your DCC.

Limitations

  • Bipedal only. Generates motion for humanoid characters on a standard bipedal skeleton. Quadrupeds and non-human rigs are out of scope.
  • Short clips. Typical output is 2–6 seconds per generation. For longer sequences, generate clips and stitch them — or use Uthana’s full platform for blending and stitching tools.
  • Simpler prompts perform best. 1.0 handles single actions and short sequences well. For complex multi-step motions with specific timing, text-to-motion-2.0 may produce more coherent results.
  • No facial animation. Body motion only — face and lip sync are outside scope.
  • No scene physics. The model animates the character in isolation. It doesn’t simulate collisions, props, or ground contact with external objects.

How to use it

Via the Replicate playground

Enter a prompt and run. Toggle foot_ik on if you want cleaner ground contact on locomotion clips.

Via the API

See the API tab for code examples in Python, Node.js, and HTTP.

In your engine

The output targets a standard bipedal skeleton. To apply it to your own character:

  • Unity: Import the FBX, use Humanoid rig retargeting
  • Unreal: Import the FBX, retarget via IK Retargeter
  • Blender: Import GLB or FBX, use built-in retargeting or Auto-Rig Pro

For native retargeting to your own rigs, batch generation, or pipeline integration, see Uthana’s API docs.

More from Uthana

Part of Uthana’s AI animation suite on Replicate:

For the full platform — 100,000+ motion library, DCC plugins, runtime SDKs, stitching, blending, looping, and enterprise features — visit uthana.com.

About Uthana

Uthana builds AI foundation models for human motion. The platform makes studio-quality character animation accessible to any team.

Model created
Model updated