bytedance/omni-human-1.5

A film-grade digital human model that generates realistic video from a single image, audio clip, and optional text prompt.

543 runs

Readme

OmniHuman 1.5

OmniHuman 1.5 produces character-driven video by combining an input image, audio, and optional prompt text. Compared to earlier versions, it adds:

  • Support for text prompts.
  • Unrestricted camera and character motion.
  • Audio-aware action generation for logical and expressive video behavior.

Capabilities

  • Audio comprehension – character behavior and expressions follow audio semantics.
  • Camera and character control – supports multiple, sequential actions and free camera movement.
  • Emotion performance – recognizes and performs nuanced emotions and micro-expressions.
  • Multi-character scenes – specify who speaks and manage background reactions.
  • Diverse subjects – supports humans, animals, and stylized or animated characters.

Typical Use Cases

Scenario Description
Film & TV / Short Video Character dialogue, dramatic and emotional scenes, narrative shots.
Fantasy Vlog Realistic or surreal selfie-style recordings with controllable events and dynamics.
AI Music Video Rhythm-driven actions, expressive camera motion, music emotion alignment.
UGC / Creative Stylized or non-human avatars, pixel-style content, creative virtual scenes.

Prompt Writing Guide

Core principles

  • Write prompts as short, natural storylines.
  • Focus on dynamic actions, not static attributes already in the image.
  • Use clear, step-by-step, non-contradictory language.

Recommended structure

[Camera movement] + [Emotion] + [Speaking state] + [Specific actions] + [Optional background actions]

Example

> “The camera slowly moves from the side to a medium front shot. > A young woman sits by the window, calm, smiling as she talks to the camera. > A boy beside her looks at her, then turns to the camera and smiles.”

Tips

  • Include verbs like talks or sings to improve lip-sync.
  • Use sequence words (first, then) for multi-step actions.
  • Avoid long absences of the subject from frame (may break continuity).
  • High-resolution, clear input images yield better results.