lucataco / diffusion-motion-transfer

Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer

  • Public
  • 172 runs
  • GitHub
  • Paper

Input

Output

Run time and cost

This model costs approximately $0.86 to run on Replicate, or 1 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 11 minutes.

Readme

Implementation of diffusion-motion-transfer

About

Introducing a zero-shot method for transferring motion across objects and scenes. without any training or finetuning.

We present a new method for text-driven motion transfer – synthesizing a video that complies with an input text prompt describing the target objects and scene while maintaining an input video’s motion and scene layout. Prior methods are confined to transferring motion across two subjects within the same or closely related object categories and are applicable for limited domains (e.g., humans). In this work, we consider a significantly more challenging setting in which the target and source objects differ drastically in shape and fine-grained motion characteristics (e.g., translating a jumping dog into a dolphin). To this end, we leverage a pre-trained and fixed text-to-video diffusion model, which provides us with generative and motion priors. The pillar of our method is a new space-time feature loss derived directly from the model. This loss guides the generation process to preserve the overall motion of the input video while complying with the target object in terms of shape and fine-grained motion traits.

Motion Transfer

  • Our method is designed for transferring motion across objects and scenes
  • Our method is based on ZeroScope text-to-video model. Therefore, we can edit videos of 24 frames.
  • In some cases the combination of target object and input video motion is out of distribution for the T2V model, which can lead to visual artifacts in the generated video. It may be necessary to sample several seeds.
  • Method was tested to run on a single NVIDIA A40 48GB, and takes ~32GB of video memory. It takes approximately 7 minutes on a single NVIDIA A40 48GB.

Tips

  • To get better samples from the T2V model, we used the prefix text “Amazing quality, masterpiece, ” for inversion and edits.
  • If the video contains more complex motion/small objects, try increasing number of optimization steps - optimization_step: 30.
  • For large deviation in structure between the source and target objects, try using a lower lr - scale_range:[0.005, 0.002], or adding the source object to the negative prompt text.
@article{yatim2023spacetime,
        title = {Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer},
        author = {Yatim, Danah and Fridman, Rafail and Bar-Tal, Omer and Kasten, Yoni and Dekel, Tali},
        journal={arXiv preprint arxiv:2311.17009},
        year={2023}
        }