tiger-ai-lab / anyv2v

Tuning-free framework to achieve high appearance and temporal consistency in video editing

  • Public
  • 975 runs
  • L40S
  • GitHub
  • Paper
  • License

Input

*file
Preview

Input video

file

Provide the edited first frame of the input video. This is optional, leave it blank and provide the prompt below to use the default pipeline that edits the frist frame with instructpix2pix

string
Shift + Return to add a new line

The first step invovles using timbrooks/instruct-pix2pix to edit the first frame. Specify the prompt for editing the first frame. This will be ignored if edited_first_frame above is provided.

Default: "turn man into robot"

string
Shift + Return to add a new line

Describe the input video

Default: "a man doing exercises for the body and mind"

string
Shift + Return to add a new line

Things not to see int the edited video

Default: "Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms"

integer
(minimum: 1, maximum: 500)

Number of denoising steps

Default: 50

number
(minimum: 1, maximum: 20)

Scale for classifier-free guidance

Default: 9

number
(minimum: 0, maximum: 1)

Specifies the proportion of time steps in the DDIM sampling process where the convolutional injection is applied. A higher value improves motion consistency. 1.0 indicates injection at every time step

Default: 1

number
(minimum: 0, maximum: 1)

Specifies the proportion of time steps in the DDIM sampling process where the spatial attention injection is applied. A higher value improves motion consistency. 1.0 indicates injection at every time step

Default: 1

number
(minimum: 0, maximum: 1)

Specifies the proportion of time steps in the DDIM sampling process where the temporal attention injection is applied. A higher value improves motion consistency. 1.0 indicates injection at every time step

Default: 1

integer
(minimum: 0)

This parameter determines the time step index at which to begin sampling from the initial DDIM inversed latents, with a range of [0, num_inference_steps-1]. In the context of a DDIM sampling process where the sampling step is 50, the scheduler progresses through the time steps in the sequence [981, 961, 941, ..., 1]. Therefore, setting ddim_init_latents_t_idx to 0 initiates the sampling from t=981, whereas setting it to 1 starts the process at t=961. A higher index enhances motion consistency with the source video but may lead to flickering and cause the edited video to diverge from the edited first frame.

Default: 0

integer

Number of ddim inversion steps

Default: 100

integer

Random seed. Leave blank to randomize the seed

Output

Generated in

This example was created by a different version, tiger-ai-lab/anyv2v:30adf8ca.

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

AnyV2V

AnyV2V

Introduction

AnyV2V is a tuning-free framework to achieve high appearance and temporal consistency in video editing. - can seamlessly build on top of advanced image editing methods to perform diverse types of editing - robust performance on the four tasks: - prompt-based editing - reference-based style transfer - subject-driven editing - identity manipulation

🖊️ Citation

Please kindly cite our paper if you use our code, data, models or results: ```bibtex @article{ku2024anyv2v, title={AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks}, author={Ku, Max and Wei, Cong and Ren, Weiming and Yang, Huan and Chen, Wenhu}, journal={arXiv preprint arXiv:2403.14468}, year={2024} }