chenxwh / depth-any-video

Depth Any Video with Scalable Synthetic Data (Updated 7 months, 4 weeks ago)

  • Public
  • 162 runs
  • A100 (80GB)
  • GitHub
  • Paper
  • License
Iterate in playground

Input

*file

Input image or video

boolean

Specify if the input is a video.

Default: true

integer

Number of denoising steps, 1-3 steps work fine

Default: 3

integer

Number of frames to infer per forward. This should be an even number

Default: 32

integer

Number of frames to decode per forward

Default: 16

integer

Number of frames for inpaint inference

Default: 16

integer

Number of frames to overlap between windows

Default: 6

integer

Maximum resolution for inference

Default: 1024

integer

Random seed. Leave blank to randomize the seed

Output

Generated in

Run time and cost

This model runs on Nvidia A100 (80GB) GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Depth Any Video with Scalable Synthetic Data

Depth Any Video introduces a scalable synthetic data pipeline, capturing 40,000 video clips from diverse games, and leverages powerful priors of generative video diffusion models to advance video depth estimation. By incorporating rotary position encoding, flow matching, and a mixed-duration training strategy, it robustly handles varying video lengths and frame rates. Additionally, a novel depth interpolation method enables high-resolution depth inference, achieving superior spatial accuracy and temporal consistency over previous models.

Demos

Citation

If you find our work useful, please cite:

@article{yang2024depthanyvideo,
  author    = {Honghui Yang and Di Huang and Wei Yin and Chunhua Shen and Haifeng Liu and Xiaofei He and Binbin Lin and Wanli Ouyang and Tong He},
  title     = {Depth Any Video with Scalable Synthetic Data},
  journal   = {arXiv preprint arXiv:2410.10815},
  year      = {2024}
}