Depth Any Video with Scalable Synthetic Data

Depth Any Video introduces a scalable synthetic data pipeline, capturing 40,000 video clips from diverse games, and leverages powerful priors of generative video diffusion models to advance video depth estimation. By incorporating rotary position encoding, flow matching, and a mixed-duration training strategy, it robustly handles varying video lengths and frame rates. Additionally, a novel depth interpolation method enables high-resolution depth inference, achieving superior spatial accuracy and temporal consistency over previous models.

Demos

Citation

If you find our work useful, please cite:

@article{yang2024depthanyvideo,
  author    = {Honghui Yang and Di Huang and Wei Yin and Chunhua Shen and Haifeng Liu and Xiaofei He and Binbin Lin and Wanli Ouyang and Tong He},
  title     = {Depth Any Video with Scalable Synthetic Data},
  journal   = {arXiv preprint arXiv:2410.10815},
  year      = {2024}
}

Model created over 1 year ago