visionaix/metric3dv2

Metric3D v2 (TPAMI 2024): Monocular metric depth and surface normals from a single image. Predicts real-world depth in meters. Works indoor and outdoor.

Public
12 runs

Run time and cost

This model runs on Nvidia T4 GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Metric3D v2 on Replicate

Monocular Metric Depth Estimation and Surface Normals

Paper: TPAMI 2024 | Authors: Yin et al. | Code: github.com/YvanYin/Metric3D

Metric3D v2 predicts metric depth (in real-world meters) and surface normals from a single image. Unlike relative depth models, it outputs actual distances — directly usable for 3D reconstruction, AR, and robotics.

Works on both indoor scenes (rooms, hallways, offices) and outdoor environments (streets, buildings, landscapes).


Default Example

Default input image:

https://cdn.sanity.io/images/k55su7ch/production2/d9e35a73891d43ccb0bc665bf2e0d5d9d6f1ea2b-4200x2363.jpg?w=1920&q=75&auto=format
import replicate

output = replicate.run("visionaix/metric3dv2", input={
    "image": "https://cdn.sanity.io/images/k55su7ch/production2/d9e35a73891d43ccb0bc665bf2e0d5d9d6f1ea2b-4200x2363.jpg?w=1920&q=75&auto=format",
    "max_depth": 20,
    "return_normals": True,
    "return_visualization": True,
})

Default output (indoor scene, max_depth=20m):

{
  "image": { "width": 1920, "height": 1080 },
  "depth": {
    "min_m": 2.56,
    "max_m": 20.0,
    "mean_m": 8.16,
    "median_m": 7.16
  },
  "settings": {
    "focal_length_px": 2304.0,
    "focal_estimated": true,
    "max_depth_m": 20.0
  },
  "performance": {
    "inference_time_s": 1.0,
    "device": "cuda"
  }
}
Region Depth Notes
Couch (foreground) ~3m Nearest furniture
Room center ~7m Median depth
Back wall ~12-13m Farthest structure
Surface normals RGB-encoded Green=right-facing walls, Magenta=left-facing

Inputs

Parameter Type Default Description
image File required Input image (JPEG, PNG, WebP)
focal_length_px Float 0 Camera focal length in pixels (0 = auto-estimate)
max_depth Float 200 Max depth clamp in meters (20 indoor, 80-300 outdoor)
return_normals Boolean true Return surface normals as RGB
return_visualization Boolean true Generate 6-panel diagnostic image

Outputs

  • depth_colorized — Turbo-colorized depth map (blue=near, red=far)
  • normals — Surface normals as RGB (R=X, G=Y, B=Z)
  • visualization — 6-panel: input, depth, normals, confidence, histogram, summary
  • calibration_json — Full metadata (depth stats, settings, timing)

How It Works

The model operates in a canonical camera space (focal=1000px). After inference, depth is scaled by real_focal / 1000 to produce metric values. Architecture: DINOv2-reg ViT-Small backbone + RAFT depth-normal decoder.


Indoor vs. Outdoor

Indoor (max_depth=20): Rooms, hallways, offices — wall/floor/ceiling cues give precise metric depth

Outdoor (max_depth=80-300): Streets, buildings, landscapes — set higher max_depth for distant structures


Advanced Usage

Known Focal Length

output = replicate.run("visionaix/metric3dv2", input={
    "image": "photo.jpg",
    "focal_length_px": 1500.0,
})

Batch Processing

from concurrent.futures import ThreadPoolExecutor
import replicate

def predict(url):
    return replicate.run("visionaix/metric3dv2", input={
        "image": url, "return_visualization": False, "return_normals": False,
    })

urls = ["img1.jpg", "img2.jpg", "img3.jpg"]
with ThreadPoolExecutor(4) as ex:
    results = list(ex.map(predict, urls))

Citation

@article{yin2023metric3d,
  title={Metric3D v2: A Versatile Monocular Geometric Foundation Model},
  author={Yin, Wei and others},
  journal={IEEE TPAMI},
  year={2024},
}

License

Wraps Metric3D. See original license.

Model created
Model updated