visionaix/metric3dv2

Metric3D v2 (TPAMI 2024): Monocular metric depth and surface normals from a single image. Predicts real-world depth in meters. Works indoor and outdoor.

Public
12 runs

Metric3D v2 on Replicate

Monocular Metric Depth Estimation and Surface Normals

Paper: TPAMI 2024 | Authors: Yin et al. | Code: github.com/YvanYin/Metric3D

Metric3D v2 predicts metric depth (in real-world meters) and surface normals from a single image. Unlike relative depth models, it outputs actual distances — directly usable for 3D reconstruction, AR, and robotics.

Works on both indoor scenes (rooms, hallways, offices) and outdoor environments (streets, buildings, landscapes).


Default Example

Default input image:

https://cdn.sanity.io/images/k55su7ch/production2/d9e35a73891d43ccb0bc665bf2e0d5d9d6f1ea2b-4200x2363.jpg?w=1920&q=75&auto=format
import replicate

output = replicate.run("visionaix/metric3dv2", input={
    "image": "https://cdn.sanity.io/images/k55su7ch/production2/d9e35a73891d43ccb0bc665bf2e0d5d9d6f1ea2b-4200x2363.jpg?w=1920&q=75&auto=format",
    "max_depth": 20,
    "return_normals": True,
    "return_visualization": True,
})

Default output (indoor scene, max_depth=20m):

{
  "image": { "width": 1920, "height": 1080 },
  "depth": {
    "min_m": 2.56,
    "max_m": 20.0,
    "mean_m": 8.16,
    "median_m": 7.16
  },
  "settings": {
    "focal_length_px": 2304.0,
    "focal_estimated": true,
    "max_depth_m": 20.0
  },
  "performance": {
    "inference_time_s": 1.0,
    "device": "cuda"
  }
}
Region Depth Notes
Couch (foreground) ~3m Nearest furniture
Room center ~7m Median depth
Back wall ~12-13m Farthest structure
Surface normals RGB-encoded Green=right-facing walls, Magenta=left-facing

Inputs

Parameter Type Default Description
image File required Input image (JPEG, PNG, WebP)
focal_length_px Float 0 Camera focal length in pixels (0 = auto-estimate)
max_depth Float 200 Max depth clamp in meters (20 indoor, 80-300 outdoor)
return_normals Boolean true Return surface normals as RGB
return_visualization Boolean true Generate 6-panel diagnostic image

Outputs

  • depth_colorized — Turbo-colorized depth map (blue=near, red=far)
  • normals — Surface normals as RGB (R=X, G=Y, B=Z)
  • visualization — 6-panel: input, depth, normals, confidence, histogram, summary
  • calibration_json — Full metadata (depth stats, settings, timing)

How It Works

The model operates in a canonical camera space (focal=1000px). After inference, depth is scaled by real_focal / 1000 to produce metric values. Architecture: DINOv2-reg ViT-Small backbone + RAFT depth-normal decoder.


Indoor vs. Outdoor

Indoor (max_depth=20): Rooms, hallways, offices — wall/floor/ceiling cues give precise metric depth

Outdoor (max_depth=80-300): Streets, buildings, landscapes — set higher max_depth for distant structures


Advanced Usage

Known Focal Length

output = replicate.run("visionaix/metric3dv2", input={
    "image": "photo.jpg",
    "focal_length_px": 1500.0,
})

Batch Processing

from concurrent.futures import ThreadPoolExecutor
import replicate

def predict(url):
    return replicate.run("visionaix/metric3dv2", input={
        "image": url, "return_visualization": False, "return_normals": False,
    })

urls = ["img1.jpg", "img2.jpg", "img3.jpg"]
with ThreadPoolExecutor(4) as ex:
    results = list(ex.map(predict, urls))

Citation

@article{yin2023metric3d,
  title={Metric3D v2: A Versatile Monocular Geometric Foundation Model},
  author={Yin, Wei and others},
  journal={IEEE TPAMI},
  year={2024},
}

License

Wraps Metric3D. See original license.

Model created
Model updated