Readme
Metric3D v2 on Replicate
Monocular Metric Depth Estimation and Surface Normals
Paper: TPAMI 2024 | Authors: Yin et al. | Code: github.com/YvanYin/Metric3D
Metric3D v2 predicts metric depth (in real-world meters) and surface normals from a single image. Unlike relative depth models, it outputs actual distances — directly usable for 3D reconstruction, AR, and robotics.
Works on both indoor scenes (rooms, hallways, offices) and outdoor environments (streets, buildings, landscapes).
Default Example
Default input image:
https://cdn.sanity.io/images/k55su7ch/production2/d9e35a73891d43ccb0bc665bf2e0d5d9d6f1ea2b-4200x2363.jpg?w=1920&q=75&auto=format
import replicate
output = replicate.run("visionaix/metric3dv2", input={
"image": "https://cdn.sanity.io/images/k55su7ch/production2/d9e35a73891d43ccb0bc665bf2e0d5d9d6f1ea2b-4200x2363.jpg?w=1920&q=75&auto=format",
"max_depth": 20,
"return_normals": True,
"return_visualization": True,
})
Default output (indoor scene, max_depth=20m):
{
"image": { "width": 1920, "height": 1080 },
"depth": {
"min_m": 2.56,
"max_m": 20.0,
"mean_m": 8.16,
"median_m": 7.16
},
"settings": {
"focal_length_px": 2304.0,
"focal_estimated": true,
"max_depth_m": 20.0
},
"performance": {
"inference_time_s": 1.0,
"device": "cuda"
}
}
| Region | Depth | Notes |
|---|---|---|
| Couch (foreground) | ~3m | Nearest furniture |
| Room center | ~7m | Median depth |
| Back wall | ~12-13m | Farthest structure |
| Surface normals | RGB-encoded | Green=right-facing walls, Magenta=left-facing |
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
image |
File | required | Input image (JPEG, PNG, WebP) |
focal_length_px |
Float | 0 |
Camera focal length in pixels (0 = auto-estimate) |
max_depth |
Float | 200 |
Max depth clamp in meters (20 indoor, 80-300 outdoor) |
return_normals |
Boolean | true |
Return surface normals as RGB |
return_visualization |
Boolean | true |
Generate 6-panel diagnostic image |
Outputs
depth_colorized— Turbo-colorized depth map (blue=near, red=far)normals— Surface normals as RGB (R=X, G=Y, B=Z)visualization— 6-panel: input, depth, normals, confidence, histogram, summarycalibration_json— Full metadata (depth stats, settings, timing)
How It Works
The model operates in a canonical camera space (focal=1000px). After inference, depth is scaled by real_focal / 1000 to produce metric values. Architecture: DINOv2-reg ViT-Small backbone + RAFT depth-normal decoder.
Indoor vs. Outdoor
Indoor (max_depth=20): Rooms, hallways, offices — wall/floor/ceiling cues give precise metric depth
Outdoor (max_depth=80-300): Streets, buildings, landscapes — set higher max_depth for distant structures
Advanced Usage
Known Focal Length
output = replicate.run("visionaix/metric3dv2", input={
"image": "photo.jpg",
"focal_length_px": 1500.0,
})
Batch Processing
from concurrent.futures import ThreadPoolExecutor
import replicate
def predict(url):
return replicate.run("visionaix/metric3dv2", input={
"image": url, "return_visualization": False, "return_normals": False,
})
urls = ["img1.jpg", "img2.jpg", "img3.jpg"]
with ThreadPoolExecutor(4) as ex:
results = list(ex.map(predict, urls))
Citation
@article{yin2023metric3d,
title={Metric3D v2: A Versatile Monocular Geometric Foundation Model},
author={Yin, Wei and others},
journal={IEEE TPAMI},
year={2024},
}
License
Wraps Metric3D. See original license.