Metric3D v2 on Replicate

Monocular Metric Depth Estimation and Surface Normals

Paper: TPAMI 2024 | Authors: Yin et al. | Code: github.com/YvanYin/Metric3D

Metric3D v2 predicts metric depth (in real-world meters) and surface normals from a single image. Unlike relative depth models, it outputs actual distances — directly usable for 3D reconstruction, AR, and robotics.

Works on both indoor scenes (rooms, hallways, offices) and outdoor environments (streets, buildings, landscapes).

Default Example

Default input image:

https://cdn.sanity.io/images/k55su7ch/production2/d9e35a73891d43ccb0bc665bf2e0d5d9d6f1ea2b-4200x2363.jpg?w=1920&q=75&auto=format

import replicate

output = replicate.run("visionaix/metric3dv2", input={
    "image": "https://cdn.sanity.io/images/k55su7ch/production2/d9e35a73891d43ccb0bc665bf2e0d5d9d6f1ea2b-4200x2363.jpg?w=1920&q=75&auto=format",
    "max_depth": 20,
    "return_normals": True,
    "return_visualization": True,
})

Default output (indoor scene, max_depth=20m):

{
  "image": { "width": 1920, "height": 1080 },
  "depth": {
    "min_m": 2.56,
    "max_m": 20.0,
    "mean_m": 8.16,
    "median_m": 7.16
  },
  "settings": {
    "focal_length_px": 2304.0,
    "focal_estimated": true,
    "max_depth_m": 20.0
  },
  "performance": {
    "inference_time_s": 1.0,
    "device": "cuda"
  }
}

Region	Depth	Notes
Couch (foreground)	~3m	Nearest furniture
Room center	~7m	Median depth
Back wall	~12-13m	Farthest structure
Surface normals	RGB-encoded	Green=right-facing walls, Magenta=left-facing

Inputs

Parameter	Type	Default	Description
`image`	File	required	Input image (JPEG, PNG, WebP)
`focal_length_px`	Float	`0`	Camera focal length in pixels (0 = auto-estimate)
`max_depth`	Float	`200`	Max depth clamp in meters (20 indoor, 80-300 outdoor)
`return_normals`	Boolean	`true`	Return surface normals as RGB
`return_visualization`	Boolean	`true`	Generate 6-panel diagnostic image

Outputs

depth_colorized — Turbo-colorized depth map (blue=near, red=far)
normals — Surface normals as RGB (R=X, G=Y, B=Z)
visualization — 6-panel: input, depth, normals, confidence, histogram, summary
calibration_json — Full metadata (depth stats, settings, timing)

How It Works

The model operates in a canonical camera space (focal=1000px). After inference, depth is scaled by real_focal / 1000 to produce metric values. Architecture: DINOv2-reg ViT-Small backbone + RAFT depth-normal decoder.

Indoor vs. Outdoor

Indoor (max_depth=20): Rooms, hallways, offices — wall/floor/ceiling cues give precise metric depth

Outdoor (max_depth=80-300): Streets, buildings, landscapes — set higher max_depth for distant structures

Advanced Usage

Known Focal Length

output = replicate.run("visionaix/metric3dv2", input={
    "image": "photo.jpg",
    "focal_length_px": 1500.0,
})

Batch Processing

from concurrent.futures import ThreadPoolExecutor
import replicate

def predict(url):
    return replicate.run("visionaix/metric3dv2", input={
        "image": url, "return_visualization": False, "return_normals": False,
    })

urls = ["img1.jpg", "img2.jpg", "img3.jpg"]
with ThreadPoolExecutor(4) as ex:
    results = list(ex.map(predict, urls))

Citation

@article{yin2023metric3d,
  title={Metric3D v2: A Versatile Monocular Geometric Foundation Model},
  author={Yin, Wei and others},
  journal={IEEE TPAMI},
  year={2024},
}

License

Wraps Metric3D. See original license.

Model created 2 months, 3 weeks ago