vufinder/vggt-1b

Feed-forward neural network that directly infers all key 3D attributes of a scene.

Public
12.1K runs

Run time and cost

This model costs approximately $0.043 to run on Replicate, or 23 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 45 seconds. The predict time for this model varies significantly based on the inputs.

Readme

VGGT: Visual Geometry Grounded Transformer

Visual Geometry Group, University of Oxford; Meta AI

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, David Novotny

Project page

@inproceedings{wang2025vggt,
  title={VGGT: Visual Geometry Grounded Transformer},
  author={Wang, Jianyuan and Chen, Minghao and Karaev, Nikita and Vedaldi, Andrea and Rupprecht, Christian and Novotny, David},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}

Note: This model uses VGGT-1B-Commercial weights.

Model output

Model returns an object with attributes:

  • point_cloud (optional): a URL to GLB file that contains point cloud and meshes that represent cameras.

  • data: a list of URLs to JSON files that contains raw model output per image, attributes: image, pose_enc, depth, depth_conf, world_points, world_points_conf, original_image with width and heigth attributes and optional mask.

Model created
Model updated