Configure a model deployment

Replicate makes it easy to run machine learning models. You can run the best open-source models with just one line of code, or deploy your own custom models. But sometimes you need more control. That’s where deployments come in.

What are deployments?

Deployments give you more control over how your models run. With deployments you can:

  • Roll out new versions of your model without having to edit your code.
  • Auto-scale your models to handle extra load and scale to zero when they’re not being used.
  • Keep instances always on to avoid cold boots.
  • Customize what hardware your models run on.
  • Monitor whether instances are setting up, idle, or processing predictions.
  • Monitor the predictions that are flowing through your model.

Deployments work with both open-source models and your own custom models.


Deployments auto-scale according to demand. If you send a lot of traffic, they scale up to handle it, and when things are quiet they scale back down, so you only pay for what you need. You can also limit the maximum number of instances the deployment can use to limit your maximum spend, or set a minimum to keep some instances warm and ready for predictions.

Creating a deployment

To create a new deployment, navigate to any model or model version on the website and click the deploy button on the top right of the view.

Here you will be able to give the deployment a name and confirm the model and version that you wish to deploy. You can update the version at any time. Once created, you’ll be able to start running predictions against your new model instance.

To further configure the deployment navigate to the settings tab on the deployment page, here you can customize the hardware as well as the autoscaling characteristics of the deployment. We will also give you an indicator of how much the deployment will cost to operate.

View existing deployments

All existing deployments can be found under your account dashboard. From here you can navigate to a specific deployment to see its current status, usage metrics and predictions.

To temporarily disable a deployment you can set the minimum and maximum instances under the autoscaling settings to zero. This will cancel any in-flight predictions, prevent further predictions from running, and halt billing for the instance.