How does Replicate work?

Replicate lets you run machine learning models with a cloud API, without having to understand the intricacies of machine learning or manage your own infrastructure. You can run open-source models that other people have published, or package and publish your own models. Those models can be public or private.

Terminology

Let's start by defining some important terms that you'll need to know:

Models

In the world of machine learning, the word "model" can mean many different things depending on context. It can be the source code, the trained weights, the architecture, or some combination thereof. At Replicate, when we say "model" we're generally referring to a trained, packaged, and published software program that accepts inputs and returns outputs.

Versions

Just like normal software, machine learning models change and improve over time, and those changes are released as new versions. Whenever a model author retrains a model with new data, fixes a bug in the source code, or updates a dependency, those changes can influence the behavior of the model. The changes are published as new versions, so model authors can make improvements without disrupting the experience for people using older versions of the model. Versioning is essential to making machine learning reproducible: it helps guarantee that a model will behave consistently regardless of when or where it's being run.

Predictions

Every time you run a model, you're creating a prediction. A prediction is an object that represents a single result from running the model, including the inputs that you provided, the outputs that the model returned, as well as other metadata like the model version, the user who created it, the status of the prediction, and timestamps.

How to run models in the browser

You can run models on Replicate using the cloud API or the web.

The web interface is a good place to start when trying out a model for the first time. It gives you a visual view of all the inputs to the model, and generates a form for running the model right from your browser:

Replicate's browser UI for running models in the browser

How to run models with the API

The web interface is great for getting acquainted with a model, but when you're ready to integrate a model into something like a chat bot, website, or mobile app, that's when the API comes into play.

Our HTTP API can be used with any programming language, but there are also client libraries for Python, JavaScript, and other languages that make it easier to use the API.

Using the Python client, you can create predictions with just a few lines of code:

import replicate
output = replicate.run(
    "stability-ai/stable-diffusion:db21e45d3f7023abc2a46ee38a23973f6dce16bb082a930b0c49861f96d1e5bf",
    input={"text": "an astronaut riding a horse"}
)

How predictions work

Whenever you run a model, you're creating a prediction.

Some models run very quickly and can return a result within a few milliseconds. Other models can take longer to run, especially generative models, like the kind which produce images from text prompts. For these long-running models, you need to poll the API to check the status of a prediction. Predictions can have any of the following statuses:

  • starting: the prediction is starting up. If this status lasts longer than a few seconds, then it's typically because a new worker is being started to run the prediction. Refer to Keeping models warm.
  • processing: the predict() method of the model is currently running.
  • succeeded: the prediction completed successfully.
  • failed: the prediction encountered an error during processing.
  • canceled: the prediction was canceled by the user.

When you're logged in, you can view a list of your predictions on the dashboard, with summaries of status, run time, etc:

Share predictions

Every prediction that you create is associated with your user account, and only you can see the predictions that you create. If you're using the web interface, then you can click the "Share" button to make the prediction public, so that others can view it.

Delete predictions

Input and output (including any files) are automatically deleted after an hour for any predictions created through the API, so you must save a copy of any files in the output if you'd like to continue using them. For more details on how to store prediction data, see the guide to webhooks.

Predictions created through the web interface are kept indefinitely, unless you delete them manually.

To manually delete a prediction on the website, go to your dashboard, find the prediction, and look for a "Delete" button on the prediction page. Clicking this button completely removes the prediction from the site, including any output data and output files associated with it.

Which models can you run?

You can use the API or the web interface to run any public model on Replicate from your own code. It can be an open-source model created by someone else, like nightmareai/disco-diffusion or kuprel/min-dalle, or you can publish and run your own models.

To find models to run, you can explore popular and featured models or search for something specific.

You can also push your own model to Replicate. Refer to Pushing your own models.

Pricing

As a free user, you get a little bit of free compute time to try out the web interface and the API. Every model has different performance characteristics, so the number of predictions that you can run with your free compute time varies.

When you hit your free limit, you'll need to add your credit card. There's no base charge when adding your credit card.

You're billed by the second based on your usage. The price per second varies depending on the hardware used by the model that you're running: some models run on Nvidia A100 GPUs, others on Nvidia T4 GPUs, and a few even run on CPUs. Each hardware type has a different price.

The hardware, pricing, and performance characteristics of each model are on the model page under Run time and cost section. Here's an example from the kuprel/min-dalle model page:

“This model costs approximately $0.067 to run on Replicate, but this varies depending on your inputs. ... Predictions run on Nvidia A100 GPU hardware, which costs $0.0023 per second. Predictions typically complete within 30 seconds.”

When you log in to Replicate in your browser, you'll see a notice on your dashboard to help you keep track of your current spend:

Your usage for the current billing period is $3.14

For more details, head to our pricing page.

Commercial use

The models on Replicate have been built and contributed by different people and organizations, and the licenses vary for each model. Here are a few examples:

For Stable Diffusion, neither Replicate nor the authors of the model claim any ownership over the output. For details, see the Stable Diffusion license, and Replicate's terms of service.

Other models like Pixray have some restrictions on commercial use.

You can view the license for a model by clicking the button at the top right of the model page:

Cold boots

We have a huge catalogue of models. To make good use of resources, we only run the models that are actually being used. When a model hasn't been used for a little while, we turn it off.

When you make a request to run a prediction on a model, you'll get a fast response if the model is "warm" (already running), and a slower response if the model is "cold" (starting up). Machine learning models are often very large and resource intensive, and we have to fetch and load several gigabytes of code for some models. In some cases this process can take several minutes.

Cold boots can also happen when there's a big spike in demand. We autoscale by running multiple copies of a model on different machines, but the model can take a while to become ready.

For popular public models, cold boots are uncommon because the model is kept "warm" from all the activity. For less-frequently used models, cold boots are more frequent.

If you're using the API to create predictions in the background, then cold boots probably aren't a big deal: we only charge for the time that your prediction is actually running, so it doesn't affect your costs.

If you're doing something more real-time and experience spikier demand leading to frequent cold boots, then we can keep a model warm for you. Email us at team@replicate.com.

Rate limits

We limit the number of API requests that can be made to Replicate:

  • You can call create prediction at 10 requests per second (average) with a burst capacity of up to 600 requests.
  • All other endpoints you can call at 50 requests per second (average) with a burst capacity of up to 3000 requests.

See the HTTP API reference docs for more details.

Push your own models

In addition to running other people's models, you can push your own models to Replicate. You can make your model public so that other people can run it, or you can make it private so that only you can run it.

To learn more, check out Push a model to Replicate.

Get support

Stuck on something? We're here to help.

Checkout our troubleshooting page.