Table of contents
When you use a public model on Replicate, you only pay for the time it's active processing your requests. Setup and idle time for the model is free.
By default, you share a hardware pool with other customers, meaning your requests enter a shared queue alongside other customer requests. This means you will sometimes encounter cold boots or scaling limits depending on how other customers are using the model.
If you would like more control over how the model is run, you can use a deployment and have your own instances and request queue.
Unlike public models, most private models (with the exception of fast booting models) run on dedicated hardware and you don't have to share a queue with anyone else. This means you pay for all the time instances of the model are online: the time they spend setting up; the time they spend idle, waiting for requests; and the time they spend active, processing your requests.
As with public models, if you would like more control over how a private model is run, you can use a deployment.
Hereโs an example using Meta's Llama 3.1 405B Instruct:
Tokens | Count | Price | |
---|---|---|---|
Input | Write a limerick about llamas | 8 | $0.0000760 |
Output | There once was a llama named Sue,\n Whose favorite color was blue,\n She lived in the Andes,\n With her friends eating candies\n And together they all played kazoo. | 43 | $0.0004085 |
Total | 51 | $0.0004845 |
Sometimes, we're able to optimize how a trained model is run so it boots fast. This works by using a common, shared pool of hardware running the base model. In these cases, we only ever charge you for the time the model is active and processing your requests, regardless of whether or not it's public or private.
Fast booting versions of models are labeled as such in the model's version list. You can also see which versions support the creation of fast booting models when training.
Deployments are a feature that allow you to, among other things, control the hardware and scaling parameters of any model. Like with private models, we charge for all the time deployment instances are online: the time they spend setting up; the time they spend idle, waiting for requests; and the time they spend active, processing your requests.
In addition to the benefits of having a stable endpoint and graceful rollouts of versions, you might want to use a deployment if, for example:
Note that well-tuned deployments are usually only marginally more expensive than public models, because, despite paying for setup and idle time for deployment instances, when configured correctly, they should only be setting up or idle for a fraction of the time they're active.
For public models, if a run fails, we don't charge you for its time. However, if you cancel a run, we charge you for the time it ran up until that point.
For private models and deployments, failed and canceled runs are billed for the time the instances they ran on were active, as normal.
Different models run on different hardware. You'll find the hardware specifications under the "Run time and cost" heading on each model's page. Check out stability-ai/sdxl for an example.
If a model is one you created on Replicate, you can adjust which hardware to use in the model's settings. You can also specify hardware for a deployment.
At the beginning of each month, we charge you for what you used in the previous month.
The minimum billable unit for an individual run of a public model is 1 second or 1 token.
Sometimes, when your usage exceeds certain thresholds for the first time, or after you change your payment method, we charge you early for some of the month's usage. We do this to help prevent fraudulent use of Replicate.
You can find your current usage and manage your billing settings on your account page.
You can try featured models out on Replicate for free, but after a bit you'll be asked to set up billing.
Some features are only available to customers with billing set up.