How does billing work?
|Nvidia T4 GPU||$0.000225/sec
|Nvidia A40 GPU||$0.000575/sec
|Nvidia A40 (Large) GPU||$0.000725/sec
|Nvidia A100 (40GB) GPU||$0.001150/sec
|Nvidia A100 (80GB) GPU||$0.001400/sec
|8x Nvidia A40 (Large) GPU||$0.005800/sec
Lifecycle of an instance
When you use a public model on Replicate, you only pay for the time it’s active processing your requests. Boot and idle time for the model is free.
By default, you share a hardware pool with other customers, meaning your requests enter a shared queue alongside other customer requests. This means you will sometimes encounter cold boots or scaling limits depending on how other customers are using the model.
If you would like more control over how the model is run, you can use a deployment and have your own instances and request queue.
Unlike public models, most private models (with the exception of fast booting models) run on dedicated hardware. This means we charge for their boot and idle time in addition to the active time they spend processing your requests, but you don’t have to share a queue with anyone else.
As with public models, if you would like more control over how a private model is run, you can use a deployment.
Fast booting models
Sometimes, we’re able to optimize how a trained model is run so it boots fast. This works by using a common, shared pool of hardware running the base model. In these cases, we only ever charge you for the time the model is active and processing your requests, regardless of whether or not it’s public or private.
Fast booting versions of models are labeled as such in the model’s version list. You can also see which versions support the creation of fast booting models when training.
Deployments are a feature that allow you to, among other things, control the hardware and scaling parameters of any model. Like with private models, we charge for deployment boot and idle time, in addition to the active time it spends processing your requests.
In addition to the benefits of having a stable endpoint and graceful rollouts of versions, you might want to use a deployment if, for example:
- you want to configure a public model owned by someone else to run on different hardware
- you have steady use of a model and want to avoid being impacted by other customers using it
- you know your expected request rate and want to avoid cold boots
- you have a private model with a consistent, predictable request rate
Note that well-tuned deployments are usually only marginally more expensive than public models, because, despite paying for boot and idle time for deployment instances, when set up correctly, they should only be booting or idle for a fraction of the time they’re active.
Failed and canceled runs
For public models, if a run fails, we don’t charge you for its time. However, if you cancel a run, we charge you for the time it ran up until that point.
For private models and deployments, failed and canceled runs are billed for the time the instances they ran on were active, as normal.
Different models run on different hardware. You’ll find the hardware specifications under the "Run time and cost" heading on each model’s page. Check out stability-ai/sdxl for an example.
If a model is one you created on Replicate, you can adjust which hardware to use in the model’s settings. You can also specify hardware for a deployment.
At the beginning of each month, we charge you for the total time that you used in the previous month.
The minimum billable time for an individual run of a public model is 1 second.
You can try Replicate out for free, but after a bit you’ll be asked to set up billing.
Some features are only available to customers with billing set up.