How does billing work?
Hardware | Price | GPU | CPU | GPU RAM | RAM |
---|---|---|---|---|---|
CPU
cpu |
$0.000100/sec
$0.36/hr |
- | 4x | - | 8GB |
Nvidia A100 (80GB) GPU
gpu-a100-large |
$0.001400/sec
$5.04/hr |
1x | 10x | 80GB | 144GB |
2x Nvidia A100 (80GB) GPU
gpu-a100-large-2x |
$0.002800/sec
$10.08/hr |
2x | 20x | 160GB | 288GB |
4x Nvidia A100 (80GB) GPU
gpu-a100-large-4x |
$0.005600/sec
$20.16/hr |
4x | 40x | 320GB | 576GB |
8x Nvidia A100 (80GB) GPU
gpu-a100-large-8x |
$0.011200/sec
$40.32/hr |
8x | 80x | 640GB | 960GB |
Nvidia A40 (Large) GPU
gpu-a40-large |
$0.000725/sec
$2.61/hr |
1x | 10x | 48GB | 72GB |
2x Nvidia A40 (Large) GPU
gpu-a40-large-2x |
$0.001450/sec
$5.22/hr |
2x | 20x | 96GB | 144GB |
4x Nvidia A40 (Large) GPU
gpu-a40-large-4x |
$0.002900/sec
$10.44/hr |
4x | 40x | 192GB | 288GB |
8x Nvidia A40 (Large) GPU
gpu-a40-large-8x |
$0.005800/sec
$20.88/hr |
8x | 48x | 384GB | 680GB |
Nvidia A40 GPU
gpu-a40-small |
$0.000575/sec
$2.07/hr |
1x | 4x | 48GB | 16GB |
Nvidia T4 GPU
gpu-t4 |
$0.000225/sec
$0.81/hr |
1x | 4x | 16GB | 16GB |
Lifecycle of an instance
Public models
When you use a public model on Replicate, you only pay for the time it’s active processing your requests. Setup and idle time for the model is free.
By default, you share a hardware pool with other customers, meaning your requests enter a shared queue alongside other customer requests. This means you will sometimes encounter cold boots or scaling limits depending on how other customers are using the model.
If you would like more control over how the model is run, you can use a deployment and have your own instances and request queue.
Image models
Some image models are maintained by Replicate and priced per image generated.
Language models
Some language models are maintained by Replicate and priced per token.
A language model processes text by breaking it into tokens, or pieces of words. Replicate uses the Llama tokenizer to calculate the number of tokens in text inputs and outputs once it’s finished.
Here’s an example using meta/llama-2-7b-chat:
Private models
Unlike public models, most private models (with the exception of fast booting models) run on dedicated hardware and you don’t have to share a queue with anyone else. This means you pay for all the time instances of the model are online: the time they spend setting up; the time they spend idle, waiting for requests; and the time they spend active, processing your requests.
As with public models, if you would like more control over how a private model is run, you can use a deployment.
Fast booting models
Sometimes, we’re able to optimize how a trained model is run so it boots fast. This works by using a common, shared pool of hardware running the base model. In these cases, we only ever charge you for the time the model is active and processing your requests, regardless of whether or not it’s public or private.
Fast booting versions of models are labeled as such in the model’s version list. You can also see which versions support the creation of fast booting models when training.
Deployments
Deployments are a feature that allow you to, among other things, control the hardware and scaling parameters of any model. Like with private models, we charge for all the time deployment instances are online: the time they spend setting up; the time they spend idle, waiting for requests; and the time they spend active, processing your requests.
In addition to the benefits of having a stable endpoint and graceful rollouts of versions, you might want to use a deployment if, for example:
- you want to configure a public model owned by someone else to run on different hardware
- you have steady use of a model and want to avoid being impacted by other customers using it
- you know your expected request rate and want to avoid cold boots
- you have a private model with a consistent, predictable request rate
Note that well-tuned deployments are usually only marginally more expensive than public models, because, despite paying for setup and idle time for deployment instances, when configured correctly, they should only be setting up or idle for a fraction of the time they’re active.
Failed and canceled runs
For public models, if a run fails, we don’t charge you for its time. However, if you cancel a run, we charge you for the time it ran up until that point.
For private models and deployments, failed and canceled runs are billed for the time the instances they ran on were active, as normal.
Hardware
Different models run on different hardware. You’ll find the hardware specifications under the "Run time and cost" heading on each model’s page. Check out stability-ai/sdxl for an example.
If a model is one you created on Replicate, you can adjust which hardware to use in the model’s settings. You can also specify hardware for a deployment.
Billing
At the beginning of each month, we charge you for what you used in the previous month.
The minimum billable unit for an individual run of a public model is 1 second or 1 token.
Sometimes, when your usage exceeds certain thresholds for the first time, or after you change your payment method, we charge you early for some of the month’s usage. We do this to help prevent fraudulent use of Replicate.
You can find your current usage and manage your billing settings on your account page.
Free limits
You can try featured models out on Replicate for free, but after a bit you’ll be asked to set up billing.
Some features are only available to customers with billing set up.