Table of contents
Replicate's API has three different endpoints for creating predictions depending on the type of model you want to run:
predictions.create
models.predictions.create
deployments.predictions.create
There are two modes for creating predictions with the API: synchronous (sync) and asynchronous (async).
Here's a brief summary of their differences and use cases:
Sync mode:
Async mode (default):
Choose sync for speed and simplicity, or async for flexibility and managing more time-consuming predictions.
Sync mode is optimized to return model output as quickly as possible, and is suited for real-time applications or when immediate results are needed. Sync mode is best for models that take just a few seconds to run.
Synchronous predictions hold the request open for a specified duration, which defaults to 60 seconds. If the model finishes running within this time, the response contains the prediction object with the output
field populated.
Enable sync mode by setting the Prefer: wait
HTTP header in your API request.
The examples on this page are written in cURL, but you can also create predictions using Replicate's JavaScript and Python clients.
Example cURL request:
The response will be the prediction object, with the output
field populated with model results and the status usually in a terminal state:
The default duration for sync mode is 60 seconds, but you can specify a different timeout duration in the header if needed. For example, Prefer: wait=5
will wait for 5 seconds.
If the model doesn't finish within the specified duration, the request will return the incomplete prediction object with status set to starting
or processing
. You can then fetch the prediction again via the URL provided in the Location
header, or the urls.get
field as with Async mode.
For models that produce files as output, Replicate will respond with the files as soon as they are all available. In this instance, the output
field will contain all file outputs but status
may still be in a processing
state and completed_at
and metrics
may not yet be populated.
If you prefer not to use the blocking API, you can opt for the polling mode. This allows you to handle predictions asynchronously and can be useful if you want to avoid holding a connection open. To use polling mode, pass the appropriate argument to the run()
method in your language of choice. For more details, see the Output files documentation.
Async mode is ideal for cases where you don't need the output immediately, or when the output is large and you want to avoid blocking the request.
To use async mode, you don't need to set any special headers or parameters. The default behavior of the API is to use async mode.
Async mode returns immediately with a prediction ID and an incomplete prediction object.
Here's an example async request using webhooks to get the prediction results later:
Example request body:
Example cURL request:
The response will contain a prediction in the starting state:
When the prediction has completed the webhook URL provided will be called with the final prediction data:
An alternative to using webhooks is polling. Polling involves making repeated API requests to fetch the prediction, until the prediction is in a terminal state (succeeded
or failed
). This method is useful if you're not able to provide a webhook handler.
To poll for updates, you can periodically send GET requests to the prediction URL. The prediction URL is provided in the urls.get
field of the initial prediction response, as well as in the Location
header.
Here's a basic example of how polling might work:
This approach allows you to check the status of your prediction at regular intervals until it's finished processing.
Check out the documentation for predictions.get
for more information.