Table of contents
Replicate's API supports server-sent event streams (SSEs) for language models. This guide will show you how to consume streaming output.
Streaming output allows you to receive real-time progressive updates while a language model processes your input. Instead of waiting for the entire prediction to complete, you can access results as they are generated, making it ideal for applications like chat bots that require immediate responses.
At a high level, streaming output works like this:
stream
option.Streaming output is supported by lots of language models, including several variations of Llama 3:
For a full list of models that support streaming output, see the streaming language models collection.
When you create a prediction,
specify the stream
option to request a URL to receive streaming output using
server-sent events (SSE).
If the requested model version supports streaming,
then the returned prediction will have a stream
entry in its urls
property
with a URL that you can use to construct an
EventSource
.
EventSource is a standard web browser API for receiving server-sent events. It allows the server to push real-time updates to the browser without needing a full two-way connection like WebSockets.
const stream = replicate.stream("meta/meta-llama-3-70b-instruct", {
prompt: "Tell me a story",
});
You can then process events from this stream.
curl -X GET -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
-H "Accept: text/event-stream" \
"https://streaming.api.replicate.com/v1/predictions/fuwwvjtbdmroc4xifxdcwqtdfq"
event: output
id: 1690212292:0
data: Once upon a time...
To receive streaming output,
construct an EventSource
using the stream
URL from the prediction:
const output = [];
for await (const { event, data } of stream) {
if (event === "output") {
output.push(data);
}
}
console.log(output.join(""));
A prediction's event stream consists of the following event types:
event | format | description |
---|---|---|
output | plain text | Emitted when the prediction returns new output |
error | JSON | Emitted when the prediction returns an error |
done | JSON | Emitted when the prediction finishes |
A done
event is emitted when a prediction finishes successfully,
is cancelled, or produces an error.
If a prediction completes successfully,
it receives a done
event with an empty JSON payload.
event: output
id: 1690212292:0
data: Once upon a time...
event: output
id: 1690212293:0
data: The End.
event: done
data: {}
If a prediction is cancelled,
it receives a done
event with a JSON payload {"reason": "canceled"}
.
event: output
id: 1690212292:0
data: Once upon a time...
event: done
data: {"reason": "canceled"}
If a prediction produces an error,
it receives an error
event with a JSON payload for the error
followed by a done
event with a JSON payload {"reason": "error"}
.
event: output
id: 1690212292:0
data: Once upon a time...
event: error
data: {"detail": "Something went wrong"}
event: done
data: {"reason": "error"}
There is a 30 second timeout on the event stream endpoint, which when reached will result in an empty event being sent down the stream with the text "408: 408 Request Timeout".
:408: 408 Request Timeout
This will usually occur if you try to connect to the stream after the prediction has been deleted (API predictions expire after 1 hour) or if the client has failed to process the done
event and close the connection.