Streaming output
Replicate’s API supports server-sent event streams (SSEs) for language models. This guide will show you how to consume streaming output.
Streaming is in public beta, and is only available for some models. If you have questions or feedback, email us at team@replicate.com.
What is streaming output?
Streaming output allows you to receive real-time progressive updates while a language model processes your input. Instead of waiting for the entire prediction to complete, you can access results as they are generated, making it ideal for applications like chat bots that require immediate responses.
At a high level, streaming output works like this:
- You create a prediction with the
stream
option. - Replicate returns a prediction with a URL to receive streaming output.
- You connect to the URL and receive a stream of updates.
Which models support streaming output?
Streaming output is supported by lots of language models, including several variations of Llama 2:
- meta/llama-2-70b-chat: 70 billion parameter model fine-tuned on chat completions. If you want to build a chat bot with the best accuracy, this is the one to use.
- meta/llama-2-70b: 70 billion parameter base model. Use this if you want to do other kinds of language completions, like completing a user’s writing.
- meta/llama-2-13b-chat: 13 billion parameter model fine-tuned on chat completions. Use this if you’re building a chat bot and would prefer it to be faster and cheaper at the expense of accuracy.
- meta/llama-2-7b-chat: 7 billion parameter model fine-tuned on chat completions. This is an even smaller, faster model.
For a full list of models that support streaming output, see the streaming language models collection.
Requesting streaming output
When you create a prediction,
specify the stream
option to request a URL to receive streaming output using
server-sent events (SSE).
If the requested model version supports streaming,
then the returned prediction will have a stream
entry in its urls
property
with a URL that you can use to construct an
EventSource
.
EventSource is a standard web browser API for receiving server-sent events. It allows the server to push real-time updates to the browser without needing a full two-way connection like WebSockets.
# https://replicate.com/meta/llama-2-70b-chat
$ curl -X POST -H "Authorization: Token $REPLICATE_API_TOKEN" \
-d '{"input": {"prompt": "Tell me a story"}, "stream": true, "version": "2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1"}' \
"https://api.replicate.com/v1/predictions"
const prediction = await replicate.predictions.create({
version: "2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1",
input: {prompt: "Tell me a story"},
stream: true,
});
{
"id": "fxwwvjtbdmroc4xifxdcwqtdfq",
"version": "2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1",
"input": {
"prompt": "Tell me a story"
},
"logs": "",
"error": null,
"status": "starting",
"created_at": "2023-07-25T13:43:38.897083857Z",
"stream": true,
"urls": {
"cancel": "https://api.replicate.com/v1/predictions/fuwwvjtbdmroc4xifxdcwqtdfq/cancel",
"get": "https://api.replicate.com/v1/predictions/fuwwvjtbdmroc4xifxdcwqtdfq",
"stream": "https://streaming.api.replicate.com/v1/predictions/fuwwvjtbdmroc4xifxdcwqtdfq"
}
}
Receiving streaming output
$ curl -X GET -H "Authorization: Token $REPLICATE_API_TOKEN" \
-H "Accept: text/event-stream" \
"https://streaming.api.replicate.com/v1/predictions/fuwwvjtbdmroc4xifxdcwqtdfq"
event: output
id: 1690212292:0
data: Once upon a time...
To receive streaming output,
construct an EventSource
using the stream
URL from the prediction:
if (prediction && prediction.urls && prediction.urls.stream) {
const source = new EventSource(prediction.urls.stream, { withCredentials: true });
source.addEventListener("output", (e) => {
console.log("output", e.data);
});
source.addEventListener("error", (e) => {
console.error("error", JSON.parse(e.data));
});
source.addEventListener("done", (e) => {
source.close();
console.log("done", JSON.parse(e.data));
});
}
If you want to run this code from Node.js, you'll need to install the eventsource
package:
npm install eventsource
Then, import EventSource
at the top of your file:
import EventSource from "eventsource";
A prediction’s event stream consists of the following event types:
event | format | description |
---|---|---|
output |
plain text | Emitted when the prediction returns new output |
error |
JSON | Emitted when the prediction returns an error |
done |
JSON | Emitted when the prediction finishes |
A done
event is emitted when a prediction finishes successfully,
is cancelled, or produces an error.
If a prediction completes successfully,
it receives a done
event with an empty JSON payload.
event: output
id: 1690212292:0
data: Once upon a time...
event: output
id: 1690212293:0
data: The End.
event: done
data: {}
If a prediction is cancelled,
it receives a done
event with a JSON payload `{“reason”: “canceled”}``.
event: output
id: 1690212292:0
data: Once upon a time...
event: done
data: {"reason": "canceled"}
If a prediction produces an error,
it receives an error
event with a JSON payload for the error
followed by a done
event with a JSON payload {"reason": "error"}
.
event: output
id: 1690212292:0
data: Once upon a time...
event: error
data: {"detail": "Something went wrong"}
event: done
data: {"reason": "error"}
Further reading
- Check out llama.replicate.dev to see an example of streaming output in a Next.js app.
- Read the Replicate Node.js client API docs for usage details for Node.js and browsers.
- Compare streaming models using Vercel’s AI playground.
- Learn how to use Vercel’s AI SDK to stream Replicate models in JavaScript apps.