Streaming output

Replicate’s API supports server-sent event streams (SSEs) for language models. This guide will show you how to consume streaming output.

Streaming is in public beta, and is only available for some models. If you have questions or feedback, email us at team@replicate.com.

What is streaming output?

Streaming output allows you to receive real-time progressive updates while a language model processes your input. Instead of waiting for the entire prediction to complete, you can access results as they are generated, making it ideal for applications like chat bots that require immediate responses.

At a high level, streaming output works like this:

  1. You create a prediction with the stream option.
  2. Replicate returns a prediction with a URL to receive streaming output.
  3. You connect to the URL and receive a stream of updates.

Which models support streaming output?

Streaming output is supported by lots of language models, including several variations of Llama 2:

  • meta/llama-2-70b-chat: 70 billion parameter model fine-tuned on chat completions. If you want to build a chat bot with the best accuracy, this is the one to use.
  • meta/llama-2-70b: 70 billion parameter base model. Use this if you want to do other kinds of language completions, like completing a user’s writing.
  • meta/llama-2-13b-chat: 13 billion parameter model fine-tuned on chat completions. Use this if you’re building a chat bot and would prefer it to be faster and cheaper at the expense of accuracy.
  • meta/llama-2-7b-chat: 7 billion parameter model fine-tuned on chat completions. This is an even smaller, faster model.

For a full list of models that support streaming output, see the streaming language models collection.

Requesting streaming output

When you create a prediction, specify the stream option to request a URL to receive streaming output using server-sent events (SSE).

If the requested model version supports streaming, then the returned prediction will have a stream entry in its urls property with a URL that you can use to construct an EventSource.

EventSource is a standard web browser API for receiving server-sent events. It allows the server to push real-time updates to the browser without needing a full two-way connection like WebSockets.

# https://replicate.com/meta/llama-2-70b-chat

$ curl -X POST -H "Authorization: Token $REPLICATE_API_TOKEN" \
      -d '{"input": {"prompt": "Tell me a story"}, "stream": true, "version": "2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1"}' \
      "https://api.replicate.com/v1/predictions"
const prediction = await replicate.predictions.create({
  version: "2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1",
  input: {prompt: "Tell me a story"},
  stream: true,
});
{
  "id": "fxwwvjtbdmroc4xifxdcwqtdfq",
  "version": "2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1",
  "input": {
    "prompt": "Tell me a story"
  },
  "logs": "",
  "error": null,
  "status": "starting",
  "created_at": "2023-07-25T13:43:38.897083857Z",
  "stream": true,
  "urls": {
    "cancel": "https://api.replicate.com/v1/predictions/fuwwvjtbdmroc4xifxdcwqtdfq/cancel",
    "get": "https://api.replicate.com/v1/predictions/fuwwvjtbdmroc4xifxdcwqtdfq",
    "stream": "https://streaming.api.replicate.com/v1/predictions/fuwwvjtbdmroc4xifxdcwqtdfq"
  }
}

Receiving streaming output

$ curl -X GET -H "Authorization: Token $REPLICATE_API_TOKEN" \
      -H "Accept: text/event-stream" \
      "https://streaming.api.replicate.com/v1/predictions/fuwwvjtbdmroc4xifxdcwqtdfq"
event: output
id: 1690212292:0
data: Once upon a time...


To receive streaming output, construct an EventSource using the stream URL from the prediction:

if (prediction && prediction.urls && prediction.urls.stream) {
  const source = new EventSource(prediction.urls.stream, { withCredentials: true });

  source.addEventListener("output", (e) => {
    console.log("output", e.data);
  });

  source.addEventListener("error", (e) => {
    console.error("error", JSON.parse(e.data));
  });

  source.addEventListener("done", (e) => {
    source.close();
    console.log("done", JSON.parse(e.data));
  });
}

If you want to run this code from Node.js, you'll need to install the eventsource package:

npm install eventsource

Then, import EventSource at the top of your file:

import EventSource from "eventsource";

A prediction’s event stream consists of the following event types:

event format description
output plain text Emitted when the prediction returns new output
error JSON Emitted when the prediction returns an error
done JSON Emitted when the prediction finishes

A done event is emitted when a prediction finishes successfully, is cancelled, or produces an error.

If a prediction completes successfully, it receives a done event with an empty JSON payload.

event: output
id: 1690212292:0
data: Once upon a time...


event: output
id: 1690212293:0
data: The End.


event: done
data: {}

If a prediction is cancelled, it receives a done event with a JSON payload `{“reason”: “canceled”}``.

event: output
id: 1690212292:0
data: Once upon a time...


event: done
data: {"reason": "canceled"}

If a prediction produces an error, it receives an error event with a JSON payload for the error followed by a done event with a JSON payload {"reason": "error"}.

event: output
id: 1690212292:0
data: Once upon a time...


event: error
data: {"detail": "Something went wrong"}


event: done
data: {"reason": "error"}

Further reading