Streaming output for language models

You know when you’re using ChatGPT or Vercel’s AI playground and it returns an animated response, rendered word by word? That’s not just a dramatic visual effect to make it look like there’s a robot typing on the other side of the conversation. That’s actually the language model generating tokens one at a time, and streaming them back to you while it’s running.

Replicate already provides ways for you to receive incremental updates as your predictions are running, through polling and webhooks. But those aren’t always the most efficient methods to get updates from a running model. When you’re building something like a chat app, what you really need is a live-updating event stream.

Replicate’s API now supports server-sent event streams for language models. This lets you update your app live, as the model is running. In this post we’ll show you how to consume streaming responses from language models on Replicate.

How streaming works

At a high level, consuming an event stream on Replicate works like this:

You create a prediction with the stream option.
Replicate returns a prediction with a URL to receive streaming output.
You connect to the URL in your web browser and receive a stream of updates.

A Node.js example

Let’s walk through an example using Replicate’s Node.js client.

First, create a prediction using llama-2-70b-chat, setting the stream option to true:

import Replicate from "replicate";

const replicate = new Replicate({ auth: process.env.REPLICATE_API_TOKEN });

const prediction = await replicate.predictions.create({
  version: "2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1",
  input: { prompt: "Tell me a story" },
  stream: true,
});

Note the stream URL in the prediction response:

console.log(prediction.urls.stream);
// https://streaming.api.replicate.com/v1/predictions/fuwwvjtbdmroc4xifxdcwqtdfq

To receive streaming output, construct an EventSource in your browser-side JavaScript code using the stream URL from the prediction:

const source = new EventSource(prediction.urls.stream, {
  withCredentials: true,
});

source.addEventListener("output", (e) => {
  console.log("output", e.data);
});

source.addEventListener("error", (e) => {
  console.error("error", JSON.parse(e.data));
});

source.addEventListener("done", (e) => {
  source.close();
  console.log("done", JSON.parse(e.data));
});

A command-line example using cURL

The browser’s built-in EventSource API is useful for building web apps, but the responses are standard HTTP event stream responses, so you don’t have to use a browser to consume them. You can also receive streaming output using the programming language of your choice, or use command-line tools like cURL and jq to display the output right in your terminal.

Copy and paste the commands below in your shell to do the following:

Use curl to create a prediction with llama-2-70b-chat
Pipe the prediction response into jq to pluck out the stream URL and print it out
Use curl again to connect to the stream URL and receive a stream of updates

STREAM_URL=$(curl -s -X POST \
-d '{"version": "58d078176e02c219e11eb4da5a02a7830a283b14cf8f94537af893ccff5ee781", "input": {"prompt": "Tell me a story"}, "stream": true}' \
-H "Authorization: Bearer $REPLICATE_API_TOKEN" \
"https://api.replicate.com/v1/predictions" | jq -r .urls.stream)
echo $STREAM_URL
curl -H 'Accept: text/event-stream' $STREAM_URL

cURL will print out a stream of updates from the model until the connection is closed:

event: output
id: 1692041342:0
data: Sure

event: output
id: 1692041342:1
data: !

event: output
id: 1692041342:2
data: Here

event: output
id: 1692041342:3
data: '

event: output
id: 1692041343:0
data: s

event: output
id: 1692041343:1
data: a

event: output
id: 1692041343:2
data: story

event: output
id: 1692041343:3
data: for

event: output
id: 1692041343:4
data: you

Which models support streaming output?

Streaming output is already supported by lots of language models on Replicate, including Falcon, Vicuna, StableLM, and of course… Llama 2 🦙. For a full list of models that support streaming output, see the streaming language models collection:

Adding streaming support to your own models

When publishing your own public or private language models to Replicate, you should make sure they support streaming so users of your model will have the best possible experience.

If you’re fine-tuning an existing language model, then you’re already set: Your fine-tuned model will automatically inherit the streaming support from the base model.

If you’re writing your own model using Cog, the key is to yield tokens as they’re generated, instead of returning the final result from a function. Use ConcatenateIterator to hint that the output should be concatenated together into a single string. Here’s an example:

from cog import BasePredictor, Path, ConcatenateIterator

class Predictor(BasePredictor):
    def predict(self) -> ConcatenateIterator[str]:
        tokens = ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
        for token in tokens:
            yield token + " "

For more details, check out the Cog documentation on streaming output.