Streaming output for language models
Posted by @zeke
You know when you’re using ChatGPT or Vercel’s AI playground and it returns an animated response, rendered word by word? That’s not just a dramatic visual effect to make it look like there’s a robot typing on the other side of the conversation. That’s actually the language model generating tokens one at a time, and streaming them back to you while it’s running.
Replicate already provides ways for you to receive incremental updates as your predictions are running, through polling and webhooks. But those aren’t always the most efficient methods to get updates from a running model. When you’re building something like a chat app, what you really need is a live-updating event stream.
Replicate’s API now supports server-sent event streams for language models. This lets you update your app live, as the model is running. In this post we’ll show you how to consume streaming responses from language models on Replicate.
How streaming works
At a high level, consuming an event stream on Replicate works like this:
- You create a prediction with the
stream
option. - Replicate returns a prediction with a URL to receive streaming output.
- You connect to the URL in your web browser and receive a stream of updates.
A Node.js example
Let’s walk through an example using Replicate’s Node.js client.
First, create a prediction using llama-2-70b-chat, setting the stream
option to true
:
import Replicate from "replicate";
const replicate = new Replicate({ auth: process.env.REPLICATE_API_TOKEN });
const prediction = await replicate.predictions.create({
version: "2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1",
input: { prompt: "Tell me a story" },
stream: true,
});
Note the stream URL
in the prediction response:
console.log(prediction.urls.stream);
// https://streaming.api.replicate.com/v1/predictions/fuwwvjtbdmroc4xifxdcwqtdfq
To receive streaming output, construct an EventSource in your browser-side JavaScript code using the stream URL from the prediction:
const source = new EventSource(prediction.urls.stream, {
withCredentials: true,
});
source.addEventListener("output", (e) => {
console.log("output", e.data);
});
source.addEventListener("error", (e) => {
console.error("error", JSON.parse(e.data));
});
source.addEventListener("done", (e) => {
source.close();
console.log("done", JSON.parse(e.data));
});
A command-line example using cURL
The browser’s built-in EventSource API is useful for building web apps, but the responses are standard HTTP event stream responses, so you don’t have to use a browser to consume them. You can also receive streaming output using the programming language of your choice, or use command-line tools like cURL and jq to display the output right in your terminal.
Copy and paste the commands below in your shell to do the following:
- Use
curl
to create a prediction with llama-2-70b-chat - Pipe the prediction response into
jq
to pluck out the stream URL and print it out - Use
curl
again to connect to the stream URL and receive a stream of updates
STREAM_URL=$(curl -s -X POST \
-d '{"version": "58d078176e02c219e11eb4da5a02a7830a283b14cf8f94537af893ccff5ee781", "input": {"prompt": "Tell me a story"}, "stream": true}' \
-H "Authorization: Bearer $REPLICATE_API_TOKEN" \
"https://api.replicate.com/v1/predictions" | jq -r .urls.stream)
echo $STREAM_URL
curl -H 'Accept: text/event-stream' $STREAM_URL
cURL will print out a stream of updates from the model until the connection is closed:
event: output
id: 1692041342:0
data: Sure
event: output
id: 1692041342:1
data: !
event: output
id: 1692041342:2
data: Here
event: output
id: 1692041342:3
data: '
event: output
id: 1692041343:0
data: s
event: output
id: 1692041343:1
data: a
event: output
id: 1692041343:2
data: story
event: output
id: 1692041343:3
data: for
event: output
id: 1692041343:4
data: you
Which models support streaming output?
Streaming output is already supported by lots of language models on Replicate, including Falcon, Vicuna, StableLM, and of course… Llama 2 🦙. For a full list of models that support streaming output, see the streaming language models collection:
Adding streaming support to your own models
When publishing your own public or private language models to Replicate, you should make sure they support streaming so users of your model will have the best possible experience.
If you’re fine-tuning an existing language model, then you’re already set: Your fine-tuned model will automatically inherit the streaming support from the base model.
If you’re writing your own model using Cog, the key is to yield
tokens as they’re generated, instead of return
ing the final result from a function. Use ConcatenateIterator
to hint that the output should be concatenated together into a single string. Here’s an example:
from cog import BasePredictor, Path, ConcatenateIterator
class Predictor(BasePredictor):
def predict(self) -> ConcatenateIterator[str]:
tokens = ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
for token in tokens:
yield token + " "
For more details, check out the Cog documentation on streaming output.
Further reading
- Check out llama.replicate.dev to see an example of streaming output in a Next.js app.
- Read our streaming guide for more details about how to use streaming output on Replicate.
- Read the Replicate Node.js client API docs for usage details for Node.js and browsers.
- Compare streaming models using Vercel’s AI playground.
- Learn how to use Vercel’s AI SDK to stream models on Replicate in JavaScript apps.
Follow @replicate on Twitter X to keep up as we add streaming support to more models.