You know when you're using ChatGPT or Vercel's AI playground and it returns an animated response, rendered word by word? That's not just a dramatic visual effect to make it look like there's a robot typing on the other side of the conversation. That's actually the language model generating tokens one at a time, and streaming them back to you while it's running.
Replicate already provides ways for you to receive incremental updates as your predictions are running, through polling and webhooks. But those aren't always the most efficient methods to get updates from a running model. When you're building something like a chat app, what you really need is a live-updating event stream.
Replicate's API now supports server-sent event streams for language models. This lets you update your app live, as the model is running. In this post we'll show you how to consume streaming responses from language models on Replicate.
At a high level, consuming an event stream on Replicate works like this:
stream
option.Let's walk through an example using Replicate's Node.js client.
First, create a prediction using llama-2-70b-chat, setting the stream
option to true
:
Note the stream URL
in the prediction response:
To receive streaming output, construct an EventSource in your browser-side JavaScript code using the stream URL from the prediction:
The browser's built-in EventSource API is useful for building web apps, but the responses are standard HTTP event stream responses, so you don't have to use a browser to consume them. You can also receive streaming output using the programming language of your choice, or use command-line tools like cURL and jq to display the output right in your terminal.
Copy and paste the commands below in your shell to do the following:
curl
to create a prediction with llama-2-70b-chatjq
to pluck out the stream URL and print it outcurl
again to connect to the stream URL and receive a stream of updatescURL will print out a stream of updates from the model until the connection is closed:
Streaming output is already supported by lots of language models on Replicate, including Falcon, Vicuna, StableLM, and of course... Llama 2 🦙. For a full list of models that support streaming output, see the streaming language models collection:
When publishing your own public or private language models to Replicate, you should make sure they support streaming so users of your model will have the best possible experience.
If you're fine-tuning an existing language model, then you're already set: Your fine-tuned model will automatically inherit the streaming support from the base model.
If you're writing your own model using Cog, the key is to yield
tokens as they're generated, instead of return
ing the final result from a function. Use ConcatenateIterator
to hint that the output should be concatenated together into a single string. Here's an example:
For more details, check out the Cog documentation on streaming output.
Follow @replicate on Twitter X to keep up as we add streaming support to more models.