Run Llama 2 with an API

Posted July 27, 2023 by

Llama 2 is a language model from Meta AI. It’s the first open source language model of the same caliber as OpenAI’s models.

With Replicate, you can run Llama 2 in the cloud with one line of code.

Contents

Running Llama 2 with JavaScript

You can run Llama 2 with our official JavaScript client:

import Replicate from "replicate";
 
const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});
 
const input = {
  prompt:
    "Write a poem about open source machine learning in the style of Mary Oliver.",
};
 
for await (const event of replicate.stream("meta/llama-2-70b-chat", {
  input,
})) {
  process.stdout.write(event.toString());
}

Running Llama 2 with Python

You can run Llama 2 with our official Python client:

import replicate
# The meta/llama-2-70b-chat model can stream output as it's running.
for event in replicate.stream(
    "meta/llama-2-70b-chat",
    input={
        "prompt": "Write a poem about open source machine learning in the style of Mary Oliver."
    },
):
    print(str(event), end="")

Running Llama 2 with cURL

Your can call the HTTP API directly with tools like cURL:

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Prefer: wait" \
  -d $'{
    "input": {
      "prompt": "Write a poem..."
    }
  }' \
  https://api.replicate.com/v1/models/meta/llama-2-70b-chat/predictions

You can also run Llama using other Replicate client libraries for Go, Swift, and others.

Choosing which model to use

There are four variant Llama 2 models on Replicate, each with their own strengths:

  • meta/llama-2-70b-chat: 70 billion parameter model fine-tuned on chat completions. If you want to build a chat bot with the best accuracy, this is the one to use.
  • meta/llama-2-70b: 70 billion parameter base model. Use this if you want to do other kinds of language completions, like completing a user’s writing.
  • meta/llama-2-13b-chat: 13 billion parameter model fine-tuned on chat completions. Use this if you’re building a chat bot and would prefer it to be faster and cheaper at the expense of accuracy.
  • meta/llama-2-7b-chat: 7 billion parameter model fine-tuned on chat completions. This is an even smaller, faster model.

What's the difference between these? Learn more in our blog post comparing 7B, 13B, and 70B.

Example chat app

If you want a place to start, we’ve built a demo chat app in Next.js that can be deployed on Vercel:

Take a look at the GitHub README to learn how to customize and deploy it.

Fine-tune Llama 2

Because Llama 2 is open source, you can train it on more data to teach it new things, or learn a particular style.

Replicate makes this easy. Take a look at our guide to fine-tune Llama 2.

Run Llama 2 locally

You can also run Llama 2 without an internet connection. We wrote a comprehensive guide to running Llama on your M1/M2 Mac, on Windows, on Linux, or even your phone.

Keep up to speed

Happy hacking! 🦙