meta/meta-llama-3.1-405b-instruct | Run with an API on Replicate

Official

meta / meta-llama-3.1-405b-instruct

Meta's flagship 405 billion parameter language model, fine-tuned for chat completions

Warm

Public
5.9M runs
Priced by multiple properties
GitHub
License

Iterate in playground

Run with an API

Playground API Examples README

Input

prompt

string

Shift + Return to add a new line

Tina has one brother and one sister. How many sisters do Tina's siblings have?Tina has one brother and one sister. How many sisters do Tina's siblings have?

Prompt

Default: ""

system_prompt

string

Shift + Return to add a new line

System prompt to send to the model. This is prepended to the prompt and helps guide system behavior. Ignored for non-chat models.

Default: "You are a helpful assistant."

min_tokens

integer

The minimum number of tokens the model should generate as output.

Default: 0

max_tokens

integer

The maximum number of tokens the model should generate as output.

Default: 512

temperature

number

The value used to modulate the next token probabilities.

Default: 0.6

top_p

number

A probability threshold for generating the output. If < 1.0, only keep the top tokens with cumulative probability >= top_p (nucleus filtering). Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751).

Default: 0.9

top_k

integer

The number of highest probability tokens to consider for generating the output. If > 0, only keep the top k tokens with highest probability (top-k filtering).

Default: 50

presence_penalty

number

Presence penalty

Default: 0

frequency_penalty

number

Frequency penalty

Default: 0

stop_sequences

string

Shift + Return to add a new line

A comma-separated list of sequences to stop generation at. For example, '<end>,<stop>' will stop generation at the first instance of 'end' or '<stop>'.

prompt_template

string

Shift + Return to add a new line

A template to format the prompt with. If not provided, the default prompt template will be used.

Run this model in Node.js with one line of code:

npx create-replicate --model=meta/meta-llama-3.1-405b-instruct

or set up a project from scratch

Install Replicate’s Node.js client library:

npm install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import and set up the client:

import Replicate from "replicate";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Run meta/meta-llama-3.1-405b-instruct using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

const input = {
  top_k: 50,
  top_p: 0.9,
  prompt: "Tina has one brother and one sister. How many sisters do Tina's siblings have?",
  max_tokens: 1024,
  min_tokens: 0,
  temperature: 0.6,
  system_prompt: "You are a helpful assistant.",
  presence_penalty: 0,
  frequency_penalty: 0
};

for await (const event of replicate.stream("meta/meta-llama-3.1-405b-instruct", { input })) {
  process.stdout.write(event.toString());
};

To learn more, take a look at the guide on getting started with Node.js.

Install Replicate’s Python client library:

pip install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import the client:

import replicate

Run meta/meta-llama-3.1-405b-instruct using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

# The meta/meta-llama-3.1-405b-instruct model can stream output as it's running.
for event in replicate.stream(
    "meta/meta-llama-3.1-405b-instruct",
    input={
        "top_k": 50,
        "top_p": 0.9,
        "prompt": "Tina has one brother and one sister. How many sisters do Tina's siblings have?",
        "max_tokens": 1024,
        "min_tokens": 0,
        "temperature": 0.6,
        "system_prompt": "You are a helpful assistant.",
        "presence_penalty": 0,
        "frequency_penalty": 0
    },
):
    print(str(event), end="")

To learn more, take a look at the guide on getting started with Python.

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Run meta/meta-llama-3.1-405b-instruct using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Prefer: wait" \
  -d $'{
    "input": {
      "top_k": 50,
      "top_p": 0.9,
      "prompt": "Tina has one brother and one sister. How many sisters do Tina\'s siblings have?",
      "max_tokens": 1024,
      "min_tokens": 0,
      "temperature": 0.6,
      "system_prompt": "You are a helpful assistant.",
      "presence_penalty": 0,
      "frequency_penalty": 0
    }
  }' \
  https://api.replicate.com/v1/models/meta/meta-llama-3.1-405b-instruct/predictions

To learn more, take a look at Replicate’s HTTP API reference docs.

Output

Tina has one brother and one sister. From the brother's perspective, he has one sister, Tina, and also the other sister. So, Tina's brother has 2 sisters. From the sister's perspective, she also has one sister, Tina. So, Tina's siblings have a total of 2 sisters (from the brother's perspective) and 1 sister (from the sister's perspective).

{
  "completed_at": "2024-07-23T15:04:24.090090Z",
  "created_at": "2024-07-23T15:04:21.227000Z",
  "data_removed": false,
  "error": null,
  "id": "zjq6szssxdrm20cgw0fsntgvbg",
  "input": {
    "top_p": 0.9,
    "prompt": "Tina has one brother and one sister. How many sisters do Tina's siblings have?",
    "max_tokens": 1024,
    "min_tokens": 0,
    "temperature": 0.6,
    "system_prompt": "You are a helpful assistant.",
    "presence_penalty": 0,
    "frequency_penalty": 0
  },
  "logs": "INFO 07-23 15:04:21 async_llm_engine.py:173] Added request 3cd670c9b87b469ea9721ddfbcc6b67a.\nINFO 07-23 15:04:21 metrics.py:396] Avg prompt throughput: 7.8 tokens/s, Avg generation throughput: 17.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.\nINFO 07-23 15:04:24 async_llm_engine.py:140] Finished request 3cd670c9b87b469ea9721ddfbcc6b67a.\nGeneration took 1721746720.08sFormatted prompt: <|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTina has one brother and one sister. How many sisters do Tina's siblings have?<|eot_id|><|start_header_id|>assistant<|end_header_id|>",
  "metrics": {
    "batch_size": 1.000229398266064,
    "total_time": 2.86309,
    "input_token_count": 26,
    "tokens_per_second": 29.74134035953616,
    "output_token_count": 84,
    "predict_time_share": 2.8200430870056152,
    "predict_time": 2.826425468,
    "time_to_first_token": 0.112068943
  },
  "output": [
    "T",
    "ina",
    " has",
    " one",
    " brother",
    " and",
    " one",
    " sister",
    ".",
    " From",
    " the",
    " brother",
    "'s",
    " perspective",
    ",",
    " he",
    " has",
    " one",
    " sister",
    ",",
    " Tina",
    ",",
    " and",
    " also",
    " the",
    " other",
    " sister",
    ".",
    " So",
    ",",
    " Tina",
    "'s",
    " brother",
    " has",
    " ",
    "2",
    " sisters",
    ".\n\n",
    "From",
    " the",
    " sister",
    "'s",
    " perspective",
    ",",
    " she",
    " also",
    " has",
    " one",
    " sister",
    ",",
    " Tina",
    ".",
    " \n\n",
    "So",
    ",",
    " Tina",
    "'s",
    " siblings",
    " have",
    " a",
    " total",
    " of",
    " ",
    "2",
    " sisters",
    " (",
    "from",
    " the",
    " brother",
    "'s",
    " perspective",
    ")",
    " and",
    " ",
    "1",
    " sister",
    " (",
    "from",
    " the",
    " sister",
    "'s",
    " perspective",
    ").",
    ""
  ],
  "started_at": "2024-07-23T15:04:21.263665Z",
  "status": "succeeded",
  "urls": {
    "stream": "https://streaming-api.svc.sea.v.replicate.net/v1/streams/6zoqtc6qgpxby7xzauee36qv6bktfaxugelxfqs6ifgg6q4s77sa",
    "get": "https://api.replicate.com/v1/predictions/zjq6szssxdrm20cgw0fsntgvbg",
    "cancel": "https://api.replicate.com/v1/predictions/zjq6szssxdrm20cgw0fsntgvbg/cancel"
  },
  "version": "hidden"
}

Generated in

2.8 seconds

Input tokens

Output tokens

Tokens per second

29.74 tokens / second

Time to first token

0.1 seconds

Tweak it Share Report View full prediction

Pricing

Model pricing for meta/meta-llama-3.1-405b-instruct. Looking for volume pricing? Get in touch.

$9.50

per million output tokens

or around 105,263 tokens for $1

$9.50

per million input tokens

or around 105,263 tokens for $1

Official models are always on, maintained, and have predictable pricing. Learn more.

Check out our docs for more information about how pricing works on Replicate.

Readme

Meta Llama 3.1 405B Instruct is an instruction-tuned generative language model developed by Meta. It is optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks. Supported languages are English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

The model is trained on over 15 trillion tokens from a mix of publicly available online data, consisting of multilingual text and code. The cutoff date in the dataset is December 2023. The model was trained for 30.84 million GPU hours.

For additional details, please refer to the official model card: https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md