arboreal-ai/llama-2-7b-chat – Run with an API on Replicate

arboreal-ai / llama-2-7b-chat

Llama-2-7b-Chat (GPTQ) with additional generation parameters

Cold

Public
4.7K runs
L40S
GitHub
License

Run with an API

Playground API Examples README Versions

Input

prompt

string

Shift + Return to add a new line

Prompt to send to Llama v2

Default: "[INST]Tell me about AI[/INST]"

system_prompt

string

Shift + Return to add a new line

You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.

System prompt that helps guide system behavior

Default: "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."

max_new_tokens

integer

(minimum: 1, maximum: 4096)

Number of new tokens

Default: 512

temperature

number

(minimum: 0, maximum: 5)

Randomness of outputs, 0 is deterministic, greater than 1 is random

Default: 1

top_p

number

(minimum: 0.01, maximum: 1)

When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens

Default: 0.95

repetition_penalty

number

(minimum: 0, maximum: 5)

Penalty for repeated words in generated text; 1 is no penalty, values greater than 1 discourage repetition, less than 1 encourage it

Default: 1

exponential_decay_start

integer

(minimum: 0, maximum: 4096)

Number of tokens to wait before starting exponential decay.

Default: 512

exponential_decay_factor

number

(minimum: 1, maximum: 10)

Decay factor for LogitProcessor exponential decay.

Default: 1

skip_prompt

boolean

Whether to skip the prompt to .generate() or not. Useful e.g. for chatbots.

Default: true

random_seed

integer

Random seed for reproducibility. Set to 0 for no random seed.

Default: 0

Run this model in Node.js with one line of code:

npx create-replicate --model=arboreal-ai/llama-2-7b-chat

or set up a project from scratch

Install Replicate’s Node.js client library

npm install replicate

Set the REPLICATE_API_TOKEN environment variable

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import and set up the client

import Replicate from "replicate";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Run arboreal-ai/llama-2-7b-chat using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

const output = await replicate.run(
  "arboreal-ai/llama-2-7b-chat:50a56eda7bd300bbebb35bca6f4c2f6a4eb3ee37b45fac8cdfb543c310659aed",
  {
    input: {
      top_p: 0.95,
      prompt: "[INST]Tell me about AI[/INST]",
      random_seed: 0,
      skip_prompt: true,
      temperature: 1,
      system_prompt: "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.",
      max_new_tokens: 512,
      repetition_penalty: 1,
      exponential_decay_start: 512,
      exponential_decay_factor: 1
    }
  }
);
console.log(output);

To learn more, take a look at the guide on getting started with Node.js.

Install Replicate’s Python client library

pip install replicate

Set the REPLICATE_API_TOKEN environment variable

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import the client

import replicate

Run arboreal-ai/llama-2-7b-chat using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

output = replicate.run(
    "arboreal-ai/llama-2-7b-chat:50a56eda7bd300bbebb35bca6f4c2f6a4eb3ee37b45fac8cdfb543c310659aed",
    input={
        "top_p": 0.95,
        "prompt": "[INST]Tell me about AI[/INST]",
        "random_seed": 0,
        "skip_prompt": True,
        "temperature": 1,
        "system_prompt": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.",
        "max_new_tokens": 512,
        "repetition_penalty": 1,
        "exponential_decay_start": 512,
        "exponential_decay_factor": 1
    }
)
print(output)

To learn more, take a look at the guide on getting started with Python.

Set the REPLICATE_API_TOKEN environment variable

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Run arboreal-ai/llama-2-7b-chat using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Prefer: wait" \
  -d $'{
    "version": "50a56eda7bd300bbebb35bca6f4c2f6a4eb3ee37b45fac8cdfb543c310659aed",
    "input": {
      "top_p": 0.95,
      "prompt": "[INST]Tell me about AI[/INST]",
      "random_seed": 0,
      "skip_prompt": true,
      "temperature": 1,
      "system_prompt": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don\'t know the answer to a question, please don\'t share false information.",
      "max_new_tokens": 512,
      "repetition_penalty": 1,
      "exponential_decay_start": 512,
      "exponential_decay_factor": 1
    }
  }' \
  https://api.replicate.com/v1/predictions

To learn more, take a look at Replicate’s HTTP API reference docs.

Install Cog

brew install cog

If you don’t have Homebrew, there are other installation options available.

Pull and run arboreal-ai/llama-2-7b-chat using Cog (this will download the full model and run it in your local environment):

cog predict r8.im/arboreal-ai/llama-2-7b-chat@sha256:50a56eda7bd300bbebb35bca6f4c2f6a4eb3ee37b45fac8cdfb543c310659aed \
  -i 'top_p=0.95' \
  -i 'prompt="[INST]Tell me about AI[/INST]"' \
  -i 'random_seed=0' \
  -i 'skip_prompt=true' \
  -i 'temperature=1' \
  -i $'system_prompt="You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don\'t know the answer to a question, please don\'t share false information."' \
  -i 'max_new_tokens=512' \
  -i 'repetition_penalty=1' \
  -i 'exponential_decay_start=512' \
  -i 'exponential_decay_factor=1'

To learn more, take a look at the Cog documentation.

Pull and run arboreal-ai/llama-2-7b-chat using Docker (this will download the full model and run it in your local environment):

docker run -d -p 5000:5000 --gpus=all r8.im/arboreal-ai/llama-2-7b-chat@sha256:50a56eda7bd300bbebb35bca6f4c2f6a4eb3ee37b45fac8cdfb543c310659aed
curl -s -X POST \
  -H "Content-Type: application/json" \
  -d $'{
    "input": {
      "top_p": 0.95,
      "prompt": "[INST]Tell me about AI[/INST]",
      "random_seed": 0,
      "skip_prompt": true,
      "temperature": 1,
      "system_prompt": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don\'t know the answer to a question, please don\'t share false information.",
      "max_new_tokens": 512,
      "repetition_penalty": 1,
      "exponential_decay_start": 512,
      "exponential_decay_factor": 1
    }
  }' \
  http://localhost:5000/predictions

Output

No output yet! Press "Submit" to start a prediction.

Run time and cost

This model costs approximately $0.089 to run on Replicate, or 11 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 91 seconds. The predict time for this model varies significantly based on the inputs.

Readme

This model exposes 2 additional parameters that assist in reduced verbosity of the LLama2B 7B Chat model: (1) exponential_decay_start - Number of tokens after which it starts to become increasingly more likely that the model will stop generation. (2) exponential decay - Decay factor for exponential decay of probability of continued generation. In the range of 1 (no decay) and above.