arboreal-ai / llama-2-7b-chat

Llama-2-7b-Chat (GPTQ) with additional generation parameters

  • Public
  • 4.7K runs
  • L40S
  • GitHub
  • License

Input

string
Shift + Return to add a new line

Prompt to send to Llama v2

Default: "[INST]Tell me about AI[/INST]"

string
Shift + Return to add a new line

System prompt that helps guide system behavior

Default: "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."

integer
(minimum: 1, maximum: 4096)

Number of new tokens

Default: 512

number
(minimum: 0, maximum: 5)

Randomness of outputs, 0 is deterministic, greater than 1 is random

Default: 1

number
(minimum: 0.01, maximum: 1)

When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens

Default: 0.95

number
(minimum: 0, maximum: 5)

Penalty for repeated words in generated text; 1 is no penalty, values greater than 1 discourage repetition, less than 1 encourage it

Default: 1

integer
(minimum: 0, maximum: 4096)

Number of tokens to wait before starting exponential decay.

Default: 512

number
(minimum: 1, maximum: 10)

Decay factor for LogitProcessor exponential decay.

Default: 1

boolean

Whether to skip the prompt to .generate() or not. Useful e.g. for chatbots.

Default: true

integer

Random seed for reproducibility. Set to 0 for no random seed.

Default: 0

Output

No output yet! Press "Submit" to start a prediction.

Run time and cost

This model costs approximately $0.089 to run on Replicate, or 11 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 91 seconds. The predict time for this model varies significantly based on the inputs.

Readme

This model exposes 2 additional parameters that assist in reduced verbosity of the LLama2B 7B Chat model: (1) exponential_decay_start - Number of tokens after which it starts to become increasingly more likely that the model will stop generation. (2) exponential decay - Decay factor for exponential decay of probability of continued generation. In the range of 1 (no decay) and above.