joehoover / zephyr-7b-alpha

A high-performing language model trained to act as a helpful assistant

  • Public
  • 8K runs
  • L40S
  • GitHub
  • Paper
  • License
Iterate in playground

Input

*string
Shift + Return to add a new line

Prompt to send to the model.

string
Shift + Return to add a new line

System prompt to send to the model. This is prepended to the prompt and helps guide system behavior.

Default: "You are a helpful assistant."

integer
(minimum: 1)

Maximum number of tokens to generate. A word is generally 2-3 tokens

Default: 128

integer
(minimum: -1)

Minimum number of tokens to generate. To disable, set to -1. A word is generally 2-3 tokens.

Default: -1

number
(minimum: 0.01, maximum: 5)

Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic, 0.75 is a good starting value.

Default: 0.75

number
(minimum: 0, maximum: 1)

When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens

Default: 0.9

integer
(minimum: 0)

When decoding text, samples from the top k most likely tokens; lower to ignore less likely tokens

Default: 50

string
Shift + Return to add a new line

A comma-separated list of sequences to stop generation at. For example, '<end>,<stop>' will stop generation at the first instance of 'end' or '<stop>'.

integer

Random seed. Leave blank to randomize the seed

boolean

provide debugging output in logs

Default: false

boolean

if set, only return logits for the first token. only useful for testing, etc.

Default: false

Output

Sure! Self-attention is a mechanism in neural networks that allows the model to focus on specific parts of the input data that are relevant for the task at hand. Here's a brief overview: 1. The input data is split into smaller pieces, called "tokens" or "words". 2. Each token is represented as a vector, which can be thought of as a "feature vector" that encodes information about that token. 3. The model computes a "query" vector and a "key" vector for each token. 4. The model then calculates a "score" for each pair of query and key vectors. 5. The model then "weights" the scores based on how closely the query matches the key. 6. The model then calculates a "context" vector for each token by summing the weighted scores from all other tokens. 7. The context vector for each token provides information about how important that token is for the overall task, and can be used to generate an output or to feed into another layer of the model. Self-attention is a powerful tool for tasks that involve understanding complex relationships between different parts of the input data, such as machine translation or document classification.
Generated in

Run time and cost

This model costs approximately $0.0045 to run on Replicate, or 222 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 5 seconds.

Readme

Zephyr 7B Alpha is the first of a series of language models developed by the Hugging Face H4 RLHF team. It is a fine-tuned version of mistralai/Mistral-7B-v0.1 that has been optimized with both supervised fine-tuning and RLHF (reinforcement learning with human feedback).

Please see the Hugging Face model card for more information about the model, licensing, and acceptable use.

How to prompt Zephyr 7B Alpha

To use this model, you can simply pass a prompt or instruction to the prompt argument. We handle prompt formatting on the backend so that you don’t need to worry about it. But, for reference, the prompt format for this model is:

 """<|system|>
{system_prompt}</s>
<|user|>
{instruction}</s>
<|assistant|>
"""

Where {system_prompt} is an optionally user-specified system prompt and {instruction} is the user input.

Formatting prompts for chat interfaces

However, if you’re managing dialogue state with multiple exchanges between a user and the model, you need to mark the dialogue turns with tags that indicate the beginning and end of user input. For example, dialogue formatting might proceed like:

  • system_prompt is set to "You are a helpful assistant."
  • User inputs "Can you help me answer a question?" and input is passed to the Replicate API.
  • Internally, the user input will be injected into the prompt template, like:
 """<|system|>
You are a helpful assistant.</s>
<|user|>
Can you help me answer a question?</s>
<|assistant|>
"""
  • The model might respond with:
"I'd be happy to help you answer any question you have. Please provide me with the question you'd like assistance with, and I'll do my best to provide you with an answer."
  • Then, the user might respond with "Please help understand this riddle: \"I’m tall when I’m young, and I’m short when I’m old. What am I?\"".

  • In this case, the next input to the model should be formatted like:

 """
Can you help me answer a question?</s>
<|assistant|>
I'd be happy to help you answer any question you have. Please provide me with the question you'd like assistance with, and I'll do my best to provide you with an answer.</s>
<|user|>
Please help understand this riddle: "I’m tall when I’m young, and I’m short when I’m old. What am I?"
"""
  • Then, the model might respond with something like:
"Certainly! The answer to this riddle is a \"candle\". When a candle is young, it's tall, but as it burns and gets shorter, it becomes shorter and shorter until it eventually extinguishes, at which point it's no longer \"tall\" or \"short\"."

Modifying the system prompt

In addition to supporting dialogue exchanges, this deployment also allows you to modify the system prompt that is used to guide model responses. By altering the input to the system_prompt argument, you can inject custom context or information that will be used to guide model output.