Prediction

Model

ibm-granite/granite-3.3-8b-instruct:a325a0ca

t96wgpwzn9rm80cp79q8cxprm8

Status

Succeeded

Source

Web

Hardware

H100

Total duration

4m 29s

Created

8 months ago by ibm-granite

Webhook

–

Input

prompt: How is perplexity measured for LLMs and why is it useful?
min_tokens: 0
max_tokens: 512
temperature: 0.6
top_p: 0.9
top_k: 50
presence_penalty: 0
frequency_penalty: 0

{
  "frequency_penalty": 0,
  "max_tokens": 512,
  "min_tokens": 0,
  "presence_penalty": 0,
  "prompt": "How is perplexity measured for LLMs and why is it useful?",
  "temperature": 0.6,
  "top_k": 50,
  "top_p": 0.9
}

Install Replicate’s Node.js client library:

npm install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=r8_Xvz**********************************

This is your API token. Keep it to yourself.

Import and set up the client:

import Replicate from "replicate";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Run ibm-granite/granite-3.3-8b-instruct using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

const output = await replicate.run(
  "ibm-granite/granite-3.3-8b-instruct:a325a0cacfb0aa9226e6bad1abe5385f1073f4c7f8c36e52ed040e5409e6c034",
  {
    input: {
      frequency_penalty: 0,
      max_tokens: 512,
      min_tokens: 0,
      presence_penalty: 0,
      prompt: "How is perplexity measured for LLMs and why is it useful?",
      temperature: 0.6,
      top_k: 50,
      top_p: 0.9
    }
  }
);

console.log(output);

To learn more, take a look at the guide on getting started with Node.js.

Install Replicate’s Python client library:

pip install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=r8_Xvz**********************************

This is your API token. Keep it to yourself.

Import the client:

import replicate

Run ibm-granite/granite-3.3-8b-instruct using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

output = replicate.run(
    "ibm-granite/granite-3.3-8b-instruct:a325a0cacfb0aa9226e6bad1abe5385f1073f4c7f8c36e52ed040e5409e6c034",
    input={
        "frequency_penalty": 0,
        "max_tokens": 512,
        "min_tokens": 0,
        "presence_penalty": 0,
        "prompt": "How is perplexity measured for LLMs and why is it useful?",
        "temperature": 0.6,
        "top_k": 50,
        "top_p": 0.9
    }
)

# The ibm-granite/granite-3.3-8b-instruct model can stream output as it's running.
# The predict method returns an iterator, and you can iterate over that output.
for item in output:
    # https://replicate.com/ibm-granite/granite-3.3-8b-instruct/api#output-schema
    print(item, end="")

To learn more, take a look at the guide on getting started with Python.

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=r8_Xvz**********************************

This is your API token. Keep it to yourself.

Run ibm-granite/granite-3.3-8b-instruct using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Prefer: wait" \
  -d $'{
    "version": "ibm-granite/granite-3.3-8b-instruct:a325a0cacfb0aa9226e6bad1abe5385f1073f4c7f8c36e52ed040e5409e6c034",
    "input": {
      "frequency_penalty": 0,
      "max_tokens": 512,
      "min_tokens": 0,
      "presence_penalty": 0,
      "prompt": "How is perplexity measured for LLMs and why is it useful?",
      "temperature": 0.6,
      "top_k": 50,
      "top_p": 0.9
    }
  }' \
  https://api.replicate.com/v1/predictions

To learn more, take a look at Replicate’s HTTP API reference docs.

Output

Perplexity is a common metric used to evaluate the performance of language models, including large language models (LLMs). It measures how well a language model predicts a sample. A lower perplexity score indicates better performance, as it suggests the model is more confident in its predictions.

Perplexity is calculated by raising 2 to the average negative log-likelihood of the test set. In simpler terms, it quantifies the uncertainty of the model in predicting the next word in a sequence. A perplexity of 2^n means that, on average, the model is uncertain about the next word, assigning a probability of 1/n to the correct word.

Here's a step-by-step breakdown of the calculation:

1. For each word in the test set, the language model computes a probability distribution over the vocabulary, predicting the likelihood of the next word.
2. The negative logarithm (base 2) of these probabilities is calculated for each word.
3. The negative log-likelihood is averaged across all words in the test set.
4. 2 raised to the power of this average negative log-likelihood gives the perplexity score.

Perplexity is useful for several reasons:

1. **Model comparison**: It allows for comparing different language models or model configurations. A model with lower perplexity on a given test set is generally considered to have better performance.

2. **Training optimization**: Perplexity can be used as a loss function during training, guiding the model to minimize uncertainty in its predictions. Lower perplexity during training indicates that the model is learning to make more accurate predictions.

3. **Transfer learning and fine-tuning**: Perplexity can help assess the impact of transfer learning or fine-tuning on pre-trained language models. A decrease in perplexity after applying these techniques suggests improved performance.

4. **Understanding model behavior**: Perplexity can offer insights into how well the model captures the underlying patterns and structures of the language.

However, it's essential to consider the limitations of perplexity. It doesn't always correlate with human-evaluated quality, especially for specific tasks or domains. Moreover, it tends to favor models that assign very low probabilities to a small subset of words, which might not necessarily indicate better performance in practical applications.

In summary, perplexity is a valuable

{
  "id": "t96wgpwzn9rm80cp79q8cxprm8",
  "model": "ibm-granite/granite-3.3-8b-instruct",
  "version": "a325a0cacfb0aa9226e6bad1abe5385f1073f4c7f8c36e52ed040e5409e6c034",
  "input": {
    "frequency_penalty": 0,
    "max_tokens": 512,
    "min_tokens": 0,
    "presence_penalty": 0,
    "prompt": "How is perplexity measured for LLMs and why is it useful?",
    "temperature": 0.6,
    "top_k": 50,
    "top_p": 0.9
  },
  "logs": "2025-04-15 17:15:15 [info     ] predict() commencing           request_id=c23a193892654772b688b7fd97f894b8 user_prompt=How is perplexity measured for LLMs and why is it useful?\n2025-04-15 17:15:15 [debug    ] Formatted prompt using chat template formatted_prompt=<|start_of_role|>system<|end_of_role|> Knowledge Cutoff Date: April 2024.\n Today's Date: April 15, 2025. You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>\n<|start_of_role|>user<|end_of_role|>How is perplexity measured for LLMs and why is it useful?<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|> request_id=c23a193892654772b688b7fd97f894b8 user_prompt=How is perplexity measured for LLMs and why is it useful?\n2025-04-15 17:15:15 [debug    ] SamplingParams                 request_id=c23a193892654772b688b7fd97f894b8 sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.9, top_k=50, min_p=0.0, seed=22349, stop=[], stop_token_ids=[0], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None) user_prompt=How is perplexity measured for LLMs and why is it useful?\nINFO 04-15 17:15:15 [async_llm.py:228] Added request c23a193892654772b688b7fd97f894b8.\n2025-04-15 17:15:19 [debug    ] result                         finish_reason=length request_id=c23a193892654772b688b7fd97f894b8 text=Perplexity is a common metric used to evaluate the performance of language models, including large language models (LLMs). It measures how well a language model predicts a sample. A lower perplexity score indicates better performance, as it suggests the model is more confident in its predictions.\n\nPerplexity is calculated by raising 2 to the average negative log-likelihood of the test set. In simpler terms, it quantifies the uncertainty of the model in predicting the next word in a sequence. A perplexity of 2^n means that, on average, the model is uncertain about the next word, assigning a probability of 1/n to the correct word.\n\nHere's a step-by-step breakdown of the calculation:\n\n1. For each word in the test set, the language model computes a probability distribution over the vocabulary, predicting the likelihood of the next word.\n2. The negative logarithm (base 2) of these probabilities is calculated for each word.\n3. The negative log-likelihood is averaged across all words in the test set.\n4. 2 raised to the power of this average negative log-likelihood gives the perplexity score.\n\nPerplexity is useful for several reasons:\n\n1. **Model comparison**: It allows for comparing different language models or model configurations. A model with lower perplexity on a given test set is generally considered to have better performance.\n\n2. **Training optimization**: Perplexity can be used as a loss function during training, guiding the model to minimize uncertainty in its predictions. Lower perplexity during training indicates that the model is learning to make more accurate predictions.\n\n3. **Transfer learning and fine-tuning**: Perplexity can help assess the impact of transfer learning or fine-tuning on pre-trained language models. A decrease in perplexity after applying these techniques suggests improved performance.\n\n4. **Understanding model behavior**: Perplexity can offer insights into how well the model captures the underlying patterns and structures of the language.\n\nHowever, it's essential to consider the limitations of perplexity. It doesn't always correlate with human-evaluated quality, especially for specific tasks or domains. Moreover, it tends to favor models that assign very low probabilities to a small subset of words, which might not necessarily indicate better performance in practical applications.\n\nIn summary, perplexity is a valuable user_prompt=How is perplexity measured for LLMs and why is it useful?\n2025-04-15 17:15:19 [info     ] Generation took 3.98s          request_id=c23a193892654772b688b7fd97f894b8 user_prompt=How is perplexity measured for LLMs and why is it useful?\n/root/.pyenv/versions/3.11.12/lib/python3.11/site-packages/cog/server/scope.py:21: ExperimentalFeatureWarning: current_scope is an experimental internal function. It may change or be removed without warning.\n  warnings.warn(\n2025-04-15 17:15:19 [info     ] predict() complete             request_id=c23a193892654772b688b7fd97f894b8 user_prompt=How is perplexity measured for LLMs and why is it useful?",
  "output": [
    "Per",
    "plex",
    "ity",
    " is",
    " a",
    " common",
    " metric",
    " used",
    " to",
    " evaluate",
    " the",
    " performance",
    " of",
    " language",
    " models",
    ",",
    " including",
    " large",
    " language",
    " models",
    " (",
    "LL",
    "Ms",
    ").",
    " It",
    " measures",
    " how",
    " well",
    " a",
    " language",
    " model",
    " predict",
    "s",
    " a",
    " sample",
    ".",
    " A",
    " lower",
    " per",
    "plex",
    "ity",
    " score",
    " indicates",
    " better",
    " performance",
    ",",
    " as",
    " it",
    " suggests",
    " the",
    " model",
    " is",
    " more",
    " conf",
    "ident",
    " in",
    " its",
    " predictions",
    ".",
    "\n",
    "\n",
    "Per",
    "plex",
    "ity",
    " is",
    " calculated",
    " by",
    " raising",
    " ",
    "2",
    " to",
    " the",
    " average",
    " negative",
    " log",
    "-",
    "likelihood",
    " of",
    " the",
    " test",
    " set",
    ".",
    " In",
    " simpler",
    " terms",
    ",",
    " it",
    " quant",
    "ifies",
    " the",
    " uncertainty",
    " of",
    " the",
    " model",
    " in",
    " predict",
    "ing",
    " the",
    " next",
    " word",
    " in",
    " a",
    " sequence",
    ".",
    " A",
    " per",
    "plex",
    "ity",
    " of",
    " ",
    "2",
    "^",
    "n",
    " means",
    " that",
    ",",
    " on",
    " average",
    ",",
    " the",
    " model",
    " is",
    " uncert",
    "ain",
    " about",
    " the",
    " next",
    " word",
    ",",
    " assigning",
    " a",
    " probability",
    " of",
    " ",
    "1",
    "/",
    "n",
    " to",
    " the",
    " correct",
    " word",
    ".",
    "\n",
    "\n",
    "Here",
    "'s",
    " a",
    " step",
    "-",
    "by",
    "-",
    "step",
    " break",
    "down",
    " of",
    " the",
    " calculation",
    ":",
    "\n",
    "\n",
    "1",
    ".",
    " For",
    " each",
    " word",
    " in",
    " the",
    " test",
    " set",
    ",",
    " the",
    " language",
    " model",
    " computes",
    " a",
    " probability",
    " distribution",
    " over",
    " the",
    " vocabulary",
    ",",
    " predict",
    "ing",
    " the",
    " likelihood",
    " of",
    " the",
    " next",
    " word",
    ".",
    "\n",
    "2",
    ".",
    " The",
    " negative",
    " log",
    "arith",
    "m",
    " (",
    "base",
    " ",
    "2",
    ")",
    " of",
    " these",
    " probabilities",
    " is",
    " calculated",
    " for",
    " each",
    " word",
    ".",
    "\n",
    "3",
    ".",
    " The",
    " negative",
    " log",
    "-",
    "likelihood",
    " is",
    " aver",
    "aged",
    " across",
    " all",
    " words",
    " in",
    " the",
    " test",
    " set",
    ".",
    "\n",
    "4",
    ".",
    " ",
    "2",
    " raised",
    " to",
    " the",
    " power",
    " of",
    " this",
    " average",
    " negative",
    " log",
    "-",
    "likelihood",
    " gives",
    " the",
    " per",
    "plex",
    "ity",
    " score",
    ".",
    "\n",
    "\n",
    "Per",
    "plex",
    "ity",
    " is",
    " useful",
    " for",
    " several",
    " reasons",
    ":",
    "\n",
    "\n",
    "1",
    ".",
    " **",
    "Model",
    " comparison",
    "**:",
    " It",
    " allows",
    " for",
    " comparing",
    " different",
    " language",
    " models",
    " or",
    " model",
    " configurations",
    ".",
    " A",
    " model",
    " with",
    " lower",
    " per",
    "plex",
    "ity",
    " on",
    " a",
    " given",
    " test",
    " set",
    " is",
    " generally",
    " considered",
    " to",
    " have",
    " better",
    " performance",
    ".",
    "\n",
    "\n",
    "2",
    ".",
    " **",
    "Training",
    " optimization",
    "**:",
    " Per",
    "plex",
    "ity",
    " can",
    " be",
    " used",
    " as",
    " a",
    " loss",
    " function",
    " during",
    " training",
    ",",
    " guid",
    "ing",
    " the",
    " model",
    " to",
    " minimize",
    " uncertainty",
    " in",
    " its",
    " predictions",
    ".",
    " Lower",
    " per",
    "plex",
    "ity",
    " during",
    " training",
    " indicates",
    " that",
    " the",
    " model",
    " is",
    " learning",
    " to",
    " make",
    " more",
    " accurate",
    " predictions",
    ".",
    "\n",
    "\n",
    "3",
    ".",
    " **",
    "Transfer",
    " learning",
    " and",
    " fine",
    "-",
    "tuning",
    "**:",
    " Per",
    "plex",
    "ity",
    " can",
    " help",
    " assess",
    " the",
    " impact",
    " of",
    " transfer",
    " learning",
    " or",
    " fine",
    "-",
    "tuning",
    " on",
    " pre",
    "-",
    "trained",
    " language",
    " models",
    ".",
    " A",
    " decrease",
    " in",
    " per",
    "plex",
    "ity",
    " after",
    " applying",
    " these",
    " techniques",
    " suggests",
    " improved",
    " performance",
    ".",
    "\n",
    "\n",
    "4",
    ".",
    " **",
    "Under",
    "standing",
    " model",
    " behavior",
    "**:",
    " Per",
    "plex",
    "ity",
    " can",
    " offer",
    " insights",
    " into",
    " how",
    " well",
    " the",
    " model",
    " captures",
    " the",
    " underlying",
    " patterns",
    " and",
    " structures",
    " of",
    " the",
    " language",
    ".",
    "\n",
    "\n",
    "However",
    ",",
    " it",
    "'s",
    " essential",
    " to",
    " consider",
    " the",
    " limitations",
    " of",
    " per",
    "plex",
    "ity",
    ".",
    " It",
    " doesn",
    "'t",
    " always",
    " corre",
    "late",
    " with",
    " human",
    "-",
    "evalu",
    "ated",
    " quality",
    ",",
    " especially",
    " for",
    " specific",
    " tasks",
    " or",
    " domains",
    ".",
    " More",
    "over",
    ",",
    " it",
    " t",
    "ends",
    " to",
    " favor",
    " models",
    " that",
    " assign",
    " very",
    " low",
    " probabilities",
    " to",
    " a",
    " small",
    " subset",
    " of",
    " words",
    ",",
    " which",
    " might",
    " not",
    " necessarily",
    " indicate",
    " better",
    " performance",
    " in",
    " practical",
    " applications",
    ".",
    "\n",
    "\n",
    "In",
    " summary",
    ",",
    " per",
    "plex",
    "ity",
    " is",
    " a",
    " valuable"
  ],
  "data_removed": false,
  "error": null,
  "source": "web",
  "status": "succeeded",
  "created_at": "2025-04-15T17:10:50.538Z",
  "started_at": "2025-04-15T17:15:15.475257Z",
  "completed_at": "2025-04-15T17:15:19.455833Z",
  "urls": {
    "cancel": "https://api.replicate.com/v1/predictions/t96wgpwzn9rm80cp79q8cxprm8/cancel",
    "get": "https://api.replicate.com/v1/predictions/t96wgpwzn9rm80cp79q8cxprm8",
    "stream": "https://stream-b.svc.ric1.c.replicate.net/v1/streams/vwpm2qj3tdrjns6qqnztkvnavsyzzgawvsritri6jim2er7ce3ja",
    "web": "https://replicate.com/p/t96wgpwzn9rm80cp79q8cxprm8"
  },
  "metrics": {
    "predict_time": 3.980575445,
    "total_time": 268.917833
  }
}

Generated in

4.0 seconds

Tweak it Report

2025-04-15 17:15:15 [info ] predict() commencing request_id=c23a193892654772b688b7fd97f894b8 user_prompt=How is perplexity measured for LLMs and why is it useful? 2025-04-15 17:15:15 [debug ] Formatted prompt using chat template formatted_prompt=<|start_of_role|>system<|end_of_role|> Knowledge Cutoff Date: April 2024. Today's Date: April 15, 2025. You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|> <|start_of_role|>user<|end_of_role|>How is perplexity measured for LLMs and why is it useful?<|end_of_text|> <|start_of_role|>assistant<|end_of_role|> request_id=c23a193892654772b688b7fd97f894b8 user_prompt=How is perplexity measured for LLMs and why is it useful? 2025-04-15 17:15:15 [debug ] SamplingParams request_id=c23a193892654772b688b7fd97f894b8 sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.9, top_k=50, min_p=0.0, seed=22349, stop=[], stop_token_ids=[0], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None) user_prompt=How is perplexity measured for LLMs and why is it useful? INFO 04-15 17:15:15 [async_llm.py:228] Added request c23a193892654772b688b7fd97f894b8. 2025-04-15 17:15:19 [debug ] result finish_reason=length request_id=c23a193892654772b688b7fd97f894b8 text=Perplexity is a common metric used to evaluate the performance of language models, including large language models (LLMs). It measures how well a language model predicts a sample. A lower perplexity score indicates better performance, as it suggests the model is more confident in its predictions. Perplexity is calculated by raising 2 to the average negative log-likelihood of the test set. In simpler terms, it quantifies the uncertainty of the model in predicting the next word in a sequence. A perplexity of 2^n means that, on average, the model is uncertain about the next word, assigning a probability of 1/n to the correct word. Here's a step-by-step breakdown of the calculation: 1. For each word in the test set, the language model computes a probability distribution over the vocabulary, predicting the likelihood of the next word. 2. The negative logarithm (base 2) of these probabilities is calculated for each word. 3. The negative log-likelihood is averaged across all words in the test set. 4. 2 raised to the power of this average negative log-likelihood gives the perplexity score. Perplexity is useful for several reasons: 1. **Model comparison**: It allows for comparing different language models or model configurations. A model with lower perplexity on a given test set is generally considered to have better performance. 2. **Training optimization**: Perplexity can be used as a loss function during training, guiding the model to minimize uncertainty in its predictions. Lower perplexity during training indicates that the model is learning to make more accurate predictions. 3. **Transfer learning and fine-tuning**: Perplexity can help assess the impact of transfer learning or fine-tuning on pre-trained language models. A decrease in perplexity after applying these techniques suggests improved performance. 4. **Understanding model behavior**: Perplexity can offer insights into how well the model captures the underlying patterns and structures of the language. However, it's essential to consider the limitations of perplexity. It doesn't always correlate with human-evaluated quality, especially for specific tasks or domains. Moreover, it tends to favor models that assign very low probabilities to a small subset of words, which might not necessarily indicate better performance in practical applications. In summary, perplexity is a valuable user_prompt=How is perplexity measured for LLMs and why is it useful? 2025-04-15 17:15:19 [info ] Generation took 3.98s request_id=c23a193892654772b688b7fd97f894b8 user_prompt=How is perplexity measured for LLMs and why is it useful? /root/.pyenv/versions/3.11.12/lib/python3.11/site-packages/cog/server/scope.py:21: ExperimentalFeatureWarning: current_scope is an experimental internal function. It may change or be removed without warning. warnings.warn( 2025-04-15 17:15:19 [info ] predict() complete request_id=c23a193892654772b688b7fd97f894b8 user_prompt=How is perplexity measured for LLMs and why is it useful?