ibm-granite/granite-3.2-8b-instruct – Run with an API on Replicate

ibm-granite / granite-3.2-8b-instruct

Granite-3.2-8B-Instruct is a 8-billion parameter 128K context length language model fine-tuned for reasoning and instruction-following capabilities.

Cold

Public
344.1K runs
Priced per token
License

Iterate in playground

Run with an API

New model available: Try out Granite-3.3-8b-instruct View the new model

Playground API Examples Train Beta README

Input

prompt

string

Shift + Return to add a new line

How is perplexity measured for LLMs and why is it useful?How is perplexity measured for LLMs and why is it useful?

Prompt

Default: ""

system_prompt

string

Shift + Return to add a new line

System prompt to send to the model. This is prepended to the prompt and helps guide system behavior. Ignored for non-chat models.

Default: "You are a helpful assistant."

min_tokens

integer

The minimum number of tokens the model should generate as output.

Default: 0

max_tokens

integer

The maximum number of tokens the model should generate as output.

Default: 512

temperature

number

The value used to modulate the next token probabilities.

Default: 0.6

top_p

number

A probability threshold for generating the output. If < 1.0, only keep the top tokens with cumulative probability >= top_p (nucleus filtering). Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751).

Default: 0.9

top_k

integer

The number of highest probability tokens to consider for generating the output. If > 0, only keep the top k tokens with highest probability (top-k filtering).

Default: 50

presence_penalty

number

Presence penalty

Default: 0

frequency_penalty

number

Frequency penalty

Default: 0

stop_sequences

string

Shift + Return to add a new line

A comma-separated list of sequences to stop generation at. For example, '<end>,<stop>' will stop generation at the first instance of 'end' or '<stop>'.

Run this model in Node.js with one line of code:

npx create-replicate --model=ibm-granite/granite-3.2-8b-instruct

or set up a project from scratch

Install Replicate’s Node.js client library:

npm install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import and set up the client:

import Replicate from "replicate";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Run ibm-granite/granite-3.2-8b-instruct using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

const input = {
  top_k: 50,
  top_p: 0.9,
  prompt: "How is perplexity measured for LLMs and why is it useful?",
  max_tokens: 512,
  min_tokens: 0,
  temperature: 0.6,
  system_prompt: "You are a helpful assistant.",
  presence_penalty: 0,
  frequency_penalty: 0
};

for await (const event of replicate.stream("ibm-granite/granite-3.2-8b-instruct", { input })) {
  process.stdout.write(event.toString());
};

To learn more, take a look at the guide on getting started with Node.js.

Install Replicate’s Python client library:

pip install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import the client:

import replicate

Run ibm-granite/granite-3.2-8b-instruct using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

# The ibm-granite/granite-3.2-8b-instruct model can stream output as it's running.
for event in replicate.stream(
    "ibm-granite/granite-3.2-8b-instruct",
    input={
        "top_k": 50,
        "top_p": 0.9,
        "prompt": "How is perplexity measured for LLMs and why is it useful?",
        "max_tokens": 512,
        "min_tokens": 0,
        "temperature": 0.6,
        "system_prompt": "You are a helpful assistant.",
        "presence_penalty": 0,
        "frequency_penalty": 0
    },
):
    print(str(event), end="")

To learn more, take a look at the guide on getting started with Python.

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Run ibm-granite/granite-3.2-8b-instruct using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Prefer: wait" \
  -d $'{
    "input": {
      "top_k": 50,
      "top_p": 0.9,
      "prompt": "How is perplexity measured for LLMs and why is it useful?",
      "max_tokens": 512,
      "min_tokens": 0,
      "temperature": 0.6,
      "system_prompt": "You are a helpful assistant.",
      "presence_penalty": 0,
      "frequency_penalty": 0
    }
  }' \
  https://api.replicate.com/v1/models/ibm-granite/granite-3.2-8b-instruct/predictions

To learn more, take a look at Replicate’s HTTP API reference docs.

Output

Perplexity is a common metric used to evaluate the performance of language models, including large language models (LLMs). It's a measure of how well a model predicts a sample. Perplexity is calculated based on the concept of cross-entropy. In simpler terms, it's a way to measure how surprised the model is by the test data. The lower the perplexity, the less surprised the model is, and the better it predicts the data. Here's a simple breakdown of how it's calculated: 1. The model is given a test set of sentences. 2. For each word in the sentence, the model calculates the probability of that word given all the previous words. 3. The perplexity is then the inverse probability of the entire sentence, averaged over the whole test set. The formula for perplexity (PP) is: PP(W) = exp(-1/N * Σ log P(wi|w1...wi-1)) Where: - W is the test set of sentences, - N is the number of words in the test set, - wi is each word in the test set, - P(wi|w1...wi-1) is the probability assigned by the model to word wi given the previous words. Perplexity is useful for several reasons: 1. **Model Comparison**: It provides a standard way to compare different models. A lower perplexity score generally indicates a better model. 2. **Model Improvement**: It helps in identifying areas where the model is struggling. If a certain type of sentence or vocabulary consistently results in high perplexity, it indicates a weakness in the model that can be addressed through further training or adjustments. 3. **Understanding Model Behavior**: It gives insights into how the model understands language. A lower perplexity suggests the model has a better grasp of the language's structure and usage. 4. **Evaluation of Unseen Data**: While it's trained on a specific corpus, perplexity can be calculated on unseen data to evaluate the model's generalization capability. However, it's important to note that while perplexity is a widely used metric, it's not without its limitations. For instance, it doesn't directly correlate with human judgment of fluency or coherence, and

{
  "completed_at": "2025-03-24T13:57:28.261657Z",
  "created_at": "2025-03-24T13:57:24.601000Z",
  "data_removed": false,
  "error": null,
  "id": "7kqz6aczz5rme0cns1rrsj94yr",
  "input": {
    "top_p": 0.9,
    "prompt": "How is perplexity measured for LLMs and why is it useful?",
    "max_tokens": 512,
    "min_tokens": 0,
    "temperature": 0.6,
    "system_prompt": "You are a helpful assistant.",
    "presence_penalty": 0,
    "frequency_penalty": 0
  },
  "logs": "INFO:     127.0.0.1:39820 - \"POST /v1/chat/completions HTTP/1.1\" 200 OK\nINFO 03-24 13:57:24 metrics.py:455] Avg prompt throughput: 6.8 tokens/s, Avg generation throughput: 4.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.\nGeneration took 3.65sFormatted prompt: How is perplexity measured for LLMs and why is it useful?",
  "metrics": {
    "batch_size": 1.00010019649809,
    "input_token_count": 22,
    "tokens_per_second": 140.5497302255601,
    "output_token_count": 513,
    "predict_time_share": 3.654869794845581,
    "predict_time": 3.656886907,
    "total_time": 3.660657,
    "time_to_first_token": 0.010703263000000001
  },
  "output": [
    "",
    "Per",
    "plex",
    "ity",
    " is",
    " a",
    " common",
    " metric",
    " used",
    " to",
    " evaluate",
    " the",
    " performance",
    " of",
    " language",
    " models",
    ",",
    " including",
    " large",
    " language",
    " models",
    " (",
    "LL",
    "Ms",
    ").",
    " It",
    "'s",
    " a",
    " measure",
    " of",
    " how",
    " well",
    " a",
    " model",
    " predict",
    "s",
    " a",
    " sample",
    ".",
    "\n",
    "\n",
    "Per",
    "plex",
    "ity",
    " is",
    " calculated",
    " based",
    " on",
    " the",
    " concept",
    " of",
    " cross",
    "-",
    "entropy",
    ".",
    " In",
    " simpler",
    " terms",
    ",",
    " it",
    "'s",
    " a",
    " way",
    " to",
    " measure",
    " how",
    " sur",
    "pri",
    "sed",
    " the",
    " model",
    " is",
    " by",
    " the",
    " test",
    " data",
    ".",
    " The",
    " lower",
    " the",
    " per",
    "plex",
    "ity",
    ",",
    " the",
    " less",
    " sur",
    "pri",
    "sed",
    " the",
    " model",
    " is",
    ",",
    " and",
    " the",
    " better",
    " it",
    " predict",
    "s",
    " the",
    " data",
    ".",
    "\n",
    "\n",
    "Here",
    "'s",
    " a",
    " simple",
    " break",
    "down",
    " of",
    " how",
    " it",
    "'s",
    " calculated",
    ":",
    "\n",
    "\n",
    "1",
    ".",
    " The",
    " model",
    " is",
    " given",
    " a",
    " test",
    " set",
    " of",
    " sentences",
    ".",
    "\n",
    "2",
    ".",
    " For",
    " each",
    " word",
    " in",
    " the",
    " sentence",
    ",",
    " the",
    " model",
    " calculates",
    " the",
    " probability",
    " of",
    " that",
    " word",
    " given",
    " all",
    " the",
    " previous",
    " words",
    ".",
    "\n",
    "3",
    ".",
    " The",
    " per",
    "plex",
    "ity",
    " is",
    " then",
    " the",
    " inverse",
    " probability",
    " of",
    " the",
    " entire",
    " sentence",
    ",",
    " aver",
    "aged",
    " over",
    " the",
    " whole",
    " test",
    " set",
    ".",
    "\n",
    "\n",
    "The",
    " formula",
    " for",
    " per",
    "plex",
    "ity",
    " (",
    "PP",
    ")",
    " is",
    ":",
    "\n",
    "\n",
    "PP",
    "(",
    "W",
    ")",
    " =",
    " exp",
    "(-",
    "1",
    "/",
    "N",
    " *",
    "",
    " Σ",
    " log",
    " P",
    "(",
    "wi",
    "|",
    "w",
    "1",
    "...",
    "wi",
    "-",
    "1",
    "))",
    "\n",
    "\n",
    "Where",
    ":",
    "\n",
    "-",
    " W",
    " is",
    " the",
    " test",
    " set",
    " of",
    " sentences",
    ",",
    "\n",
    "-",
    " N",
    " is",
    " the",
    " number",
    " of",
    " words",
    " in",
    " the",
    " test",
    " set",
    ",",
    "\n",
    "-",
    " wi",
    " is",
    " each",
    " word",
    " in",
    " the",
    " test",
    " set",
    ",",
    "\n",
    "-",
    " P",
    "(",
    "wi",
    "|",
    "w",
    "1",
    "...",
    "wi",
    "-",
    "1",
    ")",
    " is",
    " the",
    " probability",
    " assigned",
    " by",
    " the",
    " model",
    " to",
    " word",
    " wi",
    " given",
    " the",
    " previous",
    " words",
    ".",
    "\n",
    "\n",
    "Per",
    "plex",
    "ity",
    " is",
    " useful",
    " for",
    " several",
    " reasons",
    ":",
    "\n",
    "\n",
    "1",
    ".",
    " **",
    "Model",
    " Comparison",
    "**:",
    " It",
    " provides",
    " a",
    " standard",
    " way",
    " to",
    " compare",
    " different",
    " models",
    ".",
    " A",
    " lower",
    " per",
    "plex",
    "ity",
    " score",
    " generally",
    " indicates",
    " a",
    " better",
    " model",
    ".",
    "\n",
    "\n",
    "2",
    ".",
    " **",
    "Model",
    " Impro",
    "vement",
    "**:",
    " It",
    " helps",
    " in",
    " identifying",
    " areas",
    " where",
    " the",
    " model",
    " is",
    " strug",
    "gling",
    ".",
    " If",
    " a",
    " certain",
    " type",
    " of",
    " sentence",
    " or",
    " vocabulary",
    " consistently",
    " results",
    " in",
    " high",
    " per",
    "plex",
    "ity",
    ",",
    " it",
    " indicates",
    " a",
    " weak",
    "ness",
    " in",
    " the",
    " model",
    " that",
    " can",
    " be",
    " addressed",
    " through",
    " further",
    " training",
    " or",
    " adjust",
    "ments",
    ".",
    "\n",
    "\n",
    "3",
    ".",
    " **",
    "Under",
    "standing",
    " Model",
    " Behavior",
    "**:",
    " It",
    " gives",
    " insights",
    " into",
    " how",
    " the",
    " model",
    " understand",
    "s",
    " language",
    ".",
    " A",
    " lower",
    " per",
    "plex",
    "ity",
    " suggests",
    " the",
    " model",
    " has",
    " a",
    " better",
    " gr",
    "asp",
    " of",
    " the",
    " language",
    "'s",
    " structure",
    " and",
    " usage",
    ".",
    "\n",
    "\n",
    "4",
    ".",
    " **",
    "Evaluation",
    " of",
    " Un",
    "seen",
    " Data",
    "**:",
    " While",
    " it",
    "'s",
    " trained",
    " on",
    " a",
    " specific",
    " corpus",
    ",",
    " per",
    "plex",
    "ity",
    " can",
    " be",
    " calculated",
    " on",
    " un",
    "seen",
    " data",
    " to",
    " evaluate",
    " the",
    " model",
    "'s",
    " general",
    "ization",
    " capability",
    ".",
    "\n",
    "\n",
    "However",
    ",",
    " it",
    "'s",
    " important",
    " to",
    " note",
    " that",
    " while",
    " per",
    "plex",
    "ity",
    " is",
    " a",
    " widely",
    " used",
    " metric",
    ",",
    " it",
    "'s",
    " not",
    " without",
    " its",
    " limitations",
    ".",
    " For",
    " instance",
    ",",
    " it",
    " doesn",
    "'t",
    " directly",
    " corre",
    "late",
    " with",
    " human",
    " jud",
    "gment",
    " of",
    " fl",
    "u",
    "ency",
    " or",
    " co",
    "her",
    "ence",
    ",",
    " and"
  ],
  "started_at": "2025-03-24T13:57:24.604770Z",
  "status": "succeeded",
  "urls": {
    "stream": "https://stream-b.svc.ric1.c.replicate.net/v1/streams/fvcuniftrlxdmgzvydaz2y2r5pzekg25ka6a4oaxy2ijqdddap5q",
    "get": "https://api.replicate.com/v1/predictions/7kqz6aczz5rme0cns1rrsj94yr",
    "cancel": "https://api.replicate.com/v1/predictions/7kqz6aczz5rme0cns1rrsj94yr/cancel"
  },
  "version": "hidden"
}

Generated in

3.7 seconds

Input tokens

Output tokens

510

Tokens per second

140.55 tokens / second

Time to first token

11 milliseconds

Tweak it Share Report

Pricing

This model is priced by how many input tokens are sent and how many output tokens are generated.

Type	Per unit	Per $1
Input	$0.03 / 1M tokens or 33M tokens / $1
Output	$0.25 / 1M tokens or 4M tokens / $1

For example, for $10 you can run around 57,143 predictions where the input is a sentence or two (15 tokens) and the output is a few paragraphs (700 tokens).

Check out our docs for more information about how per-token pricing works on Replicate.

Readme

Model Summary

Granite-3.2-8B-Instruct is an 8-billion-parameter, long-context AI model fine-tuned for thinking capabilities. Built on top of Granite-3.1-8B-Instruct, it has been trained using a mix of permissively licensed open-source datasets and internally generated synthetic data designed for reasoning tasks. The model allows controllability of its thinking capability, ensuring it is applied only when required.

Developers: Granite Team, IBM
Website: Granite Docs
Release Date: February 26th, 2025
License: Apache 2.0

Evaluation Results

Models	ArenaHard	Alpaca-Eval-2	MMLU	PopQA	TruthfulQA	BigBenchHard	DROP	GSM8K	HumanEval	HumanEval+	IFEval	AttaQ
Llama-3.1-8B-Instruct	36.43	27.22	69.15	28.79	52.79	72.66	61.48	83.24	85.32	80.15	79.10	83.43
DeepSeek-R1-Distill-Llama-8B	17.17	21.85	45.80	13.25	47.43	65.71	44.46	72.18	67.54	62.91	66.50	42.87
Qwen-2.5-7B-Instruct	25.44	30.34	74.30	18.12	63.06	70.40	54.71	84.46	93.35	89.91	74.90	81.90
DeepSeek-R1-Distill-Qwen-7B	10.36	15.35	50.72	9.94	47.14	65.04	42.76	78.47	79.89	78.43	59.10	42.45
Granite-3.1-8B-Instruct	37.58	30.34	66.77	28.7	65.84	68.55	50.78	79.15	89.63	85.79	73.20	85.73
Granite-3.1-2B-Instruct	23.3	27.17	57.11	20.55	59.79	54.46	18.68	67.55	79.45	75.26	63.59	84.7
Granite-3.2-2B-Instruct	24.86	34.51	57.18	20.56	59.8	52.27	21.12	67.02	80.13	73.39	61.55	83.23
Granite-3.2-8B-Instruct	55.25	61.19	66.79	28.04	66.92	64.77	50.95	81.65	89.35	85.72	74.31	85.42

Supported Languages:

English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. However, users may finetune this Granite model for languages beyond these 12 languages.

Intended Use:

This model is designed to handle general instruction-following tasks and can be integrated into AI assistants across various domains, including business applications.

Capabilities

Thinking
Summarization
Text classification
Text extraction
Question-answering
Retrieval Augmented Generation (RAG)
Code related tasks
Function-calling tasks
Multilingual dialog use cases
Long-context tasks including long document/meeting summarization, long document QA, etc.

Training Data

Overall, our training data is largely comprised of two key sources: (1) publicly available datasets with permissive license, (2) internal synthetically generated data targeted to enhance reasoning capabilites.

Infrastructure

We train Granite-3.2-8B-Instruct using IBM’s super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.

Ethical Considerations and Limitations

Granite-3.2-8B-Instruct builds upon Granite-3.1-8B-Instruct, leveraging both permissively licensed open-source and select proprietary data for enhanced performance. Since it inherits its foundation from the previous model, all ethical considerations and limitations applicable to Granite-3.1-8B-Instruct remain relevant.