Official

ibm-granite / granite-3.2-8b-instruct

Granite-3.2-8B-Instruct is a 8-billion parameter 128K context length language model fine-tuned for reasoning and instruction-following capabilities.

  • Public
  • 344.1K runs
  • Priced per token
  • License
Iterate in playground

New model available: Try out Granite-3.3-8b-instruct View the new model

Input

string
Shift + Return to add a new line

Prompt

Default: ""

string
Shift + Return to add a new line

System prompt to send to the model. This is prepended to the prompt and helps guide system behavior. Ignored for non-chat models.

Default: "You are a helpful assistant."

integer

The minimum number of tokens the model should generate as output.

Default: 0

integer

The maximum number of tokens the model should generate as output.

Default: 512

number

The value used to modulate the next token probabilities.

Default: 0.6

number

A probability threshold for generating the output. If < 1.0, only keep the top tokens with cumulative probability >= top_p (nucleus filtering). Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751).

Default: 0.9

integer

The number of highest probability tokens to consider for generating the output. If > 0, only keep the top k tokens with highest probability (top-k filtering).

Default: 50

number

Presence penalty

Default: 0

number

Frequency penalty

Default: 0

string
Shift + Return to add a new line

A comma-separated list of sequences to stop generation at. For example, '<end>,<stop>' will stop generation at the first instance of 'end' or '<stop>'.

Output

Perplexity is a common metric used to evaluate the performance of language models, including large language models (LLMs). It's a measure of how well a model predicts a sample. Perplexity is calculated based on the concept of cross-entropy. In simpler terms, it's a way to measure how surprised the model is by the test data. The lower the perplexity, the less surprised the model is, and the better it predicts the data. Here's a simple breakdown of how it's calculated: 1. The model is given a test set of sentences. 2. For each word in the sentence, the model calculates the probability of that word given all the previous words. 3. The perplexity is then the inverse probability of the entire sentence, averaged over the whole test set. The formula for perplexity (PP) is: PP(W) = exp(-1/N * Σ log P(wi|w1...wi-1)) Where: - W is the test set of sentences, - N is the number of words in the test set, - wi is each word in the test set, - P(wi|w1...wi-1) is the probability assigned by the model to word wi given the previous words. Perplexity is useful for several reasons: 1. **Model Comparison**: It provides a standard way to compare different models. A lower perplexity score generally indicates a better model. 2. **Model Improvement**: It helps in identifying areas where the model is struggling. If a certain type of sentence or vocabulary consistently results in high perplexity, it indicates a weakness in the model that can be addressed through further training or adjustments. 3. **Understanding Model Behavior**: It gives insights into how the model understands language. A lower perplexity suggests the model has a better grasp of the language's structure and usage. 4. **Evaluation of Unseen Data**: While it's trained on a specific corpus, perplexity can be calculated on unseen data to evaluate the model's generalization capability. However, it's important to note that while perplexity is a widely used metric, it's not without its limitations. For instance, it doesn't directly correlate with human judgment of fluency or coherence, and
Generated in
Input tokens
22
Output tokens
510
Tokens per second
140.55 tokens / second
Time to first token

Pricing

Official model
Pricing for official models works differently from other models. Instead of being billed by time, you’re billed by input and output, making pricing more predictable.

This model is priced by how many input tokens are sent and how many output tokens are generated.

TypePer unitPer $1
Input
$0.03 / 1M tokens
or
33M tokens / $1
Output
$0.25 / 1M tokens
or
4M tokens / $1

For example, for $10 you can run around 57,143 predictions where the input is a sentence or two (15 tokens) and the output is a few paragraphs (700 tokens).

Check out our docs for more information about how per-token pricing works on Replicate.

Readme

Model Summary

Granite-3.2-8B-Instruct is an 8-billion-parameter, long-context AI model fine-tuned for thinking capabilities. Built on top of Granite-3.1-8B-Instruct, it has been trained using a mix of permissively licensed open-source datasets and internally generated synthetic data designed for reasoning tasks. The model allows controllability of its thinking capability, ensuring it is applied only when required.

Evaluation Results

Models ArenaHard Alpaca-Eval-2 MMLU PopQA TruthfulQA BigBenchHard DROP GSM8K HumanEval HumanEval+ IFEval AttaQ
Llama-3.1-8B-Instruct 36.43 27.22 69.15 28.79 52.79 72.66 61.48 83.24 85.32 80.15 79.10 83.43
DeepSeek-R1-Distill-Llama-8B 17.17 21.85 45.80 13.25 47.43 65.71 44.46 72.18 67.54 62.91 66.50 42.87
Qwen-2.5-7B-Instruct 25.44 30.34 74.30 18.12 63.06 70.40 54.71 84.46 93.35 89.91 74.90 81.90
DeepSeek-R1-Distill-Qwen-7B 10.36 15.35 50.72 9.94 47.14 65.04 42.76 78.47 79.89 78.43 59.10 42.45
Granite-3.1-8B-Instruct 37.58 30.34 66.77 28.7 65.84 68.55 50.78 79.15 89.63 85.79 73.20 85.73
Granite-3.1-2B-Instruct 23.3 27.17 57.11 20.55 59.79 54.46 18.68 67.55 79.45 75.26 63.59 84.7
Granite-3.2-2B-Instruct 24.86 34.51 57.18 20.56 59.8 52.27 21.12 67.02 80.13 73.39 61.55 83.23
Granite-3.2-8B-Instruct 55.25 61.19 66.79 28.04 66.92 64.77 50.95 81.65 89.35 85.72 74.31 85.42

Supported Languages:

English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. However, users may finetune this Granite model for languages beyond these 12 languages.

Intended Use:

This model is designed to handle general instruction-following tasks and can be integrated into AI assistants across various domains, including business applications.

Capabilities

  • Thinking
  • Summarization
  • Text classification
  • Text extraction
  • Question-answering
  • Retrieval Augmented Generation (RAG)
  • Code related tasks
  • Function-calling tasks
  • Multilingual dialog use cases
  • Long-context tasks including long document/meeting summarization, long document QA, etc.

Training Data

Overall, our training data is largely comprised of two key sources: (1) publicly available datasets with permissive license, (2) internal synthetically generated data targeted to enhance reasoning capabilites.

Infrastructure

We train Granite-3.2-8B-Instruct using IBM’s super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.

Ethical Considerations and Limitations

Granite-3.2-8B-Instruct builds upon Granite-3.1-8B-Instruct, leveraging both permissively licensed open-source and select proprietary data for enhanced performance. Since it inherits its foundation from the previous model, all ethical considerations and limitations applicable to Granite-3.1-8B-Instruct remain relevant.