Official

ibm-granite / granite-3.0-8b-instruct

Granite-3.0-8B-Instruct is a lightweight and open-source 8B parameter model is designed to excel in instruction following tasks such as summarization, problem-solving, text translation, reasoning, code tasks, function-calling, and more.

  • Public
  • 181.4K runs
  • Priced per token
  • GitHub
  • Weights
  • Paper
  • License
Run with an API

New model available: Try out Granite-3.3-8b-instruct View the new model

Input

string
Shift + Return to add a new line

Prompt

Default: ""

string
Shift + Return to add a new line

System prompt to send to the model. This is prepended to the prompt and helps guide system behavior. Ignored for non-chat models.

Default: "You are a helpful assistant."

integer

The minimum number of tokens the model should generate as output.

Default: 0

integer

The maximum number of tokens the model should generate as output.

Default: 512

number

The value used to modulate the next token probabilities.

Default: 0.6

number

A probability threshold for generating the output. If < 1.0, only keep the top tokens with cumulative probability >= top_p (nucleus filtering). Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751).

Default: 0.9

integer

The number of highest probability tokens to consider for generating the output. If > 0, only keep the top k tokens with highest probability (top-k filtering).

Default: 50

number

Presence penalty

Default: 0

number

Frequency penalty

Default: 0

string
Shift + Return to add a new line

A comma-separated list of sequences to stop generation at. For example, '<end>,<stop>' will stop generation at the first instance of 'end' or '<stop>'.

Output

APR stands for Annual Percentage Rate. It is the annual interest rate charged for borrowing, expressed as a single percentage number that represents the actual yearly cost of funds over the term of a loan. It includes any fees or additional costs associated with the transaction.
Generated in

Pricing

Official model
Pricing for official models works differently from other models. Instead of being billed by time, you’re billed by input and output, making pricing more predictable.

This model is priced by how many input tokens are sent and how many output tokens are generated.

TypePer unitPer $1
Input
$0.05 / 1M tokens
or
20M tokens / $1
Output
$0.25 / 1M tokens
or
4M tokens / $1

For example, for $10 you can run around 57,143 predictions where the input is a sentence or two (15 tokens) and the output is a few paragraphs (700 tokens).

Check out our docs for more information about how per-token pricing works on Replicate.

Readme

A newer version of this model is available here. This version will be deprecated in January 2025.

Model Summary

Granite-3.0-8B-Instruct is an 8B parameter model finetuned from Granite-3.0-8B-Base using a combination of open source instruction datasets with permissive licenses and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging.

Supported Languages

English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, Chinese (Simplified)

Usage

Intended use

The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including bussiness applications.

Capabilities

  • Summarization
  • Text classification
  • Text extraction
  • Question-answering
  • Retrieval Augmented Generation (RAG)
  • Code related
  • Function-calling
  • Multilingual dialog use cases

Generation

This is a simple example of how to use Granite-3.0-8B-Instruct model.

Install the following libraries:

pip install torch torchvision torchaudio
pip install accelerate
pip install transformers

Then, copy the snippet from the section that is relevant for your usecase.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "auto"
model_path = "ibm-granite/granite-3.0-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()
# change input text as desired
chat = [
    { "role": "user", "content": "Please list one IBM Research laboratory located in the United States. You should only output its name and location." },
]
chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
# tokenize the text
input_tokens = tokenizer(chat, return_tensors="pt").to(device)
# generate output tokens
output = model.generate(**input_tokens, 
                        max_new_tokens=100)
# decode output tokens into text
output = tokenizer.batch_decode(output)
# print output
print(output)

Model Architecture

Granite-3.0-8B-Instruct is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA and RoPE, MLP with SwiGLU, RMSNorm, and shared input/output embeddings.

Model 2B Dense 8B Dense 1B MoE 3B MoE
Embedding size 2048 4096 1024 1536
Number of layers 40 40 24 32
Attention head size 64 128 64 64
Number of attention heads 32 32 16 24
Number of KV heads 8 8 8 8
MLP hidden size 8192 12800 512 512
MLP activation SwiGLU SwiGLU SwiGLU SwiGLU
Number of Experts 32 40
MoE TopK 8 8
Initialization std 0.1 0.1 0.1 0.1
Sequence Length 4096 4096 4096 4096
Position Embedding RoPE RoPE RoPE RoPE
# Paremeters 2.5B 8.1B 1.3B 3.3B
# Active Parameters 2.5B 8.1B 400M 800M
# Training tokens 12T 12T 10T 10T

Training Data

Granite Language Instruct models are trained on a selection of open-srouce instruction datasets with a non-restrictive license, as well as a collection of synthetic datasets created by IBM. Together, these instruction datasets are a solid representation of the following domains: English, multilingual, code, math, tools, and safety.

Infrastructure

We train the Granite Language models using IBM’s super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.

Ethical Considerations and Limitations

Granite instruct models are primarily finetuned using instruction-response pairs mostly in English, but also in German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese (Simplified). As this model has been exposed to multilingual data, it can handle multilingual dialog use cases with a limited performance in non-English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. The model also inherits ethical considerations and limitations from its base model. For more information, please refer to Granite-3.0-8B-Base model card.