IBM's Granite 4.0 is now on Replicate
IBM has released Granite 4.0, their latest family of open-source small language models built for speed and low cost.
The Granite 4.0 models use a hybrid architecture that uses less memory than traditional models, so you can run them on regular consumer GPUs instead of expensive server hardware. They work well for document summarization, RAG systems, and AI agents.
ibm-granite/granite-4.0-h-small is a 30 billion parameter long-context instruct model and it’s now available on Replicate.
Running Granite 4.0 with an API
You can start using Granite models right away on Replicate. Here’s how to run them with an API:
cURL
curl -s -X POST \
-H "Authorization: Bearer $REPLICATE_API_TOKEN" \
-H "Content-Type: application/json" \
-d $'{
"version": "ibm-granite/granite-4.0-h-small",
"input": {
"messages": [
{
"role": "user",
"content": "Explain the key benefits of using open-source models in business environments"
}
]
}
}' \
https://api.replicate.com/v1/predictions
JavaScript
Here’s an example using Replicate’s JavaScript client:
import Replicate from "replicate";
const replicate = new Replicate({
auth: process.env.REPLICATE_API_TOKEN,
});
const output = await replicate.run(
"ibm-granite/granite-4.0-h-small",
{
input: {
messages: [
{
role: "user",
content: "Explain the key benefits of using open-source models in business environments"
}
]
}
}
);
Python
Here’s an example using Replicate’s Python client:
import replicate
output = replicate.run(
"ibm-granite/granite-4.0-h-small",
input={
"messages": [
{
"role": "user",
"content": "Explain the key benefits of using open-source models in business environments"
}
]
}
)
Granite is performant
Granite models are built around a hybrid design that combines two key ideas: the linear-scaling efficiency of Mamba-2 with the precision of Transformers.
Mamba-2 is a state space model that processes sequences linearly, unlike traditional transformers that scale quadratically with sequence length. This makes it more efficient for very long inputs, like documents with hundreds of thousands of tokens. Transformer blocks complement this by better supporting tasks that require long-context reasoning.
Select Granite 4.0 models also use an MoE (mixture of experts) routing strategy. The MoE setup splits the model into several “experts”. Instead of running every parameter at once, the model routes each input through only the experts it actually needs. For example, Granite 4.0 Small has 32 billion total parameters, only 9 billion of which are activated for an inference request.
Together, these two approaches let Granite models handle long contexts quickly and run on more modest hardware, like consumer-grade GPUs, without sacrificing performance.
Granite is practical
Granite models are designed for real work, not just demos. They’re lightweight and efficient, which makes them a good fit for:
- Summarizing long documents, like contracts or technical manuals.
- Building systems that pull answers from large datasets, like CRMs or knowledge bases, without chopping inputs into tiny chunks.
- Running multiple AI agents at the same time for complex workflows.
- Deploying models on local devices or edge hardware, where bandwidth or cloud access is limited.
Granite is open source
Granite models are released under the Apache 2.0 license. That means you can use them for both commercial and non-commercial projects without restrictions or hidden fees. You can also modify the models however you want — fine-tune them, add adapters, or train them on private datasets — and release those changes under your own terms. This openness makes Granite a practical choice for companies that need compliance, security, or customization.
For more details, check out IBM’s documentation on deployment, fine-tuning, and integration patterns. If you’re using LangChain, IBM has also built a LangChain integration for Replicate to make it even easier to work with Granite models.