Pipeline models
Explanation of pipeline models on Replicate
Table of contents
View Source Code









Pipeline models are a new kind of ephemeral CPU model that runs on Replicate using a dedicated runtime that’s optimized for speed. These models work like serverless functions: they run once and are then discarded, without any setup steps. The key feature is that they can call other Replicate models directly using the replicate Python client library.
These models make a lot of new things possible, because Replicate has a huge library of models that you can pipe together. Pipe a FLUX LoRA output into Kling for stylized text-to-video. Add a prompt upscaler with Claude Sonnet, add sound with mmaudio, etc. It’s just code! No complex orchestration, just plain Python. You can add whatever preprocessing or glue code you want.
Getting Started
Pipeline models make a lot of new things possible, because Replicate has a huge library of models that you can pipe together. Pipe a FLUX LoRA output into Kling for stylized text-to-video. Add a prompt upscaler with Claude Sonnet, add sound with mmaudio, etc. It’s just code! No complex orchestration, just plain Python. You can add whatever preprocessing or glue code you want.
Just like other Replicate models, you can run pipelines in the web UI or from the API, and you pay per second of compute and for the cost of the downstream models your pipeline calls.
🐇 Eager to create your first pipeline model? Check out the quickstart guide.
Running pipeline models
Just like other Replicate models, you can run pipelines in the web UI or from the API:
import replicate
flux_dev = replicate.use("black-forest-labs/flux-dev")
claude = replicate.use("anthropic/claude-4-sonnet")
def main() -> None:
images = flux_dev(prompt="a cat wearing an amusing hat")
result = claude(prompt="describe this image for me", image=images[0])
print(str(result)) # "This shows an image of a cat wearing a hat ..." Hardware
Pipeline models run on CPU hardware, specifically CPU 1x 2GB. See hardware pricing for more details.
The downstream models used by your pipeline model will run on a variety of different hardware types, depending on the model.
Billing
You pay per second of compute and for the cost of the downstream models your pipeline calls.
For both public and private pipelines, you only pay for the time it’s actively processing your requests. Setup and idle time for the model is free.
If a pipeline model fails, it will be billed for the duration of the run, plus the cost of the downstream models that were called.
See pricing for more details.
Cancellation
If you cancel a prediction request made to a pipeline that calls other models, we will cancel all downstream predictions that are queued or haven’t started. If a downstream prediction is already running, that prediction will continue to run to completion.
Stack depth
Pipeline models can call other pipeline models, up to a limit of 250 layers deep.
Creating pipeline models
You can create pipeline models in the web UI or from the API.
To get started creating your first pipeline model, check out the quickstart guide.
To develop a pipeline model on your own machine with your preferred editor and tools, check out the guide to building pipeline models locally.
Pipeline model features
Input & output types
Below are the Python annotations that Pipelines understands today and how each one shows up in the web UI / API.
| Annotation | UI control / JSON type | Typical use-case | Notes |
|---|---|---|---|
str | Text field | Prompts, IDs, file URLs | Empty string "" allowed |
int, float | Number field | Steps, CFG, FPS, seed | Add min= / max= via cog.Input |
bool | Checkbox | Feature flags, safety toggles | Defaults to false |
cog.Path | File upload or presigned URL | Images, videos, audio, weights | Returned paths are real files locally, saved in the tmp/ directory |
list[str] | Repeating text field | Batch prompts, stop words | Max 64 KB per item |
dict, typing.Any | Raw JSON | Arbitrary config blobs | Passed through untouched |
Here’s an example:
import replicate
from pathlib import Path
# Flux takes a required prompt string and optional image and seed.
def hint(*, prompt: str, image: Path | None = None, seed: int | None = None) -> str: ...
flux_dev = replicate.use("black-forest-labs/flux-dev", hint=hint)
def main() -> None:
output1 = flux_dev() # will warn that `prompt` is missing
output2 = flux_dev(prompt="str") # output2 will be typed as `str` Supported Python packages
There are limitations around what packages are available when running a pipeline on Replicate. Supported packages include:
anyio==4.9.0
certifi==2025.6.15
charset-normalizer==3.4.2
coglet @ https://github.com/replicate/cog-runtime/releases/download/v0.1.0-alpha31/coglet-0.1.0a31-py3-none-any.whl
decorator==5.2.1
h11==0.16.0
httpcore==1.0.9
httpx==0.28.1
idna==3.10
imageio==2.37.0
imageio-ffmpeg==0.6.0
joblib==1.5.1
moviepy==2.2.1
numpy==2.3.1
packaging==25.0
pillow==11.2.1
pip==25.1.1
proglog==0.1.12
pydantic==1.10.22
python-dotenv==1.1.1
replicate==1.1.0b2
requests==2.32.4
scikit-learn==1.7.0
scipy==1.16.0
sniffio==1.3.1
threadpoolctl==3.6.0
tqdm==4.67.1
typing-extensions==4.14.0
urllib3==2.5.0 For an up-to-date list of supported packages, see pipelines-runtime.replicate.delivery/requirements.txt.
Passing files between models
Outputs annotated as cog.Path behave like local files and feed directly into the next model:
import replicate
upscale = replicate.use("stability-ai/sd-xl-upscale")
caption = replicate.use("anthropic/claude-4-sonnet")
hi_res = upscale(image="dog.jpg")
summary = caption(prompt="Describe the image", image=hi_res)
Running models in parallel
You can run multiple models in parallel by using:
function1 = replicate.use("my-name/function1")
function2 = replicate.use("my-name/function2")
run1 = function1.create(input1=value1, input2=value2)
run2 = function2.create(input1=value1, input2=value2)
output1 = run1.output()
output2 = run2.output() Streaming outputs in real-time
Display partial tokens, images, or status updates as soon as they’re produced:
import replicate
claude = replicate.use("anthropic/claude-4-sonnet", streaming=True)
output = claude(prompt="Summarize War and Peace in emojis")
for chunk in output:
print(chunk) Async predictions for high concurrency
Await predictions inside asyncio apps (FastAPI, Quart, etc.) for better throughput:
import asyncio, replicate
flux = replicate.use("black-forest-labs/flux-dev", use_async=True)
claude = replicate.use("anthropic/claude-4-sonnet", use_async=True)
async def handler():
img_task, text_task = await asyncio.gather(
flux(prompt="astronaut playing guitar on Mars"),
claude(prompt="Write a song about Mars")
)
return text_task
Getting logs
To see the logs of downstream models, use run.logs():
import replicate
claude = replicate.use("anthropic/claude-4-sonnet")
def main() -> None:
prediction = claude.create(prompt="Give me a recipe for tasty smashed avocado on sourdough toast that could feed all of California.")
prediction.logs() # get current logs (WIP)
prediction.output() # get the output LLM-friendly docs
We maintain an LLM-friendly version of the pipeline models documentation that you can use in your AI-powered code-editing tools like Cursor, Copilot, or Claude to give them extensive knowledge of how pipeline models work, and how to author them.
Feed this URL into your preferred AI editor to give it context about pipeline models:
https://replicate.com/docs/reference/pipelines/llms.txt