Deploy a custom model
Learn to build, deploy, and scale your own custom model on Replicate.
Table of contents
Replicate lets you build and scale AI products with your own custom models. You write the code, we handle the infrastructure orchestration and scaling.
With custom models and deployments, you can:
- Run arbitrary code and weights on high-end GPUs (A100s, H100s, B200s, and more) via API
- Dedicated deployments with no noisy neighbors
- Scale from zero to thousands of GPUs automatically based on demand
- Deploy updates without downtime using rolling deployments
- Monitor performance, costs, and usage in real-time
- Pay only for what you use
This guide shows you how to deploy your model on production-grade GPU infrastructure that scales dynamically and can handle millions of requests.
This guide will show you how to build a custom model from scratch using Cog. If youâre looking to create a fine-tuned image generation model using your own training data, check out the guide to fine-tuning image models.
What is a custom model?
In the world of machine learning, the word âmodelâ can mean many different things depending on context. It can be the source code, trained weights, architecture, or some combination thereof. At Replicate, when we say âmodelâ weâre referring to a trained, packaged, and published software program that accepts inputs and returns outputs.
Models on Replicate are built with Cog, an open-source tool that packages arbitrary code into a standard, production-ready container. Cog handles generating API servers and cloud deployment automatically. You define your model environment and prediction logic, and Replicate handles the server, scaling, and compute management.
Step 1: Create a model
Click âCreate modelâ in the account menu or go to replicate.com/create to create your new model.
Choose a name
Pick a short and memorable name, like hotdog-detector
. You can use lower case characters and dashes.
Choose an owner
If youâre working with a team, you should create your model under an organization so you and your team can share access and billing. To create an organization, click âJoin or create organizationâ from the account menu or go to replicate.com/organizations/create. Learn more about organizations.
If youâre creating a model for your own individual use, you donât need an organization. Create it under your user account.
Choose model visibility
Public models can be discovered and used by anyone. Private models can only be seen by the user or organization that owns them.
Set your model to be private to start, so itâs accessible only to you and your organization. You can make it public later if you want to share it with others.
Choose hardware
Choose the type of hardware your model runs on. This affects performance and cost. The billing docs show specifications and pricing for available hardware.
For GPU-accelerated models, start with a Nvidia T4 GPU for development. You can upgrade to more powerful GPUs like A100s or H100s later using deployments without changing your code.
Once youâve created your new model, you should see a page that looks something like this:
đ„· If you prefer to work from the command line, you can use the Replicate CLI to create models, or create models programmatically using the API.
Step 2: Build your model
Now that youâve created your model on Replicate, itâs time to actually write the code for it, build it, and push it to Replicate.
Youâll use Cog to build and push your model. Cog is an open-source tool that packages arbitrary code in production-ready containers with automatic API generation.
Follow this guide to learn how to install Cog, write the code for your model, and push it to Replicate:
âïž Guide: Push your own model with Cog
Once youâve pushed your custom model, return to this guide to learn how to create a deployment and scale it.
Step 3: Create a deployment
When you push a model to Replicate, we automatically generate an API server and deploy it on our GPU cluster. For production use, you should create a deployment to get full control over scaling, hardware, and performance.
Deployments give you production-grade control over your modelâs infrastructure and provide a private, dedicated API endpoint.
To create a deployment, go to your model page and click the Deploy button.
Youâll see a form where you can configure:
- Deployment name: Choose a descriptive name for your deployment
- Hardware type: Select from available GPU architectures
- Instance scaling: Set minimum and maximum instance counts
- Cost estimation: Live preview of your deployment costs
The form shows real-time cost estimates as you adjust your configuration, helping you balance performance and budget.
Once youâre satisfied with your choices, click Create a deployment.
đ„ Keep your model warm. If youâre giving a demo or putting your model in the hands of users, youâll want it to respond quickly, without a cold boot. Set the minimum number of instances to 1 to make sure that at least one instance is always running. You can reduce this to 0 later if you donât need the model to be instantaneously responsive to new requests.
After creating your deployment, youâll see updated example code for running your model through the deployment endpoint. This code differs from the earlier API callsâit references your deployment (you/your-deployment
) rather than the model directly (you/your-model
):
Step 4: Monitor your deployment
Once your deployment receives traffic, go to your deployment page to monitor its performance. The dashboard shows:
- Request volume and latency: Track how many requests youâre processing and response times
- Instance utilization: See how many instances are active, idle, or starting up
- Error rates: Monitor failed requests and troubleshoot issues
- Cost analysis: View detailed spending broken down by compute time and instance hours
Step 5: Iterate on your deployment
Machine learning models improve over time as you retrain with new data, fix bugs, or update dependencies. With deployments, you can update your model without disrupting service.
When you make changes to your model, run cog push
to publish updates. Your deployment can then use rolling updates to deploy the new model without downtime. You can integrate this into your existing development workflow using GitHub Actions.
Deployments handle the complexity of managing multiple model iterations, allowing you to test changes, rollback if needed, and ensure consistent performance for your users.
Next steps
Youâve successfully created a custom model on Replicate, built it with Cog, deployed it with production-grade infrastructure control, and set up monitoring to track its performance. Your model now has a private API endpoint that can scale automatically based on demand.
Now itâs time to integrate your deployed model into your app or product:
- Learn how to continuously deploy your model using GitHub Actions.
- Check out the client libraries you can use to run your model.
- Check out the deployments guide to learn more about model performance and scaling.
đ