Deploy a custom model
Table of contents
Replicate makes it easy to run thousands of open-source models in the cloud with just a few lines of code. Using existing public models is a good way to start, but you can also build and deploy your own custom models.
Using custom models and deployments, you can:
- build private models with your team or on your own
- only pay for what you use
- scale automatically depending on traffic
- monitor model activity and performance
In this guide youâll learn to build, deploy, and scale your own custom model on Replicate.
This guide will show you how to build a custom model from scratch using Cog. If youâre looking to create a fine-tuned image generation model using your own training data, check out the guide to fine-tuning image models.
What is a custom model?
In the world of machine learning, the word âmodelâ can mean many different things depending on context. It can be the source code, trained weights, architecture, or some combination thereof. At Replicate, when we say âmodelâ weâre referring to a trained, packaged, and published software program that accepts inputs and returns outputs.
Models on Replicate are built with Cog, an open-source tool that lets you package machine learning models in a standard, production-ready container. Using Cog, you can deploy your packaged model to Replicate, or your own infrastructure.
đ If you just want to run an existing public model with customized hardware and scaling settings, you may not even need a custom model. Check out deployments.
Step 1: Create a model
Click âCreate modelâ in the account menu or go to replicate.com/create to create your new model.
Choose a name
Pick a short and memorable name, like hotdog-detector
. You can use lower case characters and dashes.
Choose an owner
If youâre working with a team, you should create your model under an organization so you and your team can share access and billing. To create an organization, click âJoin or create organizationâ from the account menu or go to replicate.com/organizations/create.
If youâre creating a model for your own individual use, you donât need an organization. Create it under your user account.
Choose model visibility
Next, choose whether to make your model public or private. There are two important factors to consider here:
- Visibility: Public models can be discovered and used by anyone. Private models can only be seen by the user or organization that owns them.
- Cost: When running public models, you only pay for the time it takes to process your request. When running private models, you also pay for setup and idle time. Take a look at how billing works on Replicate for a full explanation.
Choose hardware
Choose the type of hardware you want your model to run on. This will affect how the model performs and how much it costs to run. The billing docs show the specifications of the different hardware available and how much each costs.
If your model requires a GPU to run, choose a lower-price GPU model to start, like the Nvidia T4 GPU. Later in this guide, youâll learn how to use deployments so you can customize the hardware on the fly.
Once youâve created your new model, you should see a page that looks something like this:
đ„· If you prefer to work from the command line, you can use the Replicate CLI to create models, or create models programmatically using the API.
Step 2: Build your model
Now that youâve created your model on Replicate, itâs time to actually write the code for it, build it, and push it to Replicate.
Youâll use Cog to build and push your model. Cog is an open-source tool that makes it easy to put a machine learning model in a Docker container.
Follow this guide to learn how to install Cog, write the code for your model, and push it to Replicate:
âïž Guide: Push your own model with Cog
Once youâve pushed your custom model, return to this guide to learn how to run it, deploy it, and scale it.
Step 3: Run the model
When you push a model to Replicate, we automatically generate an API server for it and deploy it on a big cluster of GPUs. We also generate a web form that you can use to run the model right from your browser.
Click the âRunâ tab, fill out the inputs form, and hit âRunâ:
Once it finishes, youâll see the outputs on the page. Youâll also see tabs that show code snippets for running the model with those same inputs using different programming languages and tools like Node.js, Python, cURL, etc:
Step 4: Deploy and scale
Your newly published model is now up and running in the cloud. You can run it as-is using the web form and the API as described in the previous step, but if youâre planning to use it in production for Something Realâą, you should set up a deployment for it.
Deployments let you to control the configuration of a model and provide a private, fixed API endpoint.
With deployments you can:
- Roll out new versions of your model without having to edit your code.
- Auto-scale your models to handle extra load and scale to zero when theyâre not being used.
- Keep instances always on to avoid cold boots.
- Customize what hardware your models run on.
- Monitor whether instances are setting up, idle, or processing predictions.
- Monitor the predictions that are flowing through your model.
To create a deployment, go to your model page and click the âDeployâ button.
Youâll see a form that lets you choose a name for your deployment, as well as the hardware it runs on and the minimum and maximum number of instances to run. You can change the hardware type and number of instances to see a live-updating estimate of the cost on the right-hand side of the page.
Once youâre sasified with your choices, click âCreate a deploymentâ.
đ„ Keep your model warm. If youâre giving a demo or putting your model in the hands of users, youâll want it to respond quickly, without a cold boot. Set the minimum number of instances to 1 to make sure that at least one instance is always running. You can reduce this to 0 later if you donât need the model to be instantaneously responsive to new requests.
After creating the deployment, youâll see new example code for how to run your model using your deployment. Note that this client library code is slightly different from the API call you made earlier. Itâs a different method and references the deployment (you/your-deployment
) rather than the model itself (you/your-model
):
Once your deployment starts receiving traffic, you can view its recent activity and performance metrics:
Step 5: Iterate on your model
At this point youâve created a working model with a single version. Maybe itâs already perfect at this point, but in all likelihood youâll want to make some improvements to it.
Just like normal software, machine learning models change and improve over time, and those changes are released as new versions. Whenever you retrain a model with new data, fix a bug in the source code, or update a dependency, those changes can influence the behavior of the model. As you make these changes, youâll publish them as new versions, so you can use those improvements without disrupting the experience for existing uses of the model. Versioning is essential to making machine learning reproducible; it helps guarantee that a model will behave consistently regardless of when or where itâs being run.
If you built your model using Cog, you can release new versions of your model by running cog push
. You can integrate this into your existing software development release process on GitHub using a GitHub Actions workflow.
If you trained an existing model on your data using Replicateâs training API, you can release new versions by running the training API again with new training data, or against a newer version of the base model.
Once youâve updated your model and confirmed it behaves how you expected, donât forget to update your deployment to use the new version youâve just published.
Next steps
Now that youâve built and deployed your own custom model, itâs time to start using it in your app or product.
- Learn how to continuously deploy your model using GitHub Actions.
- Check out the client libraries you can use to run your model.
- Check out the deployments guide to learn more about model performance and scaling.
đ