Deployments
Use deployments for more control over how your models run.
Table of contents
Replicate makes it easy to run machine learning models. You can run the best open-source models with just one line of code, or deploy your own custom models. But sometimes you need more control. That’s where deployments come in.
What are deployments?
Deployments give you production-grade control over your model’s infrastructure and provide private, dedicated API endpoints.
Hardware flexibility
Choose from multiple GPU architectures including NVIDIA A100s, H100s, T4s, and more. Switch hardware types without changing your code to optimize for performance or cost.
Intelligent scaling
- Auto-scaling: Scale from zero to hundreds of instances based on traffic
- Always-on instances: Keep models warm to eliminate cold start delays
- Traffic-based scaling: Automatically add capacity during peak usage
- Scale-to-zero: Reduce costs by shutting down unused instances
Zero-downtime deployments
- Rolling updates: Deploy new model versions without interrupting service
- Canary deployments: Test new versions on a subset of traffic
- Instant rollbacks: Revert to previous versions if issues arise
Production monitoring
Track your deployment’s performance and health with:
- Real-time metrics: Track latency, throughput, and error rates
- Instance health: Monitor whether instances are starting, idle, or processing
- GPU memory usage: Track resource utilization across all instances
- Cost tracking: View detailed usage and spending analytics
- Request logs: Analyze predictions flowing through your model
For detailed information see Monitor a deployment.
Enterprise security
- Private endpoints: Dedicated URLs that only you can access
- Audit logging: Track all model access and configuration changes
Deployments work with both open-source models and your own custom models.
Autoscaling
Deployments auto-scale according to demand. If you send a lot of traffic, they scale up to handle it, and when things are quiet they scale back down, so you only pay for what you need. You can also limit the maximum number of instances the deployment can use to limit your maximum spend, or set a minimum to keep some instances warm and ready for predictions.