Home / Topics

Deployments

Use deployments for more control over how your models run.


Replicate makes it easy to run machine learning models. You can run the best open-source models with just one line of code, or deploy your own custom models. But sometimes you need more control. That’s where deployments come in.

What are deployments?

Deployments give you production-grade control over your model’s infrastructure and provide private, dedicated API endpoints.

Hardware flexibility

Choose from multiple GPU architectures including NVIDIA A100s, H100s, T4s, and more. Switch hardware types without changing your code to optimize for performance or cost.

Intelligent scaling

  • Auto-scaling: Scale from zero to hundreds of instances based on traffic
  • Always-on instances: Keep models warm to eliminate cold start delays
  • Traffic-based scaling: Automatically add capacity during peak usage
  • Scale-to-zero: Reduce costs by shutting down unused instances

Zero-downtime deployments

  • Rolling updates: Deploy new model versions without interrupting service
  • Canary deployments: Test new versions on a subset of traffic
  • Instant rollbacks: Revert to previous versions if issues arise

Production monitoring

Track your deployment’s performance and health with:

  • Real-time metrics: Track latency, throughput, and error rates
  • Instance health: Monitor whether instances are starting, idle, or processing
  • GPU memory usage: Track resource utilization across all instances
  • Cost tracking: View detailed usage and spending analytics
  • Request logs: Analyze predictions flowing through your model

For detailed information see Monitor a deployment.

Enterprise security

  • Private endpoints: Dedicated URLs that only you can access
  • Audit logging: Track all model access and configuration changes

Deployments work with both open-source models and your own custom models.

Autoscaling

Deployments auto-scale according to demand. If you send a lot of traffic, they scale up to handle it, and when things are quiet they scale back down, so you only pay for what you need. You can also limit the maximum number of instances the deployment can use to limit your maximum spend, or set a minimum to keep some instances warm and ready for predictions.