Posted by @nickstenning and @zeke
Our job is to run infrastructure you can rely on. We want you to be able to get on with building your business, launching your project, or just making cool things with machine learning models.
Unfortunately, we're not perfect (yet!) and sometimes things go wrong. When they do, you deserve to be kept in the loop about what's happening, and so we've shipped a status page at replicatestatus.com for real-time updates on the health of Replicate.
We're also going to be publishing detailed reports when we have major incidents so you can understand what happened during and after the outage, and can see what we're doing to improve our systems for the future. We had a significant outage on 11 May and you can find our report about it below.
Incident report: 11 May outage
On 11 May, Replicate experienced a significant outage affecting both the replicate.com website and our API. For about two hours from 05:45 UTC, many customers received slow responses or HTTP 500 errors from our API or website, and had trouble running predictions on Replicate's platform as a result.
Since the outage we’ve been working to understand what happened in detail so that we can improve our systems. We are sharing some of what we’ve learned in this report.
Our goal is to provide disruption-free service to our customers, but sometimes things don’t go according to plan. When they don’t, we work to ensure that the experience is not wasted and that we maximise our return on the "unplanned investment" of the outage.
At the center of Replicate's platform lives a PostgreSQL database which serves as our primary datastore. PostgreSQL is a reliable and capable technology and we expect it will continue to grow with us for a long time. That said, this database is critical infrastructure for Replicate, and it is currently coupled to core scenarios such as “run a prediction” more tightly than we would like. If the database is unavailable for five minutes, then we are unable to start predictions for five minutes.
Over recent weeks we have been working to change this. Specifically, we’ve been adding caches and queues to our architecture to ensure that running predictions will still work even when the database is temporarily unavailable due to necessary maintenance or short outages. As it turns out, some of this work was responsible for triggering this outage, but also helped insulate customers from some of the effects of the outage — as we originally intended.
In the lead-up to the incident, we shipped changes to our API service that allowed us to flush updates to prediction state to the database asynchronously, outside of the request handlers for the API service. We enabled these features broadly (using feature flags to control the rollout) over the course of 9 and 10 May in preparation for planned database maintenance on 10 May.
A little after 14:00 UTC on 10 May, we resized our database primary successfully in just over a minute. The features we had rolled out to allow asynchronous prediction updates seemed to work largely as expected, and most of our users saw no disruption during database maintenance.
What we didn’t know until the outage was that our asynchronous update features had created a dangerous query pattern: simultaneous INSERT queries for the same prediction ID. When multiple parallel queries attempt to insert a row with the same value for a field that is supposed to be unique (such as prediction ID) only one can succeed. The parallel queries contend over the lock for the unique index, and this can result in all of these queries taking longer to process while the database makes a decision about who will win.
This didn’t immediately cause user-facing problems. Our code is designed to first attempt an INSERT and, if the INSERT fails, fall back to updating the already-existing row instead.
But INSERT queries taking longer meant that more queries were running in parallel than was normal, which could result in even more contention for the unique index lock. We had created an inherently unstable situation which could tip over at any moment.
Connection pool exhaustion
At a little after 05:00 UTC on 11 May, normal variations in traffic pushed us into this tipping point. Within a few minutes, stacking queries caused INSERT latencies to jump from about 1 millisecond to tens of seconds. This in turn meant that even more queries were running in parallel. Each of these parallel queries needs its own database connection, and we rapidly exhausted the maximum permitted number of connections to the database.
With no more connections available, new queries were not even able to run. This immediately resulted in the user-visible impact that we mentioned in the opening paragraphs of this report. The replicate.com website was down as a result (strictly the failures were intermittent, but it was definitely more down than up during this period).
Ironically, although the simultaneous INSERT queries were the result of the asynchronous prediction update code we had shipped in the days leading up to the incident, that same code meant that for at least some users things kept working. Customers who rely on webhooks to receive updates on their predictions rather than polling the API saw relatively little disruption during the outage.
Our database connection pool was fully saturated starting at about 05:45 UTC. Our monitoring systems detected the issue almost immediately and at 05:46 UTC paged an engineer.
While we now understand in detail the events leading up to the outage, at the time it was far from clear. It initially seemed more likely that something had been missed when we resized the database the day before. Perhaps we had accidentally provisioned the new database primary with slower disks? Perhaps there was some auxiliary job running on the new database server which was consuming disk I/O? Conversely, it seemed unlikely that the asynchronous update code was to blame, because it had been enabled — seemingly without causing any problems — for more than 18 hours at the time of the outage.
We involved our database provider, Crunchy Data, who responded within minutes and helped us rule out possible problems with the database.
Running out of options, we decided to disable the the asynchronous prediction update features, and at 07:31 UTC we switched them all off using a feature flag. Within a few minutes, the database recovered and normal operations resumed.
What we’ve learned
This outage has many of the classic hallmarks of a complex systems failure:
- A system runs “successfully” in a degraded state until that degradation tips over into a catastrophic failure
- A defence mechanism is implicated not only in triggering the failure, but also (partly successfully) in defending against that same failure
- Change creates new hazards, but change is also how hazards are made safe
One other interesting irony of this incident is that the critical trigger — the stacked INSERTs — are a direct result of a performance improvement delivered by the asynchronous prediction updates. We removed about 100-150ms of latency from the process of creating a prediction, which caused two prediction updates which were previously separated by that duration to be processed at almost the same time. As is so often the case in distributed systems: we made one thing better, unintentionally making another thing a whole lot worse.
This was a painful experience for us — and we understand that it was a painful experience for many of you reading this, too — but we are grateful that we had this opportunity to understand our system better than we did before.
We have a clearer idea of the constraints of our database, and are perhaps even more determined to get the asynchronous prediction updates working so that database disruptions don’t affect the running of predictions. For the moment, the feature remains disabled while we redesign it to avoid the stacked INSERTs that caused this outage. In addition, we’re conducting a review of lock contention and other database hotspots that might cause similar problems if not addressed.
Thank you for bearing with us and, if you got this far, thank you for taking the time to read this report!