Some thoughts on the recent run of internet-breaking cloud provider outages. Are they symptoms of a change in the industry at large, or just a run of bad luck?

In less than a month, we’ve seen big outages at three major cloud providers: AWS, Azure and most recently Cloudflare. I can’t remember a time when we’ve seen a cluster of internet-breaking incidents like this. So what gives? Does this signal more bad times to come?

Whenever a celebrity died, my grandmother would always say “it comes in threes”. So for the superstitious among us, there may be a sense of comfort that we’ve had our three big outages for now. Of course, if you’re superstitious *and* cynical, you might be waiting for the next trio. For the more data-driven among us, let’s look at the reported root causes.

The prize for postmortem turnaround goes to Cloudflare, whose CEO put out a blog post within 24 hours of the incident. This post described the root cause as a change to permissions for a database query, indirectly triggering a code path that causes a panic. A code change, but one that only caused a failure when combined with a few other aspects of a complex system.

The Azure outage was reported as being caused by a configuration change to Azure Front Door, a change that bypassed regular checks due to a flaw in the deployment system.

I wrote about the AWS outage in a previous post. Condensing the report even further, this incident was caused by a race condition in a DynamoDB subsystem that was likely only hit in us-east1 because of how heavily that region is used. It is unclear how long the problematic code was in place.

So out of three outages, two were caused by an explicit change. Out of those two, one was a permissions change rather than a code or “config” change. What they do have in common, though, is that they were all a confluence of multiple issues with very complex systems.

Wild speculation time!

There are plenty of pet theories out there about what this all means. Two common scapegoats are layoffs and AI. In both cases, teams are being expected to produce more output with fewer people. There could be a grain of truth in either of these, but there are also counterarguments.

Both AWS and Microsoft have reported large layoffs in recent times, but Cloudflare’s have been more modest.

As for AI, it’s hard to say how much code is really AI-generated in a closed source project (and it’s not trivial for open source). Based on general chatter in tech circles, more and more code is being generated by AI, and reviewing it effectively is a challenge for sure. But in the case of the AWS outage, it’s unclear how long the offending code had been in place.

There’s a really unsatisfying answer here, that this is a coincidence. From where I’m sitting there’s no clear smoking gun implicating a specific shift in the industry, even if it feels like it.

Outages are pretty common. GitHub typically reports 15+ incidents in a given month on their status page. Not wanting to single them out, but their status page is one of the more readable ones to provide as an example. Most outages have pretty limited scope, impacting a small subset of users, a very specific workflow or a single feature.

Takeaway

I see the biggest lesson here is to avoid complacency. Mocking someone who experienced an outage is just asking fate to strike down your critical infrastructure the next month. Similarly, you could assume that it’s ok to be impacted by these outages because everyone else was, but that would be leaving some potential customer satisfaction on the table.

Hyperscale cloud providers have incredibly complicated platforms, and the series of events in each of these outages just go to show how hard it is to predict how a small issue can be amplified by something seemingly unrelated.

After the AWS outage, I said that I didn’t think this would leave to an exodus from hyperscalers to simpler hosting providers. Even after two more outages, I don’t think that is likely to happen. But this may provide a boost to disaster recovery and multi-cloud plans.

What's with all these cloud outages?

Are the causes related?

Wild speculation time!

Takeaway