AWS Outage, What Now?

Penned by John Minnihan (LinkedIn, Twitter)

John is a technologist with a background that includes repairing some of the earliest PCs at the component level and inventing hosted source control. He’s done infrastructure projects at Amazon, Wily, Walmart.com, Oracle and many others. Prior to his work in tech, John wanted to be a stuntman and once jumped a motorcycle 70 feet.

I’m sitting here wondering whether to continue using AWS for my stuff.

I’ve used AWS services since ’06. I know it pretty well. I can model virtually anything from idea to functional prototype in a matter of hours. I’m extremely confident in my skills because I’ve been doing this for a long time.

It’s great being able to say ‘I need [foo]’ and then go integrate ‘AWS Foo’ into whatever I’m builidng without writing [foo] myself. ‘AWS Managed Foo’ is a big advantage in speed-to-market and ongoing maintenance (patches, etc). It’s smart to do this because you’re leveraging use of a proven, packaged outcome without the need to build it. That’s powerful.

At some point though, reintegrating all these discreet managed services into a giant architecture of reconnected services is really just reimplementing old-school web, application + database servers plus some commoditized autoscaling and load balancing that AWS has broken apart into silo’ed services.

“There are only two ways to make money in business: one is to bundle; the other is unbundle.” Jim Barksdale, former CEO and President of Netscape.

As your application’s architecture grows and use of these managed services increases, two architectural dimensions are growing at accelerating rates in opposite directions:

Perceived Complexity and Cost vs Risk

Let’s examine Perceived Complexity and Cost first: ‘I don’t have to design and implement [foo] (or fooX or fooY or…). I don’t need to care how it works and I’ll only pay for the time or units I actually use. This is great!’. Your perception is that you’re reducing both complexity and cost, so that plot line is going down over time.

This is the siren song of pay-as-you-go models that employ managed black box services.

Now let’s take a look at Risk as use of all those managed services increases. You reach a point where virtually all (see what I did there) of the infrastructure your application utilizes is managed services. Each one a discreet function that’s been separated out from a web or application server long ago, all providing black-boxed service abstractions connected using some sort of glue or event triggers.

All of that increases complexity and masks underlying risk that exists in the black boxes. Any one of these services can fail and take down your entire application all while not being able to see into that black box to undersand what’s happening. This is where the risk plot line starts to go vertical. You’ve rebundled a bunch of previously bundled services using proprietary glue you can’t control, inside a giant data center that’s heavily used & targeted by bad actors globally. I’ll explore the bad actor threat in a moment.

That heavy use means that systems designed to have a resiliency of [x] might hit a [y] or [y(2)] or whatever metric is used and simply fall over. If this sounds like buffer oveflow or packet loss or NIC saturation… you’re right. And if a third party dependency fails due to an otherwise isolated service failure in a single region (the happiest path), your stuff is still going to fail even if you aren’t deployed in that region. Because they are.

Unbundling has secondary and tertiary impact that isn’t well understood until it slaps you in the face like a Monty Python seabass on the pier.

Here’s the quote that you should share with your C-levels:

Abstraction doesn’t remove complexity. You just get lulled into a false sense of security and ignore it until a single incident causes a cascading systemic failure.

I have a lot of AWS friends and tried to keep my commentary impersonal. I saw reports stating that ‘…an unusual amount of incoming traffic overwhelmed some networking devices [..] causing packet loss’ which led to the outage. This suggests an incoming event that sure sounds like a DDoS attack.

The fact that the outage lasted an entire business day – at least 12 hours by my count – strongly suggests that it wasn’t just some bad device that went sideways. When that happens, you can simply remove the device and replace it. You’ve had an outage, sure… but one that lasted 15 minutes or less. I half-joked on twitter that when that happens, you use a hammer (to smash the bad gear) but if it’s a DDoS, you play whack-a-mole for several hours trying to stop the incoming flood of traffic.

It’s not difficult to see which of these patterns most closely resembles what we saw happen.

So what now?

That’s a tertiary risk that you take on when you use AWS. It’s a global brand that powers a huge amount of the economy. Even Amazon’s own package delivery drivers were impacted by the outage – credible news reports stated that drivers could not get routes or packages sorted for their trucks.

This outage has cost at least a billion dollars in economic loss to companies that depend on AWS. Maybe more… ten billion dollars? I don’t have enough data to assert that, but the impact of this outage two and a half weeks before Christmas – when hundreds of thousands of people are trying to order gifts online and the transactions fail – can’t be overstated. Some businesses won’t survive this outage.

After the second major AWS outage in six months (Cloudfront and Certificate Manager in June), I’m done with whistling past the graveyard as a risk management approach. I’m looking at alternatives.

AWS Outage, What Now?

Red Hat Brings Ansible Automation Platform To Microsoft’s Azure Cloud Platform

Iterative’s DVC Adds Experiment Versioning

Red Hat Brings Ansible Automation Platform To Microsoft’s Azure Cloud Platform

Iterative’s DVC Adds Experiment Versioning

You may also like

How to Run Enterprise IT on a Ship With No Shore Support | Diogo Almeida, AIDA Cruises | TFiR

How to Shift Complexity From Developers to a Platform Team | Corey McGalliard, Akamai Cloud | TFiR

Why HA Health Check Findings Must Be Fixed Immediately | Trey Isaac, SIOS Technology | TFiR

Why Cloud Waste Is a Hidden Tax and How CFOs Can Fix It | Peter Maloney, Azul | TFiR

Why OpenTelemetry Is Now the Observability Standard for Cloud Native and AI Workloads | Chris Aniszczyk, CNCF | TFiR

How to Test Multi-Cloud and Sovereign Cloud Workloads Locally | Waldemar Hummer, LocalStack | TFiR