52 architecturalist papers: the different flavors of availability

In a recent sequence of discussions on availability at work, I realized that there are different flavors.

But first, a history lesson.

Before virtualization became a thing, the physical data center physically existed. And its destruction, although possible, was extremely rare.

The destruction of an entire physical data center was extremely rare, and we referred to it as a “meteor strike or nuclear war.”

For practical purposes, availability was about dealing with physical component failures. For example, if a server had a disk, and the disk died, could the application on the server keep running?

As hardware components became increasingly resilient and the cost of hardware plummeted, the weak link became the software.

Because rearchitecting software to become intrinsically highly available was impossible, new software frameworks came into existence.

One of the most impressive was vSphere-HA because it said the following

If you have highly available storage and enterprise storage was highly available
If you have shared storage, and enterprise storage was shared
And you capacity connected to that storage, all of your applications will restart after 5 minutes.

vSphere HA and its more boutique solutions more or less solved the availability problem.

And the power of vSphere-HA to systematically solve the HA problem across an entire application fleet greatly expanded the value of VMware’s technology value proposition.

Except, a new availability problem emerged.

In the past, a single software could not destroy a single data center or all data centers. The scope of the blast radius of an operator error was quite limited.

But with virtualized infrastructures and cloud infrastructure, a single button click can destroy an environment in minutes.

For example, at Zynga, an operator error caused by a lousy user-experience design deleted the production Cityville deployment. Cityville had over 10 million Daily Active Users (DAU) and 30 million Monthly Active Users (MAU).

Why was this possible?

Because in the past, you couldn’t delete a server, the applications in the server, the network connections associated with that server, and the data contained in those servers with a single button click in minutes because it would require logging into hundreds of systems. Why? Because the nature of the systems relied on multiple administrators of infrastructure coordinating to either provision or delete infrastructure.

Let’s use an analogy from social media. Before Facebook, I couldn’t share the news with millions of people using a few button clicks to use an analogy. I had to talk to a lot of people. And so, the blast radius of my news was quite limited. Facebook made it easier to share good information like I have a new child very quickly, and sad news like a friend died very quickly. It also made it possible to share fake news and lies very quickly as well. The same technology could be used to do both. And the societal damage of that rapid dissemination of phony information is confirmed.

Virtualization and later cloud moved the entire data center that the application depended on into a single database. The single database allowed agility to grow, shrink, and adjust infrastructure demand to meet the changing application demand. It also made this database the single most significant vulnerability to your environment.

And so software engineers rightfully focused on how to make that database available. But the problem is that no amount of software available can solve operator error, and worse, mistakes that destroy the database in its entirety.

As more and more effort was put into making the central system more robust and available, the overall design became more and more fragile.

But what kinds of errors? Remember, the physical data center rarely goes away in its entirety, but several servers? Those fail all of the time. Worse, as the servers become less available to become cheaper and more disposable because technologies like vSphere HA or cloud-native 12-factor application styles expand, their failure rate increases.

And this cheaper hardware that is more disposable is possible because of the very same centralized databases that when they fail destroy everything. Antifragility makes it clear that the more and more you make everything depend on a single system, the more and more the entire system becomes fragile.

So we have this peculiar phenomenon that the entire virtual infrastructure of the data center resides in a single piece of software that if it fails, everything is gone. And the thing that can cause it to fail is not under the control of software: things like Human error, or data corruptions in parts of the stack that are unknowable, or ransomware, etc.

Without any personal internal knowledge, the Facebook DNS debacle is a great example. The software allowed the network engineering team to make widespread infrastructure changes across Facebook in minutes. And human error combined with technology errors (just another word for human errors) resulted in the entire Facebook infrastructure is unavailable.

IT operators understand this point, as did DevOps teams and the SRE teams. And this is why they refrain from having a single system control their entire infrastructure. The scope of a single error is why those groups like to have more k8s clusters instead of one large one. Or why backup admins have backup software from a different vendor from their primary storage.

Centralizing things allows for unprecedented agility and failure.

What to do?

My take is threefold.

The first is that the era of centralized databases that control the entire infrastructure is closing. Instead, we will have many databases. In the k8s space, the idea of a few large k8s clusters is an anti-pattern. And this is why I am such a huge fan of products like Tanzu Mission Control. They line up with the future.

The second is that those centralized databases will have to be built leveraging blockchain systems that protect against byzantine errors and, in particular, human errors.

The third is that the ability to recover from those database errors in a timely fashion will become a critical differentiator. A once-in-a-decade event is terrible if it happens on a Friday and a company ending event on Black Friday. Software stacks that make that recovery quick and easy will win.

Share this:

Like this:

Leave a ReplyCancel reply