I’ve been pondering the problem of making infrastructure transactional for 20 years.
The one paper I wrote – https://www.usenix.org/legacy/event/lisa07/tech/full_papers/holl/holl.pdf is an early attempt at trying to get the desired state systems to work.
You can read the paper, but the critical idea (and it’s an ancient one) was that you take all of the control plane code and put it in the central system.
The problem with that approach (and why the product failed) is due to availability.
The thing we built had the nice property of simplicity of management. It had the unfortunate property of being less available than what it tried to replace. What do I mean? Our solution required a single centralized control plane. If that control plane failed, then snapshots, mirrors, and backups failed. Without our control plane, each NetApp Filer managed its own schedule and failed independently.
Storage administrators barfed all over it. They rejected the product and the architecture.
Then I went to Zynga. And there I took another stab at the problem of managing systems at scale. And there we built some pretty slick management software that allowed Zynga to scale to 100 million MAU for Cityville, on what was basically the flakiest infrastructure I have ever used. The critical insight I had at Zynga was that since transactional systems at scale didn’t work with a centralized database, you needed to build something that relied on eventual consistency.
Then I came to VMware and decided to tackle the problem of deterministic infrastructure at scale again. That’s when I realized there wasn’t really a solution to my problem.
What was my problem:
I had several hundred distributed databases (one per cluster), and I wanted to manage particular semantics that didn’t quite fit into a cluster’s semantics. For example, networking spans clusters.
And I failed to come up with an answer.
What do I mean? The current system requires manual intervention to keep running. The new eventually consistent system also required manual intervention to keep running because it wasn’t deterministic.
So what was the win? Unclear. But there was a win around per-cluster state, and so we decided to solve that. Working with Brian Oki, who did most of the heavy lifting, we devised a plan to make forward progress. We decided to push the cluster state into the cluster.
We began working on an internal project called Bauhaus, despite not having a definitive answer on how to approach networking. Bauhaus was about moving some of the cluster state into the cluster using a distributed KV store to simplify recovery and improve resiliency.
The critical insight I didn’t have was “AZ”
An AZ is one of those concepts that practitioners of distributed systems have spectacularly failed to define, and it is the most fluid of all.
Ask 50 practitioners and you get 50 answers.
And because of that, it’s too amorphous to build systems with.
But there is a crucial insight about an AZ:
An AZ is a control plane that, when it fails, the hardware it manages becomes unusable, even if the hardware is powered on.
An AZ from the outside observer’s perspective is one thing.
But the critical activity in cloud engineering is “how do I build an AZ so it appears to be one thing, but is actually built from many things.”
The thing that’s not obvious to folks who don’t spend too much time puzzling this problem is how the network is built in the cloud.
If you examine the cloud, the critical aspect of their systems is a highly redundant and substantial bandwidth inter- and intra-data center network.
Every cloud has its own proprietary networking stack, which, when you interact with it (from the underlying, not the overlay), requires a significant amount of bridging magic. Those underlay networks do not have all of the semantics or properties of traditional IP networks.
It’s the existence of those networks that allows for the cloud to provide a transactional system behavior.
So let me be precise:
In the cloud, I can assert that if I can’t reach a node, the node is down.
If I can’t reach the AZ, it’s down.
And if a VM was created in AZ 1, it’s either running in AZ 1 or not running in AZ 1. It can not exist outside of AZ 1.
Without the cloud networks and the fact that every part of the system was engineered around this principle, building an AZ-like construct on premises was very difficult without extensive investment in network and hardware design.
What these Nutanix guys did is figure out how to work around this using a custom data path and soft transactions.
Rather than relying on the network connectivity to determine if a VM is running or not, they used the IO data path and a stateless OS.
The IO data path guarantees that any hypervisor that boots cannot access any state that the clustered control plane doesn’t want it to access.
The stateless OS allows the cluster control plane to program the OS to its new state trivially.
The existence of a clustered IO path and a stateless hypervisor allows the cluster to control what state is being modified and which workloads are running. In effect, the clustered I/O path and stateless hypervisor enable the cluster as a whole to operate as a single entity.
As I mentioned earlier, soft transactions and a distributed database are what enable this scalability.
In this incredibly long and complex journey, I was fortunate to work with some brilliant people, but a critical person was Dahlia Malkhi, who, when I hit a brick wall, made it possible for me to see the path around it. I call her out because she was a researcher, and we may have interacted on a technical topic 2 or 3 times, and each time was seminal.

Leave a Reply