the architecturalist 63: nutanix was the correct answer

In 2012, while at Zynga, I had a moment of clarity that the way we had thought about infrastructure up to that point was wrong. That our focus on making a single node more and more available was a dead end.

I wrote about this on Quora, and it was picked up by Forbes, which gave me 1 minute of fame.

And I wrote this:

person using magnifying glass enlarging the appearance of his nose and sunglasses — Photo by Marten Newhall on Unsplash

NetApp’s engineering spent a lot of time worrying about hardware availability and making hardware appear to be much more resilient than it actually was.

And yet, these guys like Facebook, Twitter, and Google didn’t think that was important.

Which was mind-boggling. How else can you write software if the infrastructure isn’t perfect? “What were you people doing?” I thought.

So what drove me to find another job was that somehow, people were building meaningful applications that didn’t need component level availability. Something was changing…

Which brings me to what was changing.

What was changing, and this only became obvious after I joined Zynga, was that the old model was dead.

In a world where you have thousands of servers, depend on services that change all of the time, the notion that the application can be provided the illusion of perfect availability is, well, foolish.

In fact, applications have to be architected to understand failures. Failures are now as important to software as thinking about CPU and Memory and Storage. Your application has to be aware of how things fail and respond to those failures intelligently.

I believe that the next generation of software systems will be built around how do you reason about failure, just like the last was about how do you reason about CPU and Memory and Storage.

For the last 13 years, I have been wondering what the correct answer is. One school of thought believed that the correct answer was to treat everything as a database transaction. What if we made infrastructure transactional?

As a result, numerous attempts were made to develop management applications that updated the model of the world in a database and attempted to force the real world to conform to that model. I even invented one and published a paper that described such a system.

And they kind of worked.

The general idea was that you had an API that updated a database, and then a set of controllers that would go and modify the world to conform to the database. And if they ever detected an inconsistency between the world and the database, they would go and correct the system to conform to the database.

And those systems failed to deliver on transactional infrastructure.

When you invoke an API, the database gets updated, and the world converges, but here’s the rub: the world can diverge. And you wouldn’t know.

Let me provide an example from vCenter, a product with which I am very familiar.

Let me be specific – you tell vCenter to power on a VM. vCenter updates its database, then communicates with ESXi, and the VM is powered on.

But is the VM powered on?

You don’t know, because a user can log into ESXi and power off the VM.

In effect, ESXi has its own database and API. And that API and database can be used to change the state of the system.

To make matters worse, if a network partition occurs, the VM will be powered on, and vCenter cannot determine if the VM is powered on or not.

Therefore, any piece of code written must account for three states: “Yes, No, and I don’t know.”

Now, if it’s only one client calling vCenter and doing one thing at a time, that’s manageable. However, if you are working with workflows that depend on the VM being powered on, for example, powering on the VM, moving it, and so on, then for every step, you must account for the possibilities of ‘yes’, ‘no’, and ‘maybe’. And that handling all the different kinds of ‘maybe’ makes writing the control plane tricky.

And when I was at Zynga, I would like to believe I had identified this problem, but I had no idea how to solve it.

For years, I thought the only path forward was the desired state. In short, you express an intent, and the system converges to that intent. But the problem with that model is that expressing things as a sequence of operations is more convenient than simply describing intent. The problem with intent is that if you need to express two different contingent intents, how do you do that? And yes, you could, but pretty soon, you have one massive intent that describes the entire universe.

And so the approach, although promising, never materialized.

And then I ended up at Nutanix. I have also noted that Nutanix has a distributed database at its core, which is part of the puzzle. However, as I mentioned earlier, it’s only a part.

There were three more.

The second was the ability to have a parent database with multiple child databases, and that the parent database would always receive updates in the correct write order.

The third was soft transactions. This is critical because the system must perform reliably and be able to tolerate failures.

But the piece of the puzzle that eluded me was the need for two magical pieces of technology: the first was AHV, a stateless operating system, and the other was Stargate, a clustered IO path.

What Stargate guarantees is that the cluster knows which disk is being connected to, and it provides a point of control for the disk. It is not possible to change the state without Stargate knowing. And so, for a cluster, Stargate can prevent anyone from accessing disks and assert who is accessing them.

The second is AHV, which, when it reboots, doesn’t remember what it was doing before it rebooted. Therefore, AHV cannot run any workload without the cluster knowing what the workload is.

When you combine all five pieces of technology, you have the answer to the question I posed.

The infrastructure, by design of the datapath and system components, only has two answers to any operation: “Yes, I completed, and No, I didn’t.” And either is definitive. There exists no other possible answer to the question.

Once you have such a system, it becomes possible to implement two services that control the OS and the datapath that can assume the behavior of the infrastructure is binary.

And once you do that, you can build a system of APIs that always return yes or no to any question.

This then allows you to combine APIs into workflows that can be trivially designed. What do I mean?

Suppose I have a workflow that must call 5 APIs. We model this as a single workflow comprising five tasks.

In transactional infrastructure, after each API returns a response, I know what the environment must be. And therefore, if it says “Yes”, I can advance to the next step knowing that it is “Yes.” In other words, if Task 1 is completed successfully, I can easily advance to Task 2.

So let’s consider the alternative. Task 1 is to power on a VM. Task 2 is to attach a network to the VM. If Task 1 declares success, Task 2 might fail because someone behind the scenes shut down the VM. Now, Task 2 must handle an error. But what does this mean for the workflow? Did the workflow fail? Well, it didn’t. What happened was that the environment changed in a way that the workflow was unaware of.

So let’s look at the workflow state –
Task 1 – power on VM – success
Task 2 – Attach Network – Failure because the VM is not powered on.

This is a contradiction. How could Task 1 succeed and Task 2 fail? This is a contradiction because the workflow didn’t account for another system changing the state of the VM behind the scenes. And because the change occurred outside of the system, the program interacting with the APIs cannot determine why it has a contradiction.

To understand what happened, you need to build yet another system that monitors both the workflow and the system that can be changed outside the workflow’s control.

Intent-based systems attempted to work around this by retrying, but, as I mentioned, they had their own issues, the most significant being an infinite retry loop.

Ultimately, the only solution was to make it impossible for the system to be changed that was not under the control of the control plane.

And that’s what the folks at Nutanix did.

Trackbacks

the nutanixist 20: how to build an AZ using soft transactions, a clustered IO path, and a stateless hypervisor without a hyperscalar cloud network says:

September 21, 2025 at 3:31 pm

[…] I/O path and stateless hypervisor enable the cluster as a whole to operate as a single entity. As I mentioned earlier, soft transactions and a distributed database are what enable this […]

Share this:

Like this:

Trackbacks

Leave a ReplyCancel reply